A Rotating Roll-Call-based Adaptive Failure Detection and Recovery Protocol for Smart Home Environments Ya-Wen Jong1, Chun-Feng Liao1, Li-Chen Fu1, and Ching-Yao Wang2 Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan1 Information and Communication Laboratory Industrial Technology Research Institute, HsinChu, Taiwan2 {r96922022, d93922006, lichen}@ntu.edu.tw,
[email protected]
Abstract. Smart homes generally differ from other pervasive environments such as office environments. Homes are lack of system administrators to fix faulty services on the spot. Nevertheless, services in smart homes can be critical especially when they involve health and wellness services, since faulty services can lead to unexpected/undesirable consequences. Therefore, robustness and availability are two fundamental requirements for service management protocols or middleware at homes. In this paper, we propose an efficient and adaptive failure detection and recovery mechanism, namely, the Rotating RollCall-based Protocol (RRCP), for home environments. Failures of software components are detected efficiently with a roll-call-based algorithm in which the roll-caller is elected periodically. Adaptive techniques and reliable UDP are adopted to maintain home network stability. Experimental results show that the proposed protocol is robust even under elevated failure rates. Keywords: failure detection, failure recovery, smart home.
1.
Introduction
We can perceive the advent of the next generation computing environment attributed to rapid emerging of intelligent devices and sensors. We believe that, soon after, these technologies will also be brought to our homes. However, technologies are treated differently at homes. Home is a pleasant place where people stay and relax. Activities in offices differ from home activities. Activities in offices tend to be more formal, task-oriented, and geared to optimize productivity, whereas home activities are more informal and for entertaining. In addition, the scale of infrastructure to support social interaction within a home is often reduced to a small LAN (Local Area Network) [1]. Furthermore, another important home feature is that homes are lack of a professional system administrator that can be there anytime to fix software or hardware failures [2]. This can lead to unexpected/undesirable consequences, especially when it comes to health and wellness services. Suppose a fall detection service for the elderly at home is devised, and the service fails without anybody’s notice. With the failed service, parents wouldn’t know about a falling incident when it occurred, and the time for
emergency assistance can be seriously delayed. Therefore, service availability and robustness are two fundamental requirements for middleware or service management protocols at smart homes. In [3], we have proposed a Pervasive Service Management Protocol (PSMP) devised to fulfill the requirements when intelligent technologies are brought into home. The functionalities of PSMP include autonomous service composition, activation, failure detection, and failure recovery. In addition, we have also designed a Pervasive Node Application Model (PAM) and dealt with the failure detection and recovery of the Service Nodes, but the failure detection and recovery of Kernel Nodes remain unsolved. In this paper, we propose a novel approach, called Rotating Roll-Call-based Protocol (RRCP), to detect failures efficiently in the home environment based on PAM. This approach is inspired by the roll-call procedure in our daily lives. For example, the names of people from a list are called one after one to determine the presence or absence of the listed people. In our protocol, every Kernel Node takes turn to initiate a roll-call, and therefore a failed node can be detected rapidly and be marked as suspected. To avoid making false detections, we consider a semi-consensus policy here, i.e., a node is considered as failed only when more than half of the nodes in the environment suspect it. To achieve better efficiency, we adopt reliable User Datagram Protocol (UDP) as the transporting protocol used. Finally, the time period for calling the roll is adaptive; i.e., it varies depending on context of the network and status of the nodes. This time period is crucial to the failure detection time. The RRCP is designed based on the Message-based Pervasive middleware and the Pervasive Node Application Model (PAM) presented in [3]. The Message-based Pervasive middleware is supported by the message-oriented middleware (MOM), a “software bus” for integrating heterogeneous service components. The logical pathways between service components are called “topics”, which reside in MOM. Service components in MOM exchange messages via several topics. The advantage of the MOM is that the dependencies between nodes are removed: they depend on topics instead of depending on each other. Hence, the failures of nodes in Message-based systems are always isolated. Moreover, when recovering from failure, the order of execution or re-initialization of nodes does not matter. In the next section we will begin by giving an overview of failure detection protocols. In Section 3, we introduce the Pervasive Node Application Model, which is the base of our protocol. The Rotating Roll-Call-based failure detection and recovery protocol is described in Section 4. Section 5 discusses the implementation details and the performance of the RRCP protocol. Finally, conclusions are presented and suggestions are made for further research.
2.
Related Works
Many advanced failure detection techniques in distributed environments have been proposed. Traditionally, a system monitor detects failures by using Periodicbroadcast-based protocols such as Heartbeat or Polling [4]. If the broadcasting period is configured properly, Heartbeat or Polling protocols reduce significantly the failure
detection time. However, these protocols usually consume network bandwidth and cause instability in the environment. Recently, Gossip techniques [5] have also become a prominent method to detect failures in distributed environment. However, Gossip techniques are designed for large-scale systems because it is highly scalable, but unfortunately they also introduce considerable detection latency [5]. The aim of this paper is to devise a failure detection protocol that is favorable for home environments. The faulty services can be detected efficiently using a roll-callbased algorithm and a semi-consensus technique. Adaptive techniques and reliable UDP are adopted to maintain home network stability.
3.
The Pervasive Node Application Model
Before the detailed discussions of proposed mechanisms, we propose a Pervasive Node Application Model for a message-oriented pervasive middleware (see Fig. 1). In this model, we will use the term “Pervasive Node” to refer to an atomic software entity in a message-oriented pervasive middleware.
Fig. 1. The Pervasive Node Application Model.
As illustrated in Fig. 1, there are two subtypes of Pervasive Node: Kernel Node and Service Node. Service Nodes are the basic component of a pervasive service. We can further classify Service Nodes into 3 sub-categories according to their behaviors: Sensor Nodes acquire context information from the environment, such as temperature sensors; Actuator Nodes perform actions, such as appliance controller; and finally, Process Nodes perform logic operations according to the context gathered by Sensor Nodes and instruct Actuator Nodes to execute the corresponding action. Kernel Nodes perform administrative tasks as mentioned above. PSM (Pervasive Service Manager) manages the group of Service Nodes of a certain Pervasive Service, such as an adaptive air conditioning service that turns on the fan when the temperature increases. PSM activates and manages its group members according to a PSDL (Pervasive Service Description Language). The PSDL keeps track of the corresponding Service Nodes needed by a Pervasive Service, including their criteria, the number of instances, restrictions, and other information. PHM (Pervasive Host
Manager) manages the group of Service Nodes and their lifecycle located in the same machine. Each Service Node, after installation in the machine, registers its metadata to the PHM’s PSMR (Pervasive Service Metadata Registry), so that they can be enquired for. Note that the members of PSMs and PHMs can be overlapped. The Service Nodes are monitored by PSMs, where the failure detection and recovery of Service Nodes have been discussed in [3]. In that work, the authors assumed that Kernel Nodes do not fail, which is unrealistic. Hence, in the next section we propose a method for Kernel Nodes to monitor one another and detect failure of Kernel Nodes in an efficient way.
4.
Rotating Roll-Call-based Failure Detection and Recovery
In this section, we describe the Rotating Roll-Call Protocol for smart home environments. Although the protocol is based on the Pervasive Node Application Model discussed in the previous section, it is also suitable for any small-scale distributed environment such as the home network where the failure detection latency is critical. 4.1
Failure Detection and Recovery
In our Pervasive Node Application Model, Kernel Nodes are the Pervasive Nodes that manage the Service Nodes. There are two kinds of Kernel Nodes: PHM and PSM. For simplicity, we will refer “Kernel Nodes” simply as “nodes” in the following sections. Nodes can all be distributed in different machines but two or more nodes can also be deployed in the same machine. All of the nodes together constitute a distributed system environment. Therefore, the RRCP can be applied to any distributed environment. The reference implementation of RRCP is based on UPnP’s [6] SSDP (Simple Service Discovery Protocol), which is widely adopt in the Smart Home Environment However, the core concepts of the RRCP can be easily applied to any other Service Discovery Protocol. We use SSDP to receive and send “presence” announcements. Every node will have a list of nodes present in the environment. Note that we assume that the list maintained in every node is the same; however, the order of the nodes in the list can be different. The problem is that the default latency of SSDP’s periodic announcement is set to 1800 seconds, although this can be set differently, but too much bandwidth will be wasted if this latency is set too short as SSDP sends 3+2d+k messages for each device. For this reason, here we propose an appropriate method to detect failures reliably and efficiently. At the beginning, the nodes perform a Leader Election procedure to decide who would be the first leader. For Leader Election, we adopt the Invitation Algorithm [7]. The key idea of the Invitation Algorithm is that a node wishing to become the leader “invites” other node to join it in forming a group. At the beginning, each node is the leader of its own group which contains itself only. Each leader periodically “invites” other groups to join it. To avoid a livelock situation, a priority mechanism is used so that lower priority nodes send out invitation after a longer period of time. For details of the algorithm, see [7].
After the leader is elected, the checking process begins. The message exchange sequence diagram is shown in Fig. 2 and Fig. 3. The leader sends Check (CHK) messages to every node in its list via UDP, just like calling a roll. A node that receives a CHK message must reply with an acknowledgement (ACK) telling the leader that it is alive indeed, as in Fig. 2. The leader will wait for a time period for the acknowledgement, and, if a node does not reply, the leader will mark that node as suspected.
Fig. 2. Node A is the leader and calls the roll. As node D does not reply, it is marked as suspected. A while later, the leader authority is passed to node B.
Fig. 3. Node B is now the current leader. After calling the roll, it notices that half of the nodes suspect node D, and therefore it announces node D’s failure.
Every node maintains a suspected list of nodes, and this suspected list is passed along with the leader authority. After calling the roll, the leader must check whether more than half of the nodes in the environment suspect the same node. If there is such a node, the leader announces that this node has failed with a FAIL packet, as in Fig 3. On the other hand, if the leader receives an ACK that is in the suspected list, then the waiting time period for acknowledgement is extended. This time period is crucial as it affects the interval for calling the roll and consequently the latency for detecting a failure. If the leader receives an ACK from a node that is in the suspected list may indicate that the network or the nodes are loaded, and as a consequence, the waiting time is extended. The waiting time period is adjusted again after certain times of normal operation. Finally, the current leader passes the leader authority with an LDR packet to a random node along with the list of suspected nodes and the number of nodes that suspect each of those on the list. The node that receives the LDR packet must also reply with a LACK. This is necessary as the leader authority must not be lost. The purpose of passing around the leader authority is for load balancing. In our Pervasive Node Application Model, every Kernel Node has its own responsibility, and therefore they cannot be performing the leader labor constantly. Table 1 presents a summary of the packets used in the RRCP and their corresponding functions. The process described above is repeated until none of the nodes receive the leader authority for a certain period of time. This may imply that the node that received the leader authority has failed, and hence the leader election algorithm is carried out again. The leader that detects the failure of a node is in charge of recovering it. Because Kernel Nodes are irreplaceable, the leader will try to re-execute the software application directly. When a PHM recovers from failure, it must check which are the nodes executing in the host it manages. When a PSM recovers from failure, it must check the nodes which compose that service that it manages.
Table 1. Types of packets used in the RRCP. Packet Function CHK Check ACK LDR
Check Acknowledgement Leader
LACK Leader Acknowledgement FAIL Failure FACK Failure Acknowledgement
4.2
Description Sent by the leader when checking every node’s status Sent to the leader to indicate that the node is operating properly Used to pass the leader authority between the nodes Indicates that the leader authority is received Sent by the leader to announce the failure of a node Indicates that the announcement of failure is received
Notes No retransmission needed. A node is marked as suspected by the leader if no ACK is received. Packet is retransmitted if no LACK is received.
Packet is retransmitted if no FACK is received.
Analysis
We consider a fail-stop model, where nodes stop executing and stop sending messages when they crash. For link abstractions, we assume a Fair-Loss Link. The fair-loss property guarantees that a links does not systematically drop any given message. Therefore, if neither the sender node nor the recipient node crashes, and if a message keeps being retransmitted, the message is eventually delivered [8]. Correctness. A healthy node replies with an ACK when it receives a CHK form the leader. If one of these packets is lost, the next leader will check again, and thus the suspicion is corrected. To avoid false detections, we adopt a semi-consensus technique: the suspected node is considered failed only until more than half of the nodes suspect it. This will avoid every node makes its own decision, and comes to an agreement from the majority of the nodes. Performance. The algorithm proposed here can reduce failure detection time as a faulty node can be detected quickly after half of the nodes suspect it. The suspecting message does not need to gossip among random nodes and it does not need to wait until all of the nodes come to an agreement. The algorithm does not consume too much bandwidth as it uses reliable UDP instead of broadcast mechanisms. This transforms many-to-many communication into one-to-many communication.
5.
Implementation and Evaluation
We have implemented a prototype of our Adaptive Failure Detection and Recovery Protocol. All nodes are randomly distributed over three P4/1GHz mini-PCs in the same LAN with 1G bytes memory. We designed experiments to investigate the availability and efficiency of our protocol. At the beginning of the experiment, an initial configuration is established: all the Pervasive Nodes, including Service Nodes and Kernel Nodes, discover each other and operates properly. During the experiment
duration, a number of Kernel Nodes fail at the same time. The node failure rate varies from 10% to 80% in 10% increments. Our experiment models a prototype that includes 10 Kernel Nodes, composed of 5 PHMs and 5 PSMs. 4.1
Metrics
We investigate the availability and efficiency of our RRCP. According to [9], availability can be defined as below: Availability = MTTF / MTBF = MTTF / (MTTF+MTTR) .
(1)
MTTF is the Mean Time To Failure, MTBF is the Mean Time between Failures, and MTTR is the Mean Time To Repair. Therefore, to achieve high availability, we must minimize the MTTR. The value of Availability depends greatly in MTTF, and the longer we set the experiment duration, the higher will be the availability. Hence, here the value of MTTR is more meaningful. MTTR measures the latency from the failure of the first node until all the nodes are recovered and the initial configuration is restored. We decompose MTTR into failure detection and failure recovery latency. Efficiency is measured as the total number of messages required to detect and recover a failure. 4.2
Experiment Results
We executed 80 repetitions, 10 repetitions for each failure rate. The results of the experiments are shown in Fig. 4 and Fig. 5. We can notice a slight decrease in message counts whereas an increases in the failure rate. This is because with more suspected nodes, fewer nodes reply with the CHK packets and semi-consensus is achieved faster. However, the message counts for failure recovery increase with the failure rate, as nodes exchange SSDP presence announcements during the recovery procedure. Comparing the experiments done in [10] which show the message counts of UPnP SLP and Jini, messages counts are considerably reduced in our protocol. But for availability, it is incomparable as experiments settings are different. Failure Detection
Failure Detection
Failure Recovery
700
12
600 st 500 n u o C 400 e ga 300 ss e 200 M 100
c) 8 e (s cy 6 n e ta 4 L
Failure Recovery
10
2 0
0 10
20
30
40
50
60
70
80
Failure Rate (%)
Fig. 4. Message Counts for Failure Detection and Recovery.
10
20
30
40
50
60
70
80
Failure Rate (%)
Fig. 5. Latency in seconds for Failure Detection and Recovery.
The diagram in Fig. 5 shows that the failure detection latency decreases with failure rate. This is due to the fact that with fewer nodes it is faster to reach an
agreement among half of the nodes. The latency of failure recovery is higher as failure rate increases, this is because more nodes need to be discovered by each other during the recovery procedure and thus longer time is taken.
5.
Conclusion
Availability is one of the most important aspects for a middleware or service management protocol. System availability is usually defined as the portion of system uptime over the sum of uptime and downtime [11]. Therefore, a service management protocol in smart home environment should be designed to decrease or minimize the downtime of a system. In this paper, we propose an adaptive failure detection and recovery protocol that is suitable for any small-scale distributed environment where the failure detection and recovery time are critical. Experimental results show that the proposed RRCP is robust even under elevated failure rate. Currently, we use SSDP for service discovery, but from the experiments we learned that the message counts of SSDP increases considerably with the number of nodes. We think this is due to the multicast mechanism used by SSDP. We are therefore planning to design a more efficient service discovery protocol for our Service Management Protocol for Smart Homes in the coming future.
References 1.
Meyer, S. and A. Rakotonirainy: A survey of research on context-aware homes. In: Proceedings of the Australasian information security workshop conference on ACSW frontiers 2003, vol. 21. Adelaide, Australia, Australian Computer Society, Inc. (2003) 2. Edwards, W. K. and R. E. Grinter: At Home with Ubiquitous Computing: Seven Challenges. In: Proceedings of the 3rd international conference on Ubiquitous Computing. Atlanta, Georgia, USA, Springer-Verlag. (2001) 3. Liao, C.-F., Y.-W. Jong and Fu, L.-C.: PSMP: A Fast Self-Healing and Self-Organizing Pervasive Service Management Protocol for Smart Home Environments. In: Asia-Pacific Conference on Services Computing, IEEE. (2006) 4. Hanmer, R. S. : Patterns for Fault Tolerant Software, John Wiley & Sons Ltd. (2007) 5. Ranganathan, S., A. D. George, et al.: Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters. In: Cluster Computing, vol. 4, issue 3, pp. 197--209. (2001) 6. UPnP Device Architecture 1.0, UPnP Forum. (2003) 7. Garcia-Molina, H.: Elections in a Distributed Computing System. In: Computers, IEEE Transactions on, vol. 31, issue1, pp. 48--59. (1982) 8. Guerraoui, R., Rodrigues, L.: Introduction to Reliable Distributed Programming, SpringerVerlag New York, Inc. (2006) 9. I. Koren and C. M. Krishna: Fault-Tolerant Systems, Morgan-Kaufman, San Francisco, CA. (2007) 10. Dabrowski, C., K. Mills, et al.: Understanding failure response in service discovery systems. In: J. Syst. Softw, vol. 80, issue 6, pp. 896--917. (2007) 11. M. L. Shooman: Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design, Wiley-Interscience. (2001)