An Efficient Autonomous Failure Recovery ... - Semantic Scholar

An Efficient Autonomous Failure Recovery Mechanism for UPnP-based Message-Oriented Pervasive Services Ya-Wen Jong1, Chun-Feng Liao1, and Li-Chen Fu1,2 Department of Computer Science and Information Engineering1 Department of Electrical Engineering2 National Taiwan University Taipei, Taiwan {r96922022, d93922006, lichen}@ntu.edu.tw Abstract—Service Discovery is an interesting challenge in highly dynamic Pervasive Environments such as smart living spaces. In addition to adaptation to varied environments, autonomous failure recovery and activation of services are two fundamental issues of a Service Discovery protocol in order to achieve high service availability. In [1], we have introduced an enhanced version of UPnP’s Simple Service Discovery Protocol (SSDP), which aims to fulfill these two fundamental issues. However, our experiences in the previous work show that the failure recovery performance is not desirable due to the multicast mechanism used by SSDP. In this paper, we propose a supporting data structure, Mapped Eviction SND Tree, composed of a set of specialized mapped-trees to speed up the failure recovery process. The system with this structure is aware of existing service nodes and need not to perform the discovery procedure again when failure is detected. Hence, the time for restoring the failed service is minimized. Experiment results show that the proposed approach helps reducing the failure recovery time up to 90 percent in average. Issues concerning maintenance and collaboration between the sets of internal data structures are also discussed in this paper. Keywords—UPnP, SSDP, service discovery, message-oriented middleware, failure recovery

I.

In the first place, we propose a message-oriented application model [5] for pervasive environments, which supports dynamic composition of services and is capable of isolating the failed components. The message-oriented application model is supported by the Message-Oriented Middleware (MOM) [5]. The MOM creates a “software bus” for integrating heterogeneous applications. The logical pathways between publishers and subscribers are called Ātopics”, which reside in the MOM. Software entities (called “nodes”) in MOM exchange messages via these logical pathways (see Fig. 1). Based on the abovementioned “NodeBus-Node” message-oriented architecture, the system can provide pervasive services by chaining nodes and topics together. For instance, node A, B, C, D, and F in Fig. 1 collectively provide an “adaptive air conditioner” service. In this service, node A and B are software adapters to wireless temperature sensors, node C is the context interpreter which transforms raw sensor data into high-level semantic context data, node D decides the commands to be taken by performing logical reasoning based on the acquired context data, and finally node F is responsible for controlling fans or air-pumps based on the commands issued by D.

INTRODUCTION

Service Discovery plays an essential role in a pervasive environment. In such environments, a user, or his/her computing device, makes use of specialized information appliances in the network neighborhood, wired or wireless. These information appliances have been born to aim at supporting mobility, in essence, and hence cooperation among them, since cooperation is an indispensable feature that complements some missing parts in mobile devices, compared to conventional, fully-powered computing devices [2]. Many design issues of Service Discovery protocols such as service advertisement, service sub-typing, and security have been discussed in the recent years [3] [4]. Nevertheless, few of them address the essential yet challenging issue, the service availability, in a Pervasive environment. High service availability is a fundamental requirement when deploying pervasive services in a smart living space. Apparently, a middleware that supports high system availability is required. This research is supported by the National Science Council of Taiwan, under Grant NSC96-2752-E-002-007-PAE, and by the Excellent Research Projects of National Taiwan University, 95R0062-AE00-05.

.Figure 1. The Message-Oriented Application Model

In order to achieve autonomous service activation and failure recovery, we further define a Pervasive Node object model for message-oriented pervasive system, and then propose a community-based service activation and failure recovery mechanism on top of the object model [1]. Moreover, we have originated a community-based object model. The

1960 c 2008 IEEE 1-4244-2384-2/08/$20.00

fundamental components of this community are the Pervasive Service Manager (PSM) and the Pervasive Host Manager (PHM). In simple terms, the PSM manages the collection of nodes needed to provide a pervasive service to the user and the PHM manages and controls the nodes that co-locates in the same computing device. The PSM keeps track of the aliveness of the nodes needed by its service, so when failure is detected the PSM could reactivate the nodes through the PHM or load other service nodes that are already active in the environment. In order not to reinvent the wheel, we realize the abovementioned functionalities by extending portions of the SSDP (Simple Service Discovery Protocol) protocol stack, which is part of UPnP (Universal Plug and Play) [6], an ISO/IEC home networking protocol standard (ISO/IEC 29341) [7]. The reason for choosing SSDP/UPnP is three fold: (a) Service Discovery in pervasive computing environments requires a decentralized architecture where a node should not depend on some other node(s) to advertise or register services [8]. SSDP is one of the few dynamic service discovery protocols that do not need a dedicated and centralized service registry [9], which is more feasible in a pervasive environment. (b) SSDP/UPnP is platform and language independent, as it is based on SOAP/HTTP protocol. (c) SSDP/UPnP is an ISO standard. The major weakness of SSDP is that the unreliable multicast mechanism makes the performance of failure recovery undesirable. In [1], we have defined two kinds of failure recovery: Hot Failure Recovery and Cold Failure Recovery. The Hot Failure Recovery is recovering a pervasive service using nodes that are already loaded in the environment, while Cold Failure Recovery is performed when no loaded nodes that we need exist in the environment. However, the average Hot Recovery Time takes longer than expected. The objective of this work is to deal with the extensive hot failure recovery time, since service availability is extremely important in a pervasive computing environment. The organization of the rest of this paper is as follow. In section 2, we present the community-based object model and how failure recovery is performed. In Section 3, we describe how to speed up the failure recovery time using the Mapped Eviction SND Tree. Section 4 shows the implementation and evaluation results of this work and the performance compared with our previous work. Section 5 compares the proposed approach to related works. Finally, conclusions are presented and suggestions are made for further research. II.

Before entering into detailed discussions of the proposed mechanisms, in this section we introduce the background of a community-based failure recovery mechanism. A. Pervasive Nodes A Pervasive Node is an atomic software entity in a message-oriented pervasive middleware. There are two subtypes of Pervasive Node: Kernel Node and Service Node (see Fig.2). Kernel Nodes are designed for administrative purposes. For example, the PSM (Pervasive Service Manager) and PHM (Pervasive Host Manager) perform node administration tasks instead of providing services to the user. We will discuss PSM and PHM in more detail in the next subsection. On the other hand, Service Node is the basic component of a pervasive service. We can further classify Service Node into 3 sub-categories according to their behaviors: 1. Sensor Node is used to specify the Service Node that is capable of sending messages. 2. Actuator Node is used to specify the Service Node that is capable of receiving messages. 3. Process Node is used to specify the Service Node that is capable of both sending and receiving messages. B. Pervasive Comunities

Figure 3. The structure of a Pervasive Service Community.

COMMUNITY-BASED FAILURE RECOVERY

Figure 4. The structure of a Pervasive Host Community

Figure. 2: The Pervasive Node object model

There are two types of pervasive communities that co-exist in a message-oriented pervasive system: the “Pervasive Service Community (PSC)” and the “Pervasive Host Community (PHC)”. PSC is a group of Service Nodes that collectively provide a pervasive service to the user. In Fig. 3 we can see the structure of a PSC. Each PSC is composed of one or more Service Nodes. These PSCs are managed by a Pervasive Service Manager (PSM). PSM activates and manages its

2008 IEEE International Conference on Systems, Man and Cybernetics (SMC 2008)

1961

community members according to a specification written in the PSDL (Pervasive Service Description Language). The PSDL keeps track of the corresponding service nodes needed by a pervasive service, including their criteria, the number of instances, restrictions and other information. As sketched in Fig. 4, a PHC contains many Service Nodes located in the same machine in addition to a PHM that manages the lifecycle of these Service Nodes. Each node, after their installation in the machine, registers their metadata to the PSMR, so that they can be enquired for. C. Autonomous Failure Recovery When a service is operating, the PSM is responsible for ensuring all members in its community are in active state. If one of them fails, then the service is “in-operative”. In this case, the PSM will perform a failure recovery operation. Fig. 5 indicates the interactions between PSM, PHM and nodes when performing failure recovery operation. In Fig. 5, the PSM notices that the node A, one of its service community members, failed. The PSM will first try to use the nodes that are already loaded, since loading a service from disk to memory takes longer time than just activating a node that is already in memory. Therefore, the PSM detects that there is another node G that matches its criteria and activates it. Assume that the node G is unstable and also fails after a short period of time and the PSM can not find any loaded nodes; on that occasion the PSM will try to re-perform the service activation procedure (see Fig. 5, step 3-7). If the PSM still can not find the needed nodes, then the service can not be recovered and the PSM will have to report an error.

pervasive computing environment. Therefore, we have been considering a solution to overcome the problem. The formerly procedure for performing failure recovery is as follow: when PSM detects that a node has failed, it will try to find candidate nodes in the environment that are already loaded. The process of discovering loaded nodes involves three steps: 1) PSM sends out a multicast message, 2) all the nodes receive the message and the one that matches the criteria replies to the PSM, 3) the PSM receives the response and adds the node to its table. This round trip takes approximately 5-6 seconds per node, which is an extensive period of time in the networking world and for a pervasive computing environment. Based on the above considerations, we designed a set of data structures, the Mapped Eviction SND Trees, the Service Time Map (STM), and the Expiration Time Map (ETM), to keep track of the advertisements of interested nodes. Employing these structures, the PSM won’t have to follow the copious discovery procedure. Hence, with the relation between these data structures, we can restore the services in a minimal time. There are some issues to be explored when designing these structures, including eviction strategy and state consistency. These issues are discussed in sub-section B. Before that, a brief overview of SSDP service discovery mechanism is provided. A. SSDP Preliminaries SSDP operates using HTTPMU, HTTP Multicast over UDP. Multicast is a method of forwarding IP datagrams to a group of interested receivers via a set of predefined multicast addresses. By default, SSDP uses the address 239.255.255.250:1900. SSDP introduces two concepts related to service identification, the service type and the Unique Service Name (USN). A service type is a URI that identifies the type, or function, of a particular resource. A Unique Service Name is a URI that is used to uniquely identify an instance of a particular service, allowing SSDP clients to differentiate between two services with the same service type [10]. SSDP uses HTTPMU to send messages to every SSDP peer on the network. It extends HTTP with two message types: Notify and M-Search. There are three kinds of SSDP primitive actions:

Figure 5. Interactions between PSM, PHM and Service Nodes when performing failure recovery procedure

III.

ENHANCING AUTONOMOUS FAILURE RECOVERY VIA MAPPED EVICTION SND TREES

The consequent of using SSDP is that it uses a multicast mechanism to discover devices or service nodes. This causes a significant delay when performing failure recovery. Service availability is one of the most important requirements in a

1962

•

ssdp:alive: announces the presence of a UPnP Device by using a HTTP Notify message.

•

ssdp:byebye: announces that a device has left the network by using a HTTP Notify message.

•

ssdp:discover: finds an UPnP Device that meets certain service type specified in the ST (Search Target) header in a M-Search HTTP message. The matching device then replies by sending back a standard HTTP Response message.

It is noteworthy that since all participants of UPnP network communicate via a multicast address, hence no central registry is required. However, it is possible that a device fails without sending a “ssdp:byebye” message. Therefore when issuing a “ssdp:alive” or a response message, the device attaches information of valid time period by using a HTTP “Cache-


Control” header. After this time period, the presence announcement becomes invalid B. Mapped Eviction SND Trees, STM, and ETM Every service node has a corresponding Service Node Descriptor (SND), which contains information about its friendly name, its USN, its service type, and most importantly, its expiration time. These pieces of information are attached within the “ssdp:alive” notification message. To design this kind of cache-like technique, first of all, we need a data structure to store the SND of interested nodes. For these nodes to be quickly found when performing failure recovery, they have to be classified by service type. Therefore, we use a Map with the service type as the key and the value would be the list of SND that corresponds to that service type. We call this data structure the Service Type Map (STM, see Fig. 5). Note that the SND list can be empty if no extra service node of that type exists. For optimizing, we need a tree with the list of SND sorted by its expiration time, so that we always take the SND with the longest expiration time as the candidate node. For the tree to be in a balanced way, we store SND list in a Red-Black Tree, which guarantees searching, inserting and deleting operations in O(log n) time in the worst case. Note that these data structures correspond to the PSM. At initialization, the STM is empty. When a service node notification message (“ssdp:alive”) arrives and the service type matches with the specification in the PSM’s PSDL, and there are already a set of nodes needed to provide the service to the user, the service node is put into the table for backup use.

Figure 5. The Service Type Map (STM).

When a failure is detected, the PSM will first look for a candidate node in the STM. The searching criterion is by service type, and the PSM will load the first node in the list of that service type, which is the one with the longest expiration time. In that way, the service is restored. If the failed node is re-activated afterwards, and the PSM receives its advertisement, the service node will be added to the STM.

message sent by Pervasive Node has its own expiration time. If the Pervasive Node doesn’t send a “ssdp:byebye” before it leaves, we can remove it from the STM by the time of its expire time. To keep track of each service node’s expire time, we have designed an Expiration Time Map (ETM, see Fig.6). This map has the expire time as the key and the value is a list of SNDs that expires at that time. ETM will be checked every second to see wether there are expire time of nodes that are due. If some SNDs expire, these SNDs are taken from the ETM and removed from the STM using its service type and USN, respectively.

Figure 6. The Expiration Time Map (ETM).

2) State Consistency: the state of the SND stored in the tables must be the same as the original service node, this means that the cached information should correspond to the actual information in the service nodes. If the original service node is no longer available, then the one stored locally must be also removed. This means that when a “ssdp:byebye” is received, the corresponding SND must be removed from the maps.To ensure this is done in a minimal time,we can find the service node in STM by its service type and then remove it from the list of SND.We will also have to remove it from ETM, but because the key of the ETM is the expiration time, we would have to search it one by one. Therefore another map, the USN Map, is needed. This map has the USN as the key and the value points to the SND stored in ETM, see Fig 7. In this way, we can quickly find the SND by its USN and remove it from both maps.

There are some issues to be aware of when designing Mapped Eviction SND Trees, STM, and ETM: 1) Eviction Strategy: SND stored in the map requires storage space, since we can not stored these data permanently. What is more, service nodes could have “died” without having time to send a “ssdp:byebye”. Fortunately, every “ssdp:notify”

Figure 7. The ExpirationTime Map and the USN Map.


1963

As a result, through the collaboration between these three maps we can recover the service in a minimal time, maintaining the consistency of the data stored at the same time. To sum up, the enhanced procedure of performing the failure recovery is: •

•

•

•

When receiving a notify message, “ssdp:alive”, of certain service node, PSM will store the SND of this node in the STM if it is an extra interesting node. STM will register to ETM and USN Map with is expiration time and USN respectively. When PSM detects a failure, it will look for the candidate node in STM. It will take the first candidate node of that service type from STM and remove it from ETM and USN Map using USN as the key. When a “ssdp:byebye” is received, PSM will check if the service node that is leaving is in its tables. If it is, it will remove it from the maps using its service type and USN as the key. Additionally, SND are checked every second using the ETM to see whether they have expired or not. IV.

IMPLEMENTATION AND EVALUATION

We implemented the prototype platform mainly based on JDK 6.0. Some nodes are implemented with C# and others in C++. We use ActiveMQ 4.1.1 [11], an open source MOM as the message exchange platform. ActiveMQ uses a crossplatform messaging protocol, and supports several programming languages such as C, C++, C#, and Java. Therefore a PSC can be composed of Service Nodes implemented by means of heterogeneous technologies. This ability is important in developing pervasive applications. For instance, the real-time image-processing components are better implemented with C or C++ while server-side components are usually implemented with Java. The cross-platform interoperability is an inherited nature of the message-oriented applications, which is very hard to achieve in other application model. To support UPnP, we use Intel UPnP SDK [12] for C# and C++ based Pervasive Nodes, and Java-based Pervasive Nodes are developed with Cyberlink UPnP for Java [13]. The middleware and all Pervasive Nodes are distributed on three P4/1GHz mini-PC with 1GB memory. For STM, we use a HashMap data structure with the service type as the key and a TreeSet of SND as the value. TreeSet has the ability of stored objectsordered according to some userdefined ordering. Here, we store SNDs sorted by its expiring time. For ETM, we use a HashMap with the expiration time as the key and an ArrayList of SNDs. ExpirationTime Map is used to remove SNDs which are out of date. USN Map is also a HashMap with the USN as the key, and a SND as the value. Note that each USN corresponds to only one SND. USN Map is used when removing SNDs using its USN when a “ssdp:byebye” is received. We verify the feasibility of the proposed mechanisms by developing four pervasive services (see Table I and Table II). Table I shows all Service Nodes deployed in the experiment environments. These nodes are located in three different hosts

1964

(H1, H2 and H3), and each host has a PHM. Table II lists required service types and criteria of pervasive services. Notice that there are four PSCs and three PHCs in this experiment. Some service types have several node instances. Therefore, the PSM can choose one of them and store the other one locally. For instance, the “adaptive air conditioner” service requires service type S1, P3 and A2. There are two nodes with service type S1 (PL-2303 and Taroko), and hence the PSM can use one of them and store the other one locally. TABLE I.

IMPLEMENTED SERVICE NODES Service Nodes

Node Name

Type ID

Type/Criteria Name

Host

PL-2303 Sensor Adapter

S1

Wireless Sensor Node

H1

Taroko Sensor Adapter

S1

Wireless Sensor Node

H1

Ekahau Position Engine Adapter

S2

Location Detector

H2

Smart Floor Adapter

S2

Location Detector

H1

A1

Web Application Server

H2

A2

Home Appliance Controller

H2

Control and Monitoring Web Application Home Appliance Controller Smart Display A

A31

Smart Display B

A32

Smart Display C

A33

Short Message SystemGateway

A4

Media Follow Me Logic

P1

Air Conditioner Logic

P2

Burglar Detection Logic

P3

TABLE II.

Smart Display, location=livingroom Smart Display, location=studyroom Smart Display, location=kitchen SMS

H1 H2 H3 H2

Logic, name=Media Follow Me Logic, name=Air Conditioner Logic, name=Burglar

H3 H3 H3

PERVASIVE SERVICES

Pervasive Services Service ID

Service Community Name

Community Members Type ID

PS1

Web-based Control and Monitoring

S1, A1, A2

PS2

Media Follow Me

S2, P1, A31, A32, A33

PS3

Adaptative Air Conditioner

S1, P2, A2

PS4

Burglar Detection Alert

S1, P3, A2, A4

TABLE III.

EXPERIMENT RESULTS

Experiments Results Service ID

Average Hot Recovery Time (ms)

Enhanced Average Hot Recovery Time ( ms)

PS1

5016

212

PS2

6016

500

PS3

5032

206

PS4

5016

447

Table III shows the experiments results. In this experiment, each test is performed 10 times. The Average Hot Recovery


Time is the time for recovering a pervasive service using the UPnP’s multicast messages to discover the candidate service nodes. The Enhanced Average Hot Recovery Time is the time for recovering a pervasive service using the approach proposed in this paper. We can see that the Enhanced Average Hot Recovery Time is much shorter than the time of the original type of Hot Failure Recovery. The main reason is because the information we need for recovery is already stored locally. We don’t need to perform the discovery process again. We can say that we use additional storage space to barter for the time of recovery. V.

REFERENCES [1]

[2]

RELATED WORK

Many service discovery systems have been designed to support interaction between heterogeneous devices in a pervasive environment. Among them, Jini [14], UPnP, SLP [15] and Salutation [16] are the most discussed. The SSDP of UPnP differs from other service discovery protocols in that it uses a decentralized approach which is preferred in a pervasive environment. The above mentioned protocols allows failure detection either by using polling mechanism or monitoring periodic announcements, but none of them support autonomous recovery after failure detection, since discovery systems generally expect application software to initiate recovery, guided by an application-level persistence policy [17]. The protocol described in this work has carried out the recovery part relieving the pervasive services designers’ work. What is more, we have greatly reduced the failure recovery time using a specialized set of data structures. VI.

limited pervasive devices. Therefore, further consideration should be taken with respect to the entity replacement problem.

CONCLUSION

In this paper, we have presented a set of specialized data structure: Mapped Eviction SND Trees, STM, and ETM. A mechanism is also designed for them to cooperate in order to speed up the process of failure recovery for a pervasive service. First, we present a community-based object model for performing autonomous failure recovery. Next, we define the problem of performing failure recovery using SSDP discovery mechanism. Then, we proposed a set of data structures composed of a set of specialized data structure to enhance the efficiency of the failure recovery. Finally, we evaluate the feasibility of our work measuring the failure recovery time with and without our approach. There are still some issues to be addressed in this approach, for example, the storage space for the service nodes being assumed unlimited. Although storage space is not a problem in desktop environments, but this is not the case in resource

[3]

[4]

[5]

[6] [7] [8]

[9]

[10] [11] [12]

[13] [14] [15] [16] [17]

Chun-Feng Liao, Ya-Wen Jong, and Li-Chen Fu, "Community-based autonomous service activation and failure recovery in a messageoriented pervasive middleware”, submitted to Workshop on ContextAware Pervasive Communities: Infrastructures, Services and Applications (CAPC 2008) , Sydney Australia, 2008. Choonhwa Lee and Sumi Helal, "Protocols for service discovery in dynamic and mobile networks," International Journal of Computer Research: Special Issue on Wireless Systems and Mobile Computing, vol. 11, no. 1, Nova Science Publishers, 2002. Frank Adelstein, Sandeep KS Gupta, Golden Richard III, and Loren Schwiebert, Fundamentals of Mobile and Pervasive Computing. United States of America: McGraw-Hill Professional, 2004. Steven E. Czerwinski, Ben Y. Zhao, Todd D. Hodes, Anthony D. Joseph, and Randy H. Katz, “An architecture for a secure service discovery service”, Proceedings of the 5th annual ACM/IEEE international conference on Mobile computing and networking. Seattle, Washington, United States, pp. 24-35, 1999. Chun-Feng Liao, Ya-Wen Jong, and Li-Chen Fu, "Toward a messageoriented application model and its middleware support in ubiquitous environments," Proceedings of 2008 International Conference on Multimedia and Ubiquitous Engineering (MUE 2008), Busan, Korea, 2008., in press. UPnP Device Architecture 1.0, UPnP Forum, Dec. 2003. UPnP Device Architecture 1.0, ISO/IEC DIS 29341. Dipanjan Chakraborty, Anupam Joshi, Yelena Yesha, Tim Finin, "Toward distributed service discovery in pervasive computing environments," IEEE Transactions on Mobile Computing, vol. 5, no. 2, pp. 97-112, February, 2006. F.Zhu, M.W.Mutka, and L.M.Ni, "Service Discovery in Pervasive Computing Environments, " IEEE Pervasive Computing, Vol.4, Issue.4, pp.81-90, 2005. Michael Jeronimo and Jack Weast, UPnP Design by Example. United States of America: Intel Press, 2003. ActiveMQ, URL: Intel Tools for UPnP Technologies, URL: CyberLink for Java: A Development Package for UPnP Devices, URL: Jini Technology Core Platform Specification,v. 2.0, Sun Microsystems, June 2003. Service Location Protocol,v. 2, IETF RFC 2608, June 1999. Salutation Architecture Specification, Salutation Consortium, 1999. C.Dabrwoski, K.Mills, “Understanding Self-Healing in Service Discovery Systems”, Proceedings of the first workshop on Self-healing systems, Charleston, SC, USA, 2002.


1965