Open and Reliable Group Communication Processing: The FITOS-RPC Approach Lucian0 Porto Barreto Caxias do Sul University Computer Science Department Caixa Postal 1352, CEP 95001-970 - Caxias do Sul, RS, Brazil (
[email protected]) Ingrid Jansch-PBrto Federal University of Rio Grande do Sul Institute of Computer Science Caixa Postal 15064, CEP 9150 1-970 - Porto Alegre, RS, Brazil (
[email protected]) available such as Sun RPC [3], DCE RPC [4], Xerox Courier [5] and Microsoft RPC [ 6 ] . Others less known have been designed to achieve specific goals such as group communication (GRPC [7], MultiRPC [SI), fault tolerance (Rajdoot [9], Circus [lo], Yap & Jalote [ll], Atomic RPC [12], Hiltunen & Schlichting [13], Alphorn [14], Issarny et al. [15]) and performance optimization (Firefly RPC [16], Lightweight RPC [17], Real-time Mach RPC [IS]). The Open Systems Interconnection Reference Model (OSI-RM) is a seven layer architecture, in which the higher layer is intended for the construction and execution of distributed applications providing communication between heterogeneous environments concerning operating systems, data formats, hardware architecture and so on. In 1990, ECMA (European Computer Manufacturers Association) has specified a document [ 191 which describes the relationship between an RPC service and OS1 primitives. This RPC model requires two application service elements (ASEs): ACSE (Association Control Service Element) [20] and ROSE (Remote Operations Service Element) [21]. The first one is responsible for connection establishment and management between OS1 application entities, while the second one provides a basic interface to the execution of remote operations In 1996, IS0 has published an international standard (ISOAEC 11578) [22] which basically follows the DCE RPC protocol specification. The document approved by IS0 does not cover some important aspects such as orphan detection and group communication. However, this protocol does not impose restrictions that could
Abstract This paper describes the design of a fault-tolerant group remote procedure call system based on the I S 0 RPC international standard ISO/IEC 11578. The system extends the original model providing detection and handling of orphan computations caused by process crashes or network partitioning. A group communication protocol supports replicated procedure execution preserving total message ordering among nodes. An X500 directory service is also used to provide transparent service location. Keywords: Remote Procedure Call, Software Fault Tolerance, Group Communication, Open Systems.
1
Introduction
Remote Procedure Call (RPC) [1,2] has been widely adopted as a model to implement distributed applications and operating systems based on the client-server paradigm. Clients invoke remote services by sending request messages to the servers who perform the requested operation and send a reply message back to the client. With RPC, this may be done similarly to conventional (local) procedure calls. The runtime environment hides the details of network communication such as service location, binding, data type marshaling and conversion. Typically RPC primitives are blocking (synchronous) but many systems allow asynchronous calls or both. Several RPC commercial systems are
389 0-8186-8332-5/98 $10.00 0 1998 IEEE
prevent the implementation of issues for these points. These omissions in the basic specification have motivated the present research which adds mechanisms for group communication and fault treatment. This paper addresses the implementation of FITOS-RPC Eault Tolerant OSI RPC), a reliable group remote procedure call mechanism based on the IS0 RPC protocol [22]. The original model has been extended to include failure detection mechanisms, orphan treatment and group communication properties to provide faulttolerant behavior.
2
3
Fault Tolerant RPC Systems
In a distributed system, failures may occur and produce undesirable consequences in the RPC system. Besides abnormally terminated calls, these failures can generate orphan computations that continue to run although their result is no longer needed and are often harmful. Most authors stand that orphan computations must be eliminated since they waste CPU time, cause dangerous interference and hold system resources (locks and message buffers), which require close attention depending on the reliability issues demanded by applications. Failure detection has been provided since Nelson’s Ph.D. thesis [l], but efficient solutions have only appeared in mid 80s. The main mechanisms will be explained in the following. Using probe messages, an RPC client may periodically verify the health of a node or process. These special messages are sent when a crash of the called node is suspected. If confirmed, an abnormal termination is provided and notified to the application. It is simple to implement but the probe messages may be lost in the network when leading up to misjudged failures. Examples on using this mechanism are DCE RPC and Yap & Jalote. In the extermination approach, in case of failure, any non-terminated RPC requests are stored as pending executions. When the node recovers from failure, the servers whose executions are pending are notified to abort any existent executions. Unfortunately the list of pending executions must be kept in stable storage and must be updated before each RPC request introducing some overhead. A good example of this technique is given by Issarny et al. In the reincarnation scheme, each node keeps a crashcount which is increased every time it recovers from a failure. Moreover, each one holds the last crashcount values of other nodes. Therefore a node identifies orphan computations of a remote one when it receives a message with an unexpected (greater) crashcount value. The expiration approach assigns a deadline to each RPC request. When the deadline expires, the execution is aborted. However, this technique has two drawbacks: first, it forces a delay before sending a new RPC request slowing down the response time of an application; second, it is also difficult to estimate deadlines since RPC requests can be nested and have different natures of execution. An inadequate deadline may accidentally abort slow processes execution (untrue orphans). Reincarnation, extermination and probe messages depend on the retrieval of the failed node to be able to detect orphans. If a network partition occurs, a server is
GroupRPC
Initially RPC systems only provided peer-to-peer communication involving the interaction of each client with only one server. In order to obtain message diffusion among process groups they have included broadcast facilities. However, this choice implies an overhead since the message reaches all nodes generating undesirable traffic in the network. In this scenario, the concept of process groups was considered from the point-of-view of RPC systems. In group RPC, the client requests services to a set of servers whose behavior is similar to that of a single server. A key approach used to improve service availability and to mask component failures is redundancy. Multiple replicas, each with independent failure modes, are employed to implement this idea. Dynamic inclusion of new members (or recovered ones) may be considered to restore the initial group reliability. According to Hiltunen & Schlichting [13], group RPC systems have three basic properties: collation, ordering and acceptance. Collation defines the treatment given to multiple incoming responses collected from servers in order to produce a final result, before sending it back to the client. These functions, also known as collators, can be used to implement voting mechanisms to detect erroneous or malicious behavior of a server in the group. The second property defines the order in which concurrent calls are executed by different members of the server group (FIFO, causal, total). The acceptance semantics defines the quorum of responses necessary for the group RPC request to be considered successful. Another important property is membership management of dynamic groups which updates the set of active members of a group (group view) changed by join and leave events from group members. Group communication facilities are used in systems such as GRPC, Circus, MultiRPC, Yap & Jalote, Hiltunen & Schlichting and Electra [23].
390
possible harmful procedure execution by setting a deadline for RPC requests.
unable to recognize the existence of orphan processes. But these orphans caused by network partitions may be detected by means of the expiration scheme, as this approach does not depend upon messages received from other nodes.
4
5
I S 0 FWC Protocol
Prior to invoking a remote procedure in a particular interface, an RPC binding (bind) is necessary between the client and the server: this action enables the client to access the server, by establishing a connection with the service provider. Binding and unbinding to a server are confirmed services. They are uniquely identified by a binding-handle. If the binding procedure has been successful, a bind-ack PDU (Protocol Data Unit) is returned to the client which is now able to perform remote executions. If the server rejects an association request (i.e. protocol version mismatch or server is too busy), a bind-nak is returned to the client. To call a remote operation procedure, a request is sent and it is received at the provider as a request.indication. A client receives the results of a successful remote execution within a result message. If the remote procedure has failed the client must receive a fault message with information about failure reason. To cancel a remote operation the client calls a cancel primitive specifying what procedure must be aborted. After the server sends all desired results to the client, it requests the client to close the connection by sending a shutdown PDU.
Failure Assumptions
The behavioral analysis of some RPC systems in the presence of failures and their solutions can be found in [24,25]. During normal operation, the RPC protocol guarantees exactly-once execution semantics on the server side. When failures occur, the protocol must guarantee at-most-once execution, which does not exclude any or partial executions. Multiple failures are not considered in our model, but transient or permanent failures may occur in both clients or servers. Therefore, the protocol must avoid a client indefinitely waiting for a reply. Nodes are assumed to be fail-stop (they either work correctly or stop execution without performing any malicious actions, also known as byzantine behavior) and network partitions may occur due to physical (cabling or component crashes) or logical (misjudged time-outs) communication failures. Several problems may happen during the execution of an RPC request. Messages may get lost (a) due to network clogging, messages discards by gateways or routers and by use of unreliable transport protocols (one example is UDP - User Datagram Protocol). This failure scenario is also extensible to reply messages (b). The client may fail after submitting the request (c) and then recovers. The server can also crash (4 in several ways. In the latter case, three excluding failure situations may happen: crash (dl) before, (d2) while and (d3) after procedure execution. Moreover, a client crash followed by recovery and request resubmission may leave undesirable computations commonly known as orphans. One way to solve these problems is to keep the clients and servers with some data structures related to procedure execution. The client holds a list of pending executions (with the already submitted requests) to make resubmissions after time-out expiration or after perceiving server recovery. This solves problem (b). The server also maintains a list of received executions which solves (dl) and a list of finished executions which solves problems (a) and (d3). The failure (d2) is more complex and requires transactional techniques to provide atomic actions and recovery, such as checkpoints and rollbacks. Orphan detection is achieved by using the reincarnation scheme which verifies the node crashcount built-in within the RPC request messages. When network partitions are present, the expiration approach avoids a
Table 1: IS0 RPC Primitives PDU Type Request Response Fault Bind Cancel Shutdown
6
Description Issues an RPC call Returns the results of an RPC request Reports an operation error on the server Binds a client to a server Cancels a previous RPC request Server requests the client to close the connection
1
Design Approach and System Model
All the communication among clients and servers is carried out by a process called OS1 RPC daemon (osirpcd). This process must be started on every node before registering services, binding or sending requests to servers. It also acts as a failure suspector, keeps membership information about groups, guarantees total message ordering and provides transparent service location to the clients. The sequence of messages exchanged due to the interaction between client and server applications in FITOS-RPC is shown in Figure 1.
391
group process. Symmetric and asymmetric order protocols are supported, permitting a process to use symmetric version in one group and asymmetric version in another one. The asymmetric version of Newtop uses one of the members of a group as a sequencer for ordering messages. To multicast a message in a group, a process unicasts it to the sequencer which forwards it to all process in its current group view in the received order. A member process delivers messages (including its own) in the order they are received from the sequencer process. Each process maintains a receive vector which records the counter value of the latest received message from other group members. They also have a time-silence mechanism which sends null messages to other processes to provide liveness and failure detection. The choice of the asymmetric version of Newtop ensures total message ordering between possible overlapping groups by using a sequencer for each group. The system execution model, shown in figure 2, is described in the following.
Reynei( Indication
Client
(5)
Server
Figure 1: Client-server interaction in FITOS-RPC The procedure of binding to a server is made by a query to the name server (NS) which is responsible for providing the necessary identification (address or name) of the desired server. Clients do not know a priori which server will implement the service to be executed. To be available to network users, every service must be stored in the name server. Group abstraction is obtained through the primitive bind which avoids several sequential bindings to different servers. The action of binding to a group of servers is similar to the one of binding to a single server, which is seen as a single member group. Our prototype implementation considers static process groups where faulty members are detected and excluded by a failure suspector module. Groups are formed only once and explicit actions of leaving and joining a group are not allowed. This approach avoids additional algorithms to control membership changes and message storage for hrther recovery. Its major disadvantage relies upon its inflexibility in group management, since groups are rather dynamic in most of real applications. However, the extension of this model to support dynamic groups is straightforward. Using a reliable transport protocol such as TCP/IP (without connection failures) guarantees FIFO (First In First Out) order between each client-server pair. However, some applications require more reliable mechanisms including causal and total order [26]. To achieve better performance and simplicity we have chosen an asymmetric group communication protocol. This protocol uses a special member called sequencer to order messages. The algorithm is based on Lamport’s logical clocks and follows the approach of Newtop (NEWcastle Total Order Protocol) [27]. Newtop is a general purpose fault-tolerant group communication protocol which permits multiple, dynamic and overlapping groups. It provides causality preserving total order delivery to group members, ensuring that total order delivery is preserved for multi-
-.. ,._-
_ _ - a -
.........,*--
.-- - REGISTER __ _---_
Figure 2: FITOS-RPC execution model First each server must register itself on the name server (register). The client queries the NS (lookup) to know which server or group server implements the desirable service. The NS returns the process identifier or the group sequencer identifier by sending an responder message. The client issues a bind request to connect to a specific server or group. It receives the confirmation by a bind-ack. After that, the client is able to make requests to the group by sending request messages. When the sequencer receives this message, it verifies the client’s crashcount number in order to detect a previous crash followed by recovery. If the client has not failed, in other words, the value of crashcount received is equal to the previous one, the sequencer stores the request as pending in the Reg-Pending list, forwards the message to the other group members with a 392
new sequence number (timestamp) and places a time-tolive in each request to prevent orphans caused by network partitioning. If the crashcount value is different from the expected one, the sequencer sends abort messages to all processes which have pending executions of the previous failed node. When receiving a request, the server stores it as received (in the Req-Received list). A counter is initialized to verify if the execution time (time-to-live) expired. Before sending back the result to the sequencer, the reply is stored as done (in the Req-Done list). Then the sequencer forwards it to the original client. If the maximum execution time for an RPC has expired, the server probes the sequencer to see if he is still alive. After some few unsuccessful tries, the server stops the procedure execution and assumes a sequencer failure. If the sequencer responds in reasonable time, the server restores the procedure execution time allowing it to continue its execution. We used the ELROS programming language (Embedded Language for Remote Operations Service) E281 to implement daemons, clients and servers. ELROS is basically an extension of the C programming language, including some specific runtime functions for networking and data marshaling. It also allows the programmer to specify the interface of a remote operation using either conventional C structures or ASN. 1 types directly. Basically the client must encode its arguments in ASN.l using the special type Any and the server must accept requests defined with this type and perform some decoding functions to get the original arguments sent by the client. All the fault tolerance mechanisms such as failure detectors, total message ordering, are transparent to users. Therefore, ordinary client and server applications should not be heavily penalized by further modifications.
7
this paper. Here we briefly examine some future research topics and desirable extensions. The Newtop group communication protocol should be extended to handle dynamic groups membership and partition handling, considering its symmetric and fully distributed version. Transaction techniques may be integrated to achieve failure treatment during server crashes. One approach is to use the I S 0 CCR Protocol [29]. Currently we are considering the inclusion of a library for checkpointing management to achieve safe process reexecution. We are also implementing an X.500 directory service to register RPC servers and to provide transparent service location. Security services such as authentication and cryptography techniques could also be implemented. Detection of performance bottlenecks and optimization also require closer attention. Considering the original specification, which was the international standard 11578, our system here presented has some differences. We used ASN.l as an Interface Definition Language, while IS0 has chosen NDR as its syntax transfer method for IS0 RPC. Still there are other kinds of messages (PDUs) in the communication between servers and clients which have not been considered for our system implementation: one is the orphaned PDU which is used by a client to notify a server that it is aborting a request or a response in progress. The protocol specified by IS0 also defines some context negotiation functions among applications which were not taken into account by our system. These ones are handled by an alter-context PDU which is used to request additional presentation negotiation for another interface andor version. The alter-context-response PDU is used to indicate the server's response. Previous OS1 RPC implementations carried out by Gertosio & Marruzi [30] and Yusheng & Hoang [31] were concerned with other aspects rather than group communication and reliability covered by our approach.
Acknowledgements
Conclusions
Acknowledgements are due to the Computer Science Department of the Caxias do Sul University (DEW/ UCS), to Institute of Computer Science of the Federal University of Rio Grande do Sul (IVUFRGS) and to Brazilian National Council for Scientific and Technological Development (CNPq) for their support.
Group process (or objects) abstractions intend to improve availability and to simplify groupware applications development such as teleconferencing, distributed database servers and cooperative CAD projects. Module replication can improve application performance by exploring parallel procedure calls using the network workstations as a multiprocessor system. Most of the available mechanisms support either peer-topeer or broadcast communication facilities [24,25]. These applications could be easily implemented if group communication facilities were available at the W C level. The design of a complete RPC system should still consider different hypotheses not completely covered in
References [l] B. J. Nelson. Remote Procedure Call. Ph.D. Dissertation. CMU-CS-8 1- 119. Carnegie-Mellon University, Pittsburgh, PA 1981.
393
[2] A. D. Birrel, B. J. Nelson. Implementing Remote Procedure Calls. ACM Transactions on Computer Systems. v.2:39-59, February 1984. [3] Sun Microsystems. RPC: Remote Procedure Call Protocol Specification - Version 2. Internet Request For Comments RFC-1057. June 1988. [4] Open Software Foundation. DCE RPC. Cambridge, MA, December 1991. [5] Xerox. Courier: The Remote Procedure Call Protocol. Technical Report XSIS 0381 12, Xerox System Integration Standard, Stanford, CT, December 1981. [6] J. Shirley, W. Rosenbeny. Microsoft RPC: Programming Guide. O'Reilly & Assoc. Inc. 1995. [7] X. Wang, W. Zhao, J. Zhu. GRPC: A Communication Cooperation Mechanism in Distributed Systems. ACM Operating Systems Review, 27(3):75-86, July 1993. [8] M. Satyanarayanan, E. H. Siegel. Parallel Communication in a Large Distributed Environment. IEEE Trans. Computers, 39(3):328-348, March 1990. [9] F. Panzieri, S. Shrivastava. Rajdoot: a Remote Procedure Call Mechanism Supporting Orphan Detection and Killing. IEEE Trans. Software Eng., 14(1):30-37, January 1988. [lo] E. Cooper. Replicated Procedure Call. ACM Oper. Syst. Review. 20( 1):44-56, Jan. 1986. [l 11 K. S. Yap, P. Jalote, S. K. Tripathi. Fault Tolerant Remote Procedure Call. VIII Int '1 Conf on Distributed Computing Systems, June 1988, p. 48-54. [12] K. Lin, J. Gannon. Atomic Remote Procedure Call. IEEE Trans. Software Engineering, SE-I 1:1121-35, Oct. 1985. [13] M. A. Hiltunen, R. D. Schlichting. Constructing a Configurable Group RPC Service. Technical Report TR 94-28, University of Arizona, USA, October 1994. [14] H. R. Aschmann. Alphorn: A Remote Procedure Call Environment for Fault-Tolerant, Heterogeneous, Distributed Systems. IEEE Micro, 11(5):16-19,60-67, Oct. 1991. [15] V. Issamy et al. Efficient Treatment of Failures in RPC Systems. XIII Symp. on Reliable Distributed Systems, Dana Point, California, Oct. 1994. [16] M. Schroeder, M. Burrows. Performance of Firefly RPC. IEEE Transactions on Computer Systems, 6( 1):1- 17, February 1990. [17] B. Bershad et al. Lightweight Remote Procedure Call. IEEE Transactions on Computer Systems, 6( 1):37-55, February 1990. [18] E. Burke et al. RPC Design for Real-Time Mach. OSF Research Institute, Cambridge, MA. Draft Version. April 1994. [191 European Computer Manufacturers Association (ECMA). Remote Procedure Call using OS1 (RPC). Standard ECMA127. Second Edition, June 1990. [20] International Organization for Standardization / International Electrotechnical Committee. Association Control Service Element. ISO/IEC 8649, 1988. [2I] International Organization for Standardization / International Electrotechnical Committee. Remote Operations. ISO/IEC 9072-1, 1988. [22] International Organization for Standardization / International Electrotechnical Committee. Remote Procedure Call. ISO/IEC 11578, 1996.
[23] S. Maffeis. Run-Time Support for Object Oriented Programming. PhD thesis. University of Zurich, Department of Computer Science, 1995. [24] L. P. Barreto. Fault Tolerance in Remote Procedure Call (RPC) Systems. Port0 Alegre, Brazil. CPGCC-UFRGS. January 1996. TI n.501, 79 p. (in Portuguese) [25] L. P. Barreto, I. E. S. Jansch-PBrto. Reliable RPC Models. In: VI Seminfo, Bahia, Brazil. May 1996. p. 175-190. Proceedings. (in Portuguese) [26] P. Jalote. Fault Tolerance in Distributed Systems. Prentice Hall, 1994. [27] P. Ezhilchelvan, R. MacGdo, S. Shrivastava. Newtop: a Fault-tolerant Group Communication Protocol. XVZZI IEEE Int '1 Conf Distr. Comput. Syst., June 1995, p. 296-306. [28] M. R. Boolootian et al. Using ELROS to Implement I S 0 Application Protocols. University of California. 1992. 175 p. [29] International Organization for Standardization / International Electrotechnical Committee. Commitment, Concurrency and Recovery. ISO/IEC 9804, February 1990. [30] C. Gertosio, R. Maruzzi. Conception, Realisation et Evaluation d'un RPC OSI. Electronic Journal on Networks and Distributed Processing, n. 3, v. 5. 1996. p. 1-19. [3I] L. Yusheng, D. B. Hoang. Design and Implementation of an OS1 RPC System. Singapore ICCS'94, Nov 1994. p. 11951199.
~
394