Clique: A Toolkit for Group Communication using IP ... - CiteSeerX

2 downloads 20605 Views 45KB Size Report
best-effort multicast service which is inadequate for a wide variety of distributed ... tions range from group email communication and net- work news groups to ...
Clique: A Toolkit for Group Communication using IP Multicast Rajendra Yavatkar

James Griffioen

Department of Computer Science University of Kentucky Lexington, KY 40506 [email protected]

Department of Computer Science University of Kentucky Lexington, KY 40506 [email protected]

Abstract

tially increases the portability and geographic span of distributed applications based on group communication. That is, applications based on group communication are no longer confined to physical network architectures with multicast capabilities. Instead, any host capable of IP communication can now communicate via multicast messages with any other IP host. Unfortunately, although most hosts now have the ability to send multicast messages, IP multicast only provides best-effort multicast service which is inadequate for a wide variety of distributed applications. Consequently, such applications require additional support to be built on or around the simple IP multicast support. The goal of our research is to provide a flexible group communication paradigm based on IP multicast, thereby providing support for a wide range of distributed applications requiring group communication. Group communication using multicasting has received considerable attention in the design of distributed systems. Several systems such as the ISIS system [1], the V kernel [7], Amoeba [13], and the Psynch protocol [11], and various others have proposed group communication primitives for constructing distributed applications. However, all of these systems support a restricted group communication model either designed to provide reliable delivery with support for atomicity and causality or to simply support an unreliable, unordered multicast delivery. In reality, group communication is useful across a wide range of application domains where each domain has its own specific communication requirements. Example application domains that can make effective use of group communication include:

Widespread availability of IP multicast has renewed interest in structuring distributed applications around a group communication paradigm that exploits networklayer support for multicast applications. In the past, distributed systems that provided group communication supported a restricted group communication model. Such systems are either designed to provide reliable delivery with support for atomicity and causality or to provide simple unreliable, unordered multicast delivery. We believe that the group communication abstraction is useful to many application domains. However, the group communication requirements of an application vary widely from domain to domain. This paper describes a group communication toolkit called Clique that contains the basic building blocks required to provide a flexible group communication paradigm. Clique achieves support for a wide variety of applications by tailoring the underlying multicast mechanism to meet the application’s group communications requirements with the least amount of unnecessary overhead.

1

Introduction

The group communication paradigm serves as a powerful abstraction on which to build distributed applications. Most distributed applications require the exchange and/or sharing of information among many of the participating nodes. Unfortunately, group communication has not achieved widespread use; in part due to the lack of adequate or appropriate multicast support at the (network) hardware or operating system level. Many network systems, in particular point-to-point networks, fail to provide link-level multicast support altogether, and other network architectures, such as LANs, often do not provide the type of multicast support desired by distributed applications. However, the increasingly widespread support for IP multicast combined with the IP Mbone infrastructure [8, 3] has renewed interest in structuring distributed applications around a group communication paradigm. The networklayer multicast support provided by IP multicast substan-

Dissemination Applications: Dissemination applications typically take the form of a single producer/multiple consumer. Such applications usually have a single information source which transmits the information to many locations throughout the system. Examples include distributed weather or traffic information systems, video broadcasts such as TV news, shows,

or movie rentals, and whiteboard style presentations. Such applications may require reliable and/or realtime in-order delivery of data, but do not typically require casual ordering or atomic delivery. Collaborative Applications: Example collaborative applications include group editors and group design tools (e.g., a distributed CAD/CAM application). Such applications rely on many-to-many communication and have widely varying reliability and causality requirements [14]. Conversational Applications: These types of applications range from group email communication and network news groups to audio and video conferencing. In both cases communication is many-to-many with a wide variety of requirements related to the desired reliability, causality, atomicity, bandwidth, and delay [12]. Distributed Operating Systems: Distributed operating systems can also benefit from group communication. Maintaining and insuring consistency of cached information typically requires the invalidation or updating of the set of hosts currently caching the information [6]. In this case, communication can be characterized as many-to-many. Any host may send to any group. Moreover, such messages may or may not require reliability, ordering, or atomicity because applications can detect and recover from any observed inconsistencies in many cases. Thus, the underlying multicast transport may not concentrate on providing totally ordered reliable delivery. Fault Tolerant Applications: Applications that require fault tolerance, such as fault tolerant databases, achieve reliability through distribution and redundancy. In such systems, each message must usually be transmitted to multiple recipients to ensure backup servers remain up-to-date in the event they become the primary server. Moreover, the level of reliability desired may dictate the reliability required from the group communication facility. For example, it may be sufficient to transmit to K out of N recipients. Report In/Back Applications: Distributed monitoring applications gather information at points throughout the system and periodically report the gathered information to a single site for processing/decision making. Such “report-in” type applications can easily congest the network or the single receiver with messages. Similarly, “report-back” style applications are characterized by a single site issuing a request to multiple receivers who process the request and report the answer

back to the requesting site. Like the report-in style applications, the reply messages of report-back style applications can overload the network or the receiver. Note that reliable dissemination applications are actually a form of report-back since all Acks function as replies. Of course, several other distributed application domains could also be listed. The main point of this discussion is simply the fact that group communication is useful to a wide variety of applications but the style of group communication required varies substantially from application to application. We are building a group communication toolkit called Clique that contains the basic building blocks required to achieve a wide range of group communication semantics useful for constructing distributed applications.

2

Clique

The objective of the Clique system is to provide a flexible infrastructure that allows applications to achieve appropriate group communication semantics without sacrificing efficiency. We have identified the following issues that must be addressed by such a facility. Note that these issues are not necessarily orthogonal and thus must be resolved collectively to achieve any sort of meaningful communication semantics. Sending Groups vs. Receiving Groups: A process group consists of a set of processes cooperating on a particular task. We can classify the interacting processes in a process group into two subgroups, namely sending group and receiving group. Processes in the sending group concurrently send messages to the members of the receiving group. The characteristics and size of sending and receiving groups define a specific style of group communication selected from a wide range of possible communication styles. For example, consider a disseminationoriented application. Communication is 1xN with a single process disseminating information to multiple processes comprising a process group. Alternatively, some applications involve logical querying or distributed election algorithms in which many processes send replies to a single process resulting in a concast-style (Nx1) group communication. In the more general case, both sending and receiving groups may contain several processes (MxN communication). In such a model, the issues of sequenced delivery, atomicity, and causality may (or may not) arise depending on the definition and requirements of sending and receiving groups. For instance, if the sending group and receiving groups are disjoint as

communication provider must take advantage of locality of communication within a local area network and the possible hierarchical structure among communicating processes. For instance, group communication within a LAN can take advantage of the bandwidth and error characteristics of the underlying network. Also, reliable, sequenced delivery of messages to members of a process group within a LAN can be optimized locally to avoid retransmissions from a remote sender. Thus, the group communication provider should allow applications to specify expected locality and communication structure within a process group.

in a multiple producer - multiple consumer scenario, then the underlying communication provider need not worry about causal delivery. Reliability Semantics: Some applications may tolerate unreliable delivery while others require reliable delivery. For example, distributed directory services or distributed logs services often involve periodic updates and do not require reliable delivery of group messages because lost updates can be ignored provided future updates arrive within a certain time interval [2]. A different application may require reliable delivery where reliability simply means that messages from the same sender be delivered in sequence to each receiver. Still other applications may have even stricter ordering or atomicity requirements. Atomicity: Some applications require that message delivery be atomic where each message must be delivered to all non-failed group members or none. Because implementation of atomicity is expensive, the group communication facility should not automatically implement it, even if reliable delivery is required. Causality: Some applications require that message delivery be causal where causal delivery is based on a partial order determined by the causal dependencies among the messages sent within the process group. If sending and receiving groups do not overlap, then there are no causal dependencies and the underlying group communication facility need not expend any efforts to ensure causal delivery [5]. However, when sending and receiving groups overlap, messages sent by a sender may be causally dependent on the messages it previously received and the underlying communication provider must maintain such ordering in their delivery. Efficiency Considerations: Ordinarily, group communication involves prompt delivery of group messages to all the group members. This is the preferred mode of interaction when process group members want timely updates to some shared data or state. However, in some applications, group members may not wish to consume resources unnecessarily by eagerly receiving messages (updates). Instead, they may be content if messages are delivered only on demand in response to an explicit request from a process. This is useful when messages that update shared data or state are not immediately needed. Such an optimization also saves network resources. To allow scalable implementations involving group communication spanning hundreds of hosts, the group

2.1

Clique Facilities

To allow a wide range of semantics and scalable implementations of group communication, Clique includes the following facilities: Naming and Addressing: A process group represents a set of processes that reside on hosts throughout a distributed system. Clique allows creation, naming, and registration of process groups. Each process group is identified by a unique group id. Sending vs. Receiving Groups: Clique explicitly allows three styles of communication, namely, dissemination, concast, and conversation corresponding to 1xN, Nx1, and MxN communication respectively. Under the dissemination model, a single sender multicasts information to all the members of a particular group. Concast communication allows many processes to send messages to a single receiver. Messages are addressed to the group address representing the single receiver, but Clique optimizes delivery of messages from different senders by appropriate batching and flow control. Conversation style communication represents the most general case of MxN. By defining the members of the sending and receiving groups, the system is also able to identify any possible causal dependencies. Participation Style: Clique allows a process group member to indicate its interest in group activities by identifying its participation as an Active Listener or a Passive Listener. Typically processes will be passive listeners, which means that all messages sent to the group should be delivered to the process immediately. This is necessary when timely updates to shared data or state are needed. However, many applications do not require such eager delivery. Instead processes may identify themselves as active listeners indicating that they will explicitly request messages (updates). Such interaction is appropriate in a style of computing where new update

messages supersede past updates (e.g., distributed directory servers or weakly consistent distributed shared memory). Thus, the group communication provider may not expend any efforts to promptly deliver group messages to such a participant. Delivery Semantics: Clique supports both unreliable and reliable delivery semantics. Reliable delivery only provides sequenced delivery of consecutive message from the same sender to all the participants of a group. In addition, an application may separately request atomic and/or causal delivery of messages within a process group.

2.2

Group management

The Clique system’s group management services are built around a general-purpose directory service similar in spirit to that of the sd session directory service distributed by Van Jacobsen. Clique’s directory service has two primary functions: group address assignment and group address discovery. When clients request new host or process group addresses, the directory service locates and returns an unused unique group address. Clients trying to access existing groups may also query the server to obtain a list of the hosts or processes contained in a particular group. Clique’s general group registration primitives are similar to those of other group communication systems [7, 1] and use a directory server to implement primitives that allow processes to create, join, or leave a group.

2.3

Supporting Wide Area Group Communication

Building group communication on network level multicast facilities allows group communication applications to span large geographical networks such as the Internet. Unfortunately, the ability to send multicast messages between geographical distant hosts does not imply group-based applications will scale well in such an environment. Consequently, to provide improved scalability, we have identified three characteristics that typify many networks and group communication applications. These characteristics can be exploited to enhance the scalability of the system and prevent bottlenecks that are likely to arise in a wide area system such as the Internet. First, many LANs support link-level multicast, resulting in highly-efficient and well-defined delivery of multicast messages to members of the local area network. IP multicast already exploits link-level multicast when it is available and thus some performance enhancement is realized for group communication. However, the well defined nature of many link-level multicast mechanisms allows the group communication layer to apply further optimizations. For example, many link-level multicast mechanisms insure sequenced, largely error-free delivery of messages (e.g.,

Ethernet). In such cases, the group communication layer may optimize its implementation for the expected case of loss-free delivery and only trigger error recovery when necessary. Second, many applications exhibit locality of communication. That is, the majority of group communication occurs among members belonging to the same local area network. Even in cases involving communication across a wide area, the communication is often restricted to few geographically dispersed locations (local area networks) across the Internet. For example, a shared whiteboard application [10] used by a group of designers may only involve members from a few branch offices across the country. Third, reliable sequenced message delivery typically involves timeouts and retransmission requests after a missed message. However, if more than one group member resides within a local area, it is usually cheaper to request a copy of a missing message from a local participant rather than incurring the overhead of requesting the missed message from the original (distant) sender. Based on these observations, Clique includes a notion of communication domains which represent a hierarchical organization of group communication providers. A domain is a logical entity which groups together a set of hosts. A domain is typically defined by an administrative or geographical region or a subset of such a region. Each domain is managed by a domain manager which is responsible for the active groups within its domain. A domain manager optimizes group communication within a domain and also cooperates with managers in other domains to apply optimizations in cases such as those mentioned above. Moreover, domain managers play a key role in supporting active listeners in a group by caching and holding updates for active listeners.

2.4

Current Status

Currently, we have a prototype implementation of Clique that includes the session directory, domain managers, and process groups. Applications can request creation of a process group, register the group with the session directory, and send/receive messages directed at the group. So far, Clique supports both dissemination and concast paradigms in both reliable and unreliable mode and an implementation of the conversation paradigm is in progress. We plan to use the Clique building blocks in the implementation of a distributed shared memory system for a wide area network [9].

3

Experimental Results

Using the prototype system described in the previous section, we have implemented a few sample applications and performed various experiments to evaluate both the usefulness of the programming model and the overall performance compared against existing approaches. The fol-

Purdue

Wash U.

W

G

W

M

W

R

R

W

M

G

Cross-Traffic

Internet

Mbone Tunnels Key

W

Workstation

G

Group Manager

M

Mbone Router

R

Internet Router/Gateway

M

R

W

G

W

Kentucky

Figure 1: The experimental environment. Two Mbone tunnels connect the distribution domain (University of Kentucky) to the dissemination domains (Washington University and Purdue Univeristy). Artificial cross-traffic was used to simulate a congested Internet.

Message Size (bytes) 300 1K 10K

Direct Time for Retrans. Delivery Rate (seconds) (percent) 1.6 0 2.1 0 29.2 36

Clique Time for Retrans. Delivery Rate (seconds) (percent) 0.5 0 0.9 0 28.0 7

Message Size (bytes) 300 1K 10K

Direct Time for Retrans. Delivery Rate (seconds) (in %) 4.6 0 5.6 50 94.0 74

Clique Time for Retrans. Delivery Rate (seconds) (percent) 0.5 0 5.5 50 82.0 72

Table 1: Light network load dissemination times (measure between 12am and 2am). Results in the left hand columns were obtained using IP multicast and those on right hand were obtained when using the Clique facilities.

Table 2: Normal network load dissemination times (measure between 2pm and 5pm). Results in the left hand columns were obtained using IP multicast and those on right hand were obtained when using the Clique facilities.

lowing sections describe our test environment and the experiments performed.

3.2

3.1

The Test Environment

Figure 1 illustrates the environment in which we ran our experiments. For the purposes of these experiments, we established three geographically distinct communication domains. Machines located at the University of Kentucky, Purdue University, and Washington University served as the three distinct communication domains. The Internet served as the communication link between the three communication domains. Because IP multicast is not currently supported across the Internet (nor is it supported at all three test sites), we achieved an underlying multicast facility by installing an Mbone [4] router at each of the three sites. We then set up two Mbone tunnels running between Kentucky and Purdue and also between Kentucky and Washington University (see Figure 1). Because our tests required no multicast traffic between Purdue and Washington University, a tunnel was not necessary. A group manager daemon was then started at each location to maintain Clique groups and manage communication within the communication domain and among the three communication domains. All multicast traffic between Kentucky and Purdue and between Kentucky and Washington University flowed through the pre-defined mbone tunnels and experienced typical Internet delays and congestion caused by normal (real) Internet traffic. We analyzed the performance of the system under both light and normal Internet loads (late evening and middle of the day respectively). We executed each experiment two different ways: once using Clique’s group communication facility, and another time using standard Mbone multicast facilities. Each experiment was run several times to suppress any unusual network load changes. Results from the Mbone tests serve both as an indication of the performance of current wide area multicast facilities and also as a useful measure against which we can compare the performance of Clique primitives.

A Sample Experiment

Dissemination Test: This test consists of communication across a dissemination group D with group members scattered across three domains. In this test, a process P on a workstation in the Kentucky domain wishes to reliably disseminate a message to all the members of the group. The process P uses the User Datagram Protocol (UDP) [?] and IP multicast [?, ?] to communicate with other members of G. Without Clique support, P must first discover the identities of all current group members using the session directory. P must then reliably deliver the message to each group member using a combination of positive acknowledgements, retransmissions, and timeouts. If there are several members residing in the same geographical region or domain, P may send multiple copies of the message to the same domain. However, if P uses Clique group communication support, P need not be aware of current group membership and does not need to worry about reliable delivery to the hosts in the group. Instead, P simply uses the local (reliable) multicast mechanisms to transmit the message to the domain manager in P’s domain. The domain manager then assumes the responsibility of reliably delivering a copy of the message to its peers (domain managers) in other domains. The peer domain managers, in turn, reliably deliver the message to group members within their domain. Note that P need not be explicitly aware of the identity of current members in G and only a single copy of the message travels across the Internet between a sending and a receiving domain. To evaluate the performance of Clique, we ran the dissemination test both with and without Clique support. In the test, a process P at Kentucky disseminated a message to three processes located in the Purdue domain, three processes in the Washington domain, and one process in the Kentucky domain. Twelve different tests were done in total; six with Clique support and six without Clique support. The tests differed in the size message sent (302 byte, 1K byte, or 10K byte messages), in the time of day (late evening

or middle of the day). Tables 1 and 2 show the results of the tests. The end-toend time measures the wall clock time required to reliably distribute the message to all the processes of the group and receive and acknowledgement back. Under the Clique, the domain manager rather than the process P reliably distributes the message. Consequently, the end-to-end times shown in the third column are the wall clock times required for the local domain manager to distribute the message to all remote domain managers. The additional transfer time from the process P to the domain manager (or from a remote domain manager to a receiving process) is negligible (approximately 1ms to 15ms depending on message size).

Acknowledgements We would like to thank Madhu Sudan for implementing the Clique toolkit and performing various tests across the Internet Mbone. We would also like to thank John Lin and Chuck Cranor for their help setting up the Mbone tunnels. Finally, we would like to thank the reviewers for their helpful comments and suggestions.

References [1] Ken Birman and Thomas Joseph. Reliable communication in the presence of failures. ACM Transactions on Computer Systems, 5(1):47–76, Feb 1987. [2] Andrew D. Birrell, Roy Levin, Roger M. Needham, and Michael D. Schroeder. Grapevine: an Exercise in Distributed Computing. Communications of the ACM, 25(4):260–274, April 1982. [3] S. Casner and S. Deering. First IETF Internet Audiocast. ACM Computer Communication Review, 22(3):92–97, July 1992. [4] S. Casner and S. Deering. First IETF Internet Audiocast. ACM Computer Communication Review, 22(3):92–97, July 1992. [5] D. Cheriton and D. Skeen. Understanding the Limitations of Causally and Totally Ordered Communication. In Proccedings of the Fourteenth ACM Symposium on Operating System Principles, pages 44–57, December 1993. [6] D. R. Cheriton. Problem-oriented Shared Memory: A Decentralized Approach to Distributed System Design. In Proceedings of the 6th International Conference on Distributed Computing Systems, pages 190– 197, May 1986.

[7] David. R. Cheriton and W. Zwaenepoel. Distributed process groups in the V kernel. ACM Transactions on Computer Systems, 3(2):77–107, May 1985. [8] Stephen E. Deering and David R. Cheriton. Multicast routing in datagram internetworks and extended lans. ACM Transactions on Computer Systems, 8(2):85– 110, May 1990. [9] J. Griffioen, R. Yavatkar, and R. Finkel. Unify: A scalable, loosely-coupled, distributed shared memory multicomputer. Technical Report 226-93, University Of Kentucky, January 1993. [10] Steven McCanne. A Distributed Whiteboard for Network Conferencing. Technical report, Real Time Systems Group, Lawrence Berkeley Laboratory, September 1992. unpublished report. [11] L. Peterson, N. Buchholz, and R.D. Schlichting. Preserving and using context information in interprocess communication. ACM Transactions on Computer Systems, 7(3):217–246, August 1989. [12] Clemens Szyperski and Giorgio Ventre. A characterization of multi-party interactive multimedia applications. Technical report, The Tenet Group International Computer Science Institute and Computer Science Division UC Berkeley, February 1993. [13] Robbert van Renesse, Staveren Hans van, and Andrew S. Tanenbaum. The performance of the Amoeba distributed operating system. Software – Practice and Experience, 1989. [14] Raj Yavatkar. MCP: A Protocol for Coordination and Temporal Synchronization in Multi-media Collaborative Applications. In Proceedings of the 12th International Conference on Distributed Computing Systems. IEEE, June 1992.