Oct 17, 1990 - as a set of tools for use within the Isis system is described. Keywords ... we lack a good model for applications of this sort. ..... to free-standing.
Designing Application in Wide Area Network
Software Settings
Mesaac Makpangou Ken Birman*
, f,:
TR 90-1165 October
1
1990
J
,
Department of Computer Cornell University Ithaca, NY 14853-7501
*This research was funded in part under DARPNNASA in part under DARPA contract MDA-972-88-C-0024.
subcontract
Science
NAG 2-593, and
Designing Application in
Software
Wide Area Network Settings * Mesaae
Ken
Makpangou October
17,
Birman
t
1990
Abstract Progress in methodologies matched by similar results
for developing robust for wide-area settings.
local area network software In this paper, we consider
has not been the design of
application software spanning multiple local area environments. For important classes of applications, simple design techniques are presented that yield fault-tolerant wide area programs. An implementation
of these
Keywords and work partitions.
1
is growing
this
Process-groups,
phrases:
recognition
approach,
implement
group
protocols.
multicast systems
for use within
the
Isis system
ISIS, fault-tolerance,
[2], port
groups
plemented although
wide-area
have
different
restricted tency, not
high occur.
is described. protocols,
ronment.
cated
paper
in different will typically
of group multicast,
local
networks
and
that
work
area on
process
For example, that
although
hold
net-
and
perform
examines
wide
LAN
systems.
area
by binding
"This research was funded in part contract MDA-972-88-C-0024.
poorly
the
user
may to a local
under DARPA/NASA
not give
present
subcontract
area
to The
to include
[5] and have
etc).
ISIS
been
im-
Unfortunately, networks
(WAN) has
been
communication
network
typical
la-
partitions
of the
acceptable
do
WAN
envi-
performance
in
environment.
interconnecting an integrated
representative
RPC
V system
low
be lost,
in a WAN by
and
communication
assume
that
constructed
applications
wide
are
cooperate so forth.
protocols
group
may
protocols
(or incorrectly)
and
and
but
that
multiRPC,
systems
messages
environment
multicast
applications
Such
such
in the
computing.
and
IPC
multicast
(LAN)
groups
most
individual
in a LAN
mechanisms
operate
groups
reliable
most
might
(process
a variety
assumptions
environment,
This
facilities
processes another,
conventional
multicast,
environments.
Consequently,
a LAN
group
extend
in distributed
of one
Likewise,
accepted
bandwidth, These
typically
groups
monitor
causal
characteristics,
to LAN
into data,
paradigm
[11], etc).
multicast,
it is widely
replicated
systems
implemented
in Chorus
(atomic
of the process-group is structured
share
for such
have
utility
software
services,
facilities
Many
of the
distributed
distributed
communication
but
as a set of tools
Introduction
There In
techniques
process interface
of the WAN
NAG-2-593,
and in part
groups
lo-
abstraction, service,
which
under
DARPA
responds to requests a suitable collection protocols
that
using local data whenever possible. Our goal is to identify and impleme_lt of WAN tools to assist in this process. These consist of mechanisms a:L¢l
assume
that
applications
will be long-running
and will experience
such problems
partitions, network crashes, and long haul connection failures. Because few WAN applications have been developed, we lack a good model for applications this sort. To overcome this, we begin by examining problems that arise in a WAN application capture needed
and analysis of seismic signals. We then turn to the problem to solve this problem. Finally we discuss a general framework
applications, The rest
presenting this in the context of the Isis environment. of this paper is organized as follows. Section 2 discusses
computing environment. support requirements. emerge
2
from
these
case studies
The
wide
and provide
and area
performances
system
Isis process process
for our initial
the their that
implementation.
of a wide area environment. The system is composed of by point-to-point long haul links that comprise the wide the set of sites belonging to a single local area network.
group is a set of processes that are cooperating context of Isls, a system that provides extensive We say that
about
model
More than one link may connect two clusters. Computing within a cluster takes place in processes
another. situation
our assumptions
assumptions
Figure 1 illustrates the overall architecture a set of local area network, interconnected area network. The term cluster denotes
communication.
figures
of for
the facilities of wide area
Section 3 discusses the applications we have selected and examines Sections 4, 5 and 6 discuss the mechanisms and long haul protocols
Background
2.1
of implementing for the support
as
groups
do not span
groups located
in different
that communicate
for some support multiple clusters
via messages.
A process
purpose. Our work was done for process groups and reliable
in the group
clusters. are related if they communicate
with one
A partitioned wide area application is one composed of related groups. Figure 1 depicts a where we have two partitioned wide area applications represented on each cluster by the
process group named respectively G1 and G2. A local multicast protocol designates a protocol used to multicast a message to the members of some process group. A long haul multicast protocol designates a protocol used to multicast a message 2.1.1
to the members Failure
We assume
of a set of related
groups.
assumptions
that
each LAN system
"isolates"
the effect of a host crash, local connection
failure,
and
LAN partition. This means that only application components located within the affected cluster are involved in the detection and handling of these events. These assumptions hold for our Isis-based implementation, but might limit the applicability of our work to other LAN-based With regard to wide-area communication, we assume that long haul connection crash,
and WAN
say that a _ WAN" partition partitioning:
partition
can all occur.
haul connection occurs
failure
Because
occurs
clusters .may be redundantly
when a link connecting
connected
two dusters
when all such links fail. It will be useful to distinguish
systems. failures, cluster we will
fail, and that a
two subcases
of WAN
Cluster A
Long
Cluster B
haul
Channels
Cluster C Cluster
D
I
I
Figure
1: Overall
architecture
3
of a wide
area
system
Controlled
WAN
partitioning
WAN communication be satisfied (i.e., many applications
lines may be costly or subject
to physical
constraints
always
a satelite link will need a fine-of-site path to a satelite). For these reasons. use a periodic communication model. As needed (or whenever possible),
clusters open communication links. Data is shipped across the links, We will refer to this kind of partitioning as controlled partitioning. Unplanned
that cannot
which
are then
closed.
partitions
A WAN
partition
is unplanned
of the only communication
if it results
line linking
from
an unpredictable
two clusters
or the failure
event
such as the failure
of a machine
managing
an
endpoint of such a line. Such a partition is undistinguishable from the simultaneous failure of all the machines in one of the clusters. Our work assumes that no failure lasts indefinitely and hence
that
communication
area applications The following
will eventually
explicitly
additional
designed
terminology
be reestablished.
to tolerate
Accordingly,
the delay introduced
is used throughout
we focus
by unplanned
the rest of this paper.
on wide
partitions.
A partition
WAN partition. An application is a wide-area application, formed of a set of related groups in separate partitions. And, a connection is a single long haul communication channel. 2.1.2
An
impossibility
There
exists
a substantial
tions. The work the characteristics of possible mit protocols
body
of work on protocols
failures
at its lowest
[10,6]. 1 The
levels,
to maintain
LAN
for environments
to work correctly
implementation
information
process-group members. This information drives An implication is that little of the software modified
running
result subject
most relevant to systems like Isis is by Skeen, who proves of a two- or three-phase commit cannot be terminated
partitioning
is a
about
the higher commonly
in a WAN environment.
of Isis the status
to unplanned
uses multi-phase
com-
(operational/failed)
levels of the system. used by Isis in LAN
In particular,
parti-
that protocols having safely in the presense
settings
the form of consistency
of can be
that
Isis
supports cannot be made tolerant of network partitions without risk of "blocking" when partitions occur. The current version of Isis finesses this issue by shutting down the sites in a "minority" (smaller)
partition.
Were Isis to be used in a WAN setting,
(correct, predictable behavior) or availability. Notice that although $keen's results preclude
one would
any transparent
sacrifice
either
scaling of the existing
consistency Isis systems
- or any similar system - into a WAN environment, it /s possible to make LAN systems highly resilient to failure, and the existing Isis toolkit is quite effective at using state replication for this purpose. This justifies our assumption that from crashes) and will not lose "committed" above. t Readers izabifity consistent
familiar
with
in the presense group
the database of partitiou
management
literature failures.
and atomic
LAN services will be highly available (recovering rapidly state - the property we referred to as failure isolation,
will
be aware
Unfortunately,
communication,
of several
these which
approaches
protocols
cannot
are the cornerstones
that
yield
be extended of the
transactional into
serial-
protocols
approach.
for
2.1.3
Long-haul
We initially Such
channels
assume
that inter-cluster
a channel has the following
communication
• All messages
sent from one cluster
• Inter-cluster
communication
presence
of connection
These characteristics any of the five ISO multiple
physical
is by a communication-failure
Fee fifo channel.
properties: to another
are received
is not subject
to message
in the order duplication
sent.
or packet
loss, even
in
failures.
are stronger than what a general purpose transport transport classes provides, because we require these
communication
links exist between
a pair
of clusters
protocol properties
and even
like TCP or even when
when links fail or
are restarted during the course of execution. In Sec. 5 the implementation of a communication channel with these properties is shown to be feasible using existing Isis facilities. 2.2
Impact
For purposes
of WAN
characteristics
of protocol
design,
on
a wide area
protocol
network
design
(e.g.
ARPANET)
differs
from a local
area
network (e.g. ETHERNET) primarily in four respects: higher latency, lower bandwidth, point to point connections, and a higher probability of partition. These differences, together with the assumption that the application components located in different LAN systems are loosely coupled (that
is, they interact
relatively
infrequently
and
most
interaction
is asynchronous),
stantial impact on the implementation of long protocols, particularly those pair of participants (such as multi-phase commit or reliable multicast): 1. Network
partition
In a LAN
must receive
environment,
more
involving
the low probability
Of partition
makes
it feasible
be so infrequent and because when LAN failures actually occur, they of machine failures by separating application programs from resources of machines
machines have actually with this restriction.
are crippled
by a partition
failed may not be unreasonable.
In a WAN environment,
partition
a subthan a
attention. to either
these events, or to implement a harsh solution such as the Isls approach cited treatment can be justified, at least in moderately small LAN systems, because
If laxge numbers
have more
failure,
provoke large numbers on which they depend.
simply
assuming
Isis users have reported
will often be the usual state,
ignore
above. Such a partitions will
with dusters
that
these
little trouble
contacting
each
other periodically so as to minimize the cost of maintaining open connections for long periods of time and to maximize the use of connections when they are opened. Moreover, because applications of machine
will be loosely coupled, a WAN partition will generally not trigger large numbers failures. These considerations make it important to limit the impact of a partition
and to provide level of service 2. Multicasting Systems groups
mechanisms in partitioned
by which applications settings.
only when it is really
like ISlS often structure with perhaps
can offer some restricted
(or autonomous)
necessary. applications
3 or 4 members
each.
and services A request
using a collection
of small process
on such a group may be implemented
as an IPC or RPC
to a favored
member,
all members
perform
others
it up for fault-tolerance.
back
because
different
the request
group
or as a multicast
in parallel,
members
The
2 to the full set. In this case, either
or one member primary/backup
can respond
performs
the request
approach
is encouraged
as the primary
server
while
the
in I._ls
for different
requests,
providing a form of load sharing. This approach is inexpensive because it benefits from the comparatively high speed of communication and because the backup processes for one request will be working actively on other special LAN hardware facilities. In a WAN environment,
casual
due to the long latency strictions on establishing
requests.
Moreover,
use of a "large-scale
multicast"
of WAN communication, lower and using WAN communication
of programming will not map transparently will normally communicate with the WAN
the multicast
itself
may make
use of
could lead to poor performance
WAN bandwidths, and possible relinks. Consequently, the Isls style
to WAN applications. application through
Instead, the group
such applications representing that
application on the local cluster. As much as possible, this group will respond to requests using local information. If information from a remote server is needed, it will most often request it using some form of point-to-point long haul multicast might remain useful for asynchronous to the groups
3
Case
studies
This section
discusses
in a partitioned
wide-area
a series of problems
communication. On the other purposes, such as the diffusion
hand, a WAN of information
application.
motivated
by a set of wide-area
seismic
monitoring
appli-
cations collectively called the Nuclear Monitoring Research and Development System, or NMRD, being developed by Science Applications International Corporation under contract to DARPA. 3 NMRD includes several knowledge-based applications which collect, analyze and archive seismic data
from
a geographically
dispersed
network
of seismic
ing and analyzing data in the archive to address automated with rule-based AI techniques.
sensors,
seismological
and a rich set of tools issues.
The
system
for select-
is extensively
The largest and most complex element of NMRD is the Intelligent Monitoring System or IMS which detects, locates, and identifies seismic events using data from a network of stations in Eurasia. IMS is structured as a collection of LAN clusters, initially placed in Washington, Norway, and San Diego. As the system more LAN dusters.
is developed,
there
are potential
requirements
for expansion
to include
Our group became involved in developing LAN and WAN software for NMRD 1989. The LAN aspects of NMRD are concerned with system fault-tolerance and
several
and IMS in configuration
management, commmdcation, LAN resource scheduling, and related issues. All of these aspects are beyond the scope of the present paper. Below, we focus on WAN use of Isls in the current IRIS prototype. Currently, IMS is structured like a wheel, with a central "hub" in Washington, DC, that performs most of the automated data interpretation functions. A set of "spokes" connect this hub to free-standing LANs which acquire the data and do extensive signal processing to select and 2We are using multicaet in the sense of a software protocol for communicating with the full membership of a dynamically changing group - not in reference to a haxdware feature. _DARPA Contract No. MDA972-88-C-0024
characterizedata at the "hub"
segments
plays
a crucial
which may have
signals
role in this selection.
of interest.
The
network, and consist of long-distance TCP channels. of automatically initiated data selection and transfer
spokes
The central comprise
interpretation
done
the WAN communication
Most of the WAN communication operations, with the hub software
consists issuing
requests to the remote subsystems. Because the system is automated, the fault-tolerance of these operations is critical to correct function. In the future, IMS and other NMRD subsystems may grow to include multiple hubs, supporting seismic researchers as well as automated a number of of additional WAN services. hypothetical 3.1
issues
File
The most
after
transfer common
and file transfer.
briefly
and
analysis, and this will make it important to support The discussion that follows examines some of these
commenting
remote
on the file transfer
notification
of the WAN applications
The initial
problem.
arising
signal processing
in IMS concern
inter-LAN
is done close to the data
event
acquisition
notification
systems
to avoid
the requirement that all data be transferred to the hub. All acquired data are processed to detect signals and characterize them in terms of a standard set of parameters which axe archived in a local commercial relational database management system (RDBMS). On a regular schedule (e.g., every 15 minutes), the hub initiates a request to transfer data from the remote RDBMS to the central RDBMS at the hub. The automated knowledge-based system (KBS) at the hub analyzes the data from all stations to locate and identify all detected events. Depending on the location and character data.
of the events
The sequence utility is invoked retrieved
(station
formed
by the KBS,
a request
is formed
for relevant
segments
of the raw
of steps involved in such a raw data transfer is as follows. First, the ISIS long-haul by an IMS program running on the hub with a message describing the data to be and time
interval).
The
remote
portion
of IMS
receives
this message,
retrieves
the requested data and initiates the file transfer to the hub. When the file transfer takes place, a suitable spooling area is found for the incoming data and notifies the hub process that initiated the retrieval. Finally, after the transfer has completed successfully, the remote file is deleted. This procedure is generalized by replication for additional errors such as failure to transfer files, lost or duplicate problems 3.2
Resource
Resource and
requiring
location
contents
later
human
intervention.
remote sites. Fanlt-tolerance notification messages, and
4
location is the problem
of the named
of mapping
data objects.
resource
names
important
if the system
expands
into information
This is the problem
services, and represents an active research topic. Because the problem does not yet arise. However, WAN solutions become
is key here: so forth cause
to include
multiple
about
solved by so-called
the location "white
pages"
the current IMS system is centralized, to the resource naming problem would hubs.
Imagine an IMS-like system running with many integrated computational hubs. Each of these hubs would have the ability to request information (new_data) from outer clusters (data that was not provided
as part of routine
processing).
Obtaining
and
analyzing
new_data
may involve
expensive
4IMS almost never _crashes" due to software failures - the system tries to handle errors gracefully. However, errors may cause the system to lose things - events, data for the analyst to review, etc. In cases where the lost data may be important, a fairly tedious manual corrective action will eventually be needed.
(in terms
of resources)
data
retrieval
and processing
a complex data adaptive beaanforming hours of CPU time. Clearly, one would
operation not want
operations.
For example,
it might
require
theft
be performed; such computations may require to perform this sort of operation on hub A when
hub B has already performed one. It follows that when a new_data request is made, a service will be needed to determine if the computation has already been performed (or is underway), and if so, whether it would be cheaper to transfer the computational results or to transfer the raw data and repeat the analysis locally. It is natural to think of such a version or database. cost)
This file would identify
of the corresponding
a computation. A number
of IMS as generating
both raw events
processed
data
and manipulating
and the location
file, or the location
(and
a large event-file
size, and computational
of any hub currently
engagaed
in such
The problem can thus be reduced to one of locating resources in a WAN. of difficult problems now arise. First, observe that the n_.rnlng space is a dynamically
changing one with several natural forms of hierarchy: physical hierarchy in space (i.e., the set of events known only within some local cluster), logical hierarchy (i.e., the set of raw-data objects associated with some new_data event), and global hierarchy (i.e., a set events currently under consideration as evidence that a nuclear test has been detected). Operations on the naming space will be search
requests,
want this namespace should be maintained
read
requests,
and
update
requests.
For simplicity
to present a seamless global abstraction. At the close to where it will be generated or manipulated,
of design,
one would
same time, information to avoid excess WAN
communication. Consistency any update
or coherency
eventually
of such a WAN naming
reaches
all clusters
structure
will correspond
with a copy of an event
descriptor,
to the property that
read operations
preserve the abstraction of a single global namespace, and in particular, that updates serialized. To see this, consider a computation that reads a descriptor (say, a correlation The computation this
descriptor
corrupted. enforcing that
should depends;
subsequently otherwise,
see "current" it would
Such a relationship is causal, causal orderings shortly.
and
appear
copies of any other that
the
we will have
event
namespace
more
appear to be descriptor).
descriptors has
somehow
to say about
To ensure
that
the namespace
on which become
mechanisms
For brevity, we will not develop a complete solution to this problem here. We observe, the core mechanisms needed here will be ways to form WAN groups and to multicast
to the group members. Given into a collection of information
that
for
however, updates
such tools, the resource management service would be structured domains within which updates would be multicast to all members. presents
a causally
consistent
abstraction,
we will need
to know
that any multicast sent to such a WAN group (eventually) reaches all its members, and that if an update is dependent upon some prior update, then all WAN group members see the two updates in the order they were issued. Notice also that once a WAN group is formed in this application, its membership remains fairly stable. Only the creation of new hubs or thew addition of new sensor dusters would require changes in this part of the system configuration. Both operations will obviously should imagine
be infrequent.
be fairly needing
common.
The physical
scale of WAN
On the other hand,
to send messages
to a subset
within
systems
suggests
such a WAN mulitcast
of the total membership.
8
that
this form of stability group,
one can easily
3.3
Resource
scheduling
The above examples the need to support Notice to initiate
show how IMS uses WAN file transfer and WAN multicast. They also hint at WAN resource allocation and scheduling policies in an extended system.
that the e.xisting IMS permits an analysis data retrieval requests and computation
is only one
hub.
However,
with
multiple
program or researcher working in Washington in Norway. This is not a major issue if there
analysis
hubs,
it would
become
important
to partition
computational cycles among the various hub systems contending for database access and signal processing facilities. Otherwise, it would be easy for an IMS component at one location to overload a cluster located halfway around the world, preventing it from accomplishing locally critical tasks such as data
compression
of the computational We can abstract holding "event"
and event
resources. this problem
detection,
as one of selling
denying
tickets
local
analysis
for a periodic
systems
event.
a fair share
Only
a process
the appropriate tickets will be granted access to the processor pool on a given LAN. An in this formulation might correspond to one specific hour of activity on the Norway cluster,
and a ticket to a permission sales problem has substantially
to perform five minutes computation during that hour. more structure than the basic file transfer and remote
problems seen in our first example. A solution to this problem should loosely cluster
or even
address
two goals.
The ticket notification
The first arises from the need
coupled scheduling service. It should be possible to sell tickets for a future event even if communication with that cluster is presently impossible, if a connection
the interaction,
or even if a partitioning
or cluster
failure
occurs.
A second
to design
a
on a remote fails during
goal is that
the system
should satisfy the maximum number of demands possible (presumably using an application-specific cost function) while also guaranteeing fairness (also an application-specific notion). Let us ask what can be said about this problem without speculating on the application-specific aspects.
Clearly,
if the
distribution
of tickets
is static
and
fixed,
a cluster
that
receives
a large
number of demands may not be able to satisfy all of them, while some other duster may fail to sell some of the tickets it holds. This will compromise the second goal, and suggests that the distribution algorithm will either need a central decision making mechanism or a way to dynamically repartition the coUection of tickets. A centralized policy would violate our first goal. Thus, we need a dynamic distributed
allocation
policy.
Such an approach
might
pre-allocate
tickets
to dusters,
but
include
a mechanism for reallocating unsold tickets as the "event period" approaches. Ideally, we would want this mechanism to make progress even if a communication failure or partition occurs. 3.3.1
Structure
Assume
that
cluster.
We will partition
Each vending
of the
we have group
application
N dusters
and
that
a group of ticket
the pool of tickets in N subsets
uses its partition
to serve demands
vending
processes
and pre-allocate
are active
in each
each to a specific
cluster.
from its local workers.
Next,
we divide
the
selling period in subperiods. At the end of each subperiod, each server multicasts a state message to its peers. This message reflects recent sales as well as the anticipated needs of the sender. Finally, on the basis of the state messages it receives, each server computes a new partitioning of unsold tickets
using some deterministic,
well known algorithm.
3.3.2
Classes
of ticket
Repartitioning messages, and
algorithms
algorithms can be characterized by the degree to which actions
We distinguish 1. Class
repartitioning
three
classes
1 consists
in which
state
of such algorithms:
of algorithms messages
and deterministic
by their sensitivity to the delivery order of state by servers in different partitions are synchronized.
that
axe received
repartitioning
operate
asynchronously
from different
algorithms.
servers.
For example,
and
are insensitive
These suppose
are all fixed, that
to the order well known,
we have five servers.
An algorithm in class 1 might assign 1/5 of each lot of unsold tickets carried by each state message to each server. Notice that even if different servers see state messages in different orders, the number of tickets available to a given server in a given round will be the same. Class 1 algorithms are simple and stateless: they require only that the system provide eventual delivery of each state message its destinations, and that the set of participants be fixed before execution starts. We refer to WAN multicasts satisfying this eventual delivery property as fault-tolerant
WAN
2. Class _ algorithms
multicasts. operate
by having
carrying out the repartitioning the class 1 algorithms because unsold tickets, and anticipated
each server wait for all the round-k
algorithms
and support
require
that
for fault-tolerant
3. Class 3 algorithms
are sensitive
messages
before
for round k-l-1. Such an algorithm has more flexibility than it operates with full knowledge of ticket sales, availability of demand. Again, the algorithm must be deterministic and well
known, so that all servers can execute it in parallel. Class to the order in which messages are received but synchronous. 1, these
state
the system
provide
2 algorithms are thus insensitive Like their counterparts in class
information
about
the set of participants
multicasts. to the delivery order
of state
messages
and asynchronous.
For
example, consider a system in which a server needing tickets broadcasts its need, and servers with a surplus broadcast the existence of the surplus. One might imagine a rule under which allservers,in parallel, reallocateticketsas each such message isreceived.Such a scheme has the advantage of making progressas rapidly as possible,as in the class1 algorithms,but without requiringthe rigiddeterminism of the classI algorithms. However, the order in which messages containingticketrequestsare receivedmay affectthe way that ticketsare repartitionedin this case. In general,serversimplementing class3 algorithms may need to see allstatemessages in the same order,or at leastin a predictable order. We will refer to such multicasts as ordered WAN multicasts. General
remarks
C/ass 1 algorithms will perform poorly if demands axe not uniformly distributed within the WAN system as a whole. Typically, for these algorithms to maximize the number of requests satisfied, the sellingperiod willneed to be divided in small subperiods. Such divisionwill increasethe wide-areanetwork traffic making theapplicationcomponents more tightly coupled. ,
Class £ algorithms might reduce availability at certainlocations.Suppose that some server has no more ticketsto sell.Even ifithas already receiveda statemessage indicatingthat unsold ticketsexiston some other server,and even ifthe repartitioning algorithmissuch that 10
it will be allocated state
messages
Because
.
some
before
of these
granting
class 3 algorithms
at the repartition
any further
allow
servers
to operate
yield a loosely coupled solution. However, known delivery ordering properties, and used in a class 1 asynchronous 4. Communication
failures
it has
to wait
until
it receives
all
asynchronously,
these
are more likely
to
class 3 algorithms need a multicast primitive with this may be a more costly primitive than the one
algorithm.
will affect
time,
requests.
We return
all these
to this issue
algorithms
below.
by delaying
the
delivering
of state
messages. • For class I algorithms, delays impact ticket availability at certain locations. For example, suppose the two subsets of servers {A, B} and {C, D, E} are isolated from one another. Naturally, messages during the partition. assign
about unsold tickets released by each subset will not reach the other Therefore any tickets released by A or 8 that the algorithm will
to C, D or E will remain
unused
• For c/ass 2 algorithms, the delay duration of the partition. • Finally,
for class 3 algorithms,
during
might
delays
the partition.
completely
impact
inhibit
ticket
the availability
repartitioning
of unsold
tickets
for the
in certain
partitions. Moreover, communication partitions might prevent the algorithm implementing atomic WAN multicast from making progress in certain partitions. For example, if WAN multicast is done using a multi-phase protocol, a partition during the first round could completely inhibit the delivery of WAN messages for the duration of the partition. This suggests that one-phase protocols are strongly preferable to multi-phase protocols in WAN settings. 3.4
Summary
The examples plications.
of WAN discussed
communication
above
In this section,
seem
requirements
representative
we summarize
the
of a reasonably
essential
WAN
large
class of wide
communication
area
requirements
apthat
emerge. An
abstraction
super.imposed
upon the concept
of group
WAN applications will typically need communication between a set of related groups located in different clusters. This wide area set of groups (wSet) constitutes a new WAN abstraction super-imposed upon the existing Isls LAN process group mechanisms. In such s set, each element
is a group and
there is at most one element
transmit messages to individual Unlike groups in LAN settings, after creation. Fault-tolerant
multicasts
Certain
applications
need
members of this set of groups it seems reasonable to assume
a multicast
protocol
eventually deliver messages to all its destinations crashes or connection failures. If a server issues and
the system
has "accepted"
on each cluster.
the message 11
tolerant
It must be possible
to
as well as to the set as a whole. that wSets change infrequently
of failures.
Such
a protocol
will
even in presence of partitions, network a fault-tolerant multicast and then fails,
in a sense
discussed
below,
this fault-tolerant
multicastmustbe deliveredsooneror later to all its destinations.Conversely, whena serw,r recoversfrom a crash,it shouldbe ableto recoverpendingfault-tolerantmulticastsdestin(_