© ACM, 1994. This is the authors' version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version is available at http://doi.acm.org/10.1145/195792.195823.
A Security Architecture Systems MICHAEL
for Fault-Tolerant
K. REITER
AT& TBell
Laboratories
and KENNETH
P, BIRMAN
and ROBBERT
VAN RENESSE
Cornell University
Process
groups
present
are a common
a security
Integral
parts
of
cryptographic easily and
substantial
the
secure
figures
extends
architecture it
number
services
of servers. group
that
into
securely
fault
also
services
novel
both
the transient
give
support
introducing
andimplementation
We
We
abstraction.
tolerantly
and
these
despite
the design support.
and
constructed
systems.
a security
necessary,
key distribution
they
in distributed
group
when
we have
We detail
abstraction
computing
process only
necessary,
and to permit
group
for some common
the
replication
was
attack
process
are
Using
when
against
for fault-tolerant
that
distribution.
techniques
defensible
ity ofa
this
key
replication
abstraction
architecture
to be
unavailabil-
of these
preliminary
services
performance
operations.
C.2.O [Computer-Communication Networks]: General— [Computer-Communication Networks]: Distributed Systems; D.4.5 [Operating Systems]: Reliability —j?mlt tolerance; D.4.6 [Operating Systems]: Security cryptographic controls; K.6.5 [Management of Computing and Protection—authentication; and Information Systems]: Security and Protection—authentication Categories
and Subject
security
and
General
Terms:
Additional
Descriptors:
protection;
Key
C.2.4
Security, Words
Reliability and Phrases:
Key
distribution,
multicast,
process
groups
1. INTRODUCTION There exists much experience with addressing the needs for security and fault tolerance individually in distributed systems. However, less is under-
Because paper,
the
Editor-in-Chief
he played
of ACM
no role
in the
Transactions
review
process
work
was performed
while
the first
under
I) ARPA/ONR
grant
NOO014-92-J-1866,
Any opinions reflect those Authors’
or conclusions of the ONR.
addresses:
M.
Cornell
not made of the
Ithaca,
or distributed
Transactions
for direct
and
on
14853;
its
and email:
date
appear,
and
from
are the
IBM, and
Holmdel,
Department
advantage,
the ACM
is given
that
and
NJ
Systems,
Vol.
12,
No.
Inc.
necessarily
07733;
of Computer
provided copyright
copying
that notice
email: Science,
1994,
the copies
requires
Pages
340-371
are
and the title
is by permission
or to republish,
4, November
This
supported
Siemens,
do not
1100-0340$03.50 Computer
was
edu.
is granted
notice
of this
manuscript.
work
GTE,
material
To copy otherwise,
This
authors’
Laboratories, Renesse,
is a coauthor
for the
University.
{ken; rvr}(ucs.cornell.
of this
commercial
Machinery.
Bell R. van
Systems decision
and by grants document
AT&T
fee all or part
for Computing
specific permission. @ 1994 0734-2071/94/ ACM
NY
to copy without
publication
Association
Reiter,
was at Cornell
in this
K. P. Birman
University,
Permission
expressed K.
[email protected];
author
on Computer or acceptance
of the
a fee and/or
A Security Architecture stood about solution.
how to address
Indeed,
these
needs
for Fault-Tolerant
simultaneously
the goals of security—or
and integrity—have
traditionally
more
been viewed
Systems
.
in a single,
precisely,
integrated
the goals
as being
:341
in conflict
of secr,ecy with
goals
of availability, because the only generally feasible technique for making data and services highly available, namely replicating them, also makes them harder to protect [Herlihy and Tygar 1988; Lampson et al. 1992; Turn and Habibi involved
1986]. This in enforcing
trusted
computing
Prudence possible, makes
dictates in order
conflict is particularly pertinent to services that /are security policy, or in other words, that are part of the [Department of Defense 1985] of a system. base (TCB) that the TCB should be kept as small and localized as
to facilitate
its protection
We have
more
designed
illustrates
that
system. The fault-tolerant
this
its protection.
Distribution
of TCB
components
difficult.
a security conflict
architecture need
not
for
result
fault-tolerant
in
an
systems
unreliable
that
or insecure
process groups—a comlmon paradigm of supports [Amir et al. 1992; Birman and Joseph 1987b; Cheri-
architecture computing
ton and Zwaenepoel 1985; primary security abstraction,
Kaashoek 1992; Peterson et al. 1989] —as its and provides tools for the construction of appli-
cations that can tolerate both benign component failures and advanced malicious attacks. We have implemented this architecture as part of a new version of the Isis distributed programming toolkit [ Birman and Joseph uirtually 1987b; Birman et al. 1991] called Horus, thereby securing Horus’ process group abstraction. An earlier paper [R,eiter et al. 19!32] synchronous presents the design rationale and an overview of the architecture. Here we emphasize
how the security
and efficient. The tradeoff our
between
architecture.
secure
group,
security
At the
abstraction supported the user can balance
mechanisms
level
have
and availability of user
to be fault
is addressed
applications,
by the architecture this tradeoff for
applications
been built
the
in two ways
secure
process
can be efficiently
replicated
in a protected
on trustworthy
and correct nisms limit
sites,
the members
will
enjoy
secure
group semantics among themselves. These how attackers can interfere with applications
in
group
provides a framework within each application individually.
authentication and access control mechanisms enable the group prevent untrusted processes from joining, and if the members processes
tolerant
which In a fashion:
members to admit only
communication
protection mechaand, in particular,
enable the user to control exactly where and how widely each application is replicated. The second and more critical level at which this conflict is addressed is thlese mechanisms, within the core security services in the TCB that underlie architecand indeed the security of all process groups. As do other security tures, ours uses cryptography to protect communication, and this in turn requires that a secure means of key distribution exist. Most key distribute on mechanisms employ trusted services whose corruption or failure could result in security breaches or prevent principals from establishing secure communication; it is in these services that the conflict between security and availability
is most
apparent.
We ACM
have Transactions
developed on Computer
an
approach Systems,
Vol.
to reconciling 12, No.
4, November
tlhis 1994.
342
Michael K. Reiter et al.
-
conflict that exploits techniques to achieve The implementation
the semantics of these services and novel replication secure, fault-tolerant key distribution. of our security architecture as part of Horus has
brought performance and user interface issues to the forefront of our work, as well. By using caching extensively and moving costly operations off critical paths
whenever
secure
possible,
version
hardware. expect
This
is
to account
tions.
Moreover,
resulted
we have
of Horus
without
particularly
for the the
achieved
true
vast
for
majority
changes
to group
of the
to the Horus
performance
network-wide operations
security
process
in the
cryptographic
communication,
of group
implementation
in minimal
reasonable
resorting
which
in most
mechanisms
group
we
applicain
interfaces.
Horus
So, tools
and applications designed for the Horus interfaces should port easily to secure groups. We present here the security architecture as realized in Horus and elaborate on the contributions just described. In Section 2 we present the programming model of secure features that augment
process groups, with an the Horus process group
discuss
some implementation
Section
3 we present
use to support tion of secure group
2.
The
abstraction
arbitrarily,
and
and discuss
key distribution,
model.
In
which
we
with
processes
communicate
by Horus
an associated may
both
create,
by
work
is the
process
group
in Section
5,
group,
which
join,
a message
Groups
address.
and
point-to-point
i.e., by multicasting
multicast,
related
GROUPS provided
of processes
Processes group
We conclude
PROCESS
basic
collection
posed by the programming
of fault-tolerant
security we also
secure process groups. In Section 4, we detail the implementaprocess groups and give performance numbers for common
operations.
SECURE
challenges
our method
emphasis on the abstraction. Here
leave methods
groups
is a
overlap
at any
(e.g.,
to the entire
may RPC)
time.
and
membership
by of a
group
synof which it is a member. Further, Horus supports a virtually [Birman and Joseph 1987a] execution model, so that message uiew (i.e., the membership of the group) deliveries and changes to the group appear atomically and in a consistent order at all group members, even when failures occur. Our security architecture makes the Horus programming model robust against malicious attack, while leaving the model itself unchanged. First,
chronous
during cated
to
group
joins,
one
another.
the group More
and the joining
precisely,
the
group
process
are mutually
members
are
informed
authentiof the
site from which the process is attempting to join, as well as the owner of the process according to that site. Any effort by an intruder to replay a previous join sequence or to forge the apparent site from which a request is sent will be detected. And, the joining process knows that responses apparently from the group are actually from the group that it is trying to join. Second, a group member must explicitly grant each group join before the join is allowed to proceed. joiner, they can deny the ACM
Transactions
on
Computer
If the group members choose not request, in which case the joiner
Systems,
Vol.
12,
No,
4, November
1994,
to admit the will not be
A Security Architecture
for Fault-Tolerant
(’”J=
Systems
highly secure sit, es
I authentication
(34 ion
a
(a)
A
~
process
to the
either
Inside
grant
the
or deny
group,
join
the
tive
if
requested,
from
that
site
authenticated
passive
the that
ment
integrity
that
a group
admit
a corrupt
which
case their the
point-to-point
and
site
corruptions.
sites
secured
only
as secure
process
to limit A
from
secu-
damage
group
to different as the
many
internal
from
can
span
levels,
least
but
secure
is site
groups,
members
trusted
are trustworthy.
by the
of the process
group
members
requesting
to join,
not
mechanisms
of group
has admitted
processes
prevent
Secure
of these
has tampered
member
policies
to enforc,s
In particular, to have
if
properly
the members
may
the request.
as the authenticity
group
1.
all group
is not
the owner
choose to deny
intruder
ac-
can be built
groups
admitted.
provided
requesting
Third,
rity
attacks.
admitted
well
secure is
from
(
(b) Applications
request.
Fig.
the
is
members,
communication
cryptographically
network
e$
to
group
protected and,
G
““’’’C”’
requesting
authenticated who
sites
moderately secure sites
6’----7 protected
:343
secure group) insecure
cmnnmnic.t
.
no processes with admit
site
disclosure
group.)
will
communication
sites,
on corrupt since
messages
is also supported
abstractions,
i.e., sites
to
(The
also
process
request sent,
a network
an
require-
does not imply
being
within
each
at which
system.
sites
can before
as
within
an untrustworthy
Members
be encrypted
of those
Eiorus
or operating
processes
to the
the
are guaranteed
on corrupt
the hardware
need not be trusted, messages
and
communication,
tlhat could
secrecy,
in
in an effort
to
intruder.
and outside
Secure of groups,
of groups [Birmanl et and in particular between group members and clients al. 199 1], although in this article we focus on secure group communication. The programming model thus presented to the programmer is one in which each process group can be viewed as a “fortress,” where admission is regulated by the group members themselves (see Figure l(a)). A setting to which this
style
fault-tolerant
of secure service
group must
is
particularly
be provided
well
suited
to a larger,
is
one
in
which
untrustworthy
a
system
against which the service must protect itself [Reiter et a“l. 1992]. Such an application could be composed of a single secure group located on a small “island of trustworthy sites. Alternatively, a larger application in which ACM
Transactions
on Computer
Systems,
Vol.
12, No.
4, November
1994.
344
Michael K, Reiter et al,
.
greater
internal
control
is required
could
be implemented
using
many
secure
groups, arranged to enforce security policies within the application and to limit the damage to the overall application from the corruption of a site (see Figure l(b)). While the groups could span sites with different levels of trustworthiness, each process it contains. When many
group
implementing
issues
is only
secure
in fault
as secure
process
tolerance,
groups
as the
least
in Horus,
performance,
and
secure
we were
integration,
site
faced
or
with
including
the
following: —Because process groups are a fault tolerance tool, it was important that the integration of our security mechanisms not increase the sensitivity of the process group authenticating authenticating
abstraction the origin principals
unavailability
could
them
more
difficult
for
achieving
niques support
to failures. This of join requests, in open networks
inhibit
authentication
to protect.
Thus,
fault-tolerant
authenticated
group
—In Horus, a process seeking join, will generally not know not need to. Moreover,
was most difficult to achieve in since all known techniques for rely on trusted services whose but whose
we were
replication
forced
authentication
can make
to devise
and
key
new
tech-
distribution
to
joins. to contact the current
requiring
this
a group to obtain a service, or to membership of the group and does
knowledge
would
involve
substantial
changes to Horus and significant overheads in the system. So, it important that an outsider’s ability to authenticate group members rely on accurate knowledge of the group membership. —Group communication underlying network felt
that
possible,
it was
necessary
to retain
requiring
that
on all
sites.
with
Isis
experience common
can offer substantial performance benefits if the supports broadcast or multicast [Kaashoek 1992]. We
without
deployed
This
goal
suggests
these
potential
special-purpose was that
particularly group
benefits
cryptographic crucial
as much
as
hardware
be
to Horus,
since
will
very
communication
be
in Horus.
-Horus offers a variety of ordering guarantees on the delivery ordering property, raises multicasts. One of these, the causal issues when causal relationships exist between multicasts in overlapping important
groups to identify
this ordering ever possible. The
was not
property
following
sections
[Reiter and Gong 1993; potential security threats and to provide
detail
how
defenses
we addressed
of group security different
Reiter et al. to applications
1992]. that
It was employ
against
threats
when-
these
these
and
other
issues
in
the implementation of our security architecture. Section 3 presents techniques to achieve fault-tolerant key distribution, which we use in our architecture to support authenticated group joins fault tolerantly. However, these techniques are also of interest outside of the context of our security architecture and could be useful in a wide range of systems, and so in Section 3 we present ACM
them
Transactions
in on
a general Computer
light.
Systems,
A discussion Vol.
12,
No.
4, November
of their 1994.
use in
our
security
A Security Architecture architecture tion
is deferred
of secure
process
3. FAULT-TOLERANT In open
networks,
tion
in two ways
tion
under
until
Section
we focus
on the
345
o implemerlta-
KEY DISTRIBUTION
an intruder [Voydock
can attempt
and Kent
a false identity,
protocols
4, where
Systems
groups.
allow
1983]:
or it can replay
to initiate
spurious
communica-
it can try
to initiate
communica-
a recording
of a previous
initiation
or key distribution protocols have been proattacks (see Denning and Sacco [19811, CCITT and Schroeder [19781, Steiner et al. [19881).
sequence. Many authentication posed to protect against these [1988], Kent [1993], Needham These
for Fault-Tolerant
principals
(e.g., computers,
users)
initiating
communi-
cation to verify each others’ identities and the timeliness of the interaction. Most also arrange for the involved principals to share a secret cryptographic key by which
subsequent
communication
can be protected,
others’ public keys, by which either shared key can be negotiated. Authentication protocols typically called
an authentication
the first normally other
shared
corresponding
service private
A predominant cols
is
to
message
was
principals usually
communicate. the public
to detect
into
generated;
a trusted
and Schroeder
each or a
service,
commonly
[1978]],
to counter
each
the
In public-key
has a well-known
key to certify
technique
incorporate
or to possess
can be protected
shared-key protocols, the authentication service each principal and uses these keys to distribute
keys by which
the authentication
employ
[Needham
seruice
type of attack. In shares a key with
communication
replay
protocol
message
public
keys
of principals.
attacks
in authentication
message
is then
protocols,
key and uses the
the
valid
time
for
at
protowhich
a certain
the
lifetime,
beyond which it is considered a replay if received [Denning and Sacco 1981]. Timestamp-based replay detection has been used in several systems (e.g., Steiner is often
et al. [1988], Tardo and Alagappan [1991], preferable to challenge-response techniques
[1978]], because However, using
it results in fewer protocol messages and less protocol state. securely timestamps requires that all participants maintain
clocks.
synchronized
In practice,
clock synchronization
as in Gusella and Zatti dependence of authentication
a time The
raises
troubling
security
is usually
achieved
[1984] and Mills [1989]. protocols on authentication
seruice,
services
Wobber et al. [1993]) land [Needham and Schroeder
and
availability
issues.
and
First,
the
via time
assur-
ances provided by authentication protocols rely directly on the security of these services, and thus these services must be protected from corruption by an intruder. Second, the unavailability of these services may prevent principals from establishing secure communication, or even open security “holes” that time
could be exploited by an intruder. For instance, the unavailability service could result in clocks drifting far apart, thereby exposing
(of a princi-
pals to replay attacks. To increase the likelihood of these services being available, they could be replicated. However, as already noted in Section 1, this is dangerous in some environments, because replicati~lg data or services makes them inherently harder to protect. ACM
Transactions
on Computer
Systems,
Vol.
12!, No.
4, November
1994.
346
Michael K. Reiter et al,
.
We have
developed
and availability
techniques
in these
to reconcile
services.
By using
the
conflict
replication
and introducing novel replication techniques when constructed these services to be easily defensible transient unavailability hinder key distribution
of even between
attacks.
Client
interactions
with
services
can be used with
many
the services
when
security necessary,
it was necessary, against attack.
a substantial number principals or expose different
between
only
we have And, the
of servers does not protocols to intruder
are simple
and efficient,
authentication
and the
protocols.
3.1 The Time Service The security cols are well
risks of clock synchronization failures known [Denning and Sacco 1981; Gong
time
secure
service
recognized
in
that
several
cannot
be tampered
systems
with
(see Bellovin
in authentication proto1992], and the need for a or impersonated
and
Merritt
has been
[1990]
and
Mills
available time [1989]). We claim, however, that the case for a highly is not as clear. It is true that an extended period of unavailability
service might
cause
But,
principals
itself
this
too quickly, estimate not
h
real
lengthy
to have
need
not
of this,
allows
our
the
time
service
achieve resilience to a time rithm for estimating time. We describe this algorithm
views
weaknesses algorithm
key distribution
of the time
disparate
in security
evidence time
unavailability
replicate
increasingly
result
so that service
This
it
will
in Section
by which
securely
has allowed be easier
unavailability
time,
even
clients during
a
us to explicitly
to protect,
through
3.1.1 and discuss
in
communication
we propose
to proceed
service.
of real
or inhibit
the
and
client
to
algo-
alternatives
to our
approach in Section 3.1.2, As will be discussed in Section 3.1.2, the algorithm of Section 3.1.1 is heavily influenced by previous work in clock synchronization.
As
such,
techniques time
its
estimation
tralized
contribution
can be adapted time
in key
lies
mainly
for use in our
distribution
in
protocols
clock
synchronization
to achieve
with
simple,
an easily
fail-safe
defensible,
cen-
service.
Clients interact with 3.1.1 The Algorithm. RPC-style protocol shown in Figure 2. We possesses a private key K; 1 whose known. (There is a similar shared-key queries the time service with 1978], a new, unpredictable quest,
how
setting
our time service by the simple assume that the time server
corresponding public key ~z protocol.) At regular intervals,
a nonce identifier value. When the
N
time
is well a client
[Needham and Schroeder server receives this re-
it generates a timestamp T equal to its current local clock with {N, T}Kj: , i.e., the nonce and the timestamp, both The client considers the response valid if it contains N and signed with K;l. can be verified with the public key of the time service. The method by which a client uses this response rests on the following additional assumptions:
value
(1)
The
immediately and replies
client
t – t‘
has
access
of a real
p is a known
time constant
(1 ACM
Transactions
on
to a local interval
[ t‘,
hardware t ] with
- t’)
Computer
Systems,
0, we have that O < Qz s * Keith (personal
Marzullo
has suggested
communication,
Feb. ACM
the
possibility
1993).
However,
Transactions
of dynamically
measuring
we do not pursue on Computer
Systems,
this Vol.
the server the client al + a!l is R
– mini
1~ on a per-client
basis
here. 12, No.
4, November
1994.
348
Michael K. Reiter et al.
.
A
and so after + ag satisfies
the client
— minz,
verifies
the response,
real
time
t = T + minz
~~[T+minz,T+R–mini].
By (l),
it follows
the desired
that
; by:
time
[L(t), u(t)l,
(4)
where L(t)
= (H(t)
– H(j))
+ p) + T + min2
/(1
and
U(t) = (H(t) –H(f))/(1 To estimate is more
the time,
the client
conservative.
col messages,
uses either
In particular,
principals
– p) + ‘1+ r/(1 – p) – rninl. L(t)
to detect
use the following
or U(t),
replays
rules
depending
on which
of authentication
for estimating
(1) When timestamping an authentication protocol to detect a later replay of that message, the timestamp to T = L(t).
proto-
time:
message to allow others sender sets the message
A recipient accepts an authentication protocol message with timestamp as valid at time t only if T + A > LX t), where A is the predetermined lifetime of the message.
(2)
The benefit
of this
THEOREM
after
time
An
3.1.1.2.
by a (correct)
scheme
client
is that
authentication
at time
t will
a client
sends
in the following
it is fail-safe, protocol
never
message
be accepted
with
by another
T
sense: lifetime
A sent
(correct)
client
t + A.
PROOF.
Suppose
t.The timestamp
T = L(t)
for
an
the
authentication
message
protocol
satisfies
message
T < t. Now
at
time
consider
a
recipient recipient,
at time t + A, where A is the lifetime of the message. Since at the t + A < .!7(t + A), it follows that T + A < U(t + A). Thus, the mes-
sage will
be rejected
as invalid.
❑
Because the interval (4) grows wider with time, periodically each client desynchronizes with the time service in order to narrow its interval. A r, and T for the successful resynchronization results in new values of H(t), ) and L(t).Resynchronization attempts can fail, however, calculation of lJ(t r for the attempt exceeds some timeout value. when the round-trip time When this happens, the client continues to attempt to desynchronize with the service at regular intervals, while retaining the values of T, r, and H(;) L(t) and U(t). obtained in the last successful resynchronization to calculate becomes unavailable, clients’ intervals will continue to So, if the service widen. If the service is unavailable for too long, eventually the principals’ will exceed their values of L(t) by the protocol message values of U(t)
lifetimes, creation. ACM
and
Transactions
all
on
messages
Computer
will
Systems,
be perceived
Vol.
12,
as expired
NrJ. 4, November
1994.
immediately
upon
A Security Architecture While the very
this
time
bounds
service,
tight.
For
the amount
calculations example,
of time in our
consider
for Fault-Tolerant that
the system
system
two
Systems
indicate
principals
can operate
that
F’l
.
this
and
349
without
bound
P2, each
is not
of whose
clocks is characterized by p = 10-5, and suppose for simplicity that the values of f and T corresponding to the last resynchronization for each prior to a time service crash are the same. Moreover, suppose that rninl = min2 = O and that the value of r obtained by P2 in its last resynchronization is 0.5 seconds. Then, even if the clocks of PI and P2 drift apart at the maximum possible rate—i.e., the clocks of PI and P2 are as slow and as fast as possible, respectively, the
while
value
relatively
still
of U(t) short
at
satisfying
(l)—it
P2 exceeds
the
message
and Sacco [1981].
lifetime
Additionally,
will
be over 20.4 hours
value
of L(t)
in comparison
at
PI
to that
the parameters
from
i before
by 30 secondk,
suggested
a
by Denning
used in the above
calculation
are very conservative for most settings, and tests in our s,ystem show that a time service unavailability can typically be tolerated for much longer. Tlhese results lead us to believe that the system, if tuned correctly, should be able to operate
without
service,
even if the restart
3.1.2 that
the
Comparison
presented
primary
by
difference
interval
time
service
for
requires
to Alternative Cristian
long
to restart
the
time
We derived cur algorithm implementing a time service.
from The
intervention.
Designs.
[1989]
between
(2). In the latter,
sufficiently
operator
ours
for and
the client
Cristian’s
lies
in how
uses the midpoint
clients
use the
of (2) as its estimate
the time at time ;, since this choice minimizes the maximum possible and the client estimates future times as an offset, equal to the measured
of
e:rror, time
since the last resynchronization, from this midpoint.2 However, like any other clock synchronization algorithm in which each client maintains a single clock value, this algorithm is not fail-safe: e.g., if the midpoint of (2) were too low,
then
the client’s
future
estimates
of the time
would
tend
to be low,
and
thus expired messages may be incorrectly accepted. We feel that our approach, which is fail-safe, is better for our purposes. A reasonable alternative to not replicating our time service is to replicate it for high availability and to compensate for the increased difficulty of protecting the service by making it tolerant of the corruption of some servers. For instance, a client could use the robust averaging algorithm of Marzullo [1990] to obtain
an interval
n time servers, approach might
of bounded
inaccuracy
if fewer than \ n /3] be attractive if clients
containing
real
servers are faulty are highly transient,
time
from
a set of
or corrupt, This and thus a time
service unavailability will prevent large numbers of clients from synchrcmizto be ing initially with the service at client startup. However, this is unlikely the case in most systems, where time service clients are sites that do not tend to reboot frequently. Moreover, a replicated time service places a larger burden on the administrator of the service than does ours, since the administrator 2This
must
protect
is a simplification
measures
to ensure
are unimportant
that
multiple of the client
servers,
algorithm clocks
instead
by Cristian
are continuous
of only
[ 1989]; and
the
monotonic.
one,
actual
to
ensure
the
also
takes
algorithm
These
features,
however,
for our purposes. ACM Transactions
on Computer
Systems,
Vol. “12, No. 4, November
1994.
350
Michael K. Reiter et al
.
integrity of the service. For these reasons and the additional costs of replication (e.g., authenticating and maintaining multiple time servers), we feel that a replicated time service is difficult to justify for our purposes. Also discussed on a physical that
variable.
contain
by Marzullo
state
the
variable
[1990] despite
It is observed actual
that
physical
value,
are approaches
to evaluating
the impossibility given
a range
safe
a predicate
of accurately of values
evaluation
of
that the
measuring is known
predicate
to may
require that all values in the range satisfy the predicate, or that only some value in the range does. Our approach of estimating time conservatively with the endpoints of (4) can be viewed as an instance of the former approach, where the physical state variable being measured is time; the range containing time is (4); and the predicates of interest relate time to timestamps in authentication protocol messages. Numerous
other
approaches
to clock
synchronization
have
been
proposed
(see, e.g., Simons et al. [1990]), but for brevity, we do not discuss them all here. Unlike ours, however, most assume upper bounds on message transmission
times
or employ
greater
distribution,
thereby
increasing
the number
of
components that must be protected in the system. Moreover, to our knowledge none provide a fail-safe algorithm for estimating time in authentication protocols. We thus feel that our approach is unique in providing this property with
relatively
few requirements.
3.2 The Authentication
Service
Our authentication service is of the public-key variety, that produces publickey certificates for principals. Each certificate {P, T, KP}~jI contains the identifier P of the principal, the public key KP of the principal, and the expiration
time
T of the certificate,
authentication service, A principal identifiers to public keys, by which
all signed
by the private
key K,jl
uses these certificates to map those principals (who presumably
the corresponding private keys) can be authenticated; the cussed in Lampson et al. [1992], In general, a principal
of the
principal possess
details are discan request a
certificate for any principal from the authentication service. The need for security in such an authentication service is obvious: as the undisputed authority on what public key belongs to what principal, the authentication service, if corrupted, could create public-key certificates arbitrarily and thus render secure communication impossible. It would also the authentication service must be appear that, unlike the time service, highly available, since its unavailability could prevent certificates from being obtained or refreshed when they expire. Other researchers have also noted that both security and availability, and thus the conflict between them, must be dealt with in the construction of authentication services [Gong 1993; Lampson et al. 1992]. The most common approach to address this conflict in public-key authentication services is to implement the service using two services: a highly secure certification authority that creates certificates, and a highly available certificate database that distributes them (see CCITT [1988], Kent [1993], Lampson approach differs in that it ACM
Transactions
on Computer
et al. [19921, Tardo and Alagappan performs both of these functions
Systems,
Vol. 12, No. 4, November
1994.
[ 1991]). Our in a single
A Security Architecture replicated and
service,
available
So, the
but
does so in such a way
despite
conflict
for Fault-Tolerant
even the malicious
between
security
and
that
the service
corruption
availability
Systems
.
remains
of a minority is addressed
351 correct
of servers.
by replicating
the service for availability, but compensating for the increased difficultly of protecting the service by making the service tolerant of successful attacks on servers. We first describe our approach, and then compare it in detail to other alternatives. 3.2.1 The Algorithm. Reiter securely replicating any service technique client
is similar
sends
receives correct, Reiter
its
from then and
13irman [1994] can be modeled
to state machine
request
to
a majority the
and that
replication
servers
of them.
response
Birman
all
In this
obtained
provides
[Schneider
1990],
and
accepts
way,
if a majority
by the
similar
describe a technique as a state machine.
client
guarantees
the
but
in which
response of the
is correct.
The
diffkrs
for The a
that
it
servers
is
approach
by freeing
of the
client from authenticating the responses of all servers. Instead, the client is required to possess only one public key for the service and to authenticate only one (valid) response, just as if the service was not replicated. We have constructed our authentication service using thk technique. In its full generality, the system administrator can choose any threshold value k and create
any number
has the following Integrity. signed
n > k of authentication
servers
such that
the service
properties:
If fewer
certificate
than
k servers
produced
by
are corrupt,
the
service
the contents
were
of any properly
endorsed
by
some
correct
server. Availability. erly
signed
As indicated so that integrity Our
If at least
k servers
are correct,
the
service
produces
prop-
certificates, above,
a natural
a majority of correct of the service. technique
employs
choice
for the threshold
servers
a threshold
ensures signature
both
value
is k = [n/2
-t 1],
the
availability
and
scheme.
Informally,
a (k, n)-
threshold signature scheme is a method of generating shares of the corresponding private key in such a way
the
a public key and n that for any message
m, each share can be used to produce a partial result from m, where any k of these partial results can be combined into the private-key signature for m. Moreover, knowledge of k shares should be necessary to sign m, in the sense that
without
(1) create
the private the signature
(2) compute
a partial
(3) compute
a share
key it should for
result
m without for
be computationally k partial
m without
or the private
infeasible
results
fc~r m,
the corresponding
key without
k other
to
share,
or
shares.
Our replication technique does not rely on any particular threshold signature scheme. For our authentication service, we have implemented the one of Desmedt and Frankel [ 1992], which is based on IWA [Rivest 1978]. Given a (k, n)-threshold signature scheme, we build our authentication servers. service as follows. Let tti = { AS1, . . ., AS.) be the set of authentication ACM Transactions
on Computer
Systems,
Vol. 12, No. 4, November
1994.
352
Michael K. Reiter et al.
.
These
servers
be identical;
should
satisfy
the
in fact,
it may
be preferable
same
specification, that
although
they
they
need
be developed
not
indepen-
dently, to prevent a (possibly deliberate) design flaw from affecting all of them [Joseph 1987]. We first choose a threshold value k and create n shares from
the private
server
ASi,
K;~t, the principals. clock
key K~l
when
of the
authentication
service.
Each
authentication
started,
is given the ith share of K,;l, its own private key ASj, and the public keys for all public key K~8, of each server It is also given the public key of the time service to synchronize its
as in Section
3.1.1.
The protocol by which clients obtain certificates from the authentication service is shown in Figure 3. A client C requests a certificate for a principal P by sending the identifier for P and a timestamp T to the servers. The purpose of T is to give the servers a common base time from which to compute the expiration time of the certificate;3 we discuss later how C chooses
T. When
each server
if T is no more its
than
AS,
receives
its current
value
T + A, KP)
the request,
of L(t).
partial
result
pri(P,
certificate,
where
A is the predetermined
for
the
it extracts
If this contents
lifetime
T and tests
is the case, it produces (P, T + A, KP)
of the certificate.
sends pri( P, T + A, KP ) to the other servers, signed under its key. (Alternatively, partial results can be sent over point-to-point
of P’s
AS,
then
own private authenti-
cated channels, rather than being authenticated by digital signatures.) When results from which it can create AS, has authenticated k – 1 other partial the certificate for {P, T + A, KP}~,jl, it sends the certificate to C. C accepts the first properly far in the future,
signed certificate for P with an expiration and ignores any other replies.
It is easy to see why guarantees k
servers
just are
this
protocol
stated.
Informally,
corrupted
by
provides Integrity
an
intruder,
the holds
then
time
Integrity
and Availability
because the
sufficiently
if only
corrupt
fewer
servers
than do not
possess enough shares to sign a certificate; i.e., they need the help of a correct server. Availability holds because if at least k servers are correct, then the correct servers possess enough shares to sign a certificate and can do so using this protocol. Because each correct than its value of L(t),
server where
produces
a partial result only if T is no more at which it receives the request,
t is the time
any certificate produced from its partial result accepts a certificate of at most t + A. A principal
has an expiration timestamp as valid at some time t only
if the certificate
the principal’s
expiration
time
is greater
than
value
of U(t),
which ensures that the certificate expiration time has not been reached. So, like authentication protocol messages (see Section 3.1. 1), a certificate will never be considered valid for longer than its intended lifetime. A client’s choice for T is constrained by two factors. On the one hand, for a certificate to be produced, each of k different servers must find T to be at so, most L(t), were t is the time at which the server receives the request;
’31n a prlo r version received drifts ACM
as the
and variances Transactions
of this
base
protocol,
to compute in request
on Computer
each
the
server
expiration
delivery Systems,
used time.
its This
value
of L(t)
version
was
times. Vol. 12, No. 4, November
1994.
when more
the
request
sensitive
was
to clock
A Security Architecture C~.#:
for Fault-Tolerant
Systems
.
353
P>T
(Vi)ASi
+M:
{P, T+
(vi)
~ C :
{P, T + A, Kp}K&,
As,
A,pri(p,
T + A, Kp)}Ki~
Fig, 3. Protocol by which client C obtains a certificate for principal P.
choosing T too high prevents a certificate from being produced. On the other hand, since the certificate’s expiration time is T + A, the client shortens the effective lifetime of the certificate by choosing T too low. SO, a client should choose
T to be close to, but
servers’ values of L(t) In practice, it works
less than,
what
it anticipates
will
when they receive the request. well to have a client, when sending
to set T to its own value
of L(t) minus
a small
on subsequent requests if prior attempts cause an unavailability of the time service
offset
be the correct
a request
8>0,
at time
t,
and to increase
8
to obtain a certificate failed. Bewill generally ca,use clients’ values
of L(t) to drift from those of the servers, during a lengthy unavailability a client may need to set S to several seconds to obtain a certificate, at the cost of reducing the effective lifetime of the certificate by that amount. However, since certificate lifetimes are typically at least several minutes, this would normally
reduce
3.2.2 not
the effective
Comparison
the
first
dealing
to Alternative
to notice
construction
lifetime
the
with
this
Designs.
conflict
of authentication
by only
in
mentioned,
we are
security
availability
in the
Gong
shared-key
fraction.
As previously
between
services.
tradeoff
a small
[1993]
and
proposed
authentication
a methocl
services
for
such
as
Kerberos [Steiner et al. 1988]. Lampson et al. [1992] also discussed this tradeoff and described a different solution that is appropriate for a public-key authentication service similar to ours. In the latter solution, which is also implemented in SPX [Tardo and Alagappan 1991], certificates are created by a highly secure certification authority. offline, limited
The certification
authority
is not replicated
to make it easier to protect (Figure availability, it produces long-lived
distributed
from
an online
replicated
for high
cates
long
are
certificate
availability
lived,
however,
[Tardo there
and can even be taken
4(a)). To reduce the impact c,f its certificates that are stored in and
distribution
center
and Alagappan must
be
some
(C!DC), 1991]. way
which
Because to
can be certifi-
revoke
them
securely. For this reason, certificates are obtained only fro:m CDC replicas, so if necessary, a certificate can be revoked by deleting it from all replicas. That is, a client accepts a certificate only if both the highly secure certification authority and a CDC replica endorse it. A disadvantage of this scheme, noted by Lampson et al, [1992], is that the corruption of a CDC the revocation of a certificate. This problem could be addressed by using the technique and Birman presented in itself (Figure tion service quently and
replica described
could
dlelay
in Reiter
[1994] to replicate the CDC securely. However, our approach Section 3.2.1 of securely replicating the authentication se-t-vice 4(c)) addresses this problem more directly. Since the authen ticais online and highly available, it can refresh certificates frecreate them with short lifetimes. Thus, the window of vulnerabilACM Transactions
on Computer
Systems,
Vol. 12, No. 4, November
1994.
354
Michael K. Reiter et al
.
(usually
:
CXXl
ASI
OSline)
K&@
c’
ASI
CDCn