A security architecture for fault-tolerant systems - UNC CS

20 downloads 0 Views 3MB Size Report
for fault-tolerant computing in distributed systems. We present a security ...... a principal can request a certificate for any principal from the authentication service.
© ACM, 1994. This is the authors' version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version is available at http://doi.acm.org/10.1145/195792.195823.

A Security Architecture Systems MICHAEL

for Fault-Tolerant

K. REITER

AT& TBell

Laboratories

and KENNETH

P, BIRMAN

and ROBBERT

VAN RENESSE

Cornell University

Process

groups

present

are a common

a security

Integral

parts

of

cryptographic easily and

substantial

the

secure

figures

extends

architecture it

number

services

of servers. group

that

into

securely

fault

also

services

novel

both

the transient

give

support

introducing

andimplementation

We

We

abstraction.

tolerantly

and

these

despite

the design support.

and

constructed

systems.

a security

necessary,

key distribution

they

in distributed

group

when

we have

We detail

abstraction

computing

process only

necessary,

and to permit

group

for some common

the

replication

was

attack

process

are

Using

when

against

for fault-tolerant

that

distribution.

techniques

defensible

ity ofa

this

key

replication

abstraction

architecture

to be

unavailabil-

of these

preliminary

services

performance

operations.

C.2.O [Computer-Communication Networks]: General— [Computer-Communication Networks]: Distributed Systems; D.4.5 [Operating Systems]: Reliability —j?mlt tolerance; D.4.6 [Operating Systems]: Security cryptographic controls; K.6.5 [Management of Computing and Protection—authentication; and Information Systems]: Security and Protection—authentication Categories

and Subject

security

and

General

Terms:

Additional

Descriptors:

protection;

Key

C.2.4

Security, Words

Reliability and Phrases:

Key

distribution,

multicast,

process

groups

1. INTRODUCTION There exists much experience with addressing the needs for security and fault tolerance individually in distributed systems. However, less is under-

Because paper,

the

Editor-in-Chief

he played

of ACM

no role

in the

Transactions

review

process

work

was performed

while

the first

under

I) ARPA/ONR

grant

NOO014-92-J-1866,

Any opinions reflect those Authors’

or conclusions of the ONR.

addresses:

M.

Cornell

not made of the

Ithaca,

or distributed

Transactions

for direct

and

on

14853;

its

and email:

date

appear,

and

from

are the

IBM, and

Holmdel,

Department

advantage,

the ACM

is given

that

and

NJ

Systems,

Vol.

12,

No.

Inc.

necessarily

07733;

of Computer

provided copyright

copying

that notice

email: Science,

1994,

the copies

requires

Pages

340-371

are

and the title

is by permission

or to republish,

4, November

This

supported

Siemens,

do not

1100-0340$03.50 Computer

was

edu.

is granted

notice

of this

manuscript.

work

GTE,

material

To copy otherwise,

This

authors’

Laboratories, Renesse,

is a coauthor

for the

University.

{ken; rvr}(ucs.cornell.

of this

commercial

Machinery.

Bell R. van

Systems decision

and by grants document

AT&T

fee all or part

for Computing

specific permission. @ 1994 0734-2071/94/ ACM

NY

to copy without

publication

Association

Reiter,

was at Cornell

in this

K. P. Birman

University,

Permission

expressed K.

[email protected];

author

on Computer or acceptance

of the

a fee and/or

A Security Architecture stood about solution.

how to address

Indeed,

these

needs

for Fault-Tolerant

simultaneously

the goals of security—or

and integrity—have

traditionally

more

been viewed

Systems

.

in a single,

precisely,

integrated

the goals

as being

:341

in conflict

of secr,ecy with

goals

of availability, because the only generally feasible technique for making data and services highly available, namely replicating them, also makes them harder to protect [Herlihy and Tygar 1988; Lampson et al. 1992; Turn and Habibi involved

1986]. This in enforcing

trusted

computing

Prudence possible, makes

dictates in order

conflict is particularly pertinent to services that /are security policy, or in other words, that are part of the [Department of Defense 1985] of a system. base (TCB) that the TCB should be kept as small and localized as

to facilitate

its protection

We have

more

designed

illustrates

that

system. The fault-tolerant

this

its protection.

Distribution

of TCB

components

difficult.

a security conflict

architecture need

not

for

result

fault-tolerant

in

an

systems

unreliable

that

or insecure

process groups—a comlmon paradigm of supports [Amir et al. 1992; Birman and Joseph 1987b; Cheri-

architecture computing

ton and Zwaenepoel 1985; primary security abstraction,

Kaashoek 1992; Peterson et al. 1989] —as its and provides tools for the construction of appli-

cations that can tolerate both benign component failures and advanced malicious attacks. We have implemented this architecture as part of a new version of the Isis distributed programming toolkit [ Birman and Joseph uirtually 1987b; Birman et al. 1991] called Horus, thereby securing Horus’ process group abstraction. An earlier paper [R,eiter et al. 19!32] synchronous presents the design rationale and an overview of the architecture. Here we emphasize

how the security

and efficient. The tradeoff our

between

architecture.

secure

group,

security

At the

abstraction supported the user can balance

mechanisms

level

have

and availability of user

to be fault

is addressed

applications,

by the architecture this tradeoff for

applications

been built

the

in two ways

secure

process

can be efficiently

replicated

in a protected

on trustworthy

and correct nisms limit

sites,

the members

will

enjoy

secure

group semantics among themselves. These how attackers can interfere with applications

in

group

provides a framework within each application individually.

authentication and access control mechanisms enable the group prevent untrusted processes from joining, and if the members processes

tolerant

which In a fashion:

members to admit only

communication

protection mechaand, in particular,

enable the user to control exactly where and how widely each application is replicated. The second and more critical level at which this conflict is addressed is thlese mechanisms, within the core security services in the TCB that underlie architecand indeed the security of all process groups. As do other security tures, ours uses cryptography to protect communication, and this in turn requires that a secure means of key distribution exist. Most key distribute on mechanisms employ trusted services whose corruption or failure could result in security breaches or prevent principals from establishing secure communication; it is in these services that the conflict between security and availability

is most

apparent.

We ACM

have Transactions

developed on Computer

an

approach Systems,

Vol.

to reconciling 12, No.

4, November

tlhis 1994.

342

Michael K. Reiter et al.

-

conflict that exploits techniques to achieve The implementation

the semantics of these services and novel replication secure, fault-tolerant key distribution. of our security architecture as part of Horus has

brought performance and user interface issues to the forefront of our work, as well. By using caching extensively and moving costly operations off critical paths

whenever

secure

possible,

version

hardware. expect

This

is

to account

tions.

Moreover,

resulted

we have

of Horus

without

particularly

for the the

achieved

true

vast

for

majority

changes

to group

of the

to the Horus

performance

network-wide operations

security

process

in the

cryptographic

communication,

of group

implementation

in minimal

reasonable

resorting

which

in most

mechanisms

group

we

applicain

interfaces.

Horus

So, tools

and applications designed for the Horus interfaces should port easily to secure groups. We present here the security architecture as realized in Horus and elaborate on the contributions just described. In Section 2 we present the programming model of secure features that augment

process groups, with an the Horus process group

discuss

some implementation

Section

3 we present

use to support tion of secure group

2.

The

abstraction

arbitrarily,

and

and discuss

key distribution,

model.

In

which

we

with

processes

communicate

by Horus

an associated may

both

create,

by

work

is the

process

group

in Section

5,

group,

which

join,

a message

Groups

address.

and

point-to-point

i.e., by multicasting

multicast,

related

GROUPS provided

of processes

Processes group

We conclude

PROCESS

basic

collection

posed by the programming

of fault-tolerant

security we also

secure process groups. In Section 4, we detail the implementaprocess groups and give performance numbers for common

operations.

SECURE

challenges

our method

emphasis on the abstraction. Here

leave methods

groups

is a

overlap

at any

(e.g.,

to the entire

may RPC)

time.

and

membership

by of a

group

synof which it is a member. Further, Horus supports a virtually [Birman and Joseph 1987a] execution model, so that message uiew (i.e., the membership of the group) deliveries and changes to the group appear atomically and in a consistent order at all group members, even when failures occur. Our security architecture makes the Horus programming model robust against malicious attack, while leaving the model itself unchanged. First,

chronous

during cated

to

group

joins,

one

another.

the group More

and the joining

precisely,

the

group

process

are mutually

members

are

informed

authentiof the

site from which the process is attempting to join, as well as the owner of the process according to that site. Any effort by an intruder to replay a previous join sequence or to forge the apparent site from which a request is sent will be detected. And, the joining process knows that responses apparently from the group are actually from the group that it is trying to join. Second, a group member must explicitly grant each group join before the join is allowed to proceed. joiner, they can deny the ACM

Transactions

on

Computer

If the group members choose not request, in which case the joiner

Systems,

Vol.

12,

No,

4, November

1994,

to admit the will not be

A Security Architecture

for Fault-Tolerant

(’”J=

Systems

highly secure sit, es

I authentication

(34 ion

a

(a)

A

~

process

to the

either

Inside

grant

the

or deny

group,

join

the

tive

if

requested,

from

that

site

authenticated

passive

the that

ment

integrity

that

a group

admit

a corrupt

which

case their the

point-to-point

and

site

corruptions.

sites

secured

only

as secure

process

to limit A

from

secu-

damage

group

to different as the

many

internal

from

can

span

levels,

least

but

secure

is site

groups,

members

trusted

are trustworthy.

by the

of the process

group

members

requesting

to join,

not

mechanisms

of group

has admitted

processes

prevent

Secure

of these

has tampered

member

policies

to enforc,s

In particular, to have

if

properly

the members

may

the request.

as the authenticity

group

1.

all group

is not

the owner

choose to deny

intruder

ac-

can be built

groups

admitted.

provided

requesting

Third,

rity

attacks.

admitted

well

secure is

from

(

(b) Applications

request.

Fig.

the

is

members,

communication

cryptographically

network

e$

to

group

protected and,

G

““’’’C”’

requesting

authenticated who

sites

moderately secure sites

6’----7 protected

:343

secure group) insecure

cmnnmnic.t

.

no processes with admit

site

disclosure

group.)

will

communication

sites,

on corrupt since

messages

is also supported

abstractions,

i.e., sites

to

(The

also

process

request sent,

a network

an

require-

does not imply

being

within

each

at which

system.

sites

can before

as

within

an untrustworthy

Members

be encrypted

of those

Eiorus

or operating

processes

to the

the

are guaranteed

on corrupt

the hardware

need not be trusted, messages

and

communication,

tlhat could

secrecy,

in

in an effort

to

intruder.

and outside

Secure of groups,

of groups [Birmanl et and in particular between group members and clients al. 199 1], although in this article we focus on secure group communication. The programming model thus presented to the programmer is one in which each process group can be viewed as a “fortress,” where admission is regulated by the group members themselves (see Figure l(a)). A setting to which this

style

fault-tolerant

of secure service

group must

is

particularly

be provided

well

suited

to a larger,

is

one

in

which

untrustworthy

a

system

against which the service must protect itself [Reiter et a“l. 1992]. Such an application could be composed of a single secure group located on a small “island of trustworthy sites. Alternatively, a larger application in which ACM

Transactions

on Computer

Systems,

Vol.

12, No.

4, November

1994.

344

Michael K, Reiter et al,

.

greater

internal

control

is required

could

be implemented

using

many

secure

groups, arranged to enforce security policies within the application and to limit the damage to the overall application from the corruption of a site (see Figure l(b)). While the groups could span sites with different levels of trustworthiness, each process it contains. When many

group

implementing

issues

is only

secure

in fault

as secure

process

tolerance,

groups

as the

least

in Horus,

performance,

and

secure

we were

integration,

site

faced

or

with

including

the

following: —Because process groups are a fault tolerance tool, it was important that the integration of our security mechanisms not increase the sensitivity of the process group authenticating authenticating

abstraction the origin principals

unavailability

could

them

more

difficult

for

achieving

niques support

to failures. This of join requests, in open networks

inhibit

authentication

to protect.

Thus,

fault-tolerant

authenticated

group

—In Horus, a process seeking join, will generally not know not need to. Moreover,

was most difficult to achieve in since all known techniques for rely on trusted services whose but whose

we were

replication

forced

authentication

can make

to devise

and

key

new

tech-

distribution

to

joins. to contact the current

requiring

this

a group to obtain a service, or to membership of the group and does

knowledge

would

involve

substantial

changes to Horus and significant overheads in the system. So, it important that an outsider’s ability to authenticate group members rely on accurate knowledge of the group membership. —Group communication underlying network felt

that

possible,

it was

necessary

to retain

requiring

that

on all

sites.

with

Isis

experience common

can offer substantial performance benefits if the supports broadcast or multicast [Kaashoek 1992]. We

without

deployed

This

goal

suggests

these

potential

special-purpose was that

particularly group

benefits

cryptographic crucial

as much

as

hardware

be

to Horus,

since

will

very

communication

be

in Horus.

-Horus offers a variety of ordering guarantees on the delivery ordering property, raises multicasts. One of these, the causal issues when causal relationships exist between multicasts in overlapping important

groups to identify

this ordering ever possible. The

was not

property

following

sections

[Reiter and Gong 1993; potential security threats and to provide

detail

how

defenses

we addressed

of group security different

Reiter et al. to applications

1992]. that

It was employ

against

threats

when-

these

these

and

other

issues

in

the implementation of our security architecture. Section 3 presents techniques to achieve fault-tolerant key distribution, which we use in our architecture to support authenticated group joins fault tolerantly. However, these techniques are also of interest outside of the context of our security architecture and could be useful in a wide range of systems, and so in Section 3 we present ACM

them

Transactions

in on

a general Computer

light.

Systems,

A discussion Vol.

12,

No.

4, November

of their 1994.

use in

our

security

A Security Architecture architecture tion

is deferred

of secure

process

3. FAULT-TOLERANT In open

networks,

tion

in two ways

tion

under

until

Section

we focus

on the

345

o implemerlta-

KEY DISTRIBUTION

an intruder [Voydock

can attempt

and Kent

a false identity,

protocols

4, where

Systems

groups.

allow

1983]:

or it can replay

to initiate

spurious

communica-

it can try

to initiate

communica-

a recording

of a previous

initiation

or key distribution protocols have been proattacks (see Denning and Sacco [19811, CCITT and Schroeder [19781, Steiner et al. [19881).

sequence. Many authentication posed to protect against these [1988], Kent [1993], Needham These

for Fault-Tolerant

principals

(e.g., computers,

users)

initiating

communi-

cation to verify each others’ identities and the timeliness of the interaction. Most also arrange for the involved principals to share a secret cryptographic key by which

subsequent

communication

can be protected,

others’ public keys, by which either shared key can be negotiated. Authentication protocols typically called

an authentication

the first normally other

shared

corresponding

service private

A predominant cols

is

to

message

was

principals usually

communicate. the public

to detect

into

generated;

a trusted

and Schroeder

each or a

service,

commonly

[1978]],

to counter

each

the

In public-key

has a well-known

key to certify

technique

incorporate

or to possess

can be protected

shared-key protocols, the authentication service each principal and uses these keys to distribute

keys by which

the authentication

employ

[Needham

seruice

type of attack. In shares a key with

communication

replay

protocol

message

public

keys

of principals.

attacks

in authentication

message

is then

protocols,

key and uses the

the

valid

time

for

at

protowhich

a certain

the

lifetime,

beyond which it is considered a replay if received [Denning and Sacco 1981]. Timestamp-based replay detection has been used in several systems (e.g., Steiner is often

et al. [1988], Tardo and Alagappan [1991], preferable to challenge-response techniques

[1978]], because However, using

it results in fewer protocol messages and less protocol state. securely timestamps requires that all participants maintain

clocks.

synchronized

In practice,

clock synchronization

as in Gusella and Zatti dependence of authentication

a time The

raises

troubling

security

is usually

achieved

[1984] and Mills [1989]. protocols on authentication

seruice,

services

Wobber et al. [1993]) land [Needham and Schroeder

and

availability

issues.

and

First,

the

via time

assur-

ances provided by authentication protocols rely directly on the security of these services, and thus these services must be protected from corruption by an intruder. Second, the unavailability of these services may prevent principals from establishing secure communication, or even open security “holes” that time

could be exploited by an intruder. For instance, the unavailability service could result in clocks drifting far apart, thereby exposing

(of a princi-

pals to replay attacks. To increase the likelihood of these services being available, they could be replicated. However, as already noted in Section 1, this is dangerous in some environments, because replicati~lg data or services makes them inherently harder to protect. ACM

Transactions

on Computer

Systems,

Vol.

12!, No.

4, November

1994.

346

Michael K. Reiter et al,

.

We have

developed

and availability

techniques

in these

to reconcile

services.

By using

the

conflict

replication

and introducing novel replication techniques when constructed these services to be easily defensible transient unavailability hinder key distribution

of even between

attacks.

Client

interactions

with

services

can be used with

many

the services

when

security necessary,

it was necessary, against attack.

a substantial number principals or expose different

between

only

we have And, the

of servers does not protocols to intruder

are simple

and efficient,

authentication

and the

protocols.

3.1 The Time Service The security cols are well

risks of clock synchronization failures known [Denning and Sacco 1981; Gong

time

secure

service

recognized

in

that

several

cannot

be tampered

systems

with

(see Bellovin

in authentication proto1992], and the need for a or impersonated

and

Merritt

has been

[1990]

and

Mills

available time [1989]). We claim, however, that the case for a highly is not as clear. It is true that an extended period of unavailability

service might

cause

But,

principals

itself

this

too quickly, estimate not

h

real

lengthy

to have

need

not

of this,

allows

our

the

time

service

achieve resilience to a time rithm for estimating time. We describe this algorithm

views

weaknesses algorithm

key distribution

of the time

disparate

in security

evidence time

unavailability

replicate

increasingly

result

so that service

This

it

will

in Section

by which

securely

has allowed be easier

unavailability

time,

even

clients during

a

us to explicitly

to protect,

through

3.1.1 and discuss

in

communication

we propose

to proceed

service.

of real

or inhibit

the

and

client

to

algo-

alternatives

to our

approach in Section 3.1.2, As will be discussed in Section 3.1.2, the algorithm of Section 3.1.1 is heavily influenced by previous work in clock synchronization.

As

such,

techniques time

its

estimation

tralized

contribution

can be adapted time

in key

lies

mainly

for use in our

distribution

in

protocols

clock

synchronization

to achieve

with

simple,

an easily

fail-safe

defensible,

cen-

service.

Clients interact with 3.1.1 The Algorithm. RPC-style protocol shown in Figure 2. We possesses a private key K; 1 whose known. (There is a similar shared-key queries the time service with 1978], a new, unpredictable quest,

how

setting

our time service by the simple assume that the time server

corresponding public key ~z protocol.) At regular intervals,

a nonce identifier value. When the

N

time

is well a client

[Needham and Schroeder server receives this re-

it generates a timestamp T equal to its current local clock with {N, T}Kj: , i.e., the nonce and the timestamp, both The client considers the response valid if it contains N and signed with K;l. can be verified with the public key of the time service. The method by which a client uses this response rests on the following additional assumptions:

value

(1)

The

immediately and replies

client

t – t‘

has

access

of a real

p is a known

time constant

(1 ACM

Transactions

on

to a local interval

[ t‘,

hardware t ] with

- t’)

Computer

Systems,

0, we have that O < Qz s * Keith (personal

Marzullo

has suggested

communication,

Feb. ACM

the

possibility

1993).

However,

Transactions

of dynamically

measuring

we do not pursue on Computer

Systems,

this Vol.

the server the client al + a!l is R

– mini

1~ on a per-client

basis

here. 12, No.

4, November

1994.

348

Michael K. Reiter et al.

.

A

and so after + ag satisfies

the client

— minz,

verifies

the response,

real

time

t = T + minz

~~[T+minz,T+R–mini].

By (l),

it follows

the desired

that


; by:

time

[L(t), u(t)l,

(4)

where L(t)

= (H(t)

– H(j))

+ p) + T + min2

/(1

and

U(t) = (H(t) –H(f))/(1 To estimate is more

the time,

the client

conservative.

col messages,

uses either

In particular,

principals

– p) + ‘1+ r/(1 – p) – rninl. L(t)

to detect

use the following

or U(t),

replays

rules

depending

on which

of authentication

for estimating

(1) When timestamping an authentication protocol to detect a later replay of that message, the timestamp to T = L(t).

proto-

time:

message to allow others sender sets the message

A recipient accepts an authentication protocol message with timestamp as valid at time t only if T + A > LX t), where A is the predetermined lifetime of the message.

(2)

The benefit

of this

THEOREM

after

time

An

3.1.1.2.

by a (correct)

scheme

client

is that

authentication

at time

t will

a client

sends

in the following

it is fail-safe, protocol

never

message

be accepted

with

by another

T

sense: lifetime

A sent

(correct)

client

t + A.

PROOF.

Suppose

t.The timestamp

T = L(t)

for

an

the

authentication

message

protocol

satisfies

message

T < t. Now

at

time

consider

a

recipient recipient,

at time t + A, where A is the lifetime of the message. Since at the t + A < .!7(t + A), it follows that T + A < U(t + A). Thus, the mes-

sage will

be rejected

as invalid.



Because the interval (4) grows wider with time, periodically each client desynchronizes with the time service in order to narrow its interval. A r, and T for the successful resynchronization results in new values of H(t), ) and L(t).Resynchronization attempts can fail, however, calculation of lJ(t r for the attempt exceeds some timeout value. when the round-trip time When this happens, the client continues to attempt to desynchronize with the service at regular intervals, while retaining the values of T, r, and H(;) L(t) and U(t). obtained in the last successful resynchronization to calculate becomes unavailable, clients’ intervals will continue to So, if the service widen. If the service is unavailable for too long, eventually the principals’ will exceed their values of L(t) by the protocol message values of U(t)

lifetimes, creation. ACM

and

Transactions

all

on

messages

Computer

will

Systems,

be perceived

Vol.

12,

as expired

NrJ. 4, November

1994.

immediately

upon

A Security Architecture While the very

this

time

bounds

service,

tight.

For

the amount

calculations example,

of time in our

consider

for Fault-Tolerant that

the system

system

two

Systems

indicate

principals

can operate

that

F’l

.

this

and

349

without

bound

P2, each

is not

of whose

clocks is characterized by p = 10-5, and suppose for simplicity that the values of f and T corresponding to the last resynchronization for each prior to a time service crash are the same. Moreover, suppose that rninl = min2 = O and that the value of r obtained by P2 in its last resynchronization is 0.5 seconds. Then, even if the clocks of PI and P2 drift apart at the maximum possible rate—i.e., the clocks of PI and P2 are as slow and as fast as possible, respectively, the

while

value

relatively

still

of U(t) short

at

satisfying

(l)—it

P2 exceeds

the

message

and Sacco [1981].

lifetime

Additionally,

will

be over 20.4 hours

value

of L(t)

in comparison

at

PI

to that

the parameters

from

i before

by 30 secondk,

suggested

a

by Denning

used in the above

calculation

are very conservative for most settings, and tests in our s,ystem show that a time service unavailability can typically be tolerated for much longer. Tlhese results lead us to believe that the system, if tuned correctly, should be able to operate

without

service,

even if the restart

3.1.2 that

the

Comparison

presented

primary

by

difference

interval

time

service

for

requires

to Alternative Cristian

long

to restart

the

time

We derived cur algorithm implementing a time service.

from The

intervention.

Designs.

[1989]

between

(2). In the latter,

sufficiently

operator

ours

for and

the client

Cristian’s

lies

in how

uses the midpoint

clients

use the

of (2) as its estimate

the time at time ;, since this choice minimizes the maximum possible and the client estimates future times as an offset, equal to the measured

of

e:rror, time

since the last resynchronization, from this midpoint.2 However, like any other clock synchronization algorithm in which each client maintains a single clock value, this algorithm is not fail-safe: e.g., if the midpoint of (2) were too low,

then

the client’s

future

estimates

of the time

would

tend

to be low,

and

thus expired messages may be incorrectly accepted. We feel that our approach, which is fail-safe, is better for our purposes. A reasonable alternative to not replicating our time service is to replicate it for high availability and to compensate for the increased difficulty of protecting the service by making it tolerant of the corruption of some servers. For instance, a client could use the robust averaging algorithm of Marzullo [1990] to obtain

an interval

n time servers, approach might

of bounded

inaccuracy

if fewer than \ n /3] be attractive if clients

containing

real

servers are faulty are highly transient,

time

from

a set of

or corrupt, This and thus a time

service unavailability will prevent large numbers of clients from synchrcmizto be ing initially with the service at client startup. However, this is unlikely the case in most systems, where time service clients are sites that do not tend to reboot frequently. Moreover, a replicated time service places a larger burden on the administrator of the service than does ours, since the administrator 2This

must

protect

is a simplification

measures

to ensure

are unimportant

that

multiple of the client

servers,

algorithm clocks

instead

by Cristian

are continuous

of only

[ 1989]; and

the

monotonic.

one,

actual

to

ensure

the

also

takes

algorithm

These

features,

however,

for our purposes. ACM Transactions

on Computer

Systems,

Vol. “12, No. 4, November

1994.

350

Michael K. Reiter et al

.

integrity of the service. For these reasons and the additional costs of replication (e.g., authenticating and maintaining multiple time servers), we feel that a replicated time service is difficult to justify for our purposes. Also discussed on a physical that

variable.

contain

by Marzullo

state

the

variable

[1990] despite

It is observed actual

that

physical

value,

are approaches

to evaluating

the impossibility given

a range

safe

a predicate

of accurately of values

evaluation

of

that the

measuring is known

predicate

to may

require that all values in the range satisfy the predicate, or that only some value in the range does. Our approach of estimating time conservatively with the endpoints of (4) can be viewed as an instance of the former approach, where the physical state variable being measured is time; the range containing time is (4); and the predicates of interest relate time to timestamps in authentication protocol messages. Numerous

other

approaches

to clock

synchronization

have

been

proposed

(see, e.g., Simons et al. [1990]), but for brevity, we do not discuss them all here. Unlike ours, however, most assume upper bounds on message transmission

times

or employ

greater

distribution,

thereby

increasing

the number

of

components that must be protected in the system. Moreover, to our knowledge none provide a fail-safe algorithm for estimating time in authentication protocols. We thus feel that our approach is unique in providing this property with

relatively

few requirements.

3.2 The Authentication

Service

Our authentication service is of the public-key variety, that produces publickey certificates for principals. Each certificate {P, T, KP}~jI contains the identifier P of the principal, the public key KP of the principal, and the expiration

time

T of the certificate,

authentication service, A principal identifiers to public keys, by which

all signed

by the private

key K,jl

uses these certificates to map those principals (who presumably

the corresponding private keys) can be authenticated; the cussed in Lampson et al. [1992], In general, a principal

of the

principal possess

details are discan request a

certificate for any principal from the authentication service. The need for security in such an authentication service is obvious: as the undisputed authority on what public key belongs to what principal, the authentication service, if corrupted, could create public-key certificates arbitrarily and thus render secure communication impossible. It would also the authentication service must be appear that, unlike the time service, highly available, since its unavailability could prevent certificates from being obtained or refreshed when they expire. Other researchers have also noted that both security and availability, and thus the conflict between them, must be dealt with in the construction of authentication services [Gong 1993; Lampson et al. 1992]. The most common approach to address this conflict in public-key authentication services is to implement the service using two services: a highly secure certification authority that creates certificates, and a highly available certificate database that distributes them (see CCITT [1988], Kent [1993], Lampson approach differs in that it ACM

Transactions

on Computer

et al. [19921, Tardo and Alagappan performs both of these functions

Systems,

Vol. 12, No. 4, November

1994.

[ 1991]). Our in a single

A Security Architecture replicated and

service,

available

So, the

but

does so in such a way

despite

conflict

for Fault-Tolerant

even the malicious

between

security

and

that

the service

corruption

availability

Systems

.

remains

of a minority is addressed

351 correct

of servers.

by replicating

the service for availability, but compensating for the increased difficultly of protecting the service by making the service tolerant of successful attacks on servers. We first describe our approach, and then compare it in detail to other alternatives. 3.2.1 The Algorithm. Reiter securely replicating any service technique client

is similar

sends

receives correct, Reiter

its

from then and

13irman [1994] can be modeled

to state machine

request

to

a majority the

and that

replication

servers

of them.

response

Birman

all

In this

obtained

provides

[Schneider

1990],

and

accepts

way,

if a majority

by the

similar

describe a technique as a state machine.

client

guarantees

the

but

in which

response of the

is correct.

The

diffkrs

for The a

that

it

servers

is

approach

by freeing

of the

client from authenticating the responses of all servers. Instead, the client is required to possess only one public key for the service and to authenticate only one (valid) response, just as if the service was not replicated. We have constructed our authentication service using thk technique. In its full generality, the system administrator can choose any threshold value k and create

any number

has the following Integrity. signed

n > k of authentication

servers

such that

the service

properties:

If fewer

certificate

than

k servers

produced

by

are corrupt,

the

service

the contents

were

of any properly

endorsed

by

some

correct

server. Availability. erly

signed

As indicated so that integrity Our

If at least

k servers

are correct,

the

service

produces

prop-

certificates, above,

a natural

a majority of correct of the service. technique

employs

choice

for the threshold

servers

a threshold

ensures signature

both

value

is k = [n/2

-t 1],

the

availability

and

scheme.

Informally,

a (k, n)-

threshold signature scheme is a method of generating shares of the corresponding private key in such a way

the

a public key and n that for any message

m, each share can be used to produce a partial result from m, where any k of these partial results can be combined into the private-key signature for m. Moreover, knowledge of k shares should be necessary to sign m, in the sense that

without

(1) create

the private the signature

(2) compute

a partial

(3) compute

a share

key it should for

result

m without for

be computationally k partial

m without

or the private

infeasible

results

fc~r m,

the corresponding

key without

k other

to

share,

or

shares.

Our replication technique does not rely on any particular threshold signature scheme. For our authentication service, we have implemented the one of Desmedt and Frankel [ 1992], which is based on IWA [Rivest 1978]. Given a (k, n)-threshold signature scheme, we build our authentication servers. service as follows. Let tti = { AS1, . . ., AS.) be the set of authentication ACM Transactions

on Computer

Systems,

Vol. 12, No. 4, November

1994.

352

Michael K. Reiter et al.

.

These

servers

be identical;

should

satisfy

the

in fact,

it may

be preferable

same

specification, that

although

they

they

need

be developed

not

indepen-

dently, to prevent a (possibly deliberate) design flaw from affecting all of them [Joseph 1987]. We first choose a threshold value k and create n shares from

the private

server

ASi,

K;~t, the principals. clock

key K~l

when

of the

authentication

service.

Each

authentication

started,

is given the ith share of K,;l, its own private key ASj, and the public keys for all public key K~8, of each server It is also given the public key of the time service to synchronize its

as in Section

3.1.1.

The protocol by which clients obtain certificates from the authentication service is shown in Figure 3. A client C requests a certificate for a principal P by sending the identifier for P and a timestamp T to the servers. The purpose of T is to give the servers a common base time from which to compute the expiration time of the certificate;3 we discuss later how C chooses

T. When

each server

if T is no more its

than

AS,

receives

its current

value

T + A, KP)

the request,

of L(t).

partial

result

pri(P,

certificate,

where

A is the predetermined

for

the

it extracts

If this contents

lifetime

T and tests

is the case, it produces (P, T + A, KP)

of the certificate.

sends pri( P, T + A, KP ) to the other servers, signed under its key. (Alternatively, partial results can be sent over point-to-point

of P’s

AS,

then

own private authenti-

cated channels, rather than being authenticated by digital signatures.) When results from which it can create AS, has authenticated k – 1 other partial the certificate for {P, T + A, KP}~,jl, it sends the certificate to C. C accepts the first properly far in the future,

signed certificate for P with an expiration and ignores any other replies.

It is easy to see why guarantees k

servers

just are

this

protocol

stated.

Informally,

corrupted

by

provides Integrity

an

intruder,

the holds

then

time

Integrity

and Availability

because the

sufficiently

if only

corrupt

fewer

servers

than do not

possess enough shares to sign a certificate; i.e., they need the help of a correct server. Availability holds because if at least k servers are correct, then the correct servers possess enough shares to sign a certificate and can do so using this protocol. Because each correct than its value of L(t),

server where

produces

a partial result only if T is no more at which it receives the request,

t is the time

any certificate produced from its partial result accepts a certificate of at most t + A. A principal

has an expiration timestamp as valid at some time t only

if the certificate

the principal’s

expiration

time

is greater

than

value

of U(t),

which ensures that the certificate expiration time has not been reached. So, like authentication protocol messages (see Section 3.1. 1), a certificate will never be considered valid for longer than its intended lifetime. A client’s choice for T is constrained by two factors. On the one hand, for a certificate to be produced, each of k different servers must find T to be at so, most L(t), were t is the time at which the server receives the request;

’31n a prlo r version received drifts ACM

as the

and variances Transactions

of this

base

protocol,

to compute in request

on Computer

each

the

server

expiration

delivery Systems,

used time.

its This

value

of L(t)

version

was

times. Vol. 12, No. 4, November

1994.

when more

the

request

sensitive

was

to clock

A Security Architecture C~.#:

for Fault-Tolerant

Systems

.

353

P>T

(Vi)ASi

+M:

{P, T+

(vi)

~ C :

{P, T + A, Kp}K&,

As,

A,pri(p,

T + A, Kp)}Ki~

Fig, 3. Protocol by which client C obtains a certificate for principal P.

choosing T too high prevents a certificate from being produced. On the other hand, since the certificate’s expiration time is T + A, the client shortens the effective lifetime of the certificate by choosing T too low. SO, a client should choose

T to be close to, but

servers’ values of L(t) In practice, it works

less than,

what

it anticipates

will

when they receive the request. well to have a client, when sending

to set T to its own value

of L(t) minus

a small

on subsequent requests if prior attempts cause an unavailability of the time service

offset

be the correct

a request

8>0,

at time

t,

and to increase

8

to obtain a certificate failed. Bewill generally ca,use clients’ values

of L(t) to drift from those of the servers, during a lengthy unavailability a client may need to set S to several seconds to obtain a certificate, at the cost of reducing the effective lifetime of the certificate by that amount. However, since certificate lifetimes are typically at least several minutes, this would normally

reduce

3.2.2 not

the effective

Comparison

the

first

dealing

to Alternative

to notice

construction

lifetime

the

with

this

Designs.

conflict

of authentication

by only

in

mentioned,

we are

security

availability

in the

Gong

shared-key

fraction.

As previously

between

services.

tradeoff

a small

[1993]

and

proposed

authentication

a methocl

services

for

such

as

Kerberos [Steiner et al. 1988]. Lampson et al. [1992] also discussed this tradeoff and described a different solution that is appropriate for a public-key authentication service similar to ours. In the latter solution, which is also implemented in SPX [Tardo and Alagappan 1991], certificates are created by a highly secure certification authority. offline, limited

The certification

authority

is not replicated

to make it easier to protect (Figure availability, it produces long-lived

distributed

from

an online

replicated

for high

cates

long

are

certificate

availability

lived,

however,

[Tardo there

and can even be taken

4(a)). To reduce the impact c,f its certificates that are stored in and

distribution

center

and Alagappan must

be

some

(C!DC), 1991]. way

which

Because to

can be certifi-

revoke

them

securely. For this reason, certificates are obtained only fro:m CDC replicas, so if necessary, a certificate can be revoked by deleting it from all replicas. That is, a client accepts a certificate only if both the highly secure certification authority and a CDC replica endorse it. A disadvantage of this scheme, noted by Lampson et al, [1992], is that the corruption of a CDC the revocation of a certificate. This problem could be addressed by using the technique and Birman presented in itself (Figure tion service quently and

replica described

could

dlelay

in Reiter

[1994] to replicate the CDC securely. However, our approach Section 3.2.1 of securely replicating the authentication se-t-vice 4(c)) addresses this problem more directly. Since the authen ticais online and highly available, it can refresh certificates frecreate them with short lifetimes. Thus, the window of vulnerabilACM Transactions

on Computer

Systems,

Vol. 12, No. 4, November

1994.

354

Michael K. Reiter et al

.

(usually

:

CXXl

ASI

OSline)

K&@

c’

ASI

CDCn

Suggest Documents