A Modeling Framework for Capturing Parallel Flows and ... - CMU (ECE)

Nested Objects in a Byzantine Quorum-Replicated System

Charles P. Fry 2004 Advisor: Prof. Reiter

Nested Objects in a Byzantine Quorum-ReplicatedSystem Charles R Fry

December 17, 2004

Abstract Moderndistributed, object-basedsystemssupport nested methodinvocations, wherebyone object can invokemethodson another. In this thesis wepresent a frameworkthat supports nested method invocationsamong Byzantinefault-tolerant, replicated objects that are accessedvia quorum systems.A challengein this contextis that client objectreplicas caninduceunwanted methodinvocationson server objectreplicas, dueeither to redundantinvocationsby client replicas or Byzantinefailures withinthe client replicas. At the core of our framework are a newquorum-based authorization techniqueand a novelmethodinvocationprotocolthat ensurethe linearizability andfailure atomicityof nestedmethod invocationsdespite Byzantineclient andserver replica failures. Wedetail the implementation of these techniquesin a systemcalled Fleet, andgive preliminaryperformance results for them. Keywords:Distributed systems, Quorumsystems, Byzantine failures, Replication, Fault tolerance

1

Introduction

In this thesis wepresent the design and implementationof a framework to support Byzantinefault-tolerance [14] in a distributed, object-based system. In modem object-based systems, it is commonplace that objects are passed as arguments to and can invoke methodson other objects. A goal of our frameworkis to support these natural modelsof object interaction seamlessly fromthe programmer’s perspective, while utilizing object replication and Byzantinefault-tolerant methodinvocation protocols to maskthe Byzantine(arbitrary) failure of a limited numberof replicas of each object. The modelof object interaction that our frameworksupports is motivated by that of Java remote objects. A Java remote object is one that can be invoked from outside the Java Virtual Machine(JVM)in which 1

client

client

~

~

0 handles [] replicas

servers

O handles [] replicas

servers

(a) Distributed object components

(b) Nesteddistributed objects

Figure 1: Distributed objects resides, via a protocol called RemoteMethodInvocation (RMI). The client JVMof a remote object holds a proxy for the remote object, called a stub, that implementsthe sameinterface as the remote object. The client programcan invoke methodson the remote object by invokingthemon the stub, and can pass the stub as a parameterto other, possibly remote, objects. Thoseobjects that then hold the stub can invokemethods on the remote object, as well. This mechanism thus provides location transparency for calls to the remote object. In this work, weconsider a system we are implementingwherebya serializable Java object can be dynamically exported outside the JVMin whichit wascreated, and replicated to a numberof other server JVMs, yielding a distributed object. After this operation, the client JVMis left with a handle(conceptuallysimilar to a stub, but functionally different), again that implementsthe sameinterface as the original object; see Figure l(a). Methodinvocations on the handle are translated to methodinvocations on a set (quorum[15]) of replicas for the distributed object. Like RMIstubs, handles can be passed as parametersto methodinvocations, potentially on other objects that havebeendistributed int this way,resulting in object nesting; see Figure l(b). ThoseJVMsthat hold a handle for the distributed object are called clients; however,clients of one distributed object can be servers for other distributed objects, so whenwerefer to a client or a server, it indicates the role in whichit is participating at that time. Withnested objects, it is no longer desirable to allow unfettered access to a distributed object by anyclient that possesses a handle for that object. In particular, wecannot allow a single client replica to perform arbitrary operations on another distributed object: doing so conld result in duplicate methodinvocations 2

whenother client replicas perform the samemethodinvocation, and Byzantine-faulty client replicas could corrupt the embedded object, from the application perspective, by revokingincorrect methodson it. Instead, the central goals of our frameworkare to ensure that (i) only those methodinvocations endorsedby correct replicas of the calling object are performed,and that (ii) the methodinvocationprotocol itself is robust to limited numberof Byzantinefaulty client replicas and server replicas. For (i), wepropose an authorization frameworkfor nested methodinvocations. This frameworkauthorizes methodinvocations on distributed objects fromsingle trusted clients as well as fromquorumsof individually untrusted clients. Whenobject ol is passed to 02, authorization for quorumsof o2’s replicas to invokeol is transparently delegated witla the use of delegation kelys and certificates. For (ii), wedevelopa newquorum based methodinvocation protocol that ensures lineadzable [10] object invocations and failure atomicity of methodinvocations at arbitrary depths, again despite Byzantine failures of a limited numberof replicas of each distributed object. This protocol makesno assumptionson messagetransmission times, i.e., it is designed to function safely in an asynchronousenvironment. Wehave implementedour frameworkwithin a significant restructuring of the Fleet system[16]. Theinitial Fleet system from whichwe beganthis implementationdid not support nested objects seamlessly, provided no tolerance for Byzantineclient invocations, and did not implementan authorization scheme.As such, our frameworkis a significant advancein this context.

2

Related Work

To our knowledge,the only prior system to support linearizable access to nested objects in the face of Byzantinefaults is Immune [21]. Immune takes a different appro.ach in that it implementsevery distributed object using state machinereplication [23] with majority voting, in whichevery methodinvocation is executed by every object replica. Nested methodinvocations are performedwith relative simplicity in the Immune system, as its use of Byzantinefault-tolerant atomic broadcast (specifically [11]) ensures that all client and server replicas receive every communication in the sameorder. In contrast, our approachimplements quorum-basedaccess: each methodinvocation involves accessing only a randomlyselected quorum of replicas, whichcan be relatively small, e.g., O(x/~)replicas for a distributed object with n replicas and tolerating b Byzantinereplica failures [17]. Whileoffering improvedscalability and load dispersion, the

quorumapproach introduces challenges not present in the state machineapproach, notably the absence of atomic multicast amongall client and server replicas to coordinate invocations. Finally, our authorization frameworkhas no analog in Immune,which permits any replicated object to invoke arbitrary methodson anotherreplicated object. Benignfault-tolerant nesting of transactions has previously been a topic of study in the context of database systems[18]. Nestedtransactions achieve concurrencyatomicity in the form of serializability [19]. While necessary for transactional systems that must apply multiple operations on potentially manyobjects, serializability is both non-local and blocking, thus requiring global coordination and locking to enforce it. Serializability wasrecently studied in the context of JavaSpacetransactions [20]; however,while there can be nested transactions in JavaSpaces, there is no analog to our nested methodinvocations as JavaSpaces contain passive objects that can only be read and written, rather than remote objects on whichmethodscan be invoked. Linearizability, the form of concurrencyatomicity that wepursue in this work, provides concurrencyatomicity of single operations performedon a single object only [10]. Thoughweakerthan serializability,

we

haveopted for linearizability for two reasons. First, returning to the object sharing modelthat motivatesour work, linearizability is the concurrencyatomicity property that is achieved by Java RMI(though not in fault-tolerant way),providedthat the remoteobject processesrequests sequentially. As such, our systemwill be suited to applications already utilizing that model.Second,linearizability can be enforced locally, and avoids locking, whichis problematic whena node responsible for unlockingan object fails. Nevertheless, weintend to explore serializability in future work. Failure atomicity is a secondproperty that is often considered in. conjunction with concurrencyatomicity, especially in transactional systemssuch as databases, althoughit is normallylimited to benignfailures. Our approachachievesfailure atomicity in the face of Byzantinefailures, as detailed in Section 4.

3

Authorization Framework

As discussed in Section 1, object nesting in combinationwith Byzantine failures requires that wedepart froma modelin whichsimplypossessinga handle for a distributed object is sufficient to invokemethodson it. Otherwise,a faulty object replica whichwasgiven a handle for another object throughservicing a met]hod 4

invocation wouldthen be able to invokearbitrary methodson that object. Our goal is to allow the creator of a distributed object to invoke methodson that object, and to delegate that authority to another client object, whichitself maybe replicated in order to withstand the Byzantinefailure of someof its replicas. Weanticipate that this delegation will most commonly occur automatically whena handle to one distributed object is passed as a parameterto a methodinvocation on a seconddistributed object.

3.1

Assumptions

As discussed previously, the environmentthat we consider executes methodson a distributed object at a quorumof its replicas, and the set of allowable quorumsconstitutes a quorumsystem for the distributed object. For the purposes of the present section, we are not concernedwith the structure of the quorum system, except for one assumption: the quorumsystem is formed based on an assumedmaximum number b of its replicas that will suffer Byzantinefailures. For example,it is necessarythat each quorumbe larger than b, lest operations be performedat a quorumof only faulty servers, and it is also necessary that any b failures leaves a quorumavailable, so that operations can be completed. Such quorumconstructions can be found in, e.g., [15, 17]. Weassumethat communications to and from servers are protected using standard cryptographic techniques, At a high level, our strategy will be to permit only a numberof replicas of size bl + 1 or greater of a distributed object ol to invoke methodson another ,distributed object 02. Here, bl denotes the numberof replica failures that the quorumsystem for ol was designedto survive. In this way, any invocation by any replica of o~ that the correct replicas of o2 accept is corroboratedby a correct replica of ol. For the purposes of this section, wetreat trusted individual clients that are permittedto invokemethodson a distributed object, e.g., the creator of the object or anotherclient to whichit explicitly passes the handlethroughan out-of-band mechanism, as a special case of a distributed object ol with n~ = 1 replicas and bl = 0 faults.

3.2

Delegation

Thestarting point for methodinvocationauthorizatiion is the principal that originally creates a distributed object, i.e., the client that exports the object from its JVMto makea newdistributed object. Whenthe distributed object is created, its creator generates a newprivate digital signing key S and corresponding

public verification key V (e.g., [22, 12, 1]) and deploys V with each replica, as the "root" key for the distributed object. Thecreator can optionally deploy additional root public keys with each replica, though here werestrict our attention to a single root key. Thecorrect replicas of this distributed object will henceforthonly permitmethodinvocationsbearing digital signatures that can be verified with V, or for whichthe root key has delegated authority (perhaps transitively). This delegation can occur in two ways. The most straightforward is explicit delegation by the application, in whichthe object creator certifies a public key providedby another potential client as being authorized to access the distributed object. This formof delegationclosely follows that of, e.g., Gasserand McDermott [9], and will not be detailed here. The second and more complexform of delegation occurs implicitly, whenobjects are nested. To support this form of delegation, each handle contains a private signature key called the handlekey--the handle key for the initial handleis the private root key of the handle--plusa set of statements(certificates) regarding the keys for whichthat handle key bears authority. To fully describe howdelegation works, we need to consider two cases: one in whicha handle for one distributed object is passed as a parameterinto a method call on another distributed object, and one in whicha methodcall on one distributed object returns a handle of anotherdistributed object. Consider an object o0 with n0 replicas, each replica r~ of whichholds handles hi and h~ for distributed objects Ol and 02, respectively. Figure 2(a) showsan instance in whichn0 = 1, as wouldbe the case if werea non-replicated client. Let S~ denote the private handle key for h~. If r~ passes h~ in a methodcall parameterto h~, then h~ makesa copyh~j of itself for replica r{ (i.e., the j-th replica of o~), except h/2j is equippedwith a newlygenerated public/private key pair (V£~, SI~) before being sent to r~. Theresults of this delegationin the case of Figure 2(a) are shownin Figure2(b). After generating a newpublic/private key pair for each replica of o~, a newstatement must be constructed ~, ..., V~ TM } authorizing a non-faulty set of o~’s replicas to invokemethodson 02. Let F~ denote the set {V~ of public keys created for o~’s replicas, wheren~ is the numberof replicas of o~, and let b~ be the number of faults the quorumconstruction of Ol is designed to mask.Thenfor each replica r~ of o~, the statement set of h~j is augmentedwith a statement of the form

(~) 6

C) handles [] replicas

replicas

(a) Beforenesting

(b) After nesting and delegation Figure 2: Nested objects

i.e., a certificate signedby S~stating that anysubset of bl+ 1 keysin V~is authorizedwith the sameprivileges as V~. ("Says" and "speaksfor" (3) are common formalismsfor expressingcredentials, e.g., [13, 8, 2, 4, 3].) Verifying a "signature" cr on a messagern with (b~ + 1 of V~)meansverifying that at least bl -k 1 of the public keys in {V~,..., V2/nl } can be used to verify signatures in cr (a set) for m.. Finally, once at least b0 + 1 of handles h~j, h~j, " " ¯, hn°J2havebeendeployedat r{, r{ can coalesce these handlesinto a single handle~z~by generatinga newpublic/private, ~ key pair (1)~, ~J) and the certificate

(2) °j. In this way, requests issued by ~ to o2 replicas will need only sign with ~, versus each of S~J,..., S~ ^" Of course, the statement set of h~ contains both (2) and the union of the statement sets of h~nOj ...j, ,h 2 ,

whichincludes (1).

Safety Considera methodinvocation m executed by only faulty handles {~{}, i.e., by at most bl replicas {r{}. Eachsuch invocation is signed by ~, and so each replica r~ that sees this invocation can determine that ~ says m and thus that

~Thehandles ~,~ h2~.., t~:~°~ need not all be deployedto. r{, and somemaynot be due to failures. Thus, at any point in time this certificate will concernonly the keys S~~ that r~ has received so far, and can be updatedwhenanother handleis received.

7

by (2). However,since there are at most b~ j’s for whichthis holds true, it is not possible to infer that (b~ + 1 of V~)says m, or thus that V~says rn for any correct r~. Consequently,if any invocation rn’ to by Oorequires bo ÷ 1 replicas of oo to submitrn~, then any invocationrn to o2 by ol will not succeedif only b~ replicas of Ol submitm, and safety follows by induction.

Cost Thelatency of the abovedelegationis dominatedby the costs of (i) generatingthe r~l private keypairs ¯ TM) " at r~; (ii) generatingthe private keypair (~J2, j) at r~; and(iii) generatingthe (S~1, V~I),..., (S~TM, V~ digital signatures for (1) and (2) at r~ and r~, respectively. For digital signature schemesfor which generation is costly, notably RSA[22], it wouldbe necessary to l?re-generate these keys in the background and store themfor use whenneeded.For this reason, it wouldbe preferable to use a digital signature scheme, ~ such as DSA[12] or ECDSA [1], for whichkey generation is very efficient [24]. A methodinvocation following this delegation, i.e., in whichobject ol invokes a methodon object o2, requires each invoking replica r~ to digitally sign its request with ~{. Replica r k2, uponreceiving such a request, mustverify not only its signature but also the signatures of the statement sets forwardedwith the request. This cost is particularly importantsince as n~estingdepth increases, the size of generatedstatement sets grows. Whiler~ will incur this cost on the first methodinvocation, cachingthe verification status of statements (certificates) should significantly decrease the cost of subsequentmethodinvocations. Nonetheless, nested methodinvocations performedafter one object has been nested in another will be dominated by the cost of signature verification, and wouldthus benefit from a digital signature scheme,such as RSA, whereverification wasvery efficient [24]. As keys can be pre-generated, but not pre-verified, the cost of signature verification is the determiningfactor in selecting a signature algorithmfor use in a practical system.

Returning Handles Our framework accommodatesreturning a handle from a method invocation on a distributed object in a very similar way. For this, wecontinue from the aboveexample,and consider that after the abovehas transpired, replicas of oo invoke a methodon their correspondinghandles for ol, and in doing so should obtain handles for 02 that should permit the oo replicas to invokemethodson 02 in the future. (00 need not be the sameobject as in the precedingexample,thoughto simplify notation weconsider it to be.) Notethat for r~ to performthe invocation, it mustbe invokedby bo + 1 replicas of o0, and it will 2Key generation in DSA is efficientoncecertainglobalparameters are fixed,i.e., primestypicallydenoted byp andq, anda generator 9 of a subgroup of orderq in the integersmodulo p. Similarly,ECDSA is typicallydefinedovera fixedcurve,which then allowsefficientkeygeneration. 8

return a copy ~J of its handle h~ to the handle hi in r~. Again, however,r~ will replace 5)~ in h~j with a newlygenerated handle key S2 , and add

says (b0+1of

(3)

to the statement set of h2, where])~ = { Q213,., V~°j } and b0 is the numberof failures the quorumsystem for o0 is designedto mask.Just as r~ did in the previous case, whenhandle hi (in replica r~) receives least b~ + 1 handles hg~, it coalesces these handles ~:o build a handle h~ with a newhandle key S~ and a statement set including (~

~J)

says

(~

~

j ) (4)

hi then returns h) from the methodinvocation, to r~.. As before, if only b0 replicas of o0 nowattempt to peffo~ a methodinvocation m on o~, then no co~ect replica of o~ will infer (b0 + 1 of ~) says m, and methodmwill not be invoked.Thecost analysis of t~his case is silnil~ to that above.

RevocationDistributed objects are typically intended to be long-lasting, and in particular, to outlive the clients that create them.As a result, it is not practical for the certificates created duringdelegation(i.e., (1)-(4)) to expire; doing so wouldleave distributed objects stranded with no clients able to access As a result, weopt for a different form of revocation, conceptually similar to the approachin Gasser and McDermott[9]: whena JVMno longer requires a reference to a handle and so the handle is garbage collected, its private handlekey is deleted. This implies that oncethe correct replicas of a distributed object o~ have deleted their handles for another distributed object o~, the delegation that permits o~ to invoke methodson o~ can no longer be exercised.

4

Failure Atomicity

Becausethey span multiple operations, nested methodinvocations introduce the possibility that one operation succeedswhile the operationsit inducesfail. Failureatomicity in this context involvesensuringthat all nested methodinvocations are successfully performed. To ensure that each nested operation is consistently performedby a full quorumof correct server replicas, it is necessarythat each server replica in such a quo-

rumreceives a copy of the methodinvocation request fromat least bc ÷ 1 client replicas, wherebc denotes the numberof faults that the client object is configuredto tolerate.. Afirst step towardattaining this goal is for every client replica to send its methodinvocation request to the samequorumof server replicas. This is done by using a unique methodidentifier 3 as input to a public uniform hash function that mapsmethod identifiers to quorums. Unfortunately, selection of the samequorumby every client replica is not sufficient to guarantee failure atomicity, as maliciousserver replicas could respondselectively to requesting client replicas, allowingsome of themto succeed with their selected quorum,whille forcing others to select a newquorum.This could prevent any quorumof correct server replicas fromreceiving the samerequest from be ÷ 1 client replicas. In addition to selecting the sameinitial quorumto contact, each calling client replica also needsto ensure that evenafter any failures that mayoccur on any of the initial server replicas contacted, including failure to respondto requests, a fifll quorumof correct server replicas will receive at least be + 1 requests. We accomplishthis using non-blockingByzantine quorumsystems [5], whichguarantee that even whenbs server replicas fail, a responsewill still be receivedfroma set of server replicas constituting a standard Byzantine quorum[15]. Becauseof this strong availability guarantee, noneof the calling client replicas will ever need to select another quorumto contact due to failures in the first selected quorum.This in turn ensures that a standard quorumof server replicas will all receive requests from and send replies to at least be ÷ 1 client replicas. Byzantine quorumsystems and non-blocking Byzantine quorumsystems are similar in that both require a response froma standard Byzantinequorum.Theydiffer in that the former seeks this response by accessing a standard Byzantinequorum,and will thus be prone to the failure of any selected server, while the latter accesses a non-blockingByzantine quorum,ensuring that even after the failure of someof those servers, the required response will still be received. Wedistinguish betweenthe two types of quorumsinvolved in non-blocking Byzantine quorumsystems as access quorums, to whomrequests are sent, and underlying quorums,from whichresponses are received. Previous uses of non-blocldng quorumsystems allow access quorummembersto be contacted incrementally [5]. Oncea response is received froman underlying quorum,the client need not contact any remaining 3This method identifier is constructed not onlybyencoding the method nameandargument values,but also by appending a countervalueto the method identifierof the method call that "contains" it.

10

membersof the access quorum. However,whennon-blocking quorumsystems are used for nested operations, a client replica cannot stop contacting server replicas merelybecauseit has received a response from an underlying quorum.This is becauseall client replicas need to worktogether, ensuring that each client replica receives a response from an underlying quorum."Whenany given client replica receives such a response, it is possible that someof the responses comefrom malicious server replicas whoonly respond to a subset of the client replicas. Thuseach client replica must continue to contact a full access quorumeven after it has received a respense froman underlying quorumin order to ensure that the other client replicas will e. also be able to receive the necessaryrespons

5

Operation Ordering

As discussed in Section 1, the modelof object interaction that weattempt to mimicin our approach is that offered by Java RMI.The concurrency semantics that most naturally charact6rize Java RMI,presuming that remoteJava objects process each methodinvocation in isolation, is linearizability [10]. For the distributed objects weconsider, owingto their replication and quorumbased access, werequire a method invocation protocol that implementsthis property. Prior protocol’s to achieve this property in systemssupporting quorum-basedaccess, notably [7], assumea correct client. This assumptionis violated in nested object systems, whereclients can be replicas of other distributed objects, someof whichmaybe Byzantine faulty. Theapproachwetake to methodinvocations here retains the basic structure of prior quorum-based protocols in permitting a single client to drive the ordering protocol, but does so without trusting that client to make anyprotocol decisions.Rather, this client is usedonly as a point of centralization to distribute a set of server messagesfrom whichthe servers work to makeprotocol decisions. Formsof misbehavior, notably sending different messagesto different servers, cannot cause correct servers to take permanentand conflicting actions. In addition, a faulty client cannotpreventprogress by falling silent as another client can unilaterally "take over" driving the protocol. In order to ensure linearizability, all correct servers mustapply operations in the sameorder. Theydo this by indirectly communicatingstate with other server replicas in a quorum,and then makingdeterministic decisions based upon the view they have of the quorum.Whena non-faulty client drives the protocol, it 11

1. submit(op) 2. II waiting~-- true 3. /~i~-~ 4.

11. 12. 13. 14. 15.

Q1 *--D Q+

5. 6.

p~ ~- t~.submit@q(op))

7. 8. 9.

R~ ~ p~ U ~,tn (~: I{~ wai~i~g~ false

lO.

~t~,~ P:

II ~ ~ 0

16.

repeat 0r ~-- chooseRank Ra ~- 0 Q2 ~--D Q÷

17.

state~ ~ ~.process(~, r)

18. 19. 20.

if (valid(s~ate~, R~,r, V~) Qa ~ {~} U Q~ ~a ~ state~ u ~a

~. 22. 23.

nnen(?Q~ ~: Q R~ ~ ~a until (waiting = false)

Figure 3: Client side of ordering protocol echos the state of each server replica in a quorumto the other server replicas in somequorum. Server state consists of the set of pending operations on that replica, and the last distributed object state committed(Gc), proposed (Gpc), and suggested (Gs) on that replica.

5.1

Client Protocol

The client side of the method invocation protocol is shownin Figure 3. This protocol consists of two main parts, shownin lines 2-10 and 11-23 of Figure 3, run concurrently as indicated by the "[[" in lines 2 and 11. In the first part, the client request ("op") is forwarded to a quorumQ drawn deterministically ("~--D", line 4) from a nonblocking quorumsystem Q+, as described in Section 4. The client signs its request ("S(op)", line 6) using its private key (the handle key of Section 3) and sends the request to each memberof Q. collects responses p into a set R1 until it has copies of the same reply from at least bs ÷ 1 server replicas (line 8), where bs is the numberof replica failures that the quorumsystem Q+is designed to withstand. In parallel,

the client drives the ordering protocol as shown in lines 11-23 of Figure 3. The client runs

this ordering protocol with a particular rank r, whic]h is an integer value assigned anew by the chooseRank method each time the client begins the ordering protocol (line 13). In order to make progress, the same rank must be kept during each iteration,

so the standard behavior of chooseRank is to leave the current

rank unchanged. However,at any point in time each server will only interact

with the highest ranked client

that has contacted it so far. As such, when a client receives a RankException from a server (not shownin Figure 3), chooseRankselects a new rank, larger than the one that preempted it, for use in the next round of

12

the ordering protocol. A backoff strategy could be used to avoid clients repeatedly interrupting each other; as this doesnot effect the underlyingprotocol, wediscuss it in Section5.4. Thecore of this part of the client protocol is that it simplyqueries each server ~ in a non-blockingquorum Q2 E Q+(lines 15-16) by querying ~.process(/~2, r) (line 17) where R2 is the set of valid responses gathered fromservers in the previous iteration, until it receives valid responses froman underlying quorum Q E Q (line 21) whereQ is the underlying quorumsystem induced by Q+(see Section 4). Here, a response state is valid if the state is consistent with the correct executionof ~.process(R2,r), whichincludes bearing the current rank r, being properly signed by S~, and bearing values reflecting correct protocol logic when applied to R2. Validity is tested in line 18, thoughthis is not shownin the figure to avoid duplicating the server-side logic. Althoughvalidity testing by the client cannot be relied on to ensure correctness (as the client could be faulty), it is importantin allowing correct clients to determinewhenvalid responses have been received from a full quorum,without whichthe next round maybe futile. A maliciousclient could fail to send messagesto someor all of the server replicas, or at the very worst could send different messagesto different sets of server rep][icas. As wewill demonstrateshortly, the only possible effect of such misbehavioris that it mighttemporarilyslow progress, but this cannot result in an incorrect linearization of methodinvocations.

5.2

Server Protocol

Theserver side of the protocol, as outlined in Figure 4, is necessarily morecomplicatedas it handles all of the protocol’s logic. Theprocess method(on the left of Figure 4) is invokedby client replicas’ ordering threads, while the remainingmethods(on the right of Figure 4) are intended strictly for internal use by server replica. For simplicity we do not showthe submit methodinvoked by clients’ submit threads; this simplyadds the submittedoperation to the pendingset onceit has been requestedby be -4- 1 client replicas, and returns a response to the waiting client replicas whenthe submitted operation has been ordered and executed. The bulk of the server-side logic is contained in the process method.This methodfirst confirms that the rank of the client is at least as large as any rank seen so far, and throws a RankException and terminates this run of process if not (lines 2-6). After checkingthe rank and verifying the signatures of the supplied

13

:2. 3. 4. 5. 6.

8. 9. 10. 11. 12. 13. 14.

if (~ > maxR~nk) maxRank ,- r else if (~ < m~R~) throw RankException(m~R~) foreach u E Q (content~, r~) +- state~ if (r~ = r A V~says state~) ¯ c pc , scry, pending~) (~r~, ~r~ ~ content~,

else Q if (3Q’ E Q: Q’ C_ Q)

¢g~+-¢:?Q’eQ:Q’c{u:~=¢~}_ completed +- max{v:l{u: cry.version > v}i >_ b~ + 1}

1. 2. 3. 4. 5. 6. 7. 8. 9.

addOps(cr’, pending) repeat op +-- D pending pending ~-- pending \ {op} if (cr’.reflects(op) = false) cr’.addOp(op) until (pending = ~) ~r’.version*-- ~r’.version4- 1 return ~’

1. 2. 3. 4. 5.

suggest(e) if (or~ = ±) ~rS+-- G else throw Ran k 8xception (maxRank)

15. 16. 1. propose(or, r) 2. or.proposer*-- v

17. 18. 19.

20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40.

1. 2. 3.

if (cr~ ¢ I A c@version> completed) commit(~) else if (cr~ ~ # ±/\ cr~PCversion> completed) if (3Q’ ~ Q: Q’ c_ Q~) p°) commit(~ else if (~r~c = propose(a~°, r) else ~) suggest(cr~ else if (~ ¢ ± A ~r~.version > completed) propose(a~, else

commit(or) if (or.version > crCversion) cr.applyOps

4.

~c~_~

5.

pending ,- pending \ {op : cr.reflects(op) = true} response *-- [email protected]

6.

pendi% if (pendingg cr ~-- addOps(cr~,pendingg) suggest@r) return S((cr~, crp~, cr~’ pending,),

41.

Figure4: Serverside of orderingprotocol

14

server replica states (line 9), each server replica takers steps to determinewhichnewobject state should suggested, proposedor committednext. It does this on the basis of the following values, whichit derives fromthe server replica states {stateu}uEc2.(Notethat the server-side logic is only concernedwith underlying quorums,thus all references to quorumsin this section refer to underlying quorums;non-blocking access quorumsare only required by the client for use whencontacting servers.)

(line 13) is the state last suggestedby somequorumof servers, if any; c~9~c(line 17) is the state with the highest version numberproposedto somecorrect server or, if there are multiple such states, the one proposedby the highest-rankedproposer; (line 21) is the state with the highest version numbercommittedto somecorrect server; and completedis the highest version numberfor whichsomequorumhas committeda state with that version number(line 24), or for whichsomecorrect server has suggested(line 14) or proposed(line a state with a strictly higher version number(implying that completedwas previously committedat a quorum).Wesay that a state has completedif its version numberis less than or equal to completedat somecorrect server. In summary,the server takes the following actions, in order. It commits c if it cannotdeterminethat it has beencommittedat a full quorum(lines 25-26). Otherwise,it looks at o-~c if it exists and is not already completed(line 27): if o-~c has beenproposedto a full quorumby this client (lines 18, 28), it commits (line 29). If cr~Pehas not beenproposedto a full quorum by this client, the server suggestso-gpcif it has not already beensuggestedat a full quorum(line 33) and proposeso~c if it has (line 31). If crgPcdoes not exist or is already completed,then the server similarly examinescry, proposingit if it exists and is not already completed(lines 34-35). If o-~ does not need to be acted upon, the server deterministically orders method invocations that are pendingon bs ÷ 1 servers--appendingthemto the current state, but not yet executing them; see below--through the addOpsoperation, and suggests this state (lines 37-40). Wenote that addOps,the predicate cr.reflects(op) indicates whetheror not the operation op is incorporatedinto the state or; as such, it prevents duplicate invocations. Finally, the server signs its current local states that it sees as committed,proposed, and suggested, as well as all pendinginvocations and the rank r, and returns this (line 41). 15

It is important to note that the addOpsoperation (line 39) does not actually invokemethodinvocations the object o.~; rather, they are just appendedfor later application by applyOps(line 3 of commit).With the introduction of nested operations, the scope of impact of a methodinvocation on an object is no longer limited to that object. Eachinvocation can result in a nested methodinvocation on another object. As such, it is importantto withholdperformingoperations until they are actually committed,as a suggested or even a proposedstate maynever ultimately be committed.

5.3

Correctness

Asall of the protocol logic is performedon the server side, all of the followinglemmasrefer to Figure 4, and all quorumreferences are to underlying quorums. Lemma 5.1. If o- is proposedat somecorrect server replica by a client with rank r, then @,r) wassuggested at a full quorum. Proof There are two places in the protocol wherestates are proposed: line 31 and line 35. In both cases (o., r) is proposed,with o. being assignedin line 13 to the state currently suggestedat a full quorum.Asthe suggestedstate at correct server replicas is cleared each time a higher rank is encountered(line 4), o- must havebeen suggestedat a full quorumat rank r.

[]

Lemma 5.2. Supposethat Io-l,rl) is proposedat a correct server replica, and (o-2, r2) is proposedat correct server replica. If r2 = rl, then o-2 = o.1. Proof Supposethat r2 = ~1 but o-2 ¢ o-1. Accordingto Lemma 5.1, prior to being proposedto any correct server replica, o-1 and o-2 wereeach suggestedat a full quorumall rank rl. Eachcorrect server replica will suggest at most one state at rank rz (suggest method). As any two quorumsare guaranteed to intersect at least b + 1 server replicas, any two quorumswhosememberssuggested o-1 and o-2 respectively at rank rl wouldoverlapin at least oue correct server replica. This wouldimplythat at least one correct server replica suggestedboth o-~ ando-2 with rank r~, a contradiction.

[]

Lemma 5.3. Supposethat (o-~, rl) is proposedat a full quorum,and (o-9~, r~) is proposedat a correct server replica. If o-~.version: O-l.versionandr2 >_rl, then or2 = o1.

16

Proof Weprove the result by induction on r2. The base case, where r2 = r~, follows from Lemma 5.2. Accordingto the induction hypothesis, for all proposals /o- ~, r~), wherer2 > r >rl ando-~.version = o-l-version,it is the casethat o-t = o-1. When(o-2, r2) was proposedon a correct server replica with o-2.version = o-l.version, no state o-t with o-~.version > o-l.version wasseen as proposedon bs + 1 server replicas in the quorumaccessed with rank r2, as that wouldhavecaused/or~, r2) to be proposedinstead of (o-~, r2) (lines 15-17, As o-~ wasproposedat a full quorumwith rank rl, and as every proposalwith rank r ~ and version o-1 .version, wherer2 > r t > rl, was also for state ~r~ by the induction hypothesis, every quorumaccessed with rank r~ will see o-1 proposedon at least bs + 1 server replicas. It is thus the case that at ra~akr2, o-1 is the state proposedwith the highest version andrank on at least one correct server replica, and for any (o-~, r2) proposedat a correct server replica, if cr2.version = O-l.Version, then o-2 = o-1.

[]

Lemma 5.4. If o- is committedat some correct server replica by a client with rank r~, then (o-, rl) for rl < r~ was proposedat a full quorum. Proof There are two places in the protocol where states are committed: Online 29, o-2 is committedif it is seen to have been proposed at a full quorumduring the same round, as determined on lines 15-18 and 28. In this case r~ = rl. Online 26, o-2 is committedif it has already been committedon a correct server, as determinedon lines 20-21, implyingthat it waspreviously proposedat a full quorum.In this case r~ > rl.

[]

Lemma 5.5. Supposethat (o-1, rl) is committedat a correct server replica, and (o-2, r~) is committed correct server replica. If o-~.version= o-~.versionandr2 >_rl, then o-2 = Proof According to Lemma5.4, both o-~ and o-z were proposed at full quorums. By Lemma5.3, as o-l.version ---- o-2.version, o-~_~- o-2.

[]

Lemma 5.6. If o-2 is suggestedat somecorrect server replica, then it extends the state cr; that is committed at a full quorumwith o-2.version = o-l.version + 1, by applying operations in {op : o-2.reflects(op) -~o-~. reflects (op) } to ~ in somesequentialorder.

Proof There are two places in the protocol wherestates are suggested: Onlines 39-40, a newo-2 is created and suggested, o-2 is created by adding operations whichare pending on at least onecorrect server replica (line 37) to ~1 in a deterministic order, whereo-1 is the highest ranking state that is committedat a full quorum(lines 20-25), and setting cr2.version to crl.version + 1 (addOps method). Online 33, o-2 is suggestedif it has already beenproposedon a correct server (lines 15-17), implyingthat it waspreviously suggested at a full quorumper Lemma 5.1.

Theorem5.7 (Order). Submitted operations are applied in a well-defined sequencethat is reflected in the results of all operationsthat return. Proof Accordingto Lemma 5.4, a committo a correct server replica must be preceded by a proposal at a quorum,and by Lemma 5.5 once a state is committedat a correct server, no different state can be committed at a correct server for that version number.Lemma 5.1 indicates that a proposeat a correct server must be precededby suggestion at a quorumwith the same rank. ByLemma 5,3 once a state is proposedat a quorum, no different state can be proposedat a correct server for that version number.Further, according to Lemma 5.6, once cr is committedat a full quorum,any state cr/ suggested to a correct server replica mustextend cr by applying a block of pendingoperations in a deterministic, serial order. There is thus a well-defined sequencein whichoperations are applied°

[]

5.4 Liveness Our discussion so far has focused on safety. Liveness is also an important consideration, especially as a Byzantineclient could endup driving the protocol. It is thus desirable to tolerate faulty clients while still achieving high throughput with manyconcurrent clients. A first step towardsthis is to require clients to sign their ordering requests with the private key of the associated handle. In conjunctionwith the delegation statements deft.ned in Section 3, this wouldensure fhat the ordering protocol for a given distributed object could only be driven by handles whichwere authorized to invoke methodson that object. In addition to ensuringthat only authorized handles can drive the ordering protocol, it is also necessaryto 18

prevent Byzantineclients from doing all the driving and thus preventing forward progress. Wedo this with a form of backoff similar to [7], thoughenforcedby server replicas to prevent faulty clients from repeatedly interrupting the protocol for correct clients. Whilea faulty client can delay the ordering protocol during a single round, enforced backoff ensures that correct clients (whoare the majority due to quorumoverlap requirements) will regularly be allowed to drive the protocol. Becauseoperations are applied en masse, high average throughput is maintained by the system despite small delays during rounds whichare driven by Byzantineclients.

5.5 Performance Wehave implementedthe protocol discussed in Sections 5.1 and 5.2 within the context of a substantial revision of the Fleet system.Fleet is a distributed object store built in Java that implementsdistributed objects such as those shownin Figure l(a). However,Fleet previously did not provide support for transparent nesting; in fact, nesting could result in duplicate nested methodinvocations, evenin the absenceof failures [16]. As such, our frameworkprovides a useful extension to the Fleet system. In our experiments, each server and client was equipped with dual Pentium-III 1GHzprocessors, running Linux 2.4.24 SMPand Java HotSpotTM Server VM1.4.2. The servers and client were connected by 100MbpsEthel-net and utilized TCPfor all communication;multicast was not used. Key generation, signing, and signature verification were all performednatively using the Crypto++Library. All tests were performedwith three signature algorithms: RSA[22] with 1024-bit keys, DSA[12] with 1024-bit keys and ECDSA [1] with 160-bit keys. By way of comparison, we also adapted our protocol to use HMACs [6], using Diffie-Hellmankey agreementto establish a shared secret key betweeneach pair of servers. The non-blocking quorumsystem in use was the extension of the Paths system [17] proposed by Bazzi [5], and all methodinvocations were performed by unreplicated cliients. 4 In order to measure the overhead introduced by the ordering protocol, all methodinvocations performedinexpensive tasks, such as get and set. Dueto the limited set of homogeneous computersat our disposal, weonly tested distributed objects having betweenn = 1 and n = 36 replicas and with b = 0, yielding quorumsof sizes from q = i to q = 20. Preliminaryresponse times as seen by a client for a relatively unoptimizedimplementationof this protocol 4While replicatedclientsare; animportant component of ourwork,the orderingprotocol is drivenbya singleclientreplicaat anygiventime. 19

Figure 5: Meanresponse time are shownin Figure 5. Among the three digital signature algorithms tested here, the use of RSAsignatures, whoseverification times are notably faster than the other signature algorithms[24], yielded the lowest mean response time. This result reflects the fact that each server replica in a quorummust sign its ownstate, but verify signatures on states from every server in a quorum.Note that the response time growthrate is sublinear while b stays constant, due to the fact that quorumsin the chosenquorumsystemare asymptotically of size O(x/-~). Prior to analyzing the performance of HMACs, it is necessary to understand howtheir implementation differs from that of signatures. As part of the ordering protocol, each server replica signs its local state and sends it to the driving client replica (Figure 4, line 41), whothen distributes it to a quorumof server replicas (Figure 3, lines 16-17). Whenusing HMACs instead of signatures, it is necessary for each server replica to include an HMAC for every other server replica, and for the driving client replica to include the appropriate set of HMACs for each server replica that it contacts. HMACs thus consumemore network bandwidthbetweenserver replicas and the driving client replica, as the numberof server replicas grows. Consideringthis, it is unsurprising that the responsetime using H.MACs does not scale as well as signatures as r~ increases. Anotherdifference resulting from HMACs is that the client is unable to verify them(unlike signatures). As such, a faulty server t]hat returns invalid HMACs will not be readily detectable to a client, and may impingeon the client’s ability to completethe protocol. In such c, ircumstances,the client can "fall back"to a signature-basedprotocol to better enforce progress.

20

6

Conclusions

and future

work

The delegation architecture whichwe have presented can be used to arbitrarily nest distributed objects, while ensuring correct systembehaviordespite a limited numberof Byzantinefailures in the object replicas. This is accomplishedthrough the creation of delegation keys and delegation certificates that worktogether to authorize the invocation of methodson nested objects. Methodinvocation capabilities on a specific distributed object can be delegatedto other clients explicitly, and are implicitly delegatedto permit nested methodinvocations by sufficiently large groupsof replicas to succeedtransparently. Throughthe use of non-blocking Byzantine quorumsystems, failure atomicity can be ensured for nested object interactions. Specifically, at every level within a chain of nested methodinvocations, enoughserver replicas will be activated to succeedat performingthe next invocation in the chain. Further, as long as nonblockingquorumsare used at each level in such a chain, the success of a methodinvocation at one level will not be followedby the failure of the parent operation. Finally, wepresented a novel protocol to achieve linearizability of nested methodinvocations. Our protocol tolerates Byzantinefaulty clients, as is necessary whenmethodsare invokedfromother distributed objects that themselvesmust withstand Byzantine faults. Wedetailed this protocol, argued its correctness, and described its implementationand performancein a distributed object system. Anextension that we are presently studying is the nested creation of distributed objects, i.e., whereone distributed object creates another. Ourframework providesa basis for supportingdistributed object creation, but requires us to addressadditional issues. Wewill report on this progress in future work.

References [1] ANSIX9.62, Public Key CryptographyFor The Financial Services Indusirry: The Elliptic Curve Digital Signature Algorithm (ECDSA),AmericanNational Standards Institute,

1999.

[2] A.W. Appeland E. W. Felten. Proof-Carrying Authentication. In Proceedings of the 6th ACMConference on Computerand CommunicationsSecurity, November1999. [3] D. Balfanz, D. Dean, and M. Spreitzer. A security infrastructure for distributed Java applications. In Proceedingsof 2000 IEEESymposiumon Security and Privacy, May2000.

21

[4] L. Bauer, M. A. Schneider, and E. W.Felten. A General and Flexible Access-ControlSystemfor the Web.In Proceedingsof the l lth USENIX Security Symposium,August 2002. [5] R. A. Bazzi. Access cost for asynchronousByzantine quorumsysteras. Distributed Computing14(1):41-48, 2001. [6] M. Bellare, R. Canetti, and H. Krawczyk.Keyinghash functions for messageauthentication. In Proceedings of the 16th AnnualInternational Cryptology Conferenceon Advancesin Cryptology, pages 1-15, 1996. [7] G. Chockler, D. MalkbS,and M. K. Reiter. Backoffprotocols for distributed mutual exclusion and ordering. In Proceedings of the 21st International Conferenceon Distributed ComputingSystems, pages 11-20, April 2001. [8] C.M. Ellison, B. Frantz, B. Lampson,R. Rivest, B. M. Thomas,and T. Ylonen. SPKICertificate Theory, September1999. RFC2693. [9] M. Gasser and E. McDermott.Anarchitecture for practical delegation in a distributed system. In Proceedingsof the 1990 IEEESymposiumon Research in Security and Privacy, pages 20-30, May1990. [10] M. R Herlihy and J. M. Wing.Linearizability:

a correcmess condition for concurrent objects. ACMTransactions on Pro-

grammingLanguages and Systems 12(3):463-492, 1990. [11] K. R Kihlstrom, L. E. MosermadR M. Melliar-Smith. The SecureRinggroup corm~aunicationsystem. ACMTransactions on Information and System Security 4(4), November2001. [12] D.W.Kravitz. Digital signature algorithm. U.S. Patent 5,231,668, 27 July 1993. [13] B. Lampson,M. Abadi, M. Burrows and E. Wobber. Authentication in distributed

systems: Theory and practice.

ACM

Transactions on ComputerSystems 10(4):265-310, November1992. [14] L. Lamport, R. E. Shostak and M. C. Pease. The Byzantine generals problem. ACMTransactions on ProgrammingLanguages and Systems 4(3):382-401, July 1982. [15] D. Malkhi and M. Reiter. !Byzantine quorumsystems. Distributed Computing11(4):203-213, 1998. [16] D. Malkhi, M.K. Reiter, D. Tulone,and E. Ziskind. Persistent objects in l:he Fleet system.. In Proceedingsof the 2nd DARPA InformationSurvivability Conferenceand Exposition, Vol. II, pages 126-.136, June 2001. [17] D. Malkhi, M. K. Reiter, and A. Wool.The load and availability of Byzantine quorumsystems. SIAMJournal of Computing 29(6): 188%1906,2000. [18] J. E. B. Moss.Nestedtransactions: Anapproachto reliable distributed computing.Ph.D. Thesis, MassachusettsInstitute of Technology, May1981. [19] J. E. B. Moss. An Introduction to Nested Transactions. COINSTR86-41, University of Massachusetts, Department of ComputerScience, September 1986.

22

[20] N. Busi and G. Zavattaro. On the Serializability

of Transactions in JavaSpaces. In Proc. of International Workshopon

Concurrencyand Coordination,Electronic Notes in Theoretical Cornpute:r Science, Vol. 54, July 2001. applications [21] E Narasimhan,K. E Kihlstrom, L. E. Moserand E M. Me][liar-Smith. Providing suigport for survivable CORBA with the Immunesystem. In Proceedings of the 1999 IEE~EInternational Conference on Distributed ComputingSystems, pages 507-516, May1999. [22] R.L. Rivest, A. Shamir, and L. Adleman.A methodfor obtaining digital signatures and public-key cryptosystems, Communications of the ACM21(2):120-126, Feb. 1978. [23] E B. Sctmeider. Implementingfault-tolerant services using the state machineapproach: A tutorial. ACMComputingSurveys 22(4):299-319, December1990. [24] M. J. Wiener. PerformanceComparisonof Public-KeyCryptosystems. CryptoBytes 4(1): 1-5, 1998.

23

A Modeling Framework for Capturing Parallel Flows and ... - CMU (ECE)

A Modeling Framework for Capturing Parallel Flows and ... - CMU (ECE)

Suggest Documents

A Multi-core High Performance Computing Framework for ... - CMU ECE

A Framework for Assessing the Dependability of ... - CMU ECE

OmniSense: A Collaborative Sensing Framework for ... - CMU (ECE)

Virtual Probe: A Statistically Optimal Framework for ... - CMU ECE

An Experimental Framework for Implementing and ... - CMU (ECE)

COMPRESSED SENSING - CMU (ECE)

Ballista - CMU (ECE)

s1 m'l - CMU (ECE)

Dynamic Frequency and Voltage Control for a Multiple ... - CMU ECE

The Amaranth Framework: Probabilistic, Utility-Based ... - CMU ECE

TOPOLOGY FOR GLOBAL AVERAGE CONSENSUS ... - CMU (ECE)

SensOrchestra: Collaborative Sensing for Symbolic ... - CMU-ECE

FFT Compiler Techniques - CMU (ECE)

recursive implementation and performance analysis - CMU ECE

Rethinking Architectural Research and Education - CMU (ECE)

Fault Injection Techniques and Tools - CMU/ECE

Rethinking Architectural Research and Education - CMU (ECE)

A methodological framework for capturing relative ...

A Framework for capturing the Relationship

Stochastic Communication: A New Paradigm for Fault ... - CMU (ECE)

Development of a MEMS Device for Acoustic Emission ... - CMU (ECE)

A frequency separation macromodel for system-level ... - CMU (ECE)

A Two-Axis Electrothermal Micromirror for Endoscopic ... - CMU/ECE

A MEMS transducer for detection of acoustic emission ... - CMU (ECE)