Fault-Tolerant Data Structures∗ Yonatan Aumann Department of Computer Science Bar-Ilan University Ramat-Gan 52900 Israel
[email protected]
Michael A. Bender Department of Computer Science State University of New York Stony Brook, NY 11794-4400 USA
[email protected]
Abstract We study the tolerance of data structures to memory faults. We observe that many pointerbased data structures (e.g., linked lists, trees, etc.) are highly nonresilient to faults. A single fault in a pointer in a linked list or a tree may result in the loss of an unproportionately large amount of data. In this paper we present a formal framework for studying the fault-tolerance properties of pointer-based data structures, and provide fault-tolerant versions of the stack, the linked list, and the dictionary tree.
1
Introduction
Motivation. Many commonly used pointer-based data structures are highly nonresilient to memory failures (e.g., disk sector failures, main memory erasures, accidental overwrites, etc.). Consider, for example, a linked list. Losing even a single pointer makes the entire tail of the list unreachable. Thus, one fault may result in the loss of an unbounded amount of data. Trees, stacks, and other common pointer-based data structures exhibit similar fragility. The objective is to make data structures more fault tolerant. Clearly, it is best to avoid faults in the first place. However faults do occur, and their destructive effect should be minimized. An Example. Consider the linked list. How can we make the link list more fault tolerant, so that a single fault does not cause so much havoc? A na¨ıve solution is replication. If each data item is replicated d + 1 times, the resulting data structure becomes resilient to d faults. This solution, however, entails a high price both in space and in time: the data structure occupies a factor of d more memory space and requires a factor of d more work for insertions and deletions. A more efficient solution is the following. At each node, we maintain two pointers pointing out of the node. One pointer points to the successor node in the list, as usual. The second pointer ∗
This work appeared in preliminary form in the Proceedings of the 37th Annual Symposium on the Foundations of Computer Science (FOCS), pages 580–589, October 1996 [2].
1
points to the node that is d + 1 positions along the list. For this new structure, it can be proved that with f faults at most O(f 2 ) nodes are lost, as long as f ≤ d. Specifically, the amount of lost data is bounded and is a function of the number of faults. Thus, by adding only one extra pointer to each node of the linked list, the data structure becomes resilient for up to d faults (in the sense that the bulk of the data remains accessible). Unlike replication, the space overhead in this structure is constant. Insertions and deletions, however, still take O(d) operations. In Section 4 we provide a more efficient version of the linked list, where f faults result in only O(f log f log d) lost nodes, and both insertions and deletions take constant time. Our Results. In this paper we address the problem of fault-tolerance of data structures, providing two main contributions: 1. We present a formal framework for studying the fault-tolerance properties of data structures. We define a parameterized notion of fault tolerance, which measures the amount of lost data as a function of the number of faults. 2. We introduce efficient fault-tolerant versions of three common data structures: the stack, the linked list, and the binary search tree. A full description of our results appears in Section 2 after we lay down the formal framework. Applications. One of the major goals in file-system design is quick recovery from an inconsistent state in the system’s metadata (the data structures of the file system). The metadata may reach an inconsistent state, for example, as a result of power failures, undetected disk errors, or internal bugs in the file system. Inconsistent metadata often causes the system to crash. In modern systems, the computer may be unusable for well over an hour during reconstruction (e.g., accomplished using the function fsck in UNIX). We suggest that faster recovery may ultimately be achieved by replacing the the current data structures with more fault tolerant ones, in this spirit of this paper. Note that because we are concerned with fast recovery, the lost data does not have to be inaccessible forever, but only until a more thorough recovery is completed. Meanwhile, the system is operational. In many applications, continuous functionality of the system is a prime concern (e.g., airline reservation systems). Such applications have hardware solutions that provide full fault tolerance (e.g., lots of redundant hardware). Unfortunately, these solutions are often expensive and restricted to mission-critical applications. Software-based fault tolerance in the spirit of this paper provides an alternative cost-effective solution for the less demanding applications. It allows for quick recovery of the system as a whole, while the limited amount of lost data can be recovered in the background, using more lengthly procedures. The prominence of the Web during the last decade has generated new applications, in which memory failures occur regularly. For example, search engines, such as Google [15], sift through large quantities of data gleaned from the Web. In order to manipulate this data cost-effectively, many search engines, including Google, use inexpensive hardware, in which memory faults occur 2
regularly. Small numbers of memory failures will not cripple the application, but their damage should be limited. Related Work. Computing in the presence of memory faults is studied in many contexts. There is a large body of literature on error-correcting codes, useful for memory or transmission fault tolerance [6, 29]. For example, certain Reed-Solomon codes are currently used in compact discs. Rabin [33] introduces the Information Dispersal Algorithm, which has applications to efficient faulttolerant disk storage and fault-tolerant routing. This algorithm breaks a large file into redundant small pieces in an asymptotically space-efficient manner. Computing systems use redundant arrays of inexpensive disks (RAIDs) to protect against failures in storage [31]. Extra “check” disks store redundant information so that when a disk fails, its data can be reconstructed. Rabin [32] presents a fingerprinting scheme via random polynomials for recognizing errors in memory. Some data structures are already built robustly. Munro and Poblete [30] discuss a method of representing search trees in an environment where pointers may be lost or maliciously altered. Their representation permits any two field changes to be detected and any one to be corrected. Lock-free data structures are concurrently accessed by multiple processes; they perform correctly even though the data structure might be changed by one process while another process is accessing it [4, 20, 38]. There is also work on checking or certifying the performance of data structures [7, 1, 37]. For the data structures of file systems, replication is a central tool to obtain fault tolerance. For example, in the Cedar Filesystem, described by Hagmann in [17], all metadata appears twice on the disk. The filesystem is constructed under the assumption that at most one disk sector or two contiguous sectors fail simultaneously. Once a crash occurs in the file system, it is important to recover as quickly as possible. To aid the recovery, some updates are stored sequentially in a buffer called a log. The idea of logging, originally from database systems [16], is currently used in some file systems for fast recovery [17, 23, 35, 36]. Fault tolerance with respect to processor failures is widely studied, but is out of the scope of this paper (see [22, 24, 5, 13, 8, 14]). The fault tolerance of networks, including many specific architectures (e.g., the mesh, hypercube), was studied both with respect to routing parallel computing (see [18, 19, 21, 34, 11, 12]). A network architecture especially designed for fault tolerance is described in [28]. Kuttan and Peleg [26, 25] consider the problem of correcting an illegal state of a distributed network, and define the notion of fault local mending. A distributed correction algorithm is fault local mending if its time complexity depends only on the number of failed nodes, not on the size of the entire network. In the discussion in [25] the authors raise the possibility of extending their notion to sequential data structures, and introduce the concept of fault local mendable data structures. This notion is related, but distinct from the one introduced in this paper. Outline. The rest of this paper is organized as follows. In Section 2 we introduce the definitions and the formal framework for studying fault tolerance of data structures. In Sections 3-5 we provide fault tolerant versions of the stack, the linked list, and the binary tree. Section 6 shows how the results for all three data structures can be improved using expanders. Finally, we conclude with a 3
discussion in Section 7.
2
Terminology: Faults, Reconstruction, and Emulation
2.1
Definitions
In this section we present the formal framework for studying the fault-tolerance properties of data structures. To this end, we provide the following elements: • A formal definition of data structures and faults. • A formal definition of the notion of a reconstruction of a data structure after faults have been detected. • A quantitative measure of the fault tolerance of the data structure. The measure is based on the amount of lost data as a function of the number of faults. Data Structures and Data-Structure Schemes. A data structure is characterized by the operations it supports and by their implementations. We focus on pointer-based data structures. For these data structures, we view instances as directed graphs, with the data residing in the nodes and the edges representing pointers. Accordingly, we define a data structure scheme, S, to be a pair S = (H, P), where • H is a family graphs that are valid instances of the data structure scheme, n
• P is a set of procedures P1S , . . . , PnS
o
that create and manipulate graphs of H.
An instance of the data structure scheme is any valid graph of the family H. For an instance graph H ∈ H, we distinguish between two types of nodes: • Information nodes: nodes that contain information inserted into the data structure by the user. • Auxiliary nodes: nodes that contain auxiliary or structural data, for internal use by the data structure and procedures (e.g., internal nodes in a tree where all data resides in the leafs). This distinction is important when some of the data is lost and the data structure needs to be reconstructed. In this case, we seek to restore as many information nodes as possible, whereas auxiliary nodes need to be restored only insomuch as they are necessary to maintain the correctness of the structure.
4
Faults and Reconstruction. We assume that when a node becomes faulty all the data contained in the node and all the outgoing pointers are lost. We also assume that faults are detectable upon access, i.e. trying to access a faulty node results in an error message. We must reconstruct the data structure after faults are detected. However, full reconstruction may be impossible. For example, the graph may become disconnected and some nodes inaccessible. Thus, we want to reconstruct the data structure on the subset of the remaining accessible nodes. Since some information is necessarily lost, we must define what we mean by reconstruction. Intuitively, we require that: 1. The reconstruction includes as many (information) nodes of the original data structure as possible, and 2. The reconstruction maintains the essential topological order among the reconstructed information nodes. For example, in the linked list we require that if node v appeared before node w in the original list then node v appears before node w in the reconstruction. Thus, we maintain the “before-after” relation. In other structures, we may maintain other relations, e.g., “above-under” in the tree. Note that we cannot expect to maintain all topological relations of the original graph. For example, in the linked list we cannot maintain the relation “two nodes after”, as the intermediate nodes may be lost. Accordingly, we define reconstruction with respect to a given set of relations R, induced by the topology of the graph. Definition 1 Let S = (H, P) be a data structure scheme, and let R = {R1 , . . . , Rt } be a set of relations on information nodes induced by the graphs of H (instance graphs of S). Let S = (V, E) ∈ H be an instance of S. We say that the graph S 0 = (V 0 , E 0 ) is a reconstruction of S with respect to R if the following conditions hold: • The graph S 0 is a valid instance of S, i.e. S 0 ∈ H . • The information nodes in S 0 form a subset of the information nodes of S. • For each Ri ∈ R, if Ri is a k-relation, then for each k-tuple ~v ∈ (V 0 )k of information nodes, ~v has the relation Ri in S iff it has the relation Ri in S. We call information nodes that appear in S but not in S 0 lost nodes. Note that a node may be lost even if it is not faulty. Specifically, a node may be inaccessible, in case all paths to the node are blocked by faults. We seek data structures for which all instances can be efficiently reconstructed with a minimal loss of nonfaulty nodes. Definition 2 Let S = (H, P) be a data structure scheme, and let R = {R1 , . . . , Rt } be a set of relations on information nodes induced by the graphs of H. Let d be a constant and g : N → N be a function. We say that S is (d, g)-fault tolerant with respect to R if there exists a reconstruction algorithm A satisfying the following. For any instance S of S, if there are f ≤ d faults in S, then algorithm A, on input of the faulty S, outputs a reconstruction S 0 of S with respect to R, such 5
that the number of lost information nodes is bounded by g(f ). The running time of reconstruction algorithm A must be polynomial in f and d. We seek data structures for which the function g(f ) and the running time of A are slowly growing functions and independent of the size of the data structure S. The Handle. In pointer-based data structures the nodes are accessed via pointers. Most of these pointers are themselves located in other nodes of the graph. However, the data structure must also allow access to the structure from the “outside world.” In the linked list, for example, the pointer to the head of the list is in a fixed location and the entire structure is accessed through this pointer. The queue has two such pointers, one pointer to the front of the queue and one to the end. These pointers are in fixed locations, and there is a fixed number of such pointers. If all these pointers are lost the entire data structure becomes unreachable. We call the set of fixed pointers to a structure the handle of the structure. Formally, the handle H(S) of data structure S is a set of labelled nodes {h1 , . . . , hk }. Each handle node hi stores a pointer, which can point to any node of the data structure. In addition, the handle node may store auxiliary information regarding the pointer or the data structure. For any data structure, the number of nodes in the handle is an upper bound on the number of faults that the data structure can sustain. This explains why Definition 2 has an upper bound, d, on the number of faults that the data structure is required to withstand, even in the worst case. Emulation. Most common data structures do not lend themselves to efficient reconstructions. Thus, we introduce fault-tolerant versions of the data structures. The new versions emulate the behavior of the original ones while supplying a higher degree of fault tolerance. The following definition formally defines what it means for the fault tolerant version to emulate the regular one. Definition 3 Let S = (H, P) be a data structure scheme, with P = {P1 , . . . , Pk } (the Pi ’s are the © ª ¯ P) ¯ be another data structure scheme, with P¯ = P¯1 , . . . , P¯k . We say procedures), and let S¯ = (H, that S¯ is an emulation of S if the following conditions are satisfied: • For each i, procedure P¯i of S¯ has the same interface as procedure Pi of S (i.e., it expects the same input pattern and outputs the same output pattern). • For any sequence of invocations of procedures from S, invoking the corresponding procedures from S¯ results in the same output to the user. Two measures characterize the quality of the emulation: time and space, as described in the following definition: Definition 4 Let S¯ be an emulation of S. We say that it is an (α, β)-emulation if the following criteria are satisfied: • Time: for each sequence of invocations of procedures of S and corresponding invocations of ¯ the (amortized) execution time of the S¯ procedures is at most α times the procedures of S, (amortized) execution time of the S procedures. 6
• Space: for any instance S of data structure scheme S, the corresponding instance S¯ of S¯ occupies at most a factor β more space. We say that S¯ is a constant emulation, if it is an (O(1), O(1)) emulation.
2.2
Our Results
We are now ready to present a formal description of the results presented in this paper. We provide the following fault-tolerant data structures: • A family of fault-tolerant stacks, such that for each d, there is a (d, f log f )-fault-tolerant stack. The fault-tolerant stack is a constant emulation of the regular stack. • A family of fault-tolerant linked lists, such that for each d there is a (d, f log f log d)-faulttolerant linked list. The fault-tolerant linked list is a constant emulation of the regular linked list. • A family of fault-tolerant binary trees, such that for any d, there is a (d, f log f log d)-faulttolerant tree. The fault-tolerant binary tree is a constant emulation of the regular binary tree. • Expander-based versions of fault tolerant stacks, lists, and trees. The number of lost nodes is reduced by an O(log f ) factor. This provides O(f ) lost nodes for the stack, and O(f log d) lost nodes for linked list and binary search tree. The reconstruction time is a small growing function in f and d.
3
Fault-Tolerant Stacks
The stack data structure supports two operations: Push(x) and Pop. An instance graph takes the form of a directed path, with the handle Top pointing to the first node in path and the last node pointing to Null. Each node x has one data field x.value and one outgoing pointer x.next. Conceptually, we view pointers as oriented down; thus we push and pop nodes from the top of the stack. The essential topological order of the stack is the “above-under” relation among nodes. The stack data structure is highly non fault tolerant because one memory fault can generate O(n) lost nodes. © ª∞ We describe a new family of stack-like data structures, the d-FTstack, for d ∈ 2i i=0 (for other d’s simply round up to the nearest power of 2). For any such d, the d-FTstack, is (d, O(f log f ))-fault tolerant with respect to the “above-under” relation, and is a constant emulation of the stack. The graph structure of (instances of) the d-FTstack is composed of a sequence of layers. Each layer, Li , (except possibly the top layer) consists of 2d nodes, Li = {xi,0 , . . . xi,2d−1 }. We index the layers Ld(n/2d)−1e , . . . , L0 , so that layer L0 is at the bottom of the stack and Ld(n/2d)−1e is on the top. Every log d + 1 layers constitute a butterfly structure, as follows (see [27] for a more complete description of the butterfly structure). A sample graph appears in Figure 1. Each node x i,j in the butterfly has two outgoing edges, a straight edge and a diagonal edge. The straight edge points 7
Λ Λ handle
Λ Λ Λ Λ head
Λ Λ
Figure 1: A d-FTstack, where d = 4 and n = 45. from node xi,j to node xi−1,j (if i 6= 0 and to Null otherwise). The diagonal edge is defined as follows. Let j (i) be the integer that shares the same binary representation as j, except for the (i mod 2d)-th bit, which is flipped (for example, for d = 2, 3(22) = 1). The diagonal edge points from xi,j to xi−1,j (i) (if i 6= 0 and to Null otherwise). The lexicographic order of nodes corresponds to the order in the stack. The top layer may be incomplete. The handle to the d-FTstack consists of 2d + 1 nodes. One node, Top, stores a pointer to the current top of the stack. The additional 2d handle nodes hold pointers to the top 2d nodes of the stack. The FTstack procedures Push() and Pop add and remove the nodes in lexicographic order and maintain the pointer structure. Claim 1 For any d, the d-FTstack is a constant emulation of the stack. Proof: The d-FTstack and the stack have the same user interface and provide the user with the same behavior. Thus, the FTstack is an emulation of the stack. The performance ratios between the stack and the FTstack are as follows: • Time — Each of the FTstack operations take a constant number of steps. Thus, the ratio is also constant. • Space — The nodes of the FTstack are in one-to-one correspondence with those of the stack. Each FTstack node requires a constant amount of space. In addition, the FTstack has 2d + 1 handle nodes. Thus, in total, as n grows, the ratio between the space requirements of the stack and the FTstack is O(1).
3.1
Faults and Reconstruction
When a fault is detected, reconstruction begins. The reconstruction procedure operates in two phases, described in the following paragraphs. • Pop phase — In the Pop phase, we remove nodes from the FTstack and place them in auxiliary storage, such as another FTstack. We try to reach each node via both of its incoming pointers. If both pointers are unavailable or if the node is faulty, we discard the node. The Pop phase 8
ends when 2d consecutive reachable nodes are encountered or when the bottom of the FTstack is reached. At this point the remaining FTstack is functional and has no apparent faults. (Additional faults may still exist further down the FTstack. In this case we will run the reconstruction procedure whenever these nodes are encountered.) • Reinsert phase — In the Reinsert phase, we reinsert the nodes using the FTstack Push() procedure. Because we reinsert nodes in the reverse order from which they were popped, the reconstruction maintains the order of the nodes of the original FTstack. We now prove a bound on the number of lost nodes as a function of the number of faults f : Claim 2 If there are f faults then at most O(f log f ) nodes are lost. Proof: Let F be the set of faulty nodes. Without loss of generality, assume that the number of faults f = |F | is a power of 2. Recall that a node is inaccessible if it is lost but not faulty. We begin our analysis by considering a single inaccessible node x0 , which resides in a level Li . There are two cases. First, suppose that level Li+log f +1 (the layer log f + 1 above x0 ) exists. Let A(x0 ) denote the set of nodes belonging to Li+log f +1 that are ancestors of x0 . One characteristic of the butterfly (since the number of faults f < 2d) is that if we trace the edges back log f + 1 levels from node x0 in level Li to level Li+log f +1 , we obtain a binary tree with 2f leaves. Thus, we have |A(x0 )| = 2f . In order for x0 to be inaccessible one of the following must hold for all nodes y ∈ A(x0 ): (1) either y is lost (faulty or inaccessible), or (2) y is accessible but all paths from y to x 0 are blocked by faulty nodes. At most f nodes of A(x0 ) are lost. This is because the straight edges of the butterfly structure provide node-disjoint paths from the handle to each of the 2f nodes of A(x 0 ). Since there are only f faults, these can block only f of these paths. Let B(x0 ) denote the set of accessible nodes of A(x0 ). From the previous paragraph we have |B(x0 )| ≥ f . Let T (x0 ) be the binary tree of nodes linking B(x0 ) and x0 (including faulty nodes). For node z ∈ T (x0 ), let d(z, x0 ) be the distance from z to x0 . Consider a faulty node z in T (x0 ). We count the number of nodes y ∈ B(x0 ) for which z can block the path from y to x0 . The distance from z to B(x0 ) is log f + 1 − d(z, x0 ). Thus, the number of nodes of B(x0 ) under z is at most 2log f +1−d(z,x0 ) = f · 2−d(z,x0 )+1 . This is the number of y’s the faulty z can block. For nodes z and x0 , let w(z, x0 ) =
(
2−d(z,x0 )+1 z is in T (x), 0 otherwise.
By the above analysis, a faulty node z can block the path for at most f · w(z, x0 ) paths from nodes of B(x0 ) to x0 . Since at least f paths must be blocked from B(x0 ) to x0 , for x0 to be P inaccessible, it must be that z∈F w(z, x0 )f ≥ f . Thus, we obtain X
w(z, x0 ) ≥ 1.
z∈F
9
(1)
For the case where Llog f +1 does not exist (that is, node x0 is close to the top) a similar argument also yields Equation 1. So far we have considered a single inaccessible node x0 . Now we consider the set I of all inaccessible nodes. Consider a faulty node z. Node z is at distance 1 from two descendant nodes, distance 2 from four descendant nodes, and so on. Thus, summing over all trees T (x) that contain z, we obtain X
w(z, x) ≤
log Xf
2i 2−i+1 = 2 log f.
(2)
i=1
x∈I
Thus, from Equation 1 we obtain |I| =
X
1≤
x∈I
X
x∈I
Ã
X
!
w(z, x) .
z∈F
(3)
Exchanging the order of the summation and from Equation 2, we obtain |I| ≤
X
z∈F
Ã
X
w(z, x)
x∈I
!
≤ 2f log f.
(4)
Thus, the total number of lost nodes is |I| + |F | ≤ 2f log f + f . Next we bound the complexity of the reconstruction procedure. Claim 3 For any f and d, the d-FTstack reconstruction procedure completes in O(df log f ) steps. Proof: Popping and reinserting a node takes a constant number of steps. The Pop phase completes once 2d consecutive accessible nodes are encountered. With f faults, at most f log f consecutive layers have inaccessible nodes. Since each layer consists of 2d nodes, the total work is O(df log f ). From Claims 1-3 we obtain the following performance bounds for a d-FTstack: Theorem 1 For any d, the d-FTstack is O(d, f log f )-fault tolerant and is a constant emulation of the stack.
4 4.1
Fault-Tolerant Linked Lists The Structure
The linked list supports the of following operations: • • • • •
Insert(v, p): inserts a node with value v before node p. Delete(p): removes node p. Value(p): returns value stored in node p. Next(p): returns the node following node p. Head: returns the first node in the list.
10
Λ Λ
top (the handle)
Λ Λ Λ Λ Λ
Λ
Λ 2
B
1
B
0
B
Figure 2: A d-FTlist, where d = 4 and thus M = 32. The essential relation between nodes is the 2-relation “before-after”. The linked list is highly non fault tolerant, as explained in the Introduction. We introduce a new family of list-like data structures, the d-FTlist, for d power of 2. For any such d, the d-FTstack, is (d, O(f log f log d))-fault tolerant, with respect to the “before-after” relation. We a family of fault-tolerant data structures that emulate the link list, d-FTlist, where d is a power of 2. For each d, the d-FTlist is (d, O(f log f log d))-fault tolerant and is a constant emulation of the linked list. Instance graphs of the linked list are directed paths. Thus, the stack and the linked list have the same graph structure. In the previous section we used a layered graph structure to obtain a fault-tolerant emulation of the stack; we use a similar structure for the fault-tolerant linked list. However, the essential difference between the stack and the linked list is that in the linked list nodes may be inserted and deleted anywhere in the graph, whereas nodes in a stack are pushed and popped only at the top. Thus, the main difficulty in constructing a fault-tolerant list is maintaining the structure throughout the dynamic changes. We now describe the structure of the FTList in detail. A sample graph of an FTlist appears in Figure 2. The Skeleton. The graph of the d-FTlist is composed of a sequence of blocks B 0 , . . . , B ihead . Each block B i consists of 2d(log(2d) + 1) vertices arranged in a butterfly structure. In addition, B i has one special header node, header(B i ). Vertices of the last level of block B i point to those in the first level of B i−1 . The handle of the d-FTlist consists of 2d + 1 nodes. One handle node, head, points to the head of the list. The remaining handle nodes, h[k], k = 0, . . . , 2d − 1, point to the nodes in the first level of the first block. We call this block structure the skeleton of the graph. Folding. The original linked list is folded onto the skeleton as follows. Each node x of the linked list is mapped to a vertex s of the skeleton (we use the term node for the nodes of the linked list and vertex for those of the skeleton). Vertex s stores the entire information of x, including the data field and the next pointer. At most one list node is mapped to any skeleton vertex. The mapping maintains the order of nodes across blocks. Specifically, if x and y are nodes of the original linked list and x is before y, then either x and y are in the same block, or x is in a block before that of y. Within each block, the nodes are mapped arbitrarily. The empty vertices of each block B i are 11
chained in a list freei . A pointer to the head of freei is stored in header(i). In addition, header(i) maintains the variable Loadi , which records the total number of nodes mapped onto vertices of B i . Let M = 2d log(4d) be the number of vertices in a block. We maintain the following invariant. Invariant 1 At all times at least M/4 nodes are mapped to each block B i (i > 1), and at most M are mapped to each block B i (i ≥ 0). (Only the last block B 0 may contain less than M/4 nodes.)
4.2
Operations
Head, Next(), and Value(). A full copy of the linked list is embedded in the d-FTlist. Thus, implementing the procedures Head, Next(), and Value() is easy, with the following small addition. In the regular linked list the user holds a pointer p, which points to the current location in the list. In the FTlist one pointer is never enough, because any single memory location may become faulty. Thus, in addition to the pointer p, the user holds a set of 2d pointers pointing to the nodes of first level of the current block. The pointers are initialized to the handle (Headand the first level of the first block), and updated as the current location changes from one block to the next. The amortized cost per execution of Next() is O(1). Insert(v, x).
Insert involves the following steps:
1. Create a new node y. Let it store the value v. Insert y before x in the embedded copy of the linked list. 2. Suppose x is in block B i . Map the new node y onto a free vertex s in B i . 3. Update the list freei , and advance Loadi by 1. 4. If Loadi = M then split B i into two blocks. To split a block, first create two skeleton blocks. Then, insert the first half of the nodes in one block and the second half in the next. Splitting completes in O(M ) operations. Delete(x).
Delete is the reverse procedure to Insert:
1. Remove x from the embedded copy of the linked list. 2. Suppose x is mapped to vertex s ∈ B i . Delete x from s, and add s to freei . Decrease Loadi by 1. 3. If Loadi < M/4 (and B i−1 exists) then execute one of the following. If Loadi−1 ≤ 3M/4 then join B i with B i−1 . Otherwise, divide the nodes evenly between B i−1 and B i . Joining/dividing is completed in O(M ) operations.
4.3
Faults and Reconstruction
When a fault is detected, reconstruction begins. There are three phases to the reconstruction: 1. Salvage remaining nodes. In this phase the objective is to find as many of the remaining accessible nodes as possible. 12
2. Determine the correct order between the salvaged nodes. 3. Reconstruct the data structure. 4.3.1
Salvaging Remaining Nodes
In order to find the remaining nodes we use the underlying butterfly structure of the skeleton. Recall that at all times the user holds 2d pointers to the first layer of the current block. Thus, in order to find the remaining nodes start from the first layer of the block, advance one level at a time, and gather list nodes en route. Continue until a block with no faults is encountered. By analogy to Claim 2, we have: Claim 4 With f faulty nodes, at most O(f log f ) nodes are inaccessible. 4.3.2
Tags: Determining Order Between Nodes
Once the remaining nodes have been salvaged, it is necessary to determine the correct order between these nodes. The problem is that the nodes of the list can be mapped onto the skeleton in an arbitrary order. Thus, if some of the nodes are lost, we may lose the information on the correct order between the surviving nodes. To overcome this problem we provide an additional facility that allows to determine the correct order between nodes. We do so by using tags. Specifically, for each node x, we store in the node an integer tag(x), such that if x and y are nodes mapped to the same block, then, tag(x) < tag(y) iff x is in front of y in the list. Given such tags, we can reconstruct the order between nodes of a given block by comparing their tags. In order to maintain the tags we use the Dietz and Sleator [10] construction. The [10] construction allows to maintain such tags in a dynamically changing linked list, with tags of size O(log M ) and O(log M ) amortized reassignment cost per insert or delete, for a list of size M . Accordingly, we maintain a separate instance of the [10] data structure for each block. With this construction we obtain an O(log M ) = O(log d) cost per insert or delete. In order to decrease the insertion cost to O(1) amortized time, we use indirection, as described in [10]. Roughly, the the indirection construction works as follows. We divide the O(d log d)-nodes of the block’s linked list into Θ(d) groups, each containing Θ(log d) contiguous elements. We split the tag of each node into two: a high-order tag, which holds the high-order bits of the tag, and a low order tag, which holds the low-order bits of the tag. Nodes of the same group share a high-order tag. Hence, this tag is stored only once, in a representative node of the group. Low-order tags are stored in each node. Dietz and Sleator [10] show that using this construction, tags can be maintained with O(1) amortized cost per insert/delete. See [10] for details. In order to compare two nodes, we compare their high-order tags, stored at their respective representative nodes, and their low-order tag, stored at the nodes themselves. This indirection means that if the representative node is lost, then we loose information on the order for all nodes of the group. Hence, each lost node can result in the loss of an additional O(log d) nodes. Hence, the total number of lost nodes is O(f log f log d).
13
4.4
The Theorem
Putting it all together, we obtain: Theorem 2 For any d, the d-FTlist is a constant emulation of the linked list, and is (d, O(f log f log d))fault tolerant with respect to the “before-after” relation. Proof: At least a quarter of the vertices are not empty. Thus, the d-FTlist takes an O(1) factor more space than the linked list. Functions Value(), Next(), and Head run in a constant number of steps. Regular insertion and deletions require O(1) operations. After every Ω(d log d) insertions or deletions in a block (which require O(1) amortized time), the block must be split or joined. These tasks consist of O(d log d) operations on the skeleton followed by O(d log d) insertions. Thus, the amortized work per insertion and deletion is O(1). By Claim 4 with f < d faults, at most O(f log f ) vertices of the skeleton are lost. Each vertex stores at most one list node. Thus, at most O(f log f ) list nodes are lost. Because of indirection O(f log f ) lost nodes may result in loosing the order information for O(f log f log d) nodes, effectively making them unusable in the reconstruction. At most f + 1 consecutive blocks have inaccessible nodes. Thus, the reconstruction completes in O(f d log2 d) operations.
5
Fault-Tolerant Trees
5.1
General
We consider a binary search tree that supports the following procedures: • Insert(v, x): insert a node with value v as a child of node x. A node is inserted as a new leaf or between a parent and child. • Delete(x): remove node x from the tree. Only nodes having one or zero children can be deleted. • Find(v): search for key v starting from the root. We construct a family of data structures d-FTtree, for d power of 2, which are a constant emulation of the binary search tree. For each d, the d-FTlist is (d, O(f log f log d))-fault tolerant with respect to the “above-under” and “left-right” relations. We ensure that the depth of a leaf in the reconstructed structure is no more than in the original tree. Minor modifications of this presentation allow making fault-tolerant search trees of any bounded degree.
5.2
The Block Tree
Consider a tree T . As in the FTlist, we use a block structure as a skeleton. Each block of the d-FTtree consists of M = 2d(log(2d) + 1) vertices interconnected in a butterfly structure. Each block has a special header vertex, storing the free list and the load of the block.
14
B1
B1 B2
B3 B5
B2
B5
B4
B3 B7
B7
B4 B6 wide link
B6
Figure 3: A d-FTtree, with d = 4, M = 32. Left: mapping of nodes to blocks. Right: the block tree (logical links in dashed, wide links in grey). Blocks B1 , B3 , and B5 have children. They are uni-component and contain more than M/6 nodes. Blocks B4 , B6 , and B7 are multi-component. They have no children. Block B6 contains less than M/6 nodes, but has an immediate sibling, B5 , which is uni-component and contains more than M/6 nodes.
The nodes of T are mapped to vertices of the blocks, so that each vertex holds at most one node. We use the term vertex to refer to the vertices of the blocks, and the term node to refer to the nodes of the original tree T . We say that the block contains the nodes mapped to it. The blocks are logically arranged in a tree structure, which we call the block tree and denote by BT . A sample FTtree is depicted in Figure 3. The mapping of the nodes of T to the blocks maintains the following conditions: • Let Child(B) be a child of B in BT . Let T1 , . . . , Tk be the forest of subtrees of T contained in Child(B). (That is, each Ti is a maximal subtree of T that is entirely contained in Child(B).) Then, for each i, the root of Ti is a direct child of some node in B. • If R-Sib(B) is the block to the right of B in BT , then when viewed within T , the nodes contained in R-Sib(B) are to the right of those contained B. For a block B, if all the nodes contained in B are in one connected component (in T ), we say that B is a uni-component block. Otherwise B is a multi-component block. Blocks are interconnected using wide links. To construct a wide link between blocks B and B 0 , we maintain pointers between the corresponding vertices in B and B 0 . (Recall that all blocks have the same skeleton structure - a butterfly.) Wide links are maintained between the following blocks: • Between B and its leftmost child (in BT ). • Between B and its immediate sibling to the right (in BT ). Thus, all block children of a given block are connected in wide-link linked list, rooted in the parent. The handle to the FTtree consists of 2d pointers to the first layer of the root block.
15
We will insure that the FTtree maintains the following invariant: Invariant 2 At all times the FTtree has the following structure: • At most M nodes are mapped to any block. • Only uni-components blocks have (block) children. • If a block B has a child then B contains at least M/6 nodes. • If block B contains fewer than M/6 nodes, then either B is the only block in the tree, or its immediate sibling (to the right or the left) is uni-component and contains at least M/6 nodes. From Invariant 2 we obtain the following claims, which provide bounds on the size of the FTtree, and the out degree of blocks in the block tree. Claim 5 The size of the FTtree is linear in the size of T . Proof: The proof is by accounting. We assign each block B containing fewer than M/6 nodes to a block containing at least M/6 nodes as follows. If B is an only child, we assign it to the parent, which by the invariant contains at least M/6 nodes. Otherwise, we assign it to its immediate sibling, which by the invariant contains at least M/6 nodes. Thus, at most three extra blocks are assigned to each block with M/6 nodes. Claim 6 Each block has at most 2M child blocks. Proof: A block contains at most M nodes. Each node has at most two children in T . In the worst case, each child is in a separate block.
5.3
Operations
Insertions. When a new node is inserted, it is first mapped to the block containing the node’s parent. If this block then contains more than M nodes, the block is split into two blocks. There are two types of splits: horizontal and vertical . A vertical split yields a parent and a child; a horizontal split yields two siblings. A uni-component block only undergoes a vertical split, and a multicomponent block only undergoes a horizontal split. The split procedure takes O(M ) operations. The main concern is to amortize the cost over O(M ) inserts/deletes. Below, we show how this is obtained. Deletions. If the number of nodes in a block falls under M/6 and it does not have a unicomponent immediate sibling with more than M/6 nodes, it is merged with a sibling, a parent, or a child. After merging, the resulting block may be too big. In this case, it immediately undergoes a split. The merge procedure takes O(M ) operations. Again, the main concern is to amortize the cost over O(M ) inserts/deletes.
16
Splits. A block is split as a result of an insertion or merge. Therefore, before splitting, the block may contain between M and 7M/6 nodes. A vertical split is performed on uni-component blocks. It is accomplished by breaking the single tree of the overcrowded block into two separate trees. Note that it is always possible to split an n-node binary tree into two subtrees (in linear time) so that each resulting tree has between n/3 and 2n/3 nodes. Thus, since we are splitting between a tree with between M and 7M/6 nodes, the resulting blocks have size between M/3 and 7M/9. Hence, in all cases Invariant 2 is maintained. A horizontal split divides a multi-component block into two or more new sibling blocks. Consider the nodes mapped to B. Let T1 , . . . , T` be the forest of trees on these nodes (as determined by the original tree T ). There are two cases. • Each Ti has at most M/3 nodes. In this case we split B into two blocks, each of which has between M/3 and 2M/3 + M/6 = 5M/6 nodes, as follows. Let v1 , v2 , . . . , v` (M ≤ ` ≤ 7M/6) be the nodes mapped to B, enumerated from left to right. Let T1 , . . . , Tk be the trees contained in B, enumerated from left to right. Let Ti denote the subtree containing node vM/3 . We put the nodes of the subtrees T1 , . . . , Ti in the first block, and those of subtrees Ti+1 , . . . , Tk in the second. • Otherwise, some subtrees have more than M/3 nodes. There are at most 2 such trees. Denote these large subtrees by L1 and L2 (if it exists). We put each of L1 and L2 in a separate block. The remaining nodes can be split into at most three groups: those to the left of L 1 ; those between L1 and L2 (if it exists); and those to the right of both L1 and L2 . We map each of these groups of nodes to a separate block. Note that some of these groups may be contain very few nodes, but they have uni-component neighbors with at least M/3 nodes, thus satisfying Invariant 2. Claim 7 The amortized cost of splitting and merging is O(1). Proof: Splits and merges both cost Θ(M ). We have engineered the splits so that after a horizontal split of a block B • multi-component blocks have at most 7M/9 nodes; • blocks without an M/3-full neighbor have at least M/3 nodes; • blocks with an M/3-full uni-component neighbor may have an arbitrarily small number of nodes. After a vertical split of a block B both resulting blocks have between M/3 and 5M/6 nodes. Thus, after a split, all resulting blocks that contain between M/3 and 5M/6 nodes will support at least Θ(M ) insertions or deletions before they are merged or split. All blocks with fewer than M/3 nodes have uni-component neighbors. Thus, they need not be merged regardless of the number of deletions. As for splits, such blocks can accommodate Θ(M ) insertions before they need to be split. After a merge, we may need to perform a split (immediately or sometime later). If so, the merge will “pay for” the cost of the split. Regardless of whether a split ensues, all resulting blocks have at 17
least M/3 nodes. Therefore, after a merge to a block B, there will be at least M/3 − M/6 = M/6 additional deletes before the block needs to be merged again.
5.4
Faults and Reconstruction
When a fault is detected, reconstruction begins. Reconstruction follows the tree structure of BT (the block tree). To reconstruct a block, first reconstruct all of its children recursively. The widelinks, which link the list of child blocks, guarantee that all child blocks are reachable. This is because the wide links provide 2d separate, node distinct paths from the block to all the children block, and at most d of these can contain a fault. After all child blocks have been reconstructed, the block itself is reconstructed. For each block, reconstruction is performed in three phases: 1. Salvage remaining accessible nodes. 2. Determine the correct topological order between the remaining nodes of the block, and between the nodes of block and those of the children blocks (if they exist). 3. Reconstruct the block. Phase (1) is performed using the butterfly structure of the skeleton of the block. By analogy to Claim 2, if t nodes of the block are faulty, then at most O(t log t) nodes of the block are lost. The next step is to determine the correct order among the salvaged nodes. The difficulty is that since some nodes are lost, we may loose information on the relative order among the nodes. In order to recreate the topological order among the remaining nodes, we again use tags. Note that when nodes are lost, the reconstructed nodes may not form a binary tree. For example, the root of the tree may be lost; in this case the remaining nodes form a forest rather than a tree. Another example is when a node v having two children is lost, and its parent w also has two children. In this case, attaching the children of v as children of w would maintain the above-under and left-right relations, but the tree would no longer be a binary tree. In such cases we introduce dummy nodes in the reconstruction process. Dummy nodes contain no data. They are used in order to maintain the structure of the tree as a binary tree. The tags enable us to identify when dummy nodes are necessary. We now proceed to describe the tagging system in detail. We first show a solution with an O(log n) overhead per insert/delete. Then we present an improvement which results in an O(log d) overhead per insert/delete. Finally, we use indirection to reduce the overhead to O(1), but at the cost of an O(log d) factor increase in the number of lost nodes. An O(log n) Tag Solution. We maintain the following tagging system. For each node v we maintain two tags, tagpre (v) and tagrev (v), as follows: • The tags of tagpre (v) preserve the pre-order traversal of the tree; i.e., tagpre (v) < tagpre (w) iff v is before w in the pre-order traversal of T . (In the pre-order traversal, first the root is visited, then the left sub-tree recursively, and then the right sub-tree recursively.)
18
• The tags of tagrev (v) preserve the right-to-left pre-order traversal of the tree i.e., tagrev (v) < tagrev (w) iff v is before w in the right-to-left pre-order traversal of the tree. (In the right-toleft pre-order traversal, first the root is visited, then the right sub-tree recursively, and then the left sub-tree recursively.) The following claim is easy to validate: Claim 8 Let v and w be nodes of the tree. Then: • tagpre (v) < tagpre (w) and tagrev (v) < tagrev (w) iff v is an ancestor of w. • tagpre (v) < tagpre (w) and tagrev (v) > tagrev (w) iff v is to the left of w. Thus, together, the two tags tagpre (v) and tagrev (v) allow us to fully identify the topological order among the remaining nodes, and to reconstruct the original tree. We call such as system of tags a topological tagging system. An ordered forest is a forest for which there is a full left-right order on the trees of the forest. Claim 9 Let T be a binary tree reinforced with a topological tagging system. Let V 0 be a subset of the nodes of T . Then there is a unique ordered forest F 0 (not necessarily binary) on V 0 that maintains all the left-right and above-under relations of T . This ordered forest is fully determined by the tagging system. Specifically, For any two nodes v, w ∈ V 0 : • v is an ancestor of w in F 0 iff v is an ancestor of w in T , • v is to the left of w in F 0 iff v is to the left of w in T . Proof: By Claim 8 for any two nodes v and w one can determine if one node is an ancestor of the other and their respective order. This fully determines the ordered forest F 0 . We now describe the reconstruction procedure. As mentioned above, reconstruction is applied recursively, based on the structure of block tree. For each block B, the following is performed: 1. Using the skeleton structure of the block, salvage as many as possible of the nodes that are contained in B. Denote these salvaged nodes by N (B). 2. Based on the tags, reconstruct the ordered forest on N (B) as in Claim 9. Denote this forest by F (B). 3. If block B is a leaf in the block tree, but not the only block in the tree, then delete block B from the block tree. Keep the nodes of N (B) in an axillary structure. These nodes shall be reinserted at B’s parent. 4. Otherwise (B is not a leaf in the block tree, or B is the only block in the tree): (a) If F (B) is not a tree then add a dummy node as the root to F (B). (Recall that if B is not a leaf in the block structure then it must be uni-component, i.e., the nodes of B must form a tree). Denote the resulting tree by T (B). (B must be uni-component as it is not a leaf or is the root block.) 19
(b) For each child block B 0 of B do the following: i. Choose any node v ∈ N (B 0 ). ii. For all nodes w ∈ N (B) check if w is an ancestor of v (using the tags). Let w0 be the closest ancestor of v in N (B). If no such ancestor is found, then w0 is set to be the root of T (B). iii. Connect the root of T (B 0 ) (the tree maped to B 0 ) as a direct child of w0 . (c) If T (B) is not a binary tree then convert it into a binary tree by adding dummy nodes. (d) If the number of nodes in the tree is less than M/6, add dummy nodes (in a separate subtree rooted at the root while maintaining the binary tree structure). (e) Create a new skeleton structure for B and map the nodes of the resulting tree to the skeleton, arbitrarily. Insert the new block in the proper location in the block tree. (f) If B has child blocks that have been deleted in the reconstruction (Step 3, above), then reinsert the nodes of these blocks, using the tags to identify the proper locations. To maintain the tags we use the Dietz and Sleator [10] construction for each of the two tags, tagpre (v) and tagrev (v). In this solution tags are of size O(log n) and each update (insert of delete) requires O(log n) work (n is the number of nodes in the entire tree). An O(log d) Tag Solution. Note that the reconstruction procedure uses only the order within a single block, and between nodes of a parent block and the immediate child blocks. Accordingly, in order to reduce the insert/delete overhead from O(log n) to O(log d), we exchange the global tagging system described above, which provides a topological tagging system on the entire tree, with many local systems, each of which provides the order only among nodes of a single block and between neighboring blocks. Specifically, for each block B we maintain a topological tagging system covering the nodes of B and its immediate children. Thus, each node v ∈ B takes part in (at most) two topological tagging systems: 1. The system covering B and its children. 2. The system covering B’s immediate parent and its children, i.e., B, B’s parent and B’s siblings. By Claim 11 the number of child blocks of any given block is O(M ) = O(d log d). Each block has O(d log d) nodes. Thus, the total number of nodes in each topological tagging system is O(d 2 log2 d). Thus, the the Dietz and Sleator [10] construction provides insert and delete in O(log d) steps, and tags of size O(log d). An O(1) Tag Solution. In order to convert the O(log d) solution to a O(1) solution, we use indirection, as described in Section 4.3. As with the linked list, indirection may result in a situation where a node is salvaged but its location in the tree is lost since its higher order tag is stored in another node, which has been lost. Each node stores the high-order tags for at most O(log d) other nodes. Thus, the number of lost nodes increases by a factor of at most O(log d). 20
Thus we obtain: Claim 10 Using the tags with indirection, insert and delete take amortized O(1) steps, and with f faults, at most O(f log f log d) nodes are lost. We now justify the time complexity of the reconstruction algorithms. Claim 11 Reconstruction takes O(poly(f d)) steps. Proof: The recursive reconstruction procedure stops when reaching a block with no inaccessible nodes. There are at most O(f log f ) inaccessible nodes. Hence, at most O(f logf ) blocks undergo reconstruction. The work of reconstruction of a block is polynomial in the number of nodes in the block which is O(d log d). We obtain: Theorem 3 For any d, the d-FTtree is a constant emulation of the binary dictionary tree, and is (d, O(f log f log d)) fault tolerant with respect to the “above-under” and the “left-to-right” relations.
6
Fault Tolerance with Expanders
In the fault tolerant data structure presented so far, f faults result in O(f log f ) inaccessible nodes. Using expanders, the number of inaccessible nodes can be reduced to O(f ). We first describe the EFTstack (Expander Fault Tolerant Stack ). As in the FTstack, the nodes of the EFTstack are grouped in layers. Here however, the layers are interconnected using a bounded-degree expander (instead of a butterfly structure). Specifically, let G d = (A, B, E), |A| = |B| = 2d, be a fixed, bounded-degree bipartite expander graph, with expansion rate α > 1. The graph interconnecting every two consecutive layers of the d-EFTstack is isomorphic to G d . In addition, each node points to the corresponding node in the next layer. Note that d is fixed for any given d-EFT-Stack, and thus Gd can be hardwired into the code. Implementations of Pop, Push(), and reconstruction are similar in the EFTstack and the FTstack. The details are omitted. We obtain: Theorem 4 For any d, the d-EFTstack is a constant emulation of the stack and is O(d, O(f ))-fault tolerant with respect to the “above-under” relation. Proof: We prove that with f faults, there are at most O(f ) inaccessible nodes. Let U i and Fi be the set of unreachable and faulty nodes in level i, respectively. By the expansion property, if P |Ui | ≤ d then |Ui | ≤ |Ui+1 |/α + |Fi |. Since α > 1 and |Fi | ≤ d, by induction |Ui | ≤ d, for all i. Thus, the total number of inaccessible nodes is X i
|Ui | ≤
∞ X 1 X
j=0
αj
21
i
|Fi | = O(f ).
Similarly, the d-FTlist and d-FTtree are converted into the d-EFTlist and d-EFTtree. The blocks of the d-EFTlist and the d-EFTtree contain 2d vertices each (instead of 2d log 2d, since there is no need for levels within blocks). Two blocks are interconnected by a fixed expander graph. Mapping within each block is in an arbitrary order. The tagging scheme is unchanged. We obtain the following performance guarantees for the d-EFTlist and the d-EFTtree: Theorem 5 For any d, the d-EFTlist is a constant emulation of the linked list and is (d, O(f log d))fault tolerant with respect to the “before-after” relation. Theorem 6 For any d, the d-EFTtree is a constant emulation of the binary search tree. It is (d, O(f log d))-fault tolerant with respect to the “above-under” and “left-to-right” relations.
7
Discussion
In this paper we presented a framework for studying the fault tolerance of pointer-based data structures, and provided fault-tolerant versions of several common data structures. Throughout, we considered a worst-case fault model. Other fault models should also be studied; for example, a probabilistic fault model that takes locality of faults into account may lead to practical fault-tolerant data structures. Fault-tolerant data structures should also be considered in a hierarchical memory setting, in which there is locality among memory faults, but where data locality is important for efficiency. Acknowledgments. We are grateful to Mart´ın Farach-Colton and Pino Italiano for several important discussions.
References [1] N. M. Amato and M. C. Loui. Checking linked data structures. In FTCS-24: 24th International Symposium on Fault Tolerant Computing, pages 164–175, Austin, Texas, 1994. [2] Y. Aumann and M. A. Bender. Fault tolerant data structures. In 37th Annual Symposium on Foundations of Computer Science (FOCS), pages 580–589, October 1996. [3] Y. Aumann, M. A. Bender, and L. Zhang. Efficient execution of nondeterministic parallel programs on asynchronous systems. Information and Computation, 139(1):1–16, 25 Nov. 1997. An earlier version of this paper appeared in the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), June 1996. [4] G. Barnes. A method for implementing lock-free data structures. In Proceedings of the Fifth ACM Symposium on Parallel Algorithms and Architectures, pages 261–270, 1993. [5] M. Ben-Or, S. Goldwasser, and A. Wigderson. Completeness theorems for non-cryptographic faulttolerant distributed computation. In Proceedings of the 20th ACM Symposium on Theory of Computing, pages 1–10, 1988. [6] E. R. Berlekamp. Algebraic Coding Theory. McGraw-Hill, New York, 1968.
22
[7] J. D. Bright, G. F. Sullivan, and G. M. Masson. Checking the integrity of trees. In FTCS-25: 25th International Symposium on Fault Tolerant Computing Digest of Papers, pages 402–413, Pasadena, California, 1995. [8] B. Chor, M. Merrit, and D. Shmoys. Simple constant time consensus protocols in realistic failure models. In Proceedings of the 4th Annual ACM Symposium on the Principles of Distributed Computing, pages 152–162, 1985. [9] P. Dietz, J. I. Seiferas, and J. Zhang. A tight lower bound for on-line monotonic list labeling. In Algorithm Theory—SWAT ’94: 4th Scandinavian Workshop on Algorithm Theory, volume 824 of Lecture Notes in Computer Science, pages 131–142. Springer-Verlag, 6–8 July 1994. [10] P. Dietz and D. Sleator. Two algorithms for maintaining order in a list. In Proceedings of the 19th ACM Symposium on Theory of Computing, pages 365–372, 1987. [11] D. Dolev, J. Halpern, B. Simons, and H. Strong. A new look at fault-tolerant network routing. Information and Computation, 72(3):180–196, March 1987. [12] C. Dwork, D. Peleg, N. Pippenger, and E. Upfal. Fault tolerance in networks of bounded degree. SiComp, 1989. [13] P. Feldman and S. Micali. Optimal algorithms for byzantine agreement. In Proceedings of the 20th ACM Symposium on Theory of Computing, pages 148–161, 1988. [14] M. Fischer, N. Lynch, and M. Paterson. Impossibility of distributed commit with one faulty process. Journal of ACM, 32(2):374–382, April 1985. [15] Google. http://www.google.com/. [16] J. Gray. Notes on Data Base Operating Systems, pages 393–481. Springer-Verlag, Berlin, 1979. [17] R. Hagmann. Reimplementing the Cedar File System using logging and group commit. In 11th SOSP, pages 155–162, December 1987. [18] J. Hastad, T. Leighton, and M. Newman. Reconfiguring a hypercube in the presence of faults. In Proceedings of the 28th Annual Symposium on the Foundations of Computer Science, pages 274–284. IEEE, 1987. [19] J. Hastad, T. Leighton, and M. Newman. Fast computation using faulty hypercubes. In Proceedings of the 30th Annual Symposium on the Foundations of Computer Science, pages 251–263. IEEE, 1989. [20] M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the Twentieth Annual International Symposium on Computer Architecture, 1993. [21] C. Kaklamanis, A. Karlin, F. Leighton, V. Milenkovoc, P. Raghavan, S. Roa, C. Thomborson, and A. Tsantilas. Asymptotically tight bounds for computing with faulty arrays of processors. In Proceedings of the 31st Annual Symposium on the Foundations of Computer Science, pages 285–296, 1990. [22] P. Kanellakis and A. Shvartsman. Efficient parallel algorithms can be made robust. In Proceedings of the 8th Annual ACM Symposium on the Principles of Distributed Computing, pages 211–221, 1989. [23] M. L. Kazar, B. L. Leverett, O. T. Anderson, V. Apostolides, B. A. Bottos, S. Chutani, C. F. Everhart, W. A. Mason, S. T. Tu, and E. R. Zayas. Decorum file system architectural overview. In USENIX, pages 151–164, Summer 1990. [24] Z. Kedem, K. Palem, A. Raghunathan, and P. Spirakis. Combining tentative and definite executions for very fast dependable parallel computing. In Proceedings of the 23rd Annual ACM Symposium on Theory of Computing, pages 381–390, May 1991. [25] S. Kutten and D. Peleg. Fault-local mending. In Proceedings of the 14th Annual ACM Symposium on the Principles of Distributed Computing, pages 20–27, 1995. [26] S. Kutten and D. Peleg. Tight fault-locality. In Proceedings of the 36th Annual IEEE Symposium on Foundations of Computer Science, 1995.
23
[27] F. T. Leighton. Introduction to Parallel Algorithms and Architectures: Arrays · Trees · Hypercubes. Morgan Kaufmann Publishers, San Mateo, California, 1992. [28] T. Leighton and B. Maggs. Expanders might be practical: Fast algorithms for routing around faults in the multibutterflies. In Proceedings of the 30th Annual Symposium on the Foundations of Computer Science, pages 384–389. IEEE, October 1989. [29] F. J. MacWilliams and N. J. A. Sloane. The Theory of Error-Correcting Codes. Elsevier Science Publishers, Amsterdam, The Netherlands, 1977. [30] J. I. Munro and P. V. Poblete. Fault tolerance and storage reduction in binary search trees. Information and Control, 62(2-3):210–218, August 1984. [31] D. Patterson, G. Gibson, and R. Katz. A case for redundant arrays of inexpensive disks (RAID). In 11th SOSP, pages 386–393, 1988. [32] M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR–15–81, Center for Research in Computing Technology, Harvard University, 1981. [33] M. O. Rabin. Efficient dispersal of information for security, load balancing, and fault tolerance. Journal of the Association for Computing Machinery, 36(2):335–348, April 1989. [34] P. Raghavan. Robust algorithms for packet routing in the mesh. In Proceedings of the 1st ACM Symposium on Parallel Algorithms and Architectures, June 1989. [35] M. Rosenblum and J. Ousterhout. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems, 10(1):26–52, February 1992. [36] M. Seltzer, K. Bostic, M. K. McKusick, and C. Staelin. An implementation of a log-structured file system for UNIX. In USENIX, Winter 1993. [37] G. F. Sullivan and G. M. Masson. Certification trails for data structures. 21st Int. Symp. on FaultTolerant Computing (FTCS-21), pages 240–7, 1991. [38] J. D. Valois. Implementing lock-free queues. In Proceedings of the Seventh International Conference on Parallel and Distributed Computing Systems, pages 64–69, Las Vegas, NV, 1994.
24