a framework algorithm for dynamic, centralized ...

2 downloads 0 Views 109KB Size Report
[14] James Alexander Summers. Precedence-. Preserving Abstraction for Distributed De- bugging. Master's thesis, University of Wa- terloo, Waterloo, Ontario ...
A F RAMEWORK A LGORITHM FOR DYNAMIC , C ENTRALIZED D IMENSION -B OUNDED T IMESTAMPS Paul A.S. Ward Shoshin Distributed Systems Group University of Waterloo, Waterloo Ontario N2L 3G1, Canada http://www.shoshin.uwaterloo.ca/

A BSTRACT Vector timestamps can be used to characterize causality in a distributed computation. This is essential in an observation context where we wish to reason about the partial order of execution. Unfortunately, all current dynamic vector-timestamp algorithms require a vector of size equal to the number of processes in the computation. This fundamentally limits the scalability of such observation systems. In this paper we present a framework algorithm for dynamic vector timestamps whose size can be as small as the dimension of the partial order of execution. While the dimension can be as large as the number of processes, in general it is much smaller. The algorithm consists of three interleaved phases: computing the critical pairs, creating extensions that reverse those critical pairs, and assigning vectors to each event based on the extensions created. We present complete solutions for the first two phases and a partial solution for the third phase. Keywords: distributed-system observation, distributed computation, partial order, dimension, scalability, vector timestamp, Ore timestamp

1

M OTIVATION

An important problem in distributed systems is observing distributed computations. One of the things that makes this problem hard is that the events

forming the distributed computation are partially ordered. The ability to display, manipulate, reason about, and query this partial order is the essence of distributed-system observation. An example of a tool that provides some of these facilities is P OET [6, 15]. P OET, and all similar systems, have inherent scalability limitations. One key limitation is on the data structure used for storing and querying the partial order. There are several ways in which the partial order may be represented, with different tradeoffs according to the queries that are asked of it. One of the primary queries, and the one which this paper will concentrate on, is that of event precedence. Specifically, we frequently wish to know if an event preceded, succeeded or was concurrent with another event. Various partial-order data structures can then be evaluated based on their time/space tradeoff in answering event-precedence questions. The simplest method would be to store the partial order as a directed acyclic graph. In such a case precedence determination is a constant time operation because the partial order is transitively closed and so there is an edge between any two events that are ordered. However, the space consumption for this method is unacceptably high. All systems we are aware of store the transitive reduction of the partial order. This is space-efficient but requires a (potentially quite slow) search operation on the graph to determine precedence. To compensate for this deficiency some additional information

is added to allow for a more efficient precedence test. The most common form this additional information takes is a logical timestamp, which receives its name because it determines the logical time ordering of events. Such timestamps amount to the addition of some edges, beyond the transitive reduction of the partial order, to reduce the search time. The two primary logical timestamps are the Fidge/Mattern [3, 8] and Ore [9, 14] vector timestamps. The Fidge/Mattern algorithm (described in Section 2.1) is a dynamic, distributed timestamp. That is, it can timestamp events as they occur within the distributed computation. Unfortunately, the vector size required for this timestamp is equal to the number of processes involved in the distributed computation. We are interested in being able to reason about computations involving thousands of processes, and so this does not scale for our purpose. The Ore algorithm (described in Section 2.2) is a centralized, static timestamp. That is, it requires the complete partial order before it can be computed and it cannot be computed within the distributed computation. The centralized nature is not a problem as, in general, the observation system is separate from the computation. However, the static nature is a problem. We require the ability to reason about the execution as it is on-going. The advantage of the Ore timestamp is that it requires a vector of size as small as the dimension of the partial order. A secondary problem with the Ore timestamp is that, in theory, the dimension can be as large as the number of processes [1]. In such a case the Ore timestamp would require the same vector size as the online Fidge/Mattern timestamp, negating its sole advantage. However, we have already demonstrated [18] that, in a significant number of distributed computations, with up to 300 processes, the dimension is typically much smaller than the number of processes. In the cases we examined it was 10 or less. Our motivation, therefore, is the creation of a dynamic dimension-bounded vector timestamp. That is, we want the dynamic capability of the Fidge/Mattern algorithm and the vector size of the Ore algorithm. The distributed capability of Fidge/Mattern timestamps is not necessary, and indeed the framework algorithm that we present in this paper does not achieve it. In the remainder of this paper we will first briefly

detail the formal model of distributed computation, the terminology necessary to deal with dynamic partial orders, and the Fidge/Mattern and Ore timestamps. In Section 3 we will describe the framework algorithm. It is composed of three interleaved phases, whose required properties we will specify. We will discuss the problems involved in solving each of these three phases and indicate to what degree we have solutions for them. Insofar as we have solutions, we will give an analysis of the costs associated with the algorithms. Finally we indicate what work remains to be completed to integrate this dynamic dimension-bounded vector timestamp with the P OET system.

2

BACKGROUND

We use the standard model of a distributed system, initially defined by Lamport [7]: a distributed system is a system comprising multiple sequential processes communicating via message passing. Each sequential process consists of three types of events, send, receive and unary, totally ordered within the process. A distributed computation is the partial order formed by the “happened before” relation over the union of all of the events across all of the processes. We will denote the set of all events by  , the set of events within a given process by (where  uniquely identifies the process) and an in dividual event by  (where  identifies the process and  identifies the event’s position within the process). Then the Lamport “happened before” rela ) is defined as the smallest transition (   tive relation satisfying 1. 2.

    if       if   is a send event and 

is the

corresponding receive event

Events are concurrent if they are not in the “happened before” relation:

                

(1)

These definitions presume that the partial order is a static entity. That is, implicit in the “happened before” and “concurrency” relations is a fixed computation to which they refer. In a dynamic observation context, the computation, and hence the partial order, is not fixed. To deal with this we will use the same notation and restriction from our previous paper [19]. That is, we will subscript relations with

the base set of the partial order to which the relation refers. If two partial orders are defined on the same base set, we will use the partial-order relation as the subscript. In addition, we require events to be processed in some linear extension of the partial order. We now briefly review some basic partial-order terminology (largely from Trotter [16]) necessary  for this paper. A partial order is a pair  where is a finite set1 and is an irreflexive, anti symmetric and transitive binary relation on . An    extension, , of a partial order is any partial order that satisfies





 

(2)



If is a total order, then the extension is called a linear extension. A realizer of a partial order is any set of linear extensions whose intersection forms the partial order. The dimension of a partial order is the cardinality of the smallest  possible realizer.  is any pair A critical pair of a partial order   that satisfies:





 

 

 









 











 



#"    

 

  $&%  (  '  '



)

(4)

We will now describe the Fidge/Mattern and the Ore timestamp algorithms. 2.1

F IDGE /M ATTERN T IMESTAMPS

The Fidge/Mattern timestamp [2, 3, 8, 12] is designed as follows. Each process * is assigned a  .-0/ and / unique identifier * where + ,* is the number of processes in the computation. De fine 1  as the function that returns this identifier for event  . Each event  is then assigned a vector 1 Since

243

5

!87

we are modeling distributed computations, all of the sets will be finite.

/




if @:A1 otherwise

)

9

(5)

Then the timestamp of  is:



243



:CBEDGFIH J;K L 243,5

)M

(6)

where BNDOF is the element-wise maximum of a set of vectors. The first event in a process is timestamped by creating a virtual null event in the process that is covered by the first event. This null event is assigned a timestamp with all elements set to 0. Equation 6 is then used to determine the timestamp of the first event in the process. Note that this algorithm effectively provides a set of edges in the directed graph to every event from its greatest predecessor in each process. As such, the elements of the Fidge/Mattern timestamp can be considered to be events. Also note that if we number the events in each process, starting at 1, according to  the order of their occurrence, and we define P  to be the function that returns this number, then it can be shown that: P

Any set of linear extensions of a partial order that reverses all of the critical pairs of that partial order is a realizer of that partial order. A critical pair   is said to be reversed by a set of extensions  if one of the extensions in the set contains   . is covered by ” or “ Finally, we say that “  covers ” if there is no intermediate element in the  partial order between and . 



(3)

! 

 of size !in terms of the events timestamp 243 it covers. First, define 24365 as follows:







:C2Q3

R7 1



 9

 Precedence testing between two events, 

(7) and

  , can be determined by the equivalence:       243   R7 1    9 S  243   R7 1    9

(8) Note that the vector size is equal to the number of processes, while the test is constant time. 2.2

O RE T IMESTAMPS

The Ore timestamp [9, 14] is based on having a realizer for the partial order. The linear extensions are arbitrarily ordered from 1 to T , where T is the number of extensions. Each event  in each ex tension  is assigned a natural number between 1 and U , where U is the number of events in the computation, that identifies its position within that ex tension. Define V  as the function that maps  event  to this identifier in extension  . The following equivalence must hold for this assignment:

   XW   

The vector timestamp as: 

 K [J  \]

  Y  V     for event   87  9;:?V    V

Z

Z



(9)

is defined (10)

Precedence testing between two events, can be determined by the equivalence:

      

 K [RJ \;] Z



and

  R7  9 CZ   87  9

,

(11)

Note that the vector size is equal to T , while the test  is T . We know of no dynamic algorithm to build a realizer. The essence of our solution presented in this paper is to incrementally build a pseudorealizer. We will explain this in Section 3. We will first briefly describe other work in reducing vectortimestamp size. 2.3

R ELATED W ORK

Various approaches have been taken to the problem of reducing the size of vector timestamps. Fowler and Zwaenepoel [4] create direct-dependency vectors. That is, rather than keeping track of all causal events, these timestamps only record the greatest predecessors that transmitted directly to the process in which the event occurs. This is effectively adding a set of edges connecting each event  to the greatest send event in a process that communicated directly with the process that  is in. Typically most processes are unlikely to receive communication from most other processes, and so these vectors are substantially smaller than Fidge/Mattern timestamps. However, precedence testing requires a recursive search through the events referenced in the timestamp. This is, in the worst case, linear in the number of messages in the computation. While this will not likely occur in practice, it is probable that the computation of greatest predecessors in each process will require non-trivial searching. As such, it is probably not acceptable for monitoring or debugging applications.2 Jard and Jourdan [5] generalize the Fowler and Zwaenepoel method by creating the notion of pseudo-direct dependency. This extends dependency from directly communicating processes, but bounds the amount of extension. This allows a tradeoff between precedence-testing search length and vector size. However, the gain is based on the notion that not all events are worth observing. Within the distributed computation, this is plausible. In implementing a partial-order data structure, 2 In fairness to Fowler and Zwaenepoel, their objective was computation replay, and for this task their algorithm is entirely suitable.

all elements are observable, and so there is no benefit over the Fowler/Zwaenepoel technique. Other work has been in the context of reducing the vector size that must be transmitted between processes for distributed vector timestamps [11]. As we have already noted, centralized timestamp algorithms are sufficient for distributed-system observation, and so these techniques are not directly applicable. Singhal and Kshemkalyani [13] take the approach of transmitting just the information that has changed in the vector between successive communications. This requires each process / matrix, which amounts to an to maintain  an / / = set of data per event for our application. However, it is probably feasible to use a differential technique between events within the partial-order data structure. All of these techniques are based on the idea of a vector of processes. Our work is the only one we are aware of that takes a linear-extension view of vector timestamps.





3

F RAMEWORK A LGORITHM

Our framework algorithm is based on incrementally building a pseudo-realizer and then creating timestamp vectors based on that pseudorealizer. The creation of the pseudo-realizer is a two-step process: determining the critical pairs of the partial order and building the pseudo-realizer that reverses these pairs. It is a pseudo-realizer because the extensions that form it are not linear. Thus the algorithm is as follows: 1: 2: 3: 4: 5:

 // For each new event ‘e’   criticalPairs( ); insert(  ,  , ); assignVector( , );

;L

L



L

We will now describe the detailed requirements of the three phases of our algorithm. Insofar as we have specific algorithms that satisfy, or partially satisfy, those requirements, we will describe those algorithms. 3.1

C OMPUTING C RITICAL PAIRS

The requirements for incremental computation of critical pairs are similar to those required by our online dimension-bound analysis algorithm [19]. In both cases we need a time-efficient algorithm. Specifically, we must have an algorithm that is



1: criticalPairs(  ) L 2: + ;  3: 243   fidgemattern( ); 4:  #"    5: if ( is last covering event of ) ) 6: delete 2Q3 ; 7: N 8:  243     $ 9: if  #"    87 % 2Q3  1 10: if  J K  P '      L L   11: 12:  $   #"   ) 13:    !87 % 243 1 ' 14: if  J;K L P ' ! L L  15: 16: return( L ); 17:

 









  



   

 '

 9

9





Figure 1: Dynamic Critical-Pair Computation 

with respect to the number of events in the computation, or the algorithm could not be considered to be dynamic. Also, we prefer that it is  / with respect to the number of no more than processes, as this is the cost of the Fidge/Mattern algorithm. In addition to these time requirements, we now also constrain the space consumption per  event to be no more than T , where T is the number of extensions in the pseudo-realizer. A complete incremental critical-pair algorithm that satisfies these requirements is shown in Figure 1. Our prior algorithm [19] made extensive use of Fidge/Mattern timestamps and so had a space /  per event. We have modified consumption of it so that the Fidge/Mattern timestamps it uses are deleted sufficiently soon that there are never more /  than outstanding at any given time. This necessitated two changes in the algorithm. First we had to bound the computation of Fidge/Mattern timestamps for the maximal events. The value of this timestamp for an event  depends on the timestamps of the events that  covers. Therefore, we must maintain a copy of the Fidge/Mattern timestamp for any event for which we have not seen all the events that cover it. After that time we may delete it, and this is the function of lines 4 to 7. Under our general restriction of processing events in some linearization of the partial order, the number of events that are not com>

pletely covered at any given time can be arbitrarily large. While it is not possible to further restrict this linear ordering to guarantee a bounded number of events that are not completely covered, in practice s (small) bound does exist. There are three reasons for this. In the case of systems with only synchronous communication, there can never be more / than events that have not been covered. For systems with asynchronous communication, the number of events not yet covered is bounded by the resources of the distributed system. Since these resources are finite, the number of such events is bounded. Finally, in practice we do not observe systems that have significant numbers of uncovered events outstanding. This is probably because the correctness of such systems would be subject to system resources, which is not a robust basis for guaranteeing system behaviour. Second, we needed to remove the precedence tests, as these depended on Fidge/Mattern timestamps. The algorithm from [19] was a straightforward implementation of the theorem:



 



        D    "!# D$  %  #&     

where  is the set of critical pairs in the partial      is the set of events order , D concurrent with  with no predecessor events con! #   current with  , and D'    is the set of events concurrent with  with no successor events concurrent with  . The Fidge/Mattern timestamp of the new event, computed in line 3, is equivalent to the set of greatest predecessor events by process. Since any event  being processed is, by definition, maximal, the successor events3 to  ’s greatest predecessors, lines 8 and 9, must be concurrent with  (excluding  itself). They form the set of potentially least concurrent events. It is then necessary to remove from this set events that have / predecessors within the set. This required precedence tests. The following observation allowed us to reduce this to two precedence tests:

   :  3 By



;L%+

"(

 # 

   #  "          D%



*)

“successor event” we mean the next event within the sequential process. Such events may not exist, in two senses of the word “exist.” They may exist but have not yet been processed, and so from the perspective of our algorithms they do not exist. Alternately, the process may have ended and the last event within the process may be a predecessor event to the event currently being processed.

Since in line 10 we know that the event   is concurrent with the event  being processed, this theorem applies. Specifically, it allows us to look at just the covered events of   . To perform the precedence tests themselves, we apply Equation 7, which allows us to replace the relevant Fidge/Mattern value with the event position within the process. In addition to these precedence tests, we also had to determine if the event  was least concurrent to any of the maximal events in the partial order. As with the above optimization, that can be reduced to determining if the events covered by event  are greatest predecessors of any maximal events. This is determined with the Fidge/Mattern timestamp of the maximal event in line 14. In addition to requiring less space than our previ/ time, ous algorithm, this algorithm requires  / where the prior algorithm was . This is based on the presumption that the number of events covered by an event is 2 or less. In terms of a distributed computation, this allow unicast, multicast, or broadcast communication, but does not allow multi-receive operations.4 If multi-receive is permitted, without multicast, then the cost would / be amortized , since although a multi-receive  / / would be , there would have to be corresponding unicasts. If multicast and multireceive were permitted, then the cost could be / amortized. This would amount to most or all communication events being multicast or multireceive, which only occurs in highly fault-tolerant systems.







3.2

B UILDING P SEUDO -R EALIZERS

The requirements for the incremental building of a pseudo-realizer are as follows. First, it must reverse the critical pairs discovered for the new event. This is the sense in which it is a realizer for the partial order. It need not be a true realizer, in that we do not require the extensions we form to be linear. Second, it must be possible to determine efficiently the precedence relation between any two events stored in the pseudo-realizer. That is, we must be able to build vectors using the extensions. The Ore timestamp requires linear extensions to build vectors. We have discovered, and will describe in Section 3.3, that this requirement can be 4 A multi-receive is the converse of a multicast. The reader can imagine a TCP stream being written into several times and being read once.

relaxed somewhat. Third, the insertion of an event into the pseudo-realizer must have an amortized  cost of > with respect to the number of events,  / and no more than with respect to the number of processes. The first of these constraints is necessary for it to be considered an dynamic algorithm; the second requires that we are no less efficient in building the timestamps than is the Fidge/Mattern algorithm. The fourth requirement is that the space  consumption per event must be no more than T . Reduced space consumption is a primary benefit that we are seeking. Finally, T must be reasonably small, or at least a reasonably close approximation to the dimension of the partial order of computation. This final requirement is based on the fact that building the minimum number of extensions that reverses the critical pairs of a partial order is NPhard for dimension greater than two [20]. Any dynamic algorithm we develop will be non-optimal. However, it must be sufficiently good in practice to produce pseudo-realizers with only a small number of extensions where the dimension is low. There would be no value in an algorithm that produced a number of extensions similar to the number of processes in the computation. Given these requirements, we now define the high-level event-insertion algorithm.

  

1: insert( , L , ) 2:    place( ,  , L ); 3: 4: if ( reversed( L , )) 5:  createExtension();  ; 6: 7: place( ,  , L ); 8: 9: 









 



In brief, we must place the new event in every extension such that the set of extensions reverses all critical pairs associated with that event. Note that although lines 2 and 3 imply that event insertion is performed sequentially over all extensions, in practice the implementor is not so limited. The requirement of this code is that the event must be inserted in every extension, and a constraint of the insertion is the need to reverse the critical pairs. If it is not possible to reverse all critical pairs in the existing extensions, we must create new extensions that will allow us to reverse the critical pairs that have not yet been reversed. This is the function of lines 4 to 8. The nature of our system is such that

we can guarantee that the creation of a single new extension will be sufficient for this purpose. Event insertion and new extension creation are two distinct and difficult problems to overcome. Our prior algorithms do not adequately address either of these problems for several reasons. First, they processed critical pairs, rather than events. Second, they made extensive use of Fidge/Mattern timestamps, which violates the fourth requirement. First we will deal with the problem of placing an event  into a set of extensions such that it reverses some or all of the critical pairs associated with it. We deal with this by first dividing the criti  cal pairs into those of the form and those of      the form   . The reversal of those in the latter category is trivial to satisfy since  is maximal. We can place  at the end of some extension. Note that this neither violates the partial order nor alters exiting critical pair reversals. In this regard, it is not strictly necessary to calculate critical pairs of this form. On the other hand, if we wish to minimally commit our extensions (that is, we do not wish to establish orderings that are implied neither by the partial order nor by the requirement to reverse critical pairs), then we may choose to be more selective.   Critical pairs of the form   are more diffi cult to reverse. If a new extension is required, the reversal of these pairs is what will force it, as they require that  be placed earlier in some extension than  . We can easily determine if an extension    will allow the reversal of an   critical pair by  determining whether the events that  covers occur after  in the extension. These events and  will   have been processed, and thus we have vector timestamps for them. As such we can check the relevant timestamp entry. We can thus place a range on the location of  in the extension: after the events it covers and before the  events with which it forms      critical pairs. We have not yet investigated this notion of giving a range to a new event, rather than a specific location. A heuristic that satisfies the above requirement is to place  at the earliest location possible in each   extension, other than the one in which the   class of critical pairs was reversed. This location is immediately after the last event in the extension that  covers. This is computable in steps per extension, where is the number of events that  covers. We can then check, in T steps per critical pair, whether or not the critical pair has been reversed. 



If it has not, it cannot be reversed. Since the rever  sal of   critical pairs is more constrained, in  practice, we first reverse these pairs (if they can be) and then attempt to shift  toward the end of some   extension without violating any   reversals.  The second problem is that of creating new extensions. The main issue here is that, if we were building a realizer rather than a pseudo-realizer, we would have to make this new extension linear. This would force premature commitment to a specific ordering that might well be poor. Prior experience shows this to be the case [17]. However, at the point where we need to create a new extension, precedence can be determined between all prior events without this extension. In other words, the only precedence testing ability that this new extension will give us is with the current and subsequent events. We can take advantage of this as follows. Rather than force an ordering on existing events, for each event we keep track of how many extensions were in use at the time it was inserted into the pseudo-realizer. This allows us to test for precedence using only a subset of the extensions. This in turn means we merely have to order existing events with respect to the new event that required the new extension. Since the new event is maximal, and we still have its Fidge/Mattern timestamp at this point, we can readily determine which events precede the new event and which are concurrent. Thus, we create a new extension that satisfies the precedence requirements without overcommitting. 3.3

V ECTOR A SSIGNMENT

The vector assignment problem is composed of three separate issues. First, we must deal with usage because it affects the solutions to the remaining two issues. Specifically, we must address the question of whether or not these vectors are tied to the partial-order data structure or if they can be distributed for use in other processes. The relevance of this is for systems such as P OET, which is a client/server architecture and, as such, passes event information, including timestamp information, between processes. The Fidge/Mattern and Ore timestamps are usable in other address spaces because, once assigned, the vectors do not change. Thus, some remote process, on receiving two events with associated vectors, can determine precedence between those events. In our framework algorithm this is not the case because we do not have a fixed set of extensions. However, the degree to which the

vectors change depends a great deal on the specific algorithms used for the pseudo-realizer. The algorithm that we have proposed in the previous section need not change vector assignments for existing extensions, but does result in changes if new extensions need to be created. A possible solution to this problem is to distribute events with the vectors that they possess at the time of distribution. If two events then need to be compared and they have vectors of differing lengths a vector update may be required. The second problem is how to assign a value to an event to indicate its position within the extension. In the case of the Ore timestamp an integer was used. This is possible because the realizer is computed in entirety before positional integers are assigned. In our case, we must assign values as new events arrive, and those new events are not, in general, located at the end of the extension. We cannot simply adjust the position information for all events after the new event in the extension. An unbounded number of events would need to have their relevant vector entry updated. Rather than do this, we first solve the problem in theory. Instead of using an integer to represent a position, we use a real number. Since the real numbers are infinitely divisible, we can always assign a new real number to a new event to correctly identify its place within an extension. Since real numbers do not exist in computers, we emulate their effect by leaving substantial gaps between the position integers. Whenever we can no longer subdivide the integer space to insert new events, we perform a vector update on all events in the extension after the new event. This is no more expensive than if we did it with each new event, and is amortized over a large number of events. When new extensions are created, the values assigned to existing events may be arbitrarily selected such that they satisfy the requirement of determining precedence between the existing events and the new event. They are further adjusted as subsequent events are processed. Finally, we need to specify the precedence test. While this is based on the Ore test, it is complicated because the vectors may not be the same length. If the vectors are of the same length, then the standard Ore test applies. If one event has a longer vector than the other, then it must be either concurrent or a successor. In some cases it is possible to determine concurrency using just the two vectors and the standard Ore test, up to the length of

the shorter vector. In general, however, that is not the case and for such instances we have to get a vector update from the partial-order data structure. For a partial order with a pseudo-realizer containing T extensions, we need only T > such updates, so  the amortized cost remains T which is constant with respect to the number of events.

4

C ONCLUSION

AND

F UTURE W ORK

We have presented a novel framework algorithm for dynamic, centralized dimension-bounded timestamps. We have completely solved the problem of incremental computation of critical pairs. We have presented one solution to the problem of building pseudo-realizers and it has, thus far, produced promising results. We plan to investigate other possible solutions for this problem. We have provided a partial solution to the problem of assigning vectors to events and have presented the necessary precedence-test algorithm. We plan to investigate more thoroughly the problem of timestamp distribution. This work is significant since, insofar as we are able to build efficient pseudo-realizers, we can create significantly smaller vector timestamps with little cost increment in the precedence test. The ability to build efficient pseudo-realizers is a function of the algorithm used and the distributedcomputation event interactions. We are currently investigating instances where the pseudo-realizer produced is not of satisfactory size to determine if this is a function of the computation event set or the algorithm. Finally, we note that a compact, dynamically computable, and efficiently decidable representation for partial orders has wider applicability than just distributed-system observation. For example, Raymond [10] proposed partial-order databases as a better way of structuring data. That said, the reader should recall that we do limit additions to being in a linearization of the partial order. In addition, we would need to investigate the removal of elements from the partial order. We suspect that a similar restriction, only being able to remove minimal elements, would be required.

ACKNOWLEDGEMENTS The author would like to thank IBM for supporting this work and David Taylor for many useful discus-

sions regarding this work. Many thanks also to the referees for their feedback.

A BOUT

THE

AUTHOR

Paul Ward is a Ph.D. candidate in the Department of Computer Science at the University of Waterloo. His research focus is the scalability of distributed-system observation tools. He is supported in this by an IBM CAS Fellowship, for which he is very grateful. He expects to finish his Ph.D. within the next six months. You can reach him at [email protected]. If you would like to write him a letter instead, send it to the Department of Computer Science, University of Waterloo, 200 University Avenue West, Waterloo, Ontario, Canada N2L 3G1. You can find out more about him by surfing over to http://www.shoshin.uwaterloo.ca/˜pasward/.

R EFERENCES [1] B. Charron-Bost. Concerning the size of logical clocks in distributed systems. Information Processing Letters, 39:11–16, July 1991. [2] Colin Fidge. Logical time in distributed computing systems. IEEE Computer, 24(8):28– 33, 1991. [3] Colin Fidge. Fundamentals of distributed systems observation. Technical Report 93-15, Software Verification Research Centre, Department of Computer Science, The University of Queensland, St. Lucia, QLD 4072, Australia, November 1993. [4] Jerry Fowler and Willy Zwaenepoel. Causal distributed breakpoints. In Proceedings of the 10th IEEE International Conference on Distributed Computing Systems, pages 134–141. IEEE Computer Society Press, 1990. [5] Claude Jard and Guy-Vincent Jourdan. Dependency tracking and filtering in distributed computations. Technical Report 851, IRISA, Campus de Beaulieu – 35042 Rennes Cedex – France, August 1994. [6] Thomas Kunz, James P. Black, David J. Taylor, and Twan Basten. P OET: Targetsystem independent visualisations of complex distributed-application executions. The Computer Journal, 40(8):499–512, 1997.

[7] Leslie Lamport. Time, clocks and the ordering of events in distributed systems. Communications of the ACM, 21(7):558–565, 1978. [8] Friedemann Mattern. Virtual time and global states of distributed systems. In M. Cosnard et al., editor, Proceedings of the International Workshop on Parallel and Distributed Algorithms, pages 215–226, Chateau de Bonas, France, December 1988. Elsevier Science Publishers B. V. (North Holland). [9] Oystein Ore. Theory of Graphs, volume 38. Amer. Math. Soc. Colloq. Publ., Providence, R.I., 1962. [10] Darrell Raymond. Partial Order Databases. PhD thesis, University of Waterloo, Waterloo, Ontario, 1996. [11] Michel Raynal and Mukesh Singhal. Capturing causality in distributed systems. IEEE Computer, 29(2):49–56, 1996. [12] Reinhard Schwarz and Friedemann Mattern. Detecting causal relationships in distributed computations: In search of the holy grail. Distributed Computing, 7(3):149–174, 1994. [13] M. Singhal and A. Kshemkalyani. An efficient implementation of vector clocks. Information Processing Letters, 43:47–52, August 1992. [14] James Alexander Summers. PrecedencePreserving Abstraction for Distributed Debugging. Master’s thesis, University of Waterloo, Waterloo, Ontario, 1992. [15] David J. Taylor. Event displays for debugging and managing distributed systems. In Proceedings of the International Workshop on Network and Systems Management, pages 112–124, August 1995. [16] William T. Trotter. Combinatorics and Partially Ordered Sets: Dimension Theory. Johns Hopkins University Press, Baltimore, MD, 1992. [17] Paul Ward. On the scalability of distributed debugging: Vector clock size. Technical Report CS98-29, Shoshin Distributed Systems Group, Department of Computer Science, The University of Waterloo, Waterloo,

Ontario, Canada N2L 3G1, December 1998. Available at ftp://cs-archive.uwaterloo.ca/csarchive/CS-98-29/CS-98-29.ps.Z. [18] Paul A.S. Ward. An offline algorithm for dimension-bound analysis. In Dhabaleswar Panda and Norio Shiratori, editors, Proceedings of the 1999 International Conference on Parallel Processing, pages 128–136. IEEE Computer Society, 1999. [19] Paul A.S. Ward. An online algorithm for dimension-bound analysis. In P. Amestoy, P. Berger, M. Dayd´e, I. Duff, V. Frayss´e, L. Giraud, and D. Ruiz, editors, EuroPar’99 Parallel Processing, Lecture Notes in Computer Science, No. 1685, pages 144–153. Springer-Verlag, 1999. [20] Mihalis Yannakakis. The complexity of the partial order dimension problem. SIAM Journal on Algebraic and Discrete Methods, 3(3):351–358, September 1982.

Suggest Documents