sent three group mutual exclusion solutions for tree networks. All three solutions ... distributed system to share a set of mutually exclusive âsessions. ..... root and other processes do not broadcast OS(X) to the whole tree (or subtree), but send it.
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 19, 415-432 (2003)
Group Mutual Exclusion in Tree Networks* JOFFROY BEAUQUIER, SÉBASTIEN CANTARELL, AJOY K. DATTA** AND FRANCK PETIT+ LRI/CNRS, Université de Paris-Sud, France ** School of Computer Science University of Nevada Las Vegas, U.S.A. + LaRIA, Université de Picardie Jules Verne, France The group mutual exclusion (GME) problem deals with sharing a set of (m) mutually exclusive resources among all (n) processes of a network. Processes are allowed to be in a critical section simultaneously provided they request the same resource. We present three group mutual exclusion solutions for tree networks. All three solutions do not use process identifiers and use bounded size messages. They achieve the best context-switch complexity, which is O(min(n, m)). The first solution uses a fixed root of the tree and uses 0 to O(n) messages per critical section entry. This solution supports an unbounded degree of concurrency, thus providing the maximum resource utilization. The second solution also uses a fixed root, but uses a reduced number of messages for the critical section entry. It generates an average of O(log n) messages per critical section entry and also allows an unbounded degree of concurrency. We remove the restriction of using a fixed root in the third solution in addition to maintaining all other desirable properties of the second solution. Keywords: group mutual exclusion, mutual exclusion, priority-based, resource allocation, readers/writers problem
1. INTRODUCTION The group mutual exclusion (GME) problem, introduced by Joung [1], deals with mutual exclusion and concurrency issues. A GME protocol allows all the processes of a distributed system to share a set of mutually exclusive “sessions.” Processes can share access to a particular session s. However, if a process p (other than the active processes using a session) requests to access a session t ≠ s, then p cannot access t immediately. Process p will have to wait until s is freed by all the active processes. On the other hand, if p requests to access to s, then p can start using s immediately; i.e., p can share use of Session s with other processes. In other words, there is no limit on the number of processes which can use a session simultaneously. An interesting example of the group mutual exclusion was presented in [1]. Consider large data sets stored in a secondary memory. A set of processes accesses the data sets through a server. The server can be a CD jukebox. Using a classical mutual exclusion protocol, the server needs to repeatedly load and unload the data sets (e.g., the CDs) from the secondary memory to process the requests. An efficient GME protocol would
Received May 15, 2002; accepted July 25, 2002. Communicated by Biing-Feng Wang, Stephan Olariu and Gen-Huey Chen. * A preliminary version of the paper was presented at the 2002 International Conference on Parallel and Distributed Systems, Chungli, Taiwan.
415
416
JOFFROY BEAUQUIER, SÉBASTIEN CANTARELL, AJOY K. DATTA AND FRANCK PETIT
allow multiple processes to read the currently loaded data set (a CD) simultaneously, while forcing the processes requesting a different data set (another CD) to wait. An efficient GME solution could also help improve the quality of services (QoS) of an Internet server. The GME protocol could be used to group different requests for the same service, thereby reducing memory swapping. The GME problem has some similarities with the mutual exclusion problem, but its performance is measured using some unique metrics, which we will informally discuss next. The degree of concurrency [1] measures the number of processes that can concurrently access an open session while a process is executing a session and another process is waiting for a different session. A higher degree of concurrency implies better resource utilization. The context-switch complexity (also called the number of “rounds of passages”) [1] indicates the maximum number of sessions which can be opened after a process requests access to a session. Multiple (concurrent) accesses to the same session are counted as one open session in the calculation of context-switch complexity. The notion of context-switch complexity is analogous to the notion of waiting time in mutual exclusion. Lower context-switch complexity implies a shorter waiting time. Related Work. The GME problem [1] is a generalization of mutual exclusion (ME) [2, 3] and the readers/writers [4] problem. The GME problem is also related to several well-studied synchronization problems, such as dining philosophers [5], drinking philosophers [6], and k-exclusion [7]. The reader may refer to [8] for a discussion of the use of GME to solve these problems. The GME algorithms proposed in [1] works with the shared-memory model. Solutions for shared memory systems were also proposed in [8, 9]. The GME problem was studied based on the message-passing model in [10-12]. The two algorithms proposed in [10] work for fully connected networks and generate Θ(n) messages per entry in a critical section. The message size is unbounded. The context-switch complexity of both algorithms is O(n) rounds of passages. One of them offers an unbounded degree of concurrency. Two algorithms for unidirectional rings were presented in [11]. The number of messages for a critical section entry in those algorithms is Θ(n), and they are of unbounded size. The algorithms achieve a best possible context-switch complexity of O(min(n, m)) and a degree of concurrency of O(n2). An open problem was suggested in [10] − to obtain a GME algorithm which uses bounded size messages. In [12], three algorithms were presented. An entry to the critical section generates between 0 and 2 × n messages. The message size used is O(log min(n, m)) bits, thus solving the open problem proposed in [10]. The context-switch of the solutions in [12] is O(min(n, m)) rounds of passages, and the degree of concurrency is unbounded. Contributions. We present three group mutual exclusion algorithms for tree networks. All the proposed solutions have the desirable property of using messages of bounded size and not using process identifiers. Moreover, all of them achieve the best possible context-switch complexity of O(min(n, m)) rounds of passages. The first solution, Algorithm GMεα, uses a fixed root of the tree. It supports an unbounded degree of concurrency, hence achieving the best possible resource utilization. An entry to the critical section costs between 0 and 3 × (n − 1) + h messages, where h is the height of the spanning tree.
GROUP MUTUAL EXCLUSION IN TREE NETWORKS
417
Algorithms GMεβ and GMεγ use between 0 and 4 × h messages per critical section entry. This means that the average number of messages exchanged for an entry to the critical section is typically of O(log n) [13]. Both algorithms preserve the unbounded degree of concurrency of Algorithm GMεα. However, concurrency may be limited in some parts of the network. Like Algorithm GMεα, Algorithm GMεβ uses a fixed root. Therefore, a particular process must be privileged. Moreover, in both Algorithms GMεα and GMεβ, processes nearer the root can access the critical section more often than others. The third solution, Algorithm GMεγ, deals with this problem and uses the ideas of Algorithm GMεβ and the one proposed in [13] to make the root in the network mobile without losing any desirable properties of Algorithm GMεβ. Although some processes near the root can still use a session more often than others, due to the variable root, same processes will not be disadvantaged. Outline of the paper. The rest of the paper is organized as follows. In section 2, we describe the model and specify the GME problem. Three solutions along with proofs, complexity results, and extensions are presented in section 3. Some concluding remarks are made in section 4.
2. PRELIMINARIES A distributed system is an undirected connected graph, S = (V, E), where V is a set of nodes (|V| = n) and E is a set of edges. Nodesrepresent processes, and edges represent bidirectional communication links. We denote a variable V (a constant C) of a process p by Vp (resp. Cp). A communication link (p, q) exists iff p and q are neighbors. We consider asynchronous tree networks, where processes communicate with their neighbors by message passing. The message delivery time is arbitrary (i.e., finite but unbounded). We assume FIFO communication links. To simplify the presentation, we refer to a link (p, q) of process p simply with the label q. Each process p maintains its set of link labels in the constant set Np. The state of a process is defined by the values of its variables. The state of a system is a vector of n + 1 components, where the first n components represent the state of n processes and the last component refers to the set of messages (denoted by a multi-set M) in transit in the links. In the following, we refer to the state of a process and system as a (local)) state and configuration, respectively. Let a distributed protocol P be a collection of binary transition relations denoted by ↦, on C, the set of all possible configurations of the system. An execution of a protocol P is a sequence of configurations e = γ0, γ1, …, γt, γt+1, …, such that for t ≥ 0, γt ↦ γt+1 (a single execution step), if γt+1 exists, or γt is a terminal configuration. During an execution step, one of the following actions (local steps) occurs in at least one process p: (1) p receives a message; (2) p executes some internal actions; (3) p sends at least one message. GME Problem. We assume that processes cycle through a non-critical section, entry section, critical section, and exit section. A process can access a “session” only within a critical section. Processes execute their critical section for a finite but unknown amount of time. Every time a process p moves from its non-critical section to the entry section, p non-deterministically chooses a session sp from {1, …, m}. The GME problem is to de-
418
JOFFROY BEAUQUIER, SÉBASTIEN CANTARELL, AJOY K. DATTA AND FRANCK PETIT
sign a protocol (for the entry and exit sections) so that the following properties are true in every execution: Mutual Exclusion (Safety): If two neighboring processes p and q are executing their critical sections simultaneously, then sp = sq. No Lockout (Liveness): If a process p requests access to a session, then p eventually enters its critical section. Concurrent Entering (Concurrency): If some processes are requesting access to a particular session and no process is requesting a different session, then all the requesting processes can enter their critical sections concurrently. Note that the “no lockout” property defined above is similar to the “bounded delay” property as defined in [1, 10, 11]. Neither definition imposes any bound on the delay to access the session−the only requirement is that the delay be finite. Therefore, to maintain the “eventuality” property, we chose to refer to the property as “no lockout” (also called “no starvation”) in the above specification. Complexity Metrics. In order to evaluate the performance of our algorithms, we will measure the message complexity, the context-switch complexity, and the degree of concurrency. The message complexity is a measure of the number of messages generated per entry to the critical section. We use the term “passage” from [11] to define both the context-switch complexity and the degree of concurrency. A passage by a process p through a Session X (denoted by 〈p, X〉) is an interval [t1, t2] (of time) during which Process p executes its critical section. A passage is initiated at t1 and completed at t2. Let q be a process requesting access to Session X. Let T be the set of passages initiated by some processes (≠ q) after q has made its request (for Session X), and completed before q executes the corresponding passage 〈q, X〉. A round (of passages) RY of T is a maximal set of consecutive passages of T which are passages through Session Y. The context-switch complexity is the number of rounds of T such that for each round RY, either X ≠ Y or X = Y, but 〈q, X〉 ∉ RY. The degree of concurrency is defined as the maximum number of passages that can be initiated during a round RY of T.
3. GME ALGORITHMS In this section, we first present the overall system architecture and data structures used in the solutions to the GME problem. Three solutions will be presented thereafter. A priority-based GME algorithm will also be proposed. 3.1 Overall System Architecture and Data Structures Layer Architecture. We assume that there exist two layers in the system: the application layer (the higher layer) and the GME layer (the lower). The interface between the two layers is implemented by using three types of messages: Request_Session(X), Grant_Session, and Exit_Session. When a process p running in the application layer needs to access a session, say Session X, p sends a Request_Session(X) message to the
GROUP MUTUAL EXCLUSION IN TREE NETWORKS
419
GME layer. Eventually, the GME layer grants p access to Session X by sending a Grant_Session message. On completion of its work using Session X, p sends a message Exit_Session to the GME layer. Tree Maintenance. All processors maintain two variables. A pointer variable, Parp contains a value in Np ∪ {nil}. If Parp ∈ Np, then Parp contains the link label of a particular neighbor of p, called the parent of p. Another variable, Dp, is a subset of Np \ {Parp}, called the set of descendants of p. That is, Dp contains the neighbors of p such that p is their parent. The parent pointer of processes form an oriented spanning tree rooted in a particular process, referred to as the root of the spanning tree, such that Parroot = nil. Each process p such that Dp = ∅ is called a leaf process. We denote the set of processes in the tree rooted at process p by τp (called the tree τp). h denotes the height of the tree τroot. Data Structure of Processes. The algorithms presented in this paper use indexed FIFO queues. The items in the indexed queues are two-tuples 〈ind, obj〉, where obj is an object indexed by ind. The index (object) of an item I is denoted by I.ind (resp. I.obj). Fig. 1 shows the primitives used in this paper. An empty queue is denoted by ⊙. All the items of an indexed queue Q can be accessed in loops of the form: for all I ∈ Q do … done. We assume that the items are accessed in FIFO order. Type: Procedure Function Procedure Function Function
Item: 〈ind, obj〉; Enqueue (Q, I) Head (Q): Item Remove (Q, I) Get (Q, ind): Item Exists (Q, ind): Boolean
/* Add item I to Queue Q */ /* Return (but do not remove) the oldest item of Q */ /* Remove the item I from Q */ /* Return the item indexed by ind */ /* Return true if Q contains an item indexed by ind, false otherwise */
Fig. 1. Queue primitives.
3.2 The First Algorithm − Algorithm GMεα We will present this first solution to the GME problem to help the reader understand the idea behind the next two solutions. The solution works in phases. Phases are initiated by root in response to requests made by processes to use sessions. We use the well-known Propagation of Information (PI) and Propagation of Information with Feedback (PIF) schemes to implement the phases [14, 15]. Assume that process p makes the first request for a session X. p records Session X as the requested session and sends a request message for X to its parent. The message is then forwarded to root following the processes in the path from p to root. Upon receipt of the request message, root initiates the first phase, called the opening phase (also referred to as the opening wave) of Session X. root initiates a PI wave in the tree. Each process receiving this wave records the information that Session X is currently open. As long as no process wants to access a session different from X, all the processes wanting to use Session X can concurrently enter and exit the critical section of Session X any number of times. Moreover, these additional
420
JOFFROY BEAUQUIER, SÉBASTIEN CANTARELL, AJOY K. DATTA AND FRANCK PETIT
critical section accesses do not cause any additional message exchanges because every process knows that Session X is the currently open session. Now, assume that a process q wants to access a session Y ≠ X. q sends a request message for Y towards root. Upon receipt of this message, root starts the second phase, called the closing phase (or closing wave) of Session X. root initiates a PIF wave. Each process reached by the propagation of the closing phase records that the closing phase is initiated. Eventually, this wave reaches the leaves of the tree. The leaves then initiate the feedback message. Upon receipt of the feedback wave, processes not using any session can immediately forward the feedback message to their parent. However, if a process p is inside a critical section when p receives the feedback message, it will defer relaying the feedback message to its parent. As per the specification of the GME problem (see section 2), processes spend a finite amount of time inside their critical sections. Therefore, eventually, p exits the critical section and forwards the feedback message to its parent. When a process forwards the feedback message to its parent, it records the information that access to any critical section is now forbidden. Therefore, when root receives the feedback phase, root knows that no process is (or can enter) the critical section (of Session X). Next, root initiates the opening of Session Y. The above process is repeated to open and close sessions. Complexity Analysis. We will first compute the degree of concurrency. Assume that a session X is open, that a process is executing its critical section (accessing X), and that another process is requesting access to a different session Y (Y ≠ X). Since root initiates the opening of a session X by initiating a PI, all the processes in the network are aware of the fact that Session X is the current open session. Therefore, every process wanting access to Session X can locally decide if it can execute its critical section. During the closing phase of X (which eventually is initiated because at least one process requests Y), (as discussed above) processes can delay sending the feedback message to their parents if necessary (i.e., if they are currently in their critical sections). Therefore, a process can execute its critical section an unlimited number of times until it receives the feedback messages from all its descendants. This implies that the degree of concurrency is unbounded. Therefore, our algorithm provides the best possible resource utilization. However, it is easy to observe that processes nearer root can take better advantage of the open session than can processes further away from root because they receive the feedback message of the closing wave later than the others do. We will consider two cases to compute the message complexity. Assume that a process p requests access to the current open session X. In this case, p can access Session X without sending any message. Now, assume that p requests access to a session X which is not the current open session (say Session Y). In this case, p will be able to access Session X only after the following events occur: p’s request reaches root (generating d messages, where d is the distance from p to root), Session Y is closed (causing 2 × (n − 1) messages), and finally, p receives the opening message for Session X (another n − 1 messages). Therefore, at most 3 × (n − 1) + d messages may be necessary for a process p to access its critical section. In the worst case, p is a leaf and d = h, the height of the tree. Hence, an entry in the critical section may require between 0 and 3 × (n − 1) + h messages.
GROUP MUTUAL EXCLUSION IN TREE NETWORKS
421
3.3 Second Algorithm − Fewer Messages In this section, we will present an algorithm which reduces the number of messages from 3 × (n − 1) + h (of Algorithm GMεα) to 4h. The main idea used to reduce the number of message is as follows: Instead of sending both the opening and closing waves in the whole network (which is done in Algorithm GMεα), both waves are sent only towards the processes which made a request. We will now explain in detail Algorithm GMεβ (Algorithm 3.1).
Assume that the first request for a session (X) is made by a process p. The request is received from the application layer of p (see Lines 1.01-1.04). Then, p calls Procedure ReqFlow. As this is the first time p has received this request, it adds the tuple 〈X, ∅〉 to its request queue called ReqQ (which is an indexed queue). Later, if p receives another request for the same session X (Message ASK(X)) from one of its descendants (say q), p
422
JOFFROY BEAUQUIER, SÉBASTIEN CANTARELL, AJOY K. DATTA AND FRANCK PETIT
will add q to the tuple 〈X, ∅〉, changing the tuple to 〈X, q〉 (see Line 11.08). The indexed queue ReqQ is a very important data structure in our algorithm. It is used to reduce the number of requests to open and close sessions. Every process p saves the link labels of the descendants which have made requests for a session (say X) in a set (called NS); this set, NS, is then indexed by the corresponding session number X. We will see later that the opening and closing messages are sent “only” to the members of NS, rather than broadcast to the whole tree. We will now describe the process of opening Session X requested by Process p (as mentioned in the preceding paragraph). p sends Message ASK(X) to its parent (see Line 11.07). The message is then forwarded to root via other processes on the path from p to root. (Note that every process on the path updates ReqQ (as described in the preceding paragraph) as it receives an ASK(X) message.) When root receives ASK(X), it initiates a PI wave to open Session X (Lines 11.06, Procedures NewSession and OpenSession): root sets its Open variable to true (to record that a session is open) and then sends the “open session” OS(X) message to the descendants stored in X.NS. Other processes receiving OS(X) messages from their parents (Parp) take steps similar to those taken by root and forward the messages to their particular sets of descendants (stored in X.NS). Note that root and other processes do not broadcast OS(X) to the whole tree (or subtree), but send it only to the requesting descendants, thus significantly reducing the number of messages. Our next topic is the closing of sessions. Before we describe this in detail, we need to explain another important data structure, OpenSet. (Refer to Procedure OpenSession for this discussion.) Recall that for a process p, X.NS contains the link labels of the descendants which have requested Session X. NS is part of the request queue ReqQ. Therefore, after sending the OS(X) messages and Grant_Session message to the application layer, p removes X.NS (see Lines 7.04-7.09). However, this causes another problem. Later, when p needs to close Session X, p does not know to which descendants it needs to send the close session message, CS (for Session X). As discussed above, p could broadcast the message to everybody, but we want to avoid that to reduce the number of messages. To solve this problem, every process maintains another set, OpenSet. Before p removes NS from ReqQ, it copies it into OpenSet, which is later used to send the CS messages (see Lines 7.04, 10.02, and 10.03). We will now explain the session closing process. Assume that a set of processes (denoted by P) in the tree is currently using Session X. We call those processes in P that have no descendants using X “leaves.” The tree covering these leaves and rooted at root is denoted by TX. Note that not every process in TX may be using X. However, the leaves of TX are using X. Assume that a process p wants to access a session Y ≠ X. p sends ASK(Y) towards root. Upon receipt of Message ASK(Y), root initiates a PIF wave to close Session X (Line 11.05 and Procedure CloseSession). root sends the “close session” message CS only to the members of OpenSet (similar to the case of sending OS messages). Other processes forward Message CS to their subset of descendants (stored in OpenSet). Effectively, CS message is propagated in the tree TX. Eventually, the CS message reaches the leaves. Let us consider one of the leaf processes, q. Upon receipt of CS, q knows that it has to close Session X. However, it must wait for the application layer to exit the critical section (of Session X) (Lines 2.01 and 2.02). Then, q closes the current session by changing CurrentSq to ⊥, indicating that access to the critical section is forbidden. Next, q informs its parent of the closing of Session X by sending it Message DONE (see Pro-
GROUP MUTUAL EXCLUSION IN TREE NETWORKS
423
cedure TestDone). When an internal process q′ in the tree TX receives DONE from one of its descendants in OpenSet, q′ updates its OpenSet. If OpenSet is an empty set, i.e., all the descendants which were using X have closed Session X, and if q′ has finished its critical section, then q′ forwards the DONE message towards root via its parent process. Finally, root receives all the expected DONE messages and initiates the opening of a new session, Y (see Line 8.04). Thus, the closing process is like a partial PIF because only a portion of a tree (or subtree) is involved. Correctness Proof. We define the system as quiescent in a configuration γ when all the following conditions are true in γ: 1. No process is executing a critical section. 2. For every process p, CurrentSp = ⊥. 3. If some messages are in transit, then they are ASK() messages (moving towards the root). The opening and closing waves are only initiated by the root. Since the links are FIFO, the following result is obvious: Observation 3.1 If process p ≠ root receives a message CS from Parp, then CurrentSp = X, X ≠ ⊥, and the preceding message received by p from Parp was OS(X). Lemma 3.2 If root initiates the closing wave of a session, then the closing wave eventually terminates, and the system becomes quiescent. Proof: Based on the properties of the well-known PIF scheme [15], a closing wave terminates. The termination of the wave guarantees that neither CS nor DONE messages exist in the system. Moreover, by Observation 3.1, the opening wave for Session X also terminates when the closing wave of X terminates. Therefore, no OS(X) messages exist in the system, either. From Algorithm GMεβ, when the closing wave terminates, ∀p, Cur rentSp = ⊥ (see Line 8.02 of Algorithm 3.1). Lemma 3.3 If root initiates the opening wave of a session X in γ, then the system is quiescent in γ. Proof: Consider the following two cases: 1. Session X is the first open session. Then, root initiates the opening of X upon receipt of a message ASK(X); hence, the system is trivially quiescent. 2. Session X is not the first open session. Let Y be the session opened just before X. Since X is not the first open session, root initiates the opening wave of X by executing Line 8.04. This implies that the predicate FREE is true at root (Line 8.01). Therefore, root must have received DONE message from all its neighbors (Line 6.01). A process p sends a DONE message only after p and all its neighbors close their current session (see Procedures CloseSession and TestDone). Therefore, FREE being true at root implies that all processes in τY have closed Session Y. This is possible only if root initiated the closing wave of Session Y. Then, by Lemma 3.2, the system eventually becomes quiescent.
424
JOFFROY BEAUQUIER, SÉBASTIEN CANTARELL, AJOY K. DATTA AND FRANCK PETIT
The following result follows directly from Lemmas 3.2 and 3.3: Theorem 3.4 (Mutual Exclusion) Algorithm GMεβ satisfies the mutual exclusion property. We will now prove that Algorithm GMεβ satisfies the no lockout property. Lemma 3.5 If a session X never belongs to ReqQroot, then X is never requested by any process. Proof: Assume by contradiction that there exists a session X which never belongs to ReqQroot, and that X is requested by some processes. Note that if a process p executes Procedure ReqFlow(X), then X belongs to ReqQp when Procedure ReqFlow(X) is done. Since Session X never belongs to ReqQroot, root never executes ReqFlow(X) and, hence, never receives a message ASK(X). Moreover, every time root executes Procedure NewSession, so Session X cannot be at the head of ReqQroot. Therefore, Session X is never opened, and for every process p, CurrentSp will never be equal to X. Let p be the first process to make the first request for Session X. Upon receipt of Request_Session(X) from the application layer, p executes ReqFlow. Therefore, p ≠ root. Since p made the first request, X cannot be in ReqQp. Therefore, after receiving Request_Session(X) from the application layer, p adds X to its queue (Line 11.02) and sends ASK(X) to its parent (Line 11.07) (say q). Upon receipt of ASK(X), q executes Procedure ReqFlow. Therefore, q ≠ root. If X ∈ ReqQq when q executes ReqFlow, then q executed ReqFlow before receiving the message ASK(X) from p and sent ASK(X) to its own parent. Otherwise, (X ∉ ReqQq), and like p, q adds X to its queue and sends ASK(X) to its parent. Therefore, in both cases, the parent of q receives a message ASK(X). Applying the same reasoning for every process between p and root, root eventually receives a message ASK(X), which contradicts the assumption that root never receives a message ASK(X). Lemma 3.6 For every X ∈ ReqQroot, root can initiate opening of at most min(n, m) − 1 sessions before opening Session X. Poof: Assume that X is the last entry (session) in the queue ReqQroot. There are two cases: 1. m ≤ n. Then, there are at most m items in ReqQroot (X is the m-th item) because root enqueues an item I if I ∉ ReqQroot (see Lines 11.01 and 11.02). Therefore, root can initiate at most (m − 1) sessions before it opens Session X. 2. m > n. Then, in the worst case, ReqQroot contains an item for each process of the network. ReqQroot then contains at most n items. Therefore, root can initiate at most (n − 1) openings before it opens Session X. We get the following result from Lemmas 3.5 and 3.6: Theorem 3.7 (No Lockout) If a process p requests access to a session, then p eventually executes its critical section. Theorem 3.8 (Concurrent Entering) If some processes are requesting access to a par-
GROUP MUTUAL EXCLUSION IN TREE NETWORKS
425
ticular session while no other process is requesting access to a different session, then all the requesting processes can execute their critical sections concurrently. Proof: Assume that a process p requests entry to Session X while no process is requesting entry a session Y ≠ X. We need to consider two cases: 1. Session X is not open when p requests for Session X. From Theorem 3.7, root eventually initiates the opening of Session X. Then, all the processes (including p) requesting Session X can enter the critical section concurrently as soon as they receive a message OS(X) (see Line 7.08). 2. Session X is open when p requests for Session X. Then, either CurrentS = ⊥ (p has not yet received a message OS(X)) or CurrentS = X. In the first case, p can enter the critical section (concurrently with other processes) as soon as p receives message OS(X). In the second case, p can enter X immediately (see Line 1.03). We can claim the final result from Theorems 3.4, 3.7, and 3.8: Theorem 3.9 Algorithm GMεβ satisfies the GME specification (as specified in section 2). Complexity Analysis. From Lemma 3.6, at most min(n, m) − 1 sessions can be opened while a process p is waiting to access a session. Therefore, the context-switch complexity is O(min(n, m)) rounds (of passages). Note that this result is the same as that of Algorithm GMεα. A request for a session generates between 0 and 4 × h messages because the opening and closing messages are sent only in the subtrees where processes were requested for the session. As in Algorithm GMεα, the degree of concurrency of Algorithm GMεβ cannot be bounded. All processes which did not receive DONE messages from all their descendants can execute their critical sections any number of times. However, there are some situations where Algorithm GMεα and GMεβ may behave differently. Consider the following example: root has several descendants, one of which is p. All processes in the network except p request use of Session X. p requests access to Session Y (Y ≠ X). Obviously, this configuration satisfies the assumption made in order to compute the degree of concurrency. Based on the requests received from its descendants, root initiates the opening of Session X. Now, we will consider Algorithm GMεα and GMεβ separately. Consider the above scenario in Algorithm GMεα. The opening message from root is broadcast to the whole network. Therefore, all processes including p and the processes in its subtree τp receive the OS(X) message. Moreover, since the communication links are FIFO, every process receives the opening message for Session X first, then the closing message for Session X, and finally the opening message for Session Y. Therefore, every process requesting Session X, including the processes in τp, will access (at least once) Session X before p eventually accesses Session Y. Thus, all the processes requesting X will be able to use X concurrently without waiting for p to use Y and then close Y. We will consider Algorithm GMεβ next. Since the communication links are FIFO, the ASK(Y) message sent by p precedes the ASK(X) message from p. (p sends ASK(X) after receiving the first ASK(X) message from one of its descendants.) The opening mes-
426
JOFFROY BEAUQUIER, SÉBASTIEN CANTARELL, AJOY K. DATTA AND FRANCK PETIT
sages for Session X (initiated by root) are sent to all processes except those in the subtree τp because p requested another session Y. Therefore, the processes in τp will be able access Session X only after Session X is closed and Session Y is opened and then closed. In other words, access to Session X by the processes in τp is delayed by the opening of Y. Thus, in Algorithm GMεβ, concurrency may be limited in some parts of the network, e.g., in τp in our example. The processes near root may have their request granted earlier than other processes. Based on the above discussion, we can see that there is a trade-off between the two solutions in terms of message complexity and degree of concurrency. Algorithm GMεα optimizes the degree of concurrency, but has a high message cost. Algorithm GMεβ optimizes the message cost, but may limit the degree of concurrency in some situations. 3.4 Final Algorithm − No Fixed Root In this section, we will present our final and the best solution (Algorithm GMεγ shown as Algorithm 3.2), which does not use any fixed root. Any process can now become the root. We start with Algorithm 3.1 and then add/replace some variables, predicates, procedures, and code in the message section. Therefore, the parts of Algorithm 3.1 which are not replaced by new versions in Algorithm 3.2 are considered for inclusion in Algorithm GMεγ. The main difference between Algorithms GMεβ and GMεγ is as follows: In Algorithm GMεβ, root has to manage the opening and closing of a session even when it does not request the session. In Algorithm GMεγ, a process p can be root only if p is one of the processes requesting the current session. Assume that Process r1 is the current root. Now, a new session X needs to be opened. r1 executes Procedure ReqFlow. r1 closes the current session (with a partial PIF as in Algorithm GMεβ) (see Line 11.05). Then, r1 executes Procedure NewSession (Line 11.06 and Lines 9.01 to 9.18 in Algorithm 3.2). If r1 is also a process requesting Session X (see Line 9.01), then r1 initiates the opening wave for Session X and remains root (see Lines 9.02 to 9.05). Otherwise, r1 chooses one of its requesting descendants to be its future parent and also the next root (see Line 9.06). Let us call this process r2. Next, r1 removes r2 from its set of descendants, Dr1 (Line 9.07) and executes Lines 9.08 to 9.17 to pass the root privilege to r2. We will discuss these steps in detail below: 1. r1 removes all link references to r2 in the set NS in ReqQ r1 because no process can receive a request from its parent (Line 9.10). 2. r1 builds a new queue, InvQ, which it sends to r2. InvQ is similar to ReqQ r1 except that the object associated with each session is a boolean variable, Involved, instead of a set of descendants. This variable is used by r2 later to rebuild the queue of requested sessions. For each session S in the queue, Involved is true if some requests were made in τ r1; Involved is false otherwise (see Line 9.12). 3. r1 opens Session X in its subtree τ r1 before sending the MVROOT() message to r2 (Lines 9.16 and 9.17). r1 uses Message MVROOT() to send InvQ to the new root r2.
GROUP MUTUAL EXCLUSION IN TREE NETWORKS
427
Upon receipt of MVROOT(RootQ) from r1 (parent), r2 sets Parr2 to nil, saves the link label leading to its parent in OldP (initially empty), and updates its set of descendants Dr2 (Lines 12.01 to 12.03). Process r2 is now the new root of the network. r2 now needs to deal with the following problems: 1. The current queue of r2 does not contain the requests made by processes which were not in τr2 before r2 received the MVROOT message. 2. The order of the sessions in ReqQr2 may not be the same as it was in ReqQr1. This may cause starvation − some sessions requested in τr1 may never be opened. 3. Some requests may have been received by r2 and enqueued in Reqr2 while the MVROOT message was in transit from r1 to r2. Theses requests have been ignored by r1 since when it received them, r1 was no longer root (see Line 3.01). However, these requests must not be lost. The code in Lines 12.04 to 12.14 solves the above three problems. r2 builds a new queue, NewQ, in the following manner: 1. r2 initializes NewQr2 as an empty queue (Line 12.04). 2. For each item I = 〈S, Involved〉 in RootQ, r2 enqueues 〈S, r1〉 to NewQ if some processes of τr1 requested Session X (i.e., I.Involved is true). Otherwise, r2 enqueues 〈S, ∅〉
428
JOFFROY BEAUQUIER, SÉBASTIEN CANTARELL, AJOY K. DATTA AND FRANCK PETIT
NewQ (Lines 12.05 to 12.08). This ensures that NewQ contains all the sessions of ReqQr1. Moreover, this also preserves the ordering of ReqQr1 in NewQ. 3. For each item I in its own request queue ReqQr2, r2 adds the link labels to the corresponding set NS in NewQ if an item indexed by I.S already exists in NewQ. Otherwise, r2 enqueues I in NewQ (Lines 12.09 to 12.13). Now, NewQ includes the copy of ReqQr1 before r1 sent the MVROOT message. In addition, NewQ contains the sessions enqueued in ReqQr2 while the MVROOT message was in the link from r1 to r2. 4. r2 copies NewQ into ReqQ (Line 12.14). Finally, r2 executes NewSession (Line 12.15) to open the first session after it became root. Note that OldP is used only when a process is processing an MVROOT message.
The opening wave in Procedure OpenSession is initiated by r2 in Get(ReqQ,S).NS \ OldP (see Line 7.04). This avoids opening Session X again in τr1 (note that r1 had already opened Session X before sending the MVROOT message). Once the opening wave is initiated, the set of links where the Session X is opened (including OldP if it belongs to Get(ReqQ,S).NS) is stored in Openset (Line 7.06), and OldP is reset to a null pointer. Correctness Proof. We define a process p as a root if and only if Parp = nil or an MVROOT message is in the link from Parp to p. If there exists a process p such that Parp = nil, then root is said to be fixed; otherwise, root is said to be moving. Clearly, if root never moves, Algorithm GMεγ behaves exactly like Algorithm GMεβ. Therefore, our responsibility is only to show that even if root moves, Algorithm GMεγ satisfies the GME specification. Like the previous algorithms presented in this paper, Algorithm GMεγ also uses a (rooted) spanning tree. Therefore, by assumption, the first configuration contains only one (fixed) root. Since only a fixed root can initiate the moving of root (to execute Procedure NewSession, Predicate ROOT must be true), the system always contains exactly one process p such that p is either the fixed root or an MVROOT message is in the link from p to a process q ∈ Dp. This leads to the following lemma: Observation 3.10 In any configuration, there exists exactly one root in the tree. Lemma 3.11 If root initiates the opening wave of a session X in γ, then in γ, the system is either quiescent or for every process p, CurrentSp ≠ ⊥ ⇒ CurrentSp = X. Proof: The opening wave for a session X is initiated by root by executing Procedure NewSession. In Procedure NewSession, Procedure OpenSession() is always executed with Head(ReqS).S as a parameter (see Lines 9.02 and 9.16). Consider the following two cases: 1. Procedure NewSession is executed in γ for the first time (no session has been opened yet) or following the closing of a session. The proof follows from Lemma 3.3. 2. Procedure NewSession is executed in γ upon receipt of an MVROOT message. The queue ReqQroot is built from RootQ, which is the queue received in the MVROOT message from Process q (the previous root). RootQ is itself a copy of InvQq, which is built from the queue of ReqQq. Therefore, the head of ReqQroot is the same as that of
GROUP MUTUAL EXCLUSION IN TREE NETWORKS
429
ReqQq when q built InvQ. Hence, the opening waves initiated by the successive receipt of Message MVROOT were for the same session, Session X. From Observation 3.1, Lemma 3.11, and Theorem 3.4 we get the following result: Theorem 3.12 (Mutual Exclusion) Algorithm GMεγ satisfies the mutual exclusion property. Lemma 3.13 If a process p sends a root(InvQ) message, then an MV ROOT(InvQ) message is eventually received by a process q such that ReqSq = Head(InvQ).S. Proof: Assume that p sends the MV ROOT message to a neighbor q1. p chooses q1 in Head(ReqQp).NS, which is a subset of Dp from which p received a request for the session Head(ReqQp).S, say Session X. If q1 requested X (ReqSq1 = X), then q = q1 and the lemma is proven. If q1 did not request Session X, q1 sends a MV ROOT message to q2, and so on. Since ∀i ≥ 1, qi+1 is chosen from Head(ReqQp).NS, and the message MV ROOT is always sent towards a process which requested Session X. Since the network is finite, the MV ROOT message is eventually received by a process such that ReqS = X. From Algorithm 3.2 and the queue properties, the order of sessions in the queue of the root is preserved; i.e., no session can be opened before another which was previously in the queue of root. Therefore, we claim the following from Lemmas 3.5, 3.6, and 3.13: Theorem 3.14 (No Lockout) If a process p requests access to a session, then p eventually executes its critical section. Theorem 3.8 also holds trivially for Algorithm GMεγ. Therefore, the final result follows from Theorems 3.8, 3.12, and 3.14: Theorem 3.15 Algorithm GMεγ satisfies the GME specification (as described in section 2). Complexity Analysis. It is obvious that the complexity results of Algorithm GMεβ and Algorithm GMεγ are the same: The context-switch complexity is O(min(n, m)) rounds, the degree of concurrency remains unbounded, and each entry into a critical section requires between 0 to 4 × h messages. Recall that in Algorithm GMεβ, concurrency may be limited in some parts of the network and processes near the root have an edge over other processes. In Algorithm GMεγ, we improve the situation. Although concurrency can still be limited in some parts of the network, since any requesting process can be the root, the same set of processes will not be disadvantaged forever. 3.5 Priority-Based GME So far, now we have assumed that all sessions have equal priority. The sessions are opened in the order the requests are received at the root. Allowing sessions of different priorities is a natural and useful extension of the specification of the GME problem. In the CD jokebox example (discussed above), some CDs may have higher priority than others. Also, some particular services on the Internet may require higher priority. In the readers/writers problem [4], sometimes, writers are given higher priority than readers.
430
JOFFROY BEAUQUIER, SÉBASTIEN CANTARELL, AJOY K. DATTA AND FRANCK PETIT
How can we implement a priority-based GME solution? The first task is to preserve the liveness property. Consider the following example. Assume that a process requests a session X which has a priority of 1. The current open session is Y with priority 0. (A lower priority value is assumed to imply lower priority.) After receiving the request for X, root initiates the closing phase of Y. In the meantime, another process requests Session Z, which has a priority of 2, and this request reaches root before the end of the closing phase. Then root opens Session Z instead of Session X. Therefore, if the higher priority requests keep coming and reach root earlier than the lower priority requests, then the lower priority requests will be starved. Fortunately, our solutions can be easily adapted to solve the above starvation problem. All we need to do is modify the (indexed) waiting queue implementations. We will now assume that the operations Enqueue and Remove implement priority scheduling on the queues used in our solutions. Thus, we can solve a priority-based GME problem.
4. CONCLUSIONS We have presented three group mutual exclusion solutions for tree networks. All the proposed solutions have the desirable property of using messages of bounded size. They also achieve the best context-switch complexity of O(min(n, m)) rounds of passages. The first GME solution (GMεα) is very efficient in terms of concurrency since the degree of concurrency cannot be bounded. Each entry in the critical section generates between 0 and O(n) messages. The second solution (Algorithm GMεβ) achieves the average number of messages per entry in a critical section in O(log n). It also provides an unbounded degree of concurrency. Both GMεα and GMεβ use a fixed root. The third solution (Algorithm GMεγ) uses a mobile root − a process become root only if it is involved in the current opened session. We have also discussed how our solutions can be used to incorporate priority into the session scheduling scheme. There seems to be a trade-off between message complexity and the degree of concurrency. To achieve a “true” unbounded degree of concurrency, each session must be opened to the whole network. However, this would require broadcasting the messages in the whole network. This implies that Ω(n) messages are necessary to achieve an unbounded degree of concurrency, which will be not be limited anywhere in the network.
REFERENCES 1. Y.-J. Joung, “Asynchronous group mutual exclusion,” Distributed Computing, Vol. 13, 2000, pp. 189-206. 2. E. W. Dijkstra, “Solution to a problem in concurrent programming control,” Communications of the Association of the Computing Machinery, Vol. 89, 1965, pp. 569. 3. L. Lamport, “The mutual exclusion problem: Part ii − statement and solutions,” Journal of the Association of the Computing Machinery, Vol. 33, 1986, pp. 327-348. 4. P. J. Courtois, F. Heymans, and D. L. Parnas, “Concurrent control with readers and writers,” Communications of the Association of the Computing Machinery, Vol. 14, 1971, pp. 667-668.
GROUP MUTUAL EXCLUSION IN TREE NETWORKS
431
5. E. W. Dijkstra, “Hierarchical ordering of sequential processes,” Acta Informatica, Vol. 1, 1971, pp. 115-138. 6. K. M. Chandy and J. Misra, “The drinking philosophers problem,” ACM Transactions on Programming Languages and Systems, Vol. 6, 1984, pp. 632-646. 7. M. Fisher, N. Lynch, J. Burns, and A. Borodin, “Distributed FIFO allocation of identical resourses using small shared space,” ACM Transactions on Programming Languages and Systems, Vol. 11, 1989, pp. 91-104. 8. P. Keane and M. Moir, “A simple local-spin group mutual exclusion algorithm,” in Proceedings of the 18th Annual ACM Symposium on Principles of Distributed Computing (PODC ’99), 1999, pp. 23-32. 9. V. Hadzilacos, “A note on group mutual exclusion,” in PODC01 Proceedings of the Twentieth Annual ACM Symposium on Principles of Distributed Computing, 2001, pp. 100-106. 10. Y.-J. Joung, “The congenial talking philosophers problem in computer networks,” Distributed Computing, Vol. 15, 2002, pp. 155-175. 11. K.-P. Wu and Y.-J. Joung, “Asynchronous group mutual exclusion in ring networks,” in Proceedings of the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing (IPPS/SPDP ’99), 1999, pp. 539-543. 12. S. Cantarell, A. K. Datta, F. Petit, and V. Villain, “Group mutual exclusion in token rings,” in the 8th International Colloquium on Structural Information and Communication Complexity Proceedings, 2001, pp. 61-76. 13. K. Raymond, “A tree-base algorithm for distributed mutual exclusion,” ACM Transaction on Computer Systems, Vol. 7, 1989, pp. 61-77. 14. E J H Chang, “Echo algorithms: depth parallel operations on general graphs,” IEEE Transactions on Software Engineering, Vol. SE-8, 1982, pp. 391-401. 15. A Segall, “Distributed network protocols,” IEEE Transactions on Information Theory, Vol. IT-29, 1983, pp. 23-35.
Joffroy Beauquier is a former PhD student of Maurice Nivat and is presently full professor at “Université Paris-Sud”, located at Orsay, very near from Paris. After having worked in formal language theory, he has been interested for several years in distributed computing, and specially fault-tolerance and stabilization. He studied self-stabilizing distributed algorithms using a bounded amount of memory and also time adaptivity, that is the property to recover after transient failures in a time proportional to the exact number of failures.
432
JOFFROY BEAUQUIER, SÉBASTIEN CANTARELL, AJOY K. DATTA AND FRANCK PETIT
Sébastien Cantarell is a PhD student at the University of Paris XI, France under the co-direction of Joffroy Beauquier and Franck Petit. His research interests include resource allocation problems and self-stabilization.
Ajoy K. Datta is a professor of computer science at the University of Nevada Las Vegas. His primary area of research interest is distributed computing. He works on the fault-tolerance and self-stabilization properties of distributed systems.
Franck Petit is an associate professor of computer science at the University of Picardie Jules Verne, Amiens, France. His primary research areas are distributed computing and self-stabilization.