Heuristic Di Acquiring in Lazy Release Consistency Model Zhiyi Huang, Wan-Ju Lei, Chengzheng Sun, and Abdul Sattar Knowledge Representation and Reasoning Unit School of Computing & Information Technology Grith University, Nathan, Qld 4111, Australia Email:fhzy,wlei,scz,
[email protected]
Abstract. This paper presents a Heuristic Di Acquiring (HDA) protocol in Lazy Release Consistency (LRC) based distributed shared memory (DSM) systems. Based on the run-time detection of associations between locks and data, the HDA can selectively piggy-back useful page dis in a lock grant message. By adopting the novel HDA protocol, an improved LRC model has been implemented, and the experimental results have been collected and analyzed. First, we introduce the Lazy Di Acquiring (LDA) and Eager Di Acquiring (EDA) protocols in the LRC based DSM systems. Second, we discuss the impact of LDA and EDA on the performance of the LRC-based DSM systems. Third, we propose the idea and implementation of the HDA protocol. Finally, we present and analyze the experimental results. From the experimental results, we conclude the HDA protocol can signi cantly improve the performance of LRC model. Key Words: Distributed Shared Memory, Lazy Release Consistency, Eager Release Consistency
1 Introduction Distributed Shared Memory (DSM) systems have been rapidly developed since last decade, with various kinds of memory consistency models proposed and implemented, such as Sequential Consistency (SC) [9, 10], Eager Release Consistency (ERC) [5], Entry Consistency (EC) [4], and Lazy Release Consistency (LRC) [7]. This sort of systems provides a virtual shared memory for the users on networks of workstations. So these systems can allow shared memory parallel programs to execute on networks of workstations. Through the improvement of protocols [2, 8], some optimized models, e.g., the LRC model, can support some parallel numerical computing applications eciently [7, 11]. Even though the Lazy Release Consistency (LRC) is currently the up-to-date consistency model, its performance is still not satisfactory for some Arti cial Intelligence applications [6]. The reason is that, comparing with an equivalent message passing program, the overhead to maintain the shared memory consistency of a shared memory parallel program is still very high because of the numerous messages. More improvements should and can be done on it in order
to further decrease the number of messages. For example, the Lazy Di Acquiring (LDA) in LRC can sometimes lead to more number of messages than the Eager Di Acquiring (EDA), as will be explained in Section 3. If we can accurately predict which updates will be de nitely acquired by some process and thus should be eagerly sent to that process, we can pack those updates in one message package, instead of several packages as in LDA. In this way, we can decrease the times of updates transfer. Based on this idea, a novel Heuristic Di Acquiring (HDA) protocol is proposed in the paper, which can decide which updates should be eagerly acquired and which should not. From the experimental results, the HDA can signi cantly decrease the number of messages and increase the performance of the LRC-based DSM systems. The organization of the rest of this paper is as follows. Section 2 gives an introduction to the LRC model, the Lazy Di Acquiring (LDA) protocol, and the Eager Di Acquiring (EDA) protocol. Section 3 discusses the impact of LDA and EDA on LRC model. Section 4 presents the heuristic protocol HDA. Section 5 presents and analyzes the experimental results. Finally, we conclude in Section 6.
2 Lazy Release Consistency model The LRC model is proposed in [7]. Based on the LRC model, a DSM system{ TreadMarks [1] has been implemented on standard Unix systems such as SunOS and Ultrix.
2.1 Lazy Release Consistency
LRC is an improvement of Eager Release Consistency (ERC) [5]. Both of them require mutual exclusion of accesses to a variable if there is at least one operation among the accesses [5]. They permit a process to delay making its updates to shared data visible to other processes until certain synchronization accesses occur. Therefore, the time for making the shared memory consistent is actually the same as the time of mutual exclusion. The following are the conditions of ERC model. De nition1. Conditions for Eager Release Consistency { Before an ordinary read or write access is allowed to perform with respect to any other process, all previous acquire accesses must be performed, and { before a release access is allowed to perform with respect to any other process, all previous ordinary read and write accesses must be performed, and { sync accesses are sequentially consistent with respect to one another. The dierence between ERC and LRC is that ERC requires ordinary read or write accesses to be performed globally at the next release of a lock, whereas LRC requires only that ordinary read or write accesses be performed with respect to write
some processes which acquire the released lock. The following are the conditions of LRC model.
De nition2. Conditions for Lazy Release Consistency { Before an ordinary read or write access is allowed to perform with respect to another process, all previous acquire accesses must be performed with respect to that other process, and { before a release access is allowed to perform with respect to any other process, all previous ordinary read and write accesses must be performed with respect to that other process, and { sync accesses are sequentially consistent with respect to one another. P1 [x,y]
Acq(L) w(x) Rel(L) inv(y) inv(x) Acq(L) w(y)
P2 [x,y]
Rel(L)
inv(y) Acq(L)
P3 [x,y]
Fig. 1. Eager Release Consistency
P1 [x,y]
Acq(L) w(x) Rel(L) inv(x) Acq(L) w(y) Rel(L)
P2 [x,y]
inv(x,y) Acq(L)
P3 [x,y]
Fig. 2. Lazy Release Consistency Fig. 1 and Fig. 2 are used to illustrate the dierence between ERC and LRC. In Fig. 1, when 1 calls Rel(L), it sends the invalidation of immediately to other processes, such as 2 and 3 . So does 2 when it calls Rel(L). In Fig. 2, when 1 calls Rel(L), there is no message passing from 1 to other processes. Only when 2 calls Acq(L) is the invalidation of sent from 1 to 2. So is the invalidation of and when 3 calls Acq(L). If a process does not acquire P
x
P
P
P
P
P
P
x
x
y
P
P
P
the lock , it will not get the invalidation of or . Therefore, the expense for passing the invalidation of and is saved. Compared with ERC, the delay of invalidation can greatly reduce the number of data transfer messages and better performance is achieved in LRC. L
x
x
y
y
2.2 Lazy Di Acquiring (LDA) protocol
The LRC in TreadMarks [1] adopts a LDA protocol to achieve page fault processing, in which page di, instead of the whole page, is used to renew a dirty copy. When a write access to a valid page is rst performed, a twin of the page is created and stored in system space. A comparison of the twin and a later version of the page is used to create a di, which is a run-length encoding of the dierences between the two versions. The di can then be used to update other processes' copies of the page. Example 1. Shared memory access pattern (1) P1 ... Acquire(1); write(x); write(y); ... Release(1); ...
P2 ... Acquire(1); read(x); read(y); ... Release(1); ...
Acq(1) w(x) w(y) Rel(1)
P1 [x,y]
inv(x) Acq(1) inv(y) r(x)
P2 [x,y]
diff(x)
diff(y)
r(y)
Rel(1)
Fig. 3. The message passing in Example 1 under LDA To understand how the LDA protocol passes messages, we consider the program in Example 1. Fig. 3 illustrates the message passing between two processes and . Suppose 1 and 2. In the example both 1 and 2 will access page 1 gets the lock 1 rst. When 1 releases lock 1, there is no message passing between the two processes. Then when 2 acquires the lock 1, the invalidation of and are sent from 1 to 2 . In LDA, only when 2 accesses or and page fault occurs, is the of or acquired from 1 by 2. Since the of a page is acquired only when the page is accessed, we call this protocol Lazy Di Acquiring (LDA). From above description, the LDA protocol is divided into four stages: P
P
P
P
P
x
y
P
P
x
y
P
dif f
P
x
P
y
P
x
P
y
dif f
1. lock releasing of 1, prepare for write notices (invalidation message). 2. lock acquiring of 2 , acquire write notices and invalidate the updated pages. 3. acquiring of 2, when an invalidated page is accessed for the rst time the is acquired. 4. performing of 2, the is received and performed on the page. Example 2. Shared memory access pattern (2) P
P
dif f
P
dif f
dif f
P
P1 ... Acquire(0); write(y); Release(0); ... Acquire(1); write(x); Release(1); ...
P1 [x,y]
dif f
P2 ... Acquire(1); read(x); Release(1); ...
Acq(0) w(y) Rel(0) Acq(1) w(x) Rel(1)
Acq(1)
P2 [x,y]
inv(x) inv(y)
diff(x) r(x)
Rel(1)
Fig. 4. The message passing in Example 2 under LDA Through the separation of page invalidation and page acquiring, the unnecessary transfer can be avoided in LRC. For example, in Example 2, even though 1 has updated pages and , and their invalidation message has been sent to 2 , 2 only accesses page and thus only the of is acquired by 2. As a result, only the of is transfered from 1 to 2, even though both and are invalid in 2. The message passing between 1 and 2 are illustrated in Fig. 4. dif f
dif f P
P
x
P
P
dif f
x
y
x
y
dif f
x
P
x
P
P
P
P
2.3 Eager Di Acquiring (EDA) protocol
In LRC, we can also adopt an EDA protocol. To understand how the EDA protocol passes messages, we again consider Example 1 in previous section. Fig. 5 illustrates the message passing between two processes. Suppose 1 gets the lock 1 rst. When 1 releases lock 1, there is no message passing between the two processes. Then when 2 acquires the lock 1, the invalidation of and is sent from 1 to 2. And at the same time, the s of and are appended to the invalidation message and sent eagerly from 1 to 2. Therefore, when 2 P
P
P
P
P
x
dif f
x
P
y
y
P
P
accesses or and page fault occurs, 2 need not acquire s again from 1. Since the is acquired eagerly when the invalidation of a page is sent, we call this protocol Eager Di Acquiring (EDA). x
y
P
dif f
P
dif f
P1 [x,y]
Acq(1) w(x) w(y) Rel(1) inv(x) inv(y) diff(x) Acq(1) diff(y)
P2 [x,y]
r(x)
r(y)
Rel(1)
Fig. 5. The message passing in Example 1 under EDA From above description, the EDA protocol is divided into three stages: 1. lock releasing of 1, prepare for write notices. 2. lock acquiring of 2, acquire write notices and s and invalidate the updated pages. 3. performing of 2, the s are performed on the invalidated pages. Through the combination of invalidation and messages, the number of acquiring can be decreased in EDA at some situations. For example, in Fig. 5, by integrating the s of and into the invalidation message, the number of messages between 1 and 2 is only 2, instead of 6 in Fig. 3. However, EDA may cause unnecessary transfer between processes at other situations. In Example 2, if we adopt EDA, the s of and will be sent eagerly from 1 to 2 , as illustrated in Fig. 6. But obviously the transfer of 's is not necessary since 2 will not access page . Therefore, even though the number of messages may be decreased, the unnecessary data transfer will be increased in EDA. P
P
dif f
dif f
P
dif f
dif f
dif f
x
P
dif f
y
P
dif f
dif f
P
y
dif f
P1 [x,y]
x
y
P
P
y
Acq(0) w(y) Rel(0) Acq(1) w(x) Rel(1)
Acq(1)
P2 [x,y]
inv(x) inv(y) diff(x) diff(y)
r(x) Rel(1)
Fig. 6. The message passing in Example 2 under EDA Then, which one is better, LDA or EDA? At what situations?
3 LDA vs. EDA We will analyze theoretically at what situations LDA (or EDA) is better than EDA (or LDA) in this section.
Let us consider two processes 1 and 2. Suppose 1 gets a lock rst, 2 waits and then gets the same lock after 1 releases it. According to LDA protocol, the behavior of the two processes is as below: 1. when 2 acquires the lock, it sends a message to 1 to ask for the lock as well as the write notices for the updated pages. 2. after 1 releases the lock and nds the request from 2, it sends the lock and write notices to 2. 3. when 2 receives the lock and the write notices, it invalidates its copies of the updated pages according to the write notices. 4. when 2 accesses an invalid page, it sends a message to ask for the of the page from 1. 5. when 1 receives the message, it produces the of the page and sends the back to 2. 6. when 2 receives the , it performs the on the page and continues execution. From above description, we should note that the invalid page copies in 2 are not made consistent with their up-to-date ones in 1 unless they are accessed by 2. Suppose, the number of invalid pages is N, the probability for 2 to access the invalid pages is , the average time for making of a page is md , the average time for performing on a page is pd , the message header has bits, the of a page has diff bits, the bandwidth of the network is , the system time to send a message is ss, and the system time for receiving a message is sr . To simplify the formula, let d = md + pd P
P
P
P
P
P
P
P
P
P
P
P
dif f
P
P
dif f
dif f
P
P
dif f
dif f
P
P
P
P
p
dif f
dif f
dif f
T
T
H
M
B
T
T
T
T
T
s = Tss + Tsr
T
Then the time for making the memory consistent between the two processes is: = d +2 s +2 s +(2 +( ( diff +2 ))) Since the of a page is sent from 1 to 2 only if it is accessed by 2 , the transfer is reduced in LDA when the access probability is small. However, if the probability is very large, the additional work for acquiring a , e.g, s , d , , may overshadow the bene ts of LDA. Now let us consider the time cost for EDA. According to the EDA protocol, the behavior of the two processes is as below: 1. when 2 acquires the lock, it sends a message to 1 to ask for the lock as well as the write notices and s for the updated pages. 2. after 1 releases the lock and nds the request from 2, it produces the s of the updated pages, then sends the lock, write notices and s to 2. 3. when 2 receives the lock, the write notices and s, it performs the s on its invalid copies according to the write notices. lazy
T
p
N
T
p
N
T
T
dif f
H
P
p
N
M
P
p
p
T
=B
P
dif f
T
H
dif f
H
P
P
dif f
P
P
dif f
dif f
P
P
dif f
dif f
Therefore, when 2 acquires a lock from 1 it gets the invalidation as well as the s immediately from 1 . Fig. 5 illustrates the message passing in EDA. The time for making the memory consistent between the two processes in this eager protocol is: P
P
dif f
P
T
eager
= d + 2 s + ( + ( diff + )) N
T
T
H
N
M
H
=B
In the following we discuss which protocol is better in terms of the time when some conditions are applied. We have the following formula deduction: T
eager ? Tlazy
=
(1 ? ) d ? 2 s + ((1 ? ) diff ? 2 ) If is very small, there is not much dierence between the two protocols. So we mainly discuss the situation when is very large. If is very large, say, equals to 1, the above formula becomes: p
N
T
p
N
T
p
N
M
p
N
H =B
N
N
p
eager ? Tlazy
T
= ?2 ( s + N
T
H=B
) 0
?2 ) ) H =B
For a concrete computer network, s , d , , are constant, the variables which can in uence the inequality are and diff . If is small, the above inequality is easy to be satis ed, so the lazy one is better; otherwise, the eager one may be better. When is small, diff may play an important role in the inequality. If diff is large, the inequality is easy to be satis ed and the lazy one is better; otherwise, if diff is small, the eager one may be better. In a nutshell, if the access probability is large, the EDA is usually better; if is small and diff is also small, the EDA may still be the better one; otherwise, the LDA is better. In terms of these principles, the HDA protocol is proposed to improve the performance of LRC model. T
T
p
p
M
M
M
p
p
M
H
M
B
p
4 Heuristic Di Acquiring From above analysis we know that there are preferred situations for either LDA or EDA. The most important factor is the access probability. If we know this probability we can use either LDA or EDA to achieve the best performance. The idea of HDA in this paper is that we can get some hints about the probability from the user's programs. Normally a programmer uses lock to guard shared data and the same lock guards the same block of data. For example, a programmer uses lock i to guard the data in i . When process 1 and 2 access i , both of them will rst acquire the lock i and then access i. Therefore, we conclude that when two or more processes use the same lock, it gives us a hint that they will very likely access the pages previously guarded by the lock. The probability of accessing these pages is normally very high. According to above analysis, if we use EDA for the pages previously guarded by the lock, we can save more time than using LDA. In Example 1, the lock 1 is used to guard page and . So when a process acquires lock 1, it will very likely access both page and . In HDA, the s of and will be sent eagerly after lock 1 is acquired. The s of other pages will still be sent lazily as in LDA. L
page
P
page
L
x
P
page
y
x
y
dif f
x
y
dif f
P1 [x,y]
Acq(0) w(y) Rel(0) Acq(1) w(x) Rel(1) inv(x) inv(y) diff(x)
Acq(1)
P2 [x,y]
r(x) Rel(1)
Fig. 7. The message passing in Example 2 under HDA Based on above assumption, we proposed the HDA protocol. In HDA, we maintain a page identi er list for each lock. The pages in the list are previously guarded by the lock. Initially the list is empty. After a process acquires a lock, we begin to record the updated page identi ers in the list until the process releases the lock. That is, during the period between the acquiring and the release of the lock, if the process updates a page, the page's identi er is recorded into the page identi er list of the lock. When the lock is acquired by other processes, the s of the pages in the page identi er list of the lock are created and sent eagerly to the other processes. For example, in Example 1, 1 will record the identi ers of pages and into the page identi er list of lock 1 since it will update and between acquiring and release of lock 1. Then when 2 acquires lock 1, the s of and are sent from 1 to 2 eagerly. The message passing between them are the same as in Fig. 5. However, in Example 2, 1 will record the identi er of into the page identi er list of lock 1 and the identi er of into the page identi er list of lock 0. When 2 acquires lock 1, only the of is sent eagerly from 1 to 2 (the invalidation of is also sent to ensure the dif f
P
x
x
y
y
P
dif f
x
y
P
P
P
x
y
P
x
P
P
dif f
y
correctness of HDA). The message passing between 1 and 2 are illustrated in Fig. 7. From the gure we nd, the HDA decreases the number of messages comparing with the LDA, and it avoids useless transfer comparing with the EDA. From above examples, we can discover that HDA can take advantage of both LDA and EDA. In addition, the HDA requires neither programmer annotation nor compiler support. The hints are acquired automatically from the program by the system without much extra overhead (only keeping and recording a page identi er list for each lock). If the lock hints about the page accesses are accurate, HDA can greatly decrease the number of messages and avoid unnecessary transfer in LRC. From the experimental results, our assumption about lock hints is correct at most situations. P
P
dif f
dif f
5 Experimental results This section presents an evaluation of HDA and LDA protocols. The protocols are implemented in the TreadMarks distributed shared memory system [1]. The experimental environment consists of 8 SGI workstations running IRIX Release 5.3. They are connected by Ethernet. We used 4 applications in this experiment: TSP, QS, Water, BT, in which TSP, QS, Water are provided by TreadMarks research group. All the programs are written in C language and linked with TreadMarks library. TSP is the Traveling Salesman Problem, which nds the minimum cost path that starts at a designated city, passes through every other city exactly once, and returns to the original city. QS is a recursive sorting algorithm that operates by repeatedly partitioning an unsorted input list into a pair of unsorted sublists, such that all of the elements in one of the sublists are strictly greater than the elements of the other, and then recursively invoking itself on the two unsorted sublists. BT is an algorithm that creates a binary tree. Water is a molecular dynamics simulation. Each time-step, the intra- and inter-molecular forces incident on a molecule are computed. These applications are elaborately selected and representative in either numerical computing, e.g., Water, QS, or AI computing, e.g., TSP, BT. In the following table, the item Time is the total running time of an application program; the Total Data is the sum of total message data; the Di Data is the sum of total data; the Di Miss is the number of s acquired when a page fault occurs; the Mesgs is the number of messages; From the results in Table 1, we conclude that: { The HDA outperforms LDA for every applications. The maximum decrease of running time is 45.1% (TSP), and the minimum decrease is 3.3% (Water). { the number of messages in HDA has been signi cantly decreased compared with LDA. The maximum decrease is 47.5% (TSP), and the minimum decrease is 3.6% (Water). { the Di Miss in HDA has been signi cantly reduced compared with LDA. { the total data transferred in HDA is normally smaller than that in LDA because more messages in LDA cause more extra data transfer. dif f
dif f
{ there is no signi cant change of total
data between LDA and HDA. This suggests that the HDA does not send useless data eagerly. So HDA can accurately send the useful s by using the detected lock-data associations. { the performance of EDA is normally the worst except the TSP (the reason for this exception is, in TSP every process almost accesses every page of shared memory and therefore TSP is in favour of the eager protocol). The EDA normally causes a large amount of data transferred among processes, even though the number of messages and misses is reduced. dif f
dif f
dif f
dif f
application protocol Time Total Data Di Data Di Miss Mesgs (secs) (bytes) (bytes) LDA 15.86 1267683 448958 1029 2846 TSP EDA 6.39 1250486 456163 7 734 HDA 8.70 1292737 463896 384 1494 LDA 20.09 10153006 6100023 3152 10432 QS EDA 42.08 30329884 26549451 165 6707 HDA 13.36 9332272 5577086 1053 5936 LDA 82.92 39511375 8921228 27587 96979 BT EDA 83.34 44205921 13536620 451 43505 HDA 69.71 39148835 8761972 6681 53925 LDA 32.59 11717602 9980061 4508 24495 Water EDA 40.49 17806563 16013484 2402 20461 HDA 31.53 11879024 9980913 4033 23606 Table 1. Performance Statistics for applications
6 Conclusions This paper discussed the advantages and disadvantages of LDA and EDA protocols in LRC model. Based on the discussion, we proposed a HDA protocol which can take advantage of both LDA and EDA. The HDA is based on an assumption that when a process is going to access a lock it will access the pages previously guarded by the lock. So the associations between locks and data are built up at run-time, and the s of the pages previously guarded by the lock are acquired eagerly in HDA. In this way the number of messages is decreased signi cantly in HDA. From the experimental results, the HDA protocol can signi cantly improve the performance of the LRC model. In the future, some compile-time analysis will be adopted to further optimize the detection of associations between locks and data. Based on the HDA protocol, further research will be done on consistency models to facilitate the AI applications [6] as well as numerical applications in DSM. dif f
Acknowledgments The authors would like to thank members of Knowledge Representation and Reasoning Unit, especially Krishna Rao for their constructive suggestions to this work. We are grateful to Prof. Willy Zwaenepoel and TreadMarks research group for their valuable support. We also thank the anonymous referees for their comments. The research is supported by an ARC(Australian Research Council) large grant(A49601731), an ARC small grant and a NCGSS grant by Grith University.
References 1. C.Amza, et al: \TreadMarks: Shared memory computing on networks of workstations," IEEE Computer, 29(2):18-28, February 1996. 2. C. Amza, A.L. Cox, S. Dwarkadas, and W. Zwaenepoel: \Software DSM Protocols that Adapt between Single Writer and Multiple Writer," In Proc. of the Third High Performance Computer Architecture Conference, pp. 261-271, Feb. 1997. 3. J.K. Bennett, et al: \Munin: Distributed shared memory based on type-speci c memory coherence," In Proceedings of the Second ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, Pages 168-176, March 1990. 4. B.N. Bershad, et al: \The Midway Distributed Shared Memory System," Proc. IEEE COMPCON Conf., IEEE, pp528-537, 1993. 5. K. Gharachorloo, et al: \Memory consistency and event ordering in scalable shared memory multiprocessors," In Proc. of the 17th Annual International Symposium on Computer Architecture, pp15-26, May 1990. 6. Zhiyi Huang, Chengzheng Sun, Abdul Sattar, and Wanzu Lei: \Parallel Logic Programming on Distributed Shared Memory System," In Proc. of the IEEE International Conference on Intelligent Processing Systems, Oct., 1997. 7. P. Keleher: \Lazy Release Consistency for Distributed Shared Memory," Ph.D. Thesis, Rice Univ., 1995. 8. P. Keleher, A.L. Cox, S. Dwarkadas, and W. Zwaenepoel: \An Evaluation of Software-Based Release Consistent Protocols," Journal of Parallel and Distributed Computing, Special Issue on Distributed Shared Memory, Vol. 29, pp.126-141, Oct. 1995. 9. L. Lamport: \How to make a multiprocessor computer that correctly executes multiprocess programs," IEEE Transactions on Computers, 28(9):690-691, September 1979. 10. K.Li, P.Hudak: \Memory Coherence in Shared Virtual Memory Systems," ACM Trans. on Computer Systems, Vol. 7, pp321-359, Nov. 1989. 11. H. Lu, S. Dwarkadas, A.L. Cox, and W. Zwaenepoel: \Message Passing Versus Distributed Shared Memory on Networks of Workstations," In Proc. of Supercomputing '95, Dec. 1995.