Burkhardt III H. Frank S. and Rothnie J. The KSR1: Bridging the gap between ... K.Gharachorloo, D.Lenoski, J.Jaudon, P.Gibons, A.Gupta, and J.Hennessy.
Spinning-on-Coherency: A New VSM Optimisation for Write-invalidate? A.P. Nisbet and R.W. Ford Centre for Novel Computing, Dept. of Computer Science, University of Manchester, UK.
Abstract. This paper introduces spinning-on-coherency (SOC) a technique for virtual shared memory (VSM) which enables latency-hiding of remote reads and the removal of related synchronisation points. Coherence-bits are hardware-tags associated with addresses which record local access permissions (such as read, write, invalid). In SOC a user-thread spins on the particular coherence-bits associated with an address until the new data value is asynchronously propagated and the address becomes valid. Data-propagation occurs when another node issues an update after having written the new value. Performance improvements are demonstrated for two codes, representing the core communication found in Shallow (a well known numerical weather prediction benchmark), and CG (from the NAS Parallel Benchmarks). These are run on a 30 node prototype distributed memory architecture (EDS), with invalidation based sequentially consistent VSM. SOC is also applicable to other consistency models and directory schemes, whether in hardware or software and complements other VSM optimisations. Currently such optimisation is performed by the programmer, but there is much scope for automating this process within a compiler.
1 Introduction Virtual shared memory (VSM) systems provide a shared memory programming model on distributed memory architectures. This model is attractive because it is simple to program, thus speeding the implementation and porting of parallel programs, and enabling the parallelisation of complex adaptive programs which may be difficult to implement in message-passing. However, for problems with a static communication pattern, VSM can fair significantly worse than message-passing. One of the main reasons for this performance difference is that a program’s communication structure is explicitly coded in message-passing, whereas VSM implementations seldom provide facilities which allow this application-specific knowledge to be exploited. Scalable VSM systems use a directory structure (which typically records the location and the read/write access permissions of copies of data) and invalidation/update based protocols to maintain consistency. VSM implementations such as the KSR-1 [7] and DASH [11] use hardware-tags associated with a coherence unit to form a distributed directory structure, whereas software implementations normally maintain additional datastructures that contain the directory information. We refer to the hardware-tags used to store local access permissions as coherence-bits. ? This work was funded by the U.K. Meteorological Office and the ESPRIT SODA project.
In invalidation based sequential consistency an attempt to read a value which has been newly written by another node can only be race-condition free if a synchronisation point (such as a barrier) separates the write and the read. Performance is degraded by this course grain synchronisation overhead and by the delay incurred in fetching a copy of the value. We introduce spinning-on-coherency (SOC), an optimisation for remote reads in invalidation based VSM. In SOC a node wanting to read an invalid value spins on the associated coherence-bits until they become valid. The writing node initiates an update that asynchronously propagates the new data and sets the coherence-bits to valid. This fine-grain synchronisation on coherence-bits allows the removal of the coarse-grain synchronisation point. SOC is very efficient as any potential latency hiding can be fully utilised. Data can be sent as soon as it is written, the writing node does not need to stall and the receiving node does not have to to stall until it actually needs to read that data. In our implementation the nodes requiring updates are identified from the VSM directory structure. The main contributions of this paper are:– i) applying coherency-bits as user synchronisation variables, in conjunction with, ii) the removal of course grain synchronisation points, and, iii) the latency-hiding update-mechanism. SOC is applicable to any invalidation based consistency model and complements other VSM optimisations such as distributed invalidation [6, 10], migratory data optimisations [4, 21] and pre-fetching [23]. We show how, in particular, distributed invalidation and SOC combine to improve performance. In Section 2 we review work related to our research. Section 3 describes our implementation of SOC and the applications used in our experiments. Section 4 shows our experimental method and evaluates the performance improvements. Finally Section 5 gives our conclusions.
2 Related Work Traditional sequentially consistent VSM systems have either implemented invalidation or update based protocols. The invalidation based protocol invalidates any remote copies before writing a new value. This protocol therefore has no latency hiding facilities; a remote read will always miss on first access. The update based protocol always propagates the newly written value to all copies before continuing. We therefore have potential latency hiding but will incur excess network traffic when updates are not required. Both of these protocols can be overly restrictive. The invalidation protocol forces the writing node to stall until all outstanding copies have been invalidated. In this case weaker coherence schemes [14, 2, 9] and distributed invalidation methods [10, 6] have been proposed to allow the writing node to continue without stalling. The update protocol also forces the writing node to stall until all remote updates have completed. In both cases, race conditions can cause incorrect results and to avoid this, programs will already have some high level synchronisation scheme2 . Weaker coherence models force the user to adhere to their synchronisation scheme and utilise any potential latency by only ensuring updates have completed by, or take place at (depending on whether the model is eager or lazy), the appropriate synchronisation point. 2
Note, some asynchronous or chaotic algorithms do not synchronise.
Each protocol has its advantages and choosing the most efficient depends on the reference patterns of the application being run [15]. Some hybrid techniques have been proposed to dynamically switch between update and invalidation based protocols [15]. As in this work, [18] propose a facility to query the state of a coherence unit but they do not give its use. In another paper [16] they show how the directory can be used to decide where to send updates. Their example explains how for a static run time dependent communication structure the first pass can be used to set up the directory information which can be used to send updates on subsequent passes. This is equivalent to the inspector/executor work [19] used in message passing. [13, 5] remove the course grain barrier between the write and read by synchronising using message passing at the lowest level. [8] suggests a scheme similar to ours where they replace a course grain barrier with fine grain synchronisation at the word level. They suggest a software implementation where a single word is marked invalid by a special value e.g: NaN and a hardware version which needs an extra hardware bit per word. However they do not propose to use a VSM systems coherence bits and only use their method for a purely update based protocol as their invalidation protocol does not allow for an invalid state. Most work in this area has been based on simulation, however, [5] show that an application-specific protocol can give significant performance benefit on a real machine.
3 SOC implementation and applications 3.1 Target machine The target architecture for the experiments was a 30 node EDS prototype [20]. A supervisor level process known as the mapper provides a fixed-distributed scheme [12] for the management of VSM consistency on a 4KByte page basis through the manipulation of local access permissions in memory management unit page table entries (PTEs) 3 . In this scheme, write ownership may move but there is one fixed mapper which is responsible for dispatching ownership of particular virtually shared pages. The assignment of pages to mappers is based on the locus of creation of VSM. In our experiments, VSM is created so as to evenly distribute the management of virtual pages between nodes. 3.2 Spinning-on-Coherency Semantics: Figure 1 a) depicts the behaviour of a program under a write-invalidate protocol whilst 1 b) depicts a program having the same communication structure with SOC optimisation. In 1 a) a barrier ensures no race-conditions exist such that Node 2 reads the correct value written by Node 1. A read fault on Node 2 occurs after the barrier and Node 1 has its local access permissions reduced to read before dispatching a copy of the data-value to Node 2. The read fault is a synchronous operation, thus Nodes 1 and 2 have to synchronise twice:– firstly in a barrier to remove a race-condition and secondly in the VSM system to ensure sequential consistency. In 1 b) the barrier is removed and replaced by SOC on Node 2 and by an update on Node 1. The SOC operation continually reads the local access permissions for an address until the desired access permission (read in this case) becomes valid. When Node 3
PTEs form the coherence-bits plus a virtual-to-physical address mapping.
1 issues an update an asynchronous data-propagation occurs and Node 1 has its access permissions reduced to read to enforce sequential consistency. The update utilises information stored in the address’s VSM directory to determine the Nodes requiring updates. In order for SOC to function correctly the directory information needs to reflect multiple copies of data with read access when the true status in the parallel machine is that a single node has write access. In our work, we achieve this by locally invalidating copies and setting the local access permissions of the writer to write.
a)
Node 1
Node 2
W
I
b)
Node 1 W
Node 2 I W: Write and Read
Barrier
SOC Read Fault
R
Update R
R: Read I: Invalid SOC: Spin-on-Coherency
R R
Fig. 1. Write-invalidate and Spinning-on-Coherency. EDS Target Implementation: The SOC operation is implemented as busy-waiting at user-level on the return value of an operating system call which reads the local access permission in the appropriate PTE for an address. The update operation is an asynchronous mapper call which reads the coherence directory for an address and then supplies nodes in that list with read access to a copy of the data. Note, in this implementation protection is preserved as users cannot read/write any addresses outside of their own address space.
3.3 Applications We apply the optimisations described in this paper to two applications representing the core communication found in Shallow and CG. Shallow [22] is a well known numerical weather forecasting benchmark which solves the Shallow water equations using an explicit finite difference method. This is the solution method used by the U.K. Meteorological Office’s operational weather forecasting and climate prediction model. CG from the NAS parallel benchmarks [1], uses a conjugate gradient technique to solve an unstructured sparse linear system.
4 Method and Results In VSM systems the unit of consistency between nodes varies in size from a few words with hardware support, to whole pages in software only implementations. As the consistency unit is greater than a single word such systems can suffer from the problem of false-sharing, where two or more nodes attempt to write to distinct elements which happen to reside in the same coherency unit. This leads to a “ping-pong” effect. In order to avoid these problems we chose our problem sizes and the number of nodes so that falsesharing is eliminated.
The experiments recorded the time in seconds (with micro-second resolution) taken to execute iterations 2 to inclusive on 2, 4, 8 and 16 nodes of EDS. The first iteration was not timed, in order to exclude any variable one-off overheads caused by initial misses on code/data segments of memory. Figure 2 presents the temporal performance of the applications on EDS. The LI-LE line represents the temporal performance of the applications when distributed invalidation [6] is used to eliminate write misses. The SOC line represents the temporal performance of the applications when spinning on coherency is used to additionally eliminate read misses.
N = 51
0.09
0.10
0.08 0.08
0.07
1/(Execution Time)
1/(Execution Time)
0.06 0.06
0.04 Naive Ideal Standard LI-LE SOC
0.05
0.04
Naive Ideal Standard LI-LE SOC
0.03
0.02
0.02
0.01
0.00
0
2
4
6
8 Nodes
10
12
14
16
0.00
0
2
4
6
8 Nodes
10
Fig. 2. Temporal Performance of the core communication of (a) Shallow ( and (b) CG ( ).
1024 1024
12
14
16
1024 512)
5 Conclusions SOC eliminates read misses in addition to write misses previously eliminated by distributed invalidation. The removal of synchronisation points and the asynchronous propagation of data values through SOC enables significant increases in temporal perforand over standard versions of the core communication of Shallow mance of and CG respectively on 16 nodes. We are currently identifying new coherency optimisations and their applicability. We are also examining the viability of automating SOC into a parallel research compiler, MARS [3] into which we have already incorporated distributed invalidation [17].
22%
187%
References 1. D. Bailey, J. Barton, T. Lasinski, and H. Simon. The nas parallel benchmarks. NASA Technical Memorandum 103863, 1993.
2. B.N.Bershad and M.J.Zekauskas. Midway: Shared memory parallel programming with entry consistency for distributed memory multiprocessors. Technical Report CMU-CS-91-170, School of Computer Science, Carnegie Mellon University, Pitsburgh, PA 15213, 1991. 3. F. Bodin and M.F.P. OBoyle. A compiler strategy for svm. In 3rd Workshop on Languages, Compilers and Runtime Systems for Scalable Computing. Kluwer Press, May 1995. 4. A.L. Cox and R.J. Fowler. Adaptive cache coherence for detecting migratory shared data. In Proc. of the 20th International Symposium on Computer Architecture, pp 98–108, 1993. 5. B. Falsafi et al. Application-specific protocols for user-level shared memory. In Supercomputing 94. IEEE Press, 1994. 6. R.W. Ford, A.P. Nisbet, and J.M. Bull. User level vsm optimisation and its application. In Lecture Notes in Computer Science. 1041, pp 223-232, Springer-Verlag, 1996. 7. Burkhardt III H. Frank S. and Rothnie J. The KSR1: Bridging the gap between shared memory and mpps. In Proceedings of Compcon 93, pages 285–294, San Francisco, 1993. 8. D.B. Glasco, A. Delagi, and M.J. Flynn. The impact of cache coherence protocols on systems using fine-grain data synchronisation. In IFIP Transactions, Parallel Architectures and Compilation Techniques, PACT94. North Holland, 1994. 9. K.Gharachorloo, D.Lenoski, J.Jaudon, P.Gibons, A.Gupta, and J.Hennessy. Memory consistency and event ordering in scaleable shared memory multiprocessors. In Proceedings of the 17th International Symposium on Computer Architecture, pages 15–26, 1990. 10. A.R. Lebeck and D.A. Wood. Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors. In ISCA95, pages 48–59, 1995. 11. D. Lenoski, J. Landon, K. Gharachorloo, A. Gupta, and J.Henessy. The directory-based cache coherence protocol for the dash multiprocessor. In IEEE 17th Annual International Symposium on Computer Architecture. IEEE Press, 1990. 12. K. Li and P. Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems, 7(4):321–359, 1989. 13. R. Mirchandaney, S. Hirandani, and A. Sethi. Improving the performance of dsm systems via compiler involvement. In Proceedings of Supercomputing, 1994. 14. D. Mosberger. Memory consistency models. ACM SIGOPs Review, 27(1), 1993. 15. F. Mounes-Toussi and D.J. Lilja. The potential of compile-time analysis to adapt the cache coherence enforcement strategy to the data sharing characteristics. IEEE Transactions on Parallel and Distributed Systems, 6(5), May 1995. 16. S.S. Mukherjeee, S.D.Sharma, M.D. Hill, J.R.Larus, A.Rodgers, and J.Saltz. Efficient support for irregular applications on distributed-memory machines. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1995. 17. M.F.P. O’Boyle, R.W. ford, and A.P. Nisbet. Compiler reduction of invalidation traffic in shared virtual memory systems. in preparation, 1995. 18. S.K. Reinhardt, J.R. Larus, and D.A. Wood. Tempest and typhoon: User-level shared memory. In Proc. of the 21st Annual International Symposium on Computer Architecture, 1994. 19. J.H. Saltz, R.Mirchandaney, and K.Crowley. Run-time parallelisation and scheduling of loops. IEEE Transactions on Computers, 40(5), May 1991. 20. C.J. Skelton et al. Eds a parallel computer system for advanced inoformation processing. InParallel Architectures and Languages Europe, PARLE92, pages 3–18, 1992. 21. P. Stenstrom, M. Brosson, and L.Sandberg. Adaptive cache coherence protocol optimized for migratory sharing. In Proc. 20th Intl. Symp. on Computer Architecture, pp 109–118, 1993. 22. P.N. Swartzrauber. The shallow benchmark weather prediction program. Technical report, National Center for Atmospheric Research, Boulder, Colorado, 1984. 23. T.Mowry and A.Gupta. Tolerating latency through software-controlled prefetching in sharedmemory multiprocessors. Journal of Parallel and Distributed Computing, 12(2), June 1991.