Weak Atomicity Under the x86 Memory Consistency ... - Google Sites

Weak Atomicity Under the x86 Memory Consistency Model Amitabha Roy

Steven Hand

Tim Harris

University of Cambridge [email protected]

University of Cambridge [email protected]

Microsoft Research, Cambridge [email protected]

Abstract We consider the problem of building a weakly atomic Software Transactional Memory (STM), that provides Single (Global) Lock Atomicity (SLA) while adhering to the x86 memory consistency model (x86-MM). Categories and Subject Descriptors D.1.3 [Software]: Programming Techniques Concurrent Programming General Terms Keywords

Algorithms, Theory, Performance

Software Transactional Memory, x86 Memory Model

1. Introduction Single Lock Atomicity semantics require that “atomic” blocks behave as if they all acquire a single process-wide lock at the beginning and release it at the end. There has been considerable work on providing SLA for the Java memory model [3] and some on C/C++ [6]. In contrast to SLA work in these language level memory models, there has been no work on providing SLA at the level of a processor memory consistency model. The simple example below illustrates how the memory models differ greatly. // Thread 1 // Thread 2 atomic { X = 10; t1 = Y; Y = 10; } t2 = X; C++: Catch fire due to data race, any result allowed Java: Intra-thread reordering allowed x86: No intra-thread reordering In the example, the result t1 == 10 and t2 == 0 is allowed by the C++ and Java memory models but forbidden by the x86 one. An STM used to implement SLA and the x86 memory model must ensure that the forbidden result does not occur. A practical application of such an STM would be as the last stage in compilation (after machine code is generated) or through a dynamic binary rewriting engine [4]. The natural consequence of the lack of such an STM is that work needing it suffers from caveats in general applicability, requiring the user to be aware of the internals of the underlying STM implementation. We have designed an STM algorithm that provides SLA and x86-MM to the most general class of programs possible. We exclude only programs with Transactional Reads Unprotected Writes

Figure 1. TSO state machine with per processor private memory int X = 0, Y = 0, Z = 0; bool done1 = false, done2 = false; // Thread 1 // Thread 2 // Thread 3 // Thread 4 atomic { Y = 300; t1 = done1; atomic { if(Y == 0) X = 300; t2 = done2; X = 100; done1 = true; Z = 1; } done2 = true; } // Not allowed by x86-MM and SLA: // X == 100 and t1 == true and t2 == false and Z == 1 Figure 2. Loads must be ordered across atomic blocks (TRUW) races, which is a race between a read in a transaction and a write outside any transaction. Crucially however, we use weak atomicity and ensure that ad-hoc synchronisation [2] done by the program is preserved.

2. Memory Consistency We use the memory consistency model of Owens et al. as reference [5] (Figure 1). This casts the x86 as a sequential machine except for a write buffer that delays the visibility of stores to other processors. Each processor can acquire a lock to gain exclusive access to memory (modelling locked instructions). Interestingly, under the write-back memory type in x86, load fences and store fences become no-ops (since loads and stores are already ordered). On the other hand, locked instructions and memory fences (mfence) need to flush the write buffer on the executing processor.

3. x86-MM + SLA = Impossible

Copyright is held by the author/owner(s). PPoPP’11, February 12–16, 2011, San Antonio, Texas, USA. ACM 978-1-4503-0119-0/11/02.

We prove by example that any STM that aims to provide x86-MM and SLA for transactions in any program must execute transactions serially (and hence can provide no scalability). Consider the program fragments in Figure 2. When executing on the state machine of Figure 1, we have the following reasoning: Thread 4 is the ‘witness’ to Thread 1 acquiring the global lock

int X = 0, Y = 0, Z = 0, W = 0; // Thread 1 // Thread 2 // Thread 3 atomic { t1 = X ; atomic { X = 100; Z = 100; t2 = Z; W = 100; t3 = W; Y = 100; } t4 = Y; } // Not allowed by x86-MM and SLA: t1 == 100 and t2 == 0 and t3 == 100 and t4 == 0 Figure 3. Stores must be ordered across atomic blocks before Thread 2. Since the final value of X is 100, Thread 3 must have flushed its write to X from its write buffer to main-memory before Thread 1. This also means that it must have flushed its write to Y before Thread 1 flushed its write to X. Thread 2 cannot acquire the lock until thread 1 has flushed its write buffer (in order for the unlock to be visible to it) and hence when it acquires the lock the writes by thread 3 are already in memory. Hence it cannot read Y == 0 and thus cannot set Z = 1. Finally, note that the program fragment is also an example of a TRUW race, between the stores to Y on Thread 3 and the load from Y on Thread 2. A weakly atomic STM will not see any conflicts from Thread 3 or Thread 4, which means that it cannot detect the departure from x86-MM by solely depending on conflict detection. Instead it must always serialise the loads in a transaction after all stores in a previous transaction. In Figure 3, on the other hand, stores from the transaction in Thread 2 must be ordered after all stores in the transaction on Thread 1. Interleaving leads to the disallowed result. Interleaving however does not cause any conflicts and hence a weakly atomic STM is forced to serialise all stores in a transaction after all stores in a previous transaction. One also needs to consider the fact that buffering of writes (lazy update) is essential for the STM not to expose speculative writes to non-transactional reads. This means reads must always precede and complete before transaction linearisation, followed by writeback. Coupled with the intra-thread ordering requirements illustrated by the two examples above, one concludes that that the only way to preserve SLA under x86-MM for all programs is to ensure that transactions execute all their operation serially with no overlap. It is thus not possible to build a weakly atomic STM with any kind of scalability optimisations while still providing SLA on x86-MM for all programs. This is in contrast to work that strives to provide SLA in Java [3]. This is because the language level memory models are more permissive. For example, the language level memory models for C++ and Java allow the results in the examples by virtue of allowing reordering of operations on the same thread (within the transaction).

4. x86-MM + SLA − TRUW Races = Possible We have designed and implemented an STM (without complete serialisation) that provides SLA + x86 MM to all programs excluding those with TRUW races. The salient points of the algorithm are listed below. 4.1 Speculation Phase During the speculation phase we depend on STM read and write barriers to log all loads and stores. We use as a basis a lazy weakly atomic STM similar to TL2 [1]. However we log every read and write into software read and write buffers with no merging of adjacent reads and writes (unlike STMs that use larger granularities such as a cache line). This is critical to preserving the x86-MM. Further, we handle accesses of any size including overlapping accesses. We fall back to irrevocability on encountering any memory

access originating from a locked instruction or on encountering an mfence. Both of these instructions require flushing the write buffer and hence would lead to a departure from the x86-MM when executing with the STM. 4.2 Commit Phase We use a two phase commit. The first phase of the commit acquires metadata locks on modified locations. The second phase verifies every read from the read buffer in addition to checking STM metadata. At this point the transaction succeeds and the write logs are played back into shared memory. We introduce additional synchronisation between threads in the commit phase. Each commit acquires a unique commit ticket from a global counter. Threads with a later commit ticket must execute their write back phase after threads with an earlier commit ticket have finished their write back phase. The read checks are performed in parallel, the departure from total serialisation costs us the capability to handle TRUW races. The commit phase includes a dynamic race detector (false negatives but no false positives) for the TRUW races we cannot handle for no additional performance penalty. 4.3 Speculation Safety Our focus for the STM has been safety. In addition to preserving the x86 memory consistency model for any program not including a TRUW race, we also provide the same guarantee to speculating threads. A speculating thread is not allowed to execute with a read set that it could not have seen when executing with SLA. Further, since we buffer writes, no uncommitted values are allowed to leak out of speculating transactions.

5. Conclusion We have designed and implemented an STM algorithm to provide SLA and the x86 memory consistency model to transactions. This is not merely an academic exercise and the STM we have designed is integrated with an efficient instrumentation system for x86 binaries that we have built. The two together provide SLA for atomic blocks delimited by a single global lock in the binary, provided the program is free of TRUW races.

References [1] D. Dice, O. Shalev, and N. Shavit. Transactional locking II. In DISC ’06: Proc. 20th International Symposium on Distributed Computing, pages 194–208, Sept. 2006. [2] A. Jannesari and W. Tichy. Identifying ad-hoc synchronization for enhanced race detection. In IPDPS ’10: Proc. 25th IEEE International Symposium on Parallel Distributed Processing (IPDPS), pages 1 –10, April 2010. [3] V. Menon, S. Balensiefer, T. Shpeisman, A.-R. Adl-Tabatabai, R. Hudson, B. Saha, and A. Welc. Single global lock semantics in a weakly atomic STM. In TRANSACT ’08, 3rd ACM SIGPLAN Workshop on Languages, Compilers, and Hardware Support for Transactional Computing, Feb. 2008. [4] M. Olszewski, J. Cutler, and J. G. Steffan. JudoSTM: A dynamic binary-rewriting approach to software transactional memory. In PACT ’07: Proc. 16th International Conference on Parallel Architecture and Compilation Techniques, pages 365–375, Sept. 2007. [5] S. Owens, S. Sarkar, and P. Sewell. A better x86 memory model: x86TSO. In TPHOLs: 22nd Annual Conference on Theorem Proving in Higher Order Logics, 2009. [6] C. Wang, W.-Y. Chen, Y. Wu, B. Saha, and A.-R. Adl-Tabatabai. Code generation and optimization for transactional memory constructs in an unmanaged language. In CGO ’07: Proc. 2007 International Symposium on Code Generation and Optimization, pages 34–48, Mar. 2007.

Weak Atomicity Under the x86 Memory Consistency ... - Google Sites

Weak Atomicity Under the x86 Memory Consistency ... - Google Sites

Suggest Documents

Weak Atomicity Under the x86 Memory Consistency ... - Google Sites

Formally Verifying the Distributed Shared Memory Weak Consistency

Formally Verifying the Distributed Shared Memory Weak Consistency ...

Taming Weak Memory Models

Memory Consistency Models for Shared-Memory - CiteSeerX

Taming Weak Memory Models

Memory Consistency Models using Constraints

Weak-consistency group communication and ... - Semantic Scholar

Scalable State Replication with Weak Consistency - CiteSeerX

Transactions: From Local Atomicity to Atomicity in the Cloud - Microsoft

Parallel Assertions for Architectures with Weak Memory ... - Google Sites

Z$Estimators and Auxiliary Information under Weak ... - Google Sites

socializing consistency - Google Sites

socializing consistency - Google Sites

The Performance of Weak-consistency Replication Protocols - CiteSeerX

Retrenchment and the Atomicity Pattern

Memory Consistency Models for Shared-Memory - CiteSeerX [PDF]

Memory formation under stress

Memory formation under stress

A Framework of Memory Consistency Models - CiteSeerX

RISC-V Memory Consistency Model Status Update

Verifying Sequential Consistency on Shared-Memory Multiprocessors

1Q15 weak - Google Sites

Programming for Different Memory Consistency Models - Description