Formal Reasoning About Lazy-STM Programs - Springer Link

8 downloads 10777 Views 410KB Size Report
School of Computer Science and Technology, University of Science and Technology of ... program verification, transactional memory (TM), proof-carrying-code, ...
Li Y, Zhang Y, Chen YY et al. Formal reasoning about lazy-STM programs. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 25(4): 841–852 July 2010. DOI 10.1007/s11390-010-1065-8

Formal Reasoning About Lazy-STM Programs



), Yu Zhang∗ (

Yong Li (

and Ming Fu (

), Member, CCF, Yi-Yun Chen (íû), Member, CCF

)

School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China Software Security Laboratory, Suzhou Institute for Advanced Study, University of Science and Technology of China Suzhou 215123, China E-mail: {liyong, fuming}@mail.ustc.edu.cn; {yuzhang, yiyun}@ustc.edu.cn Received October 30, 2008; revised March 10, 2010. Abstract Transactional memory (TM) is an easy-using parallel programming model that avoids common problems associated with conventional locking techniques. Several researchers have proposed a large amount of alternative hardware and software TM implementations. However, few ones focus on formal reasoning about these TM programs. In this paper, we propose a framework at assembly level for reasoning about lazy software transactional memory (STM) programs. First, we give a software TM implementation based on lightweight locks. These locks are also one part of the shared memory. Then we define the semantics of the model operationally, and the lightweight locks in transaction are non-blocking, avoiding deadlocks among transactions. Finally we design a logic — a combination of permission accounting in separation logic and concurrent separation logic — to verify various properties of concurrent programs based on this machine model. The whole framework is formalized using a proof-carrying-code (PCC) framework. Keywords logic

1

program verification, transactional memory (TM), proof-carrying-code, permission accounting in separation

Introduction

The advent of multi-core processors has brought concurrency into mainstream applications. However, it also brings great challenges to programmers for the concurrency management. They traditionally used locks to enforce synchronized concurrent accesses. However, locks are well-known software engineering issues that make parallel programming exactly complicated and may lead to problems such as deadlock, priority inversion, or convoying. Transactional memory (TM) provides an alternative concurrency management model that avoids these pitfalls associated with locks and significantly eases parallel programming. TM system simplifies concurrency management by supporting that parallel tasks appear to execute atomically and in isolation. There have been several proposals for hardware-based (HTM)[1-4] , software-based (STM)[5-10] and hybrid[11] TM implementations. But these TM systems mainly focus on the performance issues and suggest different tradeoffs in the mechanisms used to track transactional state, buffer size, and the

overheads associated with basic operations. There also exist a lot of works on the formal semantics of TM systems[2,12-14] , but few ones focus on formal reasoning about the correctness of programs based on these systems and making sure their soundness. Prior software TM systems mostly implement weak atomicity [13] , which allows violation of a transaction’s isolation if there is a data race between transactional and nontransactional codes. So programs based on these systems may lead to unexpected behaviors. Our previous work[15] has presented a framework for certifying concurrent programs using transactional memory. In that model it treats the huge committing operation of transactions as an instruction primitive. In this paper, we extend our previous framework by splitting the huge committing operation of transaction into several small fundamental instructions. We have studied a wide range of existing STM systems and designed an abstract machine model based on locks. It also supports irreversible actions in transactions with the technique of privatization. In this framework we focus on the correctness of programs using TM systems and formally prove some properties in programs. Under

Regular Paper Supported by the National Natural Science Foundation of China under Grant Nos. 60928004 and 90718026; and Intel China Research Center. Any opinions, findings, and conclusions contained in this document are those of the authors and do not reflect the views of these agencies. 2010 Springer Science + Business Media, LLC & Science Press, China

J. Comput. Sci. & Technol., July 2010, Vol.25, No.4

842

the verification, we can declare that these programs are partial correct TM programs with respect to the given specifications. Finally, this paper makes the following novel contributions: • We model a software TM system based on lightweight locks. It is at assembly level with lazy version management and lazy conflict detection[9-10] . We use locks to privatize the shared memory for conflict detection and data commit. Furthermore, we separate the progress of transactions into three phases, then irreversible action such as system calls and I/O can be delayed to the commit phase by compiler. • We introduce a program logic that incorporates the concurrent separation logic (CSL)[16] with permission accounting in separation logic (PASL)[17] to deal with shared memory accesses in transaction. It can split each cell in the shared memory into two read-only parts, one of which is transferred to the thread’s logical private memory and the other one is left in the logical shared memory for other transactions’ speculative reads. Our specifications assert the machine-level behavior of the program and are general enough for various safety requirements. • The whole framework has been implemented in Coq proof assistant[18] and we have also proved several examples. Besides, this formal reasoning described at assembly level can be easily lifted up to higher levels. The rest of the paper is organized as follows. Section 2 summarizes the TM system and discusses the major design issues in our framework. Next we propose our framework in Section 3, which contains an abstract machine model and the program logic for reasoning. Then an example is shown to demonstrate how to reason about programs with the program logic in Section 4. Finally, we discuss related work and conclude the paper in Section 5. 2 2.1

Transactional Memory Transactional Memory Overview

TM system provides a simple concurrency control mechanism for parallel programming. It avoids many pitfalls associated with traditional locks. With TM, programmers only need to define an atomic code block, then the underlying system guarantees that the code block executes atomically and in isolation. In TM system, behavior of transactions must satisfy the following properties: a) atomicity: either the whole transaction executes or none of it; b) isolation: partial memory updates are not visible to other transactions; and c) consistency: there is a single order of completion for transactions across the whole system[1] . Providing these properties requires data version management and

conflict detection, whose implementations distinguish different TM proposals. Data version management stores both new data and old data simultaneously. The new data is visible when transaction commits and the old data restores when transaction aborts. At most one of them can be stored in place (in the target memory) while the other one is buffered speculatively. TM systems use the lazy version management [6,9] and eager version management [4,7] alternatively. Lazy version management retains old data in place for faster abort and buffers new data in a redo-log speculatively. The redo-log defers all memory updates until transaction commits. Eager version management stores the new value in place for faster commit and logs the old value in an undo-log. Shared memory read is usually speculative (escape synchronization) in transaction, but it needs to log read data in a read set for later conflict detection before commit. Conflict detection signals an overlap access of the same address by many transactions when at least one of them attempts to write a new value. It requires tracking addresses and values read by each transaction, and usually uses a read set for recording. Lazy conflict detection [6,9] delays the conflict detection until transaction commits while eager conflict detection [7] checks for conflicts at reads and writes immediately. When conflict happens, the transaction needs to be rolled back. Prior STM systems[6,9] provide a lazy conflict detection. For conflict detection, STM system can acquire locks for shared memory privatization to avoid interleaving from other transactions. The granularity of conflict detection may be at word-level, cache-line-level or object-level. 2.2

Design Points

In TM systems, transactional codes execute atomically and in isolation. Furthermore, the isolation is not only established among transactional codes but also between transactional and non-transactional codes. Existing software systems mostly implement the weak atomicity semantics, which allows violation of TM system’s isolation when there exists a data race between transactional and non-transactional codes. Besides, the commit operation of transaction is assumed to be atomic, but in practice some STM implementations simplify this by allowing the non-atomic commit operation to improve performance. Our previous work[15] avoids these problems by using a huge built-in primitive to represent the whole committing operation, which detects conflicts and then commits new values in a single instruction cycle. This seems to be at a high level and hides the implementation details. In this paper we model a relaxed abstract TM

Yong Li et al.: Formal Reasoning About Lazy-STM Programs

machine model by splitting the huge commit operation of our previous work[15] into several low-level and fundamental instructions. Besides, lightweight locks are introduced for shared memory privatization to solve problems mentioned above. The lightweight lock is implemented by a mutex reserving a word in memory to flag whether the mutex is locked or not. The acquirement of mutex is non-blocking and may abort transactions to avoid deadlocks. According to the implementations[6,9-10] of most STM systems, the model takes a lazy version management and lazy conflict detection at the word-level granularity. In this framework we do not take nested transaction into account but leave it for future work. 3

Framework

In this section, we present an abstract machine that supports TM and gives its operational semantics. After that a program logic that incorporates CSL with PASL is presented for the verification of assembly programs running on this model. It is similar in structure with other CAP systems[19-20] . 3.1

Abstract Machine

The abstract machine is a straightforward extension of CAP by adding several transactional instructions (see Fig.1 for the syntax). The machine configuration W contains a code heap C, a global shared memory H and a number of threads. The global shared memory H has both the lock and non-lock cells inside, no distinguished differences between them. Each thread T is made up of a register file R, a program counter pc and a transactional mechanism X. Thread state S is a view of the machine configuration from a single thread’s angle, it contains the global shared memory H and the corresponding thread T. X is a special data structure for transaction. It is ε when outside transactions, and consists a read set Hr , a write set Hw , a backup file B and a transactional status A when inside transactions. The read set Hr records value and address pairs for later conflict detection during the speculative reads; the write set Hw is a redo-log that buffers write attempts in transactions. The backup file B is used to record the register file R and the program counter pc for transactional aborts. In this model we introduce three kinds of status act, cmt and abt for transactions, denoting various phases in the process of a transaction. In status act, the transaction reads and writes shared memory speculatively, then acquires locks and detects conflicts. If there are no conflicts, the status turns to cmt. Then the transaction commits the data buffered in the write set, releases locks and completes. Irreversible actions such as I/O can be placed here by compiler or runtime

843

system because transactions will not be rolled back in status cmt. If conflict happens or locking fails, the status turns to abt, then the transaction can only release locks and roll back. (World ) W (ThreadState ) S (CodeHeap) C (Memory) H, Hr , Hw (Thread ) Ti (RegFile) R (Register ) r (Address ) f, l, pc (Word ) w (Xstate) X (Bstate) B (Status) A (Instr ) ι

(InstrSeq)

I

::= ::= ∈ ∈ ::= ∈ ::= ::= ::= ::= ::= ::= ::= | | | | | | ::=

(C, H, [T1 , . . . , Tn ]) (H, T) Address  Instr Address  Word (R, pc, X) Register → Word r0 | . . . | r31 i (nat nums) i (nat nums) ε | (Hr , Hw , B, A) (R, pc) act | cmt | abt addu rd , rs , rt | addiu rd , rs , w lw rd , w(rs ) | sw rd , w(rs ) cast rd , rt , w(rs ) | out rd beq rs , rt , f | bne rs , rt , f lwt rd , w(rs ) | swt rd , w(rs ) begin | validate rd | commit j f | jr rs | rollback ι | ι; I

Fig.1. Syntax of machine.

The instruction set we present in this model is based on a subset of MIPS, with several additional instructions for the implementation of TM. Instruction begin marks the beginning of a transaction while commit marks the ending. Instruction validate is a representation of the conflict detection. Transactional instruction lwt denotes for speculative read while swt denotes for speculative write. Next instruction out is just a representation of irreversible action I/O. Finally we use a compare-and-swap (CAS) instruction to implement lock acquirement and the ordinary sw instruction for lock release. The step function (−→) of a world W is defined in Fig.2. The auxiliary function Next(pc,ι) (H, R, pc, X) is used to define the effects of the execution of a thread Tk . We follow the preemptive thread model where execution of threads can be preempted at any program point, but execution of individual instructions is atomic. This thread model is enforced by allowing any thread to execute at any point in the definition of the step function. Thus we define an interleaving model for multithreaded machines where the instructions of individual thread execute with sequential consistency. Most of instructions are standard and straightforward, and we only need to pay attention to those ones related to transactions. Instruction begin initializes the X with an empty read set and an empty write set,

J. Comput. Sci. & Technol., July 2010, Vol.25, No.4

844

(C, H, [T1 , . . . , Tk−1 , Tk , Tk+1 , . . . , Tn ]) −→ (C, H , [T1 , . . . , Tk−1 , Tk , Tk+1 , . . . , Tn ]) if C(Tk .pc) = ι and Next(pc,ι) (H, Tk ) = (H , Tk ) for any k if ι =

then Next(pc,ι) (H, R, pc, X) =

addu rd , rs , rt

(H, R{rd  R(rs ) + R(rt )}, pc + 1, X)

addiu rd , rs , w

(H, R{rd  R(rs ) + w}, pc + 1, X)

lw rd , w(rs )

(H, R{rd  H(R(rs ) + w)}, pc + 1, X)

sw rd , w(rs )

(H{R(rs ) + w  R(rd )}, R, pc + 1, X)

where R(rs ) + w ∈ dom(H) ∧ X.A = act

beq rs , rt , f

(H, R, f, X) (H, R, pc + 1, X)

if R(rs ) = R(rt ) if R(rs ) = R(rt )

bne rs , rt , f

(H, R, pc + 1, X) (H, R, f, X)

if R(rs ) = R(rt ) if R(rs ) = R(rt )

where R(rs ) + w ∈ dom(H)

jf

(H, R, f, X)

jr rs

(H, R, R(rs ), X)

out rd

(H, R, pc + 1, X)

begin

(H, R, pc + 1, (∅, ∅, (R, pc), act))

where X = ε

validate rd

(H, R{rd  1}, pc + 1, (X.Hr , X.Hw , X.B, cmt)) (H, R{rd  0}, pc + 1, (X.Hr , X.Hw , X.B, abt))

if X.Hr ⊆ H ∧ X.A = act if X.Hr  H ∧ X.A = act

commit

(H, R, pc + 1, ε)

where X.A = cmt

rollback

(H, R , pc , ε)

where X.A = abt ∧ X.B = (R , pc )

lwt rd , w(rs )

(H, R{rd  X.Hw (R(rs ) + w)}, pc + 1, X) (H, R{rd  X.Hr (R(rs ) + w)}, pc + 1, X) (H, R{rd  H(R(rs ) + w)}, pc + 1, (X.Hr {R(rs ) + w  H(R(rs ) + w)}, X.Hw , X.B, X.A))

if R(rs ) + w ∈ dom(X.Hw ); if R(rs ) + w ∈ dom(X.Hr );

swt rd , w(rs )

(H, R, pc + 1, (X.Hr , X.Hw {R(rs ) + w  R(rd )}, X.B, X.A)) where X = ε

cast rd , rt , w(rs )

(H{R(rs ) + w  R(rt )}, R{rt  R(rd )}, pc + 1, X) (H, R{rt  H(R(rs ) + w)}, pc + 1, (X.Hr , X.Hw , X.B, abt))

where X.A = cmt

if R(rs ) + w ∈ dom(H), where X = ε if R(rd ) = H(R(rs ) + w) ∧ X.A = act; if R(rd ) = H(R(rs ) + w) ∧ X.A = act; where R(rd ) + w ∈ dom(H)

Fig.2. Operational semantics of the machine.

and records the current register file and program counter in the backup file, finally sets the status to act. In this model we do not support nested transaction, so it requires that X must be ε when instruction begin executes. Then instruction validate checks the read set to detect conflicts. It compares each value stored in the read set X.Hr with the one of the same location in shared memory. If there exists one cell whose value in the read set and shared memory do not match, conflict happens. After the execution of instruction validate, the status turns to abt if conflict exists or cmt otherwise. Next instruction commit ends the transaction by modifying the special mechanism X to ε, it requires that the current transactional status must be cmt. Instruction rollback restarts the transaction by restoring the register file and the program counter with the previous one recorded in the backup file, notice that it must be executed at status abt. Instruction cast is a nonblocking synchronization primitive here. It requires to be executed at status act and changes status to abt if the desired value does not match with the one in shared memory. There are two modes of instructions for accessing the shared memory, the ordinary ones (lw and sw) can

be used both inside and outside transactions, while the special ones (lwt and swt ) can only be used inside transactions. In transactions, values must be consistent during multi reads and writes. So instruction lwt checks first the write set Hw and then the read set Hr , if the location does not exist in both sets (it is the first time accessing this location in transaction) it will directly and speculatively read the shared memory and then logs the value in the read set Hr for later conflicts detecting. With lazy version management, speculative write swt buffers new value in the write set Hw and leaves the old value in place, these new values will be committed to the shared memory at status cmt. Instruction sw has two usages (data writing and lock release) in our model, and it is not allowed to be executed in status act due to the lazy version management. Now it is clear that the big commit operation of our previous work[15] is a macro defined with these low-level instructions. It can be considered in this order: first acquires locks using cast to privatize shared memory for no interleaves with others, then validates these speculative reads recorded in the read set using validate. If all these passed, which specifies that the atomicity is maintained, then transaction sets the status to cmt and

Yong Li et al.: Formal Reasoning About Lazy-STM Programs

845

commit the buffered new values. Finally all owned locks are released by sw and the transaction ends. Otherwise, if the validation fails, it does nothing but releases owned locks and rolls back. We cannot simply achieve the atomicity and isolation properties of transactional code just by the limitation of the syntax and semantics in this model. Transactions require that all the effect of instructions must be discarded before rolling back and the immediate values are invisible to others. All these need a formal verification, which is the points of our another paper on the verification of TM programs[21]. This paper focuses on the invariant proof of shared memory.

::=

m



m

l → v | emp | m1 ∗ m2 | m1 ∧ m2 | m1 ∨ m2 ∃x.m | ∀x.m Memory → Prop

def

λH. H = {l  v}

def

λH. H = ∅

m1 ∧ m2

def

=

λH. m1 H ∧ m2 H

m1 ∨ m2

def

λH. m1 H ∨ m2 H

∃x.m

def

=

λH. ∃x. m H

∀x.m

def

λH. ∀x. m H

m1 ∗ m2

def

λH. ∃ H1 , H2 .H = H1 H2 ∧ m1 H1 ∧ m2 H2

l → v emp

= = = = =

Fig.4. Assertion language for memory in separation logic.

Fig.3. Auxiliary Npc macro.

The macro of Npc(pc,ι) showed in Fig.3 is a total function, it computes the program counter of next instruction to be executed after the current instruction is completed. It is defined as pc + 1 for arithmetic and data transfer instructions or addresses f, R(rs ) and X.B.pc for control transfer instructions. 3.2

Concurrent Separation Logic

CSL is an extension of the separation logic [22] for reasoning about shared memory concurrent programs. Separation logic is an extension of Hoare logic and used to specify memories. Fig.4 shows the assertion language for memory in separation logic, where m denotes a predicate on memory. Now let us interpret the semantics of these assertions. l → v holds when the memory has only a single location l with the value v. emp holds on an empty memory. m1 ∗ m2 holds if the memory can be split into two disjoint parts such that m1 holds on one part and m2 holds on the other. m1 ∧ m2 holds if both m1 and m2 hold on the entire memory. m1 ∨ m2 holds if either m1 or m2 holds on the entire memory. ∃x.m holds if there exists a y that [y/x]m holds on the memory. ∀x.m holds if for all x that m holds on the memory. As in separation logic, the precision of memory assertions is defined below. We require m to be precise to enforce the unique boundary between the logical shared and private memories.

Definition 3.1 (Precise Memory Assertions). A memory assertion m is precise, i.e., Precise(m) holds, if and only if for all H, H1 , H2 , if H1 ⊆ H, H2 ⊆ H, mH1 and mH2 hold, then H1 = H2 . In CSL, shared memory is partitioned and each part is protected by a unique lock. For each part of these partitions, an invariant is assigned to specify its wellformedness. The global invariant of the whole shared memory is an union of invariants of all partitions. When the lock has been acquired, the thread takes advantage of mutual-exclusion provided by the lock and treats the part of memory as private logically. Before the lock releasing, the thread must enforce the part of memory to be well-formed with regard to the corresponding invariant again. It is insufficient to reason about transactions just using CSL due to the speculative read. In CSL shared memory accessing must be placed in the conditional critical region where it treats the part of memory as logical private. However, the shared memory read lwt in transaction is speculative without memory privatization first and it can even read data from other threads’ logical private memory. This causes the violation of modular certification since it needs to know other threads’ logical private states. So we need to use some other special technique such as PASL to solve this problem. 3.3

Permission Accounting in Separation Logic

In PASL the shared memory model defined in Fig.1 is modified to the partial function: (Memory )M ∈ Address  (Word × Permission). Each cell in the original model is associated with a permission bit in the new model (we simplify the original version of PASL and only allow two permission marks “t” and “r” in our framework). In the new model we use another notation M for distinction. And

J. Comput. Sci. & Technol., July 2010, Vol.25, No.4

846

the other definitions in Fig.1 such as World and ThreadState need to be adjusted correspondingly. Permission “t” is a total permission which is allowed to perform both the read and write operations. Permission “r” is a partial permission which is only allowed to perform the read operation. The total permission “t” is equal to two read-only permissions (“r”) and can be split where needed.

Fig.5. Assertion language for memory with permission accounting.

With the new logical memory model, the new assertion language is presented in Fig.5. Here we use v denote for Word and u denote for Permission. Comu paring with Fig.4, we only modified l −→ v into l−→v which means location l has a value v with permission u. Next we give some notations: l −→ v is a short t form of l−→v, and can be split into two disjoint parts r both with read-only permission (l−→v). Two single read-only memory units with the same location and value can be composed to a memory unit with the total permission.

3.4

Logical Operational Semantics

The operational semantics based on the new logical memory model is shown in Fig.6. In this logical operational semantics, the whole shared memory M is divided into n+1 parts, where Ms is the logical shared memory, and M1 , . . . , Mn are threads’ logical private memory respectively. Each thread is only allowed to access the logical shared memory and its logical private memory. The instructions are divided into two sets: a) Strong Memory Instructions: such as sw, lwt and cast , which access both the logical shared and private memory; b) Weak Memory Instructions: the others, which only access the thread’s logical private memory. In Fig.6 we omit the semantics of instructions which have no relations with the shared memory M. Their semantics can be easily achieved from Fig.2 by replacing H with M. We also have to redefine the macro Npc(pc,ι) of Fig.3 for computing the pc of next instruction. But since that the computation has no relations with the shared memory, we can obtain the redefined macro Npc(pc,ι) from Npc(pc,ι) just by replacing H with M. It is clear that there is a strong simulation between this logical operational semantics with the previous physical operational semantics. The permissions of the global shared memory cells in world W are always the total permission, but the logical shared memory Ms and threads’ logical private memory M1 , . . . , Mn may be not. Each step with the physical operational semantics of Fig.2 can map to a step in the logical operational semantics obviously, and vice versa. O’Hearn has shown[16] that separation logic can describe ownership transfer, where concurrent programs

(C, Ms M1 · · · Mk · · · Mn , [T1 , . . . , Tk , . . . , Tn ]) −→ (C, Ms M1 · · · Mk · · · Mn , [T1 , . . . , Tk , . . . , Tn ]) if C(Tk .pc) = sw/lwt /cast and Next(pc,ι) (Ms Mk , Tk ) = (Ms Mk , Tk ) for any k if C(Tk .pc) = others and Next(pc,ι) (Mk , Tk ) = (Mk , Tk ) and Ms = Ms for any k

if ι =

then Next(pc,ι) (M, R, pc, X) =

lw rd , w(rs )

(M, R{rd  n}, pc + 1, X)

where M(R(rs ) + w) = (n, )

sw rd , w(rs )

(M{R(rs ) + w  (R(rd ), t)}, R, pc + 1, X)

where M(R(rs ) + w) = ( , t) ∧ X.A = act

lwt rd , w(rs )

(M, R{rd  X.Hw (R(rs ) + w)}, pc + 1, X) (M, R{rd  X.Hr (R(rs ) + w)}, pc + 1, X) (M, R{rd  n}, pc + 1, (X.Hr {R(rs ) + w  n}, X.Hw , X.B, X.A))

if R(rs ) + w ∈ dom(X.Hw ); if R(rs ) + w ∈ dom(X.Hr );

cast rd , rt , w(rs )

(M{R(rs ) + w  (R(rt ), t)}, R{rt  R(rd )}, pc + 1, X) if R(rd ) = n ∧ M(R(rs ) + w) = (n, t) ∧ X.A = act; (M, R{rt  n }, pc + 1, (X.Hr , X.Hw , X.B, abt)) if R(rd ) = n ∧ M(R(rs ) + w) = (n , ) ∧ n = n ∧ X.A = act

validate rd

(M, R{rd  1}, pc + 1, (X.Hr , X.Hw , X.B, cmt)) (M, R{rd  0}, pc + 1, (X.Hr , X.Hw , X.B, abt))

···

···

if M(R(rs ) + w) = (n, ), where X = ε

if ∀l ∈ dom(X.Hr ). X.Hr (l) = n ∧ M(l) = (n, ) if ∃l ∈ dom(X.Hr ). X.Hr (l) = n ∧ M(l) = (n , ) ∧ n = n where X.A = act

Fig.6. Operational semantics with logic memory model.

Yong Li et al.: Formal Reasoning About Lazy-STM Programs

move ownership of memory cells into and out of the shared memory. But in PASL it seems the permission rather than the cell that is transferred between memories. In Fig.7 we give an example for explaining the process of the permission transfer. shared memory t

mutex −→ 0 ∗ l−→v

r r mutex −→ 0 ∗ l−→v ∗ l−→v ⇓ lock(cast ) r mutex −→ 1 ∗ l−→v ⇓ unlock(sw) r r mutex −→ 0 ∗ l−→v ∗ l−→v

t

mutex −→ 0 ∗ l−→v

847

::=

a

ε | [m] | [m]r | [m]w | [r] = v | [r]b = v | [pc] = v | [pc]b = v | A = act/cmt/abt | a1 ∧ a2 | a1 ∨ a2 | ∃x. a | ∀x. a a



ε

def

λS. S.X = ε

def

λS. m S.M

def

=

λS. m S.X.Hr

def

λS. m S.X.Hw

def

λS. S.R(r) = v

def

λS. S.X.B.R(r) = v

def

λS. S.X.B.pc = v

def

λS. S.pc = v

∃x. a

def

=

λS. ∃x. a S

∀x. a

def

λS. ∀x. a S

a1 ∧ a2

def

=

λS. a1 S ∧ a2 S

a1 ∨ a2

def

λS. a1 S ∨ a2 S

A = act/cmt/abt

def

λS. S.X.A = act/cmt/abt

[m]

private memory

[m]r

emp

[m]w [r] = v

emp

[r]b = v

r

l−→v

[pc]b = v [pc] = v

emp emp

Fig.7. Permission transfer on locking and unlocking.

The shared memory has two cells mutex and l, and the mutex is a lock cell which is used for mutually exclusive accessing of cell l. Initially the mutex is free and cell l has the total permission in logical shared memory while the thread’s logical private memory is empty. After the thread successfully acquired the lock by cast , cell l with the total permission needs to be split into two read-only parts and one part is moved to the thread’s logical private memory. When the lock is released by restoring the lock cell to zero, the read-only part in thread’s logical private memory returns back to the logical shared memory. If the thread attempts to update cell l after the privatization, it needs to combine the logical private memory with the logical shared memory to generate a total permission by two read-only permissions and then updates. 3.5

Program Specifications

We have defined the memory assertion language before, now the assertion language for thread state S = (M, T) is presented in Fig.8. Here m is a predicate on memory defined in Fig.5. Most of the definitions are simple and straightforward, here we explain some special ones. ε denotes the X in the entire thread state is ε. [m], [m]r , [m]w mean the memory, read set, write set in the entire thread state satisfies m respectively. [r] = v, [r]b = v describe the register r in the thread state’s, backup state’s register file respectively. [pc] = v, [pc]b = v describe the current program counter and the record program counter in the beginning respectively. The last one A = act/cmt/abt describe the current transaction status in the entire thread state. The verification constructs of our program logic is presented in Fig.9.

= = = = = = = = = =

ThreadState → Prop

Fig.8. Assertion language for state.

Fig.9. Verification constructs for program logic.

The world specification φ contains a global invariant m and code heap specifications ψ1 , . . . , ψn for each thread. The global invariant m is a programmer specified predicate on the logical shared memory Ms , it must hold throughout the execution of world. A code heap specification ψ assigns thread state assertion a which expresses the precondition of each instruction sequence for execution. The last three are defined judgments for the well-formed world, well-formed code heap and well-formed instruction sequence respectively. Rules for these judgments will be presented in the following subsection. Besides we provide a special notation for useful auxiliary definitions in Fig.10 for program logic. The first one is a syntax sugar for propositions. The last two are syntax sugars for assertion language. a ⇒ a

def

=

∀S. a S → a S

a  (R, pc, X)

def

=

λM. a (M, R, pc, X)

a m

def

λ(M, R, pc, X). (a  (R, pc, X) ∗ m) M

=

Fig.10. Auxiliary definition for program logic.

J. Comput. Sci. & Technol., July 2010, Vol.25, No.4

848

3.6

Inference Rules

The inference rules of our program logic are presented in Fig.12. A world is well-formed with regard to a world specification φ and the thread state predicates a1 , . . . , an for each thread when the following conditions hold: • There exists a code heap specification ψk for each thread and a precise global invariant m in the program specification φ. For each thread, the code heap is wellformed regarding ψk and m. Moreover, the thread state predicate ak is satisfied at the point of pck . • There is a partition of the shared memory M, where the logical shared memory Ms satisfies the global invariant m and M1 , . . . , Mn satisfy each thread state predicate ak (1  k  n) respectively. A code heap is well-formed only if each instruction sequence in the code heap is well-formed. Next, an instruction sequence is well-formed if it is composed of a single instruction ι and another instruction sequence I and both of them are well-formed (rule INSQ). A well-formed instruction requires the corresponding transactional status defined in Fig.11 and fall into the following cases with the order of Fig.12: • Weak memory instruction — Instruction ι can execute for all thread states specified by the current thread state predicate a, and the new modified thread state must satisfy the thread state predicate for the target address of instruction ι given by ψ. • Strong memory instruction — Instruction ι can execute for all thread states specified by the current thread state predicate a and the global invariant m. Furthermore, the new modified thread state must satisfy the thread state predicate for the target address of instruction ι given by ψ and reestablish the global invariant m. Note that the domain of the shared memory may be changed after execution. Our program logic in the framework is based on CSL, and it is more powerful than CSL for supporting speculative read. It also well fits for other STM models if ι =

then En (ι) =

sw rd , rs (w)

λS. S.X.A = act

lwt rd , rs (w)

λS. S.X = ε

swt rd , rs (w)

λS. S.X = ε

cast rd , rt , rs (w)

λS. S.X.A = act

out rd

λS. S.X.A = cmt

begin

λS. S.X = ε

validate rd

λS. S.X.A = act

commit

λS. S.X.A = cmt

rollback

λS. S.X.A = abt

Others

True

Fig.11. Enable for instructions.

Fig.12. Inference rules.

(such as eager version management). Furthermore, it can be applied to TM implementation using read-write locks just by extending the cases of permission. The atomicity and isolation properties of transaction can also be formally verified in our framework. We introduce a local guarantee g for each thread, as in SCAP[23] , describing valid memory updates — it is safe for the current transaction to roll back only after make a memory update allowed by g. So when a transaction rolls back, it guarantees that the memory in the current thread state is consistent with the memory at the beginning of the transaction, just as nothing has been done. The isolation is enforced by CSL, in status act the shared memory is required to be unchanged due to that the transaction may roll back, so only in status cmt the memory updates can be visible to other threads. The whole progress is presented in detail in our another paper on TM[21] . However, it cannot supports to prove other correctness criteria such as linearizability and opacity in this framework. They need more concrete specific details of the TM implementation, such as the pre- and post-check of the speculative read which have been abstracted from our model. 3.7

Soundness

The soundness of our framework inference rules with

Yong Li et al.: Formal Reasoning About Lazy-STM Programs

respect to the logical operational semantics for the machine is established following the syntactic approach of proving type soundness[24] . From the “progress” and “preservation” lemmas, we can guarantee that given a well-formed world under compatible assumptions, the current instruction sequence will be able to execute without getting “stuck”. Furthermore, any safety property derivable from the global invariant will hold throughout the execution. We define W −→m W as the relation of m-step (m  0) world transitions. The soundness of the framework is formally stated as Theorem 3.4. Lemma 3.2 (Progress). For any W = (C, M, [T1 , . . . , Tn ]), if φ, [a1 , . . . , an ] W, then for any thread Ti , there exist M , Ti , such that Next(pc,ι) (M, Ti ) = (M , Ti ). Lemma 3.3 (Preservation). If φ, [a1 , . . . , an ] W, and W −→ W , then there exist a1 , . . . , an , such that φ, [a1 , . . . , an ] W . Theorem 3.4 (Soundness). If φ, [a1 , . . . , an ] W, then for any n  0, there exists a program W and a1 , . . . , an such that W −→m W and φ, [a1 , . . . , an ] W . Although the soundness proof is based on the logical operational semantics, the whole framework is still sound with respect to the physical operational semantics due to the strong simulation between these two transition systems. We have implemented the complete framework[25] including the proofs for these two lemmas and the soundness theorem in the Coq proof assistant so we are confident that the framework is indeed sound. 4

Example

Our framework is a realization of established verification techniques[16-17] at the assembly level for concurrent programs. In this section, we give an example to demonstrate the mechanized verification of safety properties (usually the shared memory invariant in parallel program) for concurrent assembly code. Atomic Block

STM implement

int fib( ) { atomic{ val = curr; curr += prev; prev = val; printf(curr); } }

int fib( ) { stmStart( ) val 1 = stmRead(curr); val 2 = stmRead(prev); stmWrite(val 1+val 2, &curr); stmWrite(val 1, &prev); printf(curr); stmCommit(); }

Fig.13. Fibonacci program.

849

A simple example of Fibonacci program is presented in Fig.13, which is the concurrent code that computes the next element of a Fibonacci sequence. The routine computes the Fibonacci number by storing the last two numbers of the sequence into internal variables prev and curr. The variables prev and curr are shared between threads so it needs synchronization for access. In high-level programming these shared memory accesses are put in an atomic block, it hides the additional operations for transaction. In STM, these operations are packed in APIs as shown in Fig.13. The assembly code of the Fibonacci transaction is presented in Fig.14. In the code it defers I/Os till data committing, and assigns a lock for synchronization that relates to the inline synchronization routines in API stmCommit. Together with the assembly code, we also present the set of private assertions and the shared memory invariant of the the program for verification in Fig.14. The shared memory invariant m denotes that when the lock is free (here we use a protocol that zero denotes for free while one denotes for not free), values in locations prev and curr are contiguous Fibonacci numbers with the full permission, otherwise they only have the read permission with no relations (the part of memory has been privatized by a thread and it may temporarily destroy the relation between prev and curr ). From the rule S-INSN in Fig.12, strong memory instructions add the invariant to the private assertion and re-establish the invariant after execution. At the first sight, we may consider that the shared memory is unchanged. However, the invariant m describes several well-formed statuses and it may change from one case to another. Here we take the first execution of cast in Fig.14 as an example. If the lock l is ont free, the shared memory satr r isfies ∃a, b. l −→ 1 ∗ prev −→ a ∗ curr −→ b, then lockacquirement fails and the shared memory is unchanged; otherwise the shared memory satisfies ∃n. l −→ 0 ∗ prev −→ fib(n) ∗ curr −→ fib(n + 1), the lock is free and the acquirement succeeds, after execution one part of the shared memory is transferred to thread’s private memory and the modified shared memory satisfies r r ∃n. l −→ 1 ∗ prev −→ fib(n) ∗ curr −→ fib(n + 1), which is another case of our invariant m. All the work above is fully mechanized in the Coq proof assistant[18] , including the machine model, the soundness proof and examples. Interested readers can refer to the Coq implementation[25] for details. 5

Related Work and Conclusions

There has been a lot of work on the verification of concurrent programs, such as CCAP[19] and CMAP[20] . They extended the CAP framework proposed for

J. Comput. Sci. & Technol., July 2010, Vol.25, No.4

850

m



r

r

= ∃ a, b, n. l −→ 1 ∗ prev −→ a ∗ curr −→ b ∨ l −→ 0 ∗ prev −→ fib(n) ∗ curr −→ fib(n + 1)

fib :

−{[emp] ∧ ε} begin −{[emp] ∧ [emp]r ∧ [emp]w ∧ [pc]b = fib ∧ A = act} lwt t1 , curr(r0 ) lwt t2 , prev(r0 ) addu t2 , t1 , t2 swt t2 , curr(r0 ) swt t1 , prev(r0 ) addiu t1 , r0 , 1 −{[emp] ∧ [t1 ] = 1 ∧ ∃v, v .[curr → v ∗ prev → v ]r ∧ [curr → v + v ∗ prev → v]w ∧ [pc]b = fib ∧ A = act} // now acquire locks together cast r0 , t1 , l(r0 ) bne r0 , t1 , rb r r −{[∃n. curr −→ fib(n + 1) ∗ prev −→ fib(n)]   ∧ ∃v, v .[curr → v ∗ prev → v ]r ∧ [curr → v + v ∗ prev → v]w ∧ [pc]b = fib ∧ A = act} validate t2 beq r0 , t2 , ulk r r −{∃n. [curr −→ fib(n + 1) ∗ prev −→ fib(n)] ∧ [emp]r ∧ [curr → fib(n + 2) ∗ prev → fib(n + 1)]w ∧ [pc]b = fib ∧ A = cmt} lwt t1 , curr(r0 ) sw t1 , curr(r0 ) out t1 lwt t2 , prev(r0 ) sw t2 , prev(r0 ) r r −{∃n. [curr −→ fib(n + 2) ∗ prev −→ fib(n + 1)] ∧ [curr → fib(n + 2) ∗ prev → fib(n + 1)]w ∧ [emp]r ∧ [pc]b = fib ∧ A = cmt} // now release locks sw r0 , l(r0 ) commit −{[emp] ∧ ε} j fib r

ulk : rb :

r

−{[∃n. curr −→ fib(n + 1) ∗ prev −→ fib(n)] ∧ [emp]r ∧ [∃v, v .curr → v + v ∗ prev → v]w ∧ [pc]b = fib ∧ A = abt} sw r0 , l(r0 ) −{[emp] ∧ [pc]b = fib ∧ A = abt ∧ ([emp]r ∧ [emp]w ∨ ∃v, v .[curr → v ∗ prev → v ]r ∧ [curr → v + v ∗ prev → v]w )} rollback

Fig.14. Assembly code with assertions of Fibonacci.

assembly code and use the rely-guarantee method[19,26] for compositional concurrent program verification. However, it needs to check the non-interference

properties between every two threads. O’Hearn[16] proposed CSL for a high-level parallel language based on the separation logic. It explicitly separates the private and shared memories and uses conditional critical regions (CCR) to permit the ownership transfer. CSL uses invariants to preserve the well-formness of shared memory out of CCR. The CCR is usually implemented with the lock/unlock primitives and each lock corresponds to an invariant. Recently, Brookes[27] provides a grainless semantics to CSL for parallel programs that share mutable states; Bornat et al.[17] proposed a refinement of CSL with fine-grained resource accounting. We are inspired from these works and specify a global invariant on the shared memory for the correctness of interaction between threads. Transactional memory, as applied to programming languages, was first studied by Herlihy and Moss[1] . The primary goal is to make it easier to perform general atomic updates of multiple independent memory words, avoiding the problems of locks. It is a hardware implementation and relies on the assumption that transactions have short durations and small datasets. Shavit and Touitou[5] proposed the first software implementation handling transactions with statically known read and write sets. Next Herlihy et al.[8] built a non-blocking STM that runs on common hardware and handles transactions with dynamically known read and write sets. It is designed with preemption safety as a major concern. All of these works above explore the various implementation strategies of TM systems to achieve better performance with less expense. Few of them formally reason about the correctness of these implementations and properties of programs. Moore and Grossman[14] present a type system for spawning new threads in transaction programs and prove the widely-held belief that if each mutable memory location whether used outside transactions or inside transactions (but not both) then strong and weak atomicity are indistinguishable. The semantics of transaction is high-level and small-step in a λ-calculus. But the TM system only allows at most one thread to execute a transaction at a time. Our previous work[15] presents a framework for verifying concurrent programs using transactional memory. It focuses on the verification of invariant over the shared memory. However, the machine model has abstracted too much that concentrates the whole operations of transaction commit into an instruction, losing the sight of various actual implementations and related properties. In this paper we extend our previous framework by refining the big commit operation. As the same as the previous one, it is a lazy software TM systems with lazy conflict detection and lazy version management. The

Yong Li et al.: Formal Reasoning About Lazy-STM Programs

difference is that our framework is based on the fundamental instructions and uses lightweight locks for memory privatization. In the framework we focus on the verification of shared memory properties. We specify a global invariant on shared memory for each program using this model, and use a combination of CSL and PASL for verifying that the global invariant always holds. However, there are some important correctness criteria such as opacity cannot be verified in our framework due to the limit of the model. In future, we plan to extend our TM system with nested transactions (opennested transaction for especial) for concurrent program certifying, and even with the weak atomicity semantics. Acknowledgment We would like to thank Prof. Zhong Shao (Yale University) and anonymous reviewers for their inspiring discussions and suggestions on this paper.

851

[11]

[12] [13]

[14]

[15]

[16]

References [17] [1] Herlihy M, Moss J E B. Transactional memory: Architectural support for lock-free data structures. In Proc. the 20th Annual International Symposium on Computer Architecture (ISCA 1993), San Diego, US, May 1993, pp.289-300. [2] Hammond L, Wong V, Chen M et al. Transactional memory coherence and consistency. In Proc. the 31st Annual International Symposium on Computer Architecture (ISCA 2004), M¨ unchen, Germany, Jun. 19-23, 2004, p.102. [3] Ananian C S, Asanovic K, Kuszmaul B C et al. Unbounded transactional memory. In Proc. the 11th International Symposium on High-Performance Computer Architecture (HPCA 2005), San Francisco, US, Feb. 12-16, 2005, pp.316-327. [4] Moore K E, Grossman D. Log-based transactional memory. In Proc. The Twelfth International Symposium on HighPerformance Computer Architecture, Austin, USA, Feb. 1115, 2006, pp.254-265. [5] Shavit N, Touitou D. Software transactional memory. In Proc. the 14th Annual ACM Symposium on Principles of Distributed Computing (PODC 1995), Ottawa, Canada, Aug. 2023, 1995, pp.204-213. [6] Harris T, Fraser K. Language support for lightweight transactions. In Proc. the 18th Annual ACM SIGPLAN Conference on Object-Oriented Programing, Systems, Languages, and Applications (OOPSLA 2003), Anaheim, USA, Oct. 2630, 2003, pp.388-402. [7] Saha B, Adl-Tabatabai A R, Hudson R L, Minh C C, Hertzberg B. McRT-STM: A high performance software transactional memory system for a multi-core runtime. In Proc. the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2006), New York, USA, Mar. 29-31, 2006, pp.187-197. [8] Herlihy M, Luchangco V, Moir M, Scherer W N III. Software transactional memory for dynamic-sized data structures. In Proc. the 22nd Annual Symposium on Principles of Distributed Computing (PODC 2003), Boston, USA, July 13-16, 2003, pp.92-101. [9] Dice D, Shalev O, Shavit N. Transactional locking II. In Proc. International Symposium on Distributed Computing, Stockholm, Sweden, Sept. 18-20, 2006, pp.194-208. [10] Felber P, Fetzer C, Riegel T. Dynamic performance tuning of word-based software transactional memory. In Proc. the

[18] [19]

[20]

[21]

[22]

[23]

[24] [25]

[26]

[27]

13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2008), Salt Lake City, USA, Feb. 20-23, 2008, pp.237-246. Kumar S, Chu M, Hughes C J, Kundu P, Nguyen A. Hybrid transactional memory. In Proc. the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2006), New York, USA, Mar. 29-31, 2006, pp.209-220. Liblit B. An operational semantics for LogTM. Technical Report 1571, University of Wisconsin-Madison, August 2006. Martin M, Blundell C, Lewis E. Subtleties of transactional memory atomicity semantics. IEEE Computer Architecture Letters, 2006, 5(2): 17. Moore K F, Grossman D. High-level small-step operational semantics for transactions. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming (PPoPP 2008), Salt Lake City, USA, Feb. 20-22, 2008, pp.51-62. Li L, Zhang Y, Chen Y, Li Y. Certifying concurrent programs using transactional memory. Journal of Computer Science and Technology, Jan. 2009, 24(1): 110-121. O’Hearn P W. Resources, concurrency, and local reasoning. Theor. Comput. Sci., 2007, 375(1-3): 271-307. Bornat R, Calcagno C, O’Hearn P, Parkinson M. Permission accounting in separation logic. In Proc. the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL 2005), Long Beach, USA, Jan. 12-14, 2005, pp.259-270. The Coq proof assistant reference manual. Coq release v8.1, Coq Development Team, October 2006. Yu D, Shao Z. Verification of safety properties for concurrent assembly code. In Proc. the 2004 ACM SIGPLAN International Conference on Functional Programming (ICFP 2004), Snow Bird, USA, Sept. 19-21, 2004, pp.175-188. Feng X, Shao Z. Modular verification of concurrent assembly code with dynamic thread creation and termination. In Proc. the 2005 ACM SIGPLAN International Conference on Functional Programming (ICFP 2005), Tallinn, Estonia, Sept. 26-28, 2005, pp.254-267. Li Y, Zhang Y, Chen Y, Fu M. On the verification of strong atomicity of programs using STM. In Proc. the 3rd IEEE Int. Conf. Secure Software Integration and Reliability Improvement (SSIRI 2009), Shanghai, China, July 8-10, 2009, pp.123-131. Reynolds J C. Separation logic: A logic for shared mutable data structures. In Proc, the 17th Annual IEEE Symposium on Logic in Computer Science (LICS 2002), Copenhagen, Denmark, July 22-25, 2002, pp.55-74. Feng X, Shao Z, Vaynberg A, Xiang S, Ni Z. Modular verification of assembly code with stack-based control abstractions. In Proc. the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2006), Ottawa, Canada, Jun. 10-16, 2006, pp.401-414. Wright A K, Felleisen M. A syntactic approach to type soundness. Information and Computation, 1994, 115(1): 38-94. Li Y. Coq implementation for formal reasoning about concurrent programs using a lazy-STM system. http://ssg.ustcsz.edu.cn/content/formal-reasoning-about-lazy-stm-programs. Jones C B. Tentative steps toward a development method for interfering programs. Transactions on Programming Languages and Systems., 1983, 5(4): 596-619. Brookes S. A grainless semantics for parallel programs with shared mutable data. Electron. Notes Theor. Comput. Sci., 2006, 155: 277-307.

J. Comput. Sci. & Technol., July 2010, Vol.25, No.4

852 Yong Li is currently a Ph.D. candidate in Department of Computer Science & Technology, University of Science & Technology of China (USTC). He received his B.E. degree in computer science from USTC in 2005. His research interests involve language based software security, concurrent program verification and transactional memory. Yu Zhang received the M.E. degree in computer science from Hefei University of Technology in 1996, and the Ph.D. degree in computer science from USTC in 2004. She is currently an associate professor in Department of Computer Science & Technology at USTC. Her research interests include theory and implementation of programming language, especially on techniques for designing and implementing parallel programming languages, concurrent program analysis and verification, Just-in-Tie compiler assisted garbage collection. She is a member of China Computer Federation. Yi-Yun Chen is a professor in Department of Computer Science & Technology, University of Science & Technology of China. He received his M.S. degree from East-China Institute of Computer Technology in 1982. His research interests include applications of logic (including formal semantics and type theory), techniques for designing and implementing programming language and software safety and security. He is a member of China Computer Federation. Ming Fu is currently a Ph.D. candidate in Department of Computer Science & Technology, USTC. He received his B.E. degree in computer science from USTC in 2006. His research interests involve language based software security, concurrent program verification and transactional memory.

Appendix Here is the proof sketch of soundness. Lemma A.1 (Progress). For any W = (C, M, [T1 , . . . , Tn ]), if φ, [a1 , . . . , an ] W, then for any thread Ti , there exists M , Ti , such that Next(pc,ι) (M, Ti ) = (M , Ti ). Proof Sketch. For any thread Ti = (R, pc, X), the global shared memory M can be divided into three parts Ms , Mi and M¯i , where Ms is the logical shared memory,

Mi is Ti ’s logical private memory. By induction over the structure of C(pc). In the cases where the instruction is a weak memory instruction, the thread i can always making a step by the definition of operational semantics (Next(pc,ι) (Mi , Ti ) = (Mi , Ti )), the side conditions for making a step are established by the rule W-INSN). So with the extend memory Ms M¯i , thread i can also make a step Next(pc,ι) (M, Ti ) = (M , Ti ), where M = Ms Mi M¯i . In the cases where the instruction is a strong memory instruction, it needs to access the logical shared memory. The thread i can always make a step by the definition of operational semantics (Next(pc,ι) (Ms Mi , Ti ) = (Ms Mi , Ti )), the side conditions for make a step are established by the rule S-INSN). So with the extend memory M¯i , thread i can also make a step Next(pc,ι) (M, Ti ) = (M , Ti ), where M = Ms Mi M¯i .  Lemma A.2 (Preservation). If φ, [a1 , . . . , an ] W, and W −→ W , then there exists a1 , . . . , an , such that φ, [a1 , . . . , an ] W . Proof Sketch. By the assumption φ, [a1 , . . . , an ] W and the inversion of the rule WORLD, we know that: 1) W = (C, M, [T1 , . . . , Tn ]), φ = (m, [ψ1 , . . . , ψn ]), Ti = (Ri , pci , Xi ) and M = Ms M1 · · · Mn ; 2) mMs ; 3) Precise(m); 4) ak (Mk , Rk , pck , Xk ) for any k; 5) ψk , m C : ψk for any k; 6) ψk , m {ak }pck : C[pck ] for any k. Suppose the step of the world W −→ W is caused by thread i, then W = (C, M , [T1 , . . . , Ti , . . . , Tn ]) where M = Ms M1 · · · Mi · · · Mn . We prove the lemma by induction over the structure of C(pci ). Here we only give the detailed proof of cases where the instruction is a strong memory instruction, the other cases are the same. According to the rule WORLD, we choose ai = ψi (Npc(pc,ι) (Ms Mi , Ti )), a¯i = a¯i , then to prove φ, [a1 , . . . , ai . . . an ] W , we need to prove the following: 1) mMs ; 2) ai (Mi , Ri , pci , Xi ); 3) ψi , m {ai }pci : C[pci ] where pci = Npc(pc,ι) (Ms Mi , Ti ). According to the rule S-INSN, we can obtain m  ai (Ms Mi , Ri , pci , Xi ) by the hypothesis 4), then we can prove the subgoals 1) and 2). According the hypothesis 5), we can get ψi , m C : ψi , and then using the inversion of rule CDHP, we know that ∀(pc, a) ∈ ψi , ψi , m {a}pc : C[pc]. Next we can prove subgoal 3) due to that (pci , ai ) ∈ ψi . With the finish of these subgoals, we can finally prove that φ, [a1 , . . . , ai , . . . , an ] W .  Theorem A.3 (Soundness). If φ, [a1 , . . . , an ] W, then for any n  0, there exist a program W and a1 , . . . , an such that W −→m W and φ, [a1 , . . . , an ] W . Proof Sketch. Given the progress and the preservation lemmas, this theorem can be easily proved by induction over m.