A Language-Theoretic Approach to Specifying and Verifying Multiprocessor Cache Coherence Protocols Dennis Abts David J. Liljay Steve Scott
[email protected] [email protected] [email protected] Cray
y University of Minnesota
Inc. Chippewa Falls, Wisconsin 54729
Electrical and Computer Engineering Minnesota Supercomputing Institute Minneapolis, Minnesota 55455
Abstract
It is common practice in computer architecture to construct an abstract model of the system (for example a cycle-accurate C++ model such as SimpleScalar) to explore performance trade-offs of proposed architectural features. Likewise, an abstract model can be used to verify the correctness of an architectural feature. However, while the accuracy of the performance model is important, it is not a necessity for its usefulness. For example, if a performance simulator accurately represents the machine within 3% of the actual hardware, it is still a very useful tool for the design process and the resulting system is likely to be commercially viable. On the other hand, if the verification model does not depict the system accurately, it is possible (if not likely) that the abstract model will deem the architecture as sound while the implementation could be incorrect. Such functional inconsistencies can hide serious hardware errors, possibly rendering the system commercially impotent. Such dire consequences motivate us to develop a more rigorous method for ensuring the correctness of an architectural feature at the abstract level as well as the implementation level.
Computer architects have often used trace-driven simulations to evaluate the performance of new architectural features. However, the verification of these features often proves to be more difficult than their specification and implementation. This paper proposes a novel approach for extending the basic idea of trace-driven simulation to automatically verify a complex cache coherence protocol for large-scale multiprocessor systems. A formal model of the coherence protocol is specified as several human-readable text files which are automatically verified using the Mur' formal verfication tool. A formal execution trace is extracted during the verification process and re-encoded to provide the input stimulus for a simulation of the actual implementation model, which is typically specified in a standard hardware description language. This approach provides a rigorous connection between the architectural specification and the verification of the low-level implementation. The feasibility of this approach is demonstrated by using it to verify the coherence protocol of the Cray SV2.
1 Introduction
Figure 1: An abstraction of the hardware in a sharedmemory multiprocessor. Control of the memory system is carried out by finite-state machines (FSM) at each level of the memory hierarchy that exchange messages using micropackets over a reliable communication medium.
Distributed Shared Memory (DSM) architectures [1, 2] provide a flexible programming model, allowing the programmer to treat the memory systems as a large (logically) shared memory, while capable of scaling to large processor counts. This programming abstraction, however, comes at the expense of additional hardware complexity to handle the implicit transfer of data as it migrates through the extended memory hierarchy that spans from the load-store unit of a given processor through multiple levels of cache, and possibly across multiple nodes which communicate over an interconnection network (Figure 1). This extended memory hierarchy must be kept consistent by ensuring that writes (stores) are propagated through the memory hierarchy. This cache coherence problem has motivated software solutions [3, 4, 5] as well as hardware solutions [6, 7, 8, 9, 10]. This paper extends our prior work [11] focusing on the specification and verification of directory-based cache coherence hardware for scalable multiprocessors. The design of an efficient coherence protocol is extremely challenging and, as Lenoski and Weber [12] point out, “. . . unfortunately, the verification of a highly parallel coherence protocol is even more challenging than its specification.” In this work, we treat the specification and verification of the coherence protocol as two sides of the same coin.
Node 1 P1
L1 $
L2 $
Mem
Router
Node 2 P2
FSM
FSM
FSM Dir
L1 $
L2 $
Mem
FSM
FSM
FSM Dir
Router
Virtual Network
1
Node n Pn
L1 $
L2 $
Mem
Router
FSM
FSM
FSM Dir
1.1 Motivation
To verify the coherence protocol at an abstract level, we used the Mur' formal verification environment [13]. The coherence protocol is specified as several human-readable text files. These files are then read by a protocol compiler to automatically generate the finite-state machine descriptions in the Mur' description language. The Mur' compiler is then used to create the intermediate C++ description which is compiled and linked into the protocol verifier. There are several ways to attack the subsequent verification of the actual implementation. One possible approach is to construct a testbench around the actual hardware and write some pseudorandom and directed diagnostics to attempt to expose implementation flaws. This is a useful exercise and will undoubtedly uncover some implementation errors. However, enumerating all the possible event orderings and “interesting cases” is extremely difficult and time consuming. We feel a better approach is to refine the abstracted formal verification model to allow the results from the abstracted verification to be used to verify the implementation. The Mur' formal verification system used to verify the architectural soundness of the coherence protocol will enumerate all reachable states and will search the state space to establish the correctness of the properties specified, or it will show a counter-example. This search process can be conducted in three possible ways: 1) breadth-first, 2) depth-first, or 3) random. We discovered that if we made trivial modifications to the Mur' source code, we could observe the nondeterministic firing of rules as the finite-state machines at each pseudo-node interact. We developed a method for recording these events during a depth-first search of the state space in Mur', making it possible to reproduce these interactions on the actual hardware. We refer to the set of events from the start state, S0 , to a leaf node as a witness string since it “witnesses” the execution of the formal verification model. In a formal sense, we can say that the set of all witness strings accepted by the formal model defines the language, , accepted by the coherence protocol. If the language is also accepted by the implementation verification, then we have a rigorous connection between the verified architectural model and its implementation.
The design process can be viewed as a multi-stage process where each successive stage has increasing levels of detail. In general, the design process has the following stages: 1. 2. 3. 4. 5.
functional specification, design specification, implementation in register-transfer language (RTL), structural implementation (synthesis to logic gates), and physical layout and fabrication.
Then the verification process can be viewed as a comparison of successive stages to ensure that the subsequent stage is an accurate refinement of its predecessor; that is, the successor stage encapsulates the same functionality of its predecessor stage. We can dichotomize this process into logical design (stages 1 thru 3) and physical design (stages 4 and 5), with the design synthesis step that maps RTL constructs into logical gates occurring between stages 3 and 4. We concern ourselves only with the logical design stages of this process, stages 1 thru 3. The functional specification is a high-level description of the system that describes architectural features such as the instruction set architecture, the memory consistency model, and the cache coherence protocol. For example, the functional specification of the cache coherence protocol consists of a text description of the state transition tables. The design specification provides more detail about how the design will be implemented. For instance, cache organization, FIFO sizes, and directory pointer structure, to name just a few. Finally, the implementation is given by an RTL description in a hardware description language (HDL) such as Verilog or VHDL. Verification occurs in two stages: 1. architectural verification to ensure that the functional specification is “correct” and the design specification accurately reflects the functional specification, and
L
2. implementation verification to ensure that the RTL implementation correctly encapsulates the design specification.
L
1.3 Paper Organization
The architectural verification compares stages 1 and 2 of the design process, and is carried out early in the design cycle using an abstracted model of the system. On the other hand, the implementation verification compares stages 2 and 3 and occurs much later in the design process using conventional discrete-event logic simulation tools. It is no secret that the implementation verification is often error prone and very time-consuming usually contributing to the longest event in the design schedule. Furthermore, the growing complexity and sheer enormity of integrated circuits is exacerbating the problem.
In Section 2 we describe the theoretical underpinnings of our approach in terms of a formal specification and formal model. Then, in Section 3 we describe the methodology we used to specify the Cray SV2 cache coherence protocol and present our early experiences and results for this system in Section 4. Finally, we give a summary of related work and conclusions. 2
Formal Language Framework
We propose a framework for verifying memory coherence at the architectural level and a method for refining the formal verification model to automatically verify the RTL implementation. We begin by defining several correctness properties of the system being developed using a formal logic. We then describe a formal model of computation and show how to use the Mur' verification system to construct this formal model. Then, as the formal model executes we record the “witness strings” that define a language accepted by the formal model. Finally, we describe how to encode the witness strings into a practical representation that can be executed using conventional logic simulators.
1.2 Overview The input to the coherence protocol verification process is a formal specification of the cache coherence protocol for each level of the memory hierarchy, as shown by the example in Figure 2. The objective is to show that the coherence protocol is architecturally sound by satisfying the correctness properties outlined in Section 2.2. Once we have established the correctness of the protocol at an abstract level, we then would like to show that its implementation is also correct.
2
2.1 Coherence Protocol Specification
Figure 2: An example coherence protocol specification for an L1 cache.
At each level of the memory hierarchy the hardware is controlled by a finite-state machine (FSM) that governs the state transitions specified by the coherence protocol. The different levels of the memory hierarchy exchange data using micropackets which provide efficient and reliable point-to-point communication between chips (Figure 1). We begin by formally defining an FSM by the 5-tuple: F SM
= (Q; ; Æ; qo ; F )
Q = fInvalid, Exclusive, Shared, Pendingg F = fInvalid, Exclusive, Sharedg qo = Invalid
= fPrRead, PrWrite, Inval, ReadResp, GrantExclg q2Q x2 Æ (q; x) z (action) Invalid Invalid Invalid Invalid Invalid Exclusive Exclusive Exclusive Exclusive Exclusive Shared Shared Shared Shared Shared Pending Pending Pending Pending Pending
(1)
2
where Q is a finite set of states, is a finite input alphabet, qo Q is the initial state, F Q is the set of final states, and Æ is the transition function mapping Q to Q. The transition function, Æ (q; x) takes as its argument the current state, q Q, and an input symbol, x , to produce a new state Q. We extend the definition of Æ slightly, so that the transition function may produce zero or more actions, z Æ (q; x) based on the current state q and input message x. The coherence protocol is specified as a set of tables describing the operation of each FSM in the memory hierarchy. These cooperating FSMs encapsulate the rules that govern the protocol. Each of the caches have some state (Q) that describes their access permission at any point in time. To see how this works, we give a partial specification for the L1 data cache FSM of a hypothetical directory-based protocol (Figure 2). The set of states, Q, is Invalid, Exclusive, Shared, Pending of which only Invalid, Exclusive, Shared are final states F — the Pending state is a transient state and therefore not a member of F . The initial state of the cache, qo , is Invalid. The input alphabet is given by the set PrRead, PrWrite, Inval, ReadResp, GrantExcl . The L1 controller receives read and write requests (i.e. PrRead and PrWrite, respectively) from the processor. In addition, it receives Inval requests from the L2 controller as well as producing ReadResp and GrantExcl responses. The ReadResp response returns the data from the requested address, whereas the GrantExcl response returns the cache line indicating that write permission was granted by the directory for the requested cache block. The action, z , is described in terms of the messages (i.e. micropackets) that flow between the components in the memory hierarchy. For instance, the action L2(L1ReadReq) means, send an L1ReadReq request to the L2 controller. Similarly, the response P(ReadResp) is passed back to the processor after the read request has been satisfied. The actions Block(msg) and Error(msg) are special in that they handle the blocking of subsequent requests when a cache line is currently in a Pending state, and error handling if an unexpected message is received, respectively. The cache coherence protocol is specified as a set of these transition tables, one for each level in the memory hierarchy. Each table is decomposed into a set of FSMs according to which virtual communication channels are being used. For instance, in Figure 2, incoming processor requests (PrRead and PrWrite messages) flow on virtual channel 0 (vc0) and L2 responses (ReadResp and GrantExcl) flow on virtual channel 1 (vc1). So, two concurrent FSMs are constructed: one to handle incoming processor requests on vc0 and another to handle L2 responses on vc1. This decomposition step is applied to the protocol specification at each level of the memory hierarchy creating a set of FSMs that interact in a producer-consumer fashion using the virtual network as the communication medium.
2
2
f f
g
g
PrRead PrWrite Inval ReadResp GrantExcl PrRead PrWrite Inval ReadResp GrantExcl PrRead PrWrite Inval ReadResp GrantExcl PrRead PrWrite Inval ReadResp GrantExcl
Pending Pending Invalid Exclusive Exclusive Invalid Shared Pending Invalid Pending Pending Pending Shared Exclusive
L2(L1ReadReq) L2(GetExclusive) L2(InvalAck) Error(UnexpectedMsg) Error(UnexpectedMsg) P(ReadResp) P(WriteComplete) L2(InvalAck) Error(UnexpectedMsg) Error(UnexpectedMsg) P(ReadResp) P(GetExclusive) L2(InvalAck) Error(UnexpectedMsg) Error(UnexpectedMsg) Block(PrRead) Block(PrWrite) Block(Inval) P(ReadResp) P(WriteComplete)
f
Message PrRead PrWrite Inval ReadResp GrantExcl
g
Virtual Channel vc0 vc0 vc1 vc1 vc1
2.2 Formal Correctness Properties Showing that a cache coherence protocol is correct is nontrivial, as there are many aspects to “correctness,” and the protocol state space is very large. Our approach is to formally model the protocol and prove that a collection of well-defined, fundamental properties hold over the state space. We expect that most coherence protocols would require similar properties. The properties make no assumptions about the detailed implementation of the protocol. Rather, we use generic predicates and functions1 to describe the state of the caches, directory, and interconnection network. Data Coherence A memory system is coherent if the value returned by a load is always the value from the “latest” store to the same memory [14, 15]. On the surface this notion of data coherence appears vague, so this point bears some elaboration. As a memory request propagates through the memory hierarchy, hardware components, such as arbiters and buffers, will impose a serial ordering on all the memory operations to the same address. This linearization of memory events provides context for the word “latest” in the definition of memory coherence. 1 Predicates
are designated by bold typeface and evaluate to a logical true or false. Functions return a value and are in sans serif typeface
3
We indirectly capture the notion of data coherence by making some assertions about the state of the memory directory and caches.
However, the forward progress property makes no claims about fairness. For instance, it says nothing about the distribution of service times or whether requests are serviced in an equitable manner. These are more implementation-specific properties that deal with the performance of the memory system and not its correctness.
Property 1 If an address, a, is in the “noncached” state at the directory and there are no messages, m, in-flight from processor p to Dir(Home(a)) then processor p must have address a in an invalid state.
Exclusivity
8a 8m 8p Noncached(Dir(Home(a))) ^ : InFlight(m; p; Home(a)) ) Invalid(a; p)
The coherence protocol enforces some access permissions over the shared-memory to ensure that there are never two or more processors with “exclusive” (write) access to the same memory block. This single-writer property can be stated as:
where a is an address, p is a processor cache, and m is a message. The function Home(a) returns the identity of the memory directory responsible for managing the address a. Likewise, the function Dir(d) returns the state of the memory directory (access permission and sharing set) for a given memory directory d.
Property 4 Two different caches, p and q, should never have write access to the same address, a, at the same time.
8a 8p IsDirty(a; p) ) 8q=p : IsDirty(a; q) 6
Property 2 If an address, a, is present in cache, p, then it must be included in the sharing set by the directory.
This property ensures that no two processors p and q are able to have memory block a in their local memory hierarchy in the “dirty” state 2 at the same time.
8a 8p Present(a; p) ) SharingSet(Dir(Home(a); p)
Unexpected Messages
The SharingSet predicate returns true if the memory directory knows that address, a, is present in cache, p. Put another way, the set of caches with address a present is a subset ( ) of the sharing set at the directory. While these two properties do not explicitly address the readthe-latest-write aspect of memory coherence, they do ensure that the memory directory is properly maintaining the sharing set, an essential ingredient for memory coherence. Property 2 allows a cache line to be tracked by the memory directory, even if it is no longer present in the cache. For instance, if a cache line is evicted there will be some transient time between the eviction notice being sent and the memory directory removing the cache from the sharing set. As such, the cache could receive a “phantom” invalidate from the directory for a cache line that is no longer present.
If the coherence protocol is not fully specified it is possible to get an unexpected message making it possible for an FSM to receive an input message for which there is no corresponding entry in the specification table. For example, consider the following example encoding of an FSM
... Case State = Dirty Case InMsg = PrRead send(P, ReadResp) Case InMsg=PrWrite UpdateCache() send(P, WriteComplete) ... Default Error(UnexpectedMsg) Case State = Shared ...
Forward Progress Ensuring forward progress requires every memory request to eventually receive a matching response. Since all coherent memory transactions occur using request-response message pairs, we can exploit this fact by formally stating:
The ”Default” case is used to trap unexpected messages, which are probably the result of an oversight in the protocol specification or some corner case that the protocol designer overlooked.
Property 3 Each request must have a satisfying response.
8x Request(x) ) 9y Response(y) ^ Satis es(y; x)
2.3 Formal Model of Computation
Moreover, the forward progress property (Property 3) encapsulates the notion of deadlock and live-lock avoidance by requiring each request to eventually receive a matching response. Deadlock is the undesirable condition where it is impossible to transition out of the current global state. Live-lock, on the other hand, is a cycle of states that prevents forward progress. The predicates Request(x) and Response(y ) evaluate to a logical true if x is a request and y is a response, respectively. Similarly, the predicate Satis es(y; x) evaluates to a logical true if y satisfies x. For example, the predicate Satis es(y; x) would consult the transition relation for the coherence protocol to determine if y was an expected response to request x. Clearly, this property ensures forward progress by ensuring that a request is never starved or indefinitely postponed.
After having defined necessary properties for the correctness of a cache coherence protocol, we next develop a methodology for automatically verifying that these properties hold true for a given implementation a protocol specification. Our approach has its genesis in the theory of NP-completeness taken from computational complexity theory. Briefly, all problems that belong to the class NP possess the attribute of polynomial-time verifiability. That is, if we magically knew the solution to an NP-hard problem we could then verify (i.e. check the correctness of) the solution in polynomial time. This notion of a witness string or verification certificate 2 Some
protocols use the terms “dirty” or “exclusive” state. We assume that the predicate will return true if the cache line is either dirty or exclusive.
4
g
is the basis of our approach. The idea behind a ”verification certificate” or a ”witness string” (they are both the same idea) is that we can take this solution and verify it in polynomial time (although it may require exponential time to generate the witness string). If the witness string is generated by an abstract formal model, we have demonstrated that it can be generated in a “reasonable” (several days of compute time) amount of time. Then, the logic implementation running in a discrete-event simulator would execute the witness string in polynomial-time. Without the witness string, exploring the state space of the implementation would be intractable. To see how this approach works, we first describe the formal model of computation based on a k-tape nondeterministic Turing machine (NTM) and how the coherence protocol is executed under this model. Then, we describe the formal language that this model accepts. The Turing machine model [16, 17] is commonly used in complexity theory to characterize the run-time complexity of an algorithm. We describe a variant of this model of computation called a k-tape nondeterministic Turing machine (NTM). To be specific, the NTM is an acceptor machine which decides (accepts or rejects) if the input is an acceptable solution. The NTM consists of some finite control unit and k tapes that are used for storage (Figure 3). The NTM is defined as the 6-tuple NT M
= (Q; ;
; Æ; qo ; F )
R operators to control the movement of the tape head to the left or right, respectively. The configuration, Ci , of the NTM describes the current state, tape contents, and head location. The initial configuration, C0 is given by the initial state q0 , with a 2 symbol on each tape cell, and the tape head positioned at the leftmost cell of the tape. The NTM operates by applying rules that consist of a sequence of atomic actions described by the transition function, where each configuration Ci yields a new configuration Ci+1 . Intuitively, each configuration Ci represents the global system state of the system being modeled. The formal model of the system being verified, M , is an NTM consisting of all the cooperating FSMs that describe the coherence protocol (Figure 4). Let n be the number of nodes being analyzed. Each node has 1 bit of local memory. Each cache has a single line with only a single-bit data value, requiring lg (n) tag bits to represent the cache tag. The FSMs are processes that execute when their rule condition is satisfied. For example, the rule condition for the FSM VC0 state machine is “wait for an incoming message on vc0 tape head.” When the condition is satisfied, the rule fires causing the incoming message to be consumed and a 2 symbol to be written on the tape. The tape head then is moved one position to the right (using the R tape head operator). Any messages that are produced from this rule firing are written to the appropriate virtual channel tapes.
(2)
where Q is a finite set of states, is the input alphabet, is the tape alphabet, which includes a special blank (2) symbol, with 2 and . The transition function, Æ , is augmented with L,
2
f
Figure 4: The formal model is decomposed into a set of concurrent finite-state machines (FSMs) that implement the coherence protocol. The FSMs interact by reading and writing to the virtual channel tapes (vc0, vc1, and vc2) and save their state to the state storage tape.
Figure 3: A formal model of computation based on a k tape nondeterministic Turing machine (NTM)
Input encoding of problem instance
NTM
vc0 vc1
Accept or Reject
vc2
(a) A general model of a k -tape NTM. witness string tape
L1
L1
L2
L2
FSM_VC0
FSM_VC1
FSM_VC1
FSM_VC2
system state storage tape
tag NTM Input encoding of coherence protocol specification
describing coherence protocol
value
L1 Cache Control
vc0 vc1
state
virtual network tapes
tag
state
L2 Cache Control
system state storage tape
vc2
Accept or Reject
σ1 σ2 σ3
(b) A k -tape NTM used to model the cache coherence protocol. 5
value
witness string tape
M
Figure 5: The computation tree that results from the execution of the formal model. Ci
σ2
C3
C2
σi
Figure 6: A collection of symbols form a word, language L( ).
Ci+1
σ0 = (N1, P0, L1,PrRead , X)
Ci+2
σ1 = (N1, L1, L2,L1ReadReq , X) σ 2 = (N1, L2, MD, L2ReadReq , X) σ3 = (N1, MD, L2, ReadExclResp , X)
σ1
σ4 = (N1, L2, L1, L2ReadResp , X) σ5 = (N1, L1, P0, ReadResp, X)
C1 σ0 C0
w, in the
Cj
Ck
w Each node has a load/store unit (LSU) rule that operates as follows:
,σ 2
,
,σ5
}
σ4
σ5 Q
σ3
σ2
σ1
Q
σ0
rule LSU condition begin 1. nondeterministically choose a load/store command to issue 2. write the command on vc0 virtual channel tape end
Q
the NTM will have unserviced symbols (messages) on the virtual channel tapes. It is useful to note that the NTM is essentially enumerating all the reachable states of the coherence protocol. As such, this enumeration could be carried out in a depth-first or a breadth-first manner. We choose to conduct the search in a depth-first fashion. As each new configuration is produced we can verify that the correctness properties defined in the previous section hold. If any of these correctness properties are violated, the NTM will halt in a rejecting configuration with the tape contents providing evidence of the failure. As the formal model executes, it produces a computation tree (Figure 5) where each node in the tree is a new configuration of the NTM and each arc represents the rule firing. The symbols i 1 ; 2 ; : : : ; n represent rule firings, where Ci Ci+1 . The set of symbols 0 ; 1 ; : : : ; i that traces a path from the starting configuration C0 to a leaf node is called a witness string for the execution of the formal model. In Figure 5, the white nodes in a tree show an example witness string. The dotted arcs represent a rule firing that does not produce a new (unique) state and is therefore not considered part of the computation tree. It is these symbols that are recorded on the witness string tape of the NTM as each rule is fired. Each symbol is really a 5-tuple describing the rule firing as
There will be times when multiple rule conditions are satisfied simultaneously, making multiple rules eligible for execution. Our model will nondeterministically choose an eligible rule and execute it. So, the NTM operates as follows: NTM (cache coherence protocol) do 1. r