Teapot: Language Support for Writing Memory Coherence Protocols. Satish Chandra, Brad ... Custom protocols have been used to achieve message-pass-.
Teapot: Language Support for Writing Satish Chandra,
Memory
Brad Richards,
Computer University
Sciences
and James R.Lams Department
of Wisconsin-Madison
1210 West Dayton Madison, {chandra,
exciting possibility
parallel
lams}
COISto lit an application’s
systems offer the A shared-memory
memory coherence pmto-
ming model of a global protocols dlficult
semantics and sharing patterns.
written
address space-and
constructs.
sistent view of memory.
in a conventional
Teapot,
a small,
coherence protocols.
help reduce the complexity
or modlfi.
This paper
protocols.
nearly as fast as hand-written
an application’s
to
nication.
Simple
poorly
C code. A Teapot specifica-
For
between a protocol
and
example,
invalidation
protocols
sharing,
perform
since invalidating
copies forces the consumers to re-request data,
transfer [7]. Recent shared-memory
CO1and to input for a model checking system, which permits
in a higher-level
can deter-
which requires up to four protocol messages for a small data
degrees, the flexibility
the specljication to be venjied. We report our experiences coding and venfiing several protocols written in Teapot, ing a protocol
protocol
sharing pattern leads to excessive commu-
for producer-consumer
outstanding
that run
tion can be compiled both to an executable coherence pmto-
along with measurements of the overhead incurred
sees a con-
a coherence
copies with the new value. A protocol
stores to shared data. A mismatch
much of the
and results in protocols
In general,
mine, to a large extent,”the performance of a shared-memory program, since communication occurs through loads and
language for
Teapot uses continuations of writing
static analysis in the Teapot compiler eliminates overhead of continuations
updating
coherence
language such as C are
domain-specij$c
manages the repliprogram
ensures consistency when a processor writes to a shared location either by invalidating outstanding copies or by
to implement
Unfortunately,
to write, debug, understand,
describes writing
language
coherence protocol
cation of data, to ensure that a parallel
Custom protocols have been used to achieve message-passing performance-while retaining the convenient programhigh-level
USA @cs,wisc.edu
1 Introduction
computer
of customizing
St.
WI 53706
richards,
Abstract Recent shared-memory
Coherence Protocols
architectures
of tailoring
provide,
to varying
a coherence protocol
to
better fit an application’s needs. An example of such an architecture, the one on which this work is based, is the Tempest interface, a specification of the mechanisms neces-
by writ-
language.
sary to implement
shared-memory
coherence protocols
at
user-level, i.e., as unpriviledged code running in an application program [14, 23]. With Tempest, an application programmer can either use a ready-made shared-memory
This work is supported in part by Wright Laboratory Avionics Directorate, Air Force Materiat Command, USAF, under grant #F33615-94-l-1525 and ARPA order no. B550, an NSF NYI Award CCR-9357779, NSF Grant MIP-9225097, DOE Grant DE-FG02-93ER25176, and donations from Digitat Equipment Corporation and Sun Microsystems. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon, The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Wright Laboratory Avionics Directorate or the U.S. Government.
protocol
from a library
memory protocol achieving
or construct
a customized
that suits an application’s
higher performance
shared-
needs, thereby
[11 ]. Compiler
writers
can
exploit custom coherence policies as well. For example, a recent paper showed how a custom protocol (LCM) enables a compiler implement
to use a fine-grain, copy-on-write mechanism to a language’s copy-in, copy-out semantics, with-
Permissionto make digitelhrci copy of part or atl of this work for persondl or classroom use is ranted without fee provided Umtcopies are not mada or distributed for prof!t or commercial advanta e, the copyright notice, fhe title of the ublication and its date appear, an8 notice ia given that copying is ! y parmisdon of ACM, ino. To copy otherwise, to republish, to post on asrwrs, or to redistribuk to lists, raquires prior spedfio perrrrisskm andbr a fee.
out being limited
PLDi ’965’96 PA, USA CI 1996 ACM 0-89791-795-290/0005... $3.50
willing
237
by conservative pointer analysis [18].
Unfortunately, Tempest mechanisms are low level and protocols can be difficult to write, debug, and modify. Writing coherence protocols grams)
artificially
in an ad-hoc manner
limits
or able to develop
the
number
of
custom protocols.
(as C pro-
programmers
This paper
addresses the problem
with anew
that supports both writing
language, called Teapot,
and verifying
coherence proto-
cols. By reducing the complexity of developing new protocols, Teapot increases the attractiveness of systems, such as Stanford Flash [17], Wisconsin [24], that implement protocols
The main difficulty
in writing
that all protocol
response to a protocol incoming
in
protocol
before relinquishing code for other
blocks. Action code, however, frequently blocking
(code executed
event, such as a read miss or an
to execute
communication,
cache
needs to engage in
e.g., sending a request message to
another node and awaiting a reply. To circumvent this problem, protocol programmers introduce intermediate states in their finite state machines. A protocol awaits a reply message by shifting into an intermediate state and relinquishing the processor. The intermediate
state handles the message
on its arrival, and then goes to the destination
state (one that
reflects the overall effect of the protocol
event). The result-
ing proliferation
protocols,
non-atomic
of states complicates
transitions
force a programmer
because
to reason about
the interaction
between all messages and each state. In prac-
tice, protocols
written
in this style are difficult
modify. Detailed discussion of coherence their complications appears in Section 2.
to debug or
protocols
and
Teapot allows a programmer to suspend an action handler and await an asynchronous event by calling a waiting function with the current continuation (Section 3 and Section 4). Teapot also permits the programmer
to concisely
specify how to handIe other messages that arrive while waiting. The Teapot compiler transforms the handler with suspending calls into atomically without
causing deadlock,
Teapot compiler head incurred
executable pieces that can run
Analysis
and optimization
(Section 5) eliminates by writing
protocols
in the
most of the over-
at a higher
level of
abstraction. Section 6 presents our experience write several protocols
with using Teapot to
for the Blizzard
system [24]. With
optimization disabled, our compiler generates protocol code in C that runs application programs within 1370 of the speed of a hand-written performance
protocol.
is (except
With
optimization
for one program)
corresponding hand-written protocol, tive given the vastly superior protocol Even with
this system, as it has reduced the time to develop a complex protocol by an order of magnitude.
2 Why are coherence protocols
a coherence protocol is the actions
message) run to completion
the processor
better programming
enabled,
within
A shared-memory mechanisms.
the system to control
abstractions,
exploration
has emerged
as a viable
debugging
aid
can be built
writing location
access to memory
an invalid
location
or writing
must cause an access fault
ence protocol.
The
second
com-
allows
by permitting
read
a valid but read-only and invoke the coher-
mechanism,
communication,
information
and data
among processors. A protocol comes into play at an access fault. It must satisfy the memory access by bringing the data to the faulting processor’s
memory.
In many
protocols,
each block
shared data has a home node that coordinates
of
accesses to the
block. The accessing processor sends a request to the referenced block’s
home processor, which perfomns bookkeep-
ing and returns the data. Once a processor obtains the data, it caches a copy, which can be subsequently is invalidated.
Many protocols,
accessed until it
for example our Stache pro-
tocol [23], enforce coherence by permitting only a single writer (or multiple readers) to a block. When a home node receives a request for a writable copy of a block, it invalidates the outstanding copies before returning the block. The memory reference and the request and invalidation messages are protocol events, which cause transitions tocol state machine.
in a pro-
Both the home node and caching processors record state for a block. At a protocol event for block B, the protocol consults B’s state to determine
an action, which may send
messages and update both the state and the contents of the block in memory. For example, Figure 1 depicts states in a simple protocol for a block at a non-home node (Figure 2 shows the corresponding home node state machine.) Consider a block that is initially
Invalid.
When the processo~
Readable invalidatemsg (acknowledge)o $A yg:p *Q‘$%= &J +
protocols
for
on two basic
access control,
enables a system to transfer control
the
[8, 9, 22, 28]. In addition to executable C code, the back-end of the Teapot compiler can generate
hard?
and write accesses only for valid, cached data. Reading or
10% of the
which is very attracwriting environment.
system
The first mechanism,
can be difficult to write correctly. Field testing cannot ensure correctness since protocols contain complex timingdependent paths. Although completely verifying a nontrivmodel checking through state space ial protocol is difficult, plex
system (Section 7). Pro-
has been one of the greatest benefits of
Typhoon [23], and Blizzard in software. Protocols writ-
ten in Teapot could also be used for distributed object systems, such as Multipol [6], CRL [16], and Orca [3]. requirement
input for the Mur@ [9] verification tocol verification
Writable Figure
protocols
~
1: Idealized
protocol
state
machine
(non-home
‘7
ReadShared waits for a PutResponse message. Before that message arrives, the home node may also attempt to write the same block and send a WriteFault col programmer
message. The proto-
must consider these possibilities
vide the Excl-To-ReadShared One unattractive
and pro-
state with a suitable action.
approach is to receive these messages
and encode a block’s history, including pending actions, in its state. For example, a block in state Excl-To-ReadShared receives a WriteFault
message and moves to another state
Excl-To-ReadShared-PendingShared
2: Idealized
~Figure
~Transitions / action.
protocol state machine (home side). are labeled with cause and, in parenthesis,
reads the block, the protocol
obtains a read-only
an auxiliary
copy from
the copy with an Invalidate
mes-
sage, which causes a transition to state Invalid. Simultaneously, the block’s access permissions are changed so subsequent reads or writes cause an access fault. In general, actions execute on transitions between states, which are caused by protocol events. Actions may send messages to other processors, protocol-specific ions. the
await their replies,
information, algorithm
and
memory
processes
later processing. An
negatively
acknowledge
messages, or queue unexpected
auxiliary
the
(nack)
messages for
All three approaches have disadvantages,
data
structure
because a programmer
complicates
must remain
programming,
aware that a block’s
state is split between two representations. Nacks can lead to deadlock, so they must be employed carefully. And, queuing requires additional memory. Teapot offers all three options, but advocates queuing unexpected messages. Message reordering complexity,
in a network
further
adds to the
because messages may arrive in an unexpected
order. For example,
consistency
a ReadRequest
from
a processor
that
already has a readable copy cannot be ignored or treated as
model. To illustrate
the complications
in writing
a protocol,
an error. The processor may have returned
con-
PutNoData
sider the home-side state machine of the Stache protocol (Figure 2). Conceptually, the protocol requires only three
message and subsequently
states: Idle, ReadShared, cessor currently
and Exclusive
(a non-home
has the single valid copy.) Although
-—..—=
.
to ReadShared,
{
previous
the
message and
;State
3
a simplification
.
—.--———...”,.. “’?
(Node
R)
Put Request)
; \
(R) ;
: = ReadShared;
State
. . . . .. ... . . . . ...
...”..”.
..- ..,. ... . ”.. .,—
——..
Exclu-sive
Message
Send State
1
ReadRequest (Node PutRequest)
R)
:.
I ;
(Owner,
R;
: = Excl-To-ReadShared;
~
exit;
(b) State
the PutRe-
I
Excl-To-ReadShared Message
PutResponse
(Node
SendSharedCopy
sponse subsequently arrives, the transition completes by changing to state ReadShared (Figure 3b). Other states also require intermediate states for their transitions. Figure 4 shows the new, more complex state machine—which
.
PutResponse;
sharer
ton can wait for a protocol event to cause a transition. Hence, after sending the invalidate message, the ReadRequest handler changes to an intermediate state When
.
exit;
I
completion and terminate. Handlers cannot wait on an asynchronous event, such as a message arrival. Only the automa-
and terminates.
.
SendSharedCopy
~ I 2 .. ,”.,..-
it. Conceptually,
sends an invalidation
.
ReadRequest (Owner,
AWAIT
(a) ~
awaits the block’s new value (Figure 3a). However, to avoid deadlock in a real system, protocol handlers must run to
(Excl-To-ReadShared)
Send
I
can complete only when the
owner relinquishes
action for this transition
Message
which responds to a read request
by a processor. This transition
.
(3’GE’& Exciiis’ive
protransi-
tions appear as atomic, state changes in response to protocol events cannot be atomic. Consider the transition from Exclusive
its copy with a
requested a readable
copy with a ReadRequest. If messages can pass each other,
,~-.
“—.
block’s
When the Read-
the protocol
update
and actions depend on
the
data structure,
unexpected
and change access permiss-
The exact states, transitions, coherence
WriteFault.
completes,
pending WriteFault. Because this approach greatly expands the state space, protocols either encode this information in
the home node and changes its state to Readable. Later, the home node may invalidate
transition
~,
is still
.. . . .
State
:.
exit; .
. ...”.-
(sharer)
R) ;
I
ReadShared; ..
. . .. ... .
. .
Figure 3: The action in box(a) executes the exclusive to synchronous (blocking) using shared transition communication, which could deadlock on a real system. The action in box (b) shows an approach to avoiding it introduces an communication: synchronous intermediate state, which waits for a reply message.
of the actual protocol.
Intermediate states complicate programming because they make transitions non-atomic. While in an intermediate state, the protocol may receive messages other than the expected reply message. For example, the state Excl-To-
239
tation. The second computation tured computation,
can then resume the cap-
often passing along a value.
Consider the example from the previous section. Figure 5 shows how to write the protocol transition with continuations. The Teapot Suspend ReadRequest handler is similar to a cal current continuation the function Resume
statement in the lCC. It passes the
(L, the first argument to Suspend)
that
is its
statement
second
is similar
argument.
to a throw,
does not return a value.3 The Suspend its environment
(the program position,
The
to
Teapot
except that it
statement captures as well as local vari-
ables) at the call point in its first argument, switches to the subroutine state specified by the second argument, and passes the continuation. At this point, the handler yields control (to a dispatch loop), until an action in response to a message to the subroutine state causes the suspended han-
machine (home side) with intermediate states necessary to avoid synchronous communication. Figure
4: State
~eseemingly
gratuitous
ReadRequest must be retained an{
processed after the PutNoData
message. Teapot, by default,
queues such messages for processing at a Iater time, Another important difficult oping
to modify, a
custom
consideration
Consider
adding
a
Com-
pare&Swap primitive to the basic protocol, This primitive is a minor variation of a WriteRequest that also executes the compare and swap operation
at a block’s
states are parameterized
(e.g., the Cent
L parameter
which is part of the environment
is that a protocol can be
although this is a desirable way of develprotocol.
dler to resume. Subroutine continuation
home node once
A subroutine
of all handlers in the state.
state differs from a normal state in two ways:
it is entered synchronously
by a Suspend
Using these mechanisms,
a programmer
cussed in Section 2, many protocol ply a matter of awaiting action.
This section describes our approach to writing handlers. In general, any mechanism that permits multiple execution
an explicit
of synchronous
An attractive benefit of continuations tions bv. .movidin~
and need
intermediate
Using threads, a protocol event handler can be launched in a new thread, which yields control at a synchronous communication point. When the subsequent response message
,@%K”e”ExcY”il-