Reducing Controller Contention in Shared-Memory Multiprocessors Using Combining and Two-Phase Routing Sarah A. M. Talbot
Paul H. J. Kelly
Department of Computing Imperial College of Science, Technology and Medicine 180 Queen's Gate, London SW7 2BZ, United Kingdom phone: +44 171 594 8322 fax: +44 171 581 8024 e-mail:
[email protected] [email protected] Abstract In simple cache coherency protocols, serialisation can occur when many simultaneous accesses are made to data held in a single node, and when many accesses involve a common \home" node controller. This is ameliorated in various designs with a hierarchical or clustered structure. In this paper we investigate the idea of routing requests via an intermediate \proxy" node where combining is used to reduce contention. We present a hashing-based proxy placement scheme, and evaluate a \reactive" approach which invokes proxying only when contention occurs. Simulation results using various benchmarks show that the hotspot contention which occurs in pathological examples can be dramatically reduced, while performance on well-behaved applications is essentially unaected.
Keywords: cache protocols, occupancy, shared-memory, combining, producer/consumer sharing
1 Introduction This paper investigates a modi ed cache coherency protocol for large-scale shared-memory multiprocessors. The aim of this work is to ameliorate performance in certain pathological cases, without reducing performance on well-behaved applications. It is important that such anomalies be minimised, so that the performance potential of large con gurations is easy to achieve. This paper is a re nement of our initial work on \proxy" protocols [3], aiming to eliminate the overheads so that the programmer need not manually identify which memory regions will bene t. In a coherent-cache shared-memory multiprocessor (cc-numa), each processor's memoryand cache 1
is managed by a \node controller". In addition to local memory references, the controller must handle requests arriving via the network from other processors. These requests concern cache lines owned by this cache (reads, ownership requests), lines of which a copy is held in this cache (invalidations and replacements), and lines whose \home" is this node (i.e. this node holds directory information about the line). Note that a page is allocated to a home controller, by the operating system, when it is rst accessed. Meanwhile, ownership of the line depends on which processor last had write access. In large con gurations, unfortunate ownership migration or home allocations can lead to concentrations of requests at particular nodes. This leads to performance being limited by the service rate or \occupancy" of an individual node controller, as is demonstrated by Holt, Singh and Hennessy in [11]. In particularly severe cases, such as the Gaussian Elimination benchmark discussed below, contention for the node controller to which ownership has migrated dominates execution time (Bianchini and LeBlanc [4]). A number of measures are available to avoid this. Michael et al [16] evaluate the eect of improving the node controller service rate. Combining in the interconnection network has been used to avoid contention for reads, writes and fetch-and-update operations, for example in the NYU Ultracomputer [7] and the Saarbrucken SB-PRAM prototype [1]. Attempts have been made to use combining in a cache coherence protocol, notably Johnson's STEM [12] and Kaxiras and Goodman's GLOW extensions to the SCI protocol [14]. Both have overheads which make them suitable only for data structures where the bene ts outweigh the costs. Architectures based on clusters of bus-based multiprocessor nodes provide an element of read combining since caches in the same cluster snoop their shared bus. Finally, Bianchini and LeBlanc propose \eager combining" [4], where the programmer identi es speci c memory regions for which a small set of \proxy" caches are pre-emptively updated. It is similar to our approach, but suers from poor behaviour if deployed inappropriately. This paper concerns a technique for alleviating read contention. With each location, we associate a proxy set: a small set of caches which will act as intermediaries for reads to the location. In our \basic" scheme, when a cpu suers a read miss, instead of directing its read request directly to the location's home node, it sends it to one of the location's proxies. If the proxy has the value, it replies. If not, it forwards the request to the home; when the reply arrives it can be 2
forwarded to all the pending proxy readers and can be retained in the proxy's cache. The main contribution of this paper is to present a \reactive" re nement of this scheme. By using proxies only when contention occurs, the bene ts accrued in pathological examples incur little, if any, cost in performance for well-behaved applications. We also investigate the number of proxy caches associated with each location (the proxy set size). The rest of the paper is structured as follows: our simulated architecture and experimental design is outlined in Section 2. The proxy concept is introduced in Section 3 and re ned in Section 4. Results of simulations of a set of standard benchmark programs demonstrating the contention eect and the bene t of proxies are presented in Section 5. The relationship of this paper to previous work is discussed in Section 6. Finally, in Section 7, we summarise our conclusions and give pointers to further work.
2 Simulated architecture and experimental design In this section we give details of the system modelled by our execution-driven simulator [2]. This paper is intended to provide general insights into coherence protocol design, so we have tried to design simulation experiments which avoid arbitrary design details, such as network topology, and make reasonable assumptions about cache sizes, relative timings etc. Details are summarised in Table 1. We discuss the impact of these assumptions in our concluding section. The basic architecture used in our simulation studies is shown in Figure 1. Each node consists of a processor, with an integral rst level cache (flc), a large second-level cache (slc), some dram and a node controller. The slc, dram and node controller are interconnected by two decoupled buses. The node controller sends messages to, and receives messages from, the network and the processor. The processor and cache are attached to the slc bus, whereas dram and the node controller are attached to the mem bus. This decoupled bus arrangement allows the processor to access the slc at the same time as the controller accesses dram. This can be expected to occur frequently because the flc is small and write-through: many memory references which cannot be satis ed by 3
Network
Network Controller network buffers Memory (DRAM)
Node Controller
MEM bus SLC bus
SLC
FLC
(off-chip
CPU
SRAM)
Figure 1: The architecture of a node CPU
CPI 1.0 Cycle time 10ns Instruction set based on DEC Alpha Instruction cache All instruction accesses assumed primary cache hits First level data cache Capacity 4KByte Line size 64 bytes Direct mapped, write-through Second-level cache Capacity 1MByte Line size 64 bytes Direct mapped, write-back Node controller Non-pipelined Service time and occupancy See Table 2 Interconnection network Topology full crossbar Incoming message queues 8 read requests Cache coherence protocol Invalidation-based, sequentially-consistent cc-numa, home nodes assigned to rst CPU to reference each page (i.e. \ rst touch"). Distributed directory, using singly-linked sharing list Based on the Stanford Distributed-Directory Protocol, described by Thapar and Delagi [21]
Table 1: Details of the simulated architecture the flc will be satis ed by the slc. The decoupled bus arrangement therefore reduces contention between the processor and the controller. The interconnection network model is highly simpli ed, in the spirit of the \ " abstraction [5]. The model describes an abstract machine con guration in terms of four performance parameters: L, the latency experienced in each communication event; o, the overhead experienced LogP
LogP
4
P (processors) P M
P M
C
C
o
...
P M o
C g (gap)
L (latency) Interconnection network
o = occupancy, the time during which a node controller is handling a particular request and cannot service any others that are waiting.
limited capacity (L/g to or from a processor)
Figure 2: LogP based abstract machine used in [10] (diagram adapted from [15]) by the sending and receiving processors; g, the gap between successive sends or successive receives by a processor; P, the number of processor modules. The LogP model was adapted by Holt et. al. [10] to recognise the occupancy of a node controller as being critically important and using this instead of the overhead (see Figure 2). The occupancy is the time during which a node controller is handling a particular request and cannot service any other requests that are waiting. In the simulated system, we have parameterised the network and node controller models so that we have: : 10 cycles for long messages (include 64 byte cache line), and 5 cycles for all other messages.
L
o
g
P
: occupancy simulated in more detail (see Table 2). : 5 cycles. : varied with dierent simulation runs.
Lack of time has prevented us from studying the in uence of each of these parameters on the results presented here.
Operation
Acquire slc bus Release slc bus slc lookup slc line access Acquire mem bus Release mem bus dram lookup dram line access Initiate message send
Time (cycles) 2 1 6 18 3 2 20 24 5
Table 2: Latencies of the most important node actions 5
Node i
Node j
Node k
NULL
Figure 3: Sharing lists For the reactive protocol variant studied below, the incoming message queue is limited to 8 for read requests. This is done to study the eect of nite buering on read requests (rather than all messages). The limit of 8 may seem low, but it has been deliberately chosen to re ect the p where is the limitations one would expect in a large con guration i.e. queue length is number of processing nodes. N
N
The identities of nodes which have cached a particular line are maintained using distributed singly-linked lists; the sharing chain is illustrated in Figure 3. Alternatives such as doubly-linked or tree-structured structures may be preferable, especially in larger con gurations; we discuss the signi cance of this choice in our concluding section. Adding proxying to the protocol requires the addition of two new message types and relatively minor changes to the protocol state machines. The addition of nite buering required one extra message and further minor changes to the protocol state machines. Each cache line has a \home" node associated with it (at the granularity of a page) which:
Either holds a valid copy of the line (in slc, dram, or both), or knows the identity of a node which does (the \owner"),
Has pre-allocated space in dram to which the nal replacement of the line from cache can take place, and
Holds directory information for the line (head and state of the sharing list) in dram.
The home maintains the state of every line of data for which it acts as home. There are four possible dram states (using 2 bits to hold the 4 states) as shown in Table 3. A seeming disadvantage with this state set is that in the state Home-Exclusive it is not possible to tell whether there is a more 6
up-to-date copy held in the home node's slc. This was a deliberate choice, because it saves the local processor (when it has Home-Exclusive ownership of a line) from having to update the dram until the data block is evicted from the slc. This, given data locality, should be the common case, and the choice is deliberately made at the expense of remote read requests. A similar state set is used for the same reason in some prototype machines, e.g. s3.mp [17].
state
Home-Invalid
description
the data is not held at the home node, but the home knows the identity of the owner of the data. Home-Exclusive the data is held only at the home node, in the slc and/or dram. Home-Shared the data is held at the home node and at one or more clients, all copies being consistent with the copy in dram. Home-Locked the sharing list is being updated.
Table 3: dram states
3 Basic Proxies The severity of node controller contention is both application and architecture dependent. Controllers can be designed so that there is multi-threading of requests (e.g. s3.mp [17] is able to handle two simultaneous transactions) which slightly alleviates the occupancy problem but does not eliminate it. Some contention is inevitable and will result in the latency of transactions being elongated. The key problem is that queue lengths at controllers, and hence contention, are non-uniformly distributed around the machine. Some controllers may have short queues, whereas others have long, or full queues. The result is that many requests are waiting in buers for long periods of time, and processors are stalled unnecessarily. To reduce overall message latency, it is important that node controllers can handle requests at a high rate. However, in large con gurations, unfortunate home allocations or application program behaviour can lead to concentrations of requests at particular nodes, causing queues to build up. One way of reducing the queues is to distribute the workload to other node controllers, essentially using them to act as \proxies" for read requests, as indicated in Figure 4. When a processor makes a read request, instead of going directly to the cache line's home, it is routed rst to another node. 7
Proxy
Home
Home
Proxy
(a) Without proxies
(b) With two proxies
Figure 4: Contention is reduced by routing loads via a proxy If the proxy node has the line, it replies directly. If not, it requests the value from the home itself, allocates it in its own cache, and replies. This basic form of proxies was rst described in [3]. The choice of proxy node can be at random, or (as shown in Figure 4) on the basis of locality. Proxy nodes require a small amount of extra store to be added to the node controller. Speci cally we need to be able to identify which data lines have outstanding transactions (and the tags they refer to), and be able to record the identity of the head of the pending chain. Entries in this separate \pending proxy cache" form the pending chains required by combining (see the example in Figure 5). This could be implemented using a small, fast associative store on the mem bus.
3.1 Choosing the Proxy We now describe how a cpu decides which node to use as a proxy, i.e. to which node it should direct a given read request. We begin with some de nitions: H(l):
the home node of location . l
This is determined by the operating systems memory management policy. As mentioned in Table 1, our simulations employ the \ rst-touch" policy: the rst cpu to touch a given page becomes home to all cache lines on the page. N P:
the number of proxies. 8
(a) First request to a proxy requires forwarding to the home: Home 2. read request sent to home Proxy
1. client makes read request Client 1
(b) Second request, before data is returned, forms pending chain: Home
4. client receives pointer to client 1
Proxy 3. client makes read request
Client 2
Client 1
(c) Data line is passed to each client on the request chain: Owner 5. request forwarded to owner
6. data supplied to proxy
Home
Proxy 7. data supplied to client 2
Client 2
Client 1
8. data forwarded along pending chain
Figure 5: Combining of proxy requests
9
PS (l):
the set of N P nodes which are acting as proxies for address . l
In this paper, PS ( ) simply identi es one of N P disjoint clusters (there is no reason why proxy sets should not overlap, and it may be desirable to allocate proxy sets so that they are randomly distributed). l
PC (l; C P U )
the actual node chosen by a given cpu when reading location . l
We use a simple hash function to choose the actual proxy from the proxy set PS ( ). This choice depends on the id of the requesting cpu, and is bypassed if the requesting cpu is the node's home. l
This two-stage mapping, illustrated in Figure 6 for N P = 2, ensures that requests for a given location are routed via the proxy set (so that combining occurs), that successive cache lines are served by dierent proxy sets, and that the proxy copy of a given cache line is placed pseudorandomly within each proxy set. This should reduce network contention [22] and balance load more evenly across node controllers. In some architectures, proxy placement could be designed to exploit communications locality. Proxy
Home
Home Proxy
Proxy Proxy
(a) Load Line
(b) Load Line
K
K
+1
Figure 6: The proxy is dierent for successive cache lines It should be noted that if a client node is in the proxy set for a particular line , then that client will make a direct read request to the home node and not pick one of the other members to act as its proxy. This is essential to avoid deadlock. l
10
4 Reactive Proxies If the application programmer makes an injudicious choice of data structures to be marked for proxying, the basic proxy scheme runs the risk of swamping the proxy node caches, adding indirection and lengthening the sharing lists. A more generally applicable variant of proxies, which takes advantage of the nite buering of real machines and does not require the application programmer to add proxy directives, is \reactive proxies". In this work, the bounding of the input message queue only aects incoming read messages. When a read message is received, the length of the input queue is checked to see if it is above the limit. If the limit has been reached, the read message is bounced back to the originator. All other incoming messages, and read messages which are accepted, are then processed as normal (i.e. as in the original protocol). The bounced read messages are immediately sent back across the network. When the originator receives the bounced message, and the reactive proxies protocol is not in eect, the message will be re-sent. This leads to the possibility of an in nite loop of messages, and we plan to enhance the protocol to add a \priority read" to be sent after a given number of bounced read requests. When the reactive proxies protocol is in operation, the arrival of a buer bounced read request will trigger a proxy read. This is quite dierent to the basic proxies protocol, where the user has to decide whether all or selected parts of the shared-data are proxied, and proxy reads are always used for data marked for proxying. Instead, the proxying only takes eect when it is known that there is a problem with congestion. A proxy read is only done in direct response to the arrival of a buer bounced read request, so as soon as the queue length at the destination node has reduced to below the limit, read requests will no longer be bounced and no proxying will be employed. This removes the obvious overhead of the basic protocol, where proxying adds at least one message to every remote read on shared-data marked for proxying.
5 Experimental Results In this section, we rst elaborate on our choice of benchmark applications, and explain our terminology. We then present the results obtained for the basic proxies scheme (using in nite and 11
Home
Home Proxy
bounce
(a) Input buer full, some read requests bounce
(b) Reactive proxy reads
Figure 7: Bounced read requests are retried via proxies nite input message buers), and for the reactive proxies scheme. We analyse these results, and discuss the bene ts and potential drawbacks of using proxies.
5.1 Choice of Benchmarks The benchmarks and their parameters are summarised in Table 4. In general, applications exhibiting single producer / multiple consumer data sharing tend to have contention problems [4], particularly because a barrier is commonly used for synchronising between the producer update and the consumer reads. In this work we use a Gaussian Elimination application to be representative of such algorithms. In addition, from the splash-2 suite, fmm has an instance of producer / consumer data sharing, and the non-contiguous partition implementation of ocean (ocean-noncontig) has longer sharing lists than the contiguous partition implementation.
Application
Parameters
Barnes 16K particles FFT 64K points FMM 8K particles GE 512 x 512 matrix Ocean-contig 258 x 258 ocean Ocean-non-contig 258 x 258 ocean Water-nsq 512 molecules
Table 4: Benchmark applications
12
A problem we encountered with selecting benchmarks was that many programs have been specifically written to avoid producer / consumer data sharing precisely because of performance problems. This runs counter to the spirit of the shared-memory programming paradigm which aims to insulate the application programmer from having to consider the underlying architecture. It also results in applications that are less portable, and that no longer match the original algorithm. ge is a simple Gaussian elimination program, similar to that used by Bianchini and LeBlanc in
their study of proxies [4]. At the end of each iteration a single processor updates a row of the matrix which is designated as the pivot row. Following a barrier, all processors read this row and use it to update a set of rows which they maintain. It is immediately clear that this will cause contention since the cache lines holding the pivot row will all reside in a single cache. The entire 512x512 matrix is marked for proxying. The following benchmark programs were taken from the splash-2 suite [23]
FMM: fmm was run for a two-cluster Plummer distribution with cost zones partitioning, and the precision set at 1e-6. Trace results of a 32 node system showed that queues of length 31 were occurring for access to elements of the f array which forms part of the G Memory data structure.
Ocean-non-contig: The original splash suite version of ocean, which this is similar to, was used by Ross Johnson [12] as an example of long sharing lists, so it was speci cally selected to see if it exhibits read contention.
Barnes, FFT, Ocean-contig, Water-nsq: in the absence of any obvious data contention within these applications, all the signi cant shared data structures were marked for proxying to study the eect.
5.2 Terminology The performance speedup results are presented in terms of \relative speedup", i.e. the ratio of the execution time for processors to the execution time for the fastest algorithm running on 1 processor. The problem size is kept constant. N
13
The \relative changes" results are for 64 processing node simulations and we show three dierent metrics. messages: the ratio of the total number of messages (for a given number of proxies) to the number of messages when proxies are not used. queueing delay: the ratio of the total time that messages spend waiting for service in input buers to the same total with no proxies. execution time: the ratio of the execution time (excluding startup) to the execution time (also excluding startup) with no proxies.
5.3 Results The performance results for the 7 benchmark applications are summarised in Table 5, and Figure 8 shows the performance graphs for some of the benchmarks (the others are, like water-nsq, close to a \straight line"). More detailed results for 64 processing nodes are shown in Figures 9, 10 and 11. The main points to note are:
in nite buers: The ge benchmark shows a signi cant performance improvement using basic proxies. This was to be expected because the application was speci cally chosen as an example of read contention. The application runs over 37% faster on 64 processors with proxies. 5 of the other 6 benchmarks show performance improvements, from the noticeable 10% improvement for fft, to the very small improvements for the other applications. water-nsq showed a slight drop in performance of about 0.7%. This serves as a warning
that basic proxies can have a detrimental eect on performance if the wrong data structures are marked for proxying.
nite input buers, basic proxies: ge and fft both show signi cant performance improvements, and there are also slight performance improvements for fmm and barnes. However, both the ocean applications and water-nsq show reduced performance. The problem is that the additional messages associated with basic proxies are causing more congestion (with nite buers) than they alleviate.
nite input buers, reactive proxies: 5 out of 7 benchmarks show a performance improvement over no proxies. 14
barnes has a slightly worse performance than with basic or no proxies. This is because the
application does bene t marginally from proxying, but only if it is immediate. The bene t is lost with reactive proxies because in the time it takes for the client to \react" and send the request via a proxy, the congestion has eased at the home node. ge has nearly 30% performance improvement over no proxies, but it is not as good as the
40% improvement achieved by basic proxies (all for nite input buer). This is because the best results are obtained by immediately using proxies for the pivot row data sharing. The delay in waiting for initial read requests to bounce from the congested home node reduces the bene t. However the performance improvement is still signi cant. The best performance results for fft, fmm, ocean-non-contig and water-nsq are obtained with reactive proxies. For these applications, there are bene ts to be had from proxies, but they are best obtained by letting the system react to contention rather than getting the application programmer to mark data structures as \hot". ocean-contig shows a worse performance than with no proxies, but it is better than using
basic proxies. An interesting result from the simulations was that most, if not all, of the bene ts were achieved with just one proxy (i.e. N P = 1). We believe that this is because we were modelling a very fast full-crossbar network with no contention, and are currently doing further work to con rm this.
applications
Barnes FFT FMM GE Ocean-contig Ocean-non-contig Water-nsq
in nite buers nite input buers no proxies basic proxies no proxies basic proxies reactive proxies 43.1413 47.3143 36.1158 22.3240 51.5513 49.0287 55.4568
43.1618 52.0787 36.2174 30.7212 52.0433 49.9339 55.0740
43.2408 47.4157 36.1373 21.9748 51.3692 50.4755 55.4495
43.2970 52.1181 36.2099 30.8052 50.5736 50.1291 55.0777
43.1448 53.2886 36.2619 28.5506 50.8899 51.3051 55.5556
Table 5: Benchmark relative speedup for 64 processing nodes To demonstrate how proxies aect the input queue delays at each processing node, Figure 12 shows the individual queueing time mean and standard deviation for each processor on 64 node simulations of ge and fmm. In (a), with no proxies, ge has some signi cant hot-spots, whilst fmm also has hot-spots, but they are not as pronounced. In (b), with basic proxies, the selective 15
fft
ge
60
35 30 25 20 15 10 5 0
50 40 30 20 10 0 0
10 20 30 40 50 60 70 number of processors
no proxies basic proxies finite buffer reactive proxies 0
ocean-non-contig
10 20 30 40 50 60 70 number of processors water-nsq
60
60
50
50
40
40
30
30
20
20
10
10
0
0 0
10 20 30 40 50 60 70 number of processors
0
10 20 30 40 50 60 70 number of processors
Figure 8: Performance speedup use of proxies has dramatically reduced all but one of the hot-spots for ge. For fmm, proxying one data structure has caused a small reduction in queueing. Finite buer queueing times for basic and reactive proxies are shown in Figure 12(c) and Figure 12(d) respectively. Input queue lengths have been limited by bouncing read messages when there are 8 or more messages already in an input buer, so the mean and standard deviation are much lower than with in nite buers. It is interesting to note that reactive proxies produce a more \smooth" distribution for ge than basic proxies.
5.4 Summary Proxies reduce queueing for obtaining both copies of data and directory information held by a single node controller. They also reduce blocking in the interconnection network, by avoiding some unfortunate communications patterns: although the overall trac level is somewhat higher, 16
the distribution is more uniform. In addition, messages not involved in proxies (e.g. writes) spend less time in queues. For the reactive proxies scheme, there is no marking of data structures (by application programmer or, possibly, optimizing compiler). For some applications (such as ge and barnes) this results in marginallyworse performance, but most other applications bene t from only invoking proxies when contention is detected rather than relying on judicious marking of data structures. water-nsq illustrated this, where basic proxies reduced performance, whereas reactive proxies gave improved performance. There are some overheads associated with proxies. Every load (for addresses subject to proxying) goes via a proxy, whereas with the basic protocol no indirection would be involved. For basic proxies, this can be detrimental to performance where inappropriate data structures are marked as \hot". For reactive proxies, this delay is only introduced once a direct read request to the home is bounced. In addition, there will be cache pollution i.e. allocating space in the cache for proxying may displace another line, and lead to a later cache miss, with invalidation overhead for the displaced line. In our simulations, these delays did not seem to be signi cant, but we plan to study this further by simulating \no allocate" proxies.
6 Related Work Contention in shared-memory multiprocessors is receiving more attention now that larger scale machines are being designed and built. \End-point contention" was identi ed in [11] as the most worrying potential architectural bottleneck in large-scale distributed shared-memory (DSM) machines. The paper was aimed at how applications can be tailored to scale well, whereas the emphasis of our work on proxies is to avoid queueing at node controllers without the application programmer having to get involved. It is better to have an upper-bounded (hopefully) architectural solution than having to hack DSM application code (which, as pointed out in [11] can sometimes end up as being more complex than a message-passing solution). In an earlier technical 17
report [10], the same team introduced the concept of \occupancy" and focused on understanding the performance requirements for node controllers. A separate study looking at similar issues [15] uses hardware rather than simulation to study the bottlenecks in a workstation cluster architecture (rather than shared-memory). Scalable distributed shared-memory architectures rely on the node controllers to synthesise cachecoherent shared memory across the entire machine. In [16] an investigation is made into the performance tradeos between using customised hardware or a programmable protocol processor to implement the coherence protocol. Proxies allow read requests for data to be combined in controllers away from the home node which suering from contention. This is a restricted instance of combining of atomic read-modify-write operations as proposed for the NYU Ultracomputer [7] and the RP3 [18]. In addition, the choice of proxy has connections with two-phase random routing [22]. Eager combining, suggested by Bianchini and LeBlanc [4], uses intermediate nodes which act like proxies for \hot" pages, i.e. the programmer is expected to mark the \hot" data structures. The choice of server node is based on the page address rather than data line address, so their scheme does not spread the load of messages around the system in the way that proxies do. Bianchini and LeBlanc's scheme eagerly updates all proxies whenever a newly-updated value is read, unlike our protocol, where data is allocated in proxies on demand. Our less aggressive scheme incurs lower overheads because it only provides data when it is needed, and so there will be less cache pollution (and hence unhooking) at the proxies. We are able to show performance results for proxies both when read contention occurs, where a large improvement is evident, and when it does not, where there is no degradation for reactive proxies. The glow [14] protocol extensions are, like proxies, designed to build on protocols such as the IEEE Scalable Coherent Interface (SCI) [19] which are optimised for data that is not widely shared. The work is related to the STEM extensions to SCI proposed in [12]. glow intercepts requests for widely-shared data by providing agents at selected network switch nodes. Sharing trees are built up recursively, bottom-up. Request combining is achieved for reads, thus leading to the elimination of \hot spots". In addition, the protocol provides scalable writes by using the tree structure to invalidate or update sharing nodes in parallel. glow allows for locality when 18
constructing the sharing tree, i.e. it exploits the network topology to select the location of the agents to reduce the latency of protocol messages. At present, the glow protocol relies on the programmer identifying the widely-shared structures in the source code, and marking them. This is similar to the marking of data structures provided in our \basic proxies" scheme, i.e. unlike reactive proxies which respond to execution time congestion at nodes. The idea of caching extra copies of data to speed-up retrieval time for remote reads has been explored for hierarchical architectures in [20] and in the DDM [8]. The proxies approach is dierent because it does not use a xed hierarchy, but instead allows for copies of successive data lines to be serviced by dierent proxies. The Wisconsin Multicube [6] also provides extra copies of data in snooping caches at each intersection on the grid of buses. This is a more exible approach, but it suers from the overhead of the large snooping caches.
7 Conclusions This paper has presented the reactive proxy technique, discussed the design and implementation of proxying cache coherence protocols, and examined the results of simulating seven benchmark applications. We have shown that proxies bene t some applications immensely, as expected, while other benchmarks with no obvious read contention still showed performance gains under the reactive proxies protocol. There is a tradeo between the exibility of reactive proxies and the precision (when used correctly) of basic proxies. Despite increasing the overall number of messages, proxies seldom increase execution time as they give a more even distribution of network trac. Performance is in uenced more by anomalous contention eects than by the overall average network trac level. In our simulated system, much if not all of the bene t of proxying is realised with just one proxy per location, although there is evidence that more proxies are justi ed for some applications. This result depends partly on the network topology, and our work to date has only considered a full crossbar interconnect. We also need to do further work to quantify the adverse eects of cache pollution.
19
In summary, we have shown that reactive proxy extensions can be added to a cc-numa cache coherence protocol fairly easily, can dramatically improve the performance of some applications (e.g. ge), have no detrimental eect on applications that do not exhibit node controller contention, and that one proxy per location is sucient for a full crossbar network. Furthermore, by exploiting the nite nature of real buering (in network or at destination node), reactive proxies are only used when they are needed at execution time, so the application programmer does not have to mark speci c data structures for special treatment. Among our plans for future work are to study what in uences the optimal number of proxies, and how proxies can be used to improve communications locality. Our results show that proxies lead to an increase in average sharing list length and an increase in unhooking, which would be faster with a doubly-linked protocol or a separate proxy cache (c.f. victim caching [13]). We also plan to investigate not allocating the line in the proxy's slc: combining could be reduced but this may be more than compensated for by avoiding the cache pollution eect. Our goal is to design a scalable shared-memory system which exploits caching eectively, but also has reasonable worst-case performance compared to a crcw pram [9]. To do this we need to extend the proxy idea to form hierarchies, and redesign the sharing list representation to allow parallel unhooking and invalidation.
Acknowledgements This work was funded by the U.K. Engineering and Physical Sciences Research Council through project GR/J 99117 (CombiningRandomisationand Mixed-Policy Caching for Bounded-Contention Shared Memory), and a Research Studentship. We would also like to thank Andrew Bennett and Ashley Saulsbury for their work on the alite simulator and for porting some of the splash-2 programs.
References [1] Peter Bach, Michael Braun, Arno Formella, Jorg Friedrich, Thomas Grun, and Cedric Lichtenau. Building the 4 processor SB-PRAM prototype. In Proceedings of the Hawaii 30th International Symposium on System Science HICSS-30, volume 5, pages 14{23, January 1997. [2] Andrew J. Bennett, Anthony J. Field, and Peter G. Harrison. Development and validation of an analytical model of a distributed cache coherency protocol. In Performance Evaluation, volume 27-28, pages 541{563, Oct 1996.
20
[3] Andrew J. Bennett, Paul H. J. Kelly, Jacob G. Refstrup, and Sarah A. M. Talbot. Using proxies to reduce cache controller contention in large shared-memory multiprocessors. In Luc Bouge et al, editor, Euro-Par 96 European Conference On Parallel Architectures, Lyon, August 1996, volume 1124 of LNCS, pages 445{452. Springer-Verlag, 1996. [4] Ricardo Bianchini and Thomas J. LeBlanc. Eager combining: a coherency protocol for increasing eective network and memory bandwidth in shared-memory multiprocessors. In 6th IEEE Symposium on Parallel and Distributed Processing, Dallas, October, pages 204{213, 1994. [5] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. LogP: Towards a realistic model of parallel computation. 4th Symposium on Principles and Practice of Parallel Programming, in Sigplan Notices, 28(7):1{12, July 1993. [6] James R. Goodman and Philip J. Woest. The Wisconsin Multicube: A new large-scale cache-coherent multiprocessor. 15th Annual International Symposium on Computer Architecture, in Computer Architecture News, 16(2):422{431, 1988. [7] Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal, Kevin P. McAulie, Larry Rudolph, and Marc Snir. The NYU Ultracomputer { designing an MIMD shared memory parallel computer. IEEE Transactions on Computers, C-32(2):175{189, February 1983. [8] Seif Haridi and Erik Hagersten. The cache coherence protocol of the Data Diusion Machine. In E. Odijk, M. Rem, and J.-C Syre, editors, PARLE 89 Parallel Architectures and Languages Europe, Eindhoven, June 1989, volume 365 of Lecture Notes in Computer Science, pages 1{18, Berlin, 1989. Springer-Verlag. [9] Tim J. Harris. A survey of PRAM simulation techniques. Computing Surveys, 26(2):187{206, June 1994. [10] Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg, and John Hennessy. The eects of latency, occupancy and bandwidth in distributed shared memory multiprocessors. Technical Report CSL-TR-660, Computer Systems Laboratory, Stanford University, January 1995. [11] Chris Holt, Jaswinder Pal Singh, and John Hennessy. Application and architectural bottlenecks in large scale distributed shared memory machines. 23rd Annual International Symposium on Computer Architecture (ISCA), Philadelphia, in Computer Architecture News, 24(2), May 1996. [12] Ross E. Johnson. Extending the Scalable Coherent Interface for Large-Scale Shared-Memory Multiprocessors. PhD thesis, Computer Science Department, University of Wisconsin-Madison, Feb 1993. [13] N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buers. 17th Annual International Symposium on Computer Architecture (ISCA), Seattle, in Computer Architecture News, 18(2), May 1990. [14] Stefanos Kaxiras and James R. Goodman. The glow cache coherence protocol extensions for widely shared data. In 10th ACM International Conference on Supercomputing, May 25-28 1996, Philadelphia, PA, pages 35{43, May 1996. [15] Richard P. Martin, Amin M. Vahdat, David E. Culler, and Thomas E. Anderson. Eects of communication latency, overhead and bandwidth in a cluster architecture. In 24th International Symposium on Computer Architecture (ISCA), Denver, CO, in June, 1997. [16] Maged M. Michael, Ashwini K. Nanda, Beng-Hong Lim, and Michael L. Scott. Coherence controller architectures for SMP-based CC-NUMA multiprocessors. In 24th International Symposium on Computer Architecture (ISCA), Denver, CO, in June, 1997. [17] Andreas Nowatzyk, Gunes Aybay, Michael Browne, Edmund Kelly, Michael Parkin, Bill Radke, and Sanjay Vishin. The s3.mp scalable shared memory multiprocessor. In Proceedings of the International Conference on Parallel Processing Vol. 1, August 14-18, pages 1{10. CRC Press, 1995. [18] G. F. P ster. An introduction to the IBM Research Parallel Processor Prototype (RP3). In J. Dongarra, editor, Experimental Parallel Computing Architectures, pages 123{140. North Holland, 1987. [19] SCI. Scalable Coherent Interface. ANSI/IEEE Standard 1596-1992, August 1993. [20] Steven L. Scott. A cache coherence mechanism for scalable, shared-memory multiprocessors. Technical Report 1002, University of Wisconsin - Madison, Department of Computer Science, 1210 West Dayton Street, Madison, WI 53706, February 1991. [21] Manu Thapar and Bruce Delagi. Stanford distributed-directory protocol. IEEE Computer, 23(6):78{ 80, June 1990. [22] Leslie G. Valiant. Optimality of a two-phase strategy for routing in interconnection networks. IEEE Transactions on Computers, C-32(8):861{863, August 1983. [23] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In 22nd Annual International Symposium on Computer Architecture, in Computer Architecture News, pages 24{36, June 1995.
21
barnes
fft
1.4 1.2 1 0.8 0.6 0.4 0.2 0
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
1
2 3 4 5 6 7 Number of Proxies
8
0
1
2 3 4 5 6 7 Number of Proxies
fmm
ge
1.4 1.2 1 0.8 0.6 0.4 0.2 0
1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
1
8
2 3 4 5 6 7 Number of Proxies
8
messages queuing delay execution time 0
1
2 3 4 5 6 7 Number of Proxies
ocean-contig
8
ocean-non-contig
1.4 1.2 1 0.8 0.6 0.4 0.2 0
1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
1
2 3 4 5 6 7 Number of Proxies
8
0
1
2 3 4 5 6 7 Number of Proxies
water-nsq 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
1
2 3 4 5 6 7 Number of Proxies
8
Figure 9: Relative changes for 64 processing nodes, basic proxies 22
8
barnes
fft
2
1.4 1.2 1 0.8 0.6 0.4 0.2 0
1.5 1 0.5 0 0
1
2 3 4 5 6 7 Number of Proxies
8
0
1
2 3 4 5 6 7 Number of Proxies
fmm
ge
1.4 1.2 1 0.8 0.6 0.4 0.2 0
messages queuing delay execution time
1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
1
8
2 3 4 5 6 7 Number of Proxies
8
0
1
2 3 4 5 6 7 Number of Proxies
ocean-contig
8
ocean-non-contig
1.4 1.2 1 0.8 0.6 0.4 0.2 0
1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
1
2 3 4 5 6 7 Number of Proxies
8
0
1
2 3 4 5 6 7 Number of Proxies
water-nsq 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
1
2 3 4 5 6 7 Number of Proxies
8
Figure 10: Relative changes for 64 processing nodes, basic proxies with nite buer 23
8
barnes
fft
1.4 1.2 1 0.8 0.6 0.4 0.2 0
1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
1
2 3 4 5 6 7 Number of Proxies
8
0
1
2 3 4 5 6 7 Number of Proxies
fmm
ge
1.4 1.2 1 0.8 0.6 0.4 0.2 0
1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
1
8
2 3 4 5 6 7 Number of Proxies
8
messages queuing delay execution time 0
1
2 3 4 5 6 7 Number of Proxies
ocean-contig
8
ocean-non-contig
1.4 1.2 1 0.8 0.6 0.4 0.2 0
1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
1
2 3 4 5 6 7 Number of Proxies
8
0
1
2 3 4 5 6 7 Number of Proxies
water-nsq 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
1
2 3 4 5 6 7 Number of Proxies
8
Figure 11: Relative changes for 64 processing nodes, reactive proxies 24
8
ge queueing time (cycles)
std deviation mean
2000 1500 1000 500 0 -500 0
(b) queueing time (cycles)
fmm
2500
2500 std deviation mean
2000 1500 1000 500 0 -500
10 20 30 40 50 60 70 processor number
0
ge std deviation mean
2000 1500 1000 500 0 -500 0
(c)
2500 std deviation mean
2000 1500 1000 500 0 -500
10 20 30 40 50 60 70 processor number
0
std deviation mean
1500 1000 500 0 -500 0
(d)
2500 std deviation mean
2000 1500 1000 500 0 -500
10 20 30 40 50 60 70 processor number
0
std deviation mean
1500 1000 500 0 -500 0
10 20 30 40 50 60 70 processor number fmm
queueing time (cycles)
queueing time (cycles)
ge 2500 2000
10 20 30 40 50 60 70 processor number fmm
queueing time (cycles)
queueing time (cycles)
ge 2500 2000
10 20 30 40 50 60 70 processor number fmm
2500
queueing time (cycles)
queueing time (cycles)
(a)
10 20 30 40 50 60 70 processor number
2500 std deviation mean
2000 1500 1000 500 0 -500 0
10 20 30 40 50 60 70 processor number
Figure 12: Queueing cycles per processor for 64 processing nodes, ge and fmm 25