Document not found! Please try again

Performance Evaluation of Cache Depot on CC-NUMA ... - IEEE Xplore

2 downloads 0 Views 198KB Size Report
Cache depot is a performance enhancement tech- nique on cache-coherent non-uniform memory access (CC-. NUMA) multiprocessors, in which nodes in the ...
Performance Evaluation of Cache Depot on CC-NUMA Multiprocessors Hung-Chang Hsiao Chung-Ta King Department of Computer Science National Tsing Hua University Hsinchu, Taiwan 300, R.O.C. E-mail: [email protected]

Abstract Cache depot is a performance enhancement technique on cache-coherent non-uniform memory access (CCNUMA) multiprocessors, in which nodes in the system store extra memory blocks on behalf of other nodes. In this way, memory requests from a node can be satisfied by nearby depot nodes without going all the way to the home node. This not only reduces memory access latency and network traffic, but also spreads the network load more evenly. In this paper, we study the design strategy for cache depot that (1) enhances the network interface of each node to include a depot cache which stores those extra memory blocks for other nodes, and (2) employs a new multicast routing scheme, which is called the multi-hop worms and works cooperatively with depot caches, to transmit coherence messages. By considering message routing and depot caches together, the design concept can be applied even to those CC-NUMA systems that have a non-hierarchical, scalable interconnection network. We have developed an executiondriven simulator to evaluate the effectiveness of the design strategy. Performance results from using four SPLASH-2 benchmarks show that the design strategy improves the performance of the CC-NUMA multiprocessor by 11% to 21%. We have also studied in depth various factors which affect the performance of cache depot.

1. Introduction A cache-coherent nonuniform memory access (CCNUMA) multiprocessor supports a shared address space with physically distributed main memory. Nonlocal data may be replicated in the caches associated with each node, and cache coherency is often maintained via a directory using cache blocks as the basic coherence unit. Each memory block usually has a fixed home node, which stores the block in its local memory. Operations on the block, e.g., caching, updating, or invalidating, often involve several point-to-

point network transactions among the requesting node, the home node, and other sharing nodes. Apparently, the communication datapath of these coherence messages is critical to the performance of CC-NUMA multiprocessors. From a communication’s perspective, coherence messages in CC-NUMA multiprocessors have several important characteristics. First, since a cache line may be shared by several nodes at the same time, a coherence message often needs to be sent from one source to many destinations, e.g., during cache line invalidation or updating. Thus, an efficient support for multicasting the coherence messages is desirable [5, 11]. Second, due to program locality, coherence messages tend to be bursty and concentrated. As a result, certain areas in the network may become congested at times, which reduce the communication efficiency. It is necessary to spread the coherence traffic as evenly as possible. Third, coherence messages are usually short [6]. Thus, the length of the paths traversed by the coherence messages has a bearing on the transmission latency of coherence messages, even cut-through switching is used [12]. We have conducted an experiment to study the effects of the distance traversed by the coherence messages. In the experiment, a CC-NUMA multiprocessor using an 8  8 2D mesh interconnection network was simulated. Messages were routed using the dimension-ordered routing [12] and cut-through switching. Five applications were executed on the simulated multiprocessor: FFT, LU, Radix, Ocean, and Water Nsquared, all from SPLASH-2 [4]. Detailed experimental settings will be described in Section 3. Figure 1 shows the average distance a coherence message has to travel from a requesting node to a node where the message needs to make a turn in the 2D mesh in order to get to the home node. That distance is normalized with respect to the average distance from the requesting node to the home node. From the figure we can see that the distance to the turning nodes is only about half of that to the home nodes. This suggests that if those nodes at the turns “cache” some blocks on behalf of other nodes, then the coherence messages do not have to go all the way to the home

node and may save half the trip. If the block is shared by many nodes on the same row or column, then the amount of saving in traffic is very significant.

Water-Nsq

Ocean

Radix

LU

FFT

0%

20%

40%

60% To turn

80%

100%

120%

To home

Figure 1. Average distance a coherence message travels

There are several ways to implement this idea, e.g., charging local caches to hold and manage those extra blocks, attaching an extra storage engine to serve other nodes, etc. However, we believe that the key issue in putting the above idea to work on CC-NUMA multiprocessors is to reduce the extra latencies introduced by going through the additional “depots” and finding the requested block. Thus, in this paper we evaluate the effectiveness of implementing this cache depot concept using an enhanced network interface coupled with an appropriate routing scheme. In the following discussion, we will call a node which caches a block on behalf of another node as a depot of the latter. The depot node may or may not itself reference the cached block. The storage which stores those cached blocks will be called the depot cache. Our performance evaluation is based on the following design strategies: 1. The depot cache is made part of the network interface of each node. All the operations on the blocks in a depot cache are handled by the network interface. 2. Coherence messages are transmitted by a mechanism generalized from the concept of multidestination worms [13]. A multidestination worm carries many destination addresses in its header and thus a single such worm can be used to route a message through all the destinations, instead of sending several separate messages. However, the worm for cache depot is different from a multidestination worm for multicasting in that it has to be received into the network interface for depot cache operations and re-injected into the network after being processed. On the other hand, a multidestination worm can be duplicated and forwarded completely inside the router. For this reason we will refer to the worm for cache depot as the multi-hop worm. Note that a multi-hop worm is only received into the

network interface. It is not moved into the node memory for processing, which may induce software startup delays. The above design strategies of course raise an array of other design issues, e.g., choosing depot nodes and obtaining their addresses, a new cache coherence protocol to work with depot caches, depot cache management and cost estimation, etc. In this paper we will concentrate on evaluating the performance of the design concepts. Some of the abovementioned issues will be touched upon during our evaluations, but specific designs are left for future investigation. Based on the above design concepts, we have developed a simulator for a CC-NUMA multiprocessor enhanced with depot caches using execution-driven simulation. Simulation results show that cache depots based on the above design strategies can alleviate the hot-spot problem in the home nodes, reduce the length of the paths traversed by the coherence messages, and utilize network locality to lower network contention. Overall, the total execution time of the tested benchmarks can be improved by 11% to 21%. The remainder of the paper is organized as follows. Section 2 gives an overview of a CC-NUMA system, which supports cache depots with the features described above. The model system will serve as the basis for our evaluation. In Section 3 the settings of the experiments for evaluating the cache depot concept is discussed. Section 4 presents and discusses the simulation results. Section 5 compares briefly our proposed ideas with related works. Finally, conclusions are given in Section 6 and possible future works are discussed.

2. Overview of Cache Depot Based on the design concepts outlined in the previous section, cache depots can be incorporated into a traditional CC-NUMA multiprocessor as shown in Figure 2. The local cache holds those blocks which the local node needs to reference during a computation. The depot cache is associated with the network interface, which stores memory blocks on behalf of other nodes. For simplicity, we assume that the depot nodes of a given requesting/home node pair are fixed and their addresses can be obtained easily, perhaps from the network topology. In the remainder of the section we will give a high-level description of how the system works.

2.1. Coherence Operations Let us first consider what happen when a read miss occurs in a local cache. On a read miss, the missing node injects a multi-hop worm carrying the read request into the network. The worm will visit each depot node specified in its header in turn. When the worm reaches a depot Nj , it

P

P Depot Cache

Depot Cache

Cache

Cache NI

Ejection Channel

Mem

Injection Channel

Ejection Channel

Mem

NI Dir

Injection Channel

Dir

Interconnection Network

Figure 2. A model CC-NUMA multiprocessor with depot caches

is taken off the network and delivered to the network interface. The network interface of Nj will look through its depot cache to determine if the requested block is cached valid there. If so, the incoming worm is discarded and a new message is generated to send the requested block back to the requesting node. In other words, the requested block is supplied by the depot rather than by the home node. On the other hand if the requested block cannot be found in the depot cache, then Nj forwards the worm to the next depot specified in the worm header. If none of the depot ever caches the requested block, the worm eventually reaches the home node. When the home node receives a read request, it first checks if the requested block is dirty in another node, i.e., owned by that node. If so, the home node informs the owner to forward that block to the requesting node directly. However, if the block is clean shared or unowned, the home node goes ahead and sends the block to the requesting node, again in a multi-hop worm that routes through the designated depots. In the mean time, the home node should mark in its directory not only the requesting node but also all the depot nodes. When such a worm arrives at a depot node, the node stores the block in its depot cache and forwards the worm onto the next depot. By keeping only clean shared blocks in the depot caches, it is not necessary to maintain the multilevel inclusion property [2] in depot caches and to write back the block when it is replaced. Now let us consider the operations involved in a write miss. The missing node will first send a read exclusive message directly to the home node without specifying any depots in the message. After the home node receives the write request, it looks up its directory. If the block is owned by another node, the home node again informs the owner to forward the block directly to the requesting node. On the other hand, if the block is clean shared by several nodes (some are true sharers and some are depot nodes), then the home node needs to send an invalidate message to each of them. Note that this could also be done with one or more multi-hop worms. But in the current simulator this option

is not included. When the sharing nodes complete their invalidate operation, they need to send an acknowledgment message directly to the requesting node to guarantee cache coherency. There are other complications. For example, when a read request is satisfied by a depot node, the home node needs to be notified so that later invalidations can be done properly. Solution details of this and other issues can be found in [8]. For completeness of presentation, a more detailed description of the new coherence protocol is given in Appendix.

2.2. Supports for Multi-hop Worms A multi-hop worm is similar to a multidestination worm for multicasting in that both carry multiple destination addresses in their header and they are routed through those destinations in turn. One difference between the two is that a multi-hop worm may be terminated in an intermediate destination (a depot) if the requested memory block is found valid in that depot. In other words, the worm may not reach every destination and could be intercepted in an intermediate destination. Another difference, as mentioned earlier, is that a multi-hop worm will be received into the network interface when it arrives at an intermediate node, instead of being processed by the router only. Thus a multihop worm induces a larger latency than a multidestination worm. However, this also relaxes routing restrictions. Unlike a multidestination worm which has to follow a base routing conformed path [13] all the way, a multi-hop worm only needs to ensure that the path between any two adjacent depot nodes conforms to the base routing. No restriction needs to be imposed on the routing between any two segments of the transmission. To support multi-hop worms on the model CC-NUMA system shown in Figure 2, we propose the following designs in the communication subsystem. Since a multi-hop message is processed by the network interface, the router needs not be aware of it and thus there is no need to change the router organization and the message format. The destination field of the message contains the address of the first depot node to be reached. The remaining depot addresses can be treated as part of the data field and are left for the network interface to process. We can set an upper bound on the number of depot addresses allowable in a multi-hop worm. This limits the maximum number of depots a worm can visit at one time but also sets an upper bound on the message length. As a result, the network interface only needs a fixed buffer space for each multi-hop worm. For cache coherence operations, the required buffer space is quite small. When the requested block is not found in the local depot cache, the network interface will get the next depot address from the list of depot addresses in the data field, put it into the message header, and re-inject the multi-hop worm into

the network. Here we have a design choice of not throwing away the destination address of the local node, which was in the worm header on arrival. Instead, that address can be appended to the rear of the address list in the data field. In this way, the full set of depot addresses is carried along with the worm all the way through the transmission. If in the end the home node receives the worm (for no intermediate depot having cached the requested block), it knows immediately who the depots are without further calculation or table lookup. With this information, the home node can mark its directory and reply the requested block with another multi-hop worm to the requesting node, if it is clean shared or unowned. Note that the reply worm needs not be routed through the same path as the requesting worm, as long as it is routed through a base routing conformed path between any two adjacent depots.

3. Evaluation Setup To evaluate the effectiveness of the design strategies for cache depot, as described in Section 1, we have conducted some experiments through simulation. The simulator was built using Limes [10], an execution-driven multiprocessor simulation tool for 80x86 machines. In this section, we will describe how the experiments were set up and conducted. Performance results will be given in the next section.

3.1. The Model Systems Two model systems were simulated and compared: a baseline model and a reference model. Baseline model: The baseline model is a CC-NUMA multiprocessor without using depot caches and multi-hop routing. It is modeled after Alewife [1] but with a directory-based cache coherence protocol. The protocol is hardware-based and derived from the Berkeley protocol. The directory uses full-bit-vector entries. Each memory block is associated with a bit vector, with each bit indicating which node caches a copy of the block, and a few state bits. The model system is interconnected with a 2D mesh network. Reference Model: The reference model models the CCNUMA system outlined in the previous section. It serves as a reference of the conceptual ideas discussed in this paper. In this model, each node has a depot cache in the network interface and multi-hop worms are used to transmit clean shared blocks. The directory structure and cache coherence protocol are similar to those in the baseline model, except that the coherence protocol is extended to work with depot caches [8]. There are many ways to choose the depot nodes. In this model we use the following simple heuristic. For

a given pair of requesting node, Nr (rx ; ry ), and home node, Nh (hx ; hy ), two depot nodes will be selected and used: x ; d1y );

N1 (d1

where (d1x ; d1y ) = (rx ;

b

r

y

+ hy 2

c)

and x

y );

N2 (d2 ; d2

b

where (d2x ; d2y ) = (

x + hx

r

2

cb ;

y

r

+ hy 2

c)

:

3.2. System Parameters In Table 1, the simulated system parameters are listed. The 2D mesh interconnection network is configured with 8  4 = 32 nodes. Each node has a local cache of size 8 Kbytes and a depot cache of 16 Kbytes. Some coherence messages are transmitted using unicast messages, which take four flits. Others, such as read request and shared-reply, are transmitted with multi-hop worms, which take 16 flits. Table 1. Simulated system parameters Parameter Number of nodes Network Cache size Cache block size Set associativity Cache access time Depot cache size Depot cache block size Depot set associativity Depot cache access time Switching delay Wire delay Link width Flit length Message length Virtual channels per link Injection/Ejection channels per link Message startup Message pack/unpack Directory lookup Memory acccess

Simulated value 32 nodes 8 4 mesh 8 Kbytes 32 bytes 2-way 1 cycle 16 Kbytes 32 bytes 4-way 1 cycle 1 cycle 1 cycle 16-bit 16-bit 4 flits, 16 flits 4 virtual channels 4 virtual channels 4 cycles 3 cycles 5 cycles 20 cycles



3.3. Benchmarks Four applications from the SPLASH-2 benchmarking suite were chosen to evaluate the model systems (see Table 2). Program data were distributed to the nodes in a round-robin fashion. Only shared data were simulated. The execution time was measured starting from the time when the simulation started. All benchmark programs were compiled with the GNU C compiler version 2.7 on Linux operating system and with ,O2 optimization.

Because only shared data were modeled, the size of the simulated caches needs to be chosen very carefully. For the reference model, the local cache was set to have a size of 8 Kbytes so that it is large enough to hold the level-1 working set [4] of all the benchmarks. The size of the depot caches was chosen in such a way that the level-2 working set can be accommodated. An exception is Ocean. Ocean has a very large level-1 working set, and thus the caches used in the simulator can only catch that set. Due to space limitation, we only present four bechmarking programs from SPLASH-2. A rich experimentation could be found in [8]. Table 2. Benchmarks used in the evaluation and their problem sizes Application Water-Nsq Ocean

Cholesky

FMM

Description Molecular dynamics N-body problem Simulation of large-scale ocean movements based on eddy and boundary currents Factorization of a sparse matrix into the product of a lower triangular matrix and its transpose Solving the N-body problem using a parallel adaptive Fast Multipole method

Problem size 512 molecules 130 130 ocean



tk29.O

8192 particles

4.1. Factors A ecting Performance Let us first have an overview on what has happened in the communication subsystem and the network. In this set of experiments, the amount of messages transmitted from a given pair of nodes were recorded. Each segment of a multi-hop message, i.e., from one intermediate depot node to the next, was counted as one message between these two depot nodes. The resultant percentage is shown in Figure 4. From the figure we can see that the message distribution is smoothed out somewhat in the reference model. Also, the percentage of transactions between a given pair of nodes increases in general but decreases for those hot spots. The increased communication is due to accesses to and replies from depot caches. This indicates that nodes do play an active role as depots and supply their cached blocks to others — a topic to be elaborated shortly. A smoother message distribution also results in few hot spots in the system, and a uniform message loading on the network in general improves network bandwidth utilization and reduces the stress on network resource requirements. This is also evident from Figure 4, where programs such as Water-Nsq, Ocean, Cholesky and FMM that have a more uniform message distribution have a higher overall performance improvement, 19%, 17%, 11% and 21%, respectively. After having a general idea of message transmission in the systems, we can now focus on the constituents of the messages and their effects.

4. Evaluation Results and Discussions

Hit ratio of depot caches

The normalized execution times of the benchmark programs obtained from our simulation are shown in Figure 3, using the baseline model as a reference. The figure shows that the execution time on the reference model improves by 11%21% over the baseline model. In the remainder of this section, we will first investigate the sources of the improvements and factors affecting the performance.

In the baseline model all read requests are sent to the home node. In the reference model some portion of those requests are serviced by the depot nodes and are returned to the requesting nodes directly. Figure 5 shows that ratio. The figure indicates that for programs such as FMM, Ocean and Water-Nsq, between 35% and 50% of the requests can be satisfied from the depot caches. This implies that depot caches are effective for these programs. Amount of messages transmitted

120%

normalized execution time

100% 80% 60% 40% 20% 0% Water-Nsq

Cholesky Reference model

Ocean

FMM

Baseline model

Figure 3. Normalized execution time of the benchmark programs

In the reference model, the amount of network transactions should increase. This is a natural result from using the multi-hop routing, because previously a single communication between a requesting node and its home node now becomes several transmissions involving many depot nodes. Note that each segment was counted as one message in our experiments. Furthermore, additional messages need to be exchanged to maintain the depot caches, e.g., during invalidation and reply of shared data. Figure 6 shows that for most programs the mount of network transactions increases from 21% to 31%, and for Cholesky that ratio is 41%.

120% 100%

normalized

80%

Pecentage (%) 30 25

60% 40%

20 20%

15 10

0% Water-Nsq

5

Cholesky Baseline model

0

Ocean

FMM

Reference model

Figure 5. Hit ratio of depot caches

30 25 20 0

15

5

10

15 Destination

Source

10 20

25

5 30

0

(a) Cholesky on the reference model 160% 140%

normalized

120%

Pecentage (%) 30 25

100% 80% 60% 40%

20

20%

15

0% Water-Nsq

10

0 30 25 20 0

Cholesky Baseline model

5

15

5

10

15 Destination

25

FMM

Figure 6. The total amount of messages transmitted

Source

10 20

Ocean Reference model

5 30

0

(b) Cholesky on the baseline model

For Cholesky, despite the increased traffic, the total execution time on the reference model still improves 11%. This can be explained with the message breakdown discussed next.

Pecentage (%) 20 15

Processor stall time due to memory operations

10 5 0 30 25 20 0

15

5

10

15 Destination

Source

10 20

25

5 30

0

(c) FMM on the reference model

Pecentage (%) 20 15 10 5 0 30 25 20 0

5

15 10

15 Destination

10 20

25

Source

5 30

0

(d) FMM on the baseline model Figure 4. The percentage of messages transmitted between a pair of nodes

In this set of experiments we study the mix of the messages and the resultant processor stalls due to the memory operations. Note that the model systems use sequential memory consistency, and thus every memory operation will stall the node processor until the operation is completed. We classify memory operations into read, write, and synchronization. Figure 7 shows the processor stall time using the baseline model as the reference. In the reference model the node processors spend less amount of time waiting for read or synchronization operations. For the tested benchmarks the improvement over the baseline model ranges from 12% to 35%. The reduction in the read stall time for other programs can be attributed mainly to the use of depot caches, which reduce the average distance a read request message has to travel to fetch the requested memory block. Unfortunately, processor stall time due to write operations increases in the reference model by 7% to 21% for most programs. This is because the home node now has to send extra invalidate messages to notify the depot nodes on a write operation. Also, those nodes which obtain their data from the depot nodes instead of the home node also require

140%

100%

120% 100%

80%

normalized

normalized

120%

60% 40%

80% 60% 40%

20%

20%

0%

0%

Water-Nsq

Cholesky

Ocean

FMM

Water-Nsq

Baseline model Reference model

Cholesky

Ocean

FMM

Baseline model Reference model

(a) stall time due to reads and synchronizations

(b) stall time due to writes

Figure 7. The normalized stall time in the node processor due to memory operations

extra messages during invalidation. Although stalls due to write operations increase, the net performance of the reference model is still better than the baseline model (see Figure 3). From Figure 8 we can see that this is because there are more read or synchronization operations, about 71% to 81% more, than write operations in the benchmark programs. Thus for example, Water-Nsq causes a much higher processor stall time on write operations in the reference model. However, it is executed about 19% faster on the reference model, because writes only constitute about 20% of the total coherence operations as shown in Figure 8.

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Water-Nsq

Cholesky

Ocean

Read and synchronization

FMM Write

Figure 8. Breakdown of coherence operations for reads/synchronizations and writes

5. Related Works The concept of depot cache is closely related to that of hierarchical directory. In a hierarchical directory, the directory information of a memory block is organized logically as a tree. The directory hierarchy may be reflected in the topology of the interconnection network [7], or is embedded in a non-hierarchical scalable interconnection network [9]. The latter approach has several advantages, including allowing more general applications to be efficiently executed,

requiring only constant network bandwidth, avoiding hot spots and bottlenecks, etc. The model CC-NUMA multiprocessor adopted in this paper follows that approach and our performance evaluations were done on top of a mesh network. However, depot caches are different from hierarchical directories in that depot caches store the actual memory blocks, and different depot caches for a given memory block do not have any hierarchical relationship among them, nor do they maintain multilevel inclusion. Missing in a local cache requires the requesting node to send a multi-hop message, which passes through a number of depot caches on its way to the home node. In general the depot nodes are chosen from those that are on the path from the requesting node to the home node. Note that there is no restriction that the depot nodes be visited in a particular order, except those not allowed by the base routing algorithm, nor is there any requirement that everyone of them be visited during each request. Network cache [3] is similar to cache depot in adding a cache to the network interface on a hierarchy of networked workstations. Each cluster in the system is a bus-based multiprocessor with one to four processors, and there is a dedicated network cache for each cluster. Clusters are connected with a global ring to form the next level of hierarchy. As mentioned earlier, cache depot does not have or rely on a hierarchical structure. The choice of the depot nodes depends on the routing path of the multi-hop worms. As mentioned earlier, multi-hop routing was derived from the multidestination worm concept [5], which was proposed originally for supporting multicasts. Several works have applied multidestination routing to reduce the latency of invalidations in the CC-NUMA multiprocessors [5, 11]. In this paper, we attempt to combine message routing together with coherence protocol and cache organization. A multi-hop message needs to be received into the intermediate destination node’s network interface, processed and reinjected into the network if necessary, or terminated without going through all the destinations. Unlike a multidestination worm, which has to follow the base routing conformed path all the way, a multi-hop message only needs to follow such a path between adjacent destinations. The way that a multi-hop message visits several nodes during its transmission is also similar to random routing [15] for reducing the contention between each pair of source and destination. The routing scheme takes several phases to delivery a message. During each phase, an intermediate node is chosen randomly and the message is forwarded to that node. When the intermediate node receives the message, it re-injects the message into the network to forward to another intermediate node or the destination. Duato et al. proposed a multi-phase routing scheme for fault-tolerant routing [14], which is similar in concept.

Certain intermediate nodes are chosen, to which the message is forwarded to avoid faults or deadlocks in the network. In this paper, we apply this idea to cache coherence on CC-NUMA multiprocessors.

6. Conclusions The basic idea of cache depot is to store extra memory blocks in the nodes of a CC-NUMA multiprocessor on behalf of other nodes. In this way, memory requests from a node can be satisfied by nearby depot nodes without going all the way to the home node. In this paper we evaluated the design strategies that incorporating the depot caches in the network interface and transmitting the coherence messages with multi-hop worms. Four programs from the SPLASH2 benchmark suite were used in our execution-driven simulators, which model two different architectures. Experiments were conducted to study the impacts to the overall performance by such factors as message distribution, memory stall time and message mix. Our simulations show that for Water-Nsq, Choleskey, Ocean and FMM, the total execution time can be reduced by 11%21% because depot caches can supplied needed data and message distribution is more even. Even for programs which have quite a few synchronization operations, such as Cholesky, the execution time can still be improved by 11%. We found that with the test-test-and-set primitive, nodes can acquire their synchronization variable from nearby depot caches. Works presented in this paper is only the beginning of a series of future research. We need to refine the coherence protocol so as to reduce the amount of messages transmitted, especially for writes. We have to address the cost/performance issue and study more closely how much performance improvement comes from increased cache space especially when the capacity of local cache is large. Due to limited computing resources and time, we could not investigate in details the effects of network topology, routing algorithm problem size and other system parameters. These will also be our future works. Finally, the use of a more relaxed memory consistency model worths further studying.

References [1] A. Agarwal and et al. The MIT Alewife machine: Architecture and performance. In Proc. of International Symposium on Computer Architecture, pages 2–13, June 1995. [2] J.-L. Baer and W.-H. Wang. On the inclusion properties for multi-level cache hierarchies. In Proc. of 15th International Symposium on Computer Architecture, pages 73–80, May 1988.

[3] J. K. Bennett, K. E. Fletcher, and W. E. Speight. The performance value of shared network caches in clustered multiprocessor workstations. In Proc. of 16th International Conference on Distributed Computing Systems, pages 64–74, May 1996. [4] S. Cameron, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considers. In Proc. of International Symposium on Computer Architectures, pages 24–36, 1995. [5] D. Dai and D. K. Panda. Reducing cache invalidations overheads in wormhole routed DSMs using multidestination message passing. In Proc. of 1996 International Conference on Parallel Processing, 1996. [6] J. Duato and M. P. Malumbres. Optimal topology for distributed shared memory multiprocessors: Hypercubes again? In Proc. of Euro-Par’96, 1996. [7] S. Frank, H. B. III, and J. Rothnie. The KSR1: Bridging the gap between shared memory and MPPs. In Proc. of COMPCON, pages 285–294, 1993. [8] H.-C. Hsiao and C.-T. King. Designing depot caches on CCNUMA multiprocessors. Technical Report 98-01, National Tsing Hua University, Taiwan, 1998. [9] S. K. Kaxiras and J. R. Goodman. The Glow cache coherence extensions for widely shared data. In Proc. of 10th ACM International Conference on Supercomputing, May 1996. [10] D. Magdic. Limes: A multiprocessor simualtion enviorment. In IEEE Computer Technical Committee on Computer Architecture Newsletter, pages 68–71, March 1997. [11] M. P. Malumbres, J. Duato, and J. Torrellas. An efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors. In Proc. of 8th IEEE International Symposium on Parallel and Distributed Processing, October 1996. [12] L. M. Ni and P. K. Mckinley. A survey of wormhole routing techniques in direct networks. IEEE Computer, pages 62– 76, February 1993. [13] D. K. Panda, S. Singal, and P. Prabhakaran. Multidestination message passing mechanism conforming to base wormhole routing scheme. In Proc. of Parallel Computing Routing and Communication Workshop, pages 131–145, 1994. [14] Y.-J. Suh, B. V. Dao, J. Duato, and S. Yalamanchili. Software based fault-tolerant oblivious routing in pipelined networks. In Proc. of 1995 International Conference on Parallel Processing, August 1995. [15] L. G. Valiant. A scheme for fast parallel communication. In SIAM Journal on Computing, pages 350–361, 1982.

Suggest Documents