J Supercomput (2013) 66:249–261 DOI 10.1007/s11227-013-0900-4
DP&TB: a coherence filtering protocol for many-core chip multiprocessors Fengkai Yuan · Zhenzhou Ji
Published online: 7 March 2013 © Springer Science+Business Media New York 2013
Abstract Future many-core chip multiprocessors (CMPs) will integrate hundreds of processor cores on chip. Two cache coherence protocols are the mainstream applied to current CMPs. The token-based protocol (Token) provides high performance, but it generates a prohibitive amount of network traffic, which translates into excessive power consumption. The directory-based protocol (Directory) reduces network traffic, yet trades off with the storage overhead of the directory as well as entails comparatively low performance caused by indirection limiting its applicability for many-core CMPs. In this work, we present DP&TB, a novel cache coherence protocol particularly suited to future many-core CMPs. In DP&TB, cache coherence is maintained at the granularity of a page, facilitating to filter out either unnecessary coherence inspections for blocks inside private pages or network traffic for blocks inside shared pages. We employ Directory to detect private and shared pages and Token to maintain the coherence of the blocks inside shared pages. DP&TB inherits the merit of Directory and Token and overcome their problems. Experimental results show that DP&TB comprehensively beyond Directory and Token with improvement by 9.1 % in performance over Token and by 13.8 % in network traffic over Directory. In addition, the storage overhead of DP&TB is less than half of that of Directory. Our proposal can fulfill the requirement of many-core CMPs to achieve high performance, power and area efficiency. Keywords Chip multiprocessors · Cache coherence protocol · Coherence filtering · Page granularity · On-chip network traffic · Indirection problem F. Yuan () · Z. Ji School of Computer Science and Technology, Harbin Institute of Technology, 92 Xidazhi Street, Harbin, Heilongjiang, China 150001 e-mail:
[email protected] Z. Ji e-mail:
[email protected]
250
F. Yuan, Z. Ji
1 Introduction and motivation The era of many-core CMPs is coming and some manufacturers even produce manycore chips today, such as Tilera’s 100-core TILE-GX [17] and Kalray’s MPPA with 256 integrated processors [8]. For many-core CMPs, tile-based architectures are most practical, which are presented as arrays of identical tiles connected over an unordered point-to-point on-chip network [16]. CMPs entail the sustenance of cache coherence to provide programmers with the intuition of shared-memory model. Although a great deal of effort had been dedicated to the cache coherence protocols of CMPs, manycore CMPs require new mechanisms to fulfill the increasingly high performance requirements, area efficiency, and power constraints. Directory [2] has been classically employed in systems, typically as tiled CMPs. However, the indirection caused by obtaining coherence information from directories increases the cache miss latencies and thus harms the overall performance. Moreover, the heavy storage overheads delivered by Directory negatively impacts on area requirements. The protocol requires the capability of directory caches to hold coherence information of all cached memory blocks on chip, which increases storage overheads significantly. Token [12] is an alternative protocol to avoid indirection. It broadcasts requests to all last-level private caches and then the cache, keeping the requested data, can directly provide the data when they receive a request. Nevertheless, the broadcast Token uses critically increases network traffic and consequently power consumption in the interconnection network, which has been reported to constitute a momentous fraction of the overall chip power [10]. In this aspect, Directory has undoubted advantage of less intensive network traffic. In addition, Directory and Token have a common problem. They maintain the cache coherence uniformly at the granularity of block, even though the private blocks do not require coherence maintenance. It has been widely recognized [6, 7, 18] that a significant fraction of memory blocks used by parallel applications are private. Cuesta et al. [6] provide the fraction of private and shared blocks for a wide range of parallel applications that an average of 75 % blocks are private. Therefore, unconditionally implementing a block coherence inspection for each last-level private cache miss actually brings unnecessary extra cache miss latencies and network traffic for both Directory and Token. Even more for Directory, tracking coherence information of private blocks wastes scarce on-chip storage resources occupied by those redundant directory cache entries for them. Essentially, operating systems allocate memory spaces to processes and threads at the granularity of page, a much bigger grain than block. It is impossible that processes or threads, running in different processor cores, have shared blocks if the page holding the blocks is private. Only blocks inside a shared page have the possibility to be shared. Thus, blocks can be classified into private and possibly shared by knowing whether the page holding them is private or not. With a page coherence inspection on chip, all blocks inside a private page can be exempted from onerous coherence inspections for last-level private cache misses. In this work, we propose DP&TB, a novel protocol that maintains cache coherence differentially at the granularity of page and block. We use Directory to maintain the
DP&TB: a coherence filtering protocol for many-core chip multiprocessors
251
coherence of Pages and Token to maintain the coherence of possibly shared Blocks inside shared pages. The directory information for pages is distributed among the tiles on chip through a physical address mapping. Requestor tile sends message to home tile of an unacquainted page holding the requested block to discriminate whether the page is private or not. For a private page, all blocks inside the page are treated as private ones, regardless of coherence of the blocks at all. For a shared page, the requestor tile only broadcasts requests to the sharer tiles of the page to maintain the coherence of the requested block. Directory provides light network traffic for page coherence maintenance, while Token contributes direction for maintaining the coherence of blocks inside shared pages. Simultaneously, DP&TB eases the indirection of Directory by filtering out enormous unnecessary block coherence inspections and considerably reduces network traffic of Token by selective broadcast to merely page sharer tiles. Detailed simulations show that DP&TB has the best performance and network traffic compared with Directory and Token, achieving improvements in execution time of 9.1 % on average over Token and in network traffic of 13.8 % on average over Directory. Besides, DP&TB has storage overhead less than half of that of Directory.
2 Related works Ros et al. [15] proposed Direct Coherence protocol that improves Directory by avoiding the indirection of it, which sends coherence messages directly from the requesting caches to the owner tile storing up-to-date sharing information and providing the desired block. TOKENM [11] and Token Tenure [14] are hybrid protocols that use predictive multicast to capture the latency benefits of Token, while capturing bandwidth benefits of Directory. They extend Directory with token counting and destination-set prediction. However, these protocols cannot solve the storage overhead problem of Directory and even aggravate it by adding coherence caches [15] and destination-set predictor [11, 14], which makes them difficult to fulfill both area and power efficiency of future many-core CMPs. Some researchers employ OS to detect private and shared pages. Hardavellas et al. [7] used the detection to propose an efficient data placement policy for NUCA, while they do not consider coherence aspects. Kim et al. [9] employ OS detection to reduce broadcasts in Token. They propose a sophisticated mechanism to detect the shared blocks and their sharing degree so that broadcast can be substituted with multicast. Unfortunately, this proposal needs extra hardware (much larger TLBs) and adds lots of OS overhead. Cuesta et al. [6] focus on how the detection of shared and private blocks can be used to increase directory effectiveness. This mechanism does not require complex hardware/OS modifications and can reduce the size of directory cache. However, this proposal does not consider the indirection problem of Directory when handling the coherence of possibly shared blocks. In contrast, our proposal exploits cache coherence protocol to detect private and shared pages, which does not require any processor core and OS modification. Admirably, we achieve significant system improvement without either hardware or software revolution.
252
F. Yuan, Z. Ji
Fig. 1 The basic idea of DP&TB. (a) The process of grabbing the directory information of an unacquainted page, which is private in this case. (b) The coherence maintenance of blocks inside an acquainted shared page
Some authors eliminate the unnecessary traffic of broadcast-based protocols by performing coarse-grain tracking of blocks. Moshovos et al. [13] and Cautin et al. [5] proposed RegionScout filters and Region Coherence Arrays, respectively, which provide tradeoff between accuracy and implementation costs. Zebchuk et al. [19] proposed RegionTracker that reduces the storage overhead and removes the imprecision of the two previous proposals. However, these proposals need substantial modifications in the cache design to realize region-level lookups. Differently, our proposal just simply adds two normal caches to store page directory and block attendance information per tile for the implementation of page-grain coherence maintenance.
3 DP&TB protocol Figure 1 gives the basic idea of our proposal. Using Directory to classify pages into private and shared, DP&TB avoids useless coherence inspections for all blocks inside private pages and handles rapidly the coherence of potentially shared blocks inside shared pages with refined Token. Figure 1(a) shows the Directory part of the protocol. In order to know whether a page holding the missed last-level private cache block is private or not, the requestor tile (R) sends a GetP message to the home tile (H) of the page. H receives the message, updates related page directory information, and sends PInfo to R and, if the page has already been accessed, UpInfo message(s) to former sharer tile(s) (S) to update their page sharer information. In Fig. 1(a), the page is private and H only sends PInfo to R. After the delivery of PInfo, R accesses the specific home memory controllers (M) directly of all the blocks inside the page when processing last-level private cache misses without any coherence inspection. Figure 1(b) gives the Token part of the protocol. Known that the page holding the requested block is shared and the identity of sharers, R directly sends GetX messages to merely S tiles and the M tile, attempting to attain data and tokens on chip or from memory in case of private blocks inside a shared page, for the coherence maintenance of blocks inside the page. In Fig. 1(b), R avoids to access memory for the missed block by receiving a response with data and tokens from one S tile on chip. It is perfection that the advantages of Directory and Token, light network traffic and direction, are exploited to implement page-grain coherence, while the payoff, coherence filtering, solves their problems of indirection and heavy network traffic by much less frequent page coherence inspections and non-sharer-tile exclusive broadcast.
DP&TB: a coherence filtering protocol for many-core chip multiprocessors
253
Fig. 2 The hardware organization of DP&TB assumed in this work
We extend the tag’s part of all level caches to keep the token count for any block stored in them. In addition, DP&TB requires two hardware structures that keep page coherence information for home and local tiles: • Page directory cache (PDC$): The vector, sharer tracker, stored in this structure is used by home tiles to track the identity of the tiles sharing specific pages. Only one bit set is a sign of private page. • Page record cache (PRC$): Two vectors are stored in this structure. One is sharer tracker, the same as the one in PDC$, which is fetched in PInfo and UpInfo messages from home tiles to infer local tiles private ownership or sharers of specific pages. Another is block tracker that records attendance of all blocks inside specific pages in local tiles’ caches. The vector plays a significant role in the maintenance of both structures for replacement (see Sect. 3.2.3). 3.1 Architecture of DP&TB-CMP Figure 2 shows the tile (left) organization of the 16-tile CMP (right) assumed in this work. Each tile contains a processing core with both instruction and data L1 caches, a slice of the L2 cache, and a connection, the router, to the on-chip network. Each L2 slice is treated as a private L2 cache for the local processing core in this design. Moreover, the L1 and L2 caches are exclusive to exploit the total available cache capacity on chip. Besides, each tile adds two structures introduced in the previous section: the page directory and record caches (see Fig. 2, left). A comparison among the extra storage and structures required by Directory, Token, and DP&TB considered in this work can be found in Sect. 3.3 to demonstrate the area efficiency of our proposal. 3.2 Description of the cache coherence protocol 3.2.1 Page address DP&TB uses page address to maintain page-grain coherence. Page address is derived from physical address according to specific system parameters. Figure 3 shows the partitioning of physical address for system configuration in this work (64-Byte block, 4K-Byte page, and 16-tile CMP). Page directory information is address-interleaved among tiles by the last log2 (n) bits of page address, where n is the number of tiles on chip. In Fig. 3, the 4-bit page pin indicates the identity of home tile of a page. Page address is typically extracted from L2 missed cache block address to refer to pages in PDC$, PRC$, and PInfo messages, etc.
254
F. Yuan, Z. Ji
Fig. 3 The physical address partitioning of DP&TB
3.2.2 Page coherence maintenance In this work, L2 misses and coherence upgrades require to access local PRC$s for the page coherence information. The sharer tracker in PRC$s tell whether a page is private and, if not, which tiles are the sharers of the page. However, if a PRC$ miss happens, the PRC$ have to access the PDC$ of the home tile, pointed by page pin of the page address. In addition to provide page coherence information, PDC$s also work as main ordering points. When a hit happens in a PDC$, it locks the inquired entry, updates the sharer tracker, sends UpInfo message (with new sharer tracker) to former sharer tiles’ PRC$, and waits for their acknowledge messages (ACKs) back. With all ACKs received, a PDC$ sends PInfo to the requesting PRC$ and unblocks the entry. If a PDC$ miss happens, it simply needs to build a new entry for the private page and sends PInfo directly. With a PInfo, a PRC$ builds a new entry and begins to serve for coherence of blocks inside the page. A L2 miss processing eventually needs to report to a local PRC$ for new arrival of the block. Besides, what the L2 replacements and coherence downgrades need to do in DP&TB is only to report to local PRC$ for the new absence of blocks. Figure 4 shows the general work scheme of DP&TB. 3.2.3 Replacement As the system runs, processor cores access more and more pages inevitably. Both PDC$s and PRC$s face the situation to evict relatively inactive (LRU) entries to make room for new active arrivals. We use flushing strategy for the replacement maintenance. For a PRC$ replacement, it flushes all blocks inside the specific page out of local tile, invalidates the entry, sends a PutP message to the PDC$ in home tile to abandon the share right of the page. Then, the PDC$ updates the sharer tracker and sends UpInfo immediately to other sharer tiles to avoid unnecessary network traffic for coherence maintenance or simply invalidates its entry for a private page. For a PDC$ replacement, it sends FlushP messages to PRC$s in sharer tiles (or just one for private page) to invalidate their corresponding page entries, which actually flushes the whole page of blocks out of entire on-chip cache system, and then invalidate the page entry. The block tracker of a PRC$ makes great contribution for replacement. The vector records detailed up-to-date attendance of blocks inside a page. With the help of a block tracker,
DP&TB: a coherence filtering protocol for many-core chip multiprocessors
255
Fig. 4 The general work scheme of DP&TB
PRC$ only needs to flush the specific blocks directly based on the set bit of the vector without searching all page blocks in local caches, which saves tremendous processing time and energy. Moreover, block tracker helps detecting no-block-attending pages as soon as possible and evict the page entry in time, avoiding a waste of either PRC$ or PDC$ entries. 3.3 Storage overhead consideration Directory employs L2 tags to store the on-chip directory information when the L2 cache holds a block. Moreover, a distributed directory cache is required to keep the information when the block is stored in any of L1 caches but not in the L2 cache. In our implementation, the number of entries of a distributed directory cache is the same as that of a L1 cache, since this size can satisfy the on-chip cache misses to always find the directory information, without incurring in directory misses. Token extends L1 and L2 tags’ for a field to keep token count for any block stored in them. This field only has a size of 1 + log2 (n) bits (the one owner bit and the other non-owner token count bits), where n is the number of processor cores. DP&TB stores distributed page directory information in PDC$. A PDC$ entry have the same size as that of directory cache of Directory. The length of a sharer tracker PDC$ entry contains is equal to the number of processor cores. A PRC$ entry holds a sharer tracker and a block tracker. The length of a block tracker is the number of blocks a page contains. For the particular configuration of this work (a 4 × 4 tiled CMP with 128 KB L1 caches, 1 MB L2 cache slice, 64-Byte block, 4KByte page), the number of bits required by a sharer tracker is 16 and that of a block tracker is 64, whereas any L1 and L2 cache entry needs extra 1 + log2 (16) = 5 bits to keep token count. Table 1 summarizes the structures and their size required by Token, Directory, and DP&TB. Since PDC$ and PRC$ entries keep the coherence information for whole page blocks, much fewer entries are required to serve for the
256
F. Yuan, Z. Ji
Table 1 Comparison among the extra structures and storage overhead
Data Token
Structure
Entry Size
Entries
Total Size
L1 cache
64 bytes
2K
128 KB
L2 cache
64 bytes
16K
1024 KB
L1$ tags
5 bits
2K
1.25 KB
L2$ tags
5 bits
16K
10 KB
Directory
L2$ tags
2 bytes
16K
32 KB
Dir Cache
2 bytes
2K
4 KB
DP&TB
PDC$
2 bytes
512 (288)
1 KB (0.56 KB)
PRCS
10 bytes
512 (288)
5 KB (2.81 KB)
L1$ tags
5 bits
2K
1.25 KB
L2$ tags
5 bits
16K
10 KB
Overhead
+0.98 % +3.13 % +1.50 % (+1.27 %)
coherence of 2K + 16K = 18K blocks stored in the exclusive L1 and L2 caches of a local tile. Although 288 pages are enough to hold 18K blocks, processor cores usually do not access entire page blocks and thus more PDC$ and PRC$ entries are required to store enough page coherence information. In order to avoid overfull capacity replacements, PDC$ and PRC$ are given 512 entries to have more chance to evict empty page entries, caused by capacity replacement of the 18K cache blocks, with the help of block tracker, before the entry limit is reached. Fortunately, the surplus entries only increase storage overhead by 0.23 % (from 1.27 % to 1.50 %). We can see that DP&TB has storage overhead less than half of Directory, which fully shows its area efficiency.
4 Simulation environment We evaluate our proposal with full-system simulation using GEM5 [4], which provides a detailed memory system timing model which account for all protocol messages and state transitions. For modeling the interconnection network, we have used GARNET [1], a detailed network simulator included in GEM5. We evaluate the performance of Directory, Token, and DP&TB. The parameters of configurations and benchmarks are given in Table 2. Token and DP&TB use private L2 caches and their L1 and L2 are exclusive. The L2 cache of Directory is shared among processor cores on chip and the private L1 and shared L2 slice are noninclusive, because the L2 slice only stores specific cache blocks whose addresses have the same home pin bits. We use 10 workloads from PARSEC 2.1 benchmark suite [3] to evaluate our design, as shown in Table 2. The experimental results reported in this paper correspond to parallel phase of each program.
5 Performance evaluation In this section, we show how our proposal improves the system performance and network traffic at the same time. We also study the potential of DP&TB to provide
DP&TB: a coherence filtering protocol for many-core chip multiprocessors Table 2 System parameters and benchmarks
257
4 × 4 tiled CMP Processor core
TimingSimple (ALPHA ISA)
Processor speed
2 GHz
Memory Parameters Cache block size
64 bytes
Page size
4K bytes
Split L1 l&D caches
128KB, 4-way
L1 cache hit time
4 cycles
Sliced unified L2 cache
16 MB (1 MB/tile), 8-way
L2 cache hit time
6 + 9 cycles (tag + data)
Page directory caches
1 KB, 16-way, 2 hit cycles
Page record caches
5 KB, 16-way, 2 hit cycles
Replacement police
Pseudo-LRU
Memory access time
160 cycles
Network Parameters Topology
4 × 4 Mesh
Link latency (one hop)
4 cycles
Routing time
2 cycles
Flit size
4 bytes
Link bandwidth
1 flit/cycle
Benchmarks blackscholes, bodytrack, canneal, ferret, fluidanimate, freqmine, streamcluster, swaptions, vips, x264
simmedium (simsmall for x264)
even better performance by increasing moderately the number of entries of PDC$ and PRC$, when taking into account the tradeoff between performance and storage overhead. 5.1 Execution time Figure 5 shows the execution time of Directory, and DP&TB normalized to Token. Directory requires the access to home tile for directory information for every cache miss, which causes most cache to miss taking more steps to obtain the desired data. Token avoids this indirection by broadcasting requests to all tiles and grabs the data promptly. Therefore, Token has an execution time less than Directory, 4.5 % on average, as shown in Fig. 5. DP&TB has the best performance, an average of 9.1 % better than Token, because it avoids enormous unnecessary private block coherence inspections and maintains the direction of coherence handling by refined Token for possibly shared blocks. Note that although DP&TB has to access PRC$ for every L2 cache miss or upgrade (as shown in Fig. 4), it only takes two cycles for the local tile accesses, much less than that of Directory and Token for remote tile access.
258
F. Yuan, Z. Ji
Fig. 5 Normalized execution time to Token system
Fig. 6 Normalized network traffic to Directory system (in flits)
5.2 Network traffic Figure 6 depicts the on-chip network traffic of Token and DP&TB normalized to Directory. Token broadcasts requests to all tiles for every L2 cache miss or coherence updating which brings about a great mount of network traffic. Not only does the network traffic increase energy consumption, but it also challenges the on-chip network for congestion, which actually increases the miss latency and limits performance improvement. This is the reason why Token improves the performance over Directory by only 4.5 % in Fig. 5. Directory has a network traffic much less than Token, 35.4 % on average, as depicted in Fig. 6. DP&TB has the least network traffic, an average of 13.8 % less than Directory, because it only needs to access local PRC$ without any network traffic for most block coherence maintenance. Moreover, for those possibly shared blocks, it only sends requests to page sharer tiles that must observe them,
DP&TB: a coherence filtering protocol for many-core chip multiprocessors
259
Fig. 7 Normalized execution time of DP&TB systems with different number of PDC$ and PRC$ entries
which filters out enormous network traffic compared with Token and actually makes it similar to Directory. 5.3 Area efficiency In Directory, the eviction of a directory cache entry involves the invalidation of all the cached copies of the associated block. Similarly, the eviction, in DP&TB, of a PRC$ or PDC$ entry requires to flush all cached blocks inside the page out of local tile for PRC$ replacement or all on-chip tiles for PDC$ replacement. It is apparent that DP&TB entails more procedures to evict a PRC$ or PDC$ entry, but we have to notice that the eviction is much less frequent than Directory, since this entry stores information for a whole page. Fortunately, the area efficiency advantage facilitates DP&TB to be able to obtain a larger PDC$ and PRC$ entry number and reduce the impact of their replacement on performance with the storage overhead tradeoff. Figure 7 shows the execution time of DP&TB with different number of PRC$ and PDC$ entries normalized to 288 entries. The three benchmarks we chose have the most shared and exchanged data among the benchmark set we use in this work, and they have the minimal performance improvement with DP&TB, shown in Fig. 5. In Fig. 7, we can see that with 512 entries, adopted by our design, the three applications’ performance improve by 10 % on average over the 288 entries design. With 2048 entries, making DP&TB has a similar but still less storage overhead than Directory; the three applications’ performances improve by 13 % on average over the 288 entries design.
6 Conclusion In this paper, we have proposed a novel cache coherence protocol which is able to maintain cache coherence at the granularity of page. We exploit Directory to detect private and shared page and add two hardware structure PDC$ and PRC$ to maintain page coherence. The blocks inside a private page are exempted from coherence inspections and the possibly shared blocks inside a shared page are handled rapidly
260
F. Yuan, Z. Ji
with refined Token when encountering last-level private cache miss or coherence updating. Our simulation results show that DP&TB achieves improvements in both performance (by 9.1 % over Token) and network traffic (by 13.8 over Directory). In our implementation, the 512 entries PDC$ and PRC$ make that our proposal has only less than half of storage overhead of Directory and experimental results show that DP&TB can achieve more space for performance improvement when considering enlarging the size of PDC$ and PRC$ with the increase of storage overhead. We conclude that DP&TB can fulfill the requirements for many-core CMPs of performance, power consumption, and area efficiency.
References 1. Agarwal N, Krishna T, Peh L-S, Jha NK (2009) GARNET: a detailed on-chip network model inside a full-system simulator. In: IEEE intl symp on performance analysis of systems and software (ISPASS), pp 33–42 2. Barroso LA, Gharachorloo K, McNamara R, Nowatzyk A, Qadeer S, Sano B, Smith S, Stets R, Verghese B (2000) Piranha: a scalable architecture based on single-chip multiprocessing. In: 27th intl symp on computer architecture (ISCA), pp 12–14 3. Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC benchmark suite: characterization and architectural implications. In: 17th intl conference on parallel architectures and compilation techniques (PACT), pp 72–81 4. Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39(2):1–7 5. Cantin JF, Lipasti MH, Smith JE (2005) Improving multiprocessor performance with coarse-grain coherence tracking. In: 32th intl symp on computer architecture (ISCA), pp 246–257 6. Cuesta B, Ros A, Gmez EM, Robles A, Duato J (2011) Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In: 38th intl symp on computer architecture (ISCA), pp 93–104 7. Hardavellas N, Ferdman M, Falsafi B, Ailamaki A (2009) Reactive NUCA: near-optimal block placement and replication in distributed caches. In: 36th intl symp on computer architecture (ISCA), pp 184–195 8. Kalray (2012) First MPPA MANYCORE chip (MPPA256) integrates 256 cores. http://www.kalray.eu/ products/mppa-manycore. Accessed 22 May 2012 9. Kim D, Ahn J, Kim J, Huh J (2010) Subspace snooping: filtering snoops with operating system support. In: 19th intl conference on parallel architectures and compilation techniques (PACT), pp 111– 122 10. Magen N, Kolodny A, Weiser U, Shamir N (2004) Interconnect power dissipation in a microprocessor. In: Intl workshop on system level interconnect prediction (SLIP), pp 7–13 11. Martin MMK (2003) Token coherence. PhD dissertation, University of Wisconsin 12. Marty MR, Bingham J, Hill MD, Hu A, Martin MM, Wood DA (2005) Improving multiple CMP systems using token coherence. In: 11th intl symp on high-performance computer architecture (HPCA), pp 328–339 13. Moshovos A (2005) RegionScout: exploiting coarse grain sharing in snoop-based coherence. In: 32nd intl symp on computer architecture (ISCA), pp 234–245 14. Raghavan A, Blundell C, Martin MMK (2008) Token tenure: PATCHing token counting using directory-based cache coherence. In: 41st IEEE/ACM intl symp on microarchitecture (MICRO), pp 47–58 15. Ros A, Acacio ME, Garca JM (2010) A direct coherence protocol for many-core chip multiprocessors. IEEE Trans Parallel Distrib Syst 21(12):1779–1792 16. Taylor MB, Kim J, Miller J, Wentzlaff D, Ghodrat F, Greenwald B, Hoffman H, Lee JW, Johnson P, Lee W, Ma A, Saraf A, Seneski M, Shnidman N, Strumpen V, Frank M, Amarasinghe S, Agarwal A (2002) The raw microprocessor: a computational fabric for software circuits and general purpose programs. IEEE MICRO 22(2):25–35
DP&TB: a coherence filtering protocol for many-core chip multiprocessors
261
17. Tilera (2012) Tilera announces latest tile-gx family processors with up to 100 cores. http://www.tilera.com/products/processors/TILEGx_Family. Accessed 20 May 2012 18. Wang J, Wang D, Wang H, Xue Y (2012) Dynamic reusability-based replication with network address mapping in CMPs. In: 17th Asia and South Pacific design automation conference (ASP-DAC), pp 487–492 19. Zebchuk J, Safi E, Moshovos A (2007) A framework for coarse-grain optimizations in the on-chip memory hierarchy. In: 40th IEEE/ACM intl symp on microarchitecture (MICRO), pp 314–327