Latency-aware Utility-based NUCA Cache ... - VLSI Systems Lab

5 downloads 11461 Views 841KB Size Report
the utility information collected from monitoring unit, called. Figure 1. Utility of Cache .... node), and the circles forming a network are cache bank nodes. (c-node). .... dimension of cache is extracted from Cacti 6 [16]. In 32nm technology, the ...
Latency-aware Utility-based NUCA Cache Partitioning in 3D-Stacked Multi-Processor Systems Jongpil Jung, Seonpil Kim, and Chong-Min Kyung Department of Electrical Engineering & Computer Science Division of Electrical Engineering Korea Advanced Institute of Science and Technology [email protected], [email protected], [email protected] Abstract— Increasing number of processor cores on a chip is a driving force to move to three-dimensional integration. On the other hand, as the number of processor cores increases, nonuniform cache architecture (NUCA) receives growing attention. Reducing effective memory access time, including cache hit time and miss penalty, is crucial in such multi-processor systems. In this paper, we propose a Latency-aware Utility-based Cache Partitioning (LUCP) method which reduces memory access time in a 3D-stacked NUCA. To reduce the memory access time, the proposed method partitions shared NUCA cache for each processor core according to latency variation (depending on the physical distance from processor core to cache bank) and cache access characteristic of application programs. Experimental results show that the proposed method reduces memory access time by up to 32.6% with an average of 14.9% compared to conventional method [1].

I.

Figure 1. Utility of Cache Resources

cache resource. Cache partitioning is a viable solution to solve this problem.

INTRODUCTION

Researches to increase the number of processor cores per chip are becoming a mainstream of multi-processor architecture design. A processor with many simple cores provides better performance compared to single complex superscalar processor core [2]. Recently, a processor integrating 80 cores was announced. To enjoy the peak performance of such systems, designers need to deploy complex and large cache, even larger than 32MB [3]. As the physical dimension of cache increases with increasing number of cores, wire delay of shared cache cannot be modeled with single access latency, because wire delay and, therefore, access latency becomes a strong function of physical location of the data block in the shared cache. Data blocks near the processor core can be accessed faster without waiting for the delay of distant block. Kim, et al. observed these properties and introduced a Non-uniform Cache Architecture (NUCA)[4]. The least recently used (LRU) is the most widely adopted cache management policy for shared cache architecture [5-7]. LRU discards the least recently used cache data block, so that more data blocks are allocated to programs with high memory demand. However, the LRU policy does not always provide optimal result because not all the programs utilize additional

c 978-1-4244-6471-5/10/$26.00 2010 IEEE

In general, the number of cache misses can be reduced by assigning a larger amount of cache. However, the effectiveness of assigning additional cache varies according to programs [1]. Fig. 1 shows the reduction of cache miss rate, i.e., number of cache misses per kilo-instructions (MPKI), of various benchmark programs as the amount of cache, i.e., number of allocated cache banks, assigned to the program increases. For example, in case of mcf (well-known memorybound benchmark), cache miss rate is drastically reduced as larger number of cache banks is allocated, while infinitesimal reduction in cache miss rate is obtained for mpeg2dec and mpeg2enc (well-known computation-intensive benchmark). There have been extensive works to improve performance through cache partitioning [1, 8-11]. [11] proposed a concept and application of cache partition in private NUCA cache. In [1], M. Qureshi, et al. proposed a method which exploits utility variation across programs, called utility-based cache partitioning (UCP). Utility is defined as the amount of reduction in cache miss rate for an additional number of cache banks allocated to the application. For example, in Fig. 1, mcf is said to have high utility while mpeg2dec and mpeg2enc are said to have low utility.UCP partitions shared L2 cache using the utility information collected from monitoring unit, called

125

UMON, during program runs. To reduce the hardware overhead, UMON monitors only a few sampled sets instead of monitoring all possible number of ways. In [8], Nikas, et al. proposed Adaptive Bloom Filter Cache Partitioning (ABFCP), which performs low-cost cache partitioning. To integrate more cores in a chip, we obviously need higher transistor density. 3D die stacking technology gives a great chance to increase transistor density. Additionally, threedimensional integration leads to drastic decrease in the length of the longest interconnects across a chip [12]. B. Black, et al. showed that stacking cache memory on top of processor cores significantly reduces the cycles per memory access with slight increase of the peak temperature [13]. In such systems with many cores, packet-switched on-chip networks are replacing buses and crossbars due to scalability and less wiring overhead [14]. In previous works on cache partitioning, variation of latency from the processor cores to the cache banks is not considered on runtime. In this paper, we propose a method which allocates limited cache resources to the processor cores while judiciously exploiting variations of access latency and utility of program. We targeted at multi-processor system with multi-banked 3D-stacked cache. Experimental results show that average memory access time is reduced by up to 32.4%. This paper is organized as follows. We introduce the proposed target architecture in section II. Then we define the problem in section III as graph coloring. Section IV describes the heuristic solution of the problem defined in section III. Experimental results are compared to conventional methods in section V. Finally, conclusion is presented in section VI. II.

TARGET ARCHITECTURE

A. Processor with 3D-Stacked Cache We target at a multi-processor system with 3D-stacked cache as shown in Fig. 2. Processor cores are placed on the first tier with their own private L1 cache. L2 cache, divided into small banks, is placed on the second tier forming a mesh network. All processor cores share L2 cache as the last-level cache adopting NUCA. In this work, we assumed 8 processor cores and shared L2 cache divided into 64 banks. All 64 cache banks are connected with on-chip mesh network. Since there are so many processing and memory elements, it is prohibitive to connect all those elements with conventional bus architecture. The bus will be easily saturated and become a bottleneck of memory access. If we directly connect all those elements, the wiring overhead will be extremely severe. With on-chip network

Figure 2. Multi-processor with 3D-stacked Cache

126

Figure 3. Framework for Dynamic Cache Partitioning

architecture shown in Fig. 2, all processor cores can access any cache bank. Since it is a packet-switched on-chip network, multiple processor cores can access multiple cache banks in parallel without collision, unless their paths are overlapped. The latency for a processor core to access certain cache bank is assumed to be constant. Therefore, each processor core is assumed to know the latency (in terms of number of cycles) taken to access each bank. B. Dynamic Cache Partitioning Fig. 3 depicts logical architecture for dynamic cache partitioning. It is an extension of the framework proposed in [1] to many-core system. Each processor core has its own private level 1 instruction cache and level 1 data cache. The monitoring units monitor the L2 cache accesses from each processor core and report the cache resource utility information to partitioning algorithm. Then, the proposed partitioning algorithm decides how the processor cores have to use the shared L2 cache. Procedure of dynamic cache partitioning over time is presented below. First, the monitoring unit monitors the L2 cache access for certain period. Then, with the information from monitoring, partitioning algorithm computes the partitioning mechanism for the next period. After the computation, shared L2 cache is actually partitioned according to the result of the computation. Processor cores keep requesting to access data of newly reconfigured L2 cache. These sequences are repeated periodically. Note that the processor cores do not need to wait for the partitioning algorithm. Even during the computation, cores can continue to access L2 cache. Thus, the partitioning algorithm does not lie in the middle of critical path of cache access, and does not degrade the performance of processor core accessing cache.

Figure 4. Utility Curve Variation

2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010)

Shared L2 cache is partitioned based on the information gathered during previous period, not the current period. This scheme is possible because the cache resource utility of programs have time locality. Fig. 4 shows an example of utility curve variation depending on time. The time axis is added to the utility curve of sphinx3 shown in Fig. 1. In this example, one period is 107cycles. While the program sphinx3 is running, we monitored the cache resource request for 50 periods, which is 5 · 108 cycles. Note that the utility curve keeps similar shape during all time and does not change abruptly. III.

PROBLEM DEFINITION

A. Input, Output and Objective The inputs of this problem are as follows: i) number of processor cores, ii) number of L2 cache banks, iii) connectivity (including latency) of processor core and L2 cache bank, and iv) task sets assigned to each processor core1. Given the inputs, the objective of this problem is to partition shared L2 cache to minimize average cache access time expressed as follows.

Figure 6. Amount of Performance Improvement

average memory access time of all processor cores. IV.

A. Problem Formulation Note that the objective of our method is to minimize the average memory access time. To achieve this goal, we compare the amount of performance improvement as a metric for determining the next bank to be allocated, which is expressed as the following equation. Rij = Δri(ni)∙(L-lij)

memory access time = h · th + m· ( L + tm ) where h and m are cache hit and miss ratio, respectively. th and tm are hit latency and miss resolution time, respectively. L is the latency of accessing external memory. Former researches of cache partitioning are focused on reducing memory access time by minimizing cache miss ratio m. In our work, however, the proposed algorithm does not only reduce the miss ratio m, but also reduces hit latency th. B. Graph Coloring The problem can be transformed into graph coloring problem. The network structure of cache banks is represented as a graph. Fig. 5 shows an example of this transformation. In Fig 5, the square box represents a processor core node (pnode), and the circles forming a network are cache bank nodes (c-node). Edge represents network connection with certain delay. When a cache bank is assigned to certain task, it is said that the cache bank is colored with the same color of the processor core. The objective of this problem can be transformed into coloring all the c-nodes to minimize the

PROPOSED METHOD: LUCP

where ni is the number of c-nodes (cache bank) allocated to the processor core pi. Δri(ni) is the reduction of miss rate when one more bank is allocated to the processor core pi when ni banks are already allocated. L is the external memory access latency, which is assumed to be the same for all processor cores. lij is the access latency from processor core pi to cache bank cj. Rij denotes the reduction of memory access time when c-node cj is allocated to pi. For example, as shown in Fig 6, two c-nodes are already allocated to the processor core pi, and c-node cj is to be allocated to pi. In this process, since additional cache bank cj is assigned, pi can reduce the number of external memory access by the amount of cache hit on cj. Thus, the memory access time from the processor core pi is reduced by the amount of Δri(ni)∙L. However, since the processor core pi has to access cj, the memory access time is increased by Δri(ni)∙lij. We assume a network with N c-nodes and P p-nodes. Let S be the set of all c-nodes, and Si be subsets of S assigned to pi. Then, the solution of the problem is to find the set of c-nodes for each p-node such that the overall performance improvement over all c-nodes is maximized. It can be formulated as follows: Maximize with constraints S0 ∪ S1 ∪ ··· ∪ SP-1 = S and

Figure 5. Graph Expression of Cache Network 1

In this work, we assume that tasks are statically allocated to processor cores, i.e., static task allocation. In our future work, we will extend our work to apply for dynamic task scheduling method.

∀ i,j( i ≠ j ), Si ∩ Sj = φ The union of all subsets Si must be S, since we assumed that all the cache banks are allocated to processor cores. The

2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010)

127

second constraint is to make sure that no c-node is allocated to multiple processor cores. Intuitively, it is possible to find a solution of this problem by comparing the memory access time for all possible cases. Even though this method gives an optimal solution, the method is too complex to perform cache partitioning in runtime. Since there are total PN cases of cache bank allocation, the solution has the complexity of O(PN), i.e., NPhard. A significant acceleration is needed to apply the algorithm to cache partitioning in runtime. In the following section, we propose a heuristic algorithm, latency-aware utility-based cache partitioning (LUCP). B. Heuristic Method Initially, we start with the condition that the cache bank(s) nearest to each processor core are already colored with the color of the nearest processor core. In the example of Fig. 10(b), the c-node closest to p0 is colored with the color of p0, the c-node closest to p1 is colored with the color of p1, and so on. This initial condition guarantees that at least one c-node is allocated for each p-node. This initializing step is represented in line 1 and 2 of the Pseudo code shown in Fig. 7. The following procedure is iterated until all the nodes are colored. First, we find uncolored c-nodes which are neighbor to at least one colored c-node, and update Rij. At this step, we need to remember the value Rij, the color of the neighbor (i.e., processor core the c-node is allocated to), and the information whether this node is the c-node with maximum Rij or not. If there are more than two colored neighbors, we compare Rij from all neighbors and pick up the case with maximum performance improvement. These steps are from line 7 to line 20 of the Pseudo code. After that, the node with maximum value of Rij is selected to be colored with the color of the neighbor. This step is the line 23 of the Pseudo code. Proposed Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

/* initialize */ Color the c-nodes nearest to each processor(); while(uncolored nodes exist){ MaxR=0 /* Update MaxR values for all c-nodes */ /* and Pick a node with Maximum R value */ for(each uncolored c-nodes){ if(colored neighbor exists){ for(each neighbor){ if(R_value_of_the_neighbor > MaxR){ NodeWithMaxR = this c-node; ColorOfMaxR = color of the neighbor; MaxR = R value of the neighbor; update the R value of this node(); } } } }

SYSTEM CONFIGURATION

Processor

8 SPARC Processors

L1 Cache

Private Cache 64KB Instruction Cache 64KB Data Cache

L2 Cache

Shared Cache 16MB Divided into 64 banks (8 by 8)

External Memory

200 cycles access latency

C. Complexity of the Algorithm When the algorithm updates Rij values, (N-P) nodes are visited, where N is the number of c-nodes and P is the number of p-nodes. P nodes are colored as the initial condition. This step of updating Rij is iterated for (N-P) times to color one cnode at a time. The complexity of the proposed LUCP algorithm is O(N2). V.

EXPERIMENT

A. System Configuration The experiment is conducted with the general executiondriven multi-processor simulator (GEMS) [15]. GEMS enables detailed simulation of multi-processor systems to allow examining the cache behavior for given workload applications. The system configuration is shown in TABLE I. B. Cache Delay Model The latency from processor cores to cache banks is calculated based on the architecture shown in Fig. 8. A c-node consists of a cache bank connected to a switch. Switches are connected to each other forming a network as shown in Fig. 8. Total 16MB of L2 cache is divided into 64 cache banks resulting in 256KB per bank. With this configuration, physical dimension of cache is extracted from Cacti 6 [16]. In 32nm technology, the height of a bank is 1.13mm and the width is 0.66mm. Since the delay per length is 0.85ns/mm, vertical and horizontal wire delay is 0.13ns and 0.07ns, respectively. Assuming the operating frequency of 1GHz, wire delay of one vertical or horizontal hop is one clock cycle. Access time to a bank is 0.32ns, which is one clock cycle. The delay of a network switch depends on the pipeline architecture of the switch and network congestion. The number of pipeline stage differs from 2 to 5 stages depending on pipeline optimization scheme [14]. In this work, for simplicity, a switch is assumed to take three clock cycles to operate.

/* Color the picked Node */ color the NodeWithMaxR with the ColorOfMaxR(); } Figure 7. Pseudo Code

128

TABLE I.

Figure 8. Cache Bank Network Model

2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010)

C. Workload Profile For workloads, we used benchmark programs chosen from SPEC CPU 2000 [17] and ALPBench [18]. Fig. 1 shows the cache miss variation depending on the number of banks allocated to the workloads. The mcf, sphinx, and art are the benchmarks with high utility. While the benchmarks such as mpeg2dec, and mpeg2enc shows a small reduction of memory access time as the number of allocated cache bank increases, mcf shows significant memory access time reduction until 14 banks are allocated. The experiment is conducted assuming eight of various benchmark programs (ammp, art, gzip, mcf, mesa, mpeg2dec, mpeg2enc, sphinx, equake, gcc, parser, and twolf) are assigned to eight processor cores (p0, p1, ∙∙∙, and p7). We developed five different types of input combination to examine the performance of proposing algorithm. The combination of benchmark programs is as shown in the TABLE II. Type 1 is the combination of low-utility applications and high-utility applications. As shown in Fig. 1, art, mcf and sphinx3 are the high-utility applications. The rest of benchmarks are low-utility applications. Type 2 is the combination of low-utility applications. All eight programs of this combination are low-utility applications. With this combination, we can see the behavior of proposing method when we do not need large cache. In type 3, two high-utility programs run on the processor core 0 and processor core 1. The rest of processor cores are in idle state. Type 4 is similar to type 3. The only difference is the running processor core. In type 3, two benchmarks runs on adjacent processor cores. In type 4, however, two high-utility programs runs on processor core 0 and processor core 4 which are located on complete opposite site to each other. Finally, the type 5 is the combination of high-utility programs running on all processor cores. Four mcf benchmark programs run on the processor core 0, 2, 4 and 6. Four sphinx3 programs are run on the processor core 1, 3, 5 and 7. In this situation, since all of running programs are high-utility, they intensely compete for the cache resources. D. Uniform Partitioning & UCP To evaluate the contribution of the proposed algorithm, we compared the result to uniform partitioning and utility-based cache partitioning. Uniform partitioning, shown in Fig. 10(a), is the case where cache banks are partitioned uniformly with the same amount of cache resources for each processor core. Cache banks are allocated for processor cores to have the access latency as uniform as possible. In this work, eight cache banks are allocated to each processor core, and the shape of partition is symmetric. In the utility-based cache partitioning, the cache resource utility of workload is considered but not the latency differences. The number of cache banks allocated to each processor core is decided by an algorithm proposed in [1]. Since [1] assumes that the latency accessing cache from processor cores is constant, we used average cache access latency to examine the memory access time. Fig. 10(b) shows the result of the proposed LUCP algorithm with the system configuration and workloads as previously explained. Note that more cache banks are

TABLE II. Type

TYPES OF INPUT COMBINATION

Selected Benchmarks

1

ammp, art, gzip, mcf, mesa, mpeg2dec, mpeg2enc, sphinx3

2

equake, gcc, parser, twolf, gzip, mesa, mpeg2dec, mpeg2end, sphinx3

3

mcf, sphinx3 on p0,p1

4

mcf, sphinx3 on p0, p4

5

mcf on p0, p2, p4, p8, sphinx3 on p1, p3, p5, p7

allocated to the applications with high utility like mcf and sphinx. E. Result Analysis Fig. 9 shows a result of comparison among different types of input combinations. The y-axis shows the average memory access time normalized by the value of uniform partitioning for each type. The absolute value of memory access time, which is not shown in the figure, differs significantly depending on the input benchmark program. We normalized the values for easy comparison. If all programs running on the processor cores are high-utility programs, the result showed high values of average memory access time. In contrast, when all programs are low-utility programs, the result showed lower values of average memory access time. Since our goal is to reduce the average memory access time, a bar with lower value is more desirable. As a result, average memory access time is reduced by up to 32.6% and on average 14.9%. For the type 1 benchmarks, the proposing method showed the best performance. TABLE III shows the absolute values of cache misses per thousand instructions and average memory stall cycles for thousand instructions. Reduction of cache misses per thousand instructions and the average memory access time are compared among three different methods. Note that the cache miss of the proposed method is significantly less than of uniform partitioning, which is due to the effect of considering the utility of workloads. UCP also shows the similar result since the utility is considered. The memory access time, however, shows a different result. Compared to the UCP, the memory access time of the proposed method is reduced by 27.4%. This is because L2 cache hit time is reduced, since LUCP effectively allocates cache resources to minimize the access latency. LUCP does not only reduce the cache misses, but also reduces the hit latency. Notice that the cache miss of LUCP is slightly increased compared to UCP. The optimal solution in terms of minimizing the cache misses is not necessarily the optimal solution to minimize the memory access time. In the case with type 2 benchmark programs, LUCP shows similar performance with the uniform partitioning while UCP showing worse. All the benchmarks of type 2 are low-utility programs. Thus, they do not need a large amount of cache resource. As shown in Fig. 1, the eight banks of cache are big enough for all programs. So, the cache misses per thousand instructions are almost the same for all three methods. For uniform partitioning and UCP, cache hit time also are about

2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010)

129

LUCP shows less hit time than of UCP, resulting better performance than UCP. VI.

Figure 9. Experimental Result

the same. It is because the cache resources close to a processor core are assigned to the processor core for both methods. For type 3 and type 4, two high-utility benchmark programs are run on two processor cores, and other six processor cores remain idle. For both type, since all two programs are high-utility, they need a large number of cache banks. Despite the demand for cache resources, only eight cache banks can be allocated to each program in case of uniform partitioning, while UCP and LUCP utilize almost entire cache resources. This makes a significant difference on the number of cache miss. As shown in Fig. 9, UCP and LUCP show much better performance than uniform partitioning for both type 3 and type 4 benchmark programs. The difference between type 3 and type 4 comes from the physical position of processor core. Two running processor cores are located right next to each other in the case of type 3. In this case, because of the characteristics of the proposed algorithm, the two processor cores compete for cache resources in the earlier stage of the algorithm. This affects the performance of algorithm. LUCP shows better performance in the case of type 4, where the two running processor cores are located opposite to each other. Average memory access time of LUCP was reduced from that of UCP by 32.6% with type 4 inputs. Finally, type 5 benchmark programs are all high-utility programs running on all eight processor cores. Since they are all high-utility programs, they intensely compete for cache resources. The result is similar to the case of type 1. LUCP and UCP show less miss rate than uniform partitioning. And

ACKNOWLEDGMENT The This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MEST) (No.2010-0000823) and IT R&D program of MKE/KEIT [ KI002134 , Wafer Level 3D IC Design and Integration] . REFERENCES [1]

[2] [3]

[4]

[5] [6] [7] [8] [9] [10] [11] [12] [13]

(a)

[14]

(b)

Figure 10. Result of Cache Partitioning for Type 1 Input: (a) Uniform Partitioning, (b) Latency-aware and Utility-based Cache Partitioning

[15]

TABLE III. EXPERIMENTAL RESULT FOR TYPE 1 BENCHMARKS

[16]

130

Algorithm

Uniform

UCP

Proposed

Cache Miss (MPKI) Memory access time (cycles)

4.36

2.26

2.29

1207

876

637

CONCLUSION

In this paper, we proposed the so-called latency-aware utility-based cache partitioning (LUCP) method for reducing the effective memory access time on multi-processor system with 3D-stacked cache. It considers variations of cache resource utility and latency depending on the physical distance between cache bank and processor core. To reduce the time complexity for obtaining optimal solution, we proposed a heuristic method. Experimental results show that, at a slight increase of L2 miss per instruction, average memory access time was reduced by up to 32.6% and on average 14.9% compared to conventional method [1].

[17] [18]

M. K. Qureshi and Y. N. Patt, "Utility-based cache partitioning: a lowoverhead, high-performance, runtime mechanism to partition shared caches," in MICRO, 2006, pp. 423-432. M. Annavaram, E. Grochowski, and J. Shen, "Mitigating Amdahl's law through EPI throttling," in ISCA, 2005, pp. 298-309. S. R. Vangal, J. Howard, G. Ruhl, et al., "An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS," Solid-State Circuits, IEEE Journal of, vol. 43, pp. 29-41, 2008. C. Kim, D. Burger, and S. W. Keckler, "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," presented at the ASPLOS, San Jose, California, 2002. H. Q. Le, W. J. Starke, J. S. Fields, et al., "IBM POWER6 microarchitecture," IBM J. Res. Dev., vol. 51, pp. 639-662, 2007. H. Sharangpani and K. Arora, "Itanium processor microarchitecture," IEEE Micro, vol. 20, pp. 24-43, 2000. C. N. Keltcher, K. J. McGrath, A. Ahmed, et al., "The AMD Opteron processor for multiprocessor servers," IEEE Micro, vol. 23, pp. 66-76, 2003. K. Nikas, M. Horsnell, and J. Garside, "An adaptive bloom filter cache partitioning scheme for multicore architectures," in SAMOS, 2008, pp. 25-32. D. B. Kirk, "Process dependent static cache partitioning for real-time systems," in RTSS, 1988, pp. 181-190. G. Suh, L. Rudolph, and S. Devadas, "Dynamic partitioning of shared cache memory," The Journal of Supercomputing, vol. 28, pp. 7-26, 2004. J. Chang and G. S. Sohi, "Cooperative caching for chip multiprocessors," in ISCA, 2006, pp. 264-276. V. Pavlidis and E. Friedman, Three-dimensional integrated circuit design: Morgan Kaufmann Pub, 2009. B. Bryan, A. Murali, B. Ned, et al., "Die stacking (3D) microarchitecture," in MICRO, 2006, pp. 469-479. L.-S. Peh and N. E. Jerger, On-chip networks: Morgan and Claypool Publishers, 2009. M. M. K. Martin, D. J. Sorin, B. M. Beckmann, et al., "Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset," SIGARCH Comput. Archit. News, vol. 33, pp. 92-99, 2005. N. Muralimanohar, R. Balasubramonian, and N. Jouppi, "Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0," presented at the MICRO, 2007. J. L. Henning, "SPEC CPU2000: measuring CPU performance in the New Millennium," Computer, vol. 33, pp. 28-35, 2000. M.-L. Li, R. Sasanka, S. V. Adve, et al., "The ALPBench benchmark suite for complex multimedia applications," in IISWC, 2005, pp. 34-45.

2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010)