Squeezing Maximizing Performance out of 3D CacheStacked Multicore Architectures Asim Khan, Kyungsu Kang, Chong-Min Kyung Department of Electrical Engineering Korea Advanced Institute of Science and Technology Daejeon, South Korea {asim,kskang}@vslab.kaist.ac.kr,
[email protected] Abstract— 3D integration is one of the most promising options to fulfill the demands of high performance and large cache by integrating multiple processor cores and 3D stacked cache. There are however temperature problems in 3D integration. This paper presents a method for performance maximization of a 3D cache-stacked multicore system keeping the temperature under a given limit while by assigning the clock frequencies and number of cache banks to each core according to the requirement. We have done experiments on multiple benchmark programs and have found a peak 32% and an average 29.8% improvement in performance as compared to the base case which assigns the same frequency and the same number of banks to each core. Index Terms— Instructions per second, 3D Integrated Circuits, Non uniform Cache, Temperature management. I. INTRODUCTION Increasing number of cores on a chip has been a widely used technique to improve the performance since it is well known that we cannot increase the clock frequency indefinitely in single core processors due to low power budgets [1]. Memory bandwidth needs to be increased according to the number of cores; this can be done by having multiple layers of stacked memory in a single chip, i.e., 3D integration. As the size of cache increases the latencies to access the cache banks also changes. Non-Uniform Cache Architecture (NUCA) has been proposed to reduce the access time [2]. Another important issue is the interconnection among these processor cores and memory banks, since we have a large number of cores and banks; Network on Chip (NoC) is one of the most promising options. Implementing a large number of processors and a huge stacked cache memory looks a big incentive to get much better performance. However, as the number of processors and number of stacked cache layers increase the power density (power dissipation per unit ________________________________________________ This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MEST) (No.2011-0000307).
volume) also increases, which impacts the temperature and hence the reliability, cost and temperature related issues [3]. This implies that in order to improve the performance we have to take care of temperature. There are several techniques to manage the temperature like running at low frequencies, power gating, etc. However, they all have some negative impact on the performance. Performance improvement can be defined as improvement in the Instructions per Second (IPS) which depends on the clock frequencies at which cores operate and the memory stall time. There has been a lot of works done on the assignment of frequencies to cores so that the power consumption is lowered, as well as works on reducing the stall time, all these works consider them separately especially in case of NUCA architecture. However, there are no works reported considering both the frequencies and stall time. More specifically memory stall time depends on the following factors, number of banks allocated to each core and placement of those banks. In short, if we are looking to improve IPS we have to accordingly adjust the clock frequencies at which the cores are running, the number of banks allocated to each core, and the placement of these banks. In case of application-specific design each application has different requirements. Therefore, we can have different configurations of frequencies and bank assignment for different applications to maximize the performance. In the high performance server based applications we have certain tasks which run for a long period of time. There we need to assign a certain configuration every time the tasks are assigned. The algorithm proposed here maximizes the performance for such applications under a given temperature constraint. II. RELATED WORKS In this section different works which consider frequency assignment, bank assignment or placement of banks for Uniform Cache Architecture (UCA) or Non-uniform Cache Architectures (NUCA) are presented.
978-1-61284-857-0/11/$26.00 ⓒ2011 IEEE
For 2D case, in [4] authors present dynamic and static algorithms for cache partitioning among threads. Thermal modeling considering the thread scheduling is proposed [6][7]. Temperature-aware thread scheduling to improve the performance is also presented [10]. Chen et al [11] compared two thread scheduling techniques for real time systems targeted for Chip Multiprocessors (CMPs). In [12] a runtime solution is presented to minimize the distance between the data and accessing cores but it has not considered the temperature. In case of 3D, Dynamic Voltage and Frequency Scaling (DVFS) based thermal management solutions are presented [13]-[14]. Sun et al [15] presented a design time solution for maximizing performance, while meeting the temperature and power constraints. In [16] a design time solution is presented for both 2D and 3D architectures to minimize the distance between cores and banks under temperature constraint. In [17] temperatureaware memory mapping is presented. In [18] a heuristic approach to allocate cache banks considering workload characteristics is presented. III. PROBLEM DEFINITION The problem definition is as follows. Given The number of cores ‘M’, the number of stacked cache layers ‘L’, total number of banks ‘B’, the thread set = {τ1, τ2, …, τM} and their assignment to cores. (ficpu),
number of banks allocated to Find Frequency of core core (bi) and placement of banks (Pi,b) i = 1, 2,…, M Such that Total IPS is maximized. While the Temperature under a given limit. IV. PROBLEM FORMULATION The goal in this section is to develop a formulation to maximize the performance under temperature constraint. Figure 1 shows the 3D architecture considered in this work. We have a single layer of multiple cores and multiple layers of stacked NUCA cache. The cores are connected to cache banks via Through Silicon Bus (TSB). Each core has a private L1 cache and we have multiple stacked layers of L2 cache to be allocated to each core. The threads running on each core are independent of each other. The banks assigned to each core have equal probability of being accessed. This means that the total number of L2 accesses will be equally
Figure 1 Baseline architecture divided among all the banks being accessed. Moreover, we assume that there is no network congestion and we use deterministic routing to route data to and from cache/cores. The objective function is defined as follows Maximize M
IPStotal = ∑ N iinst / tex ( fi , wiL 2 )
(1)
i =1
The objective function is to maximize the total Instructions per second (IPStotal) which depends on the total number of instructions (Niinst) and the total execution time (tex). We have to formulate the dependence of IPS depends on the number of banks. IPS also depends on number of instructions ( Niinst) and the execution time (tex ). Execution time, in turn, depends on the core frequency and the number of cycles consumed in accessing L2 cache. IPSi = IPS ( fi cpu , wiL 2 ) = N iinst / tex ( f i cpu , wiL 2 )
tex ( fi , wiL 2 ) =
wicpu wiL 2 + fi cpu f mem
We can formulate this delay (wiL2) as follows. ⎡ No. of L2 Accesses ⎤ wiL 2 = ⎢ × Miss Rate × Miss Penalty ⎥ Task ⎣ ⎦ ⎡ No. of L2 Accesses ⎤ × Hit Rate × Hit Processing Time ⎥ +⎢ Task ⎣ ⎦ Hit Rate = 1 - Miss Rate Miss Penalty = Hit Processing Time + Off-Chip Penalty
(2) (3)
No. of L2 Accesses ⎡ Miss Rate × Off-Chip Penalty ⎤ ⎢ + Hit Processing Time ⎥ Task ⎣ ⎦ − μi miss init Miss Rate R i = Ri .(bi .Cbank ) wiL 2 =
Off-Chip Penalty D off − chip
⎤ No. of L2 Accesses ⎡ R imiss × D off − chip ⎢ ⎥ Task ⎣ + Hit Processing Time ⎦
(4)
where Riinit and μi are the workload-dependent variables. bi is the number of banks allocated to core i and Cbank is the capacity of each bank. We assume that whole cache is divided into equal-capacity banks. Off-chip penalty is assumed to be a constant value for each core. Hit processing time is a function of the placement of banks. The constraints are given as follows. Ti ,zone ≤ Tmax ∀i = 1, 2,..., M ; L
(5)
Ti,Lzone is the temperature of zone (a set of banks above each core) in cache layer L above core i. This constraint shows that the temperature [13] of each zone should be less than the maximum temperature limit. Other constraints are as follows. 1 ≤ bi ≤ B
(6)
2GHz ≤ f i cpu ≤ 3GHz
(7)
M
∑b
i
=B
(8)
i =1
The steps to perform bank placement are shown in figure 2. Step 1 just checks if all the cores are assigned same cache capacity, all banks are placed above each core. Step 2-3 assigns the number of banks to cores which have number of banks less than the banks in zones on top of them. This makes sure that we have the banks with minimum distance to those cores assigned to them (to maximize performance). Now we are left with the cores which have more banks to be assigned than the banks in zones on top of them. We start with assigning them all the banks in zones right above them as in step 4. Now the remaining banks are assigned starting from the core with the lowest bank count remaining and so on. This assignment is also done so that the distance between banks and cores is minimized. V. EXPERIMENTAL SETUP We have considered 8 cores and 4 cache layers in our experimental setup. Each core is an Intel Core2Duo Merom processor [20]. Each core has L1 instruction and data cache. Four banks are stacked directly on top of each core per layer. So we have total of 32 banks in each layer and a total of 128
Figure 2 Bank Placement Algorithm 110 Zone1
100
Temperatures °C
wiL 2 =
1. Check if all cores have the same number of banks allocated; place them above each core. 2. Otherwise start from the core with the least number of banks allocated; place them above the core and mark the remaining banks as ‘not assigned’. 3. Repeat this for all cores which are allocated the number of banks less than the banks in the cache zones right on top of that core. 4. If the banks allocated to the core are more than the banks on top of it, first assign the banks in zones on top of that core and save the number of remaining banks to be assigned. 5. Repeat step 4 for the rest of banks. 6. Now start from the core with the least number of banks remaining and assign the banks from the nearest of the unallocated zones. 7. Repeat step 6 for other cores with banks left unassigned, until all the banks are assigned.
Zone2
90
Zone3
80
Zone4
70
Zone5
60
Zone6 Zone7
50
Zone8
Figure 3 Temperatures at 3GHz
banks for four layers. We are considering 65 nm technology. Capacity of each cache bank is 256 KB which is computed using CACTI [5]. We have used a commercial solver LINGO to maximize the objective function. Experiment was performed with SPEC2000/2006 [8], ALPBench [19] and a computational fluid dynamics (CFD) benchmark [9]. These benchmarks include mcf, equake, gzip, gcc, vortex, twolf, ammp, free_cfd, mcf06, sjeng, sphinx, face_rec. We have chosen eight benchmarks randomly from this list for each simulation run. Maximum temperature (Tmax) is taken as 90 °C. VI. EXPERIMETNAL RESULTS First we have run all the cores at maximum frequency i.e 3 GHz. Figure 2 shows the resultant temperatures of each cache
zone. We can see the temperatures going beyond the maximum temperature limit. Figures 3 and 4 show the normalized IPS and temperatures of each core and cache zone respectively. The results for IPS are normalized (which we call OFB (optimum frequencies and banks) with respect to the base case in which we have equal frequencies and number of banks for each core and all banks are placed on top of the core. We have obtained an average of 29.8% improvement in IPS. VII. CONCLUSION We can conclude that we can extract much better performance from the 3D multicore architectures only by 95
Temperatures °C
90
Zone1
85
Zone2
80
Zone3
75
Zone4
70 65
Zone5
60
Zone6
55
Zone7
50
Zone8 Case1 Case2 Case3 Case4 Case5 Case6
Figure 4 Temperatures for OFB
IPS (Normalized w.r.t BA)
1.40 1.20 1.00 0.80 0.60
OFB
0.40
BA
0.20 0.00
Figure 5 Normalized IPS w.r.t Baseline Architecture
considering the assignment of frequencies and the cache capacity. We have presented a very simple formulation and achieved a considerable improvement in the results. REFERENCES [1]
Intel products [Online]. http://www.intel.com/products/processor/index.htm [2] C. Kim et al. Non uniform cache architectures for wire-delay dominated on-chip caches. IEEE Micro, vol. 23, no. 6, pp. 99-107, 2003. [3] K. Kang et al., “Temperature-aware integrated DVFS and power gating for executing tasks with runtime distribution,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 29, no. 9, pp. 13811394, Sep. 2010. [4] S. Kim et al. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proc. of PACT ’04. [5] D. Tarjan et al, “CACTI 4.0,” HP Laboratories, Palo Alto, CA, Tech. Rep. HPL-2006-86, Jun. 2006. [6] K. Skadron et al, “Temperature-aware microarchitecture: modeling and implementation,” ACM Trans. Architecture and Code Optimization, vol. 1, pp. 94-125, Mar. 2004. [7] Y. Han et al, “Temptor: A lightweight runtime temperature monitoring tool using performance counters,” in Proc. 3rd Workshop on TCAS, Held in conjunction with ISCA-33, 2006. [8] Standard Performance Evaluation Corporation. [Online] http://www.specbench.org. [9] Free Computational Fluid Dynamics. [Online] http://www.freecfd.com. [10] J. Choi et al., “Thermal-aware task scheduling at the system software level,” in Proc. ISLPED, Aug. 2007, pp. 213-218. [11] S. Chen et al. Scheduling threads for constructive cache sharing on cmps. In Proc. of SPAA ’07. [12] Mahmut et. al, Dynamic Thread and Data mapping for NoC based CMPs. [13] C. Zhu et al., “Three-dimensional chip-multiprocessor runtime thermal management,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 27, no. 8, pp. 1479-1492, Aug. 2008. [14] X. Zhou et al., “Thermal-aware task scheduling for 3D multi-core processors,” IEEE Trans. Parallel and Distributed Syst., vol. 21, no. 1, pp. 60-71, Jan. 2010. [15] C. Sun et al., “Three-dimensional multiprocessor system-on-chip thermal optimization,” in Proc. Int. Conf. Hardware/Software Codes. Syst. Synth., Oct. 2007, pp. 117-122. [16] Ozcan Ozturk et. al,. Optimal topology exploration for applicationspecific 3D architectures. [17] A.-C. Hsieh et al, “Thermal-aware memory mapping in 3D design,” in Proc. DATE, 2009, pp. 1361-1366. [18] G. Sun et al, “Exploration of 3D stacked L2 cache design for high performance and efficient thermal control,” in Proc. Int. Symp. Low Power Electronics and Design, 2009, pp. 295-298. [19] M.-L. Li et al., “The ALPBench benchmark suite for complex multimedia applications,” in Proc. Int. Symp. Workload Characterization, Oct. 2005, pp. 34-35. [20] N. Sakran et al., “The implementation of 65nm dual-core 64b Merom processor,” in Proc. ISSCC, pp. 106-590, 2007.