Exploring Small-Scale and Large-Scale CMP ... - CiteSeerX

65 downloads 1737 Views 184KB Size Report
Enterprise Java workload (SPECjbb2005) on two important classes ..... on the J2EE framework and the performance of application .... System Under Test. Cache.
Exploring Small-Scale and Large-Scale CMP Architectures for Commercial Java Servers R. Iyer, M. Bhat*, L. Zhao, R. Illikkal, S. Makineni, M. Jones*, K. Shiv*, D. Newell Systems Technology Laboratory Intel Corporation Abstract As we enter the era of chip multiprocessor (CMP) architectures, it is important that we explore the scaling characteristics of mainstream server workloads on these platforms. In this paper, we analyze the performance of an Enterprise Java workload (SPECjbb2005) on two important classes of CMP architectures. One class of CMP platforms comprise of “small-scale” CMP (SCMP) processors with a few large out-of-order cores on the die. Another class of CMP platforms comprise of “largescale” CMP (LCMP) processor(s) with several small inorder cores on the die. For these classes of CMP architectures to succeed, it is important that there are sufficient resources (cache, memory and interconnect) to allow for a balanced scalable platform. In this paper, we focus on evaluating the resource scaling characteristics (cores, caches and memory) of SPECjbb2005 on these two architectures and understanding architectural trade-offs that may be required in future CMP offerings. The overall evaluation is uniquely conducted using four different methodologies (measurements on latest platforms, tracebased cache simulation, trace-based platform simulation and execution-driven emulation). Based on our findings, we summarize the architectural recommendations for future CMP server platforms (e.g. the need for large DRAM caches).

1. INTRODUCTION CMP architectures [25] have become the norm for client and server platforms as these platforms [1, 10, 11, 17, 18] enter the marketplace and gain widespread adoption. The existing CMP offerings differ in number of cores, cache/memory architecture and interconnect choices. As a result, it is very difficult to understand the architectural trade-offs currently being made. Without a detailed evaluation of client and server workload characteristics, it is also difficult to grasp how CMP platforms will evolve in the future and what trade-offs will need to be considered as the CMP architecture space matures. In this paper, our goal is to address this issue by evaluating the performance of a commercial server workload on current and future CMP architectures. By doing so, we hope to identify critical challenges in future CMP architectures as well as point to potential emerging solutions. In the server space, there are two major classes of CMP architectures: small-scale (SCMP) and large-scale

*Software Solutions Group Intel Corporation (LCMP) architectures. SCMP platforms comprise of “small-scale” CMP processors with a few large out-oforder cores on the die. Recent offerings of SCMP platforms are described in [1] and [10]. LCMP platforms comprise of “large-scale” CMP processor(s) with several small in-order cores on the die. These platforms target the throughput computing segment where recent offerings such as Niagara [17] and Azul [2] are prime examples. The successful evolution of these CMP architectures depends not only on the ability to integrate more cores, but also heavily on the available platform resources (cache, memory and interconnect). For example, the current SCMP server platforms have large caches since there are only two cores on the die (e.g. Intel® Core™ 2 Duo processor [10]), whereas the existing LCMP platform has a small shared cache since the die space is occupied by eight cores (e.g. Sun’s Niagara [17]). As a result, the LCMP platform was required to support higher memory bandwidth and lower number of sockets. In this paper, we will study the resource scaling characteristics of both SCMP and LCMP architectures using a commercial server workload. Our analysis of SCMP and LCMP architectures is based on the SPECjbb2005 benchmark [27]. We chose SPECjbb2005 since managed runtime applications based on Java are increasingly used in the server domain and since it is heavily used by CPU and platform developers to understand the performance of commercial servers. Some of the recent studies using SPECjbb [26, 27] include [6, 15, 21, 22, 23, 24]. While several studies have used SPECjbb2000 (and a handful with SPECjbb2005) for architectural evaluation, this paper is the first to study the performance of SPECjbb2005 on both small-scale and large-scale CMP architectures. In order to accomplish this, we took advantage of four different methodologies: (1) measurements on a state-of-the-art Intel server platform, (2) trace-based functional cache simulations, (3) tracebased CPU and platform simulations and (4) executiondriven emulation using an FPGA prototype. In this paper, we will describe how these methodologies allowed us to cover the SCMP/LCMP design space effectively. Overall, the primary contribution of this paper is the detailed resource scaling study of SPECjbb2005 on SCMP and LCMP architectures (including core scaling effects, cache scaling effects, memory scaling effects and the potential benefits of emerging bandwidth/latency

mitigation techniques such as DRAM cache). A secondary contribution of the paper is the application of four different methodologies on the same workload and the comparison of their effectiveness in covering the intended design space.

architectures [2, 17] are prime examples of this architecture class with 8 in-order cores (each with four simultaneous hardware threads). Figure 1 (though not to scale) provides an illustration of the relative difference between SCMP and LCMP architectures.

The rest of this paper is organized as follows. Section 2 provides a background on current and future CMP architectures along with a description of the resource scaling dimensions that are investigated in this paper. Section 3 describes the Enterprise Java workload (SPECjbb2005). Section 4 outlines the methodologies used to evaluate SPECjbb2005 on the SCMP and LCMP architectures. Section 5 presents the results and analysis along with our observations and recommendations for future CMP architectures. Section 6 concludes the paper with a direction for future work in this area.

The trade-off between small-scale and large-scale is that of scalar versus throughput performance. In this paper, we do not focus on this trade-off. Instead, our focus is on understanding the impact of such architecture classes on the scalability that the rest of platform resources need to provide. For example, SCMP architectures have higher individual core/thread performance and are hence likely to be more memory latency sensitive, whereas LCMP architectures have lower individual core/thread performance but their multiple simultaneously executing threads place a significant pressure on the cache and memory bandwidth resources available.

2. CMP OVERVIEW: PAST, PRESENT & FUTURE In this section, we provide background on CMP platforms, discuss the issues around platform resource scalability for two classes for CMP architectures, and motivate the resource scaling vectors that we chose for analysis in this paper.

C0

C1

(T0, T1)

(T2, T3)

C0 T0-3

LLC

C1 T0-3

C1 T0-3

Interconne ct

0

Last-Level Cache

C0 T0-3

C1 T0-3

C1 T0-3

C1 T0-3

LLC 0 C1 T0-3

Figure 1. Small-Scale vs. Large-Scale CMP Architecture 2.1. CORE AND CMP SCALABILITY Motivated by the power wall (that restricts core frequency improvements) and the potential of thread-level parallelism, most CPU manufacturers have introduced CMP processors into the marketplace [1, 10, 11, 17, 18] or have announced plans to do so. Most of the CMP processors [1, 10, 11, 18] today are small-scale in nature with few large cores on the die. In such small-scale CMP (SCMP) architectures, each core is fairly large since it provides significant hardware features (out-of-order execution, complex branch predictors, etc) to ensure that single-thread performance is high. While today’s SCMP offerings have only two cores, it is expected that, by the end of this decade, these platforms may have as many as 8 cores on the die. In contrast, large-scale CMP architectures (LCMP) with many small cores are also being investigated by architects [12, 17]. Each individual core in the LCMP architecture is not as high-performing as a large core since it strips out some of the area- and power-intensive features (e.g. no out-of-order execution). Sun’s Niagara and Azul’s

2.2. CACHING TRENDS & SCALING The contrast between SCMP architectures and LCMP architectures can be first noticed in the cache hierarchy. Given the same die area constraints, the amount of cache space available on the die could be much smaller in the LCMP architecture than that on the SCMP architecture. Actually, even if the same amount of cache space is available, the cache size per thread will still be very different since the LCMP architectures have many more cores/threads (see illustration in Figure 1). For example, most of today’s SCMP architectures (e.g. Intel® CoreTM 2 Duo processor [10]) allow for 1MB or more per hardware thread, whereas LCMP architectures (e.g. Sun’s Niagara [17]) are likely to have 10x less (~100KB or so) per hardware thread. Both architectures tend to use shared caches at least at the last level of the hierarchy with the hope that the shared cache space provides a higher equivalent cache per thread. However, there are several usage models and workloads (including the one considered in this paper) that do not exhibit a high degree of sharing. Given the disparity in cache size per thread, it is important to analyze the cache behavior for such architectures and provide a compare/contrast between them. The main goal here is to provide insights into the scalability and evolution of the SCMP and LCMP architectures. There are several orthogonal aspects that can be considered as well. For example, the cache hierarchy (number of levels, size at each level, etc) and policies [13, 16, 19, 21, 30, 35] could be different for SCMP and LCMP architectures. Such topics are left as future work. 2.3. MEMORY BANDWIDTH & TECHNOLOGY Given the disparity between the cache sizes per thread as well as the number of simultaneous executing threads in the SCMP versus the LCMP architectures, it is obvious that the memory pressure exerted by the LCMP architecture is significantly higher than that of SCMP

architectures. Memory technology trends [7] show that the DDR bandwidth and latency trends do not scale at the same pace as the bandwidth demands of even SCMP architectures. Even with more channels and new technologies like FBD [8], it is not clear whether sufficient bandwidth will be available to sustain the utilization of future SCMP and LCMP architectures.

CPU Die

Memory

Dimensions

Range

Cores / Threads

2 to 64 threads 2 to 32 cores In-order vs. OOO; varied compute capability 2MB to 32MB 16 to 64GB/s 64MB to 256MB 64 GB/s to 128 GB/s

Micro-architecture

Cache Scaling Memory Scaling DRAM cache

Table 1: CMP Platform Resources under Evaluation

3. ENTERPRISE JAVA WORKLOAD OVERVIEW

CPU Die

DRAM cache

Tags

Memory

Tags

(a) Typical CMP Architecture w/ integrated memory

(b) MCP Architecture w/ DRAM cache Figure 2. Large DRAM Caches in Multi-Chip Package In this paper, we will characterize the bandwidth requirements of a commercial Java workload running on these architectures. Since memory bandwidth is expected to be a significant bottleneck for these future architectures, we also evaluate the potential of large DRAM caches [33, 34] in future processors. DRAM caches can be enabled either through 3D stacking or as a multiple-chip package (as illustrated in Figure 2). The benefits of DRAM caches are two-fold: (a) it reduces the bandwidth requirement on external DRAM and hence allows potential reduction in the pin count out of the package and (b) the latency to and from DRAM cache is only a fraction of that of external memory. As a result, DRAM cache is suitable for latencybound workloads as well as bandwidth-bound workloads. Of course, the benefits of the DRAM cache are largely a function of the miss rate that it provides. In this paper, we evaluate the DRAM cache miss characteristics (from as low as 32M to as high as 256MB) of a commercial Java workload running on a SCMP and LCMP architecture. We then estimate the potential of a DRAM cache for future SCMP and LCMP architectures. 2.4. CMP DIMENSIONS SUMMARY Here, we attempt to summarize the platform resource scaling dimensions that are explored in this paper for both SCMP and LCMP architectures. The summary as illustrated in Table 1 includes: • Core Scaling (SCMP vs LCMP) • Core Capability (Thread-IPC sensitivity) • Cache Resourcing (primarily last level cache) • Memory Scaling (technology capability) • DRAM cache potential (innovative techniques)

In this section, we introduce Enterprise Java workloads and then describe the Java benchmark (SPECjbb2005) that we chose for this study. 3.1. ENTERPRISE JAVA WORKLOADS Over the years, Java has gained as a language of choice to develop Server-Side applications. Java 2 Platform, Enterprise Edition (J2EE) framework is a programming platform used for developing and running distributed multi tier Java applications. It is based largely on modular software components running on an application server. Most Enterprise Java Workloads share some common traits. They create plenty of short lived objects, thus stressing garbage collection mechanisms of Java Virtual Machines (JVM). They access databases in some form or the other. These applications normally have multiple users concurrently accessing the business logic. Standard Performance Evaluation Committee (SPEC) has developed several benchmarks that focus on different facets of Java applications. SPECJVM98 [28] benchmark focuses on JVM performance. SPECjbb2005 [27] models a 3-tier system (client, business logic and database) on a single platform. SPECjAppServer2004 [29] concentrates on the J2EE framework and the performance of application servers. For this paper, we have studied SPECjbb2005. 3.2. SPECJBB2005 OVERVIEW SPECjbb2005 [27] is an upgraded version of the earlier Java benchmark (SPECjbb2000) and was inspired by the TPC-C benchmark [32]. Much like TPC-C, a warehouse is a unit of stored data and there are five types of transactions. However, unlike TPC-C, SPECjbb2005 replaces database tables with Java classes and replaces data records with Java objects. The objects are held in memory as Java Collection instances or other Java data objects. Each warehouse contains roughly 25 MB of data stored in Java Collection objects. Users are mapped directly to Java threads. Each thread executes operations in sequence, with each operation selected from the operation mix using a probability distribution. As the number of warehouses increases during the full benchmark run, so does the number of threads.

SPECjbb2005 is implemented as a Java 5.0 application emulating a 3-tier system with emphasis on the middle tier. All three tiers are implemented within the same JVM. SPECjbb2005 is totally self contained and self driving (generates its own data, its own multi-threaded operations, and does not depend on any package beyond the JRE). SPECjbb2005 is memory resident, performs no I/O to disks, has only local network I/O, and has no think times. It reflects actual applications better than its predecessor SPECjbb2000 by being more object-oriented, and introduces Big Decimal and XML. It is meant to exercise implementations of the JVM, JIT, garbage collector and threads. SPECjbb2005 is intended to stress processors, caches, memory hierarchy and the scalability of shared memory processors; as is the intention here.

4. EVALUATION METHODOLOGY

4.2. SIMULATION-BASED METHODOLOGY

A unique aspect of this work is that four different evaluation methodologies have been adopted to address the goals of SCMP/LCMP analysis. These included: (a) measurement-based characterization of SCMP platforms, (b) trace-based cache simulation for SCMP / LCMP characterization, (c) trace-based platform timing simulation for evaluating performance & bandwidth implications and (d) execution-driven emulation for analyzing the potential of large DRAM caches in future architectures. Figure 3 summarizes these evaluation methodologies and the subsections below expand further. 4.1. MEASUREMENT-BASED METHODOLOGY In order to analyze the performance of SPECjbb2005 on SCMP platforms, the best place to start is to measure and characterize it on the latest dual-core platforms. This also gives us a good base line for validation and comparison of subsequent simulation and emulation-based approaches. We measured SPECjbb2005 on Intel’s latest dual-core dual-socket server platform [10]. This platform consisted of two Intel® Core™ 2 Duo processors where each Woodcrest processor consists of two cores (running at 3GHz). Each processor also contains a 4M L2 cache shared between the two cores. Note that the cores are not multi-threaded. SJBB Execution

The platform consists of 4 FBD [8] channels running at 533 MT/s to provide a total peak bandwidth of 25.6 GB/s. We studied the performance scaling characteristics of SPECjbb2005 as a function of the number of cores, the number of sockets and cache size. To understand the performance scaling characteristics and execution behavior, we used EMON (Intel performance monitoring tool) to collect architectural performance counters and other tools to collect system-level information. Based on these counters, we analyze cycles per instruction (CPI), misses per instruction (MPI), path-length (number of instructions per operation) and CPU utilization (fraction of time spent executing on the CPU). In addition, we also collected profiling statistics from the JVM to understand time spent in garbage collection as well as other functions.

SJBB Traces

While measurements allow us to understand the performance behavior of today’s platforms, simulation techniques are required for understanding SCMP and LCMP architecture evolution and the performance implications of future implementations. In this paper, we employ two different types of simulation techniques – component simulation which focuses on cache characterization and platform simulation which focuses on platform performance and bandwidth analysis. We employed a typical trace-driven cache simulator (CASPER) [14] to analyze the impact of scaling threads (and cores) and cache size on cache performance. We validated the cache data by re-producing the same configuration as the measurement configuration and comparing the miss rates. For SCMP platforms, we studied 4 and 8 threads per processor socket, whereas for LCMP, we studies 32 and 64 threads per processor socket. While we collected data for all levels of caches, we focus on the last-level cache performance in this paper. We focus on misses per instruction as the primary metric and show a breakdown of the misses to understand the contribution of instruction, loads and stores as well as their individual scaling characteristics. SJBB Traces

SJBB Execution

Real Platform

Latest Platform

Cache Simulation

Platform Simulation

Passive Emulator

E.g. CPI, MPI Latencies, E.g. Misses, etc.

Real Platform Measurement

Trace Driven Cache Simulation

E.g. CPI, Latencies, etc.

Trace Driven Platform Simulation

E.g. Misses, etc.

Hardware Emulation

Different levels of scope, accuracy and speed Figure 3. Evaluation Methodologies for SCMP and LCMP Evaluation

Workloads

Instruction Traces

C

C

C

C

C

C

L1

L1

L1

L1

L1

L1

Core simulation accomplished either using cycle-accurate CPU simulator or fixed CPI & L1 cache simulator to generate CPU traces. Annotated CPU traces

L2 $

L2 $

L2 $

Interconnect module simulates bandwidth and latency of topologies (bus, ring, etc)

We used ManySim to represent both SCMP and LCMP architectures differentiated by parameters such as core compute capability, number of outstanding memory requests, number of threads, cache size and memory bandwidth. Table 2 summarizes the SCMP and LCMP simulation configuration.

On-Die Interconnect

Shared L3 Cache

Cache simulation model simulates the cache hierarchy and coherence protocol

Memory

Memory module simulates the bandwidth and latency of several memory channels

Figure 4. ManySim Simulation Framework We then used an in-house platform simulator (called ManySim) to study the bandwidth and performance implications of future SCMP and LCMP platforms. ManySim simulates the platform resources with great accuracy and abstracts the core by representing it as a sequence of compute events (collected from a real core simulator) separated by memory accesses that are injected into the platform model. We ran future core microarchitecture simulations for both out-of-order cores (for SCMP) as well as in-order cores (for LCMP) to get a range of suitable compute event values for the abstract core model. The rest of the platform model contains a detailed cache hierarchy model, a detailed coherence protocol implementation, an on-die interconnect model (simulating a bi-directional ring) and a memory model that simulates the maximum sustainable bandwidth specified in the configuration. The ManySim model is illustrated in Figure 4. The model exports several parameters for flexible configuration. Parameters Core L1 I/D cache L2 cache L2 cache hit time L3 cache L3 cache hit time Interconnect bandwidth Memory access time Memory bandwidth Queues & Other Structures

Small-scale CMP 4GHz, out-of-order, 1 or 2 threads per core Core CPI varied 32 KB, 4-way, 64B 256K, 64B, 8-way 10 cycles 4MB, 16-way, 64B

Large-scale CMP 4GHz, in-order, 4 threads per core, core CPI varied, 4 cores/node 32 KB, 4-way, 64B 512K/node, 64B, 8-way 10 cycles 16MB, 16-way, 64B, distributed banked org.

50 cycles

50 cycles

64GB/s

128GB/s

400 cycles

400 cycles

16 to 32GB/s

32 to 64GB/s

Memory Queue (16) L2/L3 MSHR (16) Coherence Cntrlr (16) Interconnect I/F (8)

Memory Queue (64) L2/L3 MSHR (32) Coherence Cntrlr (32) Interconnect I/F (8)

Table 2. SCMP & LCMP Simulation Parameters

The traces used to feed these simulations consisted of about a billion instructions (spread over 10 trace samples). We distributed these traces over the simulation threads while ensuring that the code space is shared, but the data space is distributed (using address offsets). We validated that this behavior accurately reflects the behavior of the workload by comparing the cache miss rates as well as sharing behavior with measurement platforms. 4.3. EMULATION-BASED METHODOLOGY An emulation-based approach has significant advantages for evaluation: (a) Speed and accuracy: Emulation can be orders of magnitude faster than simulation, where the speed goes down significantly as the accuracy of the models increases. For example, ManySim runs at a speed of roughly 25 KIPS (Kilo Instructions per Second) even though it employs an abstract core model. (b) Limited workload coverage: For the purposes of this evaluation, the motivation for using emulation is to evaluate large DRAM caches (where the size of the cache goes as high as 256MB). Trace-based simulation becomes problematic at these sizes since trace length becomes an issue (insufficient warm-up). Another benefit of an emulation-based approach is that it enables additional accuracy since the evaluation can be run for minutes and all phases that the workload goes through get captured. Dragonhead: FPGA Cache Emulator

System Under Test

Tag Memory Cache Controller

Tag Memory

Tektronix Address Filter LAI

Cache Controller

Tag Memory Cache Controller

Tag Memory Cache Controller

Host Interface

Figure 5. Dragonhead Cache emulator As the hardware technology advanced over the years, it has become more and more feasible to use FPGA based emulation for architecture evaluation [36]. FPGA emulators are now capable of execution speeds measured in MIPS. In this paper, we use an FPGA-based cache emulator (called Dragonhead) developed internally for

large DRAM cache evaluation for future CMP architectures. The Dragonhead emulator (as illustrated in Figure 5) is connected to the system under test through a logic analyzer interface (LAI). The workload is run unmodified on the system. The LAI connected to the FSB detects the memory accesses on the bus and sends them to Dragonhead, where caches with various sizes and organization are emulated. In this paper, we emulate caches with sizes ranging from 1 to 256MB, with line sizes varying from 64 to 4K.

5. RESULTS AND ANALYSIS In this section, we present the results gathered from each of the methodologies and analyze SCMP/LCMP architectural characteristics based on SPECjbb2005. 5.1. TODAY’S SCMP PLATFORM PERFORMANCE We start our analysis based on SPECjbb2005 measurements on Intel’s latest dual-core dual-socket platform based on the Intel® Core™ 2 Duo processor micro architecture. The BEA JRockit JVM was used for * the characterization . At first we varied the number of cores in the platform and studied how the performance of SPECjbb2005 scales. Figure 6 shows the data from this core scaling experiment. The label on the x-axis can be deciphered as follows: 1Socket-1Core (1S1C) has only one core in the system with a 4M LLC cache. 1Socket-2Core (1S2C) has 2 cores sharing the 4M LLC on a single socket. 2Socket-2Core (2S2C) has 2 cores, each on a different socket with a 4M LLC. 2Socket-4Core (2S4C) has 4 cores with 2 cores on each socket sharing the 4MB last-level cache.

3.50

SPECjbb2005 Scaling on Intel Core 2 Duo Architecture

Normalized Performance

3.00 2.50 2.00 1.50

2S2C, we get a 1.8x speedup. The 1S2C does not achieve as much of a performance increase because of LLC sharing (MPI increases by 23% from 0.0031 to 0.0039) between the two cores on the same socket. This causes CPI increase (from 1.03 to 1.32) and hence limits speedup. The speedups are closer to perfect when going from 1S to 2S while keeping the number of cores per socket constant (2S2C/1S1C=1.81X and 2S4C/2S2C = 1.86X). We also observed that the percentage of time spent in Garbage Collection (GC) increases as we increase the number of cores. This happens due to an increased rate of object creation and increased resident memory. Cache Scaling 1M Throughput-Scaling Pathlength CPI MPI HIT% HITM%

2M 1.00 87,296 2.52 0.0087 6% 0%

4M 1.30 87,927 1.93 0.0063 4% 0%

1.80 88,368 1.38 0.0039 3% 0%

Table 3. SCMP Cache Scaling of SPECjbb2005 We next studied how performance of SPECjbb2005 changes with size of cache by keeping number of sockets and cores constant (2S4C) and varying the last level cache (LLC) size. Table 3 shows the data obtained from these measurements. As seen from the table, we get about 30% to 40% performance benefit for every doubling in cache size. The increase in performance is directly due to decrease in MPI and resultant decrease in CPI. We also analyzed the coherence / sharing characteristics by studying the HIT% (percentage of misses finding line in other cache in shared or exclusive state) and HITM% (percentage of misses finding line in other cache in Modified state) data. We found that the HITM% is extremely low (pretty much zero) and the HIT% is in the low digits (decreasing as the cache size increases). The decreasing nature of the HIT% shows that more data is not shared at larger cache size or higher performance. In summary, it is clear that there is hardly any sharing between the SPECjbb2005 threads and they can be treated as independent threads of execution.

1.00 0.50 0.00 1S1C

1S2C 2S2C Number of Sockets/Cores

2S4C

Processor Scaling 1S1C 1S2C 2S2C 2S4C Throughput- Scaling 1.00 1.56 1.81 2.89 CPI 1.03 1.32 1.14 1.38 Pathlength 85,942 86,163 86,104 88,368 MPI 0.0031 0.0039 0.0032 0.0039 Time Spent in GC 1.83% 2.38% 2.86% 4.28%

Figure 6: SCMP Processor Scaling of SPECjbb2005 As we can see from the table, going from 1S1C to 1S2C, we get a 1.56x speedup, whereas going from 1S1C to * The results shown here do not reflect the current performance of SPECjbb2005 on IA processors or platforms. They are meant as characterization data only.

Frequency-GHz Throughput-Scaling Pathlength CPI MPI

2 1.00 88,126 1.16 0.0039

2.67 1.19 88,061 1.29 0.0039

3 1.25 88,368 1.38 0.0039

Table 4. SCMP SPECjbb2005 Frequency Scaling Keeping the number of processors and cache size constant (2S4C, 4M LLC), we then changed processor frequency to find out how SPECjbb2005 scales with frequency. Table-3 shows data from these measurements. We find that we get a 19% performance benefit for a 30% increase in the frequency (an efficiency of ~60%). The increase in CPI limits the scaling of SPECjbb2005 with frequency. The dependence on memory stall time as well as some increase in memory latency is the cause for CPI increase.

5.2. SCMP AND LCMP SCALING BEHAVIOR As SCMP/LCMP platforms evolve, the number of cores/threads will increase and the remainder of the die will be used by cache and other components [19]. Since cache performance has a major impact on SPECjbb2005 performance (shown by measurements above), we first focus on cache scaling in SCMP/LCMP platforms and then analyze overall platform scaling behavior.

0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0

64T

4T

2

4

8

16

32

Cache Size (in MB)

(b) MPI Characteristics Miss Breakdown 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0

Load Store Code

Number of Threads & Cache Size

(b) MPI Breakdown Figure 7: SPECjbb2005 Cache Scaling on SCMP/LCMP The cache scaling behavior of SCMP/LCMP platforms was studied using trace-driven simulation based on CASPER (as described in the previous section). Figure 7(a) shows the MPI scaling as a function of cache size and number of threads. It should be noted that we consider the 16 thread case to be the common boundary between SCMP and LCMP platforms. For SCMP architectures we studied the performance of a shared cache size range of 1MB to 16MB, whereas for LCMP architectures, we studied a shared cache size ranging from 4MB to 32MB. We validated the cache data by comparing it to the measurement data shown in the previous section. For instance, at 1MB per thread, Table 3 shows that the MPI is roughly 0.0063 (2 cores sharing 2MB data point). We compared this to a trace-driven simulation and found that the data was within 2% error. In Figure 7(a), the corresponding data shows roughly 0.0065 (with 4 threads sharing 4MB). From the figure, one can notice that the MPI decreases linearly as a function of Log(cache size) up to a cache size of 1 MB per thread. When the cache size increases from 1MB to 2MB, the MPI decrease is more significant (this occurs because many of the database structures start to fit into cache). It should also be noted that as the number of threads and cache size are scaled proportionately, the MPI stays relatively constant. Finally,

4-Thread CPI and Memory Utilization Core

1.2

MLC

LLC

Mem

Mem_utilization

120%

1

100%

0.8

80%

0.6

60%

0.4

40%

0.2

20%

0

Memory utilization

8T

We next delve into the platform performance behavior based on detailed ManySim simulations. Here, we chose two SCMP configurations (4-thread and 8-thread) and two LCMP configurations (32-thread and 64-thread) for evaluation. We kept the cache size at 4MB for SCMP and 16MB for LCMP. We performed several detailed microarchitecture simulations to determine the core performance (with a perfect cache). We found that the CPI of out-oforder SCMP cores to be between 0.85 and 1.5 per thread (lower value for single-threaded cores and higher value for dual-threaded cores). We performed similar in-order core simulations for LCMP cores and found the CPI to range from 2 to 4.5 per thread (lower end for single-threaded cores and higher end for quad-threaded cores). Since the core performance varies significantly depending on the features enabled within the core as well as the number of threads, we decided to vary the core performance in our simulations (0.5 to 2 per thread for SCMP and 1 to 5 for LCMP).

0% 2

1.5

1

0.5

2

mem_bw=16GB/s

1.5

1

0.5

mem_bw=32GB/s

Core Capability and Peak Memory Bandwidth

(a) 4-Thread SCMP Performance 8-Thread CPI and Memory Utilization Core

1.2

MLC

LLC

Mem

Mem_utilization

120.00%

1

100.00%

0.8

80.00%

0.6

60.00%

0.4

40.00%

0.2

20.00%

0

0.00% 2

1.5

1

mem_bw=16GB/s

0.5

2

1.5

1

0.5

mem_bw=32GB/s

Core Capability and Peak Memory Bandwidth

(b) 8-Thread SCMP Performance

Figure 8. Future SCMP Platform Performance

Memory utilization

16T

Normalized CPI

32T

1

MPI (+breakdown)

Figure 7b shows the MPI breakdown for most of the cache configurations. We observe that code miss rate is negligible (we later found the code working set size to be less than 200K), the write miss rate is relatively constant at about 0.002 (due to object creation/allocation that results in compulsory misses) and the read MPI varies significantly as a function of the cache size.

Normalized CPI

Misses Per Instruction

SPECjbb2005 Cache Scaling

we find that the MPI seems to flatten out after 2MB per thread (e.g. 4 threads with 16MB cache). Since the length of the trace starts to be an issue after 4MB of cache, we did not extend the study to 8MB and beyond.

Figures 8 and 9 show the data collected from ManySim simulations for SCMP and LCMP platform performance respectively. Figure 8 shows the 4-thread and 8-thread normalized CPI breakdown (as bars on primary y-axis) and memory utilization (as dots on secondary y-axis) for SCMP platforms. From the SCMP figures, the following observations were made: (a) memory stall time is the dominant portion of the CPI ranging from 45% to 70%, (b) memory utilization is moderate to low varying from 18% to 85% (of maximum sustainable bandwidth) depending on the number of threads, memory technology and the core capability and (c) the benefit of 2x memory bandwidth is very low (< 5%) since the SCMP memory stall time is largely latency-dependent. Figures 9a and 9b show the data collected from ManySim simulations for LCMP platform performance with 32 and 64 threads respectively. The 32-thread case could be considered representative of Sun’s Niagara architecture, except that we have scaled the cache size (to 16M) and frequency (to 4GHz) based on future process technology improvements (0.45nm). From this LCMP data, the following observations can be made: (a) memory stall time is a dominant portion of the CPI ranging from 40% to 90%, (b) memory utilization is very high varying from 40% to 100% depending on number of threads, memory technology and the core capability and (c) the benefit of 2x memory bandwidth can be quite significant (from