A Simple Model to Quantify the Impact of Memory Latency and Bandwidth on Performance. Russell Clapp. Intel Corporation. Hillsboro, Oregon. Martin Dimitrov.
A Simple Model to Quantify the Impact of Memory Latency and Bandwidth on Performance Russell Clapp
Martin Dimitrov
Karthik Kumar
Vish Viswanathan
Thomas Willhalm
Intel Corporation Intel Corporation Intel Corporation Intel Corporation Intel Gmbh Hillsboro, Oregon Chandler, Arizona Chandler, Arizona Hillsboro, Oregon Walldorf, Germany {Russell.M.Clapp, Martin.P.Dimitrov, Karthik.Kumar, Vish.Viswanathan, Thomas.Willhalm}@intel.com
ABSTRACT
2. RELATED WORK
In recent years, DRAM technology improvements have scaled at a much slower pace than processors. While server processor core counts grow from 33% to 50% on a yearly cadence, DDR4 memory channel bandwidth has grown at a slower rate, and memory latency has remained relatively flat for some time. Meanwhile, new computing paradigms have emerged, which involve analyzing massive volumes of data in real time and place pressure on the memory subsystem. The combination of these trends makes it important for computer architects to understand the sensitivity of the workload performance to memory bandwidth and latency. In this paper, we outline and validate a methodology for quick and quantitative performance estimation using a real-world workload.
Previous researchers have also extensively addressed performance modeling [2-6]. Sorin et al. [2] present an analytical model for evaluating various types of architectural alternatives for sharedmemory systems, exploiting instruction level parallelism. Emma [4] provides an analytical framework for estimating performance using CPI, breaking it down into both infinite cache and memory subsystem effects, which is consistent with our model. In particular, [5,6] describe the impact of memory-level parallelism (MLP) in modeling. Our paper builds on these approaches with equations that provide quick and quantitative analytical performance estimation. We demonstrate and validate the modeling methodology using a real-world workload.
Categories and Subject Descriptors
3. METHODOLOGY
C.4 [Performance of Systems]: Modeling techniques, Design studies; D.4.8 [Performance]: Measurements
We use the number of clock cycles per instruction (CPI) as the measure of processor performance, as recommended in [4]. CPI is the inverse of instruction throughput, and a lower value of CPI indicates better performance. We define the pathlength of a workload as the required number of instructions to complete a unit of work. Thus the pathlength depends on the software code path that is executed in order to complete such a unit of work. The combination of pathlength and CPI then determines the performance of the workload, whether it is quantified as run time or throughput. Thus, if the pathlength is fixed, the CPI per processor and the number of processors can directly be converted into a workload measure of throughput. To simplify our analysis, we will consider the case of a fixed pathlength below.
Keywords Performance Modeling, Workload Characterization
1. INTRODUCTION As data volumes grow and as compute capabilities continue to increase with Moore’s Law, memory (DRAM) density scaling has begun to lag far behind. Emerging memory technologies [1] may address the capacity issue but have different latency and bandwidth characteristics from conventional DRAM. The implication is clear: as computer architects look toward alternative memory technologies that provide more capacity at the expense of latency and bandwidth, it is paramount that performance models exist to quantify those tradeoffs. Higher-level quick-turn estimation models that enable faster evaluation of architectural alternatives are required to set direction before developing new, more detailed simulation models of new technologies. Further, as the adoption of these new technologies increases, it is natural that there will be hybrid memory systems – that contain more than one type of memory technology – in tiered or non-tiered configurations. The key factor in determining how an application may use or benefit from such systems is performance. This paper demonstrates how we can vary the latency and bandwidth characteristics of the memory subsystem and determine the impact of that subsystem on performance. To do so, we outline and validate a methodology that uses classic analytic equations for quick, quantitative performance estimation.
Let’s assume we have a processor with a two-level memory hierarchy; one level of cache followed by main memory. CPIeff stands for Effective Cycles per Instruction and denotes the inverse of the effective rate of instruction execution. Thus for a singlethread of execution in its simplest form: =
+
∗
∗
[1]
Here CPIcache is the CPI if all memory references were satisfied by the processor cache, MPI are the misses per instruction at the processor cache (either demand or prefetch), MP is the cache miss penalty, and BF is the “blocking factor”, or percentage of the miss penalty that contributes to CPIeff. Thus, the equation takes as input the effective CPI with an infinite cache, and adds incremental CPI based on the number of cache misses, the latency to satisfy those misses, and a factor that reflects the impact of that latency. If an application is truly core bound, it will have no sensitivity to memory latency, and the blocking factor in our equation will be zero. This is possible if prefetching ensures that data is readily available in the processor cache once the core requires it. On the other hand, if the workload is bandwidth bound, the equation should still hold, as queuing theory informs us that latency increases dramatically as available bandwidth becomes saturated.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Copyright is held by the author/owner(s). SIGMETRICS’15, June 15–19, 2015, Portland, OR, USA. ACM 978-1-4503-3486-0/15/06. http://dx.doi.org/10.1145/2745844.2745900
471
Table 1: Computed versus measured CPI for In-memory Analytics. 2.1
2.1
2.4
2.4
2.7
2.7
3.1
3.1
0.0056
0.0056
0.0056
0.0056
0.0059
0.0055
0.0057
0.0055
MP (core cycles)
402
383
462
448
543
502
631
598
CPIeff (computed)
1.33
1.31
1.39
1.38
1.52
1.43
1.60
1.53
Core Speed (GHz) MPI
CPIeff (measured) Error
1.32
1.32
1.38
1.39
1.47
1.44
1.60
1.57
1.0%
-1.1%
0.9%
-0.6%
3.0%
-0.6%
-0.1%
-2.5%
Our Eq. 1 is consistent with the observations made in [5,6], which states that the core stall time resulting from multiple simultaneous outstanding long latency cache misses, call this value MLP, is derived from the miss penalty MP divided by the MLP. Eq. 2 is taken from [5] and shows this explicitly: ∗( −
=
)+
∗
4. VALIDATING THE MODEL To demonstrate and validate the model, we need to determine and BF in Eq. 1 for workloads of interest. We can estimate these parameters by measuring a workload’s at different miss penalties, which are achieved by varying the core speed and memory speed of the system under test. Thus we get data points with different ( ∗ ), and we measure for each of these datapoints using hardware performance counters. We estimate and BF in Eq. 1 by obtaining a fit for these data points – as shown in Fig. 1 for an in-memory analytics workload. Table 1 shows the detailed measurements we made to construct Fig. 1, taking multiple measurements at each core speed to account for any run-to-run variation. Using the values for and BF determined in Fig. 1, we compute the CPIeff and compare it to the measured value. The error is shown in last row of the table, and is less than 3%. We have found similar error rates across many workloads that we have measured.
[2]
, Our BF then, is proportional to 1/MLP, but offset by which represents the overlap of core execution with cache misses. We choose to use Eq. 1, since and MLP are not directly measureable from hardware performance counters. Once we have computed the effective CPI for a given program phase using Eq. 1, we can use Eq. 3 to compute the bandwidth demand in bytes per cycle on the memory subsystem. In Eq. 3, WBR is the percentage of cache misses that require writeback of a dirty victim, LS is the line size in bytes, and CPS is the core speed in cycles per second. =
(
∗( +
)∗
)∗
5. OBSERVATIONS We have applied this methodology to establish and BF for several Big Data workloads, and compare them to well-known HPC and Enterprise workloads. We see that there is a similarity within each workload class between the values for BF and (MPI*(1+WBR))/CPIcache, the second term representing the injection rate of memory requests coming from the processor. Based on this observation, we created synthetic parameter sets for each workload class, and computed CPI sensitivity to changes in memory latency and bandwidth. We have observed that, while the HPC workloads we selected are more sensitive to changes in memory bandwidth, and Enterprise class workloads are more sensitive to changes in memory latency, and Big Data workloads fall in between, with sensitivity to latency only in configurations where sufficient bandwidth is provided. We plan to publish these results in the near future. The ability to derive these model parameters from measured data and apply them to hypothetical systems is exactly what is required to evaluate the insertion of new memory technologies into system designs.
[ ]
This equation can be easily extended to account for IO events that also consume memory bandwidth. By scaling Eq. 3 with total core count, we can compute the total system-wide bandwidth required. Using this value, we can estimate the average MP using queuing theory or characterization data that provides a relationship between queuing delay and bandwidth utilization. Further, by changing the denominator in Eq. 3 to the available memory bandwidth, we can compute a bandwidth-limited CPI.
6. REFERENCES [1]
[2] [3] [4] [5]
[6]
Figure 1: Example fit for CPIcache and BF.
472
A. Makarov et al. “Emerging memory technologies: Trends, challenges, and modeling methods,” Microelectronics Reliability, vol. 52, no. 4, pp. 628--634, 2012. D. Sorin et al. “Analytic evaluation of shared-memory systems with ilp processors”, 25th ISCA, pp. 380-391, 1998. B. Fields et al. “Using interaction costs for microarchitectural bottleneck analysis,” 36th MICRO, pp. 228-239, 2003. P. Emma. “Understanding some simple processorperformance limits,” IBM J. Res. Dev. 41: 3, 215-232, 1997. Y. Chou et al, “Microarchitecture optimizations for exploiting memory-level parallelism,” 31st ISCA, pp. 76-87, 2004. T. Karkhanis et al. “A first-order superscalar processor model,” 31st ISCA, pp. 338—349, 2004.