predict cache utilization and inter-thread contention. This technique accurately and ... behavior and is highly predictable with very little informa- tion and a simple ...
Predictable Fine-Grained Cache Behavior for Enhanced Simultaneous Multithreading (SMT) Scheduling Joshua L. KIHM, Andrew W. JANISZEWSKI, and Daniel A. CONNORS Department of Electircal and Computer Engineering University of Colorado at Boulder UCB 425 Boulder, CO, 80309, USA {kihm,janiszew,dconnors}@colorado.edu
ABSTRACT By converting thread-level parallelism to instruction level parallelism, Simultaneous Multithreaded (SMT) processors are emerging as effective ways to utilize the resources of modern superscalar architectures. However, the full potential of SMT has not yet been reached as most modern operating systems use existing single-thread or multiprocessor algorithms to schedule threads, neglecting contention for resources between threads. To date, even the best SMT scheduling algorithms simply try to group threads for coresidency based on each thread’s expected resource utilization but do not take into account variance in thread behavior over time. As such, we introduce architectural support that enables new thread scheduling algorithms to group threads for co-residency based on fine-grain memory system activity information. The proposed memory monitoring framework centers on the concept of an activity vector, which allows an operating system scheduler to estimate and dynamically predict cache utilization and inter-thread contention. This technique accurately and dynamically tracks program phase behavior and is highly predictable with very little information and a simple autoregressive model, making it ideal for scheduling decisions.
Keywords: abstract.tex Simultaneous Multithreading, Program Phase Behavior, Autoregressive Modeling, Linear Multi-Step Modeling
1.
INTRODUCTION
Multithreading is emerging as the leading cost-effective architecture model to achieve greater performance in scientific and commercial workloads. By improving the utilization of architecture resources, multithreaded architectures maximize on-chip hardware parallelism by exploiting independent thread-level parallelism rather than the instructionlevel parallelism of a single thread. During periods when individual threads might otherwise stall and disable the efficiency of a large devoted single-threaded processor, multithreaded architectures improve execution throughput by devoting resources to multiple active threads. The full potential of multithreaded architectures has not yet been reached since most modern operating systems are based on existing single-thread or multiprocessor algorithms to schedule threads, neglecting contention for resources shared between
simultaneously executing threads. Even the best current multithreaded scheduling algorithms simply try to group threads for co-residency based on each thread’s expected resource utilization but do not take into account variance in thread behavior. As such, we propose architectural support that enables new operating system thread scheduling algorithms to group threads for co-residency based on predictions of fine-grain memory system activity information. The proposed framework centers on the concept of allowing the operating system scheduler to estimate and dynamically predict cache utilization and inter-thread contention. In order to make effective multithreaded scheduling decisions, a thread scheduler must be able to predict the behavior of a thread in the upcoming scheduling period in order to characterize how it will interact with the other resident threads. Since programs execute in phases, program behavior can change drastically over the course of execution. One obvious solution is to try and characterize each phase and predict which phase or phases will occur in the next interval. The primary obstacle to predicting phase behavior is that the hardware and software mechanisms needed to fully characterize and predict phase behavior in dynamic execution are fairly complex and time consuming. Nevertheless, Sherwood et al [4] recently proposed accurate, efficient, and cost effective mechanisms for characterizing and predicting phase behavior by tracking basic block behavior. As more phase-tracking technologies emerge, it will be critical to exploit run-time execution characteristics in modern operating systems. More importantly, it will be important to leverage such proposed phase characterization and prediction techniques to strategically handle detection of phasebased thread interaction behavior and to apply the results directly to creating adaptive thread scheduling. As such, we investigate phase tracking methods that characterize and predict phases for assisting thread scheduling, specifically techniques that anticipate contention in cache systems. We present data that illustrates phases of execution serve as the primary characteristic to exploit in scheduling threads in a multithreaded architecture. Experimental studies [5] have recognized that in a multithreaded architecture, the choice of which threads to schedule together has a major effect on overall system performance. Because threads are competing for finite processor resources, inter-thread conflicts occur. This conflict for
To increase the cache system effectiveness for multithreaded architectures, we investigate run-time guided thread management, in which thread symbiosis is accurately estimated by hardware support. Our scheme seeks to aid thread scheduling by gathering the cache activity patterns of each active thread. We introduce the concept of an activity vector, which is a collection of event counters for a set of contiguous cache blocks. We present an analysis of the cache vectors and find substantial execution regularity and predictability that be exploited by thread schedulers. This regularity can be exploited to accurately predict cache usage for intellegent scheduling decisions using simple, low-order predictive models.
2. MOTIVATION In SMT systems, contention on cache resources is the primary impediment to the delivery of higher performance since thread competition for shared cache resources results in severe memory access penalties. Hily and Seznec [2] evaluated the performance impacts of second-level caches and studied the contention on the second-level cache in SMT architectures. Their work and similar studies have revealed that running with more than four active hardware contexts led to a tapering off of SMT performance. Unlike intra-thread cache contention, eliminating inter-thread conflict penalties is difficult as the potential sequence of events are non-repeatable and difficult to predict. Nevertheless, there exist methods of run-time guided thread management, in which thread symbiosis is artificially generated by strategic thread pairing at run-time in simultaneously multithreaded architectures. The performance improvement is made by reducing the number of inter-thread cache lines kicked out for each thread.
In examining the run-time behavior of simultaneous threads in a multithreaded architecture, the inefficiencies of modern cache hierarchies are clearly found in thread accesses contending for neighboring cache resources. By tracking cache access information into groups, more detailed information about resource requirement can be gained. The tracking of cache region activity is illustrated in Figure 1, through a set of heat-maps which show the temporal and spatial cache use behavior of a set of SPEC 2000 benchmarks. The x-axis of each graph is the time measured in samples of five million clock cycles. Along the y-axis is cache position (grouping of cache sets into 32 super sets). Each cache super set includes a corresponding activity register which counts the activity for that set of cache lines. The data address bits which correspond to the cache set index can be used to lookup the activity counter associated with cache super set. In analyzing run-time memory access pattern of threads, experimental observations indicate that spatial behavior often occurs in distinct phases. A repetition of any of the cache super sets contained in a cache over an interval of time represents an opportunity to eliminate inter-thread conflict. An example of reuse interval behavior can be seen in 188.ammp application which shows distinct intervals of high, low, and intermediate usage of all of the cache sets. Clearly from these figures, there are exploitable cache behaviors.
L2−Cache Use Maps 188.ammp 28 24 20 16 12 8
Super Set Number
resources can weigh heavily on the advantages of multithreading and reduce its effectiveness. The key to effective scheduling in a multithreaded system is determining which threads can be scheduled together and minimize the interthread conflict as to maximize overall performance. Early techniques [7] [5] for thread scheduling that take into account thread interference tried finding static combinations of threads that had the best overall symbiosis. Although this idea was monumental in recognizing inter-thread interference, there are several areas for improvement. First, the sheer number of possible combinations grows with the factorial of the number of available threads and processor contexts, making it difficult to get a meaningful subset of possible combinations. The second improvement to make is to assume that program behavior is not monolithic, and that past program behavior is often a poor indicator of future performance. Van Beisbrouk et al. [1] demonstrated that these improvements can be made by showing that performance is dictated by not only which threads are scheduled together, but also in which phase each of those threads. Thus it is critical to develop dynamic thread scheduling techniques that can adapt to program phase changes. Ideally, the thread scheduler should predict future as well as monitor program phase, since what it is trying to do is maximize program performance during the next scheduling period, not simply determine what would have worked the best in the past.
4 0 0
50
100
150
200
250
300
150
200
250
300
253.perlbmk 28 24 20 16 12 8 4 0 0
50
100
Sample# >3500
2800−3499
2100−2799
1400−2099
700−1399
0−699
Figure 1: Map of Level 3 Cache usage.
3. RESULTS Frequency Analysis Simple observation of cache behavior maps, as in Figure 1, reveals a level of periodic behavior. In order to further investigate this behavior, a 2-dimensional discrete Fourier transform was performed on the data from each benchmark. Example results for two benchmarks’ level 3 cache usage are shown in figure 2. In the figure, the x-axis represents normalized frequency in time. The normalization means that the sample time in longer running benchmarks was made longer so that the total number of points was kept constant. In the graph 181.mcf, the sample size is four million operations and for 183.equake the sample size was 32 million operations. The y-axis freqency accross the cache or per super-set. For the benchmark 181.mcf, there is a strong freqency along both the x and y axis. This indicates a large constant componnent of the usage, both in time and location. The other interesting aspect of this graph is the vertical lines freqeuncies of approximately positive and negative .333 which indicates a srong periodicity of every three samples or 96 million operations. Like 181.mcf, 183.equake shows a strong zero frequency component in space, meaning use is pretty well spread out accross the cache. In contrast to 181.mcf however, 183.equake has only a weak temporaal constant component with much of the power spread out in the high freqnecy regions. This indicates a great deal of various between consecutive samples. The vertical lines accross the graph indicate stong periodicity with a period of approximately 16 samples or 64 million operations. The other vertical lines in this graph are most likely harmonics of this main tone as they are equally spaced along the spectrum. The longer period harmonics are of more interest for the purpose of OS level thread scheduling. These components are produced by large scale program phase. In the absence of events such as interrupts and page faults, a modern OS like Linux makes a scheduling decision approximately every 50ms. On a 3GHz processor, this is equivalent 150 million clock cycles. So even at very low IPC, only the longest period components are of interest for scheduling. Although not all of these components are visible in figure 2, at longer sampling periods of once or a few times per scheduling period, these componnents are more apparent. Since we are not sampling a continuous signal but rather looking at the total use over the sampling period, Using a longer sampling
l3cache use Frequency Maps 181.mcf 0.5 0.25 0 −0.25
Spatial Frequency
Although informed thread scheduling can eliminate significant amounts of inter-thread conflict, there are additional opportunities that require capturing periodic/predictable elements of program behavior. Virtually all types of programs go through a series of stages during execution. A stage is characterized by changes in the execution properties for code, the data, or both. The stages of a particular program depend on the problem domain and the implementation. Similarly, programs such as compilers, interpreters, and graphics engines exhibit phase behavior, having different modes of operation for different inputs. The use of profile information by compiler-directed mechanisms can hide opportunities since profile-guided decisions may not be representative of all program workloads that an operating system observes. For these reasons, it is imperative for future operating systems to adapt scheduling techniques to variations in program behavior.
−0.5 −0.5
−0.25
0
0.25
0.5
0
0.25
0.5
183.equake 0.5 0.25 0 −0.25 −0.5 −0.5
−0.25
Temporal Frequency High
Low−med
Q3
Q1
Med−high
Low−med
Median
Zero
Figure 2: Frequency domain usage of level 3 cache. period averages out higher frequency components. Additionally, a longer sampling period increases precision of the frequency as the two are inversely proportional with a constant number of samples.
Autoregressive (AR) Prediction Aside form being interesting, the existence of these periodics, especially in time, indicates that cache usage should be predictable based on previous behavior if enough information is know about the previous behavior. One common method used in signal processing for predicting values of noisy signals is called autoregressive (AR) modeling. A thorough discussion of AR models can be found in [3] which comliments the breif discussion that follows. Essentially, the predicted next value in a series is weighted some of some number of the the previous, known entires of the series. The number of previous entries used to make the prediction is the order of the predictor. The weights or coefficients of the sum are determined using the Yule-Walker Equations using the Levinson-Durbin Algorithm. The advantage of the Levinson-Durbin Algorithm of this investigation is that in the process of determining the coefficients of a higher order predictor, the coefficients of all lower order predictors are also produced. This greatly facilitates comparison of different order predictors. The major computation which must be
Percent Wrong Predictions vs. AR Predictor Order 3.1
2.6
Percent Wrong
undertaken before using the AR model is the computation of the covariance of the series. The major assumption made in this investigation is that the cache usage is wide sense stationary, that is, that the covariance is constant over time. Additionally, The covariance was computed using all available data, not just the data before a given prediction, which would not be possible in a real system. This is justified as the length of use history that would be available in an actual system would be much greater than that available in simulation. The caveat being that a certain warm up time would be needed in a real system for the estimated covariance to approach the actual value.
2.1
1.6
1.1
0.6 0
50
100
150
200
250
Predictor Order % Wrong Guesses (2 levels)
Figure 3 illustrates the percent error versus predictor order in both absolute and root mean square (RMS) terms. The Yule-Walker equations minimize RMS error which is why the error is so low. The values presented are the average accross nine SPEC2000 benchmarks. The predictions are made for each super set in the cache individually between 1 million operation samples and then averaged together. The most interesting facet of this graph is that the error drops significantly after an order of about 5 but then generally levels off all the way out the order 250, the highest order tested. Similar results occurred in all caches and for both cache accesses and misses. This illustrates that even with a very low order predictor it is possible to make very accurate predictions.
% Wrong Guesses (4 levels)
Figure 4: Percent incorrect predictions use levels of level 3 cache.
Linear Multi-Step (LMS) Prediction Linear Multi-step (LMS) methods seem well suited to the prediction of performance within the theoretical and practical constraints of a real system. Using only a small amount of past data, and past (and potentially current) derivative values, inferences are made which can be as accurate as the data set permits. The general form of an LMS method is k X
αi yn−i + hn
i=0 Percent Error vs. AR Predictor Order
βi y˙ n−i = 0
(1)
i=0
By solving for yn , a predicting formula based only on past values and the current derivative is obtained. In this work, we only consider what are known as explicit methods, with β0 = 0, to avoid any dependence on current or future data. Predicting y˙ n with any accuracy is computationally prohibitive, so implicit methods are not considered.
40 35 30
Percent Error
k X
25 20 15 10 5 0 0
50
100
150
200
250
Predictor Order Absolute Error RMS Error
Explicit LMS methods are defined by varying αi and βi for 1 ≤ i ≤ k. In theory, LMS predictors will improve as the order increases, and to a lesser extent, as k increases. A full discussion of LMS, including the formula for computing the order of an LMS method, can be found elsewhere [6].
Methodology Figure 3: Percent error in predicted usage of level 2 cache. Figure 4 takes the concept one step further. Instead of looking at the actual counts being used each value is given one of a small number of levels. In the two level case, if the use was above a threshold, the value was given a 1 and if not, it was given a zero. The prediction is than made based solely on that quantized data. What was measured is the percentage of the time in which the predicted quantized value was different from the actual value that occurred. Again averaging over all benchmarks, the values were very promising. Values of nearly 98% were acheived in four level tests and over 99% were acheived in 2 level tests. As was the case in the percent error case, a sharp drop occurs in wrong predictions with a prediction order of about 5 and a general leveling occurs after that. This means that only a very small amount of data would be needed to reliably predict cache usage with a very high degree of accuracy.
Two series of LMS methods were tested, observing mean absolute error, RMS error, and performance on 2- and 4level counters. The cutoff values for these counters were determined by the median over the nine SPEC2000 benchmarks observed. In this simulation, the time step h is set to 1. Derivatives are approximated as the average of the difference between the current and previous three points.
Varying Order Initially, the methods tested were a simple zeroth order predictor (yn = yn−1 ), a first order method (known as Forward Euler), a second order predictor (an Adams-Bashforth method), and a third order method. Results, seen in Figure 5, show that the higher order methods perform worse in terms of overall prediction accuracy. Analysis of these results suggests that the reason higher order methods fail is tied to their increasing dependence on derivatives, which lead to severe over-corrections in what is somewhat invariant data.
Average use level Predictability (LMS)
Average use level Predictability (LMS)
1 point Avg
1 point Avg
Forward Euler
2 point Avg
Adams−Bashworth
3 point Avg
3rd Order
4 point Avg 5 point Avg 85%
87%
89%
91%
93%
95%
2−bit Predictability 4−bit Predictability
Figure 5: Percent predictability for higher-order LMS models.
Zero Order, Varying k A second array of tests was run on a number of zero order methods which look progressively further back in their calculation of a predicted yn . The chosen methods simply take the average of the visible historical values, with windows varying from one to five samples wide. In addition, two extra tests with k = 5 were run which also include a weakly weighted derivative of the nearest points to allow observed trends to influence predicted behavior. Results show that prediction accuracy increases with k. Additionally, the k = 5 methods show deterioration in performance as the derivative weight given by βi increases. More extensive testing is warranted, but results strongly suggest that any non-zero βi is detrimental to performance. Figure 6 shows the performance for both 2- and 4-level counters over these methods.
4. CONCLUSIONS Simultaneous Multithreading offers great promise for improved processor throughput but only if effective scheduling methods can be developed. In any modern system, minimizing cache contention is one necessary step in that process, and the activity vectors offer an effective method to achieve that goal. It is vitally important that any SMT scheduling routine be able to accurately predict the behavior of threads before they are coscheduled. We have shown that activity vectors can very easily be predicted using simple, low-order methods and as such offer a powerful tool in SMT scheduling.
5. REFERENCES [1] M. V. Beisbrouk, T. Sherwood, and B. Clader. A co-phase matrix to guide simultaneous multithreading simulation. In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, 2004. [2] S. Hily and A. Seznec. Contention on 2nd level cache may limit the effectiveness of simultaneous multithreading. Technical Report PI-1086, IRISA, 1997.
5 point .1 der 5 point .15 der
91%
93%
95%
2−bit Predictability 4−bit Predictability
Figure 6: Percent predictablility for simple LMS models. [3] B. Porat. A Course in Digital Signal Processing. John Wiley and Sons, Inc., New York, 1997. [4] T. Sherwood, S. Sair, and B. Calder. Phase tracking and prediction. In Proceedings of the 17th annual international conference on Supercomputing. ACM Press, 2003. [5] A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Architectural Support for Programming Languages and Operating Systems, pages 234–244, 2000. [6] J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer-Verlag, New York and Berlin, 1980. [7] D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy. Supporting fine-grained synchronization on a simultaneous multithreading processor. In International Symposium on Architectural Support for Programming Languages and Operating Systems, pages 54–58, 2000.