4.7 Percentage of stitch pointers updates to load accesses. . . . . . . . 79 ..... As the program runs, compiler-inserted hints invoke the HPE to employ the ...... balance between accuracy and coverage is an important design consideration for.
Identifying and Exploiting Memory Access Characteristics for Prefetching Linked Data Structures by Hassan Fakhri Al-Sukhni B.S., Jordan University of Science and Technology, 1989 M.S., King Fahd University of Petroleum and Minerals, 1993
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Electrical and Computer Engineering 2006
This thesis entitled: Identifying and Exploiting Memory Access Characteristics for Prefetching Linked Data Structures written by Hassan Fakhri Al-Sukhni has been approved for the Department of Electrical and Computer Engineering
Daniel Alexander Connors
Andrew Pleszkun
James C. Holt
Date
The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.
iii Al-Sukhni, Hassan Fakhri (Ph.D., Computer Engineering) Identifying and Exploiting Memory Access Characteristics for Prefetching Linked Data Structures Thesis directed by Assistant Professor Daniel Alexander Connors
Data prefetching is one paradigm to hide the memory latency in modern computer systems. Prefetch completeness requires that a prefetching mechanism achieves high coverage of the program would-be misses with high prefetching accuracy and timeliness. Achieving prefetch completeness has been successful for memory accesses associated with regular data structures like arrays because of the spatial regularity of these memory accesses. Modern applications that use Linked Data Structures (LDS) exhibit a lesser degree of spatial regularity in their meory accesses. Thus, other characteristics of the memory accesses associated with LDS have been exploited by previous prefetching mechanisms. Unfortunately limited success has been reported for prefetching LDS, therefore a significant opportunity remains to improve the performance of modern computer systems by improving prefetch completeness for LDS memory accesses. This dissertation proposes a coordinated approach consisting of three components to improve prefetch completeness for LDS. These components are: 1) a rigorous approach that offers metrics to quantify the exploitable characteristics of the memory accesses, 2) a coordinated software and hardware approach that benefits from the static characteristics facilitated by the global view of the compiler combined with the dynamic characteristics accessible via profiling and runtime monitoring, and 3) simultaneous coordination of several mechanisms that exploit different characteristics of the LDS memory accesses. The proposed coordinated approach is illustrated in this dissertation by
iv extending the understanding of three exploitable characteristics of LDS memory accesses: spatial regularity, temporal regularity, and topology. Metrics are offered to enable the identification of these characteristics, and prefetching mechanisms to exploit them are designed in a coordinated fashion to benefit from the offered metrics. Simulation results indicate that the proposed approach can improve prefetch completeness by improving prefetch coverage for would-be misses of the program memory accesses using accurate and timely prefetches.
Dedication
To all of the fluffy kitties.
vi
Acknowledgements
Here’s where you acknowledge folks who helped.
vii
Contents
Chapter 1 Introduction
1
1.1
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2
Organization
8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Exploiting Spatial Regularity Using Extrinsic Stream Metrics
10
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2
Characterizing Regular Streams . . . . . . . . . . . . . . . . . . .
12
2.2.1
Intrinsic Stream Characteristics . . . . . . . . . . . . . . .
13
2.2.2
Extrinsic Stream Characteristics . . . . . . . . . . . . . . .
14
2.2.3
Stream Affinity and Short Streams Exploitation . . . . . .
15
2.2.4
Stream Density and Prefetch Coverage . . . . . . . . . . .
16
2.2.5
Measuring Stream Metrics . . . . . . . . . . . . . . . . . .
18
Run-time Exploitation of Stream Metrics . . . . . . . . . . . . . .
21
2.3.1
Stream Detection and Allocation Filtering . . . . . . . . .
23
2.3.2
Stream Prioritization and Thrashing Control Using Stream
2.3
Density . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.3.3
Exploiting Short Streams Using Stream Affinity . . . . . .
25
2.3.4
Using Intrinsic and Extrinsic Metrics Together: PAD . . .
26
2.3.5
Controlling Accuracy Using PAD and Stream Length . . .
27
viii 2.3.6 2.4
2.5
Controlling Timeliness Using PAD and Stream Density . .
28
Experimental Evaluations . . . . . . . . . . . . . . . . . . . . . .
29
2.4.1
Methodology . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.4.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3 Stream De-Aliasing: The Design of Cost-Effective Stride-Prefetching for Modern Processors
39
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.2
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.2.1
Stream Identification . . . . . . . . . . . . . . . . . . . . .
42
3.2.2
Feedback Management . . . . . . . . . . . . . . . . . . . .
43
Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . .
46
3.3.1
Methodology . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.3.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3.3
3.4
4 A Stitch In Time: Idenitfying and Exploiting Temporal Regularity
54
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
4.2
LDS Prefetching Limitations . . . . . . . . . . . . . . . . . . . . .
58
4.2.1
Context-Based Prefetching Limitations . . . . . . . . . . .
58
4.2.2
Content Directed Prefetching Limitations . . . . . . . . . .
60
4.3
Characterizing Temporal Regularity in the Memory Accesses . . .
63
4.4
Stitch Cache Prefetching . . . . . . . . . . . . . . . . . . . . . . .
67
4.4.1
Identifying Recurrent Loads Using Temporal Regularity Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.4.2
Stitch Pointers Installation . . . . . . . . . . . . . . . . . .
69
4.4.3
Using Stitch Pointers . . . . . . . . . . . . . . . . . . . . .
71
ix
4.5
4.6
4.4.4
Controlling Prefetch Timelines . . . . . . . . . . . . . . . .
71
4.4.5
Exploiting Content Based Prefetching . . . . . . . . . . . .
73
4.4.6
Overcoming the limitations of context-based prefetching .
73
4.4.7
Overcoming the limitations of content-based prefetching .
74
Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . .
76
4.5.1
Methodology . . . . . . . . . . . . . . . . . . . . . . . . .
76
4.5.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
5 Exploiting the Topology of Linked Data Structures for Prefetching
87
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
5.2
Motivation and Background . . . . . . . . . . . . . . . . . . . . .
89
5.2.1
Dynamic Data Structure Analysis . . . . . . . . . . . . . .
90
5.2.2
Memory System Evaluation . . . . . . . . . . . . . . . . .
95
Compiler-Directed Content-Aware Prefetching . . . . . . . . . . .
98
5.3
5.4
5.5
5.3.1
Compiler-Directed Content-Aware Prefetching Directives . 100
5.3.2
Linked Data Structure Example . . . . . . . . . . . . . . . 104
Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 107 5.4.1
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.4.2
Results and Analysis . . . . . . . . . . . . . . . . . . . . . 108
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6 Conclusions
116
6.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Bibliography
122
x
Tables
Table 3.1
Estimated structural requirements to use full program counter for prefetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.1
Simulated Benchmarks . . . . . . . . . . . . . . . . . . . . . . . .
76
5.1
Percentage of dynamic execution of linked data structure access types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
xi
Figures
Figure 2.1
Prefetch opportunity of spatially regular accesses in SPEC-INT. .
11
2.2
Stream Affinity Illustration. . . . . . . . . . . . . . . . . . . . . .
15
2.3
Stream Density Illustration. . . . . . . . . . . . . . . . . . . . . .
18
2.4
Stream Density Histogram. . . . . . . . . . . . . . . . . . . . . . .
19
2.5
Percentage of streams with α > 0.5, w = 200. . . . . . . . . . . .
21
2.6
A Finite State Machine That Controls The PEs. . . . . . . . . . .
22
2.7
IPC Speedup of PAT over Stream Buffers. . . . . . . . . . . . . .
32
2.8
Prefetch Coverage. . . . . . . . . . . . . . . . . . . . . . . . . . .
33
2.9
Prefetch Accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . .
34
2.10 Prefetch Timeliness. . . . . . . . . . . . . . . . . . . . . . . . . .
36
2.11 L1 Cast-outs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.1
Forming Stream ID in Load Attribute Prefetching (LAP). . . . .
43
3.2
Prefetch moving window illustration. . . . . . . . . . . . . . . . .
45
3.3
Performance of PC bits. . . . . . . . . . . . . . . . . . . . . . . .
47
3.4
Accuracy of PC bits. . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.5
Performance gain. . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.6
Prefetching Accuracy. . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.7
Prefetch timeliness. . . . . . . . . . . . . . . . . . . . . . . . . . .
51
xii 3.8
Prefetching into the L1 cache. . . . . . . . . . . . . . . . . . . . .
52
4.1
Content-Directed Prefetching. . . . . . . . . . . . . . . . . . . . .
61
4.2
A sequence with temporal regularity. . . . . . . . . . . . . . . . .
63
4.3
Markov model conditional entropy for the reference sequence of Figure 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.4
Stitch Cache Prefetching. . . . . . . . . . . . . . . . . . . . . . . .
70
4.5
Controlling prefetch timeliness. . . . . . . . . . . . . . . . . . . .
72
4.6
Average conditional entropy and variability for the top ten executed load instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
4.7
Percentage of stitch pointers updates to load accesses. . . . . . . .
79
4.8
Utilization of Stitch Pointers. . . . . . . . . . . . . . . . . . . . .
79
4.9
Prefetch coverage. . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
4.10 Percentage of used prefetches from each mechanism in SCP. . . .
81
4.11 Prefetch Overall Accuracy. . . . . . . . . . . . . . . . . . . . . . .
81
4.12 Prefetch accuracy of each prefetch type in SCP. . . . . . . . . . .
82
4.13 Prefetch Timeliness of CDP and SCP. . . . . . . . . . . . . . . . .
83
4.14 Percentage improvement in IPC. . . . . . . . . . . . . . . . . . . .
84
5.1
Basic example of four linked data structure types: traversal, pointer, direct, and indirect . . . . . . . . . . . . . . . . . . . . . . . .
91
5.2
Dynamic distribution of linked data structure access types. . . . .
94
5.3
Loads miss rates at L1 and L2 caches . . . . . . . . . . . . . . . .
97
5.4
Content-Aware Prefetching Memory System. . . . . . . . . . . . . 101
5.5
Compiler-directed prefetch harbinger instruction. . . . . . . . . 101
5.6
Source and low-level code example of linked data structure in Olden’s health. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
xiii 5.7
Memory accesses and prefetching commands for CDCAP on Olden’s health. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.8
Prefetching accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.9
Timeliness of prefetching. . . . . . . . . . . . . . . . . . . . . . . 110
5.10 Normalized bus blocking with prefetching. . . . . . . . . . . . . . 112 5.11 Normalized loads miss rates for L2 cache. . . . . . . . . . . . . . . 113 5.12 Normalized cycles of processor waiting for load values. . . . . . . 114 5.13 Normalized execution times. . . . . . . . . . . . . . . . . . . . . . 115
Chapter 1 Introduction
Over the last decade, significant advancements in the semiconductor industry have allowed for an unprecedented increase in microprocessor performance. Continued exploitation of instruction level parallelism, longer pipelines and faster clocks are just a few techniques that have led to increased performance. Unfortunately, innovations in memory system design and technology have been unable to achieve the same rate of improvement in memory speeds. As a result, an increasing percentage of total execution time of modern computer systems is spent waiting on the memory system. It is estimated that nearly 40% of execution time is spent stalled waiting for both instruction and data cache misses [5]. Cache organizations have been used successfully to enhance the perceived latency of the memory subsystem in modern computer systems [46]. Caches exploit the locality of reference concepts known as spatial and temporal locality [23]. Unfortunately, the larger the cache is, the slower it becomes, which limits the benefits of cache-based solutions. This situation is exacerbated by the increased size of both programs and their data sets. Prefetching is another appealing technique for overcoming the memory latency problem in modern processors [41]. Numerous prefetching approaches have been proposed, including hardware-based approaches [38, 26, 11, 27], as well as software-based approaches [30, 31, 7]. Hardware-based prefetching approaches
2 employ specialized hardware to monitor memory accesses at run-time and to predict future memory accesses such that they can be prefetched well in advance of when they are needed by software executing on the processor. Software-based approaches use tools such as compilers and code profilers to instrument code so that data for future memory accesses can be prefetched using software mechanisms at run-time. A complete prefetching mechanism needs to meet three strict requirements: coverage, timeliness, and accuracy [25]. Coverage measures the ability to correctly predict and prefetch future memory accesses. It is the ratio of misses hidden by prefetching to the overall number of misses without prefetching. Timeliness requires that prefetches be launched with sufficient lead time to hide the memory latency. Timeliness for a prefetch is the ratio of the memory latency hidden by the prefetch to the overall memory latency. Accuracy is the ratio of prefetches that were used by the program to the total prefetches generated. Timeliness and coverage are correlated since prefetches that are late do not contribute toward coverage. Similarly, coverage and accuracy are correlated because inaccurate prefetches will not cover program misses. Prefetching spatially regular data structures (e.g., data structures that exhibit simple mathematical relationships in their load addresses, like arrays) has been particularly successful using stride-based techniques [4, 20, 38, 27, 24, 11, 26]. These techniques satisfy the prefetching completeness requirements by effectively exploiting spatial regularity to accurately predict future accesses well ahead of the program. While many programs make use of such spatially regular data structures, the use of Linked Data Structures (LDS) is pervasive in modern software (e.g. linked lists, trees, hashes, etc.). Integer applications that use LDS demonstrate fragmented short spatially regular patterns in their memory accesses, and previous stride-prefetching mechanisms are not designed to account for such
3 patterns. Prefetching techniques for LDS have been proposed [18, 25, 17, 42, 35], but the results have only been marginally successful [15]. Therefore, significant opportunities remain to hide the memory latency associated with LDS. Prefetching techniques for LDS can be broadly classified into three categories: context-based techniques, content-based techniques, and precomputation techniques. Context-based techniques use correlations amongst the memory accesses to predict future memory references. Content-based techniques use the content of the accessed data to make their predictions. Precomputation techniques run a slice of the program to generate memory addresses that are used for prefetching. Context-based prefetching techniques exploit temporal regularity in LDS accesses [15] by finding correlations amongst repeatedly accessed addresses, and then using these correlations to initiate prefetches (temporal regularity is discussed in Chapter 4). These correlations enable context-based techniques to launch timely prefetches ahead of the program. Unfortunately, context-based prefetching has several limitations that reduce its coverage, including limited capacity, learning time, and excessive overhead. A detailed discussion of these limitations follows in Chapter 4. Content-based prefetching involves using stateless systems that prefetch the connected data objects of LDS by discovering pointers in the data contained within cache lines as they are filled. Content-based prefetching can achieve good coverage by overcoming the limitations of context-based prefetching; however, its timeliness is limited since the pointers contained in the accessed data usually refer to data that will soon be needed by the program. Other limitations of content-based prefetching include the inability to recognize traversal orders of multiple potential paths, and the inability to connect and prefetch isolated LDS. Chapter 4 discusses these limitations further.
4 Precomputation prefetching runs a slice of the program instructions to compute future references. The slice of instructions is run as a thread context in multithreaded environments [17], or in a dedicated prefetching engine [2, 55, 47, 2, 29]. These techniques trigger their precomputation with specially marked instructions of the program. Although these techniques can predict accesses that exhibit neither spatial nor temporal regularity, the time between the trigger instructions and the instructions that require the prefetched data is usually too little to hide the memory latency, which limits their timeliness. Coordinated prefetching approaches that employ two or more of the above prefetching mechanisms have been proposed. For example, Roth [43] combined a context-based mechanism called jump pointers with a content-based mechanism called dependence-based prefetching [42]. This combined approach, called cooperative prefetching, used jump pointers to launch timely prefetches for correlated accesses and subsequently triggered the dependence-based prefetcher as a result of jump pointer prefetches. Guided-region prefetching [50] is another coordinated approach that used software analysis to tag load instructions that access LDS fields and used these tags (as a context) to improve the accuracy of content-based prefetching. Recently, multi-chain prefetching [29] was proposed as a coordinated approach that uses compiler-analysis techniques to identify static traversal paths in a program. Precomputation schedules for the identified paths were statically generated and shipped to a hardware prefetching engine that executed them based on the memory content. While these coordinated prefetching solutions proved the viability of a coordinated prefetching approach to improve prefetch completeness, they still suffered from the limitations of the underlying mechanisms they coordinated. The principal impediment to prefetching LDS is that the memory addresses for the connected data objects are dynamically generated based on specific pro-
5 gramming traversal patterns, phases, and application behavior. The connectivity among the nodes of the LDS changes during the program lifetime, and the program potentially changes its traversal paths along these nodes. Therefore, in order to achieve completeness, a technique for LDS prefetching must be able to dynamically track these changes over the program lifetime and adjust its prefetch targets for improved coverage and prefetch initiation time for improved timeliness. This dissertation hypothesizes that a coordinated software and hardware prefetching system, which builds on the strengths of both software and hardware prefetching paradigms, and utilizes well established compile and profile technologies to guide a run-time prefetching environment, can improve prefetching completeness for LDS. Such a coordinated system can be enabled by a rigorous approach that extends the understanding of the memory access characteristics associated with LDS and offers metrics to quantify such characteristics. The global view and understanding of the memory access characteristics by the compiler cannot be detected by a hardware-only system. Such information combined with the offered metrics and the dynamic run-time knowledge of the hardware part creates a promising and flexible system for exploiting the memory access characteristics associated with LDS. 1.1
Contributions The methodology of this dissertation consists of: (1) surveying and under-
standing the exploitable characteristics of previous prefetching mechanisms, (2) extending the understanding of these characteristics and offering metrics to quantify them, (3) proposing prefetching mechanisms that employ the offered metrics using a coordinated approach that employs information from compile, profile, and run-time technologies, and (4) validating that the proposed mechanisms improve prefetch completeness using cycle-accurate simulation.
6 Through this methodology, the memory access characetristics associated with LDS are studied, and mechanisms to identify and exploit them are proposed. Thus, this dissertation makes the following contributions: • Exploiting spatial regularity: To exploit spatially regular streams of the memory accesses, previously identified regular stream metrics are extended to quantify extrinsic characteristics of regular streams. These new metrics are employed to improve the efficiency of stride prefetching for LDS accesses. The extrinsic metrics introduced are stream affinity and stream density. Stream affinity enables prefetching for short streams that result from LDS manipulations. These streams were previously ignored by stride prefetching mechanisms. Stream density enables a prioritization mechanism that dynamically selects amongst available streams in favor of those that promise more miss coverage, and provides thrashing control amongst several coexisting streams. Using intrinsic and extrinsic stream metrics in combination allows a novel hardware technique for controlling Prefetch Ahead Distance (PAD), which dynamically adjusts the prefetch launch time to better enable timely prefetches while minimizing cache pollution. • De-aliasing regular streams: Several prefetching mechanisms utilize the Program Counter (PC) to de-alias co-existing streams of regular memory accesses (spatial and temporal.) Transmitting the full PC across modern deep and wide pipelines for the sole purpose of prefetching is not practical (as illustrated in Chapter 3.) To overcome the issues related to using the entire PC for effective stream de-aliasing and prefetching, this dissertation combines other instruction attributes with a small subset of the PC to help detect regularity in program data accesses. This de-aliasing
7 scheme is illustrated by implementing a cost-effective stride prefetching mechanism called Load-Attributes Prefetching (LAP). • Exploiting temporal regularity: Metrics are proposed to quantify temporal regularity in the memory accesses generated by load instructions that traverse LDS. These metrics are applied to profile information to identifying recurrent load instructions (instructions that generate temporally regular accesses). Recurrent load instructions are then targeted at run-time by a coordinated context-based and content-based mechanism to exploit temporal regularity in the memory accesses to prefetch LDS accesses. The approach is illustrated by proposing an LDS prefetching mechanism called Stitch Cache Prefetching (SCP), which makes the following contributions: (1) definition of temporal regularity metrics (based on concepts of information theory) that allow a profiler to identify loads that traverse any type of LDS paths (static or dynamic,) without requiring source code, (2) proposing a context-based prefetching mechanism that avoids problems of capacity using a logical stitch space, and of maintenance overhead by physically implementing the stitch space using a hierarchical organization, and (3) improving prefetch timeliness by dynamically adjusting the prefetch launching time based on the observed memory latency using a continuously tuned timeliness stitch queue. • Exploiting LDS topology: A classification of the load instructions used to traverse LDS is proposed. Based on this calssification, the compiler is enabled to extract LDS topology and traversal information. This information is used by a hardware component and combined with the content of accessed cache lines to dynamically construct LDS traversal schedules. The constructed schedules are used by the hardware component to gener-
8 ate timely and accurate prefetches in a coordinated prefetching approach. The coordinated prefetching approach is illustrated via the design of a novel prefetching mechanism called Compiler-Directed Content-Aware Prefetching (CDCAP). The mechanism utilizes compiler-inserted prefetch instructions to guide hardware prefetching engines (HPEs) to prefetch traversed paths of LDS. The inserted prefetch instructions contain information about the static attributes that describe the topology of the data structure. As the program runs, compiler-inserted hints invoke the HPE to employ the topology information for generating prefetches based on the current program state and the contents of the accessed cache lines. The technique addresses the shortcomings of software-only techniques by eliminating the need to transform the data structure without the use of excessive prefetch instructions and does not require prior knowledge of the traversed data structure paths. At the same time, the approach eliminates the need for large correlation-based hardware structures and reduces the number of unnecessary prefetches caused by hardware-only implementations. 1.2
Organization This dissertation is composed of six chapters. Chapter 2 studies spatial reg-
ularity in the memory accesses, and proposes metrics to quantify extrinsic regular stream characteristics. The proposed metrics are utilized by a run-time prefetching system to improve the prefetch completeness of stride-based prefetching systems. Chapter 3 describes Load-Attribute Prefetching (LAP) as a cost-effective solution to de-alias regular streams in modern deep and wide pipelines. Chapter 4 studies temporal regularity in the memory accesses associated with LDS. Metrics to quantify temporal regularity are offered and used in the design of a coordinated
9 prefetching approach called Stitch Cache Prefetching (SCP) to improve prefetch completeness. Chapter 5 identifies patterns for LDS traversal instructions and employs these patterns to design a Compiler-Directed Content-Aware Prefetching (CDCAP) approach that generates dynamic precomputation prefetching schedules based on compiler communicated information to prefetching engines that are located at different cache levels in the memory hierarchy. Finally, in Chapter 6, conclusions and suggestions for future research are given.
Chapter 2 Exploiting Spatial Regularity Using Extrinsic Stream Metrics
2.1
Introduction Hardware-based prefetching approaches are particularly interesting due to
(1) the potential to dynamically adapt to run-time characteristics, and (2) the ability to improve software performance of unmodified program binaries (e.g., there is no need to recompile programs using special compilers or other software tools, and legacy program binaries can also benefit from these techniques). Stride-based prefetching is an effective hardware technique that exploits regular streams of memory accesses [26]. A regular stream is an ordered sequence of addresses that exhibits a non-zero constant address difference (a stride) between its consecutive addresses. Existing stride-based techniques achieve efficiency primarily through accuracy and timeliness. Figure 2.1 depicts the prefetch opportunity of spatially regular accesses in SPEC-INT benchmarks. This figure indicates that there exists significant opportunities to hide the memory latency in these benchmarks via exploiting spatial regularity. Unfortunately, previous stride-prefetching mechanisms were designed to target spatial regularity in scientific numerical applications. The spatial regularity in such applications demonstrate different characetristics than those of integer applications that use LDS (as will be illustrated in the following sections of this chapter.) Thus, improving the efficiency of stride-based prefetching for ap-
11 Fraction of spatially regular accesses
0.7 0.6 0.5 0.4 0.3 0.2 0.1
Average
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
0.0
Figure 2.1: Prefetch opportunity of spatially regular accesses in SPEC-INT.
plications that use LDS requires better understanding of their spatial regularity characteristics. This understanding enables the improvement of prefetch completeness through improving coverage, without sacrificing accuracy or timeliness (as demonstarted by the simulation results in Section 2.4.) Mohan defines regularity metrics for measuring the characteristics of streams [34]. This work illustrated that well-defined metrics could be effectively employed to guide software optimizations associated with regular streams. The fidelity enabled by these metrics allowed a compiler to select the most effective code optimizations amongst techniques such as tiling, software prefetching, and loop transformations. While Mohan’s work was not focused at hardware prefetching, this dissertation hypothesizes that stride-based hardware prefetching completeness could be improved by similar exploitations of regular stream metrics. Furthermore, while Mohan focused on the important characteristics of individual streams in isolation, this research recognizes that at run-time there are interactions between streams and other entities (for example with other streams, or with the memory sub-system). These interactions can be accounted for and can be used to fur-
12 ther improve hardware prefetch efficiency. Thus, the major contributions of this chapter are the extension of Mohan’s metrics to enable measurement of certain additional characteristics of regular streams, and application of these extended metrics to optimization of stride-based hardware prefetching. The rest of this chapter is organized as follows. The next section identifies characteristics of regular streams and defines metrics to quantify these characteristics. Section 2.3 illustrates how these metrics can be dynamically used to improve the efficiency of a stride-based hardware prefetching system. Section 2.4 illustrates (using simulations) that the proposed hardware outperforms one that does not account for the metrics offered in this research. Section 2.5 summarizes and draws conclusions. 2.2
Characterizing Regular Streams Metrics of regular streams are classified into two classes, intrinsic stream
metrics and extrinsic stream metrics. The intrinsic class describes a stream’s inherent characteristics, such as its stride or length (e.g., the number of accesses that belong to the stream). The extrinsic class describes a stream’s characteristics relative to the program, to other streams, and to the memory system (which includes the hardware prefetchers). While intrinsic stream metrics have been identified [34] and used to optimize stride prefetching mechanisms [26, 27, 20, 24], the same cannot be said about extrinsic stream metrics. In this section, intrinsic stream metrics will be briefly discussed, and extrinsic stream metrics are introduced. Then these metrics are used to make three main contributions that can be used to improve the efficiency of hardware stride prefetching. These contributions are: • Exploiting short streams to improve coverage,
13 • Dynamic selection amongst streams to prevent thrashing and improve coverage, and • Dynamically changing prefetch-ahead distance to improve timeliness.
2.2.1
Intrinsic Stream Characteristics
Existing hardware stride prefetching mechanisms recognize and exploit the intrinsic characteristics of regular streams. These characteristics can be measured by two metrics: stride and length. The stride measures how far apart the consecutive accesses of the stream are in terms of memory addresses, while the length measures the number of accesses of the stream. Streams having strides within one cache line, called unit-stride streams, have been exploited by stream buffer prefetching mechanisms [26, 20]. Unitstride streams are easy to detect and exploit, however in practice there remain a significant number of streams that exhibit strides larger than a single cache line [34]. Detection of streams with arbitrary strides can provide increased coverage. To exploit streams having strides longer than one cache line, Farkas [27] uses perPC stream detection, a technique that detects streams with arbitrary strides by comparing consecutive accesses of specific load instructions. However, as with all stream buffer based techniques there is no mechanism to adjust prefetch launch time to improve prefetch timeliness. Stream length has been used to optimize stride prefetching techniques through exploitation of long streams [38, 26, 27, 4]. Longer streams are preferred because they offer better prefetch efficiency over time. However, the efficiency of prefetching for long streams can be negatively impacted by the presence of short streams unless there is a way to distinguish between them. Therefore, several mechanisms have been proposed to filter out streams of short length, such as allocation filters
14 [27, 20]. Unfortunately, while these techniques can filter out irregular accesses, they fail to exploit a short stream because most of the stream is used to establish its stride, leaving very little of the stream remaining to be prefetched. This chapter shows how extrinsic stream metrics can enable efficient prefetching of some classes of short regular streams without negatively impacting efficiency of prefetching for long streams.
2.2.2
Extrinsic Stream Characteristics
Extrinsic metrics measure a stream’s characteristics relative to other streams, and to the memory system (including misses not associated with any stream, as well as interactions with prefetchers). Stride prefetching mechanisms can utilize these metrics to evaluate tradeoffs whose implications cannot be measured using only intrinsic metrics, thus allowing for improved efficiency. The following sections discuss this in detail. To facilitate the extrinsic metrics discussion, the notion of regular streams is augmented by associating a timestamp with each memory access to provide a temporal ordering of stream accesses. The timestamp is a monotonically increasing integer that starts at 0, and is incremented with each memory access. Using the temporal ordering provided by the stream access timestamp, the following stream attributes can be defined: • Stream birth, b, is the timestamp of the first address of the stream, • Stream death, d, is the timestamp of the last address of the stream, and • Stream age, a, is the difference between a stream’s death and its birth (d − b + 1).
15 With these attributes the following metrics are next defined and discussed: stream affinity and stream density.
2.2.3
Stream Affinity and Short Streams Exploitation
Allocating hardware resources to detect and prefetch regular streams is usually done based on demand misses [26, 4]. Allocation filters are used to filter out misses that do not belong to regular streams to prevent them from disturbing on going prefetching of regular streams [38]. This filtering consumes part of the stream to establish its intrinsic characteristics before prefetching can commence. Such consumption can prevent prefetching of short streams. Stream affinity is introduced as an extrinsic metric to measure how similar a stream is to the most recent non-interleaved stream of equivalent stride. Stride prefetching can benefit from recognizing affine streams (streams with high affinity) by spending less time identifying intrinsic stream characteristics, and instead using that time to continue prefetching. This is especially important for short streams, where individual streams cannot be exploited due to the time wasted in determining their intrinsic characteristics.
for (i=0; i < 50; i++) { t = A[i]; for (j=0; j < 3; j++) { sum += B[t+j*16]; } } (a)
Memory Accesses x : stream for j=0 y : stream for j=1
by y
dx x w
(b)
Figure 2.2: Stream Affinity Illustration.
Figure 2.2(a) shows an example of affine streams where two nested loops occur in a program such that the purpose of the outer loop is to choose different
16 starting points for the inner loop. The short streams produced by consecutive iterations of the inner loop will be affine streams. A prefetching mechanism that detects stream affinity can use the first of a set of high affinity streams to identify the stride, and then can reuse this characterization to efficiently prefetch subsequent members of the set. The author is not aware of any technique that recognizes or exploits affine streams. Given two streams, x and y, that are generated by the same load instruction, let the stream birth and stream death of these streams be denoted as bx , by , dx , and dy , respectively. Further, let w be a timestamp window during which stream affinity can be exploited efficiently by prefetching hardware. The affinity, αy , for stream y is defined as:
x αy = 1 − by −d w
if (by − dx ) ≤ w and stride(x) = stride(y)
=0
Otherwise
(2.1)
The fractional portion of the metric measures how far apart the two streams are. The metric approaches zero as the distance between streams grows larger. Therefore, the highest affinity of 1 happens for two streams x and y when x is a continuation of y (making the fractional term 0.) 2.2.4
Stream Density and Prefetch Coverage
Regular streams compete for limited hardware prefetching resources that are usually lesser than the number of regular streams [34]. Sherwood [45] used priority counters to select one of several potential streams based on how well predictable each one is (e.g., selection amongst streams based on prefetch accuracy.) This approach prevents irregular misses from deterring the prefetching of regu-
17 lar streams. However, Sherwood’s approach does not distinguish between regular streams to select one that potentially has better coverage than the others, nor does it prevent predictable streams from thrashing each other. Stream density is introduced as an extrinsic metric that indicates the expected coverage of a stream (how many program misses are potentially hidden in a given period of time by prefetching a stream.) Given the density metric, a prefetching mechanism can select the stream that has more density over one that has lesser density. This enables maximizing prefetch coverage and therefore efficiency. With the previously introduced stream attributes, stream density, δ, is defined as the ratio of a stream’s length to its age:
δ=
l l = a d−b+1
(2.2)
Intuitively, a stream that has low density (a sparse stream) is one whose accesses are separated by many interleaved memory accesses that do not belong to the stream. Conversely, accesses of a high density stream are separated by few memory accesses not belonging to the stream. Dense streams appear for example in memory copy operations, and in search algorithms that use tight loops containing load and compare operations. To illustrate dense and sparse streams, consider the two nested loops of Figure 2.3 (a). The outer loop generates a sparse stream (stream z) of the accesses of array A, whereas the inner loop generates streams x and y that are dense streams. Figure 2.3 (b) illustrates the calculation of the density metric for stream z. In this example stream z has a length of 2 (accesses A[0] and A[100] in the code segment) and an age of 5 (its death is 4 and its birth is 0.) Conversely, the density of streams x and y are equal to 1. Dense streams present more prefetching opportunities than sparse streams.
18
A[0] = 0; A[100] = 100; for (j=0; j < 2; j=j++) { t = A[j*100]; for (i=0; i < 3; i++) { sum += B[t+i*16]; } } (a)
bz Memory Accesses Timestamp
dz z
x
z
y
δz =
2 lz = 2 = = 0.4 az dz −bz +1 4 −0+1
0 1 2 3 4 5 6 7 (b)
Figure 2.3: Stream Density Illustration.
A prefetching mechanism that does not account for stream density can be diverted from prefetching dense streams in the presence of sparse streams. This problem is known as the stream thrashing problem [38]. Although allocation filtering [38, 27] prevents non-stream misses or short streams from interrupting prefetching for long streams, previous hardware stride prefetching mechanism do not resolves the stream thrashing problem. Section 2.3 introduces a mechanism that uses the density metric to select a stream that promises more coverage over other streams.
2.2.5
Measuring Stream Metrics
This chapter uses the SPEC2K suite of benchmarks for illustration. Several traces statistically represent each benchmark, such that each trace consists of several million instructions. Each trace has a weight representing its contribution to the overall benchmark. This is similar in concept to benchmark representation used in SimPoint [39]. Trace representation was verified against actual hardware (a previous processor) by using IPC as a verification metric. The simulated IPC, obtained by simulating the traces on the matching cycle-accurate simulator and weighting the IPCs of each sub-trace appropriately, was compared against the IPC from execution on hardware. All benchmarks showed good correlation between
100% 80% 60% 40% 20% twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
0% gzip
Density Distribution
19
>= 0.2 < 0.2 < 0.1 < 0.01 < 0.001 < 0.0001
apsi
sixtrack
fma3d
lucas
ammp
facerec
equake
art
galgel
mesa
applu
mgrid
swim
100% 80% 60% 40% 20% 0% wupwise
Density Distribution
Benchmark (INT)
>= 0.2 < 0.2 < 0.1 < 0.01 < 0.001 < 0.0001
Benchmark (FP)
Figure 2.4: Stream Density Histogram.
actual and simulated IPC. In collecting stream metrics, they were not weighed by the trace weights. Stream Density Figure 2.4 depicts a histogram of streams’ densities. Each bar represents the percentages of streams that had densities within ranges identified in the legend. One observation here is that the SPEC2K-INT benchmarks generally exhibit higher density streams than the FP benchmarks. This is due to the fact that interleaved streams appear more frequently in the FP benchmarks as a result of aggressive compiler loop unrolling. In contrast, the INT benchmarks offer little or no loop unrolling opportunities to the compiler.
20 Figure 2.4 illustrates that the majority of the regular streams detected in the benchmarks are sparse. This does not mean that these streams represent the majority of the memory accesses, since sparse streams are often significantly shorter than dense streams. However, the high percentage of sparse streams means that these streams will frequently interrupt the denser streams. Improved stride prefetching mechanisms should account for this observation and should be able to dynamically react as conditions change. Section 2.3 presents one solution to this problem based on stream density. Stream Affinity Figure 2.5 shows the percentage of streams in the benchmarks that have an affinity greater than 0.5 (α > 0.5) for a w of 200. A value of 200 was chosen for w experimentally, based on a study of how well the available streams could be exploited by the intended hardware system. This figure suggests that affine streams constitute a significant portion of the total number of streams in several workloads. Overall, streams with high affinity are less frequent in the SPEC2K-FP benchmarks when compared to the SPEC2K- INT ones. This is due to the nature of the two suites, where the INT benchmarks tend to have more loop nesting constructs and more dynamic data structures. Dynamic data structures are usually allocated in chunks that are consecutive in the memory space. As these data structures get rearranged by the program due to deletion and insertion of nodes, their subsequent traversal results in fragments of short affine streams. Contrary to that, the FP benchmarks generally do not use dynamic data structures and they tend to scan large arrays causing long streams that are well separated by non-stream accesses. Therefore, these benchmarks demonstrate low affinity in general. Investigating the FP benchmarks that exhibit high affinity (for example galgel and art) reveals that these benchmarks have outer loops that set up
21
Affine streams
12% 10% 8% 6% 4% 2% twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
0%
apsi
sixtrack
fma3d
lucas
ammp
facerec
equake
art
galgel
mesa
applu
mgrid
swim
14% 12% 10% 8% 6% 4% 2% 0% wupwise
Affine streams
Benchmark (INT)
Benchmark (FP)
Figure 2.5: Percentage of streams with α > 0.5, w = 200.
an address for inner loops, which is consistent with the affine streams presented in the INT benchmarks. The only difference between the two code segments showed that galgel used aggressive loop unrolling of the inner loop, while art did not. 2.3
Run-time Exploitation of Stream Metrics The effective use of both intrinsic and extrinsic regular stream metrics is
demonstrated with a design for a hardware stride prefetching system.The design includes a number of prefetching engines (PEs). Each PE is controlled by a finite state machine similar to the one shown in Figure 2.6. These PEs get allocated to load instructions that miss in the L1 cache to prefetch their regular streams.
22 Therefore, this design employs per-PC stride detection similar to that presented by Farkas [27]. The system is designed to prefetch the load accesses and not the store accesses. Stores are handled by other micro-architecture solutions such as store buffers [20]. The states of this state machine are divided into two sets, inactive states and active states. The active states dynamically control prefetching accuracy and timeliness issues, while the inactive states are employed to improve coverage by detecting streams, managing priorities amongst streams, and identifying affine streams. The details of how stream metrics govern state transitions are discussed in detail in the following sections.
MU
ALLOCATE
OFF
SD
State
Description
Metric Used
SD
Stream Detection
Density
TC
Thrashing Control
Density
AFD
Affinity Detection
Affinity/Density
LC1
Low Confidence with 1 PAD
PAD/Affinity/Density
HCx
High Confidence with x PAD
PAD/Affinity/Density
PEM PEM MU MP
TC
MU PV
PEM MU
PV PU
PV
PV
PU
MP
MP
AFD
LC1
PU
LM/PEM
PV
HC1
PU
HC2
LM/PEM
PU LM/PEM
HC4
HC6
PU LM/PEM
Active States PU
MP
LEGEND LM : Allocated Load Miss PU : Prefetch Used PV : Prefetch Evicted MU: Load Miss Unpredicted MP: Load Miss Predicted PEM: PE Miss (Load Miss Not Allocated An Engine)
Padding states and path Affinity states and path Density states and path
Figure 2.6: A Finite State Machine That Controls The PEs.
23 2.3.1
Stream Detection and Allocation Filtering
When a load instruction misses in the L1 cache, a free PE gets allocated to the instruction, and the state machine transitions to the Stream Detection (SD) state. During the SD state the PE makes an initial stride guess equal to the size of a cache line. Using the subsequent accesses of the load instruction, a stream is identified by comparing the guessed stride with the measured stride of the load accesses. If a stream is identified, the state machine transitions to state LC1 (low confidence with ability to launch 1 prefetch.) Otherwise, the PE state machine stays in SD and re-computes the stride based on the addresses of the load instruction’s consecutive accesses. This process repeats until either a stream is detected or the PE is reallocated to another missing load instruction (via repeated transitions through states TC and finally OFF due to other load instruction misses). This stream detection is analogous to the allocation filters proposed by Palacharla [38], in the sense that misses not comprising a stream do not affect prefetching because there will be no resulting transition to the LC1 state.
2.3.2
Stream Prioritization and Thrashing Control Using Stream Density
When several streams are interleaved, they will generate misses that are also interleaved. The proposed prefetching system is designed with a number of PEs equal to N . If the interleaved streams are less than or equal to N , each one will be allocated a PE. However, when the number of interleaved streams is more than N , a prioritization mechanism is needed. The prioritization is based on measuring the density of each of the interleaved streams and finally favoring the denser of them. This task is carried out by the states SD, Thrashing Control
24 (TC), and LC1. In these states the interleaved streams compete for the PE such that allocated streams climb to higher confidence states, while unallocated streams can decrease the confidence of allocated streams. The denser stream will eventually win the PE. In previously proposed mechanisms, newly identified streams can replace existing streams without consideration of their respective densities. This approach results in stream thrashing that reduces the prefetching efficiency. Unfortunately, this situation is common in scientific code that has been subject to aggressive loop unrolling. The above prioritization mechanism implements thrashing control using stream density, such that a denser stream retains the PE. Similar to the above prioritization, this is done by dynamically allowing streams to compete for the PEs. The details of this competition are explained next. Recall that the states of the state machine are divided into active (states LC1 and HC1-HC6) and inactive states (states SD, TC, and AFD). Once a stream has been detected (while the PE is in the inactive states,) the PE’s state machine transitions to the active states and the PE launches a number of prefetches based on the measured stride. The number of prefetches launched in each state is shown as a number in the state name (for example, in state LC1 one prefetch is launched.) Prefetch requests are queued to an architected miss buffer which manages requesting and collecting data from lower memory levels. Once a request has been fully serviced, its data will be evicted to the L1 data cache (referred to as a PV event.) If the prefetched data is needed by any program instruction while in the miss buffer, a prefetch use event is declared (PU). In this case, the state machine transitions to the next higher confidence state (state High Confidence with ability to launch 1 prefetch, HC1 in this example.) However, if the prefetched cache line is not needed until after it has been evicted, then the state machine transitions to the lower confidence state (e.g., state Affinity Detect, AFD.)
25 Streams that are not allocated any PE will generate PE miss (PEM) events that reduce the confidence of allocated PEs, while allocated streams increase the PEs confidence via prefetch use (PU) events. This competition will resolve in one of two ways: (1) if the unallocated stream is denser than the allocated one then its PEM events will number more than the PU events for the allocated streams and eventually the denser stream will take over the PE, or (2) if the allocated stream is denser then its PU events will number more than the PEM events for the unallocated stream thus leaving the PE at high confidence. Either of these cases will result in the denser stream having control of the PE, thereby mitigating the thrashing problem and improving the prefetch coverage.
2.3.3
Exploiting Short Streams Using Stream Affinity
In all previously reported hardware stride-based mechanisms, if a missing address has not already been predicted by the prefetching mechanism, then it becomes a candidate for starting a new stream. This treatment of misses does not account for stream affinity. In the proposed approach of this chapter, stream affinity is exploited by lowering the confidence of the state machine controlling the PE allocated to the missing load instruction (referred to as LM events.) This lowering of confidence repeats until the state machine reaches the Affinity Detect state (AFD). In AFD, if the conditions of Equation 2.1 indicate an affine stream, then the PE changes its prefetching region to match the most recently missed address. This change of the prefetching region allows the PE to begin prefetching affine streams without the need to go through the detection process. When affinity is detected, the PE state transitions directly to HC1 state, bypassing the LC1 state. Allowing this transition results in faster climbing of the confidence states and enables exploiting short affine streams. Note also that not changing the prefetch region until the state machine confidence drops to one of the inactive states al-
26 lows the PE to maintain its prefetching path when certain prefetches cannot be launched due to other system limitations (for example, an arbitration mechanism that gives priority to other program instructions in the memory subsystem.)
2.3.4
Using Intrinsic and Extrinsic Metrics Together: PAD
Existing stride-based prefetchers run-ahead of the program, prefetching a number of cache lines less than or equal to the size of a stream buffer [26]. As prefetches get consumed from the stream buffer, they are replaced by additional prefetches. The distance that prefetching can run ahead of the program is bounded by the buffer size. Although reaching this bound is essential for launching early prefetches, it may eventually result in prefetches that are too early. Too early prefetches can cause cache pollution (replacing data that is yet to be needed.) This problem is exacerbated when prefetches are stored into the L1 data cache instead of a stream buffer. While cache pollution is a significant issue, there are also situations in which the bound imposed by the stream buffer size can actually prevent timely prefetches (e.g., when data structures larger than the cache are traversed.)To address these issues, a mechanism that can launch timely prefetches (neither late nor early) is necessary. To facilitate such a mechanism, the distance between the most recent prefetch and the stream’s corresponding access (the one that requires the prefetched data) is measured. This distance is referred to as the Prefetch Ahead Distance (PAD). PAD is defined as follows:
P AD =
pt − xt s
(2.3)
Where: pt is the address of the most recent prefetch launched before timestamp t for some stream x,
27 xt is the address accessed by stream x at timestamp t, and s is the stride of stream x. PAD can be used to control how early a prefetch should be launched in advance of its corresponding stream access. Increasing the PAD gives the prefetched data more time to be brought from lower memory levels (which prevents too late prefetches,) while decreasing the PAD delays prefetch launching (which prevents too early prefetches.) The process of increasing or decreasing the PAD is referred to as padding. Associating the padding process with a measure of prefetch timeliness enables adaptive control of prefetch launch time. Hence, if prefetches are being performed too late, more padding is needed (the PAD needs to be increased). On the other hand, if the prefetches are too early, then less padding is needed (the PAD needs to be decreased.) This dynamic control of timeliness using PAD is unique to the approach proposed by this work, and is a novel idea. The following sections discuss how PAD can be used in conjunction with both intrinsic and extrinsic stream metrics to improve prefetching efficiency.
2.3.5
Controlling Accuracy Using PAD and Stream Length
Cache pollution is more likely to occur when a stream is short. By the time prefetching hardware has detected a short stream and started prefetching its predicted next accesses, the stream is likely to have died. Unfortunately, its prefetches will pollute the cache with unnecessary data. However, cache pollution is less likely when streams are long, due to the tendency of these streams to ultimately require cache replacements. Thus, less padding for short streams can help prevent pollution, while more padding is appropriate for longer streams where more prefetches should enhance coverage even though the data arrives very early. Farkas [27] introduced incremental prefetching by doubling the number of
28 prefetches every time a prefetch is used, up to the size of the prefetch buffer (3entry prefetch buffers were reported in [27, 21].) This incremental prefetching enhances the prefetching accuracy by delaying aggressive prefetching until confidence has been gained. The design proposed in this dissertation makes two contributions to Farkas’ mechanism. The first is to make the aggressiveness of incremental prefetching more dynamic, initially by adding state HC1 to the state machine (which improves accuracy by preventing further prefetches that may not be accurate when the stream is short.) Then, as the confidence states are climbed towards state HC6, the streams are by then assumed to be long and therefore prefetches can be aggressively launched very early since they are not expected to cause cache pollution. This differentiation between short and long streams is unique to the design proposed in this work. In the experimental section, coverage data that supports these claims are provided. The second contribution is to prefetch to the L1 cache, which removes the buffer size limit. Tracking the use of a prefetch while in the L1 cache can be done using several approaches. One approach is to add a prefetched-bit to each cache line and to communicate hits to the PEs. Another approach is to communicate all hits of loads to their respective PEs. When the respective PE receives the hit information, it compares the hit address to a window of addresses that it has prefetched. If the hit address is within that window, a prefetch use event (PU) is declared and confidence is enforced, thus effectively increasing the PAD. Otherwise the hit does not increase the PE confidence which maintains the PAD at a lower level. This solution is further descussed in Chapter 4.
2.3.6
Controlling Timeliness Using PAD and Stream Density
Similar to improving accuracy by padding for stream length, padding for stream density can improve prefetch timeliness. This is supported by two ob-
29 servations: (1) accesses of a dense stream have little time between them, requiring the prefetching algorithm to have a larger PAD to enable early launching of prefetches, and (2) sparse streams have ample time between their accesses making a smaller PAD sufficient for timely prefetches. Using density and padding to maintain prefetch timeliness is an important contribution of this design. This is done by changing the PAD based on whether prefetches are early or late. Late prefetches indicate a dense stream and cause transitions in the state machine to higher confidence states (which causes more padding,) while early prefetches indicate a sparse stream and cause transitions in the state machine to lower confidence states (which causes less padding.) This process is enabled by the state machine of Figure 2.6 as follows: late prefetches will cause a prefetch use event (PU) while they are in the miss buffer. This will cause the confidence mechanism to get higher, which in turn will allow prefetches to be launched earlier (since the PAD increases.) On the other hand, an early prefetch will stay in the miss buffer until its data returns from lower memory levels, which causes an evict event (PV.) Evict events cause the state machine to transition to a lower confidence state. This lowering of confidence will prevent the PE from issuing any more prefetches until the number of previously launched prefetches becomes less than the PAD allowed by the new state, thus delaying the prefetches to maintain timeliness. 2.4
Experimental Evaluations
2.4.1
Methodology
The results shown in this work were obtained from a cycle-accurate simulator being developed in concert with an experimental high-performance microarchitecture from Freescale Semiconductor, Inc. The simulator models a super-
30 scalar processor pipeline that can decode and complete up to three instructions per cycle. Instructions complete in program order, but can be issued and executed out of order. The pipeline includes multiple stages each of fetch, decode/rename, and issue. Integers operations take one cycle in execute. The branch predictor uses a 512-entry 4-way branch target buffer, a 2K-entry branch history table and a 2K-entry pattern history table. The memory hierarchy includes separate 32K 8-way I and D caches, a unified 1MB 8-way L2 cache, an integrated peripheral bus, and dual integrated DDR controllers. Loads and stores are pipelined. For the experiments of this chapter the hit latency of the L1 was set to 3 cycles, the average total latency for an L2 hit is 25 cycles, and the average total latency for a DRAM read is 150 cycles. All implemented prefetching techniques used 8 PEs.
2.4.2
Results
In order to evaluate the proposed approach of dynamically exploiting the regular streams metrics, it is compared to a stream buffer mechanism as reported by Farkas in [27]. The approximations made to the reported approach are: (1) allowing prefetches to be loaded into the L1 data cache instead of prefetching into dedicated buffers, and (2) making the prefetch buffers of size 4 instead of 3. The prefetch buffer size is simulated by the number of maximum allowed prefetches that can be maintained at the highest confidence state. Both of these changes are optimistic in the sense that they enhance the expected performance of stream buffers. This is done by eliminating the time needed to access the prefetch buffers for the former, while the latter allows launching more prefetches when the confidence is high. The stream buffer implementation reported by Farkas was simulated using an 8-entry table for pc-based stream detection and allocation filtering. This
31 implementation is referred to as S in the experimental figures. The rest of the configurations used in the experiments illustrate the exploitation of padding for stream affinity and stream density. The second configuration extends the above stream buffer implementation by allowing LM events to control the padding process and to identify and exploit streams that are in affinity with each PE’s allocated stream. This configuration is referred to as PA (for padding and affinity) in the experiments. The last configuration replaces the allocation filters by the prioritization and thrashing control mechanism based on stream density (by handling PEM events.) This configuration is referred to as PAT (for padding, affinity, and thrashing) in the experiments. Speedup Figure 2.7 shows speedup in IPC that each enhancement provides over no prefetching. Several observations can be made about this figure: • In general, the effect of prioritization and thrashing control is much higher than that of stream affinity. This is expected since affinity measurements did not show a high percentage of the number of affine streams (See Figure 2.3.) One benchmark with affine streams that shows little IPC improvement is gap. The reason for this discrepancy is that most of the load instructions of this benchmark fold on previously missing stores, which masks their streams from the prefetching hardware. One possible solution to this problem is to communicate loads that fold on previously missing stores to the PEs and attempt prefetching them. • The effect of prioritization and thrashing control is higher in the FP than in the INT benchmarks because the FP benchmarks exhibit many more concurrent streams due to loop unrolling. One exception is the case of art
32
IPC Speedup
1.20
2.1
1.15
S PA PAT
2.9/3.0
1.10 1.05 1.00 0.95
avg
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
0.90
apsi
sixtrack
fma3d
lucas
ammp
facerec
equake
avg
S PA PAT
3.0/3.1
art
mesa
applu
2.8
mgrid
swim
2.0
galgel
1.70 1.60 1.50 1.40 1.30 1.20 1.10 1.00 0.90 wupwise
IPC Speedup
Benchmark (INT)
Benchmark (FP)
Figure 2.7: IPC Speedup of PAT over Stream Buffers.
which takes little benefit from prioritization using density due to the lack of loop unrolling in the representative traces used in this study. • On average, exploiting affinity has more impact on the INT than on the FP benchmarks. The reason is that the INT benchmarks usually have more short streams, while the FP benchmarks usually have more long streams [10] [34]. Two of the exceptions are the benchmarks mcf and art. The former is an INT benchmark that allocates memory for very large linked lists and initially traverses the lists in the order of allocation. This traversal generates long streams in the initial phases of the program, while these streams get shorter with high affinity as the linked
33 lists get rearranged during the program lifetime. The other exception is the benchmark art which shows high affinity exploitation. This agrees with its affinity measure that was shown in Figure 2.3 and with the little amount of loop unrolling that the representative traces of this benchmark
0.30
S PA PAT
0.6
0.25 0.20 0.15 0.10 0.05
avg
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
0.00 gzip
Fraction of misses hidden by prefetches
exhibit.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
avg
apsi
sixtrack
fma3d
lucas
ammp
facerec
equake
art
galgel
mesa
applu
mgrid
swim
S PA PAT
wupwise
Fraction of misses hidden by prefetches
Benchmark (INT)
Benchmark (FP)
Figure 2.8: Prefetch Coverage.
Prefetch Coverage Figure 2.8 depicts the prefetch coverage for the three configurations. These coverage numbers agree with the IPC speedup shown in Figure 2.7. The analysis of the coverage numbers is that the FP speedup can be attributed to improved coverage
34 resulting from better prioritization and thrashing control, while the INT benchmarks’ speed up is due to better control of timeliness (given that the coverage
1.0
S PA PAT
0.8 0.6 0.4 0.2
avg
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
0.0 gzip
Fraction of prefetches that are accurate
measures are similar between the three techniques.)
1.0
S PA PAT
0.8 0.6 0.4 0.2
avg
apsi
sixtrack
fma3d
lucas
ammp
facerec
equake
art
galgel
mesa
applu
mgrid
swim
0.0 wupwise
Fraction of prefetches that are accurate
Benchmark (INT)
Benchmark (FP)
Figure 2.9: Prefetch Accuracy.
Prefetch Accuracy Figure 2.9 depicts the accuracy of the three prefetching configurations. The first observation is that affinity based padding results in the best accuracy for the INT benchmarks. This is due to better control of prefetch aggressiveness via the padding process, such that the shorter streams do not generate as many prefetches as stream buffers generally do. This supports the earlier claim that padding can
35 be used to improve accuracy. On the other hand, when density is used the accuracy is reduced. This is accorded to the larger number of prefetches used when denser streams are tracked. Note that the most accurate prefetching technique is not necessarily the most efficient one. For example a very conservative technique can elect to prefetch when its confidence is very high. In such a case, the coverage will be very low resulting in lower efficiency as well. Achieving the right balance between accuracy and coverage is an important design consideration for all prefetching techniques. Prefetch Timeliness Figure 2.10 displays histograms of prefetch timeliness for the three configurations studied. The upper figure displays the SPEC2K-INT benchmarks while the lower one displays the SPEC2K-FP benchmarks. These figures illustrate the proposed approach’s ability to adapt prefetch timeliness to the density of the detected streams. Two examples are surrounded by ellipses in these figures. The first is for the benchmark twolf from the INT suite. The density histogram for this benchmark (Figure 2.4) indicates that about 65% of its streams are sparse, which in turn requires more timely prefetches. The padding process illustrated in the above figure shows that about 85% of its prefetches were timely. Contrary to this, the streams of the benchmark mgrid were shown to have a large percentage of dense streams (Figure 2.4), which requires aggressive early prefetching. The padding process for this benchmark shows about 95% of prefetches being Very Early in the above figure compared to about 55% for the stream buffer implementation. Memory Traffic Finally, Figure 2.11 depicts the number of L1 cast-outs of each prefetching configuration relative to number of cast-outs without prefetching. On average the
36 Low Density
100% 80%
Very Late Late Timely Early Very Early
60% 40%
crafty
parser
eon
perlbmk
gap
vortex
bzip2
twolf
S PA PAT
S PA PAT
S PA PAT
S PA PAT
S PA PAT
mcf
S PA PAT
gcc
S PA PAT
S PA PAT
vpr
S PA PAT
S PA PAT
gzip
S PA PAT
S PA PAT
0%
S PA PAT
20%
avg
Benchmark (INT)
High Density
100% 80%
Very Late Late Timely Early Very Early
60% 40%
mesa
galgel
art
equake facerec ammp
lucas
S PA PAT
S PA PAT
S PA PAT
S PA PAT
S PA PAT
S PA PAT
S PA PAT
S PA PAT
S PA PAT
S PA PAT applu
fma3d sixtrack
S PA PAT
mgrid
S PA PAT
wupwise swim
S PA PAT
S PA PAT
0%
S PA PAT
20%
apsi
avg
Benchmark (FP)
Figure 2.10: Prefetch Timeliness.
total number of cast-outs increased by about 3% over no prefetching. This small increase indicates that prefetching did not cause cache pollution on average. However, a few benchmarks (eon, galgel, and sixtrack) do have a significant increase in cast-outs. These benchmarks also showed low accuracy which explains the high cast-outs. An interesting case is represented by the benchmark ammp where the L1 cast outs drops by about 15% with prefetching. This is attributed to LRU replacement changes due to prefetching which resulted in improvement in the selection of which cache lines to replace from the L1 cache. Although these few benchmarks exhibited low accuracy (and resulting cache pollution,) it is important to note that there was a neutral to positive effect on IPC for them (Figure 2.7,) which remains an important measure of success for the
18 13 S PA PAT
8 3
18 16 14 12 10 8 6 4 2 0 -2
34
32
30
average
apsi
sixtrack
fma3d
lucas
ammp
facerec
equake
art
galgel
mesa
applu
mgrid
swim
S PA PAT
wupwise
Percentage increase in L1 cast-outs
Benchmark (INT)
average
twolf
bzip2
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
-2 gzip
Percentage Increase in L1 Castouts
37
Benchmark (FP)
Figure 2.11: L1 Cast-outs.
proposed technique in this work. 2.5
Conclusions This dissertation extends regular stream characterization to include extrin-
sic as well as intrinsic characteristics. The identified extrinsic characteristics are stream density and stream affinity. Metrics to quantify these extrinsic characteristics are defined and demonstrated in the context of a prefetch design which has been modeled within a cycle-accurate simulator of an advanced research microprocessor. The prefetch design employs stream affinity to enable prefetching of short streams that were previously ignored by prefetching mechanisms. Stream density was exploited with a novel mechanism called padding for (1) dynamically
38 adjusting the prefetch launching time to control prefetch timeliness, (2) stream prioritization, (3) and controlling stream thrashing. The results indicate that improving the prefetching efficiency requires timely prefetching for sparse streams, as well as early prefetching for dense streams. Overall, stream prioritization and thrashing control based on density showed improved prefetch coverage by 100% on average for the FP representative traces over the implementation of a streambuffer based mechanism that does not use density. The prefetching system designed in this chapter identifies and prefetches streams based on load misses only. One future work investigation is to incorporate load misses that are masked by previous store misses to participate in the prefetch triggering mechanism. Further, an inverse-correlation was observed between the prefetching efficiency and the significant increase in the L1 cast-outs due to prefetching. This observation requires detailed investigation to identify how it could be used to augment the padding decisions.
Chapter 3 Stream De-Aliasing: The Design of Cost-Effective Stride-Prefetching for Modern Processors
3.1
Introduction Stream buffers prefetching is a particularly interesting hardware prefetching
technique because of its ability to detect and accurately prefetch memory accesses exhibiting unit-strides [26]. Unit-strides appear in memory accesses when data is located in sequential memory cache lines. However, stream buffers do not detect strides larger than a cache line, referred to as arbitrary-strides. An effective way of detecting arbitrary-strides is to correlate the addresses of memory accesses with the program counters (PCs) of the load instructions generating the accesses [27]. In practice, transmitting the full PC across the pipeline stages (as required by most proposed prefetching schemes) is expensive. This expense will increase as modern processor architectures transition from 32-bit to 64-bit PCs, doubling the size in both pipeline registers and data paths. The transmission cost is exacerbated by modern design trends of micro-processors, where performance improvement is done by increasing the processor clock frequency, in turn, requiring deeper pipelines [48] (more stages). In such pipelines, data access operations are handled by specialized hardware, referred to as the load-store unit (LSU). The LSU is typically located at the back-end of the processor pipeline, where the PC is typically not needed.
Stage Issue Slot Reservation LSU(AQ) Exe pipe PEs (8) Total bits
Latch bits 176 44 44 308 308 352 1232
Logic bits 88 44 44 44 44 704 968
40
Table 3.1: Estimated structural requirements to use full program counter for prefetching.
Table 3.1 shows estimated structural requirements for extending a modern high-performance microarchitecture design to transmit the full PC throughout the pipeline. In this example, the PC needs to be transmitted through pipeline stages: issue, issue slots, reservation stations, execution pipeline stages, LSU, and finally to the prefetching hardware components referred to in the table as Prefetch Engines (PEs). The data shown assumes that the implemented microarchitecture supports a 44-bit PC after necessary virtual to physical address translations. The table shows the required additional latching and logic bits for each stage. The overall required number of bits, to transmit the PC to the PE processing it, is 1, 232 latch bits and 968 logic bits. Note that these bits imply added requirements in terms of timing, floor planning, power consumption, verification, etc. Therefore, this chapter presents an alternative approach to using the full PC in the detection of arbitrary strides in the memory accesses. The approach, called Load Attributes Prefetching (LAP), uses a subset of the PC combined with other attributes of the instruction generating the access stream to identify such a stream within the memory accesses. The main trade-off for consideration in the design of effective prefetching for a microprocessor is performance, which is governed by prefetch accuracy, coverage,
41 and timeliness. In the case of stream buffer prefetching [26], accuracy is the biggest limitation. Poor accuracy results not only in missed opportunities, but in bad prefetches that actually hurt the overall system performance in two distinct ways. The first is that prefetched data can replace useful data already in the various levels of the cache system. This problem is often referred to as cache pollution. Cache pollution effects can be reduced by prefetching into dedicated buffers [26, 38] or to lower level caches, e .g . the L2 data cache [20]. However, such a solution results in additional delays that limit the benefits of accurate prefetches. The second way that bad prefetches degrade performance is by incorrectly using resources like queues, buffers and bandwidth that otherwise could be used by program instructions. Arbitration techniques that give prefetches lower priority to access the system resources can help reduce this effect [1, 18], but they cannot eliminate it. Therefore, it is important to devise efficient mechanisms to minimize the number of bad prefetches, and to enable prefetching directly to the caches. The stride prefetching mechanism presented in the previous chapter addressed the prefetching accuracy concern. However, prefetching directly to the caches rsults in loss of information about prefetched cache lines. This information is needed to increase/decrease the confidence of the state machine (that controls the prefetch engines). This loss is due to the fact that prefetches are not distinguishable from other cache content. This problem is addressed in this chapter, and a cost-effective solution is evaluated. The remainder of the chapter is organized as follows: Section 3.2 presents the proposed prefetching approach, Load Attribute Prefetching (LAP). Section 3.3.1 presents the evaluation methodology used in this chapter. Experimental results are presented in Section 3.3, and finally, conclusions are drawn in Section 3.4.
42 3.2
Approach Because the PC of missing load instructions is used to uniquely identify
streams, a prefetch system that uses only a small subset of the PC to enable arbitrary stride detection may result in making several missing load instructions appear to be the same instruction. In order to minimize such aliasing of load instructions, other attributes of the instruction encoding can be used to identify streams with strided patterns. Such attributes are usually needed for other reasons within the pipeline and therefore do not require major structural additions or impact the amount of information that must be propagated between stages. The LAP approach capitalizes on these observations. LAP is demonstrated using the PowerPC instruction set, and can be extended to other architectures. Load instructions in the PowerPC instruction set architecture, are of the form: ld rD, δ(rA) where rD is the destination register, rA is the base register of the load instruction and δ is a displacement that is either specified as an immediate offset within the instruction or in another register rB. Both rA and rB are fivebit fields in the load instruction. These fields (including the offset) are usually transmitted across the pipeline (at least up to the register rename stage) in modern architectures. The LAP prefetching addresses implementation effeciency concerns of stride prefetching by extending the stride prefetching approach presented in Chapter 2 in two ways: (1) effecient stream identification, and (2) efficient feedback management.
3.2.1
Stream Identification
In order to effectively associate address requests and prefetches with a particular instruction, a stream identifier (Stream ID) is designated. The Stream ID
43
PC
PC?
Ld rD,d (rA)
Stream ID: PC? rA
Figure 3.1: Forming Stream ID in Load Attribute Prefetching (LAP).
is used as a reference index into different prefetch structures that generally should not have a large index for faster access. The LAP approach uses, in addition to the partial PC, the least significant 4 bits of the instruction’s base index register rA. The base index register is selected with the intuition that critical loads with high-percentage miss rates access addresses not common to other loads.
As
such, the base register value as well as its index should generally be unique among other loads in a dynamic code sequence. The Stream ID is composed by concatenating the 4 lower bits of the source register identifier rA with the selected portion of the PC.
3.2.2
Feedback Management
Each of the identified streams is allocated to one of a number of prefetching hardware components called prefetch engines (PEs). The operation of these PEs is controlled by a finite state machine as was described in Chapter 2. All the operations use the formed Stream ID instead of the PC. Dynamically Controlled Run-Ahead Prefetching The major advantage of stream buffers prefetching is its ability to provide prefetched data in time for the program to consume, potentially hiding all the memory latency. This is facilitated by prefetching data that is several strides away from
44 the current program access. This is referred to as running ahead. A problem associated with running ahead of the program arises when the prefetched data is not consumed by the program. This has two negative impacts on the overall performance gains of the prefetching technique: resource wasting and cache pollution. Resource wasting occurs because unused prefetches consume memory bandwidth and erroneously utilize other system resources that could have been otherwise used by the program. Cache pollution occurs when prefetched data replaces data from the memory caches that would have been accessed by the program. The state machine that controls the PEs (see Figure 2.6) is designed to attain the benefits of run-ahead prefetching and minimize its disadvantages by minimizing the useless prefetches. This is done by allowing the PE to maintain a sufficient distance ahead of the program, so long as its prefetches are consumed by the program. The run-ahead distance is dynamically changed based on the confidence level of the PE. The confidence level is increased if a prefetch is used. A prefetch is considered to be used if a load folds on it in the LSU, or if a load hits on a cache line that has been prefetched (as discussed later). On the other hand, prefetches that do not get used result in lowering the confidence level of the PE that issues them, until it is deallocated. A major difference between this mechanism and other reported mechanisms is that the PE is allowed to gain, lose and regain confidence. This is important especially in cases where the PE misses occasional loaded addresses in the stream. Such a miss may be the result of structural, bandwidth or other limitations. These limitations do not necessarily mean that the PE is failing to effectively prefetch the required data for the program. Therefore, the PE is allowed to regain high confidence levels, without killing the whole stream as was done in prior techniques [26].
45
Cache Lines
PMW Program
Prefetch
Figure 3.2: Prefetch moving window illustration.
Feedback and Prefetched Moving Window To overcome the cache pollution problem, stream buffers prefetching usually stores its prefetched data into dedicated prefetch buffers (hence the name). However, this solution means that accessing the prefetched data requires an additional number of cycles beyond accessing the L1 cache, which will reduce the benefits of prefetching. With LAP, prefetches are allowed to go directly to the L1 cache. Although this may result in cache pollution, the PE’s ability to minimize useless prefetches is helpful in reducing harmful effects, as will be illustrated in the experimental results of Section 3.3. The challenge associated with prefetching to the L1 cache is recognizing cache hits to prefetched cache lines and feeding this information back to the PEs to adjust their confidence levels accordingly. Previously proposed approaches tag each L1 data cache line as being prefetched or not [24] to identify hits to prefetched data. This tagging increases the size of the L1 cache. In addition, the information is lost once the prefetched data is moved to lower memory levels. Instead of tagging the L1 cache lines, a heuristic called Prefetched Moving Window (PMW) is used in LAP. The PMW (shown in Figure 3.2) is a range of addresses between the program’s last known accessed address and the most recent prefetched address. A cache hit that falls within the PMW is assumed to be to a prefetched cache line, otherwise the hit is ignored by the PE. The last known accessed address is up-
46 dated with every hit or miss of the instruction assigned to the PE, which results in sliding the PMW as the program executes. Using the PMW enables tracking prefetched cache lines across the different memory levels without needing to tag every cache line in the memory hierarchy. 3.3
Experimental Evaluation
3.3.1
Methodology
The results shown in this work were obtained from a cycle-accurate simulator being developed in concert with a new high-performance micro-architecture from Freescale Semiconductor, Inc. The simulator models separate 32KB 8-way instruction (I) and data (D) caches, a unified 1MB 8-way L2 cache, an integrated peripheral bus, and dual integrated DDR controllers. The hit latency for the L1 cache is 3 cycles. The average total latency for an L2 cache hit is 25 cycles, and the average total latency for a DRAM read is 150 cycles. This work uses the SPEC2000 suite of benchmarks. Several traces statistically represent each benchmark, such that each trace consists of several million instructions. Each trace has a weight representing its contribution to the overall benchmark. The reported numbers in this chapter are those of the weighted sum of each trace metric. This is similar in concept to benchmark representation used in SimPoint [44]. Trace representativeness was verified against actual hardware (a previous processor) by using IPC as a verification metric. The simulated IPC, obtained by simulating the traces on the matching cycle-accurate simulator and weighting the IPCs of each sub-trace appropriately, was compared against the IPC from execution on hardware. All benchmarks had less than a 15% delta between actual and simulated IPC.
47 3.3.2
Results
In order to find the optimal subset of the PC to use in LAP, a number of bits (2,3,4,5 and 6) of the PC were used as a Stream ID instead of the full PC. Because the instructions used in the simulation are all four bytes long, the least significant two bits were always zero. These two bits were first shifted out of the PC, and then the required number of bits (e .g . 2) was taken from the least significant bits remaining of the PC (e .g . bits 40 and 41 of the 44-bit PC, where bit 43 is the least significant).
Average IPC Percentage Gain
Partial PC(FP+INT)
FULL PC (FP+INT)
LAP (FP+INT)
25
24
23
22 PC2
PC3
PC4
PC5
PC6
Number of bits of the PC
Figure 3.3: Performance of PC bits.
Figure 3.3 shows the average percentage gain in IPC over no prefetching for all the benchmarks in both the SPEC2K suites (INT and FP). The gain is shown for three mechanisms: Partial, Full, and LAP. Partial represents using a number of PC bits (2,3,4,5 and 6) to compose the Stream ID. Full uses the full PC, and LAP uses 4 bits of the PC and 4 bits of rA. Figure 3.3 indicates that
48 Partial using 4 bits of the PC results in the best performance gain. It is interesting to see that using this subset of the PC results in an overall gain of about 24.8%, while using the full PC results in a gain of about 24.2%. This added performance is a result of folding several streams into the same Stream ID, which results in the engine reaching the active states faster than when only one load is being tracked. LAP is shown to have less performance advantage than Full with 4 bits. However, this loss is accompanied with large gain in accuracy as demonstrated in the next experiment. Partia PC (FP+INT)
Full PC(FP+INT)
LAP (FP+INT)
Fraction of used prefetches
0.45 0.40 0.35 0.30 0.25 0.20 PC2
PC3
PC4
PC5
PC6
Number of bits of the PC
Figure 3.4: Accuracy of PC bits.
Figure 3.4 shows the average prefetching accuracy of the above configurations. Accuracy is the percentage of used prefetches among all prefetches issued. The figure indicates that the accuracy has strong correlation with the number of PC bits used. It also indicates that combining rA with 4 bits of the PC in LAP comes within 4% of the accuracy achieved when the full PC is used, and better
49 by about 25% than using only 4 bits of the PC. The reasonable tradeoff between
1.2
2.9
3.0
1.1 LAP Full PC
1.0 0.9 geomean
twolf
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
0.8 gzip
Percentage IPC Gain
performance and accuracy of LAP makes it a good choice.
2.0 1.8 1.6 1.4 1.2 1.0 0.8
2.0
3.0
3.1
geomean
apsi
sixtrack
lucas
ammp
facerec
equake
art
galgel
mesa
applu
mgrid
swim
LAP Full PC
wupwise
Percentage IPC Gain
Benchmark (INT)
Benchmark (FP)
Figure 3.5: Performance gain.
Figure 3.5 depicts the percentage gain in IPC for each benchmark of the SPEC2K suite for (1) LAP using 4 bits of the PC and rA and (2) using the full PC. The figure indicates that the SPEC2K INT suite of benchmarks contains a significant amount of regularity in its accesses. For example, the benchmark 181.mcf gains about 200% in IPC with LAP. The reason for this gain is that 181.mcf allocates its data structures at once, which results in all of its data nodes being spatially consecutive, and it walks that data structure in its allocation
50 order in much of its processing. The figure also indicates that the proposed LAP performs very similar to using the full PC, except for the case of 188.ammp
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 average
twolf
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
LAP Full PC
gzip
Fraction of used prefetches
which suffers from an aliasing effect that is the result of aggressive loop unrolling.
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 average
apsi
sixtrack
lucas
ammp
facerec
equake
art
galgel
mesa
applu
mgrid
swim
LAP Full PC
wupwise
Fraction of used prefetches
Benchmark (INT)
Benchmark (FP)
Figure 3.6: Prefetching Accuracy.
A similar trend in the accuracy for (1) LAP using 4 bits of the PC with rA and (2) using the full PC is presented in Figure 3.6. This figure shows the percentage of used prefetches among all prefetches issued to the lower memory levels, with an average of 40% for LAP compared to an average of 44% for the full PC. Figure 3.7 depicts prefetch timeliness: the percentage of prefetches that were
Timely prefetches
51 100% 80% 60%
LAP Full PC
40% 20% average
twolf
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
gzip
0%
100% 80% 60%
LAP Full PC
40% 20% average
apsi
sixtrack
lucas
ammp
facerec
equake
art
galgel
mesa
applu
mgrid
swim
0% wupwise
Timely prefetches
Benchmark (INT)
Benchmark (FP)
Figure 3.7: Prefetch timeliness.
used from the L1 cache before they were replaced out of all used prefetches. This figure indicates that, on average, about 85% of the used prefetches were timely and completely masked the memory latency. It also demonstrates that the timeliness of LAP prefetching is similar to that of using the full PC. Previous approaches to stream buffers prefetching stored prefetched data in buffers until they were needed by missing loads. This was done to avoid cache pollution by prefetches that are not accurate. In contrast, the LAP approach allows cache pollution and presents a comprehensive mechanism to minimize bad prefetches. Figure 3.8 illustrates the benefits of prefetching directly into the L1
1.2
2.2
3.0
1.1 Buffer L1 Cache
1.0 0.9 geomean
twolf
vortex
gap
perlbmk
eon
parser
crafty
mcf
gcc
vpr
0.8 gzip
Percentage IPC Gain
52
2.0 1.8 1.6 1.4 1.2 1.0 0.8
2.0
4.0
3.1
geomean
apsi
sixtrack
lucas
ammp
facerec
equake
art
galgel
mesa
applu
mgrid
swim
Buffer L1 Cache
wupwise
Percentage IPC Gain
Benchmark (INT)
Benchmark (FP)
Figure 3.8: Prefetching into the L1 cache.
cache, by presenting the IPC gain for two schemes. The first scheme is referred to as Buffer, which works by storing prefetches in a buffer and then accesses that buffer to acquire prefetched data for missing loads when they get to the LSU. The second scheme, referred to as L1 Cache, writes prefetched data to the L1 cache as soon as they are serviced from lower memory levels. The reason for better gains with pollution is that accessing the separate prefetch buffers takes 4 cycles longer than accessing the L1 cache. The extra cycles occur because the buffers need to be accessed serially after determining the L1 cache miss. Otherwise, accessing the buffers in parallel with the L1 cache results in needing to drive several circuits
53 (the cache and the buffer) at the same time, which will slow accessing the critical L1 cache. By polluting the L1 cache, these extra cycles are saved when prefetches are timely and accurate. 3.4
Conclusions This chapter presented LAP prefetching, a cost effective prefetching system
that extends the traditional idea of stride prefetching by introducing a set of hardware components that accurately and dynamically detect exploitable address regularity. Potential prefetch addresses are generated by coordinating many pieces of strategically tagged in-flight information through a dedicated state machine. A confidence mechanism is proposed that enhances the accuracy of stride prefetching and allows the LAP approach to prefetch directly to the L1 cache overcoming the traditional limitations of cache pollution. A prefetched moving window heuristic is presented to enable pollution without the need to mark each cache line in the L1 cache. The experiments show that LAP prefetching using as few as 4 bits of the program counter along with other attributes of the instruction can come to within 1% of the performance achieved using a full PC.
Chapter 4 A Stitch In Time: Idenitfying and Exploiting Temporal Regularity
4.1
Introduction Prefetching spatially regular data structures (e.g., data structures that
exhibit simple mathematical relationships in their load addresses) has been particularly successful using techniques such as stride [36] and stream [26] prefetching. These techniques satisfy the prefetching completeness requirements by effectively exploiting spatial regularity to accurately predict future accesses well ahead of the program. While many programs make use of such spatially regular data structures, the use of linked data structures (LDS) is pervasive in modern software (e.g. linked lists, trees, hashes, etc.) Although prefetch techniques that work well for spatially regular data can predict some of the misses associated with LDS (about 14% on average for the SPEC-INT, as was illustrated in the previous two chapters,) the remaining misses need to be prefetched by exploiting characteristics other than spatial regularity. This chapter investigates coordinating the exploitation of two such characteristics: temporal regularity and content of accesses associated with LDS. Context-based prefetching techniques exploit temporal regularity in LDS accesses [15] by finding correlations amongst repeatedly accessed addresses, and then using these correlations to initiate prefetches (temporal regularity is discussed in Section 4.2). These correlations enable context-based techniques to launch
55 timely prefetches ahead of the program. Unfortunately, context-based prefetching has several limitations that affect its coverage, including limited capacity, learning time, and excessive overhead. A detailed discussion of these limitations follows in Section 4.2. Content-based prefetching involves using stateless systems that prefetch the connected data objects of LDS by discovering pointers in the data contained within cache lines as they are filled. Content-based prefetching can achieve good coverage by overcoming the limitations of context-based prefetching; however, timeliness is limited since the pointers contained in the accessed data usually refer to data that will soon be needed by the program. Other limitations of content-based prefetching include the inability to recognize traversal orders of multiple potential paths, and the inability to connect and prefetch isolated LDS. Section 4.2 discusses these limitations further. To improve the coverage of context-based prefetching (while benefiting from its timeliness,) Roth [43] combined a context-based approach called jump pointers with a content-based mechanism called dependence-based prefetching [42]. This combined technique, called cooperative prefetching, used jump pointers to launch timely prefetches for correlated accesses and subsequently triggered the dependence-based prefetcher as a result of jump pointer prefetches. Cooperative prefetching indicated that a combined context-and-content prefetching approach has promising prospects for achieving prefetch completeness for LDS. However, cooperative prefetching had several shortcomings: (1) the approach was limited to a certain type of LDS that needed to be statically identified by the compiler, and (2) jump pointers competed with the program data for the data path, including critical resources like the data caches. This chapter proposes an improved multi-approach LDS prefetching solution, called Stitch Cache Prefetching (SCP), which consists of a context-based
56 mechanism and a content-based mechanism. SCP exploits the synergy between context and content prefetching techniques, while overcoming the above mentioned limitations of cooperative prefetching by improving its two components. In the context-based mechanism of SCP, a software or hardware profiler (explained in Section 4.4) identifies load instructions that exhibit strong correlations amongst their generated accesses (referred to as recurrent loads) using temporal regularity metrics that are introduced in this work. The identified recurrent loads are marked with hints that are used by a hardware component. Unlike cooperative prefetching, recurrent loads are not limited to static paths or to loads that access specific LDS layouts. Additionally, source code is not needed for profiling purposes. The hardware component records the correlations among the addresses generated by the recurrent loads and other missing addresses in the form of stitch pointers. The stitch pointers are stored in a dedicated logical memory region called the stitch space. The stitch space is implemented physically as a multi-level stitch memory hierarchy including a stitch cache to hold the working set of stitch pointers, a stitch physical memory, and potentially stitch secondary storage. To minimize conflicts with the CPU data path, the stitch cache has a dedicated data path to the stitch physical memory. During subsequent traversal of the data structures, accessed addresses are used to index to the stitch cache to read the previously stored stitch pointers. These stitch pointers are used to generate timely context-based prefetches. The content-based prefetcher is triggered by two mechanisms. The first is the L2 cache misses as proposed initially in [18], which alleviates the learning time problem of the context-based component. The second triggering mechanism is to scan the cache lines associated with context-based prefetches for potential pointers. When pointers are identified, their target addresses are pushed as prefetches in order to improve SCP prefetching coverage. The use of the context-
57 based prefetches to initiate content-based prefetching improves the timeliness and traversal path ordering problems of the content-based prefetching technique. Generating timely prefetches is a major challenge for prefetching mechanisms. This challenge is due to the dynamic changes in the memory subsystem resource utilization at the time of prefetch launching. A fully utilized memory subsystem requires longer time to fulfill a prefetch request than an idle one. Previous solutions to this challenge included using a statically determined prefetch average latency [43, 31]. Using an average latency is a simple solution that does not require hardware support, yet it presents a limitation to achieve prefetch completeness. In order to overcome this limitation, a dynamic timeliness mechanism is proposed. The mechanism dynamically measures the memory subsystem response time for each prefetch request and continuously adjusts the targets of the stitch pointers to account for the memory subsystem response time. Stitch cache prefetching makes the following contributions to LDS prefetching: (1) definition of temporal regularity metrics (based on concepts of information theory) that allow a profiler to identify recurrent loads associated with static or dynamic traversal paths, without the need for source code, (2) proposing a context-based prefetcher that avoids the capacity problem using the logical stitch space and the maintenance overhead problem by physically implementing the stitch space using a hierarchical organization, and (3) proposing a mechanism to dynamically adjust the prefetch launching time based on the observed memory latency to improve prefetch timeliness using a dynamically tuned timeliness stitch queue. The next section discusses the limitations of context-based and contentbased prefetching. Temporal regularity metrics are proposed in Section 4.3 to enable identifying recurrent loads that exhibit such regularity in their memory accesses. These instructions are good candidates for exploitation by context-based
58 prefetching. Section 4.4 describes the details of stitch cache prefetching. Results are presented and analyzed in Section 4.5, and finally conclusions are drawn in Section 4.6. 4.2
LDS Prefetching Limitations Several context-based and content-based prefetching mechanisms have been
proposed to hide the memory access latency that results from traversing LDS. Both classes of prefetching mechanisms suffer fundamental limitations that prevent them from being complete prefetching approaches. The following sections outline the salient limitations of both classes in detail, and how SCP addresses the identified limitations.
4.2.1
Context-Based Prefetching Limitations
Context-based prefetching mechanisms identify correlations amongst repeating addresses in the memory accesses and use the correlations to prefetch subsequent traversals of these addresses. Joseph [25] used a hardware Markov prediction table to record repeating addresses in the miss reference stream. Upon a subsequent miss, the prediction table was used to provide the next set of possible cache addresses that previously followed the miss address. Luk [31] stored the repeating addresses across the program data structures using additional storage fields, called jump pointers, and instrumented the program code to use them to generate software-based prefetches. Roth [43] used jump pointers to prefetch for load instructions that traverse a certain type of LDS referred to as the backbone nodes (nodes of the same type that point to LDS nodes of other type, called rib nodes.) These nodes were marked by a specialized compiler and their memory references were recorded into the jump pointers by a hardware component. The jump point-
59 ers were used in subsequent traversals of the backbone nodes to generate timely prefetches. The prefetches were then utilized by a content-based mechanism called dependence-based prefetching to improve coverage by generating prefetches for the rib nodes. An important contribution of Roth’s work is the combination of a context-based mechanism and a content-based mechanism in a unified prefetching approach called cooperative prefetching. Unfortunately, since prefetching is triggered by the jump pointers, the limitations of context-based mechanisms discussed below play an important role in limiting the efficiency of the approach. Limitations of prior approaches to context-based prefetching can be summarized as follows: • Capacity problem. Context-based techniques record the correlations amongst the accessed references in either fixed size hardware tables or in jump pointer fields. The hardware tables fail to record further relations once they saturate, while the jump pointer fields are designated to specific types of data structures (e.g. the backbone nodes.) Both approaches result in limiting the prefetching coverage. • Learning Time. Context-based techniques need to learn the correlation among accessed data before they can launch prefetching. This learning time can consume a significant portion of the application run-time if the frequency of the regular subsequences is low, thus limiting the prefetching opportunities. • Excessive Overhead. In Markov prefetching, a large hardware table is needed to record the access correlations (as large as an L2 cache in [25].) The time needed to access such a table can be a limiting factor in launching timely prefetches, especially at the high operating frequencies of modern cores where tens of processor cycles are needed to read from such
60 a table. On the other hand, jump pointers share the data path resources with the program data, which adds pressure on critical resources like the L1 data cache. This situation is exacerbated when the LDS nodes are small, which can occur when the backbone nodes are mainly used to connect the ribs, which usually have more data in their nodes. In addition, jump pointer maintenance at runtime is not a trivial task and the code needed to maintain the jump pointers presents a challenge that is yet to be quantified.
4.2.2
Content Directed Prefetching Limitations
Content-Directed Prefetching (CDP) [18] is a promising stateless prefetching mechanism that exploits data structure links present in the content of accessed cache lines. CDP provides good prefetch coverage due to its ability to identify future accesses. However, this technique has several limitations that result in poor accuracy and timeliness. To illustrate these limitations, Figure 4.1(a) shows an example of a linked data structure traversal and how the content-based prefetching mechanism [18] exploits the cache content (shown in Figure 4.1(b).) Assume that each data node occupies a cache line, nodes a − d have already hit in the cache, and the first miss address is for node e. Once the cache line for e is filled, the content-based hardware scans that line for potential addresses and prefetches those addresses (in this example, f , x1 and y1 are initiated as prefetches.) Upon the return of each of these prefetches, the associated cache lines are scanned for additional addresses, and subsequent prefetches are initiated. Figure 4.1 illustrates how an unaided hardware-only technique can prefetch addresses based on chained prefetching, and can accurately predict many future accesses (e.g. x3, x4, x5, y1, y2, y3, y4, f, and g.) However, this technique only partially covers the miss penalty of nodes
61
a
b
c
d
e
f
x1
g
Address Cache Content d e g f
y1
T1 x2
x4
x3
x5
y2
y3
y4
e c b y1 x1 x2 a
f x1 y1 d c y2 x2 x3 x4 x5 b
T2 1
2
3
4
a) Linked Data Structure
b) Memory Content
Traversal Order: a, b, c, d, e, x1, x2, x4, 1, 2, 3, 4 Prefetching Order: f, x1, y1, g, x2, x3, y2, x4, x5, y3, y4 Legend: a
LDS Node
a
Potentially Missing Node
Traversal Path
a
Fully Prefetched Node
a
Partially Prefetched Node
Stitched Links
Figure 4.1: Content-Directed Prefetching.
near the original miss request (x1 and x2.) Depending on the data structure and its traversal patterns, many near nodes may be accessed relative to a given node, leading to problems in even the best linked data structures. Overall, content-based prefetching techniques share several limiting factors that prevent them from achieving higher levels of prefetching efficiency. These factors include: • Prefetch initiation point and timeliness effects. Content prefetching is initiated by a demand miss. Hence the approach cannot avoid the demand miss itself, as with node e in Figure 4.1. Furthermore, when the identified prefetch addresses are close to the current node in terms of the
62 logical structure of the LDS (e.g. nodes x1 and y1 are logically close to node e in Figure 4.1) there are few processor cycles available between the time that the address is identified and when the data will be required. This results in poor timeliness for these initial prefetches. Prefetch chaining was proposed in [18] to overcome this problem for subsequent accesses. In prefetch chaining, the prefetched cache lines are further scanned for potential pointers to prefetch. These chained prefetches have better chances of hiding the memory latency than the initial ones as illustrated in the above figure; however, prefetch chaining suffers from low accuracy, therefore it was limited to short depth (a depth of six was proposed.) • Traversal order and accuracy. In linked data structures that have tree structures of degrees greater than one, hardware-based content-prefetching cannot differentiate amongst pointers in the data structure to determine which represents the most important traversal path to begin prefetching. Hence content-prefetching cannot selectively prefetch only the traversal path the program will immediately follow within the data structure, but instead must speculatively prefetch multiple paths, wasting time and resources prefetching paths that will not be used (resulting in poor accuracy.) For example, prefetching nodes f and y1 when node e misses is wasted since the program traversal order followed the x1 − x4 path. • Isolated LDS trees and coverage. In programs that use multiple isolated trees of linked data structures (example: hash tables), content-based prefetching approaches cannot find relations between separated tree structures. For example, a content-based prefetcher cannot begin prefetching nodes 1 and 2 from the traversal path labeled T 2 in Figure 4.1 based on having previously prefetched for path T 1 because they are not logically
63 connected, thus limiting the coverage of the approach. 4.3
Characterizing Temporal Regularity in the Memory Accesses Formally, a sequence is a set of objects (or events) arranged in a linear
fashion, such that the order of the members is well defined and significant [52]. A load instruction generates a sequence of memory accesses (to associated memory addresses) as a program runs. The ordering criterion of the sequence is time of the memory access. This sequence is referred to as the reference sequence for the load instruction. A reference sequence that contains frequently repeated contiguous subsequences of addresses (members of the sequence that are adjacent in time, and in the same order) is said to be temporally regular. A load instruction whose reference sequence is temporally regular is a recurrent load. Chilimbi [14] demonstrates that temporal profiles of reference sequences expressed in terms of hot data streams, which are the frequently repeating contiguous subsequences, are quite stable across multiple program runs. This important finding suggests that recurrent loads identified using profiles would still be recurrent during program execution. This motivates the use of profiles to identify recurrent loads that can then be targets for context-based prefetching. This section develops temporal regularity metrics that can be used during profiling runs to identify recurrent loads. The profiling runs can be performed either prior to running the program or as part of a run-time feedback optimization system. S = {s1 , s2 , s3 , s4 , s5 , s6 , s7 , s8 , s9 , s10 , s11 , s12 } = {a, b, c, a, c, a, b, c, a, b, c, d} Figure 4.2: A sequence with temporal regularity.
Figure 4.2 shows a reference sequence, S. Each element of the sequence, si represents an address (a, b, c, or d,) and the subscript, i, represent the ordering
64 criterion, which is temporal (e.g. s1 = s4 = s6 = s9 = a). This reference sequence exhibits temporal regularity due to the repeating contiguous subsequence {a, b, c} shown in boldface. A reference sequence can be modeled as a random event generated by a labeled deterministic finite state machine automaton (FSA) with probabilities on the transitions (i.e. a Markov source) [49]. Joseph [25] indicates that a firstorder Markov source model (one whose next event depends only on the preceding event) performs comparable to higher order models for context-based prefetching. Therefore, in this work a reference sequence is modeled as a first-order Markov source. Using this model, two metrics are defined to quantify temporal regularity in a reference sequence, conditional entropy and variability. Conditional Entropy The concept of entropy is used in information theory to measure how much randomness (conversely information) there is in a random variable. The conditional entropy measures how much entropy a random variable Y has remaining if we have already learned completely the value of a second random variable X. It is referred to as the entropy of Y conditional on X, and is written H(Y |X) [51]. Let S = {s1 , s2 , s3 , ..., sn } be a reference sequence of some load instruction observed during some profile run of a program under consideration. Further, let X denote the set of all unique addresses in S and assume that S conforms to a first-order Markov source. As S becomes more temporally regular (having more frequently repeating subsequences), its randomness gets lower and consequently its conditional entropy gets lower. Conversely, if S were a temporally irregular reference sequence that contained few repeating subsequences then its conditional entropy would be high. Therefore, conditional entropy of a reference sequence can
65 be used as a metric of its temporal regularity. The conditional entropy of S is defined as follows:
H(Y |X) = −
X
p(x)
x∈X
X
p(y|x)log2 (p(y|x))
(4.1)
y∈Y
where p(x) is the probability of reference x ∈ X, p(y|x) is the conditional probability of referencing y given that x is referenced, and Y is the set of all references accessed after the reference x ∈ X.
0.75
a
0.25
b 1.0
0.75 d
0.25
c
a) Markov model
x
p(x)
y
p(y|x)
term
a a b c c
4/12 4/12 3/12 4/12 4/12
b c c a d
3/4 1/4 1.0 3/4 1/4
-.09 -.17 0.00 -0.10 -.17 0.53
Conditional Entropy
b) Probabilities
Figure 4.3: Markov model conditional entropy for the reference sequence of Figure 4.2.
Figure 4.3(a) illustrates the corresponding Markov model for the reference sequence, S, of Figure 4.2. The states of the FSA represent the set of unique references of S, referred to as X (known also as the alphabet of the Markov source.) The probability on each transition between two states (the conditional probability of the destination given the source) is computed as the fraction of times the reference represented by the destination state appears after the source state reference out of all instances of the source state reference. For example, the conditional probability of referencing address a after address c is 3 out of 4 (0.75.) The conditional entropy for the sequence shown in Figure 4.3 is 0.53, as computed in Figure 4.3(b.)
66 Variability Low conditional entropy in a reference sequence is a necessary condition for temporal regularity. However, it is not a sufficient condition. To explain this, assume a reference sequence that consists of accesses to an array of values exactly one time during a program lifetime. This sequence will have conditional entropy of 0 because all elements of the sequence are specified given the first element. However, since there are no subsequences that repeat among these references there would be no temporal regularity in this sequence. Therefore, another metric is needed to prevent such a sequence from being erroneously identified as temporally regular. This metric is called the variability of the sequence. Given a reference sequence S, let |S| denote the number of elements in S (known as the sequence length,) and |X| denote the number of unique elements in S. The variability, v, of the sequence S is then defined as:
v=
|X| |S|
(4.2)
By definition v is a positive number that is less than or equal to 1. The variability of a sequence of references indicates how many of the sequence references are repeated. Therefore, for a sequence to exhibit temporal regularity, its variability must be less than 1. Furthermore, the lower the variability is, the more temporal regularity is expected to exist in the sequence. A sequence of references that has low conditional entropy and low variability is one that exhibits temporal regularity, while one that has either of these metrics high is expected to exhibit little or no temporal regularity. For example, the variability of the above mentioned array referencing sequence is 1 and hence such a sequence does not have temporal regularity although it has conditional entropy of 0.
67 Using these two metrics, a profiler can collect a segment of a load’s reference sequence and measure its conditional entropy and variability. If both metrics fall below a predetermined threshold, then the load can be identified as recurrent and optimizations targeting recurrent loads, like context-based prefetching, can be applied to its reference sequence. 4.4
Stitch Cache Prefetching SCP exploits the synergy between context and content prefetching tech-
niques while improving its two components by addressing their aforementioned limitations. The context-based mechanism consists of three stages. The first stage is to identify recurrent loads that exhibit temporal regularity in their memory accesses. The second stage is to record the correlations between the recurrent load addresses and the program accesses using stitch pointers. The third stage is to use the stitch pointers to generate prefetches and to trigger the content-based prefetcher. The content-based prefetcher is used to prefetch identified pointers in the content of accessed cache lines while the context-based prefetcher builds its correlations. The following sections explain how both prefetchers work and presents the required architectural support needed to implement SCP.
4.4.1
Identifying Recurrent Loads Using Temporal Regularity Metrics
In order to maximize the utilization of the prefetching resources and improve the accuracy of the prefetching mechanism, LDS prefetching mechanisms attempt to limit their scope to accesses generated by recurrent loads. Thus identifying the recurrent loads amongst the program load instructions is an important task in LDS prefetching. Several heuristics have been proposed to identify recurrent loads. Luk [31]
68 proposed that the compiler uses both type declaration information to recognize data objects belonging to LDS, and control structure information (e.g. loops and recursive function calls) to recognize when these objects are being traversed. Roth manually identified recurrent loads by profiling the application to identify load instructions that contribute the highest number of misses and then traced these loads back to the source code by inspection. Both of these approaches require access to source code. Cooksey [18] and Collins [17] identified a recurrent load as a load that accesses a pointer field by matching the upper N bits of its effective address to the upper N bits of the value being loaded. Although this approach does not require access to source code, it results in false positives. All these techniques do not check for temporal regularity and therefore may attempt loads that do not benefit from context-based prefetching. SCP overcomes these limitations (requiring source code and false positives) by profiling the application and measuring the temporal regularity metrics. Load instructions whose accesses are less than predetermined thresholds of conditional entropy and variability are marked as being recurrent loads. The profiling can be done off-line before the application is run or can be done on line while the application is running. If run-time profiling is chosen, then the profiling task can be done periodically and may be triggered by other performance metrics. Such metrics may include prefetch accuracy, or when the number of recurrent load accesses falls below some threshold during some window of memory accesses. In this work, the input of the simulated work loads consisted of traces, hence an off-line approach was used to profile each trace for a fraction of its instructions (results are reported for profiling 25 million instructions.) During the profiling phase, the accessed memory addresses are stored along with the PC of the load instruction that generated them. At the end of the profiling session, conditional entropy and variability in the accesses of each load instruction are computed, and
69 load instructions that exhibit values below predetermined thresholds for these metrics are marked as recurrent loads. Determining the conditional entropy threshold is implementation-dependant. Conditional entropy of 1.0 indicates that each unique element of the reference sequence under consideration is on average followed by two unique elements. Thus, for SCP, recurrent load instructions need to have reference sequences with conditional entropy of less than 1.0. This choice is governed by the fact that SCP correlates one stitch pointer at a time for each unique address in a given reference sequence. Experiments of this research indicate that conditional entropy works well for SCP and hence is used for the reported results in this chapter. Another approach that may elect to correlate two stitch pointers with each address of a load’s reference sequence may use conditional entropy of 1.0 as a threshold. As explained earlier, variability measures the portion of unique addresses in the reference sequence of a load instruction. Experiments indicated that a variability of 0.7 is not too restrictive and allows the inclusion of reference sequences containing long traversal paths that partially repeat during the profiling run while eliminating load instructions that tend to build a graph and traverse it one time before destroying it (e.g. the benchmark gcc.)
4.4.2
Stitch Pointers Installation
A major problem with hardware context-based prefetching is the saturation of the structures used to record the correlations amongst the addresses accessed by the recurrent loads. To overcome this problem, SCP records these correlations in an infinite logical space called the stitch space. The stitch space is parallel to the program physical data space. As recurrent loads execute, their accessed addresses (called recurrent addresses) are mapped to their cache line addresses in the stitch space. This is done by shifting each recurrent address to the right by
70
SP
PB
SCP Logic
TSQ
PRQ
CDP PRT`
SC
L2 $
Mem Interface
Processor
L1 $
Mem
SM
Legend TSQ: Timeliness Stitch Queue SC: Stitch Cache PRT: Prefetch Round-trip Timer PRQ: Prefetch Queue PB: Prefetch Buffer SM: Stitch Memory : Program Memory Traffic : Stitch Traffic CDP: Content-Directed Prefetcher SP: Stride Prefetcher
Figure 4.4: Stitch Cache Prefetching.
a number of bits equivalent to an L1 data cache line size. For example if the L1 data cache lines are 64 bytes, then each recurrent address is shifted to the right by 6 bits. This is done because prefetching works on cache lines. The mapped addresses in the stitch space are called stitch addresses. Figure 4.4 depicts an implementation of SCP. The gray components represent the additional resources needed to support SCP. The stitch space is implemented physically as a stitch cache (SC) that holds the working set of the stitch addresses and is extended to a physical stitch memory (SM) that occupies a portion of the system main memory. The stitch cache is used to record the correlations amongst the stitch addresses in the form of pointers from one stitch address called the home address to another stitch address called the target address. These pointers are referred to as stitch pointers. As loads execute they are checked for the recurrence hint. If an executing load is identified as recurrent its effective address (the address it is loading from) is mapped to its equivalent stitch address. The stitch address is entered at the tail of a queue called the timeliness stitch queue(TSQ) if it does not exist in any of
71 the TSQ entries. If this stitch address satisfies the timeliness criteria (explained next) then it is used as a target address for a stitch pointer that gets recorded in the stitch cache. The home address for this stitch pointer is the stitch address of the entry at the head of the TSQ. For example the stitch address associated with node e of Figure 4.1 is recorded at the stitch address associated with node a in the stitch cache, which results in installing a stitch pointer from node a to node e. Note that stitch pointers correlate addresses of arbitrary recurrent loads. The recurrent addresses of these loads can reference any node type of any LDS of the program data structures, and these LDS can be of any topology. This is a major difference from jump pointers proposed by Luk [31] and used by Roth [43] and others [32].
4.4.3
Using Stitch Pointers
The SCP logic identifies a recurrent load as it executes. The stitch address associated with an executing recurrent load is used to index to the SC. If a valid address exists in the SC, it used to generate a prefetch that is queued to the prefetch queue (PFQ). Otherwise the SM is checked to see if it contains the stitch address. If so, then a stitch cache line is brought form the SM to the SC, and the stitch pointer at that address is used to generate the prefetch. If no valid address is found in either the SC or the SM, then no further action is taken.
4.4.4
Controlling Prefetch Timelines
To achieve timely prefetches, the launching time for prefetched data needs to be dynamically changed according to the expected response time of the memory subsystem at the time of prefetch initiation. In context-based prefetching, correlated addresses (the home and target) need to be separated by sufficient time
72
PRT
PRT ta tb tc td te
a
b
c
d
e
x1
x2
tb
tc
td
te
tx1
tx2
x4
1
a’ b’ c’ d’ e’ Tail
Head
ta
TSQ
tx4
t1
Memory Accesses Timeline
Legend : Recurrent address TSQ: Timeliness Stitch Queue
ta: Time of access a PRT: Prefetch Roundtrip Time
Stitch Pointer a’ : stitch address of a
Figure 4.5: Controlling prefetch timeliness.
that allows the memory subsystem to prefetch the target as the home is accessed. To illustrate this, Figure 4.5 shows a timeline view of the accesses generated by the traversal of path T 1 of the LDS of Figure 4.1. Assume that a stitch pointer was installed in previous traversals of this path between node a and node e. As the program accesses node a, the SCP Logic will source the SC and find the stitch pointer to the target address of node e, which will result in a prefetch for node e at time ta . Further assume that the memory subsystem response time for this prefetch is measured using a special timer called prefetch roundtrip timer (PRT). For this prefetch to be timely, the time difference between ta and te should be greater than the value of the PRT. In SCP, the time available for the memory subsystem to fill a prefetched cache line (when the prefetch is launched) is approximated by the PRT value at the time of installing the corresponding stitch pointer. The stitch pointer is installed between two addresses only if the time difference between their accesses is greater than the PRT at the time of installing the stitch. The PRT is continuously updated with the time needed to fill the most recent prefetch in the system. The mechanism used to implement this dynamic tuning of the stitch target employs
73 the timeliness stitch queue (TSQ) in addition to the PRT. As recurrent addresses are queued into the TSQ, the access time is also recorded for each entry. To install a stitch pointer between the head and the tail entries of the TSQ, the difference between their corresponding access times is compared against the PRT value. A stitch is installed only if the difference is greater than the PRT value. Thus, the length of the TSQ is continuously changed to control the future prefetch launching time, which results in timely prefetches. This is in contrast to previously reported techniques that used a fixed queue size estimated by a compiler based on average memory latency [43, 31].
4.4.5
Exploiting Content Based Prefetching
The content-based prefetcher is triggered by two mechanisms. The first is the L2 cache misses as proposed initially in [18], which alleviates the learning time problem of the context-based component. The second triggering mechanism is to scan the cache lines associated with context-based prefetches for potential pointers. As pointers are identified, their target addresses are pushed as prefetches in order to improve SCP prefetching coverage.
4.4.6
Overcoming the limitations of context-based prefetching
SCP avoids the following limitations of context based prefetching: • Capacity problem. The capacity problem is resolved by the use of a logical stitch space. The physical implementation of the stitch space opens the lower levels of the memory hierarchy for storing the correlation information. • Learning Time. The CDP content prefetching component of the system exploits the content of the missing cache lines to launch prefetches for
74 references that are not yet learned by the context prefetcher. This resolves the learning time problem associated with such mechanism. • Excessive Overhead. The excessive overhead of accessing large correlation tables is resolved by sizing the SC to have an access time comparable to the L1 caches. This choice exploits the concept of spatial locality while improving the SC utilization by recording correlations on the cache line granularity. On the other hand, recording stitch pointers in dedicated memory space having a separate data path avoid pressuring critical resources like the L1 data cache or busses between the L1 and the L2 as done by jump pointers. Finally, SCP, being a hardware approach, does not require software-based maintenance or changing the layout of the program data structures.
4.4.7
Overcoming the limitations of content-based prefetching
SCP overcomes the previously discussed limitations of the underlying content directed prefetcher as follows: • Prefetch initiation point and timeliness: Once stitch pointers have been recorded, they are used to prefetch target nodes before the program reaches the target nodes, causing the chain prefetching of the contents to start before a demand miss is incurred. This results in early initiation of the content prefetches and therefore improves their timeliness. This is illustrated in Figure 4.5 as node e would be prefetched when node a is accessed, since node e is linked to node a during previous traversals of the LDS. As node e is filled its content is scanned and identified pointers are prefetched ahead of the program access to node e.
75 • Traversal order and accuracy: The content prefetcher scans cache lines and pushes addresses as prefetches in the order of their appearance in the cache line. In stitch cache prefetching, stitched pointers trigger the content prefetcher before any other pointers in the accessed cache line. This will make the content prefetcher give priority to the previously seen traversal path. Assuming that the same path is walked repeatedly, this will make its prefetches more timely. The other potential paths would still be prefetched by the content prefetcher as explained earlier avoiding traversal path problems associated with jump pointer prefetching techniques. • Isolated LDS trees and coverage: The intrinsic problem of content directed prefetching is that it cannot prefetch isolated LDS. This is overcome by opening the stitch pointers to addresses of any recurrent load. In fact, since stitch pointers are semantically irrelevant to the application, they can be filled with any value (e.g. any missing address). This is a major difference between stitch pointers and jump pointers as proposed in [43], where jump pointers are used to connect backbone nodes only (nodes of the same traversal type). In addition, jump pointers require the compiler to generate code to place the prefetch link at run-time, which affects the code size. An example of such utilization of stitch pointers is shown in Figure 4.5 where node x1 is connected artificially to node 1 of the other LDS.
Benchmark Input Skipped Instr. em3d 4096 2 120M mst 1024 1 350M llu -i 500 -g .333 -d -t -n 341 100M gcc cp-decl.i 1B mcf reference inp.in 1B parser 2.1.dict -batch ¡ train.in 1B
76
Table 4.1: Simulated Benchmarks
4.5
Experimental Evaluation
4.5.1
Methodology
This chapter uses benchmarks shown in Table 1. The benchmarks em3d and mst are from the Olden suite [37], llu [54] is a linked list traversal microbenchmark that demonstrates the behavior of the benchmark health from the Olden suite, and the benchmarks gcc, mcf, and parser are from the SPEC CPU2K INT suite. The SPEC benchmarks were built using the standard SPEC build, while the rest of the benchmarks were compiled using gcc version 2.95.4 20011002 with O2 flag. The input for each benchmark is shown in Table 1 under the input column. Each benchmark was traced on silicon and a number of instructions were skipped before tracing started. The number of skipped instructions for each benchmark is shown in Table 1 under the last column. For the purpose of simulating CDP, each stored value was recorded while skipping instructions to generate a memory image and each loaded and stored value was collected while tracing. The memory image was used during subsequent simulation to source the content of accessed cache lines, and was updated as each store instruction was simulated. The results shown in this work were obtained from a cycle-accurate microarchitecture simulator used for research within Freescale Semiconductor, Inc. The
77 simulator models a superscalar processor pipeline that can decode and complete up to three instructions per cycle. Instructions complete in program order, but can be issued and executed out of order. The pipeline includes multiple stages each of fetch, decode/rename, and issue. Integers operations take one cycle in execute. The branch predictor uses a 512-entry 4-way branch target buffer, a 2K-entry branch history table and a 2K-entry pattern history table. The memory hierarchy includes separate 32K 8-way I and D caches, a unified 1MB 8-way L2 cache, an integrated peripheral bus, and dual integrated DDR controllers. Loads and stores are pipelined. For the experiments of this chapter the hit latency of the L1 was set to 3 cycles, the average total latency for an L2 hit is 25 cycles, and the average total latency for a DRAM read is 150 cycles. The stride prefetcher of the base configuration used 8 per-pc prefetch engines. The prefetch queue had 4 entries and the prefetch buffer had 24 entries. The SC was a 64K 8-way set associative.
4.5.2
Results
Measuring Temporal Regularity Profiling the benchmarks was accomplished by collecting addresses generated by all load instructions during the first 25 million instructions of each trace. For each load instruction the conditional entropy (H) and the variability (v) were computed using the collected load addresses. Figure 4.6 shows the average of these metrics for the ten most frequently executed load instructions by benchmark during the profiling period. The third column in the figure labeled Cont (their contribution to the memory accesses) shows the ratio of the number of accesses generated by these top ten load instructions to the overall number of accesses generated by all load instructions during the profiling period.
78 1 0.8 H v Cont
0.6 0.4 0.2 0 em3d
gcc
llu
mcf
mst
parser
Figure 4.6: Average conditional entropy and variability for the top ten executed load instructions.
Figure 4.6 indicates that the benchmarks em3d and llu have high temporal regularity because they both have low H and v values and the frequent ten load instructions that demonstrate this temporal regularity account for almost all the program accesses. Therefore, these two benchmarks are expected to gain the most from a context-based approach. The benchmarks mcf and mst demonstrate similar characteristics but with a lesser degree of temporal regularity. Conversely, gcc and parser demonstrate high variability, indicating low temporal regularity. Stitching Overhead The percentage of stitch pointers updates to the number of load instructions executed during each benchmark simulation is shown in Figure 4.7. As expected for gcc and parser this percentage is low since there are few loads that are recurrent as indicated in Figure 4.6. For the rest of the benchmarks this percentage ranges from 6% for mst to 18% for em3d. This indicates the amount of overhead needed to maintain stitch pointers (as well as jump pointers.) This significant overhead is a major problem with context-based prefetching mechanisms. SCP overcomes the negative impact of this overhead on the application by shifting it to the SCP logic and its associated stitch space.
79 20% 15% 10% 5%
er ag e Av
pa rs
er
st m
cf m
llu
gc
em 3d
c
0%
er ag e
Av
er pa rs
st m
cf m
llu
gc
c
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 em 3d
Prefetches/Stitch
Figure 4.7: Percentage of stitch pointers updates to load accesses.
Figure 4.8: Utilization of Stitch Pointers.
Stitch Utilization Figure 4.8 shows, for each benchmark, the utilization of stitch pointers as the ratio of the number of prefetches issued by the stitch mechanism of SCP to the number of stitch pointers created. This figure indicates comparable utilization of stitch pointers over all the benchmarks with an average of 0.45 prefetches issued per stitch pointer. This means that nearly one of every two stitch pointers was used to generate a prefetch. Note that these prefetches do not include content-based prefetches that were triggered by these stitch-based prefetches. These results suggest that there is room to improve the utilization of the stitch space in SCP. A possible follow up study involves evaluating better mapping functions from the
80 program address space to the stitch space. 1 CDP SCP
0.8 0.6 0.4 0.2
pa rs er Av er ag e
st m
cf m
llu
em 3d
-0.2
gc c
0
Figure 4.9: Prefetch coverage.
Prefetch Coverage Figure 4.9 depicts the prefetching coverage of the two simulated prefetching mechanisms, CDP and SCP. The benchmark llu is a memory intensive benchmark with almost all of its loads accesses LDS fields, and the LDS is fairly stable during the lifetime of the simulated trace. This high coverage of llu agrees with its very high temporal regularity for almost all of its memory accesses (Figure 4.6). With the exception of mst, SCP outperforms CDP in term of prefetching coverage for all the benchmarks. This is an interesting exceptional case that calls for further investigation on how to better employ CDP in the SCP approach. Coverage Contribution of Each Component in SCP Figure 4.10 shows the percentage of used prefetches from each mechanism in SCP. The stride mechanism has the least contribution since the used workloads do not use regular data structures. The small contribution of stride prefetches is due to occasional fragments of data that exhibit spatial regularity [34]. Overall, the majority of used prefetches are stitch initiated prefetches. The benchmarks mst
81 100% 80% 60% 40% 20% 0% st pa rs er Av er ag e
m
cf m
llu
gc c
em 3d
Stitch Content Stride
Figure 4.10: Percentage of used prefetches from each mechanism in SCP.
and parser, demonstrate higher contribution of content. This is attributed to higher usage of pointers contained within the LDS nodes of these benchmarks, as opposed to llu or mcf where many of these contained pointers do net get referenced by these applications. 1 0.8 STR CDP SCP
0.6 0.4 0.2
er Av er ag e
pa rs
st m
cf m
llu
gc c
em 3d
0
Figure 4.11: Prefetch Overall Accuracy.
Prefetch Accuracy Figure 4.11 depicts the prefetching accuracy of the three simulated approach, STR refers to the base configuration that includes a stride prefetcher, CDP refers to a stride and a content prefetcher, and SCP to the proposed combined approach.
82 As expected, the stride prefetching mechanism demonstrates the highest accuracy in almost all the benchmarks. However, a prefetch approach that issues very few prefetches may be accurate, but it will have limited coverage. Given the higher coverage of SCP (Figure 4.9), with an accuracy that almost matches a stride prefetcher improves on the approach’s prefetch completeness and should result in significantly better performance gain as illustrated next. 1 0.8 Stride Content Stitch
0.6 0.4 0.2
er Av er ag e
pa rs
st m
cf m
llu
gc c
em 3d
0
Figure 4.12: Prefetch accuracy of each prefetch type in SCP.
Prefetch Accuracy of Each Component in SCP Figure 4.12 depicts the prefetch accuracy of each type of prefetches issued by the SCP approach. Again, stride in the figure refers to the accuracy of stride issued prefetches, content refers to the accuracy of prefetches issued by the content component, and stride stitch refers to the accuracy of prefetches issued due to stitch pointers. An important observation in this figure is that the stitch pointer prefetches are very accurate with an average of about 0.7. This high accuracy is attributed to the use of the proposed temporal regularity metrics that guide the stitch prefetcher in selecting load instructions that demonstrate repeatedly accesses subsequences in its reference sequence. Another observation is that both the stride and the content prefetches have lost accuracy when they were combined
83 with stitch pointers in SCP as opposed to their individual accuracy (Figure 4.11.) This is because some of the accurate prefetches are now covered by the stitch
Number of used prefetches (100K)
based approach, since it starts earlier than the other two components.
14 12 10 8 6 4 2 0
SCP CDP
V. Early
Early
Timely
Late
V. Late
Figure 4.13: Prefetch Timeliness of CDP and SCP.
Prefetch Timeliness Figure 4.13 shows the average number of used prefetches per benchmark for the two approaches, CDP and SCP. The used prefetches were divided into five bins based on the difference between when they were filled by the memory system and when they were needed by the program. V. Early refers to the prefetches that were earlier than their use by more than the memory latency in cycles. Early refers to the prefetches that were earlier than their use by more than half the memory latency in cycles. Timely refers to the prefetches that were filled within half the memory latency earlier or later than their use. Late represents the prefetches that were filled a number of cycles greater than half the memory latency and less than the full memory latency. Finally, V. Late represents the used prefetches that were filled after their need by a number of cycles greater than the memory latency. Figure 4.13 indicates that SCP is better than CDP in generating prefetches well ahead of their use. Combining this observation with the greater coverage and
84 accuracy as indicated in the previous figures illustrates that SCP is capable of improving the prefetch completeness. 20 15 10 CDP SCP
5
st m
cf
pa rs er Av er ag e
-10
m
llu
-5
gc c
em 3d
0
Figure 4.14: Percentage improvement in IPC.
Performance Improvement Figure 4.14 demonstrates the performance improvement in terms of IPC gain percentage for CDP and SCP prefetching mechanisms. The benchmarks that are not expected to gain from SCP (gcc and parser) were not degraded by it. This is because no prefetching would be attempted unless recurrent loads are detected. Contrary to that, CDP had a negative impact on half of the benchmarks considered. The best performance gain was seen in llu, which agrees with the coverage of its misses. The benchmarks em3d and mcf demonstrate comparable gains in IPC since SCP demonstrated comparable coverage and accuracy for both of them. The benchmark mst, on the other hand, gained less IPC percentage, although SCP showed coverage and accuracy levels similar to mcf and em3d. This lesser IPC benefit is due to the fact that mst is less memory bound than the other two (L1 miss rates of 7% vs around 30%). Another interesting observation about mst is that the gain in IPC using CDP is negligible, although its coverage was better than that of SCP. This is due to the low accuracy of CDP, where the
85 bad prefetches were occupying system resource, thus negating the benefits of the good prefetches. 4.6
Conclusions Prefetch completeness requires that a prefetching mechanism achieves high
coverage of the program would-be misses with high prefetching accuracy and timeliness. Achieving prefetch completeness is a delicate task that requires the simultaneous fine tuning of several prefetch approaches, as was demonstrated by cooperative prefetching [43] (combining a context-based and a content-based prefetching.) However, cooperative prefetching suffered the limitations of its underlying mechanisms. This dissertation presents stitch cache prefetching (SCP) to improve prefetch completeness by combining and improving context-based and content-based prefetching mechanisms. Context-based prefetching is improved by (1) better exploiting temporal regularity in the memory accesses via the introduction of temporal regularity metrics, and (2) the use of a logical stitch space with a practical physical implementation of this space. The resulting contextbased approach does not suffer the capacity problem of Markov tables, nor does it require complex manipulation of source code. Content-based prefetching is improved by associating it with the temporally regular accesses, which improves upon its timeliness and accuracy. The experiments indicate that SCP improves prefetch completeness by achieving coverage of about 40% over a system that includes a stride prefetcher, while hitting an average accuracy of about 60% using timely prefetches. This improvement in prefetch completeness results in improving the IPC by up to 40%, and by 14% on average for a mixed set of temporally regular and temporally irregular benchmarks that use LDS. Simultaneously, SCP does not significantly degrade performance for benchmarks that do not exhibit temporal regularity.
86 Future work includes better utilization of the content-based component as one of the benchmarks demonstrated that a more aggressive content-based approach can result in improving the prefetch completeness.
Chapter 5 Exploiting the Topology of Linked Data Structures for Prefetching
5.1
Introduction Traditional prefetch mechanisms rely on regularity in the memory reference
stream to accurately predict future memory references. Array memory accesses can be accurately predicted using hardware stride-based (linear) prefetching techniques. Applications that use dynamic data structures may only exhibit some regularity in periodic fragments of their data access. This situation is exacerbated by the increasing popularity of object-oriented programming languages such as C++ and Java, which make significant use of Linked Data Structures (LDS). Exploiting temporal regularity in the memory accesses associated with LDS was shown in the previous chapter to aid prefetching for LDS accesses that do not exhibit spatial regularity. Combining exploiting spatial and temporal regularity with exploiting the content of accessed cache lines was demonstrated to hide up to 95% of the misses in some workloads (see Section 4.5.) Unfortunately, this combination was able to hide only an average of 40% of the misses in a mixed set of applications as was illustrated in Chapter 4. Thus, there remains misses associated with LDS that do not exhibit exploitable regularity. For these misses, other characteristics need to be considered and studied. In this chapter, the topology of the LDS is investigated to aid prefetching. The topology in this context refers to the way the LDS is connected and the or-
88 der in which the LDS fields are traversed by the application. In order to exploit the LDS topology, a novel mechanism called Compiler-Directed Content-Aware prefetching (CDCAP) is proposed to illustrate the integration of relevant information about LDS topology, available through profile and compile analysis, with a hardware prefetching system. CDCAP builds on the strengths of both software and hardware prefetching paradigms, and utilizes well established compiler techniques as well as profile information to guide the prefetching effort. The global view and understanding of LDS by the compiler combined with the dynamic run-time knowledge of the hardware part of the prefetching system creates a promising and flexible framework for prefetching. The techniques borrow from thread-based prefetching the pre-computation aspect of prefetching, but used only limited hardware resources in the memory system to speculate ahead of the program memory requests. Simulation results show that the proposed CDCAP approach improves the overall performance of integer applications with significant amounts of accesses to LDS. The improvements are due to greater coverage of the original cache misses with new prefetch requests. Overall, by building on the strengths of both software and hardware prefetch approaches, CDCAP results in a more timely prefetching scheme when compared to previous hardware-based content-aware schemes and dependence based correlation prefetching approaches. The work illustrates that future techniques to improve memory system performance will require the compiler to play a more active role in analyzing and communicating a program’s data flow and information to the architecture system. The remainder of this Chapter is organized as follows. Section 5.2 provides a brief overview of works related to the concept of linked data structure prefetching and the intuitive motivation behind the proposed scheme. Next, Section 5.3 presents an overview of the compiler-directed approach facilitates the exploitation
89 of guiding a content aware prefetching engine. The effectiveness of the proposed approach in improving performance and exploiting instruction repetition is presented in Section 5.4. Finally, the Chapter is summarized in Section 5.5. 5.2
Motivation and Background Linked data structures (LDS) consist of dynamically connected and allo-
cated nodes generally taking the form of trees, graphs, and lists. In order to prefetch a specific node of the LDS, the address of that node must be obtained from an adjacent node. This sequential access of LDS nodes is referred to as the pointer traversal or pointer chasing problem. Programs tend to traverse LDS using loops or recursive procedure calls. Future references in this chapter to loop traversal applies equally to recursive procedure calls. While accessing each node of the linked data structure, the program typically performs some computation on fields of the node. This computation time can be exploited by the prefetching algorithm to avail the next node before the program actually requests the next node (by overlapping the computation time with the next node loading time). Unfortunately, in general the computation time within traversal loops is much shorter than access latency to lower levels of cache and memory system. As such, the prefetcher needs to make its requests as soon as possible, even before the traversal loop is entered. Since linked data structures imply a serial processing of current nodes to the next node address and since next node address are not predictable, there is little chance for better prefetching to take place in a timely fashion. The inherent problem of prefetching dynamically linked data structures requires compiler analysis to synthesize relevant program information that gets used prior to the program ordered access. By conveying this information to a hardware prefetcher, the system can iterate ahead on the nodes of the LDS on its own rather than
90 waiting for her procession of the program in the made processor core. At the same time, the control and depth of prefetching has to be properly controlled. Otherwise prefetching can exhaust the memory bandwidth with prefetches that may not be useful and can replace previously prefetched cache lines that are yet to be used. Software-only prefetching techniques cannot achieve this effect, and hence fall behind when the computation time is short. Hardware techniques are good in proceeding ahead of the actual loads by speculating loads using correlation tables, or by scanning cache lines for potential addresses, but these techniques are require substantial hardware state.
5.2.1
Dynamic Data Structure Analysis
Dynamic data structures can be classified using static and dynamic characteristics. Static characteristics include topology of the data structure and the way the data structure is connected to other structures. Topology include the size of the data structure, and the relative distance between fields of data encapsulated in it. Such parameters are significant for a prefetching technique. Several kinds of data load references are identified depending on the nature of data they access: linked, pointer, global, and array. Furthermore, linked loads that access dynamic LDS have several different types: traversal, pointer, direct and indirect. Traversal loads access parts of the data structures for advancing a cyclic region of code to the next data structure node. These loads represent the basic concept of finding next iteration of traversal loops and are a fundamental barrier to prefetching. Direct A recursive direct access occurs when a load is executed that references a data field within a recursively fetched data structure. The data is used directly within computation operations but does not form the base address for load and stores instructions. Such loads generally occur immediately following the access of the traversal loads.
91 Pointer Accesses that occur to data items that are based on a data field of the current data structure are linked data structure pointers. These accesses are not recursive, although other accesses can be based from the pointer data access. Indirect A load which is based on the address of a linked pointer access is classified as indirect. Indirect loads reference other data items from pointers other than the primary linked traversal pointer.
A) while(node!=NULL) { if(node->value >= 0) count++;
B) while(node!=NULL) { if(node->value >= 0) count++;
if(node->ptr != NULL) a = node->ptr->weight; }
node = node->next;
C) while(node!=NULL) { if(node->value >= 0) count++;
if(node->ptr != NULL) a = node->ptr->weight; }
D) while(node!=NULL) { if(node->value >= 0) count++;
if(node->ptr != NULL) a = node->ptr->weight; }
node = node->next;
node = node->next;
if(node->ptr != NULL) a = node->ptr->weight; }
node = node->next;
Figure 5.1: Basic example of four linked data structure types: traversal, pointer, direct, and indirect
A basic example of the four linked load classes is illustrated in Figure 5.1. Figure 5.1(A) shows the base linked data structure being accessed for the purpose of traversals. An important note is that many data structures have multiple traversals points for reaching the next data item. Figure 5.1(B) describes the
92 access of a direct linked data access from the base traversal pointer. Likewise, Figure 5.1(C) demonstrates a pointer reference from the linked traversal data which loads to a location potentially unrelated to the current data structure or the current data structure type. An indirect load is illustrated in Figure 5.1(D) which generates a memory request from having accesses both the traversal load and the pointer load. The indirect load represents a very difficult to prefetch in a timely fashion since it’s address depends on chained pointers all having the potential of missing in the cache system. Benchmark Traversal bh 32 bisort 47 em3d 19 health 23 mst 24 perimeter 100 power 0 treeadd 0 tsp 32 176.gcc 0 181.mcf 4 300.twolf 17
Direct Pointer 68 0 53.0 0 81 0 0 65 76 0 0 0 0 0 0 0 68 0 0 0 32 13 71 9
Indirect 0 0 0 12 0 0 0 0 0 0 51 3
Table 5.1: Percentage of dynamic execution of linked data structure access types.
Table 5.1 shows the breakdown of dynamic linked data structure load categories. These load classes are similar to the categories explored by Roth and Sohi in their evaluation of hardware jump pointers [43]. On average much of the data accesses through linked data structure consist of direct fields and the traversal pointer to the next data item. However, benchmarks 181.mcf and health indicate the use of complex data structures in that pointer and indirect access based starting from the main traversal data structure occur very frequently.
93 Although several methods attempt to overlap memory accesses with other computation in the processor in order to hide the memory latency, few of the hardware systems distinguish between the static access types of loads. Prefetching attempts to fetch data from main memory to the cache before it is needed to overlap the load miss latency with other computation. Both hardware [46, 26, 4, 11, 33, 22, 19] and software [40, 28, 36, 12, 13] prefetching methods for uniprocessor machines have been proposed. However, most of these methods focus on prefetching regular array accesses within well-structured loops accesses by arrays indexed by the loop iteration variable, or some other induction variable. These loads have access patterns primarily found in numeric applications. Other methods geared toward integer codes [31, 30] focus on compiler-inserted prefetching of pointer targets, and are orthogonal to the methods discussed in this dissertation. Figure 5.2 illustrates the distribution of dynamic load classes in the most relevant pointer-based benchmarks in the Olden and SPEC suites. These benchmarks were selected for their high percentage of linked data structure accesses. Other benchmarks that do not exhibit large percentages of dynamically linked data structures have been removed from this chapter to bring a more detailed focus on the proposed approach. The dynamic numbers when compared to previous generations of SPEC benchmarks illustrate greater use of linked data structures. Figure 5.2 illustrates as well the degree to which the linked data structure accesses deter program performance by causing a large percentage of memory system cache misses. According to the cache simulations which were run using the IMPACT compiler infrastructures, linked data structure load operations evaluated to more than 30% of some program execution. Such loads contributed 20% to the total data cache misses. It is important to note that although the benchmarks achieve nearly an average of 90% hit rate, 40% of EPIC execution is dominated by memory system stalls of the process. Although out-of-order processor systems
94 Distribution of Missing Loads 100%
80%
60% other global array stack-local pointer-parameter LDS
40%
20%
tsp 17 6. gc c 18 1. m cf 30 0. tw olf
m st pe rim et er po we r tre ea dd
bis or t em 3d he alt h
bh
0%
-20%
Figure 5.2: Dynamic distribution of linked data structure access types.
can adapt to cache misses by executing other instructions surrounding the load, this is not the case for EPIC machines. For an EPIC in-order design, the execution pipeline stalls when another instruction attempts to use the result of a load that has missed in the cache. As this execution semantic is a substantial performance bottleneck, it is critical to improve memory efficiency by initiating prefetch requests on memory system. Although the distribution of the linked data structure load operations are varied over the benchmarks, the presence of such high percentages beckons the use of a special prefetching technique for recursive loads. Since the occurrence of linked data structure loads is relatively high, the proposed approach is to use specific compiler prefetch instructions to eliminate the need for the processor to process the current linked data structure load in order to find the next data structure to prefetch. Although the technique still uses explicit compiler inserted prefetches, the data does not need to be brought
95 back to the processor to generate a linked-structure prefetch. Instead, the cache system can be guided by the compiler to an efficient way of scanning a cache line retrieved on a miss for future virtual addresses (representing traversal, pointer, and direct loads). Indirect loads can be viewed then as direct loads being sourced by linked pointer loads.
5.2.2
Memory System Evaluation
Several measures have been used to evaluate the prefetching algorithms. Three measures are used here to evaluate the different techniques addressed in this work: accuracy, coverage and timeliness. Accuracy of the prefetching technique refers to the number of usable cache lines among the total number of lines prefetched. Prefetching is a speculative effort aiming at loading values that are expected to be used by the processor. If the accuracy of the prefetcher is poor, then precious bandwidth will be wasted. Coverage refers to the number of program loads that are serviced by a prefetched cache line. An accurate prefetcher may cover few load misses, which results in poor enhancements to the overall system performance. The more loads that a prefetching technique can service the higher performance gains it would be capable of achieving. Software prefetching techniques are very accurate because the prefetch instructions would be inserted in the proper control path and they would be issued only where they will be used. These techniques tend to cover good ratios of loads as well. Nevertheless, these techniques tend to achieve little overall performance gains when prefetching for LDS. This is due to the fact that little time is usually available between the issue of the prefetch instruction and the actual load. Hence, a small part of the memory latency is hidden. This aspect of the prefetching is measured by timeliness. Prefetching techniques trade-off one of these attributes to the other. Several heuristics have been used in an attempt to achieve a working balance. Software
96 techniques achieve higher coverage and accuracy but have poor timely prefetching. This is due to the fact that there is usually little work in the traversal loop of the data structure. Context based techniques build correlation tables among different loads in hardware [10, 25]. These tables are used to speculatively issue related addresses that have been associated with a certain address. These techniques trade off correctness to timeliness. They however, require large tables and need training times to be able achieve acceptable levels of correctness. Content-Directed prefetching [18] aggressively prefetches loosing accuracy for coverage. Reported work clearly shows that achieving a good balance across a number of programs seems to be a difficult task. The prefetching algorithm must adapt to the running program in order to achieve the best balance. In the evaluation and comparison of this work with existing techniques the above mentioned measures were used. Two reported techniques were selected to evaluate and compare to this work. The first is content-directed prefetching [18] technique which examines each address-sized content of data as it is moved from memory to the L2 cache. Data values that are likely to be addresses are then translated and pushed to a prefetch buffer for prefetching in anticipation of their future use. As prefetch requests return data from lower memory levels, their content is also examined to retrieve subsequent candidates. This technique achieves an effect similar to thread-based prefetching by running ahead of the application. The second technique that is examined is the dependence based prefetching (DBP) [42] which works by establishing producer-consumer relationships among load instructions and using this relationship to issue a prefetch on behalf of the consumer load whenever the producer load is accessed. The technique is a representative of context-based prefetching techniques. DBP achieves an effect similar to software prefetching once the relationships are established. The same technique is used in a later study by the authors in [43] to enhance its timeliness, by trading
97 off on its accuracy.
Miss Rates for Simulated Loads 70% 60% 50% 40%
L1 Loads L2 Loads
30% 20% 10% Average
300.twolf
181.mcf
176.gcc
tsp
treeadd
power
perimeter
mst
health
em3d
bisort
bh
0%
Figure 5.3: Loads miss rates at L1 and L2 caches
Figure 5.3 shows L1 and L2 miss rates for a selected number of benchmarks. The selected benchmarks are from the Olden benchmark suite which were used in [42, 43, 9] in their evaluation of their prefetching techniques. The use of the Olden suite is augmented by few SPEC2K benchmarks that were found to make use of LDS. The rest of the SPECint2K suite had either little chance for improvement because they start with very small miss rates or because they do not use dynamic LDS. The interested reader is referred to [8] for a complete characterization of the memory performance for the SPEC2K suite. To the best of the author’s knowledge, all existing prefetching techniques are designed to work at a specific level in the memory hierarchy. It is clear from the two figures that the benchmark em3d is best suited for prefetching into L1 cache. On the other hand, perimeter and treeadd are expected to benefit from L2 prefetching techniques. Health and
98 mst may benefit from both. The rest of the benchmarks have moderate chances for improvement from both classes of prefetching techniques. DBP is designed to issue prefetches at the L1 level, because it issues the prefetches within one iteration of the traversal loop. The technique cannot iterate speculatively ahead of the main program, because it uses the main program loads to trigger further prefetching. This makes applying DBP to either levels of the cache have almost the same effect on the second cache level. DBP ability to produce timely prefetches depends on the computation time that is available within the traversal loop body, or between the producer and consumer loads in general. This makes the technique work better at the first level cache for programs that have high miss numbers at the L1 cache. At the same time, it limits its ability to hide the much higher L2 latency. If this technique is applied to the L2 level, performance degrades because there is only a very little window between issuing the prefetch and needing its value in the program. This is due to the nature of integer programs where the amount of work within a traversing loop is very small and to the fact that processing speeds are becoming way faster than the memory latency. 5.3
Compiler-Directed Content-Aware Prefetching Prescient instruction prefetch involves the use of helper threads to perform
instruction prefetch to reduce I-cache misses for single-threaded applications. Here the term prescient carries two connotations: One, that the helpers are initiated in a timely and judicious manner based upon a global analysis of program behavior, and two, that the instructions prefetched are useful as the helper thread follows the same path through the program that the main thread will follow when it reaches the code the helper thread prefetches. The compiler-directed content-aware prefetching (CDCAP) technique evolves
99 the idea of hardware-directed content-directed prefetching (CDP) [18] by introducing a set of hardware components collectively called Hardware Prefetching Engine (HPE) and software instructions to eliminate the need to dynamically detect dynamic data structures in prefetched cache lines. Also related to the proposed CDCAP approach is a recent prefetching [47] that designates a more sophisticated memory system controller outside the scope of the processor for analyzing memory access requests and making prefetches. In contrast to the hardware CDP prefetching scheme and the memory controller prefetcher, the compiler-directed approach statically designates data structure locations specifically consumed by the HPE to carry out a simple task list. The task list is carried out by a simple controller called the Linked Prefetching Control Engine. Overall, the proposed prefetching architecture mechanism is depicted in Figure 5.4 and consists of the following components: Content-Aware Prefetching Buffer(CAPB) The CAPB is a dedicated cache that stores prefetched data. This cache is accessed simultaneously with the main cache that the HPE is attached to, if requested data is available in the CAPB and misses in the main cache. The request is serviced from the CAPB and the main cache line is filled accordingly. The CAPB shares the same bus into the next memory level with the main cache. The HPE is allowed to issue prefetches only during idle bus cycles. Linked Prefetching Control Engine (LPCE) The prefetching hardware component that validates the compiler directives and the Contents Buffer contents to direct future prefetches based on returned cache contents. Static Data Structure Prefetching Instruction, the Harbinger The harbinger is used for conveying static data structure layout information to the LPCE. Each instruction has a unique ID. Harbinger instructions are stored in a dedicated table in the HPE, shown in the figure as HIS, Harbinger Instructions Store. Dynamic Invocation Hints EPIC style architectures allow the compiler to insert hints within instructions to convey information to the memory system. Hints are inserted into
100 load instructions that bring the LDS base address from memory. These hints are consumed by the LPCE. The LPCE upon receiving the ID of a previously received prefetch instruction, will start a recursive prefetch effort. The effort is based upon the information contained in the prefetch instruction and the loaded address of the hint-specifying load.
The organization of the memory system with content-based prefetching [18] is illustrated in Figure 5.4. This overview depicts the content prefetcher and stride prefetcher operating between memory and level-2 cache. The prefetching queue prioritizes demand requests over prefetch requests and content-based prefetches over stride-based prefetches. Similar to hardware-based prefetching, prior to issuing a prefetch request, the main cache and system bus arbiters are checked for any potential outstanding block requests. The HPE keeps a counter for each prefetch instruction initiated. In essence a small state machine is implement to repeating prefetch requests from the memory the estimated depth of the linked data structure iteration. Likewise, the engine uses the prefetch encoding to maintain a request mapping for the incoming contents brought from the memory system. This encoding invokes further prefetches based on the retrieved contents. Hence, achieving an effect similar to what thread-based prefetching techniques try to. The advantage over threadbased techniques here is that the next node prefetching starts as soon as the the prefetched node arrives from lower memory levels. Whereas, in thread based prefetching the next node prefetch has to wait until the current node address reaches the processor and get processed there.
5.3.1
Compiler-Directed Content-Aware Prefetching Directives
The CDCAP prefetching approach involves the introduction of a new instruction, the prefetch harbinger. Figure 5.5 illustrate the harbinger instruc-
101
Memory System Topology CAPB LPCE Prefetch Buffer HIS Bus Arbitor HPE
Bus Arbitor L2 HPE
L1 HPE
Processor
L1 D-Cache
Bus
UL2 Cache
Legend: CAPB: Content-Aware Prefetch Buffer HPE : Hardware Prefetch Engine HIS : Harbinger Instructions Store LPCE: Linked Prefetching Control Engine
Bus
Next Memory Level
Figure 5.4: Content-Aware Prefetching Memory System.
Harbinger Instruction Format 5 bits
5 bits
Opcode
ID
5 bits
5 bits
4 bits 1 2 bits
5 bits
Trav_Off Offset 2 Offset 3 Depth L ATT
ATT = 00 Traversal
32-bit traversor harbinger instruction format
5 bits
5 bits
Opcode
ID
5 bits
5 bits
5 bits
4 bits 1 2 bits
Offset 1 Offset 2 Offset 3 Depth L ATT
ATT = 10 AOP
32-bit AOP harbinger instruction format
5 bits
5 bits
Opcode
ID
5 bits
5 bits
5 bits
4 bits
1 2 bits
Trav_ID Ptr_Off Ind_Off ------- L ATT
ATT = 01 Indirect
32-bit indirect harbinger instruction format
Figure 5.5: Compiler-directed prefetch harbinger instruction.
102 tion used to communicate the exact nature of the linked data structure to the the HPE. The new extensions designate different aspects of how the data structure is used in an upcoming LDS traversal loop and the degree to which the prefetching engine should initiate recursive prefetches. The proposed prefetching instruction consists of four fields: Harbinger ID is a unique ID that is associated with each harbinger instruction. Access Type Template indicates the type of the harbinger instruction. Three types are recognized: Traversal Harbinger describes the offset of the linking address within the data structure along with additional pointer loads that will be accessed within the traversal loop. These offsets are included in the Offset Vector field. This harbinger serves for traversing the nodes of the data structure along with associated pointer loads. Array Of Pointers(AOP) Harbinger is designed to assist prefetching arrays of pointers. The first Offset field in the offset vector is similar to the increasing offset of arrays. This may contain pointers and any other data types. The location of the pointer within the array is encoded in the second offset of the harbinger. Indirect Harbinger describes an indirect load that needs to be prefetched with each node. The load referred to as Access 3 in Figure 5.5 is of this type. The indirect harbinger has a field for the associated traversal harbinger id Trav ID, and the relevant pointer offset within the traversal harbinger in the Ptr Off field. The LPCE will control the invocation of related harbinger instructions at every iteration of the recursive traversal.
103 Offset Vector The contents of data retrieved from memory due to a request initiated from either the compiler-inserted hint or subsequent recursive prefetches are analyzed with respect to the offset vector. The vector in conjunction with the access type template is used to determine if the linked data structure requires greater spatial prefetches (next line or previous line) or if a pointer/traversal type is necessary. Recursion Depth The depth field is used to inform the HPE as to the expected number of traversals across different iterations of loops. The depth is a weighted encoding that allows a large range of prefetch iterations to be made. This depth is determined based on profile runs. The depth field is used by the HPE to guide the recursion. CDP uses a path reinforcement measure to control its recursion. Cooksey reports in [18] that the path reinforcement had minor effect on his prefetcher performance. Level (L) Indicates the level in the cache hierachy to apply the recursive prefetch. The instruction overcomes the nature of the base hardware approach to content-aware prefetching that indiscriminately speculates that every potential data item that appears as a virtual address will be relevant to future execution. There are a number of scenarios in which the hardware technique will detect virtual address values that do not correspond to virtual addresses of pointers. Likewise, the hardware may prefetch virtual addresses corresponding to the linked data structure, but these may not be necessary to the code region in the vicinity of the cache miss. This high degree of speculation forces the use of the CDP technique at the second level or even the third level in the memory hierarchy. Because otherwise the amount of traffic it generates will negatively impact performance. This was confirmed by the study and evaluation reported in this chapter. On the other hand, by aggressively speculating each potential address in the cache
104 line, early prefetching effects starts appearing for small to medium sized prefetch buffers (less than 8K bytes). Early prefetching refers to the case where prefetches start kicking previous prefetches out of the prefetch buffer before they have been used. This fact makes the CDP technique work well for large prefetch buffers only with sizes of 128Kbytes and larger. Again limiting its application to Level-2 and beyond in the hierarchy. 5.3.2
Linked Data Structure Example
LDS Traversal Example - Health check_patients_waiting(struct Village *village, struct List *list) { int i; struct Patient *p;
Function Entry
Book keeping code Cb 3
while (list != NULL) { i = village->hosp.free_personnel; p = list->patient; /* op 10 */
}
(a) C-code segment •Profile Information for op 10: •L1 Miss rate: 25% •L2 Miss Rate: 33% •Contribution to total miss rate:54% •Profile Information on loop execution: •59 times on average
op 10 op 11
# Load a pointer node field # Jump to different code #region if condition +ve
load r1, r3(4)
op 12 # Indirect load of a value # from the above pointer op 13 # Increment the value op 14 # Store it back
Cb 3
Cb 5
if (i > 0) { /* Manipulate LDS */ } else { p->time++; /* op 12,op 13, op14 */ } list = list->forward; }
load r3,r2(0) bgt r4,0 cb4
add r1,r1(1) store r3(4),r1
Cb 6
load r2,r2(8) bne r2,0 cb 3
op 15 op 16
# Traverse the LDS # Until NULL is found
(b) L-code segment (op_id prefetch [Ptr_Off] [(TO)(DIR)(DPTH)(POC)] (op_1 prefetch [0][(8)(0x222)(59)(1) (op_id prefetch [][(tr_op_id)(DIR)(ptr_off)(ind_off) (op_2 pref_ld [] [(1)(0x422)(0)(4)]) (d) Prefetch instruction
LEGEND Ptr_off:pointer offset TO:Traversal Offset DIR:Directive DPTH:Depth COP:pointer offsets count Tr_op_id: ID of traversing prefetch
(c) Profile information
Figure 5.6: Source and low-level code example of linked data structure in Olden’s health.
Figure 5.6 contains code generated from the function check patients waiting of the benchmark health. Within check patients waiting, a while loop iterates through a linked list until reaching the end of the list. Each node in the list contains a pointer to another data structure as well as a pointer to the next node.
105 Simulation results show that miss penalties caused by the four highlighted loads account for approximately 80% of the total time spent stalled on data memory operations. Upon entry to the section of code, r2 contains the head pointer of the list. Consequently, CDCAP prefetching algorithm will schedule the harbinger instruction directly following CB2. The while loop begins by loading a pointer from the linked list (op 11) to be later used as an address for op 18 and op 31. Op 11 is therefore classified as a LDS Pointer load while ops 18 and 31 are classified as LDS indirect. Op 14 traverses the linked list and is classified as the traversal instruction. Data Structure
Pointer
Iteration 1
Starting Address Access 1
Access 2 Access 3
Iteration 2
Access 4 Access 5
Access 6
Iteration 3
Access 7 Access 8
Access 9
Traversal
AOP
INDIRECT
Figure 5.7: Memory accesses and prefetching commands for CDCAP on Olden’s health.
Figure 5.7 illustrates three iterations of the address request pattern for previous example of the health benchmark. Each iteration has three accesses, a traversal, pointer, and an indirect. The CDCAP issues prefetches starting from
106 the first traversal access during the first iteration, subsequently a pointer will be re-cursed in the second iteration, followed by an access to a indirect data item. These prefetches are informed by the prefetch instructions. It is important to compare this example to the other techniques. First, the CDP, hardware content prefetcher has some overlap with the issues prefetches starting from the first access. In CDP, the access to the first cache line contents of the first access will initiate accesses (2) and (4). In addition, depending on the other cache line, it may also issue prefetches to unnecessary addresses found in the cache line. A similar event happens at the access of (2), in which more unnecessary prefetches may be issues. In the CDCAP approach, these erroneous prefetches are eliminated by informing the layout of the data structure to the hardware using the compiler. In comparison to the dependence-based approach (DBP), there are several disadvantages. First, DBP must have previously traversed the data structure nodes, and have also built a full dependence table between all of the pointer connections in the above example. The size of the dependence table for this example would contain dependences (1 to 2), (1 to 4), (2 to 3), etc, requiring all iterations of the loop to be represented in the dependence structure. On the other hand, CDCAP is comparatively a stateless approach to maintaining the connection between data structure accesses since it does not maintain information about past traversals and pointers. 5.4
Experimental Evaluation
5.4.1
Methodology
The IMPACT compiler and emulation-driven simulator were enhanced to support a model of an EPIC architecture [3]. The base level of code consists of the best code generated by the IMPACT compiler, employing function inlining,
107 superblock formation, and loop unrolling. The base processor modeled can issue in-order six operations up to the limit of the available functional units: four integer ALU’s, two memory ports, two floating point ALU’s, and three branch units. The instruction latencies used match the Itanium microprocessor (integer operations have 1-cycle latency, and load operations have 2-cycle latency.) The execution time for each benchmark was obtained using detailed cycle-level simulation. For branch prediction, a 4K entry BTB with two-level correlation prediction with a branch misprediction penalty of eight cycles is modeled. The parameters for the processor include separate 32K direct-mapped instruction and data caches with 32-byte cache lines with a miss penalty of 10 cycles. A 1M 8-way set associative with 128-byte cache lines second-level unified cache (with memory latency of 130 cycles) is also simulated. Each cache has its own hardware prefetching engine. The base prefetch engine runs stride prefetching. The stride prefetcher uses 16-entry pc-indexed table with a threshold parameter to start stride prefetching. All prefetch implementations use a dedicated prefetch buffer. Dependence-based prefetching technique is implemented according to [42]. The simulations indicate that this technique was not able to issue many timely prefetches, although its coverage was excellent. In addition, DBP was augmented with a variant implementation of jump-pointers [43] and this combined implementation is referred to as Dependence-based prefetching (DBP) in the following experiments. DBP prefetching is a representative of context-based prefetching techniques. DBP was simulated with an 8KB L1 Prefetch buffer with 32 Bytes cache line and 4-way set associativity. Content-Directed data Prefetching CDP reported in Cooksey et. al. [18] was implemented using an L2 Prefetch Buffer of 256KBytes with 128 bytes cache lines and 4-way set associativity. Path reinforcement mechanism of the CDP technique was not simulated, this reinforcement
108 contributes less than 1.3% of the performance. The CDCAP approach was implemented with the above configuration for L1 and L2 prefetch buffers attached to both L1 data cache and the unified L2 data cache. All evaluated techniques were queueing prefetch requests to the same prefetch queue in each level of the cache. The bus arbitration policies were the same for the different simulations. Benchmark simulations ran up to a maximum of 100 million cycles or completion, whichever happened first. Initial warm up statistics were not discounted.
5.4.2
Results and Analysis
The proposed Compiler-Directed Content-Aware Prefetching technique promises to achieve a working balance between the three measures that are used for evaluating the prefetching techniques, namely accuracy, timeliness and coverage for dynamically allocated LDS. The ability of the architecture to prefetch into the different levels in the memory hierarchy is a key advantage. To evaluate the techniques under study, different memory system measures are reported. The prefetching accuracy is defined as the ratio of used blocks to the total number of cache blocks prefetched. Figure 5.8 shows the prefetching accuracy for the three studied techniques. The graph supports the earlier discussion about the added accuracy of software oriented prefetching techniques. Such techniques can use program control flow and execution path to reduce unnecessary prefetches, and hence have higher accuracy on average. This knowledge is not as accessible to hardware prefetching techniques like CDP. The more aggressive the prefetching technique the less accurate it becomes. The proposed CDCAP seems to score close to DBP in accuracy. Prefetching techniques that could be used at higher levels in the memory hierarchy need to be more accurate. Otherwise they will be waisting precious bandwidth.
109 Prefetch Accuracy 1.2 1 0.8 dbp cdp cdcap
0.6 0.4 0.2
av g
ts 17 p 6. gc 18 c 1. m 30 cf 0. tw olf
pe mst rim et er po we tre r ea dd
bh bis or t em 3d he alt h
0
Figure 5.8: Prefetching accuracy.
Figure 5.9 shows a measure of timeliness of each of the prefetching techniques. Each load that arrives at the L1 data cache will eventually be serviced and sent back to the processor. The time in cycles that it takes the memory to service each load was measured. Each bar of the graph shows four slices, the bottom slice is the percentage of loads that were serviced within one quarter of the memory latency and is labeled Timely. Loads that were serviced within half the memory latency are labeled Good. Loads serviced within the third quarter are labeled Acceptable, and finally those that were service in times more than that are collectively labeled Poor and occupy the upper part of each bar. This histogram indicates that the timeliness of the implemented DBP technique is superior to the existing techniques. It also says that DBP was not timely enough, because it did not enhance the response time of the memory system. CDP seems to have different effects on different benchmarks. For health, perimeter and
110 L1 LOADS TIMELINESS HISTOGRAM Timely
Good
Acceptable
Poor
100%
80%
mst
peri
power treeadd
tsp
gcc
mcf
twolf
BASE DBP CDP CDCAP
BASE DBP CDP CDCAP
BASE DBP CDP CDCAP
BASE DBP CDP CDCAP
BASE DBP CDP CDCAP
BASE DBP CDP CDCAP
health
BASE DBP CDP CDCAP
em3d
BASE DBP CDP CDCAP
BASE DBP CDP CDCAP
bisort
BASE DBP CDP CDCAP
BASE DBP CDP CDCAP
bh
BASE DBP CDP CDCAP
40%
BASE DBP CDP CDCAP
60%
avg
Figure 5.9: Timeliness of prefetching.
treeadd CDP seem to be doing well. These benchmarks tend to have multiple addresses in one node, and tend to access them within short periods of time. This makes aggressive prefetching pay off. However, for the rest of the benchmarks CDP seems to hurt the performance, which is reflected by extending the poor timeliness portion of the bar in the figure. Cooksey’s CDP is designed to work from the L2 cache. It kicks in upon a miss in L2. If such a miss does not occur, or if most of the miss traffic is between L1 and L2, then CDP does not have a chance to enhance performance. Even if this technique is applied at the L1 level, it is expected to hurt performance, because it issues prefetches aggressively, consuming bandwidth that should serve program loads. Furthermore, studies of this research indicate that this is a major weakness in the approach for some programs even at the second cache level. From this point of view, CDP is expected to perform well on perimeter and treeadd. Figure 5.10
111 shows the normalized time in simulated cycles that the lower memory level bus was busy when needed for the different evaluated prefetching techniques. The high number that CDP incurs illustrates the previous argument. This is due to the fact that the algorithm attempts to issue prefetches too aggressively even with the several heuristics that Cooksey uses to limit the number of issued prefetches. On the other hand, DBP seems to be the least intrusive in issuing prefetches that which affect other loads. Although CDCAP seem to be very intrusive in some benchmarks like health, this is not all bad, because the accuracy of the technique imply that this occupation of the bus is not going to be waisted. Figure 5.10 suggests that prefetching usually uses the bus when program loads use it. That is the idle times of the bus are not being used for prefetching, but rather that prefetching is intrusive and is going to steal cycles. So for an effective prefetcher, those cycles better be used to serve the program. On average, the proposed CDCAP technique does seem to be achieving a working balance here as well. Figure 5.11 shows the reduction in the miss rate for the second level cache normalized to the base miss rate. Since DBP is applied at the L1 cache it could either enhance or degrade the miss rate at the second level cache, and still enhance the overall performance. This is because the prefetch requests that are issued from the L1 cache are not discriminated from the program load at the L2 level. This explains why some of the benchmarks experience higher miss rates, because L2 receives more prefetch requests that are treated as loads. The proposed CDCAP technique was applied to L1 for bh, em3d. and to Both L1 and L2 in health, and to L2 in the rest of the benchmarks. Performance measured as the normalized cycles of processor stalls waiting for memory is reported. Figure 5.12 shows clearly that balancing the prefetching technique has an important role to play in prefetching. Ranges of performance improvement relative to the base architecture model without prefetching indicate
112 Normalized Cycles L2 Bus was Blocked 50 45 40 35 base dbp cdp cdcap
30 25 20 15 10 5
ts 17 p 6. gc 18 c 1. m 30 cf 0. tw o av lf er ag e
pe mst rim et er po we tre r ea dd
bh bis or t em 3d he alt h
0
Figure 5.10: Normalized bus blocking with prefetching.
that on average CDCAP reduces stall time by 10%. This represents a strong improvement in cache prefetching efficiency since the base number includes stride prefetching. The largest improvements are for health and treeadd. Improvements are made in most benchmarks, and CDCAP consistenly out-performs the hardware content-enabled prefetching. Lastly, normalized execution times of the simulated benchmarks is shown in figure 5.13. Execution times speedup is directly related to the number of cycles a processor needs to stall waiting for data from load instructions. The figure shows speedups of up to 24% and an average of 9%. This again stresses the fact that CDCAP consistently out-performs evaluated prefetching techniques.
113 2 1.8 1.6 1.4 1.2
base dbp cdp cdcap
1 0.8 0.6 0.4 0.2
ts 17 p 6. gc 18 c 1. m 30 cf 0. tw olf
m s pe rim t et er po we tre r ea dd
bis or t em 3d he alt h
bh
0
Figure 5.11: Normalized loads miss rates for L2 cache.
5.5
Conclusions As cache penalties will continue to have a significant impact on memory
system performance traditional hardware-based prefetch techniques will not evolve sufficiently to help with accesses to linked data structures. By integrating compiler and hardware techniques in a coordinated approach, the cache penalty effects for accessing linked data structures can be reduced. With the compiler-directed content-aware prefetching approach presented in this chapter, a memory system can be improved by an average 10%, while reducing the bandwidth requirements of content-aware prefetching by a factor of 6. Although these results have only been highlighted for applications with large amounts of linked data structures, the potential of the approach is very wide reaching. Likewise, the system illustrates that certain prefetching methods are not suited for higher levels of cache due to
114 Normalized Cycles Processor Stalled Due to Loads 1.5 1.4 1.3 1.2 BASE DBP CD CDCAP
1.1 1 0.9 0.8 0.7
t 17 sp 6. gc 18 c 1. m 30 cf 0. tw olf Av g
pe ms rim t et er po w tre er ea dd
bh bis or t em 3d he alt h
0.6
Figure 5.12: Normalized cycles of processor waiting for load values.
bandwidth limitations. The area of future work is to continue to develop other intelligent prefetch mechanisms that are guided by compiler designation of the connection of linked data structures. Similarly, the approach will be evaluated for C++ and Java programs to understand how greater utility of content-based prefetching can be performance. Lastly, by applying the technique to instruction memory requests, compiler-directed content-aware prefetching can aid prefetching of upcoming control paths. A likely next step will be to further evaluate the hardware complexity of the content-aware prefetcher and conduct further studies to tune and validate designs. Finally, these mechanisms will need to be examined in the context of a multiprocessor and multithreading systems.
115
Normalized Execution Time 1.4 1.3 1.2 1.1
BASE DBP CD CDCAP
1 0.9 0.8 0.7
t 17 sp 6. g 18 cc 1. 30 mcf 0. tw ol f Av g
or t em 3d he al th pe ms rim t et e po r w tre er ea dd
bi s
bh
0.6
Figure 5.13: Normalized execution times.
Chapter 6 Conclusions
6.1
Summary The memory latency in modern computer systems is a major impediment
to achieving better performance. Technology trends in cpu and memory designs indicate that the gap between the speed of the two is only expected to grow. Thus, techniques to hide the memory latency are expected to play an increasingly larger role in improving the performance of future computer systems. Prefetching is one such technique that has been extensively investigated. Prefetching for memory accesses that exhibit spatial regularity has been successful due to the ability to abstract the memory access patterns using mathematical models. This abstraction enabled prefetch completeness for stride-based prefetching mechanisms by achieving high coverage, accuracy and timeliness. Applications that use Linked Data Structures (LDS) do not exhibit the same degree of spatial regularity, therefore other characteristics of their memory accesses have been the target of several prefetching mechanisms. While these prefetching mechanisms have shown promising results, there remains a significant opportunity for improving the performance of modern computer systems by hiding the memory latency associated with LDS memory accesses. Prefetch completeness requires that a prefetching mechanism achieves high coverage of the program would-be misses with high prefetching accuracy and time-
117 liness. Achieving prefetch completeness is a delicate task that requires coordinating three components that complement each other. These components are: 1) a rigorous approach that offers metrics to quantify the exploitable characteristics of the memory accesses, 2) a coordinated software and hardware approach that benefits from the static characteristics facilitated by the global view of the compiler combined with the dynamic characteristics accessible via profiling and runtime monitoring, and 3) simultaneous coordination of several mechanisms that exploit different characteristics of the LDS memory accesses. This multi-dimensional approach was illustrated by extending the understanding of three exploitable characteristics of LDS memory accesses: spatial regularity, temporal regularity, and LDS topology. Metrics for these characteristics were offered and employed in designing coordinated prefetching approaches. Spatial regularity associated with LDS memory accesses exhibits short and interleaved regular streams that could not be exploited by previously proposed stride-prefetching mechanisms that were designed for long and stable streams that result from accessing regular data structures. This dissertation extended regular stream characterization to include extrinsic as well as intrinsic characteristics. The identified extrinsic characteristics are stream affinity and stream density. Metrics to quantify these extrinsic characteristics were defined and demonstrated in the context of a prefetch design which has been modeled within a cycle-accurate simulator of an advanced research microprocessor. The prefetch design employed stream affinity to enable prefetching of short streams. Stream density was exploited with a ”simultaneous streams management” mechanism for (1) dynamically adjusting the prefetch launching time to control prefetch timeliness using a novel mechanism called padding, (2) stream prioritization, and (3) controlling stream thrashing. The results indicate that improving the prefetching efficiency requires timely prefetching for sparse streams, as well as early prefetching for
118 dense streams. Overall, stream prioritization and thrashing control based on density showed improved prefetch coverage by 40% on average for the SPEC2K-INT representative traces over the implementation of a stream-buffer mechanism that does not use the proposed metrics and techniques. Identifying spatially regular streams in the memory accesses requires a dealiasing process to associate each memory access with the regular stream that it belongs to. Previous de-aliasing mechanisms used to identify regular streams with arbitrary strides used the full PC to couple the memory accesses with the load instructions that generate them in order to de-alias the regular streams. This dissertation presented LAP prefetching, a cost effective prefetching system that extends the traditional idea of stride prefetching by introducing a set of hardware components that accurately and dynamically detect exploitable address regularity. Potential prefetch addresses are generated by coordinating many pieces of strategically tagged in-flight information through a dedicated state machine. A confidence mechanism is proposed that enhances the accuracy of stride prefetching and allows the LAP approach to prefetch directly to the L1 cache, overcoming the traditional limitations of cache pollution. A prefetch moving window heuristic was used to enable prefetching to the L1 cache without the need to mark each cache line. The experiments show that LAP prefetching, using as few as 4 bits of the program counter along with other attributes of the instruction, can achieve within 1% of the performance achieved using a full PC. Although exploiting spatial regularity successfully prefetched about 14% of the misses in the SPEC-INT suite of benchmarks, the remaining misses needed a different approach that exploits different characteristics of the memory accesses associated with LDS. One such characteristic that was investigated in this dissertation is temporal regularity. Coordinating the exploitation of temporal regularity and the content of the accessed cache lines was proposed and illustrated through
119 a prefetching mechanism called Stitch Cache Prefetching (SCP). SCP improved prefetch completeness by combining and improving context-based and contentbased prefetching mechanisms. Context-based prefetching was improved by (1) better exploiting temporal regularity in the memory accesses via the introduction of temporal regularity metrics, and (2) the use of a logical stitch space with a practical physical implementation of this space. The resulting context-based approach does not suffer the capacity problem of Markov tables, nor does it require complex manipulation of source code. Content-based prefetching is improved by associating it with the temporally regular accesses which improves upon its timeliness and accuracy. The experiments indicate that SCP improves prefetch completeness by achieving coverage of about 40% over a system that includes a stride prefetcher, while hitting an average accuracy of about 60% using timely prefetches. This improvement in prefetch completeness results in improving the IPC by up to 40% and by 14% on average for a mixed set of temporally regular and temporally irregular benchmarks that use LDS. Simultaneously, SCP does not degrade performance for benchmarks that do not exhibit temporal regularity. Combining exploiting spatial and temporal regularity with exploiting the content of accessed cache lines was demonstrated to hide up to 95% of the misses in workloads that exhibit high regularity (see Section 4.5.) Unfortunately, regular accesses of the studied benchmarks enabled prefetching an average of only 40% of the misses, as was illustrated in Chapter 4. For covering the remaining misses, the topology of the LDS was investigated as an exploitable characteristic of LDS. Compiler-Directed Content-Aware Prefetching (CDCAP) illustrated that a coordinated approach which exploits the topology of the LDS, encompassing pre-computation and content-based prefetching mechanisms, can improve prefetch completeness. The improvement was a result of dynamically generated prefetching schedules that were both timely and accurate. A characterization of the load
120 instructions used in traversing LDS led to the introduction of pattern metrics. These pattern metrics were used by a compiler to statically identify traversal algorithms. The identified traversal algorithms were concisely communicated to a run-time system that used this information along with the memory content to generate timely and accurate prefetches. Profile information was incorporated in the system to guide the run-time environment to prefetch into different levels of the memory hierarchy based on their expected use. With CDCAP prefetching approach, miss coverage was improved by an average 10%, while reducing the bandwidth requirements of content-based prefetching by a factor of 6 for the selected set of benchmarks. 6.2
Future Work This dissertation illustrated that hiding the memory latency in modern com-
puter systems requires a coordinated and rigorous prefetching approach that exploits different characteristics of the memory accesses associated with LDS. The approach was illustrated by exploiting spatial regularity, temporal regularity, and topology of the memory accesses via coordinated compile-profile-run-time mechanisms. However, simulation results showed that exploiting these characteristics did not completely solve the memory latency problem. This is due to the fact that there remains memory accesses associated with LDS that do not exhibit any of the investigated characteristics. Thus, future research needs to build on the findings of this dissertation and further studies of the memory accesses shall enable identifying additional characteristics that can be quantified using metrics. Prefetching mechanisms that employ the identified metrics can be coordinated with the mechanisms proposed in this dissertation to improve prefetch completeness. The prefetching system designed in this dissertation to exploit spatial regularity identifies and prefetches streams based on load misses only. One future
121 work investigation is to incorporate load misses that are masked by previous store misses to participate in the prefetch triggering mechanism. Further, an inversecorrelation was observed between the prefetching efficiency and the significant increase in the L1 cast-outs due to prefetching. This observation requires detailed investigation to identify how it could be used to augment the padding decisions. Future work on exploiting temporal regularity in a coordinated context and content approach includes better utilization of the content-based component as one of the benchmarks, mst, demonstrated that a more aggressive content-based approach can result in improving prefetch completeness. All the evaluated approaches were limited to C-based applications. A future evaluation for C++ and Java programs is needed to understand how greater utility of the proposed prefetching mechanisms can be achieved. By applying a coordinated prefetching approach that employs metrics to quantify the memory access characteristics of instruction memory requests, it is hoped that the memory content can aid prefetching of upcoming control paths. Finally, the proposed mechanisms and approach will need to be examined in the context of multiprocessor and multithreading systems.
Bibliography
[1] Hassan Al-Sukhni, Ian Bratt, and Daniel A. Connors. Compiler directed content aware prefetching for dynamic data structures. In Proceedings of the 2003 PACT International Conference, pages 91–100, 2003. [2] M. Annavaram, J. Patel, and E. Davidson. Data prefetching by dependence graph precomputation. In 28th Annual International Symposium on Computer Architecture, June 2001. [3] D. I. August, D. A. Connors, S. A. Mahlke, J. W. Sias, K. M. Crozier, B. Cheng, P. R. Eaton, Q. B. Olaniran, and W. W. Hwu. Integrated predication and speculative execution in the IMPACT EPIC architecture. In Proceedings of the 25th International Symposium on Computer Architecture, pages 227–237, June 1998. [4] J.-L. Baer and T.-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In Proceeding of Supercomputing ’91, pages 176–186, November 1991. [5] I. Bratt, A. Settle, and D. A. Connors. Predicate-based transformations to eliminate control and data-irrelevant cache misses. In Proceedings of the First Workshop on Explicitly Parallel Instruction Computing Architectures and Compiler Techniques, pages 11–22, December 2001. [6] Brendon Cahoon and Kathryn S. McKinley. Data flow analysis for software prefetching linked data structures in java. In Proceedings of the 2001 international conference on PACT, pages 52–63, 2001. [7] D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In Proc. of the 4th Int’l Conf. on Architectural Support for Prog. Lang. and Operating Systems., pages 40–52, April 1991. [8] Jason F. Cantin. Cache performance for spec cpu2000 benchmarks. Computer Architecture News, 29(4), September 2001. [9] A. Rogersand M. Carlisle, J. Reppy, and L. Hendren. Supporting dynamic data structures on distributed memory machines. In ACM Transaactions on Programming Languages and Systems, March 1995.
123 [10] M. J. Charney and A. P. Reeves. Generalized correlation based hardware prefetching. Technical Report EE-CEG-95-1, Technical Report- Cornell University, February 1995. [11] T.F. Chen and J.L. Baer. Reducing memory latency via non-blocking and prefetching caches. In Proceedings of ASPLOS Symposium, pages 51–61, October 1992. [12] W. Y. Chen, S. A. Mahlke, P. P. Chang, and W. W. Hwu. Data access microarchitectures for superscalar processors with compiler-assisted data prefetching. In Proceedings of the 24th Annual International Symposium on Microarchitecture, pages 69–73, November 1991. [13] W. Y. Chen, S. A. Mahlke, W. W. Hwu, T. Kiyohara, and P. P. Chang. Tolerating data access latency with register preloading. In Proceedings of the 6th International Conference on Supercomputing, July 1992. [14] T. Chilimbi. On the stability of temporal data reference profiles. [15] Trishul M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting dara reference locality. In Proceedings of the ACM SIGPLAN ’01 conference on Programming language design and implementation, 2001. [16] Trishul M. Chilimbi, Mark D. Hill, and James R. Larus. Cache-conscious structure layout. In Proceedings of the ACM SIGPLAN ’99 conference on Programming language design and implementation, pages 1–12, 1999. [17] J. Collins, S. Sair, B. Calder, and D. Tullsen. Pointer cache assisted prefetching. In International Symposium on Microarchitecture, November 2002. [18] R. Cooksey, S. Jourdan, and D. Grunwald. A stateless, content-directed data prefetching mechanism. In Proceedings of the 10th International Conference on ASPLOS, pages 201–213, October 2002. [19] F. Dahlgren, M. Dubois, and P. Strenstrom. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proceedings of the 1993 International Conference on Parallel Processing, pages 56–63, August 1993. [20] R. J. Eickemeyer and S. Vassiliadis. A load-instruction unit for pipelined processors. IBM Journal of Research and Development, 27(4):547–564, July 1993. [21] Keith I. Farkas, Norman P. Jouppi, and Paul Chow. How useful are nonblocking loads, stream buffers, and speculative execution in multiple issue processors? Technical Report WRL Research Report 94/8, Western Research Laboratory, 250 University Avenue Palo Alto, California 94301 USA, 1994.
124 [22] J. W. C. Fu, J. H. Patel, and B. L. Janssens. Stride directed prefetching in scalar processors. In Proc. 25th Ann. Conference on Microprogramming and Microarchitectures, Portland, Oregon, December 1992. [23] John L. Hennessy, David A. Patterson, and David Goldberg. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, San Francisco, California, 2002. [24] Sorin Iacobovici, Lawrence Spracklen, Sudarshan Kadambi, Yuan Chou, and Santosh G. Abraham. Effective stream-based and execution-based data prefetching. In Proceedings of the International Conference on Supercomputing, July 2004. [25] D. Joseph and D. Grunwald. Prefetching using markov predictors. IEEE Transactions, 48(2):121–133, 1999. [26] N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th ISCA, pages 364–373, May 1990. [27] P. Chow K. Farkas and Z. Vranesic. Memory system design considerations for dynamically scheduled processors. In 24th Annual International Symposium on Computer Architecture, June 1997. [28] A. C. Klaiber and H. M. Levy. An architecture for software-controlled data prefetching. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 43–53, Toronto, Canada, May 1991. [29] Nicholas Kohout, Seungryul Choi, Dongkeun Kim, and Donald Yeung. Multichain prefetching: Effective exploitation of inter-chain memory parallelism for pointer-chasing codes. In Proceedings of the 2001 international conference on Parallel Architectures and Compiler Technology, 2001. [30] M. H. Lipasti, W. J. Schmidth, and R. R. Roediger. SPAID: Software prefetching in pointer- and call-intensive environments. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 231– 236, December 1995. [31] Chi-Keung Luk and Todd C. Mowry. Compiler-based prefetching for recursive data structures. In Proceedings of the seventh international conference on ASPLOS, pages 222–233, 1996. [32] F. Dahgren M. Karlsson and P. Stenstrom. A prefetching technique for irregular accesses to linked data structures. In The 6th International Symposium on High-Performance Computer Architecture, January 2000.
125 [33] S. Mehrotra and L. Harrison. Quantifying the performance potential of a data prefetch mechanism for pointer-intensive and numeric programs. Technical Report 1458, Center for Supercomputing Research and Development, University of Illinois, November 1995. [34] T. Mohan, B. Supinski, S. McKee, F. Mueller, A. Yoo, and M. Shulz. Identifying and exploiting spatial regularity in data memory references. In Proceedings of the ACM/IEEE SC2003 Conference (SC’03), 2003. [35] T. C. Mowry and A. Gupta. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. Journal for Parallel and Distributed Computing, 12:87–106, June 1991. [36] T. C. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the Fifth International Conference on ASPLOS, pages 62–73, Oct 1992. [37] Olden. Olden suite of benchmarks. 2005. [38] S. Palacharla and R. Kessler. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21st International Symposium on Computer Architecture, April 1994. [39] Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder. Using simpoint for accurate and efficient simulation. In Proceedings of the 2003 ACM SIGMETRICS international conference on measurement and modeling of computer systems, 2003. [40] A. K. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Department of Computer Science, Rice University, Houston, TX, 1989. [41] Daniel M. Pressel. Fundamental limitations on the use of prefetching and stream buffers for scientific applications. In Proceedings of the 10th ACM Symposium on Applied computing, pages 554–560, March 2001. [42] Amir Roth, Andreas Moshovos, and Gurindar S. Sohi. Dependence based prefetching for linked data structures. In Proceedings of the eighth international conference on ASPLOS, pages 115–126, 1998. [43] Amir Roth and Gurindar S. Sohi. Effective jump-pointer prefetching for linked data structures. In Proceedings of the 26th annual international symposium on Computer architecture, pages 111–121, 1999. [44] Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on ASPLOS, October 2002.
126 [45] Timothy Sherwood, Suleyman Sair, and Brad Calder. Predictor-directed stream buffers. In Proceedings of the 33rd International Symp. on Microarchitecture, December 2000. [46] A. J. Smith. Cache memories. Computing Surveys, 14(3):473–530, September 1982. [47] Y. Solihin, Jaejin Lee, and Josep Torrellas. Using a user-level memory thread for correlation prefetching. In Proceedings of the 29th Annual International Symposium on Computer Architecture, pages 171–182, May 2002. [48] Eric Sprangle and Doug Carmean. Increasing processor performance by implementing deeper pipelines. In Proceedings of the 29th ISCA, pages 25–34, Washington, DC, USA, 2002. IEEE Computer Society. [49] Jeffrey Scott Vitter and P. Krishnan. Optimal prefetching via data compression. Journal of the ACM, 43(5):771–793, 1996. [50] Z. Wang, D. Burger, K. McKinley, S. Reinhardt, and C. Weems. Guided region prefetching: A cooperative hardware/software approach. In Proceedings of the 30th ISCA, June 2003. [51] WikiPedia. Conditional entropy. Technical report, Wiki Pedia, the free encyclopedia. [52] WikiPedia. Sequence. Technical report, Wiki Pedia, the free encyclopedia. [53] Chia-Lin Yang and Alvin R. Lebeck. Push vs. pull: data movement for linked data structures. In Proceedings of the 14th international conference on Supercomputing, pages 176–186, 2000. [54] C. Zilles. Linked list traversal micro-benchmark. 2005. [55] C. Zilles and G. Sohi. Execution-based prediction using speculative slices. In 28th Annual International Symposium on Computer Architecture, June 2001.