This thesis describes a complete solution to accelerate the trace driven ..... storing the frequently used data and instructions, cache memory helps to ..... cache misses, for any program that runs on a MIPS processor. ...... ds a t th. e b otto m tree le vel. Figure 4.6: Direction of simulation and tag ...... C++ Map is already sorted.
O PTIMIZING S INGLE - PASS S IMULATION T ECHNIQUES FOR
FAST C ACHE M EMORY D ESIGN S PACE
E XPLORATION IN E MBEDDED P ROCESSORS
by M OHAMMAD S HIHABUL H AQUE
A T HESIS S UBMITTED IN ACCORDANCE WITH FOR THE
THE
R EQUIREMENTS
D EGREE OF
D OCTOR OF P HILOSOPHY
S CHOOL OF C OMPUTER S CIENCE AND E NGINEERING T HE U NIVERSITY OF N EW S OUTH WALES
AUGUST 2011
c ⃝Copyright by Mohammad Shihabul Haque 2012 All Rights Reserved
ii
Statement of Originality ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, nor material which to a substantial extent has been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged’.
Mohammad Shihabul Haque August 2011
iii
Copyright Statement ‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation’.
Mohammad Shihabul Haque August 2011
Authenticity Statement ‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format’.
Mohammad Shihabul Haque August 2011 iv
List of Publications • Mohammad Shihabul Haque, Jorgen Peddersen and S. Parameswaran. CIPARSim: Cache Intersection Property Assisted Rapid Single-pass FIFO Cache Simulation Technique. In the proceedings of International Conference on Computer-Aided Design (ICCAD) 2011, November 2011. • Mohammad Shihabul Haque, Jorgen Peddersen, Andhi Janapsatya and S. Parameswaran. SCUD: a fast single-pass L1 cache simulation approach for embedded processors with round-robin replacement policy. In the proceedings of the 47th Design Automation Conference, June 2010. • Mohammad Shihabul Haque, Jorgen Peddersen, Andhi Janapsatya and S. Parameswaran. DEW: A Fast Level 1 Cache Simulation Approach for Embedded Processors with FIFO Replacement Policy. In the proceedings of Design Automation and Testing in Europe 2010, March 2010. • Mohammad Shihabul Haque, Andhi Janapsatya and S. Parameswaran. SuSeSim: a fast simulation strategy to find optimal L1 cache configuration for embedded systems. In the proceedings of the 7th IEEE/ACM International Conference on Hardware/software Codesign and System Synthesis, October 2009. • Haris Javaid, A. Janapsatya, Mohammad Shihabul Haque, and S. Parameswaran. Rapid Runtime Estimation Methods for Pipelined MPSoCs. In the proceedings of Design Automation and Testing in Europe 2010, March 2010. Journal articles waiting to be published: • Mohammad Shihabul Haque, Jorgen Peddersen, Andhi Janapsatya and S. Parameswaran. Use of Double Wave Pointer in Single-pass FIFO cache simulation. v
• Mohammad Shihabul Haque, Jorgen Peddersen, Andhi Janapsatya and S. Parameswaran. Fast Trace-based Cache Simulation Tool By Exploiting The Cache Inclusion Property.
vi
Contributions of this Thesis • A new set of inclusion properties to accelerate the cache hit/miss detection for the LRU single-pass cache simulation. • A new LRU single-pass cache simulator “SuSeSim” that achieves the fastest performance among the available LRU single-pass simulators by deploying the cache inclusion properties. • Two special data structures “Wave pointers” and “Central Look-up Table” to assist in the acceleration of single-pass simulation of caches with any type of replacement policy. • Two fast, single-pass, trace driven FIFO cache simulators “DEW” and “SCUD” that deploy the “Wave pointers” and “Central Look-up Table” to accelerate the cache hit/miss extraction process. • A new type of cache property called “Intersection Property” to accelerate the singlepass simulation of the stack or non-stack algorithm based caches without using any extra computing resources. • Three FIFO intersection properties. • A new trace driven, space efficient, single-pass, FIFO cache simulator “CIPARSim” that deploys the FIFO intersection properties. CIPARSim shows the fastest performance among the available FIFO cache simulators. • Two PLRU intersection properties. • A new trace driven, space efficient, single-pass, PLRU cache simulator “PSAICO” that deploys the PLRU intersection properties. PSIACO achieves a huge speedup over the traditional non-optimized PLRU cache simulators.
vii
Abstract An application’s cache miss rate is used in timing analysis, system performance prediction and in deciding the best cache memory for an embedded system to meet tighter energy, performance and cost constraints. Single-pass simulation allows a designer to find the number of cache misses quickly and accurately on various cache memories. Such single-pass simulation systems have previously relied heavily on cache inclusion properties, which allowed rapid simulation of cache configurations for different applications without using excessive computing resources. Thus far the only inclusion properties discovered were applicable to the Least Recently Used (LRU) replacement policy based caches. However, LRU based caches are rarely implemented in real life due to their circuit complexity at larger cache associativities. Embedded processors typically use a FIFO or Pseudo LRU (PLRU) replacement policy in their caches instead, for which there are no full inclusion properties to exploit. No other mechanism has also been discovered for these cache replacement policies to speedup the single-pass simulation. As a result, the replacement policies that do not show inclusion properties did not attract the cache simulator designers and researchers even though there was a great need to simulate these policies. This thesis describes a complete solution to accelerate the trace driven single-pass simulation of the cache replacement policies regardless of their capability to show inclusion properties. The solution aims to maintain low space consumption and minimal use of computing resources. For these purposes two potential speedup mechanisms have been analyzed: (i) Different methods of representing caches during simulation (or data structure based speedup); and, (ii) Replacement policy based speedup (such as cache inclusion properties). To make a significant acceleration during a single-pass simulation, two smart data structures, “Wave pointers” and ”Central Look-up Table”, have been presented in this dissertation. These two cache configuration representation techniques can dramatically reduction the time to search an address content in a cache memory and to reflect the changes made after each cache access. Utilizing wave pointers and central look-up table, viii
simulation can be performed up to 40 times and 57 times faster respectively compared to a single cache memory simulator DineroIV . To reduce space consumption to speedup the single-pass simulation process, a new cache property, “Intersection property”, has also been introduced. The intersection properties can help to predict the cache status during single-pass simulation, which can help to reduce enormous amount of simulation time. Using the intersection properties, up to 97% cache hits can be predicted. Using the speedup mechanisms discovered, a group of efficient and fast single-pass cache simulators namely “SuSeSim”, “DEW”, “SCUD”, “CIPARSim” and “PSAICO” are presented for LRU, FIFO and PLRU caches. These simulators outperform the available conventional simulators significantly.
ix
Acknowledgements Firstly I would like to thank my supervisor, Professor Sri Parameswaran, for the guidance and encouragement he has given me throughout the PhD degree. The motivation and direction he has supplied made completion of the research work possible. His ability to always know the right thing to say to pick you up when you’re down always amazes me. The other research students in the group have also been a great help. In particular, Andhi Janapsatya, whose criticism would often lead to inspiration. His friendship has also been a great boon when a peer to talk to is needed. I’d also like to thank Jorgen Peddersen for his jovial nature and the patience to analyze my crazy ideas. The other members of the group have also been great friends. I am very pleased to have studied alongside Krutartha Patel, Angelo Ambrose, Xin He, Michael Chong, Haris Javaid, Tuo Li, Mei Hong, Liang Tang and Sumyatmin Su. So as they have all been helpful towards my work as well as great gossip and Wii partners. I also thank my thesis examiners for taking their time to read my thesis and making suggestions for improvements and finding a few typos. Last, but certainly not least, I thank my parents for their support and love that has given me the courage to finish the project. I will be eternally grateful for the kind upbringing they have given me and the unending love they continue to supply.
x
Contents Statement of Originality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
Copyright Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
Authenticity Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 1 Introduction
1
1.1
Processor-Memory Performance Gap . . . . . . . . . . . . . . . . . . . .
1
1.2
Introduction to Cache Memory . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
The Necessity to Decide the Best Cache Memory Configuration . . . . .
6
1.3.1
Challenges in The Best Cache Configuration Selection . . . . . .
7
1.3.2
Deciding The Best Cache Memory Configuration . . . . . . . . .
8
1.4
Trace Driven Cache Simulation . . . . . . . . . . . . . . . . . . . . . . .
8
1.5
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.6
Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.7
Contribution and Organisation . . . . . . . . . . . . . . . . . . . . . . .
13
xi
2
3
4
Literature Survey
17
2.1
Categories of Cache Simulation Techniques . . . . . . . . . . . . . . . .
17
2.2
Simulation Speedup Mechanisms . . . . . . . . . . . . . . . . . . . . . .
20
2.2.1
Speedup Mechanisms for Online Simulation Techniques . . . . .
21
2.2.2
Speedup Mechanisms for Off Line Simulation Techniques . . . .
23
2.2.2.1
Estimation Methods . . . . . . . . . . . . . . . . . . .
23
2.2.2.2
Exact Simulation Methods . . . . . . . . . . . . . . .
29
Overview of The Thesis
38
3.1
Project SuSeSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.2
Project DEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3
Project SCUD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.4
Project CIPARSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.5
Project PSAICO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
SuSeSim: An Analysis of LRU Policy
44
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.2
Contributions and Limitations . . . . . . . . . . . . . . . . . . . . . . .
45
4.3
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4.3.1
4.4
Janapsatya’s Methodology With Proposed Enhancements in The CRCB Algorithms . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.3.1.1
Tree Formation . . . . . . . . . . . . . . . . . . . . .
51
4.3.1.2
Tag Searching . . . . . . . . . . . . . . . . . . . . . .
52
4.3.1.3
Cache Set Update . . . . . . . . . . . . . . . . . . . .
53
4.3.1.4
The CRCB Algorithm . . . . . . . . . . . . . . . . . .
55
4.3.2
Limitations of Available LRU Single-pass Cache Simulators . . .
56
4.3.3
Cache Inclusion Properties for Quick Cache Miss Detection . . .
58
SuSeSim Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.4.1
Tree Formation . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.4.2
Tag Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
xii
4.4.3
Cache Set Update . . . . . . . . . . . . . . . . . . . . . . . . . .
64
4.5
Experimental Procedure and Results . . . . . . . . . . . . . . . . . . . .
68
4.6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5 DEW: Data Structure Based Speedup 5.1
77
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.1.1
Contributions and Limitations . . . . . . . . . . . . . . . . . . .
78
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.2.1
Data Structure Used in DEW . . . . . . . . . . . . . . . . . . . .
80
5.2.2
Properties of The Data Structure in DEW . . . . . . . . . . . . .
81
5.3
DEW Simulation Approach . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.4
Experimental Procedure and Results . . . . . . . . . . . . . . . . . . . .
90
5.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
5.2
6 SCUD: Advancement Over DEW 6.1
6.2
97
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
6.1.1
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
Cache Parameters Exploration Methodology . . . . . . . . . . . . . . . .
98
6.2.1
Data Structure Used in SCUD . . . . . . . . . . . . . . . . . . .
99
6.2.1.1
Central Look-up Table (CLT) . . . . . . . . . . . . . .
99
6.2.1.2
Simulation Tree . . . . . . . . . . . . . . . . . . . . . 101
6.2.1.3
Miss Counter Table . . . . . . . . . . . . . . . . . . . 104
6.2.2
Properties of the SCUD data structure . . . . . . . . . . . . . . . 104 6.2.2.1
Property 1: Reduction of Simulation Time Through CLT 104
6.2.2.2
Property 2: Simulating Associativity of One . . . . . . 105
6.2.2.3
Property 3: Simplified Cache Set Searching Using Simulation Tree . . . . . . . . . . . . . . . . . . . . . . . 106
6.3
SCUD Simulation Approach . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4
Experimental Procedure and Results . . . . . . . . . . . . . . . . . . . . 110
6.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 xiii
7
CIPARSim: Use of Intersection 7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.1.1
7.2
8
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Inclusion Properties vs. Intersection Properties . . . . . . . . . . . . . . 119 7.2.1
7.3
117
FIFO Intersection Properties . . . . . . . . . . . . . . . . . . . . 120
CIPARSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.3.1
Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.3.2
CIPARSim Simulation Approach . . . . . . . . . . . . . . . . . 132
7.4
Experimental Procedure and Results . . . . . . . . . . . . . . . . . . . . 134
7.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
PSAICO: Speedup of PLRU Simulation 8.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.1.1
8.2
141
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Pseudo LRU Replacement Policy . . . . . . . . . . . . . . . . . . . . . . 143 8.2.0.1
Tree Based Pseudo LRU(PLRU-t) . . . . . . . . . . . . 143
8.2.0.2
MRU Bits Based Pseudo LRU(PLRU-m) . . . . . . . . 144
8.2.1
Intersection Properties Shown by PLRU . . . . . . . . . . . . . . 146
8.2.2
Advantages of PLRU Intersection Properties in A Single-pass Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9
8.3
Experimental Procedure and Results . . . . . . . . . . . . . . . . . . . . 149
8.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Conclusion and Future Work
157
9.1
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
9.2
Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
A Data Structures
163
A.1 SuSeSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 A.2 DEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 xiv
A.3 SCUD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 A.4 CIPARSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 A.5 PSAICO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 B Sample Code
166
B.1 SuSeSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 B.2 DEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 B.3 SCUD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 B.4 CIPARSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 B.5 PSAICO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Bibliography
185
xv
List of Tables 1.1
Storage space comparisons [1] . . . . . . . . . . . . . . . . . . . . . . .
6
4.1
Trace of requested addresses . . . . . . . . . . . . . . . . . . . . . . . .
50
4.2
Cache configuration parameters . . . . . . . . . . . . . . . . . . . . . .
69
4.3
Trace files used for simulation . . . . . . . . . . . . . . . . . . . . . . .
70
5.1
Cache configuration parameters . . . . . . . . . . . . . . . . . . . . . .
90
5.2
Comparison between Dinero IV and DEW showing simulation time and total number of tag comparisons . . . . . . . . . . . . . . . . . . . . . .
92
5.3
Effectiveness of properties used in DEW (all results in millions) . . . . .
92
6.1
Cache configuration parameters . . . . . . . . . . . . . . . . . . . . . . 112
6.2
Total number of address requests in each application of SPEC CPU2000(In millions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3
Simulation time comparison for Mediabench applications . . . . . . . . . 113
7.1
Cache configuration parameters . . . . . . . . . . . . . . . . . . . . . . 134
7.2
Simulation Time and Performance Analysis Part 1 . . . . . . . . . . . . . 136
7.3
Simulation Time and Performance Analysis Part 1 . . . . . . . . . . . . . 137
8.1
Cache configuration parameters . . . . . . . . . . . . . . . . . . . . . . 150
8.2
Simulation time comparison for Mediabench applications . . . . . . . . . 152
8.3
Improvements gained using the intersection properties in single-pass simulation for Mediabench applications (all results in millions) . . . . . . . . 155
xvi
List of Figures 1.1
Typical processor-based computer system architecture . . . . . . . . . . .
2
1.2
Processor-based computer system architecture with cache memory . . . .
3
1.3
A set associative cache . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4
A byte addressable memory address . . . . . . . . . . . . . . . . . . . .
5
1.5
Cache memory design space exploration . . . . . . . . . . . . . . . . . .
9
4.1
Formation of simulation tree . . . . . . . . . . . . . . . . . . . . . . . .
47
4.2
Singly linked list to represent associativity . . . . . . . . . . . . . . . . .
48
4.3
Values for tag, index and offset for a requested address in different cache configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.4
Address tags in different cache configurations . . . . . . . . . . . . . . .
50
4.5
Miss counter array of Janapsatya’s method . . . . . . . . . . . . . . . . .
51
4.6
Direction of simulation and tag searching . . . . . . . . . . . . . . . . .
52
4.7
Singly linked list with requested address tags . . . . . . . . . . . . . . .
52
4.8
Janapsatya’s method needs five steps to update the address tag inside the associated singly linked list . . . . . . . . . . . . . . . . . . . . . . . . .
54
Index 1 of a two-set cache with some tags . . . . . . . . . . . . . . . . .
54
4.10 Index 1 of the two-set cache of Figure 4.9 after the missed tag is updated .
55
4.11 Index 1 of the two-set cache of Figure 4.9 after the hit tag is updated . . .
55
4.12 Simulation tree for two cache configurations . . . . . . . . . . . . . . . .
59
4.13 Cache sets S, S1 and S2 with associativity four . . . . . . . . . . . . . .
60
4.14 Searching non frequent address tags . . . . . . . . . . . . . . . . . . . .
62
4.9
xvii
4.15 Doubly linked list and search paths of the SuSeSim algorithm . . . . . . .
64
4.16 An example simulation tree . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.17 Cache set 0 of Figure 4.16 after the update of the tag position . . . . . . .
66
4.18 Search tree of Figure 4.16 after the placement of new tag . . . . . . . . .
67
4.19 Total number of linked list nodes searched during simulation . . . . . . .
72
4.20 Total time elapsed for searching tags during simulation . . . . . . . . . .
73
4.21 Total number of address tags found in the head of the linked list . . . . .
74
4.22 Total simulation time . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
5.1
A simulation tree for DEW . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.2
An address request simulation flow diagram for DEW . . . . . . . . . . .
86
5.3
Simulation tree of DEW after new tag insertion . . . . . . . . . . . . . .
88
5.4
Speedup of DEW over Dinero IV . . . . . . . . . . . . . . . . . . . . . .
93
5.5
Reduction of tag comparison in DEW . . . . . . . . . . . . . . . . . . .
95
6.1
A Central Look-up Table in SCUD . . . . . . . . . . . . . . . . . . . . . 100
6.2
Formation of simulation tree . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3
A tree node with associativity lists . . . . . . . . . . . . . . . . . . . . . 103
6.4
Requested memory address evaluation flow in SCUD . . . . . . . . . . . 107
6.5
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.6
Speed up of SCUD over Dinero IV for SPEC CPU2000 applications . . . 115
7.1
Example Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2
Example Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.3
A Multi Set Look-up Table in CIPARSim . . . . . . . . . . . . . . . . . 126
7.4
A Simulation Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.5
CIPARSim Associativity Lists . . . . . . . . . . . . . . . . . . . . . . . 128
7.6
Speedup in CIPARSim . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.1
A PLRU-t fully associative cache . . . . . . . . . . . . . . . . . . . . . . 144
8.2
Memory block insertion and access in a PLRU-t fully associative cache . 144 xviii
8.3
A PLRU-m fully associative cache . . . . . . . . . . . . . . . . . . . . . 145
8.4
Memory block insertion and access in a PLRU-m fully associative cache . 145
8.5
Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.6
Formation of single-pass simulation tree . . . . . . . . . . . . . . . . . . 150
8.7
Speedup of Single-pass optimized simulator . . . . . . . . . . . . . . . . 153
xix
Chapter 1 Introduction 1.1 Processor-Memory Performance Gap Gordon Moore predicted in 1965 [2] that the quantity of transistors that can be placed inexpensively on an integrated circuit would double approximately every two years. Proving Moore’s prediction accurate, the number of transistors in an integrated circuit has kept growing at an exponential rate over the last four decades [3, 4]. The increased quantity of transistors on an integrated circuit helped to achieve a faster processing speed in microprocessors and increased the physical memory storage capacity in computer systems. In Dynamic Random Access Memory (DRAM), which is commonly used as the main memory in a processor-based computer system, each bit of data is stored in a separate capacitor within an integrated circuit [5, 6]. The capacitors can be charged or discharged to represent two different values (0 or 1) of a bit. Only one transistor and a capacitor are required per bit representation in DRAM. The structural simplicity is the main advantage of using DRAM as an inexpensive and large main memory in a computer system. However, due to this simplicity in structure, data access becomes slower when DRAM’s storage capacity is increased. As an obvious consequence, the performance of memory has lagged behind the performance of the processor, from the early age of computers, such as the IBM system/360 model 50 [7], to the ultra-modern mainframe systems, such 1
CHAPTER 1. INTRODUCTION
2
as the IBM zEnterprise 196 [8] with multiple processing cores each running at 5.2GHz.
Data Bus
Processor
Fast Data Transfer Rate
Data Bus
Main Memory
Slow Data Transfer Rate
Permanent Storage/ Harddisk
Figure 1.1: Typical processor-based computer system architecture
1.2
Introduction to Cache Memory
In a typical processor-based computer system architecture (also known as Von Neumann Architecture [1]), illustrated in Figure 1.1, the processor cannot execute an application if the memory is unable to provide necessary data. Therefore, the widening processormemory performance gap introduces a limitation to the overall system performance. To overcome the problems raised by the processor-memory performance gap, the use of cache memory is an effective and widely used solution [9, 10]. A cache memory is a
1.2. INTRODUCTION TO CACHE MEMORY
Data Bus
Processor
Fastest Data Transfer Rate
Da ta B u s
Cache Memory
Faster Data Transfer Rate
Data Bus
Main Memory
Slow Data Transfer Rate
Permanent Storage/ Harddisk
Figure 1.2: Processor-based computer system architecture with cache memory
3
CHAPTER 1. INTRODUCTION
4
fast Static Random Access Memory (SRAM) based small memory which is placed in between the main memory and the processor (illustrated in Figure 1.2). By temporarily storing the frequently used data and instructions, cache memory helps to avoid slow data transfer to and from the main memory to ensure smooth processing [9]. Cache memories can be either on the same chip as the microprocessor, or off the chip. Modern desktop and server CPUs have at least three independent caches: (a) an instruction cache to speed up an executable instruction fetch; (b) a data cache to speed up data read and write; and, (c) a translation look-aside buffer (TLB) used to speed up virtual-to-physical address translation for both executable instructions and data. Cache memory configurations are parameterized using cache set size (S), associativity (A), cache block size/line size (B) and replacement policy. Cache size (T ) is the total number of bits that can be stored in the cache. The minimum amount of data (bits) that can be placed in a cache is called the cache line size or block size (B). For better accessibility, cache lines are arranged into sets. Cache set size (S) is the total number of sets in a cache. The number of cache lines inside each set of a cache is called associativity (A). Therefore, T = S × B × A. The replacement policy decides which cache line to replace when it is necessary to load a data block on the cache memory. Depending on the arrangement of cache sets and cache lines, caches can be divided into three categories: (i) Fully associative cache -a cache with one set only; (ii) Set associative cache -a cache that arranges the cache lines into sets; and, (iii) Direct mapped cache -a cache with only one cache line per set. Therefore, fully associative and direct mapped caches are also set associative caches. It can be noticed that any type of cache can be represented using the above mentioned four cache parameters. An example of a set associative cache is presented in Figure 1.3. The cache has eight data storage locations or cache lines. The amount of data that can be stored in each line is called block size (B). Every two data storage locations form a set in Figure 1.3. Therefore, we have four sets (S = 4), and each set has two different blocks/lines/ways to store data (A = 2). Therefore, the cache of Figure 1.3 is a two-way set associative cache. Each cache set is identified by an index number. In Figure 1.3, T = 8 bytes for B = 1 byte.
1.2. INTRODUCTION TO CACHE MEMORY
5
Way1
Way2
Index 00
Tag
Data
Tag
Data
Index 01
Tag
Data
Tag
Data
Index 10
Tag
Data
Tag
Data
Index 11
Tag
Data
Tag
Data
Cache with four sets
Figure 1.3: A set associative cache Byte addressable memory address 111001
Tag 111
Index 00
Byte offset 1
When cache block size is 2 byte
Figure 1.4: A byte addressable memory address In Figure 1.4, we show how a byte addressable memory block address content is searched in the cache of Figure 1.3. Let us consider that the cache has a block size of two bytes (B = 2). To search the content from the byte addressable memory address shown in Figure 1.4, the last one bit is used to select the byte inside the two byte cache block. Therefore, the last bit of the address is called the byte offset. As S = 4 in the cache of Figure 1.3, the penultimate two bits of the address of Figure 1.4 are used to select the cache set. The rest of the address is used as a tag. If the tag is found inside any line of index 00 in the cache of Figure 1.3, it will be a cache hit; otherwise, it will be a miss. On a miss, content from the memory address of Figure 1.4 will be placed inside the cache set 00 of Figure 1.3. Unlike DRAM, SRAMs do not need to periodically recharge the capacitors. That is why SRAM needs a complex circuitry that makes it expensive but significantly faster and more power generous than DRAMs. Column 4 in Table 1.1 (collected from [1]) shows
CHAPTER 1. INTRODUCTION
6
Bytes fetched Latency Cost per Energy Technology per access per Megabyte consumed (typically) access per access On-chip Cache 10 100 of picoseconds $1-100 1nJ Off-chip Cache 100 Nanoseconds $1-10 10-100nJ DRAM 1000 10-100 $0.10 1-100nJ (internally fetched) nanoseconds (per device) Disk 1000 Milliseconds $0.001 100-1000mJ Table 1.1: Storage space comparisons [1] the price per Megabyte storage in different cache memories, DRAM (main memory) and disks. In this table, columns 3 and 5 show latency and energy consumption per data access in different caches along with DRAM and disk. From Table 1.1, it can be seen that onchip cache memories have a faster data transfer rate to the processing cores and lower energy consumption compared to the off-chip caches. However, on-chip caches are more expensive than off-chip caches.
1.3
The Necessity to Decide the Best Cache Memory Configuration
To achieve a cost-effective solution to minimize the processor memory performance gap, most modern computers use a hierarchical memory system that includes on-chip caches, multiple levels of off-chip caches, DRAM-based memory and disks. By selecting the appropriate cache configurations, hierarchical memory systems can achieve a performance close to that of an on-chip cache memory [1]. However, in hierarchical memory systems, inappropriate cache configurations can also overwhelm the system’s performance due to excessive energy consumption and slow operation [11–13]. Therefore, to achieve the minimum possible processor-memory performance gap, deciding the best cache configurations for a processor-based computer system is a very important decision the system designer has to make. Deciding the best cache configuration is more frequent in modern extensible and reconfigurable processors, such as Tensilica’s Xtensa [14, 15], NIOS [16],
1.3. THE NECESSITY TO DECIDE THE BEST CACHE MEMORY CONFIGURATION7 ARC 600 and 700 core families [17], ARCtangent processor [18], CoWare/LisaTek [19] and so forth, as the cache configuration can be modified to meet tighter energy, performance and cost constraints for a changing workload.
1.3.1 Challenges in The Best Cache Configuration Selection In processor-based embedded systems, performance is a very important criteria in deciding the best cache memory but not the only issue that needs to be considered. In this dissertation, an embedded system is defined as a device designed to execute a specific task or a specific collection of tasks. Examples of embedded systems can be portable devices, like smart phones, hearing aid devices or digital cameras, or non-portable devices, such as routers or printers. Most modern embedded systems are energy or power critical. Experiment results show that in Intel Pentium Pro Processors [11], which are general purpose processors, the most optimized use of instruction cache memory consumes 22.2% of the power consumed by the processor during the execution of an application. On the other hand, in a StrongARM 110 processor [20], which is an embedded processor, 27% of the total power can be consumed by the instruction cache and 16% of the power can be consumed by the data cache. Therefore, power consumption by the cache memory is an important concern in deciding the appropriate cache memory for embedded systems. In addition to that, due to space constraints, an embedded system may not be able to deploy any size of cache memory. Moreover, for real-time embedded systems, especially hard real-time systems, a strict performance criteria must be fulfilled by the cache configurations. Examples of a hard real-time system are the Sensor-based Intelligent Traffic Signaling Systems deployed on the suburban roads in the state of New South Wales in Australia. The signaling system ensures a higher priority to emergency vehicles and the lowest possible waiting time for ordinary vehicles without causing accidents. In brief, the best cache memory must satisfy the space, energy/power and performance constraints.
CHAPTER 1. INTRODUCTION
8
1.3.2 Deciding The Best Cache Memory Configuration Previously, it was commonly believed that the largest possible cache memory was preferable for a computer system as large cache memories typically provide lower cache miss rates compared to small caches. However, for embedded systems and systems with reconfigurable processors and extensible processors, the largest cache is certainly not the best cache configuration. This is because, the increment in cache size increases the cache access time/latency and energy consumption along with space consumption. Figure 1.5 is presented to support this fact. Figure 1.5 compares the cache miss rates (x-axis) against the applications’ total memory access time (y-axis) for different cache configurations for a g721encode application from Mediabench [21]. Cache access times were obtained from CACTI [22]. Figure 1.5 shows that the larger cache memories have fewer cache misses compared to the smaller caches, but large cache memories do not necessarily provide the best performance (the shortest cost of total memory access time). To decide the optimal cache configuration for an embedded system, several methods have been proposed so far. Among these methods, a trace driven cache simulation is a very popular and cost effective solution. In the following section, the trace driven cache simulation approach is discussed in detail.
1.4 Trace Driven Cache Simulation During the execution of an application, the total number of cache hits and misses vary depending on the cache parameters. When the number of cache misses increases, the cache memory needs to transfer more data to and from the slow but large main memory. The increment in the data transfer between the main memory and cache memory not only increases the execution time of the application but also the overall energy/power consumption in the system. Several previous studies [11, 23, 24] presented analytical models to correlate the execution time and power consumption with the number of cache misses. These methods can estimate power consumption and execution time with close
Figure 1.5: Cache memory design space exploration 1.E+02
100
1000
10000
1.E+03
1.E+04
1.E+06
Cache Misses
1.E+05
1.E+07
1.E+08
1.E+09
512 1024 2048 4096 8192 16384 32768 65536 131072 > 131072
1.4. TRACE DRIVEN CACHE SIMULATION 9
Total Execution Time (ms)
CHAPTER 1. INTRODUCTION
10
accuracy. Therefore, when the number of cache misses for one application is known, the energy consumption and execution time of that application in a given embedded system can be found quickly by utilizing those analytical models. In addition, the number of cache misses can also be used to identify memory bottlenecks, investigate memory behavior, identify a program’s hot spots, and optimize cache configurations for adequate pre-fetching and replacement strategies [25–28]. Due to this reason, trace driven cache simulation approaches have been proposed to find the number of cache misses quickly and accurately for a given cache configuration. In a trace driven cache simulator, the trace of memory block address accesses of one application is simulated on the given cache configurations to find the number of cache misses. The simulator reads one memory access at a time and tries to decide whether it is cache hit or miss on the cache configuration provided by the user. As less hardware details are simulated, these methods are usually faster than other available methods for cache hit/miss detection for an application.
1.5 Motivation If no optimization policy is utilized, trace driven simulators can be space consuming and may take years to finish a simulation. At the same time, the simulator must generate the number of cache misses accurately. To do that, analyzing each and every memory access request is a necessary. However, the evaluation of each and every memory block access is time consuming if no speedup mechanism is used. Therefore, to shorten the time to market the embedded system, these simulation processes must be optimized to deploy them feasibly during system design. Due to less hardware behavior simulation, trace driven cache simulation approaches provide a better opportunity to utilize speedup mechanisms. Several proposals have been made to accelerate the trace driven cache simulation approaches. To accelerate simulation, some of these proposals compromise the accuracy of the simulation results while others maintain the accuracy for an accurate system behavior estimation. The trace driven simulation methods that produce an accurate cache hit/miss number
1.5. MOTIVATION
11
rely mainly on the cache inclusion properties for faster operation. An inclusion property holds between two caches when the set of elements in one cache always remain as the subset of the other cache, provided that both of the caches start running the same application at the same time. Therefore, by using inclusion properties, some of the simulation steps can be avoided by making an accurate prediction about a cache while simulating another cache. In this way, simulation time can be reduced enormously without using extra space or hardware when a large group of cache configurations are simulated together for the same application. One example of an inclusion property is: the set of elements in an LRU cache is a subset of the set of elements in a larger set sized LRU cache. Inclusion properties are cache replacement policy dependent. To utilize the cache inclusion properties efficiently, supporting customized data structures are also used. However, all the cache replacement policies do not show inclusion properties. A replacement policy that does not show inclusion property is called a non-stack algorithm. On the other hand, a replacement policy that does show an inclusion property is called a stack algorithm. In embedded systems, stack algorithms are rarely utilized as a cache replacement policy due to circuit complexity. The circuit complexity in caches increases both the energy consumption and execution time of applications in addition to the high cost of implementation. Embedded processors typically use a First-In-First-Out(FIFO) replacement policy in their caches instead. Some examples of embedded processors using FIFO as their cache replacement policies are: Tensilica Xtensa LX2 processors [29], Intel XScale [30], ARM9 [31] and ARM11 processors [32]. Their simple design makes FIFO caches inexpensive to implement. However, the LRU cache replacement policy shows less cache misses than FIFO [33]. To achieve lower cache misses than FIFO without using the complex LRU replacement policy, a Pseudo LRU (PLRU) replacement is also used in the embedded systems. The PLRU is an approximation to LRU. Therefore, the number of cache misses in PLRU is usually in between LRU and FIFO. However, FIFO and PLRU do not have the features to show inclusion properties. To simulate caches that do not show inclusion properties, reducing the application trace size is another popular method
CHAPTER 1. INTRODUCTION
12
for speedup. These methods are known as trace compression methods. However, compressing the trace and generating the accurate number of cache hits and misses from the compressed trace can take a long time, that may make the simulation process infeasible to deploy. In addition to trace compression, a resource hungry speedup mechanism is parallel simulation. In parallel simulation, a large number of computing resources are used to accelerate the simulation process. Trace compression and parallel simulation techniques are described in detail in the Literature Survey chapter. This dissertation intends to study new space and resource generous optimization/speedup techniques that are possible to deploy in the trace driven simulation of caches using nonstack algorithms as their replacement policy. The speedup techniques should be able to quicken the cache hit/miss detection process during simulation. This dissertation also intends to present fast cache simulators utilizing the proposed speedup mechanisms. The rest of the chapter describes the scope of the dissertation, followed by a description of its organization.
1.6 Scope The goal of this dissertation is to find the number of cache hits/misses of an application quickly and accurately for caches that do not use stack algorithms as their replacement policy; so that the information can be used to find the best cache memory configuration for an embedded system. Therefore, investigations have been performed to develop methods that can be utilized in trace driven cache simulation to accelerate the cache hit/miss detection processes. To understand the role of inclusion properties in the speeding up cache hit/miss extraction from an application trace without using excessive space or computing resources, LRU cache inclusion properties have been analyzed along with the simulators that deploy these inclusion properties. Depending on the experiences gained from the analysis, two different speedup techniques have been studied for non-stack algorithm based cache simulations: (i) Custom tailored data structure based speedup; and, (ii) Replacement policy
1.7. CONTRIBUTION AND ORGANISATION
13
based speedup. In the first technique, methods to customize the data structures to represent cache memories have been studied for quick data retrieval with the assistance of the temporal and spatial locality of caches. Due to the temporal locality, recently used data have a higher possibility of being used in the near future. Due to the spatial locality, data around a recently used data block have a higher possibility of being used in the near future. In the second technique, non-stack replacement policies have been analyzed to figure out their predictable features. When the cache status can be predicted, no simulation operation is necessary to determine a cache hit or miss in that particular cache configuration. Therefore, simulation time can be saved. Predictable properties can also help to determine the status of a group of caches just by simulating one or a few cache configurations. Two widely used non-stack replacement policies, (i) FIFO and (ii) PLRU, have been analyzed in this dissertation. This dissertation also presents five single-pass cache simulation techniques to determine the cache hit/miss rate quickly from an application trace for LRU, FIFO and PLRU caches. A single-pass simulation technique reads the application trace once and, for each memory access request in the trace, it determines a hit or miss in different cache configurations. To accelerate the simulation process, single-pass simulation approaches, presented in this dissertation, deploy the optimization techniques discovered during investigation. The work presented in this dissertation targets single-pass cache simulation for embedded systems only and may not be applicable to general purpose processor cache designing, especially when applications are unknown. The predictable features of the FIFO and PLRU replacement policies presented in this dissertation can be used in cache memory predictability analysis. However, such analysis is out of the scope of this Thesis.
1.7 Contribution and Organisation This thesis discusses the mechanisms of designing a fast, space efficient and accurate single-pass cache simulator for non-stack algorithm based caches. The research work that led to these methodologies, algorithms and implementations was performed during the
CHAPTER 1. INTRODUCTION
14 period July 2008 to March 2011.
In particular, the goals that were achieved throughout the work that are discussed in this dissertation are as follows: • A new set of inclusion properties to accelerate the cache hit/miss detection for the LRU single-pass cache simulation. • A new LRU single-pass cache simulator “SuSeSim” that achieves the fastest performance among the available LRU single-pass simulators by deploying the cache inclusion properties. • Two special data structures “Wave pointers” and “Central Look-up Table” to assist in the acceleration of single-pass simulation of caches with any type of replacement policy. • Two fast, single-pass, trace driven FIFO cache simulators “DEW” and “SCUD” that deploy the “Wave pointers” and “Central Look-up Table” to accelerate the cache hit/miss extraction process. • A new type of cache property called “Intersection Property” to accelerate the singlepass simulation of the stack or non-stack algorithm based caches without using any extra computing resources. • Three FIFO intersection properties. • A new trace driven, space efficient, single-pass, FIFO cache simulator “CIPARSim” that deploys the FIFO intersection properties. CIPARSim shows the fastest performance among the available FIFO cache simulators. • Two PLRU intersection properties. • A new trace driven, space efficient, single-pass, PLRU cache simulator “PSAICO” that deploys the PLRU intersection properties. PSIACO achieves a huge speedup over the traditional non-optimized PLRU cache simulators.
1.7. CONTRIBUTION AND ORGANISATION
15
The leading chapter of this dissertation provides a background for the technical content. The remainder of the dissertation is organized as follows: Chapter 2 provides a literature survey of publications related to cache simulation techniques. This chapter also discusses several speedup mechanisms for cache hit/miss detection during a cache simulation process. Chapter 3 provides an overview of the thesis and discusses the flow of the research. Chapter 4 presents the first study of the series of researches performed in this thesis to gain a better understanding of the use of cache inclusion properties in a single-pass simulation. The analysis of the available stack algorithm based single-pass simulators revealed the fact that inclusion properties help to predict the cache status without using any extra computing resources. Therefore, single-pass simulators not only become fast but also space efficient when inclusion properties are utilized. The data structures used to represent cache configurations in single-pass simulation are an important factor for the successful use of the cache inclusion properties. In addition, the cache inclusion properties must be selected depending on the goal of the single-pass simulator to achieve a faster performance. A single-pass LRU cache simulation methodology SuSeSim has been presented in Chapter 4 to verify the findings and identify the flaws of the knowledge achieved through the analysis of the cache inclusion properties and single-pass simulators. From the experiment results, SuSeSim found to be the fastest LRU simulator among the available ones. Chapter 5 presents the second study of the series to utilize a data structure based speedup mechanism for non-stack replacement policy FIFO. Utilizing a novel data structure “Wave pointer”, a new single-pass, trace driven, FIFO cache simulator “DEW” has been proposed in this chapter. To keep the space consumption to its minimal, DEW could simulate caches with varying set sizes only. DEW outperformed the available state-ofthe-art FIFO cache simulator of that time DineroIV significantly. Inspired by the potentials of the data structure based speedup mechanisms, a new methodology SCUD was next analyzed to overcome the limitations of the wave pointers.
CHAPTER 1. INTRODUCTION
16
Chapter 6 describes the methodology. SCUD defeated DEW by utilizing a central lookup based data structure that allowed quick simulation of caches with varying set sizes, associativities and cache line sizes. However, enormous size of the central look-up table required a large amount of space in the memory. Both DEW and SCUD methodologies revealed that data structure based acceleration always increase space consumption for flexibility. To overcome this problem, an alternative to inclusion properties, “Intersection properties” was introduced in the study discussed in Chapter 7. Intersection properties predict the status of caches accurately without using any extra space in the memory or excessive computing resources. Both stack and non-stack algorithms show intersection properties. Therefore, unlike inclusion properties, intersection properties can be used in single-pass simulation regardless of the cache replacement policy. Utilizing the intersection properties and space efficient supporting data structures, a new single-pass FIFO cache simulation methodology CIPARSim was discussed in Chapter 7. CIPARSim outperformed both DEW and SCUD immensely with almost half the space consumption than SCUD. The work presented in Chapter 8 studies the effectiveness of the intersection properties and the supporting data structure based acceleration in the single-pass simulation of the second widely used non-stack algorithm PLRU. To utilize the speedup mechanism, some of the intersection properties for PLRU caches were shown in this chapter. Exploiting the intersection properties, a single-pass methodology PSAICO was also presented. PSAICO was able to achieve super fast performance compared to the unoptimized PLRU simulators. Chapter 9 concludes the thesis, summarizing the key points contained throughout the dissertation and takes a look of some of the future work that can extend upon the work performed here. Following this, several appendices provide some codes needed for various parts of the project.
Chapter 2 Literature Survey The number of cache hits and misses of one application in different cache configurations is highly unpredictable. That is why cache memory simulation techniques came about to assist in finding the cache miss rate of an application in a given cache configuration. However, simulation of each and every processor request is a time consuming task which is sometimes infeasible to deploy during system design due to the short time that exists to market the embedded system. At the same time, the actual number of cache hits and misses are impossible to discover without considering each and every memory access request by the program. Therefore, finding the number of cache misses of an application in a given cache configuration quickly and accurately remains a key point of interest for researchers. In this chapter, we are going to discuss some of the popular simulation techniques proposed to discover the number of cache misses of an embedded system application. We are also going to discuss the proposed speedup mechanisms for these techniques.
2.1 Categories of Cache Simulation Techniques Cache simulation techniques to find the number of cache misses of an application can be broadly categorized into two: (i) Online simulation techniques; and, (ii) Off line simulation techniques. 17
CHAPTER 2. LITERATURE SURVEY
18
An online cache simulator uses trapping or executes the application on a functional processor simulator and traces the instruction and data memory accesses. Once the trace is generated, depending on the given cache configuration, the cache simulator accurately decides how many of the memory accesses will be cache hits and how many of them will be misses. Some examples of online simulators are: Pixie: Pixie [34] is a utility to trace, profile or generate dynamic statistics, including cache misses, for any program that runs on a MIPS processor. It works by annotating an executable object code with additional instructions that collect the dynamic information at run time. ATOM based simulators: ATOM [35] provides a framework for providing customized program analysis tools. It provides a common infrastructure provided in all code instrumenting tools. It has been used to build a diverse set of tools for basic block counting, profiling, dynamic memory recording, instruction and data cache simulating, pipeline simulating, evaluating branch predicting and instruction scheduling. Some examples of cache and memory performance evaluation tools based on ATOM are SIGMA [36], CPROF [37] and MemSpy [38, 39]. SimOS: SimOS [40] is a machine simulation environment designed to study large complex computer systems. SimOS simulates computer hardware, including the cache memory, in sufficient detail and accelerates the existing system software and application programs. QPT: QPT [41, 42] is mainly an application profiler and tracing system. It rewrites an executable file of a program by inserting codes to record the execution frequency or sequence of every basic block or control-flow edge. From this information, another program, QPT STATS, can calculate the execution time cost of procedures in the program. Using only commonly available performance counters on existing processors, Richard et al. presented an online cache simulator in [43] to estimate the cache occupancies of software threads. An analytical model has also been proposed in this article that allows the consideration of the impact of set-associativity, the line replacement policy, and memory
2.1. CATEGORIES OF CACHE SIMULATION TECHNIQUES
19
locality effects. Tao et al. in [44] propose an online cache simulation engine that uses instrumentation integration at runtime by rewriting the code on the fly. This technique is designed to study the cache behavior of OpenMP applications on SMP machines. Ravindran et al. proposed an online cache simulator that helps cache simulation and evaluation much before the actual processor design [45–47]. The simulator is built upon the functional simulator, Fsim [48]. In addition, most of the modern instruction set simulators (such as SimICS [49, 50], ARMISS [51], Shade [52], gem5 [53], SMART [26]) and cycle accurate simulators (such as [54–56]) have the functionality to be used as an online cache simulator. That is why modern extensible and reconfigurable processors are provided with instruction set simulators and cycle accurate simulators (e.g., The MULTI Integrated Development Environment for ARC processors [57] and Xtensa LX2 Development Environment [58]) Online cache simulators, especially instruction set simulators and cycle accurate simulators, do not just find the number of cache misses in an application, but also several other details as they mimic the behavior of an embedded processor during memory access trace generation and cache hit/miss extraction. Due to detailed hardware behavior simulation, online cache simulators can analyze only a single cache memory configuration at a time. To find the cache miss rate of a large number of cache memories, the designer has to use these simulators in different cache configurations separately and analyze the results manually. As the number of available cache configurations is typically very large and each simulation takes a long time, this process would take years to find the number of cache misses for all the considered cache configurations. In addition, in an online simulator, only one application is executed. Therefore, they are unable to simulate interference between different applications in a shared memory environment. In an off line cache simulator, a memory access trace is not generated. Instead, an off line cache simulator finds the number of cache misses for a given cache configuration when the memory access trace of an application is provided. As hardware details are not mimicked in the memory access trace generation and simulation, off line cache
20
CHAPTER 2. LITERATURE SURVEY
simulators can be designed to find the cache miss rate of an application on multiple cache configurations by reading the application trace only once. Besides that, the application trace can be preprocessed to get rid of some of the unnecessary memory accesses. For example, in an application’s memory access trace, a request to access the most recently accessed (MRA) memory block is always going to be a cache hit regardless of the cache parameters chosen. Therefore, simulation steps for the MRA cache line can be omitted when generating the number of cache misses. This type of preprocessing helps to reduce the simulation time of an off line simulator and the space consumption by the trace file. Due to these reasons, cache hit miss extraction in an off line cache simulator is usually much faster than in an online cache simulator. Some examples of off line simulators are: DineroIV [59]. This popular simulator is designed to extract cache hit/miss information from a given application trace. However, DineroIV simulates one cache configuration at a time for the provided memory access trace file. Cheetah [60,61] is a cache simulation package which is able to simulate various cache configurations in a single pass over the memory access trace of the application. Cheetah can simulate ranges of set-associative, fully-associative or direct mapped caches. In the following subsection, we are going to discuss more off line simulators as examples of the single-pass simulators, compressed trace simulators and parallel simulators.
2.2 Simulation Speedup Mechanisms The two main time consuming tasks in an online cache simulator are (i) application trace generation and (ii) cache hit/miss evaluation on the generated trace. In an off line cache simulator, however, reading each memory access request and analyzing it for each cache configuration’s cache hit/miss evaluation is the main time consuming phase. Time taken in these phases can make the simulator infeasible to deploy when time to market the embedded system is very short. That is why researchers have proposed several techniques to reduce the time consumption in these phases. In the following subsections, we are going to discuss several key speedup mechanisms hitherto proposed for both online and
2.2. SIMULATION SPEEDUP MECHANISMS
21
off line cache simulators.
2.2.1 Speedup Mechanisms for Online Simulation Techniques A traditional online cache simulator can either generate the trace of memory accesses by running an application to the end on a functional processor simulator and performing cache hit/miss evaluation on the generated trace file, or it can force the program to trap after the execution of each instruction and perform the cache hit/miss evaluation on the current memory access. Each of these approaches can easily slow down the execution of the application by a factor of 1000 or more [62–68]. To generate the trace of memory accesses with less overheads than trapping or simulation, a technique called “Inline tracing” can be used. In this method, measurement instructions are inserted in the program to record the processor accessed memory block addresses in a separate buffer. To reduce the storage requirement in the trace buffer, cache hit/miss evaluation can be performed on the buffered trace concurrently. Borg et al. [69] used inline tracing and modified the programs at the compile time to write addresses to a trace buffer. The addresses in the trace buffer were analyzed by a separate process for cache hit/miss evaluation. For trace generation, this method was 8 to 12 times slower than the normal execution of the applications selected for the experiment. Cache hit/miss evaluation from the buffered trace was about 100 times slower than the normal execution time of the tested applications. Eggers et al. [70] also used inline tracing to generate an application trace; however, the trace was copied from the buffer to the disk by a separate process to avoid interference with the trace generator. That is why, their trace generator could work faster than the method of Borg et al. [69]. Without any optimization, a traditional online cache simulation technique can consume a large amount of time. For a 1GHz processor, a one second snapshot stored as an application trace might translate to a 10 gigabyte file [71]. A large trace not only consumes a huge space, but it also slows down the trace generation process due to an
CHAPTER 2. LITERATURE SURVEY
22
excessive write operation. Eggers et al. also addressed this problem in their article [70]. To minimize the overhead of generating the application trace, firstly, they produced a subset of the memory block addresses in the trace from which the other addresses could be inferred during a postprocessing pass over the trace file. For instance, they only stored the first memory block address in a sequence of contiguous basic blocks with a single entry point and multiple exit points. Rather than reserving a set of registers to be used for the trace generation code, they identified which registers were available and thus avoided executing many save and restore instructions. Due to these initiatives, the trace generation overhead was accomplished in less than 3 times the normal execution time of an application. In addition, writing the buffers to disk required a factor of 10 times the normal execution time. The postprocessing pass, which generates the complete trace from the subset of memory block addresses stored, was much slower and produced about 3000 addresses per second. No information was given on the overhead required to actually analyze the cache performance. Ball and Larus [72–74] also reduced the overhead of the trace generation by storing a portion of the trace from which the complete trace can be generated. They optimized the placement of the instrumentation code (measurement instructions) to produce the reduced sized trace with respect to a weighting of the control flow graph. They showed that the placements are optimal for a large class of graphs. In this method, the overhead for the trace generation was less than a factor of 5. However, the postprocessing pass to regenerate the full trace required 19-60 times more compared to the normal execution time of the application. Whalley tried to reduce the cache hit/miss evaluation time by proposing some on-thefly simulation techniques in [75, 76]. With the trace generator, Whalley linked a cache simulator which was instrumented with a measurement code to evaluate the instruction cache performance during the application execution. The techniques he evaluated avoid making calls to the separate cache simulator when the cache hit can be determined in a less expensive manner. Therefore, the overhead time for these techniques was highly dependent upon the hit ratio of the applications. Whalley reported an overhead of 15 times
2.2. SIMULATION SPEEDUP MECHANISMS
23
compared to the normal execution time for average hit ratios of 96% and 2 times compared to the normal execution time for hit ratios exceeding 99%. These faster techniques also required the recompilation of the program when the cache configuration was altered. Online simulator Shade [52,77] uses a clever mechanism to reduce the communication overhead between the trace generator and the cache hit/miss analyzer program. To reduce the communication overhead, Shade’s instruction set simulator, the trace generator and the cache hit/miss analyzer program are executed in the same address space. To accelerate the trace generation process, the level of details in the trace file can be customized. To further improve performance, code which simulates and traces the application is dynamically generated and cached for reuse. Shade provides fine control over tracing, so users pay a collection overhead only for data they actually need. In Shade, the tracing of SPEC 89 benchmark applications [78] run about 2.3 times slower for floating-point programs and 6.2 times slower for integer programs compared to the normal execution.
2.2.2 Speedup Mechanisms for Off Line Simulation Techniques Off line cache simulation speedup mechanisms can be divided into two categories: (i) Estimation methods that find the approximate cache miss rate or the total number of cache misses using analytical models or heuristics; and, (ii) Exact simulation methods that analyze each and every memory access request by the processor, but make a smart utilization of cache behavior to quickly calculate the exact number of cache hits and misses. Both these methods have their pros and cons. In the following subsections, we are going to discuss these methods in detail.
2.2.2.1 Estimation Methods Reading each and every memory access request from the trace file and analyzing it for a cache hit or miss is a time consuming task. To reduce execution time, several off line cache simulation methods predict a large number of cache misses/hits by utilizing heuristics. Several other off line cache simulators use analytical techniques and find the cache
CHAPTER 2. LITERATURE SURVEY
24
miss rate (the average number of cache misses for an application trace) instead of the actual number of cache misses. These analytical and/or heuristic dependent methods are called estimation methods and inaccurate result is the major drawback of these methods; however, enormous speedup during cache simulation is something that attracts a large group of users. These methods are attractive when close to accurate results are enough to make a fairly close estimation of the execution time of the application, energy consumption by the cache memory and several other memory behaviors. A large number of studies have been conducted to find efficient estimation methods. Some of the memory hierarchy studies used empirical cache models to find the approximate cache miss rate. Chow [79] assumed a power function of the form m = ACB for the miss rate, where C is the size of that level in the memory hierarchy and A and B are constants. He neither gave a basis for this model nor validated this model against experimental results. Smith showed in [9] that the above function approximates the miss ratio for a given set of results within certain ranges for appropriate choices of the constants. However, no claims were made for the validity of the power function for other workloads, architectures, or cache sizes. Rao et al. [80] used the Independent Reference Model [81] to analyze cache performance. This model was chosen primarily because it was analytically tractable. Using the arithmetic and geometric distributions for page reference probabilities, cache miss rate estimates were provided only for direct-mapped, fully-associative and set-associative caches. The main drawbacks of this method were that it assumed fixed page sizes in memory and that the number of parameters needed to describe the program was very large. Moreover, validations against real program traces were not provided. Smith [82] studied the effect of page mapping algorithms and set-associativity using two models: (i) a mixed exponential model; and, (ii) the inverse of Saltzers linear paging model [83], for the miss-ratio curve of a fully-associative cache. The miss rate formulas described in this article compared well with the exact simulation results. However, a separate characterization was necessary for each cache line size. Besides that, time-dependent effects and multiprogramming issues were not addressed in this proposal.
2.2. SIMULATION SPEEDUP MECHANISMS
25
Haikala [84,85] assessed the impact of the task switch interval on cache performance. He used a simple Markov chain model to estimate the effect of cache flushing on the number of cache misses. The LRU stack model of program behavior [86] and geometrically distributed lengths of task switch intervals were assumed. The model reasonably estimates the cache miss rate for small caches where task switching flushes the cache completely and is pessimistic for large caches where significant data retention occurs across task switches [87, 88]. Easton and Fagin [89] focussed on the need to accurately determine the cold-start effect of the finite trace lengths, particularly for large caches. Cold-start is a special scenario in a multi-level memory hierarchy where the first level (closest to the processor) memory/cache memory is initially empty. Easton and Fagin used cold-start miss ratios to analyze the impact of task switching on cache performance and proposed models to accurately obtain cold-start miss rates from steady-state miss rates for fully-associative caches. In a later article [90], Easton showed how the cold-start miss rates of caches of different sizes for several task-switching intervals could be efficiently computed. Strecker [91] addressed the fact that the miss rate is sometimes measured by running a single program to completion; however, in real systems rarely does such uninterrupted execution occur. Strecker analyzed the transient behavior of cache memories for programs in an interrupted execution environment using the linear paging model for the miss ratio function [91]. Strecker’s analysis accounts for data retention in the cache across interruptions. The form of the miss ratio function used is (a + bn)/(a + n), where n is the number of cache lines filled; and a and b are constants obtained by measuring the program miss rates at two cache sizes. The predicted miss rates of several real programs run individually and interleaved for various execution intervals compare reasonably well with exact simulation results. Like Strecker, Stone et al. [92] studied the transient behavior of caches. They calculate the minimum number of transient cache refills necessary at process switch points as a function of the number of distinct cache entries used by the program (called the program footprint), for two processes executing in a round-robin fashion. They have also shown
CHAPTER 2. LITERATURE SURVEY
26
that the predictions of the reload transients agree well with exact simulation results. However, they do not give a method of obtaining the footprint and limit their discussion to the multitasking level of two levels only. Kumar [93] investigated the effect of spatial locality in programs and proposed an empirical technique to estimate the working set of a program for different cache line sizes. The miss rate calculated as a function of cache line size was shown to correlate well with the results of the exact simulation. Besides being specific to cache line size effects, the study has certain other drawbacks. Validation is carried out only for very large caches to exclude the effect of program blocks colliding with each other in the cache. Hence, only start-up effects can be considered to be adequately modeled in this study. By representing the factors that affect cache performance, an analytical model has been proposed by Agarwal et al. [94] that gives the cache miss rates for a given trace as a function of the cache size, degree of associativity, cache line size, sub-block size, multiprogramming level, task switch interval, and observation interval. The values, which are predicted by combining the actual measurement and analytical techniques, closely approximate the results of trace-driven exact simulations, while requiring only a small fraction of the computation cost. Fornaciari et al. [95] proposed a heuristic method to find a suboptimal configuration of the cache architecture without exhaustive analysis of the space of the cache parameters. Their analysis looked at the sensitivity of individual cache parameters on the energy delay product. The maximum error in this method was guaranteed to be less than 10%. Pieper et al. [96] proposed an estimation method dependent on a metric to represent cache behavior independently of the cache structure. The cache miss rate generated in this method was within 20% accuracy of an exact simulation approach that simulates an uniprocessor system. However, this method was 20 to 50 times faster than an exact simulation method. Ghosh et al. described a compiler based estimation method in [97] for cache miss calculations. They introduced the Cache Miss Equations (CMEs) as a mathematical framework that precisely represents cache misses in a loop nest. They estimate the number of
2.2. SIMULATION SPEEDUP MECHANISMS
27
cache misses in a code segment by counting the number of solutions of a system of linear Diophantine equations extracted from the trace file, where each solution corresponds to a potential cache miss. Two kinds of equations are generated in this method: compulsory equations, that represent compulsory misses; and, replacement equations, which represent interferences with other memory block references. The number of cache misses is computed by traversing the iteration space and solving the system of equations at each iteration point. Although solving these linear systems is an NP-hard problem, the authors claim that the mathematical techniques for manipulating the equations allow them to compute relatively easily the number of possible solutions without solving the equations. The method described in this article to extract cache miss information from CMEs is, however, limited in accuracy. For each cache configuration, the entire process had to be repeated in this method to find the number of cache misses. The method is limited to nested loop analysis only. No practical workload has been analyzed to show the effectiveness of this method and solving the CMEs was difficult and slow. Vera et al. [98] proposed a fast and accurate method to solve the cache miss equation (CME). Using sampling techniques, their proposed method could find the approximate cache miss ratio by analyzing a small subset of the iteration space. Results are given with a confidence interval, parameterizable by the user. The authors expanded their work in [99] to suite entire programs. However, they still perform the analysis at the loop level, and rely on the same technique used in [100] to transform the entire program into one loop. An analytical model has been presented for the behavior modeling of loop nests executing in a memory hierarchy in the article [101]. In this article, Presburger formulae [102] have been used to express various kinds of cache misses as well as the state of the cache at the end of the loop nest. Validation against exact cache simulation results confirms the close correctness of the formulation for cache misses. The model is powerful enough to handle imperfect loop nests and various flavors of non-linear array layouts based on bit interleaving of array indices. However, the major drawback of this model was its inability to handle caches with various replacement policies and associativities
CHAPTER 2. LITERATURE SURVEY
28
besides being specific to the nested loop analysis. Cascaval et al. [100] proposed a compiler based estimation method for cache miss estimation utilizing “Stack Distance”. In this method, one memory block address is read from the trace file at a time and push into a stack if never accessed before. If a memory block address is re-accessed, it is pulled out from the stack and pushed back on top of the stack. The depth from which a certain memory block address has to be extracted in the stack is called the stack distance of that reference or memory block address. Using the trace of stack distances for each memory block address, the total cache misses for a given application could be estimated. This method was capable of reasonably approximating cache misses for almost 80% of the loops in the SPECfp95 benchmark applications [103]. Ding et al. [104] proposed a faster method for generating the stack distance using the LRU replacement policy. They called the stack distance for the LRU replacement policy the “Reuse Distance”. Ding et al. showed that the reuse distance can be calculated much quicker when a tree is used instead of the stack. On average, the accuracy of this method was 94% for the cache miss calculation and 99% loops from SPEC95 benchmarks [105] that could be handled in this method. In their article [106], Fang et al. showed that the reuse distance is predictable on a per instruction basis for floating-point program traces. On average, over 90% of all memory operations executed in a program are predictable with almost 97% accuracy. In addition, the predictable reuse distances translate to predictable miss rates for the instructions. On average, this method predicts the critical instructions with an 86% accuracy. Several estimation techniques utilize the fact that a smaller trace file needs less simulation time to find the total number of cache misses compared to a complete application trace of memory accesses. To utilize this fact, a sample trace file is generated from the actual trace file using several sampling methods. Methods proposed in [107, 108] generate a sample trace by recording a memory access reference after a certain time interval. The method proposed in [109] generates a sample trace file for certain cache sets only. In [110], a group of contiguous memory access references from the trace file were selected to form a cluster in the sample trace file. This method reports a maximum speedup
2.2. SIMULATION SPEEDUP MECHANISMS
29
of 73 times over an exact simulation method. In [111], a random sampling based method “StatCache” was presented to find the average cache miss rate from an application trace of memory accesses. Like the methods in [107–110], StatCache also cannot find the accurate number of cache misses for an application due to sampling; however, the average miss ratio estimated by it is fairly accurate with a sampling rate as low as 10−4 . 2.2.2.2 Exact Simulation Methods Calculating the exact number of cache misses is essential to calculate energy consumption and to accurately predict the behavior of the memory hierarchy. Specially, for large computer systems and cache memories, a small deviation in the number of cache misses can make a huge impact on energy consumption and memory subsystem behavior. In addition to that, in predictability analysis of hard real-time systems and ordinary embedded system applications, and worst-case execution time analysis, finding the number of cache misses accurately is extremely necessary. Due to this reason, exact simulation methods take each and every memory block address access into consideration to calculate the number of cache misses from an application trace. Therefore, simulation time in exact simulation methods can take years when optimization methods are not utilized. In this section we are going to discuss some of the optimization methods utilized in exact simulation methods to reduce simulation time.
Trace Compression: The simulation time of an exact simulation method is approximately proportional to the size of the trace file or number of memory accesses in the application trace [112]. Because of this, several proposals have been made to compress the size of the application trace without effecting the accuracy of the results produced. Smith [112] pioneered this work by proposing a trace compression technique for memory paging studies. He used an LRU-model of memory references and produced a reduced trace by deleting memory address references that accessed the top ‘D’ levels of the LRU stack of data. The resulting trace, if used for simulating memory larger than ‘D’
CHAPTER 2. LITERATURE SURVEY
30
pages under the LRU replacement algorithm, would produce the same number of cache misses as the original long trace, provided that the page size is kept constant. Puzak extended Smith’s technique in his article [113]. He called his new technique “Trace Stripping”. His approach focused on reducing traces for simulating set-associative caches. It works as follows: A direct-mapped cache, serving as a filter, is simulated and the cache miss references are recorded in a separate trace. The reduced trace, if used to simulate caches with a larger number of sets, would result in the same number of misses as if the original trace was used, provided that the cache line size is kept the same. Unfortunately, both of the above mentioned proposals have three major limitations: (i) the reduced trace method can only produce a count of cache misses but not the number of write-backs for write back caches; (ii) these methods apply only to uniprocessor caches and are inadequate for multiprocessor caches; and, (iii) the reduced trace cannot be used to simulate caches which have block/cache line sizes different from that of the filter cache used for the trace compression. Wang and Bear tried to overcome these problems in their article [114]. They used separate filter caches for different block sizes. To simulate the multiprocessor environment, a special cache coherence protocol was attached to the filter caches. An universal reduced trace was generated by collecting the superset of misses from the trace files generated by the filter caches with different block sizes. The universal reduced trace helped to simulate caches of varying cache line sizes by reading the reduced trace only once. In their article [115], Wu and Wolf also proposed a method for trace compression to simulate caches with varying cache line sizes in a multiprocessor environment. However, their main focus was generating a wide range of performance metrics while keeping the simulation time as small as possible. Their target performance metrics included the cache miss ratio, write-back counts and bus traffic. Their trace compression was based on simulating cache configurations in a particular order, so that some redundant information could be stripped off from the trace file. In this way a large number of cache configurations could be avoided from the simulation for certain memory block references without affecting the accuracy of the results.
2.2. SIMULATION SPEEDUP MECHANISMS
31
In [116], a trace specific compression technique was proposed. Two different methods were used in this proposal to compress instruction accesses and data references. In this method, a sequential run of instructions, called “Stream”, from the target of a taken branch to the first taken branch in the sequence are performed. A stream table, created during compression, encompasses all relevant information about streams, such as the starting address, stream length, and instruction types. An instruction stream is replaced by its index from the stream table. Data addresses are linked to a corresponding instruction stream and compressed using an efficient online algorithm that recognizes regular strides. This technique achieves a fair balance between the most important compression requirements: a high compression ratio and a fast decompression/analysis time. The authors reported that their proposed methods reduced the size of SPEC CPU2000 Dinero compilable instruction and data address traces from 18 to 309 times. All the above mentioned techniques can find the total number of cache misses in an application trace very quickly when the trace is compressed. However, a significant amount of time is spent on trace compression and the decompression/analysis of the reduced trace to produce the correct results. To reduce time in the compression and decompression phase in the instruction cache simulation, Janapsatya et al. proposed a technique in [71]. Their compression method is fast but 2 to 10 times worse when compared to gzip. This is because the compression methodology is not designed to achieve a maximum compression rate, instead it is designed for minimal processing time. In addition, their cache simulation algorithm does not need to decompress the trace to extract the cache hit/miss information accurately. Therefore, their cache simulation algorithm achieved an average speedup of 9.67 times when compared to DineroIV [59].
Parallel Simulation: To reduce the overall simulation time, several proposals were made to perform the simulation of a group of cache configurations in parallel on multiple processors. These methods are called parallel cache simulation approaches. Parallel simulation approaches are able to reduce the overall simulation time by overlapping the simulation times of
CHAPTER 2. LITERATURE SURVEY
32
individual cache configurations. However, to parallelize the simulation process, an extensive amount of computing resources are necessary. This resource hungry behavior makes these approaches also costly to implement. Therefore, in a low budget and a short time to market project, these approaches may not be practical to deploy. Depending on the source of parallelism, parallel simulation techniques can be categorized into several subcategories. The proposal in [117] simulates each cache set of a cache configuration on different processors. This type of parallelism is called set-parallelism. This method simulates only one cache configuration at a time. To reduce the memory space requirement, the method in [117] reads the application trace by parts. From the read part of the original trace, a list of memory block addresses is generated for each cache set, maintaining the addresses’ relative order in the original trace. When all the memory block addresses of the read trace part are recorded in the lists, parallel processing of the lists starts on the dedicated processors to extract cache hit/miss information. The authors reported that due to parallelization, only 30% of the simulation time is spent on the cache hit/miss extraction phase. The remaining 70% of the simulation time is spent on reading the trace file and generating the lists for each cache set. In this article, the authors analyzed only two cache configurations and 3 trace files. They reported an average speedup of 8 times (approximately) over an unoptimized exact simulation tool, DineroIII [118] 1 . Heidelberger et al. showed in their article [119] that set-parallelism is not efficient in a statistical sense because of the high correlation of the activity between different sets. They argued that only a small fraction of the sets should actually be simulated. Depending on their observations, Heidelberger et al. introduced time-parallelism in their article [119] that divides the application trace into multiple equal length sub traces and each trace is used to simulate the same cache configuration in parallel. No synchronization is utilized to maintain the order of access for memory block addresses that are simulated on different processors. Because the contents of the caches are unknown at the beginning of the parallel simulation, this parallel simulation does not produce the correct count of cache hits and 1 DineroIII
also simulates one single cache configuration at a time
2.2. SIMULATION SPEEDUP MECHANISMS
33
misses. To correct the count of cache hits and misses, post-processing is necessary at the end of parallel simulations. The post-processing effort required is proportional to the size of the cache simulated and not to the size of the application trace. The authors claimed that their proposed method can achieve a speedup that is nearly proportional to the number of processors with the overhead limited to the initial partitioning of the application trace and to the post-processing phase; however, they did not provide any simulation time comparison or cache hit/miss results for real application traces. They also did not discuss any implementation method for their proposed algorithm. In addition, their method is only applicable to caches with the LRU replacement policy. Nicol et al. [120] proposed an improvement to the proposal of Heidelberger et al. Their method specifically calculated the number of cache misses of a single cache set. Their parallel cache simulation algorithms calculated cache misses based on stack distances and could handle multiple cache replacement policies. These algorithms were simple and space efficient. Han et al. [121] proposed a method that not only exploits the set-parallelism but also parallelizes searches for the requested memory block address in a particular cache set. They called the parallelized cache set searching mechanism the “Search Parallelism”. In addition, their method is designed to utilize the modern GPU-CPU platform using Compute Unified Device Architecture (CUDA). In their method set-parallelism was implemented by distributing the simulation process of each cache set to a separate thread block in the GPU. To assist the set-parallelism, a bucket sort was used to generate the list of memory block addresses for each processor. The search-parallelism was exploited by distributing the search operations to several intercommunicating threads in a thread block. As the application trace is huge, the proposed method utilizes the large global memory in the GPU and reads the entire application trace into the global memory. Due to the combined efficient utilization of search parallelism, set parallelism and the GPU-CPU platform, this method was able to achieve 2.5x speedup over the unoptimized exact simulator, DineroIV [59].
CHAPTER 2. LITERATURE SURVEY
34
Single-pass Simulation: In a single-pass simulation approach, multiple cache configurations are simulated together by reading one application’s trace of memory accesses only once. In this approach, one memory address is read from the application trace and the cache hit/miss evaluation is performed for that address on all the cache configurations considered. As a result, simulation time reduces enormously in a single-pass simulation when a large number of cache configurations are simulated. In contrast to parallel simulation, in a single-pass simulation approach one processing unit is used as optimally as possible. Therefore, single-pass simulation can be combined with parallel or compressed trace simulation for further acceleration. In addition to reducing trace reading time, single-pass simulation approaches usually exploit inclusion properties and custom-tailored data structures to reduce processing time without the help of extra hardware. An inclusion property indicates when all the elements within one cache configuration are known to be present in other configurations. In their article [122], Mattson et al. defined inclusion properties as a condition where larger caches contain a superset of smaller caches at every point in time. This holds between alternative caches that have the same cache line size, do not pre-fetch and have the same number of sets. Importantly, the replacement policy must induce a total priority ordering on all previously referenced memory blocks (that map to same cache set) before each reference and use only this priority ordering to decide the next replacement cache block. Therefore, the inclusion properties allow some of the simulation steps to be avoided, saving simulation time enormously when a large group of cache configurations are simulated together. Mattson et al. [122] also introduced a single-pass simulation technique called “Stack Processing” that can be used in the cost-performance evaluation of a large class of memory hierarchies. Their technique was dependent on those cache replacement policies which they classified as “Stack Algorithms”. To be an stack algorithm, a replacement policy must show an inclusion property. Mattson et al. derived several inclusion properties for the stack algorithms and used them for quick single-pass simulation. However,
2.2. SIMULATION SPEEDUP MECHANISMS
35
they did not analyze any real workload to test the effectiveness of their method. In an article [123], published in 1989, Hill et al. studied the effect of varying associativity in caches in search for a rapid single-pass cache simulation approach. Relying on their findings, Hill et al. proposed a forest simulation technique to quickly simulate alternate direct-mapped caches. They also used another technique called the “All-associativity methodology”, based on the stack algorithm described by Mattson et al. [122], to simulate alternate direct-mapped caches, fully associative caches and set associative caches. In allassociativity methodology, they utilized the fact that the memory block addresses that map to the same set in the larger set sized caches also map to the same cache set in the smaller set sized caches. They call this property “Set-refinement”. Hill et al. showed that inclusion properties are not useful for their all-associativity methodology, but set-refinement is. They also demonstrated with empirical evidence, that the forest simulation strategy is faster than the all-associativity methodology for alternate direct-mapped caches. In 1995, Sugumar et al. [124] made an effort to exploit the cache inclusion properties further when they introduced the use of binomial trees to represent LRU caches. Their method improved the method of Hill et al. Binomial trees speedup the searches for the cache sets enormously. Utilizing binary trees and some cache inclusion properties based on the LRU replacement policy, Sugumar’s method was able to simulate multiple LRU cache configurations very quickly in a single pass over an application trace. Their experimental results showed that the proposed algorithm gained 1.2 to 3.8x speedup over the all-associativity methodology and forest simulation algorithms. In 2004, Li et al. [125] proposed an advancement to Sugumar’s proposal through a trace compression method to reduce simulation time. Their method gained an average 3.8x speedup over Sugumar’s method. In their article [23], Janapsatya et al. proposed a binary tree based data structure to utilize the LRU cache inclusion properties and temporal locality features of cache memories more efficiently in single-pass simulation. Their method represented cache configurations using binary trees. Binary tree based representation reduced the cache set selection time. In the method proposed by Janapsatya et al. [23], cache lines in a cache set were
36
CHAPTER 2. LITERATURE SURVEY
searched according to their last access time, starting from the most recently accessed cache line. The reason for this was that in an LRU cache set, the recently accessed cache lines have a higher possibility to be accessed again in the near future due to the temporal locality. Therefore, performing the search from the most recently accessed cache line towards the least recently accessed line has a higher possibility of quickly finding the requested memory block address tag in the cache. The method proposed by Janapsatya et al. evaluates the cache configurations in the trees in a top-down fashion. To speedup, their method tried to utilize two LRU based cache inclusion properties: (i) a cache hit in a LRU cache guarantees a hit in the larger set sized LRU caches; and, (ii) a hit in a LRU cache guarantees a hit in all the larger caches that differ by associativity only. However, their simulation method only makes effective use of the second inclusion property. This is because when the tree is traversed in the top down fashion, simulation can be avoided in the larger caches which differ by associativity only when a cache hit occurs in an LRU cache. However, when the caches differ by both set size and associativity, a hit in one level in the binary tree does not reveal the minimum size of associativity to cause a cache hit in the larger caches. Therefore, simulation cannot be stopped in the larger set sized caches on a hit in one level in the binary tree. Using special data structures and cache inclusion properties, the method proposed by Janapsatya et al. achieved almost 45 times average speedup over DineroIV [59]. To efficiently use the first inclusion property of Janapsatya et al.’s method, Tojo et al. [126] applied two more LRU inclusion properties to Janapsatya et al.’s method: (i) the Most Recently Accessed (MRA) tag of an LRU cache set must be available as the MRA tag in all the larger set sized LRU caches; and, (ii) the MRA tag of an LRU cache set must be available as the MRA tag in all the larger associativity LRU caches. Therefore, whenever the requested tag is found as the MRA tag in an LRU cache, the cache hit can be guaranteed in the MRA tag in all the larger LRU caches that differ by set size, associativity or both. As a result, simulation can be avoided in all the larger caches when a cache hit occurs in the MRA tag in an LRU cache. Tojo et al. called their simulation method the “CRCB” algorithm. Utilizing these two inclusion properties, the CRCB algorithm
2.2. SIMULATION SPEEDUP MECHANISMS
37
managed an average speedup 1.8 times over Janapsatya et al.’s method. To speedup the single-pass cache simulation, Grund et al. tried to find predictable features of cache replacement policies applicable to single cache configurations. Their goal was to use these predictable features to predict the availability of memory block addresses in a cache configuration without performing any simulation. They derived several predictable features of FIFO, LRU and Pseudo LRU (PLRU) replacement policies [127–131]. Articles [132–134] also analyzed the LRU replacement policy to derive predictable features. However, no known single-pass cache simulator has used these predictable features to accelerate simulation so far. To the best of our knowledge, before us, no attempt has ever been made to accelerate the single-pass simulation of caches using non-stack algorithms as their replacement policy.
Chapter 3 Overview of The Thesis The main objective of this thesis project is to accelerate cache hit/miss detection in singlepass simulation of caches not using stack algorithms as their replacement policy. Utilizing the discovered speedup mechanisms, this thesis also intends to propose cache simulators which are able to quickly extract cache hit/miss information from an application trace of memory accesses. This study aims to maintain the following three features in the speedup mechanisms, 1. Accurate simulation results. 2. Less space consumption during simulation. 3. The minimal possible use of computing resources. To achieve our goals, we have conducted our research by dividing it into five different projects. The projects are as follows: 1. SuSeSim 2. DEW 3. SCUD 4. CIPARSim 5. PSAICO In the following subsections, we are going present an overview of these sub-projects. 38
3.1. PROJECT SUSESIM
39
3.1 Project SuSeSim
The use of cache inclusion properties has attracted researchers for decades because of their ability to accelerate the accurate detection of cache hits and misses in a cache simulation without using excessive space or computing resources. To gain a better understanding of how inclusion properties benefit single-pass cache simulation, we have analyzed the LRU replacement policy in our project “SuSeSim”, as part of the thesis. We have also studied how the binary tree based cache configuration representation helps to use the cache inclusion properties efficiently during simulation. From our analysis, we understood that when simulation is performed in sequence from the smallest to the largest cache configuration in the single-pass simulation (top-down simulation in the binary tree), the utilizable inclusion properties detect cache hits quickly in a large group of cache configurations. However, cache simulation is performed mainly to detect the number of cache misses, as it is misses that increase execution time and power consumption enormously, not the cache hits. To detect the cache misses quickly, we have analyzed the effect of performing a bottom-up simulation in the binary trees representing cache configurations. We found that performing the cache simulation in the bottom-up fashion can exploit a different set of cache inclusion properties to detect cache misses quickly in a large group of cache configurations. We have proposed two useful inclusion properties to use in the bottom-up single-pass simulation for the LRU caches in the SuSeSim project. We have made a small change in the tree based representation that allows a faster update on a cache access and a quick search to decide a cache hit or miss using temporal locality. To ensure the effective use of these proposals, we have also proposed a simulation algorithm in this project. The proposed simulation algorithm shows the fastest performance compared to the available LRU single-pass simulators. The SuSeSim project has been published in the proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis [135].
3.2 Project DEW Project “DEW”, the short form of “Direct Explorer Wave”, is our first initiative to explore some speedup mechanisms for the single-pass simulation of caches that do not use
40
CHAPTER 3. OVERVIEW OF THE THESIS
stack algorithms as their replacement policy. In this project, we chose the widely used replacement policy “FIFO” for analysis. As FIFO does not show any inclusion property, we tried to focus on data structure based speedup mechanisms in this project. We have tried to utilize the knowledge we have learned from the SuSeSim project. Therefore, also in DEW, we have used a binary tree based representation of caches due to their ability to explore cache sets quickly. However, from our experiments and analysis, we realized that, unlike LRU, neither the bottom-up nor the top-down simulation in the binary tree helps in the quick detection of misses in the FIFO caches. Therefore, we concentrated on reducing the time in FIFO cache set searching only. By analyzing the cache inclusion properties in SuSeSim, we have found that inclusion properties reduce search time in a cache set as a cache hit/miss decision can be made about larger set sized caches by simulating a smaller set sized cache when inclusion properties are used. To reduce the tag searching in the FIFO cache sets in a similar manner, we have introduced a new feature “Wave pointer” in the associativity lists in the binary trees. If a tag is available in the associativity list in a tree node, wave pointers keep track of the location of the same tag in the immediate larger set sized cache in the binary trees. Therefore, by searching a smaller set sized FIFO cache and using wave pointers in the associativity list, the requested memory reference tag’s position can be determined in the larger set sized caches without performing any simulation. Therefore, a large portion of the simulation time can be easily reduced. To judge the effectiveness of the wave pointers, we present a single-pass FIFO cache simulator in our project DEW. The simulator is able to achieve an enormous speedup over the single cache configuration simulator DineroIV [59]. However, the wave pointers introduced new problems. To avoid excessive space consumption, wave pointers can only be used to simulate cache configurations which differ by set size only. On a cache miss, the proposed simulator has to use a complex procedure to update the wave pointers and the binary trees. In addition, to use the wave pointers efficiently, a top-down simulation must be performed in the binary tree. The DEW project has been published in the proceedings of design automation and testing in Europe 2010 [136]. In the same conference, a second article [137] was published that used DEW as their main supporting tool.
3.3. PROJECT SCUD
41
3.3 Project SCUD To overcome the problems of wave pointers in DEW; especially, to perform the simulation of FIFO cache configurations which differ by both set size and associativity, we conducted an extensive analysis in our project “SCUD”. We did not worry about simulating cache configurations with varying cache line sizes as most of the time they are restricted by the bus size in the selected system. Utilizing the experience of wave pointers, we introduced a new data structure, which is a central look-up table, in SCUD. At every point in time, the central look-up table keeps a detailed map against each memory block address accessed by the processor to point out which cache configurations currently store the memory block content. To keep the space consumption to its minimum, whenever a previously accessed memory block is evicted from all the cache configurations, the entry for the memory block address is deleted from the central look-up table. The central look-up table omitted the need for wave pointers. In addition to cache configurations with varying set size and associativity, the central look-up table also allowed the simulation of configurations which differ by cache line size. To retrieve the entry for a memory block address quickly, a binary search is used in the central look-up table. Even though space consumption was considerably large, the speedup achieved by this method was really impressive compared with DEW and DineroIV [59]. The SCUD project has been published in the proceedings of the 47th design automation conference [138].
3.4 Project CIPARSim From our experience in DEW and SCUD, we realized that data structure based speedup mechanisms prevented us from achieving speedup without using excessive space in the memory. And excessive space consumption always poses the threat of slow simulation due to the swapping operation in the memory when a sufficient amount of space is not available. Therefore, to establish some speedup mechanisms without any extra space consumption, we had to design something similar to cache inclusion properties. From the knowledge we gathered on cache behavior during the SuSeSim, DEW and SCUD project, we realized that several of the elements can be common in two different cache
42
CHAPTER 3. OVERVIEW OF THE THESIS
configurations during a single-pass simulation. In our CIPARSim project, we investigated whether we could determine when and which elements are common between two different cache configurations in single-pass simulation. We used the FIFO replacement policy for our analysis again. From our experiment and analysis, we found that, at certain situations, two caches do have common elements and those common elements can be identified accurately when the proper conditions for the situation to occur are known. We called this feature of caches the “Intersection property”. In the CIPARSim project, we propose three FIFO cache intersection properties to be utilized in the single-pass simulation. Using these intersection properties, we also proposed a cache simulator in the CIPARSim project. The cache simulator proposed in CIPARSim also uses a look-up table like SCUD. However, for faster searching, the entries are arranged into smaller sets. One entry for a particular memory block address can only go to a specific set in the look-up table. Therefore, memory block addresses are searched in a specific set only, not in the entire look-up table. To reduce space consumption in the look-up table, the entries are replaced by binary bit sets. Bit operation makes the cache hit/miss evaluation faster. All these features helped the CIPARSim cache simulator to defeat the simulators proposed in DEW and SCUD both in space and time consumption. The CIPARSim project has been published in the proceedings of International Conference on Computer Aided Design (ICCAD) 2011 [139].
3.5 Project PSAICO In PSAICO, we attempted to discover some intersection properties for the PLRU caches. We analyzed both the decision tree based PLRU and the decision bit based PLRU. Utilizing the intersection properties and a modified data structure, we proposed PSAICO simulator for performing a quick single-pass simulation of both versions of PLRU. PSAICO concludes our study by showing that intersection properties are effectively usable to accelerate the single-pass simulation of the widely used FIFO and PLRU replacement policy besides customized data structures. The combined use of a customized data structure and
3.5. PROJECT PSAICO
43
intersection properties can achieve all the features we want to achieve in our fast singlepass simulators.
Chapter 4 SuSeSim: An Analysis of LRU Policy 4.1
Introduction
Due to the non-linear nature of caches, determining cache hits and misses for a particular application can be difficult. In particular, without actually performing the time consuming execution of the application, there is no known way of accurately determining the number of cache hits and misses on a cache configuration without simulating an application’s trace of memory accesses. To simulate the trace on hundreds of differing cache configurations can take several months and is simply not feasible during the compact system design phase. To accelerate the cache simulation process without using excessive space or computing resources, cache inclusion properties have been studied and applied in single-pass simulation for a long time. Among the stack algorithms 1 , the LRU has been most analyzed for this purpose. According to Al-Zoubi [33], the LRU shows the best performance in terms of the fewer number of cache misses among all the widely used cache replacement policies. Due to the optimal use of the temporal locality feature of cache accesses, LRU demonstartes the lowest number of cache misses compared to its competitors FIFO and PLRU. Even though LRU is rarely used in embedded systems, to understand how the cache inclusion properties can benefit a single-pass simulation, an 1A
cache replacement policy that shows cache inclusion properties
44
4.2. CONTRIBUTIONS AND LIMITATIONS
45
analysis of the LRU replacement policy can be a good starting point. The large number of articles on LRU single-pass simulation can also help us to understand what kind of data structures are necessary for efficient use of the inclusion properties. In our research, we have analyzed the exact simulation methods, especially Janapsatya’s method [23] with the proposed enhancements in the CRCB algorithm [126] to study the use of LRU cache inclusion properties and supporting data structures to speedup a single-pass simulation. During our analysis, we have found that these methods can be improved by better searching methods to exploit temporal locality and optimized data structures for a quick update to reflect cache access. From our findings, we propose a new exact cache simulation algorithm “Super Set Simulator” (SuSeSim) for the LRU singlepass simulation. SuSeSim overcomes most of the problems we have identified in the previously proposed simulation methods for LRU caches. Layout: The rest of the chapter is structured as follow: Section 4.2 gives an overview of our contributions in the project SuSeSim, Section 4.3 presents the background of our research, Section 4.4 describes our SuSeSim algorithm, Section 4.5 describes the experimental setup and discusses the results found for Mediabench applications [21]; and Section 4.6 concludes the chapter.
4.2 Contributions and Limitations The main contributions are described in the points given below. 1. Two new cache inclusion properties are shown in this chapter. 2. For the first time, a novel simulation strategy (a bottom-up strategy) has been proposed that can detect quickly the absent memory block address tags in multiple cache configurations without searching all the cache configurations individually. This ability helps to minimize simulation time by reducing the total number of searches. 3. A novel tag searching strategy has also been proposed that can search from the most recently accessed tag to the least recently accessed tag and vice versa. This ability
CHAPTER 4. SUSESIM: AN ANALYSIS OF LRU POLICY
46
reduces the total number of cache ways needed to be searched in a set associative cache during a single-pass simulation. 4. We have proposed a novel data structure to represent associativity. This data structure makes the updating process both easier and less time consuming when a cache access occurs. It helps to deploy our proposed searching strategies to reduce the number of searches for less frequently used address tags. The limitation of this work is that it is restricted to caches using the LRU replacement policy with the same cache line size.
4.3
Background
It has been found by the previous researchers [23, 124, 126] that the LRU replacement policy enables the single-pass simulator to use the following two inclusion properties to speed up simulation by reducing the total number of cache sets to be searched: 1. Given caches with the same associativity and block size, and using the Least Recently Used (LRU) replacement policy, whenever a cache hit occurs, all caches that have larger set sizes will also guarantee a cache hit; and, 2. A hit on a set associative cache means a cache hit is guaranteed on all the caches with larger associativity and the same block size. The above observations are described in detail in [23]. To utilize the above mentioned cache inclusion properties, special data structures are created. A set of binary trees, called “Simulation trees”, are created such that each node reflects a set in a cache and each level of a tree represents a cache configuration. Inside a simulation tree, all the cache configurations must have the same cache block size. An example is presented in Figure 4.1. In Figure 4.1, the two nodes at the top of the two trees point to a cache with S = 2. The first node on the left, marked ‘0’, refers to the cache set with the index 0. And the second node, marked ‘1’, refers to the cache set 1. At the second level of the two trees there are a total of four nodes marked ‘00’, ‘10’, ‘01’ and
4.3. BACKGROUND
47
‘11’. Thus the second level will represent a cache with S = 4, and the numbering within the nodes will represent the respective cache sets, as shown in Figure 4.1. Similarly, the third level (depicted as bottom level in Figure 4.1), will represent a cache with eight sets. For caches with larger set sizes, the tree is further expanded with a greater number of levels. Index 0 Valid bit
Tag
Data
Index 1 Valid bit
Tag
Data
Cache with two sets
0
1
00
000
10
100
010
Top Level
01
110
001
Index 00 Valid bit
Tag
Data
Index 01 Valid bit
Tag
Data
Index 10 Valid bit
Tag
Data
Index 11 Valid bit
Tag
Data
11
101
011
111
Bottom Level
Cache with four sets
Figure 4.1: Formation of simulation tree
The advantage of such a structure in single-pass cache simulation can be explained with the aid of a simple example. Let us suppose that the memory address ‘10001010’ has to be simulated. Such an address will store its content in the index 0 in the cache with two sets and in the index 10 in the cache with four sets (assuming byte addressable memory and B = 1 byte). Thus with the aid of the tree structure, by first reading the last bit of the memory address, we can store the address in the index 0 of the two-set cache. Then,
CHAPTER 4. SUSESIM: AN ANALYSIS OF LRU POLICY
48
Associated linked list Way 2
Way 1 Cache Set
Head
Next
Way 3 Next
Way 4 Next
Most recently used node
Least recently used node
Figure 4.2: Singly linked list to represent associativity
the link from that node can be followed by reading the “penultimate” bit of the address. Since the penultimate bit is a 1 in the example, the node 10, to which there is a link from the node 0, can be quickly simulated without searching the trees for the appropriate cache set. Moreover, only one tree is needed during the simulation of an address. Due to this strategic tree structure, a large number of caches with differing set sizes can be simulated simultaneously, with a minimum number of searches for appropriate cache sets. Note that the tag field in the cache with S = 2 will be filled with the number ‘1000101’ and in the cache with S = 4 the tag field will be filled with the number ‘100010’. To store accessed memory block address tags, in [23, 126], the authors associated a singly linked list with each simulation tree node. Each linked list node corresponds to a cache line. Figure 4.2 presents an example of such a linked list. Four nodes inside the example linked list in Figure 4.2 indicate that a four way set associative cache is under simulation. Only the most recently used memory address tags are stored in these linked list nodes. So, in the example list, the tag inside the first cache line (also called the head node) is the most recently accessed tag, the tag inside the second cache line is the second most recently accessed and so on. Let us describe the advantage of this type of linked list with the aid of an example. Let us consider that, at a certain point in time, the processor requests the last four memory addresses presented in Table 4.1. If the cache has a one byte cache block and the memory is byte addressable, all except the first requested addresses shown in Table 4.1 will go to index 0 of the two-set cache of Figure 4.1. At the end of the last request, index 0 of the two-set cache will look like the
4.3. BACKGROUND
49
cache set shown in Figure 4.7. It can be seen that in Figure 4.7, the most recently accessed tag “11000” is directly accessible from the tree node; however, to access tag “11100”, the node with tag “11000” must be searched first. Similarly, to access tag “00110”, the nodes with tag “11000” and “11100” must be searched first. Therefore, the node arrangement policy in the linked list helps to search recently accessed tags quickly. As the temporal locality increases the chance of reusing recently accessed memory block address tags in the near future, the node arrangement policy in the linked list in [23, 126] helps to reduce simulation time by reducing the number of cache lines searched.
Tag 1111 0 00
000
Index 0
Offset 1
Tag 111
Index 10
10
100
010
Tag 11
Offset 1
Index 110
Offset 1
110
Figure 4.3: Values for tag, index and offset for a requested address in different cache configurations
Figure 4.3 gives an example of how memory addresses are divided into different parts and are used to select cache sets when we have the simulation tree of Figure 4.1. In this example, our memory is byte addressable and our cache block size is 2 bytes; and the requested binary memory address is 111101. From Figure 4.3 the values of the tag, cache set index and byte offset for the example memory block address can be seen in the three different cache configurations under simulation. Figure 4.4 shows the tags of the left most tree (starting node ‘0’) of Figure 4.1 in the associated linked list when processor requests the binary addresses from Table 4.1 simultaneously. In the following subsection, we are going to show how these simulation trees are used in Janapsatya’s algorithm with the CRCB enhancements [23, 126] to perform a fast single-pass simulation.
CHAPTER 4. SUSESIM: AN ANALYSIS OF LRU POLICY
50
Application 111101 101100 011000 001100 111000 110000
Table 4.1: Trace of requested addresses Tags in the associated linked list 110
111
0 00
11
11
Tags in the associated linked list
Cache configurations
000
10
100
010
110
1100
1110
0011
0110
001
011
101
111
00
01
10
11
Tags in the associated linked list
Figure 4.4: Address tags in different cache configurations
4.3.1
Janapsatya’s Methodology With Proposed Enhancements in The CRCB Algorithms
To record the total number of cache misses for different cache configurations, Janapsatya’s approach [23] keeps a table indexed with cache configuration parameters: set size, associativity and cache line size. The size of the table is dependent on the total number of cache configurations considered for simulation. An example miss counter table has been presented in Figure 4.5. In this example, the total number of misses of the first entry will be increased only if a tag is missed in the cache with set size 1, cache line size 1 byte and associativity 1. Similarly, the second entry’s total number of misses will be increased when we want to record a miss for a cache with set size 1, cache line size 1 byte and associativity 2. This example miss counter table can hold the total number of misses for
4.3. BACKGROUND
51
caches with a set size of 1 to 1024, an associativity of 1 to 512 and a cache line size of 1 to 512 bytes.
Miss Counter Array Cache Set Size 1
Associativity 1
Block Size 1
Total Number of Misses
Cache Set Size 1
Associativity 2
Block Size 1
Total Number of Misses
... Cache Set Size 1
Associativity 16
Block Size 1
Total Number of Misses
... Cache Set Size 2
Associativity 1
Block Size 1
Total Number of Misses
... Cache Set Size 1024
Associativity 512
Block Size 512
Total Number of Misses
Figure 4.5: Miss counter array of Janapsatya’s method
Janapsatya’s simulation method has three phases: Tree formation, Tag searching and Cache set update. In the following subsections the phases are described.
4.3.1.1 Tree Formation At the beginning of simulation in Janapsatya’s approach, a forest of simulation trees (described above) is created for each cache line size. Processor requested addresses are read from the selected application’s trace file, one at a time, and sent to each forest. Inside a forest, a tree is selected using the procedure we have described earlier at the beginning of Section 4.3.
CHAPTER 4. SUSESIM: AN ANALYSIS OF LRU POLICY
Simulation is started from top and ends at the bottom tree level
52
Tags in the associated linked list
Cache configurations
Head
0 00
000
10
100
010
Head
110
Head
Tag searching is started from head
Figure 4.6: Direction of simulation and tag searching 4.3.1.2
Tag Searching
To simulate each address, Janapsatya’s approach begins from the top level of a simulation tree or smallest cache configuration and continues toward the biggest configuration. Each tree level is simulated in sequence. Inside each level of the tree, the tag is searched in the associated singly linked list of the selected tree node (cache set). Tag searching starts from the head node inside the cache set and continues toward the node with the least recently used tag or tail. The search continues until a tag is found or there are no more tags left to check in the linked list. In Figure 4.6, the direction of the simulation and the direction of the tag searching are presented.
Way 1 Cache Index 0
Head
11000
Way 4
Way 2 Next
11100
Next
00110
Next
Most recently accessed tag
Figure 4.7: Singly linked list with requested address tags
01100 Least recently accessed tag
4.3. BACKGROUND
53
4.3.1.3 Cache Set Update Depending on the outcome of the tag search inside a cache set, one of the following actions can be taken: 1. If the tag is missed, the miss counters for all the cache configurations with a set size equal to the current configuration, and associativity less than or equal to the current configuration’s associativity will be increased. This is because, according to the second cache inclusion property of Section 4.3, when all the caches are using the LRU replacement policy, a miss on a set associative cache guarantees a miss on the caches with smaller associativity and equal set size. On a miss, Janapsatya’s simulator goes through five different steps to place the missed tag in the linked list. In Figure 4.8, all of these steps are shown. In steps 1 and 2 the entire linked list must be searched to point out the least recently used node, or tail, and the second least recently used node, or (tail − 1). Next, the missing tag will be placed in the tail of the linked list and in the next step, the tail will be moved to the head of the linked list. In step 5, (tail − 1) will become the new tail. We would like to mention that depending on implementation, some steps of Figure 4.8 may be combined; however, tasks inside these steps must be performed to reflect the cache miss. Here we show five steps for the cache miss to increase readability. The missed tag update process can be explained with the help of an example. Let us suppose that, at a certain point in time, index 1 of a two set cache looks like Figure 4.9 (assuming byte addressable memory and a one byte cache line). The processor requests the address 011111 which is not available in index 1. To place the missed address tag, the least recently accessed tag 11111 will be replaced by 01111. Tag 01111 will become the most recently accessed address tag and the second least recently accessed address tag 11110 will become the new least recently accessed address tag or tail. After all these modifications, cache index 1 of Figure 4.9 will look like Figure 4.10 2. If the searched address tag is found in the linked list node representing the V th most
CHAPTER 4. SUSESIM: AN ANALYSIS OF LRU POLICY
54
Step 1: Search Tail
Step 2: Search (Tail -1)
Way 1 Cache Set
Head
Head
Way 4
Way 2
Next
Next
Next
Least recently Used
Step 5: Make next pointer empty
Step 3: Place Tag
Step 4: Make the tail the new head
Figure 4.8: Janapsatya’s method needs five steps to update the address tag inside the associated singly linked list Way 1 Cache Index 1
Head
11000
Way 4
Way 2 Next
11100
Next
11110
Next
11111 Least recently accessed tag
Figure 4.9: Index 1 of a two-set cache with some tags recently accessed cache line, the node will be moved to the head of the linked list. However, to do that, the simulator goes through three different stages. First, the node before the V th most recently accessed node (the (V − 1)th most recently accessed node) must be searched out. The next pointer of the (V − 1)th most recently accessed node will be set to point to the next element of the V th most recently accessed node. After that, the V th most recently accessed node will become the new head. All these changes will make the V th most recently accessed node the most recently accessed (MRA) node. The miss counter for all the cache configurations with a set size equal to the current set size and associativity less than V will be increased. We can use the same example of Figure 4.9 to explain the scenario. Let us suppose that address 11110 is requested. It is available at the third most recently accessed cache line of the cache set/index 1. So, tag 11110 will be sent to the head, and tag 11111 will become the next tag of tag 11100. After updating the found tag position, cache set 1 of Figure 4.9 will look like Figure 4.11.
4.3. BACKGROUND
55 Way 1
Cache Index 1
Head
01111
Most recently accessed tag
Way 4
Way 2 Next
11000
Next
11100
Next
11110
Least recently accessed tag
Figure 4.10: Index 1 of the two-set cache of Figure 4.9 after the missed tag is updated Way 1 Cache Index 1
Head
11110
Way 2 Next
11000
Way 4 Next
11100
Next
11111
Most recently accessed tag
Figure 4.11: Index 1 of the two-set cache of Figure 4.9 after the hit tag is updated 4.3.1.4 The CRCB Algorithm The method Janapsatya proposed efficiently used the second inclusion property of Section 4.3. Using the access order based singly linked list as the associativity list and searching the requested tag from the most recently accessed to the least recently accessed in a cache set, Janapsatya’s method can predict the availability of a memory block content quickly in a large group of caches just by simulating the smallest associativity cache that has stored the memory block content. The inclusion property can detect when the same memory block content was last accessed in the larger caches which differ by associativity only. However, Janapsatya’s method cannot utilize the first cache inclusion property of Section 4.3. This is because when a cache hit occurs in an LRU cache, it guarantees a cache hit in all larger LRU caches which differ by a larger associativity, a larger set size or both; however, it does not show when the memory block content was last accessed in the larger caches which differ by set size, associativity or both. Without this information, the larger cache configurations which differ by set size, associativity or both, cannot be updated to reflect the cache access. At the same time, caches with smaller associativity and larger set sizes also cannot be updated to reflect the cache miss. Therefore, simulation cannot be avoided in the simulation tree to save time even when it is already known that
CHAPTER 4. SUSESIM: AN ANALYSIS OF LRU POLICY
56
the processor request is a hit in all the lower levels of a simulation tree. To use the first cache inclusion property of Section 4.3, Tojo et al. proposed the following two enhancements for Janapsatya’s method [23] in their CRCB algorithm [126]: 1. The most recently accessed memory block content in a fully associative cache will be available as the most recently accessed content in some cache set in all the larger set associative LRU caches that differ by set size, associativity or both. Therefore, if the most recently accessed content in a set associative cache is requested by the processor, the cache hit can be declared for the request in all the larger LRU caches that differ by set size, associativity or both. No update is necessary in the larger caches. 2. If the same memory address content is accessed consecutively, there is no need to simulate the later requests except the first one as the duplicated requests are always going to generate hits (as the most recently accessed memory block) in all the cache configurations in a forest. No update is needed either in the simulation tree for the later accesses. The cache inclusion properties used in the CRCB algorithms are applicable to any cache replacement policy.
4.3.2
Limitations of Available LRU Single-pass Cache Simulators
Currently available LRU single-pass cache simulators have some serious problems in their: 1. Cache inclusion properties: To the best of our knowledge, the inclusion properties used in Janapsatya’s method and the CRCB algorithms are the only LRU cache inclusion properties used so far in a single-pass simulation. These inclusion properties helped to reduce the simulation time enormously; however, they have some fundamental drawbacks. Single-pass simulation aims to find the number of cache misses from an application trace as misses are responsible for increasing the power
4.3. BACKGROUND
57
consumption and execution time. However, all the hitherto proposed cache inclusion properties try to find the cache hits quickly. Finding the cache hits quickly is not as helpful in a single-pass simulation as finding cache misses quickly. When a cache miss occurs, updating the cache is not as complicated as when it is a hit (see Section 4.3). This is because, to update the cache on a hit, it is necessary to determine the last access time of the currently accessed memory block. CRCB algorithms help to predict the last access time of a memory block but only in a small number of situations. Therefore, if the cache misses can be predicted in a large group of cache configurations using cache inclusion properties, a huge amount of time spent on the cache updates can be eliminated. 2. Use of supporting data structures: Both Janapsatya’s method and CRCB algorithms use customized binary trees for utilizing the cache inclusion properties to reduce simulation time. Binary trees help to find the appropriate cache set quickly during simulation. However, the linked lists used as the associativity lists and the operations performed on them have much scope for improvement. First, associativity list searching to determine a cache hit or miss can be improved in the available single-pass simulators. When a memory block content is not accessed for a long time, the search should be continued from the least recently used cache line to the most recently used cache line in an LRU associativity list. However, both Janapsatya’s method and the CRCB algorithms only allow searching from the most recently accessed cache line towards the least recently accessed cache line. The use of a singly linked list as the associativity list has blocked the possibility of a reverse search. In both Janapsatya’s method and the CRCB algorithms, only the most recently accessed cache line can be directly accessed from a tree node and the most recently accessed cache line must be accessed first before any other cache lines can be accessed in a cache set. Therefore, on a cache miss, the simulator must visit all the cache lines in a cache set before it can remove the least recently accessed cache line. This method consumes a huge amount of time on cache misses. If the least recently accessed cache line was directly accessible from the tree node, the cache
58
CHAPTER 4. SUSESIM: AN ANALYSIS OF LRU POLICY miss update would be much faster.
4.3.3 Cache Inclusion Properties for Quick Cache Miss Detection To detect the cache misses quickly during a single-pass simulation, we are going to describe two LRU cache inclusion properties in this section. These are as follows: 1. According to the first and second cache inclusion properties described in Section 4.3, a memory block content available in an LRU cache during a single-pass simulation guarantees a cache hit in all the larger LRU caches which differ by set size, associativity or both. That means the bigger cache is a super set of the smaller caches (check [23]). According to mathematical set theory, content available in the superset has no possibility of availability in the subset. Therefore, a cache miss in an LRU cache guarantees a cache miss in all the smaller LRU caches that differ by set size, associativity or both. 2. Provided that all the caches use the LRU replacement policy and all of them have same cache line size and associativity, a tag found in a cache line in a cache set cannot be available inside a more recently used cache line inside a smaller set sized cache during a single-pass simulation. To prove the inclusion properties, let us assume that caches C1 and C2 have the same associativity A and block size B, but C2 is double the size of C1. If we are using the same application trace to simulate both C1 and C2 at the same time, at a certain point in time the same address must be requested from both caches. Inside each set of C1 and C2, tags are arranged according to their last access time due to the LRU replacement policy. Therefore, each cache set inside C1 and C2 is an ordered set. Due to the simulation tree structure, at a particular point in time, tags available in a set of C1 will be available in two different sets of C2. An example of a simulation tree has been presented in Figure 4.12. In this Figure, two caches have been presented. The cache at level 1 of the tree has only one set and its associativity is four. The other cache, presented in level 2, also has an associativity of four; however, it has two cache sets, Set 0 and 1. Both of the caches have
4.3. BACKGROUND
59
the same block size. If these two caches are simulated at the same time with the same application trace, the memory block address tags that go to the set of the cache in the tree level 1, must go to either set 0 or 1 of the cache presented in the tree level 2. First half Set Size 1 Level 1
Set
Last half
First
Last
Associativity 4
First half Set Size 2 Level 2
Set 0
First half
Last half
First
Last
Set 1
Last half
First
Associativity 4
Last
Associativity 4
Figure 4.12: Simulation tree for two cache configurations Let us suppose that tags of the cache set S of C1 are available in two different cache sets, S1 and S2 of C2. As S ⊂ (S1 ∪ S2) and both of the caches are using the LRU replacement policy, the tags of S are the most recently accessed A tags of (S1 ∪ S2). For S1 and S2, one of the following cases is possible: 1. Either S1 or S2 is an exact copy of S. Therefore, the Mth most recently accessed tag of S1 or S2 will be the Mth most recently accessed tag of S. 2. In either S1 or S2 (for this case let us assume S1) the most recently accessed M tags, when M < A, are identical to the most recently accessed M tags of S, and the least recently accessed (A − M) tags of S are available in the other set (in this case S2), and those (A − M) tags will be in the most recently used positions of S2. These two cases show that due to the LRU replacement policy, whatever is found in S1 or S2 at the X th (X = 1, 2, 3, ..., A) most recently accessed position, if available in S, it
CHAPTER 4. SUSESIM: AN ANALYSIS OF LRU POLICY
60
must be available at one of the positions among the X th to the Ath most recently accessed position (i.e. least recently accessed position) in S. The following example illustrates this point. Let us suppose at a particular point in time, S, S1 and S2 have the tags for binary memory addresses presented in the circles in Figure 4.13. In S2, the tag for binary memory address 1101 is the most recently accessed tag; however, for S, it is the second most recently accessed tag. Again, in S1, the tag of 1010 is the second most recently accessed tag; however, in S, it is the third most recently accessed tag. Most recently accessed tag
Cache set
S
1000
1101
1010
1111
S1
1000
1010
S2
0100
Most recently accessed tag
1110
1101
1111
1001
0111
Most recently accessed tag
Figure 4.13: Cache sets S, S1 and S2 with associativity four If we know that the requested tag is the X th most recently accessed tag in S1 or S2, where X > (A/2), according to our second inclusion property, the tag is not going to be one of the most recently accessed (X − 1) tags in S. If we search from the most recently accessed tag in S to search the X th most recently accessed tag of S1 or S2, if it is available in S at the X th most recently accessed cache way, X tags will be searched inside S. However, if we searched from the least recently accessed tag, only (A − X + 1) tags need to be searched inside S. As X > (A/2), (A − X + 1) < X. One example has been given in Figure 4.14 using the caches presented in Figure 4.13. Note that in Figure 4.13, binary
4.3. BACKGROUND
61
memory addresses are shown in the circles, and in Figure 4.14, address tags of the binary addresses of Figure 4.13 have been shown in the circles. Figure 4.14 shows that after finding the tag 110 in the bottom tree level, if the tag is available in the top tree level, it must be available within the last two nodes in a cache set. 110 is the address tag for the binary memory address 1100 when the cache block size is one byte. If searched from the most recently used tag in the top tree level, three nodes need to be searched to find the tag 1100. However, if the tag is searched from tail to head, only two nodes need checking. If the top tree level does not have the tag, the head to tail search searches the entire cache. On the other hand, just by performing the tail to head search among the last two nodes in the smaller cache, the simulator can determine whether the tag is there or not. Therefore, the addition of a new search function that searches from tail to head can reduce the number of searches. Therefore, we can conclude that when the tag is at the X th most recently accessed line of a cache set in the bigger cache where X > (A/2), depending on the value of X, tail to head searching can reduce the number of cache lines searched inside a set of the smaller cache from one to (A/2). In other words, tail to head searching can reduce searching up to 99% compared to the head to tail search. To benefit from our observations of Section 4.3.3, we decided to start searching from the bottom level of the tree or the biggest cache configuration and continue towards the smallest cache configuration. In other words, we needed bottom-up simulation inside the tree. In addition, bottom-up simulation combines the first observation of Section 4.3 and the first observation of the CRCB enhancement by using our first observation. If the tag is found in the head node of a cache set during bottom-up simulation, it is possible to just skip to the next smaller set sized cache configuration (without any update). On a miss, there is no need to search the tag in the upper levels of the simulation tree when a bottom-up simulation is utilized. The deployment of these two additional cache inclusion properties requires following two modifications to the data structure to represent the associativity inside a simulation tree for a faster performance. 1. To have a reverse tag search function (tail to head) and a head to tail tag search
CHAPTER 4. SUSESIM: AN ANALYSIS OF LRU POLICY
62
First half Set Size 1 Level 1
Set
0010
Last half
0110
1100
1111
Associativity 4
First half Set Size 2 Level 2
Set 0
001 011
First half
Last half
110 101
Associativity 4
Set 1
111 011
Head to tail search needs three nodes to analyze to find tag 1100 in top level where tail to head search needs only two nodes to analyze to do the same.
Last half
101 110
Associativity 4
Simulation tree for two cache configurations
Figure 4.14: Searching non frequent address tags
function require a doubly linked list instead of singly linked list to be associated with a simulation tree node. Therefore, each linked list node not only points to its next element but also to its previous one. One example of such a doubly linked list is presented in Figure 4.15. In addition, a doubly linked list allows the simple updating of a tag position inside a cache set without searching or remembering the previous node inside a linked list.
2. On a miss, there is no need to search the tag in the smaller cache configurations; however, the tag must be updated in all of those caches. To make the update process easy and fast, tree nodes needed to be connected to the tail node as well as the head node of the associated linked list. On a miss, the update process will go to the tail of a cache set, place the tag and make the tail the new head of the tree node.
Based upon our observations and the new data structure, we propose our new singlepass LRU cache simulation algorithm “SuSeSim” which is discussed in the following section.
4.4. SUSESIM ALGORITHM
63
4.4 SuSeSim Algorithm SuSeSim simulates caches with the LRU replacement policy to accurately find the total number of cache misses for each and every cache configuration. Similar to Janapsatya’s method, SuSeSim has three phases: Tree formation, Tag searching and Cache set update. In the following subsections, we describe these three phases in detail.
4.4.1 Tree Formation A cache miss counter table similar to the one presented in Figure 4.5, and a simulation tree like the one presented in Figure 4.1 is used in SuSeSim. However, the simulation tree nodes have doubly linked lists, instead of singly linked lists, to simulate set associative caches. Nodes in the doubly linked lists correspond to cache lines, as usual. In addition, each tree node not only points to the most recently accessed cache line (head of the associated doubly linked list), but also points to the least recently accessed cache line (tail of that list). At the beginning of simulation, a forest of simulation trees (described in Section 4.3) is created for each cache block size. Processor requested addresses are read from the selected application’s trace file one at a time and sent to each forest for simulation. A duplicated address is not send to a forest when requested consecutively. Inside a forest, a tree is selected using the procedure we have described in Section 4.3.
4.4.2 Tag Searching To simulate a memory block access, SuSeSim starts searching the address tag in the biggest cache configuration and continues toward the smallest cache configuration. Each tree level is simulated in sequence. Inside each level of the tree, a memory block address tag is searched in the associated doubly linked list of the selected tree node (cache set). In SuSeSim, there are two different search functions to search for a tag inside the doubly linked list associated with a simulation tree node.
CHAPTER 4. SUSESIM: AN ANALYSIS OF LRU POLICY
64
1. A search function that searches from the head to tail in the cache set. This is the default search function. 2. The second search function starts searching the address tag from the least recently accessed tag or tail and continues searching in the previous N nodes, where N is the given number of nodes. This search is performed instead of the default search in the parent tree level if an address tag is found as the X th most recently accessed tag in one node of the current simulation tree level, where X > (A/2) and A is associativity. The value of N will be (A − X + 1) during the simulation of the parent tree level. We call this search the reverse search. Figure 4.15 shows the search paths of these two search functions.
Default search function starts searching from head and continues toward tail Tail Way 2
Way 1 Cache Set
Head
Head
Next Previous
Way 4
Previous Next
Next
Tail
Previous
Reverse search function starts searching from tail and continues toward head
Figure 4.15: Doubly linked list and search paths of the SuSeSim algorithm
4.4.3
Cache Set Update
Depending on the outcome of the search, one of the following actions is taken. 1. On a miss, there is no need to search the address tag in the smaller cache configurations which differ by set size, associativity or both. The missing address tag will be placed in the tail node of the appropriate cache sets of the current and other smaller
4.4. SUSESIM ALGORITHM
65
cache configurations in the simulation tree. After that, the tail will be moved to the head node of the linked list. The node with the least recently accessed address tag will become the new tail. The cache miss counters for all of the configurations with any associativity and a set size equal to these updated cache configurations will be increased. 2. If the tag is found in the head of the associated doubly linked list, the simulation skips from the current configuration to the next smaller cache configuration in the simulation tree. 3. If the tag is found in the V th most recently accessed cache line in a cache set when the V th most recently accessed cache line is neither the head nor the tail, the miss counter for all cache configurations with a set size equal to the current set size and associativity less than V will be increased. The tag container node of the linked list will be brought to the head position of the linked list and the node with the least recently accessed address tag will become the new tail. After that, the address tag will be searched in the next smaller cache configuration; however, if V > (A/2), the reverse search function will be used in the smaller cache/parent level cache. The reverse search function will search up to the V th most recently accessed cache line in the appropriate cache set of the parent level cache configuration if the tag is not found. Three examples are given to illustrate the three different actions of the simulation process. Let us suppose that we have the simulation tree shown in Figure 4.16. Here, we have two set associative caches to simulate. Both of these caches have an associativity of four. We want to simulate the following three cases: 1. We want to simulate the binary byte addressable memory address 1100 where the cache line size is one byte. SuSeSim will start the simulation from the bottom level of the tree. The default search function will be used at the beginning. We can see from Figure 4.16 that the tag for memory address 1100 is in set 0. This tag is the third most recently accessed tag of the cache set. After finding the tag, SuSeSim
CHAPTER 4. SUSESIM: AN ANALYSIS OF LRU POLICY
66
First half Set Size 1 Level 1
Set
0010
Last half
0110
1100
Reverse search for 2 nodes
1111
Associativity 4
First half Set Size 2 Level 2
Set 0
001 011
First half
Last half Set 1
110 101
Associativity 4
111 011
Last half
Default Search
101 110
Associativity 4
Simulation tree for two cache configurations
Figure 4.16: An example simulation tree will increase the cache miss counter for each of those configurations that has set size two, associativity less than three and a cache block size of one byte. After that, Set 0 will look like Figure 4.17 as the tag for 1100 will become the new head due to the LRU replacement policy. After finishing simulation in the bottom level, SuSeSim will search the tag in the top level; however, the reverse search function will be used, as in the bottom level the tag was found as the third most recently accessed tag. A reverse search will search the tag in the two least recently accessed tags in the top level. This time the tag will be found as the third most recently accessed tag again; therefore, the miss counter for all the cache configurations with a set size of one, associativity less than or equal to three and a cache line size of one byte will increase. After that, the tag for address 1100 will again become the head of this cache set. Associativity 1
Associativity 2
Associativity 4
Previous Set 0
Head
110
Next Previous
001
Next
011
Next
101
Previous
Figure 4.17: Cache set 0 of Figure 4.16 after the update of the tag position 2. If we want to search for the tag corresponding to address 0010, it will be found at
4.4. SUSESIM ALGORITHM
67
the head tag in the bottom tree level. Therefore, no miss counter update and linked list update will be performed. The default search will be performed again in the top tree level. As the tag is again the most recently accessed tag in the top level, no change will occur. 3. If we want to simulate address 1000, a default search in the cache set 0 of the bottom tree level will fail to find the tag, which makes SuSeSim aware that the tag is not available in the search space. Therefore, the miss counter of all the cache configurations with a set size equal or less than two, associativity equal or less than four and a cache line size of one byte will be increased. The tag for 1000 will be added as the head node in both set 0 of the bottom tree level and the cache set of the top tree level. Figure 4.18 shows the tree after all these modifications.
First half Set Size 1 Level 1
Set
1000
Last half
0010
0110
1100
Associativity 4
First half Set Size 2 Level 2
Set 0
100 001
First half
Last half
011 110
Associativity 4
Set 1
111 011
Last half
101 110
Default Search
Associativity 4
Simulation tree for two cache configurations
Figure 4.18: Search tree of Figure 4.16 after the placement of new tag
It should be mentioned that, like Janapsatya’s approach with the CRCB algorithms, in SuSeSim, each level of a simulation tree except the top level must be twice as big as its parent level’s cache. The SuSeSim algorithm is presented in Algorithm 1.
CHAPTER 4. SUSESIM: AN ANALYSIS OF LRU POLICY
68
1: while Trace is not finished do 2: Read an address request Addr; 3: offset=maximum cache set size; 4: found=true; 5: previous result=0; 6: while offset is not empty do 7: tag s=Tag from Addr for offset; 8: index s=cache index from Addr for offset; 9: Select tree with cache set index s and go to level for cache set size equal to offset; 10: if found is false then 11: Place tag in the tail of the associated doubly linked list.; 12: Move tail to be the head of the linked list; 13: Increase cache miss counter for all cache configurations with set size equal to offset 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32:
and any associativity; else Goto the doubly linked list associated with tree node with index index s; if head of tree node with index index s does not contain tag s then if previous result>(maximumassociativity/2) then Search the doubly linked list to find a tag entry equal to tag s, within tail to previous resultth node; else Search the linked list to find a tag entry equal to tag s within head to tail; if a cache hit occurs in the Sth element of the linked list then Increase cache miss counter for all caches with set size equal to offset and associativity less than S.; Make the (S − 1)th node the new tail of the doubly linked list when Sth node is the tail node; Move the Sth node to be the head of the doubly linked list; found=true; previous result=S; else found=false; Place tag in the tail of the linked list; Move tail to be the head of the linked list and make the (tail-1) node the new tail of the doubly linked list; Increase cache miss counter for all cache configurations with set size equal to offset and any associativity; offset=offset/2;
Algorithm 1: SuSeSim Algorithm
4.5
Experimental Procedure and Results
With the implementation described above, SuSeSim can reduce the time of the total simulation. Each SuSeSim doubly linked list entry is used to hold a tag (32 bits) and pointers to
4.5. EXPERIMENTAL PROCEDURE AND RESULTS
69
the next and previous elements (32 bits each). In total, each doubly linked list entry needs to store 96 bits. In the simulation tree, each node keeps a pointer to the head element (32bits) and tail element (32 bits) of the linked list; giving a total of 64 bits. Therefore, per tree node or cache set, (64 + (96 × A)) bits are consumed, where A is the maximum associativity. Janapsatya’s method (and CRCB) needs (96 + (65 × A)) bits per tree node or cache set. We have re-implemented Janapsaty’s method with and without the CRCB enhancement. All of these simulators are written in C++. We have compiled and simulated programs from Mediabench [21] with the SimpleScalar/PISA 3.0d [140]. Program traces were generated by the SimpleScalar and fed into all of these three methods. We have verified the hit and miss numbers of each method using DineroIV [59] and found all of them are consistent with each other. Simulations were performed on a machine with a dual core Opteron64 2 GHz processor and 8 GBytes of main memory. We have simulated 1300 different cache configurations (for both data and instruction cache) for each simulator. The cache configurations are the all possible combinations of the cache parameters presented in Table 4.2. Cache Set Size=2i Cache line Size=2i Bytes Associativity=2i
where 0 t a g = t a g >>o f f s e t ; r e p c o u n t ++; r e p l a c e m e n t (& node [ i n d e x ] , node [ i n d e x ] . t a i l , 0 ) ;
124 125 126 127 128 129 130 131 132 133 134 135
136 137 138 139 140
141 142 143 144 145 146 147
/ / Update m i s s c o u n t e r f o r ( i n t j = 0 ; j t a g ! = ( t a g >>o f f s e t ) ) { / / Search the tag in the a s s o c i a t i v i t y l i s t in the s e l e c t e d t r e e node t o t a l s i m ++; i f ( r e s u l t >( m a x a s s o c / 2 ) ) r e s u l t = t a i l t a g s e a r c h (& node [ i n d e x ] , node [ i n d e x ] . t a i l , t a g >>o f f s e t , r e s u l t , 0 , 0 ) ; else r e s u l t = h e a d t a g s e a r c h (& node [ i n d e x ] , node [ i n d e x ] . head , t a g >>o f f s e t , 1 , 0 ) ; f o r ( i n t j = 0 ; j m a x a s s o c ) found2= f a l s e ;
149 150 151
else
169
APPENDIX B. SAMPLE CODE
170 152 153 154 155 156 157 158 159 160
}
161
found2= true ; } else { t o t a l l i s t ++; o n e c o u n t ++; r e s u l t =1; found2= true ; }
162
/ / Go t o n e x t l e v e l o f f s e t −−;
163 164 165
i f ( o f f s e t t a g s [ i ]== t a g ) { / / Tag f o u n d node−>h e a d = t a g ;
16 17 18 19
/ / Update t h e p a r e n t l e v e l wave p o i n t e r if (! first level ) p a r e n t n o d e −>wave [ t a g l o c a t i o n ] = i ;
B.2. DEW tag location=i ;
20 21
/ / Search l o c a t i o n update f o r c h i l d l e v e l s e a r c h l o c a t i o n =node−>wave [ i ] ; return true ;
22 23 24 25 26 27 28 29 30 31
}
32
} } else { / / Tag n o t f o u n d i =max assoc ; }
33
/ / Cache m i s s / / Update head node node−>h e a d = t a g ;
34 35 36 37
/ / Update e v i c t e d node node−>e v i c t e d =node−>t a g s [ node−>u p d a t e i n d e x ] ; node−>e v i c t e d l o c a t i o n =node−>wave [ node−>u p d a t e i n d e x ] ; node−>t a g s [ node−>u p d a t e i n d e x ] = t a g ;
38 39 40 41 42
/ / Wave u p d a t e node−>wave [ node−>u p d a t e i n d e x ] = ( m a x a s s o c +1 ) ; if (! first level ) p a r e n t n o d e −>wave [ t a g l o c a t i o n ] = node−>u p d a t e i n d e x ; t a g l o c a t i o n =node−>u p d a t e i n d e x ;
43 44 45 46 47 48
/ / Child l e v e l search l o c a t i o n update s e a r c h l o c a t i o n =( max assoc +1) ;
49 50 51
/ / Deciding tag placement l o c a t i o n i f ( node−>u p d a t e i n d e x u p d a t e i n d e x =node−>u p d a t e i n d e x + 1 ; else node−>u p d a t e i n d e x = 0 ;
52 53 54 55 56 57
return f a l s e ;
58 59
}
60 61 62 63 64
/ / Function to s t a r t a cache h i t / miss e v a l u a t i o n i n t s e a r c h ( s e t node [ 3 2 7 6 7 ] , u n s i g n e d i n t t a g , l o n g m i s s e s [ t o t a l l e v e l s ] [ 1 ] , int offset , int index ) { bool r e s u l t ;
171
APPENDIX B. SAMPLE CODE
172 65 66 67 68
i n t temp ; f i r s t l e v e l =true ; t a g l o c a t i o n =( max assoc +1) ; s e a r c h l o c a t i o n =( max assoc +1) ;
69 70 71 72
while ( o f f s e t >o f f s e t ) { / / O b s e r v a t i o n 1 : S e a r c h i n g i n t h e same c a c h e way b u t i n the bigger caches i f ( s e a r c h l o c a t i o n ! = ( max assoc +1) ) { / / The wave p o i n t e r i s n o t i n v a l i d i f ( node [ i n d e x ] . t a g s [ s e a r c h l o c a t i o n ]== t a g >>o f f s e t ) { / / Found t h e t a g i n t h e s e l e c e t d c a c h e l i n e r e s u l t =true ; node [ i n d e x ] . h e a d = t a g >>o f f s e t ; / / ∗∗∗ C a r e p o i n t
86 87 88 89 90
/ / ∗∗∗ Wave u p d a t e if (! first level ) p a r e n t n o d e −>wave [ t a g l o c a t i o n ] = s e a r c h l o c a t i o n ; tag location=search location ;
91 92 93 94 95 96
s e a r c h l o c a t i o n = node [ i n d e x ] . wave [ s e a r c h l o c a t i o n ] ; } / / i f ( node [ i n d e x ] . t a g s [ s e a r c h l o c a t i o n ]== t a g >> o f f s e t ) else { / / S e l e c t e d cache l i n e does not have t h e tag
97 98 99 100 101 102
i f ( node [ i n d e x ] . e v i c t e d == t a g >>o f f s e t ) { / / R e q u e s t e d t a g i s n o t t h e MRE / / Perform a miss o p e r a t i o n misses [ o f f s e t ][1]= misses [ o f f s e t ][1]+1;
103 104
node [ i n d e x ] . h e a d = t a g >>o f f s e t ; / / ∗∗∗ C a r e p o i n t
105 106 107 108
s e a r c h l o c a t i o n = node [ i n d e x ] . e v i c t e d l o c a t i o n ; node [ i n d e x ] . wave [ node [ i n d e x ] . u p d a t e i n d e x ] = ( m a x a s s o c +1) ;
B.2. DEW 109
110
111
173 node [ i n d e x ] . e v i c t e d = node [ i n d e x ] . t a g s [ node [ i n d e x ] . update index ] ; node [ i n d e x ] . e v i c t e d l o c a t i o n = node [ i n d e x ] . wave [ node [ index ] . update index ] ; node [ i n d e x ] . t a g s [ node [ i n d e x ] . u p d a t e i n d e x ] = t a g >>o f f s e t ;
112 113 114 115
116
/ / Wave u p d a t e if (! first level ) p a r e n t n o d e −>wave [ t a g l o c a t i o n ] = node [ i n d e x ] . update index ; t a g l o c a t i o n = node [ i n d e x ] . u p d a t e i n d e x ;
117 118 119 120 121 122 123 124 125 126 127 128
/ / Deciding tag placement l o c a t i o n i f ( node [ i n d e x ] . u p d a t e i n d e x > o f f s e t ) else { / / R e q u e s t e d t a g i s n o t e v e n i n t h e MRE / / Perform a miss o p e r a t i o n misses [ o f f s e t ][1]= misses [ o f f s e t ][1]+1;
129 130
node [ i n d e x ] . h e a d = t a g >>o f f s e t ; / / ∗∗∗ C a r e p o i n t
131 132 133
s e a r c h l o c a t i o n =( max assoc +1) ; node [ i n d e x ] . wave [ node [ i n d e x ] . u p d a t e i n d e x ] = ( m a x a s s o c +1) ;
134 135 136
137
node [ i n d e x ] . e v i c t e d = node [ i n d e x ] . t a g s [ node [ i n d e x ] . update index ] ; node [ i n d e x ] . e v i c t e d l o c a t i o n = node [ i n d e x ] . wave [ node [ index ] . update index ] ; node [ i n d e x ] . t a g s [ node [ i n d e x ] . u p d a t e i n d e x ] = t a g >> offset ;
138 139 140 141
142
/ / ∗∗∗ Wave u p d a t e if (! first level ) p a r e n t n o d e −>wave [ t a g l o c a t i o n ] = node [ i n d e x ] . update index ; t a g l o c a t i o n = node [ i n d e x ] . u p d a t e i n d e x ;
143 144 145
/ / Deciding tag placement l o c a t i o n i f ( node [ i n d e x ] . u p d a t e i n d e x >o f f s e t ) { / / R e q u e s t e d t a g i s n o t t h e MRE / / Perform a miss o p e r a t i o n
160
misses [ o f f s e t ][1]= misses [ o f f s e t ][1]+1;
161 162
node [ i n d e x ] . h e a d = t a g >>o f f s e t ; / / ∗∗∗ C a r e p o i n t
163 164
s e a r c h l o c a t i o n = node [ i n d e x ] . e v i c t e d l o c a t i o n ; node [ i n d e x ] . wave [ node [ i n d e x ] . u p d a t e i n d e x ] = ( max assoc +1) ;
165 166
167
node [ i n d e x ] . e v i c t e d = node [ i n d e x ] . t a g s [ node [ i n d e x ] . update index ] ; node [ i n d e x ] . e v i c t e d l o c a t i o n = node [ i n d e x ] . wave [ node [ i n d e x ] . u p d a t e i n d e x ] ; node [ i n d e x ] . t a g s [ node [ i n d e x ] . u p d a t e i n d e x ] = t a g >>o f f s e t ;
168
169
170
171
/ / ∗∗∗ Wave u p d a t e if (! first level ) p a r e n t n o d e −>wave [ t a g l o c a t i o n ] = node [ index ] . update index ; t a g l o c a t i o n = node [ i n d e x ] . u p d a t e i n d e x ;
172 173 174
175 176
/ / Deciding tag placement l o c a t i o n i f ( node [ i n d e x ] . u p d a t e i n d e x >o f f s e t ) ; if (! result ) { misses [ o f f s e t ][1]= misses [ o f f s e t ][1]+1; }
192
}
193 194
/ / Record t h e c u r r e n t l e v e l p a r e n t n o d e =&node [ i n d e x ] ; / / ∗∗∗ P a r e n t node u p d a t e
195 196 197
/ / Go t o t h e n e x t l e v e l o f f s e t ++;
198 199 200
i f ( o f f s e t >= t o t a l l e v e l s ) return 0; else { temp = ( pow ( 2 . 0 0 , o f f s e t ) ) ; i n d e x = ( t a g%temp ) +temp −1; / / ∗∗∗ C a r e p o i n t }
201 202 203 204 205 206 207 208
} / / i f ( node [ i n d e x ] . head != t a g >> o f f s e t ) else { / / The r e q u e s t e d t a g i s t h e MRA return 0; } / / e l s e i f ( node [ i n d e x ] . head != t a g >> o f f s e t ) first level =false ;
209 210 211 212 213 214 215
} return 0;
216 217 218
}
B.3 SCUD 1 2
3 4 5
/ / Function to handle a cache miss i n t h a n d l e o n e l e v e l m i s s ( s e t ∗ c a c h e , map : : i t e r a t o r pos , u n s i g n e d l o n g m i s s e s [ t o t a l l e v e l s ] [ 5 ] , i n t l e v e l , i n t i n d e x , i n t row , i n t c o l ) { / / 1 . Remove t h e o l d t a g p o s i t i o n and Update t h e CLT i f ( c a c h e [ i n d e x ] . a s s o c [ row ] [ c o l ] . p o s i t i o n i n t h e t a g l i s t ! = c e n t r a l t a g l i s t . end ( ) )
176 {
6 7
8
9 10
11
APPENDIX B. SAMPLE CODE
( ∗ ( c a c h e [ i n d e x ] . a s s o c [ row ] [ c o l ] . p o s i t i o n i n t h e t a g l i s t ) ) . s e c o n d . l e v e l a v a i l a b i l i t y c o u n t [ l e v e l ]−−; ( ∗ ( c a c h e [ i n d e x ] . a s s o c [ row ] [ c o l ] . p o s i t i o n i n t h e t a g l i s t ) ) . s e c o n d . l e v e l a v a i l a b i l i t y [ l e v e l ] [ row ] = f a l s e ; ( ∗ ( c a c h e [ i n d e x ] . a s s o c [ row ] [ c o l ] . p o s i t i o n i n t h e t a g l i s t ) ) . s e c o n d . t o t a l c o u n t −−; i f ( ( ∗ ( c a c h e [ i n d e x ] . a s s o c [ row ] [ c o l ] . p o s i t i o n i n t h e t a g l i s t ) ) . s e c o n d . t o t a l c o u n t ==0) c e n t r a l t a g l i s t . e r a s e ( c a c h e [ i n d e x ] . a s s o c [ row ] [ c o l ] . position in the tag list ) ;
}
12 13
/ / 2. Place the tag in the a s s o c i a t i v i t y l i s t c a c h e [ i n d e x ] . a s s o c [ row ] [ c o l ] . p o s i t i o n i n t h e t a g l i s t = p o s ;
14 15 16
/ / 3 . Update t h e CLT (∗ pos ) . second . l e v e l (∗ pos ) . second . l e v e l (∗ pos ) . second . t o t a l
17 18 19 20
f o r t h e MRA a v a i l a b i l i t y c o u n t [ l e v e l ]++; a v a i l a b i l i t y [ l e v e l ] [ row ] = t r u e ; c o u n t ++;
21
/ / 4 . Update m i s s c o u n t e r m i s s e s [ l e v e l ] [ row ] = m i s s e s [ l e v e l ] [ row ] + 1 ; return 0;
22 23 24 25
}
26 27 28
29 30 31 32
/ / Function to search a tag i n t s e a r c h ( s e t ∗ cache , unsigned i n t tag , unsigned long miss es [ t o t a l l e v e l s ] [ 5 ] , int offset , int index ) { i n t temp ; assoc list temp list ; Tag list temp tag list ;
33 34 35
/ / S e a r c h t h e t a g i n t h e CLT map : : i t e r a t o r p o s = c e n t r a l t a g l i s t . f i n d ( tag ) ;
36 37 38 39 40
41 42 43
/ / #H i f ( p o s ! = c e n t r a l t a g l i s t . end ( ) ) { / / ∗∗∗∗∗∗∗∗∗∗∗∗∗∗ Cache H i t i n t h e CLT ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗// while ( o f f s e t