A template formula is essentially a generator of formulas. . . . . . . 115. 6.19. Classification example .... Newsletter, June 1997. âOptical Character .... as compiler designers, must design a modeling methodology, or strategy. A modeling strategy ...
UNIVERSITY of CALIFORNIA, SAN DIEGO
Guiding Program Transformations with Modal Performance Models
A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science
by
Nicholas Matthew Mitchell
Committee in charge: Professor Professor Professor Professor Professor
Larry Carter, Co-Chair Jeanne Ferrante, Co-Chair Bradley Calder Frederic T. Chong Sutanu Sarkar 2000
The dissertation of Nicholas Matthew Mitchell is approved, and it is acceptable in quality and form for publication on microfilm:
Co-Chair
Co-Chair
University of California, San Diego
2000
iii
Wendy, whence Wally?
iv
TABLE OF CONTENTS Signature Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
Table Of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
List Of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
List Of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Vita, Publications, and Fields of Study . . . . . . . . . . . . . . . . . . . . . . . . . xvi Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1. Compilers Need Models . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2. Modeling Methodologies Yield Models . . . . . . . . . . . . . . . . . 4 3. Modeling Methodology Design Difficulties . . . . . . . . . . . . . . . 6 4. The Domain of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . 7 1. Computational Corpus . . . . . . . . . . . . . . . . . . . . . . 7 2. Architectural Corpus . . . . . . . . . . . . . . . . . . . . . . . 8 5. Modeling Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1. Purely Static Approaches . . . . . . . . . . . . . . . . . . . . 9 2. Purely Experimental Approaches . . . . . . . . . . . . . . . . 11 3. Limited modeling, bounded experimentation . . . . . . . . . . 12 6. Structure and Contributions of this Thesis . . . . . . . . . . . . . . . 14 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 A Static-Counting Model for TLB and Cache 1. Terminology and Assumptions . . . . . . . . . 2. Tall, Thin Modules . . . . . . . . . . . . . . . 3. Intersection Lemma . . . . . . . . . . . . . . . 4. Applying the Lemma to Tile Size Selection . . 5. Consequences of the Lemma and Results . . . 1. TLB decisions first . . . . . . . . . . . 2. Cache first . . . . . . . . . . . . . . . . 3. In-concert . . . . . . . . . . . . . . . . 4. Which of the three approaches is best?
v
................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20 21 22 24 27 30 30 31 31 32
4 Bucket Tiling Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1. Motivation for Our Approach . . . . . . . . . . . . . . . . . . . . . . 2. The Reference Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . 1. Reference Pattern as Box . . . . . . . . . . . . . . . . . . . . . 2. The Use of Reference Patterns . . . . . . . . . . . . . . . . . . 3. The Three Tasks of Bucket Tiling . . . . . . . . . . . . . . . . . . . . 4. Some Observations on Bucket Tiling . . . . . . . . . . . . . . . . . . 1. Bucketizing reduces spatial footprint . . . . . . . . . . . . . . 2. Net effect of bucket tiling . . . . . . . . . . . . . . . . . . . . 5. Permutation Generation . . . . . . . . . . . . . . . . . . . . . . . . . 1. Spotting A . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. Choosing f . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. Generating π . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. Loop Regeneration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1. Three Loop Classes Describe General Structure . . . . . . . . 2. Two Useful Relations . . . . . . . . . . . . . . . . . . . . . . . 3. An Initial Implementation via Coalescing . . . . . . . . . . . . 4. Reducing Overhead via Hoisting . . . . . . . . . . . . . . . . . 7. Data Variants, Data Remapping . . . . . . . . . . . . . . . . . . . . . 8. Optimization Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 1. The Heart Kernel . . . . . . . . . . . . . . . . . . . . . . . . . 2. Integer Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. Sparse Matrix-Vector Product . . . . . . . . . . . . . . . . . . 9. Multi-level Bucket Tiling . . . . . . . . . . . . . . . . . . . . . . . . .
36 36 41 43 43 44 48 48 48 49 49 50 51 52 53 56 57 59 60 63 63 66 67 70
5 Modal Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1. Program Behavior is Modal . . . . . . . . . . . . . . . . . . . . . . . 1. Syntax-induced Modality . . . . . . . . . . . . . . . . . . . . . 2. Hierarchy-induced (intra-planar) Modality . . . . . . . . . . . 3. Interaction-induced (inter-planar) Modality . . . . . . . . . . 4. Epistasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. Unknown Determinants of Modal Behavior . . . . . . . . . . . . . . . 1. Analytic Unknowns . . . . . . . . . . . . . . . . . . . . . . . . 2. Modeling Unknowns . . . . . . . . . . . . . . . . . . . . . . .
74 75 75 75 78 80 80 83 85
6 Modal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1. Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. A Representation of Modal Behavior . . . . . . . . . . . . . . . . . . 1. Behavioral Modes and Mode Instances . . . . . . . . . . . . . 2. Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. Virtual Equality . . . . . . . . . . . . . . . . . . . . . . . . . . 4. Mode Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. A Process for Instantiating Modes . . . . . . . . . . . . . . . . . . . . 4. A Process for Instantiating Mode Trees from Complex Expressions . .
86 88 88 89 90 91 91 92 93
vi
5.
6.
7.
8.
The Relationship of Modes and Performance . . . . 1. Exploring Possibility Two . . . . . . . . . . 2. Codifying Context via Isolation Parameters 3. Extended Parameter Set . . . . . . . . . . . Mode Scoping . . . . . . . . . . . . . . . . . . . . . 1. Parameter Specifications, Parameter Space . 2. Parameter Sweeping Strategies . . . . . . . Template Formula Generation . . . . . . . . . . . . 1. Two Types of Formulas . . . . . . . . . . . 2. Formula Generation Strategies . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
96 97 100 101 102 103 104 115 115 116 121
7 A Modal Model for Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 1. Locality Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 1. Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 2. Composition Semantics . . . . . . . . . . . . . . . . . . . . . . 128 3. Mode Specifications . . . . . . . . . . . . . . . . . . . . . . . . 128 4. Modality Specification . . . . . . . . . . . . . . . . . . . . . . 131 2. Parameter Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 3. Simplification Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4. Template Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8 Optimization Guidance with a Modal Model . . . . . . . . . . . . . . . . . . . . 139 1. Transforming Between Trees . . . . . . . . . . . . . . . . . . . . . . . 139 2. Evaluating a Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 1. Tree Evaluation versus Classification . . . . . . . . . . . . . . 143 3. Generating a Cost Function . . . . . . . . . . . . . . . . . . . . . . . 146 4. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 1. Accommodating cache capacity and spatial reuse . . . . . . . 147 2. Parameterization and Preconditions . . . . . . . . . . . . . . . 148 5. Runtime Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 1. Reference Pattern Height . . . . . . . . . . . . . . . . . . . . . 153 9 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 1. TAPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 1. Some notes on SUIF 2.0 . . . . . . . . . . . . . . . . . . . . . 156 2. TAPS Capabilities . . . . . . . . . . . . . . . . . . . . . . . . 159 3. Bucket Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 2. Modal Model of Memory . . . . . . . . . . . . . . . . . . . . . . . . . 165 1. Some notes on Scheme . . . . . . . . . . . . . . . . . . . . . . 166 2. MMM Capabilities . . . . . . . . . . . . . . . . . . . . . . . . 169
vii
10 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 1. Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 2. Refinement of Modal Infrastructure . . . . . . . . . . . . . . . . . . . 184 3. Runtime Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 4. Further Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 5. Classification Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6. Model Discovery Techniques . . . . . . . . . . . . . . . . . . . . . . . 186 7. Scoping Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 8. Staging and Interactions . . . . . . . . . . . . . . . . . . . . . . . . . 186 9. Beyond Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 1. Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 188 2. Analogy, Symbolism, and Robustness . . . . . . . . . . . . . . . . . . 190 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
viii
LIST OF FIGURES 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
Mock-up model of execution time. . . . . . . . . . . . . . An example of optimization preconditions. . . . . . . . . . An example of optimization parameterizations. . . . . . . Visualization of modeling methodology. . . . . . . . . . . Peeking inside a modeling methodology. . . . . . . . . . . An example hierarchical system. . . . . . . . . . . . . . . Comparing a affine loop nest and a non-affine loop nest. . The implementation of Figure 1.4 using the modal models. Two example modal representations of loop nests. . . . . .
. . . . . . . . .
. 2 . 3 . 4 . 5 . 5 . 9 . 11 . 12 . 13
3.1 3.2 3.3
An architectural scenario with deleterious tiling interactions. . . . A tiled loop nest for matrix multiply. . . . . . . . . . . . . . . . . We assume the scheduled tiling for matrix multiply given in this figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulated data access times for the three tiling strategies. . . . .
. 22 . 24
3.4 4.1 4.2 4.3 4.4 4.5 4.6
4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. 25 . 35
The application of bucket tiling a computation involves three main processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Four possible reference patterns. . . . . . . . . . . . . . . . . . . . Example array references which lead to the four example reference patterns in Figure 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . Using reference patterns to visualize the effect of (a) inspectorexecutor applied to locality and (b) bucket tiling. . . . . . . . . . . Bucket tiling localizes reference patterns by permuting that computation’s iteration order. . . . . . . . . . . . . . . . . . . . . . . . A demonstration loop (a), the corresponding shackled version [66], and the version after shackling, and implementing the foreach loop with for loops, using a loop coalescing strategy. . . . . . . . . . . . Memory reference patterns of three variants of a loop containing D[i] = C[i] + A[B[i]]. . . . . . . . . . . . . . . . . . . . . . . . . . . A process for permutation generation. . . . . . . . . . . . . . . . . A template for loop regeneration. . . . . . . . . . . . . . . . . . . . Two possible implementations of the bucket loop. . . . . . . . . . . Loop coalescing is a well-known program transformation. . . . . . . An example showing what a coalescing-based implementation looks like. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The result of applying hoisting to the example in Figure 4.12(b). . The heart kernel, which simulates the beating of a heart. . . . . . . Heart kernel permutation generation pseudo-code. This example uses 1000 buckets. . . . . . . . . . . . . . . . . . . . . . . . . . . . The result of applying bucket tiling to the heart kernel. . . . . . . . An implementation of integer sort. . . . . . . . . . . . . . . . . . . ix
40 42 42 44 46
47 47 52 54 55 58 58 60 63 64 65 66
4.18 4.19 4.20 4.21 4.22
The result of applying bucket tiling to integer sort. Code for sparse matrix-vector product. . . . . . . . The result of bucket tiling conjugate gradient. . . . The benefit of bucket tiling applied to integer sort. An illustration of two-level bucket tiling. . . . . . .
5.1
An intuitive correspondence between two such classes of syntax and the respective parameterized reference pattern classes. . . . . . . . Visualizing hierarchical/intra-planar performance modality. . . . . Visualizing interaction-induced (inter-planar) performance modality. Each level curve varies one of two properties of the values in the B array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Visualize inter-planar performance modality. . . . . . . . . . . . . . Using a polyhedral algebra tool on an affine computation. . . . . . Using a polyhedral algebra tool on a non-affine computation. . . . .
5.2 5.3 5.4 5.5 5.6 5.7 6.1 6.2 6.3 6.4 6.5 6.6
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
68 68 69 70 72 76 77 79 81 82 83 84
6.17 6.18 6.19 6.20
A peek inside Figure 1.5. . . . . . . . . . . . . . . . . . . . . . . . . 87 A mode tree is a tree of mode instances. . . . . . . . . . . . . . . . 92 Mode tree instantiation maps an input program to a mode tree. . . 93 Examples of when our mode tree instantiation applies. . . . . . . . 94 An example instantiation of a mode tree from an expression, i∗X +k. 95 This figure details the experiments we run to explore one possibility for composing the performance of mode instances: treating them as independent of eachother. . . . . . . . . . . . . . . . . . . . . . . 98 Experiments with composition on the mode tree κκ, for three composition functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Multiplication-based composition. . . . . . . . . . . . . . . . . . . . 99 A figure similar to Figure 6.7, except for the mode tree σκ. . . . . 100 Mode specifications, isolation rules, extended parameters. . . . . . 102 Mode scoping discovers the relationship between mode parameters and performance for a particular system. . . . . . . . . . . . . . . . 103 Our mode scoping strategy first chooses a number of planar slices of the parameter space, and then sweeps through each slice. . . . . 107 An algorithm for choosing planes to sweep. . . . . . . . . . . . . . 109 A pruning algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Our parameter sweeping strategy runs a divide-and-conquer sweep on a number of planes. . . . . . . . . . . . . . . . . . . . . . . . . . 111 Whether we use a purely arithmetic division or a combination of the two determines how many experiments we run. . . . . . . . . . 113 A divide-and-conquer plane sweeping algorithm. . . . . . . . . . . . 114 A template formula is essentially a generator of formulas. . . . . . . 115 Classification example for two of the modes introduced in Chapter 7.120 An overview of the modal model infrastructure. . . . . . . . . . . . 121
7.1
Visualizing the locality modes. . . . . . . . . . . . . . . . . . . . . 125
6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16
x
7.2
7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13
Relating several example expressions with their mode instances under the locality modal model. . . . . . . . . . . . . . . . . . . . . Three example mode trees for locality. . . . . . . . . . . . . . . . Our locality modal model must define the semantics of mode tree composition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specification of the κ locality mode. . . . . . . . . . . . . . . . . Specification of the σ locality mode. . . . . . . . . . . . . . . . . Specification of the ρ locality mode. . . . . . . . . . . . . . . . . The locality modality specification. . . . . . . . . . . . . . . . . . The parameter specification of the κ mode. . . . . . . . . . . . . The parameter specification of the σ mode. . . . . . . . . . . . . The parameter specification of the ρ mode. . . . . . . . . . . . . The full detail of a simplification rule. . . . . . . . . . . . . . . . The full detail of another simplification rule. . . . . . . . . . . . .
. . . . . . . . . .
128 129 130 131 132 134 134 135 136 137
8.1 8.2 8.3 8.4 8.5 8.6 8.7
Visualizing modal guidance, the contribution of Chapter 8. . . . . . Comparing predicted and actual performance for κρ. . . . . . . . . Predicted tiling speedup for two problem sizes. . . . . . . . . . . . Similar to Figure 8.3, but for the A matrix. . . . . . . . . . . . . . Similar to Figure 8.3, but for the C matrix. . . . . . . . . . . . . . Comparing predicted and actual performance for tile size selection. Experiments with random range sampling for runtime resolution. .
140 145 149 150 151 152 154
9.1 9.2
How to convert between the various formats within SUIF. . . . . . TAPS provides an expression classifier which, given an expression, provides a vector which classifies that expression along seven axes. How our bucket tiling implementation uses the abstractions summarized in Table 9.5 . . . . . . . . . . . . . . . . . . . . . . . . . . How to convert between the various representations in MMM. . . .
156
7.3 7.4
9.3 9.4
xi
. 126 . 127
161 164 167
LIST OF TABLES 3.1 3.2 3.3
TLB and L1 data cache characteristics for various processors. . . . 23 Tile height and width picked by three strategies. . . . . . . . . . . 32 Simulated and predicted misses for the three tiling strategies. . . . 33
4.1 4.2 4.3
Quantifying the occurrences of non-affine references in three ways. Summary of recent locality-improving work. . . . . . . . . . . . . The two relations, belongsto and alignedto, characterize a loop regeneration strategy. . . . . . . . . . . . . . . . . . . . . . . . . This table compares two possible choices of bucket-indexing function, f , for the heart kernel. . . . . . . . . . . . . . . . . . . . .
4.4
. 38 . 39 . 57 . 64
5.1
Comparing cache capacities in a Compaq Alpha 21164a to the points in Figure 5.2 where the slope changes. . . . . . . . . . . . . 78
6.1 6.2
The (a) stages of our modal guidance system, and (b) key to the important sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 How various modeling strategies satisfy three criteria. . . . . . . . . 117
7.1 7.2
The mode tree simplification rules. . . . . . . . . . . . . . . . . . . 136 Some example template formulas generated by our system. . . . . . 138
8.1
Transformation rules for some locality optimizations. . . . . . . . . 144
9.1
A summary of the technologies used in the implementation of this thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Which software the various parts of this thesis rely on. . . . . . . Summary of SUIF front-end usage. . . . . . . . . . . . . . . . . . The interface file for the features of TAPS. . . . . . . . . . . . . . The abstractions of our bucket tiling implementation. . . . . . . . The per-file breakdown of our modal model implementation. . . . The commands for constructing abstract syntax trees (ASTs) in MMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Some commonly used functions provided by our modal model implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 9.3 9.4 9.5 9.6 9.7 9.8
xii
. . . . . .
155 156 158 160 163 170
. 171 . 180
ACKNOWLEDGEMENTS I have found nothing quite so beautiful as a cat delicately, quietly, ground-huggingly, inexorably approaches the window, behind which perches a fluttering bird. As inexorable, I suppose, as my clambering, noisy, “whatcha looking at, Simon?”, clodhopping entrance. I try, Simon! And I apologize for not learning patience more quickly. Nor have I found anything quite so useful as working with, being advised by imaginative folks. But I imagine you all know who you are.
xiii
VITA July 15, 1972
Born, San Francisco, California
1994
B.A. University of California, Berkeley
1994–1995
Powell Fellowship
Winter 1996
Teaching Assistant, Department of Computer Science University of California, San Diego
1996
M.S. University of California, San Diego
1996–1999
Research Assistant, University of California, San Diego
Summer 1997
Intern, Digital Equipment Corp., Western Research Lab
Summer 1999
Intern, IBM T.J. Watson Research Center
1999–2000
Intel Fellowship
2000
Doctor of Philosophy University of California, San Diego
PUBLICATIONS “ILP versus TLP on SMT”, Nick Mitchell, Larry Carter, Jeanne Ferrante, and Dean Tullsen, Supercomputing, November 1999. “Localizing Non-affine Array References”, Nick Mitchell, Larry Carter, and Jeanne Ferrante, Parallel Architecture and Compilation Techniques (PACT), October 1999. “Explorations in Symbiosis on Two Multithreaded Architectures”, Allan Snavely, Nick Mitchell, Larry Carter, Jeanne Ferrante, and Dean Tullsen, Workshop on Multithreaded Execution And Compilation (MTEAC), January 1999. “Multi-processor Performance on the Tera MTA”, Allan Snavely, Larry Carter, Jay Boisseau Kang Su Gatlin, Amit Majumdar, Nick Mitchell, John Feo, and Brian Koblenz, Supercomputing, November 1998. “Quantifying The Multi-Level Nature of Tiling Interactions”, Nicholas Mitchell, Karin H¨ogstedt, Larry Carter, and Jeanne Ferrante, International Journal of Parallel Programming, Volume 26, Number 6, pp. 641–670, June, 1998. “Quantifying The Multi-Level Nature of Tiling Interactions”, Nicholas Mitchell, Larry Carter, Jeanne Ferrante, and Karin H¨ogstedt, Languages and Compilers for Parallel Computing (LCPC), August 1997. xiv
“A Compiler Perspective on Architectural Evolutions”, Nicholas Mitchell, Larry Carter, Jeanne Ferrante IEEE Technical Committee on Computer Architecture (TCCA) Newsletter, June 1997. “Optical Character Recognition and Parsing of Typeset Mathematics”, Richard Fateman, Taku Tokuyasu, Benjamin Berman and Nicholas Mitchell, Journal of Visual Communications and Image Representations, Volume 7, Number 1, 1996.
FIELDS OF STUDY Major Field: Computer Science Studies in Compilers Professors Larry Carter and Jeanne Ferrante, University of California, San Diego Major Field: Computer Science Studies in Symbolic Computation Professor Richard J. Fateman, University of California, Berkeley
xv
ABSTRACT OF THE DISSERTATION
Guiding Program Transformations with Modal Performance Models by Nicholas Matthew Mitchell Doctor of Philosophy in Computer Science University of California, San Diego, 2000 Professor Larry Carter and Professor Jeanne Ferrante, Co-Chairs Successful program optimization requires analysis of profitability. From this analysis, a compiler or runtime system can decide where and how to apply an assortment of program transformations. This two-faced problem is called transformation guidance [110, 100, 11, 63, 108]. We consider the desired goal of robust guidance of performance optimizations for hierarchical systems. A guidance system is robust if it unifies disparate sources of knowledge, and makes reasonable decisions hold up, despite a lack of definitive information. In particular, we seek to address concerns presented by aspects of syntax, architecture, and data set. Syntax may not be statically analyzable; for example, the data dependences due to A(B(i)) (an indirect memory reference) cannot be determined until runtime. Architecture poses a problem in the complexity of the relationship between its properties and performance. Data set shares both problems: on the one hand, we cannot analyze properties of unavailable data; and yet, once available, we cannot easily predict how its properties, combined with architectural properties, influence execution time. This thesis solves aspects of this robust guidance problem. First, we present bucket tiling, a program transformation for locality which handles non-affine array references (such as the indirect which reference mentioned above). Bucket tiling improves the performance of codes such as conjugate gradient and integer sort by 1.5 to 2.8 times. We have developed a tool which automatically applies bucket tiling to C or Fortran codes. xvi
To guide locality optimizations such as bucket tiling in a robust manner requires a new modeling strategy. We present the abstraction of modal models. A modal model recognizes, and leverages off the following observation: many aspects of a program’s behavior can be assigned to a small, finite number of distinguishable categories. We develop a modal model for guiding locality transformations which uses three parameterized modes to represent three different access patterns. We show how to experimentally determine parameterized formulas for execution time of these modes on any given target platform. Further, we use these modes as the basis for a calculus of performance modeling for our guidance system. Given any program, represented as a tree of modes, we show how to determine an execution time formula for the program. For bucket tiling, we determine execution time formulas for the original and transformed programs, and use these to guide the decision on performing the transformation. We also contrast a modal modeling approach to a static-combinatoric approach. Such an approach models by counting some observable property of behavior, such as cache misses. This contrast highlights the principle advantage of modal modeling: robustness to syntax, architecture, and data set properties.
xvii
Chapter 1 Introduction A compiler transforms an input program to one of a collection of possible output versions. How does it choose between these possibilities? To distinguish between versions requires some ranking of desirability. The essential problem of this thesis is to reduce the execution time of programs by performing and guiding program transformations in a robust manner.
1.1
Compilers Need Models
Imagine that a program which iterates once finishes in one unit of time, and which iterates twice finishes in two units, and so on. To visualize the performance of this program, we could use a simple chart, such as the one shown in Figure 1.1; this chart is a model of the program’s execution time. Observe that, analogous to the model as chart, is a model as formula: E(n) = n. From such a model, a compiler can derive execution time estimates for a particular situation by, in this case, reading the chart appropriately. A compiler, of course, has no such a priori knowledge from which to draw when generating its own models.1 And, unfortunately, life is a little more complicated than this chart makes it out to be. Despite these two stumbling blocks, a compiler needs models to guide program transformation. To select transformations, a compiler must establish the precondi1
Nor is it likely to have visual processing capabilities for reading charts.
1
2 5
execution time
4
3
2
1
0
0
1
2
3
4
5
number of iterations
Figure 1.1: This chart is a model of the execution time of a mock-up code whose execution time is linearly proportional to its number of iterations. tions of its known transformations. To apply one, it must determine the parameterizations of that transformation.2 Consider, for example, the application the loop tiling transformation [117, 83]:
do j = 1, L do i = 1, M do k = 1, N C(i, j) += A(i, k) ∗ B(k, j) end end end
do kk = 1, N, Th do ii = 1, M, Tw do j = 1, L do i = ii, min(ii + Tw − 1, M ) do k = kk, min(kk + Th − 1, N ) C(i, j) += A(i, k) ∗ B(k, j) end end end end end
The left-hand loop nest forms the input to the compiler. The right-hand nest is a possible output of the compiler, after strip-mining the i and k loops, and then interchanging the strip-loops (kk and ii) to push them towards the top of the loop nest. The above code identifies the two choices an optimizer must make. First, it must decide the preconditions for this mapping: whether or not this mapping, from left-hand to right-hand nest, is legal and more desirable than the many possible alternatives (e.g., doing nothing, or perhaps applying tiling to some other combination of loops). Second, if the compiler decides to apply the mapping, it then must parameterize the mapping: choose values for tile height, Th , and tile width, Tw . 2
There are certainly other aspects of optimization, such as placement of overhead code. Chapter 8 addresses these issues. We concentrate on the two mentioned tasks in this thesis.
3 2
tiling helps speedup of tiling
1.5
1
0.5
tiling doesn’t help
0
1
10
tiling slowdown of 31%
problem size
100
1000
Figure 1.2: Preconditions: To guide the application of loop tiling to matrix multiply, we must choose whether to apply it. The benefit of tiling depends on the problem size. We ran these experiments on a 500MHz Compaq Alpha 21164a. To make this example more concrete, we can explore these two choices experimentally. Figure 1.2 shows that the question of whether or not we should apply loop tiling depends on problem size. Below a certain threshold, around 250 elements, tiling hurts performance, typically by around 30%. However, above this threshold, tiling is a win. Figure 1.3 shows that a large range of square cache tile sizes, from 20 to 70 (measured as the length of a tile’s side), yield performance within 9% of each other. If the compiler had a model, such as the chart in Figure 1.1, for the original implementation, and a similar chart for the transformed implementation, it could read the execution times off the two charts, compare them, and make a decision: should I apply this transformation, and if so, how? What makes for a good model? We say a model is good if it permits a compiler to make reasonable precondition and parameterization decisions. But where does a compiler get good models, without a priori knowledge about the domain, and which handles situations more realistic than that in Figure 1.1?
4
Cycles per element
22
17
bad
12
good
(9%) 7
1
10
100
1000
Cache Tile Size
Figure 1.3: Parameterizations: To guide the application of loop tiling on matrix multiply, we must choose how to apply it. A range of cache tile sizes yields nearly equal performance; outside this range, performance quickly degrades. We ran these experiments on a 300MHz Intel Pentium II.
1.2
Modeling Methodologies Yield Models
To automate this process of generating good models for realistic scenarios, we, as compiler designers, must design a modeling methodology, or strategy. A modeling strategy operates as shown in Figure 1.4. As we just learned, a model derives execution time. Similarly, a modeling strategy generates models. In it are the “smarts” about how factors, such as the syntax of a particular implementation and the properties of the architecture and data set, all combine, as visualized in Figure 1.5, to yield a chart/model such as Figure 1.1. These smarts encode what types of information must be gathered, and how to synthesize this information.3
Thesis Statement: A modeling methodology which exploits the modal behavior of performance can guide optimizations in a manner which is robust to syntax, architecture, and data set. 3
Henceforward, we use the terms modeling methodology, modeling strategy, and guidance system as synonyms. Thus, when we say a “methodology guides optimizations”, we mean that the compiler in which the methodology is embedded does the guiding.
5 syntax
architecture
data set
modeling methodology to peek inside, see Figure 1.5
model to peek inside, see Figure 1.1
execution time
ind
e
ire
siz
cti
on
Figure 1.4: A modeling methodology takes properties of a program and its environment to produce a model of the performance of that program under those conditions. The model may itself directly represent execution time, or may be a more abstract model which in turn generates execution time.
ste syn p tax
ta
set sity
den
re dis use tan ce
ge
ran
da
performance
architecture data cache
pipelines
store buffer
ISA
instruction cache
Figure 1.5: Think of a modeling strategy as this triangle. It consists of: a decision as to what environmental properties are important (syntax, architecture, and data set), an abstract representation of performance (what lies within the triangle), and a mapping from the selected environmental properties to the performance abstraction.
6 What makes a good modeling strategy? Our thesis statement equates goodness with robustness. We say that a strategy is robust if the universe of situations that it guides well (in the sense of generating good models) is large. But, whether a strategy is good or bad depends partly on the universe of situations we care about. We discuss our universe presently.
1.3
Modeling Methodology Design Difficulties
A modeling strategy specifies what information is necessary to make decisions, and how the information combines to form a decision. Two features of knowledge acquisition and synthesis complicate the decision-making: interactions and contingency. interaction Information might come from many sources, as shown in Figure 1.5. We concentrate on three important determinants of performance. In part, your computation’s performance depends on certain properties of the way it is written, its syntax. It will run faster or slower depending on how many times it reuses each data item, and whether it access memory directly or indirectly. In part, your computation will run faster or slower depending on properties of the architecture it runs on. As the size and latency of caches vary, and the number execution units and processor clock speed vary, so might the computation’s performance. And finally, your computation’s performance depends on properties of the data set it receives as input. If I run it on a small problem size, versus a large problem size, and if the data is dense versus sparse, your computation will run faster or slower. The optimization process becomes more difficult when, due to disparity of information, the multiple sources of information combine in some non-obvious, or essentially irreproducible, ways. For example, in Chapter 3, we provide a modeling strategy for managing interactions between cache and TLB. contingency Each of these disparate concerns — syntax, architecture, and data set
7 — poses its own contingency. Syntax poses a problem of non-analyzability. For example, the data dependences due to A[B[i]] (an indirect memory reference) cannot be determined by static analysis alone. Architecture poses a problem of a seemingly arbitrary relation between its properties and performance. Data set shares both problems: on the one hand, we cannot analyze properties of data which does not yet exist; and yet, once it does exist, we cannot easily predict how its properties, combined with architectural properties, yield an execution time.
1.4
The Domain of this Thesis
The usefulness of one modeling strategy versus another depends primarily on what the generated models will be used for: which computations they are intended to optimize, and on which architectures those computations will run.
1.4.1
Computational Corpus
This thesis is primarily targeted at loop-bound computations on which conventional modeling strategies stumble due to interactions or contingencies.4 As a specific example, we were motivated by computations involving non-affine array references. Consider, for example, in the following simple loop nest: do i = 1, N A(B(i))++ end The array reference in this loop nest, A(B(i)), has an index expression which is not an affine function of the nest’s induction variable. Other codes which fall into this class are sparse matrix computations, including conjugate gradient, and more generally any computation which involves indirect memory references, such as an integer sort. 4
Stumble due to “messiness”, if you will.
8 Non-affine index expressions pose a problem for existing transformation and modeling strategies, due to syntax contingencies: transformations such as loop tiling and modeling strategies such as the one provided in Chapter 3 simply do not apply in this case. In Chapter 4, we provide the mechanisms for a new transformation which localizes these non-affine array references. In Chapter 6, we provide a modeling strategy which applies to non-affine array references. As we will discuss in Chapter 10, syntactic “messiness” is not a requirement for the techniques described in this thesis to prove useful. Because our strategy handles architectural and data set contingencies, it can also potentially benefit non-messy syntactic situations.
1.4.2
Architectural Corpus
We study hierarchical systems. A hierarchical system recursively synthesizes components into tree-like structures. For example, a contemporary processor consists of a cache, layered on top of store buffers, which are again layered on top of registers and execution unit. A node typically composes one or more such processors sharing a larger cache. A parallel machine composes many such nodes, possibly with caches intervening at each level of composition. Figure 1.6 shows one possible hierarchical organization. Due to constraints of economics, technology, and politics, system designers often design hierarchical systems. For, in the absence of any such constraints, we would design flat systems: those potentially with branching, but without nesting. It is precisely when introducing a constraint that hierarchical systems become advantageous. Each level of nesting in Figure 1.6 matches up with some constraint. Processor architects incorporate on-chip, level 1 (L1), caches because of limited die space, and because register file latency and expense grow with increased register file size. They add level two caches due to similar constraints on the first-level cache. Sometimes a program runs on multiple nodes. The need for parallelization may result from time constraints (wanting an answer fast), or space constraints (requiring
9 network
nod
main memory
e
main memory
nesting L2
L2
L1 E1
L1 E2
E1
r
o ess c o r
L1 E2
E1
p
L1 E2
E1
E2
branching
Figure 1.6: A hierarchical system consisting of three levels of parallelism (branching) and five levels of locality (nesting). The example system shown in this figure has two nodes, each node with two processors, each processor with two execution units (E1 and E2) and a first-level cache (L1). Two processors share a second-level cache (L2) and main memory. The nodes are connected via some interconnection network. too much storage to fit into a single node’s main memory). In either case, a parallel program must fetch data from other nodes, across a network.
1.5
Modeling Strategies
Consider three modeling strategies: purely static, purely experimental, and one which combines aspects of both.
1.5.1
Purely Static Approaches
First, the compiler writer could identify the important metrics of performance, and derive a purely static model to determine performance based on the values of these metrics. For example, consider the following simple computation, which loads the fifth entry of an array named B, and stores the result in a scalar named x: x = B[5]
10 Say our metric is exactly how many machine cycles we must wait until this computation completes. Further, assume we know that the programming language stores arrays in linear order, and that the computation will run on a single-threaded processor, which has a unified instruction-data cache, that the particular address at which B[5] is stored does indeed reside in cache, and that this cache can service an element in 5 cycles, and that the processor’s instruction set architecture can implement this assignment statement in a single instruction, and that the compiler chooses to store x in a register, and that said instruction itself resides in cache, but has been prefetched into the processor many cycles ago, and that the arithmetic logic unit is pipelined with a front-to-end latency of 3 cycles, and finally that pipeline output can be latched to registers in the following cycle. Assuming we have provided a complete catalog of environmental information, we could express the performance of this computation with the number 9. This example suggests one possible model of performance: perform a similar analysis to count the number of cycles at each step during the execution of the computation. Counting cycles expresses the performance of each candidate environment in the language of natural numbers. However, it is difficult to envision this process of cycle-by-cycle analysis generalizing to even slightly more complicated examples.
Observation 1 A low-level modeling strategy may fail due to infeasibility of analysis.
And yet, the language of natural numbers seems reasonable enough. Its syntax and semantics are straightforward: express performance as natural numbers, and use the less-than relation to pick the best. Observation 1 posits infeasibility of analysis in this case, not the language of expression. Rather than counting cycles, we can count cache misses [39, 103, 98, 81, 41, 48] or predict execution time [55]. Purely static counting models typically limit the guidance system in the syntax it can handle (e.g., only affine index expressions, only when array dataflow analysis is sufficiently precise). For example, they can
11
do j = 1, M do k = 1, N Y (j) += A(j, k) ∗ X(k) end end
do j = 1, M do k = R(j), R(j + 1) Y (j) += A(k) ∗ X(C(k)) end end
(a) affine only
(b) non-affine
Figure 1.7: The loop nest in (a) contains only affine functions of induction variables, whereas the nest in (b) has non-affine loop bounds and array index expressions. Model methodologies which rely on counting the number of cache misses typically only apply to the nest in (a). The strategy developed in Chapter 5 an onwards applies to both. handle the kernel in Figure 1.7(a) (which contains only affine array references, and whose loop bounds are affine functions of the enclosing loops’ induction variables), but not the one in Figure 1.7(b) (which has non-affine index expressions and loop bounds highlighted).
Observation 2 A high-level performance model may fail due to lack of syntax robustness.
1.5.2
Purely Experimental Approaches
As a second way to synthesize information, the guidance system itself could profile the candidate implementations. This second solution has many desirable properties over the first, particularly in that it is the guidance system itself doing the work. However, while a profile-based approach works well when the codes and the data sets are known [15, 13, 109], it is not a reasonable approach for a robust, general-purpose guidance system. First, as with any profiling system, it specializes the generated code to the profiled data set. Second, this approach requires executing the entire program, which is not a viable strategy when considering a large number of implementations.
12 source program human
scheme AST
modeling methodology
mode instance
target system
mode tree template formula
matching transformations formula
model execution time cost function
Figure 1.8: Here we show our implementation of Figure 1.4 using the modal model infrastructure designed in Chapter 6. See Figure 9.4 in Chapter 9 for a more detailed version of this figure.
1.5.3
Limited modeling, bounded experimentation
A third possibility, the strategy we use, combines limited static modeling with bounded profiling [43, 47]. Figure 1.8 visualizes the relationship between sources of knowledge and our guidance system. To guide optimizations with only limited static knowledge, and a bounded set of experiments, we introduce the concept of modal models in Chapter 6. Such a model specifies the important parameterized modes of memory reference patterns, and a modal representation of a code is a tree of instances of these modes. Figure 1.9 shows two examples, for a model with three modes, κ, σ, and ρ. κ denotes a repetition of the same addressing pattern, σ denotes shifting by a fixed stride, and ρ is used for non-affine or “random” behavior. In Chapter 7, we define
13
do i = 1, L do j = 1, M do k = 1, N A(k ∗ T + j) end end end
κL σ1,M σT,N
(a)
do i = 1, L do j = 1, M do k = 1, N A(B(i) + j) end end end
ρ?,?,L σ1,M κN
(b)
Figure 1.9: Two example modal representations of loop nests. A modal representation is a tree of parameterized modes. In this case, the three modes in the model are κ, σ, and ρ. Each instance of a mode assigns values to the modal parameters; a question mark denotes a statically unknown value. the locality modes in greater detail. In Chapter 8, we show that this modal model of memory locality allows a guidance system to reason about performance, despite (currently) imperfect information about the indicators of performance. Reasoning without perfect information is a step towards the robustness we seek to achieve. Limiting Analysis: In that chapter we also learn that a modal modeling strategy bounds syntactic analysis only when we can identify a reasonably sized set of distinguishing parameters, each of which fairly directly mirrors underlying syntax. In other words, on the one hand, the parameters cannot require complicated analysis to determine from a particular piece of code. But on the other, the parameters must meaningfully model the families of behavior we care about. Chapter 7 describes, for locality, the modes and each mode’s parameters Bounding Experiments: We will learn, in Chapter 8, that a modal modeling strategy bounds experiments only when mode trees have a compositional property, which states: the performance of a tree in its entirety can be determined by composing the performance of individual elements.
14
1.6
Structure and Contributions of this Thesis
We have organized the body of this thesis into seven chapters. The seven chapters contribute to our argument as follows. Model Cache-TLB Interactions In Chapter 3, we describe a static, counting model for managing interactions due to cache modules which don’t satisfy the nesting property, such as a translation lookaside buffer (TLB). A Locality Optimization for Non-Affine Array References Next, Chapter 4 presents the design of a new locality optimization, called bucket tiling. Bucket tiling targets non-affine array references. Most existing static-counting models, such as the one presented in Chapter 3, do not account for this scenario. Therefore, we consider bucket tiling as a principle example of the need for modal models. Observe Modal Phenomenon Next, in Chapter 5, we introduce modal aspects of program behavior. Introduce Modal Models To account for modal behavior, Chapter 6 introduces the modal model abstraction. Apply Modal Model Abstraction to Locality In Chapter 7, we develop a modal model for memory reference locality. Design Modal Guidance Our guidance system separates the concerns of mode specification from guidance driven by that specification. From these mode specifications, our system first transforms a program into a mode tree representation. Next, it applies program transformations, such as tiling (which are specified as a mapping from one mode tree to another). To determine the interaction between implementation choices and the target platform, the system runs, once for each platform, a small set of runtime experiments for each mode. It then uses the output of these experiments to evaluate the performance of a
15 mode tree. Finally, it generates a cost function based on the relative goodness of mode trees. Implement Bucket Tiling, Modal Models, and Modal Guidance Finally, in Chapter 9, we describe the implementations underlying this thesis. Following the main body of our argument, we discuss future directions for this work, in Chapter 10. Finally, Chapter 11 presents our conclusions.
Chapter 2 Related Work Throughout this thesis, we will discuss related work in the appropriate section. In this chapter, we additionally present a more structured view of related work in the general area of performance modeling, and handling interactions and contingencies. We summarize these works into a number of categories below, such as: analyzing static cost, combining static and dynamic information, adapting optimizations to dynamic information, generating new implementations based on dynamic information, experimentally discovering system characteristics, aspects of automation, related work using modal properties, and finally somewhat more tangential work. Static cost: Many prior models for predicting memory hierarchy performance are static and quantitative. For example, Ghosh [48] presents a cost model to quantify the number of cache misses in loop nests. On kernels such as matrix multiply, their algorithm predicts nearly exactly how many cache misses the kernel endures. Others also provide quantitative models for predicting the number of cache misses [39, 26, 98, 81, 22, 41], or execution time [104, 103, 55, 17]. Ladner et al. present a probabilistic quantitative model [67]. Combined static-dynamic approaches: Based on user-specified performance templates, Brewer derives cost models based on profile feedback [15]. He uses these platform-specific cost models to guide program variant choice. The FFTW (Fastest Fourier Transform in the West) project optimizes FFTs with a combination of static
16
17 modeling (via dynamic programming) and experimentation to choose the FFT algorithm best suited for an architecture [44]. Gatlin and Carter introduces architecture cognizance, a technique which accounts for hard-to-model aspects of the architecture [47]. Lubeck et al. develop a hierarchical model which predict the contribution of each level of the memory hierarchy to performance. Their technique rely on experiments to determine architectural features such as ability to overlap cache misses with useful work [74]. Adaptive optimizations: Saavedra and Park et al. [94] and Diniz and Rinard [34] study adapting the behavior of programs to knowledge discovered while the program is running. Program specialization: Another way to adapt program behavior to dynamic information is to generate new implementations on the fly. Program specialization, from its most abstract beginnings by Ershov in 1977 [37], to a version of C, called ’C, developed recently by Engler et al. [36], or annotations to C by Consel et al. [28], Leone et al. [71], and Grant et al. [50]. Voss and Eigenmann describe ADAPT[107], a system which provides an interface and supporting mechanisms for dynamic adaptation. Based on an optimization specification, their system finalizes the optimization at runtime; it alternatively compiles new variants or dispatches among existing variants. ADAPT chooses the fastest variant experimentally, trying each in successive executions of the associated code. The interface consists of five criteria associated with each optimization. These criteria direct the adaptation of the runtime system. This system finalizes optimization, alternatively compiling variants or dispatching among existing variants. ADAPT chooses the fastest variant experimentally, trying each in successive executions of the associated code. ADAPT provides only a very coarse-grained guidance: either do or do not apply an optimization, and do or do not attempt to adapt. This implementation decision forces ADAPT to run the entirety of every variant. Furthermore, ADAPT does not currently adapt to data-set properties other than size. System scoping: McCalpin introduces the STREAM benchmark, which discovers the machine balance of an architecture via experimentation [77]. In addition
18 to bandwidth, McVoy’s and Staelin’s lmbench determines a set of system characteristics, such as process creation costs, and context switching overhead [78]. Saavedra and Smith use microbenchmarks to experimentally determine aspects of the system [95]. Automation: Collberg develops a strategy for automatically generating a compiler back-end [27]. His system discovers many aspects of the underlying system, such as instruction set syntax and semantics, and instruction timings, via experimentation. With this knowledge, his system generates a code generator specialized to that system. Hoover and Zadeck design a system which, based on specifications on an architecture, automatically generates a compiler backend tuned for that architecture [57]. Tjiang’s and Hennessy’s tool, called Sharlit, automatically generates dataflow optimizers based on specifications [105]. Modal Work: Steffen uses modal specifications to drive dataflow analysis [102]. Davis, Pfenning, Wickline, and Lee develop an extension to ML, called ModalML [31, 111]. Modal-ML permits the compiler to reason about staged computations. Specifications: Several researchers have studied the use of specifications to guide program optimization. In his thesis, Vandevoorde studies using specifications to guide, in part, program specialization [106]. His system uses an automatic theorem prover which, given the specifications, can reason about what specializations are legal. He concentrates on such precondition issues. Guyer and Lin use specifications of the semantics of separately compiled library routines to enhance program optimization across library boundaries [52]; they allow for annotations ranging from dataflow analysis (e.g. which global variables each routine uses) to performance annotations (e.g. how a routine traverses its arrays). Interactions and Staging: We discuss implications of our work on optimization staging in future work, Chapter 10. Whitfield and Soffa [110], Pollock and Soffa [86], Berson, Gupta and Soffa [11, 12], and Cho et al. [23] have also studied this problem: how do the stages of a compiler (such as high-level optimizations versus low-level optimizations, or register allocation versus instruction scheduling) interact, and how can a compiler manage those interactions to derive better global
19 decisions? Coordination models: Arbab and Papadopoulos nicely summarize work on coordination models in [85, 8]. A coordination model provides a language, such as Arbab’s MANIFOLD, for expressing how a diverse set of system components interact to form the whole.
Chapter 3 A Static-Counting Model for TLB and Cache If computers had only a single level of memory or parallelism, relatively simple cost functions could successfully guide optimization decisions. Such one-level cost functions commonly increase locality [112, 18, 20, 19] and exploit parallelism [5, 113, 38, 63, 70]. For instance, a one-level cost function for a tiling might reflect only whether the tile fits in cache (perhaps by considering cache size, line size, and cache associativity [26]), but not the effect of the tiling on instruction level parallelism. However, recent trends towards greater architectural complexity have increased the amount of information available to an optimizing compiler. Many machines now have multiple levels of memory and of parallelism arranged hierarchically. Memory appears as registers, several levels of cache and a translation look-aside buffer. Parallelism may occur as multiple functional units, multiple processors in a node sharing memory, multiple nodes sharing distributed memory, and so forth. The multi-level aspect of memory and parallelism complicates optimizations. While a one-level cost function suffices to yield good performance at a single level of the memory hierarchy, it may not be globally optimal. In this chapter, we strive to show that, as the amount of available information increases, the cost functions which guide tiling optimization choices must be similarly expanded.
20
21 Researchers have developed a number of solutions to this information expansion problem. Many have simply ignored the multi-level information, instead relying on one-level cost functions [112, 26, 99]. Others rephrase program optimization as a search problem and invent heuristics to prune the search [46, 114]. We explore a different solution which first formulates the system to be optimized by quantifying both the effects of tiling choices and the interactions between such choices in a single formula, and then proceeds to minimize this formula. If the minimization is closed-form, this technique utilizes all the information necessary to achieve good performance without the non-optimality of one-level cost functions or the expense of searches. Whether or not the minimization is closed-form, when tiling for a hierarchy of memory and parallelism this technique derives a multi-level cost function.
3.1
Terminology and Assumptions
Tiling [45, 58, 115, 116, 9, 69, 90, 112, 113, 64] can improve both data locality and parallel execution time. A tiling redefines the order of execution of the points within an ISG by specifying four pieces of information: the atomic units of execution, how to span each unit (in other words, a schedule for the points that comprise the unit), how these units are scheduled to span the iteration space, and the mapping of units to processing elements. Each unit, or tile, is a subset of the iteration space, typically of one size and shape (except tiles that intersect ISG boundaries). We refer to the method of spanning a tile as the internal schedule and the method of spanning the iteration space as the external schedule. Increasing the depth of a program’s loop nest, with proper indices and bounds, implements a tiling. In this chapter, we will only consider internal and external schedules that are given by a total order of the iteration space axes. We represent each schedule by an ordered list of the index variables indicating loop order, from outermost to innermost. For example, Figure 3.3 shows a scheduled tiling for matrix multiplication with (i, k) internal schedule and (k, i, j) external schedule.
22
T L B
TLB lines
cache lines
cache
Figure 3.1: An architectural scenario with deleterious tiling interactions: one memory module doesn’t “fit in” the other. A cache has short, but numerous, lines, whereas a TLB has long, but scarce, lines. A module is an architectural component, such as cache, registers, disk or network interconnections. The memory in a module is organized in blocks, the unit of transfer between a module and the next larger module. We denote modules by their first letter (r for register, c for cache, t for TLB, m for main memory, etc.). Sk denotes the block size of module k. For example, the block size Sc is the size of a cache line. We express block size in units of problem elements (for example, elements of the matrices in matrix multiply), rather than bytes. The block count, Ck , is the number of blocks contained in module k. We assume that TLB and cache use least-recently used (LRU) replacement policies.
3.2
Tall, Thin Modules
Consider the architectural scenario which consists of two modules, a and b, where a has fewer blocks, Ca < Cb , but greater capacity, Ca Sa > Cb Sb . We are particularly interested in the case that a is a tall, thin module1 ; i.e., when Ca Sa , as in 1
We visualize modules as rectangles with height corresponding to the block size and width corresponding to the block count. The scenario of this chapter is visualized as a large module that
23 TLB processor
L1 Cache
entries
kB per line
entries
bytes per line
Power2
512
4
1024
256
PA-8000
96
4+
16384+
16+
R10000
64
4+
1024
32
Sun UltraSPARC
64
8+
512
32
64
4
512
32
Compaq 21064a
64
8
512
32
Compaq 21164
64
8
256
32
64
4
256
32
IBM Hewlett-Packard MIPS
Intel Pentium II
Intel Pentium Pro
Table 3.1: TLB and L1 data cache characteristics for various workstation processors. Some processors can, at system boot-up time, choose to use one of a number of configurations. For example, the MIPS R10000 processor can use any power-of-two page size from 4 kilobytes up to 64 megabytes. Figure 3.1. Tall, thin modules occur in virtually every contemporary workstation in the form of a translation look-aside buffer, as shown in Table 3.1. For the remainder of this chapter, we will use the example of TLB2 and cache. In this chapter, we demonstrate the improvement seen by optimizing for TLB and cache simultaneously. We first quantitatively derive the optimal tiling for TLB, using a TLB-specific cost function and ignoring the cache completely. This tiling will lead to cache thrashing, so we then perform an additional level of tiling for cache. Next we pursue the reverse strategy: first tile for cache, show there is TLB thrashing, and then tile the cache-tiled code for TLB. Finally, we show that the best strategy balances TLB miss rate and miss cost with those of cache. We explore the benefit of multi-level optimization using matrix multiply. We chose to look at matrix multiply for two reasons. First, much work has been put into it’s optimization [13], but none has explored optimizing both cache and TLB siis too narrow to contain a smaller module. 2 We consider a data item to be “in the TLB” if the virtual to physical address translation for the item’s virtual page is cached in the TLB.
24
for kk=1 to min(M,N) by W for ii=1 to N by H for j=1 to M for i=ii to min(ii+H−1,N) for k=kk to min(kk+W−1,M,N) C(i,j) += A(i,k)*B(k,j) tile stick slab Figure 3.2: A tiled loop nest for matrix multiply. We have highlighted and named certain portions of the loop computation. For example, the most darkly shaded region, the inner-most two loops, we name a tile. A collection of tiles is a stick, and a collection of sticks is a slab. Figure 3.3 visualizes the behavior of this code, and substantiates this choice of names. multaneously. Second, the relatively simple data access patterns of matrix multiply allows straightforward locality analysis. However, this latter aspect of matrix multiply implies that matrix multiply is a compute-bound application. Being computebound, all tiling strategies which remove TLB and cache thrashing will have similar execution times. However, our goal is to provide an analytical framework for simultaneous optimization, rather than optimize any one application. Therefore, we nonetheless continue to analyze matrix multiply (in this chapter) due to its simplicity. To focus on the interaction of TLB and cache tiling choices (as opposed to all the other choices affecting performance), optimized versions use the external schedule (k,i,j) and the internal schedule (i,k). Figure 3.2 gives the code and defines our naming convention: tiles stacked along the j dimension are sticks and sticks stacked along the i dimension are slabs; Figure 3.3 illustrates the tiling. The only difference between optimized versions will be the choice of the tile size, given by H and W .
3.3
Intersection Lemma
To determine the optimal tile height and width, we use a somewhat simplistic execution time cost function of miss cost times miss count. Let ik be the idle time
25
i N
W
H
N
N
k
j
Figure 3.3: We assume the scheduled tiling for matrix multiply given in this figure. This tiling uses the (i, k) internal schedule and (k, i, j) external schedule. The intensity of the shading measures nesting depth; regions with the darkest shadings correspond to inner loops. The picture remains the same whether we are tiling for TLB or cache, but the height and width (H and W) change.
26 due to a miss at level k and Mk be the miss count at level k. Using these, we define two cost functions: a single-level cost function Eksingle and a multi-level cost function E multi :
Eksingle = ik Mk (H, W ) E multi = it Mt (H, W ) + ic Mc (H, W )
Notice that minimizing the single-level cost function is equivalent to minimizing the miss count. In contrast, minimizing the multi-level cost function balances the relative costs of TLB and cache misses. Thus, to determine the optimal tile height and width, we need two formulas. The first, Mk predicts the number of mandatory misses on A, B and C in a module k. The second predicts the memory requirements of the tile; the optimal tiling must satisfy the capacity requirements of TLB and cache. As we assume a fully associative cache, we do not factor in self- and cross-interference. To derive these two formulas, we prove a lemma which closely bounds the expected number Bk (H, W ) of blocks that must be brought into module k to hold an H × W submatrix. We use this lemma to generate both the miss count formula and the capacity requirements of a tile. First, a formal definition: Definition 1 Suppose A is an N × N matrix3 partitioned into H × W submatrices, and k is a module. Let Bk (H, W ) be the expected number of distinct (full or partial) blocks of module k included in a submatrix, where the average is over all submatrices in the partitioning. The exact value of Bk (H, W ) depends on the alignment of the matrix columns in module k; for instance, even if H = 2, the columns of A might be allocated in a way that each column spans two blocks of k. Rather than making assumptions about the alignment, we derive an upper bound on Bk (H, W ). 3
Treating non-square matrices requires nominal modifications to the formulas in this chapter.
27 Lemma 1 Let k be a module and A be an N × N matrix stored, without loss of generality, in column-major order; that is, each column of A is allocated in contiguous storage. Suppose that A is partitioned into submatrices of size H × W . Then, Bk (H, W )