finding representative workloads for computer system ...

0 downloads 0 Views 5MB Size Report
The importance of workloads in computer system design . ..... set of benchmarks common to most current design and evaluation studies. ...... For commercial applications from, for example SAP, PeopleSoft1 and Oracle, extensive ...... 4. find the representative benchmarks by taking the center most benchmark of each cluster.
F INDING REPRESENTATIVE WORKLOADS FOR COMPUTER SYSTEM DESIGN

F INDING REPRESENTATIVE WORKLOADS FOR COMPUTER SYSTEM DESIGN

PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus prof. dr. ir. J.T. Fokkema, voorzitter van het College voor Promoties, in het openbaar te verdedigen op dinsdag 18 december 2007 om 17.30 uur

door Jan Lodewijk Bonebakker sterrenkundige geboren te Bergisch Gladbach (Duitsland).

Dit proefschrift is goedgekeurd door de promotoren: Prof. dr. H.G. Sol Prof. dr. ir. A. Verbraeck Samenstelling promotiecommissie: Rector Magnificus, voorzitter Prof. dr. H.G. Sol, Technische Universiteit Delft, promotor Prof. dr. ir. A. Verbraeck, Technische Universiteit Delft, promotor Prof. dr. D.J. Lilja, University of Minnesota, USA Prof. dr. ir. K. De Bosschere, Universiteit Gent, Belgium Prof. dr. P.T. de Zeeuw, Universiteit Leiden Prof. dr. ir. H.J. Sips, Technische Universiteit Delft Prof. dr. ir. W.G. Vree, Technische Universiteit Delft

c Copyright !2007 by Lodewijk Bonebakker All rights reserved worldwide. No part of this thesis may be copied or sold without written permission of the author. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation, without intent to infringe. The data and the data collection tools presented in this thesis are the intellectual property of Sun Microsystems, Inc. The data are presented solely for the purpose of this thesis and may not be used without written permission from both the author and Sun Microsystems, Inc. ISBN/EAN: 978-90-5638-187-5 Cover: Lucas Bonebakker Printing: Grafisch Bedrijf Ponsen & Looijen b.v. Wageningen, The Netherlands http://www.p-l.nl

Aan ir. Jan Lodewijk Bonebakker (∗1907- †2000)

CONTENTS

1. The importance of workloads in computer system design . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Designing computer systems and processors . . . . . . . . . . . 1.2.1 The iterative design process . . . . . . . . . . . . . . . 1.2.2 Workload characterization . . . . . . . . . . . . . . . . 1.2.3 Performance evaluation . . . . . . . . . . . . . . . . . . 1.2.4 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Overview of processor and computer system design . . . . . . . 1.4 Processor design cases . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Intel Pentium 4 . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Sun Microsystems UltraSPARC III . . . . . . . . . . . 1.4.3 Intel Itanium I . . . . . . . . . . . . . . . . . . . . . . 1.5 Computer system design considerations . . . . . . . . . . . . . 1.6 Workload and benchmark considerations . . . . . . . . . . . . . 1.7 Increasing complexity and increasing diversity . . . . . . . . . . 1.8 Representing real workload characteristics in the design process 1.9 Research questions . . . . . . . . . . . . . . . . . . . . . . . . 1.10 Research approach . . . . . . . . . . . . . . . . . . . . . . . . 1.10.1 Philosophy . . . . . . . . . . . . . . . . . . . . . . . . 1.10.2 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10.3 Instruments . . . . . . . . . . . . . . . . . . . . . . . . 1.11 Research outline . . . . . . . . . . . . . . . . . . . . . . . . . . Part I

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

Context of representative workload selection

2. Current approaches for workload selection in processor and computer system design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Selecting workloads for commercial computer system design . . . . . . 2.2 Standardized performance evaluation . . . . . . . . . . . . . . . . . . . 2.2.1 SPEC CPU 2000 . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 TPC -C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Benchmark context . . . . . . . . . . . . . . . . . . . . . . . .

1 3 3 4 6 7 8 9 12 12 13 15 16 17 18 19 21 22 22 23 24 24 27 29 31 32 33 35 35

viii

Contents 2.3 2.4 2.5

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

36 37 42 42 44 45 45 47 50 50 52

3. Towards an unbiased approach for selecting representative workloads . 3.1 Approach blueprint . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Sources of representative metrics . . . . . . . . . . . . . . . . . 3.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Reflecting upon the research hypotheses . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

53 55 56 62 65

4. Constructing the approach . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Formulating our implementation approach . . . . . . . . . . . . . 4.1.1 Evaluating the quality of the representative set . . . . . . . 4.1.2 Evaluating workload clustering . . . . . . . . . . . . . . . 4.1.3 Workload clustering . . . . . . . . . . . . . . . . . . . . . 4.1.4 Dimensionality reduction . . . . . . . . . . . . . . . . . . . 4.1.5 Normalization of metric data . . . . . . . . . . . . . . . . . 4.1.6 Workload characterization . . . . . . . . . . . . . . . . . . 4.2 Collecting workload characterization data . . . . . . . . . . . . . . 4.3 Reducing workload characterization data . . . . . . . . . . . . . . . 4.4 Selecting representative metrics for spanning the workload space . . 4.5 Reducing workload space dimensionality . . . . . . . . . . . . . . 4.5.1 Normalization . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Removing correlated metrics . . . . . . . . . . . . . . . . . 4.5.3 Principal Component Analysis . . . . . . . . . . . . . . . . 4.5.4 Independent Component Analysis . . . . . . . . . . . . . . 4.5.5 Generalized additive models . . . . . . . . . . . . . . . . . 4.5.6 Other dimensionality reduction techniques . . . . . . . . . 4.6 Partitioning the workload space with clustering algorithms . . . . . 4.6.1 K-means clustering . . . . . . . . . . . . . . . . . . . . . . 4.6.2 K-means clustering and the Bayesian Information Criterion 4.6.3 Model based clustering . . . . . . . . . . . . . . . . . . . . 4.6.4 MCLUST for model based cluster analysis . . . . . . . . . . 4.7 Comparing clustering results . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

67 69 69 71 71 73 74 75 75 78 79 79 80 81 82 82 85 88 88 89 90 91 92 93

2.6

2.7

Selecting the correct benchmarks for the design problem . . Requirements for optimal benchmark sets . . . . . . . . . . Finding benchmarks in practice . . . . . . . . . . . . . . . . 2.5.1 Predicting application performance . . . . . . . . . 2.5.2 Representing application characteristics . . . . . . . Reducing redundancy in benchmark sets . . . . . . . . . . . 2.6.1 Reducing simulation time as motivation . . . . . . . 2.6.2 Approaches for removing benchmark set redundancy 2.6.3 Evaluating benchmark quality . . . . . . . . . . . . 2.6.4 Summarizing sub-setting techniques . . . . . . . . . Summary of benchmark selection . . . . . . . . . . . . . .

viii

. . . . . . . . . . .

Contents

ix

4.8 4.9

94

Selecting the representative workloads . . . . . . . . . . . . . . . . . . Quantifying metric sampling error on computer system workloads . . . . . . . . . . . . . . . . . . . . . .

5. Testing the methodology on a benchmark set . . . . . . . . . . . . . . 5.1 SPEC CPU 2000 similarity in simulation . . . . . . . . . . . . . . 5.2 Characterizing SPEC CPU 2000 using processor hardware counters 5.2.1 Collecting component benchmark hardware counter data . 5.2.2 Reduction, PCA and clustering . . . . . . . . . . . . . . . 5.2.3 Clustering results . . . . . . . . . . . . . . . . . . . . . . 5.3 Comparing clustering . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Similarity score . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Monte-Carlo simulation . . . . . . . . . . . . . . . . . . 5.3.3 Similarity score and probability results . . . . . . . . . . 5.4 On the µA-dependent and µA-independent debate . . . . . . . . . 5.5 Reflecting on the differences in similarity . . . . . . . . . . . . . Part II

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

Selecting a representative set from collected workloads

6. Collecting and analyzing workload characterization data . . . . . . . . . . . 6.1 Workload characterization . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 WCSTAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Measurement impact . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Origins of the workload data . . . . . . . . . . . . . . . . . . . 6.2 Data cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Workload validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Data errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Workload error checking . . . . . . . . . . . . . . . . . . . . . 6.3.3 Workload stability analysis . . . . . . . . . . . . . . . . . . . . 6.3.4 Accepted workload list . . . . . . . . . . . . . . . . . . . . . . 6.3.5 Reflecting on workload validation . . . . . . . . . . . . . . . . 6.4 Data reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Metric selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 System standardization . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 The impact of system utilization . . . . . . . . . . . . . . . . . 6.6.2 Compensating for different system configurations: system normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Workload categorization . . . . . . . . . . . . . . . . . . . . . . . . . 6.8 Reflection on the methodology . . . . . . . . . . . . . . . . . . . . . . 6.8.1 Bias in system metric selection . . . . . . . . . . . . . . . . . . 6.8.2 Data collection and reduction efficiency . . . . . . . . . . . . . ix

94 101 103 104 104 106 107 110 113 113 115 116 121 125 127 129 131 131 132 132 133 133 134 135 140 140 141 142 144 144 145 147 147 147 148

x

Contents 6.8.3

Describing the final workload characterization data set . . . . . 151

7. Grouping together similar workloads . . . . . . . . . . . . . . . 7.1 From data to clusters . . . . . . . . . . . . . . . . . . . . . 7.2 Metric normalization . . . . . . . . . . . . . . . . . . . . . 7.3 Dimensionality reduction . . . . . . . . . . . . . . . . . . . 7.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Cluster comparison . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Measuring cluster locality . . . . . . . . . . . . . . 7.5.2 Comparing clusterings . . . . . . . . . . . . . . . . 7.6 Describing workload similarity and clustering in the dataset . 7.6.1 Locality in the workload set . . . . . . . . . . . . . 7.6.2 Locality in the clustering solution . . . . . . . . . . 7.6.3 Differences between the clusters . . . . . . . . . . . 7.6.4 Reflecting on workload similarity and clustering . . 7.7 Does a different dataset change the workload . . . . . . . . 7.8 A workload is the combination of application and dataset . . 7.9 Non-determinism as source of variation . . . . . . . . . . . 7.10 Method stability and practical considerations . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

155 158 159 160 160 161 162 164 166 166 166 169 170 174 175 177 180

8. Finding representative workloads in the measured dataset . 8.1 Requirements for representative workloads . . . . . 8.2 Guiding workload selection . . . . . . . . . . . . . . 8.3 Selecting the candidate workloads . . . . . . . . . . 8.4 Representative workload selection . . . . . . . . . . 8.5 Validating the representative set . . . . . . . . . . . 8.5.1 Insufficient size . . . . . . . . . . . . . . . . 8.5.2 Excessive redundancy . . . . . . . . . . . . 8.5.3 Workload outliers . . . . . . . . . . . . . . . 8.5.4 Non-uniform distribution . . . . . . . . . . . 8.5.5 Quantitative validation of representativeness 8.6 Looking back at representative workload selection . . 8.7 Addressing common knowledge with our dataset . . 8.7.1 Representativeness of SPEC CPU 2000 . . . . 8.7.2 Representativeness of all benchmarks . . . . 8.7.3 Diversity in database workloads . . . . . . . 8.8 Summary . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

183 185 187 188 191 194 195 195 195 196 198 200 202 202 204 206 208

Part III

Evaluating representative workload selection

9. Evaluating workload similarity

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

209

. . . . . . . . . . . . . . . . . . . . . . . . 211 x

Contents 9.1

9.2 9.3 9.4 9.5

9.6

xi

Evaluation against requirements . . . . . . . . . . . . . . . . . . . . . 9.1.1 Requirement 1: Ease of measurement . . . . . . . . . . . . . . 9.1.2 Requirement 2: Efficient processing . . . . . . . . . . . . . . . 9.1.3 Requirement 3: Cheap and efficient data collection . . . . . . . 9.1.4 Requirement 4: Standardized data collection . . . . . . . . . . 9.1.5 Requirement 5: Non-interfering data collection . . . . . . . . . 9.1.6 Requirement 6: Unbiased workload metric selection . . . . . . 9.1.7 Requirement 7: Quantitatively unbiased workload similarity . . 9.1.8 Requirement 8: Expedient determination of workload similarity 9.1.9 Summary of evaluation against the requirements . . . . . . . . Evaluating the prescriptive model consequences . . . . . . . . . . . . . Evaluation against the hypotheses . . . . . . . . . . . . . . . . . . . . Evaluation against the research questions . . . . . . . . . . . . . . . . Reflecting on our approach . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Using computer system metrics and hardware counters . . . . . 9.5.2 Using workload characterization data . . . . . . . . . . . . . . 9.5.3 Grouping together similar workloads . . . . . . . . . . . . . . 9.5.4 Representative workload selection . . . . . . . . . . . . . . . . Future work and recommendations . . . . . . . . . . . . . . . . . . . .

Appendix A. Tables . . . . . . . . . . . . . . . . . . . . A.1 Opteron performance counters . . . . . A.2 UltraSPARC IIIi performance counters . A.3 Workflow legend . . . . . . . . . . . . A.4 Workload clustering result . . . . . . . A.5 Measured metrics . . . . . . . . . . . .

214 214 215 216 216 216 217 218 218 219 219 221 222 225 225 226 226 227 227 231

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

233 234 235 236 237 241

B. Additive Model for the Instruction Count . . . . B.1 Constructing the model . . . . . . . . . . . B.1.1 Construction of B-splines . . . . . B.1.2 Definition of the Distance Measure

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

247 247 252 253

. . . . . .

1. THE IMPORTANCE OF WORKLOADS IN COMPUTER SYSTEM DESIGN

A BSTRACT We place the thesis within the context of computer system and processor design at a commercial company. We provide several design cases to highlight relevance and urgency of improved workload characterization. Building upon the context of the examples we formulate our research questions and their philosophical context. We end with an outline of the thesis.

2

W ORKLOAD IMPORTANCE

Define the question Chapter 1 - The importance of workloads in computer system design

Gather information and resources Chapter 2 - Current approaches for workload selection in processor and computer system design

Present conclusions Chapter 9 Evaluating workload similarity

Interpret data and formulate conclusions Chapter 8 - Finding representative workloads in the measured dataset

Form hypothesis Chapter 3 - Towards an unbiased approach for selecting representative workloads Deductive-Hypothetic research strategy Prescriptive model Chapter 4 Constructing the approach

Early test Chapter 5 - Testing the methodology on a benchmark set

Collect data Chapter 6 - Collecting and analyzing workload characterization data

Analyze data Chapter 7 Grouping together similar workloads

1.1 Introduction

3

1.1

Introduction

A computer system is the end result of a long, complex and many year, multi-dimensional development process. The resultant computer system represents the best possible solution to a design challenge given restrictions in time, technology and resources. Kunkel et al. (2000) explain that many decisions in the computer system, specifically the processor, were made years before the computer system or the processor even existed. This thesis looks at the problem of providing accurate information to computer system and processor designers during the design and implementation stages. This thesis develops and evaluates a methodology that enables the processor and computer system designers to take into account information from current computer systems usage when considering and evaluating new computer system and processor designs. Incorrect information can lead to incorrect design trade-offs. By supplying and updating relevant information during the design and implementation stages, designers can keep their design current with the marketplace. The technological challenges involved in computer system design will likely increase the time between initial design specification and market release. This necessitates flexible approaches to computer system and processor design which can adapt to market changes. Computer system designers have no influence on the diversity of applications and usage of computer systems, thus they can only react to trends in the marketplace. The contribution of this work is to provide a clear methodology by which processor and computer system designers can use information on actual computer system usage during the design process. The expected benefits are that many of the trade-offs can be judged against usage characteristics present in the marketplace. This methodology would help prevent poor design choices that might not be discovered using the limited set of benchmarks common to most current design and evaluation studies. This first chapter provides a description of the design process and defines the terms workload and benchmark. This chapter establishes why and how workloads and benchmarks are used in the computer system design process. We discuss three processor design cases that illustrate the need for better workload characterization information. The limitations of the design process are discussed, leading to the research question. After the research question, the underlying research philosophy, research approach and research instruments are introduced.

1.2

Designing computer systems and processors

The successful design of a complex system depends greatly upon the research and design process used. Systems engineering is concerned with the engineering of large scale and complex systems (Sage, 1995). Computer systems belong to the most complex systems built by humankind (Heuring and Jordan, 1996), and systems engineering principles have long been associated with their architecture and design (Simpson, 1994;

4

W ORKLOAD IMPORTANCE

Rowe et al., 1996). The systems engineering discipline has associated with it a language, i.e., a collection of terms with defined meaning, used to express the nuances of system design and its approaches. Sage (1995) defines systems engineering as the art and science of producing a product, based on phased efforts, that satisfies user needs. The system is functional, reliable, trustworthy, of high quality, and has been developed within cost and time constraint through the use of an appropriate set of methods and tools. In this thesis we follow the terminology of systems engineering to describe the development process. Modeling and simulation are the primary tools for performance analysis and prediction of proposed computer system designs (Austin et al., 2002; Yi et al., 2005). Simulators are the dominant tool for evaluating computer architecture, offering a balance of cost, timeliness and flexibility (Yi et al., 2006). Model complexity and slow execution speed necessitate that system models are developed hierarchically (Yi and Lilja, 2006). Representations of computer system behavior are extensively used to provide input to the models. The system design process uses iterative refinement of the model (Schaffer, 1996), at each step evaluating its principal parameters, e.g., cost, performance, functionality and physical constraints (Kunkel et al., 2000). In short, computer system design is the application of an iterative design process in an hierarchical modeling environment (Schaffer, 1996; Coe et al., 1998). Next, we expand on this iterative design process. 1.2.1

The iterative design process

The iterative design process uses the predictions and results from computer system models to determine the design changes and optimizations for the next model iteration. The modeling environment consists of performance and component models of the computer system at varying levels of detail. These models vary from very simple and rough performance estimates to highly complex and detailed simulations (Kunkel et al., 2000). The modeling hierarchy abstracts details away to component models lower in the hierarchy. Conversely, abstractions based on predictions from component models are used at the higher levels in the modeling hierarchy (Zurcher and Randell, 1968; Kumar and Davidson, 1980; Mudge et al., 1991; Stanley and Mudge, 1995; Kunkel et al., 2000; Hennessy and Patterson, 2003; Larman and Basili, 2003). Processor, memory system, or whole computer system designs are iteratively refined until the performance criteria have been optimized for the given constraints (Flynn, 1995; Schaffer, 1996; Coe et al., 1998). Although the actual process varies between companies and between the types of computer systems or components under design, there is a set of common steps (Zurcher and Randell, 1968; Kunkel et al., 2000; Larman and Basili, 2003), illustrated in Figure 1.1: 1. Project, evaluate and categorize the impact of technological developments: which techniques will likely be available when the computer system is ready for production? Each technology should be evaluated for risk and possible alternatives

1.2 Designing computer systems and processors

Parameters: - functionality - performance - cost - physical requirements

5

Initial planning

Strategic workloads

Technological developments

Planning Requirements Evaluation Iterative Incremental Development Cycle Testing/ Verification

Analysis & Design

Implementation

Model/ Prototype

Hierarchical, multi-level simulation environment

Final design

Fig. 1.1: Iterative design strategy

investigated. 2. Project dominant workloads: which workloads or applications are most likely going to benefit from the intended architecture, or, which workloads are the target of the intended architecture. These workloads need to be thoroughly characterized and performance data collected. 3. Create a hierarchical, multi-level simulation environment: this environment will be the principal means of investigating design trade-offs and performance optimization. 4. Run an iterative incremental development cycle using the simulation environment to gradually increase the level of detail and complexity in the models to the point that the models present a fair representation of the desired computer system. The steps in the development cycle echo those of Boehm (1986): planning, analysis & design, deployment, testing and evaluation. 5. Review relative to goals. After several development cycles, the overall design effort is reviewed relative to the goals; if necessary, either the goals or the development effort is adapted. 6. In parallel with the iterative incremental development cycle: do the verification of the proposed computer system logic. A hierarchical modeling environment allows experts to work on optimizing those components they know best, without losing the ability to evaluate component contribution to the overall performance of the computer system. The projection of dominant

6

W ORKLOAD IMPORTANCE

workloads as listed in item 2 requires the ability to quantitatively evaluate many workloads and place them in the context of the marketplace (Kunkel et al., 2000). During the iterative design cycle simulation is the instrument of choice. The time and cost involved in building computer system prototypes makes even the most complex simulation cost competitive (Kunkel et al., 2000; Eeckhout et al., 2002; Alameldeen et al., 2003). While the modeling environment with its simulators facilitates performance evaluation of future computer systems, it is essential to understand what that performance is relative to. As mentioned in item 2, workloads are the essential ingredient providing the performance context. 1.2.2

Workload characterization

Ferrari (1978) defines a workload as the work performed by a computer system over a period of time. All inputs into the computer system, e.g., all demands that require execution time, are part of the workload definition. During execution a workload utilizes the resources of the computer system: processor, processor cache, system-bus, memory, disks, etc. The resource utilization of a workload depends on the design of the underlying computer system and the implementation of the workload. Different workloads can require different computer system resources (Ferrari, 1972; Agrawala and Mohr, 1975; Agrawala et al., 1976; Menascé, 2003). Workload characterization describes the resource utilization of a workload on an existing computer system. Workload specific characteristics determine the resources required of a computer system to achieve a desired level of performance. Most workloads can be described independent of the computer system. Specifically, workloads that are used for comparing different computer systems should be defined independent of the computer system implementation (Gustafson and Snell, 1995). However, describing a workload independent of the computer system with sufficient detail for computer systems comparison can require a significant investment in time and resources. Workloads can be described by their intended work (e.g., the number of served web-page requests), or they can be described by their impact on the computer system (e.g., the utilization of the processor). The goal of workload characterization is to create a workload description that can be used for selection, improvement and design studies. These areas share common techniques of workload characterization and the distinction is based more on the purpose of the workload characterization than its execution (Ferrari, 1978): Selection studies use workload characterization to assist in the selection of computer systems or components. Improvement studies use workload characterization to determine the changes in workload performance due to modifications of the computer system or workload implementation. The workload on an unmodified computer system is characterized and compared with the workload from a modified computer system. If the modified computer system does not yet exist, then a model of the workload, verified

1.2 Designing computer systems and processors

7

on the unmodified system, can be used to evaluate the performance of the modified system. The requirement for the workload model is that it is modification independent, thus the workload model must be transportable. Design studies require workload characterization to do performance analysis. Without workload characterization, computer systems and components cannot be quantitatively improved. The characterization and analysis of workloads is essential to provide the designers of computer systems a goal to work towards. For design studies the specification and selection of workloads to use is probably the most difficult and significant of the three studies. Workloads provide the information on resource utilization required in the hierarchical modeling environment. Different workloads must be considered when designing a computer system because computer system performance depends both on the workload characteristics and its design (Kunkel et al., 2000). Workload characterizations from real computer systems provide quantitative information on such resource utilization. Design changes can lead to changes in the utilization of computer system components. To evaluate these changes an implementation independent characterization of the workload is required (Ferrari, 1978). For design studies, computer systems designers want to select relevant workloads. Relevant workloads are assumed to contain important workload characteristics. The relevance of these workloads can be determined by their frequency among customers, the value of the computer systems required, or their importance for marketing reasons (Kunkel et al., 2000). The primary goal for computer system architects is to increase the performance of a computer system for relevant workloads. Computer system architects must include information from many different workloads to make appropriate design trade-offs in general-purpose computer systems. A good computer system design will not let a single resource limit performance for relevant real workloads. Resources are balanced to achieve the best possible performance on relevant workloads (Kunkel et al., 2000). 1.2.3

Performance evaluation

Accurate computer system performance evaluation depends on accurate workload characterization (Bose and Conte, 1998; Eeckhout et al., 2002). Computer system design decisions are made based on the projected benefits of architectural changes. These projections are made using data from characterized workloads and evaluated using simulation. Simulation is the most cost effective method of verifying, evaluating and comparing new computer system designs using existing and/or anticipated workloads (Kunkel et al., 2000; Eeckhout et al., 2002; Alameldeen et al., 2003). Evaluating new designs is limited by the simulation’s cost in time and resources, and its accuracy. The main reason that simulation is the most cost effective method is that the alternatives are prohibitively expensive. The alternatives to simulation are, for example, making prototypes

8

W ORKLOAD IMPORTANCE

or implementing logic in field programmable arrays (FPGA). Both these alternatives are not realistic due to the cost of their implementation and the cost of validation: validating simulation results is straightforward compared to debugging an actual prototype or FPGA implementation. Simulation itself is not cheap. Modern computer systems can execute billions of instructions per second; therefore cycle accurate simulators need to efficiently simulate these billions of events. Cycle accurate simulators may take days to simulate a billion instructions. The best simulators introduce only several orders of magnitude slowdown when simulating detailed execution (Eeckhout et al., 2002; Alameldeen et al., 2003). Hierarchical simulation improves the performance of the higher level simulation by relying on the results of more detailed and slower simulation in the lower levels of the simulation hierarchy. While simulation is the most cost-effective way of verifying computer system designs on certain workloads, it is still expensive in time and resources. This high cost greatly limits the applicability of simulation in the design cycle. It is impossible to do detailed simulations of all workloads of interest due to the cost and time associated with the simulation process. Characterizing workloads at the required level of detail and then running the simulations for all those workloads cannot be completed within a reasonable time or cost (Darringer et al., 2000; Magnusson et al., 2002; Kim et al., 2007). As a result, the designers of computer systems have to chose which relevant workloads to use during the design process. Naturally designers will want to use convenient workloads - workloads that are less expensive to use in time and resources but still provide the required level of detail. 1.2.4

Benchmarks

A benchmark is an artificial workload that captures the most important workload characteristics of relevant real workloads. A benchmark is usually expressed in a program or set of programs that are small, efficient, and controllable (Ferrari, 1978). These artifacts are a requirement when considering workloads for the design and evaluation of computer systems. The process of workload characterization as well as workload simulation benefit from benchmarks that represent relevant workload characteristics efficiently. In the case of workload characterization the benchmark improves on the real workload by its controllable nature and smaller package. In simulation the benchmark can provide all relevant workload characteristics in a workload representation that does not require extensive simulation time. For example, if the real workload requires 15 billion instructions to fully represent all important workload characteristics and the benchmark only 1 billion instructions, then the benchmark reduces the simulation time by 15. When simulation time is measured in days and weeks, this is an appreciable gain. Using benchmarks in detailed simulations reduces cost with little risk if the benchmarks accurately represent important workload characteristics (Connelly, 1995; Bose et al., 1999; Diefendorff and Dubey, 1997; Kunkel et al., 2000; Rosti et al., 2002; Menascé, 2003; Tsuei and Yamamoto, 2003; Spracklen and Abraham, 2005). The efficiency gain of

1.3 Overview of processor and computer system design

9

benchmarks over real workloads is such that using real workloads in the design cycle is no longer considered. As noted before, with sufficiently accurate simulators, computer system architects reduce risk by evaluating their designs using benchmarks that represent the most important characteristics of real and emerging workloads. Choosing the representative benchmarks therefore is of the utmost importance to the designers. Over time, real workloads evolve and new workloads emerge, so relevant workload characteristics can change. This evolution creates the need to detect important emerging workload characteristics and represent them in benchmarks (John et al., 1998; Skadron et al., 2003). Without new benchmarks that represent changes in the computer usage, the designers of computer systems can only rely on their limited set of standardized benchmarks. Certain common benchmarks, i.e., from SPEC (www.spec.org, 2007) and TPC (www.tpc.org, 2007), are used by most computer system architects during the design of commercial servers and processors. These benchmarks provide valuable metrics for comparing performance of existing computer systems. Achieving good performance on these benchmarks is important based on the value attributed to these benchmarks by consumers. However, as we shall investigate later, these standardized benchmarks may not represent actual usage of computer systems. Most real workloads and computer system configurations differ from the standardized benchmarks, thus computer system designers might require information from real workloads to determine the configuration attributes of computer systems. These configuration attributes are for example the number of supported processors, the memory size, I/O capacity etc. Designing and configuring computer systems solely based on information from a limited set of benchmarks could deny the diversity of workloads in the world.

1.3

Overview of processor and computer system design

Computer system and processor designers generally aim for the best achievable performance on relevant workloads (Kunkel et al., 2000; Hennessy and Patterson, 2003). While all processors do more or less the same thing, not all are created equal. Much depends on a processor’s internal workings, called the micro-architecture, abbreviated to µ-architecture. The µ-architecture performs five basic functions: data access, arithmetic operation, instruction access, instruction decode and write back results. Computer architects have reworked these stages for different processor families to come up with their own unique µ-architectures. Modern processors are usually designed to optimally execute a stream of instructions at a specific clock-speed. A processor’s clockspeed is measured in cycles per second or Hertz. Most high-end chips currently run at clock-speeds of at least 1.25 GHz. However, faster clock-speeds do not directly translate into more performance when comparing processors of different µ-architectures. Microprocessors use pipelines, which are electronic paths that bits of data are pushed through as they are processed by the processor. A processor pipeline has multiple steps

10

W ORKLOAD IMPORTANCE

RAM L2 PU

Processor Latency

L1

system bus

Cache 2ns 14ns

system bus Main memory 200ns

Disk

Hard disk 2000000ns

Fig. 1.2: Processor-memory-disk hierarchy

or stages, each stage performing a specific function. Example stages are fetch, decode, execute, memory access and writeback (Hennessy and Patterson, 2003). The processor’s pipeline decodes instructions and fetches required data from the caches in time for execution. The number of stages in a pipeline is important because longer pipelines have more disruptive pipeline hazards. A hazard is a conflict in the pipeline that may lead to stalls and thus lower performance. Current processors attempt to predict application execution paths or branches, and are very capable in doing so. However, misspredicting the next branch still happens roughly 10 percent of the time. Branch missprediction is expensive enough that further reducing the missprediction penalty remains an active area of research (Sprangle and Carmean, 2002). Processors typically realize their mistakes during the last quarter of the pipeline, so the longer the pipeline, the longer it takes to flush the message from the pipeline and fix the problem. As a result, performance suffers. This explains why a lower clockspeed, shorter pipeline processor can outperform an higher clock-speed longer pipeline (Allbritton, 2002). It also demonstrates how workload characteristics impact performance. The processor’s clock-speed determines the rate at which the pipeline supplies and the processor executes instructions. Processing interruptions occur when required data are not available. Data that are not available are first sought in the processor’s caches and then in main memory. In the worst case the data has to be retrieved from disk. The Processor-memory-disk hierarchy is illustrated in Figure 1.2. It includes the typical time scale for the processor to access each level in the hierarchy. If data must come from main memory, the data are said to have been missed in the caches. Cache misses can delay execution by causing the pipeline to stall until the missing data arrive. The time taken for data to arrive after a cache miss is called the cache latency and is measured in clock cycles. The cache miss rate is the percentage of instructions that result in cache misses. Typical values for the missrate are 1-10% for the first level (L1) cache and 0.1-1.0% for the second level cache (L2) (Hennessy and Patterson, 2003). To help understand the process and times involved, we present an analogy (Pronk van

1.3 Overview of processor and computer system design

11

Hoogeveen, 2007): Imagine sitting behind your laptop on a desk, somewhere in Amsterdam, the Netherlands. You are writing a document. While writing, most of the information you need is in your head; this is equivalent to execution on the processor only. In some cases you will need to consult your notes; which are next to you on your desk. The process of moving your attention away from the laptop, to your notes and back to the laptop (circa 10 seconds) is akin to accessing the L1 cache - it is fast and inexpensive. Unfortunately, your notes are limited. Sometimes a piece of information you need is not in the notes and you have to get up, turn around and consult a book on your bookshelf (one minute). The time needed represents a L2 access. If you discover that the required book is not on the shelf, you have to go to the local library. You get up again and drive to the local library, several kilometers away. The time needed to go to the local library and return home (about 15 minutes) represents accessing main memory. If at the library you discover that the required information is not there, you have the equivalent of retrieving data from disk. In this case you get up from your desk, walk to the library in Moscow (circa 2150km), retrieve the information and walk back (circa 100 days!). As we can see from the above analogy, efficient cache-miss handling is a significant contributor to processor performance for workloads with a high missrate. Main memory latency is determined by the speed of DRAM (the actual memory), system bus and memory controllers. Similarly, disk latency is determined by the speed of the hard disk, the I/O interface and the system bus. Computer systems are designed to support the best achievable transfer of data between a processor, its caches, main memory, network interfaces, and disk. Large computer systems can support multiple processors, large memories, many network interfaces and disks. Component interactions in large computer systems increase design complexity (Hennessy and Patterson, 2003). Efficient branch-prediction on a processor is a significant contributer to good workload performance. Correctly predicting execution path of a program allows efficient pre-fetching of required data, thus reducing missrate on the correctly predicted branch in the program (Hennessy and Patterson, 2003). Another significant factor influencing missrates is the organization of the cache. There are several organizations, ranging from fullyassociative, via n-way associative to direct-mapped. Associativity can be explained along the lines of the above analogy. We explain associativity for the L2 cache using the bookshelf: In order to quickly find books on our bookshelf, we maintain a certain organization. The simplest organization is to use the first digit of the title of the book as the location on the bookshelf. Thus, there are 36 possible locations. Having two books starting with the letter A is not possible since they

12

W ORKLOAD IMPORTANCE compete for a single slot a. This is equivalent to a direct-mapped cache. Specific items must go to a specific location. Therefore, if we need another book starting with A, we remove the current book, return it to the library and fetch the next book. Obviously this is unattractive, and we would like to have the option of storing multiple A’s. This is called an n-way associative cache, the n representing at how many possible locations we can put different a’s. Naturally the size of the cache is still limited, being able to store multiple A’s will still lead to the removal of other books. Fully associative means that any book can go in any location of the bookshelf.

The analogy above explains cache structure contributions to the missrate. In directmapped caches the missrate can be negatively influenced by evictions, i.e., two books vying for the same place on the shelf. The above example highlights the diversity of choices facing designers - not only must they decide on the size of the cache, they also have to design its organization. Direct mapped caches are easy and relatively inexpensive to implement, while associative caches are more involved to implement and can be more costly. Processor and system design oversights can result from the design complexity of processors and computer systems (Hennessy and Patterson, 2003). Design oversights can also result from the number of trade-off decisions, the impact of external constraints (like cost and time) and the inability to fully evaluate a design for all relevant workloads (Borkenhagen et al., 2000; Hartstein and Puzak, 2002). Design oversights usually have a negative performance impact and can be difficult to rectify in subsequent revisions of the processor or computer system designs.

1.4

Processor design cases

To establish relevance of this work for processor design, we present three cases where mainstream processors did not arrive at the optimal design point for their target workloads. Both the PentiumTM 4 and UltraSPARCTM III processors suffered from design oversights when they were first released, while the ItaniumTM processor was less suitable for common commercial workloads than intended. 1.4.1

Intel Pentium 4

Cataldo (2000) describes how the design of the Pentium 4 was intended to extend the Pentium III design into a higher clock-speed domain. This decision was made to leverage the marketing value of high clock-speeds. The main rationale behind pushing the clock-speed was the success of increasing performance through clock-speed increases on previous generations of the Pentium processor family . The design goal of the Pentium 4 was to scale to fast clock-speeds, because consumers were beginning to purchase computers based on higher megahertz ratings. The P4 is a

1.4 Processor design cases

13

classic example of marketing concerns driving technological development. Intel used a deep instruction pipeline to implement this goal, which reduced the amount of real work that the Pentium 4 could do per clock cycle, compared to other CPUs like the Pentium III and Athlon, but allowed it to scale to higher clock speeds (Allbritton, 2002). This soon prompted AMD’s “Megahertz myth campaign”. The first version of the Pentium 4 was generally considered a poor performer relative to other processors in the marketplace. On a per cycle basis the Pentium 4 performed less work that other processors requiring higher clock-speeds to make up for the difference. The higher clock-speeds translated into higher power requirements. Comparative performance on different workloads showed that performance of the Pentium 4 was highly uneven. Multi-media workloads generally performed very well, while graphics and floating point intensive workloads showed poor performance (Mihocka, 2000). In Colwell (2005) the design of the Pentium 4 is evaluated and considered to be too complex. The issues of complexity are that it leads to errors in the design that can be expensive to fix (bugs): it leads to suboptimal trade-offs between multiple goals since full evaluation is too expensive given the complexity; complex designs make follow-on designs very difficult; finally, complexity is cumulative in the sense that new designs inherit the complexity of older designs. This complexity inheritance is partly caused by the requirement from the marketplace that a new generation of processor and computer system should support most, if not all, of the features of the previous generation of processors. Breaking compatibility with previous generations forces users through potentially expensive and disruptive upgrade cycles. The end-users of computer systems prefer to have a faster version of the same chip. One of the root causes of the poor performance of the Pentium 4 was the very deep pipeline. The initial design had a 20 stage pipeline, primarily to allow for higher clockspeeds. However, the price of the longer pipeline is the increased complexity and the increased cost of flushing the pipeline when execution errors (like misspredicted branches) occur. In retrospect, the Pentium 4 pushed the limits of Moore’s Law to a point where the power consumption, performance and cost were no longer attractive. In the end, Intel reverted to the Pentium III design for their later products (Colwell, 2005). We argue that the example of the Pentium 4 demonstrates how insufficient understanding of workload behavior can lead to unrealistics goals. We ask ourselves the question; if the designers had been given good workload characterization data, would they have been able to curb marketing’s drive towards higher clock-speed? We argue that good workload characterization data could have illustrated many of the processor’s problems early in the design process. 1.4.2

Sun Microsystems UltraSPARC III

The design of the UltraSPARC III began with four high level goals decided by marketing, engineering, management and operations in Sun’s processor and system groups. The four high level goals were compatibility, performance, scalability and reliability.

14

W ORKLOAD IMPORTANCE

The compatibility goal was to provide a 90% increase in application program performance without requiring a recompile of the application. This performance and compatibility goal demanded a sizable micro-architecture performance increase while maintaining the programmer-visible characteristics from previous generations (Horel and Lauterbach, 1999). To reach the performance objectives for the processors, the designers evaluated increasing performance by aggressive instruction level parallelism (ILP). However, the performance increase obtainable with ILP varies greatly across a set of programs. Instead the processor designers opted to scale up the bandwidths of the processor while reducing the latencies. This decision was in part based on results from the SPEC CPU 95 integer suite. In order to support the high clock-rate and performance goal, the UltraSPARC III (USIII) was given a 14-stage pipeline. The pipeline depth was identified early in the design process by analyzing several basic paths (Horel and Lauterbach, 1999). However, long pipelines have additional burdens when the pipeline stalls due to an unexpected event, i.e., a data cache miss. The USIII handles such an event by draining the pipeline and re-fetching the instructions - too many stalls obviously lower processor performance. The designers chose a direct mapped cache, i.e., a cache where each memory location maps to a single cache location. The advantage of direct mapped caches is that they are simple and fast. Direct mapped caches are considered inefficient because they are susceptible to mapping conflicts, i.e., multiple memory addresses are mapped to the same cache-line (Hennessy and Patterson, 2003). Furthermore, the original design for the L2 cache was optimized for a fast hit time at the expense of a higher missrate. In addition, the memory management unit (MMU) was also optimized for fast lookup at the expense of a higher missrate. Competing processors of that time all used associative caches, a cache structure that maps memory addresses to multiple cache locations - if one location is in use, another may be used. In commercial workloads, with many cache misses, these conflict misses are common. Thus, the USIII combined expensive data cache-miss handling with a cache-architecture susceptible to misses. The USIII attempted to improve cyle time and execution speed for low miss ratio code like SPEC CPU 95. The lack of efficient cache-miss handling resulted in a performance deficit on important real workloads. Real workloads exhibit significant cache misses. These cache misses lead to pipeline stalls. The frequent misses common to commercial workloads lowered the performance of the UltraSPARC III processor on these workloads. In fact, the performance increase of the USIII relative to its predecessor was only slight, despite an almost 2× clock-speed increase (Koster, 2004). The lower performance on relevant real workloads placed the USIII at a disadvantage compared to other processors. When the USIII finally taped out, i.e., the first processor prototype was made, the extent of these oversights was discovered. The designers performed some hasty patch work in order to at least approach the performance targets. It seems likely that if the design team had taken real workload characteristics into account throughout the design process, it would have been clear that the SPEC CPU 95 benchmark was inadequate in the light of processor and workload developments (Koster, 2004). Competing processors,

1.4 Processor design cases

15

like the IBM Power 4, correctly targeted the relevant workload characteristics. 1.4.3

Intel Itanium I

The Itanium processor, with its EPIC instruction set was designed to take maximum advantage of instruction level parallelism in the execution path of applications. EPIC is an implementation of a Very Large Instruction Word architecture. The Itanium was designed as a general purpose processor. As is the case for all VLIW architectures, the Itanium designers relied on advances in compiler technology for compile time optimization (Schlansker, 1999; Sharangpani and Arora, 2000; Gray et al., 2005). This is necessary since the width of the instruction stream makes it impossible for the hardware to optimize. This in contrast to the RISC instruction set that does attempt to optimize execution scheduling in the hardware. The quality of the compiler optimizations depends on how well the compiler can predict the most probable execution patterns at compile time (Gray et al., 2005). However, tests on real commercial applications have demonstrated that the Itanium is not ideally suited for the ad-hoc nature of commercial applications. Commercial applications can have numerous data dependent execution paths for which compile time optimization is difficult. The performance impact of numerous data dependent execution paths is further exacerbated by current compiler limitations (Hennessy and Patterson, 2003). The Itanium design does excel on some applications that can be iteratively tuned for maximum performance using repeated execute-profileoptimize cycles (Shankland, 2005). While not a design oversight per-se, this tuning requirement highlights the importance of using representative benchmarks in the design process. Representative benchmarks could have demonstrated the extreme degree of compiler support required to optimize data-dependent execution paths. While it is believed that these execution time issues could be solved by better compilers and iterative execute-profile-optimize cycles, such advances have not come to fruition (O’Krafka, 2007). As a result, the performance of the Itanium fell far below expectations in its first implementation (Shankland, 2005). The first version not only was two years behind schedule, but several compromises made in attempts to meet the schedule reduced overall performance. When the Itanium was first released it did not compare favorably with competing processors except for scientific computing workloads. The impact of the trade-offs could have been reduced if the designers had concentrated on specific workloads. Instead, the design trade-offs were made on most processor components and thus impacted its performance over the whole spectrum of workloads. By making more targeted trade-offs, the designers could have achieved the same schedule and cost reductions yet maintained performance for a significant portion of the workload space. In the end the Itanium only performed reasonably well for scientific computing based on the number of resources on the processor - notably the large number of floating point units (Shankland, 2005). This raises the question if the use of a broader set of representative workloads would have improved the design decisions. It seems unlikely that trade-off decisions made to meet schedule goals or price-points improve processor

16

W ORKLOAD IMPORTANCE

design when they are not evaluated against a representative set of workloads.

1.5

Computer system design considerations

The first computer systems that supported multiple Itanium processors were designed with several processors on the same memory bus. Unfortunately, the memory bus did not provide the bandwidth required by the Itanium processors. The lack of memory bus bandwidth severely limited memory throughput and increased memory latency for each processor, thus lowering computer system performance (Zeichick, 2004; Shankland, 2005). Processor performance depends on the ability of the system enclosure to provide the data and instructions needed to sustain execution. If computer system bandwidth is a bottleneck, the processors have to wait until the data arrives. As noted in Kunkel et al. (2000), for commercial computer systems a proper balance between the performance requirements of the processors and the capabilities of the computer systems is essential. Evaluating benchmarks representative of the bandwidth requirements of commercial applications could have provided quantitative data on bandwidth requirements. This quantitative data could have prevented the lack of memory bus bandwidth. Mistakes like these are expensive to rectify and are best detected during the design phase. The hierarchical model paradigm was specifically designed to prevent these issues happening during the design cycle. We can only speculate at the underlying cause for this design oversight. The interaction of processors and computer system is of fundamental importance to application performance. While having sufficient processors is necessary to reach higher levels of performance, having more processors does not necessarily increase performance. Many workloads suffer from internal inefficiencies that are exacerbated by bottlenecks in the computer system. In these cases, solving for computer system bottlenecks is more important than increasing processor performance. Gunther (1998) presents an example where application performance is negatively impacted by increasing the number of processors. The example explains that a database server can have a bottleneck in its I/O channel when reading or writing data to the disks. By adding more processors, the number of outstanding transactions increases, further stressing the I/O channel and thus exacerbating the performance problem. In this example the correct action to improve performance would have been to increase the capacity of the I/O channel. A commercial server should therefore have sufficient I/O capacity for the most demanding workloads at a reasonable cost. Insufficient bandwidth capacity will lead to execution bottlenecks and hence to poor performance. Interactions between workloads and computer systems are many and varied. It is unrealistic to expect that a few benchmarks adequately characterize the performance requirements of many applications. At the same time, the processor and computer system designers need only concentrate on the extreme cases. Solving for all extreme cases however can lead to an undesirable, expensive solution. The computer system design-

1.6 Workload and benchmark considerations

17

ers need to weigh the requirements of the workloads carefully to make optimal choices regarding the desired capacities for the computer system. This is the case for nearly all resources in the computer system, number of processors, memory capacity, internal bandwidth and I/O capacity. This leads to the question - do real workloads provide a richer picture of computer system requirements than currently presented by standard benchmarks?

1.6

Workload and benchmark considerations

For commercial applications from, for example SAP, PeopleSoft1 and Oracle, extensive sizing and capacity planning is required to predict the optimal machine configuration for a given customer workload (Cravotta, 2003). These sizing and capacity requirements are based on the computer system performance on standardized, application-specific benchmarks. Yet, even after sizing and capacity planning, customers regularly require a machine that has more processors, more memory and more I/O capacity than initially predicted by these benchmarks (Golfarelli and Saltarelli, 2003). This increase in requirements is often caused by differences in workload characteristics between the application in benchmarks and in real use. Real workload requirements frequently exceed the workload requirements of standardized application benchmarks. The differences between the requirements for application specific benchmarks and real customer workloads illustrate the value of information on real workload characteristics throughout the processor and computer system design process. Not taking into account the increased requirements of real applications can lead to computer systems that are designed with insufficient capacity (Kunkel et al., 2000). As noted in the previous chapter, insufficient capacity leads to poor performance. In addition to real workload requirements exceeding the benchmark predictions, there is a difference in perspective. The application vendor is interested in achieving a high level of performance on computer systems that are relatively inexpensive. The application vendor therefore has an incentive to make the common case fast and efficient. The end-user interests are to support their business by using the application, not necessarily to run the common case fast. In the end-user (customer) environment, decisions on application characteristics are not based on what will perform well but rather on what supports their business. The business requirements guide the use and modification of an application. As a result, the same application deployed at different businesses may have very different workload characteristics. This property of real workloads is hard to include in a design process that relies exclusively on standardized benchmarks for guiding design decisions. Supplying additional information on real workloads helps designers understand the breadth of application specific workload characteristics. A side-effect of real workloads is described as “software rot” (Brooks, 1995). Application software undergoes a steady stream of changes early in the design cycle. As 1

PeopleSoft was acquired by Oracle in June 2005, however the PeopleSoft applications still exist.

18

W ORKLOAD IMPORTANCE

the software ages the development pace slows down until it stops prior to the next release. Software “rot” comes from the compound effects of these changes on the application. Users adapt to most application bugs by developing a workaround. The combined effect of software changes and user workarounds is to push the workload characteristics away from what was benchmarked for the initial sizing. While some application changes improve performance, others will not. This combination of application adaptation and evolution creates a much greater diversity than can be covered by standardized benchmarks. Designers require quantitative feedback regarding these extended usage parameters when evaluating trade-off decisions. In many cases of software “rot”, the application is too old to warrant any additional development. It is in the customers best interest to keep the application running as fast as possible. The guidance to computer system and processor designers from these applications is to make sure that bad/old software executes well on new computer systems. Improving performance for poor or old software is more beneficial to users than requiring application recompilation.

1.7

Increasing complexity and increasing diversity

There are no indications that processors and computers systems are getting less complex (Colwell, 2005; McNairy and Bhatia, 2005; Ranganathan and Jouppi, 2005; Kongetira et al., 2005; Spracklen and Abraham, 2005). Single thread performance on a processor has reached the point of diminishing returns. Further increasing single thread performance requires a much greater investment of time and resources than is warranted by the expected gains (Colwell, 2005). The transition from single thread, single core processors to multi-threaded, multi-core processes, reflects the difficulty of further increasing single thread performance. Including more execution cores on a processor, each with several concurrent hardware threads, allows processor designers to achieve higher instruction rates by performing more work in parallel (Nayfeh et al., 1996; Olukotun et al., 1996; Kongetira et al., 2005). This change, from single core/single thread to multi core/multi thread processor designs, further increases the simulation burden in time and effort. The simulation time is strongly related to the total number of instructions that have to be simulated. Parallel to the increase of complexity, the adoption rate of computer systems shows no indication of slowing down. In the domain of commercial computer systems, the range and diversity of applications continues to grow (Ranganathan and Jouppi, 2005). The adoption of new programming techniques continues to abstract the programmer away from the actual hardware, making performance optimization and application tuning more difficult and complex. Processor and computer system designers require feedback on the effects of new programming languages and paradigms. Shifts in the way applications are developed and used can significantly change the workload characteristics of future workloads. According to Ranganathan and Jouppi (2005), computer systems deployment shows shifting trends as well. There is a trend towards dedicating

1.8 Representing real workload characteristics in the design process

19

cheap computer resources to specific applications like web-servers. At the same time computer system users attempt to improve their efficiency by using virtualization and software partitioning to host multiple applications on a single computer system. The combined load of the shared applications will push computer system utilization higher than would be the case if each application was on dedicated hardware. Virtualization software hides each application from the other and distributes available computer resources between them. For computer system and processor designers virtualization introduces additional complexities since traditional workloads no longer exist. Designers require feedback on the workload characteristics of these combined, virtualized, workloads even though each combined workload may be unique. In an attempt to reduce simulation cost, work is done to better understand the composition of standardized benchmark suites like SPEC and TPC. The goal is to remove redundant characteristics from the simulation set, thus optimizing the information gain per simulation (Vandierendonck and De Bosschere, 2004b; Phansalkar et al., 2005b; Eeckhout et al., 2005b). It increases the efficiency of simulating these standardized benchmark suites, but does not assist in identifying changes in real workload behavior and related trends. The ability to reduce standardized benchmarks into a subset that not only represents all important workload characteristics but is efficient to simulate, is valuable in the design process. Of equal value is quantitative feedback on the diversity of real workloads and the coverage of relevant real workload characteristics in that reduced set of benchmarks. Over-simplification is the risk introduced by the standardized benchmarks sets, further compounded by the effort to reduce them. One single benchmark is unable to capture the richness of real workloads. Reducing the diversity of benchmarks in the design process can increase the risk of unwanted surprises.

1.8

Representing real workload characteristics in the design process

Computer system performance is the most important distinguishing attribute designers attempt to optimize. Computer system and processor designers therefore require accurate workload characterization data to help them make trade-off decisions that improve computer system performance on relevant real workloads. Benchmarks commonly provide the required workload characterization data, since benchmarks are generally efficient to use and easy to control. Currently, only a limited set of computer system independent benchmarks are used to compare computer system performance and to design new computer systems. However, many important characteristics of real workloads required for future computer systems design might not be adequately represented in this small set of commonly used benchmarks. Identifying important workload characteristics as well as quantifying the representativeness of benchmarks therefore remains an issue. Quoting Skadron et al. (2003) “..the R&D community does not have a system for identifying important benchmark characteristics and how benchmark applications embody them...[and to de-

20

W ORKLOAD IMPORTANCE

termine] what portion of the total behavior space each benchmark really represents”. Without information on the representativeness of benchmarks and the relevance of the workload characteristics they represent, designers have no insight into the needs and requirements of real workloads. Benchmarks used for processor and computer system design should thus be chosen based on how well they represent relevant workloads. If those benchmarks do not exist, candidate benchmarks should be found among these relevant workloads. Finding and defining these benchmarks requires that we can quantitatively determine representativeness. How are benchmarks or workloads representative of other workloads? They can be representative in two ways: (i) they accurately predict application performance, (ii) they accurately mimic application characteristics (Ferrari, 1978; KleinOsowski et al., 2001; Eeckhout et al., 2002, 2005b). Representativeness of benchmarks relative to each other and to workloads is therefore the key issue. Designers need to keep track of real world characteristics and evaluate their impact on computer system design, yet they are limited by the impossibility of performing detailed, cycle-accurate simulations for more than a handful of benchmarks. The main problem area of this work can be summarized as: How can we identify important workload characteristics and find or define the benchmarks that represent them ? In Section 1.4 the requirement of backwards compatibility for processors was mentioned. Consequently, the design of a new processor caries the burden of compatibility with the previous generation. This means that the selection of representative workloads for the design process can be made specific to a processor micro-architecture. Workload characterization data collected on current processors provide the information required to improve the next processor generation. Representing real workload data in the design process is essential since it provides insight into workload characteristics not provided by benchmarks. However, real workloads are not practical for the detailed analysis required in the processor and computer system design process. Real workloads lack the compactness and controllability associated with benchmarks (Kunkel et al., 2000). Therefore designers are faced with a dilemma: there is clear value in understanding the workload characteristics of a broad set of real workloads, yet the cost in time and resources required to characterize all these workloads sufficiently, makes their inclusion economically unattractive. We propose that the best way of including these relevant workload characteristics is to find or define a minimal set of workloads which represent all characteristics relevant for the specific processor micro-architecture and computer system design. We identify a hierarchy of steps between the user workload space and the optimal reduced benchmark space, illustrated in Figure 1.3. The real workload space is spanned by all workloads in existence. The optimal reduced benchmark space represents the workload space obtained after removing redundancies from the benchmarks identified

1.9 Research questions

Real workload space

Candidate workload selection

21

Representative workload space

Standard benchmark selection

Standard benchmark space

Benchmark similarity analysis

Optimal reduced benchmark space

simulation metrics computer system metrics

Fig. 1.3: Overview of benchmark set creation

in the standard benchmark space. The representative workload space captures all important workload characteristics of the real workload space and provides the candidates for the standard benchmark space. Currently the steps from the standard benchmark space to the processor and system design space are reasonably well understood. Most of the simulation-based results available in the literature target these steps, e.g., Eeckhout et al. (2003b) and Phansalkar et al. (2005b). Much less studied and understood, and the focus of this work, is the selection of representative workloads from the real workload space to span the representative workload space. The representative workload space must reflect all significant workload characteristics identifiable in the real workload space.

1.9

Research questions

Computer system and processor designers need to quantify two issues related to relevant workloads and the benchmarks used to represent them. The first issue is to determine - what are the important workload characteristics and requirements? Workload characteristics are the required size of memory, recommended cache size, etc. Characteristics can be viewed as common between classes of workloads, i.e., they can be defined in advance. Workload requirements are particular to a workload; they depend on the specific processor dependent implementation of the workload as well as the nature of its work. These workload requirements must be addressed during computer system design. This requires workload characterization to evaluate the impact of these requirements on computer system and processor design. For example, if studies find that a majority of the workload would benefit from larger caches, then the designers could increase the caches at the expense of other optimizations. These design trade-offs can only be evaluated in simulation, using workloads that are representative of these requirements. Information from real workloads is needed to determine the relative importance of these requirements and select the representative workloads used for simulation. The second issue is finding the workloads which represent the requirements of real workloads such that their impact on computer system design can be evaluated. Simultaneously these representative workloads should have the usability of benchmarks.

22

W ORKLOAD IMPORTANCE

We believe that real workloads are the key to representative workload selection. We must find an approach that allows us to characterize real workloads and use that characterization to guide the selection of representative workloads. Since real workloads are tied to specific processor micro-architectures, we limit our approach to a specific micro-architecture. Choosing a single micro-architecture is reasonable since it reflects the reality faced by IBM (PowerPC), Intel (X86, X64) and Sun Microsystems (SPARC), each with their own specific micro-architecture implementation. The approach thus requires the ability to select representative workloads from a set of measured workloads. This requires that the approach supports a measure of representativeness for workloads on computer systems of the same micro-architecture. The overarching goal is to select a set of workloads that represent the workload space for that specific processor architecture. Yet given the constraints in time and resources, we would like this set to be as small as possible. This leads us to the primary research question: Research Question 1 How do we find the smallest set of workloads representative of a specific micro-architecture workload space? While the ability to select such a workload set by itself is valuable, it must also be practical in use. This selection process must therefore be efficient in use. We therefore introduce a second research question to address this practicality requirement. Research Question 2 How can we efficiently find a smallest workload set representative of hundreds or thousands of collected workloads? Efficiency is necessary since broad characterization of the workload space combined with the market value of portions thereof, should solidify the approach as an important tool for the design phase of computer systems and processors.

1.10

Research approach

Scientific inquiry can be thought of as a particular process or strategy in which a set of research instruments are employed, guided by researchers using an underlying research philosophy. The following sections discuss the philosophy, strategy and instruments applied in the pursuit of the research objectives. 1.10.1

Philosophy

Scientific inquiry follows a process or strategy, uses research instruments and is guided by an underlying research philosophy (Galliers, 1994). A research philosophy is a belief about the way in which data about a phenomenon should be gathered, analyzed and used. The choice of research philosophy guides the selection of research instruments and strategy.

1.10 Research approach

23

There are two major philosophies, or “schools of thought”, in the Western tradition of science. These philosophies are the “hard” positivist research tradition and the “soft” interpretivist research tradition (Hirschheim, 1992). Positivists believe that reality is stable and can be observed and described from an objective viewpoint, i.e., without interfering with the phenomena being studied. Positivists contend that phenomena should be isolated and that observations should be repeatable. Thus predictions can be made on the basis of previous observations and their explanations. Positivism has a particularly strong and successful association with the physical and natural sciences and concentrates on laboratory experiments, field experiments and surveys as its primary research instruments (Hirschheim, 1992; Galliers, 1994). In contrast, Interpretivists contend that only through the subjective interpretation of and intervention in reality can that reality be fully understood. The study of phenomena in their natural environment is key to the interpretivist philosophy, together with the acknowledgement that scientists cannot avoid affecting those phenomena under study. Interpretivists understand that there may be many interpretations of reality, but consider these interpretations a part of the scientific knowledge they are pursuing. Interpretivist researchers use “soft” research instruments such as reviews, action research and forecasting (Hirschheim, 1992; Galliers, 1994). It has been often observed that no single research methodology is intrinsically better than any other methodology, though some institutions seem to favor certain methodologies above others (Galliers and Land, 1987). Such a favored approach conflicts with the fact that a research philosophy should be based on the research objective rather than the research topic (March and Smith, 1995). This thesis reflects a pragmatic approach to an engineering question: how to include empirical evidence in the design process to improve computer systems. We believe that we can always make a better computer even though the best computer might never exist. Similarly we believe that the inclusion of measurements from different workloads will bring us closer to what is happening in reality. These beliefs are post-positivistic. Post-positivism is a branch of passivism; its most common form is critical realism, i.e., the belief in a reality independent of our thinking that science can study. Postpositivism uses experimental methods and quantitative measures to test hypothetical generalizations. 1.10.2

Strategy

Consistent with our post-positivistic research philosophy is the deductive-hypothetic research strategy or scientific method, illustrated in Figure 1.4. We define the question, verify that this question is relevant and form a hypothesis for a solution. Based on this hypothesis we formulate a solution: a prescriptive model that allows us to test the hypothesis. We then perform an experiment and collect data on this solution. This data is analyzed and the interpretation serves as the starting point for new hypotheses.

24

W ORKLOAD IMPORTANCE

Publish results

Define the question Interpret data and draw conclusions

Gather information and resources Analyze data

Form hypothesis

Deductive-Hypothetic research strategy Perform experiment and collect data

Prescriptive model

Fig. 1.4: deductive-hypothetic research strategy

1.10.3

Instruments

According to Galliers (1994), research instruments in quantitative research are laboratory experiments, field experiments, surveys, case studies, theorem proof, forecasting and simulation. To establish the problem statement and derive the solution requirements, we survey the literature as well as personal experience. After formulation of the research hypotheses, we postulate a solution for each hypothesis in the form of a prescriptive model. This prescriptive model makes falsifiable predictions. To verify these predictions we use field data, laboratory experiments or simulation. Data analysis allows conclusions that validate or reject the hypotheses.

1.11

Research outline

The outline of this thesis is as follows, outlined in Figure 1.5. We closely follow the deductive-hypothetic research strategy. In Chapter 2 we review the available literature and distill a descriptive model. Based on this descriptive model and our research question we formulate a number of hypotheses and their requirements in Chapter 3. These hypotheses must be testable, i.e., we must do research that leads to acceptance or rejection of the hypotheses. The prescriptive model is the framework to test these hypotheses, discussed in Chapter 4. After defining the prescriptive model we do an early test in Chapter 5. This early test investigates the extent to which simulation and real system measurements agree on workload similarity properties. The purpose of this test is to determine if we should continue a large commitment in effort and resources towards large scale collection of workload characterization data.

1.11 Research outline

25

Define the question Chapter 1 - The importance of workloads in computer system design

Present conclusions Chapter 9 Evaluating workload similarity

Gather information and resources Chapter 2 - Current approaches for workload selection in processor and computer system design

Interpret data and formulate conclusions Chapter 8 - Finding representative workloads in the measured dataset

Form hypothesis Chapter 3 - Towards an unbiased approach for selecting representative workloads Deductive-Hypothetic research strategy Prescriptive model Chapter 4 Constructing the approach

Early test Chapter 5 - Testing the methodology on a benchmark set

Collect data Chapter 6 - Collecting and analyzing workload characterization data

Fig. 1.5: Thesis outline

Analyze data Chapter 7 Grouping together similar workloads

26

W ORKLOAD IMPORTANCE

The second part of this thesis details our data-collection and subsequent analysis of a collected workload set. Chapter 6 covers the observability problem, how to collect and reduce data, as well as the dimensionality-reduction problem - how do we make a large multi-dimensional space more understandable. In Chapter 7 we analyze the data and compare different strategies for grouping workloads in meaningful ways. Chapter 8 is the last methodological chapter where we draw our conclusions and contrast those with the results from Chapter 2. Finally in chapter 9, we depart from the deductive-hypothetic research strategy and evaluate the work.

Part I CONTEXT OF REPRESENTATIVE WORKLOAD SELECTION

2. CURRENT APPROACHES FOR WORKLOAD SELECTION IN PROCESSOR AND COMPUTER SYSTEM DESIGN

A BSTRACT Workload selection for computer system design requires understanding technological developments, marketplace requirements and customer workloads. Ideally, computer system designers have a benchmark set that is representative of their customers’ behavior. The value of standardized computer system performance evaluation, provided by e.g., SPEC CPU 2000 and TPC -C, is limited by their purpose as platform independent performance metric. Good benchmark sets provide concise and balanced coverage of the workload space. The impossibility of evaluating large benchmark sets in simulation effectively spawned the development of redundancy removal techniques. Redundancy removal retains the predictive value of the benchmarks but greatly reduces the amount of simulation time needed for design evaluation. Benchmark sets must be complete, i.e., all relevant workloads are represented, and they must be compact, i.e., the smallest available set without compromising representativeness.

30

C URRENT APPROACHES

Define the question Chapter 1 - The importance of workloads in computer system design

Gather information and resources Chapter 2 - Current approaches for workload selection in processor and computer system design

Present conclusions Chapter 9 Evaluating workload similarity

Interpret data and formulate conclusions Chapter 8 - Finding representative workloads in the measured dataset

Form hypothesis Chapter 3 - Towards an unbiased approach for selecting representative workloads Deductive-Hypothetic research strategy Prescriptive model Chapter 4 Constructing the approach

Early test Chapter 5 - Testing the methodology on a benchmark set

Collect data Chapter 6 - Collecting and analyzing workload characterization data

Analyze data Chapter 7 Grouping together similar workloads

2.1 Selecting workloads for commercial computer system design

31

In this chapter we review the literature on benchmark selection for processor and computer system design. Where the previous chapter established the relevance of this work, this chapter presents a framework of already existing approaches and techniques related to the topic. We begin with the goals of benchmark selection, and a description of the selection process. Next the context of benchmark selection is discussed. Benchmark selection is then placed in the context of the overall design process, taking into account constraints on cost and time. The constraints require that benchmark sets be made as small as possible, without losing relevant information. Benchmark similarity and its application in reducing redundancy and sub-setting of benchmark sets is discussed. The chapter ends with a list of requirements for selection and benchmark set reduction and how these requirements are necessary for finding the most relevant workloads.

2.1

Selecting workloads for commercial computer system design

Section 1.2 presented an overview of the design process within companies that design and manufacture processors and computer systems. An analogous view is presented by Bose and Conte (1998), where they emphasize that market place competition has forced companies to use a targeted and highly systematic process that focuses new designs on specific workloads. They further emphasize that different companies will follow different strategies, with the following common elements: 1. An application analysis team, often working with customers, usually determines market requirements and reduces them to key workloads. 2. Lead architects work with interdisciplinary experts in examining practical concerns to bound the space of potential designs. 3. The performance team, working with the lead architect(s), writes a model that measures performance in terms of throughput or total turn-around time. 4. Performance architects reduce workloads to representative benchmarks and test cases (micro-benchmarks). 5. Using a compiler and these benchmarks, performance architects test the model, usually employing trace-driven simulation. 6. On the basis of these results, the design team chooses (and later refines) a specific micro-architecture. 7. The team implements the micro-architecture by translating the high-level design to lower-level details (register-transfer-, gate-, and/or circuit-level models, followed by physical and chip lay-out, and so on) and tuning. 8. The design team applies verification methods at each level. The main advantage of a systematic process as identified by Bose and Conte (1998) is: “... it produces a finely tuned design targeted at a particular market. At its core are

32

C URRENT APPROACHES

models of the processor’s performance and its workloads. Developing and verifying these models is the domain now called performance analysis.”. This perspective and the perspective of Section 1.2 emphasize the need to identify the dominant or key workloads. Dominant workloads generate significant revenue due to their volume in the marketplace and are of primary concern to computer system manufacturers. Key workloads are workloads that customers are most likely to care about when the system is ready. Customers consider these workloads important either due to their perceived relevance to the marketplace or due to their similarity with their own typical workloads. Selecting the best subset of dominant and key workloads and making them suitable for the design process are the topics of this chapter. Based on the two descriptions of the computer system design process, determining the design parameters of a computer system is a multi-actor decision-making problem. In most companies that design and build computer systems or processors the required knowledge and the decision-making capability is distributed among different people. Making the correct design goal choices involves decision-making regarding technical feasibility, technical merit and expected impact on the marketplace. Technical merit requires that the computer system design addresses the performance challenges of workloads that are relevant. Most times relevance can be equated to value in the market place. The decision-making process should decide which workloads are relevant, based on understanding of the marketplace. These relevant workloads guide the selection of benchmarks. For purposes of marketing, i.e., the perception of computer system performance, standardized benchmarks are considered relevant workloads. These standardized benchmarks provide performance comparisons between different computer systems. The dominant workloads in the marketplace have several defining attributes: they are common in the marketplace, the workload market-value is significant and good performance on these workloads provides a positive marketing message. A workload’s market-value is determined by the volume of deployments, multiplied with their average computer system investment. For example, web servers are generally considered inexpensive computer systems, and with their large volume represent a significant market. Database servers are not as frequent as web-servers, but significantly more expensive, creating a high value market. An extreme market are supercomputers; very few exist, but they are all very expensive.

2.2

Standardized performance evaluation

Benchmarking is an important aspect of computer system performance comparison. Over the past several decades the importance of objectively comparing computer system performance has paralleled the global expansion and adoption of computer systems. Ferrari (1978) already defined benchmarking as an essential step when purchasing a computer system - will the offered computer system meet the demands? In Jain (1991) the techniques of “benchmarketing” are discussed - the manipulation

2.2 Standardized performance evaluation

33

of performance measurement results to support a predetermined conclusion. The two most prominent standard benchmark organizations; SPEC (www.spec.org, 2007) and TPC (www.tpc.org, 2007), came into being to provide a standardized set of benchmarks by which the performance of computer systems can be measured objectively. The two most prominent benchmarks used for performance comparison are the SPEC CPU suite and TPC -C. Both will be discussed in more detail below. As Ferrari (1978) already outlined, benchmarking for making purchasing decisions requires benchmark tests that address the requirements of the intended workload. The TPC benchmarks are targeted at transactional workloads, most commonly found in a business environment. The most common TPC benchmarks are TPC -A/B/C for database systems, TPC -H for data warehouse, and TPC -W for web-server environments. TPC -A and B have been retired in favor of TPC -C, while in turn TPC -C will be retired soon in favor of TPC -E (www.tpc.org, 2007). These transactional benchmarks are specifically designed to test the performance of computer systems; they test the performance of the processors and the surrounding system (disks, network, memory) simultaneously. In contrast, the SPEC CPU 2000 benchmark suite primarily targets the performance of the processor and memory hierarchy. For the personal computer space the SPEC CPU 2000 has long been the driving benchmark. However, reviews of modern personal computers in trade-magazines and websites, e.g., PC-magazine (www.pcmag.com, 2007), AnandTech (www.anandtech.com, 2007), contain a gaggle of different benchmarks aimed at specific user classes. For example, there are spreadsheet and word-processing benchmarks aimed at the business user which measure the performance of the computer system on a standardized set of files. Specific gaming benchmarks measure the response time and frame-rate of popular computer games and are targeted at the serious gamer. Overall there are too many domain specific benchmarks to be taken into consideration during the design of computer system. Since the two benchmarks most commonly found in the processor and computer system design process are SPEC CPU 2000 and TPC -C, we now discuss them in more detail. 2.2.1

SPEC CPU 2000

Strictly speaking, SPEC CPU 2000 is not a benchmark, it is a collection of benchmarks. SPEC CPU 2000 provides a single performance metric for the integer and floating point performance of a processor. SPEC CPU 2000 was retired in February 2007 in favor of SPEC CPU 2006 (www.spec.org, 2007). In this work we refer to SPEC CPU 2000 since not many results are available for SPEC CPU 2006. As mentioned SPEC CPU is a collection of benchmarks. The composition of the SPEC CPU benchmark is decided by the SPEC committee and uses snippets of codes submitted by interested parties. These interested parties are for example hardware vendors, academic institutions or individuals. Each component benchmark is selected after a lengthy and sometimes political selection process by the SPEC committee, composed of people representing industry interests. The current committee has people from the

34

C URRENT APPROACHES CINT 2000

name 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf

CFP 2000

name 168.wupwise 171.swim 172.mgrid 173.applu 177.mesa 178.galgel 179.art 183.equake 187.facerec 188.ammp 189.lucas 191.fma3d 200.sixtrack 301.apsi

Description Data compression utility FPGA circuit placement and routing C compiler Minimum cost network flow solver Chess program Natural language processing Ray tracing Perl Computational group theory Object Oriented Database Data compression utility Place and route simulator Description Quantum chromodynamics Shallow water modeling Multi-grid solver in 3D potential field Parabolic/elliptic partial differential equations 3D Graphics library Fluid dynamics: analysis of oscillatory instability Neural network simulation; adaptive resonance theory Finite element simulation; earthquake modeling Computer vision: recognizes faces Computational chemistry Number theory: primality testing Finite element crash simulation Particle accelerator model Solves problems regarding temperature, wind, velocity and distribution of pollutants

Tab. 2.1: SPEC CPU 2000 composition (www.spec.org, 2007)

hardware industry, e.g., AMD, HP, IBM, Intel, Sun, software e.g., Red Hat, Microsoft, ORACLE, and academia (www.spec.org, 2007). SPEC CPU 2000 is comprised of 26 component benchmarks, each with its own behavioral characteristics and utilizing the integer and/or floating point units of a processor. SPEC CPU 2000 is divided into two parts, CINT 2000 for primarily integer workloads and CFP 2000 for floating point workloads. CINT 2000 and CFP 2000 are based on computeintensive applications provided as source code. CINT 2000 contains eleven applications written in C and one in C++ (252.eon) that are used as benchmarks: CFP 2000 contains 14 applications (six Fortran-77, four Fortran-90 and four C) that are used as benchmarks. The composition of CINT 2000 and CFP 2000 are listed in Table 2.1. The SPEC CPU 2000 benchmark was for the period 2000-2006 the de facto benchmark used for processor performance comparison. The SPEC CPU 2000 is also very popular in academia, supporting a wide variety of research interests (Citron, 2003). The popularity of SPEC CPU 2000 comes with downsides. As explained by Jain (1991), standard performance benchmarks make attractive targets for specialized features, i.e., features that benefit benchmark performance but have no real value in real workloads. Frequently updating the benchmark specification and retiring workloads no longer current for the hardware, e.g., SPEC CPU 95, SPEC CPU 92 and recently SPEC

2.2 Standardized performance evaluation CPU 2000,

35

are attempts to keep the benchmark relevant. 2.2.2

TPC -C

As previously introduced, TPC -C is a transactional benchmark. The following is a description from www.tpc.org (2007): As an OLTP system benchmark, TPC -C simulates a complete environment where a population of terminal operators executes transactions against a database. The benchmark is centered around the principal activities (transactions) of an order-entry environment. These transactions include entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses. TPC -C does not specify how to best implement an Order-Entry system. [...] While the benchmark portrays the activity of a wholesale supplier, TPC -C is not limited to the activity of any particular business segment, but, rather, represents any industry that must manage, sell, or distribute a product or service.[...] Like the transactions themselves, the frequency of the individual transactions are modeled after realistic scenarios. [...] The performance metric reported by TPC -C measures the number of orders that can be fully processed per minute and is expressed in tpm-C. [...] As of March 2007 the TPC -C world record exceeds 4 million tpmC (www.tpc.org, 2007). TPC -C as a benchmark relies heavily on the I/O capacity of the computer system for maximum performance. Computer systems claiming the TPC -C record are invariably large and expensive. The most significant trends visible in the market for commercial computers over the past ten years are larger processor caches (L2 and L3) as well as increased I/O bandwidth, the two properties crucial to good TPC -C performance. 2.2.3

Benchmark context

When TPC -C was defined in 1992, computer system performance barely reached 10,000 tpm-C. Current records exceed 4M tpm-C, and the benchmark will soon be retired in favor of TPC -E. The frequent updates to SPEC CPU highlight the same problem - benchmarks age. When benchmarks are first released they create a flurry of activity as every computer system vendor attempts to achieve maximum performance. Soon after, the first mistakes in the benchmark definition are found and taken advantage of. In turn the benchmark definition is updated to fix these mistakes. Over the life of a benchmark the definition is likely to be updated multiple times. Benchmarks from SPEC and TPC carry significant relevance in the marketplace. It is a near requirement for computer system manufacturers to demonstrate they can achieve good performance. As a result, these benchmarks have achieved a strong status within

36

C URRENT APPROACHES

the computer systems and processor design process - companies believe that they cannot afford to ignore them. At the same time, both SPEC and TPC benchmarks are not free from detractors. These detractors point to the effort made to achieve good results on these benchmarks compared to the effort afforded commercial workloads (Morgan, 2005). A common analogy is that of comparing performance between a Formula 1 race-car and a truck. The truck has to work economically while the Formula 1 car is afforded all performance enhancements possible.

2.3

Selecting the correct benchmarks for the design problem

Benchmark selection should depend on design goals. If the design goals are computer system related, the benchmarks should provide computer system related information. When design goals are processor related, the benchmarks should provided detailed information regarding processor performance. When selecting benchmarks, the scope of the design problem should match the scope of the benchmark, i.e., a processor oriented benchmark should not be expected to yield representative information of computer system requirements. In terms of the workload space, the workload space must be selected to fit the design goals, i.e., the workload space should be limited to the volume of interest (Dujmovi´c, 2001). Workloads and benchmarks that do not match these design goals should not be included on the program space. Any workload that does not match the design goals has the potential of stretching the workload space, increasing its size and thus the number of benchmarks required to provide sufficient coverage, without contributing to the quality of the final design (Dujmovi´c, 2001). The applicability of information obtained from benchmarks is related to the scope of that benchmark. Benchmarks evaluate performance since their execution on computer systems is ultimately limited by the capacity of critical resources. The scope of a benchmark is therefore defined by the resources required to achieve good performance. Good benchmarks target multiple resources in the processor or computer system; improving performance for these benchmarks thus requires optimization of these multiple resources (Vandierendonck and De Bosschere, 2004a). It is the nature of computer system optimization that execution bottlenecks move around. After each bottleneck is resolved another bottleneck crops up somewhere else. This process creates a vicious spiral towards faster and more capable processors and computer systems (Jain, 1991). The main issue with bottleneck resolution and the chosen solutions is that optimizing a processor or computer system for a single workload will make a system exclusively suited for that single workload. Over-optimizing a computer system or processor for a single workload can negatively impact the performance on other workloads. The scope of a benchmark determines the outer limit of its usefulness for optimization. Only bottlenecks within the scope of the benchmark will be found and resolved. Apart from the computer system hardware, the operating system also contributes to

2.4 Requirements for optimal benchmark sets

37

+ + ++ +*++ + *+ + + +* + * + + + *+ + + + + + +* * + * ++ +* ++ * + +++ + workload

* benchmark

Fig. 2.1: Workloads and benchmarks spanning a workload space.

the characterization of the workload. In an ideal computer this would not be the case, the only workload of interest would be the executing workload. The current trend of sharing computer systems between different applications and therefore between different workloads each with their own characteristics is made possible by advanced management features provided by the operating system (Ahmad et al., 2003). These features allow different workloads to sit side by side on the computer system without the workloads being aware of the arrangement. Such advanced features require that the operating system takes a more active role in the management of the computer system in, for example, limiting the resource utilization of one workload to make sure that such a resource is shared fairly between the different workloads. The impact of these arrangements is to create a composite workload, made out of the co-hosted workloads plus the overhead of the operating system needed to manage them. The increasing performance of computer systems makes the possibility of co-hosting more attractive, since with multiple applications on a single computer system better utilization and therefore efficiency is achieved. Traditionally the scope of software is to regard the operating system as immaterial to the workload and regard the workload as dominated by the single application. For the future we must likely take into account that composite workloads will become more common and therefore the scope of the software will change to include the operating system and its management features (Rosenblum, 2004).

2.4

Requirements for optimal benchmark sets

Choosing the best design benchmarks is equivalent to designing a benchmark set. A benchmark set is a collection of benchmark programs, each with a specific set of properties. By definition, each benchmark program is considered a workload. A common shorthand is to refer to benchmark programs as benchmarks. Workload characterization describe workload properties in metrics. As detailed in Dujmovi´c (1998) a workload space is a space where points represent workloads. Each workload characterization can consequently be represented as a vector in the workload space spanned by the workload

38

C URRENT APPROACHES

Processor and system design space

Processor and system design evaluation

Real workload space

Candidate workload selection

Ferrari (1972) Skadron et.al. (2003)

Representative workload space

Standard benchmark selection

Ferrari (1972) Saavedra et.al. (1989) Dujmovic (1991) Saavedra and Smith (1996) Henning (2000) Sherwood et.al. (2002) Eeckhout et.al. (2003a)

Standard benchmark space

Benchmark similarity analysis

Conte et.al. (1996) Yi et.al. (2003) Yi et.al. (2005) Hoste et.al. (2006)

Optimal reduced benchmark space

Ferrari (1978) Dujmovic (1998) Dujmovic and Dujmovic (1998) KleinOsowski et.al. (2001) Eeckhout et.al. (2002; 2003b; 2005b) Vandierendonck and De Bosschere (2004a;2004b) Phansalkar et.al. (2005) Joshi et.al. (2006) simulation metrics

computer system metrics

Fig. 2.2: Overview of benchmark set creation and relevant citations

characterization metrics. The concept of a workload space is common to all benchmark selection methods. While the metrics that span the workload space can differ between methods, there are common properties in the workload space that provide the foundation of benchmark selection. Figure 2.1 provides a two dimensional illustration of a workload space and its distribution of workloads and benchmarks. The creation of optimal benchmark sets can be viewed as a process detailed in Figure 2.2 (a legend explaining each shape can be found in Table A.3 on page 236). The real workload space at the left represents the actual usage of a computer system and its processors. The first step in the process is finding the candidate workloads and benchmarks for further reduction into standard benchmarks. Improving this step is the topic of this dissertation. The standard benchmark set is created from the representative workload space and summarized in the standard benchmark space. The standard benchmark space is spanned by the collection of all industry standard benchmarks, e.g., SPEC and TPC. Figure 2.2 also contains citations relevant to the steps performed in the selection process. In the following sections we review this literature. Our goal is to review the available techniques for selecting the representative workloads, selecting the standard benchmarks and choosing the optimal set of benchmarks. We also review how workloads are used for evaluation of design alternatives. We now return to the design of an optimal benchmark set. The most common property, fundamental to the selection process, is related to distance in the workload space. Differences between computer workloads can be expressed as distances between points in the workload space. Proximity of two points in the workload space is assumed to indicate similarity of corresponding computer workloads. Under that assumption, similar workloads are represented as clusters of points in the programs space. The process of

2.4 Requirements for optimal benchmark sets

39

benchmark selection can then be seen as selecting a representative set from this workload space, aimed at simultaneous satisfaction of two goals (Dujmovi´c, 1998): 1. benchmark workloads should be a good functional representative of a given universe of real workloads, 2. benchmark workloads should yield the same distribution of the utilization of system resources as real workloads. There are a number of requirements an optimal set of benchmarks should satisfy; Benchmark sets are based on a specific set of performance criteria. Two main groups of criteria can be identified: qualitative, and quantitative criteria. Qualitative criteria specify global features of benchmark sets and include the following five compliance requirements (compiled from Dujmovi´c (1998)): compliance with the goal of benchmarking - typical goals are performance evaluation of a given system, performance comparison of several competitive systems, standardized comparison and ranking of all commercially available systems in a given class, selection of the best system according to specific user requirements, resource consumption measurement and analysis, and performance tuning. In short, benchmark sets must alway support a clearly defined goal (also Ferrari (1978)). application area compliance - the benchmark sets are assumed to be good representatives of a desired application area, or combination thereof. Application areas can be related to the activity of typical users or they can be defined according to computer workload characteristics. Each benchmarking effort should be related to a given application area, and so should members of the benchmark set. workload model compliance - the design of benchmark sets includes the selection of the most appropriate workload model. The workload models can be physical, virtual and functional (Ferrari et al., 1983). The selection of workload models must be justified by specific requirements of a given benchmarking problem. A desired workload model can also be specified during the evaluation of benchmark sets. Once the desired workload model is selected the benchmark suite must satisfy the workload model compliance criterion. Functional and virtual workload models are simpler and more frequent than physical models. Different functional or virtual workloads can yield very similar usage of hardware and software resources and in such cases their functional/virtual difference becomes insignificant. In such cases we need physical workload models, because they are quantitative and provide measures of redundancy between component benchmarks of a benchmark set. Generally, there is no justification for using redundant workloads because the cost of benchmarking increases without a corresponding increase in benefits. hardware platform compliance - specifies a target hardware category and a computer architecture for which a benchmark set is designed. The most frequent hardware

40

C URRENT APPROACHES platforms are: personal computers, workstations, mainframes, database machines and communication systems. For each platform it is necessary to identify the main set of resources and to prove that the members of the benchmark set sufficiently use all identified resources.

software environment compliance - analogous to the hardware platform compliance criterion. It specifies the desired operating system, user interface, programming languages, database systems, communication software and program development tools that benchmark workloads must use. This criterion identifies all relevant software resources and takes care that they are properly used. Qualitative criteria help to specify a general framework and guidelines for benchmarks. However, specific requirements for a set of benchmarks must be defined using quantitative criteria. The main advantage of quantitative criteria is that they facilitate proving that a benchmark suite satisfies some specific requirements. Quantitative criteria are related to the distribution of component benchmark programs in the workload space, and include the following five characteristics desired of optimal benchmark sets: size - the fundamental quantitative criterion. It is defined as the smallest circumscribed hypersphere containing all benchmarks. Each benchmark set has a central benchmark and a most peripheral benchmark. The central benchmark has maximum similarity to all workloads in the group, the peripheral the lowest. Ideally the benchmark hypersphere has approximately the same size as the hypersphere of relevant workloads. completeness - can be evaluated using a coverage indicator. The smaller the coverage the more incomplete the benchmark set. Coverage can be understood as how well the benchmark set matches the set of relevant real workloads. density - the density of the benchmark set is the number of benchmarks per unit of the benchmark space. granularity - is the ratio between the number of benchmarks in the set and the minimum necessary number of benchmarks, based on the number of different computer resources used in the suite. redundancy - workloads are considered redundant if their differences are small. In such cases it is difficult to justify using similar workloads because their contribution to the cost is higher than their contribution to the comparison of competitive systems. Removing redundant benchmarks greatly increases the efficiency of the design process, either shortening the design time or allowing a more thorough exploration of design alternatives. These criteria are interdependent. If a benchmark set fails one of the requirements, say completeness, this can be remedied by adding another benchmark. Adding a benchmark increases its size and density, and may even increase its redundancy. An optimal

2.4 Requirements for optimal benchmark sets Excessive redundancy

Insufficient size

* * * *

* * * * * ** * * * *

** ** *

* * **

Non−uniform distribution

Workload outlier

* * * * *** * * * * * ** * * ** * * * *

41

* *** * * * * * * ** * * * *

*

* *

*

* *

Fig. 2.3: Benchmark distribution errors in a workload space. (Dujmovi´c, 2001)

benchmark set is defined against criteria which define the context of optimal. We define a representative benchmark set as an optimal benchmark set given a specific set of criteria. Benchmark set representativeness therefore depends on the granularity required if a high degree of completeness is required, then a small size should not be expected. The workload space is the essential area of comparison. In order for the workload space to be representative, it must be constructed using metrics relevant to performance (Dujmovi´c, 2001). The five characteristics of benchmark sets help in selecting the required benchmarks. A representative benchmark set can thus be found by applying the five characteristics to a workload space containing both workloads and benchmarks. There are a number of preventable mistakes in benchmark distribution over that workload space (Dujmovi´c, 2001), illustrated in Figure 2.3: Insufficient size - benchmark sets that contain many similar benchmarks often fail to provide coverage for outliers in the workload space. Excessive redundancy - benchmark sets have concentrated benchmarks in some parts of the workload space, but other parts of the workload space are not covered at all. Outliers - a special case of an irregular distribution. Most benchmarks form a dense cluster in the workload space, while a single outlying benchmark provides cover-

42

C URRENT APPROACHES age for workloads in a different part. Either the benchmark set must be changed to include more workloads and provide better coverage of the whole workload space, or in the case of an outlying workload, the workload set should be re-evaluated to make sure that all workloads are indeed relevant and required.

Non-uniform distribution benchmark sets with non-uniform distributions are created by various reasons, some of which can be legitimate. A non-uniform distribution is a problem if, and only if, it does not match the underlying distribution of workloads. In other words, a non-uniform distribution is only a problem if it has low coverage. The benchmark set used for the design, analysis and evaluation of computer systems and processors should demonstrate proper coverage in the workload space. Benchmarks outside the areas populated with real workloads run the risk of only adding to the analysis cost without providing valuable information. In the worst case, these workloads might lead to the wrong conclusions. Insufficient coverage of certain workloads in turn runs the risk of missing significant workload characteristics.

2.5

Finding benchmarks in practice

Benchmarks selected for the quantitative evaluation of computer system components should use the same resources as the applications they represent. The benchmarks should also provide inputs that are representative of the relevant workloads. As mentioned earlier, benchmarks can be representative in two ways: (i) they accurately predict application performance, (ii) they accurately represent application characteristics. The workload space for these two ways is different. In the first case the workload space is spanned by the relative performance indexes, while in the second case the workload space is spanned by the measurable application characteristics. In both cases, representative workloads will be in close proximity to each other in the workload space since the workload and the benchmark should have comparable measurable outcomes and these outcomes are measured in the metrics that span the workload space. 2.5.1

Predicting application performance

Predicting application performance requires that a benchmark accurately reflects the dominant resource utilization in an application. It does not matter if the prediction is made based on independent measurements for each computer system resource. Benchmarks that accurately predict application performance are therefore not necessarily similar to those applications. Saavedra-Barrera et al. (1989) introduce a uniform model for machine comparison using high level parameters representing an abstract Fortran machine. The abstract Fortran machine executes basic blocks which are often repeated snippets of Fortran code. The method allows comparison of different machine architectures and performance projections for applications with a known composition of Fortran basic blocks. Similar

2.5 Finding benchmarks in practice

43

applications have a similar composition of basic block and basic block frequency. The basic blocks are the minimum components needed to provide a performance comparison of a computer system. Saavedra and Smith (1996) further investigate the use of the abstract Fortran machine and evaluate which operations dominate benchmark results. The authors specifically address the question of similarity. First they evaluate similarity based on dynamic statistics, i.e., procedure calls, if statements, branches and iterations. These are grouped into 13 reduced parameters that represent specific operations and sensitivities of the Fortran programming language. Based on the implementation of the computer system, some of these reduced parameters can be identical. To compare benchmarks using these reduced parameters, the authors introduce the squared Euclidean distance, where every dimension is weighted according to the average run-time contributed by that parameter, averaged over the set of all programs. While Saavedra and Smith (1996) perform a comparative study of computer system performance using the execution times of standardized software blocks, their main goal is the ability to correctly predict program execution time between different computer systems. Their workload space is constructed using the execution rates of the different software components as the principal axes. They find that predictions made by the Euclidean distance in the workload space compare well with those found using execution time similarity. The section detailing the limitations of the model notes that the absence of implementation specific information like cache and TLB misses is a source of errors. Neither can they correctly attribute performance differences based on fundamental differences between the architectures, i.e., comparing scalar and vector processors. Many of the finer details of the processor are hidden in the execution time. While the execution time is important, the proposed method cannot assist in further improving the execution time since it does not provide designers with processor specific feedback. Another identified source of errors corresponds to limitations in the measuring tools and environmental influences e.g., resolution and intrusiveness of the clock, random noise and external events like page faults and interrupts. These random events add noise to the measurements. The authors also note that their methodology of using basic blocks is limited to the execution of un-optimized code, but they claim that in general their predictions hold true for optimized code. The difference between un-optimized and optimized code is likely to increase as the quality of compilers and optimizers further improves. Current optimizers can already move memory references around in the execution path of the software; such moves can span hundreds of instructions and thereby overlap several basic blocks. This fundamentally changes the performance of optimized code relative to un-optimized code in unpredictable ways (Saavedra and Smith, 1996). Dujmovi´c and Dujmovi´c (1998) evaluate the performance predicting properties of the SPEC CPU benchmark suite for different computer systems. The authors introduce a method that evaluates the difference between benchmarks based on resource utilization (“white box”) or execution time (“black box”). The terms white and black box reflect

44

C URRENT APPROACHES

the amount of information available from the computer systems used during evaluation. The white box model requires details of workload execution measured on the particular system and therefore reveals details about the computer system. The black box model requires only the results, i.e., the timing information, of the different benchmarks and performs the comparative analysis without any insight into the actual characteristics of the hardware. They analyze the evolution of the SPEC CPU benchmarks from 1989, 1992 and 1995. The research illustrates that the SPEC CPU suite has improved in uncovering differences between computer system architectures and that the SPEC CPU suite as a whole has improved over time for performance comparisons of systems. The authors note that many of the results in the SPEC CPU benchmark suite are the result of very precise performance tuning and as such are not representative of a realistic production environment. In Mirghafori et al. (1995) the optimizations used for the SPEC CPU benchmarks are reviewed. The results show that a typical user will only get between 6586% of the peak performance of the SPECpeak ratings. While Mirghafori et al. (1995) is now more than ten years old, the significance of SPEC CPU has not diminished and neither has the desire of computer system manufacturers to achieve impressive SPEC CPU results. Therefore, the tension between benchmark optimization and real user experience remains present. Dujmovi´c and Dujmovi´c (1998) recommend that the next generation of SPEC CPU benchmarks be designed in a way that enables the computation of workload-specific SPECmark indicators. The current implementations of the SPEC benchmarks (2000)1 do not contain such workload specific indicators, nor has the difference between real user experience and SPEC CPU specific optimization and tuning been addressed, other than Skadron et al. (2003) noting the lack thereof. 2.5.2

Representing application characteristics

Accurately representing application characteristics in a benchmark is much harder. Application performance does not only depend on computer system resource bottlenecks but also on the interaction between these bottlenecks. For optimal results, benchmarks should therefore represent the behavioral characteristics of the application such that the effect of the benchmark on the computer system and/or processor is similar to the effect of the real application (Saavedra and Smith, 1996). Representing application characteristics in a benchmark requires a good understanding of these characteristics and a thorough understanding of how best to characterize them. There are numerous interdependencies in application behavior that can be hard to represent in a benchmark (Saavedra and Smith, 1996). As introduced in Section 2.1, the design goals of the computer system and processor guide the selection of both relevant real workloads and benchmarks. Let us assume that the relevant real workloads have been translated into usable and representative bench1

SPEC CPU 2006 was released in August 2006. No major analysis thereof had been published at the time of writing.

2.6 Reducing redundancy in benchmark sets

45

marks. A major influence on computer system design is the desire to achieve good performance on a variety of industry standard benchmarks. These industry standard benchmarks are independent of relevant real workloads. The superset of design benchmarks is the combination of application and industry standard benchmarks. The combined scope of the benchmark superset must cover the design goals. Ferrari (1972) laid the foundation for representing workload characteristics in benchmark suites. The majority of workload characterization papers since all emphasize the necessity of representing workload characteristics accurately, e.g., Agrawala et al. (1976); Calzarossa and Serazzi (1993); Arlitt and Williamson (1996); Burton and Kelly (1998); John et al. (1998); Calzarossa et al. (2000); Menascé (2003). The majority of papers that deal with distilling representative workload characteristics for creating benchmarks, concentrate on only a specific area. This usually coincides with the authors’ area of interest. For example Hodges and Stewart (1982); Alexander et al. (1987); Khalil et al. (1990); Pasquale et al. (1991); Berry (1992); Yu et al. (1992); Kotz and Nieuwejaar (1994); Arlitt and Williamson (1995, 1997); Arlitt and Jin (2000); Chow et al. (2001); Luo and John (2001); Wang et al. (2003) and Maxiaguine et al. (2004), represent a cross-section of workload characterization papers that drive benchmark creation. Each paper attacks either a specific workload (high performance computing, web-server, file-servers, e-commerce) or a specific architecture (super-computer, multi-processor, web-server). All represent the dominant characteristics of the applications and ignore the lesser. While reducing the number of characteristics under consideration is certainly appropriate for these specific papers, it does not lead to a coherent representation of the whole workload space. As such, no published work was found on approaches that attack the workload characterization problem for the whole workload space. We wish to understand the workload characteristics across the whole workload space, not in the dominant performance inhibitors for sections of the workload space.

2.6

Reducing redundancy in benchmark sets

One of the dominant external constraints is the cost of benchmark evaluation: the cost of fully characterizing system behavior under a specific benchmark. Removing redundant benchmarks from the benchmark set, or removing redundant parts from a benchmark, reduces the total size of the benchmark set and thus reduces the evaluation time and cost. Any evaluation time reduction allows more design iterations to be made, or reduces the total design and evaluation time and cost. 2.6.1

Reducing simulation time as motivation

Execution driven simulators of modern computer systems are faced with two problems: (i) the performance of modern processors is high, easily processing one billion instructions per second, (ii) the complexity of modern processors is high, slowing down de-

46

C URRENT APPROACHES

tailed execution driven simulation. The net effect of these problems is that detailed execution driven simulation is possible for only a few benchmarks (KleinOsowski et al., 2001). Clearly it is desirable to obtain as much relevant performance information from benchmarks that are simulated in detail. In order to improve the signal-to-noise ratio of benchmark simulation, it is important to remove those benchmarks, or those parts of benchmarks, that do not add information. Benchmarks in simulation do not add information if either their behavior is similar to another, already simulated benchmark, or the benchmark is in some intermediary phase and not exercising the computer system to full capacity. Sub-setting a benchmark set is the process whereby the benchmark set is reduced into a smaller set, such that the smaller set is still representative of the majority of the benchmark set features. Every subset has the potential of reducing the total available amount of information from the benchmark. At the same time, it is quite possible that the benchmark set can be reduced in size, without losing any information. Removing redundancy and sub-setting are the preferred methods for reducing a benchmark set to a size that can be simulated within reasonable time and effort. KleinOsowski et al. (2001) analyze the SPEC CPU 2000 benchmark for redundancy. They observe that the SPEC organization, in an effort to keep up with the rapid progress of computer systems, chose to dramatically increase the run-times of the new SPEC CPU 2000 benchmark programs, compared to the run-times of SPEC CPU 1995. These long run-times are beneficial when testing performance on actual computer systems. However, when evaluating new computer architectures using detailed execution-driven simulators, the long run-times of the SPEC CPU 2000 benchmarks result in unreasonably long simulation times. Quoting KleinOsowski et al. (2001), reasonable execution times for simulation-based computer architecture research come in a few flavors: a) We want a short simulation time (on the order of minutes) to help debug the simulator and do quick tests. b) We want intermediate length simulation time (of a few hours) for more detailed testing of the simulator and to obtain preliminary performance results. c) We want a complete simulation (of no more than a few days) using a large, realistic input set to obtain true performance statistics for the architecture design under test. Items (a) and (b) do not have to match the execution profile of the original full input set that closely, although we would prefer if (b) was reasonably close. For accurate architectural research simulations, however, we need (c) to match the profile of the original full input set to within an acceptable level as measured using an appropriate statistical test. KleinOsowski et al. (2001) identify a clear need to reduce the simulation time, without compromising the value of the simulations. Simplifying the simulator is not an option since it compromises the value of the results. The best option is therefore to find a quantitatively defensible way to reduce the input datasets, and, consequently, the runtimes, of the SPEC CPU 2000 benchmarks. The outlined approach looks at the fraction

2.6 Reducing redundancy in benchmark sets

47

of total execution time spent in functions as measured by software profiling, the instruction mix using a simulator and the cache missrates as measured with a cache simulator. KleinOsowski et al. (2001) find that they can obtain small datasets that reasonably mimic the behavior of the real dataset used for the SPEC CPU 2000 benchmarks. These smaller datasets require reasonable simulation times. They conclude however that only the reduction of the datasets is not sufficient to perfectly match the execution profiles of the original datasets and the associated program behavior. 2.6.2

Approaches for removing benchmark set redundancy

Ferrari (1978) first discussed reducing the benchmark set to the smallest collection possible without losing generality. Central to the reduction of the benchmark set is the concept of similarity. For the goal of benchmark set reduction, similarity can be understood as the measure used to determine which benchmarks or parts of a benchmark can be removed from the set without losing information. The central premise of benchmark reduction is that benchmarks with similar characteristics are indeed similar, i.e., benchmarks with similar characteristics can be used interchangeably within computer system and processor design and each will yield the same results (Agrawala and Mohr, 1975; Calzarossa and Ferrari, 1985; Saavedra-Barrera et al., 1989; Saavedra and Smith, 1996; Eeckhout et al., 2002, 2003b; Phansalkar et al., 2005b). The level of interchangeability is determined by the extent of the similarities between workloads (Vandierendonck and De Bosschere, 2004b). Similarity plays an important role in reducing the size of benchmarks sets by allowing the removal of the most similar benchmarks (Saavedra-Barrera et al., 1989; Eeckhout et al., 2003b; Vandierendonck and De Bosschere, 2004b; Phansalkar et al., 2005b). Similarity can be viewed as a small distance between two workloads in the workload space W . Following the examples of Section 2.5, two benchmarks can be similar if they: (i) provide the same ranking of computer systems (Dujmovi´c and Dujmovi´c, 1998); (ii) have a similar composition of software components (Saavedra-Barrera et al., 1989; Saavedra and Smith, 1996); or (iii) create the same level of utilization on relevant computer system resources (KleinOsowski et al., 2001; Eeckhout et al., 2002, 2005b). Providing the same ranking of computer systems (i) is relevant for comparing computer systems, but does not provide any insight into the dominant characteristics of the workloads. As such, similarity based on performance prediction is a poor guide for architecture design choices. The composition of software components (ii) can be used to determine the similarity of workloads, but only to determine the relative speed-up of computer systems using those software components. For computer system and processor design, the execution properties of benchmarks are equally dependent on both the benchmark and the dataset, rendering software similarity insufficient (Eeckhout et al., 2002). Thus (iii) remains as the most likely candidate for determining similarity. The metrics used to span the workload space W must be relevant. Clearly metrics that have nothing to do with performance should not be used to span a workload space that

48

C URRENT APPROACHES

compares performance characteristics. If choices are made between available metrics these choices should be elaborated, since choosing metrics can introduce unintended bias. Many papers on workload or benchmark similarity offer no other substantiation for metric choice other than intuitive reasoning or reasonable choice (Eeckhout et al., 2005a; Phansalkar et al., 2005b; Hoste et al., 2006; Joshi et al., 2006). We believe that good reasons for selecting metrics for workload similarity are availability and a demonstrable relevance to the design question. Yi et al. (2003) and Yi et al. (2005) present a strong case for the use of statistical methods when choosing parameters for computer system simulation. The two papers investigated improving simulation results by using a Plackett and Burman test design to determine the most important parameters. In addition they use the same Plackett and Burman design to find the set of benchmarks with the largest effect on the processor under evaluation. They indicate the importance of parameter selection. Eeckhout et al. (2002) address the issue of selecting representative program-input pairs. In their view, the composition of a workload involves two issues: (i) which benchmarks to select and (ii) which input datasets to select per benchmark. They too observe that it is impossible to select a large number of benchmarks and respective input sets due to the large number of instructions per benchmark and the limitations on available simulation time. Eeckhout et al. (2002) span the workload space using a p-dimensional vector with p the number of important program characteristics that affect performance, e.g., branch prediction accuracy, cache missrates, instruction-level parallelism. Obviously, the number of metrics, p, is too large to display the workload design space Wp understandably. The authors note that correlation exists between these variables which reduces the ability to understand what program characteristics are fundamental to diversity in the workload space. They use the statistical analysis technique Principal Component Analysis (PCA) combined with cluster analysis to efficiently explore the workload space. PCA reduces the dimensionality of the workload space while preserving its principal characteristics. Clustering is an algorithm-based technique of partitioning the workload space into meaningful clusters of benchmarks. The authors note that selecting program characteristics that do not affect performance, might discriminate benchmark-input pairs in ways that provides no information about workload execution. They emphasize that they want closely clustered benchmark-input pairs to behave similarly so that a single benchmark-input pair can be chosen to represent the cluster, thus fulfilling the objective of reducing the number of workloads under consideration. Within the program space the distance in the workload space between program-input pairs can be used to determine their behavioral differences. Representative datasets can then be selected for a given benchmark by choosing the set with the least redundancy. Their choice of metrics is not substantiated other than noting that program characteristics that affect performance are important and selecting characteristics that do not affect performance yield no additional information. Unfortunately, Eeckhout et al. (2002) provide no guidance on how these metrics should be chosen. The

2.6 Reducing redundancy in benchmark sets

49

metrics chosen are obvious candidates, based on the literature of processor design and evaluation. Eeckhout et al. (2002) evaluate the reductions obtained by comparing the quality of the predictions made between a full set of simulated benchmark-input pairs and the reduced set, over a range of different processor variations for the gcc portion of SPECint95. They observed a good match. In a follow-on paper, Eeckhout et al. (2005b) tweak their methodology by changing to Independent Component Analysis (ICA) from Principal Component Analysis. They argue that ICA is a better alternative to PCA as it does not assume that the original dataset has a Gaussian distribution, which allows ICA to better find the important axes in the workload space. In an erratum, Eeckhout et al. (2005c) note that incorrect application of ICA exaggerated the differences between ICA and PCA. For their dataset, they found only a marginal difference between the ICA and PCA results. Phansalkar et al. (2005b) further extend the PCA methodology by applying it rigorously to all SPEC CPU benchmark suites, and comparing the differences over time. They focus on a set of micro-architecture independent metrics, using instruction mix, dynamic basic block size, branch direction, taken branches, forward taken branches, dependency distance, data temporal locality, data spatial locality, instruction temporal locality and instruction spatial locality. Their metric selection is based only on intuitive reasoning on how the metrics can affect performance; no quantitative analysis or literature reference is provided that supports their choice. They motivate their choice for micro-architecture independent metrics as follows: micro-architecture independent metrics allow for a comparison between programs by understanding their inherent characteristics independent from features of the underlying implementation of a processor. In contrast, ranking programs based on micro-architecture-dependent metrics can be misleading for future designs because a benchmark might have looked redundant in analysis merely because the existing architectures did equally well (or worse) on them, and not because that benchmark was not unique. To obtain the metrics, all benchmarks were executed in a simulator instrumented to provide the required statistics. A parallel application of PCA and clustering for sub-setting a benchmark set is found in Vandierendonck and De Bosschere (2004b); Phansalkar et al. (2005b) and Joshi et al. (2006). Each evaluate methods to subset the SPEC CPU benchmark set. The first evaluation is based on the SPECmark of the different benchmarks on different computer system. If the algorithmic construction of benchmark subsets works as desired, the created subsets should provide about the same level of information as the original set. Here the workload space Wp is based solely on the SPECmark. Their chosen subset quality criterion is the ability of the subset to correctly rank the computer systems of the original set. The authors observe that in many cases, as the size of the subset decreases, the quality of the subset decreases as well. Even more notable is the discovery that the worst performing subsets can be significantly larger than the best performing small subsets. To further investigate this effect, they evaluate the quality of subsets when based on workload parameters described in Eeckhout et al. (2002, 2003c) and Vandierendonck

50

C URRENT APPROACHES

and De Bosschere (2004a). Again they confirm that the reduction of the size of the subset can lead to a reduction in the quality, but that this is by no means a given. With the metrics of the latter papers, smaller subsets can still be quite accurate, while larger subsets fare poorly. The main conclusion of their paper is that sub-setting benchmark suites based on their SPECmarks is hard, even when the statistical methods are automated. They further note that the representativeness of a benchmark subset is mostly determined by the procedure used to cluster the benchmarks in the workload space and that the characteristics used to place the benchmarks in the workload space have a much smaller impact. 2.6.3

Evaluating benchmark quality

Vandierendonck and De Bosschere (2004c) and Vandierendonck and De Bosschere (2004a) evaluate a number of benchmarks. The authors note that a significant number of benchmarks are overly dependent on a single resource in the computer system and therefore any optimization based on these benchmarks can be artificial. In Vandierendonck and De Bosschere (2004a) the authors coin the terms eccentric and fragile as labels for benchmarks that are sensitive to simple improvements or tweaks in processor design (eccentric) or that are susceptible to targeted work-arounds in dedicated tuning efforts (fragile). For proper representation of the workload space, both eccentric and fragile benchmarks should be removed for it is unlikely that they are truly representative of real workloads. 2.6.4

Summarizing sub-setting techniques

From the above we can conclude that the most techniques for sub-setting benchmark sets follow the following steps (illustrated in Figure 2.4): 1. somehow select relevant metrics. These metrics preferably are related to the research goal. 2. reduce the dimensionality of the workload space spanned by the metrics without losing relevant information. Common techniques are Principal component analysis and Independent component analysis. 3. partition the reduced workload space using a clustering algorithm into groups of benchmarks. The clusters group together benchmarks considered similar by the clustering algorithm. 4. find the representative benchmarks by taking the center most benchmark of each cluster. Each step of this technique requires numerous choices. Notwithstanding the need to substantiate the choices, we feel that the outlined steps are appropriate for the problem statement of this thesis.

2.6 Reducing redundancy in benchmark sets

Simulator and workload

51

Metric selection and data collection

Workload characterization data

Metric normalization PCA

ICA

Kmeans Normalized workload data

Dimension reduction

Reduced workload space

Clustering

Cluster solution

Centroid selection

Representative benchmark set

Fig. 2.4: Summary workflow for sub-setting, showing dimensionality reduction and clustering steps. Based on Eeckhout et al. (2003c) and Eeckhout et al. (2005b). A legend explaining each shape can be found in Table A.3 on page 236.

52

C URRENT APPROACHES

2.7

Summary of benchmark selection

In this chapter we reviewed the literature on approaches and techniques for benchmark selection in processor and computer system design. We find that benchmark selection is a complex process with conflicting requirements. The underlying conflict is the contrast between completeness, i.e., making sure that all relevant workloads are represented, and compactness, i.e., using the smallest available set of benchmarks without too much compromise on representativeness. The steps outlined in Section 2.1 make two important transitions. The first transition is from the general workload space to a set of key workloads, based on relevance and importance of the workload. The second transition is from those key workloads into representative benchmarks that capture the essence of the workload performance characteristics. In Section 2.4 we discussed the requirements for optimal benchmark sets. Optimal benchmark sets provide full representativeness with a minimum set of benchmarks. Common problems with benchmarks sets - insufficient size, excessive redundancy, outliers and non-uniform distributions - were discussed. In Section 2.5 we reflected upon the purpose of the benchmarks - they either predict performance, or they mimic behavioral characteristics. We found that for the purpose of processor and computer system design, performance prediction is much less relevant than the behavioral characteristics. Getting to an optimal set of representative benchmarks with the correct behavioral characteristics is discussed in Section 2.6. We discuss sub-setting of benchmark suites where redundant benchmarks are removed without losing generality. Sub-setting of benchmark suites is important because it can reduce the cost associated with performance analysis in computer system and processor design. The most important aspect of sub-setting is workload similarity - it is the basis upon which classification and subsetting takes place. Overall we find that the literature is uneven on the process of getting from real workload characteristics (the Real workload space of Figure 2.2 on page 38) to an optimal benchmark set (the Optimal reduced benchmark space). Skadron et al. (2003) already identified this and the need for a systematic approach. We consider workload similarity the means by which we can provide such a systematic approach. The cases discussed in Section 1.4 illustrated the need for such a systematic approach as a way to guard against human bias. In these cases, human bias regarding the importance of certain workloads and their characteristics was not sufficiently balanced by quantitative workload data, facilitating decisions detrimental to processor performance. In the next chapter we will take the first steps to defining a systematic approach for representative workload selection.

3. TOWARDS AN UNBIASED APPROACH FOR SELECTING REPRESENTATIVE WORKLOADS

A BSTRACT Representative workload data represents an observational problem - how do we get and process meaningful workload characterization data. Using the collected data we span a workload space and reduce its dimensionality. Within the reduced workload space the workloads are partitioned into similar groups. Representative workload selection selects the most appropriate candidates.

54

T OWARDS AN UNBIASED APPROACH

Define the question Chapter 1 - The importance of workloads in computer system design

Gather information and resources Chapter 2 - Current approaches for workload selection in processor and computer system design

Present conclusions Chapter 9 Evaluating workload similarity

Interpret data and formulate conclusions Chapter 8 - Finding representative workloads in the measured dataset

Form hypothesis Chapter 3 - Towards an unbiased approach for selecting representative workloads Deductive-Hypothetic research strategy Prescriptive model Chapter 4 Constructing the approach

Early test Chapter 5 - Testing the methodology on a benchmark set

Collect data Chapter 6 - Collecting and analyzing workload characterization data

Analyze data Chapter 7 Grouping together similar workloads

3.1 Approach blueprint

55

The previous chapter presented an overview of existing approaches to representative benchmark selection. Central to representativeness is the concept of similarity. Similarity can be used not only for determining representativeness but also for classification of benchmarks into similar groups (Phansalkar et al., 2005b; Eeckhout et al., 2005b; Joshi et al., 2006). Skadron et al. (2003) identified the need for a systematic selection approach and the cases from Section 1.4 indicated the importance of removing bias. In this chapter we formulate an approach for the unbiased selection of representative workloads. This approach consists of a number of subsequent steps. Each step is formulated as a hypothesis, and the collection of hypotheses defines our approach. We build our case for using workload characterization data and develop the components needed in the following sections.

3.1

Approach blueprint

The goal of selecting representative workloads from real workloads parallels sub-setting of benchmark sets. The general approach summarized in Section 2.6.4 provides a blue print for our approach. Our approach starts with an observational problem, how do we get our data? Next we solve a data reduction and verification problem, how do we know our data is correct and meaningful, i.e., does it really characterize workloads? Using the obtained data and workload characterization we then get the problem we set out to solve: how do we find the representative workloads? We thus identify the following areas, illustrated in Figure 3.1: collecting workload characterization data All methods discussed in the previous chapter relied on a quantitative representation using several different performance related metrics. Collecting quantitative workload characterization data is the first step. reducing workload characterization data Collected data should be reduced to a standardized form. While simulation data is assumed to be a perfect representation of simulator state, observed data is not. Verification of data quality is therefore a necessary step to prevent erroneous observations from contributing to the workload space. spanning the workload space Choosing which metrics should be used to span the workload space defines much of the method. Bias in metric selection or adding unrelated metrics could impact the quality of the method results. reducing workload space dimensionality As with benchmark sub-setting we expect a certain amount of redundancy between the chosen workload metrics. Dimensionality reduction is an important step since we want the lowest amount of dimensions without removing essential distinctive information. workload partitioning Partitioning the workload space into different clusters by applying a clustering algorithm is the main step in grouping workloads with similar

56

T OWARDS AN UNBIASED APPROACH characteristics. Each clustering algorithm has specific distinctive properties. The partitioning results therefore likely depend on the chosen algorithm.

representative workload selection Commonly the center most element of a cluster is chosen as the representative workload. Figure 3.1 uses a grammar consistent throughout this thesis. Hexagons represent vector spaces, usually spanned by some form of workload characterization data where each workload is a vector of different metrics. Rounded rectangles represent actions or operations. Orange circles represent resources, from which workload characterization data is extracted. Blue circles represent algorithms or programs, the color blue represents existing artifacts or knowledge. Diamonds represent lists. Arrows naturally represent transitions and depict the flow through the graphs. A summary overview of the different components in a graph can be found in Table A.3 on page 236. Our approach for representative workload selection requires the determination of workload similarity. Our approach for determining similarity between real workloads differs from the benchmark sub-setting approach primarily in the source of workload characterization data. Phansalkar et al. (2005b); Eeckhout et al. (2005b) and Joshi et al. (2006) all use simulation data as the basis for their sub-setting approach, but in Section 1.2 we described the infeasibility of capturing real workloads in simulation. Thus we need to consider how we get workload characterization data suitable for the proposed process. We discuss this in more detail in Section 3.2. Once we have a solid representation of the dataset we can follow the steps of dimensionality reduction and clustering, discussed in Section 2.6. An area of concern is the selection of metrics to use for spanning the workload space. Any bias inadvertently introduced into metric selection could bias the overall method in undesirable ways. We discuss ways to prevent this bias in Section 4.2.

3.2

Sources of representative metrics

The key enabler of representative workload selection is the concept of similarity or representativeness. Phansalkar et al. (2005b); Eeckhout et al. (2005b) and Joshi et al. (2006) express similarity as a function of the Euclidean distance in the reduced workload space. They consider representativeness to be functionally equivalent to similarity, i.e., if two workloads are in proximity to each other and grouped in the same cluster, then these workloads are interchangeable. In other words, the strength of their representativeness is the inverse of their seperation in the workload space. Common to spanning the workload space is that the metrics are measured, they are not derived from an a priori analysis of the benchmark code or the executable. The described troubles of the Itanium processor and its reliance on compile time optimization (Section 1.4) indicate that a priori methods for determining workload similarity cannot be used since workload characteristics depend on the datasets used when exe-

3.2 Sources of representative metrics

Computer system and workload

57

Raw data

Collecting workload characterization data

Reducing workload characterization data Workload characterization reduced data

Spanning the workload space

Workload characterization space

Reducing workload space dimensionality

Reduced workload space

Workload partitioning Cluster solution

Representative workload selection

Representative workload set

Fig. 3.1: Approach blueprint

58

T OWARDS AN UNBIASED APPROACH

cuting the workload (Eeckhout et al., 2003c). From the perspective of computer system and processor designers, the similarity of execution time changes between benchmarks on different computer systems is informative only in a comparative sense, since the changes cannot be quantitatively attributed to design differences. This is substantiated by the difficulties experienced by Vandierendonck and De Bosschere (2004b) as they attempted to subset the SPEC CPU 2000 benchmark suite based on the execution times of the member benchmarks on different computer systems. The most recent papers on similarity analysis of benchmarks, Eeckhout et al. (2005a); Phansalkar et al. (2005b); Hoste et al. (2006) and Joshi et al. (2006), all obtain their metrics from simulation. Workload characterization and performance evaluation however can use more than just simulation. Dujmovi´c and Dujmovi´c (1998) use execution times and details on resource utilization for their analysis. Keeton et al. (1998) and Duesterwald et al. (2003) characterize workload behavior using processor hardware counters. Hardware counters are a limited set of special-purpose registers built in most modern microprocessors. Hardware counters store the counts of hardware-related activities, like cache misses, memory stall cycles and retired instructions per processor. Ahn and Vetter (2002) evaluate multi-variate statistical techniques for extracting valuable information from the measured hardware counters. Processor design relies heavily on the use of simulation. As a result an extensive debate has raged between the proponents of micro-architecture dependent and independent metrics. The micro-architecture dependent users point to the relative ease and the incremental nature of processor design to support their conviction micro-architecture dependent metrics are good enough for processor design. The proponents of microarchitecture independent metrics in turn argue that the dependent metrics lack distinction, i.e., certain workload characteristics could be the result of different microarchitecture independent metrics. Therefore micro-architecture dependent metrics would not be able to distinguish between then and choose those workloads to guide processor design choices (Phansalkar et al., 2005b). The literature shows that there is value in both µ-architecture dependent metrics, e.g., Keeton et al. (1998); Ahn and Vetter (2002) and Duesterwald et al. (2003), and µarchitecture independent metrics, e.g., Eeckhout et al. (2005a); Phansalkar et al. (2005b); Hoste et al. (2006) and Joshi et al. (2006). We are of the opinion they should be used to complement each other. We believe that the proponents of architecture independent metrics have a valid reason to point out the potential lack of distinction of micro architecture dependent metrics, certainly in cases were the design deviates considerably from the original processor. We also believe that the current model of micro-architecture independent characterization using system simulation is limited by the system simulator the workload space as a whole is too large to efficiently explore using simulation. However, such an exploration could well be within reach when using hardware counters and therefore µ-architecture dependent metrics. We will revisit this issue in Chapter 5. In the context of our research we are working with established processor architec-

3.2 Sources of representative metrics

59

tures. We believe that companies like IBM, Intel and Sun Microsystems, with mature and established processor and system architectures, cannot easily change direction. These companies are limited by their customers’ vested interest in long-term compatibility. Innovation on these mature processor architectures will be incremental rather than radical. As a result, we believe that in this context computer system based metrics, i.e., hardware dependent characterization, are a viable means of exploring the workload space for representative benchmarks. We also believe that the increasing number of hardware counters available on modern processors greatly reduce the number of workload characteristics were processor counters do not provide distinction. We also believe that it is foolish to suppose that the modern hardware counters can replace the simulation based architecture independent evaluation, just as we believe that it would be foolish of the proponents of architecture independent metrics to dismiss the value of using the hardware counters for vetting the workload space. As a result, we present the case that representative workload selection should be made using quantitative data collected from the execution of real workloads. These measurements provide the best representation of workload behavior on computer systems. The remaining issue is the distinction between micro-architecture dependent and independent metrics. Phansalkar et al. (2005b) clearly favors micro-architecture independent metrics for they believe that they better reflect the differences between the different benchmarks and are much less sensitive to the specifics of the hardware implementation found by Vandierendonck and De Bosschere (2004b). However, we believe that micro-architecture independent metrics are too hard to obtain for real workloads. In the literature the micro-architecture independent metrics are nearly all obtained using simulators, not an environment were real workloads live. Without the use of simulators, the only available alternative are micro-architecture dependent metrics. This however is not a limitation. A common constraint for the designers of computer systems and processors is backward compatibility, i.e. the new system must be able to execute workloads from the previous generation(s). As such, the required analysis of workloads can take place on the previous generation of systems, reducing the need for micro-architecture independent metrics. The use of µ-architecture dependent metrics comes with an additional advantage: all metrics are readily accessible through the operating system and we can therefore characterize workloads in their real environment. This leads to our research hypothesis regarding the source of metrics for determining similarity between real workloads: Research Hypothesis 1 Processor hardware counters and operating system performance metrics provide sufficient distinctive ability for useful workload similarity analysis. Missing in Research hypothesis 1 is the determination of which metrics to use. We have to assume that hardware counters and system metrics paint a complete picture of the state of the computer system. In other words, we assume that there is no inherent

60

T OWARDS AN UNBIASED APPROACH

bias in the complete set of hardware counters or system metrics. This leads to the collection of a large number of computer system metrics. Since most of these computer system metrics will wax and wane as function of computer system utilization, we expect a considerable amount of redundancy. Spanning a workload space with all metrics will therefore over-represent computer system utilization as categorizing factor. Before we can span the workload space W for similarity analysis we have to verify that the metrics used have merit. Thus: Research Hypothesis 2 The dimension of the workload space W spanned by collected computer system metrics should be reduced to remove redundant or uninformative metrics. Once W is spanned, the collected workloads and benchmarks can populate the workload space. Like Eeckhout et al. (2005a); Phansalkar et al. (2005b); Hoste et al. (2006) and Joshi et al. (2006) we propose the use of a direct correlation between the distance between workloads and their reciprocal similarity, thus: Research Hypothesis 3 The Euclidean distance between workloads in the normalized workload space W is a quantitative measure of their similarity and representativeness. Esposito et al. (1981); Calzarossa and Ferrari (1985); Ahn and Vetter (2002); Eeckhout et al. (2003c, 2002), amongst others, use statistical clustering as the tool of choice for finding representative workloads from identified groups in the workload space. Specifically they use the centroids of workload clusters to select their representative workloads. It is tempting to follow a similar approach, however, we must keep in mind that we intend to use our approach for hundreds, maybe thousands of workloads. It stands to reason that with many collected workloads we may identify an abundance of clusters. We should have neither to few nor too many workloads. All we need is sufficient coverage of each dimension in the reduced workload space. Simultaneously, requiring cluster membership prevents the selection of singleton workloads. Preferably we want representative workloads to be members of confined or high density clusters. Confined clusters have multiple members within a small volume of the workload space. Combining these aspects leads to next research hypotheses: Research Hypothesis 4 Representative workloads provide full coverage along each axis of the reduced workload space and are members of confined clusters. The above definition implicitly guards against the selection of too few or too many workloads by requiring full coverage of all axes. If a workload provides no additional coverage, it should not be included in the representative set. As noted on the introduction of the original research questions, the value of the quantitative measure is not only our ability to define it, but also to use it on many hundreds or thousands of workloads and benchmarks. Since our measure targets real workloads

3.2 Sources of representative metrics

61

on real computer systems, it has to take into account the constraints of real environments. A useful representative measure requires that we have a data collection method that does not limit our ability to collect data. In other words, data collection has to meet the requirements and constraints of real workload environments. At a high level the questions are: • how do we acquire our data? • how do we process the data?

• how do we define similarity?

• how do we find representative workloads?

We observe that only data collection on workloads ties this approach to computer systems. In fact, the methodology of measuring data, reducing its dimensions, and finding representative items through similarity analysis, is a generic approach. We are now faced with defining what testable consequences validate the chosen approach. Since all hypotheses and requirements work towards answering the two research questions, the best testable consequences should be related to the research questions. Failure of one of the hypotheses will likely mean failure to answer the research question. We propose that workload characterization data are collected for a large number of different workloads and benchmarks. The summarized value of these different metrics populate a large workload space of raw metrics. Redundancy within the dataset must be investigated by evaluating the cross-correlation properties of the metrics. After spanning the workload space using the reduced metrics, similarity can be investigated within that space using the Euclidean distance. We can now define necessary consequences of this collection of hypotheses. One of the most obvious and easy to test is: Consequence 1 Repeated characterizations of the same workload are proximate. If this were not true, then the whole concept of workload similarity based on computer system metrics would be instantly invalidated. If repeated measurements of the same workload (for example a standardized benchmark) are scattered over the workload space W then Research Hypotheses 1 and 3 must be rejected. Continuing on with similarity is the following consequence: Consequence 2 Similar workloads are proximate. This consequence is a generalization of Consequence 1, since it defines similarity in a broader context, between different workloads. The process of reducing the workload space (with for example Principal component analysis or Independent component analysis) can lose information. Clustering algorithms applied to this reduced space can therefore lead to different partitions. However, these differences should be small.

62

T OWARDS AN UNBIASED APPROACH

Consequence 3 Repeated determination of representative workloads using the same dataset should yield similar results. Some variation can be expected, for example with workloads that are very close together. If the method does not yield consistent results for the majority of the workloads it is obviously not stable enough to useful. This would reject Research Hypothesis 4. Some consequences are related to the use of the word “useful” in Research Hypothesis 1. In the context of this thesis we strive to select representative workloads for computer system design and analysis using system metrics. “Useful” in this context means that similarity found using the computer system metrics leads to a selection of workloads that spans the workload space. One consequence refers to the usability of these representative workloads: Consequence 4 The set of selected representative workloads uniformly covers the workload space of collected workloads and benchmarks. This can be verified once the procedure has completed. The other meaning of “useful” is that the selected set of representative workloads leads to better system design. For this to be true the set of selected workloads would each have to represent unique behavioral characteristics. In other words: Consequence 5 The set of representative workloads selected using computer system metrics shows only minimal similarity when evaluated using micro-architecture independent metrics. Evaluating this last consequence is beyond the scope of this thesis since it requires the results of this thesis plus a simulation framework for detailed workload evaluation. This is best left for future research.

3.3

Requirements

This chapter outlines the process underlying the selection of benchmarks and workloads for computer sytem and processor design. From the reviewed literature we extracted requirements regarding the selection of metrics, the acquisition of data and their external constraints. The following requirements provide practical constraints on the research hypotheses and consequences of chapter 3. Research Hypothesis 1 proposes the use of hardware counters and operating system performance metrics as the basis for our approach. The presence of these metrics and counters are therefore the observational tool by which we collect data (Kuhn, 1962). As with all observational tools and their application in measurement, there are practical requirements we have to meet. We have argued in Chapters 1 and 2 that one of the main differentiating factors of computer system based metrics was their low cost and broad applicability when com-

3.3 Requirements

63

pared to simulation based metrics. Therefore, the cost associated with obtaining workload characterization metrics for any given workload should be low. Our first requirement thus becomes: Requirement 1 selected metrics must be efficiently measurable. Collecting data is only part of the process. As with any scientific instrument there is a part where the collected data are processed to form an observation. In other words, there is a post-processing step that extracts the observation from the measurement. This too requires investment in time and cost which needs to be kept low: Requirement 2 the collected metric data must be efficiently processable to obtain an observation. Obtaining an observation is an important step, however, our second research question mentions the ability to look at hundreds or thousands of workloads. Therefore, we not only need the ability to efficiently observe on a single system, we need the ability to efficiently observe on many systems. Since we desire a large number of workloads, the cost per workload collection must be low. The low cost requirement therefore excludes software instrumentation of the workload (effort), and simulation (time) as viable options. Requirement 3 the data collection process must be cheap and efficient to facilitate collecting data from many workloads. Continuing our scientific instrument analogy, we note that the output of an observational instrument must be consistent - it should report on the same attributes of the object under study every time it is used. Without observational consistency, comparison of observations is impossible. Quantitative analysis within a dataset relies on the completeness of that dataset. Therefore if data collection is not standardized and gaps are present in the dataset, rigorous analysis is unattainable. This condenses to: Requirement 4 data collection must be standardized for all workloads. As is the case with many scientific instruments that are part of the environment under study, we have to be aware of our observational actions. Does the action of observation create a significant reaction in the object under study? In other words, are our observations perturbing the state of the computers system and are we interfering with our own experiment? Good experiment design always strives to minimize methodological error. When measuring a real workload the perturbation impact of the measurement must be small enough to not impact the real workload. Trace collection and software instrumentation are invasive and resource consuming measurement methods that significantly impact the characteristics of the workload. Requirement 5 data collection must not perturb workload execution.

64

T OWARDS AN UNBIASED APPROACH

In our discussion on current benchmark suite sub-setting techniques (Chapter 2) we noted the apparent lack of motivation regarding the choice of simulation metrics used as input to the sub-setting process. Eeckhout et al. (2002) side stepped the issue by mentioning that metric selection is an important topic of study, without offering guidance. We believe that approaches as suggested by Yi et al. (2003) and Yi et al. (2005), i.e., using a Plackett and Burman experiment design, can greatly improve the relevant metric selection. However, from an observational point of view, selection of relevant metrics can only be done when the relevance of these metrics can be ascertained. Thus, in order to determine metric relevance, these metrics first have to be collected! Collecting these metrics requires the existence of a measurement environment. Given the rich variety of different computer systems and processor micro-architectures it is obvious that the choice of measurement environment introduces a contextual bias since we limit ourselves to a specific micro-architecture implementation. However it is within this specific micro-architecture context that we believe relevant metric selection should be done as part of the analysis methodology, and not as part of the observations. Thus: Requirement 6 bias in workload metric selection should be avoided. Since workload similarity is the key to our selection process we believe that the determination of workload similarity should take place in a workload space W spanned by an unbiased selection of metrics relative to a specific micro-architecture context. The previous requirement guarantees that our inputs are unbiased, i.e., no human preference has influenced its composition. Apart from human interference, we must prevent bias introduced by our data processing and analysis. We must avoid unintended transformation effects that can skew and thus bias the data. Based on Chapter 2 we expect to have some redundancy in our measured metric set. We intend to span the workload space W using metrics representative of the features of the measured data. We need to make sure that our selection of relevant metrics itself does not introduce bias. Research Hypothesis 3 proposes the Euclidean distance as a quantitative measure of representativeness. The Euclidean distance is sensitive to scale differences in its component dimensions, e.g., if one dimension is ten times greater than the other dimensions it will dominate the Euclidean distance. Consequently, these metrics should support a quantitative definition of inter-workload distance and representativeness, without introducing any scaling bias. We summarize this as follows: Requirement 7 workload similarity should be quantitatively unbiased. If determining workload similarity requires a process that takes very long, e.g., in the order of weeks, it’s value is greatly diminished. The value of quantitative workload similarity of real workloads is the ability to quickly analyze and report on new trends in the incoming data. This means that depending on the data volume the reaction time of the whole process should be measured in hours, not days. The final requirement is:

3.4 Reflecting upon the research hypotheses

65

Requirement 8 determining workload similarity using measured data should be a process of hours, not days. This set of requirements guides our approach to a practical implementation of an approach for selecting representative workloads.

3.4

Reflecting upon the research hypotheses

In this chapter we took an approach used for sub-setting benchmark suites as blue print for our approach to select representative real workloads. We formulated a number of research hypotheses that together form the frame of the desired approach. This completes the step in the hypothetic-deductive research strategy where we define our hypotheses. Hypotheses should be falsifiable, i.e., collected evidence can lead to their rejection. We defined a number of consequences that, if violated, would reject an hypothesis outright. The complexity of workload similarity analysis makes testing these hypotheses possible only within its overall scope. We expanded on the research hypotheses and their consequences by defining a number of requirements. These requirements set the context of our work and define the scope of our approach. With these requirements we work our way towards constructing a framework within which we can evaluate our hypotheses. This is the topic of the next chapter.

66

T OWARDS AN UNBIASED APPROACH

4. CONSTRUCTING THE APPROACH

A BSTRACT The representative workload selection approach is formulated along a number of fixed steps. First workload characterization followed by dimensionality reduction of the spanned workload space. Next partitioning of the workload space using clustering algorithms. Finally representative workload selection using the clusters and the dimensions.

68

C ONSTRUCTING THE APPROACH

Define the question Chapter 1 - The importance of workloads in computer system design

Gather information and resources Chapter 2 - Current approaches for workload selection in processor and computer system design

Present conclusions Chapter 9 Evaluating workload similarity

Interpret data and formulate conclusions Chapter 8 - Finding representative workloads in the measured dataset

Form hypothesis Chapter 3 - Towards an unbiased approach for selecting representative workloads Deductive-Hypothetic research strategy Prescriptive model Chapter 4 Constructing the approach

Early test Chapter 5 - Testing the methodology on a benchmark set

Collect data Chapter 6 - Collecting and analyzing workload characterization data

Analyze data Chapter 7 Grouping together similar workloads

4.1 Formulating our implementation approach

69

The previous chapter described a deductive model based on available knowledge and proposed a number of hypotheses, their consequences and requirements. In this chapter we develop the necessary framework to evaluate these hypotheses.

4.1

Formulating our implementation approach

We now turn our attention to practice. We believe that the approach for benchmark suite sub-setting provides a good blueprint. modeling our approach for selecting representative workloads accordingly, we have a number of steps to take. These steps are (see Figure 4.1): Workload characterization Collecting workload characterization data from many different workloads. Workload data processing Reducing the collected workload data into a concise and standardized representation suitable for further analysis Metric normalization Different workload metrics have different scales. Normalization rescales metrics to remove magnitude effects. Dimension reduction From the collected set of metrics we transition to a smaller set of metrics. This new selection of metrics should reflect the properties of the workloads in the original set and should also be unbiased. Clustering Using a clustering approach we classify the selected workloads into groups based on their distribution in the reduced dimensionality workload space. Representative workload selection From the groupings identified by the clustering algorithm we select a set of workloads optimally covering all workload space dimensions, thus representing all behavioral properties of the workloads. Each of these steps requires the application of an algorithm or method of some kind. This means that each step presents a number of choices. Each choice can impact the composition of the final output, the representative set. 4.1.1

Evaluating the quality of the representative set

Before we can claim that we found the or an optimal representative set, we need a way to substantiate that our set selection is (close to) optimal. The ultimate test of our optimal set would be to design a computer system or processor using this set and see how it performs. This is of course fraught with risk, and therefore inadvisable - a company could very easily go bankrupt when it cannot recoup the investment made. Another way to test the representative set is to compare processor and system designs made using the representative set with designs using the currently popular set of standard benchmarks. In this scenario the evaluation of which design is better suited for the

70

C ONSTRUCTING THE APPROACH

Computer system and workload

Raw data

Workload characterization (data collection)

Workload data processing Workload characterization final data

Metric normalization

Normalized Workload data

Dimension reduction

Reduced workload space

Clustering

Representative workload selection

Optimal cluster solution

Representative workload set

Fig. 4.1: Summary workflow for the evaluation of hardware counter workload data, showing method steps.

4.1 Formulating our implementation approach

71

marketplace would remain a matter of debate. Another way to test the representative set would be to collect many workloads over a period of several years and follow changes to the representative set, investigating the question - how representative is the original representative set relative to the workload space after one, two and four or more years? The above mentioned tests are impractical within the confines of a thesis. Yet we desire a way by which we can determine which cluster solution - as function of our method choices - delivers the most distinction. In other words, if we base our representative set on the reduced workload space and clustering solution that is best able to distinguish workloads, we likely have or are close to an optimal representative set. Thus we consider the method that provides the best grouping of similar workloads as optimal. There is no theory that we know of on how to best select a set of representative workloads from a given workload space and clustering solution. We postpone further discussion until Chapter 8 where we will both develop and apply a selection algorithm. 4.1.2

Evaluating workload clustering

Evaluating the partitioning of the collected dataset requires a criterion by which we can judge the quality of that grouping. We propose using categorical information on the workloads as a means of evaluating the quality of the cluster result. Recall that the clustering algorithm groups together workloads it considers similar. If we observe that many workloads of the same type or origin are partitioned together, than we consider this as indicative of the methods discerning properties. This is related to consequences 1 and 2 which require that repeated characterizations of the same workload are proximate and that similar workloads are proximate. Thus we have defined the criterion by which we evaluate different clustering solutions. Since categorical data can be considered a cluster solution, each category is its own cluster, we need a quantitative method to evaluate similarity or distinction between two cluster solutions on the same dataset. To that end we employ the variation of information criterion (VI) proposed by Meil˘a (2003). The VI provides an evaluation of clustering similarity based on the amount of information shared between both solutions. More common approaches can be summarized as either counting pairs or performing set matching. Since we should not make assumptions about the existence of pairs or sets in the workload data, we consider the latter approaches less desirable. In Section 4.7 we provide a more detailed exposé on the VI. 4.1.3

Workload clustering

Working back from the representative set selection, we arrive at the clustering process. The literature on the application of clustering algorithms for sub-setting, primarily Phansalkar et al. (2005b); Eeckhout et al. (2005b); Hoste et al. (2006); Joshi et al. (2006) (amongst others), exclusively use K-means clustering combined with the Bayesian Information Criterion (BIC). The BIC is used to determine the appropriate amount of

72

C ONSTRUCTING THE APPROACH

clusters supported by the dataset. K-means clustering is a partitioning algorithm - at every iteration it finds the most likely sub-partition in the provided set. A criticism of K-means clustering we are eager to avoid is its dependence on the choice of initial conditions (Bradley and Fayyad, 1998; Fraley and Raftery, 1998; Pelleg and Moore, 2000; Ishioka, 2005). If these initial conditions are poorly chosen, the results will differ significantly from the “true” clustering. There are alternatives to partitioning algorithms and we propose the use of modelbased clustering approaches. We propose using a clustering methodology based on multivariate mixture models in which the BIC is used for direct comparison of models that may differ not only in the number of components in the mixture, but also in the underlying densities of the various components. Clusters are determined by a combination of hierarchical clustering and the expectation-maximization (EM) algorithm for maximum likelihood (Fraley and Raftery, 1998). We chose the MCLUST implementation of model-based clustering (Fraley and Raftery, 1998, 1999, 2002a,b, 2003) available in R (R Development Core Team, 2006). The model-based clustering approach is combined with the BIC in such a way that it allows the simultaneous evaluation of different Gaussian models. MCLUST chooses the most appropriate model as decided by the BIC. The clustering process is initialized using the results of a Gaussian model-based hierarchical clustering step. This not only greatly improves the quality of the clustering, it also improves the consistency of the clustering (Banfield and Raftery, 1993; Fraley and Raftery, 1998, 2002b, 2003). This consistency improvement is important since Kmeans clustering generally does not provide the same result upon repeated executions. We choose an additional clustering approach because we believe that the structure of the computer system metric workload space is different from the simulation metric workload space. We know that K-means clustering is a classification EM algorithm. In fact, it is a special case of EM clustering algorithms that assumes (Celeux and Govaert, 1992; Bradley and Fayyad, 1998): • Each cluster is modeled by a spherical Gaussian distribution; • Each data item is assigned to a single cluster; • All clusters have equal weight.

We believe that clusters in the hardware metric workload space are unlikely to be spherical. Above all, we believe that different metrics, even normalized, do not contain equivalent amounts of workload information. For example, halving L1 misses will have a substantially different effect on performance than halving L2 misses, thus the two metrics convey different information over their range. We believe there are many non-linear relationships between hardware metrics hidden in the measured data. These hidden relationships are likely part of the reason µA-dependent metrics are considered difficult or inaccurate in workload characterization (note that they are not hidden in simulation). However, by allowing the clusters to change shape accordingly, we can accommodate

4.1 Formulating our implementation approach

73

for the differences in meaning and importance of the metrics. This is comparable but not the same as weighting metric contribution and using weighted K-means clustering. We discuss both K-means and model based clustering in more detail in Section 4.6. 4.1.4

Dimensionality reduction

The clustering algorithms of the previous subsection perform their work in the reduced workload space. As discussed in Section 2.6, dimensionality reduction is a way to represent a many dimensional workload space more understandably. Given that we are proposing collecting all available system metrics and hardware counters to avoid bias (refer to Requirement 6), we can expect our workload metric space to be very large. As a result dimensionality reduction is a requirement to maintain efficiency. Not only do many dimensions provide a computational burden, they can also be severely redundant making the analysis process less efficient. The most straightforward approach to dimensionality reduction is to remove one from each pair of metrics with an absolute correlation coefficient greater than a chosen cutoff r, where 0.90 ≤ |r| < 1. There are more advanced methods that better preserve the properties of the underlying space during dimension reduction. Eeckhout et al. (2005a) use Principal Component Analysis while Eeckhout et al. (2005b) use Independent Component Analysis. We therefore include both PCA and ICA in our evaluation process. We discuss PCA in more detail in Section 4.5.3 and ICA in Section 4.5.4. PCA and ICA create a new representation of the dataset along axes chosen to be more representative of important features of the dataset (hence feature extraction). This projection loses information by discarding the less significant features in favor of the more significant. Conversely feature selection attempts to select only those metrics that carry the most information, lesser metrics are discarded. We also intend to include other techniques since there are cases where PCA has been known to obscure rather than reveal groupings of interest (Chang, 1983). Dimensionality reduction is part of a larger set of statistical techniques called machine learning. For example, supervised learning, Bayesian decision theory, parametric methods, non-parametric methods, multivariate methods, dimensionality reduction, clustering, decision trees, linear discrimination, etc. are all classes of machine learning (Alpaydin, 2004). An important area of research is supervised learning. The goal is to use a set of measured variables as inputs. These variables have some influence on one or more outputs. The goal of supervised learning is to use the inputs to predict the values of the outputs (Hastie and Tibshirani, 1990). Applying supervised learning to the measured dataset requires that we specify both inputs and outputs. Invariably associated with computer systems is the notion of performance. We can use processor instruction count as indicative of computer system performance and construct a supervised learning model that predicts instruction count as function of the other metrics. We assume that in general there are monotone relations between different metrics and the instruction count. Using

74

C ONSTRUCTING THE APPROACH

Generalized Additive Models (Hastie and Tibshirani, 1990; Hastie et al., 1990) we can build such a supervised model. After model construction we can evaluate the contribution of each metric to the model prediction. By discarding all metrics that contribute nothing, or very little, to predicting the instruction count we can reduce the dimensionality of the dataset. This approach has an additional advantage, since we are selecting the most predictive metrics from the original dataset we could update our workload characterization to only collect these metrics, thus likely lowering the measurement impact on the workload (per requirements 3 and 5) and lowering the computational burden of the approach (per requirement 8). A more elaborate discussion on model construction is provided in Section 4.5.5. There are alternative machine learning techniques like Linear Discriminant Analysis. However, many of these techniques require the existence of training data. The existence of training data assumes that there is a notion of truth, i.e., a classification that is known to be correct (Alpaydin, 2004). We do not believe that such a notion is supported by the properties of computer system workloads. We believe that constructing a training dataset will require all sorts of assumptions regarding importance of workloads, workload characteristics and system metrics. We believe that the selection of representative workloads is not the type of question best addressed using methods that require training data external to the dataset. We note that given a large enough dataset we could split the set into a training and validation part, but we still remain with the question of how relevant the remaining metrics are to our classification problem. In our approach we concentrate on the distinguishing capabilities of different clustering algorithms against dimensionality reduction based on PCA, ICA, GAM and correlation. 4.1.5

Normalization of metric data

Many of the methods for dimensionality reduction require that the data are normalized. In addition the Euclidean distance proposed as our measure of workload similarity is very sensitive to scale differences. This means that data are re-scaled to be of similar magnitude and variance. Normalization removes all effects based on scale differences. There are two major approaches to normalization, linear and logarithmic. Linear normalization is appropriate for metrics where absolute differences are of equal importance over their complete range. Logarithmic normalization is appropriate for metrics where there are great differences in scale and the ratio of the values is considered of importance. Normalization however is not without risk, by normalizing the metrics, the metrics are assumed to be of equal weight. Minor variations in a metric that is otherwise constant can increase in significance, without there being any physical rationale for doing so. We maintain both linear and logarithmic normalization in our comparative approach. Normalization is discussed in detail in Section 4.5.1.

4.2 Collecting workload characterization data 4.1.6

75

Workload characterization

The last step in our discussion and the first step of our approach is workload characterization. Requirements 1, 2, 3, 4 and 5 all deal with properties of workload characterization. Based on these requirements we concluded that a standardized set of measurement tools was needed to perform workload characterization. This set of tools was found in WCSTAT (Sun Microsystems Inc., 2004). WCSTAT meets all the above requirements and was the primary tool for data collection in this thesis. A more in-depth discussion of workload characterization is provided in Section 4.2, while a more in-depth discussion of WCSTAT can be found in Section 6.1.1. Finally we present an overview of the proposed approach and the different techniques in Figure 4.2. In the remainder of this chapter we will revisit all the mentioned techniques in more detail.

4.2

Collecting workload characterization data

We have several times encountered requirements for data collection, here we will expand on those requirements and define the type of collected data. The goal is to select workload or benchmarks representative of real workloads, based on their similarity to other real workloads. This requires that we have a quantitative understanding of these real workloads. Workload characterization was introduced as the way by which we acquire quantitative data from workloads. Workload characterization is a broad term that covers many different approaches. We have also taken note of the constraints on data collection and data processing, and we have determined that instrumenting workloads is not feasible due to both the effort required and the perturbation of the workload. We have also reviewed the use of simulators and seen that while simulators are the preferred tool for computer system design evaluation, they are very impractical for data collection on a large scale. Many papers have remarked on the difficulty of obtaining quality simulation data for any version of the SPEC benchmark suite (Eeckhout et al., 2002; Gómez et al., 2002; Kandiraju and Sivasubramaniam, 2002; Citron, 2003; Lee, 2006). Performing simulation on the massive scale to collect and simulate a large volume of real workloads, is clearly impractical and inefficient. To our advantage, modern computer systems and their operating systems are equipped with a large number of hardware and software measurement points. We surmise that the amount of information available through those measurement points is sufficient to perform a meaningful selection of workloads, without requiring any simulation or workload instrumentation. The measurement points on these computer systems are accessible through operating system utilities and can be accessed while the workload is executing. It is clear that accessing the measuring points themselves will have some effect on the real workload and the collection mechanism should be constructed such that this effect is minimal. The operating system not only makes available metrics that report on the state of the

76

C ONSTRUCTING THE APPROACH

Computer system and workload

Raw data

Workload characterization (data collection) WCSTAT

Workload data processing Workload characterization final data

Workload categorization information

Metric normalization PCA

ICA

Model based K-means Dimension reduction

Normalized workload data

Reduced workload space

Clustering GAM

Correlation Cluster solution

Representative workload selection

Representative workload set

Optimal cluster solution

Cluster comparison

VI-score

Fig. 4.2: Summary workflow for the evaluation of hardware counter workload data, showing dimensionality reduction and clustering steps.

4.2 Collecting workload characterization data

77

hardware, but also on the activity of the operating system. While simulators and traces provide excellent data on the state of the hardware, they provide no information on the execution of the actual workload. The simulator and the traces can be used to determine with great accuracy the hardware requirements of the few billion instructions in trace or simulated. These few billion instructions reflect only a tiny part of the execution of any real workload. In contrast, metrics collected through the operating system cover a much larger part of the execution at much less detail. As mentioned before, the literature is incomplete on the subject of metric selection. There is universal agreement that important metrics must be used, yet little is provided in the way they are selected. We propose the use of high level or available metrics. When comparing our results with those obtained in the literature, we have to demonstrate that these available metrics indeed provide the required level of information and perform equal or better than identified important metrics. Let us address the issue of important metrics first, these metrics are considered important because there exists a body of work in the literature that very successfully answers a number of questions with these metrics. This body of work can be based on theory, simulators or practice. Since we believe that the available high level metrics represent all relevant performance aspects, we now have the burden of proof. We believe that without substantiating our metric selection, we cannot prevent the suspicion of bias. If we were to present a set of metrics and demonstrate that they work, the question lingers - how did we decide which metric to use? A perfectly viable approach is to interview a few knowledgeable people in the domain of workload characterization and use their recommendation as a representative metric set. Such an approach would however not be free of bias, since selection by domain experts leverages exactly their experience based bias. Metric selection becomes even more relevant when there is disagreement on the value of relevance of a specific metric. Probably the most straightforward method of avoiding bias is to not make any sub-selection until strictly necessary. In other words, measuring the largest set of metrics without violating requirements 1 and 5 on page 63. Within the context of workload similarity, it is possible to use workload similarity as the basic test of the suitability of the high level metrics for constructing workload distances. By collecting data of the same and similar workloads and observing the differences in the inter-workload distance, the ability of the high-level metrics to make distinctions are tested. Different measurements of identical workloads must, on average, be within closer proximity to each other than other workloads. The choice of on-average conveys our understanding that all measurements are subject to random noise. In an ideal environment, without the presence of noise, we would expect repeated measurements of the same workload to return zero workload distance. On real computer systems with non-deterministic behavior, random server processes starting and stopping with uncontrollable processor interrupts, it is unlikely that a zero workload distance will ever be measured. Rather we expect each workload to have some distance, where different measurements from identical workloads are close together.

78

C ONSTRUCTING THE APPROACH

A significant question is if the available high level metrics provide an improvement beyond the important metrics used in the literature. While they are incomparable in a strict sense, our high level metrics live in real computer systems, other metrics live in simulators. This touches on the issue of result comparability - are results obtained with high level metrics comparable to those obtained with simulators? To that end we would have to perform the whole end-to-end analysis, including data reduction, dimensionality reduction and clustering. This is an interesting side-step, facilitating an early test of our proposed framework. This side-step is the topic of Chapter 5.

4.3

Reducing workload characterization data

Metrics collected from computer systems are interval based. Each metric represents either the average or the cumulative value of a metric over the measurement interval. If a metric is measured multiple times, then the succession of different measurements provides a temporal trace of that metric. The interval based nature of the measured metrics limit the observability of rapid events. The interval nature of the measurements acts as a smoother, peaks and troughs are averaged out over the interval. In the overall time-series the measured values can still vary considerably, but they do not convey the actual dynamism of the underlying metrics behavior. While the ability to correctly capture the dynamic behavior of metrics is limited to the use of simulators, for the purpose of workload characterization it is of limited value. Workload characterization aims at a representation of workload behavior on the computer system. Therefore, the metric time-series are averaged and the resulting mean value taken as representative for the whole measurement interval. The mean value can only be an proper representation of workload behavior of the workload is stable over the measurement interval. Workload stability implies that the mean varies little of the measurement interval. If for example there is a significant difference between the mean in the first quarter of the measurement interval when compared to the last quarter then the workload obviously was not stable and the mean should not be used. Workload data collected when the workload was not stable is generally not usable for the purposes of similarity analysis. This is because it is unclear if the measured variability is a workload artifact, or if it is the result of the workload transitioning between distinct workload phases. Workload stability is an important criterion when accepting workloads for similarity analysis and detecting workload stability is an important capability. While strict checking of workload stability can lead to a high rejection rate, our primary goal is to determine the feasibility of performing workload similarity analysis using computer system based metrics. Future research can then improve the data collection and analysis methodology to improve handling of unstable workloads. In section 4.9 we quantify the expected quality of sample based metrics collected

4.4 Selecting representative metrics for spanning the workload space

79

from processor hardware counters.

4.4

Selecting representative metrics for spanning the workload space

Sub-setting the workload space for the purpose of selecting representative workloads requires that the workload space itself is representative. In Eeckhout et al. (2005a, 2002) and Phansalkar et al. (2005b) the selection of representative metrics is performed adhoc. The metrics used to span the workload space for similarity analysis were selected based on reasoning or on previously achieved promising results. When dealing with collected workload characterization data, there is no guidance on which metrics are most appropriate. Any bias present in the superset of computer system metrics reflects a bias inherent in the overall design. We consider the risk of bias in the computer system metrics to be small. The computer system metrics currently in use are the result of many years of performance and workload analysis of computer systems. Missing features would likely have been found. While this is a broad statement, there is no way to prove that computer system metrics are sufficiently complete, other than to demonstrate that the obtained results are comparable to those based on simulation metrics. This is a self-referential problem, if we demonstrate suitability of the method, then obviously we have made the correct choices. For collection of real workload characterization data this means that all metrics that can be easily collected with minimum perturbation should be collected.

4.5

Reducing workload space dimensionality

The large number of metrics in the collected dataset are hard to visualize and are expected to contain redundant information. These metrics are redundant in the sense that they report on different events that are strongly correlated. Finding the real information embedded in those correlated metrics becomes the first priority. We want to remove the impact of the correlated metrics, understand the underlying influences and determine the effect of noise in the data. As mentioned in the literature, the presence of strongly correlated metrics can introduce a bias where correlated metrics become the dominant signal (Eeckhout et al., 2002, 2003b, 2005b; Vandierendonck and De Bosschere, 2004b). By applying statistical techniques we aim to reduce the number of metrics under consideration, without losing relevant information. Eeckhout et al. (2002) propose the use of Principal Component Analysis (PCA) when determining similarity within the SPEC benchmark set. PCA was used to reduce the number of observed metrics from 20 to three or four. Similarity was then determined within the reduced workload space spanned by the three or four principal metrics. Independent Component Analysis (ICA) was used by Eeckhout et al. (2005b) as an alternative to PCA. The presented reason for choosing ICA instead of PCA was that PCA strives to find uncorrelated axes whereas ICA attempts to find independent axes in a multi-dimensional space. Independence is a

80

C ONSTRUCTING THE APPROACH

stronger requirement than uncorrelated. ICA is better suited for data typically acquired from computer systems since it works on non-Gaussian distributions. Both PCA and ICA are based on the principle of transformation - the workload space is transformed along the principal axes of the information in the dataset. Since PCA and ICA also provide information on the properties of the dataset, independent of the transformation, we will provide an introduction of both methods, based on Hastie et al. (1990) and Jolliffe (2002). After the determination of the principal components, these components can be used to span the workload space Wreduced , the reduction is made by only using principal components that explain a large part of the variance. Principal components with only a small variance will be left out since they do not contribute much to our understanding. The next few subsections discuss the required steps to prepare measured data for use with PCA and ICA. 4.5.1

Normalization

Before dimensionality reduction techniques can be applied, the metrics must be normalized to remove all effects based on scale differences. Particularly since the Euclidean distance was proposed as our quantitative similarity metric - scale differences in the Euclidean distance will dominate its behavior. Normalization is therefore an essential step (Eeckhout et al., 2002, 2003b,c). Normalization however is not without risk, by normalizing the metrics, the metrics are assumed to be of equal weight. Minor variations in a metric that is otherwise constant can increase in significance, without there being any physical rationale for doing so. Normalizing is commonly used to compare metrics with different values. Classical standardization is the most common normalization, where each original variable x is scaled independently of the other variables to a new variable y using: x−x y= , (4.1) σx where x is the mean value and σx the standard deviation. Classical standardization is not without issues. First, classical standardization is sensitive to outliers. It is quite common that measurement data of computer resource usage contain a few observations that are significantly larger (or smaller), by say an order of magnitude, than all other observations. The standard deviation in turn is dominated by these few outliers. To overcome the problem with outliers, the mean and the standard deviation in equation (4.1) should be replaced by their robust equivalents. The robust equivalent of the mean of standard deviation is defined as an estimation of the true values when both can be observed for a very long time. The longer the observation time, the smaller the impact of the outliers. Well known robust estimators are the median, or the trimmed mean. The trimmed mean is constructed by removing 1-5% of the observations from the right hand tail (for variables limited by zero).

4.5 Reducing workload space dimensionality

81

In Raatikainen (1993) the risks of automatic standardization are discussed. The main issue with automatic standardization is that it does not take into account real world, common sense, knowledge about the phenomenon under study. With standardization the magnitudes of the original variables are no longer taken into account and that the standard deviation, real or robust, dominates the relative importance of the variable. This in turn allows variables of lesser significance with large deviation to dominate the analysis. In Raatikainen (1993) the use of a logarithmic transform is recommended. The logarithmic transform is defined as: y = wx log10 (1 + cx x) ,

(4.2)

where wx and cx are subjectively determined coefficients. The relative importance of the variables are controlled by the weights wx , while the constant 1 overcomes the problems with x = 0. The coefficient cx affects the resolution of the transform for small values of x, x < 10. Without the coefficient cx the small values would be invisible next to the constant 1. Since we will attribute weights later, based on a metrics contribution to performance, we assume wx = 1. To prevent information loss in variables with small values, cx is chosen such that the difference between the minimum and the mean is at least 3 orders of magnitude. While this increases the magnitude of the variable and therefore its relative importance, the increase is much less significant than under classical standardization. An alternative approach formulation is: y = wx log10 (cx + x) ,

(4.3)

and where cx equal to min(x)|x > 0, for metrics where x < 10 and σ & min(x)|x > 0. After normalization or log-normalization the metrics are on a comparable scale. We now want to find those variables that are most relevant to performance and also remove variables that provide no additional information. 4.5.2

Removing correlated metrics

Maybe the simplest technique of reducing the number of metrics under consideration is removing one from each pair of correlated metrics. A cutoff correlation coefficient is chosen, mostly between [0.80−0.95], and the correlations between all metrics evaluated. Depending on the flavor of the algorithm one of the correlated metrics is removed. The resulting dataset contains only metrics correlated less than the chosen cutoff. In addition, correlation between linear combinations of metrics can be evaluated. By combining the least correlated metrics and evaluating the correlation with the remainder, additional metrics can be removed. Removing correlated metrics is a quick and simple way to reduce the dimensionality of a dataset. In practice however more advanced methods like Principal component analysis and Independent component analysis better preserve important characteristics of the dataset during dimensionality reduction.

82

C ONSTRUCTING THE APPROACH

variance

shoulder

dimension

Fig. 4.3: Scree plot with shoulder

4.5.3

Principal Component Analysis

Our treatment of Principal Component Analysis is based on Jolliffe (2002), Hastie et al. (1990) and Eeckhout et al. (2002). Here we intend to present the basics of PCA and supply relevant details where necessary for further discourse. Assume that x is a vector of p random variables for which we have m observations X = (x1 , x2 , ..., xm ), and we are interested in the variances of the p random variables and the structure of the covariances or correlations between the p variables. Only when p is small (p ! 4) can we easily visualize the structure of the data. We are interested in a few (' p) derived variables that preserve most of the information. The principal components of a set of data in R! provide a sequence of best linear approximations to the data, of all ranks q ≤ p. PCA transforms the p variables ! x = (x1 , x2 , ..., xp ) into p principal components y = (y1 , y2 , ..., yp ) such that yi = pj=1 αij xj . Where V ar(y1 ) > V ar(y2 ) > ... > V ar(yp ). The vector y1 describes the most information, while yp the least. In addition Cov(yi , yj ) = 0 ∀i )= j. The principal components are therefore linear combinations of the original variables, such that all principal components are uncorrelated. Principal component analysis is a useful tool for dimension reduction and compression. Both work by leaving out those principal components yi where the variance is small. The smaller the variance V ar(yi ) the fewer information is represented by yi . PCA therefore allows the construction of a space Rq with a predetermined loss of information. Commonly the variance of the constituent principal components are plotted along their rank to visualize the distribution of information. This graph is commonly called a scree graph and it is used to determine the elbow in the variance distribution, illustrated in Figure 4.3. 4.5.4

Independent Component Analysis

Independent component analysis (ICA) is a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals

4.5 Reducing workload space dimensionality

83

(Hyvärinen et al., 2001). The origins of ICA are in signal processing. The cocktail party problem is commonly used to explain ICA: For a human it is no problem to follow a discussion with your neighbors, even if there are lots of other sound sources in the room: other discussions, music, etc. ICA is able to extract these sound sources as long as there are as many microphones as there are different sound sources (Hyvärinen and Oja, 2000). Our summary of ICA is based on Jolliffe (2002), Hastie et al. (1990) and Eeckhout et al. (2005b). Independent Component Analysis relaxes the assumption that the underlying data is Gaussian distributed around the mean. Rather ICA tries to produce components that are statistically independent and requires that the underlying distributions are non-Gaussian. Statistical independence is stronger than uncorrelatedness. Two variables x1 and x2 can be defined to be statistically independent when their joint probability density p(x1 , x2 ) can be expressed as: p(x1 , x2 ) = p(x1 )p(x2 ). If two variables are independent, they are also uncorrelated. The inverse is however not true, two uncorrelated variables can still depend on each other. Independent Component Analysis can be viewed as a statistical method for separating mixed signals. ICA assumes that x = Λ(f ), where Λ is some, not necessarily linear function and the elements of f are independent. The components (factors) f are estimated by fˆ, which is a function of x. The family of functions from which Λ can be chosen must be defined. Within the chosen family, functions are found that minimize an objective cost function, based on information or entropy, which measures how far the elements of fˆ are from independence. Typically, an iterative method is used to find the optimal fˆ, and it is computationally expensive. It is clear that ICA requires choices made by the practitioner regarding the form of the functions used for fˆ. We complete our presentation of ICA by following Eeckhout et al. (2005b) since they are the main proponents. This treatment is based on Hyvärinen (1999); Hyvärinen and Oja (2000), but we limit ourselves to the relevant sections used in Eeckhout et al. (2005b). The ICA estimation procedures consist of estimating a mixing matrix A, and its inverse W, such that the measured data consisting of column vectors xi are related to the independent components si , as follows: x = As

and

s = Wx

There are several methods of estimating the matrix A and then calculating its inverse W. The representation is similar to PCA since the matrices A and W are linear representations of the data x. Since ICA is interested in extracting dimensions that highlight the non-Gaussianity of the data, a measure of non-gaussianity is required. Such a measure is provided by negentropy. Negentropy describes the distribution of a random, non-Gaussian variable relative to a Gaussian distribution, where the Gaussian distribution has the same covariance matrix as x: J(x) = H(xgauss ) − H(x)

(4.4)

84

C ONSTRUCTING THE APPROACH

H(x) represents the entropy of a random variable, where " H(x) = − f (x) log(f (x))dx

The measure of entropy is an important concept in information theory as it relates to the randomness of a variable. The greater the entropy the more random the variable. When x is a random variable with a Gaussian distribution, it is proven that it’s entropy is high (Cover and Thomas, 2006). The formulation of Equation 4.4 is hard to obtain in practice because one would need an estimate of the probability density function (PDF) of x. An analytical representation of the PDF of x might not be easy to obtain for arbitrary distributions. In practice, negentropy is practically estimated through Hyvärinen and Oja (2000): J(x) ≈ Σpi=1 ki [E{G(xi )} − E(G(v)}]2

(4.5)

where ki are some positive constants, v is a Gaussian random variable with mean 0 and variance 1 and Gi are some non-quadratic functions. Examples of Gi are: G1 (u) =

1 a1

log cosh a1 u ,

2

G2 (u) = − exp( −u2 )

where 1 ≤ a1 ≤ 2. The function G should chosen such that it does not grow too fast – this allows for a more robust estimator. In Eeckhout et al. (2005b) they refer to the FastICA algorithm by Hyvärinen and Oja (2000). FastICA is based on a fixed point algorithm for finding a maximum of non-Gaussianity of WT x as estimated by Equation 4.5. The basic form of the algorithm chooses initial vectors w and iterates until Equation 4.5 has converged for all vectors w. The use of ICA is greatly enhanced when the data has been centered and whitened. Centering linearly transforms each vector x to a new vector x˜ = x − E(x), such that the new vector has zero mean. Whitening linearly transforms x to x˜, such that its components are uncorrelated and their variances equal unity. The whitening transformation is always possible, one method of whitening is applying PCA first, and then applying ICA on the resultant transformation. When using ICA we limit ourselves to the available implementation of FastICA in Matlab (Hyvärinen and Oja, 2000; Mathworks Inc., 1984) or R. To develop our intuition on the differences between ICA and PCA, we provide an example decomposition in Figure 4.4. The left figure shows our raw example data, created using two intersecting distributions. The middle figure is the PCA decomposition, illustrating how the distribution is rotated thus aligning the dominant variance with the first principal component. The ICA decomposition shows how ICA distinguishes between the two component distributions and reorients the space to emphasize their distinction. Note that the example was constructed using two dimensions to highlight the difference between PCA and ICA problem decomposition. The difference in distinctive capability between ICA and PCA can be generalized to M distributions in N dimensions, as required for our purposes.

4.5 Reducing workload space dimensionality

4 2

4

● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●

ICA decomposition

4

PCA decomposition

2

Original data

85

● ●● ●● ●● ●● ●●

●● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ●● ●● ● ●● ● ●●● ●●●● ● ● ● ● ●● ● ●● ●●● ● ●● ●● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ●●●●●●● ● ●● ●● ●● ●● ●● ● ● ●●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●

0

IC−2

● ●● ●●● ● ● ●● ● ●● ● ●●● ● ●●●● ●●●● ● ●● ●● ●● ● ●● ● ●● ●● ●● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ●● ● ●●●● ●●●●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●● ●●●●●●● ● ●● ●● ●● ● ● ● ● ●● ● ●●● ● ●● ●● ●● ●● ● ● ● ● ●● ● ● ● ●●● ● ●●●

● ● ● ● ●● ● ●

● ●● ● ● ● ● ●● ●● ● ● ●● ● ●●

−2

PC−2

● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ●● ● ● ● ●● ● ● ●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ●●● ● ●● ● ●● ● ● ●● ●● ● ● ●● ●● ●●●● ● ●● ● ●



● ●● ● ●● ●● ● ● ● ● ● ● ●●

−2

0

data−2

● ● ●

−2

● ●● ● ● ●● ● ●●

● ● ● ●

● ●● ●● ●

0

2



−4

−2

0

2

4

−4

−4

−4

● ● ● ●● ● ● ●● ●● ● ●●

−4

data−1

−2

0

2

4

PC−1

−4

−2

0

2

4

IC−1

Fig. 4.4: Example decomposition for Principal Component analysis (PCA) and Independent Component analysis (ICA)

4.5.5

Generalized additive models

We can also reduce the dimension of the measured dataset by applying a Generalized Additive model (GAM). The purpose of the GAM is to determine how well the measured data predicts a metric of our choosing. All metrics that do not contribute to this prediction are considered uninformative and can be removed. We are therefore interested in statistical models that can explain computer system performance with a minimum of assumptions regarding the underlying distributions. Regression is a natural approach to determining the contribution of different metrics to performance. However a traditional linear model might not be applicable since most of the effects in the dataset are not well understood. There is a class of automatic flexible statistical methods that may be used to characterize nonlinear regression effects. Following the overview provided in Hastie et al. (1990), the methods are called generalized additive models. In the regression setting, a generalized additive model has the form E(Y |X1 , X2 , . . . , Xp ) = α + f1 (X1 ) + f2 (X2 ) + · · · + fp (Xp )

(4.6)

The X1 , X2 , . . . , Xp represent predictors and Y is the outcome; the fj ’s are unspecified smooth (“nonparametric”) functions. The functions fj are estimated in a flexible manner, using an algorithm whose basic building block is a scatterplot smoother. The estimated function fˆj can then reveal possible nonlinearities in the effect of Xj . Not all of the functions fj need to be non-linear. We can easily mix in linear and other parametric forms with the nonlinear terms (Hastie and Tibshirani, 1990; Hastie et al., 1990). There are different types of scatterplot smoothers, like the LOWESS function (Cleveland, 1979), a running mean or kernel and cubic spline smoothers. The role of the

86

C ONSTRUCTING THE APPROACH

smoother is to show the appropriate functional form of the data. The smoother tries to expose the functional dependence without imposing a rigid parametric assumption about that dependence (Hastie and Tibshirani, 1990). More often than not there are multiple predictor variables Xi available and the dependence of Y on both X1 and X2 is modeled through the multiple linear regression model E(Y |X1 , X2 ) = α + X1 β1 + X2 β2

(4.7)

E(Y |X1 , X2 ) = f1 (X1 ) + f( X2 ).

(4.8)

The three parameters α, β1 and β2 are usually estimated using least squares. The multiple linear regression model will be inappropriate if the regression surface E(Y |X1 , X2 ) is not well approximated by a plane (Hastie and Tibshirani, 1990) There is a heuristic way to apply a scatterplot smoother to the multiple predictor case. The model that we envision is

Given an estimate fˆ1 (X1 ), an intuitive way to estimate f2 (X2 ) is to smooth the residual Y − fˆ1 (X1 ) on X2 . With this estimate fˆ2 (X2 ) we can get an improved estimate of fˆ1 (X1 ) by smoothing Y − fˆ2 (X2 ) on X1 . This process is continued until our estimates f1 (X1 ) and f2 (X2 ) are such that the smooth of Y − fˆ1 (X1 ) on X2 is fˆ2 (X2 ) and the smooth of Y − fˆ2 (X2 ) on X1 is fˆ1 (X1 ). The model (4.8) is an example of an additive model. The iterative smoothing process is an example of backfitting, the main tool used for estimating additive models. The advantage of a GAM is that it determines the importance of each metric. This then allows us to evaluate which metrics are of interest, which in turn can be used to reduce the workload characterization burden, e.g., only metrics of demonstrated value need to be collected. We build a generalized additive regression model (GAM) for explaining the instruction count via the remaining metrics. This model is used to determine what metrics affect performance and to what extent. Dimensionality reduction is achieved by only retaining those metrics identified by the GAM as important. A complete exposition of model construction is presented in Appendix B. The idea behind the GAM is simple: the best metrics for clustering our dataset are the metrics that best predict computer system performance. The main issue is, how do we define computer system performance for the large variety of different workloads in our dataset. Different benchmarks have different performance metrics, e.g., it is unclear how we should relate say SPEC CPU 2000 performance to TPC -C performance. The TPC -C performance metric reflects system performance, while SPEC CPU 2000 primarily reflects processor performance. We chose the instruction count as measured on the processor as our predictor of performance. The idea is that the number of instructions executed is a fair, if not perfect, indicator of performance. Of course, it is well understood that certain instructions do not “propel” execution forward, but only add to the

4.5 Reducing workload space dimensionality

87

instruction pathlength. For example, if the only difference between two benchmarks is in the frequency of spins on mutexes, one is unlikely to interpret the difference in the instruction count as higher performance of the benchmark with more spins. We represent the metrics excluding the instruction count by x, a P -dimensional vector. P is the number of metrics under consideration. The instruction count is represented by the scalar y. By definition, the regression model relates the response y to the predictor vector x. We assume that the following additive regression relationship holds: y = f (x) + $ =

P #

fp (xp ) + $

(4.9)

p=1

where f is the regression function, fp are smooth univariate functions of individual metrics that add up to f and $ is noise, the part of the instruction count that is not explained by the fp . The model is fitted using data pairs (xi , yi ), 1 ≤ i ≤ N , where index i runs over N workloads for which the metric vectors are available. We assume that the noise terms $i , the noise components at the data points, are independent and are centered at zero. Since the fp (xp ) are smooth univariate functions of individual metrics, we recognize that strongly correlated metrics xp will have similar fp . Strongly correlated metrics therefore are redundant. During initial creation of the model we jumpstart dimensionality reduction by removing all except one of metrics with an absolute correlation coefficient greater than 0.95. In practice this means that instead of 73 metrics the GAM model starts with 48 metrics. At a high level our modeling approach is as follows: First we perform regression on all the metrics using B-splines. Next we obtain the fit using a penalized least squares (PLS) criterion. Here we decide what metrics make a significant contribution to the model and are retained and also what roughness penalty parameters λp to use for each retained metric. These roughness penalties can be interpreted as metric weights. Third, the parameters of the PLS are chosen using cross-validation for optimal model selection making the desired trade-off between matching the data and keeping the model simple. Last, we note that most of the additive effects are expected to be monotone (e.g. higher missrates result in fewer executed instruction), and we extend our approach using Isplines to satisfy these monotonicity constraints. We label the B-spline approach as uGAM since it is unconstrained. The constrained I-spline approach we label as iGAM. Workload similarity is expressed as their in the workload space. The fitted regression function is # fˆ(x) = fˆp (xp ), (4.10) p∈P

where P is the set of chosen metrics. Next we define a distance measure ρ between any two workloads based on their metric vectors. We would like ρ to be small whenever two workloads have similar performance relevant metrics that result in similar performance. Similar performance alone does not make two workloads similar. But a difference between two benchmarks in metrics that do not explain performance is not of interest. On

88

C ONSTRUCTING THE APPROACH

the contrary, we would like the individual metrics to contribute into the distance definition according to their effect on performance. We interpret the fp as the effect of metric p. Therefore, we propose the following definition: ρ(u, v) =

# p∈P

|fˆp (up ) − fˆp (vp )|.

(4.11)

Note that the difference between two metric values enters (4.11) as the difference it makes on the expected performance according to (4.10). The greater its impact on the instruction count, the greater is the corresponding summand in (4.11). For example, if fˆp (s) = a ˆp s were linear, then |fˆp (up ) − fˆp (vp )| = |ˆ ap ||up − vp |, and so |ˆ ap | can be interpreted as the weight for metric p. In particular, metrics not used in (4.10) do not contribute to the distance definition.

4.5.6

Other dimensionality reduction techniques

There are many other dimensionality reduction or feature extraction techniques available like genetic algorithms and support vector machines. Another class of techniques are based on supervised learning, but these require feedback on the quality of their solution. It is not immediately obvious how that feedback should be provided when dimensionality reduction is a precursor to partitioning using clustering algorithms, hence our limited interest in these alternative techniques.

4.6

Partitioning the workload space with clustering algorithms

Kaufman and Rousseeuw (1990) present an overview of some clustering methods distinguishing between partitioning and hierarchical approaches. Hierarchical approaches do not work with a single value for k, but rather work over the whole set 1 . . . k. In between all the values of k = 2, 3, . . . , n − 1 are covered in a gradual fashion. The only difference between k = r and k = r + 1 is that one of the r clusters is split to create r + 1 clusters. Hierarchical approaches have two main subcategories: agglomerative and divisive. These two methods construct their hierarchy in opposite direction, possibly yielding different results. Partitioning approaches divide a dataset into clusters based on some dissimilarity criterion. The number of clusters k is either provided by the user, or calculated based on a provided numerical measure. Usage of partitioning algorithms provided in the literature is more common for large datasets, certainly when the goal is to derive meaningful groupings. Examples of partitioning approaches are K-means and MCLUST, discussed next.

4.6 Partitioning the workload space with clustering algorithms 4.6.1

89

K-means clustering

K-means (MacQueen, 1967; Hartigan, 1975; Hartigan and Wong, 1979) is one of the simplest unsupervised learning algorithms for clustering. K-means follows a few simple and easy steps to classify a given dataset through a certain number of clusters (assume k clusters) chosen a priori. The main idea is to define k centroids, one for each cluster. Different locations of the centroids will lead to different results, making centroid placement an important artifact of the method. The next step is to take each point belonging to a given dataset and associate it with the nearest centroid. When all points have been assigned to clusters, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as new centroids of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same dataset points and the nearest new centroid. These iterative steps define a loop. We iterate over this loop until the k centroids no longer change their location. In other words the centroids do not move any more. This algorithm aims at minimizing an objective function, in this case a squared error function. The objective function, with N data points, k disjoint subsets Sj containing Nj data points J=

k # #

j=1 n∈Sj

||xn − µj ||,

(4.12)

where ||xn − µj || is a chosen distance measure between a vector xn representing the nth data point and the geometric centroid µj of data points in Sj . The objective function is an indicator of the distance of the n data-points from their respective cluster centers. Although it can be shown that the procedure will always terminate, the K-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum. In general, the algorithm does not achieve a global minimum of J over the assignments. In fact, since the algorithm uses discrete assignment rather than a set of continuous parameters, the "minimum" it reaches cannot even be properly called a local minimum. Despite these limitations, the algorithm is used fairly frequently as a result of its ease of implementation. The algorithm is also significantly sensitive to the initial randomly selected cluster centers. The K-means algorithm can be run multiple times to reduce this effect. This is a simple version of the K-means algorithm. It can be viewed as a greedy algorithm for partitioning the n samples into k clusters so as to minimize the sum of the squared distances to the cluster centers. It does however have some weaknesses: • The way to initialize the means was not specified. One popular way to start is to randomly choose k of the samples. Or, one places the centroids at maximum distances from each other within the space spanned by the data. Another popular approach is to use the result of a hierarchical clustering method to determine the initial centroids.

90

C ONSTRUCTING THE APPROACH • The results produced depend on the initial values for the means, and it frequently happens that suboptimal partitions are found. The standard solution is to try a number of different starting points. • It can happen that the set of samples closest to a chosen centroid µi is empty, so that µi cannot be updated. This is an annoyance that must be handled in each implementation. • The results depend on the metric used to measure ||xn − µj ||. A popular solution is to normalize each variable by its standard deviation, though this is not always desirable. • The results depend on the value of k.

This last problem is particularly troublesome, since we often have no way of knowing how many clusters exist. Unfortunately there is no general theoretical solution to find the optimal number of clusters for any given dataset. A generally accepted approach is to compare the results of multiple runs with different k classes and choose the best one according to a given criterion (usually the Bayesian Information Criterion), but we need to be careful: increasing k results in smaller error function values by definition, while increasing the risk of over-fitting. 4.6.2

K-means clustering and the Bayesian Information Criterion

K-means clustering has been the staple approach for many clustering problems, including a significant part of the literature on similarity analysis of workloads. Current applications of K-means clustering use the Bayesian Information Criterion (BIC) to find the optimal clustering solution. The BIC combines the number of observations, the number of free parameters to be estimated, the residual sum of squares of the model and the maximized value of the likelihood function of the estimated model, into a single number. Optimizing a model according to the BIC then entails finding the best BIC value. Schwarz (1978) provided a Bayesian argument for adopting this information criterion. Most times, combining K-means and BIC requires that for a range of k, the clustering solution and its associated BIC are calculated for each k. While this is computationally expensive, it is not a limiting factor. What does limit the effectiveness however is that K-means is sensitive to the starting conditions. If cluster centers are chosen randomly, some randomization of the answers will occur. In addition, K-means is also prone to finding local minima in the similarity measure, i.e., the produced answer may or may not be the best clustering result possible given the data. Note that K-means is fully deterministic, given the starting centers (Pelleg and Moore, 2000). A bad choice of initial centers can have a devastating impact on both performance and distortion. Repeated iterations of sub-sampling and smoothing have been proposed to refine the selection of starting centers and improve the consistency of the results (Bradley and Fayyad, 1998).

4.6 Partitioning the workload space with clustering algorithms

91

In Pelleg and Moore (2000) X-means is introduced as an extension to K-means. X-means improves the computational scalability, determines the number of clusters automatically and partially solves the problem of local minima in the clustering solution - the main shortcomings of K-means. In Ishioka (2000) and Ishioka (2005) X-means is further expanded. Both with a formalization to general or p-dimensional datasets and algorithmic improvements. The result is an X-means algorithm that automatically and robustly finds the best clustering division using K-means combined with the BIC. The properties of both consistency and automatic determination of optimal clustering, make X-means an attractive addition to our tool-set. Ishioka (2005) provides an implementation for X-means in R, we use this implementation in this thesis and refer to it as XMEANS . 4.6.3

Model based clustering

Another approach to clustering problems is model-based. This approach consists of using certain models for clusters and attempting to optimize the fit between the data and the model. In practice this means that every cluster is mathematically represented by a parametric distribution, e.g., Gaussian (continuous) or Poisson (discrete). The entire dataset is therefore modeled by a mixture of these distributions. An individual distribution used to model a specific cluster is often referred to as a component distribution. A mixture model with high likelihood tends to have the following traits: • component distributions have high “peaks” (data in one cluster are tight);

• the mixture model “covers” the data well (dominant patterns in the data are captured by component distributions). The main advantages of model-based clustering are: • well-studied statistical inference techniques available; • flexibility in choosing the component distribution; • obtain a density estimation for each cluster; • a “soft” classification is available.

The most widely used clustering method of this kind is based on a mixture of Gaussians: we can consider clusters as Gaussian distributions around their centroid. The method extends the basic K-means approach in two important ways: • Instead of assigning cases or observations to clusters to maximize the differences in means for continuous variables, the EM clustering algorithm computes probabilities of cluster memberships based on one or more probability distributions. The goal of the clustering algorithm then is to maximize the overall probability or likelihood of the data, given the (final) clusters.

92

C ONSTRUCTING THE APPROACH • Unlike the classic implementation of K-means clustering, the general EM algorithm can be applied to both continuous and categorical variables (note that the classic K-means algorithm can also be modified to accommodate categorical variables).

The E-step of the EM algorithm assigns “responsibilities” for each data point based in its relative density under each mixture component, while the M-step recomputes the component density parameters based on the current responsibilities. The relative density under each mixture component is a monotone function of the Euclidean distance between the data point and the mixture center. Hence in this setup EM is a “soft” version of K-means clustering, making probabilistic (rather than deterministic) assignments of points to cluster centers (Hastie et al., 1990). 4.6.4

MCLUST

for model based cluster analysis

MCLUST is a software package for model-based clustering, density estimation and discriminant analysis interfaced to the S-PLUS commercial software and its Open-source counterpart R. MCLUST combines parameterized Gaussian hierarchical clustering algorithms and the EM algorithm for parameterized Gaussian mixture models. Also included are functions that combine hierarchical clustering, EM and the Bayesian Information Criterion (BIC) (Fraley and Raftery, 2002a). The BIC is used to guide selection of the cluster count as well as the optimal EM model, the model that best explains the clustering properties of the data. In MCLUST, each cluster is represented by a Gaussian model p 1 1 φ(x|µk , Σk ) = (2π)− 2 |Σk |− 2 exp{− (xi − µk )T Σ−1 k (xi − µk )}, 2

(4.13)

where x represents the data, and k is an integer subscript specifying a particular cluster. Clusters are ellipsoidal, centered at means µk . The covariances Σk determine their other geometric features. Each covariance matrix is parameterized by eigenvalue decomposition in the form Σk = λk Dk Ak DkT (4.14) where Dk is the orthogonal matrix of eigenvectors, Ak is a diagonal matrix whose elements are proportional to the eigenvalues of Σk , and λk is a scalar. The orientation of the principal components of Σk is determined by Dk , while Ak determines the shape of the density contours; λk specifies the volume of the corresponding ellipsoid, which is proportional to λdk |Ak | where d is the data dimension. Characteristics (orientation, volume and shape) of distributions are usually estimated from the data, and can be allowed to vary between clusters, or constrained to be the same for all clusters (Fraley and Raftery, 2002a). In more than one dimension the characteristics can be used to describe the models evaluated. For example, EVI denotes a model where the cluster volumes are

4.7 Comparing clustering results

93

equal (E), the shapes of the clusters may vary (V) and the orientation is the identity (I). This equivalent to writing eq. (4.14) as Σk = λAk . Celeux and Govaert (1992) show that the standard K-means algorithm is a version of a classification EM (CEM) algorithm corresponding to the uniform spherical Gaussian model Σk = λI. The CEM algorithm converts the clustering attribution in each E-step to a discrete classification before performing the M-step.

4.7

Comparing clustering results

With several clustering algorithms available, it is convenient to have a method of quantifying the difference between these clustering results. Meil˘a (2002, 2006) introduces the variation of information (V I) criterion. This criterion measures the amount of information lost or gained in changing from clustering C to clustering C # . Other approaches for determining the degree of similarity between two clusterings can be summarized as either counting pairs or performing set matching. We refer the interested reader to Meil˘a (2002, 2006) for an overview. We chose V I as our clustering criterion mainly for two reasons; First, V I is based on information theoretical concepts. Second, V I has desirable geometric properties, i.e., it’s behavior matches our intuition. V I is not directly concerned with the relationships between pairs of points, rather it is based on the relationship between a point and its cluster in each of the two clusterings compared. This is advantageous since we need to correctly attribute the information contained in the singletons in our clustering results. A complete discussion of V I is beyond the scope of this work, and we will suffice by listing its most important properties. • V I is a metric.

• V I is bounded.

• V I depends only on the relative sizes of the clusters, and is independent of the size of the dataset. The V I expresses the similarity between two clusters as a single number. For the normalized V I, the metric will be zero when the clusters are identical, and one when completely dissimilar. We will use this normalized version because all our comparisons are made between clustering solutions using the same dataset. Another criterium we can use to compare clustering results is to look at the quality of their clustering. By quality of clustering we mean the properties of the defined clusters. Under the assumption that similar workloads are proximate in the workload space, the following prediction should hold true: the average inter-object distance within groups of the same category, type or cluster, should be smaller than the average inter-object distance for the whole dataset. It is easier to express this inequality as a ratio. Thus, for the proximity assumption to be true, the ratio of the average inter-object-group and interdataset distances should be less than one. We can clarify our position with a thought

94

C ONSTRUCTING THE APPROACH

experiment. Imagine a sphere uniformly filled with n workloads. The average distance between all workloads is !n−1 !n i=1 j=i+1 dij (4.15) RW = n(n−1) 2

where dij is the distance between two workloads. We randomly select n8 workloads. Since this random selection is a fair estimator of the inter workload distances, the average inter-workload distance RR will be close to RW . We next partition the sphere into eight identical parts. It is clear that the average distance within a partition Rw is considerably smaller than RW and also RR . Thus, from our thought experiment we conclude that for workload proximity to be true, the ratio of Rw /RW < 1. For brevity, we notate this fraction as RwW . We observe that RwW is independent of the dimensionality of the dataset, it depends primarily on the number of workloads.

4.8

Selecting the representative workloads

The majority of papers using clustering to partition sets of benchmarks select the representative benchmarks by taking the centroids of the clusters. Centroids are the benchmarks closest to the center of the cluster. Depending on the clustering algorithm and the researcher, the centroid can be calculated according to the center of mass or the geometric center. The center of mass is calculated taking into account all represented workloads. The geometric center is calculated using the center of the smallest hypersphere around the cluster, encompassing all identified members. The distinction between the centroids is seldom made with most papers defaulting to the K-means definition of cluster center, which is the center of mass. While the center of mass certainly is an intuitive choice, one could argue that when representing clusters based on behavioral characteristics, the geometric centroid provides more uniform coverage of the workload space. In Chapter 8 we revisit the question of representative workload selection in detail. Until know we have implicitly assumed that we can accurately measure the computer system metric and hardware counters. In the next section we explore the error bounds on hardware counter observations.

4.9

Quantifying metric sampling error on computer system workloads

Central to our thesis is the ability to efficiently collect data from computer systems without perturbing the workload. Necessarily we use the hardware and software metrics available to us through the operating system. The software metrics provide information on system state and resource consumption while the hardware counters provide practical observability of program execution. Hardware counters can sample processor

Quantifying metric sampling error

−0.3

−0.2

−0.1

0.0 Relative error

0.1

0.2

0.3

140 120 Density

0

20

40

60

80

100

120 0

20

40

60

Density

80

100

120 100 80 60 0

20

40

Density

Histogram after 2000 seconds

140

Histogram after 1000 seconds

140

Histogram after 500 seconds

95

−0.3

−0.2

−0.1

0.0 Relative error

0.1

0.2

0.3

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

Relative error

Fig. 4.5: Density histograms of the relative error after 500, 1000 and 2000 seconds (250 bins). The lines show Normal distributions with a standard deviation based on the underlying data. Data from Bonebakker (2007)

resource utilization during actual workload execution. In contrast, simulation metrics provide visibility into all aspects of computer system performance for the duration of the simulation. Simulation provides not only metric counts but also their distribution in time. Hardware counter registers are limited in their capability, a register can only count events over a specified time interval. As a result, hardware counter observations are staggered in time and thus provide an incomplete picture of actual execution. Yet, the ubiquitous presence of hardware counters combined with their ease of use, makes their use for workload characterization compelling. We investigate the properties of hardware counter observations and provide bounds on workload observability. This section is based on a technical report (Bonebakker, 2007) and its underlying data. This report provides a very detailed investigation of hardware counter sampling accuracy. Under reasonable assumptions the following conclusions are supported: 1. There is a trade-off between sample duration and sample count. A long sample duration requires less samples, and vice-versa. 2. Given sufficient samples relative to the sample time, 95% of all samples are within 5% of their mean, independent of sampling strategy. 3. Measurements on multiple workloads show that 99.98% of the sampled means are within 25% of their true value. Using several workload models as input and sampling them repeatedly, Bonebakker (2007) was able to extract a distribution of sample means relative to a known mean. In Figure 4.5 we illustrate this distribution. These simulation results show that we can reasonably expect the maximum error of a sample metric to be within 25% of its true value, for the most part independent of workload behavior. Some dependency on workload behavior remains in cases where cyclical workload behavior resonates with sampling time. Bonebakker (2007) also measured variability in hardware counter metrics using WC STAT against 100 instances each of the workloads 176.gcc and 189.lucas, each

96

C ONSTRUCTING THE APPROACH

−0.1

0.1

Relative error

0.3

−0.3

−0.1

0.1

Relative error

0.3

15 10 0

5

Density

10 0

5

Density

10 0

5

Density

10 Density 5 0 −0.3

176.gcc−peak

15

176.gcc−base

15

189.lucas−peak

15

189.lucas−base

−0.3

−0.1

0.1

Relative error

0.3

−0.3

−0.1

0.1

0.3

Relative error

Fig. 4.6: Density histograms of the relative error in instruction count for 176.gcc and 189.lucas, base and peak. Based on 100 measurements of 1000 seconds. The lines illustrates the density of a Normal distribution for which 99.98% of values are within [-0.25,0.25]. Data from Bonebakker (2007)

with two optimizations: base and peak. These two SPEC CPU 2000 component benchmarks have their own specific periodic behavior. The data collected from measured workloads mostly agrees with the simulation prediction. The notable difference is their broader distribution, as illustrated in Figure 4.6. Part of this broadness is caused by nondeterministic effects during measurement which introduce additional variability, another part is the difference in sample count. The graphs in Figure 4.5 are based on forty thousand workload averages, while the graphs in Figure 4.6 are based on just 100 measured workloads. The results show that sampling computer system metrics is equivalent to data collection in a noisy environment, the variance of the data goes up. Yet all values in the measured dataset are within 25% of its mean. In Figure 4.6 we added the Normal density distribution for a curve where 99.98% of the values are within [-0.25,0.25] of the mean. The graphs in Figure 4.6 help illustrate the locality properties of hardware counter based sampling. Bonebakker (2007) demonstrates that processor hardware counter metrics can be efficiently determined with reasonable accuracy. What remains is the question if the measured distributions are sufficient evidence of the efficacy of hardware counters. Important in that regard is to understand how hardware counter results will be used and to understand what the error progression characteristics will be. We must put hardware metrics in context of workload characterization since we intend to use them for comparison purposes. An important question in that regard is, what is the risk of misclassification given that 99.98% of the means are expected to be within 25% of the true value. To that end let us approach the relative error with a Gaussian distribution. We set the properties of the Gaussian such that 99.98% of all values fall within the range [-0.25,0.25] around the mean (0). This translates to a standard de-

Quantifying metric sampling error

97

viation σ = 0.0716. Now let us assume that our observational errors are equal and independent of each other for each metric. In the multi-dimensional workload space, each workload is represented by a point. Given the above parameters, we can generalize that the “correct” location for that data point is within a hypercube of [0.75x, 1.25x] for each workload x. The formulation above makes the error box dependent on each x. If we are to compare two workloads, the ratio of their metric values is of interest. Implicit in the ratio is the dynamic range of the workloads. For example, for a single metric i, where i ∈ [1, . . . , d] and d the dimensionality of x; if xi = 1 and yi = 2, there will be no risk of misclassification. Using the xi and yi error margins, there is a possibility of overlap if their ratio is within 0.75 xi 1.25 ≤ ≤ . (4.16) 1.25 yi 0.75 If we compare two workloads with very similar values for x, it would be difficult to tell them apart - but according to the similarity assumption we would not have to. If just one of the dimensions is significantly different, the error cubes would no longer intersect. If we consider the dimensionality of the workload space, we can see that more dimensions are favorable for making distinction. Each added dimension provides more opportunity for distinction. The mean instruction counts collected for the workloads in Figure 4.6 have varying ratios. The instruction count ratios between the four workloads are listed in Table 4.1. = 0.600 and 1.25 = 1.667 to Table 4.1. Based on We apply inequality 4.16, with 0.75 1.25 0.75 the ratios, we can conclude that there is no risk of misclassifying 189.lucas-base with 176.gcc-base or 176.gcc-peak. Neither will 189.lucas-peak overlap with 176.gcc-peak. Interestingly 189.lucas-peak and 176.gcc-base show apparent overlap, as do all base-peak pairs of the same workload. We illustrate the actual overlap between workloads in Figure 4.7 with the frequency histograms for the average instruction count for all 100 runs per workload. It is notable from contrasting Figure 4.7 with Table 4.1 that a visual inspection of the distribution would not confuse 189.lucaspeak with 176.gcc-base, even though inequality 4.16 predicts the slight possibility. This is expected behavior and recommends that workloads are characterized multiple times. If four SPEC CPU component benchmarks demonstrate such spread on the instruction count then it is likely other workloads will too. Thus, generalizing the behavior demonstrated by the instruction count to all hardware counters we expect to have sufficient distinctive capability. Until now we specifically addressed the hardware counters, we argue the results hold true for the computer system metrics as well. Recall that the computer system metrics report aggregate results collected over set time intervals. Those metrics do not have the measurement issues associated with the hardware counters, neither do they have the dynamic range. If the hardware counters are viable, with acceptable error characteristics, then the computer system metrics are viable too. Thus we find insufficient evidence to reject Research Hypothesis 1 on page 59:

98

C ONSTRUCTING THE APPROACH

20

Composite histogram

10 0

5

Frequency

15

189.lucas−base 189.lucas−peak 176.gcc−base 176.gcc−peak

2e+08

3e+08

4e+08

5e+08

6e+08

7e+08

8e+08

9e+08

Instruction count

Fig. 4.7: Composite histogram of mean instruction count for 176.gcc and 189.lucas, base and peak. It illustrates the measurement overlap between workloads. Based on 100 measurements of 1000 seconds. Data from Bonebakker (2007)

xi /yi 189.lucas-base 189.lucas-peak 176.gcc-base 176.gcc-peak

189.lucas-base 1.522 2.156 2.616

189.lucas-peak 0.657 1.416 1.719

176.gcc-base 0.464 0.706 1.213

176.gcc-peak 0.382 0.582 0.824 -

Tab. 4.1: Ratio of instruction count means for SPEC component benchmarks 189.lucasbase, 189.lucas-peak, 176.gcc-base and 176.gcc-peak. Values greater than 1.667 and smaller than 0.600 have no risk of misclassification. The risk of misclassification increases as the value approaches 1.000. Data from Bonebakker (2007).

Quantifying metric sampling error

99

Processor hardware counters and operating system performance metrics provide sufficient distinctive ability for useful workload similarity analysis. In fact, the ability to distinguish between workloads has been demonstrated in Figure 4.7 and reiterated in Table 4.1. The inverse question is if repeated observations of the same workload are proximate to each other. In other words, if we repeatedly observe the same workload will they be in the same place in the workload space. We will address that question in Section 7.6. In this chapter we have defined an approach by which we intend to select representative workloads. These representative workloads are chosen from a set of workloads characterized using computer system metrics. In this last Section we have evaluated the error characteristics of the hardware counters and showed that they provide good distinction. Next we need to put our approach in practice, we need to prove that this approach has merit.

100

C ONSTRUCTING THE APPROACH

5. TESTING THE METHODOLOGY ON A BENCHMARK SET

A BSTRACT The developed approach is put to the test by comparing hardware counter data for SPEC CPU 2000 from both UltraSPARC III+ and Opteron with data from the literature. We find significant agreement between literature and hardware counter data strengthening the case for the hardware counters.

102

T ESTING THE METHODOLOGY

Define the question Chapter 1 - The importance of workloads in computer system design

Gather information and resources Chapter 2 - Current approaches for workload selection in processor and computer system design

Present conclusions Chapter 9 Evaluating workload similarity

Interpret data and formulate conclusions Chapter 8 - Finding representative workloads in the measured dataset

Form hypothesis Chapter 3 - Towards an unbiased approach for selecting representative workloads Deductive-Hypothetic research strategy Prescriptive model Chapter 4 Constructing the approach

Early test Chapter 5 - Testing the methodology on a benchmark set

Collect data Chapter 6 - Collecting and analyzing workload characterization data

Analyze data Chapter 7 Grouping together similar workloads

5.1 SPEC CPU 2000 similarity in simulation

103

In this chapter we perform an early test of our proposed approach. Prior to large scale application, we want to understand its feasibility - does the approach work as intended? With a limited scope test we want to get confident in our approach prior to investing the time and resources for large scale data collection. In Section 2.6 we presented the combination of Principal component analysis (PCA) and clustering when discussing component benchmark similarity in benchmark suites. The presence of existing results on similarity within benchmark suites provides a good testbed for such an early test. We follow the approach outlined in Section 4.1 using SPEC CPU 2000 component benchmarks as workloads. Specifically we investigate if SPEC CPU 2000 component benchmark similarity based on hardware counter measurements is consistent with similarity determined from simulation results. This is significant since the usability of micro-architecture (µA) dependent metrics for processor and computer system design is debated. If hardware counter similarity is consistent with simulation based similarity, then we can sidestep the µA-independent versus µA-dependent discussion and concentrate on the selection of relevant workloads. The outline of this chapter is as follows: In the first section we review workload similarity literature regarding SPEC CPU 2000. In the second section we present our approach to data-collection and clustering. In section three we use the variation of information criterion to quantify the agreement between simulation based clustering and hardware counter clustering. In the last section we discuss our findings and their consequences.

5.1

SPEC CPU 2000

similarity in simulation

There is a considerable body of recent work (Eeckhout et al., 2005a, 2003a, 2005b, 2002; Joshi et al., 2006; Phansalkar et al., 2004, 2005b,a) related to understanding workload similarity within standard benchmark suites like SPEC CPU 2000. That body of work demonstrates considerable redundancy within SPEC CPU 2000 and suggests that successful analysis of processor architectures can be done using only eight of the 22 benchmarks evaluated in simulation (Phansalkar et al., 2005b; Joshi et al., 2006). These eight component benchmarks are selected based on their representativeness within clusters of component benchmarks. To find these clusters, a simulation framework is used to measure a set of more than 29 µA-independent metrics. An outline of the methodology used to determine similarity was presented in Section 2.6.4. The large investment in time and resources required to collect data from benchmarks in simulation, in this case many months of system time, limits the suitability of simulation based benchmark similarity. By comparing these simulation results with the hardware counter results, we hope to determine if hardware counters can be used for similarity analysis.

104

T ESTING THE METHODOLOGY

5.2

Characterizing SPEC CPU 2000 using processor hardware counters

To assess the general suitability of hardware counters for determining benchmark similarity, we set out to determine how well hardware counter similarity predictions match simulation results. In our approach we collect hardware counter data and process these data similar to the simulation data. We use the utility WCSTAT as proposed in Chapter 4, but use only the processor hardware counter data. This is appropriate since the simulation results also concentrate only on the processor. We consider the achieved component benchmark clustering as the end product of our measurement and analysis process. We compare this clustering result against the simulation clustering result. We use the variation of information to quantify the match between simulation and measurements. To substantiate confidence in the observed similarity score, we perform a Monte-Carlo simulation and derive the cumulative frequency distribution of the similarity score. We illustrate the steps in our approach in Figure 5.2. 5.2.1

Collecting component benchmark hardware counter data

We installed the SPEC CPU 2000 benchmark suite on two systems running the Solaris 10 operating system1 , with the most recent SunStudio compiler and all recent patches. One system has a 1600Mhz UltraSPARC IIIi processor, the other a 2393 MHz AMD Opteron 250. Each system was configured with 4 GB of RAM and a single (similar) SCSI hard-disk. We installed the SPEC CPU 2000 benchmarks in identical fashion on both systems and compiled them using the appropriate configuration files obtained from SPEC (www.spec.org, 2007). We compiled, trained and executed each component benchmark in full compliance with the SPEC run-rules. During hardware counter measurement, each component benchmark was repeatedly executed for at least 15 consecutive minutes. During these 15 minutes, hardware counter data collection took place using the standard Solaris utility cpustat. For each processor all available hardware counters were sampled during repeated ten second intervals. We measured 155 distinct Opteron hardware counters and 62 on the UltraSPARC IIIi. Each hardware counter sample is based on a 0.02 second measurement time. The hardware counters were collected in system mode, collecting combined user and supervisor events. At data collection completion, each hardware counter was sampled at least 90 times. The results for each counter are averaged and stored for our subsequent analysis. To test if any unexpected events impacted benchmark execution or data collection, we performed additional run and counter validation and detected no anomalies (these tests are discussed in more detail in Chapter 6). We also compared the measured SPEC CPU 2000 performance while sampling with the reported SPECint_base and SPECfp_base results for each system configuration (www.spec.org, 2007). The measured SPECint_base and SPECfp_base results 1

Solaris is a trademark of Sun Microsystems, Inc.

5.2 Characterizing SPEC CPU 2000 using processor hardware counters

Computer system and workload

105

Raw data

Workload characterization (data collection) cpustat

Workload data processing Workload characterization final data

Literature data

Metric normalization PCA Model based K-means Normalized workload data

Dimension reduction

Reduced workload space

Clustering

Cluster solution

Cluster comparison

Clustering comparison

VI-score

Fig. 5.1: Summary workflow for the evaluation of hardware counter workload data against SPEC CPU 2000, showing dimensionality reduction, clustering steps and comparison with the results from the literature.

106

T ESTING THE METHODOLOGY

while sampling were each within the noise thresh-hold (1%). This indicates that our sampling approach does not significantly impact benchmark performance. It took about 28 hours of system time to collect both base and peak data for the full SPEC CPU 2000 benchmark set on both systems, excluding compilation and training of the benchmarks. The terms “base” and “peak” refer to the optimization options used during compilation and training. For our similarity analysis we use only the base measurements for the 22 benchmarks used in Phansalkar et al. (2005b). 5.2.2

Reduction, PCA and clustering

The set of metric means are the basis of our analysis. For determining the clustering properties of the collected benchmark data, we deviate from the process outlined in Phansalkar et al. (2005b). We feel this is necessary since we are dealing with hardware metric data. Simulation data is from a single origin, namely the simulator. All values taken from the simulator reflect the exact same simulation interval. In our case, the hardware counter data do not share that property. The measured data are samples that reflect snapshots of execution and are staggered in time, since we cannot measure all hardware counters simultaneously. We believe these differences warrant a broader approach rather than just following a method that works well for simulated data. We deviate from the method presented in Phansalkar et al. (2005b) on two accounts; First we do not limit ourselves to a single instance of Principal Component Analysis. Instead we evaluate four different approaches: 1. Clustering based on the full, normalized dataset, before PCA. 2. Clustering based on the most-important principal components, i.e., those components before the knee in the scree-plot (Jolliffe, 2002). 3. Clustering based on the principal components that explain at least 85% of the variability in the original dataset, per Phansalkar et al. (2005b). 4. Clustering based on a representative set of chosen metrics, without dimensionality reduction. The chosen metrics are used within the performance evaluation community at Sun Microsystems, i.e., these counters are most commonly used during performance evaluation of benchmarks and applications. The performance set of metrics are listed in Appendix A in Tables A.1 and A.2. The motivation for these four approaches is straightforward. The first approach clusters the data without any dimensionality reduction, faithfully representing component benchmark similarity in the workload space spanned by the hardware metrics. The second and third approach use PCA to reduce the dimensionality of the dataset. The second approach follows the recommendation of taking only the most important principal components reflecting the greatest variance, i.e., those components before the knee of the curve in the scree-plot (Jolliffe, 2002), illustrated in Figure 4.3 on page 82. In practice this translates to about 50% of the variability. The third approach follows common

5.2 Characterizing SPEC CPU 2000 using processor hardware counters Cluster 1 2 3 4 5 6 7 8 9

Overall characteristics applu, mgrid gzip, bzip2 equake, crafty fma3d, ammp, apsi, galgel, swim, vpr, wupwise mcf twolf, lucas, parser, vortex mesa, art, eon gcc ∅

107

Data locality characteristics gzip mcf ammp, applu, crafty, art, eon, mgrid, parser, twolf, vortex, vpr equake bzip2 mesa, gcc fma3d, swim, apsi galgel, lucas wupwise

Tab. 5.1: Clustering results for SPEC CPU 2000 from Phansalkar et al. (2005b)

practice of using PCA to explain a desired degree of variability (here 85% to match Eeckhout et al. (2002); Joshi et al. (2006) and Phansalkar et al. (2005b)). The fourth approach evaluates the performance of chosen metrics. With this fourth approach we evaluate the predictive value of metrics whose de facto usefulness in performance evaluation has been established. The second method deviation is our clustering methodology. Rather than just using K-means clustering combined with the Bayesian Information Criterion (BIC), we use a model-based approach to clustering. We use MCLUST to perform model based clustering. One of the main reasons to avoid K-means clustering is its dependence on initial conditions. MCLUST uses a Gaussian model-based hierarchical clustering step to initialize the model. This not only greatly improves the quality of the clustering, it also improves the consistency of the clustering (Banfield and Raftery, 1993; Fraley and Raftery, 1998, 2002b, 2003). This consistency improvement is important since Kmeans clustering generally does not provide the same result upon repeated executions. See also the discussion in Sections 4.1.3 and 4.6. 5.2.3

Clustering results

Pertinent to this chapter are the results from Phansalkar et al. (2005b) listed in Table 5.1. These results are from data collected using an instruction tracing simulation framework. The instruction tracing framework simulates the DEC Alpha processor and instruction set. The executables are compiled and optimized for the Alpha even though the metrics extracted from the simulation are considered micro-architecture independent. The data are normalized and processed with PCA to reduce dimensionality and then clustered using K-means clustering. The data-locality dependent part in Table 5.1 illustrates that clustering differences will occur when the underlying data differs. The simulation clustering results presented in the literature (Table 5.1) partition the 22 SPEC CPU 2000 component benchmarks into eight distinct clusters. We obtained the underlying data for the results of Phansalkar et al. (2005b) from Phansalkar et al. (2005a). This allows a multi-step approach to similarity evaluation. Using the origi-

108

T ESTING THE METHODOLOGY

Cluster 1 2

Full dataset mcf ammp, art, eon, equake, fma3d, mesa, vpr, wupwise

50% PC gzip, mcf ammp, art, eon, mesa, vpr, wupwise

85% PC mcf ammp, art, eon, equake, fma3d, mesa, vpr

3

gcc

gcc

gcc

4

bzip2, crafty, parser, twolf apsi

6

crafty, parser, twolf, vortex applu, apsi, mgrid, swim galgel

crafty, parser, twolf, vortex apsi, mgrid, swim, wupwise galgel, lucas

7 8

bzip2, gzip lucas

5

equake, fma3d, galgel, lucas, mgrid, swim vortex applu

bzip2, gzip applu

Performance set mcf ammp, applu, art, crafty, eon, mgrid, parser, twolf, vortex, vpr fma3d, gcc, mesa, swim bzip2 apsi, wupwise galgel, lucas gzip equake

Tab. 5.2: Cluster membership of the SPEC CPU 2000 benchmark set for the data from Phansalkar et al. (2005b). MCLUST forced to make eight partitions.

Cluster 1 2

4 5

Full dataset mcf bzip2, gzip, parser, twolf, vpr ammp, applu, apsi, art, equake, fma3d, galgel, mgrid, swim, wupwise lucas crafty

6

eon, mesa

applu, apsi, art, equake, fma3d, galgel, mgrid, swim, wupwise lucas crafty, gcc, gzip, parser, vortex eon

7 8

gcc vortex

mesa ammp

3

50% PC mcf bzip2, twolf, vpr

85% PC mcf bzip2, mesa, twolf, vpr art, equake, galgel, swim

Performance set mcf bzip2, gzip, parser, twolf equake, galgel, mgrid, wupwise

lucas crafty

art, lucas, swim crafty, vortex

ammp, applu, apsi, eon, fma3d, wupwise gcc, mgrid, vortex gzip, parser

ammp, eon, mesa, vpr gcc applu, apsi, fma3d

Tab. 5.3: Cluster membership of the SPEC CPU 2000 benchmark set for the UltraSPARC IIIi, based on measurements with benchmark set composition based on Phansalkar et al. (2005b). MCLUST forced to make eight partitions.

5.2 Characterizing SPEC CPU 2000 using processor hardware counters Cluster 1 2 3

109

50% PC art mcf ammp, applu, apsi, equake, fma3d, galgel, lucas, mgrid, swim, wupwise

85% PC art mcf applu, apsi, equake, fma3d, mgrid, swim, wupwise

Performance set art mcf applu, galgel, lucas, mgrid, swim, wupwise

4 5

Full dataset art mcf ammp, applu, apsi, bzip2, crafty, equake, fma3d, galgel, lucas, mesa, mgrid, parser, swim, vpr, wupwise gcc eon

gcc crafty, eon

gcc, mesa, vortex eon

6

twolf

7 8

gzip vortex

gzip, parser, twolf, vortex, vpr mesa bzip2

gcc crafty, eon, mesa, vortex ammp, bzip2, parser, twolf, vpr gzip galgel, lucas

bzip2, gzip, parser, twolf, vpr crafty ammp, apsi, equake, fma3d

Tab. 5.4: Cluster membership of the SPEC CPU 2000 benchmark set for the Opteron, based on measurements with benchmark set composition based on Phansalkar et al. (2005b). MCLUST forced to make eight partitions.

nal data we can attempt to replicate the original result, plus we can perform PCA and clustering on our own terms - using K-means clustering and MCLUST - combined with different dimensionality reduction approaches. We concentrate our comparison effort on the Overall characteristics of Table 5.1. According to Phansalkar et al. (2005b) these were obtained by performing PCA on the whole dataset and retaining between 75%-90% of the variance in the data. We approach the agreement between our hardware results and the simulation results in a two step process. First, we force K-means clustering and MCLUST to determine eight distinct clusters based on the four approaches listed in Section 5.2.2: Full dataset (Full), using the most important metrics (50PC), 85% variance explained (85PC) and the performance based metric set (Performance). Second, we let MCLUST decide the optimal clustering count using the BIC. We attempted to use K-means clustering combined with the BIC, but were unable to replicate the Phansalkar et al. (2005b) results. We believe primarily due to the previously discussed initial conditions problem. To prevent unfair comparison, we only use the forced partitioning results since they put MCLUST and K-means clustering on equal footing. We list the forced partitioning in Tables 5.2, 5.3 and 5.4. For the literature data, we have made the composition of the performance set equal to the data-locality set. For additional clarity we plot component benchmark distribution using the two principal components in Figure 5.2. These figures illustrate the possible cause of frequent cluster membership changes between subsequent iterations of the K-means clustering algorithm. Visual inspection prefers fewer than eight clusters. Figure 5.2 is informative in other ways. Even a casual observer will see that the structure of the hardware counter based plots is very similar, much less than the literature plot. A closer look indicates that the literature plot is rotated relative to the hardware plots - the principal components

110

T ESTING THE METHODOLOGY

Cluster 1

Full dataset bzip2, crafty, gcc, gzip, mcf, parser, twolf, vortex

50% PC gcc

2

ammp, applu, apsi, art, eon, equake, fma3d, galgel, lucas, mesa, mgrid, swim, vpr, wupwise

3



ammp, applu, apsi, art, bzip2, crafty, eon, equake, fma3d, galgel, gzip, lucas, mcf, mesa, mgrid, parser, swim, twolf, vortex, vpr, wupwise ∅

4





85% PC applu, apsi, bzip2, equake, galgel, gcc, gzip, lucas, mcf, mgrid, swim, wupwise crafty, parser, twolf, vortex

Performance set bzip2, gzip, mcf

ammp, art, eon, fma3d, mesa, vpr ∅

apsi, gcc, wupwise

ammp, applu, art, crafty, mgrid, parser, twolf, vortex, vpr

eon, equake, fma3d, galgel, lucas, mesa, swim

Tab. 5.5: MCLUST results for the dataset from Phansalkar et al. (2005a), BIC used to determine optimal cluster count.

are interchanged! We switched the principal components for the top-right graph in Figure 5.2 to illustrate the increased similarity. This is a tremendously significant result it clearly shows that hardware counter based measurements span a space similar to the space spanned on metrics obtained from simulation! In the second comparison step we let MCLUST determine the optimal number of clusters for each approach based on the BIC criterion. These partitions represent partitions obtained without prescribing the cluster count. From Tables 5.5, 5.6 and 5.7 we see that partitions greatly change when the BIC decides the appropriate number of clusters. Striking in that regard is the disparity between all results, both in partitioning and cluster count. We note that MCLUST strongly disagrees with Phansalkar et al. (2005b) on the appropriate number of clusters. The results in Tables 5.1 - 5.7, paint a confusing picture of cluster composition and similarity prediction. The effect on cluster composition of changing the dimensionality reduction technique is pronounced. There seems to be little agreement within the same dataset, let alone between the datasets! Instead of relying on human pattern matching, we would like to quantify how much agreement exists between the different solutions.

5.3

Comparing clustering

In this section we compare the Phansalkar et al. (2005b) simulation clustering results with our measurement based clustering result. We use the variation of information criterion as similarity score for clustering results on the same dataset. Since we lack background information on the nature of cluster similarity, we perform a Monte-Carlo simulation of cluster similarity to determine if our results are coincidental.

5.3 Comparing clustering

111

2

2

4

Literature (rotated)

4

Literature

5 2

7 47 3 4

4

63

6 1 4

4

4

4

4

6

3 6

8 1

6

4

7 4 7 7

2

2

5

6

−2

−2

6

4

4 3

0

4

6

PC1

0

PC2

2

4 1

1

7

−4

−4

8

−2

0

2

4

−4

0

UltraSPARC IIIi

Opteron

2

4

2

2

4

PC2

6

5

4 2 6 2 8 6 36

0

0

4

7 2 7 64 3 66 2

−2

5

−2

7 7

7

4 11 6 4 3 44 PC2

4 1 14 7 34 4 4 PC2

−2

PC1

4

−4

−4

−4

8

−4

−2

0 PC1

2

4

−4

−2

0

2

4

PC1

Fig. 5.2: Principal component distributions for the different datasets. The numbers match the cluster assignment from Phansalkar et al. (2005b).

112

T ESTING THE METHODOLOGY

Cluster 1

Full dataset ammp, applu, apsi, art, equake, fma3d, galgel, swim, wupwise

50% PC ammp, applu, apsi, art, equake, fma3d, galgel, lucas, mgrid, swim, wupwise bzip2, crafty, eon, gcc, gzip, mesa, parser, twolf, vortex, vpr mcf

2

mgrid

3 4

bzip2, gzip, twolf, vpr lucas

5

gcc



6 7 8 9

mcf crafty eon, mesa vortex

∅ ∅ ∅ ∅

parser,



85% PC art, equake, galgel, swim, wupwise

Performance set equake, galgel, mgrid, wupwise

crafty, gcc, lucas, mcf, mgrid, vortex

art, lucas, swim

applu, apsi, eon, fma3d ammp, bzip2, mesa, twolf, vpr gzip, parser

applu, apsi, fma3d

∅ ∅ ∅ ∅

ammp, eon, mesa, vpr bzip2, gzip, twolf gcc mcf crafty, vortex ∅

parser,

Tab. 5.6: MCLUST results for the UltraSPARC IIIi dataset, BIC used to determine optimal cluster count.

Cluster 1

50% PC ammp, applu, apsi, equake, fma3d, galgel, lucas, mgrid, swim, wupwise

85% PC applu, apsi, fma3d, mgrid, swim, wupwise

Performance set ammp, apsi, equake, fma3d, wupwise

2

Full dataset ammp, applu, apsi, bzip2, crafty, eon, equake, fma3d, galgel, lucas, mesa, mgrid, parser, swim, twolf, vortex, vpr, wupwise art

bzip2, mesa

3

gzip

art

crafty, eon, mesa, vortex galgel, lucas

4

gcc

art

5

mcf

crafty, eon, gzip, parser, twolf, vortex, vpr gcc

applu, galgel, lucas, mgrid, swim crafty, gcc, gzip, mesa, vortex art

6 7 8 9 10

∅ ∅ ∅ ∅ ∅

mcf ∅ ∅ ∅ ∅

gzip vpr gcc mcf bzip2, parser, twolf

ammp, equake

bzip2, parser, twolf, vpr mcf eon ∅ ∅ ∅

Tab. 5.7: MCLUST results for the Opteron dataset, BIC used to determine optimal cluster count.

5.3 Comparing clustering

113 5.3.1

Similarity score

In order to quantify the degree of similarity between our results and those in Phansalkar et al. (2005b), we need a quantitative measure of similarity between clustering results of the same dataset. We use the variation of information (V I) criterion (Meil˘a, 2002, 2006), discussed in Section 4.7 on page 93. We use the normalized version of the V I, bounded on the interval [0, 1]. We can use this normalized version because all our comparisons are made between clusterings of the same dataset. The V I expresses the similarity between two clusters as a single number. Since we use the normalized version, the metric will be zero when the clusters are identical, and one when completely dissimilar. This alone does not provide enough information to develop our intuition, we also want to understand the effect on V I from minor changes in cluster composition. 5.3.2

Monte-Carlo simulation

To help develop our intuition of the V I we want to understand what the relationship is between the V I and random clusterings. Knowing the probability that a V I result is a random event - a fluke - can be inversely interpreted as a measure of confidence in the result. The higher the probability of being a randomness, the less confidence we have. We can analytically explore the probability space by calculating the V I-score for all possible cluster assignments (8! · 814 ≈ 2.2 · 1016 combinations), but is more convenient to use a Monte-Carlo method. In a Monte-Carlo simulation we generate random sets to explore the combinatorial space. We perform two such Monte-Carlo simulations. First we calculate the V I distribution of the literature results against one million random assignments of the 22 component workloads over eight clusters. The results are summarized in Figure 5.3(a) as a plot of the cumulative distribution function (CDF) of the V I. The second Monte-Carlo simulation bounds the V I as function of the number of membership changes. Starting from the given clustering in Table 5.1 we randomly change cluster memberships for 1:N elements and calculate the V I. The V I range is illustrated in Figure 5.3(b). The graph clearly illustrates how the magnitude of V I change depends on cluster membership changes. The V I and CDF combined allow us to evaluate how much relevant information is retained between different clusterings. In this respect the V I score is counter-intuitive, we might expect a random result to average V I=0.500. In fact the midpoint (P=0.500) of the CDF is at V I=0.618. This is because membership changes are not equal. Some membership changes will have little effect, others will greatly affect the V I. For example the exchange of two singletons has no effect on the V I, while the disappearance of a singleton has a minor effect. Spreading half of a cluster to different clusters has a considerable effect. From Figures 5.3(a) and 5.3(b) we can derive that at least eight membership changes are needed before there is an appreciable likelihood of a chance matching (i.e., P greater than 0.01 on the CDF). We can also argue the other way round,

114

T ESTING THE METHODOLOGY

P=0.950

P=0.500

0.0

0.2

0.4

P

0.6

P=0.001 P=0.010 P=0.050

0.8

1.0

Cumulative probability function

0.0

0.2

0.4

0.6

0.8

1.0

VI

(a) Monte-Carlo simulation of V I

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

VI

VI score Max Mean Min









● ●

● ●

P=0.950





● ● ●

P=0.500 ● ●

P=0.050 P=0.010 P=0.001

● ● ● ●









1

3

5

7

9

11

13

15

17

19

21

Membership changes

(b) V I as function of membership changes

Fig. 5.3: Cumulative distribution function (CDF) for Monte-Carlo simulation of the Variation of Information (V I) scores against Phansalkar et al. (2005b) clustering results (a). V I score as function of membership changes (b). The P = [0.001, 0.010, 0.050, 0.500, 0.950] lines represent probability values on the CDF.

5.3 Comparing clustering

115

Comparison overall characteristics ↔ datalocality overall characteristics ↔ Best UltraSPARC (USIII.FD) overall characteristics ↔ Worst UltraSPARC (USIII.85PC) overall characteristics ↔ Best Opteron (Opteron.50PC) overall characteristics ↔ Chosen UltraSPARC (USIII.PS) overall characteristics ↔ Chosen Opteron (Opteron.PS) Best UltraSPARC (USIII.FD) ↔ Opteron (Opteron.50PC) Worst UltraSPARC (USIII.85PC) ↔ Opteron (Opteron.FD) Literature.PS ↔ datalocality

V I-score 0.503 0.390 0.578 0.415 0.509 0.482 0.210 0.557 0.089

Probability 0.0317 0.000287 0.238 0.000910 0.0372 0.0149 0.000001 0.146 0.000000

Clustering K-means MCLUST MCLUST MCLUST MCLUST MCLUST MCLUST MCLUST MCLUST

Tab. 5.8: Comparison results of interest

starting from a completely random chosen set of eight clusters. Assume that the random set of cluster hovers around the P = 0.50 mark. From the M ax V I change line in Figure 5.3(b) we can then deduce that four very specific membership changes are needed to place the probability P ≤ 0.01. In other words we have to use specific knowledge about the membership of four elements relative to the existing distribution. If we were to use random changes, then the M ean V I line in Figure 5.3(b) indicates at least seven specific changes relative to the given distribution. Multiple transitions from one cluster to another cluster preserve a lot of similarity information, hence the reduced impact on the V I. V I scores with a corresponding small probability on the CDF therefore indicate that significant information about the original clustering is preserved in the new clustering result. 5.3.3

Similarity score and probability results

We have calculated the similarity scores and their location on the CDF between all the clustering solutions found. The scores are listed in Table 5.9 using MCLUST and Table 5.10 using K-means clustering. The row Literature illustrates the V I result, while the column Literature shows the probability that this result is obtained by chance. We expect some changes in cluster membership based on the differences of data origin. Phansalkar et al. (2005b) is based on DEC Alpha executables, the measured data are based on SPARC and Opteron x86-64 architectures. However, if µA-independent metrics are truly representative of processor behavior, we expect there to be considerable agreement between the measured results and the simulated results. Considerable agreement would be expressed in low V I scores and their matching low probability (P ≤ 0.01). In this regard it is illustrative to investigate the degree of similarity between the overall characteristics and the data locality characteristics. Their V I score is 0.503, with P = 0.0317. While this V I score is reasonably low, it is on the fast rising part of the CDF of Figure 5.3(a). As illustrated in Figure 5.3(b), one or two random clustering assignments can be the difference between being strongly similar or nearly random. In this case we could have artificially improved our results by using similarity scores based

116

T ESTING THE METHODOLOGY

only on pair-matching. The unique distribution in the simulation results greatly favor pair matching due to the (n2 ) increase of pairs in large clusters. However, pair matching does not account for singleton clusters, making the V I criterion the more appropriate choice. We compare the measured cluster results with those from the literature, using both the actual literature data and the clusterings listed in Table 5.1. We observe that the results are quite mixed for different methods. For example, the best matching result for the UltraSPARC IIIi (P=0.000287) is found using in Table 5.9 , MCLUST against the full dataset (USIII.FD, without dimensionality reduction). All pertinent results are summarized in Table 5.8. In contrast, the best matching Opteron result (P=0.000910) is based on MCLUST against the most important Principal components (Opteron.50PC). The worst clustering match (P=0.238) is obtained for MCLUST against the 85% variability PCA for UltraSPARC IIIi (USIII.85PC). The results using K-means clustering are more consistent, but overall MCLUST performs better, delivering lower V I scores and probabilities. A related point of interest is the performance of the Performance set, the set of chosen metrics. For both UltraSPARC and Opteron they perform well, in both cases beating the 85PC approach, thus choosing a set of metrics cannot be disqualified as strategy. We note that MCLUST does particularly well on matching the Literature.PS against Datalocality, with a V I-score of only 0.0892. This is somewhat expected, since they are based on the same data. Our success with the Datalocality metrics contrasts starkly with our inability to replicate the main clustering result. Another important facet of these results is the internal consistency between the Full data, 50PC, 85PC and Performance set columns for each processor. The V I scores and probabilities are low. The good internal consistency between different methods for the same processor contrasts with the mixed results against the literature. We believe that good internal consistency is indicative of the soundness of our approach, after all, dimensionality reduction is the removal of lesser artifacts. The differences between the clustering results can be partly explained from our decision to force the clustering algorithm to partition into a specific number of clusters. As indicated in Tables 5.6 and 5.7, the algorithm prefers different numbers of clusters based on the presented data. A final and important observation is the generally strong agreement between the processor results. For example, the low probabilities of US3.FD against Opteron.FD through Opteron.PS. This is important since it indicates that the measured data, even though from different processors, are more similar with each other than with the simulation based metrics!

5.4

On the µA-dependent and µA-independent debate

In Section 3.2 we already mentioned the ongoing debate between the proponents of micro-architecture (µA) dependent and µA-independent metrics. As mentioned in Chap-

5.4 On the µA-dependent and µA-independent debate

Literature literature.FD literature.50PC literature.85PC literature.PS USIIIi.FD USIIIi.50PC USIIIi.85PC USIIIi.PS Opteron.FD Opteron.50PC Opteron.85PC Opteron.PS Datalocality Literature literature.FD literature.50PC literature.85PC literature.PS USIIIi.FD USIIIi.50PC USIIIi.85PC USIIIi.PS Opteron.FD Opteron.50PC Opteron.85PC Opteron.PS Datalocality

Literature 0.000023 0.012901 0.000095 0.132854 0.000287 0.076016 0.238426 0.037178 0.008242 0.000910 0.017692 0.014948 0.031744 US3.85PC 0.577642 0.532729 0.532729 0.574945 0.631697 0.329888 0.342978 0.000745 0.146241 0.004657 0.039269 0.002638 0.365502

lit.FD 0.340309 0.000044 0.000001 0.014739 0.000196 0.007098 0.078283 0.001297 0.034248 0.015305 0.001507 0.158683 0.004028 US3.PS 0.508790 0.423105 0.382333 0.439938 0.644388 0.334622 0.396441 0.410414 0.052219 0.003055 0.000104 0.001541 0.202032

lit.50PC 0.478012 0.351555 0.000090 0.091188 0.001366 0.047434 0.078283 0.000188 0.002634 0.027131 0.005592 0.058749 0.032520 Opt.FD 0.465885 0.505403 0.439249 0.520200 0.563599 0.396135 0.484134 0.556830 0.520792 0.000055 0.000122 0.001661 0.210466

lit.85PC 0.367136 0.130864 0.366353 0.000977 0.002093 0.033312 0.225523 0.002779 0.052023 0.016093 0.001020 0.023690 0.001379 Opt.50PC 0.414912 0.482319 0.497708 0.483764 0.515133 0.209966 0.307958 0.452203 0.441548 0.356688 0.000008 0.000001 0.008307

lit.PS 0.551598 0.481301 0.537462 0.416592 0.355312 0.385669 0.602547 0.686782 0.174182 0.045615 0.006046 0.713990 0.000000 Opt.85PC 0.485870 0.425567 0.456345 0.417018 0.458381 0.293424 0.339854 0.509808 0.369407 0.372668 0.318806 0.000009 0.000736

117

US3.FD 0.389529 0.383351 0.424123 0.432999 0.596676 0.000001 0.000012 0.000016 0.000396 0.000001 0.000004 0.000001 0.121262 Opt.PS 0.481728 0.559129 0.523753 0.494419 0.648103 0.247327 0.351954 0.439512 0.426821 0.427575 0.241405 0.322357 0.221311

US3.50PC 0.530856 0.462595 0.516720 0.504811 0.602335 0.215624 0.000026 0.000408 0.016200 0.000006 0.000022 0.000045 0.139239 Datalocality 0.503132 0.448224 0.504385 0.424286 0.089238 0.548210 0.553869 0.598619 0.570539 0.571294 0.466667 0.409915 0.574255 -

Tab. 5.9: Similarity scores between all MCLUST results. Above the diagonal are the Variation of Information scores (0 is identical), below the diagonal are the Monte-Carlo simulation probabilities that this is a chance observation. FD: full dataset, 50PC/85PC: 50/85 percentile of variability explained, PS: performance set. Literature: full dataset reported in Phansalkar et al. (2005b). Datalocality: clustering based on data locality metrics reported in Phansalkar et al. (2005b).

118

Literature literature.FD literature.50PC literature.85PC literature.PS USIIIi.FD USIIIi.50PC USIIIi.85PC USIIIi.PS Opteron.FD Opteron.50PC Opteron.85PC Opteron.PS Datalocality Literature literature.FD literature.50PC literature.85PC literature.PS USIIIi.FD USIIIi.50PC USIIIi.85PC USIIIi.PS Opteron.FD Opteron.50PC Opteron.85PC Opteron.PS Datalocality

T ESTING THE METHODOLOGY

Literature 0.000008 0.004880 0.000009 0.181561 0.040977 0.159115 0.040977 0.011057 0.027939 0.022168 0.034109 0.008706 0.031744 US3.85PC 0.511488 0.483408 0.481963 0.487978 0.635205 0.040772 0.435533 0.000003 0.000004 0.004513 0.000010 0.000433 0.365502

lit.FD 0.316370 0.000011 0.000000 0.001850 0.003162 0.076281 0.015770 0.000173 0.005406 0.028695 0.005040 0.167542 0.001379 US3.PS 0.474033 0.379799 0.378355 0.358987 0.613140 0.243348 0.388085 0.284120 0.000149 0.001754 0.000062 0.003123 0.132345

lit.50PC 0.452630 0.327617 0.000510 0.063675 0.003029 0.004028 0.015196 0.000157 0.000033 0.002077 0.003093 0.084777 0.007273 Opt.FD 0.498988 0.455519 0.347148 0.440102 0.498355 0.258836 0.448416 0.299607 0.376510 0.000003 0.000001 0.000001 0.007003

lit.85PC 0.320941 0.079009 0.401701 0.001476 0.003758 0.040027 0.018287 0.000061 0.002842 0.040027 0.001875 0.120395 0.001071 Opt.50PC 0.493138 0.500434 0.432835 0.510400 0.550701 0.410150 0.473344 0.450922 0.428856 0.283293 0.000001 0.000176 0.019140

lit.PS 0.565100 0.430094 0.525582 0.424671 0.343187 0.384398 0.629908 0.455148 0.027619 0.131002 0.015455 0.713045 0.000000 Opt.85PC 0.505075 0.453648 0.442210 0.430800 0.483131 0.282347 0.421162 0.323119 0.359250 0.162068 0.266033 0.000001 0.001928

US3.FD 0.511488 0.442636 0.441192 0.447206 0.594434 0.002303 0.000000 0.000001 0.000001 0.000740 0.000002 0.000433 0.150209 Opt.PS 0.468019 0.562253 0.535426 0.546836 0.647897 0.398384 0.422842 0.398384 0.442473 0.265605 0.380034 0.260845 0.291766

US3.50PC 0.559292 0.531212 0.448224 0.510400 0.601466 0.435533 0.002303 0.000272 0.004071 0.010798 0.001233 0.001281 0.353934 Datalocality 0.503132 0.424286 0.463613 0.418863 0.118129 0.557848 0.595658 0.598619 0.551171 0.461769 0.488733 0.431156 0.585928 -

Tab. 5.10: Similarity scores between all K-means clustering results. Above the diagonal are the Variation of Information scores (0 is identical), below the diagonal are the MonteCarlo simulation probabilities that this is a chance observation. FD: full dataset, 50PC/85PC: 50/85 percentile of variability explained, PS: performance set. Literature: full dataset reported in Phansalkar et al. (2005b). Datalocality: clustering based on data locality metrics reported in Phansalkar et al. (2005b).

5.4 On the µA-dependent and µA-independent debate

Real workload space

Candidate workload selection

Representative workload space

Standard benchmark selection

119

Standard benchmark space

Benchmark similarity analysis

Optimal reduced benchmark space

simulation metrics computer system metrics

Fig. 5.4: Overview of benchmark set creation (repeated from Figure 1.3 on page 21)

ter 1, simulators are the dominant tool for evaluating computer architecture, offering a balance of cost, timeliness and flexibility. The biggest limitation of simulators are their low execution speed, inversely proportional to the level of detail they support. Simulators are the only source of µA-independent metrics. There are several techniques that increase simulation speed by limiting the level of detail to the subset of interest (Yi et al., 2006). Simulation performance limits its application in broader workload characterization - some form of workload selection has taken place prior to simulation. We can understand this from the necessity of running the workload within a simulator. Some workloads cannot be accurately represented in a simulator, e.g., a large, high performance OLTP database server. In Figure 5.4, repeated from 1.3 on page 21, we outline the applicability of types of metrics. Simulation based metrics are untenable in the real workload and representative workload spaces. Simulation simply cannot accurately capture the circumstances of a workload in its natural environment - simulation speed is just too slow and the simulator environment too restrictive. On the other hand spanning an optimal reduced benchmark space from a standard benchmark space can and has been achieved using simulators with significant investment of time and resources (Phansalkar et al., 2005b; Hoste and Eeckhout, 2006; Joshi et al., 2006). Eeckhout et al. (2005b) and Hoste and Eeckhout (2006) mention that hardware counters have the potential risk of hiding significant workload behavioral differences. Based on the results of this chapter, we cannot recommend using hardware counters to reduce the standard benchmark space into the optimal reduced benchmark space. The differences highlighted between the processors and literature in the PCA space in Figure 5.2 are too significant. The risk of hiding significant workload behavioral differences when creating the smallest subset of workloads for computer system performance evaluation weighs heavily on our mind. In our opinion this is currently best left to more detailed simulation as explained in Eeckhout et al. (2005a); Phansalkar et al. (2005b); Hoste et al. (2006); Joshi et al. (2006) and Hoste and Eeckhout (2006). At the same time, this chapter demonstrates that hardware counters have tremendous potential, as indicated by the strong similarity between UltraSPARC IIIi and Opteron results. This chapter demonstrates the value of using all hardware counters on modern processors, since their combination provides strong distinction in the workload space.

120

T ESTING THE METHODOLOGY

W Workload space

T

T'

T" W' Simulator metrics

W" Computer system metrics

Fig. 5.5: Overview of workload spaces and their transitions.

Hoste and Eeckhout (2006) support their claim that µA-dependent metrics are at risk of missing important workload distinctions by showing that 47 µA-independent metrics from their simulator provide much better distinction than the 8 hardware counters on the Alpha V21264A processor. We consider the lack of sufficient hardware counters as fundamental to their result. We believe that if they were to repeat their research on a more advanced processor (with more hardware counters) their results would be much less pronounced. The UltraSPARC IIIi has 62 distinct hardware counters, while the Opteron supports 155, and both processors are of more current design. In the terms of Kuhn (1962), there has been a significant improvement in our observational ability. Thus, while we understand the risk of hiding workload behavioral characteristics, the level of agreement demonstrated in this chapter supports our intended use of all hardware counters for reducing the real workload space into the representative workload and standard benchmark spaces. Hoste and Eeckhout (2006) confirm the superior applicability of hardware counters, by noting the significant time and effort required for measuring µA-independent metrics. Simulation requires that the workload be removed from its natural environment and executed in a simulator, or instruction tracer, in order to obtain results. This process is laborious and slow. We have already demonstrated that hardware counters can be measured efficiently and effectively on a workload in its natural environment. Thus hardware counters provide a critical workload characterization advantage - they are the only means by which we can capture good characterizations of real workloads. We can approach the differences between µA-independent and µA-dependent metrics from a more formal perspective. We define three workload spaces: W An implementation free workload space, this workload space is only concerned with the factual differences of the workload expression.

5.5 Reflecting on the differences in similarity

121

W’ The workload space expressed in by simulator metrics. The transition T from W→W’ is the mapping of the workload to that simulator. W” A workload space expressed by computer system metrics, tied to a specific computer system and processor implementation. The three workloads and their transitions are shown in Figure 5.5. In workload space W , each workload is expressed in terms independent of an underlying architecture. The implementation of the workload, i.e., its expression in software is considered fixed in W . Changing the software expression of a workload will change its position in W . The simulator metric workload space W # expresses any workload within the context of the simulator. Even though µA-independent metrics can be extracted from the simulator space, a simulator environment is not free of architectural choices. Each of these architectural choices impact the µA-independent metrics. As illustration consider the memory access pattern obtained from a vector processor compared to a scalar processor. This implies that µA-independent metrics are independent for a subset of architectures only. This subset of architectures is limited to architectures sharing the basic design principles of the simulated system. In the computer system metric space W ## , any workload is expressed within the constraints of the specific architecture. We have postulated the existence of W to provide context for W # and W ## . The transitions T and T # are interesting since they concern the behavior of workload properties under influence of implementation details. Within the context of this thesis we are much less concerned with W and the respective transitions. We are interested in effectively selecting representative workloads for use in computer system design, thus, we focus on the T ## , the mapping between W # and W ## . The most important conclusion regarding the differences between µA-independent and µA-dependent metrics is that they are both tied to computer system architecture choices. True µA-independent independent metrics cannot be extracted from any simulator executing any specific computer system architecture implementation - there are too many implicit design assumptions. The value of the µA-independent metrics is their abstraction of types of architectures. There will likely be more discussion in the µA-dependent versus µA-independent metric debate. We feel that science is best served when both parties work together to close the visibility gap between simulator metrics and hardware counters. By working together, the quality of simulators and hardware counters can be improved. This would allow for much stronger validation of simulators and much more accurate and faster selection of optimal benchmark sets.

5.5

Reflecting on the differences in similarity

The intention of this chapter was to test the applicability of our approach. We found that with collected hardware counter data we could span and reduce a workload space for clustering. Our subsequent investigation of the clustering qualities of these space

122

T ESTING THE METHODOLOGY

demonstrated considerable agreement between the spaces spanned by different processor architectures as well as similarity bases results. On first impression any claims for agreement between hardware and simulation based similarity seem tenuous. We believe that the differences between the underlying architectures and collected metrics play a significant role. Another influence is method sensitivity. The analysis results on our measured data showed a strong sensitivity to the particular use of dimensionality technique and clustering algorithm. Our analysis of the V I scores and their location on the cumulative distribution function for random clustering distribution showed that the differences between the different datasets, while considerable, were unlikely to be due to chance. We assert that the uncovered similarity is real and not a chance artifact. This is supported by the Monte-Carlo simulation results. Most of the similarities are too improbable to be chance configurations. Therefore, similarity based on measurements from hardware counters finds agreement with similarity based on simulation results using µarchitecture independent metrics. We believe that the similarity scores can be improved. Improving the similarity scores between hardware counter measurement and simulation would require a calibration effort. The set of metrics collected, their reduction process and the clustering algorithm each contribute to the final clustering result. Instead of taking an already found clustering result and evaluating how similar a hardware based similarity result compares, the process should be performed for the different datasets concurrently. Concurrent evaluation of similarity reduces the impact of methodological error, allowing for a better assessment of actual similarity. Optimally, the calibration process would identify the “mapping” between the data collection mechanisms. This works towards our goal of using easy to collect metrics for fast and efficient similarity analysis and representative workload selection. Another methodological addition would be the development of a heuristic to determine which combination of dimensionality reduction and clustering techniques is closest to the “truth”. Overall we believe that this chapter provides a bridge between µA-independent simulation environments and µA-dependent workload characterization. As such we will continue to work towards the need identified in Skadron et al. (2003) for a quantitative comparison method. We do not believe that we can avoid future µA-independent versus µA-dependent debate. We do believe that this chapter highlights the potential contribution modern hardware counters can make to some of the validation and selection problems with computer system workloads and benchmarks. The observed agreement between the measured hardware counters and the simulated results allows for a more efficient approach to the exploration of the workload space than using simulators exclusively. We suggest that hardware counter designer and the simulation community work closely together to improve the hardware counters. Improving the hardware counters provides more validation for simulation and, as we have shown here, improved approaches to selecting relevant workloads. The initial goal of this chapter - validating our approach - is convincingly demon-

5.5 Reflecting on the differences in similarity

123

strated by the previous discussion. We are not blind to remaining work but the degree of agreement between different different hardware counters bodes well for the generality of the approach. The degree of agreement between simulation metrics and the hardware counters is promising - it is far beyond probability. In short, this chapter greatly boosts our confidence, both in our approach and in working towards our research goal. We now shift our focus back on the research question. In Chapter 4 we proposed a larger scale measurement methodology, aimed at including more metrics and more workloads. We need to focus on the collection and processing of workload data and the subsequent extraction of workload characterization information. This chapter followed a single dimensionality reduction approach to enable comparison with existing results from the literature. For our proposed method the dimensionality reduction methods are proposed, as are the clustering algorithms. We will need to develop ways to evaluate method and solution quality in the absence of literature results.

124

T ESTING THE METHODOLOGY

Part II SELECTING A REPRESENTATIVE SET FROM COLLECTED WORKLOADS

6. COLLECTING AND ANALYZING WORKLOAD CHARACTERIZATION DATA

A BSTRACT We collect data on 960 workloads on UltraSPARC III+ based computer systems using WCSTAT, a standardized workload characterization utility. We validate each workload and remove workloads that are not in steady state or suffer from measurement errors. The remaining 650 workloads are standardized to remove differences in computer system configuration. This resultant dataset spans the workload space for similarity analysis.

128

C OLLECTING AND ANALYZING

Define the question Chapter 1 - The importance of workloads in computer system design

Gather information and resources Chapter 2 - Current approaches for workload selection in processor and computer system design

Present conclusions Chapter 9 Evaluating workload similarity

Interpret data and formulate conclusions Chapter 8 - Finding representative workloads in the measured dataset

Form hypothesis Chapter 3 - Towards an unbiased approach for selecting representative workloads Deductive-Hypothetic research strategy Prescriptive model Chapter 4 Constructing the approach

Early test Chapter 5 - Testing the methodology on a benchmark set

Collect data Chapter 6 - Collecting and analyzing workload characterization data

Analyze data Chapter 7 Grouping together similar workloads

6.1 Workload characterization

129

The coming three chapters from the second part of this thesis, selecting a representative set from collected workloads. It is the large scale application of our approach. In this part we investigate large scale data collection, data reduction, and apply our approach. This chapter covers data collection and reduction. The next chapter covers the selection of an optimal clustering strategy, while the last chapter deals with the selection of representative workloads. We begin our discussion at the top-left of Figure 6.1 - workload characterization. The process of workload characterization requires a computer system running a workload and a set of collection tools to perform the workload characterization (data collection). The output of workload characterization is our raw-data - a collection of data-files, one collection per collected workload. Unsurprisingly, raw data by itself is useless. A number of steps are followed to transform the raw data into our desired workload characterization data and categorization information. Data cleaning is the process by which the raw data are transformed to a consistent form, stored in a database. Workload validation is the process of determining if the collected workload data are useful. During data reduction the workload data are reduced to single value representations for all metrics. In metric selection the constructed set of workload metrics is cleansed of abberant metrics. In the system standardization step data are projected onto a pseudo system representation to remove system configuration effects from the dataset. The end-product of system standardization is the final workload charactization data, a concise representation of all accepted workloads and metrics. In parallel we also perform workload categorization, a process by which we determine the type (what is it doing) and origin (what system was it measured) of each workload. We now discuss the subsequent steps in more detail, and end the chapter with a discussion of the properties of the workload dataset.

6.1

Workload characterization

The requirements prescribe that metric collection should be straightforward and proceed with minimal perturbation of the actual workload. The methodology in this thesis is intended for use on a large scale which in turn prescribes minimizing human involvement. Throughout the data collection and analysis phase, we will strive for automation when collecting and evaluating the data. The multiple computer systems and workload characterization objects in Figure 6.1 indicate parallelism. Parallelism is important, we expect that workload characterization is a continuous process on many computer systems simultaneously. This parallelism dictates straightforward metric collection as requirement. The requirement of straightforward metric collection is translated to mean that neither should special tools be attached to the computer system nor should the system require operational interference, i.e., a reboot. Any software required for data collection should therefore be installable without interrupting the system, causing only minimal interference on the machine. Most modern operating system provide standardized util-

130

C OLLECTING AND ANALYZING

Computer system and workload

Raw data

Workload characterization (data collection)

Data cleaning

Workload database (standardized per metric summary data)

Metric selection

Workload dataset Data reduction

Accepted workload list

(standardized per time interval metric data)

Workload categorization (meta data)

Workload validation

System standardization

Workload characterization final data (standardized composite summary data)

Workload categorization information

Fig. 6.1: Summary of workload data collection and analysis (based on Figure 3.1 on page 57).

6.1 Workload characterization

131

system metrics

hardware metrics

mpstat netstat iostat vmstat

cpustat

0

600

time [seconds]

1500

Fig. 6.2: WCSTATdata collection sequence

ity programs that provide access to the operating system and hardware metrics. Here we limited data collection to computer systems running the Solaris operating system. The utilities provided by the Solaris operating system are sufficient to collect all the necessary data. One additional tool was used to better collect data related to the network interfaces. This tool, netsum, is internal to Sun Microsystems and has not yet been made available for public use. It is a rewrite of the netstat utility and provides more detailed information on the network interfaces in the computer system. To simplify data collection, the utility programs are controlled by a script, WCSTAT. 6.1.1

WCSTAT

The WCSTAT data collection tool encompasses a simple to use and integrated approach to data collection. This script was developed within the Performance Availability Engineering group at Sun Microsystems, Inc. WCSTAT was developed with the specific purpose of concise and consistent data collection on workloads (Sun Microsystems Inc., 2004). In collaboration with several customer facing groups at Sun Microsystems, Inc., data was collected from both customer workloads and internal benchmarks. We also used WCSTAT to collect the workload characterization data on SPEC2000. Standard Solaris system tools like cpustat, iostat, etc., are started by WCSTAT with the appropriate parameters to both facilitate and standardize data collection. WC STAT makes some assumptions regarding workload characterization. Most importantly, WCSTAT assumes that the workload is in steady state and does not significantly change during measurement. Minimum measurement duration is 1500 seconds, split between two different sets of utility tools. The sequence is illustrated in Figure 6.2. This split was designed into WCSTAT to prevent cross contamination of the tools. Further on we will see why this split measurement scheme requires steady state validation. The list of collected metrics, their origin and the utility used for their measurement are presented in Table A.5 on page 241. 6.1.2

Measurement impact

The requirement for minimal workload perturbation means that we cannot accept more than one percent deviation in the performance of the workload while data is collected. This one percent is chosen based on limits in observability and common practice. Many

132

C OLLECTING AND ANALYZING

workloads demonstrate natural variations in performance, i.e., their performance metrics are inherently noisy. Measurement tool impact should therefore not cause more than a one percent reduction of the average performance. Measuring tool impact on workloads is hard. Workloads often do not have smooth performance profiles making impact detection hard. With t-tests and high-pass filters some of the problems associated with the noise and variability can be resolved, but the end result cannot definitively distinguish between workload variability and measurement impact. In this light, our tools were tested on extreme workloads sensitized to interference from the measurement tools. These validation tests demonstrated that a standardized set of collection scripts, like WCSTAT, indeed captures all required metrics without exceeding the one percent perturbation requirement. Testing and validating the performance impact of WCSTAT was a laborious process performed by PAE, its developers. 6.1.3

Origins of the workload data

The data collection script WCSTAT is part of a larger effort within Sun Microsystems to acquire more insight into customer workloads. For the purpose of comparison we are interested in both customer workload data and benchmark data. The customer workloads run the gamut of known commercial workloads. These workloads are database servers, application server, web-server and Java applications. A number of benchmarks are also included, SPEC CPU 2000 contributing the most. Other benchmarks are SPECweb, TPC -C, TPC -W amongst others. Customer data was collected on a variety of workloads available at the different customer-facing centers inside Sun. These centers include the Benchmark center, the iForce Ready Centers and different application engineering groups. The workloads in the different centers reflect the business activity of Sun Microsystems, thus we expect the collected workloads to reflect this too. There is a possibility that the dataset is biased towards certain types of workloads and computer system configuration, relative to the global distribution of workloads. In Section 6.8.3 we discuss the composition of the workload dataset in more detail.

6.2

Data cleaning

The raw data collected during workload characterization are in a cumbersome format. The raw data are stored in disparate files, each with their own specific structure. In the data cleaning step we extract the time-sequence workload characterization data from these files and upload them into the workload dataset. The workload dataset is a central repository that stores all pertinent data in a consistent format. In our implementation an ORACLE database hosts the workload dataset. The process of workload characterization data extraction and upload is performed by a specific utility - the WCSTAT-analyzer. All practical knowledge specific to the

6.3 Workload validation

133

extraction of metrics from the data files is maintained in WCSTAT-analyzer. The tool has been programmed with all known variations of measurement utility output and processor types supported by WCSTAT. During upload, WCSTAT-analyzer standardizes all measurements to per second values and stores the type, value and measurement interval in the database. Performing this standardization function in the tool is advantageous since several instances can be active concurrently, streamlining the cleaning and upload process. Another feature of WCSTAT-analyzer is to determine if the data collection process completed successfully. In cases where WCSTAT-analyzer cannot find all required files an error is reported. Workloads with errors are not added to the repository.

6.3

Workload validation

Successfully uploading the collected workload data into the workload data repository is no guarantee for usability. There are a number of errors that can render workload data useless. We first discuss the possible errors and their impact, then we discuss the rejection criteria used. 6.3.1

Data errors

Working with data collected from diverse workloads with unknown stability properties, requires well defined accept and reject criteria. The data in our dataset were collected by a large number of different people and we cannot assume that all data were collected in strict adherence to the prescribed methodology. The methods described in Chapter 4 require that we provide a single value representation per metric per workload. The rejection criteria should be formulated such that the best estimates of the appropriate value are carried forward. Or, if these values cannot be reliably determined, the workload should be rejected. There are a number of scenarios that lead to straightforward rejection, these scenarios are: Configuration error: specific configuration options were used on the computer system that are known to impact data collection. For example the use of processor sets. The purpose of processors sets is to partition the processors in a computer system. The effect is that processors are unequally utilized. Since it can be quite acceptable for a workload to have different loadings on the processors during execution, we want to distinguish between workload behavior and configuration. Another configuration error can occur when the data collection script does not get access to the hardware counters. This can happen due to customer restrictions (i.e., the customer feared to much interference from the hardware counter measurement), or the collection script was run in parallel with some other collection tools that took precedence when accessing the hardware counters. In these cases, no hardware counter data is available. In most cases these workloads will be rejected

134

C OLLECTING AND ANALYZING during upload. In some cases the utility collects only a few samples, thus bypassing the rejection criteria of WCSTAT-analyzer.

Operational measurement error: data collection interval and workload runtime do not match, i.e., data collection continued beyond the execution time of the workload, or started before the workload started. This effect is visible as head or tail effects on the data (see Figure 6.3(c) on page 136). Another type occurs when the workload was not in steady state during collection, e.g., the workload was ramping up or down when data collection took place. This effect is visible as significant trends in system activity over time. The significance of these trends must be viewed relative to the variance and absolute magnitude of the data. Workload insignificant: this error occurs when a workload was measured on an substantially over-provisioned system, e.g., running a single processor workload while there are many more processors available in the system. Insignificant workloads suffer from strong noise components, in other words, the workload data cannot easily be distinguished from random fluctuations. Workload error: the workload suffers and anomalous event, e.g., the workload stops execution during measurement (see Figure 6.3(b) on page 136), or experiences a sudden peak activity (see Figure 6.3(d)). Possible causes of these failures are network timeouts, disk-contention, mutex locks etc. We can only speculate at the root cause since we do not have sufficient information regarding the workload. These anomalies make the data suspect since they leave gaps in the data with distinctly different characteristics. These gaps can greatly impact the determination of the mean and as such workloads with gaps are rejected. Process error: an error was made during processing that leads to loss of essential data or duplication of data or workloads. A large part the collection of workloads is done by others. We therefore do not have full control over the chain from data collection to entry in the repository. However, these problems are of a technical nature and of limited interest. Our main concern here is that we must make sure we are using the best quality data available. Each of these errors have distinct characteristics. In workload error checking we filter for the most obvious errors. 6.3.2

Workload error checking

The simplest errors to discover are configuration errors and the insignificant workload. As mentioned most configuration errors will be detected during upload. The few workloads that pass through upload with configuration errors usually upload only a few (≤ 10) data-points. In case of an insignificant workload, system utilization is - 0. Both are trivial to detect and lead to immediate rejection. The most complex errors to detect and distinguish are operational measurement error, workload error and workload

6.3 Workload validation

135

anomalies. Since measurement has been split over two intervals, we must use a representative metric for each measurement interval to determine if the workload was indeed error free. By using a stringent stability criterion we can detect most, if not all, of the complex errors. The next step is to filter the workload set for missing data. We construct a model dataset by taking the mean of every observed metric for each workload stored in the database. Next we verify this set to determine the impact of missing data. As mentioned before, we have missing data due to differences in the version of WCSTAT, differences in processor revision and measurement features. The two main causes of missing data are computer system differences and different versions of WCSTAT. While the basic set of hardware counters is identical for the UltraSPARC III processor family, certain hardware counters have been implemented in later revisions of the processor. In our data matrix, each missing data-point must be considered in the context of the workload and the metric. Obviously we want to remove metrics that miss a substantial percentage of data-points, as we want to remove workloads that miss a substantial number of metrics. We filter the dataset in a step-wise manner, removing either the metric or the workload with the most missing data. If there are an equal number of workloads and metrics with missing data, we remove the workloads. Out of a total of 1183 collected workloads, 960 workloads survived the upload and error checking process. These measurement and missing data issues are best fixed at the data collection level - we cannot compensate for changes in computer systems. 6.3.3

Workload stability analysis

We require good quality data for our subsequent analysis but we have used a measurement methodology that collects different data in two distinct phases. We must qualify and reject data for any of the errors identified in Section 6.3.1. Significant changes in the measured data are indicative of errors while variability might be an inherent feature of the workload. From the perspective of data collection, highly variable workloads require more data collection in order to distinguish between undesirable trends and workload related variability. If no such distinction is possible, the workload must be rejected. To test if a workload is in steady state, we take the per-processor idle time and the perprocessor instruction count. Under the assumption that workload collection took place during steady state, system idle time and instruction count should not significantly vary over the measurement time interval. Since these metrics are measured per processor we assume that they accurately reflect activity on the whole system. We do not merge data collected in the same time interval over multiple processors to create an average value for that time, since we believe that this reduces the quality of the comparison by reducing the number of data-points, thus obscuring the distribution. Under the strict assumption of steady state, there should be no significant variation in the measured metrics. We introduce a Kolmogorov-Smirnov measure based on the two-sided Kolmogorov-Smirnov test to determine if that is the case. The two-sided Kolmogorov-Smirnov test is a non-

136

C OLLECTING AND ANALYZING

+++++ +

+++++ +

+++++ +

System utilization

+++++ +

+++++ +

+++++ +

+++++ +

++ ++ +

1e+08

+++++ +

++++++++++++++++++++++++++

++++++++++++++++++++++++++ + + + + +

+

80

Instructions 1e+04 1e+06

+

Utilization 40 60

+ + + + +

+ ++ ++ + ++ + + +

20 200

400

600

800

1000

+ + + +

+ + + + + +

0 0

(a) Typical Instruction count data for a stable workload

0

100

200

+ + + ++ +

300

+++ + ++

+ ++++ +

+++ + +

++ + ++

+

+ + +

+ +

+ + ++ ++ ++ + + ++ + + ++ ++ + + ++ +

+ ++ + + + ++ ++ +

700

+ + + + + +

+ ++ + ++ ++ + ++ ++ +++ + + +++

+++ + ++ ++ ++ + + + +

+ + ++ ++ + + + + + + +

+ +

+ +

++ + ++ + + + + +++ + + ++ + + ++ ++ + ++ +++ + ++++++ + ++++ + +++ ++ +++ + + ++ + +++ + + + + + +++++ +++ +++ ++ +++ ++ + + ++++ + + ++ + ++++ ++ ++ + + + + +++++++ ++ + ++ + + ++ + +

+++ + + + + +

+++ + + +++ + ++ + + + + +

+

+ + ++ +

0

20

+ ++ +++ +++ + +

+ ++ ++ +++ + +++ + + + + + ++ + ++ + + + + + +

Utilization 40 60

80

+ +

Instructions 1e+04 1e+06

600

+ ++ ++ +++ ++ ++ +++ + + +

+

1e+02

500

System utilization 100

1e+08

+++ +

400 Time

(b) Example of outliers in System utilization data

Instruction count +

1e+00

+ +

+ + + +

Time

0

200

400

600

800

1000

0

100

200

300 Time

Time

(c) Example of rejected data

1e+08

+++ +

+++ + +

+++ +

+++ +

+++

500

600

System utilization +

++ +

++ +

+++

100

++ + + +

400

(d) Example of rejected System utilization data

Instruction count +

+ +

+ + + + + +

+ + +

+ +

+ + +

+ +

+ +

+ +

+ +

+ +

+ +

+ +

+ + +

+

+ +

+ +

+ +

+ +

+

+

+ +

+ +

+ +

+ +

+ +

+

+ +

+ + +

Utilization 40 60

Instructions 1e+04 1e+06

80

+ +

0

20

1e+02 1e+00

+++++++++++++ +

+

1e+02 1e+00

100

Instruction count +

0

200

400

600

800

1000

0

50

100

Time

(e) Example of incorrectly rejected instruction count data

150 Time

200

250

(f) Example of incorrectly rejected System utilization data

Fig. 6.3: Example data distributions

6.3 Workload validation

137

parametric test to determine the likelihood that two observed distributions are the same (Sheskin, 2004). We accept two measured distributions as similar if the reported value p is greater than a predefined value $. A value of one means the two distributions are identical. The value $ represents the sensitivity of the test, $ ∈< 0, 1], the closer to one (1) $ is chosen the more sensitive the test will be and the higher the rejection ratio. The distributions we wish to compare are created by subdividing the whole dataset into four sub-sets, composed of 14 of the measurements in the order they were measured. This gives us four empirical distribution functions for measurements in that workload. We use the two-sided Kolmogorov-Smirnov test to determine the distance between the sub-distributions. For each pair of sub-sets (six in total), we calculate the p-statistic of the two-sided KS-test; the likelihood that both sub-sets have the same distribution. As acceptance criterion we take p ≥ $ with $ = 0.05, and p the likelihood that they have the same distribution. The value of $ was chosen to accommodate at least some of the expected variability in the workload data. Since we perform the test six times between all subset pairs, we only reject the workload if three or more subset pairs are rejected. The rejection of three or more pairs matches with our intuitive understanding of how the test should perform if a subset differs. If less than three pairs are rejected we can attribute that to larger scale variations in the workload and not reject the workload outright. For brevity we will refer to this approach as determining the KS-measure of the distribution. An example of typical instruction count data is presented in Figure 6.3(a). We combine the KS-measure based rejection criterion with a number of data quality tests. These data quality tests are included to remove datasets that have missing data or contain no information. All workloads that have less than ten data-points for any metric are rejected outright. If the metric has zero variance and a mean of zero, the workload is rejected too. For the instruction count the reason is clear, since a zero value indicates no instructions executed. For the idle time, or its inverse, utilization, we reject the workload if the average utilization is near zero with small variance. This situation would arise if the measurement is performed on an idle system. The application of the KS-measure in the described form leads to the rejection of circa half of the collected workloads. By analyzing the origin of the collected workloads, we find that a mixture of workloads is rejected. More than 20% of the collected SPEC CPU 2000 component benchmarks are rejected. Since the SPEC CPU 2000 component benchmarks were collected under ideal circumstances, with full control over the computer system, their rejection is an indication the KS-measure is overly harsh in its verdict. Visual inspection of the rejected workloads identified cases of KS-measure rejection that we should accept. This rejection behavior has two root-causes: data collection was performed poorly (Figure 6.3(c)) or the steady state for a real workload differs from the interpretation of the two-sided KS-test. The two-sided KS-test purely tests for the similarity of the original distributions, it cannot take into account acceptable behavior based on random fluctuations in the workload. These cases are still rejected. Based on a visual inspection of the instruction count data in the case of the SPEC CPU 2000

138

C OLLECTING AND ANALYZING

component benchmarks, it seems that an important KS-test rejection cause is related to low variance of the distribution. We illustrate this in Figures 6.3(e) and 6.3(f). Low variance makes the KS-measure very sensitive to minor fluctuations in the workload, since these minor variations are significant relative to the variance. The two-sided KStest compares the cumulative distribution functions of the measured data partitions and finds the maximum value D, the vertical distance between the two distributions. If D is larger than a predefined cutoff D! , the measured data is not from the same distribution. p and D! are related through a scaling based on the sample size. A workload with a small variance will have a steep CDF, therefore comparing two workloads with steep CDF’s will lead to cases where even minor differences between the two distributions can lead to large values for D and thus to rejection. This is an unavoidable trait of the KS -measure. We maintain the KS -measure as our primary measure of stability since it performs rather well for datasets where the variance is large. We therefore require an additional method to test if low variance is a contributing factor in workload rejection. We assume that if the KS-measure accepts a dataset, the workload is in steady state. The goal of workload data processing is to assign a single representative value to each measured metric and use that single value in the determination of workload similarity and distance. Simply testing the ratio of variance to the mean for workloads that are rejected by the KS-measure, does not significantly reduce the rejection rate of SPEC CPU 2000 component benchmark data. Further study of the variance and mean ratio, as well as visual inspection of the instruction count data for rejected SPEC CPU 2000 component benchmark data, indicates that the main cause for the rejection rate are large outliers in otherwise stable data, as indicated in Figure 6.3(b). These large outliers significantly increase variance. Removing the outliers allows the KS-measure to complete successfully. Unfortunately, detecting outliers is not a simple endeavor and is further complicated by the acceptable distribution types for the data. Again our main guidance is the assumption of workload stability and the related stability of the measured data. Instead of detecting and removing the outliers, we use the LOWESS function (Cleveland, 1979, 1981) to apply smoothing to the collected data. The name LOWESS function is derived from the term locally weighted scatter plot smooth, since the method uses locally weighted linear regression to smooth data. The smoothing process is considered local because, like the moving average method, each smoothed value is determined by neighboring data points defined within the span. The process is weighted because a regression weight function is defined for the data points contained within the span. The method uses a linear polynomial in the regression function. The smoothing window of the function determines how many data points are taken into consideration for the smoother. The size of the smoothing window is expressed as the fraction f of the dataset used. For our smoother we use the R default fraction of f = 23 (R Development Core Team, 2006). Larger f lead to smoother results. Validating the quality of the measured data using a smoother is performed by determining the ratio r of the smooth standard deviation and the mean. This ratio scales with the mean, thus penalizing measured data with small

1000

6.3 Workload validation

139

960

100%

74 90

84% 77% 67%

400

600

800

146

Reject both Reject Utilization Reject Instruction count Accept

0

200

650

0% Workload data

Fig. 6.4: Breakdown of workload rejection by criterium

mean and high variance. For both instruction count and system utilization, the quality of the data improves with increased mean. While it is unlikely that the instruction count will be zero, low instruction counts (i.e., less than 106 ) are unlikely to occur for workloads executing on Gigahertz processors. If low instruction counts do occur this is most likely indicative of an idle system and therefore of operational error. We analyze the rejection rate using the LOWESS smoother by rejecting all workloads with r ≥ $. The value $ = 0.05 was retained after considering both the anticipated noisiness of the data and our desire to rule out strong linear trends in the data. We name this rejection criterion the LOWESS-ratio. Similar to our evaluation of the KS-measure, we visually inspected the workloads rejected by the LOWESS-ratio. We observe that the algorithm performs poorly in cases where the data has high variability and performs well when low variability is present. This is not surprising as this reflects one of the decision criteria. We conclude that the weaknesses of the KS-measure are compensated for by the strength of the smoothing approach and vice versa. This approach is also computationally very efficient, evaluating workloads at a rate of seven workloads per second. We present the rejection rate per criterion in Figure 6.4. The percentage rejected workloads is considerable (33%). After inspecting a sample of the rejected workloads, it became clear that data collection errors are an important factor in rejection, hence the high number of rejections on both criteria. As such, more attention should be spent on optimizing data collection methodology and execution. Additionally, the disjoint nature of data collection used in WCSTAT introduces overly rigid restrictions. The two

140

C OLLECTING AND ANALYZING

distinct measurement intervals effectively require the data to be in steady state for 1500 seconds. There are not many workloads, other than benchmarks and high performance computing applications, that provide such stable workload characteristics. Most workloads fluctuate, and can fluctuate wildly over the course of minutes as they respond to differences in demand. Collecting all metric data simultaneously as well as extending the measurement interval may alleviate these problems. Simultaneously collecting all workload relevant data introduces additional complexity given that some of the measurement tools may interact with other tools. We anticipate that it should be possible to correct the data for measurement tool impact by for example allowing for some baseline measurements. Overall we expect the benefits (lower workload rejection) to be greater than the cost (more involved data reduction). Since there is nothing else we can do to improve the quality of our collected dataset, we proceed with our dataset. The output of workload stability analysis is a list of stable workloads. 6.3.4

Accepted workload list

Since the inception of this research, the workload analysis project at Sun Microsystems, Inc. has collected data from a large number of different computer systems spanning a large range of processor architectures and operating systems. To reduce some of the uncertainties related to system configuration and processor architecture effects, we decided to limit ourselves to a single processor family. We chose the UltraSPARC III processor family for our dataset. The UltraSPARC III processor family includes the UltraSPARC III, UltraSPARC III+ and UltraSPARC IIIi models, and varies between 750 MHz and 1600 MHz in clock-speed. The family also has varying L2-cache sizes, ranging from 1 Mb to 8 Mb. We further limited the impact of processor design variances, by selecting only workloads based on UltraSPARC III+ processors with clock-speeds ranging from 900 MHz to 1280 MHz, with 8MB second level cache. There were a total of 960 UltraSPARC III+ based processors in our workload dataset. After workload validation a total of 650 workloads were accepted. 6.3.5

Reflecting on workload validation

In this section we demonstrated how workload characterization data are filtered and processed into representative values. The next step in the approach is to use these reduced data to construct a workload space. An important factor in this method is the time required between workload characterization and analysis. The demonstrated ability to efficiently collect and process data greatly enhances the value of this approach. One could envision its use in targeted research approaches where specific classes of workloads are collected and quickly analyzed. In this thesis we use the complete set of accepted workloads, however, it is essential to realize that this is not a requirement. Instead of continuing with the whole dataset,

6.4 Data reduction

141

we could select only a subset of the workload data. This subset could be based on workload characteristics, represented in the reduced data, or it could be based on the workload description provided. For example, we could decide to limit ourselves to database workloads, or to high performance computing workloads. Alternatively, continuing the previous paragraph example, we could concentrate on a set of recently collected workloads. The ability to select subsets of interest is thus of great value. Next we move to data reduction and metric selection on our way to spanning the representative workload space.

6.4

Data reduction

In preparation for further analysis with statistical models, the dataset must be reduced to a workable, single value, representation. This representation consists of the robust estimation for all metric means over their measurement period. These summary results together form the workload database, i.e., the first concise representation of our dataset. In the previous section we determined that stable workload data does not mean that they are free from anomalies or outliers. We intend to represent the collected data with a measure of central tendency. We want to determine a good measure of central tendency, resilient against both outliers and anomalies. Anomalies are changes in workload behavior that are not significant enough to reject the workload out-right, while outliers are data-points that clearly do not fit the majority of the data. We know, from reviewing KS-measure and LOWESS-measure results, that some datasets containing outliers or anomalies were accepted. We consider remaining dataset anomalies as an artifact of our data-rejection methodology. Any measure for central tendency we choose needs to be stable in the presence of anomalies and outliers and perform as expected. Since it will be impossible to manually review the 251 collected metrics data for all workloads, (251 × 960 = 240960 combinations), we pay special attention to the selection of our central tendency measure. The mean as a metric value estimator is very sensitive to outliers and anomalies. We therefore reject the mean as a measure of central tendency for the WCSTAT data. Another measure of central tendency is for example the robust estimate of the mean, based on the dataset with the bottom and top 5% of the measured metric values discarded. While this would most likely work in the presence of outliers, the volume of data points in workload anomalies may exceed the chosen 5%. While the 5% cutoff is a choice, and we can easily increase the amount of data discarded, we desire a more reasoned approach. Another popular measure of central tendency is the median. The median is not sensitive to outliers, but lacks distinction in cases where a workload has a bimodal or higher order distribution. We accept that in the case of multi-modal distributions single value representation can lead to loss of distinction. In the set of accepted workloads, the anomalies have not been rejected by the LOWESSmeasure, even though the KS-measure might have initially rejected the workload. We

142

C OLLECTING AND ANALYZING

therefore consider the mean of the LOWESS smoother as our measure of central tendency. The advantages of using the LOWESS smoother is that it is known to work with anomalies present in the dataset. For convenience we call our LOWESS based measure of central tendency the LOWESS smoothed data mean, or LSD-mean. While for stability calculations the processor hardware counter metrics were considered on a log-scale, we must not use the log-scale when determining the LSD-mean. Using the LSD-mean on logarithmically scaled data will emphasize the lower value measurements, and lead to an underestimation of the desired LSD-mean. The use of the LSD-mean significantly increases the computational burden for constructing the workload dataset. In practice our implementation in R, needs about 0.4 seconds per workload metric. Since each workload is separate, we can support multiple R processes, parallelizing processing. The total processing time for all thesis data was about twelve hours. If this methodology were adopted and deployed to support the business process of computer system design, the calculation of the data validity checks and LSD -means should take place at data upload time. Performing all calculations at data upload spreads the computational burden in time since the upload time of workloads are independent. This keeps this methodology feasible on a large scale, per requirement 8. Using the LOWESS smoother, we calculate the LSD-mean for all 251 metrics and 650 accepted workloads. The LOWESS smoother is applied to each workload metric data and the smoothed data mean is our LSD-mean. The obtained value is stored, together with the mean, median and variance, in the database for future reference. We illustrate this workflow in Figure 6.5, where multiple workload summarizers illustrate its inherent parallelism. Retaining results in the database reduces work since only added workload values need to be calculated. Recalculating the stored values is needed only if we make changes to the LSD-mean parameters or choose a different measure of central tendency. The output of data reduction is stored in the workload database. The workload database allows us the first glimpse into the behavior of all the collected metrics over our selected sample of 650 workloads. The next step evaluates this collection of metrics.

6.5

Metric selection

One of the main difficulties of measuring the processor hardware counters of the UltraSPARC III family, is the presence of the kernel idle loop. The kernel issues idle instruction when there is no work present, thus measurement of the UltraSPARC III family must take place separately for supervisor (or kernel) and user-land events. This is reflected in the dataset, where we report on both the user instruction count and the kernel instruction count. If we are to fairly characterize the processor performance for a workload, we should combine the user instruction count and the kernel instruction count and use the combination as a measurement of the total number of instruction executed on the processor. The same combination would be made for all hardware counters for which we have measured both in user and in kernel mode. A number of hardware coun-

6.5 Metric selection

143

File repository Computer system and workload

(text based workload characterization data)

Workload characterization (data collection)

WCSTAT

WCSTAT analyzer

Workload database (standardized per metric summary data)

Data cleaning and upload

Workload database (standardized per time interval metric data)

Data reduction (in R)

Workload categorization (meta data)

Metric selection (in R) LOWESS smoother

Workload error checking

System standardization Accepted workload list

Workload characterization final data (standardized composite summary data)

Workload stability analysis (in R) KS measure

Workload validation Workload categorization information

Fig. 6.5: Updated workflow with algorithms and parallelism added.

144

C OLLECTING AND ANALYZING

ters have also been measured in system mode, the combination of the user and kernel. These latter hardware counters are not impacted by the kernel idle loop and can therefore be safely measured as such. When applying the metric filter (Figure 6.5) we update the workload dataset, making a new metric by combining the user and kernel metrics for pertinent hardware counters. If we also have system level measurement available, we give preference to the user and kernel measurements. Since the metrics in the model data have already been adjusted to reflect events per second, we may take their sum as representative of the workload on the computer system.

6.6

System standardization

The data stored in the workload database are from a diverse background of system configurations and workload types. The two most important issues here are understanding the effect of utilization and of system configuration on the workload data. 6.6.1

The impact of system utilization

Apart from correcting for the system configuration, an additional issue is presented by the system utilization. Ideally all comparison workloads fully utilize their system. When system utilization is at 100%, we can assume that all resources are maximally used within the context of the workload. While reaching 100% utilization is consistently achievable for benchmarks and high performance computing workloads, it is rarely reached in reality. Frequently, high utilization is considered undesirable for many workloads. Many commercial workloads do not operate at 100% system utilization since the companies want to retain some headroom for workload fluctuations. The utilization differences lead to an interesting effect in our dataset. More than half of our dataset is based on SPEC CPU 2000 measurements on various configurations. Most of these benchmark measurements have near 100% system utilization over their measurement interval. The other workloads are collected from diverse sources and reflect the real world of workloads, with greatly varying system utilization. We must investigate how system utilization impacts the workload characteristics and derive methodologies that assist us in also normalizing for system utilization. This is relevant since system utilization is reflected in the instruction count. If we standardize the measured workload data relative to the instruction count, the effects of system utilization are removed. However, since we would like to use the prediction of instruction count as a measure of relevance for a metrics, as opposed to using all metrics indiscriminately, standardization relative to instruction count is unattractive. Standardization relative to instruction count can be used once metric relevance has been established. An alternative approach is to standardize on the cycle count. The cycle count represents the number of clock cycles the processor is not idle. By standardizing the cycle count to a fixed value common to all workloads, we compare workloads based

6.6 System standardization

145 Histogram User instruction count [Full data−set]

Density

0.15

0.0

0.00

0.05

0.5

0.10

Density

1.0

0.20

0.25

1.5

Histogram Processor Idle [Full data−set]

0

20

40

60

80

Processor Idle

(a) Idle time sample density

100

0

2

4

6

8

10

log10(Instruction count)

(b) Processor instruction count sample density

Fig. 6.6: Sample density histograms for Instruction count and System idle time, complete dataset

on the work performed per unit of clock cycles. Given the potential value of an General Additive model predicting instruction count, we standardize all workloads to a unit of 1 · 109 clock cycles. In Figure 6.6 we illustrate the distribution of instruction count and idle time in our dataset. This density represents all collected samples prior to data reduction (see Figure 6.1 on page 130). The system idle data represents all utilization data collected for every processor in our dataset. The high density peak near 0% idle reflects a large number of observations of processors fully utilized. Most workloads achieve low idle times during their execution. The increase in density for processors that are near 100% idle can be a reflection of unbalanced resource use within the computer systems measured, or it can reflect issues with the measurements. The density histogram for the instruction count illustrates that the distribution is very broad, covering a dynamic range varying between 102 and ∼ 4.8 · 109 . Note that while the maximum clock speed is 1200 MHz, the UltraSPARC III+ processors are capable of retiring four instructions per clock-cycle, hence instruction counts that exceed this maximum clock-speed. The maximum measured instruction count (just under 4.8 · 109 per second) was measured on Linpeak, a high performance computing application. 6.6.2

Compensating for different system configurations: system normalization

Since the workloads have been collected from different systems with different configurations, i.e., number of processors, disk and network interfaces, we have to compensate for these differences prior to workload similarity analysis. Not compensating for these

146

C OLLECTING AND ANALYZING

differences would leave system configuration as a first order effect. The process of correcting for configuration differences we call system normalization. From a computer system modeling perspective, we understand a computer system to be a combination of finite resources, fit into an enclosure. For example, a computer system can have two processors, four GB of memory, two internal disk drives and a single network card. Another example of a system is 64 processors, 64 GB of memory, 150 disk-drives and 24 network interfaces. Conceptually these systems use the same resources, albeit in different composition. Practically however there are significant differences between these example systems, yet we want to compare workloads captured on these systems. From a data collection perspective, the systems are not significantly different - the larger system naturally generates more data since each resource is measured and reported. Recall that our measurement strategy uses a sample based approach. This allows us to average the reported resource metrics over a ten second interval and calculate the sum of all similar metrics for that interval. This sum is then divided by the number of resources, mapping the workload to a pseudo system of one processor, one disk and one network interface. How does a pseudo system compare to reality? By doing the summation, we believe that we introduce neither a real smoothing effect, nor are we introducing any artifacts into the data. Our reasoning behind this assertion is the following; workload similarity compares workloads based on their workload properties. The modeling approach that makes this possible is the pseudo system model which standardizes all resource properties to a single resource. Thus we consider each workload to be a set of data, for which we only need the summed event counts per metric, per interval, and the processor count. The standardization of all the processor hardware counters is done by dividing the metrics by the processor count. The other resources, i.e., the disk and network interface, are virtualized to a single device with infinite capacity. We are not as much interested in the capacity of the disk and network resources, but much more in their utilization and throughput. In other words, we believe that if the disks and networks show substantial activity, this activity must also be reflected in the processor hardware counters. Then, if there are a great many devices attached to a computer system, the relative processor activity due to those devices will increase as well. Of interest to us is only the device load per processor. Hence, our pseudo system with single devices. Another approach that follows this reasoning is based on processor scheduling on actual systems. Unless systems are configured to use processors for specific tasks, the operating system scheduler will move work around over the processors as it deems fit. While this might look chaotic on short time scales, i.e., less than a minute, when observed over a longer time, this individual chaos is no longer visible. Since the scheduler treats all processors as equal, we can do so too. Together this motivates why we can move to a system summed view without caring for the details of the individual processors; part of their individual differences are effects introduced by the operating system scheduler, in effect they are the exact artifacts we want to remove. These final steps leads to a final workload characterization dataset. Erroneous work-

6.7 Workload categorization

147

loads and metrics have been removed, data have been summarized and are represented by single values. The final workload characterization set contains 650 workloads and 73 metrics.

6.7

Workload categorization

A seperate course of action illustrated in Figures 6.1 and 6.5 is workload categorization. Where workload characterization collects a quantitative description of the workload, workload categorization provides a qualitative description. In other words, the former collects data from the system, the latter about the system. Categorical data is important for computer system design - it is the data by which we determine relevance in a volume and monetary sense. As mentioned in the first chapter, workloads can be relevant by their volume, or by their generated revenue. Here the categorical data supports another function, it is a repository of relevant workload attributes. These attributes help us determine if the workload is unique, or part of a set of similar workloads. The attributes also tell us what the workload kind is, e.g., database, web-server, etc. The categorical data provides a partitioning of the workload set based on the application type and the workload type. The relevance of this knowledge will become clear in Section 7.5 when we select the best approach for representing and partitioning the dataset.

6.8

Reflection on the methodology

One of the recurring themes in this work is bias. Since we use a data-driven approach to workload similarity, we must investigate if the selection of operating system and processor introduces bias in our dataset. 6.8.1

Bias in system metric selection

Within the general context of computer system workload characterization, it is clear that we have made several, bias introducing, choices. We have limited ourselves to UltraSPARC III+ based computer systems running the Solaris operating system. However, the context of the approach is not computer systems in general, but rather a specific micro-architecture implementation. We realize that results found on one microarchitecture may not necessarily translate to another. Even though the results of Chapter 5 indicate that some commonality between different micro-architectures can exist. Before we can investigate generalizing this approach to multiple micro-architectures, we first need to demonstrate it on a single micro-architecture - the goal of this thesis. We need to evaluate bias within the context of a single micro-architecture and operating system choice since this defines the composition of our initial dataset. We make distinction between the approach as methodology and its implementation. If we validate the

148

C OLLECTING AND ANALYZING

approach on a single micro-architecture and operating system combination, it will likely work on other combinations as well. As noted previously however, the results may not be comparable. We assumed that the context of our approach is commercial computer system and processor design. Consequently, the value of the approach is not diminished by our choices, only the generality of our results. Next we specifically evaluate bias related to our choice of operating system and hardware metrics. Does limiting data collection to the Solaris operating system introduce bias? Differences in operating system implementation will lead to interaction differences with the underlying hardware. However, the role of the operating system is to provide access to computer system resources for the executing workloads. Therefore, the task of the operating system is to facilitate workload execution. As a result workload characteristics are primarily determined by the workload, not by the underlying operating system. Solaris does not limit access to the hardware counters or to its operating system statistics. As such we are confident that Solaris does not intentionally introduce bias. The only bias introduced is bias by design. The hardware counters are biased by design since the processor designers defined the available events. The operating system metrics reported by their respective utilities reflect design choices influenced by many years of performance evaluation - commonly the metrics presented are used for resolving performance related issues (McDougall and Mauro, 2006). Does limiting data collection to the UltraSPARC III+ processors introduce bias? For system and processor designers, the choice of the instruction set architecture is usually a given, determined by the legacy of the company they work for. A significant burden to the processor and system designer is legacy support, i.e., the capability to correctly execute applications written for earlier versions of the processor. As a result, a bias towards a certain instruction set architecture follows from external constraints on the design parameters. For processor design, understanding the interaction of the instruction set architecture with the workloads is essential. The bias introduced by limiting data collection to a specific instruction set architecture is therefore acceptable since the method results will fall within the design constraints. In short, Solaris itself does not introduce bias by limiting access to necessary metrics. The limitation of using only UltraSPARC III+ processors might introduce a bias if methods results are used outside of the design context. Thus we see no reason to suspect any fundamental bias within the context of a computer systems company working to design the next generation processor and computer system. As mentioned in Section 5.4 the differences between results based on computer system metrics and simulation can be used to motivate more research into defining better operating system and hardware metrics. 6.8.2

Data collection and reduction efficiency

Requirement 2: the collected metric data must be efficiently processable to obtain an observation, is tied to Research Question 2: How can we efficiently find a smallest

6.8 Reflection on the methodology

149

25min per workload

Raw data Computer system and workload

(text based workload characterization data)

Workload characterization (data collection)

WCSTAT

WCSTAT analyzer

Workload database (standardized per metric summary data)

Suggest Documents