This paper shows that subsetting a benchmark suite in .... are then recomputed and set equal to the geometrical center ... We call this the forward procedure.
Experiments with Subsetting Benchmark Suites H. Vandierendonck and K. De Bosschere Department of Electronics and Information Systems (ELIS) E-mail: {hvdieren,kdb}@elis.UGent.be Abstract Benchmarks are one of the most popular tools to compare the performance of computing systems. Benchmark suites typically contain multiple benchmark programs with more or less the same properties. Hence the suite contains redundancy, which increases the cost of executing or simulating the benchmark suite without adding value. To limit simulation time, researchers frequently subset benchmark suites. However, correctly identifying a representative subset is of paramount importance to perform a trustworthy evaluation. This paper shows that subsetting a benchmark suite in such a way that representativeness of the suite is maintained is non-trivial. We show that a small randomly selected subset is not representative of the full benchmark suite. We discuss algorithms to subset the SPEC CPU 2000 benchmark suite and show that they provide more representative subsets than randomly selected subsets. However, the algorithms evaluated in this paper do not always compute representative subsets: the algorithms produce bad results for some subset sizes. In this sense, these algorithms are unreliable, as it remains necessary to validate the benchmark suite subset. We find one subsetting algorithm that is reliable. It is, however, uncertain whether this algorithm is also reliable under other circumstances.
1 Introduction The evaluation and development of computer systems depends strongly on benchmarks: standardized computer programs are run on the system and their execution time is measured. Lower execution times indicate higher performance. Although benchmarks are so vital for the evaluation of computer systems, relatively little effort has been targeted at understanding the fundamental properties of benchmark suites. For one, the execution time of benchmarks is frequently increased in order to make the time measurement easily repeatable. However, a long-running benchmark does not always provide more information than a short-running
benchmark [12]. This is especially a concern when performing simulation studies, as simulation is already very slow. Furthermore, it is not clear how many benchmarks are necessary to accurately characterize a computer’s performance [8]. Running more benchmarks does not necessarily provide additional information about the computer’s performance. Both of these issues have lead researchers to use only a small number of benchmarks in a benchmark suite. This activity is called subsetting the benchmark suite. Great care must be applied when subsetting a benchmark suite [2]. Clearly, the goal should be to select a subset of the benchmarks that, when simulated or executed, provides (nearly) the same information as the full benchmark suite but in significantly less time. Such a subset is representative for the full benchmark suite. Although there are a few known methods to determine a representative subset of a benchmark suite, researchers frequently make ad hoc or undocumented decisions about the benchmark suite subset that they use [2]. The situation definitely calls for rigorous and accepted methods to compute representative subsets of benchmark suites. We call such methods subsetting algorithms. Subsetting algorithms must compute subsets that are representative for the full benchmark suite. In order to be widely accepted, these algorithms must compute representative benchmark subsets in all situations. We call such a subsetting algorithm reliable. We show in this paper that the investigated subsetting algorithms can compute highly representative subsets in some cases, but less representative subsets in other cases. Thus, the algorithms are essentially unreliable and should not be used blindly to subset benchmark suites. The crucial idea behind subsetting a benchmark suite is that several benchmarks have the same characteristics. Executing or simulating only one of those benchmarks suffices to understand the properties of the other benchmarks. Thus, to subset a benchmark suite one has to know (i) what benchmarks have the same characteristics and (ii) how to properly select a minimal set of benchmarks while retaining all important characteristics present in the full benchmark suite. All previous research in this area has used the same ideas, namely displaying the benchmarks as points in an imagi-
nary multi-dimensional space, called the workload space or program space, and applying cluster analysis techniques to compute the important subsets [5, 7, 13, 14]. Subsetting algorithms have two major parameters: (i) the information used to determine the position of the benchmarks in the program space and (ii) the cluster analysis technique used to compute the clusters. The coordinates of the benchmarks could be determined by means of workload characteristics such as instruction mix [5, 7, 10], cache miss rates [9, 13], benchmark running times [3, 14], basic block counts [12], etc. Frequently used cluster analysis techniques include principal components analysis [5, 7, 1, 13, 14], K-means clustering [9, 12] and hierarchical cluster analysis [5, 7]. In this paper, we evaluate several choices for the characteristics to position the benchmarks in the program space as well as the clustering algorithm. We show that both parameters have an important impact on the representativeness of the computed subset. Furthermore, we show that the evaluated subsetting algorithms are unreliable: sometimes they compute representative subsets, but sometimes they do not. The remainder of this paper is organized as follows. Section 2 discusses characteristics to place benchmarks in the program space and reviews cluster analysis techniques. Section 3 presents the evaluation environment and Section 4 evaluates the subsetting algorithms. Section 5 summarizes related work and Section 6 concludes this paper.
2 Algorithms to Subset a Benchmark Suite The approach followed in this paper is similar to that of other work on this topic. The benchmarks are displayed as points in an imaginary space, called the workload or program space [5, 7]. The coordinates of these points are determined by characteristics of the benchmarks. In this paper, we assume that the coordinates are determined by benchmark execution times as a baseline. Once the benchmarks are placed in this multidimensional space, determining when benchmarks are similar is trivial: one can simply measure the euclidean distance between two benchmarks in this space.1 Next, one has to select a subset of the benchmarks that approximates the full benchmark suite as closely as possible. When displaying the benchmarks in the program space, one typically finds that the benchmarks are strongly clustered, i.e., there are several tightly connected clusters or groups of benchmarks. Benchmarks that belong to the same 1 Note that when two workload characteristics are strongly correlated, the distance metric becomes distorted, as those two workload characteristics are slightly different ways to measure the same underlying characteristic. This underlying characteristic receives a higher weight than other characteristics as it is present in two of the measured characteristics. Hereto, it was argued to use the mahalanobis distance metric that compensates for the correlation [5].
cluster have similar properties as the distance between these benchmarks is small. Therefore, it suffices to select just one representative benchmark for each cluster [5, 7]. In this paper, we evaluate several algorithms to compute benchmark clusters and compare their relative merits. These algorithms are two common cluster analysis techniques (hierarchical cluster analysis and K-means cluster analysis) as well as two methods based on principal components analysis.
2.1 K-means Clustering K-means clustering is a statistical data analysis technique to cluster n points, in our case benchmarks, such that similar points are clustered together. The number of clusters to compute is provided as an input to the algorithm. The algorithm first computes K random cluster centers. During each iteration of the algorithm, each point is assigned to the cluster with the nearest cluster center. The cluster centers are then recomputed and set equal to the geometrical center of the cluster. These steps are iterated until no more changes occur.
2.2 Hierarchical Cluster Analysis Hierarchical cluster analysis, also called linkage clustering, is another statistical data analysis technique. Hierarchical cluster analysis starts by placing each point in a separate cluster. Then the algorithm iteratively merges the two most closely located clusters until only one cluster remains. The clusters are joined based on the distance between the clusters, also called linkage distance. In this paper, we use the complete linkage strategy, in which the distance between clusters is determined by the distance between the furthest neighbours. Another linkage strategy, weighted pair-group average linkage, has been shown to give similar results [7]. The linkage distance computed in every iteration of the algorithm is produced as an output of the algorithm. It can be used by the user of hierarchical clustering to determine the number of clusters to retain.
2.3 Methods Based on Principal Components Analysis With these methods, we turn everything around: instead of interpreting the benchmarks as points in a multidimensional space and workload characteristics as the dimensions of that space, we pretend that the workload characteristics are points and place those in a multi-dimensional space. The coordinates of these points are determined by the benchmarks. Using this setup, and assuming that we have more workload characteristics than there are benchmarks, it is possible to apply principal components analysis
(PCA) to rank the benchmarks from most indicative benchmark to least indicative benchmark. Before explaining how to accomplish this, let us first explain how principal components analysis works. PCA reduces the dimensionality of a data set [4]. It computes a new set of dimensions, called the principal components, that describe all aspects of the original data set. The p original dimensions Xi , i = 1, . . . , p are linearly transformedP into p principal components Zi , i = 1, . . . , p with p Zi = j=1 aij Xj , i = 1, . . . , p. The principal components are constructed such that they are sorted in order of decreasing variance (V ar[Zi ] ≥ V ar[Zj ] if i < j) and are uncorrelated (Cov[Zi , Zj ] = 0 if i 6= j). One typically finds that s small number of principal components has high variance, i.e., they explain a large fraction of the information in the data set, while others have almost no variance. The principal components, computed by PCA, can now be used to rank the benchmarks in order of decreasing interest. Note that the first principal component provides the most information about the data set. Furthermore, it is defined as a linear combination of the original dimensions (i.e., benchmarks). Thus, we can find the most important benchmark as the benchmark j that has the highest weight a1j (in absolute value) in the first principal component, as this benchmark has the strongest influence in the first principal component [4]. The second benchmark can be obtained in a similar manner by looking for the weight a2k with the highest absolute value in the second principal component. If this benchmark was not selected before, i.e., k 6= j, then we have found the second most important benchmark. If k = j, we look for the second largest weight, and so on. This procedure can be repeated until all benchmarks are placed in order of decreasing interest. We call this the forward procedure. There is a related procedure that starts with the last principal component, which we call the backward procedure. The idea is to identify the least interesting benchmark first. This benchmark has the highest weight (in absolute value) in the last principal component, i.e., it is most similar to the principal component that carries the least information. The second least interesting benchmark is read from the second to last principal component, etc. It is possible to compute a representative subset of the benchmarks using either ranking procedure. When the subset contains n benchmarks, we simply take the n most important benchmarks.
3 Evaluation Methodology The algorithms discussed in the previous section are evaluated in their ability to compute representative subsets of a benchmark suite. We compute subsets of the SPEC CPU 2000 benchmarks and vary the subset size from 1 to
Table 1. The architectures (ISA), the generations of the processors and the number of processors per generation on which the evaluation is performed. Architecture Alpha AXP 80x86
Itanium PA-RISC POWER/PowerPC
MIPS SPARC
Generation 21264 21364 Pentium III Pentium 4 Pentium M Xeon (Pentium III and 4) Athlon Opteron Itanium Itanium 2 8600 8700 PowerPC 604e RS64-III, RS64-IV POWER3 POWER4 R12000 R14000 UltraSPARC II/IIi UltraSPARC III UltraSPARC III Cu SPARC64GP, SPARC64V
No. 19 3 32 86 1 37 27 6 1 5 3 8 1 13 17 18 4 5 6 10 14 24
26 benchmarks, the latter equaling the full benchmark suite. The representativeness of the subsets are evaluated by ranking the performance of a number of computers when executing these benchmarks. A representative subset ranks all computers in the same order as the full benchmark suite.
3.1 Benchmark Execution Times We analyze the CPU2000 running times published on the SPEC website2 before August 2003 [14]. As we are interested in both SPECint and SPECfp results for all computers, the analysis is limited to those computers for which both SPECint and SPECfp peak results have been submitted. SPEC publishes these results in separate files, so we have to match the SPECint and SPECfp results files that describe the same computer. The details of this matching process are described in [14]. We find 340 computers, representing 7 different architectures, for which both SPECint and SPECfp peak performance results are submitted (Table 1). 2 http://www.spec.org/cpu2000/results/.
The subsetting algorithms are evaluated in their ability to produce a subset of the SPEC benchmarks that ranks the computers in the same way as the full benchmark suite. The similarity of ranking is evaluated by Kendall’s rank correlation coefficient. This rank correlation coefficient counts the number of inversions of rank (the subsetted suite states that one computer is faster than another, but the full suite states the opposite) and is defined as: N N − 2 · number of inversions / τ= 2 2 where N equals the number of computers (N = 340). The τ coefficient is limited to the range (−1, 1). Higher values of τ appear when there are fewer inversions. In practice, a τ value of 0.9 or 0.95 should be aimed for. These values correspond to incorrectly ranking 5% or 2.5%, respectively, of all computers. Note that many of the computers are very similar (in an extreme case, two computers differ only in the version of the compiler or in the optimization flags), so it is very easy to confuse between some computers, but their performance is almost the same anyway. For this reason, we find τ values of 0.9 or 0.95 appropriate.
3.3 Software The evaluations in this paper are performed using the R package for statistical computations.3 This software implements the K-means clustering algorithm, hierarchical clustering and principal components analysis. The forward and backward algorithms were implemented on top of R. R also provides means to compute the τ coefficient.
4 Evaluation 4.1 Subsetting Algorithms 4.1.1 Optimal and Random Subsets The algorithms studied in this paper are heuristic in nature: it is not guaranteed that they find optimal subsets, i.e., those subsets that are most representative. In a number of cases, where there are few possibilities, it is possible to compute optimal subsets. The optimal subset of size K is computed by evaluating all subsets containing K benchmarks. The subset with the largest value for Kendall’s rank correlation coefficient is the most representative. It is not always possible to compute the optimal subset, due to the large number of possible subsets. When the optimal subset can be computed, it is highly representative, even for small subset sizes (Figure 1, label max). A 1-benchmark optimal subset has 3 http://www.r-project.org.
a representativeness of 0.88 and a subset of 4 benchmarks already has a representativeness of 0.95. 1.00 0.90
Representativeness
3.2 Performance Metric
0.80 0.70 opt
0.60
random
0.50 worst
0.40
random (estimated)
0.30 0
5
10
15
20
25
30
Subset Size
Figure 1. The representativeness of optimal subsets, random subsets and the worst subsets for varying subset sizes. When one has no information available, or one does not have the appropriate tools, all one can do is select a subset at random, perhaps based on intu¨ıtion and experience. The representativeness of a randomly selected subset is significantly lower than that of the optimal subset (Figure 1, label random). A subset of 5 benchmarks does not suffice to mimic the full benchmark suite, i.e., it does not reach the 0.9 level of representativeness. The error bars have the size of one standard deviation and show the region containing 68% of all subsets. The representativeness of a randomly selected subset can vary quite strongly. An 11-benchmark subset achieves the 0.9 level of representativeness with a probability of 68% (the lower error bar remains above 0.9 for 11 benchmarks). Fourteen benchmarks of the SPEC CPU2000 suite are necessary to reach the 0.9 level with a probability of 95%. Finally, the representativeness of the worst possible subset of each size was determined (Figure 1, label worst). The representativeness can be as low as 0.3. A τ coefficient of 0.3 implies that about 35% (1 − (1 + 0.3)/2) of all pairs of machines is ranked in the wrong order. The obtained subset is thus not representative. Note that there is no solution in making the subset large since relatively bad subsets can already be obtained when removing only a few benchmarks from the suite. E.g., the representativeness of a 22-benchmark subset may be as low as 0.92, which is less than the representativeness of the optimal 2-benchmark subset. 4.1.2 K-means and Hierarchical Clustering The representativeness obtained with the K-means cluster-
1.00
1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 PCA - fwd PCA - bwd random opt
0.60
0.95
0.55
0.90
Representativeness
4.1.3 Subsetting Algorithms Based on PCA The subsetting algorithms based on principal components analysis perform worse than the K-means and hierarchical clustering algorithms (Figure 3). Furthermore, they show the same artefact that subset representativeness can drop with increasing subset size. The forward algorithm tends to give worse results than the backward algorithm. On the positive side, the backward algorithm performs very well for subset sizes in the range of 8 to 14 benchmarks yielding a subset representativeness of 0.95. However, these results are accidental and they cannot be extrapolated to other benchmark suites.
Representativeness
ing is relatively high (Figure 2). It approaches the optimal clustering and, as far as data is available, it is better than random subsetting. There is, however, a problem with Kmeans clustering, which also occurs with other clustering techniques: when increasing the subset size (i.e., the number of clusters), the representativeness of the subset does not always increase. This artifact is fundamental to clustering techniques. Imagine the case where the data can be naturally described using n clusters. In this case, a subset of size n is constructed where each benchmark in the subset is representative for one of the natural clusters. When asking for a subset of n + 1 benchmarks, then the clustering algorithm is no longer able to find these n natural clusters. Instead, it searches for a solution with n + 1 clusters in which the clusters do not align with the natural clusters present in the data. E.g., the cluster algorithm may have to select two benchmarks from the same cluster, which assigns too much importance to this cluster. Consequently, the subset of size n + 1 is less representative than the subset of size n.
0.50 0
0.85
5
10
15
20
25
30
Subset Size
0.80 0.75
Figure 3. The representativeness of subsets obtained with the forward (fwd) and backward (bwd) algorithms to rank and subset the benchmarks.
0.70 0.65 K-means H. clust random opt
0.60 0.55 0.50 0
5
10
15
20
25
30
Subset Size
Figure 2. The representativeness of subsets obtained with K-means clustering and hierarchical clustering.
Similar remarks hold for the hierarchical clustering, although hierarchical clustering produces worse subsets when the subset size is less than 7 benchmarks. Note that SimPoint [12] also makes use of K-means clustering to determine a representative part of a trace for simulation. SimPoint estimates the representativeness of subsets of varying sizes and selects a subset size for which the representativeness is high. Although this approach is workable, the results in this paper show that it is possible to compute highly representative subsets (the optimal subsets) that are much smaller than the representative subsets computed using heuristic methods such as K-means clustering.
Overall, we conclude that the cluster analysis techniques can compute subsets of the benchmarks that are fairly representative to very representative of the full benchmark suite. However, the performance of these algorithms is unreliable: the subset may be very inaccurate. In order to have faith in the subset, it is always necessary to validate that the subset is representative. Preferably, subsetting algorithms should be known that are always reliable, i.e., they provide reasonably representative subsets for all subset sizes. What those algorithms are is a topic for further research.
4.2 Workload Characteristics In the previous sections we have evaluated the subsetting algorithms when they cluster the benchmarks based on the execution times, i.e., the same information as is used to evaluate the representativeness of the subsets. In practice, however, information like this is frequently not available, neither is it necessary. It is quite possible to cluster the benchmarks based on other information. E.g., Eeckhout,
1.00
0.95
Representativeness
Vandierendonck and De Bosschere [5] have used common workload characteristics such as instruction mix, branch misprediction rates and cache miss rates to cluster benchmarks. Lafage and Seznec [9] have used cache miss rates to determine representative program phases to reduce simulation time. In this section, we evaluate the representativeness of a benchmark subset when it is computed based on a set of workload characteristics. The workload characteristics are similar to the characteristics used in [5] and consist of
• instruction and data cache miss rates. The caches are 1 and 2-way set-associative 8KiB and 16KiB caches, 2 and 4-way set-associative 32KiB and 64KiB caches and a 4-way set-associative 128KiB cache. The caches are level-1 caches and are split instruction/data caches (9 characteristics for i-caches, 9 characteristics for dcaches). This totals 27 workload characteristics. These characteristics are measured for all SPEC benchmarks running all reference input sets (except for the perfect input of perlbmk). The benchmarks are compiled on an Alpha system using the Compaq compilers with optimization flags set to -arch ev6 -fast -O4 for C programs, to -arch ev6 -O2 for C++ programs and to -arch ev6 -O5 for Fortran programs. The K-means algorithm computes subsets from the workload characteristics that are approximately as representative as when the execution times are used to position the benchmarks in the program space (Figure 4). The “optimal” and “worst” subsets are still computed using the time measurements published by SPEC. In a few cases, such as 6 benchmarks in the subset or a subset size larger than 21, the subsets are clearly less representative when computed from the workload characteristics. The backward algorithm gives better results when operating on the workload characteristics then when clustering the benchmarks based on execution times, provided the subset size is 4 or larger (Figure 5). In fact, the subsets are highly representative. Furthermore, the backward PCA algorithm is reliable when the benchmarks are positioned using workload characteristics: larger subsets almost always
0.80 K-means, time K-means, wchar random opt
0.70 0
5
10
15
20
25
30
Subset Size
Figure 4. The representativeness of subsets obtained with the K-means algorithm operating on workload characteristics.
imply higher representativeness, except in two cases where there is an important drop in representativeness (11 and 17 benchmarks in the subset). The other algorithms do not perform better than the backward PCA algorithm, e.g., the hierarchical clustering algorithm achieves almost the same subset representativeness as the backward PCA algorithm when using the workload characteristics as input. 1.00
0.95
Representativeness
• the number of instructions between two control flow breaks (1 characteristic)
0.85
0.75
• instruction mix: integer arithmetic operations, logical operations, shift and byte manipulation, load/store operations and control operations (5 characteristics) • branch prediction accuracies, measured for a 16Kibit bimodal branch predictor, a 16Kibit gshare predictor and a 24Kibit hybrid branch predictor combining the bimodal and gshare components and an 8Kibit meta predictor (3 characteristics)
0.90
0.90
0.85
0.80 PCA - bwd, time PCA - bwd, wchar random opt
0.75
0.70 0
5
10
15
20
25
30
Subset Size
Figure 5. The representativeness of subsets obtained with the backwards PCA algorithm operating on workload characteristics.
4.3 Predicting Future Trends Another approach to measure program similarity is to to measure benchmark running times for only a small number
1.00
0.95
Representativeness
of computers. The question to answer is whether the execution times for this small number of computers is sufficient to compute representative subsets. We selected two subsets of the computers: the set of all computers that were available before the end of September 2000 and those available before the end of October 2000. These sets contain 31 and 53 submissions, respectively. We show results for the K-means (Figure 6) and backward PCA algorithms (Figure 7). In both cases, using only early computers to subset benchmarks results in subsets that have a comparable representativeness as to using all computers. Thus a scenario arises that could be used by constructors of benchmark suites: a large number of benchmarks can be developed and their performance can be measured using several computers that are available at that time. Based on this information, a subset can be computed that is also valid when comparing the performance of a larger number of computers.
0.90
0.85
0.80
PCA - bwd, time PCA - bwd, sep PCA - bwd, oct random opt
0.75
0.70 0
5
10
15
20
25
30
Subset Size
Figure 7. The representativeness of subsets obtained with the backward PCA algorithm when using only early computers.
1.00
where tA,Mi is the execution time of benchmark A on machine Mi and tB,Mi is the execution time of benchmark B on machine Mi . They show that both metrics of program similarity are statistically equivalent.
Representativeness
0.95
0.90
0.85
0.80
K-means, time K-means, sep K-means, oct random opt
0.75
0.70 0
5
10
15
20
25
30
Subset Size
Figure 6. The representativeness of subsets obtained with the K-means algorithm when using only early computers.
5 Related Work Cluster analysis techniques have been applied to analyse benchmark suites by several authors. Saavedra and Smith [10] apply cluster analysis on a set of Fortran programs to measure program similarity. In a first experiment, they measure program similarity by measuring workload characteristics and interpreting these values as the coordinates of the benchmark in the program space. In a second experiment, they compute program similarity based on the execution times on a number of machines. Hereto, they define execution time similarity between benchmarks, given by the coefficient of variation of the stochastic variable zA,B,i = tA,Mi /tB,Mi
Chow, Wright and Lai [1] analyze how different Java workloads are from common benchmark suites such as SPEC using principal components analysis. Principal components analysis was also investigated to measure program similarity by Eeckhout, Vandierendonck and De Bosschere [5, 7]. In this work, the analysis is applied to select input data sets to benchmarks in order to limit simulation time without changing the characteristics of the simulated program. These techniques are used to analyse the representativeness of the MinneSPEC reduced inputs [6]. Additional tricks to analyse the result of a principal components analysis and a discussion of eccentric and fragile benchmarks are presented in [13]. Yi and Lilja [15] present a statistically rigorous technique to determine the relative importance of processor parameters such as issue width, reorder buffer size, cache parameters, etc. The importance of these parameters varies over the benchmarks and defines a fingerprint of the benchmarks, making them useful to determine program similarity. These metrics have the benefit that they take the value of other processor parameters into account while pure workload characteristics may give a wrong image (e.g., reorder buffer size looses importance when the branch misprediction rate is high). Program similarity metrics are also important when selecting a representative part of a program for simulation. These ideas were investigated by Lafage and Seznec [9] and in SimPoint [11, 12].
6 Conclusion This paper provides an experimental evaluation of several algorithms to subset a benchmark suite. We define two criteria to evaluate these algorithms. First, the computed benchmark subsets must be representative of the full benchmark suite, i.e., an evaluation of a computing system using the subset must the same or similar results as an evaluation using the full benchmark suite. Second, the subsetting algorithms must be reliable, i.e., they must always compute representative subsets. Subsetting algorithms based on K-means cluster analysis, hierarchical clustering and principal components analysis are applied to subset the SPEC CPU 2000 benchmark suite. The evaluation shows that none of these algorithms are reliable, but most of these algorithms compute representative subsets in some situations. The lack of reliability implies that benchmark subsets computed using these procedures should always be verified to show that they are representative. The most reliable algorithm in this evaluation is the backward algorithm based on principal components analysis when using workload characteristics to position benchmarks in the program space. It would be interesting to see if this result can be extrapolated to other benchmark suites or whether this procedure scores well by coincidence in the presented analysis. The main conclusion of this paper is that subsetting benchmark suites is difficult, even when automated procedures are available. It was shown that a single procedure can compute either a highly representative subset or a less representative subset, depending on the desired subset size. Based on the experiences reported in this work, the representativeness of a benchmark subset is mostly determined by the procedure used to cluster the benchmarks in the benchmark space. The characteristics used to place the benchmarks in the benchmark space have a much smaller impact on representativeness.
Acknowledgements This research is sponsored by the Flemish Institute for the Promotion of Scientific-Technological Research in the Industry (IWT) and by the National Fund for Scientific Research - Flanders.
References [1] K. Chow, A. Wright, and K. Lai. Characterization of java workloads by principal components analysis and indirect branches. In Proceedings of the Workshop on Workload Characterization, pages 11–19, Nov. 1998.
[2] D. Citron. The use and abuse of SPEC: An ISCA panel. IEEE Micro, 23(4):73–77, July 2003. [3] J. J. Dujmovic and I. Dujmovic. Evolution and evaluation of SPEC benchmarks. ACM SIGMETRICS Performance Evaluation Review, 26(3):2–9, Dec. 1998. [4] G. H. Dunteman. Principal Components Analysis. SAGE Publications, 1989. [5] L. Eeckhout, H. Vandierendonck, and K. De Bosschere. Workload design: Selecting representative program-input pairs. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, pages 83–94, Sept. 2002. [6] L. Eeckhout, H. Vandierendonck, and K. De Bosschere. Designing computer architecture research workloads. IEEE Computer, 36(2):65–71, Feb. 2003. [7] L. Eeckhout, H. Vandierendonck, and K. De Bosschere. Quantifying the impact of input data sets on program behavior and its applications. Journal of Instruction-Level Parallelism, 5:1–33, 2 2003. [8] L. John. Workload characterization: Can it save computer architecture and performance evaluation? Keynote presented at the 7-th Workshop on Computer Architecture Evaluation using Commercial Workloads, Feb. 2004. [9] T. Lafage and A. Seznec. Choosing representative slices of program execution for microarchitecture simulations: A preliminary application to the data stream. In Workshop on Workload Characterisation (WWC-2000), Sept. 2000. [10] R. H. Saavedra and A. J. Smith. Analysis of benchmark characteristics and benchmark performance prediction. IEEE Transactions on Computers, 14(4):344–384, Nov. 1996. [11] T. Sherwood, E. Perelman, and B. Calder. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, pages 3–14, Sept. 2002. [12] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2002. [13] H. Vandierendonck and K. De Bosschere. Eccentric and fragile benchmarks. In 2004 IEEE International Symposium on Performance Analysis of Systems and Software, pages 2– 11, Mar. 2004. [14] H. Vandierendonck and K. De Bosschere. Many benchmarks stress the same bottlenecks. In Workshop on Computer Architecture Evaluation Using Commercial Workloads (CAECW-7). Held in conjunction with HPCA-10., pages 57– 64, Feb. 2004. [15] J. J. Yi, D. J. Lilja, and D. M. Hawkins. A statistically rigorous approach for improving simulation methodology. In Proceedings of the 9th International Symposium on High Performance Computer Architecture, pages 281–291, Feb. 2003.