Performance Data Extrapolation in Parallel Codes

3 downloads 0 Views 492KB Size Report
Juan Gonzalez, Judit Gimenez and Jesus Labarta. Barcelona Supercomputing ...... [16] D. Vianney, A. Mericas, B. Maron, T. Chen, S. Kunkel, and. B. Olszewski.
Performance Data Extrapolation in Parallel Codes Juan Gonzalez, Judit Gimenez and Jesus Labarta

Barcelona Supercomputing Center - Universistat Polit´ecnica de Catalunya Barcelona - Spain Email: {juan.gonzalez, judit.gimenez, jesus.labarta}@bsc.es

Abstract Measuring the performance of parallel codes is a compromise between lots of factors. The most important one is which data has to be analyzed. Current supercomputers are able to run applications in large number of processors as well as the analysis data that can be extracted is also large and varied. That implies a hard compromise between the potential problems one want to analyze and the information one is able to capture during the application execution. In this paper we present an extrapolation methodology to maximize the information extracted in a single application execution. It is based on a structural characterization of the applications, performed using clustering techniques, the ability to multiplex the read of performance hardware counters, plus a projection process. As a result, we obtain the approximated values of a large set of metrics for each phase of the application, with minimum error.

I. Introduction Performance evaluation of parallel applications relies on the analysis of performance data produced during application execution. In first steps of analysis, a some hard question rise: where are the application bottlenecks? Depending on what problem we want to tackle we will need different data to analyze it. More issues can be detected as more information is made available. For example, if we want to know the memory behavior of our application, we will require information regarding memory accesses or cache hits and misses, meanwhile if we target on how the computational part of application behaves, data regarding to fixed point and floating point operations will be essential. Hence performance hardware counters have become an invaluable aid to performance analysis. However, making use of them is not simple, and there are always limitations on the total amount of counters that can be read at the same

time and the combinations of counters available. Building CPU breakdown models is always a good example of this situation. These models are an interesting way to understand the behavior of the computation parts of an application, as described in [6]. Unfortunately, they require high number of counters to be collected, and many of them cannot be read at the same time. One approach widely used consists of multiplexing the hardware counters read during the application execution. There are a number of works where this idea has been exploited, including the IBM hpmcount [10], and the multiplexing features of PAPI library [4], where a basic statistical sampling is applied to compute the total value of multiple groups of hardware counter, for the whole application execution. Further works, [3], [11], [12], [14], improved this basic method so as to better understand the time-varying behavior of each counter. In this paper we present a method focused on the extrapolation of performance counters data, also applicable to sequential codes, but focused on parallel applications. The method presented is based on a automatic application structure characterization using clustering analysis, described in [7]. This characterization is able to group those computation bursts, the regions between two MPI calls, which present similar characteristics in terms of some metrics, such as Instructions Completed and IPC. Using this structure characterization and the ability to multiplex hardware counters values, our tool is able to project the average values of a large set of performance counters for each region detected. The key point is that we do not need to re-run the application many times, which is required in the previous work. With just a single run, we are able to extrapolate the different values with a minimum error. This mechanism results in a big save in terms of time and resources. This paper is organized as follows. In Section II we present the method to automatically detect parallel application structure, based on clustering analysis. In Section III the extrapolation process, which represents the core of this work, is described. The experimental validation of the process itself is detailed in Section IV. Section V reviews

(a) Instructions Completed vs. IPC

(b) Load vs. Stores Operations

(c) Data Lines read from L2 vs. Data Lines from Main Memory

(d) Floating Points vs. Integer Operations

Figure 1: Example of a clustering of GAPgeofem application using DBSCAN with Instructions Completed and IPC. Upper left plot depicts the metrics used by the clustering algorithm. The rest of plots show the clusters found in terms of other pairs of metrics not used during the cluster analysis

the previous work that we used to develop the current research. Finally, Section VI presents a brief discussion about our methodology and how the results obtained are developed and we also unveil some points about our inprogress research.

II. Structure Detection As previously introduced, this work is based on a computational structure detection presented by Gonzalez et. al. [7]. In this Section we provide a brief introduction to that work, highlighting the most important facts needed to correctly understand the work presented in this paper.

A. Computation Regions The clustering analysis we use tries to detect the similarities between the computation regions, also called CPU bursts. A CPU burst is the region between two MPI calls. Basically, each of these CPU bursts is characterized using its duration and the values of a set of hardware counters. In our experiments, this data is extracted from Paraver traces

[1], but any other monitoring system able to extract the described data might be used. Presented with a large set of metrics that describe each burst, duration plus up to eight counters, we have to choose those ones that we guess will produce the best characterization. Statistical methods, such as Principal Components Analysis (PCA), can be used to select the metrics, serving as example the work by Ahn and Vetter [2]. On the other hand, our proposal is similar to the presented in [9], and consists of reducing the dimensionality by selecting counters or derived metrics with “physical” meaning to the analyst As shown in [7], using Completed Instructions and IPC, which focuses on a general performance view of the application, obtained good results for the majority of cases. This combination is able to detect regions with different computational complexity (Instructions Completed), and at the same time to differentiate between regions with the same complexity but different performance. All experiments presented on this paper were done using this pair of metrics.

B. Clustering Analysis

A. Extrapolation Methodology

The clustering algorithm used in this study is DBSCAN (Density Based Spatial Clustering of Applications with Noise) [5]. The main point of this algorithm is that it does not assume any distribution of the data to cluster. That is specially important, because, as we observed, the performance hardware counters data does not have a consistent underlying distribution. Once the clustering is applied, the clustering tool produces a plot showing the resulting groups of CPU bursts in terms of the metrics used to compute the clusters. We can also depict other plots showing how the rest of metrics are distributed into these clusters. In addition, we are able to annotate the original Paraver trace, adding the cluster information, to analyze the temporal behavior of the regions detected. Figure 1 contains four plots of an example cluster analysis on GAPgeofem application. The plot 1a shows the metrics used to into the clustering algorithm, Completed Instructions and IPC. As one can see, the resulting groups are compact and isolated from each other. Furthermore, the groups found also differentiate the groups in terms of other metrics, as can be seen in the rest of plots. In this example we found an exception of L2 accesses versus Main Memory accesses, plot 1c. In this case, those points belonging to Cluster 3 were divided in two sub-groups. After an analysis of the data, we found that the structure detection is correctly done, following the quality criteria based on the SPMD structure of parallel applications, described in [8]. However, one of the tasks had a little distortion, reflected in the two trends in L2 cache accesses. These plots reinforce the statement we made before, regarding the ability to distinguish clearly different regions of code. This fact is the essential characteristic of the clustering that permits us to develop the extrapolation method presented on next point.

The extrapolation methodology can be divided in these three main steps: 1. Data Extraction. First decision before executing the application and extracting the data is which groups of counters do we need. Even more important than this selection, is to ensure that all groups have a common subset of counters so as to guarantee that the clustering analysis could be computed for all points extracted. This fact depends on the processor design, but all major processor vendors today include Completed Instructions and Cycles as fixed values in all possible combinations of counters in the available sets. In order to have measures of all groups for all application phases or regions, then we multiplex the read of all groups during the application execution. 2. Clustering Analysis. Having read the needed data, then we apply the clustering algorithm to the common metrics. As shown in previous Section, we can trust that Completed Instructions and IPC produce a good characterization, in terms of all of counters extracted. After applying the clustering, we obtain is a set of clusters, where the points they contain include information from different groups. 3. Data Projection. Finally, once we have the points divided into clusters, the final step consists of computing the average of all counters. That is just a weighted average of each counter for each cluster. In this way, each point contributes its non common counters to the characterization to such cluster.

III. Performance Data Extrapolation The performance data extrapolation process consists of determining the average value of a wide range of hardware counter groups for each structural region of the application, detected using clustering analysis described in the previous point. It is an extrapolation because the number of groups computed is larger than the number of registers available in the processor that can be read simultaneosly. In addition, the resulting set can contain two mutually exclusive counters, coming from combinations fixed by the processor design. The basic idea is, having a subgroup of counters common in all possible groups, we multiplex these groups during the application execution. Then, we apply the clustering algorithm to the common subset, and finally we extrapolate the values of the rest of the counters, for all clusters detected.

B. Considerations The first consideration to take into account is that we can take advantage of the fact that we analyze parallel applications, in order to tune how we multiplex the data acquisition. In this way, we propose three different strategies to multiplex the hardware counter groups: first, space multiplexing that consists of assigning a different hardware counters group to each task of the parallel application; second, time multiplexing, where all tasks read the same group of counters at the same time, but the group changes with a user defined frequency; and third, timespace multiplexing, a combination of both. There is another consideration regarding the big issues when correlating the multiplexed reads of counters values, for example when they are obtained using sampling approaches or multiple-runs. This correlation guarantees that different reads refer to same regions of the application. In our case, this is solved using the clustering analysis to those common metrics present in all groups read. This analysis ensures that each cluster represents a region of the application with high similarity, in terms of the metrics used to compute the cluster itself.

Table I: List of all hardware counters used in the experiments to verify the extrapolation technique #

Counter Name

Description

1

PM_CYC

Processor cycles

2

PM_GRP_CMPL

A group completed. Microcoded instructions that span multiple groups will generate this event once per group

3

PM_GCT_EMPTY_CYC

The Global Completion Table is completely empty

4

PM_GCT_EMPTY_IC_MISS

GCT empty due to I cache miss

2

5

PM_GCT_EMPTY_BR_MPRED

GCT empty due to branch mispredict

2

6

PM_CMPLU_STALL_LSU

Completion stall caused by LSU instruction

3

7

PM_CMPLU_STALL_REJECT

Completion stall caused by reject

4

8

PM_CMPLU_STALL_ERAT_MISS

Completion stall caused by ERAT miss

3

9

PM_CMPLU_STALL_DCACHE_MISS

Completion stall caused by D cache miss

4

10

PM_CMPLU_STALL_FXU

Completion stall caused by FXU instruction

5

11

PM_CMPLU_STALL_DIV

Completion stall caused by DIV instruction

5

12

PM_CMPLU_STALL_FPU

Completion stall caused by FPU instruction

6

13

PM_CMPLU_STALL_FDIV

Completion stall caused by FDIV or FQRT instruction

6

14

PM_CMPLU_STALL_OTHER

Completion stall caused by other reason

4

15

PM_INST_CMPL

Number of Eligible Instructions that completed

In addition, the data acquisition dos not need to be synchronized across processes. The only requirement would be that a sufficiently large run is made such that several acquisitions with different hardware counter sets are made for the relevant computation bursts when using time multiplexing. In SPMD codes the space multiplexing approach can avoid running the application for a long time, but it is most advisable to use the combination of both. Space multiplexing is able to capture the variances of the intranodes, and time multiplexing focuses on the variances across nodes, hence the combination of both results in a better way to guarantee that this variances are considered when extrapolating the counters values.

IV. Experimental Validation The methodology we chose to validate the technique presented on this paper consists of comparing the numbers obtained using it and the values of all counters groups obtained from different runs of the same application without multiplexing them. To verify that all clusters were representing the same regions across all runs we compare manually the plots and application time-lines resulting from the clustering analysis.

Groups All

1, 2

All

sequentiality and instruction mix, with a very low level of granularity. In our experiments, we run the applications on an IBM JS21 cluster, where the processors are PowerPC 970MP, so we adapted the model, choosing the most suitable counters comparing both processors. Table I contains the list of all counters we used as well as the description of them provided by PAPI command papi_native_avail. As a result, we obtained a set of 6 different counters groups, to which we add 4 more groups with other counters interesting to compute an instruction mix of the applications regions. using our extrapolation method would save nine extra runs of the application. The applications chosen were a mix of the SPEC MPI2007 v1.0 [13] benchmark set applications, plus the NAS Parallel Benchmark kernels and some other interesting parallel codes. Our choices reflect our desire to cover a wide range of application types, and the benchmark sets we selected from are commonly used to evaluate parallel performance. Due to space considerations, in this paper we include the results for the next four applications: •



A. Experiments Design For the experiments, we decided to extrapolate the counters needed to build a CPI breakdown model for the Power5 processor, described by IBM in [16], [17]. This model is useful to the analyst because it easily shows the weak points of the code in terms of memory access, code

1





TERA TF, a 3D eulerian hydrodynamics application. This code is part of the SPEC MPI2007. The input data used was the ’mref’ SPEC input. GAPgeofem, a finite-element code for transient thermal conduction. This code is part of the SPEC MPI2007. The input data used was the ’mref’ SPEC input. ZEUS-MP/2, a computational fluid dynamics for the simulation of astrophysical phenomena. This code is part of the SPEC MPI2007. PEPC, a parallel tree-code for rapid computation of long-range Coulomb forces in N-body particle

systems, developed at J¨ulich Supercomputing Center. In this case, we used PEPC benchmark input data, provided by the authors. Some numbers regarding the clustering analysis and the extrapolation of this applications is resumed in Table II. We want to note that the column titled “Representative Clusters” shows the number of clusters whose aggregate duration suppose more than 90% of the application computing time.

B. Extrapolation Errors The results obtained in the different experiments regarding the extrapolation errors are presented in Figure 2 and Figure 3. Figure 2 contains seven different plots obtained from PEPC application. In these plots, the counters needed to build the IBM’s breakdown model (see Table I) are depicted on the X axis of each plot. The three plots on left column, sub-figures 2a, 2b and 2c, are the relative errors when comparing the measures extrapolated and the values obtained in the non-multiplexed runs. Each subfigure corresponds to one of each multiplexing methods described in Section III. In these direct comparisons we can observe two interesting facts. First, the extrapolation error decreases as we change the multiplexing method and the time/space multiplexing is the method that produces less error. This error is less likely to be correlated to variations of the structure across processors or temporal periodicites on the application behavior. The second point observed was that even though the majority of counters are correctly extrapolated, some of them present errors up to 45%. That could be understood as a drawback of the method itself, but, considering that this work focus on application analysis, we must interpret the numbers in a different way. For example, if the prediction says that 15 square root operations have been performed, but the actual read was just 10, it represents a 50% error. But, the most valuable information lies on what this numbers represents, in terms of the total instructions completed or total number of cycles. In other words, what an analyst will take into account will be the relevance of each counter for each application phase. This is exactly what we did to create sub-figures 2d, 2e and 2f. Table II: Description of the clustering and extrapolation analysis presented in the paper Application

Task Count

Data Points

PEPC

32

14,228

4

3m 44.656s

GAPgeofem

16

44,676

4

18m 46.343s

ZEUS-MP/2

16

18,845

10

1m 31.316s

TERA TF

16

65,508

2

4m 0.782s

Representative Analysis Clusters Run-Time

In these plots, the error computed for the first three figures is weighted in terms of the relevance of each counter. This relevance is the relative value the counter has with respect to cycles, in the non-extrapolated runs, and is depicted in sub-figure 2g. What we obtain is that even the relative error is not this new number, the weighted error points that the biggest errors appear in counters with small values, which are irrelevant to understand the application behavior. This is exactly what we can see when comparing left and right sub-figures. For example, on sub-figure 2a, the counters PM_CMPLU_STALL_FXU (10), PM_CMPLU_STALL_DIV (11) and PM_CMPLU_STALL_FPU (12) were extrapolated with a big errors: 18%, 50% and -20% respectively. Then, in sub-figure 2d, all the weighted errors of these counters are bounded between 1% and -2%, because their values are insignificat with respect to the total cycles. The plots in Figure 3 show the weighted errors obtained for the three other applications, TERA TF, ZEUS-MP/2 and GAPgeofem, just for the time/space multiplexing, as we considered this multiplexing strategy the one which produces the best results. In addition, all counters present in the groups read are shown. As one can see, the errors are bound in all cases between ±5% The numbers presented in these plots confirm that our technique performs well, even with a large number of counters to extrapolate. Furthermore, it works perfectly in a wide variety of scenarios, for example when just few clusters are detected, case of TERA TF in sub-figure 3a, or the opposite, when the application has lots of clusters, case of ZEUS-MP/2 in sub-figure 3b.

C. Application analysis In Figure 4 we show possible analysis models and statistics that can obtained from a single run. In this case, we show the previously mentioned CPI breakdown, on the left plots. In this case, it is a simplification, using just five of the 15 counters, to improve its legibility. It shows the major stall causes for each cluster. In brief, ‘Group Complete Cycles’ refers to those cycles when the processor actually finishes instructions, ‘GCT Empty Cycles’ explains the stalled cycles due to re-order buffer issues, for example branch miss-predictions, ‘LSU (Load-Store Unit) Stall Cycles’ are those cycles stalled by memory issues, ‘FXU Stall Cycles’ and ‘FPU Stall Cycles’ are the cycles stalled waiting to integer and floating point computations respectively and, finally, ‘Other Stalls’ refers to stalls caused by situations not captured by the hardware counters. The plots on the right column are a brief statistical description of some interesting computational characteristics, as the percentage of the peak IPC (considering 4 as the maximum IPC possible in a PowerPC 970), different instructions mix (percentage of total instructions that corresponds to memory ‘Memory Mix’, floating point

60%

6%

40%

4%

Weighted Error

Relative Error

Application: PEPC. 32 Tasks Run in a PowerPC 970MP

20% 0% -20% -40% -60%

2% 0% -2% -4% -6%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

60%

6%

40%

4%

20% 0% -20% -40% -60%

2% 0% -2% -4% -6%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

60%

6%

40%

4%

20% 0% -20% -40% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(c) Time-Space Multiplexing Relative Errors

Counter Relevance

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(e) Time Multiplexing Weighted Errors

Weighted Error

Relative Error

(b) Time Multiplexing Relative Errors

-60%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(d) Space Multiplexing Weighted Errors

Weighted Error

Relative Error

(a) Space Multiplexing Relative Errors

2% 0% -2% -4% -6%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(f) Time-Space Multiplexing Weighted Errors

100% 80% 60% 40% 20% 0%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Cluster 1

Cluster 2

Cluster 3

Cluster 4

(g) Counters relevance with respect to Total Cycles counter

Figure 2: PEPC extrapolation errors for CPI breakdown model counters using different multiplexing strategies. Left column represents the relative error comparing the extrapolation and the actual value from non-multiplexed execution. Right column represents these errors weighted in terms of the relevance of each cluster with respect to the total cycles on the non-multiplexed run, depicted in the plot at the bottom. The counters names are listed on Table I.

PM_CYC PM_GRP_CMPL PM_GCT_EMPTY_CYC PM_GCT_EMPTY_IC_MISS PM_GCT_EMPTY_BR_MPRED PM_CMPLU_STALL_LSU PM_CMPLU_STALL_REJECT PM_CMPLU_STALL_ERAT_MISS PM_CMPLU_STALL_DCACHE_MISS PM_CMPLU_STALL_FXU PM_CMPLU_STALL_DIV PM_CMPLU_STALL_FPU PM_CMPLU_STALL_FDIV PM_CMPLU_STALL_OTHER PM_INST_CMPL PM_INST_DISP PM_LD_REF_L1 PM_ST_REF_L1 PM_LD_MISS_L1 PM_ST_MISS_L1 PM_LD_MISS_L1_LSU0 PM_LD_MISS_L1_LSU1 PM_L1_WRITE_CYC PM_DATA_FROM_MEM PM_DATA_FROM_L2 PM_DTLB_MISS PM_ITLB_MISS PM_LSU0_BUSY PM_LSU1_BUSY PM_LSU_FLUSH PM_LSU_LDF PM_FPU_FIN PM_FPU_FSQRT PM_FPU_FDIV PM_FPU_FMA PM_FPU0_FIN PM_FPU1_FIN PM_FPU_STF PM_LSU_LMQ_SRQ_EMPTY_CYC PM_HV_CYC PM_1PLUS_PPC_CMPL PM_TB_BIT_TRANS PM_FLUSH_BR PM_BR_MPRED_TA PM_GCT_EMPTY_SRQ_FULL PM_FXU_FIN PM_FXU_BUSY PM_IOPS_CMPL PM_GCT_FULL_CYC

Weighted Error

Weighted Error

Weighted Error

Application: TERA TF. 16 Tasks Run in a PowerPC 970MP

5% 4% 3% 2% 1% 0% -1% -2% -3% -4% -5% Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster

5% 4% 3% 2% 1% 0% -1% -2% -3% -4% -5%

5% 4% 3% 2% 1% 0% -1% -2% -3% -4% -5% Cluster Cluster Cluster Cluster

1 2 3 4 5 6 7 8 9 10

(a) TERA TF weighted errors. Time/Space multiplexing

Application: ZEUS-MP/2. 16 Tasks Run in a PowerPC 970MP

Cluster 1 Cluster 2

(b) ZEUS-MP/2 weighted errors. Time/Space multiplexing

Application: GAPgeofem. 16 Tasks Run in a PowerPC 970MP

1 2 3 4

(c) GAPgeofem weighted errors. Time/Space multiplexing

Figure 3: Extrapolation errors for all counters read using time-space multiplexing, and weighted in terms of total cycles. All plots share the same counters list, only depicted in subfigure c.

‘FPU Mix’ and integer instructions ‘FXU Mix’), and also some numbers related to total duration, instructions per burst, and number of bursts per cluster. A good approach to understand the figure is comparing both plots of each application. For example, in TERA TF, both regions detected perform in the same way, having a IPC around the 22% of the peak (metric A on sub-figure 4b), which is actually a good value for this processor. In both cases, most of the stalls are caused by floating point operations (CPI breakdown 4a), but FPU Mix (metric D on sub-figure 4b) is just around 16%. What is more, we can see that Memory Mix (metric D on sub-figure 4b) presents the highest values, around 20%. These facts make that both clusters could be improved by trying to solve the dependencies of floating point data computations, and the access to the data structures that contain them. Finally, we can also see that the only difference between both regions is that the Cluster 1 has less instructions per burst, but it appears four times more often than Cluster 2. Regarding GAPgeofem, we observe that Cluster 1 dominates the computing part, up to 50% of total time (metric F on sub-figure 4d), having an IPC around 30% of the theoretical peak. It is interesting to note that the frequency of this Cluster is nearly 100 times bigger than the rest of the clusters (metric G on sub-figure 4d). In any case, this is a good situation, because this cluster groups big computation bursts, which obtained the best performance in the applications tested. So, in this case, developer/analyst should tackle the problems of Cluster 2, because the rest of clusters represent a small amount of total time (metric G on sub-figure 4d). Cluster 2 is mainly dominated by memory stalls (LSU stalls on sub-figure 4c). The main recommendation here should be to analyze the memory access patterns of this cluster.

the number of counters available for a given function, when analyzing application profiles.

VI. Conclusions This paper presents a methodology to extrapolate the value of a large number of performance counters in parallel applications. The strongest point of this methodology is that we do not need to rerun the applications several times to obtain a wide range of metrics for different application regions. The method presented relies on structure detection, based on the DBSCAN clustering algorithm, which is able to clearly detect different computation phases of the applications. Using this phase detection with the ability to multiplex the read of different hardware counters sets, we are able to extrapolate the values with minimum error. Even though we focused on performance hardware counters data, it is applicable to any other continuous metric. We experimentally validated the correctness of the technique using four different parallel applications: PEPC, TERA TF, ZEUS-MP/2 and GAPgeofem. These applications are good representatives of physics codes, widely used in supercomputing centers. The experiments shown that in all cases, our methodology was able to predict the values with a weighted error fitted to ±5%, comparing the extrapolated values and the values obtained from different runs, reading separate counter groups. In addition, we showed two experiments where this methodology is applied to compute a CPI breakdown model, and also a set of statistics showing its usefulness in the analysis work.

Acknowledgments V. Related work We can find several works regarding to hardware counters data multiplexing and extrapolation. The MPX software by John M. May can be considered the seminal research [12], using a basic statistical sampling plus a linear extrapolation of the multiplexed values for the whole application execution. This approach is exploited both in the IBM’s hpmcount tool [10] and the PAPI library [4]. Further studies have focused on understanding and predicting time-varying behavior of counters values [3], [11], as well as the existing correlations between different counters [14]. In any case, we can not compare our approach directly to these works, firstly, because our work is intended to summarize the information, accounting the data for each phase/cluster; secondly, because the previous works are based on sampling techniques. Focusing on applications analysis, we found that some of major analysis tools packages as TAU [15] or Kojak [18] rely on a multi-experiment infrastructure so as to increase

We would like to acknowledge the BSC Tools team for their support with the tools used for the development of the current paper. This work is granted by the IBM/BSC MareIncognito project and by the Comisi´on Interministerial de Ciencia y Tecnolog´ıa (CICYT) , contract TIN200760625.

References [1] CEPBA-Tools Team@BSC Home. http://www.bsc.es/plantillaF. php?cat id=52. [2] D. H. Ahn and J. S. Vetter. Scalable analysis techniques for microprocessor performance counter metrics. In SC ’02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pages 1–16, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press. [3] R. Azimi, M. Stumm, and R. W. Wisniewski. Online Performance Analysis by Statistical Sampling of Microprocessor Performance Counters. In ICS ’05: Proceedings of the 19th Annual International Conference on Supercomputing, pages 101–110, New York, NY, USA, 2005. ACM. [4] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. A Portable Programming Interface for Performance Evaluation on Modern Processors. International Journal of High Performance Computing Applications, 14(3):189–204, 2000.

Application: TERA TF. 16 Tasks Run in a PowerPC 970MP 100%

100%

80%

80%

60%

60%

40%

40%

20%

20%

0%

0% Cluster 1

Cluster 1 Cluster 2

A

Cluster 2

(a) TERA TF CPI breakdown model

B

C

D

E

F

G

H

(b) TERA TF general computation statistics

Application: GAPgeofem. 16 Tasks Run in a PowerPC 970MP 100%

100%

80%

80%

60%

60%

40%

40%

20%

20%

0% Cluster 1 Group Complete Cycles FXU Stall Cycles

Cluster 2

Cluster 3

Cluster 4

GCT Empty Cycles

LSU Stall Cycles

FPU Stall Cycles

Other Stalls

Cluster 1 Cluster 2 Cluster 3 Cluster 4

0% A

B

C

D

A. % peak IPC C. Memory Mix E. FXU Mix G. % Inst.per Burst

(c) GAPgeofem CPI breakdown model

E

F

G

H

B. % Instr. Completed D. FPU Mix F. % Total Duration H. % Total Bursts

(d) GAPgeofem general computation statistics

Figure 4: General CPI breakdown models of all applications presented in the paper. This models are a general view of the major categories, not using all 15 counters extrapolated to clarify its legibility. In all cases, they were computed the time-space multiplexing extrapolation method

[5] M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In E. Simoudis, J. Han, and U. Fayyad, editors, Second International Conference on Knowledge Discovery and Data Mining, pages 226–231, Portland, Oregon, 1996. AAAI Press. [6] S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A Top-Down Approach to Architecting CPI Component Performance Counters. IEEE Micro, 27(1):84–93, 2007. [7] J. Gonzalez, J. Gimenez, and J. Labarta. Automatic Detection of Parallel Applications Computation Phases. In IPDPS ’09: Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium, 2009. [8] J. Gonzalez, J. Gimenez, and J. Labarta. Automatic Evaluation of the Computation Structure of Parallel Applications. In PDCAT ’09: Proceedings of the 10th International Conference on Parallel and Distributed Computing, Applications and Technologies, 2009. [9] A. Joshi, A. Phansalkar, L. Eeckhout, and L. K. John. Measuring Benchmark Similarity Using Inherent Program Characteristics. IEEE Transactions on Computers, 55(6):769–782, 2006. [10] Q. Liang. Performance Monitor Counter data analysis using Counter Analyzer. http://www.ibm.com/developerworks/aix/library/ au-counteranalyzer/index.html. [11] W. Mathur and J. Cook. Improved Estimation for Software Multiplexing of Performance Counters. In Proceedings of the 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pages 23– 34, Washington, DC, USA, 2005. IEEE Computer Society. [12] J. M. May. MPX: Software for Multiplexing Hardware Performance Counters in Multithreaded Programs. In IPDPS ’01: Proceedings

[13]

[14]

[15] [16]

[17]

[18]

of the 15th International Parallel & Distributed Processing Symposium, page 22, Washington, DC, USA, 2001. IEEE Computer Society. M. S. M¨uller, M. van Waveren, R. Lieberman, B. Whitney, H. Saito, K. Kumaran, J. Baron, W. C. Brantley, C. Parrott, T. Elken, H. Feng, and C. Ponder. SPEC MPI2007—an application benchmark suite for parallel systems using MPI. Concurrency and Computation: Practice and Experience, 22(2):191–205, 2010. T. Mytkowicz, P. F. Sweeney, M. Hauswirth, and A. Diwan. Time Interpolation: So Many Metrics, So Few Registers. In MICRO ’07: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pages 286–300, Washington, DC, USA, 2007. IEEE Computer Society. S. S. Shende and A. D. Malony. The Tau Parallel Performance System. International Journal of High Performance Computing Applications, 20(2):287–311, 2006. D. Vianney, A. Mericas, B. Maron, T. Chen, S. Kunkel, and B. Olszewski. CPI analysis on POWER5, Part 1: Tools for measuring performance. http://www-128.ibm.com/developerworks/ library/pa-cpipower1. D. Vianney, A. Mericas, B. Maron, T. Chen, S. Kunkel, and B. Olszewski. CPI analysis on POWER5, Part 2: Introducing the CPI breakdown model. http://www-128.ibm.com/developerworks/ library/pa-cpipower2. B. J. N. Wylie, B. Mohr, and F. Wolf. Holistic Hardware Counter Performance Analysis of Parallel Programs. In ParCo ’05: Proceedings of the Parallel Computing Symposium 2005, Malaga, Spain, September 2005.