Performance Prediction and Tuning on a Multiprocessor

Performance

Prediction

and Tuning

on a Multiprocessor

R. K. Iyer

R. T. Dimpsey

Center for Reliable and High-Performance Coordinated University

Science

of Illinois

1101 W. Springfield

at Urbana-Champaign Ave.,

This

paper

presents

a methodology

for

modeling

the

of applications executing in real machine. The methodology is

illustrated by modeling the execution of computationally bound, parallel applications running in real workloads on an Alliant FX/80. me model is constructed from real measured data obtained during normal machine operation and is capable of capturing intricate multiple job interactions, such as contention for shared resources. The model is a finitestate, discrete-time Markov model with rewards and costs associated with each state. The model is capable of predicting the distribution of completion times in real workloads for a given application. The predictions are useful in gauging how quickly an application will execute, or in predicting the performance impact of a system change. The model constructed in this study is validated with three separate sets of empirical data. In one validation, the model successfully predicts the effects of operating the machine with one less processor.

In the next two sections related and past work are surveyed. The Alliant FX/80 architecture is introduced in Section 4. Following this, the model-building methodology is described in a machine-independent way, and then illustrated with the Alliant FX/80 example. Section 6 explains the Monte Carlo simulation procedure. The illustrative model is then validated with three sets of empirical data. Following this, in Section 8, some useful predictive properties of the model are illustrated.

In practice, the performance evaluation of supercom puters is still substantially driven by single-point estimates (e.g., MFLOPS) obtained by running characteristic benchnlarks or workloads. With the rapid increase in the use of time-shared multiprogramming in these systems such measurements are clearly inadequate. This is beeause multiprogmmrning and system overhead, as well as other degradations in performance due to time varying characteristics of workloads, are not taken into account. Finally, benchmarks provide minimal insight into the reason for a system’s acnieved performance and hence are of little value from system tuning or design improvement perspectives.

2. Related

use of such models

The

Permission provided advantage, its date Association requires

to copy that

without

the copies

the ACM appear,

and

on

fee

all

or part

are not made

copyright notice

for Computing a fee and/or

application

specific

of

this

or distributed

noace

1s given

to predict

and that

Machmety.

the title

copying

To copy

material for

direct

is also

otherwise,

To collect these measurements, high resolution performance monitors are needed. This study uses performance monitoring tools developed at the Center for Sttpercomputing Research and Development (CSRD) at the University of Illinois [5], The tools were developed for Cedar [6], but can be used to monitor individual Alliants.

IS granted commercial

of the pubhcauon is by

obtained.

the effect

performance

permission

and of the

or to repubhsh,

permission.

@ 1991 ACM 0-89791 -394-9/91/0005/0190

$1.50

Work

Many studies have employed simulation or analytical techniques [1,2] to model the behavior of multiprocessors. However, if real system measurements are not collected, simplifying and restrictive assumptions must be made to solve these models. Those studies that measure actual systems are usually designed to derive measurements such as MFLOPS [3,4] and do not account for degradation encountered in real workloads. To accurately evaluate a machine’s performance, measurements from real workloads must be

This paper presents a methodology which uses system measurements to identify and build a Markov reward model that predicts completion times for a given class of of design changes demonstrated.

IL 61801

Monte Carlo simulation is used to solve the model and predict the completion time distribution of a specific application under the measured workload. Further, the model can be used to predict the effect of different system changes. For instance, the model can predict performance effects of the addition/subtraction of processors, changes in scheduling, and reduction of overheads.

1. Introduction

applications.

Urbana,

To build the Markov Reward model, applications representing the given class are executed numerous times during the normal operation of the machine. System parameters such as job queue lengths and multiprogramming overhead are monitored during the executions of the applications. Statistical clustering is then used on the collected data to identify a finite-state, discrete-time Markov model. The final step involves assigning a reward to each state to quantify the actual system resource available to an application in that state.

Abstract behavior of a given class workloads on a particular

Computing

Laboratory

190

Once the measurements of the machine have been collected, modeling techniques must be employed to analyze the data. Markov modeling is one such technique often used [7-9]. For instance, Ktdkarni, Nicola, Smith, and Trivedi used Markov reward analysis to analytically evaluate the completion time of a job in a repairable faulttolerant system [8].

(CES) and Interactive Processors (IPs), which communicate through a large shared memory. The Alliant measured for this study has eight CES, six IPs, and a shaucd memory of 96 Mbytes. It is used by CSRD for algorithm development and general scientific computing. The CE complex executes all parallel applications and most serial user programs. It is the focus of this study. The IPs handle interactive jobs, 1/0, and a large portion of the operating system. The operating system on the measured Alliant, called Xylem, was designed at CSRD for the It is an extension c~f Concemrix, Cedar supercomputer. Alliant’s Unix-based operating system.

Statistical clustering has also been widely used in computer performance evaluations [7, 10- 12]. Devarakonda used clustering to build a model and predict resource requirements of an application [10]. We have used clttstering to determine the workload/states in which an Alliartt FX/8 operated [1 1]. A similar usage of statistical clustering is presented in the current study.

The eight CES dynamically switch between two configurations: detached and clustered. In the detached configuration the eight CES are used as independent processors and serial jobs are multiprocessed on the individual CES. In the clustered configuration, all eight CES are gang scheduled to execute a single parallel application. Concurrency is exploited within the application by scheduling different iterations of loops across the CES.

None of the above address the question of determining the response time distribution of a given application under a real workload. Here we develop a measurementbased approach that provides a model of application execution in real workloads. In addition, the methodology is illustrated on an Alliartt FX/80 and validated with empirical data.

3. Multiprogramming

Scheduling is based on six classes of jcjbs. The class of a job specifies the resource needed to process the job (i.e., single CE, clustered CE, or 1P) and its priority. There are two classes of parallel jobs; these are referred to as type A cluster jobs and type C cluster jobs. A parallel job requires all eight CES to be in the clustered configuration to execute. There are two classes of serial jobs. These need a single CE to execute and are referred to as C’E jobs type A and C. The remaining two job types are referred to as 1P and IP/CE jobs. 1P jobs require an 1P to exel;ute, while an IP/CE job can be processed by either a single 1P or single CE.

Overhead

A major component of the model constructed in this paper is multiprogramming (MP) overhead. MP Overhead is usually defined as the system work created to maintain the timeshared, multiprogrammed environment. Its components include tasks such as context switching, kernel lock In addition, MP overhead spinning, and scheduling. ac;ounts for multiple job resource contentions. The measurement and evaluation of MP overhead is presented in [13,14]. These studies describe methodologies that quantify both the lower bound on MP overhead and the MP overhead found in real workloads. The methodologies were illustrated on Alliant machines and distributions of MP overheads in real workloads are presented, Briefly, a parallel program is monitored as it executes in the workload under investigation. Measurements collected during the execution, together with scheduling information, are then used to estimate the expected completion time of the program assuming no MP overhead. This value and the true completion time are used to determine the MP overhead for the given workload.

Table 1 summarizes the scheduling algorithm for the CES. The scheduler steps down the levels of the table granting the specified time quantum to a job c)f the choice 1 job type. If there is no choice 1 job in the system, a choice 2 job is scheduled; if a choice 2 job is unavailable, a choice 3 job is scheduled, and so on. All jobs withlin a class are scheduled in a fair round robin fashion. When it is time for a cluster job to execute (level 1 and 2), the CES become physically clustered, The CES are in the detached configuration at levels 3, 4, and 5 (if IP/CE or CE jobs are available).

5. Model Construction

In the above studies, MP overhead measurements were conducted on nearly 300 independent workloads of an Alliant FX/80. The results showed that, on average, 16% of the parallel processing environment was consumed by MP overhead, with the 25th and 751h percentiles being 10% and 23%, respectively.

4. The Alliant

In the following subsections the steps of the model building methodology are first explained from a high-level, machine-independent perspective, and then illustrated with the details of a model constructed for colmptttationally bound, parallel jobs executing on an Alliant FX/80. The objective is to build a model of the system and workload as it would be seen by an application of the targeted class. The four steps of model construction are:

FX/80

The Alliant FX/80 is a shared memory, mtrltiprocessor mini-supercomputer [15]. It can be best understood as two complexes of processors, Computational Elements

1) mal

191

Monitor machine

system/worklotid operation.

partimetcrs

during

nor-

Level 1 2 3 4 5

Quantum 300 400 200 200 200

ms ms ms ms ms

Choice 1

Choice 2

Choice 3

Choice 4

Choice 5

cluster (A) cluster (C) CE (C) CE (A) IP/CE

cluster (C) cluster (A) CE (A) CE (C) CE (C)

IP/CE IP/CE lP/CE lP/CE CE (A)

CE (A) CE (C) cluster (C) cluster (A) cluster (C)

CE (C) CE (A) cluster (A) cluster (C) cluster (A)

Table 1 FX/80 Scheduling Algorithm

2) Statistically cluster the measured data and identify the key states of system/workload operation.

consequence of what resource or resources are used as the reward. This is discussed further in Section 5.4.

3) Convert the identified model.

The execution time of the target application is split into intervals of nearly equal length called observations. Each observation describes a system/workload state with the parameters measured cluing that time period. The observation’s length is chosen by trial and error to reflect the granukwity at which system/workload states are defined.

cluster model into a Markov

4) Define reward and cost functions the Markov model.

for each state of

The above steps result in a finite-state, discrete-time Markov model. Each state of the Markov model summarizes an observed system/workload state and the transitions in the model describe the observed transitions between workload states. Each state has associated with it a cost and a reward function. The reward function quantities the amount of a resource that an application of the targeted type would receive if it were submitted to the machine while in that state. The cost is the wall clock time needed to obtain the reward.

The result of the monitoring is a large number of observations each defined by parameters used LOdetermine reward and cost. The following details of monitoring the Alliant will clarify the above discussion. The model based on the Alliant workloads is designed to predict the performance of computationally bound, parallel applications. Three applications were chosen from the Perfect Club benchmark suite [4] to be used as target applications. The three- Dyfesm, F1052, and Track- are listed aIong with their base processing requirements (obtained by executing on a dedicated machine) in Table 2. Dyfesm is a computationally intensive program that performs two-dimensional, dynamic, finite-element structural analysis. FI052 analyzes transonic air flow past All an air foil, while Track involves signal processing. three applications are type A cluster jobs and execute on all eight CES in the clustered configuration.

It should be emphasized that some of the steps in the methodology, most notably determining the reward (step 4), are highly machine dependent and require detailed knowledge of the system under investigation. It should also be noted that the model allows only applications of the chosen class to be analyzed (e.g., computationally bound, parallel applications). This is not overly restrictive though, because a class of applications is generally quite broad. A new model needs to be built if a different class of applications (for instance, 1/0 bound jobs) is to be investigated. 5.1. Monitoring

and Measuring

The data for the model were collected by monitoring the Alliant during 100 separate target application executions distributed over a 3 month period. The 100 executions included 35 executions of Dyfesm, 35 of F1052, and 30 of Track.

the System

The data necess~y to build the model are obtained by monitoring the system while an application of the targeted class (referred to as a farget application) executes in a normal workload. The periods when a target application is executed are chosen randomly over an extended period of lime. The model then accurately represents the system/workload that would be seen by an application (similar to the target application) if it were run.

For this model, reward was defined as the amount of clustered CE time given to a Type A cluster job. Hence,

Completion Appl.

Dedicated

Time FX/80

(seconds)

The system parameters monitored are those needed to calculate the reward and cost functions. Because reward is calculated with respect to a specified resource, the measured system parameters reflect the amount of the specified system resource (CPU, memory, etc.) that an application would receive if it were submitted to the system. Therefore, choosing the parameters to be monitored is a direct

Dyfesm F1052 Track

200 88 94 Table 2 Target Applications

192

~.

of clustered

time exec. system code 1.2 3.4 3.3

Target Application tarts

Target Application Fmlsh s

> time

1

r

●

A

Obse~ation 1

‘~‘VY7V Observation

s

3

ft‘d>

000

S iJ - Sample j of Observation

i

Contents

2,3

Sample

3

#

of

Observation

2

Type

A

Cluster

Jobs

- CLA

,J

# Type

C

Cluster

Jobs - CLCij

E!EE!d

Figure 1 Monitoring System parameters that allow for the calculation of deliverable clustered CE time were monitored. From the scheduling algorithm (Table 1), and past work [14], it was determined that the following four parameters were necessary.

the work required to execute Q was done by the IPs so the clustered CES were perturbed very little. Figure 1 also illustrates the division of target application execution into observations. Each observation is made up of five consecutive samples, resulting in an average observation length of 6.094 seconds (standard deviation = 1.37 seconds). A variety of observation sizes were tested. It was determined that five samples was a decent granularity to reflect a single system/worktoad state.

1) Number of type A cluster jobs. 2) % of time there is at least one type C cluster job, 3) % of total time the CES execute cluster jobs. 4)

70

of clustered time CES execute MP overhead.

For each observation, the four parameters Iistcd above, along with the actual length of the observation (parameter five), were estimated. Equations 1 -5 detail how this was done, using the nomenclature of IFigure 1.

These parameters were obtained from measurements taken using two software monitoring facilities. The first, called Q, was used to monitor the utilization of each processor. Q recorded the time each processor was idle, was executing user code, and was executing system code. These measurements were needed to calculate MP overhead. Q was also used to periodically determine the number and types of jobs in the system.

The average number of type A cluster jobs in the system during observation i, CLAi, was estimated by averaging the five sampled measurements of cluster A job queue length of the observation (Equation 1). Pammetcr two was estimated by the fraction of the five samples in which there was at least one type C cluster job (Equation 2)1. The third parameter, the percentage of the observation lime in which cluster jobs were executing (CLUSPi ), was estimated using the sampled queue lengths and the scheduling information of Table 1 (Equation 3). The equation assumes that if only cluster jobs were present in a sample then the CES executed cluster jobs the entire time, while if there were CE or IP/CE jobs present, the clustered jobs were given the CES 7/13 of the time. The percentage of time the clustered CES were executing MP overhead (Mf’Oi ) was determined for the entire run of the application [14] and this number was assigned to each observation in that execution. The length of each observation period (TIME, ) was estimated by

The second facility, called HRTIME, was used to determine the completion time of target applications, Target application completion time is needed for the computation of MP overhead and is also used in validating the model. The monitoring procedure is illustrated by Figure 1. The target application is run under a normal workload. Both HRTIME and Q are invoked once at the inception and once at the completion of execution. In addition, the Q facility is invoked approximately once every 1.2 seconds to measure the job queue lengths (short arrows in Figure 1 represent Q invocations). Because the Q facility must be submitted as a software job, the time between measurements varied (standard deviation = 0.28 seconds). Most of

‘The throughout

193

equation this pa~r.

introduces

Lhe

indicator

f unct]orl

wI1l ch

is

used

parameters with the largest range of values did not dominate the clustering procedure. Clustering of the standardized data was accomplished with the FASTCLUS procedure of the SAS software package FASTCLUS is based on the K-Means algorithm [16].

ciividing the completion time of the target application by the number of samples taken during the execution (Equation 5). CLAi = -& x

,= $

CLC, = ~ x @ j=

IND [arg

(1)

c~i,j

Choosing the correct number of clusters to accurately capture real workload behavior has been previously discussed [11]. For the 21,046 observations collected in this study, it was found that 20 clusters were adequate (r2 = 0.84) to include all the data points (no outliers). This means that, with respect to the five parameters clustered upon, the real machine operation could be described by 20 system/workload states.

(2)

[cLciJ>O

1 if arg is True

1= () if arg is False {

CLUSPi

I 1

lPi,j

=

7

= -&X ,~5 IPij

if IPCEi,,

After clustering the data points were returned to their original values and the cluster centroids were calculated. Cluster centroids are the geometric centers of the clusters (i.e., the average of all the observations in the cluster). The superscript C will indicate a centroid value. For example, CLA~ refers to the centroid value of cluster i corresponding to the average number of cluster A jobs present.

(3)

= CEAL ,j = CECi,j

= O

otherwise

n

MPO, = MPO

TIME,

for execution

– comp[~~;

of entire program

~~moeof Target Application Samples Taken

5.3. Discrete-Time

(4)

(5)

OBs‘)-1 AA 100

Clustering

P,

The next step in model construction is determining from the measured data distinct system/workload states in which the machine has operated. Statistical clustering is first used to group similar observations. The centroids of these cluster groups are then used to summarize observed system/workload states.

while

minimizing

the Euclidean

distances

the Euclidean

between

distances

of

a cluster.

from

=

C, – Cluster

OBS (k ) – # of

c Ci) A (Ok,l+l =

Cj)l

‘-1 lND [0~,1 = CL]

i ,

l

Performance Prediction and Tuning on a Multiprocessor

Performance Prediction and Tuning on a Multiprocessor

Suggest Documents

Performance Prediction and Automated Tuning of ... - Semantic Scholar

High-Performance, Dependable Multiprocessor

Performance Comparison of Uniprocessor and Multiprocessor Web ...

Prediction of Harmonic Tuning Performance in pHEMTs - Center for ...

Prediction of Harmonic Tuning Performance in pHEMTs - CiteSeerX

Tomcat Performance Tuning and Troubleshooting

Java Performance Tuning and Optimization

Performance and Tuning: Basics - Sybase

Performance and Tuning: Locking - Sybase

Study on PID controller design and performance based on tuning ...

Database Performance Tuning on AIX - IBM Redbooks

Database Performance Tuning on AIX - IBM Redbooks

ETL Performance Tuning Tips

MySQL Performance Tuning - Percona

Performance Tuning - 3cyl.com

ABAP Performance Tuning

Java Performance Tuning

JDBC Performance Tuning

Hibernate Performance Tuning - databene

Two Stroke Performance Tuning

Automated Performance Tuning

Memcached Performance Tuning

Two Stroke Performance Tuning

OpenGL Performance Tuning - AMD