T ECHNICAL R EPORT A Resource Occupancy for Evaluating Modeling and EvaluatingModel the Instrumentation Instrumentation System Overheads System Management Policies in Parallel Systems Abdul Waheed, Herman D. Hughes, and Diane T. Rover Departments Abdul of Electrical Engineering Computer Science Waheed and Diane and T. Rover E-mail: {waheed,rover}@egr.msu.edu Department of Electrical Engineering email: {waheed,rover}@egr.msu.edu
[email protected]
Date: May 1995 Date: December 1994 TR-MSU-EE-SCSL-023-95 TR-MSU-EE-SCSL-021-94
Scalable Computing Systems Laboratory Department of Electrical Engineering Michigan State University 260 Engineering Building East Lansing, Michigan 48824-1226
-1
A Resource Occupancy Model for Evaluating Instrumentation System Overheads Abdul Waheed, Herman D. Hughes, and Diane T. Rover Departments of Electrical Engineering and Computer Science Michigan State University, East Lansing, MI 48824 E-mail: {waheed, rover}@egr.msu.edu
[email protected]
Abstract Software instrumentation is a widely used technique for measurement-based parallel program performance evaluation, prediction, debugging, steering, and visualization. With increasing sophistication of parallel tool development technologies and broadening of application areas where these tools are being used, runtime data collection and management activities are growing in importance. We use the term instrumentation system (IS) to refer to components that support these activities in state-of-the-art parallel tool environments. The overheads and perturbation effects attributed to an IS must be accounted for to ensure correct and efficient representation of program behavior, especially for on-line and real-time environments. The nondeterministic nature of the events in a concurrent computer system that trigger corresponding IS functions necessitates the development of appropriate stochastic models for evaluating their overheads. In this paper, we present a Resource OCCupancy (ROCC) model to evaluate the overheads of an IS to an application program due to the sharing of system resources. The ROCC model is based on a queuing network that captures the dynamics of the IS. We have applied the ROCC model to the IS of the Paradyn tool environment. Paradyn is implemented for a cluster of workstations. Although, the ROCC model can consider a varying number of system resources, this study uses it for two resources: local CPU and network.We present approximate analytical and detailed simulation results of this case study. The goal of the modeling effort is to provide the tool developers with valuable feedback on performance effects of IS parameters and policies. This effort can assist the tool developers in making design decisions early in the software development cycle.
0
A Resource Occupancy Model for Evaluating Instrumentation System Overheads 1
Introduction and Motivation
Software instrumentation is a widely used technique for parallel program performance evaluation, debugging, and visualization. Parallel tools rely on execution information regarding the states and behavior of application programs to provide useful feedback to the user. With increasing sophistication of parallel tool development technologies, runtime data collection and management activities are receiving more attention from tool developers [26]. Parallel tool developers are focusing on integrated parallel tool environments [37] and frameworks [32], performance evaluation of real-time systems [2], and program steering [7]. The overheads and perturbation effects associated with data collection and management are of critical importance in these emerging technologies, and, therefore, deserve special attention. We use the term instrumentation system (henceforth, IS) to specifically consider the data collection and management components in state-of-the-art parallel tool environments [35]. We favor a structured approach to plan, design, model, evaluate, and implement an IS to address specific requirements imposed by the parallel tool environment that it supports [36]. This paper presents a model to account for the overheads of an IS in a distributed computing environment. The case study presented in this paper focuses on the Paradyn environment [24] and elaborates on the IS design evaluation approach implemented early in the tool development cycle. In order to put our IS development and evaluation approach in the proper perspective, we have examined a number of parallel tool development efforts, some of which are reviewed in Section 5. A majority of the ISs in current tool environments have been developed in a manner that can best be described as ad hoc, with insufficient or no evaluation of their overheads. Typical activities of an instrumentation system are nondeterministic, including the rate of data arrivals (collection), competition and contention between application and IS processes for shared system resources, message passing among various IS modules, etc. Any system supporting such activities can not be evaluated reliably unless its modules and activities are appropriately specified. We have specified and modeled some of the well-known and popularly used ISs, including those of PICL [5] and Paradyn [24], to evaluate their overheads to the application programs and systems with respect to the specific requirements of their environments. This paper specifically focuses on the evaluation of Paradyn IS to demonstrate the promise of this evaluation methodology as a part of the development cycle for ISs for the next generation of parallel tool environments. We will present a Resource OCCupancy (ROCC) model to evaluate the overheads due to the contention for shared resources between the IS processes and application processes in a cluster of workstations environment. We use queuing models for this evaluation and use analytical as well as simulation techniques to answer various “what-if” questions regarding the overheads under various workload conditions. The insight thus obtained is applicable to other similar ISs developed for a distributed computing platform. The evaluation approach presented in this paper is useful for developing an IS in a structured manner. We propose a rapid prototyping, two-level tool development approach, as depicted in Figure 1. On a higherlevel, requirements of the IS are either determined by the developer or specified by the tool users. These requirements are transformed to detailed lower-level system specifications, which are subsequently mapped to a model representing the structure and dynamics of the IS. This model is parameterized and evaluated with respect to chosen performance metrics that reflect the critical IS overheads to the application program as well as the target system. The evaluation results are then translated back to the higher-level, so that conclusions can be drawn by tool developers and users regarding IS performance and overheads. Feedback from the IS prototyping process is used to modify either the requirements or the system specifications to obtain desired performance. Finally, the model becomes the blueprint for actual synthesis of the IS.
1
Feedback from the evaluation process
IS Requirements
IS Evaluation
Higher-level qualitative considerations
Lower-level quantitative considerations System Specifications
IS Model
Parameterization
Model Calculations
IS Synthesis
Figure 1. Two levels of a structured IS development approach.
Realization of a tool in general, and an IS in particular, is a non-trivial process requiring many personhours of programming effort. Moreover, evaluation of a tool by users upon its release typically leads to requests for corrections, changes, or enhancements in its function. In contrast, rapid prototyping and preliminary evaluation of an IS using the approach presented in this paper can be applied to ensure that specific requirements of a tool environment are met prior to the investment in programming effort. This process is likely to deliver better performance, be less costly, and yield greater user satisfaction. Section 2 specifies the goal of this study, i.e., the evaluation of Paradyn IS overheads. We present the ROCC model and related analytical results in Section 3. The Paradyn IS evaluation case study is discussed at length in Section 4. We follow the approach depicted in Figure 1 in this regard. Section 5 discusses the related work that provides an appropriate context to appreciate the relevance of this work. We conclude with a discussion of the significance of this research.
2
Modeling Objectives
The IS modeling approach presented in this paper is applied specifically to the Paradyn parallel performance analysis tool environment. This fact does not prohibit the use of this approach to the ISs of other tools. Due to the application-specific nature of a computer system performance study, it is a valuable experience to apply the modeling approach to a particular system for the proof of concept. We selected the Paradyn IS for this purpose due to some of its features that reflect the state-of-the-art in IS development. These features include on-line data collection and forwarding, adaptive management of the IS, and application-specific configuration of various IS modules. This section introduces the Paradyn IS and the objectives of its evaluation.
2.1 Paradyn IS The Paradyn environment, developed at the University of Wisconsin, has been implemented on a CM-5 and a cluster of Unix workstations. We have modeled the Paradyn IS for the workstation cluster. It provides data collection support for Paradyn’s W3 search model [12], which analyzes program performance bottlenecks by measuring system resource utilization with appropriate metrics. When the search algorithm needs to analyze a particular metric, instrumentation is inserted dynamically in the program during runtime to generate samples of that metric value. Therefore, the W3 search methodology uses a minimal amount of instrumentation to provide a structured and automated way for a programmer to isolate performance bottlenecks.
2
The Paradyn IS supports an on-the-fly bottleneck search process by continuously providing instrumentation data to the main Paradyn process (mPp). Required instrumentation data samples are collected from the application processes executing on each node of the system. These samples are forwarded to the local Paradyn daemon (Pd), which forwards them to the mPp. The rate of sampling of data progressively decreases over time during an interval when instrumentation is present in the program. In order to model and evaluate the Paradyn IS, we need to focus at the specific aspects that are relevant to the study of IS overheads. We follow the approach of Figure 1 and present these specifications in the following subsection.
2.2 IS Specifications Specifications necessary for creating a model of the Paradyn IS are summarized in Table 1. Figure 2 represents the overall structure of Paradyn IS. The figure denotes the application processes that are instrumented by the local Paradyn daemon at node i as pji for j =0,1,..., n-1, where the number of application processes n at a given node may differ from another node. Table 1. Specifications characterizing the Paradyn instrumentation system. Analysis Requirements On-line
Platform
Pd
mPp
Cluster of workstations
Local daemon process for each node that collects samples from application processes and forwards data
Main Paradyn process that accepts data from daemons and uses data for analysis
Data Transfer Unix-based interprocess communication
Management Policy Adaptive management policy implemented by the tool developers
mPp
Main Paradyn processes
Host workstation Paradyn daemons
Application processes
Pd
p0 0
Pd
pn-10
pn-1P-1
p0P-1
Node 0
Node P-1
Figure 2. An overview of the Paradyn IS [24].
The level of detail of the IS shown in Figure 2 is neither sufficient to evaluate alternative data collection schemes nor to select one that incurs minimum overhead to the application processes. In order to accomplish this objective, the Pd and local application processes are considered with greater details to account for resource sharing and contention between application processes and the Pd functions. Therefore, we focus one particular node of the distributed system to model Paradyn IS. This model is presented in Section 3.
3
2.3 Questions of Interest At an early development stage, tool developers can benefit from the evaluation of various possible implementations of a module or function. Questions of interest depend heavily on the specifications that the tool has to meet and may vary from one tool to another. In the case of Paradyn IS, the specifications require the dynamic implementation of various IS functions to support on-line analysis of performance bottlenecks in long-running application programs. Performance data collection and forwarding activities may adversely affect the application program performance, especially if they introduce additional performance bottlenecks due to the competition and sharing of system resources with the application processes. Therefore, our study focuses on the evaluation of the overheads due to resource sharing. Sensitivity of various IS performance metrics (to be specified in Section 3) and their behavior with changing various system parameters provide valuable information regarding the desirable operating conditions for the IS. If there are several possible data collection and forwarding options, the study needs to provide some guidance to the developer to choose one over another. The questions of interest indicated above are fairly general. We considered these questions with the developers of Paradyn at the start of this study. These questions were later translated to the performance metrics that could be calculated from the IS model. Such metrics are necessary for quantitative evaluation of the IS. These metrics are presented in Section 3.
3
The Resource Occupancy Model
This section presents the Resource OCCupancy (ROCC) model for isolating the overheads due to nondeterministic sharing of resources between IS and application processes. This model is used for evaluating the Paradyn IS. We also present alternative approaches for calculating performance metrics of interest through this model.
3.1 The Model The ROCC model is a queuing network model to account for the contention and sharing of system resources. It consists of three components: 1. System Resources. These resources are shared among (instrumented) application processes, other user and system processes, and IS processes. They include CPU, network, and I/O devices; 2. Requests. These are demands from application processes, other users’ processes, and IS processes to occupy the system resources during the execution of an instrumented application program. A request to occupy a resource specifies the amount of time needed for completion of a particular computation, communication, or I/O step of a process; and 3. Management Policies. IS management policies determine the nature of data capturing and forwarding operations. The entire Paradyn IS can be represented by a queuing network model, as shown in Figure 3. It consists of several sets of identical subnetworks representing a local Pd and application processes. This queuing network includes many more details than necessary for evaluating the overheads of resource sharing. We can assume that the subnetworks at every node in the distributed system behave identically and independently of one another. Therefore, we consider only one subnetwork at a given node and apply the ROCC model for a detailed evaluation. Figure 4 depicts the resource occupancy model for Paradyn IS with two types of resources, CPU and network, being shared by three types of processes: application, IS, and other user processes. These processes generate requests for occupying the resources for certain periods of time, which are determined from workload studies on the target system. Multiple processes can generate requests concurrently. If a 4
p0 i
pn-1i
p1 i
Pd0
Pdi
PdP-1
Local application processes (n) on node i Instrumentation data buffers provided by the kernel (Unix pipes)
Local Paradyn daemons (P), one per node Network delays are represented by the arrivals to a single server buffer to allow random sequence of arrivals from different Pds
mPp
Main Paradyn process
Figure 3. A model for Paradyn instrumentation system.
resource is busy, the request waits in the queue of that particular resource. To ensure fair scheduling of processes, the operating system (Unix) can preempt a process that needs to occupy a CPU for a period of time longer than the specified quantum. When a request is fully serviced, it signals the process that generated it, which then issues the next request for occupying another resource. This activity continues until the application program terminates. Triggering of subsequent request from the corresponding process Instrumented application processes
time out
Instrumentation system processes
Other user processes
CPU
Network
Requests
CPU
Processes running at a particular system node that generate requests for occupying the system resources
Figure 4. The resource occupancy model for the Paradyn IS.
Process scheduling policies are determined by the target operating system. For instance, an operating system (such as Unix) can use preemption with first-come, first-serve or priority scheduling policies to manage the user processes. This is the only part of the ROCC model that relies on the scheduling policies used by a particular operating system. The number of CPUs as well as the number of concurrent processes are parameters of this model that can be changed. The execution of the ROCC model for the Paradyn IS relies on the workload characterization on the target system. A working definition of performance metrics is another important aspect that needs to be considered before proceeding to the calculations through this model. We present workload characterization and performance metrics in the following subsections.
5
3.1.1 Workload Characterization In order to evaluate the impact of interaction among two processes (application and IS) on a distributed system node, the workload for the system needs to be characterized. We consider various states of an instrumented process running on a node, as illustrated by Figure 5. After the process has been admitted, it can be in one of the following states: Ready, Running, Communication, or Blocked (for I/O). The process can be preempted by the operating system to ensure fair scheduling of multiple processes sharing the CPU. After specified intervals of time (in case of sampling) or after occurrence of an event of interest (in case of tracing) such as spawning a new process, instrumentation data are collected from the process and communicated to the mPp over the network. Forward data to the mPp Communication
Data collection network access
done dispatch
Admit
log the new process sampling interval
Ready
Spawn
Running
Fork
time out resource available
wait
release
Blocked
Exit
Figure 5. Detailed process behavior model in an environment using an instrumentation system.
The process behavior model depicted in Figure 5 is fairly detailed but requires low level monitoring information to analyze the characteristics of the workload driving this system. Such low level data is not trivially obtained. Additionally, the above process behavior is dependent on the scheduling policies used by the operating system. The model may need to be modified for every different operating system. In order to reduce the number of states in the process behavior model and hence the level of detail, we group some states into a representative state. This simplification facilitates obtaining measurements from the actual programs without any special operating system privileges. The simplified model is shown in Figure 6. It provides sufficient information to characterize the workload when applied in conjunction with the resource occupancy model, described in the next section. This model considers only three states of process activity: Computation, Communication, and I/O. The Computation, Communication, and I/O states require the use of the CPU, network, and I/O resources, respectively. Additionally, the computation state groups the Ready and Running states of the detailed model of Figure 5. Similarly, the Communication state represents the data collection as well as communication activities. Measurements regarding the three states of the simplified model are conveniently obtained from the application programs. I/O access
network access Communication
I/O
Computation done
done
Figure 6. Simplified behavior model for an application/IS process for the purpose of workload characterization.
6
The above discussion has presented the behavior of a generic application or IS process. We specifically consider application and IS process behavior with respect to the Paradyn IS in the following subsections. Behavior of Application Processes A Paradyn daemon dynamically inserts instrumentation code in the binary image of an executing process as needed by the W3 search algorithm executed by its mPp [11]. The instrumentation code is removed when the algorithm no longer needs to collect instrumentation data from that application process. Therefore, an application process alternates between the periods of instrumentation and no instrumentation during its execution. An alternating renewal process is perhaps the most appropriate choice to model this type of behavior [30]. This process repeats in cycles of an instrumentation period followed by a no instrumentation period as shown in Figure 7. The j-th cycle starts at time τj and the length of this cycle is equal to τj+1 - τj. The process probabilistically restarts after each cycle. If ti is the average length of the instrumentation interval and tn is the average length of the no instrumentation interval, the long-term proportion of time that the process is in instrumentation state is given by: Instrumentation
E[ti]
τ0
Instrumentation
Instrumentation
E[tn]
τ1
τ2
t
Figure 7. Alternating cycles of instrumentation and no instrumentation of an application process during program execution.
E [ ti ] lim P [ Instrumentation ] = ----------------------------------E [ ti ] + E [ tn ] t→∞
(1)
by using Smith’s theorem [28]. Using the result shown by Kleinrock for a similar type of renewal process [15], the mean accumulated length of the instrumentation period up to time t can be obtained simply by multiplying (1) with t. Using an average value of instrumentation cost associated with each renewal, total instrumentation cost for one application process can be calculated. This cost can be aggregated over all the application processes to calculate the instrumentation cost for the application program as a whole. This type of calculation provides an estimate of the instrumentation cost to the application program at a gross level. However, the objective of our study is to focus at the level of a local distributed system node to obtain a better insight into the overheads to application processes. Therefore, we restrict our attention to the above renewal process only to its modulating effect on the behavior of the application process. Unless otherwise specified, the results presented in this paper correspond to the application process behavior during an instrumentation interval. For the purpose of modeling the Paradyn IS, we further simplify the application process behavior from the one shown in Figure 6. The I/O state is merged with the computation state, such that the process alternates between the computation and communication states only, as depicted by Figure 8. The length of each computation state determines the amount of time for which the process needs to occupy a CPU. For simplicity, we assume that the amount of CPU time demanded by a process is exponentially distributed. (Similar workload characterization studies were developed, for instance, for a shared workstation by Kleinrock et al. [15].) More accurate behavior can be modeled by a formal workload characterization for the Paradyn IS, which is beyond the scope of this study. Similarly, the amount of time that network resources are to be occupied is assumed to be exponentially distributed. The mean inter-arrival time of the subsequent requests to the CPU and network are denoted by λc1 and λn1, respectively. It is an approximation for the analytical model to assume that the generation of CPU and network occupancy requests are independent of each other since one has to be satisfied before the other could be generated. 7
Computation x1
Communication
Communication
Communication
Communication
Computation
Computation y1
x2
λc1
Computation y2
yM-2
xM-1
λn1
Execution time
xi—CPU time required for the i-th request for occupying the CPU with mean inter-arrival time of λc yi—Communication time required for the i-th request for occupying the network resources with mean inter-arrival time of λn
Figure 8. Alternating computation and communication states of an instrumented application process during its execution.
Behavior of IS Process The behavior of the IS process, i.e., Paradyn daemon, is similar to the behavior of the application process. It waits for the arrival of an instrumentation data sample. This data arrives through a Unix pipe from an application process to the Pd. As soon as the sample is available, the Pd forwards it over the network. We denote the inter-arrival time for the CPU and network requests from the daemon by λc2 and λn2, respectively. This time is determined by the sampling period being used by the IS in addition to any delays due to the contention for CPU time among multiple application processes forwarding their instrumentation data samples to the daemon. 3.1.2 Metrics for IS Evaluation The ROCC model is developed with the specific goal of evaluating the IS overheads to the application program due to the competition and contention for the shared system resources. Two applicable metrics, their calculation method, and their interpretations are summarized in Table 2. Here, we consider the metrics with respect to the CPU. The relative amount of CPU time required for execution of the Pd process i.e., CPU utilization, represents direct overhead to the application processes. Lower is better. The throughput of the Pd process requests has a more complicated interpretation in which a nominal value is best under high application process loads. That is, both high and low values are undesirable if there is contention from application processes. Relatively high throughput for the daemon then reflects low availability and thus low throughput (compared to capacity) for application processes. Conversely, relatively low throughput correlates with high latency in servicing Pd requests (monitoring latency [6]) if the system is saturated. Table 2. Metrics for evaluating the Paradyn IS using ROCC model and their interpretations. Metric
Calculation
Interpretation
CPU utilization by Pd requests
Queuing analysis, mean value analysis, and simulation
Corresponds to direct perturbation of the program; lower is better
CPU throughput for Pd requests
Queuing analysis, mean value analysis, and simulation
Nominal is best
Three approaches have been mentioned in Table 2 for calculating the performance metrics of interest int the context of the ROCC model. The following subsection elaborates on these approaches.
8
3.2 Model Calculation Approaches We adopt three approaches to calculate the metrics of interest from the IS model: queuing analysis, mean value analysis (MVA), and simulation experiments. Multiple approaches are used, primarily to establish the validity of conclusions drawn from such a performance study. Use of actual measurements is desirable for validation purposes but remains a difficult task particularly when the IS is at an early prototyping stage. 3.2.1 Queuing Theoretic Approach As the ROCC model is a queuing network, therefore, queuing analysis is a natural choice. However, queuing analysis depends on the stochastic nature of various arrival and departure processes, for which modeling is impractical at the early tool development stages. Therefore, queuing analysis is not tractable unless several assumptions are made regarding the stochastic nature of the processes involved. Therefore, this approach can provide only back-of-the-envelope type of calculations of the metrics of interest, which may not be very accurate. However, such calculations are useful to validate the calculations through other more thorough approaches, such as simulation models. We have considered two system resources for modeling the Paradyn IS that are shared between the application processes and the Paradyn daemon: CPU and network. In order to facilitate the effort required for queuing analysis, we assume that the CPU and network occupancy requests can be serviced independently of each other. This approximation is justified, particularly when we know that the queuing analysis is going to yield only an approximate analytical solution for the metrics of interest. Therefore, the two resources are decomposed and modeled as two single-server queues. The CPU(s) and the requests to occupy it from the application and IS processes can be modeled as an M/G/ 1 (or M/G/m) queuing system having two types of arrival classes: arrivals from P application processes, each with an exponential inter-arrival rate of λc1; and arrivals from the Paradyn daemon with an exponential inter-arrival rate of λc2. Priorities of both classes of requests are identical, and the requests are serviced according to a first-come, first-served discipline. This single-server system is represented by Figure 9. time out
1/E[s1]
Pλc1
CPU
λc2
1/E[s2]
Figure 9. Approximate computation resource management model for Paradyn IS.
The overall arrival rate to the CPU(s) in the case of Poisson arrivals is given by: λ c = Pλ c1 + λ c2 .
(2)
Allen [1] tabulates the analytical results for this type of system that are useful for calculating the metrics of interest for the Paradyn IS. Unix uses a round-robin CPU scheduling policy with preemption of a request after it receives a specified quantum of CPU time. Kleinrock has analyzed this type of M/G/1 system (with all arrivals belonging to the same class) in [16]. We have combined Kleinrock’s results with those of Allen’s to determine the closed-form expressions for the metrics of interest for the Paradyn IS. These results are presented in Table 5 in the Appendix. The service times application and the Paradyn daemon processes are denoted by s1 and s2, respectively. The overall service time is denoted by s. It should be noted that the results for average queue length and throughput are derived from the steady state behavior of
9
the system using Little’s law and utilization law, respectively [13]. These results are valid regardless of the characteristics of the underlying stochastic process. Network resources, such as the buffer in the network interface, communication bandwidth, and operating system protocols to handle message traffic, are modeled collectively as an M/M/1 queue. As in the case of CPU(s), the queuing system has two types of arrival classes: arrivals from P application processes, each with an exponential inter-arrival rate of λn1; and arrivals from the Paradyn daemon with an exponential inter-arrival rate of λn2. Priorities of both classes of requests are identical and are serviced according to a first-come, first-served discipline. The exponentially distributed service rates are denoted by µn1 and µn2 for application and daemon requests, respectively. This single-server system is represented by Figure 9. The overall arrival rate to this system is given by: λ n = Pλ n1 + λ n2 .
(3)
µn1
Pλn1
Network
λn2
µn2
Figure 10. Approximate network resource management model for Paradyn IS.
We again use the results tabulated by Allen [1] to calculate closed-form expressions for the metrics of interest. Analytical results are presented in Table 4 in the Appendix. 3.2.2 Mean Value Analysis Approach The mean value analysis (MVA) technique is used for solving closed queuing networks [18]. As the name implies, MVA provides useful results regarding the long-term average behavior of the system. The main advantage of this technique lies in the fact that only average values of various measurable system parameters need to be known. These parameters include the demands for occupying various system resources, number of visits to a resource, service time received per visit to a resource, etc. Various operational laws can be applied to calculate performance metrics such as utilization, throughput, and response time [13]. Therefore, this technique is more attractive to obtain average results regarding the IS overheads at various design stages. However, this technique has the drawback that it is not appropriate for detailed analysis. For analyzing the Paradyn IS using MVA techniques, we directly consider the demands for occupying various system resources by two classes of local processes. The demand for a resource k is denoted by Dk, which is expressed in terms of aggregate time needed from the resource k to service this demand (request). For this analysis, we focus our attention upon the CPU only as its aggregate demand by both application and IS processes is more than the demand for network resources. Let Xc1 and Xc2 denote the throughput (as requests/sec) of application and IS process requests to the CPU, respectively. Also let Vc1 and Vc2 denote the number of visits to the CPU by an application and IS process, respectively, to fully service a CPU request. Nc1 and Nc2 represent the number of application and IS processes, respectively. Table 6 (in the Appendix) represents the summary of the approximate results for the metrics of interest. MVA techniques are also useful for calculating the bounds on the performance metrics of interest [18]. Analysis of performance bounds is particularly useful to locate the bottlenecks and answer fairly high-level “what-if” questions. We calculate the asymptotic bounds on the throughput of Paradyn daemon process requests to the CPU. If D is the total demand for the CPU and network and Dmax is the maximum of the demands on both resources, the asymptotic bounds on CPU throughput calculated by Lazowska [18] are
10
given by: 1 1 N ---- ≤ X ( N ) ≤ max (-------------,----) . D D max D
(4)
These bounds on CPU throughput are used to determine the minimum number of daemon processes needed to obtain maximum throughput. 3.2.3 Simulation Analysis Approach Simulation is the third approach we applied to calculate the performance metrics from the ROCC model. Simulation models allow the incorporation of minute details of system behavior, which are impractical to study through analytic models [17]. A detailed system model allows answering “what-if” questions regarding the system behavior more accurately, provided that the simulation model and its execution are valid. Analytical approaches presented in Sections 3.2.1 and 3.2.2 will be used to determine the validity of the simulation model of the Paradyn IS. Simulation is a particularly useful approach for analyzing an IS that has not yet been realized to evaluate IS overheads and verify system behavior. This is a widely used approach in a number of diverse application areas for rapid prototyping of the system design. A simulation model is developed to account for the concurrent activities of system resources, application and IS processes, and their interactions. The number of application processes, daemon processes, and CPUs are parameters of this model in addition to the lengths of sampling intervals and the demands for system resources. These parameters can be varied for an execution of the model to study their effects on the performance metrics of interest. The simulation model for the Paradyn IS accounts for the dependence between application process requests and Pd requests. An application process needs to occupy the CPU to determine if the sampling interval has expired. When the sampling interval expires, the process invokes a system call to forward the sample to the local Pd over a pipe. Hence, it releases the CPU, and a CPU occupancy request is generated by the Pd process to collect that sample. Multiple application processes can cause the generation of multiple Pd requests. These requests are generated concurrently if there are multiple local Pd processes. In the case of a single local Pd, these requests are generated serially, each after the previous one has been serviced. The simulation model assumes that the requests from application or Pd processes have equal priorities. We ignore the context switching overhead and assume that a CPU preempts a request after it has received a quantum of 100 milliseconds. Unless a resource occupancy request is fully serviced, the corresponding application or Pd process does not generate a subsequent request. We assume initially that there are no pending requests at any of the system resources. Using a resolution of 1 millisecond, we allow the simulation to run for 100 seconds. Simulation experiments are set up to analyze the effects of two parameters (factors), sampling rate and number of local application processes, on the two performance metrics of interest. We use a 2 kr factorial design technique for these experiments, where k is the number of factors of interest and r is the number of repetitions of each experiment [13]. For these experiments, k=2 factors and r=50 repetitions, and the mean values of the two metrics are derived within 90% confidence intervals from a sample of 50 values. We complement this 2kr factorial design technique with the principal component analysis to assess the sensitivity of the performance metrics to various model parameters.
4
Paradyn IS Evaluation
In this section, we use the ROCC model and the three solution approaches outlined in Section 3 for a quantitative evaluation of the Paradyn IS design. Once again we refer to the overall evaluation approach depicted by Figure 1 and present the range of each model parameter of interest before proceeding to a 11
presentation of the results.
4.1 Parameterization Model parameters that are used throughout the calculation of results are presented in Table 3. Some of these parameters, such as the resource occupancy demand from application processes, require workload characterization on the target system. However, as discussed in Section 3, such a study was deemed beyond the scope of the initial evaluation of the Paradyn IS. Therefore, some of the parameters were assigned values somewhat arbitrarily. However, we used principal component analysis (for simulation results) to ensure that the performance metrics are not significantly sensitive to those parameters. Table 3. Summary of parameters used in calculation approaches for the ROCC model. Parameter
Notations and Range of Values
Inter-arrival time of application requests
1 1 -------- = -------- = 800 msec λ n1 λ c1
Inter-arrival time of Pd requests
1 1 -------- = -------- = 20 msec λ n2 λ c2
Number of application processes
P = N c1 = N n1 = 1–32
Number of Pd processes
N c2 = N n2 = 1–4
Mean CPU demand from an application process
D c1 = 800 msec
Mean CPU demand from a Pd process
D c2 = 20 msec
Mean network demand from an application or Pd process
D n1 = D n2 = 20 msec E [ s 1 ] = 100 msec
CPU service quantum
Applying principal component analysis to determine the relative significance of the above parameters, we found that the number of processes is the most significance factor, followed by the sampling rate. Model calculations presented in the rest of this section are based primarily on the variation of these two factors.
4.2 Model Calculations Using Queuing Analysis Figure 11 represents the behavior of metrics of interest with respect to varying sampling rate and number of application processes to delineate the effects of sharing CPU between application processes and the Paradyn daemon. The rate of arrivals of samples (i.e., the sampling period) and the number of application processes are the factors that can change during the instrumented execution of a program. It should be noted that the complete range of sampling periods and number of application processes can not be used to calculate the performance metrics due to the restriction on the traffic rate (ρ) to be less than unity. As shown in the figure, the effects of changing these two factors are summarized in the following: 1. An increasing sampling period results in decreased CPU utilization by the Paradyn daemon process. The average number of pending daemon requests sharply decreases as the sampling period increases. The throughput of the IS requests (samples processed and forwarded per unit time) also decreases with increasing sampling period. This implies that the direct overhead of the IS process to the application processes decreases with decreasing sampling rate; and 2. An increasing number of application processes has a similar effect as that of decreasing the sampling period because a larger number of application process requests enter the queue. The CPU utilization and throughput for the IS process decrease and the average number of pending IS process requests increases 12
×
Application processes IS process
+ 90 80 70 60 50 40 30 20 10 0 100
150
200
250
300
350
400
450
10
300
250
200
150
100
50
0 100
500
Sampling period (msec)
150
200
250
300
350
400
450
500
Sampling period (msec)
Throughput (requests/sec)
350
Average queue length
CPU utilization (%)
100
9 8 7 6 5 4 3 2 1 100
150
200
250
300
350
400
450
500
Sampling period (msec)
(a) P = 8 application processes 80
70
60
50
40
30
20
10 1
2
3
4
5
6
7
8
Number of application processes
35
300
250
200
150
100
50
0 1
2
3
4
5
6
7
8
Throughput (requests/sec)
350
Average queue length
CPU utilization (%)
90
Number of application processes
30
25
20
15
10
5
0 1
2
3
4
5
6
7
8
Number of application processes
(b) Sampling period = 110 msec (to maintain ρ < 1) Figure 11. Analytical results of the ROCC model regarding CPU sharing by application processes and the Paradyn daemon with respect to (a) sampling period and (b) number of application processes.
with an increasing number of application processes. 3. Throughput of Pd requests is a multiple of their utilization of system resources, which is in accordance with the utilization law. Sharing of network and communication resources is also of interest in a distributed computing setup. Figure 12 represents the behavior of metrics of interest with respect to varying sampling rate and the number of application processes to delineate the effects of sharing these resources between application processes and the Paradyn daemon. The results shown here are obtained under the assumption that the application and IS processes have same communication resource occupancy demands. We again change the two factors of sampling period and number of application processes. The effects on the performance metrics are summarized in the following: 1. An increasing sampling period results in decreasing network resource utilization, throughput, and average number of waiting requests from the IS process. Therefore, the contention between the IS process and the application processes reduces as the sampling period increases (i.e., the rate decreases); and 2. An increasing number of application processes does not significantly change the utilization or throughput of the IS process requests to occupy network resources. Analytical results presented in this subsection are approximate because they do not consider the dependence between the arrivals and the service of a request from either an application or IS process. Neither of the two types of processes can issue a subsequent request before a previous request is completely serviced. The MVA presented in the next subsection is comparatively more accurate because it accounts for the overall system behavior without assuming the nature of arrival and service processes.
13
×
Application processes IS process
+ 0.25
90 80 70 60 50 40 30 20 10 0 100
150
200
250
300
350
400
450
500
0.2
0.15
0.1
0.05
0 100
Sampling period (msec)
150
200
250
300
350
400
450
500
Sampling period (msec)
Throughput (requests/sec)
50
Average queue length
Network utilization (%)
100
45 40 35 30 25 20 15 10 5 0 100
150
200
250
300
350
400
450
500
Sampling period (msec)
(a) P = 8 application processes 90 80 70 60 50 40 30 20 10 0 1
2
3
4
5
6
7
0.2
0.15
0.1
0.05
0 1
8
Number of application processes
50
2
3
4
5
6
7
8
Number of application processes
Throughput (requests/sec)
0.25
Average queue length
Network utilization (%)
100
45 40 35 30 25 20 15 10 5 0 1
2
3
4
5
6
7
8
Number of application processes
(b) Sampling period = 110 msec Figure 12. Analytical results of the ROCC model regarding network sharing by application processes and the Paradyn daemon with respect to (a) sampling period and (b) number of application processes.
4.3 Model Calculations Using MVA The MVA approach is typically used to calculate overall system performance metrics, such as utilization, throughput, and response time. The analysis here focuses on the CPU, the bottleneck device, as it has maximum demand in the system. We calculate the performance metrics with respect to the range of two factors of interest according to the MVA expressions listed in Table 6. The results are graphically presented in Figure 13 for the purpose of comparison with those in Figure 11 and with simulation results (presented in Section 4.4). The effects of approximate analysis are clearly apparent in Figure 13. The rate of arrival of Pd requests does not vary with the number of application processes. Due to this approximate nature of this analysis, CPU utilization and throughput are not significantly affected as the number of application processes varies. We can use equation (4) to determine asymptotic bounds on Pd throughput. These bounds are calculated for the parameter values shown in Table 3 and are reflected by the shaded area in Figure 14. The figure also indicates that increasing the number of IS processes up to two will increase the throughput, but any further increase in this number will not result in any performance improvement. This change-over point is indicated by N* in the figure.
4.4 Model Calculations Using Simulation Results of simulation experiments are plotted in Figure 15. As expected, direct overhead to local application processes (i.e., CPU utilization by the Pd) decreases as the sampling period increases. The decrease is significant initially, but levels off. CPU throughput for the daemon requests (i.e., throughputPd) decreases as the number of application processes becomes large. Note that the throughput Pd shown in 14
×
Application processes IS process
+ 90 80 70 60 50 40 30 20 10 0 100
150
200
250
300
350
400
450
7
6
5
4
3
2
1 100
500
10
Sampling period (msec)
150
200
250
300
350
400
450
500
Sampling period (msec)
Throughput (requests/sec)
8
Average queue length
CPU utilization (%)
100
9 8 7 6 5 4 3 2 1 0 100
150
200
250
300
350
400
450
500
Sampling period (msec)
(a) P = 8 application processes 80
70
60
50
40
30
20
10 1
2
3
4
5
6
7
6
5
4
3
2
1 1
8
Number of application processes
10
7
2
3
4
5
6
7
8
Number of application processes
Throughput (requests/sec)
8
Average queue length
CPU utilization (%)
90
9 8 7 6 5 4 3 2 1 0 1
2
3
4
5
6
7
8
Number of application processes
(b) Sampling period = 110 msec Figure 13. Mean value analysis results of the ROCC model regarding CPU sharing by application processes and the Paradyn daemon. X(N) N/D
(requests/sec) 1/Dmax = 50
Dmax = 20 ms (CPU demand) D = CPU demand + Network demand = 20 + 20 = 40 ms N* = Optimal number of IS processes to get maximum CPU throughput for IS requests
1/D = 25
1
N*=2
Number of IS processes, N
Figure 14. Asymptotic bounds on CPU throughput for IS requests.
Figure 15(a) shows large variance for higher sampling periods. However, the plotted values are still within 90% confidence interval of the mean. Therefore, effectively we can consider that the throughput is unaffected at larger values of sampling period. The reduction in CPU utilization by the daemon is mainly due to the round-robin CPU scheduling used by the Unix operating system. If there are more processes waiting for CPU time in the queue, then within a given period of time, the daemon process will receive relatively less CPU time. This means that the daemon becomes a bottleneck as the number of application processes grows. If it can not collect and forward the instrumentation data samples at a sufficiently high rate, the pipes become full and application processes, blocked. A similar result is reported by Gu et al. using measurements from their Falcon IS to show that multiple monitoring processes reduce the monitoring latency when the number of application 15
×
+
Application processes IS process
100
2.5
Throughput (requests/sec)
CPU utilization (%)
90 80 70 60 50 40 30 20 10 0 50
100
150
200
250
300
350
400
450
500
2
1.5
1 50
100
150
Sampling period (msec)
200
250
300
350
400
450
500
Sampling period (msec)
(a) P = 8 application processes 100
Throughput (requests/sec)
9
CPU utilization (%)
90 80 70 60 50 40 30 20 10 0 0
5
10
15
20
25
30
8 7 6 5 4 3 2 1 0 0
35
Number of application processes
5
10
15
20
25
30
35
Number of application processes
(b) Sampling period = 100 msec Figure 15. CPU utilization and throughput metrics calculated with the ROCC simulation model.
processes is above a threshold [7]. This is particularly true when local nodes have more computation than communication capacity as in the case of high performance workstations. In order to get some insight about the performance of the IS with multiple Pds, the results shown in Figure 15 were reproduced with one to four daemon processes. These results are shown in Figure 16. In all cases, an increase in the number of daemon processes increases both the CPU utilization as well as the throughput of the CPU requests from Pds. However, this does not imply that the larger number of Pds is desirable. Recall that a nominal value is desirable for the throughput of Pd requests to the CPU. Higher throughput is desirable from the perspective of the efficiency of data forwarding but also means higher CPU utilization by the Pd processes. Higher CPU utilization is a direct overhead to the application processes. Therefore, as the analysis of bounds on CPU throughput for Pd requests indicates in Figure 14, there is an optimal choice for the number of daemon processes. Selecting a number higher than this incurs undesirable overheads to the application processes, whereas a smaller number decreases the throughput of the IS and thus its efficiency.
4.5 Measurement Results Measurements are important to validate the results of the performance study of the Paradyn IS. However, at the time of this evaluation, the Paradyn IS was not available for this type of measurement. In order to validate the results, we replicated the operation of Paradyn daemon process on a SPARC 10 workstation running Solaris operating system. In order to validate the analytical, MVA, and simulation results regarding the CPU utilization (i.e., direct overhead), we used the client/server setup depicted in Figure 17. The server replicates the behavior of a Paradyn daemon for collecting and forwarding a data sample. In order to replicate the data collection after a sampling period, the client forwards a 64-byte-long sample to the server after waiting for the sampling time. We collected ten measurements for each sampling period to 16
+
1 Pd 2 Pds 3 Pds 4 Pds
x * o
Throughputpd (requests/sec)
CPU utilizationpd (%)
18
16
14
12
10
8
6
4
2 50
100
150
200
250
300
350
400
450
500
9
8
7
6
5
4
3
2
1 50
Sampling period (msec)
100
150
200
250
300
350
400
Sampling period (msec)
450
500
(a) P = 8 application processes Throughputpd (requests/sec)
CPU utilizationpd (%)
18 16 14 12 10 8 6 4 2 0 0
5
10
15
20
25
30
9 8 7 6 5 4 3 2 1 0 0
35
Number of application processes
5
10
15
20
25
30
35
Number of application processes
(b) Sampling period = 100 msec Figure 16. CPU utilization and throughput metrics calculated with the ROCC simulation model using multiple Pds.
obtain a reasonable confidence interval for the mean.
Unix pipe Server (Pd)
Client (Appl. process)
Figure 17. Arrangement for measurement-based validation.
The measurement results using only one client process are depicted in Figure 18. The figure shows the CPU utilization for the Pd (server). It shows that the direct overhead due to the daemon is large for the shorter sampling intervals, but it decreases with increasing sampling interval. The same behavior was shown by analytical, MVA, and simulation results. Evaluation of the Paradyn IS continues in collaboration with the Paradyn developers, including modeling of new management policies. We expect that feedback to the developers early in the development process will lead to better design decisions.
5
Related Work
Many parallel programming tools use an IS. Cheng [4] has surveyed most of the well-known parallel 17
14
UtilizationPd (%)
12
10
8
6
4
2 50
100
150
200
250
300
350
400
450
500
Sampling period (msec)
Figure 18. Measurement results.
program performance analysis and debugging tools. In this section, we introduce the IS development and evaluation approaches of various representative parallel tools. Additional details can be found in [36]. PICL and ParaGraph [9] have been used with several environments. PICL [5] is a portable library of efficient communication functions that also supports instrumentation. We have modeled and evaluated two performance data management policies supported by PICL in [35]. AIMS (Automated Instrumentation and Monitoring System [39]) is a toolkit consisting of an instrumentation library and a set of off-line performance analysis and visualization tools. Its IS support is almost identical to that of PICL. A user can specify different sizes of buffers or usage of flushing functions in a configuration file as part of a static management policy. Pablo [27] is an integrated tool environment that offers three types of performance data capturing functions: (1) event tracing; (2) event counting; and (3) code profiling. If a local buffer is full, all buffers can be flushed synchronously to a file or to an Internet domain socket. Unlike PICL and AIMS ISs, Pablo’s IS supports adaptive levels of tracing to dynamically alter the volume, frequency, and types of event data recorded. Adaptive management policies ensure that the IS overheads remain low, particularly for longrunning instrumented programs. Paradyn [24] is an on-line performance evaluation environment that is based on dynamically updating the cumulative-time statistics of various performance variables. In addition to implementing a dynamic management policy, its IS is equipped with the capability to estimate its cost to the application program [12]. This cost model is continuously updated in response to actual measurements as an instrumented program starts executing, and the model attempts to regulate the amount of IS overhead to the application program. Falcon [7] is an application-specific, on-line monitoring and steering system for parallel programs. The Falcon IS supports dynamic control of monitoring overhead to reduce the latency between the time an event is generated and the time it is acted upon for the purpose of steering. Various modules and functions of the IS are specified by a low-level sensor specification language and a higher level view specification language. ParAide [29] is the integrated performance monitoring environment for the Intel Paragon. Commands are sent to the distributed monitoring system, called Tools Application Monitor (TAM). TAM consists of a network of TAM processes arranged as a broadcast spanning tree with one TAM process at each node. This configuration allows broadcasting of monitoring requests to all nodes. Instrumentation library calls 18
generate data that are sent to the event trace servers, which perform post-processing tasks and write the data to a file or send them directly to an analysis tool. To minimize perturbation, trace records are stored locally in a trace buffer that is periodically flushed to the local trace server. Scalable Parallel Instrumentation (SPI [2]) is Honeywell’s real-time instrumentation system for heterogeneous computer systems. SPI supports an application-specific instrumentation development environment, which is based on an event-action model and an event specification language. Hewlett-Packard’s VIZIR [8] is another integrated tool environment used for debugging and visualizing of a workstation cluster. This environment utilizes commercially available debuggers and visualization tools. This environment is an example in which IS support has been used to integrate heterogeneous tools. Typically, parallel tools, including ISs, are developed in response to users’ needs for addressing the performance problems of their application programs. The performance of the tool itself is usually of secondary importance to tool developers. Often, it is the user who discovers a poorly performing (i.e., performance-handicapped) tool, for instance, when invoking some feature that unexpectedly and inexplicably causes severe performance degradation. There are very few examples where tool developers either perform, provide, or document any evaluation of their IS overheads through testing with application programs. In particular, we are not aware of this type of evaluation being performed concurrently with the tool design and implementation processes. Paradyn is a notable example in which tool developers provide an adaptive cost model to predict the overhead to an application program due to the IS [12]. This cost model is continually updated in response to actual measurements during instrumented program execution. SPI [2] ensures that the invasiveness of its IS is accountable. It measures the instrumentation load on nodes and links in each specified window of time to evaluate the degree of invasiveness relative to an application program. Falcon [26] is perhaps the only tool that supports a thorough evaluation of its instrumentation system. Perturbation to programs is measured under different conditions of tracing rates, event record lengths, and event buffer sizes. On-the-fly ordering of event records, which is needed for meaningful visualization, is evaluated as a ratio of out-of-order events that need to be “held back.” This hold-back ratio is found to be sensitive to the size of the data collection buffers. Additionally, IS performance is compared with other standard instrumentation tools, such as Gprof, using the same metrics for overheads. Such meticulous and practical evaluation of IS performance by the developers provides essential information to the users, especially when an IS is used under real-time constraints. IPS-2 [23] also reports overhead measurements for application programs in comparison with Gprof. Work has been done on compensating for the effects of program perturbation due to instrumentation [38]. The goal of perturbation compensation is to reconstruct the actual program behavior from the perturbed behavior as it may be recorded by the IS. Malony et al. [20] describe a model for removing the effects of perturbation from the traces of parallel program executions. Presently, it is not standard practice to formally evaluate the performance and functionality of a tool early in its development. Usability and efficiency studies of prototypical tools are emerging to alleviate this situation. However, the underlying IS is removed from the end-user and is part of system infrastructure, thus necessitating more rigorous evaluation. Moreover, contemporary approaches to evaluate IS overheads and perturbation do not adequately consider the nondeterministic nature of these effects. The approach introduced in this paper has addressed these issues.
6
Discussion and Conclusions
This paper has presented a resource occupancy model to evaluate the overheads of an IS configuration and
19
the management policies to provide valuable feedback to developers regarding the performance of their designs. Due to the diversity and complexity of computer systems in general and concurrent computer systems in particular, many performance analysts agree that there is no such thing as a “theory of computer system performance evaluation” yet. Performance evaluation of a given computer system is justifiably referred to as an art [13]. The evaluation process is carried out with respect to specific goals and the subtle behavioral aspects of the system under study by applying appropriate results from multiple, related disciplines such as statistics, probability theory, queuing theory, operations research, simulation, and so on. In this paper, we have presented modeling and evaluation approaches that are suitable for the Paradyn IS. We have modeled and evaluated the PICL and Vista ISs that serve fundamentally different requirements [36]. While a “universal” model or evaluation technique that applies to all ISs is not practical, our present suite of three case studies reflects the following commonalities:
• Queuing models are intuitively appropriate to model the dynamics of an IS, just as they are for several other computer system components and policies, including processor architectures, networks, I/O subsystems, memory hierarchies, memory management schemes, caching policies, processor scheduling policies, communication protocols, and so on.
• A model is established according to high-level requirements of an IS, which aids in specifying the factors and metrics that are important for evaluating performance.
• The primary goal of a model is to support “what-if” analyses regarding the selection of various parameters and policies of an IS. If the IS is in production, modeling results can be used to analyze the system and measurements can be obtained to test their validity. If an IS is being designed or prototyped, simulation experiments can be used to investigate various design choices. This is a standard practice in system design [17]. These commonalities point toward the need and the opportunity for applying a structured approach in IS development. Although specifics will differ for different ISs, the overall approach is represented by Figure 1. This approach provides a basis for developers to institute design decisions that better serve the requirements. Most of the extant ISs that represent state-of-the-art IS development try to address a subset of the performance issues raised in this paper in order to meet their domain-specific requirements. As concurrent computing is becoming more popular in a growing number of application areas, IS developers are faced with new challenges. One such challenge is the development of ISs for distributed or embedded real-time systems [2]. Such systems have to meet stringent timing and performability constraints to be operational, and their ISs need to incorporate adaptive management and usage of system resources, customizability, and flexibility. The demands on next-generation ISs reinforce the need for a structured approach. The operation of an IS in a real system is nondeterministic; hence it is not sufficient to collect measurements to evaluate it. The nondeterministic nature of arrivals, resource usage and contentions, and computational load on the system may render measurements of limited use. Our use of queuing models for the ISs does not overlook the random nature of various IS activities. Several important areas are being addressed by our on-going efforts in IS development: (1) benchmarking of ISs to validate that requirements are met; (2) applying structured software engineering methods to map abstract instrumentation system models to implementations; (3) appropriately characterizing IS workload to enhance the power and accuracy of the models; and (4) modeling other ISs that are at various stages of development to augment our suite of case studies using the structured approach. References [1]
Allen, Arnold O., Probability, Statistics, and Queuing Theory with Computer Science Applications, Second Edition, Academic Press, 1990.
20
[2]
Bhatt, Devesh, Rakesh Jha, Todd Steeves, Rashmi Bhatt, and David Wills, “SPI: An Instrumentation Development Environment for Parallel/Distributed Systems,” Proc. of Int. Parallel Processing Symposium, April 1995.
[3]
Brown D., S. Hackstadt, A. Malony, B. Mohr, “Program Analysis Environments for Parallel Language Systems: The TAU Environment,” Proc. of the Second Workshop on Environments and Tools For Parallel Scientific Computing, Townsend, Tennessee, May 1994, pp. 162–171.
[4]
Cheng, Doreen Y., “A Survey of Parallel Programming Languages and Tools,” Report RND-93-005, NASA Ames Research Center, March 1993.
[5]
Geist, G., M. Heath, B. Peyton, and P. Worley, “A User’s Guide to PICL”, Technical Report ORNL/ TM-11616, Oak Ridge National Laboratory, March 1991.
[6]
Gelenbe, E., G. Pujolle, and J. C. C. Nelson, Introduction to Queuing Networks, John Wiley, 1987.
[7]
Gu, Weiming, Greg Eisenhauer, Eileen Kramer, Karsten Schwan, John Stasko, and Jeffrey Vetter, “Falcon: On-line Monitoring and Steering of Large-Scale Parallel Programs,” Technical Report GIT–CC–94–21, 1994.
[8]
Hao, Ming C., Alan H. Karp, Abdul Waheed, and Mehdi Jazayeri, “VIZIR: An Integrated Environment for Distributed Program Visualization,” Proc. of Int. Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS ‘95) Tools Fair, Durham, North Carolina, Jan. 1995.
[9]
Heath, Michael T. and Jennifer A. Etheridge, “Visualizing the Performance of Parallel Programs,” IEEE Software, 8(5), September 1991, pp. 29–39.
[10] Helm, B. R. and A. D. Malony, “Automating Performance Diagnosis: a Theory and Architecture,” Technical Report CIS-TR-95-09, Department of Computer and Information Science, University of Oregon, March 1995. [11] Hollingsworth, J. K. and B. P. Miller, “Dynamic Control of Performance Monitoring on Large Scale Parallel Systems,” Proc. of Int. Con. on Supercomputing, Tokyo, Japan, July 19–23, 1993. [12] Hollingsworth, J. K. and B. P. Miller, “An Adaptive Cost Model for Parallel Program Instrumentation,” Technical Report, Oct. 1994. [13] Jain, Raj, The Art of Computer Systems Performance Analysis—Techniques for Experimental Design, Measurement, Simulation, and Modeling, John Wiley & Sons, Inc., 1991. [14] Kilpatrick, Carol and Karsten Schwan, “ChaosMON—Application-Specific Monitoring and Display of Performance Information for Parallel and Distributed Systems,” Proceedings of the ACM/ONR Workshop on Parallel and Distributed Debugging, Santa Cruz, California, May 20–21, 1991. [15] Kleinrock, Leonard and Willard Korfhage, “Collecting Unused Processing Capacity: An Analysis of Transient Distributed Systems,” IEEE Transactions on Parallel and Distributed Systems, 4(4), May 1993, pp. 535–546. [16] Kleinrock, Leonard, Queuing Systems—Volume II: Computer Applications, John Wiley & Sons, 1976. [17] Law, Averill M. and W. D. Kelton, Simulation Modeling and Analysis, McGraw-Hill, Inc., 1991. [18] Lawzowska, Edward D., John Zahorjan, G. Scott Graham, and Kenneth C. Sevcik, Quantitative System Performance—Computer System Analysis Using Queuing Network Models, Prentice-Hall, 1984. [19] Lieu, Eric, personal communications, Hewlett-Packard Labs, Palo Alto, California, June 1994. [20] Malony, A. D., D. A. Reed, and H. A. G. Wijshoff, “Performance Measurement Intrusion and Perturbation Analysis,” IEEE Transactions on Parallel and Distributed Systems, 3(4), July 1992.
21
[21] Malony, A., B. Mohr, P. Beckman, D. Gannon, S. Yang, F. Bodin, and S. Kesavan, “Implementing a Parallel C++ Runtime System for Scalable Parallel Systems,” Proceedings of Supercomputing ‘93, Portland, Oregon, November 15–19, 1993. [22] Malony, A. D., “Measurement and Monitoring of Parallel Programs,” Tutorial, Sigmetrics ‘1994, Nashville, Tennessee, May 16–20, 1994. [23] Miller, B. P. et al., “IPS-2: The Second Generation of a Parallel Program Measurement System,” IEEE Transactions on Parallel and Distributed Systems, 1(2), April 1990, pp. 206–217. [24] Miller, Barton P., Jonathan M. Cargille, R. Bruce Irvin, Krishna Kunchithapadam, Mark D. Callaghan, Jeffrey K. Hollingsworth, Karen L. Karavanic, and Tia Newhall, “The Paradyn Parallel Performance Measurement Tools,” Technical Report, 1994. [25] Nutt, Gary J. and Adam J. Griff, “Extensible Parallel Program Performance Visualization,” Proc. of Int. Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS ‘95), Durham, North Carolina, Jan. 1995. [26] Ogle, David M., Karsten Schwan, and Richard Snodgrass, “Application-Dependent Dynamic Monitoring of Distributed and Parallel Systems,” IEEE Transactions on Parallel and Distributed Systems, 4(7), July 1993, pp. 762–778. [27] Reed, Daniel A., Ruth A. Aydt, Tara M. Madhyastha, Roger J. Noe, Keith A. Shields, Bradley W. Schwartz, “The Pablo Performance Analysis Environment,” Dept. of Comp. Sci., Univ. of Ill., 1992. [28] Resnick, Sidney I., Adventures in Stochastic Processes, Birkhauser, 1992. [29] Ries, Bernhard, R. Anderson, D. Breazeal, K. Callaghan, E. Richards, and W. Smith, “The Paragon Performance Monitoring Environment,” Proceedings of Supercomputing ‘93, Portland, Oregon, Nov. 15–19, 1993. [30] Ross, Sheldon M., Introduction to Probability Models, Academic Press, 1989. [31] Rover, Diane T., “Vista: Visualization and Instrumentation of Scalable Multicomputer Applications,” Project Summary, IEEE Parallel and Distributed Technology, 1(3), August 1993, pp. 83. [32] Rover, Diane T., “Performance Evaluation: Integrating Techniques and Tools into Environments and Frameworks,” Roundtable, Supercomputing ‘94, Washington DC, November 14–18, 1994. [33] Simmons, M., and R. Koskela, editors, Performance Instrumentation and Visualization, ACM & Addison-Wesley, 1990. [34] Waheed, A., B. Kronmuller, Roomi Sinha, and D. T. Rover, “A Toolkit for Advanced Performance Analysis,” Proc. of Int. Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS ‘94) Tools Fair, Durham, North Carolina, Jan. 31– Feb. 2, 1994. [35] Waheed, A., Vincent Melfi, and Diane T. Rover, “A Model for Instrumentation System Management in Concurrent Systems,” Proceedings of the Twenty Eights Hawaii International Conference on System Sciences, Maui, Hawaii, Jan. 3-6, 1995. [36] Waheed, A. and Diane T. Rover, “A Structured Approach to Instrumentation System Development and Evaluation,” Technical Report, Department of Electrical Engineering, Michigan State University, April 1995. [37] Workshop on Debugging and Performance Tuning of Parallel Computing Systems, Chatham, Mass., Oct. 3-5, 1994. [38] Yan, Jerry C. and S. Listgarten, “Intrusion Compensation for Performance Evaluation of Parallel Programs on a Multicomputer,” Proceedings of the Sixth International Conference on Parallel and Distributed systems, Louisville, KY, Oct. 14–16, 1993.
22
[39] Yan, Jerry, “Performance Tuning with AIMS—An Automated Instrumentation and Monitoring System for Multicomputers,” Proc. of the Twenty-Seventh Hawaii Int. Conf. on System Sciences, Hawaii, January 1994. Appendix Table 5. Summary of analytical results for the ROCC model to characterize CPU sharing between application processes and the Paradyn IS (daemon). Performance Metric
Analytical Result λ c = Pλ c1 + λ c2
Arrival rate Overall CPU utilization
ρ c = Pρ c1 + ρ c2 = Pλ c1 E [ s 1 ] + λ c2 E [ s 2 ] = λ c x c
Mean CPU time received by each application process request
λ c1 x c1 = -------- E [ s 1 ] λc
Mean CPU time received by the IS process (Paradyn daemon) request
λ c2 x c2 = -------- E [ s 2 ] λc
Mean CPU time received by both types of requests
x c = Px c1 + x c2
UtilizationPd
x c2 ρ c2 U 2 = ------ = ------xc ρc
Mean delay in the queue
2
W cq
W ci = W cq + E [ s i ]
Mean response time for both types of requests (i = 1 and 2) Mean length of the CPU queue
λc E s = -----------------------2 ( 1 – ρc)
E [ N c ] = E [ N c1 ] + E [ N c2 ] = Pλ c1 ⋅ W c1 + λ c2 ⋅ W c2 λ c2 U2 - = ---------X 2 = -------------λc xc E [ s2 ]
ThroughputPd
The above results hold only when ρ