Tuning with Embedded Gossip ... performance tuning, Support for Adaptation, Parallel Systems ... Alamos Computer Science Institute and Rice University.
1
Lightweight Online Performance Monitoring and Tuning with Embedded Gossip Wenbin Zhu, Patrick G. Bridges, Member, IEEE, and Arthur B. Maccabe
Abstract— Understanding and tuning the performance of large-scale, long-running applications is difficult, with both standard trace-based and statistical methods having substantial shortcomings that limit their usefulness. This paper describes a new performance monitoring approach called Embedded Gossip (EG) designed to enable lightweight online performance monitoring and tuning. Embedded Gossip works by piggybacking performance information on existing messages and performing information correlation online, giving each process in a parallel application a weakly consistent global view of the behavior of the entire application. To demonstrate the viability of EG, this paper presents the design and experimental evaluation of two different online monitoring systems and an online global adaptation system driven by the use of Embedded Gossiping. In addition, we present a metrics system for evaluating the suitability of an application to EG-based monitoring and adaptation, a general architecture for implementing EG-based monitoring systems, and a modified global commit algorithm appropriate for use in EG-based global adaptation systems. Together, these results demonstrate that EG is an efficient, low-overhead approach for addressing a wide range of parallel performance monitoring tasks, and that results from these systems can effectively drive online global adaptation. Index Terms— Lightweight performance monitoring, Dynamic performance tuning, Support for Adaptation, Parallel Systems
I. I NTRODUCTION Online performance monitoring for large-scale, longrunning applications is important for performance debugging, performance tuning, and adaptation [4]. However, online measurement of a large-scale application is difficult because of the interactions among the large number of processes and large number of activities in each process. Many research projects have attempted to address this situation. Trace-based monitoring tools [1], [2], [8], [9], [15], [24], [28], [33], [37], [38], for example, generate large amounts of data for long-running, large-scale applications, but must globally merge this tracing data to understand interprocessor performance relationships. This makes performance information only available post-mortem, so trace-based monitoring is not appropriate for online adaptation. The overhead of tracebased monitoring systems is also generally high, which can perturb running applications and skew monitoring results [35]. Department of Computer Science, MSC01-1130, 1 University of New Mexico, Albuquerque, NM 87131-0001, email: {wenbin,bridges,maccabe}@cs.unm.edu This work was supported in part by Sandia National Laboratories, a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract DEAC04-94AL85000, and by subcontract R7A824-79200004 from the Los Alamos Computer Science Institute and Rice University
Statistical monitoring techniques, in contrast, drastically reduce monitoring overhead and can provide online performance information about an application [7], [26], [35]. However, correlating performance data from different processes in these systems is difficult because it generally includes no crossprocess causality information that could be used for correlation or merging. To address these problems, we have designed and implemented Embedded Gossip (EG), a system that uses the natural communications inside parallel applications to gather, analyze, and propagate performance information. It does this by piggybacking performance information on existing messages and merging performance data locally at each process as the application runs. The resulting performance information at each process potentially encompasses information from all processes in an application. Using natural communication for performance monitoring not only reduces monitoring overhead, but also implicitly captures the relations among processes, making monitoring more efficient. There are a number challenges to using Embedded Gossip for performance monitoring and tuning, however. For example, since EG relies on natural communications inside an application to do performance monitoring, it is important to understand for what classes of applications and monitoring tasks EG is appropriate. In addition, each process in an EGbased monitoring system potentially has a different view of system performance; the resulting inconsistency could cause problems when EG is used for global performance tuning. In this paper, we demonstrate that Embedded Gossiping can be used to conduct a wide range of monitoring tasks in a variety of different parallel applications by implementing and studying two different EG-based monitoring systems and one EG-based global adaptation system. We also present a general architecture for implementing EG-based monitoring and adaptation systems, a simplified version of the standard two-phase commit protocol for use in EG-based adaptation systems, and a set of metrics that can be used to understand EG’s suitability for a given application and monitoring task. The remainder of this paper is organized as follows. Section II gives an overview of Embedded Gossip. Section III introduces a metrics system to examine the suitability of an EG-based monitoring or adaptation system for an application. Section IV describes three EG-based monitoring systems, and Section V presents evaluation results using the three monitoring systems. Section VI gives an overview of related research, and Section VII presents conclusions and future work.
2 process X local state
global view
process Y message
message
EG data
message
EG data
Local Information Gathering
local state
global view
Global Status Reporting
Update local state Update global
Fig. 1. Online propagation of performance information from Process X to Process Y using Embedded Gossiping
view Update Global Information Merging merging policy
abstraction
II. E MBEDDED G OSSIP Incoming Messages
A. Overview Processes in a parallel and distributed applications frequently exchange messages, for example to share data or synchronize computation. These communications frequently create performance dependencies between processes where the one portion of a parallel application depends strongly on the performance of another part of the application. In large scale applications (e.g. parallel scientific applications with hundreds or thousands of processes), these inter-processor performance dependencies frequently dominate the runtime of the program. As a result, measuring, monitoring, optimizing, and adapting to these performance dependencies is essential to enabling scientific applications that can scale on modern large-scale supercomputers. Because of the importance of communication in largescale application performance, we have taken a monitoring approach that uses these communications to understand system performance through a technique we term Embedded Gossip (EG). EG works by piggybacking performance information about an application’s status onto each outgoing application message, extracting this information from incoming messages, and merging this gossiped data together at each process to provide that process a global view of application behavior. That view may include information on, for example, the application’s rate of progress, load relative to other processes, or other information that needs to take other process’s state into account. Note, however, that different process’s global views are not necessarily globally consistent because they may have been generated using information from different received messages. Figure 1 illustrates how performance data is transfered between two processes, X and Y, using Embedded Gossip. When a message is about to be sent from process X, a version of its global view is attached to the outgoing message. Process Y, on receiving of the message, extracts process X’s global view from the message, and updates its own global view, which will in turn be sent out with process Y’s later messages. When communications among processes are frequent enough, information about concerned performance status reaches all processes in a timely fashion. In addition to the global view, each process also keeps a local state. While it contains types of information similar to that contained in the global view, the local state is not updated upon message receipt. In other words, the global view is an approximation of the global state of the entire application, while the local state is an exact measurement of the state
Fig. 2.
Outgoing Messages
Architecture of Embedded Gossiping
of a single process within the application. By comparing the local state with the global view, each process can compare it behavior with the global behavior of the application for performance tuning purposes. Because Embedded Gossip uses message headers to propagate performance information, it is most suitable for monitoring tasks that only require a small amount of information to understand the targeted property. For these kinds of tasks, EG has the following advantages: • Global information is available to all processes locally. Natural communication in many applications enables timely dissemination of global information. This makes every process aware of global changes in the application in a timely fashion. • Monitoring is scalable. Since processing for monitoring is carried out locally, application and system scale changes does not have direct impact on Embedded Gossip. Monitoring techniques that use global and centralized operations, on the other hand, may not scale well as system size increases [22]. • Monitoring overhead is low. Since no extra communications are introduced into the system and the amount of data added to outgoing messages is small, the overhead of Embedded Gossip is low. B. An Architecture for Embedded Gossiping To facilitate the development of different EG-based monitoring systems, we have developed a general architecture for implementing such techniques, shown in Figure 2. The figure also illustrates how the three principal components of EG, local information gathering, global information merging, and global status reporting, interact with each other to perform EG-based monitoring. The three principal components of the architecture share a small set of state information, namely the local state and the global view, but otherwise function independently. • Local information gathering gathers a process’s local status during the process’s activities. The gathered information is used to update the local state as well as the global view.
3
C. Addressing Monitoring Inconsistencies The decentralized nature of Embedded Gossip provides lowoverhead monitoring, but introduces possible inconsistencies among processes because different processes have different global views. This inconsistency among processes can cause problems such as deadlock if not all processes agree to participate in a global synchronous action, for example, load rebalancing. In this case, processes that are aware of imbalance will begin load balancing, while processes that are not aware of imbalance will continue computation and not join the global load balancing actions, resulting in deadlock. To avoid such problems, we have developed a simplified version of the two-phase commit (2PC) protocol [19], which we term 3-wait, shown in figure 3 and described more fully elsewhere [39]. The 3-wait algorithm, like the standard 2PC protocol, uses a coordinator process and ensures that either all processes commit to perform an action or none do. Unlike 2PC, however, 3-wait assumes that there are no crashes in the application and that vote-requests happen implicitly through the gossiping of performance information. This prevents applications that do not need to perform any commits (for example, load balancing applications running a data set that never goes out of balance) from paying unnecessary vote request and commit costs. III. M ETRICS
FOR
E VALUATING EG S UITABILITY
Because it relies on existing application messages to perform monitoring, Embedded Gossip is only appropriate for certain applications and monitoring tasks. To be able to
COORD
wait first message from all with timeout timeout causes abort and sending cancel message to replied
wait ack from all with timeout timeout causes abort and sending cancel message
1
2
OTHER
send first message sen d f irs t r epl y send ack sen
1
2
wait first reply with timeout timeout causes abort
3
if answer is cancel, abort if answer is commit, commit
d a nsw
er
commit
Fig. 3.
3
Time flow diagram for the 3-wait algorithm
percentage of processes seen the information
Global information merging updates the global view when a message arrives and attaches a compressed global view to the outgoing messages. This component has a merging policy built-in which determines how information from the incoming message is used. • Global information reporting answers local queries based on information contained in the local state and the global view at each process. The 3-wait algorithm, a modified version of the standard two-phase commit (2PC) protocol, outlined briefly in Section II-C, can be used in this component to coordinate global action based on local information when needed. Embedded Gossip, as a distributed monitoring system, resides at each process. The key component of its architecture is the global information merging component. It determines how information from other processes changes the global view of an individual process. Only minimal information is kept in the global view to reduce the disturbance to the message passing system, the overhead of injecting and extracting information, and the cost of merging. The merging component does simple online analysis to update the global view, which reflects an abstract view of the status of the whole application. For different monitoring and tuning tasks, the performance information kept at each process is different. This requires a different merging policy, as well as a different query interface. For certain tasks, one or more components may be unnecessary. •
abstract view of global information propagation % 100
0 wait time
propagation time
time
resolution time
Fig. 4. Time-interval metrics: wait time, propagation time, and resolution time are defined in terms of when the first and last remote nodes receive data measured at time t = 0.
quantify the suitability of EG for different applications and monitoring tasks, we developed a metrics systems to measure how quickly and in what way gossiped information propagates inside an application. A. Metrics We have developed five different metrics for evaluating the suitability of a given application to EG-based distributed monitoring and tuning techniques; we define these metrics in two different scenarios: one-to-all monitoring and allto-all monitoring. These metrics include three time-interval metrics that measure the largest amount of time it could take to perform a portion of EG-oriented communication in an application, summarized in figure 4, as well as two others. Specifically, the five metrics we have chosen are: • Wait time, the amount of time between taking a measurement and the first remote node receiving complete monitoring information. Wait time represents the minimum amount of time between a measurement being taken and some remote node being able to react to the measurement. In the one-to-all case, for example, this could be the time between a node noticing that its CPU temperature is too high and a neighbor that it wants to migrate to becoming aware of this fact. In the all-to-all case, this could be the time for some node to receive global load information from the entire system prior to initiating a rebalance.
4
•
•
B. Metrics Measurement To measure the metrics described in Section III-A, we developed a simple framework using the architecture described in Section II-B implemented in the MPICH 1.2 reference version [12] of the MPI Message Passing Interface library [31]. In this framework, each process gossips either one bit for oneto-all metrics or a bit vector with one bit per process for all-toall metrics measurement. In the case of a gossiped bit vector, merging is performed through a bit-wise OR of the process’s current bit vector and the received bit vector. At runtime, metric measurement proceeds as follows: at an a priori determined random point in the application, all processes start a timer after a global synchronization point (e.g. MPI Barrier). In the one-to-all case, a designated process then sets its bit value to one and then continues computation and communication. Each process stops its timer when it receives a gossiped bit with the value one. In the all-to-all case, each process sets its own bit in the bit vector at the synchronization time and stops its timer when the full bit vector has been received. For a single run, the shortest time interval among all processes from synchronization to timer stop is the wait time measured from that synchronization point, the longest time interval is the resolution time, and the difference between
the wait time and the resolution time is the propagation time for that synchronization point. To determine the overall wait, propagation, and resolution times for an application, we run each application with a wide range of different predetermined synchronization points and take the longest measured wait, propagation, and resolution times for the application. C. Metrics Results effective of Embedded Gossip 40000 10000 5000 1000 500 effectiveness
•
Propagation time, the amount of time between the first and the last remote nodes receiving complete monitoring information. Propagation time represents the time interval in which nodes necessarily have inconsistent monitoring information because of the communication patterns of the application. Monitoring inconsistency potentially limits the ability of applications to globally adapt to changing application and system characteristics, and propagation time quantifies this inconsistency. In addition, propagation time is potentially useful in determining timeouts for asynchronous distributed adaptation systems. Resolution time, the amount of time between taking a measurement and the last remote node receiving complete monitoring information (i.e., the sum of wait time and propagation time). This time interval quantifies how quickly global state can be propagated for global decision making, and how fast a phenomenon can change in the system and still be accurately sampled by EG-based techniques. Effectiveness, the number of complete measurements that can be done by Embedded Gossip during an application’s execution. This metric, a relativized version of resolution time, is computed as the application’s execution time divided by the resolution time. It represents how many sequential measurements an EG-based monitoring system can perform over the course of an application run. Monitoring overhead, the cost of doing embedded gossipping in an application. Monitoring overhead limits the benefit of using EG-based monitoring. For EG-based monitoring systems that also require the use of the 3-wait algorithm to manage global consistency, the overhead also includes the overhead of using the algorithm.
100 50 10 5 1
BT
CG
EP
FT
IS
LU
MG
SP
SMG CCELL
benchmarks
Fig. 5.
Effectiveness results, one-to-all
Figure 5 shows the effectiveness results of 9 benchmarks and one application, described later in section V-B, in the one-to-all case. The effectiveness results for all-to-all case is similar. We can see that some benchmarks, BT, CG, LU, SP, SMG, and CCELL, can do more than 500 one-to-all notifications during their execution time. Based on this, the types of applications represented by each of these benchmarks, all common application types, appear to be amenable to EGbased monitoring. In contrast, Embedded Gossip performs poorly for EP and IS because of the embarassingly parallel nature of these applications. 100
propagation time wait time
80 Metrics Time Breakdown (%)
•
60
40
20
0 CG
LU
MG
SMG2000 ChemCell
Benchmarks and Application
Fig. 6.
Time interval analysis for 5 benchmarks and application
To further understand the results for different time intervals, we choose four benchmarks and one application that all have high effectiveness metrics, CG, LU, MG, SMG2000 and ChemCell, and plot the resolution time breakdown in terms of propagation time and wait time, as shown in figure 6. Despite
5
having roughly similar effectiveness, the breakdown between wait time and propagation time in these applications varies dramatically. CG, for example, spends the majority of their resolution time waiting for a measurement to begin to propagate, and then spreads the resulting data quickly throughout the application, resulting in a relatively small amount of time during which different process’s global views are inconsistent. Programs like LU or MG, on the other hand, gradually propagate results throughout the application, resulting in relatively longer propagation times than wait times. Because of this, EG is unlikely to be appropriate for driving global adaptation in programs structured like LU or MG despite being an effective measurement tool in these applications. Programs like SMG2000 and ChemCell represent a middle ground between these two application, with both resolution time split relatively equally between wait and propagation time. IV. E XAMPLE EG-BASED M ONITORING S YSTEMS In this section, we present three monitoring systems that use Embedded Gossip to gather and propagate performance information: a critical path profiling system, an application progress monitoring system, and a system for scheduling load balancing actions in parallel applications. For each monitoring system, we present the performance problem of concern, define its monitoring goal, and how EG is used to carry out this monitoring task. For reference, Table I shows a summary of the design of the three monitoring systems in terms of the EG architecture described in Section II-B . A. Critical Path Profiling The critical path of an application is the longest execution path in a parallel application for its whole execution [36]. The critical path is important because an application’s critical path determines the execution time for that application. Analyzing the critical path of an application can yield good insights for performance optimizations. Traditionally, the critical path is obtained using the call graph in the application and timings of all procedures at each process by following the call graph and finding the longest call path. For large scale applications, however, the call graph is large, complex, and distributed, and so calculating critical path is expensive, subject to large monitoring perturbation, and only available post-mortem. As one application of Emedded Gossip, we briefly describe an online critical path profiling system built using the architecture described in section II-B. A full description of this monitoring system can be found elsewhere [40]. In this system, the runtime library monitors 64 different performance counters in both local state and global view. When a message is sent, the 8 largest counters are selected from global view and sent with the message, with the remaining 56 counters summed into a single gossiped other counter. When a message is received, the merging portion of the EG architecture must determine if the incoming message is on the application’s critical path; it does this by determining if the application is waiting on the message, for example by blocking in MPI Wait or MPI Recv. If the message has been waited
for, the received performance counters overwrite the current global view because they represent the application’s critical path. If the message has not been waited for, the received performance counters are discarded as the contained data is not on the application’s critical path. B. Progress Counting External effects, such as network failure, processor anomaly, or algorithm bugs can result in application stalls. If undetected, such stalls waste valuable computing resources; unfortunately, these stalls cannot be easily detected by the application itself or current cluster management systems, such as the PBS queue manager [34]. To address this problem, we implemented an EG-based monitoring system Gossip that measures the speed of application progress in parallel applications for a simple definition of application progress [41]. Specifically, this system defines one step of application progress as a period in which every process communicates directly or indirectly with every other process. While this definition is not appropriate for all applications, for example embarrassingly parallel applications that perform little to no communications, it is general enough to encompass a wide range of applications and provide application scientists information on how fast the application is running. To implement this progress counting monitoring system, each process maintains a sequence number and a bit vector and gossips this data to other processes on all outgoing messages. The sequence number denotes the progress of the application and is incremented by at least one process whenever it has communicated directly or indirectly with every other process in the system. Each bit in the bit vector is assigned to one process in the system, and this bit is set to one on a given process when that process has been communicated with directly or indirectly. This is similar to the metric measurement system described in Section III-B, but runs continually using sequence numbers instead of once at a predetermined application point. When a process receives a message, the gossiped information is extracted from the message. If the received sequence number is the same as the sequence number in its global view, the process ORs the received bits with the bit vector in its global view.If all of the bits in the resulting bit vector are 1’s, it also increases the sequence number by one and resets the bits of every other process in its global view to 0. If the received sequence number is greater than the current local value, on the other hand, the process discards the old bit vector, replaces it with the newly received vector, and updates its own sequence number to the received value. In this way, the sequence number increases when every process has, directly or indirectly, received data from every process in the system, a strong indication of application progress. If the progress counter value progresses extremely slowly or is not increasing at all, this could indicate problems of the application or the system. C. Load Balancing Scheduling Load balancing is used by applications that expect load imbalance to happen during execution. Traditionally, these
6
design metrics
applications
values gossiped
merging policy
localEG
globalEG
load balancing scheduling . maximum workload . minimal workload . reset sequence . imbalance flag . if my imbalance flag set, return . if incoming flag set: set my flag . if reset sequences equal: if incoming max > my max: my max = incoming max if incoming min < my min: my min = incoming min . if incoming reset sequence less: return . if incoming reset sequence greater: set all my workloads to incoming’s . local workload . reset sequence
. maximum workload . minimum workload . reset sequence
progress counting . progress sequence counter . bit vector, one bit for one process if incoming sequence number . equals to mine: my vector |= incoming vector if my vector full: increase sequence reset my vector . greater than mine: my vector = incoming vector | initial vector update my sequence . less than mine: return . target value for vector . initial value for vector
critical path profiling . 8 most significant countings . total count . bit values for counter types overwrite the 8 most significant countins if the incoming message has been waited
. whether application is waiting . 64 local counters . total count
. progress sequence counter
. 64 gossiped counters
. bit vector, one bit for one process
TABLE I OVERVIEW OF THE THREE EG- BASED MONITORING SYSTEMS
applications perform a periodic global load check to determine whether or not imbalance exists. During this check, all processes exchange their local workload in terms of data points using, for example, an MPI AllReduce call, and compute the ratio of the maximum workload to the average workload in the system. If this ratio is above a pre-defined threshold, the application is declared to be out of balance and a load rebalancing step is initiated. If an application is rarely out of balance, however, most or all of the load checks will be pure overhead. Because it is not necessarily possible to know when or how often a given data set will go out of balance ahead of time, the overhead and extra global synchronizations introduced by load balance checks are potentially troublesome. To address this problem, we have built an EG-based system for scheduling load balancing actions in large-scale applications. In this system, monitoring is done by having each process track of its own workload as the local view and gossiping the system-wide maximum and minimum of these local workloads as the global view; in this case, we define workload as the amount of wall clock time time spent computing (as opposed to waiting for messages) in a fixed time interval longer. This time interval is chosen to be relatively long compared to program time-steps to adequately average program workload while still being responsive to changes in program behavior. By comparing the gossiped maximum workload with the local workload using the ratio maximum / local, a process can determine if another process is more highly loaded than the local process. Similarly, by comparing the gossiped minimum workload with the local workload using the ratio local / minimum, a process can determine whether this
process is more highly loaded than other processes. If either of these ratios is high, the process can decide that load imbalance exists, and a load balancing action is needed. Because of the decentralized nature of Embedded Gossip, however, the workload estimates gathered in this system may have inconsistencies between processes. For example, one process may have noticed imbalance and decided that a load balancing action is needed, while others may have seen no imbalance at all. If not dealt with properly, localized load checks may return different results at different processes, which can cause deadlock where some processes wait indefinitely for global synchronous load balancing action, and others continue with their computation. We address this problem using the 3-wait algorithm as described in Section II-C. A comparison of the pseudo-code for load balancing using conventional load check and the one using EG with 3-wait is shown in Figure 7. The left figure shows conventional load check using MPI Allreduce(). If the load check shows imbalance, a load balancing action is taken. The right figure shows load balancing using EG with the 3-wait algorithm, where the load check is localized when possible. If the local load check shows imbalance, then the 3-wait algorithm is executed. If the 3-wait algorithm returns true, then a load balancing action is taken. Otherwise, processes continue without load balancing. Note that if no process shows local imbalance, no global load check is performed. V. E VALUATION To evaluate Embedded Gossip as a viable monitoring approach, we implemented and evaluated the three EG-based monitoring systems described in Section IV.
7
main loop: compute communicate if time to check load: rc = global load check if rc == imbalance: call load balancing end loop
main loop: compute communicate if time to check load: rc = local load check if rc == imbalance: rc2 = 3!wait if rc2 == 1: call load balancing end loop
normal load checking
localized load checking with 3!wait
Fig. 7. Comparison of load check using conventional load balancing scheduling and EG-based load balancing scheduling
A. Experimental Setup We have implemented Embedded Gossip in two MPI (Message Passing Interface standard) [31] libraries, LA-MPI [20] and MPICH [12]. Necessary changes included adding the EG framework described in Section II-B to the library, changing the message header to accommodate the performance information for EG, and adding extracting and merging interface calls for sending and receiving routines. Full test runs were carried out on the LosLobos cluster at the UNM Center of High Performance Computing. LosLobos has 256 compute nodes, each with dual 733 MHz processors and 1 GB memory. This cluster runs the Linux 2.4.18 kernel and includes both Myrinet and 100Mb Ethernet adapters. B. Benchmarks and Applications To test Embedded Gossip, we primarilly used the NAS parallel benchmarks [5] and the SMG2000 benchmark from the ASCI Purple Benchmarks [21]. For studying EG-based load balancing scheduling, we used the ChemCell simulation code and a locally developed benchmark, imb-please, which allows the user to change parameters in a simulated parallel load balancing application. 1) NAS Parallel Benchmarks and ASCI Purple SMG2000: Each benchmark in the NAS Parallel Benchmarks is considered representative for certain parallel applications, and have a range of input sizes ranging from Class A (smallest) to Class D (largest). We used the NAS Class C data sets for our tests. A full description of the characteristics of each benchmark and the associated data sets is described elsewhere by Bailey [5]. The SMG2000 benchmark is a semi-coarsening multi-grid benchmark. It uses a parallel semi-coarsening multi-grid solver for the linear systems arising from finite difference, finite volume, or finite element discretization of a diffusion equation on logically rectangular grids. Unlike the NAS parallel benchmarks, SMG2000 is a weak scaling benchmark, where the amount of work for each process is approximately fixed regardless of how many processors it runs on. We used the 3-D problem version for our test. The arguments for our SMG2000 benchmark runs are:“-n 90 90 90 -c 2.0 3.0 40”. 2) The ChemCell Simulation Code: ChemCell is a particlebased reaction/diffusion simulator designed for modeling protein networks in biological cells developed at Sandia National Laboratories [25]. ChemCell uses the Zoltan load balancing framework [10] for checking for load imbalance and performing rebalances. We modified Zoltan to be able to optionally use
our EG-based load balancing scheduling system in addition to the MPI-based global reduction system that it normally uses. For data sets, we used ChemCell to simulate photosynthesis inside a cell based on data sets we generated using the “Pizza.py” package. Two data files are used in these tests: one initially evenly distributed set file, and one initially outof-balance data set which has a small bubble that contains high density carbon dioxide molecules and photons as the source of imbalance. This small bubble resides inside the simulation space without boundaries, and the contents of the bubble spreads out once the simulation starts. 3) The imb-please Benchmark: Because we had access to only one full load balancing application, ChemCell, we developed the imb-please benchmark to allow us to experiment with a wide range of synthetic load balancing workloads. The benchmark allows user to change several important parameters for load balancing and EG, for example, frequency of overloading, amount of overloading, global communication frequency, and the ratio of computation and communication. Whether there is a global communication in a timestep is determined by the global communication scale, abbreviated as comm scale. A comm scale of 1, for example, indicates one global communication at every timestep, a comm scale of 2 means one global communication at every 2 timesteps, and so forth. Finally, a comm scale of 0 indicates no global communication at all. At the time of re-balancing, data migration is used to move data around to even out the workload among processes. To mimic the overhead of migrating objects among processes during re-balancing, an artificial delay is added to the re-balancing routine. The migration cost is considered proportional to the number of processes, as more boundary adjustments are needed for larger number of processes. The overhead of migration is chosen as 1 second for 64 nodes, based on data from a previously published paper [17]. C. Critical Path Profiling Results EG-based critical path profiling identifies the causes of an application’s slowdown online through a process’s critical path. During an execution, each process obtains a critical path for itself, which represents the longest computational path and encompasses the reasons for its slowdown. To study the viability of EG-based critical path profiling, we examine its ability to globally detect local overhead and its monitoring overhead. We used NAS parallel benchmark CG class C for the experiment. In this experiment, one process, process 1 injects overhead during the execution, and we monitor the critical path on both process 1 and remote processes to determine the ability of EG-based critical path profiling to globally measure the performance impact of locally-injected overhead. Figure 8 shows the critical path for process 1 and 6. For each process, the left vertical bar is the global view, and the right vertical bar is the local state. This result shows that even though overhead is only injected from process 1, the global views of both process 1 and 6 show similar amount of overhead, shown in the striped portion. However, only process
8
NAS CG on 16 nodes 100
other RX_COPY RX TX artificial_overhead EG_overhead MPI_Init computation
95
Time Breakdown (%)
90 85 80 75 70 65 60
global local 1
SMG2000 are presented in Figure 10. Most of the overhead comes from the comparison of the 64 performance counters which are carried out on each send. The average overhead of our EG-based critical path profiling system is less than 2%. The standard deviations of the execution time of the three benchmarks, LU, CG and SMG2000 without critical path profiling at 64 nodes are 3%, 8%, and 4%, respectively, making the monitoring overhead of our approach statistically insignificant; in particular, this relatively large variance in program runtime explains how the average runtime of CG with embedded gossiping can be less than the average runtime of CG without embedded gossiping.
global local 6
10
CG LU SMG2000
Process ID
EG-based critical path profiling of CG on 16 nodes
1’s local state shows the overhead. By comparing the global view and the local state, a process can infer the reasons for slowdown. Process 1 can see larger amount of overhead in its local state than in its global view, indicating that itself is the generator of the overhead. Similarly, process 6 can conclude that another process is generating overhead and has slowed down the application. We further conduct an experiment to have two processes, process 1 and process 6, injecting overhead. Results are presented in Figure 9. In this figure, the amount of overhead is plotted for all 16 processes, with the left, light-gray vertical bar at each process for the overhead amount in the global view, and the right, dark-gray vertical bar at each process for the overhead amount in the local state. We can see that only process 1 and 6 show significant amount of overhead in their local state, and every process shows noticeable amount of overhead in its global view. Again, process 1 and 6 can conclude that they are the causes of the overhead problems, and other processes are aware of the overhead problem and can conclude that they are not the cause. 80
5
0
!5
!10 2
Fig. 10.
4
8
16 Number of Nodes
32
64
Overhead of EG-based critical path profiling
D. Progress Counting Results EG-based progress counting seeks to enable each process be aware of the progress of the whole application locally. We validated this monitoring system by checking how an application’s progress speed change is reflected in the rate of sequence number changes. Results for three benchmarks, LU, CG, and SMG2000, are presented in Figure 11. Experiments are conducted on 32 nodes. 250
global view local state
70
smg2000.32 cg.C.32 lu.C.32
200 progress counter value
Reported overhead
Percentage of Mesurement Overhead"
Fig. 8.
60 50 40 30
150
100
50
20 0
10
0
100
200
300
400
500
600
700
800
900
time (second)
Fig. 11.
0 0
1
2
3
4
5
EG-based progress monitoring of LU, CG and SMG2000
6 7 8 9 10 11 12 13 14 15 Process ID
Fig. 9. EG-based critical path profiling of CG on 16 nodes, with artificial overhead injected at process 1 and 6.
Overhead results for three benchmarks, LU, CG and
From this figure, we can see a tight correlation between the metrics results on effectiveness and the progress counter update speed. The metrics results on effectiveness of SMG2000 show that Embedded Gossip can perform over 800 all-toall propagations during its execution time of 130 seconds,
9
300
progress counter value
250 200
150 100 orig delay1 delay2 delay5 delay10
50
0
0
50
100
150
200
250
300
time (seconds)
Fig. 12. Report of progress counter update speed slowdown at process 0 due to slowdown from process 3, on 32 nodes using CG
The overhead measurements for progress monitoring is conducted over a number of nodes from 4 to 64, using MPICH with and without our EG-based progress monitoring system. The results show that the overhead of this system is negligible, less than 1%.
1.2
EG lb conventional lb
1
0.8 speedup
while LU can only do about 1000 during its execution time of 800 seconds. Effectiveness results for CG is similar to SMG2000; it can do about 2000 all-to-all propagations during its execution time of 300 seconds. The progress counting results match the metrics measurement. The SMG2000 has the highest progress counter update speed, while the LU has the lowest. This experiment verifies that EG-based progress counting can effectively reflect different progress speeds of different applications. In order to examine how well EG-based progress monitoring reflects changes in an application’s progress speed, we artificially slowdown one process, process 3, in the CG class C benchmark. Figure 12 shows how artificial slowdowns of 1, 2, 5 and 10 seconds at process 3 change the progress speed of process 0. The figure shows that a local problem on process 3 is known to other processes, and the entire system can observe the progress counter update speed slowdown due to another process.
0.6
0.4
0.2
0
1.1
1.3
1.5 2.0 detecting threshold
Fig. 13. Speedup comparison for ChemCell on 64 nodes with checking interval of 1
examination of the fine-grained behavior of the application shows that this is in fact due to unusual application and data set issues. In particular, the speedup for the EG-based system is modest because our test data set specifies a diffusion problem which naturally balances itself. In particular, the test data set with imbalance specifies an initially very localized out-of-balance problem that the underlying load balancing algorithm is unable to resolve early in the application run. The conventional load balancing system, as a result, detects the imbalance immediately and repeatedly makes failed attempts to rebalance the system, resulting in unnecessary overhead. The EG-based approach introduces a delay in imbalance detection (i.e. the resolution time for EG in ChemCell), which causes it to issue fewer failed load balancing attempts. 2) Imb-please Results: In order to further investigate Load Balancing Scheduling because of the limitations in ChemCell, we conducted additional experiments using the imb-please benchmark over a wide range of synthetic load balancing parameters. In particular, we compare speedups between our EG-based load balancing scheduling system and conventional load balancing system, as well as overhead comparisons. 20
E. Load Balancing Scheduling Results
16 14 12 speedup
Using Embedded Gossip to schedule load balancing actions is more challenging than the previous two monitoring systems because it requires accurate global monitoring results to avoid false positives and monitoring reports must also be consistent among all processes to avoid application deadlock. 1) ChemCell Results: Using the ChemCell simulation code, we compare EG-based load balancing scheduling and conventional load balancing using different detection thresholds and checking intervals. The re-balancing action uses the Zoltan [11] toolkit. Figure 13 shows the comparison results using four different detection thresholds: 1.1, 1.3, 1.5, and 2.0, with checking interval of 1. From the figure, we can see that conventional load balancing is more sensitive to the value of detection threshold than EG-based load balancing scheduling and is not as effective as our EG-based approach. Generally, EG-based load balancing scheduling outperforms conventional load balancing in this experiment, but closer
EG lb conventional lb
18
10 8 6 4 2 0 1
2
5
10
!
global communication scale
Fig. 14. Comparison of different global communication scales, using imbplease, with single overloading, overload factor 3.
3) Speedup Comparison: Figure 14 shows the speedup results for a single application overloading with overload
10
factor of 3 on 64 nodes at different global communication scales using imb-please. We can see that EG-based load balancing scheduling can effectively detect and initiate load balancing actions, resulting substantial speedups, but our EGbased system is not as effective as conventional load balancing at this scale. When the global communication is frequent, for example, if global communication happens every timestep (comm scale 1), the application’s speedup of using Load Balancing Scheduling is comparable to conventional load balancing. However, the speedup with Load Balancing Scheduling degrades as the frequency of global communications decreases. As described later in this section, this degradation is due to increased inconsistency in measured load imbalance between processes, resulting aborts in the 3-wait algorithm and missed load balancing opportunities. 4) Overhead Comparison: The overhead of doing load balancing when the application is never out of balance is one of the concerns of using load balancing. In this experiment, we measure the overhead that load balancing adds to an application’s execution using EG-based load balancing scheduling and conventional load balancing. Experiments are conducted on 16, 25, 36, 49 and 64 nodes.
percent overhead (%)
EG!lb conventional!lb 2 1 0 !1 !2
16
25
36
49
64
number of nodes
Fig. 15. Overhead comparison of load balancing using Load Balancing Scheduling versus conventional load balancing.
From Figure 15, we can see that when there is no imbalance in the application, our EG-based approach has negligible overhead, while conventional load balancing has measurable but small overhead that generally increases as the scale of the application increases. Overhead comparison results with different global communication scales do not show significant influence on the overhead of load balancing for both the EGbased approach and conventional load balancing up to 64 nodes. 5) Analysis of EG-based Load Balancing Scheduling: Compared with conventional load balancing, the reduced effectiveness of our EG-based approach comes from three places: 1) Inaccuracy in load imbalance measurement in our EGbased approach delays the detection of imbalance, and results in delayed load balancing actions. Conventional load balancing, on the other hand, always detect load imbalance immediately.
2) Inconsistency among processes when the global communication frequency is low results in 3-wait algorithm timeouts and missed load rebalancing opportunities, both of which increase program execution time. 3) Delays from load imbalance increase the asynchrony among processes, again resulting in additional 3-wait algorithm timeouts and missed load rebalancing opportunities. To further quantify the cost of timeout intervals and missed load balancing attempts in failed 3-wait algorithm executions, we explicitly set one process out of 64 to never participate in the 3-wait algorithm and every other process to invoke the algorithm at every timestep. The costs of the 3-wait timeouts themselves were small—only 9.8 seconds for the whole application run. The resulting missed load rebalancing opportunities, on the other hand, resulted in as much as an 8800 second slowdown (multiple overloadings with an overleading factor of 3) compared with conventional load balancing techniques. VI. R ELATED W ORK Performance monitoring, analysis, and tuning are important research areas, and much research has been conducted on various aspects of this field. Trace-based performance monitoring [1], [2], [8], [9], [15], [24], [28], [33], [37], [38] records information about events in the system or application, so that visualization can display traced data or later analysis can be applied to the recorded data for debugging or other purposes. Trace data from each process must be merged into a central merging file, which is analyzed to infer performance information. At large scale, merging data from all processes and analyzing the resulting voluminous data file is too expensive for online monitoring. For example, Paradyn [24] focuses on measuring long running programs on large parallel machines with large datasets. TAU C++ can provides detailed information about an application, it requires source code instrumentation and off-line analysis of its traced performance data [28]. Pinpoint [9] determines which components in the system are most likely to be at fault, by using data-mining techniques to correlate the believed failures to successes of these requests. It uses off-line analysis for the traced data. By contrast, EG-based monitoring systems do not need any knowledge about or modifications on the monitored program, and conducts performance analysis online. Statistical performance monitoring systems [3], [7], [26], [30], [35] either gather statistics of a program’s execution through statistical sampling, or perform statistical analysis over performance data. Performance monitoring systems using statistical monitoring approach have lower overhead, and the performance information can be obtained online. For example, PAPI [7] supports statistical profiling, and at program completion, the histogram over execution events can be analyzed to get a line-by-line account of where counter overflow occurred in the program. PHOTON [35] uses message sampling to enable dynamic statistical profiling of an application’s communication, and obtains statistics about a program’s execution with low overhead. In contrast, EG-based monitoring enables
11
online performance analysis and online performance problem detection. The online monitoring system Ganglia [23] is a scalable distributed monitoring system for high performance computing systems such as clusters and Grids. It can monitor an application’s progress online with low overhead. However, Ganglia does this through special multicast messages and centralized aggregation, where in EG-based monitoring systems, there are no extra messages for monitoring purposes and performance information is analyzed at every process. Dynamic performance adaptation and tuning techniques [6], [11], [16], [18], [27], [32] aim to change application and system settings to improve an application’s performance and/or system utilization. For example, Autopilot [27] is an integrated toolkit for performance monitoring, online performance analysis and dynamic performance tuning. It uses distributed sensors to extract qualitative and quantitative performance data from executing applications. Autopilot works well in parallel file systems, where clear policy can be determined for different file access patterns. Embedded Gossip does more fine-grained analysis of application performance and tuning than Autopilot by using the simple performance view at each process. Hollingsworth’s critical path analysis measurement system [14] uses online computation to measure the completel critical path for the whole application is a manner somewhat similar to the system described in section IV-A. Critical path information is attached to each outgoing message and each incoming message is checked, with a longer received critical path overwriting a shorter critical path. By doing this at every process, a process is very likely to have the critical path for the application. The most important difference of Hollingsworth’s system from that of the EG-based critical path profiling described in this paper is that Hollingsworth’s requires that the procedures selected for monitoring be the most computational intensive ones; otherwise, the computed critical path could be wrong. In EG-based critical path profiling, however, we can choose any procedure and event to be monitored, and use the causality merging in EG to capture the events that are causing application slowdown. VII. C ONCLUSIONS
AND
F UTURE W ORK
application slowdown can be found in each process’s critical path. Similarly, our progress monitoring experiments show that EG can be used to enable each process to be aware of the progress speed of the whole application. Finally, our load balancing scheduling results show that EG can be used to drive global adaptation in parallel applications, though not necessarily as effectively as conventional techniques at the scales at which we were able to test. Overall, these results show that Embedded Gossip is a viable technique for online monitoring for large-scale, long-running applications, where traditional monitoring techniques are expensive. B. Future Work This is a rich research field, and much further research remains on EG. Based on our mixed results in scheduling load balancing, exploring new and larger applications and data sets, as well as new monitoring mechanisms that would improve its responsiveness to load imbalance is a promising area of future work. Our experience shows that real parallel applications have much more global communications than the benchmark imb-please. The impact of more global communications may change the effectiveness of our system particularly on largerscale systems. Another area of future work is to use EG to drive distributed adaptations as opposed to the centralized load-balancing adaptation we have explored to this point. In systems that can adapt without the need for a global commit, EG may offer substantial benefits over existing monitoring systems. One such potential adaptation system is power scaling research [16], which scales down processor speed on less-loaded processors. Finally, a descriptive language for expressing and reasoning about EG-based monitoring techniques similar to IDL [13], [29] would be desirable. Such a system would allow a user to specify what to monitor, how to merge, and how to report monitoring and have an appropriate message-passing library automatically configured without source-code level modifications by application programmers. It could also potentially enable in-depth analysis of metrics results to determine application suitability to different proposed EG-based monitoring techniques.
A. Conclusions In this paper, we present the design, implementation, and evaluation of a new, online monitoring technique, Embedded Gossip (EG). EG uses natural communications inside an application to do performance monitoring. It has the advantage of low overhead (< 2%), and uses implicit filtering through an application’s interactions among processes to achieve effective performance monitoring. Global performance status is available at each process, and locally detected performance information can propagate to all processes. To demonstrate the effectiveness of EG, we implemented three EG-based monitoring systems, one for critical path profiling, one for application progress monitoring, and one for scheduling load balancing applications. Our critical path profiling experiments show that EG can be used to obtain the critical path for each process cheaply, and reasons for
R EFERENCES [1] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP-19), Bolton Landing, NY, 2003. [2] J. M. Anderson, L. M. Berc, S. G. J. Dean, M. Henzinger, S. A. Leung, R. L. Sites, M. T. Vandevoorde, C. A. Waldspurger, and W. E. Weihl. Continuous profiling: where have all the cycles gone? ACM Transactions on Computer systems, 15(4):357–390, 1997. [3] T. E. Anderson and E. D. Lazowska. Quartz: A tool for tuning parallel program performance. In Proceedings 1990 SIMETRICS Conference on Measurement and Modeling Computer Systems, pages 115 – 125, 1990. [4] D. A. Bader. Petascale Computing: Algorithms and Applications. Chapman and Hall/CRC Computational, 2007. [5] D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga. The NAS parallel benchmarks. Technical Report RNR-94-007, NASA Ames Research Center, Moffett Field, CA, 1994.
12
[6] E. Brewer. High-level optimization via automated statistical modeling. In Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 80 – 91, Santa Barbara, California, United States, 1995. [7] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. A portable programming interface for performance evaluation on modern processors. The International Journal of High Performance Computing Applications, 14(3):189–204, Fall 2000. [8] J. Caubet, J. Gimenez, and et al. A dynamic tracing mechanism for performance analysis of OpenMP applications. In Proceedings of Workshop on OpenMP Applications and Tools (WOMPAT), 2001. [9] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic Internet services. In Proceedings of the 2002 International Conference on Dependable Systems and Network, pages 595–604, June 2002. [10] K. Devine, E. Boman, R. Heaphy, B. Hendrickson, and C. Vaughan. Zoltan data management services for parallel dynamic applications. Computing in Science and Engineering, 4(2), 2002. [11] K. Devine, B. Hendrickson, E. Boman, M. S. John, and C. Vaughan. Design of dynamic load-balancing tools for parallel applications. In Proceedings of the International Conference on Supercomputing, Santa Fe, NM, May 2000. [12] W. D. Gropp and E. Lusk. User’s Guide for mpich, a Portable Implementation of MPI. Mathematics and Computer Science Division, Argonne National Laboratory, 1996. ANL-96/6. [13] M. Gudgin. Essential IDL : interface design for COM. Addison-Wesley, 2001. [14] J. Hollingsworth. Critical path profiling of message passing and sharedmemory programs. IEEE Transactions on Parallel and Distributed Systems, pages 1029–1040, 1998. [15] J. Joyce, G. Lomow, K. Slind, and B. Unger. Monitoring distributed systems. ACM Transactions on Computer Systems (TOCS), 5(2):121– 150, May 1987. [16] N. Kappiah, V. W. Freeh, and D. K. Lowenthal. Just in time dynamic voltage scaling: Exploiting inter-node slack to save energy in MPI programs. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, page 35, Seattle, WA, 2005. [17] D. J. Kerbyson, H. J. Alme, A. Hoisie, F. Petrini, H. J. Wasserman, and M. Gittings. Predictive performance and scalability modeling of a largescale application. In Proceedings of the 2001 ACM/IEEE conference on Supercomputing, pages 37–48, Denver, CO, 2001. ACM Press. [18] O. Kremien, J. Kramer, and J. Magee. Scalable, adaptive load sharing for distributed systems. IEEE Parallel and Distributed Technology: Systems and Technology, 1(3):62–70, 1993. [19] B. W. Lampson and H. Sturgis. Crash recovery in a distributed data storage system. Technical report, Computer Science Laboratory, Xerox, Palo Alto Research Center, Palo Alto, CA, 1976. [20] Los Alamos National Laboratories. The Los Alamos Message Passing Interface. [21] Los Alamos National Laboratories, Sandia National Laboratories, and Lawrence Livermore National Laboratories. The ASCI Purple Benchmarks, 2001. http://www.llnl.gov/asci/purple/ benchmarks/. [22] X. Martorell, N. Smeds, R. Walkup, J. R. Brunheroto, G. Almasi, J. A. Gunnels, L. DeRose, J. Labarta, F. Escale, H. S. J. Gimenez, and J. E. Moreira. Blue Gene/L performance tools. IBM Journal of Resarch and Development, 49(2/3):407–424, 2005. [23] M. L. Massie, B. N. Chun, and D. E. Culler. The Ganglia distributed monitoring system: Design, implementation, and experience. Parallel Computing, May 2004. [24] B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadam, and T. Newhall. The Paradyn parallel performance measurement tool. IEEE Computer, 28(11):37–46, 1995. [25] S. J. Plimpton and A. Slepoy. ChemCell: A particle-based model of protein chemistry and diffusion in microbial cells. Technical Report 2003-4509, Sandia National Laboratories, Albuquerque, NM, 2003. [26] D. Reed, P. Roth, and et al. Scalable performance analysis: the Pablo performance analysis environment. In Proceedings of Scalable Parallel Libraries Conf., pages 104–13, 1994. [27] R. L. Ribler, J. S. Vetter, H. Simitci, and D. A. Reed. Autopilot: Adaptive control of distributed applications. In Proceedings of the 7th IEEE Symposium on High-Performance Distributed Computing, Chicago, IL, 1998. [28] S. Shende, A. Malony, J. Cuny, K. Lindlan, P. Beckman, and S. Karmesin. Portable profiling and tracing for parallel scientific ap-
[29] [30] [31] [32] [33] [34] [35] [36] [37]
[38] [39] [40]
[41]
plications using C++. In Proceedings of SPDT’98: ACM SIGMETRICS Symposium on Parallel and Distributed Tools, 1998. R. Snodgrass and K. P. Shannon. The interface description language : definition and use. Computer Science Press, 1989. M. J. Sottile and R. G. Minnich. Supermon: A high-speed cluster monitoring system. In IEEE Conference on Cluster Computing, September 2002. The MPI Forum. MPI: A message-passing interface standard. International Journal of Supercomputer Application, 8(3/4):165–416, 1994. M. M. Theimer and K. A. Lantz. Finding idle machines in a workstationbased distributed system. IEEE Transactions on Software Engineering, 15(11):1444–1458, 1989. B. Tierney, W. E. Johnston, B. Crowley, G. Hoo, C. Brooks, and D. Gunter. The NetLogger methodology for high performance distributed systems performance analysis. In HPDC, pages 260–267, 1998. Veridian Systems Inc. Portable batch system administrator guide. http://mordred.bioc.cam.ac.uk/ rapper/downloads/pbs-user-guide.pdf. J. S. Vetter. Dynamic statistical profiling of communication activity in distributed applications. In Proceedings of SIGMETRICS 2002, 2002. C.-Q. Yang and B. Miller. Critical path analysis for the execution of parallel and distributed programs. In Proceedings of 8th international conference on Distributed Systems, San Jose, CA, USA, 1988. M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. Performance analysis using the MIPS R10000 performance counters. In Proceedings of the 1996 ACM/IEEE conference on Supercomputing, Pittsburgh, Pennsylvania, United States, 1996. X. Zhang, Z. Wang, N. Gloy, J. B. Chen, and M. D. Smith. System support for automated profiling and optimization. In Proceedings of the Sixteenth Symposium on Operating System Principles, 1997. W. Zhu. Lightweight Online Performance Monitoring and Tuning with Embedded Gossip. PhD thesis, The University of New Mexico, Computer Science Department, Albuquerque, NM, 87131, 2007. W. Zhu, P. G. Bridges, and A. B. Maccabe. Online critical path profiling for parallel applications. In Proceedings of the 2005 IEEE International Conference on Cluster Computing (Cluster 2005), Boston, MA, September 2005. W. Zhu, P. G. Bridges, and A. B. Maccabe. Embedded gossiping: Lightweight online measurement for large-scale applications. In Proceedings of the 2007 IEEE International Conference on Distributed Computing Systems (ICDCS), June 2007.