System-Level Resource Monitoring in High ... - Springer Link

6 downloads 33254 Views 624KB Size Report
dproc system-level monitoring mechanisms provide tools both for efficiently monitoring ... application quality, and (c) by performing monitoring at kernel-level, the ...
Journal of Grid Computing 1: 273–289, 2003. © 2004 Kluwer Academic Publishers. Printed in the Netherlands.

273

System-Level Resource Monitoring in High-Performance Computing Environments Sandip Agarwala, Christian Poellabauer, Jiantao Kong, Karsten Schwan and Matthew Wolf College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA E-mail: {sandip, chris, jiantao, schwan, mwolf}@cc.gatech.edu

Key words: cluster computing, customizability, distributed systems, dynamic adaptation, high performance computing, kernel-level resource management, monitoring

Abstract Low-overhead resource monitoring is key to the successful management of distributed high-performance computing environments, particularly when applications have well-defined quality of service (QoS) requirements. The dproc system-level monitoring mechanisms provide tools both for efficiently monitoring system-level events and for notifying remote hosts of events relevant to their operation. Implemented as extension to the Linux kernel, dproc provides several key functions. First, utilizing the familiar /proc virtual filesystem, dproc extends this interface with resource information collected from both local and remote hosts. Second, to predictably capture and distribute monitoring information, dproc uses a kernel-level group communication facility, termed KECho, which implements events and event channels. Third, and the focus of this paper, is dproc’s run-time customizability for resource monitoring, which includes the generation and deployment of monitoring functionality within remote operating system kernels. Using dproc, we show that (a) data streams can be customized according to a client’s resource availabilities (dynamic stream management), (b) by dynamically varying distributed monitoring (dynamic filtering of monitoring information), an appropriate balance can be maintained between monitoring overheads and application quality, and (c) by performing monitoring at kernel-level, the information captured enables decision making that takes into account the multiple resources used by applications.

1. Introduction 1.1. Motivation The need for flexible and extensible monitoring tools for high-performance computing environments, particularly for the management and development of dynamic resource-responsive applications, has been documented many times [22, 27, 20, 13, 2]. For such applications, accurate and up-to-date monitoring of the underlying system’s relevant observable quantities and events enables runtime management activities that include the continuous optimization of application behavior, the dynamic distribution or balancing of application tasks across hosts, and/or the allocation of system resources like disks to applications. Observable quantities include available processor, network, or disk bandwidths, and observable events include sys-

tem failures or the violation of resource utilization thresholds. As distributed memory and cluster computers scale to hundreds or thousands of nodes, the perturbation implied by the frequent exchange of monitoring data can be substantial. Run-time monitoring of system resources is important, however, because it enables application designers to deal with dynamic behaviors in these complex platforms. Moreover, dynamics are becoming increasingly prevalent, as traditional high performance computing (HPC) applications using dedicated resources are being brought into dynamic, shared environments like computational grids [9, 10, 23]. Run-time monitoring enables application-level management in these environments that include the dynamic creation of subtasks to make use of newly available resources, the reallocation of computational threads from one parallel task com-

274 ponent to another to achieve better load balance, application-driven check-pointing and migration of tasks, and dynamic optimization of network communications to improve communication/computation overlap. In addition, run-time monitoring plays an important role in guiding system-level resource scheduling and allocation within computational grids. Previous research has addressed the timeliness of monitoring and the perturbation imposed by it. Recent systems like Supermon [30] and MAGNeT [8] have exploited the enhanced performance and lowlatency data paths associated with kernel-level monitoring. The d(istributed)proc system described in this paper is similar in its use of kernel-level data collection, but differs in its support of full peer-to-peer communications at kernel-level and across all of the systems that participate in dproc-based monitoring [1]. The intent is to improve performance (1) by avoiding monitoring-induced user/kernel domain crossings, and (2) by avoiding centralized or ‘master’ collection points, thereby also improving scalability and fault tolerance. Another advantage of kernel- vs. user-level communication described in [17] is improved predictability, since the variation in round-trip times is much larger for user- vs. kernel-level communications. The main characteristic of dproc explored in this paper is its support for application-specific, customizable remote monitoring. Customization includes wellunderstood functionality like the parameterization of monitoring frequencies or thresholds, but dproc also implements a more flexible approach in its support for dynamically deployed monitoring filters. These filters are functions, specified by an application at runtime, that can be distributed via dproc to other hosts. They are compiled dynamically, allowing for the on-line deployment of new monitoring functionality into local or remote OS kernels. Since compilation takes place at the host that will execute and utilize this function, heterogeneity of hardware and software is supported while maintaining the performance advantages of using binary codes. Dproc filters can customize monitoring behavior as follows: – by capturing complex relationships between monitoring results (e.g., ‘monitor the available memory only if disk access times exceed some critical threshold’); – by integrating application-level with system-level information (such as the number of open connections in server applications correlated with the round-trip times on the corresponding sockets); and

– by dynamically deploying specialized monitoring functionality into remote kernels (e.g., to monitor current energy usage on a server system or track remaining battery power in a mobile device). The result is an application-aware monitoring system that captures and propagates only the monitoring data of current interest across the underlying distributed system. This fact coupled with dproc’s use of peer-to-peer communications and of kernel-level data capture and analysis results in an efficient, scalable monitoring platform. Experimental evaluations of our current dproc prototype demonstrate the benefits attained from its use with an application that implements real-time remote visualizations of scientific data, termed SmartPointer [32]. By monitoring system-level resources like CPU, network, and disk at visualization clients, data servers gain the ability to customize the data streams sent to clients, thereby decreasing buffering and transfer delays and increasing stream transfer rates. An additional benefit of kernel-based monitoring demonstrated in this paper is the kernel’s ability to simultaneously monitor multiple system resources; this can prevent run-time adaptations that have conflicting resource demands, thereby improving system-wide metrics like total throughput. Finally, microbenchmarks running on a small Linux cluster show that dproc-based monitoring has small overheads and low response times. 1.2. Related Work Monitoring tools tend to operate at different levels of granularity, with consequent tradeoffs between the quality of the information monitored and the associated overheads. Cluster performance monitoring tools have been developed to allow system administrators to monitor cluster state. A typical tool consists of two major entities: a server that collects state information of a cluster and a GUI-based front-end, which provides a visualization of system activity. Parmon [3], Ganglia [12], Smile [31], and many others belong in this category. These tools cannot deliver very frequent monitoring updates. Paradyn [22] provides extensive support for dynamically instrumenting distributed programs. The Pablo [27] toolkit provides efficient support for collecting distributed monitoring information, coupled with rich tools for analyzing and displaying captured data. In comparison, Falcon [14] focused on application-specific, dynamic monitoring for on-line program performance tuning. Dproc shares with Falcon the ability to support dynamic program tuning, but

275 differs in that it is implemented at kernel-level. Both Paradyn and Pablo present interesting opportunities for hybridized monitoring infrastructures if coupled with dproc; it would be interesting, for example, to add Paradyn’s method of dynamic program instrumentation to dproc. Some of Pablo’s rich analysis methods could be implemented using dproc’s dynamic code generation support. The Supermon [30] cluster monitoring system uses a modified version of the SunRPC remote-status rstat protocol [15] to collect data from remote cluster nodes. This modified protocol is based on symbolic expressions, which allows it to operate in a heterogeneous environment. The Supermon kernel patch exports various kernel monitoring information via a sysctl call. Supermon does not address scalability issues, however, because it uses a centralized concentrator that collects monitoring data from all participating cluster nodes. The Remos [20] network monitoring system provides a scalable solution for collecting network resource information, by implementing models of such information in terms of abstractions required by applications. It also has the ability to predict future behaviour based on past history. A compromise between detailed kernel-level vs. application-level monitoring for capturing resource information is explored by the SHRIMP performance monitor [19]. User- to kernellevel interfaces for efficient information transfer are used in the MAGNeT [8] system, which uses an instrumented kernel to export kernel events to user space. It maintains a circular buffer in the kernel where all events are recorded. Other nodes can obtain these records by contacting a daemon, called magnetd. The kernel must be configured at compile-time to enable monitoring. In comparison, dproc provides a low overhead, fine-grained, kernel-level monitoring facility, with inter-node communications that are based on kernelkernel messaging. Dproc is extensible in that new monitoring functionality can be added dynamically, e.g., through loadable kernel modules. Dproc is customizable in that applications can fine-tune distributed resource monitoring via parameters and by run-time deployment of dynamically generated code, as described in detail in the following sections.

2. Architecture Dproc is a kernel-level monitoring toolkit for Linuxbased distributed systems, such as cluster servers. The

toolkit provides a single uniform user interface available through /proc, which is a standard feature of the Linux operating system. The /proc mount point provides a reasonably complete set of local monitoring data, including system load, memory utilization, and per-process system resource utilization. Dproc project extends the local /proc entries of each of the cluster machines with relevant information from all other participating nodes within the cluster. The toolkit adds a /proc/cluster entry into /proc, whose sub-directories contain the monitoring information for each of the registered cluster nodes. For instance, on nodel, the loadavg information (average CPU run-queue length) from node2 would be located in /proc/cluster/node2/loadavg. For each node entry in /proc/cluster, there is also an associated control file, which a user-space application can modify to (a) specify monitoring parameters (e.g., thresholds or update periods) and (b) deploy dynamically generated filters for each individual resource or for all resources together. The associated control file for the previously mentioned node (node2) would be /proc/cluster/node2/control. Typical hierarchies of pseudo-file entries in /proc are shown in Figure 1. Alan, maui, and etna are the names of the three nodes in the cluster. Memory, network traffic, CPU load, and disk usage are monitored on alan, network and CPU load on maui, and network, CPU load and disk usage on etna. Dproc makes this information available at each of these three nodes in the hierarchy shown in Figure 1. The communication infrastructure used by dproc is implemented with a kernel implementation of the ECho event channel infrastructure [6], called KECho [26]. KECho provides a publish-subscribe mechanism for direct kernel-kernel communications. Each kernel connects to a communication channel (called a monitoring channel) both as a publisher and subscriber, and uses it to exchange control and monitoring information with other kernels. To improve scalability, the exchange of data is triggered only when an application registers interest in receiving some particular monitoring data. Dproc offers the following functionality: – Selective monitoring via kernel-level publish/subscribe channels. The basic operating system construct offered by dproc is a pair of kernel-level channels, one for monitoring and one for control messages. – Standard API. Applications need not explicitly handle monitoring channels. An application accesses dproc entries through an extension to the standard /proc interface.

276

Figure 2. Dproc architecture.

Figure 1. Dproc file hierarchy.

– Flexible analysis and filtering. Dproc offers simple ways of dynamically varying certain parameters of monitoring actions, such as monitoring rates or desired statistics. Through simple reads and writes to control files within the pseudo-file system, an application can specify the relevant controls. We discuss this further in the following sections. – Run-time configuration. Monitoring channels, handlers, and control attributes may be created, changed, and deleted at any time during the operation of dproc. – Plug-n-Play monitoring services. New monitoring modules can be installed and old ones removed without bringing down or reloading the system. A simplified diagram of the dproc architecture appears in Figure 2. The d-mon (distributed monitor) module is the main module, coordinating the activities in dproc. It is built on top of KECho, which provides the kernel-level communication infrastructure. Different monitoring modules (also called service modules) can register with d-mon to provide monitoring data. Periodically, d-mon invokes these services to retrieve monitoring information. D-mon modules use a channel registry, which is a user-level channel directory server, to register new channel and to find existing

channels. The first d-mon module to contact the registry creates the kernel-level channels. All other d-mon modules in the cluster retrieve the channel identifiers from the registry and subscribe to this channel. Subsequently, applications can communicate with their local d-mon modules via the /proc filesystem (procfs) and customize the behavior of remote d-mon modules using the dynamic filter approach discussed in this paper.

3. Implementation The dproc implementation consists of four major components: (a) d-mon, (b) monitoring modules, (c) procfs, and (d) KECho. All of these are implemented as dynamically loadable kernel modules. We discuss the first three in detail in this section. The details of KECho have been discussed elsewhere [26] and are beyond the scope of this paper. 3.1. Distributed Monitor d-mon The d-mon (distributed monitor) module is the most important component of dproc, controlling all of its activities. On startup, each host’s d-mon connects to the monitoring channel, as discussed earlier, and it also registers with it an event handler. This handler is called each time an event arrives on the channel. Events are of two types: control events and data events. Control events are used to specify subscription criteria in the form of filters and also to advertise the different monitoring services available on a particular system. Data events, as the name indicates, carry monitoring data.

277

Figure 3. Monitoring service API.

D-mon exports a set of interfaces (Figure 3), with which different monitoring modules can register or deregister to provide monitoring service. Registration is done with register_service calls. Registered monitoring services provide callback functions with which d-mon can fetch fresh data from them. It is not required that d-mon possesses prior information about the formats of such data, that is, services provide descriptions of the names, types, sizes, and positions of the fields in their monitoring data records in the form of PBIO’s (Portable Binary Input/Output) [4] field lists and format lists. PBIO is a record-oriented binary communication mechanism that supports outof-band meta-data. A service can disassociate itself from d-mon at any time by simply calling deregister_service, using the service id that was obtained during registration as a parameter. Whenever a new service registers with d-mon, the latter sends out an advertisement control message on the channel, declaring the list of monitoring services currently available. Every time a polling-period (default = 1 second) expires, the d-mon kernel thread wakes up to check whether an event has arrived in the channel or whether there is data to be sent. On event arrival, an appropriate event handler is invoked. D-mon defines four different kinds of events: – Mon Data. This contains the actual monitoring data encoded in PBIO format. If its format information is not available in the cache, d-mon will request it from an application-level format server and then cache it. The data is then decoded to its native type using this format information and the corresponding entry in /proc is updated using procfs. For example, if memory information is received from node etna.cc.gatech.edu, the buffer corresponding to entry /proc/cluster/etna.cc.gatech.edu/meminfo is updated.

– Available Services. This event is received when a remote node advertises different monitoring services available on it. D-mon creates the necessary entries in /proc/cluster if required. – Modify Subscription Ecode. A remote node may specify its subscription criteria in the form of a filter written in ‘Ecode’ [5]. D-mon maintains a list of registered filters together with a default filter. When a modify subscription ecode message is received, d-mon compiles it dynamically to generate binary code and adds this code to its list of registered filters. All those nodes which don’t specify their own customized filters are assigned the default filter. – Modify Subscription Params. A remote node can also specify its subscription criteria in form of parameters (discussed later). D-mon converts parameters to Ecode and then handles it in the same way as it did in the previous case. If there is monitoring data that needs to be sent to other nodes, d-mon iterates through all of the registered services, invoking their callback functions (which they registered during the time of registration) and prepares a structured data vector, MonVector: typedef struct _Mon_vector{ int count; int ∗ mon_data_ptr; } Mon_vector, ∗ PMon_vector; Each MonVector entry specifies the base address of the monitoring data. Count can be either 1 or 0 and indicates whether the base address as specified by mon_data_ptr is valid or not. This vector is passed through each registered filter, including the default filter. D-mon maintains some amount of state, called lastMon, for each filter, which

278 represents what other nodes know about this node. A prototype filter looks like the following: int filter_function ( void ∗ in_mon_data_vec, void ∗ out_mon_data_vec, void ∗ last_mon_data_vec, unsigned long time); The parameter in_mon_data_vec is the same as the vector prepared with the data collected from the monitoring modules. The parameter out_mon_data_vec is the vector where the filter puts the filtered data, and last_mon_data_vec stores the data image that was last sent to the node that registered this filter. Finally, time is the current system time in jiffies (e.g., each jiffy corresponds to 10 ms in x86 Linux kernels). On successful completion, a filter returns a non-negative integer, which can be one of the following: – SEND_ORIGINAL_DATA. The filter did not make any changes to the input vector and data needs to be sent unmodified, – SEND_FILTERED_DATA. The filter made some modifications to the input data and stored the modified data in out_mon_data_vec. The modified data needs to be sent to the node that registered that filter. – DONT_SEND. No data to be sent. Once filtering has been done, d-mon encodes the appropriate data using PBIO and performs a vector submission (similar to writev system call) to the event channel. This action is repeated for each filter. 3.2. Monitoring Modules Monitoring modules (or service modules) keep track of system resources and also monitor various systemlevel activities. Dproc’s extension interface allows applications to dynamically deploy monitoring modules without the need to recompile or restart the running dproc mechanisms. CPU_MON. In a standard Linux system, /proc/ loadavg contains the average run queue lengths over 1, 5, and 15 minutes. This pre-specified time period average may not be useful in a system with constantly varying CPU load. Therefore, dproc’s CPU monitoring module creates a kernel thread that wakes up frequently to examine the task_list in the kernel and computes the average of the run-queue lengths over an application-specified period. The default period is 1 minute.

MEM_MON. This provides memory status (free, used, etc.). DISK_MON. This measures the average number of disk writes and reads, as well as the average number of sectors written and read for a certain period of time. The default period is 1 s, but, as with CPU_MON, d-mon can change this value to any desired number. NET_MON. This module monitors the round-trip times of established network connections, the network traffic on all open connections (per node and per connection), the number of re-transmissions (for TCP), the number of lost messages (for UDP), and the end-to-end delay for both TCP and UDP connections. Performance Monitoring Counters (PMC). Most modern processors offer performance counters, accessible from kernel-level. These counters can be configured to track certain low-level processor events (depending on particular processor model), such as cache misses, number of operations, and other potentially interesting chip-level statistics. Many projects and products have attempted to extend this low-level monitoring to the application level [11]. The dproc approach not only implements a method for exposing this functionality locally (through the pseudo-files and associated controls of /proc), but also to the entire dproc-enabled cluster. An application-level process on one machine can then remotely extend the kernel of a participating cluster machine, in order to generate very fine-grained monitoring data of relevance for a particular application. For instance, monitoring of the number of cache lines loaded (through cache misses) may give a remote master process a way to track the amount of data a worker process has consumed, thereby allowing it to better tune its data distribution policies. 3.3. ProcFS The ProcFS component of dproc is its general user interface. Figure 1 shows a sample interface created by procfs. A user-space application accesses dproc monitoring information via /proc/cluster and can also specify Ecodes and parameters to control monitoring traffic. ProcFS exports various functions used by d-mon to create entries in /proc. This is shown in Figure 4. To create an entry for a particular node, d-mon invokes the kdproc_create_node call. For example, kdproc_create_node(‘usa’, NULL) creates a directory entry /proc/cluster/usa/. This directory contains two default entries, called the in_filter and the out_filter.

279

Figure 4. ProcFS API.

Out_filter indicates what filter code will be applied to the monitoring data sent to the node ‘usa’. If ‘usa’ does not specify any filter code, the default filter is shown. The in_filter is used by the application to specify filter code to be installed at node ‘usa’, to control data traffic being sent to this local node. Suppose node ‘usa’ sends its memory information. In that case, an entry called meminfo is created in directory /proc/cluster/usa/. This is achieved by calling kdproc_create_placeholder. Figure 5 shows the result of reading the meminfo entry. Note that the procfs output is in standard XML, although d-mon internally uses efficient, PBIO-based binary messaging formats.

4. Extensibility in dproc This section discusses in more detail the extensibility and filtering capabilities of dproc. With dproc,

we offer parameters and dynamic filters as means for customizing the communication and processing needs of the monitoring activities to the specific needs of applications or to the capabilities of participating hosts. 4.1. Parameters We distinguish two types of parameters: (1) those that modify the update period of information and (2) those that modify thresholds. The update period indicates how often an application would like to be informed about the current utilization or availability of a resource. Another option is to specify thresholds, where an application indicates upper or lower bounds on resource utilization. Once d-mon recognizes that a threshold has been exceeded, the new value is distributed to the remote /proc entries. Combinations of these two are also possible, e.g., an application can specify to ‘update the CPU information once every 2 seconds

280

Figure 5. Sample monitoring data.

IF the CPU utilization is above 80%.’ Threshold comparisons can be expressed as percentage limits (e.g., if x varies by 10% from the last measurement), or as fixed relative values (e.g., if x < y · 1.1 or x > y · 0.9), or as minimum and maximum values (e.g., if x is in the range [y, z]). These parameters allow us to dramatically reduce the amount of monitoring traffic, and hence increase scalability. For example, for a batchqueue scheduler, we might need load average updates only if they are less than the number of local CPUs. Note that although parameters are easier to specify, for efficiency reasons, dproc internally converts them to dynamic filters. 4.2. Dynamic Filters At any time it desires, an application can deploy filters by writing the filter code as string to the ‘in_filter’ file in /proc/cluster. It is d-mon’s responsibility to distribute the string to the corresponding hosts via KECho’s communication channel. Incoming filter strings are received by d-mon, which then dynamically generates binary code. The resulting binary filters are executed by d-mon before any information is submitted to the channel, thereby allowing filters to customize (or block) monitoring information. Dynamic filters can provide the same functionality as described in the

‘parameter’ section above. However, they can provide more powerful data transformations, as well, like those that combine multiple decision making strategies. To extend the previous example, imagine that the batch-queue scheduler is not interested in loadavg, but instead, in the amount of free memory. However, it wants memory information to be updated only if there is a free CPU on which to run its process. A solution is to tie the update period of memory information to the load average value dropping below the number of CPUs. Dproc filters encoding such combined decisions can be formulated and deployed easily. The run-time generation of filter code in dproc is based on a kernel port of a binary code generator, called Ecode [5]. Ecode implements a small, safe subset of the C programming language, supporting the C operators, for loops, if statements, and return statements. Code written in Ecode can be passed as strings between hosts via dproc’s control channel and is executed at the publishing host. Unlike specialized monitoring languages like GEM [21], the general nature of Ecode allows the application programmer to create arbitrarily complex subscription criteria. Figure 6 depicts sample filter code that manipulates the information sent out by a dproc node in the cluster. This manipulation reduces the amount of data

281

Figure 6. Complex filter: Ecode example.

sent to different nodes in the cluster. Such data reduction decreases both network and CPU perturbation, thereby increasing the scalability and performance of dproc-based performance monitoring. 5. Experimental Evaluation All experiments described in this section are performed on a cluster of 8 Pentium II 500 MHz machines, each with 512 KB cache and 512 MB RAM. The cluster is connected via a switched 100 Mbps Fast Ethernet, and the cluster nodes run RedHat Linux 9 (kernel version 2.4.19). The communication infrastructure (KECho) as well as the dproc monitoring system are implemented as loadable kernel modules. The goal of this section’s experiments is to show the following: – application-specific filtering of monitoring information can reduce the overhead and perturbation caused by the monitoring mechanisms, – run-time monitoring information can be used to make intelligent decisions about how to manipulate and customize data streams in order to reduce resource requirements and to adapt streams to a clients’ capabilities, – and it is important that information be made available about multiple resources in a system, thereby

Figure 7. Default filter.

enabling an application to properly identify and remove resource bottlenecks. 5.1. Microbenchmarks We first describe the results of simple scalability and performance experiments performed with the dproc monitoring toolkit. For discussion purposes, we focus on the following three cases: (i) Dproc with an update period of 1 second. In this case, we use the ‘default filter’ (Figure 7), which broadcasts monitoring events every second to all subscribers without performing any modifications. (ii) Dproc with an update period of 2 seconds. This is same as the above, except that the rate of sending events is halved. (iii) Dproc with a ‘complex filter.’ Here, we use the filter as shown in Figure 6. The rate of event

282 transmission ‘out’ of this filter depends on the rate of change in system state. For this set of experiments, dproc monitors CPU load, disk usage, memory usage, and network traffic, resulting in monitoring events of about 500 bytes of information. Each reading is recorded after averaging over 10 minutes of the experiment. 5.1.1. CPU Perturbation The first microbenchmark measures the CPU overheads of dproc. These overheads are evaluated in terms of linpack1 performance. Linpack is a CPU-intensive benchmark commonly used to measure the floating point computational power of CPUs in Mflops. We measure the change in linpack performance by running dproc on 0–8 nodes in the cluster and running linpack on one of them. Figure 8 shows that in all three cases the number of Mflops, as measured by linpack, decreases only slightly as the number of participating nodes increases in the cluster. The decrease in measured Mflops is less accentuated in the case of the complex filter. However, note that while dproc spends more cycles to execute the complex filter in case (iii), overall cost is much lower than in the other two cases because less time is spent on encoding and decoding potentially unnecessary monitoring data. Even the cost of transmission is less because of smaller data sizes and lower data rates. 5.1.2. Network Perturbation This test measures the amount of traffic generated by the dproc monitoring system. Tcpdump [16] is used

Figure 8. CPU perturbation analysis. 1 http://www.netlib.org/linpack/

to count the number of bytes of packet data going in and out of a particular node in the cluster. This value includes not only monitoring payloads but also the headers attached to monitoring packets (as they pass through various protocol layers) as well as the acknowledgment packets received. When varying the number of nodes in the cluster from 1–8, the results plotted in Figure 9 demonstrate multiple facts. First, when monitoring data is small (e.g., 500 bytes per packet), Figure 9(a) shows that filtering reduces total monitoring traffic to values that are quite small compared with the 100 Mbps capacity of the network links present in this cluster. Second, with statically set filter parameters and for larger packet sizes (5 Kbytes

(a)

(b) Figure 9. Network perturbation analysis: (a) monitoring data size = 500 bytes; (b) monitoring data size = 5 KB.

283 in Figure 9(b)), monitoring traffic markedly increases with an increasing number of nodes, thereby demonstrating the importance of complex filters that implement customized monitoring actions able to eliminate any unwanted data (see (iii)). Note that total traffic in this case is not exactly 10 times that of the previous case, as one might expect due to the tenfold increase in packet size. This is because the percentage of traffic due to headers, acks, etc. (i.e., overhead) is much lower here. 5.1.3. Event Submission Overhead Every second, d-mon polls each of the registered monitoring modules and sends the collected information to interested clients. Figure 10(a) shows the event

(a)

(b) Figure 10. Event submission overhead: (a) monitoring data size = 500 bytes; (b) monitoring data size = 5 KB.

submission overheads during one polling iteration of d-mon. Broadly speaking, this overhead has four main components: – the time required to fetch data from all monitoring modules, – the time required to execute the filter (if any) on this monitoring data, – the time required to encode data for transmission (i.e., PBIO-encoding), and – the time required to publish the data. Since these overheads are small, they are measured by counting the number of CPU cycles using the rdtsc (Pentium time-stamp counter) instruction. Overhead is calculated by computing the average of 600 polling iterations. The plot shows that if the complex filter is used, overheads remain within 1 millisecond, even for 8 nodes. Figure 10(b) repeats the previous experiment, but with monitoring events of average size 5 KB. It is quite surprising to see that the resulting overhead is only slightly larger than the previous case. This is because of the relatively ‘flat’ encoding costs of PBIO due to its avoidance of unnecessary data copying. Section 4.2 of [7] discusses this property of PBIO in more detail. 5.1.4. Event Receiving Overhead Every second, d-mon polls the listening sockets to check whether or not an event has arrived. If there is an event in the receive queue, d-mon invokes the appropriate handler to consume the event. Figure 11 shows the overhead caused by handling these incoming events in one polling iteration of d-mon. Again, overhead is calculated as the average of 600 polling iterations. The plot shows that receiving overhead is significantly larger than sending overhead. This is because PBIO experiences decoding costs only at the receiver side (using a receiver makes right approach to data encoding and decoding). It is worth mentioning here that PBIO has two forms of decoding mechanisms: interpreted and DCG. The number shown in Figure 11 uses interpreted PBIO, in which a generalized routine walks a data structure to convert data from the incoming to the native format. In contrast, the DCG-based approach generates customized conversion subroutines for every incoming record type and is far cheaper than the former [7]. Use of DCG can substantially reduce receiving side overhead.

284

(a)

(b) Figure 11. Overhead in receiving incoming events: (a) monitoring data size = 500 bytes; (b) monitoring data size = 5 KB.

5.2. Scientific Collaboration 5.2.1. Application Description The need to support scientists as they collaborate on a local, national and even international scale is a research challenge. Work pursued by our group [32], termed the SmartPointer project, is designed to leverage existing collaboration toolkits, such as the Access Grid2 or the AGAVE project [18], and extend them with tools that are suitable for handling the very high data flows which the next generation of scientific computing will demand. Current scientific codes, such as the Terascale Supernova Initiative3 funded un2 http://www.accessgrid.org/ 3 http://www.phy.ornl.gov/tsi/

der the DOE SciDAC program, can already generate relevant data at nearly a gigabit per second. Many other grid-computing projects exist with similar network and CPU requirements. Within the next 5 years, these sizes are expected to grow by at least a factor of 10. In order to provide a remote collaborator with an adequate and timely visualization, the data must be carefully staged and transformed to manage bandwidth, latency, and content representation. This becomes particularly relevant for real-time collaborations, where there are multiple such streams (e.g., data, video, audio) and where streams must be synchronized in order to provide appropriate experiences to end users. The sample application used to demonstrate the utility of the dproc infrastructure is a client-server model that complies with this vision. Using a subset of the SmartPointer code, a single server delivers molecular dynamics data, similar to what might be generated by large scientific codes in physics, chemistry, or mechanical engineering (to name a few) to different clients. Clients range from high-end displays like an ImmersaDesk4 to smaller displays like a handheld, a laptop or a fast desktop machine. Clients can subscribe to any of a number of different views of the data produced by the ongoing-simulation (i.e., the data produced by the server), ranging from a straight data feed, to down-sampled data (for example, removing velocity data), to a stream of images representing the full visualization. Communication is based on an event service, where a server establishes an event channel and interested clients can subscribe to this channel in order to receive the data stream. The different views desired by clients are created by customizing the data stream with data filters, similar to dproc’s concept of filters described earlier. For example, resource-constrained devices like wireless handhelds may down-sample a data stream using a filter, while other, resource-rich devices receive the full-quality data stream. Without a monitoring system, a client must specify its requirements a priori depending upon expected resource availability. If the availability of a particular resource changes beyond that scope, the client can adapt by specifying and deploying a new filter. With dproc, we automate this process. The server can be made ‘aware’ of resource availability at different clients. It uses this online monitoring data to customize the data streams being sent to clients. Although such optimizations are useful on a single stream, as our 4 http://www.evl.uic.edu/research/res_project.php3?indi=163/

285 measurements below will show, such optimizations become even more relevant in large-scale scientific collaborations, where multiple endpoints, distributed over a wide area network, each require their own set of data personalizations. In cases like these, it cannot be left to individual clients to determine appropriate levels of data reduction. Instead, resource-aware filtering must take into account multiple clients’ needs and resources. Consider, for example, collaborators engaged in a video-conference while examining data. In this case, the personalized data streams sent to be played out at clients should be synchronized in order to avoid unnecessary delays or large overheads due to client-side data buffering. Otherwise, confusion may result as one collaborator examines data that is logically many time steps ahead of what others are seeing. With dproc, such server-side data stream coordination is easily implemented, based on the server’s knowledge about current client-side resource availability and based on its knowledge of network statistics between the data source and the multiple endpoints. 5.2.2. Single Stream Measurements We have already begun to work on resource-aware data streaming to multiple remote clients. This paper reports initial results from that work in which we optimize the behavior of a single large data stream. In the experiments conducted for that purpose, dproc provides the following monitoring information from each client to the SmartPointer server: (a) CPU load, (b) available network bandwidth, and (c) disk utilization. To measure the utility of dproc, we compare the performance of the original SmartPointer application with a modified application that uses dproc-provided monitoring information. The data streamed to clients can be modified with a tunable data filter. The data filter reduces the information content of data, and therefore, the amount of data being streamed. The following three cases are compared: – SmartPointer application with no filter: the SmartPointer server sends the original data stream to all clients without any customization. – SmartPointer application with static filters: the SmartPointer server does client-specified customization, but does not use run-time resource information about the clients. The customization criteria remain the same throughout the experiment. – SmartPointer application with dynamic filters using monitoring information: the customization

filters at the server use the clients’ resource information provided by dproc to customize data streams. In our experiments, we use three different kinds of clients: (a) a CPU-loaded client, (b) a client connected to a link with highly varying network traffic, and (c) a hybrid client. 5.2.3. CPU-Loaded Client The client is artificially loaded by running a CPUintensive application, i.e., linpack, also used in our microbenchmark experiments. The load in the system is varied by running different instances of linpack processes. Our experiments demonstrate that dynamic serverbased data filtering outperforms static approaches both in terms of scalability and responsiveness. Figure 12(a) shows the amount of time required for a data packet to be submitted by the server and processed by the client in a CPU-loaded system, where more than 99% of the total time is spent in processing the event stream at the client. Every time a new linpack thread is started, latency increases for the cases without a filter or with a static filter. A dynamic filter, however, can be used to create almost constant latencies (less than 4 ms – this small delay is due to event queuing under high loads). With dynamic filters, the dproc toolkit informs the server about a change in load at the client, and the dynamic filter uses this information to modify the amount of data sent to the client. Static filtering helps somewhat compared to no filtering, but the delay is large in comparison to dynamic filtering. Figure 12(b) shows the change in the event rate at a CPU-loaded client. The figure shows that in the dynamic filter case, the client is able to receive and process events at the same rate at which the server sends them. Therefore, the inter-event arrival delay remains almost constant. The static filter cannot adapt to increased system load and hence, the queuing delays increase and the intervals between event arrivals get larger, despite the fact that the server sends these events at a constant rate. The case without filters has the worst performance. 5.2.4. A Network-perturbed Client This experiment emulates large-data visualizations on remote clients, where the server sends events of size 3 MBytes each. To focus on the effects of network perturbation, clients do not process incoming events in any meaningful way. The link between the client and the server is artificially perturbed by running Iperf

286

(a)

(a)

(b)

(b)

Figure 12. SmartPointer performance in a CPU loaded system: (a) latency variations with increasing CPU load; (b) event rate variations with increasing CPU load.

Figure 13. SmartPointer performance in a network-perturbed system: (a) change in latency with varying network traffic; (b) change in event rate with varying network traffic.

on two different nodes that share a link with the former two nodes. Iperf generates continuous streams of UDP packets. Figure 13(a) shows how latency is affected by the decrease in available bandwidth. As the network perturbation is increased, available bandwidth decreases, and latency increases. The capacity of the link is 100 Mbps. When there is no perturbation, the server sends data to the client at a rate of about 30 Mbps. Hence, the plot remains horizontal up to 70 Mbps of network perturbation. As perturbation increases beyond 70 Mbps, latency drastically increases for the first two types of filters, because the server is unaware of the state of the network linking it to clients. Dynamic filtering again outperforms the other cases

because the data sizes sent are adjusted to available network capacity. Figure 13(b) shows the effect of network perturbation on the rate of events received by the client in a network-perturbed system. Again, we see that the filter using dproc monitoring information is better able to customize the data stream and hence, is better able to maintain high data rates despite reductions in network bandwidth. 5.2.5. A Hybrid Client In this case, both CPU and network perturbation are imposed, using linpack and Iperf. We increase the number of linpack threads, thereby increasing CPU

287 displayed by the client. Such pre-processing, however, can increase the size of the data stream, thereby also increasing network requirements. In addition, disk activity increases because of the increased data rate. Conversely, data reduction can increase clientside processing requirements (e.g., for uncompression and rendering), but is useful when dealing with congested network links. These examples demonstrate the need to take into account information about multiple resources when adapting data streams [29, 28]: adaptations based on only one resource can have negative effects concerning other resource requirements. In summary, multiple resources must be monitored in order to make intelligent stream management decisions.

(a)

(b) Figure 14. SmartPointer performance in a perturbed system: (a) change in Latency with varying perturbation; (b) change in Event Rate with varying perturbation.

load and network perturbation by about 10 Mbps. We compare three different types of dynamic filters in this case: (a) the first one uses CPU load information, (b) the second one uses network information, and (c) the third one uses CPU, network, and disk information to customize data streams. Figures 14(a) and (b) show improved performance when a dynamic filter uses information about multiple client-side resources to customize its data stream. When the CPU load at the client increases, the filter pre-processes data in order to reduce client-side processing needs. An example is the server-side computation of graphical views which are then simply

6. Conclusions and Future Work This paper introduces the extensibility features of a scalable, kernel-level resource monitoring toolkit, called dproc. To our knowledge, dproc is the first cluster performance monitoring system designed to exploit kernel-level data capture as well as kernellevel peer-to-peer communications. Kernel-level data capture permits dproc to capture the joint behavior of multiple system resources. Kernel-level communications enable the exchange of monitoring information across participating nodes without user involvement and with predictable delays. By using a binary messaging system with out-of-band meta-information and very low marshalling overhead, dproc communications are suitable for use in high performance computing environments. Despite its kernel-level operation, dproc can be dynamically extended with new monitoring functionality. Plug-and-play monitoring modules can be added at run-time to permit dproc to deal with new devices or resources and/or to offer new performance models of resources to applications. Monitoring events and control information are sent to other nodes via publish/subscribe channels that permit both the system and applications to dynamically customize data capture, analysis, and exchanges to match current needs. Specifically, applications local or remote to a dproc kernel monitor may dynamically deploy new monitoring functionality by specifying complex subscription criteria in the form of Ecode filters associated with their subscriptions to monitoring data. These filters are compiled and deployed at run-time. Dproc uses standard interfaces to applications, by exporting monitoring data and control interfaces

288 via the well-known /proc interface. While kernellevel data transport and manipulation uses binary data representations, data is exported to applications in standard XML. Microbenchmarks show that the basic CPU and network overheads of dproc are quite small. More importantly, the utility of customizable, multiresource and multi-machine monitoring via dproc is demonstrated with a distributed, scientific visualization, termed SmartPointer. Specifically, experiments demonstrate improvements in application scalability and performance by using run-time monitoring information provided by dproc in order to adjust application behavior to available resources. In particular, the SmartPointer application is enhanced with dynamic data stream filters that use information about multiple system resources to improve their data management decisions. Our future work will use dproc in both wide-area grids and in embedded systems, both of which present challenging, dynamic execution environments. The distributed scientific visualization described above is one example of the applications being addressed. Dproc has also been used in our Service Morphing [25] work, the intent of which is to continuously meet application and end-users needs in pervasive systems in the presence of run-time resource variations. The approach is to use dproc-provided resource information to dynamically adapt application-level services as well as replace old services with ‘morphed’ ones (e.g., using different compiler-generated codes). Additionally, in wireless and mobile systems, power has to be considered a first-class resource, and power models based on the run-time inspection of low-level performance counters are currently being added to dproc. Finally, dproc is part of the Q-Fabric project [24], which attempts to provide large-scale distributed systems with a customizable resource management framework. The monitoring results delivered by dproc can be used by QoS management mechanisms to optimally allocate resources to applications and to integrate application adaptation with resource management.

References 1. S. Agarwala, C. Poellabauer, J. Kong, K. Schwan and M. Wolf, “Resource-Aware Stream Management with the Customizable dproc Distributed Monitoring Mechanisms”, in: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC-12), Seattle, Washington, 2003, pp. 250–259.

2. E. Al-Shaer, H. Abdel-Wahab and K. Maly, “HiFi: A New Monitoring Architecture for Distributed Systems Management”, in Proceedings of the 19th International IEEE Conference on Distributed Computing Systems, Austin, TX, 1999, pp. 171–178. 3. R. Buyya, “PARMON: A Portable and Scalable Monitoring System for Clusters”, Software Practice and Experience Journal, Vol. 30, No. 7, pp. 723–739, 2000. 4. G. Eisenhauer, “Portable Self-Describing Binary Data Streams”, Technical Report GIT-CC-94-45, College of Computing, Georgia Institute of Technology, 1994. http://www.cc. gatech.edu/tech_reports 5. G. Eisenhauer, “Dynamic Code Generation with the E-Code Language”, Technical Report GIT-CC-02-42, Georgia Institute of Technology, College of Computing, 2002. 6. G. Eisenhauer, F. Bustamante and K. Schwan, “Event Services for High Performance Computing”, in: Proceedings of High Performance Distributed Computing (HPDC), 2000. 7. G. Eisenhauer, F. Bustamante and K. Schwan, “Native Data Representations: An Efficient Wire Format for High Performance Computing”, IEEE Transactions on Parallel and Distributed Systems, Vol. 13, No. 12, pp. 1234–1246, 2002. 8. W. Feng, M. Broxton, A. Engelhart and G. Hurwitz, “MAGNeT: A Tool for Debugging, Analysis and Reflection in Computing Systems”, in 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. 9. I. Foster and C. Kesselman, “Computational Grids”, in The Grid: Blueprint for a New Computing Infrastructure, Chapter 2, Morgan Kaufmann Publishers, 1999. 10. I. Foster, C. Kesselman and S. Tuecke, “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”, International Jounal of Supercomputer Applications, Vol. 15, No. 3, 2001. 11. R. Fowler, A. Cox, S. Elnikety and W. Zwaenepoel, “Using Performance Reflection in Systems Software”, in HotOS IX: Ninth Workshop on Hot Topics in Operating, Lihue, Hawaii, USA, 2003. 12. GANGLIA, “Ganglia Toolkit: A Distributed Monitoring and Execution System”. http://ganglia.sourceforge.net/ 13. C. Glasner, R. Huegl, B. Reitinger, D. Kranzmueller and J. Volkert, “The Monitoring and Steering Environment”, in Proceedings of the International Conference on Computational Science (ICCS), San Francisco, CA, 2001, pp. 781–790. 14. W. Gu, G. Eisenhauer, K. Schwan and J. Vetter, “Falcon: On-line Monitoring and Steering of Large-Scale Parallel Programs”, Concurrency: Practice and Experience, Vol. 6, No. 2, 1998. 15. S.M. Inc, “RPC: Remote Procedure Call Protocol Specification Version 2”, 1988. http://www.ietf.org/rfc/rfcl057.txt. 16. V. Jacobson, C. Leres and S. McCanne, “Tcpdump”, Lawrence Berkeley Laboratory (LBL), Available from ftp://ee.lbl.gov/tcpdump.tar.Z. 17. J. Jancic, C. Poellabauer, K. Schwan, M. Wolf and N. Bright, “dproc – Extensible Run-Time Resource Monitoring for Cluster Applications”, in Proceedings of the International Conference on Computational Science, 2002. 18. J. Leigh, G. Dawe, J. Talandis, E. He, S. Venkataraman, J. Ge, D. Sandin and T. DeFanti, “AGAVE: Access Grid Augmented Virtual Environment”, in Proceedings of AccessGrid Retreat, Argonne, Illinois, 2001. 19. C. Liao, M. Martonosi and D.W. Clark, “Performance Monitoring in a Myrinet-connected Shrimp Cluster”, in Proceedings of 2nd SIGMETRICS Symposium on Parallel and Distributed Tools, 1998, pp. 21–29.

289 20.

21.

22.

23.

24.

25.

26.

B. Lowekamp, N. Miller, R. Karrer, T. Gross and P. Steenkiste, “Design, Implementation, and Evaluation of the Remos Network Monitoring System”, Journal of Grid Computing, Vol. 1, No. 1, 2003, pp. 75–93. M. Mansouri-Samani and M. Sloman, “A Generalised Event Monitoring Lanaguage for Distributed Systems”, IEE/IOP/BCS Distributed Systems Engineering Journal, Vol. 4, No. 2, pp. 96–108, 1997. B.P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R.B. Irvin, K.L. Karavanic, K. Kunchithapadam and T. Newhall, “The Paradyn Parallel Performance Measurement Tool”, IEEE Computer, Vol. 28, No. 11, pp. 37–46, 1995. Z. Nemeth and V. Sunderam, “Characterizing Grids: Attributes, Definitions, and Formalisms”, Journal of Grid Computing, Vol. 1, No. 1, pp. 9–23, 2003. C. Poellabauer, H. Abbasi and K. Schwan, “Cooperative Runtime Management of Adaptive Applications and Distributed Resources”, in Proceedings of the 10th ACM Multimedia Conference, Juan-les-Pins, France, 2002, pp. 402–411. C. Poellabauer, K. Schwan, S. Agarwala, A. Gavrilovska, G. Eisenhauer, S. Pande, C. Pu and M. Wolf, “Service Morphing: Integrated System- and Application-Level Service Adaptation in Autonomic Systems”, in Proceedings of the 5th Annual International Workshop on Active Middleware Services (AMS), 2003. C. Poellabauer, K. Schwan, G. Eisenhauer and J. Kong, “KECho – Event Communication for Distributed Kernel Services”, in Proceedings of the International Conference on Architecture of Computing Systems (ARCS’02), Karlsruhe, Germany, 2002.

27. D.A. Reed, R.A. Aydt, R.J. Noe, P.C. Roth, K.A. Shields, B.W. Schwartz and L.F. Tavera, “Scalable Performance Analysis: The Pablo Performance Analysis Environment”, in Proceedings of the Scalable Parallel Libraries Conference, 1993, pp. 104–113. 28. D. Rosu, K. Schwan and S. Yalamanchili, “FARA – A Framework for Adaptive Resource Allocation in Complex RealTime Systems”, in Proceedings of the 4th IEEE Real-Time Technology and Applications Symposium (RTAS), Denver, USA, 1998, pp. 79–84. 29. D. Rosu, K. Schwan, S. Yalamanchili and R. Jha, “On Adaptive Resource Allocation for Complex Real-Time Applications”, in Proceedings of the 18th IEEE Real-Time Systems Symposium (RTSS), San Francisco, USA, 1997, pp. 320–329. 30. M. Sottile and R. Minnich, “Supermon: A High-Speed Cluster Monitoring System”, in Proceedings of IEEE International Conference on Cluster Computing, 2002. 31. P. Uthayopas, S. Phaisithbenchapol and K. Chongbarirux, “Building a Resources Monitoring System for SMILE Beowulf Cluster”, in Proceeding of the Third International Conference/Exhibition on High Performance Computing in Asia-Pacific Region (HPC ASIA’99), Singapore, 1998. 32. M. Wolf, Z. Cai, W. Huang and K. Schwan, “SmartPointers: Personalized Scientific Data Portals in your Hand”, in: Proceedings of ACM Supercomputing, 2002.

Suggest Documents