A Grid Network monitoring and performance measure- ment infrastructure is a key .... tional monitoring tools to be easily added with the only requirement being that the ... local LDAP server, or, a backend script to allow LDAP server access to ...
1
GRID NETWORK MONITORING IN THE EUROPEAN DATAGRID PROJECT
Pascale Vicat-Blanc Primet 2 Robert Harakaly 2 Franck Bonnassieux
1
Abstract A Grid Network monitoring and performance measurement infrastructure is a key component of a large-scale Grid. Requirements for network monitoring with respect to Grid operations as well as performance characteristics and their relevance to Grid context have to be carefully identified before realizing such a system. In this paper we analyze the main issues raised by the design, development, integration in a Grid middleware and deployment of a network monitoring infrastructure in a real Grid environment. Then we present the details of the architecture we proposed in the context of the European DataGRID project and the MapCenter tool we developed to vizualize the Grid status in real time and to give access to our measurement infrastructure. Key words: High performance grid networking, end-to-end performance, grid network monitoring, grid status vizualization
Introduction
One feature that distinguishes Grids from the conventional distributed computing environment is the complex network interconnection that mixes the Internet with private Gigabit, Fast Ethernet local area networks or high performance system area networks (such as myrinet). The private and public domains are delimited by firewalls and edge routers. The load, capacity and availability of network links used during data transfers may heavily affect the Grid application performance. Consequently, a fully functional Grid is critically dependent on the nature and quality of the underlying network. Connectivity, performance and security are key factors. Therefore, one critical challenge facing the Grid community is to deal with two opposites: the IP networks exhibiting extreme heterogeneity of performance and reliability and data movement across the network that can be a critical determinant of application performance. In a Grid environment, monitoring the network is vital in determining the source of performance problems or to tune the system for better performance. The European DataGRID project (EDG) aims at developing, implementing and exploiting a large-scale data and CPU-oriented computational GRID. Distributed data and CPU intensive scientific computing models, drawn from three scientific disciplines, physics, biology, and earth sciences, are executed on a geographically distributed testbed. The resources of the EDG are interconnected by the European (GEANT) and national research network production infrastructures. The networking group of the EDG project has built a prototype network monitoring toolkit within the framework of a simple and extensible architecture that permits basic network metrics to be measured, gathered and published to the Grid middleware, and also for them to be available, via visualization, to the human observer. The structure of the paper is the following. Section 2 is dedicated to Grid Network performance measurement issues with emphasis on metrics, measurement methodologies, measurement coordination, data publication and network performance forecasting. In Section 3 we describe the network monitoring architecture that has been developed in the European DataGRID software and that is being deployed on the EDG testbed. Related works are analyzed in Section 4. Conclusions and perspective comprise Section 5 of this paper.
1
INRIA RESO PROJECT, LIP LABORATORY 446, ALLÉE D’ITALIE, 49007 LYON, FRANCE
The International Journal of High Performance Computing Applications, Volume 18, No. 3, Fall 2004, pp. 293–304 DOI: 10.1177/1094342004046044 © 2004 Sage Publications
2
UREC CNRS ENS LYON, FRANCE
GRID NETWORK MONITORING SYSTEM
293
2
Goals and Methods
To provide the middleware and the users with an abstract and homogeneous view of the complex interconnection, the links of the global network cloud have to be characterized by simple and relevant metrics and their basic properties have to be measured. Appropriated methods for short-term and long-term storage of network monitoring information to enable both archiving and near real-time analysis functions, and the development of effective means of visual presentation of the multivariate data are required. 2.1 SPECIFICITY OF GRID NETWORK MONITORING First and foremost, network monitoring and forecasting is required by Grid applications to optimize their usage of the networks that comprise the Grid. In this way, the Grid network aware applications or the middleware components, dedicated to the resource usage optimization in the Grid, such as resource broker, job scheduler, and replica manager, are able to adjust their behavior to make best use of this resource. For this type of usage, of primary importance will be the publication of the relevant metrics that describe the current and permit us to estimate the future behavior of the network to the Grid middleware. Secondly, the results will be used to provide network status information and background measurements of network performance which will be of value for those charged with the task of the supervision and provision of network services for Grid applications. Moreover, as the large size and distributed structure of the Grid system contributes to the possibility of encountering failure, mechanisms to detect and recover from network failures are also required. The framework for measuring and monitoring the network resource as a Grid service has to be simple, scalable, resilient, modular, secure and easy to use. Of primary importance are the simplicity and scalability as the number of entities may be very large. Resilience is important as the links and nodes are very dynamic and may change over the time. To allow the addition of new sensors or components, the framework has to be modular and extensible. Several levels of security are required with authentification, authorization or encryption facilities. A Grid Network Monitoring Architecture is a subset of the Grid Monitoring Architecture. In order to provide the Grid community with an open framework and encourage the developers to built more interoperable tools, the Grid Performance working group of the Global Grid Forum has proposed the Grid Monitoring Architecture (GMA; http://www-didc.lbl.gov/GGF-PERF/GMA-WG). The Grid Network Monitoring Architecture has to provide the network-specific components of a monitoring architecture:
294
COMPUTING APPLICATIONS
• network sensors (as GMA providers) that perform the measurements; • network data storage (as GMA store) that stores the historical data and computed metrics; • network data processing and presentation (as GMA consumers) that performs the computing and vizualization functions. 2.2 EDG NETWORK MONITORING ARCHITECTURE To illustrate the issues related to Grid network monitoring, we present the architecture and the prototype that has been designed and deployed within the European IST DataGRID project (see EU DataGrid home page, http:// www.eu-datagrid.org) and the results gathered on the experimental testbed. The European DataGRID project (EDG) develops the necessary middleware software, in collaboration with the Globus team (http://www.globus. org) and the Network Measurement Working Group of the GGF, leveraging practice and experience from previous and current Grid initiatives in Europe and elsewhere. The first objective of the networking group of the EDG project was to provide a prototype set of tools for network monitoring within the framework of a simple and extensible architecture that permitted the basic network metrics to be published to the Grid middleware, and also for them to be available, via visualization, to the human observer. This accords with the broad requirements of the other EDG Work Packages; currently through the use of LDAP information services but also with the use of RGMA, the Monitoring architecture proposed by the monitoring team of the project. This architecture allows additional monitoring tools to be easily added with the only requirement being that the means for analysis and visualization of the data and, either a push mechanism to update local LDAP server, or, a backend script to allow LDAP server access to specific metrics, are provided. The architectural design of the monitoring system comprises four functional units: • monitoring tools or sensors; • a repository for collected data; • the means of analysis of that data to generate network metrics; • the means to access and to use the derived metrics. In the next section, the options that have been adopted within the EDG project for solving the issues raised at the three levels of the architecture are presented.
3
EDG Network Monitoring System
At the lower level, the EDG Network Monitoring System is composed of sensors that aim to measure different network metrics. When defining such a measurement system many decisions have to be undertaken: • choosing the appropriate characteristics; • choosing the methodology and tools to perform the measurement (sensors); • decide where, when and how to deploy the sensors; • decide how the measurement will be scheduled. The subject of network monitoring is well documented and many communities have active programs in place. Of particular relevance is the work undertaken by the IP Performance Monitoring (IPPM) of the IETF, the CAIDA organization (http://www.caida.org), the HEP community and the Internet2 (see http://www.slac.stanford.edu/comp/ net/wan-mon.html and http://www.slac.stanford.edu/comp/ net/wan-mon/iepm-cf.html). The RFC 2330 (Framework for IP Performance metrics) describes a framework for IP performance metrics and provides a valuable discussion of the issues relating to network monitoring. For the Grid environment, the most relevant metrics have been recently analysed by the Network Measurement Working Group of the GGF (R. Hughes-Jones et al., A Hierarchy of Network Performance Characteristics for Grid Applications and Services, http://www-disc.lbl.gov/NMWG/docs/draftggf-nmwg-hierarchy-00.doc). 3.1 MEASUREMENT STRATEGIES AND PROBLEMS In order to obtain performance information about the network, measurement needs to be performed. These experiments can be broadly classified into two categories. First, active network measurement. These measurements generate test data which are sent through the network to discover the properties of the end-to-end connection. Such traffic is in addition to the usual traffic load on the network. There are two drawbacks to active methodologies: they add potentially burdensome load to the network and the additional traffic perturbs the network and biases the resulting analysis. The active approach makes use of a variety of network monitoring tools and has to be appropriately scheduled to minimize the impact to the users of networks whilst still providing an accurate measurement of a particular network metric. All active measurement require at least some form of participation by multiple network components. For example, to evaluate the available TCP or UDP throughput tools such as Iperf (http://dast.nlanr.net/Projects/Iperf) or Netperf (http://www. netperf.org) send a probe packets for a given time (default 10 s). The amount of the probe
Fig. 1 Physical/real view of grid network. Each logical connection consist of multiple physical network connections.
data is from 12.5 MB on a 10 Mbps link to 1.25 GB on the 1Gbps link. The other drawback is that many active network monitoring tools make use of ICMP, a layer three protocol, which may be subject to different traffic management strategies from TCP- or UDP-based traffic (layer four protocols). For example, under congestion conditions ICMP traffic may be preferentially dropped or dealt with at a reduced priority. Passive network monitoring makes use of real applications and their traffic to record the application experience of using the network. So, for example, Gridftp can be used to record throughput of real Grid traffic across the network and similarly with other applications. This is beneficial in that no additional traffic is introduced across the network but it reflects the user’s experience in performing some tasks and as such may not accurately record the capability of the network. Active and passive monitoring have their own advantages and drawbacks for Grid. A careful and appropriate mix of both of them can be very useful. Within the EDG project, it was decided to collect, by active probes, only the few most relevant end to end network characteristics: • Instantaneous connectivity of an Internet path. This characteristic provides a simple metric on the reachability of a particular end system. Its answer is (units are Boolean) whether at a particular instant host A is able to send IP datagram to host B. • Packet loss. This provides a good measure of the quality of the route between endpoints. Loss in the IP networks is caused by congestion, routing instability, link failure and by unreliable links like telephone or wireless links. • Two-way delay (round trip time). This represents the time taken to traverse the path from the source to the
GRID NETWORK MONITORING SYSTEM
295
destination and back. One-way delay measures the time taken to traverse the path between source and destination. Formally, given a packet p, the time at which the last packet byte departs from the source ts , and the time at which the last packet byte arrives at the packet destination td , rtt = 2(td – ts) and owd = (td – ts) The measurement of one-way delay (owd) and derived IPDV provides a means whereby a more rigorous characterization of the Internet can be developed, but these metrics are not fundamental for end to end Grid applications. rtt is the most relevant temporal metric in Grid context. • TCP throughput. By definition, this metric defines the number of bits per second that can be transmitted from a given TCP endpoint A to a given TCP exit point B. As discussed in Paxson (1996), this metric cannot be expressed as an analytical metric as the throughput the user can expect to see across the IP cloud is heavily shaped by the use of TCP, how the flow and congestion algorithms adapt to the network conditions, the TCP sack used, the current network condition in the cloud. Predicting how the cloud's network condition interacts with the TCP implementation is an open research area. This empirical metric corresponds to the benchmarks approach founded in other measurement communities. However, this metric is very relevant for Grid applications mainly based on TCP. To perform these measurements, four types of tools have been developed or adapted: • PinGER for two way delay and loss measurement; • UDPmon for one way delay and one way loss measurement; • IPerfER for TCP throughput measurement; • Mapcenter for instantaneous connectivity checking. 3.2
NETWORK SENSORS
• PinGER (http://www-iepm.slac.stanford.edu/pinger) is used to measure the two-way delay and loss performance of a link. The round trip time in milliseconds (ms), the packet loss percentages, the variability of the response time both short-term (time-scale of seconds) and longer, and the lack of reachability, i.e. no response for a succession of pings. PingER data are stored locally on the machine that runs the tool. Web-based access provides an analysis of packet loss, RTT and the fre-
296
COMPUTING APPLICATIONS
Fig. 2 Logical view of the grid network. Each site has a set of logical connections with the others. Monitoring tools need to report network metrics following this logical collection.
quency distribution of RTT measurements in both graphical and tabular format. The data are also collected centrally to allow site-by-month or site-by-day history tables for all (or selected) sites as seen from the local monitor point. • UDPmon, a tool developed within the EDG project, gives a measurement of the packet loss, and the packet jitter or variation in the arrival times between consecutive packets. This packet jitter is an estimator of the variations in the one-way latencies of the packets traversing the network, as defined in RFC 2330, Framework for IP Performance metrics. UDPmon provides also an estimate of the maximum usable UDP throughput between two end nodes. • Iperf (http://dast.nlanr.net/Projects/Iperf) is used to measure maximum TCP bandwidth, allowing the tuning of various parameters and TCP characteristics. It reports bandwidth, delay jitter and datagram loss. IperfER has been developed within the EDG project based upon the PingER software replacing the RTT and packet loss measurement based upon ping with TCP throughput measurements using the iperf tool. The graphical output from iperfer is very similar to PingER, and the throughput metrics are made available to the middleware, via LDAP, in a manner consistent with PingER. • NetLogger application performance monitoring facility (Tierney et al., 1998), which gives the possibility of recording the system time spent for the execution of a given set of instructions and in this way to locate performance bottlenecks. Distributed application components are modified to produce time-stamped logs of interesting events at all the critical points. Instrumented grid applications are expected to record, through this type of instrumented library, the throughput data experienced in their normal operation and to make these
available along with the data collected by other network monitoring tools. Already, GridFTP has the necessary capability to record such information on each transfer. This information will form a major component of the network monitoring effort recorded by the Grid. However, the information it provides will have to be carefully interpreted and compared with the results from active monitoring of the Grid. A schema has been proposed (Vazdkudai et al., 2001) that makes use of a patched version of GridFTP in which transfers are recorded and summary data stored. Debate is continuing within the GGF as to the exact format. However, it is hoped that a standard form of network monitoring schema will unify the active and passive measurement. 3.3
NETWORK SENSOR DEPLOYMENT
In a grid environment the network monitoring strategy is an important issue. While the network metrics should be available for any grid element (computing or storage element) the full mesh measurement between these elements cannot be obtained simply from the scale perspective. This causes the necessity to build network monitoring infrastructure on the top of the grid sites. Network monitoring infrastructure is built from machines dedicated for network monitoring on each site. These machines then represent all grid elements located on the site. As a consequence, not every query about network performance or its capability can be answered directly. This is not a limitation of the network monitoring architecture, but a recognition of the scale of the Grid, and of the virtual organizations that exist within it. Passive monitoring is likely to occur on end systems such as Computing Elements and Storage Elements using instrumented application such as GridFTP to transfer data. Even if dedicated machines are used at each GRIS site, all-to-all network sensor communication would consume a considerable amount of resources (both on the individual host machines and on the interconnection network) as Grid expands to thousand of sites. There is no simple solution to this problem. Metric composition, modeling, forecasting and extrapolation can be useful. This is still an open research area. 3.4
MEASUREMENT SCHEDULING
After defining the sensor deployment strategy one has to decide how to schedule the measurements. As the traffic generated by active testing is in addition to the usual traffic load on the network, these tools must be appropriately scheduled to minimize the impact on the users of networks whilst still providing an accurate measurement of a particular network metric. Grid network monitoring raises problems which are not so critical in classical Internet performance measurement. As the number of
Fig. 3 Multitiered architecture of the European DataGrid project.
sites and the respective logical links, used by a community of users in a Grid environment, are limited, the probability of concurrent measurements is high. The possibility that sensor probes would collide and thereby measure the effect of sensor traffic increases quadratically with the number of sensors (Wolski et al., 1999). This can become very critical in hierarchical Grids such as the HEP physics DataGrid, organized following a multitiered architecture. For example, the probes from tier 1 to tier 0 (CERN) may collide very often making tier 0 site a real bottleneck, leading to unreliable results, as illustrated in Figure 3. For coordinating active probes, different approaches from very optimistic to very pessimistic are possible. The optimistic strategy considers that the probability of measurements collision is relatively low. At the other end, the pessimistic strategy aims at avoiding any measurement collision. Four main approaches are possible: • • • •
random scheduling; cron-based distributed scheduling; centralized scheduling; token passing.
The most optimistic and simple method is a random activation of the sensor. We define it as random scheduling. This type of strategy assumes that load generated by active measurement is negligible and consider that all traffic is production traffic. This method is valid for nonsystematic monitoring. The most popular scheduling method uses a cron daemon scheduler on each measurement host. Good time synchronization between hosts is required. The smallest time slot enabled by the cron mechanism is 1 min. Any time shift or unexpected long measurement duration can cause collision. This strategy is valuable only if the number of sensors is small or if the measurements are non-intrusive. At the mid-point between the optimistic and pessimistic scheduling strategies lies the “measurement on demand” strategy. Here the central server coordinate the experiments.
GRID NETWORK MONITORING SYSTEM
297
Considering high collision probability in a grid context, one has to adopt a pessimistic strategy to avoid contention and to provide a scalable way to generate pertinent network performance measurement. To realize such pessimistic strategy, different solutions are applicable in distributed context: those with a central manager and those based on a token passing protocol. In EDG, a standalone protocol dedicated to network monitoring, Probes Coordination protocol, has been developed. It is based on a token passing protocol and on a clique concept inspired from the NWS approach (Wolski et al., 1999). PCP is open in the sense that it can support any type of sensor. It implements many original features like distributed registry, inter-clique synchronization and security. The concept of clique, logical group of sensors, is defined by an ordered list of participating nodes to which type of sensor and additional required or optional information like period, delay and timeout are attached. 3.5 NETWORK INFORMATION PROCESSING A Network Cost Estimation Function that is able to compare two destinations has been developed as a Network Grid Service exposed to the middleware. In EDG, the replica management optimizer considers the proximity of two sites by estimating the time needed to transfer a file of known size between two different Storage Elements SEid1 and SEid2. Computation of the transfer time is based on the estimated throughput between the two endpoints – if a point-to-point measurement is available – or between the monitoring domains the two nodes belong to. If a population of measurement samples is available, then estimation is based on the median of the throughput samples, while the estimation error is expressed as the 68% confidence interval, while if only the most recent sample can be provided by the information system, the error is based on the estimated accuracy of the measurement tool. s is a source site, d1 and d2 are two destination sites, v is an amount of data to be transferred, r1 and r2 are the estimated throughputs of link1 and link2. The network cost estimation function f could be expressed as a transfer delay. We have ( f(d1) = v/r1) and ( f(d2) = v/r2), then the optimizer can compare f(d1) and f(d2). The Application Programming Interfaces used by the optimization service of the data management component has been developed and is integrated into the EDG middleware. 3.6
NETWORK INFORMATION STORING
In essence the results of network monitoring activities may be separated on the basis of time. The immediate output from monitoring provides a snapshot of existing conditions within the network, whilst the historic data
298
COMPUTING APPLICATIONS
can be used to look back over days, weeks and months at the behavior of the network. The measured data are stored in the network monitoring data store, which is modeled as a single entity. Globus (Foster and Kesselman, 1997; Czajkowski et al., 2001) is implementing its own schema for describing computing resources. However, there has been little work to describe the volume of data transferred across a network, and network-monitoring metrics. As with LDAP servers and associated schema, the Globus schema can be easily modified. However, in order to maintain consistency amongst the various GIIS infrastructures, the schema structure requires a well-known framework. Within the EDG project, specific back-end scripts have been provided to support the particular network monitoring tools chosen. Each measure of network performance is affected by a great number of parameters, some of which are under the control of the tools used. The information each tool produces can be divided according to these parameters, which gives more detailed information to a consumer on how to optimize a data transfer. It can also be aggregated across all the different values of the parameters, which gives an overview of the performance that can be expected for a data transfer. These parameters variously include packet size, buffer sizes and the number of streams used, and not all measures of network performance are affected by all of them. A performance-measuring tool might make many measurements of the same variable, and all this information can be collated into something more manageable. A set of scripts associated with each monitoring tool is available to provide web-based access for viewing and analyzing the related network metrics. This architecture allows additional monitoring tools to be easily added with the only requirement being the provision of the means for analysis and visualization of the data and either a push mechanism to update local LDAP server or a back-end script to allow LDAP server access to specific metrics. The EDG software only proposes to publish the most recent data for each metric. This may also contain basic statistical information such as minimum, maximum and average, calculated from data stored elsewhere. It is recognized that historical information is important for network monitoring and forecasting. For historical information it is more efficient to publish data using a separate archiving system, which might extract and store data from the local GIIS or have the data installed directly by the monitoring tools themselves. In order to extract performance data from a server, it must be queried using the LDAP protocol. Figure 4 gives the Network Monitoring extension of the MDS visualized by the EDG Mapcenter and shows how the measured values are stored in the LDAP tree.
Fig. 4
LDAP store of network monitoring data.
3.7 GRID STATUS AND NETWORK INFORMATION VISUALIZATION Distributed applications and services over large-scale and heterogeneous networks imply new paradigms that make traditional monitoring representation models obsolete. Grid environments generate new functions and constraints that must be addressed by a monitoring tool. • Widespread application: grid application can run over a huge number of sites, spread over several countries and institutes. • Virtual organizations: communities of users (biologists, physicists, etc.) need to have common views and access rules to the grid and they are grouped into virtual organizations. • Dynamic nature: fast insertion and removal of computing resources, fast data and files moving, dynamic replication systems, etc. • Delocalization: unknown foreseen localization of running process and of end users.
model. The main purpose of this model is to fulfill the needs of grid users and grid administrators: Final users require real-time status of services availability, while Grid and site administrators need interfaces that help them to react quickly and tackle problems occurring in the Grid. MapCenter (http://ccwp7.in2p3.fr/mapcenter) has been designed to provide an extensible presentation layer of the services and applications available on a Grid. Figure 5 represents the status of the EDG testbed as it is displayed by the mapCenter tool. The proposed tool offers a single, flexible and simple means to graphically represent Grid communities, organization, and applications. Figure 7 gives an architectural view of the presentation layer within the Grid components. This tool also allows simple access to all the different metrics that are stored within the Information System of the Grid. Figure 6 gives a view of the Grid Network Services that are provided on the EDG testbed and that are accessible by the Mapcenter tool. MapCenter is currently monitoring European DataGRID, European DataTAG project, e-toile, the French Grid Project.
The Network Work Package of the European DataGRID has proposed a presentation model suitable for grid environments and a tool MapCenter that implements this
GRID NETWORK MONITORING SYSTEM
299
Fig. 5
4
Status of European DataGrid project displayed by Mapcenter.
Experimentations and results
A network monitoring testbed has been used to demonstrate the operation of a variety of network monitoring tools and their ability to collect data and to make the network monitoring metrics available via a Web interface. The testbed sites provided an environment where monitoring tools could be deployed to understand their capabilities, to gain experience in their use and to compare them. Contributing sites allowed appropriate access to the monitoring machines for the installation and management of the various monitoring tools. At all times the
300
COMPUTING APPLICATIONS
purpose was to install the monitoring tools at as many of the sites as possible so that some coherent view of the output from a specific monitoring tool could be collected. Each site also provided a Web interface to the monitoring tools they hosted as one of the means of publication of the particular network metric being monitored. Figures 8, 9, and 10 give respectively the TCP throughput, the RTT distribution, the UDP packet loss rate and the UDP throughput of the same network link (CERN to CNAF in Italy) for the same period. Despite a RTT that looks very stable, the TCP throughput is about 1/3 of the full capacity. The wire throughput measured by
Fig. 6
Condensed view of European DataGrid services.
UDPmon is close to 80 Mb/s. These results show that the end-to-end loss rate is not negligible and affects the TCP flows. The throughput obtained by a standard TCP flow is really not optimal. The time and frequency plots may help us to analyze and explain the performance measurement a user can observe. There is ongoing work for designing tools able to analyze these correlated data and to provide aggregated results.
5
Related work
Many monitoring systems and performance measurement tools for parallel, distributed or network environments have been proposed in the literature and can be used and adapted for a Grid context. In Waheed et al. (2001) and Fisher (2002), a number of existing monitoring systems have been examined and are described. The majority of
GRID NETWORK MONITORING SYSTEM
301
Fig. 7
Mapcenter overview. Fig. 9
Fig. 8
IperfER network available throughput time plot.
classical systems are closed in the sense that they do not support external tools. For example, Netlogger (Tierney et al., 1998), Paradyn (Miller et al., 1995), AIMS, Gloperf (http://www-fp.globus.org/details/gloperf.html) and SPI can collect data from distributed systems for analysis through their specific tools but cannot serve as data collection for other tools and application that may wish to use this information. Systems such as Autopilot (Ribler et al., 1998), the Information Power Grid (IPG) monitoring infrastructure (http://www.ipg.nasa.gov/), NWS (Wolski et al., 1999) and R-GMA (Fisher, 2002) follow the GMA architectural standard. The most advanced performance measurement architecture dedicated to Grid environment is the Network Weather Service that dynamically forecasts network per-
302
PingER frequency distribution of RTT plot.
COMPUTING APPLICATIONS
Fig. 10 plot.
UDPmon packet loss and wire throughput time
formances (Wolski, 1998). The performance prediction service uses monitoring data as inputs for a prediction model, which are in turn used by a scheduler to determine which resource to use (Faerman et al. 1999). Primet et al. (2002) have studied the accuracy of the NWS network measurement. NWS underestimates the real per-
formances because the active method it adopts to perform the measurement is not well adapted. A formula to scale the measurements in order to be able to use the predictive engine of NWS, has been proposed. Also, NWS is not well adapted for measuring high performance links, but it has a very interesting forecasting approach. In Den Burger et al. (2002), a grid network monitoring tool, TopoMon is described, based on NWS. It focuses on two characteristics of the network: latency and bandwidth between the cooperating computers in a Grid environment. The network topology between the computers is provided for applications, so that shared links can be taken into account and performance figures of hierarchical measurement groups can be aggregated. Autopilot integrates dynamic performance instrumentation and onthe-fly performance data reduction with configurable resource management and adaptive control algorithm. The infrastructure for monitoring and management of the IPG at NASA (http://www.ipg.nasa.gov/) is based on three basic components: sensors, actuator, and an event service. Event services provide mechanisms for forwarding sensor collected information to other processes that are interested in that information. The EDG system does not offer such feature at the moment. NetSaint (http://www.netsaint.org) is a tool to monitor network and computing resources both at LAN and WAN level, but is not dedicated to Grid context. With respect to many monitoring tools NetSaint provides complete decoupling between the core logic and the monitoring engine. On the Internet side, much work has been done for Internet performance measurement (Adams et al., 1998). The Cooperative Association for Internet Data Analysis CAIDA’s measurement efforts are intended to help users, providers and researchers understand the complexities in the current and future Internet. Skitter (http://www. caida.org) is used to measure forward IP paths from a source to many destinations. This research provides the community with insight into the complexity of a large, heterogeneous and dynamic worldwide topology. These Internet tools offer promise for insights into the infrastructure as a whole. One can expect that the effort made in building new end-to-end measurement and data analysis tools in this network community will also benefit the Grid community. The EDG networking group follows it closely in order to build more accurate and efficient open Grid network services following the philosophy of the Open Grid Service Architecture (Foster et al., 2002). 6
Conclusion and perspectives
In this paper, we have described the prototype of network monitoring architecture designed and developed in the European DataGrid project. A coherent set of network mon-
itoring tools has been integrated and populates the Information Service database. This infrastructure has been running for several months and is very useful for grid users. However, further work is required in its development. We have outlined that scalability and intrusiveness are major issues for such tools and architecture. We have proposed several methods to solve the aforementioned. During the coming year we will focus our network monitoring activity on these problems and possible improvements. ACKNOWLEDGMENTS This work is supported by the European IST DataGRID project IST2000 25-182 and the French INRIA RESO Project. The authors want to thank all the participants of the European DataGrid Project WP7, Network, for their precious contribution to the development and deployment of this infrastructure. BIOGRAPHIES Pascale Vicat-Blanc Primet graduated in Computer Science in 1984 and obtained a PhD in Computer Science from INSA de Lyon in 1988. From 1989 to 2001 she has been working as researcher and teacher at the Ecole Centrale de Lyon in France. She has been teaching Computer Science, Operating Systems, Internet protocols and Computer Networks. She joined the Institut National de la Recherche en Informatique et Automatique (INRIA) in 2001 as a researcher. Her research interests include distributed and real-time systems, collaborative work, high performance grid and cluster networking, active networks, internet (TCP/IP) and quality of service. A member of the Laboratoire de l'Informatique du Parallelisme (LIP) Laboratory of the Ecole Normale Supérieure in Lyon, she is actually leading the RESO team of the INRIA. This team is specialized in communication protocol and software optimization for high performance and high-speed networks. In 2001–2002, she has been managing the Network Workpackage of the European IST DataGRID project. She is the scientifical coordinator of the French RNTL Grid Platform project, titled E-Toile and funded by the French Research Ministry, and member of the RNRT VTHD++ project. She is the INRIA correspondent of the IST DataTAG project and co-chair of the DataTransport Research Group in the Global Grid Forum. Franck Bonnassieux has been a research engineer at CNRS since 2001. He was involved in the national French grid project E-toile, and managing the network work package of the European DataGrid project. He has previously worked in Proval Technology society as developer, and then in STERIA society as project leader for more than seven years.
GRID NETWORK MONITORING SYSTEM
303
Robert Harakaly is a research engineer in the laboratory UREC of CNRS. His work is focused on grid networking and more specifically to the high performance network monitoring, tuning and the data replication optimization. He is co-author of several scientific publications in the subject of grid networking. REFERENCES Adams, A., Mahdavi, J., Mathis, M., and Paxson, V. 1998. Creating a scalable architecture for Internet measurement. In IEEE Network. Czajkowski, K., Fitzgerald, S., Foster, I., and Kesselman, C. 2001. Grid information services for distributed resource sharing. In IEEE International Symposium on High-Performance Distributed Computing (HPDC-10). Den Burger, M., Kielmann, T., and Bal, H. 2002. A monitoring tool for Grid Networks. In International Conference on Computational Science (ICCS 2002), pp. 21–24. Faerman, M., Su, A., Wolski, R., and Berman, F. 1999. Adaptive performance prediction for distributed data-intensive applications. In Proceedings of Supercomputing, IEEE Computer Society Press. Fisher, S. 2002. DataGrid: Information and Monitoring WP3 Architecture Report: Desing, Requirements and Evaluation Criteria, PPARC, UK. Foster, I. and Kesselman, C. 1997. Globus: a metacomputing infrastructure toolkit. International Journal of Supercomputer Applications and High Performance Computing, 11(2):115–128. Foster, I., Kesselman, C., Nick, J., and Tuecke, S. 2002. The physiology of the Grid: an Open Grid Services Architecture for distributed systems integration. In Open Grid Service Infrastructure WG, Global Grid Forum, June 22.
304
COMPUTING APPLICATIONS
Miller, B. et al. 1995. The Paradyn parallel performance measurement tool. IEEE Computer, 28(11):37–46. Paxson, V. 1996. Towards a framework for defining Internet performance metrics. In Proceedings of INET '96, Montreal, Canada. Primet, P., Harakaly, R., and Bonnassieux, F. 2002. Experiments of network throughput measurement and forecasting using the Network Weather Service. In Proceedings of CCGRID 2002, Berlin, Germany. Ribler, R., et al. 1998. Autopilot: adaptive control of distributed applications. In IEEE International Symposium on HighPerformance Distributed Computing (HPDC-7). Tierney, B., et al. 1998. The netlogger methodology for high performance distributed systems performance analysis. In IEEE International Symposium on High Performance Distributed Computing (HPDS-7). Vazdkudai, S., Tuecke, S., and Foster, I. 2001. Replica selection in the Globus Data Grid. In Proceedings of the 1st IEEE/ ACM International Conference on Cluster Computing and the Grid (CCGRID 2001), IEEE Computer Society Press, pp. 106–113. Waheed, A., Smith, W., George, J., and Yan, J. 2001. An Infrastructure for Monitoring and management in Computational Grids. In IEEE International Symposium on High Performance Distributed Computing. Wolski, R. 1998. Dynamically forecasting network performance using the Network Weather Service. Cluster Computing, 1(1), 119–132. Wolski, R., Spring, N., and Hayes, J. 1999. The Network Weather Service: a distributed resource performance forecasting service for metacomputing. Future Generation Computing Systems, Metacomputing Issue, 15(5–6):757– 768.