Leveraging Complex Event Processing for Grid Monitoring*

3 downloads 93105 Views 133KB Size Report
Event Processing technologies applied to real-time Grid monitoring. ... tion is usually collected locally and disseminated to a site-level or central server where ..... tainer; Apache ActiveMQ; Codehaus Esper CEP engine (foundation for the first ...
Leveraging Complex Event Processing for Grid Monitoring? Bartosz Bali´s1,2 , Bartosz Kowalewski2 , and Marian Bubak1,2 1

Institute of Computer Science, AGH, Poland {balis,bubak}@agh.edu.pl, [email protected] 2 Academic Computer Centre – CYFRONET, Poland Abstract. Currently existing monitoring services for Grid infrastructures typically collect information from local agents and store it as data sets in global repositories. However, for some scenarios querying real-time streams of monitoring information would be extremely useful. In this paper, we evaluate Complex Event Processing technologies applied to real-time Grid monitoring. We present a monitoring system which uses CEP technologies to expose monitoring information as queryable data streams. We study an example use case – monitoring for job rescheduling. We also employ CEP technologies for data reduction, measure the overhead of monitoring, and conclude that real-time Grid monitoring is possible without excessive intrusiveness for resources and network. Key words: Grid computing, real-time monitoring, event-driven architecture, complex event processing, event correlation.

1

Introduction & Motivation

Monitoring services are integral part of large scale Grid infrastructures. Typically, monitoring activities focus on reporting the current status and utilization of resources, and gathering historical data in order to enable retrospective analysis. Monitoring information is usually collected locally and disseminated to a site-level or central server where it is stored, refreshed periodically and exposed for querying by consumers. However, in certain cases a more real-time access to monitoring information streams would be desired. Examples of these include SLA contract monitoring, real-time system misuse detection, failure detection, or real-time monitoring of resource utilization for the purpose of steering and adaptive algorithms, such as job rescheduling [4]. Given the dynamic nature of the Grid which is characterized by variable resource demands and dynamic application behavior, this type of monitoring is particularly important. Complex Event Processing (CEP) [11] is a general term that describes all approaches that take streams of atomic events, enable querying over those streams, and produce derived complex events. Nowadays advanced event processing mechanisms [7] are being introduced and CEP engines are capable of discovering extremely sophisticated patterns in the event stream. Surprisingly, though monitoring information can be viewed as streams of data reflecting current status and happenings within the Grid infrastructure, CEP technologies have not been employed to build Grid monitoring services. ?

This work is supported by the European Union through the IST-027446 project ViroLab Scientists, AGH grant 11.11.120.777, and ACC CYFRONET AGH grant 500-08.

The goal of this paper is to evaluate the Complex Event Processing technologies as a basis of Grid monitoring services. We have built a Grid monitoring infrastructure – GEMINI2 – which uses a CEP engine – Esper – to provide monitoring information as real-time streams [13]. We show that CEP technologies enable not only powerful processing of streams of monitoring information, but also can be used for data reduction that prevents excessive resource usage and network flooding due to monitoring. We study a use case – monitoring used for job rescheduling in the Grid, and evaluate the overhead of monitoring services on the Grid resources. We argue that Complex Event Processing technologies can benefit Grid monitoring services as follows: (1) Enable to expose monitoring information as queryable data streams and access it in real-time; (2) Provide highly expressive querying constructs and high-performance engines that enable such capabilities as filtering, aggregation, sliding window calculations, and correlation of events; (3) Enable data reduction based on buffering, filtering and aggregation. This paper is organized as follows. Section 2 presents related work. The GEMINI2 infrastructure is described in section 3. Section 4 briefly describes current implementation of GEMINI2. Section 5 presents the evaluation of GEMINI2, including the case study scenario, and monitoring overhead evaluation. Section 6 concludes the paper.

2

Related Work

Existing Grid infrastructure monitoring systems focus on the current status and utilization of Grid resources, and a retrospective analysis based on historical data. Monitoring information is typically collected by local sensors and disseminated to a site-level or global persistent storage, used subsequently by consumers. A representative Grid monitoring service which adopts those design principles is GridICE [1], a system used within the EGEE project. In GridICE, sensors collect monitoring information about local resources and disseminate it to a site collector where it is converted and stored according to a data model, in this case the Glue schema [2]. The monitoring information can be collected from individual sites through the Grid information system interface, and aggregated at the global level by a GridICE server. A similar architecture is featured by Inca 2 [12], a monitoring tool used in the TeraGrid project. Inca 2 focuses on detection of problems in the Grid infrastructure. To this end, testing of Grid software and services is periodically performed by local agents, called reporters, which are managed by reporter managers. The results of the tests are stored in a depot, which can be queried by consumers. R-GMA [6] is to some extent similar to our approach in that it views monitoring information as streams published by producers and requested by consumers via distributed queries. As R-GMA adopts a relational model for monitoring information, the streams are tuples (table rows), requested by consumers via SQL queries. However, SQL and the relational model impose several restrictions onto processing and querying over data streams. On the other hand, Complex Event Processing technologies have the advantage of being specifically designed for this purpose. In CEP, unlike in SQL / relational model, features such as aggregation, sliding window calculations or correlation are naturally available.

Several other existing Grid monitoring systems, which cannot be described here because of space limitations, adopt similar design principles. In summary, existing Grid monitoring services are not oriented towards enabling real-time subscriptions for monitoring information. Most solutions do not expose monitoring information as queryable data streams, but convert it to data sets stored in a permanent repository. While this is useful for certain scenarios (e.g., those where historical data is needed), it is not well-suited for real-time querying. For example, temporal aspects of data streams are lost during conversion, along with querying capabilities that rely on it (such as sliding window calculations or correlations). The work described here is a result of our previous experience in event-driven systems [9] and monitoring in the Grid based on an earlier non-CEP based monitoring system GEMINI [3].

3

Concept of GEMINI2

Generic Monitoring Infrastructure (GEMINI2) is a lightweight framework designed to provide event-based mechanisms for distributed environments. Though the basic framework of GEMINI2 is prepared to leverage event-based mechanisms for any distributed system, currently we focus on monitoring capabilities for Grid infrastructures. 3.1

Requirements

A few important assumptions were made in the initial stages of the project that had a significant impact on the final design of the framework. These prerequisites were to clearly draw a distinction between the currently available solutions and set of features expected to be provided by GEMINI2. The main functional and non-functional requirements identified at that stage are: – usability – the solution has to be easy to employ in order to produce a setup suitable for any distributed environment, – performance – because of the potential load of event messages, complexity needs to be reduced wherever possible and technologies need to be chosen carefully, – scalability – the infrastructure should be capable of dynamic allocation and environment reconfiguration, – configurability – the constituent parts of the solution should be configurable and exchangeable, – standards-based approach – will provide optimal solution adoption and reduce efforts involved in building applications founded on this infrastructure, – well-defined event messages – all event objects passed through the system need to be standardized making employing the solution in any use case easy and simplifying any migration paths, – well-defined management contracts – control interfaces should be well-defined and easily accessible using any programming language, – simplified CEP – creation of CEP expressions should be made as simple as it is possible in a distributed environment founded on multiple types of events.

GEMINI2 as a generic monitoring infrastructure also needs to define a taxonomy for events passed through the system and provide a standard way of representing these events as event objects. The monitoring events’ hierarchy will be based on already available documentation that attempted to summarize the set of monitoring events currently used in Grid environments. There are several papers and memos that cover this subject. Nevertheless, event types used in monitoring measurements haven’t been standardized yet. 3.2

Leveraging Complex Event Processing

CEP technologies have been playing an important role in supporting contemporary event-driven systems that currently have to withstand tremendous, constantly increasing loads [5]. High-performance CEP engines have emerged which are capable of discovering sophisticated patterns in the event stream. Applied to monitoring, CEP enables real-time processing of monitoring information streams, including among others: (1) aggregation of smaller events in order to provide a high-level view of a process – statistics, summaries, etc.; (2) correlation of events generated by different event sources; (3) long-term metrics/measurements. A CEP engine incorporated into a monitoring infrastructure not only provides powerful stream querying capabilities and discoverability, but also enables data reduction, essential to avoid network flooding. Section 3.4 describes this aspect in more detail. GEMINI2 aims at providing interchangeable building blocks that will significantly simplify deploying a monitoring (or, more generally, eventing) infrastructure based on CEP, in a distributed environment. GEMINI2 exposes event dissemination and management interfaces using standard communication technologies while delegating the responsibility of identifying complex situations to an exchangeable CEP engine. The initial versions of the infrastructure are built upon Codehaus Esper, a popular and powerful open-source CEP solution. Consequently, the interface to request monitoring information is based on Event Processing Language (EPL), used by Esper. EPL is a declarative language for expressing queries over event streams, using an intuitive SQL-like syntax. The drawback of EPL is that it is not an industry standard. Supporting the process of CEP standardization is one of the future goals of GEMINI2. 3.3

Design

As it was already mentioned, configurability and usability are the two requirements that had the biggest influence on the design of the framework. The whole infrastructure was divided into a group of interchangeable components that could be used to easily assemble a solution suitable for any particular distributed environment. Fig. 1 presents a simplified view over the architecture of GEMINI2. At the same time this diagram depicts a standard deployment configuration to be used in a distributed environment instrumented using GEMINI2. There are three main logical entities that build every distributed monitoring environment: Sensors, Monitors and Clients. Sensors are responsible for sampling the

Fig. 1. High-level view over GEMINI2 architecture and a sample deployment of its components.

environment (nodes, network, etc.) and generating simple events. They can also incorporate an Event Dispatch (CEP) engine in order to apply Complex Event Processing mechanisms directly to stream of events generated during the sampling process. This way event objects disseminated by the sensor are already preprocessed, leading to an obvious decrease in data volume. Monitors are the heart of the whole infrastructure. Their duty is to handle subscriptionrelated control messages coming from Clients and process high volume of event objects pushed into the server by multiple Sensors. The incoming events are passed to an Event Dispatch (CEP) engine which contains definitions of complex queries associated with particular client subscriptions. New event streams produced by the Dispatch Engine are then disseminated to proper Clients. Monitors can be used to build complex topologies of cooperating server nodes. This makes the deployment architecture even more flexible and scalable. Clients subscribe to monitors for particular complex events. Such control messages as subscribe, renew subscription and unsubscribe are passed to the Monitoring Service running inside a Monitor. Clients then accept event objects disseminated by the Monitor through a separate communication channel. Each of the three constituent parts of the infrastructure is assembled using exactly the same set of reusable components. GEMINI2 employs an Inversion of Control (IoC) container in order to enable one to easily create their own infrastructure setup. The common set of components includes event dissemination engines, monitoring service subparts and stubs, Event Dispatch Engine, Web Services transport endpoints for control messages, JMS transport endpoints for event channels, and many more. 3.4

Data reduction

Data reduction is important in the Grid in order to avoid excessive network overhead due to monitoring. Complex Event Processing constructs naturally enable one to achieve data reduction by buffering, filtering or aggregation, for example: – select * from CpuInfoMsg(idletime