Towards 'integrated' monitoring and ... - ACM Digital Library

Towards ‘Integrated’ Monitoring and Management Of DataCenters using Complex Event Processing Techniques Krishnaprasad Narayanan

Sumit Kumar Bose

Shrisha Rao

Unisys Corporation Global Technology Center Bangalore, India

Unisys Corporation Global Technology Center Bangalore, India

IIIT Bangalore Electronics City Bangalore, India

[email protected]

[email protected]

[email protected]

ABSTRACT Diagnosing cause of system failure in data centers that house large interconnected complex computer systems is a herculean task. This is because different monitoring tools for network, storage, server, facilities and application provide useful information regarding the health of the communication systems, the storage arrays, the physical machines, the environmental factors and the applications within a data center respectively in only a piece-meal manner. The existing tools fail to provide a comprehensive view of the complete set of operations within a data-center. In the absence of integrated monitoring and management tools, a data center administrator has to manually shuffle through and analyze data from various logs generated by the disparate monitoring tools on occurrence of a fault for identifying the root cause. In this paper we propose an approach for integrated data center health monitoring and management framework on top of the existing monitoring tools. The integrated framework leverages complex event processing techniques to process massive streams of events from these tools in (near) real time and enables automatic reuse of the existing monitoring tools in a non-intrusive manner.

Categories and Subject Descriptors K.6.4 [System Management]: Centralization / Decentralization

General Terms Algorithms, Management, Performance.

Keywords Complex event processing, Data center, Integrated monitoring, Fault diagnosis.

1. INTRODUCTION Present day server rooms, called data centers contain several hundreds and thousands of servers and storage elements interconnected using complex networking technologies. These centers host several business critical applications and are required to be available on a 24 x 7 basis and remain fault-tolerant so that applications can continually operate at optimal performance levels Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Compute'11, Mar 25-26 2011, Bangalore, India Copyright 2011 ACM 978-1-4503-0750-5/11/03 ...$10.00.

without any service disruption. In view of this, it is extremely important to monitor and maintain the overall well-being of the data-center. This in turn, requires monitoring the health of each of its individual components. Traditionally, various IT teams have relied upon different monitoring tools for tracking the operational behavior of each of these individual components. These tools detect occurrence of un-usual patterns that impacts the operational viability of the components that these tools individually manage. For example, network monitoring tools analyze and correlate several network specific metrics only. Similarly, the storage monitoring tools generate performance alerts for faulty operations of storage area networks. Thus, the existing performance monitoring tools generate alerts for faults occurring within their local area of operation. In the absence of any integrated monitoring and management tool, it is a challenge to correlate different metrics monitored by these disparate tools and establish relationship amongst these metrics as a whole to an application’s performance and to the data-center’s overall operations. A faulty switch, for example, within a data-center could trigger a chain of alerts by the network monitoring tool, by the storage monitoring tool and by the application performance monitoring tools as shown in figure-1 and figure-2. Troubleshooting the alerts generated by each of these tools manually makes the task of diagnosing and identifying the root cause responsible for the alerts difficult and time consuming. As shown in figure 1 and figure 2, a storage array is shared by multiple application servers. The server volumes that these application servers access in a storage array are based on logical entities – known typically as logical unit numbers (LUNs). The application servers are connected to the storage array using fiberchannel switches (also called SAN switches). The health of the application servers is monitored by application performance monitoring (APM) tools such as Netuitive, AppDynamics and Integrien. These APM tools measure various application specific metrics and system specific metrics on the machines where these applications reside and generate alarms in case they detect undesirable behavior. In both these scenarios the storage box provides logical volumes of storage to application servers through SAN switch and the APM tools generate HDD-I/O alarms when the applications fail to access their respective server volumes in the storage array. In addition to the APM tools that are largely responsible for monitoring the health of the applications, the network monitoring tools such as CISCO Data-Center Network Manager (CISCO DCNM) [1] monitor’s the health of the intermachine communication systems and the storage monitoring tools such as Opstor Manage Engine [2] monitor’s the health of the storage area networks (SAN). Yet, none of these tools are capable of diagnosing the root cause of the anomalies occurring in the two

scenarios; and the cause needs to be investigated manually by piecing together information from the various application monitoring logs, network monitoring logs and storage monitoring logs generated by the respective tools. In case of the figure 1 the real cause of the alarm is due to a faulty SAN switch. In contrast, in figure 2 the real cause of the alarm is due to a dis-connected cable. In order to automate the understanding of the true reasons that underlie application mis-behavior, identify the root cause responsible for the alerts and be able to distinguish amongst the instances such as the ones discussed here, it is important to combine information, called event patterns from various monitoring logs – application, network and storage in intelligent and meaningful ways.

• Integrated data center monitoring applications need to process massive streams of events in near real time. The monitoring system should be able to support a large number of concurrent scenarios and should be able to scale well with the number and the variety of scenarios to be monitored. Millisecond latency requirements for many of the monitoring applications make it in-feasible to persist data in a relational database for processing. • It should be possible for the event processing application to identify complex sequence of events generated by disparate monitoring tools. The event processing application should be able to re-use the existing monitoring tools and the events generated by them. This integration of the existing tools into the new application has to be performed in a non-intrusive (without any changes to the existing monitoring set-up) manner and with minimal effort. The structure of the paper is as follows: section 2 provides a brief background on complex event processing systems and the theory underneath these systems. Section 3 provides the architecture of our system and discusses the expressiveness of complex event processing queries in addressing complex data center monitoring needs. The section also discusses sample attributes and their values obtained from different monitoring tools along with the explanation of the CEP queries. In section 4, we discuss the related work before providing concluding remarks in section 5.

Figure 1. Scenario describing the generation of HDD I/O error by the application server monitoring agent when one of the SAN switch fails.

2. COMPLEX EVENT PROCESSING Complex event processing is on-line detection of complex patterns in event streams. We use the data model proposed in Cayuga [3] and treat data as relational tuples, referred to as events. The data model consists of temporally ordered sequences of tuples, called event streams. The events streams have fixed relational schemas. Event patterns are expressed using a SQL like query language, called event language, based on query algebra and having a well defined semantics. Event queries have the simple form: SELECT [attributes] FROM [stream_expression] PUBLISH [output_stream]

Figure 2. Scenario describing the generation of HDD I/O error by the application server monitoring agent when cable connecting the storage array and the SAN switch snaps. In this paper, we use complex event processing techniques (CEP) for correlating seemingly unrelated events generated by disparate monitoring tools for establishing the root-cause. We believe this approach to be unique as none of the existing frameworks gather events from the disparate monitoring tools and perform correlation on the same. In complex event processing, users are interested in finding matches to event patterns, which are usually sequences of correlated events. Our choice of CEP as an integration tool is dictated by two factors:

The stream_expression is constructed using (1) one unary construct, called FILTER and (2) two binary constructs, called NEXT and FOLD. These constructs produce an output stream from one or two input streams. The FILTER construct selects those events from the event stream that satisfies the expression specified in the predicate of the FILTER expression. Binary constructs, NEXT and FOLD allows us to correlate events across time. The general syntax for applying NEXT operation to the streams, say S1 and S2 is S1 NEXT (predicate_expression) S2 where predicate_expression is the logical expression that combines event e1 in stream S1, with the next event e2 in stream S2 occurring after the detection of e1 and satisfying the predicate expression. The FOLD operator allows iteration over a-priori unknown number of events till a stopping criterion is satisfied and has a general format: FOLD (predicate_expression, stopping_criteria, aggregate_expression). The first parameter, predicate_expression, contains a logical condition for selecting the input events in the next iteration. The second parameter, stopping_criteria, provides the stopping condition. The third and

the last parameter, aggregate_expression, aggregate the computation across different iterations. To address, reference ambiguity in event languages, special constructors called decorators denoted using ‘$’ are used to identify streams from which attributes are referenced. $1 is used to reference attributes in the first input stream of a binary construct. Similarly, $2 is used to reference attributes in the second input stream of a binary construct. Additionally, we use the $ symbol in the aggregate expression of the FOLD operator to refer to the attributes in the current iteration. The CONTAINS operator helps to check whether the input string is contained in a particular row value.

3. INTEGRATED MONITORING AND MANAGEMENT In this section, we introduce complex event processing system as a powerful construct that integrates disparate monitoring tools and capable of providing a holistic view of the data center operations and performing root cause analysis. First, we describe the architecture for our integrated diagnostic framework. Next, we construct CEP queries for the two examples discussed earlier and show the expressiveness of CEP queries – Cayuga system in particular, in addressing the nuances of data-center operations.

3.1 Architecture

the syntax and the semantics) that enables the CEP engine to process further. These set of events is branched under a tag called “Stream” and the CEP engine receives these Stream’s for performing correlation. The CEP engine takes CEP query as input that has patterns to match against the available stream received from the Event collector engine and the result of the correlation would be displayed to the administrator that helps them to take necessary actions for the root cause.

3.2 CEP Query Formulation The schema for the CEP Query is based on the following attributes listed in table 1 as shown below. These attributes are assumed to be obtained out of the respective monitoring tools and passed on to the Event Collector engine. The attributes numbered from 6 – 10 are considered to be quantifiable attributes. The quantifiable attributes help to loop through the set of streams (FOLD construct in both the queries) and diagnose the root cause Table 1. CEP schema for various datacenter scenarios S.No 1 2 3 4 5 6 7

Storage monitoring attributes Source of Input Device Name – Storage Box name Status of the Storage Box Message Description Alarm Type Total Storage Capacity (MB) Quota Used (MB)

Network monitoring attributes Source of Input Type of Network Device Status of Network Message Description Alarm Type No of Packets / sec Average Packet size

8 9 10

% of Storage Used Event Generated Time

Event Generated Time

11

Array name

Switch name

Application server monitoring attributes Source of Input Application Name Status of the server Message Description Alarm Type Total CPU Utilization (Used / Available) Total Memory Utilization (Used / Available) Event Generated Time No of Servers serving the application Server Names

Table 2 provides examples values for the different types of attributes discussed in table 1. These form inputs to the event collector engine as shown in figure 3. Table 2. Sample values from the monitoring tools when the SAN switch is down S. No. 1 2 3

Figure 3. Architecture for integrated monitoring and management in data centers The solution architecture is shown in figure 3. As can be seen from the diagram, the solution architecture has three important components. 1.

Data center monitoring tools for monitoring the storage, servers, network and application.

2.

Event collector engine that collects all the external events from various monitoring tools and sends it to the CEP engine.

3.

Complex event processing engine for correlating the seemingly disparate events.

All the events from the external monitoring tools irrespective of their event status (success / failure) will be sent to the Event collector engine. The role of Event Collector engine is to collect all the external events and convert them into a format (in terms of

4

5 6 7 8 9

Values for the Storage Attributes Storage SEAGATE HDD Jumbo Active

Values for the Network Attributes Network CISCO Switch WLAN Switch down – Not working Request timed out – Unable Switch Error – port link to connect to the network status down port Critical Critical 500 TB 0 KB/sec 250 TB 0 KB 50% NA 2010-10-31 1:15:13 A.M 2010-10-31 1:15:11 A.M

Values for the App server Attributes App Server JBoss App server Inactive – HDD I/O Error I/O Error – Unable to get response from switch Critical 45/100 30/100 NA 2010-10-31 1:15:15 A.M

For the examples discussed in section 1 we discuss construction of SQL like queries, called event queries, for locating matches to complex event patterns that are likely causes for these faults. Towards this purpose, we use an open source general purpose system called CAYUGA [3] to design CEP queries and for processing complex events on a large scale. The major advantage of using Cayuga over other CEP systems is that it supports online detection of large number of complex patterns in event streams apart from offering a unique combination of expressiveness and speed. It also helps to build complex patterns from simpler sub patterns. The system also implements novel techniques for query processing, indexing and garbage collection

resulting in an efficient execution engine that helps to process data streams at very high rates. Additionally, Cayuga scales well with respect to number of events monitored and queries registered that implicitly addresses the plethora of performance issues. The basic premise underlying the design of the CEP queries proposed in this paper is that: (1) it is possible to conjecture different failure scenarios a-priori and identify patterns/sequences of important events that lead to these failures and (2) the stream processing language is expressive enough to construct event queries for each of these event patterns. Events generated by the different monitoring tools are processed by the stream processor and are matched against the event queries in order to determine the rootcause of a fault. Example-1: As shown in figure 1 the anomalous behavior of the applications is a result of the faulty SAN switch. The CEP query exploits this fact in combination with an in-depth analysis of the network monitoring alert to identify SAN switch to be the root cause of the problem. The MONITORINGSTREAM2 contains all the streams that are received from the Event Collector Engine during the monitoring cycle interval. The MONITORINGSTREAM1 contains streams that are produced by the FILTER construct (contains streams where the Alarm Type is CRITICAL and Message Description has ERROR). Figure 4 shows the CEP query (Query 1) for the anomaly: SELECT NAME, DEVICESTATUS, MESSAGEDESCRIPTION, ALARMTYPE, EVENTGENERATEDTIME as Root-cause FROM FILTER {DUR