An Alert Prediction Model for Cloud Infrastructure Monitoring - CiteSeerX

52 downloads 105684 Views 2MB Size Report
parameter can be correspond to a host in an infrastructure, e.g., cpu utilization is ..... our prediction model is to forecast monitoring parameter alert in advance.
An Alert Prediction Model for Cloud Infrastructure Monitoring

Sayanta MALLICK

Ga´ etan HAINS

Cheikh Sadibou DEME

January 2013

TR–LACL–2013–1

Laboratoire d’Algorithmique, Complexit´ e et Logique (LACL) Universit´ e de Paris-Est Cr´ eteil (UPEC), Facult´ e des Science et Technologie 61, Avenue du G´en´eral de Gaulle, 94010 Cr´eteil cedex, France Tel.: (+33)(1) 45 17 16 47, Fax: (+33)(1) 45 17 66 01

Laboratory for Algorithmics, Complexity and Logic (LACL) University of Paris-Est Cr´eteil (UPEC) Technical Report TR–LACL–2013–1 S. MALLICK, G. HAINS, C. S. DEME An Alert Prediction Model for Cloud Infrastructure Monitoring c S. Mallick, G. Hains, SOMONE, January 2013 ⃝

An Alert Prediction Model for Cloud Infrastructure Monitoring Sayanta Mallick1,2

Ga´etan Hains1 1

Cheikh Sadibou Deme

2

LACL, Universit´e Paris-Est

61, Avenue du G´en´eral de Gaulle, 94010 Cr´eteil, France 2

SOMONE Cit´e Descartes, 9, rue Albert Einstein 77420 Champs sur Marne France [email protected]

[email protected]

[email protected]

Abstract A monitoring agent is build to capture monitoring parameters from a cloud IT infrastructure. Some of those parameters provide useful information to prevent downtime of a system. Predicting monitoring parameters are important task to prevent system downtime and very useful for planning system efficiency. In this paper we propose a predictive model that predicts monitoring parameters from cloud infrastructure logs. In our modelling approach we assume that the user is interested in specific monitoring parameters. Experiments have been conducted to evaluate the accuracy of our prediction model. Our results show the accuracy of the short-term prediction.

3

Contents 1 Introduction

6

2 Events 2.1 Monitoring Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 7

3 Technical and Scientific Background 3.1 Definition of monitoring terms . . . . . . . . . . . . . . . . . . 3.1.1 Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Alerts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Published Research in IT Infrastructure Monitoring . . . . . . 3.2.1 Markov Chain Based Model . . . . . . . . . . . . . . . . 3.2.2 Service Level Objectives and Monitoring Data Analysis 3.2.3 Anomaly Detection and Performance Monitoring . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

4 Monitoring Dataset and Parameter Selection

8 8 8 8 9 9 9 10 12 12

5 Modelling of Monitoring Parameters 5.1 Modelling CPU Usage with State Distribution Model . . . . . . . . 5.1.1 Defining the States of State Distribution Model . . . . . . . 5.1.2 Computing State Distribution . . . . . . . . . . . . . . . . . 5.2 Modelling CPU Usage with Maximum Frequent State . . . . . . . 5.3 Modelling the CPU Usage with Markov Chain . . . . . . . . . . . 5.3.1 Defining the States of the System . . . . . . . . . . . . . . . 5.3.2 Computing the Transition Probability . . . . . . . . . . . . 5.3.3 Building the Transition Probability Matrix . . . . . . . . . 5.3.4 Markov Chain Stochastic Matrix in Virtualized CPU Usage 5.3.5 State Transition Diagram of CPU Resource Utilization . . . 5.3.6 Computing the Distribution Vector of the System . . . . . 5.4 Calculating Prediction Error of Probability Distribution . . . . . . 5.5 Linear Disk Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Cyclic Linear Disk Model . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

15 15 15 15 16 16 16 17 17 18 18 19 19 20 20

6 Alert Prediction

21

7 Evaluating Prediction Accuracy 7.1 Relative Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Relative error of CPUprcrInterruptsPerSec Prediction 7.1.2 Relative error of MemoryUsage Prediction . . . . . . . 7.1.3 Relative error of DiskUsage Prediction . . . . . . . . . 7.1.4 Safe Look-Ahead Time Window . . . . . . . . . . . . 7.2 Result of Linear Disk Prediction . . . . . . . . . . . . . . . . 7.3 Result of Cyclic Linear Disk Prediction . . . . . . . . . . . .

22 22 22 22 26 26 28 28

4

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

8 Alert Prediction Results 30 8.1 Quality of Alert Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . 33 9 Conclusion and Future work

34

10 Acknowledgement

35

5

1

Introduction

Monitoring application is build to capture monitoring parameters from a cloud IT Infrastructure. These parameters are divided 3 different categories infrastructure, system and application. They are very important to monitor the health of the system. A single parameter can be correspond to a host in an infrastructure, e.g., cpu utilization is above 80% or Memory usage over 80%. Monitoring probe can capture thousands of monitoring parameters in a short time period; applying predictive model on those parameters could forecast a problem in advance. Predictive models have proved useful in the area of system fault management. In our article we consider the following scenario. An infrastructure is under continuous monitoring; a system administrator is interested in identifying monitoring parameters and to see their predictable behaviour. In particular he is interested about forecasting alerts and threshold levels. In this article we describe a predictive modelling technique that takes as input of monitoring parameters and forecast trends of parameters as an output, for example, forecast of the parameters responsible for monitoring system health. Specifically we can apply our modelling techniques to predict parameter alerts In our study the data is generated by monitoring probes which are active during one consecutive month on an infrastructure. The infrastructure consists of 5 virtual machine. Our focus concentrates on selecting parameters are critical to monitor by domain experts. One week of continuous monitoring generated over 120 monitoring parameters, with different types of values. Such high volume of information is difficult to forecast manually, rendering the use of predictive algorithms for forecasting much necessary.

2

Events

An event is simply anything, which happened at some moment in time. This could be the ringing of a phone, the arrival of a train, or anything else, which happened. In the context of computing, the term event is also used for the message, which conveys, what has happened, and when it has happened. Examples here could be a message indicating, that a web page was requested from a server, which is sent to the system log, a message, that a network link is down, or a message, that the user pressed a mouse button, sent to the User Interface (UI). Event co-relation i.e relations between different events. Event correlation is usually done to gain higher level knowledge from the information in the events. Event management is managing the information in the events e.g. • to identify extraordinary situations, • to identify the root-cause of a problem, • or to make predictions about the future and to discover trends. 6

Various approaches for doing event correlation exist (and applications can be found in various areas, such as • market and business data analysis (e.g. detecting market trends), • algorithmic trading (e.g. making predictions about the development of a stock price), • fraud detection (e.g. detecting uncommon use patterns of a credit card), • system log analysis (e.g. grouping similar messages, and escalating important events) • or network management and fault analysis (e.g. detecting the root-cause of a network problem). Events are processed and managed by an Event Management software.

2.1

Monitoring Events

An event is a change of resource or monitored item, which can occur anywhere on IT resources and reflects the states of the monitored items. An event can be generated by software, hardware, peripherals Event management or event correlation is a rather young research area the terms are sometimes used ambiguously in research papers, and even more so in product marketing. An event source generates events. For this report, the monitoring probe is often the event source; but in a more general view, an event source can be anything from a hardware device (e.g. a sensor) to a software process, service process or even a user. An event sink receives events. In practice, an event sink could be a event manager which process the events, a ticketing system, or some process, which takes further action, such as sending an incident email. A raw event (primitive event) is an event, which is generated by an event source, from outside the event correlation engine, for instance an event generated directly from a log message. An input event is an event at the input of a specific correlation operation or of the correlation engine itself. Conversely, an output event is generated at the output of a correlation operation or of the whole correlation engine. A compressed event is an event representing multiple identical events, but it does not contain the individual events it represents. A derived event (synthetic event) is an event, which was generated by the correlation engine e.g., because an event was generated during a given state of the correlation engine, because a given set of events occurred, or because a given event was not generated for a specified time (timeout event).

7

A composite event is an event generated by the correlation engine, which contains multiple other events, e.g. raw, derived or composite events. A composite event usually represents some information on a higher level. Content of an Event An event message consist of following items : 1. A message/notification 2. An event type 3. An host name or IP address 4. A time stamp 5. A severity 6. A status

3

Technical and Scientific Background

In this section we explore some of the scientific approaches on prediction, monitoring and analysis so far used on monitoring parameter data. We will provide a snapshot of scientific methodologies used in prediction, forecasting from usage log data. First we introduce technical terms of monitoring and then we discuss research methodologies in monitoring system, server monitoring, virtual infrastructure monitoring, prediction and performance modelling.

3.1 3.1.1

Definition of monitoring terms Event

In monitoring system an event is an exceptional condition occurring in the operation of the infrastructure, hardware or software of the monitored systems [14]. This definition is extended to comprise exceptional conditions in applications, infrastructure and end-user systems. Please note the different use of the term in the context of event correlation 3.1.2

Faults

Faults can be defined as an identification of main problems in a monitored system. It is a collection events class that can be handled directly by monitoring system. Faults can be grouped by duration time as: (1) permanent fault, (2) intermittent, and (3) transient fault. A permanent fault happens in a service operation infrastructure until a service repair action is taken place. Intermittent faults occurred on a discontinuous and periodic basis, causing service down-time for short time period. However, frequently re-occurring intermittent faults significantly jeopardize service performance. Transient faults cause a

8

temporary and minor degradation of service. They are usually repaired automatically [3]. 3.1.3

Alerts

Alerts are external manifestations of symptoms failures. These symptoms observed as alarms which notify potential faults in the monitored system. These notifications can be originate from monitoring agents via management protocol messages (e.g., SNMP protocol traps) and monitoring systems e.g., using commands such as ping, system logfiles, or character streams sent by external equipment [11].

3.2

Published Research in IT Infrastructure Monitoring

System-level virtualization is increasing popular in current data center technologies. It enables dynamic resource sharing while providing an almost strict partitioning of physical resources across several applications running on a single physical host. However, running a virtualized data center at high efficiency and performance is not an easy task. Inside the physical machine, the virtual machine monitor needs to controls the allocation of physical resources to individual VMs. Different publish papers address the issues of modelling virtual capacity and resources. In early works Markov chains have been used in predicting software performance and testing[13]. In statical software testing the black box model has two important extensions. In a first step, sequences of defects in performance are stochastically generated based on a probability distribution. Then the probability distribution represents a profile of actual or anticipated use of the software. Second, a statistical analysis is performed on the test history. This enables the measurement of various probabilistic aspects of the testing process. 3.2.1

Markov Chain Based Model

Researchers has worked on determining the capacity supply model[8] for virtualized servers. In this model first a server is model as a queue based on a Markov chain. Then the effect of server virtualization will be analysed based on capacity supply with the distribution function of the server load. Guet al presented a new stream-based mining algorithm to achieve on-line anomaly prediction. Different types of anomaly detection, anomaly prediction requires to perform continuous classification on future data based on stream mining algorithm [2]. The stream mining algorithm presented in this paper shows the way of application naive Bayesian classification method and Markov models to achieve the anomaly prediction goal. The proposed system continuously collects a set of runtime system-level and application-level measurements. Markov chain also has been used for prediction in wireless networks performance. Katsaroset al presented information-theoretic techniques for discrete sequence prediction[5]. The article presented here addresses the issues of location and request prediction in wireless networks in a homogeneous fashion. The model then characterizes them in discrete sequence prediction problems. Pre-recorded system log data provides monitoring information about

9

system behaviour. System log data analysis and symptom matching from large volume data is a challenging task on virtual infrastructure. Collecting, analysing, monitoring large volume of monitoring log data is important for monitoring and predicting system behaviour. Virtual IT infrastructure differs from physical IT infrastructure. Virtual infrastructure resource usage is dynamic in nature. It’s also difficult to monitor and predict its capacity resources. In our proposed system we used Markov chain based prediction model to predict resource usage. 3.2.2

Service Level Objectives and Monitoring Data Analysis

Holubet al presents an approach to implement run-time correlation of large volumes of log data . The system also proposed to do symptom matching of known issues in the context of large enterprise applications [4]. The propose solution provides automatic data collection, data normalisation into a common format. The system then do runtime correlation and analysis of the data to give a coherent view of system behaviour at real-time. This model also provides a symptom matching mechanism that can identify known errors in the correlated data on the fly. Predicting System Anomalies [10] in real world is a challenging task. We need to predict virtual resources anomalies in order to know virtual capacity problems. Tanet al presented an approach aims to achieve advance anomaly prediction with a certain lead time. The intuition behind this approach is that, system anomalies often manifest gradual deviations in some system metrics before system escalates into the anomaly state. The propose system monitor various system metrics called features (e.g., CPU load, free memory, disk usage, network traffic) to build a feature value prediction model. The proposed model use discrete-time Markov chain schemes. In the next step, system also includes statistical anomaly classifiers using tree augmented naive Bayesian (TAN) learning method. Proposed system only used for monitoring physical infrastructure rather than a virtual or cloud infrastructure. In order to use virtualization techniques properly virtual machines need to be deployed in physical machines fast and flexible way. These virtual machine needs to be automatically or manually monitor to meet service level objectives (SLOs). An important monitoring parameter in SLOs is high availability of virtualized systems. A high availability model in virtualized system has been proposed [6]. This model constructed on non-virtualized and virtualized two hosts system models using a two-level hierarchical approach. In this model fault trees are used in the upper level and homogeneous continuous time Markov chains (CTMC) are used to represent sub-models in lower level. In cloud computing, virtualized systems are distributed over different location. In most distributed virtualized systems provides minimal guidance for tuning, problem diagnosis, and decision making to monitor performance. In this aspect a monitoring system was been proposed. Stardust[12] is a monitoring system that replaces traditional performance counters with end-to-end traces of requests and allows efficient querying of performance metrics. These traces can inform key administrative performance challenges by enabling performance metrics. Such performance metrics are extraction of per-workload, per-resource demand information and per-workload latency graphs. This paper also proposes idea of experience building. In this model, a distributed storage 10

system is used for end-to-end online monitoring. Distributed system has diverse systemworkloads and scenarios. This model showed that such fine-grained tracing can be made efficient. This can be useful for on and off-line analysis of system behaviour. These proposed methodologies can be incorporate in other distributed systems such as virtualization based cloud computing framework. Cloud computing infrastructure contains a large number of virtualized system, applications and services with constantly changing demands for computing resources. Today’s virtualization management tools allow administrators to monitor current resource utilization of virtual machines. However, it is quite challenging to manually translate user-oriented service level objectives (SLOs), such as response time or throughput. Current virtualization management tools unable to predict demand of capacity resources. Predication of capacity resources is very important task for the availability of cloud computing infrastructure. An article has been presented to have an adaptive control system. The proposed system automate task of tuning resource allocations and maintains service level objectives [9]. It also focuses on maintaining the expected response time for multi-tier web applications. In this system there is a control system. This control system is capable of adjusting resource allocation for each VM so that the application’s response time matches the SLOs. The propose approach uses individual tier’s response time to model the end-to-end performance. It helps stabilize application’s response time. This paper also presents an automated control system for virtualized services. The propose system suggests that it is possible to use intuitive models based on observable response time incurred by multiple application tiers. This model can be used in conjunction with a control system to determine the optimal share allocation for the controlled VMs. Propose system helps maintain the expected level of service response time while adjusting the allocation demand for different workloads level. Such behaviour allows administrator to simply specify the required end - to - end service-level response time for each application. This method helps managing the performance of many VMs. A model named PRESS [1] has been presented for predicting service level objectives in cloud system. There are several differences with our model. The main difference with the model, it uses the Chapman-Kolmogorov equations for predicting the future steps but in our model we calculate the initial distribution vector based on the last state of our monitoring data and then we multiply that with transition matrix to get the future steps of our prediction. Furthermore PRESS only consider as near future prediction in hourly time scale where as we use our model for long term forecasting in days and as well as in short term forecasting in minute scales. Moreover the model is used for predicting non repeating patterns, whereas we consider our Markov model for predicting repeating states from the historical data. We have found that our model work well under repeating states. Lastly the main objective of our prediction model is to forecast monitoring parameter alert in advance. Whereas the model is used for the resource forecasting this is used more towards SLO violation. So the application point of our model is different than press.

11

3.2.3

Anomaly Detection and Performance Monitoring

Cloud computing system is a service driven architecture. The cloud services are based on internet and web. In modern era web-based vulnerabilities represent a substantial portion of the security loop holes of computer network. This is also true for cloud computing infrastructure. In order to detect known web-based attacks, misuse detection systems are equipped with a large number of signatures. Unfortunately, its difficult to keep up to dated with the daily disclosure of web-related vulnerabilities, and, in addition, vulnerabilities may be introduced by installation-specific web-based applications. Therefore, misuse detection systems should be complemented with anomaly detection systems[7]. This paper presents an intrusion detection system that uses a number of different anomaly detection techniques to detect attacks against web servers and web-based applications. The system correlates the server- side programs referenced by client queries with the parameters contained in these queries. The application-specific characteristics of parameters allow the system to perform detailed analysis and produce a reduced number of false positive alerts. The system generates parameter profiles automatically associated with web application data (e.g., length and structure of parameters). This paper introduces a novel approach to perform-anomaly detection, using as input HTTP queries containing parameters. The work presented here is novel in several ways. First of all, to the best of our knowledge, this is the first anomaly detection system specifically tailored to the detection of web-based attacks. Second, the system takes advantage of application-specific correlation between server-side programs and parameters used in their invocation. Third, the parameter characteristics (e.g., length and structure) are learned from input data. Ideally, the system will not require any installation-specific configuration, even though the level of sensitivity to anomalous data can be configured via thresholds to suit different site policies. The proposed system has been tested on data gathered at Google Inc. and two universities in the United States and Europe. Propose model will work in the focus on further decreasing the number of false positives by refining the algorithms developed so far, and by looking at additional features. The ultimate goal is to be able to perform anomaly detection in real-time for web sites that process millions of queries per day with virtually no false alarms. This proposed model can be useful for reducing number of false monitoring alerts in virtualization environment.

4

Monitoring Dataset and Parameter Selection

A monitoring Agent records the values of each parameter and stores them in a history database. This database remains open while the monitoring agent is running. Monitoring application provides a utility that enables user to extract information from the database about a parameter based on its class, instance, and time period.

12

Information type Port number Host name Class Instance Parameter Value Time and date

Description port number on which the Agent listens name of the host on which the Agent is running application class to which the data point belongs application instance to which the data point belongs parameter to which the data point belongs day, time, and date on which the data point was collected

Table 1: information stored in history database

The dump hist utility extracts the following information by default: • hostname • application class • instance • parameter • time and date • value The following example is a portion of a historical dump output of CPUprcrInterruptsPerSec HOST 12/NT CPU . CPU Wed Aug 1 Wed Aug 1 Wed Aug 1 Wed Aug 1 Wed Aug 1 Wed Aug 1 Wed Aug 1 Wed Aug 1 Wed Aug 1

0/ C P U p r c r I n t e r r u p t s P e r S e c 1 3 : 3 2 : 5 3 2012 2 3 1 . 8 3 3 1 3 : 3 3 : 5 3 2012 2 0 8 . 8 6 7 1 3 : 3 4 : 5 4 2012 8 2 . 2 4 5 9 1 3 : 3 5 : 5 4 2012 1 5 6 . 3 1 7 1 3 : 3 6 : 5 4 2012 1 8 9 . 6 8 3 1 3 : 3 7 : 5 5 2012 1 9 9 . 1 2 9 1 3 : 3 8 : 5 5 2012 2 0 1 . 1 6 7 1 3 : 3 9 : 5 5 2012 1 6 0 . 7 5 1 3 : 4 5 : 4 3 2012 1 4 0 . 5 8 9

The following example is a portion of a historical dump output of MemoryUsage HOST 12/NT HEALTH.NT HEALTH/MemoryUsage Wed Aug 1 1 3 : 3 3 : 5 3 2012 4 8 . 0 7 6 6 Wed Aug 1 1 3 : 3 5 : 5 3 2012 4 8 . 3 3 3 4 Wed Aug 1 1 3 : 3 7 : 5 5 2012 5 1 . 0 9 7 1 Wed Aug 1 1 3 : 3 9 : 5 5 2012 5 1 . 1 4 7 8 Wed Aug 1 1 3 : 4 5 : 4 3 2012 6 1 . 9 8 5 13

Wed Wed Wed Wed Wed Wed Wed

Aug Aug Aug Aug Aug Aug Aug

1 1 1 1 1 1 1

13:47:43 13:49:43 13:54:48 13:56:48 13:58:48 14:00:48 14:02:48

2012 2012 2012 2012 2012 2012 2012

58.8581 59.177 68.8183 76.8219 66.6149 66.0309 65.8916

The following example is a portion of a historical dump output of DiskUsage HOST 12/NT HEALTH.NT HEALTH/ DiskUsage Wed Aug 1 1 3 : 3 3 : 5 3 2012 44 Wed Aug 1 1 3 : 3 5 : 5 3 2012 44 Wed Aug 1 1 3 : 3 7 : 5 5 2012 44 Wed Aug 1 1 3 : 3 9 : 5 5 2012 44 Wed Aug 1 1 3 : 4 5 : 4 3 2012 44 Wed Aug 1 1 3 : 4 7 : 4 3 2012 44 Wed Aug 1 1 3 : 4 9 : 4 3 2012 44 Wed Aug 1 1 3 : 5 4 : 4 8 2012 44 Wed Aug 1 1 3 : 5 6 : 4 8 2012 44 Wed Aug 1 1 3 : 5 8 : 4 8 2012 44 Wed Aug 1 1 4 : 0 0 : 4 8 2012 44 Wed Aug 1 1 4 : 0 2 : 4 8 2012 44 The following example is a portion of a historical dump output of Disk Usage in MB. In the data below the first column is the time-stamp, the second column is the date, the third column is the hour, the fourth column is the host, the fifth column is the monitoring parameter name and lastly the sixth column is the value Timestamp 1349864582 1349864642 1349864702 1349864762 1349864822

Date 10−10−2012 10−10−2012 10−10−2012 10−10−2012 10−10−2012

Hour 11.23 11.24 11.25 11.26 11.27

Host Host1 Host1 Host1 Host1 Host1

Para Disk Disk Disk Disk Disk

Val 8019 8019 8019 8020 8020

In our experiments test the quality of the prediction induced from a particular domain of monitoring parameters. The monitoring application under our test comprises 5 virtual machine on a physical server. The physical server is hosted under somone cloud infrastructure. Monitoring agents active during 3 days generated over 150 parameters with dump history of 200 mb with different types of values. In our predictive model we target 4 parameters labelled as critical by domain experts. The target monitoring parameters are listed as follows: 1. CPUprcrInterruptsPerSec parameter values indicates CPU percentage interrupt per second of a host, higher value of it triggers CPU usage alert.

14

2. DiskUsage value indicates a disk saturation of a disk. 3. MemoryUsage indicates memory usage of a host.

5

Modelling of Monitoring Parameters

5.1

Modelling CPU Usage with State Distribution Model

In this section we describe how to create the simple state based probability model for predicting resource utilization. The model can used for predicting a set of resource utilization metrics i.e. CPU, Network, Memory and Disk Usage. The model uses data gathered from Somone Cloud infrastructure. The infrastructure running natively VMware virtualization solution on the top of real hardware. For now we will show our simple model on state based probability model for predicting CPU usage. CPU usage measurement unit is ”CPU usage percentage”. ”CPU usage percentage” measurement level varies from 0 % to 100 %. We partition the measurement level in intervals. We define the each interval as states of our model. Let be s1, s2,.. . . , sn are the states of our system. Our model use experimental data from SOMONE virtual infrastructure. These experimental data has four collection intervals i.e day, week, month, and year. Each collection interval has is own collection frequency. For example Monthly collection intervals has collection frequency of one day. One month data are rolled up to create one data point for every day. As a result is 30 data points each month. Similarly we can define collection frequency for hour, day, week and month collection intervals. 5.1.1

Defining the States of State Distribution Model

We described above that CPU usage measurement unit is CPU usage percentage. CPU usage percentage varies from 0 to 30 % in our collected data. We partition the CPU usage level from 0 to 30 % from our experimental data. In our collected data we don’t have CPU usage level more than 30 %. Let be x1, .. . . , ,xn are the different level of CPU usage. state x1 = ϵ[0, 5)% CPU usage state x2 = ϵ[5,10)% CPU usage state x3 = ϵ[10,15)% CPU usage state x4 = ϵ[15,20)% CPU usage state x5= ϵ[20,25)% CPU usage state x6= ϵ[25,30)% CPU usage we observed the state x1, x2, x3,.. . . , xn percentage CPU usage of at discrete time t1, t2,.. . . , tn. Where t is each day of the month. 5.1.2

Computing State Distribution

We compute the state based probability vector distribution for each state. Our data don’t contain higher levels of observation than 30%. Our process to compute the state 15

based probability of x1 state P r(x1), we compute total no number of events going from x1 −→ x1..xn divided by total number of possible events. which is describe in following equation state based probability vector distribution P r(x1) = n(x1)/n(S). where n(x1) = total no number of events going from x1 −→ x1..xn, n(S) = total number of possible events Similarly we can compute the state based probability vector distribution of each states.

5.2

Modelling CPU Usage with Maximum Frequent State

We discussed on section §5.1.2 about how to compute state based based probability vector distribution for each state. In this section we will discuss about how to we model CPU usage with maximum frequent state. First we compute the state based probability vector distribution of CPU usage P r(x1),.. . . ,P r(xn). Which we have been discussed on section §5.1.2. In order to calculate maximum frequent state of probability vector distribution, we compute S = max P r(x1),.. . . ,P r(xn), where S is the predicted maximum frequent state of CPU Usage and P r(x1), . . . . , P r(xn) state based based probability vector distribution of CPU usage x1, x2, x3,.. . . , xn state. Similarly we can compute Maximum Frequent State of Memory, Network, Disk usages.

5.3

Modelling the CPU Usage with Markov Chain

A virtual Infrastructure has different components such as CPU, Memory, IO, Disk and Network. Each component has different metrics to monitor resource utilization. Example CPU have CPU usage(average), usage(MHz) metrics to monitor. Each metric has a unit of measurement. For example CPU usage metric’s measurement is ”CPU usage percentage”. ”CPU usage percentage” measurement level varies from 0 % to 100 %. We partition the measurement level in intervals. We define the each interval as states of our system. Let be s1, s2,.. . . , sn are the states of our system. We have collected experimental data from SOMONE virtual infrastructure. These experimental data has four collection intervals i.e Day, Week, Month, and Year. Each collection interval has collection frequency. For example Yearly collection intervals has collection frequency of one day. One month data are rolled up to create one data point every day. As a result is 365 data points each year. Similarly we can define collection frequency for Day, Week, Month collection intervals. For now we show our Markov Chain model for CPU Usage prediction. 5.3.1

Defining the States of the System

CPU usage metric’s measurement unit is CPU usage percentage. CPU usage percentage varies from 0 to 30 % in our collected data. We partition the CPU usage level from 0 to 30 % from our experimental data. Let be s1, .. . . , ,sn are the different level of CPU Usage.

16

state s1 = ϵ[0, 5)% CPU Usage state s2 = ϵ[5,10)% CPU Usage state s3 = ϵ[10,15)% CPU Usage state s4 = ϵ[15,20)% CPU Usage state s5= ϵ[20,25)% CPU Usage state s6= ϵ[25,30)% CPU Usage we observed the state s1, s2, s3,.. . . , sn percentage CPU usage of at discrete time t1, t2,.. . . , tn. Where t is each day of the year 5.3.2

Computing the Transition Probability

We compute the transitional probability for each pair of states. Our data don’t contain higher levels of observation than 30%.For example to compute the transitional probability from s1 −→ s2 we compute total no number of successive events going from s1 −→ s2 divided by the total number of s1 events. which is describe in following equation Transitional probability s1 −→ s2 = P r[Xt+1 = s2|Xt = s1]. Similarly we can compute transitional probability for other pair of states. 5.3.3

Building the Transition Probability Matrix

We build an m × m transition probability matrix M = M (si)(sj) where M = M (si)(sj) = P r[transtion fromsi −→ sj ] we transition matrix with our data   state s1 s2 s3 s4 s5 s6  s1 0.8564 0.1436 0 0 0 0     s2  0.1628 0.8072 0.012 0.0060 0 0.120    s3  0 1 0 0 0 0    s4 0 1 0 0 0 0     s5 0 0 0 0 0 0  s6 0.25 0 0 0.25 0 0.5 State 5 has 0 transition probabilities to other states. State 5 not a reachable state from other states. So we remove s5 from our system so our new transition matrix becomes:   state s1 s2 s3 s4 s5  s1 0.8564 0.1436 0 0 0     s2 0.1628 0.8072 0.012 0.0060 0.120    s3 0 1 0 0 0     s4 0 1 0 0 0  s5

0.25

0

0

Now state s6 = state s5 =25 to 30 CPU Usage Level

17

0.25

0.5

5.3.4

Markov Chain Stochastic Matrix in Virtualized CPU Usage

M is stochastic matrix:   state s1 s2 s3 s4 s5  s1 0.8564 0.1436 0 0 0      s2 0.1628 0.8072 0.012 0.0060 0.120    s3  0 1 0 0 0    s4 0 1 0 0 0  s5 0.25 0 0 0.25 0.5 A right stochastic matrix is a square matrix each of whose rows consists of nonnegative real numbers, with each row summing to 1 So if we add s11 + s12 =1 Similarly if we add each row the sum will be 1. Therefore it satisfy the stochastic properties of Markov Chain.

Figure 1: CPU Resource Utilization State Transition Diagram

5.3.5

State Transition Diagram of CPU Resource Utilization

A Markov Chain system can be illustrated by means of a state transition diagram, which is a diagram showing all the states and transition probabilities. The information from the stochastic matrix is shown in the form of transition diagram. The Figure ( 1) is a transition diagram of CPU resource utilization that shows the five states of CPU resource utilization and the probabilities of going from one state to another. The state transition diagram is necessary to find if the all states of the systems are strongly connected. The strong connected state transition diagram is important for the stability of Markov Chain prediction system. It can help us to understand the behaviour prediction system in long run.

18

5.3.6

Computing the Distribution Vector of the System

A distribution vector is a row vector with one non-negative entry for each state in the system. The entries can be represent the number of individuals in each state of the system. For computing the distribution vector we do the following. Let M be Markov Chain of s states, let V is an initial distribution vector. The initial distribution vector (v1......vs) which defines the distribution of V1 (p(V1 = si ) = vi ). After one move, the distribution is changed to V2 = V1 M , where M is the stochastic Matrix. We compute initial distribution vector over the s states of CPU usage level. Which is based on our last entry of our historical data. Suppose we have 5 states of CPU usage s1, s2, s3,.. . . ,s5. Our last entry our historical data fall on state s2 then the initial distribution vector V = [0, 1, 0, 0, 0]. M is the stochastic matrix virtualized CPU usage §5.3.4. We multiply the initial distribution vector with stochastic matrix M to get the distribution vector after 1 step is the matrix product VM. The distribution one step later, obtained by again multiplying by M, is given by (V M )M = V M 2 . Similarly, the distribution after n steps can be obtained by multiplying V by M n . In our monthly prediction system of CPU resource usage we compute upto 30 time steps, which donated as n. Each time step is the average prediction of CPU usage of each day. Our prediction of CPU usage is a distribution vector over n dimensional space.

5.4

Calculating Prediction Error of Probability Distribution

In above section we discuss that the predicted CPU usage is the distribution over n dimensional space. To calculate the prediction error of probability vector distribution, we use Euclidean distance of n dimensional space. This space difference is calculated between predicted and observed data. where ⃗x predicted data of the probability vector distribution of CPU usage over n dimensional space and ⃗y observed data of the probability vector distribution of CPU usage over n dimensional space. Where x = (x1 , x2 , . . . . , xn ) and y = (y1 , y2 , . . . . , yn ) are points in Euclidean n dimension space. We compute ⃗y in the following way from our observed data. ⃗y is CPU probability vector distribution over n dimensional space in our observed data. CPU usage metric’s measurement unit is CPU usage percentage. CPU usage percentage varies from 0 to 30 % in our observed data. We partition the CPU usage level from 0 to 30 % from our observed data. We don’t have more than 30 % CPU usage level in our observed data. Let be y1 , .. . . , ,yn are the different level of CPU Usage. state y1 = ϵ [0, 5) % CPU Usage state y2 = ϵ[5,10)% CPU Usage state y3 = ϵ[10,15)% CPU Usage state y4 = ϵ[15,20)% CPU Usage state y5 = ϵ[15,20)% CPU Usage state y6 = ϵ[15,20)% CPU Usage we observed the state y1, y2, y3,.. . . yn state percentage CPU Usage at discrete time t1, t2,.. . . , tn. Where t is each day of the month. 19

The Euclidean distance between x and y are the position of points in a Euclidean n-space. So, x and y are Euclidean vector. Then the distance from x to y√ is given by n √ ∑ d(⃗x, ⃗y ) = (x1 − y1 )2 + (x2 − y2 )2 + (x3 − y3 )2 + . · · · + (xn − yn )2 = (xi − yi )2 i=1

where x = (x1 , x2 , . . . . , xn ) and y = (y1 , y2 , . . . . , yn ) are points in Euclidean n-space. In error compare method we want to compare the prediction result with our observation. Our prediction result are generated in probability vector distribution of states. So the probability vector distribution is an n-dimensional state space. So compute the distance or error between the predicted and observed states we use ⃗x √ = predicted, ⃗y = observed n ∑ d1 = (xi − yi )2 i=1

5.5

Linear Disk Model

This section describes modelling of the Disk monitoring parameter. We collect the monitoring data from the company’s monitoring software. After having analysed the monitoring data we found that the disk usage steadily increases over time and moreover it does so linearly over time. Therefore our model uses a linear regression method for the Disk usage prediction. It is based on the relationship between the Disk usage and the time gathered from monitoring logs. The Disk usage values are collected at a frequency of one minute. To find the relationship between the disk usage and time we use a historical dump. Using values from the time stamp, we form a linear equation which calculates the Disk usage as a linear combination of the time: y i = p 1 ∗ xi + p 2 Where xi is the time of sample i. We use the Matlab polynomial curve fitting function to obtain the coefficients p1, p2. The polynomial curve fitting function calculate the coefficients of a polynomial p(x) of degree n that fits the data, p(x(i)) to y(i), in a least squares method. The p is a row vector of the length n+1 which contains the polynomial coefficients in descending power : p(x) = p1 xn + p2 xn−1 + . · · · + pn x + Pn+1 In our first experiment we found the values p1 = 0.0017717, p2 = -2.3835. yi is the predicted Disk usage with the time interval i.

5.6

Cyclic Linear Disk Model

In this sub-section we describe briefly some disk-usage scenarios and we explain our new Cyclic Linear model. In this model we adapt to the scenarios of disk usage during Disk de-fragmentation, Disk clean-up, manually removing of disk data and manually adding some disk hardware. 20

• Disk defragmentation describes the process of consolidating fragmented files on a computer’s hard disk. It rearranges the data on the hard disk and reunites fragmented files so the system can run more efficiently. We observed during the experiment a sudden decrement of disk usage. • Disk clean-up is a computer maintenance process to free up disk space on a computers hard drive. The utility first searches and analyses the hard drive for files that are no longer of any use, and then removes the unnecessary files. Decrement of disk usage is seen during that process. • Manually removing of disk data happens under user control. It is obviously similar to disk clean-up. Decrement of disk usage is seen during that process. 1

We collected the data during the 4 scenarios. After analysing the data, we found that the data is steadily increased over time but as soon as we applied one of the 3 disk usage scenarios then there was a sudden decrement of disk usages. Our cyclic linear model utilized the sudden decrement of the disk usages scenarios. We modelled the scenarios using a Cyclic Linear model. The algorithm analysis the disk usage data pattern, whether the data is decreased over time. If the data is decreasing we apply the linear prediction model from that point. Our algorithm shown below. if disk usage until tnow is non decreasing then apply linear prediction (from initial time t0 ) else apply linear prediction from tdrop = max(t|tbegin ≤ t < tnow & DiskU sage(t) − DiskU sage(t − 1)) end if

6

Alert Prediction

Alert prediction is used in Network Management for predicting health of managed networks. The main principle remains similar as alarm in security. Alerts are external manifestations of faults, where a fault is a disorder occurring in a threshold limit of a managed monitoring parameter. It used for network fault isolation and diagnosis, selective corrective actions, proactive maintenance, and trend analysis. In this section we propose an alert prediction model. Our model generates an alert when a violation of threshold limit is predicted. 1 We observed during our experiment that manually adding a new disk hardware and disk space creates a sudden decrement of relative disk usage. Our company gave us the opportunity to experience the system maintenance in a day to day environment. We saw from our experience that a system user generally cleaned up the disk data manually after a back-up or when it no longer used in a production environment. We observed that during that process there were decrement of disk usages.

21

PA = Probability of alert P A(t) = Probability of alert over time t where t ≤ tnow + δtsaf e where tnow is the time when we start our prediction and δtsaf e is the Safe-look ahead time. param ∈ { CPUprcrInterruptsPerSec, DiskUsage, MemoryUsage,.... } param are monitoring parameters that we want to predict. Threshold is a value range above which a parameter value is an alert. In our experiment threshold of DiskUsage parameter is between 60 to 100 % which is shown in the below equation T hreshparam ∈ [60%, 100%) DiskUsage We compute probability of alert in a following way ∑ P Aparam (t) = v≥T hreshparam P r (param = v) Alert at time (t) is raised when the parameter value overshoots 60% for Disk Usage Alert(t) = (P Aparam (t) ≥ 60%)

7

Evaluating Prediction Accuracy

7.1

Relative Error

We did several prediction tests with our 3 different prediction models. Models are 1. State Based Probability 2. Maximum Frequent State and 3. Markov Chain Model. We will show the results of monitoring parameter prediction i.e. CPUprcrInterruptsPerSec, Memory and Disk usage in probability distribution. The results of our prediction are shown in terms of relative error. The Relative error graph shows the relative error of a monitoring parameter with minute time scale. Relative errors are measured in 0 to 100% scale. In our prediction models we used monitoring traces from the SOMONE cloud infrastructure as a historic data. This historic data is used to predict a monitoring parameter. We use back-testing techniques to check the accuracy of our prediction. The results of this accuracy is shown in-terms of relative error. 7.1.1

Relative error of CPUprcrInterruptsPerSec Prediction

In the figure (2) the X axis is the time unit of a minute and the Y axis is the relative error of CPUprcrInterruptsPerSec. The Relative error of the State Based Probability Model, Maximum Frequent State Model and the Markov Chain Model are shown in red, sky blue, green dots in graph respectively. 7.1.2

Relative error of MemoryUsage Prediction

In the figure (3) the X axis is the time unit of a minute and the Y axis is the relative error of MemoryUsage. The Relative error of the State Based Probability Model, Maximum 22

Figure 2: Relative Error CPUprcrInterruptsPerSec

23

Figure 3: Relative Error MemoryUsage

24

Figure 4: Relative Error DiskUsage

25

Frequent State Model and the Markov Chain Model are shown in red, sky blue, green dots in graph respectively. 7.1.3

Relative error of DiskUsage Prediction

In the figure (4) the X axis is the time unit of a minute and the Y axis is the relative error of DiskUsage. The Relative error of the State Based Probability Model, Maximum Frequent State Model and the Markov Chain Model are shown in red, sky blue, green dots in graph respectively.

Figure 5: Safe Look-Ahead Window CPUprcrInterruptsPerSec

7.1.4

Safe Look-Ahead Time Window

After analysing our prediction results we suggest to consider a ”Safe look-ahead window” for each monitoring parameter. Safe look-ahead window is a forward look-ahead time where prediction error is less than 20% for a monitoring parameter. Each Safe look-ahead period is depends on the parameter but relatively stable across on our experiments. We set it to a pessimistic value of 5 minute and within this forward time window our prediction models have been shown to be reliable. We plot the safe look-ahead period in the graph below. In our safe look-ahead graph, x axis is the safe look-ahead time in minute and y axis is the relative error of prediction.

26

Figure 6: Safe Look-Ahead Window MemoryUsage

Figure 7: Safe Look-Ahead Window DiskUsage 27

The maxFreq state model predicts CPUprcrInterruptsPerSec parameter with safe look-ahead period of 8 mins, See figure 5. The maxFreq state model predicts MemoryUsage parameter with safe look-ahead period of 20 mins See figure( 6). Our experiments shows that the maxFreq state model predicts DiskUsage parameter with safe look-ahead period of 20 mins See figure 7.

7.2

Result of Linear Disk Prediction

Figure 8: Linear Disk Model Prediction We plot the prediction result of our linear disk model in the figure8. In this graph the x axis is time in minute and the y axis is disk usage in megabyte. From our experiment we conclude that our disk usage prediction is nearly perfect. This result depends on a steadily increasing dataset which happens in day to day system usage and maintenance. What also happens is that during the system maintenance users clean-up the disk. This cycle is repeated over time. As a result our linear model needs improvement. In the next sub-section we will discuss a new model to adapt this situation.

7.3

Result of Cyclic Linear Disk Prediction

We plot the prediction result of our cyclic linear disk model in the figure9. In this graph the x axis is time in minute and the y axis is disk usage in megabyte. In this test we modelled a disk clean-up scenarios in a day to day system usages. The test conclude that our disk usage prediction is nearly perfect accuracy. This result depends on a steadily increasing dataset which happens in day to day disk usage 28

Figure 9: Cyclic Linear Disk Model Prediction

29

and sudden decreasing dataset due to disk clean-up in a regular system maintenance. We observed that cycle is repeated over time. As a result we create a new cyclic model to address this scenario.

8

Alert Prediction Results

Figure 10: Predicted Alert DiskUsage In this section we now present the results of alert prediction. The results of alert prediction depend on the monitoring parameters and the usage of the system environment. This set of experiments focus on predicting alerts (e.g., prolonged CPU interrupts, high memory and disk usage) caused by various faults such as insufficient resources and bottle-neck in the system. We show the results of alert prediction in a probability in 0 to 1 scale. In that scale any prediction of parameter between 0.5 to 1 scales are consider as an alert. The figure (10) shows the alert prediction of disk usage where the x axis is minute and the y axis is the probability of alert. In the graph red-line is the predicted alert and green-line is the observed alert respectively. In the disk alert prediction we observe that there are 1 disk alert during the safe look-ahead with in the safe look-ahead window of 20 minutes and our Markov model predict 90% chance to have an disk usage alert, so both observed and predicted alert has an nearly perfect accuracy with in the safe look-ahead window. As a conclusion our Markov disk alert prediction model could serve 30

Figure 11: Predicted Alert MemoryUsage

Figure 12: Predicted Alert CPUprcrInterruptsPerSec 31

Figure 13: Predicted Disk Alert with Linear Disk Model

Figure 14: Predicted Disk Alert with Cyclic Linear Disk Model 32

as a predictive alert system. The figure (11) shows the alert prediction of memory usage where the x axis is minute and the y axis is the probability of alert. In the graph red-line is the predicted alert and green-line is the observed alert respectively. In the memory alert prediction we observe that there are no alert with in the safe look-ahead window of 20 minutes and our Markov model predict 20% chance to have an alert, so both observed and predicted alert has an nearly perfect accuracy with in the safe look-ahead window. As a conclusion our Markov memory alert prediction model could serve as a predictive alert system. The figure (12) shows the alert prediction of CPUprcrInterruptsPerSec where the x axis is minute and the y axis is the probability of alert. In the graph red-line is the predicted alert and green-line is the observed alert respectively. In the CPUprcrInterruptsPerSec alert prediction we observe that there are 3 alert with in the safe look-ahead window of 20 minutes and our model Max Frequent model predict 100% chance to have an alert. Although we cannot show that the our alert prediction model yields the best results, we can show that its results improve as we increase the size of the historic data when predicting. Thus the Max Frequent model needs improvement in our future research. Predicted Disk Alert with Linear Disk Model We plot the prediction result of our linear disk model in the figure (13). In the graph the x axis is time in minute and the y axis is disk usage in megabyte. Disk usage data has been collected from SOMONE cloud infrastructure. In this test our model predict disk usage alert using Linear Disk Model. The model predicts the alerts when the disk usage value overshoots 8030 MB which are shown in red colours. In the graph the observed disk alert shown as a red star and the predicted disk alert shown as a red line where as the predicted normal disk usage shown as a green star and the normal disk usage shown as a green line respectively. Predicted Disk Alert with Cyclic Linear Disk Model We plot the prediction result of our cyclic linear disk model in the figure (14). In the graph the x axis is time in minute and the y axis is disk usage in megabyte. In our test we modelled the scenarios such as 1. disk de-fragmentation 2. disk clean-up 3. manually clean-up disk data. A synthetic data has been developed to address these scenarios. In this test our model predict disk usage alert using Cyclic Linear Disk Model. The model predicts the alerts when the disk usage value overshoots 8030 MB which are shown in red colours. In the graph the observed disk alert shown as a red star and the predicted disk alert shown as a red line further more the predicted normal disk usage shown as a green star and the normal disk usage shown as a green line respectively.

8.1

Quality of Alert Predictions

We tested our proposed prediction models using CPU, Memory, Disk, data. We summarized quality of alert prediction in the table below.

33

9

Parameter

Time scale

Prediction model

Relative error

Predicted Alert

Markov Model

Safe look ahead time 20 mins

Memory usage Percentage Disk usage CPU Percentage Interrupt Per Second Disk usage in MB Disk usage under system maintenance

minutes

2%

0 out of 0

minutes

Markov Model

20 mins

15%

1 out of 1

minutes

maxFreqState

12 mins

0%

3 out of 3

minutes

Linear Model

250 mins

NA

14 out of 14

minutes

Cyclic Linear Model

500 mins

NA

28 out of 28

Conclusion and Future work

Cloud and critical IT system demands availabilities of system 24 by 7. One of the most important benefits of the predictive alert monitoring is the ability to forecast system alert in advance. This has great implication on preventive measure in cloud and critical IT system. The main contribution of this paper is a experimental implementation of predictive modelling on forecasting system alert. Our model has been adopted to each monitoring parameters. Our tests have shown that each model can have good results for predicting alerts when applied to the immediate future. This makes the evaluation of the model to be context dependent. On the other hand, our models treat monitoring data from blackbox systems without knowing the details of their underlying hardware configuration. This makes our model independent of any hardware and vendor specification. There are clear differences in alert prediction. The alert prediction model of the disk and the memory usage has regular behaviour where as CPUprcrInterruptsPerSec alert prediction model is not so stable and exhibit rather irregular behaviour. As a consequence we consider the modelling of CPUprcrInterruptsPerSec as a hard problem and needs improvement. Where as the memory-usage alert prediction behaves nicely this is most likely because of the method in which memory utilization data is being gathered. Many OS’s use up almost all of the memory in the system for things like caches. The memory utilization is then reported around 85% but still our alert prediction model works well during the safelook ahead window. We observe from our experiments that the disk usage steadily increase over time and it depicts a linear behaviour but as soon as we apply the disk clean-up the disk usage decrease drastically. During the disk maintenance this behaviour repeats periodically. 34

The experimental result shows that our Cyclic Linear model able to adapt this scenarios. Our experiments also show that our fairly simple models work best for the alert prediction with in the safe look ahead time window. We envision that our alert prediction models of these type could be used to warn system administrator and system manager of upcoming system disruptions and trigger advanced alert. The current prediction test should also be confirmed by more extensive dataset and in production scenario. In the future we will build predictive alert models based on application and services monitoring metrics, moreover we will co-related alert with different prediction models.

10

Acknowledgement

The first author thanks SOMONE S.A.S for an industrial doctoral scholarship under the French government’s CIFRE scheme. The first and the second authors thank SOMONE for access to the cloud infrastructure on which we have conducted the experiments.

References [1] Z. Gong, X. Gu, and J. Wilkes. Press: Predictive elastic resource scaling for cloud systems. In Network and Service Management (CNSM), 2010 International Conference on, pages 9–16. IEEE, 2010. [2] X. Gu and H. Wang. Online anomaly prediction for robust cluster systems. In Data Engineering, 2009. ICDE’09. IEEE 25th International Conference on, pages 1000–1011. IEEE, 2009. [3] Z. Guo, G. Jiang, H. Chen, and K. Yoshihira. Tracking probabilistic correlation of monitoring data for fault detection in complex systems. In Dependable Systems and Networks, 2006. DSN 2006. International Conference on, pages 259–268. IEEE, 2006. [4] Viliam Holub, Trevor Parsons, Patrick O’Sullivan, and John Murphy. Run-time correlation engine for system monitoring and testing. In ICAC-INDST ’09: Proceedings of the 6th international conference industry session on Autonomic computing and communications industry session, pages 9–18, New York, NY, USA, 2009. ACM. [5] D. Katsaros and Y. Manolopoulos. Prediction in wireless networks by Markov chains. Wireless Communications, IEEE, 16(2):56–64, 2009. [6] D.S. Kim, F. Machida, and K.S. Trivedi. Availability modeling and analysis of a virtualized system. In Dependable Computing, 2009. PRDC’09. 15th IEEE Pacific Rim International Symposium on, pages 365–371. IEEE, 2009.

35

[7] C. Kruegel and G. Vigna. Anomaly detection of web-based attacks. In Proceedings of the 10th ACM conference on Computer and communications security, pages 251– 261. ACM, 2003. [8] A. Pinnow and S. Osterburg. A capacity supply model for virtualized servers. Informatica Economica, 13:96–105, 2009. [9] A. Sangpetch, A. Turner, and H. Kim. How to tame your VMs: an automated control system for virtualized services. In Proceedings of the 24th international conference on Large installation system administration, pages 1–16. USENIX Association, 2010. [10] Y. Tan and X. Gu. On predictability of system anomalies in real world. In 2010 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pages 133–140. IEEE, 2010. [11] P.H.S. Teixeira, R.G. Clemente, R.A. Kaiser, and D.A. Vieira Jr. Holmes: an eventdriven solution to monitor data centers through continuous queries and machine learning. In Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems, pages 216–221. ACM, 2010. [12] Eno Thereska, Brandon Salmon, John Strunk, Matthew Wachs, Michael Abd-ElMalek, Julio Lopez, and Gregory R. Ganger. Stardust: tracking activity in a distributed storage system. In SIGMETRICS ’06/Performance ’06: Proceedings of the joint international conference on Measurement and modeling of computer systems, pages 3–14, New York, NY, USA, 2006. ACM. [13] J.a. Whittaker and M.G. Thomason. A Markov chain model for statistical software testing. IEEE Transactions on Software Engineering, 20(10):812–824, 1994. [14] S.A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie. High speed and robust event correlation. Communications Magazine, IEEE, 34(5):82–90, 1996.

36

Suggest Documents