Distributed Network Monitoring and Anomaly ... - Semantic Scholar

3 downloads 989 Views 152KB Size Report
heterogeneous sensors to produce a view of the network state. ... Keywords. Grids, Security Management, Network monitoring, Anomaly detection,. Principal ...
Distributed Network Monitoring and Anomaly Detection as a Grid Application

V. Chatzigiannakis, A. Lenis, C. Siaterlis, M. Grammatikou, D. Kalogeras, S. Papavassiliou & V. Maglaris {vhatzi, anglen, csiat, mary, dkalo, papavassiliou, maglaris} @netmode.ntua.gr Network Management & Optimal Design Laboratory (NETMODE), School of Electrical & Computer Engineering National Technical University of Athens (NTUA) 9 Iroon Polytechniou str. Zografou, Athens, Greece

Abstract In this paper an anomaly detection system based on Grid middleware that supports the efficient, scalable and secure monitoring of multiple instruments and sensors, is proposed and investigated. The Grid provides the means to control the sensors and gather information with security and reliability. The system includes a Decision Support Service that fuses multi-metric data from heterogeneous sensors to produce a view of the network state. The proposed fusion algorithm is based on the application of Principal Component Analysis on multi-metric data, and provides an efficient way of taking into account the combined effect of the correlated observed data, for anomaly detection purposes. The performance and operational effectiveness of our proposed anomaly detection approach is evaluated via modeling and simulation, and is compared against the corresponding techniques that are based on the singlemetric analysis. Keywords Grids, Security Management, Network monitoring, Anomaly detection, Principal Component Analysis

1.

Introduction

One of the main challenges in security management of large high speed networks is the detection of suspicious anomalies in network traffic patterns due to Distributed

1

Denial of Service (DDoS) attacks or worm propagation. Network managers would ideally protect their resources with a simple command that could install filters in the network perimeter and collect information about the malicious source(s) of such attacks. In this context, we envisage a variety of distributed sensors that exchange information through a common automated platform and assist the anomaly detection algorithms by sharing their views. By correlating data from sensors that are assigned to different network elements, network administrators of a wide area network (WAN) could identify an attack path (or tree) and apply countermeasures near the anomaly sources by tuning the appropriate instruments i.e. firewalls. An even more sophisticated scenario involves cooperation and information sharing among multiple administrative domains through an automated common platform. Currently, the alternative which is based on off-line interaction of network operators (a human web of trust, e.g. the Computer Security Incidence Response Teams - CSIRTs), is nonautomated and prohibits prompt reactions. However, as networks and their performance/utilization dynamics are becoming exponentially more complex and the sources and scope of network anomalies multiplying, human network operators find it increasingly difficult to analyze and recognize network anomalies and/or performance degradations in real-time, and to do it accurately and reliably. In the near future, given the stringent network reliability requirements, it may well be impossible, even in theory, to rely much on human operators for detection and diagnosis. At the same time, next generation Grids aim to extend traditional Grid capabilities that provide batch access to distributed computational and storage resources. They are intended to become self-managing, self-repairing, fault-tolerant and scalable, based on the service-oriented computing principles. In this context, one of most challenging goals is the development of real-time monitoring and control infrastructures on top of Grids [5]. It is practically an extension of Decision Support Systems (DSS) in a “Gridified” mission critical environment, where Web Services based middleware will support the efficient, scalable and secure monitoring of multiple instruments and sensors. Beyond plain monitoring, the middleware will enable users to analyze the gathered information as well as tune and control the instruments/sensors in real-time. In this work, we view these “Remote Instrumentation Grids – RIGs” as platforms for developing distributed network monitoring and anomaly detection systems. Specifically, we propose the use of Grid infrastructure for an anomaly detection system that correlates data from heterogeneous sensors spread throughout the network. The Grid provides the means to control the sensors and gather information with

2

security and reliability. At the same time, the system possesses a Decision Support Service that applies Principal Component Analysis to multiple metrics received from the network elements. We demonstrate via modeling and simulation, that by applying PCA on multiple metrics simultaneously, our proposed methodology improves significantly the anomaly detection capabilities, when compared against anomaly detection approaches that are based on single metric analysis. The remaining of this paper is organized as follows. In section 2 the proposed system architecture and its corresponding bulding blocks are provided, while in section 3 a data fusion technique based on the application of PCA on multi-link multi-metric environement is introduced and described. The corresponding perfromance and operational effectiveness of our proposed anaomaly detection approach is evaluated via modeling and simultaion in section 4, while section 5 concludes the paper.

2.

Architecture

Building a distributed monitoring and anomaly detection application that covers a large network and provides different views of the monitored infrastructure to users based on their rights (trusted Network Operation Centres - NOCs) involves many challenges. First, the various sensors must share a common communication protocol. Second, their intercommunication has to be secure and efficient; privacy issues may require the transmitted data to be encrypted and the overhead of management traffic should be kept minimal. Even with an efficient communication protocol this would be hard to achieve as a large number of sensors that report their data to a central node leads to heavy communication overhead. To meet these challenges we propose an architecture, as shown in figure 1, that consists of the following building blocks: •

Anomaly Sensors measuring various network elements and being linked to Grid controlled Instrument Managers within a domain real-time Service for Monitoring & Control. We refer to this service as the Virtual Instrument Grid Service1 (VIGS).



The communication network playing the role of an overlay Grid service between different domains, thus allowing access to interconnected VIGS’s.



The Decision Support Web Service (DSWS), providing algorithms aiming at fusing the collected knowledge without shrinking the inference mechanism to a simplistic processing of independent sensor views. This web service analyses individual domain state reports, possibly originating from heterogeneous sensors,

1

VIGS is defined in the EU 6th Framework IST project GRIDCC[7]

3

to deduce a global view of security incidents. The DSWS is also invoked to publish findings on security threats to human operators at NOCs and CSIRTs, who may undertake specific countermeasures against the detected anomaly. A critical constraint of such a system is the need for near real time detection. Therefore it is imperative that the Grid middleware provides a reliable mechanism that can deliver messages swiftly. This mechanism should assure that the reports from the VIGSs are delivered in time rather than reliably, and in that sense this mechanism differs form the ones used in traditional Grid middleware. Control and reaction messages however should be delivered with high reliability. Another operation of the VIGS is the control and monitoring of the sensors. In a large scale network, the great number of sensors and network elements leads to inefficient management. The Grid middleware could assist in the configuration of the network elements through the transmition of control messages. In the control room, it is possible to monitor, control and fine tune the VIGS and the network elements centrally. Especially when under an attack, that time is critical, the administrator should be able to easily adjust some configuration parameters in the network elements. Depending on the situation, there is even the possibilty for automated reaction. Moreover with the use of the Data Mining tool the messages can be gathered and stored for off-line analysis as well. Following this paradigm we can expand the use of a network monitoring and of anomaly detection application into multi-domain environments. A NOC acting as a Virtual Organization within the Grid may have access to neighboring domain sensors. Secure deployment of countermeasures is assisted by the existing Grid security Authorization & Authentication Infrastructure (AAI). Encryption and trust issues are also resolved by the Public Key Infrastructure of the Grid environment. The challenge of the communication overhead is addressed by the use of data reduction techniques. In contrast to a centralized architecture where sensors would constantly report all measurements to a single collector centre, we propose the use of data fusion algorithms as a method to compact collected knowledge. Every VIGS processes data received from the sensors locally and transmits the reports to the appropriate DSWS along with any additional incident information that might be required.

4

Figure 1: Network monitoring and anomaly detection on a real-time Grid

5

3

Anomaly detection

3.1 Data Gathering The first step towards the realization of the architecture described in the previous section is the adjustment of several sensors [4],[5] to report XML data that can be handled by the Web-Services based middleware. The sensors utilized in our environment are based on different network monitoring technologies, like packet capturing, Netflow exports and SNMP MIBs, and measure different metrics (e.g. ICMP traffic ratios, number of short flows, SYN/FIN ratio, flow generation rate etc.) that characterize the state of a network element. They are tailored to identify anomalies for specific traffic components and have fuzzy outputs (instead of binary thresholds) that can be interpreted as anomaly indications. Temporal correlation of the measured values with the use of statistical methods, e.g. adaptive thresholds or exponential weighted moving average, is also supported.

3.2 Data fussion The second step emphasizes on the experimentation with various data fusion techniques that would allow us to perform data reduction and enhance local detection capabilities. One of the most interesting options is the use of the independent monitoring data streams from decoupled sensors as inputs to a decision maker, e.g. based on the Dempster – Shafer (D-S) theory of evidence (metrics correlation) [6]. By using D-S as the low-level modeling framework one gains the advantage that the data reported by the individual sensors are plain belief metrics [4], thus eliminating the need to communicate a large amount of data. However, in the case of combining data from more than one network elements, metrics should be assigned to all possible “anomaly” states, in effect to all subsets of the state space, leading to exponential complexities. On the other hand, the DSWS could use the reports of every sensor and the knowledge of the network topology to improve the detection results via spatial correlation methods, as reported in [2] . In that paper the authors proposed a method to separate the high-dimensional space as defined by a set of spatially distributed traffic measurements, into disjoint subspaces corresponding to normal and anomalous network conditions. The Subspace procedure that divides the measurement space into two orthogonal subspaces, the PC subspace and the residual, was originally presented in [3]. This method relies on Principal Component Analysis [1] (PCA) deployed on a network using a specific metric. Our goal is to extend this method to support multilink and multi-metric analysis so as to achieve a more complete analysis of network

6

traffic behavior.

3.2.1 Principal Component Analysis The goal of Principal Component Analysis it to reduce the dimensionality of a data set in which there are a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. The extracted non-correlated components are called Principal Components (PC) and are estimated from the eigenvectors of the covariance matrix of the original variables. Let the original data x be an n x p data matrix of n observations on each of p variables (x1, x2, …, xp) and let S be a p x p sample covariance matrix of x 1, x2, …,xp. If (λ1, e1), (λ2, e2), …, (λp, ep) are the p (eigenvalue, eigenvector) pairs of the matrix S, then the ith PC is zi =ei’(x-xm)

(1)

where λ1 ≥ λ2 ≥ … ≥ λp ≥ 0 ei x’

is the ith eigenvector, is any observation vector on the variables x1, x2, …, xp

and xm is the mean of x.

3.2.2 Subspace-based Anomaly Detection PCA decomposes a normalized sample vector into two portions: y=ynorm +yres

(2)

such that ynorm corresponds to modeled and yres to residual traffic. We form ynorm by projecting y onto the normal subspace S, and we form yres by projecting y onto the abnormal subspace ˜S. To accomplish this, we arrange the set of principal components corresponding to the normal subspace (v1, v2, ..., vr) as columns of a matrix P of size m×r where r denotes the number of normal axes. We can then write ynorm and yres as follows: ynorm = PPT y = Cy and yres = (I − PPT)y = ˜Cy

(3)

7

where matrix C = PPT represents the linear operator that performs projection onto the normal subspace S, and ˜C likewise projects onto the anomaly subspace ˜ S. Thus, ynorm contains the modeled traffic and yres the residual traffic. In general, the occurrence of an anomaly will tend to result in a large change to yres. A change in variable correlation will increase the projection of y to the subspace ˜ S. A typical statistic for detecting abnormal conditions is the squared prediction error (SPE),

SPE ≡ y norm

2

= Cy

2

(4)

3.3 Metric Correlation Among our main objectives in this paper is to investigate and demonstrate the benefits and performance improvements that can be achieved by combining different metrics in the same PC computation, instead of running PCA for each metric separately. To achieve this, we create a set of virtual links for every real link with each virtual link corresponding to a different metric. During an attack/anomaly we may have deviations in metrics that are not normal. One of the key driving factors and principle of the proposed approach is the observation that although the increase in a metric may not exceed its normal thresholds, combining it with the state of another metric it may prove to be anomalous. For example increase in the number of flows combined with decrease in the number of packets/bytes per flow may indicate a Denial of Service attack. Anomaly detection is based on the detection of outliers in the data set. Outliers are observations that are inconsistent with the rest of the data set. An outlier may be easily detected when one or more of its variables are significantly out of the range of the other data. However, a multivariate outlier may not be extreme in any of the original variables but still not conform to the correlation patterns of the remainder of the data. That last type of outlier may be detected only by checking the directions defined by the last few PCs. For this reason, the Subspace method when applied to multi-metric variables may detect not only volume diversion in the network traffic but also alterations in the composition of traffic as well. It should be noted that is very important to combine metrics that are correlated, because the use of metrics without any common correlation structure in the same PCA decreases the accuracy of the modeled data. For example, it is rather unseemly to combine UDP and TCP traffic in a non-fully used link. The test for deciding whether or not two sets of series are correlated is to calculate their correlation coefficient. Variables with correlation coefficient near 1 vary together in the same direction;

8

whereas variables with correlation near -1 vary together in opposite directions.

Figure 2: Multi-metric analysis Another diversion of our method is the use of correlation matrix instead of covariance matrix. The problem of using PCA based on covariance matrices is that PCs are sensitive to large differences between the variances of the elements of x. For example when combining the number of packets and number of bytes of a link or when combining same metrics between links with great variation in network traffic, the problem of different scales becomes obvious. The elements with greater variance tend to dominate the first PCs and an anomaly in the other elements may be concealed because of their small factor. Therefore in order to alleviate the problem of scale dependence of PCA, we used standardization of the virtual link variables. Principal components can also be defined as z = A’x*, where A has columns consisting of the eigenvectors of the correlation matrix, and x* consists of standardized variables. The goal in adopting such an approach is to find the principal components of a standardized version x* of x where x* has jth element xj/σjj1/2 , j=1,2…p, xj is th jth element of x, and σjj is the variance of xj. Then the covariance matrix of x* is the correlation matrix of x, and the PCs of x* by the last equation.

4. Experiments and Numerical Results In this section we evaluate the perfromance and operational effectiveness of our proposed anaomaly detection approach via modeling and simulation, using Network Simulator (NS-2) [8]. In subsection 4.1 we present the network topology and traffic patterns (normal and anomalous) that were utilized throughout our study, while subsection 4.2 contains the corresponding numerical results and relevant discussion.

9

4.1 Network Simulation Model For demonstration purposes in this paper a tree-based network topology, as shown in Figure 3, is considered. In order to generate web traffic in NS-2 the PackMime-HTTP model was utilized. The traffic intensity generated by PackMime-HTTP is controlled by the rate parameter, which is the average number of new connections started each second. The PackMime-HTTP implementation in NS-2, developed at UNC-Chapel Hill, is capable of generating HTTP 1.0 and HTTP 1.1 connections. The model of aggregate HTTP traffic has been implemented in NS-2 and is available for use in generating realistic synthetic web traffic in network simulations. The model and its implementation have been validated through both empirical and analytical analyses that are presented in [9]. The dumps created by NS-2 were processed by a script to produce two or more metric time series for every monitored link. So if M is the number of metrics, L is the number of links and N the number of time bins produced by the experiment, the data matrix processed with PCA is (PxN), where P=MxL. Every row of the matrix is a time series of a specific metric of a monitored link and every column is an instance of the link state vector. To simulate a network anomaly we injected TCP or UDP traffic using the SimpleTCP, FullTCP and UDP agents of NS-2. When the injected traffic resulted in a deviation of the correlation structure of the multi-metric variables, the system reported an anomaly.

4.2 Numerical Results With reference to the specific experiment reported in this paper, a SYN/ACK Denial of Service attack is considered. Specifically, normal traffic consists of three web client clouds that communicate with the web server cloud in the root of the tree, thus creating three independent traffic flows. In this scenario we also injected anomalous traffic consisting only of sequences of SYN/ACK packets that used one of the three paths that normal traffic did. The anomaly is injected during time bins 110 to 120.

10

Figure 3: Network Topology

The corresponding numerical results are presented in Figure 4. These diagrams display the norm of the link state vector y of the network and the norm of the residual yres. The first two diagrams display the results of running PCA for packets/sec and bytes/sec separately, whereas the last one displays the outcome of combining the two metrics based on our proposed methodology. Based on these results it is obvious that the anomaly is apparent and captured only in the third diagram that takes into account the correlation between the different metric packets and bytes, while the same attack would have been missed if a conventional method based on separate analysis of the individual metrics were used. Furthermore a false positive in PCA applied in packets (first diagram) is not present in the multi-metric analysis diagram.

11

Figure 4: Multi-metric against single-metric PCA

4

Conclusions

In this paper, we studied the problem of discovering anomalies in a large scale network. We proposed the use of Grid infrastructure for an anomaly detection system that correlates data from heterogeneous sensors spread throughout the network. The Grid provides the means to control the sensors and gather information with security and reliability. At the same time, the system possesses a Decision Support Service that applies Principal Component Analysis to the multiple metrics received from the network elements. The proposed fusion algorithm is based on the application of Principal Component Analysis on multi-metric data, and provides an efficient way of taking into account the combined effect of the correlated observed data, for anomaly detection purposes. Finally it was demonstrated via modeling and simulation, that by applying PCA on multiple metrics simultaneously, our proposed methodology improves significantly the anomaly detection capabilities, when compared against anomaly detection approaches that are based on single metric analysis.

ACKNOWLEDGMENT

12

This work was partially supported by the European Commission under the GRIDCC 6th Framework Programme, Information Society Technologies.

REFERENCES [1] I.T. Jollife. “Principal Component Analysis” ,Second Edition, Springer [2] A.Lakhina, M.Crovella, C.Diot. “Diagnosing network-wide traffic anomalies” ACM SIGCOMM Computer Communication Review Volume 34, Issue 4. 2004. [3] R. Dunia and S. J. Qin. “A Subspace Approach to Multidimensional Fault Identification and Reconstruction”. American Institute of Chemical Engineers(AIChE) Journal, 1998, pp 1813–1831. [4] C.Siaterlis, B.Maglaris. “Detecting DDoS attacks with passive measurement based heuristics” in ISCC'2004, Egypt, 2004. [5] V. Chatzigiannakis, G. Androulidakis, M. Grammatikou, B. Maglaris. “A Distributed Intrusion Detection Prototype using Security Agents” in HP-OVUA 2004, Paris. [6] G.Shafer. “A Mathematical Theory of Evidence”. Princeton University Press, Princeton, 1976 [7] GRIDCC. Grid enabled remote instrumentation with distributed control and computation. http://www.Gridcc.org [8] The Network Simulator, http://www.isi.edu/nsnam/ns/ [9] J. Cao, W.S. Cleveland, Y. Gao, K. Jeffay, F.D. Smith, and M.C. Weigle, “Stochastic Models for Generating Synthetic HTTP Source Traffic” in IEEE INFOCOM, Hong Kong, 2004

13

Suggest Documents