Large-scale IP network behavior anomaly detection and identification ...

3 downloads 2357 Views 746KB Size Report
Aug 4, 2010 - Springer Science+Business Media, LLC 2010 ... University of Electronic Science and Technology of China ...... Department of Computer Science, Florida Institute of Technology .... West Point: United States Military Academy.
Telecommun Syst (2012) 50:1–13 DOI 10.1007/s11235-010-9384-1

Large-scale IP network behavior anomaly detection and identification using substructure-based approach and multivariate time series mining Weisong He · Guangmin Hu · Yingjie Zhou

Published online: 4 August 2010 © Springer Science+Business Media, LLC 2010

Abstract In this paper, a substructure-based network behavior anomaly detection approach, called WFS (Weighted Frequent Subgraphs), is proposed to detect the anomalies of a large-scale IP networks. With application of WFS, an entire graph is examined, unusual substructures of which are reported. Due to additional information given by the graph, the anomalies are able to be detected more accurately. With multivariate time series motif association rules mining (MTSMARM), the patterns of abnormal traffic behavior are able to be obtained. In order to verify the above proposals, experiments are conducted and, together with application of backbone networks (Internet2) Netflow data, show some positive results. Keywords Anomaly detection and identification · Weighted frequent subgraphs · Multivariate time series motif association rules mining

1 Introduction Large-scale IP network behavior analysis is important for network monitoring and network management. Network traffic which exhibits multi-timescale properties within temporal domain, however, is highly dynamic. large-scale IP W. He () · G. Hu · Y. Zhou Key Lab of Optical Fiber Sensing and Communication, University of Electronic Science and Technology of China (UESTC), 611731 Chengdu, P.R. China e-mail: [email protected] G. Hu () e-mail: [email protected] Y. Zhou e-mail: [email protected]

network traffic anomalies which feature sudden erupting without preknown signs, often bring great damage to network equipments or computers of network in a short time. Therefore, one of the prepositions to ensure trustworthy networks is to detect and locate network traffic anomalies quickly and accurately, determine the reasons that cause them and make reasonable response to them in time. Anomaly detection, which refers to the topic of finding patterns in data that do not conform to expected behavior or can be defined as follows: “Given a set of n data points or objects and the number p of expected outliers, find the top p objects that are considerably dissimilar, exceptional, or inconsistent with respect to the remaining data” [1]. Network behavior anomaly detection refers to the issue of finding network behavior patterns in network data that do not conform to expected behavior. An anomalous traffic pattern in a computer network could mean that a hacked computer is sending out sensitive data to an unauthorized destination [4]. Based on the extent to which the labels are available, anomaly detection techniques can operate in one of the following three modes: supervised anomaly detection, semisupervised anomaly detection and unsupervised anomaly detection. Some anomaly detection techniques used in network intrusion detection systems are listed as follows: Statistical Profiling using Histograms [5–19], Parametric Statistical Modeling [20–23], Non-parametric Statistical Modeling [24], Bayesian Networks [25–28], Neural Networks [13–15, 29–34], Support Vector Machines [35], Rule-Based Systems [36–43], Clustering Based Method [35, 43, 45, 46], Nearest Neighbor based Method [35, 44, 47], Spectral and Graph Method [48, 49, 51, 53], Information Theoretic Method [53, 54]. On the foundation of entropy-based analysis of traffic feature distributions, this paper puts forward some new ideas which go as follows. Firstly, two kinds of traffic feature dis-

2

tributions are selected: one is address and port distributions, the other is protocol and octets distributions. The two kinds of distributions complement each other and provide different insight into the underlying structure. Secondly, aimed at detecting of large-scale IP network behavior anomalies more accurately, in this paper multivariate time series are represented by graphs which are produced by a given pair of graphs using Cartesian Product, and a substructure-based network behavior anomaly detection algorithm, named as WFS, was proposed to detect the anomalies of a large-scale IP network flow data. The paper is organized as follows. We review related work in Sect. 2. Then, in Sect. 3, we describe how to represent the real large-scale IP network flow data with graph and propose WFS algorithm. In Sect. 4 we identify abnormal patterns using MTSMARM. In Sect. 5 experiments are conducted to evaluate our proposed method. Finally, we conclude in Sect. 6.

2 Related work There is considerable interest in using entropy-based analysis of traffic feature distributions for anomaly detection [49, 50]. Entropy-based metrics are tempting since they provide more fine-grained insights into traffic structure than traditional traffic volume analysis.While previous work has illuminated the benefits of using the entropy of different traffic distributions in isolation to detect anomalies, there has been little effort in comprehensively understanding the detection power provided by entropy-based analysis of multiple traffic distribution used in connection with each other. Few researchers focus on the detection power which is affected by on choice of feature. Choosing traffic features distributions that complement one another will benefit anomaly detection capabilities greatly. The detection of anomalies in multivariate time series is more challenging due to that establishing a clear definition of an anomaly is difficult. Analogous to univariate time series, some anomalies may correspond to abnormally high (or low) values or unusual subsequences (discord [59]) in one or more time series. Others may correspond to unexpected changes in the relationships among a set of variables. Recently there has been an urge towards analyzing data using graph theory methods. Not to be confused with the mechanisms for analyzing spatial data, graph-based data mining methods are to analyze data that can be represented as a graph (i.e., vertices and edges). Yet, while there has been much written as it belongs to graph-based data mining [60], very little research has been good at the area of graphbased anomaly detection. In 2003 [53] used the SUBDUE application to consider the problem of anomaly detection from both the anomalous substructure and anomalous subgraph perspective. They were able to provide measurements

W. He et al.

of anomalous behavior as they are applied to graphs from two different views. Anomalous substructure detection dealt with the unusual substructures that were discovered in an entire graph. In order to discriminate an anomalous substructure from the other substructures, they proposed a simple measurement whereby the value associated with a substructure indicated a degree of anomaly. They also introduced the idea of anomalous subgraph detection which dealt with how anomalous a subgraph (i.e., a substructure that is part of a larger graph) was to other subgraphs. The idea was that subgraphs that contained many ordinary substructures were generally less anomalous than subgraphs that contained few ordinary substructures. In addition, they also investigate the idea of conditional entropy and data regularity using network intrusion data as well as some artificially created data. [64] utilized rarity measurements to the discovery of uncommon links within a graph. Using various metrics to define the commonality of paths between nodes, the user was able to confirm whether a path between two nodes were interesting or not, without having any preconceived notions of significant patterns. One of the disadvantages of this approach was that while it was domain independent, it assumed that the user was querying the system to discover interesting patterns regarding certain nodes. In other words, the uncommon patterns had to originate or terminate from a user-defined node. The AutoPart system presented a non-parametric approach to finding outliers in graph-based data [63]. Part of Chakrabarti’s approach was to detect outliers by analyzing how edges that were removed from the whole structure affected the minimum descriptive length (MDL) of the graph. Representing the graph as an adjacency matrix, and using a compression technique to encode node groupings of the graph, he searched the groups that cut down the compression cost as much as possible. Nodes were put in groups based on their entropy. In 2005, the concept of entropy was also used by [61] in their analysis of a real-world data set: the famous Enron scandal. They used “event based graph entropy” to determine the place of the most interesting people in an Enron e-mail data set. Using a measure analogous to what [53] had proposed, they supposed that the important nodes (or people) were the ones who had the biggest effect on the entropy of the graph when they were removed. Thus, the most interesting node was the one that brought about the maximum change to the graph’s entropy. However, they neglected the relations between nodes which provided more information about frequent subgraph. In 2005, by using just bipartite graphs [52] presented a model for scoring the normality of nodes as they connect to the other nodes. Next, by using an adjacency matrix they gave a “relevance score” such that every node Vi had a relevance score to every node Vj , whereby the higher the score the more related the two nodes. The idea was that the nodes with the lower normality score to node Vi were the more anomalous ones to that node.

Large-scale IP network behavior anomaly detection and identification using substructure-based approach Table 1 The flow record contents

3

Table 2 Description of six anomaly detection metrics at flow level

Basic information

Related to routing

Metrics type

Description

srcaddr, dstaddr, srcport, dstport

nexthop

H(srcPort)

Entropy of source port distribution

dpkts, doctets

src_as, dst_as

H(dstPort)

Entropy of destination port distribution

first, last

src_mask, dst_mask

H(srcIP)

Entropy of source IP address distribution

H(dstIP)

Entropy of destination IP address distribution

tos, tcp_flags

The two drawbacks with this approach were that it only dealt with bipartite graphs and it only found anomalous nodes, rather than what could be anomalous substructures. In [62], they also went after anomalous links, this time via a statistical approach. By using a Katz measurement they used the link structure to statistically predict the likelihood of a link. While it worked on a small dataset of author-paper pairs, their single measurement just analyzed the links in a graph.

3 Anomaly detection using substructure-based approach

H(octets)

Entropy of octets distribution

H(prot)

Entropy of protocol type distribution

of the uncertainty of a random variable [2, 3, 49]. A wide variety of anomalies will impact on the distribution of one of the discussed IP features. Let X be a discrete random variable with alphabet X representing the distribution of values which a particular network traffic feature can take, then probability mass function is p(x) = Pr{X = x}, x ∈ X . The entropy H (X) of a discrete random variable X is defined by  H (X) = − p(x) log p(x) (1) x∈X

With Netflow, applications on the network are able to be monitored, and malicious traffic informations are able to be identified. The Netflow, which is a passive monitoring tool includes: statistics about groups of related packets (e.g., same TCP/IP headers and close in time); records header information, counts, and time. As can be seen from Table 1, the flow record contents can be divided into two parts: one is the basic information about the flow; the other is information related to routing. The basic information about the flow goes as follows: the first information is source and destination, IP address and port; the second one is packet and byte counts; the third one is start and end times; the fourth one is ToS and TCP flags. The informations related to routing are listed as follows: the first next-hop IP address; source and destination AS; input and output.

where p(x) is the probability of event x ∈ X occurring. For example, the probability of seeing IP 129.173.192.0 is defined to be the number of packets using IP 129.173.192.0 divided by the total number of packets in the given time interval. In this paper, entropy can be applied to obtain a compressed representation of the large-scale IP network flow data that is much smaller in volume, yet closely maintain the integrity of the original data, which is called lossless distribution information reduction. Anomaly detection metrics refer to metrics which are used for anomaly detection. Network anomaly detection can be operated in three different granularity such as packet level, flow level and networkwide (OD traffic) level. Since flow level analysis is a good compromise for traffic analysis based on network-wide and packet in its performance and accuracy, the analysis method of this paper is based on features distribution of Netflow. The empirical evaluation of [50] shown that port and address distribution are highly correlated. Besides port and address, in order to explore the underlying traffic structure, other features are chosen in this paper. The six flow level metrics used in this paper are as follows: source address (sometimes called source IP and denoted srcIP), destination address (or destination IP, denoted dstIP), source port (srcPort), destination port (dstPort), octets and type of protocol, which are shown in Table 2.

3.1.2 Shannon entropy and anomaly detection metrics

3.2 Symbolic representation of multivariate time series

Entropy denoted by H , is a metric that captures the degree of dispersal or concentration of a distribution and is a measure

We use min-max normalization to map the entropy time series in the range [0,1] by computing

The graph representation of large-scale IP network flow data can be divided into three steps. Firstly, large-scale IP network flow data are compressed using Shannon entropy; Secondly, the entropy time series (a time series is composed of entropy) is normalized, and the time series is converted into an discrete symbolic sequence; Finally, multiple dimensional data at each time point are represented by a graph. 3.1 Netflow, entropy and anomaly detection metrics 3.1.1 Netflow

4

W. He et al.

Fig. 1 The representation of graphs time series

hi =

hi − hmin hmax − hmin

(2)

where hmin and hmax are the minimum and maximum values of entropy time series. Then, we divide the normalized entropy series into 2n (n ≥ 2) parts equally, with 2n alphabets denoting the value of 2n parts respectively, and symbolic sequence is obtained. For example, by using four alphabets {A, B, C, D} denote the value of four parts respectively, and symbolic sequence si is obtained. ⎧ A, ⎪ ⎪ ⎪ ⎨ B, si = ⎪ C, ⎪ ⎪ ⎩ D,

hi ≤ 0.25 0.25 < hi ≤ 0.5 0.5 < hi ≤ 0.75 else

(3)

3.3 Graph representation of multivariate time series Consider a time series H = h1 h2 . . . hT , which is an ordered sequence of measurements taken for a real-valued variable H at timestamps 1, 2, . . . , T . A multivariate time series D = {Hi }6i=1 is a collection of time series that corre-

sponds to measurements for six real-valued variables spanning the same time interval. In order to represent the multivariate time series D, a new undirected graph, G = (V , E), V = {v1 , v2 , . . . , vn }, E = {e1 , e2 , . . . , em } ⊆ V × V which is derived from a given pair of graphs TN and K6 using Cartesian Product TN × K6 , is made up. The representation of graphs time series is shown in Fig. 1. Each vertex in the graph denotes a data point of a Netflow feature and each vertex can be assigned a value within {A, B, C, D}. Edge denotes the relation between two network anomaly detection metrics. As multiple network anomaly metrics distribution may affect each other, network anomaly metrics distribution time series are represented with complete graph to describe the relations of all network anomaly detection metrics. V (g) denotes the vertex set of graph g and E(g) denotes the edge set. Given a sequence GTs of n graphs {GT1 , GT2 , . . . , GTn } with GTn = (VTn a , ETn b ), 1 ≤ a ≤ 6, 1 ≤ b ≤ 15, GTn denotes the graph at time Tn , htc denotes the value for cth vertex Vc in Gt , 1 ≤ t ≤ n, 1 ≤ c ≤ 6. ETn p denotes the pth edge of GTn at time Tn . We define GTs to be a graphs time series for all 1 ≤ Ts ≤ n, and define Supt (i, j, . . .) as the numbers of

Large-scale IP network behavior anomaly detection and identification using substructure-based approach

5

graphs in GTs where g (consists of k vertexes) is a subgraph. For example, Supt (i, j ) ( i, j denote two different vertices, t denotes the time point of the graph) represents the support of 2-itemsets. A frequent subgraph is a graph whose support is no less than a minimum support threshold, min_sup. According to different values of the nodes, the support of graph pattern of each time point is counted. The fewer the supports are, the more anomalous the subgraphs. Covariance is a measure of how much two traffic features change together. The corresponding edge weight is given according to the covariance value. Let w be a discrete random variable with alphabet W representing the weight coefficient of subgraph can take, W (i, j ) denotes the edge weight of 2 vertexes (i, j ). Consider m × 1 random vector x(ζ ) = [x1 (ζ ), x2 (ζ ), . . . , xm (ζ )]T , then the mathematical expectation μx is

ˆ xy can be obtained based on the The estimation covariance C sample value:

μx = [μ1 , μ2 , . . . , μm ]T

W (x1 , x2 , x3 , x4 , x5 )

(4)

1  = n−1 n

cˆxi ,yj

k=1

M = [α 1 , α 2 , . . . , α n ]T = [β 1 , β 2 , . . . , β m ]

(5)

where β j (j = 1, 2, . . . , m) is the sample vector of j th random variable, α i (i = 1, 2, . . . , n) is the vector composed of all the sample values of ith random variable. The covariance matrix of random variable x(ζ ), y(ζ ) can be defined as C xy = E{[x(ζ ) − μx ][y(ζ ) − μy ]H } ⎛

(6)

1 Mki − xpi n n

p=1

The weight coefficient of subgraph can be computed as follows: 1/2

W (x1 , x2 ) = exp(−cˆx1 ,x2 ) 1/2

1/2

1/2

W (x1 , x2 , x3 ) = exp(−cˆx1 ,x2 − cˆx2 ,x3 − cˆx3 ,x1 ) 1/2

1/2

1/2

1/2

W (x1 , x2 , x3 , x4 ) = exp(−cˆx1 ,x2 − cˆx2 ,x3 − cˆx3 ,x4 − cˆx4 ,x1 )

1/2

1/2

1/2

1/2

= exp(−cˆx1 ,x2 − cˆx2 ,x3 − cˆx3 ,x4 − cˆx4 ,x5 − cˆx5 ,x1 )

(8)

where 1 ≤ x1 , x2 , x3 , x4 , x5 ≤ 4. θτ denotes the contribution factor (impact degree to anomaly)of subgraph G(i, j, . . .).  τ

For example, θ2 denotes the contribution factor of subgraph G(i, j ). According to F 2(S, G) = Size(S) ∗ Instances(S, G), Size(S) is the number of vertices in S, Instances(S, G) is the frequency that S appears in the graph G, which is described in [53], small substructures appear frequently, large substructures appears infrequently. In general, 1 = θ2 > θ3 > θ4 > θ5 > 0. The abnormal index Pt is represented as in (9): ⎤⎞

1≤x1

Suggest Documents