Security Related Data Mining

50 downloads 103 Views 777KB Size Report
different data mining techniques applied to detect security threats and analyzes their .... the best: f(x)=WTɸ(x)+b=0; Where f(x) represents the discriminant function ..... in networks, network resources occupying, software corruption, and so on.
Security Related Data Mining

Mehrnoosh Monshizadeh Nokia Technology and Innovation Center, Finland Department of Comnet, Aalto University, Espoo, Finland [email protected]; [email protected]

Abstract—Data mining is the process that extracts, classifies and analyzes valid and useful information from large volumes of data provided by multiple sources. The data mining has been widely applied into various areas, one of which is to investigate potential security threats. In the literature, various data mining techniques such as classification and clustering have been proposed to detect intrusions, DoS attacks, and malware. This paper surveys different data mining techniques applied to detect security threats and analyzes their advantages and disadvantages. Through comparison, we discuss open research issues about security-related data mining and propose future research focus. Keywords: Data mining, classification, clustering, intrusion, DoS, Botnet

I. INTRODUCTION Data mining (DM) is the process of extracting valid and useful information from large quantities of data, analyzing the information and discovering useful patterns with different techniques. It has been applied into many different applications such as medical, health care, marketing, finance, privacy, security and so on. Security applications can be for national security to fight against terrorism attacks or for cyber security to protect computers and networks against corruption (worms and viruses), intrusion, botnet attack, malware and denial of services (DoS). Non-real-time techniques like classification, prediction, and link analysis are applied to figure out a group of similar threats in order to determine possible future attacks by tracing viruses, while real time techniques are more suitable for intrusion detection [1] Wide usage of the Internet for data communications and sharing data over the networks increase the risk of cyberattacks such as data corruption, network degradation, unauthorized access to confidential information and so on. Due to the open nature of IP (open protocols) in 3G/4G technologies, these networks are potential targets of cyberattackers to intrude services and cause problems to the end users and mobile operators. The cyber-attackers could steal user data such as IMSI number, billing information and contact details, degrade networks through DoS, or interrupt or suspend services of a host connected to the Internet, thus making network resources unavailable to its end users [2].

Zheng Yan The State Key Lab of ISN, Xidian University, Xi’an,China Department of Comnet, Aalto University,Espoo, Finland [email protected]; [email protected]

Data mining has evolved as a good technology to figure out the above mentioned security threats. Many studies have been conducted in the literature to detect security problems, holes, intrusions, malware, etc. In this paper, we give a thorough review on the existing techniques about security related data mining, i.e., using data mining methods to figure out security issues. Herein, we focus on data mining techniques in general and data mining for security applications in particular. The rest of the paper is organized as follows. Section 2 briefly reviews different data mining techniques and comments their advantages and disadvantages. In Section 3 we survey the current literature on applying data mining techniques to detect various types of security threats in cyber space (i.e., cyber-attacks) by comparing and analyzing their detection performance. We further discuss current open issues along with suggestions on future research focus in Section 4. Finally, conclusion is presented in the last section. II. DATA MINING TECHNIQUES Several techniques are used in data mining, such as classification, clustering, link analysis and association rules. Generally data mining techniques are classified into two categories: descriptive and predictive. The descriptive category provides information from data itself (e.g., classification) while the predictive category extracts information that is discovered based on previous data (e.g., clustering). In what follows, we mainly review the methods and algorithms of classification and clustering, and briefly introduce the procedure of link analysis and association rules for cyber-attack detection. A. Classification Classification is the process of dividing data into different classes. The classes are predetermined and supervised, which means the set of possible classes are known in advance. There are different techniques comprised of different algorithms for classification [3]. - Bayesian networks (BN) A Bayesian network is a probabilistic graphical model that shows dependencies between variables expressed using network nodes. There is a causal or influential relationship

between nodes connected together and each node has a probability table. Figure 1 shows an example of direct dependency between variables. In this case train delay would affect Norman and Martin has dependency to both oversleeping and train delay variables. The probabilities for Martin and Norman delay are calculated based on Bayesian theorem [4, 5].

network. Therefore ANN changes the weights to reduce the error. This procedure will repeat until the difference between the desired and actual output is minimum. Multi Layer Perceptron (MLP), Self Organized Feature Map (SOFM), Radial Basis Function (RBF) and so on are different algorithms of ANN [8-10]

Fig. 1. An example of Bayesian network [5]

- Decision tree (DT) A decision tree has a tree structure with nodes. The node that has no incoming edges is called the root and all other nodes are called leaves. Each node of the tree is a judgment that independently defines the output of the decision outcome and each leaf is labeled with one class depending on the target value. Moreover it has a probability vector indicating the probability of the target [5, 6]. For example we want to decide whether the weather is proper to play tennis or not. Two-week data (data set S) shown in table I are collected to build a decision tree in Figure 2 by using Entropy to define Information Gain [7]. TABLE I. DATA SET S FOR 14 DAYS Day1 Day2 Day3 Day4 Day5 Day6 Day7 Day8 Day9 Day10 Day11 Day12 Day13 Day14

Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain

Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild

Humidity high high high high normal normal normal high normal normal normal high normal high

Wind weak strong weak weak weak strong strong weak weak weak strong strong weak strong

Play no no yes yes yes no yes no yes yes yes yes yes no

Fig. 2. Decision tree for playing tennis based on weather [7]

- Artificial neural network (ANN) ANN is made of a large number of interconnected processing elements (PE) called nodes (neurons) that work together to solve specific problems. Each node is connected to another node via links (axons); every link has a weight (synapse). The output of each node is fed as the input to all of the nodes in the next layer. During an optimization learning process, for a given input (pattern), ANN compares the generated output with the expected output; if they are not equal ANN generates an error signal and propagates it back to the

Fig. 3. Neural system and artificial neural network [11]

Due to the demand for pattern recognition, function estimation and adaptability, neural network becomes a good tool for data mining. Figure 3 shows a neural system and an artificial neural network called multilayer perceptron (MLP). - Support vector machines (SVM) In this method data samples are classified into two groups, positive and negative. Support vector machines find a hyper plane in a dimensional space that separates two classes with a maximum margin and new input data would belong to one of these two classes.

Fig. 4. SVM [12]

If (xi, yi) are training data from two classes which yi= ±1, then SVM finds a hyper-plane that can separate the two classes the best: f(x)=WTɸ(x)+b=0; Where f(x) represents the discriminant function associated with the hyper-plane and WTɸ(x) is a nonlinear kernel function that maps the input xi into a higher-dimensional space [12]. - K-Nearest neighbor (KNN) This classifier is based on similarity of new entry to a sample data. Query examples are classified into different groups in a multidimensional feature space and similarity is based on the distance between the two samples (closest neighbor). Figure 5 shows an error case in which a black sample is marked to positive class because the nearest neighbor is classified wrong [13].

Fig. 5. Error case in k-nearest [13]

B. Clustering Clustering is to distribute data into a number of groups while objects with similar profiles or properties go to a same group. It helps data mining to analyze the correlations between attributes. Objects belonging to one cluster are similar and dissimilar to the objects that belong to other clusters. Clustering is an unsupervised process that means the characteristics of clusters are beforehand unknown and they have to be discovered in the clustering process. Clustering techniques are divided to five main groups: Partitioning, Hierarchical, Density based, Grid based and Model based methods [14]. In what follows we briefly introduce some of the main clustering techniques. - Fuzzy clustering (grid based) This algorithm belongs to soft clustering which means each object is part of a cluster in a certain level up to a likelihood factor. This concept is opposite to hard clustering in which each object belongs to only one cluster. Fuzzy clustering algorithm calculates these likelihood factors so that it assigns the objects to one or more clusters [15]. If (data >a) && (data =b and data c and data