The implementation had been done based on Java and PHP .... {telnet,ecr_i,private,http,http_443,exec,login,printer,time,uucp,klogin,kshell,whois,echo,systat,da.
Data Mining - Network Intrusion Detection Classifier Selection Model Abstract. The information are vulnerable when transmitting from one location to another in network environment. The attacks are unknown and any minute the information can be attacked by a intruder. Because of these attacks are dynamic in nature the problem is detecting and defending is not easy. Therefore a very high increasingly becoming a critical component to secure the network. A need for an intelligent Network Detection System has high demand on this context. Due to large volumes of security audit data as well as complex and dynamic properties of intrusion behaviors, optimizing performance of IDS becomes an important open problem that is receiving more and more attention from the research community. A research has been carried out in order to develop such NID and followed the data preparation, training the data using classifier and evaluate the model with the test data in order to achieve the results. 1 Introduction As we are in the digital economy or the era of information age the usage of digital information is very high. Consequently the usage of the software and hardware and other media that are been used for the purpose of exchanging the information also become more and more sophisticated and complex. The information providers and facilitators are all over the world and ever increasing. Having said that people in any society from small to the adult are required to use some sort of media to obtain information and deal with it. The information providers are the same as they use high tech and sophisticated media to provide the information. The information travel on the internet or any network are open to the attacks until it reaches the destination. On the other hand the information is vulnerable unless the information packets are highly protected and encrypted. But the problem here is intruders are watching other peoples' information never before as they use them for making wealth or harass people like cyber bulling, or send viruses, Trojans and many more . The attacks on the networks are dynamic in nature and complex and one cannot judge with common sense. Therefore a need for intelligent security system becoming more important never before and to defend against these attacks computer security techniques have been intensively studied over the last decade, namely cryptography, firewalls, anomaly and
intrusion detection … Among them, network intrusion detection (NID) has been considered to be one of the most promising methods for defending complex and dynamic intrusion behaviors. The concept of Network Intrusion Detection System was first introduced by James P. Anderson in 1980 [4] and later formalized by Dr. Dorothy Denning in 1986 [5]. Thereafter many software and many research have been done in order to find an intelligence system to automate the process of detecting the attacks which are coming as various ways. The most appealing systems are been developed by the techniques and principles in data mining. The most common Dataset is KDD99[2] which has been collected in a Network Environment in order to build a Network Intrusion Detector model to recognize 'bad' connection or 'good' connection. The detection is predicted by the intelligence been built based on a algorithm and trained the model using the KDD99 dataset. There are many classification algorithms have been introduced over the years. A literature survey reveals that most of the intrusion detection have been based on the single algorithm to detect multiple attack categories would compromise the performance in some cases. Report results suggest that much detection performance improvement is possible. Therefore the combining the one or two algorithms which is Ensemble Algorithms would promisingly increase the accuracy[7] and still to research and see how the model will perform under the stressful situation. The classification algorithms can be separate into four categories such as Density-based Classifiers: Naïve Bayes, TAN, etc., Distance-based Classifiers: Instance-based Learner, KNN, etc. , Information-based Classifiers: Decision Tree and Neural-based Classifiers: ANN, SVM[7]. The main objective of this paper is to introduce a new classification model specifically using Ensemble approach. The methodology is simple and straight forward and build with existing classifiers models to classify the given dataset T1. The software will be planned with different stages such as Data Preparation, Develop Algorithm, Train the Model, Evaluate the Test data and finally Print output with the Performance Matrix to indicate the False Positive Rate(FT) to indicate the which model would be suitable for the real application environment. The final outcome will be presented with the evaluation model using the Receiver Operating Characteristics(ROC)[7] curve graphical presentation. The given dataset contains four attack categories and introduction with the new model, it will try to detect attacks on each of them. The four attack categories: Probe(information gathering), DoS (deny of service), U2R (user to root), R2L (remote to local).
1. Intrusion Detection previous learning In 1980, the concept of intrusion detection began with Anderson’s research paper [9]; he introduced a threat classification model that develops a security monitoring surveillance system based on detecting anomalies in user behaviour. Intrusion detection system (IDS) had been introduced to protect the single computers using host-based protection. But when the network systems were introduced the mechanism had been changed in order to detect the attacks from networks. The concept was introduced as Context Sensitive String Evaluation(CSSE). The implementation had been done based on Java and PHP environment and have access to necessary protocols[3]. Normally the IDS can categorized into misuse detection and anomaly detection. Misuse detection can be easily identified as they are stored in a database. But the bigger problem is to detect the attacks with the anomaly detection which are coming as friendly connection. The anomaly detection system has high false positive/ negative alarm rate compared to misuse detection systems[10]. KDD99 Dataset is the most famous and comprehensive dataset according to the literature review. It has been used widely for many classification such as Naive Bayesian algorithms, Fuzzy Logic algorithms, J48, SVM and so on. Venkata Suneetha Takkellapati , G.V.S.N.R.V Prasad[8] have proposed Information Gain (IG) and Triangle Area based KNN based algorithm to detect and classify the attacks based on the KDD99 dataset and found the low False Rate and High Accuracy[8]. Dewan have proposed an algorithm using ID3 and Naïve Bayesian have proposed that identifies effective attributes from the training dataset, calculates the conditional probabilities for the best attribute values, and then correctly classifies all the examples of training and testing dataset and proved that the FP is for DoS 0.03%, 0.28% for Probe, 0.12% for U2R and 6.24% for R2L attacks which is less than the 10%.
2. Algorithm T1
Random Sampling Process (X)
Create Array of Classifiers
Build The Model using (X)
Test
Evaluate the Model using (Y,X)
Calculate the Accuracy
Decision Making Model Building Process Diagram The proposed solution is based on the array of classifiers and build and evaluate the model in order to evaluate the model. Based on the evaluation the accuracy will be calculated and then can be used as a trained model. based on the Algorithm. Basically given T1 data set will be bagged with using the Random Forest Algorithm. The sub datasets been generated from the Random Forest will be sub dataset extracted from the T1.
3 Empirical Study In order to detect the on-coming connection effectively, array of classifiers have to be trained as model using the set data set. The given data set T1 contains consists of 247010 instances, each with 41 attributes as described in the Table 1. Firstly the T1 database has treated in order to suite with data processing environment which is the Weka format of arff file format. Then build the experiment evaluation environment with major steps: environment setup, data preprocessing, built the data mining process using Java programming environment. Secondly, the processing of data using the developed algorithm and fed the training data into the software. The classifiers have been hard coded for the ease of evaluation. Set of most popular classifier algorithms, have selected for the software execution purpose such as Bayesian approaches, decision trees, Random Forest, and Bagging. And then path of the Test Data also been place on the directory where the program is executing.
3.1 Evaluation Setup The development and training was done using a Personal Computer with the configuration of Intel(R) Core(TM)2 CPU 2.13GHz, 2 GB RAM under the Microsoft Windows 7 environment. Java environment was built using Eclipse Kepller JAVA IDE. And more importantly it has been used Weka 3.6 packages in order to build the software. Weka provides collection of machine learning algorithms for data mining tasks that contains tools for data preprocessing, classification, regression, clustering, association rules, and visualization. For the purpose of classifying the network connections on it has been used and dealing only with the classifier algorithms. Due to the huge number of data records it is very difficult to deal in a normal PC environment and takes lots of time. Therefore it has been selected the random record set which will proportionately represent the main categories of the file. The selection process is done using the manual random sample from the original T1 dataset. For the first time, we extracted 49,596 instances as training set for our experiments. They include 4864 normal instances, 19573 DoS instances, 205 Probe instances, 21 U2R instances and 554 R2L instances. Secondly, we extracted 15,000 instances as the independent testing set. By using these two datasets, we thus can effectively evaluate the models.
Class
Number of records
10 % of occurrence
Normal
48648
4864
Dos
195734
19573
Prob
2052
205
U2R
21
2
R2L
554
55
Total
247010
24699
10% of the records presentation
3.2 NaïveBayes The NaiveBayes implements the probabilistic Navie Bayes classifier. NaiveBayes can use kernel density estimators if the normality assumption is incoorect. It also handles the numeric attributes using supervised discretization.
3.3 J48 (C4.5 Decision Tree Revision 8) C4.5 algorithm was developed by Quinlan[12] is the most popular decision tree classifier. Weka classifier package referes to the C4.5 as J48 which is an optimized implementation of C4.5 rev. 8. J48 . The parameters used as: confidence Factor = 0.25; number of Folds = 3; seed = 1; unpruned = False.
3.4 Random Forest Random forests contains collection of decision trees in a way of an ensemble learning algorithm for classification. A multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual Trees Random Forests can be used to rank the importance of variables in a regression or classification problem in a natural way.
3.5 Comparison of Each classifier The evaluation of the best accuracy performing classifier for the most relevant label will be used in testing environment in order to detect the actual connection. Simulation results are given in each of the screen shots. The software ran on the Random Forest, Naive Bayes, and J48 algorithms. The accuracy of the each model have been recorded. The True Postive rate(TP) and False Positive(FP) rates have to be calculated in order to achieve the correct results on each of the algorithm. These parameters are the most important criteria to determine the classifier. The Average accuracy have been achieved as Total correctly classified
instances/Total instances using the formula in the software and each accuracy have been displayed on each of the results taken from the algorithm.
Conclusion The information are vulnerable when transmitting from one location to another in network environment. The attacks are unknown and any minute the information can be attacked by a intruder. Because of these attacks are dynamic in nature the problem is detecting and defending is not easy. Therefore a very high increasingly becoming a critical component to secure the network. A need for an intelligent Network Detection System has high demand on this context. Due to large volumes of security audit data as well as complex and dynamic properties of intrusion behaviors, optimizing performance of IDS becomes an important open problem that is receiving more and more attention from the research community. A research has been carried out in order to develop such NID and followed the data preparation, training the data using classifier and evaluate the model with the test data in order to achieve the results. The software written in Java language with the use of Weka packages. The results were generated onto the command line environment and most of the classifiers have given High accuracy rate. The problems encountered while developing and testing is performance time the evaluation is taking make more difficult in achieving the results.
Table 1 T1: attributes description Number Attribute Name Data Type 1 duration continuous. 2 protocol symbolic. 3 service symbolic. 4 flag symbolic. 5 src b continuous. 6 dst b continuous. 7 land symbolic. 8 wrong frag continuous. 9 urgent continuous. 10 hot continuous. 11 failed logins continuous. 12 logged in symbolic 13 compromised continuous 14 root shell continuous. 15 su attempted continuous. 16 root continuous. 17 file creations continuous. 18 shells continuous. 19 access files continuous. 20 outbound cmds continuous. 21 host login symbolic. 22 guest login symbolic. 23 count continuous. 24 srv count continuous. 25 serror rate continuous. 26 srv serror rate continuous. 27 rerror rate continuous. 28 srv rerror rate continuous. 29 same srv rate continuous. 30 diff srv rate continuous. 31 srv diff host rate continuous. 32 d h count continuous. 33 d h srv count continuous. 34 d h same srv rate continuous. 35 d h diff srv rate continuous. 36 d h same src port rate continuous. 37 d h srv diff host rate continuous. 38 d h serror rate continuous. 39 d h srv serror rate continuous. 40 d h rerror rate continuous. 41 d h srv rerror rate continuous.
Results of the Training Model J48 Classifier
Accuracy of the j48 Model
Accuracy of the J48 Model
Testing with Navie Baysean
Accuracy calculation using Navie Bayes Algorithms
Analysis of Random Forest Algorithm
Results of the Test Run
Sample Arff File @relation traindata_dos @attribute duration numeric @attribute protocol {tcp,icmp,udp} @attribute service {telnet,ecr_i,private,http,http_443,exec,login,printer,time,uucp,klogin,kshell,whois,echo,systat,da ytime,domain,ftp,ftp_data,ssh,smtp,mtp,gopher,remote_job,link,hostnames,csnet_ns,pop_3,pop _2,sunrpc,uucp_path,nntp,rje,netbios_ssn,netbios_dgm,imap4,finger,sql_net,ctf,bgp,supdup,iso _tsap,ldap,nnsp,shell,efs,netbios_ns,courier,discard,netstat,name,vmnet,auth,Z39_50,other,tim _i,domain_u,eco_i,ntp_u,IRC,urh_i,urp_i,X11,pm_dump,tftp_u,red_i} @attribute flag {S0,SF,RSTR,S2,S1,REJ,RSTO,S3,OTH,RSTOS0,SH,IRC} @attribute src_b numeric @attribute dst_b numeric @attribute land numeric @attribute wrong_frag numeric @attribute urgent numeric @attribute hot numeric @attribute failed_logins numeric @attribute logged_in numeric @attribute compromised numeric @attribute root_shell numeric @attribute su_attempted numeric @attribute root numeric @attribute file_creations numeric @attribute shells numeric @attribute access_files numeric @attribute outbound_cmds numeric @attribute host_login numeric @attribute guest_login numeric @attribute count numeric @attribute srv_count numeric @attribute serror_rate numeric @attribute srv_serror numeric @attribute rerror_rate numeric @attribute srv_rerror numeric @attribute same_srv_rate numeric @attribute diff_srv_rate numeric @attribute srv_diff_host_rate numeric @attribute d_h_count numeric @attribute d_h_srv_count numeric @attribute d_h_same_srv_rate numeric @attribute d_h_diff_srv_rate numeric @attribute d_h_same_src_port_rate numeric @attribute d_h_srv_diff_host_rate numeric @attribute d_h_serror_rate numeric @attribute d_h_srv_serror_rate numeric @attribute d_h_rerror_rate numeric @attribute d_h_srv_rerror_rate numeric @attribute class {normal,dos,probe,r2l,u2r}
@data 60,tcp,telnet,S3,125,179,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,1,0,0,1,1,1,0,1,0,1,1,0,0,r2l 0,tcp,telnet,RSTO,125,179,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,1,1,1,0,0,3,3,1,0,0.33,0,0.33,0. 33,0.67,0.67,r2l 0,tcp,telnet,RSTO,125,179,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,1,1,1,0,0,5,5,1,0,0.2,0,0.2,0.2, 0.8,0.8,r2l 0,tcp,telnet,RSTO,125,179,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,1,1,1,0,0,7,7,1,0,0.14,0,0.14,0. 14,0.86,0.86,r2l 0,tcp,telnet,RSTO,125,179,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,1,1,1,0,0,9,9,1,0,0.11,0,0.11,0. 11,0.89,0.89,r2l 0,tcp,telnet,RSTO,126,179,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,1,1,1,0,0,11,11,1,0,0.09,0,0.09 ,0.09,0.91,0.91,r2l 0,tcp,telnet,RSTO,126,179,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,1,1,1,0,0,13,13,1,0,0.08,0,0.08 ,0.08,0.92,0.92,r2l 0,tcp,telnet,RSTO,126,179,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,1,0,0,15,15,1,0,0.07,0,0.07 ,0.07,0.93,0.93,r2l 0,tcp,telnet,RSTO,126,179,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,1,1,1,0,0,17,17,1,0,0.06,0,0.06 ,0.06,0.94,0.94,r2l 0,tcp,telnet,RSTO,126,179,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,1,1,1,0,0,19,19,1,0,0.05,0,0.05 ,0.05,0.95,0.95,r2l 0,tcp,telnet,RSTO,126,179,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,1,1,1,0,0,21,21,1,0,0.05,0,0.05 ,0.05,0.95,0.95,r2l 60,tcp,telnet,S3,126,179,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,2,2,0.5,0.5,0.5,0.5,1,0,0,23,23,1,0,0.04,0 ,0.09,0.09,0.91,0.91,r2l 0,tcp,telnet,RSTO,126,179,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,1,0,0,25,25,1,0,0.04,0,0.08 ,0.08,0.92,0.92,r2l
The Java Software File import java.io.*; public class NIDClassifier extends JFrame { private JPanel contentPane; private JTextField txtSize; private JFileChooser fc; Instances train; Instances fTrain; Instances temp; Instances test; private JTextField txtCount; /** * Launch the application. */ public static void main(String[] args) { EventQueue.invokeLater(new Runnable() { public void run() { try { NIDClassifier frame = new NIDClassifier(); frame.setVisible(true); } catch (Exception e) { e.printStackTrace(); } } }); } /** * Create the frame. */ public NIDClassifier() { setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE); setBounds(100, 100, 621, 695); contentPane = new JPanel(); contentPane.setBorder(new EmptyBorder(5, 5, 5, 5)); setContentPane(contentPane); contentPane.setLayout(null); JPanel panel_1 = new JPanel(); panel_1.setBounds(10, 11, 585, 142); contentPane.add(panel_1); panel_1.setLayout(null);
JButton btnNewButton = new JButton("Evaluate"); btnNewButton.setBounds(309, 119, 102, 23); panel_1.add(btnNewButton); JLabel lblSizeOfThe = new JLabel("Size of the Training Data Set"); lblSizeOfThe.setBounds(10, 29, 137, 14);
panel_1.add(lblSizeOfThe); lblSizeOfThe.setHorizontalAlignment(SwingConstants.LEFT); txtSize = new JTextField(); txtSize.setBounds(157, 26, 182, 20); panel_1.add(txtSize); txtSize.setColumns(10); JLabel lblDataCount = new JLabel("Data Copied"); lblDataCount.setBounds(70, 94, 77, 14); panel_1.add(lblDataCount); txtCount = new JTextField(); txtCount.setBounds(157, 91, 254, 20); panel_1.add(txtCount); txtCount.setColumns(10); final JTextArea txtArea = new JTextArea(); txtArea.setBounds(24, 183, 1000, 1000); contentPane.add(txtArea); btnNewButton.addActionListener(new ActionListener() { public void actionPerformed(ActionEvent arg0) { Evaluation[] lstEtest=new Evaluation[4]; StringBuilder res = new StringBuilder(); double mSize =txtSize.getText().length() > 0 ? Double.parseDouble(txtSize.getText()):0; try { // Choose a set of classifiers Classifier[] models = { new RandomForest() ,new J48(), new NaiveBayes(), new DecisionTable(), new OneR(), new DecisionStump() }; // Run for each classifier model for(int j = 0; j < models.length; j++) { // Collect every group of predictions for current model in a FastVector FastVector predictions = new FastVector(); // For each training-testing split pair, train and test the classifier Evaluation validation = BuildRandomClassifier(train, models[j]); predictions.appendElements(validation.predictions()); lstEtest[j] = validation; // Uncomment to see the summary for each trainingtesting pair. //System.out.println(models[j].toString());
// Calculate overall accuracy of current classifier on all splits double accuracy = calculateAccuracy(predictions); // Print current classifier's name and accuracy in a complicated, but nice-looking way. System.out.println(models[j].getClass().getSimpleName() + ": " + String.format("%.2f%%", accuracy) + "\n====================="); res.append(models[j].getClass().getSimpleName() + ": " + String.format("%.2f%%", accuracy) + "\n====================="); res.append(validation.toSummaryString()); txtArea.setText(res.toString()); DrawROC(validation); }
} catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); } } }); JButton btnOpenArff = new JButton("Open Arff"); btnOpenArff.setBounds(157, 57, 107, 23); btnOpenArff.addActionListener(new ActionListener() { public void actionPerformed(ActionEvent arg0) { UseFileDialog ufd = new UseFileDialog(); System.out.println ("Loading : " ); /*
System.out.println ("Saving : " + ufd.saveFile(new Frame(), "Save...",
"c:\\temp\\", "*.arff"));*/ File[] arffFiles = ufd.loadFile(new Frame(), "Open...", ".\\", "*.arff"); try { BufferedReader reader = new BufferedReader(new FileReader("c:\\temp\\traindata_dummy.arff")); train = new Instances(reader); } catch (FileNotFoundException e1) { // TODO Auto-generated catch block txtArea.setText(e1.getMessage()); e1.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } int num =0; for(int i=0; i length-1) break; } return sampleTrainSet; } public void BuildRandomInstances(Instances data) throws Exception { for(int i=0;i