The 22st Annual International Conference on ...

8 downloads 0 Views 310KB Size Report
Programming in MATLAB 2014a environment is used for analysis and evaluation of ... parts including KDD Cup 99 Full Data and KDD Cup 99 10% Data (containing ..... A general evaluation of all of the features by the proposed solution. 41.
Cumhuriyet Üniversitesi Fen Fakültesi

Cumhuriyet University Faculty of Science

Fen Bilimleri Dergisi (CFD), Cilt:36, No: 6 Özel Sayı (2015)

Science Journal (CSJ), Vol. 36, No: 6 Special Issue (2015)

ISSN: 1300-1949

ISSN: 1300-1949

A New Approach For Feature Selection In Intrusion Detection System Rahim TAHERI*, Marzieh AHMADZADEH, Mohammad Rafi KHARAZMI Dept. of Information Technology and Computer Engineering, Shiraz University of Technology, Iran, Shiraz

Received: 24.04.2015; Accepted: 09.07.2015 Abstract. There are numerous intrusion detection systems with different methods for detection of the attacks whose main challenge is enhancement of the efficiency and accuracy rates. Therefore, development of methods to enhance their efficiency is necessary. In this paper, we are searching for a solution to reduce number of the features. Cuttlefish algorithm was used to investigate all of the features and numbers including 3, 5, 10, and 13 were used for the analysis. Artificial neural network was used as the evaluation function and the results of this paper were compared with other papers. Among the selected features, the number thirteen feature had the highest efficiency and could detect nearly all of the attacks. Keywords: Data mining, feature selection, intrusion detection, cuttlefish, multilayer neural network

1. INTRODUCTION Increasing development of information technology has resulted in an increase in the dimensions and complexity of data. Despite the opportunities that the data beds with numerous dimensions provide, they cause many computational challenges. One of the problems of high dimensions data is that all of the data features are not usually important and vital to find the knowledge hidden within the data. For this reason, data dimension reduction has been remained as a significant topic in many fields including intrusion detection. Intrusion detection is an effective method to achieve a higher security in computer networks [1]. The intrusion detection system is responsible for the detection and declaration of the people who enter a system without permission and use it and the people who are legally permitted to access the system, but they abuse their rights. Many methods are used for intrusion detection among which data mining techniques are more popular because of their high speed in complex patterns recognition. The efficiency of a pattern recognition system is strongly dependent of the selected features [2]. Many of the features are usually used with a learning algorithm for intrusion detection of attacks classification. However, in most cases, many of the features are irrelevant or redundant for learning and decrease the efficiency of learning algorithm deployment. These redundant features can significantly worsen learning accuracy and training speed [3, 4]. Therefore, selection of relevant features and design and implementation of the system with minimum feasible number of data is an important and essential topic. The main objective of this paper is development of a method for feature selection in intrusion detection systems. The evaluation was based on KDD Cup 99 data set. The proposed method in this study, which is called FS_CF_IDS, is obtained by combination of cuttlefish algorithm and multilayer neural network. Cuttlefish method is the result of cuttlefish’s hiding method in sea. Both of this fish’s solutions in its color and plan change has been covered. In the proposed model, cuttlefish algorithm and multilayer neural network algorithm were used as feature selection function and evaluation function, respectively. The innovation of this study is development of a new combinational method for feature selection and intrusion system detection. This paper has been organized in this way: in section 2, materials and methods are introduced. In section 3, the standard data set for algorithm implementation is introduced. In section 4 and 5, the proposed algorithm used for feature selection is explained. In section 6, the results are presented and discussed. In section 7, the evaluation function is selected and introduced. In section 8, the variables _____________

* Corresponding author. E-mail: [email protected] Special Issue: International Conference on Non-Linear System & Optimization in Computer & Electrical Engineering http://dergi.cumhuriyet.edu.tr/ojs/index.php/fenbilimleri ©2015 Faculty of Science, Cumhuriyet University

A New Approach For Feature Selection In Intrusion Detection System related to selection analysis are introduced. In section 9, the selected features have been evaluated. The final section of this paper is conclusion. MATERIALS AND METHODS In this section, the proposed method is discussed. The method proposed in this study, which is called FS-CF-IDS, at first the data were preprocessed and normalized and cuttlefish algorithm was used to generate features subset. Multilayer neural network algorithm was used for evaluation. In terms of running order, cuttlefish algorithm ranks first and multilayer neural network algorithm ranks second. At first the features subset are obtained using cuttlefish algorithm and then imported to MLP evaluation function to be evaluated, analyzed its efficiency, and compared with other combinational methods presented in other researches. Cuttlefish algorithm has been written by C# language and the optimum properties will be selected by this algorithm. Programming in MATLAB 2014a environment is used for analysis and evaluation of the selected features for all simulations. The obtained results are compared with the results available in authentic journals. The evaluation criterion for the method proposed in this study is classification accuracy. Standard data set for the algorithm running There are numerous data sets for evaluation of intrusion detection systems. KDD Cup 1999 [5] is one of the most useful data sets for system evaluation and submitting models for intrusion detection. KDD Cup 1999 training data sets include nearly 5 million records. Each record has 41 features. These features are classified in four groups: basic features, content features, time traffic features, and host traffic features. The connections are classified into two types: destructive and normal. The test data sets include 311029 records. The training data sets include 23 types of attack and the test data sets include 37 types of attack. The important note is that the distribution probability of the test data is different from that of the training data. The test data sets include specialized attacks that do not exist in the training data sets. These attacks make the data set more actual. This data set has been prepared to detect the intrusion detection system. It includes 9 month attacks transaction in air force of USA which has been presented in two parts including KDD Cup 99 Full Data and KDD Cup 99 10% Data (containing 494022 records). This data set includes five types of output called Probe, R2L, U2R, DOS, Normal. Cuttlefish algorithm This algorithm uses color change rule of cuttlefish in which the design and color seen in this fish’s skin is due to the reflex of different layers of specialized skin’s cells including chromatophores, leucophore, and iridophores. These fishes are also called cuttle and compound fishes. When their skin is exposed to a ray of light, these three specialized layers do different separate operations which cause change in the color and appearance of the fish’s skin. The specialized chromatophores layer includes pigments which allow the light to pass and reflects a color that fits the environment by the color obtained by being exposed to other layers. There is another layer under this layer called iridophore that includes colorless crystals. It contains a type of protein that forces the fish to change its skin’s pigments and combine with other lights. The membrane of these cells has plate-like structures that contains reflection protein. When this protein is altered by a phosphorylation process, it changes the size and structure of these small plates. These changes alters the light reflection way in exchange. In addition, this fish has leucophore in its next layer that is also called white pigment. This layer act as a mirror and completely reflects all of the received light [8]. Cuttlefish algorithm (CFA) consists of two main process: 1-reflection 2-sighting [7].

1345

TAHERI, AHMADZADEH, KHARAZMI Algorithm implementation in the data set



To implement this algorithm in the data set, which includes 41 features, at first the row of the features are inserted in variable rankeArray={1,2,3,…,41}. The steps of the algorithm implementation are shown in the flowchart of Figure 1. They are discussed in detail as follows: The first step of the algorithm: as mentioned, at first the algorithm randomly generates N groups from the combinations of all of the features which will be saved as {p1, p2, p3… pN}. Each P includes two sets: selectedFeaures and unselectedFeatures. Either of them is a subset of rankedArray. The sharing of these two set is null set. As mentioned, rankedArray is the set that includes all of the features of the data set desired in this research. To run this algorithm, it is assumed that selectedFeatures={1,2,3,4,5} and unselectedFeatures={6,7,…,41}.

start

Initial stage  Creation of set of features  Calculating the average for each set  The best set

First and second states  Sorting set based on the average  Random number between zero and half of the number of features  Using 1, 2, 3 formulas and creating a new set  If the new set is better than AVBEST, change it

Third and fourth states  Using formulas 7 and 8 change a feature with each other in the below two sets SelectedFeature , UnSelectedFeature  If the newly produced set is better than the optimal set, change it

Fifth states  Using formulas 10, 11 and 12 remove the feature from the optimized set  Re-calculate the average  If any sets are better than the optimal set, change them Stopping stipulation

Return of the optimal set Figure 1. Flowchart of the proposed method based on the cuttlefish algorithm.

1346

A New Approach For Feature Selection In Intrusion Detection System In addition, the fixed numbers will be determined in this step. The two mentioned sets mean that the features 1, 2, 3, 4, 5 have been selected as the answer for the problem, but the features from 6 to 41 have not been selected. 

The second step: using the optimizer function, the best group generated in the first step is selected and inserted in two variables: AVBestsubset and bestsubset. Both of these variables include two sets: selectedFeatures and unselectedFeatures. The number of the features of selectedFeatures must always be at least one less than that of unselectedFeature. Ignoring this issue make some difficulties in the next steps of the algorithm.



The third step: in this step, the algorithm is run with the selected features set and the related formulas are applied on this set. A: Implementation of first and second states In this step, at first the groups available in P are sorted descending in terms of the value of each group (fitness variable) and new group of half of P is randomly selected and saved in newSubset variable. In other words, pi is selected such that i=1,2,…,k and k is from 0 to N/2. In the main algorithm, there are two variables R and V. R is the contraction or expansion of skin’s muscle and V is the compatibility of the skin with the external environment. In this algorithm, the formula 1 are obtained using the following variables. newSubset i  Re flectioni  Visibilityi Formula 1: The community of the two sets is placed in newSubset. Re flectioni  randomsubset[ R]  pi .selectedfeatures formula 2: Visibility i  randomsubset[V ]  pi .unselectedfeatures formula 3: In these formulas, Reflectioni and Visibilityi are two sets as large as R and V which can be calculated by formulas 4 and 5.

formula 4: formula 5:

R  random(0, selectedFe atures.Size)

V  selectedFe atures.Size  R

After applying the above formula, the result of the sets are as follows. In this set, f 1 is the first feature and f41 is the last feature. pi .selectedFe atures  { f 3 , f1 , f 5 , f 2 , f 4 }

pi .unselected Features  { f 6, f 7 , f 8 ,..., f 41}

During the two above sets, features 1, 2,3,4,5 have been selected and features 6, 7, 8… 41 have not been selected. R  random(0,5)  2

V 523

Re flection1  { f1 , f 4 } Visibility1  { f15 , f 30 , f 22}

1347

TAHERI, AHMADZADEH, KHARAZMI

newSubset1  { f1 , f 4 }  { f15 , f 30 , f 22}  { f1 , f 4 , f15 , f 30 , f 22} In the above term, the community of two set has been determined as a new set. It should be noted that if the number of features of unselectedFeatures set is more than that of selectedFeatures set, the algorithm undergoes violation because it will be impossible to select a feature for transfer. To solve this problem, the set of all the features can be selected. Generally, this step specifies the best set in half of the selected groups that includes the selected features and the feature that will probably exist in the optimum set. B: Implementation of third and fourth states In this step, the best set is placed in bestSubset and the final set is announced as a new set. Iridophores layer reflects the shone light by color change. To implement this algorithm, it is assumed that the color is changed by deletion of a feature from the entered feature set (contents of selectedFeature). Feature selection for deletion is done randomly. This operation is shown in formulas 6 and 7. formula 6: Re flection  bestSubset .selectedFe atures  bestSubset .selectedFe atures[ R]

formula 7: visibility  bestSubset .unselected Features[V ]

In the above formulas, R is the selected feature for deletion and V is the feature for selection. For this purpose, formulas 8 and 9 are used. formula 8: V  random(0, bestSubset.unselected Features.Size)

formula 9:

R  random(0, bestSubset.selectedFe atures.Size) The new set is selected via formula 1. In other words, the new set is created by changing R and V features. For example, it is assumed that selectedFeature={f3,f1,f2,f4} and unselectedFeatures={ f5,f6,f7,…,f41}. Therefore, the following values are obtained. R  random(0,4)  2

Re flection  { f 3 , f1 , f 2 , f 4 }  f 2

V  random(5,41)  7 Visibility  f 7

1348

A New Approach For Feature Selection In Intrusion Detection System C: Implementation of fifth state To calculate the similarity between the input and output lights, it is assumed that the input light is equal to AVBestsubset.selectedFeature and the output light is equal to AVBestsubset.selectedFeature by deletion of one of the features. Visibility represents the feature for deletion. This set is very similar to the final design that is calculated according to formula 10, 11, and 12. formula 10: newSubset i  Re flectioni  Visibilityi

formula 11: Re flection  AVBestsubset.selectedFe atures

formula 12: Visibilityi  AVBestsubset.selectedFe atures[i]

i represents the feature for deletion and R is the size of selectedFeature. To understand this case, we assume: AVBestsubset.selectedFe atures  { f 2 , f 6 , f1 , f 9 , f 20}

R5

newSubset [5]  { {f 2 , f 6 , f 1 , f 9 },

{f 2f 6 , f 1 , f 20 }, {f 2 , f 6 , f 9 , f 20 }, {f 2 , f 1 , f 9 , f 20 }, {f 6 , f 1 , f 9 , f 20 }} neSubsset1  newSubset[1]  { f 2 , f 6 , f1 , f 9 }

Generally, this case finds the best set by deletion of each feature. The important note is that in this step the number of the features of AVBestsubset.selectedFeatures must be also one less than that of bestSubset. In this case, it is assumed that each light entered from the environment is compatible by a random set. The number of the produced sets is m=N-k which N is the number of the sets produced in the first step and k is a random number between 0 and N/2. After descending sorting of the answers set, some answers are selected for analysis from k to more. If the current set is better than the best set, theses two set will be replaced with each other. This operation will be repeated according to the first step.

1349

TAHERI, AHMADZADEH, KHARAZMI RESULTS AND DISCUSSION Selection of the optimum features set Dimensions 13, 10, 5, and 3 have been used for the program analysis. The reason of selection of such these number of features is that the research papers cited and used in study have used these numbers and they result in the best efficiency [5, 6, 7]. Number of dimensions of three: Based on the request of feature number 5 and implementation of the cuttlefish algorithm, feature numbers 15, 16, 34 have been selected according to Table 1. Table 1. The effective the set of 3 features out of 41 features.

Comments

Type of data

-Value one means using the command su root and zero means using other directives. - Using this command means entering the system root and full access to the system - Is the number of accesses to the root

Nominal

- Percent of connections that in the last two seconds have had access the same service with the same destination.



Feature number 15

Feature name

su_attempted Continuous number Continuous number

16

num_root dst_host_same_srv_rate

34

Number of dimensions of five: Based on the request of feature number 5 and implementation of the cuttlefish algorithm, feature numbers 15, 16, 27, 35, 38 have been selected according to Table 2. Features 15 and 16 are common to features 3. For the better realization of these features, the explanations related to the features are presented in Table 2. Table 2. The effective the set of 5 features out of 41 features.



Comments - According to Table 1 - According to Table 1

Type of data Nominal Continuous number

Feature name su_attempted num_root

Feature number 15 16

- Percent of connections that had REJ error - In the last two seconds asked for the same host Percent of use of different services in the current host

Continuous number

rerror_rate

27

Continuous number

dst_host_diff_srv_rate

35

- Percent of connections that had SO error in the current host - Requested connections but received no response

Continuous number

dst_host_serror_rate

38

Number of dimensions of ten: By running the program and importing ten features for optimum selection, the features presented in Table 3 was specified among which feature 15 is still common to other sets. The technical explanations of these features are presented in Table 3.

1350

A New Approach For Feature Selection In Intrusion Detection System Table 3. The effective features in the set of 10 features out of 41 features.

Comments - Connection time - Service requested - The number of packages that are marked to be delivered fast - The number of indicators that have been identified as essential - Value one means access to the root core value and zero other conditions - According to Table 1 -The number of connections that have used the same services in their connections in the last two seconds -The number of connections that have used the same services in their connections in the last two seconds - Percent of connections that have had the same hosts with the same services in the last two seconds - Percent of connections that have used a constant port in the last two seconds in the current host



Type of data Continuous Nominal Continuous Continuous

duration service urgent hot

Feature number ‫ویژگی‬ 1 3 9 10

Nominal

root_shell

14

Nominal

su_attempted

15

Continuous

srv_count

24

Continuous

srv_diff_host_rate

31

Continuous

dst_host_srv_count

33

Continuous

dst_host_same_src_port_rate

36

Feature name

Number of dimensions of thirteen: By running the program with the expected feature number 13, the results presented in Table 4 were obtained. Some of the features were common to the previous sets that show the importance of these features and their relationship with the other features. In Table 4, the technical explanations of the selected features have been presented. Table 4. The effective features in the set of 13 features out of 41 features.

Type data

Comments - According to Table 3 - Bytes sent from destination to source - If successful connection one, otherwise zero - According to Table 1 - According to Table1 - According to Table 3 - Percent of connections with SYN error - In the last two seconds requested the same host - Percent of connections that have used the same services in the last two seconds - Percent of the number of connections that had the same host - According to Table 3 - According to Table 1 - According to Table 2 - Percent of connections that had the same service in the current host that S0 error occurred.

of

Continuous Nominal

dst_bytes logged_in

Feature number 1 6 12 15 16 24

Continuous

serror_rate

25

Continuous

same_srv_rate

29

Continuous

dst_host_count

32 33 34 38

Continuous

dst_host_srv_serror_rate

39

Feature name

As can be seen the repeat of features 15 and 16 in all dimensions. This means access to root core and number of access to this core. The attack probability is too much because of the complete access to the system in this core. These features have also been observed in other researcher’s studies [7]. Evaluation of the selected features in data set The selected data set for evaluation is 10 % of standard KDD Cup 99 set. This data set has been also used in other related papers. In the papers used for comparison, three data set of 5000, 10000, and 100000 records have also been used in addition to 10 % of the data set. To compare the analysis results of this study with those of the other papers, data sets of the same numbers have also been used

1351

TAHERI, AHMADZADEH, KHARAZMI in this study and called KDD5000, KDD10000, and KDD100000. The overview of this data set has been shown in Table 5. KDD Cup 99 data set includes five general types of attack. This number of attacks also exists in the selected sets. Table 5. Selected Data Sets.

The number of attacks 67% 60% 70% 73%

The overall number of attacks 5 5 5 5

The number of features 41 41 41 41

Number of samples 5000 10000 100000 500000

Data set KDD 5000 KDD 10000 KDD100000 KDD500000

Evaluation function selection A solution has been suggested for the final evaluation. There is a need to an evaluation function by selection of the features by cuttlefish algorithm. After investigations of the existing evaluation function and in order to be different from the selected papers for comparison, multilayer neural network evaluation function was selected. Support vector networks were not selected because of data type in terms of the volume and failure to distribute linearly. The decision tree function has been used in numerous papers. For this purpose, to be different from the other papers and better analysis of the proposed solution, multilayer neural network evaluation function has been used. The classification part of this function composed of the input and output layers, does the classification. The evaluation accuracy can be enhanced by adjustment of the number of the neurons available in the output layer. In this study, 10 neurons have been used. In addition, to control the function, six errors have been set for the test part. Therefore, if the number of the errors in the validation part reaches six, the function will be stopped. For the training case, the number of errors has been set to 1000. If this number of errors is reaches, the system will be stopped without process continuation. Actually, the system will be unable to estimate the result. The evaluation process will be in this way: at first the data is evaluated by the total feature by the mentioned evaluation function, and then the result will be repeated by recording the variables effective in the analysis in dimensions 3, 5, 10, 13 in order to compare the number of the optimum features and compare the effect of each number with the other papers. Variables related to the analysis It is needed to define common variables to analyze and evaluate the obtained feature sets according to the cuttlefish algorithm and compare the results with that of the other papers. After investigation of the papers, some variables were selected and the values related to these variables were obtained from the selected optimum sets. Variable 1 is the total number of the records that are attack in the test data set. Variable 1: TNAttack  TotalNumberofAttackinTestData

Variable 2 is the number of attack records in the test data set that has been correctly recognized as attack after classification. Variable 2: NAttackP NumberofAttackinTestwithPositv epredict

Variable 3 is the number of normal records that have been wrongly recognized as attack after classification. Variable 3:

1352

A New Approach For Feature Selection In Intrusion Detection System NNormakF  NumberofNormalwithNegetive Pr edict

Variable 4 show the classification quality that is an output of the multilayer artificial neural network. Variable 4:

CE  Cross  Entopy

Variable 5 is the ratio of number of the correctly recognized attacks to total number of the attacks. Variable 5: DR = NAttackP / TNAttack The computer used for the solution analysis of this research has these properties: CPU: Core i5, Ram: 6 GB, and Cache: 2 GB that was unable to run in some analysis because of high volume of data. This issue will be presented in the related tables. Evaluation of the selected features After one by one analysis of the selected features set, it is required to present the values of some of the required variables. According to the standard rule of the multilayer neural network, among the whole data set, 70 %, 15 %, and 15 % have been selected for training, validation, and test, respectively. The obtained values have been represented in Table 6. To analyze and evaluate each features set in the mentioned data set, the results have been obtained in form of figure and diagram by uploading these two items in the evaluation function. Table 6. The set of utilized data.

Nimber of record for Test 750 1500 15000 74083

Nimber of record Validation for 750 1500 15000 74083

Nimber of record for Training 3500 7000 70000 345719

number of attack state 3000 6000 70820 396742

number of Normal state 2000 4000 29180 97143

Number or records

Data Set

5000 10000 100000 500000

KDD5000 KDD10000 KDD100000 KDD500000

Evaluation of all of the features (41 features) The results of the evaluation of all of the features with four groups of data sets are given in Table 7. Each row of the table shows one of the data sets. Each column represents one of the variables defined in the previous section. Table 7. A general evaluation of all of the features by the proposed solution.

DR 0 1 0.61 0.4

CE 0 2.8 2.02 2.01

NNormalF 0 0% 39% 60%

NAttackP 0 100% 61% 40%

TNAttack 396.74 70.82 6.00 3.00

41 500000 100000 10000 5000

In the analysis of KDD500000 data set with 41 features, the available hardware was unable to analyze this data set. Finally, the evaluation function could not learn and classify it after spending a lot of time. In addition, in the papers used for the comparison, this number of data set has not been used for all of the features due to the unresponsive hardware [6]. The main reason of selection of a number of the features for the intrusion detection system is that the hardware is unresponsive to this data volume. This is a sign of the high value of this research. In the evaluation of the total number of the features by KDD5000 and KDD100000 sets, the evaluation function could not find the relationship between the properties. The first outcome that can be concluded is that in the intrusion detection system, number of the data is important in addition to number of the features.

1353

TAHERI, AHMADZADEH, KHARAZMI

Evaluation of feature number thirteen In this section, according to the cuttlefish algorithm, a set of thirteen selected features was analyzed. The general results of the analysis and the values of the mentioned variables have been represented in Table 8. Table 8. Evaluating 13 features with the proposed method.

DR 99.7

CE 1.8

NNormalF NAttackP TNAttack 13 0.6 99.4 396.74 500000

99.1

1.72

0.9

99.1

70.82

100000

100

5.2

0

100

6.00

10000

100

5.18

0

100

3.00

5000

In the analysis of 13 numbers of features, the hardware could analyze all of the four data sets. The values obtained for the variables have been shown in Table 8. Evaluation of ten features number In this section, the selected ten features set was analyzed and evaluated in four groups of the mentioned data sets according to the cuttlefish algorithm. The general results of the analysis and the values of the mentioned variables have been given in Table 9. Table 9. Evaluating 10 features with the proposed method.

DR 99.6

CE 0

NNormalF NAttackP TNAttack 0.4 99.6 396.74

10 500000

99.1

2.88

0.1

99.1

70.82

100000

0

1.01

100

0

6.00

10000

60

0.254

40

60

3.00

5000

In the analysis of 10 numbers of features, the hardware could analyze all of the four data sets. In this feature number, with KDD5000 data set, the evaluation function could not find the relationships between the features. Evaluation of five features number In this section, the selected five features set was analyzed and evaluated in four groups of the mentioned data sets according to the cuttlefish algorithm. The general results of the analysis and the values of the mentioned variables have been given in Table 10. Table 10. Evaluating 5 features with the proposed method.

DR

CE

NNormalF NAttackP TNAttack

5

85.5

0

14.5

85.5

396.742

500000

95.1

2.44

4.9

95.1

70.820

100000

65.5

6.61

34.5

65.5

6.000

10000

100

4.97

0

100

3.000

5000

In addition, in the analysis of 5 numbers of features, the hardware could analyze all of the four data sets. The time required for the analysis was significantly less than the features number more than 5.

1354

A New Approach For Feature Selection In Intrusion Detection System Evaluation of three features number In this section, the selected three features set was analyzed and evaluated in four groups of the mentioned data sets according to the cuttlefish algorithm. The general results of the analysis and the values of the mentioned variables have been given in Table 11. Table 11. Evaluating 3 features with the proposed method.

DR 84 82

CE 0.87 0.849

NNormalF 15.1 8

NAttackP 84 82

TNAttack 396.742 70.820

3 500000 100000

65

9.2

35

65

6.000

10000

0.99

6.22

0.1

0.99

3.000

5000

Comparison of the features among the data sets At first, the diagram related to the general effective data in all of the data set is plotted. Figure 2. The comparison of features number in the recognized attack percents with the data set.

In Figure 2, the vertical axis is variable 2, which is correctly recognized attack percents, and the horizontal axis is the number of the selected features. The variable of the correctly recognized attacks is very important in the intrusion detection system and have been paid much attention in the other investigated studies. In this figure, the recognized attack percents for each data set and the number of the selected different feature numbers shows has shown that all of the values from Tables 8 to 11 are related to NAttakP rows. As shown in this figure, the most optimum feature number, which has caused the recognition of all of the attacks, is related to feature number 13 in all of the data sets and feature number 10 in 100000 and 500000 data sets. This value in [6] is equal to 92.051 for feature number 10 and is equal to 98 in [7] which the value of 99.6 for ten features in this solution is one of the high reasons of the proposed solution and feature number 13 in more optimum than feature number 10. In Figure 3, the classification quality has been presented for each feature number.

1355

TAHERI, AHMADZADEH, KHARAZMI Figure 3. The diagram of the classification quality of the selected features in the data sets.

In Figure 3, the horizontal axis shows is “the number of the selected feature” and the vertical axis shows variable 6, which is the classification quality. In the number related to the classification quality, low numbers are signs of the high quality of the classification. This variable has the lowest value at feature number 41. However, the most optimum number in this criterion is 10 and then 13 due to the other problems of this feature. In comparison with the other papers, this variable is equal to 1.27 in [7]. Therefore, it is negligibly different from the value of this variable with feature number 13 with 100000 data sets. It should be noted that due to the mechanism of the evaluation function of the artificial neural network, its training and validation frequency cause no significant change in the values of the investigated variables. Repetition of each step for 10 times caused a change of 0.1 unit in the variables. Therefore, the average is nearly equal to each of the variables and the scattering of the variables is zero. Thus, meaningfulness of the values compared with the values of the other papers is to the extent of comparison of equality or inequality of these values. Conclusions The main goal of this study is the enhancement of accuracy in intrusion detection systems. A method for feature selection and use in intrusion detection systems was proposed. It was applied 10 % of KDD data set. It was shown that the algorithm proposed in this paper is more efficient than the algorithms proposed in the previous papers. In order to investigate by the results of the other papers, different feature numbers including 3, 5, 10, 13, and 41 were selected to be somewhat similar to the other papers. For this purpose, four different data sets including 5000, 10000, 100000, and 500000 were selected. These numbers were selected by implementation of the cuttlefish algorithm in C# programming and using optimum-shape-finder function. In the next step, the mentioned data sets were classified using the evaluation function of artificial neural network. After completion of the steps related to the proposed solution, the following results were obtained: 

Feature number 41 with 500000 data sets is unable to evaluate. This is one of the reasons of confirmation of selection of certain numbers of the properties.



Feature number 41 was unable to find the relationship between the features and had high error percents in 5000 and 10000 data sets.



Feature number 13 had the best efficiency in the criteria evaluated in this research.



In comparison with the other papers, feature number 13 in two criteria and feature number 10 in one criterion gave the best results.

1356

A New Approach For Feature Selection In Intrusion Detection System References [1] S. Wu, , and Y. Ester, "Data mining-based intrusion detectors." Expert Systems with

Applications 36.3 , PP 5605-5612, 2009. [2] H. Liu et al., “Boosting feature selection using information metric for classification”,

Neurocomputing, Vol. 73, No. 1, pp. 295 –303, 2009. [3] L. Yu, H. Liu, “Efficient feature selection via analysis of relevance and redundancy”, Journal of

Machine Learning Research, Vol.5, No. 2, pp. 1205 – 1224, 2004. [4] H. Liu, L. Yu, “toward integrating feature selection algorithms for classification and clustering”,

IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 4, pp. 491 – 502, 2005. [5] Amrita and P. Ahmed, "A Hybrid-Based Feature Selection Approach For IDS", Springer

International Publishing Switzerland,2014; [6] Adel Sabry Eesa, Zeynep Orman, Adnan Mohsin Abdulazeez Brifcani," novel feature-selection

approach based on the cuttlefish optimization algorithm for intrusion detection systems", Expert Systems with Applications 42, 2670–2679,2015. [7] Le Thi, Hoai An, et al. "A Filter Based Feature Selection Approach in MSVM Using DCA and Its Application in Network Intrusion Detection." Intelligent Information and Database Systems. Springer International Publishing, 2014. 403-413. [8] http://fa.swewe.net/word_show.htm

1357