Paper Title (use style: paper title)

0 downloads 0 Views 388KB Size Report
Keywords- Intrusion detection, Unknown attacks, Ant, CAC communicating .... for a frequency distribution of bytes using n-gram and the anomaly is ..... 5.89% sendmail. 78.81%. 21.19%. 80 %. 20% snmpgetattack. 24.83%. 75.17%. 26 %. 74%.
Science and Information Conference 2014 August 27-29, 2014 | London, UK

CAC-UA: a Communicating Ant for Clustering to detect Unknown Attacks Mokrane Kemiche

Rachid Beghdad

Faculty of Sciences Abderrahmane Mira University Béjaïa06000, Algeria [email protected]

Faculty of Sciences Abderrahmane Mira University Béjaïa06000, Algeria [email protected]

Abstract—We introduce a novel algorithm to detect unknown attacks, based on the Communicating Ant for Clustering (CAC) [1], which despite the other ants algorithm, lead to a better detection rate (DR). Secondly, having noted the low DR of R2L attacks, we improve this approach by hybridizing it with association rules approach. In addition to the measure of similarity calculated using continuous attributes of KDD(Knowledge Discovery in Databases) dataset [2], we applied also association rules on discrete attributes.These rules that are generated with the “a priori algorithm” [3] are used by ants to reach a better DR rate compared to some known intrusion detection methods.Our solution is implemented and evaluated using KDD dataset. Simulations confirm the robustness of our approach term of DR of both known and unknown attacks. Keywords- Intrusion detection, Unknown attacks, Ant, CAC communicating ant clustering, Association rules, KDD dataset.



The advent of computer systems, networks and Internet have revolutionized the lives of individuals and businesses. Information technology owes its popularity to its ease of use. In fact, nowadays, working, browsing the web, checking his bank account and communicating via networks and Internet using a personal computer are daily activities. A world without communications network and Internet is unimaginable. The popularity of computers has not only advantages. This often contributes to affect reliability. It should be noted that a reliable computer system must ensure the availability of services, integrity and confidentiality of data. It is obvious that if one of these three conditions is not satisfied, it would be confronted either to a blockage due to lack of service, either to false results, or a disclosure of secrets. In each of these three cases, losses are considerable. This phenomenal growth is naturally accompanied by the increase in the number of users. These users, known and unknown, have not necessarily good intentions towards these networks. Since these networks have appeared as potential targets for attack, securing them becomes inevitable. To overcome this problem, companies and governments are investing heavily in security. The protection of computer systems has become a major preoccupation of all IT services. It is rare to say with a simple glance at the computer equipment that they have been used in an undesirable way. To

alleviate this arduous task, software tools called Intrusion Detection Systems (IDS) is developed to automating a part of the work. Many methods and frameworks have been developed to detect intrusions. Various techniques are also employed such as decision tree, artificial neural networks, association rules, clustering, support vector machines, ant colony, and others have been applied to detect old and new attacks, but struggle to detect this new attacks with high DRs and low false alarm rate (FAR), because the attackers are constantly finding new forms of attacks. Such that, IDSs or firewallsinstalled to protect the computer system, cannot detect these new attacks. In this research we will propose a method to detect old and new attacks with high DRs using CAC-UA, step by step. Firstly, we let evolving ants on a grid that contains connections of the KDD dataset (attacks and normal behaviors) set randomly. At each iteration, the ants are on a cell which contains a connection. The first ant, communicates with all the others lying on a cell that also contains a connection to gather similar connections in the same heap. We get two heaps at the end. The first contains the attacks and the other the nomalbehavior. We note that the low DR of R2L (Remote To Local) is due to the similarity of some attacks (snmpgetattack and snmpguess) with the normal behavior, especially using a set of selected features. This is the reasons why, secondly, to improve the low DR of R2L attacks, we hybridizing it with association rules using the discrete attributes of the KDD Data Set. The rest of this paper is organized as follows. Section 2 presents related work on intrusion detection. All the detail of CAC algorithm is defined in section 3. Section 4 describes the first approach using CAC-UA and its experimental results. Section 5 describes the hybrid approach using CAC-UA and association rule, its experimental results and the comparison with some known approaches. Section 6 concludes the paper. II.


In intrusion detection field, two classes of methods have been defined: anomaly detection and misuse detection. Anomaly detection consists of establishing normal behavior profile for user and system activity and observing significant deviations of actual user activity with respect to the established habitual pattern. Misuse detection, refers to intrusions that follow well defined attack patterns that exploit weaknesses in system and application software.

515 | P a g e

Science and Information Conference 2014 August 27-29, 2014 | London, UK

Among IDSs, there is the PHAD (Packet Header Anomaly Detection) [4], which establishes the normal profile of packets from the data link, network and transport layers based on the probability of the event occurring and the rate of anomalies during the training period. This approach has been improved by [5] who proposed an IDS called NETAD (Network Traffic Anomaly Detection) that unlike PHAD does not use all the network traffic, but filter out uninteresting traffic and uses a new anomaly score function that detects deviations. The approach proposed by [7] is also based on packet networks but for R2L class only, it monitors the application layer protocols by assigning a score to packets and detecting deviations from the normal profile. The IDS called Payl (Payload)[8] works also in the application and network layers on the packet payload. It looks for a frequency distribution of bytes using n-gram and the anomaly is signaled after the calculation of Mahalanobisdistance. Another approach proposed by [9] is an improvement of ADAM(Audit Data Analysis and Mining) system that is based on association rules to detect attacks, this improvement consist to add a technique called pseudo-Bayes estimators to enhance the system's ability to detect new attacks. Several other recent methods were proposed with different techniques of classifications and machine learning applied to the KDD database, among them, the deterministic approach as the one cited in [10] focusing on detecting new U2R (User To Root) attacks by modeling the IDS as a binary linear program leading to a 100% DR of U2R, or that proposed by [11] where the principal component analysis was used in an iterative 3-tier architecture called RePIDS (Real-time Payload-based Intrusion Detection System) that detects attacks in network payloads in real time. In [12] [13], the authors used neural networks for testing the ability of MLP (Multi-Layer Perceptron) to detect new attacks. There are other approaches either based on decision tree with ID3(Iterative Dichotomiser 3) algorithm [14] or C4.5 algorithm [15] where the authors developed an improved version classifying new instances of attacks as new attacks class, instead of classifying them innormal class. In the approach [16], they proposed a hybrid model by misuse and anomaly detection. The misuse detection is performed by a sequential hierarchical model using binary tree which separates an attack at each level. For anomaly detection they used association rules. In [17], the authors applied the support vector machines for their IDS but got a low rate for new attacks. Finally we present some approaches based on ants as [18] which is inspired by ant clustering algorithm that simulates behavior of ants that take or deposit objects, in which each ant is totally independent and do not communicate. In [19] the authors combines the SVM (Support Vector Machine) method with CSOACN (Clustering based on Self-Organized Ant ColonyNetwork) to take the advantages of both while avoiding their weaknesses. The authors of [20] propose an Ant-Miner based classification system. Its main improvement is the introduction of multiple ant colony optimizations (MACO) instead of a single one. Each ant that belongs to a colony deposits a distinct type of pheromone which affects only the

ants belonging to the same colony. Colonies are searched in parallel to finally discover one rule per colony. The rule with the best quality is selected and added to the rule set. In [21] the authors proposed MACO-I, the modifications in the algorithm is to improve the accuracy and time learning. The algorithm stores all the generated high quality rules by the entire ant colony, instead of simply saving the best rule produced by each ant. The rules are sorted with respect to their predictive accuracy. In [22] the authors proposed a hybrid multilevel IDS which uses a combination of decision tree classifier and an ant colony clustering algorithm, the resulting IDS achieves competitive detection rates. In the first level the enhanced C4.5 algorithm is used to classify connection record into Dos, Probe and “Others”. The class of “Others” contains U2R, R2L and normal connections. In the second level, ant colony algorithm splits the data into two clusters, normal and abnormal traffic. The cluster with abnormal connections is distinguished easily because it has smaller in size. Finally, on the third level, the C4.5 algorithm classifies the abnormal traffic. The disadvantage of ants approach is the long execution time while clustering large data. A. Discussions Some methods are efficient [6, 10], with a DR of 100%, but detect only a given set of attacks.Allother methods presented above, that have used the KDD dataset, suffer in terms of DRs of new attacks, since they did not take into account the problems of KDD dataset itself.The most interesting ones are those based on ant colony, which have reached a high DR. This is the reason why we introduce here a new approach based on CAC and association rules to detect both known and unknown attacks better, and we called it CAC-UA. III.


A. CAC Description There are many algorithms based on artificial ants, which are inspired by the behavior of real ants in search of food (the deposit of pheromone trails). These algorithms are successfully applied to various combinatorial optimization problems. But another phenomenon observed by biologists in ants which is the collective sorting of the brood (larvae storage) and the creation of cemeteries (storage of corpses) in some species of ants, motivates solving the problem of classification by artificial ants. The advantage of the CAC algorithm [1] that we used in this work, is that it has few parameters. And the algorithm is not centralized. Each ant is autonomous but it communicates with its congeners with signals. In the following section we will see the details of the algorithm. B. Details of the algorithm The ants evolve on a 2D grid. This grid is in the shape of cells which will contain objects and ants. The distribution of the objects on a grid is done randomly, by checking that a cell can contain only a single object, but after moving objects by ants, several similar objects can be placed on one cell, which form a heap of objects. The grid is toroidalmeaning that ants 516 | P a g e

Science and Information Conference 2014 August 27-29, 2014 | London, UK

move from one side of the grid to another in one step. The size of a grid is determined automatically from the number of objects to be classified. If N is the number of objects, the grid includes L cells per side. L is equal to the entire upper part of the square of N: 𝐿=


KDD Training Set


Ants are positioned randomly on the grid as the objects. At each iteration of the algorithm, each ant moves randomly on the grid, taking into account that the grid is toroidal. It can move to one of the eight cells in its neighborhood.

Feature selection for normal behavior

The number of ants is fixed automatically. If L represents the size of one side of the grid, then the number of ants Ais as follows: 𝐴 = (𝐿 ∗ 𝐿)/9

Computes the minimum distance (dmin)and the average distance (dmoy)



The measure of distance used in this algorithm is the Euclidean distance. In order to compute the similarity between two heaps, the maximum distance is used. 𝑑 𝑇𝑖 , 𝑇𝑗 = 𝑚𝑎𝑥[𝑑(𝑂𝑖 , 𝑂𝑗 )]

Test Set (News attacks)

The inputs of the ant colony algorithm


Such that Oi and Ojrepresent respectively objects in the heap Ti and Tj. I