Because of increased network connectivity, computer systems are ... The degree of protection from such malicious actions depends on ..... California, Davis.
©2010 International Journal of Computer Applications (0975 – 8887) Volume 1 – No. 2
An Effective Approach to Network Intrusion Detection System using Genetic Algorithm Ms.Nivedita Naidu
Dr.R.V.Dharaskar
Student,MTech,IIISem, Department of Computer Science & Engineering,
Professor & Head, Department of Computer Science & Engineering,
G.H.Raisoni College of Engineering,Nagpur,
G.H.Raisoni College of Engineering,Nagpur,
Maharashtra,India.
Maharashtra,India.
ABSTRACT This paper presents a general overview of Intrusion Detection Systems and the methods used in these systems, giving brief points of the design principles and the major trends. Artificial intelligence techniques are widely used in this area such as fuzzy logic and Genetic algorithms. In this paper, we will focus on the Genetic algorithm technique and how it could be used in Intrusion Detection Systems giving some examples of systems and experiments proposed in this field. The purpose of this paper is to give a clear understanding of the use of Genetic Algorithm in IDS.
Categories and Subject Descriptors D.3.3 [C++]: Language Contructs and Features – abstract data types, polymorphism, control structures. Yisual basic 6 and windows API function and winsock control.
General Terms Algorithms, Performance, Security, Theory.
Keywords Intrusion Detection System, Genetic Algorithm.
1. INTRODUCTION Because of increased network connectivity, computer systems are becoming increasingly vulnerable to attack. These attacks often exploit laws in either the operating system or application programs. The general goal of such attacks is to subvert the traditional security mechanisms on the systems and execute operations in excess of the intruder's authorization. These operations could include reading protected or private data or simply doing malicious damage to the system or user files. The degree of protection from such malicious actions depends on the amount of time and effort spent building and maintaining the system's security defenses. By building complex tools, which continually monitor and report activities, a system security operator can catch potentially malicious activities as they occur.
However, this involves a large expense in terms of time and money in both building and maintaining such a monitoring system. The monitoring will also impose a performance penalty on the system being protected - something which the users may object to. This paper proposes a mechanism for building such a monitoring system which does not involve a significant process operates independently of the other agents, but they all cooperate in monitoring the system.This approach has significant advantages in terms of overhead, scalability and flexibility.
2. INTRUSIONS AND INTRUSION DETECTION: An intrusion can be defined as [1]: Any set of actions that attempt to compromise the integrity, confidentiality or availability of a resource and they can be categorized into two main classes: Misuse intrusions : They are well defined attacks on known weak points of a system. They can be spotted by watching for certain actions being performed on certain objects. Anomaly intrusions: These are based on observations of deviations from normal system usage patterns. They are caught by examining log messages resulting from system calls. This can be done using a pattern matching approach such as in [3]. However, anomaly intrusions are harder to detect. There are no fixed patterns that can be monitored for and so a more ―fuzzy" approach must be taken. Ideally we would like a system that combined human-like pattern matching capabilities with the vigilance of a computer program. Thus it would always be monitoring the system for potential intrusions, but would be able to ignore spurious false intrusions if they resulted from legitimate user actions. It would rely on heuristics to decide this - they could either be pre-specified (by a human operator) or learned by the system over time. However heuristics will not always guarantee perfect accuracy, so another goal is to minimize the probability of incorrect classification. 26
©2010 International Journal of Computer Applications (0975 – 8887) Volume 1 – No. 2 Intrusion Detection: An intrusion detection system (IDS) must [2] identify, preferably in real time, unauthorized use, misuse, and abuse of computer systems. An intrusion detection system does not attempt to stop an intrusion as it occurs. Its role is to alert a system security officer that a potential security violation is occurring. As such it is a reactive, rather than proactive, form of system defense. An intrusion detection system can either be hostbased or network-based. Often a system is a hybrid of the two approaches. A host based system will monitor all the activity on a single host computer. It will ensure that no user operations are violating the site security policy. A network based system monitors on a net-wide basis -it will consider actions occurring on the network and analyze them as to whether they constitute potential security violations. In either case, the resulting system is often a large monolithic module. This module performs all of the monitoring, data gathering, data manipulation and decision making for the whole system. It either sits in, or on top of, the system kernel. Depending on the data being gathered, it can monitor system audit logs, user activities and system state. From these observations, it will deduce some metrics about the system's overall security state, and decide whether an intrusion is currently occurring. The most apparent problem with this approach to building an IDS is the overhead it imposes on the system being protected. Often, a single system is analyzed by another system due to the added overhead of the IDS. If audit logs are being analyzed, the kernel must generate audit information for all the actions it performs. This results in a large amount of information, which must be stored at least semi-permanently on some storage medium. Generating these detailed logs consumes both disk-space and CPU time. Once these logs are generated, the IDS must read and interpret them, attempting to correlate activities with other system information. An IDS may also perform its own system monitoring - it may keep aggregate statistics which give a system usage profile. These statistics can be derived from a variety of sources - CPU usage, disk I/O, memory usage, activities by users, number of attempted logins etc. These statistics must be continually updated, and correlated with an internal model. This model may describe a set of intrusion scenarios or perhaps it encodes the profile of a ―clean" system. Either way, significant processing must occur in matching the observed profile to an internal model.
2.1 Intrusion Detection Systems: Implementation and Integration 2.1.1 Host-based Zirkle [207] described host-based IDS as ―loading a piece of software on the system to be monitored‖. This software, which is generally defined as either host wrappers/personal firewalls or agent-based software, performs the following: • Uses log files and/or the system's auditing agents as sources of data. • Looks at the communications traffic in and out of a single computer; • Checks the integrity of system files, and watches for suspicious processes, including changes to system files and user privileges. Host-based detection software is particularly effective in detecting
trusted-insider attacks (―anomalous activity‖). One drawback for host based intrusion detection is that the software must be installed on each computer on the network to be protected. 2.1.2 Network-based Northcutt [138-139] described network based intrusion detection system (NIDS) as an ID system that monitors the traffic on its network segment as a data source. Implementation requires: • The network interface card is placed in promiscuous mode to capture all network traffic that crosses its network segment; and • A sensor, which monitors packets traveling on that network segment. The objective is to determine if packet flow matches a known signature. There are three signatures that are particularly important: 1. String signatures that look for a text string that indicates a possible attack, 2. Port signatures simply watch for connection attempts to well known, frequently attacked ports, and 3. Header signatures that watch for dangerous or illogical combinations in packet headers.
3.GENETIC ALGORITHMS ―A Genetic Algorithm (GA) is a programming technique that mimics biological evolution as a problem-solving strategy.‖[2] It is based on Darwinian‘s principle of evolution and survival of fittest to optimize a population of candidate solutions towards a predefined fitness. [16][18] GA uses an evolution and natural selection that uses a chromosome-like data structure and evolve the chromosomes using selection, recombination, and mutation operators. The process usually begins with randomly generated population of chromosomes, which represent all possible solution of a problem that are considered candidate solutions. Different positions of each chromosome are encoded as bits, characters or numbers. These positions could be referred to as genes. An evaluation function is used to calculate the goodness of each chromosome according to the desired solution; this function is known as ―Fitness Function‖. During evaluation, two basic operators, crossover and mutation, are used to simulate the natural reproduction and mutation of species. The selection of chromosomes for survival and combination is biased towards the fittest chromosomes. [16][18][20][31][3] The following figure taken from [16] shows the structure of a simple genetic algorithm. Starting by a random generation of initial population, then evaluate and evolve through selection, recombination, and mutation. Finally, the best individual (chromosome) is picked out as the final result once the optimization meet it target (Pohlheim, 2001). GA has some common elements and parameters which should be defined: Fitness Function is defined according to [2], ―The fitness function is defined as a function which scales the value individual relative to the rest of population.‖ It computes the best possible solutions from the amount of candidates located in the population.
27
©2010 International Journal of Computer Applications (0975 – 8887) Volume 1 – No. 2
Figure 1: Application to the Intrusion Detection problem. GA Operators According to the figure above, the selection, mutation and crossover are the most effective parts in the algorithm as they are they participate in the generation of each population. Selection is the phase where population individuals with better fitness are selected, otherwise it gets damaged. Crossover is a process where each pair of individuals selects randomly participates in exchanging their parents with each other, until a total new population has been generated. Mutation flips some bits in an individual, and since all bits could be filled, there is low probability of predicting the change.
4. DETECTION ALGORITHM OVERVIEW In [8], a generic algorithm has been presented which contains a training process. This algorithm is designed to apply set of classification rules according to the input data given. It follows the simple flow of genetic algorithms presented in the Figure 1. [8][16][2]. Applying genetic algorithm to intrusion detection seems to be a promising area. We discuss the motivation and implementation details in this section.
4.1 Overview : Genetic algorithms can be used to evolve simple rules for network traffic (Sinclair, Pierce, and Matzner 1999). These rules are used to differentiate normal network connections from anomalous connections. These anomalous connections refer to events with probability of intrusions. The rules stored in the rule base are usually in thefollowing form (Sinclair, Pierce, and Matzner 1999): if { condition } then { act }
A Generic GA-based intrusion Detection Approach For the problems we presented above, the condition usually refers to a match between current network connection and the rules in IDS, such as source and destination IP addresses and port numbers (used in TCP/IP network protocols), duration of the connection, protocol used, etc., indicating the probability of an intrusion. The act field usually refers to an action defined by the security policies within an organization, such as reporting an alert to the system administrator, stopping the connection, logging a message into system audit files, or all of the above. For example, a rule can be defined as: if {the connection has following information: source IP address 124.12.5.18; destination IP address: 130.18.206.55; destination port number: 21; connection time: 10.1 seconds } then {stop the connection} This rule can be explained as follows: if there exists a network connection request with the source IP address 124.12.5.18, destination IP address 130.18.206.55, destination port number 21, and connection time 10.1 seconds, then stop this connection establishment. This is because the IP address 124.12.5.18 is recognized by the IDS as one of the blacklisted IP addresses; therefore, any service request initiated from it is rejected. The final goal of applying GA is to generate rules that match only the anomalous connections. These rules are tested on historical connections and are used to filter new connections to find suspicious network traffic. In this implementation, the network traffic used for GA is a preclassified data set that differentiates normal network connections from anomalous ones. This data set is gathered using network sniffers (a program used to record network traffic without doing something harmful) such as Tcpdump (http://www.tcpdump.com) or Snort (http://www.snort.com). The data set is manually 28
©2010 International Journal of Computer Applications (0975 – 8887) Volume 1 – No. 2 classified based on experts‘ knowledge. It is used for the fitness evaluation during the execution of GA. By starting GA with only a small set of randomly generated rules, we can generate a larger data set that contains rules for IDS. These rules are ―good enough‖ solutions for GA and can be used for filtering new network traffic. Table 1. Rule definition for connection and range of values of each field
(130.18.176.0 ~ 130.18.255.255), source port number 42335,destination port number 80, duration time 482 seconds, ends with state 11 (the connection terminated by the originator), uses protocol type 2 (TCP), and the originator sends 7320 bytes of data, the responders sends 38891 bytes of data, then this is a suspicious behavior and can be identified as a potential intrusion. The actual validity of this rule will be examined by matching the historical data set comprised of connections marked as either anomalous or normal. If the rule is able to find an anomalous behavior, a bonus will be given to the current chromosome. If the rule matches a normal connection, a penalty will be applied to the chromosome. Clearly no single rule can be used to separate all anomalous connections from normal connections. The population needs evolving to find the optimal rule set. In the example shown in Table 1, some wild cards (the ‗*‘ character and the ‗?‘ character) are used and the corresponding genes within the chromosome are shown as –1. These wild cards are used to represent an appropriate range of specific values (Crosbie and Spafford, 1995). It is useful when representing a network block (a range of IP addresses or port numbers) in a rule. Once the spatial information is included in the rules, the capability of the IDS can be greatly improved as an intrusion may initiate from many different locations. The inclusion of the duration time of a network connection in the chromosome ensures incorporation of temporal information for network connections. The maximum value of duration time is 99999999 seconds, which is more than a year. This is helpful for identifying intrusions because complex intrusions may span hours, days, or even months.
4.2 Data Representation In order to fully exploit the suspicious level, we need to examine all fields related with a specific network connection. For simplicity, we only consider some obvious attributes for each connection. The definition of rules (for TCP/IP protocols) is shown in Table 1. if {the connection has following information: source IP address 209.11.??.??; destination IP address: 130.18.176+?.??; source port number: 42335; destination port number: 80; connection time: 482 seconds; the connection is stopped by the originator; the protocol used is TCP; the originator sent 7320 bytes of data; and the responder sent 38891 bytes of data } then {stop the connection} We can convert the above example into the chromosome form, as described below. (d, 1, 0, b, -1, -1, -1, -1, 8, 2, 1, 2,b, -1, -1, -1, 4, 2, 3, 3, 5, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 4, 8, 2, 1, 1, 2, 0, 0, 0, 0, 0, 0, 7, 3, 2, 0, 0, 0, 0, 0, 0, 3, 8, 8, 9, 1) Altogether there are fifty-seven genes in each chromosome. For simplicity, we use hexadecimal representations for the IP addresses. The rule can be explained as follows: if a network connection with source IP address 209.11.??.?? (209.11.0.0 ~ 209.11.255.255), destination IP address 130.18.176.??
The genetic algorithm starts with a population that has randomly selected rules. The population can evolve by using the crossover and mutations operators. Due to the effectiveness of the evaluation function, the succeeding populations are biased toward rules that match intrusive connections. Ultimately as the algorithm stops, rules are selected and added into the IDS rule base.
4.3 Parameters in Genetic Algorithm There are many parameters to consider for the application of GA. Each of these parameters heavily influences the effectiveness of the genetic algorithm. The methodology and related parameters are discussed in the following section. 4.3.1 Evaluation function The evaluation function is one of the most important parameters in genetic algorithm. The proposed implementation differs from the scheme used by (Crosbie and Spafford, 1995) in that the definition on calculations of outcome and fitness is different. The following steps are used to calculate the evaluation function. First the overall outcome is calculated based on whether a field of the connection matches the pre-classified data set, and then multiply the weight of that field. The Matched value is set to either 1 or 0. Outcome= ∑Matched* Weighti
29
©2010 International Journal of Computer Applications (0975 – 8887) Volume 1 – No. 2 The fitness of a chromosome is computed using the above penalty: fitness = 1- penalty Obviously, the range of the fitness value is between 0 and 1. By defining evaluation, we have incorporated both temporal and spatial information needed for identification of network intrusions.
5. SYSTEM ARCHITECTURE
Figure 3. Architecture of applying GA into Intrusion Detection Figure 2. Order of weights for fields in the evaluation function The order of weight values in the function is shown in Figure 2. These orders are categorized according to different fields in the connection record as reported by network sniffers. Therefore, all genes representing destination IP address field have the same weight. The actual values can be finely tuned at execution time. The basic idea behind this order is the importance of different fields in TCP/IP packets. This scheme is straightforward and intuitive. Destination IP address is the target of an intrusion while the source IP address is the originator of the intrusion. These are the most important pieces of information needed to capture an intrusion. Destination port number indicates to applications that the target system is running (for example, FTP service usually runs on port 21). Some IP addresses are more probable targets for intrusions—for example, IP addresses for military domains. Domain-specific information is less important compared with the source IP addresses. Other parameters like duration, bytes sent by the originator, bytes sent by the receiver, and state are usually less important than the above fields but are still useful. The protocol and source port number fields are commonly dispensable and are used for identifying some specific intrusions. The absolute difference between the outcome of the chromosome and the actual suspicious level is then computed using the following equation. The suspicious level is a threshold that indicates the extent to which two network connections are considered a ―match.‖ The actual value of suspicious level reflects observations from historical data. ∆ = | outcome – suspicious level | Once a mismatch happens, the penalty value is computed using the absolute difference. The ranking in the equation indicates whether or not an intrusion is easy to identify. Penalty = (∆ * ranking)/100
Figure 3 shows the structure of this implementation. We need to collect enough historical data that includes both normal and anomalous network connections. The MIT Lincoln Laboratory (http://www.ll.mit.edu/) data set for testing IDSs, which is represented in the Tcpdump binary format, is a good choice. This is the first part inside the system architecture. This data set is analyzed by the network sniffers and results are fed into GA for fitness evaluation. Then the GA is executed and the rule set is generated. These rules are stored in a database to be used by the IDS.
6. CONCLUSION AND FUTURE WORK In this paper, we discussed a methodology of applying genetic algorithm into network intrusion detection techniques. A brief overview of Intrusion Detection System (IDS), genetic algorithm, and related detection techniques are discussed. The system architecture is also introduced. Factors affecting the GA are addressed in detail. This implementation of genetic algorithm is unique as it considers both temporal and spatial information of network connections during the encoding of the problem; therefore, it should be more helpful for identification of network anomalous behaviors. Future work includes creating a standard test data set for the genetic algorithm proposed in this paper and applying it to a test environment. Detailed specification of parameters to consider for genetic algorithm should be determined during the experiments. Combining knowledge from different security sensors into a standard rule base is another promising area in this work.
7. REFERENCES [1] R. Heady, G. Luger, A. Maccabe, M. Servilla. The architecture of a network level intrusion detection system. Technical Report, University of New Mexico, Department of Computer Science, August 1990.
30
©2010 International Journal of Computer Applications (0975 – 8887) Volume 1 – No. 2 [2]Bobor, V. Efficient Intrusion Detection System Architecture Based on Neural Networks and Genetic Algorithms, Department of Computer and Systems Sciences, Stockholm University / Royal Institute of Technology, KTH/DSV, 2006.
[12] Bradford, P. G., and N. Hu. A Layered Approach to Insider Threat Detection and Proactive Forensics. 21st Annual Computer Security Applications Conference, Applied Computer Security Associates (ACSA),December 5-9, 2005, Tucson, Arizona
[3]Faraoun, K M., and A. Boukelif. Genetic Programming Approach for Multi-Category Pattern Classification Applied to Network Intrusions Detection International Journal of Computational Intelligence, Vol. 3, No. 1, 2006 pp. 79-90.
[13] Brugger, S. T. Data Mining Methods for Network Intrusion Detection. Terry Brugger's Homepage. 9 June 2004. University of California, Davis. 6 Oct. 2006.
[4]Zhang, J., and M. Zulkernine.Anomaly Based Network Intrusion Detection withUnsupervised Outlier Detection, Symposium on Network Security and Information AssuranceProc. of the IEEE International Conference on Communications (ICC), June 2006, Istanbul,Turkey. [5]Kabiri, P., and Ali A. Ghorban. Research on Intrusion Detection and Response: a Survey International Journal of network security, The Intelligent & Adaptive Systems Group (IAS),Vol. 1, No. 2, 4 July 2005, pp. 84-102 . [6] Diaz-Gome, P. A., and D. F. Hougen. Improved Offline Intrusion Detection using a genetic algorithm, Proceedings of the Seventh International Conference on Enterprise Information Systems, 2005, Miami, USA.
[14] Li, W.,Using Genetic Algorithm for Network Intrusion Detection Proceedings of the United States Department of Energy Cyber Security Group 2004 Training Conference, May 24-27, 2004, Kansas City, Kansas, USA. [15] Marczyk, A. "Genetic Algorithms and Evolutionary Computation.", The Talk, Origins Archive. 23 Apr. 2004. 7 Oct. 2006. [16] Song, D., A Linear Genetic Programming Approach to Intrusion Detection, Master Degree for Computer Sciences, Genetic and Evolutionary Computation– GECCO 2003. [17] Smith, L. S. An Introduction to Neural Networks Professor Leslie S. Smith, Centre for Cognitive and Computational Neuroscience. 2 Apr. 2003. 7 Oct. 2006.
[7]Gong, R.H. , M. Zulkernine, P. Abolmaesumi, A Software Implementation of a Genetic Algorithm Based Approach to Network Intrusion Detection,Proceedings of Sixth IEEE ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD) May 2005, Maryland,USA.
[18] Coull, S., Joel Branch, Boleslaw Szymanski, and Eric Breimer. Intrusion Detection: a Bioinformatics Approach Proceedings of the 19th Annual Computer Security Applications Conference, Dec. 2003, Las Vegas, Nevada.
[8] Stein, G., B. Chen, A. S. Wu, and Kien A. Hua. Decision tree classifier for network intrusion detection with GA-based feature selection, In the Proceedings of the 43rd ACM Southeast Conference, March 18-20, 2005, Kennesaw, GA, .
[19]Gorodetsky,V., I.Kotenko, and O.Karsaev. Multi-agent Technologies for Computer Network Security: Attack Simulation, Intrusion Detection and Intrusion Detection Learning International Journal of Computer Systems Science and Engineering. vol.18, No.4, July 2003, pp.191-200.
[9].Folino, G., C. Pizzuti, G. Spezzano,GP Ensemble for Distributed Intrusion Detection Systems, International Conference on Advances in Pattern Recognition, ICAPR05, August22-25, 2005, Bath, UK.
[20]Gomez, J., and D. Dasgupta. Evolving Fuzzy Classifiers for Intrusion Detection.Proceedings of the 2002 IEEE, Workshop on Information Assurance, United States Military Academy, June 2001,West Point, NY .
[10] De Boer, P., and Martin Pels, Host-Based Intrusion Detection Systems, Technical Report:1.10, Faculty of Science, Informatics Institute, University of Amsterdam, 2005.
[21] Liu, Q., S. Bridges and I. Banicescu, Parallel genetic algorithms for tuning a fuzzy data mining system , In Proceedings of the Artificial Neural Networks in Engineering Conference (ANNIE 2001), November 4-7, 2001, St. Louis, MO.
[11] Yao, J. T., S.L. Zhao, and L.V. Saxton, A study on fuzzy intrusion detection , Proceedings of SPIE Vol. 5812, Data Mining, Intrusion Detection, Information Assurance, And Data Networks Security, 28 March - 1 April 2005, Orlando, Florida, USA.
[22] Chittur, A., Model Generation for an Intrusion Detection System Using Genetic Algorithms, High School Honors Thesis, Ossining High School,Ossining, NY., 27 Nov,2001.
31
©2010 International Journal of Computer Applications (0975 – 8887) Volume 1 – No. 2 [23] Dasgupta,D., and F.A. Gonzalez, An Intelligent Decision Support System for Intrusion Detection and Response, In Lecture Notes in Computer Science (publsher: Springer-Verlag) as the proceedings of International Workshop on Mathematical Methods, Models and Architectures for Computer Networks Security (MMM-ACNS), May 21-23, 2001, pp 1-14, St. Petersburg, Russia.
Administration Conf. (LISA ‘99), pp. 229-238.Seattle, Washington. Sinclair, Chris, Lyn Pierce, and Sara Matzner. 1999.
[24]Bridges, S. M., and R. M. Vaughn, Fuzzy Data Mining and Genetic Algorithms Applied to Intrusion Detection, Proceedings of the Twenty-third National Information Systems Security Conference, October 2000, Baltimore, MD. [25] Wang, W., and S.M. Bridges, Genetic Algorithm Optimization of Membership Functions for Mining Fuzzy Association Rules , Proceedings of the 7th International Conference on Fuzzy Theory & Technology, February 27 – March 3, 2000, pp.131-134, Atlantic City, NJ. [26] Dasgupta, D., Immunity-Based Intrusion Detection System: a General Framework., 22nd National Information Systems Security Conference, The University of Memphis, 1999, Virginia, USA. [27] Sinclair,C.,L.Pierce, S. Matzner,An Application of Machine Learning to Network Intrusion Detection, Proceedings of the 15th Annual Computer Security Applications Conference, December 1999, page 371, Phoenix, AZ. [28]Paxson, Vern. 1998. ―Bro: A System for Detecting Network Intruders in Real-time.‖ In Proceedings of 7th USENIX Security Symposium, pp. 31-51. San Antonio, Texas. [29]Pohlheim, Hartmut. 30 Oct. 2003. Genetic and Evolutionary Algorithms: Principles, Methods and Algorithms ,Genetic and Evolutionary Algorithm Toolbox. Hartmut Pohlheim. [30]Roesch, Martin. Nov. 7-12, 1999, Snort - Lightweight Intrusion Detection for Networks, In Proceedings of 13th Systems
32