preprocessing rough network traffic for intrusion ...

IADIS International Telecommunications, Networks and Systems 2007

PREPROCESSING ROUGH NETWORK TRAFFIC FOR INTRUSION DETECTION PURPOSES Salem Benferhat, Karima Sedki, Karim Tabia CRIL-CNRS FRE,Université d'Artois Faculté des Sciences Jean Perrin Rue Jean Souvraz SP 18,62307 Lens Cedex, France

ABSTRACT This paper describes a new tool for preprocessing rough network traffic into connection records. This tool can be used to provide summarized and relevant information for intrusion detection and prevention. It preprocesses both off-line and online row network data into high-level network connection records. Our tool is added as new functionalities to the well known network protocol analyzer Ethereal. Relevant preprocessed data is critical for intrusion detection and prevention particularly in terms of efficiency. KEYWORDS Intrusion detection, connection records, preprocessing, feature extraction.

1. INTRODUCTION As information systems rely more and more on networking technologies, gigabytes of data must be analyzed every day in order to detect and prevent from attacks. This is particularly hard to handle in case of real-time detection. Network rough packets are unstructured data that most existing network IDS (Intrusion Detection System) and IPS (Intrusion Prevention System) use directly while only few pieces of each packet data are relevant in order to detect attacks. Usually, IDS and IPS use only packet headers. The aim of preprocessing rough network traffic is three fold: 1. Summarizing rough network traffic in order to reduce amounts of data to be analyzed. This is particularly important as there are gigabytes of data to be analyzed every day. This is why it is important to summarize rough unstructured network data. For instance a telnet session of hundreds of kilobytes can be summarized into one single connection record which will allow rapid analysis. For example KDD’99 data set size [5] is less than 800 megabytes while its corresponding rough traffic (Darpa’98 data set [2]) is more than 5 gigabytes. 2. Providing relevant information that can be useful for intrusion detection analysis. For example it is important to count the number of failed login attempts in telnet, ftp, etc. sessions in order to detect some guess and dictionary attacks. Relevant data are composed of a set of meaningful features revealing anomalous events or attacks. These features dependent on expert knowledge about the attacks and normal network activities. This usually requires both packet headers and data content of packets. 3. Providing a sufficient set of features that can be used to distinguish between normal traffic data and attacks. This is particularly important to avoid “noise” in preprocessed data. Besides, during last years, data mining techniques perform better and better performances in intrusions detection. For instance classification techniques, like decision trees and Bayesian networks [6], are very effective on preprocessed data like KDD’99 [5] data set. However these techniques require structured data and their effectiveness strongly depend on relevant training data. Relevant data mainly refers to

105

ISBN: 978-972-8924-40-9 © 2007 IADIS

data safe from noise: normal connections and attacks should be associated with different connection records. This requires a set of meaningful features that are capable of distinguishing normal connections from attacks. • training data representative of the attacks and normal traffic to be learnt. Preprocessing rough traffic data could provide structured and relevant network traffic. For instance it is important to note that the majority of work in intrusion detection use KDD’99[5] data set since it is basically one of the only available preprocessed and labeled data set. Preprocessing network traffic is a transformation function that builds summarized connection records from rough packets. Usually binary data captured on the wire are first transformed into ASCII data then preprocessed into connection records. For instance Edicap edits capture files and can translate them from one format to another. The following figure shows different steps of capturing/preprocessing/building detection models: •

Figure 1. Data preprocessing

As no preprocessing tool is nowadays publicly available, we designed and implemented our preprocessing tool as new functionalities added to an open source network protocol analyzer Ethereal [1]. Before describing our new tool, we first give a brief description of Ethereal.

2. ETHEREAL: NETWORK PROTOCOL ANALYZER Ethereal is a graphical network protocol analyzer that decodes more than 750 protocols. It allows exploring graphically each field composing the protocol stack in each packet. It is mainly used to 1. capture live network traffic and save it into dump files. It is also used to read previously saved capture files. Thus Ethereal can be used instead of tcpdump [3] or any other sniffer. In addition to libpcap format, Ethereal supports most capture file formats, 2. analyze, dissect and filter live and saved capture files, 3. retrieve and rebuild most conversations between pairwise hosts, 4. build connections at the application level. For instance we can reconstruct an http or telnet connection at the application level and ignore transport and IP level, 5. provide various statistics about captured/read Ethereal is an Open Source software programmed mainly in C language. It is available for Unix, Linux and MS Windows operating systems and can be run through its friendly graphical user interface or through a command line. In addition, other tools are provided in order to edit and deal with capture files. For instance Editcap edits capture files and can translate them from one format to another.

106


3. PREPROCESSING ROUGH NETWORK DATA FOR INTRUSION DETECTION PURPOSES Most IDSs and IPSs use only packet headers. Few tools operate at connection level in which case connection records are derived only from packet headers. For instance the Bro IDS [4] includes a script (conn.bro) that outputs an ASCII summary line for each network connection. This summary line includes only the following basic features derived from packet headers: • start-time: it corresponds to the start time of the connection namely the time stamp of the connection’s first packet. • duration: it represents the duration of the connection expressed in seconds. • protocol: it is the protocol (TCP, UDP or ICMP) used at the transport layer of the packet. • service: usually this represents the service corresponding to the destination port of the connection. Note that Bro outputs other for non well known ports. • source bytes: it represents the number of bytes transferred from source host to destination one. • destination bytes: this corresponds to the amount of data transferred from destination host to source one. • source address: this represents the IP address of the source host. • destination address: it represents the IP address of the destination host. • source/destintion port: it is the source/destination port of TCP or UDP connections • flag: it stands for the connection status (ex. SF=connection is established and terminated correctly, REJ= connection request is rejected, etc.) These basic features are useful summaries but not sufficient in order to train IDSs or build accurate models since neither information about the data content of the connection nor statistics about past connections are provided. Information about data content is essential for detecting R2L(Remote to Local) and U2R(User to Root) attacks while DoS (Denial of Service) and Probing attacks can be revealed through statistics summarizing past connections. For example, the number of connections with same destination/port is a key statistic to detect some DoS attacks. On the basis of these facts, Lee [7] proposed a set of useful features to be used in a KDD process in order to provide relevant data that can lead to accurate data mining models. This resulted in the KDD’99[5] data set which is suited for data mining techniques when applied to intrusion detection. The transformation function used to obtain KDD’99 data set preprocessed rough Darpa’98[2] data set into connection records. Note that each KDD’99 connection is described by 41 features relative to • basic features of individual TCP connections (ex. protocol type, service, connection flag, etc.) • content features within a connection suggested by do-main knowledge (ex. number of failed login attempts in telnet sessions, etc.) • time based traffic features computed using a two seconds time-window (ex. the rate of connections to the same port/destination host, etc.) • host based traffic features computed using a window of 100 connections used to characterize attacks that scan hosts/ports using much larger time intervals than two seconds (ex. rate of connections scanning different ports on same destination host, etc.). Note also that the only available preprocessed and labeled data set is KDD’99[5] which suffers at least from two major problems • KDD’99 testing data set contains inconsistencies due to problems with the used transformation function [8]. • KDD’99 is old and includes few attacks against only Unix based systems and Cisco routers.

4. ADDED PREPROCESSING FUNCTIONALITIES Our tool builds connection records similar to the KDD’99 connection one. In addition, our tool provides supplementary features. We added the following new interesting functionalities: 1. build KDD/enriched KDD connection records: Our tool preprocesses rough data into 41 KDD attribute record and can provide additional attributes (ex. connection direction: based on local net

107

ISBN: 978-972-8924-40-9 © 2007 IADIS

IDs, each connection is marked as inbound, outbound or inside). Another example of added attributes is bad traffic which counts the number packets including data that do not comply with the used service as syntax errors in ftp commands). 2. build complete/incomplete connection records: This is a new functionality that preprocesses rough data into connection records with a user-specified duration. This function is given a duration t is seconds and builds connection records until the connection duration attains t. The remaining packets of such connections are ignored. Building incomplete connections provides records at different stages. Incomplete connections are necessary to train models for real-time intrusion detection or intrusion anticipation. 3. build real-time connection records: This functionality aims at providing connection records in real time. Namely connections are build when capturing network packets. These real-time connection records can be used to anticipate or detect intrusions in real-time. In addition, when computing time-based features, our tool can be configured to use any time window chosen by the user. Similarly host based traffic features can be computed according to any user specified connection window.

5. PREPROCESSED CONNECTION RECORDS Our tool preprocesses rough network traffic into structured high level connection records. According to the connection protocol (transport protocol TCP/UDP or ICMP), we consider three connection categories • TCP connections: this corresponds to network services based on TCP protocol. TCP connections start with the three-way handshake, data can then be exchanged between hosts. TCP connections normally end with explicit disconnect/acknowledge requests. Note that content features are computed only for sensitive services as telnet, ftp, login, http, smtp, etc. • UDP connections: UDP is a connectionless protocol that can be used by any service but it is usually used by few services since it does not provide reliable transport protocol. So we consider an UDP connection each UDP packet (with SF flag to indicate that the connection worked normally) or each UDP request with its corresponding reply if any. For example each DNS request and response based on UDP protocol represent one connection. Note that content features are computed for some sensitive services (ex. telnet) when based on UDP protocol. • ICMP connections: ICMP packets are dealt with in the same way as we deal with UDP ones since ICMP is a connectionless protocol. ICMP requests and their responses (ex. ping request/reply) represent one single connection while each error reporting ICMP packets is associated with a new connection. Note that after building connection records we can save them in files in some formats as CSV (Comma Separated Values) which saves data in a tabular form where values are separated by commas. Our tool can also save in space character separated format or display connection records directly on the screen.

6. SOME IMPLEMENTATION DETAILS The following generic algorithm represents our preprocessing procedure: Input: rough packets (capture file/live packets) and preprocessing parameters Output: connection records Algorithm: 1. empty connection list 2. while read/capture packet - if packet belongs to an open connection in connection list then update the connection - else create a new connection with this packet In this algorithm, an open connection corresponds to a connection that does not come to its end. For instance a TCP connection between two hosts/ports is considered open until a FIN/ACK request/acknowledge is seen in both directions. A TCP connection can also ended if one host sends a RESET packet. Note that

108


RFCs (Request For comments) do not specify time-outs that can be used in order to close idle connections. This algorithm starts with some parameters that will be used like duration of time window for computing time based attributes, network IDs to compute the connection’s direction. This algorithm goes into a loop that reads/captures packets and updates the connection list consequently. This loop ends when live capture is stopped or capture file is read. Updating an open connection with a packet means updating the features that may evolve given new packets like connection duration, amount of data transferred between hosts, etc. In our implementation, to store connection records hash tables seem more advantageous since they allow rapid access to a given connection record with source adress/port and destination address/port as key. Nevertheless a hash table in our context will cause large numbers of collisions particularly when preprocessing DoS connections (several open connections with same source/destionation adresses/ports). Furthermore to compute time based features and host based traffic features, connection records must be ordered according to their time stamps. Therefore a linked list is more suited to our requirements.

7. CONCLUSION Nowadays data preprocessing is a key element in intrusion detection research. Relevant preprocessed data is essential for training IDSs or building accurate data mining models. This requires efficient preprocessing tools that summarize rough packets while increasing the information value of the built connection records. Expert knowledge about network protocols, attacks and security threats are necessary to design efficient preprocessing tools. Our tool provides new and very important functionalities as well as building enriched connection records, incomplete connections and real-time ones.

ACKNOWLEDGEMENT This work is supported by a French national project ACI (Action Concerte Incitative) Sécurité Informatique entitled DADDi(Dependable Anomaly Detection with Diagnosis).

REFERENCES [1] Angela Orebaugh, Ethereal Packet Sniffing, Jay Beale’s open Source Security Series, 1-468, Syngress Publishing, Inc,2004. http://www.ethereal.com [2] R. Lippmann, D. J. Fried, I. Graf, J. W. Haines, K. R. Kendall, D.McClung, D. Weber, S. E. Webster, D. Wyschogrod, R. K. Cunningham, M. A. Zissman, Evaluating Intrusion Detection Systems: the 1998 DARPA OffLine Intrusion Detection Evaluation, Proceedings of the 2000 DARPA Information Survivability Conference and Exposition (DISCEX), vol. 2, IEEE Press, 2000. [3] Lawrence Berkeley National Laboratory Network Research Group, TCPDump, 1999. http://www.nrg.ee.lbl.gov [4] Paxson, Vern, Bro: A System for Detecting Network Intruders in Real-Time, Lawrence Berkeley National Laboratory Proceedings, 7’th USENIX Security Symposium, San Antonio TX,Jan. 26-29, 1998, http://www.broids.org [5] http://kdd.ccs.uci.edu/databases/kddcup99/task.html [6] S. Benferhat, N. Ben Amor, Z. Elouedi, Naive Bayes vs Decision Trees in Intrusion Detection Systems, Proceedings of the 2004 ACM symposium on Applied computing, Nicosia, Cyprus, pp.420-424, 2004. [7] W. Lee, A Data Mining Framework for Constructing Features and Models for Intrusion Detection Systems, PhD thesis, Columbia University, 1999. [8] Y. Bouzida, F.Cuppens and S. Gombault, Modeling Network Traffic to Detect New Anomalies Using Principal Component Analysis, HPOVUA 2005, Porto -Portogal, 2005.

109