Dipartimento di Informatica e Scienze dell’Informazione
• • ••• • •
A monitoring behaviour based system for Anomaly Detection by Davide Chiarella
Theses Series
INF/01 DISI-TH-2010-1
DISI, Universit`a di Genova v. Dodecaneso 35, 16146 Genova, Italy
http://www.disi.unige.it/
Universit` a degli Studi di Genova Dipartimento di Informatica e Scienze dell’Informazione Dottorato di Ricerca in Informatica
Ph.D. Thesis in Computer Science
A monitoring behaviour based system for Anomaly Detection by Davide Chiarella
February, 2010
Submitted by Davide Chiarella DISI, Univ. di Genova
[email protected] [email protected]
Date of submission: February 2010 Release date: July 2010
Title: A monitoring behaviour based system for Intrusion Detection
Advisor: Maurizio Aiello IEIIT - U.O.S. di Genova
[email protected]
Supervisor: Giovanni Chiola DISI - Universit`a degli Studi di Genova
[email protected]
Ext. Reviewers: Luigi V. Mancini Dipartimento di Informatica - Universit`a degli Studi di Roma ”La Sapienza”
[email protected]
Rodolfo Zunino DIBE - Universit`a degli Studi di Genova
[email protected] c
2009 by Davide Chiarella. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without prior written permission of the author.
To Sara and to her lovely gift
Abstract In this thesis I present my PhD work, a Behaviour Based Intrusion Detection System called WPD-pe (WormPoacherDaemon Panworm Engine), born as a consequence of the analysis I made on the SMTP traffic in CNR-Area di Genova network. The system had three development phases. In the first phase the system was off-line and it worked on postfix log files: in this document will be addressed like Worm-Poacher. In the second phase the system was on-line, always working on postfix files: in this development phase the system performed not so good, so I tried to improve it, redesigning the most part of it. Following these main changes the final system WPD-pe has born. The analysis is mainly divided into three main branches: global SMTP flow analysis, sender addresses usage analysis and rejected e-mails flow analysis. During the analysis I got quite good results, identifying several anomalies in the SMTP traffic (e.g. virus infection and spam activities): moreover I can say that the data gathered and analysed by the system gives good hints about SMTP traffic and what it can be considered anomalous in it. These hints can be very useful for a complete analysis of a chosen network. For example the usage of different sender addresses gives information about which kind of hosts are in a network, and how many different e-mail addresses a person owns while the analysis of the global SMTP flow gives information about the average load expected and the SMTP traffic trend for the future. All these features identified, on the other hand, permit to outline which it has to be considered normal and what it hasn’t. The purpose of this thesis is to create an on-line monitoring system which can identify SMTP anomalies on a selected network and can potentially react against them.
Chapter 1 Introduction Better to reign in hell, than serve in heaven.
(John Milton)
Internet is nowadays an ever-growing reality: it deeply influences our daily life and it gives access to services and information. However with the sharply growth of internetbased services, information on networks and connectivity, the number and the severity of internet-based computer attacks have distinctly increased.Moreover our heavy reliance on the Internet and the strengthen of worldwide connectivity have greatly increased the potential damage that can be inflicted by attacks launched over the Internet against remote systems. In addition compromised computers can also be used to launch further attacks, thus achieving to attackers another level of indirection for further illegal actions. The majority of break-ins, however, are the results of a small number of known attacks, making their detection easier: however there is a part of attacks which don’t fit any known attack and so it is a difficult task to prevent such attacks. Thus, Intrusion Detection Systems (IDSs) become a crucial part of network and host security. First of all, we must answer to one question: what is intrusion detection and which are its aims? Intrusion Detection is primarily concerned with the detection of illegal activities or acquisitions of privileges that weren’t intended. Current approaches to detecting intrusions can be broadly classified into two wide categories: Misuse Detection and Anomaly Detection. The first one is based upon the signature concept and it aims to detect well known attacks as well as slight variations of them, by characterizing the rules that govern these attacks: it is accurate but it lacks the ability to identify the presence of intrusions that do not fit a pre-defined signature or any attacks that lie beyond its knowledge, resulting not adaptive. The second one tries to create a model to characterize a normal behaviour:
the system defines the expected network behaviour and, if there are significant deviations in the short term usage from the profile, raises an alarm. It is a more adaptive system, ready, in principle, to counter-attack new threats, but this is far from easy: in fact it has a high rate of false positives and it is very time consuming and labour expensive to sift true intrusions from false alarms. Anyhow anomaly based systems are employed to detect the unknown intrusions which cannot be addressed by misuse detection. Theoretically Misuse and Anomaly detection integrated together can get the holistic estimation of malicious situations on a network. Between all the kind of malicious situations, e-mail viruses have become one of the major Internet security threats today. An e-mail virus is a malicious program which hides in an e-mail attachment, and becomes active when the attachment is opened or read. A principal goal of e-mail virus attacks such as Melissa is that of generating a large volume of e-mail traffic over time, so that e-mail servers and clients are eventually overwhelmed with this traffic, thus effectively disrupting the use of the e-mail service. Future viruses may be more damaging, taking actions such as creating hidden back-doors on the infected machines that can be used to command these machines in a subsequent coordinated attack. Current approaches for dealing with e-mail viruses rely on the use of anti-virus software at the desktops, network servers, mail exchange servers and at the gateways. Detection of e-mail viruses is usually based on a signature-based approach, where the signature captures distinguishing features of a virus, such as a unique subject line or a unique sequence of bytes in its code. As we said before this approach is, however, effective against known e-mail viruses, but is ineffective against unknown (i.e., newly released) viruses. Furthermore a series of Internet worms exploit the confluence of the relative lack of diversity in system and server software run by Internet-attached hosts, and the ease with which these hosts can communicate. A worm program is self-replicating: it remotely exploits a software vulnerability on a victim host, such that the victim becomes infected, and itself begins remotely infecting other victims. The severity of the worm threat goes far beyond mere inconvenience. So a necessary prerequisite for any response is quick and accurate detection. Beyond that, the effectiveness depends on the strategy chosen. Most known worms have very aggressive behaviours. They attempt to infect the Internet in a short period of time. This type of worms is actually easier to be detected because their aggressiveness stands out from the background traffic. The future worms may be modified to circumvent the rate-based defence systems and purposely slow down the propagation rate in order to compromise a vast number of systems over the long run without being detected. As mentioned, there are many recent studies and proposal in Anomaly Detection Techniques, especially in worm and virus detection. In this field it does matter to answer few important questions like at which ISO/OSI layer data analysis is done and which approach is used. The common approach usually takes into account connections, fact due partly to the availability of data: in fact these works suffer of scarcity of real data due to lack of network resources or privacy problem; almost every work in this sector uses synthetic ( e.g. DARPA, KDD99) that are predisposed to a connections analysis. 2
The novelty contents and the features of the approach can be summarized into five main points: 1. Real data I analyzed quantitatively the National Research Council Genoese Area network e-mail traffic and applied my methods on gathered data to detect indirect worm infection (worms which use e-mail to spread infection). 2. Mixed layer analysis The work done in this thesys is based on layer seven quantities (number of e-mail sent in a chosen period, e-mail sender address etc): moreover the on-line system has been predisposed to a mixed layer (Layer 3, 4, 7) analysis for future implementation. 3. Threshold optimization The approaches use a threshold method and, in the dataset at my disposal, they identified various worm activities. Moreover taking a look at the final results and at the K-analysis ( the K parameter found in the baseline) it has been possible to identify an optimization of the previous baseline (see Chapter 4.2). 4. Quite good performance During the on-line analysis I had no performance problem, although the system run on an old P4. Moreover I want to stress that all the daemons has been developed in Perl, which is not an high performance language. 5. High scalability The on-line system is made up of different daemons, which can run in different computers, so the system can have one or more sniffing daemons (i.e. SMTPDump daemons) which can collect data in different places in a network and then send them to be analysed. Concluding, this thesis is focused on intrusion detection techniques and in particular on anomaly detection: my research starts from the above premises and its aim is to detect any kind of SMTP anomalies in a monitored network. My system is a centralized or a distributed system, depending on what the network topology needs and it uses a time series threshold method. The aim of my works is to design and to implement a monitoring behaviour based system which can identify SMTP worm activities ( or in general SMTP anomalies) in order to free anti-virus software from daily updates. In chapter 2 the state of art of intrusion detection is introduced. Chapter 3 shows the architecture of the system during its three development phases: off-line (Worm-Poacher), on-line (WPD) and on-line with Panworm Engine (WPD-pe). Chapter 4 presents the analysis performed on the data and the results obtained.
3
Chapter 2 State of art Where my reason, imagination or interest were not engaged, I would not or I could not learn. (Sir Winston Churchill) In the origin intrusion detection was performed by system administrators by sitting in front of a console and monitoring user activities. Intrusions could have been detected by noticing, for example, that a user on vacation is logged in locally or that a seldomused printer is unusually active. Although effective enough at the time, this early form of intrusion detection was ad hoc and not scalable. Intrusion Detection, as we know it, starts in the early 80s with two main works: ”Computer Security Threat Monitoring and Surveillance” by Anderson and ”An intrusion-detection model” by Denning. James Anderson gave birth to his work in April 1980 [And80] and he determines the basics on which all the work about intrusion detection will be done. It took seven year to make the seeds bloom with Dorothy Denning’s article, where she gave us the first motivations and guidelines for an intrusion detection system [Den87]. However, before proceeding with the analysis of the two main works in this field Igive a possible definition of what intrusion detection is [NWY02]. The goal of intrusion detection is to discover intrusions into a computer or network, by observing various network or host activities and attributes. It is classified as an intrusion any set of actions that threatens the integrity, availability or confidentiality of a resource. (Noel et al.) The chapter is organized as follows: in the first two section Isee Anderson’s and Denning’s works; after this Isee Intrusion Detection Basics in order to give the reader a good screen shot of what is intrusion detection today.
2.1
Intrusion detection basics: Anderson’s work
Anderson’s aim, in his work [And80], is ”to design a security surveillance system”: a system ”which provides an initial set of tools to computer system security officers for use in their jobs”. So we can see that Anderson thinks of a set of tools better than an automated system. However he focuses his attention on some crucial points that every researcher have to fight with. First of all is to detect the intrusion in the data: in fact he says ”it is necessary to understand the types of threats and attacks that can be mounted against a computer system, and how these threats may manifest themselves in the audit data. Is is also important to understand the threats and their sources from viewpoint of identifying other data. It is also important to understand the threats and their sources from the viewpoint of identifying other data sources by which the threat may be recognized”. So summarizing the concept he says that a threat can be identified both from audit specific data both from general audit data both from other data sources. Moreover he gave the following definitions, which hold nowadays with some minor changes. • Threat: the potential possibility of a deliberate unauthorized attempt to: 1. access information 2. manipulate information 3. render a system unreliable or unusable • Risk: accidental and unpredictable exposure of information, or violation of operations integrity due to malfunction of hardware or incomplete or incorrect software design • Vulnerability: a known or suspected flaw in the hardware or software design or operation of a system that exposes the system to penetration of its information to accidental disclosure • Attack: a specific formulation or execution of a plan to carry out a threat • Penetration: a successful attack; the ability to obtain unauthorized (undetected) access to files and programs or the control state of a computer system Anderson, considering the threat problem, created a threat representation on the basis of whether or not an attacker is normally authorized to use the computer system, and whether or not a user is authorized to use a particular resource in the system. This representation is shown in table 2.1 Nowadays this representation it is not so useful, because the ”access” about which Anderson speak is a physical access (remember that he wrote the report in 1980 for an USA air force base), and we all know that to intrude a computer, given 5
Penetrator not authorized Penetrator authorized to to use data/program re- use data/program resource source Penetrator not authorized External Penetration X use of computer Penetrator authorized use Internal Penetration Misfeasance of computer Table 2.1: Anderson Threat Representation the high connectivity and diffusion of computers at present time, it is not necessary a physical access. However we can notice that he stressed the fact that an intrusion can be of two different kind: it can born outside the network (external intrusion), but at the same time can originate from the inside (internal intrusion). This fact is very important when designing an Intrusion Detection System because internal intrusion can be very challenging to detect. Apropos Anderson identified three classes of users that can be identified in order of increasing difficulty through audit trails(for our purposes it is very interesting the masquerader case with his concept of extra use or, using our words, searching for anomalies): 1. The masquerader: the masquerader is at all effects a legitimate user. In fact he is an user who wishes to exploit another users identification and password. There is no particular feature to distinguish the masquerader from the legitimate user. So how can a masquerader be identified? Searching for anomaly, for the extra use of the system. This extra use can be determined by some features like: • Use outside of normal time • Abnormal frequency of use • Abnormal volume of data reference • Abnormal patterns of reference to programs or data It very important to notice that Anderson, after this list, says: ”the operative word is abnormal which implies that there is some notion of what normal is for a given user”. These are the seeds for anomaly detection, these are the essential facts a researcher can build his system on. 2. The legitimate user: he is an user who is authorized to use the system, so audit trail records don’t exhibit any abnormal patterns of reference, log on times and so forth. For this reason the degree of difficulty in detecting the abnormal use by legitimate user of a system is more difficult: in some cases there is no extra use of 6
resources that can be of help in detecting the activity and in other cases a legitimate user can show the same behaviour of a masquerader. 3. The clandestine user: it is the most difficult to detect because clandestine user has, for assumption, supervisory control of the system and as such can either operate below the level at which audit trail data is taken. It is immediately clear that the author identified very well some characteristic features for future Intrusion Detection System. In fact in chapter three of his work Anderson tried to characterize the use of a computer system by observing various parameters and pointing out some interesting issues. • Session: it denotes a continuous unit or a single unit of use of a computer with a well defined beginning and a well defined end. The parameters that distinguish one unit from another are the user identifiers and the list of programs used. • Time parameters:The time of the day (and in a larger sense the day of the week) that a session is operated and the duration of length of time the session takes. • Dataset and program usage: a list of what an user uses normally. It’s a per-user measure. • Monitoring files and devices: a list of which users use a device or files normally. It’s a per-device or a per-file measure. • Group statistics: sessions, time usages etc. referring to the same entity can be considered to belong to the same population and will exhibit similar statistical properties from run to run. An arbitrary deviation of the norm for the user is a criterion for reporting a particular use and generating an ”abnormal volume of data” or an ”abnormal (measure of one the parameters discussed above) exception”. The second time measure (the duration of length of time the session takes) has to be investigated deeper, because it can gives us some good points to take in account. In fact Anderson says that such a measure is expected to have relatively little variability if we think of a given user while system usage patterns can exhibits wide fluctuations from one user to another. These anomalies can be detected through analysing the historical data: in fact he says ”Detection of outside of normal times of use is relatively straight forward. Individual sessions are sorted on time of initiation and compared with previously recorded data for the specific user”. And after this he stresses the importance of granularity: ”The basic question to be faced is the granularity of the analysis needed OMISSIS for user exhibiting little variability in their use a gross measure, such as number of jobs per quarter of the day 7
will be sufficient” while for user ”’with considerable variability” it may necessary to record usage by hour. I want to stress these passages because all these observations in a way or in another stand out during our work. To detect log on anomalies the author used a simple, but effective measure (for his experiment): be the Hi the recorded historical value for 24 hours and the Ai the actual recorded value the score formula was 24 X
|Hi − Ai |2
i=1
Although its simplicity the formula worked very well with the log on data: the only obscure point was the threshold value, that is when the deviation from normal behaviour must be considered an anomaly. It is interesting to stress that the author in his design used different parameters (to draw inspiration from) like: • the mean of a parameter (e.g. CPU time, I/O operations ) • the maximum and minimum value of a parameter to establish the range of values • standard deviation of a parameter • mean + (2.58 * standard deviation) = upper bound of distribution • mean - (2.58 * standard deviation) = lower bound of distribution
2.2
Intrusion detection basics: IDES
Denning’s work is focused on a model of a real time intrusion detection expert system called IDES. Before describing her work the author enunciates the four motivating factors for this system (some of these factors still hold true today), which I summarize in poetic sentences at the beginning of each point: 1. The present is not perfect: most of the systems have security flaws that render them susceptible to intrusions, penetrations and other form of abuse; finding and fixing all these deficiencies is not feasible for technical and economic reasons 2. The present is hard changing: existing systems with known flaws are not easily replaced by systems that are more secure mainly because the systems have attractive features that are missing in the more-secure systems, or else they cannot be replaced by economic reasons 3. Perfection is far away: developing systems that are absolutely secure is extremely difficult, if not generally impossible 8
4. None is secure: even the most secure systems are vulnerable to abuses by insiders who misuse their privileges As already said these factors still today hold true. Denning’s model is an host-based one, as Anderson’s model was: it is not so strange, since Internet and network in general in 80’s moved the first steps. The abnormal use of a system that the author wants to detect are: • an attempted break-in (symptom: high rate of password failures) • masquerading or successful break-in (symptoms: different login time, location, directories browsing etc. ) • penetration by legitimate user (symptoms: execution of different programs, triggering more protection violations, access to command not normally permitted ) • leakage by legitimate user (symptoms: login at unusual times, route data to remote printer etc. ) • inference by legitimate user: a user attempting to obtain unauthorized data from a database through aggregation and inference might retrieve more records than usual • trojan horse (symptoms: different CPU time usage or I/O activity) • virus (symptoms: increased frequency of write operations on files, increased usage of storage for executable files etc.) • Denial of service (symptoms: monopolizing a resource, making it inaccessible to other users) Obviously the aim is to have a high rate of detection and a low rate of false alarms: alas the dream of every intrusion detection researcher! the basic idea is to monitor the standard operations on a target system, looking only for deviations in usage. The model decomposes all activity into single object actions so that each audit record references only one object. The audit records are sixtuples representing actions performed by a subject (e.g. an user, a process, a system etc.) on objects (e.g. files, programs, messages, records, terminals, printers etc. ). For a given subject with respect to a given object an activity profile is created: the profile is characterized in terms of a statistical metric and model. A metric is a random variable x representing a quantitative measure accumulated over a period. The period may be a fixed 9
interval of time (e.g. minute, hour etc.) or the time between two audit related events (i.e. login and logout, starting and ending a program, opening and closing a file). Observations (samples) from the audit records are used together with a statistical model to determine whether a new observation is abnormal: the statistical model makes no assumptions about the underlying distribution of x; all knowledge about x is obtained from observations. The author defined three types of metrics x: 1. Event Counter: x is the number of audit records satisfying some property occurring during a period 2. Interval Timer: x is the length of time between two related events 3. Resource Measure: x is the quantity of resources consumed by some action during a period IDES has several statistical models whose purpose is to determine, given a metric and n previous observations of x1 ..xn , whether a new observation xn+1 is abnormal. • Operational Model (Arbitrary Threshold) This model is based on the operational assumption that abnormality can be decided by comparing a new observation of x against fixed limits. It’s the simple concept of an arbitrary threshold: e.g. when we have more of ten password failures during a brief period we presume there is an attempted break-in. This model has a drawback, in fact it requires a prior knowledge about normal activity. • Mean and Standard Deviation Model This model is based on mean and standard deviation of the stored observations of x. A new observation is defined to be abnormal if it falls outside a confidence interval that is d standard deviations from the mean for some parameter d that is: mean ± d × stdev By Chebyshev’s inequality the probability of a value falling outside this interval is at most 1/d2 ; for d = 4, for example, it is at most 0.625. This model requires no prior knowledge about normal activity in order to set limits, instead it learns its limits (threshold) from its observations. Moreover it’s more adaptable because the confidence intervals depend on the observed data, so what is considered to be normal for one user can be considerably different from another. A slight variation of this model is to weight the computations, with greater weights placed on more recent values
10
• Multivariate Model This model is similar to the previous one except that is based on the correlations among two or more metrics. This can be useful if it can be obtained better discriminating power through combining of related measures. • Markov Process Model This model applies only to event counters. It regards each distinct type of event as a state variable, and uses a state transition matrix to characterize the transition frequencies between states. A new observation is defined to be abnormal if its probability is too low. This model can be useful for command sequences for instance. • Time series Model This model uses an interval timer together with an event counter or resource measure: it takes into account the order and inter-arrival times of the observations as well as their values. A new observation is abnormal if its probability of occurring at that time is too low. A time series has the advantage of measuring trends of behaviour over time and detecting gradual but significant shifts in behaviour, but the disadvantage of being more costly than mean and standard deviation model. This model list, thus primeval, gives a good screen shot of which kind of Intrusion Detection Systems we can find in the most of the works done in this field: for a comprehensive and more complete taxonomy see the following section.
2.3
Intrusion Detection Basics
In this section I introduce intrusion detection basics or what I can define, in my humble opinion, all that is needed to understand intrusion detection. All the section is based on Axelsson [Axe00],Noel [NWY02], Lazarevic [LEK+ 03], Garca-Teodoro [GTDVMFV09] and Debar [Deb99] works: I want to stress that, since there is no official standard or taxonomy, the one that follows is one of the possible choices.
2.3.1
General IDS architecture
An intrusion detection system can be seen as a set of macro-components. We adopt two schemas. In the first schema an intrusion-detection system can be described at a very macroscopic level as a detector that processes information coming from the system that is to be protected (Fig. 2.1). This detector uses three kinds of information: long-term information related to the technique used to detect intrusions (a knowledge base of attacks, for example), configuration information about the current state of the system, and audit 11
DATABASE
ALARMS
CONFIGURATION
DETECTOR
COUNTERMEASURE
AUDITS
ACTIONS
SYSTEM
Figure 2.1: The Debar IDS schema
12
information describing the events that occur on the system. The role of the detector is to eliminate unnecessary information from the audit trail and present a synthetic view of the security-related actions taken by users. A decision is then made to evaluate the probability that these actions can be considered symptoms of an intrusion. In the second schema, the IDWG ( Intrusion Detection Working Group) architecture, the schema is based on the consideration of four types of functional modules (Fig. 2.2): • E blocks (Event-boxes): This kind of block is composed of sensor elements that monitor the target system, thus acquiring information events to be analysed by other blocks. • D blocks (Database-boxes): These are elements intended to store information from E blocks for subsequent processing by A and R boxes. • A blocks (Analysis-boxes): Processing modules for analysing events and detecting potential hostile behaviour, so that some kind of alarm will be generated if necessary. • R blocks (Response-boxes): The main function of this type of block is the execution, if any intrusion occurs, of a response to thwart the detected menace. It is straight forward that the two schemas are almost the same, but they adopt different word and sometimes they aggregate/disaggregate modules: • Detector ⇔ Event box & Analysis box • Database ⇔ Database box • Countermeasure ⇔ Response box
2.3.2
IDS concepts
There are a number of concept/characteristics we use to classify intrusion detection systems, presented in figure 2.3. One Intrusion Detection System can belong to different families, but not to different sub-branches of the same family. For example a intrusion detection system can be network based and proactive at the same time, but it can’t be proactive and passive. Let’s deepen our concepts in the following sections. 2.3.2.1
Detection method or general strategy detection
The detection method/general strategy detection describes the characteristics of the analyser.When the intrusion-detection system uses information about the normal behaviour 13
Monitored environment
A-BOX
E-BOX
A-BOX A-BOX
E-BOX
E-BOX
D-BOX
Figure 2.2: The IDWG IDS schema
14
R-BOX
Behaviour-based Anomaly detection Detection method General strategy detection Knowledge-based Misuse/Signature detection Corrective Active
IDS
Proactive
Behaviour on detection Type of response Passive
Centralised Host-based
Audit source location
Distributed Network-based Continuous monitoring Usage frequency Time of detection Periodic analysis
Figure 2.3: IDS characteristics
15
of the system it monitors, we qualify it as behaviour-based. When the intrusion-detection system uses information about the attacks, we qualify it as knowledge-based. These two categories under general strategy detection techniques are also called anomaly detection and misuse/signature detection. Anomaly detection techniques define the expected behaviour of the network (or other entity) in advance. Any significant deviations from this expected behaviour are then reported as possible attacks: such deviations are not necessarily actual attacks, but they may simply be new network behaviour that needs to be added to the network profile. This kind of system has the advantage to detect novel attacks, but, in case of bad calibration, we can have an high false positives rate. Misuse detection finds intrusions by looking for activity corresponding to known techniques for intrusion. This, generally, involves the monitoring of network traffic in search of direct matches to known patterns of attack called signatures. A disadvantage of this approach is that it can only detect intrusions that follow pre-defined patterns. An advantage of this approach is that it has 0% rate of false positives.
2.3.2.2
Type of response or behaviour on detection
Type of response/behaviour on detection describes the response of the intrusion detection system to attacks. When it actively reacts to the attack the IDS is said to be active. This class can be further divided into two subclasses: the first one includes systems that take corrective actions on the attacked systems (e.g. closing holes), the second one includes systems that take proactive actions against the attacker (e.g. logging out attackers, closing down services). These systems generate scripts both to suppress the vulnerability (by changing the permissions on a file system, for example) and to restore the system to its previous state. Hence the application of a countermeasure is made safer by the capability of reverting quickly to a former state in the event of an abnormality. If the IDS merely generates alarms it is said to be passive. these kind of systems, then, respond by notifying the proper authority, and they do not themselves try to mitigate the damage done, or actively seek to harm or hamper the attacker.
2.3.2.3
Audit source location or data source
The audit source location/data source distinguishes among intrusion-detection systems based on the kind of input information they analyse. This input information can be audit trails, system logs or network packets. So we can identifies two categories under audit source location/data source detection techniques: host-based intrusion detection and 16
network-based intrusion detection. Host-based intrusion detection collects data from individual host on the network. Host-based detection systems directly monitor the host data files and operating system processes that will potentially be targets of attack. Host audit sources are the only way to gather information about the activities of the users of a given machine. On the other hand, they are also vulnerable to alterations in the case of a successful attack. This creates an important real-time constraint on host-based intrusion-detection systems, which have to process the audit trail and generate alarms before an attacker taking over the machine can subvert either the audit trail or the intrusion-detection system itself. Which kind of data does an host based IDS process? We can divide the data in three families: 1. System sources. All operating systems have commands to obtain a snapshot of information on the processes currently active on the computer. In a UNIX environment, examples of such commands are ps, pstat, etc. These commands provide very precise information about events because they examine the kernel memory directly. However, they are very difficult to use for continuous audit collection in intrusiondetection tools because they do not offer a structured way of collecting and storing the audit information. 2. Syslog. Syslog is an audit service provided to applications by the operating system. This service receives a text string from the application, prefixes it with a time stamp and the name of the system on which the application runs, and then archives it, either locally or remotely. 3. Security audit. The security audit records all potentially security-significant events on the system. The kind of recorded events depends on the Security Audit Software. The host based approach has a main drawback because sometimes it is not possible to put the monitor in every network host and in the case of a centralized data analyser we can have data transfer problem (e.g. privacy, bandwidth, timing...). Network-based intrusion detection collects data from traffic across the monitored network. This involves placing a set of traffic sensor within the network. The sensor may perform local analysis and detection and report suspicious events to a central location or only collect data for the central location. Since such monitors perform only the intrusion detection function, they are usually much easier to harden against attacks and to hide from attackers. With the widespread use of the Internet, network based intrusion detection systems have become more intensively used. In fact Network attacks (DNS spoofing, TCP hijacking, port scanning, ping of death, etc. ) cannot be detected by examining the host audit trail, at least not easily. Therefore, specific tools have been developed that sniff network packets in real time, searching for these network attacks. In addition, a number of classical attacks against servers can also be detected by parsing the payload of the packet and looking for 17
suspicious commands. Moreover, these tools are often attractive for system administrators because a small number of them can be installed at strategic points in the network to cover most of the current attacks. However placing the sensors is a critical task because sensors must be put in sensitive network spots, so a good and well planned study is needed. We want to stress that there is an inherent duality in network sniffers, which is also apparent in the firewall world with its differences between application-level gateways and filtering routers. If the analysis is carried out at a low level by performing pattern matching, signature analysis, or some other kind of analysis of the raw content of the TCP or IP packet, then the intrusion detection system can perform its analysis quickly, but does not take into account session information, which could span several network packets. If the intrusion detection system acts as an application gateway and analyses each packet with respect to the application or protocol being followed, then the analysis is more thorough, but also much more costly. Moreover, this analysis of the higher levels of the protocol is also dependent on the particular machine being protected, as implementations of the protocols are not identical from one network stack to another.This approach addresses several problems: • Detection of network-specific attacks. There are a number of network attacks, particularly denial of-service, that cannot be detected in a timely fashion by searching for audit information on the host, but only by analyzing network traffic. • Impact of auditing on the host performance. Information is collected entirely on a separate machine, with no knowledge of the rest of the network. Therefore, installation of such tools is facilitated because, both in terms of configuration and performance, they do not impact the entire environment. • Heterogeneous audit trail formats. The current de facto standardization towards TCPrIP facilitates the acquisition, formatting, and cross-platform analysis of the audit information. • Certain tools analyze the payload of the packet, which allows the detection of attacks against hosts by signature analysis. However, an efficient analysis requires knowledge of the type of machine or application for which the packet is intended. But it also has a number of drawbacks: • It is more difficult to identify the culprit when an intrusion is discovered. There is no reliable link between information contained in the packets and the identity of the user who actually submitted the commands on the host. • With switched networks (switched Ethernet, switched Token Ring, ATM), it is not obvious where the sniffer should best be placed. Some tools are located on switches, 18
other at gateways between the protected system and the outside world. The former yields better audit information but is also more costly. One has to realize, however, that switched networks are also much less vulnerable to sniffer attacks and are actually recommended to improve the security of a network. • Encryption makes it impossible to analyze the payload of the packets, and therefore to hide a considerable amount of important information on these tools. Also, it is possible, even without encryption, to obfuscate the contents of the packet to evade detection if the signatures are not sufficiently comprehensive. • Systematic scanning, for example at the firewall, is difficult because it might create bottlenecks. This will only worsen as the bandwidth to access the Internet is increased at sensitive sites (e.g. banks, electronic commerce web sites). • Finally, these tools are inherently vulnerable to denial-of-service attacks if they rely on a commercial operating system to acquire network information. As the network stacks of these commercial operating systems are vulnerable to attacks, so is the intrusion-detection system. 2.3.2.4
Usage frequency or time of detection
Usage frequency/time of detection describes when a system applies its detection strategy. Certain intrusion detection systems have real-time continuous monitoring capabilities, whereas others must be run periodically. The three first classes can be seen as grouped in the category functional characteristics of an intrusion detection system because they refer to the internal workings of the intrusion detection engine, namely its input information, its reasoning mechanism, and its interaction with the information system. The fourth characteristic distinguishes Real-Time Intrusion Detection (RTID) from scanners usually used for security assessment. These scanners are sometimes attached to the intrusion-detection area, and we must differentiate/discriminate between them and real intrusion detection systems.
2.4
Intrusion Detection Taxonomy
Given the above concepts, the most common ways to classify intrusion detection systems are according to detection method/general strategy detection: in this family we have Misuse/Signature detection versus Anomaly detection. 19
2.4.1
Misuse/Signature detection or knowledge based intrusion detection
In signature detection the intrusion detection decision is formed on the basis of knowledge of a model of the intrusive process and what traces it ought to leave in the observed system. We apply the knowledge accumulated about specific attacks and system vulnerabilities. We can define in any and all instances what constitutes legal or illegal behaviour, and compare the observed behaviour accordingly. It should be noted that these detectors try to detect evidence of intrusive activity irrespective of any idea of what the background traffic, i.e. normal behaviour, of the system looks like: in other words, any action that is not explicitly recognized as an attack is considered acceptable. The detectors have to be able to operate no matter what constitutes the normal behaviour of the system, looking instead for patterns or clues that are thought by the designers to stand out against the possible background traffic. This places very strict demands on the model of the nature of the intrusion. No sloppiness can be afforded here if the resulting detector is to have an acceptable detection and false alarm rate. Advantages of the knowledge-based approaches are that they have the potential for very low false alarm rates, and that the contextual analysis proposed by the intrusion-detection system is detailed, which makes it easier for the security officer using this intrusion-detection system to take preventive or corrective action. Drawbacks include the difficulty of gathering the required information on the known attacks and keeping it abreast with new vulnerabilities and environments. Maintenance of the knowledge base of the intrusion-detection system requires careful analysis of each vulnerability and is therefore a time-consuming task. Knowledge-based approaches also have to face the generalization issue. Knowledge about attacks is very focused on the operating system, version, platform, and application. The resulting intrusion- detection tool is therefore closely tied to a given environment. Also, detection of insider attacks involving an abuse of privileges is deemed more difficult because no vulnerability is actually exploited by the attacker. Misuse Detection Systems can be divided in: • Programmed The system is programmed with an explicit decision rule, where the programmer has himself pre-filtered away the influence of the channel on the observation space. The detection rule is simple in the sense that it contains a straightforward coding of what can be expected to be observed in the event of an intrusion. Thus, the idea is to state explicitly what traces of the intrusion can be thought to occur uniquely in the observation space. This has clear correspondence with a default permit security policy, or the formulation that is common in law, i.e. listing illegal behaviour and thereby defining all that is not explicitly listed as being permitted. – State-modelling State-modelling encodes the intrusion as a number of different states, each of which has to be present in the observation space for the intrusion to be considered to have taken place. They are by their nature time series 20
models. Two subclasses exist: in the first, state transition, the states that make up the intrusion form a simple chain that has to be traversed from beginning to end; in the second, Petri-net, the states form a Petri-net. In this case they can have a more general tree structure, in which several preparatory states can be fulfilled in any order, irrespective of where in the model they occur. Owing to the generality of Petri Nets, quite complex signatures can be written easily. However, matching a complex signature against the audit trail may become computationally very expensive. Fig. 3 shows a simple example of a Petri Net
unsuccessful login
unsuccessful login
unsuccessful login
unsuccessful login
START
FINAL
S1
S2
S3
t = T1
S4
S5
t = T2 T2 – T1 3. 29
2.4.2.2.2 Default deny systems The idea is to state explicitly the circumstances under which the observed system operates in a security-benign manner, and to flag all deviations from this operation as intrusive. This has clear correspondence with a default deny security policy, formulating, as does the general legal system, that which is permitted and labelling all else illegal. 2.4.2.2.2.1 State series modelling In state series modelling, the policy for security benign operation is encoded as a set of states. The transitions between the states are implicit in the model, not explicit as when we code a state machine in an expert system shell. As in any state machine, once it has matched one state, the intrusion detection system engine waits for the next transition to occur. If the monitored action is described as allowed the system continues, while if the transition would take the system to another state, any (implied) state that is not explicitly mentioned will cause the system to sound the alarm. The monitored actions that can trigger transitions are usually security relevant actions such as file accesses (reads and writes), the opening of “secure” communications ports, etc. The rule matching engine is simpler and not as powerful as a full expert system. There is no unification, for example. It does allow fuzzy matching, however. Fuzzy in the sense that an attribute such as “Write access to any file in the /tmp directory” could trigger a transition. Otherwise the actual specification of the security benign operation of the program could probably not be performed realistically. 2.4.2.2.2.2 User Intention Identification User Intention Identification [SD96] models the normal behaviour of users by the set of high level tasks they have to perform on the system. The analyser keeps a set of tasks that each user can perform. Whenever an action occurs that does not fit the task pattern, an alarm is issued. 2.4.2.2.2.3 Computer immunology This technique [FHS96] builds a model of normal behaviour of the UNIX network services, rather than that of users. This model consists of short sequences of system calls made by the processes. Attacks that exploit flaws in the code are likely to take unusual execution paths. The tool first collects a set of reference audits, which represents the appropriate behaviour of the service, and extracts a reference table containing all the known “good” sequences of system calls. These patterns are then used for live monitoring to check whether the sequences generated are listed in the table; if not, the intrusion-detection system generates an alarm. 2.4.2.2.3 Classification Classification is the task of assigning database records to one of pre-defined set of target classes. The difficulty is that target classes are not explicitly given in the database records, but must be derived from the available attribute values. 30
This division is made by a classifier. To build a classifier we use a so-called training set. Defined the classes, the training data set consists of records already been classified which are given to the classifier: such classifier can be used to solve this initial task, then it is ready to being calibrated with other records or with new records. Classifiers can use different representations to store their classification knowledge. Two common knowledge representations are if-then rules [YZW07] and decision trees [Ait08]. If-then rules check record attributes in their if-parts and postulate class labels in their then-parts. For example, given two class true positive and false positive: if Source IP = 172.16.22.1 and Destination Port = 27 then false positive A decision tree is a flow-chart-like tree structure, in which each node denotes a test on an attribute value, each branch represents an outcome of the test and each leaf indicates a class label.
2.5
Orthogonal concepts
So we can say that in intrusion detection systems different orthogonal concepts in distinct areas coexist. We can divide the areas in: 1. Knowledge collection: how a system gets its knowledge 2. Detection strategy: how a system discriminates one kind of data/event from another • Type of response (sub-area): what a system does when it applies its detection strategy • Time of detection: when a system applies its detection strategy 3. Data source(sub-area): where a system gets its data/event In knowledge collection area we saw that we have self-learning versus programmed systems. In detection strategy area we saw that we have anomaly versus signature/misuse systems. In this particular area we have two sub-areas with their orthogonal concepts, too. Type of response sub-area has the active systems versus the passive systems. In data source area area we saw that we have host-based versus network-based systems.
31
2.6
Evaluating an Intrusion Detection System
One important task for researchers is to evaluate their system goodness. Different key aspects concern the evaluation, and thus the comparison, of the performance of alternative intrusion detection approaches: these are the efficiency of the detection process, the cost/performance involved in the operation, the fault tolerance of the system itself and the timeliness of the alarms process[Deb99][GTDVMFV09]. • Efficiency Without underestimating the importance of the other aspects, actually the efficiency aspect is the most considered. Four situations exist in this context, corresponding to the relation between the result of the detection for an analysed event (normal vs. intrusion) and its actual nature (innocuous vs. malicious). These situations are: false positive (FP), if the analysed event is innocuous (or clean) from the perspective of security, but it is classified as malicious; true positive (TP), if the analysed event is correctly classified as intrusion/malicious; false negative (FN), if the analysed event is malicious but it is classified as normal/innocuous; and true negative (TN), if the analysed event is correctly classified as normal/innocuous. It is clear that low FP and FN rates, together with high TP and TN rates, will result in good efficiency values. We can identified two sub-aspects under efficiency: – Accuracy Accuracy occurs when we have a low false positives rate and an high true positives rate. So inaccuracy occurs when an intrusion-detection system flags as anomalous or intrusive a legitimate action in the environment. – Completeness Completeness occurs when we have a low false negative rate and an high rate of true negative. Incompleteness occurs when the intrusiondetection system fails to detect an attack. This measure is much more difficult to evaluate than the others, because it is impossible to have a global knowledge about attacks or abuses of privileges. • Performance The performance of an intrusion-detection system is the rate at which audit events are processed. If the performance of the intrusion-detection system is poor, then real-time detection is not possible. • Fault tolerance An intrusion-detection system should itself be resistant to attacks, particularly denial of service, and should be designed with this goal in mind. This is particularly important because most intrusion-detection systems run on top of commercially available operating systems or hardware, which are known to be vulnerable to attacks. • Timeliness An intrusion-detection system has to perform and propagate its analysis as quickly as possible to enable the security officer to react before much damage 32
has been done, and also to prevent the attacker from subverting the audit source or the intrusion-detection system itself. This implies more than the measure of performance, because it not only encompasses the intrinsic processing speed of the intrusion-detection system, but also the time required to propagate the information and react to it.
2.6.1
Evaluation: data sets
When evaluating an IDS it is important to have the same unit of measurement, a fixed standard quantity which give us an idea of the goodness of a system by comparison with the others. With regard to this aim two public data set are used: DARPA 1998 and KDDCup’99. Both the data sets are synthetic and a lot of critics have been written about their problems, above all [Mch00].
2.7
Knowledge discovery in database and data mining
There are techniques that may help in the task of dealing with the amount of information contained within a dataset. Four of these techniques are presented in this work: Association Rules Discovery, Frequent Rules Discovery, Principal Component Analysis (see 2.4.2.1.3.7) and Visual Data Analysis. However, before proceeding, let’s characterize data mining and KDD: in the most of the cases the two terms data mining and KDD are interchangeable, although some authors said that knowledge discovery in databases (KDD) is used to denote the process of extracting useful knowledge from large data sets, while data mining, by contrast, refers to one particular step in this process, specifically, the data mining step applies so-called data mining techniques to extract patterns from the data. So the term data mining is commonly used to designate the process of extracting useful information from large databases. In fact data mining try to get simple, but understandable models that can be interpreted as interesting or useful knowledge: data mining techniques are pattern discovery algorithms. When studying data mining we have to face some problem because different data mining techniques assume different input data representation (sets of transaction versus relational databases) and same representation (relational database) have different ways to represent the same data sets.
33
2.7.1
ARD and FRD
Association Rules Discovery and Frequent rules Discovery obtain correlations between different features extracted from a training data sets. 2.7.1.1
Association Rules Discovery
Association rules capture implications between attribute values. n (Λm i=1 Ai = vi ) ⇒ (Λi=m+1 A1 = vi )[s, c]
Formally association rules have the above form where A1 ( for i ∈ 1...n) are pairwise attributes, and vi attributes values. The parameters s and c are called support and confidence. S indicates the percentage of database records that satisfy the conjunction of the rule’s left- (let’s call it L) and right-hand side (R) (i.e. the conjunction with i ∈ 1...n of Ai = vi ). In other words the number of records that give true if we put their values in the rule. C is the conditional probability that a database record satisfies the rule’s right-hand side, provided it satisfies the left-hand side P (R|L). Support and confidence are generally used to measure the relevance of association rules: a high support value implies that the rules is statistically significant and similarly a high confidence value is characteristic of a strong association rule. So we can say that the association rule mining problem is formulated as find all association rules that have support and confidence beyond a user-specified minimum value. 2.7.1.2
Frequent Episode Rules Discovery
Frequent episode rules are used when collections of records are ordered: in fact sometimes the order of records contains important information that must not be ignored. Frequent episode rules take record order into account by revealing correlation among temporally close records: so it is clear that frequent episode rules are only applicable for database records that have a time attribute. Formally frequent episode rules have the form: (P ∧ Q ⇒ R)[s, c, w] Where P,Q and R are predicates over a used-defined class of admissible predicates. Intuitively this rule says that two records, that satisfy respectively P and Q, are generally accompanied by a third record R that satisfies R. The temporal correlation is clear when we give the definition of s, c and w (support, confidence and window width). The support s is the probability that a time window of w seconds contains three records p, q and r that satisfy P,Q and R respectively. Moreover, consider a time window of w seconds that 34
contains a record p satisfying P and a record q satisfying Q; then, the confidence c is the probability that a record r satisfying R is also contained within the same time window. So in frequent episode rules we redefined the term support and confidence, however the problem of finding rules remains the same: given the triplet [s, c, w], the problem is to find all frequent episode rules that have support and confidence of at least s and c, respectively. This task involves in particular finding out appropriate predicates P, Q and R.
2.7.2
Visual Data Analysis
Figure 2.7: An example of Visual Data Analysis: a portsweep attack identified Visual data analysis or Visual Analytics [HCGZ09] [TMWJK04] of traffic pattern is a contemporary and proven approach to combine the art of human intuition and the science of mathematical deduction to directly perceive patterns and derive knowledge and insights from them. Visualization mainly aims to provide the supervisor with a synthtetic and intuitive representation of a given situation (e.g. network situation, host situation etc.): see for example FIg.2.7. In Intrusion Detection Visualization techniques aim to make the available statistics from the traffic monitoring systems understandable in an interactive way. This approach requires mapping high-dimensional feature data into low dimensional space 35
presentation, so it uses techniques such as Principal Component Analysis, Linear Discrimination Analysis and Factor Analysis. Visual methods have one special feature: they can discover information that complements the knowledge found by more commonly used approaches and they are particular useful when users want to explore data to learn from it, that is they don’t know exactly what they need to discover from the data.
36
Chapter 3 The system It didn’t occur to me that this might be the preliminary stages of a wormkilling job. I never guessed that big a worm was free and running. Now I see the implications, and I guess you do too, hm? (The Shockwave Rider - John Kilian Houston Brunner) In this chapter I describe my system: I want to stress that the implementation of the system has been done simultaneously with the analysis of the data at my disposal. In fact while I was developing the IDS, I analysed the data (in this case postfix log files) to find better ways to identifies anomalies (see the following chapter for more information). The creation of my Intrusion Detection System is divided in two steps: the first one is the off-line phase, the second one is the on-line phase. In the off-line phase the system in this chapter will be called Worm-Poacher, whilst in the on-line phase the system will be called WPD (WormPoacherDaemon) or WPD-pe (WormPoacherDaemon wih Panworm Engine). The whole system has been developed in PERL. Although the off-line system shows poor performance and, in designing the on-line version, the architecture and the software modules have been changed, the following sections are important to understand the system development and all the obstacles which have been overcome: the sections can be considered by the reader like an historical documentation.
3.1
The off-line system: Worm-Poacher
The off-line system is described here - even if it shows poor performance - as the analysis done with it was vital for the creation of identification of anomalies. The system with its shortcomings and lack of features actually gave way to create the current system. To fully
understand then the analysis proposed in the next chapter is here described the system with which has been made. Worm-poacher [AACP06a][AACP06b] is an off-line system which can read Postfix log files, create an human readable file with the data gathered, store the data in multiple databases and make a traffic analysis on them. Worm-poacher is made up of different modules: Log Translator, GenDB, Statistic DB, Inquirer DB and Alerter.
3.1.1
Log Translator
Log Translator module’s main aim is to read postfix log files and creates an human readable file of them (you can find a stand alone implementation at [ACP07]). In fact one of the most common problem for an administrator of mail domains is answering to users’ requests regarding e-mail sent that never arrived to destination, or messages that he should have received but he didn’t. From log mail analysis it is possible to monitor transactions and understand if something went wrong in e-mail communications. The problem with this approach is that log file is of difficult comprehension, since a single
Figure 3.1: The Postfix architecture 38
transaction is split across several lines. In fact typical mail servers use different modules for dispatching e-mail to users and each module write its own information to log file. An example is shown in Figure 3.1 which describes Postfix architecture. If you take a look to a postfix log file ( see Fig. 3.2 ) you can notice that for every e-mail you have different lines of information about a chosen transaction. These lines are difficult to understand for human beings because they are spread through all the file: you can reconstruct manually if you gather the information you need about one transaction through an hash string called QID (queue identifier are highlighted in Figure 3.2). As already explained, a transaction is spread out over many distant lines: in every line we find the module that manage at the moment the e-mail and through the hash we can search for the next step. Therefore a sysman should perform the following to reconstruct: 1. find the QID related to a specific e-mail 2. search for the other QID instances ( eventually split by log-rotate) 3. reconstruct the whole transaction It’s a boring, time-consuming work. For all these reasons I thought to develop a tool which transforms a log in a cleaner format, normalizing it. Through this tool I can access different and important information about e-mail traffic and I can make it available for the other modules (Database modules and analysis module).
Figure 3.2: A Postfix log example Log Translator reads in input two files: the Postfix log file and a configuration file. The configuration file contains various kind of information and this information is available to all the modules ( for obvious reasons I describe it here): 39
Figure 3.3: The Log Translator module • database type: the system supports Berkeley database and MySQL database. • my domain: this setting identifies the mail domain(s)(all that is written after the ). • my network: this setting identifies the network(s) IP addresses to analyse. For example if I have this setting set to “192.” a mail sent from IP 192.150.1.4 is taken into account, as a mail sent from 10.192.1.1; a mail sent from 172.16.1.1 is not taken into account. • whitelist: this setting identifies the hosts that I don’t want to monitor. This setting is very useful in case of mailserver(s): this kind of host in fact has a behaviour very different compare to simple hosts. • mailing-list: this setting identifies the authorized mailing-lists. It is very useful, cause a mailing list can be seen as an anomalous event when I analyse SMTP traffic flow. Log Translator creates an human readable text file. It reconstructs every single e-mail transaction spread across the mail server log. Every row represents a single transaction and it has the fields described in the following list: • Timestamp It is the moment in which the e-mail has been sent: it is possible to have this information in Unix timestamp format or in julian format. • Client It is the hostname of e-mail sender. • Client IP It is the IP of the sender’s host. • From It is the sender’s e-mail address. 40
• To It is the receiver’s e-mail address. • Status It is the server response (e.g. 250, 450, 550 etc.). • Size It is the e-mail size (in bytes). With this format is possible to find the moment in which the e-mail has been sent, the sender client name and IP, the from and to field of the e-mail and the server response. Lets make an example. A transaction will appear like this if
[email protected] send an e-mail on 22 September 2008 to
[email protected] from 172.16.72.235:
22/09/2008 paulpc.myisp.com 172.16.72.235
[email protected] [email protected] 550 3481
3.1.2
GenDB
Once I had gathered all the information useful from log files I needed a way to store them: GenDB module does this work. GenDB reads in input the Log Translator output file and it creates a database with the data.
Figure 3.4: The GenDB module It has implemented storing data in two ways: 41
• Berkeley DB Berkeley DB is a library database which offers low-level and raw operation to store data on simple files. Berkeley DB is embedded because it links directly into the application, so theres no need of installing third part software and it works on every kind of architecture. • My-SQL MySQL is a relational database management system (RDBMS) so it provides a more flexible choice, however you need to install a mySQL server and all the stuff needed to make it run. About the Berkeley DB implementation I have to spend a few more words. Berkeley DB provides a simple function-call API which has driven me to build some high level operations. For my purposes I have chosen DBHASH and DB-BTREE implementation where a single value key is linked to data value. Its difficult to perform different kind of queries because of the single key: in fact you cannot associate a single key to multiple fields, but every key is linked to a single value. In order to allow searches through the database, I created five different table, implementing a database search engine. The first one is the main table, containing all information about e-mail traffic: every mail is indexed by an integer number and the value is a pipe-delimited string containing six different values (date, sender IP, sender e-mail address, receiver e-mail address, e-mail status and e-mail size). The other four tables are secondary tables, whose keys are the followings: data, IP, sender and receiver; every key points to different lines of the main table(Figure 3.5). For example when we want to search for all the e-mails sent from a given IP address the
Figure 3.5: The Berkeley DB schema process followed by the GenDB Berkeley database engine is as reported below: 42
1. search the selected IP (xxx.xxx.xxx.0) address in sender IP table 2. obtain a list of integer (mail identifiers), which are keys in mail table xxx.xxx.xxx.0 → 2|45|78|3456|8960 3. each element in the list is used to retrieve all the data about that specific e-mail 2 → date|host|xxx.xxx.xxx.0|mailserver|f rom|to|250|320 45 → date|host|xxx.xxx.xxx.0|mailserver|f rom|to|250|2512 78 → date|host|xxx.xxx.xxx.0|mailserver|f rom|to|250|417 3456 → date|host|xxx.xxx.xxx.0|mailserver|f rom|to|250|342 8960 → date|host|xxx.xxx.xxx.0|mailserver|f rom|to|250|1024 I implemented a query interface for Berkeley database which gives users a set of pre-definite queries. For example, you can find how and which e-mails have been sent by a user or to a particular user. Another choice is to see the whole e-mail traffic in a certain period of time or listing all the e-mails rejected by your e-mail server.I implemented the following queries: • IP-STORY: it lists all the e-mails sent by the selected IP address • FROM-STORY: it lists all the e-mail with the given from field • DAILY-EMAIL: it lists all the e-mail traffic in a given day • DAILY-REJECT: same as daily-email, but you have only the e-mails rejected by the mail-server ( for example messages to non-existing users) Speaking of MySQL, it has already a query interface and offers a very powerful query engine, so I only made the data importing function from Log Translator file to MySQL. However, as already said, it is obvious that you need MySQL already installed and running on your system. In this case all the queries are automatically handled by DBMS engine. At this stage of the system, however, the main choice was Berkeley DB: on the contrary, when the system went on-line (see the other sections), the choice changed to MySQL because Berkeley DB with huge amount of data suffered of low performance.
3.1.3
Analysis engine: StatisticDB, Inquirer DB and Alert
The analysis engine is made up of two main modules: StatisticDB and Inquirer DB. The first module (see Fig.3.6) reads in input the genDB databases (Mail and IP), and creates eight different databases with statistical data: each one of these databases contains 43
Figure 3.6: The StatisticDB module
44
the amount of e-mails sent in a given time interval. So we have the following: Stat5m, Stat10m, Stat15m, Stat30m, Stat1h, Stat4h, Stat8h, Stat24h. All the data gathered in these databases, however, were of the e-mail global flow and pertained to all the e-mail sent from the first instant of activity. So I developed Inquirer DB which gives the possibility to do per IP analysis and chosen interval analysis. The analysis engine has an alert module, which writes in a given file if there is something wrong with the e-mails flow. At this development stage there is no module which does automatically from analysis or reject analysis: there is only a primeval analysis engine.
Figure 3.7: The Worm-Poacher architecture
3.2
The on-line system: WPD and WPD-pe
The system described and developed until now is an off line system. The next step of my work was to make Worm-poacher an on line system. The first on line system, as previously said, is called Worm-poacher Daemon (WPD). In this phase, the aim was to do all the work done previously by Worm-Poacher in an automatic way and to have a transparent SMTP 45
server monitoring system with an alert module that can warn the system administrator by e-mail. This first on line system presented a lot of flaws (caused by implementation choices) so, during its development, it was abandoned in favour of WPD with the PanWorm engine. In the second phase I’ll describe the new system: I can speak of a new system, as previously said, because the architecture and the modules have been changed a lot, due to the new aims and poor performance of the previous system.
3.2.1
WPD
At this stage, WPD was thought like a main process which managed the whole architecture. The architecture is made up of six modules (three of these are brand new modules while the other three are old modules or two or more old modules merged together): LogSplitter, LogTranslator, GenDB, Statistic, Scheduler and Alert. 3.2.1.1
Log Splitter
LogSplitter is the module in charge of taking the log file and passing it to LogTranslator. I developed this module to have at my disposal the log ( no Postfix daemon interferences) and not to miss those kind of transactions that are in progress. LogSplitter acts following these main steps: • Activation Time The first time LogSplitter copies a given number of rows (or the whole log file) in a temporary log file to pass to LogTranslator. • n-th step During the normal activity LogSplitter takes the postfix log file and the previous temporary file and it makes a differ operation: when it detects the point where the new rows begin it takes the new piece and a ∆L rows before the point detected to avoid missing part of transaction in progress.
3.2.1.2
LogTranslator, GenDB and Statistic
LogTranslator is almost the same module as we saw in WormPoacher, while GenDB, at this stage, has undergone minor changes. It creates the five databases, but they only refer to the current month (see the Scheduler module). Statistic is the merging of the previous modules StatisticDB and InquirerDB and, like GenDB, creates the databases only with the data referring to the actual month.
46
Figure 3.8: The WPD architecture
47
3.2.1.3
Scheduler and Alerter
The Scheduler module has the aim to schedule the monthly update of the data. In fact it cleans the monthly databases and it inserts the data collected during the passed month in historic databases. Moreover it recalculates the statistical values (e.g. baseline) using the passed month data. The main task of the Alerter module is to send e-mail with warning to a chosen e-mail address. During the tests phase I encountered some problems both of performance both of choices of implementation. In fact the Berkeley DB, my first and my favourite choice, with a huge amount of data began to perform badly and all the databases rotation and log processing resulted a bit over-elaborate: moreover new PERL modules have been developed permitting to have a direct sniffing module. In fact, initially, I had in mind to utilize the methods used by Log Rotator on the sniffed data to perform a semi on-line analysis, because there were no PERL modules to sniff directly the cable. Taken cognizance of these things I decided to interrupt the development of the WPD architecture in favour of the architecture with the Panworm engine: in fact all the information gathered with WPD are a subset of the information gathered with a sniffing module. The term Panworm Engine derives from the ancient greek and it means “All the kind of worms”: in fact the Panworm Engine, sniffing directly the cable, can identified all the worms which use SMTP to spread. Obviously the terms worms or SMTP anomalies are interchangeable.
3.2.2
WPD with PanWorm engine: WPD-pe
WPD-pe has no centralized architecture, but it is made up of four independent daemons: SMTPDump, SMTPStats, SMTPAnalyzer and SMTPGuardian. Every independent daemons has been given a specific task: • SMTPDump This daemon has the duty to read the cable network and to gather the raw data (packets and transactions). • SMTPStats This daemon has the duty to analyse the raw data and to build tables about the e-mail flow. • SMTPAnalyzer This daemon main tasks are to calculate statistic indexes about the e-mail flow and to build historic tables about the e-mail flow means. • SMTPGuardian As Latins said “nomen omen”: this daemon check that everything is going fine, if not so it raises an alarm through an e-mail to a given e-mail address.
48
Figure 3.9: The actual WPD architecture In WPD-pe you have always the choice to select Berkeley DB, but this choice is deprecated: in fact the official database is MySQL for all the already explained reasons. As I previously said, WPD-pe is made up of four independent daemons: this particular feature allows you to have these daemon running in different hosts. For example we can have a scenario like in figure 3.10. 3.2.2.1
SMTPDump
SMTPDump is “the beating heart of” WPD-pe sniffing engine. In fact this daemon read TCP/IP packets (in this thesis I use the term “packet” in a general sense, it is not intended in a TCP/IP way e.g. datalink layer ↔ frame, network layer ↔ packet, transport layer ↔ datagram and application layer ↔ data) directly on the medium. The computer where SMTPDump is running must have some specific features: • its network interface card has to be in promiscuous mode: the host must read all the packets to be able to reconstruct a truthful screen shot of the SMTP activity • the computer must be positioned in a sensible spot of the network. In fact to be effective the system must see all the SMTP activity, so it is necessary a deep analysis 49
Figure 3.10: A possible scenario for WPD-pe
50
of the network topology • the computer must be able to process the data gathered in promiscuous mode, so it must have a good calculating power The daemon builds two tables ( remember that in WPD-pe we use MySQL!) based on the data gathered. The first table is called Packets: it contains all the packets that have set the destination port or source port to 25 (the default SMTP server port). The table is made up of the following field: • Source IP It is the source IP address of the packet. • Source Port It is the source port of the packet. • Destination IP It is the destination IP address of the packet. • Destination Port It is the destination port of the packet. • Sequence Number It is the TCP sequence number of the packet. • Flags They are the TCP flags of the packet. The flags are: – FIN (value = 1) The FIN bit indicates that the host that sent the FIN bit has no more data to send. When the other end sees the FIN bit, it will reply with a FIN/ACK. Once this is done, the host that originally sent the FIN bit can no longer send any data. However, the other end can continue to send data until it is finished, and will then send a FIN packet back, and wait for the final FIN/ACK, after which the connection is sent to a CLOSED state. – SYN (value = 2) The SYN (or Synchronize sequence numbers) is used during the initial establishment of a connection. It is set in two instances of the connection, the initial packet that opens the connection, and the reply SYN/ACK packet. It should never be used outside of those instances. – RST (value = 4) The RESET flag is set to tell the other end to tear down the TCP connection. This is done in a couple of different scenarios, the main reasons being that the connection has crashed for some reason, if the connection does not exist, or if the packet is wrong in some way. – PSH (value = 8) The PUSH flag is used to tell the TCP protocol on any intermediate hosts to send the data on to the actual user, including the TCP implementation on the receiving host. This will push all data through, unregardless of where or how much of the TCP Window that has been pushed through yet.
51
– ACK (value = 16) This bit is set to a packet to indicate that this is in reply to another packet that we received, and that contained data. An Acknowledgment packet is always sent to indicate that we have actually received a packet, and that it contained no errors. If this bit is set, the original data sender will check the Acknowledgment Number to see which packet is actually acknowledged, and then dump it from the buffers. – URG (value = 32) This field tells us if we should use the Urgent Pointer field or not. If set to 0, do not use Urgent Pointer, if set to 1, do use Urgent pointer. – ECE (value = 64) This bit was added with RFC 3268 and is used by ECN. ECE stands for ECN Echo. It is used by the TCP/IP stack on the receiver host to let the sending host know that it has received an CE packet. The same thing applies here, as for the CWR bit, it was originally a part of the reserved field and because of this, some networking appliances will simply drop the packet if these fields contain anything else than zeroes. This is actually still true for a lot of appliances unfortunately. – CWR (value = 128) This bit was also added in RFC 3268 and is used by ECN. CWR stands for Congestion Window Reduced, and is used by the data sending part to inform the receiving part that the congestion window has been reduced. When the congestion window is reduced, we send less data per time unit, to be able to cope with the total network load. • Timestamp It is the instant in which the packet has been sent. • DATA It is the DATA part of the packet, where the SMTP command are contained. The second table is called Transactions: it contains all the SMTP transactions which the system identified. What is a transaction? From my point of view, it is the SMTP communication that takes place between an SMTP client and a SMTP server: in this kind of communication there will be only one side with a port equals to 25 and the other side will have a port different from 25. We can have an almost unique1 identifier merging this information. For example: if x.x.x.45:1488 send an e-mail to SMTP server y.y.y.50:25 all the packets will contain these two quadruples: • Packets from client: x.x.x.45k1488ky.y.y.50k25 • Packets from server: y.y.y.50k25kx.x.x.45k1488 1
We say almost unique identifier because there is a low probability to have a quadruple twice in a long time run. From our point of view, WPD-pe is a experimental system, so it is sufficient an almost unique identifier
52
To obtain an almost unique identifier we can create a rule which says that an identifier has always the server couple in first place. So the identifier will be: y.y.y.50k25kx.x.x.45k1488. And the SMTP communication will be reconstructed taking the DATA packet part of packets with x.x.x.45k1488ky.y.y.50k25 or y.y.y.50k25kx.x.x.45k1488 in Timestamp order. The table is made up of the following fields: • id This is the primary key. It is made up of, as already said, server IP, port 25, client IP and client port. • Source IP It is the source IP address of the first packet. • Source Port It is the source port of the first packet. • Destination IP It is the destination IP address of the first packet. • Destination Port It is the destination port of the first packet. • Sequence Number It is the TCP sequence number of the first packet. • Timestamp It is the instant in which the first packet has been sent. • Last seen It is the instant in which the last packet has been sent. • Duration It is the connection duration. • From It is the the “MAIL FROM” value of the SMTP connection. • To It is the the “RCPT TO” value of the SMTP connection. • Reject It is set to a value different from zero if the mail has been rejected. • Quit It is set to a value different from zero if the connection ended gracefully (in a RFC compliant way). • DATA It is the sequence of SMTP command. These two tables contain a very rich set of information, information that are of great value for the following work.
53
3.2.2.2
SMTPStats
SMTPStats is the daemon which is in charge of calculating how many e-mails have been sent on the network respectively every five minutes, one hour and twenty-four hours. It takes care of reconstruct the missing time slices if there was a service interruption. For example if you have the data of five minutes flow until yesterday, but you lack the last twelve hours (due to daemon death or something similar) SMTPStats reconstruct the table setting the activity of this period to zero. The three tables structure is the same (in fact “five minutes” and “one hour” are common divisors of “twenty-four hours”). For example for the table ∆t: • Timestamp The start point (t) of the time slice examined. • Number of e-mail seen Number of e-mails seen in the period [t, t + ∆t) • Number of rejected e-mails seen Number of rejected e-mails seen in the period [t, t + ∆t) 3.2.2.3
SMTPAnalyzer
SMTPAnalyzer is the daemon which is in charge of recording the history of the SMTP network activity. It calculates a dynamic mean of the e-mails seen and of the rejected e-mails seen. I use the term dynamic to differentiate this mean from the mean we spoke of in Worm-Poacher: in the second case the mean was static, because it was calculated at the start of every month using the activity of that month, not a so accurate method it was! On the other hand, in this case I calculate the mean in the time slice X taking the previous “Period” time slices activities (see the table field). The table is made up of the following fields: • Timestamp It is the instant t in which I calculate all the other fields: it is the present day. • Period It is the historic period (in seconds) p I will analyse before t. • Mean It is the dynamic mean at the instant t (µt ). • Rejected Mean It is the dynamic rejected e-mails mean at the instant t (µRej t ). The “Period” value is set to 1209600 seconds ( fourteen days) by default. For example if it is the instant t = 1230800400 (i.e. the 9.00 a.m. of 1ST January 2009) I will calculate the five minutes expected mean by summing the five minutes means of the previous fourteen days and dividing the result by 4032 ( 1209600 = 300*4032). 54
3.2.2.4
SMTPGuardian
SMTPGuardian is the daemon which is in charge of checking the network status and of identifying anomalies: it is activated every six minutes by default. It is the daemon which calculates the baselines and which does the policeman role. Its main task are: • Checking e-mails flow It checks the actual e-mails activity and it compares it with a calculated baseline (see the following chapter for further details). • Checking rejected e-mails flow It checks the actual rejected e-mails activity and it compares it with a calculated baseline (see the following chapter for further details). • Checking from field anomalies It checks if someone on the network is using too much e-mail addresses ( at the moment it considers an host suspected if it uses six or more different e-mail addresses). Actually for the first two tasks it examines the data of the last thirty days (2592000 seconds) to take a decision, while for the from task it examines the five hundred seconds previous period. Whatever it finds on the network SMTPGuardian sends a summary e-mail to one or more administrator’s addresses.
55
Chapter 4 The analysis and results Postea noli rogare, quom impetrare nolueris. Ask not again, when you wish not to obtain. Lucilium)
(Seneca, Epistulae morales ad
In this chapter I present the analysis done and the results obtained. I have clear in mind that I’m far to have a perfect Intrusion Detection System, but I’m convinced that I have got some good results which make me confident of the future development. Some of these results have been published in [ACP08],[ACP09]. All data described in this chapter are sensitive data: due to this reason all the IP addresses and DNS name have been obfuscated through a bijective function, whose table is stored and can be consulted at IEIIT of Genoa (protocol number “IEIIT Tit: VI CI: PERSONALE N.0000013 11/01/2010”).
4.1
The scenario
All the work I have done it has been done on the network of “CNR - Area della Ricerca di Genova”: a lot of time has passed since I launched the first automatic tool on a log file and the network has undergone a lot of changes (however luckily only two of these are important to understand what I have done). The first scenario configuration can be seen in figure 4.1. In this topology there was an SMTP server (Postfix) and an Anti-virus Server: no virus/worm scanning engine was embedded in the mail-server. The e-mail originated from Internet (1) arrives at the mail-server (2 and 3), which before dispatching it sends to the AV-server (4): the anti-virus server analyses it and returns it to the mail-server. The communication between mail-server and anti-virus server is managed through SMTP. Once the e-mail has again reached the mail-server can be delivered to the recipient address.
Figure 4.1: The old CNR network It is important to notice that if you take a look to the e-mail log file you see that this email is logged twice: the first time when it reaches the e-mail server and the second one when it is sent back from the anti-virus server. This behaviour has to be taken in account when applying analysis methods. The Intrusion Detection System monitored a mail-server (identified by the fake IP address 172.16.70.21): note that there are other mail-servers in the network and these mail-server can communicate each other. Moreover to compare my results and the data gathered the network was monitored by a computer which hosted SNORT, a network intrusion detection system. All the off-line analysis has been done on log files originated from this scenario. The second scenario configuration can be seen in figure 4.2. In this topology it is adopted a new strategy for SMTP service: in fact in this kind of scenario you have the MX records of the CNR-Area domains pointing not to the real mail-server, but to a mail-gateway server: this kind of server (i.e. in the 1ST phase ESVA and in the 2N D ESDA, for more information see my work [ACMQ08]) has a lot of features that make the network more secure. In fact every mail, it receives, is checked by an anti-virus server (Sophos) updated once an hour: this is an important fact because it assures that all the worms found during analysis are zero-day worm. To circumvent spam it has SpamAssassin and Greylisting enabled. SpamAssassin is a software which uses a variety of mechanisms including header and text analysis, Bayesian filtering, DNS blocklists, and collaborative filtering databases to detect spam while a mail transfer agent which uses Greylisting temporarily rejects email from unknown senders. If the mail is legitimate, the originating server will try again to send it later according to RFC, while a mail which is from a spammer, it will probably not be retried because very
57
Figure 4.2: The two development phases of the CNR-Area network: for my purposes they are just the same, because for my analysis there is no difference between a virtual machine and a real one. At my “work level” this change is transparent.
58
few spammers are RFC compliant. The few spam sources which re-transmit later are more likely to be listed in DNSBLs and distributed signature systems (also enabled in the mail-gateway). Greylisting and SpamAssassin reduced heavily our spam percentage. To make a complete description I must add that port 25 is monitored and filtered: in fact the hosts inside CNR network cant communicate with a host outside the network on port 25 and an outsider cant communicate with an host inside the network on port 25. These restriction nullify two threats: • the first one concerns the infected hosts can become spam-zombie pc • the second one concerns the SMTP relaying misuse problem: in fact since CNR is a research institution almost all the hosts are used by a single person who detains root privileges, so she can eventually install a SMTP server. Only few of total hosts are shared among different people (students, fellow researcher etc.). In the network there is a good balancing between Linux operating systems distribution and Windows ones. The average SMTP traffic of CNR-Area network is about 350000 e-mails sent per year.
4.2
Analysis’ methods
In my analysis, I work on the global email flow in a given time interval. I use a threshold detection: if the traffic volume rises above a given threshold, the system triggers an alarm. The given threshold is calculated in a statistical way, where I determine the network normal email traffic in selected slices of time: for example I take the activity of a month and I divide the month in five minutes slices, calculating how many e-mails are normally sent in five minutes. After that, I check that the number of e-mails sent in a day during each interval doesnt exceed the threshold. I call this kind of analysis base-line analysis. Anyway my strategy is to study the temporal correlation between the present behaviour (maybe modified by the presence of a worm activity) of a given entity (pc, entire network) and its past behaviour (normal activity, no virus or worm presence). The analysis I did can be divided in four branches: 1. baseline analysis or flow analysis 2. k-baseline analysis 3. from analysis
59
4. reject analysis In the baseline analysis we calculate a threshold for the SMTP activity of working hours according to the following formula: µt + 3σt The mean and the variance are calculated for every fourteen days period, modelling the network behaviour stored in the Historic mean 5m database (the one created by SMTPAnalyzer): Pn
µt =
µt−jδt n
j=1
where p/δt = n Values are compared with the baseline threshold and if found greater than it they are marked as anomalies. To support the choice of the formula µt + 3σt as baseline value I tried other coefficients rather than 3: I called this kind of analysis k-baseline analysis. In this branch the formula becomes as follows ( k ∈ N ): µt + kσt where µRej t
Pn
=
µRej t−jδt n
j=1
where p/δt = n The “Period” value is set to 1209600 seconds ( fourteen days) by default. For example if it is the instant t = 1230800400 (i.e. the 9.00 a.m. of 1ST January 2009) I will calculate the five minutes mean by summing the five minutes means of the previous fourteen days and dividing the result by 4032 ( 1209600 = 300*4032). My hypothesis is N(µt ,σt ). Looking at the k-baseline formula, given that an infection of an host A with an IP address can be seen as a set of ordered alert events (Ei,A i=1...n), the optimum k-value, if we want to optimize the reaction time, is: min E1,IP ∀ IP K
in other words the minimum k-value chosen between the first anomalous events of every IP address. If we want to optimize the accuracy reducing the false positives the optimum k-value is: min(max Ei,IP ) ∀ IP, i = 1...n K
60
in other words the minimum k-value chosen between the maximum k-value for every IP address. The third branch is the from analysis. In fact sometimes, peaks catch from flow analysis were e-mail sent to mailing list which are bothersome hoaxes. This fact produced from analysis, where I analyze how many different e-mail address every host use: in fact an host, owned by a single person or few persons, is not likely to use a lot of different e-mail addresses in a short time and if it does so, it is highly considerable a suspicious behaviour. So I think that this analysis could be used to identify true positives, or to suggest suspect activity. Of course it isnt so straight that a worm or a virus will change from field continuously, but it is a likely event. Finally there is the rejected e-mails flow analysis: I use the same baseline method, but I filter the traffic only to reject e-mail. I do this because one typical feature of a malware is haste in spreading the infection. This haste leads the malware to send a lot of email to unknown receivers or non-existent e-mail addresses: this is a mistake that, I think, it is very important. In fact all e-mails sent to a non-existent e-mail address are rejected by the mail-server and these rejection can be tracked by logs or by sniffing the communication.
4.3
The off-line analysis
The off-line analysis has been made on the e-mail traffic of eleven C-class network in a period of 900 days: the activity time period range is from January 2004 to November 2006. It is important to understand that all this kind of analysis is an off-line one, that is when I start to analyse the data of the month I make the analysis with already all the data of the month. For example: when I start the analysis of march 2008 I take all the data of march 2008 and I calculate the threshold for march, that is not realistic (see in fact the problem with August 2005 analysis). In fact an IDS can’t foresee the future, the right approach would be to make the analysis for a given month taking the data of the previous month. Although this main concern I have, this kind of analysis turned out very useful, because it gave me a lot of hints for the future analysis. With the data at my disposal, before proceeding with the analysis, however, I pre-processed the data subtracting the mean to the values and cutting all the intervals with a negative number of e-mails, because I wanted to obfuscate the no-activity and few activity periods, not interesting for my purposes. In other words I trashed all the time slices characterized by a number of e-mail sent below the month average, with the purpose of dynamically selecting activity periods (working hours, no holidays etc). If I didnt perform this pre-processing I could have had an average which depended on night time, weekend or holidays duration. We can see the situation before and after the pre-processing operation in Fig.4.3. As you can see, the weeks and the weekends are well identified by my preprocessing: it very interesting to see that one of the weeks delimited has only four working days. This isn’t an error,on the contrary it is a confirmation that the work I did was well promising: in fact the Monday missing in 61
Figure 4.3: The preprocessing: you can see the not filtered activity in the red line and the filtered activity in blue marks
62
the middle week is the Easter Monday (a.k.a. “Luned´ı dell’Angelo”), which is an holiday day in Italy. Making another example, if you take the 2004 data you can see that e-mails sent mean in 2004, before pre-processing, was 524 in a day for 339 activity day: after data pre-processing was 773 in a day for 179 activity day. Before proceeding I must say that the analysis has been done only on the e-mails originated from CNR-Area hosts and not on the e-mails sent to CNR-Area hosts: in fact I want to know if someone in my network is infected, I don’t want to know if someone, that is an outsider, is infected. In the following sections I will show what I found during the analysis. Here follows a little summary to make the reader better understand the results. • January 2004 The global flow (GF) results are a subset of from analysis (FA) events. I observe for the first time the “Spam-forward” phenomenon. In the rejected e-mails analysis (RA) I noticed that in future I can give more importance to rejected e-mails analysis events with different recipient address. • April 2004 The intersection of GF true positives and FA true positives is not empty: FA identifies something more. The RA identifies all the GF true positives. • May 2004 Two formulas to get the optimum K value are given: one optimizes the Reaction Time, the other one the Computational Load. • November 2004 I have only one GF true positive which has a very low K value. The fact can invalidate the K-value optimization (if the value is too low we can’t optimize raising the k-value and cutting off a lot of false positives): however this true positive has been identified by FA, so even if I raise the k-value I have no accuracy loss. • August 2005 I have very few GF true positives, because of a big infection in the network. I discuss in a deeper way the above k-value concepts.
4.3.1
January 2004
January 2004 activity is characterized by some anomalies both in the global e-mail flow both in from analysis. The activity days are in total fifteen. In fig. 4.4, there are eleven anomalies in five minutes analysis, three in one hour analysis and zero in twenty-four hours analysis. You can see these anomalies in tables 4.1 and 4.2. Reading the logs only the red ones were actually symptoms of a virus infection. I identified two infected hosts: x.x.6.24 and the x.x.4.24. The same two hosts are identified in the from analysis (Fig. 4.5). Moreover taking a look to the from usage you can see that there other six hosts that behave in strange ways: 63
Date Fri 16 Jan 2004 17:40:15 Mon 19 Jan 2004 11:5:15 Tue 20 Jan 2004 10:40:15 Tue 20 Jan 2004 10:55:15 Thu 22 Jan 2004 14:50:15 Tue 27 Jan 2004 13:15:15 Tue 27 Jan 2004 13:25:15 Wed 28 Jan 2004 9:55:15 Wed 28 Jan 2004 18:0:15 Thu 29 Jan 2004 10:30:15 Thu 29 Jan 2004 17:5:15
# e-mails 41 49 41 175 35 63 44 55 56 42 77
K K= K= K= K= K= K= K= K= K= K= K=
Host 3.89 4.79 3.89 19.08 3.21 6.38 4.23 5.47 5.59 4.00 7.97
172.16.74.24, 172.16.76.24 172.16.74.24, 172.16.76.24 172.16.75.201 172.16.75.201
Table 4.1: January 2004 Global flow activity 5 minutes Date Tue 20 Jan 2004 10:20:15 Tue 27 Jan 2004 13:20:15 Thu 29 Jan 2004 10:20:15
# e-mails 243 245 218
K Infected % K = 3.54 K = 3.58 79,6% K = 3.04 74,8%
Table 4.2: January 2004 Global flow activity 1 hour 1. 172.16.71.23 A mail-server. 2. 172.16.73.244 A simple host. 3. 172.16.74.29 A mail-server. 4. 172.16.74.24 The infected host already identified by e-mails flow analysis. 5. 172.16.75.201 An infected host already identified in the e-mails flow analysis. 6. 172.16.76.24 The infected host already identified by e-mails flow analysis. 7. 172.16.147.33 An infected host not identified in the e-mails flow analysis. 8. 172.16.78.151 A mail-server The three mail-servers identified forward some e-mail accounts to the mail-server I monitor, but taking a look at the transactions I could recognize two different behaviours, a normal one and an anomalous one. In fact 172.16.74.29 and 172.16.78.151 are victims of a lot of spam e-mails, that are forwarded to the mail-server monitored: this fact will be much more evident in the following subsections for 172.16.74.29. 172.16.71.23 isn’t a spam victim (its 64
from usage will be almost the same in the future months, too). For future use I’ll call this phenomenon “Spam-forward”.
In this month the total amount of rejected e-mails is 637: • 3 e-mails originated from 172.16.79.175 on 19 January 2004: same mistaken user in recipient field. • 17 e-mails originated from 172.16.79.157 on 19 January 2004: same mistaken user in recipient field. • 2 e-mails originated from 172.16.78.151 on 28 January 2004. • 56 e-mails originated from 172.16.77.33 on 28 January 2004: different mistaken users in recipient field. • 1 e-mail originated from 172.16.76.47 on 22 January 2004 • 7 e-mails originated from 172.16.76.40 on 23 January 2004: same mistaken user in recipient field. • 162 e-mails originated from 172.16.76.24 on 27 January 2004: different mistaken users in recipient field. • 236 e-mails originated from 172.16.75.201 on 29 January 2004: different mistaken users in recipient field. • 126 e-mails originated from 172.16.75.201 on 28 January 2004: different mistaken users in recipient field. • 18 e-mails originated from 172.16.74.24 on 27 January 2004: different mistaken users in recipient field. • 4 e-mails originated from 172.16.73.244 on 27 January 2004: same mistaken user in recipient field. • 1 e-mail originated from 172.16.73.244 on 21 January 2004 • 1 e-mail originated from 172.16.71.94 on 20 January 2004 • 2 e-mails originated from 172.16.71.23 on 26 January 2004 • 1 e-mail originated from 172.16.70.12 on 29 January 2004 65
1600 January 2004 - 24 hours baseline = 1650
1400 1200 1000 800 600 400 200
1.0744
1.0746
200
1.0748
1.0750 9 x10
1.0752
1.0754
1.0756
1.0752
1.0754
1.0756
1.0752
1.0754
1.0756
January 2004 - 1 hour baseline = 215
150
100
50
0 1.0744
1.0746
1.0748
1.0750 9 x10
January 2004 - 5 minutes baseline = 33
150
100
50
0 1.0744
1.0746
1.0748
1.0750 9 x10
Figure 4.4: January 2004 baseline analysis 66
Figure 4.5: January 2004 sender addresses
67
If you take a look to the previous list you’ll see that the rejected e-mails analysis is the specular of the from analysis if you look at it from a certain point of view: in fact all the infected host produce a lot of rejected e-mails, these e-mails have always different recipient addresses. However, I can’t do a recipient address analysis because the events described are only a subset of the “event with different recipient addresses”: in fact in the set you can find the symptoms of an infection, but the mailing lists, too. With regard to k-baseline analysis the results say that the value three was a good choice for 1 hour analysis in fact the minimum k of all the one hour alerts was actually three as you can see in the previous table 4.2 while the value four would be better for five minutes analysis ( Table 4.1).
68
4.3.2
April 2004
Date Fri 9 Apr 2004 16:33:30 Tue 20 Apr 2004 13:33:30 Thu 22 Apr 2004 8:33:30 Wed 28 Apr 2004 14:58:30 Wed 28 Apr 2004 15:3:30 Wed 28 Apr 2004 15:53:30 Wed 28 Apr 2004 15:58:30 Thu 29 Apr 2004 9:8:30 Thu 29 Apr 2004 9:13:30 Thu 29 Apr 2004 9:18:30 Thu 29 Apr 2004 9:23:30 Thu 29 Apr 2004 9:28:30 Thu 29 Apr 2004 9:33:30 Thu 29 Apr 2004 9:38:30 Thu 29 Apr 2004 9:43:30 Thu 29 Apr 2004 10:3:30 Thu 29 Apr 2004 11:53:30 Thu 29 Apr 2004 12:3:30
# e-mails 108 82 156 87 181 80 204 270 127 154 289 350 100 138 65 60 62 60
K K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K=
Host 6.49 4.82 9.57 5.14 11.18 4.69 12.65 16.89 7.71 9.44 18.11 22.02 5.98 8.42 3.73 3.41 3.54 3.41
172.16.77.20 172.16.77.20, 172.16.77.20, 172.16.77.20, 172.16.77.20, 172.16.77.20, 172.16.77.20, 172.16.77.20, 172.16.77.20, 172.16.77.20, 172.16.77.20,
172.16.75.216 172.16.75.216 172.16.75.216 172.16.76.36 172.16.76.36 172.16.76.36 172.16.76.36 172.16.76.36 172.16.76.36 172.16.76.36
Table 4.3: April 2004 Global flow activity 5 minutes April 2004 is one of the most representative month I analysed: during this month I found two important infections with global flow e-mails analysis and one infection with from analysis. Moreover the mail-server due to the previous two infection went down: anyway let’s proceed in order. In the global flow analysis I found eighteen alerts: three of these (the first three in the table 4.3) are false positives, while the last four of the table are, in some ways, fallacious, because, after the “e-mails storm” the server had, the log becomes full of internal transactions (i.e. 2004-04-29 11:55:38|local |mailserver ||||0 |2898 ) that anyway have been counted by the IDS. The e-mails storm of the red highlighted five minutes alerts have been caused by three different host: 172.16.77.20 (activity on 28 April and on 29 April), 172.16.75.216 (activity on 28 April) and 172.16.76.36 (activity on 28 April and on Date Wed 28 Apr 2004 15:3:30 Thu 29 Apr 2004 9:3:30
# e-mails K Infected % 546 K = 3.81 91% 1645 K = 12.61 75%
Table 4.4: April 2004 Global flow activity 1 hour 69
3500 3000 April 2004 - 24 hours baseline = 3644
2500 2000 1500 1000 500
1.0810
1.0815
1.0820 9 x10
1.0825
1.0830
1600 1400 April 2004 - 1 hour baseline = 444
1200 1000 800 600 400 200 0 1.0810
1.0815
1.0820 9 x10
1.0825
1.0830
500 400 300
April 2004 - 5 minutes baseline = 53
200 100 0 1.0810
1.0815
1.0820 9 x10
1.0825
1.0830
Figure 4.6: April 2004 baseline analysis 70
Figure 4.7: April 2004 sender addresses
71
29 April). It should be noted that if you look at the one hour alert table and in particular at the infected e-mails percentage you’ll notice that in the first one only the 9% of e-mails sent was legitimate traffic (i.e. 148 e-mails, however a quite big number if considered alone, but quite little if you consider that the one hour baseline was 444). In the from analysis eight hosts can be identified: 1. 172.16.71.23 A mail-server. 2. 172.16.72.126 An infected host not identified in the e-mails flow analysis. 3. 172.16.74.29 A mail-server, victim of a spam attack (see 4.3.1). 4. 172.16.75.216 The infected host already identified by e-mails flow analysis. 5. 172.16.76.36 The infected host already identified by e-mails flow analysis. 6. 172.16.77.20 The infected host already identified by e-mails flow analysis. 7. 172.16.78.151 A mail-server. 8. 172.16.158.36 A simple shared host. It is very interesting to see that the three hosts, already identified in the global flow analysis, have a common logic in building the different from uses. Here in Table 4.5 follows an excerpt of the recipient addresses used (for privacy reasons I can show you only some of these): In this month the total amount of rejected e-mails is 1293: as you can see in the Fig. 4.8 the most part of the traffic happened on 28th and 29th April 2004. On 28th there is a peak
Figure 4.8: April 2004 rejected e-mails flow of 422 rejected e-mails: 72
172.16.75.216 5.2.1.1.0.20031215150323.00b503d0@censored 5.2.1.1.0.20031217152338.00b3ccd0@censored 5.2.1.1.0.20031217182413.00bb40f8@censored 5.2.1.1.0.20031219140705.00b4ee68@censored 5.2.1.1.0.20031219141552.00b3d888@censored 5.2.1.1.0.20031222125138.00b40ea0@censored 172.16.76.36 3.0.5.32.20020321102715.0085bbe0@censored2 3.0.5.32.20020405101820.00863c00@censored2 3.0.5.32.20020405102025.00867980@censored2 3.0.5.32.20020418194802.00879100@censored2 172.16.77.20 3.0.1.32.19990707115450.006a688c@censored3 3.0.1.32.19991102145524.006af018@censored3 3.0.1.32.19991209144913.006babd8@censored3 3.0.1.32.19991224144833.006bcda0@censored3 3.0.1.32.20000322180857.006cc888@censored3 Table 4.5: The common logic in building the recipient address during the infection of 28th and 29th April 2004. The malware adapts itself: the three MX domains (i.e. censored, censored2 and censored3) are different, according to the information found on the infected host. • 186 e-mails have been sent by 172.16.75.216: different mistaken users in recipient field. • 230 e-mails have been sent by 172.16.77.20: different mistaken users in recipient field. On 29th there is a peak of 813 rejected e-mails: • 574 e-mails have been sent by 172.16.76.36: different mistaken users in recipient field. • 233 e-mails have been sent by 172.16.77.20: different mistaken users in recipient field. With regard to k-baseline analysis the results say that the value three was a good choice for 1 hour analysis in fact the minimum k of all the one hour alerts was actually three as you can see in the previous table 4.4 while the value four or five would be better for five minutes analysis ( Table 4.3).
73
4.3.3
May 2004 Date Mon 3 May 2004 9:55:41 Mon 3 May 2004 11:55:41 Mon 3 May 2004 13:55:41 Tue 4 May 2004 12:5:41 Tue 4 May 2004 12:10:41 Tue 4 May 2004 13:15:41 Tue 4 May 2004 13:20:41 Tue 4 May 2004 13:25:41 Wed 5 May 2004 10:25:41 Wed 5 May 2004 14:0:41 Fri 7 May 2004 9:0:41 Fri 7 May 2004 14:40:41 Fri 7 May 2004 15:0:41 Fri 7 May 2004 15:20:41 Fri 7 May 2004 15:35:41 Mon 10 May 2004 10:25:41 Mon 10 May 2004 12:10:41 Mon 10 May 2004 15:10:41 Tue 11 May 2004 16:10:41 Wed 12 May 2004 16:45:41 Thu 13 May 2004 16:20:41 Fri 14 May 2004 17:35:41 Fri 14 May 2004 17:40:41 Fri 14 May 2004 17:45:41 Thu 20 May 2004 18:55:41 Mon 24 May 2004 9:35:41 Wed 26 May 2004 12:30:41 Wed 26 May 2004 13:5:41 Thu 27 May 2004 14:15:41 Thu 27 May 2004 15:5:41
# e-mails 31 46 44 134 95 27 35 30 26 29 28 28 43 27 27 27 27 46 68 32 34 73 43 93 53 36 42 29 68 26
K K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K=
Host 3.90 6.16 5.86 19.44 13.56 3.29 4.50 3.75 3.14 3.60 3.44 3.44 5.71 3.29 3.29 3.29 3.29 6.16 9.48 4.05 4.35 10.24 5.71 13.25 7.22 4.65 5.56 3.60 9.48 3.14
172.16.75.158 172.16.75.158 172.16.75.158 172.16.75.158 172.16.75.158
Table 4.6: May 2004 Global flow activity 5 minutes May 2004 activity is quite strange: I have a lot of false positive alarms in the global flow analysis, however the system identifies one true positive on 4th May caused by 172.16.75.158. It is worth mentioning that the third red 1 hour alert is made up of 88% local mailserver transactions, because the mail-server went down. Moreover the “one hour” and “five min74
May 2004 - 5 minutes baseline = 25
120 100 80 60 40 20 0 1.0835
1.0840
1.0845
1.0850
1.0855
1.0860
9
x10
May 2004 - 1 hour baseline = 127
200
150
100
50
0 1.0835
1.0840
1.0845
1.0850
1.0855
9
x10
800
May 2004 - 24 hours baseline = 877
600
400
200
1.0835
1.0840
1.0845
1.0850
1.0855
9
x10
Figure 4.9: May 2004 baseline analysis 75
1.0860
Date Tue 4 May 2004 11:10:41 Tue 4 May 2004 12:10:41 Tue 4 May 2004 13:10:41 Fri 14 May 2004 17:10:41 Wed 26 May 2004 12:10:41 Thu 27 May 2004 14:10:41
# e-mails 211 147 172 230 130 145
K K= K= K= K= K= K=
5.99 3.69 4.59 6.67 3.07 3.61
Infected % 59% 63% 88% local 80%
Table 4.7: May 2004 Global flow activity 1 hour 172.16.75.158 5.0.2.1.0.20040402191946.01da1570@censored1 5.1.0.14.0.20040220144953.009ec450@censored3 5.1.0.14.0.20040304143235.00a6b980@censored2 5.1.0.14.2.20040219135917.00acbda0@censored4 5.1.1.6.0.20040331112914.009f4120@censored5 Table 4.8: Found an old friend in this infection. utes” alerts of fourteenth May (i.e. the green ones), which show a high K values, have been caused by 172.16.76.24, an host that sent an e-mail to a quite big mailing list. In the from analysis four hosts can be identified: 1. 172.16.71.23 A mail-server. 2. 172.16.74.29 A mail-server, victim of a spam attack (see 4.3.1). 3. 172.16.75.158 The infected host already identified by e-mails flow analysis. 4. 172.16.78.132 A mail-server. Taking a look at the recipient addresses used by 172.16.75.158 I found that it was infected by the same malware of the previous month ( see Table 4.8). In this month the total amount of rejected e-mails is 191 (see Fig. 4.11): 107 have been sent on 4th May (106 by 172.16.75.158). With regard to k-baseline analysis the results say that the value three was too low for the “five minutes” analysis, in fact I have a lot of false positives. At this point I can say that a value of five, I suppose, can fit better than a value of three for “five minutes” analysis. In fact if you take a look at the first “five minutes” alert of every malicious event per host you can say that, even if I raise the k value to five, these events would have been triggered anyway (see table 4.9).
76
Figure 4.10: May 2004 sender addresses
Figure 4.11: May 2004 rejected e-mails flow
77
Event date Tue 27 Jan 2004 Wed 28 Jan 2004 Wed 28 Apr 2004 Wed 28 Apr 2004 Thu 29 Apr 2004 Tue 4 May 2004
Host 172.16.74.24, 172.16.76.24 172.16.75.201 172.16.77.20 172.16.75.216 172.16.76.36 172.16.75.158
first k-value K = 6.38 K = 5.59 K = 5.14 K = 11.18 K = 16.89 K = 19.44
Table 4.9: The first k-value found in the first “five minutes” event triggered by a host.
78
4.3.4
August 2004 Date Mon 2 Aug 2004 9:46:15 Mon 2 Aug 2004 14:26:15 Mon 2 Aug 2004 14:31:15 Tue 3 Aug 2004 12:31:15 Wed 4 Aug 2004 9:26:15 Wed 4 Aug 2004 10:16:15 Wed 4 Aug 2004 11:16:15 Mon 9 Aug 2004 9:21:15 Mon 9 Aug 2004 16:16:15 Tue 10 Aug 2004 14:21:15 Thu 12 Aug 2004 9:31:15 Thu 12 Aug 2004 11:1:15 Thu 12 Aug 2004 11:6:15 Thu 12 Aug 2004 11:11:15 Fri 13 Aug 2004 9:46:15 Fri 13 Aug 2004 15:16:15 Tue 17 Aug 2004 10:36:15 Tue 17 Aug 2004 13:51:15 Tue 17 Aug 2004 14:51:15 Tue 17 Aug 2004 14:56:15 Tue 17 Aug 2004 15:51:15 Thu 19 Aug 2004 17:36:15 Mon 23 Aug 2004 9:51:15 Mon 23 Aug 2004 17:6:15 Wed 25 Aug 2004 11:26:15 Thu 26 Aug 2004 17:1:15 Thu 26 Aug 2004 17:6:15 Mon 30 Aug 2004 10:21:15 Mon 30 Aug 2004 15:36:15 Tue 31 Aug 2004 10:36:15 Tue 31 Aug 2004 14:51:15 Tue 31 Aug 2004 17:46:15
# e-mails 44 22 21 18 21 15 19 22 20 16 19 17 39 57 19 15 15 22 41 24 17 23 19 39 45 19 18 22 21 16 135 16
K K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K= K=
Host 10.17 4.81 4.57 3.84 4.57 3.10 4.08 4.81 4.32 3.35 4.08 3.59 8.95 13.34 4.08 3.10 3.10 4.81 9.44 5.30 3.59 5.05 4.08 8.95 10.42 4.08 3.84 4.81 4.57 3.35 32.36 172.16.73.234 3.35 172.16.73.234
Table 4.10: August 2004 Global flow activity 5 minutes August 2004 activity shows a lot of false positives, but only one true positive. The flow is quite low, because August usually is an holiday month (see the global flow in 4.12). 79
120 100 August 2004 - 5 minutes baseline = 14
80 60 40 20 0 1.0915
1.0920
1.0925
1.0930
1.0935
1.0940
1.0935
1.0940
9
x10 140 120
August 2004 - 1 hour baseline = 79
100 80 60 40 20 0 1.0915
1.0920
1.0925
1.0930 9
x10
500 August 2004 - 24 hours baseline = 550
400 300 200 100
1.0915
1.0920
1.0925
1.0930
1.0935
9
x10
Figure 4.12: August 2004 baseline analysis 80
Date Mon 2 Aug 2004 9:41:15 Thu 12 Aug 2004 10:41:15 Tue 31 Aug 2004 14:41:15
# e-mails 88 148 149
K Infected % K = 3.46 K = 6.61 K = 6.67 91%
Table 4.11: August 2004 Global flow activity 1 hour The true positive has been raised by 172.16.73.234 on 31th of August. It can be seen that between the two alerts (we are speaking of five minutes analysis) there is a gap: I suppose that it can be caused by switching off and then on the pc infected. The host identified in global flow analysis, however, shows a perfect behaviour in the “from” analysis, as you can see in Fig. 4.13. In the from analysis three hosts can be identified: 1. 172.16.71.23 A mail-server. 2. 172.16.74.29 A mail-server, victim of a spam attack (see 4.3.1). 3. 172.16.78.151 A mail-server. The total amount of rejected e-mails in this month is 180: 94 rejected e-mails have been sent by the host identified in the global flow analysis (172.16.73.234 Fig. 4.14). With regard to k-baseline analysis the results say that the value three is still too low.
81
Figure 4.13: August 2004 sender addresses
Figure 4.14: August 2004 rejected e-mails flow 82
4.3.5
November 2004 Date Tue 2 Nov 2004 9:18:45 Tue 2 Nov 2004 11:13:45 Thu 11 Nov 2004 17:13:45 Fri 12 Nov 2004 10:18:45 Fri 12 Nov 2004 14:18:45 Fri 12 Nov 2004 14:43:45 Thu 18 Nov 2004 16:48:45 Tue 23 Nov 2004 10:38:45 Tue 23 Nov 2004 11:38:45 Thu 25 Nov 2004 12:3:45 Thu 25 Nov 2004 12:8:45 Mon 29 Nov 2004 19:28:45
# e-mails 44 58 53 118 65 449 178 93 48 55 65 114
K K= K= K= K= K= K= K= K= K= K= K= K=
Host 3.07 4.21 3.80 9.08 4.77 35.95 13.95 7.05 3.39 3.96 4.77 8.75
172.16.73.101 172.16.73.200 172.16.73.252 172.16.75.123
Table 4.12: November 2004 Global flow activity 5 minutes In November 2004 activity you can see that there are a lot of false positives in global flow analysis, and, the only one real alert, has got a very low K value (172.16.75.123): so I can say that what I said before about raising the K value holds still true, because in a set full of false positives a true negative with a so low value is likely to go unnoticed. The same host, however, it is well identified with the From analysis (see Fig. 4.16). Moreover you can see that the mail-server 172.16.74.29 continues to be a spam victim. On the other hand, the reject analysis shows no anomalies during this month (see Fig. 4.17).
Date Fri 12 Nov 2004 14:43:45 Thu 18 Nov 2004 16:43:45 Tue 23 Nov 2004 10:43:45
# e-mails 501 216 197
K Infected % K = 9.96 K = 3.52 K = 3.10 49%
Table 4.13: November 2004 Global flow activity 1 hour 83
400
300 November 2004 - 5 minutes baseline = 43 200
100
0 1.0995
1.1000
1.1005 9 x10
1.1010
1.1015
500
400 November 2004 - 1 hour baseline = 192 300
200
100
0 1.0995
1.1000
1.1005 9 x10
1.1010
1.1015
1200 November 2004 - 24 hours baseline = 1195
1000 800 600 400 200 0 1.0995
1.1000
1.1005 9 x10
1.1010
1.1015
Figure 4.15: August 2004 baseline analysis 84
Figure 4.16: November 2004 sender addresses
85
Figure 4.17: November 2004 rejected e-mails flow
86
4.3.6
August 2005
August 2005 activity is the most strange case I analysed. The global flow analysis is useless because I had the problem explained at the start of the section 4.3: doing the analysis aposteriori with all the data of the whole month the extraordinary activity happened during the 22nd of August got the global flow out of scale, reducing the number of the alerts4.18: however, as you can see, if you zoom in the day taken in account the things get clearer (Fig. 4.19). In fact you can notice that the first anomalous activities started at about noon with some peaks, after that there is a silence period and thereafter you can see the complete outbreak. The “From analysis” highlights very well the anomalous host 172.16.80.10, which uses 12486 different sender addresses. Moreover the reject analysis shows 136447 rejected e-mails sent during this month: 136324 rejected e-mails have been sent on the 22nd of August by 172.16.80.10.
87
25 August 2005 - 5 minutes baseline = 17047
x10
3
20 15 10 5 0 1.1230
1.1235
1.1240
1.1245
1.1250
1.1255
1.1245
1.1250
1.1255
1.1250
1.1255
9
x10
140 120
x10
3
100 August 2005 - 1 hour baseline = 146247
80 60 40 20 0 1.1230
1.1235
1.1240 9
x10
x10
3
600
August 2005 - 24 hours baseline = 751187
400
200
0 1.1230
1.1235
1.1240
1.1245 9
x10
88 Figure 4.18: August 2005 baseline analysis
Figure 4.19: 22nd August 2005 e-mails flow
Figure 4.20: August 2005 sender addresses
89
Figure 4.21: August 2005 rejected e-mails flow
90
I notice that during the 2005 and 2006 the global flow analysis hasn’t raised true positives, except for August 2005: this phenomenon can be explained with a too low value of the K factor, as previously said. However during this two years I noticed a rise of the “Spam-forward” phenomenon I spoke about in 4.3.1: the 172.16.74.29 server raised peaks of thousands of different sender addresses in a month (see for example the from usage for 2005 in 4.22). With regard to rejected e-mails I can say that the activity for 2005 and 2006 gave me few alerts and these alerts had a very low value: I think that a value of five for the k factor would be better here, too. For example in 2006 I had 102 alerts (roughly one alert every four days) for rejected e-mails flow, as you can see in fig. 4.23. The gaining of this change is about 75% with no loss (with zero true positives in every case the loss would be zero): in fact the total of alerts will change from 101 to 25 (roughly one alert every two weeks). Moreover I can extend this thought to the global flow, in fact, let’s take the 2004 activity: if I raise the K value for the global flow analysis (5m) to five, the number of alerts change from 255 to 100, with a gain of 60% and a loss of 0%: all the anomalous events are identified anyway, even if the November 2004 one, because it is catched by the from analysis. Analysing the results I can say that I can take K = 5 for five minutes global flow analysis (gain 60%, loss 0%) and K = 3 for one hour global flow analysis (gain 0%, loss 0%).
Figure 4.22: The different sender address usage for 172.16.74.29 during the 2005.
91
Figure 4.23: The different values of coefficient K in 2006 fake alerts for rejected analysis.
Figure 4.24: The different values of coefficient K in 2004 for global flow analysis.
92
4.4
The on-line analysis
The on-line analysis has been made on the all e-mail traffic of CNR-Area network in a period of 75 days: the activity time period range is from 4th Mar 2009 to 19th May 2009. I have some days missing because the network during the analysis period had some maintenance works that needed WPD-pe not working. This new analysis has some special features that I have to mention: • the network traffic analysed is the traffic of all eleven class C network which the CNR-Area is made up of. • the network traffic analysed is the outgoing and incoming SMTP traffic, so the local deliveries (i.e. e-mail traffic between internal users is not monitored). • I couldn’t always get the sender address and so applying the From analysis, because during the E-SMTP transactions the communication can be encrypted • I couldn’t always get the rejected status and so applying the Reject analysis, because during the E-SMTP transactions the communication can be encrypted • the new security features of the network and the new malware trends didn’t help the research: I have no alerts during this analysis period. In fact the anti-virus and anti-spam mail-gateway (172.16.71.80) stopped very well the incoming viruses and spam: moreover the new malware trends point to hit the media storages (USB keys and media card) and sharing folders, so no new nasty mass-mailer worm reached the network monitored. • with regard to the previous point I can say that maybe, positioning WPD-pe in another point of the network, for example in front of several internal mail-servers, I could have obtained more interesting results The activity, that is described in the tables below, is quite linear and there are no strange peaks: in spite of the fact that this result could seem not good at the first moment, I can say that is very promising. In fact the traffic goes as expected during this monitored period, a period which hadn’t infections at all (no tickets were issued to the “CNR-Area di Genova” HelpDesk). The graphs, especially the April one, the most complete, show a quite good linearity both for the global flow analysis both for the rejected flow analysis. Concerning the From analysis the results are even better from my point of view. In fact the Sniffing module has been located just before the border router, so I have no trace of the “Spam forward” phenomenon, which the internal mail-server were affected of. In the From graph the mail-server are well evident and above all the mail-gateway is the most evident. In fact you can see as simple mail-servers the following IP addresses: 93
• 172.16.70.21 This is the mail-server monitored during the off-line analysis: the log files are of this host. • 172.16.71.23 An internal mail-server. • 172.16.71.80 The mail-gateway: it is almost evident that the most SMTP load of the network is its duty. • 172.16.74.29 The internal mail-server which was the main actor of the “Spam forward” phenomenon. • 172.16.125.3 censored44, no information available • 172.16.76.40 censored55, no information available • 172.16.76.55 censored66, no information available • 172.16.77.72 censored77, no information available • 172.16.78.151 An internal mail-server. Here, the tables of these three month activities follow:
Figure 4.25: March 2009 global flow, five minutes slices
94
Figure 4.26: April 2009 global flow, five minutes slices
Figure 4.27: May 2009 global flow, five minutes slices
Figure 4.28: March 2009 rejected flow, five minutes slices
95
Figure 4.29: April 2009 rejected flow, five minutes slices
Figure 4.30: May 2009 rejected flow, five minutes slices
Figure 4.31: March-May 2009 different sender address usage on CNR-Area network 96
4.5
Conclusions
The results obtained are somewhat encouraging, in fact, although the system is far away from being perfect, I have identified a lot of good features to characterize an anomalous SMTP activity. • Baseline analysis The baseline analysis shows good results, but it doesnt catch all the true positives: it can be useful in identifying SMTP anomalous activities, but this approach needs some integration by the other methods (From and Reject), because it lacks a complete vision of the SMTP activity. • K-value Using From Analysis and Rejected Analysis with the Global Flow Analysis it is possible to choose an optimization of the k-value having no loss and getting a boost in accuracy. In fact, with the 5 minutes k-value optimization we have an increase of accuracy: if we select k equal to five, we have a loss of accuracy equal to zero percent and a reduction in alarms from 255 to 100 with a total gain of sixty percent. On the other hand we have seen that the choice of k equal to three for the 1 hour analysis has been the best. • Good performance and adaptivity My greatest concern was the fact that packet loss usually occurs during sniffing operation because of network congestion: in this respect the on-line system can monitor the traffic in real-time with no data loss and can modify the baseline adaptively. The system hasn’t got any computational load problem, even if the whole system has been developed in PERL, which is not an high performance programming language and it run on an old computer (a P4 with only 1GB ram). This fact should not be underestimated as currently the largest manufacturers of IDS appliances are focusing their efforts to dedicated hardware to get a better performance: if you take this factor into account then the system may yield surprising results when calibrated on an ad hoc hardware. • Scalability Positioning the system is a critical point if you want to get the system working and performing well: the daemons based architecture can be very useful, because different daemons can be distributed in a network communicating one with each other to have a good screenshot of what is going on. I think that the ideas found work in a quite good manner, however, during my researches, I thought how to improve the system. All my observations can be read in the following subsection.
97
4.5.1
WPD-pe: more work to do?
As mentioned all the work done gave rise to a lot of new future development ideas. At this stage I can identify five different points to develop: • Adding DNS to analysis I think that every event, bind to some service, can be seen as sub-event bind to a given protocol. In my case the exchange of e-mails in the Internet (my event) can be made up of two sub-events bind to two protocols: SMTP and DNS protocols. So from these premises the idea to analyse the DNS traffic has born: moreover DNS traffic has the advantage that at present isn’t encrypted (as can be E-SMTP) and so some sort of analysis can be done on DNS queries. Regarding to SMTP transactions I think can be very useful analysing the DNS MX queries: maybe a threshold analysis like the one done for the global SMTP traffic can be done and it can result interesting. • Use of new K-value Adopting the K-value results found in this work can have a good effect upon the whole performance of the system. • Rejected E-mails Analysis optimizedIn the Rejected E-mails Analysis can be useful give more importance to anomalous events which has a lot of different recipient address (i.e. rejected e-mails with different recipient address). • An user friendly systemTo improve the system it could be add a simple module which inform via e-mail the owner of a given host about the different sender addresses used during the last week from its terminal. • Global Flow E-mails Analysis optimized To discern better the false positives caused by the mailing lists a “Size” analysis can be add: in fact an e-mail sent to a mailing list has always the same size. • Thwart the Spam Forward phenomenon To better identify the internal mailservers it could be used a baseline based upon the number of the authorized users: this can be useful for those mail-servers that do not present the “Spam forward” phenomenon. • Interaction with other systems The introduction of a standard formalism to intermediate data normalization (such as IDMEF [HDF07]) will add the system an useful feature: in fact this would favour interchangeability of different tools and technologies, especially in view of future extensions.
98
Acknowledgements Acknowledgements are always the hardest part to write in a thesys. Sometimes there is fear of forgetting someone or do not do justice to someone else: I hope not run into any of these two cases. My first thanks goes to Sara who has encouraged and endured for all these years a Ph. D. student discouraged and sometimes lost in a world that even now I struggle with. Then I want to thank the little gift that is coming for giving me something to battle for, perhaps sooner or later we will decide your name ... currently we have a shortlist of three. A thanks goes to my parents who allowed me to get where I am: without their moral and financial aid I do not know if I ever succeeded in this endeavor. I can only wish them to find the harmony that they lost along the road. Thanks also to my friends who have made all these years passing in the blink of an eye: it seems like yesterday that we finished high school or that we asked ”and after you graduate?”. Then I want to remember Birba, who gave me so much love and now is there where all the pains have an end. I arrived at the end of my rant: I just hope after this long research, I become a better person than the one I was when I started: alas in this regard, I remain with a strong doubt...
Table of Contents Chapter 1 Introduction
1
Chapter 2 State of art
4
2.1
Intrusion detection basics: Anderson’s work . . . . . . . . . . . . . . . . .
5
2.2
Intrusion detection basics: IDES . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3
Intrusion Detection Basics . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.3.1
General IDS architecture . . . . . . . . . . . . . . . . . . . . . . . .
11
2.3.2
IDS concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
Intrusion Detection Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.4.1
Misuse/Signature detection or knowledge based intrusion detection
20
2.4.2
Strategy detection techniques: anomaly detection or behaviour based intrusion detection . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.5
Orthogonal concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.6
Evaluating an Intrusion Detection System . . . . . . . . . . . . . . . . . .
32
2.6.1
Evaluation: data sets . . . . . . . . . . . . . . . . . . . . . . . . . .
33
Knowledge discovery in database and data mining . . . . . . . . . . . . . .
33
2.7.1
ARD and FRD . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
2.7.2
Visual Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
35
2.4
2.7
Chapter 3 The system 3.1
37
The off-line system: Worm-Poacher . . . . . . . . . . . . . . . . . . . . . .
i
37
3.2
3.1.1
Log Translator . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.1.2
GenDB
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.1.3
Analysis engine: StatisticDB, Inquirer DB and Alert . . . . . . . .
43
The on-line system: WPD and WPD-pe . . . . . . . . . . . . . . . . . . .
45
3.2.1
WPD
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.2.2
WPD with PanWorm engine: WPD-pe . . . . . . . . . . . . . . . .
48
Chapter 4 The analysis and results
56
4.1
The scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4.2
Analysis’ methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
4.3
The off-line analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.3.1
January 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.3.2
April 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
4.3.3
May 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
4.3.4
August 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
4.3.5
November 2004 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
4.3.6
August 2005 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
4.4
The on-line analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
4.5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
4.5.1
98
WPD-pe: more work to do? . . . . . . . . . . . . . . . . . . . . . .
List of Figures
iii
List of Tables
vi
Bibliography
vii
ii
List of Figures 2.1
The Debar IDS schema
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2
The IDWG IDS schema . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.3
IDS characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.4
A Petri net example: four login attempts . . . . . . . . . . . . . . . . . . .
21
2.5
Yearly land-air average global temperature deviations from 1900 to 1997 .
25
2.6
An example of PCA reduction
. . . . . . . . . . . . . . . . . . . . . . . .
28
2.7
An example of Visual Data Analysis: a portsweep attack identified . . . . .
35
3.1
The Postfix architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.2
A Postfix log example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3
The Log Translator module . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.4
The GenDB module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.5
The Berkeley DB schema . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.6
The StatisticDB module . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.7
The Worm-Poacher architecture . . . . . . . . . . . . . . . . . . . . . . . .
45
3.8
The WPD architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3.9
The actual WPD architecture . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.10 A possible scenario for WPD-pe . . . . . . . . . . . . . . . . . . . . . . . .
50
4.1
57
The old CNR network . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
4.2
The two development phases of the CNR-Area network: for my purposes they are just the same, because for my analysis there is no difference between a virtual machine and a real one. At my “work level” this change is transparent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
The preprocessing: you can see the not filtered activity in the red line and the filtered activity in blue marks . . . . . . . . . . . . . . . . . . . . . . .
62
4.4
January 2004 baseline analysis . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.5
January 2004 sender addresses . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.6
April 2004 baseline analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4.7
April 2004 sender addresses . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.8
April 2004 rejected e-mails flow . . . . . . . . . . . . . . . . . . . . . . . .
72
4.9
May 2004 baseline analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
4.10 May 2004 sender addresses . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
4.11 May 2004 rejected e-mails flow . . . . . . . . . . . . . . . . . . . . . . . . .
77
4.12 August 2004 baseline analysis . . . . . . . . . . . . . . . . . . . . . . . . .
80
4.13 August 2004 sender addresses . . . . . . . . . . . . . . . . . . . . . . . . .
82
4.14 August 2004 rejected e-mails flow . . . . . . . . . . . . . . . . . . . . . . .
82
4.15 August 2004 baseline analysis . . . . . . . . . . . . . . . . . . . . . . . . .
84
4.16 November 2004 sender addresses . . . . . . . . . . . . . . . . . . . . . . . .
85
4.17 November 2004 rejected e-mails flow . . . . . . . . . . . . . . . . . . . . .
86
4.18 August 2005 baseline analysis . . . . . . . . . . . . . . . . . . . . . . . . .
88
4.19 22nd August 2005 e-mails flow . . . . . . . . . . . . . . . . . . . . . . . . .
89
4.20 August 2005 sender addresses . . . . . . . . . . . . . . . . . . . . . . . . .
89
4.21 August 2005 rejected e-mails flow . . . . . . . . . . . . . . . . . . . . . . .
90
4.22 The different sender address usage for 172.16.74.29 during the 2005. . . . .
91
4.23 The different values of coefficient K in 2006 fake alerts for rejected analysis.
92
4.24 The different values of coefficient K in 2004 for global flow analysis. . . . .
92
4.25 March 2009 global flow, five minutes slices . . . . . . . . . . . . . . . . . .
94
4.26 April 2009 global flow, five minutes slices . . . . . . . . . . . . . . . . . . .
95
4.3
iv
4.27 May 2009 global flow, five minutes slices . . . . . . . . . . . . . . . . . . .
95
4.28 March 2009 rejected flow, five minutes slices . . . . . . . . . . . . . . . . .
95
4.29 April 2009 rejected flow, five minutes slices . . . . . . . . . . . . . . . . . .
96
4.30 May 2009 rejected flow, five minutes slices . . . . . . . . . . . . . . . . . .
96
4.31 March-May 2009 different sender address usage on CNR-Area network . . .
96
v
List of Tables 2.1
Anderson Threat Representation . . . . . . . . . . . . . . . . . . . . . . .
6
4.1
January 2004 Global flow activity 5 minutes . . . . . . . . . . . . . . . . .
64
4.2
January 2004 Global flow activity 1 hour . . . . . . . . . . . . . . . . . . .
64
4.3
April 2004 Global flow activity 5 minutes . . . . . . . . . . . . . . . . . . .
69
4.4
April 2004 Global flow activity 1 hour
. . . . . . . . . . . . . . . . . . . .
69
4.5
The common logic in building the recipient address during the infection of 28th and 29th April 2004. The malware adapts itself: the three MX domains (i.e. censored, censored2 and censored3) are different, according to the information found on the infected host. . . . . . . . . . . . . . . . . . .
73
4.6
May 2004 Global flow activity 5 minutes . . . . . . . . . . . . . . . . . . .
74
4.7
May 2004 Global flow activity 1 hour . . . . . . . . . . . . . . . . . . . . .
76
4.8
Found an old friend in this infection. . . . . . . . . . . . . . . . . . . . . .
76
4.9
The first k-value found in the first “five minutes” event triggered by a host.
78
4.10 August 2004 Global flow activity 5 minutes . . . . . . . . . . . . . . . . . .
79
4.11 August 2004 Global flow activity 1 hour . . . . . . . . . . . . . . . . . . .
81
4.12 November 2004 Global flow activity 5 minutes . . . . . . . . . . . . . . . .
83
4.13 November 2004 Global flow activity 1 hour . . . . . . . . . . . . . . . . . .
83
vi
Bibliography [AACP06a]
Maurizio Aiello, David Avanzini, Davide Chiarella, and Gianluca Papaleo. A tool for complete log mail analysis: Lma. TNC 2006, part of session ”Security on the Backbone: Detecting and Responding to Attacks”, 2006.
[AACP06b]
Maurizio Aiello, David Avanzini, Davide Chiarella, and Gianluca Papaleo. Worm detection using e-mail data mining. In Proceedings of the PRISE 2006, Primo Workshop Italiano su PRIvacy e SEcurity, pages 18–21, 2006.
[ACMQ08]
Maurizio Aiello, Davide Chiarella, Claudio Martini, and Alfonso Quarati. Introduzione del mail-gateway esda nella rete arige. Technical report, prot. GE0108,CNR-IEIIT, Genova, June 2008.
[ACP07]
Maurizio Aiello, Davide Chiarella, and Gianluca Papaleo. Log mail analyzer: A tool for log mail analysis., May 2007. http://lma.sourceforge.net/.
[ACP08]
Maurizio Aiello, Davide Chiarella, and Gianluca Papaleo. Statistical anomaly detection on real e-mail traffic. In Proceedings of the International Workshop on Computational Intelligence in Security for Information Systems CISIS 2008, pages 170–177, 2008.
[ACP09]
Maurizio Aiello, Davide Chiarella, and Gianluca Papaleo. Statistical anomaly detection on real e-mail traffic. Journal of Information Assurance and Security, 4(6):604–611, 2009.
[Ait08]
M. J. Aitkenhead. A co-evolving decision tree classification method. Expert Syst. Appl., 34(1):18–25, 2008.
[And80]
J.P. Anderson. Computer security threat monitoring and surveillance. Technical report, Fort Washington, April 1980.
vii
[Axe00]
Stefan Axelsson. Intrusion detection systems: A survey and taxonomy. Technical report, Department of Computer Engineering Chalmers University of Technology, Goteborg, Sweden, March 2000.
[BKNS00]
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and J¨org Sander. Lof: identifying density-based local outliers. SIGMOD Rec., 29(2):93–104, 2000.
[BM98]
J.M. Bonifaco and E.S. Moreira. An adaptive intrusion detection system using neural network. In Proceedings of the 14th Int. Information Security Conference (IFIP-Sec98, part of the 15th IFIP World Computer Congress), 1998.
[Deb99]
Towards a taxonomy of intrusion-detection systems. 31(9):805–822, 1999.
[Den87]
Dorothy E. Denning. An intrusion-detection model. IEEE Transanctions on Software Engineering, 13(2):222–232, 1987.
[DMRV03]
Christopher Kruegel Darren, Darren Mutz, William Robertson, and Fredrik Valeur. Bayesian event classification for intrusion detection. In Proceedings of ACSAC 2003, Las Vegas, NV, page 14, 2003.
[DSKP05]
T. Thein Dong Seong Kim, Ha-Nam Nguyen and Jong Sou Park. An optimized intrusion detection system using pca and bnn. In Proc. of The 6th Asia-Pacific Sym. on Information and Telecommunication Technologies, IEICE Communications Society, pages 356–359, 2005.
[FHS96]
Stephanie Forrest, Steven A. Hofmeyr, and Anil Somayaji. Computer immunology. Communications of the ACM, 40:88–96, 1996.
Comput. Netw.,
[GTDVMFV09] P. Garca-Teodoro, J. Daz-Verdejo, G. Maci-Fernndez, and E. Vzquez. Anomaly-based network intrusion detection: Techniques, systems and challenges. Computers & Security, 28(1-2):18 – 28, 2009. [HCGZ09]
´ Alvaro Herrero, Emilio Corchado, Paolo Gastaldo, and Rodolfo Zunino. Neural projection techniques for the visual inspection of network traffic. Neurocomput., 72(16-18):3649–3658, 2009.
[HDF07]
D. Curry H. Debar and B. Feinstein. The intrusion detection message exchange format. RFC4765, 2007.
[Hec96]
David Heckerman. A tutorial on learning with bayesian networks. Technical report, Learning in Graphical Models, 1996. viii
[LEK+ 03]
Aleksandar Lazarevic, Levent Ert¨oz, Vipin Kumar, Aysel Ozgur, and Jaideep Srivastava. A comparative study of anomaly detection schemes in network intrusion detection. In Proceedings of the Third SIAM International Conference on Data Mining, 2003.
[LV02]
Yihua Liao and V. Rao Vemuri. Use of k-nearest neighbor classifier for intrusion detection. Computer Security, 21(5), 2002.
[Mch00]
John Mchugh. The 1998 Lincoln Laboratory IDS Evaluation (A critique). 2000.
[NWY02]
Steven Noel, Duminda Wijesekera, and Charles Youman. Applications of Data Mining in Computer Security, chapter Modern Intrusion Detection, Data Mining, and Degrees of Attack Guilt, pages 1–25. Advances in Information Security. Kluwer Academic Publishers, 2002.
[QXBG02]
Y. Qiao, X.W. Xin, Y. Bin, and S. Ge. Anomaly intrusion detection method based on hmm. Electronics Letters, 38(13):663–664, 2002.
[SD96]
T. Spyrou and J. Darzentas. Intention modelling: approximating computer user intentions for detection and prediction of intrusions. In S.K. Katsikas, D. Gritzalis (Eds.), Information System Security, Samos, Greece, pages 319 – 335. Chapman & Hall, 1996.
[SDW07]
I. Stanimirova, M. Daszykowski, and B. Walczak. Dealing with missing values and outliers in principal component analysis. Talanta, 72(1):172 – 178, 2007.
[SMS01]
G. Janowski S. Mukkamala and AH. Sung. Intrusion detection using neural networks and support vector machines. In Proc. Hybrid Information Systems Advances in Soft Computing, volume 21, 2001.
[TK07]
A.N. Toosi and M. Kahani. A new approach to intrusion detection based on an evolutionary soft computing model using neuro-fuzzy classifiers. Computer Communications, 30(10):2201–2212, 2007.
[TKW07]
Chi-Ho Tsang, Sam Kwong, and Hanli Wang. Genetic-fuzzy rule mining approach and evaluation of feature selection techniques for anomaly intrusion detection. Pattern Recognition, 40(9):2373 – 2391, 2007.
[TMWJK04]
Soon Tee Teoh, Kwan-Liu Ma, Soon Felix Wu, and T. J. Jankun-Kelly. Detecting flaws and intruders with visual data analysis. IEEE Comput. Graph. Appl., 24(5):27–35, 2004.
ix
[WFP99]
Christina Warrender, Stephanie Forrest, and Barak Pearlmutter. Detecting intrusions using system calls: Alternative data models. In IEEE Symposium on Security and Privacy, pages 133–145. IEEE Computer Society, 1999.
[WZ06]
Ningning Wu and Jing Zhang. Factor-analysis based anomaly detection and clustering. Decision Support Systems, 42(1):375 – 389, 2006.
[Yin02]
H. Yin. Visom - a novel method for multivariate data projection and structure visualization. IEEE Transactions on Neural Networks, 13(1):237 – 243, 2002.
[YZW07]
J.J.P. Tsai Y. Zhenwei and T. Weigert. An automatically tuning intrusion detection system. IEEE Transactions on Systems, Man and Cybernetics, Part B, 37(2):373384, 2007.
[ZH06]
Jun Zheng and Mingzeng Hu. An anomaly intrusion detection system based on vector quantization. IEICE - Trans. Inf. Syst., E89-D(1):201– 210, 2006.
[ZJK05]
Chunlin Zhang, Ju Jiang, and Mohamed Kamel. Intrusion detection using hierarchical neural networks. Pattern Recogn. Lett., 26(6):779–791, 2005.
x