Proceedings of 2010 IEEE Student Conference on Research and Development (SCOReD 2010), 13 - 14 Dec 2010, Putrajaya, Malaysia
A Framework for Automated Malcode Signatures Generation Hanieh Rajabi, Dr. Muhammad Nadzir Marsono and Alireza Monemi Faculty of Electrical Engineering, Universiti Teknologi Malaysia, 81310 lohor Bahru, Malaysia E-mail:
[email protected]
Abstract-Rapid replicating security
malicious
malicious
threat
to
the
codes
programs Internet.
(malcodes)
that Fast
are
represent monitoring
a
have to act quickly to identify and quarantine scanning rapid malcodes. Although anti-virus systems are signature-based and need to be updated, they can not effectively and automatically detect unknown Internet malcode without proper signatures. Security routers and firewalls only able to block the packets by traffic signatures, after rapid malcodes already spread [4]. Pattern-based signatures are the most common technique employed for malcode detection systems. The advantage of such malcode detectors lies in their simplicity and speed. While the signature based approach is successful in detecting known malcode, it does not work for new malcode for which signatures have not yet been prepared and there is a need to train the detector often in order to detect new malcodes [5]. Detection of malcodes is best done at the byte-stream level, mainly due to the limited payload information seen at packet level [6]. Detection at the transport layer and above requires complete Internet byte-stream visible for detecting malcodes [7]. However, the necessity of reassembly and stateful detec tion is quite expensive to be implemented [8]. Once the com municating network speed gets higher, flow reassembly within the intercommunicating network requires more computational resources in terms of memory and processing power. However, current rapid malcode detection mechanisms always create false alarm in real time network traffic. So it is necessary to detect rapid malcode attacks accurately with low false positive rate as well as the quick detection rapid malcode before a large number of hosts is infected [9]. This paper proposes a payload-sifting approach that sifts traffic payload to identify new rapid malcodes signatures based on their spreading behaviour and then verify the pro posed technique against the widely used Snort string-matching technique. Furthermore, evaluation on Universiti Teknology Malaysia network corpus shows higher detection accuracy at 87% compare to 56% for Snort signatures. However, false positive increases to 16% in comparison with 2% for Snort signatures.
self major
and early
warning systems are very essential to prevent rapid malcodes spreading. The difficulty in detecting malcodes is that they evolve
over
time.
Although
signature-based
tools
such
as
network intrusion detection systems are widely used to protect critical systems , traditional signature-based malcode detectors fail
to
detect
executables.
obfuscated
Automatic
and
signature
previously
unseen
generation
malcode
techniques
are
needed to augment these tools due to the speed at which new vulnerabilities are discovered. In particular, we need automated techniques
without
mistakenly
block legitimate traffic or increase false alarms.
which
This work
investigates a
generate
signatures
technique for automatically generating sound
vulnerability signatures of novel rapid malcodes. In this paper, rapid malcode signatures are automatically generated based on their spreading behavior, specially aimed at automatically extracting and deploying signatures on the packet level, without the need for reassembly that could be used by signature-based firewalls network
intrusion detection system.
Evaluation on
Universiti Teknologi Malaysia network corpus shows higher detection accuracy at 87% compare to 56% for Snort signatures. Moreover, false negative reduces to 14% compared to 78% for Snort signatures.
Index Terms-Signatures generation; behavioural approach; rapid malcode; intrusion detection system.
I.
INTRODUCTION
Security of computer and network systems have become increasingly important as more computers are inter-connected and greater applications are networked. Although intrusion detection is not a new technology, false positive is still a challenging problem in detecting malwares. Furthermore, the inability to detect novel attacks is probably a more serious shortcoming of intrusion detection based on signature match ing. Rapid malcodes are one of the most destroying and distinct types of attacks over the Internet that exploits and modifies host's software security holes. They are also contain highly malicious payloads that can be a threat to network infrastructure. Rapid malcode is designed to corrupt as many machines as possible, spread very quickly to infect all vulnera ble machines before human can trace their propagation process [1]. It is apparent that human reaction times are not fast enough to respond to fast scanning rapid malcodes which can infect the majority of vulnerable systems in a matter of minutes [2] [3]. Therefore, rapid malcode detection and response systems
978-1-4244-8648-9/101$26.00 ©2010 IEEE
II.
RELATED WORKS
Computer worms are the most prominent instances of rapid malcode and moreover is the most rapidly self-spreading over the Internet. Once they penetrate vulnerable systems, they attempt to install malicious codes such as backdoor, patches, and denial-of-service (DOS) attacks [1].
72
In recent years, researcher have made a concerted effort to the necessity of detection malicious activities especially network worms. Most recent research on Internet malicious concentrates on the worms propagation model [10]-[12]. In a similar work, defence against worms is still an open problem. Rule-based network intrusion detection systems such as Snort and Bro are unable to handle novel rapid malcodes due to their dependence on known signatures which are collected only after their widely spread. However, these generated signatures must be generated and entered into the system by humans. Snort, which is an open source software, does not provide any assistance in generating the signatures.
III.
CHARACTERISTICS OF RA PID MALCODES
There are many indicators of rapid malcode characteristic that generally found in network. Small size, variety of source and destination, payload repetition, and spread by clients are the main properties for known rapid malcodes. When a rapid malcode propagates, its presence is apparent even with the background of other scan activities and the distribution of the number of scans per source IP versus their overall rank follows a power law [19]. It is due to thousands of infected hosts that behave similarly. Their scan traffic congests the networks around the world. All of described properties are inherent characteristics of rapid malcodes which generate a great similarity in network traffic flow. All the existing rapid malcodes behave in similar manner and have common traits hence detecting rapid malcodes will be easier by exploiting these three characteristics [20]. i. Content invariance: There are invariant parts in all the existing rapid malcodes payload which are identical across each vulnerable progress. These constant bytes have a fixed value that cannot be changed because they are absolutely necessary for the exploit to work. ii. Content prevalence: Since rapid malcodes can replicate themselves, it is clear that invariant content bytes appear most frequently during rapid malcode spreading. Thus most prevalent content can be exploited as signatures. iii. Address dispersion: Malcodes are spreading from one host to another, and they need to have lots of distinct sources and destinations as a propagation progress. Therefore, the number of sources and destinations are involved in infection, that enlarge uniformity. iv. Random Probing: When an infected source in self propagating process needs to communicate to the random IP addresses using fixed port numbers. This is to exploit vulnera ble services, thereby often probing unused IP addresses. These properties can be applicable to distinguish a malicious traffic from more benign traffic patterns such as flash crowds, spam, and popular content in a P2P network that would fulfil the first two characteristics.
Spring and Wetherall [13] applied Rabin fingerprints in their web caching to recognize redundant network traffic in order to improve web performance. Shield [14] investigated vulner able signatures instead of string-oriented content signatures, and also stopped all attacks that exploited that vulnerability. Although Shield is required to be developed manually, it can specify how the exploit would look like in packet datagram. It is capable to block any connections that fulfil this specification. Many investigators have taken content analysis and using packet flows into consideration when they create automatic signature generation systems. Honeycomb [15] is a host based context intrusion detection system, rather than anal yses network-wide traffic, and then automatically generates signatures. It applies honeypots to capture suspicious traffic targeting dark space, and then uses the longest common substring (LCS) algorithm into the payload packet of any connections that communicating with the same services. The figured substrings are used as candidate malcode signatures. Williamson proposed to modify the network stack so that the rate of connection requests to distinct destinations is bounded [16]. This method identified that hosts infected with rapidly spreading malcodes like worms will be expected to connect to a larger than the usual number of hosts. Autograph [17] uses heuristics to classify traffic into two categories: a flow pool with suspicious scanning activity and a non-suspicious flow pool. TCP flow reassembly is applied to the suspicious flow pool. Rabin fingerprinting method is used to partition the payload into small blocks. These blocks are then counted to determine their prevalence, and the most frequent substrings from these blocks form worm signatures. The signature generator uses blacklisting in order to decrease the number of false positives.
IV.
DETECTION TECHNIQUES
Detection approaches are generally divided into two main categories of content base detection and behaviour base detec tions. Fig. 1 represents a perfect taxonomy of malcode control techniques. Content-base detection using signature matching technique is the prime interest of the content based detection system that looks for common strings or byte patterns in the content of the packet. All payload that could match specific patterns from the database are considered as anomalous payload. Signatures must be specific enough in order to be dissimilar to the regular traffic but must be general enough to catch different version of the same rapid malcode. Hence, signature based detection requires considerable analysis of the payload to be able to come out with good signatures. These signatures include security policy violations, penetration scenario, and suspected behaviour which are stored in a signatures database.
Earlybird [18] is another scheme that automatically per forms the detection of new worms in a method similar to Autograph. This approach analyses the content of packets using Rabin fingerprinting algorithm and observes substrings repetitiveness. Meanwhile the varied IP addresses are recorded into a table that rank the frequency count so that it generates the set of likely worm traffic, but this prototype relies on a prefiltering step to identify flows with more similar worm behaviour.
73
Data Mining ----.J\ Signatures ----J\ Evaluation Analysis --]I Generation ---v of System
Fig. 2.
Proposed Signature Generation Modules
Classification
V. OUR Fig. I.
PROPOSED
TECHNIQUE
Rapid Malcode Control Techniques Taxonomy
Motivated by the design goals given in the previous sec tion, we now present our proposed techniques in automatic signatures generation. We begin with a schematic overview of the system, shown in Fig 2. Each component has different functionality in order to provide reliable signature generation system. Input of the system is all traffic crossing an edge's Universiti Teknologi Malaysia (UTM) network corpus, and its output is a list of rapid malcodes signatures. There are three main stages in our framework of signatures generation system. First, in the data mining stage, a suspicious flow selection is done using specific Snort's rules in order to classify inbound TCP flows as either suspicious or non-suspicious. After classi fication, suspicious packets for these inbound flows are logged by the Snort for further processing only on payloads in the suspicious flows. Thus, flow classification reduces the volume of traffic that must be processed and also decreases generation of signatures that cause highly false positives subsequently in the system. At the time of flow extraction from the suspicious flow pool, n-grams technique which is a type of probabilistic model for predicting the next item in a sequence, is used in our frame work. In probability terms, n-grams can also be used for efficient approximate matching. By converting a sequence of adjacent byte values in a packet payload to a set of n-grams, it can be embedded in a vector space, thus allowing the sequence to be compared to other sequences in an efficient manner [23]. when flow payloads are split into fixed size overlapping blocks using 8-gram payload partitioning by starting at the beginning of the string and then sliding one byte at a time in the direction of end payload. Rabin fingerprint algorithm is applied for all possible substrings of a certain length of 8 bytes in our prototype to handle any countermeasures against polymorphic rapid malcodes and also to detect not frequently occurring packet contents with fixed length strings that occur frequently [24]. Since these fingerprints are applied irreducible polynomials for each bit string, they can be used as features to represents the original flow. Let us suppose the character string A is a bit string containing m bits rbI, b2, b3, · .., bml. Then
Signature based detection obviously depends on the accuracy of the signatures. If the signatures are defined too long, some attacks cannot to be detected. Therefore, failure in detection causes false negatives. Otherwise, if the signatures are defined too short, some normal payloads will be considered as suspicious flow mistakenly, and this increases false positive ratio. Content-base detection using classification technique is used to predict group membership for data instances. The obstacles in novel malcodes detection has attracted attention of many researchers. Several approaches has been proposed that suggest using classification techniques for rapid malcodes variants such as polymorphic worms. However, classification technique is more expensive than signature based technique [21]. Behaviour-Based Detection is a type of anomaly detection that defines statistical mode for normal behaviour at the beginning of detection and then all incidents that deviate from this mode will be conducted as abnormal traffic. However, the noticeable difficulty in this method is that defining a standard for normal traffic somehow is difficult [22]. On the other hand, its advantage lies in the ability of recognizing new attacks without known signatures and overcome the restriction of content base detection. This is possible by concentrating on normal system behaviour and consider any abnormality as suspicious flow regardless of types of attacks. Furthermore, behaviour based detection does not require extensive analysis of the payload. However, this approach requires huge amount of data to be observed. Table I shows the comparison between two different methods for malcode detection. TABLE I COMPARISON BETWEEN BEHAVIOUR-BASED AND CONTENT-BASED RAPID MALCODE DETECTION Behaviour-Based
Content-Based
Relies on common behavioural patterns of rapid malcodes
Relies on certain type of rapid malcode signature (signature database)
Focuses on known and unknown rapid malcode signature
Focuses on known rapid malcode signature
Focuses on detected patterns at higher level of abstraction
Focuses on fix regular expressions in payloads.
where (m 1) is a polynomial degree in indeterminate t. By considering that we have a string represented by A, the -
74
fingerprint of A is
F (A)
=
A(t) mod P(t)
where a polynomial P(t) of degree
P(t)
=
aitk
+
k
,
(2)
Data Mining
A.
I I
,
Generation
Snort Using
System
Original Rules
,
Snort Using Generated Signatures
(4)
I
(5)
Likewise, the specific property of the Rabin fingerprint implies that two equal substrings generate the same fingerprint value which is directly affected by the frequency of incident of that fingerprint. This means for the longer substring we have the fewer repetitions regardless of where they are in the string. In self-propagating rapid malcode's behaviour, although some special malcodes modify themselves in each infection, the entire malcodes is identical over each attack. Therefore, the non-variant portion of the malcodes can be observed in the network. On the contrary, invariant portion of the malcode will not appear frequently, thus, they are not frequent enough to be considered as a signature. During signature generation stage, the proposed system in volves payload analysis of suspicious flows to obtain sensitive and specific signatures. Content invariant and content pre valency are two common traits of rapid malcodes that suggest the viability of content analysis of rapid malocdes. Thus, we measures the frequency of overlapping payload substrings occurring across all suspicious flow payloads, and proposes the most frequently occurring substrings as candidate signatures. Furthermore, rapid malcodes are mostly intended to spread as many victims as possible, and their network packets have a large number of sources and destinations. Although content prevalence is the most prominent key for recognizing rapid malcode signature, address dispersion metric is also crucial for reducing false positives rates. Without applying this additional metric, our system is not capable to differentiate between a probable rapid malcode or benign traffic that would be flowing between two nodes inside the network. Hence, these distinct source and destination addresses are retrieved from IP header data according to the their payload content. Finally, at signature generation stage, we use sifting method to filter out the legitimate traffic and extract the more likely anomalous payload traffic as signatures. The higher invariant substrings prevalence between various sources and destinations are considered as two main criteria for generating signatures. VI.
Network
Signatures
(3)
So if we have (4) for the first substring, we conclude that rabin fingerprint for the next substring [bi, b2, b3, ..., bm+ij, only need to add the last coefficient and remove the first one (5).
A(t) mod P(t) i ( p.Fi + bm+l - bi.pk- )
UTM Corpus
Analysis
is defined in
i a2tk- , ..., ak-it + ak
I I
I I Fig. 3.
Result Analysis
Evaluation of the System
In this paper, we use Snort open source IDS in evaluating the system. The primary goal of this evaluation with IDS is to de termine intrusion detection system can detect novel malcodes by using new generated signatures and also to understand how the false positive rate occurring in the system can be reduced as well as the impact of the substring length over which we computed Rabin fingerprint on the false positive and false negative. However, previous evaluations of intrusion detection systems focuses exclusively on the probability of detection, without regard to probability of false alarm. This evaluation version of the system is positioned to verify generated signatures when they are embedded in IDS system as shown in Fig 3. This test bed comprises of two procedures in which Snort in NIDS mode is applied in the system to examine every incoming packets payload against the Snort rules or the created rules by our proposed prototype. Then by replaying the same dataset into the system, we compare and analyse the result in terms of accuracy, false positives and false negatives. B.
Evaluation of Results
Ideally, a signature generation system should generate sig natures that only match malcodes. In describing the efficiency of rapid malcode's signature in sifting traffic, we imply two significant parameters that directly impact signature accuracy and efficiency. 1. Sensitivity is proportional to the true positives alerts generated by a signature [17]. It describes the fraction of the malicious traffic that the signatures could match among benign and malicious flows. 2. Specificity is the triggered false positive alerts by the generated signatures [17]. In a mixed population, it is the ratio of non malicious flows that match incorrectly by the signatures. This term specifies a malicious code identified by a signature is not properly correct. The experimental result is shown in Table II. This table obviously demonstrates that 16% false positive and 87% true positive alert make our prototype more sensitive and unspecific in comparison with Snort which has less true positive ratio. It is clear Snort generates lower false positive(2%) than our
EVALUATION OF SYSTEM
Setup Descrption
The following evaluation of our proposed system is based on the tcpdump trace files collected from UTM edge networks on July 2, 2010. The trace file has a total of 13779 packet (with payload) records.
75
prototype. Hence, Snort is more specific than our proposed signature generation System. However, less false negative alert in our prototype makes this system more reliable than Snort open source intrusion detection system.
[2] D. Moore, V. Paxson, S. Savage, C. Shannon, S. Staniford, and N. Weaver, "Inside the slammer worm," IEEE Security and Privacy, vol. I, July-Aug. 2003. [3] S. Staniford, V. Paxson, and N. Weaver, "How to own the internet in your spare time," Proceedings of the lith USENIX Security Symposium (Security '02), August 79, 2002. [4] M. Rasheed and M. M. Kadhum, "Traffic signature detection for unknown internet worms," International Conference on Network Ap plication, Protocols and Services, 21 -22 November 2008. [5] V. S. Sathyanarayan, P. Kohli, and B. Bruhadeshwar, "Signature genera tion and detection of malware families," Centre for Security, Theory and Algorithmic Research (C-STAR) International Institute of lriformation Technology Hyderabad- 500032, India. [6] T. H. Ptacek and T. N. Newsham, "Insertion, evasion, and denial of service: Eluding network intrusion detection," Secure Networks, Calgary, AB, Canada, Technical Report T2R-OY6, January 1998. [7] G. Varghese, 1. A. Fingerhut, and F. Bonomi, "Detecting evasion attacks at high speeds without reassembly," In Proceedings of the 2006 SIGCOMM Conference, Pisa,ltaly, p. pp. 327338, September 2006. [8] E. Markatos, "Speeding up tcp/ip: Faster processors are not enough," Proceedings of the 21st IEEE International Performance, Computing, and Communications Coriference (IPCCC 2002), Phoenix, AZ, USA, p. pp. 341345, April 2002. [9] Y. AI-Hammadi and C. Leckie, "Anomaly detection for internet worms. integrated network management," 2005 9th IFIPIIEE International Sym posium, p. 133 146, May, 2005. [10] C. Zou, W. Gong, and D. Towsley, "Code red worm propagation mod eling and analysis," Proceeding of 9th ACM Coriference on Computer and Communication Security, November 2002. [11] M. Liljenstam, Y. Yuan, B. Premore, and D. Nicol, "A mixed abstrac tion level simulation model of large-scale internet worm infestations," Proceedings of 10th IEEEIACM Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), October 2002. [12] D. Moore, C. Shannon, G. Voelker, and S. Savage, "Internet quarantine: Requirements for containing self-propagating code," In Proceedings of the 22nd Annual Joint Conference of the IEEE Computer and Commu nications Societies (INFOCOM), San Francisco, California, USA, vol. March 2003, p. pp. 19011910. [13] N. T. Spring and D. Wetherall, "A protocol-independent technique for eliminating redundant network traffic," In ACM Sigcomm 2000, Aug 2000. [14] H. J. Wang, C. Guo, D. R. Simon, and A. Zugenmaier, "Shield: Vulnerability-driven network filter for preventing known vulnerability exploits," Proceedings of the 2004 coriference on Applications, tech nologies, architectures, and protocols for computer communications, vol. Portland, Oregon, USA, August 30-September 03, 2004. [15] C. Kreibich and J. Crowcroft, "Honeycomb - creating intrusion detection signatures using honeypots," Proceedings of the 2nd ACM Workshop on Hot Topics in Networks, November 2003. [16] 1. Twycross and M. M. Williamson, "Implementing and testing a virus throttle," In to apper in 12th USENlX Security Symposium, Aug. 2003. [17] K.-A. Kim and B. Karp, "Autograph: Toward automated, distributed worm signature detection," In Proceedings of the 13th USENIX Security Symposium, August 2004. [18] S. Singh, C. Estan, G. Varghese, and S. Savage, "Automated worm fingerprinting," In Proceedings of the ACMlUSENIX Symposium on Operating System Design and Implementation, December 2004. [19] G. Zipf, "Behavior and the principle of least-effort," Human Addison Wesley, Cambridge, MA, 1949. [20] H. Kumar, "Seminar report on study of viruses and worms," l.l. T Bombay, 2009. [21] 1. Kolter and M. Maloof, "Learning to detect malicious executables in the wild," Proceedings of the Tenth ACM SIGKDD International Coriference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM Press, p. pp 470478, August 2004. [22] Y. Yao, J. Lv, F. Gao, Y. Zhang, and G. Yu, "Behavior-based worm detection and signature generation," 2008 IEEE International Multi symposiums on Computer and Computational Sciences, 2008. [23] "n-grams [online]," avaialble: http://en.wikipedia.orglwikilN-gram. Last accessed: October 2009. [24] S. Singh, C. Estan, G. Varghese, and S. Savage, "The earlybird system for real-time detection of unknown worms," Technical Report CS20030761, UCSD, 2003.
TABLE II SYSTEM RESULT Test Bed
Snort IDS 13779
Automatic Signatures Generation System 13799
Total Rules
Total Packets
3137
1482
Total Alerts
704
1281
Accuracy
56%
87%
False Positive
2%
16%
False Negative
78%
14%
VII.
CONCLUSION
In this paper, rapid malcode signatures are automatically generated based on their spreading behavior, specially aimed at automatically extracting and deploying signatures on the packet level, without the need for reassembly that could be used by signature-based firewalls network intrusion detection system. Evaluation on UTM corpus shows high detection accu racy and low false negative alert at 87% and 14% respectively, while Snort signatures illustrate 56% accuracy and 78% false negatives in our evaluation system. However, false positive increases to 16% compared to 2% for Snort signatures. The main contribution of this study is the development of a prototype based on behaviour of rapid malcode spreading, targeting the packet level network traffic. A close study associated with rapid malcode prevalence characteristic and sifting approach on detecting novel malcodes, has also been carried out. This study has proved that rapid malcodes can be detected by using its similarity in network traffic pattern which is the repetition of packet payload. VIII.
FUTURE WORK
Static threshold value in this paper is the most important restriction of our prototype that can be improved by running dynamic threshold value in signature generation system. This dynamic value will be determine by the characteristic of suspicious traffic. Adding content based classification into the sifting approach can also improve the system accuracy and sensitivity in the result. Therefore using combination of these two algorithms is planned as a future work. Theoretically, it is possible for attacker to exploit the algorithm used to identify packet similarity. If the attacker knows the parameter of Rabin fingerprint and the exact value of the mask, the attacker can create a new rapid malcode in which its content never matches the mask. It is suggested that additional fine tuning of the algorithm should be conducted. REFERENCES [1] N. Weaver, V. Paxson, S. Staniford, and R. Cuningham, "A taxonomy of computer worms," Proceedings of the 2003 ACM Workshop on Rapid Malcode (WORM), October 27, 2003.
76