Second International Conference on Computer Research and Development
A Framework for Malware Detection Using Combination Technique and Signature Generation
Mohamad Fadli Zolkipli
Aman Jantan*
School of Computer Science Universiti Sains Malaysia, USM Penang, Malaysia e-mail:
[email protected]
School of Computer Science Universiti Sains Malaysia, USM Penang, Malaysia e-mail:
[email protected]
[3] because not all malware has bit patterns that are indicative of the presence of malicious intent. The signature-based matching technique must be frequently update and virus dictionary must have specific signature for each new malware. However, because of technology advancement many malware writers try to use better hiding techniques to avoid detection [4]. Malware especially Rootkit [5] have become trouble in computer security because of its ability of hiding was increasingly. Several new detection methods have been developed for identifying malware, including machine learning technique and data mining technique. However, due to the unstructured document, it must be reprocessing in order to create signatures component that can be identify by the detection tools. In this paper, we propose a new framework for malware detection that combined signature-based matching technique and machine learning technique that work together on detecting malware. The proposed framework has three main modules such as signature-based detection, genetic algorithm-based detection and signature generator. Due to the limitation, this study only focus three types of malware which is virus, worms and Trojan horse. The rest of the paper is organized as follows. Section II is summarization of the important related work in this context. The proposed framework is presented in Section III. Finally, this paper concludes with an outlook to our future work.
Abstract—Malware detection must apply sophisticated technique to minimize malware thread that can break computer operation. Nowadays malware writers try to avoid detection by using several techniques such as polymorphic, hiding and also zero day of attack. However, commercial anti-virus or anti-spyware that used signature-based matching to detects malware cannot solve that kind of attack. In order to overcome this issue, we propose a new framework for malware detection that combines signature-based technique and genetic algorithm technique. This framework consists of three main components such as s-based detection, GA detection and signature generator. These three main components will work together as interrelated process in our propose framework. Result from this study is the new framework that design to solve new launce malware and also to generate signature automatically that can be used on signature-based detection. Keywords-malware detection; combination signature-based; genetic algorithm (GA)
I.
technique;
INTRODUCTION
The most popular computer attack is malware that consist of seventeen categories such as viruses, worms, Trojan horse, spyware and also other malicious software. Malware is a program with malicious intent designed to damage the computer on which it executes or the network over which it communicates [1]. Although all types of malware have their specific objective, the main purpose is to break the computer operation. Because of that, the security mechanism needs to be implemented [2] in order to protect all code and data against modification, replacement or sub-versioned. Malware can be filter by using specific anti-virus or antispyware software that deploys specific detection algorithm and technique. A detector identifies and contains malware before it can reach a computer system or network. Most of commercial anti-virus or anti-spyware used signature-based matching technique. The problem will happen [3] when computers must be infected before a new virus pattern can be captured and stored for future use. It became more problem
II.
* Corresponding author
978-0-7695-4043-6/10 $26.00 © 2010 IEEE DOI 10.1109/ICCRD.2010.25
RELATED WORK
Malware is defined as [6] software that performing actions intended by an attacker without consent of the owner when executed. Each malware have specific characteristic, attack goal and propagation method. Five main categories of malware types are virus, worm, trojan horses, backdoors and spyware [7]. Virus is malware that, when executed tries to replicate itself into other executable code within a host. Another malware type that shares several characteristic of virus is worm, but worm attacks across a network. Trojan house is a program that does some benign task, but secretly contains malicious code to attack the system or leak data. Back door is program that used by attackers to allow remote access and control which bypasses a normal security policies and procedures. Spyware is useful software that can
196
algorithms that only work on structured input. First, the compression algorithms work on the character-level raw executables rather than the preprocessed subset of the original code. Another benefit of the character-level nature of the compression algorithms is that they are more robust to obfuscation. S. B. Mehdi et al. developed [12] a new malware analysis and detection technique called IMAD. IMAD was developed using Genetic Algorithm (GA) used to optimize system parameters that can detect a malware on the first day of its launch. GA was selected for this purpose because GAs are known to give good results in real-time dynamic environments. This technique also developed in order to counter the limitation of signature-based technique. IMAD not only detected malware but also detect a malicious process while it is executing before it can cause any significant problem to operating system. S. B. Mehdi also stated that both static and dynamic technique that have been developed before by security expert still not fulfill high detection accuracy and other three needs. The architecture of IMAD technique consist of six main component such as system calls logger, n-grams generator, n-grams analyzer, goodness evaluator, ngc786 in-execution classifier and genetic optimizer. The accuracy of IMAD technique was tasted by comparing with other four well-known classification techniques such as Support Vector Machine, RIPPER, c4.5 and Na¨ıve Bayes. Data mining also one of the techniques that already applied on malware detection. S. M. Tabish et al. was proposed [13] malware detection technique which is based on the analysis of byte-level file content. This technique also designed to provide protection against first day launched malware. This non-signature based technique has the potential to detect previously unknown and new launch malware. It does not memorize specific byte-sequences or string that appearing in the actual file content. Standard data mining algorithm was used to classify the file content of every block as normal or potentially malicious by categorize it as bening or malware. The proposed technique was tasted using six different file types such as doc, exe, jpg, mp3, pdf and zip. Six different types of malware that consist of backdoor, Trojan, virus, worm, constructor and miscellaneous was used as dataset. Y. Ye et al. developed [14] an intelligent malware detection system known as IMDS. This system was developed to overcome the limitation of signature-based antivirus systems that failed to detect polymorphic and unseen malicious executables. IMDS is an integrated system that used Objective Oriented Association (OOA) mining based classification. Three major modules were present in IMDS which is PE parser, OOA rule generator and rule based classifier. Malware classification was done by OOA rules generator using adapted OOA_Fast_FPGrowth algorithm. A comprehensive experimental study was done using Window API execution sequences called PE files. A large collection of PE files was obtained from the anti-virus laboratory of KingSoft Corporation is performed to compare various malware detection approach. As a result, the
collects user data and transmits it to unauthorized entity. However, because of technology advancement, malware creation become more sophisticated and significantly improved since the early days [8]. Signature-based matching technique is one of the most popular approaches to malware detection [1]. This technique was commercially applied by anti-virus or anti-spyware product in the market. In order to prevent, detect and remove malware this product work by comparing file content with the signature using scan string approach that search for pre-defined bit patterns. The anti-virus product must be frequently update and virus dictionary must have specific signature for each new malware [9]. Although this technique is very popular and reliable for host-based security tool, there are some limitations on this technique need to be solved. The main problem with this technique is fails to detect new launch malware that known as zero-day malware attack [10][11]. Certain number of computers must be infected before a new virus pattern can be captured and stored for future use [9]. F. Hsu et al. was developed [9] an automatic malware removal and system repair that consist of three major components such as a monitor, a logger, and a recovery agent. That framework needs to solve two problems. First is to determine untrasted program that will break the system integrity. Second is to remove the entire untrasted program. The framework monitors and logs the operation of untrusted programs. Then, these logs will be uses for removing the untrusted programs and their effects completely and automatically. This framework does not need any prior knowledge about the untrusted program, but it can defend against both known and unknown malware. The user does not need to modify any existing programs and should not notice that it is running in that framework. It is because this framework provides a transparent environment for both trusted and untrusted programs. Prototype of this framework was implemented on Windows environment and shown that it can detected all the malware’s modification compare to the commercial tools that used signature-based technique. Machine learning algorithm was applied by Y. Zhou and M. Inge on their malware detection technique. That technique [10] was using adaptive data compression in order to counter the limitation of signature-based technique in current commercial anti-malware tool. Y. Zhou and M. Inge identified two limitations of signature-based technique. First, not all malicious programs have bit patterns that are evidence of their malicious nature and also not recorded in the virus dictionary. Second, obfuscated malware that take many forms of bit patterns will not working on signaturebased technique. The proposed technique used adaptive data compression model and prediction by partial matching (PPM) as learning engine to build two compression models. This technique works on unstructured input, that is, raw executables, with an underlying statistical compression model. The proposed unstructured learning technique has a number of advantages over standard machine learning
* Corresponding author
197
accuracy of IMDS system outperforms popular antivirus software such as Norton and McAfee. III.
effective in detecting well known malware. Static analysis method has less run-time overhead compare with the dynamic analysis method. In order to improve the efficiency of computer operation, this technique was proposed in this framework.
PROPOSE FRAMEWORK
Our propose framework is combination of two malware detection techniques which is signature-based technique and GA technique. It was design to solve two malware detection challenges. First, how to detect new launched malware? Second, how to generate signature from malware infected file? Fig. 1 shows the three main components of our framework such as GA detection, s-based detection and sbased generator. S-based detection will become the first defense from malware attack. GA detection will work as a second layer defense especially to detect new launched malware. After the new signature from the new launch malware was created, that signature will be use by signature-based detection technique. These three main components will work together as interrelated process in our propose framework. S-BASED DETECTION
B. GA detection GA is the popular technique that commonly used on learning approach to solve computational research problem. This machine learning technique applies genetic programming to learn a population of evolving. Data representation that used in this algorithm is bit string value that known as chromosomes. GA selects chromosomes based on selective value from a population. Then new chromosomes will be produce according to the combination of the bit string from existing chromosomes in the existing population. Solution will be representing based on the nature of the problem. Two type basic operations in GA are crossover and mutation. GA technique will be implementing in this framework in order to solved polymorphic malware and also new types of malware. Polymorphic malware that has a combination of previous characteristic can be detected through this technique because it works same like crossover and permutation operations in GA. Hidden technique that normally used by malware attacker also can be detected using this technique because it not only filter the content but also learn the malware behavior .
SIGNATURE GENERATOR
GA DETECTION
C. Figure 1. Framework for Malware Detection Technique.
A. S-based detection Signature-based detection is one of the static analysis methods that commonly used on commercial antimalware software. The static analysis method would scan through the program code for detection purpose and sometimes called scan strings. This technique uses it characterization of the malicious code to decide that is malware of not through program inspection. Normally, each malware represented by one or more signature patterns which is unique to characterize it. When a program is executed, anti-malware software will search through bytes of data stream. Thousands of signatures will be place on database and scanning process will look for each signature to compare with the program code that execute. Searching algorithm will be used for the purpose of comparing content of program code with the signature on database. In this framework, signature-based technique will be implementing as the first defense from malware attack that will infect computer operation. This technique was chosen because this type of technique has been very
* Corresponding author
S-based generator Signature is the string patterns which is unique to identify and characterize the malware. Currently, signature is creating by forensic experts after a new malware sample was founded. Signature will be creating based on the behavior of the malware. Each antimalware product must create their own signature and must be encrypted in order to avoid accessing error if more than one anti-malware products are install in one computer. Once a signature has been created, it is added to the signature database. Computer user will require an updated copy of signature into their anti-malware database in order to be properly protected against the latest malware threats. Basically signature pattern is 16 bytes and usually a long enough string to detect 16-bit malware code. 0410 B801 02CE 07BB 0002 33C9 8BD1 419C
Signature generator captures the malware behavior that identifies and analyze by the GA detection module. The signature pattern will be generate and update it into malware database as signature for signature-based detection. This module was proposed in this framework in order to replace forensic expert’s tasks.
198
IV.
CONCLUSIONS
In this paper, we have proposed a new framework for malware detection using combination signature-based technique and GA technique. The framework will preserve computer system both well known or new malware attack. This is an important contribution because zero day malware attack can be identify using GA technique and signature will be create automatically by generator that can be used by signature detection for future reference. In order to improve efficiency and batter performance of computer operation, this research will be continue by implementing integrated tool that can integrate all three main component of this framework. ACKNOWLEDGMENT This work was supported by Short-term Grant No.304/PKOMP/639021, School of Computer Science, Universiti Sains Malaysia, Penang, Malaysia. REFERENCES M. D. Preda, M. Christodorescu, S. Jha and S. Debrey, G. Eason, B. Noble, and I. N. Sneddon, “A semantics-based approach to malware detection,” ACM Trans. Program. Lang. Syst. 30, 5, Article 25, August 2008. [2] L. Hanno, “Framework for Malware Resistance Metrics,” QoP’06 in ACM, pp. 39-44, 2006. [3] M. Christodorescu and S. Jha, “Testing Malware Detectors,” ISSTA’04 in ACM, pages 34-44, 2004. [1]
[4]
[5] [6]
[7]
[8] [9]
H. Yin and D. Song, “Panorama: Capturing System-wide Information Flow for Malware Detection and Analysis,” CCS’07 in ACM, pp. 116 – 127, 2007. Microsoft Corporation, “The Antivirus Defense-in-depth Guide,” 2004, in press. M. Apel, C. Bockermann and M. Meier, “Measuring Similarity of Malware Behavior,” SICK 2009 in IEEE Explorer, pp. 891-898, Oktober 2009. G. McGraw and G. Morrisett, “Attacking malicious code: report to the Infosec research council,” IEEE Software, 17(5):33 - 41, Sept./Oct. 2000. S. Noreen, S. Murtaza, M. Z. Shafiq and M. Farooq, “Evolvable Malware,” GECCO’09 in ACM, pp. 1569–1576, July 2009. F. Hsu, H. Chen, T. Ristenparty, J. Liz and Z. Su, “Back to the Future: A Framework for Automatic Malware Removal and System Repair,” Proc. IEEE Annual Computer Security Applications Conference (ACSAC'06), IEEE Press, July 2006: 0-7695-2716-7/06
[10] Y. Zhou and M. Inge, “Malware Detection Using Adaptive
Data Compression,” AISec’08 in ACM, pp. 53 – 59, October 2008. [11] Y. Ye, Q. Jiang and W. Zhuang, “Associative Classification and Post-processing Techniques used for Malware Detection,” IEEE Explorer, 2009. [12] S. Mehdi, A. K. Tanwani and M. Farooq, “IMAD: In-Execution Malware Analysis and Detection,” GECCO’09 in ACM, pp. 15531560, July 2009. [13] S. M. Tabish, M. Z. Shafiq and M. Farooq, “Malware Detection using Statistical Analysis of Byte-Level File Content,” CSI-KDD’09, pp. 23-31, June 2009. [14] Y. Ye, D. Wang, T. Li and D. Ye, “IMDS: Intelligent Malware Detection System,” KDD’07 in ACM, pp. 1043 – 1047, August 2007.
* Corresponding author
199