Scientific Research and Essays Vol. 7(22), pp. 2031-2036, 14 June, 2012 Available online at http://www.academicjournals.org/SRE DOI: 10.5897/SRE12.001 ISSN 1992-2248 ©2012 Academic Journals
Full Length Research Paper
Malware detection based on evolving clustering method for classification Altyeb Altaher1*, Supriyanto2, Ammar ALmomani1, Mohammed Anbar1 and Sureswaran Ramadass1 1
National Advanced IPv6 Centre Universiti, Sains Malaysia, Malaysia. 2 University of Sultan Ageng Tirtayasa (UNTIRTA), Indonesia. Accepted 30 May, 2012
Malware is a computer program that can replicate itself and cause potential damage in data files. The high speed of the computers and networks increased the virus spread. To avoid the virus infection and the data loss, it is important to use an efficient and effective method for virus detection. This paper proposes an approach for malware detection based on the evolving clustering method. The proposed approach effectively combined the information gain method as a feature selector with the evolving clustering method as evolving learning classifier. Based on the experimental results, the proposed malware detection approach proved its capability to detect the malware by decreasing the false positive rate to 1% while increasing the level of accuracy to 99%. Key words: Malware detection, network security, intelligent classification, information gain.
INTRODUCTION Internet has grown very fast and it touches everything in the human life. However, it is reported the Internet technology also give rise to cybercrime destruction of other organization as well as person. The attacker could create malicious software (malware) to disturb other people activities. Malware is the most serious threat to the computer systems; the spread of malware may have great impact on the computer world (Nachenberg, 1997; Thimbleby et al., 1998). Moreover, they can also make high risk in the business, economy and other aspects of life (Wierman and Marchette, 2004). Malware is malicious software which includes virus, Trojan, worm and many other harmful software (Aron et al., 2002; Perdisci et al., 2008). The number of malwares is increasing continuously, for example in February 2011 there were 252,187,961 malicious programs detected (Zakorzhevsky, 2011). Windows Application programming Interface (API) calling sequence indicates the behavior of a particular portable executable (PE) application (Wang et al., 2009).
*Corresponding author. E-mail:
[email protected].
Malwares attack these PE applications to find the API addresses and control the execution of the victim applications (Willems et al., 2007). When the malwares get the API addresses of the normal applications, it changes certain fields to orient the execution of normal applications to their codes. After finishing their task, malwares return the execution control to the normal application (Clemens, 2009). The API call sequence is used by many researchers to detect and classify the malwares (Jajodia, 2009; Father, 2004; Mori, 2004). To classify malwares, Khaled et al. (2010) used the API call sequence; using comparison between some algorithms, they concluded that the false alarm still exists. Sami et al. (2010) used data mining techniques to analyze the API call sequences; they found that the data mining techniques minimize the false alarm. Ye et al. (2008) used the API calls sequence to detect the infected PF files. The main objective of this paper is to propose an approach for malware detection based on the evolving clustering method. The proposed approach effectively combined the information gain method as a feature selector with the evolving clustering method as evolving learning classifier. This paper is organized as follows: An
2032
Sci. Res. Essays
The malware portable executable (PE) file analyzer
Data Set
Malware Portable Executable Analyzer
The function of the malware PE analyzer is to analyze the malware PE files and extract the API calls imported by the malware PE file. The PE analysis reflects the behavior of the PE and this explains why malware programmers do their best to learn the PE file structure (Father, 2004). The content of a PE file divided into sections and each section stores data with common attributes. The sections of the PE files are illustrated in Figure 2. The PE analyzer extracts the API called by a PE file from the Import Address Table (IAT), which contains pointers to the all imported functions and Dynamic Linked Library (DLL) (Ye et al., 2008).
Features selector
Features Election using Information Gain
Malware Detection using Evolving Clustering Method
Informative feature selection is an important process for the accurate malware detection. In the proposed approach we adopt correlation measures based on the theoretical concept of entropy, a measure of the uncertainty of random variable. The classification power of each feature is derived by calculating its information gain (IG) based on the number of its appearances in the malicious class and benign class. Features with negligible information gains can then be removed to reduce the number of features and speed the classification process. The entropy of variable X is defined as: (1) And the entropy of X after observing values of another variable Y is defined as: (2)
Malware Detection Report
Figure 1. The overall intelligent approach for malware detection based on evolving clustering method for classification.
introduction and related work is first presented, followed by the proposed approach for malware detection based on the evolving clustering method. Thereafter, experiments and their results are introduced; and then conclusion. PROPOSED APPROACH FOR MALWARE BASED ON EVOLVING CLUSTERING METHOD FOR CLASSIFICATION The proposed approach for malware detection consists of three phases; in the first phase the proposed approach analyzes Windows API execution sequences called by the malware PE files. Then the most informative features are selected in the second phase. The evolving clustering method was then used to perform the detection in the third phase (Figure 1).
where P(xi) are the prior probabilities for the values of X, and P(xi|yi) is the posterior probabilities of xi given the values of yi. The amount by which the entropy of X decreases reflects additional information about X provided by Y and is called information gain (Lei and Huan, 2003). The information gain is given by: (3) Table 1 provides a ranking of some features selected by the proposed approach. The more the information gain, the more useful a feature will be. Based on the information gain ranking, the feature “ExitProcess” has best distinguish power, whereas the feature “LoadStringW” has least distinguish power.
Evolving clustering method for classification Evolving Clustering Method for Classification is a classifier that divides a data set into a number of classes in the n-dimensional input space by evolving rule nodes. Each rule node rj is associated with a class through a label. Its receptive field R(j) covers a part of the n-dimensional space around the rule node (Kasabov, 2003, 2001, Kasabov and Song, 2002). As an online clustering method, the evolving clustering method for classification (ECM) algorithm performs well on one-pass partitioning of an input space and evolves in an online mode. The evolving clustering method for classification works as in the following steps: Step 1: Create the first cluster C1 by simply taking the position of the first example from the input data stream as the first prototype p1, and setting a value 0 for its cluster radius r 1.
Altaher et al.
2033
Figure 2. A typical structure of a (PE) file.
Table 1. Malware classification features with information gain ratio.
Feature ExitProcess memcpy _wcsicmp RegOpenKeyExW Free GetLastError malloc _wcsnicmp RegQueryValueE GetCommandLine LoadStringW
Function Ends a process and all its threads. Copies n bytes between two memory areas; if there is overlap, the behavior is undefined Compares one string to another, without case sensitivity. Opens the specified registry key. Used to return allocated memory to the system Retrieves the calling thread's last-error code value. Allocates size bytes and returns a pointer to the allocated memory Compare characters of two strings without regard to case Retrieves the type and data for the specified value name associated with an open registry key Retrieves the command-line string for the current process. Reads a string resource into an existing object.
Step 2: If all examples from the data steam have been processed, the clustering process finishes. Else, the current input example x i is taken and the normalized Euclidean distance dij between this example and all already created prototypes pj , dij = ||xi − pj ||, j = 1, · · · , l, is calculated. Step 3: If there is a cluster Cm with a prototype pm, a cluster radius rm and distance value dim such that: (i) dim = ||xi − pm|| = min{dij} = min || xi − pj||, for j = 1, · · · , l; and (ii) dim < rm, The current example xi is considered to belong to this cluster. In this case, neither a new cluster is created, nor any existing cluster is updated. The algorithm then returns to Step 1. Else: Step 4: Find a cluster Ca with a prototype pa, a cluster radius ra, and a distance value dia, which cluster has a minimum value s ia: sia = dia +ra = min{sij}, for j = 1, · · · , l. Step 5: If sia is greater than 2 × Dthr, the example xi does not belong to any existing cluster. A new cluster is created in the same way as described in Step 0. The algorithm then returns to Step 1. Else: Step 6: If sia is not greater than 2 × Dthr, the cluster Ca is updated
Information gain 0.14930161545 0.1467147213 0.1441795855 0.14136526288 0.1235470557 0.123511293 0.120886020 0.120154210 0.114342341 0.1143330390 0.11207477
by moving its prototype pa and increasing its radius value ra. The updated radius rnew a is set to be equal to sia/2 and the new prototype pnew a is located on the line connecting input vector xi and the old prototype pa, so that the distance from the new prototype pnew a to the point xi is equal to rnew a. The algorithm then returns to Step 1.
EXPERIMENTAL RESULTS To conduct our experiments we collected 2700 Windows malicious executables and 2300 benign Windows executables. 1000 malicious executables are randomly selected from VX Heavens malware collection (VX Heavens, 2011); another 1700 are collected from Internet from July 2010 to July 2011. All the experiments are conducted under the environment of Windows 7 operating system plus Intel Quad CPU 2.66 GHz and 4 GB of RAM. MATLAB version 7 was used in the proposed intelligent approach for computation and analysis. The proposed evolving intelligent malware detection
2034
Sci. Res. Essays
Figure 3. The clusters of malware files and normal files generated by the proposed approach.
approach works through two phases; the first phase is the learning and second one is testing based on the knowledge gained in the first phase. The accuracy of the learning phase was 100%, whereas the accuracy of testing phase was 99%. The proposed evolving intelligent approach used rule extraction in the learning phase to obtain the results in the testing phase. Our proposed approach generated fuzzy rules based on the evolving clustering and is capable of detecting different types of malware files at high speed with low computation cost. The following are samples for the rules generated by our approach: Rule 1: If Centre is [0.13, 0.17] and Radius is 0.27 Then Class is [1] 94 Samples in Cluster Rule 2: If Centre is [0.10, 0.42] and Radius is 0.19 Then Class is [1] 15 Samples in Cluster Rule 3: If Centre is [0.27, 0.21] and Radius is 0.17 Then Class is [1] 13 Samples in Cluster Rule 4: If Centre is [0.38, 0.20] and Radius is 0.11 Then Class is [1] 1 Samples in Cluster Rule 5: If Centre is [0.32, 0.28] and Radius is 0.11 Then Class is [1] 3 Samples in Cluster Rule 6: If Centre is [0.23, 0.34] and Radius is 0.00 Then Class is [2] 0 Samples in Cluster Rule 7: If Centre is [0.75, 0.37] and Radius is 0.43 Then Class is [2] 8 Samples in Cluster Rule 8: If Centre is [0.35, 0.37] and Radius is 0.00 Then Class is [2] 0 Samples in Cluster Rule 9: If Centre is [0.85, 0.30] and Radius is 0.16 Then Class is [2] 23 Samples in Cluster Rule 10: If Centre is [0.89, 0.64] and Radius is 0.37 Then Class is [2] 49 Samples in Cluster Figure 3 visualizes the process of the proposed evolving
intelligent approach in detecting malware files based on the analysis of the API function called by the executable files. In Figure 4, we also show that our proposed approach successfully finds separated five clusters different from the other clusters, these five clusters are malware files. Whereas the other clusters in the left side of Figure 4 represent the normal files. In order to evaluate the performance of our proposed evolving intelligent malware detection approach, we also compared the results of our proposed approach with other detection approaches like neural network, naïve byes, and decision tree, to prove that our approach has the best accuracy. Table 2 and Figure 4 show the results of the comparison between the proposed intelligent malware detection approach and other classification methods. The proposed intelligent malware detection approach outperforms other classification approaches with the highest accuracy of 99% and lowest false positive of 1%. The proposed intelligent approach achieved highest detection accuracy because it has the capability to evolve with the time and learn how to detect new, previouslyunseen malware. False positives, that is, flag benign files as malicious and false negatives, that is, flag malicious files as benign, are one of the most important problems faced by malware detection systems today. Figure 5 clearly shows that the false positives by using our malware detection approach are much fewer than other classification approaches.
Conclusion In this paper, an evolving intelligent approach, which
2035
Accuracy (%)
Altaher et al.
Figure 4. Accuracy of the proposed intelligent detection approach compared with other different classifiers.
Table 2. Comparison between the proposed intelligent approach and other classification approaches.
Classification method Naïve Bayes Decision Tree Neural network The proposed evolving intelligent approach
True positive (%) 90.65 95.19 96.42 99
False positive (%) 9.35 4.81 3.58 1
Total accuracy (%) 90.65 95.19 96.42 99
False positive (%)
10.00 8.00 6.00 4.00 2.00 0.00
Figure 5. False positive of the proposed intelligent detection approach compared with other different approaches.
combines an evolving clustering method for classification with the information gain method for feature selection, was proposed for malware detection in dual stack IPv4/IPv6 networks. A controlled environment of a dual stack IP4/IPv6 network was deployed to conduct a
comprehensive experiment to validate our proposed intelligent malware detection approach. It is demonstrated, through experiments, that the proposed evolutionary approach for malware detection successfully evolved, and detect known and new, previously-unseen
2036
Sci. Res. Essays
malware with high detection accuracy of 99% and low false positive rate of 1%.
ACKNOWLEDGEMENT This research is supported by the National Advanced IPv6 Center of Excellence (NAv6) – Universiti Sains Malaysia.
REFERENCES Aron J, Leary M, Gove RA, Azadegan S, Schneider MC (2002). ”The benefits of a notification process in addressing the worsening computer virus problem: results of a survey and a simulation model”. Comput. Secur., 21: 142-163 Clemens KPMC (2009). Effective and Efficient Malware Detection at the End Host. 18th USENIX Security System Montreal, Canada, USENIX Security System. Father H (2004). "Hooking Windows API - Technics of hooking API functions on Windows." COD. Break-J., 1(2). Jajodia S (2009). Identifying Malicious Code Through Reverse Engineering, Advances in Information Security, A. Singh. USA, SpringerLink, p. 44. Kasabov N (2003). Evolving connectionist systems. Methods and applications in bioinformatics, brain study and intelligent machines. Springer Verlag Heidelberg, New York, London, 2002. Kasabov N, Song Q (2002) “DENFIS: Dynamic, evolving neural-fuzzy inference systems and its application for time-series prediction,”. IEEE Trans. Fuzzy Syst., 10(2): 144-154. Kasabov N (2001) “Evolving fuzzy neural networks for on-line supervised/unsupervised, knowledge–based learning”. IEEE Trans. SMC – part B, Cybern., 31(6): 902-918. Khaled A, Abdul-Kader H, Housam R, Weiss H, Davis D, Gregory R, Gebretsadik T, Shintani A (2010). "Artificial Immune Clonal Selection Classification Algorithms for Classifying Malware and Benign Processes Using API Call Sequences." IJCSNS, 10(4): 31. Lei Y, Huan L (2003). Feature Selection for High-Dimensional Data: A Fast Correlation- Based Filter Solution. Proceed. 19th Int. Conf. Mach. Learn., pp. 856-863. Mori A (2004). Detecting Unknown Computer Viruses – A New Approach. Lecture Notes Comput. Sci. Springer Berlin / Heidelberg, 3233: 226-241. Nachenberg C (1997). Computer virus-Coevolution, Commun. ACM, 40: 46-51. Perdisci R, Lanzi A, Lee W (2008). ”Classification of packed executables for accurate computer virus detection”. Patt. Recog. Lett., 29: 1941-1946 Sami A, Yadegari B, Hossein R (2010). Malware detection based on mining API calls. ACM, pp. 22-26.
Thimbleby H, Anderson S, Cairns P, (1998). A framework for modelling trojans and computer virus infection, Comput. J., 41(1998) 444-458. VX Heavens malware collection (2011): http://vx.netlux.org/vl.php. Wang C, Pang J, Zhao R, Liu X (2009). "Using API Sequence and Bayes Algorithm to Detect Suspicious Behavior", 2009. Int. Conf. Commun. Software Networks, pp. 544-548. Wierman JC, Marchette DJ (2004). Modeling computer virus prevalence with a susceptible-infected-susceptible model with reintroduction, Comput. Stat. Data Anal., 45: 3-23. Willems C, Holz T, Felix F, (2007). "Toward automated dynamic malware analysis using cw sandbox." IEEE Security & Privacy. pp. 32-39. Ye Y, Wang D, Gao Y, Chen G, Gao H, Dai X (2008). "An intelligent PE malware detection system based on association mining." J. Comput, Virol., 4(4): 323-334. Zakorzhevsky (2011). Monthly Malware Statistics. Available from: http://www.securelist.com/en/analysis/204792182/Monthly_Malware_ Statistics_June_2011