Ensemble Based Categorization and Adaptive Model for Malware

0 downloads 0 Views 240KB Size Report
Ensemble Based Categorization and Adaptive Model for Malware Detection .... learning algorithms - e.g neural network and decision tree. Generally, it is better ...
Ensemble Based Categorization and Adaptive Model for Malware Detection Muhammad Najmi Ahmad Zabidi Kulliyyah of Information and Communication Technology, International Islamic University Malaysia, 53100 Gombak, Kuala Lumpur, Malaysia [email protected]

Abstract—Malware, a term which was derived from two words; malicious software has caused many problem to the computer users throughout the world. Previously was known as many names; trojan, virus, worms, dialers and many others, thid potientially unwanted software simply labeled as malware. Malware is a software, which works as any other benigh software, but was designed to accomplish the goal of its writers. It was written to exploit the vulnerability of the target victim’s operating system or application. Previously was a primitive and easy to detect, it evolves to a sophisticated and professionally written piece of software. Current malware detection method involved string search algorithm which based on the pattern detection. This may include the use of signature based method. In this paper, we propose an ensemble categorization by using ensemble classification and clustering together with adaptive learning model.

Mohd Aizaini Maarof† and Anazida Zainal‡ Information Assurance and Security Research Group, Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310 Skudai, Johor Bahru, Malaysia [email protected]† , [email protected]

may stick in the computers without the users ever realized that they were infected. Anti malware research effort involves several associated sub areas, “malware detection” is in the front line to detect the maliciousness of a binary. “Malware analysis” involves sub-components; static analysis of a binary to find the malicious strings/code insertion in a binary and dynamic analysis which involves behavior monitoring a binary by executing it. “Malware classification” denotes the effort to find the similar variant(malware family) of a malware if the similar variant is actually exist. According to the detailed statistics taken from [1], majority of attacks came from Portable Executables (PE) which associated with Microsoft Windows based program.

Keywords-component; ensemble; adaptive; malware; soft computing; machine learning;

I. I NTRODUCTION Malware is derived from the word malicious software and is meant to fulfil its creator’s intention; getting access illigitemately into the system and execute the pre-programmed task later on. The risks caused may be in term of file deletion, system failure, encrypting(locking) certain files and others. Our life was simplified by the existence of computers and Internet. Things which was previously done by physical access are currently being done back in our home by computers. Thing however becomes complex when the criminals are also sitting in front of their computer and writing a potential harmful software - malware. In the underground market, there are “malware creation kits” which are written by criminals(malware author) to criminals(the buyer) to create their own malware with the configuration that they want. This means, the entry level of malware creation becomes lower, and possibly the reason why the number of malware reached thousands. Anti malware could handle this, fortunately. But still there are rooms of improvements for the detection since still there are false positives in the detection which means a malware

Figure 1.

File formats [1]

By referring to Figure 1, we can see Microsoft Windows operating system, which is known for its popularity, is also susceptible to most attacks, which in this case, malware. In this paper, we propose a mixture of adaptive categorization with an ensembled based malware detection classificaiton to detect the existence of malicious elements in a binary. II. R ELATED WORKS Researches in malware detection among others involve statistical analysis;[2], [3], [4], data mining; [5], [6], [7],

machine learning [8], [9], [10], [11] and monitoring function calls [12], [13], [14]. Research in [15] used ensemble of clustering algorithms with the application of Hybrid Hiearchical Clustering together with the well known TFIDF algorithm. Static analysis of a malware involve the process of reversing the malware binaries, and plays very important role since the beginning of malware detection efforts. However, static analysis itself has its own limit [16], for example; the usage of opaque constant which is a binary code obfuscation method as a way to evade the analysis. Reasearch also was done on malware normalizer, by [17], where the researchers came out with a method to undo the obfuscation process done to a particular malware. The malware normalization process later helps the detection phase to be done. Later, researchers started to concentrate on the usage of dynamic analysis technique, where it involves the process of emulating vulnerable services for deception using honeypot and behavior monitoring using sandbox [18] which also provides the facility of producing humand readable report. Syntactic analysis that mostly being used by antiviruses has a limitation of detecting polymorphism and metamorphism, hence came semantic analysis [19] which understands the semantic of instructions in malware and successfully handled obfuscated binaries. This method is useful for malware variants which share common properties. The method however being challenged by [16] in their literature. Semi-supervised learning of malware was introduced in [20], which in this paper the authors provides new paradigm of malware detection by using soft computing approach. III. P ROBLEM BACKGROUND The following Figure 2 shows the latest trend of malware and current effort in handling malware attacks. As showed in Figure 2, the process of detecting malware is tedious, since the creation of malware is easy, and predefined binary seems “different” but only because it was caused by “small portion of change”. The routine process of malware detection needs an automation for seperating benign and malicious binaries. [21] claimed that signature-based detection is expensive and slow, heuristics detection, despite giving protection to unknown threat is “inefficient and inaccurate”. In addition to that, [21] wrote,the behavior-based detection although being praised, is experiencing problem with complex algorithm realization. Automatic classification proposed by [22] did not proposed the usage of ensembled-based model, and the application of ensemble and adaptive learning could be able to contribute to the detection accuracy. Feature selection of malware proposed by [23], also can be improved by ensemble based classification and adaptive learning model. Apart from that, an off line automated detection is needed, given the number of malware sample is very huge. Hence, the tasks can be done using automated malware classifier,

Figure 2.

Problem background

which would segregate benign program with the malicious ones. IV. F EATURE S ELECTION A. Features In malware detection, features could be defined as anything that represent maliciousness in a particular software. That is, if a software contained certain features or pair of features (which, if exist concurrently, could be defined as a malicious software), then the software is regarded as malicious. The following are possible strings of malicious patterns in a software; CreateMutex, NtasdfCreateFile, call shell32, advapi32.RegOpenKey, KERNEL32.CreateProcess,

shdocvw, gethostbyname, advapi32.RegCreate, advapi32.RegSet, http://, OutputDebugString, FindWindow, and finally IsDebuggerPresent Regarding the co-existence of feature, we take the above string named “http”. This particular word may exist in a benign software as well. But, given it tries to contact blacklisted Intenernet domains, it is possibly a malicious software. Kolter,et.al in [24] pointed out several ways to extract texts from binaries, those are: 1) extracting three types of resources from Windows executables, those are: a) a list of Dynamically Linked Libraries (DLLs) b) function calls from DLLs c) number of different system calls from DLLs 2) UNIX strings command 3) hexdump utility Feature selection as a data reduction or dimension reduction method is considered the earliest part prior to learning and categorization. It involved accumulating features and reducing redundant or unsignificant features. Traditional methods, according to [25] for feature selection involves: 1) Filter is process of selecting the optimal feature subset by analyzing the structure of the data. 2) Wrapper is process of searching through all possible feature subsets and utilize the learning algorithm to test the suitability of each feature subsets. 3) Hybrid combines the above filter and wrapper methods V. E NSEMBLE BASED C LASSIFICATION AND C LUSTERING Learning algorithms by themselves are varied, each of them have different capabilities for different kind of data. Ensembling the learning algorithms provides a value added capabilities for the learning process. Given the combination is correct, it could enhance the categorization effort of a given data. Ensemble learning approach combines set of base learners, by boosting “base learners (which is also weak learners)” that were generateed from other base learning algorithms - e.g neural network and decision tree. Generally, it is better compared by using a single learner [26]. An ensemble consists of a set of individually trained classifiers (such as neural networks or decision trees) whose predictions are combined when classifying novel instances. Previous research has shown that an ensemble is often more accurate than any of the single classifiers in the ensemble [27]. In addition to the said description, according to [28],

“An ensemble of classifiers is a set of classifiers whose individual decisions are combined in some way (typically by weighted or unweighted voting) to classify new examples. One of the most active areas of research in supervised learning has been to study methods for constructing good ensembles of classifiers. The main discovery is that ensembles are often much more accurate than the individual classifiers that make them up.” A. Learning algorithms Supervised algorithms deal with equipped knowledge hence there are sets of prior knowledge known for analysis. Unsupervised algorithms in the other hand deal with data which no knowledge is known beforehand. We believe that by utilizing supervised algorithm for partially known features in malware could help the classification between benign and malicious software. In addition, clustering could help where there is ambigiouty in the data, where we never encounter such sample or features in the sample. Clustering being done by calculating distances malware profiles which were trained beforehand (in the learning phase) against the software for testing. The tested programs will be segregated according to the profiles which are mapped using the distance [29], [30]. [31] provides analysis on both classification using BayesNet, NNge,Random forest and Rotation forest and clustering using DBScan, Expectation maximization (EM) and Xmeans. Semi supervised learning is a new area where it exploits the learning of unlabeled data using the information gained from labeled data. Data to be classified using supervised method (classification) or unsupervised (clustering) fall under labeled and unlabeled respectively. In a literature that was written by [32], the author mentioned that, semisupervised learning is where the effort being done to process a large amount of unlabeled data, together with the labeled data being done, in order to build better classifiers. Together with ensemble learning, it provides more accurate results based on the accumulation of results gained from many weak classifiers. As of now, only the literature written by [20] provides information which regards to the application of semi-supervised learners for malware research. A paper by [33] puts a promising research on the ensemble method for semi-supervised learning in general. Our proposed method could be also known as semi-supervised learning since it is combining the supervised (classification) and unsupervised (clustering) method. VI. A DAPTIVE C ATEGORIZATION Adaptive categorization making use of past history for the current learning. Each iteration for the next analysis is prepared with the new knowledge gained from current results. We believe this method will decrease the detection

time for malware, by cutting steps for the next iteration of analysis. [34] and [35] used of adaptive method together with clustering. VII. T HE M ETHODOLOGY Several research activities involved in the malware detection. The following Figure 3 shows the general overview of the activities:

malware. 2) Feature Selection In this phase, features from the malware are extracted and the feature selection process is done. Some of the features already explained in Section IV-A. Which regards to the features, some features may only noted as malicious if it paired with another suspicious features. Some features may already being regarded by malicious by existing alone. Some features might be canceling each other (which means, if another feature is exist, it will be useless). B. Phase 2 - First Tier Detection using Classification Here, we categorize the software according to the common(general) features of known malicious features. Given a binary is a known malware, it will be dismissed here. For the next step, it will go to the second round of this tier, by testing on specific feature of known malware. The ideal algorithm to be chosen here is a binary classifier, since it will classify the analyzed malware into two sections, either benign or malware. Here, we chose Support Vector Machine (SVM) as the classifier candidate. C. Phase 3 - Ensemble Based Classification and Clustering Here, we use several weak classification (supervised) algorithms for the detection purpose. This process, an effort to assemble different weak algorithms could complement the result of each the algorithms chosen. If, the malware still undetected, it will go to the unsupervised learning phase, that is singleton clustering sublevel. Clustering does not know what it classifies hence the output will be clusters of something which yet to known. D. Phase 4 - Signature Creation and Adaptive Learning

Figure 3. model

The ensemble based classification and adaptive categorization

A. Phase 1 - Preprocessing and Feature Selection We have around 30,000 malware samples which were obtained from the Malaysian Computer Emergency Response Team (CERT), MyCERT which is part of the government funded CyberSecurityMalaysia (CSM). These samples contain known malware which include the notorious Confickers variants and some of the samples are not identifiable through major commercial and open source anti viruses. 1) Preprocessing This phase involves collecting malware samples, which in this research will be “sanitized” with the existing known techniques, so that what will be processed further are basically raw and “vanilla”

In this phase, the a signature will be created for the newly found malware. Later, the classifier will be retrained with the new signature hence it is expected the processing time for the next n number of classification later will be decreased. This design means any known malware (which includes the previously unknown, but later was trained with Phase 4 algorithm) will simply stop at Phase 2. VIII. C ONCLUSION Ensembled based classification, together with clustering and adaptive categorization for malware detection could improve the detection accuracy of PE based malware on Windows platforms. Clustering, as the unsupervised method is able to segregate unknown groups into several clusters. Adaptive model is embedded to reduce the classification time without sacrificing the accuracy of detection. The same model could be deployed on mobile malware detection for instance, given the correct features are gained in the early stage of the detection model. Future works of this research

hopefully will be done in another research domain which not putting speed of detection as concern, since ensembling a normal process could take speed as the tradeoff of accuracy. ACKNOWLEDGMENT The authors want to thank Universiti Teknologi Malaysia for providing research space for the research. Also, thanks to International Islamic University Malaysia for providing funding to one of the researchers under the Endowment Grant (Type A), EDW A11-151-0942. Finally, the Ministry of Higher Learning (MOHE) of Malaysia for providing supports throughout the research period.

R EFERENCES [1] VirusTotal, “VirusTotal - Free Online Virus, Malware and URL Scanner,” http://www.virustotal.com, 2011. [2] M. R. Chouchane, A. Walenstein, and A. Lakhotia, “Statistical signatures for fast filtering of instructionsubstituting metamorphic malware,” in Proceedings of the 2007 ACM workshop on Recurring malcode, ser. WORM ’07. New York, NY, USA: ACM, 2007, pp. 31–37. [Online]. Available: http://doi.acm.org/10.1145/1314389.1314397 [3] M. Saudi, A. Cullen, and M. Woodward, “Statistical Analysis in Evaluating STAKCERT Infection, Activation and Payload Methods,” in Proceedings of the World Congress on Engineering, vol. 1, 2010. [4] R. Merkel, T. Hoppe, C. Kraetzer, and J. Dittmann, “Statistical detection of malicious pe-executables for fast offline analysis,” in Communications and Multimedia Security, ser. Lecture Notes in Computer Science, B. De Decker and I. Schaumller-Bichl, Eds. Springer Berlin / Heidelberg, 2010, vol. 6109, pp. 93–105. [5] X. Sun, Q. Huang, Y. Zhu, and N. Guo, “Mining distinguishing patterns based on malware traces,” in Proc. 3rd IEEE Int Computer Science and Information Technology (ICCSIT) Conf, vol. 2, 2010, pp. 677–681. [6] D. Komashinskiy and I. Kotenko, “Integrated usage of data mining methods for malware detection,” in Information Fusion and Geographic Information Systems, ser. Lecture Notes in Geoinformation and Cartography, W. Cartwright, G. Gartner, L. Meng, and M. P. Peterson, Eds. Springer Berlin Heidelberg, 2009, pp. 343–357. [7] ——, “Malware detection by data mining techniques based on positionally dependent features,” in Proc. 18th Euromicro Int Parallel, Distributed and Network-Based Processing (PDP) Conf, 2010, pp. 617–623. [8] Y. Elovici, A. Shabtai, R. Moskovitch, G. Tahan, and C. Glezer, “Applying machine learning techniques for detection of malicious code in network traffic,” in KI 2007: Advances in Artificial Intelligence, ser. Lecture Notes in Computer Science, J. Hertzberg, M. Beetz, and R. Englert, Eds. Springer Berlin / Heidelberg, 2007, vol. 4667, pp. 44– 50. [9] D. Gavrilut, M. Cimpoesu, D. Anton, and L. Ciortuz, “Malware detection using machine learning,” in Proc. Int. Multiconference Computer Science and Information Technology IMCSIT ’09, 2009, pp. 735–741. [10] I. Firdausi, C. Lim, A. Erwin, and A. S. Nugroho, “Analysis of machine learning techniques used in behavior-based malware detection,” Advances in Computing, Control, and Telecommunication Technologies, International Conference on, vol. 0, pp. 201–203, 2010. [11] V. Golovko, S. Bezobrazov, P. Kachurka, and L. Vaitsekhovich, “Neural network and artificial immune systems for malware and network intrusion detection,” in Advances in Machine Learning II, ser. Studies in Computational Intelligence, J. Koronacki, Z. Ras, S. Wierzchon, and J. Kacprzyk, Eds. Springer Berlin / Heidelberg, 2010, vol. 263, pp. 485– 513.

[12] L. Bai, J. Pang, Y. Zhang, W. Fu, and J. Zhu, “Detecting malicious behavior using critical api-calling graph matching,” in Proc. 1st Int Information Science and Engineering (ICISE) Conf, 2009, pp. 1716–1719. [13] S. Ortolani, C. Giuffrida, and B. Crispo, “Bait your hook: A novel detection technique for keyloggers,” in Recent Advances in Intrusion Detection, ser. Lecture Notes in Computer Science, S. Jha, R. Sommer, and C. Kreibich, Eds. Springer Berlin / Heidelberg, 2010, vol. 6307, pp. 198–217.

[25] J. Rogers and S. Gunn, “Ensemble algorithms for feature selection,” in Deterministic and Statistical Methods in Machine Learning, ser. Lecture Notes in Computer Science, J. Winkler, M. Niranjan, and N. Lawrence, Eds. Springer Berlin / Heidelberg, 2005, vol. 3635, pp. 180–198. [26] X. Wu and V. Kumar, The top ten algorithms in data mining, ser. Chapman & Hall/CRC data mining and knowledge discovery series. CRC Press, 2009. [Online]. Available: http://books.google.com.my/books?id= kcEn-c9kYAC

[14] M. Alazab, S. Venkataraman, and P. Watters, “Towards understanding malware behaviour by the extraction of api calls,” in Cybercrime and Trustworthy Computing Workshop (CTC), 2010 Second, 2010, pp. 52 –59.

[27] D. Opitz and R. Maclin, “Popular ensemble methods: An empirical study,” Journal of Artificial Intelligence Research, vol. 11, no. 1, pp. 169–198, 1999.

[15] Y. Ye, T. Li, Y. Chen, and Q. Jiang, “Automatic malware categorization using cluster ensemble,” in Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’10. New York, NY, USA: ACM, 2010, pp. 95–104. [Online]. Available: http://doi.acm.org/10.1145/1835804.1835820

[28] T. Dietterich, “Ensemble methods in machine learning,” in Multiple Classifier Systems, ser. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2000, vol. 1857, pp. 1–15.

[16] A. Moser, C. Kruegel, and E. Kirda, “Limits of static analysis for malware detection,” in Proc. Twenty-Third Annual Computer Security Applications Conf. ACSAC 2007, 2007, pp. 421–430. [17] M. Christodorescu, J. Kinder, S. Jha, S. Katzenbeisser, and H. Veith, “Malware Normalization,” 2005. [18] T. Holz, F. C. Freiling, and C. Willems, “Toward automated dynamic malware analysis using cwsandbox,” IEEE Security & Privacy, vol. 5, no. 2, pp. 32–39, 2007. [19] M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and R. E. Bryant, “Semantics-aware malware detection,” in Proceedings of the 2005 IEEE Symposium on Security and Privacy (Oakland 2005). Oakland, CA, USA: ACM Press, May 2005, pp. 32–46.

[29] T. Lee and J. Mody, “Behavioral classification,” in EICAR Conference, 2006. [30] G. g. Jacob, H. Debar, and E. Filiol, “Behavioral detection of malware: from a survey towards an established taxonomy,” Journal in Computer Virology, vol. 4, pp. 251– 266, 2008, 10.1007/s11416-008-0086-0. [Online]. Available: http://dx.doi.org/10.1007/s11416-008-0086-0 [31] J. A. Morales, A. Al-bataineh, S. Xu, and R. S, “Analyzing and exploiting network behaviors of malware,” 2007. [32] X. Zhu, “Semi-supervised learning literature survey,” Computer Sciences, University of Wisconsin-Madison, Tech. Rep. 1530, 2005. [33] Z. Zhou, “When semi-supervised learning meets ensemble learning,” Multiple Classifier Systems, pp. 529–538, 2009.

[20] I. Santos, J. Nieves, , and P. G. Bringas, “Semi-supervised learning for unknown malware detection.” in Proceedings of the 4th International Symposium on Distributed Computing and Artificial Intelligence (DCAI). 9th International Conference on Practical Applications of Agents and Multi-Agent Systems (PAAMS), 2011, in press.

[34] A. Bagherjeiran, C. F. Eick, C.-S. Chen, and R. Vilalta, “Adaptive clustering: Obtaining better clusters using feedback and past experience,” in Proceedings of the Fifth IEEE International Conference on Data Mining, ser. ICDM ’05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 565–568. [Online]. Available: http://dx.doi.org/10.1109/ICDM.2005.17

[21] L. Wang, X. Tan, J. Pan, and H. Xi, “Application of prefixspan* algorithm in malware detection expert system,” in Education Technology and Computer Science, 2009. ETCS ’09. First International Workshop on, vol. 3, 2009, pp. 448 –452.

[35] A. Topchy, B. Minaei-Bidgoli, A. Jain, and W. Punch, “Adaptive clustering ensembles,” in Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, vol. 1. IEEE, 2004, pp. 272–275.

[22] K. Rieck, P. Trinius, C. Willems, and T. Holz, Automatic analysis of malware behavior using machine learning. TU, Professoren der Fak. IV, 2009. [23] R. Islam, R. Tian, L. Batten, and S. Versteeg, “Classification of malware based on string and function feature selection,” in Cybercrime and Trustworthy Computing Workshop (CTC), 2010 Second, 2010, pp. 9 –17. [24] J. Z. Kolter and M. A. Maloof, “Learning to detect and classify malicious executables in the wild,” J. Mach. Learn. Res., vol. 7, pp. 2721–2744, December 2006. [Online]. Available: http://portal.acm.org/citation.cfm?id=1248547.1248646

Suggest Documents