Challenges in High Accuracy of Malware Detection

0 downloads 0 Views 170KB Size Report
najmi@kict.iium.edu.my. Mohd Aizaini Maarof. † and Anazida Zainal. ‡ .... VXhaven portal is currently down due to the directive by the Ukranian government[21].
Challenges in High Accuracy of Malware Detection Muhammad Najmi Ahmad Zabidi Kulliyyah of Information and Communication Technology, International Islamic University Malaysia, 53100 Gombak, Kuala Lumpur, Malaysia [email protected]

Abstract—Malware is a threat to the computer users regardless which operating systems and hardware platforms that they are using. Microsoft Windows is the most popular operating system and the popularity also make it the most favourite platform to be attacked by the adversaries. Current detection for Windows relies on the signature based detection which is fairly fast although suffers undetected binaries. Here, we propose a method to increase the detection rate of malware by manipulating machine learning methods. Our focus is on the Microsoft Windows binaries. Index Terms—malware; feature selection; machine learning

I. I NTRODUCTION The work on computer virus formally traced to Fred Cohen’s work in 1986[1]. The term “virus” was given by Cohen’s supervisor Dr Adleman which the public use until currently. Nowadays, the term malware most likely being used, which covers computer worm, virus, dialers, trojans, rootkits and so many others threat. Hence the term malware which means MALicious softWARE being used. Machine learning methods has been widely used in the area of finance, for example in credit card fraud detections, patient’s drug prescriptions and other areas. In the area of malware research, machine learning plays role in several phases, dimension reductions with feature selection for reducing the number of features; without reducing the accuracy rate. It is also able to classify unknown data based on clustering(unsupervised methods). The room of improvements depends on the number or which feature are selected; which means the quality of the chosen features. A. Problem statement In order to detect the malicious features within a malware, two ways of detection methods being used; static analysis and dynamic analysis. Static analysis deals with parsing the malware binaries so that the malicious strings could be find as well as by reverse-engineer method; disassembling the malware. Dynamic analysis in the other hands, monitoring the activities of malware by executing them in a safe environment; for example in a virtual machine. Each method has its own strengths and weaknesses and most of the time it is advisable to use both methods to analyze a malware[2]. The rich sets of malicious features could be reduced for malware detection without sacrificing the accuracy. This could save some of the

Mohd Aizaini Maarof† and Anazida Zainal‡ Information Assurance and Security Research Group, Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310 Skudai, Johor Bahru, Malaysia [email protected]† , [email protected]

researcher’s time for analysis. Our concern is there is no obvious necessity to detect a malware with many features when it could be reduced to several strong features which could perform the same. In order to begin with the feature selection step for malware, the mechanism or algorithm candidates must be reviewed into. The problem to be solved is how to reduce the number of features in malware detection without reducing the accuracy rate and secondly, how to detect malware which never been seen before by comparing commonalities with the previous malware. II. M ETHODOLOGY We obtained a dataset of malware samples from a Malaysian government agency CERT (Computer Emergency Response Team), CyberSecurityMalaysia (CSM) which consists of 2GB size of malware, roughly around 30,000 malware. In order to extract the features, pre-processing methods has to be done prior to the next stage. We use static analysis to parse the visible Application Programming Interface (API) calls and dynamic analysis to capture the obfuscated API calls. The most popular way for malware to invoke commands is by using calls within Dynamically Linked Library (DLL). Two commands within “kernel32.dll” which are “LoadLibrary” and “GetProcAddress” are required to be invoked to perform the DLL calls[3]. General ideas about the process can be referred from [4], [5] and [6]. In order to test the accuracy of the detection, we mix the testing sample with benign software in order to know whether there is false positive or false negative in the detection. Roughly our methods consist of : 1) Feature Selection(Ranking/Pruning) 2) Supervised Classification 3) Unsupervised Classification Item 2) and 3) above also could be combined to a method known as “Semi Supervised Classification”. A. Feature Selection as a Dimension Reduction Method We review several methods, for example by using Information Gain(IG) as in [7], [8], [9], [10], [11]. The formula of IG in [10] and [11] is the same except [11] added an error correction method after the result to reduce the error rate, which is basically the area under the bell curve selection.

The amount by which the entropy of X decreases reflects additional information about X provided by Y is called information gain, given by IG(X|Y ) = H(X) − H(X|Y )

(1)

[11] introduced the following algorithm to “correct out” error the results. Pn IG(Xi ) 0 IG(X) = IG(X) ± i−0 (2) n While [12] used: X X P (t0 , c) (3) IG(t) = P (t0 , c)log P (t0 )P (c) 0 c∈{ci ,ci } t ∈{t,t}

We expect to look in the said IG algorithm above, due to the interest by the previous researchers in applying the algorithm for feature ranking purpose. We already evaluated several malware in [4] by using the following features; GetSystemTimeAsFileTime, SetUnhandledExceptionFilter, GetCurrentProcess, TerminateProcess, LoadLibraryExW, GetVersionExW, GetModuleFileNameW, GetTickCount, SetLastError, GetCurrentProcessId, GetModuleHandleW, LoadLibraryW, InterlockedExchange, UnhandledExceptionFilter, FreeLibrary, GetCurrentThreadId, QueryPerformanceCounter, CreateFileW, InterlockedCompareExchange, UnmapViewOfFile and GetProcAddress:. Our next target is to evaluate the fitness and quality of the chosen features so that only the most relevant features could be used for representing malicious features. This, theoretically could be done by using distance measurement and data clustering by plotting the details on a graph. [13] for example argues on the credibility of detection by using API calls. This is true, since API calls is a legit method to call operating system functions as it interfacing operating systems with applications. API calls detection, although very popular in term usage in research papers, is not the only method for measuring the maliciousness level. It could also being examined along with the entropy level, malicious strings existence, anti debugger and anti virtual machine detector strings, RedPill detector, XORed strings and many more. B. Supervised Learning for Malware Detection Supervised learning needs an instance label so that it can sort of “predict” the upcoming results based on the rules fed before. This could be efficient for relatively known malware, but has has some problems if the analyzed features were never found. C. Unsupervised Learning for Malware Detection Due to the nature of malware rapidness growth, the prospect of using the unsupervised method is high, due to the nature of the unsupervised learning process which does not need any instance label prior to the categorization process. The separation of clusters can be done hence the malware can be segregated according to groups and measuring distance between them.

III. C HALLENGES Malware writers are usually being regarded as “lazy”, which means usually the “new” malware in the wild are actually derived from the previous malware, hence if there is large similarities the new malware will be regarded as the new variant. In addition to that, malware writers also used to insert garbage calls in order to confuse the analyst with fake API calls. This however could be avoided by using different ways of analyzing the malware. Also, malware writers encrypts the important details within the malware body, however this partly already being addressed with our tool in [4], if the malware writer used XOR as the encryption method. Malware also usually being packed, some use well known packer and some rarely used packer. “Packing” is a method to compress a Windows executables without having the user needs to manually decompress them. The purpose is basically legit since it is used by benign Windows executables as well but it has already being expoloited by the criminals for compressing the size plus encryptions. Ideally, malware could be analyzed better by unpacking them. However the process of unpacking without the existence of the packing software is difficult and time consuming, hence methods has been developed to analyze malware at the surface level by using entropy analysis [14], [15], [16], [17]. We tested entropy test on the recent sample of Duqu malware[18], [19], which is known as an offspring of the infamous Stuxnet, however Duqu malware seems unsuspicious to the entropy test. There is one important aspect to consider, according to [20], the process of cyber security experiments are considered impossible to be replicated, due to the nature of sensors and data involved. In malware research for example, usually the usually the malware samples were obtained from VXhaven website1 , a repository for malware enthusiasts. However the VXhaven portal is currently down due to the directive by the Ukranian government[21]. There are two other education institutions that provided their datasets, those are University of Mannheim2 , Germany and Nexginrc3 , which is based in Pakistan. There are issues in term of analyzing the malware too. Safe environment for malware analysis is needed, in order to ensure the malware that is being analyzed will not infecting elsewhere. One way is by disconnecting the network connection. This method however could work for static analysis process. For dynamic analysis, the connection is required hence several works being done on sandboxes or internet simulator[22], [23], [24]. The virtual environment could also guarantee the safety, but many malware now have the feature of virtual machine detector. This problem has already being addressed and could be detected by our tool in [4]. IV. C ONCLUSION Malware detection is a problem which the researchers have tried to solve for so many years by using enormous types 1 http://vx.netlux.org/index.html 2 http://pi1.informatik.uni-mannheim.de/malheur/ 3 http://nexginrc.org/Datasets/Default.aspx

of methods. Each of the proposed methods has its own weakness and later being improved by they themselves or other researchers. In the mean time, the malware writers out there are also improving their malware codes from time to time and work more organized in the underground world. This “cat and mouse” game is like vicious circle, as the process of improvement need to be done from time to time. ACKNOWLEDGMENT We would like to thank to the Ministry of Higher Education and International Islamic University Malaysia for supporting the main author of this paper. Also to Universiti Teknologi Malaysia for the second and third authors for providing facilities and advises for the contents of this paper. R EFERENCES [1] F. B. Cohen, “Computer viruses,” Ph.D. dissertation, University of Southern California, Los Angeles, CA, USA, 1986, AAI0559804. [2] D. Glynos, “Packing heat!” http://census.gr/media/packing-heat.pdf, May 2012, (Accessed at 12 May 2012). [3] J. Dai, “Detecting malicious software by dynamic execution,” Ph.D. dissertation, University of Central Florida, 2010. [4] M. N. A. Zabidi, M. A. Maarof, and A. Zainal, “Malware analysis with multiple features,” Computer Modeling and Simulation, International Conference on, vol. 0, pp. 231–235, 2012. [5] ——, “Ensemble Based Categorization and Adaptive Model for Malware Detection,” 2011 7th International Conference on Information Assurance and Security (IAS), vol. 978-1-4577-2153-3, pp. 80–85, 2011. [6] M. N. A. Zabidi, “Compiling Features for Malicious Software,” http: //conference.hitb.org/hitbsecconf2011kul/materials/D1\%20SIGINT\ %20-\%20Muhammad\%20Najmi\%20Ahmad\%20Zabidi\%20-\ %20Compiling\%20Features\%20for\%20Malcious\%20Binaries.pdf, October 2011, (Accessed at January 11, 2012). [7] B. Zhang, J. Yin, J. Hao, S. Wang, and D. Zhang, “New malicious code detection based on n-gram analysis and rough set theory,” Y. Wang, Y.-M. Cheung, and H. Liu, Eds. Berlin, Heidelberg: Springer-Verlag, 2007, pp. 626–633. [8] S. Sabnani, “Computer security: A machine learning approach,” Master’s thesis, 2008. [9] R. Perdisci, A. Lanzi, and W. Lee, “McBoost: Boosting Scalability in Malware Collection and Analysis Using Statistical Classification of Executables,” in Computer Security Applications Conference, 2008. ACSAC 2008. Annual, Dec. 2008, pp. 301–310. [10] A. Altaher, S. Ramadass, and A. Ali, “Computer Virus Detection Using Features Ranking and Machine Learning,” Australian Journal of Basic and Applied Sciences, vol. 5, no. 9, pp. 1482–1486, 2011. [11] P. Singhal and N. Raul, “Malware detection module using machine learning algorithms to assist in centralized security in enterprise networks,” International Journal of Network Security & Its Applications, vol. 4, 2012. [12] Q. Jiang, X. Zhao, and K. Huang, “A feature selection method for malware detection,” in 2011 IEEE International Conference on Information and Automation (ICIA), june 2011, pp. 890–895. [13] S. M. Abdulla, M. L. M. Kiah, and O. Zakaria, “Minimizing errors in identifying malicious api to detect pe malwares using artificial costimulation,” in International Conference on Emerging Trends in Computer and Electronics Engineering (ICETCEE’2012), 2012. [14] X. Ugarte-Pedrero, I. Santos, B. Sanz, C. Laorden, and P. Bringas, “Countering Entropy Measure Attacks on Packed Software Detection,” in Proceedings of the 9th IEEE Consumer Communications and Networking Conference (CCNC2012), 2012, in press. [15] I. Sorokin, “Comparing files using structural entropy,” Journal in Computer Virology, pp. 1–7, 2011, 10.1007/s11416-011-0153-9. [Online]. Available: http://dx.doi.org/10.1007/s11416-011-0153-9 [16] R. Lyda and J. Hamrock, “Using entropy analysis to find encrypted and packed malware,” Security & Privacy, IEEE, vol. 5, no. 2, pp. 40–45, 2007. [17] I. Briones and A. Gomez, “Graphs, entropy and grid computing: Automatic comparison of malware,” in Proceedings of the 2004 Virus Bulletin Conference, 2004.

[18] B. Bencs´ath, G. P´ek, L. Butty´an, and M. F´elegyh´azi, “Duqu: A stuxnetlike malware found in the wild,” http://www.crysys.hu/publications/files/ bencsathPBF11duqu.pdf, 2011. [19] ——, “Duqu: Analysis, detection, and lessons learned,” ACM European Workshop on System Security (EuroSec) 2012., 2012. [20] T. Dumitras and P. Efstathopoulos, “The provenance of wine,” http://www.ece.cmu.edu/∼tdumitra/public documents/ dumitras12wineprovenance.pdf, 2012, (Accessed at 12 May 2012). [21] J. Kirk, “Ukraine shuts down forum for malware writers,” http://www.computerworld.com/s/article/9225693/Ukraine shuts down forum for malware writers, March 2012, (Accessed at 12 May 2012). [22] T. Holz, F. C. Freiling, and C. Willems, “Toward Automated Dynamic Malware Analysis Using CWSandbox,” IEEE Security & Privacy, vol. 5, no. 2, pp. 32–39, 2007. [23] K. Yoshioka, T. Kasama, and T. Matsumoto, “Sandbox analysis with controlled internet connection for observing temporal changes of malware behavior,” in The Fourth Joint Workshop on Information Security (JWIS 2009), 2009. [24] K. Aoki, T. Yagi, M. Iwamura, and M. Itoh, “Controlling malware http communications in dynamic analysis system using search engine,” in Cyberspace Safety and Security (CSS), 2011 Third International Workshop on. IEEE, 2011, pp. 1–6.

Suggest Documents