Nov 8, 2013 - siding within binary files (i.e., estimating the functions cre- ation date), a new set of informative features can be derived. We show that these ...
Poster: Detecting Malware through Temporal Function-Based Features Eitan Menahem, Asaf Shabtai, Adi Lavher Information System Engineering Department, Ben-Gurion University of the Negev Be’er Sheva, 84105, Israel
{eitanme, shabtaia, adilev}@post.bgu.ac.il ABSTRACT In order to evade detection by anti-virus software, malware writers use techniques, such as polymorphism, metamorphism and code re-writing. The result is that such malware contain a much larger fraction of ”new” code, compared to benign programs, which tend to maximize code reuse. In this research we study this interesting property and show that by performing “archaeological” analysis of functions residing within binary files (i.e., estimating the functions creation date), a new set of informative features can be derived. We show that these features provide a good indication for the existence of malicious code within binary files. Preliminary experiments of the proposed temporal function-based features with a set of over 12,000 files indicates that the proposed set of features can be useful for the detection of malicious files (accuracy of over 90% and AUC of 0.96).
Categories and Subject Descriptors C.2.0 [Computer-Communication Networks]: security and protection; H.2.8 [Information Systems]: Database Applications data mining
General Terms Experimentation, Security
Keywords Malware, Static analysis, Machine learning
1.
INTRODUCTION
Anti-virus vendors are facing huge quantities (thousands) of suspicious files every day [1]. These files are collected from various sources including dedicated honeypots, third party providers and files reported by customers either automatically or explicitly. Such a large amount of files makes the efficient and effective inspection of files particularly important and challenging. Analysis techniques for detecting
malware have been proposed. Basically static and dynamic analysis is distinguished. In dynamic analysis (also known as behavioral analysis), the detection consists of information that is collected from the operating system at runtime (i.e., during the execution of the program), such as system calls, network access and files and memory modifications [2]. This approach suffers from several limitations. First, it is difficult to setup the appropriate conditions in which the malicious payload of the program will be activated (such as the vulnerable application that the malware exploits). Second, it is not clear what is the required period of time needed to observe the appearance of the malicious activity for each malware. Third, many malicious programs nowadays apply anti-forensic and anti-debugging methods, thus limiting the ability to analyze them during runtime. Fourth, masquerading techniques, such as polymorphism and metamorphism, may be applied on the malwares system-calls to conceal malicious behavioral signatures. In static analysis, information about the program or its expected behavior consists of explicit and implicit observations in its binary/source code. The main advantage of static analysis is that it is able to process a file without actually executing it and thereby providing rapid classification [7]. Up until recently, static analysis solutions were primarily implemented using the signature-based method, which relies on the identification of unique strings in the binary code [1]. While being very precise, signature-based methods are useless against unknown malicious code [3]. Thus, generalization of the detection methods was crucial in order to detect new versions and types of malware, before their execution. Modern static analysis techniques automate and extend the idea of heuristic-based methods. In these methods the
T-2
T-3 T-15 T-26 ? T-66
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CCS’13, November 4–8, 2013, Berlin, Germany. Copyright 2013 ACM 978-1-4503-1651-4/12/10 ...$15.00.
T-6 T-13 T-30
T-44
T-42 T-80
Figure 1: Illustrating the proposed temporal function-based features. Each section represents a function, detected within the file’s code (dark black sections indicate unlabeled sections). The age of each function is relative to the release date of the file - T.
Prolog Extractor
Epilog Extractor
File “Creation Date” Extractor Benign files corpus
PE Header date extractor File properties date extractor
file:
File “Creation Date” Estimation
(dated) benign files Dated benign files repository (exe, dll)
Function Extractor
functions
Function Matcher
functions
Function “Creation Date” Estimation
…
Web date extractor
Dated function repository
Figure 2: The process of creating the dated function repository. binary code of a file is represented, for example, using byte sequence (i.e., byte n-grams) or using OpCode sequences (i.e., OpCode n-grams), and classifiers are used to learn patterns in the code in order to classify unknown files as malicious or benign [1],[5],[4]. Recent studies have shown that by using these types of features for representing binary files, classifiers can be trained with very accurate classification results. To avoid being detected by signature-based techniques and heuristic methods, sophisticated malware programmers conceal suspicious (unique) byte n-grams and OpCode sequences by rewriting entire code segments, or by using polymorphism and metamorphism techniques [8]. While these techniques render the detection techniques based on byte ngram and OpCode sequences useless, they result in binary programs that contain bulks of “new” functions, never previously seen in other executable files. In this research we propose a novel method for detecting malware that employ the above mentioned evasion techniques. Our main assumption is that such techniques distort the malware’s temporal profile, making it distinct from benign programs (in other words, attackers have coding styles, in terms of code reuse, that significantly differ from legitimate code writers). To derive a temporal profile, our method performs “archeological” analysis of the tested binary file, i.e., estimating the creation date of the functions residing within the file’s code. Using this information, a series of temporal features, representing the file’s temporal profile, are then extracted. These features can be exploited for various analysis tasks, including for detecting malware. Since usually in machine learning applications the learning algorithm can benefit from a larger set of features, and especially from features that provide different views on the instances (i.e., executable files), combining the proposed temporal features (i.e., new view) together with byte/OpCode n-grams features can be very effective for improving the performance of malware detection applications. Fig. 1 illustrates the proposed temporal function-based features. Each section represents a function detected within the file’s code (note that dark black sections indicate unlabeled sections). The age of each function is relative to the release date of the file - T. For example, the youngest function detected within the file is the top left-most function with age T-2. It also can be seen that 16.7% of the functions is between the age 12 and 24 months. One function could not be set with a creation date (i.e., undated function).
2.
THE PROPOSED METHOD
The proposed classification method is based on temporal features that are computed for functions detected within executable files. These features are computed based on the estimated “age of functions” (i.e., coding time of function). The proposed method consists of two main phases: (1) creating a repository of dated functions; and (2) extracting features for classification using the dated functions repository.
2.1
Creation of Dated Functions Repository
In this phase a repository of common functions is created, and each function is assigned with a creation date. This process is presented in Fig. 2. First, given a corpus of benign executable files, the system sets the creation date of each file. This is done by extracting relevant dates for each file from different sources (e.g., the release date of the file as was publish in Web, or the compilation date that was extracted from the file’s Portable Executable (PE) header). Next, from each file, a collection of functions is extracted. The functions extraction procedure (presented in [5] [6]) is performed by sequentially scanning the file using a sliding window. Each window is classified by a machine learning model (C4.5 decision tree), trained to detect prologues and epilogues patterns of functions, without disassemble process (i.e., by processing the binary file). Once a function prologue is detected, the algorithm scans for the last function epilogue prior to a following prologue. When a pair of markers are found (prologue and epilogue), the new function’s boundaries are stored as a ‘function’ object. Finally, we detect similar functions and set the creation date of each function. We choose the Levenshtein distance as the distance function between the byte-sequence of two functions. Each function is compared to all other extracted functions of the same length. We count the number of different bytes between the two bytes sequences and normalize it by the length. If the result was less than a predefined threshold (we set this threshold to 15%), we determine that the compared functions are identical. The last step is setting the creation date the functions. We set the creation date of each function based on the dates of the files from where it was extracted from - we set the date of the oldest file as the function date, after removing outliers. The result of this phase is a repository of dated functions.
2.2
Feature Extraction
We extract the following 13 features using the dated function repository: percentage of dated/undated functions (functions extracted from a given file and were detected/not detected in the function repository), the age in years of the
1
40% 35%
0.8
25%
Malware Benign
20%
AUC = 0.96
0.6
TPR
functions (%)
30%
15%
0.4 10% 5%
0.2 0% 0-1
1-2
2-5 5-10 10-15 functions' "age" in years
15-20
20-100
0 0
0.2
0.4
0.6
0.8
1
FPR
Figure 3: Histograms of functions‘ estimated age. oldest/youngest function in the file, functions median and average age in years, percentage of functions from last 2, 5 and 10 years, and the percentage of functions older than 10, 15 and 20 years. In order to induce a classifier based on the defined features, we use two sets of files - benign (different from the files used in the previous phase of creating the dated functions repository, to avoid over-fitting) and malware. For each file we first extract its functions using the same method used for creating the dated functions repository. To determine the functions‘ origin date, each extracted function is compared with functions of the same length in the “dated functions repository” (using Levenshtein distance). If a match function is found, the tested function is assigned with the origin date of the matching function. Finally, given the extracted functions and their corresponding creation dates, we compute the file’s temporal function-based features vector, which comprises the file’s temporal profile.
3.
EXPERIMENTS AND RESULTS
For the experiments we used the following file sets. The first set contains 850 benign executable files, downloaded using a web crawler from various web-sites (such as, filehippo.com and oldapps.com), which indicated the files‘ original date. Since most of the downloaded files were compressed or packed, we used a Universal Extractor for decompressing the packed files. We then filtered in the relevant files only (i.e., exe and dll). These files were used to generate the dated functions repository. At the end of this process, the dated functions repository contained a total of 1,999,209 different (dated) functions. To train and evaluate the malware detection classifier we used 6,756 benign and 6,169 malware files. The benign files were collected from the ”‘Program files”’ and ”‘System32”’ directories of several computers and were verified to be benign using the McAfee anti-virus. The malware files were taken from the VX-heavens collection. For each file in the repository we extracted the temporal function- based features and conducted a 10-fold cross validation experiment. Our first observation is that on average, the dated functions percentage in benign files was 35.8% and only 11.4% in malicious files. In addition, Fig. 3 depicts the proportion of functions in benign and malicious files within different age clusters. It can be seen that the percentage of functions from last year is 8.4% for benign files compared to 30.1% in malicious files. A notable difference can also be observed for functions that are from the last 2 year and 5 years. The best classification results were obtained when using a Rotation Forest (RotFst) classifier (Random Forest as the base classifier), with an average accuracy of 90% and an area under the ROC curve of 0.96 (see Fig. 4).
Figure 4: ROC curve for the RotFst classifier.
4.
CONCLUSION AND FUTURE WORK
In this paper we propose a novel method for detecting malware, based on the file’s temporal profile (time-based features). For future work we are planning to focus on the evaluation of the proposed method on a larger file set and on combining these temporal-based features with other types of features such as byte or OpCode n-grams.
5.
REFERENCES
[1] Griffin, K., Schneider, S., Hu, X., and Chiueh, T.-c. Automatic generation of string signatures for malware detection. In Recent Advances in Intrusion Detection (2009), Springer, pp. 101–120. [2] Jacob, G., Debar, H., and Filiol, E. Behavioral detection of malware: from a survey towards an established taxonomy. Journal in computer Virology 4, 3 (2008), 251–266. [3] Moser, A., Kruegel, C., and Kirda, E. Limits of static analysis for malware detection. In Computer Security Applications Conference, 2007. ACSAC 2007. Twenty-Third Annual (2007), IEEE, pp. 421–430. [4] Moskovitch, R., Feher, C., Tzachar, N., Berger, E., Gitelman, M., Dolev, S., and Elovici, Y. Unknown malcode detection using opcode representation. In Intelligence and Security Informatics. Springer, 2008, pp. 204–215. [5] Potashnik, D., Fledel, Y., Moskovitch, R., Elovici, Y., et al. Monitoring, analysis, and filtering system for purifying network traffic of known and unknown malicious content. Security and Communication Networks 4, 8 (2011), 947–965. [6] Shabtai, A., Menahem, E., and Elovici, Y. F-sign: Automatic, function-based signature generation for malware. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 41, 4 (2011), 494–508. [7] Shabtai, A., Moskovitch, R., Elovici, Y., and Glezer, C. Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey. Information Security Technical Report 14, 1 (2009), 16–29. [8] Yin, H., Song, D., Egele, M., Kruegel, C., and Kirda, E. Panorama: capturing system-wide information flow for malware detection and analysis. In Proceedings of the 14th ACM conference on Computer and communications security (2007), ACM, pp. 116–127.