Opcodes Histogram for Classifying Metamorphic Portable Executables Malware Babak Bashari Rad Faculty of Computer Science and Information Technology University Technology of Malaysia Kuala Lumpur, Malaysia
[email protected]
Maslin Masrom
Suahimi Ibrahim
UTM Razak School of Engineering and Technology University Technology of Malaysia Kuala Lumpur, Malaysia
[email protected]
Advanced Informatics School University Technology of Malaysia Kuala Lumpur, Malaysia
[email protected]
Abstract—Malware writers attempt to generate different shapes of a malware to evade the signature-based scanners. As the number of variants of a metamorphic malware is increased, the analysis of all variants and selecting the appropriate signature and updating the database of the antivirus becomes more tiresome and time-consuming. Furthermore, for automated generated metamorphic viruses, which utilize the virus kits to produce different instances, sometime it is not possible to analyze all of them. Therefore, use of some classification methods to speed up the analysis process is necessary. In this paper, we show that how the histogram of instructions opcodes can help us in classification of metamorphic virus family variants. Keywords—metamorphic virus; virus classification; opcode frequency histogram
I.
detection;
virus
INTRODUCTION
Whenever a new virus is spread, the experts of antivirus labs start analyzing the code to extract the virus signature and update the antivirus database. If a virus has many different variants, it is not wise to spend a lot of time for analyzing its whole species, because the manual analysis of the code is very time-consuming and even sometimes if the samples of a virus family are numerous, it is an impossible task [1]. On the other side, using virus creation kits or retouching a virus to create a new version, or automatic producing of new variants of a virus, what metamorphic viruses do, is very easy [2-3]. It makes the duty of analysis more difficult for antivirus specialists. Although, variants of a virus have different appearance of binary codes, but since the behavior of all instances are similar [4], they should have the same statistical feature of binaries [5]. In this paper, we implement the method proposed by Rad at al. in [5] directly on portable executable files. Hence, we can see how the histogram of instructions opcodes can help us to classify and detect the different shapes of metamorphic virus family members. The paper is organized as following: in section II, we take a brief look at the most related researches and attempts, recently have done on this topic. Next, we explain our methodology to implement the histogram similarity concept on the portable executable files. In section IV, we present the implementation
978-1-4673-1677-4
steps. Classification test, results and evaluation of the methodology are given in section V, and conclusion is summarized in section VI. II.
RELATED WORKS
Bilar in [6] studied on distribution of opcodes frequency as a feature to detect and distinguish malicious codes. Altogether, he shows malware opcode frequency distribution can be used to discriminate it considerably from non-malicious codes. Opcodes that are more infrequent look also to show more frequency dissimilarity than frequent ones. Rad et al. in [5] show that frequency histogram of Instructions opcodes can be used as a proper feature to identify the mutated samples of a metamorphic virus family. The main drawback with that research is use of IDA Pro. Software [7] to disassemble the binary codes, manually. Their method is based on assembly source codes. The pre-process of disassembly and breaking down the code into subroutines makes it unpractical. The purpose of that research is to show the histogram of opcodes can be used as a statistical feature to classify the virus family members. Han at al. in [2] show that malware classification using instructions frequency can be a useful technique speed up malware detection. The major weakness of their proposed method is the false positives when the method is used singly. Moskovitch et al. in [8] proposed use of opcodes, created by disassembly of the executable files. The authors introduced a methodology for representation of malware and normal executable files for detection of the unknown malicious codes using opcode. Then they employed n-grams of the opcodes as a feature for the classifying. III.
METHODOLOGY
Figure 1 summarizes the methodology. The main purpose of the proposed methodology is to find a histogram of opcodes for each family, as a feature. This histogram presents the average distribution of instructions opcodes in the virus family. To achieve this purpose, firstly, we build a database of different variants of the morphed virus. Then, we extract the opcodes from these files. It can be performed using a disassembler program, which is able to
209
analyze PE binaries and recognize the first byte of each instruction. In Portable Executable, or Common Object File Format (COFF), headers are created of a COFF header, an Optional header, an MS-DOS stub, and a file signature. Using this information, we can take out the code segment of the file [9-10]. By analyzing the code segment, we can separate the machine instructions.
Virus family Database
PE Analyzer
can be used to classify and detect the different instances of a malware family. IV.
IMPLEMENTATION
We formed a collection of viruses generated via Next Generation Virus Creation Kit (W32 NGVCK) [12], containing of 200 different instances of this metamorphic family, in form of portable executable files. Then, we created a program using C++ to analyze the PE files, extract the code segment of each file, and find the start points of instructions in the code. Next step is to take out the first bytes of instructions opcodes and make a sequence of opcodes for each file. We built a histogram of opcodes frequency for every file in the NGVCK virus family. MATLAB R2008a [13] is used to build the histograms of files.
Disassembler
Last step is to produce an average histogram, which is exploited to show the distribution of opcodes. Figure 2 shows an average histogram that presents the distribution of opcodes in our NGVCK metamorphic virus family data collection.
Opcode Extractor
Figure 3 shows some sample histograms of opcodes extracted for different kind of programs, in our data set. V.
Histogram Calculation
Virus Family Opcode Frequency Histogram
The classification process is based on comparison of the dissimilarity between the histogram of a PE file and the histogram of the virus family. If dissimilarity value is less than a specific value, called threshold, we can conclude the file belongs to the family. Otherwise, the file is classified as nonfamily [2, 14]. Dissimilarity between two histograms can be measured by Minkowski-form distance metrics. The Euclidean form distance is a type of Minkowski-form metric with r = 2, as following [5, 15]: 𝑛𝑛
2 𝑑𝑑𝑋𝑋,𝑌𝑌 = �(𝑥𝑥𝑖𝑖 − 𝑦𝑦𝑖𝑖 )2
Figure 1. Proposed Methodology
The first byte of instructions is important for us. We ignore next bytes of instructions, because they are optional bytes, generally one or more operands, upon on the operation of instructions [8]. These bytes are usually mutated by the metamorphic engines, so they are different in various instances of a morphed family malware. Therefore, in next stage we generate a sequence of opcodes extracted in the disassembly procedure, in the similar order in which the opcodes located in the executable file, ignoring the extra bytes of instructions such as succeeding bytes of opcodes, memory addresses, registers, etc. In next step, we compute the frequency histogram of the first byte of the opcodes included in the generated sequence. We have 256 values for opcodes; therefore, each file has a histogram include of 256 bins that should be counted overall the code segment of the file. Finally, we can calculate an average of histograms of the variants in the virus family database. This histogram can be used as the opcodes frequency feature of the family [11]. We will show that how the opcode sequence frequency histogram
978-1-4673-1677-4
CLASSIFICATION TEST AND EVALUATION
(1)
𝑖𝑖=1
To test the efficiency of the proposed method, we built a test data collection consisted of 100 various PE-files, include of 40 different NGVCK viruses, not the same used to produce the average histogram; 40 benign programs randomly chosen from Microsoft Windows 7 dll’s (Dynamic Linked Library) and Cygwin; and 20 other virus executable files generated using Virus Creation Lab (VCL) and G2 Kits. Figure 4 illustrates the classification result. Table I shows the minimum and maximum distance values for each class in the test set versus the average histogram of NGVCK family. By choosing an appropriate threshold value in range of 0.36 < θ < 0.39, we are able classify the executable files in the test collection. TABLE I.
Min Max
MINIMUM AND MAXIMUM DISTANCE VALUES NGVCK viruses
Benign programs
Other virus files
0.16 0.36
0.78 1.34
0.39 1.32
210
Figure 2. Average of opcodes distribution in NGVCK virus PE-files
Figure 3. Sample histograms of opcodes for each class of executables in our data test collection
978-1-4673-1677-4
211
Figure 4. comparison result and dissimilarity values TABLE II. Real Class of Input Executable Files
CLASSIFICATION RESULT (THRESHOLD = 0.37)
Number of File in Test Class
True Positive
False Positive
Precicion (%)
Total Accuracy (%)
NGVCK virus files
40
40
0
100
100
Benign programs
40
40
0
100
100
Other virus files
20
20
0
100
100
The classification results are summarized in Table II. As it shows, the proposed classification method is able to classify all 100 different executable files, correctly. To evaluate the efficiency of the proposed methodology, first we calculated the True Positives (TP) measure, which is the number of infected files classified correctly, False Positive (FP), which is the number of non-infected files misclassified. Precision is the rate of correctly classified infected files and the Total Accuracy shows the rate of correctly classified files, either positive or negative. The Precision and Total Accuracy are obtained via the equations 2 and 3 as following [8, 11, 1618]: 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 =
𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = VI.
𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 + 𝐹𝐹𝐹𝐹
978-1-4673-1677-4
Our experiment results prove that this method can be used as a reliable technique for classifying the metamorphic PEmalware. Furthermore, it has shown that the simple proposed methodology not only is able to recognize the malware variants, but also it can differentiate benign executable programs. The most significant advantages of this method are the simplicity of implementation and the high rate of accuracy.
(2)
As a future work, we suggest to perform this experiment on a larger dataset of malware with diverse instances of metamorphic malware.
(3)
ACKNOWLEDGMENT
CONCLUSION AND FUTURE WORK
We implemented and evaluated based on the opcodes distribution, presented method, a histogram of portable executables is generated
distribution of opcodes in files belonged to the same family. Later, we used this histogram to be compared with any executable file histogram to classify it whether is belonged to the same family or not.
a classification method using histogram. In the opcodes extracted from to show the average
This study is supported by the Razak School of Engineering and Technology grant funded by University Technology of Malaysia (No. 4B010).
212
[1] [2]
[3] [4] [5] [6] [7] [8]
[9] [10]
[11] [12] [13] [14]
[15]
[16] [17]
[18]
I. Santos, F. Brezo, J. Nieves, Y. Penya, B. Sanz, C. Laorden, and P. Bringas, "Idea: Opcode-Sequence-Based Malware Detection," Engineering Secure Software and Systems, pp. 35-43, 2010. K. S. Han, B. Kang, and E. G. Im, “Malware classification using instruction frequencies,” in Proceedings of the 2011 ACM Symposium on Research in Applied Computation, Miami, Florida, 2011, pp. 298300. J. Z. Kolter, and M. A. Maloof, “Learning to Detect and Classify Malicious Executables in the Wild,” J. Mach. Learn. Res., vol. 7, pp. 2721-2744, 2006. U. Bayer, A. Moser, C. Kruegel, and E. Kirda, “Dynamic Analysis of Malicious Code,” Journal in Computer Virology, vol. 2, no. 1, pp. 67-77, 2006. B. B. Rad, and M. Masrom, "Metamorphic Virus Variants Classification Using Opcode Frequency Histogram," LATEST TRENDS on COMPUTERS. pp. 147-155, 2010. D. Bilar, “Opcodes as predictor for malware,” Int. J. Electron. Secur. Digit. Forensic, vol. 1, no. 2, pp. 156-168, 2007. Hex-Rays, "IDA Pro Disassembler and Debugger," 2012. R. Moskovitch, C. Feher, N. Tzachar, E. Berger, M. Gitelman, S. Dolev, and Y. Elovici, “Unknown Malcode Detection Using OPCODE Representation,” in Proceedings of the 1st European Conference on Intelligence and Security Informatics, Esbjerg, Denmark, 2008, pp. 204215. A. Singh, "Portable Executable File Format Identifying Malicious Code Through Reverse Engineering," Advances in Information Security, pp. 1-15: Springer US, 2009. M. Z. Shafiq, S. M. Tabish, F. Mirza, and M. Farooq, “PE-Miner: Mining Structural Information to Detect Malicious Executables in Realtime,” in Proceedings of the 12th International Symposium on Recent Advances in Intrusion Detection, Saint-Malo, France, 2009, pp. 121-141. I. Santos, F. Brezo, X. Ugarte-Pedrero, and P. G. Bringas, “Opcode sequences as representation of executables for data-mining-based unknown malware detection,” Information Sciences, no. 0, 2011. VXHeavens. "VX Heavens - Computer Virus Information, Library, Collection, and Sources," http://vx.netlux.org/vl.php. MathWorks, "MATLAB - The Language Of Technical Computing," MathWorks, 2008. J. Lee, C. Im, and H. Jeong, “A study of malware detection and classification by comparing extracted strings,” in Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication, Seoul, Korea, 2011, pp. 1-4. S. M. Tabish, M. Z. Shafiq, and M. Farooq, “Malware detection using statistical analysis of byte-level file content,” in Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics, Paris, France, 2009, pp. 23-31. T. Ronghua, L. Batten, R. Islam, and S. Versteeg, "An automated classification system based on the strings of trojan and virus families." pp. 23-30. B. B. Rad, M. Masrom, S. Ibrahim, and S. Ibrahim, "Morphed Virus Family Classification Based on Opcodes Statistical Feature Using Decision Tree," The International Conference on Informatics Engineering and Information Science (ICIEIS2011), Communications in Computer and Information Science (CCIS) A. Abd Manaf, A. Zeki, M. Zamani et al., eds., pp. 123-131: Springer Berlin Heidelberg, 2011. M. G. Schultz, E. Eskin, E. Zadok, and S. J. Stolfo, “Data Mining Methods for Detection of New Malicious Executables,” in 2001 IEEE Symposium on Security and Privacy, Oakland, California 2001, pp. 38.
978-1-4673-1677-4
213