Malicious Code Detection Using Penalized Splines

Malicious Code Detection Using Penalized Splines on OPcode Frequency Mamoun Alazab, Mohammad Alkadiri, and Sitalakshmi Venkatraman



Abstract— Recently, malicious software are gaining exponential growth due to the innumerable obfuscations of extended x86 IA-32 (opcodes) that are being employed to evade from traditional detection methods. In this paper, we design a novel classifier that combines Multivariate Logistic Regression model using kernel HS in Penalized Splines along with opcode frequency feature selection technique for efficiently detecting obfuscated malware. The main advantage of our penalized splines based feature selection technique is its performance capability achieved through the efficient filtering and identification of the most important opcodes used in the obfuscation of malware. This is demonstrated through our successful implementation and experimental results of our proposed model on large malware datasets. Overall, the high accuracy of classification in our proposed model is achieved due to the observed behaviour of malware opcodes as their distributions differ statistically and significantly from benign opcodes.

Keywords — Multivariate statistics, Penalised splines, Operation codes, Obfuscation, Malware detection. I. INTRODUCTION

M

alicious software (Malware) affects the secrecy and integrity of data as well as the control flow and functionality of a computer system. Recent attacks using obfuscated malicious codes (previously unknown malware) have resulted in disruption of services leading towards huge financial and legal implications [1] [2]. Literature studies on malware detection have shown that there is no single technique that could detect all types of malware [3] [4]. While signature-based techniques are very efficient and effective method to detect known malware, the major drawback is their inability to detect new or unknown malicious code. Hence, many research studies are towards anomaly-based detection that uses the knowledge of normal behaviour patterns to decide the maliciousness of a program code [5] [6] [7] [8]. Though anomaly-based detection has the key advantage and ability to detect zero day attacks, the techniques employed are not efficient and report high false positives. This paper presents an efficient anomaly-based approach using penalised splines that achieves very low false positives even on very large real-life data sets. M. Alazab is Research Fellow and Associate Investigator for the ARC Centre of Excellence in Policing and Security (CEPS) at the Australian National University (ANU), Canberra 2601 AUS (phone: 61-2- 6125 1506; e-mail: [email protected]). M. Alkadiri has recently completed his PhD at Science, Information Technology and Engineering, University of Ballarat, VIC 3353 AUS (phone: 61-3- 5327 9860; e-mail: [email protected]). S. Venkatraman is Senior Lecturer at School of Science, Information Technology and Engineering, University of Ballarat, Ballarat, VIC 3353 AUS (phone: 61-3- 53279074; e-mail: [email protected]).

The sign of cybercrime is not showing any decelerating and represents the fastest growing crime globally. Anti-malware engines use signature matching to detect malware where signatures are generated by human experts by disassembling the file and selecting pieces of unique code. Information gathered from the analysis of computer code can lead to determine the author of a executable [9] [10]. The term obfuscation means modifying the program code in a way to preserve its functionally with the aim to reduce vulnerability to any kind of static analysis and to deter reverse engineering by making the code difficult to understand and less readable [11]. Obfuscation techniques such as packing, polymorphism and metamorphism are used by malware authors as well as legitimate software developers [12]. They both use code obfuscation techniques for different reasons. Code obfuscation is very effectively used by malware authors to evade antivirus scanners since it modifies the program code to produce offspring copies which have the same functionality but with different byte sequence or ‘virus signature’ that is not recognized by antivirus scanners [13]. The notable contribution of our approach is that we have used opcode frequency statistics that does not require the signature of malware while detecting the malware in the AV engine. Due to exponentially increasing large number of new malwares generated every day, for our approach to be effective and plausible in real-time, an efficient reduction in the opcode feature set is required. We propose a novel approach of combining Multivariate Logistic regression using kernel HS in penalized splines along with opcode frequencystatistics based selection technique to filter a minimal set of opcode features for detecting obfuscated malware efficiently. Overall, our contribution includes the following hitherto unreported in the literature: - Unique approach with the use of opcodes as features for binaries for performing a statistical differentiation of the opcode frequency, - Efficient filtering of opcode features set using Multivariate Logistic regression with kernel HS in penalized splines - Performance improvements in accuracy and efficiency in malware detection as compared to some existing popular classification methods. In this paper, we have designed a novel classifier that combines Multivariate Logistic Regression model using kernel HS in Penalized Splines along with opcode frequency feature selection technique for efficiently detecting binary code that has obfuscated malicious code. Our approach has been tested directly on the PE format files, to compare opcode distributions within malicious and benign files. We have automated the inspection of the opcode frequency statistics,

and have given a preliminary assessment of its frequency used for detection and differentiation of various malicious files from benign files. The rest of the paper is organized as follows. Section II provides the research background describing opcodes and executable files, as well as the need for the study. Section III summarises our literature review of related studies. Section IV describes the research methodology, and the proposed model is provided in Section V. Section VI explains the experimental dataset used for evaluating our model. In sections VII, we present the experimental results, with an analysis of data and outcomes of the study. Lastly, section VIII provides conclusions and future work. II. BACKGROUND A. Opcodes and Executable files The language of reversing compiled binary code is the assembly language [14]. In this chapter, the focus is on the ‘x86’ also called the (‘IA-32’, 386, or the i386-architecture) which is Intel’s 32-bit architecture and is the basis for all of Intel’s x86 CPU, since the first version of i386 to our day. The focus was on the Intel-32 assembly language for these experiments, because it is almost exclusively used in every computer and is the most popular processor architecture. The IA-32 architecture and IA-64 are almost same in term of architecture and programming environment, the main difference is that IA-64 bit processors use the prefix/extension to the 80386 instruction set. However, there are popular instructions that most likely could exist in any program either in IA-32 or IA-64 such as moving data, arithmetic or compare operators, conditional branches and function calls. Instead of focusing on the basic instructions ,our study considers all IA32 instructions and have used the list of instructions from the Intel IA-32 Architecture Software Developer's Manual, Volume 1[15], Volume 2A [16] and Volume 2B [17]. The Win32 Portable Executable (PE) file formats such as (.EXE and .DLL) introduced by Microsoft, which is the standard executable format for all versions of the Windows operating system on all supported processors. PE file has different sections and headers, Windows PE files start with the DOS header which is identified by ‘MZ’. The second section is the PE Signature field, which when viewed as ASCII text is PE\0\0. Third one is the IMAGE_FILE_HEADER containing the most basic information about the file. Fourth, is IMAGE_OPTIONAL_HEADER that contains the structure of additional information provided by the PE creators, beyond the basic information found in the IMAGE_FILE_HEADER. Last is the section table that has code sections (.text), and data sections (.data). The .text section is the default section for code and the .data section stores writable global variables and also contains the file’s Original Entry Point (OEP) which refers to the execution entry point (where the file execution begins) of a portable executable file. Finally, the .rdata section contains read-only data. The experiment in this paper to extract and calculate opcode have been tested on the PE format files; to compare operation code distributions within malicious and benign files. Automated the inspection of the opcode frequency statistics implemented, and have given a preliminary assessment of its

frequency used for detection and differentiation different files of malicious and benign. IDA Pro Dissasember [18] has been selected to view and analyse the PE files. As well as to create our tool of extracting the opcode from PE files. Literature review over the past decade indicate that some research studies have been successfully conducted using opcode and kernel based features as they are not only good in classification of malware, but are also good in detecting injected malicious executable [19]. However, all these existing methods possess high false positive rate, which poses a major research challenge in arriving at efficient techniques. Malware authors are developing highly sophisticated variant distribution techniques to produce malware so quickly for bypassing anti-malware scanners [20]. Since signature pattern based detection approach is effective and fast but cannot detect zero day attack, there is a need to capture the behaviour of malware based on anomalies or behavioural patterns exhibited and filter them prudently to achieve efficiency as well as accuracy in obfuscated malware detection. Therefore, our approach in this paper is to combine static signatures and dynamic heuristics based on the extended x86 IA-32 binary assembly instructions or operation code (opcode), and propose a novel algorithm to detect unknown malware. B. Need for the Study Literature surveys on malware detection have shown that there is no single technique that could detect all types of malware. However, there are two techniques commonly used for malware detection, signature-based detection and anomalybased detection [21-23]. Anti-Virus engines use malware signatures to detect known malware. The malware signature is a byte sequence that uniquely identifies a specific malware. Typically, a malware detector uses the malware signature to identify the malware like a fingerprint. Most AV engines are supplied with a database containing information of existing malware to identify maliciousness, by looking for code signatures or byte sequences while scanning the system. A malware detector scans the system for characteristic byte sequences or signatures that match with the one in the database and declares the existence of malware blocking its access to the system. The signature matching process is called signature-based detection and most traditional AV engines use this method. It is a very efficient and effective method to detect known malware [24]. But, the major drawback is the inability to detect new or unknown malicious code. The signature generation involves manual processing and requires strict code analysis. To overcome signature based methods, polymorphic malware have an in-built polymorphic engine that can generate new variants each time it is executed and a new signature is generated. Therefore, signature based approaches fail to detect such malware. On the other hand, anomaly-based detection uses the knowledge of normal behaviour patterns to decide the maliciousness of a program code. This approach has the ability to detect some zero day attacks. However, it is very difficult to accurately specify the system or program’s behaviour and thus these approaches usually are resulting in more false positives than signature based methods.

In our previous work [25], we have proposed and evaluated a method of employing several data mining techniques to detect and classify new malware based on the frequency of Windows API calls. We have employed robust classifiers, namely Naïve Bayes (NB) Algorithm, k−Nearest Neighbor (kNN) Algorithm, Sequential Minimal Optimization (SMO) Algorithm with 4 differents kernels (SMO - Normalized PolyKernel, SMO – PolyKernel, SMO – Puk, and SMORadial Basis Function (RBF)), Backpropagation Neural Networks Algorithm, and J48 decision tree and have evaluated their performance. Results show that using SVM on the frequency of Windows API calls are most effective to detect new malware with high accuracy. Therefore, in this research, we focus on SVM with an improvised implementation. We have devised penalized splines implementation to SVM as an improvement and have investigated this approach analysing program code behaviour patterns based on the opcode frequency features of the benign and malicious executables. Extracting features from the obfuscated executables for reverse obfuscation is labor intensive and requires deep understanding of kernel and assembly programming [26] [27]. Recent studies [28] [29] [30] [31] [32] [33] have used analysis of opcode for generation of birthmarks on portable execution files. Use of statistical analysis of file binary content including statistical N-gram modeling techniques [34] [35] [36] have been tested in identifying malcode in document files and does not have sufficient resolution to represent all class of file types. From other studies on related work [37] [11] it has been found that the statistical modeling of hidden malcode detection is yet to be perform accurately with the required real-time efficiency in the case of metamorphic engine techniques employed for innumerable code obfuscations that are possible to produce exponentially large morphed copies of an original program. This gap in literature is a motivation for this research towards a positive contribution in understanding malware behavior through proposing an efficient and novel statistical analysis of opcode. The analysis of computer system performed offline is called static analysis, which has been employed in this research to study the patterns of the opcode within binary executables by reverse engineering the code [38]. Static analysis provides a better understanding of the anomalous behavior patterns of the code since we adopt a methodology to perform a deep analysis into the code program and their statistical properties [12]. The existing techniques and methods exhibit high false positives as they do not perform sufficient statistical analysis to determine if the anomaly was ‘actually’ malicious [39]. Therefore, in this research, static anomaly-based detection analysis is adopted to perform introspection of the program code with the goal of determining various dynamic properties of opcode that are extracted from these codes in an isolated environment. The results of the following recent studies have been the prime motivation for this research: 1) malware authors are able to easily fool the detection engine by applying obfuscation techniques on known malwares [40], 2) identifying benign files as malware (false positive) is becoming very high [41], 3) failing to detect obfuscated malware is high (false negative) [41], 4) the current detection rate is decreasing [42], and 5) current malware detectors are

unable to detect zero day attacks [42]. These results imply that code obfuscation has become a challenge for digital forensic examiners with the limitations of signature based detection [43] [34]. III. RELATED STUDIES Data mining techniques for malware detection usually starts with the first step of generating a feature set. In 2005, studies reported in [44] that a temporal consistency element was added to the opcode frequency to calculate the frequency. In [45] unigram analysis of binary byte values has been applied to generate a fingerprint for identification and classification purposes. In [46] The Portable Executable Analysis Toolkit ‘PEAT’ developed in order to be used to determine malicious in code of a Windows Portable Executable (PE) ﬁle. The tool was s rely on structural features executables. However, the tool has major weaknesses. Static Analysis for Vicious Executables (SAVE) [47] is another work. But it is not based on opcode it is based on API calls, which made in an attempt to detect polymorphic and metamorphic malwares. They defined signature as an API sequence of calls and started the reverse engineering process from decompressed 16 binaries, which are then passed through a PE file parser. Another signature-free system to detect polymorphic malware and unknown malware based on the analysis of Windows API execution sequences extracted from binary executable is called Intelligent Malware Detection System (IMDS) [48]. IMDS was developed using ObjectiveOriented Association (OOA) mining based classification with large data set gathered for the experiment (29580 binary executables, of which 12214 were benign binary executables and 17366 were malicious binary ones). In 2010 the authors of IMDS had incorporated the CIDCPF method into their existing IMDS system with larger dataset, and called it CIMDS system [16]. Their results were good, but involved unbalanced test data while the training data was quite balanced. Also, the detection rate was for training set about 89.6% and the accuracy was approximately 71.4 and in the testing set about 88.2% and the accuracy was approximately 67.6 which still the work need to be improve to achieve higher detection rate and higher overall accuracy. In 2006, researchers [19] described the use of machine learning and data mining to detect and classify malicious executables. They tested several classifiers including, IBk, naive Bayes, support vector machines (SVMs), decision trees, boosted naive Bayes, boosted SVMs, and boosted decision trees. Kolter found that support vector machine performed exceptionally well and fast as compared to the other classifiers. Hence, for the obfuscated malware detection system, this research adopts SVM as a classifier for the detection of hidden malware that invariably uses API call sequence. API based features are not only good in classification of malware, but is also good in detecting injected malicious executable. DOME [13] is a host-based technique that uses static analysis based on monitoring and validating Win32 API calls for detecting malicious code in binary executables. In a study on the performance of kernel methods in the context of robustness and generalization capabilities of malware

classification [49], results revealed that analysis based on the Win API function call provides good accuracy to classify malware. Bilar in [28] done some statistical analysis based on the frequency distribution of opcode on 67 malware and 20 benign. Result shows that malware opcode distributions differ statistically significantly from benign software. In 2012, Shabtai et al [28] used opcode n-gram patterns as features for classification and identification process, authors achieved 96% accuracy of detection rate. IV. RESEARCH METHODOLOGY We propose a feature selection methodology that does not require any knowledge of the binary signatures (zero knowledge). We use a unique behavior-based fingerprint of executable programs for detection of malwares. There are other behavior-based studies with API call features. In this paper, we consider the opcodes based on the extended x86 IA32 binary assembly instructions as the unique fingerprint of executables. Our approach is to first disassemble the binary executable for opcode frequency statistics and then to adopt opcode feature selection algorithms using Akaike's Information Criterion (AIC), which depend on Maximum likelihood function, to reduce the number of features. Finally we classify the files as malware or benign by applying multivariate logistic regression using kernel HS in penalized splines. Support Vector Machine (SVM) has become one of the main tools in classification analysis in the recent decades. It is playing a crucial role in vehicle for what have become recently known as Kernel Machine. Penalized Splines has a connection to both Kernel methods and Mixed Models that has made this technique popular in analysis of different types of data. In a recent monograph, Pearce and Wand [50] built in a cross-fertilization environment establishing the connection for penalized splines to SVM [50]. They illustrated that SVM can be considered as a special case of penalized splines. Our research depends on Pearce and Wand method and improves it further. A good explanation for the penalized splines methodology, which has other names in the literature like P-splines, lowrank splines and reduced knot splines, as general can be found in [51] and [45]. The statistical approach of penalized splines and its simplicity in implementation make it popular in statistical analysis [52]. In this paper, the focus is on penalized splines implementation to SVM. The advantage of this approach over other kernel approaches is its suitability and efficiency for large and complex studies of malware detection. More discussions on our proposed approach that uses knots principle for producing the required efficiency are given in the next section. V.P ROPOSED MODEL Statistical methods are rich to analyse and classify Malware data. In the first place we use the multivariate logistic regression and the Akaike's Information Criterion (AIC), which depend on Maximum likelihood function, to reduce the number of features we have. We have applied the reproducing kernel HS in penalised splines that is discussed below to

classify the data set into two categories, 'Malware' and 'Benign'. In multivariate regression equation, say y = a + b x + e, we have just one explanatory (or independent variable), x. The other components in this relation are the response (or dependent variable) y, the coefficients a and b, and the error term e. The response usually is continuous random variable, for example the Gaussian response, which is the most popular case in regression, whilst in some other cases the response can be taking discreet values, and logistic regression is one example of such cases. In the Cartesian plane, ℝ × ℝ, where the data points (xi , yi), 1 ≤ i ≤ n of size n are scattered into this plane, we define the simple penalized splines case as: u (x − k ) + ε (1)

y = β + β x +

where, , are constant coefficients, , … , are random coefficients from a specific distribution with fixed mean and variance, are random noise and ( − ) can be defined as maximum of (0 , ( − ) ). The set of Knots , … , can be chosen on the range of the predictor where their count, K, is very small when compared to the data size n. The last principle makes this approach more efficient in large data sets as said before. Least square method is used to minimize the orthogonal distances between the scatter points and the proposed fitted line is one of the simplest techniques that can be used to fit penalized spline model in (1). Penalized splines least square can be achieved by minimizing the expression (2) ,

−

−

+

−

(

−

)

(2)

where, β = (β , β ) and u = (u , … , u ) . The usual vector notations are used here, where superscript T means the transpose operator on the vectors or matrices. In equation (2), the last sum was added to avoid overfitting as well as the nonnegative smoothing window λ , which captures the trade-off between overfitting and bias. These general concepts in penalized least square method are discussed widely in the literature [53]. We employ reproducing kernel approach for additional improvement as such an approach can take Kernel methods to a new stage in literature and application. These reproducing kernel approaches are established within functional decomposition in what are called reproducing kernel Hilbert Space (HS). Wahba (1999) and some other researchers have summarized and defined this paradigm [54]. In this paper a particular attention is given to reflect the definition of reproducing kernel methods to penalized splines, which mainly depended on projection operator on spaces. Our approach is described below:

Assume that we have two features which are x1 and x2, then the linear penalized spline model takes the form as given in (3) = (

,

)

+ (3)

where,

( ,

) =

+

+

+

(

+

(

−

) (4)

−

Here, and are the two sets of knots for the features x and x respectively. In this model, we adopt least square approach to minimise the equation (5) min ∈ℍ

, ( ) + ‖ ‖ℍ

and considered as response variable yi while all other features are considered as independent or explanatory variables. The logistic regression model is given by (9) log

)

(5)

1 if the i data item is Malware (8) 0 if the i data item is Benign

MLW =

(MLW | , , … , ) 1 − (MLW | , , … , ) = + X +

Where,

,

,..

ℍ =

⎧ ⎪ ⎨ ⎪ ⎩

: ( ) =

+

+

(

−

ℍ ℍ

)

⎫ ⎪ ⎬ ⎪ ⎭

(6)

where, ℍ is orthogonal to ℍ . Also, it can be proposed that the result of can be expressed in terms of and such that = + . The response (or dependent variable) in logistic regression is usually dichotomous (i.e. measured at two levels), that is, it can take the value 1 with a probability of success p, or the value 0 with probability of failure 1-p. The logistic model in multivariate shape can take the equation given in (7) ( ) 1− ( ) = + X +

[ ( ) ] = log

where, the constants

,

,…,

X + ⋯ +

X (7)

are called coefficients, and

X , X , … , X are the “multivariate” explanatory variables. The data corresponding to indicates the occurrence of Malware or Benign (MLW) for the set of files. The MLW data are coded

X (9)

are the features of operation code.

Definition 1: Risk of Malware: We define the risk of malware as given by (23) 1 ( )= (10) ⋯ ) 1+ ( Definition 2: To obtain the probability of a file to have Malware, we compute (11) (

where, we have the p predictors/features xi and the response yi , 1 ≤ i ≤ n , as scatter points (xi , yi) which belong to the space ℝ × ℝ. The unknown function which formulates the relationship between the features and the responses is f and l is a loss function that needs to be given. The symbol ||.|| is the vector norm. If we have only one feature, the reproducing kernel HS in penalized splines can be factorized into orthogonal subspaces as given in (6)

X + ⋯ +

( )=

1+

(

⋯

⋯

)

) (11)

The goal of logistic regression is to correctly predict the category of outcome for individual cases using the most parsimonious model. To achieve this aim, a model is created that includes all independent variables (or explanatory variables) that are significant (useful) in predicting the response variable. Several different options are available during model creation. Variables can be entered into the model in the order specified by the researcher or logistic regression can test the fit of the model after each coefficient is added or deleted, called stepwise regression. We have implemented our proposed model in R programming software, and we have employed the command step to perform this function depending on Akaike's Information Criterion (AIC). The next section describes the experimental evaluation of our proposed logistic regression model. VI. EXPERIMENTAL DATASET We have gathered 66,703 executable files in total consisting of 51,223 recent Malware datasets and the remaining being benign datasets as shown in table 1. Such large malware datasets with obfuscated and unknown malware used in this research study have been collected from honeynet project, VX heavens (VX Heavens 2011) and other sources. The 15,480 benign datasets include: Application software such as Databases, Educational software, Mathematical software, Image editing, Spreadsheet, Word processing, Decision making software, Internet Browser, Email and many others system software, Programming language software and many other applications. Both (Malware, Benign) have been uniquely named according to their MD5 hash value.

TABLE. 1 Dataset

Type

successfully. This phenomenon is effectively utilised in our model to filter the most significant opcodes.

Max. Size

Min. Size

Avg. Size

(KB)

(KB)

(KB)

Qty

Benign

15,480

109,850

0.8

32,039

Virus

17,509

546

1.9

142

Worm

10,403

13,688

1.6

860

Rootkit

270

570

2.8

380

Backdoor

6,689

1,299

2.4

685

Constructor

1,039

77,662

0.9

1,193

Exploit

1,207

22,746

0.5

375

Flooder

905

16,709

1

1,397

Trojan

13,201

17,810

0.7

1,819

VII. EXPERIMENTAL RESULTS A. Descriptive Analysis of Data In our dataset for malware and benign files, the aggregate malware dataset yielded roughly about 48,629,512 opcodes and the aggregate benign dataset yielded roughly about 405,942 opcodes. The experiment was run on a total of 590 different opcodes collected from Intel, but for the analysis part the opcodes that have been found in our sample binaries was only just considered which are in total of 80 opcodes. Analysis show that the top 13 listings for both malware and benign are identical (ADD/ CALL/ CMP/ JMP/ JNZ/ JZ/ LEA/ MOV/ POP/ PUSH/ RETN/ TEST/ XOR). Many of the new opcodes were not used at all in all our samples such as: Move Data from String to String (MOVS/ MOVSB/ MOVSW/ MOVSD/ MOVSQ), Compare String Operand (CMPS/ CMPSB/ CMPSW/ CMPSD/ CMPSQ), Load Machine Status Word (LMSW), Load String (LODS/ LODSB/ LODSW/ LODSD/ LODSQ) , Repeat String Operation Prefix (REP/ REPE/ REPZ/ REPNE/ REPNZ) , Scan String (SCAS/ SCASB/ SCASW/ SCASD). Figure 1 shows the top 13 listings of opcodes for both malware and benign executables and their frequencies of use. From Figure 1 it is evident that the percentages of using the most popular set of opcodes in both malware and clean binaries are almost similar. This shows that obfuscations are targeted on using less popular opcodes which differ statistically and significantly to a great extent. For example, the dead-code insertion NOPs (No Operation Performed) which inserts operation that do nothing exhibits a very high frequency of use in malware as compared to benign, NOP used in Malware was 98.8% compared to NOP used in benign 1.2%). Similar to the opcode (JMP) and (JZ) which used to shuffle the binary content, in malware (84%) compare to benign (%16). This is clear evidence that the malware authors are using obfuscation methods to transform malcode into a new code without affecting the original functionality or purpose. Thereby making it very difficult to reserve engineer and decipher the signature

B. Outcomes of the Study The output generated from the proposed logistic regression model and analysis (Table 2), clearly indicates that the estimated risk of a file being classified as Malware is 2.702. We describe this result as, "A file is 2.7 times as likely to be classified as Malware as that of a file being classified as Benign.” After introducing all variables or features in the model, the set of features represented in red font are considered as significant (Table 2). The others are removed from the opcode feature set as they are considered insignificant at 0.05 significance level, where this is the smallest AIC = 376.085. Table 3 shows the mean misclassification rates and p-values with 4 degrees of freedom in each smoothing function for the significant features found from Table 2. The model (6) was used with all features selected in Table 2. The truncated line additive model that was extended in (6) was used with equal number of 20 knots for all features. The result shows that truncated additive smoothing splines classifier is significant, and the reduced opcode feature set was successfully used in our penalized splines implementation to SVM for classifying malwares from benign accurately. VIII. CONCLUSIONS AND FUTURE WORK This research paper has attempted to address an important situation of an exponential growth in unknown malware that is witnessed today, due to the easy availability of automated obfuscations of operation code (opcode). Current malware detection systems have proven futile resulting in escalating zero-day attacks. Many sophisticated techniques reported in literature are not only time-consuming but also report high false positives. In this paper, we have proposed a novel and intelligent approach of combining the statistical behaviour patterns of the opcode features extracted from the executables, along with Multivariate Logistic regression using kernel HS in penalized splines to filter a minimal set of opcode features for detecting obfuscated malware efficiently. We have adopted the Akaike's Information Criterion (AIC), which depends on Maximum likelihood function to reduce the number of opcode features, and have devised penalized splines implementation to SVM. We implemented our novel technique into a fullyautomated system right from the initial process of extracting the opcode features for each executable, up to the final process of classifying as either malware or benign. In order to validate our proposed intelligent malware detection approach, we conducted a comprehensive experiment consisting of large data sets (51,223 samples) of malware that were accurately detected using a reduced opcode feature set that was intelligently and automatically derived through our novel approach using penalised splines. Further, our experimental result suggests that an executable file is about 2.7 times as likely to be classified as Malware as that of a file being classified as Benign. We have demonstrated through experiments that our proposed approach significant was successfully used in our penalized splines implementation to

SVM for classifying malwares from benign executables in large data sets. Future work is to analyse further on establishing the opcode relationships to understand the behavior of malware and the obfuscation patterns for classifying malware families efficiently. The focus would also be in terms of combining the classifiers learned using opcode features as well as API call features on different datasets. In the case of training the same dataset, several classifiers would be employed to result in a final class output assigned by adopting a voting strategy.

[19]

[20]

[21]

[22] [23]

REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14] [15] [16]

[17] [18]

S. McCombie, J. Pieprzyk, and P. Watters, "Cybercrime Attribution: An Eastern European Case Study," in The 7th Australian Digital Forensics Conference, Perth, Australia, 2009, pp. 41-51. P. Watters and S. McCombie, "A methodology for analyzing the credential marketplace," Journal of Money Laundering Control, vol. 14, pp. 32 - 43, 2011. M. Alazab, S. Venkatraman, P. Watters, and M. Alazab, "Information Security Governance: The Art of Detecting Hidden Malware," in IT Security Governance Innovations: Theory and Research, D. Mellado, L. Sánchez, E. FernándezMedina, and M. Piattini, Eds., ed: IGI Global, 2012, pp. 293- 315. P. Watters and R. Layton, "Fake file detection in P2P networks by consensus and reputation," in The 1st International Workshop on Complexity and Data Mining, Nanjing, Jiangsu, 2011, pp. 80 - 83. A. Stabek, P. Watters, and R. Layton, "The Seven Scam Types: Mapping the Terrain of Cybercrime," in Cybercrime and Trustworthy Computing Workshop, Ballarat, VIC, 2010, pp. 41-51. D. Lobo, P. Watters, and X. Wu, "Identifying Rootkit Infections Using Data Mining," in The International Conference on Information Science and Applications, Seoul, 2010, pp. 1 - 7. D. Lobo, P. Watters, and X. Wu, " RBACS: Rootkit Behavioral Analysis and Classification System," in The International Conference on Knowledge Discovery and Data Mining, Ballarat, 2010, pp. 75-80. D. Lobo, P. Watters, X. Wu, and L. Sun, "Windows Rootkits: Attacks and Countermeasures," in Cybercrime and Trustworthy Computing Workshop, Ballarat, 2010, pp. 69-78. R. Layton, P. Watters, and R. Dazeley, "Authorship Attribution for Twitter in 140 characters or less," in The 2nd IEEE Cybercrime and Trustworthy Computing Workshop Ballarat, VIC, 2010, pp. 18. R. Layton, P. Watters, and R. Dazeley, "Recentred local profiles for authorship attribution," Journal of Natural Language Engineering, vol. 18, pp. 293-312, 2012. C. Linn and S. Debray, "Obfuscation of executable code to improve resistance to static disassembly," in 10th ACM conference on Computer and communications security Washington, DC, USA, 2003, pp. 290-299. M. Alazab, S. Venkataraman, and P. Watters, "Effective digital forensic analysis of the NTFS disk image," Ubiquitous Computing and Communication Journal, vol. 4, pp. 551- 558, 2009. J. C. Rabek, R. I. Khazan, S. M. Lewandowski, and R. K. Cunningham, "Detection of injected, dynamically generated, and obfuscated malicious code," presented at the Proceedings of the 2003 ACM workshop on Rapid malcode, Washington, DC, USA, 2003. E. Eilam, Reversing: Secrets of Reverse Engineering, 1st ed.: Wiley Publishing, 2005. Intel, "Intel ® 64 and IA-32 Architectures Software Developer's Manuals : Basic Architecture," vol. 1, 2010. Y. Yanfang, L. Tao, J. Qingshan, and W. Youyu, "CIMDS: Adapting Postprocessing Techniques of Associative Classification for Malware Detection," Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 40, pp. 298307, 2010. Intel, "Intel ® 64 and IA-32 Architectures Software Developer's Manuals : Instruction Set Reference, N-Z," vol. 2B, 2010. IDA Pro, "IDA Pro Disassembler and Debugger ", 5.7 ed, 2010.

[24]

[25]

[26]

[27]

[28]

[29] [30]

[31]

[32]

[33]

[34] [35]

[36]

[37]

[38]

[39]

[40]

J. Z. Kolter and M. A. Maloof, "Learning to Detect and Classify Malicious Executables in the Wild," J. Mach. Learn. Res., vol. 7, pp. 2721-2744, 2006. R. Mukhtar, A. A.-. Nemrat, M. Alazab, S. Venkatraman, and H. Jahankhani, "Analysis of Firewall Log Based Detection Scenarios for Evidence in Digital Forensics," vol. International Journal of Electronic Security and Digital Forensics, 2012. A. Dinaburg, P. Royal, M. Sharif, and W. Lee, "Ether: malware analysis via hardware virtualization extensions," in Proceedings of the 15th ACM conference on Computer and communications security, Alexandria, Virginia, USA, 2008, pp. 51-62. G. Lawton, "Virus Wars: Fewer Attacks, New Threats," IEEE Computer Society, vol. 35, pp. 22 - 24, 2002. B. Birrer, R. Raines, R. Baldwin, M. Oxley, and S. Rogers, "Using Qualia and Hierarchical Models in Malware Detection," Special Issue on Intrusion and Malware Detection: Journal of Information Assurance and Security, vol. 4, 2009. S. Venkatraman, "Autonomic Context-Dependent Architecture for Malware Detection," presented at the e-Tech 2009, International Conference on e-Technology, Singapore, 2009. M. Alazab, S. Venkatraman, P. Watters, and M. Alazab, "Zero-day Malware Detection based on Supervised Learning Algorithms of API call Signatures," in The 9th Australasian Data Mining Conference, Ballarat, VIC, 2011, pp. 171-182. M. Alazab, R. Layton, S. Venkataraman, and P. Watters, "Malware Detection Based on Structural and Behavioural Features of API calls," in The 1st International Cyber Resilience Conference, Perth, WA, 2010, pp. 1-10. M. Alazab, S. Venkataraman, and P. Watters, "Towards Understanding Malware Behaviour by the Extraction of API Calls," in Cybercrime and Trustworthy Computing Workshop, Ballarat, VIC, 2010, pp. 52-59. A. Shabtai, R. Moskovitch, C. Feher, S. Dolev, and Y. Elovici, "Detecting unknown malicious code by applying classification techniques on OpCode patterns," Security Informatics, vol. 1, pp. 1-22, 2012. D. Bilar, "Opcodes as predictor for malware," Int. J. Electron. Secur. Digit. Forensic, vol. 1, pp. 156-168, 2007. R. Moskovitch, C. Feher, N. Tzachar, E. Berger, M. Gitelman, S. Dolev, and Y. Elovici, "Unknown Malcode Detection Using OPCODE Representation," in Proceedings of the 1st European Conference on Intelligence and Security Informatics, ed Esbjerg, Denmark: Springer-Verlag, 2008, pp. 204-215. D. Bilar, "Callgraph properties of executables and generative mechanisms," AI Communications - Network Analysis in Natural Sciences and Engineering vol. 20, pp. 231-243, 2007. B. Rad and M. Masrom, "Metamorphic Virus Variants Classification Using Opcode Frequency Histogram," in LATEST TRENDS on COMPUTERS vol. 1, N. Mastorakis, V. Mladenov, and Z. Bojkovic, Eds., ed: WSEAS Press, 2010, pp. 147-155. B. Rad and M. Masrom, "Metamorphic Virus Detection in Portable Executables Using Opcodes Statistical Feature," in International Conference on Advanced Science, Engineering and Information Technology, Kuala Lumpur, Malaysia, 2011, pp. 403-408. I. Santos, Y. K. Penya, J. Devesa, and P. Bringas, "N-grams-based file signatures for malware detection," 2009, pp. 317–320. S. Stolfo, K. Wang, and W.-j. Li, "Fileprints: Identifying File Types by n-gram Analysis," in IEEE Workshop on Information Assurance United States Military Academy, West Point, NY, 2005. C. Wang, J. Pang, R. Zhao, W. Fu, and X. Liu, "Malware Detection Based on Suspicious Behavior Identification," in First International Workshop on Education Technology and Computer Science, Wuhan, Hubei, China, 2009, pp. 198-202. S. Venkatraman, "Autonomic Context-Dependent Architecture for Malware Detection," in e-Tech 2009, International Conference on e-Technology, Singapore, 2009, pp. 2927-2947. M. Alazab, S. Venkatraman, and P. Watters, "Digital forensic techniques for static analysis of NTFS images," in The 4th International Conference on Information Technology, AmmanJordan, 2009, pp. 1- 9. G. Jacob, H. Debar, and E. Filiol, "Behavioral detection of malware: from a survey towards an established taxonomy," Journal in Computer Virology, vol. 4, pp. 251-266, 2008. M. Sharif, V. Yegneswaran, H. Saidi, P. Porras, and W. Lee, "Eureka: A Framework for Enabling Static Malware Analysis," in

[41] [42]

[43]

[44]

[45] [46]

[47]

[48]

[49]

[50]

[51] [52]

[53]

[54]

+

Computer Security - ESORICS 2008. vol. 5283, S. Jajodia and J. Lopez, Eds., ed: Springer Berlin / Heidelberg, 2008, pp. 481-500. Symantec Enterprise Security. (2011) Symantec Internet Security Threat Report: Trends for 2010. Symantec Enterprise Security. P.Wood, "Symantec Intelligence – May 2012: Malware Moves Outside of the Windows World," Semantec Inteeligence Report, 2012. K. Tang, M.-T. Zhou, and Z.-H. Zuo, "An Enhanced Automated Signature Generation Algorithm for Polymorphic Malware Detection," Journal of Electronic Science and Technology, vol. 8, pp. 114-121, 2010. D. J. Malan and M. D. Smith, "Host-based detection of worms through peer-to-peer cooperation," presented at the Proceedings of the 2005 ACM workshop on Rapid malcode, Fairfax, VA, USA, 2005. Y. Li and D. Ruppert, "On the asymptotics of penalized splines," vol. 95, pp. 415 - 436. M. Weber, M. Schmid, D. Geyer, and M. Schatz, "A Toolkit for Detecting and Analyzing Malicious Software," in The 18th IEEE Annual Computer Security Applications Conference, Washington, DC, 2002 pp. 423 - 431. A. H. Sung, J. Xu, P. Chavez, and S. Mukkamala, "Static analyzer of vicious executables (SAVE)," in 20th Annual Computer Security Applications Conference, , Tucson, AZ, USA, 2004, pp. 326-334. Y. Ye, D. Wang, T. Li, and D. Ye, "IMDS: intelligent malware detection system," presented at the Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, San Jose, California, USA, 2007. M. Shankarapani, K. Kancherla, S. Ramammoorthy, R. Movva, and S. Mukkamala, "Kernel machines for malware classification and similarity analysis," in The 2010 International Joint Conference on Neural Networks, Barcelona 2010, pp. 1-6. N. D. Pearce and M. P. Wand, "Penalised spline support vector classifiers: computational issues," The American Statistician, vol. 60, pp. 233-240, 2006. P. Eilers and B. Marx, "Flexible Smoothing with B-splines and Penalties Statistical Science," vol. 11, pp. 89-102, 1996. M. Al Kadiri, R. Carroll, and M. Wand "Marginal longitudinal semiparametric regression via penalized splines," Statistics and Probability Letters, vol. 80, pp. 1242 - 1252, 2010. H. Wu and J.-T. Zhang, Nonparametric Regression Methods for Longitudinal Data Analysis: Mixed-Effects Modeling Approaches. Hoboken, NJ: John Wiley and Sons, Hoboken, 2006. G. Wahba, Support Vector Machines, Reproducing Kernel Hilbert Spaces and Randomized GACV. MA, USA: MIT Press Cambridge, 1999.

Table 2: opcode tests based on the Model Effects

ID

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Opcode Name

Likelihood Ratio ChiSquare

Sig.

AAD

.596a

0.44 a

AAM

3.662

0.056

AAA

.009a

0.926

.456

a

ADC

.095

a

ADD

4.491a

AAS

0.499 0.758

.176

a

ARPL

.011

a

0.918

CALL

.106a

0.745

AND

CBW

.000

0.675

a

0.995 a

CLC

1.134

0.287

CLD

16.496a

0

.046

a

CMC

.960

a

0.327

CMP

.027a

0.869

CLI

CWD

.002

0.831

a

0.961 a

DAA

2.628

0.105

DAC

.357a

0.55

DEC

a

5.077

0.024

DIV

a

5.771

0.016

FLDCW

2.078a

0.149

.221

a

.108

a

IMUL

.656

a

0.418

IN

.969a

0.325

HLT IDIV

0.638 0.743

a

26

0.034

INC

7.280

0.007

ID

27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Opcode Name

Likelihood Ratio ChiSquare

Sig.

INT

.079a

0.779

INTO

a

.160

0.689

IRET

.890a

0.346

JA

a

2.552

0.11

a

JB

2.412

0.12

JBE

.032a

0.858

JMP

a

10.661

JNB

8.461

0.004

JNZ

2.622a

0.105

JZ

a

2.297

0.13

LAR

a

.001

0.974

LEA

.170a

0.681 1

LGDT

a

.000

a

LOOP

20.021

0

LSL

.000a

1

LTR

a

.013

0.909

a

MOV

.991

0.32

MUL

5.144a

0.023

NEG

a

4.311

0.038

NOP

a

8.362

0.004

NOT

1.352a

0.245

OR

a

32.249 a

49 50 51

OUT

.157

0.692

POP

.547

0.46

POPF

.990a

0.32

PUSH

6.798

Table 3: Mean misclassification rates and p-values Source Additive model (11)

0

a

a

52

0.001

a

mean 0.0493

p-value .034

0.009

ID

53 54 55 56 57 58 59 60 61 62

Opcode Name

Likelihood Chi-Square

PUSHF

5.571a

0.018

RCL

a

1.475

0.225

RCR

5.363a

0.021

a

0.255

RETF

1.296 a

Ratio

Sig.

RETN

.680

ROL

3.885a

0.049

a

0.005

ROR

7.886 a

SAHF

.380

SAL

.997a

SAR

0.409

0.538 0.318 a

9.039

0.003

SBB

a

.695

0.404

64

SGDT

.001a

0.979

65

SHL

18.353

63

66 67 68 69 70 71 72 73 74 75 76 77

a

0

SHR

21.812

0

SIDT

.001a

0.98

STC

a

.300

0.584 a

STD

1.078

0.299

STI

.722a

0.395

SUB

a

2.413 a

TEST

.639

VERR

.010a

WAIT

0.12 0.424 0.921

a

5.206

0.023

a

0.615

XLAT

a

.700

0.403

XOR

.791a

0.374

XCHG

.252

O

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

add

call

cmp

jmp

jz

lea

mov

pop

push

retn

test

xor

26%

17%

19%

16%

17%

22%

17%

13%

18%

13%

19%

23%

Malware 74%

83%

81%

84%

83%

78%

83%

87%

82%

87%

81%

77%

Benign

Figure 1: The Most Frequent 13 Opcodes for Both Malware and Benign Executables

Malicious Code Detection Using Penalized Splines

Malicious Code Detection Using Penalized Splines

Suggest Documents

Malicious Code Detection Using Active Learning - Semantic Scholar

Malicious Code Execution Detection and ... - Semantic Scholar

Testing Malicious Code Detection Tools - Semantic Scholar

detection of algorithmically- generated malicious domain using

Malicious Automatically Generated Domain Name Detection Using ...

Obfuscated Malicious Javascript Detection using ... - CiteSeerX

Malicious PDF Files Detection Using Structural and

Towards a testbed for malicious code detection - UC Davis Computer ...

Static Detection of Malicious Code in Executable Programs

1 Unknown Malicious Code Detection â Practical ... - Semantic Scholar

Lux0R: Detection of Malicious PDF-embedded JavaScript code ...

New Malicious Code Detection Based on N- Gram ...

Early Detection of Malicious Behavior in JavaScript Code

Towards a testbed for malicious code detection - UC Davis Computer ...

CIDT: Detection of Malicious Code Injection Attacks on Web Application

Early Detection of Malicious Behavior in JavaScript Code

Unorganized Malicious Attacks Detection

PENALIZED REGRESSION SPLINES David Ruppert and Raymond J

Spatially Adaptive Bayesian Penalized Splines With ... - TAMU Stat

Spatially Adaptive Bayesian Penalized Regression Splines - TAMU Stat

On The Asymptotics Of Penalized Splines - Semantic Scholar

Selecting the Number of Knots For Penalized Splines - CiteSeerX

Exact Likelihood Ratio Tests for Penalized Splines - Cornell

Acquisition of Malicious Code Using Active Learning - CiteSeerX

Malicious Code Detection Using Penalized Splines