Detecting Metamorphic Malware by using Behavior-based Aggregate

0 downloads 0 Views 412KB Size Report
behavior-based aggregate signature for a family of malware. Creating a signature that spans a ... either host based or network based intrusion detection systems ...
Detecting Metamorphic Malware by using Behavior-based Aggregate Signature Yanzhen Qu Colorado Technical University Colorado Springs, CO

Kelly Hughes Colorado Technical University Colorado Springs CO

[email protected]

[email protected]

Abstract— The capability of advanced malware, such as metamorphic malware and polymorphic malware, are quickly outpacing our current abilities to detect their presence. For example all current signatures based malware detection methods have failed to generate effective signatures because they only create signatures based on the software forms which will be obfuscated by every instance of advanced malware. However, we have discovered that behavior in advanced malware remains consistent throughout variants within a family of malware. We have capitalized upon this discovery to create a behavior-based aggregate signature for a family of malware. Creating a signature that spans a malware family allows it to identify both known and new variants of that family of malware. Using a behavior-based aggregate signature verses many signatures for each variant for detection of malware provides many benefits to anti-virus vendors and the community as a whole by reducing the size of the signature database, reducing maintenance, and increasing speed of detection without losing accuracy. Keywords- metamorphic malware; polymorphic malware; signature; behavior-based aggregate signature; detection

I.

INTRODUCTION

Over the last several years there has been an explosive growth in advanced malware. “If you compare the first six months of 2011 with the first six months of 2012, the increase is seen … at 392%” [1]. Traditional signature based detection is not able to identify this type of malware. In a study conducted by Trustwave, “Anti-virus detected less than 12% of the targeted malware samples collected during 2011 investigations” [2]. Malware exploits vulnerabilities found in software and hardware built to run systems. Harris defines a vulnerability as “a software, hardware, procedural, or human weakness that may provide an attacker the open door he is looking for to enter a computer or network and have unauthorized access to resources within that environment” [3]. Poorly written software and poorly tested software contribute greatly to the

number of vulnerabilities that are found on systems. The hacker is going to infiltrate through the easiest vulnerability which is usually found in unpatched software. The term “advanced malware” will refer to polymorphic and metamorphic malware throughout this paper. Advanced malware is able to automatically create variants (a variation) in form but retain primary functionality. This capability allows the new variants to evade detection as most detection capabilities rely upon the form (code, strings, and byte identification) of software. Signature based and anomaly based detection are two primary methods for detection. Both types can be used in either host based or network based intrusion detection systems. Each method of detection has advantages and disadvantages. As seen, neither is effective in detecting advanced malware. Signature detection has the advantage of a small number of false positives and false negatives. Signature detection generally uses a string based signature that requires an exact match to trigger an alert. The problem with signature detection is that it is unable to handle malware that morphs when it propagates. “Simple signature matching techniques are not effective to detect polymorphic worms because it not possible to define a common string representing all different worm copies with low false positives” [4]. Anomaly detection is good at detecting unusual or abnormal behavior and can detect advanced malware when the malware operates above thresholds that define normal verses abnormal behavior. The problem with anomaly detection is that often it alerts on many false positives. So often in fact, that anomaly detection alerts are generally ignored unless the analyst is alerted through another method that something is wrong with the network or system. Only then will the analyst try to filter through the plethora of alerts to determine which alerts are related to the issues at hand. Identification of malware based upon behavioral characteristics (functionality) is not the primary area of research today. This paper will examine how similar behavioral characteristics for a family of malware can used effectively to identify advanced malware and their associated variants. Common examples of behavior might be the command and control (C2) to which the malware calls back (a

specific DNS or IP address) for further information. Other identified behavior may be related to specific files that are created or dropped onto a newly infected system, registry modifications, or changes. Multiple behavioral characteristics will be discussed throughout this paper. The paper is structured as follows. Section II presents related work in variant malware detection and how logistic regression has been used in the past in this field. Section III will discuss the problem statement and the hypothesis. Section IV will discuss the methodology proposed for identification of the metamorphic malware and how the logistic regression model is created. Section V discusses the creation of the Logistic Regression Model and presents a discussion of the experiment. Section VI will address the future works for this research and will offer conclusions. II.

range of 10%-30% of the time. The speed of detection varies depending on the sample size and the expected resolution. The analysis time is just under 30 seconds which the authors claim to be acceptable “with modern computing facilities” [10]. Many researchers are looking to automate the creation of signatures to simplify the detection and analysis of polymorphic malware. [11] [12] [13] and [14] examined how to automate a signature based detection approach. These automated methodologies take time to examine and evaluate to create signatures. They cannot be used in a real-time environment but may be useful in developing signatures which can then be used in a real-time environment. [15] investigated the use of neighborhood relations when developing signatures. Neighborhood relation signatures (NRS) “is a collection of distances frequency distributions between neighbor byte” [15]. However, the authors used non-polymorphic malware during testing that they manual induced advanced malware characteristics for testing. Therefore, this methodology did not test using “realworld” malware samples and would be difficult to predict actual performance. As signature detection commonly suffers from false negatives, many researchers have looked to behavior based methodologies to detect polymorphic malware. [16] created the Malware Behavior Feature (MBF) based malware detection system. This system is based on the understanding that regardless of what the polymorphic malware looks like; its behavior stays the same. MBF falsely identified 7 files as being malicious. The system was only tested against two malware samples which it identified as being malicious. [17] created a system that was created specifically to protect web servers and uses a sandbox methodology to observe behavior characteristics. However the effectiveness of [17] is limited in nature. [18] use Principal Component Analysis (PCA) to identify significant substrings across all advanced malware. These substrings or shared invariant artifacts are then used to create signatures. Using this methodology, the authors believe they will be able to detect Zero-day polymorphic malware. The largest issue with this methodology is that it is assumed that the identified artifacts would not be found in normal nonmalicious files.

RELATED WORKS

This section will review related works in detecting advanced malware. Several methodologies with be discussed and strengths and weakness will be identified. This section will also cover the logistic regression modeling and its use in detecting malware in the past. A. Advanced Malware Detection Methods This section will discuss the areas of research towards detecting advanced malware. Advanced malware can be detected using the same methodologies and the similar characteristics of the variants. Advanced malware is made variant based on the programming within the malware to evade detection. Many methodologies have been proposed to fight advanced malware. We will discuss their strengths and weaknesses. Graph based methodology is used in [4] [5] and [6]. This type of analysis takes time to analyze the control flow to match it up with other flows created from the same family of malware. This methodology provides a low false positive and false negative rate, but by using a sandbox to characterize behavior, real-time identification is not possible. [7] and [8] use bioinformatics to create signatures. This methodology uses one or more byte invariants combined with distance restrictions to form an advance malware signature. As the advanced malware worm mutates, these distances may change causing the signature to fail. [9] create signatures for advanced malware worms by generation of a tree. The signature tree generator looks to extract invariant bytes from the malware sample to create a signature. Each signature is then arranged using a tree organizational scheme. When samples become large the tree becomes too cumbersome and slows down the operation. When approaching 260 samples the tree update time reaches almost 160 seconds. This methodology would not be effective for real-time detection due to the delay. Hierarchical Hidden Markov (HHM) model is investigated to detect advanced malware in [10]. HHM provides “accurate inductive inference of the malware family” [10]. Inductive inference is achieved by creating a state machine from previous versions of the advanced malware to create signature sequences. This method does have a higher rate of false positives than other methods discussed in this paper. False positives are shown to be somewhere in the

B. Logistic Regression Logistic regression has been around since the early 1970’s [19]. This methodology was first accepted in the field of epidemiologic research [20]. Since then logistic regression has been used in “biomedical research, business and finance, criminology, ecology, engineering, health policy, linguistics and wildlife biology” [20]. This section will summarize how logistic regression has been used in some of these fields. C. Used in Malware/Intrusion Detection Logistic regression has been used for the detection of malware and intrusion detection. Logistic regression has been primarily used in the form of anomaly detection verses a signature based detection. [21] focused on a host based anomaly detection that looks at network traffic. A specific period of time was selected to identify and create the logistic regression model and demonstrated good results recognizing 2

46 out of 51 attacks, just over a 90% accuracy level. [22] also focus on host based anomaly detection but they use the audit log data to create their logistic regression model. Their research produced a 92% accuracy of detection without any false positives. [23] developed a method based on unidirectional flow of data. Their research focused on flow data in order to detect port scans. They developed a Bayesian logistic regression model that had 95.5% success rate with only a 0.4% false positive rate. [24] developed a multinomial logistic regression model. The research was based on the KDD-cup 1999 data and looked to identify 13 different risk factors identified within the data. Wang was able to product a multinomial logistic regression model that was able to detect about as well or better than the competitors for the KDD-cup with a lower false positive rate. [25] also used the KDD-cup 1999 data for their research. Their focus was on a random effects model to identify anomalies in the data. Their model had an accuracy of 98.94%. The false positive rating was not provided. None of these methods focused on detection of advanced malware, but of malicious activity determined to be anomalous from what was considered to be normal activity. These types of methods are largely based on training data and the accuracy of the model. If the training data does not represent the data upon which the model will be used, the definition of normal would vary and thus produce more false positives than creating the model based upon data where the model will be used. III.

Set theory can help define the goal of locating the unique characteristics that belong to the metamorphic family of malware but are not prevalent within the entire set of malware characteristics. There exists a set of behavioral characteristics that is a sub set of the advanced malware family of behavioral characteristics that is found to be common across the different variants with the advanced malware family but is not common across variants outside of the family. Definition: Malware – let malware m be defined as a tuple m = {S, B}, where S is the software or code set that forms the malware and B is the behavioral characteristic set. Such that S = {s1, s2, …, si} where i is an integer and si represents the ith piece of source or executable code of malware m. B = {b1, b2, … , bj} where j is an integer and bj represents the jth behavior characteristics of malware m. Definition: Malware Family – Assume there exists a set of malware M = {m1, m2, … , mk} where k is an integer. We call M a malware family. Definition: The Complement of a Malware Family Let Q stands for the Software Space, which contains all the malware and all benign software. Let M be a malware family. Let N = Q –M. We will call N the complement of M. That is N consists of all the software that are not in M. Definition: Behavior-based Aggregate Signature – Assume there exists a malware family M = {m1, m2, … , mx} and its complement N = {n1, n2, …, ny} where x and y are integers. Let mx = {Sx, Bx} stands for any instance of M, and nk = {Sk, Bk} stands for any instance of N., where x ≤ k, and y ≤ p, and k and p are integers, Bx and By are behavioral characteristics sets of mx and ny correspondingly. We will define Bd as the following

PROBLEM STATEMENT & HYPOTHESIS STATEMENT

Logistic regression (LR) use in creating an aggregate signature has been shown to be an effective methodology for detecting normal malware [26]. The question to be answered is if an aggregate signature based on invariant behavior within a family of malware can detect advanced malware families accurately without too many false positive and false negatives.

𝑦=𝑝

𝑥=𝑘 𝐵𝑑 = {∩𝑥=1 𝐵𝑥 } − {∪𝑦=1 𝐵𝑘 }

If 𝐵𝑑 ≠ ∅ and 𝐵𝑥 ∩ 𝐵𝑑 = 𝐵𝑑 where x = 1, 2, … n, then we call 𝐵𝑑 as the behavior-based aggregate signature of M. That is that the set 𝐵𝑑 contains a set of behavioral characteristics that are common to any instance of the given malware family but not to other malware families or any benign software. In fact the formula (1) has provided us a clue on how to detect if a new malware belongs to a known malware family. Based on the guidance of (1), we have created an engineering process for generate the right behavior-based aggregated signature. Figure 1 provides an overview of the process to be described.

A. Problem Statement Current signatures based detection methods for advanced malware are ineffective because the focus is on the software forms (code, strings, and byte invariants) which can be obfuscated by every instance of advanced malware. B. Hypothesis Statement If a set of common behavioral characteristics can be identified within the variants associated with an advanced malware family, these characteristics can be used to create a behavior based aggregate signature which can provide an accurate method for identify and detecting the advanced malware family to include zero-day (new, never-seen-before) variants of the family. IV.

(1)

Figure 1: Signature Aggregation Method

METHODOLOGY

A. Collection of Data Let Q = M + N where M = {m1 , m2 , … , mx } and N = {n1 , n2 , … , ny } and x and y are integers, and | Q| = 2x which represents the total number of software instances. We must collect behavior set data B of a set family of malware M. All Bx data is collected for every mx where x is

This section will describe the methodology used to identify the string or strings (characteristics) to be used in the aggregate signature for identification of a family of malware. These strings should be unique to the malware family and not strings that are broadly used for every malware or nonmalware applications. 3

an integer and mx ∈ M. This will provide the B data set for malware family M. Next we must collect behavioral data By for every ny where y is an integer and ny ∉ M. Collecting behavioral data for every ny would not be practically possible; however we can collect behavioral data from a random sample of ny ∉ M where x ≤ y ≤ x * 2 and | Q| /50 ≥ number of potential indicators. If the last requirement cannot be met there are two choices. The first is to reduce the number of potential indicators. The second is that x is not large enough to support a proper analysis of the data and another family of malware M should be selected. In the second case, if samples of the malware can be collected, the malware can be allowed to replicate in a secure sandbox environment. Each replication sample can then be collected and an analysis of the sample can be conducted to collect the behavioral characteristics.

Backwards logistic regression starts with all of the potential characteristics. “Variables are tested, one at a time, for removal from the model. The first variable that is removed is the variable whose likelihood ratio statistic has the largest probability that is greater than alpha” [38]. This process continues until you are left with all significant indicators, or all likelihood ratio statistics are less than alpha.

B. Identify Potential Indicators The objective of this section is to identify one or more behavioral elements that are exclusive to the selected malware family. In other words we want to find the potential Bd elements. Let each behavior description for all malware in the set M be grouped and counted but separated by those behaviors in set M and the behaviors identified in set N. M = {m1, m2, …, mx} where mx = {Sx, Bx} and N = {n1, n2, …, ny} where ny = {Sy, By} and N ⊂ Mc where x and y are integers. Let all the distinct behaviors in set M be represented by bM and bM = {Bz, Cz} where Bz is the behavior and Cz is the count of that behavior found in M and z is an integer. Let all the distinct behaviors in set N be represented by BN following the same rules for BM. Those behaviors with the highest count in BM form a subset BSM where BSM ⊂ BM and the count for that behavior is greater than 80% of x|. Those behaviors with the highest count in bN form a subset BSN where BSN⊂BN and the count for that behavior is greater than 20% of y. Then potential Bd = BSM – BSN.

Where y is the observed values and x is the expected values. Since this calculation can be computationally expensive as the calculation would have to be completed for every the original model and a model created with each characteristic removed individually and then recalculated with the remaining characteristics. Most software packages provide for the deviance of the model. In most cases, the deviance of a model is identical to the SSE (residual sum-of-squares) [20]. Therefore

C. Create Multiple Models These data were then used to identify the attribute or attributes that should be used for the identification of family of selected malware. After all the potential elements in Bd have been identified, for each element in Bd the data were collected as follows: for each sample in Q, if the behavior was present in the sample a value of 1 would be specified otherwise a value of 0. If mx ∊ M then a value of 1 would be specified otherwise a value of 0 is specified. A row of data would be created consisting of multiple columns of data where the first column represents if m𝑥 ∊ M the remaining columns are for each potential Bd and the associated 0 or 1 value. These data were analyzed using Logistic Regression. We used backwards Logistic Regression analysis to identify the important attributes for the family of malware selected. This process shown in the next section narrows down the potential selected attributes to those that provide a relevant indicator for identification of the selected malware family.

𝑊=

The formula for the likelihood ratio statistic: 𝐷 = −2ln [

(𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑖𝑡𝑡𝑒𝑑 𝑚𝑜𝑑𝑒𝑙) (𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑎𝑡𝑢𝑟𝑎𝑡𝑒𝑑 𝑚𝑜𝑑𝑒𝑙)

]

(2)

The formula for the likelihood: 𝐿(𝜷) = ln[𝑙(𝜷)] = ∑𝑛𝑖=1{𝑦𝑖 𝑙𝑛[𝜋(𝑥𝑖 )] + (1 − 𝑦𝑖 )𝑙𝑛[1 − 𝜋(𝑥𝑖 )]} (4)

D=-2ln (likelihood of the fitted model)

(3)

(5)

The Wald statistic can be used to calculate the significance of each variable faster than calculating each likelihood variable and then calculating the likelihood ratio statistic [29]. It also provides more information than calculating all potential possibilities for removal of each characteristic to determine the characteristic that should be removed. The formula for calculating the Wald statistic is: ̂𝑖 𝛽 ̂𝑖 ) ̂ 𝑖 (𝛽 𝑆𝐸

(6)

Where 𝛽̂𝑖 is the coefficient for the ith characteristic or ̂ 𝑖 (𝛽̂𝑖 ) is the standard error for the ith characteristic element. 𝑆𝐸 or element. After evaluating every element with the above process and we determine which elements should be included in the model, we can create the final model. Probability of Malware Family = eĝ/(1+ eĝ)

(7)

Where “e (the base of the natural logarithms) is raised to the power of ĝ” [28] in both the numerator and the denominator. The formula for ĝ is the following: ĝ = b0 + b1(X1) + b2(X2) + … + bn(Xn)

(8)

Where b0 is a calculated constant based on the data used for the model and b1 through bn are the weighted values for each

D. Evaluate Models for best fit 4

predictor element or variable. X1 through Xn are the actual values for the predictor elements.

Data should be collected for both VCL variant data and samples that are outside of the malware family. Those data collected outside of the VCL family should be stored in a separate database with a separate table structure but the format of all the tables should be consistent. The schema for the tables is shown in Appendix A.

E. Signature Creation and Verification Based upon the final model, the attributes or behavioral characteristics have been determined (Bd elements have been identified and verified), this should include one or more characteristics, a signature can be created for the family of malware. If only one behavior was identified in B d a standard signature can be created. If more than one behavior was identified the use of the logistic regression model as a signature can be used. Once a signature is created it is important to test it on both the malware family and other malware and non-malware samples. New malware variants for the selected malware family can and should be used for testing to confirm the accuracy of the signature. V.

B. Identify Potential Indicators The data were analyzed to identify like characteristics across the different variations of the VCL malware. This was done by creating queries for each table to identify those characteristics that spanned multiple variants. Data for 258 VCL malware samples was were entered into the database. A SQL statement for the Network Connections Made table is the following: Select distinct “Network connection”, count(*) as c1 from “VCL Network Connections” group by “Network connection” order by c1 desc

EXPERIMENT

This section will cover the methodology used to create the behavior-based aggregate signature for the malware samples. The malware family called VCL will be used to demonstrate how to create a signature for a family of malware. VCL is an advanced malware family that uses different obfuscation techniques to hide its signature from common detection methods. We will demonstrate how the process described can be used with VCL to identify the strong behavioral characteristics that can be found across the family of malware and can be used to effectively detect this malware family

This query selects distinct values across the network connection data and then counts the number of those values within the data. That data set is then ordered to show the characteristic with the most occurrences across the VCL samples at the top of the result set followed by the next highest occurrence, etc… for all the different network connections made by the VCL samples. This query creates the BM set for the VCL Network Connections table where BM = {Bz, Cz} and Bz is the behavioral characteristic for each element in Network Connections followed by the count for that behavioral element. This type of query is repeated for each table. All the query results for each table together will form the entire BM set with behavioral characteristics and counts for each characteristic. The elements with the highest counts will be used. For this data result set, an activity was identified “Adds or modifies a COM object” which occurred in 217 out of 258 sample data collected. This behavioral characteristic is found in 84% of the examples. It is important to examine all of the data within the collection sample. This is an element that would be included in the BSM as it is not commonly found in the BN data set as seen in that table results. All of the tables are analyzed in the same manner. This allows the common characteristics across all the VCL samples to be identified. These characteristics are called independent variables or indicators or predictor variables for the logistic regression model which will be used to determine which of these indicators are significant in identifying VCL. The identified set of characteristics from the data was identified to be the following:

A. Collecting data and the Analysis Report The objective for this section is to describe how analysis data for variant malware will be collected to be used for the rest of the methodology. A standard analysis report has both static and behavior information on each sample variant of malware. This type of report can be created from using existing sandbox software or you can use reports that have already been created by antivirus vendors. Examples from anti-virus vendors of this report is what you might see on McAfee’s Threat Library[27]. For this paper the authors selected McAfee’s reports on VCL due to the number of malware variants collected. These reports will be used to identify indicators across the malware variants. Data characteristics were collected from McAfee’s VCL analysis reports to identify common indicators across several categories: Activities; Files added to the system, Files changed, Temporary Files, Files Deleted, Registry elements changed, Registry elements added, and Network Connections Made. After data were collected from the McAfee threat reports, data were entered into a database. In this case, the authors chose to use Oracle 11g Standard Edition 32 bit, however any SQL based database would be sufficient. A database was created within Oracle to hold the data. Each data category was entered into a different table with a new row entry for every element in each category. Each row was connected back to a main table that held the malware name, type of malware, date the malware was analyzed.

Table 1: Fields and Descriptions for VCL Field Description STRING1 Adds or modifies a COM object (Activity) STRING2 HKEY…\{74630177-839A(Registry 11D5-EBA1-F78EEEEEE983} Added) 5

Count 217 214

STRING3 (File Added) STRING4 (Registry Changed)

%WINDIR%\SYSTEM32\VCL 32.EXE HKEY…\vcl32.exe

strings remain. The software automatically removed STRING3 and STRING4 as they do not provide any more information that was statistically significant than what STRING2 provided.

214 214

E. Identify Best Fit Model For this model, Analysis Studio calculated the deviance to be:

The counts for these variants are not exactly equal to the number of variants selected for the experiment. Unfortunately, the data for all samples are not complete. For example, older malware samples were not analyzed in the same manner as newer ones. Older samples may only have one line of analysis that states something to the effect of “Modified a COM Object” where new analysis results have more detail at every level of analysis. When using an automated collection tool, these data are not collected accurately, thus the difference in counts and results. Once the number of predictor variables were identified (potential Bd) the number of rows to create the logistic regression model can be calculated. Experts recommend having at least 50 rows of per predictor variable [38]. Thus we will have at least 300 rows of data, however for this study:   

602 rows of data were used 258 rows from VCL analysis 344 rows from other malware families  Representing 40 malware families multiple variants each

D = 86.7848

(9)

For Table 2 STRING1 the Wald statistic can be calculated as: W = 1.1817/0.4738 = 2.4941

(10) (11)

As you can see, the calculation in equation (11) matches with the Wald value listed in the table for STRING1. This value will not be calculated for the rest of the paper but will rely on the value listed in the table. Based on the values, the P Value, Wald statistic and the Exp(Coefficient), the first string to be removed from the model will be STRING2. The goal will to bring all the P values for all included strings in the model to below the 0.05 range. Table 3: Variables minus STRING2

with

C. Create Analysis Data The data gathered for VCL was split into a test set and an analysis set which resulted in the 258 rows randomly selected for the analysis. The other portion of VCL data will be used for testing the final aggregate signature. All rows of data specified above will then be entered into the analytic software. To create the dataset a set of SQL queries were used. First query was used on the VCL tables (set M) with only VCL malware. Then, the non-VCL (set N) malware was loaded into similar tables, but renamed as Non-VCL_Tablename no field names were changed. The second query was then to work with the Non-VCL tables to pull for the same characteristics. Both VCL and Non-VCL (set Q) results were put into a single Excel table and imported into the statistic software.

At this point, based on the scores the model appears to be accurate with STRING1 being the best indicator. The ROC curve for this last model is at 90.3% which is in an excellent range. The model deviance for the current model is 367.6933. F. Create Signature The VCL model can now be created with the selected three indicators. Probability of VCL = eĝ/(1+ eĝ)

(12)

The equation for ĝ is the following: ĝ = -2.2315 + 4.634 * STRING1

D. Create Multiple Models For this paper, Analysis Studio© v6.40.0 developed by Appricon Inc. was used to create the logistic regression model. The output was analyzed to produce a logistic regression model and determine which indicators have a greater influence in identification of VCL malware.

(13)

STRING1 has a value of either 0 or 1 depending on if the string value (the behavior) exists for that sample of malware. 0 would indicate that the behavior/characteristic does not exist for that sample, where a value of 1 would indicate that it does exist for the sample. The value for ĝ shown in equation (13) is now substituted into equation (12) to provide the following:

Table 2: Variables showing All Characteristic

Probability of VCL = e(-2.2315 + 4.634 * STRING1)/ (1 + e(-2.2315 + 4.634 * STRING1) ) (14) This equation (14) provides for our aggregate signature. Or for this case, the single characteristic can be converted to a traditional signature and used in place of the multiple signatures in use for the VCL metamorphic family. This

Backwards logistic regression model building will help identify the most important characteristics that should be used for indicators in identifying VCL. As you can see only two 6

signature was created using the characteristics identified to be most common across the malware variants for an individual family. Considering the number of variants for VCL and the number of signatures for each variant, one can imagine the significant savings in space and speed. The selected characteristics when selected can be modified based on the type of signature required. For example, there are host based signatures or network based signatures. For network based signatures, it does not make sense to include characteristics such as registry value changes or added, deleted, or modified files where these characteristics would be relevant for a host based detector. VI.

International Conference on Computer Research and Development, 2010. [9] Y. Tang, et al., "Signature tree generation for polymorphic worms," IEEE Transactions on Computers, vol. 60, pp. 565-579, 2011. [10] F. B. Muhaya, et al., "Polymorphic malware detection using hierarchical hidden Markov model," presented at the 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing, 2011. [11] Y. Tang and S. Chen, "An automated signature-based approach against polymorphic internet worms," IEEE Transactions on Parallel and Distribued Systems, vol. 18, pp. 879-892, 2007. [12] Q. Zhang and D. S. Reeves, "MetaAware: Identifying metamorphic malware," presented at the 23rd Annual Computer Security Applications Conference, 2007. [13] S. Cesare, et al., "Malwise - An effective and efficient classification system for packed and polymorphic malware," IEEE Transactions on Computers, 2012. [14] J. Newsome, et al., "Polygraph: Automatically generating signatures for polymorphic worms," presented at the 2005 IEEE Symposium on Security and Privacy, 2005. [15] J. Wang, et al., "Polymorphic worm detection using signatures based on neighborhood relations," presented at the 2009 11th IEEE International Conference on High Performance Computing and Communications, 2009. [16] L. Wu, et al., "Behavior-based malware analysis and detection," presented at the 2011 First International Workshop on Complexity and Data Mining, 2011. [17] L. Campo-Giralte, et al., "PolyVaccine: Protecting web servers against zero-day, polymorphic and metamorphic exploits," in 2009 28th IEEE International Symposium on Reliable Distributed Systems, 2009, pp. 91-99. [18] M. M. Z. E. Mohammed, et al., "Detection of zeroday polymorphic worms using principal component analysis," presented at the 2010 Sixth International Conference on Networking and Services, 2010. [19] C.-Y. J. Peng and T.-S. H. So, "Logistic regression analysis and reporting: A primer," Understanding Statistics, vol. 1, p. 31, 2002. [20] D. W. Hosmer and S. Lemeshow, Applied Logistic Regression, Second ed. Hoboken, NJ: John Wiley and Sons, 2000. [21] C. Gao, et al., "Host anomalies detection using logistic regression modeling," IEEE, vol. 9/09, p. 5, 2009 2009. [22] N. Ye, et al., "Multivariate statistical analysis of audit trails for host-based intrusion detection," IEEE Transactions on Computers, vol. 51, pp. 810-820, July 2002 2002. [23] C. Gates, et al., "Scan detection on very large networks using logistic regression modeling," in IEEE Symposium on Computers and Communications, 2006, pp. 402-408.

CONCLUSION AND FUTURE WORKS

Using a single signature to identify or detect an entire family of malware requires less space than the traditional signature methodology. It will also reduce the amount of maintenance required to upkeep the signature database. Since the signature can detect zero-day variants, the database will need to be updated less often. The capability to identify zeroday variants also provides for a more secure system. This is accomplished by identifying the key behavioral characteristics that cross the majority of variants within the family of malware. Since most variants are created by simply modifying small portions of code, common behavioral characteristics still remain in the base code and can be used for the identification of the malware. Future works would include the ability to standardize the aggregate signature so that it can be utilized in the various current anti-virus applications to provide a stronger methodology for the detection of malware and the protection of our valuable assets. VII.

REFERENCES

[1]

FireEye, "FireEye Advanced Threat Report - 1H 2012," FireEye2012. R. Barnett, et al., "2012 Global Security Report," Trustwave, Chicago2012. S. Harris, All in One CISSP Exam Guide, 5th ed. New York McGraw-Hill Companies, 2010. B. Bayoğlu and İ. Soğukpınar, "Graph based signature classes for detecting polymorphic worms via content analysis," Computer Networks, vol. 56, pp. 832-844, 2012. S. Cesare and Y. Xiang, "A fast flowgraph based classification system for packed and polymorphic malware on the endhost," in 2010 24th IEEE International Conference on Advanced Information Networking and Applications, 2010, pp. 721-728. B. S. U. Mendis and L. T. Koezy, "Fuzzy rough signatures," presented at the FUZZ-IEEE 2009, Korea, 2009. Y. Tang, et al., "Using a bioinformatics approach to generate accurate exploit-based signatures for polymorphic worms," Computers & Security, vol. 28, pp. 827-842, 2009. M. F. Zolkipli and A. Jantan, "A framework for malware detection using combination technique and signature generation," presented at the Second

[2] [3] [4]

[5]

[6]

[7]

[8]

7

[24]

[25]

[26]

Y. Wang, "A multinomial logistic regression modeling approach for anomaly intrusion detection," Computers & Security, vol. 24, pp. 662-674, 2005 2005. M. S. Mok, et al., "Random effects logistic regression model for anomaly detection," Expert Systems with Applications, vol. 37, pp. 7162-7166, 2010. K. Hughes and Y. Qu, "A theorectical model: Using logistic regression for malware signature based detection," presented at the 10th IEEE International Conference on Dependable Autonomic and Secure Computing, Changzhou, China, 2012.

[27]

[28]

[29]

8

McAfee. (2013, 5/1/2013). Threat Intelligence. Available: http://www.mcafee.com/us/mcafeelabs/threat-intelligence.aspx L. G. Grimm and P. R. Yarnold, Eds., Reading and Understanding Multivariate Statistics. Washington DC: American Psychological Association, 1995, p.^pp. Pages. S. Menard, Applied Logistic Regression Analysis, Second ed. Thousand Oaks, CA: Sage Publications, Inc, 2001.