Clustering of Similar Malware Behavior via Structural Host-sequence Comparison Horng-Tzer Wang Dept. of Computer Science and Information Engineering National Taiwan University of Science and Technology Taipei, Taiwan
[email protected] Te-En Wei Dept. of Computer Science and Information Engineering National Taiwan University of Science and Technology Taipei, Taiwan
[email protected]
Abstract—Malware (malicious software) is used by attackers to gain access to end-users’ computing devices with the aim of performing malicious actions, such as sending spam, downloading malicious files, and stealing private information. Furthermore, malicious actions result in the infection of many machines at considerable monetary expense to affected endusers. Although anti-malware companies can detect malware, they generally rely on signature-based methods. Because attackers use automated malware generation tools to generate new or modified malware that leads to increased instances and variants of malware, variants of the same malware class can often evade such methods. In this study, we propose a system based on structural sequence comparison and probabilistic similarity measures for detecting variants of a malware class. We believe that the structural sequence of malware behavior can detect variants of the same malware class by profiling the relationship between different behavior patterns. Furthermore, probabilistic similarity measures are used for discovering the behavior patterns of the sequential relation and finding very similar behavior models of malware classes. The proposed system, which incorporates a structural sequence mechanism and probabilistic similarity measurement, can be helpful for detecting and discovering variants of a malware class. The results of our experiment show that the proposed system detects variants of a malware class more effectively than a two-gram sequence does, and it lowers the effect and weight of behavior patterns, with higher precision and F-measure in variants of malware detection than the approach based on a two-gram sequence and normalized embedding function. Keywords-malware detection; sequential data; Markov chain; similarity measures;
I. I NTRODUCTION Malicious software, also known as malware, is the most active and challenging area of computer security research. In recent years, many personal computing devices (e.g.,
Ching-Hao Mao Institute for Information Industry Taipei, Taiwan
[email protected]
Hahn-Ming Lee Dept. of Computer Science and Information Engineering National Taiwan University of Science and Technology Taipei, Taiwan
[email protected] Institute of Information Science Academia Sinica Taipei, Taiwan
[email protected]
Notebook, Smartphone [10], [29], and iPad) have suffered from attack through the Internet by such forms of malware as Trojans, viruses, worms, spyware and botnets. In 2011, Microsoft published a Microsoft Security Intelligence Report [13] that presented a ten-year review on the evolution of malware and the threat landscape. The report focuses on software vulnerabilities, ways of exploiting them, and malicious and potentially unwanted software. Furthermore, a Symantec Internet Security Threat Report [14] discussed malicious code intelligence obtained from the Internet by deployment of its antivirus products in 2011. According to that report, malware has spread to computers worldwide, and most antivirus vendors detect and prevent them using a signature-based method, which is not very efficient. Because of the rapid update rate of viruses’ signatures, the antivirus database can never keep up with the creation rate of novel malware variants. In general, malware analysis methods can be either static or dynamic [23]. Static malware analysis [20], [22], also known as code review, can be performed using a decompiler and also dynamically through the use of a debugger such as WinDBG [9] or OllyDbg [7]. The results of a source code analysis can be used to identify which software will infect a computer. Code Security [8] and Fortify [4] are two companies known worldwide that provide the functionalities of code review targeting suspicious software. The main advantage of static analysis is that it provides a complete overview of what the given software actually does. However, static analysis is often obstructed by evasion techniques [22] such as binary packers, polymorphism, and anti-debugging techniques. Furthermore, static analysis is more easily obstructed by malware attack than dynamic analysis and causes a high false alarm rate. Therefore, dynamic analysis is more effi-
cient for malware analysis research. It can enable monitoring of malware behavior during run time using a sandbox, such as CWSandbox [2], Norman Sandbox [6], and ANUBIS [1], in which malware finds it more difficult to hide. Behaviorbased detection is used to obtain the result of malware’s real actions in a general operating system (OS). Furthermore, a sandbox can provide a virtual environment for implementing malware for monitoring and will not destroy the real OS. Typically, behavior-based detection is commonly used for two purposes, classification and clustering. The analysis model’s architecture allows for automatic identification and clustering of new classes of malware with similar behavior as well as assignment of unknown or variant malware to these discovered classes for classification. We propose an approach for behavior-based analysis based on clustering and classification that can enable analysis of the behavior of thousands of malware binaries. Specifically, we compile a behavioral report about sent real-world malware that the sandbox has extracted and perform them in a virtual environment. In general, attackers use automated malware generation tools to create new or modify existing malware, leading to increased numbers and variants of malware, thus presenting two problems. On the one hand, variants of the same malware can generate similar malicious behavior. On the other hand, variants of the same malware can evade anti-viruses companies (AVs). In our research, we find the structural sequence of malware behavior to be useful for distinguishing different malware classes. In our experiments, we use the four common actions, Create, Delete, Open, and Write, and three familiar change locations, File, Registry, and Network, all involving host-side behavior. Based on the host behavior, we can more accurately analyze malware for clustering and classification. Hence, our goal can be described as follows: 1) Prevention of variants of the same malware from being able to evade AVs by using the relationship between different behavior patterns. 2) Detection of many different types of malware by using similar malicious behavior patterns However, we have some challenges to face. The sandbox reports can suggest many different behaviors, and how to construct the relationship between different behavior patterns becomes a problem. In our method, we use a sequential relation to indicate these patterns. We also use a probabilistic model to determine the sequential relation before relation construction. Each of our behavioral reports describes the behavior of each malware program. We use model similarity to calculate and determine the relationships between malware. An additional goal is to identify instances of the same malware. Thus, we cluster malware by finding very similar malware behavior models. The major contributions of this paper can be summarized as follows: •
Discovery of a sequential relation profile between dif-
ferent behavior patterns that can ensure a structural sequence mechanism. • Clustering of very similar malicious behavior of malware variants to ensure probabilistic similarity measurement. • Detection of unknown malware or variants of known malware classes, thereby reducing the need for reliance on anti-viruses. The rest of this paper is organized as follows: Related work and background knowledge are presented in Section II. Our Methodology includes brief expression and model frameworks are introduced in Section III. Experiment result is showed in Section IV that our malware could be clustered and then classified perfectly. Finally, our conclusions and further work will be discussed in Section V. II. R ELATED W ORK Much work has been done to identify malware by using dynamic analysis. Rieck et al. [27] used support vector machines (SVM) [30] to classify malware samples based on OS objects and OS operations. They used OS operations to identify malware classes; this can cause ambiguity in detecting different variants of malware classes. Because of the usage of both OS objects and OS operations, much memory was required. Trinius et al. [31] proposed Malware Instruction Set (MIST) to represent malware behavior reports. The goal of MIST is to reduce memory usage in identifying malware. The use of classification by Rieck et al. [27] and Trinius et al. [31] causes obfuscation and is ineffective in identifying new malware samples. Bayer et al. [19] use Locality Sensitive Hashing (LSH) to group malware samples using their behavior profiles. Their method is limited by its excessive use of OS operations of the behavioral profile, and it causes host-level ambiguity. Perdisci et al. [25] used structural similarities among malicious Hypertext Transfer Protocol (HTTP) traffic traces generated by executing HTTP-based malware. Their approach is limited by the difficulty of clustering without HTTP-based behavioral malware. Rieck et al. [28] used prototypes to extend the approaches of Rieck et al. [27] and Trinius et al. [31] with a framework for performing clustering and classification. The authors’ goal was to help anti-virus companies detect new or variant malware. Their approach is limited by the use of two-grams that may result in the current state not knowing the situation of previous states, as show in Figure 1. Furthermore, they used a normalized embedding function that may create more matched behavior patterns, lowering the effect and weight of the behavior patterns. Finally, they defined a heuristic rejected cluster that may cause ambiguity in clustering malware classes. Our goal is to cluster malware classes effectively. The major differences between the present work and that of Rieck et al. [28] are • We propose a structural sequence mechanism that discovers sequential relation profiling between different
Malware Behavioral Reports
Host-level Behavioral Tokenizer
Figure 1. The current state doesn’t know the situation of previous states by using two-grams.
•
behavior patterns. We use probabilistic similarity measurement to find a closely similar sequential relation of malicious behavior by malware variants. III. S TRUCTURAL S IMILAR B EHAVIORAL H OST- SEQUENCE C LUSTERING
In order to solve the problem of malware variants, we propose a structurally similar behavioral host-sequence clustering methodology that clusters many different malware programs and can be used to detect malicious behavior at host level. Actually, the main intention is to help antivirus software to detect many new and different malware programs and easily find unknown malware that uses a different structural malware behavior sequence. Our system consists of three components: Structural Host − sequence, Behavioral M odel Similarity Constructor, and Clustered Engine, as shown in Figure 2. The Structural Host-sequence attempts to extract the malicious behavioral patterns from each sequential behavioral report. These patterns are used to construct the sequential states structurally, creating a behavioral structure module. The Behavioral Model Similarity Constructor has two subcomponents, Host-Structural Model Comparator and Probabilistic Similarity Model Measurer. This component mainly constructs a similar host-level sequence between each of the two models. It uses similar malicious behavior patterns to discover the sequential relation in order to find unknown or variant malware classes among all of the models. The Clustered Engine uses all of the models to find the least distance between models. It produces clustered results of behavioral models relating to inputting all malware behavioral reports. To summarize, our system can deal with malware behavior for clustering malware classes to distinguish among them in the real world. In the following sections, we describe the individual components in further detail. A. Malware Behavioral Reports In order to cluster many malware variants and more than 3000 distinct malware samples, we applied our system to real-world deployment by using a public dataset with input samples collected from public websites [5]. The public
Behavioral Patterns Sequence Constructor
Behavioral Structure Module
Behavioral Model Similarity Constructor
Clustering Engine
Probabilistic Similarity Model Measurer
HostStructural Model Comparator
Figure 2. System architecture of Structural Similar Behavioral Hostsequence Identification. Use malware behavioral reports as input, and produces similarity results of behavioral models as an output.
Figure 3.
A example of behavioral reports by CWSandbox.
dataset contained reference data on 24 malware classes from Rieck et al. [28]. We used public websites [5] providing many behavioral reports of sample malware. All behavioral reports were monitored by CWSandbox [34], and the malware samples were labeled by Kaspersky anti-virus. CWSandbox uses API hooking [18] to monitor operations in each sample and generates a behavioral report after monitoring each sample, as shown in Figure 3. The types of behavioral reports include CWS reports, sequential reports, and MIST reports from Trinius et al. [31]. We used system calls of sequential reports in each malware sample. These components of system calls are described in greater detail in the following section.
B. Host-level Behavioral Tokenizer With this component, we process the sequential behavioral reports described in the previous section. We extract each behavioral report that uses CWSandbox to describe the program operations of automated dynamic analysis during run-time in each sample.
same malware class. In this component, we use host-level behavioral patterns of all malware reports as input with the goal of using host-level behavioral patterns to construct the sequential relation in each malware report. The main purpose is to construct a probability sequence between patterns and to integrate the relations among all behavioral patterns in each malware sample.
A
B
C
D
E
Load dll
Query value Temporal Patterns
Figure 5.
Create mutex
Open key Create open file Get host 14
Figure 4.
An example of host-level behavioral patterns.
Figure 3 provides an analysis report using API hooking. Each such report includes OS patterns and OS objects in a behavior profile. For example, we can observe a CreateF ile pattern at host level that creates a file in a destination folder and that can insert malicious file objects with every malicious program. The SetV alue host-level pattern may create or modify registry keys or set Windows registry objects to make the computer prone to unexpected flaws, failures, and crashes. To address these problems, we extract only OS patterns at host level, not the corresponding OS objects. Using corresponding OS objects creates many variable objects in the same malware classes and may cause noise, ambiguity in identifying malware classes, and excessive memory usage. Therefore, we only extract OS patterns at host level as preprocessing for each malware sample, as show in Figure 4. This component then passes the OS patterns for each malware sample to the next component, as described in Section III-C. C. Behavioral Patterns Sequence Constructor Attackers use automated malware generation tools to generate new or modified malware that leads to increased instances and variants of malware, and they also attempt to use similar malicious behavior to generate variants of the malware. To circumvent the fact that variants of the same malware can evade anti-viruses, we consider a sequential relation between different behavioral OS patterns in the
Markov model summaries the situation of previous states
We apply a Markov chain model [11] because the state transitions in a probabilistic model are directly visible to the observer, and a Markov chain model can summarize the situation of previous states, as shown in Figure 5. In this study, we use a Markov chain as the structural sequence of the behavioral patterns in each malware sample, as well as a counting frequency [33] that is uniformly and randomly distributed for all of the different state transition probabilities. For example, as shown in Figure 6, we use a structural sequence of Markov chain [11] and state transition probabilities of counting frequency [33] to construct a sequence ABAABAC for a structural sequence. Briefly, this component’s procedure is comprised of the following three steps. 1) Tokenize each behavior pattern of a malware report into the current model. 2) Adjust the sequential state of the probabilistic model to match all of the behavioral patterns in each malware report. 3) Create a structural sequence for each malware report as output.
A
B
A
A
B
A
C
1
2
3
4
5
6
7
A
B
1
A
2
.5
3
B
A
C
5
6
7
.5
S1: A C A W X C A
G1
S2: C A C B C A
G2
Figure 8. Given two sequences S1 and S2, and their associated models G1 and G2
.33 B
A
.67
1
2
A
C
6
7
.5
.5 .25
.25
C 7
A
1 B .5
Figure 6.
2
A example of a sequence ABAABAC.
D. Behavioral Structure Module In this module, we construct all of the behavioral sequences as input for all the malware reports. It is used mainly for finding similar behavioral sequential models in the same malware class between each two of their models for the next component, as described in Section III-E. The task is to create a behavioral structure module that describes individual behavioral sequential models of all malware samples. These individual behavioral sequential models contain a set of behavioral patterns, the initial probability of each state, and the transition probabilities that each model records as the likelihood of transition from one state to the next. In this component, we generate sequential models for all the malware reports in the behavioral structure module. A behavioral structure module of the POISON malware using Graphviz’s dot format [15] to generate a graph with dots is shown in Figure 7. The goal of these models is to provide easily found similar relationships, as discussed in Section III-E. E. Behavioral Model Similarity Constructor In this section, we propose a behavioral model similarity constructor to construct a similarity matrix between each two of the behavioral structural models in all malware reports for identifying the variants of malware classes. In our design, each sequence has an associated correlation structural model with a set of transition probabilities and behavioral patterns determined by the sequence [21]. First, all of the received behavioral structure sequential models calculate the correlation between each two of their models and sequences. Then, the similarity for each pair of sequences is taken to represent the relation of those sequences as ”distance.” Finally, we use all of the distances for each pair of sequences to construct
a similarity matrix for malware clustering in all malware reports. 1) Host-Structural Model Comparator: Consider a set of behavioral sequences S = {S1 , S2 , ...} and an associated set of behavioral structural models G = {G1 , G2 , ...}. If Sj is the jth sequence and Gi is the structural model for the ith sequence, then f s(Sj ; Gi ) is the probability density function over sequence Sj according to model Gi . We use a length-normalized log-likelihood [16] to measure similarity between each two sequences, as formulated in the following Equation (1): C(Sj |Gi ) =
1 log f s(Sj ; Gi ) length(Sj )
(1)
where C is the length-normalized log-likelihood function. In this subcomponent, we mainly construct the correlation through which one sequence corresponds to another model. We essentially compare one sequence with another sequence. As show in Figure 8, we have two sequences S1 and S2 and their associated models G1 and G2. Furthermore, we individually calculate the length-normalized log-likelihood of C(S1|G2) and C(S2|G1). Hence, we use the model comparator results as output to the next subcomponent. 2) Probabilistic Similarity Model Measurer: In this subcomponent, we consider the similar relationship of hostsequence, and we measure models by using a probabilistic similarity methodology. The method is helpful for discovering similarity correlations between two sequences. Given two sequences Si and Sj and their correlation models Gi , Gj , we use a measurement of the probabilistic similarity between the two sequences that depends on how well one sequence is described by the model for another sequence. We use distance to define the probabilistic similarity of two sequences Si and Sj , as in Equation (2): dij = |(C(Sj |Gi ) + C(Si |Gj ))|
(2)
where dij is the distance of two sequences Si and Sj . Mainly, we have pairwise similarities for all pairs of sequences. We use all pairs of sequences to calculate all distances, and we construct a similarity matrix for all malware reports. The Similarity Matrix of the number of reports × number
Figure 7.
Use graphviz’s dot format [15] to generate a behavioral structure module of POISON malware’s graph.
of reports is as follows: d11 d12 d21 d22 .. . .. .
··· ··· .. .
··· ··· ..
.
F. Clustering Engine
1
2 3
4 5
Input: A set of malware reports, with their Similarity Matrix, and the better number of clusters num Output: Clustering results for each report /* step 1: Finding all of centroids Centroids */ Centroids ← use fast k-means to find all of centroids with Similarity Matrix of malware reports and the better number of clusters num; /* step 2: Assigning each report to the nearest centroid */ for x ∈ reports with their Similarity Matrix do z ← use fast k-means to find the nearest Centroids with x; assign x to cluster of z; end Algorithm 1: Clustering results for all of reports
The last goal is to cluster all of the malware reports according to similar malware behavior. The similarity matrix that includes each pair of compared sequences represents the sequences in such a way that the distance well represents the relationship of other sequences. However, the similarity matrix is high-dimensional data. Therefore, in order to cluster the high-dimensional data and to find nearest centers
Table I T HE DATA SOURCE DESCRIPTION OF MALWARE REPORTS FROM M ALHEUR ’ S WEBSITE [5].
Data source Malware Classes Reports Analysis Date Malheur 24 3,131 2009/08/01-2009/08/07
of data quickly, we utilize fast k-means [3], [12] to cluster our similarity matrices for most of our evaluation. The procedure is detailed in Algorithm 1. IV. E XPERIMENT AND RESULTS In this section, we describe our datasets from public website in Section IV-A. The evaluation methods are described for evaluating our system performance in Section IV-B. We present experimental results for evaluation of the proposed method in Section IV-C. A. Datasets Rieck et al. [28] provided public datasets offered on Malheur’s website [5]. These datasets are monitored by CWSandbox, and we use CWSandbox to generate all of our behavioral reports. Moreover, Rieck et al. [28] use several anti-virus companies to label these datasets for evaluating their systems. We used the reference dataset provided by Rieck et al. [28] as our evaluation dataset. Rieck et al. [28] collected their reference dataset over a period of three years. Labeled by several anti-virus companies during 2009/08/012009/08/07, it includes 24 malware classes and a total of 3,131 malware reports. The numbers of reports from the data source are listed in Table I and their labels are shown in Figure 9.
in evaluating our system, we attempt to get each group to contain all samples of one malware class. Therefore, we use an F-measure score to combine both precision and recall rate. The F-measure is defined as follows: 2 ∗ P recision ∗ Recall (5) P recision + Recall In order to evaluate our performance and to compare the methodology of Rieck et al. [28] in the reference dataset, we need to find the number of clusters. We used a method of Ray and Turi’s [26] proposed, and we measured distance of the samples from their cluster centroid. This method is used to iteratively find the different number of clusters and select one that results in compact clusters. Therefore, we used both intra-cluster and inter-cluster distance to measure the number of clusters. We attempted to find the number of clusters with better results. First, we used the intra-cluster to calculate distance between a sample and its cluster centroid in each cluster. The intra-cluster is defined as follows: F measure =
Figure 9. The public datasets contain 3,131 reports of 24 malware classes from Malheur’s website [5].
IntraCluster =
B. Evaluation Methods In this section, we present evaluation methods for experiments. We used evaluation methods of Rieck et al [28]. that involve precision, recall rate, and F-measure to evaluate our experiments. These methods were used to measure the reference dataset described above. The variable n is the numbers of malware reports; we use 3,131 malware reports of the reference dataset to evaluate our performance, including 24 malware classes of the reference dataset. The goal of precision is to find the maximum number of instances of the same malware class in each group. It measures how well individual groups agree with the malware classes. Precision assigns samples of different variants to different groups. Y is the numbers of groups, and y is each group. Precision is defined as follows: 1 X max|♯y| (3) P recision = n y∈Y
The goal of recall rate is to find the maximum number of instances of the same group in each malware class. It measures the extent to which malware classes are distributed over groups. Recall rate assigns samples of the same variants to the same group. M is the number of malware classes, and m is each malware class such that the reference dataset has 24 malware classes. The recall rate is defined as follows: Recall =
1 X max|♯m| n
(4)
m∈M
According to the evaluation method described above, if each report belongs to a group, we will get precision = 1, but the worst recall rate, and if all reports are in the same group, we will get recall rate = 1, but the worst precision. In order to attain both high precision and high recall rate
K 1XX ||x − zi ||2 n i=1
(6)
x∈Yi
where n is the number of reports or samples, K is the number of clusters, and zi is the cluster centroid of cluster Yi . Second, we used the inter-cluster to calculate distance between different clusters centroid, and find the minimum of these value. The inter-cluster is defined as follows: InterCluster = minimum(||zi − zj ||2 )
(7)
where zi and zj are the different cluster centroids. Finally, we hope that find minimize the intra-cluster distance and maximize the inter-cluster distance to find the number of clusters with better results. We used validity ratio to combine both intra-cluster and inter-cluster distance. The validity is defined as follows: IntraCluster (8) InterCluster Therefore, we want to find the number of clusters with better results in minimize the validity ratio. The validity ratio in the different number of clusters was as show in Figure 10 by using the reference dataset. We almost found the better number of clusters in the reference dataset. For the experiments, we find 37 clusters with minimize validity ratio, and then we fixed the parameter of the better number of clusters, num, to be equal to 37. Finally, we used the 37 clusters to cluster all of the malware reports in the reference dataset. V alidity =
Figure 10.
Validity ratio in the different number of clusters.
C. Experimental Results The experimental results are summarized in Table II. We considered the methodology of Rieck et al. [28] as the baseline for comparing the effectiveness of our system. We also used public datasets of Rieck et al. [28] to evaluate their method. For both our work and that of Rieck et al. [28], we focused on the sequential reports from CWSandbox monitoring. We used a public dataset constructed from 24 malware classes and 3,131 malware reports as mentioned above. Our method has a higher F-measure (94.2%). In our experiments, we used 37 clusters to cluster 3,131 malware reports, whereas Rieck et al. [28] used 28 clusters. We used more clusters than Rieck et al. [28] did, and our precision (94.7%) is greater than theirs is. Because we precisely divided into 24 malware classes, portions of malware classes were distributed over different clusters. The goal of precision is to find a large amount of biased data in all of the malware classes, in order to discover the variants of malware classes. For example, the F LY ST U DIO malware class is distributed over three clusters with maximum numbers in each cluster. F LY ST U DIO spreads over many different clusters, because it has many different behaviors. The goal of recall was for samples of the same class to be grouped in the same cluster in almost all cases. When many clusters are produced, the recall rate should be reduced in Equation (4). On the contrary, our recall rate (93.7%) was the same as that of Rieck et al. [28] (93.7%), because we used our method to group the same malware class in the same cluster, and F-measure was used to evaluate both precision and recall rate. Therefore, our method groups variants of malware classes better than Rieck et al. [28]. Now we discuss some malware classes for which our method and that of Rieck et al. [28] outperform each other on different occasions. Our method is better than that of Rieck et al. [28] for the following malware:
Figure 11.
Recall rate of our method and Rieck et al. [28].
RBOT Rieck et al. [28] clustered some of RBOT the same way as SP Y GAM ES. Because RBOT’s sequence is the same as that of SP Y GAM ES as shown in Figure 12, Rieck et al. [28] differentiated between them ambiguously. Because our method involves a probabilistic model, it can easily differentiate between them. • ROTATOR Rieck et al. [28] used a heuristic rejected cluster that caused some of ROT AT OR to include the rejected cluster. Each cluster with fewer reports is included in the rejected cluster by using a threshold. • SALITY SALIT Y only changes the order of behavior patterns, which caused ambiguous clustering by Rieck et al. [28] in the same clusters. Because Rieck et al. [28] used two-grams, the situation of the previous states was not considered. Because our method involves a structural sequence to consider the situation of previous states, it can differentiate between them easily. • FLYSTUDIO The method of Rieck et al. [28] ranges over many different clusters than our method, because there are many variants of malware that behave differently. For example, some modify Internet Explorer’s homepage, some modify an Internet browser’s settings, while others hook the keyboard, and so on. Hence, Rieck et al. [28] did not consider situation of the previous states, and caused ambiguous to cluster F LY ST U DIO. The method of Rieck et al. [28] is better than our method for the following malware: • VIRUT The method of Rieck et al. [28] causes part of the V IRU T host-sequence to be clustered in the same way as SP Y GAM ES, because part of V IRU T modifies ”LoadIMM” text services of registry keys to the same behavior as SP Y GAM ES. •
Table II T HE EXPERIMENT RESULTS .
Method Number of Clusters Precision (%) Recall (%) F-Measure (%) Our Method 37 94.7 93.7 94.2 Rieck et al. [28] 28 (include rejected cluster) 92 93.7 92.9
Figure 12.
Same sequences with RBOT malware and SP Y GAM ES malware.
V. C ONCLUSION In this paper, we presented a system for clustering the variants of malware classes with structural sequencing and similarity measurement. For discovering sequential relation profiling between different behavior patterns in variants of malware classes, we used a structural sequencing mechanism based on a Markov chain model and counting frequency for determining the sequential relationship. Thus, each report is a generated structural sequence. For clustering closely similar malicious behavior of malware variants, we used a probabilistic similarity based on length-normalized log-likelihood to measure distance between each pair of sequences. Finally, we used fast k-means to cluster our dataset effectively. Our experiments showed that the proposed system can cluster different malware classes efficiently and effectively. Future work will examine how botnet malware accesses external malicious network behaviors. We hope that our proposed system can extract external malicious network behaviors from malware. When we use Sandbox to monitor malware dynamically, we can observe malicious network behaviors. We hope to be able to trace malicious network behaviors to detect fast-flux service networks [24], [17], [32] in real time, and combine that we propose our system.
[4] Fortify software. fortify-360.pdf.
http://www.powertest.com/files/datasheet-
[5] Malheur: Public datasets. mannheim.de/malheur/.
http://pi1.informatik.uni-
[6] Norman sandbox. http://www.norman.com/. [7] Ollydbg. http://www.ollydbg.de/download.htm. [8] Software code security and code security http://www.veracode.com/security/code-security.
analysis.
[9] Windows debugger. http://msdn.microsoft.com/enus/windows/hardware/gg463009.aspx. [10] B. Abhijit, H. Xin, S. K. G., and P. Taejoon. Behavioral detection of malware on mobile handsets. In Proceedings of the 6th international conference on Mobile systems, applications, and services, MobiSys ’08, 2008. [11] T. M. Cover and J. A. Thomas. Elements of information theory. Wiley-Interscience, New York, NY, USA, 1991. [12] C. Elkan. Using the triangle inequality to accelerate k-means. In ICML, pages 147–153, 2003. [13] J. Faulhaber. Microsoft security intelligence report. Technical Report Volume 11, Microsoft, Inc., 2011.
In
R EFERENCES [14] M. Fossi, G. Egan, K. Haley, E. Johnson, T. Mack, T. Adams, J. Blackbird, L. M. King, D. Mazurek, D. McKinney, and P. Wood. Symantec internet security threat report: Trend for 2010. In vol. XVI, Apr 2011.
[1] Anubis. http://anubis.seclab.tuwien.ac.at/. [2] Cwsandbox. http://www.cwsandbox.org/. [3] Fast k-means code for http://cseweb.ucsd.edu/ elkan/fastkmeans.html.
matlab.
[15] E. Gansner, E. Koutsofios, and S. North. Drawing graphs with dot. Technical report, 2006.
[16] D. Garc´ıa-Garc´ıa, E. Parrado Hern´andez, and F. D´ıaz-de Mar´ıa. A new distance measure for model-based sequence clustering. IEEE Trans. Pattern Anal. Mach. Intell., 31(7):1325–1331, July 2009. [17] S.-Y. Huang, C.-H. Mao, and H.-M. Lee. Fast-flux service network detection based on spatial snapshot mechanism for delay-free detection. In 5th ACM Symposium on InformAtion, Computer and Communications Security (ASIACCS 2010), 2010.
[27] K. Rieck, T. Holz, C. Willems, P. D¨ussel, and P. Laskov. Learning and classification of malware behavior. In Proceedings of the 5th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment, DIMVA ’08, pages 108–125, Berlin, Heidelberg, 2008. SpringerVerlag. [28] K. Rieck, P. Trinius, C. Willems, and T. Holz. Automatic analysis of malware behavior using machine learning. Journal of Computer Security (JCS), 19(4):639–668, 2011.
[18] G. Hunt and D. Brubacher. Detours: binary interception of win32 functions. In Proceedings of the 3rd conference on USENIX Windows NT Symposium - Volume 3, pages 14–14, Berkeley, CA, USA, 1999. USENIX Association.
[29] A. Shabtai, U. Kanonov, Y. Elovici, C. Glezer, and Y. Weiss. Andromaly: a behavioral malware detection framework for android devices. Journal of Intelligent Information Systems, 38(1):161–190, 2011.
[19] C. Kruegel, E. Kirda, P. M. Comparetti, U. Bayer, and C. Hlauschek. Scalable, behavior-based malware clustering. In Proceedings of the 16th Annual Network and Distributed System Security Symposium (NDSS 2009), 1 2009.
[30] J. A. K. Suykens and J. Vandewalle. Least squares support vector machine classifiers. Neural Process. Lett., 9(3):293– 300, June 1999.
[20] C. Linn and S. Debray. Obfuscation of executable code to improve resistance to static disassembly. In Proceedings of the 10th ACM conference on Computer and communications security, CCS ’03, pages 290–299, New York, NY, USA, 2003. ACM. [21] C.-H. Mao, H.-K. Pao, C. Faloutsos, and H.-M. Lee. Sbad: sequence based attack detection via sequence comparison. In Proceedings of the international ECML/PKDD conference on Privacy and security issues in data mining and machine learning, PSDML’10, pages 78–91, Berlin, Heidelberg, 2011. Springer-Verlag. [22] A. Moser, C. Kruegel, and E. Kirda. Limits of static analysis for malware detection. In 23rd Annual Computer Security Applications Conference (ACSAC 2007), December 10-14, 2007, Miami Beach, Florida, USA, pages 421–430. IEEE Computer Society, 2007. [23] V. P. Nair, H. Jain, Y. K. Golecha, M. S. Gaur, and V. Laxmi. Medusa: Metamorphic malware dynamic analysis usingsignature from api. In Proceedings of the 3rd international conference on Security of information and networks, SIN ’10, pages 263–269, New York, NY, USA, 2010. ACM. [24] R. Perdisci, I. Corona, D. Dagon, and W. Lee. Detecting malicious flux service networks through passive analysis of recursive dns traces. In Proceedings of the 2009 Annual Computer Security Applications Conference, ACSAC ’09, pages 311–320, Washington, DC, USA, 2009. IEEE Computer Society. [25] R. Perdisci, W. Lee, and N. Feamster. Behavioral clustering of http-based malware and signature generation using malicious network traces. In Proceedings of the 7th USENIX conference on Networked systems design and implementation, NSDI’10, pages 26–26, Berkeley, CA, USA, 2010. USENIX Association. [26] S. Ray and R. H. Turi. Determination of number of clusters in k-means clustering and application in colour image segmentation. In Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques (ICAPRDT’99), 1999.
[31] P. Trinius, C. Willems, T. Holz, and K. Rieck. A malware instruction set for behavior-based analysis. In Proc. of Conference “Sicherheit, Schutz und Zuverl¨assigkeit” (SICHERHEIT), 2010. [32] H.-T. Wang, C.-H. Mao, K.-P. Wu, and H.-M. Lee. Realtime fast-flux identification via localized spatial geolocation detection. In Proceedings of the 2012 IEEE 36th Annual Computer Software and Applications Conference, COMPSAC ’12, pages 244–252. IEEE Computer Society, 2012. [33] J. A. Whittaker and M. G. Thomason. A markov chain model for statistical software testing. IEEE Trans. Softw. Eng., 20(10):812–824, Oct. 1994. [34] C. Willems, T. Holz, and F. Freiling. Toward automated dynamic malware analysis using cwsandbox. IEEE Security and Privacy Magazine, 5(2):32–39, 2007.