private users, commercial companies and governments. The recent growth in ... software/hardware vendors detect the malware, analyze it, generate a signature and release an update to clients is .... Netsky and B[e]agle. Nevertheless, larger ...
F-Sign: Automatic, Function-based Signature Generation for Malware Asaf Shabtai, Eitan Menahem, and Yuval Elovici Deutsche Telekom Laboratories at Ben-Gurion University, and Department of Information Systems Engineering, Ben-Gurion University, Be’er-Sheva, 84105 Israel {shabtaia, eitanme, elovici}@bgu.ac.il
Abstract. This research proposes a novel automatic method (termed F-Sign) for extracting a unique signature from malware. The method is primarily intended for high-speed malware filtering devices that are based on deep-packet inspection and which operate in real-time. Malicious executables are analyzed using two approaches: disassembly, utilizing IDA-Pro, and the application of a dedicated statemachine in order to obtain the set of functions comprising the executables. The signature extraction process is based on a comparison with a common function repository. By eliminating functions appearing in the common function repository from the signature candidate list, F-Sign can minimize the risk of false positive detection errors. To minimize false positive rates even further, F-Sign proposes intelligent candidate selection using an entropy score to generate signatures. Evaluation of F-Sign was conducted under various conditions. The findings suggest that the proposed method can be used for automatically generating signatures that are both specific and sensitive. Keywords: Malware, Automatic Signature Generation (ASG), Malware filtering.
1
Introduction
Modern computer and communication infrastructures are highly susceptible to various types of attack. A common way of launching these attacks is by means of malicious software (malware) such as worms, viruses, and Trojan horses, which, when spread, can cause severe damage to private users, commercial companies and governments. The recent growth in high-speed Internet connections provides a platform for creating and rapidly spreading the new malware. Several analysis techniques for detecting malware have been proposed. They are classified as to whether they are static or dynamic. In dynamic analysis (also known as behavioral-based analysis) detection is based on information collected from the operating system at runtime (i.e., during the execution of the program) such as system calls, network access and files and memory modifications [1]-[5]. In static analysis, the detection is based on information extracted explicitly or implicitly from the executable binary/source code. The main advantage of static analysis is in providing rapid classification. Since antivirus vendors are facing each day an overwhelming amount of suspect files for inspection [6], rapid detection is essential. Static analysis solutions are primarily implemented using two methods: signature-based and heuristic-based. Signature-based methods rely on the identification of unique strings in the binary code [6]. The heuristic methods are based on rules which are either determined by experts or by machine learning techniques that define a malicious or a benign behavior in order to detect unknown malware [7], [8]. As a case in point, Zhang et al. [9] apply the random forest data-mining algorithm to detect misuse and anomalous network intrusions. The period of time from the release of new (i.e., unknown) malicious software until security software/hardware vendors detect the malware, analyze it, generate a signature and release an update to clients is highly critical. During this time, the malware is undetectable by most signature-based solutions and thus the malware is usually termed a zero-day attack (or zero-day threat). Since the new malware can easily spread and infect other machines, it is highly im-
portant to detect it as soon as possible and to rapidly generate a suitable signature so that signature-based solutions can be updated to block the new threat. One way to protect organizations from malware is to deploy high-speed malware filtering devices on the communication lines (i.e., network-based intrusion detection systems). Such appliances perform deep-packet inspection in real-time and thus support very simple signatures for detecting and removing attacks such as malware, propagating worms, denial-of-service attacks, or remote exploitation of vulnerabilities. In order to monitor traffic in real-time without causing a major impact on performance (e.g., delay, latency) these devices inspect the content of the packets without reassembly of the session. Since these devices use a continuously updated repository of signatures, they can handle known malware for which they have signatures but are unable to handle unknown malware. This study focuses on automating the process of generating unique signatures for these malware-filtering devices. Various techniques have been proposed for automatically deriving malware signatures. The signatures generated by most of these techniques have been tested and reported to be effective for dealing with small-sized malware. However, these approaches ignore the fact that many types of malware appear as full-fledged executables and therefore contain a significant portion of code emanating from code generators, development tools and platforms. Furthermore, these techniques focus on malware detection only after it has been executed, and extract a signature from the traffic it creates in the course of launching an attack. In order to address the problems discussed above, this research proposes and evaluates a signature generation technique, termed F-Sign, for generating signatures which can be used by network intrusion detection systems (NIDS) operating as malware filtering devices. For such devices, F-Sign needs to generate a very simple signature -- actually a string of bytes or a sim-
ple regular expression at the most -- that a network appliance can use for filtering malware in real-time. To enhance its precision, F-Sign employs an exhaustive and structured technique, which first sanitizes the malware’s unique code from other segments of common and usually benign code such as shared libraries. When the sanitization process ends, the remaining code segments are the malicious code pieces. The process continues by generating a unique signature from the malicious code, which can later be used for detecting malware in network traffic. The focus of this research is on generating signatures from malicious code in the form of worms, spyware, Trojan horses, and viruses. Our main assumption is that in a prior step, suspected files are classified as benign or malicious by a human expert of by an automated detection tool. This assumption allows us to focus on the signature generation process, but it also means that the quality of the signatures relies on the accuracy of the classification of suspicious files. F-Sign was used as the automatic signature generation module of the eDare (early detection, alert and response) framework [10], which offers “malware filtering as a service” and is targeted toward Network Service Providers (NSP), Internet Service Providers (ISP), and large enterprises. eDare is aimed at mitigating the spread of both known and unknown malware in computer networks. eDare operates by first monitoring network traffic and filtering out known malware using high-speed filtering devices that are continuously updated with signatures generated by F-Sign. Next, unknown files are extracted from the remaining traffic and examined using various machine-learning and temporal reasoning methods in order to classify the files as malicious or benign. F-Sign is implemented in the last step to extract signatures from newly detected malicious
files (i.e., files that were verified as malware). When eDare identifies a new threat, F-Sign automatically produces a signature and then the filtering devices that are stationed on the network infrastructure are automatically updated. Since this process is very fast, much faster than when human intervention is required, it is effective against zero-day attacks. This research describes the F-Sign technique and a set of experiments that were performed on a collection of malicious and benign executables. We were particularly interested in finding the optimal length and selection criteria of a signature among several candidates as well as the size and type of the training set in order to minimize false alarms. The rest of the paper is structured as follows. In section 2 we present related work for this study. In section 3 we present the F-Sign method followed by its evaluation in section 4. In section 5 we conclude the paper with a summary and future research directions.
2
Related Work on Automatic Signature Generation (ASG)
Automated signature generation (ASG) for new malware is extremely difficult since the signature must be general enough to capture as many instances of the malware as possible, yet specific enough to avoid overlapping with the content of normal traffic in order to minimize false positives. Automated Signature Generation (ASG) methods have been proposed and evaluated in many studies. Most of these studies have been employed in expediting the process of signature generation for effectively containing worms. Malware signatures can be classified as vulnerabilitybased, exploit-based and payload-based [11]. A vulnerability-based signature describes the properties of a certain bug in the system that
can be maliciously exploited by the malware [12]. Vulnerability-based signatures do not attempt to detect every malicious code exploiting the vulnerability and therefore can be very effective when dealing with polymorphic malware. However, a vulnerability-based signature can be generated only when the vulnerability is discovered. An exploit-based signature describes a piece of code (sequence of commands or inputs) trigged by the malware which actually exploits a vulnerability in the system. Exploit-based method include Autograph [13], PAYL sensor [14], Netspy [15] and EarlyBird [16] which focus on analyzing similarities in packet payloads belonging to suspicious network traffic. These systems first identify anomalous traffic originating from suspicious IP addresses and then generate a signature by identifying most frequently occurring byte-sequences. The Nemean architecture [17] first clusters similar sessions and then uses machine-learning techniques to generate semantic-aware signatures for each cluster. Polygraph [18] expands the notion of single substring signatures (i.e., tokens) to conjunctions, ordered sets of multiple tokens that match multiple variants of polymorphic worms. Honeycomb [19] overlays parts of the flows in the traffic and uses a Longest Common Substring (LCS) [20] algorithm to spot similarities in packet payloads. Tang and Chen [21] subsequently designed a double-honeypot system and introduced the position-aware distribution signatures (PADS) that are computed from polymorphic worm samples and are composed of a byte frequency distribution (instead of a fixed value) for each position in the signature “string”. Tang et al. [11] use sequence alignment techniques, drawn from bioinformatics, in order to derive Simplified Regular Expression exploit-based signatures. Exploit-based signatures can be generated rapidly to detect zero-day exploits of uncovered vulnerabilities. They are, however, less effective on polymorphic malware. In addition, the signa-
tures generated by the above techniques were tested and reported to be effective for short, stream-based malware (i.e., worms) such as Nimda, Code Red/Code Red II, MS Blaster, Sober, Netsky and B[e]agle. Nevertheless, larger malware executable files, carrying full-fledged applications, usually contain a significant portion of invariant code segments that are planted by the software development platform spawning the malware. As a result, selecting a signature that will be both sensitive and specific is a very challenging task when dealing with large files. Another limitation of these techniques is that they focus on detecting malware after it has been unleashed and try to generate a signature from the traffic it creates while the attack is being launched. A payload-based signature indentifies the actual malware’ code or body. The approach proposed in this paper falls into the payload-based signature category. Payload-based signature generation ,methods are presented in [22] and [6]. Jeffrey et al. [22] present a two-step statistical method for automatically extracting good, “near optimal” signatures from the code of a virus. In the first step, decoy programs on isolated machines are deliberately infected with the virus. Then, the infected regions of the decoys are compared with one another to establish which regions of the virus are constant from one instance to another. Those regions are considered as signature candidates. The second phase estimates the probability that each of the candidate signatures will match a randomly chosen block of bytes in the code of a randomly chosen program. The candidate with the lowest estimated false positive probabilities is chosen as the signature. The Hancock system [6] was proposed for automatically extracting signatures for anti-virus software. Based on several heuristics, the Hancock system generates a set of signature candidates, selecting the candidates that are not likely to be found in benign code. Similar to our approach, Hancock relies on modeling benign code in order to minimize false-alarm
risks. Although, the Hancock system differs from our approach which is semantic-aware in the sense that it does not rely on arbitrary byte code sequences but rather on code representing internal functions of the software. In addition, since the two methods above focus on generating signatures for anti-virus software, the limitation of signature length is not necessarily considered. Other solutions were proposed for protecting systems and preventing an attack beforehand rather than detecting the attack after it is launched. This can be done by generating signatures based on sequences of instructions that represent malicious or benign behavior. These sequences can be extracted either by statically analyzing the program after disassembly [23] or by monitoring the program during execution. For example, protecting a system from buffer overflow attacks can be achieved by: (1) creating signatures for legitimate instruction blocks and matching instruction sequences of monitored programs with the signature repository [24], [25]; (2) using obfuscation of pointers such that a malicious application that tries to exploit a buffer overflow vulnerability will not be able to create valid pointers [26]; or (3) by applying array and pointer boundary checking [27], [28]. As opposed to such methods, our goal in this research is to generate signatures for high-speed traffic filtering devices that do not rely on installation or modification of end-points and that will protect the end-points at the network level. In summation, each of the above techniques suffers from at least one critical limitation. Some rely on small and coherent malware files, but such files may not constitute the general case. Other techniques rely on observing malware behavior, but such malware cannot always be fully monitored. Other methods search for packet similarities but this does not assure true low false positive. Our method disregards the small malware size assumption. In addition, it does not require activating the malware. We propose an automatic signature generation technique
capable of generating sensitive and specific signatures for malware of any size and type, while trying to minimize the false positive cases by analyzing the malware at the functional level and taking into account large common code segments.
3
F-Sign: The Proposed Automatic Signature Extraction Method
In order to create and employ signatures for effectively and efficiently detecting malware in executables, our technique should be able to generate a signature that complies with the following requirements: The signature has to be sufficiently long to make it unique among benign executables (high specificity). When searching for a byte-string signature in network traffic, the string may split over two or more packets. In such cases, a match – and consequently the malware -- will go undetected. Current high-speed network filtering devices such as the DefensePro intrusion detection system used in our evaluation [29], cope with this problem by applying TCP reassembly that aggregates a trail of last seen packets (a trail of last 128B in case of the DefensePro). However, the signature should still be short enough to meet the limitations of such devices. As will be explained below, the proposed method can choose functions in the code that are short enough to meet the limitation of these filtering devices and which are sufficiently unique to become signatures. The signature should comply with the limitations of high-speed deep packet inspection devices that can detect and remove malware in real-time in high-speed data streams. The signature should be well defined to enable total automatic generation.
3.1
Malware signatures
The major challenge in conforming to the aforementioned requirements is to develop a methodology that can locate a code segment or segments unique to a specific malware instance and which can serve as a unique signature. Since many malware instances are in fact developed using code generators and higher-level languages (e.g. Java, C, C++, Delphi), common service routines of such higher-level authoring tools should not be considered part of the actual malware. Such routines are in fact code segments common to many applications and are not manually written by the creator of the malware, but rather automatically linked to the malware by the underlying development package and authoring tool library. These segments of code, termed common function code (CFC) are not good signature candidates for a malware because they can potentially be used in many benign applications and are thus not unique to a particular malware. To significantly decrease the risk of selecting such code segments as signature candidates, we must first identify and disregard the CFC part of a malware instance by analyzing the malware against a repository of known CFC segments, termed a common function library (CFL). The CFL can be derived from a collection of benign executables and should be regularly updated in order to take into account the evolution of benign (and potentially malicious) files. Functions are excellent candidates to act as markers and the approach that we chose is to identify the start- and end-points of the functions located in the malware and to compare these functions against the CFL. To meet with all the constraints described we defined two distinct stages: 1. Common function library (CFL) construction – build a library that contains representation of functions from standard libraries used by higher-level languages. 2. Signature generation – search for one or more segments that comply with the constraints
described.
3.2
Common Function Library Construction
In the preliminary stage, we use a repository of clean files to create the common function library (CFL). As depicted in Figure 1, legitimate files are inserted into the Function Extractor. Each file is then processed and all identified functions are extracted. Next, the Function Matching component filters known functions leaving only new functions to be inserted into the CFL. The CFL plays a major role in the signature generation process, and generating good signatures highly depends on the accuracy of the CFL. Therefore, as already noted, the CFL should regularly be updated in order to take into account the evolution of benign (and optionally malicious) files. As a case in point, if the CFL is not regularly updated with common functions, a malicious file may share common functions with benign files that do not appear in the CFL. In such cases, F-Sign may select these functions as candidates and possibly as the signature, which would create a false alarm. Inaccurate or incomplete CFL may also result in good candidates (i.e., malware unique) to be considered as common code, resulting in fewer candidates to choose from (or even none). Creating and updating the CFL as well as extracting candidates from malware files based on the CFL are performed in a back-system where all F-Sign logic is running, and not in real-time. To achieve acceptable response time when searching the CFL, large CFLs can be distributed and a parallel algorithm can be implemented. In addition, searching the CFL can be performed more efficiently when grouping common functions to its specific platform/operating system. Finally, the CFL needs to be constantly re-validated and irrelevant (old) common functions need to be removed to keep the CFL size from continuously growing. Our experimental results
suggest that the CFL converges to a size for which a low false alarm rate is maintained and that additional common functions do not contribute to further reducing the false alarm rate.
Fig. 1. The Common Function Library (CFL) creation and signature generating processes.
We propose and compare two approaches for identifying and extracting functions: (1) disassembly using IDA-Pro and extracting functions; and, (2) extracting functions from the binary code by using a specialized state-machine.
3.2.1 Extracting Functions Using File Disassembly The disassembly process consists of translating machine code instructions stored in the executable to a more human-readable language, namely, Assembly. The next step is to identify the functions of the program. Although such a process seems trivial, malware writers often try to prevent the successful application of the disassembly process to thwart experts from analyzing their malwares [30]. In this study we used IDA-Pro1, one of the most advanced commercial disassembly programs available today. In order to process large amounts of executables needed for creating and maintaining the
CFL, we implemented a proprietary IDA-Pro Plug-in. The Plug-in uses IDA-Pro APIs that allows operating IDA-Pro's algorithms using command-line interface. The plug-in process each executable, first by disassembling it and then, by extracting all its possible functions. Since most executable functions have address references, a similar function at different executable files might have different binary representations. For example, a call to another function is usually represented by a command that contains a reference. Since a reference may change from one executable file to another during the linkage editing stage, we can assume that for a given function X and two or more executable files containing function X, the binary code representation may be different between these executable files. The plug-in, using IDA-Pro APIs, is used to pinpoint call references and therefore enable the normalization of the functions by eliminating those references. Consequently, two occurrences of the same function that are differing in references will be identified as one function (see Figure 2).
Fig. 2. The normalization process using IDA-Pro.
3.2.2 Extracting Functions Using Specialized State-Machine With this approach, we use a specialized state-machine to identify the functions that reside in 1
http://www.hex-rays.com/idapro/
the executable files without disassembling them. The state-machine-based function extraction method searches for byte sequences that represent either start- or end of functions, within the binary representation of an executable file. In a preparatory phase, we analyzed the assembly code of a representative collection of executable files and identified patterns that indicate a start or end of functions. The matching binary representations of the identified start/end patterns are extracted and compiled into a statemachine. The states of the state-machine represent a start-of-function, an end-of-function, or an insignificant byte-sequence. A transition from one state to another is triggered by reading the next byte of a file. At runtime, the state-machine is applied, without disassemble the file, by reading the bytes of an executable on-by-one and marking byte-sequences that reach a start-of-function, end-offunction states. Figure 3 exemplify the process of generating the state machine. By examining the assembly code, we see that the commands ‘push ebp’ and then ‘mov ebp, esp’ are a common pattern for beginning a function. We map these commands to their binary representation (in this case "55", "8B", "EC") that are used for constructing the state machine. During the function identification process, the file's bytes are sequentially fed to the statemachine and when a "Start-of-Function" state had been reached, a new function's start is marked as found, and the state machine is nullified. The state-machine example in Figure 3 recognizes a sequence of "55", "8B" and "EC" (all are hexadecimal byte values) as a function's start. In the same way, the following byte-sequences are identified as a function's start: "56 57 5D"; "56 57 5C"; "56 57 89". Although we can identify the beginning and the end of a function by using the state-machine
approach, we cannot identify references and then normalize the function. Thus, in order to determine whether two given functions are similar enough to be assumed as the same function regardless of the references changes and its offset in a file, we used a similarity test controlled by a threshold to solve the problem. If the percentage of different bytes in two functions exceeds the threshold, then the two functions are considered different (see example Figure 4).
Function begins
Begin
55
56
55
56
8B
57
55 8B
56 57
EC
Start of function
5D | 5C | 89
Fig. 3. Identifying the beginning of a function using a state-machine – example.
Function X in File1:
8B D8 8B D6 8B C3 E8 F5 CD FF FF 89 73 1C 5E 5B Function X in File2:
8B D8 8B D6 8B C3 E8 DD E5 FF FF 89 43 38 5E 5B Fig. 4. A simple example of different binary code representations for the same function X.
3.3 Signature extraction After creating the CFL, a signature can be extracted for a given malware file. The process of signature generation for a malware file (Figure 1) begins with the identification of the internal functions using one of the two proposed approaches that were used to create the CFL (disassembly using IDA-Pro or specialized state-machine). Next, each of the functions is matched against the database of existing functions (i.e. the CFL). Those functions found in the CFL are marked as common functions. The remaining functions, which do not exist in the CFL, become viable candidates for generating a unique malware signature. In order to increase the specificity of the signature, we can manipulate the candidate functions by adding, for example, an offset to all candidate functions (i.e. adding to the signature candidate additional bytes prior to the beginning of that function or at its end). In the evaluation, we used a 16-byte offset that was added at the end of the candidate2. The final step is to select a function per sequence of candidate functions to be used for generating the signature. In order to choose the best candidate functions, we used a simple method where we calculate the entropy of each candidate (according to Equation 1) and select the one with the highest entropy.
Equation (1)
This method was chosen on the assumption that a candidate with the highest entropy will increase the probability of choosing only a code section. Since a low entropy score may suggest a 2
The choice of 16 byte offset is mainly based on heuristic. Using more than 16 bytes (e.g., 20 bytes) would be unnecessarily long and will result in longer signatures. Using shorter offset sizes (e.g., 8 or 12 bytes) might not be sufficient, since several bytes after the function's end might be used for padding.
mix of code and data or even just data, which may not be unique, choosing one or the other as a signature can lead to false positives cases. Once the best candidate has been selected, this function’s body (possibly with an offset) becomes the signature. In some cases, in order to generate a unique signature, the two best candidates may be selected and the signature will consist of two strings. A known attack on such signature-based detection systems is the black-box attack in which a hacker continuously modifies a malicious program without disabling its functionality (creating mutations) and tests it against the detection system until the system cannot detect the malware mutation [31], [32]. In order to overcome this attack, we can choose several signatures from the collection of candidates and update detection devices with randomly selected signatures. As mentioned before, F-Sign generates signatures to files that were verified as malware by other systems or human experts who are prone to errors. If a false positive error occurs (a benign file is tagged falsely as malicious) and F-Sign extracts a signature, it would create a false alarm. If a benign file is infected by a virus, we would expect the system to extract the signature from the infected section by removing all common functions. F-Sign - Generate Signature Procedure Generate_Signature(malwareFile) 1 functionsInFile Extract_Functions(malwareFile, functionDetectorType) 2 if functionsInFile is empty 3 return “no functions were extracted from file” 4 candidates Filter_Common_Functions(functionsInFile) 5 if candidates is empty 6 return “no candidates exists for file” 7 candidates Manipulate_Candidates(candidates, malwareFile) 8 signature Select_Best_Signature(candidates) 9 return signature Extract_Functions(malwareFile, functionDetectorType) 1 if (functionDetectorType == "StateMachine") 2 return Extract_Functions_Using_StateMachine(malwarefile) 3 else return Extract_Functions_Using_Disassembly(malwarefile) Extract_Functions_Using_StateMachine(malwareFile) 1 state null
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
byteSequence null functionsInFiles null startsOfFunctions null endsOfFunctions null while end of malwareFile not reached do: byte read next byte byteSequence byteSequence + byte state Apply_StateMachine_To_Get_Next_State(byte) if (state == "EndOfFunction") add byteSequence to endsOfFunctions else if (state == "StartOfFunction") add byteSequence to startsOfFunctions else if (state == "None") state null byteSequence null candidates use startOfFunctions and endOfFunctions to identify functions return candidates
Extract_Functions_Using_Disassembly(malwareFile) 1 state null 2 byteSequence null 3 disassembledFile Disassemble(malwareFile) 4 functionsInFile Identify_Functions(disassembledFile) 3 return functionsInFile Filter_Common_Functions(functionsList) 1 candidates null 2 for each function in functionsList do: 3 exists false 4 for each cflFunction in CFL do: 5 if (function == cflFunction) 6 exists true 7 break 8 if (exists == false) 9 add function to candidates 10 return candidates Manipulate_Candidates(candidates, malwareFile) 1 newCandidates null 2 for each candidate in candidates do: 3 find candidate in malwareFile 4 newCandidate add beforeOffset and afterOffset to candidate 5 newCandidates newCandidate 6 return newCandidates Select_Best_Signature(candidates) 1 entropy 0 2 bestSignature null 3 for each candidate in candidates do: 4 temp Calculate_Entropy(candidate) 5 if (temp > entropy) 6 bestSignature candidate 7 entropy temp 6 return bestSignature
Fig. 5. Pseudo-code describing the signature generation process.
Figure 5 describes the pseudo-code of the signature generation process. Theorem 1. Given a malware file of size n bytes and l the number of common functions in the CFL repository, the computational complexity of the Generate_Signature is bounded by O(l∙n∙log(n)). Proof. Given a malware file of size n bytes, the computational complexity of the Extract_Function method is bounded by O(n) when using the state-machine method for extracting functions (the state-machine method sequentially reads the bytes in the file). Extracting functions from files using the disassembly approach depends on the disassembly tool (e.g., IDA-Pro) and algorithm, but it can be bounded by O(n∙log(n)). The disassembler reads the bytes in the file and identifies opcodes. It parses each opcode only once. Parsing of the opcode may lead to either moving to the next instruction or to a data item. An already visited instruction is not parsed again except for adding a reference; there are no more than a total of O(n) references. If a tree is used as the inner data structure for holding all the instructions, handling a single instruction is bounded by O(log(n)). Thus, the disassembly process using IDA-Pro is O(n∙log(n)). The number of functions that can be extracted from a file of size n is O(n); thus, Filter_Common_Functions is bounded by O(l∙n) where l is the number of functions in the CFL.
In our implementation the Manipulate_Candidates simply adds a 16-byte offset for each candidate and Select_Best_Signature computes the entropy of each candidate, returning the candidate with the highest entropy. Thus, both functions are bounded by O(n) since both require one pass over all the extracted functions and O(1) is the computation complexity of the
process applied on each function (i.e., adding off-set or computing the entropy). Consequently, the overall computational complexity of the Generate_Signature is bounded by O(l∙n∙log(n)).
4
Evaluating F-Sign
F-Sign raised many questions. More specifically, we were interested in determining: 1. What were the recommended length and selection criteria of a signature among several candidates in order to minimize false positives? 2. What training set size should be used to create the common function repository for achieving similar goals? 3. Does adding an offset to the candidates and using the entropy signature selection heuristic minimize false positives? 4. Does offset addition could contribute to producing better signatures. 5. Which method was best for analyzing a malware file in order to obtain the most accurate map of its functions (disassembly using IDA-Pro or the state-machine). To answer these questions we implemented an evaluation process. The following sections describe the evaluation environment, the experiments and results.
4.1 The evaluation environment The evaluation process included three steps (see Figure 6): (1) creating the Common Function Library; (2) generating signatures for malware files; and (3) computing statistics and evaluation measurements. The CFL is a basic requirement for our signature generation method since it enables us to fil-
ter common functions that should not be used as signature candidates. In the first step, we created the common function library from a collection of benign files. The extraction of the CFL functions from the collection of benign files was performed using the two function-extraction methods presented in section 3.2. The differences between the two methods lie in the way they detect functions. For the first method, we used the IDA-Pro third party software to disassemble the files. We implemented a plug-in for IDA-Pro (in Python programming language) that uses IDA-Pro API for extracting functions from the disassembly representation of the file (section 3.2.1). For the second method, we implemented the state-machine approach using the Java programming language. The state-machine implementation scans an input file in order to find predefined patterns of start/end of functions using configurable definitions of the state-machine (see section 3.2.2). In both cases, the functions of the CFL were stored in files and were loaded to memory during the second stage of the evaluation, i.e., generating malware signatures. In the second stage, we applied our automatic signature generation method on a collection of malware files resulting in a list of signatures (The signature generation process is described in section 3.3.) First, the signature builder (developed using the JAVA programming language) extracts functions of the malware file using one of the two function extraction approaches (IDA or state-machine). Then, common functions were filtered out by using the relevant CFL (created in the first step on the evaluation). Last, offsets were added to each candidate function and the best candidate was chosen based on the entropy heuristic. The third stage of the evaluation focused on statistics computation. In this stage, we measured the number of malware signature occurrences in the benign control group files. A malware signature detected in a benign file means that this benign file would have been falsely stopped by a real-time network traffic-filtering device. We consider this as a false-positive case. The
best signature generation setting is expected to have minimal false positives. The evaluation program was designed for generating the desired statistics for different signature-generation scenarios. Step I: Common Functions Library (CFL) Creation
Common Functions Library
Function Matching
Function Extraction
Legitimate Files Only
Step II: Signatures Generation Signatures 8b-ec-07-78-ff 55-8b-20-05-c3 55-56-75-9a-20 ... ... ... ... 55-8b-a2-a3-c3 e0-16-ab-83-0c
Signature Generation & Selection
Common Function Filter
Function Extraction
Malware Corpus
Step III: Statistics Computation
Statistics Computation
Report
Hit Counter
Search Signatures in files
Control Group Files
Signatures 8b-ec-07-78-ff 55-8b-20-05-c3 55-56-75-9a-20 ... ... ... ... 55-8b-a2-a3-c3 e0-16-ab-83-0c
Fig. 6. The evaluation process of F-Sign is performed in three steps: first, a CFL is created. Next, signatures are generated for the entire malware corpus. Finaly, we measure the false positive rate by counting signatures that are detected in the benign control group files.
4.2 Experiment settings For the evaluation we used the following three executable file collections: malware set with 849 win32 executable malware files (154Mb) acquired from the VX
Heaven website3; a CFL file-set with 500, 1000, 2000, 4000 and 8000 benign executable files (the total size of the 8000 benign files in the CFL file set is 1675Mb); and a control set with 1500 benign executable files (331Mb). Note that the files in the control set were not included in the CFL file set, presenting new, unknown benign files. Benign files, including executable and DLL (Dynamic Linked Library) files, were selected randomly from a large collection of 22,735 files, gathered from machines running the Windows XP operating system on our campus. To identify the malicious files and to verify that the benign files did not contain any malicious code, we used the Kaspersky4 antivirus and the Windows version of the Unix ‘file’ command. The distribution of malware types in the malware file set is described in Figure 7.
Fig. 7. Distribution of malware types.
3 4
http://vx.netlux.org http://www.kaspersky.com
Fig. 8. Evaluation of F-Sign with all possible settings.
The evaluation process included eight possible settings that are illustrated in Figure 8: - Method for extracting functions from files: disassembly using IDA-Pro (IDA) or statemachine (SM) - Choosing a whole function (WF) or a whole function with an offset add-on (WFO) - Choosing the signature randomly (Rand) among all the malware candidates or choosing the candidate function with the highest entropy (Entropy) Thus, the acronym IDA_WF_Entropy is shorthand for a setting where we are: 1) using IDAPro for extracting functions; 2) choosing the whole functions as candidates for signatures; and 3) using the entropy score to choose the signature. The acronym SM_WFO_Rand is shorthand for a setting where we: 1) use the state-machine for extracting functions; 2) choose the whole functions with an offset as candidates for signatures; and 3) choose the signature randomly.
We are also interested in seeing how choosing the CFL size influenced the quality of the signatures and whether we can use a sufficiently large enough CFL.
4.3 Evaluation Results As the first step in evaluating F-Sign, we compared the two functions’ extraction methods (i.e., SM and IDA) and the common function filtering capability of the CFLs that were generated using the two methods. Figure 9 shows the percentage of candidate and common function corpora among all functions extracted from the malware set by the SM and IDA-Pro extraction methods for each CFL size (500, 1000, 2000, 4000 and 8000 files). It is evident from the diagram that SM is capable of extracting more functions from the malware files: IDA extracted 57,929 functions while SM extracted 249,158. Function size was limited to a minimum of 16 bytes and a maximum of 256 bytes. The figure also shows that SM is capable of trimming a larger portion of functions which appear in the CFL and that the portion of remaining functions becomes smaller along with the increase in the size of the CFL. The first observation, regarding SM’s extraction superiority, is consistent with IDA-Pro detecting fewer functions from the training set. This is probably due to the fact that IDA-Pro is a more rigid software and cannot deal effectively with code obfuscations which are a prominent technique employed by hackers [30]. On the other hand, the SM method, by nature, works on high recall (extracting as many functions as possible) and low precision (many of the extracted function might not be really functions); thus it extracts more functions. The second observation, the filtering capability of the two methods, can be explained straightforwardly by the fact that as the size of the CFL grows, the likelihood increases that a function extracted from the malware set will appear in the CFL.
Fig. 9. Percentage/number of common functions vs. candidate functions extracted from malware set for IDA and State-Machine for several CFL dataset sizes.
For some malware files, it might be the case that all of the extracted functions were identified as common functions and therefore were filtered out by the CFL. In such cases, the method cannot generate a signature for the malware. Figure 10 depicts the percentage of malware that was left without candidates. The figure shows that IDA missed more malware. The reason for that is that (1) it extracts fewer functions and (2) IDA only detects functions that are being called from other functions using standard protocols. This may not be the case in malware that wishes to camouflage its existence. As expected, for both methods, increasing the CFL also increases the missed malware, but also the gap between IDA and SM narrows. Figures 11-15 depict the detection rate of candidate signatures and signatures selected for the malware set in the control file set. This rate serves as a measure of the false positive detection rate of malware in benign files.
We checked the false positive rate of signature candidates in the control set files. Figure 11 depicts the percentage of signature candidate detected in the control set as a function of the candidate’s length in bytes and CFL size. We expected that the false positive rate would drop for longer signature candidates and for larger CFL. The length of a signature candidate affects the probability of finding the same byte sequence in an arbitrary file. Indeed, regardless of function and signature extraction techniques, (SM/IDA), short candidates caused most of the hits. Consequently, based on the diagram, we recommend using function candidates if their length is above 112 bytes in order to ensure a lower false positive rate. An exception is shown with the SM with CFL size 500Mb and SM with CFL size 1000Mb where we see a high false positive rate for candidates that are 160-176 bytes. Using CFL of 2000Mb and more eliminates the problematic candidates. Additionally, we can see that using a larger CFL contributes considerably to lowering the false positive rate in both function extraction methods.
Fig. 10. Malware without signature candidates - the percentage of malware that were left without candidates (i.e., all extracted functions were filtered by the functions in the CFL) and thus the method cannot
extract signatures.
Fig. 11. False positive rate of candidates in the control set files as a function of the candidate size in bytes.
Fig. 12. Comparing the false positive rate for the two function extraction methods: state-machine (SM) and IDA-Pro (IDA) as a function of the CFL size.
In Figure 12 we compare the false positive rate of the two function extraction methods: statemachine and IDA-Pro for different CFL sizes. With both extraction methods, the false positive rate is reduced when a large CFL is used. Compared to IDA, the state-machine method achieves a lower false positive rate when using CFL with a size greater than 1000 files. Next, we compare the mean false positive rate (averaged over both SM and IDA function extraction methods) when using candidates with/without a 16-byte offset (Figure 13) and when randomly choosing a signature or by using the entropy score (Figure 14). As expected, adding a 16-byte offset to the candidate functions and using the entropy score to choose a signature helps in reducing the percentage of signatures detected in the control set files. This was consistent for all tested CFL sizes. The entropy selection method favors large signatures; approximately 80% of the signatures that the entropy-based method selected were larger than 112 bytes. Selecting a candidate randomly shows that only 50% are 112 bytes and more. This observation complies with the results presented in Figure 10 regarding the recommended size of candidates.
Fig. 13. Comparing the mean false positive rate (averaged over SM and IDA function extraction methods) with/without adding offset bytes to function candidates.
Fig. 14. Comparing the mean false positive rate (averaged over SM and IDA function extraction methods) when using random (Rand) signature selection and entropy-based selection. The entropy-based heuristic performs better than when using the random selection method, for all CFL sizes.
Since the entropy method evidently showed better results than Rand, we continued to investigate the signature-generating methods using only the entropy method. The goal of the next and final experiment was to show how IDA and SM methods are affected by using offset along with the signature candidate for different CFL sizes. Figure 15 summarizes the results and compares the detection rate of malware signatures in the control set files for different CFL sizes and methods. It is evident that SM with a 16-byte function offset and entropy scoring is the best method for minimizing false positive rates. It also provides the most significant improvement when the CFL size is increased. This is shown by the detection rate declining from 2.7% to 0% for a CFL greater than 2000Mb. Note that detection rate in this context relates to the false positives or, undesirable detection on a malware signature in benign files; thus, a lower detection rate is better. The worse signature generation method is IDA when not using an offset of 16bytes. However, this method, when combined with CFL containing 8000 files, still manages to have a low FPR of 0.4%.
Fig. 15. Rate of malware signatures detected in the control group (1500 benign files) as a function of the CFL size and method (IDA=IDA-Pro; SM=State Machine; WF=No offset; WFO=with offset).
Finally, we tested the signatures generated by F-Sign for false negatives using a DefensePro intrusion detection appliance. False negatives in the context of F-Sign means that a signature generated for a malware file was not identified in an instance of the same malware malware
(e.g., as a result of a long signature split over multiple packets). Therefore, false negatives depend on the detection engine. The malware detection capability of DefensePro is based on IP packet inspection and does not reconstruct the files. We uploaded the signatures extracted by FSign to the DefensePro signature database and configured the device to reset any session for which a packet was identified with malware signature. We transmitted all malware (for which F-Sign successfully generated a signature) via the DefensePro (at the maximal speed that we could load the link) and executed several tests in which DefensePro successfully removed all malware.
Additionally, we measured the time required by F-Sign to generate a signature. The extraction time of a signature as a function of the file size is presented in Figure 16. The graph shows a linear increase in signature extraction time as a function of the file size.
Fig. 16. Signature extraction time for various file sizes.
5
Discussion and Future Work
This paper proposes, for high-speed malware filtering devices, a new approach for automatically extracting signatures from malware files. We consider the fact that large-scale executables, comprising substantial amounts of code that originate from underlying standard development platforms, are replicated in various instances of benign and malicious programs developed by these platforms. In order to minimize the risk of false positive detection of benign executables as malware, we proposed and evaluated a method to sanitize executables from such replicated chunks of code.
We tested our method in a network-security lab on various configurations in terms of: methods to analyze the executables (IDA-Pro, SM); signature extraction with/without offset; and random/entropy. Results indicated that SM outperforms IDA-Pro and that signatures should be extracted using the cluster-based method with an offset (in our case, 16 bytes). The state-machine (SM) is a fast technique for extracting functions from assembly files. However, since it is platform-specific, extending a state-machine for new compilers and platforms has to be done manually. This makes it prone to errors. In order to overcome this limitation, we are developing an automatic method for detecting start and end sections of functions within a binary code based on machine learning algorithms that are trained for this task. The empirical findings presented in this paper support the viability of the general approach proposed by this research which suggested that common code identified as the functions in the program can be discarded. Realizing F-Sign in generating signatures for high-throughput network security appliances requires a more exhaustive and systematic methodology for building CFL repositories. Considering the global variety of development platforms and the mobility of threats facilitated by the Internet, ensuring the external validity of this study relies substantially on reaching a critical mass of CFL files. Furthermore, it often does not suffice for a signature to be available— deployed signatures must be managed, distributed and updated by security administrators [33]. F-sign is also helpful in tackling “allergy attacks” against ASG [34]. Allergy attacks are defined as the process of inducing ASG systems into generating signatures that match normal traffic leading to a denial-of-service. This type of attack is mainly relevant to learning-based ASG systems that use machine-learning algorithms in order to extract arbitrary invariant byte sequences mostly from the malware's successful behavior [35]. In learning-based methods, an
attacker can make the automated signature generation method to consider benign byte sequences as malicious. Since F-Sign is a knowledge-based method and since it employs a higher level of abstraction in examining executables, allergy attacks are more difficult to conduct. Since this is only the first stage of the F-Sign research, our evaluation is not aimed at generating signatures for such malware as self-decrypted files; polymorphic malware; archive files (i.e., CAB, MSI, Zip) or highly optimized binaries. We plan to tackle these types of malware in our future research. Assuming that self-decrypted polymorphic malware should have a constant/unchangeable code section that is used for the decryption, we will evaluate the ability of F-Sign to specifically identify these sections while isolating the encrypted sections, possibly by using entropy measures [36]. In addition, an organization policy may deny any incoming and outgoing communications that contain encrypted files. Polymorphic malware is difficult to detect and remove [37]. We plan to tackle that type of malware by enhancing F-Sign with regular expression signatures and by maintaining malware function libraries. Archive files are easier to handle, especially in gateways in which the files can be extracted and examined for matching signatures. We plan to repeat the evaluation of F-Sign on a larger scale with many more malware files and CFLs generated for different development environments. We also plan to evaluate additional methods for detecting and extracting functions in binary code and additional methods for ranking and selecting the best signature out of the collection of candidates. Another direction we intend to examine is the use of a malware function library (MFL) in the signature generation process in order to further strengthen the signatures and minimize the risk of false positives. In addition, in order to further minimize the risk of false positives, we propose to use “composite signatures” which are generated by using two or more distinct signatures for each malware.
These activities address the biggest challenge of F-Sign which is the need to reduce to zero the amount of false positives before it can be deployed for generating signatures in high-speed malware filtering devices.
References [1]
S.B. Cho, “Incorporating Soft Computing Techniques Into a Probabilistic Intrusion Detection System,” IEEE Transactions on Systems, Man, and Cybernetics – Part C, Vol. 32(2):154-160, 2002.
[2]
K. Rieck, T. Holz, C. Willems, P. Düssel, P. Laskov, "Learning and Classification of Malware Behavior," Proc. of the Conference on Detection of Intrusions and Malware & Vulnerability Assessment, Springer Press, pp. 108-125, 2008.
[3]
M. Bailey, J. Oberheide, J., Andersen, Z.M., Mao, F., Jahanian, J., Nazario, "Automated Classification and Analysis of Internet Malware," Proc. of the 12th International Symposium on Recent Advances in Intrusion Detection, Springer Press, pp. 178-197, 2007.
[4]
W., Lee, S.J., Stolfo, "A framework for constructing features and models for intrusion detection systems," ACM Transactions on Information and System Security, Vol. 3(4):227–261, 2000.
[5]
R. Moskovitch, Y. Elovici, L. Rokach, "Detection of unknown computer worms based on behavioral classification of the host," Computational Statistics and Data Analysis, Vol. 52(9):4544-4566, 2008.
[6]
K. Griffin, S. Schneider, X. Hu, T. Chiueh, “Automatic Generation of String Signatures for Malware Detection,” Proc. of the 12th International Symposium on Recent Advances in Intrusion Detection, Springer Press, pp. 101-120, 2009.
[7]
G. Jacob, H. Debar, E. Filiol, "Behavioral detection of malware: from a survey towards an established taxonomy," Journal in Computer Virology, Vol. 4:251-266, 2008.
[8]
D. Gryaznov, "Scanners of the Year 2000: Heuristics," The 5th International Virus Bulletin, 1999.
[9]
J. Zhang, M. Zulkernine, A. Haque, “Random-Forests-Based Network Intrusion Detection Systems,” IEEE Transactions on Systems, Man, and Cybernetics – Part C, Vol. 38(5):649-659, 2008.
[10] A. Shabtai, D. Potashnik, Y. Fledel, R. Moskovitch, Y. Elovici, " Monitoring, Analysis and Filtering System for Purifying Network Traffic of Known and Unknown Malicious Content," Security and Communication Networks, to appear, 2010. [11] Y. Tang, B. Xiao, and X. Lu, " Using a bioinformatics approach to generate accurate exploit-based signatures for polymorphic worms," Computers and Security, Vol. 28(2009):827-842, 2009. [12] D. Brumley, J. Newsome, D. Song, H. Wang, and S. Jha, “Towards automatic generation of vulnerability-based signatures,” Proc. IEEE Symposium on Security and Privacy, IEEE Press, pp. 2-16, 2006. [13] H.A. Kim, and B. Karp, “Autograph: Toward automated, distributed worm Signature detection,” Proc. Usenix Security Symposium (Security 2004), USENIX Association, pp. 19-35, 2004.
[14] K. Wang and S.J. Stolfo, “Anomalous payload-based network intrusion detection,” Proc. of Recent Advance in Intrusion Detection (RAID ‘04), Springer Press, pp. 203-222, 2004. [15] H. Wang, S. Jha and V. Ganapathy, “NetSpy: Automatic Generation of Spyware Signatures for NIDS,” Proc. of the 22nd Annual Computer Security Applications Conference (ACSAC'06), IEEE Press, pp. 99-108, 2006. [16] S. Singh, C. Eitan, G. Varghese, and S. Savage, “Automated worm fingerprinting,” Proc. of the 6th USENIX Symposium on Operating Systems Design and Implementation, USENIX Association, pp. 45-60, December 2004. [17] V. Yegneswaran, J.T. Giffin, P. Barford, and S. Jha, “An architecture for generating semantics-aware signatures,” Proc of the 14th USENIX Security Symposium, USENIX Association, pp. 97-112, August 2005. [18] J. Newsome, B. Karp, and D. Song, “Polygraph: Automatically generating signatures for polymorphic worms,” Proc. IEEE Symposium on Security and Privacy, IEEE Press, pp. 226-241, 2005. [19] C. Kreibich, and J. Crowcroft, “Honeycomb: creating intrusion detection signatures using honeypots,” ACM SIGCOMM Computer Communication Review, Vol. 34(1):51-56, 2004. [20] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 2001. [21] Y. Tang, S. Chen, “Defending against Internet worms: A signature-based approach,” Proc. of IEEE INFOCOM’05, IEEE Press, Vol. 2:1384-1394, 2005. [22] J.O. Kephart and W.C. Arnold, “Automatic Extraction of Computer Virus Signatures,” 4th Virus Bulletin International Conference, 1994.
[23] M. Christodorescu, S. Jha, S. Seshia, D. Song, and R.E. Bryant, “Semantics-aware malware detection,” Proc. IEEE Symposium on Security and Privacy, IEEE Press, pp. 226241, 2005. [24] E. Milenković, A. Milenković, E. Jovanov, “Hardware Support for Code Integrity in Embedded Processors,” Proc. of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, ACM, pp. 55-65, 2005. [25] Y. Fei, Z.J. Shi, “Microarchitectural Support for Program Code Integrity Monitoring in Application-specific Instruction Set Processors,” Proc. of the conference on Design, automation and test in Europe, EDA Consortium, pp. 815-820, 2007. [26] N. Tuck, B. Calder, G. Varghese, “Hardware and Binary Modification Support for Code Pointer Protection from Buffer Overflow,” Proc. o the 37th International Symposium on Microarchitecture, IEEE Press, pp. 209-220, 2004. [27] Z. Shao, C. Xue, Q. Zhuge, M. Qiu, B. Xiao, EHM. Sha, “Security Protection and Checking for Embedded System Integration Against Buffer Overflow Attacks via Hardware/Software,” IEEE Transactions on Computers, Vol. 55(4):443-453, 2006. [28] Z. Shaoa, J. Caoa, K.C.C. Chana, C. Xueb, EHM. Shab, “Hardware/software optimization for array&pointer boundary checking against buffer overflow attacks,” Journal of Parallel and Distributed Computing, Vol. 66, pp. 1129 – 1136, 2006. [29] DefensePro, Radware, http://www.radware.com/ [30] C. Linn, and S. Debray, “Obfuscation of Executable Code to Improve Resistance to Static Disassembly,” Proc. of the 10th ACM conference on Computer and communications security, ACM, pp. 290-299, 2003.
[31] E. Filiol, “Malware pattern scanning schemes secure against black-box analysis,” Journal in Computer Virology, Vol. 2(1):35–50, 2006. [32] L.A. Goldberg, P.W. Goldberg, C.A. Phillips, and G. Sorkin, “Constructing Computer virus phylogenies,” Journal of Algorithms, Vol. 26(1):188-208, 1998. [33] K. Rieck, P.Laskov, “Language models for detection of unknown attacks in network traffic,” Journal in Computer Virology, Vol. 2(4):243- 256, 2007. [34] S.P. Chung, A.K. Mok, “Allergy attack against automatic signature generation,” Proc. Recent Advances in Intrusion Detection, Springer Press, pp. 61-80, 2006. [35] S. Venkataraman, A. Blum, D. Song, "Limits oflearning-based signature generation with adversaries," Proc. of the 15th Annual Network and Distributed System Security Symposium, 2008. [36] R. Lyda, J. Hamrock, “Using Entropy Analysis to Find Encrypted and Packed Malware,” IEEE Security and Privacy, Vol. 5(2):40-45, 2007. [37] Y. Song, et al, “On the Infeasibility of Modeling Polymorphic Shellcode,” Proc. of the 14th ACM Conference on Computer and Communications Security, ACM, pp. 541-551, 2007.
[38] Fortinet, the "Fortinet global threat" research team, "Understanding and Detectiong Malware threats based of File Sizes", //Online. URL: www.fortinet.com/doc/whitepaper/DetectingMalwareThreats.pdf, 2006 [39] Panda Security, Quarterly Report//Online URL: http://www.pandasecurity.com/img/enc/Quarterly_Report_Pandalabs_Q1_2010.p df, March 2010.