Software Similarity and Metamorphic Detection

Software Similarity and Metamorphic Detection Mausami Mungale Department of Computer Science San Jose State University San Jose, California

Mark Stamp Department of Computer Science San Jose State University San Jose, California

Abstract In this paper, we consider a novel method for measuring the similarity of software. Our technique can be applied to any executable file and no special effort is required when developing the software. In addition, our similarity score can be computed at any point in time—even after the software has been distributed. Our approach was inspired by the success of previous research focused on detecting metamorphic computer viruses. Here, we train a hidden Markov model (HMM) on an opcode sequence extracted from a specific piece of software (the “base” software). This trained model can then be used to score another piece of software, giving a measure of its similarity to the base software. We provide experimental results that show our scheme is robust in the sense that we can extensively modify the base software and still obtain strong scores from the trained HMM. Interestingly, the work presented here has some implications for the metamorphic detection problem that served as the original motivation. We briefly discuss the connections between these two problems.

1

Introduction

In this paper we consider a novel technique for measuring the similarity of software. The work presented here is not a watermarking scheme per se, but it could be used in a similar way, at least in certain cases. Our approach was inspired by previous studies of metamorphic computer viruses [1, 7, 17, 18]. Metamorphic viruses change their structure (but not their function) each time they replicate. This makes detection difficult, since there is no constant signature available for virus scanning. In this paper, we use elementary morphing techniques to train a hidden Markov model, and we use more advanced metamorphic techniques to test the robustness of our approach. Here, “robust” means that we can properly classify software even in the presence of significant modifications to the code. Our technique is potentially useful in several contexts. For example, suppose that a company suspects that their copyrighted software has been illegally copied. Then our scoring technique could be used to compare the original software to the suspected copy. A high score would not prove that the suspect software is a modified copy, but it would

1

indicate that further analysis is warranted. In contrast, a low score would imply that the two pieces of software are substantially different. The remainder of this report is organized as follows. Background information is covered in Section 2. Section 3 contains an overview of our technique, while Section 4 discusses our implementation in more detail. Experimental results are given in Section 5 and Section 6 concludes the paper.

2

Background

In this section, we briefly discuss background topics that are relevant to the discussion in the remainder of the paper. Specifically, we discuss digital watermarking, metamorphism, and hidden Markov models.

2.1

Watermarking

Watermarking is a technique for embedding some special mark in an object, so that the object can be identified [2]. Software watermarks could be used, for example, to detect software piracy and provide evidence to prosecute those responsible. Software watermarking can be accomplished in various ways, such as embedding some particular sequence of instructions at the assembly code level [16]. This embedded code—which serves as a watermark enabling us to identify the code—could be read by disassembling the executable. Ideally, such a watermark needs to be robust against attacks involving modification of the code. Software watermarks can be classified as static or dynamic. For example, embedding a sequence of assembly code instruction (as mentioned in the previous paragraph) is a static technique. A dynamic technique might consist of a particular “secret” value that is output when a special input is provided. A good discussion of software watermarking can be found in [3]. It is worth noting that robust watermarking of digital content has proven difficult to achieve in practice. To illustrate this point, consider the Secure Digital Music Initiative (SDMI) [12]. In September 2000, SDMI issued a public challenge, apparently to show off the supposed strength of four “robust” watermarking technologies [11]. However, in spite of limited information provided with the challenge, all of the watermarking techniques were soundly defeated [4]. The excellent work presented in [4] should serve as a cautionary tale against strong claims of robust digital watermarking. Various types of attacks against static software watermarks are possible. Additive attacks involve inserting additional watermark information into already watermarked software. The goal of such an attack is to make the original watermark undetectable. Distortion attacks involve using semantic preserving transformations to make a watermark undetectable. Subtractive attacks involve identifying the watermark and removing it without changing the functionality of the program. Although our proposed technique is not a true watermarking scheme, the obvious attacks are similar to those used against static watermarks.

2

2.2

Metamorphism

Metamorphism is the process of transforming a piece of code into copies that are functionally equivalent, but structurally different [15]. Metamorphism has been used by virus writers in an attempt to defeat signature based anti-virus software. Metamorphic software also has the potential for positive uses, such as increasing the diversity of software. More diverse software can reduce the impact of many implementation level attacks, such as buffer overflows [14].

2.2.1

Assembly Language Basics

The full x86 instruction set is large and complex [5]. A typical instruction consists of an “opcode” followed by one or more operands. The operands can be constant values, pointers to a value in memory, or a register. The instructions can broadly be classified as data transfer instructions, arithmetic and logical instructions, and control flow instructions. Table 1 shows some typical instructions of each of these types.

Table 1: Examples of x86 Instructions Data Transfer Instruction MOV Move byte or word to register or memory IN, OUT Input/output byte or word LEA Load Effective Address PUSH, POP Push/Pop word on/from stack Arithmetic and Logical Instructions NOT Logical NOT of byte or word AND Logical AND of byte or word OR Logical OR of byte or word XOR Logical XOR of byte or word ADD, SUB Add, subtract byte or word INC, DEC Increment, decrement byte or word NEG Negate byte or word (two’s complement) MUL, DIV Multiply, divide byte or word (unsigned) Control Flow Instructions JMP Unconditional jump JE, JNE Jump if equal/Jump if not equal LOOP Loop unconditional, count in CX, short jump to target address CALL, RET Call, return from procedure

3

2.3

Metamorphic Code Techniques

An important component of our system is a metamorphic code generator, which is applied at the assembly code level. There are a large number of semantic preserving transformations that can be applied to assembly code to obtain metamorphic copies [7]. Here, we briefly discuss some elementary metamorphic techniques.

2.3.1

Control Flow Preserving Transformations

In this type of transformation, we insert instructions that, taken as a whole, do not change the data flow or the control flow of the program. We give a few elementary examples of such transformations and provide examples of each. 1. A NOP is a special instruction that has no effect on the execution state—it is simply a “do nothing” instruction. Therefore, we can insert NOPs between instructions as shown in Table 2.

Table 2: NOP Example Original Code Transformed Code MOV AL,BL MOV AL,BL ADD AL,05H NOP ADD AL,05H

2. We can use groups of arithmetic or logical instructions, the net effect of which does not change the value of any registers. Since arithmetic instructions can change the flags, we may need to save and restore the EFLAG register when inserting such code. Neglecting effects on the flag bits, Table 3 gives examples of such instruction groups.

Table 3: Arithmetic Example ADD SUB XOR AND

AX,05H AX,05H AX,0H AX,FFFFH

3. We can add a label to any instruction and put a JMP instruction to that label just before the instruction. This does not change the program behavior—see Table 4 for an example. 4. We can PUSH the value of a register on the stack and POP it immediately to preserve the program semantics. This is illustrated in Table 5.

4

Table 4: JMP Example Original Code MOV AL,BL ADD AL,05H

Transformed Code MOV AL,BL JMP LOC LOC: ADD AL,05H

Table 5: PUSH and POP Example Original Code Transformed Code MOV AL,BL MOV AL,BL ADD AL,05H PUSH AX POP AX ADD AL,05H

2.3.2

Dead Code Insertion

Perhaps the easiest way to morph a program is to insert code that is never executed. Any combination of instructions can be included within a dead code block. Table 6 gives an example of dead code insertion.

Table 6: Dead Code Example Original Code MOV AL,BL ADD AL,05H

Transformed Code MOV AL,BL JMP LOC: PUSH AX POP AX ADD AL,BL LOC: ADD AL,05H

It is fairly easy to make dead code stealthy, in the sense that it is non-trivial to determine whether or not it is not actually executed. We do not consider this further here, but it is worth noting that we make no effort to remove dead code in our scoring method discussed below. Removing obvious dead code is not difficult and would serve to make our scoring even stronger. In effect, we are treating the inserted dead code as if it were stealthy, although it is not.

5

2.3.3

Equivalent Code Substitution

We can transform code by replacing an instruction (or instructions) with an equivalent instruction (or instructions). Examples of equivalent code substitution appear in Table 7.

Table 7: Equivalent Code Example Original Code Transformed Code ADD AL,05H ADD AL,04H ADD AL,01H MOV AX,BX PUSH AX POP BX

2.4

Hidden Markov Models

A Markov chain can be viewed as a statistical model in which there are states and known probabilities for state transitions. In such a Markov model, the states are visible to the observer. In contrast, a hidden Markov model (HMM) has “hidden” states, i.e., the states are not directly observable. Although the states of a hidden Markov model are not visible, there is some observable output, which is probabilistically related to the hidden state [10, 13]. A hidden Markov model consists of state transition probabilities, a probability distribution over all possible output symbols for each state, and initial state probabilities. We use the following notation to describe an HMM: T = the length of the observation sequence N = the number of hidden states in the model M = the number of distinct observation symbols X = {x0 , x1 , . . . , xN −1 } = the states of the Markov process O = {O0 , O1 , . . . , OM −1 } = set of possible observations A = state transition probabilities B = observation probability matrix π = initial state distribution

Figure 1 illustrates a generic HMM, where X0 , X1 , . . . , XT −1 are the hidden states and O0 , O1 , . . . , OT −1 are the observed symbols in each state. Note that the matrices A and B represent the state transition probabilities and observation probabilities, respectively. We represent an HMM compactly as λ = (A, B, π). HMMs are used in many applications, including speech recognition, sequence alignment, and malware detection. The utility of HMMs derives largely from the fact that there are efficient algorithms to solve each of the following three problems [13]:

6

Figure 1: Generic Hidden Markov Model (1) Given the model λ = (A, B, π) and a sequence of observations O, find P (O|λ). That is, we can score an observed sequence O, relative to a given model λ. (2) Given the model λ = (A, B, π) and an observation sequence O, find an optimal state sequence for the underlying Markov process. In other words, we can uncover the (most likely) hidden state sequence. (3) Given an observation sequence O, and parameters N and M (i.,e., the number of hidden states and the number of distinct observation symbols, respectively), find the model λ = (A, B, π) that maximizes the probability of observing O. This can be viewed as training the model to best fit the observed data. Note that when training an HMM, the only free parameter is N , the number of hidden states. This is the sense in which an HMM is a machine learning technique—the user only has to specify N . In this paper, we employ the algorithms for problems (1) and (3), above. We use (3) to train a model to match a “base” piece of software. Then we can use (1) to score any piece of code against the model—a high score indicates a high degree of similarity with the base code.

3

Design Overview

The goal of this research is to design a robust method for measuring software similarity at a fairly deep level. In this section we give a brief overview of the approach we have followed.

3.1

System Overview

Our system has two phases. In the first phase, we use a metamorphic generator to create slightly morphed copies of the base software. We extract and append the opcode sequences from these morphed copies. We then use the resulting opcode sequence to train a hidden Markov model. The purpose of using morphed copies of the base software is to avoid having the HMM overfit the training data. The second phase consists of scoring. In this phase, we extract the opcode sequence from a given piece of software and we score it against the trained HMM obtained in the

7

first phase. A high score indicates that the software in question is closely related to the base software, while a low score shows that the software is “far” from the base code. Below, we show that reasonable scoring thresholds can easily be determined and that the technique described here is highly robust.

3.1.1

Design of Metamorphic Generator

Our metamorphic generator makes slightly morphed copies of a given base software. To generate these morphed copies, we use some of the techniques discussed in Section 2.3. The amount of morphing to be applied is an adjustable parameter, given as a percentage. For example if we select 20% morphing, then after morphing, the program will have expanded by approximately 20%, due primarily to dead code insertion. Figure 2 illustrates the design of the metamorphic generator.

Figure 2: Metamorphic Generator

4

Implementation

In this section we briefly describe how we have implemented the various components in our system. Additional details can be found in [8].

4.1

Metamorphic Generator

In our metamorphic generator, we use dead code insertion and various control preserving transformations (see Section 2.3). We make changes directly to assembly code obtained by disassembling the executable file. In addition, we provide a set of “normal” files from which dead code is extracted. After disassembling the base file, the following steps are performed by the metamorphic generator for each morphed copy produced: (a) Compute the number of lines of code in the base file. (b) Compute the number of positions where transformed code will be inserted, based on a specified morphing percentage and the number computed in (a).

8

(c) Select five random locations where dead code blocks will be inserted into the morphed copy. (d) Insert the transformed code and the dead code at the selected locations. It is, of course, possible to generate much more highly metamorphic code—see [7] for a discussion of a highly metamorphic generator. However, here we are not trying to make our code extremely metamorphic. Instead, we simply want the code in each morphed copy to differ sufficiently so that we can avoid potential problems caused by overfitting the data when we train the HMM.

4.2

Training an HMM

Given the base executable file, we disassemble it and create a collection of morphed versions using the method discussed above. The opcode sequences are then extracted from these morphed files. For training and testing we used a standard five-fold crossvalidation approach [18]. Specifically, we do the following: (1) We partition our opcode sequences into five subsets, each containing opcodes sequences from an equal number of files. (2) Four of these subsets are selected and all opcode sequences in these subsets are appended to create one long sequence. This long opcode sequence is then used to train an HMM. (3) To determine threshold values, the sequences in the remaining subset are scored against the model. In addition, a collection of sequences extracted from normal executable files are also scored. All scores are computed as a log likelihood per opcode (LLPO) [18]. (4) Repeat steps (2) and (3), reserving a different subset for testing each time. The results obtained over the five iterations are averaged. Figure 3 illustrates the training phase. Once we have trained the HMM and determined a threshold, we can score any executable file. To score a file, we simply extract its opcode sequence and use the trained HMM to compute a score. If the resulting score is higher than our predetermined threshold, we classify the file as being similar to the base file. A score below our threshold implies that the file is more similar to the normal files than to the base program used to generate the HMM. However, we actually obtain more information than a simple “yes/no” answer. In fact, we obtain a score, and the higher the score, the closer the match (in the HMM sense) to the base program.

5

Results

In this section, we discuss experimental results for one typical test case. Additional experimental results can be found in [8]. The test data that we used for this experiment consists of 30 randomly selected executables from Cygwin version 1.5.19. Cygwin files have been used as representative “normal” files in several previous studies of metamorphic generators [1, 7, 17, 18].

9

Figure 3: HMM Training The 30 Cygwin utilities files were named N0.EXE through N29.EXE. Each file was disassembled using IDA Pro, version 4.6.0. For the disassembled files, we added the prefix “IDA” to the respective file names and changed the suffix to “ASM,” that is, the disassembled files are denoted IDAN0.ASM through IDAN29.ASM. We randomly selected another Cygwin executable to serve as the base code that we will test against. We named the disassembled version of this particular file IDAW.ASM. For this experiment, we generated 100 morphed copies of IDAW.ASM, which we denote as IDAW0.ASM to IDAW99.ASM. As discussed in Section 4.2, the files IDAW0.ASM to IDAW99.ASM were used to train the HMM using five-fold cross validation. The files IDAN0.ASM through IDAN29.ASM serve as our normal files, that is they are the test set that will be used in setting the threshold. In general, we would expect Cygwin utility files to be more alike than randomly selected programs. Consequently, it is inherently more difficult for an HMM—or any other statistical discrimination technique—to distinguish Cygwin files from each other, as compared to two randomly selected programs. Therefore, by selecting our base program and our normal programs from the collection of Cygwin utility files, we have created a relatively challenging test for our technique. To evaluate the ability of our scheme to detect software after it has been “attacked” (i.e., modified), we developed a tampering scheme to modify the base file. The tampering is done is a way that makes the tampered file look more like the normal files since, from the attacker’s point of view, the goal is to make the HMM unable to properly classify the tampered file. The tampering scheme used here is essentially the same as that used in [18], where the goal was used to evade HMM-based detection of metamorphic malware. In [7] it was shown that by selecting a relatively small amount of dead code from normal (i.e., non-virus) files, the virus files could evade detection by the HMM technique developed

10

in [18]. It was somewhat surprising that a small amount of dead code would suffice, given that the detection technique in [18] had proven effective against all other metamorphic generators tested, including the strongest hacker-produced generators. We selected the metamorphic generator in [7] since we believe it is the most challenging generator—from the perspective of detection using HMMs—developed to date. This generator should pose a significant challenge for our similarity technique. To tamper with the base program, we copied instructions from normal files and inserted them as dead code into the base file. We denote these tampered files as IDAT0.ASM, IDAT1.ASM, and so on. Again, each tampered file is a variant of the base file. As more code from the normal files is inserted into the tampered files, the tampered files become closer (in terms of opcode statistics) to the normal file. From the perspective of the HMM, the tampered files should look more similar to the normal files. In addition, the higher the percentage of “tampered” code, the closer the HMM scores should be to the normal score range. Our goal is to experimentally determine a rough estimate of the amount of tampering that can be tolerated before we are unable to correctly classify the tampered files as matching the base file. For our experiment, we used 10% morphing, which implies that each training file is a slightly morphed version of the base file. We followed the HMM training and scoring procedure discussed above. For each test, we scored 20 morphed files and 30 normal files to obtain a threshold. The tampering included dead code insertion and equivalent instruction substitution. Note that the dead code was extracted from the normal files, namely, other Cygwin utility files. Figure 4 shows the misclassification rates for dead code percentages ranging from 10% to 90%. Note that we correctly classify all tampered versions of the base file up to 50% tampering, while at higher tampering rates the classification accuracy drops precipitously. At tampering rates of 70% or greater, we cannot correctly classify any of the tampered files. This shows that our technique is extremely robust, since failure only occurs at very high tampering rates. Finally, Figure 5 shows the growth in the size of the tampered files, in term of the number of lines of code.

5.1

Discussion

In [7], a metamorphic generator was constructed that produced malware variants containing dead code selected from normal files. A similar process was followed in the experiment discussed in this section. However, in [7], only a relatively small amount of dead code insertion was needed to break the HMM detector. That is, a small amount of dead code was sufficient to cause the HMM to fail to properly classify the metamorphic viruses. In contrast, a very high degree of metamorphism is required before the similarity test discussed in this paper fail to properly classify the tampered software. While this may seem contradictory, in fact, it is not. In the case of metamorphic detection, the HMM is trained on the morphed files, since only the morphed virus files are assumed to be available. This is the most realistic scenario in the case of malware detection. In contrast, for the similarity problem dis-

11

Figure 4: Detection Results cussed here, we have access to the base file, so we can train our model on a “pure” set of training data, which is independent of the degree of tampering employed by the attacker. For metamorphic malware detection, the model is, in effect, trained on a tainted data set, which implies that the model itself to degraded at higher levels of morphing. When viewed in this light, the large difference between the morphing rates tolerated by these two related techniques is not totally unexpected. The results in this paper have some interesting implications for the virus detection problem. First, note that if the metamorphic virus generator is available, then it should be possible to train the model on data that does not include include dead code from normal files. This should give us results comparable to those presented here, that is, we should then be able to tolerate a much higher degree of “tampering” than when we are forced to train on the actual virus files, as in [18]. The results in this paper also have implications for the virus detection problem, even in case where the generator is not available. Depending on the sophistication of the morphing techniques used, it may be possible to strip some or all of the dead code from the viruses before training [9]. If so, the resulting model would be considerably stronger. In fact, the results here and in [7] indicate that removing dead code from the training data is critical. In contrast, a relatively high degree of tampering can be tolerated in the files that are scored. This is significant in the case of virus detection, since training the model is one-time work, so we can afford to spend considerable effort to obtain pristine training data. In contrast, the scanning phase must be fast. Fortunately, the results here indicate that we can tolerate high levels of tampering at the detection phase (provided the underlying model is strong), which would make scanning much more robust and practical.

12

Figure 5: Size of Tampered and Base Files

6

Conclusion

In this paper, we have analyzed a software similarity measure inspired by the success of previous research using hidden Markov models to detect metamorphic computer viruses. Experimental results show that this similarity score is robust, in the sense that it can withstand a very high degree of tampering before the classification fails. While a high score using our method would not prove that the software in question is a tampered version of the original base file, it would certainly indicate that further investigation is warranted. Conversely, a low score would clearly indicate that the software is “far” away from the base program, in which case further investigation is unlikely to yield incriminating results. Since our approach can be used after the fact, only executable code is required, and the computational cost is minimal, it would be reasonable to compute this similarity score before proceeding to a more detailed—and costly—analysis.

References [1] S. Attaluri, S. McGhee, and M. Stamp, Profile hidden Markov models and metamorphic virus detection, Journal in Computer Virology, Vol. 5, No. 2, May 2009, pp. 151–169 [2] R. Chandramouli, N. Memon, and M. Rabbani, Digital watermarking, Encyclopedia of Imaging Science and Technology, 2002 [3] C. Collberg and C. Thomborson, Software watermarking: Models and dynamic embeddings, 1999 [4] S. A. Craver, et al., Reading between the lines: Lessons from the SDMI challenge, 10th USENIX Security Symposium, 2001

13

[5] Intel, IA-32 architectures software developer’s manuals, http://www.intel.com/products/processor/manuals/index.htm. [6] K. R. Irvine, Assembly Language for x86 Processors, 6th edition, Prentice-Hall, 2010 [7] D. Lin and M. Stamp, Hunting for undetectable metamorphic viruses, to appear in Journal in Computer Virology [8] M. Mungale, Robust watermarking using hidden markov models, Master’s Thesis, Department of Computer Science, San Jose State University, Spring 2011 [9] S. Priyadarshi, Metamorphic detection via emulation, Master’s Thesis, Department of Computer Science, San Jose State University, Spring 2011 [10] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, Vol. 77, No. 2, pp. 257–286, 1989 [11] SDMI public challenge, September 2000, http://www.hacksdmi.org [12] Secure Digital Music Initiative, http://www.sdmi.org [13] M. Stamp, A revealing introduction to hidden Markov models, 2004, http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf [14] M. Stamp, Risks of monoculture, Inside Risks, Communications of the ACM, 2004 [15] M. Stamp, Information Security: Principles and Practice, 2nd edition, Wiley 2011 [16] S. Thaker, Software watermarking via assembly code transformations, Master’s Thesis, San Jose State University, 2004 [17] S. Venkatachalam and M. Stamp Detecting undetectable metamorphic viruses, to appear in Proceedings of SAM ’11 [18] W. Wong and M. Stamp, Hunting for metamorphic engines, Journal in Computer Virology, Vol. 2, No. 3, December 2006, pp. 211–229

14

Software Similarity and Metamorphic Detection

Software Similarity and Metamorphic Detection

Suggest Documents

Classification and Detection of Metamorphic ... - Semantic Scholar

Metamorphic Malware Detection Using Linear Discriminant

Unknown Metamorphic Malware Detection: Modelling with ... - IAENG

Unknown Metamorphic Malware Detection: Modelling with ... - IAENG

Similarity templates for detection and recognition

Duplication, Redundancy, and Similarity in Software - DROPS

Dependency detection with similarity constraints

evasion and detection of metamorphic viruses - Ethesis@nitr

evasion and detection of metamorphic viruses - Ethesis@nitr

Software Clone Detection and Refactoring

IGNEOUS and METAMORPHIC PETROLOGY

Software Similarity-Based Functional Cohesion ... - Semantic Scholar

Global and Efficient Self-Similarity for Object Classification and Detection

MicroRNAs in metamorphic and non-metamorphic transitions in ...

Automatic Plagiarism Detection Using Similarity Analysis

Web Graph Similarity for Anomaly Detection

VIDEO SIMILARITY DETECTION WITH VIDEO SIGNATURE

GNSS Trajectory Anomaly Detection Using Similarity ... - MDPI

Optimizing Document Similarity Detection in ... - Semantic Scholar

Improving Spam Detection Based on Structural Similarity

Automatic Word Similarity Detection for TREC 4

Algorithmic Detection of Semantic Similarity - Indiana University ...

Algorithmic Detection of Semantic Similarity - Google Sites

Exploiting Self-Similarity for Change Detection - Polimi