Software Plagiarism Detection: A Graph-based Approach Dong-Kyu Chae
Jiwoon Ha
Sang-Wook Kim†
Dept. of Computer and Software Hanyang University, Korea
Dept. of Computer and Software Hanyang University, Korea
Dept. of Computer and Software Hanyang University, Korea
[email protected]
[email protected]
[email protected]
BooJoong Kang
Eul Gyu Im
Dept. of Electronics and Computer Engineering Hanyang University, Korea
Dept. of Computer and Software Hanyang University, Korea
[email protected]
[email protected] detection systems are in great needs. Especially, detecting plagiarism by comparing two executable files (hereafter, we call these as programs) without having their source code has been recently studied since the source codes of suspicious programs are typically unavailable.
ABSTRACT As plagiarism of software increases rapidly, there are growing needs for software plagiarism detection systems. In this paper, we propose a software plagiarism detection system using an APIlabeled control flow graph (A-CFG) that abstracts the functionalities of a program. The A-CFG can reflect both the sequence and the frequency of APIs, while previous work rarely considers both of them together. To perform a scalable comparison of a pair of A-CFGs, we use random walk with restart (RWR) that computes an importance score for each node in a graph. By the RWR, we can generate a single score vector for an A-CFG and can also compare A-CFGs by comparing their score vectors. Extensive evaluations on a set of Windows applications demonstrate the effectiveness and the scalability of our proposed system compared with existing methods.
A software plagiarism detection system may include two functions: representative feature extraction from two programs and similarity computation using the extracted features. Feature extraction methods can be divided into two categories: static analysis and dynamic analysis. Static analysis methods extract features without executing a program, while dynamic analysis methods focus on the runtime behavior of the program [2]. Dynamic analysis methods may extract different features in different conditions of execution environments. Moreover, the features extracted through dynamic analysis methods may reflect only a small part of the program since a single execution path may cover only a small part of the program [2-4]. In contrast, static analysis methods make it possible for features to inherit overall characteristics of a program. One of the problems with static analysis methods is that encrypted or compressed programs cannot be analyzed. However, contrary to malicious programs, commercial programs usually are not applied with encryption or compression techniques. Therefore, in our proposed system, we use a static analysis method to extract features from a program and compute similarities using them.
Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications -Data Mining
General Terms Security
Keywords Software Plagiarism; Binary Analysis; Graph; Similarity
1. INTRODUCTION
Existing plagiarism detection methods can be categorized into three types based on the form of the extracted features: set-based approaches, frequency-based approaches, and sequence-based approaches. Set-based approaches, for example, use a set of API calls used in a program [2]. Thus, the set-based features ignore the sequence or frequency of APIs. Frequency-based approaches count call frequencies for each API and then generate a frequency vector [3]. These approaches also cannot reflect the sequence of APIs. Sequence-based approaches use the full sequence of API calls [4] or instructions [5] from execution traces of a program. However, these approaches may suffer from high time complexity in a similarity computation phase because the number of execution traces may increase exponentially as the number of branches in a program increases. For this reason, these approaches are applicable to only small-size programs [5].
Software plagiarism is to develop software using someone else’s source code or open source code without license and disguise as original software [1]. As software plagiarism increases rapidly, there have been serious economic losses in the software industry. According to the business software alliance (BSA) report, the financial damage of USA due to software plagiarism is about 95 million dollars, and that of China is about 77 million dollars in 2010 [1]. To mitigate such economic losses, software plagiarism †
Corresponding author
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected].
In this paper, we propose a software plagiarism detection system which uses both sequence and frequency of APIs and also computes a similarity using these two features in reasonable costs. To achieve these goals, we first use API-labeled control flow
CIKM’13, Oct. 27–Nov. 1, 2013, San Francisco, CA, USA. Copyright 2013 ACM 978-1-4503-2263-8/13/10. http://dx.doi.org/10.1145/2505515.2507848
1577
graphs (A-CFGs). An A-CFG is a graph representation of a program, which consists of APIs as vertices and call sequences of APIs as edges. Since the degree of a node indicates the call frequency of the API and each edge indicates the sequence between APIs, an A-CFG includes both sequence and call frequency of APIs while previous work rarely considers both of them together. The A-CFG not only inherits unique characteristics of a program but also is robust to plagiarism because it is difficult for a plagiarist to manipulate the call frequency or the sequence of APIs or to replace APIs with something else while maintaining the program’s original semantics [2-4].
CFG is located between two split blocks. In the process (2), basic blocks that do not include an API-call instruction are labeled as “empty” and the blocks that call more than two APIs are recursively split until one block has a single API call instruction.
GG G G
mov a lea jmp
push a add call (GetDC) jmp
G G G
G G G G
call b (BitBit) add mov jmp
G G G
Secondly, we employ random walk with restart (RWR) [6] to generate a single score vector for an A-CFG. Since most of graph comparison algorithms such as graph isomorphism or maximum common sub-graph are known as NP-complete and the size of the A-CFG is huge, it is infeasible to compare two A-CFGs in the practice [10]. Instead of comparing two graphs directly, we generate score vectors representing the topologies of A-CFGs and compare the two vectors using cosine similarity [7]. Since the scores computed by RWR reflect the global structure of a graph [6], the RWR score vectors can be used as a representative feature of A-CFGs. As a result, it is possible to measure a similarity between two massive graphs in short time using these vectors. Based on the similarity between two A-CFGs, our proposed system finally determines plagiarism.
G G G G G G G
xor c call () add call (ArcTo) call (GetDC) jmp
b call (AddForm) add mov call () xor jmp
add c call (OpenFile) mov retn
lea d call (GetJob) add retn
G
CFG
G
G
call a (SetDC) jmp
G G
xor b lea call (GetMenu) add lea retn
add c mov retn
G
G G G
G
CFG G Figure 1. CFG Examples.
CFG
emptyG
In our proposed system, we use invocations of application programming interfaces (APIs) as unique and robust features of a program. API invocations are a common way for a program to request services provided by operating systems. Because resources and services used by a program are highly related to the program’s main functionalities, APIs called in the program to access the resources and services are also highly related to the main functionalities of the program. Moreover, it is difficult to replace API invocations with other instructions, while preserving the program’s original semantics.
emptyG
GetDC
SetDCG
BitBit
GetMenuG
emptyG
OpenFile
ArcToG
empty
GetDCG
CFG
CFG
2. FEATURE EXTRACTION
AddForm
GetJobG
Figure 2. Example of A-CFG Figure 1 shows simple examples of CFGs of a program which consists of three procedures , , and . Figure 2 shows the ACFG of the program. Based on the process (1), basic blocks b and c in the CFG are split into two blocks, and the CFG and the CFG are located between split blocks, respectively. After all CFGs are combined, each block is labeled as the name of the API if a block has an API call, as the case of the block a in the CFG , or labeled as “empty” if the block does not have any API calls.
In this paper, we define a novel feature of the program named ACFG, which is able to reflect both sequence and call frequency of APIs. The definition of A-CFG is as follows: Definition 1 (A-CFG: API-labeled control flow graph) The APIlabeled control flow graph of a program p is a 2-tuple graph ACFG = (N,E), where
Since an A-CFG usually has tens of thousands or millions of nodes, most of graph comparison algorithms such as graph isomorphism or maximum common sub-graph are infeasible to be employed in practice [10]. To deal with this scalability problem, we exploit random walk with restart (RWR) [6] as a way of extracting an n-dimensional vector from the A-CFG and compute the similarity between the two vectors. The vector of an A-CFG, named RWR score vector, is defined as follows:
• N is a set of nodes, where a node n N corresponds to an API called in p. • E N × N is a set of edges, where an edge n1 n2E corresponds to a possible sequence between nodes n1 and n2. To capture all the possible sequences of API calls, we use control flow graphs (CFG). A CFG is a graph representation of a procedure in which each node represents a basic block. A basic block is a maximal sequence of instructions without a change of control flows. Directed edges are usually used to represent control flows of a program [11].
Definition 2 (RWR score vector) The vector of an A-CFG, named RWR score vector is an n-dimensional vector, where • n is the number of APIs defined by MSDN [8]. • An RWR score is a score of each node calculated by the RWR method. • The value of each dimension is the summation of RWR scores that the corresponding API gets in the A-CFG. If one API may
We build the A-CFG through the following processes: (1) bringing all CFGs together in one graph based on the program’s inter-procedure-call relationship, and then (2) labeling basic blocks with API calls in the block. In the process (1), the basic blocks which have procedure-call instructions are split into two blocks at the position of the instruction, and the target procedure’s
1578
exist multiple places in an A-CFG, we aggregate all RWR scores that one API gets.
4.1 Experimental Setup We evaluate our proposed system against 56 benchmark programs shown in Table 1. There are 28 kinds of different programs and each program has two different versions. Because it is difficult to get plagiarized samples, we assume that a recent version of a program is a “plagiarized” sample of the program. Since a program update is mostly processed by modifications of the original program, it can be also viewed as a plagiarism [2-3]. SC in Table 1 denotes an existence of the source code of the corresponding program.
RWR is widely used to calculate each node’s importance in a graph [13, 14, 15]. By Equation 1, RWR calculates the probabilities that a random walker reaches each node at step t+1.
R (t 1)
(1 a ) AR ( t ) aw
(1)
In Equation 1, A is an adjacency matrix that represents the relationships between APIs. R(t+1) and R(t) are vectors that indicate the probability of the random walker reaches each node at t+1 and t steps, respectively. The initial values of the vector R(0) are assigned with the same value, 1/n, where n is the number of nodes. A restart vector, w represents the probability that the random walker jumps to each node, not traversing through the edges of the graph. a is a weight that determines how often the random walker jumps to other nodes rather than following the edges. Generally, RWR iterates until the vector R(t) is converged. In the A-CFG, elements in R(t) indicate the importance of corresponding APIs in a program.
We implemented our proposed system using A-CFG, the setbased system proposed by Choi [2], and the frequency-based system proposed by Chae [3]. As another baseline, we implemented a system using an API set as a feature and Jaccard coefficient as a similarity measure. We also implemented the sequence-based system proposed by Lim [5], but we excluded it from the evaluations because of its extremely high computational costs. Its computational overhead on the similarity computation phase is extremely high because the number of sequences from a program is exponential to the number of branches in the program. For example, 83,104 sequences are extracted from Foobar2000 and 58,492 sequences from WinSCP, and it takes about 5.4 hours for the feature extraction and similarity computation while our proposed system takes only 35 seconds.
Additionally, our proposed system modifies the restart vector in Equation (1) to avoid being unreliable due to common APIs. Since the common APIs perform essential tasks such as exception handling or memory management, they are used not only in most of programs but also frequently in a program. If these APIs get high scores, it would be difficult to differentiate programs. To reduce the effect of common APIs, the restart vector is modified as in Equation (2). Let PF (program frequency) be the number of programs using the API, and CF (call frequency) be the number of the API-call in each program. Then, let w(API) be the value of the element corresponds to the API in the vector w, PF(API) and CF(API) be the PF and CF of the corresponding API, respectively. Then, we give a value to each element in the vector w as follows:
O P
Table 1. Benchmark programs Program Version SC Program AkelPad UltraEdit 4.7.6 / 4.7.7 Y Notepad++ NateOn 6.1.4 / 6.1.5 Y Pidgin 2.10.5 / 2.10.6 Y BuddyBuddy Psi Y RestoShare 0.15 / 0.14 BadakEncoder 3.0.00 / 3.0.11 N UmileEncoder ACDSeePro XNview 4.0.2 / 5.3.1 N Ncftp CuteFTP 8.3.2 / 8.3.3 Y FileZilla WinSCP 3.5.2 / 3.5.3 Y LongPlayer Winamp 0.9.9 / 1.0.1 Y MixPlayer 1.10.0 / 1.10.1 Y PotPlayer CoolPlayer Foobar2000 2.1.8 / 2.1.9 Y 7zip Y Bandizip 9.19 / 9.20 BackZip ALZip 5.0.2 / 5.0.3 Y Putty Y SecureCRT 0.60 / 0.62
X
(2) O P Q O P This modification finally reduces the RWR scores of common APIs and increases those of APIs uniquely called in a specific program. It helps capture accurately the unique characteristics of each program.
3. SIMILARITY COMPUTATION As a similarity measure between two RWR score vectors, we use the cosine similarity, which is widely used to calculate the similarity between two vectors [8]. The similarity between two vectors ranges between 0 and 1. Following [3] and [7], the proposed system classifies two programs as follows: SIM (V p ,Vq )
t H , p and q are classified as plagiarized ® ¯d H , p and q are classified as independent
Version SC 17.10 / 18.10 N N 4.3.0 / 4.3.1 N 7.1.5 / 7.1.1 Y 0.5.3 / 0.5.4 N 3.1.2 / 3.1.3 0.50.0 / 0.51.0 N 8.3.20 / 8.3.35 Y Y 4.3.8 / 4.3.9 N 5.6.0 / 5.6.3 N 1.50 / 1.51 1.1.11 / 1.1.14 N N 2.7.0 / 3.0.0 11.6.3 / 12.6.5 N N 6.7.5 / 7.0.1
To evaluate the systems, we used three measures: area under the F-measure curve (AUC), average similarity (AS), and correlation with source code similarity (CSS). Detailed introductions of each measure are explained in the following sections.
4.2 AUC Test
(3)
In this experiment, we evaluate the proposed system in terms of AUC. AUC is widely used to represent how a system performs over the entire space of threshold [5]. The higher AUC indicates that the system can provide high accuracy insensitive to threshold . We draw an F-measure curve for each system and calculate their AUCs respectively. To draw an F-measure curve for each system, each system computes similarities between all pairs of the benchmark programs shown in Table 1 and detects “plagiarized” samples with changing the threshold between 0 and 1.
in Equation (3) denotes the plagiarism threshold. When the similarity between two programs is in the range of [, 1.0], our proposed system decides the two programs to be classified as “plagiarized”. Otherwise, it decides them to be “independent”. We analyze the accuracy of our proposed system according to the threshold in Section 4.
4. EXPERIMENTAL RESULTS
Figure 3 shows the F-measure curves and Table 2 shows AUC values of each system. Each system provides its own maximum F-measure value on a specific threshold . However, our proposed system’s interval which provides high accuracy is wider
This section shows our experimental results on a set of Windows application programs comparing A-CFG against the state-of-theart systems for software plagiarism detections.
1579
Table 2 shows the CSS results of different systems. Even though all the systems show high CSS, but our proposed system provides the highest result. This indicates that our proposed system provides the most trustable similarity between programs.
than that of all the others. Table 2 also shows our proposed system outperforms all the other systems.
Table 2. AUC and CSS results AUC CSS
Set-based 0.672 0.875
Freq-based 0.453 0.879
Jaccard 0.644 0.869
Proposed method 0.796 0.885
5. CONCLUSIONS We have proposed a software plagiarism detection system using an API-labeled control flow graph (A-CFG). The A-CFG can represent both the sequences and the frequencies of APIs, which are hardly changed by semantic-preserving transformation attacks. We also have performed a scalable comparison between A-CFGs by representing each A-CFG as a single score vector through RWR. The experimental results show that our proposed system outperforms existing methods in terms of both accuracy and credibility in a reasonable computation time.
Figure 3. F-measure curves.
4.3 AS Test AS evaluates how similarities driven from a system are reasonable. In this experiment, we measure two kinds of AS: AS among all pairs of the same program with different versions (ASP) and AS among all pairs of the different programs (ASD). The higher ASP indicates that the ability to detect plagiarism is good. Similarly, the lower ASD indicates that the ability to distinguish a program with a different program is good.
6. ACKNOWLEDGMENTS This research was supported by (1) Ministry of Culture, Sports and Tourism (MCST) and from Korea Copyright Commission in 2013, (2) Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No. 2012R1A1A2007817), and (3) MSIP (the Ministry of Science, ICT and Future Planning), Korea, under the IT-CRSP (IT Convergence Research Support Program) (NIPA-2013-H0401-13-1001) supervised by the NIPA (National IT Industry Promotion Agency).
Figure 4(a) shows the results of ASP and Figure 4(b) shows those of ASD. The results in Figure 4(a) show that all the systems provide similar ASP. Jaccard provides the highest ASP and our proposed system follows it. However, the results in Figure 4(b) show that Jaccard has a potential to cause more false alarms because of the high ASD. Both set-based and freq-based systems provide lower ASPs and higher ASDs than that of our proposed system. Our proposed system provides a much lower ASD compared to the other systems while providing a sufficiently high ASP. It implies that our proposed system distinguishes different programs most credibly as well as is very competitive enough to detect plagiarism.
7. REFERENCES [1] [2]
[3]
[4]
[5]
[6]
(a) ASP.
(b) ASD.
Figure 4. AS results.
[7]
4.4 CSS Test
[8] [9]
In this experiment, we evaluate the trustworthiness of the systems by measuring correlations between the program similarity and the source code similarity. The higher CSS indicates that the similarity driven from the system provides the trustable similarity between programs. To calculate CSS, we compute similarity between source codes of open source programs shown in Table 1. At this step, we employ MOSS [9] which is widely used for comparing source codes, and we measure the CSS with Pearson correlation coefficient [7].
[10] [11] [12] [13] [14] [15]
1580
Business Software Alliance, BSA Global Software Piracy Study, http://globalstudy.bsa.org/2010, 2010. S. Choi, H. Park, H. Lim, and T. Han, “A Static API Birthmark for Windows Binary Executables,” Journal of Systems and Software, 82(5): 862-873, 2009. D. Chae, S. Kim, J. Ha, S. Lee, and G. Woo, “Software Plagiarism Detection via the Static API Call Frequency Birthmark,” ACM SAC, pp. 1639-1643, 2013. H. Park, S. Choi, H. Lim, and T. Han, “Detecting Java Theft based on Static API Trace Birthmark,” Advances in Information and Computer Security, 5312:121-135, 2008. H. Lim, H. Park, S. Choi, and T. Han, “A Method for Detection the Theft of Java Programs through Analysis of the Control Flow Information,” Information and Software Technology, 51(9): 1338-1350, 2009. J. Pan, H. Yang, and C. Faloutsos, “MMSS: Multi-Modal Story-Oriented Video Summarization,” IEEE ICDM, pp. 491-494, 2004. J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2006. MSDN APIs, http://msdn.microsoft.com. A. Aiken, Moss: A System for Detecting Software Plagiarism, University of California–Berkeley. http://www.cs.bereley.edu/~aiken/moss.html. C. Hoffmann, “Group-Theoretic Algorithms and Graph Isomorphism,” Heidelberg: Springer, 1982. Louden, K. C, Compiler construction. PWS Publishing Company, 1997 A. Aizawa, “An Information-Theoretic Perspective of TF-IDF Measure,” Information Processing and Management, 39(1):45-65, 2003. T. Haveliwala, “Topic-Sensitive Pagerank,” WWW, pp. 517-526, 2002 W. Hwang, S. Chae, S. Kim, and G. Woo, “Yet Another Paper Ranking Algorithm Advocating Recent Publications,” WWW, pp. 1117-1118, 2010. D. Bae, S. Hwang, S. Kim, and C. Faloutsos, “Constructing Seminal Paper Genealogy,” ACM CIKM, pp. 2101-2104, 2011.