A Measure of Similarity for Binary Programs with a Hierarchical Structure Ciprian Opris¸a
Nicolae Ignat
Bitdefender Technical University of Cluj-Napoca E-mail:
[email protected]
Technical University of Cluj-Napoca E-mail:
[email protected]
Abstract—Finding similar binary programs is a challenging task, as there is no best way to compute this kind of similarity. The approach in the current paper is based on the fact that some programs retain structural information from the source code, like the division into packages, classes and methods. Using such a hierarchical structure, comparison algorithms based on OpCode n-grams can be enriched in order to produce more meaningful results. The method was implemented in order to detect plagiarism cases in a collection of Android applications written in Java. By employing the new techniques, the plagiarism detector exceeded 95% in terms of both precision and recall. The algorithm performance is also better if we use the new information about the program hierarchy, as the proposed system managed to compare large applications of over 7000 methods in about 15 seconds. By automatically detecting plagiarism cases from an early stage, our framework can limit the loss in financial revenue for the legitimate developers.
I. I NTRODUCTION Measuring the similarity between programs is useful in various scenarios. A recommender system may find similar programs, that a user might like. The maintainers of an application market may find plagiarism cases and eliminate them in order to maximize the revenue for legitimate apps. In malware research, the manual work is greatly reduced by grouping together similar applications and only analyzing a few representatives from each group. Most programs can be found in binary form, as source code is seldom distributed. Some compilers, especially those that produce native code do not maintain the structure of the source code, making it harder to separate the binary code into functions or classes. On the other hand, Java compilers [1], including the one for Android produces binary code that can easily be separated into packages, classes and methods, giving it a hierarchical structure. The similarity measure presented in this paper will take advantage of this property in order to provide more accurate results. The methodology for finding similar applications will be tested on free Android applications from the Google Play store [2]. This store is accessible for developers to easily publish their apps, simply by paying a one-time fee of 25 USD for developer. There are several revenue models for developers to monetize their creation: c 2015 IEEE 978-1-4673-8200-7/15/$31.00
• •
•
publishing paid apps - the user has to pay a fee for downloading the app. in-app purchases - the application is free, but some contents inside the application require a payment. Many free games use this technique for gaining revenue, by allowing the users to buy virtual game items or speedup their progress using real money. providing ads - despite the growing popularity of inapp purchases, mobile ads still bring a lot of revenue to their developers. The advertisements are provided by frameworks or SDKs that are specialized in this activity and pay the revenues to the app developer based on his unique id. According to [3], more than 51% of the applications from Google Play contain at least one ads SDK, while some of them go as far as including 8 such components.
The main reason for plagiarizing Android application is to gain revenue through ads SDKs. Since Android applications written in Java are easy to decompile and revert to source code [4], an attacker can follow a few simple step: • • • • •
download a popular application from the market decompile it to Java source code or disassemble it to smali code [5] (optionally) make some changes to the app recompile re-publish the application under a new name and a new vendor
The simplest changes that can be made to an existing app is to modify the developer id in the ads SDK, so that the revenue will go to the attacker instead of the original developer. Some attackers might use additional SDKs for a faster monetization (usually, such applications are short-lived, since they are taken down from the market after the plagiarism case is signaled). Few attackers will take their time and modify the code or resources of the original application in order to bypass plagiarism detectors. Most modifications of this type are done by automatic obfuscation tools. Plagiarism affects legitimate developers in two ways. First, they miss the revenue from the users that download the plagiarized application instead of the original one. Secondly, additional content added by the attacker (usually aggressive ads SDKs) will affect the application’s reputation and implic-
itly the original developer. By identifying plagiarism cases in an early stage, we can limit this damage. The following section will describe some related work. The next two sections will explain how the similarity functions work, the first one dealing with the feature extraction and the second one with strategies for computing the similarity functions. The experimental results show both the quality metrics for the similarity function and the performance of the similarity algorithm. Finally, the conclusions section wraps up the paper, and provide ideas for future work.
corners correspond to Java packages and will be extracted as a hierarchy of folders. Each package can contain other packages or Java classes, represented as simple rectangles and extracted as .smali files. Each such file corresponds to a Java class but instead of Java code, they contain assembly-like instructions. The content of .smali files is also divided into methods. com.some.application a a b
II. R ELATED WORK Previous work of the author [6] showed a unified method to identify similar code, for both finding plagiarized homework and clustering malware samples. The best approach proved to be based on OpCode n-grams, which are sequences of consecutive instructions extracted from the code. The system is designed to work with native code that can be disassemble Intel x86 instructions and does not take into account the structure of the program (separation into functions or classes). Mobile applications plagiarism detection is an interesting research topic, as we identified several recent papers dealing with this subject. The authors of [7] studied attacks where legitimate apps were modified in order to add malicious code. Three schemes were proposed for computing the similarity between applications. Symbol-Coverage relies on the symbol names extracted from the application and works only if no form of obfuscation has been involved. AST-Distance and AST-Coverage rely on Abstract Syntax Trees, a set of features based on the dependencies between the application’s methods. AST-Coverage is the more advanced technique and matches the methodlevel feature vectors of two applications and searches for the maximum coverage. Juxtapp [8] has a similar approach to our system for the features extraction part, because it is also based on OpCode n-grams. However, the division of the code into methods and classes is not used, as Juxtapp treats every application like a large code buffer. III. H IERARCHICAL CODE FEATURES EXTRACTION This section describes how relevant features can be extracted from an Android binary program, even if it’s compiled and we don’t have access to the source code. According to [9], an Android application package (called APK) is a zip archive that contains the application’s code, resources and manifest. The entire application binary code is stored in a single file inside the archive, called classes.dex. This file is obtained after the Java code is compiled to bytecode, then converted to Dalvik Executable (dex) format. Reverse engineering tools like baksmali [5] are able to extract the classes.dex file from an APK, rebuild the classes and packages and disassemble the bytecode into an assemblylike format called smali. Figure 1 shows the structure of the classes.dex file extracted from an android application using the baksmali tool. The gray rectangles with rounded
... android ... com simple simpleDld ap.smali onClick SearchActivity.smali download onCreate onPause onResume search ...
Fig. 1. classes.dex structure as extracted by baksmali
Following an approach similar to [6] and [10], each smali instruction is associated with a distinct symbol. A sequence of n consecutive instructions from a method will be called an n-gram. Internally, each method will be represented as a set of n-grams instead of a sequence of instructions. The main distinction from the approach in [6] and [10] is that for Android applications we have enough information to give the application a hierarchical structure, while in the previous works the entire application was represented as a single set of n-grams. For formalism, we will make the following notations: Let G be the set of all possible n-grams that can be extracted from an application code. Since a method is a set of ngrams, we will define the set of methods as M = P(G) (the set of subsets of G). Similarly, a class can be defined as a set of methods, and a package as a family containing classes and other packages. We will not allow a package to contain itself, directly or indirectly. The family of method containers, formed by classes and packages will be denoted by C = C1 ∪ C2 (elements from C1 are classes and elements from C2 are packages). An application can be abstracted as a simple package that contain the top-level packets. IV. S IMILARITY FUNCTIONS DESCRIPTION The basic similarity that we need to compute is the similarity between two methods. Since methods are represented
as sets, the most natural similarity function is the Jaccard similarity [11], that is 0 for disjoint sets and computes the ratio between the size of the sets intersection and the size of the sets union, as in Equation 1.
C1 C2 m11
msim : M × M → [0, 1] 0, if M1 ∩ M2 = ∅ |M1 ∩ M2 | msim(M1 , M2 ) = , otherwise |M1 ∪ M2 |
m21 (1)
m22 m13
The Jaccard similarity has some important advantages in our scenario: first of all, the similarity function will always take values between 0 and 1, with 0 only for methods represented by disjoint sets of n-grams and 1 only for identical sets. Another advantage is that the Jaccard distance, defined as dJ : M×M → [0, 1], dJ (M1 , M2 ) = 1−msim(M1 , M2 ) is a proper metric and can be used in various clustering algorithms [12]. The Jaccard distance may be sufficient for non-hierarchical code features, as each application can be represented as a large set of n-grams. However, for computing the similarity between hierarchical structures, a further step is required. Whether we are computing the similarity between two classes or between two applications, a new similarity function between sets of sets is required, as in Equation 2. sim : P(M) × P(M) → [0, 1]
m12
(2)
Considering two sets of methods, C1 , C2 ∈ P(M), C1 = {m11 , m12 , . . . , m1k }, C2 = {m21 , m22 , . . . , m2p }, their similarity should be estimated by the number of methods they have in common: • if C1 = C2 (the two sets are comprised exactly of the same methods), then sim(C1 , C2 ) = 1; • if the similarity between any method from C1 and any method from C2 is very small (msim(m1i , m2j ) < , ∀1 ≤ i ≤ k, 1 ≤ j ≤ p), then the similarity between C1 and C2 should be 0 or close to 0; • if some methods from C1 match totally or partially methods from C2 , the similarity function should reflect the match percentage. We can start by associating each pair (m1 , m2 ) ∈ C1 × C2 with a weight corresponding to their similarity, w(m1 , m2 ) = msim(m1 , m2 ). At this point, we can find a matching between the two sets of methods by solving the maximum weighted bipartite matching problem (also called the assignment problem). Several polynomial algorithms exist for this task, the most notable one being the Hungarian algorithm [13]. In Figure 2, we represented two such methods sets, C1 containing 5 methods and C2 containing 4 methods. Methods with a high similarity are connected with lines (dashed or contiguous). The Hungarian algorithm can find a match between the two sets (the contiguous lines), such that each method from C1 matches at most a method from C2 and each method from C2 matches at most a method from C1 and the sum of the weights of the matches is maximal.
m23 m14 m24 m15
Fig. 2. Bipartite match between two sets of methods
In what follows, we will consider that the Hungarian algorithm takes as input two sets C1 , C2 ∈ P(M) and outputs a set of paired items bm(C1 , C2 ). The bm : P(M) × P(M) → P(M × M) has the following properties: • ∀(xi , yi ), (xj , yj ) ∈ bm(C1 , C2 ), xi 6= xj and yi 6= yj . This property will ensure that no method belongs to more than one match. • |bm(C1 , C2 )| = min(|C1 |, |C2 |). In other words, every item from the smallest set will be matched with an item from the largest set. If the sets have different cardinality, some items P from the largest set will remained unmatched. msim(x, y) is maximal. Since every ele• (x,y)∈bm(C1 ,C2 )
ment of the sum is at most 1 and the number of elements is the smallest between C1 and C2 , it follows that this sum is smaller than the size of any of the two sets. In the Jaccard similarity formula, we can apply the inclusion-exclusion principle and obtain: simJ (X, Y ) =
|X ∩ Y | |X ∩ Y | = |X ∪ Y | |X| + |Y | − |X ∩ Y |
The contents of the bipartite match can be seen as an approximate intersection between the sets C1 and C2 . To devise the new similarity function, we will replace the intersection cardinality in the equation above with a function that depends on the output of the bm function:
sim(X, Y ) =
M atchScore(bm(X, Y )) |X| + |Y | − M atchScore(bm(X, Y ))
(3)
M atchScore(match), where match = bm(X, Y ) should be equal to |X ∩Y | when the weights in the bipartite match are 1 (the matched methods are identical) and should take lower values otherwise. There are two models for computing this function:
•
thresholded match: for each pair in the bipartite match, add 1 if the pair has a high similarity and 0 if the pair has a low similarity: M atchScore(match) = |{(x, y) ∈ match | msim(x, y) ≥ θ}|
•
(4)
contiguous match: add every similarity from the best bipartite matching: X M atchScore(match) = msim(x, y) (5) (x,y)∈match
V. H IERARCHICAL SIMILARITY ALGORITHM Algorithm 1 computes the similarity between two method containers (classes or packages). As mentioned in section III, applications are also abstracted as packages so the algorithm will successfully compute the similarity between two applications. Algorithm 1 HIERARCHICAL - SIMILARITY(X, Y ) Require: Two method containers X, Y ∈ C X = {x1 , x2 , . . . xk }, Y = {y1 , y2 , . . . yp } Ensure: the similarity between X and Y 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:
if (X ∈ C1 and Y ∈ C2 ) or (X ∈ C2 and Y ∈ C1 ) then return 0 else if X ∈ C1 and Y ∈ C1 then for i = 1 → |X|, j = 1 → |Y | do w(i, j) ← msim(xi , yj ) end for else for i = 1 → |X|, j = 1 → |Y | do w(i, j) ← HIERARCHICAL - SIMILARITY(xi , yj ) end for end if match ← COMPUTE - BIPARTITE - MATCH(w) M atchScore(match) return |X| + |Y | − M atchScore(match)
The algorithm begins by examining the type of the received containers. If the two containers are of different natures (a class and a package) they can’t be compared, so their similarity is 0 (line 2). If X and Y have the same type, we will build a weight matrix with the pairwise similarity between all the elements of X and Y . If X and Y are classes, their elements are methods. In this case we will use the msim function described in the previous section to compute their similarity (line 5). If they are packages, it means that their elements are also method containers, so we need to make a recursive call to HIERARCHICAL - SIMILARITY in order to compute their similarity (line 9). Having the weights matrix, we can compute the maximum weighted bipartite matching (line 12) and use it in the similarity function (line 13). The M atchScore function can follow any of the two strategies presented in the previous section (thresholded match and contiguous match).
To compute the complexity for Algorithm 1, we will consider the size of the input as the number of methods the received application or package has. We may consider that the two inputs have roughly the same number of methods. Since we have a divide and conquer algorithm, we can use the master theorem [14], that computes the complexity for recursions having the form from Equation 6. n + f (n) (6) T (n) = aT b In our case, b is the number of classes or sub-packages a package have (for simplicity, we consider it to be constant, b = |X| = |Y |). The recursive call in line 9 is done for every pair of items in the two sets, so a = |X| · |Y | = b2 times. Finally, after the recursive calls, the Hungarian algorithm is called on the weights matrix, but the complexity of this algorithm will be O(b3 ), which doesn’t depend on n. It follows that f (n) = O(1) = O(n0 ). To apply the master theorem, we need to compute logb a and compare it with c, where f (n) = O(nc ). From the previous paragraph, we have that c = 0. logb a = logb b2 = 2. Since 0 < 2 ⇒ c < logb a and we are in the first case of the master theorem. In this case, the algorithm’s complexity is O(nlogb a ). We conclude that the complexity for Algorithm 1 is O(n2 ), where n is the average number of methods an application or package has. As a note, the master theorem stands if b > 1. The case b = 1 corresponds to the situation where all the application code is in a single class (unlikely but possible). In this case, The algorithm’s complexity will be given by the complexity of the Hungarian algorithm and will be O(n3 ). VI. E XPERIMENTAL RESULTS In this section we will evaluate both the quality and the performance of our method. We used a dataset consisting of 53 Android applications that were already clustered into 18 groups. The applications belonging to the same group are either different versions of the same application or real plagiarism cases. An ideal similarity measure will output high similarity values for all the pairs of applications that belong to the same group and low values for the applications pairs belonging to different groups. To emphasize the quality and the performance of the hierarchical similarity technique, we will also compare it with the flat model. The flat model assumes that no hierarchical information is known about applications and all the methods are situated in a single class. A. Classification quality 53(53 − 1) = 1378 pairs of applications we For each of the 2 computed the hierarchical similarity using both the thresholded and contiguous match strategy. By comparing the similarity result with a threshold, we can state if the two applications are similar or not. For each matching strategy, we will divide the pairs into 4 categories and build a confusion matrix:
•
Recall - is the fraction of relevant instances that are retrieved. TP (8) R= TP + FN F-Measure [15] - is a measure of a test’s accuracy. It considers both the precision P and the recall R of the test to compute the score. The F-Measure can be interpreted as a weighted average of the precision and recall, and it reaches its best value at 1 and the worst at 0. A constant β ≥ 0 can be used to balance the contribution of false negatives (we will pick β = 1). Fβ =
(β 2 + 1) · P · R β2 · P + R
1 0.8 0.6 0.4 0.2 0 0
0.2
(9)
If we selected the similarity threshold at 50%, we obtained the following values for the aforementioned quality indices: • thresholded: P = 100%; contiguous: P = 100%; • thresholded: R = 78.33%; contiguous: R = 68.33%; • thresholded: F1 = 87.85%; contiguous: F1 = 81.18%; The precision is at 100%, meaning that no pair of dissimilar items was classified as similar. The thresholded match strategy provided a higher recall than the contiguous match with about 10%, leading also to a higher F-Measure. By decreasing the similarity threshold, we can increase the recall by decreasing the precision. At 25% threshold, the values are the following: • thresholded: P = 89.06%; contiguous: P = 95%; • thresholded: R = 95%; contiguous: R = 95%; • thresholded: F1 = 91.93%; contiguous: F1 = 95%; In this case, the precision is no longer 100%, meaning that some pairs of applications may be flagged as similar, even if they are not. However, the recall reached 95%, and the overall F-Measure is also higher. In order to illustrate how the precision and recall vary with the similarity threshold, we will analyze the receiver operating characteristic (ROC) [16] for the two techniques. The threshold was varied from 0% to 100%, using an increment of 0.5%. At each step, we computed the number of True Positives (TP), False Positives (FP) and False Negatives (FP).
0.4
0.6
0.8
1
False positive rate Fig. 3. ROC curve for thresholded match
1
True positive rate
•
The true positive rate is identical to the recall index previTP ously discussed and can be computed as . The false TP + FN FP positive rate can be calculated as 1 - precision, or . FP + TP A ROC curve basically plots the dependency between the two indices. Each point on the curve corresponds to the false positive rate and true positive rate for a given threshold. A perfect classifier would pass through the point (0, 1) - that corresponds to 100% true positives and 0% false positives. A random guess would result in a point along a diagonal line (called line of no discrimination) from the left bottom to the top right corners. A good classifier should be above this line.
True positive rate
True Positive (TP) - The pair was correctly classified as a case o plagiarism. • False Positive (FP) - The pair was incorrectly classified as a case of plagiarism. • True Negative (TN) - The pair is not a case of plagiarism and was not flagged as one. • False Negative (FN) - The pair is a case of plagiarism but it wasn’t detected. Taking into account the number of pairs with each label (TP, FP, TN, FN), we have computed the following quality indices: • Precision - is the fraction of retrieved instances that are relevant. TP (7) P = TP + FP •
0.8 0.6 0.4 0.2 0 0
0.2
0.4
0.6
0.8
1
False positive rate Fig. 4. ROC curve for contiguous match
Figure 3 shows the ROC curve for the classifier that uses the thresholded match strategy, while Figure 4 shows the ROC curve for the classifier that uses the thresholded match strategy. We can observe that both curves are closer to the ideal (0, 1)
time exceeds 5 minutes, which is 20 times worse than the original algorithm.
TABLE I Q UALITY INDICES FOR VARIOUS SIMILARITY STRATEGIES P 100% 95.08% 94.64%
R 88.33% 96.66% 88.33%
F1 93.8% 95.86% 91.37%
point than to the line of no discrimination, meaning that we managed to build good classifiers. Table I contains the quality indices for the two hierarchical similarity strategies and for the flat similarity, for the closest to the ideal (0, 1). In Figure 3, this point is obtained for the similarity threshold 30%. The best point on the ROC curve in Figure 4 is obtained for the threshold 17%. This results show that using the contiguous match strategy in the similarity function leads to a slightly better classifier. Also, both strategies are better than the flat strategy, where the best point is obtained at threshold 40%. B. Algorithm performance
Running time (s)
15
10
5
0 2,000
200
100
0 0
2,000
4,000
6,000
8,000
Average number of methods Fig. 6. Performance measurement for Algorithm 1 assuming a flat model
Theoretical computations showed that Algorithm 1 has O(n2 ) complexity, where n is the average number of methods for the two applications. Figure 5 plots the running time of the algorithm against the number of methods.
0
hierachical similarity flat similarity
300 Running time (s)
thresholded contiguous flat
4,000
6,000
8,000
Average number of methods Fig. 5. Performance measurement for Algorithm 1
The algorithm was implemented in Java and ran on an Intel i7 vPro processor at 2GHz with 8GB of RAM. For large applications of more than 7000 methods, the amount of time required to compute the similarity is a little more than 15 seconds, which make the method good for comparing individual applications but too slow to make every pairwise comparison on large collections. For comparison, we have plotted in Figure 6 the performance of the algorithm in the case where all the methods are situated in the single class (flat similarity). In this case, for large applications with more than 7000 methods, the running
VII. C ONCLUSION A new framework for comparing binary programs was proposed. The similarity measure is based on the information extracted from the method’s code and on the hierarchical structure of the applications. The ideas were tested on a collection of free Android applications and looked for plagiarism cases. The idea for computing the hierarchical similarity is to compare two packages by computing pairwise similarity between their components, then finding the best bipartite match using the Hungarian algorithm. Two strategies for computing the similarity from the bipartite match were examined: thresholded match and contiguous match. Contiguous match gave better results, but both strategies surpassed in terms of precision and recall the classical approach where the hierarchical structure was not taken into account. The new algorithm also performs faster, since the complexity is O(n2 ) in the number of methods, better than O(n3 ) for the flat approach. As future work, we will focus on the problem of finding the needle in the haystack. Since the running time of the similarity algorithm exceeds 15 seconds for large applications, it is feasible but not efficient to find plagiarism cases in a large collection by simply comparing every pair of applications. We will work on new techniques to find plagiarism candidates in a faster way and only apply the hierarchical similarity on the interesting pairs. R EFERENCES [1] B. Joy, G. Steele, J. Gosling, and G. Bracha, “Java (tm) language specification,” Addisson-Wesley, June, 2000. [2] “Google play,” 2015. [Online]. Available: https://play.google.com/ [3] W. Enck, D. Octeau, P. McDaniel, and S. Chaudhuri, “A study of android application security.” in USENIX security symposium, vol. 2, 2011, p. 2.
[4] D. Octeau, S. Jha, and P. McDaniel, “Retargeting android applications to java bytecode,” in Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. ACM, 2012, p. 6. [5] “smali/baksmali,” 2015. [Online]. Available: https://code.google.com/p/ smali [6] C. Oprisa, G. Cabau, and A. Colesa, “From plagiarism to malware detection,” in Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2013 15th International Symposium on. IEEE, 2013, pp. 227–234. [7] R. Potharaju, A. Newell, C. Nita-Rotaru, and X. Zhang, “Plagiarizing smartphone applications: attack strategies and defense techniques,” in Engineering Secure Software and Systems. Springer, 2012, pp. 106– 120. [8] S. Hanna, L. Huang, E. Wu, S. Li, C. Chen, and D. Song, “Juxtapp: A scalable system for detecting code reuse among android applications,” in Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 2013, pp. 62–81. [9] P. O. Fora, “Beginners guide to reverse engineering android apps,” RSA Conference, 2014. [10] C. Opris¸a, G. Cab˘au, and A. Coles¸a, “Automatic code features extraction using bio-inspired algorithms,” Journal of Computer Virology and Hacking Techniques, vol. 10, no. 3, pp. 165–176, 2014. [11] P. Jaccard, Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr. Corbaz, 1901. [12] A. Rajaraman and J. D. Ullman, Mining of massive datasets. Cambridge University Press, 2012. [13] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955. [14] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein et al., Introduction to algorithms. MIT press Cambridge, 2001, vol. 2. [15] C. V. Rijsbergen, Information Retrieval (2nd ed.). Butterworths, 1979. [16] T. Fawcett, “An introduction to roc analysis,” Pattern recognition letters, vol. 27, no. 8, pp. 861–874, 2006.