Sep 5, 2015 - 1. Introduction. Android apps plagiarism workflow get binary application (APK) decompile/ disassemble replace ads remove ads add doubtful.
A Measure of Similarity for Binary Programs with a Hierarchical Structure ICCP 2015 Ciprian Opris, a and Nicolae Ignat Bitdefender, Technical University of Cluj-Napoca
September 5, 2015
Agenda 1
Introduction
2
Flat similarity
3
Hierarchical similarity
4
Experimental results
5
Conclusions and future work
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
2 / 21
1. Introduction
Agenda 1
Introduction
2
Flat similarity
3
Hierarchical similarity
4
Experimental results
5
Conclusions and future work
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
3 / 21
1. Introduction
Introduction Why compute programs similarity? Detect plagiarism Detect malware malware created by modifying a legitimate app new malware variants, similar to previous versions
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
4 / 21
1. Introduction
Introduction Why compute programs similarity? Detect plagiarism Detect malware malware created by modifying a legitimate app new malware variants, similar to previous versions
Hierarchical structures in programs packages, classes and methods provide extra information C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
4 / 21
1. Introduction
Introduction
com.some.application a
Why compute programs similarity? Detect plagiarism Detect malware malware created by modifying a legitimate app new malware variants, similar to previous versions
Hierarchical structures in programs packages, classes and methods provide extra information
a b ... android ... com simple simpleDld ap onClick SearchActivity download onCreate onPause ...
onResume search
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
4 / 21
1. Introduction
Android apps plagiarism workflow get binary application (APK)
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
5 / 21
1. Introduction
Android apps plagiarism workflow get binary application (APK) decompile/ disassemble
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
5 / 21
1. Introduction
Android apps plagiarism workflow get binary application (APK) decompile/ disassemble remove ads
C. Opris, a (Bitdefender, TUC-N)
replace ads
add doubtful content
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
5 / 21
1. Introduction
Android apps plagiarism workflow get binary application (APK) decompile/ disassemble remove ads
replace ads
add doubtful content
repackage
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
5 / 21
1. Introduction
Android apps plagiarism workflow get binary application (APK) decompile/ disassemble remove ads
replace ads
add doubtful content
repackage re-upload C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
5 / 21
2. Flat similarity
Agenda 1
Introduction
2
Flat similarity
3
Hierarchical similarity
4
Experimental results
5
Conclusions and future work
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
6 / 21
2. Flat similarity
Flat similarity Any program can be disassembled into simple instructions Android code resides in a file called classes.dex Can be disassembled using the baksmali tool
A sequence of n consecutive operations is called n-gram
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
7 / 21
2. Flat similarity
Flat similarity Any program can be disassembled into simple instructions Android code resides in a file called classes.dex Can be disassembled using the baksmali tool
A sequence of n consecutive operations is called n-gram the entire program can be represented as a set of n-grams simJ (X , Y ) =
C. Opris, a (Bitdefender, TUC-N)
|X ∩ Y | |X ∪ Y |
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
7 / 21
2. Flat similarity
Flat similarity Any program can be disassembled into simple instructions Android code resides in a file called classes.dex Can be disassembled using the baksmali tool
A sequence of n consecutive operations is called n-gram the entire program can be represented as a set of n-grams simJ (X , Y ) =
C. Opris, a (Bitdefender, TUC-N)
|X ∩ Y | |X ∪ Y |
a method is a set of n-grams a program is a set of methods
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
7 / 21
2. Flat similarity
Set of methods - bipartite match (1) m11 m21 m12 m22 X
m13
Y m23
m14 m24 m15
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
8 / 21
2. Flat similarity
Set of methods - bipartite match (1) m11 m21 m12 m22 X
m13
Y m23
m14 m24 m15
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
8 / 21
2. Flat similarity
Set of methods - bipartite match (1) m11 m21 m12 m22 X
m13
Y m23
m14 m24 m15
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
8 / 21
2. Flat similarity
Set of methods - bipartite match (1) m11 m21 m12 m22 X
m13
Y m23
m14 m24 m15
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
8 / 21
2. Flat similarity
Set of methods - bipartite match (2) Compute similarity between each pair of methods Use Hungarian Algorithm to find the best match maximum weighted bipartite matching problem outputs bm(X , Y ) ⊂ X × Y complexity O(n3 ), n being the average number of methods
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
9 / 21
2. Flat similarity
Set of methods - bipartite match (2) Compute similarity between each pair of methods Use Hungarian Algorithm to find the best match maximum weighted bipartite matching problem outputs bm(X , Y ) ⊂ X × Y complexity O(n3 ), n being the average number of methods
Need to transform the previous similarity function simJ (X , Y ) =
C. Opris, a (Bitdefender, TUC-N)
|X ∩ Y | |X ∩ Y | = |X ∪ Y | |X | + |Y | − |X ∩ Y |
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
9 / 21
2. Flat similarity
Set of methods - bipartite match (2) Compute similarity between each pair of methods Use Hungarian Algorithm to find the best match maximum weighted bipartite matching problem outputs bm(X , Y ) ⊂ X × Y complexity O(n3 ), n being the average number of methods
Need to transform the previous similarity function simJ (X , Y ) =
sim(X , Y ) =
C. Opris, a (Bitdefender, TUC-N)
|X ∩ Y | |X ∩ Y | = |X ∪ Y | |X | + |Y | − |X ∩ Y |
MatchScore(bm(X , Y )) |X | + |Y | − MatchScore(bm(X , Y ))
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
9 / 21
2. Flat similarity
MatchScore approaches thresholded match count the number of pairs in the match with the weight over a given threshold
MatchScore(match) = |{(x, y ) ∈ match | msim(x, y ) ≥ θ}|
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
10 / 21
2. Flat similarity
MatchScore approaches thresholded match count the number of pairs in the match with the weight over a given threshold
MatchScore(match) = |{(x, y ) ∈ match | msim(x, y ) ≥ θ}| contiguous match sum the weights of the pairs from the match
MatchScore(match) =
X
msim(x, y )
(x,y )∈match
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
10 / 21
3. Hierarchical similarity
Agenda 1
Introduction
2
Flat similarity
3
Hierarchical similarity
4
Experimental results
5
Conclusions and future work
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
11 / 21
3. Hierarchical similarity
Hierarchical similarity
the algorithm is aware of the hierarchical structure classes (C1 ) contain methods packages (C2 ) contain classes and other packages a program is a package
to compute the similarity between two packages: recursively compute pairwise similarity between components find the best match
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
12 / 21
3. Hierarchical similarity
Hierarchical similarity algorithm Algorithm 1 hierarchical-similarity(X , Y ) Require: Two method containers X , Y ∈ C = C1 ∪ C2 X = {x1 , x2 , . . . xk }, Y = {y1 , y2 , . . . yp } Ensure: the similarity between X and Y if (X ∈ C1 and Y ∈ C2 ) or (X ∈ C2 and Y ∈ C1 ) then return 0 else if X ∈ C1 and Y ∈ C1 then w (i, j) ← msim(xi , yj ), ∀1 ≤ i ≤ |X |, 1 ≤ j ≤ |Y | else w (i, j) ← hierarchical-similarity(xi , yj ), ∀1 ≤ i ≤ |X |, 1 ≤ j ≤ |Y | end if match ← compute-bipartite-match(w ) MatchScore(match) 9: return |X | + |Y | − MatchScore(match)
1: 2: 3: 4: 5: 6: 7: 8:
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
13 / 21
3. Hierarchical similarity
Algorithm complexity Remainder: Hungarian Algorithm has O(n3 ) complexity
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
14 / 21
3. Hierarchical similarity
Algorithm complexity Remainder: Hungarian Algorithm has O(n3 ) complexity We use the Master Theorem to compute the complexity for our divide and conquer algorithm n T (n) = aT + f (n) b
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
14 / 21
3. Hierarchical similarity
Algorithm complexity Remainder: Hungarian Algorithm has O(n3 ) complexity We use the Master Theorem to compute the complexity for our divide and conquer algorithm n T (n) = aT + f (n) b b - avg. number of components for a package/class (constant)
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
14 / 21
3. Hierarchical similarity
Algorithm complexity Remainder: Hungarian Algorithm has O(n3 ) complexity We use the Master Theorem to compute the complexity for our divide and conquer algorithm n T (n) = aT + f (n) b b - avg. number of components for a package/class (constant) a - number of recursive calls a = |X | · |Y | w b 2
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
14 / 21
3. Hierarchical similarity
Algorithm complexity Remainder: Hungarian Algorithm has O(n3 ) complexity We use the Master Theorem to compute the complexity for our divide and conquer algorithm n T (n) = aT + f (n) b b - avg. number of components for a package/class (constant) a - number of recursive calls a = |X | · |Y | w b 2 f (n) ∈ O(nc ) ⇒ c = 0 (because f (n) = O(b 3 ))
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
14 / 21
3. Hierarchical similarity
Algorithm complexity Remainder: Hungarian Algorithm has O(n3 ) complexity We use the Master Theorem to compute the complexity for our divide and conquer algorithm n T (n) = aT + f (n) b b - avg. number of components for a package/class (constant) a - number of recursive calls a = |X | · |Y | w b 2 f (n) ∈ O(nc ) ⇒ c = 0 (because f (n) = O(b 3 )) logb a = logb b 2 = 2, so c < logb a
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
14 / 21
3. Hierarchical similarity
Algorithm complexity Remainder: Hungarian Algorithm has O(n3 ) complexity We use the Master Theorem to compute the complexity for our divide and conquer algorithm n T (n) = aT + f (n) b b - avg. number of components for a package/class (constant) a - number of recursive calls a = |X | · |Y | w b 2 f (n) ∈ O(nc ) ⇒ c = 0 (because f (n) = O(b 3 )) logb a = logb b 2 = 2, so c < logb a
The algorithm complexity is O(nlogb a ) = O(n2 ) C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
14 / 21
4. Experimental results
Agenda 1
Introduction
2
Flat similarity
3
Hierarchical similarity
4
Experimental results
5
Conclusions and future work
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
15 / 21
4. Experimental results
Algorithm performance (1)
Running time (s)
15
10
5
0 0
C. Opris, a (Bitdefender, TUC-N)
2,000 4,000 6,000 Average number of methods
Similarity for Binary Programs with a Hierarchical Structure
8,000
September 5, 2015
16 / 21
4. Experimental results
Algorithm performance (2) hierachical similarity flat similarity
Running time (s)
300
200
100
0 0
C. Opris, a (Bitdefender, TUC-N)
2,000 4,000 6,000 Average number of methods
Similarity for Binary Programs with a Hierarchical Structure
8,000
September 5, 2015
17 / 21
4. Experimental results
Classification quality
TP TP + FP TP R= TP + FN At 25% threshold: P=
thresholded: P = 89.06%, R = 95% contiguous: P = 95%, R = 95%
At 50% threshold P = 100%
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
18 / 21
4. Experimental results
Classification quality
thresholded: P = 89.06%, R = 95% contiguous: P = 95%, R = 95%
At 50% threshold P = 100%
0.6 0.4 0.2 0 0
0.2 0.4 0.6 0.8 False positive rate
1
ROC curve for thresholded match 1 True positive rate
TP P= TP + FP TP R= TP + FN At 25% threshold:
True positive rate
1 0.8
0.8 0.6 0.4 0.2 0 0
0.2 0.4 0.6 0.8 False positive rate
1
ROC curve for contiguous match C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
18 / 21
5. Conclusions and future work
Agenda 1
Introduction
2
Flat similarity
3
Hierarchical similarity
4
Experimental results
5
Conclusions and future work
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
19 / 21
5. Conclusions and future work
Conclusions and future work Conclusions: We have designed an algorithm to compute the similarity for programs with a hierarchical structure The algorithm has O(n2 ) complexity and compares two Android applications in less than 15 seconds on average The similarity function has good results in terms of both Precision and Recall
C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
20 / 21
5. Conclusions and future work
Conclusions and future work Conclusions: We have designed an algorithm to compute the similarity for programs with a hierarchical structure The algorithm has O(n2 ) complexity and compares two Android applications in less than 15 seconds on average The similarity function has good results in terms of both Precision and Recall Future work: Design a solution to ignore library code when comparing two apps Find plagiarism cases in a large collection, without performing pairwise similarity computations C. Opris, a (Bitdefender, TUC-N)
Similarity for Binary Programs with a Hierarchical Structure
September 5, 2015
20 / 21
Thank you! Questions?