A Measure of Similarity for Binary Programs with a ...

4 downloads 99011 Views 1MB Size Report
Sep 5, 2015 - 1. Introduction. Android apps plagiarism workflow get binary application (APK) decompile/ disassemble replace ads remove ads add doubtful.
A Measure of Similarity for Binary Programs with a Hierarchical Structure ICCP 2015 Ciprian Opris, a and Nicolae Ignat Bitdefender, Technical University of Cluj-Napoca

September 5, 2015

Agenda 1

Introduction

2

Flat similarity

3

Hierarchical similarity

4

Experimental results

5

Conclusions and future work

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

2 / 21

1. Introduction

Agenda 1

Introduction

2

Flat similarity

3

Hierarchical similarity

4

Experimental results

5

Conclusions and future work

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

3 / 21

1. Introduction

Introduction Why compute programs similarity? Detect plagiarism Detect malware malware created by modifying a legitimate app new malware variants, similar to previous versions

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

4 / 21

1. Introduction

Introduction Why compute programs similarity? Detect plagiarism Detect malware malware created by modifying a legitimate app new malware variants, similar to previous versions

Hierarchical structures in programs packages, classes and methods provide extra information C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

4 / 21

1. Introduction

Introduction

com.some.application a

Why compute programs similarity? Detect plagiarism Detect malware malware created by modifying a legitimate app new malware variants, similar to previous versions

Hierarchical structures in programs packages, classes and methods provide extra information

a b ... android ... com simple simpleDld ap onClick SearchActivity download onCreate onPause ...

onResume search

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

4 / 21

1. Introduction

Android apps plagiarism workflow get binary application (APK)

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

5 / 21

1. Introduction

Android apps plagiarism workflow get binary application (APK) decompile/ disassemble

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

5 / 21

1. Introduction

Android apps plagiarism workflow get binary application (APK) decompile/ disassemble remove ads

C. Opris, a (Bitdefender, TUC-N)

replace ads

add doubtful content

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

5 / 21

1. Introduction

Android apps plagiarism workflow get binary application (APK) decompile/ disassemble remove ads

replace ads

add doubtful content

repackage

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

5 / 21

1. Introduction

Android apps plagiarism workflow get binary application (APK) decompile/ disassemble remove ads

replace ads

add doubtful content

repackage re-upload C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

5 / 21

2. Flat similarity

Agenda 1

Introduction

2

Flat similarity

3

Hierarchical similarity

4

Experimental results

5

Conclusions and future work

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

6 / 21

2. Flat similarity

Flat similarity Any program can be disassembled into simple instructions Android code resides in a file called classes.dex Can be disassembled using the baksmali tool

A sequence of n consecutive operations is called n-gram

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

7 / 21

2. Flat similarity

Flat similarity Any program can be disassembled into simple instructions Android code resides in a file called classes.dex Can be disassembled using the baksmali tool

A sequence of n consecutive operations is called n-gram the entire program can be represented as a set of n-grams simJ (X , Y ) =

C. Opris, a (Bitdefender, TUC-N)

|X ∩ Y | |X ∪ Y |

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

7 / 21

2. Flat similarity

Flat similarity Any program can be disassembled into simple instructions Android code resides in a file called classes.dex Can be disassembled using the baksmali tool

A sequence of n consecutive operations is called n-gram the entire program can be represented as a set of n-grams simJ (X , Y ) =

C. Opris, a (Bitdefender, TUC-N)

|X ∩ Y | |X ∪ Y |

a method is a set of n-grams a program is a set of methods

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

7 / 21

2. Flat similarity

Set of methods - bipartite match (1) m11 m21 m12 m22 X

m13

Y m23

m14 m24 m15

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

8 / 21

2. Flat similarity

Set of methods - bipartite match (1) m11 m21 m12 m22 X

m13

Y m23

m14 m24 m15

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

8 / 21

2. Flat similarity

Set of methods - bipartite match (1) m11 m21 m12 m22 X

m13

Y m23

m14 m24 m15

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

8 / 21

2. Flat similarity

Set of methods - bipartite match (1) m11 m21 m12 m22 X

m13

Y m23

m14 m24 m15

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

8 / 21

2. Flat similarity

Set of methods - bipartite match (2) Compute similarity between each pair of methods Use Hungarian Algorithm to find the best match maximum weighted bipartite matching problem outputs bm(X , Y ) ⊂ X × Y complexity O(n3 ), n being the average number of methods

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

9 / 21

2. Flat similarity

Set of methods - bipartite match (2) Compute similarity between each pair of methods Use Hungarian Algorithm to find the best match maximum weighted bipartite matching problem outputs bm(X , Y ) ⊂ X × Y complexity O(n3 ), n being the average number of methods

Need to transform the previous similarity function simJ (X , Y ) =

C. Opris, a (Bitdefender, TUC-N)

|X ∩ Y | |X ∩ Y | = |X ∪ Y | |X | + |Y | − |X ∩ Y |

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

9 / 21

2. Flat similarity

Set of methods - bipartite match (2) Compute similarity between each pair of methods Use Hungarian Algorithm to find the best match maximum weighted bipartite matching problem outputs bm(X , Y ) ⊂ X × Y complexity O(n3 ), n being the average number of methods

Need to transform the previous similarity function simJ (X , Y ) =

sim(X , Y ) =

C. Opris, a (Bitdefender, TUC-N)

|X ∩ Y | |X ∩ Y | = |X ∪ Y | |X | + |Y | − |X ∩ Y |

MatchScore(bm(X , Y )) |X | + |Y | − MatchScore(bm(X , Y ))

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

9 / 21

2. Flat similarity

MatchScore approaches thresholded match count the number of pairs in the match with the weight over a given threshold

MatchScore(match) = |{(x, y ) ∈ match | msim(x, y ) ≥ θ}|

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

10 / 21

2. Flat similarity

MatchScore approaches thresholded match count the number of pairs in the match with the weight over a given threshold

MatchScore(match) = |{(x, y ) ∈ match | msim(x, y ) ≥ θ}| contiguous match sum the weights of the pairs from the match

MatchScore(match) =

X

msim(x, y )

(x,y )∈match

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

10 / 21

3. Hierarchical similarity

Agenda 1

Introduction

2

Flat similarity

3

Hierarchical similarity

4

Experimental results

5

Conclusions and future work

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

11 / 21

3. Hierarchical similarity

Hierarchical similarity

the algorithm is aware of the hierarchical structure classes (C1 ) contain methods packages (C2 ) contain classes and other packages a program is a package

to compute the similarity between two packages: recursively compute pairwise similarity between components find the best match

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

12 / 21

3. Hierarchical similarity

Hierarchical similarity algorithm Algorithm 1 hierarchical-similarity(X , Y ) Require: Two method containers X , Y ∈ C = C1 ∪ C2 X = {x1 , x2 , . . . xk }, Y = {y1 , y2 , . . . yp } Ensure: the similarity between X and Y if (X ∈ C1 and Y ∈ C2 ) or (X ∈ C2 and Y ∈ C1 ) then return 0 else if X ∈ C1 and Y ∈ C1 then w (i, j) ← msim(xi , yj ), ∀1 ≤ i ≤ |X |, 1 ≤ j ≤ |Y | else w (i, j) ← hierarchical-similarity(xi , yj ), ∀1 ≤ i ≤ |X |, 1 ≤ j ≤ |Y | end if match ← compute-bipartite-match(w ) MatchScore(match) 9: return |X | + |Y | − MatchScore(match)

1: 2: 3: 4: 5: 6: 7: 8:

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

13 / 21

3. Hierarchical similarity

Algorithm complexity Remainder: Hungarian Algorithm has O(n3 ) complexity

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

14 / 21

3. Hierarchical similarity

Algorithm complexity Remainder: Hungarian Algorithm has O(n3 ) complexity We use the Master Theorem to compute the complexity for our divide and conquer algorithm n T (n) = aT + f (n) b

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

14 / 21

3. Hierarchical similarity

Algorithm complexity Remainder: Hungarian Algorithm has O(n3 ) complexity We use the Master Theorem to compute the complexity for our divide and conquer algorithm n T (n) = aT + f (n) b b - avg. number of components for a package/class (constant)

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

14 / 21

3. Hierarchical similarity

Algorithm complexity Remainder: Hungarian Algorithm has O(n3 ) complexity We use the Master Theorem to compute the complexity for our divide and conquer algorithm n T (n) = aT + f (n) b b - avg. number of components for a package/class (constant) a - number of recursive calls a = |X | · |Y | w b 2

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

14 / 21

3. Hierarchical similarity

Algorithm complexity Remainder: Hungarian Algorithm has O(n3 ) complexity We use the Master Theorem to compute the complexity for our divide and conquer algorithm n T (n) = aT + f (n) b b - avg. number of components for a package/class (constant) a - number of recursive calls a = |X | · |Y | w b 2 f (n) ∈ O(nc ) ⇒ c = 0 (because f (n) = O(b 3 ))

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

14 / 21

3. Hierarchical similarity

Algorithm complexity Remainder: Hungarian Algorithm has O(n3 ) complexity We use the Master Theorem to compute the complexity for our divide and conquer algorithm n T (n) = aT + f (n) b b - avg. number of components for a package/class (constant) a - number of recursive calls a = |X | · |Y | w b 2 f (n) ∈ O(nc ) ⇒ c = 0 (because f (n) = O(b 3 )) logb a = logb b 2 = 2, so c < logb a

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

14 / 21

3. Hierarchical similarity

Algorithm complexity Remainder: Hungarian Algorithm has O(n3 ) complexity We use the Master Theorem to compute the complexity for our divide and conquer algorithm n T (n) = aT + f (n) b b - avg. number of components for a package/class (constant) a - number of recursive calls a = |X | · |Y | w b 2 f (n) ∈ O(nc ) ⇒ c = 0 (because f (n) = O(b 3 )) logb a = logb b 2 = 2, so c < logb a

The algorithm complexity is O(nlogb a ) = O(n2 ) C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

14 / 21

4. Experimental results

Agenda 1

Introduction

2

Flat similarity

3

Hierarchical similarity

4

Experimental results

5

Conclusions and future work

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

15 / 21

4. Experimental results

Algorithm performance (1)

Running time (s)

15

10

5

0 0

C. Opris, a (Bitdefender, TUC-N)

2,000 4,000 6,000 Average number of methods

Similarity for Binary Programs with a Hierarchical Structure

8,000

September 5, 2015

16 / 21

4. Experimental results

Algorithm performance (2) hierachical similarity flat similarity

Running time (s)

300

200

100

0 0

C. Opris, a (Bitdefender, TUC-N)

2,000 4,000 6,000 Average number of methods

Similarity for Binary Programs with a Hierarchical Structure

8,000

September 5, 2015

17 / 21

4. Experimental results

Classification quality

TP TP + FP TP R= TP + FN At 25% threshold: P=

thresholded: P = 89.06%, R = 95% contiguous: P = 95%, R = 95%

At 50% threshold P = 100%

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

18 / 21

4. Experimental results

Classification quality

thresholded: P = 89.06%, R = 95% contiguous: P = 95%, R = 95%

At 50% threshold P = 100%

0.6 0.4 0.2 0 0

0.2 0.4 0.6 0.8 False positive rate

1

ROC curve for thresholded match 1 True positive rate

TP P= TP + FP TP R= TP + FN At 25% threshold:

True positive rate

1 0.8

0.8 0.6 0.4 0.2 0 0

0.2 0.4 0.6 0.8 False positive rate

1

ROC curve for contiguous match C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

18 / 21

5. Conclusions and future work

Agenda 1

Introduction

2

Flat similarity

3

Hierarchical similarity

4

Experimental results

5

Conclusions and future work

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

19 / 21

5. Conclusions and future work

Conclusions and future work Conclusions: We have designed an algorithm to compute the similarity for programs with a hierarchical structure The algorithm has O(n2 ) complexity and compares two Android applications in less than 15 seconds on average The similarity function has good results in terms of both Precision and Recall

C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

20 / 21

5. Conclusions and future work

Conclusions and future work Conclusions: We have designed an algorithm to compute the similarity for programs with a hierarchical structure The algorithm has O(n2 ) complexity and compares two Android applications in less than 15 seconds on average The similarity function has good results in terms of both Precision and Recall Future work: Design a solution to ignore library code when comparing two apps Find plagiarism cases in a large collection, without performing pairwise similarity computations C. Opris, a (Bitdefender, TUC-N)

Similarity for Binary Programs with a Hierarchical Structure

September 5, 2015

20 / 21

Thank you! Questions?

Suggest Documents