Data Mining in Resilient Identity Crime Detection ... - Google Sites

1 downloads 197 Views 275KB Size Report
2.2.2 Key Ideas . .... 7.2.3 Web-based Identity Crime Detection . . . . . . . . . . . . . . 78 .... application fraud: b
Data Mining in Resilient Identity Crime Detection by

Chun Wei Clifton Phua, BBusSys(Hons), DipIT

Dissertation Submitted by Chun Wei Clifton Phua for fulfillment of the Requirements for the Degree of

Doctor of Philosophy

Supervisors: Prof. Kate Smith-Miles and Assoc. Prof. Vincent Lee Associate Supervisor: Dr. Ross Gayler

Clayton School of Information Technology Monash University December, 2007

c Copyright

by Chun Wei Clifton Phua 2007

Keywords resilience, adaptivity, quality data, identity crime detection, credit application fraud detection, string and phonetic matching, communal detection, spike detection, data mining-based fraud detection, security, data stream mining, and anomaly detection

For my parents Chye Twee and Siok Moy

献给我敬爱的父母再对和惜美

iii

Contents

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

Notation and Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xv

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1

1

Definitions of Identity Crime . . . . . . . . . . . . . . . . . . . . . . .

1

1.1.1

Credit Application Fraud . . . . . . . . . . . . . . . . . . . . .

2

1.1.2

Fraudster Attack Cycle . . . . . . . . . . . . . . . . . . . . . .

3

Challenges for Data Mining-based Detection Systems . . . . . . . . .

4

1.2.1

Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.2.2

Other Challenges . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.3

Existing Detection System . . . . . . . . . . . . . . . . . . . . . . . .

7

1.4

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.5

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.6

Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.2

2 Data Mining-based Detection . . . . . . . . . . . . . . . . . . . . . .

13

2.1

Commercial Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2

Fraud Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 iv

2.2.2 2.3

Key Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Adversarial-related Detection . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1

Terrorism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.2

Financial Crime . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.3

Computer Network Intrusion and Spam . . . . . . . . . . . . . 21

2.4

Identity Crime-related Detection . . . . . . . . . . . . . . . . . . . . . 22

2.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Data and Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.1

Responsibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2

Identity Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1

Real Application DataSet (RADS) . . . . . . . . . . . . . . . 28

3.2.2

Name Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3

Evaluation Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Name Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

4.1

Personal Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3

Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.1

Step 1: Name Authenticity . . . . . . . . . . . . . . . . . . . . 39

4.3.2

Step 2: Name Order . . . . . . . . . . . . . . . . . . . . . . . 41

4.3.3

Step 3: Name Gender . . . . . . . . . . . . . . . . . . . . . . . 42

4.4

Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.5

Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 Communal Detection . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

5.1

Adaptive Whitelist . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3

Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3.1

Step 1: Multi-attribute Link . . . . . . . . . . . . . . . . . . . 54 v

5.3.2

Step 2: Single-link Communal Detection . . . . . . . . . . . . 54

5.3.3

Step 3: Single-link Average Previous Score . . . . . . . . . . . 55

5.3.4

Step 4: Multiple-links Score . . . . . . . . . . . . . . . . . . . 55

5.3.5

Step 5: Parameter’s Value Change . . . . . . . . . . . . . . . 56

5.3.6

Step 6: Whitelist Change

. . . . . . . . . . . . . . . . . . . . 57

5.4

Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.5

Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.6

Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6 Spike Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

6.1

Adaptive Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.3

Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.3.1

Step 1: Single-step Scaled Count . . . . . . . . . . . . . . . . 67

6.3.2

Step 2: Single-value Spike Detection . . . . . . . . . . . . . . 67

6.3.3

Step 3: Multiple-values Score . . . . . . . . . . . . . . . . . . 68

6.3.4

Step 4: SD Attributes Selection . . . . . . . . . . . . . . . . . 68

6.3.5

Step 5: CD Attribute Weights Change . . . . . . . . . . . . . 68

6.4

Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.5

Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

7.1

Chapter Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.2

Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 77

7.3

7.2.1

Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.2.2

Utility Measures . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.2.3

Web-based Identity Crime Detection . . . . . . . . . . . . . . 78

Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Appendix A Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

Appendix B Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

vi

Appendix C Name Verification DataSet (NVDS) . . . . . . . . . . .

87

Appendix D Real Application DataSet (RADS) Fraud Patterns . .

89

Appendix E Parameter Values . . . . . . . . . . . . . . . . . . . . . . .

93

Appendix F CD F -Measures on Sets b and c . . . . . . . . . . . . . .

95

Appendix G Monthly F -measures on Experiments a2 and a4 . . . .

97

Appendix H Organisations’ F -measures on Experiments a2 and a4

99

Appendix I CD and SD Visualisations . . . . . . . . . . . . . . . . . . 101 Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

vii

List of Tables 1.1

Contributions to credit application fraud detection

3.1

Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2

Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1

Name Detection (ND) algorithm . . . . . . . . . . . . . . . . . . . . . 39

5.1

Communal Detection (CD) algorithm . . . . . . . . . . . . . . . . . . 53

5.2

CD experimental design . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3

Adaptive CD experimental design . . . . . . . . . . . . . . . . . . . . 58

6.1

Spike Detection (SD) algorithm . . . . . . . . . . . . . . . . . . . . . 66

6.2

SD best attributes experimental design . . . . . . . . . . . . . . . . . 69

6.3

SD and strengthened CD experimental design . . . . . . . . . . . . . 69

viii

. . . . . . . . . . 10

List of Figures 1.1

Resilient credit application fraud detection system outline

. . . . . . 11

2.1

Data mining-based detection overview

3.1

Daily application volume for two months . . . . . . . . . . . . . . . . 29

3.2

Fraud percentage across months . . . . . . . . . . . . . . . . . . . . . 30

3.3

Daily fraud percentage for two months . . . . . . . . . . . . . . . . . 31

4.1

Name algorithms’ time . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2

Name authenticity F -measures . . . . . . . . . . . . . . . . . . . . . . 45

4.3

Name order F -measures . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4

Name gender F -measures . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.1

Communal Detection (CD) F -measures on set a . . . . . . . . . . . . 59

5.2

Monthly F -measures on experiment a1 . . . . . . . . . . . . . . . . . 60

5.3

Organisations’ F -measures on experiment a1 . . . . . . . . . . . . . . 61

5.4

Adaptive CD F -measures on set d . . . . . . . . . . . . . . . . . . . . 61

6.1

Spike Detection (SD) F -measures on set e . . . . . . . . . . . . . . . 71

6.2

SD F -measures on set f . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.3

SD attribute weights on experiments f2, f3, and f4 . . . . . . . . . . . 72

. . . . . . . . . . . . . . . . . 16

C.1 Name Verification DataSet (NVDS) region . . . . . . . . . . . . . . . 87 C.2 NVDS order, gender, culture . . . . . . . . . . . . . . . . . . . . . . . 88 D.1

fraud percentage average fraud percentage

by hour . . . . . . . . . . . . . . . . . . . . . . . . 89

D.2

fraud percentage average fraud percentage

by state . . . . . . . . . . . . . . . . . . . . . . . . 90 ix

D.3 Top fourty postcodes by D.4

fraud percentage average fraud percentage

fraud percentage average fraud percentage

by organisation

D.5 Top ten organisations by

. . . . . . . . . . . . . . . 90

. . . . . . . . . . . . . . . . . . . 91

fraud percentage average fraud percentage

. . . . . . . . . . . . . . 91

F.1 CD F -measures on set b . . . . . . . . . . . . . . . . . . . . . . . . . 95 F.2 CD F -measures on set c . . . . . . . . . . . . . . . . . . . . . . . . . 96 G.1 Monthly F -measures on experiment a2 . . . . . . . . . . . . . . . . . 97 G.2 Monthly F -measures on experiment a4 . . . . . . . . . . . . . . . . . 98 H.1 Organisations’ F -measures on experiment a2 . . . . . . . . . . . . . . 99 H.2 Organisations’ F -measures on experiment a4 . . . . . . . . . . . . . . 100 I.1

CD visualisation of known fraud application links . . . . . . . . . . . 101

I.2

CD visualisation of known fraud attribute links . . . . . . . . . . . . 102

I.3

SD visualisation of all attributes . . . . . . . . . . . . . . . . . . . . . 102

I.4

SD visualisation of attribute sparsity . . . . . . . . . . . . . . . . . . 103

x

Notation and Abbreviations General RADS: Real Application DataSet. tp: number of true positives. f p: number of false positives. f n: number of false negatives. tn: number of true negatives. X: number of decision thresholds.

Name Detection ND: Name Detection. NVDS: Name Verification DataSet. ah,name : NVDS name. a ˆh,name : encoded NVDS name. ch,order : NVDS order label. ch,gender : NVDS gender label. ch,culture : NVDS culture label. NDS: Name DataSet. xi

ai,name : NDS name. a ˆi,name : encoded NDS name. ci,f raud : NDS fraud label. ci,order : NDS order label. ci,gender : NDS gender label. Tsimilarity : string similarity threshold between two values. ai,f irstname : current application’s first name. ai,lastname : current application’s last name. ai,name−authenticity : derived authenticity value. ai,name−order : derived order value. ai,name−gender : derived gender value.

Communal Detection CD: Communal Detection. G: overall continuous stream. gx : current Mini-discrete stream. x: fixed interval of the current month, fortnight, or week in the year. p: variable number of micro-discrete streams in a Mini-discrete stream. ux,y : current micro-discrete stream. y: fixed interval of the current day, hour, minute, or second. q: variable number of applications in a micro-discrete stream. vi : unscored current application. N : number of attributes. xii

ai,k : current value. W : moving window of previous applications. vj : scored previous application. aj,k : previous value.

Suggest Documents