2.2.2 Key Ideas . .... 7.2.3 Web-based Identity Crime Detection . . . . . . . . . . . . . . 78 .... application fraud: b
Data Mining in Resilient Identity Crime Detection by
Chun Wei Clifton Phua, BBusSys(Hons), DipIT
Dissertation Submitted by Chun Wei Clifton Phua for fulfillment of the Requirements for the Degree of
Doctor of Philosophy
Supervisors: Prof. Kate Smith-Miles and Assoc. Prof. Vincent Lee Associate Supervisor: Dr. Ross Gayler
Clayton School of Information Technology Monash University December, 2007
c Copyright
by Chun Wei Clifton Phua 2007
Keywords resilience, adaptivity, quality data, identity crime detection, credit application fraud detection, string and phonetic matching, communal detection, spike detection, data mining-based fraud detection, security, data stream mining, and anomaly detection
For my parents Chye Twee and Siok Moy
献给我敬爱的父母再对和惜美
iii
Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
Notation and Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . .
xi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1
1
Definitions of Identity Crime . . . . . . . . . . . . . . . . . . . . . . .
1
1.1.1
Credit Application Fraud . . . . . . . . . . . . . . . . . . . . .
2
1.1.2
Fraudster Attack Cycle . . . . . . . . . . . . . . . . . . . . . .
3
Challenges for Data Mining-based Detection Systems . . . . . . . . .
4
1.2.1
Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2.2
Other Challenges . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3
Existing Detection System . . . . . . . . . . . . . . . . . . . . . . . .
7
1.4
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.5
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.6
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.2
2 Data Mining-based Detection . . . . . . . . . . . . . . . . . . . . . .
13
2.1
Commercial Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2
Fraud Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 iv
2.2.2 2.3
Key Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Adversarial-related Detection . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1
Terrorism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2
Financial Crime . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3
Computer Network Intrusion and Spam . . . . . . . . . . . . . 21
2.4
Identity Crime-related Detection . . . . . . . . . . . . . . . . . . . . . 22
2.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Data and Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.1
Responsibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2
Identity Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1
Real Application DataSet (RADS) . . . . . . . . . . . . . . . 28
3.2.2
Name Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3
Evaluation Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Name Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
4.1
Personal Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3
Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.1
Step 1: Name Authenticity . . . . . . . . . . . . . . . . . . . . 39
4.3.2
Step 2: Name Order . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.3
Step 3: Name Gender . . . . . . . . . . . . . . . . . . . . . . . 42
4.4
Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5
Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5 Communal Detection . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
5.1
Adaptive Whitelist . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3
Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3.1
Step 1: Multi-attribute Link . . . . . . . . . . . . . . . . . . . 54 v
5.3.2
Step 2: Single-link Communal Detection . . . . . . . . . . . . 54
5.3.3
Step 3: Single-link Average Previous Score . . . . . . . . . . . 55
5.3.4
Step 4: Multiple-links Score . . . . . . . . . . . . . . . . . . . 55
5.3.5
Step 5: Parameter’s Value Change . . . . . . . . . . . . . . . 56
5.3.6
Step 6: Whitelist Change
. . . . . . . . . . . . . . . . . . . . 57
5.4
Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5
Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.6
Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6 Spike Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
6.1
Adaptive Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3
Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.3.1
Step 1: Single-step Scaled Count . . . . . . . . . . . . . . . . 67
6.3.2
Step 2: Single-value Spike Detection . . . . . . . . . . . . . . 67
6.3.3
Step 3: Multiple-values Score . . . . . . . . . . . . . . . . . . 68
6.3.4
Step 4: SD Attributes Selection . . . . . . . . . . . . . . . . . 68
6.3.5
Step 5: CD Attribute Weights Change . . . . . . . . . . . . . 68
6.4
Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5
Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
7.1
Chapter Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2
Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 77
7.3
7.2.1
Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2.2
Utility Measures . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2.3
Web-based Identity Crime Detection . . . . . . . . . . . . . . 78
Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Appendix A Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
Appendix B Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
vi
Appendix C Name Verification DataSet (NVDS) . . . . . . . . . . .
87
Appendix D Real Application DataSet (RADS) Fraud Patterns . .
89
Appendix E Parameter Values . . . . . . . . . . . . . . . . . . . . . . .
93
Appendix F CD F -Measures on Sets b and c . . . . . . . . . . . . . .
95
Appendix G Monthly F -measures on Experiments a2 and a4 . . . .
97
Appendix H Organisations’ F -measures on Experiments a2 and a4
99
Appendix I CD and SD Visualisations . . . . . . . . . . . . . . . . . . 101 Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
vii
List of Tables 1.1
Contributions to credit application fraud detection
3.1
Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2
Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1
Name Detection (ND) algorithm . . . . . . . . . . . . . . . . . . . . . 39
5.1
Communal Detection (CD) algorithm . . . . . . . . . . . . . . . . . . 53
5.2
CD experimental design . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3
Adaptive CD experimental design . . . . . . . . . . . . . . . . . . . . 58
6.1
Spike Detection (SD) algorithm . . . . . . . . . . . . . . . . . . . . . 66
6.2
SD best attributes experimental design . . . . . . . . . . . . . . . . . 69
6.3
SD and strengthened CD experimental design . . . . . . . . . . . . . 69
viii
. . . . . . . . . . 10
List of Figures 1.1
Resilient credit application fraud detection system outline
. . . . . . 11
2.1
Data mining-based detection overview
3.1
Daily application volume for two months . . . . . . . . . . . . . . . . 29
3.2
Fraud percentage across months . . . . . . . . . . . . . . . . . . . . . 30
3.3
Daily fraud percentage for two months . . . . . . . . . . . . . . . . . 31
4.1
Name algorithms’ time . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2
Name authenticity F -measures . . . . . . . . . . . . . . . . . . . . . . 45
4.3
Name order F -measures . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4
Name gender F -measures . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1
Communal Detection (CD) F -measures on set a . . . . . . . . . . . . 59
5.2
Monthly F -measures on experiment a1 . . . . . . . . . . . . . . . . . 60
5.3
Organisations’ F -measures on experiment a1 . . . . . . . . . . . . . . 61
5.4
Adaptive CD F -measures on set d . . . . . . . . . . . . . . . . . . . . 61
6.1
Spike Detection (SD) F -measures on set e . . . . . . . . . . . . . . . 71
6.2
SD F -measures on set f . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3
SD attribute weights on experiments f2, f3, and f4 . . . . . . . . . . . 72
. . . . . . . . . . . . . . . . . 16
C.1 Name Verification DataSet (NVDS) region . . . . . . . . . . . . . . . 87 C.2 NVDS order, gender, culture . . . . . . . . . . . . . . . . . . . . . . . 88 D.1
fraud percentage average fraud percentage
by hour . . . . . . . . . . . . . . . . . . . . . . . . 89
D.2
fraud percentage average fraud percentage
by state . . . . . . . . . . . . . . . . . . . . . . . . 90 ix
D.3 Top fourty postcodes by D.4
fraud percentage average fraud percentage
fraud percentage average fraud percentage
by organisation
D.5 Top ten organisations by
. . . . . . . . . . . . . . . 90
. . . . . . . . . . . . . . . . . . . 91
fraud percentage average fraud percentage
. . . . . . . . . . . . . . 91
F.1 CD F -measures on set b . . . . . . . . . . . . . . . . . . . . . . . . . 95 F.2 CD F -measures on set c . . . . . . . . . . . . . . . . . . . . . . . . . 96 G.1 Monthly F -measures on experiment a2 . . . . . . . . . . . . . . . . . 97 G.2 Monthly F -measures on experiment a4 . . . . . . . . . . . . . . . . . 98 H.1 Organisations’ F -measures on experiment a2 . . . . . . . . . . . . . . 99 H.2 Organisations’ F -measures on experiment a4 . . . . . . . . . . . . . . 100 I.1
CD visualisation of known fraud application links . . . . . . . . . . . 101
I.2
CD visualisation of known fraud attribute links . . . . . . . . . . . . 102
I.3
SD visualisation of all attributes . . . . . . . . . . . . . . . . . . . . . 102
I.4
SD visualisation of attribute sparsity . . . . . . . . . . . . . . . . . . 103
x
Notation and Abbreviations General RADS: Real Application DataSet. tp: number of true positives. f p: number of false positives. f n: number of false negatives. tn: number of true negatives. X: number of decision thresholds.
Name Detection ND: Name Detection. NVDS: Name Verification DataSet. ah,name : NVDS name. a ˆh,name : encoded NVDS name. ch,order : NVDS order label. ch,gender : NVDS gender label. ch,culture : NVDS culture label. NDS: Name DataSet. xi
ai,name : NDS name. a ˆi,name : encoded NDS name. ci,f raud : NDS fraud label. ci,order : NDS order label. ci,gender : NDS gender label. Tsimilarity : string similarity threshold between two values. ai,f irstname : current application’s first name. ai,lastname : current application’s last name. ai,name−authenticity : derived authenticity value. ai,name−order : derived order value. ai,name−gender : derived gender value.
Communal Detection CD: Communal Detection. G: overall continuous stream. gx : current Mini-discrete stream. x: fixed interval of the current month, fortnight, or week in the year. p: variable number of micro-discrete streams in a Mini-discrete stream. ux,y : current micro-discrete stream. y: fixed interval of the current day, hour, minute, or second. q: variable number of applications in a micro-discrete stream. vi : unscored current application. N : number of attributes. xii
ai,k : current value. W : moving window of previous applications. vj : scored previous application. aj,k : previous value.