Data Mining in Resilient Identity Crime Detection ... - Google Sites

Data Mining in Resilient Identity Crime Detection by

Chun Wei Clifton Phua, BBusSys(Hons), DipIT

Dissertation Submitted by Chun Wei Clifton Phua for fulfillment of the Requirements for the Degree of

Doctor of Philosophy

Supervisors: Prof. Kate Smith-Miles and Assoc. Prof. Vincent Lee Associate Supervisor: Dr. Ross Gayler

Clayton School of Information Technology Monash University December, 2007

c Copyright

by Chun Wei Clifton Phua 2007

Keywords resilience, adaptivity, quality data, identity crime detection, credit application fraud detection, string and phonetic matching, communal detection, spike detection, data mining-based fraud detection, security, data stream mining, and anomaly detection

For my parents Chye Twee and Siok Moy

献给我敬爱的父母再对和惜美

iii

Contents

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

Notation and Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xv

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1

1

Definitions of Identity Crime . . . . . . . . . . . . . . . . . . . . . . .

1

1.1.1

Credit Application Fraud . . . . . . . . . . . . . . . . . . . . .

2

1.1.2

Fraudster Attack Cycle . . . . . . . . . . . . . . . . . . . . . .

3

Challenges for Data Mining-based Detection Systems . . . . . . . . .

4

1.2.1

Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.2.2

Other Challenges . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.3

Existing Detection System . . . . . . . . . . . . . . . . . . . . . . . .

7

1.4

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.5

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.6

Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.2

2 Data Mining-based Detection . . . . . . . . . . . . . . . . . . . . . .

13

2.1

Commercial Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2

Fraud Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 iv

2.2.2 2.3

Key Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Adversarial-related Detection . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1

Terrorism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.2

Financial Crime . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.3

Computer Network Intrusion and Spam . . . . . . . . . . . . . 21

2.4

Identity Crime-related Detection . . . . . . . . . . . . . . . . . . . . . 22

2.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Data and Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.1

Responsibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2

Identity Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1

Real Application DataSet (RADS) . . . . . . . . . . . . . . . 28

3.2.2

Name Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3

Evaluation Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Name Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

4.1

Personal Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3

Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.1

Step 1: Name Authenticity . . . . . . . . . . . . . . . . . . . . 39

4.3.2

Step 2: Name Order . . . . . . . . . . . . . . . . . . . . . . . 41

4.3.3

Step 3: Name Gender . . . . . . . . . . . . . . . . . . . . . . . 42

4.4

Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.5

Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 Communal Detection . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

5.1

Adaptive Whitelist . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3


Step 1: Multi-attribute Link . . . . . . . . . . . . . . . . . . . 54 v

5.3.2

Step 2: Single-link Communal Detection . . . . . . . . . . . . 54

5.3.3

Step 3: Single-link Average Previous Score . . . . . . . . . . . 55

5.3.4

Step 4: Multiple-links Score . . . . . . . . . . . . . . . . . . . 55

5.3.5

Step 5: Parameter’s Value Change . . . . . . . . . . . . . . . 56

5.3.6

Step 6: Whitelist Change

. . . . . . . . . . . . . . . . . . . . 57

5.4


5.5


5.6

Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6 Spike Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

6.1

Adaptive Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.3


Step 1: Single-step Scaled Count . . . . . . . . . . . . . . . . 67

6.3.2

Step 2: Single-value Spike Detection . . . . . . . . . . . . . . 67

6.3.3

Step 3: Multiple-values Score . . . . . . . . . . . . . . . . . . 68

6.3.4

Step 4: SD Attributes Selection . . . . . . . . . . . . . . . . . 68

6.3.5

Step 5: CD Attribute Weights Change . . . . . . . . . . . . . 68

6.4


6.5


6.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

7.1

Chapter Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.2

Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 77

7.3

7.2.1

Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.2.2

Utility Measures . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.2.3

Web-based Identity Crime Detection . . . . . . . . . . . . . . 78

Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Appendix A Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

Appendix B Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

vi

Appendix C Name Verification DataSet (NVDS) . . . . . . . . . . .

87

Appendix D Real Application DataSet (RADS) Fraud Patterns . .

89

Appendix E Parameter Values . . . . . . . . . . . . . . . . . . . . . . .

93

Appendix F CD F -Measures on Sets b and c . . . . . . . . . . . . . .

95

Appendix G Monthly F -measures on Experiments a2 and a4 . . . .

97

Appendix H Organisations’ F -measures on Experiments a2 and a4

99

Appendix I CD and SD Visualisations . . . . . . . . . . . . . . . . . . 101 Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

vii

List of Tables 1.1

Contributions to credit application fraud detection

3.1

Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2

Evaluation measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1

Name Detection (ND) algorithm . . . . . . . . . . . . . . . . . . . . . 39

5.1

Communal Detection (CD) algorithm . . . . . . . . . . . . . . . . . . 53

5.2

CD experimental design . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3

Adaptive CD experimental design . . . . . . . . . . . . . . . . . . . . 58

6.1

Spike Detection (SD) algorithm . . . . . . . . . . . . . . . . . . . . . 66

6.2

SD best attributes experimental design . . . . . . . . . . . . . . . . . 69

6.3

SD and strengthened CD experimental design . . . . . . . . . . . . . 69

viii

. . . . . . . . . . 10

List of Figures 1.1

Resilient credit application fraud detection system outline

. . . . . . 11

2.1

Data mining-based detection overview

3.1

Daily application volume for two months . . . . . . . . . . . . . . . . 29

3.2

Fraud percentage across months . . . . . . . . . . . . . . . . . . . . . 30

3.3

Daily fraud percentage for two months . . . . . . . . . . . . . . . . . 31

4.1

Name algorithms’ time . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2

Name authenticity F -measures . . . . . . . . . . . . . . . . . . . . . . 45

4.3

Name order F -measures . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4

Name gender F -measures . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.1

Communal Detection (CD) F -measures on set a . . . . . . . . . . . . 59

5.2

Monthly F -measures on experiment a1 . . . . . . . . . . . . . . . . . 60

5.3

Organisations’ F -measures on experiment a1 . . . . . . . . . . . . . . 61

5.4

Adaptive CD F -measures on set d . . . . . . . . . . . . . . . . . . . . 61

6.1

Spike Detection (SD) F -measures on set e . . . . . . . . . . . . . . . 71

6.2

SD F -measures on set f . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.3

SD attribute weights on experiments f2, f3, and f4 . . . . . . . . . . . 72

. . . . . . . . . . . . . . . . . 16

C.1 Name Verification DataSet (NVDS) region . . . . . . . . . . . . . . . 87 C.2 NVDS order, gender, culture . . . . . . . . . . . . . . . . . . . . . . . 88 D.1

fraud percentage average fraud percentage

by hour . . . . . . . . . . . . . . . . . . . . . . . . 89

D.2


by state . . . . . . . . . . . . . . . . . . . . . . . . 90 ix

D.3 Top fourty postcodes by D.4



by organisation

D.5 Top ten organisations by

. . . . . . . . . . . . . . . 90

. . . . . . . . . . . . . . . . . . . 91


. . . . . . . . . . . . . . 91

F.1 CD F -measures on set b . . . . . . . . . . . . . . . . . . . . . . . . . 95 F.2 CD F -measures on set c . . . . . . . . . . . . . . . . . . . . . . . . . 96 G.1 Monthly F -measures on experiment a2 . . . . . . . . . . . . . . . . . 97 G.2 Monthly F -measures on experiment a4 . . . . . . . . . . . . . . . . . 98 H.1 Organisations’ F -measures on experiment a2 . . . . . . . . . . . . . . 99 H.2 Organisations’ F -measures on experiment a4 . . . . . . . . . . . . . . 100 I.1

CD visualisation of known fraud application links . . . . . . . . . . . 101

I.2

CD visualisation of known fraud attribute links . . . . . . . . . . . . 102

I.3

SD visualisation of all attributes . . . . . . . . . . . . . . . . . . . . . 102

I.4

SD visualisation of attribute sparsity . . . . . . . . . . . . . . . . . . 103

x

Notation and Abbreviations General RADS: Real Application DataSet. tp: number of true positives. f p: number of false positives. f n: number of false negatives. tn: number of true negatives. X: number of decision thresholds.

Name Detection ND: Name Detection. NVDS: Name Verification DataSet. ah,name : NVDS name. a ˆh,name : encoded NVDS name. ch,order : NVDS order label. ch,gender : NVDS gender label. ch,culture : NVDS culture label. NDS: Name DataSet. xi

ai,name : NDS name. a ˆi,name : encoded NDS name. ci,f raud : NDS fraud label. ci,order : NDS order label. ci,gender : NDS gender label. Tsimilarity : string similarity threshold between two values. ai,f irstname : current application’s first name. ai,lastname : current application’s last name. ai,name−authenticity : derived authenticity value. ai,name−order : derived order value. ai,name−gender : derived gender value.

Communal Detection CD: Communal Detection. G: overall continuous stream. gx : current Mini-discrete stream. x: fixed interval of the current month, fortnight, or week in the year. p: variable number of micro-discrete streams in a Mini-discrete stream. ux,y : current micro-discrete stream. y: fixed interval of the current day, hour, minute, or second. q: variable number of applications in a micro-discrete stream. vi : unscored current application. N : number of attributes. xii

ai,k : current value. W : moving window of previous applications. vj : scored previous application. aj,k : previous value.

Data Mining in Resilient Identity Crime Detection ... - Google Sites

Data Mining in Resilient Identity Crime Detection ... - Google Sites

Suggest Documents

Resilient Identity Crime Detection - Google Sites

Multilayered Identity Crime Detection System - Google Sites

Identify Crime Detection Using Data Mining Techniques

Crime detection and criminal identification in India using data mining ...

Adaptive Spike Detection for Resilient Data Stream Mining

Crime Pattern Detection Using Data Mining - Brown University

Adaptive Spike Detection for Resilient Data Stream ... - Google Sites

Adaptive Spike Detection for Resilient Data Stream ... - Google Sites

Investigative Data Mining in Fraud Detection - Google Sites

Investigative Data Mining in Fraud Detection - Google Sites

crime data mining: an analysis of real time data in

Crime Forecasting Using Data Mining Techniques

Data Mining for Intrusion Detection

Efficient Data Mining Algorithms for Intrusion Detection - Google Sites

an intelligent analysis of crime data using data mining & auto ...

Data Mining and Crime Analysis in the Richmond Police Department

Data Mining Applications in Healthcare - Google Sites

The Use of Data Mining Techniques in Operational Crime Fighting ...

Data Mining based Crime-Dependent Triage in Digital ... - CiteSeerX

Practical Leakage-Resilient Identity-Based Encryption ... - Google Sites

Data Mining Approach, Data Mining Cycle - Google Sites

Data Mining with Big Data - Google Sites

Privacy Preserving Data Mining - Future of IDentity in the Information ...

Virus detection using data mining techinques - CiteSeerX