Incorrect attribute value detection for traffic accident data

3 downloads 0 Views 273KB Size Report
To analysis the traffic accident factors effectively we need a noise-free ... and cleansing play a crucial role in data mining for ensuring good quality of data.
Incorrect attribute value detection for traffic accident data Rupam Deb

Alan Wee-Chung Liew

School of Information and Communication technology Griffith University Gold Coast, Australia [email protected]

School of Information and Communication technology Griffith University Gold Coast, Australia [email protected]

Abstract—Safe and sustainable road systems is a common goal in all countries. Applications to assist with road asset management and crash minimization are sought universally. Road accident is a special case of trauma that constitutes a major cause of disability, untimely death and loss of loved ones as well as family bread winners. Therefore, methods to reduce accident severity are of great interest to traffic agencies and the public at large. To analysis the traffic accident factors effectively we need a noise-free dataset. Road accident fatality rate depends on many factors and it is a very challenging task to investigate the dependencies between the attributes because of the many environmental and road accident factors. Noisy data in the dataset could obscure the discovery of important factors and mislead conclusions. In this paper, we present a novel approach called Neural network and Co-appearance based analysis for Noisy Attributes values Identification (NCNAI). NCNAI separates noisy records from clean records and also identifies the incorrect attributes values. We evaluate our algorithm using two publicly available traffic accident databases of United States, one is the largest open federal database (explore.data.gov) in United States and another one is based on the National Incident Based Reporting System (NIBRS) of city and county of Denver (data.opencolorado.org). We compare our technique with three existing methods using several well-known evaluation criteria, i.e. value based Error Recall (vER), value based Error Precision (vEP), record based Error Recall (rER), record based Error Precision (rEP), and record Removal Ratio (rRR). Our results indicate that the proposed method performs significantly better than the three existing algorithms. Keywords—neural network; processing; data mining

co-appearance;

data

pre-

I. INTRODUCTION A nation’s socio-economic development is highly dependent on the health status of its citizens. Road safety, which is mainly affected by road accident, is said to one of the major health concern. The burden of road accident causalities and damage is a major headache for both developed and developing countries. To reduce the number of road traffic accidents, it is necessary to characterize the causes of the accidents and also determine the factors that significantly affect the severity of injuries in road crashes. The global economic cost of road traffic accidents has been estimated at US$518 billion and has been calculated to account for 0.3% to 4% of the gross national product of many countries [12].

c 2015 IEEE 978-1-4799-1959-8/15/$31.00 O

The number of people who are killed globally in road traffic crashes every year is estimated to be 1.2 million, with the number who are injured reaching approximately 50 million. In Western Australia during 2009 there were 193 fatalities and 62% of these occurred in regional areas. In addition, 2755 people were admitted to hospital with serious injuries following road crashes in 2006 alone. Road crashes place a heavy burden in terms of both individual and community cost. The annual cost of road traffic crashes at the national level in 2003 was AUD$17 billion and at the state level AUD$1.9 billion [9-11]. The issue of traffic safety has raised great concerns across the globe, and it has become one of the key issues challenging the sustainable development of modern traffic and transportation. Data mining is an analytic process designed to explore large amounts of data. The ultimate goal of data mining is the prediction. Nowadays, large amount of traffic data has been collected with the advancement in sensor technologies [13]. Using data mining technology such as classification and clustering, we can uncover patterns of traffic activities and factors that lead to accident. The extracted patterns play vital role in various decision making process. Data preprocessing and cleansing play a crucial role in data mining for ensuring good quality of data. Data preprocessing includes imputation of missing values, smoothing out noisy data, identification of incorrect data, and correction of inconsistent data. Data preprocessing takes almost 80% of the total data mining effort [14]. To extract useful information using data mining algorithms, we need noise free and correct datasets. Presence of noisy values in the dataset is the barrier for the performance of data mining algorithms. Usually data are collected from heterogeneous sources and then integrated into a single dataset to extract useful information. During the process of collection, storage and preparation data often get corrupted due to various reasons including equipment malfunctioning, human errors, misunderstanding, and faulty data transmission. If an organization does not take extreme care during data collection, then approximately 5% or more missing/corrupt data may be introduced into the datasets [15-19]. In this paper, we propose a novel method called NCNAI for detecting noisy attributes values and records on traffic crash datasets. NCNAI produces a clean dataset by separating noisy

records from clean records from the dataset. Our experiment result shows that our proposed algorithm has better accuracy compared with other well-known techniques. In our study, we use two truck crash datasets and accident dataset of Denver County of almost 177000 records. Most of the columns (attributes) of these records are categorical. Each attribute contains values varying from 2 to more than 700. The domain of the categorical data is represented by nominal, ordinal and interval based variables. Nominal variables have values that have no natural ordering (airbag conditions: Ruptured, Cut, Torn), Ordinal variables do have a natural order (day of the week) and interval variables are created from intervals on a contiguous scale (age group 13-19). This paper is organized as follows: In section II we present a literature review of related work. Our proposed technique is described in section III. Experimental results are discussed in section IV. Finally section V draws the concluding remarks. II. RELATED WORK Identification of noisy data is an important data mining task for improving the quality of the data mining result. Many noise detection algorithms have been proposed for various applications [1-8, 20]. Among them, Error Detection and Impact-sensitive instance Ranking (EDIR) [5], Co-appearance based Analysis for Incorrect Records and Attribute-values Detection (CAIRAD) [3], Structured Program for Economic Editing and Referrals (SPEER) [6], Automatic Edit detection and Imputation algorithm [1], and RDCL case profiling technique [2] are some well-known noise identification algorithms. A class attribute is a label that represents a record. To identify noise and correction, Polishing technique [6] assumes that there is a natural class attribute of a dataset. There are two stages of this technique called Prediction stage and Adjustment stage. In Prediction stage, it makes decision trees using C4.5 and identifies the set of records that are misclassified based on the decision trees. Later on in the Adjustment stage, the attribute value for each misclassified record is modified so that it can be correctly classified for the natural class attribute. However, the performance of this technique is low if the dataset has low classification accuracy. The performance of this technique relies on the accuracy of the decision trees. EDIR [5] locates erroneous instances and attributes and rank suspicious instances based on their impact on the system performance. At first, EDIR trains a benchmark classifier T from noisy dataset D. The instances that cannot be classified by T are treated as suspicious and forward to a subset S. Instance contains n attributes A1, A2, …, An and each attribute Ai has Vi possible values. To rank instances in S, EDIR uses an impact based on Information-gain Ratio. Like Polishing technique, EDIR also change the attribute value to correctly classify the record. However, EDIR differs from Polishing technique in some aspects. Unlike Polishing method, EDIR can change two or three attributes values at a time if a suspicious record remains misclassified. The record is stored separately if it remains suspicious even after changing all combination of two or three values.

RDCL [2] method uses k-NN technique to classify a record. At first, dataset is divided into training and testing datasets. Then, k-NN records create in training dataset. After that RDCL classifies the record in testing dataset by using the majority class of its k-NN records in the training dataset. RDCL identifies suspicious records whereas Polishing technique and EDIR detect both noisy attributes values and records. CAIRAD [3] algorithm exploits the co-appearance between attributes values to detect noisy values of a dataset. To detect noisy values, the authors make original co-appearance matrix and expected co-appearance values. These two matrices are generated for all attributes values. In this technique, if both of them are same for any attribute value then this value is declared as a clean value otherwise it is noisy. Another interesting technique called “Automatic Edits” [1] detects the unrealistic co-occurrence of attributes values in a record. At first, this method generates all types of edits for the dataset. The data in each record should be made to satisfy all “edits” by changing the fewest possible items of attributes values. Computation complexity for both editing and correction is the major disadvantage of this method. The complexity of the computation is usually such that changes are time consuming, expensive and error-prone. Another limitation is that it often can be difficult to find a set of good rules. Like “Automatic Edits”, SPEER [6] also identifies errors using user supplied explicit edits. It is divided into four main components: Edit generation, Edit checking, Error localization and Imputation. In Edit generation stage, a set of explicit edits are generated. Edit checking determines which edits pass or fail for a given record. If one or more edits are failed by a record, the record is sent to Error localization to determine a set of fields to delete so that the remaining fields will be mutually consistent i.e. the remaining fields will jointly fail no edits. Imputation is done for the deleted fields so that all fields on a record will be mutually consistent. In practical scenario most of the attributes of the dataset are clean and the volume of noise is low. Another notable point is that noisy values have a random and independent nature and are not correlated to the occurrence of any other values of a dataset. It is very rare case that a noise value will appear repeatedly as a result of the introduction of noise [3]. III. PROPOSED TECHNIQUE Now, we present our novel algorithm called neural network and co-appearance based noisy attribute values identification algorithm. Here, we discuss the characteristics of our proposed technique with a toy dataset in Table I.

TABLE I. Record

SAMPLE DATASET D

District

Severity

Factor

R1

Mackay

Fatal

Manoeuvre

R2

Brisbane

Fatal

Direction

R3

Brisbane

Hospital

Alcohol

R4

Brisbane

Fatal

Manoeuvre

R5

Mackay

Hospital

Alcohol

R6

Townsvil

Hospital

Alcohol

R7

Townsvil

Fatal

Manoeuvre

R8

Mackay

Fatal

Alcohol

R9

Townsvil

Headache

Manoeuvre

R10

Brisbane

Hospital

Manoeuvre

checked to make sure that the value is clean. For the experiment on our two datasets, we set CO to 2 × (number of attributes-1). The first reason to set this threshold value, CO = 2 × (number of attributes-1) is that it is very rare that noise will be appearing repeatedly. Here, we assume that noise will repeat maximum two times in a record. If any value’s co-appearance is less than three it should checked whether it is noisy or not? Another reason is that each dataset of our experiment contains more than 87000 records i.e. large volume dataset and so in this large volume usually one or maximum two times noise can repeat. In our algorithm, we can set CO for each domain value of attribute separately. Step-III: Identify noisy record using neural network.

Step-I: Generate a Co-appearance matrix (CAM) to calculate co-appearance between attributes values and identify low co-appeared values. CAM is generated by the co-occurrence of two attributes values. The CAM is shown in Table II. Here, the first row and first column stand for attributes name. The second row and second column present the associate attributes values. The numerical value represents the co-occurrence of two attributes values. For example, “Fatal” value of “Severity” attribute with “Manoeuvre” value of “Factor” attribute is “3” (which is marked in underline). Next, we check the low co-appearance values. For example, in CAM, it can be seen that “Headache” and “Direction” has low co-appearance value. In Table II, “Headache” has co-appearance value “2”- 1 for 4-th column and 1 for 7-th row. These two values (“Headache” and “Direction”) are marked as possible noisy values. TABLE II.

District

Severity

Now we check Step-II and Step-III, if the marked noisy values of Step-II are residing in marked noisy record of StepIII then we declare these attributes and records as noisy. We separate these records and attributes from clean records. For example, in Table I, R2 and R9 are noisy with “Direction” and “Headache” values which are shown in Table III. Table IV gives the set of clean records. TABLE III.

Factor Manoeuv re

Record

Heada che

Hospit al

2

0

1

1

0

2

Brisbane

2

0

2

2

1

1

Townsvil

1

1

1

Mackay

Step-IV: Finally, declare attributes values and records as noisy.

CAM FOR DATASET D

Severity Fatal

A multilayer perceptron neural network is used to detect whether a record is noisy or not. To train the neural network, we divide our dataset into training and testing part. To simulate noisy data, we artificially create noisy records in the training dataset. The corrupted records are labelled as noisy, and the uncorrupted records are labelled as clean. Based on our domain knowledge in traffic accident data, we select the attributes that have major influence on traffic accidents and injuries to inject noise. Now, using this trained neural network each record of the dataset is checked whether it is noisy or not? If the output of the neural network prediction or forecasting (for a record) is greater than 0.5 then the record is marked as “noisy”.

Direction

District

Severity

Factor

R2

Brisbane

Fatal

Direction

R9

Townsvil

Headache

Manoeuvre

Alcohol

2

0

1

Fatal

3

1

1

Headache

1

0

0

Hospital

1

0

3

Step-II: Identify low co-appeared attributes values from CAM as possible noisy attributes values. The attribute values are marked as noisy if their coappearances are lower than the co-appearance threshold, CO. The value of CO depends on the number of attributes and the volume of dataset. For our toy dataset CO is set to “2” (number of attributes - 1) i.e. if any value appears only in one record (because our toy dataset contains only 10 records) it should be

NOSIY ATTRIBUTES AND NOISY RECORDS OF D

TABLE IV. Record

CLEAN ATTRIBUTES AND CLEAN RECORDS OF D

District

Severity

Factor

R1

Mackay

Fatal

Manoeuvre

R3

Brisbane

Hospital

Alcohol

R4

Brisbane

Fatal

Manoeuvre

R5

Mackay

Hospital

Alcohol

R6

Townsvil

Hospital

Alcohol

R7

Townsvil

Fatal

Manoeuvre

R8

Mackay

Fatal

Alcohol

FOR j = 1 to M DO

Record R10

District

Severity

Factor

Brisbane

Hospital

Manoeuvre

FOR all domain of j DO C ← Co-appearance between p-th domain of j-th attribute value and other attributes values of i-th record

A. Proposed algorithm

IF C 0.5 THEN FOR j = 1 to M DO

Fpy: Co-appearance of the p-th domain of i-th attribute and y-th domain of j-th attribute

IF RiAj is “noisy” THEN

Train neural network classifier T using noisy and clean records

Set RiAj ← “noisy” Set Ri ← noisy

C: Co-appearance matrix

END IF

COpj: Co-appearance threshold of the p-th domain of j-th attribute

END FOR END IF

START

END IF

Step I: Generate Co-appearance matrix Set z ← 1

Steep IV: Get clean records with clean attributes values

FOR i = 1 to M-1 DO

Dataset D has

FOR p = 1 to |Ai| DO Set k ← ∑

END FOR

|

|

records

END 1 IV. EXPERIMENT RESULT

FOR j = (i+1) to M DO FOR y = 1 to |Aj| DO Set Czk ← Fpy Increment k by 1 END FOR END FOR Increment z by 1 END FOR END FOR Step II: Identify low co-appeared values FOR i = 1 to N DO

attributes and

We compare our algorithm NCNAI with three existing techniques namely CAIRAD [3], EDIR [5], and RDCL [2]. We evaluate our proposed algorithm using five performance indicators: record based Error Recall (rER), record based Error Precision (rEP), value based Error Recall (vER), value based Error Precision (rER), and record Removal Ratio (rRR). Here, we introduce the evaluation criteria with equations 2-6. Let, n = Total number of records in dataset r = Number of records have artificially created noisy values v = Number of artificially created noisy values p = Number of correctly detected noisy values

c = Number of correctly detected noisy records d = Total number of detected noisy records q = Total number of detected noisy values (2) (3) (4) (5) (6)

F-measure is the weighted harmonic mean of precision and recall. In our experiment, we use two most commonly used Fmeasures namely F0.5 and F1. They are defined in equation 7. We use two approaches for calculating weighted score from error recall, error precision and record removal as follows. In more importance is given on error precision while more importance is calculating the overall score and in given on error recall. 1

(7) 0.3

0.6

.

0.6

0.3

.

(8) (9)

We do experiment on 43 text files data (Large Truck Crash Causation Study File 1 and 2) and accident dataset of Denver County. The Denver County’s dataset includes accidents in the City and County of Denver for the previous five calendar years plus the current year to date (30 June 2014). The two Large Truck Crash Causation Study Files have different number of attributes and 92871 records. The Denver County has 89194 traffic accident records. In Large Truck Crash Causation Study Files and Denver County’s accident dataset, most attributes (90%) are categorical. The two Large Truck Crash Causation Study File datasets contain 3192 records with missing values. Denver County dataset contains 1902 records with missing values. Here, we first remove the records having missing values thereby create two datasets having 89679 and 87292 records in two Large Truck Crash Causation Study Files and Denver County’s dataset. Here, we set CO =2(number of attributes-1). Here, Large truck crash files have variable number of attributes. We artificially create noisy values in each of the datasets. We then apply different algorithms to detect noise. Since the original values of the artificially created noisy values are

known to us we can evaluate the performances of the noise identification techniques. Naturally, noisy values have randomness and it is difficult to simulate. Here, we create four noise patterns, four noise ratios and two noise models to artificially create noisy values. We use four types of noise patterns [18-19]. In simple pattern, a record can have at most one noise value. In medium pattern, a record can have noisy values for 2 – 50% of the total number of attributes. In a complex pattern, a record can have noisy values for 51 – 80% of the total number of attributes. A blended pattern contains 25% records having noisy values with simple pattern, 50% with medium pattern and 25% with complex pattern. We also use two types of noise models, namely overall and uniformly distributed (UD). In the UD missing model, each attribute has equal number of noisy values. However, in the overall model, noisy values are not equally distributed among the attributes and in the worst case all noisy values can belong to a single attribute. Here, we artificially create noisy values in the dataset by using 4 noisy patterns, namely simple, medium, complex and blended, 4 noisy ratios i.e. 2%, 4%, 8%, and 12% and two noise models, namely overall and uniformly distributed (UD). We have altogether 32 noise combinations (4 noise ratios, 4 noise patterns, 2 noise models). For each combination we use 200 datasets. So, we created 6400 (32 combinations, 200 datasets for each combination) datasets for each Large Truck Crash Causation Study Files and Denver County’s datasets. We present the performance of our algorithm with other existing techniques on Denver County in Table V. Here, we compare our algorithm with three well-known techniques CAIRAD, RDCL, and EDIR. The good performance results are marked in bold. It is shown that our algorithm wins in all combinations except 5 combinations on two important evaluation criteria rER and rEP. We get the better result because our proposed technique exploits both co-appearance matrix and neural network to identify the noisy attributes values and records whereas other algorithms use either classification technique or statistical prediction. Table VI presents the top level aggregate result for two datasets. Table VII also presents the aggregate result for two datasets of 32 combinations. In Table VI and VII, it is shown that NCNAI performs better than other algorithms except record removal ratio. Performance on two datasets using two evaluation criteria is shown in Fig. 1. It is done on record based noisy values. Using the student t-Test from independent samples, we get the confidence level from mean and standard deviation.

TABLE V.

PERFORMANCE OF NCNAI, CAIRAD, RDCL AND EDIR ON DENVER COUNTY DATASET

rER Noise combination 2%

Overall

UD

4%

Overall

UD

8%

Overall

UD

10%

Overall

UD

rEP

rRR

NCNAI

CAIRAD

RDCL

EDIR

NCNAI

CAIRAD

RDCL

EDIR

CAIRAD

RDCL

EDIR

Simple

0.9312

0.9123

0.4123

0.2013

0.3005

0.2903

0.1900

0.1600

NCNAI 0.2905

0.3012

0.2130

0.1983

Medium

0.9326

0.9034

0.4033

0.1987

0.2900

0.2789

0.1912

0.1589

0.2434

0.2340

0.2105

0.1923

Complex

0.9276

0.8912

0.4016

0.1901

0.2885

0.2514

0.1900

0.1567

0.2451

0.2387

0.2104

0.1899

Blended

0.9277

0.8897

0.4015

0.1701

0.2808

0.2612

0.1901

0.1566

0.2462

0.2376

0.2099

0.1802

Simple

0.9111

0.8810

0.4176

0.1609

0.2517

0.2401

0.1812

0.1560

0.2031

0.2230

0.2098

0.1803

Medium

0.9101

0.8834

0.4123

0.1567

0.2436

0.2410

0.1834

0.1561

0.1807

0.2203

0.2456

0.1809

Complex

0.9104

0.9123

0.4876

0.1588

0.2201

0.2399

0.1809

0.1562

0.1805

0.2108

0.2481

0.1806

Blended

0.9105

0.9076

0.4561

0.1601

0.2240

0.2201

0.1712

0.1563

0.2261

0.2205

0.2483

0.1834

0.1645

Simple

0.9191

0.8976

0.3516

0.1412

0.6031

0.6012

0.1617

0.1456

0.2451

0.2301

0.2561

Medium

0.9025

0.8876

0.3214

0.1901

0.5814

0.5112

0.1690

0.1455

0.2450

0.2340

0.2341

0.1641

Complex

0.9013

0.8812

0.3109

0.1903

0.5067

0.5300

0.1681

0.1450

0.2010

0.2103

0.2100

0.1640 0.1532

Blended

0.9031

0.8067

0.3200

0.1971

0.5054

0.4871

0.1678

0.1451

0.2309

0.2210

0.1987

Simple

0.8503

0.7654

0.3302

0.1817

0.5091

0.4516

0.1671

0.1452

0.2201

0.2101

0.2207

0.1467

Medium

0.8551

0.7867

0.3401

0.1617

0.4832

0.4418

0.1670

0.1423

0.1671

0.2204

0.2415

0.1675

Complex

0.8465

0.7902

0.3304

0.1312

0.3860

0.4510

0.1613

0.1433

0.2405

0.2304

0.2461

0.1678

Blended

0.8278

0.7934

0.3208

0.1409

0.3914

0.3901

0.1610

0.1451

0.2400

0.2500

0.2516

0.1699

Simple

0.8200

0.8019

0.3124

0.1423

0.8481

0.8190

0.1611

0.1324

0.2405

0.2103

0.2309

0.2001

Medium

0.8231

0.7918

0.2132

0.1421

0.8312

0.8090

0.1567

0.1234

0.2212

0.2100

0.2306

0.2016

Complex

0.8255

0.8011

0.5123

0.1420

0.7982

0.7981

0.1615

0.1156

0.2498

0.2340

0.2298

0.1543

Blended

0.7989

0.7789

0.3412

0.1309

0.7481

0.7156

0.1612

0.1312

0.2809

0.3012

0.3001

0.1230

Simple

0.7931

0.7567

0.3561

0.2301

0.6625

0.6524

0.1612

0.1312

0.1903

0.2034

0.2987

0.1423

Medium

0.7732

0.7234

0.3871

0.2011

0.6431

0.6541

0.1534

0.1300

0.1897

0.2301

0.2945

0.1641

Complex

0.7625

0.7156

0.3918

0.2109

0.6316

0.6012

0.1530

0.1299

0.1908

0.2000

0.2001

0.1980

Blended

0.7461

0.7089

0.4012

0.2201

0.6041

0.5534

0.1531

0.1298

0.2988

0.3412

0.2812

0.2998

Simple

0.7290

0.6908

0.2800

0.1213

0.9013

0.8901

0.1601

0.1109

0.3010

0.2210

0.3100

0.2018

Medium

0.7378

0.6879

0.2801

0.1312

0.9042

0.8819

0.1600

0.1101

0.1670

0.2147

0.2876

0.2019

Complex

0.7124

0.6578

0.2809

0.1309

0.9161

0.8716

0.1589

0.1104

0.1894

0.2092

0.2654

0.2034

Blended

0.7045

0.6499

0.2890

0.1310

0.8972

0.8717

0.1588

0.1023

0.2982

0.2509

0.2453

0.2035

Simple

0.6924

0.6678

0.2803

0.1412

0.8876

0.8615

0.1581

0.1021

0.2871

0.2987

0.2716

0.2100

Medium

0.7000

0.7012

0.2809

0.1201

0.8745

0.8617

0.1579

0.1099

0.2500

0.2314

0.2617

0.2103

Complex

0.6989

0.6712

0.2901

0.1345

0.8841

0.8512

0.1578

0.1070

0.2103

0.2199

0.2715

0.2106

Blended

0.6980

0.6656

0.2904

0.1354

0.8840

0.8534

0.1577

0.1081

0.2104

0.2403

0.2614

0.2105

[3]

0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 0.000

[4]

NCNAI

[5]

CAIRAD RDCL

[6]

EDIR F (0.5) Score (P) Score (R F (0.5) Score (P) Score (R ) ) Large Truck

[7]

Denver

Fig. 1. Performance comparison on two datasets with confidence level 95%

[8] [9]

TABLE VI. Dataset

rER

Large Truck Denver

CAIRA D

78.35

77.02

49.01

54.81

53.01

35.0 0

78.37

75.10

36.89

54.01

50.01

31.0 9

Dataset

NCNAI

CAIRA D

RD CL

CAI RA D

RD CL

6.04

5.5 2

4.1 2

14.01

16. 01

13. 01

NCNA I

CAIRAD

NCNAI

CAIRAD

[12]

[13]

[15]

vEP EIDR

[11]

[14]

OVERALL PERFORMANCE: ATTRIBUTE BASED

vER NCNAI

Large Truck

RDC L

[10] rRR

rEP

NCNAI

TABLE VII.

Denver

OVERALL PERFORMANCE: RECORD BASED

EIDR

36.13

34.23

16.00

54.61

51.01

17.89

28.01

26.77

15.50

85.73

80.11

37.78

[16]

[17]

V. CONCLUSION In this paper, we propose a new noisy attribute values and records detection method with the aim of analyzing traffic accidents data. Our algorithm identifies noisy attributes values and records using neural network and co-appearance, and has been shown to outperform several well-known noisy records and attributes values detection methods on traffic accident data, where a large number of attributes are categorical. In the future, we intend to run our technique on large and diverse traffic accident datasets to investigate whether it consistently perform better than other existing algorithms. REFERENCES [1]

[2]

I. P. Fellegi, and D. Holt, “A systematic approach to automatic edit and imputation,” Journal of the American Statistical Association, vol. 71, no. 353, pp. 17–35, March 1976. S. J. Delany, “The good, the bad and incorrectly clasified: profiling cases for case-base editing,” 8-th International Conference on CaseBased Reasoning (ICCBR), LNAI 5650, pp. 135-149, July 2009.

[18]

[19]

[20]

[21]

[22]

M. G. Rahman, M. Z. Islam, T. Bossomaier, and J. Gao, “CAIRAD: A co-appearance based analysis for incorrect records and attribute-value detection,” International Joint Conference on Neural Networks (IJCNN), pp. 1-10, June 0212 . I. Tomek, "An experiment with the nearest neighbor rule", IEEE Transactions on Information Theory, vol 6 pp 448–452, 1976. X. Zhu, X. Wu, and Y. Yang, “Error detection and impact-sensitive instance ranking in noisy datasets,” American Association for Artificial Intelligence (AAAI), pp. 378–383, 2004. B. Greenberg, and T. Petkunas, “SPEER (Structured Program for Economic Editing and Referrals),” Bureau of the Census Statistical Research Division Report Series SRD Research Report Number: Census/SRD/RR-90/15, October 1990. N. Lavesson, and S. Axelsson, "Similarity assessment for removal of noisy end user license agreements" Knowledge Information System, 2011, Springer-Verlag London Limited, 2011. C. M. Teng, “Correcting noisy data”, 16th Internation Conference on Machine Learning, pp. 239-248, 1999. Western Australia Police. Western Australia Police fatal traffic crashes and fatalities 2009. Perth 2010. Data Analysis Australia. Analysis of Road Crash Statistics, 1995 to 2004. 2006. Monash University Accident Research Centre. Summary of road safety performance in Western Australia 2003-2006. Melbourne; 2006. S. Gargett, L. B. Connelly, and S. Nghiem, “Are we there yet? Australian road safety targets and road traffic crash fatalities”, BMC Public Health, Vol. 11, No. 270, pp. 323-336, Apr. 2011. Z. Zamani, M. Poumand, and M. H. Saraee, “Application of data mining in traffic management: Case of city of Isfahan”, Proceeding of ICECT2010 Conference, Kuala Lumpur, pp. 102-106, May 2010. S. Zhang, C. Zhang, and Q.Yang, "Data preparation for data mining”, Applied Artificial Intelligence, Vol. 17, No. 5-6, pp. 375-381, 2003. A. Farhangfar, L. Kurgan, and J. Dy, “Impact of imputation of missing values on classification error for discrete data”, Pattern Recognition, Vol. 41, No. 12, pp. 3692-3705, December. 2008. J. I. Maletic, and A. Marcus, “Data cleansing: beyond integrity analysis”, Proceeding of IQ2000 Conference, Citeseer, pp. 200-209, June. 2000. X. Zhu, X. Wu, and Y. Yang, “Error detection and impact-sensitive instance ranking in noisy data sets”, Proceeding of AAAI-04 Conference, California, pp. 378-384, July. 2004. R. Deb, A.W.C. Liew, “Missing value imputation for the analysis of incomplete traffic accident data”, Proceeding of ICMLC 2014, CCIS 481, Springer, pp. 275-286, Lanzhou, China, July 2014. R. Deb, A.W.C. Liew, and E. Oh, “A correlation based imputation method for incomplete traffic accident data”, Proceeding of PRICAI 2014, LNAI 8862, Springer, pp. 905-912, Gold Coast, Australia, December 2014. H. Paulheim, “ Identifying wrong links between datasets by multidimensional outlier detection”, 11th ESWC 2014 (ESWC2014), May 2014. Y-H. Huang, “ Artificial neural network nodel of bridge deteroiation,” Journal of Performance of Constructed Facilities, vol. 24, no. 6, pp. 597602, December 2010. M. Chong, A. Abraham, and M. Paprzycki, “ Traffic accident analysis using machine learning paradigms,” Informatica, vol. 29, pp. 89-98, 2005.

Suggest Documents