Comparison of Weighted and Simple Linear Regression ... - IEEE Xplore

7 downloads 0 Views 306KB Size Report
Comparison of weighted and simple linear regression and artificial neural network models in freeway accidents prediction. (Case study: Qom & Qazvin ...
Second International Conference on Computer and Network Technology

Comparison of weighted and simple linear regression and artificial neural network models in freeway accidents prediction (Case study: Qom & Qazvin Freeways in Iran) Safi [2] separately used artificial neural network model to estimate the number of accidents in urban intersections and Iranian rural roads and show that traffic volume is one of the most influential parameters in both urban and rural accidents. Fajroddin, Yusof, Rahman, Abdul Samad, and Salleh developed a multivariate linear regression model to recognize black spots in Pakistan. The number of local access per length unit, speed of vehicles, traffic volume and distance between vehicles have been known as considerable parameters in their research [3]. Harnen, Radin Umar, Wong, and Wan Hashim have considered speed limit, number of lanes, lane width and land use as effective parameters in urban accidents and developed weighted linear and log-linear regression models to estimate the number of motorcycle crashes in non equipped intersections of traffic lights [4]. Chaterchi, Hummer, Kiattikomol, and Younger have developed a weighted linear regression model to predict accidents in the sections of the freeways either with or without intersection. The study, done in North Carolina and Tennessee, shows that weighted regression models are appropriate models for estimating the number of accidents in freeways [5]. Hassan, Ayman, and Wahaballa developed simple, stepwise, and multivariate linear regression models to recognize the most influential parameters in road accidents in Egypt [6] and concluded that weighted linear regressions are more appropriate models in populated areas and simple linear ones are appropriate in uninhabited area. Results also show that the number of accidents has direct correlation with the number of access points and heavy traffic and indirect correlation with shoulder width.

Abbas Mahmoudabadi PhD Candidate, Technical and Engineering Department, Payam-e-Noor University and Deputy of Safety and Traffic Department, RMTO, Tehran, Iran E-mail: [email protected] Abstract— A number of models have been used for estimating frequency of accidents. Weighted and simple linear regressions are common and in the recent years artificial neural network models have also been used as prediction models of accidents. Researchers need to select and use some models with the best performance particularly with the minimum of mean square errors. In this paper, traffic volume, surface condition, heavy traffic, and monthly accident data have been analysed in two Iranian major freeways named Tehran-Qom and Karaj-Qazvin-zanjan and three different kinds of models including simple and weighted linear regression and artificial neural network have been developed for estimating the number of monthly accident based on the above input variables. The well-known software of MATLAB has been used for analytical process and principle component analysis technique has been used to ensure that input variables don’t have interrelations. Principle components and loading have been calculated and results of PCA show that all input variables should be considered in modeling. The effectiveness of input variables based on T-test has been analyzed and the results show that traffic volume and surface condition have more effect in rural accidents. For models’ performance comparison, the mean square errors have been considered. It can be concluded, from the results, that artificial neural network has the best performance with minimum mean square errors. Keywords- Linear regression; Artificial Neural Network; Traffic Volume; Surface Condition; Rural Accidents; Freeway

II. ESTIMATING MODELS There are various kinds of models that are used for estimating parameters in road safety. Regression models are known as common, because of simplicity. Recently, artificial neural network models have been used widely as prediction models of accidents. In this section a brief discussion of simple and weighted linear regression and artificial neural network models have been presented.

I. INTRODUCTION Traffic parameters are very important for planning road safety programs, so they are known as major concerns for people who are working or responsible in road safety. Because it is very difficult or costly to have data on traffic parameters, it’s a good idea that some of parameters could be estimated regarding to the other parameters, which are available or easy to be accessed. Estimating the number of accidents in urban and rural areas is one of the most common ways for planning road safety programs. Abdolmanaphy, Ahmadinejad, and Afandizadeh [1] and Mahmoudabadi and

A. Simple linear regression Minimizing the mean or total square errors between observations and model outputs is the main concern behind the setting parameters of simple linear regression models [7]. There are two kinds of simple linear regression models. In the first, there is just one variable known as dependent variable and the other known as independent variable where dependent variable is desired one that must be estimated based on the other. Equation (1) shows the relation between dependent and independent variables where α and β are known as coefficients of simple regression model between dependent variables X and independent Y.

978-0-7695-4042-9/10 $26.00 © 2010 IEEE DOI 10.1109/ICCNT.2010.73

392

Y = α + βX (1) If more than one variable, are known as independent variables in the model are entered, linear regression model is known as multivariate linear regression model [7]. Therefore, model set relationship between independent variables X 1, X2,. . . Xn, and dependent variable Y by the equation (2). In the equation (2), coefficients β0, β1, β2, β3 . . . and βn must be set based on minimizing the mean square errors between observations and model outputs. Y = β0 + β1 X 1 + β2 X 2 + β3 X 3 + ... + βn X n (2)

Figure 1.

Yi = F ( ∑Wjiyi )

The most common used of validation criteria is mean or sum square errors between observations and model outputs. III. DATA COLLECTION AND STUDY AREA Because of availability of crash data, scope of the research has been considered in two freeways in Iranian rural roads including Tehran-Qom and Karaj-Qazvin-Zanjan freeways. Tehran-Qom freeway, which is summarized as Qom in this paper, has six lanes (three on each side) with the length of 135 km and Karaj-Qazvin-Zanjan freeway, which is summarized as Qazvin, has six lanes (three on each side) with 150 km in Qazvin province. Data have been gathered in both freeways divided into 5-km sections and different months. Data include crashes during a period of three years in Qom and two years in Qazvin freeways. Traffic data including volume, average speed, and heavy vehicle have been collected by conductive loops. Sections were selected by 5 km so there are many observations for each monthly accident. Surface condition in the different parts of freeways, which were evaluated based on international roughness index (IRI), has been used in this research. Table I shows the general view of data and scope of research. Experimental data for long-term period in both of freeways are shown in table II.

(3)

Parameters are being calculated based on the minimum of mean or total square errors between observations and model outputs [8]. It is obvious that multivariate regression model can be defined when independent variables X 1, X2. . . Xn and dependent variable Y are defined as equation (4) shows. P1 1

P2 2

P3 3

Pn n

Y = β0 + β1X + β2 X + β3 X + ...+ βn X

(5 )

j

B. Weighted Regression model When it is considered that the effect of independent variable on dependent one is power base, the model called weighted linear regression. There are also simple and multivariate regression model such as simple regression models but with the power of independent different from. Assuming the parameters described in the previous section regarding to power independent variables to p. In this case equation (3) is shown the relation between X and Y as respectively independent and independent variables and α, β, and p are parameters.

Y = α + βX p

General view of neural network model

TABLE I.

( 4)

Freeway Name

In equation (4) coefficients β0, β1, β2, β3 . . . and βn and p1, p2, p3… and pn are being estimated regarding to the minimizing the mean or total square errors between observations and model outputs.

Independent Variables Dependent Variable

COLLECTED D ATA AND V ARIABLES Qom and Qazvin Traffic Volume Surface Condition Heavy Vehicle fraction Monthly Accident

TABLE II.

C. Artificial neural network model In the neural network model, the neuron is the basic component [9]. The multi-layer perceptron structure is a fully interconnected set of layers of neurons. Each neuron of a layer is connected to each neuron of the next layer so that only forward transmission through the network is possible, from the input layer to the output layer through the hidden layers as shown in fig. 1 [9]. In this structure the output Yi of each neuron of the nth layer is defined by a derivable nonlinear function F by equation (5) where F is the non-linear activation function, wji are the weights of the connection between the neuron Nj and Ni, yi is the output of the neuron of the (n−1)th layer [9].

Qom Traffic Volume Heavy Vehicle Surface Condition Monthly Accident Observations Qazvin Traffic Volume Heavy Vehicle Surface Condition Monthly Accident Observations

IV.

Symbol TV SF HV AN

E XPERIMENTAL DATA

Min. 49770 0.026 1 0

Max. 86366 0.07 3 83

Min. 4191 0.11 1.2 0

Max. 60494 0.2 4.28 15

Mean 69410 0.045 2.15 7.93

STD 9319 0.0135 0.597 8.82

Mean 22118 0.146 2.45 1.77

STD 20337 0.031 0.749 1.89

672

507

PRINCIPLE COMPONENT ANALYSIS

Principle component analysis (PCA) is a statistical process categorized in data reduction techniques. The PCs are the uncorrelated (orthogonal) variables, obtained by multiplying the original correlated variables with the eigenvector, which is a list of coefficients (loadings or weightings). Thus, the PCs 1 are weighted linear combin1

393

Principle Components

ations of the original variables. It is a powerful technique for pattern recognition that attempts to explain the variance of a large set of inter-correlated variables and transforming into a smaller set of independent (uncorrelated) variables (principal components). Factor analysis further reduces the contribution of less significant variables obtained from PCA and the new group of variables, known as varifactors, is extracted through rotating the axis defined by PCA [10]. Selecting the first n PCs is regarding to eigenvalues greater than one or obtaining enough deviations mostly 85 percent of whole deviations [11].

TABLE V. Freeway Score TV Variable HV SF

Qom Traffic Volume Heavy Vehicle Surface Condition Monthly Accident Observations Qazvin Traffic Volume Heavy Vehicle Surface Condition Monthly Accident Observations

STANDARDIZED DATA

Min. -2.107 -1.459 -1.93 -0.9

Max. 1.189 1.801 1.419 8.516

Min. -0.882 -1.141 -1.667 -0.934

Max. 1.89 1.725 2.442 7

Mean

STD

0

1

Mean

STD

0

1

507

Qom

Qazvin

Score PC1 PC2 PC3 PC1 PC2 PC3

Variation percent 48% 33% 19% 73% 20% 7%

PC1

PC2

0.83 0.85 -0.15

-0.23 0.04 -0.98

0.51 -0.52 -0.14

0.93 0.82 -0.81

0.01 0.53 0.55

AN = 0 + 0.241TV − 0.089HV + 0.003SF AN = 0 + 0.492TV − 0.075HV + 0.259SF

(7 ) (8)

B. Weighted Linear Regression The weighted regression equation of input and output variables are estimated regarding to minimize of mean square errors. Running model to estimate coefficients was regarding to the number of iteration of 500. Equations (9) and (10) show the weighted regression models of Qom and Qazvin respectively.

PRINCIPLE COMPONENTS Eigenvalue 1.43 1.01 0.56 2.2 0.59 0.21

PC3

A. Simple Linear Regression The linear regression equation of input and output variables are estimated regarding to minimize of mean square errors. Equations (7) and (8) show the regression models of Qom and Qazvin respectively.

B. Eigenvalues Calculating eigenvalues is the second stage of PCA. Each score has specific eigenvalue ordered descending shown in table IV. PCs are uncorrelated so percent of deviation means the part of deviation of each PC. If 85 percent of deviation considered PC1, PC2, and PC3 in Qom freeway and PC1 and PC2 in Qazvin must be considered to the next stage of PCA process.

Freeway

Qazvin

PC2

V. DEVELOPING MODELS Three methods including linear regression, weighted regression, and artificial neural network have been used for modeling accident data via well-known computing software of MATLAB. Detailed discussions are as follows.

672

TABLE IV.

Qom PC1

D. Extracting variables Extracting variables is the last stage of principle component analysis. Extracting input variables is based on loadings that are shown in table V. In this stage selecting PCs mostly be considered with loading either more than 0.7 or less that -0.7 [10]. For the first PC in Qom freeway it can be concluded that variables TV and HV must be considered as input variables, and by considering PC2 variable SF would be added to inputs. In the case of Qazvin by considering the first PC, can be concluded that all of the variables TV, HV, and SF must be considered as output. All of desired loading in table V, which have been used in decision process are shown in bold marked style. Briefly it can be concluded that all input variables must be considered in the process of modeling regarding to at least 85 percent of whole deviations.

A. Data Standardizing In this paper, data have been converted to normal type by equation (6), which is commonly used to standardize dada, where n is the number of observations [11]. Table III shows standardized data so the means of all variables and standard deviations are zero and one respectively. x i ( old ) - x x i ( new ) = (6 ) STD ( x ) TABLE III.

LOADING

Cumulative percent 48% 81% 100% 73% 93% 100%

AN = 0.7496 + 0.0801TV −0.355 − 2 .5925 HV 1.24 + 0.1391SF 0.0466

( 9)

AN = 0.1566 + 1.5991TV −0.499 − 0.3259 HV −0.028 + 0.048 SF 0.0639

(10)

C. Artificial neural network In this case, with three input and one output variables, neural network model must be formed of three neurons in input and one neuron in output layer. Hidden layer will be different based on the number of observation. If it’s assumed that the number of observation should not be exceeded than ten times of necessary parameters, the maximum number of hidden neurons in Qom and Qazvin freeway models were calculated 15 and 11. Figs. 2 and 3 show artificial neural network models of both Qom and Qazvin freeways

C. Loading The simple correlation between original variables and scores, known as PCs, is loading [11]. As shown in table V, correlation between variable traffic volume (TV) and the first scores in Qom and Qazvin freeways are high, means that traffic volume has more influence in the first score.

394

respectively and five times of running models have been considered with stop criteria of 1e-5.

regression. Therefore, neural network, weighted regression, and simple linear regression models are comparison order.

D. Regression discussion Effective parameters can be discussed based on t-test value in simple regression models. The values obtained by regression model shown in table VI. To ensure that which of variables are effective in road accidents in rural freeways, with confidence interval of 95%, t-test values have been compared with T0.975. Regarding to t-test values, traffic volume has direct affect in rural accident. Heavy traffic has reverse affect as in the previous researches [2]. Surface condition in Qom freeway doesn’t have significant affect in road accidents but it has a significant affect in rural accidents in Qazvin freeway.

VII. CONCLUSION In this research work, monthly accidents in Tehran-Qom and Karaj-Qazvin-Zanjan freeways have been analyzed regarding to traffic volume, heavy traffic, and surface condition in 5-km sections. Traffic volume, heavy traffic, and surface condition have been considered as input and monthly accidents as output variables. Three estimating models including simple regression, weighted regression, and artificial neural network have been developed and principle component analysis has been used to ensure that input parameters don’t have inter-relation with each other. PCA proved that three parameters must be entered in modeling based on 85 percent of deviations. Mean square errors (MSE), has been used for comparing the models’ performance and results show that artificial neural network is the best model to accident prediction with the minimum of MSE. It’s recommended that further research should be focused on behavioral parameters in rural accidents, particularly in developing countries, where experts believe that human factors are the main factors in rural crashes. Considering further parameters such as whether condition, enforcement, topography, and residential areas can be considered to have more adaptive models.

Figure 2. Neural network model in Qom Freeway

REFERENCES [1] Figure 3. Neural network model in Qazvin Freeway

Variable TV HV SF

[2]

T-VALUES OF REGRESSION MODELS

TABLE VI.

Qom

Qazvin

Coefficient

t-value

Coefficient

t-value

0.241 -0.089 0.003

4.99 -1.84 0.08

0.492 -0.075 0.259

7.68 -1.46 5.13

[3]

VI. PERFORMANCE COMPARISON OF MODELS Mean square error is commonly used for comparing performance of regression and neural network models. The number of iteration for estimating coefficients in weighted linear regression has been considered 500 and the mean square errors in artificial neural network models are the average of five times running with stop criteria of 1e-5. Table VII shows the mean square errors in different models. TABLE VII.

[5]

MEAN SQUARE ERRORS

[6]

Mean Square Errors (e-3)

Model

Qom

Simple linear regression Weighted linear regression Artificial neural network

[4]

9.58 9.43 8.18

Qazvin

[7]

8.98 8.77 7.03

[8]

As it can be revealed from table VII, the least mean square errors belongs to artificial neural network, the next to weighted regression, and the last to the simple linear

[9]

395

S.E. Abdonaphy, M, Ahmadinejad, S. Afandyzadeh, “Designing prediction model of accident in urban intersections based on statistical and neural network”, M.S. Thesis, Iranian University of Science and Technology, 2007, Persian Dtae 1386, A Mahmoudabadi, A. Safi Smghabady, "Estimate the number of daily accidents in Iranian rural network using artificial neural network”, The second congress of Intelligent and Fuzzy systems, Malek Ashtar University of Technology, Persian date Aban 1387, October 2008 F. Mustakim, I. Yusof, I. Rahman, A.Z. Abdul Samad, N.E.B.M. Salleh, "Blackspot Study and Accident Prediction Model Using Multiple Liner Regression", First International Conference on Construction in Developing Countries ( ICCIDC-I), "Advancing and Integrating Construction Education, Research & Practice", Karachi,, Pakistan, pp. 121-136, August 4-5, 2008 S. Harnen, R.S. Radin Umar, S.V. Wong, W.I. Wan Hashim, "Motorcycle accident prediction model for junctions on urban roads in Malaysia", Advances in Transportation Studies an international Journal Section A8, 2006, pp. 31-40, A. Chatterjee, J.E. Hummer, V. Kiattikomol, M.S. Younger, “Planning Level Regression Models for Crash Prediction on Interchange and Non-Interchange Segments of Urban Freeways”, Center for Transportation Research, The University of Tennessee, pp. 1-29, August, 2005 Y.A. Hassan, M.O. Ayman, A.M. Wahaballa, “Traffic Accident Analysis & Modelling for Upper Egypt Rural Roads”, Civil Eng. Dept. Assuit Univ., March 2006. P.G. Hoel, "Elementary Statistics", Fourth Edition, Published by John Wiley, Sons, Inc., New York, 1976. G.A.F. Seber, C. J. Wild, “Nonlinear Regression”, John Wiley & Sons Inc., 1989. H.B. Celikoglua, H.K. Cigizoglub, “Modelling public transport trips by radial basis function neural”, Networks, Mathematical and Computer Modelling 45, 2007, pp. 480–489

[10] K.P. Singha, A. Malika, D. Mohana, S. Sinhab “Multivariate statistical techniques for the evaluation of spatial and temporal variations in water quality of Gomti River (India) a case study”, Water Research 38, 2004, pp. 3980–3992 [11] S. Sharma, “Applied Multivariate Techniques”, Published by University of South Carolina, 1996

396