Matching Patterns for Updating Missing Values of Traffic Counts

2 downloads 0 Views 116KB Size Report
counts (PTCs) from highway agencies have missing hourly volumes. These missing values make data analysis and usage difficult. A literature review of.
Transportation Planning and Technology, April 2006 Vol. 29, No. 2, pp. 141 156

ARTICLE

Matching Patterns for Updating Missing Values of Traffic Counts MING ZHONG*, SATISH SHARMA** & PAWAN LINGRAS$ *Department of Civil Engineering, University of New Brunswick, Canada, **Faculty of Engineering, University of Regina, Canada, $Department of Mathematics and Computing Science, Saint Mary’s University, Canada (Received 24 February 2005; Revised 28 January 2006; In final form 12 April 2006)

ABSTRACT The presence of missing values is an important issue for traffic data programs. Previous studies indicate that a large percentage of permanent traffic counts (PTCs) from highway agencies have missing hourly volumes. These missing values make data analysis and usage difficult. A literature review of imputation practice and previous research reveals that simple factor and time series analysis models have been applied to estimate missing values for transport related data. However, no detailed statistical results are available for assessing imputation accuracy. In this study, typical traditional imputation models identified from practice and previous research are evaluated statistically based on data from an automatic traffic recorder (ATR) in Alberta, Canada. A new method based on a pattern matching technique is then proposed for estimating missing values. Study results show that the proposed models have superior levels of performance over traditional imputation models. KEY WORDS: Data imputation; pattern matching; missing values; traffic counts; ARIMA

Correspondence Address : Ming Zhong, Department of Civil Engineering, University of New Brunswick, GD-128, Head Hall, 17 Dineen Drive, P.O. Box 4400, Fredericton, N.B., Canada E3B 5A3; Email: [email protected] ISSN 0308-1060 print: ISSN 1029-0354 online # 2006 Taylor & Francis DOI: 10.1080/03081060600753461

142 Ming Zhong et al. Introduction Highway agencies invest a significant portion of their financial resources for monitoring their highway networks. For example, Saskatchewan Highways and Transportation has an annual data budget of CA$450 000 (US$393 000; t325 000), whereas Montana spends US$750 000 on its traffic data program each year (Liu, 2001). Three types of traffic counts, namely permanent traffic counts (PTCs), seasonal traffic counts (STCs) and short-term traffic counts (SPTCs), are usually collected through traffic monitoring programs (Garber and Hoel, 1988). Since traffic counting devices usually cannot work perfectly all the time, data for certain periods may not be recorded. These ‘lost data’ are called missing values. A previous study (Zhong et al. 2004) indicates that there are significant missing portions in traffic data sets. An examination of traffic monitoring practices in North America and Europe show that many highway agencies estimate missing values for traffic counts. Estimating missing values is known as data imputation. It shows that highway agencies only use simple factor and time series analysis models to impute data. Previous research has mainly focussed on detecting missing values. Moreover, statistical analyses for evaluating imputation accuracy are largely absent. There are increasing concerns about data imputation and Base Data Integrity. The principle of Base Data Integrity is an important theme discussed in both the American Society for Testing and Materials (ASTM) Standard Practice E1442, Highway Traffic Monitoring Standards (ASTM, 1991) and the American Association of State Highway and Transportation Officials (AASHTO) Guidelines for Traffic Data Programs (AASHTO, 1992). The principle is that traffic measurements must be retained without modification and adjustment. Missing values should not be imputed in the base data. However, this does not prohibit imputing data at the analysis stage. In some cases, traffic counts with missing values could be the only data available for certain purposes, and data imputation is necessary for further analysis. Proper imputation can help maintain the minimum integrity and costeffectiveness of traffic data programs. In accordance with the principle of Truth-in-Data, AASHTO Guidelines (AASHTO, 1992) also recommends highway agencies should document the procedures for editing traffic data. The new models proposed in this study are based on the minimum square error (MSE) technique. The purpose of this method is to compare a set of M objects (e.g. 100 traffic counts) with the object having missing values (e.g. a traffic count having K months data), each measured on K variables (e.g. K monthly traffic factors), and find the

Matching Patterns for Updating Missing Values 143 best match object. The object having the best match will be used to update missing values. A similar algorithm was used by Sharma and Werner (1981) to cluster permanent traffic counts (PTCs) based on complete data sets. Sharma and Allipuram (1993) used the same technique to assign seasonal traffic counting sites into PTC groups. The algorithm is modified to compare partial available data to find the best match pattern for imputation. The article is organized as follows. First, a literature review is carried out to examine the imputation practices of highway agencies and previous related research. Then typical models identified from the literature are tested on data from an automatic traffic recorder (ATR) in Alberta, Canada. These models include factor models, exponential smoothing method, and autoregressive integrated moving average model (ARIMA). The statistical results of these models are given. Finally, MSE models are applied to estimate missing values and compared with these traditional imputation models. Review of Imputation Practices and Previous Research In 1990, the New Mexico State Highway and Transportation Department (1990) conducted a survey of traffic monitoring practice in the USA. It was shown that when portable devices failed, 13 states used some procedure to estimate missing values and complete the data set. When permanent devices failed, 23 states employed some procedure to estimate missing values (Albright, 1991a). Different methods were used for this purpose. For example, in Alabama, if less than 6 h are missing, the data are estimated using the previous year or other data from the month. If more than 6 h are missing, the day is voided. In Delaware, estimates of missing values are based on a straight line using the data from the months before and after the failure. Most of these methods apply simple factors to historical data for estimating missing values. Only in Kentucky was a computer program used to estimate and fill in the blanks (New Mexico State Highway and Transportation Department, 1990). Personal communications with a number of practicing traffic engineers indicated that Canadian highway agencies in the prairie region usually use historical data from the same site to impute missing values. For example, Saskatchewan uses the data from the same period in the previous year to estimate missing values. The same period here means the same day of the week in the same month. For instance, missing values on the first Wednesday in July will be updated with values from the first Wednesday in the previous July. In 1997, the Federal Highway Administration (FHWA) (1997) conducted research into traffic monitoring programs and technologies

144 Ming Zhong et al. in Europe. It was reported that highway agencies in The Netherlands, France and the UK used some computer programs for data validation routines. For example, a software system INTENS was used in The Netherlands for data analysis and validation. The software used a ‘smart’ linear interpolation process between locations from which data were available to estimate missing traffic volumes. In France, a system MELODIE was used for data validation. Data validation was conducted visually by system operators. Invalid data were replaced with previous month’s data. Several data validation systems were used in the United Kingdom. One of them was used by the Central Transport Group (CTG) to validate permanent recorder data. Invalid data were replaced with data extracted from the valid data of the last week collected from the same site. These historical data are multiplied by a factor taken from nearby sites that did work correctly and used to convert the previous week’s traffic volumes to the current week. No research has been found for assessing the accuracy of such imputations. In England, a survey of practical solutions used by consultancies and local authorities was conducted in 1993 (Redfern et al., 1993). It was reported that there were two broad categories of solution. One was the ‘by-eye’ method and the other was a computerized package (Redfern et al., 1993). The ‘by-eye’ method involved manual estimation of missing values. Most automated practical solutions to patching were based upon simple, moving or exponentially weighted moving averages, or their variants. For example, the then Department of Transport (DoT) in London employed an exponentially weighted moving average model to update missing values. The process involved validating new traffic count data against old data from the same site collected over the previous weeks at the same time. The following equation was used to estimate missing or rejected data at time t, xˆ t;s : xˆ t;s  (1u)xt1;s (1u)uxt2;s (1u)u2 xt3;s . . . (1u)un1 xtn;s

(1)

where xt1,s , xt2,s ,. . ., xtn,s represent the observations for that particular site and vehicle category s, at the same times for weeks 1, 2,. . ., n before the current observation; u is a constant such that 0B u B1. A value of 0.7 was typically used for the parameter u. ARIMA models have been used by researchers to estimate missing data or predict short-term traffic (Nihan and Holmesland, 1980; Harvey and Pierse, 1984; Southworth et al., 1989). In particular, a series of studies (Clark, 1992; Redfern et al., 1993; Watson et al., 1993) was carried out by a group of scholars at the University of Leeds, England in the early 1990s. Redfern et al. (1993) tested four kinds of model on four different traffic data time series supplied by DoT in /

/

Matching Patterns for Updating Missing Values 145 London. These models were exponentially weighted moving average, autocorrelation based influence function, ARIMA model using large residuals, and ARIMA model using the Tsay likelihood ratio diagnostics. They used four traffic count series that consisted of 153 daily observations from 1 June 1991 to 31 October 1991. It was reported that the estimation of replacement values for both extreme and missing values was most efficiently done using the parametric ARIMA(1,0,0)(0,1,1)7 model. However, it was also reported that the estimated replacements of missing values showed considerable variation (Redfern et al., 1993). Based on estimated replacement values and possible ‘true values’ identified from the figures in their study, APEs were usually more than 10%, and many of them more than 20% for the best ARIMA models. Estimated average errors for the four time series ranged from 10% to 26%. The study also mentioned concerns of transport practitioners on base data integrity. The previous studies used ARIMA models to estimate missing values of traffic counts. Researchers tried to model long-term trend, seasonal variation and seasonality of traffic count time series. However, the majority of previous research on transport related time series focused on detecting missing values or outliers. Predictions of missing values were tested on a small number of time series. No detailed statistical results have been found for evaluating the accuracy of theses imputations. Study Data and Approaches Study Data Data from an ATR C002181 in Alberta, Canada are used to examine the accuracy of various models. C002181 is located on a commuter road section (of Highway #2) near the Calgary. The data from 1996 to 2000 are used in this study. Depending on models, data from different years are used as training and test sets. There are no missing values in the study data. The study data are in the form of hourly volumes for both travel directions. Factor Approaches For the factor approach, the models used by Saskatchewan Highways and Transportation in Canada, South Dakota Department of Transportation (DOT), and Delaware DOT in the USA, and the Highway Administration in the France are presented here. They represent the models using data from the last year, the models using data from the previous years, the models using current-year data from both before

146 Ming Zhong et al. and after the failure, and the models only using current-year historical data. In Saskatchewan, missing data are imputed with the previous year data. The imputing data are usually from the same day of the same week. South Dakota DOT estimates missing values with the average of data from the same periods in the previous three years. For Delaware, estimates of missing data are based on the average of the data from the same periods in the months on either side (New Mexico State Highway and Transportation Department, 1990). The French Highway Administration uses the previous months data to estimate missing values (FHWA, 1997). These models are used to estimate 12 successive missing hourly volumes on various days (e.g. Wednesdays and Saturdays) of different seasons (e.g. summer or winter). Time Series Analysis Models London exponential smoothing method. Redfern et al. (1993) showed that most computer imputation programs in England were based on simple, moving or exponentially weighted moving average techniques. DoT in London used an exponentially weighted moving average model to estimate missing values. It involved validating new traffic count data against old data from the same site collected over the previous weeks at the same time. A mean and variance are maintained for each site, day of the week, period of the day and vehicle type(s) and updated sequentially as new observations become available. Missing values and outliers (observations that lie outside four standard deviations from the old mean) are replaced with estimated values calculated from Eq. (1). The model is used to impute 12 successive missing hourly volumes for the same periods as those applied by factor models. Box Jenkins forecasting procedure (ARIMA models). This procedure is based on fitting an autoregressive integrated moving average (ARIMA) model to a given set of data and then taking conditional expectations. A typical multiplicative seasonal ARIMA model is in the form:

fp (B)FP (BS )Wt  uq (B)UQ (BS )at

(2)

where B denotes the backward shift operator; fp ; FP ; uq ; UQ are polynomials of order p, P, q, Q respectively; and {at } is the Box Jenkins notation for a purely random process with mean zero and S variance s2a : Wt  9d 9D S Xt and B Wt  WtS : {Wt } is the differenced time series. {Xt } is the original non-stationary time series. 9 is the differencing operator, and d, D is the order of differencing to remove both trend and seasonality respectively. An ARIMA model considering seasonality in the data is often represented by ARIMA(p, d, q)(P, D, Q)s, where p, d, q are the order of autoregressive, differencing, and moving average components; P, D, and Q are the order of seasonal /

Matching Patterns for Updating Missing Values 147 autoregressive, differencing, and moving average components; S is the seasonal periodic component which repeats every S observations (Chatfield, 1984). The literature review shows that the majority of the previous studies used ARIMA models to estimate missing values. This study uses ARIMA models to impute 12 missing hourly volumes on the 9th day based on the patterns from the same days of the previous eight weeks. Minimum square error (MSE) approaches. The previous analysis shows that traditional imputation models are intuitive in nature. Estimations of missing values are simple patching or computing based on historical data only. The information available from after the failure period is usually neglected. Based on such findings, a method based on minimum square error (MSE) techniques are used to impute missing values based on the data from both before and after the failure. When using MSE models to estimate missing values of certain patterns (curves), the first step is to find a set of candidate curves, from which a best-fit curve will be selected and used to update missing values. Candidate curves here mean the curves generated from the data having strong correlations with the curve having missing values. When imputing missing hourly volumes, candidate curves are usually daily patterns from the same days of the week. First, the sum of square errors (SSE) between the pattern under study and candidate patterns are calculated. Before calculating SSE, these patterns are usually normalized to remove any time series trend. For example, study and candidate patterns used in this study are all normalized with average daily traffic (ADT) or AADT. Then SSE is calculated based on available data. The curve having the minimum sum of square error is chosen as the best-fit curve, and it will be used to update missing values of the curve under study. The equation used for calculating SSE is as follows:

SSEj 

N X (candidateCurveji studyCurvei )2

(3)

i1

where SSEj is the SSE between candidate curve j and the curve under study; N is the number of available factors from the curve under study; candidateCurveji is the corresponding ith factor from candidate curve j; and StudyCurvei is the ith available factor from the curve under study (i 1, 2,. . ., N). The best-fit curve is the curve that has the minimum SSE with the curve under study. Minimum square error (MSE) is determined by the following equation: /

MSE  minfSSE1 ; SSE2 ; :::; SSEj ; :::; SSEm g

(4)

148 Ming Zhong et al. where SSEj is the SSE between candidate curve j and the curve under study; and m is the number of candidate curves. The levels of performance of the various models tested in this study are evaluated with absolute percentage errors (APE). Depending on the model, the number of patterns or observations varied. APE is calculated as: PE 

jactual volume  estimated volumej actual volume

100

(5)

The key evaluation parameters consist of the average, 85th, and 95th percentile errors. These statistics give a reasonable idea of the error distributions. Study Results and Discussion The models discussed above were used to estimate missing values of various days in different seasons. The results were essentially the same except for the magnitude of errors. For a model, the errors for imputing weekend days are usually larger than imputing weekdays, and imputation errors for the winter season usually higher than those for the summer season. For presentation purposes, only the results for imputing 12-h successive missing values on Wednesdays in July and August are reported here. The statistical results for the typical factor models used by Saskatchewan, South Dakota, Delaware, and the French Highway Administration are presented in Table 1. This shows the average, the 85th percentile, and the 95th percentile errors of these factors models for imputing 12-h missing values on Wednesdays in July and August. The average errors of the Saskatchewan model are usually about 5 7%. The 95th percentile errors are usually more than 10%, and some of them are more than 15%. Although South Dakota DOT used the previous three years data to estimate missing values, the errors are similar to the Saskatchewan method. Some of the average errors are more than 10%, and some of the 95th percentile errors are higher than 20%. Delaware method results in the average errors about 5%. The 95th percentile errors usually range from 8% to 12%. The French method results in average errors of 69%. The 95th percentile errors usually range from 12% to over 20%. The above analyses indicate that using the exactly same period data from the months on either side is the best choice (Delaware method). Taking previous years average (South Dakota method) or the last year data (Saskatchewan method) may not provide the results with similar

Table 1. Comparison of factor imputation models used by highway agencies Prediction errors 85th %

Average

95th %

(1)

Sask.a (2)

S.Dakb (3)

Delw.c (4)

Fran.d (5)

Sask. (6)

S.Dak. (7)

Delw (8)

Fran. (9)

Sask. (10)

S.Dak (11)

Delw. (12)

Fran. (13)

07 08 08 09 09 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 Total average

10.71 7.45 5.74 4.96 5.75 6.29 5.16 4.93 5.38 5.98 6.34 6.31 6.25

13.11 9.99 6.37 5.32 2.94 4.07 7.59 7.48 3.96 2.96 4.24 10.26 6.52

7.65 5.42 3.18 4.65 5.68 5.42 4.91 3.39 2.71 3.29 4.17 4.32 4.57

10.75 9.02 5.17 6.04 9.00 8.28 7.17 4.71 3.78 4.70 5.47 6.25 6.70

9.07 9.95 9.88 7.79 9.65 9.72 9.08 7.61 8.05 7.74 8.74 10.13 8.95

14.43 11.48 8.82 7.76 4.81 9.73 10.28 10.42 4.95 6.62 5.13 15.20 9.14

9.33 8.68 6.56 8.93 12.07 10.52 9.14 7.05 4.94 5.53 7.49 6.89 8.09

10.03 12.71 8.68 12.67 15.55 16.50 14.22 8.32 6.18 6.35 7.51 10.10 10.74

23.02 18.16 12.03 10.71 13.32 14.89 10.02 9.28 11.63 12.38 15.71 16.05 13.93

22.97 16.19 9.04 11.19 6.02 11.50 14.08 10.71 11.20 8.31 8.59 17.49 12.27

12.50 11.79 7.89 11.23 14.19 11.97 11.58 8.62 7.54 7.67 11.50 17.61 11.17

11.84 18.61 11.35 13.93 22.69 19.98 17.07 14.36 8.50 16.35 20.11 20.55 16.28

Note: aSaskatchewan; bSouth Dakota; cDelaware; dFrance.

Matching Patterns for Updating Missing Values 149

Hour

150 Ming Zhong et al. accuracy. Using historical data only from before the failure month resulted in higher errors (French method). In this study, the Delaware factor method is used as the benchmark model. The London exponential moving average model and ARIMA model were applied to the study data and the statistical results are given in Table 2. It should be noted here that no individual vehicle types were considered since the volume data used in this study combined all types of vehicle together. The average errors of the London model are usually less than 5%, and most of the 85th percentile errors are lower than 7%. The 95th percentile errors usually range from 7% to 14%. Compared to the benchmark model  Delaware method-the London method resulted in consistently lower average errors. Average errors for the London model are lower than those of the benchmark models in 10 out of 12 cases. For the ARIMA model, average errors are usually between 35%. Most 95th percentile errors range from 6% to 10%. For 95th percentile errors, the ARIMA model outperforms the London models in 9 out of 12 cases. The MSE model was used to update missing hourly volumes between 8:00 a.m. and 8:00 p.m. in July and August. Seven daily patterns from the previous Wednesdays and the average pattern were used as candidate curves for updating missing values on the 8th Wednesday. Figure 1(a) shows the true curve and candidate curves. These patterns

Table 2. Errors of time series analysis models Prediction errors Hour (1) 07 08 08 09 09 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 Total average

85th %

Average

95th %

London (2)

ARIMA (3)

London (6)

ARIMA (7)

London (8)

ARIMA (9)

6.99 4.41 3.27 3.89 4.47 4.28 4.18 3.04 2.59 3.31 3.29 3.98 3.98

4.86 4.67 3.03 2.87 3.71 3.43 3.31 4.47 2.38 4.46 4.03 5.74 3.91

6.71 5.94 6.34 8.17 6.81 7.56 8.29 5.70 5.35 5.21 5.47 9.08 6.72

7.72 6.71 4.11 3.83 9.16 5.25 6.73 7.58 4.49 6.48 8.37 8.10 6.54

10.85 7.65 8.49 13.99 16.27 13.79 10.92 10.62 7.27 7.12 10.20 11.46 10.72

9.13 9.73 7.93 8.43 9.43 6.03 7.60 9.83 5.20 7.73 8.83 13.47 8.61

1.2

1

1

True Average 1st Wed 2nd Wed 3rd Wed 4th Wed 5th Wed 6th Wed 7th Wed

0.6

0.4

0.2

0.8

Hourly factor

Hourly factor

0.8

True Estimated Updating

0.6

0.4

0.2

0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Time (h)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Time (h)

Figure 1. True pattern, estimated pattern and candidate patterns for MSE model

Matching Patterns for Updating Missing Values 151

(b) True Curve, Updating Curve, and Estimated Curve

(a) True Curve and Candidate Curve for HF-MSE Model 1.2

152 Ming Zhong et al. are normalized by the ADT calculated from available data. Based on 12 available hourly factors (four hours 1:008:00 a.m. and 9:00 p.m. 12:00 a.m.), the hourly factors from 8:00 a.m. to 8:00 p.m. of the 6th day were selected as estimates for updating missing values of true curve. SSE calculated based on 12 available hourly factors between true curve and the curve of the 6th day is the minimum. Figure 1(b) shows the true curve, the updating curve, and the estimated curve. However, it should be noticed that the MSE model does not guarantee the best performance for the missing data part. A few other curves can result in smaller SSEs, and the curve of the 6th day results in the second largest SSE for the missing data (the results are not shown here). However, this is justified on the basis that the MSE models make their judgments solely based on available data. They cannot guarantee optimal results for both the available data and missing data parts. Table 3 gives the error distributions of the MSE model. The average errors and the 50th percentile errors are usually less than 4%. The 95th percentile errors are all less than 10%. The average errors and the 95th percentile errors for the MSE model are less than the benchmark factor models in all 12 cases. The MSE model also outperforms the ARIMA model in 10 out of 12 cases. Figure 2 compares the mean average and the 95th percentile errors for all study models. It is clear from the evidence that the MSE model has the best performance. The significantly lower 95th percentile error emphasizes its suitability and capability for infilling missing data of the study patterns. Table 3. Errors of MSE model Hour

07 08 08 09 09 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 Total average

Prediction errors Average

85th %

95th %

1.89 6.05 2.98 7.35 3.16 3.88 3.04 2.20 2.33 2.83 2.74 4.50 3.58

2.24 8.31 4.64 9.58 4.61 6.57 5.29 3.49 4.30 5.03 4.67 7.38 5.51

3.21 9.55 5.17 9.74 5.15 7.10 5.57 4.87 4.77 5.87 5.83 8.83 6.31

Matching Patterns for Updating Missing Values 153 18

16.28

16

13.93

14

12.27 11.17

Error (%)

12

10.72

10 8

8.61 6.7

6.52

6.25

6

6.31

4.57

3.98

3.91

3.58

4 2

Sask.

S.Dak

Delw.

Fran.

London

ARIMA

P95

average

P95

average

P95

average

P95

average

P95

average

P95

average

P95

average

0

MSE

Models

Figure 2. Comparison of the mean average and 95th percentile errors of study models

Conclusion Highway agencies in many countries commit a significant portion of their financial and human resources to traffic data programs. The data collected through these programs are widely used in transportation planning, design, operation, management and research. These data are also useful in metropolitan planning and business development. However, for a variety of reasons, these datasets usually contain a significant number of missing values. The presence of missing values makes data analysis and usage difficult. The literature on estimating missing values  known as data imputation  shows that imputation practices used by highway agencies are varied. The practice has been popular for over half of a century (Albright 1991b). However, few studies have been found that assess the accuracy of these imputation practices. The traditional imputation methods used by highway agencies can be broadly categorized into factor and exponential smoothing approaches. The factor approach is the mainstream of imputation practices. The virtue of such an approach is easy to understand and use. The majority of highway agencies use historical data or their averages from before the failure periods as replacements. The information available after the failure period is usually neglected. The statistical evaluation carried out in this study shows that these models usually result in large errors. The reason is that no mechanisms are provided to reflect the seasonality and time series trend in these models. Exponential moving average models are only used by some agencies in England. The methods estimate

154 Ming Zhong et al. missing values based on the sum of the weighted averages of the past observations. Such an approach considers the time series trend and provides better estimates. An examination of previous research showed that most previous studies employed autoregressive integrated moving average (ARIMA) techniques to estimate missing values. However, the majority of the previous studies focused on detecting missing values, and detailed statistical analysis is not available. The ARIMA models tested in this study show higher accuracy than the traditional imputation models used by highway agencies. Such approaches consider seasonality and the time series trend, but it is difficult to incorporate the information from after the failure period. The new approach proposed in this study is based on the minimum square error (MSE) technique. The method is to select a set of candidate patterns and compare the data from both before and after the failure period. The one with the minimum square error will be used to estimate missing values. Before calculating SSE between candidate patterns and study patterns, the data are usually normalized. The normalization used here provides a mechanism to remove the time series trend. Considering the data from both before and after the failure is another important contribution to the model’s higher accuracy. Study results clearly show that such an approach provides a more accurate estimation than the traditional models used in practice and previous research. Two potential improvements have been identified during the course of this research. The first is that for those roads with unstable traffic patterns (e.g., recreational roads), it may not be easy to find similar patterns from historical data. Hence, MSE models may not have a better performance than traditional models (e.g. Delaware method). Therefore, future research should test MSE models on such patterns. The second is that the MSE models can be adapted to estimate missing daily and monthly volumes. The difficulties associated with these implementations are mainly the limited historical data due to the higher aggregation levels. Although standard databases provide many advantages over previous imputed data sets, many highway agencies may still have to impute their databases to maintain minimum data integrity. In many cases, data imputation has to be used to provide sufficient data. In such cases, the principle of true-in-data should be applied to document imputation procedures. Acknowledgements The authors are grateful towards NSERC, Canada for their financial support and Alberta Transportation for the data used in this study.

Matching Patterns for Updating Missing Values 155 References Albright, D. (1991a) An imperative for, and current progress toward, national traffic monitoring standards, ITE Journal , 61(6), pp. 23 26. Albright, D. (1991b) History of estimating and evaluating annual traffic volume statistics, Transportation Research Record 1305, Transportation Research Board, Washington DC, pp. 103 107. American Association of State Highway and Transportation Officials (AASHTO) (1992) Guidelines for Traffic Data Programs (Washington, DC: AASHTO). American Society for Testing and Materials (ASTM) (1991) Standard Practice E1442, Highway Traffic Monitoring Standards (Philadelphia, PA: ASTM). Chatfield, C. (1984) The Analysis of Time Series: An Introduction (3rd edition) (London: Chapman and Hall). Clark, S. D. (1992) Application of outlier detection and missing value replacement techniques to various forms of traffic count data, ITS Working Paper 384, University of Leeds, Leeds. Federal Highway Administration (FHWA) (1995) Traffic Monitoring Guide . FHWA-PL-95-031 (Washington, DC: US Department of Transportation). FHWA’s Scanning Program (1997) FHWA Study Tour for European Traffic Monitoring Programs and Technologies (Washington, DC: Federal Highway Administration, US Department of Transportation). Garber, N. J. & Hoel, L. A. (1988) Traffic and Highway Engineering (New York: West Publishing Company). Harvey, A. C. & Pierse, R. G. (1984) Estimating missing observations in economic time series, Journal of American Statistical Association , 79(385), pp. 125 131. Liu, G. X., Sharma, S. C. & Luo, J. Z. (2001) Traffic data needs for Saskatchewan highways and transportation, research report of Saskatchewan Highways and Transportation, Regina, Saskatchewan, Canada. New Mexico State Highway and Transportation Department (1990) 1990 Survey of Traffic Monitoring Practices among State Transportation Agencies of the United States . Report No. FHWA-HRP-NM-90-05. Santa Fe, New Mexico. Nihan, N. L. & Holmesland, K. O. (1980) Use of the Box and Jenkins time series technique in traffic forecasting, Transportation , 9, pp. 125 143. Redfern, E. J., Waston, S. M., Tight, M. R. & Clark, S. D. (1993) A comparative assessment of current and new techniques for detecting outliers and estimating missing values in transport related time series data, Proceedings of Highways and Planning Summer Annual Meeting , Institute of Science and Technology, University of Manchester, Manchester. Sharma, S. C. & Allipuram, R. (1993) Duration and frequency of seasonal traffic counts, Journal of Transportation Engineering , 119(3), pp. 344 359. Sharma, S. C. & Werner, A. (1981) Improved method of grouping provincewide permanent traffic counters, Transportation Research Record 815 , Transportation Research Board, Washington DC, pp. 12 18. Southworth, F., Chin, S. M. & Cheng, P. D. (1989) A telemetric monitoring and analysis system for use during large scale population evacuations, Proceedings of IEEE 2nd International Conference on Road Traffic Monitoring , London.

156 Ming Zhong et al. Watson, S. M., Clark, S. D., Redfern, E. J. & Tight, M. R. (1993) Outlier detection and missing value estimation in time series traffic count data, Proceedings of the 6th World Conference on Transportation Research , Vol. II, pp. 1151 1162. Zhong, M., Lingras, P. J. & Sharma, S. C. (2004) Estimating of missing traffic counts using factor, genetic, neural, and regression techniques, Transportation Research Part C: Emerging Technologies , 12, pp. 139 166.