Air Pollution Forecasting using Machine Learning

0 downloads 0 Views 394KB Size Report
forward type of artificial neural network, was used for air pollution forecasting. We also used Long Short-Term Memory (LSTM) units for building a Recurrent.
Air Pollution Forecasting using Machine Learning Techniques Marijana Cosovic 1, Emina Junuz2 1Faculty

of Electrical Engineering, University of East Sarajevo, Istocno Sarajevo, BiH of Information T echnology, Dzemal Bijedic University, Mostar, BiH

2Faculty

[email protected], [email protected]

Abstract. Air pollution in Sarajevo, the capital of Bosnia and Hercegovina, is an ever-increasing problem. Very little research is being conducted in the domain of air pollutants’ impact on public health in B&H. Recently, there has been an initiative of several parties for deployment of a particle matter (PM 2.5) sensor network in Sarajevo, and based on this valuable data, future research can be conducted. Weather behavior can be analyzed as time series data with hourly, daily, weekly, and seasonal periodicities. Machine learning techniques are proving useful for air pollution forecasting. In this paper we evaluate the performance of several machine learning algorithms applied to air quality and meteorology datasets. T he network architecture based on multilayer perceptron (MLP), which is a feedforward type of artificial neural network, was used for air pollution forecasting. We also used Long Short-T erm Memory (LST M) units for building a Recurrent Neural Network capable of learning data long-term dependencies. We compare prediction accuracies of urban air quality, as this is of significant importance to the public. Keywords: Machine learning∙Air pollution forecast ∙MLP∙LST M∙

1

Introduction

The most important effects of air pollution are on human health [1], the ecosystem [2] and the human built environment [3]. The Federal Ministry of Environment and Tourism B&H issued a rule book on the method of air quality monitoring, defining of pollutant types, limit values and other air quality standards [4] that is in accordance with USA and EU standards regarding ambient air pollutants. Even though we recognize effects of indoor air quality on human health, outdoor air quality was the primary focus in this study. Air pollutants considered in this study are: PM 10, SO2, NO2, CO and O3. In addition, we used atmospheric pressure, temperature, relative humidity and wind speed. We aim to predict PM 10 concentration, based on air pollutant concentrations and major meteorological data, by using machine learning techniques. By exploring a multilayer percep-

tron (MLP) feedforward neural networks and Long Short-Term Memory (LSTM) recurrent neural networks we choose a suitable approach to deal with nonlinear systems such as air pollution and aim at next-day particulate matter forecast.

2

Datasets

The datasets are obtained courtesy of the Federal Hydrometeorological Institute (FHMZ) BH [5]. There are four measuring stations in the vicinity of the greater Sarajevo area (Bjelave, Otoka, Vjecnica, and Ilidza) located and operated by either the Federal hydrometeorological Institute or the Cantonal Institute for Public Health. Location data of the measuring stations is given in Table 1. The datasets report continuous measurements (averaged hourly values) of the temperature, relative humidity, pressure, average wind speed, PM 10, SO2, NO2, CO and O3 during 2017. Table 1. Station location: latitude, longitude and altitude Station

latitude (°,','')

longitude(°,','')

altitude (m)

Bjelave

43 52 03 N

18 25 23 E

631

Otoka

43 50 54 N

18 21 49 E

512

Vijecnica

43 51 33 N

18 26 04 E

554

Ilidza

43 49 40 N

18 18 49 E

499

We consider a one-year period with hourly averaged values for air pollutants and major meteorological variables. Choosing these variables (date, time, type of day (working week day, Saturday, and Sunday), temperature, relative humidity, pressure, average wind speed, PM 10, SO2, NO2, CO and O3) was shown suitable for detecting/forecasting elevated values of particle matter in the ambient air. The dimension of the feature matrix is 8760x13. Hourly average values for PM 10 concentration during the year 2017 are shown in Fig. 1.

Fig. 1. PM10 concentrations measured at Bjelave station during 2017.

The red line in the figure represents a guideline value of 50 μg/m3 suggested by the World Health Organization, and that is the maximum tolerated value at any given time. During the winter season that value is exceeded frequently as shown in Fig. 1. The aim of this paper is to structure a forecasting problem in which the pollution of the next day is determined based on previous data. 2.1

Missing data

Ambient air pollutant data was continuously measured during 2017 at Bjelave monitoring station, an urban part of Sarajevo city. Only data that has passed FHMZ control and validation is reported in this study. The table below shows available and missing hourly average air pollutant data. Daily average values are also computed based on hourly average values for those days that are not missing more than six hourly average values. For example, 40 hourly average values are missing for PM 10 pollutant during the fall season, but since those missing values are spread out among different days, a few values at a time, the daily average value is computed for all of the days in the fall of 2017. Missing hourly average values for all five air pollutants considered are in the range of 8-20 percent and are due to analyzers not performing the measurements. Equipment failure, need for calibration of analyzers and unforeseen problems that are either difficult to recognize at their occurrence or resolve in a short amount of time are some of the reasons for missing data. In general, calibration of the equipment is done during the summer season. The increased number of missing values during that period could be contributed to the scheduled outages. Table 2. Available and missing hourly average values (total and separated by seasons) for five pollutants during 2017 Pollutants

Available hourly average values

Missing hourly average values

Missing data by season (Spring, Summer, Fall, Winter)

PM10 SO2

7761 6996

999 1764

363, 384, 40, 212 528, 711, 226, 299

NO2 CO

7278

1482

220, 995, 128, 139

8056

704

187, 452, 12, 53

O3

7785

975

280, 445, 39, 211

As the data from various sources are becoming abundant, the problem of missing data needs to be addressed accordingly. Within our datasets, sporadic points were missing, along with longer gaps of up to several days. Firstly, we considered deletion of the entire row (time instance) in which certain features have missing values, but that failed as a plausible option due to the limited number of instances in the dataset (one calendar year of observations). Generation of methods and tools for effective data assertion were explored. Hence, we addressed data imputation and looked at statistical and machine learning approaches [6] for those. Computation and insertion of overall mean, although

a fast method for data imputation, introduces side effects such as dataset variance reduction. We considered seasonal adjustment since our data has seasonal variation. Time-series specific methods of data imputation, such as last observation carried forward (LOCF) and next observation carried backwards (NOCB), were used. Authors in [6] presented imputation model based on machine learning techniques (LASSO regression and Bagging Ensemble). Results reported show improvements in suggested data imputation method in comparison to hot deck methods. We have used the root mean squared error (RMSE) to evaluate different methods of data imputation. Performance of LASSO-Bagging method has shown reduction in RMSE values on our datasets.

3

Air Pollution Forecasting Models

We have considered four seasons for air pollution forecasting: spring, summer, fall, and winter. The dataset for the winter season was collected every hour from January 1, 2017 until March 21, 2017 as well as December 22, 2017 until December 31, 2017. The spring season dataset included data from March 21, 2017 until June 21, 2017. The summer season dataset was collected from June 21, 2017 until September 23, 2017, while the fall dataset was collected from September 23, 2017 until December 22, 2017. The datasets contain 2208 instances for spring (92 days observed), 2256 instances for summer (94 days observed), 2160 instances for fall (90 days observed) and 2136 instances for winter (89 days observed). We considered nineteen features for each of the datasets: date, time, type of the day (working week day, Saturday, and Sunday), PM 10, SO2, NO2, CO, O3, previous-days PM 10 (up to seven features for seven previous days), atmospheric pressure, temperature, relative humidity and wind speed. Hence, the dimensions of the spring, summer, fall, and winter feature matrices are 2208x19, 2256x19, 2160x19, and 2136x19 respectively. 3.1

Correlation

Correlation among meteorological parameters and air pollutants can be measured with the use of different coefficients: Pearson’s, Spearman’s and Kendall’s correlation coefficients. The Pearson correlation coefficient is a measure of linear correlation between normally distributed variables. When the variables are not normally distributed or the relationship between the variables is not linear, it is more appropriate to use the Spearman rank correlation coefficient [7]. The Kolmogorov-Smirnov test was used to determine the type of distribution of our data. Hence, Spearman’s correlation coefficient (SCC) was used. SCC assesses the monotonic relationship between the variables and has a value between –1 and +1. Results of SCC for all year data are shown in Table 3. Spearman’s correlation coefficient amongst selected features was computed for all four seasons, as well, and shows stronger correlation during the winter season when the air pollution is at its maximum. In addition, positive correlation is observed between PM 10, CO, NO2, and SO2, while negative correlation is observed between those air pollutants and O3. We also observed negative correlation of all air pollutants and temperature apart from O3. Atmospheric pressure and relative humidity also have positive

correlation coefficient with PM 10, CO, NO2, and SO2 and a negative correlation with O3. The largest correlation coefficients are present between PM 10, CO, NO2, and SO2, confirming the fact that these air pollutants are originating from the same sources. Table 3. Spearman’s correlation coefficient amongst selected features

PM10 CO NO 2 SO 2 O3 Press Temp Hum Wind Class

PM10 1

CO 0.55 1

NO 2 0.49 0.7 1

SO 2 0.36 0.24 0.41 1

O3 -0.25 -0.39 -0.26 -0.03 1

Press 0.2 0.18 0.2 0.17 -0.13 1

Temp -0.39 -0.4 -0.28 -0.2 0.55 -0.19 1

Hum 0.1 0.26 0.11 0.06 -0.51 0.08 -0.59 1

Wind -0.22 -0.29 -0.38 -0.19 0.28 -0.12 0.18 -0.23 1

Class 0.58 0.34 0.33 0.33 0.11 0.16 -0.09 -0.1 -0.14 1

We also observe that temperature and wind speed have a negative correlation coefficient with PM 10, CO, NO2, and SO2 and a positive correlation with O3. Daily temperatures, relative humidity and wind have effects on O3 formation. Generally, more beneficial meteorological circumstances for ozone formation are higher temperatures with lower relative humidity, as opposed to lower temperatures with higher relative humidity. Also, depending on wind speed (high/light), we could have dilution or building up of ozone concentration, hence positive correlation of wind speed and ozone concentrations. 3.2

Methodology

We used the Keras [8] deep learning Python library with the Theano backend for development and evaluation of deep learning models. Two models were developed for air pollution forecast: MLP and LSTM models. Model, based on a backpropagation algorithm for training of fully connected multiplayer perceptron (MLP) neural network is similar to our previous work [9] developed for short-term load forecasting. It is defined as a sequence of layers: an input layer, hidden layers and an output layer. The shape of the input data needs to be specified only for the first layer in the sequence. We devised MLP models with 13-19 features for the input layer. In forecasting PM 10 concentration we opted to generate PM 10 values up to seven days prior. A model that minimizes performance measures presented in 3.3 will be chosen. In Keras, using Dense class is one of the ways to define fully connected layers. Network weights were initialized to random numbers using either uniform or Gaussian distribution. Use of appropriate activation function allows for better training of the network [10]. Traditionally, sigmoid and tanh activation functions are used, but the authors in [10] have shown that better performance can be achieved using a rectifier activation function. We use 10-fold cross validation for determining accuracy on the test dataset, and as we increase the number of hidden layers beyond two, classification accuracy decreases,

as noticed in [9]. In our case, a neural network with two hidden layers is the optimal model for air pollution forecasting. Using either too few or too many neurons in the hidden layers may result in problems of underfitting and overfitting, respectively. General guidelines for determining the number of neurons within each hidden layer are used. We selected neural network architecture based on trial and error, but in accordance with the following general guidelines: the number of neurons in hidden layers should be between the sizes of input and output layers, and they should be the sum of 2/3 of the input layer neurons and output layer neurons. Hence, we trained the neural network with two dense hidden layers with 15 and 10 neurons, respectively [9]. While in the feed-forward neural networks information travels in forward direction only, recurrent neural networks (RNN) can maintain information from computation of an earlier input, hence having memory capabilities. RNN performance degrades when long-term dependencies between previous inputs and present targets occur. Implementation of a LSTM (Long Short-Term Memory) cell allowed for better tradeoff concerning RNN performance at one side and lapsed time between previous inputs and present targets on the other side. A LSTM network is RNN composed of LSTM cells. LSTM solve the vanishing gradient problem of RNN by updating the state of each cell in an additive way. We developed a LSTM model for air pollution forecasting in the Keras deep learning library such that given the meteorological conditions and concentration of pollution of prior days, as well as expected air pollution for the next day, we can forecast air pollution for the next day. All the features are normalized with a zero mean and standard deviation of one. Datasets are split into training and testing datasets, and we fit our LSTM model on 80% of the data and evaluate it on the remaining 20%. We trained the LSTM model with 50 neurons in the first hidden layer. A neuron in the output layer enabled prediction of air pollution. We used an Adam optimization algorithm [11] instead of stochastic gradient descent because of its forthright implementation, computation efficiency and small memory requirements. 3.3

Performance Measures

The mean absolute error (MAE) is the sum of absolute differences between the actual value and the forecast, divided by the number of observations. n 1 MAE  f i  yi (1) n i 1



Hence, the mean absolute error is an average of the absolute errors where f i is the prediction and y i the actual value as shown in (1) where all individual differences have equal weight. The mean absolute percentage error (MAPE) is another measure of prediction accuracy of a forecasting model, and it is shown in (2). It is the average of absolute errors divided by actual observation values. n f i  yi 1 MAPE  (2) n i 1 fi



The mean squared error (MSE) shown in (3) represents the sum of the squared errors divided by the number of observations. MSE 

1 n

n

( f

i

 yi )2

(3)

i 1

The mean square error (MSE) is probably the most commonly used error metric. It penalizes larger errors because squaring larger numbers has a greater impact than squaring smaller numbers. The root mean squared error (RMSE) shown in (4) represents the sample standard deviation of the differences between predicted and true values. RMSE 

4

1 n

n

( f

i

 yi )2

(4)

i 1

Results and Conclusion

We have forecasted air pollution in the last week of all four season datasets. Those periods were not included during the training step. The air pollution forecast based on the MLP model for the winter season was performed from March 15 until March 21, and its comparison to real air pollution is shown in Fig. 2. Elevated values of particulate matter can be observed during that period as well as increase in the evening hours mostly due to coal burning for heating. Air pollution forecast for the spring season was performed from June 15, until June 21.

Fig. 2. MLP model Real/Forecasted PM 10 Concentration from March 15, until March 20, 2017 at Bjelave Station

A comparison of real and forecasted air pollution concentration on March 20, 2017 (winter season) and on June 20, 2017 (spring season) using MLP model is shown in Fig. 3., on the left and right respectively. Those are the best prediction results that were obtained with one day prior information, hence the performance measure values were the smallest in that case. All of the performance measures are negatively-oriented scores, which means that lower values are better.

Fig. 3. MLP model Real/Forecasted PM10 Concentration on March 20, 2017 (Winter season) and June 20, 2017 (Spring season) at Bjelave Station

Summer season air pollution forecast was performed from September 17 until September 23, and for the fall season from December 16 until December 22. A comparison of real and forecasted air pollution concentration on September 22, 2017 (summer season) and on December 21, 2017 (fall season) using LSTM model is shown in Fig. 4., on the left and right respectively. Those are the best prediction results that were obtained with one day prior information, hence the performance measure values were the smallest in that case.

Fig. 4. LST M model Real/Forecasted PM 10 Concentrat ion on September 22, 2017 (Summer season) and December 21, 2017 (Fall season) at Bjelave Station

LSTM model, when compared to MLP model, provides reduced performance measures on all datasets considered. The largest positive correlation is observed between particulate matter and carbon monoxide as shown in Table 1. We observed that during spring and summer season the most missing data comes from PM 10 and CO pollutants. We conclude that due to this phenomenon performance measures shown in Table 4. are slightly increased but nevertheless smaller than in case of MLP model. Air pollution forecast based on LSTM model for the remaining seasons were performed for the same dates as in MLP forecasting model.

Table 4. MAPE and RMSE values for Bjelave station using MLP and LST M models for Winter, Spring, Summer and Fall season Station/ Season

Bjelave

Winter Spring Summer Fall

MLP

LST M

MAPE

RMSE

MAPE

RMSE

0.019 0.023 0.027 0.015

0.027 0.037 0.039 0.026

0.018 0.019 0.021 0.012

0.023 0.022 0.031 0.024

We have developed two models for air pollution forecasting based on artificial neural networks: feed-forward MLP and recurrent LSTM neural networks. In this paper we analyzed the performance of those models applied to datasets collected in one calendar year from one location in the Sarajevo city area. Since we encountered missing data problems, we explored statistical and machine learning methods for data imputation. We have considered four seasons for air pollution forecasting: spring, summer, fall, and winter. The MLP model with two hidden layers was optimal since choosing additional hidden layers caused performance indices to deteriorate. The LSTM model used one hidden layer. We used a cross-validation technique to determine the number of neurons in each of the layers. We concluded that some features had greater effect than others on the forecast, as well as that performance measures were the best for forecast done based on the previous day’s information. By using more prior information, performance indices worsened. LSTM model performed slightly better than MLP model for all seasons considered. For future work we will explore LSTM model with additional layers as well as other methods of data imputation in search of reducing performance measures further.

References 1. G. S. Martinez, J.V. Spadaro, D. Chapizanis, V. Kendrovski, M. Kochubovski, P. Mudu, “ Health Impacts and Economic Costs of Air Pollution in the Metropolitan Area of Skopje,” International Journal of Environmental Research and Public Health 2018, vol. 15, no. 4: 626 [Online]. Available: https://doi.org/10.3390/ijerph15040626 2. L. Jones, G. Mills, A. Milne, “Assessment of the impacts of air pollution on ecosystem services – gap filling and research recommendations,” (Defra Project AQ0827), Final Report, July 2014. 3. M. C. Rodriguez, L. Dupont -Courtade, W. Oueslati, “Air pollution and urban struct ure linkages: Evidence from European cities,” Renewable and Sustainable Energy Reviews, vol. 53, pp.1-9, 2016. [Online]. Available: https://doi.org/10.1016/j.rser.2015.07.190 4. Rule book on the method of air quality monitoring. [Online]. Available: http://extwprlegs1.fao.org/docs/pdf/bih149115.pdf 5. Federal Hydrometeorological Institute (FHMZ) Bosnia and Herzegovina. [Online]. Available: http://www.fhmzbih.gov.ba/latinica/index.php 6. Rosati, G. (2017). Construcci´on de un modelo de imputaci´on para variables de ingresos con valores perdidos a partir de Ensamble Learning. Aplicaci´on a la Encuesta Permanente de Hogares, Revista SaberES, 9 (1), 91-111.

7. Hauke J, Kossowski T (2011) Comparison of values of Pearson’s and Spearman’s correlation coefficients on the same sets of data. Quaestiones Geographicae 30(2):87 –93 8. F. Chollet, Keras, (2015), GitHub repository. [Online]. Available5: https://keras.io 9. Bećirović E, Ćosović M (2016) Machine learning techniques for short-term load forecasting. In: 4th International symposium on environmental friendly energies and applications. Serbia, Belgrade 10. Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of ICML. 807-814. 11. D. P. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv:1412.6980 [cs.LG], December 2014. [Online]. Available: https://arxiv.org/abs/1412.6980