SAMI 2011 • 9th IEEE International Symposium on Applied Machine Intelligence and Informatics • January 27-29, 2011 • Smolenice, Slovakia
Design and implementation of local data mining model for short-term fog prediction at the airport P. Bednár**, F. Babi**, F. Albert*, J. Parali* and J. Bartók***, *
Department of Cybernetics and Artificial Intelligence, Faculty of Electrical Engineering and Informatics, Technical University of Košice, Slovakia ** Centre for Information Technologies, Faculty of Electrical Engineering and Informatics, Technical University of Košice, Slovakia *** MicroStep-MIS spol. s.r.o., avojského 1, 841 08 Bratislava, Slovakia
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
Abstract — This paper presents a short-term prediction of fog occurrence based on suitable data mining methods. The whole process was implemented through CRISP-DM methodology that represents most commonly used approach for data mining. This methodology consists of six main phases, which we describe in this paper for our application: business understanding, data understanding, data preparation, modeling, evaluation and deployment that resulted into new and useful knowledge to be used in real practice. The main motivation behind our solution was to develop an effective data mining model based on local conditions at the airport for short-term fog prediction as crucial factor for air management. Our first results presented in this paper are promising.
I. INTRODUCTION Recently the interest in short-term weather warnings with higher localization accuracy has been heightened, especially in connection with the influence of significant and hazardous meteorological events in various areas (for example traffic, agriculture, tourism and public safety). In general, fog forecasting represents a difficult process that includes many sources of uncertainty even though important predictors are usually well known. But it is difficult to obtain all this necessary information from performed observations and collected historical data. Fog has significant impact on human activities (let us just mention air and road traffic and shipping) and an improvement of fog prediction methods is of importance to the human society as a whole. Currently used approach to the prediction of visibility-reducing fog starts with a common 3D meteorological model executed for a limited region; its outputs are converted using empirical formulae into visibility [9]. This approach by itself cannot achieve results of satisfactory quality and common meteorological models often fail to handle inversion weather conditions, which commonly produce fog. Therefore there are several experimental models in development worldwide, which further process the results of common meteorological model: 1D physical fog modeling methods, statistical post-processing of model outputs [10], [11]. The result is then interpreted by a meteorologist, who takes into account further factors – mainly his/her experience with meteorological situations and local conditions, satellite imagery, real-time data from meteorological stations suggesting that fog has started to form, or conditions are
978-1-4244-7430-1/11/$26.00 ©2011 IEEE
favorable for the occurrence of one, conditions of the soil in the target locations, snow cover, recent fog occurrences, etc. In this paper, we will describe our approach to the prediction process based on selected data mining methodology CRISP-DM. We have implemented this approach and evaluated it on the meteorological data located around the airports in United Arab Emirates (e.g. Sharjah Airport). This locality was selected based on actual research project with company called MicrostepMIS1 that will deploy implemented and evaluated models in their monitoring information system. The whole paper is organized as follows: after short introduction in section II the state-of the art in relevant domain and simple description of the CRISP-DM methodology are presented. The next section III describes in detail our data mining approach to fog prediction, created models and theirs evaluations. The paper closes with short summary and a sketch of our future work. II. RELATED WORK In this chapter we will describe various methods already applied on historical data in order to improve prediction of weather conditions. Identification of important parameters that have strong influence on fog formations is described in [2]. The results were obtained based on historical data representing meteorological conditions at the International Airport of Rio de Janeiro. Interesting approach was used in a research of group from Italian Aerospace Research Centre that developed several fog classifiers based on Bayes networks [1]. The same method was used in [7] for creation of basic network structure that was further adapted to local prediction models. This approach was developed and tested in the conditions of major Australian airports and achieved results represent more than 55 forecasted fogs in a row compared to previous mean 7-8 with reduced false alarm ratio. Algorithm of K-nearest neighbor was used in [3] for developing of a system for climate prediction. This system used historical data as rain, wind speed, temperature, etc. for prediction of weather conditions for specific time span. Two datasets (40 000 and 80 000+ records) were used for
- 349 -
1
http://www.microstep-mis.com/
P. Bednár et al. • Design and Implementation of Local Data Mining Model for Short-Term Fog Prediction at the Airport
prediction up to 17 climatic attributes, e.g. in the cases of Boolean attributes as fog, snow or thunder was achieved 96,66 % accuracy. Weather forecasting problem can be stated as the special case of time series prediction. Interesting approach for time series prediction represents use of neural networks that provide model able to learn important characteristics from past and present information and uses them for prediction of future states of investigated time series [4], [5]. Similar approach with neural network was used in [6] for fog prediction at the Canberra International Airport. 44-years database of standard meteorological observations was used to develop, train, and test relevant neural network and to validate obtained results. The proposed neural network was trained to produce forecast for 3, 6, 12 and 18 lead times. Results (cross-validated mean value 0.937 in 3 lead time, etc.) indicate good forecasting ability of used neural network that is robust to error perturbations in the meteorological data. Y. Radhika and M. Shashi in [8] proposed an application of Support Vector Machine for weather prediction. Time series data of daily maximum temperature at a location is analyzed to predict the maximum of the next day. The performance of Support Vector Machine (SVM) was compared with Multi Layer Perceptron (MLP) and resulted into better results calculated as mean square error (MSE) in both cases. The MSE in the case of SVM was in the range of 7.07 to 7.56, whereas MLP error varies between 8.07 and 10.2. Several methods of the artificial intelligence were used in [13] in order to predict aircraft delays at the Frankfurt airport according to weather conditions. In this case value of travel time was used as target value for the algorithms as linear regression, neural networks, decision trees and fuzzy clustering. The obtained results documented easily interpretation of decision trees and clustering; up to 20% higher prediction accuracy that with simple mean estimators. Fog forecasting based on suitable data mining methods as association rules is described in [14]. This paper presents in details all performed steps as collected data; understanding of it; data pre-processing; feature selection and feature construction; operations with missing values; data transformation; models creation and rules generation. The results present generated association rules with computed confidence and support that describe combination of reasons for fog occurrence. All these rules are further stored in knowledge base and used for relevant expert system creation. A. CRISP-DM CRISP-DM (CRoss Industry Standard Process for Data Mining) [12] represents a common used methodology for whole data mining process that is described as integrated life cycle compound of six main phases, see Fig. 1. This methodology is a result of long standardization process in domain of data mining across academic and industrial sphere. The main goal of this initiative is to provide common mechanism, procedures and recommendations for each data mining steps, something like guidelines or best practices for the users. Each phase has defined its own specific goals and techniques how to accomplish these goals successfully.
The Business understanding focuses on project objectives and requirements from business perspectives that are further converted into data mining goals or problem definitions. Plan for the whole process is created based on identified inputs and specified goals. The Data understanding covers basic operations with initial data collections mainly oriented to get familiar with them, to evaluate the data quality or to obtain basic characteristics and statistics of the investigated historical dataset.
Figure 1. Life cycle of the data mining process based on CRISP-DM methodology [12]
The data preparation represents all activities to construct the final dataset for modeling purposes. This is very important phase for the whole process and expected results and in many cases has several iterations. Relevant tasks include table, record, and attribute selection as well as transformation and cleaning of data based on selected modeling algorithms and their conditions. During modeling phase, various algorithms and techniques are selected and applied on prepared data. Important step represents calibration of algorithm’s parameters to optimal values based on obtained results. In this phase several main data mining techniques can be used: classification, prediction, clustering, association rules and others. The results of this phase are created models for further evaluation and deployment. In next phase, several evaluation steps are performed: all created models are interpreted and evaluated with respect to the specified business objectives; the creation process is reviewed step by step. At the end of this phase, a decision on the use of the data mining results should be reached. The deployment phase contains presentation or report describing obtained results for customers with proposed deployment steps. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. But the collaboration between both of them is important for effective deployment and successful realization of the whole data mining process. III.
FOG PREDICTION
A. Business understanding Influence of fog occurrence has strong impact on air service management at the airports. The unpredicted fog at
- 350 -
SAMI 2011 • 9th IEEE International Symposium on Applied Machine Intelligence and Informatics • January 27-29, 2011 • Smolenice, Slovakia
a large busy airport can cause many ongoing problems, so our main goal will be to improve fog forecasting, which will lead to cost savings and increased public safety. This problem can be transformed into time series prediction task, where historical data from past time are used for prediction of the fog occurrence in the near future. We have specified our task as classification, i.e. the predicted value is from a finite set of values (the simplest case is a binary classification – e.g. value 1 if there will be a fog, or value 0 if there won’t be a fog), however apart from the predicted value itself it was useful to have as one of the predictor’s outputs also a “prediction confidence” (e.g. saying “the estimation of the probability of a fog in one hour is 0.8”). Even if our primary goal is to get a prediction of the best possible quality, also the interpretation of the rules used for predicting can be interesting – a secondary task is thus a descriptive data mining – the ability to comprehensibly describe the processes leading to occurrence of fog on given airport. B. Data understanding This phase started with the selection of data relevant for the specified problem. We have investigated the following data sources: • Set of physical quantities measured automatically by meteorological stations, radars or sounding balloons or observed manually. These data form time series and have assigned 3D coordinates of measured area (i.e. ranges or points of longitude, latitude and altitude). • Set of physical quantities computed by standard physical models. These data can vary with the preconditions and settings of the models, form time series, can be predicted (i.e. it is possible to obtain future cases) and have assigned 3D or 2D coordinates. • Satellite images, which can be in the form of raw, multi-spectral images of the selected area or preprocessed images with indication of various conditions such as fog or clouds. After making the identified data available, an initial data examination was performed, leading to verification of the quality of the data, see Fig. 2 for data extracted from METAR records.
Figure 2. Example of selected attributes with relevant number of valid records
Part of the examination was also computation of basic statistics of key attributes of the data and their correlations. C. Data preparation Data pre-processing is usually the most complex and also most time consuming phase of knowledge discovery process (usually taking 60 to 70 percent of the overall time). For our application, data were processed by the following operations: • Data extraction from the meteorological messages – the goal was to extract data encoded in the text messages broadcasted from the meteorological stations in METAR format. The format of the messages is fixed with standard codes denoting the parts of the messages and data values. The output of this task is the relational database with extracted data. • Data extraction from the satellite images – the goal was to extract indicator variables, which encode fog and low cloud cover in the given area. The output is the relational database with the extracted data. This possibility could be processed in the future work to make designed models more precisely based on enhanced quality of input data. • Data integration – Meteorological data from all sources (i.e. data extracted from messages, satellite images, meteorological stations and physical model predictions) are integrated into one relational database. Each record in the integrated database has assigned valid from/to time interval and 3D coordinates of measured area (i.e. ranges for longitude, latitude and altitude). • Data interpolation – Since each data source had different data precision and/or granularity, the goal of this task was to interpolate measured values and compute additional data for the requested area and time with the specified data granularity. The same approach was used for replacement of missing values. • Data reduction – From a large data set we have selected a representative sample, which is used in modeling. Reduction is usually necessary because of technical restrictions inherent to some methods, but it can also lead to simplification of the task at hand by removing irrelevant attributes and records, thus even increasing the quality of the results. • Noise reduction – After the consultation with the domain experts – meteorologists, we have specified valid ranges of values and detected invalid data. Out-of-range data were considered as missing values. Resulting dataset currently consists of records extracted from METAR only (i.e. outputs of the physical models or satellite images were not used for the experiments described in the next chapter). Geographically data covers the area of 10 airports in United Arab Emirates mainly located around Dubai and north coastline with time span and granularity of 10 years measured each one-hour. The quality of available meteorological data was low with high number of missing records (in average 30% of records per airport, for some airports as much as 90%). We have tried to integrate
- 351 -
P. Bednár et al. • Design and Implementation of Local Data Mining Model for Short-Term Fog Prediction at the Airport
additional data source from Climatological Database System (CLDB [15]) but still data quality has to be improved. Also the principal problem specific for fog prediction is unbalanced distribution of fog-positive and fog-negative examples (fog occurs only in 0.36% of cases). Extracted data consists of real attributes of physical qualities, nominal attributes denoting observed meteorological conditions and indicator attribute for target variable, i.e. occurrence of fog for specific location. The key attributes are: • WX1 – actual weather at the airport, e.g. FG – fog, DZ – drizzle, RA – rain, etc. • C1 – cloud amount in first layer • CTOT – overall cloud amount • VIS – visibility • RH – relative humidity • CAVOK – Ceiling And Visibility Okay indicates no cloud below 1 500m, visibility of 10km and no cumulonimbus at any level – in our case Boolean attribute – and some others. Additionally, data were enhanced with some derived attributes computed using empiric formulae, as a ratio of physical attributes or trend. Derived attributes include information about the fog situation in neighboring airports (average for 3 or 5 closest airports to the target area) and relative humidity computed empirically from temperature and dew point. Trend attributes were computed according to for temperature, dew point, relative humidity, difference between temperature and dew point and pressure. Most of the data pre-processing tasks were implemented in SPSS Clementine data mining environment. For example, creation of training and testing set is displayed on Fig. 3. Tasks are represented graphically as nodes connected to the data processing streams. Processing starts with randomization of the data order followed with the stratified selection of 90% of training data (node Balance).
D. Modeling Modeling is the core of the data mining process when selected data mining method is applied to pre-processed data. Our models are simple predictors for time series, where the prediction of outputs for time t+1, …, t+K is based on the sequence of historical data (i.e. time “window”) from time …, t-2, t-1, t. Prediction of outputs is limited to future one hour (i.e. K = 1). There is a whole range of prediction methods – from statistical methods to artificial intelligence methods, like linear or logistical regression models, Support Vector Machine, neural nets, probabilistic models (for example the Bayesian networks), decision/regression trees and lists, etc. We have tested various methods provided in the SPSS Clementine environment. In the case of neural nets we used standard SPSS model, i.e. feed-forward network with one hidden layer trained by back-propagation algorithm. Number of neurons in hidden layer was optimized using the cross-validation testing. Finally we have selected decision trees models, which provide a good compromise between prediction accuracy and model comprehension; and neural networks. The main difference between these two directions is the required format of input data; decision trees are able to process dataset with errors and missing values; whereas neural networks are very sensitive to the quality of inputs and obtained results strongly depends on it. In order to obtain optimal results, all parameters of algorithms were tuned by testing using the cross-validation method. One example of the obtained results in the case of decision trees presented on Fig. 4 shows achieved precision for correct and wrong classified examples with possibility to evaluate the coincidence matrix for both classification classes as no fog (0) and fog (1). Besides of the binary decision used for evaluation, models are able to output also confidence of the prediction. Additionally, decision trees were transferred to rules, which were evaluated by experts in order to understand main conditions identified by the model.
Figure 4. One of the obtained results in the case of decision trees Figure 3. The main processed dataset is divided into two parts for training and testing purposes in SPSS Clementine
Training data are then processed by the data mining method (neural network in this case). Second branch of stream (node Merge) removes from source data training cases and remaining 10% of data are used for testing (testing model is not displayed on Fig. 3).
E. Evaluation All models were evaluated on testing data using the following measures: • Recall = TP / (TP + FN) • False alarm = FP / (TP + FP) • True skill score = recall – false alarm
- 352 -
SAMI 2011 • 9th IEEE International Symposium on Applied Machine Intelligence and Informatics • January 27-29, 2011 • Smolenice, Slovakia
where TP (FP) is number of true (false) positive and TN (FN) is number of true (false) negative examples respectively. For evaluation, data were randomly divided ten times to training and testing set with stratification (i.e. ratio of fogpositive and fog-negative cases were preserved both in training and testing set). 90% of data were used for training of models and 10% for testing. Averaged results with standard deviations are presented in the following table: TABLE 1. ACCURACY OF MODELS FOR FOG PREDICTION Model
Recall
False Alarm
True skill score
Decision trees
0.77 ± 0.8
0.44 ± 0.14
0.33 ± 0.19
Neural networks
0.68 ± 0.8
0.41 ± 0.1
0.26 ± 0.12
started collection of historical data from these localities in order to create necessary dataset for mining purposes. ACKNOWLEDGMENT The work presented in the paper is supported by the Slovak Research and Development Agency under the contract No. VMSP-P-0048-09 (50%) and project implementation: Development of the Center of Information and Communication Technologies for Knowledge Systems (ITMS project code: 26220120030) (50%) supported by the Research & Development Operational Program funded by the ERDF. REFERENCES [1]
[2]
The results are plausible and comparable with the existing methods, but still need improvement, mainly in the quality of input data that can be possible extended with data extracted from satellite images Another challenge principal for fog phenomenon is unbalanced number of positive and negative records, which was the problem mainly for neural networks. In current training dataset there are only 0.2% of positive cases of fog. We have tried to balance data using simple re-sampling. F. Deployment Deployment phase deals with final evaluation of created models and their results focused on possible exploitation in real practice. In our case, the obtained results will be used as integrated part of Airport Weather System developed in MicroStep-MIS company specialized in design, development and manufacturing of various monitoring and information systems. IV. CONCLUSION In this paper we have described data mining approach for prediction of the fog. According to preliminary results, our models are comparable to the existing methods based on the global physical model and empirical rules. We have implemented whole chain of data pre-processing tasks, which extract and integrate data from various meteorological sources. Also, to the some degree our models can be compared to the other data mining methods presented in the “Related work” chapter, but it has to be noted that results are highly influenced by the fog conditions in the tested area and quality of the training data. In the future work we will integrate more data sources and balance positive examples in order to improve quality of training data. The quantity for positive examples is crucial for specified task, so we will start experiments with some balance methods to obtain more equally data division. Also we want to test our models for other airports in this locality to achieve better description of fog occurrences in these conditions. The interesting challenge for us is to predict fog at the Slovak airports, where fog situation is more complicated due to different meteorological conditions, so we have
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11] [12] [13]
[14]
[15]
- 353 -
G. Zazzaro, F. M. Pisano, P. Mercogliano, ”Data Mining to Classify Fog Events by Applying Cost-Sensitive Classifier”, 2010 International Conference on Complex, Intelligent and Software Intensive Systems, p.1093-1098, ISBN 978-1-4244-5917-9. N. F.F. Ebecken, “Fog Formation Prediction In Coastal Regions Using Data Mining Techniques”, International conference on environmental coastal regions No2, Cancun, Mexico, 1998 , pp. 165-174, ISBN 1-85312-527-X. Z. Jan, M. Abrar, S. Bashir, A. M. Mirza, ”Seasonal to Interannual Climate Prediction Using Data Mining KNN Technique”, Wireless Networks, Information Processing and Systems 2009. Vol. 20, pp. 40-51, ISBN 978-3-540-89853-5_7. G. Acosta, M. Tosini, “A Firmware Digital Neural Network for Climate Prediction Applications”, Proceedings of IEEE International Symposium on Intelligent Control 2001, Mexico City, Mexico, ISBN 0-7803-6722-7. T. Koskela, M. Lehtokangas, J. Saarinen, K. Kaski, “Time Series Prediction With Multilayer Perceptron, FIR and Elman Neural Networks”, Proceedings of the World Congress on Neural Networks 1996, INNS Press, San Diego, USA, pp. 491-496. D. Fabbian, R. de Dear, S. Lellyett, “Application of Artificial Neural Network Forecasts to Predict Fog at Canberra International Airport”, Weather and Forecasting 2007, Vol. 22, No. 2, pp.372381. G. T Weymouth, T. Boneh, P. Newham, J. Bally, R. Potts, K. Korb, “Dealing with uncertainty in fog forecasting for major airports in Australia”, 4thConference on Fog, Fog Collection and Dew 2007, La Serena, Chile, pp. 73-76. Y. Radhika, M. Shashi, “Atmospheric Temperature Prediction Using SVM”, International Journal of Computer Theory and Engineering 2009, Vol. 1, No. 1, pp. 1793-8201. I. Gultepe, M.D. Müller, Z. Boybeyi, “A new visibility parameterization for warm fog applications in numerical weather prediction models”, In J. Appl. Meteor 2006, Vol. 45, pp. 14691480. A. Bott, T. Trautmann, “PAFOG - a new efficient forecast model of radiation fog and low-level stratiform clouds”, Atmos. Research 2002, Vol. 64, pp. 191-203.. COST 722 - Short range forecasting methods of fog, visibility and low clouds. Final Report, COST Office, Brussels, Belgium, 2007. CRISP 1.0 Process and User Guide, available on http://www.crispdm.org/CRISPWP-0800.pdf F. Rehm, “Prediction of Aircraft Delay at Frankfurt Airport as a Function of Weather”, Presentation from German Aerospace Center, Germany, 2004. S. Viademonte, F. Burstein, R. Dahni, S. Willians, “Discovering Knowledge from Meteorological Databases: A Meteorological Aviation Forecast Study”, Third International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2001). Conference proceedings, LNCS 2114, pp. 61-70. Munich, Germany. Climatological Database System, available on: http://www.microstepmis.com/index.php?lang=en&site=src/products/meteorology/cldb