developing a hybrid model for disaster prediction

DEVELOPING A HYBRID MODEL FOR DISASTER PREDICTION USING MACHINE LEARNING WITH ARTIFICIAL NEURAL NETWORKS AND DATA MINING APPROACH

By P.M.H. THILAKARATHNE ICT/10/11/031 2674

Report submitted in partial fulfillment of the requirements for the

B. SC. FOUR YEAR DEGREE IN INFORMATION & COMMUNICATION TECHNOLOGY

FACULTY OF APPLIED SCIENCECS RAJARATA UNIVERSITY OF SRI LANKA MIHINTALE SRI LANKA 2016

ABSTRACT Forecasting natural disasters is a very powerful tool since it reinforces the mitigation of the loss of damage to mankind and nature. A practical aspect of using Machine Learning methodologies and data mining techniques in disaster forecasting has been discussed here in detail. This study involves developing a hybrid model for predicting disasters using Machine Learning techniques with Artificial Neural Networks (ANNs), and Data Mining Approach. Weather data attributes from January 1976 to December 2012 including monthly average rainfall data and monthly average minimum & maximum temperature data of the North central province of Sri Lanka have been used as the first set of input data. Autoregressive Integrated Moving Average statistical model has been used for time series forecasting of the said weather attributes. Flood type disaster related data for the same time period of time are used to train the machine learning model built with a two-class artificial neural network classifier that predicts the probability of occurring floods in a particular month by using the weather attribute values (of the same month) as its input parameters. The future values of the weather attributes; monthly average rainfall and monthly average minimum & maximum temperature, have been able to predict with error rates 181.62, 0.29 and 0.88 respectively. In addition, machine learning model developed with a two-class artificial neural network classifier has been able to predict the probability of the occurrence of floods in a particular month with a 0.919 accuracy in forecasting. Increasing the number of weather attributes used as input values of the second predictive model has increased the accuracy of the prediction. The approach presented here uses Microsoft Azure cloud based machine learning platform for analyzing data and building the predictive model. The ability to use the processing power and scalability of public cloud have been demonstrated in this study. The predictive model developed has been published as an Application Programmable Interface (API) on azure cloud, illustrating the practical usage and feasibility of machine learning techniques in developing modern intelligent applications.

Keywords: Machine Learning, Predictive Analytics, Microsoft Azure, Cloud Computing, ARIMA model

ii

DECLARATION I, P.M.H. Thilakarathne (ICT/10/11/031) do hereby declare that this report was compiled by me based on the research project “Developing a Hybrid Model for Disaster Prediction using Machine Learning with Artificial Neural Networks and Data Mining Approach” for partial fulfillment of the requirement for the completion of the B.Sc. (4 year) degree in Information & Communication Technology as stipulated in the syllabus approved by the Senate of Rajarata University of Sri Lanka.

Declarer Name

:

Signature

:

Date

:

Supervisor Name

:

Signature

:

Date

:

iii

ACKNOWLEDGEMENTS The success and final outcome of this research project required countless guidance and assistance from many parties. I am highly indebted to my research advisor, Dr. Kaushalya Premachandra for her guidance and constant supervision as well as the encouragement provided for completing the project. Without her guidance and support, my project would not have been a success. I would also like to express my special gratitude and thanks to Mr. Wellington Perera, Developer Experience director - SEA new markets at Microsoft and Mr. Indika Dalugama, Data platform solution architect at Microsoft Sri Lanka for providing me the technical guidance and experimental resources and facilitating me with lab facilities for the completion of the research project. Moreover, I would like to thank the Department of Meteorology Sri Lanka for providing me with historical data records for research purpose. I would like to use this opportunity to express my sincere thanks to Dr. Shantha Fernando for his valuable comments and feedback, and all the academic and nonacademic staff of faculty of applied sciences for the support provided. My thanks and appreciations should also go to my colleagues who helped me with developmental efforts of the project and people who have willingly helped me out with their abilities and expertise. This research project would not have been possible without the support from my family. My dearest parents and sister have always been by my side supporting and encouraging in this research effort. Thank you very much for that tremendous support. Last, but not the least I would like to thank everyone who helped me to make this research project a success by all means. May science make a better future!

iv

TABLE OF CONTENTS 1.

Introduction ............................................................................................................ 1 1.1.

Background & Motivation .............................................................................. 1

1.2.

Research Problem ............................................................................................ 2

1.3.

Research Objective .......................................................................................... 2

1.3.1.

Scientific Objectives ................................................................................ 2

1.3.2.

Social Objectives ..................................................................................... 2

1.4.

Research Questions ......................................................................................... 3

1.5.

Research Methodology.................................................................................... 3

2.

Literature Review ................................................................................................... 5 2.1.

Machine Learning Models for weather and disaster prediction – ................... 5

2.1.1. Machine Learning Modeling for predicting soil liquefaction susceptibility– ......................................................................................................... 6 2.1.2. 2.2.

Weather forecasting model using Artificial Neural Networks – ............. 6

Data Mining Aspects Related to Time Series Predictions. ............................. 8

2.2.1. Prediction of rainfall using autoregressive integrated moving average model: Case of Kinshasa city (Democratic Republic of the Congo), from the period of 1970 to 2009 - ......................................................................................... 8 2.3. 3.

Summary ......................................................................................................... 9

Methodology......................................................................................................... 10 3.1.

Business Understanding ................................................................................ 11

3.2.

Data Understanding ....................................................................................... 11

3.3.

Data Preparation – ......................................................................................... 14

3.4.

Predictive Modeling – ................................................................................... 14

4.

Results .................................................................................................................. 26

5.

Discussion ............................................................................................................. 33 5.1.

Usage of cloud technologies for building the predictive experiments – ....... 34

5.2.

Application Programmable Interface - .......................................................... 34

References .................................................................................................................... 35 6.

AppendiceS ........................................................................................................... 37

A.

Azure Machine Learning Studio ....................................................................... 38

B.

R Code for checking the seasonality of a time series ....................................... 39

C. Selecting a two-class classification algorithm for developing the flood type disaster prediction model – .......................................................................................... 41

v

LIST OF FIGURES Figure 1 - Backpropagation Neural Network................................................................. 7 Figure 2 - Process diagram showing the relationship between the different phases of CRISP-DM (Source - [19]) .......................................................................................... 10 Figure 3: Monthly Average Rainfall data of Anuradhapura district from January 1976 to December 2011 ........................................................................................................ 12 Figure 4: Average Minimum & Maximum temperature values of Anuradhapura district from January 1976 to December 2011 ............................................................. 12 Figure 5: Frequency of the Flood Type Disasters in North Central Province from January 1976 to December 2011.................................................................................. 13 Figure 6: data flow diagram of the complete predictive model ................................... 15 Figure 7: Steps of building the weather prediction experiment ................................... 17 Figure 8: Azure ML experiment of predicting weather attributes ............................... 18 Figure 9: R scripts used in the ‘Create R Model’ ........................................................ 19 Figure 10: Input and output parameters of weather prediction API ............................ 20 Figure 11: Two-class classification experiment .......................................................... 22 Figure 12: Input and output parameters of disaster occurrence classification API ..... 23 Figure 13: Flood type disaster occurrence predictive experiment ............................... 24 Figure 14: Input and output parameters of disaster occurrence API ........................... 25 Figure 15: Deviation of actual rainfall data and predicted rainfall values for the year 2012.............................................................................................................................. 27 Figure 16: Deviation of actual minimum temperature data and predicted minimum temperature values for the year 2012 ........................................................................... 27 Figure 17 – Deviation of actual maximum temperature data and predicted maximum temperature values for the Year 2012 .......................................................................... 28 Figure 18 - ROC curve of Flood type disaster prediction model ................................. 29 Figure 19 - Accuracy metrics of the flood type disaster prediction model .................. 30 Figure 20 - Azure Web App of disaster predictor........................................................ 32 Figure 21 - Web application output of disaster predictor ............................................ 32 Figure 22 - Microsoft Cortana Intelligence Suite ........................................................ 38 Figure 23: ML experiment for selecting a two-class classification algorithm............. 41

vi

LIST OF TABLES Table 1: Statistical Summary of the weather data obtained from the Department of Meteorology, Sri Lanka. .............................................................................................. 13 Table 2: The actual values of the weather attributes in 2012 Vs the predicted output of the first predictive model ............................................................................................. 26 Table 3- Flood type disaster occurrence forecasting for the year 2012 ....................... 31 Table 4 : Accuracy of four 2-class classification algorithms ....................................... 41

vii

CHAPTER 01 1. INTRODUCTION An accident or a sudden natural catastrophe that causes great damage or loss of life can be defined as a disaster. Natural disaster that is a major environmental issue creating economic and ecological damage has become a great danger to human lives. Since preventing a natural disaster is not a practical task predicting them has become extremely helpful for the policy makers and authorities to alert the public and take the necessary precautions to avoid them and hence reduce the loss of life and the damages caused [1]. The underlying mechanism of natural disasters is predicted by developing computational models and hence the historical data on natural disasters are analyzed to determine and recognize the patterns in the natural phenomena. 1.1. Background & Motivation Data science; using automated methods to analyze massive amounts of data and to extract knowledge from them, is becoming an emerging filed in various domains [2]. Data science is widely used in the field of weather prediction. Numerical Weather Prediction (NWP) uses computational models to forecast weather and disaster incidents [3]. Although current weather observations serve as an input parameter to computational models it is not efficient to forecast long-term weather and disaster patterns [4]. Hence, unlike predicting the weather, it is essential to analyze extended periods of historical data related to meteorology and disasters to predict natural disaster occurrences. Data Mining (DM), also known as Knowledge Discovery in Databases (KDD), emerged thanks to the advances of Information Technology has revolutionize the business and scientific databases. The data of these databases hold valuable information, such as trends and patterns, which can be used to improve decisionmaking and forecasting. In addition to Data Mining techniques, Machine Learning (ML) techniques are also emerging in the field of forecasting. Machine learning is the domain of computational intelligence which is concerned with the question of how to construct computer programs that automatically improve with experience [5]. In the domain of predicting and forecasting, machine learning is a computational methodology that has proven its accuracy. Records of historical data on disasters and meteorology of a certain geographical area for an extended period of time can be combined and analyzed to identify patterns, trends and relationships between them. This process of data analysis can be used to predict the future natural disaster occurrences [6]. For example, researchers from the University of Texas, USA have built a prototype for national flood data-modelling and mapping system with the potential of providing flood predictions [7]. Human expertise is limited to identify patterns from large data sets. Moreover, classical statistical methods such as standard regression are linear, parametric in nature and assume knowledge about the unknown dependency. They become 1

impractical when dealing with large, high-dimensional and complex data [5]. Hence, automated DM tools are used as an alternative method to analyze the raw data and extract high-level information for the use of the decision-maker [8]. Several DM techniques such as classification, regression and anomaly detection can be used in analyzing historical data on natural disasters and meteorology. [9]. Recently nonlinear ML techniques such as Artificial Neural Networks (ANNs) have been given much priority to more accurately forecast natural incidents since the discovery of nonlinearity in weather data [10]. For example, some weather forecasting models that use ANNs for forecasting purposes use yearly temperature data of a particular geographical location as the training data set of the ANNs [11]. The set of data which trains the ANN is called the "training data set." During the training stage the same set of data is processed many times as the connection weights are refined. 1.2. Research Problem Natural disaster forecasting is a very powerful tool since it helps to minimize loss of lives and the loss of damages to mankind and the nature. So far there is no efficient model that uses ANNs as the machine learning methodology with an improved accuracy with the support of data mining approach to predict natural disasters. 1.3. Research Objective The project goal is to develop a hybrid natural disaster-predicting model using ANNs to predict the upcoming disaster scenarios. I use historical disaster and meteorological data as input variables (training data set) in the prediction model. 1.3.1. Scientific Objectives Defining an accurate hybrid disaster predicting model using ANNs is the initially expected scientific objective. To build the model, appropriate data mining and machine learning techniques are tested to enhance the accuracy and the reliability of the predictions. With the research, the importance of modern computing paradigms such as cloud computing in machine learning domain is investigated. 1.3.2. Social Objectives This prediction model is beneficial for the policy makers and disaster management authorities to understand the mechanisms underlying the occurrence of a disaster and hence take appropriate measures to mitigate them and thereby minimize the loss of damage caused to human life.

2

1.4. Research Questions          

What are the methods of predicting natural disasters? What are the appropriate disaster related data that can be used to build the model? What are the required meteorological data that can be used as the input data for the model? What patterns best describe the data and their relationships? What are the methods that can be used to identify data patterns and dependencies? How and where the data can be processed? How to adopt the data into a machine-learning model? What is the best machine learning approach that can be used for natural disaster forecasting? How to implement the machine-learning model? How to improve the efficiency and the reliability of the model?

1.5. Research Methodology Identify the data sources using historical data A study related to previous research work on disaster models is conducted to choose the accurate and relevant data sources that best describe this computational model. The historical disaster data are obtained from the Disaster Management Center of Sri Lanka (http://www.dmc.gov.lk) and historical meteorological data are obtained from the Department of meteorology of Sri Lanka (http://www.meteo.gov.lk). Extracting relevant meteorological data and disaster records from data sources Since the raw data gathered from the authorities have numerous data fields, the relevant data fields that need to be used in the computational model are identified and pre-processed before identifying their patterns and relations. Computer based data mining software tool such as Weka (https://weka.wikispaces.com) is used to find the patterns and relationships of data that are needed to develop the forecasting model. R programming language (https://www.r-project.org/) is used for time series analysis, graphical data representation and related statistical analysis. Developing the computational machine learning forecasting model A feed forward artificial neural network is developed to increase the accuracy of the prediction of the model by further clarifying the results using machine-learning approaches. A large amount of input data are used for this purpose while computational power of cloud computing is used to implement the model. The nonlinear machine learning model for prediction purposes using ANNs is developed in Azure Machine Learning Studio [12]. Historical natural disaster data and historical meteorological data are used as training data sets of the ANNs.

3

A mechanism to combine the outputs of the data mining approach and the outputs of the machine-learning model is obtained to provide accurate predictions. Finally, the reliability of the model using historical data sets is tested and the accuracy is measured by testing the model using the available data.

4

CHAPTER 02 2. LITERATURE REVIEW Research efforts of predicting phenomena such as natural disasters incorporate a diversity of computational approaches such as statistical and machine learning approaches. The research project, “Developing a Hybrid Model for Disaster Prediction using Machine Learning with Artificial Neural Networks (ANNs) and Data Mining Approach” includes a model that predicts the natural disasters by identifying the patterns and relations between the historical natural disaster records and historical meteorological data. The model uses ANNs as a machine learning technique for prediction purposes. The training data set of ANNs includes historical natural disaster records and historical meteorological data. This chapter focuses on research contributions in relation to predicting chronological weather related attributes in the sphere of weather forecasting and disaster forecasting which are carried out with the use of artificial neural networks and data mining approaches. The first part of this review provides descriptions and analysis of research efforts carried out to predict weather and disaster incidents using machine-learning methodologies. The literature referred focuses mostly on the prediction models worked out using ANNs because the nonlinear aspect of ANNs support the natural phenomena [13]. Weather observations such as monthly rainfall data, act as the time series dataset since the collection of observations are made chronologically. The second part of the review involves the analysis of data using data mining aspects related to time series predictions. Autoregressive Integrated Moving Average Model (ARIMA) is one of the prominent models used for the time series prediction. The advantages and disadvantages of this approach are highlighted in the review. 2.1. Machine Learning Models for weather and disaster prediction – Meteorological forecasting by means of numerical models dates back to the early 19th century when a mathematical approach for forecasting is proposed by Abbe in his paper, “The physical basis on long range weather forecasting” [14]. However, the numerical forecasting in the 19th century is not so accurate since the scientists lack the knowledge in simplifying complex atmospheric dynamics (occur due to variations of weather) into simple mathematical equations [15]. Later, with the evolution of computers, developing computational models for the purpose of weather forecasting became more accurate. Through the paper on the origins of computer weather prediction and climate modeling, Lynch was able to show this by describing the evaluation of computer weather prediction and the methodologies used only in a particular domain [3]. The paper further describes the problems that is prevailing in the numerical weather prediction such as the limited time duration for which the predictions can be made which requires linear simplified mathematical equations for weather and climate related parameters. 5

Since the weather data in this paper have nonlinear characteristics, the researchers have been interested in using nonlinear prediction mechanisms for both weather and disaster forecasting. Hence, nonlinear modeling capability of Artificial Neural networks (ANNs) has been used in developing nonlinear predictive models for weather analysis with the ANN approach [1, 4, 5, and 6]. 2.1.1. Machine Learning Modeling for predicting soil liquefaction susceptibility– Samui and Sitharam [6] applied two machine-learning techniques to predict soil liquefaction susceptibility of soil based on the standard penetration test (SPT) data from the Chi-Chi earthquake occured in Taiwan. They used ANNs based on Multilayer Perceptions (MLP) that trained on Levenberg-Marquardt backpropagation algorithm as the first machine-learning algorithm. Then, they adopted a second algorithm called Support Vector Machine (SVM). Next, the predictions made from the developed models were compared to analyze their efficiency and accuracy [16]. Here, the prediction model is implemented as a classification model. In the first phase of research, MODEL I & MODEL II use ANNs with multilayer perceptions trained with Levenberg-Marquardt backpropagation algorithm where 70% of the whole dataset has chosen as the training dataset while the remaining dataset acts as the testing dataset. In MODEL I, SPT value and the cyclic shear stress ratio are the input variables. The MODEL II was developed using the same dataset and related aspects with different input parameters. SPT value and peek ground acceleration was chosen as the input variables. SVM models were developed using the same training/testing datasets and input variables. ANN models were built using neural network toolbox of MATLAB software. Using the trial and error evaluation process, it is proven that the number of the hidden neurons and the number of epochs affect the accuracy and the performance of the model. According to a comparative study [16], it is argued that the use of SVM model is more convenient than ANNs with respect to the risk minimization principle since it only deals with two variable parameters. Since the ANN model has a larger number of controlling parameters, obtaining an optimal combination of number of hidden neurons, epochs, transfer functions etc. is more complicated in ANN models than SVM models. 2.1.2. Weather forecasting model using Artificial Neural Networks – This study is carried out to determine the applicability of ANN approach for developing nonlinear predictive models in weather forecasting [13]. The advantages of having ANNs for weather forecasting over other forecasting methods are emphasized in this study. Since ANNs minimize the errors using various algorithms, the performance is improved in contrast to other models in the weather prediction domain. The tool used in this study to carry out the analysis is Neural Network Fitting Tool, nntool available in MATLAB software. Artificial Feed-Forward Neural Network with back-propagation principles is selected as the training element. Moreover, this study has attempted to investigate the impact of number of neurons and number of hidden layers for the performance of the neural network. They have selected maximum 6

temperature data from the station in Toronto, Canada for the period 1999-2009 (10 years) as input data. While 60% of the data has been used to train the data, 20% has been used to validate the data and the other 20% to test the data. This study has also shown that, increasing the number of neurons increases the performance of the predictive model while decreasing the Mean Squared Error (MSE). Moreover, increasing the number of samples has increased the performance of the predictive model while decreasing the MSE. While increasing the number of hidden layers, the MSE of the model has decreased. This decrease in MSE can be caused by ‘overfitting’. This study has also proven that ANN models can be used for predicting weather factors like humidity and rainfall.

Neural Network Model Artificial neural networks (ANNs) used in machine-learning is a family of models inspired by biological neural networks. They are used to estimate or approximate functions that can depend on a large number of inputs. ANNs are presented as systems of consistent "neurons" which exchange messages between each other. The connections have numeric weights that can be tuned based on experience, making neural nets adaptive to inputs and capable of learning. Most studies using ANNs for weather and disaster prediction use multilayer perceptions (MLPs) with LevenbergMarquardt backpropagation algorithm [1,6]. Feed forward with back propagation neural networks should contain at least three layers (Figure 1). 1. Input Layer 2. Hidden Layer 3. Output Layer

Figure 1 - Backpropagation Neural Network The optimal number of hidden layers and the number of neurons are chosen so that they minimize the MSE of the output prediction.

7

Advantages   

ANNs are easier to deploy rather than other machine learning models because they need less statistical training to develop. Neural Networks are able to detect complex non-linear relationships among the input data. Neural Networks have the ability to identify all possible interactions between predictor variables.

Disadvantages   

ANN models act as “black box” predictions. It is hard to explain the possible causal relationships between the prediction and input variables. ANN models require large computational power to execute. Neural Network models have the ‘overfitting’ error.

2.2.Data Mining Aspects Related to Time Series Predictions. A time series represents a collection of values obtained from sequential measurements over time. Time-series data mining stems from the want to reify the ability to visualize the shape of data. Many studies have been carried out to investigate the applicability of time series predictions in the weather forecasting domain. 2.2.1. Prediction of rainfall using autoregressive integrated moving average model: Case of Kinshasa city (Democratic Republic of the Congo), from the period of 1970 to 2009 This study is conducted to develop a model that predicts the behavioral pattern of rainfall using ARIMA technique. ARIMA is a statistical technique for modeling time series data [17]. Monthly precipitation data from the year 1970 to 2009 is used for the study. Future rainfall predictions are determined with the model built using ARIMA model. Advantages of ARIMA model ARIMA model is easily interpreted in contemplative studies. The relationship between the independent variables and dependent variables are easily understood based on the assumptions of the model. To maximize the prediction accuracy of ARIMA models, model selection is performed over a time series in an automated fashion [18]. Disadvantages of ARIMA model Since ARIMA model adopts linear relationships between independent and dependent variables, real-world non-linear complex relationships are often not mapped onto the model well. As a result, the ARIMA model often does not perform well where data has a complex structure. In ARIMA models, identification and estimation is critically biased by the effect of outliers [18]. 8

2.3.Summary The numerous weather and disaster prediction studies provide an extensive base of information and approaches to the problem. These studies cover a wide spectrum of computational efforts from traditional statistical approaches such as the numerical weather prediction modeling, to machine learning and data mining methodologies. These approaches achieve various levels of success in terms of addressing the disaster prediction domain. An analysis of these efforts provides several key points that must be considered within future disaster prediction models. Statistical models do not give high accuracy in predicting disasters and hence it is complicated for mapping the non-linearity of natural occurrences to simplified linear mathematical models. When training and validating the developed model, it is important to consider data preprocessing and removing outliers in order to increase the accuracy of the model. Therefore, artificial neural networks are used to adopt the non-linearity in predictive models. ARIMA models are adopted to perform the time series predictions within our model. To increase the accuracy of model execution and performance it is important to implement new computational technologies rather than using customary computational approaches.

9

CHAPTER 03 3. METHODOLOGY This project includes a model that predicts natural disasters by identifying the patterns and relationships between the historical natural disaster records and historical meteorological data such as rainfall data and average temperature data. As the study aligns with a data mining approach, the method of Cross Industry Standard Process for Data Mining (CRISP-DM) has been followed.

Cross Industry Standard Process for Data Mining (CRISP-DM) –

Figure 2 - Process diagram showing the relationship between the different phases of CRISP-DM (Source - [19])

CRISP-DM proposes a six-phase methodology to solve a problem in data mining [20]. This process is generally used as a tool, application or industry neutral framework in most of the data mining tasks. The six phases used to solve the data mining task in this project are as follows 1. 2. 3. 4. 5. 6.

Business understanding Data understanding Data preparation Modeling Evaluation Deployment

10

3.1.Business Understanding This project contains two steps. First step involves building and developing a hybrid model for flood disaster prediction in the North Central province in Sri Lanka using the data mining approach. This is carried out using weather attributes as input data to obtain the probability of flood occurrence in the North Central province in a specific month of a year. The second step involves automating the mechanism to predict the upcoming disasters based on historical weather related data for supporting the disaster precaution decision-making process of relevant authorities. 3.2.Data Understanding Since this study is based on historical weather and disaster related data, following datasets are collected.    

Monthly average rainfall data of Anuradhapura Monthly average minimum temperature data of Anuradhapura Monthly average maximum temperature data of Anuradhapura Past disaster related data in the North Central Province, Sri Lanka

Monthly data records of the above weather attributes from January 1976 to December 2012 (36 years) are purchased from the Department of Meteorology, Sri Lanka [21]. Department of Meteorology is selected as the data source because it is the official body of recording and analyzing the weather-related data in Sri Lanka. The monthly average rainfall data is measured in millimeters (mm) (Figure 3) and the average temperature data are measured in degrees of Celsius (ºC) (Figure 4). Each dataset consists of 432 tuples (year, month and value). Table 1 shows the statistical summary of the weather data obtained from the department of Meteorology, Sri Lanka. Disaster related data of North Central Province is extracted from DesInventar – Disaster Information Management System of Sri Lanka. DesInventar is a system designed for acquisition, collection, retrieval, query and analysis of information about small, medium and greater impact disasters, based on pre-existing official data, academic records, newspaper sources and institutional reports of a particular country [22]. Sri Lankan DesInventar system is managed and updated by Disaster Management Centre (DMC), Ministry of Disaster Management – Sri Lanka. DesInventar system with date, province, district, division and event (disaster type) contains 1,048,575 disaster records of Sri Lanka from January 1976 to December 2012. Out of 1,048,575 disaster records 296 flood type disaster records of North Central Province are extracted for the purpose of this project. Both meteorological (Figures 3, 4 and Table 1) and disaster (Figure 5) datasets extracted are visualized using Microsoft Excel 2016 in order to obtain a basic idea of data distribution and perform simple statistical analysis.

11

1/1/1976 3/1/1977 5/1/1978 7/1/1979 9/1/1980 11/1/1981 1/1/1983 3/1/1984 5/1/1985 7/1/1986 9/1/1987 11/1/1988 1/1/1990 3/1/1991 5/1/1992 7/1/1993 9/1/1994 11/1/1995 1/1/1997 3/1/1998 5/1/1999 7/1/2000 9/1/2001 11/1/2002 1/1/2004 3/1/2005 5/1/2006 7/1/2007 9/1/2008 11/1/2009 1/1/2011

Temperature (ºC) 1/1/1976 3/1/1977 5/1/1978 7/1/1979 9/1/1980 11/1/1981 1/1/1983 3/1/1984 5/1/1985 7/1/1986 9/1/1987 11/1/1988 1/1/1990 3/1/1991 5/1/1992 7/1/1993 9/1/1994 11/1/1995 1/1/1997 3/1/1998 5/1/1999 7/1/2000 9/1/2001 11/1/2002 1/1/2004 3/1/2005 5/1/2006 7/1/2007 9/1/2008 11/1/2009 1/1/2011

Rainfall (mm)

Monthly Average Rainfall data of Anuradhapura district from January 1976 to December 2011

600

500

400

300

200

100

0

Time

Figure 3: Monthly Average Rainfall data of Anuradhapura district from January 1976 to December 2011

Average Minimum & Maximum temperature values of Anuradhapura district from January 1976 to December 2011

40 30 20 10 0

Time

Minimum Temperature

12 Maximum Temperature

Figure 4: Average Minimum & Maximum temperature values of Anuradhapura district from January 1976 to December 2011

Table 1: Statistical Summary of the weather data obtained from the Department of Meteorology, Sri Lanka. Field Number of Tuples Mean Median Minimum Value Maximum Value Standard Deviation Number of Unique Values

Rainfall (mm) 432 106.118 69.75 0 527.5 111.5492 355

Minimum Temperature (ºC) 432 23.7603 24.1 18.8 26.1 1.3839 72

Maximum Temperature (ºC) 432 32.6997 33 28.3 37.5 1.7959 86

12 10 8 6 4 2 0

1/1/1976 2/1/1977 3/1/1978 4/1/1979 5/1/1980 6/1/1981 7/1/1982 8/1/1983 9/1/1984 10/1/1985 11/1/1986 12/1/1987 1/1/1989 2/1/1990 3/1/1991 4/1/1992 5/1/1993 6/1/1994 7/1/1995 8/1/1996 9/1/1997 10/1/1998 11/1/1999 12/1/2000 1/1/2002 2/1/2003 3/1/2004 4/1/2005 5/1/2006 6/1/2007 7/1/2008 8/1/2009 9/1/2010 10/1/2011

Frequency

Frequency of the Flood Type Disasters in North Central Province from January 1976 to December 2011

Time

Figure 5: Frequency of the Flood Type Disasters in North Central Province from January 1976 to December 2011

Statistical Summary of the Flood type disaster data –    

Number of tuples Number of frequency ‘0’ tuples Maximum Frequency Minimum Frequency

- 432 - 388 (90% from the dataset) - 11 -0

13

3.3.Data Preparation – The datasets contain missing values that are replaced by the attribute’s mean (average) of the specific field. It should be noted that the mean value is used here to replace the missing values since the ratio of the missing values and number of records is low and the average value of the attributes do not change with the operation. In addition, no outliers of the data are identified. A new filed named ‘Frequency’ is introduced for the purpose of converting the data into a monthly attribute. It denotes the number of flood type disasters occurred in the particular month. Field ‘Id” is introduced to denote the occurrence of flood in a particular month. The preprocessed datasets are stored in relational database tables of a SQL Server database in order to combine weather data together with flood type disaster data. TSQL join queries were used for the task. Microsoft SQL Server 2014 is used as the database server. Final dataset used for the development of predictive models is created with 432 tuples of data from January 1976 to December 2012. The data fields are as below:           

Date – The 1st date of each month is used as a continuous time series factor. Date is in the MM/DD/YYYY format. Series – A number series runs from 1-432 Notation – Specific ID of the disaster given by the DesInventar information system Type – The disaster type (Flood, flash flood etc.) Province – The province which the disaster took place (All the data used in the study are from North Central Province) District – The district which the disaster took place Rainfall – The average rainfall of the particular month in millimeters T_min – The average minimum temperature of the particular month in degrees of Celsius T_max – The average maximum temperature of the particular month in degrees of Celsius Frequency – The number of flood type disasters happened in a particular month Id – Denotes whether a flood happened or not in the particular month. (1 denotes flood positive. 0 denoted flood negative)

3.4.Predictive Modeling – The proposed disaster predicting model is composed of two main components. 01. Forecasting the future values of the weather attributes (average rainfall, minimum & maximum temperatures) 02. Forecasting the probability of flood type disaster occurrence based on the values of weather parameters.

14

The output of the first predictive component is used as the input for the second predictive component. The data flow of the complete predictive model is illustrated in the following diagram (Figure 6).

Figure 6: data flow diagram of the complete predictive model

Predicting the future values of the weather attributes – As the output of the first component of the predictive model, future values of three weather attributes (average monthly rainfall, average monthly maximum temperature, and average monthly minimum temperature) are predicted. Weather attributes of monthly weather data over a span of 36 years (historical data) are used as the input. This data obtained as a series of observations continuously in time makes it a time series [23]. Hence, the historical data can be defined as a continuous time series. Here, autoregressive integrated moving average model (ARIMA model) is used to forecast the future values based on historical values of the time series.

ARIMA model ARIMA is a statistical model that can be applied for the analysis of time series data. The model is based on three parts. 01. Autoregressive (AR) Processes 02. Moving Average (MA) Processes 03. ARIMA Process

15

The autoregressive part (AR) of the model has its derivation in theory where individual values of time series data are described by linear models based on preceding observations. The general formula describing autoregressive models is: 𝑝

𝑥𝑡 = ∑ 𝛼𝑖 𝑥(𝑡−𝑖) 𝑖=1

Where, 𝑥𝑡

- Time series under investigation

𝑝

- Order of the autoregressive model (number of time lags)

𝛼𝑖

- Autoregressive parameter of order 𝑖

𝑥(𝑡−𝑖) - Time series lagged 𝑖 period Since the time series values could depend on the preceding estimation errors, moving average models (MA) are used. In this project, the previous errors of estimation or forecasting are taken into account when estimating the next time series value. The difference between the estimation 𝑥 (𝑡) and the actually observed value 𝑥(𝑡) is denoted as ε(t). The general formula describing the moving-average models is: 𝑞

𝑥𝑡 = − ∑ 𝛽𝑖 ε(𝑡−𝑖) 𝑖=1

𝑥𝑡

- Time series under investigation

𝑞

- Order of the moving average model

𝛽𝑖

- Moving average parameter of order 𝑖

ε(𝑡−𝑖) - Error term

ARIMA model is generated by combining AR and MA models. 𝑝

𝑞

𝑥𝑡 = ∑ 𝛼𝑖 𝑥(𝑡−𝑖) − ∑ 𝛽𝑖 ε(𝑡−𝑖) 𝑖=1

𝑖=1

ARIMA models are used after additionally differentiate the time series data, and integrate it after the application of the model. ARIMA models are specifically used when trend filtering is required. The parameter 𝑑 of the ARIMA [𝑝, 𝑑, 𝑞]-model determines the number of differentiation steps.

16

R language (https://cran.r-project.org) is used to build the ARIMA predictive models for each weather attribute. Auto ARIMA modeling function in ‘forecast’ package of R language is used for prediction purposes. Analyzing large datasets need high computation power. Microsoft Azure Machine Learning Studio (Azure ML Studio) (https://studio.azureml.net) is used to build, test and deploy the predictive models. See Appendix A for a descriptive discussion on Azure ML Studio. The time series datasets containing the monthly average rainfall, minimum temperature and maximum temperature are analyzed using R language to observe the seasonality of each time series. The code that is used to check the seasonality is attached in Appendix B. The seasonality of each time series is estimated to be 12 by the script. The predictive model experiment for forecasting the future values of the weather attributes is developed using Azure ML Studio with the steps shown in the figure 7.

Figure 7: Steps of building the weather prediction experiment Model building steps Training dataset

- For the prediction of each weather attribute, historical data of 36 years are used as the training dataset (432 tuples each)

Algorithm

- ARIMA model is used as the training algorithm for time series prediction. auto.arima() function implemented in forecast package in R language is used for forecasting purposes.

Train model

- The model is trained with the training dataset and the ARIMA algorithm

Score model

- This module is used to generate predictions using the trained model.

Evaluate model

- A set of metrics are generated indicating the accuracy of the model. 17

Test dataset

- A separate dataset with the weather attribute values in the year 2012 is used in the initial phase to evaluate the accuracy of the predictive model.

Most of the modules used for developing the predictive model are pre-built in Azure ML Studio. The training algorithm and the score element is scripted in R. The predictive experiment developed is shown in figure 8 in Azure ML Studio.

Figure 8: Azure ML experiment of predicting weather attributes

In the experiment, three weather attributes are predicted using auto ARIMA function. From the initial dataset named ‘nc_flood_series_frequence.csv’, particular weather field (Rainfall, T_min, T_max) is chosen one at a time as input training dataset for the ‘train model’. ‘Create R Model’ is used to build the prediction model using R language for predicting the weather attributes. The code snippets used in the ‘Create R model’ is shown in figure 9.

18

Figure 9: R scripts used in the ‘Create R Model’

The evaluated experiment is deployed as a web service. The web service is published as a request response Application Program Interface (API). (https://asiasoutheast.studio.azureml.net/apihelp/workspaces/5cfa71bcddf14589a7693 b8edf8b1194/webservices/42f108ec763f4af0b47ad2993fa10971/endpoints/527d64bd 6e604541b4fe65682cf64653/score )

19

The input and output parameters of the API are shown in the figure 10. The web service is published on Microsoft Azure cloud to increase the reliability and ability to handle large amount of API requests.

Figure 10: Input and output parameters of weather prediction API

20

Forecasting the probability of flood type disaster occurrence based on the values of weather parameters – The second component of the disaster prediction model is building a predictive model to predict the probability of flood type disaster occurrence. The model should have the ability to get three weather attributes (Rainfall, T_min, T_max) as its input data and give out an output of 0 or 1. 1 denotes a positive flood type disaster occurrence while 0 denotes no flood type disaster in the specific month. This experiment can be defined as a two-class classification machine learning problem since the members of the population have to be separated into two different sets or classes according to the values of the input parameters [1]. The same steps followed in building the weather prediction model (referred in figure 7) are followed for the classification module too.

Model building steps: Training dataset

- For the prediction of flood type disaster occurrence, historical data over a span of 36 years are used. ‘rainfall, id, T_min, T_max’ fields from the ‘nc_flood_series_frequence.csv’ dataset is chosen as the training dataset.

Algorithm

- Two-class neural network model in Azure ML Studio is selected as the classification algorithm after carrying out a comparison test (Appendix C) using four two-class classification algorithms.

Train model

- The model is trained with the training dataset and the selected algorithm

Score model

- This model is used to generate predictions using the trained model.

Evaluate model

- A set of metrics are generated indicating the accuracy of the model.

Test dataset

- A separate dataset is used in the initial phase to evaluate the accuracy of the predictive model.

The predictive experiment developed in Azure ML studio is shown in figure 11.

21

Figure 11: Two-class classification experiment The evaluated experiment is deployed as a web service. The web service is published as a request response Application Program Interface (API). (https://asiasoutheast.studio.azureml.net/apihelp/workspaces/5cfa71bcddf14589a7693 b8edf8b1194/webservices/6dd2bb589a834f78b117f59ae9a395fe/endpoints/c515a02bf 4e6472dbd00873db64ccbd1/score) The input and output parameters of the API are shown in the figure 12. The web service is published on Microsoft Azure cloud.

22

Figure 12: Input and output parameters of disaster occurrence classification API

In order to build the full predictive model, the output values of the weather attribute predictive model are used as the input data of disaster occurrence classification model. The two experimental models developed are combined together to build the final predictive model where the number of months to the future is used as the input data to predict the probability of a flood type disaster occurrence in the specific month. The experiment built in Azure ML Studio is shown in figure 13. 23

Figure 13: Flood type disaster occurrence predictive experiment The evaluated experiment is deployed as a web service. The web service is published as a request response Application Program Interface (API). (https://asiasoutheast.studio.azureml.net/apihelp/workspaces/5cfa71bcddf14589a7693 b8edf8b1194/webservices/5c8c999ea7b1405cb8530d59109a05f9/endpoints/ab1231cd 57c9466aa51e516605c8f868/score) The input and output parameters of the API are shown in the figure 14. The web service is published on Microsoft Azure cloud.

24

Figure 14: Input and output parameters of disaster occurrence API

The published request/response web service has the ability to be used through any web application as well as through the Azure Machine Learning plugin for Microsoft Excel. The results of the web service are evaluated and the accuracy of the outputs are measured in the results chapter.

25

CHAPTER 04 4. RESULTS In order to build the flood type disaster prediction model, two sub predictive models are developed in the study. While the first model involves predicting future values of the weather attributes, the second model predicts the flood type disaster occurrence based on the values of input weather parameters. The predictive models are built using Azure Machine Learning Studio. ARIMA model (discussed in the methodology section) is used as the algorithm to predict the weather attributes. Average monthly rainfall, average monthly maximum temperature, average monthly minimum temperature of North Central province of Sri Lanka from January 1976 to December 2011 are used as the input weather parameters for the model. The weather attributes predicted using the first predictive model are average monthly rainfall, average monthly maximum temperature and average monthly minimum temperature of North Central province of Sri Lanka (output data). The values are forecasted for the period of one year from January 2012 to December 2012. Once predicted they are compared with the actual weather values from January 2012 to December 2012 to evaluate the accuracy and the reliability of the model developed. A web API is generated using this predictive model in order to obtain the predicted values and is accessed via Microsoft Azure Machine Learning add-in in Excel 2016 (Table 2). The deviation between the predicted values and the actual values of rainfall (Figure 15), minimum temperature (Figure 16) and maximum temperature (Figure 17) are calculated and plotted to show the level of accuracy of the model

Table 2: The actual values of the weather attributes in 2012 Vs the predicted output of the first predictive model Month

1 2 3 4 5 6 7 8 9 10 11 12

Rainfall Actual 28.20 69.30 75.20 145.40 0.20 0.00 23.90 0.00 16.20 683.90 242.10 593.70

Predicted 127.55 71.02 77.27 203.91 65.62 23.67 37.76 63.08 77.01 214.07 254.62 207.82

Minimum Temperature Actual Predicted 21.58 22.10 22.51 22.39 23.60 23.50 24.08 24.56 25.61 25.44 25.48 25.41 25.21 25.12 25.21 24.94 25.20 24.73 23.89 24.11 23.15 23.30 22.98 22.65 26

Maximum Temperature Actual Predicted 30.70 30.12 31.56 32.00 34.55 34.47 33.68 34.28 34.34 33.59 34.49 33.47 34.75 33.50 35.20 33.74 35.43 33.82 33.06 32.63 31.03 30.97 29.16 29.82

Actual Rainfall data Vs. Predicted Rainfall values for the year 2012 800 700

Rainfall (mm)

600 500 400 300 200 100 0 1

2

3

4

5

6

7

8

9

10

11

12

Month Predicted Rainfall

Actual Rainfall

Figure 15: Deviation of actual rainfall data and predicted rainfall values for the year 2012

Actual Minimum Temperature data Vs. Predicted Minimum Temperature data for the year 2012. 26

Temperature (ºC)

25 24 23 22 21 20 19 1

2

3

4

5

6

7

8

9

10

11

Month Predicted Minimum Temperature

Actual Minimum Temperature

Figure 16: Deviation of actual minimum temperature data and predicted minimum temperature values for the year 2012

27

12

Actual Maximum Temperature data Vs. Predicted Maximum Temperature values for the year 2012. 40

Temperature (ºC)

35 30 25 20 15 10 5 0 1

2

3

4

5

6

7

8

9

10

11

Month Predicted Maximum Temperature

Actual Maximum Temperature

Figure 17 – Deviation of actual maximum temperature data and predicted maximum temperature values for the Year 2012

Root Mean Square Error (RMSE) of each predicted attribute are calculated as the forecasted accuracy measures.   

RMSE of rainfall forecast of year 2012 RMSE of minimum temperature forecast of year 2012 RMSE of maximum temperature forecast of year 2012

- 181.62 - 0.29 - 0.88

Second predictive model is built for forecasting the probability of flood type disaster occurrence in North Central province of Sri Lanka. Predicted average rainfall, minimum temperature and maximum temperature values are used as input data in order to get an output probability. The model is built on Azure ML platform. The forecasted accuracy is measured using the ‘evaluate model’ model in the Azure ML Studio. The values for the particular weather attributes from January 1976 to December 2011 has been chosen to train the model. From the dataset, 70% of the records are chosen to train the model while the rest is used for evaluating the model. The Receiver Operating Characteristic (ROC) curve is used to plot the output of classification of the flood type disaster prediction model (figure 18). ROC curve is a graphical representation of different possible cut points of an analytical test, plotting the true positive rate against the false positive rate.

28

12

Figure 18 - ROC curve of Flood type disaster prediction model

The evaluation metrics of the classification model that comes out as an output from Azure ML Studio is shown in figure 19. True negative - The number of times an actual class is negative and it is predicted as negative False negative - The number of times an actual class is positive but it is predicted as negative Accuracy

Accuracy =

Precision

- The proportion of a true prediction to the total number of predictions.

True positive + True negative True positive + True negative + False positive + False negative

- The proportion of positive cases that the model has correctly forecasted

Precision =

Recall

True positive True positive + False positive

- The proportion of actual positive cases that are correctly identified by the model.

29

Recall =

True positive True positive + False negative

F1 Score =

Threshold

2(Precision ∗ Recall) Precision + Recall

- The boundary value of the two classes

The Area Under the Curve (AUC) is a portion of the area under the ROC curve of the unit square. The value of AUC will always be between 0 and 1, where 1 is the best case or everything is predicted correctly.

Figure 19 - Accuracy metrics of the flood type disaster prediction model Finally, the first and the second experimental models developed are combined together to build the final predictive model. The input data of this model are the number of months to the future. The probability of a flood type disaster occurrence in the specific month is forecasted using the above-mentioned input data. This forecasted probability (output) is obtained by the web API created which is accessed using Azure Machine Learning add-in in Microsoft Excel 2016. It is interesting to notice that the probabilities of the flood predicted for the year 2012 using the final predictive model, is 100% accurate (Table 3).

30

Table 3- Flood type disaster occurrence forecasting for the year 2012 Month 1 2 3 4 5 6 7 8 9 10 11 12

Actual Flood Occurrence 0 0 0 0 0 0 0 0 0 0 0 0

Scored Probability 0.285 0.069 0.029 0.031 0.027 0.027 0.027 0.028 0.028 0.051 0.195 0.371

Forecasted Flood Occurrence 0 0 0 0 0 0 0 0 0 0 0 0

Hence, the predictive model developed using Azure Machine Learning Studio with the number of months to the future as the input can be used to predict the probability of a flood type disaster occurrence in a specific month. The model is deployed as a web service that is accessible through a REST API. Request Response API https://asiasoutheast.studio.azureml.net/apihelp/workspaces/5cfa71bcddf14589a7693b 8edf8b1194/webservices/5c8c999ea7b1405cb8530d59109a05f9/endpoints/ab1231cd5 7c9466aa51e516605c8f868/score In order to demonstrate the ability of using the Request Response API for building custom applications, a web application has built using Azure Web Apps. ASP.NET has used as the server side language. Application is hosted on Microsoft Azure cloud. The user has the ability to input the number of months into the future through a slider selector (Figure 20). The predicted output is obtained in a table format as shown in Figure 21. URL - http://disasterpredictor.azurewebsites.net/

31

Figure 20 - Azure Web App of disaster predictor

Figure 21 - Web application output of disaster predictor

32

CHAPTER 05 5. DISCUSSION Natural disasters cause significant damage to the environment while threatening human lives. The ability to predict disasters before the occurrence helps to minimize the damage caused to the mankind. This study proposes a predictive model that forecasts natural disasters using historical weather data as the input parameters of the model. Artificial neural networks and the statistical concepts of time series forecasting are used to implement the model. The predictive experiments are built on Microsoft Azure Machine Learning Studio. Open Source R statistical environment & Microsoft Excel are used for data prepreparation and plotting purposes. In order to build the flood type disaster prediction model, two sub predictive models have been developed in the study. First model predicts the future values of the weather attributes while the second model predicts the flood type disaster occurrence based on forecasted values of input weather parameters. Time series predictions of the weather attributes (average rainfall, average minimum temperature, and average maximum temperature) carried out using the first predictive model, have RMSE values of 181.62, 0.29, and 0.88 respectively. It is noted that the forecasted values of the average temperature attributes are very close to the real values. However, a very small deviation is observed between the forecasted rainfall values and the real values (The predicted outputs are listed in the table 1). These errors might have occurred due to the noise factor of the data such as outliers. The accuracy of the training dataset increases the reliability of the predictive model; i.e. obtaining a better-forecasted output with less RMSE values. Therefore, to achieve a reliable predictive model the input dataset should be free of outliers and missing values. Removing the missing values and outliers in the data pre-processing stage is one way of increasing the accuracy of the predictive model. The reliability of the data sources is another vital factor that affects the accuracy of prediction. For this study, the weather-related datasets are gathered from the Department of Meteorology, Sri Lanka. The errors of the recording devices and errors while data storage also affect the final output of the prediction model. In addition, the disaster data of North Central province of Sri Lanka extracted from the DesInventar database have outliers and error values because it is a manually created database with official data sources plus media reports. The second model that predicts flood type disaster occurrence based on the forecasted values of input weather parameters, uses three weather parameters as input data. Limiting the number of hidden neurons to two neurons has minimized the probability of overfitting. The accuracy of the second predictive model can be increased by using more weather attributes as input data because increasing the number of input parameters for the neural network increases the accuracy and the reliability of the predictions. 33

5.1. Usage of cloud technologies for building the predictive experiments – For the task of building predictive models, Microsoft Azure Machine Learning Studio is used. To process a large amount of data and train a machine-learning model with a heavy dataset, relatively high computation power is needed. In this study, Microsoft Azure, (https://azure.microsoft.com) one of the most popular cloud services available in the market is used to train, test, evaluate and host the experiments. Azure cloud service provides a high computational power as a Platform as a Service (PAAS). The data storage is also hosted on the cloud. These features make the models that are developed more reliable and available. With the use of cloud based Azure Machine Learning Studio, a set of data preprocessing modules and machine learning algorithms are provided to build the predictive experiments enabling to develop an end to end predictive experiment. This experiment does not depend on the programming language of the underlying physical infrastructure. Cloud based machine learning service used in this study makes the process of predictive analytics more easy and accessible. 5.2. Application Programmable Interface The ultimate output of this study is to develop the request response Application Programmable Interface: (https://asiasoutheast.studio.azureml.net/apihelp/workspaces/5cfa71bcddf14589a7693 b8edf8b1194/webservices/5c8c999ea7b1405cb8530d59109a05f9/endpoints/ab1231cd 57c9466aa51e516605c8f868/score ) The API uses “the months into the future” as the web service input and forecast the probability of occurring a flood type disaster on the particular month. The API governs how one application can relate to the other. The usage of predictive analysis for real world intelligent application development is demonstrated through this study. It is interesting to note that this study proves that machine-learning models are capable of performing predictions based on historical data.

34

REFERENCES [1]

I. Bose and R. K. Mahapatra, "Business Data Mining - a machine learning perspective," Information & Management, pp. 211-225, 2001.

[2]

"What is Data Science," New York University , 2013. [Online]. Available: http://datascience.nyu.edu/what-is-data-science/. [Accessed 22 May 2016].

[3]

P. Lynch, "The origins of computer weather prediction and climate modeling," Journal of Computational Physics , 2008.

[4]

Nam Do Hoai, A. Mano and K. Udo, "Downscaling Global Weather Forecast Outputs Using ANN for Flood Prediction," Journal of Applied Mathematics, vol. 2011, 2011.

[5]

T. Mitchell, Machine Learning, McGraw Hill, 1997.

[6]

D. P. Solomatine and Y. Xue, "M5 Model Trees and Neural Networks: Application to Flood Forecasting in the Upper Reach of the Huai River in China".

[7]

"Centralizing national flood data in the cloud," Microsoft Research, 19 March 2015. [Online]. Available: http://research.microsoft.com/enUS/collaboration/stories/flood-data-in-the-cloud.aspx. [Accessed 5 January 2016].

[8]

L. G. Michael Goebel, "A Survey of Data Mining and Knowledge Discovery Software Tools".

[9]

S.-Y. Lion and . C. Sivapragasam, "Flood Stage Forcasting with Support Vector Machine," Journal of American Water Resources Association, vol. 38, no. 1, 2002.

[10] H. M. Nagy, K. Watanabe and M. Hirano, "Prediction of Sediment Load Concentration in Rivers using Artificial Neural Network Model," Journal of Hydraulic Engineering. [11] K. Abhishek, M. Singh, S. Ghosh and A. Anand, "Weather forecasting model using Artificial Neural Network," Procedia Technology, pp. 311-318, 2012. [12] D. Chappel, Introducing Azure Machine Learning, San Francisco, Califonia: Chappell & Associates, 2015. [13] K. Abhishek, P. M. Singh, S. Ghosh and A. Anand, "Weather forecasting model using Artificial Neural Network," Procedia Technology, vol. 4, pp. 311-318, 2012. [14] C. Abbe, "The physical basis of long range weather forcasting," Mon. Weather Rev., vol. 29, pp. 551-561, 1901. 35

[15] P. Lynch, "The origins of computer weather prediction and climate modeling," Journal of Computational Physics , vol. 227, pp. 3431-3444, 2008. [16] P. Samui and T. G. Sitharam, "Machine Learning modelling for predicting soil liquefaction," Natural Hazards and Earth System Sciences, vol. 11, 2011. [17] D. K. Patrick, P. P. Edmond, T. M. Jean-Marie, E. E. Louis and K.-t. N. Ngbolua, "Prediction of rainfall using autoregressive intergrated moving average model: Case of Kinshasa city (Democratic Republic of the Congo), from the period of 1970 to 2009," Journal of Computation in Biosciences and Engineering, vol. 2, no. 1, 2014. [18] M. J. Kane, N. Price, M. Scotch and P. Rabinowitz, "Comparison of ARIMA and Random Forest time series models for prediction of avian influenza H5N1 outbreaks," BMC Bioinformatics, vol. 15, no. 276, 2014. [19] K. Jensen, "Process diagram showing the relationship between the different phases of CRISP-DM," 2012. [Online]. Available: https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mi ning#/media/File:CRISP-DM_Process_Diagram.png. [Accessed 14 August 2016]. [20] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer and R. Wirth, CRISP-DM 1.0, SPSS Inc., 2000. [21] "Department of Meteorology, Sri Lanka," 2016. [Online]. Available: http://www.meteo.gov.lk. [Accessed 5 August 2016]. [22] "Disaster Information Management System - Sri Lanka," Disaster Management Centre (DMC),Ministry of Disaster Management, [Online]. Available: http://www.desinventar.lk. [Accessed 6 August 2016]. [23] I. Ibragimov, "Encyclopedia of Mathematics," 2 February 2011. [Online]. Available: http://www.encyclopediaofmath.org/index.php?title=Time_series&oldid=164 99. [Accessed 14 August 2016]. [24] B. Rohrer, "Machine learning algorithm cheat sheet for Microsoft Azure Machine Learning Studio," 2016. [Online]. Available: https://azure.microsoft.com/en-gb/documentation/articles/machine-learningalgorithm-cheat-sheet/. [Accessed 17 August 2016].

36

6. APPENDICES

37

A. AZURE MACHINE LEARNING STUDIO Microsoft Azure Machine Learning Studio comes as a product under Microsoft’s product umbrella named as the Cortana Intelligence suite (https://www.microsoft.com/en-us/cloud-platform/cortana-intelligence-suite).

Figure 22 - Microsoft Cortana Intelligence Suite Microsoft Azure Machine Learning Studio is the collaborative tool that Microsoft provides for building, testing, and deploying predictive analytics solutions for data. Machine Learning Studio publishes models as web services that can easily be consumed by custom applications or business intelligence tools such as Excel and Microsoft PowerBI. To develop a predictive analytical model, data are gathered from one or more sources, transform and analyze through various data manipulation and statistical functions to generate the results. Azure Machine Learning Studio provides an interactive, visual workspace to easily build, test, and iterate on a predictive analysis model. Azure Machine Learning Studio comes with following advantages. Fully managed scalable cloud service –Azure ML studio is able to deal with thousands, mostly with millions of data records. Azure platform provides a scalable & efficient cloud environment to increase the computational ability and make it faster. Ability to develop & deploy –Provides the ability to easily deploy a web service from the developed ML model and use in any web application. Wide range of ML algorithms inbuilt –There are plenty of ML algorithms pre-built as models in AML Studio that can be used right away for building models. R & Python integration –Azure ML Studio provides the ability to enhance the abilities of ML models by integrating python and R language code snippets.

38

B. R CODE FOR CHECKING THE SEASONALITY OF A TIME SERIES find.freq