range is [Q1 â w â IQR, Q3 + w â IQR], where w=1.5 or 3. There is no statistical ...... on Automatic Control, 44(11):2028â2042, 1999. [109] A. .... [138] G. E. P. Box, and D. R. Cox. .... âA software-Defined GPS and Galileo Receiverâ. Springer.
@copyright Awawdeh
Chapter
Data mining and Decision Support System
The application of data mining to environmental monitoring has become crucial for a number of tasks related to emergency management. Over recent years, many tools have been developed for decision support system (DSS) for emergency management. In this Chapter a contribution to data mining as a graphical user interface (GUI) for environmental monitoring system is presented. This interface allows accomplishing (i) data collection and observation, (ii) extraction for data mining and (iii) supporting decision making. We use mathematical models for data analysis and decision making. This tool may be the basis for future development along the line of the Open Source Software paradigm. The work is based on a project contribution, we show our work according to three aspects; data mining, decision support system (DSS), and environmental monitoring system (ENMS). The main contribution of this Chapter is a Graphical User Interface (GUI) as a tool for environmental management applications. This Chapter is with a practical aspect to environmental monitoring systems and the theoretical contributions is according to developing mathematical models for emergency management. We use Matlab for developing the models and all the work has been implemented practically in the proposed project.
1
6
1. Data mining and Decision Support System
1.1
Introduction
Knowledge discovery (KD), demonstrates intelligent computing at its best, and is the most desirable and interesting end product of information technology (IT). To be able to discover and to extract knowledge from data is a task that many researchers and practitioners are endeavoring to accomplish. There is a lot of hidden knowledge waiting to be discovered, this is the challenge created by todays abundance of data. Knowledge Discovery in Databases (KDD) is the process of identifying valid, novel, useful, and understandable patterns from large datasets. Data mining (DM) is the mathematical core of the KDD process, involving the inferring algorithms that explore the data, develop mathematical models and discover significant patterns (implicit or explicit), which are the essence of useful knowledge [1, 2]. The term data mining has mostly been used by statisticians, data analysts, and the management information systems (MIS) communities. It has also gained popularity in the database field [2–5]. Data mining is about explaining the past and predicting the future by exploring and analyzing data. Data mining is a multi-disciplinary field which combines statistics, machine learning, artificial intelligence and database technology. Although data mining algorithms are widely used in extremely diverse situations, in practice, one or more major limitations almost invariably appear and significantly constrain successful data mining applications. Frequently, these problems are associated with large increases in the rate of generation of data, the quantity of data and the number of attributes (variables) to be processed. Increasingly, the data situation is now beyond the capabilities of conventional data mining methods [4]. The management of environmental emergency is one of the most interested field that scientists are working to develop, where rapid environmental changes call for continuous surveillance and on-line decision making. The complexity of environment problems make necessary the development and applications of new tools capable of processing not only numerical aspects, but also the experience from experts and wide public participation, which are all needed in decision system [2, 4, 8]. As a part of decision support system for environmental emergency management, the data mining plays a main role in extracting data, analysis, and prediction. For environment monitoring system (ENMS), data come from
1.1. Introduction
7
measuring stations (i.e. meteorological ones), and the measurements flow from several sensors to support decision makers. In this Chapter, we show a Matlab graphical user interface (GUI) as a tool for environmental applications. The term of environment emergency management we consider here regards data collecting and prediction, as its role in supporting decision making. We mention the data here as sensors measurements over real time observing. In the last years, a huge number of potentially useful methods and software tools have been proposed including methods for environment surveillance. Our tool’s additive is to connect the monitored data after processing and extracting with powerful tools of Matlab for data mining, using outlier detection methods, classification and clustering, linear filtering, polynomial modeling, and nonlinear regression and analysis. We present the data mining algorithms and methods application that meeting our project phases in a sequence regarding to [2] and [4]. The tool is a contribution work in a project, named Integrated Network for Emergency (NIE) [7]. The interface is connected to other parts of the project to complete a comprehensive system for environmental management. Our role in the project is to support the decision making by scientific prediction tools. Making decisions concerning complex systems (e.g., the management of organizational operations, industrial processes, environmental management or investment portfolios, the command and control of military units, or the control of nuclear power plants.) often strains our cognitive capabilities. Even though individual interactions among a system’s variables may be well understood, predicting how the system will react to an external manipulation such as a policy decision is often difficult [6]. There is a substantial amount of empirical evidence that human intuitive judgment and decision making can be far from optimal, and it deteriorates even further with complexity and stress. Because in many situations the quality of decisions is important, aiding the deficiencies of human judgment and decision making has been a major focus of science throughout history. Disciplines such as statistics, economics, and operations research developed various methods for making rational choices. More recently, these methods, often enhanced by a variety of techniques originating from information science, cognitive psychology, and artificial intelligence, have been implemented in the form of computer programs, either as stand-alone tools or as integrated computing environments for complex decision making. Such environments are often given the common name of decision support systems (DSSs).
8
1. Data mining and Decision Support System
Figure 1.1: The architecture of a DSSs (see [6, 11])
Decision Support System (DSS)
Decision support systems are interactive, computer-based systems that aid users in judgment and choice activities. They provide data storage and retrieval but enhance the traditional information access and retrieval functions with support for model building and modelbased reasoning. They support framing, modeling, and problem solving [6]. Decision support systems can be either fully computerized, human or a combination of both. While academics have perceived DSS as a tool to support decision making process, DSS users see DSS as a tool to facilitate organizational processes [9]. Some authors have extended the definition of DSS to include any system that might support decision making [10]. There are three fundamental components of DSSs [6, 11] : • Database management system (DBMS) A DBMS serves as a data bank for the DSS. It stores large quantities of data that are relevant to the class of problems for which the DSS has been designed and provides logical data structures (as opposed to the physical data structures) with
1.1. Introduction
9
which the users interact. A DBMS separates the users from the physical aspects of the database structure and processing. It should also be capable of informing the user of the types of data that are available and how to gain access to them. • Model-base management system (MBMS) The role of MBMS is analogous to that of a DBMS. Its primary function is providing independence between specific models that are used in a DSS from the applications that use them. The purpose of an MBMS is to transform data from the DBMS into information that is useful in decision making. Since many problems that the user of a DSS will cope with may be unstructured, the MBMS should also be capable of assisting the user in model building. • Dialog generation and management system (DGMS) The main product of an interaction with a DSS is insight. As their users are often managers who are not computer trained, DSSs need to be equipped with intuitive and easy to use interfaces. These interfaces aid in model building, but also in interaction with the model, such as gaining insight and recommendations from it. The primary responsibility of a DGMS is to enhance the ability of the system user to utilize and benefit from the DSS. In this Chapter, we use the broader term user interface rather than DGMS. While the quality and reliability of modeling tools and the internal architectures of DSSs are important, the most crucial aspect of DSSs is, by far, their user interface. Systems with user interfaces that are cumbersome or unclear or that require unusual skills are rarely useful and accepted in practice. The most important result of a session with a DSS is insight into the decision problem. In addition, when the system is based on normative principles, it can play a tutoring role, one might hope that users will learn the domain model and how to reason with it over time, and improve their own thinking. A good user interface to DSSs should support model construction and model analysis, reasoning about the problem structure in addition to numerical calculations and both choice and optimization of decision variables [6, 12]. In this Chapter, we introduce a DSS user’s interface using DM methods as the mathematical core of KDD process. We have developed mathematical models connected directly to the main user interface for supporting the decision-making process. The interface here is project-based designed and to be developed in the
10
1. Data mining and Decision Support System
future to an Open Source Software (OSS). We show the process in details in the next Sections and more application in next Chapters.
1.2
Knowledge Discovery in Database KDD
Knowledge Discovery in Databases (KDD) is an automatic, exploratory analysis and modeling of large data repositories. KDD is the organized process of identifying valid, novel, useful, and understandable patterns from large and complex data sets. Data Mining (DM) is the core of the KDD process, involving the inferring of algorithms that explore the data, develop the model and discover previously unknown patterns. The model is used for understanding phenomena from the data, analysis and prediction. The accessibility and abundance of data today makes knowledge discovery and data mining a matter of considerable importance and necessity. Given the recent growth of the field, it is not surprising that a wide variety of methods is now available to the researchers and practitioners. No one method is superior to others for all cases [1]. KDD is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [2]. Historically, the notion of finding useful patterns in data has been given a variety of names, including data mining, knowledge extraction, information discovery, information harvesting, data archaeology, and data pattern processing. The term data mining has mostly been used by statisticians, data analysts, and the management information systems (MIS) communities. It has also gained popularity in the database field. The phrase knowledge discovery in databases was coined at the first KDD workshop in 1989 [15], to emphasize that knowledge is the end product of a data-driven discovery. It has been popularized in the artificial intelligence and machine-learning fields. KDD has evolved, and continues to evolve, from the intersection of research fields such as machine learning, pattern recognition, databases, statistics, AI, knowledge acquisition for expert systems, data visualization, and high-performance computing. The unifying goal is extracting high-level knowledge from low-level data in the context of large data sets. The data mining component of KDD currently relies heavily on known techniques from machine learning, pattern recognition, and statistics to find patterns from data
1.2. Knowledge Discovery in Database KDD
11
in the data mining step of the KDD process. A natural question is, How is KDD different from pattern recognition or machine learning (and related fields)? The answer is that these fields provide some of the data mining methods that are used in the data mining step of the KDD process. KDD focuses on the overall process of knowledge discovery from data, including how the data are stored and accessed, how algorithms can be scaled to massive data sets and still run efficiently, how results can be interpreted and visualized, and how the overall man-machine interaction can usefully be modeled and supported. The KDD process can be viewed as a multidisciplinary activity that encompasses techniques beyond the scope of any one particular discipline such as machine learning. In this context, there are clear opportunities for other fields of AI (besides machine learning) to contribute to KDD. KDD places a special emphasis on finding understandable patterns that can be interpreted as useful or interesting knowledge. Thus, for example, neural networks, although a powerful modeling tool, are relatively difficult to understand compared to decision trees. KDD also emphasizes scaling and robustness properties of modeling algorithms for large noisy data sets. Data mining is a step in the KDD process that consists of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns (or models) over the data [2]. Related AI research fields include machine discovery, which targets the discovery of empirical laws from observation and experimentation [16],(see [17] for a glossary of terms common to KDD and machine discovery), and causal modeling for the inference of causal models from data [18]. Statistics in particular has much in common with KDD (see [19] and [20] for a more detailed discussion of this synergy). Knowledge discovery from data is fundamentally a statistical endeavor. Statistics provides a language and framework for quantifying the uncertainty that results when one tries to infer general patterns from a particular sample of an overall population. As mentioned earlier, the term data mining has had negative connotations in statistics since when computer-based data analysis techniques were first introduced. The concern arose because if one searches long enough in any data set (even randomly generated data), one can find patterns that appear to be statistically significant but, in fact, are not. Clearly, this issue
12
1. Data mining and Decision Support System
is of fundamental importance to KDD. Substantial progress has been made in recent years in understanding such issues in statistics. Much of this work is of direct relevance to KDD. Thus, data mining is a legitimate activity as long as one understands how to do it correctly, data mining carried out poorly (without regard to the statistical aspects of the problem) is to be avoided. KDD can also be viewed as encompassing a broader view of modeling than statistics. KDD aims to provide tools to automate the entire process of data analysis and the statisticians ”art” of hypothesis selection [2]. A driving force behind KDD is the database field. Indeed, the problem of effective data manipulation when data cannot fit in the main memory is of fundamental importance to KDD. Database techniques for gaining efficient data access, grouping and ordering operations when accessing data, and optimizing queries constitute the basics for scaling algorithms to larger data sets. Most data mining algorithms from statistics, pattern recognition, and machine learning assume data are in the main memory and pay no attention to how the algorithm breaks down if only limited views of the data are possible. A related field evolving from databases is data warehousing, which refers to the popular business trend of collecting and cleaning transactional data to make them available for online analysis and decision support. Data warehousing helps set the stage for KDD in two important ways: (1) data cleaning and (2) data access. Data cleaning: As organizations are forced to think about a unified logical view of the wide variety of data and databases they possess, they have to address the issues of mapping data to a single naming convention, uniformly representing and handling missing data, and handling noise and errors when possible. Data access: Uniform and well defined methods must be created for accessing the data and providing access paths to data that were historically difficult to get to (for example, stored offline). Once organizations and individuals have solved the problem of how to store and access their data, the natural next step is the question, What else do we do with all the data? This is where opportunities for KDD naturally arise [2, 21, 22].
1.2. Knowledge Discovery in Database KDD
13
Figure 1.2: The architecture of a DSSs.
1.2.1
knowledge discovery process
The knowledge discovery process (see Fig. 1.2) is iterative and interactive, consisting of nine steps (see [6, 11]). Note that the process is iterative at each step, meaning that moving back to adjust previous steps may be required. The process has many ”artistic” aspects in the sense that one cannot present one formula or make a complete taxonomy for the right choices for each step and application type. Thus it is required to deeply understand the process and the different needs and possibilities in each step. The process starts with determining the KDD goals, and ends with the implementation of the discovered knowledge. As a result, changes would have to be made in the application domain (such as offering different features to mobile phone users in order to reduce churning). This closes the loop, and the effects are then measured on the new data repositories, and the KDD process is launched again. Following is a brief description of the nine-steps KDD process, starting with a managerial step: (see [1] and references therein, [2, 4, 22–24]) • Developing and understanding of the application domain This is the initial preparatory step. It prepares the scene for understanding what should be done with the many decisions (about transformation, algorithms, representation, etc.). The people who are in charge of a KDD project need to
14
1. Data mining and Decision Support System
understand and define the goals of the end-user and the environment in which the knowledge discovery process will take place (including relevant prior knowledge). As the KDD process proceeds, there may be even a revision and tuning of this step. Having understood the KDD goals, the preprocessing of the data starts, as defined in the next three steps (note that some of the methods here are similar to data mining algorithms, but are used in the preprocessing context). • Creating a target data set Having defined the goals, the data that will be used for the knowledge discovery should be determined. This includes finding out what data is available, obtaining additional necessary data, and then integrating all the data for the knowledge discovery into one data set, including the attributes that will be considered for the process. This process is very important because the data mining learns and discovers from the available data. This is the evidence base for constructing the models. If some important attributes are missing, then the entire study may fail. From success of the process it is good to consider as many as possible attribute at this stage. On the other hand, to collect, organize and operate complex data repositories is expensive, and there is a trade off with the opportunity for best understanding the phenomena. This trade off represents an aspect where the interactive and iterative aspect of the KDD is taking place. It starts with the best available data set and later expands and observes the effect in terms of knowledge discovery and modeling. • Data cleaning and preprocessing In this stage, data reliability is enhanced. It includes data clearing, such as handling missing values and removal of noise or outliers. Several methods are discussed in the next Sections, from doing nothing to becoming the major part (in terms of time consumed) of a KDD process in certain projects. It may involve complex statistical methods, or using specific data mining algorithm in this context. For example, if one suspects that a certain attribute is not reliable enough or has too many missing data, then this attribute could become the goal of a data mining supervised algorithm. A prediction model for this attribute will be developed, and then missing data can be predicted. The extension to which one pays attention to this level depends on many factors. In any case, studying these
1.2. Knowledge Discovery in Database KDD
15
aspects is important and often revealing insight by itself, regarding enterprise information systems. • Data Transformation In this stage, the generation of better data for the data mining is prepared and developed. Methods here include dimension reduction (such as feature selection and extraction, and record sampling), and attribute transformation (such as discretization of numerical attributes and functional transformation). This step is often crucial for the success of the entire KDD project, but it is usually very project-specific. For example, in medical examinations, the quotient of attributes may often be the most important factor, and not each one by itself. In marketing, we may need to consider effects beyond our control as well as efforts and temporal issues (such as studying the effect of advertising accumulation). However, even if we do not use the right transformation at the beginning, we may obtain a surprising effect that hints to us about the transformation needed (in the next iteration). Thus the KDD process reflects upon itself and leads to an understanding of the transformation needed (like a concise knowledge of an expert in a certain field regarding key leading indicators). More in Section 4.2.2. • Choosing the appropriate data mining task We are now ready to decide on which type of data mining to use, for example, classification, regression, or clustering. This mostly depends on the KDD goals, and also on the previous steps. There are two major goals in data mining: prediction and description. Prediction is often referred to as supervised data mining, while descriptive data mining includes the unsupervised and visualization aspects of data mining. Most data mining techniques are based on inductive learning, where a model is constructed explicitly or implicitly by generalizing from a sufficient number of training examples. The underlying assumption of the inductive approach is that the trained model is applicable to future cases. The strategy also takes into account the level of meta-learning for the particular set of available data. • Choosing the data mining algorithm Having the strategy, we now decide on the tactics. This stage includes selecting the specific method to be used for searching patterns (including multiple
16
1. Data mining and Decision Support System
inducers). For example, in considering precision versus understandability, the former is better with neural networks, while the latter is better with decision trees. For each strategy of meta learning there are several possibilities of how it can be accomplished. Meta-learning focuses on explaining what causes a data mining algorithm to be successful or not in a particular problem. Thus, this approach attempts to understand the conditions under which a data mining algorithm is most appropriate. Each algorithm has parameters and tactics of learning (such as ten-fold cross validation or another division for training and testing). • Employing the data mining algorithm Finally the implementation of the data mining algorithm is reached. In this step we might need to employ the algorithm several times until a satisfied result is obtained, for instance by tuning the algorithm’s control parameters, such as the minimum number of instances in a single leaf of a decision tree. • Evaluation In this stage we evaluate and interpret the mined patterns (rules, reliability etc.), with respect to the goals defined in the first step. Here we consider the preprocessing steps with respect to their effect on the data mining algorithm results (for example, adding features in Step 4, and repeating from there). This step focuses on the comprehensibility and usefulness of the induced model. In this step the discovered knowledge is also documented for further usage. The last step is the usage and overall feedback on the patterns and discovery results obtained by the data mining. • Using the discovered knowledge We are now ready to incorporate the knowledge into another system for further action. The knowledge becomes active in the sense that we may make changes to the system and measure the effects. Actually the success of this step determines the effectiveness of the entire KDD process. There are many challenges in this step, such as loosing the ”laboratory conditions” under which we have operated. For instance, the knowledge was discovered from a certain static snapshot (usually sample) of the data, but now the data becomes dynamic. Data structures may change (certain attributes become unavailable), and the data domain may be modified (such as, an attribute may have a value that was not assumed before).
1.3. Data mining methods and KDD process
1.3
17
Data mining methods and KDD process
Figure 1.3: The architecture of a DSSs (see [6, 11])
Data mining is a step in the KDD process that consists of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns (or models) over the data [2]. The data mining component of the KDD process often involves repeated iterative application of particular data mining methods. Following the lines of [1] and [2] this section presents an overview of the primary goals of data mining, a description of the methods used to address these goals, and a brief description of the data mining algorithms that incorporate these methods. Then, in the next sections we apply these methods and techniques to our project and we show practically the data mining methods we have used for the proposed project. The knowledge discovery goals are defined by the intended use of the system. We can distinguish two types of goals: (1) verification and (2) discovery. With verification, the system is limited to verifying the user’s hypothesis. With discovery, the system autonomously finds new patterns. We further subdivide the discovery goal into prediction, where the system finds patterns for predicting the future behavior of some entities, and description, where the system finds patterns
18
1. Data mining and Decision Support System
for presentation to a user in a human-understandable form. Data mining involves fitting models to, or determining patterns from, observed data. The fitted models play the role of inferred knowledge. Whether the models reflect useful or interesting knowledge is part of the overall, interactive KDD process where subjective human judgment is typically required. Two primary mathematical formalism are used in model fitting: (1) statistical and (2) logical. The statistical approach allows for nondeterministic effects in the model, whereas a logical model is purely deterministic. We focus primarily on the statistical approach to data mining, which tends to be the most widely used basis for practical data mining applications given the typical presence of uncertainty in real-world data generating processes. Most data mining methods are based on tried and tested techniques from machine learning, pattern recognition, and statistics: classification, clustering, regression, and so on. The array of different algorithms under each of these headings can often be bewildering to both the novice and the experienced data analyst. It should be emphasized that of the many data mining methods advertised in the literature, there are really only a few fundamental techniques. The actual underlying model representation being used by a particular method typically comes from a composition of a small number of well-known options: polynomials, splines, kernel and basis functions, threshold-Boolean functions, and so on. Thus, algorithms tend to differ primarily in the goodness-of-fit criterion used to evaluate model fit or in the search method used to find a good fit. The two high-level primary goals of data mining in practice tend to be prediction and description. As stated earlier, prediction involves using some variables or fields in the database to predict unknown or future values of other variables of interest, and description focuses on finding human interpretable patterns describing the data. Although the boundaries between prediction and description are not sharp (some of the predictive models can be descriptive, to the degree that they are understandable, and vice versa), the distinction is useful for understanding the overall discovery goal. The relative importance of prediction and description for particular data mining applications can vary considerably. The goals of prediction and description can be achieved using a variety of particular data mining methods. There are many methods of data mining used for different purposes and goals. Taxonomy is called for to help in understanding the variety of methods,
1.3. Data mining methods and KDD process
19
Figure 1.4: Main data mining taxonomy (see [1, 2, 4])
their interrelation and grouping [1]. Figure 1.4 presents this taxonomy. • Classification Classification is learning a function that maps (classifies) a data item into one of several predefined classes [25, 26]. More discussion about this method in the next Sections of this Chapter.
Figure 1.5: Main classification classes
• Regression Regression is learning a function that maps a data item to a real-valued prediction variable. Regression applications are many, for example, predicting the amount of biomass present in a forest given remotely sensed microwave measurements, estimating the probability that a patient will survive given the results of a set of diagnostic tests, predicting consumer demand for a new product as a function of advertising expenditure, and predicting time series where the input variables can be time-lagged versions of the prediction variable.
20
1. Data mining and Decision Support System
• Clustering Clustering is a common descriptive task where one seeks to identify a finite set of categories or clusters to describe the data [27, 28]. The categories can be mutually exclusive and exhaustive or consist of a richer representation, such as hierarchical or overlapping categories. Examples of clustering applications in a knowledge discovery context include discovering homogeneous subpopulations for consumers in marketing databases and identifying subcategories of spectra from infrared sky measurements [29]. Closely related to clustering is the task of probability density estimation, which consists of techniques for estimating from data the joint multivariate probability density function of all the variables or fields in the database [30]. • Summarization Summarization involves methods for finding a compact description for a subset of data. A simple example would be tabulating the mean and standard deviations for all fields. More sophisticated methods involve the derivation of summary rules [31], multivariate visualization techniques, and the discovery of functional relationships between variables [32]. Summarization techniques are often applied to interactive exploratory data analysis and automated report generation. • Dependency modeling Dependency modeling consists of finding a model that describes significant dependencies between variables. Dependency models exist at two levels: (1) the structural level of the model specifies (often in graphic form) which variables are locally dependent on each other and (2) the quantitative level of the model specifies the strengths of the dependencies using some numeric scale. For example, probabilistic dependency networks use conditional independence to specify the structural aspect of the model and probabilities or correlations to specify the strengths of the dependencies [33, 34]. Probabilistic dependency networks are increasingly finding applications in areas as diverse as the development of probabilistic medical expert systems from databases, information retrieval, and modeling of the human genome.
1.4. Project overview and statements
21
• Change and deviation detection Change and deviation detection focuses on discovering the most significant changes in the data from previously measured or normative values [35–39]
1.4
Project overview and statements
”Integrated Network for Emergencies (NIE)”, is a project aimed at the realization of an informative system representing the ”Integrated Center for Monitoring and Management of Urban Emergencies of the Municipality of Genoa”.
Figure 1.6: Flood of the city of Genoa on November 4, 2011
The science of meteorology is deeply intertwined with the process of emergency management. Weather phenomena are the cause of many disaster events such as fires and flood and a factor in many others. Weather can also affect the way assistance is provided during or after an emergency. Since time to prepare is vital, much of meteorology is concerned with forecasting and issuing. These events cause serious problems related to security and safety of citizens, mobility for the
22
1. Data mining and Decision Support System
free traffic, and blocking a large number of productive activities. The adoption of information systems for the monitoring and analysis of critical emergency is gradually covering an increasingly prominent role for the management, analysis and correlation of information, including heterogeneous nature, and implementation of specialized tools in decision support at the operational level at all levels of jurisdiction. The Municipality of Genoa, as part of the activities undertaken in recent years, leading to a reorganization of the operational management of emergencies due to critical phenomena of various kinds, intends to adopt an innovative integrated HW-SW for the management and coordination emergency, which become the reference tool for the new centre for integrated emergency cities. This project was performed by a cooperated work between ”University of Genoa” (by Mathematical Engineering and Simulation department), ”FadeOut” company (which is the software developer of this project), and ”Acrotec” company which is the executor. The main goals of the project are to monitor environmental changes, taking measurements using multi geo-based sensors in different measuring stations as real time observing, designing a website as an interactive atmosphere for data collection and extraction, in addition to interactive modeling, broadcasting and early warning system, emergencies states support, crises management, urbanbased development, and environmental measures analysis for decision support system using mathematical models based on scientific prediction tools definition. Our role in this project was to design the mathematical models for supporting the decision making in environmental emergencies state. Thus, we are not going to discuss all the project layers, but as our role in DSS we show the mathematical modeling we have done for this scope. the scientific support in this project take the name ”subsystem research laboratory”, This laboratory is equipped with facilities for the calculation which is based on Matlab for development, testing and validation of: • Mathematical models of natural disasters • Environmental and meteorological signal processing • Resources management for the prevention and containment of emergencies • Design of sensor networks for environmental data acquisition and weather.
1.4. Project overview and statements
23
The laboratory is configured as a pilot center for the development of synergies between public administrations, companies, organizations, universities and research oriented to the solution of problems with emergency on: 1. Data fusion of environmental and meteorological information to increase the speed and accuracy of forecasts 2. Innovative communication tools for effective perception of the dangers emergency by ”social network” 3. Decision support systems ”customized” for specific territories 4. Data acquisition and analysis. This section is to show the role of mathematical modeling in DSS. The subsystem is equipped with the following general functional modules: • fire prevention and prediction The spread of a forest fire is regulated by highly complex physical and chemical processes that depend on multiple factors such as bad weather (air temperature, wind intensity and direction, humidity, and precipitation), hilly terrain (slope of the land, and depressions) and vegetation (vegetation type, tree height, and presence of undergrowth). The module is implemented in such a way as to permit a double use: pre-emergency and during the emergency itself. The management of a large forest fire involves the deployment of a large number of resources in terms of staff and of which requires effective coordination for the rapid end of the event concerned to minimize the direct and indirect damages. During an emergency, the module allows operators to insert in a simple manner the current front of the fire and to obtain in a short time simulation of the progress of a fire face. In pre-emergence, however, operators can use the model to test possible scenarios (for example in areas where the fire risk is particularly high) in order to better identify the areas at risk, locate in advance the best workstations to send any means, etc. The model implemented in the module has two main features that respond to specific needs: simplicity, and computational efficiency. For this reason we used a semi-empirical approach based on ”Level Set” method, an important class of methods used to develop the curves or surfaces in many applications (seismic,
24
1. Data mining and Decision Support System
fluid dynamics, image processing, science of materials, etc.).The proposed model is based on the calculation of the level sets of a partial differential equation (PDE) of Hamilton-Jacobi that explicitly takes into account the wind direction. Level set methods are versatile and extensible techniques for general front tracking problems, including the practically important problem of predicting the advance of a fire front across expanses of surface vegetation. Given a rule, empirical or otherwise, to specify the rate of advance of an infinitesimal segment of fire front arc normal to itself (i.e., given the fire spread rate as a function of known local parameters relating to topography, vegetation, and meteorology), level set methods harness the well developed mathematical machinery of hyperbolic conservation laws on Eulerian grids to evolve the position of the front in time [43]. The Fendell and Wolff model focuses on front velocities at the rear of the front (where propagation is against the wind), at the head of the front (where propagation is with the wind), and on the flanks (where propagation is across the wind direction), see Fig. 1.7. The simplicity of the model ”Level Set” allows the possibility of
Figure 1.7: Fendell and Wolff model introduces velocities at the rear (against the wind), at the head (in the wind direction) and at the flanks, (see [41, 43])
applying methods of adaptation of the significant parameters present in the equation PDE Hamilton-Jacobi to take account of the characteristics of the terrain (vegetation, humidity, etc) Through procedures of identification of type ”white box” and ”black box”. These types of models represent the best compromise between precision, reliability, flexibility and ease of use. The module provided an easy interface to display in real time the evolution of the front of a fire on a
1.4. Project overview and statements
25
map (similar to that of Google maps). The interface takes input information from different information systems (including the information provided by the model of static risk of fire) and processes such information to generate a simulation of the propagation of the fire front. The interface displays on the map the progress of the fire front, and provides additional information such as the intensity of the fire at various positions, etc. It also allows operators to choose the size of the computational grid and therefore the precision with which you want to calculate the front, (see [7, 41, 43]), this part of the design was done by my colleagues and my contribution to this part was to connect the models with the DSS interfaces and network. Hence, the next sections will not include this part of work. • Simulation and resource allocation The Emergency management involves the deployment of a large number of resources both in terms of staff and of which require an effective coordination for the rapid overcoming events with a minimum of direct and indirect damages possibly also in pre-emergencies. The occurrence of a fire interface is the problem of optimal management of rescue teams and fire trucks. Similarly, in the case of snowfall is essential to a better manage of the equipment used for the shedding of salt and snow removal. The objective of efficient solutions of the problems towards resource allocation requires the development of: (a) General mathematical models presenting the various problems (b) Easy implementation techniques for the user with high performance (c) Efficient algorithms can operate in real time. • Data mining This sub-module contains all the functions required to support the operations including data acquisition, classification and clustering, regression, and neural network. In the next sections we will discuss this part in details towards this project with a DSS aspect. • Optimization This sub-module is essentially a library of tools that can be called from various modules to solve optimization problems. The problem of static limited resources
26
1. Data mining and Decision Support System
among competing activities can easily and intuitively formulated as an optimization problem with an objective index that expresses the validity of the assignment as a constraint and the total availability of resources. An alternative formulation of the problem of static allocation can be obtained by means of dynamic programming. In any case, in the broader context of the project it was more appropriate to follow an approach based on mathematical programming rather than on dynamic programming. A non-exhaustive list of the various problems associated with the simulation modules and resource allocation, prevention and forecasting, and data mining are formulated as mathematical programming problems within identification, optimal programming, optimal estimation, regression, and interpolation.
Figure 1.8: Related sub-fields to optimization framework
1.4.1
Environmental monitoring and data mining
Many environmental systems involve processes which are not yet well known, and for which no formal models are established at present. Because the consequences of an environmental system changing behavior or operating under abnormal conditions may be severe, there is a great need for knowledge discovery in the area [42]. Great quantities of data are available, but as the effort required
1.4. Project overview and statements
27
to analyze the large masses of data generated by environmental systems is large, much of it is not examined in depth and the information content remains unexploited. The special features of environmental processes demand a new paradigm to improve analysis and consequently management. Approaches beyond straightforward application of conventional classical techniques are needed to meet the challenge of environmental system investigation. Data mining techniques provide efficient tools to extract useful information from large databases, and are equipped to identify and capture the key parameters controlling these complex systems (see [42, 43]).
1.4.2
Decision support system and EMS
The complexity of environmental monitoring problems makes necessary the development and application of new tools capable of processing not only numerical aspects, but also experience from experts and wide public participation, which are all needed in decision making processes. Environmental decision support systems (EDSSs) are among the most promising approaches to confront this complexity. The fact that different tools (e.g. artificial intelligence techniques, statistical methods, and geographical information systems [44]) can be integrated under different architectures confers EDSSs the ability to confront complex problems, and the capability to support learning and decision making processes .
1.4.3
Statements and methodologies
As we mentioned earlier ”Integrated Network for Emergencies (NIE)”, is a project aimed at the realization of an informative system representing the Integrated Center for Monitoring and Management of Urban Emergencies of the Municipality of Genoa. As a part of scientific support to this project and which this chapter based on is within the modules we have discussed before. The role here starts with importing data which come from meteorological stations, then to process these data with the previous modules and using data mining applications, and lastly to analyze the results which they should be exported to a decision support layer. The end-user application is a graphical user interface using Matlab programming which is an interactive interface for DSS user. See Fig. 1.9
28
1. Data mining and Decision Support System
Figure 1.9: Framework’s process big-picture
1.5. Interfaces and databases
1.5
29
Interfaces and databases
The last product (i.e. for the DSS end-user) is an interactive graphical user interface. This interface was built with Matlab programming and it is the main atmosphere towards mathematical models. At this stage, we initialize the system, collect the measures, build the databases, pre-process the data, and then distributing these data to be processed and analyzed with mathematical models inside this interface. We have shown that this part of project take the data from the data warehouse which it saves the measurements in the core server of Acrotec. Hence, a connection initialization is required to import the data. The software and connections we used here are: (1) licensed Matlab program (R2012b) from Mathwork [77], (2) an built-out library of JAVA programming language (Matlabbased synchronization), and (3) CISCO VPN v.5. This interface is connected directly in half-duplex connection with Acrotec and the interactive modeling was design to be easy to use and a fully integrated with DSS users.
1.5.1
Database and data form
The database includes measurements from sensors over real time observing. These data flow to the core server in the surveillance management room of Acrotec. The data form we need to import in our interface is not compatible with the form of data in the server for two main reasons: (1) the form of data is not a numerical form so we couldn’t manage by Matlab functions. (2) The programming language, which has been used to construct the data in the core server, is XDrops language, this language is a customized programming language, developed by Acrotec company for this project. This code is able to build a connection with Matlab but the code itself is not readable by Matlab. So the interface has to make the connection between Matlab and X-Drops to create the readable data for our interface before starting the process phases. Data flow to our interface as shown in Fig. 1.10. Sensors are distributed in many locations and they are connected with the stations over transmission lines. The data come to the interface from a core server at the environmental monitoring center, and they need to be initialized, pre-processed, and transformed as shown in Fig. 1.9. The connection with Acrotec server provides our interface with all required data over on-line connections and it supports the feedback process.
30
1. Data mining and Decision Support System
Figure 1.10: Data flow from the environmental monitoring system to the interface
1.5.2
Data initialization and extraction
As shown in Section 1.5.1, the first task before data processing is to initialize the system, this occurs through a VPN connection with the core server over XDrops programming. This initialization provide the interface with full permission to access the data at the monitoring center (i.e the data that were collected from measuring stations). Then, the data need to be transformed to a readable form for matlab, this process is a built-in with a hidden layer at X-Drops connections with a Java in-library. These steps are the initial task that system need to do always before start using the interface, figure 1.11 show an example of the initialization process in Matlab.
Figure 1.11: Initialization of X-Drops to Matlab
1.5. Interfaces and databases
1.5.3
31
Interface structure and features
The main interface includes three phases of data processing, (1) extracting data from the core database and build the connection bridge of data transformation between X-Drops and Matlab. (2) collecting data from sensors warehouse, the measures here are collected with specific search criteria determined by user, and (3) building databases inside Matlab and exporting these data to other connected interfaces where we process the data with different mathematical models as shown in Fig. 1.12. The data collecting features depend on user commands where there are two modes for collecting data: (1) collect the measurements of specified sensor which is chosen by user, and (2) import data under the determinant of the geographical points criteria. The modes of these phases are shown in the interface as multi-input choices for user. The main interface which is from now on called
Figure 1.12: Interfaces structure
”GetObservation” interface is the main window for DSS user. Fig. 1.13 shows the interface in the starting mode after the initialization (note that Fig. 1.11 shows an example of the initialization which occurs in a hidden-layer inside the main interface).
32
1. Data mining and Decision Support System
Figure 1.13: The main interface at the starting mode Table 1.1: Interface functions
Article
Key
Description
Sensor type
Choose a sensor
List of all available sensors
Duration
Choose a time
List of minutes (1-60)
Date-from
Starting date
Java calender
Date-to
Last desired date
Java calender
Process
Collect measures
For all stations in Genoa
All Data
Collect measures
For all station in Italy
Stations List
Choosing a station
List of all stations in Liguria
Plot and Export
Plot data
For one sensor in one station
Station Observation
Collect all measures
All sensors in one station
Data mining panel
Data processing
Models and tools
Three panels of data are included, (1) data input panel which is user-based input, (2) measurement visualization, and (3) data mining panel which contains the mathematical models interfaces. Table 1.1 show the functional features of the interface’s panels. From Fig. 1.13 one can see that the user has lists of input choices according to specific criteria. The search-based method here lies in two possible choices: 1. Collect observations of a chosen sensor, (10 different sensors are included) for rain, water, wind, snow, pressure, and radiation measuring.
33
1.5. Interfaces and databases
Figure 1.14: The main interface with specific input at the running mode Table 1.2: Input
Article
choice
Sensor type
Thermometer
Duration
Observing each 60 minutes
Date
11 December 2012
Station
Pegli2
Graphical output
Thermometer measures plot
Data collection
Measures of all available sensors
Data mining
Exporting 6 data-sets
2. Collect observations under the determinant of geographical points with the all available sensors. (49 stations around Genoa city) Fig. 1.14 shows an example of data collection with specific input choices as shown in Table 1.2. We have shown before the structure of the project and its contribution to the DSS. Going back to the introduction we have discussed the DSS structure and how does our interface contribute in this aspect see Fig. 1.1. The interface is with the ”Dialog Generation and Management System (DGMS)” which we called it her as ”User Interface”, it is the most crucial aspect of DSSs. We have designed the interface to meet the fact that the DGMS users are often managers who are not computer well-trained, so the ease of use was considered in designing this interface. In next sections we discuss the data mining methods in details and we show all the applications of this panel.
34
1. Data mining and Decision Support System
1.6
Data mining methods and applications
The two high-level primary goals of data mining in practice tend to be prediction and description. As stated in Section 1.3, prediction involves using some variables or fields in the database to predict unknown or future values of other variables of interest, and description focuses on finding human interpretable patterns describing the data. Although the boundaries between prediction and description are not sharp [2], the distinction is useful for understanding the overall discovery goal. The relative importance of prediction and description for particular data mining applications can vary considerably. The goals of prediction and description can be achieved using a variety of particular data mining methods. According to the interface we have presented, we show in this Section the data mining methods that have been implemented in the project in their features.
1.6.1
Outlier detection
Outlier detection is a primary step in many data mining applications. In this Section we introduce the problem of outlier detection and we proposed the method that we have used in the interface. In Chapters 2 and 3, we present a new method for outlier detection using robust estimation strategy. In many data analysis tasks a large number of variables are being recorded or sampled. One of the first steps towards obtaining a coherent analysis is the detection of outlaying observations. Outliers are often considered as an error or noise, they may carry important information. Detected outliers are candidates for aberrant data that may otherwise adversely lead to model misspecification, biased parameter estimation and incorrect results. It is therefore important to identify them prior to modeling and analysis [45–47]. Hawkins [48] formally defined the concept of an outlier as follows: An observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. Outliers are also referred to as abnormalities, discordants, deviants, or anomalies in the data mining and statistics literature. In most applications, the data is created by one or more generating processes, which could either reflect activity in the system or observations collected about entities. When the generating process
1.6. Data mining methods and applications
35
behaves in an unusual way, it results in the creation of outliers. Therefore, an outlier often contains useful information about abnormal characteristics of the systems and entities, which impact the data generation process [49]. 1.6.1.1
Outlier detection applications
Outlier detection methods have been suggested for numerous applications, such as: 1. Intrusion Detection Systems (IDS): In many host based or networked computer systems, different kinds of data are collected about the operating system calls, network traffic, or other activity in the system. This data may show unusual behavior because of malicious activity. The detection of such activity is referred to as intrusion detection [49–52]. 2. Credit Card Fraud (CCF): Credit card fraud is quite prevalent, because of the ease with which sensitive information such as a credit card number may be compromised. This typically leads to unauthorized use of the credit card. In many cases, unauthorized use may show different patterns, such as a buying spree from geographically obscure locations. Such patterns can be used to detect outliers in credit card transaction data [53–55]. 3. Interesting Sensor Events: Sensors are often used to track various environmental and location parameters in many real applications. The sudden changes in the underlying patterns may represent events of interest. Event detection is one of the primary motivating applications in the field of sensor networks [49, 56–58]. 4. Medical Diagnosis: In many medical applications the data is collected from a variety of devices such as MRI scans (Magnetic Resonance Imaging), PET (Positron Emission Tomography) scans or ECG (Electrocardiography) timeseries. Unusual patterns in such data typically reflect disease conditions [59, 60]. 5. Law Enforcement: Outlier detection finds numerous applications to law enforcement, especially in cases, where unusual patterns can only be discovered over time through multiple actions of an entity. Determining fraud in financial transactions, trading activity, or insurance claims typically requires the
36
1. Data mining and Decision Support System
determination of unusual patterns in the data generated by the actions of the criminal entity [49]. 6. Earth Science: A significant amount of spatiotemporal data about weather patterns, climate changes, or land cover patterns is collected through a variety of mechanisms such as satellites or remote sensing. Anomalies in such data provide significant insights about hidden human or environmental trends, which may have caused such anomalies [49]. More applications to outlier detection such as Satellite image analysis, motion segmentation, detecting novelty in text, severe weather prediction, geographic information systems, athlete performance analysis, and other data mining tasks can be found in many literature. For more survey studies, see [45–49, 58, 61, 62]. Outliers arise because of human error, instrument error, natural deviations in populations, fraudulent behaviour, changes in behaviour of systems or faults in systems. How the outlier detection system deals with the outlier depends on the application area. If the outlier indicates a typographical error by an entry clerk then the entry clerk can be notified and simply correct the error so the outlier will be restored to a normal record. An outlier resulting from an instrument reading error can simply be expunged. A survey of human population features may include anomalies such as a handful of very tall people. Here the anomaly is purely natural, although the reading may be worth flagging for verification to ensure no errors, it should be included in the classification once it is verified. A system should use a classification algorithm that is robust to outliers to model data with naturally occurring outlier points. An outlier in a safety critical environment, a fraud detection system, an image analysis system or an intrusion monitoring system must be detected immediately (in real-time) and a suitable alarm sounded to alert the system administrator to the problem. Once the situation has been handled, this anomalous reading may be stored separately for comparison with any new fraud cases but would probably not be stored with the main system data as these techniques tend to model normality and use this to detect anomalies [61]. 1.6.1.2
Outlier detection methods
Outlier detection methods have been widely studied and investigated in the literature, usually by statistician and computer scientist. In this section we in-
1.6. Data mining methods and applications
37
troduce a brief discussion of these methods and the methodologies behind. Then we show the methods at which has been applied in the interface. Different approaches that researchers have been introduced in presenting the taxonomy of outlier detection methods, one can be confused between the types of outliers detection methods and the notion behind, and since our goal here is to give a brief conclusion about these methods, so we follow [45, 49, 61] in formulating this section. Outlier detection methods can be divided between univariate methods, and multivariate methods that usually form most of the current body of research. Another fundamental taxonomy of outlier detection methods is between parametric (statistical) methods and nonparametric methods that are model-free. General categorization can be as follow: 1. Parametric methods Statistical parametric methods either assume a known underlying distribution of the observations, or at least, they are based on statistical estimates of unknown distribution parameters, see [45]. These methods flag as outliers those observations that deviate from the model assumptions. They are often unsuitable for highdimensional data sets and for arbitrary data sets without prior knowledge of the underlying data distribution [63]. Parametric methods allow the model to be evaluated very rapidly for new instances, the model grows only with model complexity not data size. However, they limit their applicability by enforcing a pre-selected distribution model to fit the data. If the user knows their data fits such a distribution model then these approaches are highly accurate but many data sets do not fit one particular model [61]. Several techniques assume that the data is generated from a known distribution. Thus the training phase involves estimating the distribution parameters from the given sample. Several statistical tests, such as the frequently used Grubb’s test [64], also known as the maximum normed residual test,is assumes a normal distribution of the data. Most of these techniques work with univariate as well as multivariate continuous data. Parametric regression modeling techniques have also been used to fit a regression model on the data. Parametric statistical outlier detection has been used for many applications such as, network intrusion detection, mobile phone fraud detection systems, and in medical and public health domain, see [62].
38
1. Data mining and Decision Support System
More specifically, these techniques can be further categorized as follows: • Gaussian Models. A substantial work has been done in detecting outliers in data which is assumed to be normally distributed. The training phase typically involves estimating mean and variance for the distribution using Maximum Likelihood Estimates (MLE). Common techniques are Box-plot rule, the Grubbs test, rosner test, The student’s t-test, and the Dixon test. More discussion about these techniques can be found in [61, 62, 64, 67, 78]. • Regression Models. Outlier detection using regression has been extensively investigated for time-series data. The primary approach for detecting outliers in time-series data has been using regression analysis. The training phase involves fitting a regression model on the data. The testing phase is essentially a model diagnostics phase which involves evaluating each instance with respect to the model. In several techniques, the maximum likelihood estimates of the regression parameters are used as the criteria for outlier detection. The underlying approach in these techniques is to fit a regression model on the time-series and estimate certain statistics which are diagnosed to detect outliers in the time-series. Use of residuals obtained from regression model fitting, to detect outliers, has been discussed in several approaches [62]. Some of these approach to handle outliers while fitting regression models are studentized residuals, AIC (Akaike Information Content), and robust regression. These approaches are designed for univariate time-series. • Mixture of Parametric Models. In several scenarios a single statistical model is not sufficient to represent the data. In such cases a mixture of parametric models is used. These techniques can work in two ways. First approach is supervised and involves modeling the normal instances and outliers as separate parametric distributions. The testing phase would involve determining which distribution the test instance belongs to. The second approach is semi-supervised and involves modeling the normal instances as a mixture of models. A test instance which does not belong to any of the learned models is declared to be outlier.
1.6. Data mining methods and applications
39
• Markov and Hidden Markov Models. Markov and Hidden Markov Models (HMMs), are the popular statistical techniques used to model sequential data. Variations of these models that follow the Marokovian assumption such as Maxent, Conditional Random Fields, mixture of markov models and mixture of HMMs are also used to model sequential data [62, 65, 66]. Some applications in modeling sequential data using such models are with biological sequences, speech recognition and other domains. 2. Nonparametric methods Within the class of nonparametric outlier detection methods one can set apart the data mining methods, also called distance-based methods. This type of techniques do not assume the knowledge of the data distribution. These methods are usually based on local distance measures and are capable of handling large databases [45]. Nonparametric approaches, are more flexible and autonomous than parametric ones [61]. Some of the wide used techniques in this category are histogram analysis (histogramming is very efficient for univariate data, but multivariate data induces additional complexities), and parzen windows method (which directly uses the samples drawn from an unknown distribution to model its density. These techniques can also be thought of kernel based approaches since they use a known kernel to model the samples and then extrapolate to the entire data. Nonparametric techniques typically define a distance between a test observation and the statistical model and use some kind of threshold on this distance to determine if the observation is an outlier or not. These techniques are particularly popular in intrusion detection community and fraud detection, since the behavior of the data is governed by certain profiles (user or software or system) which can be efficiently captured using the histogram model see [61] for the nonparametric methods applications. The concept of dimensionality reduction and principal component analysis (PCA) which it uses a nonparametric approach in order to model the data correlations is described in [49] with more outlier analysis approach. A concluded work of nonparametric methods category has been done in [62].
40
1. Data mining and Decision Support System
3. Univariate Statistical Methods Most of the earliest univariate methods for outlier detection rely on the assumption of an underlying known distribution of the data, which is assumed to be identically and independently distributed (i.i.d.). Moreover, many discordance tests for detecting univariate outliers further assume that the distribution parameters and the type of expected outliers are also known [45, 67]. In some literate, univariate methods can be categorized as follows: • Single-step vs. Sequential Procedures. Single-step procedures identify all outliers at once as opposed to successive elimination or addition of datum. In the sequential procedures, at each step, one observation is tested for being an outlier [68]. • Inward and Outward Procedures. In inward testing, or forward selection methods, at each step of the procedure, observation with the largest outlyingness measure, is tested for being an outlier. If it is declared as an outlier, it is deleted from the dataset and the procedure is repeated. If it is declared as a non outlying observation, the procedure terminates [48]. In outward testing procedures, the sample of observations is first reduced to a smaller sample, while the removed observations are kept in a reservoir. The outward testing procedure is terminated when no more observations are left in the reservoir [45, 48]. • Univariate Robust Measures. Traditionally, the sample mean and the sample variance give good estimation for data location and data shape if it is not contaminated by outliers. When the database is contaminated, those parameters may deviate and significantly affect the outlier detection performance. Some of the most famous basic techniques are: – Hampel Identifier: Hampel suggested the median and the median absolute deviation (MAD) as robust estimates of the location and the spread [69, 70]. – Tukey’s method: Tukey introduced the Boxplot as a graphical display on which outliers can be indicated [71]. The Boxplot, which is being extensively used up to date, is based on the distribution quadrants.
1.6. Data mining methods and applications
41
This method has been implemented into Matlab programming and is often found to be practically very effective. We have built the outlier detection process in our interface based on this technique and more discussion is shown in the coming sections. – For autocorrelated and even non-stationary process data, an outlier detection and data cleaning method is proposed in [47]. – Another methods for outlier detection in univariate dataset are Standard Deviation (SD) method, and Z-Score method. In some literature one may refer these methods to outlier labeling methods. • Statistical Process Control (SPC). The field of Statistical Process Control (SPC) is closely related to univariate outlier detection methods. It considers the case where the univariable stream of measures represents a stochastic process, and the detection of the outlier is required online [45, 72]. Traditional SPC methods, such as Shewhart, Cumulative Sum (CUSUM), and Exponential Weighted Moving Average (EWMA) are extensively implemented in industry. 4. Multivariate Statistical Methods In many cases multivariable observations can not be detected as outliers when each variable is considered independently. Outlier detection is possible only when multivariate analysis is performed, and the interactions among different variables are compared within the class of data, we derive the shortlist of these methods form [45, 49, 61]. • Statistical Methods. Statistical methods for multivariate outlier detection often indicate those observations that are located relatively far from the center of the data distribution. Several distance measures can be implemented for such a task. Mahalanobis distance is a well-known criterion which depends on estimated parameters of the multivariate distribution, and it is used as an outlying degree [45, 58, 73]. • Multivariate Robust Measures. The distribution mean (measuring the location), and the variance-covariance (measuring the shape), are the two most commonly used statistics for data analysis in the presence of outliers [45].
42
1. Data mining and Decision Support System
• Data Mining Methods. Data mining related methods are often nonparametric, thus, do not assume an underlying generating model for the data. These methods are designed to manage large databases from high dimensional spaces. Different related classes to this category, such as distancebased methods, clustering methods, and spatial methods [45]. Different classes can be added such as density-based methods, and distribution-based methods [58]. We have shown different outliers detection methods based on the data structure and the outlier generating mechanism which it has been studied well in [45]. Another categories of outlier detection methods based on neural networks approaches, machine learning, and hybrid systems are found in [49]. Different approaches of outlier detection methods categories and comparison can be found in [45, 48, 49, 56, 58, 61, 62, 67–69]. 1.6.1.3
Tukey’s method implementation in the interface
We have shown before in the univariate robust measures methods for outlier detection in unavariate data sets a method that addressed the problem of robust estimators which was proposed by Tukey [71]. Tukey introduced the Boxplot as a graphical display on which outliers can be indicated. The Boxplot, which is being extensively used up to date, is based on the distribution quadrants. Box plots are nonparametric as shown before, they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacings between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers. In addition to the points themselves, they allow one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid range, and trimean. Boxplots can be drawn either horizontally or vertically. It is less sensitive to extreme values than that methods which use the sample mean and standard deviation, because it uses quartiles which are resistant to extreme values. As looking at a statistical distribution is more intuitive than looking at a box plot, comparing the box plot against the probability density function (theoretical histogram) for a normal distribution may be N (0, σ 2 ) a useful tool for understanding the box plot, see Fig. 1.15 The procedure and calculations of Tukey’s
1.6. Data mining methods and applications
43
Figure 1.15: Boxplot and a probability density function[*]
methods are as follow: 1. Quartiles (Q1 , Q2 , and Q3 ) We use the following formula for finding the quartiles positions in the dataset. Pindex = PQ (n − 1) where Pindex is the index of desired percentile, PQ is the desired percentile (25% or 50% or 75%), and n is the total number of elements. So, we have: • Q1 : Lower Quartile := 25% (n − 1) • Q2 : Median := 50% (n − 1) • Q3 : Upper Quartile := 75% (n − 1). 2. Interquartile Range (IQR) IQR is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles. IQR = Q3 − Q1
44
1. Data mining and Decision Support System
Figure 1.16: Boxplot of thermometer measures with outliers
Figure 1.17: Boxplot of thermometer measures and outliers have been removed
Inner fences are located at a distance 1.5 IQR below Q1 and above Q3 , where: Upper limit is [Q3 + 1.5 IQR], and the lower limit is [Q1 − 1.5 IQR]. The values which exceed those two limits are considered to be outliers, see Fig. 1.15. This method has been implemented into the interface process’s phase to give the DSS user a visualized output of a dateset for detecting and removing outliers. It has been implemented into a separate interface with built-in process and the user has to perform a task over these output. Fig. 1.16 and Fig. 1.17 show an example of detecting and removing outliers in a dataset of observations for thermometer measures through 6 days. The outliers have been detected using Tukey’s method and the user has to decide about keeping or removing those observation. Since the targeted group of this interface is the DSS users so an on-line decision has been implemented to processing the discovered outliers. Two options for dealing with outliers, removing and transforming. We considered the
45
1.6. Data mining methods and applications
Table 1.3: Statistics of Boxplot
Group
Day 1 Day 2 Day 3 Day 4
Day 5 Day 6
Oulier
Distance to
Median
Lower
Upper
Adjacent
Adjacent
P
value
Median
12.3
4.55
11.9
4.15
3.2
4.55
14.06
6.65
13.9
5.95
17.2
8.3
1.5
7.4
14.5
5.95
3.3
5.25
R
17.5
7.45
R
4.6
5.85
4.2
5.26
18
6.65
4.1
7.25
R 7.75
5.6
11
R R
7.95
3.5
11
R R
8.9
4.1
3.6
R R
8.55
10.45
4
9
9.5
14
R
R R
11.35
8.2
17.5
R R
transformation here according to Q1 and Q2 quartiles. (Values that are less than the lower adjacent are transformed to be the same value of Q1 , the same for the values that are greater than the upper adjacent are transformed to the value of Q3 ). Table 1.3 show the bexplot’s statistic of the given example. A DSS user need to visually observing outliers in a datasets, although, in some analysis it is not preferable to remove outliers, but surely, observing them is an important step towards data analysis. The box plot has became the standard technique for presenting the 5-number summary which consists of the minimum and maximum range values, the upper and lower quartiles, and the median. This collection of values is a quick way to summarize the distribution of a dataset. In addition, this reduced representation afforded by the 5-number summary provides a more straightforward way to compare datasets, since only these characteristic values need to be analyzed. There are two fences that Tukey’s method considers, inner fences (which we have considered in our design since), and outer fences. The outer
46
1. Data mining and Decision Support System
fences are located at a distance 3 IQR below Q1 and above Q3 , since the Tukey’s range is [Q1 − w ∗ IQR, Q3 + w ∗ IQR], where w=1.5 or 3. There is no statistical basis for the reason that Tukey uses 1.5 and 3 to make inner and outer fences.
1.6.2
Clustering
Clustering techniques are mostly unsupervised methods that can be used to organize data into groups based on similarities among the individual data items. Most clustering algorithms do not rely on assumptions common to conventional statistical methods, such as the underlying statistical distribution of data, and therefore they are useful in situations where little prior knowledge exists. The potential of clustering algorithms to reveal the underlying structures in data can be exploited in a wide variety of applications, including classification, image processing, pattern recognition, modeling and identification. Fuzzy C-Means (FCM) is one of the most popular fuzzy clustering techniques, which was proposed by Dunn [74] in 1973 and eventually modified by Bezdek [75] in 1981. It is an approach, where the data points have their membership values with the cluster centers, which will be updated iteratively. The FCM algorithm consists of the following steps [76]: Step 1: Let us suppose that M -dimensional N data points represented by xi , (i = 1, 2, · · · , N ), are to be clustered Step 2: Assume the number of clusters to be made, that is, C, where 2 ≤ C ≤ N Step 3: Choose an appropriate level of cluster fuzziness f > 1 Step 4: Initialize the N × C × M sized membership matrix U , at random, such P that Uijm ∈ [0, 1] and C j=1 Uijm = 1.0, for each i and a fixed value of m
Step 5: Determine the cluster centers CCjm , for j th cluster and its mth dimen-
sion by using the following expression: CCjm =
f i=1 Uijm xim PN i=1 Uijm
PN
(1.1)
Step 6 Calculate the Euclidean distance between ith data point and j th cluster center with respect to, say mth dimension like the following: Dijm = kxim − CCjm k
(1.2)
1.6. Data mining methods and applications
47
Step 7: Update fuzzy membership matrix U according to Dijm . If Dijm > 0, then Uijm = P C
1
2 Dijm f −1 c=1 ( Dicm )
(1.3)
If Dijm = 0, then the data point coincides with the corresponding data point of j th cluster center CCjm and it has the full membership value, that is, Uijm = 1.0 Step 8: Repeat from Step 5 to Step 7 until the changes in U ≤ ǫ, where ǫ is a pre-specified termination criterion. In Matlab, the Fuzzy logic toolbox function fcm performs FCM clustering [77]. It starts with an initial guess for the cluster centers, which are intended to mark the mean location of each cluster. The initial guess for these cluster centers is most likely incorrect. Next, fcm assigns every data point a membership grade for each cluster. By iteratively updating the cluster centers and the membership grades for each data point, fcm iteratively moves the cluster centers to the right location within a data set. This iteration is based on minimizing an objective function that represents the distance from any given data point to a cluster center weighted by that data point’s membership grade. As shown in Section 1.5.3, the clustering phase has been implemented to the data mining panel, where the datasets were built. Data arrive to the FCM’s phase as shown in Fig. 1.18
Figure 1.18: Data flow processing and nodes to FCM phase
48
1. Data mining and Decision Support System
Figure 1.19: FCM interface
the FCM performs on 2-dimensional data, so as shown in Fig. 1.18, there is a hidden-layer which is running over Matlab’s workspace to construct the required datasets in 2D. User has to extract firstly the observations form Process’s button, then to construct a dataset, finally to choose FCM process. The interface in Fig. 1.19 shows the starting mode of FCM. As shown, one can choose a sample data set and an arbitrary number of clusters from the drop down menus on the right, and then to click ”Start” for starting the fuzzy clustering process. Once the clustering is done, the user can select one of the clusters by clicking on it, and view the membership function surface by clicking the ”Plot MF” button. User can also tune the three optional parameters for the FCM algorithm (exponent, maximum number of iterations and minimum amount of improvement) from the the interface and observe how the clustering process is consequently altered. From the option ”choose a sample data set”, a choice named ”GetObservation”
1.6. Data mining methods and applications
49
Figure 1.20: FCM for thermometer’s observations over 24 hours
Figure 1.21: Membership function plot of cluster 3
allows the user to choose the dataset from the six datasets which were built before. Fig. 1.20 shows a possible clustering of thermometer’s measurements using 3 clusters. Fig. 1.21 shows the membership function plot of one cluster.
50
1. Data mining and Decision Support System
Figure 1.22: FCM for thermometer’s observations over 24 hours for 6 days
Figure 1.23: Membership function plot of cluster 2 (the green one)
Since the DSS user is analyzing meteorological data so it does make sense to observe these measurements over several days. Hence, we added the possibility of constructing a data base with several days observations of a particular sensor. Fig. 1.22 show as example of thermometer’s measurements clustering.
1.6. Data mining methods and applications
1.6.3
51
Linear filtering; Analysis of Covariance (ANOCOVA)
The Analysis of Covariance (generally known as ANOCOVA or ANCOVA) is a technique that sits between analysis of variance and regression analysis [78]. ANOCONA is a combination of ANOVA and Regression. There are two uses of ANOCOVA which, on the surface, appear to be separate analyses. In fact, both analyses are identical [79]: The first use (regression approach) is to check if the regression line for the groups are parallel. If there is evidence that the individual regression lines are not parallel, then a separate regression line must be fit for each group for prediction purposes. If there is no evidence of non-parallelism, then the next task is to see if the lines are co-incident, i.e. have both the same intercept and the same slope. If there is evidence that the lines are not coincident, then a series of parallel lines are fit to the data. All of the data are used to estimate the common slope. If there is no evidence that the lines are not coincident, then all of the data can be simply pooled together and a single regression line fit for all of the data. Second, ANOCOVA has been used to test for differences in means among the groups when some of the variation in the responsible variable can be ”explained” by a covariate. However, some of the variation in weight change may be related to initial weight. Perhaps by ”standardizing” everyone to some common weight, we can more easily detect differences among the groups. As ANOCOVA is a technique for analyzing grouped data having a response y (the variable to be predicted), and a predictor x (the variable used to do the prediction), and since the framework here is behind analysis, so it is important before the analysis is started to verify the assumptions underlying the analysis. Both goals of ANOCOVA have similar assumptions [78, 79]: The response variable Y is continuous (interval or ratio scaled) The data are collected under a completely randomized design -it is possible to relax this assumption-. This implies that the treatment must be randomized completely over the entire set of experimental units if an experimental study, or units must be selected at random from the relevant populations if an observational study
52
1. Data mining and Decision Support System
There must be no outliers. Can be checked by plotting Y vs. X for each group separately to see if there are any points that don’t appear to follow the straight line. In this scope, as shown in Section 1.6.1, an outlier detection process using Tukey’s method has been implemented and the data were exported to the all phases of data mining panel after processing and cleaning. Another hidden-layer has been built inside the Matlab’s workspace to guarantee the construction of new data sets (i.e new dasets those which were cleaned from outliers see Section 1.5.1, and Section 1.6.1). The relationship between Y and X must be linear for each group. Checking this assumption by looking at the individual plots of Y vs. X for each group, see Fig. 1.24. The variance must be equal for both groups around their respective regression lines. Checking that the spread of the points is equal around the range of X and that the spread is comparable between the groups. This can be formally checked by looking at the MSE from a separate regression line for each group as MSE estimates the variance of the data around the regression line. The residuals must be normally distributed around the regression line for each group. This assumption can be checked by examining the residual plots from the fitted model for evidence of nonnormality. The ANOCOVA function opens an interactive graphical environment for fitting and prediction with analysis of covariance models, see Fig. 1.24. It fits the models in Table 1.4 for the ith group. where α and β are the intercepts, αi and βi are the common slope for groups, and ǫ is the measure error which follows a normal distribution [77]. Table 1.4: ANOCOVA regression models
Model
Regression Model
Same mean
y =α+ǫ
Separate means
y = (α + αi ) + ǫ
Same line
y = α + βx + ǫ
Parallel lines
y = (α + αi ) + βx + ǫ
Separate lines
y = (α + αi ) + (β + βi ) x + ǫ
1.6. Data mining methods and applications
53
Figure 1.24: ANOCOVA interface
The ANOCOVA’s output consists of three figures, an interactive graph of the data and prediction curves (see Fig. 1.26), an ANOVA table, and a table of parameter estimates (see Fig. 1.25). User can choose on of the models in Table 1.4, where same mean is to fit a single mean and ignoring grouping, separate means to fit a separate mean to each group, same line to fit a separate line and ignoring grouping, parallel lines to fit a separate line to each group but constrain the lines to be parallel, and separate lines to fit a separate line to each group with no constraints [77]. A comparison of ANOCOVA models and calculations can be found in [77–79], more on sample size calculations and statistic application in [80], ANOCOVA’s tables summary can be found in [81]. For the DSS users, interpreting an analysis of covariance can present certain problems, depending on the nature of the data and, more important, the design of the experiment [81].
54
1. Data mining and Decision Support System
Figure 1.25: ANOCOVA tables (thermometer measures analysis over 24 hours)
Figure 1.26: ANOCOVA prediction (using separate lines model with 3 groups)
55
1.6. Data mining methods and applications
1.6.4
Polynomial modeling; Least squares (LS)
In statistics, polynomial regression is a form of linear regression in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y. Multiple regression refers to regression applications in which there are more than one independent variables. Multiple regression includes a technique called polynomial regression. In polynomial regression we regress a dependent variable on powers of the independent variables. The basic multiple regression model of a dependent (response) variable Y on a set of k independent (predictor) variables Xi , X2 , · · · , Xk can be expressed as: yi = β0 + β1 xi1 + β2 xi2 + · · · + βk xik + ei
i = 1, 2, · · · , n
where yi is the value of the dependent variable Y for the ith case, xij is the value of the j th independent variable Xj for the ith case, β0 is the Y -intercept of the regression surface, each βj , j = 1, 2, · · · , k is the slope of the regression surface with respect to variable Xj and ei is the random error component for the ith case [82, 83]. In our design for polynomial curve fitting, we consider the general form for a polynomial of order j: f (x) = β0 +
j X
βk xk
(1.4)
k=1
and the general least squares error (residuals): !#2 " j n X X k 2 βk x r = yi − β 0 + i=1
(1.5)
k=1
Polynomial regression played an important role in the development of regres-
sion analysis, with a greater emphasis on issues of design and inference. In our scope, a Matlab-based tool [77] has been used which provide the DSS user with an interactive plot of the result in a graphical interface. The results of this phase are interactive plotting in graphical interface, parameter values (βk and confidence intervals), estimates, and the residuals, see Fig. 1.28. User can change the parameters of the fit and exporting fit results to the Matlab’s workspace for more analysis.
56
1. Data mining and Decision Support System
Figure 1.27: From the main interface to the polynomial modeling
Figure 1.28: Polynomial modeling of thermometer measures and residuals
1.6. Data mining methods and applications
1.6.5
57
Neural networks (NN) approach
Figure 1.29: Example of neural networks
Artificial neural networks (ANN) are a type of nonlinear processing system that is ideally suited for a wide range of tasks, especially tasks where there is no existing algorithm for task completion. ANN can be trained to solve certain problems using a teaching method and sample data. In this way, identically constructed ANN can be used to perform different tasks depending on the training received. With proper training, ANN are capable of generalizing the ability to recognize similarities among different input patterns, especially patterns that have been corrupted by noise. Neural networks are analytic techniques modeled after the (hypothesized) processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new observations (on specific variables) from other observations (on the same or other variables) after executing a process of so-called learning from existing data. Mathematically, neural networks are nonlinear. Each layer represents a nonlinear combination of nonlinear functions from the previous layer. Each neuron is a multiple-input, multiple-output (MIMO) system that receives signals from the inputs, produces a resultant signal, and transmits that signal to all outputs. Practically, neurons in an ANN are arranged into layers. The first layer that interacts with the environment to receive input is known as the input layer. The final layer that interacts with the output to present the processed data is known as the output layer. Layers between the input and the output layer that do not have any interaction with the environment are known as hidden layers, see Fig. 1.29. Increasing the complexity of an ANN, and thus its computational capacity, requires the addition of more hidden layers, and more neurons per layer [84–86].
58
1. Data mining and Decision Support System
There are two main types of neural network models: supervised neural networks such as the multi-layer perceptron or radial basis functions, and unsupervised neural networks such as Kohonen feature maps. A supervised neural network uses training and testing data to build a model. The data involves historical data sets containing input variables, or data fields, which correspond to an output. The training data is what the neural network uses to ”learn” how to predict the known output, and the testing data is used for validation. The aim is for the neural networks to predict the output for any record given the input variables only. Feed-forward neural networks (FFNNs) are the simplest form of ANN, one of the (FNNNs) such as the one in Fig. 1.29, consists of three layers: an input layer, hidden layer and output layer as show before. In each layer there are one or more processing elements (PEs). PEs are meant to simulate the neurons in the brain and this is why they are often referred to as neurons or nodes. A PE receives inputs from either the outside world or the previous layer. There are connections between the PEs in each layer that have a weight(parameter) associated with them. This weight is adjusted during training. Information only travels in the forward direction through the network. There are no feedback loops, this means that signals from one layer are not transmitted to a previous layer. This can be stated for layers (i and j) as: (
wij = 0
if i = j
wij = 0
if layer(i) ≤ layer(j)
Weights of direct feedback paths, from a neuron to itself, are zero. Weights from a neuron to a neuron in a previous layer are also zero. Notice that weights for the forward paths may also be zero depending on the specific network architecture, but they do not need to be. A network without all possible forward paths is known as a sparsely connected network, or a non fully connected network. The percentage of available connections that are utilized is known as the connectivity of the network [87]. NNs have been applied to data mining for mainly three reasons: 1. High Accuracy: Neural networks are able to approximate complex nonlinear mappings.
1.6. Data mining methods and applications
59
2. Noise Tolerance: Neural networks are very flexible with respect to incomplete, missing and noisy data. 3. Independence from prior assumptions: Neural networks can be updated with fresh data, making them useful for dynamic environments. Hidden nodes, in supervised neural networks can be regarded as latent variables. Neural networks can be implemented in parallel hardware. As we are presenting an application of neural network in data mining, so we are not going to discuss neural networks in all their applications and methods. A brief introduction of NN to data mining is discussed here and more about applications is derived later on with an example using NN as a tool for data fitting and clustering in our interface. NN and their soft computing have been used in a variety of DM tasks [88]. The main contribution of NN towards DM stems from rule extraction and from clustering. Rule Extraction and Evaluation: Typically a network is first trained to achieve the required accuracy rate. Redundant connections of the network are then removed using a pruning algorithm. The link weights and activation values of the hidden units in the network are analyzed, and classification rules are generated [89]. The generated rules can be evaluated according to some quantitative measures (e.g., accuracy, coverage, fidelity, and confusion). This relates to the preference criteria goodness of fit chosen for the rules. Clustering and Dimensionality Reduction:The self-organizing maps (SOM)are deemed as being highly effective as a sophisticated visualization tool for visualizing high dimensional, complex data with inherent relationships between the various features comprising the data. The SOM’s output emphasizes the salient features of the data and subsequently leads to the automatic formation of clusters of similar data items. This particular characteristic of SOMs alone qualifies them as a potential candidate for data mining tasks that involve classification and clustering of data items [90], Kohonens SOM [91] proved to be an appropriate tool for handling huge data bases. Incremental Learning: When designing and implementing data mining applications for large data sets, we face processing time and memory space problems. In this case, incremental learning is a very attractive feature. In the context of supervised training, incremental learning means learning each input-output sample
60
1. Data mining and Decision Support System
pair, without keeping it for subsequent processing. NN’s tool in the interface has been design for three main tasks: data fitting, clustering, and pattern recognition. Complete explanation of NN methods and structure which have been used here can be found in [92]. • Data fitting With this tool, user can select data, create and train a network, and evaluate its performance using mean square error and regression analysis. A two-layer feedforward network with sigmoid hidden neurons and linear output neurons, can fit multi-dimensional mapping problems arbitrarily well, given consistent data and enough neurons in its hidden layer. The network is trained with LevenbergMarquardt backpropagation algorithm [77].
Figure 1.30: FFNN with 10 hidden layer used for data fitting, after [77]
As shown before, user can choose a data set to build the network from the databases which have been built at the first phase (Section 1.5). In this example we use a data of thermometer’s measures. The data have been collected for (6) days as (49) measures for each day (i.e collecting data each 30 minutes) as shown in the Figures in the next pages. User can follow the instructions which appear in the interface step by step as it starts training. The DSS user at the end of training would get many output results, such as, regression’s plot, error histogram’s plot, coefficient of determination value, mean square error, and the network’s performance. All the given results can be saved to the workspace of Matlab and user can apply different analysis tasks on these data. As the main goal of the interface (GetObservations) is to provide the DSS with an easy tool for data management, hence, in this scope, the NN’s fitting tool can be considered as one of data mining methods for regression analysis. One of the advantages of NN application in data mining is the high accuracy which is targeted here. See Fig. 1.31 and Fig. 1.32.
1.6. Data mining methods and applications
61
Figure 1.31: Regression plot of FFNN of thermometer’s data sets
In this example, we have used 10 hidden layers, and the input vectors and target vectors was randomly divided into three sets as follows: 70% was used for training, 15% was used to validate that the network is generalizing and to stop training before overfitting, and the last 15% was used as a completely independent test of network generalization. The regression plots in Fig. 1.31 display the network outputs with respect to targets for training, validation, and test sets. For a perfect fit, the data should fall along a 45 degree line, where the network outputs are equal to the targets. In this example, the fit is reasonably good for all data sets, with R values in each case of 0.94 or above.
62
1. Data mining and Decision Support System
Figure 1.32: Error histogram
The blue bars represent training data, the green bars represent validation data, and the red bars represent testing data. The histogram can give an indication of outliers, which are data points where the fit is significantly worse than the majority of data. It is a good idea to check the outliers to determine if the data is bad, or if those data points are different than the rest of the data set. If the outliers are valid data points, but are unlike the rest of the data, then the network is extrapolating for these points. one should collect more data that looks like the outlier points, and retrain the network. In this case, one can see that while most errors fall between -1 and 1, there is a validation point with errors of -2.2763 which is possibly indicate an outlier. Table 1.5 show the results of given network where R values measure the correlation, and MSE is the mean square error. Table 1.5: Statistic results
Data
Samples
MSE
R
Training
37
0.234
0.973
Validation
7
0.548
0.947
Testing
7
.345
0.954
1.6. Data mining methods and applications
63
• Clustering with Self-Organizing Map Self-organizing maps (SOM) learn to classify input vectors according to how they are grouped in the input space. They differ from competitive layers in that neighboring neurons in the self-organizing map learn to recognize neighboring sections of the input space. Thus, self-organizing maps learn both the distribution (as do competitive layers) and topology of the input vectors they are trained on.
Figure 1.33: SOM neural network, after [77]
A self-organizing map (SOM) consists of a competitive layer which can classify a dataset of vectors with any number of dimensions into as many classes as the layer has neurons. The neurons are arranged in a 2D topology, which allows the layer to form a representation of the distribution and a two-dimensional approximation of the topology of the dataset. The neurons in the layer of an SOM are arranged originally in physical positions according to a topology function. Different functions can be used to arrange the neurons in a grid, hexagonal, or random topology. Distances between neurons are calculated from their positions with a distance function. Link distance is the most common. SOM is trained for our scope using the batch algorithm, it presents the whole data set to the network before any weights are updated. The algorithm then determines a winning neuron for each input vector. Each weight vector then moves to the average position of all of the input vectors for which it is a winner, or for which it is in the neighborhood of a winner. DSS user mainly used this tool for clustering a big database, broadly we train the network in this aspect with unsupervised weight and bias learning rules with batch updates. The weights and biases are updated at the end of an entire pass through the input data [77, 92]. Next example show training and clustering of 1372 samples as thermometer’s measures throughout 2 days of observing.
64
1. Data mining and Decision Support System
Figure 1.34: SOM neighbor weights distance and weight positions
There are several useful visualizations that user can access from this tool. Fig 1.34, the SOM weight positions shows the locations of the data points and the weight vectors. As the figure indicates, after only 200 iterations of the batch algorithm, the map is distributed through the input space. When the input space is high dimensional, one cannot visualize all the weights at the same time. In this case, SOM neighbor distances is useful, the figure to the left in Fig. 1.34 indicates the distances between neighboring neurons. Fig. 1.34 uses the following color coding: • The blue hexagons represent the neurons • The red lines connect neighboring neurons • The colors in the regions containing the red lines indicate the distances between neurons • The darker colors represent larger distances • The lighter colors represent smaller distances. A group of light segments appear in the lower-right region, bounded by some darker segments. This grouping indicates that the network has clustered the data into two groups. These two groups can be seen in the weight position figure (to the right). The lower-left region of that figure contains a small group of tightly clustered data points. The corresponding weights are closer together in this region,
1.6. Data mining methods and applications
65
which is indicated by the lighter colors in the neighbor distance figure. Where weights in this small region connect to the larger region, the distances are larger, as indicated by the darker band in the neighbor distance figure. The segments in the lower-left region of the neighbor distance figure are darker than those in the upper right. This color difference indicates that data points in this region are farther apart. This distance is confirmed in the weight positions figure.
Figure 1.35: SOM weight planes
Fig. 1.35 shows a weight plane for each element of the input vector (two days, in this case). They are visualizations of the weights that connect each input to each of the neurons. (Darker colors represent larger weights.) If the connection patterns of two inputs were very similar, one can assume that the inputs are highly correlated. In this case, day 1 has connections that are not very different than those of day 2. Neural networks are suitable in data-rich environments and are typically used for extracting embedded knowledge in the form of rules, quantitative evaluation of these rules, clustering, self-organization, classification and regression, feature evaluation and dimensionality reduction. Although it is not an easy task for a DSS user (i.e nonprofessional user of NN’s tools) to build a suitable network for each targeted task, hence we have added this sub-tool to the interface with pre-defined networks and different auto-processed database in which to help the DSS user to deal with type of processing. Neural networks for data mining is not adopted yet totally with the decision making tasks.
66
1. Data mining and Decision Support System
1.7
Conclusion and data mining algorithms
In this chapter we have presented a graphical user interface for decision making using mathematical models. We have applied data mining methods with the proposed project in which the data as meteorological data come from many sensors in measuring stations. Algorithms have been constructed to implement the methods we have presented. Three primary components of data mining algorithms have been identified. The first one is model representation, which is the language used to describe discoverable patterns. If the representation is too limited, then no amount of training time or examples can produce an accurate model for the data. It is important that a data analyst fully comprehend the representational assumptions that might be inherent in a particular method. It is equally important that an algorithm designer clearly state which representational assumptions are being made by a particular algorithm. The second component is model evaluation criteria, they are quantitative statements of how well a particular pattern (a model and its parameters) meets the goals of the KDD process. For example, predictive models are often judged by the empirical prediction accuracy on some test set. Descriptive models can be evaluated along the dimensions of predictive accuracy, novelty, utility, and understandability of the fitted model. And the last component is search method which consists of two components: (1) parameter search and (2) model search. Once the model representation (or family of representations) and the model evaluation criteria are fixed, then the data-mining problem has been reduced to purely an optimization task: Find the parameters and models from the selected family that optimize the evaluation criteria. In parameter search, the algorithm must search for the parameters that optimize the model-evaluation criteria given observed data and a fixed model representation. The main contribution of this chapter was to present a graphical user interface as a tool for environmental management applications. This chapter was with a practical aspect to environmental monitoring systems and the theoretical contribution was according to developing Matlab-based mathematical models for decision making.
C
References
[1] O. Maimon, and L. Rokach ”Data Mining and Knowledge Discovery Handbook” O. Maimon and L. Rokach Editors, 3rd edition, Springer. Springer New York Dordrecht Heidelberg London, 2010. [2] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. ”From Data Mining to Knowledge Discovery in Databases” AI Magazine American Association for Artificial Intelligence, pp.37–54, Fall 1996. [3] H. W. Ian , F. Eibe, A. M. Hall. ”Data Mining: Practical Machine Learning Tools and Techniques (3rd edition).” A Elsevier, 30 January 2011. [4] S. Sayad. ”Real Time Data mining”. Self-Help Publishers books, Canada, 2011. [5] . Marbn, G. Mariscal, and J. Segovia. ”A Data Mining Knowledge Discovery Process Model.” In Data Mining and Knowledge Discovery in Real Life Applications, Book edited by: J. Ponce and A. Karahoca, pp. 438–453, February 2009, I-Tech, Vienna, Austria. [6] M. J. Druzdzel and R. R. Flynn. ”Decision Support Systems.” Encyclopedia of Library and Information Science, Allen Kent (ed.), New York: Marcel Dekker, Inc., 2002. [7] Acrotec http://www.acrotec.it/?lang=en. ”Projects: Integrated Network for Emergencies”, 2012.
146
References
[8] J. Spate, K. Gibert, M. Sanchez, E. Frank, J. Comas, and I. Athanasiadis. ”Data Mining as a Tool for Environmental Scientist. ” Al Magazine, vol 17, 1996. [9] P. Keen. ”Decision support systems : a research perspective” Cambridge, Mass. : Center for Information Systems Research, Alfred P. Sloan School of Management, 1980. [10] R. Sprague ”A Framework for the Development of Decision Support Systems.” MIS Quarterly, 4(4),pp.1–25, 1980. [11] A. P. Sage. ”Decision Support Systems Engineering”.
John Wiley Sons,
Inc., New York, 1991. [12] P. E. Lehner, T. M. Mullin, and M. S. Cohen.”A probability analysis of the usefulness of decision aids.” Uncertainty in Artificial Intelligence, Elsevier, pp.427–436, 1990. [13] J.G. Borges, E.M. Nordstrm, G. Gonzalo, J. Hujala, and T. Trasobares. ”Computer-based tools for supporting forest management. The experience and the expertise world-wide” Community of Practice Forest Management Decision Support Systems, Ume, Sweden, 2014. [14] K.A. Delic, L. Douillet, and U. Dayal. ”Towards an architecture for realtime decision support systems:challenges and solutions” IEEE International Symposium on Database Engineering and Applications, pp.303–311, 2001. [15] G. Piatetsky-Shapiro ”Knowledge Discovery in Real Databases: A Report on the IJCAI-89 Workshop.” AI Magazine, 11(5), pp. 68–70, 1991. [16] J. Shrager, and P. Langley ”Computational Models of Scientific Discovery and Theory Formation.” San Francisco, Calif.: Morgan Kaufmann., 1990. [17] W. Kloesgen, and J. Zytkow ”Knowledge Discovery in Databases Terminology. In Advances in Knowledge Discovery and Data Mining” AAAI Press, Menlo Park, Calif , pp. 569–588, 1996. [18] P. Spirtes, C. Glymour, and R. Scheines Search” Springer-Verlag., New York, 1993.
”Causation, Prediction, and
147
References
[19] J. Elder, and D. Pregibon ” A Statistical Perspective on KDD. In Advances in Knowledge Discovery and Data Mining” AAAI Press, Menlo Park, Calif eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, pp. 83–116, 1996. [20] C. Glymour, D. Madigan, D. Pregibon, and P. Smyth ” Statistics and Data Mining” Communications of the ACM (Special Issue on Data Mining), November 1996. [21] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. ” From Data Mining to Knowledge Discovery: An Overview. In Advances in Knowledge Discovery and Data Mining”, eds. U. Fayyad, G. Piatetsky- Shapiro, P. Smyth, and R. Uthurusamy, 130. Menlo Park, Calif.: AAAI Press., 1996. [22] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy ” Statistics and Data Mining” AAAI Press.,Menlo Park, Calif, 1996. [23] J. Han, and M. Kamber. ”Data mining: concepts and techniques”. Morgan Kaufmann, 2006. [24] D.T. Larose. ”Discovering knowledge in data: an introduction to data mining”.
John Wiley and Sons, 2005.
[25] S. I. Weiss, and C. Kulikowski. ”Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Networks, Machine Learning, and Expert Systems”. San Francisco, Calif.: Morgan Kaufmann, 1991. [26] D.J. Hand. ”Discrimination and Classification”. Chichester, U.K.: Wiley, 1981. [27] A. K. Jain, , and R. C. Dubes. ”Algorithms for Clustering Data”.
Engle-
wood Cliffs, N.J.: Prentice Hall, 1988. [28] D. M. Titterington, A. F. M. Smith, and U. E. Makov. ”Statistical Analysis of Finite-Mixture Distributions”.
Chichester, U.K.: Wiley, 1985.
[29] P. Cheeseman, and J. Stutz. ” Bayesian Classification: Theory and Results”,In Advances in Knowledge Discovery and Data Mining, eds. U.
148
References
Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 7395. Menlo Park, Calif.: AAAI Press., 1996. [30] B. Silverman ”Density Estimation for Statistics and Data Analysis”. Chapman and Hall, New York, 1986. [31] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and I. Verkamo. ” Fast Discovery of Association Rules”,In Advances in Knowledge Discovery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 307328. Menlo Park, Calif.: AAAI Press., 1996. [32] R. Zembowicz, and J. Zytkow. ” From Contingency Tables to Various Forms of Knowledge in Databases”,In Advances in Knowledge Discovery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 329351. Menlo Park, Calif.: AAAI Press., 1996. [33] C. Glymour, R. Scheines, P. Spirtes, and K. Kelly.”Discovering Causal Structure”.
Academic, New York, 1987.
[34] D. Heckerman ”Bayesian Networks for Knowledge Discovery”,In Advances in Knowledge Discovery and Data Mining, eds. U. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy, 273306. Menlo Park, Calif.: AAAI Press., 1996. [35] D. Berndt, and J. Clifford. ”Finding Patterns in Time Series: A Dynamic Programming Approach.”,In Advances in Knowledge Discovery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 229248. Menlo Park, Calif.: AAAI Press., 1996. [36] O. Guyon, N. Matic, and N. Vapnik. ”Discovering Informative Patterns and Data Cleaning”,In Advances in Knowledge Discovery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 181204. Menlo Park, Calif.: AAAI Press., 1996. [37] W. Kloesgen ”A Multipattern and Multistrategy Discovery Assistant”,In Advances in Knowledge Discovery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 249271. Menlo Park, Calif.: AAAI Press., 1996.
149
References
[38] C. Matheus, G. Piatetsky-Shapiro, and D. McNeill. ”Selecting and Reporting What Is Interesting: The KEfiR Application to Healthcare Data”,In Advances in Knowledge Discovery and Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 495516. Menlo Park, Calif.: AAAI Press., 1996. [39] M. Basseville, and I. V. Nikiforov. ”Detection of Abrupt Changes: Theory and Application”.
Englewood Cliffs, N.J.: Prentice Hall, 1993.
[40] V. Mallet and D.E. Keyes and F.E. Fendell ”Modeling wildland fire propagation with level set methods” Computers Mathematics with Applications, Elsevier, 57(7), pp. 1089–1101, 2009. [41] F.E. Fendell, M.F. Wolff ”Forest Fires Behavior and Ecological Effects” Academic Press, Chapter Wind-aided fire spread, pp. 171–223, 2001. [42] J. Spate, K. Gibertb, M. Sanchez-Marr, E. Frank, J. Comase, I. Athanasiadis, R. Letcher. ”Data Mining as a Tool for Environmental Scientists” Al Magazine, vol. 17, 1996. [43] I. N. Athanasiadis, and P. A. Mitkas ”An agent-based intelligent environmental monitoring system” International Journal of Management of Environmental Quality, 15(3), pp. 238–249, 2004. [44] M. Poch, J. Comas, I. Rodriguez-Roda, M. Sanchez-Marre, U. Cortes. ”Designing and building real environmental decision support systems” Environmental Modelling Software, Elsevier, 2003. [45] I. Ben-Gal ”Outlier Detection” In: Maimon O. and Rockach L. (Eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, Kluwer Academic Publishers, 2005. [46] G. J. Williams ,R. A. Baxter ,H. X. He ,S. Hawkins, and L. Gu ”A Comparative Study of RNN for Outlier Detection in Data Mining” IEEE International Conference on Data-mining (ICDM’02), Maebashi City, Japan, CSIRO Technical Report CMIS-02/102, 2002.
150
References
[47] H. Liu, S. Shah, and W. Jiang. ”On-line Outlier Detection and Data Cleaning” Journal of Computers and Chemical Engineering, vol. 28, pp. 1635– 7647, 2004. [48] D. Hawkins ”Identification of Outliers”. [49] C. C. Aggarwal ”Outlier Analysis”.
Chapman and Hall, 1980.
Kluwer Academic Publishers.
[50] N. Devarakonda, S. Pamidi, V. V. Kumari, A. Govardhan ”Outliers Detection as Network Intrusion Detection System Using Multi Layered Framework” Advances in Computer Science and Information Technology Communications in Computer and Information Science. Springer Vol. 131, pp. 101–111, 2011. [51] K. Prakobphol, and J. Zhan ”A Novel Outlier Detection Scheme for Network Intrusion Detection Systems ” IEEE International Conference on Information Security and Assurance Vol. 131, pp. 555–560, 2008. [52] A. Juvonen, and T. Hamalainen. ”An Efficient Network Log Anomaly Detection System Using Random Projection Dimensionality Reduction” IEEE 6th International Conference on New Technologies, Mobility and Security (NTMS), pp. 1–5, 2014. [53] W.-F. Yu, and N. Wang
”Research on Credit Card Fraud Detection
Model Based on Distance Sum” IEEE Conference on Artificial Intelligence JCAI’09, pp. 353–356, 2009. [54] S.B.E. Raj, and A.A. Portia. ”Analysis on Credit Card Fraud Detection Methods” IEEE International Conference on Computer, Communication and Electrical Technology (ICCCET), pp. 152–156, 2011. [55] M. Haiying, and L. Xin. ”Application of Data Mining in Preventing Credit Card Fraud” IEEE International Conference on Management and Service Science MASS’09, pp. 1–6, 2009. [56] O. Ghorbel, M.W. Jmal, W. Ayedi, H. Snoussi, and M. Abid. ”An overview of outlier detection technique developed for wireless sensor networks” IEEE 10th International Multi-Conference on Systems, Signals Devices (SSD), pp. 1–6, 2013.
151
References
[57] C. Franke, M. Gertz. ”Detection and Exploration of Outlier Regions in Sensor Data Streams” IEEE International Conference on Data Mining Workshops ICDMW’08 , pp. 375–384 , 2008. [58] S. Cateni, V. Colla, and M. Vannucci ”Outlier Detection Methods for Industrial Applications” Advances in Robotics, Automation and Control, Book edited by: Jess Armburo and Antonio Ramrez Trevio, I-Tech, Vienna, Austria, 2008. [59] S. Ahmad, N.M. Ramli, H. Midi. ”Outlier detection in logistic regression and its application in medical data analysis” IEEE Colloquium on Humanities, Science and Engineering (CHUSER), pp.503-507, 2012. [60] Y. Mao, Y. Chen, G. Hackmann, M. Chen, L. Chenyang, M. Kollef, and T.C. Bailey. ”Medical Data Mining for Early Deterioration Warning in General Hospital Wards” IEEE 11th International Conference on Data Mining Workshops (ICDMW), pp.1042-1049, 2011. [61] V. J. Hodge, and J. Austin ”A Survey of Outlier Detection Methodologies.”. Kluwer Academic Publishers, 2004. [62] V. Chandola, A. Banerjee, and V. Kumar ”Outlier Detection : A Survey”. University of Minnesota. [63] S. Papadimitriou, H. Kitawaga, P.G. Gibbons, C. Faloutsos. ”LOCI: Fast Outlier Detection Using the Local Correlation Integral” Intel Research Laboratory. Technical report no. IRP-TR-02-09, 2002. [64] F. Grubbs ”Procedures for detecting outlying observations in samples”. ”Technometrics”, 11(1), pp. 1–21, 1969. [65] R. O. Duda, P. E. Hart, and D. G. Stork. ”Pattern Classification (2nd Edition)”.
Wiley, 2000.
[66] L. R. Rabiner, and B. H. Juang ”An introduction to Hidden Markov Models”. ”IEEE ASSP Magazine”, 3(1), pp. 4–16, 1986. [67] V. Barnett, T. Lewis ”Outliers in Statistical Data”.
Wiley, 1994.
152
References
[68] L. Davies , and U. Gather ”The identification of multiple outliers”. ”Journal of the American Statistical Association”, 88(423), pp. 782–792, 1993. [69] F. R. Hampel ”A general qualitative definition of robustness”. ”Annals of Mathematics Statistics”, vol. 42, pp. 1887–1896, 1971. [70] F. R. Hampel ”The influence curve and its role in robust estimation”. ”Journal of the American Statistical Association”, vol. 69, pp. 382–393, 1974. [71] J. W. Tukey ”Exploratory Data Analysis”.
Addison-Wesley, 1977.
[72] I. Ben-Gal, G. Morag, and A. Shmilovici ”CSPC: A Monitoring Procedure for State Dependent Processes”. ”Technometrics”, 45(4), pp. 293–311, 2003. [73] P. C. Mahalanobis ”On the generalised distance in statistics”. ”Proceedings of the National Institute of Sciences of India”, 2(1), pp. 49–55, 1936. [74] C. J. Dunn ”A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters”. ”J. Cybernet”, Vol. 3, pp. 32–57, 1973. [75] J. C. Bezdek ”Pattern Recognition with Fuzzy Objective Function Algorithms”.
Kluwer Academic Publishers, Norwell, MA, USA, 1981.
[76] S. Chattopadhyay, D. K. Pratihar, and S. C. De Sarkar ”A Comparative Study of Fuzzy C-Means Algorithm and Entropy-based Fuzzy Clustering Algorithms”. ”Computing and Informatics”, Vol. 30, pp. 701–720, 2011. [77] Matlab and Statistics Toolbox Release 2012b The MathWorks Inc., Natick, Massachusetts, United States. Available online on http://www.mathwork. com Matlab and Statistics Toolbox Release 2012b, The MathWorks, Inc., Natick, Massachusetts, United States. [78] G. Miner, R. Nisbet, J. Elder ”Handbook of Statistical Analysis and Data Mining Applications”.
Academic Press, Elsevier, 2009.
[79] C. J. Schewart ”Analysis of Covariance - ANCOVA” http://www.stat. sfu.ca/~cschwarz/CourseNotes, Course 20-10-2014.
153
References
[80] G. Shan, and C. Ma ”A Comment on Sample Size Calculation for Analysis of Covariance in Parallel Arm Studies”. ” Biometrics and Biostatistic”, Vol 5, issue 1-1484, 2014. [81] S. W. Huck ”Reading Statistics and Research (4th ed.)”.
Boston, MA:
Allyn and Bacon, 2004. [82] E. Ostertagova ”Modelling using polynomial regression ”. ”Procedia Engineering, Elsevier”, Vol 48, pp. 500–506, 2012. [83] A. D. Aczel ”Complete Business Statistics”.
Irwin, ISBN 0-256-05710-8,
1989. [84] K. Suzuki ”Artificial Neural Networks: Architectures and Applications ”. InTech, 2013. [85] D. Michie, and D. J. Spiegelhalter ”Machine Learning, Neural and Statistical Classification”.
Ellis Horwood, 1994.
[86] P. Joseph, I. Bigus, and M. Rochester ”Data mining with neural networks: solving business problems from application development to decision support”. McGraw-Hill, Inc. Hightstown, NJ, USA, 1996. [87] S. Nirkhi ”Potential use of Artificial Neural Network in Data Mining ”. ”The 2nd International Conference on Computer and Automation Engineering (ICCAE)”, pp. 339–343, 2010. [88] Y. Bengio, J. M. Buhmann, M. Embrechts, and J. M. Zurada. ”Introduction to the special issue on neural networks for data mining and knowledge discovery”. ”IEEE Transaction on Neural Networks”, Vol 11, pp. 545–549, 2000. [89] A. Roy. ”Artificial neural networks - a science in trouble”. ”ACM SIGKDD Explorations”, Vol 1, pp. 33–38, 2000. [90] T. Kohonen ”Self-Organizing Maps”. Series in Information Sciences, second edn., Springer, Heidelberg, 1997.
154
References
[91] T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, J. Honkela, V. Paatero, and A. Saarela ”Self organization of a massive document collection”. ”IEEE Transaction on Neural Networks”, Vol 11, pp. 574–585, 2010. [92] M. H. Beale, M. T. Hagan, and H. B. Demuth ”Handbook of Neural Network Toolbox”.
The MathWorks, Inc., 2014. http://www.mathwork.com
[93] M. Gandhi and L. Mili, ”Robust Kalman filter based on a generalized maximum-likelihood-type estimator”, IEEE Trans. on Signal Processing, vol. 58, no. 5, pp. 2509–2520, 2010. [94] D. Shi, T. Chen, and L. Shi, “Event-triggered maximum likelihood state estimation,” Automatica, vol. 50, no. 1, pp. 247–254, 2014. [95] A. H. Jazwinski, “Limited memory optimal filtering,” IEEE Trans. on Automatic Control, vol. 13, no. 5, pp. 558–563, 1968. [96] H. Michalska and D. Q. Mayne, “Moving horizon observers and observer– based control,” IEEE Trans. on Automatic Control, vol. 6, no. 6, pp. 995– 1006, 1995. [97] C. V. Rao, J. B. Rawlings, and J. H. Lee, “Constrained linear state estimation–a moving horizon approach,” Automatica, vol. 37, no. 10, pp. 1619–1628, 2001. [98] G. Ferrari-Trecate, D. Mignone, and M. Morari, “Moving horizon estimation for hybrid systems,” IEEE Trans. on Automatic Control, vol. 47, no. 10, pp. 1663–1676, 2002. [99] A. Alessandri, M. Baglietto, and G. Battistelli, “Receding-horizon estimation for discrete-time linear systems,” IEEE Transactions on Automatic Control, vol. 48, no. 3, pp. 473–478, 2003. [100] C. V. Rao, J. B. Rawlings, and D. Q. Mayne, “Constrained state estimation for nonlinear discrete-time systems: stability and moving horizon approximations,” IEEE Trans. on Automatic Control, vol. 48, no. 2, pp. 246–257, 2003.
155
References
[101] A. Alessandri, M. Baglietto, and G. Battistelli, “Moving-horizon state estimation for nonlinear discrete-time systems: New stability results and approximation schemes,” Automatica, vol. 44, no. 7, pp. 1753 – 1765, 2008. [102] A. Alessandri, M. Baglietto, G. Battistelli, and M. Gaggero, “Movinghorizon state estimation for nonlinear systems using neural networks,” IEEE Trans. Neural Networks, vol. 22, no. 5, pp. 768–780, 2011. [103] A. Matasov and V. Samokhvalov, “Guaranteeing parameter estimation with anomalous measurement errors,” Automatica, vol. 32, no. 9, pp. 1317–1322, 1996. [104] A. Alessandri, M. Baglietto, and G. Battistelli, “Robust receding-horizon state estimation for uncertain discrete-time linear systems,” Systems & Control Letters, vol. 54, no. 7, pp. 627–643, 2005. [105] A. Alessandri, M. Baglietto, and G. Battistelli, “Min-max moving-horizon estimation for uncertain discrete-time systems,” SIAM J. Control and Optimization, vol. 50, no. 3, pp. 1439–1465, 2012. [106] A. Alessandri, M. Baglietto, and G. Battistelli, ”Receding-horizon estimation for discrete-time linear systems,” IEEE Trans. on Automatic Control, vol. 48, no. 3, pp. 473–478, 2003. [107] M. Grewal and A. Andrews, ”Kalman Filtering: Theory and Practice Using Matlab”.
Wiley.
[108] A. Alessandri, M. Baglietto, T. Parisini, and R. Zoppoli. ”A neural state estimator with bounded errors for nonlinear systems.” IEEE Transactions on Automatic Control, 44(11):2028–2042, 1999. [109] A. Alessandri, M. Baglietto, and G. Battistelli. ”Receding-horizon estimation for switching discrete-time linear systems.” IEEE Transactions on Automatic Control, 50(11):1736–1748, 2005. [110] P. E. Moraal and J. W. Grizzle. ”Observer design for nonlinear systems with discrete-time measurements.” IEEE Transactions on Automatic Control, 40(3):395–404, 1995.
156
References
[111] M. Alamir and L. A. Cavillo-Corona. ”Further results on nonlinear recedinghorizon observers.” IEEE Transactions on Automatic Control, 47(7):1184– 1188, 2002. [112] C. V. Rao, J. B. Rawlings, and D. Q. Mayne. ”Constrained state estimation for nonlinear discrete-time systems: stability and moving horizon approximations.” IEEE Transactions on Automatic Control, 48(2):246–257, 2003. [113] C. V. Rao, J. B. Rawlings. ”Constrained process monitoring: A moving horizon approach.” AIChE J., 48, pp. 97–109, 2002. [114] B. Boulkroune M. Darouach M. Zasadzinski. ”Moving horizon state estimation for linear discrete-time singular systems.” IET Control Theory and Applications, Vol. 4, no. 3, pp. 339–350, 2010. [115] K.R. Muske , J.B. Rawlings ”Receding horizon recursive state estimation”. Proc. IEEE American Control Conf.San Francisco, USA, pp. 900–904, 1993. [116] F. Yang, R.W. Wilde ”Observers for linear systems with unknown inputs.” IEEE Trans. Autom. Control, 33(7): pp. 677–681, 1988. [117] M. Darouach, M. Zasadzinski, J.Y. Keller ”State estimation for discrete systems with unknown inputs using state estimation of singular systems”. roc. IEEE American Control Conf., pp. 3014–3015, 1992. [118] A. Bemporad, D. Mmignone,M. Morari ”Moving horizon estimation for hybrid systems and fault detection”. Proc. IEEE American Control Conf., San Diego, Canada, pp. 2471 2475, 1999. [119] L. Pina, M.A. Botto ”Simultaneous state and input estimation of hybrid systems with unknown inputs”. Automatica., 42:(5), pp. 755–762, 2006. [120] J. D. Hedengren. ”Advanced Process Monitoring”. Chapter submitted to Optimization and Analytics in the Oil and Gas Industry, Volume II: The Downstream, Springer-Verlag., 2012. [121] R.K. Pearson. ”Outliers in process modeling and identification.” IEEE Trans. on Control Systems Technolgy, 10(1):55–63, 2002.
References
157
[122] T. Ortmaier, M. Groger, D.H. Boehm, V. Falk, and G. Hirzinger. ”Motion estimation in beating heart surgery.” IEEE Trans. on Biomedical Engineering, 52(10):1729–1740, 2005. [123] J. Zhang, M. Zulkernine, and A. Haque. ”Random-forests-based network intrusion detection systems.” IEEE Trans. on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 38(5):649–659, 2008. [124] T.Ortmaier, M.Groger, D.H. Boehm, V.Falk, and G.Hirzinger. ”Motion estimation in beating heart surgery.” IEEE Trans. on Biomedical Engineering, 52(10):1729–1740, 2005. [125] K. Fallahi, C.-T. Cheng, and M. Fattouche. ”Robust positioning systems in the presence of outliers under weak GPS signal conditions.” IEEE Systems Journal, 6(3):401–413, 2012. [126] S. Meng and L. Liu. ”Enhanced monitoring-as-a-service for effective cloud management”. IEEE Trans. on Computers, 62(9):1705–1720, 2013. [127] H. Ferdowsi, S. Jagannathan, and M. Zawodniok. ”An online outlier identification and removal scheme for improving fault detection performance”. IEEE Trans. on Neural Networks and Learning Systems, 25(5):908–919, 2014. [128] L. Fang, M. Zhi-zhong. ”An online outlier detection method for process control time series”. Control and Decision Conference (CCDC) Chinese, Mianyang, pp. 3263–3267, 2011. [129] RR.D. Cook, and S. Weisberg. ”Residuals and Influence in Regression”. Chapman and Hall, New York, 1982. [130] P.J. Rousseeuw, and B.C. Zomeren. ”Unmasking multivariate outliers and leverage points”. Journal of American Statistical Association, Vol. 85, pp. 623–651, 1990. [131] J. Osborne ”Notes on the use of data transformations”. Practical Assessment, Research Evaluation, Vol. 8(6), 2002.
158
References
[132] J. W. Osborne, and A. Overbay ”The power of outliers and why researchers should always check for them”. Practical Assessment, Research Evaluation, Vol. 9(6), 2004. [133] J. W. Osborne ”Improving your data transformations: Applying the BoxCox transformation”. Practical Assessment, Research
Evaluation, Vol.
15(12), 2010. [134] J. H. McDonald. ”Handbook of Biological Statistics (3rd ed.)”.
Sparky
House Publishing. Baltimore, Maryland, 2014 [135] B. G. Tabachnick, and L. S. Fidell. ”Using multivariate statistics (5th ed.)”. Allyn and Bacon. Boston, 2007 [136] D. C. Howell. ”Statistical methods for psychology (6th ed.)”.
AThomson
Wadsworth. Belmont, CA, 2007. [137] J. W. Tukey. ”The comparative anatomy of transformations”. Annals of Mathematical Statistics, Vol. 28, pp. 602–632, 1957. [138] G. E. P. Box, and D. R. Cox. ”An analysis of transformations”. Journal of the Royal Statistical Society, Vol. 26, pp. 211–234, 1964. [139] R. M. Sakia ”The Box-Cox transformation technique: A review”. The Statistician, Vol. 41, pp. 169–178, 1992. [140] O. Renaud, and M.-P. Victoria-Feser ”A robust coefficient of determination for regression”. Journal of Statistical Planning and Inference, Vol. 140(7), pp. 1852–1862, 2010. [141] W. Greene. ”Econometric Analysis (third ed.)”.
Prentice Hall, 1997.
[142] R. Brown, and P. Hwang. ”Introduction to Random Signals and Applied Kalman Filtering, Third edition”.
Wiley, 1997.
[143] B. Hofmann-Wellenohf, H. Lichtenegger, and J. Collins ”Global Positioning System, Theory and practice, 2nd edition”.
Springer-Verlag 1993.
[144] Institute of navigation ”Global Positioning System”. Vols I,II,III, and IV. The institute of Navigation.1980-86.
References
159
[145] J. C. Rambo ”Receiver Processing Software Design of the Rockwell International DoD Standard GPS Receivers” Proceedings of the 2nd International Technical Meeting of the Satellite Division of the Institute of Navigation, Colorado Springs, pp. 217-225, 1989. [146] R. M. Kalafus, J. Vi lcans , and N. Knable. ”Differential Operation of NAVSTAR GPS.” Navigation, 1. Inst. Navigation, 30(3), pp. 187-204, 1983. [147] E. G. Blackwell ”Overview of Differential GPS Methods,” Navigation, 1. Inst. Navigation, 32(2), pp. 114-125, 1985. [148] X.-W. Chang, Y. Guo ”Huber’s M-estimation in relative GPS positioning: computational aspects” McGill University, Canada. [149] C. Ordonez , J. Martinez, J. R. Hodrfquez-Perez, and A. Reyes. ”Detection of Outliers in GPS Measurements by Using Functional-Data Analysis” JOURNAL OF SURVEYING ENGINE, pp. 150–155, 2010. [150] D. Wright, C. Stephens, and V. P. Rasmussen ”On Target Geospatial Technologies” USU Geospatial Extension Program, 2010. [151] D. Cooksey ”Understanding the Global Positioning System (GPS)” Montana State University-Bozeman. [152] J-M. Zogg ”GPS Basics: Introduction to the system Application overview” u-bloxag, 2002. [153] Y. Bian ”GPS signal selective availability-modelling and simulation for FAA WAAS IVV” proc. Position Location and Navigation Symposium, pp. 515–522, IEEE 1996. [154] Y. Huihai, ”Modeling of selective availability for global positioning system” Journal of Systems Engineering and Electronics, 7(3): pp. 515–522, IEEE 1996. [155] A. Franchois ”Determination of GPS positioning errors due to multi-path in civil aviation” Proceedings of 2nd International Conference on Recent Advances in Space Technologies, pp. 400 - 406, IEEE 2005.
160
References
[156] L. Liu ”Comparison of Average Performance of GPS Discriminators in Multipath” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. III-1285 - III-1288, IEEE 2007. [157] K. Breivik, B. Forsselli, C. Kee, P. Enge and T. Walter ”Estimation of Multipath Error in GPS Pseudorange Measurements” Journal of Navigation, pp. 43–52, 2014. [158] ”Quasi-Zenith Satellite System (QZSS) Service” http://www.qzs.jp/en/ Satellite Positioning Overview, Co. Japan. [159] T.-K. Yeh, C. Hwang, G. Xu, C.-S. Wang, and C.-C. Lee. ”Determination of global positioning system (GPS) receiver clock errors: impact on positioning accuracy” Measurement Science and Technology, 20(7), 2009. [160] F.-c. Chan ”Stochastic modeling of atomic receiver clock for high integrity gps navigation” Transactions on Aerospace and Electronic Systems, IEEE, 50(3), pp.1749–1764, 2014. [161] K. L. Nathan and J. Wang. ”A Comparison of Outlier Detection Procedures and Robust Estimation Methods in GPS Positioning” J. Geodesy. 66(4), pp. 699–709, 2009. [162] E. Gkalp ,O. Gngr, and Y. Boz. ”Evaluation of Different Outlier Detection Methods for GPS Networks ” Sensor, 8, pp. 7344–7358, 2008. [163] Y. Zhang, F. Wu, and H. Isshiki. ”A New Cascade Method for Detecting GPS Multiple Outliers Based on Total Residuals of Observation Equations ” Position Location and Navigation Symposium (PLANS),IEEE, pp. 208–215, 2012. [164] E. D. Kalpan. ”Understanding GPS, Principles and Applications” Artech House, 1996. [165] A. Giremus, E. Grivel, and F. Castani. ”Is H infinity filtering relevant for correlated noises in GPS navigation ?” IEEE proceedings DSP, 8, pp. 1–6, 2009. [166] P. Alxelrad and R.G Brown, ”GPS navigation algorithms” Global Positioning System: Theory and Applications, vol. 1, 1996.
161
References
[167] K. Fallahi, C.-T. Cheng, M. Fattouche ”Robust Positioning Systems in the Presence of Outliers Under Weak GPS Signal Conditions” IEEE Systems Journal, 6(3), pp.401–413, 2012. [168] G. Pulford ”Analysis of a nonlinear least squares procedure used in global positioning systems” IEEE Trans. Signal Process, 58(9), pp.4526–4534, 2010 [169] T.-H. Chang , L.-S. Wang, and F.-R. Chang.
”A solution to the ill-
conditioned GPS positioning problem in an urban environment” IEEE Trans. on Intelligent Transportation Systems, 10(1), pp.135–145, 2009. [170] M. F. Abdel-Hafez ”The autocovariance least-squares technique for GPS measurement noise estimation” IEEE Trans. Vehicle Technology, 59(2), pp.574–588, 2010. [171] Y. Wang ”Position Estimation using Extended Kalman Filter and RTSSmoother in a GPS Receiver” 5th International Congress on Image and Signal Processing (CISP), pp.1718–1721, 2012. [172] P. Misra, and P. Enge. ”Global Positioning System Signals, Measurements, and Performance(Second Edition)”.
Wiley, 2006.
[173] K. Borre and D. M. Akos. ”A software-Defined GPS and Galileo Receiver”. Springer. Birkhauser Basel, Aug. 2006. [174] R. Leonardo, and M. Santos ”Stochastic models for GPS position - An imperial approach”.
GPS world, February 2007