uncovering spatio-temporal patterns in environmental data introduction

2 downloads 0 Views 491KB Size Report
to distinguish between the application of particular algorithms designed to extract ... limited amount of data mining techniques has focused on spatial data as a ...
Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002

UNCOVERING SPATIO-TEMPORAL PATTERNS IN ENVIRONMENTAL DATA Monica Wachowicz Wageningen UR Centre for Geo-Information Droevendaalsesteeg 3 PO BOX 47 6700 AA Wageningen The Netherlands Phone: +31.317.47 4764 Fax: +31.317.47 4567 http://www.geo-informatie.nl/ E-mail: [email protected]

Abstract The integration of data mining and geographic visualization techniques facilitates the identification and the interpretation of spatio-temporal patterns – a process recognized as knowledge construction. Knowledge construction is a dynamic process of manipulating "data” to find, relate, and interpret interesting patterns in large environmental data sets. Toward this end, an overview of the main methods associated with the expanding fields of Knowledge Discovery in Databases (KDD) and Geographic Visualization (GeoVis) is provided. The paper explains how different methods can be combined in order to design a knowledge construction process for the identification and interpretation of the space-time variability of both composition and structure of a pattern. Case studies, tools and prototype implementations are described for illustrating how both KDD and GeoVis methods can be applied to uncovering spatio-temporal patterns. Finally, the specific underlying research issues are described, with particular emphasis on how these relate to the environmental sciences domain. Keywords: knowledge construction process, data mining, geographic visualization

INTRODUCTION Much of the environmental data being generated today are from Earth Observation Systems, monitoring efforts in endangered ecosystems as well as from ground observations. The availability of GIS and spatial statistics has facilitated the first step towards identifying and quantifying patterns in environmental data sets. As a result, there is a better understanding of the range of types of patterns rather than of the processes responsible for their generation. Few studies exist which translate the knowledge of an environmental process to an explicit pattern context (see Wachowicz, 2000a for a recent review). In general, among empirically based work, there is a tendency to find highly quantitative studies on patterns and more qualitative studies describing their corresponding processes. There is a better understanding of the diversity of types of patterns (for example, types of landscapes) rather than the process responsible for their generation (for example, fire, urbanization, and climate

1

Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002

change). One of the main reasons is due to the complexity of mapping a given pattern to a given process. Two independent processes may produce the same pattern. Two types of patterns may vary in terms of their spatial arrangement of units and variability through time. Therefore, it is often insufficient to identify a given pattern using a single composition (for example, NDVI values from a satellite image) or a single structure (size, shape, adjacency, and sinuosity). In fact, the identification and definition of the space-time variability of both composition and structure of a pattern will determine what form the pattern will take and therefore distinguish it from other patterns. This requires a good understanding of how patterns can be found using a variety of methods such as association, correlation, causality, partial periodicity, sequential, and emerging patterns. Therefore, a knowledge construction process is proposed in this paper for creating a dynamic process of finding, relating, and interpreting interesting, meaningful, and unanticipated patterns in large environmental data sets. The goal is to develop a conceptualization of a knowledge construction process that involves scientists achieving insight about spatio-temporal patterns that facilitate the understanding of geo-physical phenomena and their corresponding pattern-process relation in the real world. This knowledge construction process involves both interaction and iteration actions, through which scientists can achieve insight by manipulating large data sets using data mining techniques for tracking patterns, and geographic visualization techniques for steering the knowledge construction process as it unfolds, visualizing patterns on the fly. Knowledge Discovery in Databases (KDD) has been defined as: "the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data" (Fayyad et al., 1996, p. 6). It is a multi-step process in which "data mining" methods (methods through which patterns are extracted from data) play a central role. Geographic Visualization (GeoVis) has been defined as "the use of concrete visual representations – whether on paper or through computer displays or other media – to make spatial contexts and problems visible, so as to engage the most powerful of human information-processing abilities, those associated with vision (MacEachren 1992, p. 101). Both KDD and GeoVis can be characterized as complementary approaches for the design of a knowledge construction process, which aims at the identification and interpretation of spatio-temporal patterns in very large environmental data sets. The integration of KDD and GeoVis will support the development of new tools, which will provide a gateway to both human information processing abilities and the design of a knowledge discovery software. Several KDD and GeoVis methods have recently emerged from the literature and they differ in the conceptualizations developed, reflecting (in part) their separate developments in fields such as database systems, information visualization, machine learning, statistics, cartography, and artificial intelligence. Next section elaborates upon these definitions of KDD and GeoVis and builds the case for their integration. The main methods developed in KDD and GeoVis are described and a knowledge construction process is proposed for identifying and interpreting spatio-temporal patterns in very large environmental data sets. Case studies, tools and prototype implementations are described for illustrating how both KDD and GeoVis methods can be applied to uncovering spatio-temporal patterns. Finally, the specific underlying research issues are described, with particular emphasis on how these relate to the environmental sciences domain.

2

Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002

KDD AND GeoVis: COMPLEMENTARY APPROACHES KDD and GeoVis have similar goals but differ in the extent to which they rely upon computational methods or human vision to manipulate data. In this section, the underlying principles and the key developments of the past decade in both fields are summarized. This review provides a base from which we then explore the commonality of goals and potential for integration of KDD and GeoVis methods. The term "knowledge discovery in databases" was coined in 1989 in an effort to distinguish between the application of particular algorithms designed to extract patterns from data (that is, data mining), and the overall process of extracting knowledge from these patterns. The distinction is a critical one. It is based upon an understanding that blind application of data mining methods is unlikely to result in meaningful knowledge (and can easily lead to the opposite). Without careful preprocessing of data, specification of data representations (selection of models), and subsequent interpretation, all by domain experts, data mining has been called "a dangerous activity," even by its proponents (Fayyad et al., 1996, p. 4). The definition cited above for KDD, focusing on a process of identifying patterns that are not only valid and useful but also understandable and novel has been generally accepted (see Frawley et al., 1991; Brachman and Anand, 1996). Up to this time, KDD research and tools have primarily developed for a wide variety of applications such as marketing, sales, telecommunications, epidemiology, and investment trading (Koperski et al., 1999; Roddick and Spiliopoulou, 1999). Only a limited amount of data mining techniques has focused on spatial data as a unique source of structure for finding patterns. (e.g. Koperski and Han, 1995; Ester et al., 1998; Knorr and Ng, 1996; Wachowicz, 2000b). Although various authors have proposed somewhat different delineations of the process, the data mining definition introduced during the last NASA workshop on the Issues in the Application of Data Mining to Scientific Data (Behnke et al., 1999) is particularly suited to the application of KDD in environmental sciences. The definition states that data mining involves the "science, tools, environment, and facilities to scale up and/or automate scientific analysis of largescale data streams, consisting of: -

Exploration of anomalies in geophysical data, where the detection of an anomaly may initiate an "alert" requiring further human-in-the-loop analysis (e.g. using statistical or other methods);

-

Scaling up of current analysis techniques that detect known phenomena such that large-scale data product streams may be automatically analyzed;

And characterized by: -

Critical partnerships between physical scientists, computer scientists, and statisticians for the effective integration of analysis processes, scientific algorithms, statistical approaches, and enabling computer architectures." (Behnke et al., 1999, p.2).

Most of the data mining techniques perform tasks that are dependent on the intended outcome of the overall knowledge construction process. Some examples are: prediction, classification, clustering, and association. Different methods are used to perform these tasks, such as: statistical association, case-based reasoning, neural 3

Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002

networks, decision trees, rule induction, Bayesian belief networks, genetic algorithms, fuzzy and rough sets theory (see Mitchell, 1997 for details of the workings of these methods). In contrast, the first process-oriented perspective on the GeoVis research field was DiBiase's (1990) characterization of visualization as a 4-stage process that facilitates geographic information science: exploration, confirmation, synthesis, and presentation. In this case, the goal of a map is to stimulate a hypothesis rather than to portray a message. Knowledge is to some extent constructed by processing and understanding a visual representation. In a GeoVis process, visual representations are structures that have an effect on how we “see” and “interact” with data using both vision and information processing cognition. Visual representations allow us to derive meaning from visual displays and interrelate them with different kinds of knowledge, whether in propositional form (understanding by means of abstract concepts of something), analogical form (experiencing imagery as an abstract thought), or procedural form (understanding how to do something) (see Rumelhart and Norman, 1985 for a description of kinds of knowledge). Different types of visual representations can construct different kinds of knowledge at different stages of a GeoVis process. According to MacEachren (1995), whenever a visual representation depicts a dynamic event (particularly with dynamic and animated symbols, glyphs, and icons), or is used dynamically as a decision making tool, procedural and analogical forms of constructing knowledge are likely to play a role. Visualization has been used as the vehicle for exploration in the work of Monmonier (1991), MacDougall (1992), Tang (1992), Fotheringham and Charlton (1994), Gahegan (1996), Dykes (1997), Wise et al. (1998), Kraak and MacEachren (1999) and many others. Recently, visualization tools to aid specifically in the early data mining activities have also been proposed and developed (e.g. Lee and Ong, 1996; Keim and Kriegel, 1994). Useful overviews of visual data mining are provided by Wong (1999) and Hinneburg et al. (1999). So far, limited research has been carried out on the use of visualization as enabling technology, and possibly as the interface, for a complete knowledge construction process, although some visually supported methods have been proposed (MacEachren et al., 1999; Hao et al., 1999; Inselberg and Avidan, 1999). Visually based knowledge construction can offer two distinct advantages. First, it creates an opportunity for humans and machines to work together in constructing and evaluating the patterns and relationships required by the analysis, making the best use of the abilities of each. Second, it can provide an environment within which scientists can collaborate (sometimes remotely) on complex modeling and analysis activities (Brodlie et al., 1998). In both ways, visualization plays an important role in “processpattern tracking” (visual representations that display key aspects of a process as it unfolds) and “processing steering” (interactive environments that provide controlling parameters of a knowledge construction process to shape and modify its behavior).

THE KNOWLEDGE CONSTRUCTION PROCESS KDD and GeoVis share perspectives related to the primary goal of identifying, associating, and understanding interesting and unanticipated patterns in data that can be used to infer the location, identity, and relationships among geophysical

4

Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002

phenomena. In both cases, knowledge construction is a complex process and researchers have recognized the important role of different types of inferences accordingly to what do we want to infer (that is, the location, changes, attributes, identity, or relationships among entities). Inference can be interpreted in a variety of logical ways (Harman 1965). However, only two conditions are present in all and every logical account of explanation. The first one attempts to capture the condition that background theory does not explain observations, which may be a novel phenomenon, or they may actually puzzle the theory. The second condition makes an explanation needed to account for the observations (Hempel 1965). These explanations may have various forms: facts, rules, or even theories. Explanatory reasoning is triggered by a surprising phenomenon in need of explanation. In environmental sciences, new discoveries are not just the result of finding a novel geophysical phenomenon, but also the result of finding an anomaly, a pattern that puzzles the theory. On the basis of these conditions, a taxonomy of different modes of reasoning can be defined as being the abductive, inductive, and deductive mode of reasoning (Figure 1). Knowledge Construction Process

Data Driven

ABDUCTION INDUCTION

Model Driven

DEDUCTION

Hypothesis Generation

Hypothesis Presentation

Time

Figure 1 – The three modes of reasoning involved in a knowledge construction process. The arrows show the interrelationships of these types of reasoning over time. In the early stages, the knowledge construction process is largely data driven. Only after the data is successfully explored can models and theories be applied to the data.

In a knowledge construction process, the abduction mode of reasoning involves the adoption of existing or non-existing hypothesis for an explanation of a given observation, in which the explanation is usually represented as propositional clauses. Abduction is flexible because is not restricted to using existing patterns but instead free to create new patterns that help to explain the data. Abduction is the most flexible inference mode because it requires neither the target nor the hypothesis to be pre-defined. It is therefore, highly suited for the initial exploratory phase, especially if little is known concerning the structures in the data. Since computationally based approaches to knowledge construction do not get into the complexity of understanding the background knowledge that human experts have, there is little surprise that abduction is missing from most of the existing techniques and tools for

5

Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002

KDD and GeoVis. Perhaps the single most important factor that sets the knowledge construction process proposed in this paper apart from other processes is the connection to the user (expert) as a rich source of interpretation for the uncovered pattern, thus opening the way for the abductive reasoning to take place. Most of the automated methods for KDD have been developed to learn from examples that are presented or selected (induction), or attempted to locate pre-defined patterns (deduction). Induction involves the classification or generalization of examples for explaining tendencies in observations, in which the hypothesis become predicate clauses. It is definitely the most reliable means of knowledge construction due to the availability of many robust and automated algorithms in data mining. On the other hand, deduction can only be applied when objects, categories, or relationships have already been defined. It usually forms the basis of most inferential analysis and modeling because it can be verified straightforward. For example, expert systems tend to be deductive whereas decision trees use inductive learning (Simoudis et al., 1996).

The Abductive Mode of Reasoning In this case, a knowledge construction process involves the search for common attributes among a set of objects, and then the arrangement of these objects into classes, clusters, or patterns according to a meaningful partioning criteria, model or rule. An object can be a physical feature (stream-flow measured at irregularly-spaced Gauging stations), an abstract feature (precipitation deficit – the deviations from climate means) or an event (climate observations over a period of time). The focus is on exploring statistical approaches (probability distributions, hypothesis generation, model estimation and scoring) for performing the task of extracting classes, clusters or patterns from a data set (Hosking et al., 1997). Although statistics does not provide all the answers, statistical approaches offer a useful and practical framework for supporting the abductive mode of reasoning (Glymour et al., 1997). For instance, unsupervised methods that are used to uncover unknown spatio-temporal patterns in large environmental data sets. The best-known effort is AutoClass (Stutz and Cheeseman, 1994), a public domain software package that provides unsupervised classification based upon Bayesian belief network (casual theory). A Bayeasian belief network consists of a graphical structure augmented by a set of probabilities. The graphical structure is a directed acyclic graph in which nodes represent random variables (domain variables) and edges represent the existence of direct causal influence between linked variables. A conditional probability is associated with the group of edges directed toward each node. Prior probabilities are assigned to source nodes (i.e. any node without any incoming edge). The belief network is represented as a triple (V, E, P), where V is the set of vertices (variables), E is the set of edges and P is the set of probabilities. AutoClass includes four models, which are independent and covariant versions of the multinomial model for discrete attributes and the Gaussian normal model for real valued attributes (with minor variations). To apply a model, a set of discrete parameters (T), describing the general form of the model, is normally used to specify the functional form for the likelihood function. For instance, a function giving the probability of the data conditioned on the hypothesized model and parameters.

6

Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002

Second, free variables (V) constitute the remaining continuous parameters within a model, such as the magnitude of the correlation or relative sizes of the classes. A likelihood function, defined as L (E| VT), embodies an agent’s theory of how likely it would be to see each possible evidence combination E in each possible model H (an agent combines the posterior beliefs with prior expectations based on the evidence for details, see Berger, 1985). E denotes some evidence that is known (for example, the daily observations of precipitation and temperature at a weather station). In summary, E will consist of a set of cases, which can include “unknown” values. H denotes a hypothesis specifying that the real-world is in some particular state. Figure 2 illustrates the potential of the integrated GEOVIS-KDD approach to knowledge construction with an application of AutoClass to a sample gridded regional climate data set for northern Mexico and the southern U.S. This data was output from the Goddard Space Flight Center (GSFC) 4D assimilation scheme based on both observational data and a Global Change Model (GCM) to produce daily gridded data at resolution of 2º latitude and 2.5º longitude, covering an area bounded by latitude 20•-43• N, longitudes 110•-90• W. The winter months (November through March) were the focus because most surface cold fronts that affect the study area occur during this season. Two dynamically linked visual representations were used for supporting the interpretation/evaluation of the knowledge construction process: spacetime cubes and parallel coordinate plots (more detailed information is found in MacEachren et al., 1999).

Figure 2 – The Parallel Coordinate Plot was implemented using the Tool Command Language (Tcl) and X11 Toolkit (Tk) developed at UC Berkeley (http://www.sun.com/960710/cover/). The spacetime cube was implemented using the Data Explorer framework that gives users the ability to apply visualization and analysis techniques to their data (http://www.research.ibm.com/dx/). Tcl/Tk presents the PCP display and defines the execution context and linkage between the PCP and the spacetime cube.

7

Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002

In Figure 2, the Parallel Coordinate Plot depicts each case from the target data set used in data mining. The resulting pattern of lines represents relationships among the classes generated in a data mining run. The space-time cube depicts the spatialization in which non-spatial data dimensions are mapped to the threedimensions of a display space. In this case, they are humidity, precipitation and time. Color hue is used to distinguish the 7 classes being the result from a data mining run. This integrated set of visual representations has provided the scientists with hundreds of perspectives on data. When measures of class strength were examined, we found that the classes with highest strength were ones associated with particular years, thus classes that exhibit temporal patterns. Typical of these classes were ones with winters revealing el Nino (very wet/raining winters) and la Nina (very dry winters) years. On the other hand, classes with intermediate strength were characterized by cyclic patterns, with spatial dominance. For example, classes in this group have shown the location of very wet and very dry areas in the region of the case study over the eight years. Finally, classes with a very low strength were (not unexpectedly) difficult to interpret.

The Inductive Mode of Reasoning In this case, a knowledge construction process is based on learning as the reduction of uncertainty in knowledge. Several techniques have been developed in the field of Machine Learning, such as rule induction, neural networks, genetic algorithms, case-based learning and analytical learning (theorem proving). Many techniques partition the target data set into as many regions as there are classes by using some kind of discriminant function, for example, a posterior probability or linear discriminant functions. These techniques provide a data fit, in the sense that the main goal is to generate explicit knowledge describing the data, often called concept hierarchies and rules. Concept hierarchies are generalization hierarchies specified according to the relations among attributes or by set grouping of attributes using aggregate functions. A concept hierarchy can explicitly be specified by a domain expert or can be generated automatically. Numerical data or ordered attributes can be generated automatically based on the analysis of data distribution of the corresponding attribute. This requires background knowledge to define the form of the concept hierarchy, which is necessary to be given in the mining query. Two methods have been suggested in the KDD literature thus far: the Data Cube Approach (Harinarayan et al., 1996; Holsheimer et al., 1996) and the Attribute-Oriented Induction Approach (Han and Fu, 1996; Han, 1995). Several authors have investigated attribute-oriented induction methods for extracting generalization hierarchies for spatial data (Wang et al., 1997; Han et al., 1997). Data Surveyor (Holsheimer et al., 1996), DBMiner (Han et al., 1996), and its geographic extension GeoMiner (Han et al., 1997) are some examples of interactive tools using extended database query language operations to select data and generate concept hierarchies directly within a database. They have also implemented roll-up (progressive generalization) and drill-down (progressive specialization) operations to navigate within different levels of a concept hierarchy. These operations allow users to examine the finer levels of a concept hierarchy only when it is necessary.

8

Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002

Summarized tables, charts, and maps are also employed to create "snapshots" of concept hierarchies. The inductive mode of reasoning has also been explored for the extraction of different kinds of rules, including characteristic rules, discriminant rules, cluster description rules, and multi-level association rules (see Murthy 2001 for a multidisciplinary survey about inductive rules). A rule is usually in the form of "X -> Y (s%,c%)", where X and Y are sets of object predicates, s% is the support of the rule (the probability that X and Y hold together among all possible cases), and c% is the confidence of the rule (the conditional probability that Y is true under the condition of X). Individual rules can be complex and hard to interpret subjectively. The main algorithms used in KDD focus on defining rules to extract representative clusters from a target data set. Some examples are CLARANS - Clustering Large Applications based upon Randomized Search (Ng and Han, 1994) and BIRCH (Balanced Iterative Reducing and Clustering (Zhang et al., 1996). Both algorithms assume that the user specifies the rule to be mined from a target data set. Chen et al. (1996) relate some experiments that confirm they can be used to cluster reasonably large data sets. Based on CLARANS, Ester et al. (1998) developed two spatial data mining algorithms, which were SD-CLARANS (spatial dominant algorithm) and NSD-CLARANS (non-spatial dominant algorithm). The KDD process starts by collecting the relevant data based on a query that describes the relations between relevant classes of objects. For example, the user may describe parks by representing the relation between parks and railways. Spatial and non-spatial concept hierarchies are used to speed the computation of extracting the data at the level of generalized spatial and non-spatial relations. Another well-accepted algorithm is the Apriori Algorithm (Agrawal et al.,1993), and its extensions DHP (Park et al., 1995) and PARTITION (Savasere et al., 1995) algorithms. Koperski and Han (1995) have extended the Apriori Algorithm to spatial data mining, and based upon it have proposed multi-level spatial association rules in GIS. In this case, spatial association rules reflect the structure of spatial objects and spatial/spatial or spatial/nonspatial relationships that contain predicates such as adjacent_to, near_by, inside, close_to, intersecting, etc. GeoVis offers a means towards the integration of these algorithms with clustering visualization methods by assisting the user in understanding the concept hierarchies and rules. Scientists play an important role in examining the space-time characteristics of a concept hierarchy in the target data set as well as validating the corresponding rules that data in each attribute have been generalized to (Wachowicz 2001). Therefore, interactive tools are needed to dynamically manipulate concept hierarchies and mining rules, their hierarchical levels of abstraction, and especially their space-time characteristics. Some examples are scaling (moving and reshaping a display interactively), subsetting (drawing subsamples randomly), and space-time cubes (spaces spanned by three variables at a time). Unfortunately, very few advances in KDD and GeoVis have been made in the environmental sciences. The problem is basically related to the difficulties of selecting a single spatio-temporal concept for a geophysical phenomenon. Slortz et al. (1995) illustrate the problem by showing that there is no single objective definition in the literature for the concept of a cyclone. "Several working definitions are based upon the detection of a threshold level of vorcity in quantities such as atmospheric pressure at sea level. Others are based upon the determination of local minima of the 9

Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002

sea level pressure." (Slortz et al., 1995, p.302). Their attempt at geophysical phenomenon extraction is demonstrated in the CONQUEST - (Content-based Querying in Space and Time) system. In CONQUEST, the visualization tools support static plotting (2D and 3D graphs) and temporal animation of plots of sea level pressure fields overlaid with cyclone tracks. The visualization manager tool was implemented in IDL. The graphical user interface is highly interactive and provides the user the ability to query the data and extract the geo-physical phenomenon from the data repository. It was built using the Tool Command Language (Tcl) and X11 Toolkit (Tk) developed at UC Berkeley. The results of a prototype implementation have outlined the need for interactive and extensive query processing systems that allow scientists to query images by content using color, texture, and shape to extract the important features present in geophysical data sets and previously unsuspected patterns of interest. Figure 3 illustrates the use of a decision tree inducer to infer a classifier for a random point data set, resulting in a branching format decision tree visualization. The data set contained information about in-situ observations (abandoned and nonabandoned areas), topographic aspects (slope, altitude, aspect) and land surface parameters (land cover type, spectral signatures, NDVI). The focus was on gaining insight about rules that can characterize patterns of land abandonment in the region of Southern Spain. Inductive decision trees were used for the classification task mainly because they are strictly non-parametric, free from distribution assumptions, able to deal with non-linear relations, and capable of handling numerical and categorical inputs.

Figure 3 - Rule Visualiser tool of MineSet was used for graphically displaying the results from the classification rule algorithm. MineSet is a scalable client/server tool set for extracting information from databases, mining the data with analytical algorithms, and visualizing patterns and connections through intuitive, interactive, visual displays (http://www.sgi.com/software/mineset.html).

10

Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002

The decision tree was visualised as a graphical interface that displays data as a three-dimensional landscape. It presented the data hierarchically in the form of a tree, where each level of the tree branches on the values of a different variable, for example, land cover type. Each node in the tree shows a chart representing all the data in the subtree below it. The chart is composed of a base block with height and colour depending on the data attributes. As shown in Figure 3, blue colour represent the data that characterises non-abandoned areas, meanwhile red colour represents land abandonment. For each node, a base with bars and disks whose number, label, height, and colour display the variables being used. The lines connecting nodes, called edges, show the relationship of one set of variables to its subsets. Users can drill up to get an overview, or drill down to see a detail, while maintaining the context information. An overview is also available for providing the current location of the user by using a red cross in relation to the whole concept hierarchy. This interactive set of visual representations has provided the scientists with a set of functions that give them greater insight into the nature of the target data set. Analyzing rules have shown the inappropriateness of the data set for the identification and characterization of land abandonment in the region. The patterns extracted were not relevant to the problem, insensitive to small changes in the data, and invariant to scaling, rotation, and translation. Land cover data are often obtained from satellite images, a format that is known to pose serious challenges in the extraction of patterns. Further, problems in the knowledge construction process were such that the class of interest (abandonment and non-abandonment) occurred with low probability, making random sampling inapplicable and mining induction techniques ineffective.

The Deductive Mode of Reasoning This is the most obvious KDD and GeoVis integration, which will be useful at least when both are applied to facilitating scientific understanding of the results of a knowledge construction process. In most cases, visualization is a tool to facilitate the interpretation and evaluation of the data mining results (Simoudis et al, 1996). The KDD literature contains frequent mention of the importance of visualization (Brachman and Anand, 1996; Uthurusamy, 1996). The premise is that if patterns can be uncovered, they might offer significant insight into complex domains that might result in commercial advantage, improved decision making or deeper scientific insight. The deduction mode of reasoning in knowledge construction should always be considered as a human-centered process, in which users can dynamically interact with the system and take their analysis decisions at this last stage of a knowledge construction process. Mainly because we are always interested in the accurate description of data sets, the exploration of patterns and relationships in such data, and the explanation of such patterns and relationships. Figure 4 illustrates the patterns found in the classification of 15 years of NDVI data from Landsat TM system acquired yearly from 1985 till 1999 for an area located in south Minas Gerais, Brazil. Inductive decision trees were used for the classification task. The study area was characterized by a complex and fragmented land cover pattern with intense human activities. The main land uses are perennial crops like coffee, eucalypt and pasture, enclosing remnants of semideciduous Atlantic forest and

11

Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002

savanna-related formations. It was assumed that natural forests have been present in the region for a long period of time, and changes have occurred with a relatively constant temporal profile in relation to eucalypt and coffee plantations. The results have been visualized using space-time cubes, showing that the patterns of different years of NDVI had a major importance for class discrimination within a decade of measurements (Figure 4). Color hue is used to distinguish the 6 classes being the result from the knowledge construction process. Long time-series (for example, years or decades) can be very effective for class discrimination if the goal is to compare natural with managed land cover types, especially since the later normally exhibits strong dynamics. Another reason for the different patterns might have been the pronounced spectral overlap of forests with perennial crops preventing the efficient use of spectral features for class separation.

(a)

(b)

Figure 4 – The space-time cube of MineSet was used for graphically displaying the mining results from the classification of 15 years of NDVI data. (a) Show the classes obtained from the NDVI data from 1985, 1986, and 1987. (b) Show the same classes but now obtained from the NDVI data from 1995, 1996, and 1997. Color hue is used to distinguish the classes.

CONCLUSIONS The objectives of this paper have been to make the case for the integration of KDD and GeoVis methods, emphasizing knowledge construction as a dynamic process of manipulating data to find interesting patterns in environmental data sets. Case studies, tools and prototype implementations were described to illustrate how this knowledge construction process could be understood and applied in environment sciences. Machine learning, statistics, databases, and geographic visualization have been identified as important technological fields to the overall design and implementation of knowledge construction. The combination of several methods drawn from each of these fields is needed to make significant progress on the fundamental issues associated with uncovering spatio-temporal patterns in environmental data. As a result, knowledge construction can only achieve its objectives through the formation of coherent interdisciplinary teams involving domain experts, system designers, and users.

12

Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002

At the preparation stage, data mining efforts have been useful in finding holes or errors in environmental sets. Visual displays of the parameters setting for the transformations that occur during data preparation facilitate user understanding of the problem domain by relating the data back to the world from which data were collected. However, preparing environmental data for mining has been an extremely time-consuming process, primarily carried out by hand since it was very hard to automate it. At the mining stage, nothing can be discovered that is beyond the limits of data itself. In this paper, the integration of KDD and GeoVis methods has proven to be successful for feature extraction, correlation analysis, anomaly detection, pattern recognition, and filtering. However, the methods implemented were often not ‘spatially’ aware and when they were, they have used very simple models of spatiotemporal objects and relationships; for instance, snapshots of point objects and Euclidean distances over time. Other complex spatio-temporal objects (for example, moving points, lines, and polygons) and their respective relationships (for example, direction, connectivity, and non-Euclidean distances) still need to be integrated into a knowledge construction process. The "miner" needs to take a pragmatic view of regarding an object as a collection of features about which measurements can be taken. It is the spatio-temporal patterns of features that are taken as the defining characteristics of objects. The extraction of interesting spatio-temporal patterns from large, multidimensional, and complex environmental data sets is a critical issue that must be addressed first, prior to making a model of these data sets. When data is properly prepared and mined, the quality of the models produced will depend mostly on the content of the data rather than on the ability of the modeler. But often today, instead of adequate data preparation and mining, time consuming models are built and rebuilt in an effort to represent the contents of the data. Modeling and remodeling are not the most effective way to discover what is enfolded in an environmental data set.

FUTURE RESEARCH DIRECTIONS The paper has demonstrated that knowledge construction is a field very much in its infancy, making it a source of several open research problems on issues related to the integration of KDD and GeoVis methods and associated tools for uncovering spatio-temporal patterns in environmental data, especially remotely-sensed satellite data. Although there are over 200 mining tools currently available in the public domain (see www.kdnuggets.com/software), several barriers must be overcome in order to apply them to environmental data preparation, mining and visualization. One major barrier is related to data issues. Environmental data sets are usually collected for multi-purpose use having different spatial and temporal scales, accuracy, and taxonomies. This reality coincides with an exponential increase in digital data generated by Earth Observation Systems. The Moderate Resolution Imaging Spectroradiometer (MODIS) developed for global remote sensing of clouds, aerosols, water vapor, land, and ocean properties provides 1.29 Gbytes per hour. Although large environmental data sets provide a major challenge for the integration of KDD and GeVis methods, there still exist small data sets that need to be mined as well. For instance, real-world phenomena are so complex that a small data set might contain the

13

Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002

associations between patterns and the real-world phenomena they represent. Common technical issues are related to interoperability of multi-source data sets through commensurate measurements, metadata standards, and mining infrastructures. Another research challenge deals with the development of a full range of conceptual, logical, and physical models of spatio-temporal objects in both KDD and GeoVis methods. The research challenges in this area are quite extensive and they represent the majority of new applications of knowledge construction in the future, such as the studies on global change, natural hazards, and terrestrial ecology. Space and time are to be jointly treated because when a concept hierarchy or a rule is simultaneously seen in space and time, they inevitably expose relations that cannot be traced if the hierarchy is arranged into abstraction levels and drawn out of its spacetime context. Is climate changing? What agricultural areas will continue to exist according to future global change trends? A knowledge construction process might be applicable to answer these questions.

REFERENCES Agrawal, R., Imielinski, T. and Swami, A. (1993). Mining association rules between sets of items in large databases. ACM SIGMOD, pp. 207-216. Behnke, J., Dobbinson, E., Graves, S., Hinke, T., Nichols, D., and Stolorz, P. (1999). NASA Workshop on Issues in the Application of Data Mining to Scientific Data. Final Report, Goddard Space Flight Center, USA. Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis. New York: Springer Verlag. Brachman, R.J. and Anand, T. (1996). The process of knowledge discovery in databases. In Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (eds.), Menlo Park, CA: AAAI Press / The MIT Press, pp. 37-57. Brodlie, K. W., Duce, D.A., Gallop, J.R., and Wood J.D. (1998). Distributed cooperative visualization. State of the Art Reports at Eurographics98,. A.A Sousa and. F.R.A Hopgood. (eds.), Eurographics Association, pp. 27-50. Chen, M., Han, J. and Yu, P.S. (1996). Data mining: An overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering, 8, pp. 866-883. DiBiase, D. (1990). Visualization in the earth sciences. Earth and Mineral Sciences, Bulletin of the College of Earth and Mineral Sciences, Penn State University, 59(2): 13-18. Dykes, J. (1997). Exploring spatial data representation with dynamic graphics. Computers & Geosciences, Special issue on Exploratory Cartographic Visualization, 23(4): 345-370. Ester, M., Kriegel, H.-P. and Sander, J. (1998). Algorithms for characterization and trend detection in spatial databases. Proceedings 4th International Conference on Knowledge Discovery and Data Mining (KDD’98), New York, NY, USA, pp. 44-50. Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P.1(996). From data mining to knowledge discovery: An overview. In Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (eds.), Menlo Park, CA: AAAI Press / The MIT Press, pp. 1-34. Fotheringham, A.S. and Charlton, M. (1994). GIS and exploratory data analysis: An overview of some basic research issues. Geographical Systems, 1(4): 315-327. Frawley, W.J., Piatetsky-Shapiro, G., Matheus, C.J. and Smyth, P. (1991). Knowledge discovery in databases: An overview. In Knowledge Discovery and Data Mining, G. Piatetsky-Shapiro and B. Frawley (eds.), Cambridge, Mass: AAAI Press / The MIT Press, pp. 1-27. Gahegan, M. N. (1996), Visualization strategies for exploratory spatial analysis. Proceedings Third International Conference on GIS and Environmental Modeling, Santa Fe, New Mexico, USA.URL: http://www.geog.psu.edu/~mark/santafe.html Glymour, C., Madigan, D., Pregibon, D. and Smyth, P. (1997). Statistical Themes and Lessons for Data Mining, Data Mining and Knowledge Discovery, 1, pp. 11-28.

14

Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002 Han, J. (1995). Mining knowledge at multiple concept levels. Proceedings 4th International Conference on Information and Knowledge Management, Baltimore, Maryland, USA, pp. 19-24. Han, J. and Fu, Y. (1996). Exploration of the power of attribute-oriented induction in data mining. In Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy, Menlo Park (eds.), CA: AAAI Press /The MIT Press, pp. 399-421. Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K., D., L., Lu, Y., Rajan, A., Stefanovic, N., Xia, B. and Zaiane, O.R. (1996). DBMiner: A system for mining knowledge in large relational databases. Proceeedings of International Conference on Mining and Knowledge Discovery (KDD 1996), Portland, Oregan, USA, pp. 250-255. Han, J., Koperski, K. and Stefanovic, N. (1997). GeoMiner: A system prototype for spatial mining. Proceedings, 1997 ACM-SIGMOD International Conference on Management of Data (SIGMOD'97), Tuscon, AZ, USA. URL: http://db.cs.sfu.ca/sections/publication/kdd/kdd.htmlge Hao, M. Dayal, U., Hsu, M. Baker, J. and D’Eletto, R. (1999). A Java-based visual mining infrastructure and applications. IEEE InfoVis’99, San Francisco, CA, USA, pp. 124-127. Harinarayan, V., Rajaraman, A. and Ullman, J.D. (1996). Implementing data cubes efficiently. Proceedings of ACM-SIGMOD on Management of Data, Montreal, Canada, pp. 205-216. Harman, G. (1965). The inference to the best explanation. Philosophical Review, 74, pp. 88-95. Hinneburg, A., Keim, D. and Wawryniuk, M. (1999). HD-Eye: Visual mining of high dimensional data. IEEE Computer Graphics and Applications, September/October 1999, pp. 22-31. Hempel, C. (1965). Aspects of Scientific Explanation. Free Press, New York. Holsheimer, M., Kerten, M.L. and Siebes, A. (1996). Exploration of the power of attribute-oriented induction in data mining. In Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (eds.), Menlo Park, CA: AAAI Press / The MIT Press, pp. 447-467. Hosking, J.R.M., Pednault, E.P.D. and Sudan, M. (1997). A statistical perspective on data mining. Future Generation Computer Systems, 13, pp. 117-134. Inselberg, A. and Avidan, T. (1999). The automated multidimensional detective. IEEE InfoVis’99, San Francisco, CA, USA, pp. 112-119. Keim, D. and Kriegel, H.-P.(1994). VisDB: Database exploration using multidimensional visualization. Computer Graphics and Applications, September 1994, pp. 44-49. Knorr, E. M. and Ng, R. T. (1996). Finding aggregate proximity relationships and commonalities in spatial data mining, IEEE Transactions on Knowledge and Data Engineering, 8(6): 884-897. Koperski, K. and Han, J. (1995). Discovery of spatial association rules in geographic information databases. Proceedings International Symposium on Large Spatial Databases (SSD'95), Portland, Maine, USA, pp. 47-66. Koperski, K. Han, J. and Adhikary, J. (1999). Mining knowledge in geographic data. Comm. ACM. URL: http://db.cs.sfu.ca/sections/publication/kdd/kdd.html. Kraak, M.-J. and MacEachren, A. M. (eds.) (1999). International Journal of Geographic Information Science: special issue on exploratory cartographic visualization, 13(4). Lee, H. Y. and Ong, H. L. (1996). Visualization support for data mining. IEEE Expert Intelligent Systems and their Applications, 11(5): 69-75. MacDougall, E. B. (1992).. Exploratory analysis, dynamic statistical visualization and geographic information systems. Cartography and Geographical Information Systems, 19(4): 237-246. MacEachren, A.M. (1995). How Maps Work: Representation, Visualization and Design Guilford Press. MacEachren, A.M. (1992). Visualization. In Geography's Inner Worlds: Pervasive Themes in Contemporary American Geography, R. Abler, M. Marcus and J. Olson (eds), New Brunswick, NJ: Rutgers University Press, pp. 99-137. MacEachren, A. M., Wachowicz, M., Edsall, R., Haug, D. and Masters, R. (1999). Constructing knowledge from multivariate spatio-temporal data: integrating geographical visualization with knowledge discovery in database methods. International Journal of Geographic Information Science, 13(4): 311-334. Mitchell, T. M. (1997). Machine Learning, New York, USA, McGraw Hill. Monmonier, M. (1991). Ethics and map design: Six strategies for confronting the traditional one-map solution. Cartographic Perspectives, 10, pp. 3-8. Murthy, S. (2001). On growing better decision trees from data. URL:

http://www.tigr.org/~salzberg/murthy_thesis/survey/survey.html

15

Water Resources Management International Journal, Special issue on geocomputation in water resources and environment 16(6),pp. 469-487. 2002 Ng, R. and Han, J. (1994). Efficient and effective clustering method for spatial data mining. Proceedings International Conference on VLDB, Santiago, Chile, pp. 144-155. Park, J.S., Chen, M.S. and Yu, P.S. (1995). An effective hash-based algorithm for mining association rules. Proceedings ACM-SIGMOD on Management of Data, San Jose, California, USA, pp. 175-186. Roddick, J. F. and Spiliopoulou, M. (1999). A bibliography of temporal, spatial and spatio-temporal data mining research. SIGKDD Explorations. 1(1), URL: http://www.cis.unisa.edu.au/~cisjfr/STDMPapers/. Rumelhart, D.E. and Norman, D.A. (1985). Representation of knowledge. In Issues in Cognitive Modelling, A.M. Aitkenhead and J.M. Slack (eds.), Erlbaum, pp. 15-62. Savasere, A., Omiecinski, E. and Navathe, S. (1995). An efficient algorithm for mining association rules in large databases. Proceedings of International Conference on VLDB, Zurich, Switzerland, pp.37-45. Simoudis, E., Livezey, B. and Kerber, R. (1996). Integrating inductive and deductive reasoning for data mining. In Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (eds.), Menlo Park, CA: AAAI Press / The MIT Press, pp. 353-374. Slortz, P., Nakamura, H., Mesrobian, E., Muntz, R.R., Shek, C., Santos, J.R., Yi, J., Ng, K., Chien, S., Mechoso, C.R. and Farrara, J.D. (1995). Fast spatiotemporal data mining of large geophysical data sets. Proceedings of the First Conference on Knowledge Discovery and Data Mining, Montreal, Canada, pp. 87-101. Stutz, J. and Cheeseman, P. (1994). AutoClass - a Bayesian Approach to classification. In Maximum Entropy and Bayesian Methods, J. Skilling and S. Sibisi (eds.), Dordrecht, The Netherlands: Kluwer Academic Publishers. Tang, Q. (1992). A Personal Visualization System for Visual Analysis of Area-Based Spatial Data: Proc. GIS/LIS’ 92, 2, American Society for Photogrammetry and Remote Sensing, Bethesda, Maryland, USA, pp. 767-776. Uthurusamy, R., 1996, From data mining to knowledge discovery: Current challenges and future directions. In Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (eds.), Menlo Park, CA: AAAI Press / The MIT Press, pp. 561-572. Wachowicz, M. 2000a. How can knowledge discovery methods uncover spatio-temporal patterns in environmental data? In: Data Mining and Knowledge Discovery: Theory, Tools, and Technology II, B.V. Dasarath (ed), Proceedings of SPIE, 4057(2000), pp. 221- 229. Wachowicz, M. 2000b. The role of geographic visualisation and knowledge discovery in spatio-temporal data modelling. In: Time in GIS: Issues in spatio-temporal modelling. Heres, H. (ed). Publications in Geodesy 47, pp. 13-26. Wachowicz, M.. 2001. GeoInsight: an approach for developing a knowledge construction process based on the integration of GVis and KDD methods. In Geographic Data Mining and Knowledge Discovery, H.J Miller and. J. Han (eds.), London: Taylor & Francis. Wang, W., Yang, J. and Muntz, R. (1997). STINGA: Statistical information grid approach to spatial data mining. Proceedings of the 23rd VLDB Conference, Athens, Greece, pp. 186-196. Wise, S., Haining, R. and Signoretta, P. (1998). The role of visualization for exploratory spatial data analysis of area-based data. Proc. Fourth International Conference on Geocomputation (GeoComputation’98), Bristol, UK. URL: http://www.geog.port.ac.uk/geocomp/geo98/. Wong, P. C. (1999) Visual data mining. IEEE Computer Graphics and Applications, 19(5): 20-21. Zhang, T., Ramakrishnan, R. and Linvy, M. (1996). BIRCH an efficient data clustering method for very large databases. Proceedings ACM-SIGMOD on Management of Data, Montreal, Canada, pp. 103-114.

16

Suggest Documents