An Integrated Environment for High-dimensional Geographic Data

An Integrated Environment for High-dimensional Geographic Data Mining Diansheng Guo Department of Geography University of South Carolina Columbia, SC 29208 [email protected] Mark Gahegan and Alan M. MacEachren GeoVISTA Center, Department of Geography Pennsylvania State University University Park, PA 16802 [email protected], [email protected] Introduction Geographic data are often very large in volume and “characterized by a high number of attributes or dimensions” [1]. There are urgent needs to develop effective and yet efficient approaches for analyzing such voluminous and high-dimensional data to address complex geographic problems [1, 2, 3, 4], e.g., detecting unknown multivariate patterns or relationships between socioeconomic, demographic, environmental factors and the incidence of various cancers. This paper introduces an integrated geographic data mining environment, which couples a suite of visualization and computational methods to explore multivariate patterns in large and high-dimensional geographic datasets. The integrated geographic data mining environment involves four major groups of components: (1) interactive feature selection components to identify interesting subsets of variables for further analysis[5]; (2) self-organizing map (SOM) [6] components to cluster data objects with only the variables selected above; (3) a high-dimensional visualization component—Parallel Coordinate Plot (PCP) [7]—to explore and present multivariate patterns or relationships; and (4) a geographic map component to visualize the spatial distribution of patterns discovered above. With interactive manipulation of these integrated components, the user can iteratively locate, interpret, and refine patterns. The Mining Process Similar to data mining in other scientific and applied research fields, geographic data mining is by nature an iterative exploration process that involve many steps and various methods [3, 8, 9]. With the integrated geographic data mining environment introduced in this paper, a normal cycle within the iterative process can be: loading data; transforming the data; selecting an interesting subset of variables for subsequent analysis; identifying multivariate clusters of the data (using selected variables); interactively exploring and

interpreting those clusters; and visualizing clusters in a map to examine the spatial distribution of those discovered multivariate patterns. The integrated geographic data mining environment is designed and implemented with a component-oriented approach. Different components (or a suite of components) will focus on different analysis tasks, e.g., data pre-processing, feature selection, multivariate clustering, or visualization. These components all comply JAVA Bean specification and therefore can be easily integrated within the GeoVISTA Studio [10]. Feature Selection Feature selection is to select one or more interesting subsets of variables from a high-dimensional dataset for further analysis. A feature selection step is necessary for two reasons. First, the more variables are involved, the harder it is to find patterns among them. Second, very often there are many irrelevant variables in a dataset and they should be removed in subsequent analysis [5, 11]. A measure value (e.g., chi-square or conditional entropy) is calculated for each pair of variables to measure their bivariate (either linear or non-linear) relationship. See [5, 12] for the calculation details of these measures. Then a matrix of those measure values for all variable pairs is constructed, where each diagonal cell represents a variable and each off-diagonal cell represents a measure value. Bright cells represent strong bivariate relationships. Variables are first organized into domains (e.g., cancer, census, etc.) and then ordered (within each domain) based on paired measure values. Thus those subsets of variables that have good relationships with each other will be placed next to each other and thus form a bright block. The user can zoom in for each such hot spot and pick a subset of variables for further analysis (see figure 1). Multivariate Analysis and Visualization After the selection of a subspace (i.e., a subset of variables), the user can further detect patterns within this subspace with the self-organizing maps (SOM) components, the parallel coordinate plots (PCP) component, and the geographic map component (see figure 2). The selected variable data are first input to a self-organizing map component, where the data are organized into a 2-D layout of nodes. Each non-empty node contains one or more similar data objects, which have similar values for those selected variables. Data objects in nearby nodes are also similar to each other. A 2-D color scheme is used to assign each node a color so that nearby nodes have similar colors. Then these non-empty nodes are passed to a PCP component to visualize, with colors assigned in the SOM component and the thickness of each string representing (proportionally) the number of data objects contained in that node. Each data object (here each object is a county) is also assigned the same color as its containing node. Thus we can see the spatial distribution of discovered multivariate patterns in the map (see figure 2).

Figure 1: Feature selection. In the above snapshot the block for AllCancers/Census is selected and shown in the zoom-in window (top right), with all-cancer variables on top and census variables to the right. Seven variables are selected because they form a bright block in the matrix.

Figure 2: Multivariate spatial analysis and visualization with SOM, PCP, and mapping. Software and Tutorial Download The integrated geographic data mining environment, together with a tutorial and a sample data set, can be downloaded at: http://www.geovistastudio.psu.edu/jsp/tutorial.jsp.

Acknowledgements: Research presented in this paper was partially funded by NSF grant #9983445, NSF grant #EIA-9983451, and grant CA95949 from the National Cancer Institute (NCI). References: [1] National-Research-Council. IT Roadmap to a Geospatial Future. Washington, D.C.: National Academy Press; 2003, [2] Miller HJ and Han J. Geographic Data Mining and Knowledge Discovery: an overview. In: Miller HJ and Han J (Eds). Geographic Data Mining and Knowledge Discovery. Taylor & Francis: London and New York; 2001. 3-32. [3] Fayyad U, Piatetsky-Shapiro G and Smyth P. From data mining to knowledge discovery-an review. In: Fayyad U, Piatetsky-Shapiro G, Smyth P and Uthurusay R (Eds). Advances in Knowledge Discovery. AAAI Press: Cambridge, MA; 1996. 1-33. [4] Guo D, Peuquet D and Gahegan M. ICEAGE: Interactive Clustering and Exploration of Large and High-dimensional Geodata. GeoInformatica 2003; 7(3): 229-253. [5] Guo D. Coordinating Computational and Visual Approaches for Interactive Feature Selection and Multivariate Clustering. Information Visualization 2003; 2(4):232-246. [6] Kohonen T. Self-organizing maps. Berlin ; New York : Springer; 2001, 501pp. [7] Inselberg A. The plane with parallel coordinates. The Visual Computer 1985; 1: 69-97. [8] MacEachren AM, Wachowicz M, Edsall R, Haug D and Masters R. Constructing knowledge from multivariate spatiotemporal data: integrating geographical visualization with knowledge discovery in database methods. International Journal of Geographical Information Science 1999; 13(4). [9] Gahegan M and Brodaric B. Computational and Visual Support for Geographic Knowledge Construction: Filling in the Gaps between Exploration and Explanation. Proceedings of the 10th International Symposium on Spatial Data Handling. Springer; 2002. 11 - 25. [10] Gahegan M, Takatsuka M, Wheeler M and Hardisty F. Introducing GeoVISTA Studio: an integrated suite of visualization and computational methods for exploration and knowledge construction in geography. Computers, Environment and Urban Systems 2001; 26(4): 267-292. [11] Procopiuc CM, Jones M, Agarwal PK and Murali TM. A Monte Carlo Algorithm for Fast Projective Clustering, ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, 2002, ACM Press; 418-427. [12] Guo D, Gahegan M, Peuquet D and MacEachren A. Breaking Down Dimensionality: An Effective Feature Selection Method for High-Dimensional Clustering, Workshop on Clustering High Dimensional Data and its Applications, the Third SIAM International Conference on Data Mining, May 1-3, San Francisco, CA, USA, 2003.

An Integrated Environment for High-dimensional Geographic Data

An Integrated Environment for High-dimensional Geographic Data

Suggest Documents

An Integrated Environment for Data Mining - CiteSeerX

An Integrated Environment for Scientific Data Entry and Management ...

GIMS: an integrated data storage and analysis environment for ...

EMSA Integrated Maritime Data Environment

Pyro: An Integrated Environment for Robotics Education

Genesys: An Integrated Environment for Developing Systemic ...

An Integrated Environment for Algorithm Design and

an integrated environment for simulations of heavy

PRIDE: An Integrated Software Development Environment for ...

Imagene: an integrated computer environment for sequence ...

GenoList: an integrated environment for comparative ... - BioMedSearch

An integrated development environment for ... - Semantic Scholar

An Integrated Learning Environment for reinforcing

An integrated technology CAD environment - Institute for ...

Creating an integrated collaborative environment for materials ...

An Integrated Environment for Knowledge Acquisition - CiteSeerX

An Integrated Development Environment for Pattern ... - CiteSeerX

An Open Environment for Automated Integrated

PEARLS: An Integrated Environment for Task

An Integrated Environment for Development and ... - CiteSeerX

SWING: An Integrated Environment for Geospatial ... - CiteSeerX

an integrated e-science environment for ...

An Integrated Virtual Environment for Feasibility Study

An FPGAbased integrated environment for computer architecture