of spatial data mining â fuzzy logic uncertainty propagation model with Creditability ... the study on spatial data mining is becoming an important field in GIS and ...
Uncertainty in Spatial Data Mining and Its Propagation Binbin He 1,2 , Tao Fang2 , Dazhi Guo1 and Jiang Yun1 1
School of Environment & Spatial Informatics, China University of Mining and Technology, Xuzhou, JiangSu, China, 221008 binb_he @163.com guodazhi @ pub.xz.jsinfo.net 2 Institute of Image Processing & Pattern Recognition, Shanghai Jiao-Tong University, No.1954, Huashan Road, Shanghai, China, 200030 tfang @ sjtu.edu.cn
Abstract Spatial data mining is to extract the hidden, implicit, valid, novel and interesting spatial or non-spatial patterns or rules from large-amount, incomplete, noisy, fuzzy, random, and practical spatial databases. In which an important issue but remains underdeveloped is to reveal and handle the uncertainties of patterns and features in the course of spatial data mining and knowledge discovery. In this paper, the types and origins of uncertainty, their models of measurement and propagation are analyzed firstly. Then, some uncertainty factors in operation of spatial data mining are discussed. We think the process of spatial data mining can be regarded as a complex system, a linear serial processing system in engineering control systems. An uncertainty propagation model of spatial data mining – fuzzy logic uncertainty propagation model with Creditability factor is developed. Moreover, several key problems about uncertainty handling and propagation in spatial data mining are put forward. Key Words
Uncertainty propagation;Spatial data mining;Complex system; Creditability Factor (CF);Fuzzy logic
1. Introduction Spatial data mining is to extract the hidden, implicit, valid, novel and interesting spatial or non-spatial patterns or rules from large-amount, incomplete, noisy, fuzzy, random, and practical spatial databases ( Di, 2000 ) .With the uninterrupted development in automatic obtaining techniques and methods of spatial data, the amount of data in spatial databases has been exponentially increased. But the deficiency in analytical functions of geographic information system (GIS) makes a vast amount of data and useful knowledge to be stored lose their role, in other words, it means “the spatial data explode but knowledge is poor” (Li, 2002). In recent years, the study on spatial data mining is becoming an important field in GIS and computer technology, and a great number of research results have been obtained. Up to now, the investigation on spatial data mining focuses in mining principles and methods themselves, such as mining algorithm of magnanimous data, the mining of multi-source remote sensing data and the mining of distributed spatial data. However, another important issue, the uncertainty in spatial data mining and its propagation have not been paid much attention to. In fact, spatial data themselves contain several forms of uncertainties, and the many kinds of new uncertainties in the process of spatial data mining will be evolved. Meanwhile, these uncertainties will be propagated and accumulated
1
in spatial data mining process, and then induce some uncertain knowledge. These circumstances and characteristics had not been considered and the knowledge discovered was regarded as fully useful and certain knowledge in traditional spatial data mining. Therefore, the research on characteristics and patterns of uncertainty and its propagation in spatial data mining is very important. In this paper, the types, origins, uncertainty measurement and propagation of spatial data are introduced in section 2. In section 3, some uncertainty factors in operation of spatial data mining are discussed. Especially, we proposed that the process of spatial data mining could be regarded as a complex system, a linear serial processing system in engineering control systems in section 4. An uncertainty propagation model of spatial data mining – fuzzy logic uncertainty propagation model with Creditability factor is developed. Moreover, several key problems about uncertainty handling and propagation in spatial data mining are put forward in section 5.
2. A Brief Review of Uncertainties in Spatial Data 2.1 The types and origins of uncertainty in spatial data It is said that the uncertainty within spatial data is the major components and forms for the evaluation of spatial data quality. Spatial data quality includes lineage, accuracy, completeness, logical consistency, semantic accuracy and currency (FGDC, 1998). All types of spatial data are subjected to uncertainty, since it is possible to create a perfect representation of the infinitely complex real world (Goodchild, 2003). Error refers to the discrepancy between observation results and true value, which has statistic characteristics. Uncertainty is more broadly-defined error concept continuation, measuring the discrepancy degree of the surveying objects’ knowledge. Uncertainties in spatial data can be classified: error, vagueness, ambiguity and discord (Fisher, 2003). The techniques and process of spatial data capture can deal with surveying measurement, data interpretation, data input, data processing and data representation. The uncertainties of spatial data stem from two parts. On the one hand, they stem from instability of natural phenomena and incompleteness of men’s cognition. On the other hand, the process of spatial data capture and handling bring a lot of uncertainty. In addition, these uncertainties can be propagated from the former phase into the latter one, and accumulated in different laws.
2.2 Uncertainty measurement and propagation of spatial data At present, a great deal of research have been developed in some areas, such as positional uncertainty and its propagation, especially the uncertainty of points, lines, polygons or areas. Therein, the uncertainty of points and lines is the basis of polygons and areas . Some uncertainty models have been constructed, including standard ellipse model (Mikhail and Ackerman, 1976) and circle normal model (Goodchild, 1991) of point position, Epsilon-band model (Chrisman, 1982) and error band model (Dutton, 1992) of line position. For last years, the study of spatial data quality control data put emphasis on the spatial positional uncertainty, but little on attribute uncertainty. In recent years, some scholars studied the attribute uncertainty of GIS data (Liu, 1999; Ehlschlager, 2000; Shi, 2002). Usually, positional uncertainty and attribute uncertainty were studied respectively. Shi (2000) constructed “S-band” model that combine positional uncertainty with attribute uncertainty. Zhang (1999) constructed field model that position uncertainty and attribute uncertainty are described in uniform. The spatial uncertainty propagation problem can be formulated mathematically as follows: 2
Y (•) = Opt ( D1(•), L , Dm(•)) ………………………(1)
Let Y (•) be the output of GIS operation Opt (•) on the m spatial data sets. The operation Opt (•) may be one of the various operations in GIS such as intersection of data sets. The principle of the spatial uncertainty propagation analysis is to determine the spatial uncertainty in the output Y (•) given the operation Opt (•) and spatial uncertainties in the data sets. The spatial uncertainty propagation is relatively easy when the operation Opt (•) is a linear function, which can be performed by error propagation law. However, few of operation Opt (•) were linear or could be solved by simple calculation. However, in general, rigorous methods and functions will be very troublesome. The Monte Carlo method (Openshaw, 1989) uses an entirely different approach to determine the uncertainty of geospatial objects. In this method, the results of Equation (1) are computed repeatedly, with input value D = [ D1 , D2 ,L , Dm ] that are randomly sampled from their joint distribution. The outputs of the equation construct random samples, in which their distribution parameters, such as mean value and variance, can be estimated. The Monte Carlo method may be a general method for uncertainty handling, and can be applied to the computation processing of spatial or attribute data. An outstanding advantage of Monte Carlo method is able to provide the entire distribution of output data at an arbitrary level of accuracy. The other advantages of this method are easy implementation and more general application. However, this method is more intensive computationally.
3 Uncertainties in Spatial Data Mining The uncertainties in spatial data mining may occur in the process of spatial data capture, spatial data processing, data mining and knowledge representing. The study achievements on the uncertainties of spatial data are not only very important, but also starting point for further research in spatial data mining. The firsthand data for spatial data mining usually stem from uncertain spatial database or uncertain spatial data sets. Moreover, uncertainties in spatial data may directly or indirectly affect the quality of spatial data mining (Han and Kamber, 2001). The process of spatial data mining may be divided into four phases: data selection, data preprocessing, data mining, pattern evaluation and knowledge representation. In addition, a lot of uncertainties of spatial data will happen in spatial data mining, and these uncertainties will be propagated and accumulated in spatial data mining process (Figure 1). Each phase of uncertainties will be analyzed as follows:
3
Pattern
and ion tion Uncertain t a nta alu Knowledge Ev rese p e R
ta Da g in n i M
Object Data
Uncertain Spatial Data
ta g Da ssin oce r p Pre
Generalization Data
a Dat ion t c e l Se
Uncertainty propagation
Figure 1.
Uncertainties and its propagation in the process of spatial data mining
At the phase of data selection, the data uncertainties mainly stem from selection of object data according to the task of spatial data mining through operator subjectivity, for example, what data should be collected, and how much data is enough. Spatial data preprocessing mainly include data cleaning, data transformation and data reduction. Data cleaning routines attempt to fill in missing values, smooth output noise while identifying outliers, and correct inconsistencies of data. During the stage of data transformation, the spatial data are transformed or integrated into some appropriate forms for the implementations of data mining. They may deal with some operations and processing, such as the smoothing, aggregation, generalization, normalization, attribute construction and data discretization, and concept hierarchy generation. In general, data discretization and concept hierarchy generation are the main origins of uncertainties in this phase, and many uncertainties may be hidden by data discretization. At this phase, it is said that a lot of uncertainties could be eliminated by uncertainty handling methods. However, some data uncertainties can never be eliminated completely, even some new uncertainties would occur in handling process, perhaps caused by use of some unsuitable models or processing methods. The limitation of mathematical model or data mining algorithm may further be evolved and developed, even enlarge spatial data uncertainty during the mining process. Moreover, there exist many uncertainties for spatial knowledge representation, including randomness, fuzziness and incompleteness of knowledge. It is well known that a sort of knowledge information may be represented in different ways. Most of spatial knowledge discovered by spatial data mining is qualitative knowledge, and the best way to represent them is to use the natural languages.
4
4 An Uncertainty Propagation Model of Spatial Data Mining The analysis and handling of uncertainties may be carried out by means of different technologies. Unfortunately, the problem of uncertainty propagation law in spatial data mining has not been solved. We think that the process of spatial data mining can be regarded as a complex system, a linear serial processing system in engineering control systems, in which have several inputs and outputs, and they coupled and associated in some means. Thus the implicit or explicit propagation function of every process may be given. In this case, the complicated problem could be simplified and solved step by step. On the basis of this principle, an uncertainty propagation model in spatial data mining, fuzzy logic uncertainty propagation model with Creditability factor, is constructed. In order to describe conveniently, we only consider the propagation among several key links: uncertain spatial data—spatial operation—discritization and concept hierarchy generation—data mining—knowledge representation—uncertain reasoning. Randomness (error) and fuzziness are the two main uncertainties of spatial data and spatial data mining. However, the current method is limited to handle one kind of uncertainties. In fact, spatial data and the process of spatial data mining exist in both randomness and fuzziness. The fuzzy logic uncertainty propagation model with creditability factor can represent and handle two uncertainties simultaneously. In this model, uncertainty arisen from randomness may be represent by Creditability Factor (CF), and uncertainty arisen from fuzziness may be represent by fuzzy set. The figure.2 shows the pattern of uncertainty propagation in spatial data mining. The entire model of uncertainty propagation may be recognized as a reasoning network. The main purpose of Creditability Factor (CF) is for describing propagation of uncertainty by reasoning network. The reasoning network is relatively complicated. The Creditability Factors (CF) may be Spatial data
CF1
1
Object data
D1(CFD1)
CF2
1
Generalization data
OD1
Spatial data D2(CFD2)
Dn(CFDn)
1
Rule1
GD1
Object data
CF12
S O
Spatial data
CF3
Object data CF 1
m
ODm
Generalization data
CF22
OD2
D C
CF2
L
GDL
1
CF1 CF32
GD2
Generalization data
CF4
Rule1
D M
CF2
Rule1 CF3
K
CFK
CF42
U R
Spatial Knowledge CF
K
CF 4
SO—Spatial Operation; DC—Discritization and Concept hierarchy generation ; DM—Data Mining; UR—Uncertain Resoning;
Figure 2.
Uncertainty propagation model in spatial data mining process
combined in parallel / sequential ways. The spatial data with Creditability Factor (CF) is recognized as initial evidence, which can be given uncertainty measurement model according to different types of spatial data. Creditability Factor (CF) is represented by fuzzy language value.
5
Let µ CF1 ( x) and µ CF 2 ( y ) are the subjective functions of fuzzy language of CF1 and CF2, then the operations between them are as follows: CF1+CF2 : µCF1+CF2 (z) = ∨ (µCF1 (x) ∧ µCF2 ( y)) z=x+ y
CF1−CF2 : µCF1−CF2 (z) = ∨ (µCF1 (x) ∧ µCF2 ( y)) z=x− y
CF1×CF2 : µCF1×CF2 (z) = ∨ (µCF1 (x) ∧ µCF2 ( y)) CF1÷CF2 : µCF1÷CF2 (z) = ∨ (µCF1 (x) ∧ µCF2 ( y)) z=x×y
z=x÷ y
In which, “ ∨ ” denotes the maximum value and “ ∧ ” denotes the minimum value. Furthermore, several algorithms (such as evidence combination, uncertainty propagation and result combination) that their models involved are referenced to the relative algorithms of the Creditability method in Artificial Intelligence (AI). The model of Creditability Factor (CF) should consisted in the processing model of this phase in theory, such as the Creditability Factor (CF) model with probability interpretation, although which is a difficult operation in this model.
5 Conclusion & Discussion We expect to achieve two objectives by studying the uncertainties of spatial data and spatial data mining. Firstly, the quality of spatial data mining can be improved by analyzing the uncertainties and its characteristics in each phase of spatial data mining, and finding the proper approaches to reduce the uncertainties. Secondly, although uncertainties of spatial data mining could not be fully eliminated, we may construct the measurement model of uncertainty propagation by analyzing the propagation law of uncertainties in spatial data mining in order to evaluate them approximately and to make use of the knowledge discovered in spatial data mining. The origin, measurement and propagation of spatial data mining are analyzed in this paper. Several theories and methods are applied as follows: (1) The comprehensive measurement model of spatial data, the measurement model of uncertainty currently is based on the random error of spatial data. In fact, the spatial data only involving single uncertainty is not existed. Constructing a comprehensive measurement model of spatial data is the basis of the research; (2) The transformation model of uncertainty in data discritization and concept hierarchy generation are more difficult operations, but very important. For many uncertainties may be hidden by data discritization and concept hierarchy generation. The quantitative data and qualitative concept can be transformed each other, and the uncertainties of map relations can be reflected in the models. (3) The usual uncertainty-based mining algorithm and the published techniques for spatial data mining can not very well handle the spatial information in spatial data, and they go short of evaluation to discovered knowledge; (4) The representation methods of uncertain spatial knowledge and traditional knowledge representation ways (such as Predicate logic, Producing type rule) are not suitable for representing the uncertainty of spatial knowledge.
Acknowledgements The work described in this paper was supported by the funds from National Natural Science Foundation of China (Project No.60275021).
6
References Chrisman N. R., 1982, A theory of cartography error and its measurement in digital database.In: Proceedings of Auto-Carto5. Di, K.C., 2000, Spatial Data Mining and Knowledge Discovery. (Wuhan, China:Wuhan University Press) Dutton G., 1992, Handling positional uncertainty in spatial database.In:Proceedings of 5th International Symposium on Spatial Data Mining, 460~469. Ehlschlager C.R., 2000, Representing uncertainty of area class maps with a correlated inter-map cell swapping heuristic.Computers, Enviroment and Urban Systems, 24, 451~469. FGDC(Federal Geographic Data Committee),1998, Content standard for digital geospatial metadata . FGDC-STD-001-1998, National Technical Information Services, Computer Products Office, Springfield, Virginia, USA. Fisher P.F., 2003, Data quality and uncertainty:ships passing in the light.In: Proceedings of The 2nd International Symposium on Spatial Data Quality’2003.Edited by Shi W.Z., Goodchild M.F., Fisher P.F.Hong Kong:The Hong Kong Polytechnic University. Goodchild M. F., 1991, Issuees of quality and uncertainty.In :Advances In Cartography by Muller J.C. Goodchild M.F, 2003, Models for uncertainty in area-class maps.In: Proceedings of The 2nd International Symposium on Spatial Data Quality’2003.Edited by Shi W.Z., Goodchild M.F., Fisher P.F.Hong Kong:The Hong Kong Polytechnic University. Han J.W., Kamber M., 2001, Data Mining: Concepts and Techniques.San Francisco California. Morgan Kaufmann Publishers. Li D.R., Wang S.L., Li D.Y. and Wang X.Z., 2002, Theories and technologies of spatial data knowledge discovery.Geomatics and Information Science of Wuhan University, 27(3): 221~233. Liu W.B., Den M. and Xia Z.G., 1999, Analysis of uncertainties of attributes in vector GIS.Acta Geodaetica et Cartographica Sinica, 29(1):76~81. Mikhail E. M. and Ackerman F., 1976, Observations and Least Squares.New York:IEP-A Dun-Donnelley Publisher. Openshaw, S., 1989, Learning to live with errors in spatial database, In:Accuracy of Spatial Database, Taylor & Francis Ltd, New York, 263-276. Shi W.Z., Wang S.L., 2002, Further development of theories and methods on attribute uncertainty in GIS.Journal of Remote Sensing, 6(5) :393~400. Shi W.Z., 2000, Theory and Methods for Handling Errors in Spatial data.(Beijing: Science Press) Zhang J.X., Du D.S., 1999, Field based models for positional and attribute uncertainty.Acta Geodaetica et Cartographica Sinica, 28(3), 244~249.
7