A Try for Handling Uncertainties in Spatial Data Mining

0 downloads 0 Views 292KB Size Report
method of spatial data mining handling randomness and fuzziness ... databases are to describe and represent how the spatial entities are in the infinitely.
A Try for Handling Uncertainties in Spatial Data Mining 1, 2

1

3

4

Shuliang Wang , Guoqing Chen , Deyi Li , Deren Li , and Hanning Yuan

3

1

School of Economics and Management, Tsinghua University, Beijing 100084, China 2 International School of Software, Wuhan University, Wuhan 430072, China 3 China Institute of Electronic System Engineering, Fuxing Road 20, Beijing 100039, China 4 School of Remote Sensing Engineering, Wuhan University, Wuhan 430072, China [email protected]

Abstract. Uncertainties pervade spatial data mining. This paper proposes a method of spatial data mining handling randomness and fuzziness simultaneously. First, the uncertainties in spatial data mining are presented via characteristics, spatial data, knowledge discovery and knowledge representation. Second, the aspects of the uncertainties in spatial data mining are briefed. They often appear simultaneously, but most of the existing methods cannot deal with spatial data mining with more than one uncertainty. Third, cloud model is presented to mine spatial data with both randomness and fuzziness. It may also act as an uncertainty transition between a qualitative concept and its quantitative data, which is the basis of spatial data mining in the contexts of uncertainties. Finally, a case study on landslide-monitoring data mining is given. The results show that the proposed method can well deal with randomness and fuzziness during the process of spatial data mining.

1 Introduction There are uncertainties in spatial data mining. People are faced with large amounts of [1] spatial data but are short of knowledge, which promotes spatial data mining . The uncertainties are the major component of spatial data quality, and a number of methods has been tried to deal with the elements, measurement, modeling, [4] propagation, and cartographic portrayal . The uncertainties are inherent in most of the data capturing and data analyzing because of the limitations of current instruments, technologies, capitals, and human skills. Because the spatial data are the objectives of spatial data mining, the uncertainties are brought to spatial data mining [5] along with spatial data at the beginning . Then, new uncertainties will further come into being during the process of spatial data mining. It is an uncertain process for spatial data mining to discover the little-amount knowledge from the large-amount [3] data because of variously mining angles, scales, and granularities . And the indices of the discovered knowledge, e.g. interesting degree, supporting degree and confidential degree, are all uncertain. The uncertainties may directly or indirectly affect the quality of a spatial decision-making based on spatial data mining. However, the uncertainties have not been addressed to the same degree to spatial [6] data mining itself . Although there have been some methods and techniques on [5] spatial data mining, or on spatial data uncertainties , each of them is developed in its own direction. First, most of the existing models may describe some specific situation. It is difficult for them to deal with the case where more than one uncertainty M.Gh. ` Negoita et al. (Eds.): KES 2004, LNAI 3215, pp. 513–520, 2004. © Springer-Verlag Berlin Heidelberg 2004

514

S. Wang et al.

appears at the same time, e.g. both fuzziness and randomness. In fact, the cases with many uncertainties often happen in spatial data mining. Second, some models may be far beyond the comprehension of the common users. Without enough background knowledge, these users may have difficulty in making sense of the exact nature of uncertainty that an expert specifies. Third, it is an essential issue for spatial data mining to transform between a qualitative concept and its quantitative data. Commonly, the transition models are of rigid specification and too much certainty, which comes into conflict with the human recognition process. Fourth, almost none of the existing models are unable to well deal with the uncertainties in spatial data mining, and it is strange to find out the integration of spatial data mining and spatial data uncertainties. In order to continue enjoying its success, spatial data mining should think of the uncertainties carefully, and the theories to handle the uncertainties may have to be further studied.

2 Uncertainties Inherent in Spatial Data Mining Spatial uncertainties indicate the unknown degree of the observed entities. In spatial data mining, they may arise from the objective complexity of the real world, the subjective limitation of human recognition, the approximate weakness of computerized machine, and the computerized shortcomings of techniques and methods, the amalgamation of heterogeneous data, the discovery, representation and interpretation of knowledge, and so on. During the process of spatial data mining, the original uncertainties in spatial data may be further propagated from the beginning to the end, and they are also affected by the scale, granularity and sampling in spatial data mining. And these uncertainties may have to be identified instead of presenting [5] them as being correct . First, there are many sources and causes of uncertainties, e.g. instruments, environments, observers, projection algorithms, slicing and dicing, coordinate system, image resolutions, spectral properties, temporal changes, etc. Spatial data stored in the databases are to describe and represent how the spatial entities are in the infinitely complex world via binary digits to approach them. The spatial database is only an abstracted representation with uncertainties. For it works with the spatial database as a surrogate for the real entities, spatial data mining is unable to avoid the uncertainties. Second, spatial data mining is an uncertain process. In a spatial computerized system that observes and analyzes the same spatial entities on variant levels of granularity, and/or on different worlds of different granularities, it is common to be faced with having to use data that are less detailed than one would like, and then some data will be further eliminated or got rid of when the spatial data are edited, stored, and analyzed. The unknown knowledge is refined with a high abstraction level, small scales, and small granularities, whereas the existing data are coarse with a low abstraction level, big scales, and big granularities. Sampling creates a representation from limited data, leaving uncertainty as to what actually exists between the sample points. As to the same dataset, different knowledge may be mined when different people apply the same technologies, or the same people apply different technologies.

A Try for Handling Uncertainties in Spatial Data Mining

515

Third, there exist uncertainties in knowledge representation. The discovered knowledge is unknown in advance, potentially useful, and ultimately understandable. Knowledge uncertainty arises when roll-up or drill-down is carried out in spatial data mining, and there is also a gap to be bridged between the rigidity of computerized spatial data and the uncertainty of the spatial qualitative concept, i.e. spatial transition between the qualitative concept and the quantitative data. Fourth, the performance and nature of uncertainty are various, i.e. randomness, fuzziness, chaos, positional uncertainty, attribute uncertainty, incompleteness. For example, randomness is included in a case with a clear definition but not always happens every time, and fuzziness is the indetermination between a proposed and incomplete value but cannot be defined exactly.

3 Cloud Model on Randomness and Fuzziness [2]

A cloud model is a mathematical model of the uncertainty transition between a linguistic term of a qualitative concept and its numerical representation data. It is named after the natural cloud in the sky for both are visible in a whole shape but fuzzy in detail. A piece of cloud is not a membership curve but is composed of many cloud-drops, any one of which is a stochastic mapping in the discourse universe from a qualitative fuzzy concept. As well, the degree of any cloud-drop is specified to represent the qualitative concept when the one-to-many transition is carried out. The cloud model integrates the fuzziness and randomness via three digital characteristics {Ex, En, He} (Fig.1).

CT(x) 1

Ex 0.5

3En

0

He

9mm

x

Fig. 1. {Ex, En, He} of the linguistic term” displacement is 9 millimeters around”

In the discourse universe, Ex (Expected value) is the position corresponding to the center of the cloud gravity, the elements of which are fully compatible with the spatial linguistic concept; En (Entropy) is a measure of the concept coverage, i.e. a measure of the spatial fuzziness, which indicates how many elements could be accepted to the spatial linguistic concept; and He (Hyper-Entropy) is a measure of the dispersion on the cloud-drops, which can also be considered as the entropy of En.

516

S. Wang et al.

Cloud generators may be forward or backward in the context of the integrity {Ex, En, He}. Given {Ex, En, He}, the forward cloud generator can produce as many cloud-drops as you would like, which may visualize the discovered knowledge. The input of the forward cloud generator is {Ex, En, He}, and the number of cloud-drops to be generated, N, while the output is the quantitative positions of N cloud-drops in the data space and the certain degree that each cloud-drop can represent the linguistic term. On the other hand, the backward cloud generator may mine {Ex, En, He} of cloud-drops specified by many precise data points, which discovers the knowledge from the given spatial database. The input of the backward cloud generator is the quantitative positions of N cloud-drops, xi (i=1,…,N), and the certainty degree that each cloud-drop can represent a linguistic term, yi(i=1,…,N), while the output is {Ex, En, He} of the linguistic term represented by the N cloud-drops. During the process of knowledge discovery with the cloud model, the quantitative data first produce several essential cloud models. Then the roll-up is carried out one by one, and the linguistic atoms also become linguistic terms, and further concepts. The higher the roll-up, the more generalized the qualitative concept. The concept that can attract the interest, match the demand, and support the decision-making will be the knowledge. The top hierarchy of spatial data mining is the most generalized knowledge, while the bottom hierarchy of spatial data mining is the objective data in the spatial database. It is the virtual cloud model that implements the roll-up and drilldown in spatial data mining, i.e. floating cloud, synthesized cloud, resolved cloud, and geometric cloud.

4 A Case Study The spatial database is 1G bytes on the displacements of Baota landslide, on which 2,000 people are living. The properties of dx, dy and dh, are the measurements of displacements in X, Y and H direction of the landslide-monitoring points. In Baota landslide data mining, there exist uncertainties, e.g. randomness and fuzziness, and different people may discover various rules with different techniques. In the following, all spatial knowledge is discovered from the databases with dx.

displacement small

very small

smaller

big

common

bigger

very big

mm …… mm 36 mm 45 mm 54 5 mm 9 mm 18 mm 27 around around around around around around around

Fig. 2. Pan-concept hierarchy tree of different displacements

A Try for Handling Uncertainties in Spatial Data Mining

517

From the observed values, the backward cloud generator can mine Ex, En and He of the linguistic term indicating the average level of those landslide displacements. Based on landslide-monitoring characteristics, let the linguistic concepts of “smaller (0~9mm), small (9~18mm), big (18~27mm), bigger (27~36mm), very big (36~50mm), extremely big (50mm)” with Ex, “lower (0~9), low (9~18), high (18~27), higher (27~36), very high (36~50), extremely big (50)”with En, “more stable (0~9), stable (9~18), instable (18~27), more instable (27~36), very instable (36~50), extremely instable (50 and over)” with He respectively depicting the displacements, scattering levels and stabilities of the displacements. Then, the linguistic terms of different displacements on dx, dy and dh may be depicted by the conceptual hierarchy tree in the conceptual space (Fig. 2). Fig. 3 presents the cloud models of Fig. 2 in the discourse universe. Concept hierarchy

1st level 2nd level

3rd level

4th level

0

9

8

27

36

45

54

63

Displacement (mm)

Fig. 3. The cloud models of pan-concept hierarchy tree of different displacements

It can be seen from Fig. 2 and Fig. 3 that the nodes “very small” and “small” both have the son node “9 millimeters around”, so the concept hierarchy tree is a pan-tree structure. In the context of the cloud model, the qualitative concept from the quantitative data may be depicted via the cloud generators. Based on the gained {Ex, En, He}, the forward cloud generator can reproduce as many deterministic cloud-drops as you would like, i.e. producing synthetic values of landslide displacements. These cloud-drops are reproduced with randomness, and they can be further taken as virtually monitoring Baota landslide under the umbrella of given conditions. The virtual monitoring data may further fill in the incompleteness when it is unable to establish monitoring points on some typical surfaces of Baota landslide. With the forward cloud generator and backward cloud generator, the level of monitoring-points’ displacements is extended to the whole landslide. This may

518

S. Wang et al.

approach the moving rules of Baota landslide well. Thus the rules on Baota landslide in X direction can be discovered from the databases in the conceptual space (Table 1). Because large amounts of consecutive data are replaced by discrete linguistic terms in Table 1, the efficiency of spatial data mining can be improved. Meanwhile, the resulting knowledge will be stable due to the randomness and fuzziness of concept indicated by the cloud model. Fig. 4 visualizes the displacing rule of each point with 30,000 pieces of cloud-drops, where the symbol of “+” is the original position of monitoring point, the different rules are represented via the different pieces of cloud, and the level of color in each piece of cloud denotes the discovered rules of a monitoring-point. Table 1. the rules on Baota landslide-monitoring in X direction

Points BT11 BT12 BT13 BT14 BT21 BT22 BT23 BT24 BT31 BT32 BT33 BT34

Rules The displacements are big south, high scattered and instable. The displacements are big south, high scattered and very instable. The displacements are small south, lower scattered and more stable. The displacements are smaller south, lower scattered and more stable. The displacements are extremely big south, extremely high scattered and extremely instable. The displacements are bigger south, high scattered and instable. The displacements are big south, high scattered and extremely instable. The displacements are big south, high scattered and more instable. The displacements are very big south, higher scattered and very instable. The displacements are big south, low scattered and more instable. The displacements are big south, high scattered and very instable. The displacements are big south, high scattered and more instable.

Fig. 4 indicates that all monitoring points move to the direction of Yangtze River, i.e. south, or the negative axle of X. Moreover, the displacements are different from each other. BT21 are extremely big south, extremely high scattered and extremely instable, and followed by BT31. At least, BT14 is smaller south, lower scattered and more stable. In a word, the displacements of the back part of Baota landslide are bigger than those of the front part in respect of Yangtze River, and the biggest exceptions are BT21. When the Committee of Yangtze River investigated in the region of Baota landslide, they found out that the landslide had moved to Yangtze River. By the landslide BT21, a small size landslide had taken place. Now there are still two pieces of big rift. Especially, the wall rift of the farmer G. Q. Zhang’s house is nearly 15 millimeters. These results match the discovered spatial knowledge very much, and indicate that the method of randomness and fuzziness -based spatial data mining in the context of cloud model are creditable.

A Try for Handling Uncertainties in Spatial Data Mining

519

Fig. 4. Rules on Baota landslide-monitoring points

5 Conclusions There were inherent uncertainties in spatial data mining. This paper proposed a method to handle randomness and fuzziness simultaneously in spatial data mining, by giving the cloud model to realize the transition between a qualitative concept and its quantitative data. It includes the algorithms of forward and backward cloud generators in the contexts of three digital characteristics, {Ex, En, He}. The case study of Baota landslide monitoring showed that the method was practical and confident, and the discovered knowledge with a hierarchy can match different demands from different users.

Acknowledgements This study is supported by the funds from National Natural Science Foundation of China (70231010), Wuhan University (216-276081), and National High Technology R&D Program (863) (2003AA132080).

References 1. ESTER M. et al., 2000, Spatial data mining: databases primitives, algorithms and efficient DBMS support. Data Mining and Knowledge Discovery, 4, 193-216 2. LI D.Y., 1997, Knowledge representation in KDD based on linguistic atoms. Journal of Computer Science and Technology, 12(6): 481-496

520

S. Wang et al.

3. MILLER, H.J., HAN, J., 2001, Geographic Data Mining and Knowledge Discovery (London: Taylor & Francis) 4. VIKTOR H.L., PLOOY N.F. D., 2002, Assessing and improving the quality of knowledge discovery data. In: Data Warehousing and Web Engineering, edited by Becker S.(London: IRM Press), pp.198-205 5. WANG S.L., 2002, Data field and cloud model based spatial data mining and knowledge discovery. Ph.D. Thesis (Wuhan: Wuhan University) 6. ZEITOUNI K., 2002, A survey of spatial data mining methods databases and statistics point of views. In: Data Warehousing and Web Engineering, edited by Becker S.(London: IRM Press), pp.229-242

Suggest Documents