7th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Edited by M. Caetano and M. Painho.
Matching imperfect spatial data Ana-Maria Olteanu, Sébastien Mustière and Anne Ruas COGIT Laboratory, Institut Géographique National 2 av. Pasteur, 94160 Saint-Mandé, France Tel.: + 33; Fax: + 001 555 832 1156
[email protected];
[email protected];
[email protected]
Abstract Currently, many independent geographical databases exist in the same area and users need to fusion various information coming from these databases. In order to integrate databases, redundancy and inconsistency between data should be identified. Many steps are required to finalise the databases integration and one of them is automatic data matching. In this paper we study which knowledge is required to guide the matching process and more particularly, how to manage uncertain knowledge. Firstly, we analyse how interactive matching is performed and we identify basic knowledge used in this process: objects with similar location, shapes and attributes are matched. Knowledge used is imperfect and manipulated data hold a certain degree of errors and vagueness. We classify the various kinds of imperfection information used and we distinguish imprecision uncertainty and incompleteness. For each class of imperfection, we choose the appropriate theory to model it. Finally, we illustrate imperfection in spatial data through the results of experiments of data matching through the comparison of toponyms in the context of ethnographical data. Keywords: integration, data matching, uncertainty and imprecise knowledge
1 Introduction Many applications require the integration of geographical databases of different spatial and semantic resolutions covering the same area. Currently, there is a multiplicity of geographical information to describe the same reality. This multiplicity comes from the increasing number of geographical data, because data capture becomes easier, and because there are increasingly needs for precise geographical data and updates. Data is represented at different scales (for example a river can be modelled with a linear or a surface geometry), they are means to use in various applications (visualisation, analyses) and they also come from different modes of acquisition (camera, methods). Nevertheless, we realise that there is independence between databases that represent the same reality (Devogele, 1997; Walter and Fritsch, 1999). This independence affects both data users and producers. Therefore, integration is seemed to be the issue to that problem in order to facilitate both data producers needs (updates, quality evaluation and incoherence detection) and data users needs (study of several adjacent zones, analyses which mixed different points of view). Many researches in the last few years concern the integration of geographical databases. Geographical databases integration requires a detailed analysis of the specifications, data schema and data itself (Devogele, 1997; Sheeren, 2005).
694
7th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Edited by M. Caetano and M. Painho.
The integration process is usually carried out in three stages as we show in the figure 1: • Pre-integration: this stage consists in the enrichment of the schemas sources and the modeling standardisation in order to choose a common schema. This enrichment is especially needed since implicit information is present in the geographical data. • Matching: this stage is essential for integration of two databases and it contains schema matching and data matching. Schema matching and data matching are not independent things. Therefore, schema matching needs data and specifications analyses. • Integration: this stage is about defining an integrated schema and data specification, and actually populating the integrated DB.
Mise en Schema matching
specifications
DB1
Analyse and preparation of the individual databases
Définition d’un Integrated schéma et de schema and spécifications
Integrated DB Mise en Data matching
Integrated data
DB2
Figure 1 General process of integration of databases (Sheeren et al., 2004).
The matching process consists in linking the databases elements that represent the same reality. However, these elements can be considered as imperfect knowledge. Many authors are analysing different forms of imperfection (see section 3). In addition, these imperfections are especially important in geographical data, because these are imprecise data by nature. In this context, we focus on the matching process and our aim is to study how uncertainty could be managed in the matching process in order to improve it. The outline of this paper is as follows. In Section 2 the related work is discussed. Section 3 describes the imperfect data. A taxonomy of imperfect data and the link between the concepts and methods are doned. The tests and their results are given in Section 4. Finally, section 5 concludes the paper and gives some future issues.
2 Related work Generally, in order to carry out the process of integration, two databases that represent the same area of the real world are given. However, each database has its own data model which abstracts reality in a certain way and thus objects are represented in different ways. In this context, data matching (Walter and Fritsch, 1999) is a tool that can be used in order to find corresponding elements in different databases. The matching process is used in many fields handling geographical information such as integration of geographical data (Devogele, 1997; Volz, 2006; Mustière, 2006), automated updates propagation from one database to another (Badard, 1998), quality analysis (Bel Hadj
695
7th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Edited by M. Caetano and M. Painho.
Ali, 2001) or difference and inconsistency detections between databases (Bruns and Egenhofer, 1996; Sheeren, 2005). Data matching has been studied by several researchers for different purposes and can be seen from two different points of views: schema matching and data matching. Practically, schema matching and data matching depends each other. Thus, in order to undertake schema matching, we should analyse data (Voltz, 2005) and, we observed that data matching depends on schema matching. •
•
•
Many matching methods developed in the literature have revealed good results and efficiency on certain data types of selected tests areas. Matching algorithms depend on the geometric types of the objects to match, the topological relations, the databases scales, and the last but not the least, the semantic attributes. Correspondences between homologous objects are given as links. Links are created between objects in the different databases that are assumed to represent the same real-world object. The possible cardinality of the links are 1:1, 1:m and n:m. A matching evaluation step is necessary in order to validate matching process. In the following sections, we present some existing methods for matching different geographical databases. The methods will be differentiated according to the types of objects. We distinguish two important methods: methods for isolated data and methods for networks.
2.1 Methods for isolated data Isolated data are data that are relatively independent from each other. The methods for isolated data generally consider the geometric and semantic properties of the objects to match. In (Bel Hadj Ali, 2001) a matching algorithm proposed for polygonal data uses geometric tools as intersection, surface distance, etc. This algorithm provides the m:n links, which are often encountered in data at different scales. The quality of the algorithm is evaluated using position and shape measures.
[Beeri et al, 2004] proposed four location-based algorithms for two different databases, which are obtained by different sources and where, moreover the geometry type is different i.e. punctual and polygonal data. Geometric criteria as nearest neighbours are used, and the approach relies on a probabilistic consideration. The performance is measured in terms of recall and precision and it depends on the density of the databases and the degree of overlap amongst them. In (Gombosi et al., 2003) an algorithm for determining differences between two sets of polygons was developed. This approach has been applied in land cadastres in order to detect changes over time. A semantic matching is also possible for isolated data when the toponym is present in the databases. Many works, which compare string distance metrics, were proposed in the literature (Levensthein, 1965; Cohen et al., 2003). In (Dunkars, 2003) a general method for matching databases at different scales was presented. Buildings, roads and forests data are used in order to test the algorithm performance and for each data different criteria are used.
696
7th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Edited by M. Caetano and M. Painho.
2.2 Methods for networks matching Many matching algorithms for linear networks have been developed in the literature. One of them is the one of (Devogele, 1997) considering networks at different scales. This matching algorithm was tested on road networks from different sources but also on other networks like rivers, railways, electric lines, etc. (Mustière, 2006). The network-matching algorithm has three main steps as follows: • Pre-matching of nodes: for each node of the less detailed database, we look for candidates nodes in the more detailed database. A geometric criteria is used. • Pre-matching of arcs: for each arc of the more details database, we look for candidates nodes in the less detailed database. Geometric and semantic criteria are used. • Nodes and lines matching: this step is based on the results of the two pre- matching steps. Geometric, topologic and semantic criteria are used.
(Walter and Fritsch, 1999) developed a method which was tested on road networks and that matched geometric elements in two different databases defined at similar scales. The comparison of different elements is based on geometry and topology criteria. It relies on a statistical approach. More generally, many matching algorithms for two network databases at similar scale(Voltz 2006; Haunert, 2005; Bouziani, 2003;) or different scale (Devogele, 1997; Zhang et al., 2005) were proposed. Generally, matching process is based on different criteria and it is carried out in several steps. 2.3 Our approach All the methods seen, either applied to ponctual, linear or polygon data, or employed to match two databases at different scales or at the same scale, use geometrical, semantic, or topological criteria. All the criteria are usualy applied one by one and, more importantly, most approaches do not take into account uncertainty and imprecision of these different criteria.
Therefore, our objective is to find a matching approach which will take into account all the criteria at the same time and also imperfection both in the matching process and the evaluation process.
3 Imperfection in spatial data Data in general, and particularly spatial data, are not perfect. There are some concepts like imprecision, uncertainty, vagueness, error, etc., that are used in the field of geographic information. There is no standard definition of these terms so that conflicts may appear between the definitions used in the field of geographic information and artificial intelligence (AI), but also between the ways of using these concepts by different authors in the spatial formation field. In AI, we distinguish three types of imperfection (Bouchon-Meunier, 1995; Colot, 2000; Masson, 2005): • Imprecision: it is related to the difficulty of expressing clearly and precisely the information e.g (the surface of a forest is approximately of 3 km2). • Uncertainty: it is related to a doubt about the information’s validity e.g (Is this forest really a conifers forest?). • Incompleteness: it refers to the absence of information e.g. (the toponym of the forest is not filled up).
697
7th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Edited by M. Caetano and M. Painho.
Data imperfection has been the core of many researches in the last few years. Generally, it is based on mathematic theories like fuzzy set theory (Zadeh, 1965), possibility theory (Dubois et al., 1985) and evidence theory. These are non-probabilistic approaches. The fuzzy set theory is used to manage the imprecision and it also includes the concept of vagueness. The possibility theory (Bouchon-Meunier, 1995) allows modelling both imprecision and uncertainty and it uses two set-functions: a possibility function Π(p), which represents the degree of possibility of p and a necessity function N (p), which denotes the degree of belief of the proposition p. Evidence theory (Colot, 2000; Colot, 2002) uses the basic belief assignment (bba). The mass of belief m (p) is calculated for each source of information and it represents how strongly the evidence supports p. From this mass of belief, a belief function Bel (p) and a plausibility function Pl (p) are defined. The belief function (Colot, 2002) is a measure of one’s belief that the hypothesis p is true. The plausibility function measures how much we do not doubt of the proposition p. In the field of geographic information, the term uncertainty is generally used in order to describe imperfect data. Many taxonomies are carried out and in most of the cases, the uncertainty is the base node of the taxonomy (Fisher, 2003; Fisher et al, 2005). The proposed taxonomy aims at describing what kind of uncertainty appears in the data when the database is instantiated. This taxonomy is guided by the distinction between uncertainties arising from well-defined objects and poorly defined objects. In those taxonomies concepts like error, vagueness, ambiguity, discord and non-specificity are employed. After having analysed our data, we think that the concepts of the imperfect data in AI are sufficient for identifying all types of data imperfection that interest us in the matching process. Let us suppose that we are handling two different databases, DB1 and DB2. We consider that DB1 is more detailed than DB2. In this example, we want to present some imperfections, which we must be taken into account during the matching process: • Let us suppose that we use the nature of the objects in order to match the databases. We can match an object in DB1 whose nature is "peak" with an object in DB2 whose nature is "summit". These two concepts are different but close. In this case, we are confronted with a problem of uncertainty that should be managed in the matching process. • Let us suppose the databases precisions are different. When we want to match two objects, it is not always possible to match it by taking into account the nearest neighbor. In this case, we are handling with a problem of imprecision. • Let finally suppose that we could have objects belonging to DB1, which are nonmatched. In this case, the problem is that DB1 is more detailed than DB2 and thus the object, which represents the same real-world entity, do not exist in DB2.
4 Case study In this section, the aim is to illustrate imperfection in spatial data and how it affects data matching in the context of ethnographical data from quai Branly Museum (Olteanu, 2005). 4.1 Integrating ethnographical data In our case study, we investigated ethnological data, including the art objects owned by the quai Branly Museum and collected in Alaska State. Objects were collected in certain places and they belong to one or more Alaska native people that have a geographical localisation partly known through paper maps. Native people’s geographical localisation changes both in
698
7th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Edited by M. Caetano and M. Painho.
space and in time. More, this information coming from the different data sources is heterogeneous and imperfect. In this context, our goal is to detect uncertainties and inadequacies in the current database of the quai Branly Museum. In order to undertake it, we dispose of: • A database without geometry describing the ethnographical objects. For each object, a toponym indicates the place where the object was collected, an ethnonym describes to which native people it belongs to and a date represents the date when the object was collected. • A native people map that represents the reality in a simplified way, on a given date (see figure 2)). • The spatial data, which is considered as reference data and that contain counties, town, rivers, coastlines and roads belonging to Alaska State. Our aim is to integrate the three source of information in order to find out the inconsistencies.
Textuel data
Native people map
Toponym Ethnonym Date
Schema Integration
Spatial Data
Analyse of data
Figure 2 Schema integration of the sources information.
4.2 Analysis of imperfection in the data In this section, we present and analyse the different imperfection that can be observed in our information sources. 4.2.1 Imperfection in the museum database The imperfections in the museum database generally concern the location of the object, its belonging to a native people and the date when it was found. Following our imperfection taxonomy (see Section 3) the next observations can be made: • The location may consist of a reference to a geographic feature having a spatial extent like "along the Yukon river", "in the north of Alaska". A collection date could be "end of the 19° century" or "19° century". In these examples, we can notice an
699
7th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Edited by M. Caetano and M. Painho.
•
•
imprecision problem. In the first example, the imprecision is caused by the fact that location is expressed in the natural language. Terms like "along " and "in the north of" can be interpreted in different ways. In the second example, the imprecision is caused by the terms "end of" and "century". We can also notice an imprecision caused by a semantic difference between the toponyms. This difference may be due to a typing error (e.g. "Premorska"and "Promoroska"), to a synonymy (e.g. "Iles Aléoutienne" and "Aleut Islands"), or more it is caused by the fact that for an object we have the official name and the popular name, (e.g., " Aleut\ Sugpiak"). Location may be also expresed by "Cape Nome or Cape Darby", "Kodiak Island or Kodiak Archipel". The ethnonym could be "Yupik or Inuit", "Inuit or Tutsagmiut". In this case, we are confronted with a problem of uncertainty. Firstly, we have an uncertainty caused by a doubt (e.g. "Cape Nome or Cape Darby" and "Yupik or Inuit") and secondly we have an uncertainty caused by the level of detail (e.g. "Kodiak Island or Kodiak Archipel" and " Inuit or Tutsagmiut"). Tutsagmiut native people is a subdivision of Inuit native people. The toponym, the ethnonym and more often the dates of collection are missing. So, we are also confronted with an incompleteness problem.
4.2.2 Imperfection in native people map The native people map (see figure 4 a) was elaborated by an ethnologist and contains the localisation of the Alaska native people. The imperfections are caused by the level of detail that generate two types of imperfection. On the one hand, the granularity of data generates incompleteness problems. These are the subdivisions of the native people that can not be represented on the map. On the other hand, the geometric precision of data (see figure 3) generates problems of imprecision. The limits of the native people change during the time, so it is an uncertainty problem.
Figure 3 The differences of precision between ethnological and spatial data.
4.2.3 Imperfection in spatial data references In spatial data references, imperfections are caused by the level of detail, and we also have all types of imperfections (see section 4.2.2). The granularity can be explained by the absence of the geographical entities, e.g. the archipelago subdivisions or county subdivisions.
700
7th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Edited by M. Caetano and M. Painho.
4.3 Consequences on matching process In order to integrate our data, we have to match them. Two matching process are performed, as described below. 4.3.1 Databases matching The first step is a matching process between the map of native people and spatial reference data (see figure 4a and 4b). The aim is to obtain a map of native people with a geographic reference (see figure 4c). The level of detail is different. In our case, this stage is carried out manually using a digitalisation tool because we do not have a matching tool to do it automatically. An automatic matching tool is conceivable but it must be able to deal with the imperfections in the spatial data.
a)
b)
c)
Figure 4 Databases to be integrated (a et b) and integrated data (c).
4.3.2 Toponyms matching The second step is a matching process between spatial reference data and ethnographical data. In order to handle imprecision in the toponyms, we authorise small differences between the strings to be matched, by using the Edit distance (Levenshtein, 1965) and the string comparison does not take into account accents and case.
Before applying matching process, we prepared our data. Firstly, we observed that there are some toponyms in the ethnographical data that are not present in the spatial data. This is caused by the fact that spatial data do not have all administrative divisions. For example, we have an object, which was found in Afognak Island. In our spatial data, the subdivision stops at the archipelago division. Consequently, we consider that the localisation of the object is Afognak Archipelago. This manual matching has been done to overcome an incompleteness problem. In order to match entities that present an imprecision e.g. "along the Kuskokwin River", we considered the toponym " Kuskokwin ", and then, a core area and a broad boundary are associated to it (Tøssebro, 2002; Schneider, 1999), as we see in the figure 5.
701
7th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Edited by M. Caetano and M. Painho.
Kuskokw Figure 5 Matched objects.
The links cardinality may be 0-1, 1-1, or 1-2. Thus, we observed that the matching result might be uncertain in the case of 1-2 links, i.e. there are toponyms in the ethnological database that have two corresponding entities in the spatial database, e.g. "Porte Clarence" that can be a city and a lake. In our case study, we did not treat the uncertainties. In this study, we saw that data present many imperfections. Consequently, a manually data preparation step was needed. As a result, a theory is necessary in order to treat automatically and if possible all types of imperfections.
5 Conclusion The purpose of our work is to develop a generic matching method by taking into account the imperfection in spatial data. In order to undertake it, firstly, we analysed the imperfect data and the methods that can be used to manage these imperfection, both in the field of AI and spatial information. Matching process results are influenced by the data imperfection and thus the matching can be more or less efficient. We think that appropriate taxonomy is the AI taxonomy, which classifies the imperfection in three levels: imprecision, uncertainty and incompleteness. We also consider that the evidence theory is useful in our case because it allows to model both imprecision, uncertainty and incompleteness. The theory is particularly adapted because it allows to combine the sources information and to make combined hypothesis. Secondly, we presented in Section 4 a study that illustrates the different types of imperfection encountered in spatial data and how it affects data matching. Finally, our forthcoming works consist in using the evidence theory functions in order to lead the matching process.
References Badard, T., 1998, "Extraction des mises à jour dans les BDG – De l’utilisation de méthodes d’appariement", Revue Internationale de géomatique, 8(1-2), 1998, pp. 121–147. Beeri, C., Kanza, Y., Safra, E. and Sagiv, Y., 2004, " Object Fusion in Geographic Information Systems", In Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004. Bel Hadj Ali A., 2001, Qualité géométrique des entités géographiques surfaciques –Application à l’appariement et définition d’une typologie des écarts géométriques, PhD dissertation, University of Marne la vallée, 2001. Bouziani, M., 2003, "Définition d’une méthode d’extraction des mises à jour de l’information spatiale dans un réseau routier en milieu urbain", In 2nd FIG Regional Conference Marakech, Morocco, 2-5 December, 2003.
702
7th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Edited by M. Caetano and M. Painho.
Bruns, H.T. and Egebhofer M, 1996, " Similarity of Spatial Scenes", In Seventh International Symposium on Spatial Data Handling, Delft, the Netherlands, August 1996, London, pp.173-184, Taylor & Francis. Bouchon-Meunier, B., 1995, La logique floue et ses applications, Addison Wesley, Paris, 1995. Cohen, W.W., Ravikumar P. and Fienberg, S.E., 2003, "A Comparison of String Distance Metrics for Name-Matching Tasks", In Proceedings of the IJCAI, 9-10 August 2003, Acapulco, Mexico, pp. 73-78. Colot, O., 2000, Systèmes de perception d’informations incertaines - Application au diagnostic médical, HDR dissertation, University of Rouen. Devogele, T., Parent, C. and Spaccapietra, S., 1998, "On spatial database integration", In International Journal of Geographical Information Science, 12(4), 1998, pp. 335-352. Devogele, T., 1997, Processus d’intégration et d’appariement de bases de données Géographiques, Applications à une base de données routières multi-échelles, PhD dissertation, University of Versailles, 1997. Dubois, D and Prade, H, 1985, Théorie des possibilités : applications à la représentation des connaissances en informatique, Masson, Paris, 1985. Dunkars, M., 2003, "Matching of Datasets", In: Proceedings of the 9th Scandinavian Research Conference on Geographical Information Science, Espoo, Finland, 4-6 June 2003, pp. 67-78. Fisher P. F., "Models of uncertainty in spatial data", In Geographical Information System, vol. 1, Second Edition, 2003, pp. 191-203. Fisher P.F., Comber A., Wadsworth R., 2005, Nature de l’incertitude pour les données spatiales, In Qualité de l’information géographique, 2005, p. 49-64., Paris : Hermes et Lavoisier. Haunert, J.H, 2005, "Link based Conflation of Geographic Datasets", In 8th ICA Workshop on Generalisation and Multiple Representation, Coruna, 7-8 July 2005. Lefevre, E., Colot, O. and Vannoorenberghe, P., 2002, "Belief function combination and conflict management", Information Fusion, 2002, 3(2), pp. 149-162. Levenshtein, V., 1965, "Binary codes capable of correcting deletions, insertions and reversals ", In Doklady Akademii Nauk SSSR, 4 (163), 1965, pp.845-848. Masson, M-H., 2005, Apports de la théorie des possibilités et des fonctions de croyance à l’analyse de données imprécises, PhD dissertation, University of Camiègne, 2005. Mustière, S., 2006, " Results of Experiments on Automated Matching of Networks at Different Scales", In ISPRS Workshop, Multiple representation and interoperability of spatial data, Hanover, Germany, 22-24 February 2006, pp. 92-100. Olteanu A-M., 2005, Modélisation spatio-temporelle de données d’ethnographie, Master dissertation, IGN, 2005. Parent, C. and Spaccapietra, S., 2000, "Database integration: The key to data interoperability", In Advances in Object-Oriented Data Modeling, MIT Press, 2000. Schneider, M., 1999, "Uncertainty Management for Spatial Data in Databases: Fuzzy Spatial Data Types", In Advances in Spatial Databases, 6th International Symposium, Hong Kong, China, July 20-23, 1999, pp. 330-351. Sherren, D., Mustiere, S., Zucker, J-D., 2004, "How to Integrate Heterogeneous Spatial Databases in a Consistent Way?", In Conference on Advanced Databeses and Information Systems(ADBIS), Budapest, September 2004, pp. 364-378. Shereen, D., 2005, Méthodologie d’évaluation de la cohérence inter-représentations pour l’intégration de bases de données spatiales, PhD dissertation, University of Paris 6, 2005. Tøssebro, E., 2005, Representing uncertainty in spatial and spatiotemporal databases, PhD dissertation, Norwegian University of Science and Technology, 2002. Voltz, S., 2005, "Data-Driven Matching of Geospatial Schemas", In Proceedings of the International Conference on Spatial Information Theory, COSIT, Ellicottville, NY, September 2005, pp. 115-132. Voltz, S., 2006, "An Iterative Approach for Matching Multiple Representations of Street Data", In ISPRS Workshop, Multiple representation and interoperability of spatial data, Hanover, Germany, 22-24 February 2006, pp. 101-110. Zadeh, L.A., "Fuzzy sets", Information and Control, vol. 8, pp 338-353, 1965.
703
7th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences. Edited by M. Caetano and M. Painho.
Zhang, M., Shi, W., and Meng, L., 2005, "A generic matching algorithm for line networks of different resolutions", In ICA Workshop on Generalisation and Multiple Representation, Coruna, 7-8 July 2005. Walter, V. & Fritsch, D., 1999. "Matching Spatial Data Sets: Statistical Approach", In International Journal of Geographical Information Science, 1999, 13(5), 445-473.
704