GeoExpert – A Framework for Data Quality in Spatial ...

4 downloads 845 Views 160KB Size Report
ARCGIS Engine allows us to add GIS functionality to our application through extending arcobjects either by using ARCGIS java. SDK, or ARCGIS .Net API [5].
GeoExpert – A Framework for Data Quality in Spatial Databases Aditya Tadakaluru Department of Computer Science Western Kentucky University [email protected] Karla Andrew Center for Water Resource Studies Western Kentucky University [email protected]

Abstract Usage of very large sets of historical spatial data in knowledge discovery process became a common trend and in order to obtain better results from this knowledge discovery process the data should be of high quality. We proposed a framework for data quality assessment and cleansing tool for spatial data that integrates the spatial data visualization and analysis capabilities of the ARCGIS Engine, the reason and inference capability of an expert system. In this paper, we explain the core architecture of the framework and also the functionality of each module in the framework. We will explain the implementation details of the framework.

1. Introduction In today’s world integrating data with a spatial component and viewing the data in spatial terms is opening a way for the accumulation of a large quantity of spatial data sets. Eventually, these huge sets of historical data will be used in a knowledge discovery process. In order to obtain better results from this knowledge discovery process, the data should be of high quality. Data Quality is a major problem the GIS community is facing today. Building a tool for data quality of geo-referenced data is not an easy task because there is a lot of complexity involved, especially when it comes to visualization and analysis of spatial data. ARCGIS [6] is software with a rich set of spatial data visualization and analysis tools used for managing spatial data. We are trying to harness the spatial data visualization and analysis capabilities

Dr. Mostafa Mostafa Department of Computer Science Western Kentucky University [email protected] Dr. Andrew Ernest Center for Water Resource Studies Western Kentucky University [email protected]

of the ARCGIS Engine and the reason and inference capability of an expert system in building a spatial data quality assessment and cleansing tool [16].

2. Data quality and spatial data The main objective of developing this framework is to find the outliers and anomalies. Often the quality of data is measured in terms of percentage of anomalies present in the given data [5]. Sometimes these anomalies can be a notification of some strange things happening. But most of the time, they will occur due to the malfunctioning of the hardware, garbling of data or due to a value greater than the threshold value of the device [2]. Finding an anomaly in data that is georeferenced needs the capability of spatial analysis. Usually the approach for this type of data quality assessment tools and data cleansing tools would be developing tons of code which is not always a good choice especially when it comes to the data quality of spatial data and because of this, developing a tool solely based on programming really becomes a tricky process when it comes to handling the data in terms of geographical co-ordinates [2]. Also there are vast number of sub domains in geography that use the Geographical Information Systems and developing a tool for each domain, like developing a data quality tool for climate data or developing a data quality tool for water quality data is really a cumbersome and tedious process. The tool that is developed for handling data quality in spatial databases has to be a generic tool and it should be able to handle any kind of spatial data, domain independent.

3. Related work FIELDS – an expert system for analyzing the contaminated sediments has been developed by Yichun Xie and George D. Graettinger [3]. The FIELDS system primarily focuses on finding pollutant hot spots and analyzing the contaminated sediments, which fits in the category of domain specific tools. A Query Knowledge Database, module Knowledge Acquisition has been developed as a part of FIELDS expert system framework. Arc View 2.1 [15] has been used for the spatial data visualization and the analysis part of the framework. A rule based expert system in EPA rules domain has been developed by Suresh Jayanty, Uta Ziegler using forward chaining and backward chaining engines [1]. Our proposed framework addresses all the major issues of data quality in spatial databases calibration, threshold, missing data, extra data and outliers. A forward chaining engine and a backward chaining engine constitutes the expert system part of our proposed framework. Suggestions to the user with recommended action on the data based on the ultimate use of the data is one of the major contributions in the proposed framework. Using the rule based expert system shell gives enough control on the behavior of expert system. Easy updating of knowledge is possible with the use of rule based expert system shell because it separates the rules logic from the GIS desktop application, which is another key point in our framework. The proposed framework supports any kind of spatial data irrespective of their domain, which is the major contribution in our framework. We are using ARCGIS 9.1 Engine in our framework with which one can add GIS functionality to their native java framework unlike the avenue scripts for Arc View. The power of the native java language is also harnessed in this framework.

4. Expert System Framework for data quality in spatial Databases ARCGIS offers a rich set of tools for spatial data visualization and analysis. ARCGIS Engine allows us to add GIS functionality to our application through extending arcobjects either by using ARCGIS java SDK, or ARCGIS .Net API [5]. We are using two expert system shells - one for forward chaining and one for backward chaining. With the use of expert system shell in our framework we were able to make the framework completely domain independent. We are providing the user a tool with an empty brain in the form of an expert system shell to which the user has to add his domain specific

knowledge in the form of IF-Then rules [16]. The user provides the data that has to be cleansed in the form of facts to the forward engine. Once the datasets with errors has been found, the backward engine starts to work by suggesting to the user which data cleansing technique to use according to the ultimate purpose of the data [2]. Currently, a part of this proposed framework is implemented on geo-referenced water quality data. We are feeding our expert system shell with rules that check the data for the contaminants and their maximum contaminant levels (MCLs) [12]. Later in this paper we will explain our prototype with sample rule that checks the data for arsenic contaminant level [12] that exceeds the MCL.

Figure 1: Underlying architecture of the framework We divide the application into two modules. In the first module the user loads the spatially referenced water quality data into the ARCGIS desktop application and selects the rule to be fired and selects the datasets on which the rules have to be fired. Then the control is passed on to the forward engine, which takes the facts supplied by the user, and using rules in the rulebase, it finds the facts that are not satisfying the rule. The second module will start after the dataset with errors has been identified. The control is passed on to the backward engine, which asks the user the purpose of the data. When the user supplies the purpose of the data, based on the facts or rules in the rulebase the backward engine recommends to the user the appropriate data cleansing technique to use. Depending upon the choice of the user, the selected data cleansing technique will be run on the dataset.

5. Proof of Concept We implemented the first module in our proposed framework, using the forward chaining engine for assessing the data quality of the spatial data. For this we used JESS, a forward chaining expert system shell [8]; dbXML, a native XML database [7]; Java Rule Engine API (JSR-94) [13] that provides access to the expert system shell from any java application. The Java expert system shell (JESS) is a rule based expert system shell with an inference engine and knowledge base. Rules and facts form a knowledge base. JESS uses RETE algorithm for processing the rules in the knowledge base [8]. We developed a GIS capable java application for assessing the quality of the spatial data. For this we used Java 2 Platform Standard Edition 5.0 and ARCGIS Engine 9.1 that provides access to the arcobjects, Standard GIS Framework. We used JESS for forward chaining in our application. When the user selects the data, with the help of JSR-94 the application acquires a stateful rule runtime session of the JESS and submits the selected spatial datasets as facts to the JESS.

Figure 2: Application after loading the MXD document Currently our application can handle the spatial data in the format of ArcMap Coverages. Initially the user will load the data into the application in MXD format. The above MXD document in Figure 2 represents the water quality data of Green County, KY, USA. The start points indicate the water sampling points in the green county and the underlying layer is NHD stream coverage of green county, KY, USA [10] [11]. Initially the user loads the MXD document, using the “select features” tool, then, selects the sample sites

that will be checked against the rules in the JESS expert system shell.

Figure 3: Application after selecting the features After selecting the features, the user selects the rules to be executed against the selected features from the list of rules that are displayed on the right-most panel. As soon as the user clicks on the RUN button on the right most upper corner, the data on the arsenic contaminant levels of the selected features is retrieved using a spatial query. This application is connected to the JESS rule base expert system shell [8] using JSR94 [13]. Now the retrieved data is passed to the JESS as facts. We encoded a rule for the maximum contaminant level (MCL) of Arsenic. (defrule ARSENIC ?ars ?amt 0.010)) (OBJECT ?C)) => (call ?C setmcl 9999.99) (printout t "The strange arsenic Value is " (get ?C mcl) crlf)) In the above ARSENIC rule we are setting the error flag ON if the contaminant level exceeds 0.010 [12]. Once the execution of a rule is completed using the errors detected, a separate shape file (with .shp extension) is generated for each rule executed along with the errors detected by that rule. The newly created shape file will be added as a new layer to the MXD document currently loaded. In this case a new shapfile named arsenic.shp is created with erroneous data.

coverage 1(NHD coverage of Green County) and coverage 2 (sample sites) in order to validate the data.

7. Conclusions The use of a rule based expert system for data quality automates the data cleansing and use of ARCGIS engine [5] in the application allows us to provide the user with a user – friendly interface that supports the spatial data visualization, which is one of the major advantages of the framework. In the future one may integrate this framework in their application to automate their data quality assessment and cleansing phase. Figure 4: Application after the data validation process In Figure 4, a new layer is added on the top of the tree structure. This new layer holds the data that has been detected erroneous by JESS. This proves that spatial data visualization and analysis capabilities of ARCGIS can be integrated with the reasoning and inference capabilities of the expert system shell for assessing the data quality of the spatial databases.

6. Future work Once the errors in the dataset are identified, simultaneously the backward engine asks the user the purpose of the data. The backward engine proposes to the user the appropriate cleansing technique according to his ultimate purpose of the data. Once the user selected a particular procedure, that procedure will be applied to the erroneous dataset. For backward chaining we will be using MANDRAX, a backward chaining expert system shell [9]. Sometimes, if the user detects a particular dataset as an outlier, the user checks whether it is an anomaly or incorrect data. Once the user confirms an abnormality, it is reported to the concerned authorities. Currently our tool can only handle the data from a single Coverage. One of the objectives behind developing this framework is to reduce the complexity of data quality procedures. Future goals are to add to this framework implementing more tools and features that will enable it to work with data from multiple Coverages simultaneously and enabling it to work with any geodatabase that can be accessed through ARCSDE [14]. A complex stream network rule for MXD data shown in Figure 2 need facts from both

8. References [1] Suresh C Jayanty and Dr. Uta Ziegler, “A Rule Based Expert System Framework for Small Water Systems”, Western Kentucky University, 2005 [2] Rebecca Bari Buchheit, “Vaccum – Automated Procedures for assessing and cleansing Civil Infrastructure data”, Carnagie Mellon Univeristy, 1998 [3] Yichun Xie and George D. Graettinger,”An Integrated ARCVIEW Expert System for Analyzing Contaminated Sediments in the Great Lakes Basin”, 1996 ESRI User Conference

[4] Alan F. Karr and Ashish P. Sanil, and David L. Banks, “Data Quality: A Statistical Perspective”, 2005 [5] ARCGIS engine Developer Guide – from ESRI [6] ARCGIS, www.esri.com [7] DbXML, www.dbxml.com

[8] JAVA Expert System Shell (JESS), http://herzberg.ca.sandia.gov/jess/ [9] Mandarax Project, www.mandarax.org

[10] STORET, www.epa.gov/storet [11]NHD 100k Streams http://www.uky.edu/KGS/gis/NHD100DOWN.htm [12] Drinking water contaminants and MCLs ,http://www.epa.gov/safewater/mcl.html [13]JSR-94, http://www.jcp.org/aboutJava/communityprocess/review/ jsr094/

[14] ARCSDE, http://www.esri.com/software/arcgis/arcsde/ [15]ARCVIEW, http://www.esri.com/software/arcview/ [16] Giarranto, J. and Riley, G. “Expert Systems: Principles and Programming,” PWS, Boston, 1994

Suggest Documents