WEB VACUUM: A Web-based Environment for Automated ... - CiteSeerX

WEB VACUUM: A Web-based Environment for Automated Assessment of Civil Infrastructure Data

A thesis by Ping Chen

Submitted in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE IN CIVIL AND ENVIRONMENTAL ENGINEERING

Department of Civil and Environmental Engineering Carnegie Mellon University Pittsburgh, PA 15213 August 2003


Abstract Infrastructure monitoring data include condition data, usage data, performance data, and construction and maintenance histories. These data are crucial to support efficient allocation of resources, accurate evaluation of designs and deliberate decision-making. Therefore, understanding the quality of the monitoring data and preparing a cleansed data set for analysis are critically important and they are the subject of this research.

However, specialized studies on data quality are rare, and the definition of data quality is obscure. Specifically, there has been very little research done on data quality in the domain of civil infrastructure monitoring data. Rebecca Bechheit, for her PhD dissertation research, developed a two-level procedure to assess the quality of civil infrastructure data and explored the possibility of effectively cleansing the data using the results from the assessment procedure. She also implemented part of her data assessment procedure in a prototype software program, called VACUUM. Buchheit applied the VACUUM procedure and framework to two case studies and effectively identified several types of errors that commonly occur in civil infrastructure data sets.

The research presented in this thesis is an extension of Buchheit’s research and development of VACUUM. My objective for this research was to apply Buchheit’s procedure to an additional monitoring data set with different types of data and verify, refine, and extend the functions of the software prototype of VACUUM as needed to address this data set. It was my objective to improve the VACUUM program by adding

i


two statistical methods and correcting the method using distribution patterns. It was also my objective to develop WEB-VACUUM, a web-based version of VACUUM in Java.

With the website, potential users are able to find useful tutorials and access the most upto-date version of the VACUUM application. Users can maintain complete control over their own data because the application is launched on their own machine. The added preprocessing and post-processing procedures in the WEB-VACUUM application help the users to easily organize their data sources, rules and outputs.

A case study that uses WEB-VACUUM is also presented. The data set is part of the bridge management system of the Pennsylvania Department of Transportation which stores the inspection and maintenance information of state-wide bridges. The assessment procedure in WEB-BASED VACUUM identified only a few human data entry errors, missing data errors and problems associated with updating; the distribution rules and other detection methods indicated a high level of quality for this data set as a whole.

Finally, I discuss the application opportunities that can make use of our research. Data quality assessment and cleansing techniques can be incorporated into data production processes and correct data errors on the fly. Potential improvements that can make VACUUM more effective and powerful in the future are also addressed, such as allowing non-sequential work flow, using the output from one assessment process to help select algorithms or data sources, and adding more algorithms in the algorithm function level. Additional tests on sets of civil infrastructure monitoring data similar to or different from

ii


that Buchheit and I have studied, must be conducted to examine the adaptability of VACUUM program. Tests on new types of monitoring data can help define the frequent error types in those data sets and examine the applicability of VACUUM. Furthermore, research on the effectiveness of applying different algorithms in assessing the quality of different types of civil infrastructure monitoring data must also be conducted.

iii


Acknowledgements This research has been supported by the National Science Foundation under Grant Number CMS 9987871. I would also like to thank the Pennsylvania Department of Transportation for providing me the Bridge Management System data used in this research.

I want to thank my advisor, Jim Garrett for his great patience and guidance leading me through this research and this thesis. His integrity and meticulous attention to detail have set up a great paradigm for me to use in my academic career. His warm encouragement has cheered me up many times to continue my research and escape from temporary frustration. I would also like to thank Rebecca Buchheit. She instructed me on this project and started me out in data mining and Java Programming with her strong academic knowledge. Her great patience and kind personality encouraged me a lot to start up and move forward in this research. Jim and Becky both are true mentors and friends for me. I really appreciate those great characters they have conveyed to me by their actions. I believe those are valuable wealth that will benefit me through my future academic life and direct me to be an integrated person.

I also want to thank Professor Sue McNeil and her student, Hyeon-Shic Shin from University of Illinois at Chicago. Their feedback and tireless efforts have helped me a lot to improve the VACUUM software.

iv


Contents Abstract........................................................................................... i Acknowledgements ....................................................................... iv List of Tables ............................................................................... viii List of Figures ............................................................................... ix Chapter 1 .........................................................................................1 Introduction....................................................................................1 1.1 Problem Statement .......................................................................................... 1 1.2 Summary of Previous Work ........................................................................... 4 1.3 Research Objectives........................................................................................ 6 1.4 Overview of Research Approach .................................................................... 6 1.5 Organization ....................................................................................................7 Chapter 2.......................................................................................10 Data Quality and Civil Infrastructure Monitoring Data..................10 2.1 KDD & Data Mining ......................................................................................10 2.2 Data Quality ..................................................................................................12 2.2.1 Importance of Data Quality......................................................................... 12 2.2.2 Data Quality Concept................................................................................... 13 2.2.3 Data Quality Measurement......................................................................... 16 2.3 Data Quality Assessment ..............................................................................18 2.3.1 Research on Data Quality Assessment ...................................................... 19 2.3.2 Data Quality Assessment Approach .......................................................... 21 2.4 Civil Infrastructure Monitoring Data .......................................................... 23 2.5 Characteristics of Monitoring Data ............................................................. 24 2.6 Error Types in Monitoring Data .................................................................. 25

v


2.6.1 Systematic Error Type.................................................................................. 26 2.6.2 Individual Record Error Types................................................................... 28 2.7 Summary ...................................................................................................... 30 Chapter 3...................................................................................... 32 Approach ..................................................................................... 32 3.1 General Approach......................................................................................... 32 3.2 Data Quality Assessment Approach – Aggregate Level Assessment .......... 35 3.2.1 Statistical Methods ....................................................................................... 35 3.2.2 Clustering Methods...................................................................................... 37 3.2.3 Pattern-Based Detection Methods............................................................. 40 3.2.4 Majority Voting............................................................................................. 43 3.3 Data Quality Assessment Approach – Individual Level Assessment.......... 44 3.4 VACUUM: Algorithm Summary ................................................................. 45 3.4.1 Binary Rule .................................................................................................... 46 3.4.2 Distribution Rule.......................................................................................... 48 3.4.3 Conditional Rule........................................................................................... 49 3.5 Summary ...................................................................................................... 50 Chapter 4 ..................................................................................... 52 Description of Web-based VACUUM............................................. 52 4.1 Overview ....................................................................................................... 52 4.2 Functionality Description ............................................................................ 54 4.2.1 Functionality Levels ..................................................................................... 54 4.2.2 Work Flow ..................................................................................................... 55 4.3 Technology ................................................................................................... 58 4.4 Software Architecture .................................................................................. 60 4.4.1 General Control Model (MainFrame)........................................................ 62 4.4.2 SessionXMLReader...................................................................................... 62 4.4.3 Data Source Management Panel (dsmPanel)........................................... 63 4.4.4 Rule Set Management Panel (rsmPanel).................................................. 63 4.4.5 Evaluation Process Panel (evlPanel) ......................................................... 64 4.4.6 Detection Result Application Panel (rstPanel) ........................................ 65 4.5 User Interface............................................................................................... 65

vi


4.5.1 Main Frame Interface................................................................................... 65 4.5.2 Step I – Data Source Management ............................................................ 66 4.5.3 Step II – Rule Set Management ................................................................. 70 4.5.4 Step III – Evaluation Process ..................................................................... 75 4.5.5 Step IV – Detection Result Application .................................................... 76 4.6 Summary ...................................................................................................... 78 Chapter 5.......................................................................................79 Assessment Case Study: BMS Data Set...........................................79 5.1 BMS Data .......................................................................................................79 5.1.1 BMS Overview................................................................................................ 79 5.1.2 Data Set Schema ........................................................................................... 82 5.2 Data Quality Exploration ............................................................................. 84 5.2.1 Binary Constraints ........................................................................................ 85 5.2.2 Statistical Methods....................................................................................... 98 5.2.3 Missing Records ......................................................................................... 111 5.3 BMS Data Quality Summary........................................................................111 Chapter 6 .................................................................................... 112 Discussion................................................................................... 112 6.1 Discussion of the Software Architecture..................................................... 112 6.2 Discussion of the WEB-BASED VACUUM................................................. 114 6.3 Discussion of Applications on the BMS data.............................................. 116 Chapter 7..................................................................................... 119 Conclusion .................................................................................. 119 7.1 Summary...................................................................................................... 119 7.2 Future Work ................................................................................................ 121 Reference .................................................................................... 125

vii


List of Tables Table 1.1: Taxonomy of Dirty Data (Source: KIM ET AL [9]) ....................................... 15 Table 2.1: Duplication Error (1) ....................................................................................... 29 Table 2.2: Duplication Error (2) ....................................................................................... 29 Table 5.1: BMS Binary Constraints (1) ............................................................................ 94 Table 5.2: BMS Binary Constraints (2) ............................................................................ 95 Table 5.4: BMS Binary Constraints (3) ............................................................................ 96 Table 5.5: Examination on a Binary Constraint Violation ............................................... 97 Table 6.5: Deck Condition Rating Empirical Distribution (1985) ................................. 104 Table 5.7: Deck Condition Rating Normal Distribution Test......................................... 108 Table 5.8: Deck Condition Rating Empirical Distribution Test ..................................... 109 Table 5.9: BMS Correlation Coefficients for Aggregate Attributes.............................. 110

viii


List of Figures Figure 3.1 Approach for Automated Assessment and Cleansing Data (Source: Buchheit [7])............................................................................................................................. 34 Figure 3.2 Box Plot Diagram ........................................................................................... 37 Figure 3.3 Example of clustering method......................................................................... 38 Figure 3.4 Distribution Rule Sample Data........................................................................ 49 Figure 4.1 Web VACUUM Work Flow ........................................................................... 57 Figure 4.2 Software Architecture: VACUUM application ............................................... 61 Figure 4.3 Main Frame Window....................................................................................... 66 Figure 4.4 Text File Mode Data Source Dialog................................................................ 68 Figure 4.5 Data Base Mode Data Source Dialog.............................................................. 69 Figure 4.6 Attribute Definition Dialog ............................................................................. 69 Figure 4.7 Step II: Rule Set Management Panel............................................................... 70 Figure 4.8 Rule Set Management Dialog.......................................................................... 71 Figure 4.9 Attribute-Value Rule Dialog ........................................................................... 72 Figure 4.10 Attribute-Attribute Rule Dialog .................................................................... 72 Figure 4.11 Attribute - Lagged Rule Attribute ................................................................. 72 Figure 4.12 Standard Deviation Rule Dialog.................................................................... 72 Figure 4.13 Percentile Rule Dialog................................................................................... 72 Figure 4.14 Distribution Rule Dialog: normal distribution .............................................. 73 Figure 4.15 Distribution Rule Dialog: exponential distribution ....................................... 73 Figure 4.16 Distribution Rule Dialog: empirical distribution........................................... 74 Figure 4.17 Binary Condition Rule Dialog....................................................................... 75 Figure 4.18 IF Condition Dialog....................................................................................... 75 Figure 4.19 Evaluation Process Interface ......................................................................... 76 Figure 4.20 Detection Result Application Panel............................................................... 77 Figure 4.21 Anomaly Summary Dialog............................................................................ 77 Figure 5.1 Binary Rule (“Remaining Life Deficiency”)................................................... 87 Figure 5.2 Anomaly Summary (“Remaining Life Deficiency”)....................................... 88 Figure 5.3 IF Condition Tree ............................................................................................ 91 Figure 5.4 Binary Condition Rule..................................................................................... 91 Figure 5.5 Anomaly Summary.......................................................................................... 92 Figure 5.6 Normal Distribution Rule .............................................................................. 100 Figure 5.7 Anomaly Summary Table (Normal Distribution Rule)................................. 101 Figure 5.8 Empirical Distribution Rule .......................................................................... 103 Figure 5.9 Rule Set (Empirical Distribution Rule) ......................................................... 103 Figure 5.10 Anomaly Summary Set (Empirical Distribution Rule) ............................... 104

ix


Chapter 1 Introduction Assessing the quality of civil infrastructure data is critically important because these data are widely and crucially employed to support decision making, design evaluation and other significant activities in the field of civil infrastructure engineering and management. In her PhD research, Rebecca Buchheit developed a two-level data quality assessment procedure and implemented a software prototype to support this procedure [7]. She validated this approach by applying it to several case studies of civil infrastructure monitoring data. My research is focused on (1) improving the automated data quality assessment process by incorporating additional functionalities in the assessment software; (2) implementing a web based version application for the wide and easy use of this assessment software; and (3) further validating the assessment procedure by applying it to a data set that is of a different type from those considered by Buchheit.

1.1 Problem Statement Civil Infrastructure monitoring data are collected to describe the usage, condition, performance and maintenance activities of an infrastructure element. As one example, consider a state’s Bridge Management System as a source of monitoring data; it describes the inventory, condition, repair and maintenance activities on bridges within that given state. The state agency can thus plan the optimal time to allocate limited funding to the most appropriate bridges for maintenance. Another example is a database of freight flows by origin and destination for all long haul modes of transport. States can use these data to explore future freight transportation needs and optimize infrastructure planning.

1


Because these data are critically used in all kinds of decision-making processes, high quality data become vitally important for performing deterioration modeling, for detailed cost benefit analysis, for infrastructure planning and management, for budgeting decision making and for research purposes.

At the same time, the magnitude of civil infrastructure monitoring data is growing continually. Since most civil infrastructure projects have long service lives, they will lead to huge historical databases. In addition, the wide use of sensors and the rapid development of sensor techniques also lead to large amounts of data being accumulated during the lifetime of the structure. Both of these trends will increase the size of infrastructure monitoring data to be understood and acted upon in the future. Thus, evaluating the quality of these data sets is apparently more and more important and urgent.

However, although crucial, assessing data quality is not an easy task. First, there is no explicit and rigorous definition of data quality. Data quality, when it is mentioned, is often related to data errors or anomalies. Yet there is not a complete definition of error types and executable description of how to find them as so far. Obvious errors such as missing values and incomplete records are easily picked out, while, other errors may be obscure and can be discovered only under certain contexts and after careful examination of their coherence with other data. For example, 49(kg) and 36(kg) each can be a reasonable value for inventory rating load and operating rating load, respectively. But when we understand that, for a same bridge element, the inventory rating load should be no greater than operating rating load, we can only realize that there could be a mistake related to the data entry if we see, for the same bridge deck, the inventory rating load is 49(kg) and the operating rating load is 36(kg). Second, there is no standard measurement to score the level of data quality. The quality of a data set is relative to the application of the data. Because of the complexity 2


of data applications, it is hard to compare the quality of data between various uses of that data. Given a data set with a certain amount of missing values, those errors may be neglected for one algorithm, but may heavily bias the results when used in another algorithm. Third, there is less research work on the topic of data quality than has been done on data mining methods, algorithms and software. In most situations, data mining experiments have relatively explicit questions that need to be answered through the mining process. For example, predicting the future condition of a bridge element based on its historic records of condition can be a data mining task. Unlike the data mining task, assessing data quality can be looked at as solving a much complicated and implicit set of questions. Data errors can be triggered under any condition and we can never expect to find all the errors by looking at only one aspect of data. Finding anomalies can only be built on both general and detailed understanding of the data, and data quality questions should be gradually formed in the process of understanding the data and be solved in a comprehensive and systematic way. In practice, data assessment and preparation processes are often treated on a case-by-base basis and field knowledge is required to determine whether an identified anomaly is actually an anomaly and whether the repair activities proposed are acceptable.

The impact of an effective data quality assessment and cleansing procedure is obvious. They can help cut data preparation time notably and well-prepared data can have a much wider range of use. A data set that had most of its incomplete records filled can be applied to more algorithms and greatly decreases the efforts put on choosing modeling methods. Ultimately, the reliability of data mining conclusions can be highly improved if it is based on a scientific and systematic data assessment and cleansing process.

3


Thus, the major purpose in this research is to find a general approach that can be comparatively inclusive in answering the questions related to data quality assessment. We founded most of our cases on the civil infrastructure monitoring data but our solution should not be limited to only infrastructure monitoring data. To support an informed and systematic application of their data, we also urged us to deliver effective techniques and tools that can intelligently and automatically assist engineers in evaluating the quality of their data sets.

Buchheit developed a generic procedure for conducting data quality assessment and cleansing. She also implemented a software prototype for exploring the possibility of automating this procedure. An overview of her work is presented in the following section. My research work is an extension of Buchheit’s research and I have focused on further evaluating this data quality assessment approach by using a new type of monitoring data and adding to the functionalities of the software prototype. An abstract of my research objective and approach can be found in Sections 1.3 and 1.4, respectively.

1.2 Summary of Previous Work In 2002, Buchheit summarized the characteristics of infrastructure monitoring data, and identified and classified data quality errors in civil infrastructure monitoring data [7]. In her research, she experimented with applying techniques from exploratory data analysis and data mining fields to develop an assessment procedure for determining data quality. She discussed the characteristics of infrastructure monitoring data and classified them into two classes: time-based data and event-based data. The features of each type of data are described in her thesis [7]. She also summed up the errors that occur frequently in civil infrastructure data and classified them into two major types: systematic error types and individual record error types.

4


The primary contribution of Buchheit’s research was her automated procedure to assess the quality of civil infrastructure monitoring data. The assessment procedure that she developed is a two-level hierarchical procedure fitting into the general Cross Industry Standard Process for Data Mining (CRISP-DM) model. The CRISP-DM model is an abstract, high-level process model for data mining. Data quality assessment is part of the data understanding phase of CRISP-DM, and data cleansing belongs to its data preparation phase [5].

In the first level of her procedure, different types of data quality assessment methods are used to detect anomalies in an aggregated data set. In the second level of the procedure, the analyst focuses on the specific anomalies discovered and tries to learn what types of errors are present by looking at the disaggregated data set. Once the error types have been identified, the analyst can choose a cleansing technique to counteract the negative effect of these errors based on the ultimate use of the data and the type and severity of the errors detected.

Buchheit’s research had three objectives: developing and determining the effectiveness of the data quality assessment procedures; developing and determining the effectiveness of a cleansing procedure that uses the results from assessment; and determining the potential for automating the procedure. She addressed the effectiveness of the assessment procedure by presenting two case studies that illustrate her procedure. The first one was a traffic data set collected by a weigh-in-motion scale. Her procedure identified three types of errors: data missing from the right-hand lane, extraneous passenger vehicle data, and records in which two tailgating vehicles are combined into a single vehicle. The other case study examined data collected from an HVAC system in a highly monitored experimental building; her procedure identified missing data in the monitoring data set. 5


Buchheit then addressed the effectiveness of the cleansing procedure by using the results from the assessment procedure to cleanse one of the case study data sets. She also developed a test bench, in which known errors were introduced into a clean data set, to study the sensitivity of the algorithms used in the assessment procedure. Finally, she addressed the potential for automating the procedure by presenting a prototype software program that implements part of the data quality assessment procedure. She called the prototype VACUUM.

1.3 Research Objectives My research is an extension of Buchheit’s work. As such, the primary objectives of my research were to (1) reexamine Buchheit’s two-level data quality assessment procedure; (2) improve the software prototype by integrating several more data quality assessment approaches; (3) develop a web-based application for wide spread and secure use of this procedure; and (4) perform a case study using a new data set, and examine the performance of Buchheit’s procedure on this new type of monitoring data.

1.4 Overview of Research Approach To meet these objectives, I added two statistical methods to the existing software prototype and validated the distribution rules in the VACUUM framework. I also readjusted the software architecture and divided it into three levels: (1) the meta data level, where the definition and origination of data sources and their attributes are defined; (2) the algorithm level, where rules are created and applied on a selected data source; and (3) the application level, where the users can output the outliers and evaluation results to a user-defined file format. A new software prototype implemented as a Java package was created as part of my research. New user interfaces were also created that make each step in the whole assessment procedure much clearer and more intuitive than the original version of VACUUM. 6


I then developed the web-based version of VACUUM by publishing the improved VACUUM application through the internet using Java Web Start Technology. The webbased version, to a much larger degree than other network software, decreases the concerns users may have with sending their data set to an unsecured website for analysis. The application is actually launched on the users’ local machines and connects to their data only when the connection is approved and the Java application is working locally. Another benefit that the web site brings is that user can always access and apply the latest version of VACUUM without having to compile it again.

Finally, WEB-VACUUM is applied to the Pennsylvania Department of Transportation Bridge Management System data provided to us by PENNDOT. Buchheit’s assessment procedure, delivered in WEB-VACUUM, proved its applicability and usefulness on this bridge inventory type data. Missing records and possible human data entry errors were detected; a lack of timely updating was identified as the major reason that caused a series of similar outliers. In general, however, the other evaluation rules and the results acquired from the distribution methods indicated a high level of quality of the given PENNDOT BMS data set as a whole.

1.5 Organization Chapter 1 of this thesis introduces background material relevant to civil infrastructure monitoring data and their data quality issues. In Chapter 2, a definition of data mining, knowledge discovery in databases, and data quality are presented first. The latest research on the data quality concept and identification of data quality problems are then reviewed. I then introduce Buchheit’s classification of different types of monitoring data and their characteristics, followed by her definition of different types of error that occur commonly

7


in civil infrastructure monitoring data. The two types of errors are systematic error types and individual error types.

For my research, I have verified and supplemented the algorithms used in VACUUM. The original VACUUM program is the result of Buchheit’s attempt at automating the data quality assessment procedure. As such, in Chapter 3, I introduce Buchheit’s two-level data quality assessment framework and the data cleansing procedures that she developed. The existing algorithms and approaches that are used in each level of the data quality assessment procedure are presented in more detail. The implementation of the algorithms delivered in the software prototype, VACUUM, are then described in detail.

Chapter 4 discusses the functionality, technology support, software architecture and user interfaces of the web-based version of VACUUM, called WEB-VACUUM, which is implemented by publishing the latest Java-version of VACUUM via the Internet using the Java Web Start Technology. Different from the original VACUUM framework, WEBVACUUM is focused on improving the usability of this software. It clarifies the functions of the original version of VACUUM created by Buchheit by defining a three-tier software architecture.

In Chapter 5, I present a case study in which I applied WEB-VACUUM on a bridge management system data set provided by PENNDOT and illustrated the two-level assessment procedure discussed previously. I first briefly introduce the data source and data properties of the BMS data set; then I describe examples of the different types of algorithms (binary rules, distribution rules and conditional rules) defined on the BMS data set. The application of the binary rules and conditional rules exposed some problems within the data set, such as missing data, definition violation, or lack of update in certain parts of 8


the data set. However, the extensive use of distributional rules, now available in WEBVACUUM, indicated that, in general, the acquired BMS data set exhibited a sufficiently good quality on the whole.

Chapter 6 presents an extensive discussion of my research work. Topics such as evaluations of the new software architecture and WEB-VACUUM, and issues related to performing the case study on the BMS data set using WEB-VACUUM are discussed in that chapter. Finally, Chapter 7 presents future work that can make use of our data quality assessment procedure. Some potential work on the software application that can make VACUUM more effective and powerful are also discussed. For example, the output from a previous round of assessment process can help locate new algorithms or new data sources that can be more effectively executed on the next round.

9


Chapter 2 Data Quality and Civil Infrastructure Monitoring Data In this chapter, background information about data mining (DM) and knowledge discovery in databases (KDD) are presented first. Then, the concept of data quality and its significance in the data mining process is discussed, followed by a brief summary of data quality assessment approaches. A review of Buchheit’s research describing civil infrastructure monitoring data and its characteristics is then provided [7]. Finally, the data quality problems that have been found to date in civil infrastructure monitoring data are classified by error type.

2.1 KDD & Data Mining Knowledge discovery in databases (KDD) is “a process of nontrivial extraction of implicit, previously unknown and potentially useful information (such as knowledge rules, constraints, regularities) from data in database.” [3] In some literature, knowledge discovery in databases is defined to be a general concept which represents all activities related to extracting new ideas, Data mining exists in specific phases of the KDD process and uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. Data mining technology uses techniques from database systems, especially very large database systems, machine learning and statistics. The importance and potentially huge benefit of acquiring knowledge from large databases have been recognized by many researchers from different fields and various industrial companies. For example,

10


credit card companies analyze their immense customer records and find potential customers that are most possible candidates to accept their new services.

The Cross-Industry Standard Process Model for Data Mining (CRISP-DM) is a hierarchical model of the KDD process [5]. It defines and validates a data mining process that is applicable in diverse industry sectors. Six high-level phases compose this process: domain understanding, data understanding, data preparation, modeling, results evaluation, and results deployment.

A detailed description of the CRISP-DM process model can be found in its manual [5]. I summarize the phases here. As an integrated process to accomplish a knowledge discovery task, a data mining process begins with the domain understanding phase, during which the objectives and requirements of the task are well understood. For example, a target for a data mining process using a bridge inventory database could be to classify the bridges according to their manner of deterioration and describe the common characteristics of each class. Usually, the understanding of the domain phase can be converted into a problem definition and an initial plan towards the final achievement of the target [5]. The data understanding phase follows the domain understanding phase and its major purpose is to have the analyst become familiar with the collected data. To achieve this, the analyst needs to understand data properties, identify data quality problems, extract subsets of data that are possible to form interesting topics for mining hidden knowledge. The data preparation phase yields cleansed and well formatted data for the modeling stage. In the preparation phase, selection of tables, records and attributes, transformation and data cleansing are the major tasks. In the modeling phase, one or more data mining algorithms are selected and applied to the prepared data. Then, the results of the modeling phase are evaluated against

11


the goals set in the domain understanding phase. If they are acceptable, the results are deployed in a report or perhaps in the implementation of a decision support system.

Buchheit based her research upon the CRISP-DM model to explain the process of data mining and expanded the model by adding her procedures for assessing data quality and cleansing the data [7]. She also emphasized that the KDD process is iterative. The sequence of the six phases discussed above is not strict. Moving back and forth between different phases is always required. It depends on the outcome of each phase, which phase, or which particular task of a phase has to be performed next. For example, if the initial results are unacceptable, the data mining analysts might try a different data mining algorithm. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of the data. Therefore, stepping back to the data preparation phase is often necessary. [5]

2.2 Data Quality Data quality is extremely important in the current environment, where information clearly influences decisions being made. This section first discusses the importance of data quality. Data generating mechanisms, which are the direct factors in determining the quality of data, will be briefly addressed. Then, I will discuss the concept of data quality. To define data quality in an explicit and comprehensive way is difficult and researchers and practitioners continue to attempt to do so. Finally, data quality measurement and the difficulty of quantifying data quality are addressed.

2.2.1 Importance of Data Quality Useful and accurate information has brought tremendous profit, while poor quality data has led to loss of fortune and mistaken decisions. According to an estimate from the Data Warehousing Institute, poor-quality customer data costs U.S. businesses a staggering $611 12


billion a year in postage, printing, and staff overhead [13]. At the same time, a higher number of loyal customers are frustrated due to incorrectly addressed letters.

The process of cleansing and acquiring better quality data is crucial because of the “garbage in, garbage out” principle [2]. Errors in a data set can decrease the correctness of learning algorithms; the data mining techniques may have detected patterns with respect to the errors in the poor quality data instead of discovering new knowledge.

A data quality problem might also reflect a deficiency of the data generating and collecting system. For example, hand-recorded data may contain human errors made during a data entry operation; and sensitivity may influence the accuracy of a sensor recording system. New evidence of poor quality data occurs when merging data from different data collection divisions to create an enterprise data warehouse. The previously described mistake in customer addresses may arise from a defect of the merging approach: several different descriptions for the same customer may be found in several data sources and the merging approach is not able to pick up the correct one and leaves outdated or incomplete records in the combined database. Although the accuracy of various measurement approaches continues to improve, there is no guarantee of “absolute accuracy”.

2.2.2 Data Quality Concept Wand and Wang (1996) [8] have defined the term “data quality” by using an ontological framework to describe different dimensions of data quality. Before them, “correctness” or “accuracy” was identified as one of the most often cited characteristics of data quality. Wand and Wang emphasized that data quality problems were produced because of a representation deficiency, which is the difference between the user’s view of the real world system as inferred from the information system and the view that is acquired by directly

13


observing the real world. Thus, they founded their data quality concepts on the role of an information system and derived four data quality dimensions: completeness, unambiguousness, meaningfulness and correctness [8]. Incomplete data representation loses the information about the application domain; ambiguous data provides insufficient information, where the data can be interpreted in multiple ways; meaningless data is not able to be interpreted in a meaningful way; and the data derived from the information system and inconsistent with those used to create these data are incorrect.

Wand and Wang’s evaluation of data quality is theoretically true. They disclosed the intrinsic deficiency within the data representing system that caused the data quality problem. Their analyses on the dimensions of data quality have been widely cited in the literature. However, although essentially correct, Wand and Wang’s definition didn’t offer concrete guidelines and operational criteria that can be used for the specification and audit of information systems. Researchers continue to supplement and detail its contents.

Recently, Kim, Choi, Hong et al. (2003) have developed a nearly comprehensive taxonomy of dirty data that provides a framework for understanding the origins of a complete spectrum of low quality data [9]. They adopted a “successive hierarchical refinement” structure and expended it to present a near complete taxonomy of dirty data. The concept behind their taxonomy is that dirty data are presented themselves in three distinguishable ways: “missing data”; “non-missing but wrong data”; “non-missing and not wrong, but unusable” data. They initialize their hierarchy of dirty data by putting two nodes in the first level: “missing data” and “non-missing data”. They then decompose each node into successive levels of subclasses. The taxonomy of Kim et al.’s categories can be found in Figure2.1. For example, the “non-missing data” node is split into “non-missing, wrong

14


1.

2.

Missing data 1.1 Missing data where there is no Null-not-allowed constraint 1.2 Missing data where Null-not-allowed constraint should be enforced Not-missing, but 2.1 Wrong data, due to 2.1.1 Non-enforcement of automatically enforceable integrity constraints 2.1.1.1 Integrity constraints supported in relational database systems today 2.1.1.1.1 User-specifiable constraints 2.1.1.1.1.1 Use of wrong data type (violating data type constraint, including value range) 2.1.1.1.1.2 Dangling data (violating referential integrity) 2.1.1.1.1.3 Duplicated data (violating non-null uniqueness constraint) 2.1.1.1.1.4 Mutually inconsistent data (action not triggered upon a condition taking place) 2.1.1.1.2 Integrity guaranteed through transaction management 2.1.1.1.2.1 Lost update (due to lack of concurrency control) 2.1.1.1.2.2 Dirty read (due to lack of concurrency control) 2.1.1.1.2.3 Unrepeatable read (due to lack of concurrency control) 2.1.1.1.2.4 Lost transaction (due to lack of proper crash recovery) 2.1.1.2 Integrity constraints not supported in relational database systems today 2.1.1.2.1 Wrong categorical data (e.g., wrong abstraction level, out of category range data) 2.1.1.2.2 Outdated temporal data (violating temporal valid time constraint; e.g., a person’s age or salary not having been updated) 2.1.1.2.3 Inconsistent spatial data (violating spatial constraint; e.g., incomplete shape) 2.1.2 Non-enforceability of integrity constraints 2.1.2.1 Data entry error involving a single field 2.1.2.1.1 Data entry error involving a single field 2.1.2.1.1.1 Erroneous entry (e.g., age mistyped as 26 instead of 25) 2.1.2.1.1.2 Misspelling (e.g., principle instead of principal, effect instead of affect) 2.1.2.1.1.3 Extraneous data (e.g., name and title, instead of just the name) 2.1.2.1.2 Data entry error involving multiple fields 2.1.2.1.2.1 Entry into wrong fields (e.g., address in the name field) 2.1.2.1.2.2 Wrong derived-field data (due to error in functions for computing data in a derived field) 2.1.2.2 Inconsistency across multiple tables/files (e.g., the number of Employees in the Employee table and the number of employees in the department table do not match) 2.2 Not wrong, but unusable data 2.2.1 Different data for the same entity across multiple databases (e.g., different salary data for the same person in two different tables or two different databases) 2.2.2 Ambiguous data, due to 2.2.2.1 Use of abbreviation (Dr. for doctor or drive) 2.2.2.2 Incomplete context (homonyms; and Miami, of Ohio or Florida) 2.2.3 Non-standard conforming data, due to 2.2.3.1 Different representations of non-compound data 2.2.3.1.1 Algorithmic transformation is not possible 2.2.3.1.1.1 Abbreviation (ste for suite, hwy for highway) 2.2.3.1.1.2 Alias/nick name (e.g., Mopac, Loop 1, and Highway 1; Bill Clinton, Present Clinton, William Jefferson Clinton) 2.2.3.1.2 Algorithmic transformation is possible 2.2.3.1.2.1 Encoding formats (ASCII, EBCDIC…) 2.2.3.1.2.2 Representations (including negative number, currency, date, time, precision, fraction) 2.2.3.1.2.3 Measurement units (including date, time, currency, distance, weight, area, volume, …) 2.2.3.2 Different representations of compound data 2.2.3.2.1 Concatenated data 2.2.3.2.1.1 Abbreviated version (e.g., John Kennedy for John Fitzgerald Kennedy) 2.2.3.2.1.2 Uses of special characters (space, no space, dash, parenthesis, in a social security number or phone number) 2.2.3.2.1.3 Different orderings (John Kennedy vs. Kennedy, John) 2.2.3.2.2 Hierarchical data (e.g. address concept hierarchy: state-county-city vs. state-city) 2.2.3.2.2.1 Abbreviated version 2.2.3.2.2.2 Uses of special characters 2.2.3.2.2.3 Different orderings (city-state, state-city)

Table 1.1: Taxonomy of Dirty Data (Source: KIM ET AL [9])

15


data” and “non-missing, not wrong, but unusable data”; and it is not allowed to have another category according to the restrictions of the near complete hierarchy approach. The “not-missing, not-wrong, but unusable data” node is then split into three child nodes: “different data for the same entity across multiple databases”; “ambiguous data”; and “nonstandard conforming data”. The node of “ambiguous data” is further branched into two child nodes: “use of abbreviation” (e.g., Dr. for doctor or driver) [9] and “incomplete context” (e.g., Madison city without specifying state) [9] which means that they are the two reasons that can cause the ambiguous data.

Kim et al.’s taxonomy is much concrete and clear for representing the origin of poor quality data and discovering their hierarchical relationship. It is an advancement for conceptualizing data quality and provides a detailed guideline for grouping dirty data. However, the consistency and uniqueness of the classification definitions in this hierarchy needs to be further inspected. The extent of coverage that this hierarchy can explain existing data errors also need to be examined.

2.2.3 Data Quality Measurement Research on the concept of data quality continues, while assessing data quality is still a difficult part of the knowledge discovery process. One of the difficulties of assessing data quality comes from quantifying data quality.

The first difficulty comes from the fact that it is hard to find a standard to score poor quality data. In Wand and Wang’s approach, users’ views serve as a standard against which data quality is defined. However, in practice, the view of the user cannot be completely and accurately described. Obviously, if it can be acquired, there will be no representation deficiency and no need to measure data quality. In the same way, we can’t find a reference

16


data set to measure the quality of a given data set, because if the correct values were known for sure, there would be no reason to use a data set with questionable data quality. Practically, the quality of some small data sets and artificially created data may be able to be measured, because their data errors can be confidently identified with the aid of expert knowledge and appropriate technologies.

The second difficulty is that there is no comparability of a data set having one data error type with another data set having a different data error type, even using Kim et al’s contribution. For example, it is unfair to say that a data set with missing data is better than another data set with ambiguous data.

A third difficulty is that low quality data has varying degrees of negative impact on different data mining algorithms and applications. That is, data quality requirements vary with data mining algorithms: a data set with a satisfactory quality for one mining algorithm may not be adequate for another. Thus, the question of quantitative measures of data quality is important and it should be closely connected with the actual uses of the data. Quantitative measure tells us whether a data set can be sufficient for a specific modeling method and it is necessary to answer this question before applying any data mining approaches.

However, “it is not easy to quantify this impact because of the statistical nature of the computations performed by the algorithms, their different tolerances for noise (dirty or exceptional data), and various data transformation requirements” [9]. For example, a neural network requires data of any type to be transformed to numerical data and the Bayesian network is comparatively less sensitive to noise arising from missing data. Even for the same data mining algorithms, the impact is dependent on the real application that is the 17


target of the data mining task. For example, a research project that intends to find frequent bridge deterioration patterns in a bridge inventory system is more tolerant of dirty data than another looking for less frequent patterns, because there are fewer occurrences of questionable data that can mistakenly lead the analysts to believe that they are rare deterioration patterns.

A complete hierarchy of data errors may help drive the measurement of data quality. For example, a taxonomy of dirty data like Kim et al. (2003) [9] has proposed could serve for defining a metric that can measure data accuracy loss. For example, according to the characteristics and requirement from a certain type of application or algorithm, different error types can be assigned with different weight for their significance in affecting the performance of that application (algorithm). In a real situation, the detected errors can be classified to each error type and be assigned an appropriate weight for their impact. According to the summarized weight and possible accuracy loss by using these questionable data, we may measure the total impact on the final data mining output

2.3 Data Quality Assessment Research on data quality assessment involves approaches and methods from the fields of statistics, machine learning and databases, especially very large databases. At the same time, the approaches on assessing data quality are still limited by problem-specific constraint. In this section, these aspects of data quality assessment will be discussed. Meanwhile, the related progress in the software industry and a new trend of research in assessing data quality will also be addressed. In the second part of this section, existing data quality assessment approaches are classified and briefly described.

18


2.3.1 Research on Data Quality Assessment Although crucial, the amount of research on data quality assessment and cleansing techniques conducted to date is not proportional to its significance. Most of the research effort in the KDD community has been focused toward data modeling (mining) algorithms and automated algorithm selection; relatively little work has been done on general approaches, and automated procedure on data quality assessment and data cleansing methods [7]. As a matter of fact, before data quality evaluation was strictly required in the standardized process of data mining, the data quality problem within the data set itself can be exposed only in the stage of mining new knowledge because some expected “new” knowledge has been finally proven to be the reflection of errors in the original data set. Data quality assessment has become a specialized topic only as the critical importance of data quality has been gradually recognized by data miners.

On the other hand, techniques that can help to detect data quality problems mostly get inspiration from data mining technology. The advancement of research on assessing data quality comes as an accompanied product of the development in the fields of statistics, database theory and machine learning. In reality, “data quality assessment and cleansing are usually meta-modeling activities; algorithms that are used for data modeling are also used to identify and correct data quality problems.” [7] For example, algorithms using decision trees, nearest neighbor concepts, clustering, regression and neural networks, which are the most popular data modeling tools, can also replace missing or incorrect data or poor quality data.

Other data analysis methods used in quality assessment are taken from traditional statistics. However, in traditional statistics, data sets are generally small and can be handled by hand. As data size keeps growing, automation because necessary to solve the problems existing 19


in large magnitude data [7]. Because of this difficulty, data quality assessment and cleansing are often performed only on individual data sets having limited data size with the help of expert knowledge and specific techniques. As there are already automated model selection methods available in the data modeling phase of the CRISP-DM process, exploring the possibility of automating the procedure of data quality assessment and cleansing becomes more and more attractive.

Software products that help cleanse dirty data have been developed recently (Vality Technology Inc.; Trillium, 1998; Williams, 1997) [9] and are available on the market. Commercial data quality tools, such as Trillium, First Logic and Vality, have been developed over many years and have focused on customer data quality management. They are proving helpful in converting names and addresses in several countries into their standard and complete representations, with the aid of country-wide directories of names and addresses. For example, these tools can even detect and correct wrongly entered street addresses [9]. Their products focus on providing standardization and cleansing techniques through cross-database operations and solving data problems occurring during data merge/purge processes [9].

In recent years, a new discipline, known as Enterprise Data Quality Management (EDQM), has emerged to address the need for appropriately managing data quality [14]. Research in this domain focuses on ensuring the accuracy, standardization, timeliness, relevance, and consistency of data throughout an organization. Researchers are also looking at the lifecycle of original data production, data transformation, data flow and data merges, until the ultimate application of the data, so as to ensure that decisions are made on consistent and accurate information.

20


2.3.2 Data Quality Assessment Approach Section 2.3.1 presented the current situation in, and developments from, research on data quality assessment. The process of assessing data quality is also the process of detecting outliers. As mentioned, most of the data quality assessment procedures borrow technologies from the data mining field. Referring to the related research deployed in these two fields and Bechheit’s description, I review the approaches of assessing data quality using the following classifications:

(a) Traditional statistical methods: Most of the methods belonging to this type can be found in statistics books. Primarily, they employ standard distribution models and distinguish as outliers those objects deviating from the model. The popular models for selection are normal, exponential, poisson distributions, and so on. The common methods for defining deviations are the standard deviation method, quantile ranges and regression analysis. Other sophisticated methods have also been developed. The limitation for the methods of this class is that the models must be known before hand and they only deal with univariate problems.

(b) Pattern-based methods: Buchheit has defined the “Pattern-based methods” [7] in her research, which employed existing patterns in a data set to identify records that do not conform to the pattern. The patterns can be known initially or found out after examining the actual data sets. Some of the patterns are defined according to the domain knowledge about the given data set. The methods for finding unknown patterns can be further divided as “Distance-based approaches” and “Density-based approaches”. An object in a data set is a distance-based outlier only if the number of objects far away from it is greater than those close to it; and the magnitude of the difference is over a certain limitation. Density-based

21


approaches depend on the local density of the neighborhood of each object. These two approaches both can deal with high-dimensional data features and large size data sets.

(c) Clustering: Many clustering algorithms detect outliers as by-products. Clustering is a data mining method that tries to cluster the data by their natural properties. For example, a collection of rectangles and circles can be separated correctly because these two by nature own different edge (brim) smoothness and curvature. Similarly, the data that are more alike will be grouped together. As the outliers by nature are different from the normal data, they are expected to be classified in separated groups. Because the major objective of these approaches is clustering, this approach is not optimized for outlier detection. The suspicious subsets must be explained carefully with the aid of domain knowledge in case that unknown, new knowledge is classified as an outlier.

(d) Association Rules detect a combination of attribute values or records that occur together with greater frequency than might be expected if the values or items were independent of each another. Strong association rules may help recognize anomalies because certain error sources may trigger a series of reactions and these reactions will appear as associated values while they are actually not normally supposed to occur together. Buchheit has employed the first three of those four approaches in her cases studies of two civil infrastructure monitoring data sets. The association rule is thought to be not very compatible with civil infrastructure monitoring data, because it usually requires nominal data or data sets comprised of binary decisions and this is not common in civil infrastructure data sets. [7]

22


2.4 Civil Infrastructure Monitoring Data Buchheit has given a brief definition of civil infrastructure monitoring data in her thesis: “Data that are collected to describe the state of an infrastructure element are called monitoring data.” [7] Infrastructure monitoring data are composed of data gathered through the life cycle of a civil infrastructure system and integrates the measurement of its state during construction, maintenance, rehabilitation and renovation stages. The measurement of the state consists of the description of condition, evaluation of environment, record of usage and performance of the infrastructure element.

The Pennsylvania Department of Transportation (PENNDOT) Bridge Management System (BMS) data that is discussed in detail in Chapter 5 is an example of monitoring data: it is designed to store structure inventory, inspection, and appraisal data and to compute needs estimates and rankings. The system accepts, stores, updates, and reports data on the physical characteristics and operating descriptions of all structures in Pennsylvania. During a bridge inspection, the bridge inspector assigns numerical condition ratings to each major structural part of the bridge; these condition codes, as well as the geometry and location of each bridge, are stored in the BMS. The bridge inspector may also take notes to document specific problems; these materials will also be included in the BMS.

Monitoring data are most widely used to aid decision making in infrastructure management, such as maintenance, rehabilitation and renovation decisions. For example, the PENNDOT bridge management system is designed to assist in determining the optimal time for an agency to execute improvement actions on a bridge, given the funds available. Monitoring data are also used to help evaluate design decisions and support research activities, such as the Mn/ROAD data set Buchheit studied in her PhD dissertation [7], which belongs to the Long Term Pavement Performance (LTPP) Project. The LTPP project uses the Mn/ROAD 23


data and similar data from other states to aid the analysis of pavement performance and to support better pavement design.

Collecting monitoring data has a variety of approaches. Human inspection has been one of the most common sources to collect civil infrastructure monitoring data for a long time. Bridge inventories and condition inspections are often performed by humans, although sometimes photographic or video systems can be used to automatically detect cracks or other problems. In recent years, more and more data are collected through sensor systems, and these devices are becoming more sophisticated and more functional. They have created an ever-growing amount of data in civil infrastructure, such as the Mn/ROAD data, which is gathered through weigh-in-motion scales deployed along a two-lane mainline test road. Another gradually attractive area of monitoring systems is smart sensing. Smart sensors are embedded in the infrastructure itself and are meant to monitor it throughout its lifetime. A simple example is a loop-detector at a triggered traffic light; the loop-detector signals that a car has arrived so that the light will change. “As sensors become cheaper and smaller, these types of monitoring systems are likely to become more prevalent [7].”

2.5 Characteristics of Monitoring Data Buchheit classified monitoring data into time-based systems or event-based systems according to how their record is triggered [7].

Time-based monitoring systems take a record according to a preset time interval. This time interval can be long; for example, most bridges are inspected only once every two years. In other cases, it can be very short; hundreds of observations are recorded in one second. Event-based systems are activated only when an event satisfying predefined conditions occurs, such as a deck repair activity is performed.

24


Both types of monitoring data are often examined in a summarized form [7]. For example, how many bridges are classified into each condition level is of more interest when comparing the distribution of bridge conditions in every year; the total number of vehicles passing over a sensor on a daily or yearly basis is more attractive, too.

Time-based monitoring data sets are often auto-correlated [7]; the value of an attribute at time t is correlated with the value at time t-1. For example, one would expect that the bridge’s deck condition to vary only slightly between inspections at two-year intervals.

In general, there is little correlation between records in an event-based data set. However, slight auto-correlation can occur. For example, the vehicle speed of a leading vehicle affects the vehicle speed in the same lane; if the leading vehicle slows considerably, the following vehicle will also slow or switch lanes [7].

A monitoring system will contain details ranging from a few mandatory elements to a long list of data elements, depending on the objectives of the system [6]. The knowledge from domain experience will tell us the relationship between data items in a database, so that we can pick up the relatively important ones for analysis.

2.6 Error Types in Monitoring Data Buchheit distinguished two types of error existing in civil monitoring data in her research: the systematic errors are identified when exploring the summarized data; the individual record errors are picked out when examining the individual records [7]. She inferred the error types based on her investigation of five civil infrastructure monitoring data sets. In this research project, I worked with another data set: a bridge management system data set. My research shows that some of the error types discussed in Buchheit’s research have also

25


appeared in the BMS data set. My assessment of data quality issues with the BMS data set is presented in detail in Chapter 5. To provide a background of the data quality errors that have been found to date with civil infrastructure monitoring data, I will introduce Buchheit’s findings in the following in terms of data error types she has defined; new examples with the BMS data set are briefly introduced.

2.6.1 Systematic Error Type Systematic errors occur in aggregated data sets and usually exist in relatively large scale contiguous type data [7]. They can be identified by observing the difference between the distributions of data values. For example, the distribution of the accumulated amount of vehicles passing through one road every month in the year 1998 and the distribution in the year 1999 are comparable. They will be expected to have a large number of similarities. An abrupt difference in the same month in these two years is highly suspicious and would be suspected to include errors. Four systematic error types were identified in Buchheit’s research.

(1) Calibration A calibration error appears when an erroneous record collection is caused by the same error source and with the same amount of effect [7]. Usually, there is an identical additive or multiplicative deviation applied on each observation in the erroneous record group when a calibration error occurs. For example, in the BMS data set, the inspections of the same bridge are performed by a different inspection company in different years. A poorly-trained inspector may tend to over- or under-estimate condition codes for all bridges he inspected. Another situation is that the recording devices, such as sensors, are not properly calibrated before measuring. Additive calibration errors shift the entire distribution to the positive or negative side on the distribution histogram; multiplicative calibration errors shrink or magnify the magnitude of the distribution.

26


(2) Threshold Threshold errors occur when the real value is exceeding or under the allowable value ranges that are set by the data collecting system [7]. For example, in the BMS manual, the term that records structure adequacy and safety is set to be within the range of (0, 55). However, in the actual data set, it has been found that 8.6% of the data are greater than this preset value. There are two obvious characteristics on a distribution when a threshold error occurs. If the values above or below the allowable range are ignored, the distribution graph appears truncated on one or both sides. If the out-of-range value will be recorded using the maximum or minimum value in the range, there will be a spike at one or both sides of the distribution. In other situations, the maximum or minimum value will be fount out of the allowable range.

(3) Missing Data Missing data is the most frequently occurring data error type. Large amounts of missing data are usually due to sensor malfunction, system shutdown, or schedule changes [7]. The histogram of a distribution with missing data will be shifted down. Buchheit has found that there is a period of several months when no observation is recorded by the right-lane sensors in the Mn/ROAD data set. It at last proved to be a temporary shut down of sensors [7]

. There are no obvious large amounts of missing data in the BMS data.

(4) Extra Data Data duplication and mistakenly included data both can cause extra data error. Contrary to the effect of missing data, extra data can result in the upward shift of the histogram of the distribution [7]. The Mn/ROAD data set investigated by Buchheit provides an interesting example of extra data. The data set should not include passenger vehicles. However, due to heavy traffic and tailgating on several occasions, significant numbers of passenger vehicles were included in the data set because they were mistakenly classified as non-passenger

27


vehicles using the Federal Highway Administration (FHWA) definition [7]. It seems that there are not obvious extra data in BMS data set.

2.6.2 Individual Record Error Types Compared with aggregate errors, individual record errors occur on a relatively small scale. The reasons that cause this type of errors are various, such as human judgment error, writing error, sensor malfunction, and lack of measurement sensitivity [7]. Four types of individual record error are identified by Buchheit’s research.

(1) Missing Records Compared with the missing data in the system error types, missing records occur due to occasional causes and appear very randomly [7]. They are easy to identify in time-based monitoring systems and hard to detect in event-based monitoring systems, because there is a time gap in the time-based systems and there is rarely any indication that an event happened and was not recorded [7]. The BMS data set provided many examples of missing records: the bridges are expected to be inspected every two years, while many bridges didn’t have inspection data in the database every two years.

(2) Garbling Errors A garbling error happens when a state of the real world is incorrectly recorded in a data set. Incorrect values are hard to recognize because the true value is not easy to be decided [7]. It can be easily detected only when the recorded value is extreme, for example, the temperature of the flowing water is recorded to be a negative value. A missed value in the data set can also be called a garbling error because the true value can be regarded as being mistakenly recorded as empty. Missing values in garbling errors mean that the record exists but one or several data items in the record are missing. There are two kinds of situation for a missing value: one is that a data item is not applicable for an observation and the record leaves this data item blank; the other is that the data item is expected to have a value, but it

28


is not recorded [7]. For example, in the BMS data set, if a bridge does not have any sign postings, then the condition codes for these postings will be missing in the BMS; this is an example of non-applicable missing data and is not an error. If a bridge does have sign postings and the condition codes for the postings are missing, a garbling error has occurred.

(3) Duplications In the event-based system, a duplication error may occur when the same event is recorded two or more times [7]. In the time-based system, several observations taken at the same time will be regarded as a duplication error. The duplicate record may not appear with exactly the same appearance; they may represent the same event but have two different expressions [7]

. For example, the two pieces of information in Table 2.1 actually represent the same

contractor; the duplication error occurs because of the different representation approach and abbreviation methods. One of the two records in Table 2.2 is obviously an erroneous data entry because both of them represent the temperature of the same object at the same time and only one value of the temperature can occur at a certain time.

Duplications are easy to detect in a time-based system because the same time will be repeated many times in the duplications. They are relatively difficult to be recognized in an event-based system unless each event has a unique distinguishable property [7]. Contractor Name

Address

John Smith

4900 Fifth Avenue, Pittsburgh, PA, 15213

Smith, John

4900 Fifth Ave, Pitts, PA, 15213 Table 2.1: Duplication Error (1)

Time

Temperature

2002:13:30

221

2002:13:30

243

Table 2.2: Duplication Error (2)

29


(4) Combinations Combination errors occur in event-based monitoring systems when two events are recorded into one single event or one event is decomposed into two events [7]. In a time-based monitoring system, an observation being recorded as two observations at the same time or two observations occurring at two different time points being combined into one record is regarded as a combination error. A combination error can also result in missing data error or a duplication error [7]. In general, these three error types (missing, duplication, combination) are usually caused by a loss of high sensitivity in the measuring equipment. Buchheit has detected combination errors in Mn/ROAD data set, where a tailgating vehicle and the head vehicle were counted as one vehicle by the weigh-in-motion scale [7]. Combination errors are rare in the BMS data set.

Compared with Kim et al’s categories of dirty data [9] which was introduced in Section 2.2.2, Buchheit’s research discovered the most frequently occurred error patterns in civil infrastructure monitoring data. Her study looked at the actual operations on the data set. Always, data are previewed in the aggregated condition and then examined by individual record. Kim et al gave a seemingly more detailed classification of questionable data. However, actually, these two approaches possessed many overlapped classification characteristics. For example, misspelling error defined in Kim et al’s category can be classified as Garbling Errors in Buchheit’s classification. In practice, Buchheit’s definition provided more operability for analyst to examine their data from scratch.

2.7 Summary I introduce the concept of data quality in this chapter. Data quality assessment is one of the six phases of a standardized data mining process. It is critically important when people become more and more dependent on accurate information to acquire supportive

30


knowledge. On the other hand, the lack of a complete and mature definition of data quality and the difficulty in quantifying data quality have made assessing data quality a hard task. Most of the data quality assessment approaches have evolved from data mining technologies. The most popular ones are classified into four categories and briefly described in this chapter.

Because my research actually extends the work developed by Buchheit, to explain the background of my research, I then introduce a large amount of Buchheit’s research on implementing an automated procedure for assessing civil infrastructure data based on her studies on five typical civil infrastructure monitoring data sets. The high spots in her definition of civil infrastructure monitoring data and her description of their characteristics are given afterwards. Finally, the typical errors she has found to date in infrastructure data are briefly introduced accompanied by the new experience I have learned from a new case study.

In the following chapter, Buchheit’s approach to produce such a procedure is presented. The detailed algorithms have been included in a software prototype for implementing this automated data quality assessment procedure that is the basis of the new research work described in Chapter 4 and Chapter 5 in this thesis.

31


Chapter 3 Approach In this chapter, I will introduce a general approach to assess data quality proposed by Buchheit. This approach is a two-level data quality assessment procedure. How this procedure is incorporated with other data understanding phases will be explained and the approaches in each level of this procedure will be presented in detail. Finally, I will describe the algorithms implemented in a software prototype, VACUUM, which is designed to achieve some degree of automation of this data quality assessment procedure.

3.1 General Approach Buchheit’s data quality assessment process is based upon the data understanding phase of the CRISP-DM process model. Data understanding is desired before any mining procedure begins. In the CRISP-DM process model, data understanding has four stages: collecting the initial data, describing the data, exploring the data, and verifying the data quality [5]. In the initial data collection report, the acquired data sets are listed, including the description of data source and the methods used to collect the data. There is also need to be notified whether there were any problems that occurred during the data collection process and the solutions taken to solve them. The data description report describes the properties of the data set, which include the format of the data, the quantity of the data, and the attribute and definition of each data item. In the data exploration phase, the analyst is interested in using querying and visualization tools to find simple statistical information from the data, such as distribution of attributes, relations between pairs or small numbers of attributes. Once these stages are finished and the basic knowledge about the data is acquired, the analyst can begin to assess the quality of the data. 32


Buchheit’s assessment procedure is a two-level hierarchical process: “anomalies are detected at the aggregate level and then explored further at the individual data level.” [7]

.The purpose of this procedure is to dig out suspicious records from collected data sets

and look for potential error causes. After the mechanisms of the anomalies have been discovered and identified, the data can be cleansed; that is part of the data preparation phase of the DRISP-DM process.

As depicted in Figure 3.1, the aggregate level assessment and individual level assessment constitute the main components of data quality assessment stage. Aggregate data collected from field and physical constraints organized by the domain experts are sent to the aggregate level assessment procedure, where algorithms to be applied include statistical methods, clustering and pattern-based detection methods. After the evaluation process is accomplished in this phase, the aggregate level anomalies are recognized and sent to the individual level assessment procedure. In addition, this procedure also receives the individual normal data that are helpful to compare with and evaluate the anomaly data. In this phase, the visualization of data is very helpful; histograms are widely used to compare normal data and suspect data to find out the possible causes that might initiate the variance. Finally, error types of the anomalies are decided according to the guidelines defined in the previous chapter.

Once the data understanding phase is done, analyst can enter the next step which is the data preparation phase. After picking out the subset of data that may be used with a certain mining topic, and selecting and deriving related attributes, the analyst may need to cleanse a data set according to the requirement set by the mining approach and target arranged in the domain understanding stage. In this phase, the error types identified at the data quality assessment phase are input as parameters to the cleansing procedure. Under the 33


consideration of the purpose for the data’s future use, the model that is going to be built and the identified error types, one or several cleansing techniques, will be selected and applied on the erroneous data to achieve the goal of cleansing that data.

Figure 3.1 Approach for Automated Assessment and Cleansing Data (Source: Buchheit [7])

In the following sections, I will introduce the first level of Buchheit’s assessment procedure: in this level of assessing activities, the aggregate properties of the data are explored. I will focus on explaining the approaches that can be employed in this level. I then explain the second level of her assessment procedure. In the second level, the analyst focuses on the

34


anomalies identified in the aggregate level assessment; the approaches that are helpful will be addressed.

3.2 Data Quality Assessment Approach – Aggregate Level Assessment Four general types of data quality assessment methods have been identified for use in Buchheit’s first level procedure: traditional statistical methods, clustering, pattern-based methods, and association rules. In Buchheit’s research, she recommended the first three of them to be applied on the civil infrastructure monitoring data because association rules usually require nominal data or data sets comprised of binary decisions [7]. In addition, she proposed another approach, called majority voting. In the following sections, I will discuss these approaches in detail with incorporation of a few new research results.

3.2.1 Statistical Methods Many studies of outlier detection were conducted in the area of statistics. The statistical methods discussed here are traditional statistics methods that make use of the basic concepts in statistics: observe the characteristics of the distribution of the sample data and pick out records having great deviation properties. Two statistical methods are applied in Buchheit’s research: variance measurement and prediction algorithms.

The calculation of the sample’s standard deviation is a simple example of measuring variance [7]: n

s=∑ i =1

( xi − x) 2 n −1

In the above formula, x is the mean of the sample, n is the number of data in the sample; s is the sample standard deviation. A criterion to decide an anomaly is to test if the value of a

35


given record exceeds x ± s . This test can become more sensitive or less sensitive by adding a multiplier to s and setting this multiplier to be less than 1 or greater than 1 [7].

The Percentile range approach measures variance using another indicator: percentile value and has similar comparison principles as standard deviation method. Percentile values, i.e. the median (50th percentile), which is a measure of the center, or location, lower quantile (25th percentile), upper quantile (75th percentile) of the distribution, can indicate the skewness and the spread extent of a variable, which can be non-standard distributed. Given a preset quantile range, a criterion to decide an anomaly is to test if the value of any data is out of the scope of the given percentile values. The Boxplot display method is a powerful tool for visualizing quantile properties of set of data [18]. Figure 3.2 is a diagram of box plot display. The black circle represents the median. The upper and lower ends of the box are the upper and lower quantile values. The distance between these two values is the interquantile range, which is a measure of the spread of the distribution. The middle 50% or so of the data lie between the lower and upper quantiles. If the interquartile range is small, the middle data are tightly congregated around the median. If the interquartile range is large, the middle data spread out far from the median. The relative distances of the upper and lower quartiles from the median give information about the shape of the distribution of the data. If one distance is much bigger than the other, the distribution is skewed. The dashed tail of the box plot corresponds to the adjacent value. Let r be the interquartile range, the distance between upper (lower) adjacent value and upper (lower) quantile value is 1.5r. Outside values beyond the adjacent values are graphed individually. Thus, the greatly deviated data are obviously marked in the Box Plot Diagram.

36


outside values upper adjacent value upper quantile median lower quantile

lower adjacent value outside value

Figure 3.2 Box Plot Diagram (Source: William S. Cleveland [18])

The prediction algorithm uses the values of one or more attributes in each record to predict the value of the target attribute [7]. The predicted value is then compared with the actual value. If there is a great discrepancy between these two values, the actual value will be suspected to be an anomaly. There are many different types of prediction algorithms, such as regression analysis (linear or non-linear), neural networks, and nearest neighbor approaches. The simplest approach is regression analysis; a simple multi-variance linear regression equation takes the form [7]: Am = c 0 + c1 A1 + c 2 A2 + ... + c n An Where ci is a coefficient and Ai is a data set attribute. A neural network is built up by determining a simple or complex network structure, which is composed of many simple processing elements operating in parallel. Of course, the better the model fits the whole data set or has lower prediction error rate, the prediction algorithm is more reliable to apply.

3.2.2 Clustering Methods Clustering analysis is the organization of a collection of patterns (usually represented as a vector of measurements, or a point in a multidimensional space) into clusters based on similarity [17]. Clustering Algorithms try to group the data into “natural” classes and patterns within a valid cluster that are more similar to each other than they are to a pattern 37


belonging to a different cluster. Clustering algorithms can detect anomalies because an anomaly is by definition different from normal data, the anomalies will be grouped separately from the normal data [7]. Figure 3.3 is an example of clustering analysis, the data points are classified into three groups. These three different clusters imply that group 2 and group 3 own different generating mechanisms as group 1.

Figure 3.3 Example of clustering method

The variety of techniques for representing data, measuring proximity (similarity) between elements, and grouping data elements has produced a rich number of clustering methods. A.K. Jain et al. have proposed a hierarchy of clustering techniques [17]. In their discussion, the clustering techniques can be classified according to different criteria. The following three paragraphs present the key points of their classifications:

(1) Agglomerative vs. divisive: The distinction between these two types of methods is made by algorithmic structure and operation. An agglomerative algorithm begins by assigning one cluster to each pattern, and then merges the similar clusters into a new cluster. A divisive one begins with all patterns in a single cluster and then splits it until meeting a stopping criterion.

38


(2) Monothetic vs. polythetic: These two are distinguished according to how the data features are used in the clustering procedure. Feature is a term commonly employed in the data mining field, it can be represented by data attributes in a simply way. Most algorithms belong to the polythetic methods, that is, all the features are counted when computing the discrepancy between clusters. An example of a monothetic algorithm considers data features sequentially in dividing the given collection of patterns. The problem with the monothetic methods is that when the data features become larger, the number of clusters produced by these methods may be uninterestingly small and fragmented [17].

(3) Hard vs. fuzzy: A hard method allots each pattern to a single cluster while a fuzzy algorithm assigns the possibility of membership in several clusters to each input pattern. A fuzzy method can be converted to a hard method by assigning each pattern to the cluster that owns the largest possibility.

Another two classifications: Deterministic vs. stochastic cares about the partitional approaches designed to optimize a squared error function. An Incremental vs. nonincremental issue arises when the pattern set to be clustered becomes greater and these two methods are distinguished by whether they try to reduce the number of scans through the pattern set or not.

In Buchheit’s research, she used one clustering algorithm called Autoclass [7]. This is an iterative algorithm that consists of a model level search and a parameter level search. It experiments with different models and each model consists of a set of classes, the probability of each class and the overall probability of the set. In the model level search, Autoclass will find the best models and their relative probabilities. In the parameter level search, data are moved between different classes until the overall probability that each data 39


belongs to its assigned class reaches a maximum [7]. That is, changing the class assignment of the data results in a lower probability. From the point of view by Jain et al, Autoclass is a divisive, polythetic, and fuzzy classification algorithm.

3.2.3 Pattern-Based Detection Methods Two pattern-based detection methods have been applied in Buchheit’s research: binary constraints and distribution constraints.

Generally speaking, binary constraints use binary comparison operators to compare the value of an attribute to a static value, to the value of another attribute, or to a lagged value of an attribute (i.e., compare Ai and Ai-1) [7]. Buchheit has identified three types of binary constraints: inviolable, violable and error. Inviolable constraint is the strictest type; they by principle cannot be violated [7]. For example, the inspection date 2010/03/12 is a wrong date because the latest update of the BMS data set is in year 2002. Violable constraints cannot be violated in most situations and can be violated only in exceptional instances. For example, it is illegal to exceed the speed limit on a highway, but it is still done occasionally. Error constraints are generated from the error codes detected by the sensor system; they represent the known problems in the data set. For example, the Mn/ROAD weigh-in-motion scales record an error code when they find out tailgating occurs.

Investigation of the relations between two binary constraint violations which occur at the same time can help in finding out possible causes of the violations. In other words, we can expect to see if some common outer factors result in these two violations or if one violation is causing the other. Of course, this analysis requires expert domain knowledge to help the analyst to explain the mechanisms producing these violations.

40


In distribution constraints, the actual set of data is compared with a preset probability distribution or another set of data that comes from a similar data source by using goodnessof-fit testing. Simply put, the comparison algorithm “looks” at a sample histogram (i.e., the value distribution) and decides how similar it is to a baseline histogram. If the sample histogram is too dissimilar, it is tagged as an anomaly [7].

Goodness-of-fit testing is used to decide if a random sample appears to come from a given distributional form [16]. It is based on the hypothesis that a sample of size n comes from a population with a specified probability distribution. It measures the discrepancy between the sample and a given distribution function or another sample. The given distribution function may be completely specified or may contain parameters which must be estimated from the sample.

There are several different goodness-of-fit test statistics. One important and well known test is Pearson's Chi-squared test [21]. It was introduced to study the adaptation of discrete (both quantitative and qualitative) distributions. It can also be used in continuous distributions given that the data are grouped into classes.

Most of goodness-of-fit test statistics are non-parametric, that is, they make no assumptions about the shape of the sample distribution. In Buchheit’s thesis, she introduced two of them: Kolmogorov-Smirnov (K-S) test and Anderson-Darling (A-D) test.

The K-S test measures the maximum difference between the distributions of the sample data and the probability distribution [7]. For each data point in the sample, the distance between the theoretical distribution and the sample data is calculated. The K-S statistic is

41


equal to the maximum distance among these data points. It is useful to validate the adaptation of a sample coming from a random continuous variable.

The A-D statistic measurement is similar to a sum of square error measurement of the distance [7]: [ Fn ( x) − F ( x)] 2 dF ( x) − ∞ F ( x )(1 − F ( x ))

An = n ∫

+∞

Where Fn(x) is the distribution of the sample data, F(x) is the theoretical distribution, and n is the number of values in the sample data.

Numerical integration gives the iterative computational form of the equation [7].

An = − n −

1 ∞ ∑ (2i − 1) ln(F ( xi )) + (2n + 1 − 2i) ln(1 − F ( xi )) n i =1

An is the value of the A-D test statistic. If the parameters of the distribution function are calculated from the sample, there is a slight modification to An to make up this discrepancy. Given the A-D test statistic, the statistician can find a corresponding significance level usually by looking up the closest critical value from a table [7]. If An exceeds the critical value at the chosen significance level, the test fails, which means that the sample does not come from the given probability distribution. If An is less than or equal to the critical value, it does not mean that the sample does come from the probability distribution. All that the statistician can say is that it has been disproved that the sample does not come from the theoretical distribution. In Buchheit’s procedure, she implemented an approximation that calculates the significance level (p-value) for a given An directly. This approximation can be used in place of the critical value tables with little loss of accuracy. The A-D statistics test seems to be suitable for any data-set [21] (Aksenov and Savageau - 2002) with any skewness (symmetric distributions, left or right skewed).

42


Binary constraints and distribution constraints are applied to recognize explicit patterns in the distribution of the values. Two other pattern-based methods may not be so direct: distance-based and density-based methods are good at finding out implicit patterns in the distribution of multiple-dimension features.

Most of the notions of distance-based outliers are generalized from the distribution-based approaches. This is a much more complex statistical method and was initially proposed by E. M. Knorr and R. T. Ng [21]. According to their definition, “an object O in a dataset T is a Distance Based (p, D) -outlier if at least a fraction p of the objects in T lies greater than distance D from O.” This outlier definition is based on a single, global criterion determined by the parameters p and D. This algorithms works well when the data set has only dense or sparse regions. [22]

A density-based outlier detection method was proposed by M. Breunig, et al [19]. It makes use of a new concept called the local outlier factor (LOF). The value of LOF of an object depends on how isolated the object is with respect to the surrounding neighborhood. In practice, an object with a high LOF value is more likely to be an outlier. LOF methods can deal with data sets that have both dense and sparse regions.

3.2.4 Majority Voting Buchheit recommended a voting scheme to be used to identify anomalies when several of these data quality assessment methods are applied at the same time [7]. Majority voting is one of the simplest voting schemes. It recognizes a record being an anomaly only when the majority of the assessment methods do so. To be more accurate, the statistician can assign weights to different assessment methods depending on whether they think one of the

43


methods performs better or worse than another for the types of errors found in a particular data set [7]. Thus, a more effective method is paid higher emphasis than a less effective one.

3.3 Data Quality Assessment Approach – Individual Level Assessment Once the aggregated level assessment is over, the detected anomalies are sent to the second level. The major purpose of this level is to focus on the individual data and compare the questionable data to the normal data. This comparison can help to explain if individual anomalies occur in the data set and what kinds of error types exist in the anomalies using Buchheit’s guidelines described in Section 2.6.

Visualization is one of the most important approaches in exploratory data analysis; it is also very helpful to identify individual level data errors. Histograms are widely used in this level assessment procedure to make comparisons. These histograms include the value distributions for the anomalous data and for the normal data. The anomalous histograms are compared to the normal histograms to identify discrepancy.

Chapter 2 presented descriptions of the variations that show up on the graphs when certain errors occur. Missing records can lower the distribution histogram of the anomaly data sets compared to those of normal data sets. Thus, a decreased number of traffic flows in one year compared with the same period at another year may be caused by a temporary shutdown of sensors. On the contrary, duplication errors shift the distribution graph of the anomalies upwards. Unexpected recording of passenger vehicles can occur when the traffic is high or tailgating occurs. It can lead to the shifting of the histogram of aggregated vehicles. An additive shift in a distribution may also indicate a calibration error. For example, universal higher vehicle speeds recorded in some week than those of the normal

44


weeks may imply a calibration error occurring in the velocity sensors during that period. An unexpected spike at the extreme values of a distribution indicates a threshold error. More distribution shifts are described in more detail in the previous chapter.

However, in general, there are no commonly adopted tests to characterize the differences between the two distributions. Goodness-of-fit testing may help indicate the difference between the shapes of the two distributions, they cannot indicate how. Thus, the comparison still needs to be accomplished by hand.

3.4 VACUUM: Algorithm Summary Buchheit has tested her two-level data quality assessing procedure with two civil infrastructure monitoring system: Mn/ROAD Data and Intelligent Workplace Data. In her test on the weigh-in-motion (WIM) data, her procedure identified three types of anomalies: days in which the vehicle counts were too high, days in which the counts were too low, and days in which the counts appeared to be normal. By combining knowledge about the weigh-in-motion scale with the discrepancy between the normal and anomalous data, she discovered why each of these types of anomalies was occurring. The high vehicle counts were caused by recording the tailgating passenger vehicles. The low vehicle counts were caused when the scale in the right-hand lane went off-line. And the apparently normal vehicle counts were identified as anomalies because there were a number of overweight trucks recorded on those days [7].

The tests have successfully proved the feasibility of her data quality assessment procedure. The details of these two data sets and the operations on the data sets can be found in her doctoral dissertation [7]. However, most of the anomaly detection methods applied to these two case studies was implemented by hand. So an objective of Buchheit’s work was to

45


design an automated procedure to support data quality assessment. Buchheit called this prototype VACUUM.

VACUUM is a Java software prototype that integrates part of the algorithms Buchheit has used in the first level of her data quality assessment procedures. Buchheit developed the framework of VACUUM. One objective of my research has been to extend this framework by adding new algorithms and continuing to validate the approach. I have added the pattern-based approach and the statistical methods to the latest version of VACUUM.

The manner by which an algorithm is implemented in VACUUM is to define a type of rule. Binary rules, Distributional Rules and Conditional Rules are implemented in the latest version of the VACUUM. To get the evaluation result of applying a certain data quality assessment approach on a data set, we use confidence value as an indication of high or low data quality. The confidence value is a percentage number between 0 and 1. The higher the confidence value, the more reliable it is to say that the data conforms to the constraints that are expressed by the Rule. The following three subsections discuss the implementation of these three types of rule.

3.4.1 Binary Rule A Binary Rule is one kind of the pattern-based detection methods. The Binary Rule is of the form () [7]. It is applicable for the categorical and continuous type data. For the categorical type data, the can be either equal or non-equal; for the continuous type data, the can have more options, such as greater than or less than, and so on. In both situations, the can be another attribute, a value, or a lagged attribute.

46


The method of measuring variance in statistical approaches is also implemented as a Binary Rule in VACUUM and utilizes the form (). The in variance measuring methods is a value that indicates the deviation from the original distribution of that attribute. For example, the is equal to sample mean plus (or minus) multiplier times sample standard deviation for the standard deviation method, which indicates the deviation from the sample mean. It is equal to a percentile value for the percentile range method which indicates the deviation from the sample median.

A tuple that satisfies the binary rule will be regarded as normal; otherwise, it will be counted as a violator (anomaly). The confidence value of a Binary Rule is defined as [7]: confidence = 1 −

NumberOfViolators TotalNumberOfTuples

(Equation 3.1)

This is a Binary Rule example and the following data is part of an original sample data:

Origin

Destination

Commodity_index

Tonnage

Year

… ‘AK’

‘AK’

7000

93733.0

2001

‘AK’

‘AK’

8099

2037583.0

2001

‘AK’

‘CA’

2100

22935158.0

2001

‘AK’

‘CA’

2229

29105.0

2001

…

A Binary Rule defined as “origin! = destination” means that none of the records should have the same origin location and destination location. Any record that has the same origin

47


and destination value is identified to be an anomaly. As in the above sample data, the first and second tuples are outliers violating this binary rule.

3.4.2 Distribution Rule A Distribution Rule is another type of pattern-based detection method. It is used to detect the pattern of one attribute in an aggregated state. In practice, the distribution of one attribute can be compared with a designated probability distribution or the distribution of the same attribute, but from another data source. The specified probability distribution might be a normal (Gaussian) distribution, other types of theoretical distributions (e.g., exponential, uniform), or a mixture of two distributions. The goodness-of-fit methods are the major approach to achieve the comparison. The latest version of VACUUM uses the Anderson-Darling method to get the A-D statistics value. Then, by looking up this A-D statistics using a mapping method: saddle point algorithm, VACUUM can return a corresponding significance value. The significance value is retrieved as the confidence value of the test on the Distribution Rule, which represents the extent of belief that this data distribution comes from another probability distribution or another data distribution.

The following is an example of applying a Distribution Rule in VACUUM. Figure 3.4 is the histogram of the sample data set. Using a normal distribution test without specified parameters for this sample data set returns a confidence value as high as 0.3383. The shape and location value of the probability distribution are calculated from the sample data for this comparison. On the first sight of this histogram, the statistician may suspect that the sample data set is coming from a population having a normal distribution and the result from VACUUM illustrates that the user could have at least 33.83% confidence to say so.

48


Figure 3.4 Distribution Rule Sample Data

3.4.3 Conditional Rule A Conditional Rule is another kind of pattern-based detection method. It can deal with complex patterns. The Conditional Rule is formed by an IF-Then-Else expression. If the conditional clause (If Clause) evaluates to be true for a tuple, the Then Clause of the rule is evaluated against this tuple. If the conditional clause (If Clause) evaluates to be false for the tuple and an Else Clause is specified, the Else Clause is then evaluated against this tuple.

The conditional clause is composed of one or more Binary Rules and uses the following grammar [7]: : (*) : [“AND”, “OR”] : || A well-formed condition is like: AND ([x>1] OR ([x

WEB VACUUM: A Web-based Environment for Automated ... - CiteSeerX

WEB VACUUM: A Web-based Environment for Automated ... - CiteSeerX

Suggest Documents

A Computation Environment for Automated Negotiation: A ... - CiteSeerX

Automated, web-based environment for daily fire risk ... - MSSANZ

anvaya: a workflows environment for automated

anvaya: a workflows environment for automated ...

A Web-based Environment for Medical Collaboration in a ... - CiteSeerX

automated web service composition - CiteSeerX

automated web service composition - CiteSeerX

A Web-Based Distributed Programming Environment - CiteSeerX

A Web-Based Distributed Programming Environment - CiteSeerX

MaXIC-Q Web: a fully automated web service using ... - CiteSeerX

BASys: a web server for automated bacterial genome ... - CiteSeerX

A Web-based Automated System for Industry and ... - CiteSeerX

Quality Assurance in a Vacuum Environment

MEDIT: A Web-based environment for advanced ... - CiteSeerX

The Web Measurement Environment(WebME): A Tool for ... - CiteSeerX

WESE: A WEB-BASED ENVIRONMENT FOR SYSTEMS ... - CiteSeerX

A Web-Based Multimedia Virtual Reality Environment for ... - CiteSeerX

Genisa: a web-based interactive learning environment for ... - CiteSeerX

A generic test environment for web-based services - CiteSeerX

Genisa: a web-based interactive learning environment for ... - CiteSeerX

Extensions to Generic Automated Marking Environment - CiteSeerX

A Web-Accessible Framework for Automated

Construction of a thermal vacuum chamber for environment test of ...

precision mechanisms for optics in a vacuum cryogenic environment