A DECLARATIVE APPROACH TO DATA VALIDATION OF STATISTICAL DATA SETS, BASED ON METADATA1 George A. Petrakos and Gregory E. Farmakis Liaison Systems S.A. – Research and Development of Information Systems Academias 77, Athens, 106 78 GR, (www.liaison.gr) E-Mail:
[email protected],
[email protected] Abstract Consistent validation of Statistical Data is an essential prerequisite of any process aiming at ensuring the quality and homogeneity of Statistical Information, especially when data are acquired from distributed heterogeneous sources. The ad-hoc, “filter”-like solutions currently used are not sufficient to ensure a systematic and consistent validation process. The present paper provides a framework for the formal definition and classification of validation rules, which in turn is used as the theoretical basis for the definition of abstract data structures, able to hold the rules’ declarations. Thus, the construction of an abstract information model that stores and handles validation rules as metadata becomes feasible. This leads to a declarative formalisation of rules, as opposed to the usual procedural (algorithmic) approach. Furthermore, based on these concepts, the paper describes the implementation of this data model in the form of distributed globally accessible repositories (i.e. databases) of validation rules. Suitable validation engines can then access these repositories to consistently validate data sets, even in ad-hoc cases.
Keywords: Data Validation, Metadata, Outliers, Data Quality, Editing, Distributed Metadata Repositories. 1
This paper was partially supported by the Statistical Office of the European Communities through the research project SUPCOM 98-lot 23, 98/S 119-76952
1. INTRODUCTION Information Technologies already provide the technical infrastructure that makes possible the transmission of large amounts of statistical data. Despite the abundance of data though, information still is a scarce resource. This discrepancy results from a lack of contextpreserving data integration. A networked system, whose ultimate mission is the transformation of statistical data to statistical information, must accommodate fundamental requirements, inherent to the collection of statistical data from distributed sources, among which data validation is of utmost importance. In this aspect, a Data Validation System is much more that just data type and nomenclature checking or filtering out of the invalid data; it must satisfy a number of Data Quality requirements, stemming from the fact that data is provided from different data sources and therefore the validation procedures must be consistent across these sources in order to ensure system-wide comparability and enable aggregation and further processing. Most approaches to automated Data Validation are based on a procedural concept, i.e. the straightforward implementation (programming) of the validation logic leading to a set of computer procedures, which, in turn, prove difficult to maintain or adapt to changing requirements. Our approach on the contrary, is based on a declarative concept: rules are in
1
G.Petrakos and Gr.Farmakis: A Declarative Approach to Data Validation fact information metadata and should be stored and handled in the same way as data themselves, i.e. in a metadata repository, such as a suitably designed relational database. A generic validation engine can then reconstruct the rules out of the metadata and apply them to the actual data. Apart from the obvious advantages of this approach, namely ensuring flexibility and consistency, there is also an important side effect: such an inventory can be distributed and broadly accessible, thus ensuring homogeneity in the validation process. In order to group the validation rules into more comprehensive and meaningful categories, we have to identify the targets of the validation process. These are the data elements (variables), which are elementary data building blocks and have values, the entities, which are groups of logical related elements that have instances and the data schema, which is a group of related entities. Furthermore, we identify two types of data elements (variables), the qualitative and quantitative ones. The qualitative variables (codes), do not have a meaningful metric and correspond to categorical validation rules, called logical edits. For simplicity, we include ordinal variables in the same treatment category. The quantitative or arithmetic variables are meaningful only in the presence of a metric, corresponding to numerical validation rules called also arithmetic edits.
2. CLASSIFICATION Based on the above definitions, we can first identify four different levels of validation rules based on the data subset in which they are addressed to, each one divided into several distinct categories. Level 0 consists of preliminary controls applied to the metadata elements of the data set, Level 1 of controls applied to a single data element, Level 2 of controls applied to a single entity, while Level 3 controls are validating a group of entities (data schema). 2.1 LEVEL 0: PRELIMINARY CONTROLS BASED ON META-DATA OF THE WHOLE DATA SET With these tests performed in the early stages of Data Quality Control process, preliminary metadata information attributed to the whole dataset are examined against specific values. For example, level 0 validation tests include verification of the sender of data (in case of electronic submission), checking of the volume of the incoming data file etc. STADIUM, the Eurostat electronic transmission platform, performs some of these preliminary validations, checking the basic data set attributes like the sender verification, new or update file, date, confidentiality, etc. These are ad hoc controls based on our knowledge upon the basic metadata of the statistical sets under control. The design and development of a structural metadata module in a statistical database will trigger a more systematic validation in this level. 2.2 LEVEL 1 [L L1JK]: CONTROLS ON SPECIFIC ATTRIBUTES OF INDIVIDUAL DATA ELEMENTS (VARIABLES) The attributes of individual data elements are usually related to data type - length (J=1), and domain (J=2) (interval range, set of discrete variables). Also according to previous definitions these data elements are divided to qualitative (K=1) and quantitative (K=2) ones. According to this classification we can identify L11+ controls based on the variable data type as specified in its metadata vector. A data type can be number, character, date, GIS [Spatial Statistics], etc. Also in the metadata vector the number of digits is specified and therefore tested under L11+ controls. While this control does not differentiate according to variable metric level, controls L121 and L122 do. L121 are referring to domain controls for categorical/ordinal variables also called logical edits. The logical edits, based on code values that are not acceptable, are evaluated by assorted tables, containing the acceptable codes of the variable under control. These tables in a distributed system needed to be accessible by all users and updated by the system administrator.
2
G.Petrakos and Gr.Farmakis: A Declarative Approach to Data Validation On the other hand L122 domain controls addressed to numerical variables, are based on the determination of series of intervals [ai,bi], which divides the real line ℜ into two subsets. SA= ∪i [ai,bi] and its compliment with respect to ℜ, SB = {∪i [ai,bi]}c. SB is the unacceptable region for the variable under control and when a realization of the variable falls into that region an action (warning, deletion, correction, etc.) is taken. In most applications ∪i [ai,bi] is reduced to [a,b] and that simplifies the definition of SA , SB to SA= [a,b], SB= [a,b]c with respect to ℜ. Furthermore, in certain cases i.e. when b ∨ a ∉ℜ or a ∨ b=0 then ] or [ is replaced by ) or ( respectively. The lower and the upper limit of the interval [a,b], can be either deterministic or stochastic. A lower (upper) limit is defined as deterministic when a value below (above) is definite unacceptable. On the other hand the definition of stochastic limits is actually the calculation of a confidence interval with certain probability, which requires either knowledge of the distribution of the variable under study or access to a clean data set. This process is known as univariate outlier detection. 2.3 LEVEL 2 (L2CK): CONTROLS THAT COMPLY WITH THE NATURE OF AN ENTITY In a group of variables logically related among them controls coded as L21+ are referring as vertical coherence (C=1), while controls coded as L22+ are referring as horizontal coherence (C=2). Furthermore, as in level 1, controls coded as L2 are differentiated, according to variable type, into qualitative (K=1) and quantitative (K=2) ones. In this aspect we can identify L211 controls on possible repetitions in the values of certain key variables (double or multiple entries). The use of vertical coherence is extended into sets quantitative variables, where L212 controls are referring to real functions counted in vertical coherence in these sets. A real function T(Vi)=T(V1, V2,…,Vn) of n variables Vi corresponding to n data records, evaluating the single data element V is to be tested against either a constant value, another variable or a function of another element. Some functions like the mean, the median, the variance, the coefficient of variation are widely used and their well known properties can form the basis for meaningful controls of summary errors, biased results, etc. Moving along the data type dimension we identify one control type referring to qualitative variables and two for the quantitative variables. The first one coded as L221 controls on qualitative variables corresponding to logical edits based on combinations of code values in different fields that are not acceptable. Let r the total number of fields in a record, Ai, the set of possible code values in the field i=1,2,…,r. A record g has in field i a code belonging to some subsets of Ai , say Ai0 such that g∈ Ai0. The combination of code values in different fields can be generally expressed as ƒ(A10, A20,…, Ar0). The function ƒ becomes a logical edit rule when formulated a set of unacceptable code combinations. So a record g fails the rule specified by ƒ when g ∈ ƒ(A10, A20,…, Ar0), stating that certain specified combinations of code values are not permissible. In this category we should include an ANOVA type of checks, where a certain combination of code values dictates the output of a quantitative variable. An extensive description of the algebra of logical edits can be found in Fellegi & Holt, 1976. There are two approaches to implement L222 controls in a group of logically related quantitative variables. The univariate approach, consists of the definition of a real function that evaluates the relation of the variables in the entity. This reflects the projection from ℜk, k < r, r is the number of variables in the same entity, into ℜ reduced space. Formally speaking, this is a type L122 domain control, where instead of a single variable, a real function g: ℜk→ ℜ, g (Vi) is under control. Again, L222 controls are based on the determination of series of intervals [ai,bi], which divides ℜ into two subsets. SA= ∪i [ai,bi] and its compliment with respect to ℜ, SB = {∪i [ai,bi]}c. Univariate methods in principle cannot incorporate in their decision making process the correlation among the variables of the data set, or other external
3
G.Petrakos and Gr.Farmakis: A Declarative Approach to Data Validation set of variables of the same structure and behaviour. The common used multivariate techniques are first determine a clean subset of the data set in the form of sequential ellipses (Bartkowiak & Scustalewicz, 1998) and bivariate plots (Grossi, 1998) respectively and then use forward search for detecting outliers from the remaining (suspected) data points. Both multivariate techniques are robust since they are based on a nonparametric approach. Hadi & Simonoff, 1993, provided a brief literature review on various methods aiming at the identification of a subset of candidate outliers within a given data set and its separation from what is called the remaining clean data set. An interesting approach which can apply to most of the above mentioned cases is the significant editing strategy based on predetermined expected amended values extracted from the utilization of subject matter knowledge on the study or studies under control. (D. Lawrence and R. McKenzie, 2000). Other approaches are considered the score variable method (Hidiroglou and Berthelot, 1986), the weighted deviance from group medians (Van de Pol, 1997) and the method of scoring edit failures (Latouche and Berthelot, 1992) 2.4 LEVEL 3: CONTROLS IN WHICH ELEMENTS OF DIFFERENT ENTITIES ARE INVOLVED It is possible to identify related groups of elements belonging to different data sets spatially distributed in terms of time and space with certain covariance structures. In other cases certain entities are appearing in different data sets and they should be in an identical or compatible manner. In both cases a validation treatment will require the availability of all data sets involved (data schema). Level 3 controls follow the same classification with L2 controls, only the variables belong to different entities or even different data sets and are closely related to harmonization process, which is a major quality component in large and complex databases (Eurostat,1997, Statistics Canada, 1998).
3. VALIDATION RULES REPOSITORY In contrast to the usual procedural (algorithmic) approach to the validation process, our approach is based on the semantic, declarative modelling of the data themselves. Validation logic for any given data set inherently stems from the nature of the underlying real-world objects represented by these data. Thus, it is an inseparable part of the data model, rather than just a set of “externally” applied rules. In this aspect, validation logic is in fact another form of information metadata and should be treated as such, i.e. semantically declared within the data model and integrated with other metadata, while actual validation rules to be applied can be generated from this model. Our model is based on the concept of data domain, as it is defined in relational algebra, albeit extending this concept in order to allow: (a) the flexible declaration of domains applicable to composite higher-level data structures; (b) a mechanism for the hierarchical generation of domains out of lower-level ones as well as (c) the declaration of non-deterministic domains. The model (Fig.1) is designed so as to allow the parallel declaration of data objects (ranging from simple data elements to entire data schemata) along with the declaration of domains (i.e. sets of allowable values) for all these objects, regardless of their hierarchical level (i.e. a domain may be defined for a composite structure or even an entire data set). The model hierarchies imply a recursive domain generation mechanism: only lowest level domains (those of simple data elements) are explicitly defined as sets of values or ranges, while all higher-level domains are subsequently generated with algebraic operations on domains at the immediately lower level only.
4
G.Petrakos and Gr.Farmakis: A Declarative Approach to Data Validation
3.1 REPOSITORY OBJECT MODEL Data Objects The model is characterised by a hierarchy of data objects, namely (from top to bottom): data schema consisting of different data sets, which in turn include different data record types. Each data record is an ordered vector of several generic data elements, which in turn can be either simple data elements (i.e. the elementary building blocks of the model, which can be of an either quantitative or qualitative nature) or composite data elements (i.e. composed of several simple ones). Recursive Data Element Hierarchies on the other hand, allow the definitions, in an elegant way, of unlimited-level hierarchies of simple and composite data elements. This way, the structure of complex data sets can be semantically declared in the form of metadata. For each of these objects, the system holds knowledge of rich metadata (definitions, descriptions, aliases etc.) including domain of definitions. Domain Objects A similar hierarchical structure exists for objects that represent domains of definitions, which are independent objects. Thus, simple data elements are defined on basic data types, defined subsets of which form simple domains. A simple domain may be also defined as the union of other simpler domains. According to Data Element categories, there are also qualitative and quantitative domains as well as subtypes of them. In turn, qualitative domains can be of two distinct types: (a) categorical, (i.e. a finite set of values, such as codes, explicitly enumerated); and (b) format (i.e. a syntax rule defining the set allowable values for a character string). Numerical domains on the other hand are defined as a union of subsets (ranges) of the corresponding basic data type. According to how these ranges are defined, they fall into two mutually exclusive categories, namely deterministic and stochastic. Deterministic numerical domains, are trivially defined by a minimum and a maximum. Stochastic Ranges, are based on the confidence interval for the limits. This is either known in advance if the distribution is also known, or calculated ad-hoc through a selected applicable univariate outlier detection method. Composite Data Elements are defined on Composite Domains, which are subsets of the product space of the domains of the component data elements. An important point here is that we can implicitly define a composite domain by simply defining its complement with respect to the product space of the already defined component domains. Similar to Generic Elements, from which Simple and Composite Data Elements inherit most of their metadata attributes, simple or composite domain objects inherit metadata from the Generic Domain abstract object. This, among others, includes a trigger attribute that defines the action to be taken when the Domain is violated (i.e. "warning", "error" etc.). 3.2 SYSTEM ARCHITECTURE The system architecture consists of the following basic sub-systems: A distributed (and replicated) repository of validation rules, implemented on a relational Oracle 8i database, along the general traits already described. Authentication and access control schemes will also be implemented on the repository. A validation engine, implemented as a set of platform independent Java modules, capable of remotely (over the Internet) accessing rules in the repository, interpreting their declarations and acting upon data sets under inspection. The validation engine includes in turn several modules, namely: Rules Retrieval Module, Validation Module, Quality Metrics Module and
5
G.Petrakos and Gr.Farmakis: A Declarative Approach to Data Validation Reporting Module. A user client, using a local database as a repository for transient data sets under inspection and including modules for: graphical rules representation, rules management, local storage access, user authentication as well as data set transmission. The system architecture is graphically depicted in the accompanying diagram (Fig.2).
4. IMPLEMENTATION The logical schema of the rules repository described above has been implemented on a Oracle 8i database, along with a prototype rules definition interface. In order to validate the feasibility of the repository, actual validation rules for a real statistical data set have been analysed and used in order to populate the repository. This has shown in practice that it is possible to semantically declare all possible Level 1 and Level 2 validation rules, while the declaration of complex Level 3 rules needs some model enhancements. Level 3 tests refer to data elements across different Data Records, Data Sets or even different Data Schemata. Moreover, they can include referential integrity constraints, cross validation or mirror statistics. Due to their inherent complexity, Level 3 tests cannot be efficiently declaratively defined and performed at the pre-load data validation process, which is the scope of this model. This is mainly due to the non-normalised structure of Data Sets and the one-data set-at-a-time nature of the validation process. Furthermore, level 3 tests, when only modestly more complex than simple referential integrity, can obviously only be performed efficiently if the complete data is stored in a semantically rich relational database and the tests are coded and stored as stored database procedures. Nevertheless, since the modelled validation system will be based on the concept of transient data storage and open database connectivity, level 3 tests are feasible, although handled in ad hoc way through advanced database mechanisms such as stored procedures, events and triggers and so forth.
5. CONCLUSIONS Validation procedures must be well defined, complete and consistent across the various stages of data life cycle in order to ensure the quality of the produced statistical information. This paper provides a formal classification of the validation rules based on levels of application (four targets), data attributes, variable types, and directional coherence. Furthermore, describes the architecture of a distributed metadata repository containing the objects performing the controls identified and classified above. This repository can be implemented in a (distributed) relational database, accessible through the Internet by (possibly CORBA based) applications for rules maintenance and data set validation. Based on these results , further research on the theoretical base of validation rules and the design of the metadata repository as well as a working prototype will be done in the framework of INSPECTOR, IST/EPROS research project.
6
G.Petrakos and Gr.Farmakis: A Declarative Approach to Data Validation
Fig. 1 – Rules Repository IDEF1X Model
7
G.Petrakos and Gr.Farmakis: A Declarative Approach to Data Validation
Fig. 2 – System Architecture
8
G.Petrakos and Gr.Farmakis: A Declarative Approach to Data Validation
REFERENCES BARTKOWIAK, A., SZUSTALEWICZ, A. (1998), "Outliers-which genuine and which spurious" NTTS’98 New Techniques & Technologies for statistics Sorrento IT. DATE, CHRIS.J. et al.(1998), “Relational Database Writings 1994-1997”, Addison-Wesley. DUTKA, A.F., HANSON H.H. (1989), “Fundamentals of Data Normalisation”, Addison-Wesley. EUROSTAT (1997), “Mirror Leaflet”. FELLEGI P. AND HOLT D. (1976), "A Systematic Approach to Automatic Edit and Imputation", Journal of the American Statistical Association, Volume 71, No 353 Application Section, p. 17-35. GROSSI, L. (1998), "Outlier detection for quality assessment of data sets" NTTS’98 New Techniques & Technologies for statistics Sorrento IT. HADI A. S., SIMONOFF J. S. (1993), "Procedures for the identification of multiple outliers in Linear models", Journal of the American Statistical Association , Volume 88, Number 424, p. 1254-1272. HIDIROGLOU, M. A., BERTELOT, J. M. (1986), “Statistical Editing and Imputation for Periodic Business Surveys.” Survey Methodology, 12, 73-84. LATOUCHE, M., BERTELOT, J. M. (1992), “Use a Score Function to Prioritize and Limit Recontacts in Editing Business Surveys. Journal of Official Statistics, Vol 8, p. 389-400. LAWRENCE D., MCKENZIE R.(2000), “The General Application of Significance Editing” Journal of Official Statistics, Vol 16, No. 3, p. 243-253. NIST (1993), “Integration Definition for Information Modelling (IDEF1X)”, Federal Information Processing Standards Publication 184, National Institute of Standards and Technology. STATISTICS CANADA (1998) “Quality Guidelines” Third Edition.
9