Continuous Data Quality Assessment in Information

0 downloads 0 Views 3MB Size Report
Jul 30, 2014 - Deficiencies in data quality cause major costs for businesses all over the world ...... these concepts in 16 relevant sources as follows: Definitions of data are “largely in .... factor analysis and a manual grouping were performed. ...... Another solution is OpenTSDB [46] which is built on top of Hbase the Hadoop.
Diploma Thesis CONTINUOUS DATA QUALITY ASSESSMENT IN INFORMATION SYSTEMS

by Ian Michael Prilop [email protected]

Natural Language Processing Group, Institute of Computer Science, Faculty of Mathematics and Computer Science, University of Leipzig, Germany in cooperation with Fraunhofer MOEZ - Center for Central and Eastern Europe, Leipzig, Germany

Leipzig, 30th July, 2014

Supervisors Professor Dr. Gerhard Heyer Natural Language Processing Group, University of Leipzig Junior Professor Dr. Lutz Maicher Technology Transfer Research, Friedrich Schiller University of Jena Competitive Intelligence Group, Fraunhofer MOEZ, Leipzig

Abstract Deficiencies in data quality cause major costs for businesses all over the world. It is therefore essential for any information system to deliver data of high quality. To achieve this, a continuous improvement process is necessary. With this aim, we have developed continuous data quality assessment (CDQA). CDQA enables a single data quality score to be calculated for any information system. This score is intuitively comprehensible and allows comparison of information systems and tracking of data quality development over time. It is transparently calculated from several data quality dimensions which are based on established literature. The dimensions are clarified and grouped into three categories to facilitate the discussion and implementation of assessments. CDQA supports the implementation of objective assessments and uses continuous subjective assessment feedback to meet the data quality expected by consumers. Objective assessments are carried out using the agent software artifact. The dimensions and metrics to include in the assessment are specified in easy to understand plain text files. Each dimension is assessed continuously and results are gathered by the collector software artifact. The collector visualizes the development over time, which gives insight into the data quality development. In addition improvement-advice is displayed to guide the editorial team in improving data quality. CDQA is demonstrated by application to the IP Industry Base (IPIB) information system. The evaluation shows that CDQA allows data quality deficiencies to be identified and is useful to continuously improve the data quality of an information system.

Keywords data quality, continuous assessment, information system, CDQA

i

Contents 1

2

3

4

5

6

7

Introduction....................................................................................................1 1.1

Motivation and Problem Statement.......................................................2

1.2

Goals....................................................................................................3

1.3

Methodology.........................................................................................4

Definitions and Terminology...........................................................................7 2.1

Data......................................................................................................7

2.2

Quality..................................................................................................8

2.3

Data Quality..........................................................................................9

2.4

Information System and Data Quality Assessment...............................9

Data Quality Dimensions.............................................................................13 3.1

Prior Work..........................................................................................14

3.2

Selected Data Quality Dimensions.....................................................18

3.3

Categorization of Data Quality Dimensions........................................24

Continuous Data Quality Assessment Process............................................37 4.1

Data Quality Assessment Implementation Cycle................................38

4.2

Continuous Data Quality Assessment Process...................................39

4.3

Data Quality Calculation.....................................................................42

Data Quality Dimensions in Detail...............................................................47 5.1

Objective Measurement Dimensions..................................................48

5.2

Objective Evaluation Dimensions.......................................................55

5.3

Subjective Evaluation Dimensions......................................................62

Continuous Data Quality Assessment Software Artifacts.............................67 6.1

Agent..................................................................................................69

6.2

Collector.............................................................................................75

Demonstration of Applying CDQA................................................................89 7.1

Overview of the IPIB Application.........................................................89

7.2

IPIB Data Items and Existing Data Quality Methods...........................96

7.3

CDQA Implementation......................................................................100

7.4

Assessment of Historic IPIB Versions and Data...............................111

ii

8

Evaluation..................................................................................................115 8.1

Results and Insights from Applying CDQA.......................................115

8.2

Conclusion and Outlook...................................................................134

Bibliography...............................................................................................139 Appendix....................................................................................................143

iii

List of Figures Figure 1.................................................................................................................7 Figure 2............................................................................................................... 10 Figure 3............................................................................................................... 29 Figure 4............................................................................................................... 33 Figure 5............................................................................................................... 38 Figure 6............................................................................................................... 40 Figure 7............................................................................................................... 41 Figure 8............................................................................................................... 44 Figure 9............................................................................................................... 47 Figure 10............................................................................................................. 67 Figure 11............................................................................................................. 70 Figure 12............................................................................................................. 79 Figure 13............................................................................................................. 82 Figure 14............................................................................................................. 84 Figure 15............................................................................................................. 85 Figure 16............................................................................................................. 86 Figure 17............................................................................................................. 87 Figure 18............................................................................................................. 91 Figure 19............................................................................................................. 94 Figure 20............................................................................................................. 95 Figure 21............................................................................................................. 97 Figure 22............................................................................................................. 99 Figure 23........................................................................................................... 112 Figure 24........................................................................................................... 116 Figure 25........................................................................................................... 117 Figure 26........................................................................................................... 118 Figure 27........................................................................................................... 119 Figure 28........................................................................................................... 122 Figure 29........................................................................................................... 125 Figure 30........................................................................................................... 126 Figure 31........................................................................................................... 129 Figure 32........................................................................................................... 130 Figure 33........................................................................................................... 131

iv

List of Tables Table 1................................................................................................................ 15 Table 2................................................................................................................ 16 Table 3................................................................................................................ 18 Table 4................................................................................................................ 19 Table 5................................................................................................................ 22 Table 6................................................................................................................ 26 Table 7................................................................................................................ 28 Table 8................................................................................................................ 31 Table 9................................................................................................................ 34 Table 10.............................................................................................................. 80

List of Listings Listing 1.............................................................................................................. 71 Listing 2.............................................................................................................. 72 Listing 3.............................................................................................................. 73 Listing 4.............................................................................................................. 74 Listing 5.............................................................................................................. 77 Listing 6.............................................................................................................. 98 Listing 7............................................................................................................ 101 Listing 8............................................................................................................ 102 Listing 9............................................................................................................ 103 Listing 10.......................................................................................................... 105 Listing 11.......................................................................................................... 105 Listing 12.......................................................................................................... 107 Listing 13.......................................................................................................... 108 Listing 14.......................................................................................................... 110

v

Abbreviations Abbreviation agent API BDD CDQA CI collector CRUD CSV DQ-Test DSL FK GB GET HEAD HTTP HTTPS IP IPIB IPST ISO MVC NULL ORM PDCA PK Rails REST RESTful SOC 2 SQL UI UML URL WCAG XML

Description CDQA-Agent Application Programming Interface Behavior Driven Development Continuous Data Quality Assessment Competitive Intelligence CDQA-Collector Create Read Update Delete Comma-separated Values Data Quality Test Domain Specific Language Foreign Key Gigabyte HTTP GET HTTP HEAD Hypertext Transfer Protocol HTTP over TLS Intellectual Property IP Industry Base Intellectual Property Services Taxonomy International Organization for Standardization Model-View-Controller special marker to indicate that a value does not exist Object-Relational-Mapper Plan-Do-Check-Act Primary Key Ruby on Rails Representational State Transfer Web services following the architectural constraints of REST Service Organization Controls 2 Structured Query Language User Interface Unified Modeling Language Uniform Resource Locator Web Content Accessibility Guidelines Extensible Markup Language

vi

1 Introduction Data quality1 plays a central role in nearly every company's business [1] and even more so in information systems. Low data quality results in avoidable costs and by using data of good quality many companies report achieving benefits [2]. The issue of data quality has been discussed for many years in both academia and industry but still remains a challenge. Although no generally accepted method of evaluating the costs of poor data quality exists [3] there are several widely quoted studies estimating the costs for a specific sector or region. Most prominent is a report by the Total Data Warehousing Institute from 2002, estimating “data quality problems cost U.S. businesses more than $600 billion a year” [1]. The institute confirmed the losses companies suffer due to poor quality data in a later study [4]. This study reported that nearly half of all companies recognize poor data quality as a problem. Problems are not only caused by the state of the data but also “arise from technical issues [...], business processes [...] even come from the outside [...]” [4]. Thus to improve data quality not only the data but also the context of data has to be taken into account. More recently the Economist Intelligence Unit2 explored “the impact of big data and how companies are handling it” in 2011. Still almost half of all companies continuously report deficits in management of data. Amongst others, problems with accuracy, timeliness, and accessing the right data are reported [2]. To institutionalize a data quality improvement program, continuous assessments are necessary [5]. These must be supported by an appropriate process and software which will be created in this thesis. 1 Definitions and a terminology are given in chapter 2. 2 The Economist Intelligence Unit is part of the Economist Group most widely known for the weekly international news and business publication “The Economist”.

1

Chapter 1: Introduction

1.1 Motivation and Problem Statement As stated above, achieving high data quality is essential. Unfortunately this is still a big challenge as reported by a majority of data analysts [6]. Such a challenge motivated this thesis. The IP Industry Base3 (IPIB) is an information system covering companies in the Intellectual Property (IP) Industry. The question of data quality in the IPIB arose very early and was initially covered by editorial guidelines. Later a small library was introduced to help editors improve single companies. The library is able to calculate a quality score for any company on an open scale. Thus a company now carries a quality score such as "42". The quality score does not represent any well-known quality dimension such as accuracy. Therefore a consumer is unable to interpret the meaning of this score. In addition, because the score is calculated on a open scale, a consumer is unable to say if "42" is a good score or not. Current questions asked by the IPIB consumers include “How good is the data quality of the whole system?” and “What does this value actually mean?”. Data quality in an information system is constantly changing. Even without active edits data quality will deteriorate over time as the information provided becomes less and less up to date. The IPIB editorial and development team is concerned with how to continuously work on improving data quality, while making sure they do not invest in improvements which are not needed by the consumers. Also the capability to review the data quality development over time to evaluate the impact of data quality initiatives or system redesign is required. To meet these requirements it is necessary to continuously assess data quality. To our knowledge there is no methodology and software library to solve above problems efficiently. This thesis goal is to create such a methodology and software which will be applied and evaluated within the IPIB information system. The results will improve the consumer's understanding of the IPIB data quality and support the development and editorial team in improving data quality.

3 A detailed introduction to the IPIB is given in chapter 7.1.

2

Chapter 1: Introduction

1.2 Goals Our overall objective is to systematically implement continuous data quality assessment (CDQA) for information systems. We will do this by building on established data quality concepts. CDQA will introduce a continuous process and software artifacts to support implementation and evaluation of assessments. In detail we set the following goals: •

Total data quality score for information systems Methods to calculate a single unified data quality score on a fixed scale will be provided. This will allow consumers to quickly review the current quality state of the information system.



Understandable data quality scores The meaning of the total data quality score is difficult to convey. Therefore the total data quality score for an information system will be transparently calculated from several data quality dimensions. These dimensions will form the foundation of CDQA and are selected from established literature. Thus the consumer will have an understanding of these and existing definitions can be utilized to explain their meaning. Furthermore a more detailed explanation and example metrics will be given for each dimension.



CDQA process A process will be designed which supports implementing data quality assessments and achieving a set quality goal using an iterative approach. The process will also help to avoid over optimizing data quality beyond the consumer's needs.



Software artifacts to support CDQA The software artifacts will support defining data quality assessments and expectations. Continuous calculation of results will be possible. Results will be collected over time and visualized to support evaluation.

3

Chapter 1: Introduction

1.3 Methodology To achieve the above goals this thesis will follow the principles of design science research [7]. Design science research “addresses important unsolved problems in unique or innovative ways or solved problems in more effective or efficient ways” [7]. As mentioned in the motivation the issue of data quality is as yet unsolved and especially in our case there is no software artifact which is capable of accomplishing the goals stated above. By applying design science research we will not only solve the identified problems for our use case but also provide a more general solution which will be disseminated to a larger audience. We will use the process suggested by the design science research methodology [8]. This process follows six steps to “help provide a road map” and provide a “template for readers” making it easier to evaluate this research. The entry point of this research is problem centered as "[...] the idea for the research resulted from observation of the problem [...]" [8]. The six steps to be undertaken are as follows: 1. Problem identification and motivation Is given above in chapter 1.1. 2. Definition of the objectives for a solution Is given above in chapter 1.2. 3. Design and development As a first step we select a list of data quality dimensions (chapter 3). This selection will be based on consensus using dimensions from existing literature. This achieves recognizability of the dimensions and gives them credibility. As a second step we design a process to be used for CDQA (chapter 4). This includes explaining how to calculate data quality scores in a comparable but flexible way and how to create a single score from several data quality dimensions. As a third step we explain each dimension in detail and give example metrics and possible methods of improvement (chapter 5). As a fourth step we design software artifacts which make it easy to de-

4

Chapter 1: Introduction

scribe how to assess dimensions and support the continuous assessment and collection of data quality scores (chapter 6). 4. Demonstration The artifacts created are used to implement CDQA for the IPIB (chapter 7). 5. Evaluation The process and artifacts are evaluated in comparison to the objectives. This is done on the basis of the results of the demonstration and by questioning professionals using the demonstration system (chapter 8). 6. Communication The thesis results are disseminated to the scientific community in this publication. The resulting software artifacts will be provided for use in future projects which makes them available to practitioners.

5

2 Definitions and Terminology To understand the term data quality better we will compare different definitions of the terms data, quality and data quality. We will define the terms as used in this thesis. We will also outline the main components of an information system and relevant actors, and then give short descriptions for these.

2.1 Data

Wisdom Knowledge

Information

Data

Figure 1: Knowledge Pyramid as given by Rowley [9, Fig. 1] who calls it the wisdom hierarchy. Definitions of data often refer to the “Knowledge Pyramid” as depicted in Figure 1 and to data as the base level of this pyramid. Higher level concepts in this pyramid are information, knowledge and wisdom. Rowley compares the definition of these concepts in 16 relevant sources as follows: Definitions of data are “largely in

7

Chapter 2: Definitions and Terminology

terms of what data lacks; data lacks meaning or value, is unorganized and unprocessed” [9]. Rowley quotes Ackoff4 with his definition of data to be “symbols that represent properties of objects, events and their environment” and “The difference between data and information is functional, not structural”. Thus data represents real world objects and for instance a meaningful selection depending on a query makes the difference between data and information. In literature from the data quality field the definition of data is often more precise. For instance Batini and Scannapieco state "Data represent real world objects, in a format that can be stored, retrieved, and elaborated by a software procedure, and communicated through a network." [10, p. 6]. The ISO 8000 (“Data Quality”) is said to define data as “the electronic representation of information” [11] which brings information and data even closer to each other than Ackoff who still saw a functional difference. We will define data not only as a collection of symbols but also as having an internal structure. Data therefore is an electronic representation of real world objects. But we still see a difference between data and information similar to Ackoff. Thus only the conversion of data into representations delivers information to the consumer.

2.2 Quality Quality is a concept which is difficult to define. To be of quality something not only has to satisfy formal requirements but also needs features specific to a situation or usage. This is reflected in the ISO 8042 definition of quality: "The totality of features and characteristics of a product that bear on its ability to satisfy stated or implied needs [ISO 8402 1994]" [12, p. 1490]. This also implies that the quality of something is not universal but can differ between usages. In our case the same data can be of high quality for some while of low quality for others depending on the use case.

4 Ackoff is often referred to as the original author of the knowledge pyramid.

8

Chapter 2: Definitions and Terminology

2.3 Data Quality Finding a precise definition for data quality is hard. Most authors simply describe data quality as an intuitive concept by providing examples or discussing data quality dimensions. A suitable source to find such a definition could be the ISO 8000 (“Data Quality”) which unfortunately is not available to us due to financial restrictions. Batini and Scannapieco state “Data Quality is a multifaceted concept, as in whose definition different dimensions concur” [10, p. 6] thus forwarding to the definition of the dimensions. A more direct definition is given by Wang and Strong who define “'data quality' as data that are fit for use by data consumers”. This uses the widespread definition of “fitness of use” which originated in the work of J.M. Juran: “Data are of high quality if they are 'fit for use' in their intended operational, decision-making, and other roles.” [13, p. 34.9]. Fitness of use includes the difficulty of not being able to universally define the quality of data but having to evaluate it depending on the situation. The literature reviewed does not differentiate between data quality and information quality. For instance the International Association for Information and Data Quality refer, for the definition of data quality, to the definition of information quality [14]. Other authors only speak of either data quality or information quality while referencing the same concepts. For instance Wang and Strong use the term data quality and Naumann and Rolker refer to Wang's work by simply replacing data quality with information quality [15]. We understand data quality as the entire information system's “fitness of use” for consumers. We also do not distinguish between information quality and data quality. Information is delivered by the system representations and therefore included in the system's fitness of use. Thus although data is only a component of the system data quality covers the whole system.

2.4 Information System and Data Quality Assessment We follow the definition of an information system to be a computer-based application system for performing business tasks [16]. Here we narrow this understanding to information systems which are accessible via the Internet and provide a browser

9

Chapter 2: Definitions and Terminology

based user interface. Figure 2 shows such an information system and the components we will refer to throughout this thesis. Also shown are all relevant actors. Separately highlighted are the components on which data quality assessments are based. Information System Data Model Data Sources

Data Items

Consumer

Representation

DQ Assessment implementation

Data Quality (DQ) expectation

DQ Assessment description

Author

Data Quality Assessment

Developer Editor Data Quality Developer

Data Quality Architect

Figure 2: Overview of information system components, data quality assessment and actors referenced in this thesis. The arrows indicate the direction of information flow or actions taken. As can be seen in Figure 2 in our understanding the author feeds data into an information system based on sources. The resulting data is made up of data items and can be further curated by the editor. The data (including data items) is then transformed to representations which are accessed by the consumer. The developer is hereby responsible for changes in the data model, representations or the method of transforming data into representations. The data quality assessments are based on the expectation the consumers have for data quality. Analyzing these expectations the data quality architect creates assessment descriptions. These are then implemented as assessments in the information system by the data quality developer. We will now explain the terms used in Figure 2 in more detail:

10

Chapter 2: Definitions and Terminology

Sources contain data in any form such as unstructured (natural language), semi structured (Wikipedia info boxes), or structured (SQL database, XML etc.). They are mostly available over the Internet or at least in an electronic form. Data model defines a certain structure for the stored data. The degree of formalization is dependent on the system used to store data. For instance a SQL database includes a very strict data model (schema) whereas a document oriented database may only provide an informal data model in the form of documentation. Data is the information extracted from the sources into the information system. Data is a collection of data items. Data item is the smallest unit of data which makes sense to identify. The size depends on the task at hand. For instances in some cases a data item can contain all information associated with a single company whereas in another case a single attribute is viewed as a data item. Representations are different views of the data and transform the data to meaningful information. A representation can simply display a data item or be the result of a complex query by a consumer. Information system is the stack which contains the data, representations and transformations to convert data into representations. Data quality expectations contain the expectations a consumer has in regard to the information system's data quality. These expectations are always consumer and context dependent. For instance a consumer may expect “complete” company information. Data quality assessment descriptions are a formalization of the consumer expectations. The consumer expectations are thereby transformed into units for which assessments can be made. In the example above a company would be measured as complete if all possible attributes contain data. Data quality assessment implementations are the actual implementation of the assessment descriptions. They can for instance take the form of executable code or questionnaires. They are specific to the information system but are not necessarily directly implemented inside the system. In the example above a ratio of present against possible attributes would be calculated for a company.

11

Chapter 2: Definitions and Terminology

Actors: •

Author transforms sources into data. This is done manually or supported by an automated process.



Consumer is presented information as a representation. He can generate specific representations by querying. Each consumer group has specific requirements for an information system and its data quality.



Editor curates data already present in the system.



Data quality architect interprets the consumer data quality expectations and creates formalized definitions of what is to be assessed.



Data quality developer implements the assessment descriptions which can be a part of the information system or independent.



Developer works on and improves the information system components in general.

12

3 Data Quality Dimensions Beyond an intuitive concept of data quality it is necessary to break down data quality into smaller components, or dimensions. These data quality dimensions each capture small, single aspects of data quality. Many dimensions have been identified in literature using various approaches. These collections of dimensions allow a more systematic approach to evaluating data quality. Each dimension can be measured, or assessed, on its own. Measuring single aspects of data quality is the first step in improving overall quality. In the following section we will select prior work to build a consensus based list of relevant data quality dimensions. Three general approaches to define data quality dimension have been identified in literature [17]: •

Intuitive approach “is based on the researchers' experience or intuitive understanding about what attributes are 'important.'”.



Theoretical approach “focuses on how data may become deficient during the data manufacturing process.” by formally comparing the view presented by the data model to the real-world view.



Empirical approach “analyzes data collected from data consumers to determine the characteristics they use to assess whether data are fit for use”.

While reviewing literature it became clear that the most commonly accepted approach is the empirical one and especially the work done by Wang and Strong in 1996 [17]. But there are also many intuitive approaches based on the selection by Wang and Strong. To incorporate more recent work we will also review a compar-

13

Chapter 3: Data Quality Dimensions

ative study by Knight and Burns [18] from 2005 and a monograph on data quality by Batini and Scannapieco [10] from 2006. Our goal is to find established data quality dimensions and definitions. It is therefore sufficient to only review the three publications mentioned before. We explicitly do not aim to do a comprehensive and systematic review on data quality dimensions. There is far more literature which could be included in a systematic review. For instance Zaveri et al. [19] present a comprehensive list of dimensions and metrics based on a systematic review. With the restriction of only analyzing work focused on assessing data quality of Linked Open Data they already discuss 21 articles from the years 2002 to 2012.

3.1 Prior Work In this chapter we review data quality dimensions and frameworks of other authors. The dimensions which may be of interest in the context of this thesis are listed. Empirical founded data quality dimensions As mentioned before, the work by Wang and Strong has been a foundation for the definition of data quality dimensions and many frameworks build on the dimensions defined in their article “Beyond accuracy: What data quality means to data consumers” [17]. Wang and Strong understand “fitness for use” as the essential measure of data quality. They define data of high quality as “data that are fit for use by data consumers” and data quality dimension as “a set of data quality attributes that represent a single aspect or construct of data quality”. They conducted a two phase survey with data consumers. First they collected an extensive list of potential data quality attributes by querying data consumers. This returned 179 potential data quality attributes of which 118 were selected. In the second phase 1500 data consumers provided 355 viable responses rating the importance of these 118 attributes. To further reduce the number of dimensions, a factor analysis and a manual grouping were performed. This resulted in a total of 20 dimensions which each consisted of one to several data attributes. Table 1 lists these dimensions together with their associated attributes.

14

Chapter 3: Data Quality Dimensions

Dimension Believability Accuracy

Objectivity Reputation Value-added Relevancy Timeliness Completeness Appropriate amount of data Interpretability Ease of understanding Representational consistency

Attribute List believable data are certified error-free, accurate, correct, flawless, reliable, errors can be easily identified, the integrity of the data, precise unbiased, objective reputation of the data source, reputation of the data data give you a competitive edge, data add value to your operations applicable, relevant, interesting, usable age of data breadth, depth, and scope of information contained in the data the amount of data

interpretable easily understood, clear, readable data are continuously presented in same format, consistently represented, consistently formatted, data are compatible with previous data Concise representation well-presented, concise, compactly represented, well-organized, aesthetically pleasing, form of presentation, well-formatted, format of the data Accessibility accessible, retrievable, speed of access, available, up-todate Access security data cannot be accessed by competitors, data are of a proprietary nature, access to data can be restricted, secure Traceability *1 well-documented, easily traced, verifiable Cost-effectiveness *1 cost of data accuracy, cost of data collection, cost-effective Ease of operation *1 easily joined, easily changed, easily updated, easily downloaded/uploaded, data can be used for multiple purposes, manipulable, easily aggregated, easily reproduced, data can be easily integrated, easily customized Variety of data and data you have a variety of data and data sources sources *1 Flexibility *1 adaptable, flexible, extendable, expandable

3 8

4 3 3 4 7 5 3 5 7

5

4 7 -

-

Table 1: Data quality dimensions as given in [17]. The third column indicates the frequency with which the dimensions occurred in the frameworks compared by Knight and Burn [18]. Dimensions which were not included by Wang and Strong in the final

list are denoted by *1 following their name.

15

Chapter 3: Data Quality Dimensions

Comparative study of data quality frameworks Knight and Burn compare twelve frameworks in their article “Developing a framework for assessing information quality on the world wide web”. The frameworks and with them the data quality dimensions, were published between 1996 and 2002. Knight and Burn select the 20 most common dimensions and rank them by the frequency with which they occurred. They state that except for three dimensions all had previously been suggested by Wang and Strong. Unfortunately it is not shown exactly how the mapping from the dimensions in the 12 frameworks to those they present is done. For the following four they simplified the names given by Wang and Strong (in parentheses the simplified name): Representational consistency (Consistency), Access security (Security), Concise representation (Concise), Ease of understanding (Understandability). As to the following three dimensions Knight and Burn state that they have previously been defined by Wang and Strong but no match could be made by us: Reliability, Useful, and Efficiency. The dimensions listed under the most common in Knight and Burn comparative study and not included in Table 1 are found in Table 2. Dimension Availability Usability Navigation Reliability *1 Useful *1 Efficiency *1

Definition extent to which information is physically accessible extent to which information is clear and easily used extent to which data are easily found and linked to extent to which information is correct and reliable extent to which information is applicable and helpful for the task at hand extent to which data are able to quickly meet the information needs for the task at hand

Table 2: Dimensions listed under the top 20 by Knight and Burn [18] and not included in Table 1. Denoted by *1 are the dimensions which Knight and Burn claim had already been defined by Wang and Strong.

The first three dimensions in Table 2 partly deal with problems encountered when accessing information systems via the Internet using a web browser. This might explain why these dimensions did not occur in Wang and Strong's study. Wang's study was published in the year 1996 when far less people used the Internet than in 2005 when Knight and Burn published their article5. 5 Internet World Stats [20] estimate 32 Million Internet users worldwide in 1996, compared to an estimated 1018 Million for December 2005.

16

Chapter 3: Data Quality Dimensions

In the frameworks compared, some define dimensions extremely specific to such Internet based information systems. For example Conciseness is measured in [21] by analyzing the navigation hierarchy of a website. A very specific form of the Reputation dimension (named Authority) is defined in [22] as a score which is given to a site by (the no longer existent) “Yahoo Internet Life reviews”. And in [23] “accuracy of the hyperlinks” is amongst other factors measured as “whether or not the individual Web site contains any broken links”. Monograph on data quality Batini and Scannapieco devote chapter 2 of their monograph “Data Quality: Concepts, Methodologies and Techniques” to the description of various data quality dimensions [10, Ch. 2]. Most dimensions are also found in Wang and Strong's work but are described in more detail. Batini and Scannapieco often refer to relational databases for examples so their dimensions are opinionated in this direction. In addition they discuss dimensions resulting from the “evolution of information systems toward networked, web-based information systems”. No simple list of dimensions is given but instead descriptions in which in some cases a dimension is split into sub-dimensions. We analyzed these descriptions and split them into single dimensions to make them comparable. In the following list we show those dimensions which add new aspects to the quality dimensions of Wang and Strong. A complete list of all quality dimensions described by Batini and Scannapieco can be found in the appendix. Dimension Syntactic Accuracy Semantic Accuracy Schema Completeness Column Completeness Population Completeness Currency Consistency

Definition is the closeness of a value v to the elements of the corresponding definition domain D. is the closeness of the value v to the true value v'. the degree to which concepts and their properties are not missing from the schema measure of the missing values for a specific property or column in a table evaluates missing values with respect to a reference population concerns how promptly data are updated captures the violation of semantic rules defined over (a set of) data items

17

Chapter 3: Data Quality Dimensions

Interpretability

Accessibility

concerns the documentation and metadata that are available to correctly interpret the meaning and properties of data sources. measures the ability of the user to access the data from his or her own culture, physical status/functions, and technologies available.

Table 3: Data quality dimensions as given by Batini and Scannapieco [10, Ch. 2] and which are not included in Table 1 or 2.

3.2 Selected Data Quality Dimensions From the data quality dimensions listed above the relevant ones were selected. As a basis we selected all dimensions from the work of Wang and Strong as almost all literature reviewed contains these dimensions. In addition several dimensions were split, merged, or included from other authors as described in the following chapter. This was done to obtain a list of dimensions which is a good fit for the assessment of information systems considered here. For instance splitting dimensions allows us to separate pure technical measurements from those which cannot be assessed automatically. This results in 23 data quality dimensions shown in Table 4. The data quality dimensions Believability, Objectivity, Traceability, Value Added, Relevancy, Timeliness, Interpretability, Representational Consistency, Access Security and their definitions can be found as given in the appendix of [17]. The origins of the remaining dimensions are discussed below.

18

Chapter 3: Data Quality Dimensions

Dimension Believability

Definition The extent to which data are accepted or regarded as true, real and credible. The extent to which data are unbiased (unprejudiced) and impartial. Objectivity The extent to which data are trusted or highly regarded in terms of Reputation their source and history. The extent to which a data value is correct in respect to the definition Syntactic model. Accuracy The delay until changes in the source are reflected in the data. Currency The extent to which data is well documented, verifiable and easily Traceability attributed to a source. The extent to which included data are complete in respect to the Internal defined model. Completeness The extent to which data follows semantic rules defined over a set of Internal data items. Consistency The extent to which data are beneficial and provide advantages from Value Added their use. The extent to which data are relevant and helpful for the task at hand. Relevancy The extent to which the age of the data is appropriate for the task at Timeliness hand. The extent to which concepts and their properties needed are not Model missing. Completeness The extent to which the quantity of data fully represents the real-world Population population. Completeness Amount of Data The extent to which the volume of data is appropriate for the task at hand. The closeness of data values to the true values aimed to be Semantic represented. Accuracy The extent to which documentation and metadata are available to Interpretability correctly interpret the meaning and properties of data. The extent to which data representations are clear without ambiguity, Ease of easily comprehended and not overwhelming. Understanding Representational The extent to which data are always presented in the same format and are compatible with previous data. Consistency The ease of accessing the data taking into account a consumer's Accessibility culture, physical status and technologies available. The degree to which system availability and response times can be Availability guaranteed. Access Security The extent to which access to data can be restricted and hence kept secure. The degree to which data usage and manipulation is granted by Openness license and easy due to technical measures such as data format. The extent to which metadata as source, license or author is present. Metadata

Table 4: The 23 selected data quality dimension.

19

Chapter 3: Data Quality Dimensions

Dimensions removed Cost-effectiveness and Variety of data and data sources were not included in our list of selected data quality dimensions. Cost-effectiveness is included implicitly in Value Added as data cannot provide advantages in an economic sense if costs and benefits are not balanced. Variety of data and data sources is covered by Reputation and Objectivity as many high quality sources result in data of a high Reputation and Objectivity is higher if data is supported by a variety of reputable sources. Additionally both dimensions were excluded from the final categorized list of 15 dimensions by Wang and Strong. Dimensions added Currency and Internal Consistency (Consistency) were added from the work of Batini and Scannapieco. Availability is identified by Knight and Burn as a data quality dimension which is included in four ([15], [22], [24], [25]) of the compared 12 data quality frameworks. Availability fits web-based information systems very well as it captures the pure technical issues previously included in the Accessibility dimension. Metadata measures how well documented a single data item and thus a data collection is. For example who entered the data item, when it was entered, what the source of this item is and what license the source has. This dimension is not explicitly defined in the reviewed literature. But as several other dimensions rely on metadata we decided to include Metadata as a separate dimension. Refined dimensions Two of the most important dimensions (top 5 Knight and Burn), Accuracy and Completeness, were divided into several dimensions. We thereby follow the work of Batini and Scannapieco and divide Accuracy into Syntactic Accuracy and Semantic Accuracy. Completeness is divided into Internal Completeness, Model Completeness, and Population Completeness. Accuracy was defined by Wang and Strong as “The extent to which data are correct, reliable, and certified free of error”. Batini and Scannapieco take a more formal approach (while essentially defining the same dimension) and define Accu-

20

Chapter 3: Data Quality Dimensions

racy as “the closeness between a value v and a value v', considered as the correct representation of the real-life phenomenon that v aims to represent” [10, p. 20]. This allows Batini and Scannapieco to introduce the two sub-dimensions to Accuracy as given in Table 3. To avoid having to define variables such as v and v' and as we use a slightly different terminology we altered the definitions for Syntactic Accuracy and Semantic Accuracy as given in Table 4. Completeness was defined by Wang and Strong as “extent to which data are of sufficient breadth, depth, and scope for the task at hand”. This definition is also used by Batini and Scannapieco but again several sub-dimensions are introduced. They present four sub-dimensions which can only be applied to relational databases and one sub-dimension which is focused on web data. Batini and Scannapieco hereby build on the work of Pipino et al. [5] who identified three types of completeness. In order not to be bound to relational databases we use the three types identified by Pipino as dimensions. These are Schema Completeness, Column Completeness and Population Completeness. As we use a different terminology Schema Completeness was renamed Model Completeness and Column completeness was renamed Internal Completeness. We also define Internal Completeness as covering not only the narrow notion of columns and attributes but completeness against the whole definition model (i. e. the schema for relational databases). We also narrow the definition for Population Completeness to describe the percentage of data items available against the real-world reference population. Altered and merged dimensions For several dimensions included in the work by Wang and Strong the definition, and thus the meaning, has changed. In addition, twice two dimensions were merged into a new dimension. Table 5 compares the old and new definitions of all altered or merged dimensions.

21

Chapter 3: Data Quality Dimensions

Dimension

Definition as given in [17]

New definition

Interpretability

The extent to which data are in appropriate language and units and the data definitions are clear.

The extent to which documentation and metadata are available to correctly interpret the meaning and properties of data.

Accessibility

The extent to which data are available or easily and quickly retrievable.

The ease of accessing the data taking into account a consumer's culture, physical status and technologies available.

Reputation

The extent to which data are trusted or highly regarded in terms of their source or content.

The extent to which data are trusted or highly regarded in terms of their source and history.

Amount of Data The extent to which the quan- The extent to which the volume of (originally Appropri- tity or volume of available data data is appropriate for the task at ate amount of data) is appropriate. hand Openness (origiThe extent to which data are nally Ease of opera- easily managed and manipution) lated (i. e., updated, moved, aggregated, reproduced, customized). Flexibility (merged into Openness)

The extent to which data are expandable, adaptable, and easily applied to other needs.

Ease of Understanding

The extent to which data are clear without ambiguity and easily comprehended.

The degree to which data usage and manipulation is granted by license and easy due to technical measures such as data format.

The extent to which data representations are clear without ambiguity, easily comprehended and not overwhelming.

Concise (merged The extent to which data are into Ease of Under- compactly represented without standing) being overwhelming.

Table 5: Dimensions for which the definitions have been altered or which have been merged In the following section we discuss why the definitions were altered or dimensions merged.

For Interpretability we use a definition similar to the one given by Batini and Scannapieco. Their definition captures better how important it is to correctly interpret the complex models and data used. This is especially true for online systems which do not offer any human assistance. The use of appropriate units is already covered by Ease of Understanding and the appropriate language is now covered by Accessibility.

22

Chapter 3: Data Quality Dimensions

Accessibility as defined by Wang and Strong has a broader meaning than the definition used here. This is mainly caused by the introduction of Availability as a separate dimension which captures the purely technical issues included in the original definition. This leaves Accessibility with a far stronger focus on barriers which exist for humans, especially those caused by physical or technological limitations and by ambiguities stemming from cultural differences. Good examples studying this “social-level” of data quality are given by Shanks and Corbitt in [26]. Reputation has been altered as the definition given by Wang and Strong in [17] may be confusing. Originally they associated Reputation with “reputation of the data source, reputation of the data” as listed in Table 1. They then translated “reputation of the data” to “extent to which data are trusted […] in terms of their [...] content”. This is confusing as the reputation of data is better understood as how good the reputation was in the past and is not directly linked to the current content. Also in order to make the distinction between Reputation and Believability clearer we decided to alter this definition. Amount of Data now more clearly refers to the amount of data being appropriate for the consumer task at hand. This is the definition also used in later work of Wang's working group, as for instance, in [5]. Openness now not only captures how easy it is to reuse the data technically but also the legal aspect. The legal aspect has become increasingly prominent due to the open data movement. As mentioned in the Open Data Handbook: "if you are planning to make your data available you should put a license on it" [27]. This is also valid for any closed data collections for which reuse should be possible for consumers. For Ease of Understanding the definition has been altered to make clear that the dimension deals with data representations and not with the data itself. The data itself is already covered by Internal Consistency and Interpretability. Also “not overwhelming” was added as we merged Wang and Strong's dimension Concise with Ease of Understanding.

23

Chapter 3: Data Quality Dimensions

3.3 Categorization of Data Quality Dimensions To determine a data quality score each dimension relies on different methods of assessment. The components which have to be analyzed also differ from dimension to dimension. After analyzing the necessary methods and components we distinguish two main methods (evaluation and measurement) and four main components (consumer, data, data model, system) 6. To discuss data quality dimension assessment it is necessary to understand the requirements each dimension has for assessment. Due to the high number of dimensions it is useful to categorize the dimensions in a way which makes the assessment implementation and execution visible. The methods and components identified above give us the possibility of categorizing the data quality dimensions. Knowing methods and components helps us to understand which dimensions can be assessed fully automatically (for instance those which rely on measurements of data alone) and which need manual execution (for instance those which evaluate consumer feedback). Unfortunately using the methods or components directly as categories does not work out well. Using only the method results in only two categories with one nearly twice the size of the other. Using the components as categories does not result in an unambiguous assignment as one dimension may require several components for assessment. Furthermore neither is based on known categories and the names of the methods and components are specific to this work. Therefore we reviewed existing categorizations to either adopt one or reuse existing vocabulary for a categorization which fits the dimensions used by us. Wang and Strong distinguish four types of data quality dimension categorization in [15] as following: •

Semantic-oriented: "solely based on the meaning of the criteria [dimension]"



Processing-oriented: "partitions IQ criteria [data quality dimension] according to their deployment in different phases of information processing"

6 A more detailed explanation of the methods and components is given in chapter 5.

24

Chapter 3: Data Quality Dimensions



Goal-oriented: "matches goals that are to be reached with the help of quality reasoning"



Assessment-oriented: "based [...] on the entity/process that is the source of the assessed scores"

In detail we will review a semantic-oriented categorization approach by Wang and Strong [17] and the assessment-oriented approach by Naumann and Rolker [15]. Wang and Strong's approach is included as most of the dimensions we use here originate from their work. Naumann and Rolker's approach promises to be helpful for the task of data quality assessment which is our main focus. The examples found for the remaining two types of categorization differ too much from the selection of dimensions used here and do not promise additional advantages. The approaches reviewed are evaluated against the goals which should be met by the categorization. These goals will make discussing and thus implementing and executing assessments easier and are as follows: •

A category should indicate the possible execution of a dimension assessment



A category should list the methods and components needed for assessment



The categories should be a good fit for the selected dimensions.

Semantic-oriented categorization Wang and Strong [17] conducted a two phase sorting study in which subjects were asked to sort 20 data dimensions into four categories according to their understanding of the dimensions. The categories were originally derived from former research but the category descriptions were adjusted according to the first phase of the sorting study. Table 6 contains the categories with descriptions and the assigned dimensions from the second phase sorting study. Wang and Strong give several reasons for building data quality categories. One reason is "Twenty dimensions were too many for practical evaluation purposes.". But they also want to "capture the essential aspects of data quality" and "substantiate a hierarchical structure of data quality dimensions". These reasons ultimately aim to make data quality easier to understand and evaluate. Their categorization is

25

Chapter 3: Data Quality Dimensions

based on the understanding of the dimensions and their semantics by data consumers. Thus using these categories can help when discussing data quality issues with data consumers. Category Intrinsic

Description / Definition “The extent to which data values are in conformance with the actual or true values” Contextual “The extent to which data are applicable (pertinent) to the task of the data user” Representational “The extent to which data are represented in an intelligible and clear manner” Accessibility

Dimensions Believability, Accuracy, Objectivity, Reputation

Value-added, Relevancy, Timeliness, Completeness, Appropriate amount of data Interpretability, Ease of understanding, Representational consistency, Concise representation “The extent to which data are available Accessibility, Access security or obtainable”

Table 6: Data quality dimension categorization by Wang and Strong [17].

Although the categories have been constructed using empirical methods it is not clear if these categories are universally accepted. The actual sorting study was executed by 30 participants who all, while being from the industry, were enrolled in the same evening university class. This could have introduced bias due to former teaching or discussions among the students. Also it is not clear if the results are transferable to different countries or cultures. The categorization given has an 81% average placement ratio. The consistency with which a single dimension had been assigned may well have been below this ratio. And finally those dimensions which were not consistently placed were dropped although they had been identified as relevant before. If the dropped dimensions are used it is not clear how they should be categorized. After applying the categories to the dimensions we selected, it became clear that in some cases placing a dimension is very difficult. Table 8 shows the categories in which we placed the dimensions given the definitions used here. A value between one and three indicates how fitting the category is (three being the better fit). For instance the Traceability dimension measures if the data has provenance metadata associated to it. This information could be included in a representation and may make a data representation clearer. This can only be taken as a weak indica-

26

Chapter 3: Data Quality Dimensions

tion to categorize Traceability as a representational dimension. The subjects of Wang and Strong's sorting study seem to have had the same problem and therefore the dimension was excluded from the list of categorized dimensions. We see the necessity to include this dimension for which Wang and Strong's categorization is not a good fit. Similarly we included several dimensions which cover the compliance or existence of data in comparison to the data model. These dimensions were categorized by us as intrinsic. But actually they do not fit the definition of intrinsic as given above. For instance the Internal Completeness dimension only checks if data is present as defined by the data model. It is not relevant if the data model requires values which actually exist. Nevertheless Intrinsic is still the best match from the categories given. But again it seems as if different category definitions or additional categories are necessary. Assessment-oriented categorization Naumann and Rolker [15] researched possible assessment methods for data quality dimensions. They argue that most previous research had avoided this issue and that an assessment-oriented classification "[...] is necessary to discuss assessment issues in an ordered manner, and also to guide creators of assessment methods in establishing new methods for possibly new criteria.". Thus their categorization supports the implementation of data quality assessments. Naumann and Rolker identify three sources for assessment and categorize a selection of dimensions using these. Table 7 contains the resulting categories and their description as well as the associated dimensions.

27

Chapter 3: Data Quality Dimensions

Category Subject-criteria

Object-criteria

Process-criteria

Description / Definition "Information quality criteria are subject-criteria, if their scores can only be determined by individual users based on their personal views, experience, and background. Thus, the source of their scores is the individual user. Subject-criteria have no objective, globally accepted score. A representative subjectcriterion is understandability." "The scores of object information quality criteria can be determined by a careful analysis of information. Thus, the source of their scores is the information itself. A representative object-criterion is completeness." "The scores of process-criteria can only be determined by the process of querying. The source of the scores are the actual query process. Thus, the scores cannot be fixed but may vary from query to query. The scores are objective but temporary. A representative process-criterion is response time."

Dimensions Believability, Concise Representation, Interpretability, Relevancy, Reputation, Understandability, Value Added

Completeness, Customer Support, Documentation, Objectivity, Price, Reliability, Security, Timeliness, Verifiability Accuracy, Amount of Data, Availability, Consistent Representation, Latency, Response Time

Table 7: Data quality dimension categorization by Naumann and Rolker [15].

Naumann and Rolker's goal was to create categories which support implementing data quality assessment. They use the same approach as we did to identify components needed for assessment. Their three sources for assessment are based on their model of an information system's usage and components (parts and actors) which are included, as depicted in Figure 3. As we use a more complex model and more distinct components it is natural that we identify more possible sources for assessment and thus more categories. Applied in the context of this thesis and the dimensions selected here, we encountered difficulties using the three categories suggested by Naumann and Rolker. The dimensions selected by Naumann and Rolker contain many which are focused on the query process. For instance Latency, Response Time and Availability are single dimensions which in our case are all contained in the Availability dimension. Similarly the definition they use for the Amount of Data dimension is "Size of result" which of course can be measured for every query. This conflicts with our definition in which the size of the result has to be appropriate for the task at hand. This cannot be easily extracted from the query process.

28

Chapter 3: Data Quality Dimensions

Query process

Subject-criteria scores

Process-criteria scores

Information source

Object-criteria scores

Figure 3: Naumann and Rolker's figure [15, Fig. 1] of the information system components (on the left) which are the source of data quality assessments in the categories shown (on the right). The application of Naumann and Rolker's categorization to the dimensions used here is shown in Table 8. It resulted in the process category only containing a single dimension. Therefore in our case, using this category does not seem useful. Nevertheless, using an assessment-oriented categorization is preferable in the context of this thesis but the underlying models used by Naumann and Rolker are too different to those used by us. Therefore their categorization cannot be used directly. Categorization of selected quality dimensions Applying the categorizations from previous research did not result in the clear and useful categorization needed. As mentioned above, the placement of dimensions into the existing categories was difficult in several cases and the dimension could easily have been placed into several categories. In Table 8 this can be seen by the number following the category. The number between one and three indicates how well the dimension fitted the category (three being the best fit). Table 8 also lists the methods and components we found necessary for assessment. It is possible for a dimension to rely on two components for input. For instance Accessibility mainly evaluates the system on its accessibility but of course the expected accessibility depends on the consumer group which is supposed to use the system. Therefore the consumer group and their specific handicap also have to be

29

Chapter 3: Data Quality Dimensions

evaluated at least at the beginning. A more detailed explanation why the assessment of a dimension relies on the given methods and components is to be found in chapter 5. Dimension

Wang and Strong

Naumann and Rolker

Needed for assessment Methods

Components

Final Category

Believability

Intrinsic

3 Subject 3 Evaluation 3 Consumer

3 Subjective Evaluation

Objectivity

Intrinsic

3 Object

3 Objective Evaluation

Reputation

Intrinsic

3 Subject 3 Evaluation 3 Consumer

Syntactic Accuracy

Intrinsic

1 Object

3 Measurement

3 Data 3 Objective Data model 3 Measurement

Currency

Intrinsic

2 Object

2 Measurement

3 Data

3 Objective Measurement

Traceability

Representa- 1 Object tional

3 Measurement

3 Data System

2 Objective 3 Measurement

3 Evaluation 3 Data

3 Subjective Evaluation

Internal Contextual Completeness

1 Object

3 Measurement

3 Data 3 Objective Data model 3 Measurement

Internal Consistency

Intrinsic

1 Object

3 Measurement

3 Data 3 Objective Data model 3 Measurement

Value Added

Contextual

3 Subject 3 Evaluation 3 Consumer

3 Subjective Evaluation

Relevancy

Contextual

3 Subject 3 Evaluation 3 Consumer

3 Subjective Evaluation

Timeliness

Contextual

3 Subject 3 Evaluation 3 Consumer

3 Subjective Evaluation

Model Contextual Completeness

2 Object

3 Evaluation 3 Data model 3 Objective Consumer 1 Evaluation

Population Contextual Completeness

1 Object

2 Measurement

Amount of Data

Contextual

3 Subject 2 Evaluation 3 Consumer

3 Subjective Evaluation

Semantic Accuracy

Intrinsic

3 Object

3 Objective Evaluation

Interpretability

Representa- 2 Object tional

3 Data

3 Evaluation 3 Data

3 Objective Measurement

3 Evaluation 3 Data model 3 Objective System 2 Evaluation

30

Chapter 3: Data Quality Dimensions

Dimension

Wang and Strong

Naumann and Rolker

Needed for assessment Methods

Components

Final Category

Ease of Representa- 3 Subject 3 Evaluation 3 Consumer Understanding tional

3 Subjective Evaluation

Representational Consistency

Representa- 3 Object tional

1 Evaluation 3 System

3 Objective Evaluation

Accessibility

Accessibility 3 Object

1 Evaluation 3 System 3 Objective Consumers 1 Evaluation

Availability

Accessibility 3 Proces 3 Measures ment

Access Security

Accessibility 3 Object

1 Evaluation 3 System

3 Objective Evaluation

Openness

Accessibility 3 Object

2 Evaluation 3 System

3 Objective Evaluation

Metadata

Contextual

3 Measurement

1 Object

3 System

3 Objective Measurement

3 Data 3 Objective Data model 2 Measurement

Table 8: Categorization of all data quality dimensions selected for this thesis. The categorization is done using the category definition given by the respective authors and the dimension definition as used here. To the right of each categorization it is indicated how well the category fits (three being the best fit). The columns labeled "Needed for assessment" list the methods and components needed for an assessment of the dimension as distinguished in this thesis.

As mentioned above, none of the categorizations reviewed are a good fit if applied as given. Nevertheless, several single categories do fit well and for others an altered definition makes them more useful. To analyze the interdependence of the different categorization systems an exploratory data analysis approach was used. The goal was to visualize the connection between the different categories and dimensions to gain a better understanding of which categories can be useful and/or reused. We used Gephi7 [28], a visual data mining tool which combines both statistical and visual analysis. The dimensions and categories can be seen as nodes in a graph which are connected by edges according to whether a dimension is being categorized into a category or not. The edges have weights corresponding to the numbers given in Table 8. Following the methodology applied in Heymann and Le Grand [29] the resulting graph was imported into Gephi and visualized. The layout 7 https://gephi.org/

31

Chapter 3: Data Quality Dimensions

of nodes was generated using the ForceAtlas2 algorithm [30] (scaling set to 800, gravity 1.0, edge weight influence 1.0). This layout algorithm calculates the position of the node based on repulsive and attracting forces which represent the links between the nodes. Thus heavily linked nodes mostly end up in closer vicinity than those with looser links. This already reveals cluster structures as can be seen in Figure 4 (a). To highlight potential clusters the Louvain modularity algorithm [31] was used (resolution set to 1.0). The algorithm detects non-overlapping communities of nodes which can be used as an indication for a cluster. Figure 4 (b) shows the resulting graph. The Louvain communities are color coded. Three communities were detected. Using these methods the discovered communities have to be carefully reviewed manually. There exists no direct connection between the actual domain and the methods used. Thus the results may be misleading but still can be a good base to start with. For example the Louvain algorithm reports slightly different communities from run to run due to randomization. Following we discuss why the discovered communities are a good fit for categorization or why not. Analyzing Figure 4 (b) the most distinct community of nodes is the blue colored one on the top. It only contains dimensions which rely on evaluation of consumer feedback for assessment. The categorization by Naumann and Rolker (Subject Category) is identical to this cluster. As we can reuse the categorization by Naumann and Rolker and all dimensions also share the same main method (evaluation) and input (consumer) this seems to be a fitting category. To immediately give an insight into which method is used we choose to name this category "Subjective Evaluation". The community on the lower left (red colored) is centered around the Object Category. All dimensions included rely on data and/or data model as input but the methods differ. The top part of the community relies on evaluations while the lower part uses measurements. The dimensions in the community on the right (green colored) are all included in the Accessibility or Representational Category and all rely on the system as input. A single dimension (Availability) does not rely on evaluation but on measurement. If our aim was a categorization which was based on the components needed for assessment these two clusters would be a good option. But as we require the categories to give indication as to the methods

32

Chapter 3: Data Quality Dimensions

used and possible implementation, these two clusters do not seem to be the best fit as categories.

Figure 4: Overview of three graphs showing the connection between data quality dimensions and the categories they can be assigned to: (a) graph laid out; (b) colors mapped to communities; (c) colors mapped to the final categories. Therefore the graph was analyzed further. Figure 4 (c) shows a different approach. The Subjective Evaluation category remains as before but the remaining dimensions were categorized by the method needed for evaluation. This results in one community across the middle which is based on evaluation as the method of assessment. All contained dimensions rely primarily on a combination of system,

33

Chapter 3: Data Quality Dimensions

data, and data model as main input for assessment. As (subjective) consumer feedback is at most a secondary input we name this category "Objective Evaluation". The remaining dimensions all rely on measurements as the main method of assessment. They only rely on measurements of system, data or data model. We name this category "Objective Measurement". This results in the three categories, their definitions and the associated dimensions as given in Table 9. New assessment-oriented categorization Category Subjective Evaluation Objective Evaluation

Objective Measurement

Definition Dimensions which mainly rely on evaluation of subjective consumer feedback to make an assessment. Dimensions which mainly rely on evaluation as the method for assessment. Data, data model, and system are the possible objective input to assessments. Dimensions which mainly rely on measurements as the method for assessment. Data, data model, and system are the possible objective input to assessments.

Dimensions Amount of Data, Believability, Ease of Understanding, Relevancy, Reputation, Timeliness, Value Added Accessibility, Access Security, Representational Consistency, Interpretability, Openness, Model Completeness, Objectivity, Semantic Accuracy Availability, Traceability, Internal Completeness, Internal Consistency, Syntactic Accuracy, Metadata, Population Completeness, Currency

Table 9: Final data quality dimension categorization used here.

This new assessment-oriented categorization fulfills the goals set above. The first goal, i. e. categories indicate the possible execution, is achieved as follows: all Subjective Evaluation dimensions have to be assessed manually by evaluating consumer feedback. The results depend heavily on the current consumer group and the sample giving feedback. They are therefore subjective assessments. All Objective Evaluation dimensions also rely on evaluations but do not depend on the consumers. Therefore the results should be objective. The assessments may be easier to undertake than those relying on consumer feedback as input. All Objective Measurement dimensions can in general be assessed automatically. This makes continuous assessment easier as the rate at which an assessment is carried out is only determined by the computing resources available.

34

Chapter 3: Data Quality Dimensions

The second goal, i. e. to list the methods and components needed for assessment is achieved by the category definitions which each explicitly list the main methods and components the dimensions rely on. As can be seen in Table 9 the dimensions are grouped by this categorization into three groups of almost identical size (7 - 8 elements). Also all dimensions could be assigned to their category with high confidence. Therefore the new assessmentoriented categorization is a good fit for the dimensions used here. The three new categories give a direct understanding of the method used to execute an assessment. Previously we only identified two methods: measurement and evaluation. By creating this categorization it became clear that we are actually dealing with three main methods, one per category. The method of evaluation must be split into subjective and objective evaluation. Each evaluation method uses different techniques (consumer questionnaires vs. expert evaluations), which justifies speaking of three different methods in total.

35

4 Continuous Data Quality Assessment Process In this chapter we will outline an iterative process to implement data quality assessments and combine it with a continuous process to support assessments after implementation. Implementing data quality assessments for an information system is a task which depends to a great extent on the context. As Pipino puts it a “'One size fits all' set of metrics is not a solution.” [5]. Thus it is not sufficient to define universal metrics once for all dimensions and then simply apply these to an information system. Instead as Knight and Burn [18] suggest the first step is to identify the user, the environment and the task at hand. Only then can the dimensions to be assessed be selected. Therefore every information system, environment and possibly each consumer has different requirements and applicable dimensions. The specific system and consumer determine the dimensions to be selected. Also the expectation of what good quality data must be is determined individually. For instance an information system aimed at an international audience has a very different Accessibility requirement compared to a system which is only used as an internal tool in a local company. Even dimensions with seemingly objective metrics, such as Syntactic Accuracy, can be of no importance to a specific system or consumer and may be excluded from the assessment.

37

Chapter 4: Continuous Data Quality Assessment Process

4.1 Data Quality Assessment Implementation Cycle To implement data quality assessment an iterative process using the four-steps described as Plan-Do-Check-Act (PDCA) cycle [32] is used. In Figure 5 for each step of the PDCA cycle the tasks necessary to implement CDQA are shown. Assessment results and insights gained

Dimension definition file – lists the weighed metrics to be included

Plan ● ● ●

Analyze environment Select DQ dimensions Select metrics

Do ● ●

Deploy to production system Run assessments continuously

● ●

Implement metrics Apply metrics to test system

Act ●

CDQA agent with verified metric implementation

Compare reported scores with expected outcome

Check

CDQA agent with metric implementation (executable code or questionnaire)

Figure 5: Data Quality Assessment Implementation Cycle. For each of the four PDCA steps Plan, Do, Check, and Act the tasks to be done are listed. The step order is indicated by the arrows. Outside the circle between each step the outcome of the previous step which may be used as input for the next step is shown in a rectangular box. To implement data quality assessments work begins in the “Plan” step. In this step the data quality architect analyzes the environment, i. e. the information system's consumers, context and tasks. He then selects all necessary data quality dimensions. For each dimension, a dimension description file (details in chapter 6.1) is created in which the metrics to be included and their weights are listed. Each dimension can hereby contain any number of metrics which contribute to the dimen-

38

Chapter 4: Continuous Data Quality Assessment Process

sion's score. The following “Do” step is characterized by the data quality developer implementing the metrics and applying the metrics to a test system. As described before it depends on the dimension whether a metric is implemented as executable code or as a questionnaire. The “Check” step is used to carefully compare the results from the test system with the expected results. Finally in the “Act” step, data quality assessment is rolled out to the production system. From then on the assessments are run continuously. In order to allow response to changes in the consumer expectation of data quality and to align perceived and assessed data quality the CDQA process is introduced below.

4.2 Continuous Data Quality Assessment Process Once implemented the data quality assessments are run continuously. Improvement-advice is used to determine what to work on. Ideally the data quality achieves the expected level and can be verified by continuous assessments. Nevertheless one can still see disagreement in the scores reported by the metrics and the quality perceived by consumers. Pipino describes “subjective and objective assessments of data quality” [5] which are executed at the same time. The results obtained are then compared and discrepancies identified. In the case of discrepancies the root cause for these is then determined and actions to improve data quality derived. We extend this approach and define two assessment cycles which are executed independently. In our understanding the objective assessments are executed more often than the subjective assessments. And they do not depend on each other but only influence each other by manipulating the shared components as depicted in Figure 6. The process described here is also more formalized and implemented as a software artifact (described in chapter 6.2). The objective assessment is based on the dimensions and metrics which were included and implemented in the data quality assessment implementation cycle. The metric assessments are not necessarily calculated automatically as Objective Measurement but as described in 3.3 can also rely on Objective Evaluation or Subjective Evaluation.

39

Chapter 4: Continuous Data Quality Assessment Process

score meets the dimension's expectation the cycle is complete. Hereby both score and expectation are real numbers between zero and one. Improvement-advices and notifications are only created if the score does not meet the expectation.

satisfying ... not satisfying ...

Subjective assessment is ...

The subjective assessment cycle is more complex. As depicted in Error: Reference source not found four cases can occur. The overall goal of the CDQA process is to achieve case (A) for all included dimensions. As described in case (B) the process also avoids over optimizing data quality. Case (C) causes the dimension expectation to be incrementally increased up to the maximum value. This causes the editors to try to increase the score by following the improvement-advice. Only if this does not lead to satisfying the consumer case (D) is reached and an implementation cycle is triggered. As this is not the initial implementation cycle the data quality architect may decide, for instance, only to adjust the metric weights or add a new metric to the dimension. Also if the data quality score is far below the expectation the further execution of the implementation cycle can be delayed to give the editors more time to improve the score. These decisions are made during the “Plan” step in which the current situation is analyzed by the data quality architect. and score meets expectation.

and score does not meet expectation.

(A): Nothing further is to be done. The data quality is satisfying. This is the desired state.

(B): The consumer is satisfied with the data quality. To avoid unnecessary effort to improve the objective assessment score the expectation should be lowered to the current score.

and expectation is below 1.0.

and expectation is already at 1.0.

(C): The expectation is increased. This will in most cases lead to additional improvement effort triggered by the objective assessment cycle.

(D): The expectation is already set to maximum value. Most likely there are fundamental discrepancies between what the consumer expects and what is being worked on. An assessment implementation cycle is triggered.

Figure 7: The four cases which can occur in the subjective assessment cycle. Each rectangle represents one case. On the left the outcome of the subjective assessment is shown for each case. Above each case the dimension state is indicated. By following the CDQA process described data quality assessments are implemented in a structured manner. Achieving a set data quality goal is not only sup-

41

Chapter 4: Continuous Data Quality Assessment Process

ported but it also incorporates consumer feedback and thus avoids pointless work. The process relies on subjective assessments as described above and objective assessments. The method of calculating objective assessments is described in the following chapter.

4.3 Data Quality Calculation As the application data quality is to be determined by the dimensions included for the application and each dimension in turn is made up of metrics it is important to give a simple and easily comprehendable method of calculating a combined score. Only then can the total data quality score for an application also be understood. Such a total data quality score is, as Pipino states, a controversial issue but “if the assumptions and limitations are understood and the index [score] is interpreted accordingly, such a measure could help companies assess data quality status” [5]. We also follow the convention suggested by Pipino [5] of higher scores representing superior data quality. Additionally the goal is to obtain a total data quality score on a fixed scale. We therefore require the total quality score to be a real number in the range of zero to one. Calculating Total Data Quality Score As the metrics included in the data quality dimensions are weighted individually, the easiest approach to create the dimension score is to calculate a weighted average of all included metrics. The total score can then be calculated as the average of all dimension scores as

n

Total Data Quality Score :=

dimension j :=

1 m

∑ wi

1 ∑ dimension j whereby n j=1

m

∑ w i∗metric i i=1

i=1

while metric i ∈ [0,1] ∧ metrici ∈ ℝ ; wi ∈ ℝ ∧ w i⩾0 .

42

Chapter 4: Continuous Data Quality Assessment Process

As stated above we require the metric weights to be positive and all metrics to report scores in the range of zero to one. These requirements have the advantage that each dimension score is then also in the range between zero and one. Thus no further normalization has to be applied to then calculate the total data quality score. Additionally this allows us to easily compare metric and dimension scores with each other and to easily visualize several scores in one chart. To enable every metric to report only scores between zero and one the metric assessment results have to be transformed. Calculating metric scores For any metric assessment it is essential to know the units and range of possible measurements. For instance a questionnaire can be based on answering yes-no questions. For ratios such as the percentage of invalid items a value between zero and one is returned. Time values such as the response time of a system might be measured in milliseconds on a range from zero to infinity. In any case it has to be ensured that the metric measurement is converted into a value in the range from zero to one whereby one represents the higher data quality. In most cases the simplest approach is to calculate a ratio based on the actual value and the maximum possible value anticipated for this metric. If the value expresses a positive influence on the data quality the metric is calculated as actual value . maximum possible value If the value expresses a negative influence on data quality the calculation is obviously metric i :=

actual value . maximum possible value This makes it possible to cover most metrics. For instance for the questionnaire example from above the maximum possible value would be the number of questions and the actual value the number of answers which indicated good data quality. metric i := 1−

If more control over how the actual values are mapped to a data quality score is necessary it is possible to use a more complex method of normalization. For instance the average response time of a system might be assessed by a metric. For a

43

Chapter 4: Continuous Data Quality Assessment Process

The two functions for a given lower and upper boundary can be derived from the standard linear function by solving the equation system for two points Given the linear Function: y = m∗x +b ; and A = ( x 1 , y 1 ), B = (x 2 , y 2 ). Solving the equation system m∗x 2 +b = y 2 m∗x 1 +b = y 1 produces the two point form of a linear equation y 2− y y − y1 = 2 . x 2−x x2 −x1 Into the two point form we can now insert values to calculate an actual formula. In our case the y values are zero and one and the x values the boundaries. The two generic linear functions mentioned are then derived as follows: high

To derive Qual-falling ( value) we set A and B as low

A = (low , 1) and B = (high, 0) . Inserted into the two point form we can derive the function 0− y 0−1 = high−x high−low high−x y= high−low high high−x Qual-falling ( x) = high−low low analogous we obtain with A = (low , 0) and B = (high, 1) high high−x Qual-rising ( x ) = 1− . high−low low Of course both functions return one instead of any y above one and zero for any negative y. To simplify the formulas we did not include this extra distinction of cases. We can now easily apply the two linear mapping functions to any set of low and high boundaries. Together with the ratios mentioned before this is sufficient to explain the metrics described in the following chapter.

45

Chapter 5: Data Quality Dimensions in Detail

In several cases a dimension can be influenced by the quality of another dimension. Figure 9 shows all the dimensions and how they can influence each other. Influence between two dimensions does not require one dimension's quality score in another dimension's assessment. Instead the dimensions under influence can in some cases only be assessed if the influencing dimension is of good quality. But the degree of influence always depends on the actual system at hand and does not have to occur in every case. For example the Metadata dimension influences several other dimensions. In the case of the Currency dimension, the influence is due to the reliance of the Currency dimension on volatility and source information being available. This information is of course a form of metadata. If no or only very little metadata is available and therefore no source or volatility information, the Currency dimension cannot be reliably assessed.

5.1 Objective Measurement Dimensions The following dimensions rely on measurements for assessment. Measurements can in most cases be performed automatically without the need of human interaction. This is possible as the components needed for assessment are data, data model, and system. In our understanding these are objective sources of assessment. This is why we named the method of assessment objective measurement. The assessment rate for the following dimensions is only limited by the computing resources available. Availability The degree to which system availability and response times can be guaranteed. Availability is the combination of measurements of the system availability and the average response time. System availability is defined and measured as the ratio of time the system is operational to the length of a given interval. Typical availability is given in decimal notation or percentage as for instance 0.999 or 99.9% availability and specified for a year of operation. For 0.999 or “three nines” availability this is roughly equivalent to the system downtime not exceeding 9 hours in one year.

48

Chapter 5: Data Quality Dimensions in Detail

For response time, the time period is measured between a consumer issuing a query and results being displayed. If a response time is guaranteed, this time cannot be exceeded by any possible query under any circumstances. This very strict understanding is native to real-time systems but can be loosened to guaranteeing average response times or excluding corner cases. For instance a web application response time should stay below one second so as not to interrupt the consumer's flow of thought. In order not to exceed a consumer's maximum attention span it must stay below 10 seconds [33, Ch. 5.5]. Availability is measured by assessing the actual production system as MEASUREMENT( system) = 0 9999 10s 1 Qual-rising ( system availability)+Qual-falling ( average response time) . 2 09 1s

[

]

To check system availability and response time for a web application it is easiest to use external services. An external service can access the system in similar ways as a consumer. This takes into account for instance limited bandwidth or different geographic locations. Alternatively one may use internal tools which can cover most availability and response time issues. For instance server monitoring systems show system downtimes but not connectivity issues between the consumer and the system. For response times the application log contains the time between receiving the query and responding. Again this does not account for external delays such as rendering time on the consumer system. Thus an appropriate web analytics 8 platform can be used to retrieve full response time values. To improve Availability it is most important to proactive evaluate the system design against its specifications and compare it to best-practices. For system availability this covers issues as diverse as actually securing physical access, redundant power supply, network connections, hot-spare systems and revolving system maintenance. Response time values can be checked during a simulation of the load for which the system is specified. For small systems ApacheBench 9 can be used to simulate many concurrent users making requests. If improving a system which already exists, the log files may be used to replay and multiply user interaction via 8 For example Piwik an open source web analytics platform http://piwik.org/ 9 http://httpd.apache.org/docs/2.2/programs/ab.html

49

Chapter 5: Data Quality Dimensions in Detail

httperf10. Tasks which exceed the response times and cannot be accelerated could be calculated in a background process. The consumer can then continue using the system. As soon as the task is completed the consumer is notified, which lowers the perceived response time to an acceptable level. Currency The delay until changes in the source are reflected in the data. Currency captures the delay until real-world changes are reflected in the data and thus available to consumers. As in most cases it is not easy to know when changes occur, the rate of change can be estimated. The time in between expected changes (volatility) is metadata which is associated to data items. If volatility information is present Currency can then be assessed by comparing the time since the data item was last changed to the volatility value as following MEASUREMENT( data) =

[

volatility

]

Average Qual-falling (time since data item was last changed) . all data items

0

For sources which can be accessed in an automated fashion Currency may be improved by monitoring the source for changes. Ideally the monitoring system is capable of detecting changes in the same frequency as given by the volatility rate. The time until an editor enters the changes can be further reduced by automatic processing. This is possible if sources offer structured information or sufficient methods exist to extract structured information. Internal completeness The extent to which included data are complete in respect to the defined model. Internal Completeness measures the percentage of present values. Therefore it is of course necessary to know how many values are expected by the given data model. For systems based on a relational database the easiest approach is to count nil values and total values in every row of every table (as suggested in [10, p. 26]). As discussed in the evaluation chapter, this simple approach is not sufficient but may 10 http://www hpl.hp.com/research/linux/httperf/

50

Chapter 5: Data Quality Dimensions in Detail

be a base line to build upon. By excluding technical tables and columns which do not contribute to data quality a more meaningful metric can be created. Internal Completeness is measured by checking the data present against the data model as values count MEASUREMENT(data , data model)= . total values expected Unfortunately there are many cases in which no data for an attribute is present but the data item is complete. To encourage editors to simply fill such attributes with values in order to achieve a high score would in total result in very low Semantic Accuracy. As described in [10, p. 24] a NULL value in a database table can have three different meanings: •

the attribute does not exist for the data item at hand



the attribute exists but is not known



it is not known whether or not the attribute exists

A completeness measure should value the three possibilities differently. This could be done by agreeing on a value to be entered by the author for each of the three cases. For instance “N/A” could be entered for not available. Storing this information as a special value in the field may conflict with other mechanisms. For example “N/A” cannot be stored in an integer value. Thus it is desirable to store this information as metadata associated to the value in question. The author can then select the meaning of the missing value and the missing value is then no longer counted. Internal consistency The extent to which data follows semantic rules defined over a set of data items. While Syntactic Accuracy only checks a single value against its datatype, Internal Consistency checks dependency among several values. An example of a consistency rule in a medical database might be: A patient who is pregnant must be female. These consistency rules can be defined between attributes or more complex objects (both are data items). To implement Internal Consistency the relevant consistency rules have to be selected from the data model and applied as follows to

51

Chapter 5: Data Quality Dimensions in Detail

the data MEASUREMENT(data , data model) = number of times consistency rules applied return true . total number of consistency rules applied To improve Internal Consistency the rules can be strictly enforced by checking them on each data entry. Using a Domain Specific Language (DSL) can make it easier to formulate and deploy consistency rules thus improving Internal Consistency. Metadata The extent to which metadata as source, license or author is present. The dimension of Metadata evaluates how much metadata is present compared to the specification of the system. It is dependent upon the system on which level metadata is present. There might be systems for which all data is attributed to a single author while others track input on attribute level. It does not evaluate if the amount of metadata is sufficient for the tasks the system is supposed to fulfill. Metadata can, for example, be present for information on the source including a Uniform Resource Locator (URL) [34] for web-resources, for information on the license under which the source was accessed, for information on the author and editor of data, for information on when the data was entered and when manipulated (timestamps), for information on volatility and any other useful information. For implementation, the possible metadata must be defined. This can be done on a per value base while implementing (for instance defining each company must provide information on which author added the company). Some data models explicitly define metadata which can be retrieved automatically and thus used directly as input for the assessment. For the general approach we assess the Metadata score as MEASUREMENT( data , data model)=

metadata present . amount of metadata possible

To improve Metadata some information can be added automatically: for instance who created and updated a dataset similar to timestamps which show when the data item was created and last manipulated. Also source URLs can be checked for

52

Chapter 5: Data Quality Dimensions in Detail

licensing information presented in a machine readable format. For some types of sources licenses can be estimated. For instance a company website and its content are usually copyright of the company. By monitoring a source, metadata such as volatility may be extracted over time and added to the system. Choosing a data model which directly supports metadata such as the Resource Description Framework is ideal for building information systems with a high Metadata data quality. Population completeness The extent to which the quantity of data fully represents the realworld population. For any population such as all companies engaged in a certain industry sector it is important to be confident that the data which is analyzed is of sufficient breadth. This does not mean the data must have the same size as the actual population but only a sufficient size for the tasks at hand and the information system in question. For instance if a company database stores data on the main employees of the company the number of people stored is not related to the world population. Instead it can be expected that for each company on average two employees are present in the data. To implement Population Completeness the expected population size must be defined beforehand. The score can then be calculated by analyzing the data as MEASUREMENT( data)=

entities contained in data . expected size of population

To improve Population Completeness a suggestion system could provide new possible data items. For instance for companies this might be based on co-occurring companies in news articles or on external data sources such as company registers. Syntactic accuracy The extent to which a data value is correct in respect to the defined model domain. To assess if a value is correct it is necessary to know the domain of the value first. This is done in many cases on the level of datatypes. The information system implements a data model which usually defines a datatype for the value of every attribute. The existing values can then be checked against the datatype. Examples of

53

Chapter 5: Data Quality Dimensions in Detail

basic datatypes generally available in most databases are varchar (a string) or integer. The datatypes of date or timestamp which are also available in databases already use a finer level by enforcing a specific format and possibly using plausibility checks (for instance there is no 99th of January). Syntactic Accuracy only checks the correctness of values which are actually present. In the calculation, absent values are ignored (those are measured in the Internal Completeness dimension). When implementing the Syntactic Accuracy dimension it is necessary to define which data values should be checked and how the validity of a data value is determined. The Syntactic Accuracy score is based on applying the data model (validity rules) to the data and can in general be calculated as MEASUREMENT(data , data model)=

valid data values . total number of data values

To improve Syntactic Accuracy the datatype can be strictly enforced thus not allowing invalid data to be entered into the system. Another method of improvement would be to use more predefined datatypes with a finer granularity. This could be accomplished by allowing the incorporation of datatypes defined using XML Schema [35]. XML Schema has a selection of built-in datatypes and allows constructed datatypes which are defined in the terms of built-in datatypes. Any XML datatype has three properties: a value space, a lexical space and a collection of functions on the datatype such as equality and a mapping between lexical and value space. The lexical space of the datatype, which is defined as a set of literals used to denote the values, can be validated by the system. Traceability The extent to which data is well documented, verifiable and easily attributed to a source. Traceability captures the degree to which provenance information which is stored alongside data items can be used by consumers to verify the data item. For instance if the source of a fact is stored in the system this source should be accessible for the consumer wherever the fact is presented.

54

Chapter 5: Data Quality Dimensions in Detail

Traceability is influenced by Metadata as source and provenance information is also a form of metadata. Traceability is measured by analyzing which provenance data the system presents to the user MEASUREMENT( data , system) = provenance information usable by consumer . provenance information present in data To improve Traceability automatic checks might be included to verify if source information is contained in the same view as the actual fact. This can be complemented by assessments ensuring the visibility of source information. More advanced provenance information such as transformation information also improves Traceability. For example categorizing a company can be based on the company's self-description. The self-description may use different terms than those in the category definition. A mapping between the terms should therefore be included as provenance information. Creating more advanced provenance information such as mapping information might be aided by a browser plugin which enables the editor to select the information on a web page, apply transformations and directly store this information in the system.

5.2 Objective Evaluation Dimensions Objective evaluation dimensions rely mainly on evaluation to assess the data quality. Evaluations are a professional and systematic review based on a set of standards. Evaluations can use questionnaires, interviews, statistics or other methods. The assessment of the following dimensions is based mainly on data, data model and system as possible input. As these are objective sources we named the assessment method objective evaluation. Consumers are sources of some assessments but are not questioned directly but only evaluated as a group. Objective evaluation dimensions can in general not be assessed as frequently as objective measurement dimensions. In most cases human interaction is necessary for assessment although some metrics can be automated. In the following section we describe the objective evaluation dimensions in detail, giving examples and further explanations.

55

Chapter 5: Data Quality Dimensions in Detail

Access security The extent to which access to data can be restricted and hence kept secure. For consumers it could be important to have exclusive access to data or to keep their own analysis private. Furthermore legislation can require that access to data remains restricted. This means the communication between consumer and system has to be protected as well as access to the information system itself. In some cases it may even be necessary to protect the consumer systems. Access Security can be evaluated by comparing and auditing the system against best practices. For example Bank-Level-Security or industry standards such as Service Organization Controls (SOC) 211 can be used. It is therefore generally assessed as EVALUATION (system) . Access Security can be improved on many levels. The “defense in depth” [36] principle advises not only to focus on securing technology but also to focus on people and operations. For instance server systems should always be physically secured and access restricted to as few people as possible. Improving technological defense can be done by implementing layered defense in multiple places. For instance the communication layer can be transparently encrypted using HTTP over TLS (HTTPS) [37] and the system itself can apply fine grained access controls. Improving operations focuses on activities required to sustain the security of the organization, for instance by continuously applying security patches and virus updates. The location of hosting is also of importance due to differences in legislation concerning data privacy and protection. Accessibility The ease of accessing the data taking into account a consumer's culture, physical status and technologies available. Accessibility evaluates how difficult the data is to access by consumers. For example those who are not native to the interface's language or are limited in some other way. Another reason for low Accessibility are systems which cannot be used by limited devices such as older computers or those with a slow connection. Of 11 http://www.ssae16professionals.com/services/soc-2/

56

Chapter 5: Data Quality Dimensions in Detail

course access is only possible if the system is available thus Accessibility is influenced by Availability. Access Security and Accessibility also influence each other as some measures improving security can make access more difficult and vice versa. Accessibility is assessed by evaluating the system at hand. The system's consumer group is also evaluated but not questioned directly (thus avoiding subjective feedback). The general assessment is EVALUATION (system , consumer) . To improve Accessibility, the Web Content Accessibility Guidelines (WCAG) [38] can be used to evaluate a system. In addition, recently web sites targeting the general public12 provide a version in plain language. Plain language uses texts the consumers understand the first time they read them [39]. To ensure technological compatibility it is good practice to use only valid markup for web pages. This ensures that any device compatible with these open standards is capable of displaying the content. Separating content from layout and having fall backs for JavaScript make it easier for older devices to at least display the main information. Several tools are available to (semi)automatically check a system for compliance against above guidelines and best practices. “HTML CodeSniffer”13 validates against WCAG, the Markup Validation Service14 checks validity of web documents. Services such as Browsershots15 generate screenshots of a site for different browsers, operating systems and devices. These tools can be integrated into the development of the system, for instance into the continuous integration process, and thus they enforce compliance with these standards. Interpretability The extent to which documentation and metadata are available to correctly interpret the meaning and properties of data. Interpretability captures the ability of a person to be able to understand the data and its meaning given the documentation and metadata. For instance a person's 12 For instance the German Federal Government: http://www.bundesregierung.de/Webs/Breg/DE/LeichteSprache/leichteSprache_node html 13 http://squizlabs.github.com/HTML_CodeSniffer/ 14 http://validator.w3.org/ 15 http://browsershots.org/

57

Chapter 5: Data Quality Dimensions in Detail

name given alongside an article could be interpreted as the author of the article while actually it is the subject of the article. This documentation is especially important if data is to be further processed and is offered in a machine readable format (thus possible losing information which can be deduced from a representation designed for a consumer). Interpretability is therefore assessed by evaluating the system presenting the data and the data model which should offer the possibility to include metadata as EVALUATION (data model , system) . As described above Interpretability relies on the presence of metadata and thus is influenced by the Metadata dimension. To improve Interpretability one can follow the best practices given by Batini and Scannapieco [10, p. 33]. Amongst other things they advise to design the system to provide good documentation on the data model and consistency constraints. This might be achieved by having a single place to enforce these. For instance a relational database schema can include the data model and consistency constraints. If a system is used as data storage, which directly supports metadata for resource description, it is easier to provide this information alongside the data. Model completeness The extent to which concepts and their properties needed are not missing. When designing the model (data model) to hold the data it is important to ensure everything needed for the tasks at hand can be stored. For instance a model to hold contact information for companies must provide the possibility of storing a company's address and phone number. For more complex examples, of course, it is not as obvious what may be missing. As Model Completeness does not aim to create perfect models but only to check if they provide what is needed, not only the data model is evaluated but also the consumers' needs and interaction with the system. Model Completeness is therefore assessed as EVALUATION (data-model , consumer) . Model Completeness can be improved beforehand by careful analysis of the tasks the system is supposed to fulfill and which properties are necessary. For example a

58

Chapter 5: Data Quality Dimensions in Detail

system prototype can be used to check if Model Completeness is sufficient for the task at hand. Objectivity The extent to which data are unbiased (unprejudiced) and impartial. Data of high Objectivity should not be influenced by personal feelings or interpretations of the authors and editors who created the data. On the contrary, objective data should be based on facts and be unbiased, unprejudiced and impartial. Semantic Accuracy influences Objectivity because semantically accurate data is very close to reality and therefore less likely to be influenced or biased. Objectivity is assessed by evaluating the data as EVALUATION (data) . To improve Objectivity it can be helpful to extract data several times for those transformations which introduce bias: for instance a task which requires an author to categorize a company based on a description. He may be personally biased. This can be avoided by categorizing the company by two or more authors. The results can then be compared and checked for differences. A merged version for the categorization can then be agreed upon. Openness The degree to which data usage and manipulation is granted by license and easy due to technical measures such as data format. To enable the consumer to reuse the data it is necessary to allow him to do so and facilitate him. Hereby legal and technical openness can be distinguished. For legal openness it is necessary to provide clear license information alongside the data. The consumer then knows what he may do or not do with the data. For technical openness the data format is crucial for the ease of reuse. If a system only provides data as HTML pages designed for human reading it can be very hard to transfer the data into a different system. This is especially true if the data is aggregated or displayed in a graphical form. Import and reuse is a lot easier if the system provides the data as a simple comma separated values file with appropriate documentation. This can go as far as data formats which are used for the semantic web and are machine readable without further human interaction. Openness is influenced

59

Chapter 5: Data Quality Dimensions in Detail

by the Metadata dimension as license information is metadata. It is also influenced by Representational Consistency because a consistent representation format, even if not well described, is suitable for reuse. Furthermore Access Security and Openness influence each other. Although Access Security only aims to limit illegal access and not legal reuse security measures can have a negative effect. An Openness measure such as providing a Comma-separated Values (CSV) file for paying consumers lowers Access Security as the further handling of such a CSV file can no longer be controlled. Openness is assessed by evaluating the system as EVALUATION (system) . To improve Openness one can follow the guidelines given by the “Open Data Handbook” [27], consisting of four simple steps: 1. Choose your dataset(s) 2. Apply an open license (legal openness) 3. Make the data available (technical openness) 4. Make it discoverable This does not only aim at the complete data collection of the system but can also be applied to single data items or consumer generated representations. For instance it could be useful to provide exports for filter and aggregation results from consumer queries. Representational Consistency The extent to which data are always presented in the same format and are compatible with previous data. Representational Consistency captures how stable the system representations are. It can be divided into consistency over time and consistency over designs. Consistency over time requires that representations change as little as possible. If they do change, older versions should be maintained. Especially for access via an Application Programming Interface (API), backwards compatibility should always be guaranteed. Consistency over designs covers applications presenting the same information regardless of the medium, also called content parity. For instance a consumer is presented the same information on a mobile and a desktop computer but in ways native to the platform. Representational Consistency is assessed by evalu-

60

Chapter 5: Data Quality Dimensions in Detail

ating the system as EVALUATION (system) . To improve Representational Consistency the consistency over time can be improved by versioning the API and continuing support for older versions. For example versioncake16 enables a Rails application to version any view. So not only APIs but also HTML views may be versioned allowing the consumer to continue using the known format. Consistency over design can be improved by facilitating responsive websites which make it easier to guarantee content parity. This is achieved by using the same code base and content for all devices instead of separate, dedicated websites per device class. The responsive website displays the same content in device-adapted ways. For example the widespread Bootstrap 17 framework can be used to build responsive websites. Semantic accuracy The closeness of data values to the true values aimed to be represented. Semantic Accuracy might be best explained using a person's email as an example. The string “[email protected]” is a syntactically correct email address. But we can be sure this is not a person's email address 18 and thus it is semantically incorrect. Semantic Accuracy is influenced by Internal Completeness as values for which nothing was entered19 cannot represent the true values. Semantic Accuracy is assessed by evaluating the data as EVALUATION (data) . Semantic Accuracy might be improved by several measures. In the case of email many systems apply email validation via sending an email and requesting the user to follow an included link. This is of course only an option if the user is interested in having his email entered into the system. During data entry, errors can be reduced by avoiding repetitive work for the author. For instance auto completion can be used to support entering values. A more advanced approach as described in [40] 16 https://github.com/bwillis/versioncake 17 http://getbootstrap.com/ 18 example.com is a domain set aside according to RFC 2606 for documentation purposes and not available for registration 19 Information such as “not available” can be an entry.

61

Chapter 5: Data Quality Dimensions in Detail

allows the most unpredictable questions to be asked first and then following questions to be narrowed down to only a few options. This reduces the possibility of error. Analogous to improving Objectivity two authors can separately enter data for a single data item and then agree on a merged version. Maybe the best source for detecting errors are consumers. They should be enabled to easily give feedback on errors in representations which can then be checked by editors.

5.3 Subjective Evaluation Dimensions Subjective evaluation dimensions rely mainly on evaluation of consumer feedback. Consumer feedback is hereby seen as subjective; hence the name of this category. The evaluation, however still follows the usual principles and standards as described in the previous chapter. Therefore the assessment results are not subjective but objective. This is also the main difference to the subjective assessments used in the CDQA process in chapter 4.2. The subjective evaluation relies on possibly extensive questionnaires which are answered by consumers. They are designed to reliably capture the quality of the dimension and the score is calculated from them in a reproducible way. In contrast, the subjective assessment from chapter 4.2 is a simple yes-no question alongside the dimension definition. Nevertheless in the case of subjective evaluation dimensions both assessments are more similar than for the other dimensions. Subjective evaluation dimensions can in general be assessed even less often than objective evaluation dimensions. This is due to the possibly large number of consumers which has to be included. In the following section we describe the subjective evaluation dimensions in detail, giving examples and further explanations. Amount of data The extent to which the volume of data is appropriate for the task at hand. The amount of data provided for a consumer query must be neither too little nor too large. Consider for instance a query for lawyers specialized in IP. The result must be big enough to allow the consumer to choose according to personal preferences. At the same time the result should not be too big so that time is not wasted looking through the result set. This assumes all data items presented are relevant

62

Chapter 5: Data Quality Dimensions in Detail

to the consumer. Thus this dimension is influenced by Relevancy. Additionally there must be sufficient data to generate a result thus a influence by Population Completeness is present. To evaluate and improve Amount of Data the consumer hast to be assessed as EVALUATION (consumer) . In many cases the quality can be increased by suitable ranking mechanism. In a ranked and paginated result list the consumer will only perceive the small result set on the first page and only if this set is not sufficient for his task will he continue to further pages. Believability The extent to which data are accepted or regarded as true, real and credible. Without taking into account the source of a data item, Believability captures the consumer's acceptance of the data item as true information. Believability is of course influenced by low Semantic Accuracy. Recognized false values make the data as a whole less believable. It is also influenced by Reputation as highly reputable data will be accepted as true more easily. To assess Believability the consumer has to be evaluated as EVALUATION (consumer) . It is thereby evaluation of how highly the consumer regards data as true and if acceptance is low, the root cause has to be found and worked on. Ease of understanding The extent to which data representations are clear without ambiguity, easily comprehended and not overwhelming. Ease of Understanding concerns the consumer's understanding of the data representations. It is necessary for the consumer to understand the representation and its meaning easily. This is the biggest difference to Interpretability which can require intensive work to understand the data. But Ease of Understanding is still influenced by Interpretability because Interpretability ensures the data can be understood at all. This dimension is also influenced by Accessibility because only good

63

Chapter 5: Data Quality Dimensions in Detail

accessibility, especially taking the consumer's background into account, allows the representations to be viewed as intended. To assess Ease of Understanding consumers must be evaluated as EVALUATION (consumer) . To improve Ease of Understanding web analytics tools such as Google analytics20 or Piwik can be used. These allow tracking user interaction with the system and help to identify bad navigation or pages which are difficult to understand. For instance one may identify pages from which the consumer would be expected to continue but, in fact, exits above average. This can be a sign of the page being too complex and the user not being able to identify the possible actions and navigation options presented. Also if the possibility of easily giving feedback and requesting assistance is provided, it can be used to increase the understanding of the system. Relevancy The extent to which data are relevant and helpful for the task at hand. Relevancy captures how helpful or relevant the data and results given by the system are to the consumer for the task he wishes to achieve. To assess Relevancy the consumer has to be evaluated as EVALUATION (consumer) . Relevancy can be improved by adjusting ranking algorithms and improving search results. Ideally individual consumers could be questioned to construct a gold standard. Using a gold standard, queries which return result sets may be optimized using methods from the field of information retrieval (such as optimization against fscore). Reputation The extent to which data are trusted or highly regarded in terms of their source or history. Reputation evaluates the historical reputation of the data and the source the data is derived from. Meaning, if the data of a system has been regarded as trustworthy in the past and the source it is derived from is also regarded as trustworthy it has a 20 http://www.google.com/analytics/

64

Chapter 5: Data Quality Dimensions in Detail

high Reputation. Reputation is influenced by Traceability, because to decide if a source is trustworthy a consumer must be able to access the data source. To assess Reputation in most cases the consumer must be evaluated as EVALUATION (consumer) . This is especially true for all data which is the result of transformations and calculations. It could be possible to estimate Reputation for data derived directly from sources by using consumer feedback as samples or by letting the consumer assign trust values to certain source types. For instance tweets could be given a lower trust value than blog posts. In addition, for websites the Google Page Rank score can be used as a reputation value. The part of Reputation stemming from the trustworthiness in the past is dependent on many factors. For instance a consumer, who discovered one single untrustworthy data item, might not consider the data trustworthy for a long time. Consumer feedback can be used to improve Reputation. For example if a consumer does not trust Wikipedia entries, a conventional encyclopedia can be used for source information. Making the original author who created a data item visible can also increase the consumer's trust. Timeliness The extent to which the age of the data is appropriate for the task at hand. Timeliness describes if the data is useful at the point of time it is needed. For instance a consumer who wants to take part in a course has to know the course starting time well in advance. For another consumer, it might be acceptable not to know the starting time in advance but later, as he is only interested in what time courses start at in general. Thus Timeliness cannot be determined in general but only as a consumer dependent dimension. Timeliness is influenced by Currency as data which is highly current should also be timely. But this is not the case if the source which the data is based upon is not the most up to date or does not reflect world changes fast enough. Timeliness can be assessed by evaluating the consumer as EVALUATION (consumer) .

65

Chapter 5: Data Quality Dimensions in Detail

To improve Timeliness it is necessary to anticipate at what point of time data becomes useless for certain consumer groups. Therefore consumers have to provide information on this. Given this information the system could check if the data is sufficiently up to date. Value added The extent to which data are beneficial and provide advantages from their use. Value Added captures the advantages a consumer has from using the data, the competitive edge, and value it adds to his operations. This can be measured as monetary advantage. Value Added is influenced by all other consumer dimensions. In our understanding value can only be added if the data is objective, believable, of high reputation, easily understood, available in the appropriate amount, relevant and timely. To assess Value Added the consumer has to be evaluated as EVALUATION (consumer) . In some cases analytical software could be used again to attribute a monetary value to a certain use of the system such as a file download. Value Added may then be calculated on the assumption that the consumer advantage is correctly captured by the value associated to the sum of all downloads.

66

6 Continuous Data Quality Assessment Software Artifacts In the following section we will describe the software artifacts constructed to enable CDQA for an information system. We will distinguish between the CDQAAgent (agent) and the CDQA-Collector (collector) artifact. The agent implements the actual assessment of data quality. The collector is a web application responsible for storing and visualizing data quality over time. CDQA-Collector

User Interface (UI)

API

Consumer CDQA Agent

Data Quality Architect

Application 1

CDQA Agent Editor

Application 2

U I Total DQ Score: 0.75

Figure 10: CDQA Software artifacts overview: at the top the collector application is shown. Below two applications are shown. The data quality of both applications is assessed by agents. The agents report the results to the collector API. Application 1 retrieves its total data quality score via the collector API. The score can then be displayed to a consumer accessing the application. On the left the access to the collector user interface is shown. The architect has the right to manipulate settings, the editor only to view results.

67

Chapter 6: Continuous Data Quality Assessment Software Artifacts

As illustrated in Figure 10 the agent is embedded in an application. This allows the agent to directly assess the data quality of the application. The results are sent to the collector via an API. The collector processes the results and provides a user interface to be accessed via web browser. This can be used by editors to view the data quality development of the application. To implement the artifacts described, the following requirements were compiled. CDQA-Collector requirements: •

API to collect assessment results.



Capability to store anticipated amount of assessment results.



Store original assessment results infinitely.



Display data quality development over time using line-charts.



Support the CDQA process, especially subjective assessments.

CDQA-Agent requirements: •

Reliable execution of data quality assessments.



Report assessment results to file or to the collector.



The dimension description files containing the metrics should be easily understood by non-programmers.



The agent must have full access to the data models and business logic of the application to make any kind of assessment possible.



It should be possible to implement agents for different platforms.



Adding custom metrics must be possible.



Recurring assessment with the possibility of specifying different intervals.

According to these requirements, the agent and collector are implemented as described in the following chapters.

68

Chapter 6: Continuous Data Quality Assessment Software Artifacts

6.1 Agent In order to minimize the effort to implement the agent it was desirable to build on a tool already fulfilling some of the requirements. We therefore decided to adopt cucumber21 a Behavior Driven Development (BDD) tool. Cucumber uses a DSL called gherkin22. Gherkin allows dimension description files to be written in a plain-text format which is easy to understand. Being a testing tool, cucumber of course can be executed in the application context and has full access to models and business logic. Additionally, cucumber is designed to allow reliable execution even if exceptions occur. Exceptions are surfaced but only stop a single test (or scenario in gherkin terms). Furthermore, cucumber has already been adapted to suite several different platforms. Therefore implementing an agent for one of the already supported platforms is simple. Our aim was not to employ cucumber for its original use-case, BDD. However, cucumber is very flexible and has been adapted to other scenarios before23. We use the gherkin syntax to describe which metrics are to be included in a dimension and will use the cucumber framework to execute the assessments. The actual agent is implemented using the Ruby programming language and distributed as a gem using the Ruby package manager RubyGems24. The agent currently relies on several other gems, most notably the cucumber-rails25 gem which allows a cucumber environment to be easily set up and executed for Rails based applications. This of course implies the current agent is only to be used in Rails applications. The data quality assessment can be invoked by using the cucumber binary and specifying the data quality directory and files. They are continuously executed by adding appropriate cronjobs. The agent provides templates for all relevant elements such as dimension description files or crontab entries.

21 http://cukes.info/ 22 https://github.com/cucumber/cucumber/wiki/Gherkin 23 Cucumber has been used to build the automatic security test framework gauntlt (https://github.com/gauntlt/gauntlt) 24 http://en.wikipedia.org/wiki/RubyGems 25 https://rubygems.org/gems/cucumber-rails

69

Chapter 6: Continuous Data Quality Assessment Software Artifacts

Dimension description file As can be seen in Listing 1 the dimension description file is a plain-text file with indentation to structure the file. It is written using the gherkin language and therefore has no need for the usual special symbols used in programming languages. Only very few keywords are used by the gherkin language. The parser divides the file content into features, scenarios and steps. Features and scenarios are indicated by the keywords “Feature” and “Scenario”. In our case we only want to add one feature and in most cases only one scenario to a single file. 1 2 3 4 5 6 7 8 9 10 11

Feature: Metadata data quality dimension The extent to which metadata as source, license or author is present. @daily Scenario: Assess Metadata dimension Given Metadata is assessed And presence of all Company sources checks are included And presence of an author for "Company" check is included And presence of an author for "Person" check is included When assessment is done Then the quality score should meet expectation

Listing 1: CDQA-Agent dimension definition example. The first line defines the dimension to be assessed and is followed by its definition. The dimension is repeated in line 5 and again in line 6 which starts the assessment setup. Lines 7 – 9 add metrics to the as sessment. Line 10 performs the assessment and line 11 reports the results. The tag in line 4 specifies this dimension assessment is to be performed daily. Lines 6 – 11 are actual steps.

Of more interest are the steps which are included in a scenario. Each line inside a scenario beginning with one of the keywords (“Given”, “And”, “When”, “Then”) is a step and backed by a step definition as indicated in Figure 11. The steps inside a scenario are supposed to be divided into three parts. This is due to the origin of cucumber and gherkin as a BDD tool which follows the classic Arrange-Act-Assert [41, p. 97] steps of testing. In our case it divides the description of what is to be included in the data quality assessment (Arrange), the calculation (Act) and the reporting (Assert). As the indentation is optional the order of appearance is what determines the order of execution. Additional metrics therefore must be included after the Given step and before the calculation. The frequency in which the assessment is to be carried out is indicated by a tag (@daily in Listing 1) before the actual scenario. Currently the agent supports hourly, daily and weekly execution and provides templates for setting up corre-

71

Chapter 6: Continuous Data Quality Assessment Software Artifacts

sponding cronjobs. Other frequencies can be easily implemented by creating a new tag and executing cucumber using this tag. Currently it is not possible to specify the frequency for a single metric. Instead one can add several scenarios to a single dimension file. For instance one scenario tagged “@daily” contains all metrics which are to be assessed daily and a second scenario tagged “@hourly” contains those which are to be assessed hourly. Step definitions As can be seen in Listing 2 the step definitions are similar to method definitions. They register a regular expression against which the steps are matched. The regular expressions are found in line 1 and 6 and are started and ended with a slash. When executing the dimension description file, cucumber looks up the matching regular expression for each step. All matching groups from the regular expression are passed into the following block of code. Therefore the first step definition in Listing 2 can be used to set up a metric for both company and person model (or any model available). The code executed for each of these metric step definitions then adds one or more metrics to the list of all metrics for this dimension. As can be seen in lines 3 and 10 a metric also adds a list of data items. The data items are retrieved by a helper which tries to identify the model passed as argument. Out of memory and performance considerations not an array of data item objects is passed but a reference to the relation behind this model. The metric can then operate on this relation (for instance calling count for the total number). 1 2 3 4 5 6 7 8 9 10 11 12

And(/^presence of an author for "(.*?)" check is included$/) do |model| @metrics :not_empty, :attr => :start_of_operations quality_test 'QT 02', :method_name => :not_empty, :attr => :end_of_operations, :if => lambda { |company| company.start_of_operations.present? } quality_test 'QT 03', :method_name => :each_not_empty, :attr => :role, :function => :employees quality_test 'Check regularly!', :method_name => :not_expired quality_test 'QT 04', :description => 'a custom quality test' do |company| company.is_small? && company.has_less_than_50_employees? || company.is_big? && company.has_more_than_50_employees? || some_very_complex_condition end end end

6 7 8 9 10 11 12 13 14 15 16

Listing 6: A company model with five different DQ-Tests defined on it. The first four tests use predefined test methods. Test one checks if the company has a start of operations set. The second test checks if an end of operations is set but only if a start of operations is ex istent. The third test checks if all of the company's employees have a role. The fourth test will fail if the company has not been updated in over a year. The fifth and last test uses a block to check several conditions.

The data-quality gem provides several predefined tests and custom tests can be written. An example DQ-Test definition is shown in Listing 9. A DQ-Test must specify a unique identifier and either a method from the set of default methods or a block which evaluates to true or false. Any test can be combined with a condition. This allows tests to be active only on objects which fulfill these conditions. Custom DQ-Tests are passed a block in which arbitrary code can be evaluated. This includes checks on other models or even calls to external services. The only requirement for the block is to return true or false. This allows for very high flexibility on what DQ-Tests can cover. When calculating the DQ-Test score for a single object each DQ-Test is evaluated and reports one of the following four results: •

pass – the object fulfills the requirements,



fail – the object does not fulfill the requirements,



inactive – the test contains a condition which is not fulfilled and thus the test is not evaluated,

98

Chapter 7: Demonstration of Applying CDQA



not applicable – the test was manually marked as not applicable to this object.

Depending on the test state, the test contributes to the DQ-Test score. Currently a passed test adds three points and a not applicable test adds one point to the quality score. Inactive and failed tests do not add points. The DQ-Test score and the number of failed tests are stored as model attributes. In order to store the data about which test is not applicable to a specific object, a separate table is used and linked via a polymorphic association.

Figure 22: IPIB editorial interface showing the DQ-Test tab for a single company. The page shows the total score of this company and next to it the company score development over time. A list of all active DQ-Tests is shown. For each test identifier, description, status, message, additional information and action (setting applicable or not applicable) are displayed.

99

Chapter 7: Demonstration of Applying CDQA

The main reason for the implementation of the DQ-Tests was to aid the editors in their task of completing IPIB company profiles. Every action taken by an editor should earn him DQ-Test points which is why not applicable tests also contribute points. To allow seamless interaction with DQ-Tests they were added to the editorial user interface. As the IPIB uses the RailsAdmin interface the necessary views and models could easily be added. A separate rails-admin-data-quality gem uses a Rails Engine to add a DQ-Test tab for each model using DQ-Tests. In the case of the IPIB the editors can order the company list view by DQ-Test score or number of failed tests. This allows them to easily find hot spots and systematically work on completing company profiles. Figure 22 shows the DQ-Test tab provided by the gem. It lists all DQ-Tests which are active for this company. The editor has the possibility of marking tests as not applicable according to editorial guidelines. For instance in Figure 22 the company is still in operation thus the editor set the requirement for the end of operations date to not applicable. For predefined tests which evaluate several associated objects, like DQ-Test CO-008, those objects which do not satisfy the expectation of the test are listed. This allows the editor to directly access the objects which cause the test to fail.

7.3 CDQA Implementation CDQA was implemented for the IPIB and released on 1 st of May 2014. As described above the initial implementation focused on the company as data item. As data quality tests had already been implemented they were used as metrics in several dimensions. The metric implementations can be improved in some cases; by caching assessment results, for instance, and only re-assessing if the underlying data item has changed or by using a sample of data items instead of assessing all data items. For the sake of demonstrating CDQA the implementations, which are naïve in parts, are sufficient and in some cases more robust47. In the following section we will describe the metrics used in each dimension by discussing the dimension feature file. The actual implementation will not be shown in detail but only discussed where appropriate. For the IPIB the CDQA was 47 “There are two hard things in computer science: cache invalidation, naming things, and off-byone errors.” (Source unknown)

100

Chapter 7: Demonstration of Applying CDQA

set up to report assessment results to a collector. Most dimension assessments are run daily. All dimensions contain only metrics with the same frequency. Therefore all dimensions only contain one scenario. Availability 1 2 3 4 5 6 7 8 9 10

Feature: Availability data quality dimension The degree to which system availability and response times can be guaranteed. @hourly Scenario: Assess Availability dimension Given Availability is assessed And System Availability measured by Pingdom is weighted 1 And "Average Response Time" measured by NewRelic is weighted 5 When assessment is done Then the quality score should meet expectation

Listing 7: Availability dimension description for the IPIB.

The Availability dimension is the only dimension with hourly assessments. The two metrics both only query external services and therefore cause very little load. Additionally it could be of interest at what time of day a degradation of quality in this dimension occurred. As described in chapter 5.1 Availability is measured as the weighted average of response time and system availability. System availability is measured by the Pingdom48 web service. Pingdom thereby tests the access to the IPIB main page once a minute from one of over 50 locations distributed throughout the world. The results can be accessed via an API and are reported as the number of successful and the number of unsuccessful accesses for a given time period. For the IPIB an uptime of 99,9% is set as upper target with 90% uptime as the limit from which point onwards the metric score is set to zero. System availability metric is then calculated as 0 999

m system-availability =Qual-rising ( 09

successful accesses ) . total accesses in query period

The average response time is measured by New Relic 49 web service. New Relic is a performance monitoring service which is used for the IPIB. It is directly included in the application and measures application performance on several levels. 48 https://www.pingdom.com/ 49 http://newrelic.com/

101

Chapter 7: Demonstration of Applying CDQA

Server side response time is measured for every request and can be retrieved via an API as the average response time in milliseconds for a given time period. The response time does not include client side delays. Therefore the upper boundary (data quality value of 0) was set to 5 seconds instead of 10 seconds. The metric is then calculated as 5 seconds

m average −response=Qual-falling ( 0 5 second

response time in milliseconds ) . 1000

As the response times are currently more important than system availability the metric weights were altered. This resulted in the following formula for the Availability dimension score: d availability =

1 [ 1∗m system-availability +5∗m average-response ] 1+5

For the following dimensions we will only give calculations for single metrics. The weighted average is always calculated as given in chapter 4.3. Currency 11 12 13 14 15 16 17 18 19 20

Feature: Currency data quality dimension The delay until changes in the source are reflected in the data. @daily Scenario: Assess Currency dimension Given Currency is assessed And "Company" sources are checked for updates And "Person" sources are checked for updates When assessment is done Then the quality score should meet expectation

Listing 8: Currency dimension description for the IPIB.

As the IPIB does not store volatility values for data sources the definition in chapter 5.1 could not be used. Instead, because all sources are available on the Internet, a simple source check was implemented. For each source URL, an HTTP HEAD (HEAD) [52, Sec. 4.3.2] request is made using the HTTParty50 library. The lastmodified information is then extracted. The data quality value for the source check metric is then calculated as: number of data items which were updated since their source was last modified number of data items for which sources provided valid last-modified fields 50 HTTParty is used for all web requests mentioned in this chapter if they are not made via a specialized API gem provided by third party. HTTParty is available at http://johnnunemaker.com/httparty/

102

Chapter 7: Demonstration of Applying CDQA

Thus only sources for which a valid last-modified field is returned are counted. For the Currency dimension the source attributes for companies and people are included in the check. For each model custom support code was written, which identifies the actual source attributes. Each source attribute is then added as a metric which currently results in seven company metrics and two person metrics for this dimension. Internal Completeness 1 2

Feature: Internal Completeness data quality dimension The extent to which included data are complete in respect to the defined model.

3 4 5 6 7 8 9 10 11 12

@daily Scenario: Assess Internal Completeness dimension Given Internal Completeness is assessed And all data quality "not_empty" tests are included And all data quality "each_not_empty" tests are included And the data quality test "CO-020" is included And null values in all database tables are calculated and weighted 0.1 And null values for "Company" model are calculated and weighted 0.1 And blank attributes for "Company" model are calculated and weighted 0.1 When assessment is done Then the quality score should meet expectation

13 14

Listing 9: Internal Completeness dimension description for the IPIB.

For Internal Completeness a subset of the DQ-Tests already present is used to evaluate the company's data quality. The DQ-Tests were designed to aid the completion of data for company profiles and are therefore well suited as metrics for this dimension. The subset of DQ-Tests which count towards Internal Completeness is made up of •

all DQ-Tests using the built in not_empty method



all DQ-Tests using the built in each_not_empty method



DQ-Test CO-020

For each of the tests a metric is added. This results in eighteen not_empty and four each_not_empty DQ-Test metrics. The custom DQ-Test CO-020, which evaluates if either an English or an original language description is present, was also added.

103

Chapter 7: Demonstration of Applying CDQA

We treat passing and not applicable tests equally and add one point for each of these. The total score is then divided by the number of active tests. For each DQTest metric in this and other dimensions the data quality value is thus calculated as: number of DQ-Tests which passed or were not applicable . number of active DQ-Tests In addition three generic metrics were added with a low weight. The metrics on line 10 and 11 in Listing 9 calculate the ratio of not null database fields compared to their total number. The metric in line 10 does this for all database tables present, the metric in line 11 only for the company database table. Both metrics rely on the usage of a relational database as they use SQL to efficiently retrieve the necessary counts. The metrics are calculated as follows count of NULL values in table . number of table entries∗number of table columns The last metric in line 12 evaluates the number of blank attributes for each company. This differs from the two metrics before as this metric instantiates the actual company objects and checks each attribute to establish if it is blank 51 or not. The methods used to retrieve all attributes52 are provided by the ORM ActiveRecord. Thus this metric is only applicable in applications which use ActiveRecord. This metric can also cause high loads as instantiating an object is resource intensive compared to the SQL statements used in the two metrics before. The metric is calculated as 1−

blank attributes . total number of attributes

51 The blank method is provided by the ActiveSupport library and evaluates if an object (in this case an attribute of an ActiveRecord object) is false, empty (for example an empty array or hash), or a whitespace string. See http://api.rubyonrails.org/classes/Object html#method-iblank-3F for more information. 52 More accurately all not protected attributes.

104

Chapter 7: Demonstration of Applying CDQA

Internal Consistency 1 2 3 4 5 6 7 8 9 10

Feature: Internal Consistency data quality dimension The extent to which data follows semantic rules defined over a set of data items. @daily Scenario: Measure Internal Consistency dimension Given Internal Consistency is assessed And the data quality test "C0-012" is included And the data quality test "C0-013" is included When assessment is done Then the quality score should meet expectation

Listing 10: Internal Consistency dimension description for the IPIB

Internal Consistency only contains two metrics. Both are DQ-Test metrics for the company model. The metric using DQ-Test C0-012 evaluates if the company status is not set to active if the company end of operations date is set to a past date. The second metric (DQ-Test C0-013) evaluates if a company which is an initiative is a member of a corporate grouping and if this corporate grouping contains at least one company which is not an initiative. Calculation for both metrics is analogous to the general calculation for DQ-Tests as described before. Metadata 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Feature: Metadata data quality dimension The extent to which metadata as source, license or author is present. @daily Scenario: Assess Metadata dimension Given Metadata is to be assessed And presence of an author for "Company" is checked And presence of an author for "Person" is checked And presence of "Company" sources are checked And presence of "Person" sources are checked And the data quality test "CO-009" is included And the data quality test "CO-011" is included And the data quality test "CO-016" is included And the data quality test "CO-021" is included And the data quality test "CO-026" is included When assessment is done Then the quality score should meet expectation

Listing 11: Metadata dimension description for the IPIB.

The Metadata dimension contains three different metric types. The first metric type checks if information on the author is present for each data item. The second

105

Chapter 7: Demonstration of Applying CDQA

checks the presence of source information for a given model. The third metric type is based on DQ-Tests. The first metric type relies on the presence of a versioning library to check if author information is present. In lines 7 and 8 of Listing 11 two metrics are added to check person and companies for author information. The IPIB uses PaperTrail53 which tracks all life cycle events such as creation, update of attributes and deletions for ActiveRecord models. In addition to this information, for each event, the user who triggered the event is recorded. This is, of course, only possible if the user has to authenticate himself before triggering the actions. For the IPIB access to the editorial interface, and thus the possibility of creating companies, is limited to authenticated users with appropriate rights. Thus a high score can be expected for this metric if the library has been configured correctly since the beginning of data collection and if companies have not been added by other means. The metric is calculated as number of data items with author information . total number of data items The second metric type is added for all Company and Person sources (line 9 and 10 in Listing 11). Analogous to the metrics in the Currency dimension, the actual information about which attributes are source attributes is provided by custom support code. In the case of the IPIB this relies on naming conventions, as all source fields simply append a “_source” to the name of the data field. They are therefore easily extracted from the list of all attributes for any given model. Obviously source fields should only be checked if the attribute is present. For each source a metric is added. This results in five metrics for the company model and two metrics for the person model. These metrics are calculated as number of filled source fields number of source fields for which the data field is filled The third type of metric used is those based on DQ-Tests. Five metrics of this kind are added in lines 11 to 15. They are all defined for the company model and test the following: •

DQ-Test CO-009 tests if a source is given for all locations associated with the company.

53 https://github.com/airblade/paper_trail

106

Chapter 7: Demonstration of Applying CDQA



DQ-Test CO-011 tests if a source is given for all key employments associated with the company.



DQ-Test CO-016 tests if a source is given for the number of employees category assigned to the company.



DQ-Test CO-021 tests if a source is given for either the English or the original language description if one of them is present.



DQ-Test CO-026 tests if a source is given for contact information added to the company such as address, email, telephone number, fax number.

Calculation for these metrics is analogous to the general calculation for DQ-Tests as described before. Population Completeness 1 2 3 4 5 6 7 8 9 10 11

Feature: Population Completeness data quality dimension The extent to which the quantity of data fully represents the real-world population. @daily Scenario: Assess Population Completeness dimension Given Population Completeness is to be assessed And a total population for "Company" of 2000 is assumed And a total population for "Person" of 6000 is assumed And a total population for "Location" of 3000 is assumed When assessment is done Then the quality score should meet expectation

Listing 12: Population Completeness dimension description for the IPIB

For Population Completeness three metrics are added. Each of these metrics is based on an expert estimation of the number of data items expected in the IPIB. For each of the data models referenced in the metric the number of data items is then simply counted and a quality value reported as follows: number of data items in the IPIB . assumed number of data items

107

Chapter 7: Demonstration of Applying CDQA

Syntactic Accuracy 1 2 3 4 5 6 7 8 9 10 11 12

Feature: Syntactic Accuracy data quality dimension The extent to which a data value is correct in respect to the definition model. @daily Scenario: Assess Syntactic Accuracy dimension Given Syntactic Accuracy is to be assessed And validations for "Company" model of type "custom_date_format" are included And validations for "Company" model of type "presence" are included And validations for "Company" model of type "associated" are included And validations for "Person" model of type "presence" are included When assessment is done Then the quality score should meet expectation

Listing 13: Syntactic Accuracy dimension description for the IPIB.

The Syntactic Accuracy dimension only includes metrics based on ActiveRecord validations. The validation metric checks the validity of all relevant data items for the specified kind of validation. Metrics based on the correctness of datatypes or constraints enforced in the database were not added. As described earlier, validations are provided by the Rails framework and are similar to DQ-Tests. There are predefined validations to validate, for instance, the presence, uniqueness or type of an attribute. Additional custom validations can be added. Validations can be restricted by conditions as explained for DQ-Tests. The main difference to DQ-Tests is that validations hook into the life-cycle of an ActiveRecord object. They are, for instance, executed after calling save on the object. If any of the validations return false, no insert or update query is issued against the database. The object is thus not saved. Instead the user is informed, in most cases with detailed information, which attribute is invalid. Therefore no invalid data should have been inserted and all metrics added to this dimension are expected to report a score of one. Still, it is possible to bypass validations by manipulating data directly via SQL queries or due to race conditions between Ruby processes54. Also invalid data items can exist if validations were added after data items had been created55. 54 For instance the validation of uniqueness can be easily bypassed if two processes simultaneously save objects. Thus uniqueness should be additionally enforced in the database. Still the validations remain useful as a convenience layer to easily provide meaningful error messages to users. 55 ActiveRecord validations do not check existing data in contrast to most databases which strictly enforce data types or conditions.

108

Chapter 7: Demonstration of Applying CDQA

Metrics are added for the company model and the person model. One metric is thereby added for every validation. For example a model can have several presence validations for different attributes. Validations can be restricted to only be active under certain conditions similar to DQ-Tests (using an if or an unless condition). Therefore the score for these metrics is calculated as follows: number of valid data items . number of data items for which validation is active Unfortunately ActiveRecord validations are not designed to be called separately as done here. The implementation therefore uses internal methods which could be changed in future releases without deprecation warning. It might be better to implement a simple data item valid metric which simply checks if the data item is valid in respect to all defined validations. This can then be done using the public API provided by ActiveRecord. The dimension definition above results in the following metrics: •

Two metrics validating the IPIB custom date format for company attributes.



Five metrics validating presence for company attributes which check if an attribute is not blank.



Three metrics validating objects associated to a company.



Three metrics validating presence of attributes on person data items.

Of course the validations specified in above metrics are enforced on the creation of companies. Therefore a high quality score can be expected.

109

Chapter 7: Demonstration of Applying CDQA

Traceability 1 2 3 4 5 6 7 8 9 10 11

Feature: Traceability data quality dimension The extent to which data is well documented, verifiable and easily attributed to a source. @daily Scenario: Assess Traceability dimension Given Traceability is to be assessed And presence of source information in company views is checked And retrievability of "Company" sources is checked And retrievability of "Person" sources is checked When assessment is done Then the quality score should meet expectation

Listing 14: Traceability dimension description for the IPIB.

For the Traceability dimension two kinds of metrics are used. One metric is added to check the presence of source information in each company view (Listing 14 line 7) and two metrics are added to check the retrievability of sources for companies and people (lines 8 and 9). The presence of source information in company views is assessed by checking if all source URLs which are present for the company are also contained in the web page as a link. This is done by retrieving the company page from the actual production website via a HTTP GET (GET) [52, Sec. 4.3.1] request. The response is then parsed using the Nokogiri56 HTML parser and all links contained are compared against the expected source links. The current implementation thereby does not evaluate if the links are actually visible to the consumer. Also links which are contained in parts of pages which are loaded via JavaScript are missed out as JavaScript is not executed. For instance, on the company main page only three employees are shown. The full list of employees is on a separate tab which is loaded via JavaScript. Thus not all source links for employees will be found. In the IPIB all companies use the same page layout. We therefore expect to see similar results for all source URLs of a certain type. For instance, if the source URL for the number of employees is missing for one company it is likely to also be missing for all other companies. Hence we do not add one metric per source URL type but instead use a single metric for all company source URLs. The metric assessment could be optimized to only check one company which offers the respective source URL. But if the company page was to display different companies using different methods, defects here would be missed. We therefore choose to as56 http://nokogiri.org/

110

Chapter 7: Demonstration of Applying CDQA

sess a random sample of 100 companies. The metric value is calculated as number of expected source links found . total number of expected source links The retrievability of sources metric checks if the content behind a source URL can be retrieved. This is done by evaluating the response of a HEAD request. A source is thereby defined as retrievable if the HEAD request returns a 200 (OK) response. We add three metrics for the source information of start-of-operations and end-ofoperations of companies and for person images. The metrics are then calculated as number of retrievable source URLs . total number of source URLs

7.4 Assessment of Historic IPIB Versions and Data To evaluate the long-term usage of CDQA and the collector it was necessary to assess historic IPIB data. To do this, previous versions of the IPIB and its data had to be recreated. Also it had to be decided how to implement the assessments and which dimensions and metrics to include. Usually the assessments would have evolved along with the application development. But it is not easy to estimate their development retroactively. Therefore it was decided to use the current implementation of data quality dimensions and metrics. Metrics for which no values can be calculated in retrospective were excluded completely. For instance there is no data available on the average response times of previous versions of the IPIB. Metrics for which values can only be calculated for some time-points are included. They only contribute to the dimension score once they report their first value. This is analogous to the usual addition of a metric due to changes in the application. The IPIB was launched in early 2012 and has been under ongoing development with frequent releases. Over time the data model was changed as well as the applications user interface. New releases were deployed on a weekly base. Many components, such as the Rails framework, had major version updates. Even the database used was switched from MySQL to PostgreSQL. The CDQA implementation developed for the current IPIB version could be easily applied to all previous releases. This is due to the robust implementation of the CDQA gem and of course

111

Chapter 7: Demonstration of Applying CDQA

due to the cucumber framework which can be used with all libraries and databases ever utilized for the IPIB.

as

R4

17

R3

51

-R ele

fi x R3

33 -b ug

-R el ea se

e

-R el e

e 18

-R ele

as

as

R3



10 -R ele



R2

se R1 -R el ea

Database Dump

M 2Re l

Code Version

ea s

e

e

as

e

The IPIB code base is version controlled and a tag for every release is set as shown in Figure 23. Therefore it is no problem to regain the state of code for any given point in time. To enable data quality assessment the CDQA gem and data quality configuration is added to every IPIB version used. As many metrics rely on DQ-Tests we decided to add DQ-Tests to the first IPIB version which originally did not include them. The first version can be seen as the version for which DQ-Tests had just been released. The editors therefore had not yet improved data quality by using them. DQ-Tests were originally added two months before the second version used here. Thus the second version already contains data quality improvements due to the DQ-Tests. Otherwise no changes were made to the IPIB versions.











05 06 07 08 09 10 11 12

01 02 03 04 05 06 07 08 09 10 11 12

01 02 03 04 05 06

2012

2013

2014

Figure 23: Overview of time points for which the state of the IPIB was analyzed. Above the release tags which were valid for the given dates are shown. Below the availability of database dumps is indicated. Regular database backups were performed for the IPIB. These consisted of automated nightly backups with a retention period of a week and additional manually performed backups before deploys. Unfortunately monthly backups with an unlimited retention period only existed since May 2013. Thus we could not freely choose a number of backups to compare. To have almost equidistant time periods we chose six points in time from May 2012 to May 2014 (Figure 23). For five time points a database backup exists. For January 2013 no backup exists but we were able to create an approximation of the state in January 2013. This is possible as the IPIB uses PaperTrail57, a library for storing version information on all mod57 https://rubygems.org/gems/paper_trail

112

Chapter 7: Demonstration of Applying CDQA

els. Using the May 2013 database dump the following three steps were applied to derive an approximate database of January 2013: 1. Delete all entries created after 01-01-2013. 2. Revert attributes for all remaining models to the version from 01-01-2013. 3. Revert all database migrations applied between the releases in January and May. As all tables used in the IPIB are generated with created-at and updated-at timestamps the first step can be easily applied. The tables containing the actual versioning information and database migration states were skipped. In the second step every entry belonging to a versioned model is reverted. The versioning library PaperTrail stores every create, update and destroy event for each tracked model. As only the actual object attributes are included it is not possible to track changes on associations reliably58. Additional problems occur if database entries were manipulated directly, thus bypassing the versioning. Also renaming attributes or classes makes reverting entries difficult. It would require a more advanced approach in which reverting versions and reverting database migrations or code changes have to be synchronized. Our derived database dump is therefore only an estimate of the January data state. To check our method of “rewinding” the database we used it to derive a May 2013 dump from the June 2013 dump. As we have an original May 2013 dump we were able to compare the derived state with the actual state at that point. For both dumps the count of entries for each table is equal. But comparisons on deeper levels revealed small differences. For instance the number of present (not nil or empty) attributes summed over all companies differs by 20 with a total of over 17.000 counted. As we only use the data to observe trends in the overall development, we find these differences acceptable.

58 PaperTrail tries to track changes on belongs_to associations by searching a three second timewindow for changes. This is of course not fault proof. In addition no other kind of association is tracked.

113

8 Evaluation In the following chapter we evaluate the use of CDQA. This is done by discussing the experience gained from applying CDQA to the IPIB. As the implementation has only been finished recently we cannot evaluate the long term use. Instead we assess historic data and versions of the IPIB. An editor of the IPIB used the collector and described insight he had into the data quality 59. The insight and data quality results are discussed below. Finally we will conclude by comparing the results obtained and goals set in this thesis and offer an outlook into future work.

8.1 Results and Insights from Applying CDQA After selecting the database dumps and the matching IPIB release versions as described in chapter 7.4 the development over the past years could be analyzed. A small script was used to recreate the IPIB for each time-point and subsequently gather simple metrics and execute the data quality assessment. To give an understanding of the IPIB data growth the number of entries is counted for all main models for each time-point. As can be seen in Figure 24 the data contained in the IPIB has been growing in a roughly linear way. This was expected as the method of input and the number of people working as editors have not changed substantially.

59 Fabian Bartsch is thanked for extensively using the collector application and for providing valuable insights in an interview conducted on 13th June, 2014.

115

Chapter 8: Evaluation

could not be used (such as altering metrics, weights, and dimension expectations). In addition all metrics were implemented for the current IPIB version. Data quality expectation may have been different for earlier versions. We therefore try to show how the application development influenced the data quality from today's point of view. All data issues found are discussed.

Figure 25: Collector showing the data quality development of the IPIB. All included dimensions are listed with their current quality score. The graph shows the development for the selected two year period. The IPIB score is shown in blue and the dimension scores in white. Marked (1) is the Availability dimension. Marked (2) is the Currency dimension. Marked (3) is the Internal Completeness dimension. The initial increase can be explained by the DQ-Tests which were added in August 2012. Editorial work was then focused on improving these. In early 2013 the editorial guidelines were changed to allow sparse company profiles and a large num-

117

Chapter 8: Evaluation

ber of these were added. This reduced the Internal Completeness (3) score and explains the drop in data quality. Since then new companies have been added with higher quality, due to the presence of DQ-Tests. Unfortunately older companies have no longer been worked on. The data quality of most dimensions has improved over the past two years. In the following section we discuss each dimension in more detail. For each dimension a screenshot is included showing the data displayed by the collector for the respective dimension. Availability

Figure 26: Collector showing data quality development for Availability dimension. The line-chart shows the dimension development in blue and the metrics in white. The time-range was set to display only May 2014. Marked (1) is the system-availability metric and marked (2) the average-response-time metric. As mentioned previously, Availability could not be calculated for past time-points. For the two metrics included no historic data is available. Therefore Figure 26 shows only the quality values collected for May 2014.

118

Chapter 8: Evaluation

On comparing the two metric developments it can be seen how the higher weight of the average-response-time metric (2) dominates the dimension quality score. System-availability metric (1) reported high values of 1.0 for all but three days. The dip in system availability on 7 th May was due to high server load issues following a deployment and could be resolved after a couple of hours. Currently the Availability dimension has satisfying scores. Currency

Figure 27: Collector showing data quality development for Currency dimension. The line-chart shows the dimension development in blue and the metrics in white. The time-range was set to display only May 2014. Marked (1) is the person-image-source metric, marked (2) the company-description-source metric, market (3) the company-default-market-source metric. The Currency dimension also contains metrics for which no values could be calculated for past time-points. As described in chapter 7.3 all metrics depend on checking the last-modified header of source websites. These values could not be retrieved in retrospect. Therefore again only values for May 2014 are displayed.

119

Chapter 8: Evaluation

Only two metrics assess a large number of data items. They are marked (1) and (2) in Figure 27. Both degrade slowly as expected of these metrics (meaning source websites were updated). Unfortunately most metrics have a high volatility, their values change rapidly from one day to another. In all cases this is due to the metrics only assessing very few data items. The score changes quickly therefore, if only one data item more fails or passes. The fact that a metric assesses very few data items can have several reasons: – By design. The metric does not aim to assess more data items. – Due to unmet dependencies. The metric cannot assess a data item as it does not meet the requirements for assessment. For instance the Currency metrics all require a source website to be present. If this is missing no assessment can be made. The absence of a source websites should then show up in the Metadata dimension. – Due to shortcomings in the implementation. The metric can only make few assessments because the metric implementation does not work as expected. For instance checking the last-modified headers can fail due to networking issues. For the metrics shown in Figure 27 it was determined that the reason for very few assessments was only partly because of unmet dependencies. For instance metric (3) checks if a source website is given as default for all market information. Currently 152 companies provide such a source website which should be sufficient for assessment. But for only few source websites could a successful HEAD request be made by the agent and for even less (4-8) did the HEAD contain a valid last-modified header. Changing the network used by the agent resulted in far more successful HEAD requests. In addition, most websites which do not provide a valid lastmodified date, do return content for a GET request. Therefore in future an improved metric implementation will be used. An external service which has better network access and additional content based identification of website updates will be used. In general the number of data items which were checked should be added to each metric assessment. This makes it easier to identify those metrics which only rely on very few assessed items. Currently they can only be identified by viewing the

120

Chapter 8: Evaluation

improvement-advice. In most cases this contains every data item which is failing the metric. By comparing the number of failing data items with the score it is clear how many data items have been checked. As a first effort to improve Currency data quality the editor followed the improvement-advice for the person-image-metric. Thereby in some cases the advice did show person images which had been updated. But in most cases the image appeared to be the same and only the last-modified date had changed. Again better or additional methods to detect changes in images should be used. The current implementation of Currency metrics is unsatisfactory. An implementation relying only on volatility values and data item update information would certainly have reported more consistent values. Nevertheless, the editor's opinion is that the advantages of monitoring real changes outweigh the disadvantages of setting up a complex monitoring system.

121

Chapter 8: Evaluation

Internal Completeness

Figure 28: Collector showing data quality development for Internal Completeness dimension. The line-chart shows the dimension development in blue and the metrics in white. The metric null-values-database is marked (1), the metric null-values-company is marked (2) and the metric blank-attributes-company is marked (3). Three metrics added in May 2013 are marked (4). Marked (5), is the metric using the DQ-Test CO-007. As described in chapter 7.3 the Internal Completeness dimension currently includes almost only metrics based on DQ-Tests. As can be seen in Figure 28 about half of the metrics could be calculated for the first data-point (May 2012). Over the next months further DQ-Tests were added which are used as metrics. For instance for the January 2013 time-point, seven new metrics could be calculated as the underlying DQ-Tests were added in the corresponding release. The Internal Completeness score has improved over the two year period. The quick rise until January 2013 is due to the newly added DQ-Tests and the editors' focused work on these. The drop after January 2013 is due to the change in editorial guidelines. In

122

Chapter 8: Evaluation

early 2013 many sparse company profiles were added. These contained less data than usual and thus lowered the score. After adding a new DQ-Test in most cases the score for the corresponding metric quickly increased. For example the three metrics marked as (4) reported a very low score at the beginning. The new DQ-Tests are mostly added because new data fields are added to the data model. The editors then work on all companies to add information for the newly added data fields which quickly increases the score of the metric. As mentioned in chapter 7.3 we added three generic metrics with a low weight. These were added to evaluate if a generic metric can capture the data quality. The most generic metric (1) which counts all nil values against possible values initially dropped in score. This was due to new attributes added to several models. From September 2013 on, the metric reported a score very near to one. This was due to the addition of large computed tables without nil values. For instance a table containing one million entries matching companies to patent classes60. In order to continue providing useful measures these tables have to be excluded, thus a customized metric is necessary. The other two generic metrics (2) and (3) are already customized to the IPIB as they only concern the main data model – the company. Metric (2) counts nil values against all possible values for the company table only. Metric (3) instantiates all companies and counts blank attributes against total attributes. Metric (2) continuously reported a score about 0.2 above metric (3). Both metrics experienced only small changes over the entire time-period. As the perceived data quality for companies was lower at the beginning (the reason why DQ-Tests were implemented) these metrics do not capture the data quality adequately. The small weights on these metrics were therefore justified and the metrics should be removed completely in future releases. The metric based on DQ-Test CO-007 (5) checks the presence of an Open Corporates URL. Between September 2012 and January 2013 many companies from Singapore were added. An Open Corporates URL exists for this jurisdiction for many companies. This explains the quick rise during this time for this metric. In the following period companies from jurisdictions which are underrepresented in 60 The patents are categorized with the International Patent Classification (www.wipo.int/classifications/ipc/en/)

123

Chapter 8: Evaluation

the Open Corporates database were added. As the editors do not set this DQ-Test to not-applicable for companies which are not present in the Open Corporates database, the data quality decreased again. This revealed a deficiency in the way DQ-Tests are used. Originally DQ-Tests were designed to be set to not-applicable in exactly these cases. The editors adopted a different understanding and only set DQ-Tests to not-applicable in very few cases. One reason for this is the bad user experience of the DQ-Test interface. Setting a single DQ-Test for a company to not-applicable results in a full page reload and takes nearly four seconds. This issue should be addressed both with revised editorial guidelines and an improved implementation of the DQ-Test user interface. The metric based on DQ-Test CO-003 checks the presence of an end-of-operations date. As most companies in the IPIB are still active, this date is only present for very few companies. This is one of the few DQ-Tests which editors set to not-applicable. As the possibility of setting a DQ-Test to not-applicable was only added in late 2012 the first assessments reported a very low value. The editors then focused on this DQ-Test and set it to not-applicable for many companies. The score therefore increased quickly. Again due to high load times and bad user experience of the DQ-Test interface not all companies were reworked during this time. The editors would expect a far higher score if they reworked all companies.

124

Chapter 8: Evaluation

Internal Consistency

Figure 29: Collector showing data quality development for Internal Consistency dimension. The line-chart shows the dimension development in blue and the metrics in white. Marked (1) is the metric based on DQ-Test C0-012, marked (2) the metric based on DQ-Test C0-013. Internal Consistency currently includes two metrics based on DQ-Tests. Both metrics have reported high values for a large time-period. In general Internal Consistency is satisfying. Metric (1) checks if the company status is set to not-active if an end date is given. Only three companies failed this test. An analysis of these showed the metric had correctly identified a data quality issue. All three companies had an end date set but the status was still set to active. This is due to the layout of the user interface used to edit a company. The field for end-date and status are far from each other. For all three companies the end date was set some time after initially creating the company. If the status field had been next to the end-field the editor would most likely not have forgotten to set it correctly.

125

Chapter 8: Evaluation

The values reported by metric (2) dropped in January 2014. This metric essentially evaluates a company's role in a corporate grouping. The drop in data quality is due to a refactoring of the way corporate groupings are represented in the database. The migration which transformed the grouping data did not function correctly and thus corrupted the data. This degradation of data quality went unnoticed as the control samples of companies which were checked did not show this problem. Metadata

Figure 30: Collector showing data quality development for Metadata dimension. The line-chart shows the dimension development in blue and the metrics in white. Marked (1) are two presence-of-author metrics which are not distinguishable. Marked (2) and (6) are three metrics based on DQ-Tests. Marked (3) is the company-end-of-operations metric. Marked (4) is the person-thumbnail-sourcepresent metric and marked (5) is a gap between three metrics on the left and three metrics on the right. The score of the Metadata dimension has increased over time as shown in Figure 30 and is currently satisfying. Nevertheless, the line-chart for this dimension re-

126

Chapter 8: Evaluation

veals several issues both with the collector application and the metric implementation. An issue with the library used to visualize the data quality development is shown by the two metrics marked (1). They both consistently report a score of 1.0. As is visible in Figure 30 they seem to be one metric as they overlap exactly. Even when using the line-chart mouse-over information (as shown in Figure 14) they cannot be distinguished. Only by viewing the list of metrics can they be separated. This is a shortcoming of the library used to display the line charts which should be improved. Both metrics check the presence of an author who created the company or person. As it is not possible to add companies or people via the UI without being logged in, it is easy to record the author (the current user). Obviously the libraries used for this sake have been configured and implemented correctly. As described in chapter 7.3 the source present metrics in this dimension are generated automatically based on naming conventions. Issues with this approach are visible when analyzing the gap between several metrics marked (5). The gap is between three metrics which reported their last values for September 2013 and three metrics which reported their first values in January 2014. The new metrics are actually the continuation of the three old metrics. The gap is due to a renaming of the assessed source attributes (for instance thumbnail was renamed to image). As the names of the metrics were also automatically changed the collector interprets these as new metrics. Ideally it should be possible to rename metrics or to merge two metrics into one using the collector. The metric marked (4) seems to have improved substantially in the first year. This is not what was expected as this metric checks the presence of a source for person thumbnails. As these are photos which are retrieved from company websites a source should have always been present. The increase is mainly due to a change in how the library and implementation handle people for whom no photo was present. Earlier versions reported the thumbnail as present even if no photo was uploaded. This was due to a default image which is to be displayed for those without a photo. Later versions only report the thumbnail as present if a photo has been uploaded. Thus the metric values up to January 2013 assessed the presence of source information for all people. Since May 2013 people with no thumbnail

127

Chapter 8: Evaluation

are not checked by this metric. This shows how a metric implementation has to change as the application and underlying libraries evolve. The metric marked (3) also showed a high volatility. It first reported a very low value and then quickly rose to a far higher value. This is due to the metric initially only assessing very few data items. Metric (3) checks the presence of an end-ofoperations source only if there is an end-of-operations date. In May 2012 only five companies had an end-of-operations date and only one provided source information. Due to the low number of companies, this score improved quickly when source information was added. These issues were corrected due to focused work on improving DQ-Test results in late 2012 (a DQ-Test checking the source for end-of-operation values exists). The metrics marked (2) and (6) are again an example of how adding new DQTests resulted in quick improvements for the metrics based on them.

128

Chapter 8: Evaluation

Population Completeness

Figure 31: Collector showing data quality development for Population Completeness dimension. The line-chart shows the dimension development in blue and the metrics in white. Marked (1) is the location-count metric. The Population Completeness dimension score has steadily increased over the past two years. This is expected as new companies are continuously being added to the database and the estimation of how many data items are expected per metric was not changed. The estimation of how many companies and other data items are expected was done in 2012. If CDQA had been implemented since then the number of expected data items would have been increased. Currently metric (1) has already reached a data quality of one. It is now clear that far more locations per company are expected and the number should be increased. The same is true in general for all Population Completeness metrics as the number of companies expected to be contained in the IPIB has recently been raised to 10.000. This new estimate has not yet been reflected in the metrics.

129

Chapter 8: Evaluation

Analyzing Population Completeness can show if estimates must be adjusted. If an estimate is far too low or far too high this indicates issues in the assumptions made for the application. Syntactic Accuracy

Figure 32: Collector showing data quality development for Syntactic Accuracy dimension. The line-chart shows the dimension development in blue. The metrics are overlain by the dimension and not visible. The Syntactic Accuracy dimension has constantly reported a score of one. The single metrics are not visible in Figure 32 due to the metrics and dimension lines overlapping. This is analogous to the shortcoming of the visualization library described in the Metadata dimension. As described in 7.3 a score of one was expected. Still the result shows that no data was added or manipulated bypassing the validations in place. Also no validations were added for which invalid data was present before. For the IPIB the advantage of this dimension is to detect data issues which could have been introduced by bypassing the actual application. For instance the switch

130

Chapter 8: Evaluation

of database system from MySQL to PostgreSQL could have introduced invalid data. Traceability

Figure 33: Collector showing data quality development for Traceability dimension. The line-chart shows the dimension development in blue and the metrics in white. Marked (1) is the presence of source information in company views metric. Marked (2) is the person-image-source-retrievability, marked (3) the companystart-source-retrievability and marked (4) the company-end-source-retrievability metric. The Traceability dimension historic assessments were based on a single metric. The dimension score up to January 2014 is therefore identical to the score of metric (1). Only since May 2014 three additional metrics were added. These three are retrievability metrics which check if content behind a source is retrievable. Metric (1) initially reported a value of zero with a huge increase in the second assessment. The low result in the first assessment is due to the absence of a consumer UI for this version of the IPIB. The metric checks the presence of source

131

Chapter 8: Evaluation

URLs in the consumer presentations. As the initial IPIB version only provided editorial interfaces and no consumer representations, no source URLs were found and thus a score of zero resulted. Since then the metric score fell again between May 2013 and September 2013. This was due to a restriction in the representations displayed to anonymous consumers of the IPIB. The information on services and markets and their source information has since then only been shown to authenticated users. Of course the metric can be altered to make authenticated requests. But the actual shortcoming is that there is no check if the data to which the source URL is associated is actually displayed. Still the metric did reveal two types of source URLs for which the data is displayed and not the source URL. The three retrievability metrics have reported scores since May 2014. Noticeably low is the quality of the retrievability metric for the company end of operations source (4). This is due to the editorial guidelines which suggest simply adding the no longer existent old company homepage as a source for the end-ofoperations date. In most cases the domain is still registered but a call to the page returns a server error or page not found response. This is of course sufficient to demonstrate that the company (at least under the given homepage) is no longer in operation but cannot be assessed with the current retrievability metrics implementation. For the company start of operations metric (3), most source URLs are assessed as retrievable. Analyzing the improvement-advice it was found that most sources which were assessed as not retrievable do indicate a change in the website structure. In some cases websites are not retrievable by the agent but are retrievable by manually visiting the site. In these cases the web server blocks access based on the HTTP “User-Agent” HEADER field [52, Sec. 5.3.3]. The User-Agent set by the current implementation identifies itself as a robot. By setting the User-Agent to a known browser this can be avoided. For the person image source metric (2) the source URL is the URL under which the image was originally accessible and from which it was retrieved. An image source can be not retrievable due to the use of an updated photo with different file name and/or path. In some cases, however, a not retrievable image source leads to the discovery that the person has left the company. In the sample analyzed we did

132

Chapter 8: Evaluation

not find any misleading responses from web servers. Thus for images using a HEAD request to check their retrievability seems to be sufficient. Summary of insights gained in the demonstration Using the collector application several insights were gained. In the following section we list the most interesting insights regarding the IPIB but also the agent and collector implementation. For the IPIB we gained two general insights but also identified four deficiencies in the IPIB data and user interface. – DQ-Tests initially help to quickly improve data quality but then stagnate. Therefore, DQ-Test must be kept in the editors focus, i. e. by using them as metrics. – The actual usage of DQ-Tests by the editors, especially setting DQ-Tests to not-applicable, is not as intended. The editors should be trained accordingly. – The UI response times for DQ-Tests in the IPIB should be improved as they caused the editors to stop using them. – The order of input elements in the UI for editing companies is not ideal. Those elements which are used at the same time should be located next to each other. – A change in the data model introduced corrupted data which went unnoticed until now. The data should be corrected. – There is data which should be accessible to the consumer but is not. This data should be included into the appropriate representations. Regarding the agent and metric implementation the insights gained were: – Generic metrics do not seem to be sufficient to capture the quality of a dimension. If possible, custom metrics should be used. – When accessing resources on the web in most cases it is necessary to use more sophisticated approaches than simple GET or HEAD requests.

133

Chapter 8: Evaluation

– Each metric value should provide information on the number of data items assessed. Regarding the collector application several improvements should be made: – The line-chart visualization fails in some corner-cases and should be improved. – It should be possible to merge metrics in cases where one metric is the continuation of another metric. In summary the implementation of CDQA for the IPIB offered valuable insights for both the editors and the developers of the IPIB. Several issues were detected and will be dealt with. The most helpful feature for the editor interviewed is the improvement-advice. This allows him to quickly switch to and improve the data item which failed the test. The editor suggested additional improvements, such as the aggregation of results for a single data item over all metrics. This would allow hot spot data items or data items with an unusual pattern of failing metrics to be identified. The UI can be improved in many cases. For instance, the line-chart can draw metric lines thicker depending on their weight. The filter resolution should be set automatically depending on the selected time-range and setting the timerange could be done by marking an area on the line-chart.

8.2 Conclusion and Outlook At the beginning of this thesis we formulated four goals in order to systematically implement CDQA for information systems. After creating the CDQA process and artifacts and after demonstrating the usage, we will now discuss if the goals of the thesis have been achieved. We will then provide an outlook on possible enhancements to CDQA and future work. Achieved goals The goal of creating a total data quality score for information systems has been achieved as described in chapter 4. The CDQA process and artifacts allow a single score for any application to be calculated. As the score is always in the range between zero and one it can be directly interpreted by consumers. It allows easy

134

Chapter 8: Evaluation

comparison of the current data quality of different applications and the data quality development over time. To achieve the goal of understandable data quality scores the following steps were taken. First, the data quality score is transparently calculated as the average of all included data quality dimensions. Secondly, all dimensions, except for one, and their definitions are widely used in literature and other data quality frameworks. Even for consumers without knowledge of data quality, the definitions are understandable. The editor interviewed was previously untrained in the topic of data quality. He mostly understood the meaning of the dimensions when given only their definition. This included the Metadata dimension which had not been used in the literature reviewed. When given the detailed metric explanations and examples from chapter 5, the editor was able to fully understand the meaning of all dimensions. Therefore the goal of understandable data quality scores has been achieved but there is room for improvement. The continuous data quality assessment process was defined in chapter 4.2. The process supports an iterative approach to achieve the data quality expectation and it avoids over optimization. Thus the set goal has formally been achieved. Unfortunately the process could not be evaluated practically as it has not been in use long enough. Initial feedback from the editor interviewed indicates that the process is suited for its task. Nevertheless first improvements have been suggested as described in the outlook. Two software artifacts to support continuous data quality assessment were implemented as described in chapter 6. The CDQA-Agent uses plain text files to easily define dimensions and metrics. Continuous assessments can be executed in any required time interval. The CDQA-Collector can gather assessment results from any number of agents and applications. Line-chart visualizations of the data quality scores are available and have successfully been used to gain insight into the data quality development. The collector supports the CDQA process, including subjective assessments and notifications. Hence the goal of implementing software artifacts for continuous data quality assessment has been achieved.

135

Chapter 8: Evaluation

Outlook CDQA has successfully been applied and is currently in operation for several applications. However it will be enhanced further. The demonstration and the following evaluation have shown several limitations which should be resolved and additional improvements are foreseeable. To keep track of changes in the CDQA we have adopted semantic versioning [53] to identify the impact of changes. The CDQA version is thereby based on the collector which implements or interfaces almost all relevant parts of the CDQA61. For instance the collector includes the dimension definitions, the subjective assessment cycles, and the calculation of data quality scores. The collector also exposes the API which defines the behavior of the agent. The final version of the collector and therefore of CDQA as described in this thesis is given the version number 1.0.0. Semantic versioning then describes how to increase the three version numbers (major, minor, and patch) depending on their impact on the public API. The patch version is increased if “you make backwards-compatible bug fixes”, the minor version if “you add functionality in a backwards-compatible manner”, and the major version if “you make incompatible API changes” [53]. In the following section we will propose solutions to the limitations which were identified and indicate the impact of changes Include dimension description file in collector Currently the file containing the metrics to be included is directly added to the agent. To understand, for instance, why a certain metric is present in the collector it is necessary to access the agent code to review the dimension description file. In addition, for any modification, such as changing a metric weight, one needs access to the development and deployment tools used for the agent. Therefore, it would be an improvement if the dimension description files were provided by the collector. The files can be easily presented and edited in the collector UI and distributed to the agent on its next assessment run. Adding new metrics is easier for the data quality architect and independent of the work the data quality developer performes. In addition, the dimension description file format is made part of the collector API. This is highly desirable as CDQA versioning then covers all relevant

61 The only exception is the dimension description file and its format.

136

Chapter 8: Evaluation

CDQA parts. This is an incompatible change of the API against which the agent is implemented and therefore would require a major version increment. Understandability of dimension definitions The feedback from the IPIB editor has indicated that the dimension definitions can be improved. In the future we will not directly take the definitions found in literature but improve them by adding additional information similar to that given in chapter 5. The majority of dimensions were not included in the demonstration and thus not evaluated in practice. Concerning these it is not clear if the definitions can be easily understood. Improving the dimension definitions only increments the patch level as the API does not change. Individual data quality scores The selection of metrics and dimensions is based on the information system and its environment. This does not directly account for different consumer groups having different expectations as to which dimensions and metrics should be included. For example, the IPIB developers would like a single application score based only on the Availability dimension while the IPIB editors see no need to include this dimension in their score. Currently this can only be solved by implementing several independent agents reporting several application scores. Instead it should be sufficient to only use a single agent assessing all dimensions and metrics. Individual scores can then be created in addition to the total quality score. The CDQA-Collector can be used to specify which metrics and dimensions to include and how to weigh them. To make the creation of individual scores easier it should be possible to assign weights not only to metrics but also to dimensions. These changes are backward compatible and thus increment the minor version number. Aggregated subjective assessments Currently each subjective assessment cycle can directly influence the expectation of the dimension. In high traffic applications a potentially large number of consumers may give subjective assessments. This could cause high volatility in the dimension expectation and a large amount of notifications sent to the editors. To avoid this there should be a mechanism to gather subjective assessments over a certain time period and evaluate them as a whole. For instance a dimension could be evaluated as satisfying if more than 50% of assessments from the previous day

137

Chapter 8: Evaluation

said so. This change in the API is not backward compatible and thus a major version increment would be necessary. Simple Subjective Assessments Currently the consumers have to understand the meaning of each dimension to give a subjective assessment. As described above, this can be difficult to achieve. Hence there should be a more simple way of gathering useful feedback. The subjective assessment cycle could be adapted to only assess if the application as a whole is satisfying. This basic question can be answered easily by any consumer. As the possibility of not using subjective assessment for applications should remain, this change is backward compatible and thus a minor version number increment is sufficient.

138

Bibliography [1] “TDWI’s Data Quality Report,” The Data Warehousing Institute, Feb. 2002. [2] “Big Data: Harnessing a Game-Changing Asset,” The Economist, Survey Report, 2011. [3] D. Loshin, “Questions About the ‘Cost of Poor Data Quality,’” The Practitioner’s Guide to Data Quality Improvement, 25-Jul-2011. . [4] P. Russom, “Taking data quality to the enterprise through data governance,” The Data Warehousing Institute, Seattle, Mar. 2006. [5] L. L. Pipino, Y. W. Lee, and R. Y. Wang, “Data quality assessment,” Communications of the ACM, vol. 45, no. 4, pp. 211–218, Apr. 2002. [6] S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer, “Enterprise Data Analysis and Visualization: An Interview Study,” IEEE Transactions on Visualization and Computer Graphics, vol. 18, no. 12, pp. 2917–2926, Dec. 2012. [7] A. R. Hevner, S. T. March, J. Park, and S. Ram, “Design science in information systems research,” MIS quarterly, vol. 28, no. 1, pp. 75–105, 2004. [8] K. Peffers, T. Tuunanen, M. A. Rothenberger, and S. Chatterjee, “A design science research methodology for information systems research,” Journal of management information systems, vol. 24, no. 3, pp. 45–77, 2007. [9] J. Rowley, “The wisdom hierarchy: representations of the DIKW hierarchy,” Journal of Information Science, vol. 33, no. 2, pp. 163–180, Apr. 2007. [10] C. Batini and M. Scannapieco, Data Quality: Concepts, Methodologies and Techniques. Springer, 2006. [11] Peter R. Benson, “ISO 8000 Data Quality – The Fundamentals Part 1,” RealWorld Decision Support (RWDS) Journal, vol. 3, no. 4, Nov. 2009. [12] A. B. Tucker, Ed., The computer science and engineering handbook . Boca Raton, Fla.: CRC Press, 1997. [13] J. M. Juran and A. B. Godfrey, Juran’s quality handbook. McGraw Hill, 1999. [14] E. Larry, “Data quality,” IQ/DQ glossary. IAIDQ - International Association for Information and Data Quality. [15] F. Naumann and C. Rolker, “Assessment Methods for Information Quality Criteria,” in Fifth Conference on Information Quality (IQ 2000), 2000, pp. 148–162.

139

[16] R. Gabriel, “Informationssystem,” Enzyklopaedie der Wirtschaftsinformatik. Oldenbourg Wissenschaftsverlag, 16-Oct-2013. [17] R. Y. Wang and D. M. Strong, “Beyond accuracy: What data quality means to data consumers,” Journal of management information systems, pp. 5–33, 1996. [18] S. Knight and J. Burn, “Developing a framework for assessing information quality on the world wide web,” Informing Science: International Journal of an Emerging Transdiscipline, vol. 8, no. 5, pp. 159–172, 2005. [19] A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, S. Auer, and P. Hitzler, “Quality assessment methodologies for linked open data,” Under review in Semantic Web Journal, 2013. [20] “Internet Growth Statistics - the Global Village Online.” [Online]. Available: http://www.internetworldstats.com/emarketing.htm. [Accessed: 05-Mar2014]. [21] M. J. Eppler and P. Muenzenmayer, “Measuring Information Quality in the Web Context: A Survey of State-of-the-Art Instruments and an Application Methodology,” in Seventh International Conference on Information Quality (IQ 2002), 2002, pp. 187–196. [22] X. Zhu and S. Gauch, “Incorporating Quality Metrics in Centralized/Distributed Information Retrieval on the World Wide Web,” in Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 2000, pp. 288–295. [23] P. Katerattanakul and K. Siau, “Measuring information quality of web sites: development of an instrument,” in Proceedings of the 20th International Conference on Information Systems, 1999, pp. 279–285. [24] R. H. J. Van Zeist and P. R. H. Hendriks, “Specifying software quality with the extended ISO model,” Software Quality Journal, vol. 5, no. 4, pp. 273– 284, Dec. 1996. [25] A. Dedeke, “A Conceptual Framework for Developing Quality Measures for Information Systems,” in Fifth Conference on Information Quality (IQ 2000), 2000, pp. 126–128. [26] G. Shanks and B. Corbitt, “Understanding data quality: Social and cultural aspects,” in Proceedings of the 10th Australasian Conference on Information Systems, 1999, vol. 785. [27] “Apply an Open License (Legal Openness) - Open Data Handbook.” [Online]. Available: http://opendatahandbook.org/en/how-to-open-updata/apply-an-open-license.html. [Accessed: 15-May-2014]. [28] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: An Open Source Software for Exploring and Manipulating Networks,” in Proceedings of the Third

140

[29]

[30]

[31]

[32]

[33] [34]

[35]

[36]

[37] [38] [39] [40]

[41] [42]

International Conference on Weblogs and Social Media, 2009. S. Heymann and B. Le Grand, “Visual Analysis of Complex Networks for Business Intelligence with Gephi,” in Proceedings of the 1st International Symposium on Visualisation and Business Intelligence, in conjunction with the 17th International Conference Information Visualisation, 2013. M. Jacomy, T. Venturini, S. Heymann, and M. Bastian, “ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software,” PLoS ONE, vol. 9, no. 6, p. e98679, Jun. 2014. V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2008, no. 10, p. P10008, 2008. “Demingkreis,” Wikipedia, 05-Nov-2013. [Online]. Available: http://de.wikipedia.org/w/index.php? title=Demingkreis&stableid=124145715. [Accessed: 29-Apr-2014]. J. Nielsen, Usability Engineering. Morgan Kaufmann, 1993. L. Masinter, T. Berners-Lee, and R. T. Fielding, “Uniform Resource Identifier (URI): Generic Syntax.” [Online]. Available: http://tools.ietf.org/html/rfc3986. [Accessed: 03-Jul-2014]. D. Peterson, S. (Sandy) Gao, A. Malhotra, C. M. Sperberg-McQueen, H. S. Thompson, and P. V. Biron, “W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes,” 05-Apr-2012. [Online]. Available: http://www.w3.org/TR/xmlschema11-2/. [Accessed: 17-Jul-2014]. Information Assurance Solutions Group, “Defense in Depth - A practical strategy for achieving Information Assurance in today’s highly networked environments.” National Security Agency. “RFC 2818 - HTTP Over TLS.” [Online]. Available: http://tools.ietf.org/html/rfc2818. [Accessed: 25-Jun-2014]. “Web Content Accessibility Guidelines (WCAG) 2.0.” [Online]. Available: http://www.w3.org/TR/WCAG/. [Accessed: 03-Jan-2013]. “What is Plain Language?,” Plain Language. [Online]. Available: http://www.plainlanguage.gov/whatisPL/. [Accessed: 07-Jul-2014]. K. Chen, J. M. Hellerstein, and T. S. Parikh, “Designing Adaptive Feedback for Improving Data Entry Accuracy,” in Proceedings of the 23rd Annual ACM Symposium on User Interface Software and Technology, New York, NY, USA, 2010, pp. 239–248. K. Beck, Test Driven Development: By Example, 1 edition. Boston: AddisonWesley Professional, 2002. R. T. Fielding, “Architectural Styles and the Design of Network-based Software Architectures,” Donald Bren School of Information and Computer

141

[43] [44]

[45] [46] [47]

[48]

[49]

[50] [51]

[52]

[53]

Sciences, 2000. “Ruby on Rails.” [Online]. Available: http://rubyonrails.org/. [Accessed: 03Apr-2014]. “Optimizing Relational Databases for Time Series Data (Time Series Database Overview Part 3),” TempoDB. [Online]. Available: http://blog.tempo-db.com/post/48645898017/optimizing-relationaldatabases-for-time-series-data. [Accessed: 13-Apr-2014]. “RRDtool - The Time Series Database,” RRDtool. [Online]. Available: http://www.rrdtool.org/. [Accessed: 13-Apr-2014]. “OpenTSDB - A Distributed, Scalable Monitoring System.” [Online]. Available: http://opentsdb.net/index.html. [Accessed: 13-Apr-2014]. S. Johnston, “The Unique Database Requirements of Time-Series Data,” Database Trends and Applications, 15-Feb-2008. [Online]. Available: http://www.dbta.com/Articles/ReadArticle.aspx?ArticleID=52035. [Accessed: 15-Apr-2014]. M. Prilop, L. Tonisson, and L. Maicher, “Designing Analytical Approaches for Interactive Competitive Intelligence,” International Journal of Service Science, Management, Engineering, and Technology, vol. 4, no. 2, pp. 34– 45, 2013. N. Tsitoura and D. Stephens, “Development and evaluation of a framework to explain causes of competitive intelligence failures,” Information Research, vol. 17, no. 2, Jun. 2012. “About :: Info :: OpenCorporates.” [Online]. Available: http://opencorporates.com/info/about. [Accessed: 14-Feb-2014]. L. Maicher, L. Tonisson, F. Bartsch, and P. Jha, “Intellectual Property Services Taxonomy (IPST).” [Online]. Available: http://ipib.ci.moez.fraunhofer.de/ipst. [Accessed: 23-Oct-2012]. R. T. Fielding and J. F. Reschke, “Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content.” [Online]. Available: http://tools.ietf.org/html/rfc7231. [Accessed: 25-Jun-2014]. T. Preston-Werner, “Semantic Versioning 2.0.0,” 18-Jun-2013. [Online]. Available: http://semver.org/spec/v2.0.0.html. [Accessed: 10-Jul-2014].

142

Appendix Data Quality Dimensions by Batini and Scannapieco The following table contains the data quality dimensions, sub-dimensions and their descriptions extracted from [10, Ch. 2]. Dimension Accuracy

Completeness

Completeness for data model of Relational Data

Completeness for data model of web data Time-Related

Sub-dimension

Description is defined as the closeness between a value v and a value v', considered as the correct representation of the real-life phenomenon that v aims to represent. Syntactic accuracy is the closeness of a value v to the elements of the corresponding definition domain D. Semantic accuracy is the closeness of the value v to the true value v'. the extent to which data are of sufficient breadth, depth, and scope for the task at hand Schema completeness the degree to which concepts and their properties are not missing from the schema Column completeness measure of the missing values for a specific property or column in a table Population completeness evaluates missing values with respect to a reference population What does NULL value mean? Open world assumption vs. closed world assumption. Following for Closed world assumption and allowed NULL value completeness to capture the presence of null values for some fields of a tuple tuple completeness to characterize the completeness of a tuple with respect to the values of all its fields attribute completeness to measure the number of null values of a specific attribute in a relation relation completeness to capture the presence of null values in a whole relation information that is continuously published completability the completability information gives the information about how fast the completeness will grow in time. data's change and update in time

143

Dimension Dimensions

Sub-dimension Currency Volatility

Description concerns how promptly data are updated characterizes the frequency with which data vary in time Timeliness expresses how current data are for the task at hand. Consistency captures the violation of semantic rules defined over (a set of) data items Integrity Constraints integrity constraints are properties that must (database consistency) be satisfied by all instances of a database schema. Can be defined on intrarelation or interrelation database consistency: uniqueness of a combination of attributes for Key Dependency each instance of an relation database consistency: In- some columns of an instance must be conclusion Dependency tained in other columns (possible on another table). e.g. foreign key constraint (fk must exist as primary key) database consistency: a function must be valid for each instance of Functional Dependency an relation. e.g. in a location relation if longitude and latitude is the same so must be the country Data Edits (statistics do- task of detecting inconsistencies by formulatmain) ing rules that must be respected by every correct set of answers. edits rule (formula) which denotes error conditions. e.g. “marital status = married ^ age < 14” Other Data Sometimes domain specific data quality diQuality Dimenmension as positional accuracy in the geosions graphical domain. Following two dimensions important for network information systems Interpretability concerns the documentation and metadata that are available to correctly interpret the meaning and properties of data sources. (5 types of documentation should be available) Synchronization between concerns proper integration of data having different time series different time stamps. Accessibility measures the ability of the user to access the data from his or her own culture, physical status/functions, and technologies available. Quality of Information several dimensions proposed by different auSources thors. e.g. (believability, reputation and objectivity)

144