Generating Data Quality Rules and Integration into ETL Process Jasna Rodi´c
Mirta Baranovi´c
Oracle Croatia, Oracle Consulting BIW Budmanijeva 1 Zagreb, Croatia
Faculty of Electrical Engeneering and Computing, University of Zagreb Unska 3 Zagreb, Croatia
[email protected]
[email protected]
ABSTRACT
Low quality data may cause many issues in DW such as unsuccessful load of DW due to constraint violations or wrong data types, inconsistencies in data that needs business rules to be applied, etc. We are going to study DQ rules that can be applied on certain parts of DW systems that we call SA layer and DW layer. SA layer is staging area, it is a temporary layer and thus, all SA tables are truncated before each load starts in order to have empty tables before we load them with a new set of data. DW layer is any part of DW system where dimensional model is implemented. It could be an enterprise data warehouse or any data mart. Most of DQ rules are applied on SA layer in order to prevent erroneous records from reaching DW layer. Such rules are called SA rules. Some of them are used to correct invalid values, and some of them simply mark erroneous records. There is also another type of rules that can be applied on DW layer, as suggested by [8]. We will call them DW rules. Generator for both SA and DW rules are very similar, so we will describe SA rules generator and then explain differences between the two. SA rules can be split into three main categories:
Many data quality projects are integrated into data warehouse projects without enough time allocated for the data quality part, which leads to a need for a quicker data quality process implementation that can be easily adopted as the first stage of data warehouse implementation. We will see that many data quality rules can be implemented in a similar way, and thus generated based on metadata tables that store information about the rules. These generated rules are then used to check data in designated tables and mark erroneous records, or to do certain updates of invalid data. We will also store information about the rules violations in order to provide analysis of such data. This could give a significant insight into our source systems. Entire data quality process will be integrated into ETL process in order to achieve load of data warehouse that is as automated, as correct and as quick as possible. Only small number of records would be left for manual inspection and reprocessing.
Categories and Subject Descriptors
• Database integrity rules: Rules based on database constraints and column data types.
H.2.8 [Database Management]: Database Applications
General Terms
• Match and merge rules: Rules that match records from different data sources and merge them into a single record. Also known as deduplication rules, record linkage problem, object identity problem.
Design, Performance, Languages
Keywords Data Quality, Rules, Generator, Metadata, Oracle
1.
• Business rules: Rules defined by business users that find suspicious values based on e.g. comparation or statistical methods. They include missing values, erroneous data based on a certain business definition and outliers.
INTRODUCTION
Due to lack of budget for a separate project, DQ is usually integrated into DW projects. In such cases, we have lack of time or resources for implementation of DQ part of the project. In order to minimize risks of such implementations, we search for simplified and quicker solutions. One of the possibilities is DQ rules generator described in this paper.
We will focus on database integrity rules and business rules. Each rule should also be categorized based on the severity of the rule violation. When a certain rule is violated, we should know which action to take: • Stop: Certain rule violations are severe in such way that we have to stop the entire load.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DOLAP’09, November 6, 2009, Hong Kong, China. Copyright 2009 ACM 978-1-60558-801-8/09/11 ...$10.00.
• Error: Records that were marked as errors are prevented from being loaded into DW layer, and they must be inspected manually in order to decide what to do – correct them manually or discard them. • Warning: Records that were marked as warnings are
65
moved to DW layer, but we have an information that some data has either been updated by the rules, or that the information in the reports might be inconsistent to an allowed level.
2.
Each DQ module marks suspicious records in the table and corrects invalid values where possible. At this point, we expect to have rather small amount of erroneous records in comparison to the total number of records. Correct records, as well as warning records and corrected records are further processed and loaded into DW layer, while erroneous records stay in SA layer and require manual intervention by business users. In order to achieve this integration of DQ process into ETL tasks, we should make several decisions about our design.
RELATED WORK
Many researches done in DQ area are connected to match and merge techniques. Theory on this topic, as well as case studies is covered in [4]. Most authors try to solve the problems with object identity, and the main difference between them is the approach they use. Implementation as an extension to Common Warehouse Metamodel is described in [3], different record linkage algorithms are described in [5], [7] and [14]. Many tools are developed in this area, like [17], [2], D-Dupe, The Link King and Link Plus. There are tools that cover many DQ areas, such as record linkage problems and data cleaning problems. Such tools are Oracle Warehouse Builder (OWB) and Oracle Data Integrator (ODI), WinPure, DataCleaner and DataFlux. Data mining techniques are intensively used to solve problems with missing values, erroneous data and outlier corrections, as in [10]. Referential integrity rules are covered by [15]. Correcting specific data problems connected to network traffic database is presented in [9]. Business rules are explained in [19] and [12]. Previously mentioned DQ tools are used to create DQ modules and then run them in order to create clean data records. Certain tools are also ETL tools, so they generate DQ mappings that can be included in ETL processing. Thus, usage of ETL tools with DQ capabilities makes the entire process easier. Other tools contain many different DQ algorithms to clean and conform data and load them into flat files or tables. Such tools are useful in case you use ETL tools with no DQ capabilities, or you do not use ETL tools at all and rather use 3GL programming such as Oracle PL/SQL. The problem is how to integrate such DQ tools with your ETL process in order to make it automatic. If DQ tools are not available at your project, or if you have trouble with DQ tool integration, we suggest our approach explained in this paper. We need to predict possible flaws in our data, then create rules that correct or filter out data with such flaws, and then incorporate such DQ modules in our ETL, mostly between extraction and transformation parts. In this way, most of our DQ processing is done automatically as part of ETL. Manual intervention is needed only to examine records that are left in SA layer – error severity, and to construct new rules that can be applied in order to minimize number of such records in the future.
3.
3.1 Storing of DQ Rules Violations. The first decision we have to make, is to decide how to store DQ rules violations. There are three possibilities: • Flags: Create additional columns in each SA table – warning flag, error flag, text message. • Error Log: Create additional error log table for each SA table that will store erroneous records. • Error Subject Area: Create fact table named error table with information about the load that run DQ rules, the rule that was violated, table name and rowid of the record that violated the rule, and severity flags. Advantage of having error and warning flags in SA tables is simpler filtering of erroneous records that must be stopped, and warning and correct records that can be moved into DW layer. Using error logs permanently stores actual values of the erroneous records. When error logs are not used, we lose such information since SA tables are truncated before each load. In case we decide to use error logs, Oracle 10g RDBMS provides such functionality to populate error log tables during DML operations, thus correct records are moved to the desired table, while error records are moved to the error log table. Error Subject Area provides error analysis since rule violations are stored permanently in a dimensional model. Such analysis leads to definitions of even more business rules that could correct data automatically in the future, or even point to a certain problem in the source systems that must be corrected directly in the source systems. Error subject area should provide information about the load that run DQ rules, the rule that was violated, table name and rowid of the record that violated the rule, and severity flags. Thus, it consists of one fact table named error table, similar to the one proposed by [8], that stores rowids of erroneous records and severity flags that will serve as help during analysis. It also contains three dimensions – a time dimension, a load dimension and a rule dimension. We can reuse the same time dimension used in our DW. Load dimension can be a real dimension, or we can use a degenerate dimension in our error table and include only batch id column. Since we are going to use a metadata table to describe DQ rules, the same table will serve as a rule dimension. This dimension contains information about the table where the rules are applied, so we do not need an additional column in error table to provide such information. We can decide to combine all three possibilities, since each provides certain advantages.
IMPLEMENTATION DESCRIPTION
DQ Model described in this paper can be implemented using any technology that we prefer. In our examples, we will show an implementation using Oracle technology. Our goal is to integrate DQ process into ETL tasks that is creating DQ module for each SA table that will run automatically after SA load, and before DW load. In Oracle RDBMS we implement each program code as a PL/SQL package. Thus, DQ module for each SA or DW table becomes a single PL/SQL package called DQ package. Executing this package will apply DQ rules on a certain table.
66
3.2 SA Tables Design.
way, we group them together. By grouping we mean checking the rules in the same group at the same time. This gives us a performance gain since we have one table scan for each group instead of one table scan for each rule. It is especially visible when we have very large tables with many rules, e.g. customer table with millions of records and fifty columns each of which data type must be checked before the load. Since rules that check column data types belong to the same group, we have one table scan of customer table instead of fifty table scans. Groups on SA layer are:
SA tables should have no database (DB) constraints implemented, such as primary keys, unique keys, foreign keys and check constraints. Otherwise, our load of SA tables might be unsuccessful due to constraint violations. We should rather create DQ rules for each DB constraint that is supposed to be implemented on a certain SA table, and use these rules to check possible DB constraint violations. If our data sources are flat files, it is advisable to create appropriate SA table for each flatfile that will contain only columns of very large character datatype of variable length. Advantage of such approach is that source data can be loaded into SA tables without exceptions. It is especially important because flatfiles can contain errors. It is possible that some numerical data or dates contain wrong data format. If we designed our SA layer columns as number and date datatypes, our load would not be successful. But if our SA layer contains columns of a very large character datatypes, files would be loaded successfully. Now we can apply rules that check data formats and update data in SA layer to the format that we need. Flatfiles can also contain errors such as unnecessary characters, e.g. blank instead of null. Our DQ rules should find such cases and correct values. Advantage of running DQ modules on SA layer is less workload on the data sources. Since cleaning and marking erroneous records is done in SA layer, there is almost no additional workload on the data sources. This is especially important for production systems that must be queried heavily in case of loading retries. When we load data from any production system to SA layer, and if such load is unsuccessful, we have to query the source system multiple times. We can avoid most of such retries by using our SA layer design. In this way, we can continue with the processing of those records that are correct, we update erroneous records that violate certain DQ rules, and we mark all other records as error and prevent them from being loaded into DW layer. Usually, we expect small number of such erroneous records in comparison to the number of records that are loaded successfully. Thus we have to correct data in the production system and reload only this small number of records instead of loading the entire set of records again.
• Attribute Value Constraints: This group contains rules that check column datatypes, not null constraints, attribute–based check constraints, and simple business rules. • Referential Integrity Constraints: This group contains rules that check foreign key constraints. • Key Constraints: This group contains rules that check unique key constraints. Primary key constraints can be implemented as a combination of a unique key rule and not null rules of each column that is the part of the primary key. • Assertions: This group is used for rules that depend on the data in the entire table. It contains rules that search for outliers in the data, and rules that check record count reasonability. Each rule group consists of one or more rule subtypes. Rule subtypes are used to create smaller sets of rules that have identical rule implementation, and thus used by the generator to determine type of statement that is to be generated. When a certain rule is violated, it is enough to temporarily store the rowid of the record where the rule is violated, and the rule text message that contains rule description, rule ID and severity code. This information is enough to populate the error fact table and to do possible updates of the processed SA table at the end. When rules are grouped together, we could find multiple rules violated on the same record. Thus, for each rowid we have to temporarily store the concatenation of text messages of all the rules from a certain group that were violated on this record. Such information must be found with a single select statement and that is the reason why we group rules together based on the implementation. It is easier to create such select statement if rules are implemented in a similar way. When we finish this process for each group of rules, we have to merge those violations together in order to get a list of rowids and the concatenation of text messages of all the rules on a certain SA table that were violated on this record. We can perceive that now we have aggregated information about the rule violations on a certain SA table. Since text messages contain rule IDs of the violated rules, we can easily populate our error fact table that is a more detailed version of the same information. Aggregated version of the rule violations is very useful in case of updates of the processed SA table. We would like to create a single update statement of the processed SA table for all the rules that require an update on this table. It is very easy to do so using a case statement for each column
3.3 SA Copy of DW Tables. The last decision we have to make is whether to create a copy of each DW table in SA layer, or not. If we have similar tables on both SA layer and DW layer, we can use information from RDBMS catalogues about column datatypes and constraints of DW tables. Such information is then used to create database integrity rules on corresponding SA tables. Thus, we can create a metadata table that is going to serve as a DQ rule descriptor, and populate it automatically from RDBMS catalogues. This is the part that speeds up the DQ implementation. More details about the metadata table are given in Section 5.
4.
GROUPING OF RULES
Now we have to implement DQ rules in a smart way in order to make the DQ process as fast as possible since it is going to be one of the phases of each DW load. Thus we introduce grouping of rules. DQ Rules can be split into groups based on the implementation. Since rules that check e.g. not null constraints and column data types are implemented in a very similar
67
that needs to be updated. If we find a rule ID in the text message, of the rule that does an update of a certain column, we set the column value to a new value, otherwise we leave it as it is. The DQ process that we have just described, is done for each SA table. We can summarize it in this pseudo-code: -- For each SA table, do this DQ processing: Begin For each rule group repeat: Insert into gt_group_violation to temporarily store rowids for rule violations within one rule group; Merge into gt_table_violation using gt_group_violation to temporarily store rowids for all rule violations on the processed SA table; Truncate table gt_group_violation to have it empty before the next rule group is processed; End repeat;
Figure 1: Data model diagram for generation processing control and rules definition.
5.1.1 Processing Control. Generation processing control is done using two metadata tables. Table md dq init is used to store information about each rule type. This information defines an environment where the rules are to be created, tells the generator where to create scripts, which tables to use to create rules, in which database schema to find them, and similar. Depending on the type of implementation we choose for our DQ modules, we choose columns we need in this table. They could be the following:
-- at this point, all rule violations -- on the processed SA table -- are recorded in gt_table_violation table
• tab schema: Schema where SA tables are. Tables that SA rules should be created for.
Load error_table fact table using gt_table_violation;
• tab name: Prefix or suffix of the SA table names.
If any rule on the processed SA table does update Then Update of the processed SA table; End if;
• pkg name: Prefix or suffix of DQ module name. We assume that each DQ module name contains a table name and a certain prefix or suffix.
Truncate table gt_table_violation to have it empty before the next SA table is processed;
• pkg directory name: Directory name where DQ module scripts will be generated. • pkg schema: Database schema where DQ modules will be deployed.
End; More details on grouping of rules and implementation of each rule group can be found in [18].
5.
Table md dq init details is used for any information about the rule types that requires multiple records. Such information could be parameter and variable definitions that must be created in the DQ module, or to list housekeeping columns. By housekeeping, we mean special columns in SA and DW tables that present processing information like batch id, special flags or any other column that rules are not to be created for.
GENERATING RULES
When we design DQ process as described in this paper, we notice that SA tables have similar DQ modules. Rule groups that form these DQ modules, and rule subtypes that form these rule groups are implemented in the same way. Difference between the modules is in table and column names used by the rules, and number of rules. Thus, it naturally implies an idea to generate such DQ modules. The variable part of each DQ module, such as table and column names, will be stored in metadata tables. The static part of each DQ module, such as type of implementation of each rule group, will be described in a so called generator.
5.1.2 Rules Definition. Three metadata tables give rules definition. Table md dq rules is used to describe each rule that needs to be generated. Several columns it contains are the following: • rule id: Rule ID. • rule group code: Rule group, such as attribute value constraints, key constraints, etc.
5.1 Metadata Tables In order to store the variable part of each DQ module, we use five metadata tables. Two for generation processing control and three for DQ rules definitions. Fig. 1 shows a data model diagram.
• rule subtype code: Rule subtype, such as number data type, not null constraint, outlier, etc. • table name: Table to which the rule is applied.
68
• function code: Function applied to column name. Such functions should be implemented as an auxiliary programs.
suppose that this is the only rule that must be checked on SA CUSTOMERS table. Statement that should be generated in order to check this rule, is similar to this:
• column name: Column in the table name to which the rule is applied.
create or replace package body PKG_DQ_SA_CUSTOMERS as procedure main(P_BATCH_ID NUMBER) is begin -- rule group CHECK insert into gt_group_violation (tab_rowid, txt_msg) (select tab.rowid as tab_rowid, case when CUST_INCOME_LEVEL IS NULL and CUST_CREDIT_LIMIT IS NOT NULL then ’CUST_INCOME_LEVEL CHECK ’|| ’violated (CHECK_271)W; ’ else ’’ end from SA_USER.SA_CUSTOMERS tab where (CUST_INCOME_LEVEL IS NULL and CUST_CREDIT_LIMIT IS NOT NULL));
• condition code: Condition applied to column name. • severity code: Severity code indicates the action that is to be taken if the rule is violated. • txt msg: Auxiliary column, text message of the rule used in DQ module as described earlier. Table md dq rule cond is used to describe additional conditions used by the rule, if more then one condition is needed. Table md dq rule upd is used to describe update statement of the rule, if rule violation requires update of any column.
5.2 Generator Generator is a program code that describes the statical part of DQ module. We use it to create DQ module for each table. In order to do so, it uses information in metadata tables. Generator should also be able to insert database integrity rules into metadata tables automatically if they do not already exist, it should maintain auxiliary columns in metadata tables, and it should create all other objects that are necessary for the generation of DQ rules, such as database directories and run scripts used for deployment of DQ modules, or any other object we might need depending on the way we implement the DQ modules and the generator itself. Generator implemented in Oracle technology is a PL/SQL package main purpose of which is to generate SQL scripts in a certain OS directory. In order to do so, we use UTL FILE, an Oracle built–in package. Each script contains CREATE OR REPLACE PACKAGE statement for one DQ package. Thus, we generate DQ packages as SQL scripts and then we deploy them in the database. Likewise, all auxiliary programs are implemented in an auxiliary PL/SQL package. Auxiliary programs are functions and procedures that we need for DQ rules implementation, such as functions that check if argument is of a certain datatype, procedure with an outlier rule algorithm implementation, and auxiliary procedures for error table load. When we need to implement a new rule, we should try to implement it as an already existing rule group, since it requires no changes of the generator, only new rule in the metadata, so it makes the process faster. If we need to create a new subtype or group of rules, we have to change the generator in order to implement these definitions. Such implementation should take no longer than creating such rule manually. So, if we need to implement more than one rule that belongs to a new group, using the generator to do so is much faster since we spend approximately the same amount of time as we would spend on the manual implementation of a single rule of that subtype.
merge into using on when set
gt_table_violation t gt_group_violation s (t.tab_rowid = s.tab_rowid) matched then update ... t.txt_msg = t.txt_msg || s.txt_msg when not matched then insert ...;
execute immediate ’truncate table gt_group_violation’; -- next rule group, if any -- load error_table pk_aux.p_load_gt_rules; insert into error_table...; -- update SA_CUSTOMERS table merge into SA_USER.SA_CUSTOMERS t using gt_table_violation s on (t.rowid = s.tab_rowid) when matched then update set t.CUST_CREDIT_LIMIT = case when s.txt_msg like ’%_271)%’ then ’1500’ else t.CUST_CREDIT_LIMIT end; -- truncate table gt_table_violation execute immediate ’truncate table gt_table_violation’; end main; end PKG_DQ_SA_CUSTOMERS; Fig. 3 and Fig. 4 show how to populate metadata tables for rule definition, using graphical user interface. The generator is a PL/SQL package that contains numerous procedures. One of them generates statements for a rule group named CHECK. Part of the generator implementation is shown below:
5.3 Example in Oracle Technology Suppose that we would like to mark warning on all records in SA CUSTOMERS table in case when customer income level is not set, and customer credit limit is set. We would also like to update such records in order to set customer credit limit value to 1500. In order to simplify this example,
69
control metadata, Fig. 3 and Fig. 4 for rule definition metadata.
procedure p_generate_check(p_table_name varchar2) is cursor rul_cur is select * from md_dq_rules where table_name = p_table_name and rule_group_code = ’CHECK’ order by rule_order; ... begin ... utl_file.put_line(f_pkg, ’insert into gt_group_violation’); utl_file.put_line(f_pkg, ’(tab_rowid, txt_msg)’); utl_file.put_line(f_pkg, ’(select tab.rowid as tab_rowid, ’); for rul in rul_cur loop utl_file.putf(f_pkg, ’ case when %s %s %s ’, rul.function_code, rul.column_name, rul.condition_code); utl_file.put_line(f_pkg, ’ from ’||p_table_name); ... end loop; ... end;
5.5 DW Rules Rules that are applied on DW layer are called DW rules. We use the same metadata tables to define the variable part of the rules, but the static part can be defined either in the same generator as SA rules, or in a separate generator, since we apply different types of rules on this layer. One of the most reasonable rules for DW layer is a hierarchy rule. It checks if elements of hierarchy levels roll up to a single parent in the upper level. Since DW systems usually contain more then one hierarchy, and hierarchies usually contain more then one level, we have to create one rule for each level of each hierarchy, except for the lowest level. Number of hierarchy rules can be very large, so generating such rules results in a significant time saving. Oracle RDBMS provides a logical database object called dimension. If we define our dimensions as such logical objects, we can automatically populate metadata tables with hierarchy rules as well. We can also apply certain business rules on this layer. In this case, we can use the same implementation we used for business rules on SA layer and include it in DW generator.
5.6 Results Our implementation in Oracle technology has shown that execution of a single rule group is significantly faster then sequential execution of rules. Also, the initial implementation of the generator is somewhat slower then a manual implementation of a DQ module for a single SA table, but significantly faster then manual implementation of DQ modules for each SA table. We built similar DQ rules generator for a leading banking group in Central and Eastern Europe region. The generator is used there to create over 200 DQ packages with over 2500 rules. Time needed to generate these 200 SQL scripts is less then a minute, and the deployment of DQ packages in the database takes less then five minutes. Using the generator makes rules modifications very easy and fast, as well as creating a new DQ package for a new SA table. It requires only changes in the metadata tables. The existing rules can be easily excluded from a certain package and included again by changing a flag in the metadata table. Similarly, changing the rule severity is done merely by changing it in the metadata. Even changes of the environment where DQ packages are executed, such as parameters change, can be done by implementing the changes in a single place – the generator package, and not in each of 200 DQ packages.
As we can see, we use cursors to get data from our metadata tables, and then we use utl file package with put line and putf procedures to write statements to files. Similar statements are used to create all other rule groups that we want to implement.
5.4 Graphical User Interface In order to make the usage of the generator as user friendly as possible, we should create a graphical user interface (GUI) to do the following tasks: • Query and modify data in metadata tables. • Run the generator to insert rules automatically and to create DQ modules. • Run the generator to create DQ modules. • Deploy DQ modules in the database. This GUI is not used to execute DQ modules, because executing them should be one of the steps of the DW load. Thus, we use the generator to create, modify and implement the rules and modules. Loading management of our ETL process should ensure that DQ modules are executed when needed. Oracle Designer is an Oracle tool used for business process modeling, systems analysis, software design and system generation. It is integrated with Oracle Forms Developer, an Oracle database application development tool. We used these tools to create a GUI on top of metadata tables to query and modify the metadata, and to execute the DQ generation and deployment. Example of GUI implemented in Oracle technology can be found in Fig. 2 for processing
6. CONCLUSIONS The first generator implementation in Oracle technology was done for a leading banking group in Central and Eastern Europe region, as mentioned previously. This implementation was proven to work well, but it was implemented for their specific needs and environment. Thus, we decided to create a more general implementation in a form of a DQ application that contains a GUI, as described in this paper. If the generator is implemented in a very general way, it can be reused in different DW environments. In this case, there is almost no initial implementation. We simply have to set up the environment, install the generator and initialize data in the processing control metadata table using GUI. After
70
Figure 2: Example of GUI for DQ processing control.
Figure 3: Example of GUI for DQ rules definition.
Figure 4: Updates tab of DQ rules definition.
71
[7] H. Kim and D. Lee. Parallel Linkage. In 16th ACM Int’l Conf. on Information and Knowledge Management, pages 283–292. CIKM, November 2007. [8] R. Kimball and J. Caserta. The Data Warehouse ETL Toolkit. Wiley Publishing Inc., Indianapolis, 2004. [9] F. Korn and S. Muthukrishnan and Y. Zhu. Checks and Balances: Monitoring Data Quality Problems in Network Traffic Database. In 29th International Conference on Very Large Databases Proceedings, pages 536–547. VLDB, September 2003. [10] D. Luebbers and U. Grimmer and M. Jarke. Systematic Development of Data Mining – Based Data Quality Tools. In 29th International Conference on Very Large Databases Proceedings, pages 548–559. VLDB, September 2003. [11] A. Maydanchik. Data Quality Assessment. Technics Publications, Denville, 2007. [12] T. Morgan. Business Rules and Information Systems: Aligning IT with Business Goals. Addison-Wesley, Reading, Massachusetts, 2002. [13] J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, San Francisco, 2003. [14] B. W. On and N. Koudas and D. Lee and D. Srivastava. Group Linkage. In Proceedings of the 23rd International Conference on Data Engineering pages 496–505. ICDE, April 2007. [15] C. Ordonez and J. Garc´ıa-Garc´ıa. Referential integrity quality metrics. In Decision Support Systems, Volume 44, Issue 2, pages 495–508. Elsevier Science Publishers B. V., Amsterdam, 2008. [16] M. G. Piattini and C. Calero and M. Genero. Information and Database Quality. Kluwer Academic Publishers, Boston / Dordrecht / London, 2002. [17] V. Raman and J. M. Hellerstein. Potter’s Wheel: An Interactive Data Cleaning System. In 27th International Conference on Very Large Databases Proceedings, pages 381–390. VLDB, September 2001. [18] J. Rodi´c. Data Quality Rules Implementation Using Oracle Technology. In 32nd International Convention Mipro Proceedings, pages 216–221. Mipro, May 2009. [19] R. G. Ross. Principles of Business Rule Approach. Addison-Wesley, Reading, Massachusetts, 2003. [20] D. Taniar. Progressive Methods in Data Warehousing and Business Intelligence. Idea Group Inc., Calgary, 2009. [21] J. D. Ullman and J. Widom. A First Course in Database Systems. Prentice Hall, Upper Saddle River, New Jersey, 1997.
we use the generator for the first time, our DQ system automatically contains database integrity rules implementation that can be executed instantly. Now we can use GUI to add business rules to our DQ system, and to generate and deploy new version of DQ packages. Even if we have to create our own generator, the entire process is still faster then creating DQ modules for each SA or DW table manually. Such approach speeds up the initial build of DQ systems, as well as the implementation of any change of the rules, or the environment. This paper should inspire us to investigate other rules that could be implemented, different outlier algorithms, grouping of outlier rules, creating more DW rules, using hints to optimize the rules even more, and making this generator even more general. We could also try to include match and merge algorithms into our DQ modules in order to perform automatically as many DQ tasks as possible.
7.
ACKNOWLEDGMENTS
I would like to thank my colleagues, Mr. Frantiˇsek Melko from Oracle Slovakia and Mr. Amel Salihbaˇsi´c from Oracle Austria for their help with the optimization of rules by creating groups, as well as Mr. Rudolf Jovanovi´c for his help with the GUI.
8.
REFERENCES
[1] T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley and Sons, Chichester, 2003. [2] H. Galhardas and D. Florescu and D. Shasha and E. Simon and J. P. Matsumoto and C. A. Saita. AJAX: an Extensible Data Cleaning Tool. http://web.tagus.ist.utl.pt/ helena.galhardas/ajaxdesc.html [3] P. Gomes and J. Farinha and M. J. Trigueiros. A Data Quality Metamodel Extension to CWM. In 4th Asia Pacific Conference on Conceptual Modelling Proceedings, pages 17–26. APCCM, February 2007. [4] T. N. Herzog and F. Scheuren and W. E. Winkler. Data Quality and Record Linkage Techniques. Springer, Heidelberg, 2007. [5] Y. Hong and T. Yang and J. Kang and D. Lee. Record Linkage as DNA Sequence Alignment Problem. In Proceedings of the 6th International Workshop on Quality in Databases and Management of Uncertain Data, pages 13–22. QDB/MUD, August 2008. [6] S. K. Kachigan. Multivariate Statistical Analysis: A Conceptual Introduction. Radius Press, New York, 1991.
72