Detection and Elimination of Duplicate Data Using ...

International Journal of Dynamics of Fluids ISSN 0973-1784 Volume 5, Number 2 (2009), pp. 145–164 © Research India Publications http://www.ripublication.com/ijdf.htm

Detection and Elimination of Duplicate Data Using Token-Based Method for a Data Warehouse: A Clustering Based Approach 1

1

J. Jebamalar Tamilselvi and 2V. Saravanan

PhD Research Scholar, Department of Computer Application, Karunya University, Coimbatore – 641 114, Tamilnadu, INDIA E-mail: [email protected] 2 Professor & HOD, Department of Computer Application, Karunya University, Coimbatore – 641 114, Tamilnadu, INDIA E-mail: [email protected]

Abstract The process of detecting and removing database defects and duplicates is referred to as data cleaning. The fundamental issue of duplicate detection is that inexact duplicates in a database may refer to the same real world object due to errors and missing data. Duplicate elimination is hard because it is caused by different types of errors like typographical errors, missing values, abbreviations and different representations of the same logical value. In the existing approaches, duplicate detection and elimination is domain dependent. These domain dependent methods for duplicate elimination rely on similarity functions and threshold for duplicate elimination and produce high false positives. This research work presents a general sequential framework for duplicate detection and elimination. The proposed framework uses six steps to improve the process of duplicate detection and elimination. First, an attribute selection algorithm is used to identify or select best and suitable attributes for duplicate identification and elimination. The token is formed for the selected attribute field values in the next step. After the token formation, clustering algorithm or blocking method is used to group the records based on the similarities value. The best blocking key will be selected for the blocking records by comparing performance of the duplicate detection. In the next step the threshold value is calculated based on the similarities between records and fields. Then, a rule based approach is used to identify or detect duplicates and to eliminate poor quality duplicates by retaining only one copy of the best duplicate record.

146

J. Jebamalar Tamilselvi and V. Saravanan

Finally, all the cleaned records are grouped or merged and made available for the next process. This research work will be efficient for reducing the number of false positives without missing out on detecting duplicates. To compare this new framework with previous approaches the token concept is included to speed up the data cleaning process and reduce the complexity. Analysis of several blocking key is made to select best blocking key to bring similar records together through extensive experiments to avoid comparing all pairs of records. A rule based approach is used to identify exact and inexact duplicates and to eliminate duplicates.

Introduction In the 1990's as organizations of scale began to need more timely data for their business, they found that traditional information systems technology was simply too cumbersome to provide relevant data efficiently and quickly. Completing reporting requests could take days or weeks using antiquated reporting tools that were designed more or less to 'execute' the business rather than 'run' the business. A data warehouse is basically a database and having unintentional duplication of records created from the millions of data from other sources can hardly be avoided. In the data warehousing community, the task of finding duplicated records within data warehouse has long been a persistent problem and has become an area of active research. There have been many research undertakings to address the problems of data duplication caused by duplicate contamination of data. There are two issues to be considered for duplicate detection: Accuracy and Speed. The measure of accuracy in duplicate detection depends on the number of false negatives (duplicates that were not classified as such) and false positives (nonduplicates which were classified as duplicates). The algorithm’s speed is mainly affected by the number of records compared, and how costly these comparisons are. Generally CPUs are not able to do duplicate detection on large databases within any reasonable time, so normally the number of record comparison needs to be cut down [4]. In this research work, a framework is developed to handle any duplicate data in a data warehouse. The main objective of this research work is to improve data quality and increase speed of the data cleaning process. A high quality, scalable blocking algorithm, similarity computation algorithm and duplicate elimination algorithm are used and evaluated on real datasets from an operational data warehouse to achieve the objective.

Framework Design A sequential based framework is developed for detection and elimination of duplicate data. This framework comprises of some existing data cleaning approaches and new approaches which are used to reduce the complexity of duplicate data detection and elimination and to clean with more flexibility and less effort.

Detection and Elimination of Duplicate Data Using Token-Based Method

147

Fig. 1 shows the framework to clean the duplicate data in a sequential order. Each step of the framework is well suited for the different purposes. This framework will work according to the data by using software agent in each step with less user interaction. The principle on this framework is as follows: A. Selection of attributes: There is a clear need to identify and select attributes. These selected attributes are to be used in the other steps. B. Formation of tokens: The well suited token is created to check the similarities between records as well as fields. C. Clustering/Blocking of records: Clustering algorithm or blocking method is used to group the records based on the similarities of block token key value. D. Similarity computation for selected attributes: Jaccard similarity method is used for token-based similarity computation. E. Detection and elimination of duplicate records: The rule based detection and elimination approach is used for detecting and eliminating the duplicates. F. Merge: The result or cleaned data is merged. A. Selection of attributes The data cleaning process is complex with the large amount of data in the data warehouse. The attribute selection is very important to reduce the time and effort for the further work such as record similarity and elimination process etc. Attribute selection is very important when comparing two records [5]. This step is the foundation step for all the remaining steps. Therefore time and effort are two important requirements to promptly and qualitatively select the attribute to be considered.

A

B

New Data

Selection of Attributes

C i E

Forming Tokens

Clustering / Blocking Algorithm

Cleaned Data

ii

…

Maintaining LOG Table Data Warehouse

n Similarity computation for selected attributes using selected functions

….. i

ii

iii

n

D

Maintaining similarity LOG Tables

Elimination

Bank of elimination functions

Figure 1: Framework for Data Cleaning.

Merge

F

148

J. Jebamalar Tamilselvi and V. Saravanan

A missing value is expressed and treated as a string of blanks. Obviously, missing character values are not the smallest strings. Distinct is used to retrieve number of rows that have unique values for each attribute. Uniqueness is the characteristic of an attribute that make up its relevance for duplicate detection. The uniqueness is a property that is reflected in the distribution of the similarity of the attribute-pairs of non-duplicates, i.e. if an attribute is highly unique, the mean similarity on nonduplicates for this attribute will be low. The value of measurement types are also considered for the attribute selection. The data cleaning with numeric data will not be effective. The categorical data is efficient for the data cleaning process. To identify relevant attributes for further data cleaning process, three criteria are used. The three criteria are: (a) Identifying key attributes, (b) classifying attributes with high distinct value and low missing value and (c) measurement types of the attributes. Based on the above information, ‘weight’ or ‘rank value’ is calculated for all the attributes. Finally, the highest priority attributes are selected for the further data cleaning process [2]. This flow diagram (Fig 3) shows attribute selection procedure in a sequential way. First, the data set is identified for the data cleaning process. From this data set, attributes are analyzed by identifying types of the attribute, relationship between attributes, and properties of each attribute to select the appropriate attribute. Type of attribute is classified using data type, size or length of the attribute. Threshold value is used to identify best attributes for the data cleaning process. The threshold value is measured by using three different criteria – High Threshold value, Data Quality and High Rank. High threshold value is calculated to identify high power attribute for the data cleaning process. Attributes are ranked based on the threshold value and data quality value. Finally, high rank attributes are selected for the next cleaning process to improve speed of the data cleaning process. The developed attribute selection algorithm is presented in Fig 2. The attribute selection algorithm can eliminate both irrelevant and redundant attributes and is applicable to any type of data (nominal, numeric, etc.). Also, this attribute selection algorithm can handle different attribute types of data smoothly. The quality of the algorithm is confirmed by applying a set of rules. The main purpose for this attribute selection for data cleaning is to reduce time and to improve the speed for the further data cleaning process such as token formation, record similarity and elimination process in an efficient way.

Detection and Elimination of Duplicate Data Using Token-Based Method Input : N-Attributes, no. of tuples n, relation instance r Output : S – Sub set of the attributes Initialize : L - Temporary relation instance, x – Attribute set Var : L - Temporary relation instance, x – Attribute set, F – Field set, i, j begin I. Analyze attribute set X to select best attributes II. Calculate threshold value σ for each attribute for each attribute xi, i

Detection and Elimination of Duplicate Data Using ...

Detection and Elimination of Duplicate Data Using ...

Suggest Documents

Duplicate Detection in Probabilistic Data - CiteSeerX

Preallocated Duplicate Name Prefix Detection Mechanism Using ...

Unsupervised Duplicate Detection Using Sample Non ... - CiteSeerX

Unsupervised Duplicate Detection Using Sample Non ... - tuprints

Duplicate Code Detection using Control Statements

Fingerdiff: Improved Duplicate Elimination in Storage ... - CiteSeerX

Modeling Uncertainty in Duplicate Elimination - Semantic Scholar

Topic and duplicate detection in QA data - Nicholas PiÃ«l

Duplicate Detection and Deletion in the Extended NF Data ... - CiteSeerX

IRJET- Detection of Near Duplicate Image using Variable Length Signature

Sorting, Grouping and Duplicate Elimination in the ... - Semantic Scholar

Duplicate Detection on GPUs - Journals

Distributed Duplicate Detection in Post-Process Data De ... - HiPC

Cross Language Duplicate Record Detection in Big Data

Query Based Duplicate Data Detection on WWW - Engg Journals ...

A Duplicate Detection Benchmark for XML - Data Cleaning publication ...

A Domain Independent Similar-Duplicate Detection Algorithm for Data

noise detection and elimination in data preprocessing - CiteSeerX

Scaling up Duplicate Detection in Graph Data - Hasso-Plattner-Institut

Near Duplicate Web Page Detection using ... - Semantic Scholar

Fast Near-Duplicate Image Detection Using ... - ACM Digital Library

Fast Near-Duplicate Image Detection Using ... - ACM Digital Library

BASIL: Effective Near-Duplicate Image Detection using Gene ...

A replicated study on duplicate detection: Using Apache ... - CiteSeerX