5th International Workshop on Quality in Databases

Proceedings of the

5th International Workshop on Quality in Databases QDB 2007 at the VLDB 2007 conference Vienna, Austria Editors Venkatesh Ganti (Microsoft Research) Felix Naumann (Hasso Plattner Institute, Potsdam)

2

5th International Workshop on Quality in Databases (QDB) 2007 at the VLDB 2007 conference in Vienna, Austria Data and information quality has become an increasingly important and interesting topic for the database community. Solutions to measure and improve the quality of data stored in databases are relevant for many areas, including data warehouses, scientific databases, and customer relationship management. The QDB 2007 workshop focuses on practical methods for data quality assessment and data quality improvement. QDB 2007 continues and combines the successful three IQIS workshops held at SIGMOD 2004‐2006 and the CleanDB workshop held at VLDB 2006. We are happy to host invited talks by Renee Miller (University of Toronto) and AnHai Doan (University of Wisconsin), and a great program of six research papers with authors from seven different countries. Our thanks are due to Microsoft both for hosting the conference management tool and in particular for generous sponsorship of the participation fee. The editors Venkatesh Ganti (Microsoft Research) Felix Naumann (Hasso Plattner Institute, Potsdam)

3

4

Program, Sunday, September 23 Introduction 9:15 ‐ 9:30 Felix Naumann and Venkatesh Ganti Invited talk: Renee Miller 9:30 ‐ 10:30 Management of Inconsistent and Uncertain Data by Renee J. Miller (University of Toronto) Coffee break 10:30 ‐ 11:00 Research Session 1 11:00 ‐ 13:00 Accuracy of Approximate String Joins Using Grams Oktie Hassanzadeh, Mohammad Sadoghi, and Renee J. Miller QoS: Quality Driven Data Abstraction Generation For Large Databases Charudatta V. Wad, Elke A. Rundensteiner and Matthew O. Ward Quality‐Driven Mediation for Geographic Data Yassine Lassoued, Mehdi Essid, Omar Boucelma, and Mohamed Quafafou On the performance of one‐to‐many data transformations Paulo Carreira, Helena Galhardas, Joao Pereira, Fernando Martins, and Mario J. Silva Lunch 13:00 ‐ 14:00 Invited talk: AnHai Doan 14:00 ‐ 15:00 Data Quality Challenges in Community Systems by AnHai Doan (University of Wisconsin) Research Session 2 15:00 ‐ 16:00 Towards a Benchmark for ETL Workflows Panos Vassiliadis, Anastasios Karagiannis, Vasiliki Tziovara, and Alkis Simitsis Information Quality Measurement in Data Integration Schemas Maria da Conceição Moraes Batista, Ana Carolina Salgado Coffee break & end of workshop 16:00 ‐ 16:30

5

6

Keynote presentation

Management of Inconsistent and Uncertain Data Renée Miller (University of Toronto) Although integrity constraints have long been used to maintain data consistency, there are situations in which they may not be enforced or satisfied. In this talk, I will describe Con‐ Quer, a system for efficient and scalable answering of SQL queries on databases containing inconsistent or uncertain data. ConQuer permits users to postulate a set of constraints to‐ gether with their queries. The system rewrites the queries to retrieve data that are consistent with respect to the con‐ straints. When data is uncertain, ConQuer returns each query answer with a likelihood that the answer is consistent. Hence, ConQuer allows a user to understand what query answers are known to be true, even when a database contains uncertainty. Our rewriting is into SQL, and I will show that the rewritten queries can be efficiently optimized and executed by a commercial database system. I will conclude with some open problems.

Bio Renee J. Miller is a professor of computer science and the Bell University Lab Chair of Infor‐ mation Systems at the University of Toronto. She received the US Presidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the United States government on outstanding scientists and engineers beginning their careers. She re‐ ceived an NSF CAREER Award, the Premier's Research Excellence Award, and an IBM Faculty Award. Her research interests are in the efficient, effective use of large volumes of complex, heterogeneous data. This interest spans data integration and exchange, inconsistent and uncertain data management, and knowledge curation. She serves on the Board of Trustees of the VLDB Endowment, was a member of and chaired the ACM Kanellakis Awards commit‐ tee, and served as PC co‐chair of VLDB in 2004. She received her PhD in Computer Science from the University of Wisconsin, Madison and bachelor's degrees in Mathematics and Cog‐ nitive Science from MIT.

7

8

Keynote presentation

Data Quality Challenges in Community Systems AnHai Doan (University of Wisconsin) Over the past three years, in Cimple, a joint effort between Wisconsin and Yahoo! Research, we have been trying to build community systems. Such systems employ automatic data management techniques, such as information extraction and integration, as well as user‐ centric Web 2.0‐style technologies, to build structured data portals for online communities. As the work progresses, we have encountered a broad range of fascinating data cleaning challenges. Some of these (e.g., data quality evaluation, record reconciliation) also arise in traditional ETL processes. But here they become exacerbated, take on new nuances, or are amenable to novel solutions that exploit community characteristics. Many other challenges however are new, and arise due to the fact that community systems engage a multitude of users of varying skills and knowledge. Examples include how to entice users to collaborative‐ ly clean data, how to handle "noisy" users, and how to make certain cleaning tasks easy for "the masses". We describe the challenges and our initial solutions. We also describe the in‐ frastructure support (code, data, etc.) that we can provide, in the hope that other research‐ ers will join and help us address these problems.

Bio AnHai Doan works in the database group at the University of Wisconsin‐Madison. His inter‐ ests cover databases, AI, and Web. His current research focuses on Web community man‐ agement, data integration, mass collaboration, text management, information extraction, and schema matching. Selected recent honors include the ACM Doctoral Dissertation Award (2003), CAREER Award (2004), Alfred P. Sloan Research Fellowship (2007), and IBM Faculty Award (2007). Selected recent professional activities include co‐chairing WebDB at SIGMOD‐ 05 and the AI Nectar track at AAAI‐06.

9

10

Accuracy of Approximate String Joins Using Grams Oktie Hassanzadeh

Mohammad Sadoghi

Renee ´ J. Miller

University of Toronto 10 King’s College Rd. Toronto, ON M5S3G4, Canada



[email protected]

[email protected]

[email protected]

ABSTRACT

riously time-consuming task. Recently, there has been an increasing interest in using approximate join techniques based on q-grams (substrings of length q) made out of the strings. Most of the efficient approximate join algorithms (which we describe in Section 2) are based on using a specific similarity measure, along with a fixed threshold value to return pairs of records whose similarity is greater than the threshold. The effectiveness of the majority of these algorithms depends on the value of the threshold used. However, there has been little work studying the accuracy of the join operation. The accuracy is known to be dataset-dependent and there is no common framework for evaluation and comparison of accuracy of different similarity measures and techniques. This makes comparing their accuracy a difficult task. Nevertheless, we argue that it is possible to evaluate relative performance of different measures for approximate joins by using datasets containing different types of known quality problems such as typing errors and differences in notation and abbreviations. In this paper, we present an overview of several similarity measures for approximate string joins using q-grams and thoroughly evaluate their accuracy for different values of thresholds and on datasets with different amount and types of errors. Our results include:

Approximate join is an important part of many data cleaning and integration methodologies. Various similarity measures have been proposed for accurate and efficient matching of string attributes. The accuracy of the similarity measures highly depends on the characteristics of the data such as the amount and type of the errors and length of the strings. Recently, there has been an increasing interest in using methods based on q-grams (substrings of length q) made out of the strings, mainly due to their high efficiency. In this work, we evaluate the accuracy of the similarity measures used in these methodologies. We present an overview of several similarity measures based on q-grams. We then thoroughly compare their accuracy on several datasets with different characteristics. Since the efficiency of approximate joins depends on the similarity threshold they use, we study how the value of the threshold (including values used in recent performance studies) affects the accuracy of the join. We also compare different measures based on the highest accuracy they can achieve on different datasets.

1.

INTRODUCTION

Data quality is a major concern in operational databases and data warehouses. Errors may be present in the data due to a multitude of reasons including data entry errors, lack of common standards and missing integrity constraints. String data is by nature more prone to such errors. Approximate join is an important part of many data cleaning methodologies and is well-studied: given two large relations, identify all pairs of records that approximately match. A variety of similarity measures have been proposed for string data in order to match records. Each measure has certain characteristics that make it suitable for capturing certain types of errors. By using a string similarity function sim() for the approximate join algorithm, all pairs of records that have similarity score above a threshold θ are considered to approximately match and are returned as the output. Performing approximate join on a large relation is a noto-

• We show that for all similarity measures, the value of the threshold that results in the most accurate join highly depends on the type and amount of errors in the data. • We compare different similarity measures by comparing the maximum accuracy they can achieve on different datasets using different thresholds. Although choosing a proper threshold for the similarity measures without a prior knowledge of the data characteristics is known to be a difficult task, our results show which measures can potentially be more accurate assuming that there is a way to determine the best threshold. Therefore, an interesting direction for future work is to find an algorithm for determining the value of the threshold for the most accurate measures. • We show how the amount and type of errors affect the best value of the threshold. An interesting result of this is that many previously proposed algorithms for enhancing the performance of the join operation and making it scalable for large datasets are not effective enough in many scenarios, since the performance of these algorithms highly depends on choosing a high value for the threshold which could result in a very

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permission from the publisher, ACM. VLDB ‘07, September 23-28, 2007, Vienna, Austria. Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09.

11

low accuracy. Our evaluation shows the effectiveness of those join algorithms that are less sensitive to the value of the threshold and opens another interesting direction for future work which is finding algorithms that are both efficient and accurate using the same threshold. The paper is organized as follows. In Section 2, we overview related work on approximate joins. We describe in detail the approximate join algorithms we will evaluate and compare in Section 3, along with the similarity measures used. Section 4 presents a thorough evaluation of these algorithms and measures and finally, Section 5 concludes the paper and explains future directions.

2.

fuzzy match similarity and creating signatures for this measure. However, accuracy of this measure is not compared with other measures. In [5], several such similarity measures are benchmarked for approximate selection, which is a special case of similarity join. Given a relation R, the approximate selection operation, using similarity predicate sim(), will report all tuples t ∈ R such that sim(tq , t) ≥ θ, where θ is a specified numerical similarity threshold and tq is a query string. While several predicates are introduced and benchmarked in [5], the extension of approximate selection to approximate joins is not considered. Furthermore, the effect of threshold values on accuracy for approximate joins is also not considered.

3. FRAMEWORK

RELATED WORK

Approximate join also known as similarity join or record linkage has been extensively studied in the literature. Several similarity measures for string data have been proposed [14, 4, 5]. A recent survey [9], presents an excellent overview of different types of string similarity measures. Recently, there has been an increasing interest in using measures from the Information Retrieval (IR) field along with q-grams made out of strings [10, 6, 2, 18, 5]. In this approach, strings are treated as documents and q-grams are treated as tokens in the documents. This makes it possible to take advantage of several indexing techniques as well as various algorithms that have been proposed for efficient set-similarity joins. Furthermore, these measures can be implemented declaratively over a DBMS with vanilla SQL statements [5]. Recent work addresses the problem of efficiency and scalability of the similarity join operations for large datasets [6, 2, 18]. Many techniques are proposed for set-similarity join, which can be used along with q-grams for the purpose of (string) similarity joins. Most of the techniques are based on the idea of creating signatures for sets (strings) to reduce the search space. Some signature generations schemes are derived from dimensionality reduction for the similarity search problem in high dimensional space. One efficient approach uses the idea of Locality Sensitive Hashing (LSH) [13] in order to hash similar sets into the same value with high probability and therefore is an approximate solution to the problem. Arasu et al. [2] propose algorithms specifically for set-similarity joins that are exact and outperform previous approximation methods in their framework, although parameters of the algorithms require extensive tuning. Another class of work is based on using indexing algorithms, primarily derived from IR optimization techniques. A recent proposal in this area [3] presents algorithms based on novel indexing and optimization strategies that do not rely on approximation or extensive parameter tuning and outperform previous state-of-the-art approaches. More recently, Li et al. [15] propose VGRAM, a technique based on the idea of using variable-length grams instead of q-grams. At a high level, it can be viewed as an efficient index structure over the collection of strings. VGRAM can be used along with previously proposed signature-based algorithms to significantly improve their efficiency. Most of the techniques described above mainly address the scalability of the join operation and not the accuracy. The choice of the similarity measure is often limited in these algorithms. The signature-based algorithm of [6] also considers accuracy by introducing a novel similarity measure called

12

In this section, we explain our framework for similarity join. The similarity join of two relations R = {ri : 1 ≤ i ≤ N1 } and S = {sj : 1 ≤ j ≤ N2 } outputs a set of pairs (ri , sj ) ∈ R×S where ri and sj are similar records. Two records are considered similar when their similarity score based a similarity function sim() is above a threshold θ. For the definitions and experiments in this paper, we assume we are performing a self-join on relation R. Therefore the output is a set of pairs (ri , rj ) ∈ R×R where sim(ri , rj ) ≥ θ for some similarity function sim() and a threshold θ. This is a common operation in many applications such as entity resolution and clustering. In keeping with many approximate join methods, we model records as strings. We denote by r the set of q-grams (sequences of q consecutive characters of a string) in r. For example, for t=‘db lab’, t={‘db ’ ,‘b l’,‘ la’, ‘lab’} for tokenization using 3-grams. In certain cases, a weight may be associated with each token that reflects the commonality of the token in the relation. The similarity measures discussed here are those based on q-grams created out of strings along with a similarity measure that has been shown to be effective in previous work [5]. These measures share one or both of the following properties: • High scalability: There are various techniques proposed in the literature as described in Section 2 for enhancing the performance of the similarity join operation using q-grams along with these measures. • High accuracy: Previous work has shown that in most scenarios these measures perform better or equally well in terms of accuracy when compared with other string similarity measures. Specifically, these measures have shown good accuracy in name-matching tasks [8] or in approximate selection [5].

3.1 Edit Similarity Edit-distance is widely used as the measure of choice in many similarity join techniques. Specifically, previous work [10] has shown how to use q-grams for an efficient implementation of this measure in a declarative framework. Recent work on enhancing performance of similarity join has also proposed techniques for scalable implementation of this measure [2, 15]. Edit distance between two string records r1 and r2 is defined as the transformation cost of r1 to r2 , tc(r1 , r2 ), which is equal to the minimum cost of edit operations applied to r1 to transform it to r2 . Edit operations include character

copy, insert, delete and substitute [11]. The edit similarity is defined as: simedit (r1 , r2 ) = 1 −

tc(r1 , r2 ) max{|r1 |, |r2 |}

where tfr (t) is the term frequency of token t within string r and idf (t) is the inverse document frequency with respect to the entire relation R.

(1)

3.3.2 BM25 The BM25 similarity score for a query r1 and a string record r2 is defined as follows:

There is a cost associated with each edit operation. There are several cost models proposed for edit operations for this measure. The most commonly used measure called Levenshtein edit distance, which we will refer to as edit distance in this paper, uses unit cost for all operations except copy which has zero cost.

simBM 25 (r1 , r2 ) =

w ˆr1 (t) =

(k +1)·tf

(t)

2

(1)

(1)

wR (t) = K(r) =

t +0.5 log N−n nt +0.5 |r| k1 (1 − b) + b avg rl

where tfr (t) is the frequency of the token t in string record r, |r| is the number of tokens in r, avgrl is the average number of tokens per record, N is the number of records in the relation R, nt is the number of records containing the token t and k1 , k3 and b are set of independent parameters. We set these parameters based on TREC-4 experiments [17] where k ∈ [1, 2], k3 = 8 and b ∈ [0.6, 0.75].

3.3 Measures from IR

3.3.3 Hidden Markov Model

A well-studied problem in information retrieval is the problem of given a query and a collection of documents, return the most relevant documents to the query. In the measures in this part, records are treated as documents and q-grams are seen as words (tokens) of the documents. Therefore, the same techniques for finding relevant documents to a query can be used to return similar records to a query string. In the rest of this section, we present three measures that have been shown to have higher performance for the approximate selection problem [5].

3.3.1 Cosine w/tf-idf The tf-idf cosine similarity is a well established measure in the IR community which leverages the vector space model. This measure determines the closeness of the input strings r1 and r2 by first transforming the strings into unit vectors and then measuring the angle between their corresponding vectors. The cosine similarity with tf-idf weights is given by: X

wr1 (t) · wr2 (t)

The approximate string matching could be modeled by a discrete Hidden Markov process which has been shown to have better performance than Cosine w/tf-idf in the IR literature [16] and high accuracy and low running time for approximate selection [5]. This particular Markov model consists of only two states where the first state models the tokens that are specific to one particular “String” and the second state models the tokens in the “General English”, i.e., tokens that are common in many records. Refer to [5] and [16] for a complete description of the model and possible extensions. The HMM similarity function accepts two string records r1 and r2 and returns the probability of generating r1 given r2 is a similar record: simHM M (r1 , r2 ) =

Y

(a0 P (t|GE) + a1 P (t|r2 ))

(6)

t∈r1

where a0 and a1 = 1 − a0 are the transition states probabilities of the Markov model and P (t|GE) and P (t|r2 ) is given by:

(4)

t∈r1 ∩r2

number of times t appears in r2 |r2 | r∈R number of times t appears in r P r∈R |r|

P (t|r2 ) = P

where wr1 (t) and wr2 (t) are the normalized tf-idf weights for each common token in r1 and r2 respectively. The normalized tf-idf weight of token t in a given string record r is defined as follows:

wr′ (t′ )2

(1)

r2 1 wR (t) K(r 2 )+tfr (t)

and wR is the RSJ weight:

where N is the number of tuples in the base relation R and nt is the number of tuples in R containing the token t.

t′ ∈r

(k3 +1)·tfr1 (t) k3 +tfr1 (t)

wr2 (t) =

where wR (t) is a weight function that reflects the commonality of the token t in the relation R. We choose RSJ (Robertson-Sparck Jones) weight for the tokens which was shown to be more effective than the commonly-used Inverse Document Frequency (IDF) weights [5]: N − nt + 0.5 wR (t) = log (3) nt + 0.5

wr′ (t)

(5)

where

Jaccard similarity is the fraction of tokens in r1 and r2 that are present in both. Weighted Jaccard similarity is the weighted version of Jaccard similarity, i.e., P t∈r ∩r wR (t) (2) simW J accard(r1 , r2 ) = P 1 2 t∈r1 ∪r2 wR (t)

wr (t) = q P

w ˆr1 (t) · wr2 (t)

t∈r1 ∩r2

3.2 Jaccard and Weighted Jaccard

simCosine (r1 , r2 ) =

X

P (t|GE) =

3.4 Hybrid Measures , wr′ (t) = tfr (t) · idf (t)

The implementation of these measures involves two similarity functions, one that compares the strings by comparing

13

their word tokens and another similarity function which is more suitable for short strings and is used for comparison of the word tokens.

3.4.1 GES

Group

Name

Dirty

D1 D2 M1 M2 M3 M4 L1 L2 AB TS EDL EDM EDH

Medium Error

The generalized edit similarity (GES) [7] which is a modified version of fuzzy match similarity presented in [6], takes two strings r1 and r2 , tokenizes the strings into a set of words and assigns a weight w(t) to each token. GES defines the similarity between the two given strings as a minimum transformation cost required to convert string r1 to r2 and is given by: tc(r1 , r2 ) simGES (r1 , r2 ) = 1 − min , 1.0 (7) wt(r1 )

Low Error Single Error

Erroneous Duplicates 90 50 30 10 90 50 30 10 50 50 50 50 50

Percentage of Errors in Token Duplicates Swap 30 20 30 20 30 20 30 20 10 20 10 20 10 20 10 20 0 0 0 20 10 0 20 0 30 0

Abbr. Error 50 50 50 50 50 50 50 50 50 0 0 0 0

Table 1: Datasets Used in the Experiments

where wt(r1 ) is the sum of weights of all tokens in r1 and tc(r1 , r2 ) is the minimum cost of a sequence of the following transformation operations:

where w(t, r) is the normalized tf-idf weight of word token t in record r and C(θ, r1 , r2 ) returns a set of tokens t1 ∈ r1 such that for t2 ∈ r2 we have sim(t1 , t2 ) > θ for some similarity function sim() suitable for comparing word strings. In our experiments sim(t1 , t2 ) is the Jaro-Winkler similarity as suggested in [8].

the dataset to be generated, the distribution of duplicates (uniform, Zipfian or Poisson), the percentage of erroneous duplicates, the extent of error injected in each string, and the percentage of different types of errors. The data generator keeps track of the duplicate records by assigning a cluster ID to each clean record and to all duplicates generated from that clean record. For the results presented in this paper, the datasets are generated by the data generator out of a clean dataset of 2139 company names with average record length of 21.03 and an average of 2.9 words per record. The errors in the datasets have a uniform distribution. For each dataset, on average 5000 dirty records are created out of 500 clean records. We have also run experiments on datasets generated using different parameters. For example, we generated data using a Zipfian distribution, and we also used data from another clean source (DBLP titles) as in [5]. We also created larger datasets. For these other datasets, the accuracy trends remain the same. Table 1 shows the description of all the datasets used for the results in this paper. We used 8 different datasets with mixed types of errors (edit errors, token swap and abbreviation replacement). Moreover, we used 5 datasets with only a single type of error (edit errors, token swap or abbreviation replacement errors) to measure the effect of each type of error individually. Following [5], we believe the errors in these datasets are highly representative of common types of errors in databases with string attributes.

4.

4.2 Measures

• token insertion: inserting a token t in r1 with cost w(t).cins where cins is the insertion factor constant and is in the range between 0 and 1. In our experiments, cins = 1. • token deletion: deleting a token t from r1 with cost w(t). • token replacement: replacing a token t1 by t2 in r1 with cost (1 − simedit (t1 , t2 )) · w(t) where simedit is the edit-distance between t1 and t2 .

3.4.2 SoftTFIDF SoftTFIDF is another hybrid measure proposed by Cohen et al. [8], which relies on the normalized tf-idf weight of word tokens and can work with any arbitrary similarity function to find the similarity between word tokens. In this measure, the similarity score, simSof tT F IDF , is defined as follows: X

w(t1 , r1 )·w(arg max (sim(t1 , t2 )), r2 )· max (sim(t1 , t2 ))

t1 ∈C(θ,r1 ,r2 )

t2 ∈r2

t2 ∈r2

(8)

EVALUATION

We use well-known measures from IR, namely precision, recall, and F1, for different values of the threshold to evaluate the accuracy of the similarity join operation. We perform a self-join on the input table using a similarity measure with a fixed threshold θ. Precision (Pr) is defined as the percentage of similar records among the records that have a similarity score above the threshold θ. In our datasets, similar records are marked with the same cluster ID as described above. Recall (Re) is the ratio of the number of similar records that have similarity score above the threshold θ to the total number of similar records. Therefore, a join that returns all the pairs of records in the two input tables as output has low (near zero) precision and recall of 1. A join that returns an empty answer has precision 1 and zero recall. The F1 measure is the harmonic mean of precision and

4.1 Datasets In order to evaluate the effectiveness of different similarity measures described in previous section, we use the same datasets used in [5]. These datasets were created using a modified version of the UIS data generator, which has previously been used for the evaluation of data cleaning and record linkage techniques [12, 1]. The data generator has the ability to inject several types of errors into a clean database of string attributes. These errors include commonly occurring typing mistakes (edit errors: character insertion, deletion, replacement and swap), token swap and abbreviation errors (e.g., replacing Inc. with Incorporated and vice versa). The data generator has several parameters to control the injected error in the data such as the size of

14

Figure 3: Maximum F1 score for different measures on datasets with only edit errors

Figure 4: Maximum F1 score for different measures on datasets with only token swap and abbreviation errors

recall, i.e., F1 =

2 × P r × Re P r + Re

(9)

We measure precision, recall, and F1 for different values of the similarity threshold θ. For comparison of different similarity measures, we use the maximum F1 score across different thresholds.

4.3 Results Figures 1 and 2 show the precision, recall, and F1 values for all measures described in Section 3, over the datasets we have defined with mixed types of errors. For all measures except HMM and BM25, the horizontal axis of the precision/recall graph is the value of the threshold. For HMM and BM25, the horizontal axis is the percentage of maximum value of the threshold, since these measure do not return a score between 0 and 1. Effect of amount of errors As shown in the precision/recall curves in Figures 1 and 2, the “dirtiness” of the input data greatly affects the value of the threshold that results in the most accurate join. For all the measures, a lower value of the threshold is needed as the degree of error in the data increases. For example, Weighted Jaccard achieves the best F1 score over the dirtiest datasets with threshold 0.3, while it achieves the best F1 for the cleanest datasets at threshold 0.55. BM25 and HMM are less sensitive and work well on both dirty and low-error datasets with the same value of the threshold. We will discuss later how the degree of error in the data affects the choice of the most accurate measure. Effect of types of errors Figure 3 shows the maximum F1 score for different values of the threshold for different measures on datasets containing only edit-errors (the EDL, EDM and EDH datasets). These figures show that weighted Jaccard and Cosine have the highest accuracy followed by Jaccard, and edit similarity on the low-error dataset EDL. By increasing the amount of edit error in each record, HMM performs as well as weighted Jaccard, although Jaccard, edit similarity, and GES perform much worse on high edit error datasets. Considering the fact that edit-similarity is mainly proposed for capturing edit errors, this shows the effectiveness of weighted Jaccard and its robustness with varying amount of edit errors. Figure 4 shows the effect of token swap and abbreviation errors on the accuracy of different measures. This experiment indicates that edit similarity is not capable of modeling such types of errors. HMM, BM25 and Jaccard also are not capable of modeling abbreviation errors properly.

15

Figure 5: Maximum F1 score for different measures on dirty, medium and low-error group of datasets

Comparison of measures Figures 5 shows the maximum F1 score for different values of the threshold for different measures on dirty, medium and low-error datasets. Here, we have aggregated the results for all the dirty data sets together (respectively, the moderately dirty or medium data sets and the low error data sets). The results show the effectiveness and robustness of weighted Jaccard and cosine in comparison with other measures. Again, HMM is among the most accurate measures when the data is extremely dirty and has relatively low accuracy when the percentage of error in the data is low. Remark As stated in Section 2, the performance of many algorithms proposed for improving scalability of the join operation highly depends on the value of similarity threshold used for the join. Here we show the accuracy numbers on our datasets using the value of the threshold that makes these algorithms effective. Specifically we address the results in [2] although similar observations can be made for results of other similar work in this area. Table 2 shows the F1 value for thresholds that results in the best accuracy on our datasets and the best performance in experimental results of [2]. PartEnum and WtEnum algorithms presented in [2] significantly outperform previous algorithms for 0.9 threshold, but have roughly the same performance as previously proposed algorithms such as LSH when threshold 0.8 or less is used. The results in Table 2 show that there is a big gap between the value of the threshold that results in the most accurate join on our datasets and the threshold that results in the effectiveness of PartEnum and WtEnum in the studies in [2].

Dirty

Medium Error

Low Error

Jaccard Join Threshold 0.5 (Best Acc.) 0.8 0.85 0.9 (Best Performance) 0.65 (Best Acc.) 0.8 0.85 0.9 (Best Performance) 0.7 (Best Acc.) 0.8 0.85 0.9 (Best Performance)

F1 0.293 0.249 0.248 0.247 0.719 0.611 0.571 0.548 0.887 0.854 0.831 0.812

Weighted Jaccard Join Threshold F1 0.3 (Best Acc.) 0.528 0.8 0.249 0.85 0.246 0.9 (Best Performance) 0.244 0.55 (Best Acc.) 0.776 0.8 0.581 0.85 0.581 0.9 (Best Performance) 0.560 0.55 (Best Acc.) 0.929 0.8 0.831 0.85 0.819 0.9 (Best Performance) 0.807

Table 2: F1 score for thresholds that result in best running time in previous performance studies and highest accuracy on our datasets for two selected similarity measures

5.

CONCLUSION

[8] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb’03, pages 73–78. [9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE TKDE, 19(1):1–16, 2007. [10] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB’01, pages 491–500. [11] D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, NY, USA, 1997. [12] M. A. Hern´ andez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9–37, 1998. [13] Indyk, Motwani, Raghavan, and Vempala. Locality-preserving hashing in multidimensional spaces. In STOC’97, pages 618–625. [14] N. Koudas and D. Srivastava. Approximate joins: Concepts and techniques. In VLDB’05 Tutorial, page 1363. [15] C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB’07, pages 303–314. [16] D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden markov model information retrieval system. In SIGIR’99, pages 214–221. [17] S. E. Robertson, S. Walker, M. Hancock-Beaulieu, M. Gatford, and A. Payne. Okapi at trec-4. In TREC’95. [18] S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD’04, pages 743–754.

We have presented an overview of several similarity measures for efficient approximate string joins and thoroughly evaluated their accuracy on several datasets with different characteristics and common quality problems. Our results show the effect of the amount and type of errors in the datasets, along with the value of the similarity threshold used for the similarity measures, on the accuracy of the join operation. Considering the fact that the effectiveness of many algorithms proposed for enhancing the scalability of approximate join rely on the value chosen for the similarity threshold, our results show the effectiveness of those algorithms that are less sensitive to the value of the threshold and opens an interesting direction for future work which is finding algorithms that are both efficient and accurate using the same threshold. Finding an algorithm that determines the best value of the threshold (regardless of the type and amount of errors) for the similarity measures that showed higher accuracy in our work is another interesting subject for future work.

6.

REFERENCES

[1] P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE’06, page 30. [2] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB’06, pages 918–929. [3] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW’07, pages 131–140. [4] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):16–23, 2003. [5] A. Chandel, O. Hassanzadeh, N. Koudas, M. Sadoghi, and D. Srivastava. Benchmarking declarative approximate selection predicates. In SIGMOD’07, pages 353–364. [6] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD’03, pages 313–324. [7] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE ’06, page 5.

16

(a) Low Error Data Sets

(b) Medium Error Data Sets Edit Similarity

(c) Dirty Data Sets


(b) Medium Error Data Sets Jaccard

(c) Dirty Data Sets


(b) Medium Error Data Sets Weighted Jaccard

(c) Dirty Data Sets

Figure 1: Accuracy of Edit-Similarity, Jaccard and Weighted Jaccard measures relative to the value of the threshold on different datasets

17


(b) Medium Error Data Sets Cosine w/tf-idf

(c) Dirty Data Sets


(b) Medium Error Data Sets BM25

(c) Dirty Data Sets


(b) Medium Error Data Sets HMM

(c) Dirty Data Sets


(b) Medium Error Data Sets SoftTFIDF

(c) Dirty Data Sets


(b) Medium Error Data Sets GES

(c) Dirty Data Sets

Figure 2: Accuracy of measures from IR and hybrid measures relative to the value of the threshold on different datasets

18

QoS: Quality Driven Data Abstraction Generation For Large Databases ∗

Charudatta V. Wad, Elke A. Rundensteiner and Matthew O. Ward Department of Computer Science, Worcester Polytechnic Institute, Worcester, MA USA {charu

w, rundenst, matt}@cs.wpi.edu

ABSTRACT

and summarizing [1]. Tasks conducted based on abstracted data include pattern detection, cluster analysis, outlier analysis, subspace cluster analysis, filtering and sample analysis [1]. Figures 1(a) and 1(b) represent an example of a dataset and its abstraction. The visualization technique used, called parallel coordinates [13], is a popular multivariate visualization technique. In this method, each dimension corresponds to an axis, and the N axes are organized as uniformly spaced vertical or horizontal lines. A data element in an N-dimensional space manifests itself as a connected set of points, one on each axis. Thus one polyline is generated for representing each data point.

Data abstraction is the process of reducing a large dataset into one of moderate size, while maintaining dominant characteristics of the original dataset. Data abstraction quality refers to the degree to which the abstraction represents the original data. The quality of an abstraction directly affects the confidence an analyst can have in results derived from such abstracted views about the actual data. Some initial measures to quantify the quality of abstraction have been proposed; however, they currently can only be utilized as an after-thought. An analyst can be made aware of the quality of the data he works with, but he cannot control the quality he desires and the trade-off between the time required to generate the abstraction and its quality. While some analysts require at least a certain minimal level of quality, others must be able to work with certain abstraction quality due to time and resource limitations. To tackle these problems, we propose a new data abstraction generation model, called the QoS model, that presents the performance quality trade-off to the analyst. It then generates an abstraction based on the desired level of quality versus performance as indicated by the analyst. The framework has been integrated into XmdvTool, a freeware multi-variate data visualization tool developed at WPI. Our experimental results show that our approach provides better quality compared to existing abstraction techniques.

1.

Figure 1: Figure 1(a) Displays the cars dataset using the parallel coordinates visual technique, while Figure 1(b) represents cluster centers of the dataset.

INTRODUCTION

Abstraction quality versus data quality: Abstraction quality captures how well the abstracted dataset represents the original dataset. Intuitively, a good data abstraction represents all the main features of the original dataset. Since the abstraction in Figure 1(b) captures all the main clusters present in the original dataset (Figure 1(a)), the abstraction is considered to be of high quality. Lack of knowledge regarding quality can lead to inaccurate results, jeopardizing the reliability of conclusions gleaned from the abstraction. Validating the quality of abstraction is made difficult due to a lack of data abstraction quality measures. Although some initial measures [8] have recently been proposed to measure the data abstraction, those measures do not scale well to higher number of dimensions. Furthermore, a scalable data abstraction measure by itself does not solve the problem. The main problem is a lack of consideration about quality by the abstraction generation process before its commencement. To further complicate matters, most systems and thus users of these systems assume that the raw data itself is always good. However, real-world data is known to be imperfect, suffering from various forms of defects such as sensor variability, estimation errors, uncertainty, human errors in data entry, and gaps in data gathering. Data quality refers to the quality of the underlying data used for

1.1 Motivation Data abstraction techniques are commonly used to facilitate the efficient detection of patterns in large datasets and for analyzing a huge database without actually having to explore the original data [1]. Thus, analysts typically infer characteristics of large databases by analyzing the abstracted data rather than looking at the full data. Some abstraction techniques select a subset of the original dataset as its abstraction, such as sampling and filtering, while others construct a new abstract/summary representation, such as clustering 9∗ This work was supported under NSF grants IIS-0119276 and IIS-00414380.


19

the abstraction generation. Clearly, if the quality of the underlying data is not considered during abstraction generation, the quality of an abstraction may indeed be adversely affected.

1.2 Existing Abstraction Generation Solutions Figure 2 sketches the process most commonly used by abstraction generation systems [2] [9].

Figure 3: Proposed QoS Framework. a) A scalable data abstraction measure to quantify the data abstraction result is proposed, called Multi-dimensional Histogram Difference Measure (MHDM). Other measures [8] could also be plugged in. b) The estimator calculates the performance-quality tradeoff including confidence intervals and time estimations for the process. This is at the heart of QoS, presenting the analyst with various trade-offs before the process of abstraction commences.

Figure 2: Existing Data Abstraction Solution. Predicaments of such a process include: • Quality measures, if available, are plugged in only as an after-thought to calculate the quality of a given abstraction. • Data abstraction is a one way process. Thus, when an analyst initiates the generation of an abstraction, he cannot indicate the desired level of quality nor control the output of the process in terms of its resulting quality. Rather, he would simply be informed as an afterthought on the quality (or lack thereafter) that has been obtained.

2. Generation phase: This process generates an abstraction based on quality values set by the analyst. 3. Post-processing phase: It combines the measure of the abstraction with quality of the underlying dataset to determine the overall quality.

• Furthermore, the analyst doesn’t know how much time they should budget for the abstraction process. Without such control, he cannot trade-off the acceptable level of quality with the amount of time he is able to spend on the abstraction process itself.

4. Interaction interface: This interface presents the performance quality trade-off and the final abstraction quality to the analyst.

• Data quality is not taken into account. As discussed earlier, if the data is imperfect (or of low quality), the abstraction result should also reflect the underlying data quality.

The QoS framework could be applied to many different data abstraction tasks, including hierarchical sampling, clustering, and selection. However, for the sake of explanation, in the rest of this paper we will focus on clustering for large databases.

1.3 Our Approach To overcome the above identified problems, we propose to make abstraction generation quality-aware. We present the analyst with a quality-performance trade off indicating the different values of quality measures achievable and time required for the process to generate them. Using these computations, an analyst can demand a quality level beforehand or he can request a certain performance, knowing what quality he can expect and QoS will generate the abstraction accordingly. QoS takes into consideration both the data abstraction quality and underlying data quality to calculate a complete data abstraction quality measure. The system framework for QoS, depicted in Figure 3, consists of the following main phases:

2. QOS FOR CLUSTER ANALYSIS Summarization techniques for data abstraction summarize the data by creating fewer new representatives to convey the underlying data [1]. Clustering is one such technique, where cluster representatives are used to represent the data. Since clustering is memory and computationally intensive, clustering of large databases typically employs sampling as a pre-processing step [2][3].

2.1 Quality Measure (MHDM) We now propose a measure of abstraction quality for high dimensional data. This multi-dimensional data abstraction quality measure captures the distributions present in a high dimensional dataset. The measure can be calculated before the abstraction is actually generated. The proposed measure, called Multi-dimensional Histogram Difference Measure (MHDM), is a histogram difference method. Histograms are widely used for density and selectivity estimation [4]. MHDM calculates the difference between the

1. Pre-processing phase: We introduce a pre-processing phase to compute the quality-performance trade-off. The computation is done using a multi-dimensional histogram which calculates density information. Two main components in this phase are:

20

multi-dimensional histogram of the original dataset and that of the abstraction generated from the data. For the measure we assume that the two multi-dimensional histograms (original and abstracted) have the same number of bins, with bin sizes corresponding to the percentage of data falling into that bin. MHDM is the summation of the difference between the corresponding bins. MHDM ranges from 0.0 to 1.0 with 0 implying the worst case MHDM, and 1.0 indicating the best case. One potential disadvantage of a multi-dimensional histogram is its inability to scale due its high memory requirements [4]. Unfortunately we cannot utilize just 1-dimensional histograms which are less costly, they fail to capture the correlation present in high dimensional data. To overcome the space inefficiency of multidimensional histograms [4], we encode the multi-dimensional histogram structure by explicitly associated the multi-dimensional cell address with its cell content value. For an example, Figure 4 represents the formation of an encoded multi-dimensional histogram. For instance, the cell with dimension 1 at bin 5 and dimesnion 2 at bin 2 and dimension 3 at bin 1 having a value of 6 would be encoded explicitly by the pair shown in the figure. Building the encoded multi-dimensional histogram: Assume the input tuple with d dimensions with data values v1 ,v2 ,..vd .

Figure 4: Formation of encoded multi-dimensional histogram. • P si is the percentage of data that fall into the i-th bin of the abstracted histogram; • M AXP h is the maximum histogram difference.

Step I: We partition each of the d dimensions into a number of distinct partitions. For simplicity, we’ll assume here that there are exactly n such partitions for each dimension, though other more sophisticated strategies could be employed for bin sizing in the future. The partitioning of the dimension i is denoted as ui1 , ui2 , ... uin with n the number of partitions. For each input tuple v1 ,v2 ,..vd , we determine which bin b of dimension i its ith value vi falls into. Given that each tuple value is mapped to a particular partition, we have d partition numbers for a given input tuple. Let us denote this by u1i1 ,u2i2 ,..udid , with ij the partition number for the dimensions. Thus, the number of 1dimensional partitions formed directly influence the number of multi-dimensional bins formed.

2.1.1 Noise Elimination

Step II: We encode the multi-dimensional bin from partition numbers obtained from each dimension by appending the bin numbers into one code, u1i1 u2i2 ..udid is the multi-dimensional bin corresponding to the example input tuple above. Thus, if most of the d-dimensional cells remain empty, our histogram is relatively small. Most real datasets are very sparse in nature (confirmed by our experimental study in Section 4). Thus this technique saves a lot of memory in practice. Advantages of this explicit encoding of a full matrix representation approach include : • We never encode empty bins, leading to huge savings in terms of memory in practice. • The algorithm has a linear complexity (in the number of data points), and thus can build multidimensional histograms efficiently even for high dimensional data. MDHM can be expressed by the following equation: PN |P oi − P si | M HDM = 1.0 − i=1 M AXP h

Need for MHDM: In clustering of large databases, if the samples used for clustering are chosen randomly, they fail to represent the original dataset. In that case, the clustering process fails to abstract the original dataset. It is independent of the clustering algorithm used and quality of clusters formed. There is no intuitive way of setting the ”correct” sampling rate for a dataset. In absence of measures to guide the process, the easiest method to ensure high data abstraction quality is to increase the sampling rate. The user may over-sample the database yielding a poor clustering performance without guaranteeing necessarily improvement in quality. Also, users might under-sample the dataset leading to low data abstraction quality. In that case, the clustering result might not be accurate. Thus, misleading the users with clustering results that may not represent the original dataset. Thus, even though sampling can be a direct representation of quality, setting the correct sampling rate requires a quality measure.

(1)

Real world data is often fraught with noise. Clearly, noise elimination is crucial for high quality abstractions. The multi-dimensional histogram of the original data is thus regulated to filter noise. Here we propose one method in particular that is targeting the elimination of noise in support of the task of clustering; however, other methods for the elimination of noise may need to be designed to support alternate tasks. Clearly, noise in the context of clustering may be important information when in search of outliers. Our proposed cluster-centric noise elimination phase consists of eliminating bins whose bin count is below a threshold (γ). This threshold γ can either be empirically determined (explained in Section 4) or set by the analyst. Intuitively, we observe that the bin count of a multi-dimensional bin will be below a threshold if either the point is a random noise or the point belongs to the edge of a cluster. Figure 5 displays a grid representing a 2-dimensional histogram placed over the data. Ignoring points from low bin counts may have the side effect of ignoring points from the edge of the clusters. However, since we are interested in picking more points from near the center of the cluster rather than its edges, ignoring points from the edges effectively adds more weight to the points in the center. This improves the abstraction quality, as our experimental study confirms (see Section 4). It also decreases the number of multidimensional bins to be maintained, increasing the efficiency of the QoS estimator (Section 2.2).

2.2 QoS Estimator

• P oi is the percentage of data that falls into the i-th bin of the original histogram;

The QoS estimator computes the performance quality trade-off by generating a look-up table that indicates the relationship be-

21

Algorithm 1 Populating look up table Input: x= Initial sampling rate, and α = Increment in the sampling rate. /*Populating lookup table for performance-quality trade off. Initialize by setting M ← x, and calculating β from Equation 4. */ 91: while (MHDM ≤ 1) do 92: for each bin ∈ multi − dimensionalhistogram do 93: Number of points selected from each group from equation 3; 94: end for 95: Compute MHDM for M ; 96: Compute time and confidence interval; 97: Update lookup table with sampling rate and MHDM; 98: M ← M+α, compute β ; 99: end while Output: lookup table of performance quality trade off.

Figure 5: Existence of noise in datasets.

tween MHDM, the sampling level and the estimated time required for clustering. Figure 6 shows an example of a look-up table generated by the QoS estimator. Since sampling is the preliminary step for clustering, the abstraction quality largely depends on the sampling. If the samples chosen for clustering do not represent the original dataset well, the abstraction quality of the clusters can be low. Thus, the abstraction quality is determined by sampling. Various sampling techniques are defined in the literature which can be used in this framework. Palmer et al. devised the strategy of density biased sampling [5], a probability based approach, which samples more from a dense region and less from a sparse region. According to density biased sampling [5]: suppose that we have n values x1 , x2 , . . . xn that are partitioned into g groups that have sizes n1 , n2 ,, ng and we want to generate a sample with the expected size M in which the probability of point xi is dependent on the size of the group containing xi . To bias the sample size, the probability function is defined as [5]: f (ni ) =

β nei

Figure 6: Sample look-up table created by QoS estimator. time required for the process to complete. Whenever an analyst chooses a quality value, the value closest to it is returned.

2.3 Interaction Interface The Interaction module allows the analyst to attain information on the quality performance trade off. The analyst can set one of three values: data abstraction quality (MHDM) value, sampling rate, and time for completion of the clustering process.

(2)

where ni is the number of points in group gi and e is a constant. The number of points selected from group gi : n = f (ni ) ∗ ni

2.4 Abstraction Generator

(3)

β is defined based on the sample size (M) as follows: M 1−e i=1 ni

β = Pg

(4)

Group formation: It is very important to form groups based on density for density biased sampling [5] to be effective.The group assignment is done using the encoded multi-dimensional histogram. Each bin is treated as group of points used by density biased sampling. Algorithm for estimation using density biased sampling: A look-up table is generated after the multi-dimensional histogram for the original data has been formed. Starting with sampling level α, the number of points falling in each bin are calculated. This enables us to calculate MHDM for sampling level α. It is repeated until MHDM reaches the maximum value of 1.0. The look-up table (as shown in Figure 6) will have a sampling level, minimum quality level followed for the sampling, and the

22

Once the analyst decides on the quality and other performance settings he desires, the interaction interface passes the sampling level to the abstraction generator. The abstraction generator samples the database using density biased sampling with a sampling level set by the interaction interface. The abstraction generator then passes the generated samples to a clustering algorithm. At this point, we can use any existing clustering technique [1] to cluster the data.

2.5 Inclusion of Data Quality As a last step, the quality of the underlying data is incorporated into the abstraction result. As is commonly done [12] [14], we assume that each data tuple has an associated record quality. Every cluster consists of data points of the original dataset. Thus, to calculate the total abstraction quality, we incorporate the data quality of all its members using some statistical function. Many alternative methods are possible, such as arithmetic mean and standard deviation, median values, geometric mean, root mean square and so on. For illustration purposes, henceforth, we chose to represent the data quality of clusters using the arithmetic mean of the record qualities,

3. RELATED WORK Recently some abstraction measures are introduced in the field of information visualization. Cui et. al. [8] proposed a histogram based measure. In contrast to HDM [8], our measure uses a multidimensional histogram to capture co-relations in higher dimensional data. Sampling and clustering has been extensively studied [1] [2] [3]. Olken et. al [11] proposed the idea of random sampling for data analysis which was improved by C. Palmer et. al [5] using density biased sampling. We employ density biased sampling as a sampling method in QoS due to its density preservation property. Human interaction in the field of clustering was advocated by K. Chen et al. [6] via Vista Software. Vista allows manual clustering of databases by analysts. Widom proposed a data model called Trio [14] which incorporates lineage and accuracy of the data. However, Trio does not deal with quality of abstracted data nor with clustering tasks – rather, it assumes simple sql-style queries are being processed against the data. In other words, the proposed data model is in effect an extended relational model.

Figure 7: QoS sampling interface.

4. EXPERIMENTAL EVALUATION called Cluster Data Quality (CDQ). The Cluster Data Quality (CDQ) can be expressed as: CDQ =

Pn

i=1

RecordQuality n

We have evaluated the framework using both real and synthetic datasets. The framework is integrated into XmdvTool, a public domain data visualization tool [7] developed at WPI. Experiments were conducted on Pentium 4 (1.66 GHz) running on Microsoft Windows XP with 1.0 GB RAM. We have conducted experiments for assessing the different components of QoS.

(5)

• CDQ: Cluster Data Quality

4.1 Practicality of Encoded Multi-Dimensional Histograms for Real Datasets

• Record Quality: Data quality of the record [ 0 : 1 ]. • n: number of points in the cluster.

In this experiment, we formed encoded multi-dimensional histograms for numerous real dimensional datasets with different numbers of dimensions, such as Iris, Out5d, Cars, Aaup, Census income and Supercos2. Figure 8 displays the comparisons of the number of bins actually formed and the maximum number of bins possible. As seen from Figure 8, savings (difference between maximum possible bins and bins actually formed) increase enormously. This confirms the fact that the real datasets are sparse in nature and our encoding based approach indeed saves memory in practice.

2.6 Total Abstraction Quality The MHDM of the data clustered is identical to the value set by the analyst in the pre-processing phase. However, we can also evaluate the performance of the clustering algorithm using a quality measure [9]. One possible clustering quality measure can be the average distance of every point from its nearest cluster center [9]. We can plug these clustering quality measures in the generation phase to find the quality of clustering performed, which we call Cluster Quality (CQ). MHDM can be visualized as a global measure on the entire dataset, whereas clustering quality measure gives a quality value for each cluster formed. Thus, the total data abstraction quality can be calculated as the weighted average of cluster data quality, cluster quality and abstraction quality: T AQ =

ρ ∗ CQ + δ ∗ CDQ + λ ∗ M HDM 3

(6)

• TAQ: Total data abstraction quality; • ρ: Weight associated with clustering quality; • CQ: Cluster quality; • δ: Weight associated with the data quality; • CDQ: Cluster data quality; • MHDM: Abstraction quality;

Figure 8: Savings for real datasets.

• λ: Weight associated with abstraction quality.

The savings also increase greatly if we form a larger number of partitions. Figure 9 represents the effect of increasing the number of partitions on number of multi-dimensional bins formed and thus on the MHDM.

ρ, δ and λ can be user set parameters or can be set to 1 to default to the arithmetic mean. We display the TAQ visually using an InterRing display [15].

23

the same. In short, the noise elimination stage facilitates formation of dense clusters.

Figure 9: Effect of increasing number of 1-d partitions for Aaup dataset.

4.2 Validating QoS Clustering Accuracy

Figure 11: Average distortion with and without noise elimination for various real datasets.

For this experiment, we used synthetic datasets generated with a known number of clusters. We compare the clustering result of the dataset without sampling and after applying QoS. We used a Kmeans algorithm to find the RMS error between the cluster centers of the original datasets and those of the abstractions generated by QoS. RMS error (ǫ) can be defined as: r Pm i i 2 i=1 (co − cs ) ǫ= (7) m

5. CONCLUSIONS To enable effective use of abstraction results, we observe that one, data abstraction quality must be quantified and two that the analyst must be given the ability to control this quality. We address this problem by making the abstraction process both quality-aware and interactive. In particular, we have introduced an abstraction generation framework which enables the analyst to trade-off between the time dedicated for the generation of the abstraction versus the level of quality one can expect to achieve as result of the abstraction process. This framework has been successfully implemented, and then incorporated into the XmdvTool data visualization application. Experiments have demonstrated that our proposed framework indeed improves the quality and scalability of the abstraction process to a great extent. The framework is general in that it can be used for any application working with abstraction generation in support of tasks such as clustering.

• cio : original cluster center; • cis : cluster center of the abstraction; • m: number of clusters.

6. REFERENCES

Figure 10: Decrease in RMS error with increase in MHDM. As seen from Figure 10, the RMS error decreases with an increase in MHDM. Thus, an increase in MHDM leads to a more accurate identification of clusters.

4.3 Conformance of MHDM with Average Distortion For this experiment, we calculate the average distortion of the dataset. Average distortion is the average distance of a datapoint from its cluster center. Figure 11 represents the average distortion with and without noise elimination phase. It can be noticed that in absence of noise elimination phase, with an increase in MHDM (and thus, the chosen sampling level) the average distortion increases. This is because of the introduction of noise in sampling. Thus, we introduced a noise elimination phase, γ was set at 0.2 percent, eliminating all the bins below the threshold. With the introduction of the noise elimination phase, the average distortion decreases linearly with the increase in MHDM. Since, K-means clustering algorithm was used, the number of clusters formed remains

24

[1] P. Berkhin. Survey of Clustering Data Mining Techniques. Technical report, Accrue Software, Inc., San Jose, CA, 2002. [2] T. Zhang, R. Ramakrishnan, M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proceedings of ACM SIGMOD Conference on Management of Data, pages 103-114, Montreal, Canada, June 1996. [3] S. Guha, R. Rastogi, K. Shim. CURE: An Efficient Clustering Algorithm for Large Databases. In Proceedings of ACM SIGMOD conf. Washington, USA,1998. [4] H. Wang, K. Sevcik. A Multi-dimensional Histogram for Selectivity Estimation and Fast Approximate Query Answering. In Proceedings of the 2003 conference of the Centre for Advanced Studies on Collaborative research, Pages: 328 - 342, Toronto, Ontario, Canada, 2003. [5] C. Palmer, C. Faloutsos. Density Biased Sampling: An Improved Method for Data Mining and Clustering. In Proceedings of ACM SIGMOD, May 2000. [6] K. Chen, L. Liu. VISTA: Validating and Refining Clusters via Visualization. In Proceedings of the 3rd IEEE International Conf. on Data Mining, Page: 501, 2003. [7] M. Ward. XmdvTool: Integrating Multiple methods of Visualizing Multivariate Data. In In Proc. Visualization, pages 326331,1996. . [8] Q. Cui, M. Ward, E. Rundensteiner, J. Yang. Measuring Data Abstraction Quality in Multiresolution Visualization. In

[9]

[10] [11]

[12]

[13]

[14]

[15]

IEEE Symposium on Information Visualization (InfoVis 2006), October 2006. F. Kovcs, C. Legny, A. Babos. Cluster Validity Measurement Techniques. In 6th International Symposium of Hungarian Researchers on Computational Intelligence, Budapest, Nov. 2005. S. Thompson. Sampling. John Wiley and Sons, Inc., New York, 2nd Edition, 1992. F. Olken, D. Rotem. Random Sampling from Database Files: A Survey. Proc. Fifth Int’l Conf. Statistical and Scientific Database Management, 1990. Z. Xie, S. Huang, M. Ward, E. Rundensteiner. Exploratory Visualization of Multivariate Data with Variable Quality. In IEEE Symposium on Visual Analytics Science and Technology, pp183-190, October 2006. E. Wegman. Hyperdimensional data analysis using parallel coordinates. In Journal of the American Statistical Association,411(85):664675, 1990. J. Widom. Trio: A System for Integrated Management of Data, Accuracy, and Lineage. In Proceedings of the Second Biennial Conference on Innovative Data Systems Research (CIDR ’05), Pacific Grove, California, January 2005. J. Yang, M. Ward and E. Rundensteiner. InterRing: An Interactive Tool for Visually Navigating and Manipulating Hierarchical Structures. In Proc. IEEE Symposium on Information Visualization, : 77-84, 2002.

25

26

Quality-Driven Mediation for Geographic Data Yassine Lassoued

Mehdi Essid

Coastal and Marine Resources Centre Naval Base – Haulbowline, Cobh Co. Cork, Ireland

Laboratoire des Sciences de l’Information et des Systemes (LSIS) ` Avenue Escadrille Normandie Niemen, 13397 Marseille Cedex 20, France

[email protected] Omar Boucelma

[email protected] Mohamed Quafafou



[email protected]

[email protected]

ABSTRACT

satisfies their requirements, possibly by combining data from different sources. The issue is threefold: first, we have to perform data integration, second we have to manage metadata and quality information if any, and third, we have to provide a mechanism that mixes two technologies. The data integration problem has been extensively studied by the database (DB) community. Solutions such as data warehouses [14], federated databases [25] or mediation [30] have been proposed to help the development of applications that need access to multiple heterogeneous data sources. As for the GIS community, the work was focused on interoperability aspects, see [8, 18, 26] for instance. Most of the approaches propose to enrich the data models in order to conform to a ”unified model”, and the creation of the OpenGIS consortium is the most visible output of this trend. There is a growing awareness of the metadata and data quality problem in both DB and GIS user communities. DB researchers have proposed metadata and quality-based data integration techniques [20, 12, 22, 13]. While in the GIS community, several geographic metadata models have been proposed – the most important are the ISO-19115 [15] and ISO-19139 [16] standards – and several implementations of the OGC Catalogue Services (CSW) [23] are now available. In this paper, we describe a metadata and quality-driven mediation approach and system that allow a community of users to share a set of autonomous, heterogeneous and distributed geospatial data sources with different metadata and quality information. Users share a common vision of the data, which is defined by means of a global schema and a metadata schema. A user of our system poses a query over the global schema and models their needs as metadata conditions complying with the proposed metadata model. The system uses the data sources as well as their metadata in order to provide a solution (dataset) that satisfies the user’s needs, possibly by combining data from different sources. The claims of this paper are as follows: (1) metadata and quality help efficient data access, by allowing the selection of pertinent data sources and providing an implicit index mechanism that reduces the execution time of mediated queries; (2) metadata may improve semantic interoperability, through the use of context metadata; (3) quality-driven mediation helps users specify their needs in terms of data quality; and (4) quality-driven mediation can combine datasets and generate higher quality data.

Data quality and metadata are crucial for the development of Geographic Information Systems (GIS). With the availability of geographic data sources over the internet, assessing the fitness for use concept may become a tricky task. In this paper, we describe a quality-driven mediation approach and system that allow a community of users to share a set of autonomous, heterogeneous and distributed geospatial data sources with different quality information. Users share a common vision of the data, which is defined by means of a global schema and a metadata schema. The paper shows how metadata and quality information (1) may provide efficiency, (2) improve semantic interoperability and (3) ensure fitness for use by providing results with satisfying quality.

1.

INTRODUCTION

Metadata and particularly quality information play a crucial role for the development of Geographic Information Systems (GIS). Not only do they provide information (data) about the data to be manipulated by the users, but also they may contribute in building efficient applications. In the context of geographic data, potential users must understand beforehand the quality of the data they intend to procure. The critical element here being whether the data under consideration is ”fit for the intended use” (fitness for use). Currently such decisions can only be answered by experience, i.e. trial and error. With the availability of geographic data sources over the internet, a significant issue that must be addressed is the manipulation and querying of multiple heterogeneous data sources, each of them providing (or not) some metadata and quality information. Problems can arise both (1) when allowing users to express metadata and quality conditions in a query over several distributed heterogeneous data sources, and (2) in providing her with an answer (dataset) that


27

2.

DATA QUALITY AND METADATA IN DATA INTEGRATION

In [20], work was focused on the selection and ranking of data sources according to their contents and quality descriptions. A data source provides information about one or more data types (for example Air, Precipitation, etc.) with a certain quality (location, time, completeness, last update, granularity). Data types are represented as entity relationships. For example Air and Precipitations data types are represented respectively as Air(time, city, temperature, pressure) and Precipitation(time, city, rainfall). A metadata model allows to define the Source Content and Quality Description (scqd) for each data source. The overall aim is to answer queries such as ”Which data sources provide information about a given data type (e.g. Air) with a given data quality (e.g. completeness≥ 0.6 and granularity close to 1h, etc.)?”. A query engine ranks the data sources according to their data qualities into ”maximal compatibility classes” [20] and, if needed, combines them in order to satisfy the user query. Combinations, here, consist in the union operation. The situation where attributes have different qualities and partial information (subset of the attributes) from different tables need to be joined is not considered. Work in [12] examined managing data quality in federated databases. Data (tables) are supposed to be already restructured according to the same schema and ready to be integrated. A set of quality statements compare the qualities of the different tables vis a vis each quality dimension (element): accuracy, completeness, timeliness, availability, reliability and data volume. The query language is an extension to SQL, where a user query is expressed using the (Select, From, Where, With goal) clauses. The latter contains quality predicates such as most up-to-date, most accurate, most complete or most reliable. Quality predicates are considered in the same order as in the query. Only tables satisfying the first predicate are considered to start. Next, a second selection is carried out according to the second quality predicate, and so on. The final result is the union of the remaining tables. This approach is very interesting in the way that it manages ordinal, non-quantified quality information, which is often the case in real world applications. However, like the previous approach, a table is viewed as a whole ”solid” entity, and the selection of attributes from different tables is not possible. Work described in [22] defines a mediation approach allowing to answer queries, over a given mediation schema, containing quality conditions. The situation where data sources are heterogeneous and provide only partial information (i.e. a data source table provides only a subset of the attributes required by the user) is considered. However, selection of attributes from data source tables based on their qualities is not possible. The problem of selecting partial information from data sources based on the quality of attributes has been studied in [13]. The approach is based on the use of object identifiers to perform object fusion, i.e. combination (join) of partial information from different data sources. However the approach takes into consideration only three quality elements (timeliness, completeness, data accuracy). The approaches cited above provide different interesting ways of taking into consideration metadata and data quality in data integration. All of these approaches are suitable for general databases, however none of them has been designed for spatial data. Besides, only the approach described in [13] tackles the problem of attribute quality and selection, yet it only takes account of the timeliness, completeness and data accuracy quality elements. Our contribution consists in defining a metadata and data quality-based integration solution for geospatial data. We consider a large standard metadata

element set, applicable at both feature class and attribute levels. Our approach, however, differs from a simple Catalogue Service in the way that (1) it allows dynamic combination of data sources depending on the query, and (2) it comes up with a query language for querying both data and metadata.

3.

DATA AND METADATA MODELS

3.1

Data Model and Schemas

In this article, we are using the abstract feature model with its GML encoding [19, 3] as recommended by the OpenGIS consortium. A GML data source provides two main elements: a GML Application Schema (GAS) and a XML document. The GAS is a XML Schema Description (XSD)[28] that describes the structure of the XML document; it represents a source schema. The GML application schema is composed of a set of feature classes extending the GML AbstractFeature class. Each of these feature classes represents a geographic class of objects. An instance of a feature class is referred to as feature of geographical object. A feature class has properties that describe geographical objects contained within the data source. The main property is the geometric property, which contains the shape of the object. Since data sources are independent, they do not use the same object identifiers. Practically, the only way to identify geographical objects, in a general way, is their geometries. Thus, we consider that the geometric property is the key property allowing object fusion, i.e. joining objects information from different sources (cf. section 4.3). In this study, object-properties, depending on their semantics, are divided into five disjoint categories: • Geometric properties (G) are the feature classes’ geometries. • Temporal properties (T) are properties that identify time instants or periods of events related to feature classes (e.g. a property specifying the date a road was made operational). • Classification properties (C) are enumerative properties used to identify a fixed number of subclasses of objects within one feature class (e.g. a property specifying the type of a health care center: Hospital, Clinic, etc.). • Quantitative properties (Q) are properties that represent measured quantities related to features (e.g. length, weight, temperature, etc.). • Non-quantitative properties (N) are qualitative properties, excluding classification ones (e.g. street names, feature descriptions, etc.). Among the various representations for GML application schemas, we use a tree representation of a GAS, this means that a schema is viewed as a tree whose nodes correspond to XSD [28] elements. A non-terminal node corresponds to a class (a complex-type element) and a leaf corresponds to an attribute (a simple type element or a geometric property).

3.2

Metadata Model

A second part of the system consists of information about data, or metadata. A metadata model defines the set of metadata elements and their representation (structure and types). Several metadata models are available. However, given the growing trend among leading organizations, we believe that the ISO-19115 model [15] and its XML implementation ISO-19139 [16] guarantee the most interoperability.

28

Metadata Element Name

Short

Identification Information Language

L

Citation

Discovery Metadata

Language used within the dataset Citation data for the resource

Dataset Title

DT

Name by which the resource is known

Date − Publication

DP

date of publication of the resource

Extent Information

Information about horizontal, vertical, and temporal extent

Geographic Extent

Geographic area of the dataset

Geographic Bounding Box

BB

Geographic position of the dataset

Vertical Extent

VE

Vertical domain of dataset

Temporal Extent

TE

Time period covered by the content of the dataset

Point of Contact

Identification of, and means of communication with, person(s) and organization(s) associated with the resource

Organisation Name

[R] ON

Name of the responsible organization

Role

[R]

Function performed by the responsible party (Resource Provider [Pvd], Owner [Own], etc.)

Reference System Information

Information about the reference system

Reference System Identifier

Identifier used for reference systems

Code

RSC

Alphanumeric value identifying an instance in the namespace

Code Space

RSCS

Name or identifier of the person or organization responsible for namespace

Data Quality Information

Quality information for the data specified by a data quality scope

Lineage

Information about the events or source data used in constructing the data...

Source

Information about the source data used in creating the data specified by the scope

Scale Denominator Data Quality Metadata

ISO−19115 Definition Information to uniquely identify the data

SD

Positional Accuracy

Denominator of the respresentative fraction on a source map Accuracy of the position of features

Absolute External Positional Accuracy

AEPA

Thematic Acuracy

Closeness of reported coordinate values to values accepted as or being true Accuracy of quantitative attributes and the correctness of non−quantitative attributes and of the classification of features and their relationships

Thematic Classification Correctness

TCC

Comparison of the classes assigned to features or their attributes to a universe of discourse

Quantitative Attribute Accuracy

QAA

Accuracy of quantitative attributes

Non−quantitative Attribute Accuracy

NQAA

Accuracy of non−quantitative attributes

Temporal Accuracy

Accuracy of the temporal attributes and temporal relationships of features

Accuracy of a Time Measurement

ATM

Correctness of the temporal references of an item (reporting of error in time measurement)

Temporal Validity

TV

Valisity of data specified by the scope with respect to time

Figure 1: Metadata elements and their definitions • Selective metadata, represented in black cells in Figure 2, allow to select data sources that satisfy the user’s needs. For instance, if a user asks only for data that are provided by a given organization X, only data sources provided by X will be selected.

We define the set of queriable metadata as the subset of the metadata elements that the user will be able to query in order to identify (select) the data sources he needs or define his own context (reference system, language, etc.). The queryable metadata elements considered in this study correspond to a subset of the European core metadata for discovery recommended by CEN1 [4], together with a subset of the ISO 19115 quality elements. While discovery metadata provide high level information to enable a potential user (human user or application) to find out what resources exist, their nature and content; quality information define the data’s fitness for use for a given application. Metadata elements that will be considered in this study are defined in Figure 1. The organization name element ([R]ON) is always associated with a role (prefix [R]). Roles as defined by the ISO standards are: Resource Provider (prefix [Pvd]), Custodian ([Cst]), Owner ([Own]), User ([Usr]), Distributor ([Dst]), Originator ([Org]), and Point of Contact ([Ptc]). A metadata element has a scope. To simplify we consider only feature classes and property as the only scopes. Figure 2 shows the scopes to which metadata elements are applicable. Each line corresponds to one of the metadata elements previously defined in Figure √ 1. The first six columns correspond to scopes. A cell with a symbol means that the corresponding metadata element is applicable to the corresponding scope. The last column defines the type of the metadata element when applicable. We divide the set of metadata elements into two categories: selective metadata and context metadata. 1

• Context metadata, represented in white cells in Figure 2, are metadata elements that describe the context of data (reference system, language, etc.).

4.

INTEGRATION EXAMPLE

To illustrate our mediation approach, we will be referring in this paper, to an example drawn from a geographic data integration problem studied in the REV!GIS project [17]. In this section, we, firstly, define the data sources and their schemas, as well as the global schema (c.f. subsection 4.1). Next, in subsection ??, we describe the mappings between the data source schemas and the global schema. Finally, in subsection 4.3, we describe a practical data and integration problem.

4.1

Data Sources and Schemas

The example described in this article consists of three real data sources referred to as BDTQ-21E02-200-201, BDTQ-31H08-200202 (Bases de Données Topographique du Québec) [21] and NTDB21e05 (National Topographic DataBase) [6] as represented by their geographic extents in Figure 3. For sake of simplicity of notation, we refer to these data sources, in this article, respectively by BDTQ-1, BDTQ-2 and NTDB. BDTQ-1 and BDTQ-2 data sources have the same BDTQ schema

European Committee For Standardization

29

Scope

Feature Class

Geometric Property

Temporal Property

Quantitative Property

Non−quantitative Property

Classification Property

Element Type (when applicable)

L

String

DT

String

DP

Date

BB

Bounding Box

VE

Z Interval [meter]

TE

Time Interval

[R] ON

String

RSC

String

RSCS

String

SD

Float

AEPA

Float [meter]

TCC

Percentage

QAA

Float [property unit]

NQAA

Enumeration/Number

ATM

Float [time unit]

TV

Time Interval

Figure 2: Metadata elements and their scopes and types tus of the building. The function property is a numeric property that provides the type of the buildings (0: generic, 1: arena, ..., 41: educational building). Finally, the elevation property provides the height of the building. Figure 5 summarizes the available metadata for BDTQ and NTDB data sources. Some metadata elements have been voluntarily omitted in this figure as no values can be associated with them (e.g. temporal validity). Lines correspond to metadata elements and columns to data schema elements (feature classes or properties). Each cell contains the value of the corresponding metadata element for the corresponding data element. Conforming to the scopes defined in Figure 2, the set of metadata elements applicable to a property or a feature class depends on its category. For instance, as the description property is non quantitative, only the Language (L), Date of Publication (DP), Resource Provider ([Pvd]ON) and NonQuantitative Attribute Accuracy (NQAA) are considered for this element. Empty white cells correspond to unavailable values. The problem is to seamlessly access the various data sources described above, by taking into consideration their metadata. Our solution consists in a mediation system that supplies a mediation schema or global schema. Example of such global schema is given in Figure 4, where Class Building represents buildings. In this global schema, the Function property is a textual property providing the building’s function. In Figure 4, we assign to each property of the global schema one of the categories defined in subsection 3.1.

BDTQ−1 B

BDTQ−2 B’ B’’

Sherbrooke B’4

B’1

B’3

B’2

NTDB

Figure 3: Geographic extents of Data Sources which consists initially of 25 feature classes covering several geographic themes at the scale of 1/20 000. For sake of simplicity, we consider only classes batimP and batimS which represent respectively punctual and polygonal buildings. Instances of batimP and batimS are complementary. NTDB has another schema which consists of 196 feature classes that cover several features at a scale of 1/50 000. Here again, for sake of simplicity we consider only class buildID that represents polygonal buildings. Source schemas are illustrated in Figure 4.

BDTQ batimS geom toponyme description

Global Schema batimP geom toponyme description

Building Geom (G) Toponym (N) Description (N) Status (N) Function (C) Elevation (Q)

NTDB buildID geom

4.2

Schema Correspondences

BDTQ and NTDB source schemas are semantically linked to the global schema by means of schema mappings. Mappings identify pairs of a source schema and the global schema that correspond (totally or partially). Figure 4 shows the mappings between BDTQ and the global schema and those between NTDB and the global schema. Note that a property from a source schema and one from the global schema that correspond semantically may need a mapping function to translate the values of the former into values of the later. For instance, to translate the numeric values of the NTDB’s function property into textual values of the global schema’s Function property, we need to specify a mapping function ϕ that maps 0 to ”Generic”, 1 to ”Arena”, 17 to ”Hospital”, etc.

status function elevation

Figure 4: Data sources’ schemas and correspondences with the global schema In BDTQ schema, the toponyme property is a textual property representing the toponym of the building. The description property is a text field containing a description of the object, including its operational status (ruins, under construction, etc.). In NTDB schema, the status property defines the operational sta-

30

BDTQ−1 / BDTQ−2 batimS & batimP L

French

geom

DT

BDTQ−X

DP

1995

BB

[*] / [**]

[Pvd] ON

Ministry of Natural Resources

NTDB

description

toponym

French

French

1995

1995

buildID

geom

English

function

English

English

1996

1996

elevation

NTDB−X 1995

1996

1996

1996

[***] [****]

Topographic Information Center

RSC

WGS84

WGS84

RSCS

[*****]

[∗∗∗∗∗]

SD

20 000

50 000

AEPA

4m

TCC

status

10m

1%

5%

5%

QAA

5m

NQAA West

East

South

North

[*]

−72

−71.75

45.38

45.5

[**]

−72.25

−72

45.38

45.5

[***]

−72

−71.5

45.25

45.5

[****]

Québec Toponym Commission

[*****]

World Geodetic System

Figure 5: Metadata for data sources BDTQ-1, BDTQ-2 and NTDB In the same way, a class from the global schema may correspond to a class from a source schema under a given restriction (conditions over the values of one or several properties). This occurs when a full correspondence cannot be established between both classes, due to heterogeneous classifications or specifications. For instance, if the global schema has defined a feature class Hospital, instead of the building class, then the correspondence between classes Hospital and buildID would be under the restriction σ: buildID/function = 17. Note that a source property may match many properties in the global schema. For instance, the BDTQ description property corresponds to both Description and Status properties. Another mapping function ϕ0 is needed for the correspondence with the latter.

4.3

accuracy (AEPA≤5m), the geometric property must be extracted from BDTQ-1. According to the schema mappings (c.f. Figure 4), the status information (Status property) can be extracted from the NTDB status property as well as the BDTQ description property. Non-quantitative attribute accuracy (NQAA) values for the status information are not reported in the sources’ metadata. Yet, it is relevant to define these values at the integrated level. In fact, the Status values extracted from the NTDB status property are more accurate than those obtained from the BDTQ description property. Suppose that we assign the accuracy values High and Low respectively to the NTDB status and BDTQ description properties. Then, according to the user’s query, the status information must be extracted from NTDB rather than BDTQ-1. Finally, the language condition (L=”English”) does not affect the selection of the data sources, since translations can be performed in order to conform to the language specified by the user. Note that by modifying the quality conditions we may also modify the origins of properties. For example, requiring a validation date between 1996 and 2007 instead of requiring a geometric accuracy of at most 5m implies that the geometric property be extracted from NTDB instead of BDTQ-1. To summarize, both BDTQ-1 and NTDB data sources provide buildings in area B, but only a part of the information required by the user. The amount of information to be extracted from each data source is determined by the user’s conditions in terms of selective metadata and data quality. However, joined together, both data sources can supply the complete required information. Consequently, the answer to the user’s query Q0 consists in the join (full join) of the selected information from BDTQ-1 and those from NTDB within region B. In this join process, objects (features) from both sources are compared geometrically. An object from BDTQ-1 and another from NTDB are considered equivalent (i.e. represent the same real feature) if their geometries (position and shape) correspond. In such a case, both objects are fused and we obtain a single object with complete information (Geom, Toponym, Description, Status, Function, Elevation) as required by the user. Note that the Function values (respectively the Status values), which have been obtained from data source NTDB (respectively BDTQ-1), must be translated using mapping function ϕ (respectively ϕ0 ) introduced

Problem Sketch

Consider a user who is interested in information about buildings of a given region β. She may express a SQL-like query Q0 : Q0 : Select all information (Geom, Toponym, Description, etc.) from Building features such that data cover region β and have been published during the last 15 years, with a geometric accuracy better than or equal to 5 meters, a high accuracy of the operational status of buildings, and English as data language. First of all, we must be aware that processing a query may be different depending on the choice of the geographic extent β. For example, if we respectively choose region B, B 0 or B 00 of Figure 3, the number of data sources that will participate in the query processing is respectively two, three or zero. The case of B 00 is trivial since the query has no answer.

4.3.1

Case of region B

Both (and only) BDTQ-1 and NTDB are eligible over the whole surface B. Both sources satisfy the condition over the date of publication. According to the user’s condition regarding positional

31

in subsection 4.2. In practice, in the fusion process, geometries of equivalent objects that come from different data sources are not exactly equal. Moreover, they can have different representations. Our work in [2] defines techniques allowing to deal with multi representation in performing topological joins in a mediation system. Work of the Bureau of the Census in Washington DC [24], or work described in [9, 1] define various geometric mapping techniques which compare the objects’ shapes as well as their positions. Geometric mapping techniques can be very costly. Consequently, performing objects fusion can be very costly too, especially when useless operations are not avoided. Hence, it becomes very important to take account of this problem in the query processing in order to reduce the query execution time.

4.3.2

Query flow Data flow

GQuery

Global Schema

VirGIS

Mapping Rules Metadata Schema

WFS

WFS

WFS

S1

S2

Sn

Figure 6: The VirGIS Architecture

Case of region B 0

In region B 0 , data sources cover different parts of the area. Subregions B10 , B20 , B30 and B40 (see Figure 3) do not involve the same data sources. However each of them can be treated in the same way we did above for B and B 00 . In region B10 , the result is the join of data from BDTQ-1 and NTDB. In region B20 , only NTDB is eligible. Consequently, only the Geometry, Function and Elevation information can be obtained. In region B30 , no data source is eligible, and the result is empty. In region B40 , only data source BDTQ-2 is available. Consequently, only the Geometry, Toponym and Description information can be obtained.

5.

Legend

to one or more element(s) of the source schema tree, possibly with a mapping function. This mapping rules’ tree-structure facilitates the research of the correspondents in the query rewriting process. It also plays the role of implicit indexes allowing efficient traversal of the mappings when seeking for correspondences. In order to facilitate query rewriting and to ensure system extensibility (addition and deletion of data sources), a mapping rule describes a part (subtree) of the global schema according to only one data source schema. Combining information of various data sources is performed automatically by the mediator using object geometries as identifiers.

PROPOSED SOLUTION

In the solution we propose to the problem presented in section 4, we define a metadata-driven mediator for vector geospatial data and metadata. Our interest in mediation comes from the necessity to manage a large number of data source accesses. Data sources are autonomous, can be updated more or less regularly and evolve independently. They can be stored locally or at the integration level. For this purpose, we need a system that allows a sufficient degree of scalability and source autonomy. Mediation is an approach that satisfies these needs. In this section, we give a brief overview of the VirGIS [10] geospatial mediator’s extension to support metadata, as well as its architecture. In the VirGIS system, whose architecture is illustrated in Figure 6, a global data schema and a global metadata schema are defined for users. These schemas are managed by the system administrator. Data are distributed and stored in their sources, which are accessible through Web Feature Services (WFS) [29]. A user poses a query over the global schema and the role of the mediator is to rewrite the user’s query into queries over the data sources. For this purpose, the mediator uses the mapping rules (correspondences) that link the global schema to the source schemas. All is transparent to the user. That is, the user ignores where and how data are stored and how the mediator manages to retrieve data from data sources. The VirGIS mediation system uses a set of mapping rules that describe correspondences between global and local sources. A mapping rule associates pairs of corresponding elements (from the global schema and a source schema) together with their metadata. These rules allow the mediator (i) to select the appropriate sources and data for the user’s query and (ii) to rewrite the user’s query into queries over the data sources schemas. In order to assure expressiveness and handle classification, structural and semantic conflicts, we define a mapping language that supports restrictions and conversion functions. Also, mappings are organized in tree-structured mapping rules. A mapping rule has the same structure than that of the global schema and describes each element as corresponding

6.

METADATA-DRIVEN INTEGRATION

D EFINITION 1 (I NTEGRATION S YSTEM ). A (metadata-driven) integration system is defined as a tuple I = (G, S, M), where G is a global schema and S = {Si , i ∈ [1..n]} is the set of source schemas (also called local schemas), and M = {Mi , i ∈ [1..n]} is the set of their respective descriptions, i.e. for each i ∈ [1..n], Mi is the set of mapping rules between Si and G. As mentioned in section 3.1, a data schema (i.e. global schema or a source schema), implemented as GML application schema, can be represented as a tree structure. For sake of simplicity, let us consider two-level trees, in which the root node corresponds to the feature class (geographic class) and the leaves correspond to its properties (geometric and thematic properties). This remains sufficient in most practical cases today, since most GML data we find in practice are generated from data that are initially stored in simple relational tables, thereby making the GML structure much similar to the original table. A source description specifies the mappings between the global schema and the source schema. Source descriptions are used by the mediator during query rewriting. In order to ensure expressive and generic correspondences between schemas, we use mapping functions (such as restrictions, concatenation, conversion functions, geometrical or topological functions, specific administrator defined functions, etc.).

6.1

Mappings

D EFINITION 2 (M APPING ). We define a mapping as an expression of the following form: µ

E ←− f (e) where E is an element of the global schema, e is an element of the source schema (identified by its absolute or relative path), f

32

Metadata Assertions L

DP

BB

French

1995

b1

AEPA

NQAA

DT

[Pvd]ON

RSC

RSCS

SD

TCC

QAA

R 1 (similar to R 2 ) Building

BDTQ−1/batimS

1995

BDTQ−1 4m

MNR

Geom

geom

Toponym

toponyme

French

1995

MNR

Description

description

French

1995

Medium

MNR

Status

ϕ ’(description)

French

1995

Low

MNR

English

1996

1% WGS84

World G...

20 000

QTC

R3 Building

NTDB/buildID

Geom

geom

Function

ϕ (function)

Elevation

elevation

Status

status

b3

1996 English

NTDB 10m

TIC

1996

1996

French

1995

5% WGS84

World G...

50 000

TIC

1996 English

TIC

TIC High

5%

5m

TIC

R 4 (similar to R 5 ) Building

BDTQ−1/batimS

1995

b2

BDTQ−2

MNR

Geom

geom

Toponym

toponyme

French

1995

Description

description

French

1995

Medium

MNR

Status

ϕ ’(description)

French

1995

Low

MNR

4m

MNR

1% WGS84

World G...

20 000

QTC

Figure 7: Mapping rules for BDTQ-1, NTDB and BDTQ-2 is a mapping function (conversion function or restriction), and µ is the metadata vector describing the values or instances of E as obtained from e. E is called target of the mapping, e is its source, f is its mapping function, and µ is its metadata assertion. For each metadata element d, applicable to E, µ[d] is the value of d for element E in its correspondence with e. For example, mapping between the global schema’s Function property and the NTDB function property is expressed as follows: µ

Function ←− ϕ(function), where ϕ is the mapping function introduced in subsection 4.2, and µ is the metadata vector defined by: µ[L] = ”English”, µ[PvdON] = ”Topographic Information Center”, µ[TCC] = 5%.

6.2

Mapping Rules

Mappings associated with the same source are organized in treestructured mapping rules. A mapping rule has the same structure than that of the global schema and associates to each element an element of the source schema tree. An example of this tree structure is given in Figure 7 which illustrates mapping rules R1 (for BDTQ-1/batimS), R3 (for NTDB/buildID), and R4 (for BDTQ-2/batimS). Mapping rules R2 and R5 for BDTQ-1/batimS and BDTQ-2/batimS are similar respectively to R1 and R4 . In this figure, b1 , b2 and b3 refer respectively to the BDTQ-1, BDTQ-2 and NTDB bounding boxes. For implementation purposes, we are using XML to encode mappings. Figure 8 shows the XML version of mapping rule R3 between classes Building and buildID of NTDB and their sub-elements.

7.

QUERY LANGUAGE

and spatial queries against GML [3, 19] documents. Spatial semantics is extracted and interpreted from XML tags without affecting XQuery syntax, i.e. a spatial query is a pure FLWR (Flower) expression, hence looks like a regular XQuery. GQuery may be used either as a standalone program, or as a query language in the context of an integration system, to query several distributed heterogeneous GIS data sources as it is currently deployed in VirGIS. In order to handle metadata in user queries, we extended GQuery using a S UCH T HAT clause, which contains metadata conditions. For sake of simplicity of presentation, we focus, in this paper, on elementary queries that aim at extracting the objects of one given feature class with metadata conditions. Actually, a general (complex) query can be written as a combination (join, union and operation) of such elementary queries. In the VirGIS system, a preliminary decomposition phase [10] allows to obtain elementary queries out of a complex query. For example, query Q0 , defined in subsection 4.3, is an elementary query since it uses only the Building feature class. Q0 can be written in GQuery as follows: for $x in document(Building) such that BB($x) = b AND DP($x) >=1992 AND AEPA($x/Geom) = High AND L($x) = "en" return $x We define the query tree2 AQ of a given query Q as being the subtree extracted from the global schema tree, whose nodes are present in the expression of query Q. For example, the query tree associated with Q0 is the whole global schema tree illustrated in figure 4. 2 This is not the standard query graph that describes algebraic operations involved in a query.

The VirGIS system uses the GQuery [7] query language. GQuery is based on XQuery [5] and allows users to perform both attribute

33

8.2

Building NTDB21e05/buildID -72 -71.5 Function function 0 Generic 1 Arena

8.2.1

D EFINITION 3 (E VALUATION AND BINDINGS ). For an elementary query Q and a description M of a data source s, we define the evaluation of Q over s, denoted E(Q, M ), as the maximal (possibly empty) set of bindings of query Q over data source s. A binding is a subtree of a mapping rule that describes the maximum subset of elements required by the query.

QUERY PROCESSING

The query rewriting is the process of reformulating the user’s query (expressed according to the global schema and including metadata conditions) into a combination of queries over data sources. The aim of this process is to construct a result that satisfies the user’s data quality and metadata conditions. In our approach, query rewriting is performed in four steps: (1) decomposition of the user’s initial query into elementary sub-queries, this leads to a Global Execution Plan (GEP), (2) correspondence discovery and subquery reformulation over local sources, (3) space partitioning which leads to Sectoral Execution Plans (SEP), and (4) performance of the Final Execution Plan (FEP). In the next subsections, we describe these steps.

8.1

Extraction of Correspondences

The aim of this phase is to explore the source descriptions in order to identify the relevant sources and the interesting mappings for each elementary sub-query. The result is a source evaluation for each sub-query and data source, showing how to express the subquery, fully or partially, over the source schema.

Figure 8: Mapping rule between classes Building and buildID

8.

Extraction of Correspondences and Reformulation

Obtaining a Global Execution Plan

The global execution plan is the result of the decomposition of the user query into a set of elementary sub-queries expressed in terms of the global schema [10]. In fact, data sources are queried through a WFS interface. Although providing an easy access to any geographic repository, WFS is unable to handle complex queries or geographic data integration on its own. Thus we make execution plans as a combination (join, union and operation) of elementary queries which are queries posed on a single feature.

Note that, if a binding provides mappings for all the elements required by Q, then it is called full binding. The process of computing bindings and evaluation is facilitated by the mapping rules’ tree structure. It consists in comparing the query tree to each mapping rule starting from their root nodes. Selective metadata are taken into account in order to select only data sources and elements that satisfy the user’s needs in terms of metadata conditions. Given a source description M and a query Q, we traverse all the mapping rules within M . A mapping rule is taken into consideration if it describes the same feature class as required in query Q (with at least one of the required properties), and with satisfying metadata values (with respect to the selective metadata conditions specified by the user). If the value of a required selective metadata element is unknown for a given mapping, then this mapping is rejected. For instance, let us consider query Q0 defined in subsection 4.3, having as geographic extent rectangle β = B 0 defined in Figure 3. In Figure 7, the white subtrees of the mapping rules correspond to the bindings obtained for query Q0 over the corresponding data sources. Parts represented in gray correspond to the pruned subtrees. Dark gray cells correspond to the metadata values that violate the metadata conditions required by query Q0 . While parsing mapping rule R1 (idem for R2 ), illustrated in Figure 7, we conclude that all mappings are relevant and satisfy the query’s selective metadata conditions, except that for element Status. Consequently, the evaluation of query Q0 over data source BDTQ-1 is ε = {R01 , R02 }, where R01 and R02 are respectively the subtrees of R1 and R2 that exclude the mappings for the Status element. Subtrees R01 and R02 are represented in white in Figure 7. In the same way, by parsing mapping rule R3 , we find out that all mappings are relevant and satisfy the query’s selective metadata conditions, except that for the geometric property. However, in this instance we retain the mapping. The reason is that the geometric property is the key property for object fusion. We take it into consideration only in order to join results from different data sources. Consequently, the evaluation of query Q0 over data source NTDB is ε0 = {R3 }. Finally, the evaluation of query Q0 over data source BDTQ-2 is the empty set, ε00 = ∅.

34

8.2.2

Query Reformulation

Once evaluations are performed, it is possible to reformulate the (elementary) query in terms of the source schemas. Given a source evaluation E(Q, M ) = {B1 , · · · , Bm } of query Q over source s, each binding Bi , i ∈ [1..m], results in a reformulated query qi . The reformulated query qi is obtained by: 1. replacing the elements of the query Q by their correspondents according to binding Bi , 2. adding the restrictions resulting from the mappings to the W HERE clause, 3. considering the query context metadata conditions and adding the resulting context conversion operations to perform.

It is possible to disregard this phenomenon, because, by considering all relevant datasets, the comparisons of two objects ”belonging to different zones” fails naturally and the objects will not be joined. Nevertheless, it is significant to take into account this space partitioning in order to avoid useless computations and comparisons while trying to join information of an area to those of a disjoined area. The aim of this technique, studied in [11], is to make efficient topological join operations in the object fusion process. The global idea of this technique is to benefit from the fact that data sources are distributed, increase the number of linear operations (in terms of complexity) that are performed by data sources (simple object extraction), hence can be parallelized, and decrease the number of complex operations (topological joins) to be performed at the mediator level.

8.3.1

The reformulated query over data source s, denoted by Qs , is then the union of queries q1 , · · · , qm : Qs = q1 ∪ · · · ∪ qm . To illustrate, let us consider query Q0 and its evaluations ε = {R01 , R02 } and ε0 = {R3 } respectively over BDTQ-1 and NTDB. By reformulating Q0 using mapping rules R01 , we obtain the following query q1 : for $x in document(BDTQ-21ee201/batimS) return $x/geom fr2en($x/toponym) fr2en($x/description)

b2 b1

Function fr2en is an operation that performs French-to-English translation. The reformulation of query Q0 using R02 results in a query q2 very similar to q1 , except for the name of the feature class (batimP instead of batimS). The reformulation of Q0 using binding R3 results in the query q3 defined as follows:

b

1

1

1

1, 3

1, 2

2

1, 2, 3

2, 3

3

2, 3

2, 3

3

3

3

b b3

Figure 9: Space Partitioning Example

for $x in document(NTDB-21e05/buildID) return $x/geom $x/elevation phi($x/function) $x/status

8.3.2

Sectoral Execution Plans

By partitioning the space (query bounding box), we obtain m sectors Σ1 , · · · , Σm . For each sector, only a subset of the data sources are relevant (see Figure 3 for instance). Hence, we create an execution plan per sector, called sectoral execution plan, and including only reformulated queries over the relevant data sources. For example, if we consider query Q0 with bounding box B 0 , defined in Figure 3, then in sector B 0 1, both reformulated queries Q00 and Q000 over BDTQ-1 and NTDB are considered in order to establish the sectoral execution plan, whereas in sector B 0 2 only Q000 is considered. Consider a sector Σ and s1 , · · · , sk the relevant datasets over Σ. Let Q1 , · · · , Qk be the respective reformulated queries over s1 , · · · , sk . Then the sectoral execution plan QΣ over sector Σ is the (full outer) join of Q1 , · · · , Qk :

Finally, the reformulated queries of Q0 over data sources BDTQ1 and NTDB are respectively Q00 = q1 ∪ q2 and Q000 = q3 .

8.3

Space Partitioning

At the starting of this phase we have a set of reformulated queries q1 , · · · , qn over relevant data sources s1 , · · · , sn . Each data source si , i ∈ [1..n], has as geographic extent a bounding box bi . The space partitioning consists of dividing the query bounding box b of the query into smaller sectors. Actually, different partitioning strategies can be used. For example, one can divide the space into m isometric rectangles. The strategy that we used in the work described in this article consists of partitioning the space b into a grid (horizontally and vertically) according to all the horizontal and vertical limits of the data sources’ bounding boxes b1 , · · · , bn that fall within the query bounding box. For instance, the reader can refer to sectors B10 , B20 , B30 and B40 of Figure 3. Figure 9 shows a more complex example. In each sector, the numbers refer to the data sources that are relevant to this sector.

Sectoral Execution Plans

As seen in section 4.3, depending on its bounding box, a query Q may create different sectors, each of them involving a different set of datasets and generating a different method of evaluation (c.f. example of bounding box B 0 in Figure 3).

QΣ = Q1 1 · · · 1 Qk .

35

Final Execution Plan

is: N ). m Note that the effect of partitioning is to increase the number of data sources calls (whose complexity is linear in the number of objects) and decrease the costs of objects fusion (whose temporal complexity is higher). This is beneficial in practice as will be shown in section 10.

The final execution plan of a query consists of the union of all sectoral execution plans. In the execution phase, each sector is considered separately in order to avoid useless computations (fusion of objects from different sectors). In each sector, queries over data sources that are within different (WFS) servers are executed in parallel. The join (fusion) process is performed at the mediator’s level. Note that although original data sources did not fit the user’s requirements when considered separately, the query processing algorithm helped find a ”good” combination of data sources with satisfying quality for the user. Also, context metadata allow to identify the context of data, such as the data language. By taking into consideration such metadata elements we can perform context mediation. For instance, we can automatically call a translator from the data source’s language to the user’s one.

9.

Ts = mN + (n − 1)Tj (

10.

PROCESS COMPLEXITY

The complexity of the whole query processing is evaluated through two main factors: (1) the complexity of the query re-writing, and (2) that of the query execution. In fact, the total time for the query processing is the cumulation of the time needed for the query rewriting and that needed for the execution of the reformulated queries and the join of their results. The execution time for the query rewriting process depends mainly on the time needed for the extraction of correspondences and that needed for the computation of sectoral execution plans. The former varies bi-linearly depending on the number of mapping rules and the number of properties of the feature class required by the query. The latter varies bi-linearly depending on the number of generated sectors and the number or reformulated queries. This makes the query rewriting process quite rapid. The process is made even quicker by the selection based on metadata conditions which reduces the number of relevant data sources. Nevertheless, the limiting time is that of the query execution, since this process might involve large datasets and a large number of object fusion (join) operations. In this section, we study the theoretical complexity of the query execution using the space sectoring versus the classical one (without sectoring). Let n be the number of relevant data sources. Suppose that the number of objects in a data source or in the queried area are majored by a certain number N . Let m be the number of sectors that have been generated. The average number of objects per sector is then equivalent to N/m. Let Tn be the temporal complexity of the query execution without space partitioning and Ts that with space partitioning. Given two datasets, the size of each being majored by x, the complexity of the spatial join of their objects depends on the join algorithm that is being used. It is quadratic (x2 ) with a naive join, but cannot be better than xlog(x) (with spatial indexing). Let Tj (x) be the complexity of this join process. Without partitioning, we perform the join of the results of the n data sources. The complexity of this process is (n − 1)Tj (N ). Extracting data from data sources can be done in parallel and is linear in the number of object per data source. Hence the complexity of the query execution without space partitioning is: Tn = N + (n − 1)Tj (N ).

900 With Partitioning Without Partitioning (Nb sources, Nb sectors) = (x,y) 800

(81,36) (72,30) (64,25)

700

600

500

(56,20)

400 (42,12)

(36,9)

300 (30,6) 200

100 (25,4) (20,2) 0 0

500

1000

1500

2000

2500

3000

3500

Query Area

Figure 10: Query execution times with and without space partitioning As shown in Figure 10, the performance gain increases dramatically when the size of the query’s bounding box, and consequently the number of the involved data sources and sectors, increase. While the execution time becomes drastic for the simple query processing algorithm when the number of data sources increases, that of the space-aware algorithm remains reasonable even in extreme cases (huge number of data sources).

11.

Using the space partitioning, we call each data source utmost m times (if the data source covers all sectors). Then, in each sector, we perform a join of utmost n datasets of N/m objects. This means that the complexity of the query execution with space partitioning

RESULTS

We validated our metadata-driven mediation approach by experimental results in order to evaluate the query execution time. We have been particularly interested in the effect of space partitioning, using the spatial extent metadata parameter, on the query execution time. For this purpose, we used SEQUOIA 2000 regional benchmark point data [27]. We created smaller overlapping data sources out of this initial data with different sizes. Next, we generated different queries with different bounding boxes, and we processed them using the simple query rewriting algorithm and the spaceaware one. Each resulting execution plan was executed 10 times and an average of the execution time was recorded. We studied the influence of the number of sectors, the number of data sources and the size of data on the execution time for both algorithms. Figure 10 shows graphics for the variation of the execution time for both simple and space-aware algorithms, depending on the size of the query’s bounding box. Data sources’ size is fixed to 4% of the whole space covered by data.

Execution time

8.4

CONCLUSION

Although the role of metadata and quality information in the development of GIS applications is well acknowledged, most of (geographic) data integration approaches do not take into account metadata and quality information in the data integration process. Most

36

of the time they assume that data are made of equal quality by means of some data cleaning mechanism for instance. In this paper, we described a metadata and quality-driven mediation approach and system that we developed in the context of a GIS integration project. We believe that our approach has three main advantages. First, it allows dynamic combination of data sources accordingly with the metadata requirements with the purpose of providing the user with results that satisfy their needs in terms of data quality, and which are consequently fit for use for their application. Second, it takes leverage from existing metadata in order to optimize data access by reducing significantly the execution time of mediated queries. Third, semantic interoperability is facilitated by means of context metadata. The paper does cover a subset of the quality parameters defined in the ISO standards. Future work should focus on more quality parameters such as completeness, logical consistency, amongst others.

12.

[12]

[13]

[14] [15] [16]

ACKNOWLEDGMENTS

[17]

This paper was improved by proofing by Cathal O’Mahony (Coastal and Marine Resources Centre).

13.

[18]

REFERENCES

[1] A. B. H. Ali and F. Vauglin. Geometric Matching of Polygons in GIS and assessment of Geometrical Quality of Polygons. In M. G. Wenzhong Shi and P. Fisher, editors, Proceedings of the International Symposium on Spatial Data Quality’99, pages 33–43, Hong Kong Polytechnic Univ, 1999. [2] A. Belussi, O. Boucelma, B. Catania, Y. Lassoued, and P. Podestà. Towards Similarity-based Topological Query Languages. In Query Languages and Query Processing (QLQP 2006), Munich, Germany, March 2006. [3] D. S. Burggraf. Geography markup language. Data Science Journal, 5:178 – 204, October 2006. http://www.jstage.jst.go.jp/article/dsj/5/0/178/ pdf. [4] CEN. Geographic information – European core metadata for discovery. European Committee for Standardization, The Netherlands, 2005. [5] D. Chamberlin, D. Florescu, J. Robie, J. Siméon, and M. Stefanescu. XQuery: A Query Language for XML. W3C, 2000. http://www.w3.org/TR/xmlquery. [6] CIT. Centre for Topographic Information. http://www.cits.rncan.gc.ca/, 2007. Accessed 13 January 2007. [7] F.-M. Colonna and O. Boucelma. Querying GML Data. In Proc. 1ère Conférence en Sciences et Techniques de l’Information et de la Communication (CoPSTIC’03), Rabat, December 2003. [8] C. Cranston, F. Brabec, G. Hjaltason, D. Nebert, and H. Samet. Interoperability via the Addition of a Server Interface to aSpatial Database: Implementation Experiences with OpenMap. In interop99, Zurich, March 1999. [9] T. Devogele. Processus d’Intégration et d’Appariement des Bases de Données Géographiques. Application a` une base de données routière multi-échelles. PhD thesis, Institut Géographique National, 1997. [10] M. Essid, O. Boucelma, F.-M. Colonna, and Y. Lassoued. Query Processing in a Geographic Mediation System. In ACM GIS 2004, Washington D.C., 12–13 November 2004. [11] M. Essid, Y. Lassoued, and O. Boucelma. Processing Mediated Geographic Queries: a Space Partitioning

[19]

[20]

[21]

[22]

[23] [24]

[25]

[26]

[27]

[28]

37

Approach. In The 10th AGILE International Conference on Geographic Information Science (AGILE’2007), Aalborg, Denmark, May 2007. M. Gertz. Managing Data Quality and Integrity in Federated Databases. In Proceedings of the IFIP, Second Working Conference on Integrity and Internal Control in Information Systems, pages 211–230. Kluwer, B.V., 1998. M. Gertz and I. Schmitt. Data Integration Techniques based on data Quality Aspects. In E. H. I. Schmitt, C. Türker and M. Höding, editors, Proceedings 3. Workshop Föderierte Datenbanken, pages 1–19, Magdeburg, December 1998. Shaker Verlag. W. H. Inmon. Building the Data Warehouse. John Wiley & Sons, Inc., 2nd edition, 1996. ISO. ISO 19115:2003 – Geographic Information – Metadata, May 2003. ISO. ISO/PRF TS 19139 – Geographic information – Metadata – XML schema implementation, Ocrober 2006. R. Jeansoulin. The REV!GIS project. http://www.lsis.org/REVIGIS/, 2000 – 2004. H. Kemppainen. Designing a Mediator for Managing Relationships between Distributed Objects. In INTEROP ’99: Proceedings of the Second International Conference on Interoperating Geographic Information Systems, pages 253–263. Springer-Verlag, 1999. R. Lake, D. S. Burggraf, M. Trninić, and L. Rae. Geography Mark-Up Language (GML) – Foundation for the Geo-Web. John Wiley, 2004. G. A. Mihaila, L. Raschid, and M. E. Vidal. Source Selection and Ranking in the WebSemantics Architecture Using Quality of Data Metadata. Advances in Computers, 55, July 2001. MRNF. Base de données topographiques du Québec (BDTQ). http://www.usherbrooke.ca/biblio/numstat/DonneesGeo/bdtq.htm, 2006. Accessed 13 January 2007. F. Naumann, U. Leser, and J. C. Freytag. Quality-driven Integration of Heterogeneous Information Systems. In Proceeding of the 25th VLDB Conference, Edinburgh, Scotland, 1999. D. Nebert and A. Whiteside. OGC Catalogue Services Specification. Open Geospatial Consortium Inc., May 2005. B. Rosen and A. Saalfeld. Match Criteria for Automatic Alignment. In Proceedings of Auto Carto 7, pages 1–20, 1985. J. Samos, F. Saltor, J. Sistac, and A. Bardés. Database Architecture for Data Warehousing: An Evolutionary Approach. In Proceedings of the 9th International Conference on Database and Expert Systems Applications, pages 746–756. Springer-Verlag, 1998. S. Shimada and H. Fukui. Geospatial Mediator Functions and Container-Based Fast Transfer Interface in Si3CO Test-Bed. In Proceedings of the Second International Conference on Interoperating Geographic Information Systems, pages 265–276. Springer-Verlag, 1999. M. Stonebraker, J. Frew, K. Gardels, and J. Meredith. The SEQUOIA 2000 Storage Benchmark. In SIGMOD’93: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 2–11, New York, USA, 1993. E. van der Vlist. XML Schema. O’Reilly, Sebastopol, California, June 2002.

[29] P. A. Vretanos. Web Feature Service Implementation Specification. Open Geospatial Consortium Inc., May 2005. [30] G. Wiederhold. Mediators in the Architecture of Future Information Systems. IEEE Computer, pages 38–49, March 1992.

38

On the performance of one-to-many data transformations Paulo Carreira

Helena Galhardas

Joao ˜ Pereira

University of Lisbon

Technical University of Lisbon

Technical University of Lisbon

[email protected]

[email protected]

[email protected]

Fernando Martins

Mario J. Silva ´



[email protected]

[email protected]

ABSTRACT

type of data heterogeneity arises when data is represented in the source and in the target using different aggregation levels. For instance, source data may consist of salaries aggregated by year, while the target data consists of salaries aggregated by month. In this case, the data transformation that takes place is frequently required to produce several tuples in the target relation to represent each tuple of the source relation. Currently, one-to-many data transformations are implemented resorting to one of the following alternatives: (i) using a programming language, such as C or Java, (ii) using an ETL tool, which often requires the development of proprietary data transformation scripts; or (iii) using an RDBMS extension like recursive queries [26] or table functions [15]. In this paper we investigate the adequacy of RDBMSs for expressing and executing one-to-many data transformations. Implementing data transformations in this way is attractive since the data is usually stored in an RDBMS. Therefore, executing the data transformation inside the RDBMS appears to be the most efficient approach. The idea of adopting database systems as platforms for running data transformations is not revolutionary (see, e.g., [20, 5]). In fact, Microsoft SQL Server and Oracle, already include additional software packages that provide specific support for ETL tasks. But, as far as the authors are aware of, no experimental work to compare alternative RDBMS implementations of one-to-many data transformations has been undertaken. The main contributions of our work are the following:

Relational Database Systems often support activities like data warehousing, cleaning and integration. All these activities require performing some sort of data transformations. Since data often resides on relational databases, data transformations are often specified using SQL, which is based on relational algebra. However, many useful data transformations cannot be expressed as SQL queries due to the limited expressive power of relational algebra. In particular, an important class of data transformations that produces several output tuples for a single input tuple cannot be expressed in that way. In this paper, we analyze alternatives to process one-to-many data transformations using Relational Database Management Systems, and compare them in terms of expressiveness, optimizability and performance.

1.

INTRODUCTION

In modern information systems, an important number of activities rely, to a great extent, on the use of data transformations. Well known applications are the migration of legacy data, ETL (extract-transform-load) processes supporting data warehousing, data cleaning processes and the integration of data from multiple sources [25]. Declarative query languages propose a natural way of expressing data transformations as queries (or views) over the source data. Due to the broad adoption of RDBMSs, the language of choice is SQL, which is based on Relational Algebra (RA) [12]. Unfortunately, the limited expressive power of RA hinders the use of SQL for specifying important classes of data transformations [3]. A class of data transformations that may not be expressible in RA corresponds to the so called one-to-many data transformations [10], which are characterized by producing several output tuples for each input tuple. One-to-many data transformations are required for addressing certain types of data heterogeneities [33]. One familiar

• we arrange one-to-many data transformations into subclasses using the expressive power of RA as dividing line; • we study different possible implementations for each sub-class of one-to-many data transformations; • we conduct an experimental comparison of alternative implementations, identifying relevant factors that influence the performance and optimization potential of each alternative.


The remainder of the paper is organized as follows: In Section 2 we further motivate the reader and introduce the two sub-classes of one-to-many data transformations by example. In Section 3, we focus on the implementation possibilities of the sub-classes of one-to-many data transformations.

39

The experimental assessment is carried out in Section 4. Related work is reviewed in Section 5 and Section 6 presents the conclusions of the paper.

2.

LOANNO 1234 1234 1234 1234 1234 1234

MOTIVATION

We now motivate the concept of one-to-many data transformations by introducing two examples based on real-world problems.

Relation LOANEVT EVTYP CAPTL TAX EXPNS OPEN 0.0 0.19 0.28 PAY 1000.0 0.28 0.0 PAY 1250.0 0.30 0.0 EARLY 550.0 0.0 0.0 FULL 5000.0 1.1 5.0 CLOSED 0.0 0.1 0.0 LOANNO 1234 1234 1234 1234 1234 1234 1234 1234 1234 1234 1234 1234 1234

Example 2.1 : Consider a relational table LOANEVT that, for each given loan, keeps the events that occur since the establishment of a loan contract until it is closed. A loan event consists of a loan number, a type and several columns with amounts. For each loan and event, one or more event amounts may apply. The field EVTYPE maintains the event type, which can be OPEN when the contract is established, PAY meaning that a loan installment has been payed, EARLY when an early payment has been made, FULL means that a full payment was made, or CLOSED meaning that the loan contract has been closed. In the target table named EVENTS, the same information is represented by adding one row per event with the corresponding amount. An event row is added only if the amount is greater than zero.

Relation EVENTS EVTYPE AMTYP OPEN TAX OPEN EXPNS OPEN BONUS PAY CAPTL PAY TAX PAY CAPTL PAY TAX EARLY CAPTL FULL CAPTL FULL TAX FULL EXPNS FULL BONUS CLOSED EXPNS

BONUS 0.1 0.0 0.0 0.0 3.0 0.0

AMT 0.19 0.28 0.1 1000 0.28 1250 0.30 550 5000 1.1 5.0 3.0 0.1

Figure 1: Illustration of a bounded one-to-many data transformation: source relation LOANEVT for loan number 1234 (at the top) and the corresponding target relation EVENTS (at the bottom).

Clearly, in the data transformation described in Example 2.1, each input row of the table LOANEVT corresponds to several output rows in the table EVENTS. See Figure 2. Moreover, for a given input row, the number of output rows depends on whether the contents of the columns CAPTL, TAX, EXPNS, BONUS are positive. Thus, each input row can result in at most four output rows. This means that there is a known bound on the number of output rows produced for each input row. We designate these data transformations as bounded one-to-many data transformations. However, in other one-to-many data transformations, such bound cannot always be established a-priori as shown in the following example:

Relation LOANS ACCT AM 12 20.00 3456 140.00 901 250.00

Example 2.2: Consider the source relation LOANS[ACCT, AM] (represented in Figure 2) that stores the details of loans per account. Suppose LOANS data must be transformed into PAYMENTS[ACCTNO, AMOUNT, SEQNO], the target relation, according to the following requirements: 1. In the target relation, all the account numbers are left padded with zeroes. Thus, the attribute ACCTNO is obtained by (left) concatenating zeroes to the value of ACCT. 2. The target system does not support payment amounts greater than 100. The attribute AMOUNT is obtained by breaking down the value of AM into multiple parcels with a maximum value of 100, in such a way that the sum of amounts for the same ACCTNO is equal to the source amount for the same account. Furthermore, the target field SEQNO is a sequence number for the parcel, initialized at one for each sequence of parcels of a given account.

Relation PAYMENTS ACCTNO AMOUNT SEQNO 0012 20.00 1 3456 100.00 1 3456 40.00 2 0901 100.00 1 0901 100.00 2 0901 50.00 3

Figure 2: Illustration of an unbounded datatransformation: the source relation LOANS on the left for loan number 1234, and the corresponding target relation PAYMENTS on the right. Unlike in Example 2.1, the upper bound on the number of output rows cannot be determined by analysis of the data transformation specification. We designate these data transformations as unbounded one-to-many data transformations. Other sources of unbounded data transformations exist: for example converting collection-valued attributes of SQL 1999 [26], where each element of the collection is mapped to a distinct row in the target table. A common data transformation in the context of data-cleaning consists of converting a set of values encoded as string attribute with a varying number of elements into rows. This data transformation is unbounded because the exact number of output rows can only be determined by analyzing the string.

3.

IMPLEMENTATION

Bounded one-to-many data transformations can be expressed as RA expressions. In turn, as we formally demonstrate elsewhere [10], no relational expression is able to capture unbounded one-to-many data transformations. Therefore, bounded data transformations can be implemented as relational algebra expressions while unbounded one-to-many

The implementation of data transformations like those for producing the target relation PAYMENTS of Example 2.2 is challenging, since the number of output rows, for each input row, is determined by the value of the attribute AM.

40

1: insert into EVENTS (LOANNO, EVTYP, AMTYP, AMT) 2: select LOANNO, EVTYP, ’CAPTL’ as AMTYP, CAPTL 3: from LOANEVT 4: where CAPTL > 0 5: union all 6: select LOANNO, EVTYP, ’TAX’ as AMTYP, TAX 7: from LOANEVT 8: where TAX > 0 9: union all 10: select LOANNO, EVTYP, ’EXPNS’ as AMTYP, EXPNS 11: from LOANEVT 12: where EXPNS > 0 13: union all 14: select LOANNO, EVTYP, ’BONUS’ as AMTYP, BONUS 15: from LOANEVT 16: where BONUS > 0;

1: with recpayments(digits(ACCTNO), AMOUNT, SEQNO, 2: REMAMNT) as 3: (select ACCT, 4: case when base.AM < 100 then base.AM 5: else 100 end, 6: 1, 7: case when base.AM < 100 then 0 8: else base.AM - 100 end 9: from LOANS as base 10: union all 11: select ACCTNO, 12: case when step.REMAMNT < 100 then 13: step.REMAMNT 14: else 100 end, 15: SEQNO + 1, 16: case when step.REMAMNT < 100 then 0 17: else step.REMAMNT - 100 end, 18: from recpayments as step 19: where step.REMAMNT > 0) 20: select ACCTNO, SEQNO, AMOUNT 21: from recpayments as PAYMENTS

Figure 3: RDBMS implementation of Example 2.1 as an SQL union query. data transformations have to be implemented resorting to SQL 1999 recursive queries [26] or to SQL 2003 Persistent Stored Modules (PSMs) [15]. We now examine these alternatives.

3.1

Figure 4: RDBMS implementation of Example 2.2 as a recursive query in SQL 1999. to optimize and hard to understand. In Figure 4 we present a solution for Example 2.2 written in SQL 1999. A recursive query written in SQL 1999 is divided in three sections. The first section is the base of the recursion that creates the initial result set (lines 3–9). The second section, known as the step, is evaluated recursively on the result set obtained so far (lines 11–19). The third section specifies through a query, the output expression responsible for returning the final result set (lines 20–21). In the base step, the first parcel of each loan is created and extended with the column REMAMNT whose purpose is to track the remaining amount. Then, at each step we enlarge the set of resulting rows. All rows without REMAMNT constitute already a valid parcel and are not expanded by recursion. Those rows with REMAMNT > 0 (line 19) generate a new row with a new sequence number set to SEQNO + 1 (line 15) and with remaining amount decreased by 100 (line 17). Finally, the PAYMENTS table is generated by projecting away the extra REMAMNT column. Clearly, when using recursive queries to express data transformations, the logic of the data transformation becomes hard to grasp, specially if several functions are used. Even in simple examples like Example 2.2, it becomes difficult to understand how the cardinality of the output tuples depends on each input tuple. Furthermore, a great deal of ingenuity is often needed for developing recursive queries.

Relational Algebra

Bounded one-to-many data transformations can be expressed as relational expressions by combining projections, selections and unions at the expense of the query length. Consider k to be the maximum number of tuples generated by a one-to-many data transformation, and let the condition Ci encode the decision of whether the ith tuple, where 1 ≤ i ≤ k, should be generated. In general, given a source relation s with schema X1 , ..., Xn , we can define a one-tomany data transformation over s that produces at most k tuples for each input tuple through the expression

πX1 ,...,Xn σC1 (r) ∪ ... ∪ πX1 ,...,Xn σCk (r)

To illustrate the concept, in Figure 3 we present the SQL implementation of the bounded data transformation presented in Example 2.1 using multiple union all (lines 5, 9 and 13) statements. Each select statement (lines 2–4, 6–8, 10–12 and 14–16) encodes a separate condition and potentially contributes with an output tuple. The drawback of this solution is that the size of the query grows proportionally to the maximum number of output tuples k that has to be generated for each input tuple. If this bound value k is high, the query becomes too big. Expressing one-to-many data transformations in this way implies a lot of repetition, in particular if many columns are involved.

3.2

RDBMS Extensions

3.2.2

We now turn to express one-to-many data transformations using RDBMS extensions, namely, recursive queries and table functions. Although these solutions enable expressing both bounded and unbounded transformations, here we introduce them for expressing unbounded transformations.

3.2.1

Persistent Stored Modules

Several RDBMSs support some form of procedural construct for specifying complex computations. This feature is primarily intended for storing business logic in the RDBMS for performance reasons or to perform operations on data that cannot be handled by SQL. Several database systems support their own procedural languages, like SQL-PL in the case of DB2 [22], TransactSQL in the case of Microsoft SQL Server and Sybase [24], or PL/SQL in the case of Oracle [16]. These extensions, designated as Persistent Stored Modules (PSMs), were introduced in the SQL 1999 standard [18, Section 8.2]. A module of a PSM can be, among others, a procedure, usually known as stored procedure (SP), or a function, known as a user defined function (UDF).

Recursive Queries

The expressive power of RA can be considerably extended through the use of recursion [3]. Although the resulting setting is powerful enough to express many useful one-tomany data transformations, we argue that this alternative undergoes a number of drawbacks. Recursive queries are not broadly supported by RDBMSs, and they are difficult

41

1: create function LOANSTOPAYMENTS 2: return PAYMENTS TABLE TYPE pipelined is 3: ACCTVALUE LOANS.ACCT%TYPE; 4: AMVALUE LOANS.AM%TYPE; 5: REMAMNT INT; 6: SEQNUM INT; 7: cursor CLOANS is 8: select * from LOANS; 9: begin 10: open CLOANS; 11: loop 12: fetch CLOANS into ACCTVALUE, AMVALUE; 13: REMAMNT := AMVALUE; 14: SEQNUM := 1; 15: while REMAMNT > 100 16: loop 17: pipe row(PAYMENTS ROW TYPE( 18: LPAD(ACCTVALUE, 4, ’0’), 100.00, SEQNUM)); 19: REMAMNT := REMAMNT - 100; 20: SEQNUM := SEQNUM + 1; 21: end loop 22: if REMAMNT > 0 then 23: pipe row(PAYMENTS ROW TYPE( 24: values (LPAD(ACCTVALUE, 4, ’0’), 25: REMAMNT, SEQNUM)); 26: end if 27: end loop 28: end LOANSTOPAYMENTS

Figure 5: Possible RDBMS implementation of Example 2.2 as a table function using Oracle PL/SQL. The details concerning the creation of the supporting row and table types PAYMENTS ROW TYPE and PAYMENTS TABLE TYPE are not shown. 1: select * 2: from (LOANEVT 3: unpivot AMT for 4: AMTYPE in (’LOANNO’, ’EVTYP’, ’TAX’, ’EXPNS’, BONUS’)) 5: where AMT > 0

Figure 6: Implementation of Example 2.1 using an unpivot operation on SQL Server 2005.

Table functions extend the expressive power of SQL because they may return a relation. Table functions allow recursion1 and make it feasible to generate several output tuples for each input tuple. The advantages are mainly enhanced performance and re-use [33]. Moreover, complex data transformations can be expressed by nesting UDFs within SQL statements [33]. However, table functions are often implemented using procedural constructs that hamper the possibilities of undergoing the dynamic optimizations familiar to relational queries. Besides table functions, other kinds of UDFS exist, like user defined scalar functions (UDSFs), and user defined aggregate functions (UDAFs) [21]. Still, SQL extended with UDSFs and UDAFs may not be enough for expressing oneto-many data transformations. First, calls to UDSFs need to be embedded in an extended projection operator, which, as discussed in Section 3.1, is not powerful enough for expressing one-to-many transformations. Second, UDAFs must be embedded in aggregation operations, which can only represent many-to-one data transformations.

An interesting aspect of PSMs is that they are powerful enough to specify bounded as well as unbounded data transformations. Figure 5 presents the implementation of the data transformation introduced in Example 2.2 as a user defined table function (TF), as proposed by the SQL 2003 [15]. The table function implementation written in PL/SQL has two sections: a declaration section and a body section. The first one defines the set of working variables that are used in the procedure body and the cursor CLOANS (lines 7–8), which will be used for iterating through the LOANS table. The body section starts by opening the cursor. Then, a loop and a fetch statement are used for iterating over CLOANS (lines 11–12). The loop cycles until the fetch statement fails to retrieve more tuples from CLOANS. The value contained in ACCTVALUE is loaded into the working variable REMAMNT (line 13). The value of this variable will be later decreased in parcels of 100 (line 19). The number of parcels is controlled by the guarding condition REMAMNT>0 (lines 15 and 22). An inner loop is used to form the parcels based on the value of REMAMNT (lines 15–21). A new parcel row is inserted in the target table PAYMENTS for each iteration of the inner loop. The tuple is generated through a pipe row statement that is also responsible for padding the value of ACCTVALUE with zeroes (lines 17–18 and 24–25). When the inner loop ends, a last pipe row statement is issued to insert the parcel that contains the remainder. The details concerning the creation of the row and table types PAYMENTS ROW TYPE and PAYMENTS TABLE TYPE are not presented. The main drawback of PSMs is that they use a number of procedural constructs that are not amenable to optimization. Moreover, there are no elegant solutions for expressing the dynamic creation of tuples using PSMs. One needs to resort to intricate loop and pipe row statements (or insert into statements in the case of a stored procedure) as shown in Figure 5. From the description of Example 2.2, it is clear that a separate logic is used to compute each of the attributes. Nevertheless, in the PL/SQL code, the computation of ACCTNO is coupled with the computation of AMOUNT. Thus, the logic to calculate ACCTNO is duplicated in the code. This makes the code maintenance difficult and the code itself hard to optimize.

3.2.3

Pivoting operations

The pivot and unpivot operators constitute an important extension to RA, which were first natively supported by SQL Server 2005 [27]. The pivot operation collapses similar rows into a single wider row adding new columns on-the-fly [13]. In a sense, this operator collapses rows to columns. Thus, it can be seen as expressing a many-to-one data transformation. Its dual, the unpivot operator transposes columns into rows. Henceforth, the discussion focuses on the unpivot operator, since this operator can be used for expressing bounded one-to-many data transformations. In what concerns expressiveness, the unpivot operator does not increase the expressive power of RA, since, as [13] admit, the unpivot operator can be implemented with multiple unions. Its semantics can be emulated by employing multiple union operations as proposed above for expressing bounded one-to-many data transformations through RA (Section 3.1). Nevertheless, expressing one-to-many data transformations using the unpivot operator brings two main benefits comparatively to using multiple unions. First, the syntax is more

1 Recursive calls of table functions are constrained in some RDBMSs, like DB2.

42

compact. Figure 6 shows how the unpivot operator can be employed to express the bounded one-to-many data transformation of Example 2.1. Second, data transformations expressed using the unpivot operator are more readily optimizable using the logical and physical optimizations proposed in [13].

4.

EXPERIMENTS

We now compare the performance of the alternative implementations of the one-to-many data transformations introduced in Examples 2.1 and 2.2 using relational queries, recursive queries, table functions and stored procedures. We start by comparing the performance of each alternative to address bounded and unbounded transformations. Then, we investigate how the different solutions react to two intrinsic factors of one-to-many data transformations. Finally, we analyze the optimization possibilities of each solution. We have tested the alternative implementations of one-tomany data transformations on two RDBMSs henceforth designated as DBX and OEX2 . The entire set of planned implementations is shown in Figure 7. Unbounded data transformations cannot be implemented as relational queries. Furthermore, the class of recursive queries supported by the OEX system is not powerful enough for expressing unbounded data transformations. Additionally, due to limitations of the DBX system, table functions could not be implemented. Thus, to test another implementation across both systems, bounded an unbounded data transformations were implemented also as stored procedures. Finally unpivoting operations were not considered because both DBX and OEX do not support them.

4.1

partitions are the logical partitions accessed as raw devices. These partitions handle data and log files. Each RDBMSs accesses tablespaces created in distinct raw devices. The first logical partition (/dev/hda5) handles the tablespace named RAWSRC for input data; the second logical partition (/dev/hda6) handles the tablespace named RAWTGT for output data. The partition (/dev/hda7) is used for raw logging and finally (/dev/hda8) is used as the temporary tablespace. To minimize the I/O overhead, both input and output tables were created with PCTFREE set to 0. In addition, the usage of kernel asynchronous I/O [6] was turned off.

Setup

Block sizes In our experiments, tables are accessed through full-table scans. Since there are no updates and no indexed-scans, different block sizes have virtually no influence in performance. The block size parameters are set to the same value of 8KB. Since full table scans use multi-block reads, we configure the amount of data transferred in a multi-block read to 64K. Buffers To improve performance, RDBMSs cache frequently accessed pages in independent memory areas. One such area is the which caches disk pages buffer pool [14]. The configuration of buffer pools in DBX differs from that of the OEX system. For our purposes, the main difference lies in the fact that, in DBX, individual buffer pools can be assigned to each tablespace, while OEX uses one global buffer pool for all tablespaces. In DBX, we assign a buffer pool of 4MB to the RAWSRC tablespace, which contains the source data. In OEX we set the size of the cache to 4MB. Logging Both DBX and OEX use write-ahead logging mechanisms that produce undo and redo log [19, 28]. We attempt to minimize the logging activity by disabling logging on both in DBX and OEX experiments. However, we note that logging cannot be disabled in the case of stored procedures because insert into statements executed within stored procedures always generate log.

The tests were executed on a synthetic workload that consists of input relations whose schemas are based on those used in Examples 2.1 and 2.2, for bounded and unbounded data transformations, respectively. Since the representation of data types may not be the same across all RDBMS, special attention must be given to record length. To equalize the sizes of the input rows of bounded and unbounded data transformations, a dummy column was added to the table LOANS so that its record size matches the record size of the table LOANEVT. We computed the average record size of each input table after its load. Both LOANS and LOANEVT have approximately 29 bytes in all experiments. In addition, several parameters of both RDBMSs were carefully aligned. Below, we summarize the main issues that received our attention.

We measured the throughput, i.e., the amount of work done per second, of the considered implementations of oneto-many data transformations. Throughput is expressed as the number of source records transformed per second and is computed by measuring the response time for a data transformation applied to an input table. The response time is measured as the time interval that mediates the submission of the data transformation implementation from the command line prompt and its conclusion. The interval that mediates the submission of the request and the execution by the system, known as reaction time, is considered neglectable. The hardware used was a single CPU machine (running at 3.4 GHz) with 1GB of RAM and Linux (kernel version 2.4.2) installed.

I/O conditions An important aspect regarding I/O is that all experiments use the same region of the hard-disk. To induce the use of the same area of the disk, I/O was forced through raw devices. The hard-disk is partitioned in cylinder boundaries as illustrated in Figure 8. The first partition is a primary partition formatted with Ext3 file system and journaling enabled and is used for the operating system and RDBMS installations as well as for the database control files. The second partition is used as swap space. The remaining

4.2

Throughput comparison

To compare the throughput of the evaluated alternatives, we executed their implementations on input relations with increasing sizes. The results for both bounded and unbounded implementations, are shown in Figure 9. We observe that table functions are the most performant of the

2

Due to the restrictions imposed by DBMS licensing agreements, the actual names of the systems used for this evaluation will not be revealed.

43

Implementations of one-to-many data transformations Bounded Unbounded

DBX OEX

Relational Query

Table Function

Stored Procedure

Recursive Query

Table Function

Stored Procedure

yes yes

no yes

yes yes

yes no

no yes

yes yes

Figure 7: Different mechanisms used for implementing the one-to-many data transformations developed for the experiments.

OS hda1 58GB

Figure 8: ments

swap hda2 2GB

raw hda5 25GB

raw hda6 25GB

raw hda7 25GB

raw hda8 25GB

implementations decreases because more time is spent writing the output tuples. In the case of recursive queries, more I/O is incurred because higher fanouts increase the size of the intermediate relations used for evaluating the recursive query. Finally, for stored procedures, the more tuples are written, the more log is generated.

Hard-disk partitioning for the experi-

4.4 implementations. Then, implementations using unions and recursive queries are considerably more efficient than stored procedures. Figure 9b shows that the throughput is mostly constant as the input relation size increases. The low throughput observed in stored procedures is mainly due to the huge amounts of redo logging activity incurred during their execution. Unlike the remaining solutions, it is not possible to disable logging for stored procedures. In particular, the logging overhead monitored for stored procedures is ≈ 118.9 blocks per second in the case of DBX and ≈ 189.2 blocks per second in the case of OEX. We may conclude that, if logging was disabled, stored procedures would execute with a comparable performance to table functions.

4.3

Influence of selectivity and fanout

In one-to-many data-transformations, each input tuple may correspond to zero, one, several output tuples. The ratio of input tuples for which at least one output tuple is produced is known as the selectivity of the data transformation. Similarly to [11], the average number of output tuples produced for each input tuple is called fanout Different data sets generating data transformations with different selectivities and fanouts have been used in our working examples. These data sets produce predefined average selectivities and fanouts. A set of experiments varying the selectivity and fanout factors was put in place, to help understand the effect of selectivity and fanout on data transformations. The results are depicted in Figure 10. Concerning selectivity, we observe on Figure 10a that higher throughputs are obtained for smaller selectivities. This stems from having less output tuples created when the selectivity is smaller. The degradation observed is explained having more output tuples produced and materialized at higher selectivities. Stored procedures degrade faster due to an increase in the log generation. With respect to the fanout factor, greater fanout factors imply generating more output tuples for each input tuple and hence I/O activity is directly influenced. To observe the impact of this parameter, we increase the fanout factor from 1 to 32. Figure 10b illustrates the evolution of the throughput for unbounded transformations. The throughput of all

44

Query optimization issues

The analysis of the query plans of the different implementations shows that the RDBMSs used in this evaluation are not always capable of optimizing queries involving one-tomany data transformations. To validate this hypothesis, we contrasted the execution of a simple selection applied to a one-to-many transformation, represented as σACCTNO>p (T (s)), with its corresponding optimized equivalent, represented as T (σACCT>p (s)), where T represents the data transformation specified in Example 2.2, except that the column LOANS is directly mapped, and p is a constant used only to induce a specific selectivity. We stress that the optimized versions are obtained manually, by pushing down the selection condition. Figure 11a presents the response times of the original and optimized versions implemented as recursive queries and as table functions. We observe that the optimized versions are considerably more efficient that their corresponding originals. We conjecture that the optimization handicap of RDBMSs for processing one-to-many data transformations has to do with the intrinsic difficulties of optimizing queries using recursive functions and table functions. In fact, the optimization of recursive queries is far from being a closed subject [30]. In turn, table functions are implemented using procedural constructs that hamper optimizability. Once the table function makes use of procedural constructs, it is not possible to perform the kind of optimizations that relational queries undergo. We have found that bounded one-to-many data transformations take advantage of the logical optimizations built into the RDBMS when they are implemented through a union statement. Applying a filter to a union is readily optimized. The response time for of the experiment was included in Figure 11a for comparison. Another type of optimization that RDBMSs can apply in one-to-many data transformations is the use of cache. This factor is important to optimize the execution of queries that use multiple union statements and therefore need to scan the input relation multiple times. Likewise, recursive queries perform multiple joins with intermediate relations. This happens because the physical execution of a recursive query involves performing one full select to seed the recursion and then a series of successive union and join operations to unfold the recursion. As a result, these operations are likely to

30 K

B-Union/DBX B-SP/DBX B-Union/OEX B-SP/OEX B-TF/OEX U-Rec/DBX U-SP/DBX U-SP/OEX U-TF/OEX

35000

Throughput [input tuples/sec]

25 K Troughput [in recs/sec]

40000

Union Stored Procedure Table Function Recursive Query

20 K 15 K 10 K

30000 25000 20000 15000 10000

5K 5000

0K

Bounded/DBX

Bounded/OEX Unbound/DBX Transformation

0

Unbound/OEX

0

(a) Average throughput

1

2 3 Input relation size [millions of tuples]

4

5

(b) Evolution with relation size

Figure 9: Throughput of data transformation implementations with different relation sizes. Fanout is fixed to 2.0, selectivity fixed to 0.5, and cache size set to 4MB. be influenced by the buffer cache size. To evaluate the impact of the buffer pool cache size on one-to-many transformations, we executed a set of experiments varying the buffer pool size. The results, depicted in Figure 11b, show that a larger buffer pool cache is most beneficial for bounded data transformations implemented as unions. This is explained by larger buffer pool caches reduce the number of physical reads that required when scanning the input relations multiple times. We also remark a distinct behavior of the RDBMSs used in the evaluation as cache size increases. The throughput in OEX increases smoothly while in DBX there is sharp increase. This has to do with the differences in cache the replacement policies of these systems while performing table scans [14]. DBX uses the least recently used (LRU) [29] to select the next page to be replaced from the cache while the OEX system, according to its documentation, uses a most recently used (MRU) replacement policy. The LRU replacement policy performs quite poorly on sequential scans if the cache smaller than the input relation. The LRU replacement policy purges the cache when full table scans are involved and the size of the buffer pool is smaller than the size of the table [23]. We conclude that for small input tables using multiple unions is the most advantageous alternative for bounded one-to-many data transformations. However, in the presence of large input relations, table functions are the best alternative since they are invariant to cache size. This is due to the fact that input relation being scanned only once. Stored procedure implementations also scan the input relation only once but are less performant due to logging. According to [13], the pivot operator processes the input relation only once. As a result, it is not likely to be influenced by buffer cache size, unlike the chaining of multiple unions we present in Section 3.1

5.

RELATED WORK In Codd’s original model [12], RA expressions denote trans-

45

formations among relations. In the following years, the idea of using a queries for specifying data transformations would be pursued by two prototypes, Convert and Express [36, 37], shortly followed by results on expressivity limitations of RA by [3, 31]. Many useful data transformations can be appropriately defined in terms of relational expressions, if we consider relational algebra equipped with a generalized projection operator [38, p. 104]. However, this extension is still weak to express unbounded one-to-many data transformations. To support the growing range of RDBMS applications, several extensions to RA have been proposed in the form of new declarative operators and also through the introduction of language extensions to be executed by the RDBMS. One such extension, interesting for one-to-many transformations, is the pivot operator [13], which is not influenced by buffer cache size. However, the pivot operator cannot express unbounded one-to-many data transformations and, as far as we know it is only implemented by SQL Sever 2005. Recursive query processing was early addressed by [3], and then by several works about recursive query optimization, like, for example [35, 39]. There are also proposals for extending SQL to handle particular forms of recursion [2], like the Alpha Operator [1]. Despite being relatively well understood at the time, recursive query processing was not supported by SQL-92. By the time the SQL 1999 [26] was introduced, some of the leading RDBMSs (e.g., Oracle, DB2 or POSTGRES) were in the process of supporting recursive queries. As a result, these systems ended up supporting different subsets of recursive queries with different syntaxes. Presently, the broad support of recursion still constitutes a subject of debate [32]. The problem of specifying one-to-many data transformations has also been addressed in the context of data cleaning and transformations by tools like Potter’s Wheel [34], Ajax [17] and Data Fusion [7]. These tools have proposed operators for expressing one-to-many data transformations. Pot-

50000

50000 B-Union/DBX B-Union/OEX B-TF/OEX B-SP/DBX B-SP/OEX

U-Rec/DBX U-SP/DBX U-TF/OEX U-SP/OEX 40000 Throughput [input tuples/second]

Throughput [input tuples/second]

40000

30000

20000

10000

30000

20000

10000

0

0 0

10

20

30

40

50 60 Selectivity [in %]

70

80

90

100

0

(a) Increasing selectivity (bounded)

5

10

15 Fanout

20

25

30

(b) Increasing fanout (unbounded)

Figure 10: Evolution of throughput for varying selectivities and fanouts over input relations with 1M tuples and 4MB of cache: (a) shows the evolution for bounded transformations with increasing selectivity (fanout set to 2.0) and (b) show the evolution for unbounded transformations with increasing fanout and (selectivity fixed to 0.5). The corresponding unbounded and bounded variants display identical trends.

ter’s Wheel fold operator addresses bounded one-to-many transformations, while Ajax and Data Fusion also implement operators for addressing also unbounded data transformations. Building on the above contributions, we recently proposed the extension of RA with a specialized operator named data mapper, which addresses one-to-many transformations [8, 9, 10]. The interesting aspect of this solution lies in that mappers are declarative specifications of a one-to-many data transformation, which can then be logically and physically optimized.

6.

blends logical and physical aspects, hampering dynamic optimization. An additional outcome of the experiments was the identification of selectivity and fanout, two important factors of one-to-many data transformations, that influence their cost. Together with the input relation size, these factors can be used to predict the cost of one-to-many data transformations. This information can be exploited to take advantage when the cost-based optimizer chooses among alternative execution plans involving one-to-many data transformations. In fact, we believe that one-to-many data transformations can be logically and physically optimized when expressed through a specialized relational operator like the one we propose in [9, 10]. As future work, we plan to extend the Derby [4] open source RDBMS to execute and optimize oneto-many data transformations expressed as queries that incorporate this operator. In this way, we equip an RDBMSs to be used not only as data store but also as data transformation engine.

CONCLUSIONS

We organized our discussion of one-to-many data transformations into two groups representing bounded and unbounded data transformations. There is no general solution for expressing one-to-many data transformations using RDBMSs. We have seen that although bounded data transformations can be expressed by combining unions and selections, unbounded data transformations require advanced constructs such as recursive queries of SQL:1999 [26] and table functions introduced in the SQL 2003 standard [15]. However, these are not yet supported by many RDBMSs. We then conducted an experimental assessment of how RDBMSs handle the execution of one-to-many data transformations. Our main finding was that RDBMSs cannot, in general, optimize the execution of queries that comprise oneto-many data transformations. One-to-many data transformations expressed both as unions or as recursive queries incur in unnecessary consumptions of resources, involving multiple scans over the input relation and the generation of intermediate relations, which makes them sensible to buffer cache size. Table functions are acceptably efficient since their implementation emulates an iterator that scans the input relation only once. However, their procedural nature

7.

REFERENCES

[1] R. Agrawal. Alpha: An extension of relational algebra to express a class of recursive queries. IEEE Transactions on Software Engineering, 14(7):879–885, 1988. [2] R. Ahad and S. B. Yao. Rql: A recursive query language. IEEE Transactions on Knowledge and Data Engineering, 5(3):451–461, June 1993. [3] A. V. Aho and J. D. Ullman. Universality of data retrieval languages. In Proc. of the 6th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Lang., pages 110–119. ACM Press, 1979. [4] Apache. Derby homepage. http://db.apache.org/derby, 2006.

46

45 s 40 s

B-Union/DBX B-Union/OEX B-TF/OEX B-SP/DBX B-SP/OEX U-Rec/DBX

50000

35 s Throughput [input tuples/second]

Response time [in seconds]

60000

Original Manually optimized Automatically optimized

30 s 25 s 20 s 15 s

40000

30000

20000

10 s 10000

5s 0s

UTF/OEX

0

URec/DBX BUnion/OEX Transformation

0

(a) Original vs. optimized

20

40

60 80 100 Cache size [as % of input relation]

120

140

160

(b) Evolution with cache size

Figure 11: Sensibility of data transformation implementations with one 1M tuples: (a) to optimization, with cache size fixed to 4MB and, (b) to cache size variations. Selectivity is fixed to 0.5 and fanout set to 2.0. [5] P. A. Bernstein and E. Rahm. Data wharehouse scenarios for model management. In International Conference on Conceptual Modeling / The Entity Relationship Approach, pages 1–15, 2000. [6] S. Bhattacharya, S. Pratt, B. Pulavarty, and J. Morgan. Asynchronous I/O Support in Linux 2.5. In Proc. of the Linux Symposium, 2003. [7] P. Carreira and H. Galhardas. Efficient development of data migration transformations. In ACM SIGMOD Int’l Conf. on the Managment of Data, June 2004. [8] P. Carreira and H. Galhardas. Execution of Data Mappers. In Int’l Workshop on Information Quality in Information Systems (IQIS). ACM, June 2004. [9] P. Carreira, H. Galhardas, A. Lopes, and J. Pereira. Extending relational algebra to express one-to-many data transformations. In 20th Brasillian Symposium on Databases SBBD’05, Oct. 2005. [10] P. Carreira, H. Galhardas, A. Lopes, and J. Pereira. One-to-many transformation through data mappers. Data and Knowledge Engineering Journal, 62(3):483–503, September 2007. [11] S. Chaudhuri and K. Shim. Query optimization in the presence of foreign functions. In Proc. of the Int’l Conf. on Very Large Data Bases (VLDB’93), pages 529–542, 1993. [12] E. F. Codd. A relational model of data for large shared data banks. Communic. of the ACM, 13(6):377–387, 1970. [13] C. Cunningham, G. Graefe, and C. A. Galindo-Legaria. PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS. In Proc. of the Int’l Conf. on Very Large Data Bases (VLDB’04), pages 998–1009. Morgan Kaufmann, 2004. [14] W. Effelsberg and T. Haerder. Principles of database buffer management. ACM Transactions on Database

Systems (TODS), 9(4):560–595, 1984. [15] A. Eisenberg, J. Melton, K. K. J.-E. Michels, and F. Zemke. SQL:2003 has been published. ACM SIGMOD Record, 33(1):119–126, 2004. [16] S. Feuerstein and B. Pribyl. Oracle PL/SQL Programming. O’Reilly & Associates, 4 edition, 2005. [17] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. A. Saita. Declarative data cleaning: Language, model, and algorithms. In Proc. of the Int’l Conf. on Very Large Data Bases (VLDB’01), 2001. [18] H. Garcia-Molina, J. D. Ullman, and J. Widom. Database Systems – The Complete Book. Prentice-Hall, 2002. [19] J. Gray, P. McJones, M. Blasgen, B. Lindsay, R. Lorie, T. Price, F. Putzolu, and I. Traiger. The Recovery Manager of the System R Database Manager. ACM Computing Surveys, 13(2):223–242, 1981. [20] L. Haas, R. Miller, B. Niswonger, M. T. Roth, P. Scwarz, and E. L. Wimmers. Transforming heterogeneous data with database middleware: Beyond integration. Special Issue on Data Transformations. IEEE Data Engineering Bulletin, 22(1), 1999. [21] M. Jaedicke and B. Mitschang. On parallel processing of aggregate and scalar functions in object-relational DBMS. In ACM SIGMOD Int’l Conf. on Management of Data, pages 379–389. ACM Press, 1998. [22] Z. Janmohamed, C. Liu, D. Bradstock, R. chong, M. G. abd F. McArthur, and P. Yip. DB2 SQL PL. Essential Guide for DB2 UDB. Prentice-Hall, 2005. [23] S. Jiang and X. Zhuang. Lirs: An efficient low inter-reference recency set replacement policy to improve buffer cache performance. In Proc. of SIGMETRICS 2002, 2002. [24] K. Kline, L. Gould, and A. Zanevsky. TransactSQL Programming. O’Reilly & Associates, 1 edition, 1999.

47

[25] D. Lomet and E. A. Rundensteiner, editors. Special Issue on Data Transformations, volume 22. IEEE Data Engineering Bulletin, 1999. [26] J. Melton and A. R. Simon. SQL:1999 Understanding Relational Language Components. Morgan Kaufmann Publishers, Inc., 2002. [27] Microsoft. Sql server home page. http://www.microsoft.com/sql/, 2005. [28] C. Mohan and F. Levine. ARIES/IM: an efficient and high concurrency index management method using write-ahead logging. In Proc. of the ACM SIGMOD Int’l Conf. on Management of Data, pages 371–380. ACM Press, 1992. [29] E. J. O’Neil, P. E. O’Neil, and G. Weikum. The LRU-K page replacement algorithm for database disk buffering. In ACM SIGMOD Int’l Conf. on the Managment of Data, pages 297–306. ACM Press, 1993. [30] C. Ordonez. Optimizing recursive queries in SQL. In SIGMOD ’05: Proc. of the 2005 ACM SIGMOD International Conference on Management of data, pages 834–839. ACM Press, 2005. [31] J. Paredaens. On the expressive power of the relational algebra. Inf. Processing Letters, 7(2):107–111, 1978. [32] T. Pieciukiewicz, K. Stencel, and K. Subieta. Usable recursive queries. In Proc. of the 9th East European Conference, Advances in Databases and Information Systems (ADBIS), volume 3631 of Lecture Notes in Computer Science, pages 17–28. Springer-Verlag, 2005. [33] E. Rahm and H.-H. Do. Data Cleaning: Problems and current approaches. IEEE Bulletin of the Technical Comittee on Data Engineering, 24(4), 2000. [34] V. Raman and J. M. Hellerstein. Potter’s Wheel: An Interactive Data Cleaning System. In Proc. of the Int’l Conf. on Very Large Data Bases (VLDB’01), 2001. [35] M.-C. Shan and M.-A. Neimat. Optimization of relational algebra expressions containing recursion operators. In CSC ’91: Proc. of the 19th annual conference on Computer Science, pages 332–341. ACM Press, 1991. [36] N. C. Shu, B. C. Housel, and V. Y. Lum. CONVERT: A High Level Translation Definition Language for Data Conversion. Communic. of the ACM, 18(10):557–567, 1975. [37] N. C. Shu, B. C. Housel, R. W. Taylor, S. P. Ghosh, and V. Y. Lum. EXPRESS: A Data EXtraction, Processing and REStructuring System. ACM Transactions on Database Systems, 2(2):134–174, June 1977. [38] A. Silberschatz, H. F. Korth, and S. Sudarshan. Database Systems Concepts. MacGraw-Hill, 5th edition, 2005. [39] P. Valduriez and H. Boral. Evaluation of recursive queries using join indices. In 1st Int’l Conference of Expert Databases, pages 271–293, 1986.

48

Towards a Benchmark for ETL Workflows Panos Vassiliadis

Anastasios Karagiannis

Vasiliki Tziovara

Alkis Simitsis

Dept. of Computer Science, Univ. of Ioannina Ioannina, Hellas

IBM Almaden Research Center San Jose, California, USA

{pvassil, ktasos, vickit}@cs.uoi.gr

[email protected] projects, due to the significant cost of purchasing and maintaining an ETL tool. The spread of existing solutions comes with a major drawback. Each one of them follows a different design approach, offers a different set of transformations, and provides a different internal language to represent essentially similar necessities.

ABSTRACT Extraction–Transform–Load (ETL) processes comprise complex data workflows, which are responsible for the maintenance of a Data Warehouse. Their practical importance is denoted by the fact that a plethora of ETL tools currently constitutes a multi-million dollars market. However, each one of them follows a different design and modeling technique and internal language. So far, the research community has not agreed upon the basic characteristics of ETL tools. Hence, there is a necessity for a unified way to assess ETL workflows. In this paper, we investigate the main characteristics and peculiarities of ETL processes and we propose a principled organization of test suites for the problem of experimenting with ETL scenarios.

The research community has only recently started to work on problems related to ETL tools. There have been several efforts towards (a) modeling tasks and the automation of the design process, (b) individual operations (with duplicate detection being the area with most of the research activity) and (c) some first results towards the optimization of the ETL workflow as a whole (as opposed to optimal algorithms for their individual components). For lack of space, we refer the interested reader to [11] for a detailed survey on research efforts in the area of ETL tools and to [14] for a survey on duplicate detection.

1. INTRODUCTION

The wide spread of industrial and ad-hoc solutions combined with the absence of a mature body of knowledge from the research community is responsible for the absence of a principled foundation of the fundamental characteristics of ETL workflows and their management. Here is a small list of shortages concerning these fundamental characteristics: no principled taxonomy of individual activities is present, few research efforts have been made towards the optimization of ETL workflows as a whole, and, practical problems like the recovery from failures have mostly been ignored. To add a personal touch to this landscape, in various occasions during our research, we have faced the problem of constructing ETL suites and varying several parameters of them; unfortunately, there is no commonly agreed benchmark for ETL workflows. Thus, a commonly agreed, realistic framework for experimentation is also absent.

Data Warehouses (DW) are collections of data coming from different sources, used mostly to support decision-making and data analysis in an organization. To populate a data warehouse with up-to-date records that are extracted from the sources, special tools are employed, called Extraction – Transform – Load (ETL) tools, which organize the steps of the whole process as a workflow. To give a general idea of the functionality of these workflows we mention their most prominent tasks, which include: (a) the identification of relevant information at the source side; (b) the extraction of this information; (c) the transportation of this information to the Data Staging Area (DSA), where most of the transformation usually take place; (d) the transformation, (i.e., customization and integration) of the information coming from multiple sources into a common format; (e) the cleansing of the resulting data set, on the basis of database and business rules; and (f) the propagation and loading of the data to the data warehouse and the refreshment of data marts.

In this paper, we take a step towards this latter issue. Our goal is to provide a principled categorization of test suites for the problem of experimenting with a broad range of ETL workflows. First, we provide a principled way for constructing ETL workflows. We identify the main functionality provided by representative commercial ETL tools and we categorize the most frequent ETL operations into abstract logical activities. Based on that, we propose a categorization of ETL workflows, which covers frequent design cases. Then, we describe the main configuration parameters and a set of crucial measures to be monitored in order to capture the generic functionality of ETL tools. Also, we discuss how different parallelism techniques may affect the execution of ETL processes. Finally, we provide a set of specific ETL scenarios based on the aforementioned analysis, which can be used as an experimental testbed for the evaluation of ETL methods or tools.

Due to their importance and complexity (see [1, 12] for relevant discussions and case studies), ETL tools constitute a multi-million market. There is a plethora of commercial ETL tools available. The traditional database vendors provide ETL solutions built in the DBMS’s: IBM with WebSphere DataStage [5], Microsoft with SQL Server 2005 Integration Services (SSIS) [7], and Oracle with Oracle Warehouse Builder [8]. There also exist independent vendors that cover a large part of the market (e.g., Informatica with Powercenter 8 [6]). Nevertheless, an in-house development of the ETL workflow is preferred in many data warehouse Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Database Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permissions from the publisher, ACM. VLDB ’07, September 23-28, 2007, Vienna, Austria. Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09.

Contributions. Our main contributions are as follows: − A principled way of constructing ETL workflows based on an abstract categorization of frequently used ETL operations.

49

− The provision of a set of measures and parameters, which are crucial for the efficient execution of an ETL workflow.

− Each recordset r is a pair (r.name, r.schema), with the schema being a finite list of attribute names.

− A principled organization of test suites for the problem of experimenting with ETL scenarios.

− Each activity a is a tuple (N,I,O,S,A). N is the activity’s name. I is a finite set of input schemata. O is a finite set of output schemata. S is a declarative description of the relationship of its output schema with its input schema in an appropriate language (without delving into algorithmic or implementation issues). A is the algorithm chosen for activity’s execution.

Outline. In Section 2, we discuss the nature and structure of ETL activities and workflows, and provide a principled way for constructing the latter. In Section 3, we present the main measures to be assessed by a benchmark and the basic problem parameters that should be considered for the case of ETL workflows. In Section 4, we give specific scenarios that could be used as a test suite for the evaluation of ETL scenarios. In section 5, we discuss issues that are left open and mention some parameters that require further tuning when constructing test suites. In Section 6, we discuss related work and finally, in Section 7, we summarize our results and propose issues of future work.

− The data consumer of a recordset cannot be another recordset. Still, more than one consumer is allowed for recordsets. − Each activity must have at least one provider, either another activity or a recordset. When an activity has more than one data providers, these providers can be other activities or activities combined with recordsets. − Feedback of data is not allowed; i.e., the data consumer of an activity cannot be the same activity.

2. Problem Formulation In this section, we introduce ETL workflows along with their constituents and internal structure and discuss the way ETL workflows operate. First, we start with a formal definition and an intuitive discussion of ETL workflows as graphs. Then, we zoom in the micro-level of ETL workflows inspecting each individual activity in isolation and afterwards, and then, we return at the macro-level, inspecting how individual activities are “tied” altogether to compose an ETL workflow. Both levels can be further examined from a logical or physical perspective. The final section of this section discusses the characteristics of the operation of ETL workflows and ties them to the goals of the proposed benchmark.

2.1 Micro-level activities Concerning the micro level, we consider three broad categories of ETL activities: (a) extraction activities, (b) transformation and cleansing activities, and (c) loading activities. Extraction activities extract the relevant data from the sources and transport them to the ETL area of the warehouse for further processing (possibly including operations like ftp, compress, etc). The extraction involves either differential data sets with respect to the previous load, or full snapshots of the source. Loading activities have to deal with the population of the warehouse with clean and appropriately transformed data. This is typically done through a bulk loader program; nevertheless the process also includes the maintenance of indexes, materialized views, reports, etc. Transformation and cleansing activities can be coarsely categorized with respect to the result of their application to data and the prerequisites, which some of them should fulfill. In this context, we discriminate the following categories of operations:

An ETL workflow at the logical level is a design blueprint for the ETL process. The designer constructs a workflow of activities, usually in the form of a graph, to specify the order of cleansing and transformation operations that should be applied to the source data, before being loaded to the data warehouse. In what follows, we will employ the term recordsets to refer to any data store that obeys a schema (with relational tables and record files being the most popular kinds of recordsets in the ETL environment), and the term activity to refer to any software module that processes the incoming data, either by performing any schema transformation over the data or by applying data cleansing procedures. Activities and recordsets are logical abstractions of physical entities. At the logical level, we are interested in their schemata, semantics, and input-output relationships; however, we do not deal with the actual algorithm or program that implements the logical activity or with the storage properties of a recordset. When in a later stage, the logical-level workflow is refined at the physical level a combination of executable programs/scripts that perform the ETL workflow is devised. Then, each activity of the workflow is physically implemented using various algorithmic methods, each with different cost in terms of time requirements or system resources (e.g., CPU, memory, space on disk, and disk I/O).

− Row-level operations, which are locally applied to a single row. − Router operations, which locally decide, for each row, which of the many (output) destinations it should be sent to. − Unary Grouper operations, which transform a set of rows to a single row. − Unary Holistic operations, which perform a transformation to the entire data set. These are usually blocking operations. − Binary or N-ary operations, which combine many inputs into one output. A taxonomy of activities at the micro level is depicted in Table A1 (in the appendix). For each one of the above categories, a representative set of transformations, which are provided by three popular commercial ETL tools, is presented. The table is indicative and in many ways incomplete. The goal is not to provide a comparison among the three tools. On the contrary, we would like to stress out the genericity of our classification. For most of the ETL tools, the set of built-in transformations is enriched by user defined operations and a plethora of functions. Still, as Table A1 shows, all frequently built-in transformations in the majority of commercial solutions fall into our classification.

Formally, we model an ETL workflow as a directed acyclic graph G(V,E). Each node v∈V is either an activity a or a recordset r. An edge (a,b)∈E denotes that b receives data from node a for further processing. In this provider relationship, nodes a and b play the role of the data provider and data consumer, respectively. The following well-formedness constraints determine the interconnection of nodes in ETL workflows:

50

Left wing

populate a warehouse table along with several views or reports defined over it. Figure 2 is an example of this class of butterflies. This variant represents a symmetric workflow (there is symmetry between the left and right wings). However, this is not always the practice in real-world cases. For instance, the butterfly’s triangle wings are distorted in the presence of a router activity that involves multiple outputs (e.g., copy, splitter, switch, and so on). In general, the two fundamental wing components can be either lines or combinations. In the sequel, we discuss these basic patterns for ETL workflows that can be further used to construct more complex butterfly structures. Figure 3 pictorially depicts example cases of the above variants.

Right wing

Body

n1

n1

n2

n2

V

nk

nm

Figure 1. Abstract butterfly components 100000

R

p 1=0.003

S

γA,Β

σA300 sel 2 =0.1 p 2=0.004

Combinations. A combinator activity is a join variant (a binary activity) that merges parallel data flows through some variant of a join (e.g., a relational join, diff, merge, lookup or any similar operation) or a union (e.g., the overall sorting of two independently sorted recordsets). A combination is built around a combinator with lines or other combinations as its inputs. We differentiate combinations as left-wing and right-wing combinations.

W

γA sel 5=0.5 p 5 =0.005

Figure 2. Butterfly configuration

2.2 Macro level workflows The macro level deals with the way individual activities and recordsets are combined together in a large workflow. The possibilities of such combinations are infinite. Nevertheless, our experience suggests that most ETL workflows follow several high-level patterns, which we present in a principled fashion in this section. We introduce a broad category of workflows, called Butterflies. A butterfly (see also Figure 1) is an ETL workflow that consists of three distinct components: (a) the left wing, (b) the body, and (c) the right wing of the butterfly. The left and right wings (shown with dashed lines in Figure 1) are two nonoverlapping groups of nodes which are attached to the body of the butterfly. Specifically:

Left-wing combinations are constructed by lines and combinations forming the left wing of the butterfly. The left wing contains at least one combination. The inputs of the combination can be: − Two lines. Two parallel data flows are unified into a single flow using a combination. These workflows are shaped like the letter ‘Y’ and we call them Wishbones. − A line and a recordset. This refers to the practical case where data are processed through a line of operations, some of which require a lookup to persistent relations. In this setting, the Primary Flow of data is the line part of the workflow. − Two or more combinations. The recursive usage of combinations leads to many parallel data flows. These workflows are called Trees.

− The left wing of the butterfly includes one or more sources, activities and auxiliary data stores used to store intermediate results. This part of the butterfly performs the extraction, cleaning and transformation part of the workflow and loads the processed data to the body of the butterfly.

Observe that in the cases of trees and primary flows, the target warehouse acts as the body of the butterfly (i.e., there is no right wing). This is a quite practical situation that covers (a) fact tables without materialized views and (b) the case of dimension tables that also need to be populated through an ETL workflow. There are some cases, too, where the body of the butterfly is not necessarily a recordset, but an activity with many outputs (see last example of Figure 5). In these cases, the main goal of the scenario is to distribute data to the appropriate flows; this task is performed by an activity serving as the butterfly’s body.

− The body of the butterfly is a central, detailed point of persistence that is populated with the data produced by the left wing. Typically, the body is a detailed fact or dimension table; still, other variants are also possible. − The right wing gets the data stored at the body and utilizes them to support reporting and analysis activity. The right wing consists of materialized views, reports, spreadsheets, as well as the activities that populate them. In our setting, we abstract all the aforementioned static artifacts as materialized views.

Right-wing combinations are constructed by lines and combinations on the right wing of the butterfly. These lines and combinations form either a flat or a deep hierarchy.

Assume the small ETL workflow of Figure 2 with 10 nodes. R and S are source tables providing 100,000 tuples each to the activities of the workflow. These activities apply transformations to the source data. Recordset V is a fact table and recordsets Z and W are Target tables. This ETL scenario is a butterfly with respect to the fact table V. The left wing of the butterfly is {R, S, 1, 2, 3} and the right wing is {4, 5, Z, W}.

− Flat Hierarchies. These configurations have small depth (usually 2) and large fan-out. An example of such a workflow is a Fork, where data are propagated from the fact table to the materialized views in two or more parallel data flows. − Right - Deep Hierarchies. To handle all possible cases, we employ configurations with right-deep hierarchies. These configurations have significant depth and medium fan-out.

Balanced Butterflies. A butterfly that includes medium-sized left and right wings is called a Balanced butterfly and stands for a typical ETL scenario where incoming source data are merged to

51

R

1

2

σA>300

σB>400

3

V

γA,B

W

(a) Linear Workflow

(b) Wishbone

(c) Primary Flow

(d) Tree

(e) Flat Hierarchy – Fork

(f) Right - Deep Hierarchy Figure 3. Butterfly classes mode, the sources continuously try to send new data to the warehouse. This is not necessarily done instantly; rather, small groups of data are collected and sent to the warehouse for further processing. The difference of the two modes does not only lie in the frequency of the workflow execution, but also to the load of the systems whenever the ETL workflow executes.

2.3 Goals of the benchmark The design of a benchmark should be based upon a clear understanding of the characteristics of the inspected systems that do matter. Our fundamental motivation for coming up with the proposed benchmark was due to the complete absence of a principled way to experiment with ETL workflows in the related literature. Therefore, we propose a configuration that covers a broad range of possible workflows (i.e., a large set of configurable parameters) and a limited set of monitored measures.

Independently of the mode under which the ETL workflow operates, the two fundamental goals that should be reached are effectiveness and efficiency. Hence, given an ETL engine or a specific experimental method to be assessed over one or more ETL workflows, these fundamental goals should be evaluated. To organize the benchmark better, we classify the assessment of the aforementioned goals through the following questions:

The goal of this benchmark is to provide the experimental testbed to be used for the assessment of ETL methods or tools concerning their basic behavioral properties (measures) over a broad range of ETL workflows.

Effectiveness

The benchmark’s goal is to study and evaluate workflows as a whole. We are not interested in providing specialized performance measures for very specific tasks in the overall process. We are not interested either, in exhaustively enumerating all the possible alternatives for specific operations. For example, this benchmark is not intended to facilitate the comparison of alternative methods for the detection of duplicates in a data set, since it does not take the tuning of all the possible parameters for this task into consideration. On the contrary, this benchmark can be used for the assessment of the integration of such methods in complex ETL workflows, assuming that all the necessary knobs and bolts have been appropriately tuned.

Q1. Does the workflow execution reach the maximum possible (or, at least, the minimum tolerable) level of data freshness, completeness and consistency in the warehouse within the necessary time (or resource) constraints? Q2. Is the workflow execution resilient to occasional failures? Efficiency Q3. How fast is the workflow executed? Q4. What resource overheads does the workflow incur at the source and the warehouse side? In the sequel, we elaborate on these questions.

There are two modes of operation for an ETL workflow. In the traditional off-line mode, the workflow is executed during a specific time window of some hours (typically at night), when the systems are not servicing their end-users. Due to the low load of both the source systems and the warehouse, both the refreshment of data and any other administrative activities (cleanups, auditing, etc) are easier to complete. In the active

Effectiveness. The objective is to have data respect both database and business rules. A clear business rule is the need to have as fresh data as possible in the warehouse. Also, we need all of the source data to be eventually loaded at the warehouse – not necessarily immediately as they appear at the source side – nevertheless, the sources and the warehouse must be consistent

52

Q1. Measures for data freshness and data consistency. The objective is to have data respect both database and business rules. Also, we need data to be consistent with respect to the source as much as possible. The later possibly incurs a certain time window for achieving this goal (e.g., once a day), in order to accommodate high refresh rates in the case of active data warehouses or failures in the general case. Concrete measures:

at least at a certain frequency (e.g., at the end of a day). Subproblems that occur in this larger framework: − Recovery from failures. If some data are lost from the ETL process due to failures, then, we need to synchronize sources and warehouse and compensate the missing data. − Missing changes at the source. Depending on what kind of change detector we have at the source, it is possible that some changes are lost (e.g., if we have a log sniffer, bulk updates not passing from the log file are lost). Also, in an active warehouse, if the active ETL engine needs to shed some incoming data in order to be able to process the rest of the incoming data stream successfully, it is imperative that these left-over tuples need to be processed later.

− (M1.1) Percentage of data that violate business rules. − (M1.2) Percentage of data that should be present at their appropriate warehouse targets, but they are not. Q2. Measures for the resilience to failures. The main idea is to perform a set of workflow executions that are intentionally abnormally interrupted at different stages of their execution. The objective is to discover how many of these workflows were successfully compensated within the specified time constraints. Concrete measures:

− Transactions. Depending on the source sniffer (e.g., a trigger -based sniffer), tuples from aborted transactions may be sent to the warehouse and, therefore, they must be undone.

− (M2) Percentage of successfully resumed workflow executions.

Minimal overheads at the sources and the warehouse. The production systems are under continuous load due to the large number of OLTP transactions performed simultaneously. The warehouse system supports a large number of readers executing client applications or decision support queries. In the offline ETL, the overheads incurred are of rather secondary importance, in the sense that the contention with such processes is practically non-existent. Still, in active warehousing, the contention is clear.

Q3. Measures for the speed of the overall process. The objective is to perform the ETL process as fast as possible. In the case of off-line loading, the objective is to complete the process within the specified time-window. Naturally, the faster this is performed the better (especially, in the context of failure resumption). In the case of active warehouse, where the ETL process is performed very frequently, the objective is to minimize the time that each tuple spends inside the ETL module. Concrete measures:

− Minimal overhead of the source systems. It is imperative to impose the minimum additional workload to the source, in the presence of OLTP transactions.

− (M3.1) Throughput of regular workflow execution (this may also be measured as total completion time).

− Minimal overhead of the DW system. As writer processes populate the warehouse with new data and reader processes ask data from the warehouse, the desideratum is that the warehouse operates with the lightest possible footprints for such processes as well as the minimum possible delay for incoming tuples and user queries.

− (M3.2) Throughput of workflow execution including a specific percentage of failures and their resumption. − (M3.3) Average latency per tuple in regular execution. Q4. Measured Overheads. The overheads at the source and the warehouse can be measured in terms of consumed memory and latency with respect to regular operation. Concrete measures:

3. Problem Parameters In this section, we propose a set of configuration parameters along with a set of measures to be monitored in order to assess the fulfillment of the aforementioned goals of the benchmark.

− (M4.1) Min/Max/Avg/ timeline of memory consumed by the ETL process at the source system.

Experimental parameters. Given the previous description of ETL workflows, the following problem parameters, appear to be of particular importance to the measurement:

− (M4.2) Time needed to complete the processing of a certain number of OLTP transactions in the presence (as opposed to the absence) of ETL software at the source, in regular source operation.

P1. the size of the workflow (i.e., the number of nodes contained in the graph),

− (M4.3) The same as 4.2, but in the case of source failure, where ETL tasks are to be performed too, concerning the recovered data.

P2. the structure of the workflow (i.e., the variation of the nature of the involved nodes and their interconnection as the workflow graph)

− (M4.4) Min/Max/Avg/ timeline of memory consumed by the ETL process at the warehouse system.

P3. the size of input data originating from the sources, P4. the overall selectivity of the workflow, based on the selectivities of the activities of the workflow,

− (M4.5) (active warehousing) Time needed to complete the processing of a certain number of decision support queries in the presence (as opposed to the absence) of ETL software at the warehouse, in regular operation.

P5. the values of probabilities of failure. Measured Effects. For each set of experimental measurement, certain measures need to be assessed, in order to characterize the fulfillment of the aforementioned goals. In the sequel, we classify these measures according to the assessment question they are employed to answer.

− (M4.6) The same as M4.5, but in the case of any (source or warehouse) failure, where ETL tasks are to be performed too at the warehouse side.

53

propose a specific scenario for each kind. We provide only small-size scenarios indicatively (thus, a right-deep scenario is not given); the rigorous definition of medium and large size scenarios is still open.

4. SPECIFIC SCENARIOS A particular problem that arises in designing a test suite for ETL workflows concerns the complexity (structure and size) of the employed workflows. The first possible way to deal with the problem is to construct a workflow generator, based on the aforementioned disciplines. The second possible way is to come up with an indicative set of ETL workflows that serve as the basis for experimentations. Clearly, the first path is feasible; nevertheless it is quite hard to artificially produce large volumes of workflows in different sizes and complexities all of which make sense. In this paper, we follow the second approach. We discuss an exemplary set of sources and warehouse based on the TPC-H benchmark [13] and we propose specific ETL scenarios for this setting.

The line workflow has a simple form since it applies a set of filters, transformations, and aggregations to a single table. This scenario type is used to filter source tables and assure that the data meet the logical constraints of the data warehouse. In the proposed scenario, we start with an extracted set of new source rows LineItem.D+ and push them towards the warehouse as follows: 1. First, we check the fields "partkey", "orderkey" and "suppkey" for NULL values. Any NULL values are replaced by appropriate special values. 2. Next, a calculation of a value "profit" takes place. This value is locally derived from other fields in a tuple as the amount of "extendedprice" subtracted by the values of the "tax" and "discount" fields.

4.1 Database Schema The information kept in the warehouse concerns parts and their suppliers as well as orders that customers have along with some demographic data for the customers. The scenarios used in the experiments clean and transform the source data into the desired warehouse schema. The sources for our experiments are of two kinds, the storage houses and sales points. Every storage house keeps data for the suppliers and parts, while every sales point keeps data for the customers and the orders. The schemata of the sources and the data warehouse are depicted in Figure 4.

3. The third activity changes the fields "extendedprice", "tax", "discount" and "profit" to a different currency. 4. The results of this operation are loaded first into a delta table DW.D+ and subsequently into the data warehouse DWH. The first load simply replaces the respective recordset, whereas the second involves the incremental appending of these rows to the warehouse.

Data Warehouse: PART (rkey s_partkey, name, mfgr, brand, type, size, container, comment) SUPPLIER (s_suppkey, name, address, nationkey, phone, acctbal, comment, totalcost) PARTSUPP (s_partkey, s_suppkey, availqty, supplycost, comment) CUSTOMER (s_custkey, name, address, nationkey, phone, acctball, mktsegment, comment) ORDER (s_orderkey, custkey, orderstatus, totalprice, orderdate, orderpriority, clerk, shippriority, comment) LINEITEM (s_orderkey, partkey, suppkey, linenumber, quantity, extendedprice, discount, tax, returnflag, linestatus, shipdate, commitdate, receiptdate, shipinstruct, shipmode, comment, profit) Storage House: PART (partkey, name, mfgr, brand, type, size, container, comment) SUPPLIER (suppkey, name, address, nationkey, phone, acctbal, comment) PARTSUPP (partkey, suppkey, availqty, supplycost, comment) Sales Point: CUSTOMER (custkey, name, address, nationkey, phone, acctball, mktsegment, comment) ORDER (orderkey, custkey, orderstatus, totalprice, orderdate, orderpriority, clerk, shippriority, comment) LINEITEM (orderkey, partkey, suppkey, linenumber, quantity, extendedprice, discount, tax, returnflag, linestatus, shipdate, commitdate, receiptdate, shipinstruct, shipmode, comment)

5. The workflow is not stopped after the completion of the left wing, since we would like to create some materialized views. The next operation is a filter that keeps only records whose return status is "False". 6. Next, an aggregation calculates the sum of "extendedprice" and "profit" fields grouped by "partkey" and "linestatus". 7. The results of the aggregation are loaded in view View01 by (a) updating existing rows and (b) inserting new groups wherever appropriate. 8. The next activity is a router, sending the rows of view View01 to one of its two outputs, depending on the "linestatus" field has the value "delivered" or not. 9. The rows with value “delivered” are further aggregated for the sum of "profit" and "extendedprice" fields grouped by "partkey". 10. The results are loaded in view View02 as in the case for view View01. 11. The rows with value different than “delivered” are further aggregated for the sum of "profit" and "extendedprice" fields grouped by "partkey". 12. The results are loaded in view View03 as in the case for view View01. A wishbone workflow joins two parallel lines into one. This scenario is preferred when two tables in the source database should be joined in order to be loaded to the data warehouse or in the case where we perform similar operations to different data that are later joined. In our exemplary scenario, we track the changes that happen in a source containing customers. We compare the customers of the previous load to the ones of the current load and search for new customers to be loaded in the data warehouse.

Figure 4. Database schemata

4.2 ETL Scenarios In this subsection, we propose a set of ETL scenarios, which are depicted in Figure 5, while some statistics are shown in Table 1. We consider the butterfly cases discussed in section 2 to be representative of a large number of ETL scenarios and thus, we

54

Line

Wishbone

Primary Flow

Tree Figure 5. Specific ETL workflows

55

Fork

Balanced Butterfly (1)

Balanced Butterfly (2) Figure 5. Specific ETL workflows (cont’d) Table 1. Summarized statistics of the constituents of the ETL workflows depicted in Figure 5 Line Wishbone Pr. Flow Tree Fork BB(1) BB(2)

Filters 1+1 1+0

Functions 2+0 4+0

2+1

3+0 4+0 0+2 13+2

Routers 0+1

0+1

Aggr 0+3

Holistic f.

Joins

0+1 0+4 0+4

1+0

1+0 3+0 1+0

0+12

1+0

Diff

1+0

56

6+0

1 1

Unions Load Body Load Views INCR INCR INCR I/U 1+0 I/U I/U INCR INCR INCR FULL I/U 1+0

1.

The first activity on the new data set checks for NULL values in the "custkey" field. The problematic rows are kept in an error log file for further off-line processing.

2.

Both previous and old data are passed through a surrogate key transformation. We assume a domain size that fits in main memory for this source; therefore, the transformation is not performed as a join with a lookup table, but rather as a lookup function call invoked per row.

3.

Moreover, the next activity converts the phone numbers in a numeric format, removing dashes and replacing the '+' character with the "00" equivalent.

4.

The transformed recordsets are persistently stored in relational tables or files which are subsequently compared through a difference operator (typically implemented as a join variant) to detect new rows.

5.

scenario and the result table is used to create a set of materialized views. Our exemplary scenario uses the Lineitem table as the butterfly’s body and starts with a set of extracted new records to be loaded. 1. First, surrogate keys are assigned to the fields "partkey", "orderkey" and "suppkey". 2. We convert the dates in the "shipdate" and "receiptdate" fields into a “dateId”, a unique identifier for every date. 3. The third activity is a calculation of a value "profit". This value is derived from other fields in every tuple as the amount of "extendedprice" subtracted by the values of the "tax" and "discount" fields. 4. This activity changes the fields "extendedprice", "tax", "discount" and "profit" to a different currency. The result of this actvity is stored at a delta table D+.LI. The records are appended to the data warehouse LineItem table and they are also reused for a number of aggregations at the right wing of the butterfly. All records pushed towards the views, either update or insert new records in the views, depending on the existence (or not) of the respective groups.

The new rows are stored in a file C.D+ which is kept for the possibility of failure. Then the rows are appended in the warehouse dimension table Customer.

The primary flow scenario is a common scenario in cases where the source table must be enriched with surrogate keys. This exemplary primary flow that we use has as input the Orders table. The scenario is simple: all key-based values (“orderstatus”, “custkey”, “orderkey”) pass through surrogate key filters that lookup (join) the incoming records in the appropriate lookup table. The resulting rows are appended to the relation DW.Orders. If incoming records exist in the DW.Orders relation and they have changed values then they are overwritten (thus, the Slowly Changing Dimension Type 1 tag in the figure); otherwise, a new entry is inserted in the warehouse relation.

5. The aggregator for View05 calculates the sum of the "profit" and "extendedprice" fields grouped by the "partkey" and "linestatus" fields. 6. The aggregator for View06 calculates the sum of the "profit" and "extendedprice" fields grouped by the "linestatus" fields. 7. The aggregator for View07 calculates the sum of the "profit" field and the average of the "discount" field grouped by the "partkey" and "suppkey" fields. 8. The aggregator for View08 calculates the average of the "profit" and "extendedprice" fields grouped by the "partkey" and "linestatus" fields.

The tree scenario joins several source tables and applies aggregations on the result recordset. The join can be performed over either heterogeneous relations, whose contents are combined, either over homogeneous relations, whose contents are integrated into one unified (possible sorted) data set. In our case, the exemplary scenario involves three sources for the warehouse relation PartSupp. The scenario evolves as follows:

The most general-purpose scenario type is a butterfly scenario. It joins two or more source tables before a set of aggregations is performed on the result of the join. The left wing of the butterfly joins the source tables, while the right wing performs the desired aggregations producing materialized views.

1. Each new version of the source is sorted by its primary key and checked against its past version for the detection of new or updated records. The DIFFI,U operator checks the two inputs for the combination of pkey, suppkey matches. If a match is not found, then a new record is found. If a match is found and there is a difference in the field “availqty” then an update needs to be performed.

Our first exemplary scenario uses new source records concerning Partsupp and Supplier as its input. 1. Concerning the Partsupp source, we generate surrogate key values for the "partkey" and "suppkey" fields. Then, the "totalcost" field is calculated and added to each tuple. 2. Then, the transformed records are saved in a delta file D+.PS and appended to the relation DW.Partsupp.

2. These new records are assigned surrogate keys per source 3. The three streams of tuples are united in one flow and they are also sorted by “pkey” since this ordering will be later exploited. Then, a delta file PS.D is produced.

3. Concerning the Supplier source, a surrogate key is generated for the “suppkey” field and a second activity transforms the "phone" field.

4. The contents of the delta file are appended in the warehouse relation DW.PS.

4. Then, the transformed records are saved in a delta file D+.S and appended to the relation DW.Supplier.

5. At the same time, the materialized view View04 is refreshed too. The delta rows are summarized for the available quantity per pkey and then, the appropriate rows in the view are either updated (if the group exists) or (inserted if the group is not present).

5. The delta relations are subsequently joined on the "ps_suppkey" and "s_suppkey" fields and populate the view View09, which is augmented with the new records. Then, several views are computed from scratch, as follows.

The fork scenario applies a set of aggregations on a single source table. First the source table is cleaned, just like in a line

6. View View10 calculates the maximum and the minimum value of the "supplycost" field grouped by the "nationkey" and "partkey" fields.

57

Concerning data sizes, the numbers given by TPC-H can be a valid point of reference for data warehouse contents. Still, in our point of view, a more important factor is the fraction of source data over the warehouse contents. In our research we have used fractions that range from 0.01 to 0.7. We also think numbers between 0.5 and 1.2 to be reasonable for the selectivity of the left wing of a butterfly. Selectivity refers to both detected dirty data that are placed in quarantine and newly produced data due to some transformation (e.g., unpivot). A low value of 0.5 means an extremely dirty (50%) data population, whereas a high value means an intense data generation population. In terms of failure rates, we think that the probability for a failure during a workflow execution can range between the reasonable rates of 10-4 and 10-2. Concerning workflow size, although we provide scenarios of small scale, medium–size and large-size scenarios are also needed.

7. View View12 calculates the maximum and the minimum of the "supplycost" field grouped by the "partkey" fields. 8. View View11 calculates the sum of the "totalcost" field grouped by the "nationkey" and "suppkey" fields. 9. View View13 calculates the sum of the "totalcost" field grouped by the "suppkey" field. A second exemplary scenario introduces a Slowly Changing Dimension plan, populating the dimension table PART and retaining its history at the same time. The trick is found in the combination of the “rkey”, “s_partkey” attributes. The “s_partkey” assigns a surrogate key to a certain tuple (e.g., assume it assigns 10 to a product X). If the product changes in one or more attributes at the source (e.g., X’s “size” changes), then a new record is generated, with the same “s_partkey” and a different “rkey” (which can be a timestamp-based key, or similar). The proposed scenario, works as follows: 1.

A new and an old version of the source table Part are compared for changes. Changes are directed to P.D++ (for new records) and P.DU for updates in the fields “size” and “container”

2.

Surrogate and recent keys are assigned to the new records that are propagated to the table PART for storage.

3.

An auxiliary table MostRecentPART holding the most recent “rkey” per “s_partkey” is appropriately updated.

Auxiliary structures and processes. We have intentionally avoided backup and maintenance processes in our test suites. We have also avoided delving too deep in physical details of our test suites. A clear consequence of this is the lack of any discussion on indexes of any type in the warehouse. Still, we would like to point out that if an experiment should require the existence of special structures such as indexes, it is quite straightforward to separate the computation of elapsed time or resources for their refreshment and to compute the throughput or the consumed resources appropriately.

Observe that in this scenario the body of the butterfly is an activity.

Parallelism and Partitioning. Although the benchmark is currently not intended to be used for system comparison, the underlying physical configuration in terms of parallelism, partitioning and platform can play an important role for the performance of an ETL process. In general, there exist two broad categories of parallel processing: pipelining and partitioning. In pipeline parallelism, the various activities are operating simultaneously in a system with more than one processor. This scenario performs well for ETL processes that handle a relative small volume of data. For large volumes of data, a different parallelism policy should be devised: the partitioning of the dataset into smaller sets. Then, we use different instances of the ETL process for handling each partition of data. In other words, the same activity of an ETL process would run simultaneously by several processors, each processing a different partition of data. At the end of the process, the data partitions should be merged and loaded to the target recordset(s). Frequently, a combination of the two policies is used to achieve maximum performance. Hence, while an activity is processing partitions of data and feeding pipelines, a subsequent activity may start operating on a certain partition before the previous activity had finished.

5. OPEN ISSUES Although we have structured the proposed test suites to be as representative as possible, there are several other tunable parameters of a benchmark that are not thoroughly explored. We discuss these parameters in this section. Nature of data. Clearly, the proposed benchmark is constructed on the basis of a relational understanding of the data involved. Neither the sources, nor the warehouse deal with semi-structured or web data. It is clear, that a certain part of the benchmark can be enriched a part of the warehouse schema that (incrementally) refresh the warehouse contents with HTML / XML source data. Active vs. off-line modus operandi. We do not specify different test suites for active and off-line modus operandi of the refreshment process. The construction of the test suites is influenced by an off-line understanding of the process. Although these test suites can be used to evaluate strategies for active warehousing, (since there can be no compromise with respect to the transformations required for the loading of source data), it is understood that an active process (a) should involve some tuning for the micro-volumes of data that are dispatched from the sources to the warehouse in every load and (b) could involve some load shedding activities if the transmitted volumes are higher that the ETL workflow can process.

In Figure 6, the execution of an abstract ETL process is pictorially depicted. In Figure 6(a), the execution is performed sequentially. In this case, only one instance of it exists. Figures 6(b) and 6(c) show the parallel execution of the process in a pipelining and a partitioning fashion, respectively. In the latter case, larger volumes of data may be handled efficiently by more than one instance of the ETL process; in fact, there are as many instances as the partitions used.

Tuning of the values for the data sizes, workflow selectivity, failure rate and workflow size. We have intentionally avoided providing specific numbers for several problem parameters; we believe that a careful assignment of values based on a large number of real-world case studies (that we do not possess) should be a topic for a full-fledged benchmark. Still, we would like to mention here what we think as reasonable numbers.

Platform. Depending on the system architecture and hardware, the parallel processing may be either symmetric multiprocessing – a single operating system, the processors communicate

58

Figure 6. (a) Sequential, (b) pipelining, and (c) partitioning execution of ETL processes The relational schema of TPC-H consists of eight separate tables with 5 of them being clearly dimension tables, one being a clear fact table and a couple of them combinations of fact and dimension tables. Unfortunately, the refreshment operations provided by the benchmark are primitive and not particularly useful as templates for the evaluation of ETL scenarios.

through shared memory – or clustered processing – multiple operating systems, the processors communicate through the network. The choice of an appropriate strategy for the execution of an ETL process, apart from the availability of resources, relies on the nature of the activities, which are participating in it. In terms of performance, an activity is bounded by three main factors: CPU, memory, and/or disk I/O. For an ETL process that includes mainly CPU-limited activities, the choice of a symmetric multiprocessing strategy would be beneficial. For ETL processes containing mainly activities with memory or disk I/O limitations – sometimes, even with CPU limitations – the clustering approach may improve the total performance due to usage of multiple processors, which have their own dedicated memory and disk access. However, the designer should confront with the trade-off between the advantages of the clustering approach and the potential problems that may occur due to network traffic. For example, a process that needs frequent repartitioning of data should not use clusters in the absence of a high-speed network.

TPC-DS is a new Decision Support (DS) workload being developed by the TPC [10]. This benchmark models the decision support system of a retail product supplier, including queries and data maintenance. The relational schema of this benchmark is more complex than the schema presented in TPCH. There are three sales channels: store, catalog and the web. There are two fact tables in each channel, sales and returns, and a total of seven fact tables. In this dataset, the row counts for tables scale differently per table category: specifically, in fact tables the row count grows linearly, while in dimension tables grows sub-linearly. This benchmark also provides refreshment scenarios for the data warehouse. Still, all these scenarios belong to the category of primary flows, in which surrogate and global keys are assigned to all tuples.

6. RELATED WORK Several benchmarks have been proposed in the database literature, in the past. Most of the benchmarks that we have reviewed make careful choices: (a) on the database schema & instance they use, (b) on the type of operations employed and (c) on the measures to be reported. Each benchmark has a guiding goal, and these three parts of the benchmark are employed to implement it.

7. CONCLUSIONS In this paper, we have dealt with the challenge of presenting a unified experimental playground for ETL processes. First, we have presented a principled way for constructing ETL workflows and we have identified their most prominent elements. We have classified the most frequent ETL operations based on their special characteristics. We have shown that this classification adheres to the built-in operations of three popular commercial ETL tools; we do not anticipate any major deviations for other tools. Moreover, we have proposed a generic categorization of ETL workflows, namely butterflies, which covers frequent design cases. We have identified the main parameters and measures that are crucial in ETL environment and we have discussed how parallelism affects the execution of an ETL process. Finally, we have proposed specific ETL scenarios based on the aforementioned analysis, which can be used as an experimental testbed for the evaluation of ETL methods or tools.

To give an example of the above, we mention two benchmarks mainly coming from the Wisconsin database group. The OO7 benchmark was one of the first attempts to provide a comparative platform for object-oriented DBMS’s [3]. The OO7 benchmark had the clear target to test as many aspects as possible of the efficiency of the measured systems (speed of pointer traversal, update efficiency, query efficiency). The BUCKY benchmark had a different viewpoint: the goal was to narrow down the focus only on the aspects of an OODBMS that were object-oriented (or object-relational): queries over inheritance, set-valued attributes, pointer navigation, methods and ADTS [4]. Aspects covered by relational benchmarks were not included in the BUCKY benchmark.

The main message from our work is the need for a commonly agreed benchmark that realistically reflects real-world scenarios, both for research purposes and, ultimately, for the comparison of ETL tools. Feedback from the industry is necessary (both with respect to the complexity of the workflows and the frequencies of typically encountered ETL operations) in order to further tune the benchmark to reflect the particularities of real world ETL workflows more precisely.

TPC has proposed two benchmarks for the case of decision support. The TPC-H benchmark [13 is a decision support benchmark that consists of a suite of business-oriented ad-hoc queries and concurrent data modifications. The database describes a sales system, keeping information for the parts and the suppliers, and data about orders and the supplier's customers.

59

8. REFERENCES [1] J. Adzic, V. Fiore. Data Warehouse Population Platform. In DMDW, 2003.

[8] Oracle. Oracle Warehouse Builder 10g. Retrieved, 2007. URL: http://www.oracle.com/technology/products/warehouse/

[2] Ascential Software Corporation. DataStage Enterprise Edition: Parallel Job Developer’s Guide. Version 7.5, Part No. 00D-023DS75, 2004.

[9] Oracle. Oracle Warehouse Builder Transformation Guide. 10g Release 1 (10.1), Part No. B12151-01, 2006. [10] R. Othayoth, M. Poess. The Making of TPC-DS. In VLDB, 2006.

[3] M. J. Carey, D. J. DeWitt, J. F. Naughton. The OO7 Benchmark. In SIGMOD, 1993. [4] M. J. Carey et al. The BUCKY Object-Relational Benchmark (Experience Paper). In SIGMOD, 1997.

[11] A. Simitsis, P. Vassiliadis, S. Skiadopoulos, T. Sellis. Data Warehouse Refreshment. In "Data Warehouses and OLAP: Concepts, Architectures and Solutions", IRM Press, 2006.

[5] IBM. WebSphere DataStage. Retrieved, 2007. URL: http://www-306.ibm.com/software/data/integration/datastage/

[12] K. Strange. Data Warehouse TCO: Don’t Underestimate the Cost of ETL. Gartner Group, DF-15-2007, 2002.

[6] Informatica. PowerCenter 8. Retrieved, 2007. URL: http://www.informatica.com/products/powercenter/

[13] TPC. TPC-H benchmark. Transaction Processing Council. URL: http://www.tpc.org/.

[7] Microsoft. SQL Server 2005 Integration Services (SSIS). Url: http://technet.microsoft.com/en-us/sqlserver/bb331782.aspx

[14] A.K. Elmagarmid, P.G. Ipeirotis, V.S. Verykios. Duplicate Record Detection: A Survey. IEEE TKDE (1): 1-16 (2007).

APPENDIX Table A1. Taxonomy of activities at the micro level and similar built-in transformations provided by commercial ETL tools

Extr.

Transformation and Cleansing

Transformation Category*

SQL Server Information Services SSIS [7]

Row-level: Function that can be applied locally to a single row

− − − − − − −

Routers: Locally decide, for each row, which of the many outputs it should be sent to Unary Grouper: Transform a set of rows to a single row

− Conditional Split − Multicast

Unary Holistic: Perform a transformation to the entire data set (blocking) Binary or N-ary: Combine many inputs into one output

− Sort − Percentage Sampling − Row Sampling Union-like: − Union All − Merge Join-like: − Merge Join (MJ) − Lookup (SKJ) − Import Column (NLJ)

Character Map Copy Column Data Conversion Derived Column Script Component OLE DB Command Other filters (not null, selections, etc.)

− Aggregate − Pivot/Unpivot

DataStage [2] − Transformer (A generic representative of a broad range of functions: date and time, logical, mathematical, null handling, number, raw, string, utility, type conversion/casting, routing.) − Remove duplicates − Modify (drop/keeps columns or change their types) − − − − − − − −

Copy Filter Switch Aggregator Make/Split subrecord Combine/Promote records Make/Split vector Sort (sequential, parallel, total)

Oracle Warehouse Builder [9] − − − − −

Deduplicator (distinct) Filter Sequence Constant Table function (it is applied on a set of rows for increasing the performance) − Data Cleansing Operators (Name and Address, Match-Merge) − Other SQL transformations (Character, Date, Number, XML, etc.) − Splitter − Aggregator − Pivot/Unpivot − Sorter

Union-like: − Funnel (continuous, sort, sequence) Join-like: − Join − Merge − Lookup Diff-like: − Change capture/apply − Difference (record-by-record) − Compare (column-by-column)

Union-like: − Set (union, union all, intersect, minus) Join-like: − Joiner − Key Lookup (SKJ)

− Import Column Transformation

− Compress/Expand − Column import

− Merge − Import

− Export Column − Slowly Changing Dimension

− Compress/Expand − Column import/export

Load

− Merge − Export − Slowly Changing Dimension * All ETL tools provide a set of physical operations that facilitate either the extraction or the loading phase. Such operations include: extraction from hashed/sequential files, delimited/fixed width/multi-format flat files, file set, ftp, lookup, external sort, compress/uncompress, and so on.

60

Information Quality Measurement in Data Integration Schemas Maria da Conceição Moraes Batista, Ana Carolina Salgado Centro de Informática, Universidade Federal de Pernambuco Av. Professor Luis Freire s/n, Cidade Universitária 50740-540 Recife – PE, Brasil Telephone Number: +55 81 2126.8430

{mcmb, acs}@cin.ufpe.br system’s schemas. The main goal we intend to accomplish is to improve the quality of query execution. Our hypothesis is that an acceptable alternative to optimize query execution would be the construction of good schemas, with high quality scores, and we have based our approach in this affirmative.

ABSTRACT Integrated access to distributed data is an important problem faced in many scientific and commercial applications. A data integration system provides a unified view for users to submit queries over multiple autonomous data sources. The queries are processed over a global schema that offers an integrated view of the data sources. Much work has been done on query processing and choosing plans under cost criteria. However, not so much is known about incorporating Information Quality analysis into data integration systems, particularly in the integrated schema. In this work we present an approach of Information Quality analysis of schemas in data integration environments. We discuss the evaluation of schema quality focusing in minimality, consistency and completeness aspects and define some schema transformations to be applied in order to improve schema design and consequently the quality of data integration query execution.

We focused our work in developing IQ analysis mechanisms to address schema generation and maintenance, specially the integrated schema. Initially we built a list of IQ criteria related to data integration aspects, but, due to space limitations, we decided to formally specify the algorithms and definitions of schema IQ criteria – minimality, completeness and type consistency – and guided the presented approach to this aspects. We also defined an algorithm to perform schema minimality improvements. The paper is organized as follows: in section 2 we discuss approaches of Information Quality (IQ) and its use in data integration and schemas; the section 3 introduces the main issues related to schemas’ IQ criteria; section 4 discusses the formalism of schema representation; in section 5 we present the formal specification of the chosen schemas IQ criteria and in section 6 we discuss some examples of these criteria; section 7 presents the schema improvement algorithm addressing minimality aspects and in section 8 is our concluding remarks and the final considerations about the mentioned topics.

1. INTRODUCTION Information quality (IQ) has become a critical aspect in organizations and, consequently, in Information Systems research. The notion of IQ has only emerged during the past ten years and shows a steadily increasing interest. IQ is a multidimensional aspect and it is based in a set of dimensions or criteria. The role of each one is to assess and measure a specific IQ aspect. One of these dimensions is the minimality aspect. This criterion defines that an element has good quality if it has no redundancies. A data integration system based on a Global-as-view (GAV) approach [14] provides to users a unified view of several data sources, called integrated schema. In this kind of system, data is spread over multiple, distributed and heterogeneous sources and, consequently the query execution is an essential feature. To the best of our knowledge so far, not so much is known about the important problem of incorporating IQ aspects into usual data integration components and processes, like query results integration, schema maintenance, source selection, among others.

2. APPROACHES OF IQ IN DATA INTEGRATION SYSTEMS It has long been recognized that IQ is described or analyzed by multiple attributes or dimensions. During the past years, more and more dimensions and approaches were identified in several works ([11], [17]). Naumann and Leser [17] define a framework addressing the IQ of query processing in a data integration system. This approach proposes the interleaving of query planning with quality considerations and creates a classification with twenty two dimensions divided into three classes: one related to the user preferences, the second class concerns the query processing aspects and the last one is related to the data sources.

The primary contribution of this paper is the proposal of IQ criteria analysis in a data integration system, mainly related to the Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Database Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permissions from the publisher, ACM. VLDB ’07, September 23-28, 2007, Vienna, Austria. Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09.

Other relevant topic to consider in IQ and data integration is the set of quality criteria for schemas. These are critical due the importance of the integrated and data sources schemas for query processing. Some works are related to IQ aspects of schema equivalence and transformations. As in [2], where the authors exploit the use of normalization rules to improve IQ in conceptual database schemas.

61

In this paper, we present our approach of schema maintenance with quality aspects, using three IQ criteria: schema completeness, minimality and type consistency.

Some works are also relevant because are related to schema based data integration and schema correspondence definition as in [6], [15] and [1]. The work proposed by Herden [11] deals with measuring the quality of conceptual database schemas. In this approach, given a quality criterion, the schema is reviewed by a specialist in the mentioned criterion.

Table 1. Data integration IQ criteria classification Data Integration Element Data Sources Schema Data

In [21] the authors propose IQ evaluation for data warehouse schemas focusing on the analyzability and simplicity criteria.

IQ Criteria Reputation, Verifiability, Availability, Response Time Schema Completeness, Minimality, Type Consistency Data Completeness, Timeliness, Accuracy

Schema Completeness The completeness can be measured as the percentage of realworld objects modeled in the integrated schema that can be found in the sources. Therefore, the schema completeness criterion is the number of concepts provided by the schema with respect to the application domain.

The work presented in [9] enhances our consistency point of view by identifying the measure, column heterogeneity, to quantify the data quality problems that can arise when merging data from different sources. Similarly, Halevy in [10] discusses the importance of detecting data types in handling schema heterogeneity for some tasks, including data integration Our proposition is centered in IQ analysis for schemas in data integration systems with goals of query optimization and it is described in the next section.

Minimality Minimality is the extent in which the schema is compactly modeled and without redundancies. In our point of view, the minimality concept is very important to data integration systems because the integrated schema generated by the system may have redundancies. The key motivation for analyzing minimality is the statement that the more minimal the integrated schema is, the least redundancies it contains, and, consequently, the more efficient the query execution becomes [12]. Thus, we believe that our minimality analysis will help decreasing the extra time spent by mediator with access to unnecessary information represented by redundant schema elements.

3. SCHEMA QUALITY ISSUES The main feature of a data integration system is to free the user from knowing about specific data sources and interact with each one. Instead, the user submits queries to an integrated schema, which is a set of views, over a number of data sources, designed for a particular data integration application. Each source must publish a data source schema with the representation of its contents. As a consequence, the data integration system must first reformulate a user query into queries that refers directly to schemas on the sources. To the reformulation step, a set of correspondences, called schema mappings, are required. There are also the user schemas that represent the requirements of information defined for one user or a group of users. The user requirements and their schema are not the focus of this work.

Type Consistency Type consistency is the extent in which the attributes corresponding to the same real world concept are represented with the same data type across all schemas of a data integration system. Table 2 lists each criterion with its definition and the metric used to calculate scores.

Commonly, the tasks of query processing involving query submission, planning, decomposition and results integration are performed by a software module called mediator [27].

Table 2. IQ Criteria for schemas quality analysis

As a starting point, we adopted IQ classifications proposed in previous works ([3], [17], [11], [20], [24], [25]) with discrete variations: some criteria are not considered (not applicable), and some were adapted to our environment. In our classification, the IQ aspects were adapted and associated to the main elements of a data integration system.

IQ Criteria Schema Completeness Minimality

When considering any data integration task, component, process or element, (for example, a user query execution, data source selection or integrated schema generation), we perceive that each one can be associated with one of the three components: data, schemas and data sources. These components are the core of our IQ criteria classification. We classify as data, all the data objects that flow into the system. For example, query results, an attribute value, and so on. The schemas are the structures exported by the data sources (source schemas), the structures that are relevant for users to build queries (users’ schema) and the mediation entities (integrated schema). The data sources are the origin of all data and schema items in the system.

Type Consistency

Definition The extent to which entities and attributes of the application domain are represented in the schema The extent in which the schema is modeled without redundancies Data type uniformity across the schemas

Metrics 1 – (#incomplete items / #total items)1 1 – (#redundant schema elements/# total schema elements )1 1 – (#inconsistent schema elements / #total schema elements) 1

The quality analysis is performed by a software module called IQ Manager. At the moment of integrated schema generation or update, this module proceeds with the criteria assessment and then, according to the obtained IQ scores, may execute adjustments over the schema to improve its design and, consequently, the query execution. This last step of schema tuning, after the IQ evaluation, is presented in section 7.

All IQ criteria in the data integration system are associated with one of the three groups of elements according to the Table 1.

4. SCHEMA REPRESENTATION Commonly, data integration systems use XML to represent the data and XML Schema to represent schemas. To provide a high-

62

semantically equivalent, i.e., they describe the same real world concept. • Attribute schema mappings: are the mappings among attributes of semantically equivalent entities. The mapping E1.A1 ≡ E2.A2 indicates that the attributes A1 and A2 are semantically equivalent (correspond to the same real concept); • Path mappings: specify special types of mappings between attributes and subentities of semantically equivalent entity types with different structures. Before defining a path mapping, it is necessary to define two concepts: link and path. A link between two X-Entity elements X1 and X2 (X1.X2) occurs if X2 is an attribute of the entity type X1, or X1 is an entity of the relationship type X2 (or vice-versa). If there is a multiple link, it is called a path. In this case it may occurs a normal path, X1.....Xn or an inverse path (X1.X2.....Xn)-1. Entities attributes and relationships are represented by paths. A path mapping can occur in four cases as explained in the following (assuming P1 and P2 as being two paths): 1. Case 1: P1=X1.X2...Xn and P2=Y1.Y2...Ym, where X1 ≡ Y1. The mapping P1 ≡ P2 specifies that the entity types Xn and Ym are semantically equivalent. 2. Case2: P1 = X1.X2...Xn.A and P2=Y1.Y2....Ym.A’, where X1 ≡ Y1. The mapping P1 ≡ P2 specifies that the attribute A ∈ Xn and the attribute A’ ∈ Ym are semantically equivalent. 3. Case 3: P1 = X1.X2...Xn and P2 = (Y1.Y2...Yn)1 , where X1 ≡ Yn. The mapping P1 ≡ P2 specifies that the entity types Xn and Y1 are semantically equivalent. 4. Case 4: P1 = X1.X2...Xn.Ak and P2 = (Y1.Y2...Yn)-1.Ak’, where X1 ≡ Yn. The mapping P1 ≡ P2 specifies that the attribute Ak ∈ Xn and the attribute Ak’ ∈ Y1 are semantically equivalent. To illustrate the cases, consider the integrated and data source schemas presented in Figure 1.

level abstraction for XML schema elements, we use a conceptual data model, called X-Entity [15] described in what follows. We also present the schema mappings with this notation.

4.1 X-Entity Model The X-Entity model is an extension of the Entity Relationship model [8], i.e., it uses some basic features of the ER model and extends it with some additional ones to better represent XML schemas. The main concept of the model is the entity type, which represents the structure of XML elements composed by other elements and attributes. In the X-Entity model, relationship types represent element-subelement relationships and references between elements. An X-Entity schema S is denoted by S=(E,R), where E is a set of entity types and R is a set of relationship types. • Entity type: an entity type E, denoted by E({A1,…,An},{R1,…,Rm}), is made up of an entity name E, a set of attributes {A1,…,An} and a set of relationships {R1,…,Rm}. Each entity type may have attributes {A1,…,An} that represents either a XML attribute or a simple XML element. In X-Entity diagrams, entity types are rectangles.1 • Containment relationship: a containment relationship between two entity types E and E1, specifies that each instance of E contains instances of E1. It is denoted by R(E,E1,(min,max)), where R is the relationship name and (min,max) are the minimum and the maximum number of instances of E1 that can be associated with an instance of E. • Reference relationship: a reference relationship, denoted

by R(E1,E2, {A11,…,A1n},{A21,…,A2n}), where R is the name of the relationship where the entity type E1 references the entity type E2. {A11,…,A1m} and {A21,…,A2n} are the referencing attributes between entities E1 and E2 such that the value of A1i, 1 ≤ i ≤ n, in any entity of E1 must match a value of A2i, 1 ≤ i ≤ n, in E2.

Integrated Schema S med = publisherm ({bookm({titlem,publisherm},{bookm_chapterm}), chapterm({chapter_titlem},{})}, titlem bookm {bookm_chapterm(bookm,chapterm,(1,N))})

4.2 Schema Mappings We use schema mappings to represent the correspondences between entities and attributes of distinct sources. Schema mappings, as defined in [22], are constraints used to assert that the semantics of some components in a schema is somehow related to the semantics of some components in another schema. In our approach, they are used to represent correspondences between XEntity elements representing the same real world concept called semantically equivalent.

m

contains

Source Schema S 1 = title1 publisher1 ({publication 1({title1,publisher1},{}), section1({section_title1,book_title 1}, publication 1 {ref_section 1_publication 1})}, {ref_section 1_publication 1 (section 1,publication 1,{book_title 1},{title1})}) Source Schema S 2 = name2 ({novel2({name2,year2}, {novel2_chapter 2,novel2_publisher 2}), year novel2 2 chapter2({ch_title2},{}), publisher 2({pub_name2},{})}, {novel2_chapter 2(novel2, chapter 2,(1,N)), novel2_publisher 2(novel2, publisher 2, (1,1))})

A data integration system is widely based on the existence of metadata describing individual sources and integrated schema, and on schema mappings [22] specifying correspondences between the integrated schema concepts and the source schemas concepts. There are several types of schema mappings to formally describe the associations between the concepts of X-Entity schemas. We consider an X-Entity element as an entity type, a relationship type or an attribute:

(1,N)

chapterm

section_title 1 section1

refers book1_title1

contains

(1,N)

chapter2 ch_title2

pub_name2 contains

(1,1)

publisher 2

Figure 1. Integrated Schema (Smed) and Schemas of data sources (S1 and S2) Table 3 presents the relevant schema mappings identified to compute bookm and chapterm. The mappings MP1 to MP12 specify the semantic equivalences between the integrated and data source schema elements.

• Entity schema mappings: if E1 and E2 are entity types, the schema mapping E1 ≡ E2 specifies that E1 and E2 are 1

chapter_title

In data integration, the mappings are essential to assure the query processing over integrated schema. We assume that the mappings and schema elements equivalences are already defined automatically by the system or even manually by advanced users.

# denotes the expression “Number of”

63

Particularly, in the environment exploited to experiment the proposed IQ evaluation ([5], [15]), there is a schema matcher component responsible to maintain equivalencies and define mappings among data sources and integrated schema.

• ∀Ek ∈ Sm,

Ek({Ak1,Ak2,...,Akak},{Rk1,Rk2,...,Rkrk}), where: o {Ak1,Ak2,...,Aka } is the set of attributes of Ek, ak is k the number of attributes in Ek, (ak > 0); o {Rk1,Rk2,...,Rkr } is the set of relationships of Ek, rk k

Table 3. Schema mappings between the integrated schema Smed and the source schemas S1 and S2

is the number of relationships in Ek (rk ≥ 0). • If X1 and X2 are schema elements (attributes, relationships or entities), the schema mapping X1 ≡ X2 specifies that X1 and X2 are semantically equivalent, i.e., they describe the same real world concept and have the same semantics. More details about the definition of semantic equivalences can be found in [15].

MP1:bookm ≡ publication1 MP2:bookm.titlem ≡ publication1.title1 MP3:bookm.publisherm≡ publication1.publisher1 MP4:chapterm ≡ section1 MP5:chapterm.chapter_titlem≡ section1.section_title1 MP6:bookm.bookm_chapterm.chapterm ≡ (section1.section_ref_ publication 1.publication 1)-1 MP7:bookm ≡ novel2 MP8:bookm.titlem ≡ novel2.name2 MP9:chapterm ≡ chapter2 MP10:bookm.bookm_chapterm.chapterm ≡ novel2. novel2_chapter2.chapter2 MP11:chapterm.chapter_titlem ≡ chapter2.ch_title2 MP12:bookm.publisherm≡ novel2. novel2_publisher2.publisher2.pub_name2

Every information system (even a data integration one) is constructed from a number of requirements. Moreover, embedded in this set of requirements is the application domain information [13], very important to schemas construction.

5. SCHEMA IQ CRITERIA As previously mentioned, high IQ schemas are essential to accomplish our goal of improving integrated query execution. It is important to notice that the proposed approach is not only to be applied in X-Entity schemas. The IQ aspects may be useful in any integrated schema to minimize problems acquired from schema integration processes, for example, the same concept represented more than once in a schema. The next section describes some definitions, required to introduce the minimality criterion. It is important to point that semantic equivalence specification is not the focus of this paper. We use the approach presented in [15]. From now on, we assume that the integrated schema is already created and consequently, the equivalences between entities, attributes and relationships are already defined.

Definition 2 – Domain Concepts Set We define ϕ as the β )= : ϕ(β

set

of

domain

concepts,

β

β is even a given integrated schema Sm or a data integration system Ð; σβ as the number of real world concepts in β ; Ck is an application domain concept which is represented by an schema element Y in one of the two following ways: i) ii)

Y ∈ Sm, if β is a integrated schema or; Y ∈ δ = , if β is a the data integrated system. Usually, the data integration system has mechanisms to generate and maintain the schemas. It is very difficult to guarantee that these mechanisms, specifically those concerning the schema generation, produce schemas without anomalies, e.g., redundancies. In data integration context, we define a schema as redundant if it has occurrences of redundant entities, attributes or relationships. To contextualize schema aspects, we introduce the definitions 3 to 6.

5.1 Definitions More formally, a data integration system is defined as follows:

Definition 1 – Data Integration System (Ð) A data integration system is a 4 element tuple, Ð = where: ρ,ϕ • δ is the set of Si data sources schemas, i.e. δ = , where w is the total number of data sources participating in Ð; • Sm is the integrated schema, generated by internal modules of Ð; • ρ is the set of user schemas, ρ = where u is the total number of users of Ð. Together with the data source schemas it is the basis of the integrated schema generation; • The component ϕ(Ð) is the set of all distinct concepts in the application domain of the data integration system, as stated in the next definition. This set can be extracted from the schema mappings between the data source schemas and the integrated schema. In Ð, the following statements are true: • Sm is a X-Entity integrated schema such as Sm = where Ek is an integration or mediation

Definition 3 – Redundant attribute in a single entity An attribute Aki of entity Ek, Aki ∈ {Ak1,Ak2,...,Aka } is k redundant, i.e., Red(Aki,Ek) = 1, if: ∃Ek.Akj, j ≠ i, Akj ∈ {Ak1,Ak2,...,Aka } k such as Ek.Aki ≡ Ek.Akj, 1 ≤ i,j ≤ ak

Definition 4 – Redundant attribute in different entities An attribute Aki of the entity Ek, Aki ∈ {Ak1,Ak2,...,Aka } is k redundant, i.e. Red(Aki,Ek) = 1, if: ∃Eo, o ≠ k, Eo ∈ Sm, Ek ≡ Eo, Eo({Bo1,Bo2,...,Boa }), Boj are attributes of Eo o

m

entity (1 ≤ k ≤ nm), and nm is the total number of entities in Sm;

and ∃Eo.Boj, Boj ∈ {Bo1,Bo2,...,Boa } o

64

such as Ek.Aki ≡ Eo.Boj, 1 ≤ i ≤ ak, 1 ≤ j ≤ ao. If an attribute Aki, Aki ∈ {Ak1,Ak2,...,Aka }, and k

Red(actorm,Sm) = 0 + 0 + 0 = 0 3

Red(Aki,Ek) = 0, we say that Aki is non-redundant.

The entity artistm is 50% redundant because it has only two attributes, and one of them is redundant.

Definition 5 – Entity Redundancy Degree

Definition 6 – Redundant Relationship

We say that a given entity Ek has a positive redundancy degree in schema Sm, i.e. Red(Ek,Sm) > 0, if Ek has at least one redundant attribute. The redundancy degree is calculated by the following formula:

Consider a relationship R ∈ Sm between the entities Ek and Ey represented by the path Ek.R.Ey, then: R ∈ {Rk1,...,Rkr } k

and R ∈ {Ty1,...,Tyr },where {Rk1,...,Rkr } is the set of k y

ak

∑

Red(Ek,Sm) = i = 1

relationships of the entity Ek and {Ty1,...,Tyr } is the set of y

Red(Aki, Ek)

, where

relationships of the entity Ey. Thus, the relationship R connects Ek and Ey if and only if R ∈ {Rk1,...,Rkr } and R ∈

ak ak

∑

Red(Aki, Ek) is

the number of redundant

k

{Ty1,...,Tyry}.

i = 1

We define R as a redundant relationship in Sm, i.e. Red(R,Sm) = 1 if: ∃P1, where P1=Ek.Rj.….Ts.Ey is a path with Rj ∈ {Rk1,...,Rkr } and

attributes in Ek and ak is the total number of attributes in Ek. An entity Ek is considered fully redundant when all of its attributes are redundant, i.e. Red(Ek,Sm) = 1. In this case, we assume that the entity Ek may be removed from the original schema Sm without lost of relevant information. Any existing relationship from Ek may be associated to a remaining equivalent entity Eo, as will be shown in section 7.

k

Ts ∈ {Ty1,...,Tyr } y such as P1 ≡ R. In other words, a relationship between two entities is redundant if there are other semantically equivalent relationships which paths are connecting the same two entities.

As an example of redundant attributes and the entity redundancy degree, suppose the schema and mappings of Figure 2: movie m

contains

actorm country m

Consider the schema and mappings illustrated in Figure 3.

sshm

enterprise m

agem contains

contains

artistm

contains

addressm

department m

contains

sectionm P1 = enterprisem.enterprisem_departmentm. departmentm.departmentm_sectionm.sectionm P2 = enterprisem.enterprisem_sectionm.sectionm

originCountry m

P1 ≡ P2

artistm ≡ actorm originCountrym ≡ countrym

Figure 3. Schema with redundant relationship

Figure 2. Schema with redundant attributes

The relationship connecting enterprisem and sectionm is redundant (Red(enterprisem,sectionm,(1,N))= 1) because it has a semantically equivalent correspondent represented by P1.

The attribute originCountrym in artistm is redundant because it has a semantic correspondent in the entity actorm (attribute countrym), and both the entities artistm and actorm are semantically equivalent. Thus, we have the following:

We agree with the study presented in [28], where Zhang states that redundancy is not a symmetric metric. An entity Ej may cause Ek to be assigned as redundant, but if the comparison order is changed, Ej may not be redundant when related to Ek. A simple example is the case where an entity E1 is entirely contained in E2, E1 is redundant but E2 is not.

Red(originCountrym,artistm) = 1 Red(addressm,artistm) = 0 Red(countrym,actorm) = 0 Red(agem,actorm) = 0 Red(sshm,actorm) = 0

It is interesting to mention that nationalitym ≡ countrym, but only the first is classified as redundant. This occurs because only one must be marked as redundant and removed, while the other has to be kept in the schema to assure that domain information will not be lost.

5.2 Minimality A major problem of conceptual schema design is to avoid the generation of a schema with redundancies. A schema is minimal if all of the domain concepts relevant for the application are described only once. In other words, a minimal schema may represent each application requirement only once ([12], [23], [16], [19]). Thus, we can say that the minimality of a schema is the degree of absence of redundant elements in the schema. Likewise our point of view, Kesh [12] argues that a more minimal

The entities in Figure 2 have the following entity redundancy degrees in schema Sm: Red(artistm,Sm) = 1 + 0 = 0.5 2

65

more than one element, in one or more schemas. In this case the concept is replicated and the corresponding schema elements are semantically equivalent. For example, in a given schema A the name of a movie director can be an attribute director.name and the same concept may be the element movie.director (i.e., director.name ≡ movie.director) in other schema B.

(or concise) schema will make itself more efficient, and consequently improve the effectiveness of operations and queries over it. We state that if the integrated schema is minimal, query execution will be improved. Redundant elimination (or minimality increasing) avoids the query processor to spend extra time querying redundant elements. Therefore, to measure the minimality, we must first determine the redundancy degree of the schema. To each one of the next redundancy definitions (6 and 7), we assume the following: •

nrel is the total number of relationships in Sm;

•

nm is the total number of entities in Sm;

•

rk is the number of relationships of each entity Ek in Sm.

Let consider Ð with two data sources (Sa and Sb) and its respective schemas and mappings as in Figure 4. Schema of data source S a = ({booka({titlea,pricea,yeara},{booka_authora,booka_chaptera}), author a({namea, agea},{}), chaptera({sizea,pagea},{})}, {booka_authora(booka,authora,(1,N)), booka_chapter a(booka,chapter a,(1,N))}) age name a

Definition 7 – Entity Redundancy of a Schema

contains

yeara

The total entity redundancy of a schema Sm is computed by the formula:

(1,N)

booka

titlea

a

authora

pagea

sizea

nm

ER(Sm)=

∑

Re d(Ek,Sm )

, where Red(Ek,Sm)

k =1

pricea

is the

nm

contains

(1,N)

chaptera

Schema of data source S b = ({authorb({nameb,nationality b},{authorb_publication b}), publication b({titleb,countryb,yearb},{ })}, {authorb_publication b(authorb,publicationb,(1,N))})

redundancy degree of each Ek in Sm.

Definition 8 – Relationship Redundancy of a Schema

nationality b

The relationship redundancy degree of Sm is measured by the equation:

authorb

yearb

nameb

countryb

titleb

contains

(1,N)

publication b

nrel

RR(Sm) =

∑ # Re d(R ,S ), where l

m

l=1

∑ # Re d(R ,S ) l

nrel

number of redundant Definition 6.

nrel

Schema Mappings = MP1: booka.yeara publication b.yearb MP2: booka.titlea publication b.titleb MP3: authora.namea authorb.nameb MP4: booka.booka_authora.authora

is the

m

l =1

relationships in Sm as stated in

-1

(author b.authorb_publication b.publication b)

Figure 4. Data source schemas and schema mappings in Ð

Definition 9 – Schema Minimality

With respect to the schemas in Figure 4, after investigating and analyzing the schemas and mappings, it is possible to find the set of distinct concepts (ϕ ϕ(Ð)= ) as in the

We define the overall redundancy of a schema in a data integration system as the sum of the aforementioned redundancy values: entities (ER) and relationships (RR). The schema minimality is measured by the formula:

Ð

following: ϕ(Ð) =

The schema completeness is the percentage of domain concepts represented in the integrated schema when related to the concepts represented in all data source schemas ([12], [23], [16], [19]). For instance, suppose that a given data integration system has a total of 10 distinct domain concepts (described by entities and relationships) in all data sources’ published schemas. If the integrated schema have only 6 representations of these concepts, we can say that the integrated schema is 60% complete related to the current set of data sources.

⇒

σÐ = 13, where

σÐ is the number of concepts in Ð (Definition 2) It is important to emphasize that the relationship (author, publication,1,N) is not present in the domain set of concepts because it is equivalent to the relationship (book,author,1,N) e.g., both represents the same concept. To determine the schema completeness metrics, we introduce the following definition.

To introduce schema completeness metrics, we assume that the data integration system (Ð) has a minimal integrated schema Sm.

Definition 10

As mentioned in Definition 2, a data integration system is constructed over a domain, and this domain aggregates a number of real world concepts (obtained from the requirements) that are represented by the schemas elements. Every relationship, entity or attribute in a user, data source or integrated schema are (part of) a real world concept. Sometimes, the same concept can be found in

– Schema Completeness

The overall schema completeness degree in a given schema Sx ∈ Ð is obtained by the following average:

SC(Sx) =

66

σSx σÐ

,

where Sx can be either a data source schema or the integrated schema;

σSx is

Definition 11 – X-Entity Attribute Data Type A data type Tkj for the attribute Akj, where Akj ∈ Ek, is a domain element or structural metadata associated with the attribute data as defined in previous works [7]. As the data integration system is concerned with XML data, then every Tkj may be one of the valid datatypes defined for XML Schema (including the user defined ones). From the XML Schema specification [18], we import the concept of datatype as follows:

the number of distinct concepts in the schema

Sx and;

σ Ð is the is the number of distinct concepts contained in all the schemas of the data integration system Ð.

A datatype T is a tuple , consisting of:

5.4 Type Consistency

α is a set of distinct values, called the value space, containing the values that the elements of the type can have; • λ is a set of lexical representations, called the lexical space and; • γ is a set of facets that characterize properties of the value space, individual values or lexical items; • T ∈ £, where £ is the set of all XML schema datatypes. In our work, to use datatypes, it is only necessary to refer to the α set of valid values in the datatype specification.

In databases, the consistency property states that only valid data will be written to the database. The stored data must adhere to a number of consistency rules. If, for some reason, a transaction is executed that violates the database’s consistency rules, the entire transaction will be rolled back and the database will be restored to a consistent state according to those rules. On the other hand, if a transaction successfully executes, it will take the database from a consistent state with the rules to another state that is also consistent with the rules. These affirmatives are related to data consistency, but they can be extended to adequately represent data type consistency constraints [20].

•

To determine the type consistency criterion, we define the following:

A data type is a constraint placed upon the interpretation of data in a type system in computer programming. Common types of data in programming languages include primitive types (such as integers, floating point numbers or characters), tuples, records, algebraic data types, abstract data types, reference types, classes and function types. A data type describes representation, interpretation and structure of values manipulated by algorithms or objects stored in computer memory or other storage device. The type system uses data type information to check correctness of computer programs that access or manipulate the data [7].

Definition 12

– Attribute

∀Ek ∈ Sm, every attribute Akj (Akj ∈ Ek) is defined by the tuple (Tkj,vkj), where:

When an integrated schema management system experiences problems with consistency, the same data type of information is recorded in more than one way. The first step in resolving this consistency problem is to determine which alternative data type is preferable. This approach would then be defined as the standard, namely, the accepted way of recording the information.

•

Tkj = is the datatype of attribute Akj (1 ≤ j ≤ ak);

•

vkj is the value of attribute Akj (1 ≤ j ≤ ak) and vkj ∈ αkj.

Definition 13

– Data Type Consistency Standard

The data type consistency standard is the alternative data type which is more appropriate to an attribute. This data type is defined as the standard, namely, the accepted way of recording the attribute. Formally, a data type consistency standard is an XEntity attribute data type such as:

In this case, a schema element is called consistent if it adheres to the standard data type. If it not adheres, commonly the data type conversion is a difficult process performed by mediator, and then achieving consistency could be both time-consuming and expensive. As in [4] we have based the consistency metric in an essential factor: the number of semantically equivalent attributes in schema that adhere to the standard data type defined for the attribute.

∀Ek.Akj, Ek ∈ Ð, Akj ∈ {Ak1,Ak2,...,Aka } ^ k ∃Tstd,Tstd= ^ ∃Ex.A, Ex ∈ Ð ^ Ex.A ≡ Ek.Akj A = (Tstd,v) where Tstd is the most frequently data type used in Ð for attribute A and its equivalents

Definition 14

We approximate consistency rules and data types to create the Type Consistency IQ concept. The use of different coding schemes to record the same attribute falls into the lack of IQ in this category. We use the Type Consistency criterion to investigate which data elements in the schemas are always represented with the same type, or adhering to a consistency standard. This is an indicator of quality and query improvement, once the query processor is not going to perform type conversions for a schema element in order to access its correspondences in data sources schemas. For type consistency measurement, we use a metric similar to the one presented in [24].

– Attribute Type Consistency

In a given a set of data source schemas Si (1 ≤ i ≤ w) and a mediation schema Sm, we say that an attribute Apj = (Tpj,vpj) (Tpj is a valid datatype as in Definition 12) of an entity Ep ∈ Sp (Sp=Si or Sp=Sm) is consistent i.e. Con(Apj,Sp)= 1 if it appears in another entity or even in the same entity with other datatype: ∃Tstd ∈ Ð, T ∈ £ such as Apj = (Tstd,vpj)

Definition 15 – Schema Data Type Consistency The overall schema type consistency score in a given data integration system (Con(Sm,Ð)) is obtained by the following calculation:

X-Entity is a high level abstraction for XML schema structures, it is necessary to define the concept of type for an X-Entity attribute.

67

nm

Con(Sm,Ð) =

ak

∑ ∑ Con(A

Mediation Schema Sm = ({moviem({titlem, genrem, yearm, directorm}, {moviem_actorm}), actorm({namem, nationalitym},{})}, {moviem_actorm(moviem,actorm, (1,N))})

, Ð) ,

kj

k = 1 j= 1 nm

∑a

nationalitym

genrem

titlem

k

directorm

namem

k =1

nm

moviem

yearm

where ak

∑ ∑ Con(A

, Ð) is the total number

contains

actorm

(1,N)

Figure 7. Integrated schema Sm

kj

k = 1 j= 1

of consistent attributes in Ð; Akj ∈ Ð; nm is the total number of entities in the schema Ð; ak is the number of attributes of the entity Ek.

Schema of data source S 1 = ({movie1({title1,runtime 1,genre1},{movie 1_actor1,movie 1_director1}), actor1({name1, nationality 1},{}),director1({name 1,nationality 1},{})}, {movie1_actor1(movie 1,actor1,(1,N)), nationality 1 movie _director (movie ,director ,(1,N))}) 1

1

title 1

6. Examples In this section we present practical examples of proposed criteria evaluation in schemas. For each one of the IQ criteria, one schema with anomalies in the referred aspect is presented, and the evaluation process is detailed.

1

1

contains

genre1

(1,N)

movie 1

nationality 1 contains

runtime1

name1

actor1

(1,N)

name1

director1

Figure 8. Data Source schema S1 Schema of data source S2 = ({movie2({title2,genre2,actor2},{movie2_director2}), director2({name2,nationality2},{director2_award2}) award2({year2,category2},{})}, {movie2_director2(movie2,director2,(1,N)) director2_award2(director2,award2,(1,N))}) name

6.1 Minimality Analysis Consider the redundant schema of Figure 5 for minimality example.

genre2 title2

movie2

contains

(1,N)

actor2

director2

contains category2

(1,N) award2

year2

artistm ≡ actorm idm ≡ sshm nationalitym ≡ countrym

2

nationality2

Figure 9. Data Source schema S2

Figure 5. Schema with redundant elements The entity artistm, is redundant because it is semantically equivalent to actorm and all its attributes have a semantically equivalent correspondent in actorm.

Schema of data source S3 = ({director3({name3,nationality3},{director3_movie3}), movie3({title3,country3,year3},{ })}, {director3_movie3(director3,movie3,(1,N))}) nationality3

The relationship moviem_artistm (moviem,artistm,(1,N)) is also redundant because it has a semantically equivalent relationship moviem_actorm(moviem,actorm,(1,N))and actorm≡artistm.

director3

name3 contains

year3 title3 (1,N)

country3 movie3

Figure 10. – Data Source schema S3

The schema minimality value will be obtained as illustrated in Figure 6. The minimality of schema Sm is 75%, what means that the schema has 25% of redundancy that can possibly be eliminated.

The schema mappings between the schemas are in Table 4. Table 4. Schema mappings between the integrated schema Sm and the source schemas S1 , S2 and S3 SM1:moviem ≡ movie1 SM2:moviem ≡ movie2 SM3:moviem ≡ movie3 SM4:moviem.genrem ≡ movie1.genre1 SM5:moviem.genrem ≡ movie2.genre2 SM6:moviem.titlem ≡ movie1.title1 SM7:moviem.titlem ≡ movie2.title2 SM8:moviem.titlem ≡ movie3.title3 SM9:moviem.yearm ≡ movie3.year3 SM10:moviem.directorm ≡ director1.name1 SM11:moviem.directorm ≡ director2.name2 SM12:moviem.directorm ≡ director3.name3 SM13:moviem.moviem_actorm.actorm ≡ movie1.movie1_actor1.actor1 SM14:moviem.moviem_actorm.actorm.namem ≡ movie1.movie1_actor1.actor1.name1 SM15:moviem.moviem_actorm.actorm.namem ≡ movie2.actor2 SM16:moviem.moviem_actorm.actorm.nationalitym ≡ movie1.movie1_actor1.actor1.nationality1

Red(moviem,Sm) = 0 Red(actorm,Sm) = 0 Red(theaterm,Sm) = 0 Red(artistm,Sm) = 1 ER(Sm)= 1/(4 + 4) = 0,125 RR(Sm)= 1/(4 + 4) = 0,125 Mi(Sm)= 1 -(0,125 + 0.125) = 0,75

Figure 6. Schema minimality score

6.2 Schema Completeness Analysis For the completeness evaluation, consider the integrated schema presented in Figure 7 and the data source schemas presented in Figures 8 to 10.

68

Analyzing and compiling the schemas and mappings, it is possible to say that Ð has the following set of distinct concepts (ϕ ϕ(Ð)):

7. SCHEMA MINIMALITY IMPROVEMENT

ϕ(Ð) = ⇒

After detecting the schema IQ anomalies, it is possible to restructure it to achieve better IQ scores [2]. In order to improve minimality scores, redundant elements must be removed from the schema. In this section, we present an algorithm with schema improvement actions to be executed after the integrated schema generation or update. The sequence of steps is specified in the algorithm of Table 5. It is important to declare that we can accomplish a total minimality schema score, or a schema with no redundancies, by removing redundant elements until the value of minimality equal to 1 is achieved.

σÐ = 17

Analogously, examining the integrated schema of Figure 7, it is possible to identify the following concepts: ϕ(Sm) = ⇒

σ Sm

Table 5. Schema adjustment algorithm

= 9

Thus, for our example the overall score of Sm (Figure 7) schema completeness will be obtained as follows:

SC(Sm) =

σSm σÐ

=

9 = 0,5294 17

Therefore, the completeness of Sm is 52,94%, what means that the schema has a 47,06% of the domain concepts missing in the integrated schema. Improvements in schema completeness can be done by the insertion of a set of tasks to investigate the data source schemas seeking for concepts that are not in the current integrated schema. After that, the system must generate schema mappings and propagate the new concepts converted into entities and relationships to the integrated schema. This can be done, for example, by applying the techniques presented in [15].

1

Calculate minimality score and if minimality = 1, then stop;

2

Search for fully redundant entities in Sm;

3

If there are fully redundant entities then eliminate the redundant entities from Sm;

4

Search for redundant relationships in Sm;

5

If there are redundant relationships then eliminate the redundant relationships from Sm;

6

Search for redundant attributes in Sm;

7

If there are redundant attributes then eliminate the redundant attributes from Sm;

8

Go to Step 1

The detection of redundant elements processes in steps 2, 4 and 6. are already described in previous definitions. The next sections describe the proposed redundancies elimination actions executed in steps 3, 5 and 7 of the improvement algorithm. In the following, we present details about schema adjustments, performed when the IQ Manager has to remove redundant elements.

6.3 Type Consistency Analysis To an example of type consistency evaluation, assume an hypothetic schema with the following attribute equivalencies:

7.1 Redundant Entities Elimination

SM1:actorm.birthdatem ≡ actor1.birth1 SM2:moviem.birthdatem ≡ actor2.birth2 SM3:moviem.birthdatem ≡ actor3.bd3 Suppose that the data type of the attribute actor1.birth1 is String and the data type of attributes actorm.birthdatem, actor2.birth2 and actor3.bd3 is Date. We have three Date occurrences versus one single occurrence of String data type for the same attribute. Thus, the IQ Manager will consider the data type Date as the data type consistency standard:

It is important to point that, after removing a redundant entity E, its relationships must be relocated to a semantic equivalent remaining entity. When removing a redundant entity E1 (E1 ≡ E2), the IQ Manager transfers the relationships of E1 to the remaining equivalent entity E2. Three different situations may occur when moving a relationship Rx, Rx ∈ E1: • If Rx ∈ E2 then Rx is deleted because it is no longer necessary; • If Rx ∉ E2 but ∃Ry, Ry ∈ E2 such as Rx ≡ Ry then Rx is deleted; • If Rx ∉ E2 and there is no Ry, Ry∈E2 such as Rx ≡ Ry, then Rx is connected to E2.

Tstd = Date Con(actorm.birthdatem,Ð) = 1 Con(actor1.birth1,Ð) = 0 Con(actor2.birth2,Ð) = 1 Con(actor3.bd3,Ð) = 1 The attributes of type Date are consistent and the attribute of type String is inconsistent. To compute the consistency degree of a given schema it is necessary to sum the consistency values of each attributes in the schema, dividing the result by the total number of attributes as stated in Definition 15.

The first and second situations are not supposed to cause any schema modification besides the entity deletion. However, the third case needs more attention, once the redundant relationships of the removed entity have to be relocated.

69

Definition 16 – Substitute Entity: movie m

We say that Ek is a fully redundant entity, if and only if Red(Ek,Sm) = 1 and Ek has at least one Substitute Entity Es, i.e. Subst(Ek) = Es, such as:

•

Akx are ({Ak1,..., Akak},{Rk1,..., Rkrk}) attributes and Rky are relationships of Ek and; Es ({As1,..., Asa },{Rs1,..., Rsr }) Asz are attributes s s and Rst are relationships of Es and Ek ≡ Es and ∀Ek.Aki ∈ {Ak1,..., Aka },

•

∃Es.Asj ∈ {As1,..., Asa } with Ek.Aki ≡ Es.Asj, 1 s

•

contains

description m

because it is no longer useful; If Ek.Rkj ∉ {R s1,..., R sr } but ∃Es.Rsp, such that

The relationship {artistm_awardm(artistm, awardm,(1,N))} is relocated to actorm, turning into the new relationship {actorm _awardm(actorm, awardm,(1,N))}. With this operations, it is possible to obtain a no redundant schema as illustrated in Figure 12.

7.2 Redundant Relationships Elimination After removing redundant entities and possibly performing the necessary relationship relocations, the IQ Manager discovers remaining redundant relationships to eliminate them. This can be accomplished by merely deleting from the schema, the relationships identified as redundant. Considering the example or Figure 13, the relationship {enterprisem_sectionm(enterprisem,sectionm,(1 ,N))} is redundant because it has a semantically equivalent correspondent represented by P1. After eliminating the {enterprisem_sectionm(enterprisem, sectionm,(1,N))}, the schema with no redundancies is showed in Figure 14.

s

Ek.Rkj ≡ Es.Rsp then Ek.Rkj must be deleted because it has an equivalent relationship in Es; If Ek.Rkj ∉ {Rs1,..., Rsr } and s

iii.

∃ Es.Rsp such as

s

s

are

R'st

relationships

of

Es

and

{R ,..., R } = {Rs1,..., Rsrs} ∪ Rkj . ' s1

' srs

enterprise m

The first and second case above do not imply in schema relevant changes, only the relationship removal. The third one, where the relationship relocation occurs, can be exemplified in Figures 11 and 12.

moviem

(1,N)

contains

actorm

contains

(1,N)

(1,N) (1,N)

section m

Figure 13. Redundant relationship detection

awardm

editionm

enterprise m

nationality m description m

artistm actorm => idm sshm nationalitym countrym

department m

category m

yearm

artistm

(1,N)

contains

sshm

idm

relationship

P1 = enterprisem.enterprisem_departmentm. => Red(enterprise m_sectionm departmentm.departmentm_sectionm.sectionm S ) = 1 , m P2 = enterprisem.enterprisem_sectionm.sectionm P1 ≡ P2

countrym

(1,N)

contains

contains

namem contains

relationship

It is important to note that the remaining schema after the relationship eliminations, do not lose relevant information. Instead, without redundancies, it has better IQ scores, and consequently it is more usefulness to assist the query processing.

Ek.Rkj ≡ Es.Rsp then, Es is redefined as Es = ({As1,..., Asa },{R's1,..., R'sr }), Asz are attributes and

editionm

Figure 12. Relationship relocation

In a schema Sm, if Subst(Ek) = Es, then Ek can be eliminated from Sm. In this case, in order to do not lose any information, Ek relationships are relocated to Es according to the following rules, i.e. ∀Ek.Rkj:

ii.

awardm

The fully redundant entity artistm (with its attributes) is removed and it is substituted by the semantically equivalent actorm. Consequently, the relationship {moviem_artistm(moviem, artistm,(1,N))} may be deleted because it can be replaced by the remaining equivalent relationship {moviem_actorm(moviem, actorm,(1,N))}.

– Relationship Relocation:

If Ek.Rkj ∈ {Rs1,..., Rsr } then Rkj must be deleted s

categorym

(1,N)

≤ i,j ≤ ak The Definition 16 states that an entity Ek is considered fully redundant when all of its attributes are redundant (Red(Ek,Sm) = 1) and it must have a substitute entity Es in Sm. All the attributes of Ek are contained in Es. In this case, Ek may be removed from the original schema Sm without lost of relevant information if it is replaced by its substitute entity Es. Any existing relationship from Ek may be associated to Es, as stated in the following definition.

i.

namem

yearm

k

Definition 17

sshm

actorm

countrym

Ek

•

(1,N)

contains

contains

(1,N)

department m

contains

Red(artistm,Sm) = 1

(1,N) sectionm

Figure 11. Redundant entity detection

Figure 14. Redundant relationship elimination

70

As future work, similarly as done with the minimality criterion, we must formally describe and implement the algorithms to evaluate the others IQ criteria and to execute the schema IQ improvement actions for each one.

7.3 Redundant Attributes Elimination The last step of schema improvement algorithm consists in investigating and eliminating remaining redundant attributes in schema. Similarly to the redundant relationships removal step, these attributes may merely be deleted from schema. This occurs because the schema always has semantically equivalent attributes to substitute the redundant ones. In Figure 15, the attribute nationalitym is removed because there is a semantically equivalent attribute countrym, which will substitute it.

9. REFERENCES [1] ANHAI, D., DOMINGOS, P. and HALEVY, A. Learning to Match the Schemas of Data Sources: A Multistrategy. Machine Learning, 50(3), 2003. [2] ASSENOVA, P. and JOHANNESON, P. Improving quality in conceptual modeling by the use of schema transformations, Proceedings 15th Int. Conf. of Conceptual Modeling (ER´96), Cottbus, Germany, 1996. [3] BALLOU, D.P. and PAZER, H.L. Modeling data and process quality in multi-input, multi-output information systems. Management Science 1985. [4] BALLOU, D. P. and PAZER H.: Modeling Completeness versus Consistency Tradeoffs in Information Decision Contexts. IEEE Transactions on Knowledge and Data Engineering , vol. 15, no. 1, 2003. [5] BATISTA, M. C., LÓSCIO, B. F. AND SALGADO, A. C. Optimizing Access in a Data Integration System with Caching and Materialized Data. In Proc. of 5th ICEIS, 2003. [6] CALI, A., CALVANESE, D., DE GIACOMO, G. and LENZERINI, M. Data integration Under Integrity Constraints. In Proc. Conference on Advanced Information Systems Engineering, 2002. [7] CARDELLI, L. and WEGNER, P. On Understanding Types, Data Abstraction, and Polymorphism. ACM Computing Surveys, Vol.17, No.4, Dec. 1985. [8] CHEN, P.P. The Entity-Relationship Model: Toward a unified view of data, ACM Transactions on Database Systems, 1976. [9] DAI, B., T., KOUDAS, N., OOI, B. C., SRIVASTAVA, D. and VENKATASUBRAMANIAN, S. Column Heterogeneity as a Measure of Data Quality, in proceedings of 1st Int'l VLDB Workshop on Clean Databases, 2006. [10] HALEVY, A. Why Your Data Don't Mix. ACM Queue, 3(8), 2005. [11] HERDEN, O. Measuring Quality of Database Schema by Reviewing - Concept, Criteria and Tool. In Proc. 5th Intl Workshop on Quantitative Approaches in Object-Oriented Software Engineering, 2001. [12] KESH, S. Evaluating the Quality of Entity Relationship Models. Inform. Software Technology. 1995. [13] KOTONYA, G and SOMMERVILLE, I. Requirements Engineering: Processes and Techniques. 1st Edition, Wiley & Sons, 1997. [14] LENZERINI, Maurizio. Data integration: A theoretical perspective. In Proc. of the 21st ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2002), pages 233–246, 2002. [15] LÓSCIO, B. F., Managing the Evolution of XML-Based Mediation Queries. Tese de Doutorado. Curso de Ciência da Computação. Centro de Informática, UFPE, Recife, 2003. [16] MOODY, D. Measuring the Quality of Data Models: An Empirical Evaluation of the Use of Quality Metrics in Practice, New Paradigms in Organizations, Markets & Society: Proc. of the 11th European Conference on Information Systems, 2003.

After executing the schema improvement steps, the IQ Manager can recalculate and analyze minimality scores in order to determine if the desired IQ is accomplished.

movie m

actor m

contains

sshm

countrym

age m

artistm actorm nationalitym countrym

contains artist m namem

address m nationality m

Figure 15. Redundant attribute detection

7.4. Implementation Issues We implemented the IQ Manager as a module of an existing mediator-based data integration system. More details about the system can be found in [5]. The module was written in Java and the experiment used two databases – MySQL and PostgreSQL – to store the data sources. As mentioned before, the data in the system is XML and the schemas are represented with XML Schema. The experiment was done in the following steps: (i) initially, the queries were submitted over an integrated schema 26% redundant and the execution times were measured; (ii) the redundancy elimination algorithm was executed over the redundant integrated schema generating a minimal schema (100% of minimality); (iii) the same queries of step (i) were executed. The results obtained with these experiments have been satisfactory.

8. CONCLUSION Data integration systems may suffer with lack of quality in produced query results. They can be outdated, erroneous, incomplete, inconsistent, redundant and so on. As a consequence, the query execution can become rather inefficient. To minimize the impact of these problems, we propose a quality approach that serves to analyze and improve the integrated schema definition and consequently, the query execution. It is known that a major problem in data integration systems is to execute user queries efficiently. The main contribution of the presented approach is the specification of IQ criteria assessment methods for the maintenance of high quality integrated schemas with objectives of achieving better integrated query execution. We also proposed an algorithm used to improve the schema’s minimality score. We have specified the IQ Manager module to proceed with all schemas IQ analysis and also the execution of improvement actions by eliminating the redundant items.

71

Workshop Chileno de Ingeniería de Software, Punta Arenas, 2001. [24] WAND, Y. and WANG, R.Y.: Anchoring data quality dimensions in ontological foundations. Communications of the ACM 39(11), 86--95. (1996). [25] WANG R.Y. and STRONG D.M.: Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems,1996. [26] WANG, Y. and HU, J. Detecting tables in HTML documents. In Lecture Notes in Computer Science, volume 2423, pages 249–260. Springer-Verlag, 2002. [27] WIEDERHOLD, G., 1992. Mediators in the Architecture of Future Information Systems. IEEE Computer. 2. [28] ZHANG, Y., CALLAN, J. and MINKA. T. Novelty and Redundancy Detection in Adaptive Filtering. In Proc. of the 25 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2002.

[17] NAUMANN, F. and LESER, U. Quality-dri ven Integration of Heterogeneous Information Systems. In Proc. of the 25th VLDB. 1999. [18] PETERSON, D., BIRON, P. V., MALHOTRA, A. and SPERBERG-MCQUEEN., C. M. XML Schema 1.1 Part 2: Data Types – W3C Working Draft, 2006. http://www.w3.org/TR/xmlschema11-2/. [19] PIATTINI, M., GENERO, M. and CALERO, C. Data Model Metrics. In Handbook of Software Engineering and Knowledge Engineering: Emerging Technologies, World Scientific, 2002. [20] SCANNAPIECO, M. Data quality at a glance. DatenbankSpektrum 14, 6–14. [21] SI-SAID, S. C. and PRAT, N. Multidimensional Schemas Quality: Assessing and Balancing Analyzability and Simplicity, Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2814, 140—151. 2003. [22] SPACCAPIETRA, S. and PARENT, C. View integration: a step forward in solving structural conflicts, IEEE Transactions on Knowledge and Data Engineering, 1994. [23] VARAS, M. Diseño Conceptual de Bases de Datos: Un enfoque Basado en la Medición de la Calidad", Actas Primer

72

5th International Workshop on Quality in Databases

5th International Workshop on Quality in Databases

Suggest Documents

proceedings of the 5th international workshop on

ASGO 5th International Workshop on ... - KoreaMed Synapse

5th Bordeaux Workshop in International Economics ...

5th International Workshop on Software Engineering in Health Care ...

Proceedings of the 5th International Workshop on Model Driven ...

The 5th International Workshop on Multimedia Data ... - UT Dallas

VISSOFT 2009 5th IEEE International Workshop on Visualizing ...

Proceedings of the 5th International Workshop on Model Driven ...

SmartHEALTH: 5th International Workshop on Smart Healthcare and ...

The 5th International Workshop on Software Process Simulation and ...

Proceedings of the 5th International Workshop on ... - Google Sites

SmartHEALTH: 5th International Workshop on Smart Healthcare and ...

The 5th International Workshop on Software Process Simulation and ...

Proceedings of the 5th International Workshop on Online Dispute ...

Proceedings of the 11th International Workshop on Quality in ...

5th INTERNATIONAL CONFERENCE ON

5th WORKSHOP ON SOFTWARE AND USABILITY ENGINEERING ...

The proceedings of the 5th workshop on

5th International Conference on NeuroEndocrine Immunology in

The 4 International Workshop on Regional Air Quality Management ...

International Workshop on Improving Data Quality and ... - IUCr Journals

Logic In Databases: Report on the LID 2008 Workshop. - CiteSeerX

agroconference.com 5th International Conference on

International Workshop on Successful Strategies in ...