lions of dollars annually (cf. [14]) .... must have the same street and zip, and moreover, the city ... (b) Tableau T2 of Ï2 = ([CC, AC, phn ] â [street, city, zip ], T2).
Extending Dependencies with Conditions for Data Cleaning Wenfei Fan ∗ University of Edinburgh & Bell Laboratories wenfei@{inf.ed.ac.uk,research.bell-labs.com}
Abstract
systems, analysis algorithms and profiling methods developed for constraints yield systematic methods to effectively reason about the semantics of the data, and to deduce, discover and apply cleaning rules. However, constraints used for data cleaning are mostly traditional dependencies such as functional and inclusion dependencies. These constraints were developed mainly for schema design; as will be seen shortly, they are not capable of capturing errors and inconsistencies commonly found in real-life data. This calls for new constraint languages designed for data cleaning [22]. In response to the need, an extension of functional and inclusion dependencies, referred to as conditional dependencies, has recently been proposed [16, 7]. In contrast to their traditional counterparts, conditional dependencies specify patterns of semantically related data values. They are capable of capturing many common errors and inconsistencies that traditional dependencies cannot detect. To use conditional dependencies as rules for data cleaning, one first wants to make sure that the rules are clean themselves. With this comes the need for static analyses of conditional dependencies. There are two important issues associated with conditional dependencies. One concerns consistency analysis, to determine whether or not a given set of conditional dependencies makes sense. The other concerns implication analysis, to decide whether a set of conditional dependencies logically entails another dependency. These decision problems are more intriguing for conditional dependencies than for their traditional counterparts. We want to detect and repair errors and inconsistencies based on conditional dependencies. Given a set Σ of conditional dependencies (cleaning rules) and a database D, there are SQL techniques to automatically identify tuples in D that violate one or more dependencies in Σ. Furthermore, we want to fix the errors and find candidate repairs by editing D. While the repairing problem is difficult (intractable), it is possible to develop scalable heuristic algorithms for finding database repairs with performance guarantee. The remainder of the paper is organized as follows. We present conditional dependencies in Section 2, and their reasoning techniques in Section 3, followed by inconsistency detection and repairing techniques in Section 4. Finally, we identify open research issues in Section 5. This paper is by no means a comprehensive survey: a number of related articles are not referenced due to the space constraint.
Data cleaning aims to effectively detect and repair errors and inconsistencies in real life data. The increasing costs and risks of dirty data highlight the need for data cleaning techniques. This paper provides an overview of recent advances in data cleaning, based on conditional dependencies, an extension of functional and inclusion dependencies for characterizing the consistency of relational data.
1 Introduction Real life data in all industries worldwide is routinely found dirty, i.e., inconsistent, inaccurate, stale or deliberately falsified. Recent statistics reveals that enterprises typically expect data error rates of approximately 1%–5%. The costs and risks of dirty data are being increasingly recognized. It is reported that dirty data costs US businesses billions of dollars annually (cf. [14]), and that wrong price data in retail databases alone costs US consumers $2.5 billion each year [15]. It is also estimated that data cleaning accounts for 30%-80% of the development time and budget in most data warehouse projects (cf. [24]). While the prevalent use of the Web has made it possible to extract and integrate data from diverse sources, it has also increased the risks, on an unprecedented scale, of creating and propagating dirty data. These highlight the need for data cleaning tools to effectively detect and repair inconsistencies in the data. Indeed, the market for data-cleaning tools is growing at 17%, way above the 7% average forecast for other IT segments, and is projected to pass $677 million by 2011 [19]. One of the central technical questions associated with data cleaning is how to characterize the consistency of data, i.e., how to tell whether the data is clean or dirty? Most data cleaning tools today, including those embedded in commercial ETL (extraction, transformation, loading) tools, heavily rely on manual effort and low-level programs that are difficult to write and maintain [22]. A more systematic approach is constraint-based data cleaning, to capture inconsistencies and errors as violations of integrity constraints [3, 4, 5, 8, 9, 11, 12, 21, 25]. Indeed, integrity constraints specify a fundamental part of the semantics of the data, which is critical to data quality [22]. Better still, inference ∗ Supported in part by EPSRC GR/S63205/01, GR/T27433/01 and EP/E029213/1
978-1-4244-2358-3/08/$20.00 © 2008 IEEE
185
CIT 2008
2 Extending Dependencies with Conditions We extend functional and inclusion dependencies with conditions, to characterize the consistency of data.
t1 : t2 : t3 :
CC 44 44 01
AC 131 131 908
phn 1234567 3456789 3456789
name Mike Rick Joe
street Mayfield Crichton Mtn Ave
city NYC NYC NYC
zip EH4 8LE EH4 8LE 07974
Figure 1. An instance of customer relation
Conditional functional dependencies. Let us consider the following relational schema for customer data:
(a) Tableau T1 of ϕ1 = ([CC, zip ] → [street ], T1 ) CC 44
customer (CC: int, AC: int, phn: int, name: string, street: string, city: string, zip: string)
where each customer tuple specifies a customer’s phone number (country code (CC), area code (AC), phone (phn)), name, and address (street, city and zip code). An instance D0 of the customer schema is shown in Fig. 1. Traditional functional dependencies (FDs) on customer relations include:
zip
street
(b) Tableau T2 of ϕ2 = ([CC, AC, phn ] → [street, city, zip ], T2 ) CC
AC
44 01
131 908
phn
street
city
zip
EDI MH
Figure 2. Example CFDs
f1 : [CC, AC, phn ] → [street, city, zip ] f2 : [CC, AC ] → [city ]
To give the semantics of CFDs, we define a match operator ≍ on data values and ‘ ’: η1 ≍ η2 if either η1 = η2 , or η1 is a constant ‘a’ and η2 is ‘ ’. The operator ≍ extends to tuples, e.g., (Mayfield, EDI) ≍ ( , EDI) but (Mayfield, EDI) 6≍ ( , NYC). We say that a tuple t1 matches t2 if t1 ≍ t2 . An instance D of R satisfies the CFD ϕ, denoted by D |= ϕ, if for each pair of tuples t1 , t2 in D, and for each tuple tp in the pattern tableau Tp of ϕ, if t1 [X] = t2 [X] ≍ tp [X], then t1 [Y ] = t2 [Y ] ≍ tp [Y ]. Intuitively, each tuple tp in the pattern tableau Tp of ϕ is a constraint defined on the set Dtp = {t | t ∈ D, t[X] ≍ tp [X]} such that for any t1 , t2 ∈ Dtp , if t1 [X] = t2 [X], then (a) t1 [Y ] = t2 [Y ], and (b) t1 [Y ] ≍ tp [Y ]. Here (a) enforces the semantics of the embedded FD, and (b) assures the binding between constants in tp [Y ] and constants in t1 [Y ]. Note that this constraint is defined on the subset Dtp of D identified by tp [X], rather than on the entire D.
That is, a customer’s phone uniquely determines her address (f1 ), and her country code and area code determine her city (f2 ). The instance D0 of Fig. 1 satisfies f1 and f2 . In other words, when f1 and f2 are used to specify the consistency of customer data, no errors or inconsistencies can be detected in D0 and hence D0 is considered clean. A closer examination of D0 , however, reveals that none of the tuples in D0 is error-free. The inconsistencies are captured by conditional functional dependencies (CFDs): ϕ1 : ([CC, zip ] → [street ], T1 ) ϕ2 : ([CC, AC, phn ] → [street, city, zip ], T2 )
where T1 and T2 are tableaux shown in Fig. 2. Each tuple in T1 or T2 indicates a constraint, in which ‘ ’ denotes the wild-card that can be an arbitrary value from the corresponding domain. The CFD ϕ1 asserts that for customers in the UK (CC = 44), zip determines street. In other words, ϕ1 is an “FD” that is to hold on the subset of tuples that satisfies the pattern “CC = 44”, e.g., {t1 , t2 } in D0 , rather than on the entire customer relation D0 . Tuples t1 and t2 in D0 violate ϕ1 : they have the same zip but differ in street. The CFD ϕ2 defines three constraints, each by a distinct tuple in the tableau T2 . The first one encodes the standard FD f1 , and the other two refine f1 . More specifically, the second constraint assures that in the UK (CC = 44) and for area code 131, if two tuples have the same phn, then they must have the same street and zip, and moreover, the city must be EDI; similarly for the third constraint. While D0 satisfies f1 , each of t1 and t2 in D0 violates ϕ2 : CC = 44 and AC = 131, but city 6= EDI. Similarly, t3 violates ϕ2 . More formally, a CFD ϕ defined on a relation schema R is a pair (R : X → Y , Tp ), where (1) X → Y is a standard FD , referred to as the FD embedded in ϕ; and (2) Tp is a tableau with attributes in X and Y , referred to as the pattern tableau of ϕ, where for each A in X ∪ Y and each tuple tp ∈ Tp , tp [A] is either a constant ‘a’ in dom(A), or an unnamed variable ‘ ’ that draws values from dom(A). We write ϕ as (X → Y, Tp ) when R is clear from the context.
Conditional inclusion dependencies. Next consider two schemas, referred to as source and target, respectively: Source: order (asin: string, title: string, type: string, price: real) Target: book (isbn: string, title: string, price: real, format: string) CD (id: string, album: string, price: real, genre: string)
The source database contains a single relation order, specifying items of various types such as books, CDs, DVDs, ordered by customers. The target database has two relations, specifying customer orders of books and CDs. Example source and target databases are shown in Fig. 3. To find schema mapping from source to target (e.g., [20]), or to detect errors across these databases (e.g., [5]), one may want to specify inclusion dependencies (INDs) such as order(title, price) ⊆ book(title, price), and order(title, price) ⊆ CD(album, price). These INDs, however, do not make sense: one cannot expect the title and price of a book item in the order table to find a matching CD tuple; similarly for CDs in the order table. In contrast, one can specify the following conditional inclusion dependencies (CINDs), an extension of INDs:
186
t4 : t5 :
asin a23 a12
title type price Snow White CD 7.99 Harry Potter book 17.99 (a) Example order data
t6 : t7 :
isbn b32 b65
title price format Harry Potter 17.99 hard-cover Snow White 7.99 paper-cover (b) Example book data
id album price c12 J. Denver 7.94 c58 Snow White 7.99 (c) Example CD data
t8 : t9 :
genre country a-book
Figure 3. Example order, book and CD data (a) T3 in ϕ3 = (order(title, price; type) ⊆ book(title, price), T3 )
ϕ3 : (order(title, price; type) ⊆ book(title, price), T3 ) ϕ4 : (order(title, price; type) ⊆ CD(album, price), T4 ) ϕ5 : (CD(album, price; genre)) ⊆ book(title, price; format), T5 )
where T3 –T5 are pattern tableaux shown in Fig. 4. The CIND ϕ3 asserts that for each order tuple t, if its type is “book”, then there must exist a book tuple t′ such that t and t′ agree on their title and price attribute; similarly for ϕ4 . The CIND ϕ5 states that for each CD tuple t, if its genre is “a-book” (audio book), then there must be a book tuple t′ such that the title and price of t′ match the album and price of t, and moreover, the format of t′ must be “audio”. While the databases of Fig 3 satisfy ϕ3 and ϕ4 , they violate ϕ5 . Indeed, tuple t9 in the CD table has an “a-book” genre, but it cannot find a match in the book table. Note that while t9 and t7 in the book table agree on their album (title) and price, the format of t7 is “paper cover” rather than “audio” as required by the pattern given in tableau T5 . Formally, a CIND ψ defined on schemas R1 and R2 is a pair (R1 [X; Xp ] ⊆ R2 [Y ; Yp ], Tp ), where (1) X, Xp and Y, Yp are lists of attributes of R1 and R2 , respectively, such that X and Xp (resp. Y and Yp ) are disjoint; (2) R1 [X] ⊆ R2 [Y ] is a standard IND, referred to as the IND embedded in ψ; and (3) Tp is the pattern tableau of ψ with attributes in X, Xp and Y, Yp , such that for each tuple tp ∈ Tp , (a) tp [X] = tp [Y ], consisting of unnamed variable ‘ ’; and (b) for each A in Xp or Yp , tp [A] is a constant ‘a’. An instance (D1 , D2 ) of (R1 , R2 ) satisfies the CIND ψ, denoted by (D1 , D2 ) |= ψ, iff for each t1 in the relation D1 , and for each tuple tp in the pattern tableau Tp , if t1 [Xp ] = tp [Xp ], then there must exist t2 in D2 such that t1 [X] = t2 [Y ] and moreover, t2 [Yp ] = tp [Yp ]. That is, tp is a constraint defined on D(1,tp ) = {t1 | t1 [Xp ] = tp [Xp ]}, such that (a) the IND R1 [X] ⊆ R2 [Y ] embedded in ψ is defined on D(1,tp ) rather than the entire D1 ; (b) for each t1 ∈ D(1,tp ) , there exists a tuple t2 in D2 such that t1 [X] = t2 [Y ] as required by the standard IND and moreover, t2 [Yp ] must match the pattern tp [Yp ]. Intuitively, Xp is used to identify the R1 tuples on which ψ is defined, and Yp enforces the matching R2 tuples to satisfy a certain form. From these examples one can see that in contrast to FDs and INDs, CFDs and CINDs specify patterns of semantically related constants, and are capable of capturing more errors and inconsistencies than their traditional counterparts can catch. In practice dependencies that hold conditionally may arise in a number of domains. In particular, when integrating data, dependencies that hold only in a subset of sources will hold only conditionally in the integrated data. Traditional FDs and INDs are special cases of CFDs and CIND s, respectively, in which the pattern tableau consists of
title
price
type book
title
price
(b) T4 in ϕ4 = (order(title, price; type) ⊆ CD(album, price), T4 ) title
price
type CD
album
price
(c) T5 in ϕ5 =(CD(album, price; genre) ⊆ book(title, price; format), T5 ) album
price
genre a-book
title
price
format audio
Figure 4. Example CINDs a single tuple, containing unnamed variable ‘ ’ only. As remarked earlier, dependencies considered for data cleaning so far include traditional FDs, INDs as well as a form of full dependencies, referred to as denial constraints (see [12] for a recent survey). There have also been extensions of CFDs, by supporting inequality and disjunctions, without incurring extra complexity [6]. Data cleaning tools based on CFDs and CINDs are also being developed.
3 Reasoning about Dependencies To use CFDs and CINDs to detect and repair errors and inconsistencies, a number of fundamental questions associated with these conditional dependencies have to be settled. In this section we address three central technical problems, namely, consistency, implication and axiomatizability. Consistency. Given a set Σ of CFDs (resp. CINDs), can one tell whether the dependencies in Σ are dirty themselves? If the input set Σ is found inconsistent, then there is no need to check the cleaning rules against the data at all. Further, the analysis helps the user discover errors in the cleaning rules. Formally, this can be stated as the consistency problem for conditional dependencies. For a set Σ of CFDs (resp. CINDs) and a database D, we write D |= Σ if D |= ϕ for all ϕ ∈ Σ. The consistency problem is to determine, given Σ defined on a relational schema R, whether or not there exists a nonempty instance D of R such that D |= Σ. One can specify arbitrary FDs and INDs without worrying about consistency. This is no longer the case for CFDs. Example 3.1: Consider two CFDs ψ1 = ([A] → [B], T1 ) and ψ2 = ([B] → [A], T2 ), where dom(A) is bool, T1 has two patterns (true, b1 ), (false, b2 ), and T2 contains (b1 , false) and (b2 , true). Then there exists no nonempty instance D such that D |= {ψ1 , ψ2 }. Indeed, for any tuple t in D, no matter what value t[A] has, ψ1 and ψ2 together force t[A] to take the other value from the finite domain bool. 2
187
Dependencies Consistency Implication Fin. Axiom CFD s NP -complete coNP-complete Yes FD s O(1) O(n) Yes CIND s O(1) EXPTIME -complete Yes IND s O(1) PSPACE -complete Yes CFD s + CIND s undecidable undecidable No FD s + IND s O(1) undecidable No with predefined schema or in the absence of finite domain CFD s O(n2 ) O(n2 ) Yes CIND s O(1) PSPACE -complete Yes
It turns out that while for CINDs the consistency problem is not an issue, for CFDs it is nontrivial. Worse, when CFDs and CINDs are put together, the problem becomes undecidable, as opposed to their trivial traditional counterpart. Theorem 3.1 [16, 7]: The consistency problem is • NP-complete for CFDs, • trivially decidable for CINDs, i.e., for any set Σ of CIND s defined on a schema R, there always exists a nonempty instance D of R such that D |= Σ, and • undecidable for CFDs and CINDs taken together. 2
Table 1. Complexity and finite axiomatizability
Fortunately, there are effective approximate and heuristic algorithms to check consistency for CFDs and for CFDs and CIND s taken together, respectively (see [16, 6] for details).
This motivates us to find a finite set I of inference rules that are sound and complete for implication analysis, i.e., for any set Σ of CFDs (resp. CIND) and a single CFD (resp. CIND) ϕ, Σ |= ϕ iff ϕ is provable from Σ using I. The good news is that when CFDs and CINDs are taken separately, they are finitely axiomatizable. However, just like their traditional counterparts, when CFDs and CINDs are taken together, they are not finitely axiomatizable.
Implication. Another central technical problem is the implication problem: given a set Σ of CFDs (resp. CINDs) and a single CFD (resp. CIND) ϕ defined on a relational schema R, it is to determine whether or not Σ entails ϕ, denoted by Σ |= ϕ, i.e., whether or not for all instances D of R, if D |= Σ then D |= ϕ. Effective implication analysis allows us to deduce new cleaning rules and to remove redundancies from a given set of rules, among other things. It is known that for FDs, the implication problem is decidable in linear time, while for INDs, it is PSPACEcomplete. It becomes more intriguing for CFDs and CINDs. Theorem 3.2 [16, 7]: The implication problem is • coNP-complete for CFDs, • EXPTIME -complete for CINDs, and • undecidable for CFDs and CINDs taken together.
Theorem 3.4 [16, 7]: There exist finite inference systems that are sound and complete for CFDs and CINDs taken separately. When CFDs and CINDs are taken together, they are not finitely axiomatizable. 2 Table 1 compares the complexity bounds for static analyses as well as the finite axiomatizability of CFDs and CINDs with their traditional counterparts.
4 Detecting and Repairing Inconsistencies 2
Given a set Σ of conditional dependencies defined on a schema R and an instance D of R, we want to effectively detect inconsistencies in D that emerge as violations of Σ, and moreover, if D is dirty, to find candidate repairs of D.
The undecidability result is not surprising: the problem is already undecidable for FDs and INDs put together. In certain practical cases the consistency and implication analyses for CFDs and CINDs have complexity comparable to their traditional counterparts, as stated below. For data cleaning in practice, the relational schema is often fixed, and only dependencies vary and are treated as the input.
Detecting inconsistencies. Given Σ and D, one needs an automated method to find all the inconsistent tuples in D w.r.t. Σ, i.e., the tuples that (perhaps together with other D tuples) violate some dependencies in Σ. In contrast to traditional FDs, a CFD ϕ = (X → Y, Tp ) carries a possibly large pattern tableau Tp . Nevertheless, V one can use a single pair of SQL queries (QC ϕ , Qϕ ) to find all C tuples in D that violate ϕ. In a nutshell, Qϕ detects singletuple violations, i.e., the tuples t in D that match some pattern tuple tp ∈ Tp on the X attributes, but t does not match tp on the Y attributes. On the other hand, query QVϕ finds multi-tuple violations, i.e., tuples that match tp [X] for some tp ∈ Tp but violate the standard FD embedded in ϕ. The size of the SQL queries is independent of the size of Tp . Example 4.1: When evaluated against a customer relation D0 , the two SQL queries given in Fig. 5 find all tuples in D0 that violate CFD ϕ2 of Fig. 2. Here t[A] ≍ tp [A] denotes the SQL expression (t[A] = tp [A] OR tp [A] = ‘ ’), while t[B] 6≍ tp [B] denotes (t[B] 6= tp [B] AND tp [B] 6= ‘ ’). 2
Theorem 3.3 [16, 7]: For CFDs and CINDs defined on a relational schema R, if either R is predefined, or no attributes in the given dependencies have a finite domain, then • the consistency and implication problems are both decidable in quadratic time for CFDs; and • the implication problem is PSPACE-complete for CIND s, the same as for standard IND s. 2 Axiomatizability. Armstrong’s Axioms for FDs are found in almost every database textbook, and are fundamental to the implication analysis of FDs. Similarly, there exists a finite set of inference rules for INDs. For conditional dependencies the finite axiomatizability is also important, as it reveals insight of the implication analysis and helps us understand how cleaning rules interact with each other.
188
QC from customer t, T2 tp ϕ2 select t where t[CC] ≍ tp [CC] AND t[AC] ≍ tp [AC] AND t[phn] ≍ tp [phn] AND (t[street] 6≍ tp [street] OR t[city] 6≍ tp [city] OR t[zip] 6≍ tp [zip]) QV from customer t, T2 tp ϕ2 select distinct t[CC], t[AC], t[phn] where t[CC] ≍ tp [CC] AND t[AC] ≍ tp [AC] AND t[phn] ≍ tp [phn] group by t[CC], t[AC], t[phn] having count(distinct t[street], t[city], t[zip])> 1
Figure 5. SQL queries for checking
CFD
ϕ2
This method can be extended to a set Σ of CFDs (resp. one can find a single pair of SQL queries to find all inconsistent tuples in D w.r.t. Σ, such that the size of the queries depends on neither the number of CFDs (resp. CINDs) in Σ nor the size of pattern tableau in each dependency in Σ (see [16, 6] for details). CIND s):
Finding candidate repairs. Given Σ and possibly dirty D, we want to find a candidate repair of D, i.e., an instance D′ of R that is consistent, i.e., D′ |= Σ, and moreover, D′ minimally differs from the original database D. That is, we edit D to fix the errors and to make the data consistent. This is the data cleaning approach that US national statistical agencies, among others, has been practicing for decades [17]. The effectiveness and complexity of data repair methods depend on what repair model is used. One model allows tuple deletions only [11], assuming that the information in D is inconsistent but complete. Here a repair D′ is a maximal subset of D such that D′ |= Σ. Another model allows both tuple deletions and insertions [3], assuming that D is neither consistent nor complete. Here a repair D′ is an instance of R such that (D \ D′ ) ∪ (D′ \ D) is minimal when D′ ranges over all instances of R that satisfy Σ. A more practical model is based on updates, i.e., attribute value modifications. It is common that in an inconsistent tuple, only some fields contain errors. One should fix these fields rather than remove the entire tuple, to avoid loss of correct information. This is the model adopted by US national statistical agencies [17] and recently revisited by [25, 5]. An immediate question about the update model concerns what values should be changed and what values should be chosen to replace the old values. One should make the decisions based on both the accuracy of the attribute values to be modified, and the “closeness” of the new value to the original value. Following the practice of US national statistical agencies [17], one can define a cost metric as follows. Assuming that a weight in the range [0, 1] is associated with each attribute A of each tuple t in D, denoted by w(t, A) (if w(t, A) is not available, a default weight can be used instead). The weight reflects the confidence of the accuracy placed by the user in the attribute t[A], and can be propagated via data provenance analysis in data transformations. For two values v, v ′ in the same domain, assume that a distance function dis(v, v ′ ) is in place, with lower values indicating greater similarity. One way to define the cost of
changing the value of an attribute t[A] from v to v ′ is: cost(v, v ′ ) = w(t, A) · dis(v, v ′ ), Intuitively, the more accurate the original t[A] value v is and more distant the new value v ′ is from v, the higher the cost of this change. The cost of changing the value of a tuple t to t′ is the sum of cost(t[A], t′ [A]) when A ranges over all attributes in t for which the value of t[A] is modified. The cost of changing D to D′ , denoted by cost(D, D′ ), is the sum of the costs of modifying tuples in D. A repair of D in the update model is an instance D′ of R such that cost(D, D′ ) is minimal when D′ ranges over all instances of R that satisfy Σ. This allows us to reduce repairing problem to an optimization problem. In practice, we want to pick new values v ′ from a reference database or from the active domain of the database based on certain statistical analysis. The accuracy of a repair can be measured by precision and recall metrics, which are the ratio of the number of errors correctly fixed to the total number of changes made, and the ratio of the number of errors correctly fixed to the total number of errors in the database, respectively. It is prohibitively expensive to find a repair by manual effort. The objective of data cleaning is to develop effective methods that automatically find candidate repairs of D, which are subject to inspection and changes by human experts. It is, however, nontrivial to find a candidate repair. Theorem 4.1 [5]: Given a set Σ of dependencies and a database D, the problem of determining whether there exists a repair D′ with minimal cost(D, D′ ) is NP-complete, when Σ is either a fixed set of FDs or a fixed set of INDs. 2 To cope with the tractability, several heuristic algorithms have been developed (e.g., [5, 13]). A central idea is to separate the decision of which attribute values should be made equal from the decision of what value should be assigned to these attributes. Delaying value assignment allows a poor local decision to be improved in a later stage of the repairing process, and also allows a user to inspect and modify a repair. To this end an equivalence class eq(t, A) can be associated with each tuple t in the dirty database D and each attribute A in t. The repairing is conducted by merging and modifying the equivalence classes of attributes in D. For example, if tuples t1 , t2 in D violate an FD X → A, one can fix the inconsistency by merging eq(t1 , A) and eq(t2 , A) into one, i.e., by forcing t1 and t2 to agree on their A attributes. If a tuple t1 violates an IND R1 [X] ⊆ R2 [Y ], one can resolve the conflict by picking a tuple t2 in the R2 relation that is close to t1 , or inserting a new tuple t2 into the R2 table, such that for each corresponding attribute pair (A, B) in [X] and [Y ], t1 [A] = t2 [B] by merging eq(t1 , A) and eq(t2 , B) into one. A target value is picked and assigned to each equivalence class when no more merging is possible. Based on this idea, heuristic algorithms have been developed for repairing databases using FDs and INDs [5]. The
189
algorithms modify tuple attributes in the right-hand side of an FD or an IND in the presence of a violation. This strategy, however, no longer works for CFDs: the process may not even terminate if only tuple attributes in the right-hand side of a CFD can be modified. Heuristic algorithms for repairing CFDs have been developed [13], which may modify tuple attributes in either the left-hand side or right-hand side of a CFD. Together with a statistical method, this approach guarantees that the accuracy of the candidate repairs found is above a predefined bound with a high confidence.
[4] L. Bertossi and J. Chomicki. Query answering in inconsistent databases. In J. Chomicki, R. van der Meyden, and G. Saake, editors, Logics for Emerging Applications of Databases. Springer, 2003. [5] P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A costbased model and effective heuristic for repairing constraints by value modification. In SIGMOD, 2005. [6] L. Bravo, W. Fan, F. Geerts, and S. Ma. Increasing the expressivity of conditional functional dependencies without extra complexity. In ICDE, 2008. [7] L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In VLDB, 2007. [8] R. Bruni and A. Sassano. Errors detection and correction in large scale data collecting. In IDA, 2001. [9] A. Cali, D. Lembo, and R. Rosati. On the decidability and complexity of query answering over inconsistent and incomplete databases. In PODS, 2003. [10] J. Chomicki. Consistent query answering: Five easy pieces. In ICDT, 2007. [11] J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Information and Computation, 197(1-2):90–121, 2005. [12] J. Chomicki and J. Marcinkowski. On the computational complexity of minimal-change integrity maintenance in relational databases. In L. Bertossi, A. Hunter, and T. Schaub, editors, Integrity Tolerance, pages 119–150. Springer, 2005. [13] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB, 2007. [14] W. W. Eckerson. Data Quality and the Bottom Line: Achieving Business Success through a Commitment to High Quality Data. Technical report, The Data Warehousing Institute, 2002. http://www.tdwi.org/research/display.aspx?ID=6064. [15] L. English. Plain English on data quality: Information quality management: The next frontier. DM Review Magazine, April 2000. [16] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. TODS. To appear. [17] I. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. J. American Statistical Association, 71(353):17–35, 1976. [18] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. Declarative data cleaning: Language, model and algorithms. In VLDB, 2001. [19] Gartner. Forecast: Data quality tools, worldwide, 20062011, 2007. [20] L. Haas, M. Hern´andez, H. Ho, L. Popa, and M. Roth. Clio grows up: from research prototype to industrial tool. In SIGMOD, 2005. [21] A. Lopatenko and L. Bertossi. Complexity of consistent query answering in databases under cardinality-based and incremental repair semantics. In ICDT, 2007. [22] E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 2000. [23] V. Raman and J. M. Hellerstein. “Potter’s Wheel: An Interactive Data Cleaning System”. In VLDB, 2001. [24] C. C. Shilakes and J. Tylman. Enterprise information portals, Nov. 1998. [25] J. Wijsen. Database repairing using updates. TODS, 30(3):722–768, 2005.
5 Concluding Remarks The primary goal of this paper is to provide an overview of recent advances in conditional dependencies for data cleaning. There is much more to be done. One topic for future research is to find heuristic methods, with performance guarantees, for reasoning about CFDs and CINDs taken together. Another topic is data profiling, to develop effective methods to discover useful CFDs and CINDs from sample data. While there has been work on discovering FDs and IND s, we are not aware of any solid technique for discovering CFDs and CINDs, which is more involved than their traditional counterpart. A more challenging topic is to develop scalable algorithms for finding repairs based on both CFD s and CIND s, with performance guarantee. The notion of constraint-based repairs is introduced in [3]. Also proposed in [3] is the notion of consistent query answers, which, given a query Q posed on an inconsistent database D, is to find tuples that are in the answer of Q over every repair of D [3, 9, 11, 21, 25] (see [10, 11] for surveys). Another alternative to finding database repairs is by developing finite and succinct representations of all possible repairs [25, 1, 2]. Data cleaning systems reported in the literature include AJAX [18], which provides users with a declarative language for specifying cleaning programs, and Potter’s Wheel [23] that extracts structure for attribute values and uses these to flag discrepancies in the data. Most commercial ETL tools have little built-in cleaning capability, covering mainly data transformation needs such as type conversions, string functions, etc (see [22] for a survey). While a constraint repair facility will logically become part of the cleaning process, we are not aware of analogous functionality currently in any of the systems. It is interesting to extend these systems by supporting CFDs and CINDs.
References 6
[1] L. Antova, C. Koch, and D. Olteanu. 1010 worlds and beyond: Efficient representation and processing of incomplete information. In ICDE, 2007. [2] L. Antova, C. Koch, and D. Olteanu. From complete to incomplete information and back. In SIGMOD, 2007. [3] M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, 1999.
190