Fast Identification of Relational Constraint Violations Amit Chandel
Nick Koudas
Ken Q. Pu
Divesh Srivastava
University of Toronto Toronto, Ontario, Canada
University of Toronto Toronto, Ontario, Canada
University of Ontario Institute of Technology
AT&T Labs-Research New Jersey, USA
[email protected]
[email protected]
Oshawa, Ontario, Canada
[email protected]
[email protected]
Abstract
of ’known facts’) that are expected to be true in the information recorded in a collection of tables. For example, the phone number associated with a person resident of New Jersey should have a prefix in the set ’201, 973, 908’. A tuple in violation of such a logical constraint, depending on application semantics may be erroneous and/or a prime candidate for further inspection. Databases however are primarily dynamic. Both the schemas as well as the content of relational tables evolve. As a result of such evolution, several logical constraints may be violated. As an example a tuple associating a phone number prefix of 416 with a resident of NJ may be inserted in a table. Being able to identify constraints that are violated within and across tables is highly important. Constraint violation is a primary reason for poor data quality. Being able to identify constraint violations, provides insight to the actions that one has to take in order to correct such violations and improve database quality. Towards this end, typically, one would express a logical constraint as a selection query whose result set is the set of tuples violating the constraint. Essentially, identifying a violated constraint is as costly as identifying the tuples violating the constraint. However, for constraints that are not violated in a collection of tables, one has still to issue a possibly complex query seeking to identify violating tuples, even though the result of such queries is empty. From a performance standpoint however, it would be desirable to first identify the set of constraints violated and the associated tables, fast, and then further focus on identifying the precise tuples violating such constraints. E XAMPLE : Consider a simple database consisting of three tables.
Logical constraints, (e.g., ’phone numbers in toronto can have prefixes 416, 647, 905 only’), are ubiquitous in relational databases. Traditional integrity constraints, such as functional dependencies, are examples of such logical constraints as well. However, under frequent database updates, schema evolution and transformations, they can be easily violated. As a result, tables become inconsistent and data quality is degraded. In this paper we study the problem of validating collections of user defined constraints on a number of relational tables. Our primary goal is to quickly identify which tables violate such constraints. Logical constraints are potentially complex logical formuli, and we demonstrate that they cannot be efficiently evaluated by SQL queries. In order to enable fast identification of constraint violations, we propose to build and maintain specialized logical indices on the relational tables. We choose Boolean Decision Diagrams (BDD) as the index structure to aid in this task. We first propose efficient algorithms to construct and maintain such indices in a space efficient manner. We then describe a set of query re-write rules that aid in the efficient utilization of logical indices during constraint validation. We have implemented our approach on top of a relational database and tested our techniques using large collections of real and synthetic data sets. Our results indicate that utilizing our techniques in conjunction with logical indices during constraint validation offers very significant performance advantages.
STUDENT(student_id, department, contact) COURSE(course_id, area) TAKES(student_id, course_id) Suppose the policy is that all students in the department of computer science (“CS”) must take some course in the area of “Programming”. It can be expressed in first order logic (FOL) as:
1. Introduction Logical constraints are prevalent in relational databases. Traditional integrity constraints (e.g., primary and foreign key constraints) are examples of such logical constraints. However the set of logical constraints associated with a database is much richer than that. Application semantics offer a large collection of logical constraints (in the form
S , “CS”, z) =⇒ ∀xS ∃z STUDENT(x ` ´ ∃xC COURSE(xC , “Programming”) ∧ TAKES(xS , xC )
1
(1)
Using SQL, we express the violating tuples as follows: SELECT * FROM STUDENT S WHERE STUDENT.department = "CS" AND NOT EXISTS ( SELECT * FROM COURSE, STUDENT, TAKES WHERE STUDENT.student_id = S.student_id AND COURSE.area = "Programming" AND STUDENT.student_id = TAKES.student_id AND COURSE.course_id = TAKES.course_id )
We will show that logical constraints in the form of Formula 1 can be much more efficiently checked than the equivalent SQL statement. Instead of issuing and evaluating such a complex SQL query, it would be much more desirable to quickly be able to identify whether the constraint of Formula 1 is violated. If the answer is affirmative then one would focus on identifying violating tuples, by possible conducting more expensive SQL query processing (such as that of the query above). In this paper, we present techniques that are able to, given a set of constraints and a set of relational tables, quickly identify which constraints are violated on which tables. Our approach is to construct specialized logical indices in order to support complex logical constraint checking against the indexed relational attributes. Such indices should have the following characteristics: (a) allow for fast evaluation of boolean constraints formulated in first order logic (b) be space efficient and (c) permit incremental maintenance under dynamic update operations on the indexed attributes. To this end, we propose to employ the Reduced Ordered Boolean Decision Diagram (ROBDD) as the logical index structure. BDD offers compact representation of relational data and there exist robust algorithms and implementation [7] for logical processing of BDDs. O UTLINE OF THE PAPER We introduce the background material on BDD and how BDD is used to represent relational data in Section 2. In Section 3, we study the problem of constructing compact BDD representations for relational data. We present two variable ordering approaches based on statistical measures derived from the relational data. Query processing is discussed in Section 4 where we present a collection of rewrite rules to optimize the evaluation of user defined constraints. Experimental evaluation of the variable ordering heuristics and constraint evaluation is shown in Section 5.
analysis [11]. Data cleaning has attracted lots of research attention recently. A comprehensive survey of data cleaning problems is available elsewhere [10, 8]. Bohannan et. al [3] addressed the orthogonal problem of constraint repairing in which the primary goal is to minimize the number of changes required to repair a constraint.
2.1. Binary Decision Diagram Consider a function f (x1 , x2 , . . . , xn ) that maps the boolean variables {xi }1≤i≤n to a boolean value of 0 or 1, i.e. f : {0, 1}n → {0, 1}. It can be presented as a decision tree. Of course, the size of the decision tree can be exponential to the number of variables. In order to reduce the encoding to a tractable size, one can merge isomorphic subtrees of the decision tree to form a decision diagram. Such decision diagram representation of boolean functions is known as Ordered Boolean Decision Diagrams (OBDD). A number of reduction rules can be applied to further eliminate unnecessary nodes. An OBDD that can no longer be reduced by the reduction rules is referred to as a Reduced OBDD (ROBDD). A foremost salient feature of the BDD encoding of boolean functions is that every OBDD reduces to a unique ROBDD. This leads to the following important consequence. Fact 1 (Bryant 1986). If two boolean functions f1 and f2 are logically equivalent, then B DD (f1 ) and B DD (f2 ) are structurally the same. This leads to the following rather surprising results. Let f be some boolean function with its ROBDD representation given as B DD(f ), then: • testing the validity of f is in O(1), • testing the satisfiability of f is in O(1). This means that many classically known intractable problems such as validity (coNP-complete) and satisfiability (NP-complete) of logical formula can be decided in constant time provided that the predicates are represented as a ROBDD, making ROBDD a very attractive form of encoding of logical formuli. Of course, since satisfiability of conjunctive normal form (CNF) is NP-complete, it follows that converting a formula in CNF to its ROBDD form is intractable in the worse case. However, as the large body of literature supports, in practice, the ROBDD encoding is very efficient. For the remainder of the paper, we will refer to ROBDD as simply BDD. BDDs can also be used to encode boolean functions over finite domain variables. A variable x is of finite domain if its value is from a finite set dom(x). Without loss of generality, we may assume dom(x) = {1, 2, · · · , |dom(x)|}. It is possible to replace x with k boolean variables ~y = y1 , y2 , . . . , yk where k = log(|dom(x)|) and each boolean variable yj is the j-th bit of the binary representation of
2. Background and Related Work BDD is a data structure for representation of generic boolean functions [6]. It has since then proved to be an invaluable tool in areas such as verification of digital circuits, software logic flow, model checking and constraint satisfaction. BDDs have also been applied in the field of program 2
the value of x. Therefore, it is straight forward to reduce a boolean function f (x1 , x2 , . . . , xn ) over variables of finite domains to a boolean function f ′ (~y1 , ~y2 , . . . , ~yn ) over the boolean variables. Each vector ~ yj of boolean variables is commonly known as a finite domain block. Given a boolean function f (x1 , x2 , . . . , xn ), the following logical manipulations can be done on its BDD representation.
VARIABLE O RDERING FOR R ELATIONS Variable ordering is also important in our context of logical processing of relational data. A good variable ordering is important since it can dramatically reduce the number of nodes used by the BDD which leads to a reduction in memory consumption as well as a performance gain for later manipulations of the BDD. Variable ordering has received much attention. Bollig and Wegener [5] showed that deciding if a given variable ordering is optimal or not is NP-complete. Consequently, in practice, one can only resort to heuristics to solve for sub-optimal variable orderings. A number of heuristics have been proposed using statistical techniques [4], sampling schemas [9] and dynamic approaches [13]. However, the existing heuristics in variable ordering are not suitable for our needs. For instance, the dynamic variable ordering algorithm [13] is quite expensive and requires a BDD representation of the relational data (which can be quite large) to be constructed apriori. The sampling technique [9] is intended for applications of BDD for VLSI design and it is tailored towards this particular application domain. There is no general rule of thumb on how the variables should be ordered such that the resulting BDD is minimal in node size. However, with respect to a specific domain, structural knowledge of the underlying problem can greatly assist the ordering of the BDD variables. For instance, Aziz et. al [1] and Beyer [2] demonstrate that good variable orderings for BDD representing finite state machines can be efficiently obtained from the overall connectivity of the interacting finite state machines. In our context, knowledge of the relational structure of the underlying data can also greatly help us in efficiently obtaining a good variable ordering. In particular, product structures and more generally, multivalued dependencies (MVD) are of particular interest. Suppose that one knows that the relation R(A, B, C, D, E) is in fact the product of two other relations R1 (A, B) and R2 (C, D, E), i.e. R = R1 × R2 , then, one can immediately conclude that a good ordering is one in which the two sets of variables V1 = {A, B} and V2 = {C, D, E} must be consecutive. The reason that an ordering hA, B, C, D, Ei is much better than hC, A, D, B, Ei for the relation R = R1 × R2 is that the former choice allows the BDD to more likely conclude that certain tuples do not belong to the relation R by simply examine the values of attributes A, and B. However, in the latter case, by simply knowing the values of C and A, it is far less likely for the BDD to deduce whether the tuple is in R or not. Thus, the ordering o1 = hA, B, C, D, Ei or o′1 hC, D, E, A, Bi are much better choices than o2 = hC, A, D, B, Ei since in o1 , the variables from V1 (V2 resp.) are adjacent to each other, whereas in o2 , the variables from V1 and V2 are interleaving. Similarly, if the relation R(id, A, B, C, D, E) satisfies the MVD id−≫{A, B}, then the variables {A, B}, and similarly {C, D, E}, should be placed together in the ordering.
• Restriction (f |xj =a )(x1 , . . . , xj−1 , xj+1 , . . . , xn ). • Substitution (f [xj /x′j ])(x1 , . . . , xj−1 , x′j , xj+1 , . . . , xn ). • Conjunction, Disjunction, Negation, etc. B DD op B DD′ f1 ∧ f2 , f1 ∨ f2 , and ¬f . • Quantification over finite domain variables (∀xj f )(x1 , . . . , xj−1 , xj+1 , . . . , xn ), and (∃xj f ).
One of the most important property of a given ROBDD is the ordering of its variables. Recall that a BDD is a decision diagram, thus, the order of accessing the variables directly affects the node count of the BDD. It is well known that for certain boolean circuits, the worst variable ordering can lead to an exponential blow-up in node count, while it is possible to avoid the exponential blow-up by the best variable ordering.
2.2. Representing Relations using BDD Relations over finite domain variables can naturally be represented as BDDs since a relation R(x1 , x2 , . . . , xn ) is equivalent to its characteristic function fR mapping variables (x1 , x2 , . . . , xn ) to true (belonging to the relation) or false (not belonging to the relation). Given a relation R(A1 , A2 , . . . , An ) with the attributes Ai , let domi be the active domain of attribute Ai . Then the relation R can be represented by the boolean function over the finite domains domi . _ ^ (xi = ai ) R= ~ a∈R i
To check if a given tuple ~b is in the relation R, we simply test if B DD R (~x = ~b) = 1. R ELATIONAL Q UERY P ROCESSING With respect to bounded active domains, the set of BDD operations is relational complete, thus any SQL query can be simulated by logical operation on BDDs. However, due to the differences in the data structures, BDD operations have varying performance characteristics. For instance, suppose R1 (A, B) and R2 (C, D) are represented by the BDDs B DD (R1 )(xA , xB ) and B DD(R2 )(xC , xD ) respectively. Cartesian product R1 × R2 is an expensive SQL operation since the resulting relation is with cardinality kR1 k × kR2 k. The equivalent BDD operation is the conjunction B DD (R1 )∧B DD (R2 ) whose node count is only additive: kB DD(R1 )k + kB DD(R2 )k. In fact, things are even better with the shared node implementation [7] in which nodes are shared between BDDs. 3
Theorem 1 (Optimal ordering of single product). Let the relation R(a1 , a2 , . . . , am , b1 , b2 , . . . , bn ) be the Cartesian product of R1 (a1 , . . . , am ) and R2 (b1 , . . . , bn ) where both R1 and R2 are random relations. Then the optimal variable ordering for the BDD representation B DD (R) of R is such that attributes {ai }i≤m ({bj }j≤n resp.) are adjacent to each other.
1 2
v∈V
=
3 4 5
for i = 1 to kV k − 1 let u ~ =~ v∗ (0 . . . i − 1) ~v∗ (i) = argmin I(v; u ~)
6
end for
v∈V −~ u
Figure 1. The algorithm MaxInf-Gain
Theorem 1 tells us exactly how the variables of relations with product structures should be ordered such that their BDD representations are compact, but it is quite limited in two ways. First, it requires one to have the knowledge regarding any existing product structure, which cannot be easily obtained from offline analysis, before the BDD can be constructed. Second, it only deals with relations which are single products, whereas in reality, most relations are unions of products of smaller relations. For instance, it is unclear how the variables should be ordered for the relation R(A, B, C, D)
~ v∗ = MaxInf-Gain (R) let V = attributes of R let ~ v∗ (0) = argmin H(v)
are ordered such that at each step, the information gain is maximized. Quinlan’s ID3 procedure [12] is an example of such a variable ordering procedure. However, there is a fundamental difference between variable ordering for decision tree and BDD construction; in particular during BDD construction we seek to impose a global ordering of the attributes, whereas during decision tree construction variable ordering can differ across branches of the tree. As a result our first approach utilizes the information gain (adopted from decision tree construction) in the BDD domain. Let R(v1 , v2 , . . . , vn ) be a relation with the attributes ~ V = hvi ii∈I where I = {1, 2, . . . , n}. Let the active domain of attribute v ∈ V be dom(v). Given some subset ~ of attributes, and a tuple of values ~a, denote R|~v=~a ~v ⊆ V to be the selection on R with the predicate ~v (i) = ~a(i). We also define theQ active domain of the sequence of attributes ~v as dom(~v ) = v∈~v dom(v).
R1 (A, B) × R2 (B, C) ∪ R3 (A, C) × R4 (B) × R5 (D)
In fact, in general, choosing the optimal variable ordering is NP-hard [5]. We present two variable ordering approaches in Section 3 which do not explicitly assume any knowledge of the structure of the relation.
2.3. Problem Formulation We propose to build BDD representations of relational views using a fixed amount of allocated memory. As stated in Section 2.2, the variable ordering strongly influences the resulting BDD size, thus choosing a good variable ordering is an important part of the index building process. In order to effectively make use of BDD structures, we need to have efficient heuristics for selecting good variable orderings so the BDD indices are as compact as possible. The second problem we study is how to effectively use the BDD indices to verify user defined constraints. Since BDD operations have very different characteristics from relational query processing, optimization of constraint checking is naturally also quite different.
Definition 1 (Entropy and Information Gain). Given a relation R, and some of its attributes ~v . Define a probability distribution over the multivariant variables ~x as p(~v = ~x) =
kR|~v=~x k kRk
Given an attribute v ′ 6∈ ~v , we can define the conditional probability distribution over the variables ~x and x′ as, p(v ′ = x′ |~v = ~x) =
3. Constructing A Logical Index
p(h~v , v ′ i = h~x, x′ i) p(~v = ~x)
The entropy of the attributes ~v is given by,
In this section, we describe the BDD building process from base relations or views of base relations. Since the size of a BDD can grow arbitrarily large, it is particularly important to choose a good variable ordering whenever one exists to minimize the memory usage. In practice, we assume that only a fixed amount of memory is allocated to BDD indices.
H(~v ) =
E(log p(~v = ~x)) X p(~v = ~x) · log p(~v = ~x) −
=
x ~ ∈dom(~ v)
The conditional entropy of attribute v ′ with respect to ~v is given by H(v′ |~ v)
= =
3.1. Variable ordering based on Information Gain
E(log p(v′ = x′ |~ v=~ x)) XX ˙ ¸ ˙ ¸ ′ p( ~v, v = ~ x, x′ ) · log p(v′ = x′ |~ v=~ x) − v′
~ v
The information gain between ~v and v ′ is defined as
Since BDDs are decision diagrams, it is natural to consider variable ordering techniques from the decision tree literature. In construction of decision trees [12], variables
I(~v ; v ′ ) = H(~v ) − H(v ′ |~v ) 4
The algorithm is to order the attributes V into a sequence ~v ∗ such that ~v ∗ (0) is with minimal entropy, and for each i > 0, I(~v ∗ (0 . . . i1 ); ~v ∗ (i)) is maximized. We refer to this ordering algorithm MaxInf-Gain . It is shown in Figure 1.
simply replace line 2 with let ~v ∗ (0) = argmin Φ(hvi) v∈V
and line 5 with
3.2. Variable ordering based on Probability
let ~v ∗ (i) = argmin Φ(h~u, vi) v∈V
We observe that the node size of a BDD is reduced if the variables are ordered such that a decision of whether a tuple belongs to the relation or not can be made as soon as possible. Thus, our approach is to order the variables such that the membership of a tuple can be resolved as soon as possible. Consider a relation R with attributes V = {v1 , v2 , . . . , vn } as before. Let ~v be a sequence of some attributes from V . Let ~v ∗ = {vi1 , vi2 , . . . , vin } be a total ordering of the attributes V , and ~v = {vi1 , vi2 , . . . , vik }, where k ≤ n, be a prefix subsequence of ~v ∗ . Consider the following experiment: a random tuple ~a is generated in dom(~v ∗ ), but we only have the knowledge that the values of ~a corresponding to attributes ~v is ~x. What is the probability that such a tuple belongs to the relation R? Denote such probability as φ(~v = ~x). It is easily computed: φ(~v = ~x) = =
4 Query Processing using BDDs Given a collection of BDDs built on attributes of relations in a database, checking whether constraints are violated involves manipulating the BDDs pertaining to the relations (and associated attributes) as dictated by the logical expressions describing constraints. E XAMPLE : Consider the constraint: S , “CS”, z) =⇒ ∀xS ∃z STUDENT(x ` ´ ∃xC COURSE(xC , “Programming”) ∧ TAKES(xS , xC )
Each of the logical operations, ∧, ∃, ∀, =⇒ , has been implemented in the BuDDy [7] library, so a straight forward evaluation of the constraint is to perform the conjunction “ ” B DD1 = COURSE(xC , “Programming”) ∧ TAKES(xS , xC )
followed by B DD2 = ∃xC B DD 1 , and so on. The final result is the BDD obtained by ∀xS (· · · ). Note that the final BDD has no variables, i.e. it is either TRUE or FALSE, thus it contains only one node. However, the intermediate BDDs, e.g. B DD 1 , can be quite large. There are many alternative ways of evaluating the above constraint since it can be rewritten into many equivalent forms. For instance, we can transform it into prenex normal form:
number of tuples in R such that ~v = ~x all random tuples such that ~v = ~x kR|~v=~x k Q v∈V −~ v kdom(v)k
where V − ~v is theQset of attributes not in ~v . If V − ~v = ∅, then by definition v∈V −~v kdom(v)k = 1. In case ~v = V , then φ(~v = ~x) ∈ {0, 1}. This coincides with the fact that if one knows the value of all the attributes, then the tuple is deterministically in the relation or not. The BDD representation of the relation R is a decision process which distinguishes tuples from dom(~v ∗ ) to those that belong to R from those that do not belong to R. Our observation is that a more rapid resolution of the membership of a random tuple would imply a simpler BDD, i.e. a smaller node count. In terms of the probability measure φ, we would prefer cases in which φ(~v = ~x) is closer to the extrema of 0 and 1. Thus, we adopt the entropy alike measure X Φ(~v ) = φ(~v = ~x) · log φ(~v = ~x)
“ ∀xS ∃z, xC STUDENT(xS , “CS”, z) =⇒ ` ´” COURSE(xC , “Programming”) ∧ TAKES(xS , xC )
(2)
What is particularly attractive about Constraint 2 is that it is of the form ∀xx φ(x). To check if Constraint 2 holds, one only needs to check if B DD (φ(x)) equals to TRUE. Therefore, one does not need to involve the BDD operation ∀xS , and furthermore, checking if B DD (φ(x)) = true can be done in constant time. There are, in fact, other advantages with Constraint 2 as we shall explain later. This example demonstrates two important points related to constraint checking using BDDs, namely that (a) during evaluation intermediate BDDs will be created and (b) there are numerous choices regarding the order with which BDDs will be manipulated. The first point raises performance concerns. Since BDDs can have an exponential size in the worst case, during the manipulation of existing BDDs the size of the intermediate BDDs created might be very large. Therefore it would be desirable to have a BDD size estimation procedure, that could estimate the size of the BDD constructed given a specific input (possibly a set of BDDs) without actually constructing the BDD. Unfortunately this problem is NP-Hard.
~ x
Note that this is note an entropy since φ is not a proper distribution. Φ(V ) is guaranteed to be 0. We order the variables as hv1 , v2 , · · · vn i such that Φ(~vi ) converges as rapidly as possible to 0, where ~vi = hv1 , v2 , · · · , vi i. We refer to the probability based variable ordering algorithm as Prob-Converge . Since MaxInf-Gain and Prob-Converge differ only from the measure used, to obtain Prob-Converge from Figure 1, we 5
Theorem 2. Given a set of relations {Ri }, a logical constraint φ on {Ri } and some integer k > 0, deciding whether the BDD size corresponding to φ is greater than k is NPhard.
We remark that the key to the quantifier elimination rule is that BDD representation of constraints is always reduced to the normal form, so if the formuli φ(x) = true or φ(x) = f alse, then its BDD must necessarily contain only one node.
In light of the hardness of the BDD size estimation problem, we adopt the following query processing strategy. We manipulate BDDs and we monitor the size of the BDD as it is constructed; once the size becomes larger than a size threshold, we abort the construction process and we default to SQL statements to check the constraint. We adopt the same strategy when we construct a BDD on a set of attributes of a base relation. If the size of the resulting BDD is larger than a threshold we do not materialize the BDD, but default to SQL processing for all constraints involving attributes of the table. The maximum BDD size threshold is a system dependent parameter, and there is a trade-off between having it too small and having it too large. If too little memory is permitted, most non-trivial constraints cannot be checked, and therefore we will default to SQL processing. Yet on the other hand, large thresholds are also undesirable. Certain constraints (satisfiability of CNF) are inherently intractable, so if too much memory is allocated, we will require very long time to fill up the allocated memory, before we default to SQL processing. The ideal scenario is when the memory size is large enough to solve most constraints while small enough such that explosions in node size can be detected quickly. Obviously, the ideal threshold depends on the processing power of the system; the amount of time required to fill up the memory threshold, during BDD construction, will determine the overhead BDD processing imposes, since in this case, our approach will resort to SQL processing to validate constraints. We will evaluate options in Section 5 The second point raised by the above example establishes an interesting connection to concepts from relational query processing. In relational query processing, query rewrite techniques transform queries into forms amenable to more efficient execution. We propose and subsequently evaluate a set of re-write rules in the BDD domain that improve the performance of constraint checking with BDDs significantly.
4.2 Equi-join Conditions The second re-write rule we propose applies to constraints involving relational equi-join conditions namely, R1 (x1 , . . . , xn ) ⊲⊳x1 =y1 ,...,xk =yk R2 (y1 , · · · , ym ). In the BDD domain we have the following two options to process such expressions: 1. B DD(R1 ) ∧ B DD(R2 ) ∧
“V
1≤i≤k
” B DD([xi = yi ])
2. Let B DD(R2 [x/y]) is the rename operation which renames variable x to y. B DDS = B DD(R2 [x1 /y1 ] [x2 /y2 ] · · · [xk /yk ]). B DD(R1 ) ∧ B DDS gives the desired join result.
The rename operation is beneficial compared to the approach of generating the B DD ([xi = yi ]) and performing the ∧ operation. The reason is that the renaming operation is quite efficient (a linear scan of the B DD (R2 )), and after renaming, one only needs to evaluate the conjunction B DD(R1 )∧ B DD S . However, joining using equality clauses requires evaluating two conjunctions involving three BDDs. Furthermore, B DD ([xi = yi ]) can be quite large depending on the active domain associated with the variables xi , yi .
4.3 Quantifier Pull-up or Push-down One can easily verify the following identities: ∃x1 φ1 (x1 ) ∨ ∃x2 φ2 (x2 )
⇐⇒
∃x(φ1 (x) ∨ φ2 (x))
(3)
∀x1 φ1 (x1 ) ∧ ∀x2 φ2 (x2 )
⇐⇒
∀x(φ1 (x) ∧ φ2 (x))
(4)
Thus, we need to decide if one is to distribute, or pushdown, a quantification (∃, ∀) across boolean operation (∨, ∧ resp.), or to factor, or pull-up, a quantification from a boolean operation. The BuDDy implementation [7], along with other well known BDD implementations, offers two optimized procedure bdd appex and bdd appall which evaluates logical expressions on the right-hand sides of Equation 3 and Equation 4. Due to the internal implementation, those procedures are more efficient than applying the quantification after the boolean connective. Therefore, one may wish to pull-up the quantifiers whenever possible so that the procedures bdd appex and bdd appall can be invoked. However, contrary to the above argument, our optimization rules push-down ∀ quantification across the boolean connective ∧:
4.1 Leading Quantifier Elimination The first query re-write rule we discuss stems from the nature of our problem. Consider the clause ∃x φ. Checking whether this constraint is violated, is equivalent to checking whether such a constraint has a TRUE or FALSE answer. As a result we can drop the ∃ quantifier and check whether B DD(φ) 6= f alse. Checking the latter condition is very efficient using BDDs. This technique carries over naturally even when φ has extra variables other than x. Similarly, for the clause ∀x1 , x2 , . . . , xn φ(x1 , x2 , . . . , xn ), we can drop the ∀ quantifier and test whether φ(x1 , x2 , . . . , xn ) = true.
∀x(φ1 (x) ∧ φ2 (x))
∀xφ1 (x) ∧ ∀xφ2 (x)
(5)
The left-hand side needs to evaluate a conjunction φ1 ∧ φ2 which can be quite expensive if φi are large BDDs. 6
The effect of variable ordering
5
2
x 10
Ranking Variable Ordering By MaxInf−Gain
4
x 10 1−PROD 4−PROD 8−PROD RANDOM
1.8 1.6
Ranking Variable Ordering By Prob−Conv
4
x 10
True ranking Ranking based on MaxInf−Gain
12
True ranking Ranking based on Prob−Conv
12
10
10
1.2 1 0.8 0.6
BDD node count
BDD node count
BDD node count
1.4 8
6
8
6
4
4
2
2
0.4 0.2 0 0
20
40 60 80 Variable orderings: from best to worst
100
120
0 0
20
(a) Effect of variable ordering
40 60 80 Variable orderings: from best to worst
(b) MaxInf-Gain
100
120
0 0
20
40 60 80 Variable orderings: from best to worst
100
120
(c) Prob-Converge
Figure 2. Effect and Ranking of the variable orderings
5 Experimental Results
However, on the right-hand side, the conjunction is applied to the results of∀xφi (x) which is implemented as V c∈dom(x) φi (x)|x=c . In realistic cases, ∀x φi (x) is a much smaller BDD than φi (x), thus the conjunction on the right-hand side of Rule 5 is much more efficiently evaluated than the conjunction on the left-hand side. For existential quantification ∃, we choose to pull-up the quantification from the boolean connective ∨: ∃x1 φ1 (x1 ) ∨ ∃x2 φ2 (x2 )
∃x(φ1 (x) ∨ φ2 (x))
In this section we first evaluate the quality of the variable ordering heuristics and then demonstrate the efficiency of using BDDs for checking relational constraint violations using synthetic as well as real data. Below we describe the data used for the experiments. Real Data: We used a database of 406,769 customers from US and Canada having the schema (areacode, number, city, state, zipcode); the size of the active domain for each attribute is (281, 889, 10894, 50, 17557). Synthetic Data: We also utilize synthetic data in our experiments to be able to vary the statistical properties in a controlled way. We use a synthetic schema with varying number of attributes each of active domain size at most 100. We generate several relations using this schema by varying the statistical properties of the data. The columns were populated uniformly at random. The specific methodology for varying the statistical properties is detailed in the following sections. The size of the relations range from 1000 to 1 million tuples.
(6)
Unlike the case of Rule 5, ∃x φi (x) is not as likely to result in a smaller BDD, so there will not be a major difference in the cost of evaluating the disjunctions of the left-hand and the right-hand sides of Rule 6. By applying the pull-up rule, we are able to make use of the procedure bdd appex .
4.4. Applying the Rewrite Rules The rewrite rules we have presented all individually benefit the evaluation performance, however they are not confluent, namely they potentially conflict with one another. Specifically, if one pushes down an outer most universal quantifier according to the push-down rule, then the quantifier elimination rule is no longer applicable, and vice versa. We choose to resolve such conflicts by prioritizing the rewrite rules in the following fashion. First, we convert a constraint into the prenex normal form, namely we apply the pull-up rule for existential quantifiers as well as the universal quantifiers. Then, we apply the quantifier elimination rule to drop as many leading quantifiers which are the same kind as possible. For instance, ∀x1 ∀x2 ∃x3 φ becomes ∃x3 φ, and ∃x1 ∃x2 ∀x3 ψ becomes ∀x3 ψ. Before we test the validity or the satisfiability of the transformed constraint, we push-down the universal quantifiers into conjunctions as much as possible. Finally, we build the BDD of the resulting logical formula and test its validity or satisfiability as needed.
5.1. Evaluating Variable Ordering We evaluate the quality of the variable ordering approaches with respect to different types of relations. We generated four families of relations with varying structures. All relations are with 5 attributes containing 400,000 tuples. The first family, 1-PROD, contains the most structured relations – every relation is obtained from the product of several smaller relations, i.e. R = R1 × R2 × · · · . The second family, 4-PROD, contains relationsSobtained from union of four 1-PROD relations, i.e. R = 1≤i≤4 (R1i × R2i · · · ). The attributes of the smaller relations Rkj are non-overlapping and chosen randomly. The third family, 8PROD, is relations which are unions of eight 1-PROD relations. Finally, the last family contains the least structured relations possible, namely random relations. E FFECT OF VARIABLE O RDERING We first demonstrate the varying importance of the variable orderings for differ7
6
20 Worst = 71.29 1
4
0 0.8
Prob−Conv MaxInf−Gain
Worst = 71.29
10 2
0.5
1
1.2
1.4
1.6
1.8
2
2.2
2.4
10
0 0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
0
4
1
1.2
1.4
1.6
1.8
2
1
1.2
1.4
1.6
1.8
2
1
1.2
1.4
1.6
1.8
2
1
1.01
1.02
1.03
1.04
1.05
1
Worst = 6.29
5
Worst = 6.29
2
0.5
0 0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2
0 0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
0
2 1
Worst = 2.26
1
Worst = 2.26
1
0.5
0 0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
10
0 0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
1
Worst = 1.02
10 Worst = 1.02
5
0.5
5 0 0.8
0
15
1
1.2
1.4
1.6
1.8
2
2.2
2.4
0 0.8
(a) MaxInf-Gain
1
1.2
1.4
1.6
1.8
2
2.2
2.4
0
(b) Prob-Converge
(c) Comparison
Figure 3. Comparing accuracy of the two approaches ence types of relations. Figure 2(a) shows the average BDD size corresponding to the different variable orderings (from best to worst). The ratios of the most to the least compact sizes for each of the families are as follows. Dataset Ratio
1-PROD 71.29
4-PROD 6.29
8-PROD 2.26
duces BDD representations which are at worst about 1.5 the size of the optimal representation. However, MaxInf-Gain does not perform nearly as well. Especially for 1-PROD and 4-PROD, αR > 2.5 for several runs. A more explicit comparison of the accuracies of the two approaches is shown in Figure 3(c) in which we plot accuracy on the x-axis, and percentage of runs for which the approaches achieved that accuracy. Clearly, Prob-Converge outperforms MaxInf-Gain for all the cases in which there are some existing product structure. Only for the random relations, does the MaxInf-Gain outperforms ProbConverge , but the effect of variable ordering is completely negligible as demonstrated in Figure 2(a).
RAND 1.02
Observe that the effect of variable ordering is significant for the highly structured relations, and negligible for random relations, as expected. P ERFORMANCE We investigate the ranking of the variable orderings produced by MaxInf-Gain and ProbConverge and the size of the resulting BDD for each variable ordering. Figure 2(b) and 2(c) shows the actual BDD sizes of the variable orderings which are ranked by the two statistical measures for the 1-PROD family. Observe that the ranking of the orderings by Prob-Converge correlates with the true ranking much more strongly than that by MaxInf-Gain . In fact, the top 10 variable orders ranked by Prob-Converge coincide exactly with the true ranking, whereas only the top two ranked by MaxInf-Gain coincide with the true ranking. Of course, one only should be concerned with the top 1 variable ordering picked by the two approaches. For each type of structures, we generated 20 relations, and for each relation R, compared the optimal BDD node size kB DD(R)optimal k (obtained via an exhaustive search over all possible orderings) with the size of the BDD representation kB DD(R)MaxInf-Gain k (kB DD(R)Prob-Converge k resp.) using the variable ordering obtained using MaxInfGain (Prob-Converge resp.). We compute the ratio MaxInf-Gain DD (R) k (R)Prob-Converge k αR = kBkB and βR = kBDD . Note DD (R)optimal k kB DD (R)optimal k αR , βR ≥ 1, and equal to 1 in the best case. We plotted the histogram of αR and βR for the structure types 1-PROD, 4-PROD, 8-PROD and RAND of relations respectively in Figure 3(a) and Figure 3(b). We threshold at 2.5, namely, we use one bin for all the occurrences when the approaches produce a BDD with size over 2.5 times of the optimal size. One can observe that βR < 1.5, i.e. Prob-Converge pro-
S UMMARY Prob-Converge is a more robust choice for choosing the variable ordering for the BDD representation of a given relation in an off-line fashion. In the case of a highly structured relation, Prob-Converge produces near optimal variable ordering and if the relation is nearly random, then since no variable ordering is significantly better, using Prob-Converge for a variable ordering certainly does not hurt.
5.2 Checking Logical Constraints with BDD In this section we demonstrate the efficiency of using BDDs for checking relational constraint violations using synthetic as well as real data. We demonstrate E NCODING RELATIONS INTO BDD S the space-efficiency and incremental maintenance cost of BDDs used as logical index using the real data. We build logical indices on (areacode, city, state) and (city, state, zipcode). The first index requires 29 (⌈log(281)⌉ + ⌈log(10894)⌉ + ⌈log(50)⌉) boolean variables (ncs in the Figures) and the second one (csz in the Figures) requires 35 boolean variables. Figure 4(a) and 4(b) show the BDD construction time and the average time to update (insert and delete) in the BDD when the base relation is updated, as the size of the relation increases. Evidently, building a BDD index and maintaining it is very efficient. Figure 4(c) 8
8
110
ncs: 29 csz: 35
7
5 4 3 2
90 80 70 60
1 0 50
100 150 200 250 300 350 400 Relation size (#Tuples in thousands)
450
(a) BDD Construction Time
120 100 80 60 40 20
50 0
ncs: 29 csz: 35
140 Nodes (thousands)
Time (microseconds)
6 Time (seconds)
160
ncs: 29 csz: 35
100
0 0
50
100 150 200 250 300 350 400 450 Relation size (#Tuples in thousands)
(b) BDD Update Time
0
50
100 150 200 250 300 350 400 450 Relation size (#Tuples in thousands)
(c) BDD Size
Figure 4. BDD Creation Time, Maintenance Time and Memory Utilization 4000
Time (milliseconds)
BDD approach, we encode the Constraints relation into a BDD on the fly and then obtain a logical ∧ with the BDD of the base relation. It can be observed that the BDD approach outperforms the SQL approach by significant margins. Another type of constraint of similar nature has the form if city=’Toronto then state=’Ontario’. Figure 5(a) also compares the BDD and SQL approaches for such constraints in a similar fashion. Figure 5(b) compares the BDD and SQL approaches for checking the logical constraint areacode → state. Testing this constraint using BDDs involves projection of suitable attributes to construct new BDDs and manipulation of the resulting BDDs. Using SQL involves the use of a group-by query. These experiments show that using BDDs for logical constraint checking outperforms the SQL counterpart by a factor of 6 to 8.
city - areacode: sql city - areacode: bdd city - state: sql city - state: bdd
3500 3000 2500 2000 1500 1000 500 0 0
50
100 150 200 250 300 350 400 450 Size of R (#Tuples in thousands)
(a) Join Query 1600
areacode --> state: SQL areacode --> state: BDD
Time (milliseconds)
1400 1200 1000 800 600 400 200
Q UERY R EWRITING We now discuss the query performance observed using the query rewrite rules. The experiments that follow were conducted on synthetic data; similar behavior was observed on the real data as well. We first present the rewrite rule for a join query for relations P and Q. The naive join strategy refers to the use of an equality clause on the joining attributes and the optimized strategy refers to use of variable renaming followed by a logical ∧. Figure 6(a) compares the time taken by the strategies for a join involving one and two joining attributes over various sizes of B DD (R1 ) fixing kB DD (R2 )k=50000 nodes. The optimized strategy is 2-3 times faster than the naive strategy. Similar behavior was observed when we varied kB DD(R2 )k. We compare the quantifier pull-up and pushdown approaches by varying the BDD size of φ1 and φ2 . Figures 6(b) and 6(c) compare the quantifier pull-up and push-down approaches for various BDD sizes for φ1 (P in the Figures) and a fixed kB DD(φ2 )k. The performance trends are consistent with our expectation, suggesting to pull-up the existential quantifier and push-down the universal quantifier. Similar behavior is observed when varying kB DD(φ2 )k. VARIABLE O RDERING G AIN To assess the utility of our variable ordering approaches to query processing using BDDs we conducted the following experiment. For a number of queries testing for different types of constraint violations (detailed description omitted due to space limitations),
0 0
50
100 150 200 250 300 350 400 450 Size of R (#Tuples in thousands)
(b) Implication Query
Figure 5. Comparing BDD and SQL approach
presents the memory requirements of the index in terms of the number of nodes in the BDD (in our implementation, space overhead of a BDD node is 20 Bytes). Even for large number of tuples, both indexes are small; hence BDD as a logical index is practically memory efficient. BDD Q UERY P ROCESSING We now demonstrate the efficiency of using BDDs for constraint checking. Using the telephone costumer data, we create different relations of varying size. We use Prob-Converge to get the best orderings for these relations. We create a BDD index on each relation using the predicted near-optimal variable ordering. Figure 5(a) compares the use of BDD index versus SQL for checking constraints of the form if city=’X’ then areacode∈ Set (e.g., if city=’Toronto’ then areacode∈ {416, 905}). In this experiment, which is representative of the performance improvements, we create a relation Constraints of 10,000 such constraints (similar behavior is observed with varying number of constraints) with schema (city, areacode). In the SQL approach, we join the base relation with Constraints to figure out if any of the constraints is violated. In the 9
2500
Naive: 1 attr Optimized: 1 attr Naive: 2 attr Optimized: 2 attr
400
Ex(P) OR Ex(Q) Ex(P OR Q) with bdd_appex
2000
2000 1500 1000
Time (microseconds)
Time (milliseconds)
2500
Time (milliseconds)
3000
1500 1000 500
500 0
0 200 300 400 500 600 700 Size of BDD(R1) (#Nodes in thousands)
800
200
(a) Join Rewrite
400 600 800 1000 1200 Size of P (#Nodes in thousands)
1400
(b) Rewrite: Existential Quantifier
FAx(P) AND FAx(Q) FAx(P AND Q) with bdd_appall
350 300 250 200 150 150
200
250 300 350 400 450 500 Size of P (#Nodes in thousands)
550
(c) Rewrite: Universal Quantifier
Figure 6. Comparison of Query Rewriting Techniques Approach
Q1
SQL BDD: random BDD: optimized
1778 1113 452
Q2 Q3 Q4 (Time in milliseconds) 1957 3960 4234 1215 2347 1718 240 1041 720
Q5
6 Conclusion
3151 1353 528
We have presented a novel approach to check relational constraint violations using logical indexing by employing BDDs. We have presented BDD variable ordering strategies to construct space efficient BDDs and proposed re-write strategies for efficient constraint evaluation. Our results indicate that large performance benefits can be obtained by our approach.
Table 1. Variable Ordering Gain we compare the time required by the SQL approach and the times required by the BDD approach when BDDs are constructed by (a) (BDD: random) a random variable ordering and (b) (BDD: optimized) the optimized ordering derived using Prob-Converge . The difference in size in the BDDs constructed using a random ordering and Prob-Converge is up to a factor of five. Table 1 shows the comparison of the time taken by the SQL approach, BDD: random and BDD: optimized for the various queries using synthetic data. Note that using a random variable ordering gives a gain of up to a factor of 2, but using the predicted optimal variable ordering improves the overall gain over SQL by a factor of 4 to 6. Similar behavior was observed for the real data set as well. E VALUATING BDD OVERHEAD We now discuss our thresholding strategy for BDD sizes. Following table presents the time to fill a buffer of specific size (measured in BDD nodes). Space Threshold Time (sec)
103 2.0
105 2.2
106 3.5
References [1] A. Aziz, S. Taziran, and R. K. Brayton. BDD variable ordering for interacting finite state machines. In Proc. of Design Automation Conference, pages 283–288, 1994. [2] D. Beyer. Improvements in BDD-based reachability analysis of timed automata. In Proc. of ISFME, pages 318–343, 2001. [3] P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A costbased model and effective heuristic for repairing constraints by value modification. In SIGMOD’05, pages 143–154, 2005. [4] B. Bollig, M. L obbing, and I. Wegener. Simulated annealing to improve variable orderings for OBDDs. In Proc. IWLS, pages 1–10, 1995. [5] B. Bollig and I. Wegener. Improving the variable ordering of OBDDs is NP-complete. IEEE Trans. Comput., 45(9):993– 1002, 1996.
107 17
In our environment we choose a threshold of 106 BDD nodes. As is evident from the table, this threshold imposes a constant overhead of 3.5 seconds when the threshold is exceeded and we resort to SQL processing. In our experiments, this threshold was never exceeded using our real data set. For synthetic data, when the threshold is exceeded, the corresponding SQL queries takes 100- 250 seconds; thus in such cases, BDD overhead is only 1-3%, a small fraction of the overall processing time. S UMMARY BDD as a logical index is memory efficient and incrementally maintainable. Use of BDD index in conjuncture with the query rewriting and variable ordering techniques introduced outperforms SQL approaches significantly for constraint checking. Our execution strategy has a small constant overhead when it resorts to SQL execution.
[6] R. E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Trans. Comput., 35(8):677–691, 1986. [7] BuDDy. sourceforge.net/projects/buddy. [8] T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003. [9] J. Jain, W. Adams, and M. Fujita. Sampling schemes for computing OBDD variable orderings. In Proceedings of the ICCAD, pages 631–638, 1998. [10] N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD Conference, pages 802–803. ACM, 2006. [11] M. S. Lam, J. Whaley, V. B. Livshits, M. C. Martin, D. Avots, M. Carbin, and C. Unkel. Context-sensitive program analysis as database queries. In PODS’05, pages 1–12, 2005.
10
[12] R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986. [13] R. Rudell. Dynamic variable ordering for ordered binary decision diagrams. In Proc. in IWLS, pages 1–12, 1993.
11