Fundamenta Informaticae 99 (2010) 1–30 DOI 10.3233/FI-2010-253 IOS Press
Computing Supports of Conjunctive Queries on Relational Tables with Functional Dependencies Tao-Yuan Jen, Dominique Laurent∗ Université Cergy-Pontoise - ETIS - CNRS - F-95000 Cergy-Pontoise
[email protected];
[email protected]
Nicolas Spyratos Université Paris-Sud - LRI - CNRS - F-91405 Orsay
[email protected]
Abstract. The problem of mining all frequent queries on a relational table is a problem known to be intractable even for conjunctive queries. In this article, we restrict our attention to conjunctive projection-selection queries and we assume that the table to be mined satisfies a set of functional dependencies. Under these assumptions, we define and characterize two pre-orderings with respect to which the support measure is shown to be anti-monotonic. Each of these pre-orderings induces an equivalence relation for which all queries of the same equivalence class have the same support. The goal of this article is not to provide algorithms for the computation of frequent queries, but rather to provide basic properties of pre-orderings and their associated equivalence relations showing that functional dependencies can be used for an optimized computation of supports of conjunctive queries. In particular, we show that one of the two pre-orderings characterizes anti-monotonicity of the support, while the other one refines the former, but allows to characterize anti-monotonicity with respect to a given table, only. Basic computational implications of these properties are discussed in the article.
Keywords: Functional dependencies, data mining, level-wise algorithms, frequent queries, equivalent queries. ∗
Address for correspondence: Université Cergy-Pontoise - ETIS - CNRS - F-95000 Cergy-Pontoise
1
2
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
1. Introduction In this article we address the issue of computing the supports of conjunctive queries on a (relational) table ∆ over an attribute set U , where the support of a given query is defined to be the cardinality of its answer in ∆. This issue is closely related to the problem of mining frequent queries, given that a query is said to be frequent if its support is above a given threshold. This important but challenging problem of data mining, is known to be intractable even for conjunctive queries [8]. Indeed, given a relational table ∆, the size of the search space can be shown to be exponential not only in the number of attributes, but also in the number of tuples (since there are as many selection conditions as there are sub-tuples of tuples in ∆). This problem has attracted significant attention in the last few years [5, 6, 8, 9, 10, 11, 12, 13, 15, 16, 17], because of the importance of its potential applications. As argued in [11], mining all frequent conjunctive queries from a given database is an important issue. Indeed, using the example of [11], consider the well known Internet Movie Database (http://imdb.com) containing almost all possible information about movies, actors and everything related to that, and consider the following queries: first, we ask for all actors that have starred in a movie of the genre ‘drama’; then, we ask for all actors that have starred in a movie of the genre ‘drama’, but that also starred in a (possibly different) movie of the genre ‘comedy’. Now suppose the answer to the first query consists of 1000 actors, and the answer to the second query consists of 900 actors. Obviously, each of these answers considered in isolation, does not necessarily reveal any significant insight, but when combined, these two answers reveal the potentially interesting pattern that actors starring in ‘drama’ movies typically (with a probability of 90%) also star in a ‘comedy’ movie. In the present work, we consider a fixed attribute set, denoted by U , and a fixed set of functional dependencies F D over U . In this setting, we denote by instF D (U ) the set of all relational tables over U that satisfy the functional dependencies in F D and we restrict our attention to conjunctive projectionselection queries. Given a relational table ∆ in instF D (U ), a conjunctive projection-selection query is of the form πX (σY =y (∆)) where X and Y are subsets of U and y is a tuple over Y . We define the support of such a query to be the number of tuples in the answer, i.e., the cardinality of the answer set. Under these assumptions, we use the set of functional dependencies to define two pre-orderings over queries, and we show that 1. the support measure is anti-monotonic with respect to these two pre-orderings, 2. in the equivalence relations induced by each of these pre-orderings, all queries of the same equivalence class have the same support. The implications of these results are very important from a computational point of view because of the following: 1. The fact that the support measure is anti-monotonic with respect to the two pre-orderings allows to design algorithms inspired from the well known Apriori algorithm ([1]) for the computation of frequent queries. 2. The fact that all queries in the same equivalence class have the same support implies that only one computation per equivalence class is necessary in order to compute the support of all queries in the class.
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
∆
Cid c1 c1 c2 c2 c3 c4
Pid p1 p2 p1 p3 p3 p4
Cname John John Mary Mary Paul Peter
Caddr Paris Paris Paris Paris NY Paris
Ptype milk beer milk beer beer milk
3
Qty 10 10 1 5 10 15
Figure 1. The table ∆ of the running example
Moreover, as one of the major contributions of this article, we show that one of these pre-orderings characterizes anti-monotonicity of the support, independently from the table to be considered, while the other one allows for such a characterization with respect to a given table of instF D (U ). Although this article does not provide algorithms for the computation of frequent projection-selection queries, the work presented in this article can be seen as a generalization of our previous work [13], in which we presented algorithms concerning one of the two pre-orderings of this article, when the table to be mined is the join of all tables in a data warehouse operating from a star schema. We stress that, in this particular case, the complexity of our algorithm in terms of the number of scans of the table is linear in |U | (i.e., linear in the number of attributes). Let us illustrate the basic concepts of our approach through an example (that will serve as a running example throughout the article). Example 1.1. Consider the table ∆ defined over the attribute set U = {Cid, Cname, Caddr, Pid, Ptype, Qty}, as shown in Figure 1, where: • Cid, Cname and Caddr stand for Customer Identifier, Customer Name and Customer Address, • Pid and Ptype stand for Product Identifier and Product Type, • Qty stands for Quantity (i.e., number of products sold). Moreover, assume that the table ∆ satisfies the following set F D of functional dependencies: F D = {Cid → Cname Caddr, Pid → Ptype, Cid Pid → Qty}. One can easily verify that the table ∆ shown in Figure 1 does satisfy the above dependencies. We note that, in a star schema, the table ∆ could be the result of joining the following three tables: Customer(Cid, Cname, Caddr), Product(Pid, Ptype), Sales(Cid, Pid, Qty) where Customer and Product are dimension tables and Sales is the fact table. Assuming that all frequent projection-selection queries are computed using the table ∆, then we can consider the confidence of rules of the form q1 ⇒ q2 (i.e., the ratio of the support of q2 over that of q1 ), where q1 and q2 are frequent queries such that the support of q2 is less than the support of q1 . To illustrate this point, assume that the queries q1 = πCid (σCaddr=Paris (∆)) and q2 = πCid (σCaddr P type=Paris beer (∆)) are frequent queries. Then, the fact that the confidence of q1 ⇒ q2 is 80% means that 80% of the customers from Paris buy beer.
4
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
We give later in the article (see end of Section 3) examples of more general association rules that allow to mine functional dependencies, as well as conditional functional dependencies ([4]). In this article, we focus only on frequent projection-selection queries, and we leave the study of rules for future work. Given a query q, we denote by |q| the cardinality of the answer set of q in ∆. Then, referring to Figure 1, it is easy to verify that we have |πCid (∆)| ≥ |πCaddr (∆)|, and that this inequality holds because the table ∆ satisfies the dependency Cid → Caddr (which is a consequence of F D). Notice that this inequality holds in any table that satisfies the dependencies in F D. Similarly, it is easy to verify that |πCid (∆)| = |πCid Caddr (∆)|, and that this equality holds because the table ∆ satisfies the dependencies Cid → Cid Caddr and Cid Caddr → Cid (which are consequences of F D). Notice again that this equality holds in any table that satisfies the dependencies in F D. We point out that, although not equivalent according to standard query equivalence [18], the queries πCid (∆) and πCid Caddr (∆) have the same support. Thus, as long as we are interested in computing supports, only one computation is necessary for these two queries. One can also easily verify that |σP id P type=p2 beer (∆)| ≤ |σP id=p2 (∆)|, and that this inequality holds in any table because, as p2 is a subtuple of p2 beer, this is a case of query containment [18]. On the other hand, it is easy to verify that |σP id=p2 (∆)| ≤ |σP type=beer (∆)|, and that this inequality holds because the table ∆ satisfies the dependency Pid → Ptype and p2 is associated with beer. Clearly, this inequality holds in any table that satisfies the dependency Pid → Ptype and in which the Pid value p2 is associated with the Ptype value beer. Notice however that this last inequality does not hold in every table that satisfies the dependencies in F D. Indeed, consider the table ∆′ obtained from ∆ by replacing all occurrences of beer by wine. Then ∆′ still satisfies F D but now, p2 is associated with wine instead of beer. As a result, we have |σP id=p2 (∆′ )| > |σP type=beer (∆′ )|, since the latter answer set in ∆′ is empty. As shown in the previous example, mining frequent queries can be used for discovering useful knowledge from a given table. We refer to [11] for other examples showing the pertinence of mining frequent queries and corresponding association rules. We emphasize in this respect that such association rules allow to discover unknown functional dependencies and conditional functional dependencies ([4]). In what follows, based on the observations of Example 1.1, we define two pre-orderings over the set of all projection-selection queries, and show that the support is anti-monotonic with respect to these pre-orderings. Moreover, given an attribute set U and a set of functional dependencies F D over U , the following results are shown: 1. The first pre-ordering we consider, denoted by , characterizes the anti-monotonicity of the support, in the sense that for all projection-selection queries q and q ′ , q q ′ holds if and only if |q ′ | ≤ |q| holds for all tables ∆ satisfying the given set of functional dependencies F D. 2. Given a table ∆ satisfying F D, the second pre-ordering we consider, denoted by ∆ , generalizes the previous one (in the sense that it allows for more comparisons) and characterizes the antimonotonicity of the support with respect to all tables ∆′ in which the queries under comparison have the same ‘statut’. In other words, it is shown that, for all projection-selection queries q and q ′ , q ∆ q ′ holds if and only if |q ′ | ≤ |q| holds for all tables ∆′ satisfying F D and such that q and q ′ have the same ‘status’ in ∆′ as in ∆ (see Section 4). Roughly, the queries q and q ′ are said to have the same ‘status’ in ∆′ as in ∆, when the tuples involved in their selection conditions can be
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
5
found in ∆′ if and only if they also can be found in ∆ (we refer to Section 4 for a formal definition of the notion of ‘status’). On the other hand, each of the two pre-orderings induces an equivalence relation over projection-selection queries, and these equivalence relations play a fundamental role in our approach, since two equivalent queries are shown to have the same support. The importance of this property lies in the fact that only one computation is necessary in order to obtain the support of all queries in the same equivalence class. The article is organized as follows: In Section 2, we briefly review previous work in the area, and in Section 3 we recall basic properties of projection-selection queries. In Section 4, we introduce two pre-orderings for query comparison. In particular, we characterize these two pre-orderings and show that the support measure is anti-monotonic with respect to both of them. In Section 5, we consider the equivalence relations induced by the pre-orderings, we characterize and compare the content of equivalence classes modulo each of the equivalence relations. Section 6 concludes the article and discusses future work.
2. Related Work The paper [8] considers conjunctive queries, as we do in this article, and points out that without restrictions on the database schema, the problem is intractable. Although some hints on possible restrictions are mentioned in [8], no specific case is studied. In [9], the authors consider restrictions on the frequent queries to be mined using the formalism of rule based languages. Although the considered class of queries is larger than in our approach, it should be pointed out that, in [9], (i) equivalent queries can be generated that can not be tested efficiently (a problem that does not exist in our approach), and (ii) functional dependencies are not taken into account, as we do in this article. Comparing now our present contribution to the work in [10, 11] that deals with frequent query computation, we notice that in our approach, we provide two pre-orderings for query comparison that take functional dependencies into account, which is not the case in [10, 11]. As a consequence, we show in this article that these two pre-oderings generalize standard query containment ([18]) used in [10], as well as diagonal containment used in [11]. On the other hand, the work of [10], although dealing with mining tree queries in a graph, is nevertheless closely related to ours. Indeed, in [10], a graph is represented by a binary relation, and frequent tree queries, expressed as SQL queries involving projections, selections and joins, are mined. This work is somehow generalized in [11] to the case of projection-selection-join queries, in which a given relation in the database occurs at most once. Therefore, the approach in [10, 11] can be considered as being more general than ours because (i) particular joins are taken into account, which is not the case in our approach, and because (ii) for a given join, all frequent projection-selection queries are mined, as we do in our approach. We also note that the consideration of joins, as done in [10, 11], implies that selection conditions involve equalities between two attributes, which we do not consider in this article. In our previous work [15], we consider the particular case of a database in which relations are organized according to a star schema. In this setting, frequent selection-projection queries are mined from
6
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
the associated weak instance ([18]). However, in [15], equivalence classes are defined based on projection only, and thus, only projection queries can be mined through one run of the proposed level-wise algorithm. In our subsequent work [13], we extended these results to projection-selection queries over an arbitrary table with functional dependencies, and we proposed algorithms for mining frequent queries in the case where the considered table is the join of all relations organized according to a star schema. Then, this work has been further extended in [14] in the sense that, in a given star schema, all joins of dimension tables with the fact table are explicitly used in query comparison, based on the key and foreign key constraints. In [5], a set of attributes, called the key, provides the set of values according to which the supports are to be counted. Then, using a bias language, the different tables involving the key attributes are mined, based on a level-wise algorithm. Our approach (as well as those of [9, 10, 11]) can be seen as a generalization of the work in [5], in the sense that we mine all frequent queries for all keys. The work of [7] follows roughly the same strategy as in [5], except that joins are first performed in a level-wise manner; and for each join, frequent queries are mined, based also on a level-wise algorithm. All other approaches dealing with mining frequent queries [6, 12, 16, 17] consider a fixed set of “objects” to be counted during the mining phase and only one table for a given mining task. For instance, in [17], objects are characterized by values over given attributes, whereas in [6], objects are characterized by a query, called the reference. On the other hand, except for [17], all these approaches are restricted to conjunctive queries, as is the case in the present article. To end this section, we emphasize that, to the best of our knowledge, this work along with that of [13, 14, 15], is the first attempt to consider explicitly constraints on the data set (e.g. functional dependencies or inclusion dependencies) for optimizing the computation of frequent queries.
3. Frequent Queries 3.1. Preliminaries We assume that the reader is familiar with the relational model for which we follow the notation of [18]. In particular, we consider a fixed attribute set U , each attribute A being associated with a domain of values, denoted by dom(A). For every nonempty subset or schema X of U , the domain of X is defined as usual, that is, dom(X) = ΠA∈X (dom(A)). It is important to note that, in this work, every attribute domain is assumed to contain more than one value. In particular, this hypothesis, which is not really a restriction in practice, is required in the proofs of Theorem 4.1 and Theorem 4.2, where tables containing at least two distinct tuples are built up. (Notice however that domains containing a single value are considered in Example 4.2 and in Example 4.4, for the sake of simplification.) Given two schemas X and Y , the schema X ∪ Y will be denoted as XY and, similarly, if A is an attribute in U , X ∪ {A} will be denoted by XA. As an additional notational convenience, tuples will be denoted by lowercase characters and their schema by the corresponding uppercase characters. Given a tuple x over X, for every subset Y of X, x.Y denotes the restriction of x over Y . In this work, we consider a fixed set of functional dependencies over U , denoted by F D, and we denote by instF D (U ) the set of all relational tables defined over the attribute set U and satisfying all
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
7
dependencies of F D. As in [18], we denote by F D + the set of all functional dependencies that can be inferred from F D, based on Armstrong’s axioms ([2]). We recall that these axioms state the following: for all attribute sets X, Y and Z, 1. Pseudo-reflexivity: if Y ⊆ X then infer X → Y ; 2. Augmentation: if X → Y holds then infer XZ → Y Z; 3. Transitivity: if X → Y and Y → Z hold then infer X → Z. Given a table ∆ in instF D (U ), for every X → Y in F D + and every tuple x over X occurring in ∆, we denote by ∆Y (x) the tuple over Y associated in ∆ with x through X → Y . In other words, ∆Y (x) is the (common) Y -value of any tuple t in ∆ such that t.X = x. It should be noticed that, if Y ⊆ X, then every table ∆ over U satisfies X → Y and, for every x-value occurring in ∆, ∆Y (x) is the restriction of x over Y , that is, ∆Y (x) = x.Y . Referring back to Figure 1, Cid → Caddr is in F D + , and we have, for instance, ∆Caddr (c1 ) = Paris, as all tuples t such that t.Cid = c1 satisfy t.Caddr = Paris. In this article, given an attribute set X, we consider the notion of closure of X as in [18], that is: denotes the closure of X (with respect to F D), namely, the set of all attributes A in U such that the dependency X → A is in F D + . A schema X such that X = X + is said to be closed.
X+
The following lemma, that will be used for the characterization of what we call equivalent queries (see Section 5), relates schemas having the same closure to the functions induced by the functional dependencies of F D. Lemma 3.1. Let ∆ be in instF D (U ), and X1 and X2 two schemas such that X1+ = X2+ . For all tuples x1 and x2 over X1 and X2 , respectively, we have x2 = ∆X2 (x1 ) if and only if x1 = ∆X1 (x2 ). Proof: Due to the symmetry of the statement in the lemma, we only prove the implication x2 = ∆X2 (x1 ) ⇒ x1 = ∆X1 (x2 ). Let t in ∆ such that t.X2 = x2 , then, as X2 → X1 ∈ F D + (because X1+ = X2+ ), t.X1 = ∆X1 (x2 ). Similarly, let t′ in ∆ such that t′ .X1 = x1 , then, as X1 → X2 ∈ F D + (because X1+ = X2+ ), t′ .X2 = ∆X2 (x1 ). Thus, as we assume that x2 = ∆X2 (x1 ), we have t′ .X2 = x2 . Therefore, t′ .X2 = t.X2 , which implies that t.X1 = t′ .X1 (since X2 → X1 ∈ F D + ). Consequently, x1 = ∆X1 (x2 ), which completes the proof. ⊓ ⊔ Moreover, functional dependencies allow for comparisons of the cardinalities of the answers to queries, as stated in the following lemma. Lemma 3.2. For all nonempty schemas X and Y such that X → Y is in F D + and for every table ∆ in instF D (U ), we have: 1. |πX (∆)| ≥ |πY (∆)| and 2. |σX=x (∆)| ≤ |σY =∆Y (x) (∆)|.
8
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
Proof: 1. Since ∆ satisfies X → Y , there exists a total, onto function from πX (∆) to πY (∆). Therefore, |πX (∆)| ≥ |πY (∆)|. 2. For every t in σX=x (∆), t.X = x. Thus, we have t.Y = ∆Y (x), showing that t is in σY =∆Y (x) (∆). Therefore, we have |σX=x (∆)| ≤ |σY =∆Y (x) (∆)|, and the proof is complete. ⊓ ⊔ In the standard relational model, schemas are assumed to be nonempty sets. However, for the purposes of this article, we also consider the empty schema, denoted by ∅. This is so because, as will be seen later on: 1. Queries reduced to a projection only will be considered as projection-selection queries with a selection condition holding over the empty attribute set. 2. Projections over the empty attribute set make sense in our approach, in particular when considering equivalent queries. Contrary to the case of nonempty attribute sets, the set dom(∅) is assumed to contain a single tuple, namely the empty tuple, denoted by ⊤. We consider ⊤ as being a subtuple of any tuple, and, given a table ∆ over U , we have: • σ∅=⊤ (∆) = ∆, meaning that the selection condition ∅ = ⊤ always evaluates to true. • If ∆ = ∅ then π∅ (∆) = ∅, as any projection of an empty table results in an empty table. Otherwise π∅ (∆) = {⊤}, as in this case the projection should not be empty and the domain of the empty schema is reduced to the single tuple ⊤. Moreover, we consider functional dependencies involving ∅ along with the following rules, which can be shown to be consistent with the Armstrong axioms. • For every X (possibly empty), every table ∆ satisfies X → ∅. Thus, for every x in dom(X), ⊤ = ∆∅ (x). • For every set of functional dependencies F D, ∅+ = ∅. We note however that, in this article, we do not consider functional dependencies of the form ∅ → X, which are satisfied by a table ∆ if πX (∆) is reduced to a single tuple. This is so because, in practice, these dependencies are usually not considered. On the other hand, some of the proofs in this article are based on the characterization of functional dependency satisfaction given in Proposition 3.1 below, which is an immediate consequence of Theorem 6.1 of [3]. In order to state this proposition, we introduce the following notation: for all tuples x1 and x2 over X1 and X2 , respectively, match(x1 , x2 ) denotes the set of all attributes A of X1 ∩ X2 such that x1 .A = x2 .A. Proposition 3.1. Given an attribute set U over which the set F D of functional dependencies is assumed, let ∆ be a table over U . Then, ∆ is in instF D (U ) if and only if for all t and t′ in ∆, match(t, t′ ) is closed, i.e., match(t, t′ ) = (match(t, t′ ))+ .
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
9
3.2. Queries The projection-selection queries considered in our approach are standard relational conjunctive projectionselection queries. Definition 3.1. A conjunctive selection condition, or a selection condition for short, is an equality of the form Y = y where Y is a possibly empty relation schema and y a tuple in dom(Y ). Let S = (Y = y) be a selection condition. A tuple t over U is said to satisfy S, denoted by t |= S, if: either Y = ∅ and y = ⊤, or Y 6= ∅ and t.Y = y. We denote by Q(U ) the set of all queries of the form πX (σY =y (δ)) where δ stands for an arbitrary table over U , X and Y are possibly empty subsets of U , and y is a tuple in dom(Y ). For the sake of simplicity, every query πX (σY =y (δ)) in Q(U ) is denoted by πX σy (δ), where the schema Y of y is understood. Given a table ∆ in instF D (U ) and a query q = πX σy (δ) in Q(U ), the answer to q in ∆, denoted by ans∆ (q), is defined as usual, i.e., as being the set of those tuples x over X such that there exists a tuple t in ∆ such that x = t.X and t |= (Y = y). Moreover, Q(∆) denotes the set of all queries πX σy (δ) of Q(U ) such that y ∈ πY (∆). We draw attention on the fact that, in the remainder of the article, according to the notational convention mentioned at the beginning of the present section, for every query of the form πX σy (δ), Y will denote the schema of the tuple y. We note from Definition 3.1 that all queries in Q(U ) of the form πX σ⊤ (δ), where X is a nonempty schema, are simply the relational queries of the form πX (δ). Consequently πU σ⊤ (δ) stands for the simple query δ. Moreover, for every ∆ in instF D (U ) and every q in Q(U ) \ Q(∆), ans∆ (q) = ∅. In particular, if y ∈ πY (∆) then ans∆ (π∅ σy (δ)) = {⊤}, else ans∆ (π∅ σy (δ)) = ∅. Example 3.1. Referring back to Example 1.1, πCid σ⊤ (δ) and πCaddr σParis beer (δ) are queries in Q(∆), and we have: ans∆ (πCid σ⊤ (δ)) = {c1 , c2 , c3 , c4 }, and ans∆ (πCaddr σParis beer (δ)) = {Paris}. Similarly, π∅ σParis beer (δ) is also a query in Q(∆), and we have ans∆ (π∅ σParis beer (δ)) = {⊤}. On the other hand, the queries πCid σNY milk (δ) and π∅ σNY milk (δ) are in Q(U ) but not in Q(∆), because NY milk is not in πCaddr P type (∆). Thus, ans∆ (πCid σNY milk (δ)) = ans∆ (π∅ σNY milk (δ)) = ∅. The notion of frequent query in our approach is defined in much the same way as in [8, 11]. Definition 3.2. Let ∆ be in instF D (U ) and q be in Q(U ). The support of q in ∆, denoted by sup∆ (q), is the cardinality of the answer to q in ∆, i.e., sup∆ (q) = |ans∆ (q)|. Given a support threshold min-sup, a query q is said to be frequent in ∆ if sup∆ (q) ≥ min-sup. We notice that for every ∆ in instF D (U ) and every query q in Q(U ), we have 0 ≤ sup∆ (q) ≤ |∆|. Moreover, for every q in Q(U ), we have that q 6∈ Q(∆) if and only if sup∆ (q) = 0. Indeed, if q = πX σy (δ) is in Q(U ) \ Q(∆), then ∆ contains no tuple t such that t.Y = y, which shows that sup∆ (q) = 0. Conversely, if q ∈ Q(∆), then ∆ contains at least one tuple t such that t.Y = y, and thus, sup∆ (q) ≥ 1. As a consequence, when it comes to compute all frequent queries, those in Q(U ) \ Q(∆) are not relevant since their support is 0, a value meant to be less than the support threshold min-sup.
10
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
On the other hand, if q = πX σy (δ) is in Q(∆) and Y → X is in F D + , then we have sup∆ (q) = 1. This is so because, if sup∆ (q) > 1 then ∆ contains at least two tuples t and t′ such that t.Y = t′ .Y = y and t.X 6= t′ .X, which is a violation of the dependency Y → X assumed to be in F D + . Referring back to Example 1.1, it is easy to see that sup∆ (q1 ) = 3 and sup∆ (q2 ) = 2. Thus, for a support threshold equal to 3, q1 is frequent whereas q2 is not. Before ending this section, we point out that having computed all frequent queries of Q(U ), it is possible to mine functional dependencies other than those in F D + , as well as conditional functional dependencies ([4]). Indeed, given a table ∆, it can be seen that if in ∆, the confidence of the rule πX1 X2 (δ) ⇒ πX1 (δ) is 1 (i.e., if the ratio of sup∆ (πX1 (δ)) over sup(πX1 X2 (δ)) is equal to 1), then ∆ satisfies the functional dependency X1 → X2 . This is so because, if the confidence of the rule πX1 X2 (δ) ⇒ πX1 (δ) is 1, then sup∆ (πX1 X2 (δ)) = sup∆ (πX1 (δ)), meaning that the answers in ∆ of πX1 X2 (δ) and πX1 (δ) have the same cardinality. Therefore, this implies that, in the table πX1 X2 (∆), there exists a function from the set of X1 -values to the set of X2 -values. In other words, this means that ∆ satisfies the functional dependency X1 → X2 . Regarding conditional functional dependencies, we recall from [4] that such a dependency is denoted by (∆ : X1 → X2 , T ), where ∆ is a table, X1 and X2 are subsets of the schema U of ∆, and T is a specific table over U containing partial tuples (i.e., tuples with constants and null values) that specify selection conditions. For every τ in T , let Y1τ = y1τ and Y2τ = y2τ be the two selection conditions induced by τ on X1 and X2 , respectively (i.e., for i = 1, 2, yiτ is the largest total sub-tuple of τ such that Yiτ ⊆ Xi ). Then, ∆ is said to satisfy (∆ : X1 → X2 , T ) if for every τ in T and for all tuples t and t′ in ∆ such that t.X1 = t′ .X1 and t.Y1τ = t′ .Y1τ = y1τ , then t.X2 = t′ .X2 and t.Y2τ = t′ .Y2τ = y2τ . Consequently, it can be seen that ∆ satisfies (∆ : X1 → X2 , T ) if and only if, for every τ in T , (i) the table σy1τ (∆) satisfies X1 → X2 and (ii) σy1τ (∆) ⊆ σy2τ (∆). Therefore, ∆ satisfies (∆ : X1 → X2 , T ) if and only if, for every τ in T , the confidences of the rules (i) πX1 X2 σy1τ (δ) ⇒ πX1 σy1τ (δ) and (ii) σy1τ (δ) ⇒ σy1τ y2τ (δ) are both equal to 1.
4. Query Comparison In this section, we define and characterize two pre-orderings so as to compare queries of Q(U ). The first one is shown to characterize the anti-monotonicity of the support when considering all tables in instF D (U ). Then, we refine this first pre-ordering by considering a fixed table of instF D (U ).
4.1. Query Comparison Characterizing Anti-monotonicity Definition 4.1. Let q = πX σy (δ) and q1 = πX1 σy1 (δ) in Q(U ), q1 is said to be more specific than q, denoted by q q1 , if 1. XY1 → X1 is in F D + , and 2. y is a subtuple of y1 . It is easy to see from Definition 4.1 that for all possibly empty schemas X and X1 and every tuple y, we have:
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
11
• If y is a subtuple of y1 , πX σy (δ) πX σy1 (δ) (because XY1 → X is a trivial dependency). In particular, πX σ⊤ (δ) πX σy1 (δ) (because ⊤ is a subtuple of any tuple y1 ). • If X → X1 is in F D + then πX σy (δ) πX1 σy (δ) (because XY → X1 can be deduced from X → X1 ). In particular, πX σ⊤ (δ) π∅ σ⊤ (δ) (because X → ∅ is assumed to be in any F D + ). • If Y → X is in F D + , π∅ σ⊤ (δ) πX σy (δ) (because in this case ∅Y → X is in F D + ). It is also important to note that, in this case, we also have πX σy (δ) π∅ σy (δ) (because XY → ∅ is assumed to be in any F D + ) and π∅ σy (δ) πX σy (δ) (because Y → X is assumed to be in F D + ). • πU σ⊤ (δ) πX σy (δ) (because ⊤ is a subtuple of any tuple y, and because U Y = U and X ⊆ U entail that U Y → X is in any F D + ). This means that πU σ⊤ (δ) (i.e., δ) is the less specific query in Q(U ). • If u is a tuple over U and y is a subtuple of u, πX σy (δ) πX1 σu (δ) (because y is a subtuple of u, and because XU = U and X1 ⊆ U entail that XU → X1 is in any F D + ). The following proposition shows that the relation is a pre-ordering over the set Q(U ), i.e., that is reflexive and transitive. Proposition 4.1. The relation is a pre-ordering over Q(U ). Proof: First, it is easy to see that is reflexive. So, we only prove the transitivity of . To this end, let q = πX σy (δ), q1 = πX1 σy1 (δ) and q2 = πX2 σy2 (δ) be in Q(U ) such that q q1 and q1 q2 . Then, by Definition 4.1: XY1 → X1 and X1 Y2 → X2 are in F D + , and y is a subtuple of y1 and y1 is a subtuple of y2 . Thus, y is a subtuple of y2 , and, as Y1 ⊆ Y2 , XY2 → XY1 is in F D + . Therefore, XY2 → X1 Y2 is in F D + (because XY1 → X1 ∈ F D + ), and so, XY2 → X2 is in F D + (because X1 Y2 → X2 ∈ F D + ). Hence, q q2 , and the proof is complete. ⊓ ⊔ Notice that is not a partial ordering over Q(U ), as is shown in the following example. Example 4.1. Referring back to Example 1.1, q = πCid σ⊤ (δ) and q1 = πCid Caddr σ⊤ (δ) are distinct queries of Q(U ) and we have: • q q1 (because Cid → Cid Caddr can be deduced from F D), • q1 q (because Cid Caddr → Cid can be trivially deduced from F D). The following example shows an exhaustive graph of query comparisons in a simple case. Example 4.2. Let us consider the simple case where U contains only two attributes A and B and F D is empty. Contrary to our hypothesis on the cardinalities of attribute domains, we assume that dom(A) = {a} and dom(B) = {b}, for the sake of simplification. In this case, Q(U ) contains all queries πX σy (δ) where X ∈ {∅, A, B, AB} and y ∈ {⊤, a, b, ab}.
12
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
The pre-ordering allows to compare these queries according to the graph shown below, where a top-down link between two queries q and q1 should be read as q q1 , and where two queries q and q1 in the same box are such that both q q1 and q1 q hold.
πAB σ⊤ (δ) jj jjjj jjjj
UUUU UUUU UU
πA σ⊤ (δ)
πB σ⊤ (δ) SSSS k k k SSSS k kkk SSS kkkk πAB σb (δ), πA σb (δ) π∅ σ⊤ (δ) πB σa (δ), πAB σa (δ) RRR RRR lll l l RRR ll R lll π∅ σb (δ), πB σb (δ)
πA σa (δ), π∅ σa (δ) UUUU i i i i UUUU UUU iiiiiii
πA σab (δ), πAB σab (δ), πB σab (δ), π∅ σab (δ)
We note that replacing item 1 of Definition 4.1 by X → X1 ∈ F D + would also lead to a partial preordering according to which the support is anti-monotonic. However this pre-ordering would be strictly more specific than , because XY1 → X1 ∈ F D + does not imply X → X1 ∈ F D + . For instance, in the context of our running example, πCid σ⊤ (δ) and πQty σp2 (δ) would not be comparable, whereas πCid σ⊤ (δ) πQty σp2 (δ) holds because Cid Pid → Qty ∈ F D + . Moreover, it can be seen that the pre-ordering generalizes query containment ([18]) and diagonal containment ([11]). Indeed, denoting these pre-orderings by ⊑ and ⊑D , respectively, given two queries q = πX σy (δ) and q1 = πX1 σy1 (δ) in Q(U ), in our context, we have • q1 ⊑ q if and only if X = X1 and y is a subtuple of y1 , which implies that q q1 . • According to [11], q1 ⊑D q holds if and only if q1 ⊑ πX1 (q), that is, if and only if X1 ⊆ X and y is a subtuple of y1 . This again implies that q q1 . We note that even in the case where F D = ∅, is still strictly more general than ⊑D , because in this case, for all q = πX σy (δ) and q1 = πX1 σy1 (δ) in Q(U ), we have q q1 if X1 ⊆ XY1 and y is a subtuple of y1 , which is implied by but not equivalent to the fact that q1 ⊑D q. As an example, referring back to Example 1.1, πCid σ⊤ (δ) πCname σJohn (δ) holds (because Cname ⊆ Cid Cname), whereas these queries are not comparable according to ⊑D . This implies that, even if functional dependencies are not considered, as done in [11], diagonal containment is not the optimal way of comparing queries, since more comparisons are possible according to our pre-ordering . Moreover, as shown in Theorem 4.1 below, is optimal in this respect, that is, q q1 is equivalent to the fact that sup∆ (q1 ) ≤ sup∆ (q) holds in every ∆ of instF D (U ). Theorem 4.1. For all queries q and q1 in Q(U ), q q1 holds if and only if, for every ∆ in instF D (U ), sup∆ (q1 ) ≤ sup∆ (q).
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
13
Proof: Let q = πX σy (δ) and q1 = πX1 σy1 (δ) be in Q(U ). We first show that, given a table ∆ in instF D (U ), if q q1 , then sup∆ (q1 ) ≤ sup∆ (q). If y1 = ⊤, as we assume that y is a subtuple of y1 , we have y = ⊤. Thus, the result directly follows from Lemma 3.2(1), because in this case, X → X1 ∈ F D + .
If y1 6= ⊤, then, the case where y1 6∈ πY1 (∆) is trivial because we have sup∆ (q1 ) = 0. So, let us assume that y1 ∈ πY1 (∆). In this case, the table σy1 (∆) is not empty and satisfies X → Y1 . As XY1 → X1 is assumed to be in F D + , σy1 (∆) satisfies X → X1 . Thus, by Lemma 3.2(1), we obtain that |πX1 σy1 (∆)| ≤ |πX σy1 (∆)|. On the other hand, as y is a subtuple of y1 , |πX σy1 (∆)| ≤ |πX σy (∆)|, and thus, we obtain |ans∆ (q1 )| ≤ |ans∆ (q)|. Therefore, it follows that sup∆ (q1 ) ≤ sup∆ (q).
Conversely, we proceed by contraposition, namely, assuming that one of the two items in Definition 4.1 is not satisfied, we construct a table ∆ in instF D (U ) in which sup∆ (q1 ) > sup∆ (q). Let us first assume that y is not a subtuple of y1 . In this case, as y 6= ⊤ we have Y 6= ∅. Thus there exists an attribute A in Y such that, either A does not belong to Y1 , or A is in Y1 and y.A 6= y1 .A. Let ∆ be a table over U containing a single tuple t such that t.Y1 = y1 and t.A 6= y.A. Then, clearly, ∆ is in instF D (U ) (because a table containing a single tuple satisfies any set of functional dependencies), and we have sup∆ (q1 ) = 1 (because ans∆ (q1 ) = {t.X1 }) and sup∆ (q) = 0 (because, as t.Y 6= y, ans∆ (q) = ∅). Thus, sup∆ (q1 ) > sup∆ (q).
Let us now assume that y is a subtuple of y1 , but that XY1 → X1 is not in F D + . In this case, we consider a table ∆ over U containing two tuples t1 and t2 defined as follows: − t1 .(XY1 )+ = t2 .(XY1 )+ , − t1 .Y1 = t2 .Y1 = y1 , and − for every attribute A not in (XY1 )+ , t1 .A 6= t2 .A. We note that there exist attributes not in (XY1 )+ because, assuming that (XY1 )+ = U entails XY1 → X1 ∈ F D + . Then, as match(t1 , t2 ) = (XY1 )+ , by Proposition 3.1, ∆ satisfies F D. Moreover, we have sup∆ (q) = 1 because, t1 .X = t2 .X and, since y is a subtuple of y1 and t1 .Y1 = t2 .Y1 = y1 , we have t1 .Y = t2 .Y = y. On the other hand, we have t1 .X1 6= t2 .X1 because X1 6⊆ (XY1 )+ . Indeed, if X1 ⊆ (XY1 )+ , then XY1 → X1 is in F D + , which is a contradiction to our hypothesis. Since t1 .Y1 = t2 .Y1 = y1 , we obtain that sup∆ (q1 ) = 2, and thus that sup∆ (q1 ) > sup∆ (q). Therefore, the proof is complete. ⊓ ⊔
We recall that showing that the support is anti-monotonic with respect to a specificity relation is a basic property that allows to design level-wise algorithms (such as Apriori [1]), in order to compute all frequent queries, given a table in instF D (U ). Now, although Theorem 4.1 provides an important and generic characterization of anti-monotonicity of the support of queries, it fails to capture many basic comparisons when a specific table of instF D (U ) is considered, which is the case in practice. For instance, we recall from Example 1.1 that, in the given table ∆, we have sup∆ (πU σp2 (δ)) ≤ sup∆ (πU σbeer (δ)). However, these two queries are not comparable according to because none of the tuples p2 and beer is a subtuple of the other. We address this important issue next.
14
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
4.2. Query Comparison Based on a Given Table in instF D (U) In order to cope with the problem mentioned above, we assume a fixed table ∆ in instF D (U ) and we define the following pre-ordering (which is an extension of the one given in [13] to all queries in Q(U )). Definition 4.2. Let ∆ be in instF D (U ) and q = πX σy (δ) and q1 = πX1 σy1 (δ) be in Q(U ). q1 is said to be more specific than q with respect to ∆, denoted by q ∆ q1 , if one of the following cases holds: 1. q1 is in Q(U ) but not in Q(∆), 2. q and q1 are in Q(∆) and Y1 → X1 ∈ F D + , 3. q and q1 are in Q(∆) such that XY1 → X1 ∈ F D + , Y1 → Y ∈ F D + , and y = ∆Y (y1 ). It should be noticed from Definition 4.2 that, contratry to , the pre-ordering ∆ is instance dependent, in the sense that testing whether q ∆ q1 holds requires to access the table ∆. Indeed, given q = πX σy (δ) and q1 = πX1 σy1 (δ) in Q(U ), in order to know that q ∆ q1 , one must check that: 1. In case 1, y1 does not belong to πY (∆). 2. In case 2, y and y1 belong to πY (∆) and πY1 (∆), respectively. 3. In case 3, y and y1 belong to πY (∆) and πY1 (∆), respectively, and y = ∆Y (y1 ), which is equivalent to the fact that yy1 belongs to πY Y1 (∆). We point out that these tests are linear in the size of ∆, because a simple scan of ∆ allows to conclude on whether q ∆ q1 holds. Moreover, it can be seen from [13] that, in our algorithms, such tests are even not necessary in order to mine all frequent queries, because only queries in Q(∆) are considered. The following example shows various comparisons according to ∆ . Example 4.3. In the context of Example 1.1, we have: • πCid σParis (δ) ∆ πCid σParis beer (δ), because the two queries are in Q(∆), Cid Caddr Ptype → Cid and Caddr Ptype → Caddr are in F D + , and ∆Caddr (Paris beer) = Paris. Notice that, in this case, we also have πCid σParis (δ) πCid σParis beer (δ). • πCid σ⊤ (δ) ∆ πCid σParis (δ), because the two queries are in Q(∆), Cid Caddr → Cid and Caddr → ∅ are in F D + , and ∆∅ (Paris) = ⊤. In this case again, we also have πCid σ⊤ (δ) πCid σParis (δ). • πCid σbeer (δ) ∆ πQty σp2 (δ), because the two queries are in Q(∆), Cid Pid → Qty and Pid → Ptype are in F D + , and ∆P type (p2 ) = beer. In this case however, πCid σbeer (δ) and πQty σp2 (δ) are not comparable according to . • πCid σbeer (δ) ∆ πP type σc4 p4 (δ), because the two queries are in Q(∆), and Cid Pid → Ptype is in F D + . As above, πCid σbeer (δ) and πP type σc4 p4 (δ) are not comparable according to .
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
15
• πCid σbeer (δ) ∆ πQty σp2 milk (δ), because, the fact that p2 milk is not in πP id P type (∆) entails that πQty σp2 milk (δ) ∈ Q(U ) \ Q(∆). Again in this case, πCid σbeer (δ) and πQty σp2 milk (δ) are not comparable according to . More generally, given a table ∆ in instF D (U ), based on Definition 4.2, it can be checked that, as in the case of : • For every possibly empty schema X and every tuple y, πX σ⊤ (δ) ∆ πX σy (δ). • Given q = πX σy (δ) and q1 = πX1 σy1 (δ) in Q(∆) such that X1 ⊆ X, Y ⊆ Y1 and y = y1 .Y , q ∆ q1 holds for any set F D of functional dependencies. However, contrary to , case 1 of Definition 4.2 shows that the most specific queries with respect to ∆ are all queries in Q(U ) but not in Q(∆). The following proposition shows that ∆ as defined above is a pre-ordering that generalizes , in the sense that all queries comparable according to are also comparable according to ∆ . Proposition 4.2. The relation ∆ is a pre-ordering over Q(U ) that generalizes , that is, for all queries q and q1 in Q(U ), if q q1 then, for every ∆ in instF D (U ), q ∆ q1 . Proof: It is easy to see that ∆ is reflexive, thus we show the transitivity of ∆ . Let q = πX σy (δ), q1 = πX1 σy1 (δ) and q2 = πX2 σy2 (δ) be in Q(U ) such that q ∆ q1 and q1 ∆ q2 . We first note that if q2 is not in Q(∆), then we have q ∆ q2 . So, let us assume that q2 is in Q(∆). Then, as we have q1 ∆ q2 , we must have that q1 is also in Q(∆), and thus that q is in Q(∆) as well. If Y2 → X2 is in F D + , then we have q ∆ q2 . If Y2 → X2 6∈ F D + , then we show that Y1 → X1 cannot be in F D + either. Indeed, if Y1 → X1 ∈ F D + then, as we assume that q1 ∆ q2 , X1 Y2 → X2 and Y2 → Y1 are in F D + . So, Y2 → X1 is in F D + , and thus, Y2 → X2 ∈ F D + , which is a contradiction. Thus, in this case, we also have that Y → X is not in F D + . Therefore, the following dependencies are in F D + : XY1 → X1 , Y1 → Y , X1 Y2 → X2 and Y2 → Y1 . As a consequence, XY2 → XY1 ∈ F D + , and so, using XY1 → X1 , XY2 → X1 Y2 is in F D + . As X1 Y2 → X2 ∈ F D + , we obtain that XY2 → X2 ∈ F D + . On the other hand, as Y2 → Y1 and Y1 → Y are in F D + , so is Y2 → Y . Moreover, as we have y = ∆Y (y1 ) and y1 = ∆Y1 (y2 ), we obtain y = ∆Y (y2 ), which shows that q ∆ q2 . Let us now assume that q and q1 are such that q q1 . If q1 is not in Q(∆), then we have q ∆ q1 . Assuming now that q1 is in Q(∆), since y is a subtuple of y1 , q is in Q(∆) as well. Moreover, in this case, Y ⊆ Y1 , and so, Y1 → Y ∈ F D + and ∆Y (y1 ) = y (as y is a subtuple of y1 ). Since q q1 also implies that XY1 → X1 be in F D + , we have q ∆ q1 , and thus, the proof is complete. ⊓ ⊔ As was the case for the pre-ordering , it is easy to see that ∆ is not an ordering. This is so because the two distinct queries q = πCid σ⊤ (δ) and q1 = πCid Caddr σ⊤ (δ) of Example 4.1 are such that q q1 and q1 q, and thus, by Proposition 4.2, q ∆ q1 and q1 ∆ q. The following example shows how the pre-ordering ∆ behaves in the simple case of Example 4.2.
16
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
Example 4.4. As in Example 4.2, let U = {A, B}, F D = ∅, and, contrary to our hypothesis on the cardinalities of attribute domains, assume that dom(A) = {a} and dom(B) = {b}, for the sake of simplification. In this case, considering the database ∆ = {ab}, the sets Q(U ) and Q(∆) are equal and contain all queries πX σy (δ) where X ∈ {∅, A, B, AB} and y ∈ {⊤, a, b, ab}. Then, the pre-ordering ∆ allows to compare these queries according to the graph shown below, where a top-down link between two queries q and q1 should be read as q ∆ q1 , and where two queries q and q1 in the same box are such that both q ∆ q1 and q1 ∆ q hold. πAB σ⊤ (δ) jj jjjj πA σ⊤ (δ)
TTTT TT πB σ⊤ (δ)
πAB σb (δ), πA σb (δ) πB σa (δ), πAB σa (δ) UUUU i i UUU iiii i π∅ σ⊤ (δ), π∅ σa (δ), π∅ σb (δ), π∅ σab (δ), πA σa (δ), πA σab (δ), πB σb (δ), πB σab (δ), πAB σab (δ)
The graph above clearly shows more comparisons than the graph of Example 4.2, which results in a simpler structure. We shall come back to this consequence of Proposition 4.2 later in the article. As a consequence of Proposition 4.2, it is easy to see ∆ generalizes query containment ([18]) and diagonal containment ([11]), because we have seen that so does . Moreover, all comparison cases mentioned after Definition 4.1 also hold in this case. That is, for all possibly empty schemas X and X1 and every tuple y, we have: • If y is a subtuple of y1 , πX σy (δ) ∆ πX σy1 (δ), and in particular πX σ⊤ (δ) ∆ πX σy1 (δ). Moreover, if y 6∈ πY (∆) (which entails that y1 6∈ πY1 (∆)) we also have πX σy1 (δ) ∆ πX σy (δ). • If X → X1 is in F D + then πX σy (δ) ∆ πX1 σy (δ), and in particular, πX σ⊤ (δ) ∆ π∅ σ⊤ (δ). Moreover, if y 6∈ πY (∆) we also have πX1 σy (δ) ∆ πX σy (δ). • If Y → X is in F D + , π∅ σ⊤ (δ) ∆ πX σy (δ). In this case, we also have πX σy (δ) ∆ π∅ σy (δ) and π∅ σy (δ) ∆ πX σy (δ). • πU σ⊤ (δ) ∆ πX σy (δ), meaning that, as is the case for , πU σ⊤ (δ) is the less specific query in Q(U ) with respect to ∆ . • If u is a tuple over U and y is a subtuple of u, πX σy (δ) ∆ πX1 σu (δ). Moreover, if y 6∈ πY (∆) (which entails that u 6∈ ∆) we also have πX1 σu (δ) ∆ πX σy (δ). Of course, the fact that ∆ allows for more comparisons in ∆ than is relevant only if the support measure can be shown to be anti-monotonic with respect to ∆ . This is precisely what is stated in the following proposition. Proposition 4.3. Let ∆ be a table in instF D (U ). For all queries q and q1 in Q(U ), if q ∆ q1 then sup∆ (q1 ) ≤ sup∆ (q).
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
17
Proof: Let q = πX σy (δ) and q1 = πX1 σy1 (δ) be two queries in Q(U ). We first note that the result is trivial if q1 6∈ Q(∆). So let us assume that q1 ∈ Q(∆), and thus that q ∈ Q(∆). If Y1 → X1 is in F D + , then sup(q1 ) = 1 and sup(q) ≥ 1, thus the result also holds in this case. We now consider the case where q and q1 are in Q(∆) and where Y1 → X1 is not in F D + . If y1 = ⊤, as we assume q ∆ q1 , Y1 → Y is in F D + , and so, Y = ∅. Thus, y = ⊤ and the result directly follows from Lemma 3.2(1). If y1 6= ⊤, the table σy1 (∆) satisfies X → Y1 , and as XY1 → X1 is in F D + , σy1 (∆) satisfies X → X1 . Thus, by Lemma 3.2(1), |πX1 σy1 (∆)| ≤ |πX σy1 (∆)|. As Y1 → Y ∈ F D + and y = ∆Y (y1 ), by Lemma 3.2(2), we have |πX σy1 (∆)| ≤ |πX σy (∆)|. Therefore, we obtain |ans∆ (q1 )| ≤ |ans∆ (q)|, and the proof is complete. ⊓ ⊔ However, contrary to the pre-ordering , the pre-ordering ∆ does not allow for a characterization of the anti-monotonicity of the support as general as in Theorem 4.1. In particular, it is not true that If, for a given ∆ in instF D (U ), we have q ∆ q1 , then, for every ∆′ in instF D (U ), sup∆′ (q1 ) ≤ sup∆′ (q) holds. To see this, when considering the table ∆ of Figure 1, we have πU σbeer (δ) ∆ πU σp2 (δ) (because P id → P type ∈ F D + and ∆P type (p2 ) = beer). On the other hand, in the table ∆′ obtained from ∆ by replacing all occurrences of beer by wine, we have sup∆′ (πU σbeer (δ)) = 0 and sup∆′ (πU σp2 (δ)) = 1. Thus, although ∆′ is in instF D (U ), we have sup∆′ (πU σbeer (δ)) < sup∆′ (πU σp2 (δ)). Nevertheless, ∆ can be shown to characterize the anti-monotonicity of the support, provided that we consider only tables in instF D (U ) in which q and q1 have the same ‘status’ as in ∆. In other words, we show that: q ∆ q1 holds for a given ∆ in instF D (U ) if and only if sup∆′ (q1 ) ≤ sup∆′ (q) holds for every table ∆′ in instF D (U ) in which q and q1 have the same ‘status’ as in ∆. The tables in which q and q1 have the same ‘status’ as in ∆ are defined as follows. Definition 4.3. Let ∆ be in instF D (U ) and let q = πX σy (δ) and q1 = πX1 σy1 (δ) be in Q(U ). The set of all tables in which q and q1 have the same ‘status’ as in ∆, denoted by instF D (∆, q, q1 ), is the set of all tables ∆′ in instF D (U ) such that: • q ∈ Q(∆′ ) if and only if q ∈ Q(∆), • q1 ∈ Q(∆′ ) if and only if q1 ∈ Q(∆), and • if q and q1 are in Q(∆) and Y1 → Y ∈ F D + , then ∆Y (y1 ) = ∆′Y (y1 ). It is easy to see from Definition 4.3 that, for every ∆ in instF D (U ) and all queries q and q1 in Q(U ), ∆ is in instF D (∆, q, q1 ), and for every ∆′ in instF D (∆, q, q1 ), instF D (∆, q, q1 ) = instF D (∆′ , q, q1 ). Moreover, it should be clear from Definition 4.3 that testing wether ∆′ belongs to instF D (∆, q, q1 ) is linear in the sum of the sizes of ∆ and ∆′ . On the other hand, the following proposition is an immediate consequence of Definition 4.2 and Definition 4.3.
18
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
Proposition 4.4. Let ∆ be in instF D (U ) and q and q1 in Q(U ). Then, for every ∆′ in instF D (∆, q, q1 ), q ∆ q1 holds if and only if q ∆′ q1 holds. Referring back again to Example 1.1, let ∆′ be the table obtained from ∆ by replacing all occurrences of beer by wine. Then, it should be clear that for q = πU σp2 (δ) and q1 = πU σbeer (δ), ∆′ is not in instF D (∆, q, q1 ). However, if we consider q ′ = πU σp1 (δ) and q1′ = πU σmilk (δ), then ∆′ does belong to instF D (∆, q ′ , q1′ ). We now prove that ∆ characterizes anti-monotonicity of the support with respect to all tables in which the queries under comparison have the same ‘status’ as in ∆. To do so, we show the following two lemmas. Lemma 4.1. Let X, Y , X1 and Y1 attribute sets such that • Y1 → X1 6∈ F D + and • XY1 → X1 6∈ F D + or Y1 → Y 6∈ F D + . Let y and y1 be tuples over Y and Y1 , respectively, and Ym = match(y, y1 ). If (Ym+ \ Ym ) ∩ (Y ∩ Y1 ) = ∅, then there exists ∆ in instF D (U ) such that q = πX σy (δ) and q1 = πX1 σy1 (δ) are in Q(∆) and sup∆ (q) < sup∆ (q1 ). Proof: We first note that, under the hypotheses of the lemma, as Y1 → X1 is not in F D + , X1 6= ∅. Given y and y1 over Y and Y1 , respectively, in order to prove the lemma, we successively consider the following two cases: (1) Ym = Y ∩ Y1 and (2) Ym ⊂ Y ∩ Y1 . (1) If Ym = Y ∩ Y1 , we first notice that the condition (Ym+ \ Ym ) ∩ (Y ∩ Y1 ) = ∅ is trivially satisfied. In this case, we construct a two tuple table ∆ = {t, t1 } of instF D (U ) such that (i) t.Y = y, (ii) t.Y1 = t1 .Y1 = y1 , and (iii) sup∆ (q) = 1 and sup∆ (q1 ) = 2. To this end, we study separately the cases whereby Y1 → Y is or not in F D + . (1.a) If Y1 → Y ∈ F D + , the second item in the lemma implies that XY1 → X1 is not in F D + . In other words, we have Y + ⊆ Y1+ and X1+ 6⊆ (XY1 )+ . In this case, (XY Y1 )+ 6= U because, as (XY Y1 )+ = (XY1 )+ , (XY Y1 )+ = U implies that (XY1 )+ = U , and thus that X1+ ⊆ (XY1 )+ , which is not possible. Now, let ∆ = {t, t1 } where t and t1 are two tuples over U such that: − t.(XY Y1 )+ = t1 .(XY Y1 )+ , − t.Y = t1 .Y = y, − t.Y1 = t1 .Y1 = y1 , and − for every A 6∈ (XY Y1 )+ (such attributes exist since (XY Y1 )+ 6= U ), t.A 6= t1 .A. We note that, for t and t1 to be defined, we must have y.(Y ∩ Y1 ) = y1 .(Y ∩ Y1 ), which holds because Ym = Y1 ∩ Y2 . Moreover, as match(t, t1 ) = (XY Y1 )+ , by Proposition 3.1, ∆ is in instF D (U ). It is also important to note that, in this case, we have ∆Y (y1 ) = y, since t.Y Y1 = t1 .Y Y1 = yy1 . On the other hand, ∆ clearly satisfies (i) and (ii) above, that is t.Y = y and t.Y1 = t1 .Y1 = y1 . Regarding (iii), we have sup∆ (q) = 1 (because t.X = t1 .X and t.Y = t1 .Y = y), and sup∆ (q1 ) = 2 (because t.Y1 = t1 .Y1 = y1 and, as X1 6⊆ (XY Y1 )+ , t.X1 6= t1 .X1 ). Therefore, sup∆ (q) < sup∆ (q1 ). (1.b) If Y1 → Y 6∈ F D + , we have Y + 6⊆ Y1+ , and so, Y1+ 6= U . Let ∆ = {t, t1 } where t and t1 are tuples over U such that:
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
19
− t.Y1+ = t1 .Y1+ , − t.Y = y, − t.Y1 = t1 .Y1 = y1 , and − for every A 6∈ Y1+ (such attributes exist since Y1+ 6= U ), t.A 6= t1 .A. We note that, for t and t1 to be defined, we must have y.(Y ∩ Y1 ) = y1 .(Y ∩ Y1 ), which holds because Ym = Y ∩ Y1 . Moreover, as match(t, t1 ) = Y1+ , by Proposition 3.1, ∆ is in instF D (U ). On the other hand, ∆ clearly satisfies (i) and (ii) above, that is t.Y = y and t.Y1 = t1 .Y1 = y1 . Regarding (iii), we have t.Y 6= t1 .Y , because assuming that t.Y = t1 .Y implies that Y ⊆ Y1+ and thus that Y + ⊆ Y1+ , which is not possible. Thus, sup∆ (q) = 1. Moreover, as X1+ 6⊆ Y1+ , we have t.X1 6= t1 .X1 , and since t.Y1 = t1 .Y1 = y1 , we obtain sup∆ (q1 ) = 2. Therefore, sup∆ (q) < sup∆ (q1 ), which completes the proof of case (1). (2) Let us now assume that Ym ⊂ Y ∩ Y1 . We notice that, in this case, given ∆ over U , for q and q1 to be in Q(∆), y and y1 must respectively belong to πY (∆) and πY1 (∆). Thus, y.(Y ∩ Y1 ) and y1 .(Y ∩ Y1 ) are two distinct tuples that must both belong to π(Y ∩Y1 ) (∆). Consequently, if ∆ is assumed to be instF D (U ), these two tuples must at least satisfy the functional dependencies of F D + concerning Y ∩ Y1 , meaning that F D + should not contain functional dependencies X → A such that X ⊆ Ym and A ∈ (Y ∩ Y1 ) \ Ym . Therefore, (Ym+ \ Ym ) ∩ (Y ∩ Y1 ) must be empty, and this is precisely what is assumed in the lemma. On the other hand, as the inclusion Ym ⊂ Y ∩ Y1 is strict, we have Y ∩ Y1 6= ∅. Let y ′ be a tuple over Y such that: − for every A in (Y \ ((Y ∩ Y1 )Ym+ )), y ′ .A 6= y.A, − for every A in Y ∩ Ym+ , y ′ .A = y.A, and − for every A in ((Y ∩ Y1 ) \ Ym ), y ′ .A = y1 .A. We first notice that y ′ is well defined, because Y can be partitioned into (Y \((Y ∩ Y1 )Ym+ )), Y ∩ Ym+ and ((Y ∩ Y1 ) \ Ym+ ) and, as we assume that (Ym+ \ Ym ) ∩ (Y ∩ Y1 ) = ∅, ((Y ∩ Y1 ) \ Ym+ ) = ((Y ∩ Y1 ) \ Ym ). Moreover, we have match(y ′ , y1 ) = Y ∩ Y1 . Thus, denoting by q ′ the query πX σy′ (δ), the proof of case (1) above shows that there exists ∆′ = {t, t1 } in instF D (U ) such that t.Y = y ′ , t1 .Y1 = y1 , sup∆′ (q1 ) = 2 and sup∆′ (q ′ ) = 1. Now, let t2 be a tuple over U such that: − t2 .Y = y, − t2 .Ym+ = t.Ym+ , and − for every A 6∈ Ym+ , t2 .A 6= t.A and t2 .A 6= t1 .A. We first note that Ym+ 6= U . Indeed, as Ym ⊂ Y ∩ Y1 implies that Ym+ ⊆ (Y ∩ Y1 )+ and as (Y ∩ Y1 )+ ⊆ Y + ∩ Y1+ always holds, we have Ym+ ⊆ Y + ∩ Y1+ . Therefore, Ym+ = U implies that Y + = Y1+ = U and thus that, in particular, X1+ ⊆ Y1+ . As we assume that Y1 → X1 is not in F D + , this is not possible. Therefore, t2 6= t and t2 6= t1 . As in the proof of case (1) above, t and t1 have been shown to be distinct, t, t1 and t2 are three pairwise distinct tuples. Moreover, for t2 to be defined, we must have: ′ + + + + + + y .(Ym ∩ Y ) = y.(Ym ∩ Y ), because t.(Ym ∩ Y ) = t2 .(Ym ∩ Y ) = y.(Ym ∩ Y ) and t.(Ym ∩ Y ) = ′ + ′ y .(Ym ∩ Y ). This holds by definition of y . + ′ ′ For every A in Y \ Ym , y .A 6= y.A, because in this case, t.A 6= t2 .A, t.A = y .A and t2 .A = y.A. Indeed, let A ∈ Y \ Ym+ . If A ∈ Y1 , then A ∈ (Y ∩ Y1 ) \ Ym+ . Thus, A ∈ (Y ∩ Y1 ) \ Ym , and so, by definition of y ′ , y ′ .A = y1 .A 6= y.A. If A 6∈ Y1 , then A ∈ (Y \ ((Y ∩ Y1 )Ym+ )), in which case, by definition of y ′ , y ′ .A 6= y.A.
20
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
For every A in (Y1 ∩ Y ) \ Ym+ , y1 .A 6= y.A, because in this case, t1 .A 6= t2 .A, t2 .A = y.A and t1 .A = y1 .A. Indeed, let A ∈ (Y1 ∩ Y ) \ Ym+ . Then A ∈ (Y ∩ Y1 ) \ Ym , and so, as above, y ′ .A = y1 .A 6= y.A. Thus, y1 .((Y1 ∩ Y ) \ Ym+ ) 6= y.((Y1 ∩ Y ) \ Ym+ ).
Now, let ∆ = ∆′ ∪ {t2 } and let us prove that ∆ is in instF D (U ). Indeed, as ∆′ satisfies F D, we have match(t, t1 ) = (match(t, t1 ))+ , and by definition of t2 , we also have match(t2 , t) = Ym+ . Moreover, in ∆′ , the proof of case (1) above shows that t.Y1 = t1 .Y1 , and thus, t.Ym = t1 .Ym , which entails that t.Ym+ = t1 .Ym+ . Therefore, t2 .Ym+ = t.Ym+ = t1 .Ym+ , and so, Ym+ ⊆ match(t2 , t1 ). Since we have match(t2 , t1 ) ⊆ Ym+ by construction of t2 , we obtain match(t2 , t1 ) = Ym+ . Hence, by Proposition 3.1, ∆ is in instF D (U ). Regarding the supports of q and q1 in ∆, we first note that t2 is the only tuple of ∆ such that t2 .Y = y. Indeed, as t.Y = y ′ and y 6= y ′ , we have t.Y 6= y. On the other hand, we have t1 .(Y ∩ Y1 ) = y1 .(Y ∩ Y1 ) = y ′ .(Y ∩ Y1 ) and y.(Y ∩ Y1 ) 6= y1 .(Y ∩ Y1 ) (as Y ∩ Y1 6= ∅ and as we assume that Ym 6= Y ∩ Y1 ). Since assuming that t1 .Y = y entails that t1 .(Y ∩ Y1 ) = y.(Y ∩ Y1 ), we have a contradiction showing that t1 .Y 6= y. Therefore, we have sup∆ (q) = 1. On the other hand, as ∆′ ⊆ ∆, we have sup∆′ (q1 ) ≤ sup∆ (q1 ), and thus sup∆ (q1 ) ≥ 2 (because sup∆′ (q1 ) = 2). Hence, we obtain that sup∆ (q1 ) > sup∆ (q), which completes the proof. ⊓ ⊔ Lemma 4.2. Let ∆ be in instF D (U ), and q = πX σy (δ) and q1 = πX1 σy1 (δ) in Q(∆) such that Y1 → X1 6∈ F D + , Y1 → Y ∈ F D + , and y 6= ∆Y (y1 ). Then there exists ∆′ in instF D (∆, q, q1 ) such that sup∆′ (q1 ) > sup∆′ (q). Proof: Let y ′ = ∆Y (y1 ) and Ym = match(y, y ′ ). Under the hypotheses of the lemma, we have Y + ⊆ Y1+ and Ym ⊂ Y (as assuming that Ym = Y entails that y = ∆Y (y1 )). Thus, Ym+ ⊆ Y + ⊆ Y1+ . Moreover, we also have that Y1+ 6= U (because Y1+ = U entails that Y1 → X1 ∈ F D + ). Let us consider the table ∆′ = {t, t1 , t′1 } where t, t1 and t′1 are tuples over U defined as follows: − t.Y = y, − t1 .Y = t′1 .Y = y ′ , − t1 .Y1 = t′1 .Y1 = y1 , − t.Ym+ = t1 .Ym+ = t′1 .Ym+ , − t1 .Y1+ = t′1 .Y1+ , − for every A not in Y1+ , t1 .A 6= t′1 .A, t1 .A 6= t.A and t′1 .A 6= t.A. We note that, for t1 and t′1 to be defined, we must have y1 .(Y ∩ Y1 ) = y ′ .(Y ∩ Y1 ), which holds because Y → Y1 is in F D + and y ′ = ∆Y (y1 ). Moreover, since Y1+ 6= U , the tuples t, t1 and t′1 are pairwise distinct. On the other hand, using Proposition 3.1, ∆′ satisfies F D because we have match(t1 , t′1 ) = (match(t1 , t′1 ))+ = Y1+ , match(t, t1 ) = (match(t, t1 ))+ = Ym+ and match(t, t′1 ) = (match(t, t′1 ))+ = Ym+ . Moreover, we have ∆′Y (y1 ) = ∆Y (y1 ) = y ′ , because t1 .Y1 = t′1 .Y1 = y1 and t1 .Y = t′1 .Y = y ′ . Thus, as q and q1 are in Q(∆) and in Q(∆′ ), by Definition 4.3, ∆′ ∈ instF D (∆, q, q1 ). Regarding the supports of q and q1 in ∆′ , since Ym ⊂ Y , t1 .Y and t′1 .Y are different than y, and thus, sup∆′ (q) = 1. On the other hand, as Y1 → X1 6∈ F D + , we have X1+ 6⊆ Y1+ , and so t1 .X1 6= t′1 .X1 . Hence, sup∆′ (q1 ) ≥ 2, showing that sup∆′ (q1 ) > sup∆′ (q). Thus, the proof is complete. ⊓ ⊔
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
21
Based on Lemma 4.1, Lemma 4.2, Proposition 4.3 and Proposition 4.4, we are now ready to prove the following theorem, which shows that the pre-ordering ∆ characterizes anti-monotonicity of the support with respect to all tables of inst∆ in which the queries under comparison have the same ‘status’ as in ∆. Theorem 4.2. Let ∆ be in instF D (U ) and q and q1 in Q(U ). Then, q ∆ q1 holds if and only if, for every ∆′ in instF D (∆, q, q1 ), sup∆′ (q1 ) ≤ sup∆′ (q). Proof: The ‘only if’ part of the proof is a consequence of Proposition 4.3 and Proposition 4.4. Indeed, for every table ∆′ in instF D (∆, q, q1 ), by Proposition 4.4, q ∆ q1 implies that q ∆′ q1 and so, by Proposition 4.3, sup∆′ (q1 ) ≤ sup∆′ (q). Let us now turn to the ‘if’ part of the proof that can be stated as follows, by contraposition: For all q = πX σy (δ) and q1 = πX1 σy1 (δ) in Q(U ), if q 6∆ q1 then there exists ∆′ in instF D (∆, q, q1 ) such that sup∆′ (q1 ) > sup∆′ (q). By Definition 4.2, assuming that q 6∆ q1 implies that q1 ∈ Q(∆), Y1 → X1 6∈ F D + and one of the following statement holds: (i) XY1 → X1 6∈ F D + , or (ii) Y1 → Y 6∈ F D + , or (iii) Y1 → Y ∈ F D + and y 6= ∆Y (y1 ). We consider successively the two cases whereby q is or not in Q(∆). ′ ′ ′ If q 6∈ Q(∆), then, for every ∆ in instF D (∆, q, q1 ), we have q 6∈ Q(∆ ) and q1 ∈ Q(∆ ). Thus, sup∆′ (q) = 0 and sup∆′ (q1 ) ≥ 1. Hence, we have sup∆′ (q1 ) > sup∆′ (q). If q ∈ Q(∆), then q and q1 are in Q(∆). In this case, as mentioned in the proof of Lemma 4.1, denoting by Ym the attribute set match(y, y1 ), the fact that ∆ ∈ instF D (U ) implies that (Ym+ \Ym )∩(Y ∩Y1 ) = ∅. Moreover, as we assume that q 6∆ q1 , Y1 → X1 is not in F D + (see case 2 of Definition 4.2). Now, if (iii) holds, then Lemma 4.2 applies, showing that there exists ∆′ in instF D (∆, q, q1 ) such that sup∆′ (q1 ) > sup∆′ (q), which completes the proof in this case. On the other hand, if (iii) does not hold, then (i) or (ii) hold. In this case, Lemma 4.1 shows that there exists ∆′ in instF D (U ) such that q and q1 are in Q(∆′ ) and sup∆′ (q1 ) > sup∆′ (q). We now prove that ∆′ is in instF D (∆, q, q1 ). First, we have that q and q1 are both in Q(∆) and in Q(∆′ ). Thus, the first two items of Definition 4.3 are satisfied. Therefore, if Y1 → Y is not in F D + , we have ∆′ ∈ instF D (∆, q, q1 ). If Y1 → Y is in F D + then ∆Y (y1 ) = y (because (iii) is assumed not to hold), which implies that Ym = Y ∩ Y1 . On the other hand, it has been seen in the proof of Lemma 4.1 that, when Y1 → Y is in F D + and Ym = Y ∩ Y1 (i.e., in case (1.a) of the proof), we have ∆′Y (y1 ) = y. Therefore, if Y1 → Y is in F D + then ∆Y (y1 ) = ∆′Y (y1 ). Consequently, by Definition 4.3, in this case again, we obtain that ∆′ is in instF D (∆, q, q1 ). Thus, in any case, we have shown that there exists ∆′ in instF D (∆, q, q1 ) such that sup∆′ (q1 ) > sup∆′ (q). Therefore, the proof is complete. ⊓ ⊔ Referring back to our running example, we have seen in Example 4.3 that q = πCid σbeer (δ) and q1 = πQty σp2 (δ) are in Q(∆) and q ∆ q1 . Then, Theorem 4.2 states that this holds if and only if sup∆′ (q1 ) ≤ sup∆′ (q) holds in every ∆′ of instF D (∆, q, q1 ), i.e., in every ∆′ of instF D (U ) such that q and q1 are in Q(∆′ ) and product p2 is of type beer. The following example shows that the conditions in Definition 4.3 are necessary for Theorem 4.2 to hold. That is, it is not true that q ∆ q1 holds if and only if for every ∆′ in instF D (U ), sup∆′ (q1 ) ≤ sup∆′ (q).
22
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
Example 4.5. Let U = {A, B, C} and F D = {B → C}. Considering the tables ∆ = {abc} and ∆′ = {ab′ c′ } where b 6= b′ and c 6= c′ and the queries q = πA σc (δ) and q1 = πA σb′ (δ), then ∆ and ∆′ are in instF D (U ), q ∆ q1 , and we have: • q ∈ Q(∆), q1 6∈ Q(∆) and sup∆ (q1 ) < sup∆ (q) • q 6∈ Q(∆′ ), q1 ∈ Q(∆′ ) and sup∆′ (q) < sup∆′ (q1 ). This shows that the first two conditions in Definition 4.3 are necessary in order to make sure that we consider tables with respect to which q and q1 have the same status. We now illustrate the necessity of the third condition as follows: let ∆ = {a′ bc, ab′ c} and ∆′ = {abc, ab′ c′ , a′ b′ c′ }, where a 6= a′ , b 6= b′ and c 6= c′ . Then clearly, ∆ and ∆′ are in instF D (U ), q ∆ q1 , and: • q and q1 are both in Q(∆) and Q(∆′ ), but ∆C (b′ ) 6= ∆′C (b′ ) • sup∆ (q1 ) < sup∆ (q) and sup∆′ (q) < sup∆′ (q1 ). Thus, even if the first two conditions in Definition 4.3 hold, the third one must also hold, in order to make sure that we consider tables in which the functions induced for y and y1 by the functional dependencies are equal.
5. Equivalent Queries 5.1. Equivalence Relations Each of the two pre-orderings and ∆ induces an equivalence relation over Q(U ) defined as follows. Definition 5.1. Let q and q1 be two queries in Q(U ). Then: • q and q1 are said to be equivalent, denoted by q ≡ q1 , if q q1 and q1 q. The equivalence class of q modulo ≡ is denoted by [q], and the set of all equivalence classes modulo ≡ is denoted by C(U ). • If ∆ is in instF D (U ), q and q1 are said to be equivalent with respect to ∆, denoted by q ≡∆ q1 , if q ∆ q1 and q1 ∆ q. The equivalence class of q modulo ≡∆ is denoted by [q]∆ , and the set of all equivalence classes modulo ≡∆ is denoted by C∆ (U ). Two important remarks are in order regarding the definition of ≡∆ in Definition 5.1 above. 1. First, by Definition 4.1, it is easy to see that all queries q in Q(U ) \ Q(∆) are equivalent modulo 0. ≡∆ . In the remainder of the article, the equivalence class Q(U ) \ Q(∆) will be denoted by C∆ 2. Second, all queries q = πX σy (δ) in Q(∆) such that Y → X is in F D + are equivalent modulo ≡∆ . Indeed, it is easy to see from Definition 4.1 that if q and q1 are such queries, then q ≡∆ q1 . Conversely, if q1 = πX1 σy1 (δ) is in [q]∆ , then, according to Definition 5.1, q ∆ q1 . Thus, it can be seen from the proof of Proposition 4.2 that this entails that Y1 → X1 is in F D + (since Y → X ∈ F D + ). In the remainder of the article, the equivalence class modulo ≡∆ containing all 1. queries q = πX σy (δ) of Q(∆) such that Y → X is in F D + will be denoted by C∆
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
23
Moreover, as a consequence of Proposition 4.4, given ∆ in instF D (U ) and two queries q and q1 in Q(U ), we have [q]∆ = [q]∆′ and [q1 ]∆ = [q1 ]∆′ , for every ∆′ in instF D (∆, q, q1 ). Each of the two pre-orderings and ∆ over Q(U ) induces a partial ordering over C(U ) and C∆ (U ), respectively, also denoted by and ∆ , respectively. These orderings are defined as follows: • For all classes [q] and [q1 ] in C(U ), [q1 ] is said to be more specific than [q], denoted by [q] [q1 ], if q q1 holds. • If ∆ is in instF D (U ), for all classes [q]∆ and [q1 ]∆ in C∆ (U ), [q1 ]∆ is said to be more specific than [q]∆ with respect to ∆, denoted by [q]∆ ∆ [q1 ]∆ , if q ∆ q1 holds. It is easy to see that and ∆ over C(U ) and C∆ (U ), respectively, are indeed two partial orderings that are independent from the chosen representatives in [q] and [q1 ], and in [q]∆ and [q1 ]∆ , respectively. This explains why we use the same notations for the pre-orderings in Q(U ) and their associated orderings in C(U ) and C∆ (U ). As a consequence of Theorem 4.1 and Theorem 4.2, the following corollary holds. Corollary 5.1. For all q and q1 in Q(U ): 1. q ≡ q1 holds if and only if, for every ∆ in instF D (U ), sup∆ (q) = sup∆ (q1 ). 2. If ∆ is in instF D (U ), q ≡∆ q1 holds if and only if, for every ∆′ in instF D (∆, q, q1 ), sup∆′ (q) = sup∆′ (q1 ). We recall from Example 4.1 that, in the context of our running example, πCid σ⊤ (δ) πCid Caddr σ⊤ (δ) and πCid Caddr σ⊤ (δ) πCid σ⊤ (δ) hold. Thus these queries are equivalent with respect to ≡, and Corollary 5.1 above shows that their supports are equal in every table ∆ of instF D (U ). Similarly, we recall from Example 1.1 that we have πCid σp2 (δ) ∆ πCid Cname P ype σp2 beer (δ) and πCid Cname P ype σp2 beer (δ) ∆ πCid σp2 (δ). Thus, these queries are equivalent with respect to ≡∆ , and Corollary 5.1 above shows that their supports are equal in every table ∆′ of instF D (∆, πCid σp2 (δ), πCid Cname P ype σp2 beer (δ)). Based on Corollary 5.1, given a class [q] in C(U ) (respectively [q]∆ in C∆ (U )), we denote by sup∆ ([q]) (respectively sup∆ ([q]∆ )) the support in ∆ of [q] (respectively of [q]∆ ), i.e., the support in ∆ of any query in [q] (respectively in [q]∆ ). Thus, similarly to individual queries, given a support threshold min-sup and a class [q] in C(U ) (respectively [q]∆ in C∆ (U )), [q] (respectively of [q]∆ ) is said to be frequent if sup∆ ([q]) ≥ min-sup (respectively sup∆ ([q]∆ ) ≥ min-sup). 0 is the class that contains all queries in Q(U ) \ Q(∆), we have sup (C 0 ) = 0. Recalling that C∆ ∆ ∆ 1 is the set of all queries q = π σ (δ) in Q(∆) such that Y → X is in F D + , On the other hand, since C∆ X y 1 ) = 1. we have sup∆ (C∆ The following theorem states that the support of equivalence classes is anti-monotonic with respect to the partial orderings and ∆ over equivalence classes. Theorem 5.1. For all q and q1 in Q(U ): 1. [q] [q1 ] holds if and only if, for every ∆ in instF D (U ), sup∆ ([q1 ]) ≤ sup∆ ([q]).
24
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
2. If ∆ is in instF D (U ), [q]∆ ∆ [q1 ]∆ holds if and only if, for every ∆′ in instF D (∆, q, q1 ), sup∆′ ([q1 ]∆′ ) ≤ sup∆′ ([q]∆′ ). Proof: 1. The result follows directly from Theorem 4.1 and Corollary 5.1(1). 2. Since ∆′ ∈ instF D (∆, q, q1 ), by Proposition 4.4, [q]∆ = [q]∆′ and [q1 ]∆ = [q1 ]∆′ . Thus, we have sup∆′ ([q]∆ ) = sup∆′ ([q]∆′ ) and sup∆′ ([q1 ]∆ ) = sup∆′ ([q1 ]∆′ ), and the result follows from Theorem 4.2 and Corollary 5.1(2), which completes the proof. ⊓ ⊔ It is important to note that, given a table ∆ in instF D (U ), the impact of Corollary 5.1 and Theorem 5.1 on the computation of frequent queries is as follows: 1. Only one computation per equivalence class is necessary. Consequently, instead of considering individual queries of Q(U ) in algorithms, it is enough to consider one query per equivalence class of C(U ) or C∆ (U ). 2. Frequent classes can be computed using the Apriori trick ([1]). In other words, as for individual queries, given a support threshold min-sup, if sup∆ ([q]) < min-sup (respectively sup∆ ([q]∆ ) < min-sup), then for every class [q1 ] such that [q] [q1 ] (respectively [q1 ]∆ such that [q]∆ ∆ [q1 ]∆ ), we have that sup∆ ([q1 ]) < min-sup (respectively sup∆ ([q1 ]∆ ) < min-sup). Consequently, the support of [q1 ] (respectively [q1 ]∆ ) has not to be computed. We also point out that, since ∆ refines , given a table ∆ in instF D (U ), equivalence classes modulo ≡ are smaller (with respect to set inclusion) than those modulo ≡∆ . In other words, for every query q in Q(U ), we have [q] ⊆ [q]∆ . This implies that, given ∆ in instF D (U ), it is preferable to compute frequent classes modulo ≡∆ , because the cardinality of C(U ) is greater than that of C∆ (U ). This is precisely what has been considered in our previous work [13]. However, computing frequent classes instead of individual frequent queries pays off only if equivalent queries can be easily characterized. The next section deals with this issue.
5.2. Equivalence Classes In what follows, we characterize the content of equivalence classes modulo ≡ and ≡∆ , using the notion of key of a schema, as defined below. Definition 5.2. Given a schema X, K is said to be a key of X if K is a minimal (with respect to set inclusion) schema such that K + = X + . The set of all keys of X is denoted by keys(X). It is important to note that it is not the case that all keys of X are subsets of X. However keys(X) contains at least one key K such that K ⊆ X (this is so because the set {Y | Y + = X + ∧ Y ⊆ X} is not empty, since it contains X). In order to illustrate this remark, consider the universe U = {A, B, C} and the set of functional dependencies F D = {A → B, B → A}. For X = AC, we have X + = ABC and keys(X) = {A, B}. Thus, in this case, one of the two keys is a subset of X, whereas the other one is not. The following proposition characterizes the content of classes in C(U ).
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
25
Proposition 5.1. For every q = πX σy (δ) in Q(U ), [q] is the set of all queries q1 = πX1 σy1 (δ) where y1 = y and there exists K in keys(XY ) such that (K \ Y + ) ⊆ X1 ⊆ (XY )+ . Proof: Let Q be the set of queries as defined in the proposition, and q1 = πX1 σy1 (δ) in [q]. It follows from Definition 4.1 that (XY )+ = (X1 Y1 )+ and y1 = y. Thus, Y1 = Y , and as (XY )+ = (X1 Y )+ , by Definition 5.2, we have keys(XY ) = keys(X1 Y ). Therefore, there exists K ∈ keys(XY ) such that K ⊆ X1 Y ⊆ (XY )+ . Hence, (K \ Y + ) ⊆ K ⊆ X1 Y , which entails that (K \ Y + ) ⊆ X1 (as (K \ Y + ) ∩ Y = ∅). So, we have (K \ Y + ) ⊆ X1 ⊆ (XY )+ , and thus, q1 ∈ Q, showing that [q] ⊆ Q. Conversely, let q1 = πX1 σy1 (δ) be in Q. Then, y = y1 and (K \ Y + ) ⊆ X1 ⊆ (XY )+ . Thus, as Y ⊆ (XY )+ , we have (K \ Y + )Y ⊆ X1 Y ⊆ (XY )+ , and so, ((K \ Y + )Y )+ ⊆ (X1 Y )+ ⊆ (XY )+ . In order to prove that (X1 Y )+ = (XY )+ , we show that ((K \ Y + )Y )+ = (XY )+ . Indeed, since K = (K\Y + )Y + and (K\Y + )Y + ⊆ ((K\Y + )Y )+ , we have K + ⊆ ((K\Y + )Y )+ . As K is in keys(XY ), we have K + = (XY )+ , and so, (XY )+ ⊆ ((K \ Y + )Y )+ . Moreover, as K ⊆ XY , (K \ Y + ) ⊆ (XY \ Y + ). Thus, (K \ Y + )Y ⊆ (XY \ Y + )Y . Since (XY \ Y + )Y ⊆ XY , we obtain that ((K \ Y + )Y )+ ⊆ (XY )+ . Therefore, ((K \Y + )Y )+ = (XY )+ , showing that (X1 Y )+ = (XY )+ . As this entails that X1 Y → ⊓ ⊔ X and XY → X1 are in F D + , we obtain that πX1 σy (δ) ∈ [q], and the proof is complete. To illustrate Proposition 5.1, let us consider q = πCid P type σp2 (δ) in the context of our running example. In this case, we have Y + = Pid Ptype and (Cid Ptype Pid)+ = U , and thus, [q] is the set of all queries πX1 σp2 (δ) such that Cid ⊆ X1 ⊆ U . As a consequence of Proposition 5.1 above, for every q in Q(U ), [q] contains exactly one representative πX σy (δ) such that X + = X and Y ⊆ X. Therefore, C(U ) is isomorphic to the set of all queries q = πX σy (δ) such that X + = X and Y ⊆ X. In the remainder of the article, we identify every class of C(U ) with this unique, particular representative. In order to characterize the content of classes in C∆ (U ), we first define the notion of keys of a query as follows. Definition 5.3. For every q = πX σy (δ) in Q(∆), the set of keys of q, denoted by Keys(q), is the set of all queries q0 = πX0 σy0 (δ) in Q(∆) such that • X0 = K0 \ Y + , where K0 ∈ keys(XY ) and • y0 = ∆Y0 (y), where Y0 ∈ keys(Y ). We note that, in Definition 5.3, the tuple y0 is correctly defined because, as Y0 ∈ keys(Y ), Y0+ = Y + , and thus, Y → Y0 is in F D + . Moreover, applying Lemma 3.1 shows that we also have y = ∆Y (y0 ). On the other hand, it follows from Definition 5.3 that, for every q = πX σy (δ) in Q(∆) such that Y → X is in F D + , Keys(q) = {π∅ σy0 (δ) | Y0 ∈ keys(Y ) ∧ y0 = ∆Y0 (y)}. Indeed, in this case, X + ⊆ Y + and thus, for every X0 ∈ keys(XY ), X0 ⊆ (XY )+ = Y + . Hence, X0 \ Y + = ∅. Example 5.1. In the context of Example 1.1, let q = πCid Cname σ⊤ (δ). As keys(Cid Cname) = {Cid}, we have Keys(q) = {πCid σ⊤ (δ)}.
26
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
For q ′ = πCid P type Qty σp2 (δ), we have keys(Cid Ptype Qty Pid) = {Cid Pid} and keys(Pid) = {Pid}. Since Pid+ = Pid Ptype, we obtain Keys(q ′ ) = {πCid σp2 (δ)}. Now, for q ′′ = πP type σp2 (δ), since Pid → Ptype and keys(Pid) = {Pid}, we have Keys(q ′′ ) = {π∅ σp2 (δ)}. The following lemma shows that for every query q in Q(∆), every key of q is in [q]∆ . Lemma 5.1. For every ∆ in instF D (U ) and every query q in Q(∆), every q0 in Keys(q) is such that q0 ≡∆ q. Proof: 1 . In this case, as mentioned above, for every q = π σ in Keys(q), Let us first assume that q is in C∆ 0 X0 y0 1 . Therefore, we have q ≡ q. we have X0 = ∅, which entails that q0 ∈ C∆ 0 ∆ 1 . Given q = π σ in Keys(q), we successively Let us now assume that q = πX σy (δ) is not in C∆ 0 X0 y0 prove that q ∆ q0 and q0 ∆ q, which then implies that q0 ≡∆ q. To prove that q ∆ q0 , we first show that XY0 → X0 is in F D + , or that X0+ ⊆ (XY0 )+ . Indeed, by Definition 5.3, we have X0+ = (K0 \ Y + )+ and K0+ = (XY )+ (as K0 ∈ keys(XY )). Therefore, we obtain that K0+ = (XY + )+ . Moreover, as Y0 ∈ keys(Y ), we have Y0+ = Y + and thus, Y → Y0 is in F D + and K0+ = (XY0+ )+ . Hence, K0+ = (XY0 )+ , and as X0+ ⊆ K0+ , we obtain that X0+ ⊆ (XY0 )+ . Since by Definition 5.3, we also have y0 = ∆Y0 (y), q ∆ q0 holds.
To prove that q0 ∆ q, we first show that X0 Y → X is in F D + , or that X + ⊆ (X0 Y )+ . Indeed, by Definition 5.3, we have X0 = K0 \ Y + and K0 ∈ keys(XY ). Thus (XY )+ = K0+ , and as K0+ = ((K0 \ Y + )Y )+ , we obtain that (XY )+ = (X0 Y )+ . Since X + ⊆ (XY )+ , it follows that X + ⊆ (X0 Y )+ , thus that X0 Y → X is in F D + . Moreover, as Y0 ∈ keys(Y ), we have Y0+ = Y + and thus, Y → Y0 is in F D + . By Lemma 3.1, we also have ∆Y (y0 ) = y, because Y + = Y0+ and y0 = ∆Y0 (y) (by Definition 5.3). Therefore, q0 ∆ q holds, and the proof is complete. ⊓ ⊔
Now, the following proposition characterizes equivalent classes modulo ≡∆ . Proposition 5.2. For every ∆ in instF D (U ), we have: 0 = Q(U ) \ Q(∆). 1. C∆ 1 = {π σ (δ) ∈ Q(∆) | X ⊆ Y + }. 2. C∆ X y
3. For every q = πX σy (δ) in Q(∆) such that Y → X is not in F D + : [q]∆ = {πX1 σy1 (δ) | (∃πX0 σy0 (δ) ∈ Keys(q))((X0 ⊆ X1 ⊆ (X0 Y0 )+ )∧ (Y0 ⊆ Y1 ⊆ Y0+ ) ∧ (y1 = ∆Y1 (y0 )))}. Proof: The first two items have been stated previously, after Corollary 5.1. Thus, we only prove item 3. 1 , denoting by Q the set of queries as defined in the proposition, let q = π σ (δ) be If [q]∆ 6= C∆ 1 X1 y1 in [q]∆ . It follows from Definition 4.2 that (XY )+ = (X1 Y1 )+ , Y + = Y1+ and y1 = ∆Y1 (y). As in the proof of Proposition 5.1, it can be seen that there exists K ∈ keys(XY ) such that K ⊆ X1 Y1 ⊆ (XY )+ . Similarly, there exists Y0 ∈ keys(Y ) such that Y0 ⊆ Y1 ⊆ Y + and y0 = ∆Y0 (y).
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
27
Moreover, for X0 = K \ Y0+ , as X0 ∩ Y0+ = ∅ and Y1 ⊆ Y0+ , X0 ∩ Y1 = ∅. Since X0 ⊆ K ⊆ X1 Y1 , we have X0 ⊆ X1 Y1 , and thus, X0 ⊆ X1 . Therefore q1 ∈ Q. Conversely, let q1 be in Q. As X0 ⊆ X1 , X1 → X0 ∈ F D + , and as Y0 ⊆ Y1 , Y1 → Y0 ∈ F D + . Therefore, X1 Y1 → X0 Y0 ∈ F D + . As X1 ⊆ (X0 Y0 )+ and Y1 ⊆ Y0+ , X0 Y0 → X1 Y1 and Y0 → Y1 are in F D + . Thus, (a) X0 Y1 → X1 and Y1 → Y0 are in F D + , and (b) X1 Y0 → X0 and Y0 → Y1 are in F D+ . Moreover, since q1 ∈ Q, we have y1 = ∆Y1 (y0 ), which with (a) above shows that q0 ∆ q1 . On the other hand, as Y0+ = Y1+ , by Lemma 3.1, y1 = ∆Y1 (y0 ) entails that y0 = ∆Y0 (y1 ), which with (b) above shows that q1 ∆ q0 . Hence, q1 ≡∆ q0 . Applying Lemma 5.1, we have q0 ≡∆ q, and so, q1 ≡∆ q. Thus, q1 ∈ [q]∆ , which completes the proof. ⊓ ⊔ 1 , [q] As a consequence of Proposition 5.2, for every q = πX σy (δ) in Q(∆) such that [q]∆ 6= C∆ ∆ ′ + ′ + ′ ′ contains exactly one representative πX ′ σy′ (δ) such that X = X , Y = Y and Y ⊂ X . 0 , C 1 } is isomorphic to the set of all queries π σ (δ) of Q(∆) such that X + = X, Thus, C∆ (U )\{C∆ X y ∆ 0 , C 1 } with Y + = Y and Y ⊂ X. In the remainder of the article, we identify every class of C∆ (U ) \ {C∆ ∆ this unique, particular representative.
Considering again the queries q = πCid Cname σ⊤ (δ) and q ′ = πCid P type Qty σp2 (δ) of Example 5.1, we have: [q]∆ = {πCid σ⊤ (δ), πCid Cname σ⊤ (δ), πCid Caddr σ⊤ (δ), πCid Cname Caddr σ⊤ (δ)} [q ′ ]∆ = {πX σy (δ) | (Cid ⊆ X ⊆ U ) ∧ (y = p2 ∨ y = p2 beer)}. Thus the equivalence classes [q]∆ and [q ′ ]∆ are respectively represented by πCid Cname Caddr σ⊤ (δ) and πCid Cname Caddr P id P type Qty σp2 beer (δ) (or more simply, by σp2 beer (δ)). 1 because We note that, for the query q ′′ = πP type σp2 (δ) of Example 5.1, we have [q ′′ ]∆ = C∆ + Pid → Ptype ∈ F D and p2 ∈ πP id (∆). As a consequence, no particular representative of this class is considered.
The following example shows exhaustive comparisons of equivalence classes in the simple case of Example 4.2 and Example 4.4. Example 5.2. As in Example 4.2, let us consider the case where U = {A, B} and F D = ∅. Assuming that dom(A) = {a} and dom(B) = {b} and considering the particular representatives as defined above, the set C(U ) along with comparisons according to are shown below (case (a)). πAB σ⊤ (δ) RRR l l RR l l l πA σ⊤ (δ)
πAB σ⊤ (δ) SSS k SS kkkk
πB σ⊤ (δ)
RRR ll RR lll πAB σb (δ) π∅ σ⊤ (δ) π σ (δ) RRRR AB a l l l R ll πB σb (δ) πA σa (δ) RRR ll RR lll πAB σab (δ) (a) The set C(U )
πA σ⊤ (δ)
πB σ⊤ (δ)
πAB σb (δ) RRRR RRRR
1 C∆
πAB σa (δ) ll l l l lll
(b) The set C∆ (U )
28
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
0 = ∅. Thus, the set C (U ) Now, considering ∆ = {ab} as in Example 4.4, we note that, in this case C∆ ∆ along with comparisons according to ∆ are shown above (case (b)). We draw attention on the fact that, even in this simple case without functional dependencies, the structures of C(U ) and C∆ (U ) are simpler than those of Q(U ) shown in Example 4.2 and Example 4.4, respectively.
We point out that, although the graphs in the previous example show that, in this case, C(U ) and C∆ (U ) are lattices, this does not hold in general. The following example, borrowed from [13], shows a case where neither C(U ) nor C∆ (U ) has a lattice structure. Example 5.3. Let U = {A, B, C, D} and F D = {ABC → D}. In order to see that in this case, C(U ) is not a lattice, we consider the two classes represented by q1 = πB σ⊤ (δ) and q2 = πABD σa (δ) where a is in dom(A), and we show that [q1 ] and [q2 ] have two distinct greatest lower bounds in C(U ). Indeed, for q = πBC σ⊤ (δ) and q ′ = πBD σ⊤ (δ), for i = 1, 2, [q] [qi ] and [q ′ ] [qi ], i.e., • πBC σ⊤ (δ) πB σ⊤ (δ) and πBC σ⊤ (δ) πABD σa (δ), • πBD σ⊤ (δ) πB σ⊤ (δ) and πBD σ⊤ (δ) πABD σa (δ). (These comparisons hold because F D + contains the following dependencies: BC → B, ABC → ABD, BD → B and ABD → ABD.) Thus, [q] and [q ′ ] are two distinct lower bounds of [q1 ] and [q2 ] in C(U ). Moreover, let q0 = πX0 σy0 (δ) where X0 = X0+ and Y0 ⊆ X0 be such that, for i = 1, 2, [q] ≺ [q0 ] [qi ]. Then, we have [q] 6= [q0 ] and: 1. X0 ⊆ (BCY0 )+ , B ⊆ X0 and ABD ⊆ (X0 A)+ ; 2. ⊤ is a subtuple of y0 and y0 is a subtuple of ⊤ and a. Then, by item 2 above, we obtain y0 = ⊤. So, Y = ∅, and thus, (BCY0 )+ = BC. Taking into account that [q] 6= [q0 ], the first inclusion of item 1 above can be written as X0 ⊂ BC. Therefore, either (i) X0 = ∅ or (ii) X0 = B or (iii) X0 = C. In case (i), the last inclusion of item 1 implies that ABD ⊆ A, which is not possible. Similarly, in cases (ii) and (iii), the last inclusion of item 1 implies that ABD ⊆ AB and ABD ⊆ AC, respectively, which is again not possible. As a consequence, [q] is a greatest lower bound of [q1 ] and [q2 ] in C(U ). Since a similar reasoning holds when considering [q ′ ] instead of [q], we conclude that [q] and [q ′ ] are two distinct lower bounds of [q1 ] and [q2 ] in C(U ). Consequently, C(U ) is not a lattice in this case. Now, considering ∆ = {abcd}, for the same reasons as above, we also have: • πBC σ⊤ (δ) ∆ πB σ⊤ (δ) and πBC σ⊤ (δ) ∆ πABD σa (δ), • πBD σ⊤ (δ) ∆ πB σ⊤ (δ) and πBD σ⊤ (δ) ∆ πABD σa (δ). It follows that [q]∆ and [q ′ ]∆ are two distinct lower bounds of [q1 ]∆ and [q2 ]∆ in C∆ (U ). Moreover, for q0 = πX0 σy0 (δ) where X0 = X0+ , Y0 = Y0+ and Y0 ⊆ X0 , if, for i = 1, 2, [q]∆ ≺∆ [q0 ]∆ ∆ [qi ]∆ then we have [q]∆ 6= [q0 ]∆ and: 1. X0 ⊆ (BCY0 )+ , B ⊆ X0 and ABD ⊆ (X0 A)+ ;
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
29
2. ∅ ⊆ Y0 , Y0 ⊆ ∅, and Y0 ⊆ A along with ∆Y0 (a) = y0 . In this case, item 2 above entails that Y0 = ∅, and thus that y0 = ⊤, which satisfies ∆Y0 (a) = y0 . Thus, as above, it can be shown that q0 does not exist. As a consequence, the set C∆ (U ) is not a lattice.
6. Conclusion and Further Work In this article, we have considered the problem of comparing conjunctive queries for the efficient computation of the supports of projection-selection queries on a given relational table satisfying a given set of functional dependencies. In this setting, we defined and characterized two pre-orderings with respect to which the support measure has been shown to be anti-monotonic. Furthermore, we showed that these pre-orderings allow to consider equivalence classes of queries instead of individual queries. We are currently implementing algorithms for mining frequent projection-selection queries in the case of the pre-ordering ∆ defined in this article. Regarding possible extensions of the present approach, we are investigating the following issues: • Extending our approach to the case of projection-selection-join queries. We are investigating this issue under the standard restriction whereby joins are performed in the presence of key and foreignkey constraints. The particular case of star schemas has been considered in [14]. • Extending the selection conditions to equalities of the form Y = Y ′ where Y and Y ′ are two schemas, as done in [10, 11], is another issue that we are investigating in the context of the present work. • Finally, we intend to study the discovery of functional dependencies and conditional functional dependencies in the context of the present work.
References [1] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, pages 309–328. AAAI-MIT Press, 1996. [2] W.W. Armstrong. Dependency structures of data base relationships. In IFIP Congress, pages 580–583. North Holland, 1974. [3] C. Beeri, M. Downd, R. Fagin, and R. Statman. On the structure of armstrong relations for functional dependencies. Journal of the ACM, 31(1):30–46, 1984. [4] Ph. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746–755, 2007. [5] L. Dehaspe and L. De Raedt. Mining association rules in multiple relations. In 7th International Workshop on Inductive Logic Programming, volume 1297 of LNCS, pages 125–132. Springer Verlag, 1997. [6] C.T. Diop, A. Giacometti, D. Laurent, and N. Spyratos. Composition of mining contexts for efficient extraction of association rules. In EDBT’02, volume 2287 of LNCS, pages 106–123. Springer Verlag, 2002. [7] A. Faye, A. Giacometti, D. Laurent, and N. Spyratos. Mining rules in databases with multiple tables: Problems and perspectives. In 3rd International Conference on Computing Anticipatory Systems (CASYS), 1999.
30
T.-Y. Jen et al. / Computing Supports of Conjunctive Queries
[8] B. Goethals. Mining queries, (unpublished paper). In Workshop on inductive databases and constraint based mining. Available at http://www.informatik.uni-freiburg.de/˜ml/IDB/talks/Goethals_ slides.pdf, 2004. [9] B. Goethals and J. Van den Bussche. Relational association rules: getting warmer. In ESF Exploratory Workshop on Pattern Detection and Discovery in Data Mining, volume 2447 of LNCS, pages 125–139. Springer-Verlag, 2002. [10] B. Goethals, E. Hoekx, and J. Van den Bussche. Mining tree queries in a graph. In 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 61–69, 2005. [11] B. Goethals, W. Le Page, and H. Mannila. Mining association rules of simple conjunctive queries. In SIAMSDM, pages 96–107, 2008. [12] J. Han, Y. Fu, W. Wang, K. Koperski, and O. Zaiane. Dmql : A data mining query language for relational databases. In SIGMOD-DMKD’96, pages 27–34, 1996. [13] T.Y. Jen, D. Laurent, and N. Spyratos. Mining all frequent selection-projection queries from a relational table. In EDBT’08, pages 368–379. ACM Press, 2008. [14] T.Y. Jen, D. Laurent, and N. Spyratos. Mining frequent conjunctive queries in star schemas. In International Database Engineering and Applications Symposium (IDEAS), pages 97–108. ACM Press, 2009. [15] T.Y. Jen, D. Laurent, N. Spyratos, and O. Sy. Towards mining frequent queries in star schemes. In International Workshop on Knowledge Discovery in Databases (KDID), volume 3933 of LNCS, pages 104–123. Springer Verlag, 2005. [16] R. Meo, G. Psaila, and S. Ceri. An extension to sql for mining association rules. Data Mining and Knowledge Discovery, 9:275–300, 1997. [17] T. Turmeaux, A. Salleb, C. Vrain, and D. Cassard. Learning caracteristic rules relying on quantified paths. In PKDD, volume 2838 of LNCS, pages 471–482. Springer Verlag, 2003. [18] J.D. Ullman. Principles of Databases and Knowledge-Base Systems, volume 1. Computer Science Press, 1988.