Towards an Algebraic Framework for Querying Inductive Databases Hong-Cheu Liu1 , Aditya Ghose2 , and John Zeleznikow3 1
Department of Computer Science and Mutimedia Design, Diwan University, Madou, Tainan County, 72153 Taiwan
[email protected] 2 School of Computer Science and Software Engineering, University of Wollongong, Wollongong, NSW 2522 Australia aditya
[email protected] 3 School of Information Systems, Victoria University, Melbourne, Vic. Australia
[email protected]
Abstract. In this paper, we present a theoretical foundation for querying inductive databases, which can accommodate disparate mining tasks. We present a data mining algebra including some essential operations for manipulating data and patterns and illustrate the use of a fix-point operator in a logic-based mining language. We show that the mining algebra has equivalent expressive power as the logic-based paradigm with a fixpoint operator.
1
Introduction
While knowledge discovery from large databases has gained great success and popularity, there is conspicuously lack of a unifying theory and a generally accepted framework for data mining. The integration of data mining with the underlying database systems has been formalised in the concept of inductive databases which is a very active area in the past decade. The key ideas of inductive database systems are that data and patterns (or models) are first class citizen. The one of crucial criteria for the promising success of inductive databases is reasoning about query evaluation and optimisation. Currently, research on inductive query languages has not led to significant commercial deployments due to performance concern and practical limitations. By contrast, relational database technology was based on algebraic and logical frameworks and then followed on both theoretical and system fronts, leading to an extremely successful technology and science. Therefore a theoretical framework that unifies different data mining tasks as well as different data mining approaches is necessary and would help the field and provide a basis for future research [1,2]. H. Kitagawa et al. (Eds.): DASFAA 2010, Part II, LNCS 5982, pp. 306–312, 2010. c Springer-Verlag Berlin Heidelberg 2010
Towards an Algebraic Framework for Querying Inductive Databases
307
In this article, we present an algebraic framework based on a complex value data model. Compared to the model presented in [3,4], we believe that it is more appropriate to adopt a complex value data model as data and mining results are normally represented as complex values. Some natural and dominant data mining computations can be expressed as compositions and/or combinations of known mining tasks. For example, an analyst might find a collection of frequent item-sets bought. He or she may further analyses these sets using a decision tree to predict under what conditions customers are classified for credit rating and such frequent co-purchases are made by this category of customers. This kind of problems can be easily expressed as an expression by using the proposed data mining algebra operators. The algebraic framework also provides a promising approach for query optimisation. Constraints play a central role in data mining and constraint-based data mining is now a recognised research topic. The area of inductive databases, inductive queries and constraint-based data mining has become a unifying theory. Declarative query language acts an important role in the next generation of Web database systems with data mining functionality. One of important properties of declarative query languages is closure. The result of a query is always a relation which can be available for further querying. This closure property is also an essential feature of inductive query languages. The inductive query languages proposed in the literature require users to only provide high-level declarative queries specifying their mining goals. The underlying inductive database systems need sophisticated optimiser with aim to selecting suitable algorithms and query execution strategies in order to perform mining tasks. Another tight-coupling approach using SQL implementation gives unrealistic heavy-burden on the users to write complicated SQL queries. So it is reasonable to explore alternative methods that make inductive databases realisable with current technology. In this paper, we also consider a logical framework for querying inductive databases.
2
An Algebraic Framework for Data Mining
In this section, we present an algebraic framework for data mining. The framework is based on a complex value data model. 2.1
A Data Mining Algebra
Let Ω = {R1 , ..., Rn } be a signature, where Ri , 1 ≤ i ≤ n, are database relations. The data mining algebra over Ω is denoted as DMA(Ω). A family of core operators of the algebra is presented as follows. Set operations: Union (∪), Cartesian product (×), and difference (-) are binary set operations. Tuple operations: Selection (σ) and projection (π) are defined in the natural manner. Powerset: powerset(r) is a relation of sort {τ } where powerset(r) = {ν | ν ⊆ r}. Tuple Creation: If A1 , ..., An are distinct attributes, tup createA1 ,...,An (r1 , ..., rn ) is of sort < A1 : τ1 , ..., An : τn >, and tup createA1 ,...,An (r1 , ..., rn ) = {< A1 : ν1 , ..., An : νn >| ∀i(νi ∈ ri )}. Set Creation: set create(r) is of
308
H.-C. Liu, A. Ghose, and J. Zeleznikow
sort {τ }, and set create(r) = {r}. Tuple Destroy: If r is of sort < A : τ >, tup destroy(r) is a relation of sort τ and tup destroy(r) = {ν |< A : ν >∈ r}. Set Destroy:If τ = {τ }, then set destroy(r) is a relation of sort τ and set destroy(r) = ∪r = {w | ∃ν ∈ r, w ∈ ν}. Aggregation: The standard set of aggregate functions SUM, COUNT, AVG, MIN, MAX are defined in the usual f unction manner. For example, if r is of sort < A : τ1 , B : τ2 >, G
(r) is the f unction (r) = {< a, s >| ∃ < a, v >∈ r ∧ s = Σ{t < relation over < A, S >. G B >| t ∈ r, t < A, B >=< a, b >}, where Σ is one of the allowed aggregate operator. Definition 1. If r is a relation of l-tuples, then the append operator, Δδ(i1 ,...,ik ) (r) is a set of l + 1 tuples, k ≤ l, where δ is an arithmetic operator on the components i1 , ..., ik . The last component of each tuple is the value of δ(i1 , ..., ik ). Example 1. Let the sort of a relation r be R : {< X : dom, Y : dom >}. A value of this sort is {< X : 2, Y : 3 >, < X : 5, Y : 4 >}. Then ΔX+Y =Z (r) = {< X : 2, Y : 3, Z : 5 >, < X : 5, Y : 4, Z : 9 >}. Definition 2. [5] Let us consider two relations with the same sort {Item, Count}. r 1sub,k s = {t | ∃u ∈ r, v ∈ s such that u[Item] ⊆ v[Item] ∧∃t such that (u[Item] ⊆ t ⊆ v[Item] ∧ |t | = k), t =< t , v[Count] >} Here, we treat the result of r 1sub,k s as multi-set meaning. Example 2. Consider the sort {< Item : {dom}, Count : dom >} and two relations r = {< {a}, 0 >, < {b, f }, 0 >} and s = {< {a, b, c}, 3 >, < {b, c, f }, 4 >} of that sort. The result of r 1sub,2 s is {< {a, b}, 3 >, < {a, c}, 3 >, < {b, f }, 4 >}. Definition 3. Let X be a set. The map(f ) operator applies a function f to every member of the set X, i.e., map(f ) : X → {f (x) | x ∈ X}. 2.2
Frequent Item-Set Computation
Given a database D = (Item, Support) and support threshold δ, the following fix-point algorithm computes frequent item-set of D. Let f k be a function which applies to a set T and results in the set of degree-k subset of T . For any two sets S and s, s is said to be a degree-k subset of S if s ⊂ S and |s| = k. Algorithm Fix-point Input: An object-relational database D and support threshold δ. Output: L, the frequent item-sets in D. Method: begin L1 := σSupport≥δ ( Item Gsum(Support) map(f 1 )(D))) for (k = 2; T = ∅; k + +) { S := sub join(Lk−1 , D)
Towards an Algebraic Framework for Querying Inductive Databases
309
T := σSupport≥δ ( Item Gsum(Support) (S) Lk := Lk−1 ∪ T } return Lk ; end procedure sub join (T: frequent k-itemset; D: database) for each itemset l1 ∈ T , for each itemset l2 ∈ D, c = l1 1sub,k l2 if has inf requent subset (c, T) then delete c else add c to S; return S; 2.3
procedure has inf requent subset (c: k-itemset, T: frequent (k-1)itemsets); for each (k-1)-subset s of c if s not ∈ T then return TRUE; return FALSE;
Decision Tree Induction
In this subsection, we give an algebraic expression for constructing a decision tree in DMA. The process is based on the well-known C4.5 algorithm [6] for tree induction. We assume that the algorithm is called with three parameters: D, attribute list, and Attribute selection method. We refer to D as a data partition. Initially, it is the complete set of training tuples and their associated class labels. The parameter attribute list is a list of attributes describing the tuples, i.e., D = (A1 , A2 , ..., An , Class). Attribute selection method specifies a heuristic procedure for selecting the attribute that best discriminates the given tuples according to class. The procedure employs the attribute selection measure - information gain. The complete algebraic expression is formulated by Algorithm Generate decision tree as shown below. The output of the algorithm will be a complex value relation T which holds the set of inequalities on the edges from the root to one leaf, together with their labels. Algorithm: Generate decision tree input – Data partition, D, which is a set of training tuples and their associated class labels; – attribute list, the set of candidate attributes; – Attribute selection method, a procedure to determine the splitting criterion. Output: A complex value relation which holds the set of all inequalities on the edges from the root to leaves together with their labels. Method 1. 2. 3. 4.
T := {}; Split := {} if Count(πClass (D)) = 1 then T := T ∪ set create(tuple create(Split, πClass (D))).
310
H.-C. Liu, A. Ghose, and J. Zeleznikow
5. If attribute list = {} then Count 6. T := T ∪ πClass σCount=Max (GClass ) 7. apply Attribute selection method(D, attribute list) to find the best splitting attribute 8. Split := Split ∪ {splitting attribute} 9. if splitting attribute is discrete-valued and multi-way splits allowed then 10. attribute list ← attribute list − splitting attribute; 11. for each outcome j of splitting criterion 12. Dj = σj (D) 13. if Dj = ∅ then Count ) 14. T := T ∪ πClass σCount=Max (GClass 15. else Generate decision tree(Dj , attribute list); 16. endfor 17. return T The tree starts as an empty relation. If the tuples in D are all of the same class, the resulting output relation contains only one class value. Note that steps 5 and 6 are terminating conditions. Otherwise, the algorithm calls Attribute selection method to determine the splitting criterion. A branch is grown from the current tuple for Dj for each of the outcomes of the splitting criterion. Similarly, the procedure Attribute selection method(D, attribute list) can also be specified as an algebraic expression in DMA. Example 3. As described in Introduction, an analyst might find a collection of frequent item-sets bought. He or she may further analyses these sets using a decision tree to determine the circumstances (e.g., class for credit rating) under which such frequent co-purchases are made by this category of customers. This query is easily to be expressed in DMA. It is formulated as Generate decision tree(Fix-point(D), (age, ...), Attribute selection method).
3
A Logical Framework for Data Mining
In this section, we give some basic data mining concepts based on logic. Inductive database queries can be formalised in a higher order logic satisfying some constraints. Definition 4. Given an inductive database I and a pattern class P, a pattern discovery task can be specified as a query q such that q(I) = {t ∈ P | I |= ϕ(t)}, where ϕ is a higher-order formula. Definition 5. A constraint is a predicate on the powerset of the set of items I, i.e., C : 2I → {true, f alse}. An itemset X satisfies a constraint C if C(X) is true. Definition 6. Let I be an instance of an inductive databases with a pattern class P and a complex value relational schema R = (A1 , ..., An ). An association rule
Towards an Algebraic Framework for Querying Inductive Databases
311
is an element of the set L = {A =⇒ B | A, B ∈ {A1 , ..., An } such that < A =⇒ B >∈ q(I) if and only if f req(A ∪ B, r) ≥ s and f req(A ∪ B, r)/f req(A, r) ≥ c. Where f req(X, r) is the frequency of X in the set of r, s is the support threshold and c is the confidence threshold. Definition 7. Given an inductive database I, an inductive clause is an expression of the form P (u) ← R(u1 ), ..., R( un ), where n ≥ 1, Ri are relation names and u, ui are free tuple of appropriate arity. Example 4. Let a transaction relation be T = (ID, Items) and each item in the transaction database has an attribute value (such as profit). The constraint Cavg ≡ avg(S) ≥ 25 requires that for each item-set S, the average of the profits of the items in S must be equal or greater than 25. The frequent pattern mining task is to find all frequent item-sets such that the above constraint holds. We express it as inductive clauses as follows. f req pattern(support, < Items >) ← T (ID, Items), support = f req < Items > /Count(T ), support ≥ s ← f req pattern(support, Items), F pattern(Items, AV G) V alue(Item, value), item ∈ Items, AV G = SU M (value)/SU M (Items) Ans(Items) ← F pattern(Items, AV G), AV G ≥ 25 It is simple to specify Naive Bayesian classification by means of a deductive database program. The detailed program can be found in [5]. We present a deductive program performing the partitioning-based clustering task, as follows. P (Y, Ci ) ← r(X), 1 ≤ i ≤ k, Yi = X; Cluster(Y, Ci , mi ) ← P (Y, Ci ), mi = mean{Y } Where mean is a function used to calculate the cluster mean value; distance is a similarity function. The following two rules show the clustering process. An operational semantics for the following datalog program is fix-point semantics. Example 5. The clustering process can be expressed as follows.
new cluster(X, C) ← r(X), Cluster(Y, C, m), Cluster(Y, C , m ), c = c , distance(X, m) < distance(X, m ), Cluster(X, C, m) ← new cluster(X, C), m = mean{new cluster.X} Theorem 1. Any data mining queries expressible in DMA with while loop can be specified as inductive clauses in Datalogcv,¬ . Proof Sketch. DMA is equivalent to CALC cv . A query is expressible in Datalogcv,¬ with stratified negation if and only if it is expressible in complex value calculus CALC cv . CALC cv is equivalent to CALC cv + fixpoint. So DMA + while loop is equivalent to CALC cv + fix-point. Any data mining queries expressible in DMA with while loop can be specified as inductive clauses in Datalogcv,¬.
312
4
H.-C. Liu, A. Ghose, and J. Zeleznikow
Query Optimisation Issue
An important step towards efficient query evaluation for inductive databases is to identify a suitable algebra in which query plans can be formulated. The algebraic framework presented in Section 2 provides a promising foundation for query optimisation. However, there exist many challenges for optimisation issues. For example, it is difficult to establish a cost model for mining operations; to formally enumerate all the valid query plans for an inductive query and then choose the optimal one is not straightforward. We argue that if SQL would allow expressing our sub-join, 1sub,k , in an intuitive manner and algorithms implementing this operator were available in a DBMS, this would greatly facilitate the processing of fix-point queries for frequent itemset discovery. A Datalog expression mapping to our fix-point operator has more intuitive than SQL expressions. In our opinion, a fix-point operator is more appropriate exploited in the deductive paradigm which is a promising approach for inductive database systems. We may improve performance by exploiting relational optimisation techniques, for example, optimizing subset queries, index support, algebraic equivalences for nested relational operators [7]. In the deductive paradigm, we may also apply pruning techniques by using the ’anti-monotonicity’.
5
Conclusion
We have presented an algebraic framework for querying inductive databases. The framework would be helpful for understanding querying aspect of inductive databases. We have also presented a logic programming inductive query language. The results provide theoretical foundations for inductive database research and could be useful for query language design in inductive database systems.
References 1. Dzeroski, S.: Towards a general framework for data mining. In: Dˇzeroski, S., Struyf, J. (eds.) KDID 2006. LNCS, vol. 4747, pp. 259–300. Springer, Heidelberg (2007) 2. Calders, T., Lakshmanan, L., Ng, R., Paredaens, J.: Expressive power of an algebra for data mining. ACM Transactions on Database Systems 31(4), 1169–1214 (2006) 3. Blockeel, H., Calders, T., Fromont, l., Goethals, B., Prado, A., Robardet, C.: An inductive database prototype based on virtual mining views. In: ACM Proc. of KDD (2008) 4. Richter, L., Wicker, J., Kessler, K., Kramer, S.: An inductive database and query language in the relational model. In: Proc. of EDBT, pp. 740–744. ACM, New York (2008) 5. Liu, H.C., Yu, J., Zeleznikow, J., Guan, Y.: A logic-based approach to mining inductive databases. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp. 270–277. Springer, Heidelberg (2007) 6. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993) 7. Liu, H.C., Yu, J.: Algebraic equivalences of nested relational operators. Information Systems 30(3), 167–204 (2005)