new algebraic operators and sql primitives for mining ...

2 downloads 0 Views 85KB Size Report
mining operations as a series of SQL queries; extending ... unified in a new SQL operator for mining association rules ..... Georgia, USA, November, 2001.
NEW ALGEBRAIC OPERATORS AND SQL PRIMITIVES FOR MINING ASSOCIATION RULES Ricardo Timarán Pereira Universidad de Nariño San Juan de Pasto Colombia [email protected]

Marta Millán Universidad del Valle Santiago de Cali Colombia [email protected]

Abstract Many approaches to implement data mining systems tightly coupled with a Database Management System (DBMS) have been proposed. Expressing certain data mining operations as a series of SQL queries; extending SQL language with unified operators which support certain pattern discovery tasks: DMQL, M-SQL , MINE RULE ; and, defining SQL generic primitive which facilitate the knowledge discovery process without supporting a particular task: NonStop SQL/MX primitives, Count by Group primitive, FilterPartition, ComputeNodeStatistics and PredictionJoin primitives . A major drawback of the first approach of integration is poor performance, due mainly to the fact that the rather simple SQL operations like join, group and aggregation are not sufficient for efficiently executing data mining tasks. In this paper the last two approaches are combined to support new primitives based on new algebraic operators and to integrate a unified operator to efficiently support the association tasks in a DBMS. Algebraic operators, Associator and Extract are proposed. Associator and Extract are implemented in the SQL SELECT clause as ASSOCIATOR RANGE and EXTRACT IN primitives, unified in a new SQL operator for mining association rules called DESCRIBES ASSOCIATION RULES. Auxiliary algebraic operators useful in the association task are also introduced.

Fernando Machuca CKMC Bogotá Colombia [email protected]

mining algorithms are found outside the kernel of the DBMS. Integration is provided through an interface whose function, in most cases, is limited to the commands "read from" and "write to" [1]. The main disadvantages of this architecture are poor scalability and performance. The first one arises when large data sets do not fit into the available memory and cannot therefore be mined efficiently. Poor performance arises when records are carried from the database address space to the application address space [2]. In order to solve both the problem of scalability and that of performance, mining algorithms should be integrated into the DBMS engine as a primitive in a tightly coupled architecture [3]. Many approaches to implement this kind of systems have been proposed. Expressing certain data mining operations as a series of SQL queries [4], [5], [6], [7], [8], [9]; extending SQL language with unified operators which support certain pattern discovery tasks: DMQL [10], MSQL [11], [12], MINE RULE [13], [14], [15]; and, defining SQL generic primitives which facilitate the knowledge discovery process without supporting a particular task: NonStop SQL/MX primitives [16],[3], Count by Group primitive [17],[18], FilterPartition, ComputeNodeStatistics and PredictionJoin primitives [19]. A major drawback of the first approach of integration is poor performance, due mainly to the fact that the rather simple SQL operations like join, group and aggregation are not sufficient for efficiently executing data mining tasks [5], [8], [19].

Key Words Knowledge primitives.

discovery

in

databases,

data

mining,

1. Introduction Most data mining systems are loosely coupled with a Database Management System (DBMS). Thus, data

In this paper the last two approaches are combined to support new primitives based on new algebraic operators and to integrate a unified operator to efficiently support the association tasks in a DBMS. The primitives are unified in a single SQL command to express the association task. To guarantee the efficiency of the new SQL primitives, new algebraic operators execute the most expensive processes of data mining tasks.

Algebraic operators, Associator and Extract, for the association task are proposed. From each tuple of relation, Associator generates all subsets (itemsets) whose size is a user parameter. Associator scans a table only once to obtain itemsets. Extract keeps for each one of the tuples of a relation, only the attribute values that are in a pattern set, (a set with large itemsets). Extract eliminates empty tuples (i.e. tuples with all their attribute values nulls) and empty attributes (i.e. the columns of a relation that are all nulls). Extract facilitates the work of Associator and it becomes more efficient reducing the number of attributes with values in each tuple and therefore the number of possible combinations. Associator and Extract are implemented in the SQL SELECT clause as ASSOCIATOR RANGE and EXTRACT IN primitives, unified in a new SQL operator for association rules called DESCRIBES ASSOCIATION RULES. Auxiliary algebraic operator useful in the association task is also introduced. Enhance operator transforms a simple dimension table (i.e. a table with scheme (TID, ITEM)), very common in market basket analysis, in a transactional table or multicolumn. The rest of the paper has been organized as follows: In section 2 is reviewed the related work. A briefly introduction for mining association rules is presented in section 3. In section 4, new relational algebraic operators for association task are defined. In section 5, new SQL primitives for association rules are described . In section 6 is presented the unified operator for Association. Finally, in section 7 is presented the conclusions.

2. Related Work One of the important approaches to efficiently support the knowledge discovery in databases, is to extend a DBMS engine with new operators and primitives. Meo et al. [13], [14], [15] propose a unifying model to discover association rules. The model is based on a new operator, named MINE RULE, designed as an extension of the SQL language with a formal semantics for this operator. The semantics is described by means of an extended relational algebra with new operators: Group by, Unnest, Extend, Substitute, Rename, Powerset , which transform a relational table into object-relational table (i.e. table with multivalued attributes) in order to discover association rules. MINE RULE is supported by tightly coupled architecture, where data mining is integrated within a classical SQL server[15]. The differences between this approach and the proposed approach in this paper, is that the former doesn’t propose SQL primitives

that could be used in other discovery tasks. On the other hand, the new proposed algebraic operators conserve the closure property of the relational model and use relations with atomic attributes. In [16], [3] is reported the implementation of a set of new SQL primitives: Transpose, Vertical Partitioning, Roundrobin, Horizontal Partitioning, sequence functions, sampling , which were added to NonStop SQL/MX, an parallel, object-relational DBMS from the Tandem Division of Compaq. These primitives, along with other high-performance features of the SQL/MX engine enable basic knowledge discovery tasks to be performed in a scalable, efficient and parallel manner. Therefore, this type of integration is a very specific solution to tightly coupled problem, since others not parallel DBMS possibly could not use these primitives. Also, these primitives don't have a formal definition in the relational algebra like the proposed primitives.

3. Association Rules Association rules [20],[21] find the relationships between the different items in a database of sales transactions. Let I = {i1, i2, ...,im } be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T ⊆ I. Associated with each transaction is a unique identifier, called its TID. It is said that a transaction T contains X, a set of some items in I, if X ⊆ T. An association rule is an implication of the form X ⇒ Y, where X ⊂ I, Y ⊂ I and X ∩Y = φ. The rule X ⇒ Y holds in D with confidence c if c% of transactions in D that contain X also contain Y. The rule X ⇒ Y has support s in D if s% of transactions in D contain X ∪ Y. Given a set of transactions D, the problem of mining association rules is to generate all association rules that have support and confidence greater than the userspecified minimum support (called minsup) and minimum confidence (called minconf) respectively.

4. New Operators of Relational Algebra for Association Task The overall performance of mining association rules is determined by the discover the large itemsets, i.e., the sets of itemsets that have their support above a pre-determined minimum support [22], [23]. This process is facilitated

extending the Relational Algebra with the following new operators:

4.1 Associator Operator (α α) The Associator operator (α) generates, for each tupla of the relation R, all their possible subsets (itemsets) of different size. Associator takes each tuple t of R and two numbers IS and ES as input, and returns, for each tuple t, the different combinations of attributes Xi of size IS until size ES, as tuples in a new relation. In each tupla Xi, the attributes are combined only have values, the rest of attributes is made nulls. The order of the attributes in the scheme of R determines the order of the attributes in the subsets. Formally, let A = {A1, ..., An} be the set of attributes of relation R; n and m are degree and cardinality of R respectively; IS y ES are the initial and final size of the subsets to obtain respectively:

α(IS; ES;R) = { ∪all Xi  Xi ⊆ ti, ∀i ∀k ( Xi = , ( i ≤ ( 2n -1) * m) , (k = IS..ES )), and A1. The other clauses are standard SQL clauses and therefore their functions are very well-know for all. The following SQL statement generates from Transaction(TID, ITEM1, ITEM2, ITEM3) table, all itemsets of size 1 and 2 with its corresponding supports in TransPowerSet (ITEM1, ITEM2, SUP) table.

TID

ITEM

100

i1

100

i2

200

i1

200

i4

SELECT item1, item2, item3, count(*) as sup

300

i5

FROM Transaction INTO TransPowerSet

Figure 4. Relation R.

ASSOCIATOR item1, item2, sup RANGE 1 TO 2 GROUP BY item1,item2

TID

ITEM1

ITEM2

100

I1

I2

200

I1

I4

300

I5 Figure 5. R1=η (tid;item;R).

5. SQL Primitives for Mining Association Rules

5.2 Extract In Primitive The algebraic operator Extract is implemented by Extract In primitive in the SQL SELECT clause . Extract In has the following syntax: SELECT FROM [INTO ] WHERE

5.1 Associator Range Primitive

EXTRACT IN

Algebraic operator Associator is implemented by Associator Range primitive in the SQL SELECT clause. This primitive has the following syntax:

where, EXTRACT IN clause extracts into table , all records from table , maintaining in each record of table only attribute values that are in the records of table and in set . The rest of attribute values of table becomes null.

SELECT FROM[INTO

Suggest Documents