Value Added Association Rules T.Y. Lin San Jose State University
[email protected] Glossary Association Rule Mining A Association Rule Mining is an exploratory learning task to discover some hidden , dependency relationship among items, such as state and action , in a database. Relation A relation is a tuple (H,B) with H, the header, and B, the body, a set of tuples that all have the domain H. Such a relation closely corresponds to what is usually called the extension of a predicate in first-order logic except that here we identify the places in the predicate with attribute names. Usually in the relational model a database schema is said to consist of a set of relation names, the headers that are associated with these names and the constraints that should hold for every instance of the database schema. Relational Database A relational database is a finite set of relation schemas (called a database schema) and a corresponding set of relation instances (called a database instance). The relational database model represents data as a twodimensional tables called a relations and consists of three basic components: a set of domains and a set of relations, operations on relations and integrity rules. Data Mining A data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing 1
it into useful information - information that can be used to increase revenue, cuts costs, or both. Summary Association rule mining finds interesting associations and/or correlation relationships among large set of data items. Association rules show attribute value conditions that occur frequently together in a given dataset. A typical and widely-used example of association rule mining is Market Basket Analysis. For example, data are collected using bar-code scanners in supermarkets. Such market basket databases consist of a large number of transaction records. Each record lists all items bought by a customer on a single purchase transaction. Managers would be interested to know if certain groups of items are consistently purchased together. They could use this data for adjusting store layouts (placing items optimally with respect to each other), for cross-selling, for promotions, for catalog design and to identify customer segments based on buying patterns. Association rules provide information of this type in the form of ”if-then” statements. These rules are computed from the data and, unlike the if-then rules of logic, association rules are probabilistic in nature. In addition to the antecedent (the ”if” part) and the consequent (the ”then” part), an association rule has two numbers that express the degree of uncertainty about the rule. In association analysis the antecedent and consequent are sets of items (called itemsets) that are disjoint (do not have any items in common). The first number is called the support for the rule. The support is simply the number of transactions that include all items in the antecedent and consequent parts of the rule. (The support is sometimes expressed as a percentage of the total number of records in the database.) The other number is known as the confidence of the rule. Confidence is the ratio of the number of transactions that include all items in the consequent as well as the antecedent (namely, the support) to the number of transactions that include all items in the antecedent. For example, if a supermarket database has 100,000 point-of-sale transactions, out of which 2,000 include both items A and B and 800 of these include item C, the association rule ”If A and B are purchased then C is purchased on the same trip” has a support of 800 transactions (alternatively 0.8% = 800/100,000) and a confidence of 40% (=800/2,000). One way to 2
think of support is that it is the probability that a randomly selected transaction from the database will contain all items in the antecedent and the consequent, whereas the confidence is the conditional probability that a randomly selected transaction will include all the items in the consequent given that the transaction includes all the items in the antecedent. Lift is one more parameter of interest in the association analysis. Lift is nothing but the ratio of Confidence to Expected Confidence. Expected Confidence in this case means, using the above example, ”confidence, if buying A and B does not enhance the probability of buying C.” It is the number of transactions that include the consequent divided by the total number of transactions. Suppose the number of total number of transactions for C are 5,000. Thus Expected Confidence is 5,000/1,00,000=5%. For our supermarket example the Lift = Confidence/Expected Confidence = 40%/5% = 8. Hence Lift is a value that gives us information about the increase in probability of the ”then” (consequent) given the ”if” (antecedent) part. Abstract. Value added product is an industrial term referring a minor addiction to some major products. In this paper, we borrow a term to denote a minor semantic addition to the well known association rules. We consider the addition of numerical values to the attribute values, such as sale price, profit, degree of fuzziness, level of security and so on. Such additions lead to the notion of random variables (as added values to the attributes) in the data model and hence probabilistic considerations of data mining.
1
Introduction
Association rules are mined from transaction databases with the goal of improving sales and services. Two standard measures called support and confidence are used for mining association rules. However, both measures are not directly linked to the use of association rules in the context of marketing. In order to resolve this problem, many proposals have been made by adding market semantics to data model. Using first order logic, one can add semantics either by function and/or relations (functions symbols or predicates). Barber and Hamilton, and Lu et al. considered semantic / constraint that are prescribed by binary relations (=neighborhood systems) and predicates repectively, [7,?,?]. With the introduction of such semantics or constraints, the mined association rules are more suitable for marketing purpose. In this 3
paper, we consider a framework for value added association rules by attaching numerical values to itemsets, representing profits, importance, or benifits of itemsets. Within the proposed framework, we re-examine some fundamental issues and open up doors for probabilistic approach for data mining.
2
Semantics and Relational Data Model
Relational database theory assumes that the universe is a classical set, namely, data is discrete and no addition structures are embedded. In practice, additional semantics often exist. For examples, there are, say monetary, values to objects, similarities among events, distance between locations, and so on. To express the additional semantics, we need to extend the expressive power of the relational model. this may be achieved by adopting the first order logic, which uses relations and functions, or predicates and function symbols, to capture such additional semantics information.
2.1
Structure Added by Relations
There are many studies on semantics modeling of relationships between objects in a database. Typically, the relationships are expressed in terms of some specific relations or predicates of logic view of databases. Details of such models can be found in [7,6,5,9,11].
2.2
Value Added by Functions
In this paper, we focus on the data model with value added by functions. For an attribute Aj , we assume that there exists a non-negative real-valued function, f j : Dom(Aj ) → R+ called value added function, where Dom(Aj ) is the domain of the attribute. An attribute can be regarded as a map, Aj : U → Dom(Aj ). By composition of f j and Aj , we have: X j = Aj ◦ f j : U → R+ , which maps an object to a non-negative real number. For simplicity, we write the inverse image by Xuj = (X j )−1 (X j (u)). It consists of all objects having the same value on Aj , and is called a granule, namely the equivalence class containing u. The counting probability P (Xuj ) = |Xuj |/|U | gives: Proposition 1. X j is a random variable. 4
A random variable is not a variable varies randomly, it is merely ”a function whose numerical values are determined by a chance.” In other words, the chance (propability) of the function to take its individual value is known. See [2] (page 88) for connections between the mathematical notion and its intution. Definition 1. 1. The system (U, Aj , Dom(Aj ), j = 1, 2, ...n) is called a granular data model. This model allows one to generate automatically all possible attributes (features), including concept hierachy [4]. 2. The system (U, Aj , X j , j = 1, 2, ...n) is called a value added granular data model VA-GDM. We will work in value added granular data model (U, Aj , Dom(Aj ), j = 1, 2, ...n). An itemset is a sub-tuple in a relation. In terms of GDM, a subtuple corresponds to a finite intersection of elementary granules. By abuse of notations, we will use the same symbol to denote both attribute value and the corresponding elementary granule. So a sub-tuple b = (b1 , b2 , ...bq ) could also mean the finite intersection, b = b1 ∩ b2 ∩ .... ∩ bq , of elementary granules.
3
Value added Association Rules
The value function f j may be associated with intuitive interpretations such as profits. It seems intuitively natural to compute profit additively, namely, P f (A) = i∈A f (i) for an itemset in association rule mining. In general, the value may not be additive. For example, in security, the level of security of an itemset is often computed by f (A) = M axi∈A f (i), and integrity by f (A) = M ini∈A f (i). We will use the semantic neutral term and call f a value function. Definition 2. Large value itemsets (LVA-itemsets), by abuse of language, we may refer to it as value added association rules (not in rule form). Let B be a subset of the attributes A, f a real-valued function that assigns a value to each itemset, and sq be a given threshold value for q-itemset, q = 1,2,.... 1. Sum-version: A granule b = (b1 ∩ b2 ∩ ... ∩ bq ), namely, a sub-tuple b = (b1 , b2 , ....bq ), is a large value q-VA-itemset if Sum(b) ≥ sq , where
5
Sum(b) = Σxj0 ∗ p(xj0 ) = Σf j (bj ) ∗ |b|/|U | (1) where xj0 = f j (bj ). 2. Min-version: A granule b = (b1 ∩ b2 ∩ ... ∩ bq ) is a large value q-VAitemset if M in(b) ≥ sq , where M in(b) = M inj .xj0 ∗ p(xj0 ) = M inqj=1 f j (bj ) ∗ |b|/|U |. (2) 3. Max-version: A graunle b = (b1 ∩ b2 ∩ ... ∩ bq ) is a large value q-VAitemset if M ax(b) ≥ sq , where M ax(b) = M axj .xj0 ∗ p(xj0 ) = M axqi=1 f i (bi ) ∗ |b|). (3) 4. Traditional version: The Max and Min-versions become the traditional one iff the profit function is the constant = 1. 5. Mean version: It captures the mean trends of the data. Two attributes Aj1 , Aj2 is mean associated, if |E(X j1 ) − E(X j2 ))| ≤ sq , where E(.) is the expected value, |.| is the absolute value. An LVA-itemset is an association rule without direction, since we had used only the “support”. One can easily derive value added and directed association rules from LVA-itemsets.
3.1
Algorithms for Sum-version
An immediate thought would be to mimic the classical theory. Unfortunately, “apriori” may not always be applicable. Note that counting plays a major role in classical association rules. However, in the value added case, the function values are the main concerns. Thresholds are compared against the sum, max, min and average of the function values. Thus, the results are quite different. Consider the case q = 2. Assume s1 = s2 and f is not the constant 1. Let b = b1 ∩ b2 be a 2-large granule. We have, Sum(b1 ) = f (b1 ) ∗ |b1 |/|U |, Sum(b2 ) = f (b2 ) ∗ |b2 |/|U | (4) Sum(b) = Sum(b1 ) + Sum(b2 ) ≥ s2 . (5)
6
In classical case, |b| ≤ |bi |, i = 1,2; and the apriori exploits this relationship. In the current case, such a relationship is not there; apriori criteria are not useful. Algorithm for finding value added association rules is a brutal exhaustive search: each q is computed independently.
3.2
Algorithm for Max and Min-versions
As above, the key question is: Could we conclude any relationship among M (b1 ), M (b2 ), and M(b), where M = Max and Min? Nothing for Max, but for Min, we do have: M in(f (b1 ), f (b2 )) ≤ M in(bi ), i = 1, 2, (6) Hence, we have apriori algorithms for Min-version.
3.3
Experiments
This section reports the experimental result of the algorithms for LVAitemsets. There are three basic routines: generating candidates ( potential LVA-itemsets), counting the candidates, and finally, selecting LVA-itemsets that exceed the threshold. Finding the LVA-itemsets is an exhaustive search, the search is conducted from longest to shortest. For each q, we search all q-tuples. The generates raw data set has 8 attributes and 500 tuples. The threshold for selecting the itemset is 5.7. Two potential LVA-itemsets are embedded in data. Each granule is represented by (attribute, value) pair. 1.LVA-itemset (LVA-granules) of length 6 is: ((C175),(C490), (C524), (C661), (C752), (C84)). The frequency is 2, and the sum of their weights is: (0.8 + 0.7 + 0.2 + 0.3 + 0.7 + 0.2 ) * 2 = 5.8. 2. LVA-itemset (LVA-granules) of length 8 is: (C175), (C246), (C323), (C445), (C556), (C679), (C779), (c817). The frequency is 1, and the sum of their weights is: (0.8 + 0.5 + 0.9 + 0.6 + 0.3 + 0.9 + 0.8) * 1 = 5.7.
7
The result of finding LVA-itemsets based on weights is summarized as follows: Length 1 2 3 4 5 6 7 8
Candidates 110 4491 22540 34200 27926 13997 4000 5000 107764
Generation time 0.01 11.20 322.173 601.505 327.671 43.562 1.112 0.020
Count time 0.0 0.561 3.585 6.559 6.269 3.606 1.172 0.170
LVA-itemsets 88 416 342 42 14 1 0 6 909
In this table, the first column is the length of itemsets. The second columns is the second column is the number of candidates in the given data. For length 8, it is table length (5000 rows). The 3rd, 4th, and 5th column are the times needed to generate the candidates, to counting the support, to find (check the criteria) the LVA-itemsets. Generating candidates dominates most of the runtime of the algorithm. Sice the dataset is converted to granules, to count the candidates is fast. The runtime is independent of the threshold. The number of candidates to be checked is the same regardless of the threshold value. LVA-itemsets are found at “random” lengths. There is six case, longer LVA-itemsets are not influenced by shorter ones. The algorithms may be improved; the performance is not our focus here.
4
Probabilistic Data Mining Theory
The VA-GDM (U, Aj , X j , j = 1, 2, ..., n) provides a framework for probabilistic consideration. The model naturally produces a numerical information table (U, Aj , X j , j = 1, 2, ..., n) so that we can immediately apply techniques in numerical databases. Let Y i = X ji , i = 1, 2, .., m and m ¡ n be the reduct [10], that is, smallest functionally independent subset. The collection: V = (Y 1 (u), Y 2 (u), ...., Y m (u))|∀u ∈ U , is a finite set of points in Euvlidean space. Since U, and hence V, is finite, a functional dependency can take polynomial form. So the rest of X j are 8
polynomials over Y i . We will regard them as random variables over V. By combining the work of [4] adn [3], we can express all possible numerical attribute (features) by finitely many polynomials over Y i . In other words, e will be able to search association rules in all possible attributes, not restricted to the given attributes, using probability theory. We will report the study in the near future.
5
Conclusions
Value added association rules extends standard association rules by taking into consideration semantics of data. Value added granular data model allows us to import probability theory into data mining. In general, there are no apriori criteria for value added cases. However, if we require the thresholds increase with the lenghts, that is, Sq ≥ q ∗ (M ax(s1 , s2 , ...., sq )), there are apriori criteria: q-large implies all sub-tuples are (q - i) - large, where i ≥ 0. This paper reports our preliminary findings, and more results will be presented in the near future.
9