Use Association rules to correctly predict the positive class data cases from the ... Use a ranking procedure which uses these scores to rank each data case.
Scoring the Data using Association Rules Liu B., Ma Y., Wong C. K., Yu P. S.
Presented by – Omkar Deshmukh 07007005
Outline Motivation The Problem Target Evaluation Criterion Association Rules , Challenges - Solutions Scoring using Association Rules Experimental Results Critical Analysis References
Motivation Use Association Rules fully for target selection Current methods not “good enough” More appropriate and extensive use of association rules
possible Even a small improvement in accuracy can lead to significant increase in profits Potential to achieve that improvement Highly useful in Market based analysis
The Problem D – Dataset used N – Data Cases l – number of attributes for each data case Positive class – minority class Negative class – majority class
Target Use Association rules to correctly predict the positive class data
cases from the testing data.
Use a scoring function to predict the likelihood of a data case
belonging to the positive class
Use a ranking procedure which uses these scores to rank each
data case
Using all the above, predict data cases belonging to the
positive class
Association Rules I be the set of items, D be the dataset Association Rule: An implication of the form
X Y where X I & Y I & X Y
A rule has confidence c in D if c% of the transactions in D
that support X also support Y
A rule has support s in D if s% of the transactions in D
contain X Y
Challenges/Solutions (1/5) minsup : Minimum support required for a rule to be
considered in the rule base
minconf : Minimum confidence required for a rule to be
considered in the rule base
minsup and minconf cannot be too high or too low in the case
of imbalanced data
Solution: Use different minsup and minconf for rules of
different classes
Challenges/Solutions (2/5) t_minsup : overall minimum support
is the number of class cases in the training data |D| is the total number of cases in the training data For minconf, we have the following formula
Challenges/Solutions (3/5) Mining Association rules from a relational data Need to discretize numerical attributes into intervals Potential for inducing errors Solution: Entropy based discretization method
Challenges/Solutions (4/5) Large number of rules present To improve efficiency without compromising accuracy Solution: Pruning of Rules Error rate of a rule = 1 – Confidence of the rule If rule r’s estimated error rate is higher than the
estimated error rate of rule r (obtained by deleting one condition from the condition of r), then rule r is pruned. This was, if r is a rule, then r must also be a rule and in memory.
Challenges/Solutions (5/5) Association rules used for scoring bring “Completeness” Opportunity and Challenge A lot of information Conflicting information Solution: Come up with an intuitive scoring based method Try to look at all rules covering a given dataset while still
avoiding conflicts
Scoring using Association Rules (1/4) Scoring based on Association : SBA
Many confident positive class rules High Score Positive Class rules not confident Negative class rules confident Low Score
Scoring using Association Rules (2/4) SBA – Scoring Based Association
Scoring using Association Rules (3/4)
conf
j
and
are the original confidence and support of the negative class rule j
Scoring using Association Rules (4/4)
|POS | = Number of rules in POS |NEG| = Number of rules in NEG P is the priority value function
Evaluation Criterion Lift Curve - A cumulative plot of the percentage of total
positive cases Vs the percentage of testing cases Lift curve is often not suitable to compare multiple
performances. Solution – Use Lift Index Lift Index – Area under the lift curve
Experimental Results The SBA method was compared with three other state-of-the
art classifier methods, viz. [1] C4.5 [2] Naïve Bayesian Classifier [3] Boosted C4.5 20 datasets were used. t_minsup : 1% Rule limit: 80,000 k:3
Positives of the work done Proposes a method to use Association Rule Mining to extract
maximum relevant rules possible and uses them nicely using a generic scoring function The scoring function suggested makes intuitive sense and
complements the intuition through impressive results Imbalanced datasets are especially attended to.
Future work possible Is it possible to “learn” the scoring function? Any other ranking functions possible? Can we learn those
too? How do different discretization procedures affect the output? How do different pruning methods affect the output? Is “lift index” really a good criterion to measure the performance of a target selection method? What other criteria can be used? Can this be extended to “multiple” target selection problem?
References [1] Liu B., Ma Y., Wong C. K.,Yu P. S., “Scoring the data using Association Rules”, Applied Intelligence V18 I2, March-April 2003. [2] Agrawal R., Srikant R., “ Fast algorithms for Mining Association Rules”, Proceedings of International Conference of very large databases (VLDB -94), pp.487-499, 1994 [3] Bayardo R., Agrawal R., Gunupulos D., “Constraint-based rule mining in large dense databases” Proceedings of IEEE International Conference on Data Engineering (ICDE-99), pp. 188-197, 1999. [4] http://en.wikipedia.org/wiki/Association_rule_learning
Thank you!
Questions?