Scoring the Data using Association Rules - Semantic Scholar

10 downloads 61 Views 504KB Size Report
Use Association rules to correctly predict the positive class data cases from the ... Use a ranking procedure which uses these scores to rank each data case.
Scoring the Data using Association Rules Liu B., Ma Y., Wong C. K., Yu P. S.

Presented by – Omkar Deshmukh 07007005

Outline  Motivation  The Problem  Target  Evaluation Criterion  Association Rules , Challenges - Solutions  Scoring using Association Rules  Experimental Results  Critical Analysis  References

Motivation  Use Association Rules fully for target selection  Current methods not “good enough”  More appropriate and extensive use of association rules

possible  Even a small improvement in accuracy can lead to significant increase in profits  Potential to achieve that improvement  Highly useful in Market based analysis

The Problem  D – Dataset used  N – Data Cases  l – number of attributes for each data case  Positive class – minority class  Negative class – majority class

Target  Use Association rules to correctly predict the positive class data

cases from the testing data.

 Use a scoring function to predict the likelihood of a data case

belonging to the positive class

 Use a ranking procedure which uses these scores to rank each

data case

 Using all the above, predict data cases belonging to the

positive class

Association Rules  I be the set of items, D be the dataset  Association Rule: An implication of the form

X  Y where X  I & Y  I & X  Y  

 A rule has confidence c in D if c% of the transactions in D

that support X also support Y

 A rule has support s in D if s% of the transactions in D

contain X  Y

Challenges/Solutions (1/5)  minsup : Minimum support required for a rule to be

considered in the rule base

 minconf : Minimum confidence required for a rule to be

considered in the rule base

 minsup and minconf cannot be too high or too low in the case

of imbalanced data

 Solution: Use different minsup and minconf for rules of

different classes

Challenges/Solutions (2/5)  t_minsup : overall minimum support

is the number of class cases in the training data |D| is the total number of cases in the training data  For minconf, we have the following formula

Challenges/Solutions (3/5)  Mining Association rules from a relational data  Need to discretize numerical attributes into intervals  Potential for inducing errors  Solution: Entropy based discretization method

Challenges/Solutions (4/5)  Large number of rules present  To improve efficiency without compromising accuracy  Solution: Pruning of Rules  Error rate of a rule = 1 – Confidence of the rule  If rule r’s estimated error rate is higher than the 

estimated error rate of rule r (obtained by deleting one condition from the condition of r), then rule r is pruned.   This was, if r is a rule, then r must also be a rule and in memory.

Challenges/Solutions (5/5)  Association rules used for scoring bring “Completeness”  Opportunity and Challenge  A lot of information  Conflicting information  Solution: Come up with an intuitive scoring based method  Try to look at all rules covering a given dataset while still

avoiding conflicts

Scoring using Association Rules (1/4)  Scoring based on Association : SBA

Many confident positive class rules High Score Positive Class rules not confident Negative class rules confident Low Score

Scoring using Association Rules (2/4)  SBA – Scoring Based Association

Scoring using Association Rules (3/4)

conf

j

and

are the original confidence and support of the negative class rule j

Scoring using Association Rules (4/4)

 |POS | = Number of rules in POS  |NEG| = Number of rules in NEG  P is the priority value function

Evaluation Criterion  Lift Curve - A cumulative plot of the percentage of total

positive cases Vs the percentage of testing cases  Lift curve is often not suitable to compare multiple

performances.  Solution – Use Lift Index  Lift Index – Area under the lift curve

Experimental Results  The SBA method was compared with three other state-of-the

art classifier methods, viz.  [1] C4.5 [2] Naïve Bayesian Classifier [3] Boosted C4.5  20 datasets were used.  t_minsup : 1% Rule limit: 80,000 k:3

Positives of the work done  Proposes a method to use Association Rule Mining to extract

maximum relevant rules possible and uses them nicely using a generic scoring function  The scoring function suggested makes intuitive sense and

complements the intuition through impressive results  Imbalanced datasets are especially attended to.

Future work possible  Is it possible to “learn” the scoring function?  Any other ranking functions possible? Can we learn those   



too? How do different discretization procedures affect the output? How do different pruning methods affect the output? Is “lift index” really a good criterion to measure the performance of a target selection method? What other criteria can be used? Can this be extended to “multiple” target selection problem?

References [1] Liu B., Ma Y., Wong C. K.,Yu P. S., “Scoring the data using Association Rules”, Applied Intelligence V18 I2, March-April 2003. [2] Agrawal R., Srikant R., “ Fast algorithms for Mining Association Rules”, Proceedings of International Conference of very large databases (VLDB -94), pp.487-499, 1994 [3] Bayardo R., Agrawal R., Gunupulos D., “Constraint-based rule mining in large dense databases” Proceedings of IEEE International Conference on Data Engineering (ICDE-99), pp. 188-197, 1999. [4] http://en.wikipedia.org/wiki/Association_rule_learning

Thank you!

Questions?

Suggest Documents