lot like playing twenty questions. The tree on the right decides if its possible to go
play tennis outdoors. Eg. Its overcast, but its reasonably warm (55F). Answer: ...
Decision Trees
AMT
NOTICE: Proprietary and Confidential This material is proprietary to A. Teredesai and GCCIS, RIT.
Slide 1
[email protected]
Proprietary and Confidential
Overview Decision trees Appropriate problems for decision trees Entropy and Information Gain The ID3 algorithm Avoiding Overfitting via Pruning Handling Continuous-Valued Attributes Handling Missing Attribute Values Alternative Measures for Selecting Attributes
[email protected]
Dr. Ankur M. Teredesai
P2
Time to look at the classification model The Decision tree works a lot like playing twenty questions The tree on the right decides if its possible to go play tennis outdoors
Outlook Sunny
Overcast Temp
+ < 35F
-
< 70F
+
Eg. Its overcast, but its reasonably warm (55F). Answer: Go out and play!!
[email protected]
Dr. Ankur M. Teredesai
P3
Decision Trees Definition: A decision tree is a tree s.t.: Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification
Outlook
[email protected]
sunny
overcast
rainy
Humidity
yes
Windy
high
normal
false
true
no
yes
yes
no Dr. Ankur M. Teredesai
P4
Data Set for Playing Tennis
Outlook
Temp.
Humidity
Windy
Play
Outlook
Temp.
Humidity
Windy
play
Sunny
Hot
High
False
No
Sunny
Mild
High
False
No
Sunny
Hot
High
True
No
Sunny
Cool
Normal
False
Yes
Overcast
Hot
High
False
Yes
Rainy
Mild
Normal
False
Yes
Rainy
Mild
High
False
Yes
Sunny
Mild
Normal
True
Yes
Rainy
Cool
Normal
False
Yes
Overcast
Mild
High
True
Yes
Rainy
Cool
Normal
True
No
Overcast
Hot
Normal
False
Yes
Overcast
Cool
Normal
True
Yes
Rainy
Mild
High
True
No
[email protected]
Dr. Ankur M. Teredesai
P5
Decision Tree For Playing Tennis
Outlook
[email protected]
sunny
overcast
rainy
Humidity
yes
Windy
high
normal
false
true
no
yes
yes
no Dr. Ankur M. Teredesai
P6
When to Consider Decision Trees Each instance consists of an attribute with discrete values (e.g. outlook/sunny, etc..) The classification is over discrete values (e.g. yes/no ) It is okay to have disjunctive descriptions – each path in the tree represents a disjunction of attribute combinations. Any Boolean function can be represented! It is okay for the training data to contain errors – decision trees are robust to classification errors in the training data. It is okay for the training data to contain missing values – decision trees can be used even if instances have missing attributes.
[email protected]
Dr. Ankur M. Teredesai
P7
Decision Tree Induction Basic Algorithm: 1. A ← the “best" decision attribute for a node N. 2. Assign A as decision attribute for the node N. 3. For each value of A, create new descendant of the node N. 4. Sort training examples to leaf nodes. 5. IF training examples perfectly classified, THEN STOP. ELSE iterate over new leaf nodes
[email protected]
Dr. Ankur M. Teredesai
P8
How do we pick the splitting attribute?
Determine the attribute that contributes the most information, for example, If we knew the Outlook was Sunny, its more likely that we would go out and play than just knowing its not humid outside!
? ?
?
The measure we need is termed as the Information Gain for the attribute Once we know the splitting attribute, we branch the the tree in the direction of all the unique values for that attribute. For example, for 3 unique values, a 3 way branch is necessary
[email protected]
Dr. Ankur M. Teredesai
P9
Decision Tree Induction
Outlook Sunny
Rain Overcast
______________________________ ______ Outlook Temp Hum Wind Play -----------------------------------------------------Sunny Hot High Weak No Sunny Hot High Strong No Sunny Mild High Weak No Sunny Cool Normal Weak Yes Sunny Mild Normal Strong Yes
[email protected]
_______________________________ ______ Outlook Temp Hum Wind Play -------------------------------------------------------Overcast Hot High Weak Yes Overcast Cool Normal Strong Yes
_______________________________ ______ Outlook Temp Hum Wind Play -------------------------------------------------------Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Rain Mild Normal Weak Yes Rain Mild High Strong No Dr. Ankur M. Teredesai
P10
Entropy Let S be a sample of training examples, and p+ is the proportion of positive examples in S and p- is the proportion of negative examples in S. Then: entropy measures the impurity of S: E( S) = - p+ log2 p+ – p- log2 p-
[email protected]
Dr. Ankur M. Teredesai
P11
Entropy The Entropy for an attribute a, given a data set, s is:
Entropy is a measure of how much we know about a particular class • The more we know, lower the entropy
[email protected]
E(a) = Σv(((|sk1|+|sk2|+…|skn|) / |s|) * (Ik(a)); for all ci Ik(a) is the expected information to classify a sample, for ak |ski| is the number of training samples in s, for ci & ak For attrib. a, there are v distinct values {a1,a2…ak,…av}
Dr. Ankur M. Teredesai
P12
Entropy Example from the Dataset In the Play Tennis dataset we had two target classes: yes and no Out of 14 instances, 9 classified yes, rest no ⎛ ⎞ ⎛ ⎞ p yes = − ⎜ 9 ⎟ log2 ⎜ 9 ⎟ = 0.41 ⎜ 14 ⎟ ⎜ 14 ⎟ ⎝ ⎠ ⎝ ⎠ ⎛ 5 ⎞ ⎛ 5 ⎞ ⎜ ⎟ log2 ⎜ ⎟ = 0.53 pno = − ⎜ 14 ⎟ ⎜ 14 ⎟ ⎝ ⎠ ⎝ ⎠
E (S ) = p yes + pno = 0.94 Outlook
Temp.
Humidity
Windy
Play
Outlook
Temp.
Humidity
Windy
play
Sunny
Hot
High
False
No
Sunny
Mild
High
False
No
Sunny
Hot
High
True
No
Sunny
Cool
Normal
False
Yes
Overcast
Hot
High
False
Yes
Rainy
Mild
Normal
False
Yes
Rainy
Mild
High
False
Yes
Sunny
Mild
Normal
True
Yes
Rainy
Cool
Normal
False
Yes
Overcast
Mild
High
True
Yes
Rainy
Cool
Normal
True
No
Overcast
Hot
Normal
False
Yes
Overcast
Cool
Normal
True
Yes
Rainy
Mild
High
True
No
[email protected]
Dr. Ankur M. Teredesai
P13
Information Gain Information Gain is the expected reduction in entropy caused by partitioning the instances according to a given attribute.
| Sv | Gain(S, A) = E(S) - ∑ E (Sv ) v∈Values ( A ) | S |
where SV = { s ∈ S | A(s) = V}
[email protected]
Dr. Ankur M. Teredesai
P14
Example
Outlook Sunny
Rain Overcast
____________________________________ Outlook Temp Hum Wind Play ------------------------------------------------------Sunny Hot High Weak No Sunny Hot High Strong No Sunny Mild High Weak No Sunny Cool Normal Weak Yes Sunny Mild Normal Strong Yes
_____________________________________ Outlook Temp Hum Wind Play --------------------------------------------------------Overcast Hot High Weak Yes Overcast Cool Normal Strong Yes
_____________________________________ Outlook Temp Hum Wind Play --------------------------------------------------------Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Rain Mild Normal Weak Yes Rain Mild High Strong No
Which attribute should be tested here? Gain (Ssunny , Humidity) = = .970 - (3/5) 0.0 - (2/5) 0.0 = .970 Gain (Ssunny , Temperature) = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 = .570 Gain (Ssunny , Wind) = .970 - (2/5) 1.0 - (3/5) .918 = .019
[email protected]
Dr. Ankur M. Teredesai
P15
ID3 Algorithm Informally: • Determine the attribute with the highest information gain on the training set. • Use this attribute as the root, create a branch for each of the values the attribute can have. • For each branch, repeat the process with subset of the training set that is classified by that branch.
[email protected]
Dr. Ankur M. Teredesai
P16
Hypothesis Space Search in ID3 The hypothesis space is the set of all decision trees defined over the given set of attributes. ID3’s hypothesis space is a compete space; i.e., the target description is there! ID3 performs a simple-to-complex, hill climbing search through this space.
[email protected]
Dr. Ankur M. Teredesai
P17
Hypothesis Space Search in ID3 The evaluation function is the information gain. ID3 maintains only a single current decision tree. ID3 performs no backtracking in its search. ID3 uses all training instances at each step of the search.
[email protected]
Dr. Ankur M. Teredesai
P18
Inductive Bias in ID3 Preference for short trees Preference for trees with high information gain attributes near the root. Bias is a preference to some hypotheses, not a restriction on the hypothesis space
[email protected]
Dr. Ankur M. Teredesai
P19
[email protected]
Dr. Ankur M. Teredesai
P20
Overfitting Definition: Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data if there exists some hypothesis h’ ∈ H, such that h has smaller error than h’ over the training instances, but h’ has a smaller error than h over the entire distribution of instances.
[email protected]
Dr. Ankur M. Teredesai
P21
Reasons for Overfitting
• Noisy training instances. Consider an noisy training example: Outlook = Sunny;Temperature = Hot; Humidity = Normal; Wind = Strong; PlayTennis = No Outlook
sunny
overcast
rainy
Humidity
yes
Windy
high
normal
false
true
no
yes
yes
no
add new test
[email protected]
Dr. Ankur M. Teredesai
P22
Reasons for Overfitting
• Small number of instances are associated with leaf nodes. In this case it is possible that for coincidental regularities to occur that are unrelated to the actual target concept. -
+ + + + + + +
-
[email protected]
-
area with probably wrong predictions
- +
-
-
-
-
-
-
Dr. Ankur M. Teredesai
P23
Approaches to Avoiding Overfitting Pre-pruning: stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data Post-pruning: Allow the tree to overfit the data, and then post-prune the tree.
[email protected]
Dr. Ankur M. Teredesai
P24
Criteria for Pruning Use a separate set of instances, distinct from the training instances, to evaluate the utility of nodes in the tree. This requires the data to be split into a training set and a validation set which is then used for pruning. The reason is that the validation set is unlikely to suffer from same errors or fluctuations as the training set. Use all the available data for training, but apply a statistical test to estimate whether expanding/pruning a particular node is likely to produce improvement beyond the training set.
[email protected]
Dr. Ankur M. Teredesai
P25
Reduced-Error Pruning Split data into training and validation sets. Outlook
Pruning a decision node d consists of: removing the subtree rooted at d. making d a leaf node. assigning d the most common classification of the training instances associated with d. Do until further pruning is harmful: Evaluate impact on validation set of pruning each possible node (plus those below it). Greedily remove the one that most improves validation set accuracy.
[email protected]
sunny
overcast
rainy
Humidity
yes
Windy
high
normal
false
true
no
yes
yes
no
Dr. Ankur M. Teredesai
P26
Reduced Error Pruning Example
[email protected]
Dr. Ankur M. Teredesai
P27
Rule Post-Pruning Convert tree to equivalent set of rules. Prune each rule independently of others. Sort final rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances.
Outlook
sunny
overcast
rainy
Humidity
yes
Windy
no
[email protected]
normal
false
true
yes
yes
no
IF (Outlook = Sunny) & (Humidity = High) THEN PlayTennis = No IF (Outlook = Sunny) & (Humidity = Normal) THEN PlayTennis = Yes ………. Dr. Ankur M. Teredesai
P28
Continuous Valued Attributes
Create a set of discrete attributes to test continuous. Apply Information Gain in order to choose the best attribute. Temperature: PlayTennis:
40
48 No
60 No
72 Yes
Temp>54
[email protected]
80 Yes
90 Yes
No
Tem>85
Dr. Ankur M. Teredesai
P29
An Alternative Measure for Attribute Selection
GainRatio( S , A) =
Gain( S , A) SplitInformation( S , A)
where: c
| Si | | Si | SplitInfromatio( S , A) = −∑ log 2 |S| i =1 | S |
[email protected]
Dr. Ankur M. Teredesai
P30
Missing Attribute Values Strategies: Assign most common value of A among other examples belonging to the same concept. If node n tests the attribute A, assign most common value of A among other examples sorted to node n. If node n tests the attribute A, assign a probability to each of possible values of A. These probabilities are estimated based on the observed frequencies of the values of A. These probabilities are used in the information gain measure.
[email protected]
Dr. Ankur M. Teredesai
P31
Summary Points Decision tree learning provides a practical method for concept learning. ID3-like algorithms search complete hypothesis space. The inductive bias of decision trees is preference (search) bias. Overfitting the training data is an important issue in decision tree learning. A large number of extensions of the ID3 algorithm have been proposed for overfitting avoidance, handling missing attributes, handling numerical attributes, etc.
[email protected]
Dr. Ankur M. Teredesai
P32
References Mitchell, Tom. M. 1997. Machine Learning. New York: McGraw-Hill Quinlan, J. R. 1986. Induction of decision trees. Machine Learning Stuart Russell, Peter Norvig, 1995. Artificial Intelligence: A Modern Approach. New Jersey: Prantice Hall
[email protected]
Dr. Ankur M. Teredesai
P33
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets
Paper By: J. Gehrke, R. Ramakrishnan, V. Ganti Dept. of Computer Sciences University of Wisconsin-Madison
NOTICE: Proprietary and Confidential This material is proprietary to A. Teredesai and GCCIS, RIT.
Slide 34
[email protected]
Proprietary and Confidential
Introduction to Classification An important Data Mining Problem Input: a database of training records – Class label attributes – Predictor Attributes Goal • to build a concise model of the distribution of class label in terms of predictor attributes Applications • scientific experiments,medical diagnosis, fraud detection, etc.
[email protected]
Dr. Ankur M. Teredesai
P35
Decision Tree: A Classification Model It is one of the most attractive classification models There are a large number of algorithms to construct decision trees • E.g.: SLIQ, CART, C4.5 SPRINT • Most are main memory algorithms • Tradeoff between supporting large databases, performance and constructing more accurate decision trees
[email protected]
Dr. Ankur M. Teredesai
P36
Motivation of RainForest
Developing a unifying framework that can be applied to most decision tree algorithms, and results in a scalable version of this algorithm without modifying the results. Separating the scalability aspects of these algorithms from the central features that determine the quality of the decision trees
[email protected]
Dr. Ankur M. Teredesai
P37
Decision Tree Terms
Root,Leaf, Internal Nodes Each leaf is labeled with one class label Each internal node is labeled with one predictor attribute called the splitting attribute Each edge e from node n has a predicate q associated with it, q only involves splitting attributes. P : set of predicates on all outgoing edges of an internal node; Non-overlapping, Exhaustive Crit(n): splitting criteria of n; combination of splitting attributes and predicates
[email protected]
Dr. Ankur M. Teredesai
P38
Decision Tree Terms(Cont’d)
F(n) :Family of database(D) tuples of a node n Definition: let E={e1,e2,…,ek}, Q={q1,q2,…,qk} be the edge set and predicate set for a node n; p be the parent node of n If n=root, F(n) = D If n≠root, let q(p→n) be the predicate on e(p→n), F(n) = {t: t€F(n),t €F(p), and q(p→ n= True}
[email protected]
Dr. Ankur M. Teredesai
P39
Decision Tree Terms (Cont’d)
e1
[email protected]
} 1 {q
e2 { q 2
}
P { q1, q2, … , qk }
n
ek {
qk}
Dr. Ankur M. Teredesai
P40
RainForest Framework: Top-down Tree Induction Schema Input: node n, partition D, classification algorithm CL Output: decision tree for D rooted at n Top-Down Decision Tree Induction Schema: BuildTree(Node n, datapartition D, algorithm CL) (1) Apply CL to D to find crit(n) (2) let k be the number of children of n (3) if (k > 0) (4) Create k children c1 ; ... ; ck of n (5) Use best split to partition D into D1 ; . . . ; Dk (6) for (i = 1; i