Decision Trees - Department of Computer Science

6 downloads 113 Views 441KB Size Report
lot like playing twenty questions. The tree on the right decides if its possible to go play tennis outdoors. Eg. Its overcast, but its reasonably warm (55F). Answer: ...
Decision Trees

AMT

NOTICE: Proprietary and Confidential This material is proprietary to A. Teredesai and GCCIS, RIT.

Slide 1 [email protected]

Proprietary and Confidential

Overview Decision trees Appropriate problems for decision trees Entropy and Information Gain The ID3 algorithm Avoiding Overfitting via Pruning Handling Continuous-Valued Attributes Handling Missing Attribute Values Alternative Measures for Selecting Attributes

[email protected]

Dr. Ankur M. Teredesai

P2

Time to look at the classification model The Decision tree works a lot like playing twenty questions The tree on the right decides if its possible to go play tennis outdoors

Outlook Sunny

Overcast Temp

+ < 35F

-

< 70F

+

Eg. Its overcast, but its reasonably warm (55F). Answer: Go out and play!!

[email protected]

Dr. Ankur M. Teredesai

P3

Decision Trees Definition: A decision tree is a tree s.t.: Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification

Outlook

[email protected]

sunny

overcast

rainy

Humidity

yes

Windy

high

normal

false

true

no

yes

yes

no Dr. Ankur M. Teredesai

P4

Data Set for Playing Tennis

Outlook

Temp.

Humidity

Windy

Play

Outlook

Temp.

Humidity

Windy

play

Sunny

Hot

High

False

No

Sunny

Mild

High

False

No

Sunny

Hot

High

True

No

Sunny

Cool

Normal

False

Yes

Overcast

Hot

High

False

Yes

Rainy

Mild

Normal

False

Yes

Rainy

Mild

High

False

Yes

Sunny

Mild

Normal

True

Yes

Rainy

Cool

Normal

False

Yes

Overcast

Mild

High

True

Yes

Rainy

Cool

Normal

True

No

Overcast

Hot

Normal

False

Yes

Overcast

Cool

Normal

True

Yes

Rainy

Mild

High

True

No

[email protected]

Dr. Ankur M. Teredesai

P5

Decision Tree For Playing Tennis

Outlook

[email protected]

sunny

overcast

rainy

Humidity

yes

Windy

high

normal

false

true

no

yes

yes

no Dr. Ankur M. Teredesai

P6

When to Consider Decision Trees Each instance consists of an attribute with discrete values (e.g. outlook/sunny, etc..) The classification is over discrete values (e.g. yes/no ) It is okay to have disjunctive descriptions – each path in the tree represents a disjunction of attribute combinations. Any Boolean function can be represented! It is okay for the training data to contain errors – decision trees are robust to classification errors in the training data. It is okay for the training data to contain missing values – decision trees can be used even if instances have missing attributes.

[email protected]

Dr. Ankur M. Teredesai

P7

Decision Tree Induction Basic Algorithm: 1. A ← the “best" decision attribute for a node N. 2. Assign A as decision attribute for the node N. 3. For each value of A, create new descendant of the node N. 4. Sort training examples to leaf nodes. 5. IF training examples perfectly classified, THEN STOP. ELSE iterate over new leaf nodes

[email protected]

Dr. Ankur M. Teredesai

P8

How do we pick the splitting attribute?

Determine the attribute that contributes the most information, for example, If we knew the Outlook was Sunny, its more likely that we would go out and play than just knowing its not humid outside!

? ?

?

The measure we need is termed as the Information Gain for the attribute Once we know the splitting attribute, we branch the the tree in the direction of all the unique values for that attribute. For example, for 3 unique values, a 3 way branch is necessary

[email protected]

Dr. Ankur M. Teredesai

P9

Decision Tree Induction

Outlook Sunny

Rain Overcast

______________________________ ______ Outlook Temp Hum Wind Play -----------------------------------------------------Sunny Hot High Weak No Sunny Hot High Strong No Sunny Mild High Weak No Sunny Cool Normal Weak Yes Sunny Mild Normal Strong Yes

[email protected]

_______________________________ ______ Outlook Temp Hum Wind Play -------------------------------------------------------Overcast Hot High Weak Yes Overcast Cool Normal Strong Yes

_______________________________ ______ Outlook Temp Hum Wind Play -------------------------------------------------------Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Rain Mild Normal Weak Yes Rain Mild High Strong No Dr. Ankur M. Teredesai

P10

Entropy Let S be a sample of training examples, and p+ is the proportion of positive examples in S and p- is the proportion of negative examples in S. Then: entropy measures the impurity of S: E( S) = - p+ log2 p+ – p- log2 p-

[email protected]

Dr. Ankur M. Teredesai

P11

Entropy The Entropy for an attribute a, given a data set, s is:

Entropy is a measure of how much we know about a particular class • The more we know, lower the entropy

[email protected]

E(a) = Σv(((|sk1|+|sk2|+…|skn|) / |s|) * (Ik(a)); for all ci Ik(a) is the expected information to classify a sample, for ak |ski| is the number of training samples in s, for ci & ak For attrib. a, there are v distinct values {a1,a2…ak,…av}

Dr. Ankur M. Teredesai

P12

Entropy Example from the Dataset In the Play Tennis dataset we had two target classes: yes and no Out of 14 instances, 9 classified yes, rest no ⎛ ⎞ ⎛ ⎞ p yes = − ⎜ 9 ⎟ log2 ⎜ 9 ⎟ = 0.41 ⎜ 14 ⎟ ⎜ 14 ⎟ ⎝ ⎠ ⎝ ⎠ ⎛ 5 ⎞ ⎛ 5 ⎞ ⎜ ⎟ log2 ⎜ ⎟ = 0.53 pno = − ⎜ 14 ⎟ ⎜ 14 ⎟ ⎝ ⎠ ⎝ ⎠

E (S ) = p yes + pno = 0.94 Outlook

Temp.

Humidity

Windy

Play

Outlook

Temp.

Humidity

Windy

play

Sunny

Hot

High

False

No

Sunny

Mild

High

False

No

Sunny

Hot

High

True

No

Sunny

Cool

Normal

False

Yes

Overcast

Hot

High

False

Yes

Rainy

Mild

Normal

False

Yes

Rainy

Mild

High

False

Yes

Sunny

Mild

Normal

True

Yes

Rainy

Cool

Normal

False

Yes

Overcast

Mild

High

True

Yes

Rainy

Cool

Normal

True

No

Overcast

Hot

Normal

False

Yes

Overcast

Cool

Normal

True

Yes

Rainy

Mild

High

True

No

[email protected]

Dr. Ankur M. Teredesai

P13

Information Gain Information Gain is the expected reduction in entropy caused by partitioning the instances according to a given attribute.

| Sv | Gain(S, A) = E(S) - ∑ E (Sv ) v∈Values ( A ) | S |

where SV = { s ∈ S | A(s) = V}

[email protected]

Dr. Ankur M. Teredesai

P14

Example

Outlook Sunny

Rain Overcast

____________________________________ Outlook Temp Hum Wind Play ------------------------------------------------------Sunny Hot High Weak No Sunny Hot High Strong No Sunny Mild High Weak No Sunny Cool Normal Weak Yes Sunny Mild Normal Strong Yes

_____________________________________ Outlook Temp Hum Wind Play --------------------------------------------------------Overcast Hot High Weak Yes Overcast Cool Normal Strong Yes

_____________________________________ Outlook Temp Hum Wind Play --------------------------------------------------------Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Rain Mild Normal Weak Yes Rain Mild High Strong No

Which attribute should be tested here? Gain (Ssunny , Humidity) = = .970 - (3/5) 0.0 - (2/5) 0.0 = .970 Gain (Ssunny , Temperature) = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 = .570 Gain (Ssunny , Wind) = .970 - (2/5) 1.0 - (3/5) .918 = .019

[email protected]

Dr. Ankur M. Teredesai

P15

ID3 Algorithm Informally: • Determine the attribute with the highest information gain on the training set. • Use this attribute as the root, create a branch for each of the values the attribute can have. • For each branch, repeat the process with subset of the training set that is classified by that branch.

[email protected]

Dr. Ankur M. Teredesai

P16

Hypothesis Space Search in ID3 The hypothesis space is the set of all decision trees defined over the given set of attributes. ID3’s hypothesis space is a compete space; i.e., the target description is there! ID3 performs a simple-to-complex, hill climbing search through this space.

[email protected]

Dr. Ankur M. Teredesai

P17

Hypothesis Space Search in ID3 The evaluation function is the information gain. ID3 maintains only a single current decision tree. ID3 performs no backtracking in its search. ID3 uses all training instances at each step of the search.

[email protected]

Dr. Ankur M. Teredesai

P18

Inductive Bias in ID3 Preference for short trees Preference for trees with high information gain attributes near the root. Bias is a preference to some hypotheses, not a restriction on the hypothesis space

[email protected]

Dr. Ankur M. Teredesai

P19

[email protected]

Dr. Ankur M. Teredesai

P20

Overfitting Definition: Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data if there exists some hypothesis h’ ∈ H, such that h has smaller error than h’ over the training instances, but h’ has a smaller error than h over the entire distribution of instances.

[email protected]

Dr. Ankur M. Teredesai

P21

Reasons for Overfitting

• Noisy training instances. Consider an noisy training example: Outlook = Sunny;Temperature = Hot; Humidity = Normal; Wind = Strong; PlayTennis = No Outlook

sunny

overcast

rainy

Humidity

yes

Windy

high

normal

false

true

no

yes

yes

no

add new test [email protected]

Dr. Ankur M. Teredesai

P22

Reasons for Overfitting

• Small number of instances are associated with leaf nodes. In this case it is possible that for coincidental regularities to occur that are unrelated to the actual target concept. -

+ + + + + + +

-

[email protected]

-

area with probably wrong predictions

- +

-

-

-

-

-

-

Dr. Ankur M. Teredesai

P23

Approaches to Avoiding Overfitting Pre-pruning: stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data Post-pruning: Allow the tree to overfit the data, and then post-prune the tree.

[email protected]

Dr. Ankur M. Teredesai

P24

Criteria for Pruning Use a separate set of instances, distinct from the training instances, to evaluate the utility of nodes in the tree. This requires the data to be split into a training set and a validation set which is then used for pruning. The reason is that the validation set is unlikely to suffer from same errors or fluctuations as the training set. Use all the available data for training, but apply a statistical test to estimate whether expanding/pruning a particular node is likely to produce improvement beyond the training set.

[email protected]

Dr. Ankur M. Teredesai

P25

Reduced-Error Pruning Split data into training and validation sets. Outlook

Pruning a decision node d consists of: removing the subtree rooted at d. making d a leaf node. assigning d the most common classification of the training instances associated with d. Do until further pruning is harmful: Evaluate impact on validation set of pruning each possible node (plus those below it). Greedily remove the one that most improves validation set accuracy.

[email protected]

sunny

overcast

rainy

Humidity

yes

Windy

high

normal

false

true

no

yes

yes

no

Dr. Ankur M. Teredesai

P26

Reduced Error Pruning Example

[email protected]

Dr. Ankur M. Teredesai

P27

Rule Post-Pruning Convert tree to equivalent set of rules. Prune each rule independently of others. Sort final rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances.

Outlook

sunny

overcast

rainy

Humidity

yes

Windy

no

[email protected]

normal

false

true

yes

yes

no

IF (Outlook = Sunny) & (Humidity = High) THEN PlayTennis = No IF (Outlook = Sunny) & (Humidity = Normal) THEN PlayTennis = Yes ………. Dr. Ankur M. Teredesai

P28

Continuous Valued Attributes

Create a set of discrete attributes to test continuous. Apply Information Gain in order to choose the best attribute. Temperature: PlayTennis:

40

48 No

60 No

72 Yes

Temp>54

[email protected]

80 Yes

90 Yes

No

Tem>85

Dr. Ankur M. Teredesai

P29

An Alternative Measure for Attribute Selection

GainRatio( S , A) =

Gain( S , A) SplitInformation( S , A)

where: c

| Si | | Si | SplitInfromatio( S , A) = −∑ log 2 |S| i =1 | S |

[email protected]

Dr. Ankur M. Teredesai

P30

Missing Attribute Values Strategies: Assign most common value of A among other examples belonging to the same concept. If node n tests the attribute A, assign most common value of A among other examples sorted to node n. If node n tests the attribute A, assign a probability to each of possible values of A. These probabilities are estimated based on the observed frequencies of the values of A. These probabilities are used in the information gain measure.

[email protected]

Dr. Ankur M. Teredesai

P31

Summary Points Decision tree learning provides a practical method for concept learning. ID3-like algorithms search complete hypothesis space. The inductive bias of decision trees is preference (search) bias. Overfitting the training data is an important issue in decision tree learning. A large number of extensions of the ID3 algorithm have been proposed for overfitting avoidance, handling missing attributes, handling numerical attributes, etc.

[email protected]

Dr. Ankur M. Teredesai

P32

References Mitchell, Tom. M. 1997. Machine Learning. New York: McGraw-Hill Quinlan, J. R. 1986. Induction of decision trees. Machine Learning Stuart Russell, Peter Norvig, 1995. Artificial Intelligence: A Modern Approach. New Jersey: Prantice Hall

[email protected]

Dr. Ankur M. Teredesai

P33

RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

Paper By: J. Gehrke, R. Ramakrishnan, V. Ganti Dept. of Computer Sciences University of Wisconsin-Madison

NOTICE: Proprietary and Confidential This material is proprietary to A. Teredesai and GCCIS, RIT.

Slide 34 [email protected]

Proprietary and Confidential

Introduction to Classification An important Data Mining Problem Input: a database of training records – Class label attributes – Predictor Attributes Goal • to build a concise model of the distribution of class label in terms of predictor attributes Applications • scientific experiments,medical diagnosis, fraud detection, etc.

[email protected]

Dr. Ankur M. Teredesai

P35

Decision Tree: A Classification Model It is one of the most attractive classification models There are a large number of algorithms to construct decision trees • E.g.: SLIQ, CART, C4.5 SPRINT • Most are main memory algorithms • Tradeoff between supporting large databases, performance and constructing more accurate decision trees

[email protected]

Dr. Ankur M. Teredesai

P36

Motivation of RainForest

Developing a unifying framework that can be applied to most decision tree algorithms, and results in a scalable version of this algorithm without modifying the results. Separating the scalability aspects of these algorithms from the central features that determine the quality of the decision trees

[email protected]

Dr. Ankur M. Teredesai

P37

Decision Tree Terms

Root,Leaf, Internal Nodes Each leaf is labeled with one class label Each internal node is labeled with one predictor attribute called the splitting attribute Each edge e from node n has a predicate q associated with it, q only involves splitting attributes. P : set of predicates on all outgoing edges of an internal node; Non-overlapping, Exhaustive Crit(n): splitting criteria of n; combination of splitting attributes and predicates

[email protected]

Dr. Ankur M. Teredesai

P38

Decision Tree Terms(Cont’d)

F(n) :Family of database(D) tuples of a node n Definition: let E={e1,e2,…,ek}, Q={q1,q2,…,qk} be the edge set and predicate set for a node n; p be the parent node of n If n=root, F(n) = D If n≠root, let q(p→n) be the predicate on e(p→n), F(n) = {t: t€F(n),t €F(p), and q(p→ n= True}

[email protected]

Dr. Ankur M. Teredesai

P39

Decision Tree Terms (Cont’d)

e1

[email protected]

} 1 {q

e2 { q 2

}

P { q1, q2, … , qk }

n

ek {

qk}

Dr. Ankur M. Teredesai

P40

RainForest Framework: Top-down Tree Induction Schema Input: node n, partition D, classification algorithm CL Output: decision tree for D rooted at n Top-Down Decision Tree Induction Schema: BuildTree(Node n, datapartition D, algorithm CL) (1) Apply CL to D to find crit(n) (2) let k be the number of children of n (3) if (k > 0) (4) Create k children c1 ; ... ; ck of n (5) Use best split to partition D into D1 ; . . . ; Dk (6) for (i = 1; i