Apr 25, 2012 ... Node N4. A? Yes. No. Node N1. Node N2. Before Splitting: C0. N10. C1. N11 .....
Cara lain. • Isi data kosong dengan nilai yang paling sering.
4/25/2012
Jurusan Teknik Informatika – Universitas Widyatama
Classification: Definition • Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class.
Learning (Decision Tree)
• Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible.
Pertemuan : 12 Dosen Pembina : Sriyani Violina Danang Junaedi
– A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
IF-UTAMA
1
Illustrating Classification Task Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
Attrib2
100K
Attrib3
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
• Predicting tumor cells as benign or malignant • Classifying credit card transactions as legitimate or fraudulent
Learn Model
• Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil
10
Attrib1
11
No
Small
Attrib2
55K
Attrib3
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply Model
Class
• Categorizing news stories as finance, weather, entertainment, sports, etc
10
IF-UTAMA
IF-UTAMA
2
Examples of Classification Task
Class
Tid
IF-UTAMA
3
IF-UTAMA
4
1
4/25/2012
Classification Techniques • • • • • •
Example of a Decision Tree
Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines
Tid
Refund
M arital Status
Taxable Incom e
Cheat
1
Yes
Single
125K
No
2
No
M arried
100K
No
3
No
Single
70K
No
4
Yes
M arried
120K
No
5
No
Divorced
95K
Y es
6
No
M arried
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Y es
9
No
M arried
75K
No
10
No
Single
90K
Y es
Splitting Attributes
Refund Yes
No
NO
MarSt Married
Single, Divorced TaxInc < 80K NO
NO > 80K YES
10
Model: Decision Tree
Training Data IF-UTAMA
5
Another Example of Decision Tree Tid
1 2
10
Refund
Yes No
M arital Status Single M arried
Taxable Income 125K 100K
Cheat No No
3
No
Single
70K
No
4
Yes
M arried
120K
No
5
No
Divorced
95K
Yes
6
No
M arried
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
M arried
75K
No
10
No
Single
90K
Yes
Married NO
6
Decision Tree Classification Task
Single, Divorced Refund No
Yes
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
Attrib2
100K
Attrib3
No
Class
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Learn Model
10
NO
TaxInc < 80K NO
> 80K YES
There could be more than one tree that fits the same data!
IF-UTAMA
IF-UTAMA
MarSt
IF-UTAMA
Attrib2
Attrib3
Apply Model
Class
Decision Tree
10
7
IF-UTAMA
8
2
4/25/2012
Apply Model to Test Data
Apply Model to Test Data
Test Data Start from the root of tree.
Refund
No
Refund Yes
Test Data
Marital Status
Taxable Income
Cheat
Refund
Married
80K
?
No
NO
Yes
MarSt
TaxInc
NO
Married
YES
9
Apply Model to Test Data
IF-UTAMA
10
Apply Model to Test Data Test Data
Test Data Refund
No
Refund Yes
Taxable Income
Cheat
Married
80K
?
Yes
NO
< 80K NO
IF-UTAMA
11
Married
80K
?
Married
TaxInc
NO
YES
Cheat
10
Single, Divorced
> 80K
Taxable Income
MarSt
Married
TaxInc
Marital Status
No
NO
MarSt
< 80K
No
Refund
10
Single, Divorced
IF-UTAMA
Refund
Marital Status
No
NO
?
> 80K
NO
IF-UTAMA
80K
NO
< 80K
YES
NO
Married
MarSt
TaxInc
> 80K
Cheat
10
Single, Divorced
NO
< 80K
Taxable Income
No
Married
Single, Divorced
No
Refund
10
Marital Status
NO > 80K YES
IF-UTAMA
12
3
4/25/2012
Apply Model to Test Data
Apply Model to Test Data
Test Data Refund
No
Refund
Test Data
Marital Status
Taxable Income
Cheat
Married
80K
?
Refund
Yes
No
NO
Yes
MarSt
TaxInc
NO
NO
< 80K
YES
NO
IF-UTAMA
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
Attrib2
100K
Attrib3
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Attrib3
Class
Apply Model
Decision Tree
10
IF-UTAMA
IF-UTAMA
?
Assign Cheat to “No”
NO > 80K YES
IF-UTAMA
14
• A decision tree is a chronological representation of the decision problem. • Each decision tree has two types of nodes; round nodes correspond to the states of nature while square nodes correspond to the decision alternatives. • The branches leaving each round node represent the different states of nature while the branches leaving each square node represent the different decision alternatives. • At the end of each limb of a tree are the payoffs attained from the series of branches making up that limb.
Learn Model
Class
80K
Decision Trees
10
Attrib2
Married
13
Decision Tree Classification Task
Married
MarSt
TaxInc
> 80K
Cheat
10
Single, Divorced
NO
< 80K
Taxable Income
No
Married
Single, Divorced
No
Refund
10
Marital Status
15
IF-UTAMA
16
4
4/25/2012
Strength of Decision Tree
When To Consider Decision Tree?
IF-UTAMA
17
Weakness of Decision Tree
IF-UTAMA
18
Konsep Decision Tree • Mengubah data menjadi pohon keputusan (decision tree) dan aturan-aturan keputusan (rule)
IF-UTAMA
IF-UTAMA
19
IF-UTAMA
20
5
4/25/2012
Decision Tree
Konsep Data dalam Decision Tree • Data dinyatakan dalam bentuk tabel dengan atribut dan record. • Atribut menyatakan suatu parameter yang dibuat sebagai kriteria dalam pembentukan tree. Misalkan untuk menentukan main tenis, kriteria yang diperhatikan adalah cuaca, angin dan temperatur. Salah satu atribut merupakan atribut yang menyatakan data solusi per-item data yang disebut dengan target atribut. • Atribut memiliki nilai-nilai yang dinamakan dengan instance. Misalkan atribut cuaca mempunyai instance berupa cerah, berawan dan hujan.
IF-UTAMA
21
IF-UTAMA
22
Step 1:Mengubah Data Tree
Proses Dalam Decision Tree 1. Spesifikasikan masalah menentukan Atribut dan Target Atribut berdasarkan data yang ada 2. Mengubah bentuk data (tabel) menjadi model tree. – – – – –
Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT etc
3. Mengubah model tree menjadi rule – Disjunction (v OR) – Conjunction (^ AND)
4. Menyederhanakan Rule (Pruning) IF-UTAMA
IF-UTAMA
23
IF-UTAMA
24
6
4/25/2012
General Structure of Hunt’s Algorithm • Let Dt be the set of training records that reach a node t • General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
Tid
Refund
Marital Status
Taxable Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Hunt’s Algorithm
Tid Refund
Marital Status
Taxable Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Refund Don’t Cheat
No Don’t Cheat
Don’t Cheat
Refund Yes
Refund Yes
No
Don’t Cheat
10
Dt
Don’t Cheat
Marital Status
Single, Divorced
?
IF-UTAMA
Yes
Cheat
No
10
Marital Status
Single, Divorced Married Don’t Cheat
25
Married Don’t Cheat
Taxable Income < 80K
>= 80K
Don’t Cheat
Cheat IF-UTAMA
Tree Induction
How to Specify Test Condition?
• Greedy strategy.
• Depends on attribute types
– Split the records based on an attribute test that optimizes certain criterion.
26
– Nominal – Ordinal – Continuous
• Issues • Depends on number of ways to split
– Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?
– 2-way split – Multi-way split
– Determine when to stop splitting IF-UTAMA
IF-UTAMA
27
IF-UTAMA
28
7
4/25/2012
Splitting Based on Nominal Attributes
Splitting Based on Continuous Attributes
• Multi-way split: Use as many partitions as distinct values.
• Different ways of handling – Discretization to form an ordinal categorical attribute
CarType
Family
Luxury
• Static – discretize once at the beginning • Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering
Sports
• Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury}
CarType {Family}
OR
{Family, Luxury}
– Binary Decision: (A < v) or (A ≥ v) • consider all possible splits and finds the best cut • can be more compute intensive
CarType
IF-UTAMA
{Sports}
29
Splitting Based on Continuous Attributes
IF-UTAMA
IF-UTAMA
IF-UTAMA
30
IF-UTAMA
32
How
31
8
4/25/2012
ID3 Algorithm
ID3 Algorithm (contd)
IF-UTAMA
33
C4.5 • • • • •
IF-UTAMA
34
How to determine the Best Split
Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets.
Before Splitting: 10 records of class 0, 10 records of class 1
– Needs out-of-core sorting.
• You can download the software from:
Which test condition is the best?
http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz IF-UTAMA
IF-UTAMA
35
IF-UTAMA
36
9
4/25/2012
How to determine the Best Split
Measures of Node Impurity
• Greedy approach:
• Gini Index
– Nodes with homogeneous class distribution are preferred
• Entropy
• Need a measure of node impurity:
• Misclassification error Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
IF-UTAMA
37
How to Find the Best Split Before Splitting:
C0 C1
N00 N01
No
Node N1 C0 C1
C0 C1
C0 C1
N20 N21
j
No
Node N3
M2
M1
GINI (t ) = 1 − ∑ [ p ( j | t )] 2
Yes
Node N2
N10 N11
• Gini Index for a given node t : B?
Yes
(NOTE: p( j | t) is the relative frequency of class j at node t).
Node N4
N30 N31
C0 C1
M3
N40 N41
– Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information
M4 C1 C2
M12
38
Measure of Impurity: GINI
M0
A?
IF-UTAMA
0 6
Gini=0.000
M34
C1 C2
1 5
Gini=0.278
C1 C2
2 4
Gini=0.444
C1 C2
3 3
Gini=0.500
Gain = M0 – M12 vs M0 – M34 IF-UTAMA
IF-UTAMA
39
IF-UTAMA
40
10
4/25/2012
Examples for computing GINI
Splitting Based on GINI
GINI (t ) = 1 − ∑ [ p ( j | t )] 2
• Used in CART, SLIQ, SPRINT. • When a node p is split into k partitions (children), the quality of split is computed as,
j
C1 C2
0 6
P(C1) = 0/6 = 0 Gini = 1 –
P(C2) = 6/6 = 1
P(C1)2 –
P(C2)2 = 1 – 0 – 1 = 0
k
C1 C2
1 5
P(C1) = 1/6
C1 C2
2 4
P(C1) = 2/6
Gini = 1 –
P(C2) = 5/6
(1/6)2 –
(5/6)2 = 0.278
where,
P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
IF-UTAMA
41
Binary Attributes: Computing GINI Index
IF-UTAMA
– Larger and Purer Partitions are sought for. B? Yes
No
• For each distinct value, gather counts for each class in the dataset • Use the count matrix to make decisions
Parent C1
6
C2
6
Multi-way split
Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528
C1 C2
N2 1 4
Gini=0.333 IF-UTAMA
IF-UTAMA
Family Sports Luxury
Node N2
N1 5 2
Two-way split (find best partition of values)
CarType
G ini = 0.5 00 Node N1
42
Categorical Attributes: Computing Gini Index
• Splits into two partitions • Effect of Weighing partitions:
Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194
ni GINI (i ) i =1 n ni = number of records at child i, n = number of records at node p. GINI split = ∑
C1 C2 Gini
Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333 43
1 4
2 1 0.393
1 1
C1 C2 Gini
CarType {Sports, {Family} Luxury} 3 1 2 4 0.400
IF-UTAMA
CarType {Family, Luxury} 2 2 1 5 0.419
{Sports} C1 C2 Gini
44
11
4/25/2012
Continuous Attributes: Computing Gini Index • Use Binary Decisions based on one Tid value 1 • Several Choices for the splitting value 2 – Number of possible splitting values = Number of distinct values
• Each splitting value has a count matrix associated with it – Class counts in each of the partitions, A < v and A ≥ v
• Simple method to choose best v
Refund
Marital Status
Taxable Income
Cheat
Yes
Single
125K
No
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
Continuous Attributes: Computing Gini Index... • For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Cheat
No
No
Yes
Yes
Yes
No
No
No
No
Taxable Income
8
No
Single
85K
Yes
9
No
Married
75K
No
Sorted Values
10
No
Single
90K
Yes
Split Positions
60
– For each v, scan the database to gather count matrix and compute its Gini index – Computationally Inefficient! Repetition of work.
70
55
10
IF-UTAMA
No
Yes No Gini
75
65
85
72
95
87
100
92
97
120 110
125
220
122
172
230
0
3
0
3
0
3
0
3
1
2
2
1
3
0
3
0
3
0
3
0
3
0
0
7
1
6
2
5
3
4
3
4
3
4
3
4
4
3
5
2
6
1
7
0
0.420
0.400
0.375
45
Alternative Splitting Criteria based on INFO
90
80
0.343
0.417
0.400
0.300
0.343
0.375
0.400
IF-UTAMA
0.420
46
Examples for computing Entropy
• Entropy at a given node t:
Entropy (t ) = − ∑ p ( j | t ) log p ( j | t ) 2
j
Entropy ( t ) = − ∑ p ( j | t ) log p ( j | t ) j
(NOTE: p( j | t) is the relative frequency of class j at node t).
C1 C2
0 6
C1 C2
1 5
P(C1) = 1/6
C1 C2
2 4
P(C1) = 2/6
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
– Measures homogeneity of a node. • Maximum (log nc) when records are equally distributed among all classes implying least information
• Minimum (0.0) when all records belong to one class, implying most information – Entropy based computations are similar to the GINI index computations
IF-UTAMA
IF-UTAMA
47
P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
IF-UTAMA
48
12
4/25/2012
Splitting Based on INFO...
Splitting Based on INFO...
• Information Gain:
• Gain Ratio:
GAIN
n = Entropy ( p ) − ∑ Entropy (i ) n k
split
i
GainRATIO
i =1
Parent Node, p is split into k partitions; ni is number of records in partition i
– Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) – Used in ID3 and C4.5 – Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. IF-UTAMA
split
GAIN SplitINFO Split
SplitINFO
k
= −∑
i =1
n n log n n i
i
Parent Node, p is split into k partitions ni is the number of records in partition i
– Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! – Used in C4.5 – Designed to overcome the disadvantage of Information Gain
49
Splitting Criteria based on Classification Error
=
IF-UTAMA
50
Examples for Computing Error Error ( t ) = 1 − max P ( i | t )
• Classification error at a node t :
i
Error (t ) = 1 − max P (i | t ) i
C1 C2
0 6
C1 C2
1 5
P(C1) = 1/6
C1 C2
2 4
P(C1) = 2/6
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
• Measures misclassification error made by a node. – Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information
IF-UTAMA
IF-UTAMA
51
P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
IF-UTAMA
52
13
4/25/2012
Comparison among Splitting Criteria
Misclassification Error vs Gini
For a 2-class problem:
Parent
A? Yes
No
Node N1
Gini(N1) = 1 – (3/3)2 – (0/3)2 =0
Node N2
C1 C2
N1 3 0
N2 4 3
Gini(N2) = 1 – (4/7)2 – (3/7)2 = 0.489 IF-UTAMA
53
IF-UTAMA
C1
7
C2
3
Gini = 0.42
Gini(Children) = 3/10 * 0 + 7/10 * 0.489 = 0.342
54
Step 2: Tree Rule
Stopping Criteria for Tree Induction • Stop expanding a node when all the records belong to the same class • Stop expanding a node when all the records have similar attribute values • Early termination (to be discussed later)
IF-UTAMA
IF-UTAMA
55
IF-UTAMA
56
14
4/25/2012
Step 2: Tree Rule (contd)
Example Training Data Set
IF-UTAMA
57
Example (contd)
IF-UTAMA
IF-UTAMA
IF-UTAMA
58
Example (contd)
59
IF-UTAMA
60
15
4/25/2012
Example (contd)
IF-UTAMA
Example (contd)
61
Example (contd)
IF-UTAMA
IF-UTAMA
IF-UTAMA
62
Example (contd)
63
IF-UTAMA
64
16
4/25/2012
Example (contd)
IF-UTAMA
Example (contd)
65
Extracting Classification Rule Form Tree
IF-UTAMA
66
Data Mentah
Sample
Atribut
Target Atribut
Decision Tree?? IF-UTAMA
IF-UTAMA
67
IF-UTAMA
68
17
4/25/2012
Entropy Awal
Entropy Usia
• Jumlah instance = 8 • Jumlah instance positif = 3 • Jumlah instance negatif = 5
• Jumlah instance = 8 • Instance Usia – Muda • Instance positif = 1 • Instance negatif = 3
Entropy (Hipertensi ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif −
– Tua • Instance positif = 2 • Instance negatif = 2
Pins tan ce _ negatif log 2 Pins tan ce _ negatif 3 3 5 5 = − × log 2 − × log 2 8 8 8 8 = − (0 .375 × log 2 0 .375 ) − (0 .625 × log 2 0 . 625 )
• Entropy Usia Entropy (Muda ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif Entropy (Tua ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif
= − (0 .375 × -1.415 ) − (0 .625 × - 0.678 )
– Entropy(muda) = 0.906 – Entropy(tua) = 1
= 0,531 + 0,424 = 0,955 IF-UTAMA
69
Gain Usia
IF-UTAMA
70
Entropy Berat ∑
Gain (S , Usia ) = Entropy (S ) −
v∈Muda ,Tua
Sv S
• Jumlah instance = 8 • Intance Berat
Entropy (S v )
– Overweight
S S = Entropy (S ) − Muda Entropy (S Muda ) − Tua Entropy (S Tua ) S S 4 4 = (0 .955 ) − (0 .906 ) − (1) 8 8 = 0 .955 − 0 .453 − 0 .5 = 0 .002
• Instance positif = 3 • Instance negatif = 1
– Average • Instance positif = 0 • Instance negatif = 2
– Underweight • Instance positif = 0 • Instance negatif = 2
Entropy (Overweight
) = − Pins tan ce _ positif
log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif
Entropy ( Average ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif Entropy (Underweigh t ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif
– Entropy(Overweight)=0.906 – Entropy(Average)=0.5 – Entropy(Underweight)=0.5 IF-UTAMA
IF-UTAMA
71
IF-UTAMA
72
18
4/25/2012
Entropy Jenis Kelamin
Gain Berat Sv
∑
Gain (S , Berat ) = Entropy (S ) −
S
v∈Overwight , Average ,Underweigh t
= Entropy (S ) − SUnderweigh t S
S Overweight S
Entropy (S Overweight ) −
S Average S
Entropy (S v )
• Jumlah instance = 8 • Intance Jenis Kelamin
Entropy (S average ) −
– Pria • Instance positif = 2 • Instance negatif = 4
Entropy (SUnderweigh t )
– Wanita
4 (0.906 ) − 2 (0.5) − 2 (0.5) 8 8 8 = 0.955 − 0.453 − 0.125 − 0.125 = (0.955 ) −
• Instance positif = 1 • Instance negatif = 1
Entropy (Pr ia ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif
= 0,252
Entropy (Wanita ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif
– Entropy(Pria)=1 – Entropy(Wanita)=0.75
IF-UTAMA
73
IF-UTAMA
74
Gain Usia Gain (S , JenisKela min ) = Entropy (S ) −
∑ v∈Pr ia ,Wanita
Sv S
• Atribut yang dipilih adalah atribut berat karena nilai Information Gainnya paling tinggi
Entropy (S v )
S S = Entropy (S ) − Pr ia Entropy (S Pr ia ) − Wanita Entropy (S Wanita S S 6 2 = (0 .955 ) − (1) − (0 .75 ) 8 8 = 0 .955 − 0 .75 − 0 .188
Berat
)
Underweight
Overweight Average
• Jumlah Instance untuk Overweight = 4 • Jumlah Instance untuk Average = 2 • Jumlah Instance untuk Underweight = 2
= 0,017
• Hitung Gain paling tinggi untuk dijadikan cabang berikutnya
IF-UTAMA
IF-UTAMA
75
IF-UTAMA
76
19
4/25/2012
Node untuk cabang Overweight
Decision Tree yang dihasilkan
• Jumlah instance = 4 • Instance (Berat = Overwight ) & Usia = – Muda • Instance positif = 1 • Instance negatif = 0
– Tua • Instance positif = 2 • Instance negatif = 1
• Instance (Berat = Overwight ) & Jenis Kelamin =
Clasification Rule ????
– Pria • Instance positif = 2 • Instance negatif = 1
– Wanita • Instance positif = 1 • Instance negatif = 0 77
Practical Issues of Classification
How to Address Overfitting
• Underfitting and Overfitting
• Pre-Pruning (Early Stopping Rule)
– Underfitting: when model is too simple, both training and test errors are large – Overfitting results in decision trees that are more complex than necessary
• Missing Values • Costs of Classification
IF-UTAMA
IF-UTAMA
IF-UTAMA
78
– Stop the algorithm before it becomes a fully-grown tree – Typical stopping conditions for a node: • Stop if all instances belong to the same class • Stop if all the attribute values are the same
– More restrictive conditions: • Stop if number of instances is less than some user-specified threshold • Stop if class distribution of instances are independent of the available features (e.g., using χ 2 test) • Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).
79
IF-UTAMA
80
20
4/25/2012
How to Address Overfitting…
Handling Missing Attribute Values
• Post-pruning
• Missing values affect decision tree construction in three different ways:
– Grow decision tree to its entirety – Trim the nodes of the decision tree in a bottom-up fashion – If generalization error improves after trimming, replace sub-tree by a leaf node. – Class label of leaf node is determined from majority class of instances in the sub-tree – Can use MDL for post-pruning IF-UTAMA
– Affects how impurity measures are computed – Affects how to distribute instance with missing value to child nodes – Affects how a test instance with missing value is classified
81
Computing Impurity Measure Tid Refund Marital Status
Taxable Income Class
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
?
Single
90K
Yes
60K
IF-UTAMA
Distribute Instances
Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813
Refund=Yes Refund=No
Class Class = Yes = No 0 3 2 4
Refund=?
1
0
Split on Refund:
Tid Refund Marital Status
Taxable Income Class
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
60K
Probability that Refund=Yes is 3/9 Probability that Refund=No is 6/9 Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9
Missing value
Entropy(Refund=Yes) = 0 Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183
Refund
Marital Status
Taxable Income
Class
10
?
Single
90K
Yes
Refund Yes
No
Refund Yes
Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551
No
Class=Yes
0
Cheat=Yes
2
Class=No
3
Cheat=No
4
Class=Yes
0 + 3/9
Class=Yes
2 + 6/9
Class=No
3
Class=No
4
Gain = 0.9 × (0.8813 – 0.551) = 0.3303 IF-UTAMA
IF-UTAMA
Tid
10
10
10
82
83
IF-UTAMA
84
21
4/25/2012
Cara lain
Classify Instances Married
New record: Tid Refund Marital Status 11
?
No
Taxable Income Class 85K
?
10
Single
Divorced
Total
Class=No
3
1
0
4
Class=Yes
6/9
1
1
2.67
Total
3.67
2
1
6.67
Refund Yes
No
NO
Single, Divorced < 80K
Married
Probability that Marital Status = Married is 3.67/6.67
NO
Probability that Marital Status ={Single,Divorced} is 3/6.67
> 80K
NO
– Misal ada 3 data kosong, probabilitas(Yes)=0.75 dan probabilitas(No)=0.25 – Maka nilai “Yes” akan diisikan pada 2 data kosong dan nilai “NO” diisikan pada sisa data kosong
MarSt
TaxInc
YES IF-UTAMA
85
Tugas & Praktikum
•
No. USM A001 Bwt/Cari ImpleA002 mentasi Decision A003 A004 Tree A005 A006 A007 A008 A009 A010 A011
Nilai USM 3.75 3.25 3.93 3.12 2.85 2.79 2.98 2.83 2.21 2.63 2.5 IF-UTAMA
IF-UTAMA
86
Referensi
• Lengkapi proses pembentukan tree (proses perhitungan Entropy dan Gain) untuk data training pada slide 68, kemudian tulis rule-nya • Buat rule dengan menggunakan decision tree berdasarkan data training berikut ini:
IF-UTAMA
• Isi data kosong dengan nilai yang paling sering muncul • Hitung probabilitas kemunculan suatu nilai, kemudian distribusikan nilai-nilai tersebut pada data yang kosong sesuai dengan probabilitasnya
Psikologi Tinggi Sedang Sedang Rendah Tinggi Sedang Rendah Sedang Rendah
1.
Wawancara Prodi Baik IF Baik IF IF Buruk TI Baik IF Buruk SI Baik TI SI Buruk TI Baik SI SI
2. 3. 4. 5. 6.
87
Ruoming Jin .2008. Classification: Basic Concepts and Decision Trees [online].url:www.cs.kent.edu/~jin/DM07/ClassificationDecisio nTree.ppt.Tanggal Akses:16 April 2011 Dr. Mourad YKHLEF.2009. Decision Tree Induction System.King Saud University Achmad Basuki, Iwan Syarif. 2003. Decision Tree. Politeknik Elektronika Negeri Surabaya (PENS) – ITS Simon Colton.2004. Decision Tree Learning.Tom M. Mitchell. 1997. Machine Learning. Mc-Graw Hill Suyanto.2011.Artificial Intelligence.Informatika
IF-UTAMA
88
22