Classification: Definition Illustrating Classification Task Examples of ...

3 downloads 127633 Views 2MB Size Report
Apr 25, 2012 ... Node N4. A? Yes. No. Node N1. Node N2. Before Splitting: C0. N10. C1. N11 ..... Cara lain. • Isi data kosong dengan nilai yang paling sering.
4/25/2012

Jurusan Teknik Informatika – Universitas Widyatama

Classification: Definition • Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class.

Learning (Decision Tree)

• Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible.

Pertemuan : 12 Dosen Pembina : Sriyani Violina Danang Junaedi

– A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

IF-UTAMA

1

Illustrating Classification Task Tid

Attrib1

1

Yes

Large

125K

No

2

No

Medium

Attrib2

100K

Attrib3

No

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

• Predicting tumor cells as benign or malignant • Classifying credit card transactions as legitimate or fraudulent

Learn Model

• Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil

10

Attrib1

11

No

Small

Attrib2

55K

Attrib3

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

Apply Model

Class

• Categorizing news stories as finance, weather, entertainment, sports, etc

10

IF-UTAMA

IF-UTAMA

2

Examples of Classification Task

Class

Tid

IF-UTAMA

3

IF-UTAMA

4

1

4/25/2012

Classification Techniques • • • • • •

Example of a Decision Tree

Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines

Tid

Refund

M arital Status

Taxable Incom e

Cheat

1

Yes

Single

125K

No

2

No

M arried

100K

No

3

No

Single

70K

No

4

Yes

M arried

120K

No

5

No

Divorced

95K

Y es

6

No

M arried

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Y es

9

No

M arried

75K

No

10

No

Single

90K

Y es

Splitting Attributes

Refund Yes

No

NO

MarSt Married

Single, Divorced TaxInc < 80K NO

NO > 80K YES

10

Model: Decision Tree

Training Data IF-UTAMA

5

Another Example of Decision Tree Tid

1 2

10

Refund

Yes No

M arital Status Single M arried

Taxable Income 125K 100K

Cheat No No

3

No

Single

70K

No

4

Yes

M arried

120K

No

5

No

Divorced

95K

Yes

6

No

M arried

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Yes

9

No

M arried

75K

No

10

No

Single

90K

Yes

Married NO

6

Decision Tree Classification Task

Single, Divorced Refund No

Yes

Tid

Attrib1

1

Yes

Large

125K

No

2

No

Medium

Attrib2

100K

Attrib3

No

Class

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

Tid

Attrib1

11

No

Small

55K

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

Learn Model

10

NO

TaxInc < 80K NO

> 80K YES

There could be more than one tree that fits the same data!

IF-UTAMA

IF-UTAMA

MarSt

IF-UTAMA

Attrib2

Attrib3

Apply Model

Class

Decision Tree

10

7

IF-UTAMA

8

2

4/25/2012

Apply Model to Test Data

Apply Model to Test Data

Test Data Start from the root of tree.

Refund

No

Refund Yes

Test Data

Marital Status

Taxable Income

Cheat

Refund

Married

80K

?

No

NO

Yes

MarSt

TaxInc

NO

Married

YES

9

Apply Model to Test Data

IF-UTAMA

10

Apply Model to Test Data Test Data

Test Data Refund

No

Refund Yes

Taxable Income

Cheat

Married

80K

?

Yes

NO

< 80K NO

IF-UTAMA

11

Married

80K

?

Married

TaxInc

NO

YES

Cheat

10

Single, Divorced

> 80K

Taxable Income

MarSt

Married

TaxInc

Marital Status

No

NO

MarSt

< 80K

No

Refund

10

Single, Divorced

IF-UTAMA

Refund

Marital Status

No

NO

?

> 80K

NO

IF-UTAMA

80K

NO

< 80K

YES

NO

Married

MarSt

TaxInc

> 80K

Cheat

10

Single, Divorced

NO

< 80K

Taxable Income

No

Married

Single, Divorced

No

Refund

10

Marital Status

NO > 80K YES

IF-UTAMA

12

3

4/25/2012

Apply Model to Test Data

Apply Model to Test Data

Test Data Refund

No

Refund

Test Data

Marital Status

Taxable Income

Cheat

Married

80K

?

Refund

Yes

No

NO

Yes

MarSt

TaxInc

NO

NO

< 80K

YES

NO

IF-UTAMA

Tid

Attrib1

1

Yes

Large

125K

No

2

No

Medium

Attrib2

100K

Attrib3

No

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

Tid

Attrib1

11

No

Small

55K

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

Attrib3

Class

Apply Model

Decision Tree

10

IF-UTAMA

IF-UTAMA

?

Assign Cheat to “No”

NO > 80K YES

IF-UTAMA

14

• A decision tree is a chronological representation of the decision problem. • Each decision tree has two types of nodes; round nodes correspond to the states of nature while square nodes correspond to the decision alternatives. • The branches leaving each round node represent the different states of nature while the branches leaving each square node represent the different decision alternatives. • At the end of each limb of a tree are the payoffs attained from the series of branches making up that limb.

Learn Model

Class

80K

Decision Trees

10

Attrib2

Married

13

Decision Tree Classification Task

Married

MarSt

TaxInc

> 80K

Cheat

10

Single, Divorced

NO

< 80K

Taxable Income

No

Married

Single, Divorced

No

Refund

10

Marital Status

15

IF-UTAMA

16

4

4/25/2012

Strength of Decision Tree

When To Consider Decision Tree?

IF-UTAMA

17

Weakness of Decision Tree

IF-UTAMA

18

Konsep Decision Tree • Mengubah data menjadi pohon keputusan (decision tree) dan aturan-aturan keputusan (rule)

IF-UTAMA

IF-UTAMA

19

IF-UTAMA

20

5

4/25/2012

Decision Tree

Konsep Data dalam Decision Tree • Data dinyatakan dalam bentuk tabel dengan atribut dan record. • Atribut menyatakan suatu parameter yang dibuat sebagai kriteria dalam pembentukan tree. Misalkan untuk menentukan main tenis, kriteria yang diperhatikan adalah cuaca, angin dan temperatur. Salah satu atribut merupakan atribut yang menyatakan data solusi per-item data yang disebut dengan target atribut. • Atribut memiliki nilai-nilai yang dinamakan dengan instance. Misalkan atribut cuaca mempunyai instance berupa cerah, berawan dan hujan.

IF-UTAMA

21

IF-UTAMA

22

Step 1:Mengubah Data  Tree

Proses Dalam Decision Tree 1. Spesifikasikan masalah menentukan Atribut dan Target Atribut berdasarkan data yang ada 2. Mengubah bentuk data (tabel) menjadi model tree. – – – – –

Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT etc

3. Mengubah model tree menjadi rule – Disjunction (v  OR) – Conjunction (^  AND)

4. Menyederhanakan Rule (Pruning) IF-UTAMA

IF-UTAMA

23

IF-UTAMA

24

6

4/25/2012

General Structure of Hunt’s Algorithm • Let Dt be the set of training records that reach a node t • General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.

Tid

Refund

Marital Status

Taxable Income

Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced

95K

Yes

6

No

Married

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

Hunt’s Algorithm

Tid Refund

Marital Status

Taxable Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

Refund Don’t Cheat

No Don’t Cheat

Don’t Cheat

Refund Yes

Refund Yes

No

Don’t Cheat

10

Dt

Don’t Cheat

Marital Status

Single, Divorced

?

IF-UTAMA

Yes

Cheat

No

10

Marital Status

Single, Divorced Married Don’t Cheat

25

Married Don’t Cheat

Taxable Income < 80K

>= 80K

Don’t Cheat

Cheat IF-UTAMA

Tree Induction

How to Specify Test Condition?

• Greedy strategy.

• Depends on attribute types

– Split the records based on an attribute test that optimizes certain criterion.

26

– Nominal – Ordinal – Continuous

• Issues • Depends on number of ways to split

– Determine how to split the records • How to specify the attribute test condition? • How to determine the best split?

– 2-way split – Multi-way split

– Determine when to stop splitting IF-UTAMA

IF-UTAMA

27

IF-UTAMA

28

7

4/25/2012

Splitting Based on Nominal Attributes

Splitting Based on Continuous Attributes

• Multi-way split: Use as many partitions as distinct values.

• Different ways of handling – Discretization to form an ordinal categorical attribute

CarType

Family

Luxury

• Static – discretize once at the beginning • Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering

Sports

• Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury}

CarType {Family}

OR

{Family, Luxury}

– Binary Decision: (A < v) or (A ≥ v) • consider all possible splits and finds the best cut • can be more compute intensive

CarType

IF-UTAMA

{Sports}

29

Splitting Based on Continuous Attributes

IF-UTAMA

IF-UTAMA

IF-UTAMA

30

IF-UTAMA

32

How

31

8

4/25/2012

ID3 Algorithm

ID3 Algorithm (contd)

IF-UTAMA

33

C4.5 • • • • •

IF-UTAMA

34

How to determine the Best Split

Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets.

Before Splitting: 10 records of class 0, 10 records of class 1

– Needs out-of-core sorting.

• You can download the software from:

Which test condition is the best?

http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz IF-UTAMA

IF-UTAMA

35

IF-UTAMA

36

9

4/25/2012

How to determine the Best Split

Measures of Node Impurity

• Greedy approach:

• Gini Index

– Nodes with homogeneous class distribution are preferred

• Entropy

• Need a measure of node impurity:

• Misclassification error Non-homogeneous,

Homogeneous,

High degree of impurity

Low degree of impurity

IF-UTAMA

37

How to Find the Best Split Before Splitting:

C0 C1

N00 N01

No

Node N1 C0 C1

C0 C1

C0 C1

N20 N21

j

No

Node N3

M2

M1

GINI (t ) = 1 − ∑ [ p ( j | t )] 2

Yes

Node N2

N10 N11

• Gini Index for a given node t : B?

Yes

(NOTE: p( j | t) is the relative frequency of class j at node t).

Node N4

N30 N31

C0 C1

M3

N40 N41

– Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information

M4 C1 C2

M12

38

Measure of Impurity: GINI

M0

A?

IF-UTAMA

0 6

Gini=0.000

M34

C1 C2

1 5

Gini=0.278

C1 C2

2 4

Gini=0.444

C1 C2

3 3

Gini=0.500

Gain = M0 – M12 vs M0 – M34 IF-UTAMA

IF-UTAMA

39

IF-UTAMA

40

10

4/25/2012

Examples for computing GINI

Splitting Based on GINI

GINI (t ) = 1 − ∑ [ p ( j | t )] 2

• Used in CART, SLIQ, SPRINT. • When a node p is split into k partitions (children), the quality of split is computed as,

j

C1 C2

0 6

P(C1) = 0/6 = 0 Gini = 1 –

P(C2) = 6/6 = 1

P(C1)2 –

P(C2)2 = 1 – 0 – 1 = 0

k

C1 C2

1 5

P(C1) = 1/6

C1 C2

2 4

P(C1) = 2/6

Gini = 1 –

P(C2) = 5/6

(1/6)2 –

(5/6)2 = 0.278

where,

P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444

IF-UTAMA

41

Binary Attributes: Computing GINI Index

IF-UTAMA

– Larger and Purer Partitions are sought for. B? Yes

No

• For each distinct value, gather counts for each class in the dataset • Use the count matrix to make decisions

Parent C1

6

C2

6

Multi-way split

Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528

C1 C2

N2 1 4

Gini=0.333 IF-UTAMA

IF-UTAMA

Family Sports Luxury

Node N2

N1 5 2

Two-way split (find best partition of values)

CarType

G ini = 0.5 00 Node N1

42

Categorical Attributes: Computing Gini Index

• Splits into two partitions • Effect of Weighing partitions:

Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194

ni GINI (i ) i =1 n ni = number of records at child i, n = number of records at node p. GINI split = ∑

C1 C2 Gini

Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333 43

1 4

2 1 0.393

1 1

C1 C2 Gini

CarType {Sports, {Family} Luxury} 3 1 2 4 0.400

IF-UTAMA

CarType {Family, Luxury} 2 2 1 5 0.419

{Sports} C1 C2 Gini

44

11

4/25/2012

Continuous Attributes: Computing Gini Index • Use Binary Decisions based on one Tid value 1 • Several Choices for the splitting value 2 – Number of possible splitting values = Number of distinct values

• Each splitting value has a count matrix associated with it – Class counts in each of the partitions, A < v and A ≥ v

• Simple method to choose best v

Refund

Marital Status

Taxable Income

Cheat

Yes

Single

125K

No

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced

95K

Yes

6

No

Married

60K

No

7

Yes

Divorced

220K

No

Continuous Attributes: Computing Gini Index... • For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Cheat

No

No

Yes

Yes

Yes

No

No

No

No

Taxable Income

8

No

Single

85K

Yes

9

No

Married

75K

No

Sorted Values

10

No

Single

90K

Yes

Split Positions

60

– For each v, scan the database to gather count matrix and compute its Gini index – Computationally Inefficient! Repetition of work.

70

55

10

IF-UTAMA

No

Yes No Gini

75

65

85

72

95

87

100

92

97

120 110

125

220

122

172

230























0

3

0

3

0

3

0

3

1

2

2

1

3

0

3

0

3

0

3

0

3

0

0

7

1

6

2

5

3

4

3

4

3

4

3

4

4

3

5

2

6

1

7

0

0.420

0.400

0.375

45

Alternative Splitting Criteria based on INFO

90

80

0.343

0.417

0.400

0.300

0.343

0.375

0.400

IF-UTAMA

0.420

46

Examples for computing Entropy

• Entropy at a given node t:

Entropy (t ) = − ∑ p ( j | t ) log p ( j | t ) 2

j

Entropy ( t ) = − ∑ p ( j | t ) log p ( j | t ) j

(NOTE: p( j | t) is the relative frequency of class j at node t).

C1 C2

0 6

C1 C2

1 5

P(C1) = 1/6

C1 C2

2 4

P(C1) = 2/6

P(C1) = 0/6 = 0

P(C2) = 6/6 = 1

Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

– Measures homogeneity of a node. • Maximum (log nc) when records are equally distributed among all classes implying least information

• Minimum (0.0) when all records belong to one class, implying most information – Entropy based computations are similar to the GINI index computations

IF-UTAMA

IF-UTAMA

47

P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

IF-UTAMA

48

12

4/25/2012

Splitting Based on INFO...

Splitting Based on INFO...

• Information Gain:

• Gain Ratio:

GAIN

 n  = Entropy ( p ) −  ∑ Entropy (i )   n  k

split

i

GainRATIO

i =1

Parent Node, p is split into k partitions; ni is number of records in partition i

– Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) – Used in ID3 and C4.5 – Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. IF-UTAMA

split

GAIN SplitINFO Split

SplitINFO

k

= −∑

i =1

n n log n n i

i

Parent Node, p is split into k partitions ni is the number of records in partition i

– Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! – Used in C4.5 – Designed to overcome the disadvantage of Information Gain

49

Splitting Criteria based on Classification Error

=

IF-UTAMA

50

Examples for Computing Error Error ( t ) = 1 − max P ( i | t )

• Classification error at a node t :

i

Error (t ) = 1 − max P (i | t ) i

C1 C2

0 6

C1 C2

1 5

P(C1) = 1/6

C1 C2

2 4

P(C1) = 2/6

P(C1) = 0/6 = 0

P(C2) = 6/6 = 1

Error = 1 – max (0, 1) = 1 – 1 = 0

• Measures misclassification error made by a node. – Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information

IF-UTAMA

IF-UTAMA

51

P(C2) = 5/6

Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

P(C2) = 4/6

Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

IF-UTAMA

52

13

4/25/2012

Comparison among Splitting Criteria

Misclassification Error vs Gini

For a 2-class problem:

Parent

A? Yes

No

Node N1

Gini(N1) = 1 – (3/3)2 – (0/3)2 =0

Node N2

C1 C2

N1 3 0

N2 4 3

Gini(N2) = 1 – (4/7)2 – (3/7)2 = 0.489 IF-UTAMA

53

IF-UTAMA

C1

7

C2

3

Gini = 0.42

Gini(Children) = 3/10 * 0 + 7/10 * 0.489 = 0.342

54

Step 2: Tree  Rule

Stopping Criteria for Tree Induction • Stop expanding a node when all the records belong to the same class • Stop expanding a node when all the records have similar attribute values • Early termination (to be discussed later)

IF-UTAMA

IF-UTAMA

55

IF-UTAMA

56

14

4/25/2012

Step 2: Tree  Rule (contd)

Example Training Data Set

IF-UTAMA

57

Example (contd)

IF-UTAMA

IF-UTAMA

IF-UTAMA

58

Example (contd)

59

IF-UTAMA

60

15

4/25/2012

Example (contd)

IF-UTAMA

Example (contd)

61

Example (contd)

IF-UTAMA

IF-UTAMA

IF-UTAMA

62

Example (contd)

63

IF-UTAMA

64

16

4/25/2012

Example (contd)

IF-UTAMA

Example (contd)

65

Extracting Classification Rule Form Tree

IF-UTAMA

66

Data Mentah

Sample

Atribut

Target Atribut

Decision Tree?? IF-UTAMA

IF-UTAMA

67

IF-UTAMA

68

17

4/25/2012

Entropy Awal

Entropy Usia

• Jumlah instance = 8 • Jumlah instance positif = 3 • Jumlah instance negatif = 5

• Jumlah instance = 8 • Instance Usia – Muda • Instance positif = 1 • Instance negatif = 3

Entropy (Hipertensi ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif −

– Tua • Instance positif = 2 • Instance negatif = 2

Pins tan ce _ negatif log 2 Pins tan ce _ negatif  3   3   5   5  = −    × log 2    −    × log 2     8   8   8   8  = − (0 .375 × log 2 0 .375 ) − (0 .625 × log 2 0 . 625 )

• Entropy Usia Entropy (Muda ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif Entropy (Tua ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif

= − (0 .375 × -1.415 ) − (0 .625 × - 0.678 )

– Entropy(muda) = 0.906 – Entropy(tua) = 1

= 0,531 + 0,424 = 0,955 IF-UTAMA

69

Gain Usia

IF-UTAMA

70

Entropy Berat ∑

Gain (S , Usia ) = Entropy (S ) −

v∈Muda ,Tua

Sv S

• Jumlah instance = 8 • Intance Berat

Entropy (S v )

– Overweight

S S = Entropy (S ) − Muda Entropy (S Muda ) − Tua Entropy (S Tua ) S S 4 4 = (0 .955 ) − (0 .906 ) − (1) 8 8 = 0 .955 − 0 .453 − 0 .5 = 0 .002

• Instance positif = 3 • Instance negatif = 1

– Average • Instance positif = 0 • Instance negatif = 2

– Underweight • Instance positif = 0 • Instance negatif = 2

Entropy (Overweight

) = − Pins tan ce _ positif

log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif

Entropy ( Average ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif Entropy (Underweigh t ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif

– Entropy(Overweight)=0.906 – Entropy(Average)=0.5 – Entropy(Underweight)=0.5 IF-UTAMA

IF-UTAMA

71

IF-UTAMA

72

18

4/25/2012

Entropy Jenis Kelamin

Gain Berat Sv



Gain (S , Berat ) = Entropy (S ) −

S

v∈Overwight , Average ,Underweigh t

= Entropy (S ) − SUnderweigh t S

S Overweight S

Entropy (S Overweight ) −

S Average S

Entropy (S v )

• Jumlah instance = 8 • Intance Jenis Kelamin

Entropy (S average ) −

– Pria • Instance positif = 2 • Instance negatif = 4

Entropy (SUnderweigh t )

– Wanita

4 (0.906 ) − 2 (0.5) − 2 (0.5) 8 8 8 = 0.955 − 0.453 − 0.125 − 0.125 = (0.955 ) −

• Instance positif = 1 • Instance negatif = 1

Entropy (Pr ia ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif

= 0,252

Entropy (Wanita ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif

– Entropy(Pria)=1 – Entropy(Wanita)=0.75

IF-UTAMA

73

IF-UTAMA

74

Gain Usia Gain (S , JenisKela min ) = Entropy (S ) −

∑ v∈Pr ia ,Wanita

Sv S

• Atribut yang dipilih adalah atribut berat karena nilai Information Gainnya paling tinggi

Entropy (S v )

S S = Entropy (S ) − Pr ia Entropy (S Pr ia ) − Wanita Entropy (S Wanita S S 6 2 = (0 .955 ) − (1) − (0 .75 ) 8 8 = 0 .955 − 0 .75 − 0 .188

Berat

)

Underweight

Overweight Average

• Jumlah Instance untuk Overweight = 4 • Jumlah Instance untuk Average = 2 • Jumlah Instance untuk Underweight = 2

= 0,017

• Hitung Gain paling tinggi untuk dijadikan cabang berikutnya

IF-UTAMA

IF-UTAMA

75

IF-UTAMA

76

19

4/25/2012

Node untuk cabang Overweight

Decision Tree yang dihasilkan

• Jumlah instance = 4 • Instance (Berat = Overwight ) & Usia = – Muda • Instance positif = 1 • Instance negatif = 0

– Tua • Instance positif = 2 • Instance negatif = 1

• Instance (Berat = Overwight ) & Jenis Kelamin =

Clasification Rule ????

– Pria • Instance positif = 2 • Instance negatif = 1

– Wanita • Instance positif = 1 • Instance negatif = 0 77

Practical Issues of Classification

How to Address Overfitting

• Underfitting and Overfitting

• Pre-Pruning (Early Stopping Rule)

– Underfitting: when model is too simple, both training and test errors are large – Overfitting results in decision trees that are more complex than necessary

• Missing Values • Costs of Classification

IF-UTAMA

IF-UTAMA

IF-UTAMA

78

– Stop the algorithm before it becomes a fully-grown tree – Typical stopping conditions for a node: • Stop if all instances belong to the same class • Stop if all the attribute values are the same

– More restrictive conditions: • Stop if number of instances is less than some user-specified threshold • Stop if class distribution of instances are independent of the available features (e.g., using χ 2 test) • Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).

79

IF-UTAMA

80

20

4/25/2012

How to Address Overfitting…

Handling Missing Attribute Values

• Post-pruning

• Missing values affect decision tree construction in three different ways:

– Grow decision tree to its entirety – Trim the nodes of the decision tree in a bottom-up fashion – If generalization error improves after trimming, replace sub-tree by a leaf node. – Class label of leaf node is determined from majority class of instances in the sub-tree – Can use MDL for post-pruning IF-UTAMA

– Affects how impurity measures are computed – Affects how to distribute instance with missing value to child nodes – Affects how a test instance with missing value is classified

81

Computing Impurity Measure Tid Refund Marital Status

Taxable Income Class

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

?

Single

90K

Yes

60K

IF-UTAMA

Distribute Instances

Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813

Refund=Yes Refund=No

Class Class = Yes = No 0 3 2 4

Refund=?

1

0

Split on Refund:

Tid Refund Marital Status

Taxable Income Class

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

60K

Probability that Refund=Yes is 3/9 Probability that Refund=No is 6/9 Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9

Missing value

Entropy(Refund=Yes) = 0 Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183

Refund

Marital Status

Taxable Income

Class

10

?

Single

90K

Yes

Refund Yes

No

Refund Yes

Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551

No

Class=Yes

0

Cheat=Yes

2

Class=No

3

Cheat=No

4

Class=Yes

0 + 3/9

Class=Yes

2 + 6/9

Class=No

3

Class=No

4

Gain = 0.9 × (0.8813 – 0.551) = 0.3303 IF-UTAMA

IF-UTAMA

Tid

10

10

10

82

83

IF-UTAMA

84

21

4/25/2012

Cara lain

Classify Instances Married

New record: Tid Refund Marital Status 11

?

No

Taxable Income Class 85K

?

10

Single

Divorced

Total

Class=No

3

1

0

4

Class=Yes

6/9

1

1

2.67

Total

3.67

2

1

6.67

Refund Yes

No

NO

Single, Divorced < 80K

Married

Probability that Marital Status = Married is 3.67/6.67

NO

Probability that Marital Status ={Single,Divorced} is 3/6.67

> 80K

NO

– Misal ada 3 data kosong, probabilitas(Yes)=0.75 dan probabilitas(No)=0.25 – Maka nilai “Yes” akan diisikan pada 2 data kosong dan nilai “NO” diisikan pada sisa data kosong

MarSt

TaxInc

YES IF-UTAMA

85

Tugas & Praktikum



No. USM A001 Bwt/Cari ImpleA002 mentasi Decision A003 A004 Tree A005 A006 A007 A008 A009 A010 A011

Nilai USM 3.75 3.25 3.93 3.12 2.85 2.79 2.98 2.83 2.21 2.63 2.5 IF-UTAMA

IF-UTAMA

86

Referensi

• Lengkapi proses pembentukan tree (proses perhitungan Entropy dan Gain) untuk data training pada slide 68, kemudian tulis rule-nya • Buat rule dengan menggunakan decision tree berdasarkan data training berikut ini:

IF-UTAMA

• Isi data kosong dengan nilai yang paling sering muncul • Hitung probabilitas kemunculan suatu nilai, kemudian distribusikan nilai-nilai tersebut pada data yang kosong sesuai dengan probabilitasnya

Psikologi Tinggi Sedang Sedang Rendah Tinggi Sedang Rendah Sedang Rendah

1.

Wawancara Prodi Baik IF Baik IF IF Buruk TI Baik IF Buruk SI Baik TI SI Buruk TI Baik SI SI

2. 3. 4. 5. 6.

87

Ruoming Jin .2008. Classification: Basic Concepts and Decision Trees [online].url:www.cs.kent.edu/~jin/DM07/ClassificationDecisio nTree.ppt.Tanggal Akses:16 April 2011 Dr. Mourad YKHLEF.2009. Decision Tree Induction System.King Saud University Achmad Basuki, Iwan Syarif. 2003. Decision Tree. Politeknik Elektronika Negeri Surabaya (PENS) – ITS Simon Colton.2004. Decision Tree Learning.Tom M. Mitchell. 1997. Machine Learning. Mc-Graw Hill Suyanto.2011.Artificial Intelligence.Informatika

IF-UTAMA

88

22