Web Mining

Web Mining Nguyen Tri Thanh Email: [email protected] Department of Information Systems Faculty of Information Technology

College of Technology May 11, 2011

1

Web Content Mining Evaluating Clustering

Approaches to Evaluating Clustering

Similarity-Based Criterion Functions

Probabilistic Criterion Functions

MDL-Based Model and Feature Evaluation

Classes to Clusters Evaluation

Precision, Recall and F-measure

Entropy

May 11, 2011

2

Approaches to Evaluating Clustering Criterion functions evaluate clustering models

objectively, i.e. using only the document content

Similarity-based functions: intracluster similarity and sum

of squared errors

Probabilistic functions: log-likelihood and category utility

MDL-based evaluation

May 11, 2011

3

Approaches to Evaluating Clustering (cont’d)

Document labels (if available) may also be used for

evaluation of clustering models

If labeling is correct and reflects accurately the document content we can evaluate the quality of clustering If clustering reflects accurately the document content we can evaluate the quality of labeling Criterion function based on labeled data: classes to

clusters evaluation

May 11, 2011

4

Similarity-Based Criterion Functions (distance)

Basic idea: the cluster center ci (centroid or

mean in case

of numeric data) best represents cluster Di if it minimizes the sum of the lengths of the “error” vectors x -ci for all x ∈ Di k

J e = ∑ ∑ x − ci

2

i =1 x∈Di

1 ci = Di

Alternative formulation based on

∑x

x∈Di

pairwise distance

between cluster members 1 k 1 Je = ∑ 2 i =1 Di May 11, 2011

∑x

2 j

− xl

x j , xl ∈Di

5

Similarity-Based Criterion Functions (cosine similarity)

cosine similarity is used k ci • d j 1 ci = dj J s = ∑ ∑ sim(ci , d j ) sim(ci , d j ) = ∑ Di d j ∈Di ci d j i =1 d ∈D

For document clustering the (centroid)

j

i

Equivalent form based on

1 k 1 JS = ∑ 2 i =1 Di

pairwise similarity

∑ sim(d

j

, dk )

x j , xk ∈Di

Another formulation based on

intracluster similarity (used

to controls merging of clusters in hierarchical Average pair-wise similarity agglomerative clustering) 1 k 1 1 k JS = ∑ sim(d j , d l ) = ∑ Di sim( Di ) ∑ 2 i =1 Di x j , xl ∈Di 2 i =1 May 11, 2011

6

Similarity-Based Criterion Functions (example) Sum of centroid similarity evaluation of four clusterings

May 11, 2011

7

Finding the Number of Clusters

May 11, 2011

8

Probabilistic Criterion Functions Document is a

random event P(d ) = ∑ A P(d | A) P( A)

Probability of document

Probability of sample (assuming that documents are

independent events) P(d1 , d 2 ,..., d n ) = ∏i =1 ∑ A P(d i | A) P( A) n

Log-likelihood (log of probability of sample)

L = ∑i =1 log∑ A P (d i | A) P( A) n

May 11, 2011

9

Probabilistic Criterion Functions (cont’d) Category Utility (based on probability of attributes)

Take into account the probabilities of attributes having particular values within clusters and across clusters (Gluck & Corter)

CU measures both probabilities

two objects in the same category have attribute values in common objects from different categories have different attribute values

CU = ∑∑∑ P (a j = vij | Ck ) P(Ck | a j = vij ) P(a j = vij ) i

May 11, 2011

j

k

10

Category Utility (basic idea)

∑∑∑ P(a i

j

j

= vij | Ck ) P (Ck | a j = vij ) P(a j = vij )

k

P(a j = vij | Ck )

is the probability that an object has value vij for its attribute aj given that it belongs to category Ck. The higher this probability, the more likely two objects in a category share the same attribute values

P(Ck | a j = vij )

is the probability that an object belongs to category Ck, given that it has value vij for its attribute aj. The greater this probability, the less likely objects from different categories have attribute values in common

P(a j = vij )

is a weight coefficient assuring that frequent attribute values have stronger influence on the evaluation

May 11, 2011

11

Category Utility Function CU = ∑∑∑ P(a j = vij | Ck ) P (Ck | a j = vij ) P (a j = vij ) i

j

k

P(Ck | a j = vij ) =

P(a j = vij | Ck ) P(Ck ) P(a j = vij )

Bayesian rule

CU = ∑ P(Ck )∑∑ P(a j = vij | Ck ) 2 k

∑∑ P(a i

j

i

j

j

= vij | Ck ) 2

Expected number of attribute values of a member of Ck correctly guessed using a probability

matching strategy

May 11, 2011

12

Category Utility Function (cont’d) Assume we don’t know the categories (clusters) in our sample, then

∑∑ P(a j = vij ) 2 i

j

Expected number of correctly guessed attribute values without knowing the categories (clusters) in the sample

The final expression of the function is CU (C1 , C2 ,..., Cn ) =

May 11, 2011

1 2 2 P ( C ) ( P ( a = v | C ) − P ( a = v ) ) ∑ k ∑∑ j ij k j ij n k i j

13

Category Utility (example) A = {1,3,4,6,8,10,12,16,17,18,19} B = {2,5,7,9,11,13,14,15,20}

May 11, 2011

14

Category Utility Function for continous attributes The category utility function can be extended to continuous attributes by assuming normal distribution and replacing probabilities with densities

1 n CU (C1 , C2 ,..., Cn ) = ∑ P(Ck )∑ n k =1 i where

[∫ f (v

ik

)dvik − ∫ f (vi )dvi

]

f (.) is the probability density function for normal

distribution

After solving the integrals, we obtain

1 n 1 CU (C1 , C2 ,..., Cn ) = ∑ P(Ck ) n k =1 2 π May 11, 2011

 1 1 ∑i  σ − σ  i   ik 15

Data Example

May 11, 2011

16

Finding Regularities in Data Describe the cluster as a pattern of attribute values that

repeats in data Include attribute values that are the same for all members

of the cluster and omit attributes that have different values (generalization by dropping conditions) Hypothesis 1 (using attributes

science and research)

 R1 : IF (science = 0) AND (research = 1) THEN Class = A  R : IF (science = 1) AND (research = 0) THEN Class = A A = {1,3,4,6,8,10,12,16,17,18,19}  H1 =  2 B = {2,5,7,9,11,13,14,15,20} R : IF (science = 1 ) AND (research = 1) THEN Class = A 3   R4 : IF (science = 0) AND (research = 0) THEN Class = B

May 11, 2011

17

Finding Regularities in Data (cont’d) Hypothesis 2 (using attribute

offers)

 R : IF (offers = 0) THEN Class = A H2 =  1  R2 : IF (offers = 1) THEN Class = B

Which one is better,

May 11, 2011

A = {1,3,4,6,8,10,12,16,17,18,19} B = {2,5,7,9,11,13,14,15,20}

H1 or H2?

18

Occam’s Razor “Entities

are not to be multiplied beyond necessity”

(William of Occam, 14th century)

Among several alternatives the simplest one is usually the best choice H2 looks simpler (shorter) than H1, so H2 may be better than H1

How do we measure simplicity?

May 11, 2011

19

Measure complexity Dropping more conditions produces simpler (shorter)

hypotheses, the simplest one is the empty rule

The latter however is a single cluster including the whole dataset (overgeneralization)

The most complex (longest) hypothesis has 20 clusters

and is equivalent to the original dataset (over-fitting) How do we find the right balance? The answer to both questions is

Minimum Description

Length (MDL)

May 11, 2011

20

Minimum Description Length (MDL) D (e.g. our document collection) and a set of hypotheses H1,H2,…,Hn, each one describes D

Given a data set

Find the most likely hypothesis

H i = arg max i P( H i | D) Direct estimation of

P(H1|D) is difficult, so we apply Bayes

P( H i ) P( D | H i ) P( H i | D) = P( D) Take –log of both sides

− log 2 P ( H i | D) = − log 2 P( H i ) − log 2 P( D | H i ) + log 2 P( D)

May 11, 2011

21

Minimum Description Length (MDL) (cont’d)

Consider hypotheses and data as messages and apply

Shannon’s information theory

which defines information in a message as a negative logarithm of its probability

Then estimate the number of bits (L) needed to

encode the messages

L( H i | D ) = L( H i ) + L( D | H i ) − L( D )

May 11, 2011

22

Minimum Description Length Principle

L(Hi) and L(D) are the minimum number of bits needed to

encode the hypothesis and data

L(D |Hi) is the number of bits needed to encode D if we know Hi If we think of Hi as a pattern that repeats in D we don’t

have to encode all its occurrences, rather we encode only the pattern itself and the differences that identify each individual instance in D Thus the more regularity in data the shorter description

length L(D |Hi)

May 11, 2011

23

Minimum Description Length Principle (cont’d) We need a good balance between L(Hi) and L(D|Hi)

because if H describes the data exactly then L(D|Hi)=0 but L(Hi) will be large We can exclude L(D) because it does not depend on the

choice of hypotheses

Minimum Description Length (MDL) principle

H i = arg min i L( H i ) + L( D | H i )

May 11, 2011

24

MDL-based Model Evaluation (Basics) Choose a description language (e.g. rules) Use the same encoding scheme for both hypotheses and

data given the hypotheses Assume that hypotheses and data are uniformly distributed

Then probability of the occurrence of an item out of n alternatives is 1/n And the minimum code length of the message informing us that a particular item has occurred is -log21/n=log2n

What is the possibility of the occurrence of

k items among

n items?

May 11, 2011

25

Recall Data Example

May 11, 2011

26

Recall Hypotheses Hypothesis 1 (using attributes

science and research)

 R1 : IF (science = 0) AND (research = 1) THEN Class = A  R : IF (science = 1) AND (research = 0) THEN Class = A A = {1,3,4,6,8,10,12,16,17,18,19}  H1 =  2 B = {2,5,7,9,11,13,14,15,20}  R3 : IF (science = 1) AND (research = 1) THEN Class = A  R4 : IF (science = 0) AND (research = 0) THEN Class = B

Hypothesis 2 (using attribute

offers)

 R : IF (offers = 0) THEN Class = A H2 =  1  R2 : IF (offers = 1) THEN Class = B

A = {1,3,4,6,8,10,12,16,17,18,19} B = {2,5,7,9,11,13,14,15,20}

May 11, 2011

27

MDL-based Model Evaluation (Example) Description language: 12 attribute-value pairs (6 attributes each

with two possible values) Rule

R1 covers documents 8, 16, 18 and 19

There are 9 different attribute-value pairs that occur in these

documents {history=0}, {science=0}, {research=1}, {offers=0}, {offers=1}, {students=0}, {students=1}, {hall=0}, {hall=1} Specifying rule

R1 is equivalent to choosing 9 out of 12 attribute-

value pairs, which can be done in C912

May 11, 2011

No

History

Science

Research

Offers

Students

Hall

8

0

0

1

0

0

0

16

0

0

1

0

1

1

18

0

0

1

1

1

1

19

0

0

1

1

1

1

28

MDL-based Model Evaluation (Example) (cont’d) Since we use only 1 among C912 possibilities, log2C912 bits

are needed to encode the left-hand side of R1 In addition we need one bit (a choice of one out of two

cluster labels) to encode the choice of the class (log22) Thus the code length of

R1 is

L(R1)=log2C912 +1=log2 220+1 =8.78136

May 11, 2011

29

MDL-based Model Evaluation (L(H1)) Similarly we compute the code lengths of

R2, R3, and R4

and obtain L(R2)=log2C712 +1=log2 792+1 =10.6294 L(R3)=log2C712 +1=log2 792+1 =10.6294 L(R4)=log2C1012 +1=log2 66+1 =7.04439 Using the additivity of information to obtain the code

length of L(H) we add the code lengths of its constituent rules

L(H1)=37.0845 May 11, 2011

30

MDL-based Model Evaluation (L(D|R1)) Consider the message exchange setting, where the

hypothesis R1 has already been communicated This means that the recipient of that message already

knows the subset of 9 attribute-value pairs selected by rule

R1 Then to communicate each document of those covered by

R1 we need to choose 6 (the pairs occurring in each document) out of the 9 pairs This choice will take log2 C69 bits to encode

May 11, 2011

31

MDL-based Model Evaluation (L(D|R1)) (cont’d) As

R1 covers 4 documents (8, 16, 18, 19) the code length

needed for all of them will be

L({8,16,18,19}|R1)=4 x log2 C69 =4 x log2 84=25.5693

May 11, 2011

32

MDL-based Model Evaluation (MDL(H1)) Similarly we compute the code length of the subsets of

documents covered by the other rules

L({4,6,10}|R2)=3 x log2 C67 =3 x log2 7=8.4220 L({1,3,12,17}|R3)=4 x log2 C67 =4 x log2 7=11.2294 L({2,5,7,9,11,13,14,15,20}|R4)=9 x log2 C610 =9 x log2 210=69.4282 The code length needed to communicate all documents

given hypothesis H1 will be the sum of all these code lengths, i.e. L(D|H1)=114.649 Now adding this to the code length of the hypothesis we

obtain MDL(H1)=L(H1)+L(D|H1)=37.0845+114.649=151.733 May 11, 2011

33

MDL-based Model Evaluation (H1 or H2?) Similarly we compute

MDL(H2)

MDL(H2)=L(H2)+L(D |H2)=9.16992+177.035=186.205

MDL(H1) < MDL(H2) ⇒ H1 is better than H2

Also (very intuitively), L(H1)

> L(H2) and

L(D |H1) < L(D |H2) How about the most general and the most specific

hypotheses?

May 11, 2011

34

Information (Data) Compression The most general hypothesis (the empty rule {}) does not

restrict the choice of attribute-value pairs, so it selects 12 out of 12 pairs L({})=log2C1212+1=1 L(D|{})=L(D)=20 x log2C612=20 x (log2924+1)=197.035

The most specific hypothesis has 20 rules – one for each

document L(S)=20 x (log2C612+1)=20 x (log2924+1)=217.035 Both {} and S represent extreme cases, undesirable in learning

overgeneralization and overspecialization Good hypotheses should provide smaller MDL than {} and S, or

stated otherwise (principle of Information Compression) L(H)+L(D|H)< L(D) or Hi=arg maxi (L(D)-L(Hi)-L(D|Hi)) May 11, 2011

35

MDL-based Feature Evaluation An attribute can split the set of documents into subsets,

each one including the documents that share the same value of that attribute Consider this split as a clustering and evaluate its MDL

score Rank attributes by their MDL score and select attributes

that provide the lowest score

MDL(H1) was actually the MDL score of attribute offers from our 6-attribute document sample

May 11, 2011

36

MDL-based Feature Evaluation (Example)

May 11, 2011

37

Classes to Clusters Evaluation Assume that the classification of the documents in a

sample is known, i.e. each document has a class label Cluster the sample without using the class labels Assign to each cluster the class label of the majority of

documents in it Compute the

error as the proportion of documents with

different class and cluster label Or compute the

accuracy as the proportion of documents

with the same class and cluster label

May 11, 2011

38

Classes to Clusters Evaluation (Example)

May 11, 2011

39

Confusion matrix (contingency table)

TP (True Positive), FN (False Negative), FP (False Positive), TN (True Negative)

FP + FN Error = TP + FP + TN + FN

May 11, 2011

TP + TN Accuracy = TP + FP + TN + FN

40

Precision and Recall TP Precision = TP + FP TP Recall = TP + FN

May 11, 2011

41

F-Measure Generalized confusion matrix form classes and k clusters

Combining precision and recall P(i, j ) =

nij

∑i =1 nij m

R(i, j ) =

nij

∑

k

n

j =1 ij

2 P(i, j ) R(i, j ) P(i, j ) + R(i, j ) Evaluating the whole clustering m ni F = ∑i =1 max F (i, j ) n j =1,...,k F (i, j ) =

ni = ∑ j =1 nij k

n = ∑i =1 ∑ j =1 nij m

May 11, 2011

k

Total number of documents

42

F-Measure (Example)

May 11, 2011

43

Entropy Consider the class label as a random event and evaluate its

probability distribution in each cluster

i in cluster j is estimated by the proportion of occurrences of class label i in cluster j nij pij = m ∑ nij

The probability of class

i =1

The entropy is as a measure of “impurity” and accounts for

the average information in an arbitrary message about the class label m

H j = −∑ pij log pij i =1

May 11, 2011

44

Entropy (cont’d) To evaluate the whole clustering we sum up the entropies

of individual clusters weighted with the proportion of documents in each k

H =∑ i =1

May 11, 2011

nj n

Hj

45

Entropy (Examples) A “pure” cluster where all documents have a single class

label has entropy of 0 The highest entropy is achieved when all class labels have

the same probability For example, for a two class problem the 50-50 situation

has the highest entropy of (-0.5 log 0.5- 0.5 log 0.5)=1

May 11, 2011

46

Entropy (Examples) (cont’d) Compare the entropies of the previously discussed

clusterings for attributes offers and students 11  8 8 3 3 9  3 3 6 6 H (offers ) =  − log − log  +  − log − log  = 0.878176 20  11 11 11 11  20  9 9 9 9 H ( students) =

May 11, 2011

15  10 10 5 5 5  1 1 4 4 − log − log + − log − log     = 0.869204 20  15 15 15 15  20  5 5 5 5

47

Summary Evaluating Clustering

Similarity-Based Criterion Functions

Probabilistic Criterion Functions

MDL-Based Model and Feature Evaluation

Classes to Clusters Evaluation

Entropy

May 11, 2011

48

Web Mining - Google Groups

Web Mining - Google Groups

Suggest Documents

Locadora Web - Google Groups

WEB SERVICE & WEB MINING

Web Social Mining - Google Sites

Updated web outline - Google Groups

Web Performance Notes - Google Groups

web site mockup - Google Groups

CLI on Web - Google Groups

Partea I. Data mining - Google Groups

data warehousing & data mining - Google Groups

Data mining and education - Google Groups

Data mining and education - Google Groups

A Survey on Web Mining and Web Usage Mining

(>

[PDF] Download Mining the Social Web: Data Mining ... - Google Sites

[Download] [PDF] Mining the Social Web: Data Mining ... - Google Sites

PDF/ePUB Mining the Social Web: Data Mining ... - Google Sites

PdF Download Mining the Social Web: Data Mining ... - Google Sites

[PDF] Download Mining the Social Web: Data Mining ... - Google Sites

Download (PDF) Mining the Social Web: Data Mining ... - Google Sites

Web Usage Mining - arXiv

Visual Web Mining - CiteSeerX

Semantic Web Mining

[PDF] Download Mining the Social Web: Data Mining ... - Google Sites