of the cluster and omit attributes that have different values. (generalization by dropping conditions). â Hypothesis 1
Web Mining Nguyen Tri Thanh Email:
[email protected] Department of Information Systems Faculty of Information Technology
College of Technology May 11, 2011
1
Web Content Mining Evaluating Clustering
Approaches to Evaluating Clustering
Similarity-Based Criterion Functions
Probabilistic Criterion Functions
MDL-Based Model and Feature Evaluation
Classes to Clusters Evaluation
Precision, Recall and F-measure
Entropy
May 11, 2011
2
Approaches to Evaluating Clustering Criterion functions evaluate clustering models
objectively, i.e. using only the document content
Similarity-based functions: intracluster similarity and sum
of squared errors
Probabilistic functions: log-likelihood and category utility
MDL-based evaluation
May 11, 2011
3
Approaches to Evaluating Clustering (cont’d)
Document labels (if available) may also be used for
evaluation of clustering models
If labeling is correct and reflects accurately the document content we can evaluate the quality of clustering If clustering reflects accurately the document content we can evaluate the quality of labeling Criterion function based on labeled data: classes to
clusters evaluation
May 11, 2011
4
Similarity-Based Criterion Functions (distance)
Basic idea: the cluster center ci (centroid or
mean in case
of numeric data) best represents cluster Di if it minimizes the sum of the lengths of the “error” vectors x -ci for all x ∈ Di k
J e = ∑ ∑ x − ci
2
i =1 x∈Di
1 ci = Di
Alternative formulation based on
∑x
x∈Di
pairwise distance
between cluster members 1 k 1 Je = ∑ 2 i =1 Di May 11, 2011
∑x
2 j
− xl
x j , xl ∈Di
5
Similarity-Based Criterion Functions (cosine similarity)
cosine similarity is used k ci • d j 1 ci = dj J s = ∑ ∑ sim(ci , d j ) sim(ci , d j ) = ∑ Di d j ∈Di ci d j i =1 d ∈D
For document clustering the (centroid)
j
i
Equivalent form based on
1 k 1 JS = ∑ 2 i =1 Di
pairwise similarity
∑ sim(d
j
, dk )
x j , xk ∈Di
Another formulation based on
intracluster similarity (used
to controls merging of clusters in hierarchical Average pair-wise similarity agglomerative clustering) 1 k 1 1 k JS = ∑ sim(d j , d l ) = ∑ Di sim( Di ) ∑ 2 i =1 Di x j , xl ∈Di 2 i =1 May 11, 2011
6
Similarity-Based Criterion Functions (example) Sum of centroid similarity evaluation of four clusterings
May 11, 2011
7
Finding the Number of Clusters
May 11, 2011
8
Probabilistic Criterion Functions Document is a
random event P(d ) = ∑ A P(d | A) P( A)
Probability of document
Probability of sample (assuming that documents are
independent events) P(d1 , d 2 ,..., d n ) = ∏i =1 ∑ A P(d i | A) P( A) n
Log-likelihood (log of probability of sample)
L = ∑i =1 log∑ A P (d i | A) P( A) n
May 11, 2011
9
Probabilistic Criterion Functions (cont’d) Category Utility (based on probability of attributes)
Take into account the probabilities of attributes having particular values within clusters and across clusters (Gluck & Corter)
CU measures both probabilities
two objects in the same category have attribute values in common objects from different categories have different attribute values
CU = ∑∑∑ P (a j = vij | Ck ) P(Ck | a j = vij ) P(a j = vij ) i
May 11, 2011
j
k
10
Category Utility (basic idea)
∑∑∑ P(a i
j
j
= vij | Ck ) P (Ck | a j = vij ) P(a j = vij )
k
P(a j = vij | Ck )
is the probability that an object has value vij for its attribute aj given that it belongs to category Ck. The higher this probability, the more likely two objects in a category share the same attribute values
P(Ck | a j = vij )
is the probability that an object belongs to category Ck, given that it has value vij for its attribute aj. The greater this probability, the less likely objects from different categories have attribute values in common
P(a j = vij )
is a weight coefficient assuring that frequent attribute values have stronger influence on the evaluation
May 11, 2011
11
Category Utility Function CU = ∑∑∑ P(a j = vij | Ck ) P (Ck | a j = vij ) P (a j = vij ) i
j
k
P(Ck | a j = vij ) =
P(a j = vij | Ck ) P(Ck ) P(a j = vij )
Bayesian rule
CU = ∑ P(Ck )∑∑ P(a j = vij | Ck ) 2 k
∑∑ P(a i
j
i
j
j
= vij | Ck ) 2
Expected number of attribute values of a member of Ck correctly guessed using a probability
matching strategy
May 11, 2011
12
Category Utility Function (cont’d) Assume we don’t know the categories (clusters) in our sample, then
∑∑ P(a j = vij ) 2 i
j
Expected number of correctly guessed attribute values without knowing the categories (clusters) in the sample
The final expression of the function is CU (C1 , C2 ,..., Cn ) =
May 11, 2011
1 2 2 P ( C ) ( P ( a = v | C ) − P ( a = v ) ) ∑ k ∑∑ j ij k j ij n k i j
13
Category Utility (example) A = {1,3,4,6,8,10,12,16,17,18,19} B = {2,5,7,9,11,13,14,15,20}
May 11, 2011
14
Category Utility Function for continous attributes The category utility function can be extended to continuous attributes by assuming normal distribution and replacing probabilities with densities
1 n CU (C1 , C2 ,..., Cn ) = ∑ P(Ck )∑ n k =1 i where
[∫ f (v
ik
)dvik − ∫ f (vi )dvi
]
f (.) is the probability density function for normal
distribution
After solving the integrals, we obtain
1 n 1 CU (C1 , C2 ,..., Cn ) = ∑ P(Ck ) n k =1 2 π May 11, 2011
1 1 ∑i σ − σ i ik 15
Data Example
May 11, 2011
16
Finding Regularities in Data Describe the cluster as a pattern of attribute values that
repeats in data Include attribute values that are the same for all members
of the cluster and omit attributes that have different values (generalization by dropping conditions) Hypothesis 1 (using attributes
science and research)
R1 : IF (science = 0) AND (research = 1) THEN Class = A R : IF (science = 1) AND (research = 0) THEN Class = A A = {1,3,4,6,8,10,12,16,17,18,19} H1 = 2 B = {2,5,7,9,11,13,14,15,20} R : IF (science = 1 ) AND (research = 1) THEN Class = A 3 R4 : IF (science = 0) AND (research = 0) THEN Class = B
May 11, 2011
17
Finding Regularities in Data (cont’d) Hypothesis 2 (using attribute
offers)
R : IF (offers = 0) THEN Class = A H2 = 1 R2 : IF (offers = 1) THEN Class = B
Which one is better,
May 11, 2011
A = {1,3,4,6,8,10,12,16,17,18,19} B = {2,5,7,9,11,13,14,15,20}
H1 or H2?
18
Occam’s Razor “Entities
are not to be multiplied beyond necessity”
(William of Occam, 14th century)
Among several alternatives the simplest one is usually the best choice H2 looks simpler (shorter) than H1, so H2 may be better than H1
How do we measure simplicity?
May 11, 2011
19
Measure complexity Dropping more conditions produces simpler (shorter)
hypotheses, the simplest one is the empty rule
The latter however is a single cluster including the whole dataset (overgeneralization)
The most complex (longest) hypothesis has 20 clusters
and is equivalent to the original dataset (over-fitting) How do we find the right balance? The answer to both questions is
Minimum Description
Length (MDL)
May 11, 2011
20
Minimum Description Length (MDL) D (e.g. our document collection) and a set of hypotheses H1,H2,…,Hn, each one describes D
Given a data set
Find the most likely hypothesis
H i = arg max i P( H i | D) Direct estimation of
P(H1|D) is difficult, so we apply Bayes
P( H i ) P( D | H i ) P( H i | D) = P( D) Take –log of both sides
− log 2 P ( H i | D) = − log 2 P( H i ) − log 2 P( D | H i ) + log 2 P( D)
May 11, 2011
21
Minimum Description Length (MDL) (cont’d)
Consider hypotheses and data as messages and apply
Shannon’s information theory
which defines information in a message as a negative logarithm of its probability
Then estimate the number of bits (L) needed to
encode the messages
L( H i | D ) = L( H i ) + L( D | H i ) − L( D )
May 11, 2011
22
Minimum Description Length Principle
L(Hi) and L(D) are the minimum number of bits needed to
encode the hypothesis and data
L(D |Hi) is the number of bits needed to encode D if we know Hi If we think of Hi as a pattern that repeats in D we don’t
have to encode all its occurrences, rather we encode only the pattern itself and the differences that identify each individual instance in D Thus the more regularity in data the shorter description
length L(D |Hi)
May 11, 2011
23
Minimum Description Length Principle (cont’d) We need a good balance between L(Hi) and L(D|Hi)
because if H describes the data exactly then L(D|Hi)=0 but L(Hi) will be large We can exclude L(D) because it does not depend on the
choice of hypotheses
Minimum Description Length (MDL) principle
H i = arg min i L( H i ) + L( D | H i )
May 11, 2011
24
MDL-based Model Evaluation (Basics) Choose a description language (e.g. rules) Use the same encoding scheme for both hypotheses and
data given the hypotheses Assume that hypotheses and data are uniformly distributed
Then probability of the occurrence of an item out of n alternatives is 1/n And the minimum code length of the message informing us that a particular item has occurred is -log21/n=log2n
What is the possibility of the occurrence of
k items among
n items?
May 11, 2011
25
Recall Data Example
May 11, 2011
26
Recall Hypotheses Hypothesis 1 (using attributes
science and research)
R1 : IF (science = 0) AND (research = 1) THEN Class = A R : IF (science = 1) AND (research = 0) THEN Class = A A = {1,3,4,6,8,10,12,16,17,18,19} H1 = 2 B = {2,5,7,9,11,13,14,15,20} R3 : IF (science = 1) AND (research = 1) THEN Class = A R4 : IF (science = 0) AND (research = 0) THEN Class = B
Hypothesis 2 (using attribute
offers)
R : IF (offers = 0) THEN Class = A H2 = 1 R2 : IF (offers = 1) THEN Class = B
A = {1,3,4,6,8,10,12,16,17,18,19} B = {2,5,7,9,11,13,14,15,20}
May 11, 2011
27
MDL-based Model Evaluation (Example) Description language: 12 attribute-value pairs (6 attributes each
with two possible values) Rule
R1 covers documents 8, 16, 18 and 19
There are 9 different attribute-value pairs that occur in these
documents {history=0}, {science=0}, {research=1}, {offers=0}, {offers=1}, {students=0}, {students=1}, {hall=0}, {hall=1} Specifying rule
R1 is equivalent to choosing 9 out of 12 attribute-
value pairs, which can be done in C912
May 11, 2011
No
History
Science
Research
Offers
Students
Hall
8
0
0
1
0
0
0
16
0
0
1
0
1
1
18
0
0
1
1
1
1
19
0
0
1
1
1
1
28
MDL-based Model Evaluation (Example) (cont’d) Since we use only 1 among C912 possibilities, log2C912 bits
are needed to encode the left-hand side of R1 In addition we need one bit (a choice of one out of two
cluster labels) to encode the choice of the class (log22) Thus the code length of
R1 is
L(R1)=log2C912 +1=log2 220+1 =8.78136
May 11, 2011
29
MDL-based Model Evaluation (L(H1)) Similarly we compute the code lengths of
R2, R3, and R4
and obtain L(R2)=log2C712 +1=log2 792+1 =10.6294 L(R3)=log2C712 +1=log2 792+1 =10.6294 L(R4)=log2C1012 +1=log2 66+1 =7.04439 Using the additivity of information to obtain the code
length of L(H) we add the code lengths of its constituent rules
L(H1)=37.0845 May 11, 2011
30
MDL-based Model Evaluation (L(D|R1)) Consider the message exchange setting, where the
hypothesis R1 has already been communicated This means that the recipient of that message already
knows the subset of 9 attribute-value pairs selected by rule
R1 Then to communicate each document of those covered by
R1 we need to choose 6 (the pairs occurring in each document) out of the 9 pairs This choice will take log2 C69 bits to encode
May 11, 2011
31
MDL-based Model Evaluation (L(D|R1)) (cont’d) As
R1 covers 4 documents (8, 16, 18, 19) the code length
needed for all of them will be
L({8,16,18,19}|R1)=4 x log2 C69 =4 x log2 84=25.5693
May 11, 2011
32
MDL-based Model Evaluation (MDL(H1)) Similarly we compute the code length of the subsets of
documents covered by the other rules
L({4,6,10}|R2)=3 x log2 C67 =3 x log2 7=8.4220 L({1,3,12,17}|R3)=4 x log2 C67 =4 x log2 7=11.2294 L({2,5,7,9,11,13,14,15,20}|R4)=9 x log2 C610 =9 x log2 210=69.4282 The code length needed to communicate all documents
given hypothesis H1 will be the sum of all these code lengths, i.e. L(D|H1)=114.649 Now adding this to the code length of the hypothesis we
obtain MDL(H1)=L(H1)+L(D|H1)=37.0845+114.649=151.733 May 11, 2011
33
MDL-based Model Evaluation (H1 or H2?) Similarly we compute
MDL(H2)
MDL(H2)=L(H2)+L(D |H2)=9.16992+177.035=186.205
MDL(H1) < MDL(H2) ⇒ H1 is better than H2
Also (very intuitively), L(H1)
> L(H2) and
L(D |H1) < L(D |H2) How about the most general and the most specific
hypotheses?
May 11, 2011
34
Information (Data) Compression The most general hypothesis (the empty rule {}) does not
restrict the choice of attribute-value pairs, so it selects 12 out of 12 pairs L({})=log2C1212+1=1 L(D|{})=L(D)=20 x log2C612=20 x (log2924+1)=197.035
The most specific hypothesis has 20 rules – one for each
document L(S)=20 x (log2C612+1)=20 x (log2924+1)=217.035 Both {} and S represent extreme cases, undesirable in learning
overgeneralization and overspecialization Good hypotheses should provide smaller MDL than {} and S, or
stated otherwise (principle of Information Compression) L(H)+L(D|H)< L(D) or Hi=arg maxi (L(D)-L(Hi)-L(D|Hi)) May 11, 2011
35
MDL-based Feature Evaluation An attribute can split the set of documents into subsets,
each one including the documents that share the same value of that attribute Consider this split as a clustering and evaluate its MDL
score Rank attributes by their MDL score and select attributes
that provide the lowest score
MDL(H1) was actually the MDL score of attribute offers from our 6-attribute document sample
May 11, 2011
36
MDL-based Feature Evaluation (Example)
May 11, 2011
37
Classes to Clusters Evaluation Assume that the classification of the documents in a
sample is known, i.e. each document has a class label Cluster the sample without using the class labels Assign to each cluster the class label of the majority of
documents in it Compute the
error as the proportion of documents with
different class and cluster label Or compute the
accuracy as the proportion of documents
with the same class and cluster label
May 11, 2011
38
Classes to Clusters Evaluation (Example)
May 11, 2011
39
Confusion matrix (contingency table)
TP (True Positive), FN (False Negative), FP (False Positive), TN (True Negative)
FP + FN Error = TP + FP + TN + FN
May 11, 2011
TP + TN Accuracy = TP + FP + TN + FN
40
Precision and Recall TP Precision = TP + FP TP Recall = TP + FN
May 11, 2011
41
F-Measure Generalized confusion matrix form classes and k clusters
Combining precision and recall P(i, j ) =
nij
∑i =1 nij m
R(i, j ) =
nij
∑
k
n
j =1 ij
2 P(i, j ) R(i, j ) P(i, j ) + R(i, j ) Evaluating the whole clustering m ni F = ∑i =1 max F (i, j ) n j =1,...,k F (i, j ) =
ni = ∑ j =1 nij k
n = ∑i =1 ∑ j =1 nij m
May 11, 2011
k
Total number of documents
42
F-Measure (Example)
May 11, 2011
43
Entropy Consider the class label as a random event and evaluate its
probability distribution in each cluster
i in cluster j is estimated by the proportion of occurrences of class label i in cluster j nij pij = m ∑ nij
The probability of class
i =1
The entropy is as a measure of “impurity” and accounts for
the average information in an arbitrary message about the class label m
H j = −∑ pij log pij i =1
May 11, 2011
44
Entropy (cont’d) To evaluate the whole clustering we sum up the entropies
of individual clusters weighted with the proportion of documents in each k
H =∑ i =1
May 11, 2011
nj n
Hj
45
Entropy (Examples) A “pure” cluster where all documents have a single class
label has entropy of 0 The highest entropy is achieved when all class labels have
the same probability For example, for a two class problem the 50-50 situation
has the highest entropy of (-0.5 log 0.5- 0.5 log 0.5)=1
May 11, 2011
46
Entropy (Examples) (cont’d) Compare the entropies of the previously discussed
clusterings for attributes offers and students 11 8 8 3 3 9 3 3 6 6 H (offers ) = − log − log + − log − log = 0.878176 20 11 11 11 11 20 9 9 9 9 H ( students) =
May 11, 2011
15 10 10 5 5 5 1 1 4 4 − log − log + − log − log = 0.869204 20 15 15 15 15 20 5 5 5 5
47
Summary Evaluating Clustering
Similarity-Based Criterion Functions
Probabilistic Criterion Functions
MDL-Based Model and Feature Evaluation
Classes to Clusters Evaluation
Entropy
May 11, 2011
48