Clustering

74 downloads 94544 Views 4MB Size Report
model more parameters to represent the structure more completely. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c ...
Clustering DHS 10.6-10.7, 10.9-10.10, 10.4.3-10.4.4

Clustering Definition A form of unsupervised learning, where we identify groups in feature space for an unlabeled sample set



Define class regions in feature space using unlabeled data



Note: the classes identified are abstract, in the sense that we obtain ‘cluster 0’ ... ‘cluster n’ as our classes (e.g. clustering MNIST digits, we may not get 10 clusters) 2

Applications Clustering Applications Include:



Data reduction: represent samples by their associated cluster



Hypothesis generation

• •



Discover possible patterns in the data: validate on other data sets

Hypothesis testing



Test assumed patterns in data

Prediction based on groups



e.g. selecting medication for a patient using clusters of previous patients and their reactions to medication for a given disease

3

Kuncheva: Supervised vs. Unsupervised Classification

A Simple Example Assume Class Distributions Known to be Normal Can define clusters by mean and covariance matrix However... We may need more information to cluster well

• Many different distributions can share a mean and covariance matrix

• ....number of clusters? 5

FIGURE 10.6. These four data sets have identical statistics up to second-order—that is, the same mean ! and covariance ". In such cases it is important to include in the model more parameters to represent the structure more completely. From: Richard O. c 2001 by Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright ! John Wiley & Sons, Inc.

Steps for Clustering 1. Feature Selection



Ideal: small number of features with little redundancy

2. Similarity (or Proximity) Measure



Measure of similarity or dissimilarity

Red: defining ‘cluster space’

3. Clustering Criterion



Determine how distance patterns determine cluster likelihood (e.g. preferring circular to elongated clusters)

4. Clustering Algorithm



Search method used with the clustering criterion to identify clusters

5. Validation of Results



Using appropriate tests (e.g. statistical)

6. Interpretation of Results



Domain expert interprets clusters (clusters are subjective)

7

Choosing a Similarity Measure Most Common: Euclidean Distance Roughly speaking, want distance between samples in a cluster to be smaller than the distance between samples in different clusters



Example (next slide): define clusters by a maximum distance d0 between a point and a point in a cluster



Rescaling features can be useful (transform the space)



Unfortunately, normalizing data (e.g. by setting features to zero mean, unit variance) may eliminate subclasses



One might also choose to rotate axes so they coincide with eigenvectors of the covariance matrix (i.e. apply PCA) 8

x2

x2

1

x2

1

d0 = .3

1

d0 = .1

.8

.8

.8

.6

.6

.6

.4

.4

.4

.2

.2

.2

0

.2

.4

.6

.8

1

x1

0

.2

.4

.6

.8

1

x1

0

d0 = .03

.2

.4

.6

.8

1

x1

FIGURE 10.7. The distance threshold affects the number and size of clusters in similarity based clustering methods. For three different values of distance d0 , lines are drawn between points closer than d0 —the smaller the value of d0 , the smaller and more numerous the clusters. From: Richard O. Duda, Peter E. Hart, and David c 2001 by John Wiley & Sons, Inc. G. Stork, Pattern Classification. Copyright !

x2 1.6

x2

1.4

1

(.50 20)

.8

1.2

1 .6 .8 .4 .6 .2 .4 0

.2

.4

.6

.8

1

x1 .2

(20 .50)

x2

0

.1

.2

.3

.4

.5

x1

.5 .4 .3 .2 .1 0

.25

.5

.75

1

1.25

1.5

1.75

2

x1

FIGURE 10.8. Scaling axes affects the clusters in a minimum distance cluster method. The original data and minimum-distance clusters are shown in the upper left; points in one cluster are shown in red, while the others are shown in gray. When the vertical axis is expanded by a factor of 2.0 and the horizontal axis shrunk by a factor of 0.5, the clustering is altered (as shown at the right). Alternatively, if the vertical axis is shrunk by a factor of 0.5 and the horizontal axis is expanded by a factor of 2.0, smaller more numerous clusters result (shown at the bottom). In both these scaled cases, the assignment of points to clusters differ from that in the original space. From: Richard O. Duda, Peter

x2

x2

x1

x1

FIGURE 10.9. If the data fall into well-separated clusters (left), normalization by scaling for unit variance for the full data may reduce the separation, and hence be undesirable (right). Such a normalization may in fact be appropriate if the full data set arises from a single fundamental process (with noise), but inappropriate if there are several different processes, as shown here. From: Richard O. Duda, Peter E. Hart, and David G. Stork, c 2001 by John Wiley & Sons, Inc. Pattern Classification. Copyright !

Other Similarity Measures Minkowski Metric (Dissimilarity) Change the exponent q:

• •

d(x, x! ) =

! d "

k=1

q = 1: Manhattan (city-block) distance

|xk − x!k |q

#1/q

q = 2: Euclidean distance (only form invariant to ! d #1/q translation and rotation in feature space) " d(x, x! ) =

k=1

Cosine Similarity

|xk − x!k |q

xT x! s(x, x ) = ||x|| ||x! || !

Characterizes similarity by the cosine of the angle between two feature vectors (in [0,1])

• •

Ratio of inner product to vector magnitude product Invariant to rotations and dilation (not translation)

12

More on Cosine d(x, x ) Similarity = |x − x | !

! d "

! q k

k

k=1

If features binary-valued:

• •

T ! x x ! s(x, x ) = ||x|| ||x! ||

Inner product is sum of shared feature values Product of magnitudes is geometric mean of ! d number of attributes in the two vectors "

Variations

Frequently used for

• •

#1/q

!

d(x, x ) =

k=1

! d #1/q "

! ! )q = d(x, x |xk − x | k

k=1

|xk − x!k |q

#1/q

xT x! x x s(x, x ) = ! s(x,Retrieval x)= ||x|| ||x! || Information ! T !

!

||x|| ||x ||

Ratio of shared attributes (identical! lengths): xT x! s(x, x ) =

d

xT x! s(x, x ) = d !

Tanimoto distance: ratio of shared attributes! to xT x! s(x, T x ! )= T x x attributes in x or x’ x x + x!T x! − xT x ! s(x, x ) =

xT x + x!T x! − xT x!

13

A·B = of the ai bitags in they only contain 1’s andin0’s) over the union system (e.g. we observed a decrease security of 5% Thus, the cosine similai. i=1 videos. ofTherefore, the dot simply to nly an increase both in usability 0.3% relative to product matching " reduces " #t ∩ # n the size intersection of the two sets (i.e., n Rt |) and st only author-supplied tags). In addition, wetagcould not |A ! 4. Return #! #Related Adding Tags Z the product of the magnitudes reduces to the square root of 2 2 $ $ !A!!B! = (ai ) (bi ) ate the security impact of adding title words using our the number of tags in the first tag set times the% rootthe of related videos Once equencies (which are calculated over tag space, not title square i=1 % i=1 This techn | |Rt |). the number tags intitle the second order, we introdu ground tru ), and so we decided to notofallow words.tag set (i.e., |Atilarity Therefore, the cosine similarity between a set of author tags truth. the ground ated nThe −b and a set of related can easily beBcomputed as:lowed In tags our case, A and are binary tag occurrences vectors in a YouTube ta more than ng Related Videos by Cosinethey Similarity only contain and 0’s) over the could unionwithout of the tag tag set contain ex6 |At ∩ R1’s | t %most %similarof Letthose A and be binary vectors cos θvideos. =the both Therefore, the the dot reduce consider t ect tags from videosBthat have beproduct asame singlesimply character), |At | |Rof t | the two tag sets (i.e., |A tional tags the size intersection R t to the challenge video, we performed a sort using number of related t ∩ vide t |) length (represent all tags in A&B) If the all next product of theand magnitudes reduces to the square roo osine similarity of the tags onthe related videos the Therefore, adding all of these theVector number of tags in the tag set times the% square ro n the challenge video. The Occ. cosine similarity metric is first add to 6000 new tag Tag Set dog puppy funny catup % to|A avoid bi | |R the number of tags in the second tag set (i.e., dard similarity metricAused in Information Retrieval to by adding up to n add t A 1 1 1 0 t Therefore, the between a set of are text documents [20]. between Rt The cosine B similarity 1 cosine 1 similarity 0 videos 1 (sorted in author decrea Rejecting and a set of tags can easily be computed as:v, a ectors A and B can be easily computed asrelated follows: a challenge video Security a Table 1. Example of a tag occurrence table. ber of related tagsthree to gm |A ∩ R | the t t A·B generates up totained n relate % % cos θ = S IM(A,Consider B) = cos θ = thro an example where At = {dog, puppy, |Aft |unny} |Rt | !A!!B! erating fu and Rt = {dog, puppy, cat}. We can build a simple ta(A, R, tion). F n i whichofcorresponds to are: the tag occurrence over R theELATED union TAGS ot product and ble product magnitudes t is a frequ of both tag sets (see Table 1). Reading row-wise from this Tag Set Occ. Vector dog puppy funny cat Here SIM(A, B) is 2/3. n 14 1. Create an empty set, ation, after table, ! the tag occurrence A 1= Atvectors for AAt and Rt12.areSort 1 been 0 have related videos R A · B {1, =1, 1, 0}aand b B = {1, 1, 0, 1}, respectively. Next, we i i Rt B 1 1 0 1 gre of their tag quency sets relat

Cosine Similarity: Tag Sets for YouTube Videos (Example by K. Kluever)

compute the dot product: i=1

Additional Similarity Metrics Theodoridis Text Defines a large number of alternative distance metrics, including:



Hamming distance: number of locations where two vectors (usually bit vectors) disagree

• •

Correlation coefficient Weighted distances... 15

Criterion Functions for Clustering

Criterion Function Quantifies ‘quality’ of a set of clusters



Clustering task: partition data set D into c disjoint sets D1 ... Dc



Choose partition maximizing the criterion function

16

s(x, x ) =

d

T ! x x ! s(x, x ) = T x x + x!T x! − xT x!

Criterion: Sum of Squared Error Je =

c " "

i=1 x∈Di

||x − µDi ||2

Measures total squared ‘error’ incurred by choice of cluster centers (cluster means)

‘Optimal’ Clustering Minimizes this quantity

Issues

• •

Well suited when clusters compact and well-separated



Sensitive to outliers

Different # points in each cluster can lead to large clusters being split ‘unnaturally’ (next slide) 17

Je = large

Je = small

FIGURE 10.10. When two natural groupings have very different numbers of points, the clusters minimizing a sum-squared-error criterion Je of Eq. 54 may not reveal the true underlying structure. Here the criterion is smaller for the two clusters at the bottom than for the more natural clustering at the top. From: Richard O. Duda, Peter E. Hart, and c 2001 by John Wiley & Sons, Inc. David G. Stork, Pattern Classification. Copyright !

c 1! Je = ni s¯i 2 i=1

Related Criteria: Min Variance c

1 ! s¯ = 1 ! ! ||x − x"||2 J = ni s¯ii n2i x∈D x ∈D 2 i=1 1 ! ! " 2 Ans¯i =Equivalent Formulation for SSE ||x − x || 2 n1i x∈D! x ∈D ! c ! " 2 1 c i s¯i :=mean ||x − x || ! squared distance between points in cluster 1 Je = J = ni s¯i 2 n ni s¯i 2 i=1e i x∈Di x! ∈Di (variance) c 1! Je e= ni s¯i 2 i=1

i

!

i

!

i

i

2 i=1

! !other ! 1 Alternative use median, maximum, 1 !1 Criterions: c ! ! " 2 • ! " = 1 s ¯ ||x − x || s¯ = s(x, x i) i

||x − x" |

descriptive ni s¯ion distance for 2Je = statistic n2 s¯i =

2 ! ∈D i ni x∈D2i xi=1 x∈D n x i i ! ∈D ! i x∈D i x ∈Di i Variation: Using ! !! ! 1 ! Similarity (e.g. Tanimoto) 1 ! 1 s(x, x" ) " 2 s " " ¯ = s¯i s¯ =i =2 min s(x, ||x −xx) || i s ¯ = s(x, x ) 2 i 2 n n! ∈D i x∈D s may beniany !similarity function (in this case, maximize) x x∈Di∈D x! ∈D x,x i i

1 ! ! s¯i = 2 s(x, x" ) ni x∈Di x! ∈Di " s¯i = min s(x, x ) ! x,x ∈Di

i

! i x∈D i i x ∈Di

" " s¯i = min s(x, x ) s ¯ = min s(x, x ) i ! x,x! ∈D i

x,x ∈Di

19

c ! !

Criterion: Scatter Matrix-Based

Sw =

i=1 x∈Di

trace[Sw ] =

(x − µi )(x − µi )T

c ! !

i=1 x∈Di

||x − µi ||2 = Je

Sw =

c ! !

i=1 x∈Di

c !

(x − µi )(x − µi )T

Minimize Trace of Sw (within-class) i=1 trace[Sw ] =

trace[Si ] =

c ! !

i=1 x∈Di

||x − µi ||2 =

Equivalent to SSE! c ! ! Recall that total scatter is the sum of within T Sw = (x − µi )(x − µi ) and between-class i=1 x∈D scatter (Sm = Sw + Sb). c ! This means that ! by minimizing2 the trace of ]= ||x − µi || =isJefixed): Sw, wetrace[S also wmaximize Sb (as Sm i=1 x∈D i

i

trace[Sb ] =

c ! i=1

ni ||µi − µ0 ||2 20

trace[Sw ] =

! !

i=1 x∈Di c !

||x − µi ||2 = Je

Scatter-Based Criterions, Cont’d trace[Sb ] =

i=1

Jd = |Sw | =

ni ||µi − µ0 ||2

" " "! " " c ! " T" " (x − µi )(x − µi ) " " " i=1 x∈Di "

Determinant Criterion

Roughly measures square of the scattering volume; proportional to product of variances in principal axes (minimize!)



Minimum error partition will not change with axis scaling, unlike SSE

21

Scatter-Based: Invariant Criteria c ! !

c ! = ! (x − µi )(x − µiT)T InvariantSSwwCriteria (Eigenvalue-based) = (x − µi )(x − µi )

i=1 x∈Di

i=1 x∈Di

c ! Eigenvalues: measure ratio of between to within! c ! ! ||x − µ ||2 = J trace[S ] = 2 w cluster trace[S scatterw ] in = direction ||x −of µi ||ieigenvectors = Je e i=1 x∈Di i=1 x∈D (maximize!) c ! i

• •

trace[S trace[Sbb]] = =

c !

22 n ||µ − µ || i i 0 ni ||µi − µ0 ||

Trace of a matrixi=1 is sum of eigenvalues (here d is i=1 length of feature vector) "" " "

""! " " cc ! ""! " " T "T " " " Jd d==|S |Sww|| = =are (x µµi )(x JEigenvalues (x − −under −µµi )i )" " non-singular i )(x− "" invariant " " ""i=1 i=1 x∈D x∈Dii

linear transformations (rotations, translations, scaling, etc.) −1 −1 trace[S trace[S w S Sbb]] = =

w

dd ! ! i=1

λλi i

i=1

Jf =

−1 trace[Sm Sw ]

d !

1 = i=1 1 + λi

22

Clustering with a Criterion Choosing Criterion Creates a well-defined problem



Define clusters so as to maximize the criterion function



A search problem



Brute force solution: enumerate partitions of the training set, select the partition with maximum criterion value

23

Comparison: Scatter-Based Criteria

24

Hierarchical Clustering Motivation Capture similarity/distance relationships between sub-groups and samples within the chosen clusters



Common in scientific taxonomies (e.g. biology)



Can operate bottom up (individual samples to clusters, or agglomerative clustering) or topdown (single cluster to individual samples, or divisive clustering) 25

Agglomerative Hierarchical Clustering Problem: Given n samples, we want c clusters One solution: Create a sequence of partitions (clusterings)

• • • •

First partition, k = 1: n clusters (one cluster per sample) Second partition, k = 2: n-1 clusters



Continue reducing the number of clusters by one: merge 2 closest clusters (a cluster may be a single sample) at each step k until...

Goal partition: k = n - c + 1: c clusters



Done; but if we’re curious, we can continue on until the...

....Final partition, k = n: one cluster

Result All samples and sub-clusters organized into a tree (a dendrogram)



Often show cluster similarity for a dendrogram diagram (Y-axis)

If as stated above whenever two samples share a cluster they remain in a cluster at higher levels, we have a hierarchical clustering

26

k=2 k=3 k=4 k=5 k=6 k=7 k=8

x1

x2

x3

x4

x5

x6

x7

x8

100 90 80 70 60 50 40 30 20 10 0

similarity scale

k=1

FIGURE 10.11. A dendrogram can represent the results of hierarchical clustering algorithms. The vertical axis shows a generalized measure of similarity among clusters. Here, at level 1 all eight points lie in singleton clusters; each point in a cluster is highly similar to itself, of course. Points x6 and x7 happen to be the most similar, and are merged at level 2, and so forth. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern c 2001 by John Wiley & Sons, Inc. Classification. Copyright !

x3 x 2

x7

4

x5

3

x1

2 x6

x4

6

7

x8

5

k=8

FIGURE 10.12. A set or Venn diagram representation of two-dimensional data (which was used in the dendrogram of Fig. 10.11) reveals the hierarchical structure but not the quantitative distances between clusters. The levels are numbered by k , in red. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. !

Distance Measures dmin (Di , Dj ) = dmax (Di , Dj ) =

min

||x − x" ||

max

||x − x" ||

x∈Di ,x∈Dj

x∈Di ,x∈Dj

1 ! ! davg (Di , Dj ) = ||x − x" || ni nj x∈Di x! ∈Dj dmean (Di , Dj ) = ||mi − mj ||

Listed Above: Minimum, maximum and average inter-sample distance (samples for clusters i,j: Di , Dj) Difference in cluster means (mi, mj) 28

Nearest-Neighbor Algorithm Also Known as “Single-Linkage” Algorithm

dmin (Di , Dj ) =

Agglomerative hierarchical clustering using dmin d

• •

max (Di , Dj )

=

min

||x − x" ||

max

||x − x" ||

x∈Di ,x∈Dj

x∈Di ,x∈Dj

Two nearest neighbors in separate clusters determine clusters merged 1 ! ! at each step

davg (Di , Dj ) =

ni nj

!

x∈Di x ∈Dj If we continue until k = n (c = 1), produce a minimum spanning tree (similar to Kruskal’s alg.)



||x − x" ||

dmean (Di , Dj ) = ||mi − mj ||

MST: Path exists between all node (sample) pairs, sum of edge costs minimum for all spanning trees

Issues Sensitive to noise and slight changes in position of data points (chaining effect)



Example: next slide 29

FIGURE 10.13. Two Gaussians were used to generate two-dimensional samples, shown in pink and black. The nearest-neighbor clustering algorithm gives two clusters that well approximate the generating Gaussians (left). If, however, another particular sample is generated (circled red point at the right) and the procedure is restarted, the clusters do not well approximate the Gaussians. This illustrates how the algorithm is sensitive to the details of the samples. From: Richard O. Duda, Peter E. Hart, and David G. Stork, c 2001 by John Wiley & Sons, Inc. Pattern Classification. Copyright !

min

||x − x" ||

max

||x − x" ||

Farthest-Neighbor Algorithm dmin (Di , Dj ) =

dmax (Di , Dj ) =

x∈Di ,x∈Dj

x∈Di ,x∈Dj

Agglomerative hierarchical ! clustering using dmax ! 1 " d (D , D ) = ||x − x avg i with j the smallest maximum distance||between two Clusters • ni nj x∈Di x! ∈Dj points are merged at each step

• •

Goal: minimal largest cluster diameter at dmean (Di ,increase Dj ) = to ||m i−m j || each iteration (discourages elongated clusters)

Known as ‘Complete-Linkage Algorithm’ if terminated when distance between nearest clusters exceeds a given threshold distance

Issues Works well for compact and roughly equal in size; with elongated clusters, result can be meaningless

31

dmax = large

dmax = small

FIGURE 10.14. The farthest-neighbor clustering algorithm uses the separation between the most distant points as a criterion for cluster membership. If this distance is set very large, then all points lie in the same cluster. In the case shown at the left, a fairly large dmax leads to three clusters; a smaller dmax gives four clusters, as shown at the right. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. !

dmin (Di , Dj ) =

min

x∈Di ,x∈Dj

||x − x" ||

Using Mean, Avg Distances dmax (Di , Dj ) =

davg (Di , Dj ) =

max

x∈Di ,x∈Dj

||x − x" ||

1 ! ! ||x − x" || ni nj x∈Di x! ∈Dj

dmean (Di , Dj ) = ||mi − mj ||

Reduces Sensitivity to Outliers Mean less expensive to compute than avg, min, max (each require ni * nj distances)

33

Stepwise Optimal Hierarchical Clustering Problem "

dmin (Di , D = directly min ||x − x || j ) far None of the agglomerative methods discussed so x∈Di ,x∈Dj minimize a specific criterion function dmax (Di , Dj ) =

Modified Agglomerative Algorithm: For k = 1 to (n - c + 1)

• •

max

x∈Di ,x∈Dj

||x − x" ||

1 ! ! davg (Di , Dj ) = ||x − x" || ni nj x∈Di x! ∈Dj

Find clusters whose merger changes criterion least, j dmean (Di ,D Di jand ) =D ||m i − mj || Merge Di and Dj

Example: Minimal increase in SSE (Je)

de (Di , Dj ) =

"

ni nj ||mi − mj || ni + nj

de defines the cluster pair that increases Je as little as possible. May not minimize SSE, but often good starting point



prefers merging single elements or small with large clusters vs. merging medium-size clusters

34

k-Means Clustering k-Means Algorithm For a number of clusters k: 1. Choose k data points at random 2. Assign all data points to closest of the k cluster centers 3. Re-compute k cluster centers as the mean vector of each cluster

• If cluster centers do not change, stop • Else, goto 2

Complexity O(ndcT) - T iterations, d features, n points, c clusters, in practice usually T