Pattern Recognition. Prof. Christian Bauckhage. Page 2. Page 3. outline lecture 13 recap data clustering k-means clustering. Lloyd's algorithm. Hartigans's ...
Pattern Recognition Prof. Christian Bauckhage
outline lecture 13 recap data clustering k-means clustering Lloyd’s algorithm Hartigans’s algorithm MacQueen’s algorithm GMMs and k-means soft k-means summary
remember . . .
nearest neighbors, space partitioning, Voronoi cells . . .
questions can we combine individual Voronoi cells V(xi ) into larger ones ? how to choose / determine the centers of “super” Voronoi cells automatically ?
questions can we combine individual Voronoi cells V(xi ) into larger ones ? how to choose / determine the centers of “super” Voronoi cells automatically ?
observe these are questions related to data clustering
data clustering
clustering
⇔ given a finite data set X, automatically identify latent structures or groups of similar data points within X
note
in general, clustering is an ill-posed problem
note from now on, we shall assume that X = x1 , . . . xn ⊂ Rm
note
there are many different clustering philosophies relational clustering (e.g. spectral clustering, graph cuts, . . . ) hierarchical clustering (e.g. divisive or agglomerative linkage methods) density-based clustering (e.g. DBSCAN, . . . ) prototype-based clustering (e.g k-means, LBG, . . . )
note
differences between clustering algorithms are often rather fuzzy than concise basically boil down to how they answer two questions
note
differences between clustering algorithms are often rather fuzzy than concise basically boil down to how they answer two questions
Q1: what defines similarity? Q2: what properties should a cluster have?
k-means clustering
k-means clustering
is the “most popular” algorithm for vector quantization
k-means clustering
is the “most popular” algorithm for vector quantization determines k n clusters Ci and answers Q2 as follows Ci ⊂ X Ci ∩ Cj = ∅
∀ i 6= j
C1 ∪ C2 ∪ . . . ∪ Ck = X
k-means clustering
is the “most popular” algorithm for vector quantization determines k n clusters Ci and answers Q2 as follows Ci ⊂ X Ci ∩ Cj = ∅
∀ i 6= j
C1 ∪ C2 ∪ . . . ∪ Ck = X considers cluster centroids µi to answer Q1 as follows
2
2 Ci = x ∈ X x − µi 6 x − µl
observe
each cluster therefore corresponds to a Voronoi cell in Rm Ci = V(µi )
the problem at the heart of k-means clustering is thus to determine k distinct suitable cluster centroids µ1 , . . . , µk
this can be done by minimizing an objective function E C1 , C2 , . . . , Ck = E(k)
k-means objective function
various equivalent formulations possible, in particular E(k) =
k X X
xj − µi 2
(1)
i=1 xj ∈Ci
=
k X n X
2 zij xj − µi
(2)
i=1 j=1
where zij =
1, if xj ∈ Ci 0, otherwise
(3)
note
the problem of solving argmin E(k) µ1 ,...,µk
using either (1) or (2) looks innocent but is actually NP-hard
E(k) has numerous local minima and there is no algorithm known today that is guaranteed to find the optimal solution
⇔ any available algorithm for k-means clustering is a heuristic
Lloyd’s algorithm
set t = 0 and initialize µt1 , µt2 , . . . , µtk repeat until convergence update all clusters
Cit = x ∈ X
2
2 x − µti 6 x − µtl
(4)
update all cluster means µt+1 = i
1 X x |Cit | t x∈Ci
increase iteration counter t =t+1
(5)
possible convergence criteria
cluster assignments stabilize, i.e. Cit ∩ Cit−1 = Cit
∀i
cluster centroids stabilize, i.e.
t
µ − µt−1 6 ∀ i i i
number of iterations exceeds threshold, i.e. t > tmax
example
initialization
example
initialization
1st update
example
··· initialization
1st update
final result
question is Lloyd’s algorithm guaranteed to converge?
answer we note that µi = argmin y
X
x − y 2 x∈Ci
so that the mean updates in (5) cannot increase E(k) we also note that by design the cluster updates in (4) cannot increase E(k) we therefore have 0 6 Et+1 (k) 6 Et (k) which implies that the algorithm converges to a (local) minimum
assignment given X = x1 , . . . , xn , prove the following important property of the sample mean
µ = argmin x
X X
xj − x 2 = 1 xj n j
j
note
the fact that Lloyd’s algorithm will converge says nothing about the quality of the solution the fact that Lloyd’s algorithm will converge says nothing about the speed of convergence in fact, it usually converges quickly but its quality crucially depends on the initialization of the means µ1 , µ2 , . . . , µk
assignment
read C. Bauckhage, “Lecture Notes on Data Science: k-Means Clustering”, dx.doi.org/10.13140/RG.2.1.2829.4886
watch www.youtube.com/watch?v=5I3Ei69I40s www.youtube.com/watch?v=9nKfViAfajY
observe
there are many algorithms / heuristics for minimizing E(k) Hartigan’s algorithm is much less well known than Lloyd’s algorithm is provably more robust than Lloyd’s algorithm provably converges converges quickly
Hartigan’s algorithm for all xj ∈ x1 , . . . , xn , randomly assign xj to a cluster Ci for all Ci ∈ C1 , . . . , Ck , compute µi repeat until converged converged ← True for all xj ∈ x1 , . . . , xn determine Ci = C(xj ) remove xj from Ci and recompute µi determine Cw = argminCl E C1 , . . . , Cl ∪ {xj }, . . . , Ck if Cw 6= Ci , then converged ← False assign xj to Cw and recompute µw
example
random initialization
result after one for-loop
result after two for-loops
assignment
watch www.youtube.com/watch?v=ivr91orblu8
question what if |X| 1 or what if the xj ∈ X arrive one at a time?
question what if |X| 1 or what if the xj ∈ X arrive one at a time?
answer consider the use of online k-means clustering
observe
1X xj = n n
(n)
µ
j=1
1X 1 xj + xn n n n−1
=
j=1
=
n − 1 (n−1) 1 µ + xn n n
observe
1X xj = n n
(n)
µ
j=1
1X 1 xj + xn n n n−1
=
j=1
=
n − 1 (n−1) 1 µ + xn n n
⇒ µ(n) is a convex combination of µ(n−1) and xn ⇔ µ(n) can be computed iteratively
observe
µ(n) =
n − 1 (n−1) 1 µ + xn n n
1 1 = µ(n−1) − µ(n−1) + xn n n = µ(n−1) +
i 1h xn − µ(n−1) n
MacQueen’s algorithm for all Ci ∈ C1 , . . . , Ck , initialize µi and set ni = 0 for all xj ∈ x1 , x2 , . . . determine winner centroid
2 µw = argmin xj − µi i
update cluster size and centroid nw ← nw + 1 µw ← µw + n1w [xj − µw ] for all Ci ∈ C1 , . . . , Ck
2
2 Ci = x ∈ X x − µi 6 x − µl
assignment
read C. Bauckhage, “Lecture Notes on Data Science: Online k-Means Clustering”, dx.doi.org/10.13140/RG.2.1.1608.6240
watch www.youtube.com/watch?v=hzGnnx0k6es
GMMs and k-means
pathological example
sample of data points xj ∈ R2
result of k-means clustering for k = 2
question what went wrong?
question what went wrong?
answer we applied k-means to data on which it cannot work well!
probabilistic view on clustering
imagine the given samples xj ∈ X were produced as follows sample a cluster Ci according to a discrete probability p Ci sample a point xj according to a continuous conditional probability p x Ci
probabilistic view on clustering
imagine the given samples xj ∈ X were produced as follows sample a cluster Ci according to a discrete probability p Ci sample a point xj according to a continuous conditional probability p x Ci
under this generative model, the probability for observing any sample point xj amounts to
p xj =
k X i=1
p xj Ci p Ci
modeling assumptions
let the elements in each cluster be distributed according to 1 T −1 p x Ci = N x µi , Σi = γi e− 2 (x−µi ) Σi (x−µi ) i.e. a multivariate Gaussian with normalization constant γi
modeling assumptions
let the elements in each cluster be distributed according to 1 T −1 p x Ci = N x µi , Σi = γi e− 2 (x−µi ) Σi (x−µi ) i.e. a multivariate Gaussian with normalization constant γi
let Σi = I (each Gaussian is isotropic and of unit variance) 1 2 p x Ci = N x µi = γi e− 2 kx−µi k
consequence
letting wi = p Ci , we thus consider a particularly simple Gaussian mixture model (GMM)
p xj =
k X i=1
wi N xj µi
consequence
letting wi = p Ci , we thus consider a particularly simple Gaussian mixture model (GMM)
p xj =
k X
wi N xj µi
i=1
and might be interested in estimating its parameters θ = w1 , µ1 , . . . , wk , µk from the data . . .
likelihood and log-likelihood
L(θ) =
n Y
n X k Y p xj = wi N xj µi
j=1
L(θ) =
n X j=1
j=1 i=1
log
" k X i=1
# wi N xj µi
note
our trusted recipe of considering ∇L(θ) = 0 does not lead to a closed form solution
a great idea due to Dempster et al. (1977) is to assume a set of indicator variables Z = z11 , z12 , . . . , zkn just as introduced in (3) and to consider . . .
complete likelihood
L(θ, Z) =
n X k Y j=1 i=1
zij wi N xj µi
observe P since zij ∈ 0, 1 and i zij = 1, we are allowed to write k X i=1
k h Y izij zij wi N xj µi = wi N xj µi i=1
complete log-likelihood
L(θ, Z)=
n X k X
h i zij log wi + log N xj µi
j=1 i=1
=
n X k X j=1 i=1
=
n X k X
h
2 i 1 zij log wi + log γi − xj − µi 2
zij log wi +
j=1 i=1
|
{z T1
}
n X k X
zij log γi −
j=1 i=1
|
{z T2
}
n X k X j=1 i=1
|
zij
xj − µi 2
{z T3
2 }
note
maximizing L(θ, Z) = T1 + T2 − T3 requires to minimize T3
note
maximizing L(θ, Z) = T1 + T2 − T3 requires to minimize T3
looking at 2 · T3 =
k X n X
2 zij xj − µi
i=1 j=1
we recognize the k-means minimization objective in (2)
note
maximizing L(θ, Z) = T1 + T2 − T3 requires to minimize T3
looking at 2 · T3 =
k X n X
2 zij xj − µi
i=1 j=1
we recognize the k-means minimization objective in (2)
⇒ k-means clustering implicitly fits a simplified GMM to X
note
one can also show that k-means clustering implicitly fits isotropic Gaussians of small variance ⇒ if the data in X does not consist of “Gaussian blobs”, k-means clustering will produce questionable results
assignment
read C. Bauckhage, “Lecture Notes on Data Science: k-Means Clustering Is Gaussian Mixture Modeling”, dx.doi.org/10.13140/RG.2.1.3033.2646 C. Bauckhage, “Lecture Notes on Data Science: k-Means Clustering Minimizes Within Cluster Variances”, dx.doi.org/10.13140/RG.2.1.1292.4649
soft k-means
observe
so far, we have been focusing on k-means for hard clustering with indicator variables zij ∈ 0, 1 where zij = =
1, if xj ∈ Ci 0, otherwise
2
2 1, if xj − µi 6 xj − µl ∀ i 6= l 0, otherwise
observe
we may relax this to theidea of soft clustering with indicator variables zij ∈ 0, 1 where zij > 0 X i
zij = 1
observe
we may relax this to theidea of soft clustering with indicator variables zij ∈ 0, 1 where zij > 0 X
zij = 1
i
a common approach towards this idea is to conside
2 −β xj −µi e zij =
P −β xj −µl 2 e l
soft k-means clustering
set t = 0 and initialize µt1 , µt2 , . . . , µtk repeat until convergence compute all indicator variables h
2 i exp −β xj − µti h zij = P
i t 2
l exp −β xj − µl update all centroids P j zij xj t+1 µi = P j zij increase iteration counter t =t+1
example
initial cluster centroids and corresponding soft cluster assignments
example
centroids and soft assignments after the first update step
example
centroids and soft assignments upon convergence
example
effect of the stiffness parameter β > 0
data
β=
1 2
β=1
β=2
assignment
read C. Bauckhage, “Lecture Notes on Data Science: Soft k-Means Clustering”, dx.doi.org/10.13140/RG.2.1.3582.6643
watch www.youtube.com/watch?v=Np9VuEg aqo
summary
we now know about
k-means clustering the fact that it implicitly fits a GMM and is therefore tailored to locally Gaussian data the fact that it is a difficult problem (NP-hard) whose optimal solution cannot be guaranteed the fact that there are various algorithms (heuristics!) Lloyd’s algorithm Hartigan’s algorithm MacQueen’s algorithm