Lecture Notes on Data Science: Online k-Means. Clustering. Christian Bauckhage. B-IT, University of Bonn. In this note, we discuss an approach to k-means ...
Lecture Notes on Data Science: Online k-Means Clustering Christian Bauckhage B-IT, University of Bonn In this note, we discuss an approach to k-means clustering that allows for application in online settings where the data arrive one at a time.
Introduction In practice, we often resort to k-means clustering1 not because we want to determine clusters but because we are interested in vector quantization. That is, given a set of n data points X = { x1 , . . . , x n } , x j ∈ Rm
C. Bauckhage. Lecture Notes on Data Science: k-Means Clustering, 2015c. DOI: 10.13140/RG.2.1.2829.4886 1
(1)
the goal is to compute a codebook V = {v1 , . . . , vk }, vi ∈ Rm
(2)
of k n prototypes or codebook vectors. On the one hand, this allows for data compression because appropriate codebooks can represent possibly many data points in terms of only a few prototypes. On the other hand, prototypes are used for feature learning, hierarchical information retrieval, or nearest neighbor classification to name but a few2 . Given our earlier discussions, it is easy to see that k-means clustering provides a solution to the vector quantization problem. Recall that when we use it to partition X into k clusters C1 , . . . , Ck , the algorithm initializes k centroids µ1 , . . . , µk , determines clusters o n
2
2 (3) Ci = x j ∈ X x j − µi ≤ x j − µl ∀ l 6= i , updates the centroids µi =
1 ni
∑
xj
where
ni = |Ci |
(4)
x j ∈Ci
and repeats steps (3) and (4) until convergence. Upon convergence, each centroid thus indicates the mean of a cluster of data points and can therefore be considered as a prototype of the data in X. However, in vector quantization, we are not actually interested in clusters Ci but only in prototypes µi . Step (3) therefore appears to introduce overhead and the question is Q: is there an algorithm that determines centroids µi but avoids the computation of clusters Ci ? Below, we will answer this question affirmatively: yes, such an algorithm exists. Moreover, as it avoids (3), it does not require prior knowledge of the full data set X and thus applies to settings where X is not known in advance but the data arrive one at a time. It is therefore commonly known as online k-means clustering.
codebook vectors
2
Interesting examples can be found in
A. Coates and A.Y. Ng. Learning Feature Representations with K-Means. In G. Montavon, G.B. Orr, and K.-R. Müller, editors, Neural Networks: Tricks of the Trade, volume 7700 of LNCS. Springer, 2012 D. Nister and H. Stewénius. Scalable Recognition with a Vocabulary Tree. In Proc. CVPR, 2006
lecture notes on data science: online k-means clustering
2
Online k-Means Clustering The fundamental observation for the design of an online k-means algorithm is that sample means can be computed recursively. To see how, let us consider the mean µ(n) of a sample of n points { x1 , . . . , xn } and observe that µ(n) =
1 n
n
∑ xj
j =1
=
1 n
n −1
1
∑ x j + n xn
j =1
=
n − 1 ( n −1) 1 µ + xn . n n
(5)
This is to say that the mean µ(n) is a convex combination of a mean µ(n−1) and a data point xn After another truly straightforward algebraic manipulation, we also realize that 1 1 n − 1 ( n −1) 1 µ + x n = µ ( n −1) − µ ( n −1) + x n n n n n
(6)
so that the mean µ(n) in (5) can be written as µ ( n ) = µ ( n −1) +
1 x n − µ ( n −1) . n
(7)
Looking at (7), we recognize an instance of the standard rule of competitive learning. That is, we can estimate the mean iteratively where, once a new data point arrives, we move our current estimate slightly towards the new point. Regarding our problem of online k-means clustering, this then leads to the algorithm in Fig. 1. Assuming a constant stream of data points x j , the algorithm operates in two phases. In the initial phase, the first k data points are used to initialize k centroids µi . For each centroid, it also initializes the number of points ni the centroid represents to 1. In the second phase, each incoming data point x j is then compared to all centroids and the centroid µi that is closest to x j is updated according to (7). In addition, in order to register the fact that another data point has contributed to µi , the number ni of points it represents is increased.
i←1 for all x j ∈ { x1 , x2 , . . .} do if j ≤ k then // initialize centroids µi ← x j ni ← 1 i ← i+1 else // determine winner centroid
2 µi = argmin x j − µl l
// update winner centroid and ni µi ← µi + n 1+1 x j − µi i
ni ← ni + 1
Figure 1: Online k-means.
To conclude our discussion, we emphasize the following: 1. Should we want to use this algorithm for clustering rather than merely for vector quantization, we could, at any point during its execution, use the current centroid estimates µ1 , . . . , µk to compute clusters according to (3). 2. It may be hard to believe that this simple procedure really works, but it does3 . 3. However, our earlier caveats regarding suitable initializations and the tendency of k-means to produce Gaussian clusters4 of small variance5 still apply.
Here is a video to illustrate this point: www.youtube.com/watch?v=hzGnnx0k6es 3
C. Bauckhage. Lecture Notes on Data Science: k-Means Clustering Is Gaussian Mixture Modeling, 2015a. DOI: 10.13140/RG.2.1.3033.2646 5 C. Bauckhage. Lecture Notes on Data Science: k-Means Clustering Minimizes Within Cluster Variances, 2015b. DOI: 10.13140/RG.2.1.1292.4649 4
lecture notes on data science: online k-means clustering
References C. Bauckhage. Lecture Notes on Data Science: k-Means Clustering Is Gaussian Mixture Modeling, 2015a. DOI: 10.13140/RG.2.1.3033.2646. C. Bauckhage. Lecture Notes on Data Science: k-Means Clustering Minimizes Within Cluster Variances, 2015b. DOI: 10.13140/RG.2.1.1292.4649. C. Bauckhage. Lecture Notes on Data Science: k-Means Clustering, 2015c. DOI: 10.13140/RG.2.1.2829.4886. A. Coates and A.Y. Ng. Learning Feature Representations with KMeans. In G. Montavon, G.B. Orr, and K.-R. Müller, editors, Neural Networks: Tricks of the Trade, volume 7700 of LNCS. Springer, 2012. D. Nister and H. Stewénius. Scalable Recognition with a Vocabulary Tree. In Proc. CVPR, 2006.
3