Towards K-means-friendly Spaces: Simultaneous Deep Learning and

0 downloads 0 Views 6MB Size Report
15 Oct 2016 - such as those used in spectral clustering [19] and sparse subspace clustering [28,32] have ... and sparse coding, respectively, and make use of DNNs to learn nonlinear ...... Theano: A Python frame- work for fast computation ...
Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering Bo Yang∗ [email protected]

Xiao Fu∗ [email protected]

Nicholas D. Sidiropoulos∗ [email protected]

arXiv:1610.04794v1 [cs.LG] 15 Oct 2016

Abstract Most learning approaches treat dimensionality reduction (DR) and clustering separately (i.e., sequentially), but recent research has shown that optimizing the two tasks jointly can substantially improve the performance of both. The premise behind the latter genre is that the data samples are obtained via linear transformation of latent representations that are easy to cluster; but in practice, the transformation from the latent space to the data can be more complicated. In this work, we assume that this transformation is an unknown and possibly nonlinear function. To recover the ‘clustering-friendly’ latent representations and to better cluster the data, we propose a joint DR and K-means clustering approach in which DR is accomplished via learning a deep neural network (DNN). The motivation is to keep the advantages of jointly optimizing the two tasks, while exploiting the deep neural network’s ability to approximate any nonlinear function. This way, the proposed approach can work well for a broad class of generative models. Towards this end, we carefully design the DNN structure and the associated joint optimization criterion, and propose an effective and scalable algorithm to handle the formulated optimization problem. Experiments using five different real datasets are employed to showcase the effectiveness of the proposed approach.

1

Introduction

Clustering is one of the most fundamental tasks in data mining and machine learning, with an endless list of applications. It is also a notoriously hard task, whose outcome is affected by a number of factors – including data acquisition and representation, use of preprocessing such as dimensionality reduction (DR), the choice of clustering criterion and optimization algorithm, and initialization [2, 7]. K-means is arguably the bedrock of clustering algorithm. Since its introduction in 1957 by Lloyd (published much later in 1982 [15]), K-means has been extensively used either alone or together with suitable preprocessing, due to its simplicity and effectiveness. K-means is suitable for clustering data samples that are evenly spread around some centroids (cf. the first subfigure in Fig. 1), but many real-life datasets do not exhibit this ‘K-means-friendly’ structure. Much effort has been spent on mapping high-dimensional data to a certain space that is suitable for performing Kmeans. Earlier ways for achieving this goal include the classical DR algorithms, namely principal compo∗ Department † Department

of ECE, University of Minnesota of IMSE, Iowa State University

Mingyi Hong† [email protected]

nent analysis (PCA) and canonical correlation analysis (CCA). More recent DR tools like nonnegative matrix factorization (NMF) and sparse coding (dictionary learning) have also drawn a lot of attention. In addition to the above approaches that learn linear DR operators (e.g., a projection matrix), nonlinear DR techniques such as those used in spectral clustering [19] and sparse subspace clustering [28, 32] have also been considered. In recent years, motivated by the success of deep neural networks (DNNs) in supervised learning, unsupervised deep learning approaches are now widely used for DR prior to clustering. For example, the stacked autoencoder (SAE) [27], deep CCA (DCCA) [1], and sparse autoencoder [18] take insights from PCA, CCA, and sparse coding, respectively, and make use of DNNs to learn nonlinear mappings from the data domain to low-dimensional latent spaces. These approaches treat their DNNs as a preprocessing stage that is separately designed from the subsequent clustering stage. The hope is that the latent representations of the data learned by these DNNs will be naturally suitable for clustering. However, since no clustering-driven objective is explicitly incorporated in the learning process, the learned DNNs do not necessarily output reduceddimension data that are suitable for clustering – as will be seen in our experiments. In [30], Yang et. al considered performing DR and clustering jointly. The rationale behind [30] is that if there exists some latent space where the entities nicely fall into clusters, then it is natural to seek a DR transformation that reveals these, i.e., which yields a low K-means clustering cost. This motivates using the Kmeans cost in latent space as a prior that helps choose the right DR, and pushes DR towards producing Kmeans-friendly representations. By performing joint DR and K-means clustering, substantially improved clustering results have been observed in [30]. The limitation of [30] (see also [6, 20]) is that the observable data is assumed to be generated from the latent clusteringfriendly space via simple linear transformation. While simple linear transformation works well in many cases, there are other cases where the generative process is more complex, involving a nonlinear mapping. Contributions In this work, we take a step further

r model. proximata reasonexpected 10]. Alivial and e the DR objective ta transdly latent A sneak 1, where are well en transon-linear tructure. he 100-D e can see imension ans. Our

opose an DR and htful and different subspace mization k so that

Generated

SVD

NMF

LLE

MDS

LapEig

SAE

DCN w/o reconst.DCN (Proposed)

Figure 1: The learned 2-D reduced-dimension data by different methods. The observable data is in the 100-D1:space is generated from 2-D data (cf. Figure The and learned 2-D reduced-dimension datathe by first subfigure) the nonlinear transformation different methods.through The original data is in 100-D space in is(4.9). The true labels are and generated fromcluster ground-truth 2-Dindicated data (cf.using the different colors. first subfigure) through the nonlinear transformation in

optimization criterion for joint DNN-based DR and Kmeans clustering. The criterion is a combination of three parts, namely, dimensionality reduction, data reconstruction, and cluster structure-promoting regularization, which takes insights from the linear generative model case as in [30] but is much more powerful in capturing nonlinear models. We deliberately include the reconstruction part and implement it using a decoding network, which is crucial for avoiding trivial solutions. The criterion is also flexible – it can be extended to incorporate different DNN structures and clustering criteria, e.g., subspace clustering. • Effective and Scalable Optimization Procedure: The formulated optimization problem is very challenging to handle, since it involves layers of nonlinear activation functions and integer constraints that are induced by the K-means part. We propose a judiciously designed solution package, including empirically effective initialization and a novel alternating stochastic gradient algorithm. The algorithmic structure is simple, enables online implementation, and is very scalable. • Comprehensive Experiments and Validation: We provide a set of synthetic-data experiments and validate the method on five different real datasets including various document and image copora. Evidently visible improvement from the respective state-of-art is observed for all the datasets that we experimented with. • Reproducibility: We open-source our codes at https://github.com/boyangumn/DCN.

ocedure: challengar activae induced (4.9). The plots are drawn with true labels. from [30]: We propose a joint DR and K-means clusously detering framework, where the DR part is implemented effective through learning a DNN, rather than a linear model. 2 Background and Problem Formulation gradient The rationale is that a DNN is capable of approximat- 2.1 K-means Clustering and DR Given a set of , enables ing any nonlinear continuous function using a reason- data vectors {xi }i=1,...,N where xi ∈ RM , the task

idation: a experirent real e copora. rts is ob-

s at ~.

able number of parameters [8], and thus is expected to overcome the limitations of the work in [30]. Although implementing this idea is highly non-trivial (much more challenging than [30], where the DR part only needs to learn a linear model), our objective is well-motivated: by better modeling the data transformation process, a much more K-means-friendly latent space can be learned – as we will demonstrate. A sneak peek of the kind of performance that can be expected using our proposed method can be seen in Fig. 1, where we generate four clusters of 2-D data which are well separated in the 2-D Euclidean space and then transform them to a 100-D space using a complex non-linear mapping [cf. (4.9)] which destroys the cluster structure. We apply several well-appreciated DR methods to the 100-D data and observe the recovered 2-D space (see section 4.1 for an explanation of the different acronyms). One can see that the proposed algorithm outputs reduced-dimension data that are most suitable for applying K-means. Our specific contributions are as follows: • Optimization Criterion Design: We propose an

of clustering is to group the N data vectors into K categories. K-means approaches this task by optimizing the following cost function: min

N X

M ∈RM ×K ,{si ∈RK }

i=1

2

kxi − M si k2

(2.1)

s.t. sj,i ∈ {0, 1}, 1T si = 1 ∀i, j, where si is the assignment vector of data point i which has only one non-zero element, sj,i denotes the jth element of si , and the kth column of M , i.e., mk , denotes the centroid of the kth cluster. K-means works well when the data points are evenly scattered around their centroids in the Euclidean space. For example, when the data samples exhibit geometric distribution as depicted in the first subfigure in Fig. 1, K-means can easily distinguish the four clusters. Therefore, we consider datasets which have similar structure as being ‘K-means-friendly’. However, high-dimensional data are in general not very K-meansfriendly. In practice, using a DR pre-processing, e.g.,

where f (·; W) denotes the mapping function and W is a set of parameters that characterize the non-linear mapping. Many non-linear mapping functions can be considered, e.g., Gaussian and sinc kernels. In this work, we propose to use a DNN as our mapping function, since DNNs have the ability of approximating any continuous mapping using a reasonable number of parameters [8]. For the rest of this paper, f (·; W) is a DNN and W collects the network parameters, i.e., the weights of the links that connect the neurons between layers and the bias in each hidden layer. Using the above notation, it is tempting to formulate joint DR and clustering as follows: 2.2 Joint DR and Clustering Instead of using DR as a pre-processing, joint DR and clustering was also N X 2 b considered in the literature [6, 20, 30]. This line of work min L= kf (xi ; W) − M si k2 (2.3) W,M ,{si } can be summarized as follows. Consider the generative i=1 model where a data sample is generated by xi = W hi , s.t. sj,i ∈ {0, 1}, 1T si = 1 ∀i, j. where W ∈ RM ×R and hi ∈ RR , where R  M . Assume that the data clusters are well-separated in The formulation in (2.3) seems intuitive at first glance. latent domain (i.e., where hi lives) but distorted by However, it is seriously ill-posed and could easily lead the transformation introduced by W . Reference [30] to trivial solutions. A global optimal solution to Problem (2.3) is f (xi ; W) = 0 and the optimal objective formulated the joint optimization problem as follows: b = 0 can always be achieved by simply letting value L N X M = 0. Another trivial solution is simply putting 2 2 khi − M si k2 min kX − W HkF + λ arbitrary samples to K clusters which will lead to a M ,{si },W ,H i=1 b – but this could be far from being desmall value of L + r1 (H) + r2 (W ) (2.2) sired since there is no provision for respecting the data s.t. sj,i ∈ {0, 1}, 1T si = 1 ∀i, j, samples xi ’s; see the bottom-middle subfigure in Fig. 1 [Deep Clustering Network (DCN) w/o reconstruction]. where X = [x1 , . . . , xN ], H = [h1 , . . . , hN ], and λ ≥ 0 In fact, the formulation proposed in a parallel unpubis a parameter for balancing data fidelity and the latent lished work in [29] is very similar to that in (2.3). The cluster structure. In (2.2), the first term performs only differences are that the hard 0-1 type assignment DR and the second term performs latent clustering. constraint on si is relaxed to a probabilistic type soft The terms r1 (·) and r2 (·) are regularizations (e.g., constraint and that the fitting criterion there is based nonnegativity or sparsity) to prevent trivial solutions, on the Kullback-Leibler (KL) divergence instead of least e.g., H → 0 ∈ RR×N ; see details in [30]. squares – unfortunately, such a formulation still has The approach in (2.2) enhances the performance of trivial solutions, just as what we have discussed above. data clustering substantially in many cases. However, One of the most important components to prevent the assumption that the clustering-friendly space can be trivial solution in the linear DR case lies in the reconobtained by linear transformation of the observable data struction part, i.e., the term kX − W Hk2 in (2.2). F is restrictive, and in practice the relationship between This term ensures that the learned hi ’s can (approxiobservable data and the latent representations can be mately) reconstruct the xi ’s using the basis W . This highly nonlinear – as implied by the success of kernel motivates incorporating a reconstruction term in the methods and spectral clustering [19]. How can we joint DNN-based DR and K-means. In the realm of transcend the ideas and insights of joint linear DR and unsupervised DNN, there are several well-developed apclustering to deal with such complex (yet more realistic) proaches for reconstruction – e.g., the stacked autoencases? In this work, we address this question. coder (SAE) is a popular choice for serving this purpose. 2.3 Proposed Formulation Our idea is to model To prevent trivial low-dimensional representations such the relationship between the observable data xi and its as all-zero vectors, SAE uses a decoding network g(·; Z) clustering-friendly latent representation hi using a non- to map the hi ’s back to the data domain and requires that g(hi ; Z) and xi match each other well under some linear mapping, i.e., metric, e.g., mutual information or least squares-based hi = f (xi ; W), f (·; W) : RM → RR , measures. PCA or NMF, to reduce the dimension of xi to a much lower dimensional space and then apply K-means usually gives better results. The insight is simple: The data points may exhibit better K-means-friendly structure in a certain latent feature space and the DR approaches such as PCA and NMF help find this space. In addition to the above classic DR methods that essentially learn a linear generative model from the latent space to the data domain, nonlinear DR approaches such as those used in spectral clustering and DNN-based DR are also widely used as pre-processing before K-means or other clustering algorithms [1, 4, 27].







i=1



W,Z,M ,{si }

λ 2 ` (g(f (xi )), xi ) + kf (xi ) − M si k2 2



min

N  X



By the above reasoning, we come up with the following cost function: 

(2.4) T

s.t. sj,i ∈ {0, 1}, 1 si = 1 ∀i, j, where we have simplified the notation f (xi ; W) and g(hi ; Z) to f (xi ) and g(hi ), respectively, for conciseness. The function `(·) : RM → R is a certain loss function that measures the reconstruction error. In this work, we adopt the least-squares loss `(x, y) = 2 kx − yk2 ; other choices such as `1 -norm based fitting and the KL divergence can also be considered. λ ≥ 0 is a regularization parameter which balances the reconstruction error versus finding K-means-friendly latent representations. 2.4 Network Structure Fig. 2 presents the network structure corresponding to the formulation in (2.4). On the left-hand side of the ‘bottleneck’ layer are the socalled encoding or forward layers that transform raw data to a low-dimensional space. On the right-hand side are the ‘decoding’ layers that try to reconstruct the data from the latent space. The K-means task is performed at the bottleneck layer. The forward network, the decoding network, and the K-means cost are optimized simultaneously. In our experiments, the structure of the decoding networks is a ‘mirrored version’ of the encoding network, and for both the encoding and decoding networks, we use the rectified linear unit (ReLU) activation-based neurons [16]. Since our objective is to perform DNN-driven K-means clustering, we will refer to the network in Fig. 2 as the Deep Clustering Network (DCN) in the sequel. From a network structure point of view, DCN can be considered as an SAE with a K-means-friendly structure-promoting function in the bottleneck layer. We should remark that the proposed optimization criterion in (2.4) and the network in Fig. 2 are very flexible: Other types of neuron activation, e.g., the sigmoid and binary step functions, can be used. For the clustering part, other clustering criteria, e.g., K-subspace and soft K-means [2, 12], are also viable options. Nevertheless, we will concentrate on the proposed DCN in the sequel, as our interest is to provide a proof-of-concept rather than exhausting the possibilities of combinations. 3

Optimization Procedure

Figure 2: Structure of the proposed deep clustering network (DCN). optimization procedure including an empirically effective initialization method and an alternating optimization based algorithm for handling (2.4). 3.1 Initialization via Layer-wise Pre-Training For dealing with hard non-convex optimization problems like that in (2.4), initialization is usually crucial. To initialize the parameters of the network, i.e., (W, Z), we use the layer-wise pre-training method as in [3] for training autoencoders. This pre-training technique may be avoided in large-scale supervised learning tasks. For the proposed DCN which is completely unsupervised, however, we find that the layer-wise pre-training procedure is important no matter the size of the dataset. The layer-wise pre-training procedure is as follows. We start with using a single-layer autoencoder to train the first layer of the forward network and the last layer of the decoding network. Then, we use the outputs of the trained forward layers as inputs to train the next pair of encoding and decoding layers, and so on. After pre-training, we perform K-means to the outputs of the bottleneck layer to obtain initial values of M and {si }. 3.2 Alternating Stochastic Optimization Even with a good initialization, handling Problem (2.4) is still very challenging. The commonly used stochastic gradient descent (SGD) algorithm cannot be directly applied to jointly optimize W, Z, M and {si } because the block variable {si } is constrained on a discrete set. Our idea is to combine the insights of alternating optimization and SGD. Specifically, we propose to optimize the subproblems with respect to (w.r.t.) one of M , {si } and (W, Z) while keeping the other two sets of variables fixed. For fixed (M , {si }), the subproblem w.r.t. (W, Z) is similar to training an SAE – but with an additional penalty term on the clustering performance. We can take advantage of the mature tools for training DNNs, e.g., back-propagation based SGD and its variants. To implement SGD for updating the network parameters, we look at the problem w.r.t. the incoming data xi :

Optimizing (2.4) is highly non-trivial since both the cost λ 2 function and the constraints are non-convex. In addimin Li = ` (g(f (xi )), xi ) + kf (xi ) − M si k2 . (3.5) W,Z 2 tion, there are scalability issues that need to be taken into account. In this section, we propose a pragmatic The gradient of the above function over the net-

work parameters is easily computable, i.e., OX Li = Algorithm 1 Alternating SGD ∂`(g(f (xi )),xi ) (xi ) 1: Initialization; + λ ∂f∂X (f (xi ) − M si ), where X = ∂X 2: for t = 1 : T do % Perform T epochs over the data; (W, Z) is a collection of the network parameters and ∂f (xi ) ∂` 3: Update network parameters by (3.6); the gradients ∂X and ∂X can be calculated by back4: Update assignment by (3.7); propagation [22] (strictly speaking, what we calculate 5: Update centroids by (3.8); here is the subgradient w.r.t. X since the ReLU func6: end for tion is non-differentible at zero). Then, the network parameters are updated by X ← X − αOX Li ,

(3.6)

where α > 0 is a pre-defined learning rate. For fixed network parameters and M , the assignment vector of the current sample, i.e., si , can be naturally updated in an online fashion. Specifically, we update si as follows: ( 1, if j = arg mink={1,...,K} kf (xi ) − mk k2 , sj,i ← 0, otherwise. (3.7) When fixing {si } and X , the update of M is simple and may be done in a variety ofPways. For example, one can simply use mk = (1/|Cki |) i∈C i f (xi ), where Cki is k the recorded index set of samples assigned to cluster k from the first sample to the current sample i. Although the above update is intuitive, it could be problematic for online algorithms, since the already appeared historical data (i.e., x1 , . . . , xi ) might not be representative enough to model the global cluster structure and the initial si ’s might be far away from being correct. Therefore, simply averaging the current assigned samples may cause numerical problems. Instead of doing the above, we employ the idea in [24] to adaptively change the learning rate of updating m1 , . . . , mK . The intuition is simple: assume that the clusters are roughly balanced in terms of the number of data samples they contain. Then, after updating M for a number of samples, one should update the centroids of the clusters that already have many assigned members more gracefully while updating others more aggressively, to keep balance. To implement this, let cik be the count of the number of times the algorithm assigned a sample to cluster k before handling the incoming sample xi , and update mk by a simple gradient step:

Algorithm 1 has many favorable properties. First, it can be implemented in a completely online fashion, and thus is very scalable. Second, many known tricks for enhancing performance of DNN training can be directly used. In fact, we have used a mini-batch version of SGD and batch-normalization [9] in our experiments, which indeed help improve performance. We should remark that the proposed algorithm is by no means ‘optimal’, and is not guaranteed to converge to a stationary point of Problem (2.4). Nevertheless, our extensive experiments indicate that the algorithm works fairly well in practice. In reminiscence of inexact alternating optimization [21], it may be possible to provide theory-backed learning rate-choosing strategies that ensure monotonic decrease of the overall cost in (2.4) and even convergence of the solution sequence – but we leave this as a future direction. 4

Experiments

In this section, we use synthetic and real-world data to showcase the effectiveness of DCN. We implement DCN using the deep learning toolbox Theano [26]. The experiments are conducted on a workstation equipped with a NVIDIA GTX 970 GPU. 4.1 Synthetic-Data Demonstration Our settings are as follows: Assume that the data points have Kmeans-friendly structure in a two-dimensional domain (cf. the first subfigure of Fig. 1). This two-dimensional domain is a latent domain which we do not observe and we denote the latent representations of the data points as hi ’s in this domain. What we observe is xi ∈ R100 that is obtained via the following transformation: 2

xi = (σ(W hi )) ,

(4.9)

where W ∈ R100×2 is a matrix whose entries follow the zero-mean unit-variance i.i.d. Gaussian distribui 1 mk ← mk − ( /ck ) (mk − f (xi )) sk,i , (3.8) tion, σ(·) is a sigmod function to introduce nonlinearity, and (·)2 is an element-wise squaring operator for further i where the gradient step size 1/ck controls the learning complicating the transformation. Under the above genrate. The above update of M can also be viewed as erative model, recovering the K-means-friendly domain an SGD step, thereby resulting in an overall alternat- where hi ’s live seems very challenging. ing block SGD procedure that is summarized in AlgoWe generate four clusters, each of which has 2,500 rithm 1. Note that an epoch corresponds to a pass of samples and their geometric distribution on the 2-D all data samples through the network. plane is shown in the first subfigure of Fig. 1 that we

have seen before. The other subfigures show the recovered 2-D data from xi ’s using a number of DR methods, namely, NMF [13], local linear embedding (LLE) [23], multidimensional scaling (MDS) [10], Laplacian eigenmap (LapEig) [19] – which is the first step of spectral clustering. We also present the result of using the formulation in (2.3) (DCN w/o reconstruction) which is essentially the same idea as in [29]. As one can see in Fig. 1, all the DR methods except the proposed DCN fail to map xi ’s to a 2-D domain that is suitable for applying K-means. In particular, DCN w/o reconstruction indeed gives a trivial solution: the reduced-dimension ˆ is small. data are separated to four clusters, and thus L But this solution is meaningless since the data partitioning is arbitrary. In the supplementary materials, we present two additional experiments, where the nonlinear expansion from the 2-D space to the 100-D data space are xi = tanh (σ(W hi )) and xi = σ (U σ(W hi )) , respectively, where U is another random matrix. From there, one can see that the proposed DCN still gives clear clusters in the learned 2-D domain in the above two cases. This further illustrates the DCN’s ability of recovering clustering-friendly structure under different nonlinear generative models. 4.2 Real-Data Validation In this section, we validate the proposed approach on several real-data sets which are all publicly available. 4.2.1 Baseline methods We compare the proposed DCN with a variety of baseline methods: 1) K-means (KM): The classic K-means [15]. 2) Spectral Clustering (SC): The classic SC algorithm [19]. 3) Sparse Subspace Clustering with Orthogonal Matching Pursuit (SSC-OMP) [32]: SSC is considered very competitive for clustering images; we use the newly proposed greedy version here for scalability. 4) Locally Consistent Concept Factorization (LCCF) [5]: LCCF is based on NMF with a graph Laplacian regularization and is considered state-of-theart for document clustering. 5) XRAY [11]: XRAY is an NMF-based document clustering algorithm that scales very well. 6) NMF followed by K-means (NMF+KM): This approach applies NMF for DR, and then applies Kmeans to the reduced-dimension data. 7) Stacked Autoencoder followed by K-means (SAE+KM): This is also a two-stage approach. We use SAE for DR first and then apply K-means. 8) Joint NMF and K-means (JNKM) [30]: JNKM performs joint DR and K-means clustering as the pro-

posed DCN does – but the DR part is based on NMF. Note that we do not use all the baselines for all the experiments, since some of the baselines are not very scalable and some of them are customized for particular applications. Instead, for each experiment, we select the baselines that are considered most competitive and suitable for that application from the above pool. 4.2.2 Evaluation metrics We adopt standard metrics for evaluating clustering performance. Specifically, we employ the following three metrics: normalized mutual information (NMI) [5], adjusted Rand index (ARI) [31], and clustering accuracy (ACC) [5]. In a nutshell, all the above three measuring metrics are commonly used in the clustering literature, and all have pros and cons. But using them together suffices to demonstrate the effectiveness of the clustering algorithms. Note that NMI and ACC lie in the range of zero to one with one being the perfect clustering result and zero the worst. ARI is a value within −1 to 1, with one being the best clustering performance and minus one the opposite. 4.2.3 RCV1 We first test the algorithms on a largescale text corpus, namely, the Reuters Corpus Volume 1 Version 2 (RCV1-v2). The RCV1-v2 corpus [14] contains 804,414 documents, which were manually categorized into 103 different topics. We use a subset of the documents from the whole corpus. This subset contains 20 topics and 365, 968 documents and each document has a single topic label. As in [25], we pick the 2,000 most frequently used words (in the tf-idf form) as the features of the documents. We conduct experiments using different number of clusters. Towards this end, we first sort the clusters according to the number of documents that they have in a descending order, and then apply the algorithms to the first 4, 8, 12, 16, 20 clusters, respectively. Note that the first several clusters have many more documents compared to the other clusters (cf. Fig. 3). This way, we gradually increase the number of documents in our experiments and create cases with much more unbalanced cluster sizes for testing the algorithms – which means we gradually increase the difficulty of the experiments. For the experiment with the first 4 clusters, we use a DCN whose forward network has five hidden layers which have 2000, 1000, 1000, 1000, 50 neurons, respectively. The reconstruction network has a mirrored structure. We set λ = 0.1 for balancing the reconstruction error and the clustering regularization. Similar settings are used in the other cases. More detailed description of the experiment setups can be found in the supplementary materials. Note that we increase both the width and the depth of the deep network when K and N increase – since for the cases

size of clusters

8

×10 4

0.96

6

0.94

4

0.92

2

0.90

NMI ARI ACC

0.88

0 0

5

10

15

20

index of clusters

0.86 0.84

Figure 3: The sizes of 20 clusters in the experiment. Table 1: Evaluation on the RCV1-v2 dataset

0.82 0.80 0

Methods NMI 4 Clust. ARI ACC NMI 8 Clust. ARI ACC NMI 12 Clust. ARI ACC NMI 16 Clust. ARI ACC NMI 20 Clust. ARI ACC

DCN 0.88 0.92 0.95 0.69 0.64 0.73 0.67 0.52 0.6 0.63 0.46 0.54 0.63 0.42 0.47

SAE+KM 0.8 0.86 0.92 0.66 0.59 0.66 0.65 0.51 0.56 0.62 0.45 0.49 0.61 0.43 0.46

KM 0.75 0.66 0.72 0.53 0.29 0.48 0.6 0.35 0.53 0.54 0.27 0.44 0.52 0.18 0.43

XRAY 0.12 -0.01 0.34 0.24 0.09 0.39 0.22 0.05 0.29 0.23 0.04 0.29 0.25 0.04 0.28

with more clusters and samples, many more parameters need to be learned. Table 1 shows the results given by the proposed DCN, SAE+KM, KM, and XRAY; other baselines are not scalable enough to handle the RCV1-v2 data set and thus are dropped. One can see that for each case that we have tried, the proposed method gives clear improvement relative to the other methods. Particularly, the DCN approach outperforms the two-stage approach, i.e., SAE+KM, in almost all the cases and for all the evaluation metrics – this clearly demonstrates the advantage of using the joint optimization criterion. Fig. 4 shows how NMI, ARI, and ACC change when the proposed algorithm runs from epoch to epoch. One can see a clear ascending trend of every evaluation metric. This result shows that both the network structure and the optimization algorithm work towards a desired direction. In the future, it would be intriguing to derive (sufficient) conditions for guaranteeing such improvement using the proposed algorithm. Nevertheless, such empirical observation in Fig. 4 is already very interesting and encouraging. 4.2.4 20Newsgroup The 20Newsgroup corpus is a collection of 18,846 text documents which are partitioned into 20 different newsgroups. Using this corpus, we can observe how the proposed method works with a relatively small amount of samples. As the previous experiment, we use the tf-idf representation of the doc-

10

20

Epoches

30

40

50

Figure 4: Performance metrics v.s. epochs.

uments and pick the 2,000 most frequently used words as the features. Since this dataset has a much smaller size relative to RCV1-v2, we include six 4 baselines that could not be performed on large datasets like RCV1v2, namely, LCCF, NMF+KM, SC, and JNKM. Among them, both JNKM and LCCF are considered state-ofart for document clustering. In this experiment, we use a DNN with three forward layers which has 250, 100, and 20 neurons, respectively. This is a relatively ‘small network’ since the 20Newsgroup corpus may not have sufficient samples to fit a large network. As before, the decoding network for reconstruction has a mirrored structure of the encoding part, and the baseline SAE+KM uses the same network for the autoencoder part. Table 2 summarizes the results of this experiment. As one can see, LCCF indeed gives the best performance among the algorithms that do not use DNNs. SAE+KM improves ARI and ACC quite substantially by involving DNN – this suggests that the generative model may indeed be nonlinear. DCN performs even better by using the proposed joint DR and clustering criterion, which supports our motivation that a K-means regularization can help discover a clustering-friendly space. One interesting observation is that results of SAE+KM presented here are in fact obtained from the layer-wise pre-trained network, since we observe that further optimization using SAE worsens the clustering results. This suggests that without cluster structurepromoting terms as in 2.4, the SAE does not tend to output K-means-friendly latent representations. Another observation is that all the performance indices in Table 2 are worse compared to those in Table 1. Of course, the two tables reflect very different experiments. On the other hand, this at least implies that the DNNbased methods may gain more from larger datasets – since learning the parameters of DNNs usually requires a large amount of data.

Table 2: Evaluation on the 20Newsgroup dataset. Methods NMI ARI ACC

Table 4: Evaluation on pre-processed MNIST

DCN SAE+KM LCCF NMF+KM KM SC XARY JNKM 0.48 0.47 0.46 0.39 0.41 0.40 0.19 0.40 0.34 0.28 0.17 0.17 0.15 0.17 0.02 0.10 0.44 0.42 0.32 0.33 0.3 0.34 0.18 0.24

Methods NMI ARI ACC

Table 3: Evaluation on the raw MNIST dataset. Methods NMI ARI ACC

Table 5: Evaluation on the Pendigits dataset Methods NMI ARI ACC

DCN SAE+KM KM SSC-OMP 0.63 0.53 0.50 0.31 0.44 0.39 0.37 0.13 0.58 0.55 0.53 0.30

4.2.5 Raw MNIST In this and next subsections, we present two experiments using two versions of the MNIST dataset. We first employ the raw MNIST dataset that has 70,000 data samples. Each sample is a 28 × 28 gray-scale image containing a handwritten digit, i.e., one of {0, 1, . . . , 9}. We apply the algorithms to cluster all the 70,000 images into K = 10 clusters. For the proposed DCN method, we use a network with six forward hidden layers, and the neurons are 2000, 1000, 500, 500, 250, and 50, respectively. The hyperparameter λ is set to 0.05. We use SSC-OMP, which yields the most favorable clustering results on MNIST (to the best of our knowledge), and KM as a baseline for this experiment. Table 3 shows results of applying DCN, SAE+KM, KM and SSC-OMP to the raw MNIST data – the other baselines are not efficient enough to handle 70,000 samples and thus are left out. One can see very substantial improvement obtained by using the proposed method. The three performance metrics, i.e., NMI, ARI, and ACC, are improved by 19%, 19%, and 5%, respectively, from the best results given by the competitors. 4.2.6 Pre-Processed MNIST Besides the above experiment using the raw MNIST data, we also provide another interesting experiment using pre-processed MNIST data. The pre-processing is done by a recently introduced technique, namely, the scattering network (ScatNet) [4]. ScatNet is a cascade of multiple layers of wavelet transform, which is able to learn a good feature space for clustering / classification of images. Utilizing ScatNet, the work in [32] reported very promising clustering results on MNIST using SSC-OMP. Our objective here is to see if the proposed DCN can further improve the performance from SSC-OMP. Our idea is simple: SSC-OMP is essentially a procedure of constructing a similarity matrix of the data; after obtaining this matrix, it performs K-means on the rows of a matrix comprising several selected eigenvectors of the similarity matrix [19]. Therefore, it makes sense to treat the whole ScatNet + SSC-OMP procedure as pre-processing for performing K-means, and one can replace the classic

DCN SAE+KM KM (SSC-OMP) 0.85 0.83 0.85 0.84 0.82 0.82 0.93 0.91 0.86

DCN 0.69 0.56 0.72

SAE+KM 0.65 0.53 0.70

SC 0.67 0.55 0.71

KM 0.67 0.55 0.69

K-means by DCN to improve performance. The results are shown in Table 4. One can see that the proposed method exhibits the best performance among the algorithms. We note that the result of using KM on the data processed by ScatNet and SSC-OMP is worse than that was reported in [32]. This is possibly because we use all the 70,000 samples but only a subset was selected for conducting the experiments in [32]. This experiment is particularly interesting since it suggests that for any clustering algorithm that employs K-means as a key component, e.g., spectral clustering and sparse subspace clustering, one can use the proposed DCN to replace K-means and a better result can be expected. This is meaningful since many datasets are originally not suitable for K-means due to the nature of the data – but after pre-processing (e.g., kernalization and eigendecomposition), the pre-processed data is already more K-means-friendly, and using the proposed DCN at this point can further strengthen the result. 4.2.7 Pendigits The Pendigits dataset consists of 10,992 data samples. Each sample records 8 coordinates on a tablet, on which a subject is instructed to write the digits from 0 to 9. So each sample corresponds to a vector of length 16, and represents one of the digits. Note that this dataset is quite different from MNIST – each digit in MNIST is represented by an image (pixel values) while digits in Pendigits are represented by 8 coordinates of the stylus when a person was writing a certain digit. Since each digit is represented by a very small-size vector of length 16, we use a small network who has three forward layers which are with 16, 16, and 10 neurons. Table 5 shows the results: The proposed methods give the best clustering performance compared to the competing methods, and the methods using DNNs outperform the ‘shallow’ ones that do not use neural networks for DR. 5

Conclusion

In this work, we proposed a joint DR and K-means clustering approach where the DR part is accomplished via learning a deep neural network. Our goal is to automatically map high-dimensional data to a latent space

where K-means is a suitable tool for clustering. We carefully designed the network structure to avoid trivial and meaningless solutions and proposed an effective and scalable optimization procedure to handle the formulated challenging problem. Synthetic and real data experiments showed that the algorithm is very effective on a variety of datasets. References [1] Galen Andrew, Raman Arora, Jeff A Bilmes, and Karen Livescu. Deep canonical correlation analysis. In Proc. ICML, pages 1247–1255, 2013. [2] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. In Proc. SDM, 2004. [3] Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. Greedy layer-wise training of deep networks. NIPS, 19:153, 2007. [4] Joan Bruna and St´ephane Mallat. Invariant scattering convolution networks. IEEE trans. pattern analysis machine intell., 35(8):1872–1886, 2013. [5] Deng Cai, Xiaofei He, and Jiawei Han. Locally consistent concept factorization for document clustering. IEEE Trans. Knowl. Data Eng., 23(6):902–913, 2011. [6] Geert De Soete and J Douglas Carroll. K-means clustering in a low-dimensional euclidean space. In New approaches in classification and data analysis, pages 212–219. Springer, 1994. [7] Levent Ert¨ oz, Michael Steinbach, and Vipin Kumar. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proc. SDM, pages 47–58, 2003. [8] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. [9] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. [10] Joseph B Kruskal and Myron Wish. Multidimensional scaling, volume 11. Sage, 1978. [11] Abhishek Kumar, Vikas Sindhwani, and Prabhanjan Kambadur. Fast conical hull algorithms for nearseparable non-negative matrix factorization. In ICML (1), pages 231–239, 2013. [12] Martin HC Law, Alexander P Topchy, and Anil K Jain. Model-based clustering with probabilistic constraints. In Proc. SDM, pages 641–645. SIAM, 2005. [13] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999. [14] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. JMLR, 5(Apr):361–397, 2004. [15] Stuart Lloyd. Least squares quantization in PCM. IEEE trans. info. theory, 28(2):129–137, 1982.

[16] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proc. ICML, pages 807–814, 2010. [17] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. [18] Andrew Ng. Sparse autoencoder. CS294A Lecture notes, 72:1–19, 2011. [19] Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: Analysis and an algorithm. NIPS, 2:849–856, 2002. [20] Vishal M Patel, Hien Van Nguyen, and Ren´e Vidal. Latent space sparse subspace clustering. In Proc. CVPR, pages 225–232, 2013. [21] Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126–1153, 2013. [22] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. Cognitive modeling, 5(3):1, 1988. [23] Lawrence K Saul and Sam T Roweis. Think globally, fit locally: unsupervised learning of low dimensional manifolds. JMLR, 4(Jun):119–155, 2003. [24] David Sculley. Web-scale k-means clustering. In Proc. WWW, pages 1177–1178. ACM, 2010. [25] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014. [26] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016. [27] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 11(Dec):3371–3408, 2010. [28] Sholom M Weiss. Subspace clustering of high dimensional data. In Proc. SDM, 2004. [29] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. arXiv preprint arXiv:1511.06335, 2015. [30] Bo Yang, Xiao Fu, and Nicholas D Sidiropoulos. Learning from hidden traits: Joint factor analysis and latent clustering. IEEE Trans. signal process. to appear, 2016. [31] Ka Yee Yeung and Walter L Ruzzo. Details of the adjusted rand index and clustering algorithms, supplement to the paper an empirical study on principal component analysis for clustering gene expression data. Bioinformatics, 17(9):763–774, 2001. [32] Chong You, D Robinson, and Ren´e Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In Proc. CVPR, volume 1, 2016.

Supplementary materials of “ Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering” 1

Generated

SVD

NMF

LLE

MDS

LapEig

Additional synthetic data experiments

In this section, we provide two more examples to illustrate the ability of DCN in recovering K-meansfriendly spaces under different generative models. We first consider the transformation as follows: xi = tanh (σ(W hi )) ,

(1.10)

where σ(·) is the sigmoid function as before and we use a tanh function to replace the squaring operator as used in the main text. The corresponding results can be seen in Fig. 1 of this supplementary document. One can see that a similar pattern as we have observed in the main text is also presented here: The proposed DCN recovers a 2-D K-means-friendly space very well and the other methods all fail. In Fig. 2, we test the algorithms under the generative model xi = σ (U σ(W hi )) , (1.11) where W ∈ R10×2 and U ∈ R100×10 . Note that there are two layers of sigmoid functions in (1.11) and such a transformation is (intuitively) harder to deal with than before. Nevertheless, the proposed DCN still gives very clear clusters in the recovered 2-D space, and outputs a clustering accuracy greater than 0.99. The results in this section and the synthetic-data experiment presented in main text are encouraging: Under a variety of complicated nonlinear generative models, DCN can still output clustering-friendly latent representations.

SAE

DCN w/o reconst. DCN(Proposed)

Figure 1: The generated latent representations {hi } in the 2-D space and the recovered 2-D representations from xi ∈ R100 , where xi = tanh (σ(W hi )).

be more precise, the pre-training state does not work with the whole network but only deals with a pair of encoding-decoding layers greedily). Therefore, we distinguish the parameters of the two stages as listed in Table 1, to better describe the settings. We implement SGD for solving the subproblem w.r.t. X using the Nesterov-type acceleration [17], the 2 Detailed Settings of Real-Data Experiments mini-batch version, and the momentum method. Batch 2.1 Algorithm Parameters There is a set of pa- normalization [9] that is recently proven to be very eframeters in the proposed algorithm which need to be fective for training supervised deep networks is also empre-defined. Specifically, the learning rate α, the num- ployed. Through out the experiments, the momentum ber of epochs T (recall that one epoch responds to a pass parameter is set to be 0.9, the mini-batch size is seof all the data samples through the network), and the lected to be ≈ 0.01 × N , and the other parameters are balancing regularization parameter λ. These parame- adjusted accordingly in each experiments – which will ters vary from case to case since they are related to a be described in detail in the next section. number of factors, e.g., dimension of the data samples, total number of samples, scale (or energy) of the sam- 2.2 Network Parameters The considered network ples, etc. In practice, a reasonable way to tune these has two parts, namely, the forward encoding network parameters is through observing the performance of the that reduces the dimensionality of the data and the algorithm under various parameters on a small valida- decoding network that reconstructs the data. We let two networks to have a mirrored structure of each other. tion subset whose labels are known. Note that the proposed algorithm has two stages, There are also two parameters of a forward network, i.e., i.e., pre-training and the main algorithm and they usu- the width of each layer (number of neurons) and the ally use two different sets of parameters since the algo- depth of the network (number of layers). There is no rithmic structure of the two stages are quite different (to strict rule for setting up these two parameters, but the

Generated

SVD

NMF

LLE

MDS

LapEig

SAE

DCN w/o reconst.DCN (Proposed)

Figure 2: The generatedFigure latent2representations {hi } in the 2-D space of the recovered 2-D representations from xi ∈ R100 , where xi = σ (U σ(W hi )).

1: List of parameters used in DCN. group, Table raw MNIST, pre-processed MNIST, and Pendigits are shown in Tables 4, 5, 6, and 7, respectively. Notations Meaning λ regularization parameter 3 More Discussions αp pre-training stepsize We have the following several more points as further αf learning stepsize discussion: Tp pre-traing epochs T learning epochsi.e., SAE+KM, we f 1. For the two-stage method,

always use the same network structure except that Table 2: Parameter settings for 20Newsgroup the cluster structure-promoting term used in DCN is dropped. We should remark, again, that in a parameters description lot of cases we have observed that runing SAE for f (xi ; W): RM → RR M = and R = epochs may even worsen the clustering performance Sample size N 18,XXX in the two-stage approach. In Fig. 3, we show how forward net. depth 3 layers the clustering performance indexes change with layer width 250/100/20 the epochs when we run SAE without K-means λ 10 regularization.One canαsee that 0.01 the performance in p fact becomes worse compared to merely using preαf 0.001 training (i.e., initialization). This means that using Tp 10 the SAE does not necessarily help clustering – and Tf 50 this supports our motivation for adding a K-meansfriendly structure-enhancing regularization. data samples of the datasets. Using a deeper and wider 2. To alleviate the effect brought by the intrinsic rannetwork may be able to better capture the underlying domness of the algorithms, e.g., random initializanonlinear transformation of the data, as the network has tion of pre-training, the reported results are all obmore degrees of freedom. However, fitting a large numtained via running the experiments several times ber of parameters accurately requires a large amount of and taking average (specifically, we run the experidata since the procedure can be essentially considered ments with smaller size, i.e., 20Newsgroup, raw and as solving a large system of nonlinear equations. Thereprocessed MNIST, and Pendigits for ten times and fore, there is a clear trade-off between network depth results the much larger dataset RCV-v2 we are and the width and of performance. In the experiments, from a single run). Therefore, the presented results adjust the width and depth of the networks according reflect performance to the sizes the of the datasets. of the algorithms in an average sense.

the performance of a small validation subset whose laTable 1: List of parameters used in DCN. bels are known. NoteNotations that the proposed Meaningalgorithm has two stages, i.e., pre-training λand regularization the main algorithm and they usuparameter ally use two different sets of parameters since αp pre-training stepsize they structure of the two stages are quite different. Therefore, we αl learning stepsize define the parameters of these stages as in Table 1, to Tp pre-traing epochs better describe the settings. Tl learning epochs We implemented the SGD for solving the subprob- References 3. We treat this work as a proof-of-concept: Joint lem w.r.t. X using the Nesterov-type acceleration [1], DNN learning and clustering is a highly viable task mini-batch based SGD, and the momentum method. according to our design and experiments. In the rule of thumb is to adjust them according the amounts [1] Yurii Nesterov. Introductory lectures on convex optiBatch normalization [2] is also employed. Through out future, many practical issues will be investigated of data samples of the datasets and the dimension mization: A basic course, volume 87. Springer Science the experiments, the momentum parameter is set to 0.9, –& Business e.g., designing theory-backed ways of setting of each sample. Using a deeper and wider network Media, 2013. set the mini-batch size to be ≈ 0.1×N , and the other paup network and algorithm parameters.Batch Another may be able to better capture the underlying nonlinear [2] Sergey Ioffe and Christian Szegedy. norrameters are adjusted accordingly in each experiments very intriguing direction is of course to design transformation of the data, as the network has more malization: Accelerating deep network training by – which will be described in detail. convergence-guaranteed algorithms for optimizing degrees of freedom. However, finding a large number reducing internal covariate shift. arXiv preprint the proposed criterion arXiv:1502.03167, 2015. and its variants. We leave of parameters accurately requires a large amount of 2.2 Network Parameters The considered network these interesting considerations for future work. data since the procedure can be essentially considered has two parts, namely, the forward encoding network as solving a large system of nonlinear equations – and that reduces the dimensionality of the data and the definding more unknowns needs more equalities in the coding network that reconstruct the data. We consider system, or, data samples in this case. Therefore, there these two networks have a mirrored structure of each isother. a clear trade-off between network depth/width and There are also two parameters of the network, the overall performance. i.e., the width of each layer (number of neurons) and the depth of the network (number of layers). There is no 2.3 Parameter Settings The detailed strict Detailed rule for setting up these two parameters, but the parameter settings for experiments on RCV1-v2 rule of thumb is to adjust them according the amountsare of shown in Tables 2, 3. Parameter settings for 20News-

Table 4: Parameter settings for 20Newsgroup 0.95

parameters f (xi ; W): RM → RR Sample size N forward net. depth layer width λ αp αl Tp Tl

NMI ARI ACC

0.90

0.85

0.80

0.75 0

10

20

Epochs

30

40

50

Figure 3: Clustering performance degrades when training with only reconstruction error term. This is in sharp contrast with Figure 4 in the paper, where clustering performance improves when training the proposed DCN model.

Table 2: Parameter settings for RCV1-v2 – 4 Clust. and 8 Clust. parameters f (xi ; W): RM → RR Sample size N forward net. depth layer width λ αp αl Tp Tl

description M = 2, 000 and R = 50 178,603 or 267,466 5 layers 2000/1000/1000/1000/50 0.1 0.01 0.05 50 50

Table 5: Parameter settings for raw MNIST parameters f (xi ; W): RM → RR Sample size N forward net. depth layer width λ αp αl Tp Tl

description M = 784 and R = 50 70,000 6 layers 2,000/ 1,000/ 500/500/ 250/ 50 0.05 0.01 0.05 50 50

Table 6: Parameter settings for Pre-Processed MNIST

Table 3: Parameter settings for RCV1-v2 – 12 Clust., 16 Clust. and 20 Clust. parameters f (xi ; W): RM → RR Sample size N forward net. depth layer width λ αp αl Tp Tl

description M = 2, 000 and R = 20 18,846 3 layers 250/100/20 10 0.01 0.001 10 50

description M = 2, 000 and R = 50 312,694 or 342,407 or 365,968 7 layers 2000/1000/1000/1000/500/500/50 0.1 0.01 0.05 50 50

parameters f (xi ; W): RM → RR Sample size N forward net. depth layer width λ αp αl Tp Tl

description M = 10 and R = 5 70,000 3 layers 50/ 20/ 5 0.1 0.01 0.01 10 50

Table 7: Parameter settings for Pendigits parameters f (xi ; W): RM → RR Sample size N forward net. depth layer width λ αp αl Tp Tl

description M = 16 and R = 10 10,992 3 layers 50/ 16/ 10 0.5 0.01 0.01 50 50

Suggest Documents