Initialization of Nonnegative Matrix Factorization with Vertices of

0 downloads 0 Views 221KB Size Report
Wroclaw University of Technology, Wybrzeze Wyspianskiego 27,. 50-370 ..... ence on Knowledge Discovery and Data Mining, Philadelphia, USA (2006).
Initialization of Nonnegative Matrix Factorization with Vertices of Convex Polytope Rafal Zdunek Institute of Telecommunications, Teleinformatics and Acoustics, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland [email protected]

Abstract. Nonnegative Matrix Factorization (NMF) is an emerging unsupervised learning technique that has already found many applications in machine learning and multivariate nonnegative data processing. NMF problems are usually solved with an alternating minimization of a given cost function, which leads to non-convex optimization. For this approach, an initialization for the factors to be estimated plays an essential role, not only for a fast convergence rate but also for selection of the desired local minima. If the observations are modeled by the exact factorization model (consistent data), NMF can be easily obtained by finding vertices of the convex polytope determined by the observed data projected on the probability simplex. For an inconsistent case, this model can be relaxed by approximating mean localizations of the vertices. In this paper, we discuss these issues and propose the initialization algorithm based on the analysis of a geometrical structure of the observed data. This approach is demonstrated to be robust, even for moderately noisy data.

1

Introduction

Since an alternating minimization procedure in NMF is non-convex, an initialization for the factors to be estimated plays a predominate role. There are several strategies for initializing the factors in the standard NMF model [1]. A typical approach assumes both factors are initialized with uniformly distributed random numbers [1, 2]. However, this strategy involves many iterations to convergence, especially as the estimated factors are very sparse. To avoid convergence to unfavorable local minima, the multi-start random initialization [3] can be applied. In this technique, the estimated factors are initialized several times with random initializers, and the initializer that ensures the steepest descent in the objective function after a fixed number of alternating steps is selected. This strategy combined with the multilayer technique [4] significantly improves the performance if the observed data is sparse and weakly redundant. Another approach involves the centroid decomposition or spherical k-means [5,6]. Unfortunately, this is computationally expensive preprocessing that is also non-convex, and may not guarantee the right initializer. L. Rutkowski et al. (Eds.): ICAISC 2012, Part I, LNCS 7267, pp. 448–455, 2012. c Springer-Verlag Berlin Heidelberg 2012 

Initialization of NMF with Vertices of Convex Polytope

449

Langville et al. proposed in [2] four strategies for initialization of NMF. One of them assumes the SVD-centroid initialization, that is, the centroid decomposition that is applied to the right singular vectors of the observation matrix. This way is computationally attractive since the space of right singular vectors is considerably smaller than the observation space. Nevertheless, it involves computations of the SVD of a large matrix of observations. Another approach assumes averaging of randomly selected data columns. However, this strategy usually gives only slightly better performance than the random initialization. An improved version assumes random selection of the data columns that have the longest length, which usually means the selection of the densest columns. The last approach involves construction of the co-occurrence matrix, which is computationally very expensive. The SVD-based initialization has been also considered by Boutsidis and Gallopoulos in [7]. This strategy initializes both factors in NMF by certain positive parts of rank-1 matrices obtained by the leading left and right singular vectors of the observation matrix. To better model sparse part-based image representations from NMF, Kim and Choi [8] discussed the initialization based on the hierarchical clustering of the observed data with a similarity measure reflecting ”closeness to rank-one”. Unfortunately, when an observation matrix is large, a computational complexity of this approach is quite large due to its iterative cluster merging character. The clustering-based initialization has been also proposed in [9]. This is an iterative strategy that is based on the k-means clustering, however, a similarity measure between samples and their centroids is determined by the generalized Kullback-Leibler (KL) divergence. Geometrically, this approach is equivalent to finding centroid vectors that are collinear with extreme rays of the convex cone spanned by the observation vectors [10]. For noise-free case, the extreme rays may determine vertices of the convex polytope created from data points. Similarly as for the centroid decomposition, a computational complexity of this approach is rather high. Our approach is somehow related to the latter but instead of using computationally expensive k-means clustering, we attempt to find the vertices of the convex polytope by searching the observation vectors that maximize its volume. For a noise-free case, this approach ensures an exact model of NMF. For an inconsistent case, the locations of the vertices are approximated by averaging a few observation vectors in their neighborhood. The paper is organized as follows: Section 2 discusses a geometrical aspect of NMF. The initialization algorithm is presented in Section 3. The experiments are described in Section 4. Finally, the conclusions are drawn in Section 5.

2

Geometrical Interpretation

The aim of NMF is to find such lower-rank nonnegative matrices A = [aij ] ∈ RI×J and X = [xjt ] ∈ RJ×T that Y = [yit ] ∼ = AX ∈ RI×T + + + , given the data matrix Y , the lower rank J, and possibly some prior knowledge on the matrices

450

R. Zdunek

A or X. The orthant of nonnegative real numbers is denoted by R+ . Typically IT we have high redundancy, i.e. J > I. The exact nonnegative factorization Y = AX means that each column vector in Y is a convex combination of the column vectors in A. The vectors {aa , . . . , aJ } form the simplicial cone [10] in RI that lies inside the nonnegative orthant RI+ . Definition 1. The (I − 1)-dimensional probability simplex SI = {y = [yi ] ∈ RI+ : yi ≥ 0, 1TI y = 1} contains all the points of RI+ located onto the hyperplane Π : ||y||1 = 1. Its vertices are determined by the versors (unit vectors) of the Cartesian coordinate system. is sufficiently sparse if Definition 2. The matrix X = [x1 , . . . , xT ] ∈ RJ×T + ˜ created from a there exists a square diagonal full-rank submatrix X ∈ RJ×J + subset of its column vectors. The projection of the nonzero columns in Y onto SI can be expressed as   y1 yT ¯ ,..., . PSI (Y ) = Y = ||y 1 ||1 ||y T ||1

(1)

The projected columns onto SI form the convex polytope C(Y ) [11]. If the matrix X is sufficiently sparse (see Def. 2), the vertices of C(Y ) correspond to the ¯ t whose the correcolumn vectors of A projected onto SI . Any column vector y sponding vector xt contains at most 2 positive entries lies on the boundary of the convex polytope C(Y ). Example 1. Assuming I = J = 3 and T = 1000, we generated A ∈ R3×3 + from an uniform distribution (cond(A) ∼ = 5.2) and X ∈ R3×1000 from a normal distribution N (0, 1), replacing the negative entries with a zero-value. Thus sparsity(X) ∼ = 50%. The column vectors of Y = AX plotted in RI are shown in Fig. 1(a) as the blue points. The red squares indicate directions of the column vectors in A. Note that all the blue points are contained inside the simplicial cone determined by the column vectors of A. Fig. 1(b) shows the observation points (blue points) projected onto the 2D probability simplex (the equilateral triangle marked with the black lines). The red squares denotes the projected columns of A. Note that all the observation points are contained inside the convex polytope (triangle) C(Y ). Figs. 1(c) and (d) refer to the noisy cases when the observation data is corrupted with a zero-mean Gaussian noise with the variance adopted to (c) SN R = 30 dB and (d) SN R = 20 dB. Note that even for a very weak noise (SN R = 30 dB), the smallest convex polytope (the smallest triangle) that contains all the observation points is considerably different than marked by the red squares. For a moderate noisy case (SN R = 20 dB), the locations of the columns of A can be estimated with a statistical approach, e.g. by searching the highest density of observation points (provided that the observed data is

Initialization of NMF with Vertices of Convex Polytope

451

very sparse). The negative entries of noisy observations where replaced with a zero-value, hence many entries of Y¯ lie on the boundary of the probability simplex SI .

0.9

(0, 0, 1)

6

0.8 5

0.7

Z

4

0.6

3

0.5

2

0.4

1

0.3

0 6

0.2 4

0.1 2 0

Y

0.5

0

2

1.5

1

0 0

0.1

0.2

0.3

0.4

0.9

(0, 0, 1)

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0.1

0.2

0.3

0.4

0.5

(1, 0, 0)

0.6

0.7

0.8

0.9

1

(0, 1, 0)

(c)

0.7

0.8

0.9

1

(b)

0.8

0 0

0.6

(0, 1, 0)

(a) 0.9

0.5

(1, 0, 0)

X

0 0

(0, 0, 1)

0.1

0.2

0.3

0.4

0.5

0.6

(1, 0, 0)

0.7

0.8

0.9

1

(0, 1, 0)

(d)

Fig. 1. Geometric visualization of column vectors (blue points) of Y and the columns vectors of A (red squares) for I = J = 3, T = 1000: (a) noise-free observation points in R3 , (b) noise-free observation points projected onto the (I − 1)-dimensional probability simplex, (c) noisy observation points with SN R = 30dB projected onto the (I − 1)-dimensional probability simplex, (d) noisy observation points with SN R = 20dB projected onto the (I − 1)-dimensional probability simplex.

Remark 1. From Example 1 we may conclude that if the underlying matrix X is sufficiently sparse and the factorization model Y = AX is exact, the columns of the matrix A can be readily estimated by finding the vertices of the convex polytope C(Y ), that is, the columns in Y¯ that correspond to the vertices. For moderately noisy data, exact locations of the columns in A cannot be found

452

R. Zdunek

but they can be roughly estimated considering the mean locations for cluster centroid points of observations. These estimations may also serve as the initial vectors of A.

3

Initialization Algorithm

Corollary 1. Given the exact factorization model Y = AX with X sufficiently sparse, the vertices of the convex polytope C(Y ) are determined by these column (J) vectors of Y¯ which span a polytope of the maximal volume [12]. Let Y¯ = (J) (J) I×J ¯ [¯ y 1 , . . . , y J ] ∈ R+ be a submatrix created from the columns of Y , then   (J) (J) (J) [a1 , . . . , aJ ] = arg max V (Y¯ ) = arg max det (Y¯ )T (Y¯ ) , (2) (J) (J) Y¯ Y¯ (J) (J) where V (Y¯ ) is the volume of the polytope spanned by the vectors in Y¯ .

Following Remark 1 and Corollary 1, the vectors {a1 , . . . , aJ } from (2) can be used to initialize the matrix A for NMF. The problem (2) for the exact factorization model (noise-free data) can be solved with the recursive algorithm. In the first step, we attempt to find such the ¯ t from Y¯ that is located in the furthest distance from any random vector vector y z ∼ U[0, 1] ∈ RI+ . Such a vector determines one of the vertices to be estimated. ¯ s from Y¯ is searched that maximizes the area In the next step, another vector y ¯ s . In each recursive step, the ¯ t and y of the parallelogram formed by the vectors y new vector from Y¯ is added to the basis of the previously found vertex vectors. For noisy-data, we attempt to find p vectors from Y¯ that have the highest impact on the solution to (2) in each recursive step. Then these vectors form an averaged vector that is a rough estimator of the desired vertex. The final form of the proposed recursive algorithm is given by Algorithm 1.The function [c(sort) , K] = sort(c, p, descend) sorts the entries in c in descending order, c(sort) contains the largest p entries, and K is a set of their indices.

4

Experiments

The experiments are carried out for some Blind Source Separation (BSS) problem, using the benchmark of 7 synthetic sparse nonnegative signals (the file AC-7 2noi.mat) taken from the Matlab toolbox NMFLAB for Signal Process, and this is a sufficiently sparse matrix according to ing1 [13]. Thus X ∈ R7×1000 + Definition 2. The entries of the mixing matrix A ∈ R21×7 were generated from + a normal distribution N (0, 1), with cond(A) ∼ = 4.3, where the negative entries are replaced with a zero-value. To estimate the matrices A and X from Y , we use the standard Lee-Seung algorithm [1] (denoted here by the MUE acronym) for minimizing the Euclidean distance, using 500 iterations. 1

http://www.bsp.brain.riken.jp

Initialization of NMF with Vertices of Convex Polytope

453

Algorithm 1. SimplexMax Input : Y ∈ RI×T , J - number of lateral components, p - number of nearest + neighbors - estimated initial basis matrix Output: A ∈ RI×J + 1 2 3 4 5 6

7 8 9 10 11 12 13 14

Initialize: A = 0, Replace negative entries (if any) in Y with zero-value, and remove zero-value columns in Y ;   y y // Projection onto the probability simplex Y¯ = ||y 1||1 , . . . , ||y T||1 ; 1 T z ∼ U[0, 1] ∈ RI+ ; rt = ||¯ y t − z||2 ;   (0) (sort) r , Kp = sort(r, p, descend), where r = [rt ] ∈ RT ; 1  ¯k ; a1 = y p (0) k∈Kp

for j = 1, 2, . . . , J − 1 do d = [dt ] = 0; for t = 1, 2, . . . , T do ¯ t ] ∈ RI×(j+1) D = [a1 , . . . , aj , y ; + T dt = det(D D);   d(sort) , Kp = sort(d, p, descend), where  ¯k; aj+1 = p1 k∈Kp y

d = [dt ] ∈ RT ;

To test the efficiency of the discussed initialization methods, 100 Monte Carlo (MC) runs of NMF were performed, each time the initial matrix A was estimated with the tested initialization method but X is randomly generated from an uniform distribution. We tested the following initialization methods: random, multilayer with multistart [3] (3 layers and 30 restarts), ALS-based initialization [3], SVD-based initialization [7], and the proposed SimplexMax. The efficiency of the initializers was evaluated with the Signal-to-Interference Ratio (SIR) [3] between the true matrix A and estimated one. Fig. 2 shows the SIR statistics for estimating the mixing matrix A using the MUE algorithm initialized with various initialization methods. We analyzed 3 cases: noise-free, weakly noisy data with SN R = 30 dB, and moderately noisy data with SN R = 20 dB. The SIR statistics plotted in Fig. 2 concerns only the noisy cases. We set p = 1 and p = 3 for SN R = 30 dB and SN R = 20 dB, respectively. For the noise-free data, the SimplexMax method estimates the columns in the matrix A with SIR > 200 dB for p = 1, and hence, no further alternating steps of NMF is needed (see Table 1).

454

R. Zdunek

35

28 26

30 25

22

SIR [dB]

SIR [dB]

24

20 18

20 15

16 10

14 12 10

5

Random

Multilayer

ALS

SVD

SimplexMax

Random

Multilayer

(a)

ALS

SVD

SimplexMax

(b)

Fig. 2. SIR statistics for estimating the mixing matrix A from noisy observations, using the MUE algorithm with various initialization methods (random, multilayer, ALS, SVD, and Simplex Volume Maximization): (a) SNR = 20 dB, (b) SNR = 30 dB. For noise-free data, the Simplex Max method estimates the matrix A with SIR > 200 dB. Table 1. Mean-SIR values [dB] and standard deviations (in parenthesis) averaged over 100 MC runs of the the SimplexMax method (without the MUE algorithm) Data

p=1

p=3

p=5

p = 10

noise-free 276 (1.43) 151.8 (2.62) 63.23 (10.42) 10.93 (3.05) SN R = 30 dB 21.23 (0.81) 17.91 (1.83) 15.13 (2.82) 9.91 (2.17) SN R = 20 dB 7.27 (1.36) 10.18 (2.06) 8.69 (1.85) 8.37 (1.67)

5

Conclusions

Fig. 2 demonstrates that for the observations with SN R ≥ 20 dB the proposed SimplexMax method provides the best initializer for A among the tested methods. For noise-free data that satisfies the sufficiency sparsity condition (Definition 2), the SimplexMax method gives the exact estimator. However, such data is difficult to obtain in practice, hence the SimplexMax should be nearly always combined with some alternating optimization algorithm for NMF. The performance of Algorithm 1 for p = 1 considerably depends on the SNR of observations. We noticed that for SN R ∼ = 30 dB and p = 1, the mean-SIR of the estimated initial matrix A with the SimplexMax is about 21 dB (Table 1) but after using the MUE algorithm, the SIR grows up to about 33 dB. When the observed data is stronger corrupted with noise, the SimplexMax needs adaptation of the parameter p. A further study is needed to determine the relation of p with a level of noise. Summing up, the proposed SimplexMax method seems to be efficient for initialization of the basis vectors in NMF when the observations are sufficiently sparse and corrupted with moderate noise. For a noise-free case, the proposed method gives exact estimators.

Initialization of NMF with Vertices of Convex Polytope

455

Acknowledgment. This work was supported by the habilitation grant N N515 603139 (2010-2012) from the Ministry of Science and Higher Education, Poland.

References [1] Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999) [2] Langville, A.N., Meyer, C.D., Albright, R.: Initializations for the nonnegative matrix factorization. In: Proc. of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, USA (2006) [3] Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.I.: Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. Wiley and Sons (2009) [4] Cichocki, A., Zdunek, R.: Multilayer nonnegative matrix factorization. Electronics Letters 42(16), 947–948 (2006) [5] Wild, S.: Seeding non-negative matrix factorization with the spherical k-means clustering. M.Sc. Thesis, University of Colorado (2000) [6] Wild, S., Curry, J., Dougherty, A.: Improving non-negative matrix factorizations through structured initialization. Pattern Recognition 37(11), 2217–2232 (2004) [7] Boutsidis, C., Gallopoulos, E.: SVD based initialization: A head start for nonnegative matrix factorization. Pattern Recognition 41, 1350–1362 (2008) [8] Kim, Y.D., Choi, S.: A method of initialization for nonnegative matrix factorization. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007), Honolulu, Hawaii, USA, vol. II, pp. 537–540 (2007) [9] Xue, Y., Tong, C.S., Chen, Y., Chen, W.S.: Clustering-based initialization for non-negative matrix factorization. Applied Mathematics and Computation 205(2), 525–536 (2008); Special Issue on Advanced Intelligent Computing Theory and Methodology in Applied Mathematics and Computation [10] Donoho, D., Stodden, V.: When does non-negative matrix factorization give a correct decomposition into parts? In: Thrun, S., Saul, L., Sch¨ olkopf, B. (eds.) Advances in Neural Information Processing Systems (NIPS), vol. 16. MIT Press, Cambridge (2004) [11] Chu, M.T., Lin, M.M.: Low dimensional polytype approximation and its applications to nonnegative matrix factorization. SIAM Journal of Scientific Computing 30, 1131–1151 (2008) [12] Wang, F.Y., Chi, C.Y., Chan, T.H., Wang, Y.: Nonnegative least-correlated component analysis for separation of dependent sources by volume maximization. IEEE Transactions Pattern Analysis and Machine Intelligence 32(5), 875–888 (2010) [13] Cichocki, A., Zdunek, R.: NMFLAB for Signal and Image Processing. Technical report, Laboratory for Advanced Brain Signal Processing, BSI, RIKEN, Saitama, Japan (2006)

Suggest Documents