Projective Nonnegative Matrix Factorization based

0 downloads 0 Views 142KB Size Report
Department of Information and Computer Science. ∗. Aalto University School of Science and Technology. P.O.Box 15400 ... Nonnegative learning based on matrix factorization has received a lot ...... Mathematical Forum, 3:1853–1870, 2008.
Projective Nonnegative Matrix Factorization based on α-Divergence Zhirong Yang and Erkki Oja Department of Information and Computer Science∗ Aalto University School of Science and Technology P.O.Box 15400, FI-00076, Aalto, Finland

1 Introduction

Abstract The well-known Nonnegative Matrix Factorization (NMF) method can be provided with more flexibility by generalizing the non-normalized Kullback-Leibler divergence to αdivergences. However, the resulting α-NMF method can only achieve mediocre sparsity for the factorizing matrices. We have earlier proposed a variant of NMF, called Projective NMF (PNMF) that has been shown to have superior sparsity over standard NMF. Here we propose to incorporate both merits of αNMF and PNMF. Our α-PNMF method can produce a much sparser factorizing matrix, which is desired in many scenarios. Theoretically, we provide a rigorous convergence proof that the iterative updates of αPNMF monotonically decrease the α-divergence between the input matrix and its approximate. Empirically, the advantages of α-PNMF are verified in two application scenarios: (1) it is able to learn highly sparse and localized part-based representations of facial images; (2) it outperforms α-NMF and PNMF for clustering in terms of higher purity and smaller entropy.

Nonnegative learning based on matrix factorization has received a lot of research attention recently. After Lee and Seung [11, 12] presented their Nonnegative Matrix Factorization (NMF) algorithms, a multitude of NMF variants have been proposed and applied to many areas such as signal processing, data mining, pattern recognition and gene expression studies [14, 6, 21, 3, 9, 4]. NMF is not only applicable to the feature axis for finding sparse and partbased representations (e.g.[13, 10]), but also to the sample axis, e.g. for finding clusters of data items (e.g. [8, 7, 18]). The original NMF algorithm minimizes one of two kinds of difference measure between the data matrix and its approximate: the least square error or the non-normalized Kullback-Leibler divergence (or I-divergence). When the latter is used, NMF actually maximizes the Poisson likelihood of the observed data [11]. It was recently pointed out that the divergence minimization can be generalized by using the α-divergence [1], which leads to a family of new algorithms [5, 23]. The convergence proof of NMF with α-divergence is given in [5]. The empir∗ Supported by the Academy of Finland in the project Finnish ical study by Cichocki et al. shows that the generalized NMF can achieve better performance by using Center of Excellence in Adaptive Informatics Research. 1

suitable α values. Projective Nonnegative Matrix Factorization (PNMF) [22] is another variant of NMF. It identifies a nonnegative subspace by integrating the nonnegativity to the PCA objective. PNMF has proven to outperform NMF in feature extraction, where PNMF is able to generate sparser patterns which are more localized and non-overlapping [22]. Clustering results of text data also demonstrate that PNMF is advantageous as it provides better approximation to the binary-valued multi-cluster indicators than NMF [18]. In this paper we combine the above two techniques by using α-divergence instead of I-divergence as the error measure in PNMF. We provide a multiplicative optimization algorithm which is theoretically convergent. Experiments are conducted, in which the new algorithm is shown to outperform αNMF for feature extraction and clustering on a variety of datasets. Part of the work can be found in our preliminary paper [19]. As an extension, we propose here a novel multiplicative update rule which monotonically decreases the α-divergence between the data matrix and its approximate, without additional normalization or stabilization steps. The new algorithm is more desirable because it makes the objectives at different iterations and with different initial guesses comparable. The proof uses a novel convex function for α-divergence which has not been used in the previous literature on divergence measures. We also provide the multiplicative update rule for the special case α → 0, which completes these algorithms for the entire family of α-divergences. The rest of the paper is organized as follows. We first briefly review the NMF and PNMF methods in Section 2. In Section 3, we present the α-PNMF objective, its multiplicative optimization algorithm and convergence proof. The experiments are presented in Section 4, and Section 5 concludes the paper.

2 Related Work 2.1 Nonnegative Matrix Factorization , NonnegGiven a nonnegative data matrix X ∈ Rm×N + ative Matrix Factorization (NMF) seeks an approximative decomposition of X that is of the form: X ≈ WH,

(1)

r×N and H ∈ R+ with the rank r ≪ where W ∈ Rm×r + min(m, N). b = WH the approximating matrix. Denote by X The approximation can be achieved by minimizing two widely used measures: (1) the least square crite2  rion ε = ∑i, j Xi j − Xbi j and (2) the non-normalized

Kullback-Leibler divergence (or I-divergence) !   X i j b = ∑ Xi j log DI X||X − Xi j + Xbi j . Xbi j i, j

In this paper we focus on the second approximation criterion, which leads to the multiplicative updating rules of the form  WT Z k j , Hknew j =Hk j ∑i Wik  ZHT ik new Wik =Wik , ∑ j Hk j where we use Zi j = Xi j /Xbi j for notational brevity.

2.2 Nonnegative Matrix Factorization with α-divergence

The α-divergence [1] is a parametric family of divergence functionals, including several well-known divergence measures as special cases. NMF equipped with the following α-divergence as the approximation measure was introduced by Cichocki et al and 2

called α-NMF [5]:

In practice, iterations with only the update rule (3) are sensitive to the initial guess of W and often have   ∑i j αXi j + (1 − α)Xbi j − Xiαj Xbi1−α a very zigzag learning path, where the overall scalj b Dα X||X = ing of W fluctuates between odd and even iterations. α(1 − α) This is overcome in practice by using an additional The corresponding multiplicative update rules are normalization step [22] given by the following, where we define Zei j = Ziαj : W′   α1  Wnew = e kW′ k WT Z  kj  new Hk j =Hk j   , ∑i Wik or a stabilization step [18] 

Wiknew





=Wik 

e T ZH

  α1

∑ j Hk j

ik 

new

W

.



=W

s

∑i j

∑i j Xi j  . W′ W′ T X i j

The name PNMF comes from another derivation of the approximation scheme (2) where a projection matrix P in X ≈ PX is factorized into WWT . This interpretation connects PNMF with the classical Principal Component Analysis subspace method except for the nonnegativity constraint [22]. Compared with NMF, PNMF is able to learn a much sparser matrix W [22, 23, 18]. This property is especially desired for extracting part-based representations of data samProjective Nonnegative Matrix Factor- ples or finding cluster indicators.

α-NMF reduces to the conventional NMF with Idivergence when α → 1. Another choice of α characterizes a different learning principle, in the sense that the model distribution is more inclusive (α → ∞) or more exclusive (α → −∞). Such flexibility enables α-NMF to outperform NMF with α properly selected.

2.3

ization Replacing H = WT X in (1), we get the Projective Nonnegative Matrix Factorization (PNMF) approximation scheme [22]

3 PNMF with α-divergence

In this section we combine the flexibility of α-NMF X ≈ WW X. (2) and the sparsity of PNMF into a single algorithm. We call the resulting method α-PNMF which stands b = WWT X the approximating matrix, Zi j = for Projective Nonnegative Matrix Factorization with Denote X Xi j /Xbi j , and 1m a column vector of length m and α-divergence. filled with ones. The PNMF multiplicative update rule for I-divergence is given by [22] T

3.1 Multiplicative update rule

(AW)ik Wik′ = Wik (BW)ik

(3) α-PNMF solves the following optimization problem: minimize J (W) = Dα (X||WWT X).

where A = ZXT + XZT and B = 1m 1Tn XT + X1n 1Tm .

W≥0

3

The gradient of the objective with respect to W is This convex function is however not applicable to given by the α-PNMF case because it is not decomposable, i.e. not fulfilling h(xy) ∝ h(x)h(y) or h(xy) = h(x) + i ∂J (W) 1 h e  h(y) + constant. = − AW + (BW)ik , ik ∂Wik α Here we overcome this problem by using a novel convex function e = ZX e T + XZ e T and again B = where Zei j = Ziαj , A xα y1−α 1m 1Tn XT + X1n 1Tm . . g(x, y) = − α(1 − α) Denote Λik the Lagrangian multipliers associated with the constraint Wik ≥ 0. The Karush-Kuhn- We further introduce Tucker (KKT) conditions require f (y) = g (Xi j , y) ∂J (W) = Λik (4) for notational brevity. Notice that f (y) is convex ∂Wik with respect to y, 1/η and ΛikWik = 0 which indicates ΛikWik = 0 for f (by) =b1−α f (y), 1/η η 6= 0. Multiplying both sides of Eq. (4) by Wik α(1 − α) (W) 1/η f (yz) = − f (y) f (z). Wik = 0. This suggests a multiplicaleads to ∂J∂W ik Xiαj tive update rule: b =f  η Let W be the current estimate, X Wf WT X, and  e AW   ik  Wik WT X k j Wik WT X k j Wiknew = Wik  , (5) (BW)ik = γi jk = , (WWT X)i j ∑l Wil (WT X)l j Wak Xa j Wak Xa j for all α 6= 0. The values of η are given in Eq. (6) , = βa jk = (WT X)k j ∑b Wbk Xb j below.

3.2

e ≡V( e f e 1−αWikα V W, W), Veik = W ik 1 T e X Si j = − Z α(1 − α)

Convergence proof

In this Section, we prove that iteratively applying (5) monotonically decreases the objective function Dα (X||WWT X). The convergence of NMF and most of its variants, including α-NMF, to a local minimum of the cost function is analyzed by using an auxiliary function as its tight upper-bound. This is achieved in α-NMF [5] by using the Jensen inequality based on the convex function h(z) =

Obviously, γi jk ≥ 0, ∑k γi jk = 1, βa jk ≥ 0, ∑a βa jk = 1, e V ≡ V(W, W), and Vik = Wik . In the derivation below we also employ the following inequalities for any symmetric positive matrix M independent of f W [8]:   e2 W Tr f WT Mf W ≤ ∑ ik (MW)ik , ik Wik   eikW e jk + constant WT Mf W ≥ ∑ Mi jWikW jk ln W Tr f

α + (1 − α)z − z1−α . α(1 − α)

i jk

4

where the equality holds if and only if f W = W. Recall B = 1m 1Tn XT + X1n 1Tm . We have

In summary, the auxiliary upper-bounding funcXi j tion of J (f W) = J1 (f W) + J2 (f W) + ∑i j 1−α is:

1 ffT  WW X ij ij α i h 1 = Tr f WT Bf W 2α e2 W 1 (1) ≤ ∑ ik (BW)ik ≡ G1 (f W, W) α ik 2Wik

(1) (2) G1 (f W, W) + G2 (f W, W) (1) f (1) f G1 (W, W) + G2 (W, W) (2) (2) G1 (f W, W) + G2 (f W, W)

J1 ( f W) ≡ ∑

when 1 < α, when 0 < α < 1, when α < 0.

up to some additive constant. Minimization of the auxliary function over f W is implemented by setting its gradient to zero, which leads to Eq. (5), where

for α > 0 and

  1/(2α) 1/2 η=  1/(2α − 2)

(2) J1 (f W) ≤G1 (f W, W) 1 eikW e jk + constant ≡ ∑ Bi jWikW jk ln W α i jk

when 1 < α, when 0 < α < 1, when α < 0.

(6)

Because all upper-bounds used are tight, the above update rules leads to

for α < 0. Next, we apply the Jensen inequality twice to obXiαj Xbi1−α j (see tain the upper bound of J2 (f W) ≡ − ∑i j α(1−α)  1 T Figure 1). When 0 < α < 1, S + S = − α(1−α) A is symmetric and negative. Therefore,

J (Wnew ) =G(Wnew , Wnew ) ≤G(Wnew , W) ≤G(W, W) = J (W),

where the first inequality comes from the upper 1 e Ai jVikV jk ln VeikVejk + constant bound and the second by the minimization. Iterα(1 − α) ∑ i jk atively applying (5) thus monotonically decreases Dα (X||WWT X).  1 ei jWikW jk ln W eikW e jk + constant =− ∑A Remark 1: α i jk Theoretically convergent update rules for PNMF (1) ≡G2 (f W, W) based on the non-normalized KL-divergence are unresolved in the previous PNMF literature [22, 23, 1 e is symmetric and When α > 1 or α < 0, − α(1−α) A 18]. This is now given by our proof as a special case positive. Therefore, (α → 1):

J2 ( f W) ≤ −

Veik2  e  1 AV α(1 − α) ∑ ik ik 2Vik  e 2−2α  W 1 ik e AW =− ∑ ik α(1 − α) ik 2Wik1−2α

J2 ( f W) ≤ −

Wiknew = Wik

(ZXT W + XZT W)ik  . ∑ j (WT X)k j + ∑ j Xi j (∑b Wbk )

Remark 2: An exception occurs when α → 0, where the derivative has form 00 . We thus apply L’Hˆopital’s rule

(2)

≡G2 (f W, W).

s

5

J2 (f W) ≡ ∑ − ij

=∑ f ij

∑ Weik k

Xiαj Xbi1−α j

α(1 − α)



 f WT X

kj

    = ∑ g Xi j , Xbi j = ∑ f Xbi j

!

ij

   eik f W WT X  kj  ≤ ∑ ∑ γi jk f   = ∑ ∑ γi jk f γ i jk ij k ij k 

ij

   fT X eik W W  kj  = ∑ f ∑ γi jk  γi jk ij k 

eik  T  Wik W f W X γi jk Wik kj

!

!       eik α(1 − α) Wik 1−α W T f = − ∑ ∑ γi jk f W X f kj Xiαj γi jk Wik ij k ! ! 1−α  eak Xa j eik W α(1 − α) Wik ∑a W f βa jk f = − ∑ ∑ γi jk Xiαj γi jk Wik βa jk ij k ! !   eak Xa j eik W α(1 − α) Wik 1−α W ≤ − ∑ ∑ γi jk βa jk f f Xiαj γi jk Wik ∑ βa jk a ij k ! !   eik eak α(1 − α) Wik 1−α Wak W W βa jk f Xa j f = − ∑ γi jk Xiαj γi jk Wik βa jk Wak i jak " ! ! #2   eak eik W α(1 − α) WikWak 1−α W f = ∑ γi jk βa jk f (Xa j ) f Xiαj γi jk βa jk Wik Wak i jak " # i ih h 1 α e 1−αWak e 1−αWikα W − =∑ W ∑ Ziαj Xa j ak ik α(1 − α) j aik  1 h   i e T S + ST V e e T SV e = Tr V = ∑ VikVak Sai = Tr V 2 aik Figure 1: Upper-bounding J1 (f W) ≡ − ∑i j

6

Xiαj Xbi1−α j α(1−α) .

to obtain the limit and then set it to zero. This yields the update rule     e (0) W A 1 ik  , Wiknew = Wik exp  2 (BW)ik

results in zigzag learning paths. Therefore it must be accompanied with a stabilization step (8) with reb and Z. e However, the proof of the concalculated X sistence of this additional update rule with the original objective Dα (WWT X) is still lacking. In contrast, the new algorithm using (5) does not require any additional normalization or stabilization steps, which facilitates is theoretical analysis.

(0) e (0) = Z e (0) XT + XZ e (0)T . where Zei j = log Zi j and A Remark 3: We have previouly proposed an algorithm that iterates the following two steps [19]:

  α1  e AW ik  , Wik′ =Wik  (BW)ik Wiknew

=Wik′

∑i j Xbi j Zei j ∑i j Xbi j

! 2α1

4 Experiments is comSuppose the nonnegative matrix X ∈ Rm×N + posed of N data samples x j ∈ Rm , j = 1, . . . , N. Ba+ sically, α-PNMF can be applied on this matrix in two different ways. Firstly, one employs the approximation scheme X ≈ WWT X and performs feature extraction by projecting each sample into a nonnegative subspace. The second approach approximates the transposed matrix XT by WWT XT where N×r W ∈ R+ , where α-PNMF can be used for clustering, with the elements of W now indicating the membership of each sample in the r clusters. We conduct benchmark experiments on both cases.

(7)

(8)

The update rule (7) is obtained by turning αPNMF into a constrained α-NMF with H = WT X. It guarantees the Lagrangian objective decreases in each iteration. However, the definition of such a function varies across different iterations and also across different starting values because the Lagrangian multipliers solved by the K.K.T. conditions are determined by the current W. The resulting objectives are therefore not comparable, which hinders monitoring its convergence and prevents improvement by multiple runs using different initial guesses. By constrast, the update rule (5) assures the monotonic decrease of the original α-PNMF objective whose definition does not depend on the iterations and starting W values. Therefore one may easily monitor the convergence, rerun the algorithm several times and select the solution with the best objective. The new multiplicative algorithm also overcomes another shortcoming of the previous one. The update rule (7) is sensitive to the overall scaling of W and

4.1 Feature extraction We have used the FERET database of facial images [15] as the training data set. After face segmentation, 2,409 frontal images (poses “fa” and “fb”) of 867 subjects were stored in the database for the experiments. All face boxes were normalized to the size of 32×32 and then reshaped to a 1024-dimensional vector by column-wise concatenation. Thus we obtained a 1024×2409 nonnegative data matrix, whose elements are re-scaled into the region [0,1] by dividing with their maximum. For good visualization, we empirically set r = 25 in the feature extraction experiments. After training, the basis vectors are stored in the 7

columns of W in α-NMF and α-PNMF. The ba- retrieval research. Table 1 summarizes the characsis vectors have same dimensionality with the im- teristics of the datasets. The descriptions of these age samples and thus can be visualized as basis im- datasets are as follows: ages. In order to encode the features of different facial parts, it is expected to find some localized • Iris, Ecoli5, WDBC, and Pima, which are taken and non-overlapping patterns in the basis images. from the UCI data repository with respective The resulting basis images using α = 0.5 (Hellinger datasets Iris, Ecoli, Breast Cancer Wisconsin divergence), α = 1 (I-divergence) and α = 2 (χ2 (Prognostic), and Pima Indians Diabetes. The divergence) are shown in Figure 2. Both methods Ecoli5 dataset contains only samples of the five can identify some facial parts such as eyebrows and largest classes in the original Ecoli database. lips. In comparison, α-PNMF is able to generate much sparser basis images with more part-based vi• AMLALL gene expression database [2]. This sual patterns. dataset contains acute lymphoblastic leukemia Notice that two non-negative vectors are orthogo(ALL) that has B and T cell subtypes, and nal if and only if they do not have the same non-zero acute myelogenous leukemia (AML) that ocdimensions. Therefore we can quantify the sparsity curs more commonly in adults than in children. of the basis vectors by measuring their orthogonaliThe data matrix consists of 38 bone marrow ties with the τ measurement [20]: samples (19 ALL-B, 8 ALL-T and 11 AML) kR − IkF with 5000 genes as their dimensions. , τ = 1− (r(r − 1)) where k · kF is the Frobenius matrix norm and the element Rst of matrix Rgives the normalized inner product between two basis vectors ws and wt : Rst =

• ORL database of facial images [16]. There are ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions and facial details. In our experiments, we down-sampled the images to size 46 × 56 and rescaled the gray-scale values to [0, 1].

wTs wt . kws kkwt k

Larger τ’s indicate higher orthogonality and τ reaches 1 when the columns of W are completely orthogonal. The numerical values for the orthogonalities τ using the two compared methods are given The number of clusters r is generally set to the under the respective basis image plots in Figure 2. number of classes. This work focuses on cases where All τ values in the right are considerably larger than r > 2, as there exist closed form approximations for their left counterparts, which confirms that α-PNMF the two-way clustering solution (see e.g. [17]). We is able to extract a sparser transformation matrix W. thus set r equal to five times the number of classes for WDBC and Pima. 4.2 Clustering Suppose there is ground truth data that labels the We have used a variety of datasets, most of which are samples by one of q classes. We have used the purity frequently used in machine learning and information and entropy measures to quantify the performance of 8

α = 0.5, τ=0.77

α = 0.5, τ=0.99

α = 1, τ=0.75

α = 1, τ=0.99

α = 2, τ=0.75

α = 2, τ=0.92

Figure 2: The basis images of (left) α-NMF and (right) α-PNMF. 9

datasets Iris Ecoli5 WDBC Pima AMLALL ORL

Table 1: Dataset descriptions #samples #dimensions #classes 150 4 3 327 7 5 569 30 2 768 8 2 38 5000 3 400 2576 40

r 3 5 10 10 3 40

5 Conclusions

the compared clustering algorithms:

We have presented a new variant of NMF by introducing the α-divergence into the PNMF algorithm. Our α-PNMF algorithm theoretically converges to r q a local minimum of the cost function. The resultnlk 1 l n log , entropy = − ∑ ∑ 2 k ing factor matrix is of high sparsity or orthogonaln log2 q k=1 l=1 nk ity, which is desired for part-based feature extraction and data clustering. Experimental results with variwhere nlk is the number of samples in the cluster k ous datasets indicate that the proposed algorithm can that belong to original class l and nk = ∑l nlk . A be considered as a promising replacement for both larger purity value and a smaller entropy indicate α-NMF and PNMF. better clustering performance. 1 r purity = ∑ max nlk , N k=1 1≤l≤q

The resulting purities and entropies are shown in Table 2, respectively. α-PNMF performs the best for all selected datasets. Recall that when α = 1 the proposed method reduces to PNMF and thus returns results identical to the latter. Nevertheless, αPNMF can outperform PNMF by adjusting the α value. When α = 0.5, the new method achieves the highest purity and lowest entropy for the gene expression dataset AMLALL. For the other five datasets, one can set α = 2 and obtain the best clustering result using α-PNMF. In addition, one can see that Nonnegative Matrix Factorization with α-divergence works poorly in our clustering experiments, much worse than the other methods. This is probably because α-NMF has to estimate many more parameters than those using projective factorization. α-NMF is therefore prone to falling into bad local optima.

References

10

[1] S. Amari. Differential-geometrical methods in statistics. 28, volume 28 of Lecture Notes in Statistics. Springer-Verlag, New York, 1985. [2] Jean-Philippe Brunet, Pablo Tamayo, Todd R. Golub, and Jill P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Sciences, 101(12):4164– 4169, 2004. [3] Seungjin Choi. Algorithms for orthogonal nonnegative matrix factorization. In Proceedings of IEEE International Joint Conference on Neural Networks, pages 1828–1832, 2008.

Table 2: Clustering (a) purities and (b) entropies using α-NMF, PNMF and α-PNMF. each dataset is highlighted with boldface font. (a) α-NMF PNMF α-PNMF datasets α = 0.5 α=1 α=2 α = 0.5 α=1 Iris 0.83 0.85 0.84 0.95 0.95 0.95 Ecoli5 0.62 0.65 0.67 0.72 0.72 0.72 WDBC 0.70 0.70 0.72 0.87 0.86 0.87 Pima 0.65 0.65 0.65 0.65 0.67 0.65 AMLALL 0.95 0.92 0.92 0.95 0.97 0.95 ORL 0.47 0.47 0.47 0.75 0.76 0.75

datasets Iris Ecoli5 WDBC Pima AMLALL ORL

α = 0.5 0.34 0.46 0.39 0.92 0.16 0.35

α-NMF α=1 0.33 0.58 0.38 0.90 0.21 0.34

(b) PNMF 0.15 0.40 0.16 0.91 0.16 0.14

α=2 0.33 0.50 0.37 0.90 0.21 0.35

α-PNMF α=1 0.15 0.40 0.16 0.91 0.16 0.14

α=2 0.97 0.73 0.88 0.67 0.92 0.80

α=2 0.12 0.40 0.14 0.89 0.21 0.12

tions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1):45–55, 2010.

[4] A. Cichocki, R. Zdunek, A.-H. Phan, and S. Amari. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multiway Data Analysis. John Wiley, 2009.

[8] Chris Ding, Tao Li, Wei Peng, and Haesun Park. Orthogonal nonnegative matrix tfactorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 126–135, 2006.

[5] Andrzej Cichocki, Hyekyoung Lee, YongDeok Kim, and Seugjin Choi. Non-negative matrix factorization with α-divergence. Pattern Recognition Letters, 29:1433–1440, 2008.

[9] K. Drakakis, S. Rickard, R. de Fr´ein, and A. Cichocki. Analysis of financial data using nonnegative matrix factorization. International Mathematical Forum, 3:1853–1870, 2008.

[6] I. S. Dhillon and S. Sra. Generalized nonnegative matrix approximations with bregman divergences. In Advances in Neural Information Processing Systems, volume 18, pages 283– 290, 2006. [7] Chris Ding, Tao Li, and Michael I. Jordan. Convex and semi-nonnegative matrix factoriza-

α = 0.5 0.15 0.40 0.17 0.90 0.08 0.14

The best result for

[10] C. F´evotte, N. Bertin, and J.-L. Durrieu. Nonnegative matrix factorization with the ItakuraSaito divergence: With application to music

11

analysis. Neural Computation, 21(3):793–830, [18] Z. Yang and E. Oja. Linear and nonlinear projective nonnegative matrix factorization. IEEE 2009. Transaction on Neural Networks, 21(5):734– [11] D. D. Lee and H. S. Seung. Learning the parts 749, 2010. of objects by non-negative matrix factorization. [19] Zhirong Yang and Erkki Oja. ProjecNature, 401:788–791, 1999. tive nonnegative matrix factorization with α[12] D. D. Lee and H. S. Seung. Algorithms for nondivergence. In Proceedings of 19th Innegative matrix factorization. Advances in Neuternational Conference on Artificial Neural ral Information Processing Systems, 13:556– Networks (ICANN), pages 20–29, Limassol, 562, 2001. Cyprus, 2009. Springer. [13] W. Liu, N. Zheng, and X. Lu. Non-negative [20] Zhirong Yang, Zhijian Yuan, and Jorma Laakmatrix factorization for visual coding. In sonen. Projective non-negative matrix facProceedings of IEEE International Conference torization with applications to facial image on Acoustics, Speech, and Signal Processing processing. International Journal on Pat(ICASSP 2003), volume 3, pages 293–296, tern Recognition and Artificial Intelligence, 2003. 21(8):1353–1362, December 2007. [14] Alberto Pascual-Montano, J.M. Carazo, Kieko [21] S. S. Young, P. Fogel, and D. Hawkins. ClusterKochi, Dietrich Lehmann, and Roberto D. ing scotch whiskies using non-negative matrix Pascual-Marqui. Nonsmooth nonnegative mafactorization. Joint Newsletter for the Section trix factorization (nsNMF). IEEE Transactions on Physical and Engineering Sciences and the on Pattern Analysis and Machine Intelligence, Quality and Productivity Section of the Ameri28(3):403–415, 2006. can Statistical Association, 14(1):11–13, 2006. [15] P. J. Phillips, H. Moon, S. A. Rizvi, and [22] Zhijian Yuan and Erkki Oja. Projective nonnegP. J. Rauss. The FERET evaluation methodative matrix factorization for image compresology for face recognition algorithms. IEEE sion and feature extraction. In Proc. of 14th Trans. Pattern Analysis and Machine IntelliScandinavian Conference on Image Analysis gence, 22:1090–1104, October 2000. (SCIA 2005), pages 333–342, Joensuu, Finland, June 2005. [16] Ferdinando Samaria and Andy Harter. Parameterisation of a stochastic model for human face [23] Zhijian Yuan and Erkki Oja. A family of modiidentification. In Proceedings of 2nd IEEE fied projective nonnegative matrix factorization Workshop on Applications of Computer Vision, algorithms. In Proceedings of the 9th Internapages 138–142, Sarasota FL, December 1994. tional Symposium on Signal Processing and Its Applications (ISSPA), pages 1–4, 2007. [17] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, August 2000. 12