However, most of the existing NMF algorithms do not pro- vide uniqueness of the ..... Second, the Welch's method is applied to calculate the .... (where â and â denote the elementwise multiplication and division, respectively), it is obvious that ...
Nonnegative Matrix Factorization with Temporal Smoothness and/or Spatial Decorrelation Constraints Zhe Chen and Andrzej Cichocki Laboratory for Advanced Brain Signal Processing RIKEN Brain Science Institute Wako-shi, Saitama 351-0198, Japan {zhechen,cia}@brain.riken.jp
Abstract Approximate nonnegative matrix factorization (NMF) is an emerging technique with a wide spectrum of potential applications in biomedical and neurophysiological data analysis. Currently, the most popular algorithms for NMF are those proposed by Lee and Seung. However, most of the existing NMF algorithms do not provide uniqueness of the solution and the factorized components are often difficult to interpret. The key open issue is to find such nonnegative components and corresponding basis vectors that have clear physical or physiological interpretations. Since temporally smooth structures are important for hidden components in many applications, here we propose two novel constraints and derive new multiplicative learning rules that allow us to estimate such components that have physiological meanings. Specifically, we present a novel and promising application of proposed NMF algorithm to early detection of Alzheimer disease using EEG recordings.
1
Introduction
Many real-life data, such as the image pixels, stock indices, gene expressions, and information, are known to be nonnegative. Positive or nonnegative matrix factorization (NMF) is a powerful technique for analyzing such positive or nonnegative data [4,8,9]. Specifically, Lee and Seung’s algorithms are simple yet elegant in that the multiplicative learning rules follow an expectation-maximization (EM) like procedure and monotonically increase/decrease the likelihood/cost function. While being viewed as a generative model, NMF is intrinsically linked to another popular generative models such as the ICA. However, NMF generally differs from ICA in that the decomposed components from NMF are not necessarily independent and not necessary all sparse; in addition, their different objective functions also lead to different optimization procedures. Recently, there are research efforts that attempt to enhance the NMF by exploiting further constraints apart from nonnegativeness [3,5]. When the nonnegative and independence constraints are both imposed on the sources whereas the mixing matrix may be negative, the nonnegative ICA or source separation also arises [1,6,7].
In this paper, we study the NMF problem with temporal smoothness (with unit variance) and spatial decorrelation constraints, where the sources are considered to be temporal signals. We propose new regularized cost functions and derive corresponding learning rules; similar to the original NMF procedure, the learning rules are EM-like and guarantee monotonic convergence; their validity and performance are demonstrated by the analysis of EEG (electroencephalographic) data and synthetic benchmarks.
2
Nonnegative matrix factorization
Consider a temporal linear mixing model as follows x(t) = As(t) or xt = Ast (t = 1, 2, . . . , T )
(1)
where all entries of vectors st and a mixing matrix A are assumed to be nonnegative. In the matrix form, equation (1) can be rewritten as X = AS
(2)
is a nonnegative data matrix, whose columns xt are m × 1 where X = [xit ] ∈ Rm×T + vectors with entries xit = xi (t) and T denotes the total number of observations; and S = [sit ] ∈ Rr×T are two factorized, nonnegative matrices, A = [aij ] ∈ Rm×r + + with individual elements aij ≥ 0, sit ≥ 0. In general, low-rank factorization (r ≤ m) is used. Here, A can be viewed as a bank (i.e., r) of m × 1 basis vectors associated with T observations in S, for which a generative model xt = Ast emerges. The goal of the NMF is to find nonnegative matrices A and S such that a predescribed cost function is minimized. Obviously, the solution to this problem is generally nonunique, except in special cases [2]. In practice, we wish to specify further constraints or impose prior knowledge to derive desired solutions. Although there are many potential optimization procedures for NMF, their performance and complexity vary. One of the powerful algorithms for NMF was developed by Lee and Seung [4]. Depending on specific applications, the objective function used in NMF can be: J1
=
m T
[xit log(AS)it − (AS)it ]
(3)
i=1 t=1
J2
=
X − AS2 =
m T [xit − (AS)it ]2
(4)
i=1 t=1
J3
=
KL(XAS) =
m T xit log i=1 t=1
xit − xit + (AS)it . (AS)it
(5)
Specifically, the multiplicative learning rules for cost function (3) may be derived: xit xit aij sjt , aij ← , sjt ← sjt aij . aij ← aij (AS) a (AS) it pj it p t i
3 3.1
NMF with constraints Temporal smoothness constraint
Suppose each “factorized” temporal source (the row vector of the matrix S) represents a temporal signal with length T , we then write the ith source as si (t) ≡ sit and the ith row vector of S as si . When si (t) is temporally locally smooth, its shortterm variance is relatively small compared to a larger long-term variance [10]. Let
mi = T1 t si (t) denote the mean value of {si (t)} given the total T observations, and let si (t) denote the short-term exponentially weighted average of temporal signal si (t), namely si (t) = αsi (t − 1) + (1 − α)si (t) ≡ αsi (t − 1) + βsi (t)
(6)
where 0 < α < 1 is a forgetting factor that determines the local smoothness range, and β = 1 − α. In particular, the temporal smoothness under consideration is measured by the ratio of short-term variance against long-term (i.e., with the complete data) variance in the temporal domain1 2 (si (t) − si (t))2 (s (t) − si (t − 1))2 tα i R = log t = log (7) 2 2 t (si (t) − mi ) t (si (t) − mi ) Hence, the smaller the ratio value R, the smoother is the temporal signal si (t). In vector notation, let si denote the short-term average vector corresponding to the row vector si = [si1 , . . . , siT ]; if we further constrain the variance of the row vector si as 1, then the optimization problem (7) may be equivalently rewritten as 1 1 (8) R = si − si 2 = (si − si )(si − si )T , s.t. Var[si ] = 1 T T Now, we wish to represent si in terms of si . Note that the exponentially average si (t) is indeed the convolution product between si (t) and a template operator. Suppose si (0) = 0 and the template vector has an exponentially decreasing property with a length L (here we use L = 5); namely template = [β, αβ, α2 β, α3 β, α4 β] (if α = 0.5 then sum(template)=0.9688). The convolution operation can also be conveniently expressed as a matrix product operation: ⎡ ⎤ β 0 0 0 0 0 0 ··· 0 β 0 0 0 0 0 ··· 0 ⎥ ⎢ αβ ⎢ 2 ⎥ β 0 0 0 0 ··· 0 ⎥ ⎢ α β αβ ⎢ α3 β α2 β αβ β 0 0 0 ··· 0 ⎥ ⎢ ⎥ 4 3 2 (9) sTi = TsTi , T = ⎢ β 0 0 ··· 0 ⎥ ⎢ α β α β α β αβ ⎥ ⎢ 0 ⎥ 4 3 2 β 0 ··· 0 ⎥ α β α β α β αβ ⎢ ⎢ . ⎥ .. ⎣ 0 . .. ⎦ 0 ··· 0
···
0
α4 β
α3 β
α2 β
αβ
β
where T defined above is a T × T Toeplitz matrix with the right-shifted template appearing at each row. Then R can be written as 1 1 R = sTi − TsTi 2 = (I − T)sTi 2 . (10) T T Hence, we can optimize the regularized cost functions of (4) or (5), respectively, as follows m r T λ J4 = [xit − (AS)it ]2 + (I − T)sTi 2 (11) T i=1 t=1 i=1 J5
=
m T xit log i=1 t=1
r λ xit − xit + (AS)it + (I − T)sTi 2 (12) (AS)it 2T i=1
where λ is a small regularization coefficient that balances the trade-off between the reconstruction error and temporal smoothness constraint. Now we are ready to state the main results of the learning rules for optimizing the above-defined regularized cost functions. 1
We do not simply minimize the “regular” variance of the temporal signal since the optimization will favor the output as a constant signal.
Theorem 1 Under the condition that λ is sufficiently small such that the matrix AT AS + λSQ is nonnegative, where Q = T1 (I − T)T (I − T) is a symmetric square matrix, the regularized cost function (11) is monotonically non-increasing under the learning rules (AT X)jt (XST )ij sjt ← sjt T , aij ← aij (13) (A AS + λSQ)jt (ASST )ij where the alternating update between S and A is inserted by proper scalings of row vectors si (i = 1, . . . , r) and A in order to assure each row vector of S to have a unit variance.2 The cost function is invariant under these updates if and only if A and S are at a stationary point of the distance metric. Proof: First, for 0 < α < 1, it is seen from (9) that (I − T) has positive diagonals and non-positive entries elsewhere. It is easy to verify that in the matrix Q = 1 T T (I − T) (I − T) only the diagonal values are positive (e.g., 0.333/T if α = 0.5), whereas the rest of entries are non-positive. Since λ > 0, and AT AS and S are both nonnegative, then AT AS + λSQ is also nonnegative if the scaled off-diagonal elements of λqtk is sufficiently small; this is indeed the case that was confirmed in our implementation. Next, we may verify that the update rules in (13) guarantee monotonic non-increasing convergence. This can be done by finding an auxiliary function in the EM-like setup [4]; see Appendix for details. Theorem 2 Given a matrix T defined in (9) and a small positive λ, the regularized cost function (12) is monotonically non-increasing under the learning rules √ −b + b2 − 4ac sjt ← , (14) 2a T aij sjt where a=λ qtk , b = aij , c = − xit j aij sjt i i k=1 sjt xit /(AS)it aij ← aij t , (15) t sjt where the alternating update between S and A is inserted by proper scalings of row vectors si (i = 1, . . . , r) and A in order to assure each row vector of S to have a unit variance. The cost function is invariant under these updates if and only if A and S are at a stationary point of the distance metric. Proof: This is done in the same line of analysis for Theorem 1, see appendix for details. Note that the nonnegative constraint remains in (14): it is easy to verify that the denominator a = λ k qtk > 0; namely, the column sum of each row of the matrix Q is positive; in addition, since c is non-positive, then −4ac is nonnegative and so is the nominator of (14). Note that also if (15) is followed by a normalization operation: aij ← aij / p apj , then b = 1. 3.2
Spatial decorrelation constraint
The spatial decorrelation constraint implies the column vectors st are uncorrelated, and the sample correlation matrix V = SST /T is diagonal dominant. For this purpose, we propose to minimize the following regularized cost function m m m T λ xit λ J6 = xit log − xit + (AS)it + (SST )ij − (SST )ii . (16) (AS) T 2T it i=1 t=1 i=1 i,j=i
2
Suppose for each row vector Q si is scaled by a coefficient γi to assure unit variance, then we need to rescale A ← A/ ri=1 γi before updating aij . See appendix for details.
Theorem 3 Given a small positive λ, the regularized cost function (16) is monotonically nonincreasing under the following learning rule √ b − b2 − 4ac (17) sjt ← −2a m r λ aij sjt λ where a=− , b= aij + sit , c = − xit T T j aij sjt i=1 i i=1,i=j
followed by the same update equation (15). The cost function is invariant under these updates if and only if A and S are at a stationary point of the distance metric. Proof: The proof is similar to that of Theorem 2, except the definition of the auxiliary function; we omit the proof in this paper due to lack of space. Note that b2 ≥ 4ac ≥ 0 is necessary to assure the square root to be real value; this assumption is often valid when λ is small and T is relatively large. Remarks: When b2 4ac, (16) may be approximated by the following learning rule (similar to [5])
c T aij sjt sjt ≈ μ μ xit (18) a λ i j aij sjt where μ > 0 is a scaling parameter, whose effect will be wiped out by the normalization of aij . It is also possible to integrate the temporal smoothness and spatial decorrelation constraints together into one cost function, for example J=
m T
[xit − (AS)it ]2 +
i=1 t=1
r m m λ1 λ2 2 (I − T)sTi 2 + (SST )ij − (SST )ii , T i=1 2T i=1 i,j=i
the minimization of which will lead us to new update rule for entries of S: sjt ← sjt
(AT X)jt r (AT AS + λ1 SQ)jt + λT2 ( i=1,i=j sit − sjt )
(19)
where the choices of λ1 and λ2 depend on the practical need regarding specific constraints. Our experiments have confirmed its validity for various setup.
4
Simulation results
Early detection of Alzheimer’s disease (AD) is a practically important yet challenging problem for evaluating human brain aging. In our experiments, the real-life EEG data [11] were collected from Japanese hospitals with three distinct groups: healthy subjects, AD patients, and MCI (mild cognitive impairment) patients with high risks for developing AD in 1 ∼ 1.5 year. The goal here is to extract features that help discriminate between healthy age-matched control and MCI patients. The EEG recordings (200Hz for 20s, 21 channels) were first preprocessed to remove the artifacts (e.g., eye blinks). Our method starts with a BSS/ICA algorithm [11] to extract 6 most significant spatiotemporal components for each recording, which yields 6 × 4000 data in temporal domain. Second, the Welch’s method is applied to calculate the power spectra of the temporal signals, which are further normalized (i.e., each integration of power spectra equals 1). For illustration purpose, the normalized power spectra of two typical (but randomly selected) categories of data are illustrated (Fig. 1). Given the their power spectra, we further apply the constrained NMF learning rule (parameters α = 0.8, λ = 0.001 are chosen by trial and
Normalized Power Spectra
1
Factorized Spectral Components 6
0.8
Normal
0.6
Healthy
4 0.4 0.2 0
2 0
5
10
15
20 Freq. (Hz)
25
30
35
40
0 0
5
10
15
20
25
30
20
25
30
0.4
6
MCI
0.3
2
0.1 0
MCI
4
0.2
0
5
10
15
20 Freq. (Hz)
25
30
35
40
0 0
5
10
15
Figure 1: Left: EEG normalized power spectra of randomly selected normal controls and MCI subjects. Right: Factorized spectral components.
error in our experiments) to extract 3 smooth spectral components from the power spectra of each group. The results showed that for MCI and AD patients the relative peaks are higher in Theta band (4-7 Hz) than healthy subjects, whereas MCI also has much broader components in Alpha band (8-12 Hz) and Beta band (13-30 Hz) compared to the normal case. These empirical yet quite consistent observations (for various MCI patients and healthy subjects) motivated us to explore this idea to extract relevant features for pattern clustering or classification. For each subject, 100 Monte Carlo runs (each with different random initialization) were conducted, the mean and variance of two features are calculated: the powers of alpha and Beta bands (i.e., the area summed over specific bandwidths and then averaged over three power spectrum curves). The extracted features for all 60 subjects (38 for healthy and 22 for MCI) are shown in Fig. 2. At the first sight (by assuming the class labels are known for each subject), these two classes have quite distinct pattern clusters, despite the obvious overlapping in some cases. Given the limited data at hand, we use a support vector machine (SVM) and leave-one-out cross-validation procedure (www.kyb.tuebingen.mpg.de/bs/people/spider/) to classify the two classes based on the extracted two-dimensional features (mean values). Gaussian kernel is used for training SVM with standard parameters setup σ = 1.1, C = 100. Empirical results showed that a misclassification rate of 13.3% (7 false negatives, 1 false positive) was obtained, achieving similar performance as our early results [11].
5
Summary
By enforcing temporal smoothness, unit variance, and/or spatial decorrelation constraints on the row/column vectors of S, we derived constrained NMF rules for extracting physiologically meaningful temporal components, which may be potentially useful for a wide range of engineering and neuroscience problems. Our preliminary results using simple NMF features have showed promising application for early detection of Alzheimer’s disease. We also argue that the constraints imposed on the learning rules enable the extracted components have more interesting interpretations, which may be important for many biological/biomedical signals. Appendix: The update rules are derived by fixing one basis and alternatingly update the other. The convergence proof is based on the auxiliary function [4] and briefly structured here. Let G(θ, θk ) be an auxiliary function for the cost function J(θ) (to be min-
Alpha (8−12 Hz) band average power
5
blue dots: normal control red circles: MCI patient
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
0
1
2
3
4
Theta (4−7 Hz) band average power
5
6
Figure 2: Left: The scatter plot of healthy vs. MCI subjects’ projected on twodimensional feature space. Two extracted features correspond to the averaged power in theta band (abscissa) and alpha band (ordinate). The central dots or circles correspond to mean values over 100 Monte Carlo runs, whereas the crossings represent the respective variances. Right: The decision boundary found by SVM (circled points represent support vectors; squared points represented the misclassified points). imized) if G(θ, θk ) ≥ J(θ) and G(θ, θ) = J(θ) hold. We then update parameter θ by θk+1 = arg minθ G(θ, θk ) to achieve monotonic nonincreasing convergence J(θk+1 ) ≤ G(θk+1 , θk ) ≤ G(θk , θk ) = J(θk ). Proof of Theorem 1: For cost function (11), by fixing A, we first write an additive update rule for individual sjt : sjt ← sjt + ηjt [(AT X)jt − (AT AS)jt − λ(SQ)jt ]
(20)
sjt (AT AS)jt +λ(SQ)jt ,
then it follows the multiplicative If we set the learning rate ηjt = update rule for sjt in (13) . When λ is sufficiently small, the nonnegative constraint of sjt is assured. By reversing the roles of A and S, we can derive the multiplicative update rule for aij that is independent of the regularized term. Next, we can follow the similar procedure in [4] to prove that the update rules guarantee the monotonic nonincreasing convergence, which is omitted here due to its great similarity. Finally, we show that the rescaling of row vectors si (to assure unit variance) followed by rescaling of A does not change the convergence result. Suppose each row vector si is scaled by a coefficient γi such that the newsi = γi si has a unit variance: r Var[si ] = 1, this is followed by rescaling A ← A/ i=1 γi before updating new aij T in (13). Note that (A S ) = AS, where S = [(s1 ) , . . . , (sr )T ]T . From T
T
A ← A ⊗ (XS ) (A S S ) ≡ A ⊗ (XST ) (ASST ) (where ⊗ and denote the elementwise multiplication and division, respectively), it is obvious that the rescalings of S and A do not change the update equation. Proof of Theorem 2: For cost function (12), by fixing A, we construct an auxiliary function for S as follows: G(S, Sk ) = (xit log xit − xit ) + aij sjt it
−
j
aij skjt aij skjt log(a +λ xit s ) − log (SQ)jt ij jt t k p aip ajp t ait sjt jt
It is easy to verify that G(S, S) = J(S). Similar to [4], we can also prove G(S, St ) ≥ J(S). To minimize J(S) w.r.t. sjt , we then update sk+1 = arg minS G(sjt , skjt ), jt ∂G(sjt ,sk jt ) ∂sjt
which can be solved by setting
= 0:
aij skjt 1 ∂G(sjt , stjt ) =− xit + aij + λ sjt qtk = 0 k ∂sjt t aij xjt sjt t i i
(21)
Multiplying both sides by sjt (assumed positive) and rearranging the terms yields λs2jt
k
qtk +
i
aij sjt −
i
aij skjt xit = 0. k t aij sjt
(22)
Solving the quadratic equations of sjt would yield the update rule (14). Reversing the order of S and A and repeating the similar procedure, we obtain (15). Note that a and −4ac are both nonnegative, and the regularized constraint has no influence on the update rule for aij . Likewise, it is readily to show the rescalings of S and A would not change (15) since sjt xit /(AS)it t sjt xit /(A S )it ≡ aij t . aij ← aij t sjt t sjt
References [1] Cichocki, A. and Georgiev, P. (2003). Blind source separation algorithms with matrix constraints. IEICE Trans. Info. Syst., vol. E86-A, no. 3, 522–531. [2] Donoho, D. and Stodden, V. (2003). When does non-negative matrix factorization give a correct decomposition into parts? Proc. NIPS2002. [3] Hoyer, P. O. (2004). Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5, 1457–1469. [4] Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791. Also in Proc. NIPS2000 (Algorithms for nonnegative matrix factorization). [5] Li, S. Z., Hou, X., Zhang, H., and Cheng, Q. (2001). Learning spatially localized, parts-based representation. Proc. IEEE CVPR2001, pp. 207–210. [6] Li, Y. and Cichocki, A. (2003). Non-negative matrix factorization and its application in blind sparse source separation with less sensors than sources. Proc. XIII Int. Symposium on Theoretical Electrical Engineering, Warsaw, Poland, pp. 285-288. [7] Plumbley, M. D. (2003). Algorithms for nonnegative independent component analysis. IEEE Trans. Neural Networks, 14, 534–543. [8] Sajda, P. et al. (2004). Nonnegative matrix factorization for rapid recovery of constituent spectra in magnetic resonance chemcial shift imaging of the brain. IEEE Trans. Medical Imaging, 23(12), 1453–1465. [9] Sha, F. and Saul, L. (2005). Real-time pitch determination of one or more voices by nonnegative matrix factorization. Proc. NIPS2004. [10] Stone, J. V. (2001). Blind source separation using temporal predictability. Neural Computation, 13, 1559–1574. [11] Cichocki, A. et al. (2005). EEG filtering based on blind source separation (BSS) for early detection of Alzheimer’s disease. Clinic Neurophysiology, 116(3), 729–737.
2 1 0 −1 2 1 0 −1
2 0 −2 100
200
300
400
500
600
700
800
900
1000
1 0 −1 100
200
300
400
500
600
700
800
900
1000
2 1 0 −1 100
200
300
400
500
600
700
800
900
1000
1.4 1.2 1 0.8 0.6 0.4
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
4 2 0 −2 4 2 0 −2 −4
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
2 0 −2
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
0.4 0.2 4 2 0 −2 2 0 −2
2 0 −2 −4 2 0 −2 100
200
300
400
500
600
700
800
900
1000
Time (5s)
Time (5s)
Figure 3: Left: four source signals. Right: ten mixed signals. Normalized Power Spectrum of 4 Original Source Sign
1
Four Factorized Spectral Components (each with unit power 1.5 1
0.8
0.5
0.6
0 0.4
0
5
10
15
20
25
30
0
5
10
15
20
25
30
0
5
10
15
20
25
30
0
5
10
15
20
25
30
1 0.2 0.5 0
0
5
10
15
20
25
30
35
40
Freq. (Hz) Normalized Power Spectrum of 10 Mixed Signals
1
0 1.5 1
0.8
0.5
0.6
0 2
0.4 0.2 0
1
0
5
10
15
20
25
30
35
40
Freq. (Hz)
0
Freq. (Hz)
Figure 4: Left: normalized power spectra. Right: factorized 4 spectral components with our constrained NMF (λ = 0.001).
Supplementary Experiments Frequency Component Factorization: We first construct four synthetic signals with varying harmonic components, as follows: s1 (t) s2 (t) s3 (t) s4 (t)
= = = :
[2 + sin(2πt)] sin(8πt), amplitude modulated signal 2 sin[2π(1 + 2t)t], chirp signal 2 cos[2π(2 + 0.05 sin(10πt)t], frequency modulated signal smoothed uniform noise
The data are generated with sampling frequency 200 Hz and 4 × 1000 samples are obtained in total (seen in Figure 3). Further, we generate a 10 × 5 nonnegative, random mixing matrix A (A = rand(10, 5) with Matlab notation) to obtain 10 × 1000 observed (mixed) samples. We further apply standard Welch’s method to estimate the normalized power spectra of 4 sources signals as well as 10 mixed signals (illustrated in Figure 4), whose power spectra are all nonnegative. We truncate the power spectra of 10 mixed signals and save them in a 10 × 40 data matrix X. We then apply our constrained NMF learning rule (13) to X and observe the factorized spectral components. Blind extraction of sparse mixing and temporally smooth sources: This simulation is to show that the NMF method can be used for blind source extraction when the number of sensors is much larger than the number of sources (this is often
4
2
10
CNMF NMF
1 0
0
100
200
300
400
500
600
700
800
900
1000
0.5 3
10
0
0
100
200
300
400
500
600
700
800
900
1000
0
100
200
300
400
500
600
700
800
900
1000
0
100
200
300
400
500
600
700
800
900
1000
0
100
200
300
400
500
600
700
800
900
1000
2 2
1 0
10
2 1 1
10
0 4 2 0
0
10
0
100
200
300
400
500
600
700
800
900
1000
Figure 5: Left: five source signals. Right: MSE error curve comparison. 20
5
10 0
0
100
200
300
400
500
600
700
800
900
1000
40
0
0
100
200
300
400
500
600
700
800
900
1000
0
100
200
300
400
500
600
700
800
900
1000
0
100
200
300
400
500
600
700
800
900
1000
0
100
200
300
400
500
600
700
800
900
1000
0
100
200
300
400
500
600
700
800
900
1000
5
20 0
0
100
200
300
400
500
600
700
800
900
1000
0
20
4
10
2
0
0
100
200
300
400
500
600
700
800
900
1000
0
20
4
10
2
0
0
100
200
300
400
500
600
700
800
900
1000
0
20
10
10
5
0
0
100
200
300
400
500
600
700
800
900
1000
0
Figure 6: Left: decomposed signals with the standard NMF. Right: decomposed signals with our temporally smooth NMF (λ = 10 in this example). the case for EEG recordings). We simulate five nonnegative source signals and a 20 × 5 random mixing matrix A with 20% sparsity (i.e. with 20 zero elements). We further apply NMF to the matrix X (with 20 × 1000 dimensionality) and compare the results between the standard NMF and the constrained NMF with temporal smoothness constraints (given the same initialized conditions). Experimental results show that in this case our proposed algorithm produces better performance than the standard NMF algorithm, especially when the mixed sources appear more noisy (see Fig. 4 for an illustration regarding the performance comparison). In fact, in some cases we observed that our algorithm may achieve a faster convergence speed compared to the standard NMF given the same initialized conditions. However, since the standard NMF algorithm and our modified algorithm may still get stuck in some local minima, it is not guaranteed that our proposed algorithm always achieves the better performance.