Downloaded 01 Nov 2006 to 128.112.16.132. Redistribution subject .... The probabilistic graphical model for Wavelet-Independent Component Analysis. Thick.
VARIATIONAL BAYESIAN LEARNING FOR WAVELET INDEPENDENT COMPONENT ANALYSIS E. Roussos∗,†, S. Roberts∗ and I. Daubechies∗∗ ∗
PARG, University of Oxford, Parks Road, Oxford, OX1 3PJ, UK. † PACM, Princeton University, USA. ∗∗ PACM, Princeton University, Washington Road, Princeton, USA. Abstract. In an exploratory approach to data analysis, it is often useful to consider the observations as generated from a set of latent generators or “sources” via a generally unknown mapping. For the noisy overcomplete case, where we have more sources than observations, the problem becomes extremely ill-posed. Solutions to such inverse problems can, in many cases, be achieved by incorporating prior knowledge about the problem, captured in the form of constraints [1]. This setting is a natural candidate for the application of the Bayesian methodology, allowing us to incorporate “soft” constraints in a natural manner. The work described in this paper is mainly driven by problems in functional magnetic resonance imaging of the brain, for the neuro-scientific goal of extracting relevant “maps” from the data. This can be stated as a ‘blind’ source separation problem. Recent experiments in the field of neuroscience [2] show that these maps are sparse, in some appropriate sense. The separation problem can be solved by independent component analysis (ICA), viewed as a technique for seeking sparse components, assuming appropriate distributions for the sources. We derive a hybrid wavelet-ICA model, transforming the signals into a domain where the modeling assumption of sparsity of the coefficients with respect to a dictionary is natural [6], [3]. We follow a graphical modeling formalism, viewing ICA as a probabilistic generative model. We use hierarchical source and mixing models and apply Bayesian inference to the problem. This allows us to perform model selection in order to infer the complexity of the representation, as well as automatic denoising. Since exact inference and learning in such a model is intractable, we follow a variational Bayesian mean-field approach in the conjugate-exponential family of distributions, for efficient unsupervised learning in multi-dimensional settings. The performance of the proposed algorithm is demonstrated on some representative experiments. Keywords: Component separation, variational Bayesian learning, wavelets, fMRI
INTRODUCTION Independent Component Analysis (ICA) is a data analysis framework that intends to extract the underlying “causes” that generate the observed data. ICA can be seen as a method for solving two complementary problems: (a) blind separation of hidden sources under an unknown mixing (an inverse problem) (b) representation of data in new coordinates such that the resulting signals are maximally independent. Indeed, geometrically, ICA can be seen as a method for finding an intrinsic coordinate frame in data space. There have been several algorithms for ICA in the statistics and machine learning literature, from information maximization, maximum likelihood, and Bayesian point of view, as an energy-based model, as well as several others from a signal processing CP803, Bayesian Inference and Maximum Entropy Methods in Science and Engineering edited by Kevin H. Knuth, Ali E. Abbas, Robin D. Morris, and J. Patrick Castle © 2005 American Institute of Physics 0-7354-0292-2/05/$22.50 274 Downloaded 01 Nov 2006 to 128.112.16.132. Redistribution subject to AIP license or copyright, see http://proceedings.aip.org/proceedings/cpcr.jsp
perspective (see for example [9]). In this paper we take the view of ICA as a generative model, and the problem of estimating the sources as a statistical inference problem [7] in a Bayesian framework. This paper is structured as follows. We first introduce ICA as a general decompositional model for spatio-temporal data. Then we present the hybrid wavelet-independent component analysis model and various ways of modelling sparsity in a graphical modelling framework, as well as the details of inference in this model. Finally, we present some representative results followed by conclusions and discussion.
PRINCIPLES OF INDEPENDENT COMPONENT ANALYSIS Given a data matrix X ∈ ℜd×N , we can formulate ICA as a general, noisy data decomposition model of the form: X = AS + E (1) For ICA applied to spatio-temporal data, we may regard matrix A ∈ ℜd×L as containing the dynamics of the “sources” in its L columns and matrix S ∈ ℜL×N as containing the “spatial components” in its L rows. E is the noise process modelled as a zero mean Gaussian with diagonal covariance matrix Λ = diag(Λ j ), j = 1, . . ., d. Since both A and S are unknown and under the presence of the noise process E , this is a severely ill-posed problem, especially for the case L > d, that we can only hope to solve by appropriatelly constraining it [1], [8]. This will restrict the “search space” A, S } in a meaningfull subset of the original one. ICA forces independence between {A the components of the basis, which in statistical language means that the joint density of the sources factorises—this is equivalent to minimizing the mutual information between the infered sources [9]: L
p(ssn ) = ∏ pi (si,n ),
n = 1, . . ., N.
(2)
i=1
We can either select an appropriate functional form for the individual marginal distributions pi (si,n) based on our prior knowledge about the problem, as was done in the original formulation of InfoMax ICA of Bell & Sejnowski for the separation of speech signals, or adaptivelly learn it from the data, as was done in [8], for example. In order to obtain the distribution of the data given the model parameters we must ‘integrate out’ the sources,
A, Λ ) = dss p(ss )p(xx|ss, A , Λ ). p(xx|A
(3)
Unfortunately, this integration is in general intractable. There are several ways to apA, Λ ), for example by a Laplace approximation, introducing additional proximate p(xx|A variational parameters, using mixture models, or by sampling methods. Bayesian inference [7] provides a powerful method for ICA by incorporating uncertainty in the model and data into the estimation process thus preventing “overfitting” and allowing a principled method for using our domain knowledge. The outputs are uncertainty estimates
275 Downloaded 01 Nov 2006 to 128.112.16.132. Redistribution subject to AIP license or copyright, see http://proceedings.aip.org/proceedings/cpcr.jsp
over all variables in the model, given the data. This requires performing integrations in high-dimensional spaces that, if tackled by sampling methods like Markov chain Monte Carlo can be very time-consuming, or even impractical, for large data sets. A mathematically elegant alternative, providing many of the benefits of sampling in terms of accuracy without the computational overhead, is offered by the variational Bayesian method. This is the inference method that we are going to use in this paper.
THE WAVELET INDEPENDENT COMPONENT ANALYSIS MODEL The ‘standard’ model of independent component analysis as a data representation model forms a mapping from observation space to a space of latent or hidden variables. For example, in its use as a multi-dimensional linear blind source separation model, the observation space, spanned by translations of the delta function {δn }∀n , is transformed to a space spanned by the columns of the “mixing” matrix, A . In this paper we consider an extension to this view of ICA by forming a representation to an intermediate feature space spanned, for example, by a wavelet family of localised time-frequency ‘atoms’ ψk (n). This is in essence a structural constraint, consistent with the Bayesian philosophy of data analysis, in the sense that features are matched to our notion of signals. The representation of a signal x (n) is formed by taking the projections of the data on the atoms as ck = xx, ψ k , ψ k ∈ Ψ , a non-redundant dictionary. These projections are the inner products, that is the correlations, of the corresponding signals. The idea of linear W-ICA is based on the following proposition (see also section 13.1 of [9]): Proposition 1 A (component-wise) linear operator T = (T1 , . . ., TL ) leaves the property of linear indpendence unaffected. Having ensured independence, we can exploit the properties of the wavelet transform in order to do blind signal separation. One big advantage of the wavelet transform is that it gives a parsimonious representation of the data. That is, we only need a few wavelet coefficients ck in order to represent a given signal. In other words, most ck s will be almost zero. In probabilistic terms this means that the distribution on ck will be highly peaked. An important consequence of this representation is that the notion of independence is more plausible in feature space [6], [3], i.e. with respect to the wavelet coefficients. Using the above representation for all signals in the model, the equivalent ICA model in k-space becomes: x˜ k = A c k + ε˜ k . (4)
Modelling sparsity: Bayesian priors and wavelets The above idea gives us the advantage to use ICA algorithms that enforce superGaussian distributions on the sources. Utilizing the central limit theorem, we can be sure that since the observations (i.e. intermediate variables here) are described by highlypeaked distributions centered at zero, the “true” sources will be described by even more
276 Downloaded 01 Nov 2006 to 128.112.16.132. Redistribution subject to AIP license or copyright, see http://proceedings.aip.org/proceedings/cpcr.jsp
highly-peaked ones. Furthermore, we can use any of the standard wavelet denoising algorithms in order to clean the signals. We note, however, that as we have an explicit noise process in the ICA model, small wavelet coefficients are explained by the noise process rather than included in any reconstructed source. We thus achieve a simultaneous source reconstruction and denoising. A classical prior for wavelets is the Laplacian distribution, corresponding to an L1 norm, or the generalised exponential (also called generalised Gaussian) distribution, corresponding to an L p norm [3]. Another sparse prior that has been used is the CauchyLorentz distibution. Since we are interested in working in the variational Bayesian framework we would like to use distributions from the conjugate-exponential family, in order to obtain analytical expressions. One option is to use a ‘sparse Bayesian’ hierarchical prior, which corresponds to a Student-t marginal for p(si ). In our implementation we enforce sparsity restricting the general mixture of Gaussians source model to be a two-component, zero co-mean mixture of Gaussians prior over ck . The hyperparameters of the two components have zero mean and hyperpriors over the precisions such that one component has a high precision, the other a low precision. A two-state, zero-mean mixture model is an effective distribution for modelling sparse signals, both emprirically and theoretically. Using this flexible model we can adapt to various degrees of sparsity and different shapes of prior. The parameters of the component Gaussians are the means µi,qi , the precisions βi,qi , and the associated mixing proportions p(qi = qi |π ) = πi,qi , where qi ∈ {1, . . ., Mi } denotes the (hidden) state of the mixture modelling the i-th source (i = 1, . . ., L). The set of all parameters for the source model is collected in θ = {π , µ , β }. Since we are interested in sparsity we set µi,qi = 0 a-priori. The joint prior on the wavelet coefficients is therefore: 1 1 1 2 . p(C|q, β ) = ∏ p(cik |qik , βi,qi ) = ∏ N (cik ; µi,qi = 0, βi,qi ) = ∏ (2π )− 2 βi,qi 2 e− 2 βi,qi (cik ) ik
ik
ik
and the prior on the hidden state variables, p(q), q = [qik ] is: p(q = q |π ) = ∏ p(qik = qik |π ik ) = ∏ πi,qi , ik
ik
where
(5)
p(qik = qik |(πi,1, . . ., πi,qi , . . ., πi,Mi )) = πi,qi . (6)
GRAPHICAL MODELLING FRAMEWORK The probabilistic dependence relationships in our model can be represenented in a directed acyclic graph (DAG) G, also known as a Bayesian network structure. In this graph, random variables are represented as nodes and “causal” relationships between nodes as edges. The graphical model for W-ICA is depicted in Fig. 1. Since we are interested in a fully Bayesian approach, we need to put distributions C , q} and parameters {A A, Λ, θ }. over all unknowns in the model, both latent variables {C Before we proceed with inference we need to state the prior distributions over the nodes in G; we will use conjugate priors for this model.
277 Downloaded 01 Nov 2006 to 128.112.16.132. Redistribution subject to AIP license or copyright, see http://proceedings.aip.org/proceedings/cpcr.jsp
pi_i
q_i mu_i
sigma_i
c_1
...
...
c_{i_1}
...
c_{i_I}
c_L
A
... 11 00 00 11 00 11 00 11
.
A_1 x_1
wx_j
00 11 11 00 00 11 00 11 00 11
Lambda_j
... 00 11 11 00 00 11 00 11 00 11
A_j x_j
A_d x_d
FIGURE 1. The probabilistic graphical model for Wavelet-Independent Component Analysis. Thick arrows represent deterministic relations and dotted squares represent plates.
Precisions p(β ). The prior on the precisions of the components is a gamma distribution, with β = {β i }i and β i = (βi,1, . . ., βi,qi , . . ., βi,Mi ): Mi
Mi
p(β ) = ∏ p(β i ) = ∏ ∏ gamma(βi,qi |b0i , c0i ) = ∏ ∏ i qi =1
i
i qi =1
1 0
Γ(b0i )(bi,0)ci
(βik,qik )
c0i −1
−
e
βi,q i b0i
(7)
Mixing proportions p(π ). The prior over the mixing proportions is a symmetric Dirichlet, with π = {π i } and π i = (πi,1, . . ., πi,qi , . . ., πi,Mi ): p(π ) = ∏ D(π i |{λi0 }) = ∏ i
i
0 i Γ(∑M qi =1 λi ))
Mi
∏
(πi,qi ) Mi Γ(λi0 ) qi =1 ∏qi =1
λi0 −1
(8)
Mixing matrix p(A|α ). For the wavelet-ICA model, the prior over the mixing matrix is a standard Gaussian, with zero mean and precision αi , i.e. d
p(A|α ) = ∏
L
d
L
∏ N (A ji|0, α ji) = ∏ ∏(2π )− 2 α ji 2 e− 2 α ji (A ji) ,
j=1 i=1
1
1
1
2
where α ji = αi , ∀ j.
j=1 i=1
(9) Precisions of the mixing matrix p(α ). The prior over each αi is a gamma distribution p(αi ) = gamma(αi |bαi , cαi ). Hence, α is a product of gamma distributions: L
L
p(α ) = ∏ gamma(αi |bαi , cαi ) = ∏ i=1
1
0 0 c0 i=1 Γ(cαi )(bαi ) αi
( αi )
c0αi −1
−
e
αi b0α i
,
(10)
278 Downloaded 01 Nov 2006 to 128.112.16.132. Redistribution subject to AIP license or copyright, see http://proceedings.aip.org/proceedings/cpcr.jsp
where the components of α = (α1 , . . ., αi , . . ., αL ) correspond to the column vectors {ai } of the mixing matrix, and respectively to the sources i = 1, . . ., L. By monitoring the evolution of the precision hyperparameters α , the relevance of each source may be determined (this is referred to as Automatic Relevance Determination (ARD)). If αi is large, column i of A will be close to zero, indicating source i is irrelevant. Λ). Finally, the prior over the sensor noise precision is a Sensor noise precision p(Λ product of gamma’s: d
d
Λj
cΛ j −1 − bΛ 1 Λ) = ∏ gamma(Λ j |bΛ j , cΛ j ) = ∏ e j p(Λ cΛ j Λ j Γ(c )(b ) Λ Λ j=1 j=1 j j
(11)
Approximate Inference by Free Energy Minimization We have just stated a generative model to ICA: the observed data are explained by propagating probabilities, initially drawn from the “roots” of the DAG, via the edges and using the observation model Eq. (1), performed in feature space in our case. Let us Λ, A, α , C, q , π , µ , β }. Exact Bayesian inference in these collect all unknowns in U = {Λ types of models is generally intractable, since we need, in principle, to obtain the joint X ). Instead, we will use the variational Bayes posterior of U given the data X , p(U|X framework for efficient approximate inference. The idea is to approximate the exact X ) in an appropriate posterior with an approximate one, Q(U), that is closest to p(U|X sense, here the Kullback-Leibler divergence. The cost functional for the hybrid WaveletICA model is the variational negative free energy of the system: F (Q, X) = log p(X,U) + H [Q(U)],
(12)
where the average is over Q(U) and the second term is the entropy of Q(U). We choose to restrict the variational posterior Q(U) to belong to a class of distributions that are factorised over subsets of the ensemble of variables: Λ)Q(A)Q(α )Q(C)Q(qq)Q(π )Q(µ )Q(β ). Q(U) = Q(Λ
(13)
Performing functional optimisation with respect to the distributions of the unknown variables we obtain the optimal form for the posterior for u ∈ U:
Q(u) ∝ p(u) exp log p(D,U)Q(U\{u}) . (14) This results in a system of coupled update equations, which need to be solved iteratively. Theoretical results show that the algorith converges to a (local) optimum [9]. Since we have chosen to work in the conjugate-exponential family, the posterior distributions have the same functional form as the priors and the the update equations are essentially “moves”, in parameter space, of the parameters of the priors due to observing the data. Here we will only give a representative selection of these equations, due to lack of space:
279 Downloaded 01 Nov 2006 to 128.112.16.132. Redistribution subject to AIP license or copyright, see http://proceedings.aip.org/proceedings/cpcr.jsp
Wavelet coefficients C. The variational posterior has a Gaussian functional form, N (cik |µˆ ik , βˆik ) with precision and mean:
βˆik = and 1 µˆ ik = ˆ βik
Mi
∑
qi =1
Mi
d
qi =1
j=1
∑ γikqi βiqi + ∑ Λ j
(A ji )2 ,
γikqi βiqi µi,qi + ∑ Λ j A ji x˜ j,k − xˆ j,k|i , d
j=1
(15)
xˆ j,k|i = ∑ A ji ci k . i =i
(16) Mixing matrix A. The variational posterior over the mixing matrix is a product of Gaussians with precision and mean parameters : K
αˆ ji = αi + 2Λ j ∑ (ci,k )2 ,
mˆ ji =
k=1
K 1 Λ j ∑ ci,k x˜ j,k − xˆ j,k|i . αˆ ji k=1
(17)
All averages · are over the corresponding posteriors Q(·).
RESULTS As with any mathematical model applied to “real-world” data, the semantic interpretation of the results of ICA decomposition can only be performed according to knowledge in the particular application domain. ICA has been successfully applied to 4D spatiotemporal fMRI data (see [4] for example), in two different “flavours”: temporal ICA (tICA), where each source is a timeseries, and spatial ICA (sICA), where each source is an image. In [5] we reported results of applying “classical” ICA algorithms to fMRI from the point of view of relating the statistical and geometrical characteristics of the data to the properties of the algorithm. While this present paper is more mathematically oriented, this work was mainly influenced by experiments of the neuroscience community [2] in functional MRI, and it is a continuation of the results reported in [5]. This is a new visual-task block-design fMRI experiment inspired by [4] using a superposition of two spatiotemporal patterns. There were 17 runs for each subject, each run resulting in 238 full brain volumes of 15 slices each. Here, we present a representative ‘run’ of WICA on one of the datasets of that experiment, keeping in mind that it is outside the scope of this paper to describe it in detail; the reader is kindly referred to [5] for more details. We run WICA on a dataset that corresponds to a “full activation checkerboad” visual pattern in order to detect ‘consistently task related’ (“predictable”) components. The result is shown in Fig. 2 The corresponding timecourse of the component, also shown in Fig. 2, shows a very good correspondence with the stimulus. The selection of the wavelet basis was found to be not very critical for the results, as long as we retained a certain level of smoothness.
280 Downloaded 01 Nov 2006 to 128.112.16.132. Redistribution subject to AIP license or copyright, see http://proceedings.aip.org/proceedings/cpcr.jsp
1.5 line 1 line 2
1
0.5
0
-0.5
-1 0
5
10
15
20
25
30
35
40
45
FIGURE 2. Component ‘map’ and corresponding ‘timecourse’ resulting from the application of WICA to fMRI data. Line 1 corresponds to the stimulus and line 2 to the extracted timecourse from the algorithm—this the column of the mixing matrix A that best matches the stimulus.
DISCUSSION AND CONCLUSIONS We have presented an extension of ICA to a hybrid model incorporating wavelets and sparsity-inducing adaptive priors under a full Bayesian paradigm. This enables the estimation of latent sources and mixing system, in a probabilistic graphical modelling formalism. We employed a variational framework for efficient inference. This model can be extended in various ways. First, we can take into account the residual dependencies in the wavelet coefficients in order to enhance separation. Second, we can adapt the dictionary itself in order to better match the structure of signals. Another area of current investigation is the use of complex wavelets. We are currently developing blind separation models from a functional analytic point of view. Comparing the benefits of these with Bayesian models is of great interest, and results will be reported in the near future. Finally, full results of applying sparse models on neuroscientific data and their importance for brain science will have to be thoroughly investigated.
REFERENCES 1. I. Daubechies, M. Defrise, and C. De Mol, Comm. Pure Appl. Math., 57, Wiley, 2004. 2. J. Haxby, M.I. Gobbini, M.L. Furey, and A. Ishai, “Distributed and Overlapping Representations of Faces and Objects in Ventral Temporal Cortex”, Science, AAAS, 293, pp. 2425–2430. 3. A. Mohammad-Djafari and M. Ichir, AIP Conf. Proc., 659, 1 (2003). 4. V. Calhoun, T. Adali, G. Pearlson, and J. Pekar, “Spatial and temporal independent component analysis of functional MRI data containing a pair of task-related waveforms”, Hum. Brain Mapp., 2001 May; 13(1), pp. 43–53. 5. M. Benharrosh, E. Roussos, S. Takerkart, K. D’Ardenne, W. Richter, J Cohen, and I Daubeschies, “ICA components in fMRI analysis: Independent sources?”, in 10th Neuroinformatics Annual Meeting, A Decade of Neuroscience Informatics: Looking Ahead, April 26–27, 2004. 6. M. Zibulevsky and B. Pearlmutter, “Blind Source Separation by Sparse Decomposition in a Signal Dictionary”, Neural Computation, MIT Press, 2001 13, pp. 863–882. 7. K. Knuth, “A Bayesian approach to source separation”, in Proceedings of Independent Component Analysis Workshop, pp. 283–288, 1999. 8. S. Roberts, E. Roussos, and R. Choudrey, “Hierarchy, priors and wavelets: structure and signal modelling using ICA”, Signal Processing, 84(2), Feb. 2004. 9. A. Hÿvarinen, J. Karhunen, and E. Oja, Independent Component Analysis, John Wiley, 2001.
281 Downloaded 01 Nov 2006 to 128.112.16.132. Redistribution subject to AIP license or copyright, see http://proceedings.aip.org/proceedings/cpcr.jsp