which solves a relaxed version of (1) by means of standard convex optimization procedures. A particular family of SR algorithms relies on a Bayesian formulation ...
SOFT BAYESIAN PURSUIT ALGORITHM FOR SPARSE REPRESENTATIONS Ang´elique Dr´emeau a,b , C´edric Herzet c and Laurent Daudet a a
Institut Langevin, ESPCI ParisTech, Univ Paris Diderot, CNRS UMR 7587, F-75005 Paris, France b Fondation Pierre - Gilles De Gennes pour la Recherche, 29 rue d’Ulm, F-75005 Paris, France c INRIA Centre Rennes - Bretagne Atlantique, Campus universitaire de Beaulieu, F-35000 Rennes, France ABSTRACT In this paper, we address the problem of sparse representation within a Bayesian framework. As a continuation of previous work [1], we consider a Bernoulli-Gaussian model and the use of a mean-field approximation. The resulting algorithm is shown to have very good performance over a wide range of sparsity levels. Index Terms— Sparse representations, Bernoulli-Gaussian model, mean-field approximation. 1. INTRODUCTION Sparse representations (SR) aim at describing a signal as the combination of a small number of atoms chosen from an overcomplete dictionary. More precisely, let y ∈ RN be an observed signal and D ∈ RN ×M a rank-N matrix whose columns are normalized to 1. One possible formulation of the SR problem writes x? = arg min ky − Dxk22 + λkxk0 ,
(1)
x
where kxk0 denotes the number of nonzero elements in x and λ is a parameter specifying the trade-off between sparsity and distortion. Finding the exact solution of (1) is usually an intractable problem. Hence, suboptimal algorithms have to be considered in practice. Among the large number of SR algorithms available in the literature, let us mention: iterative hard thresholding (IHT) [2], which iteratively thresholds to zero certain coefficients of the projection of the SR residual on the considered dictionary; matching pursuit (MP) [3] or subspace pursuit (SP) [4] which build up the sparse vector x by making a succession of greedy decisions; and basis pursuit (BP) [5] which solves a relaxed version of (1) by means of standard convex optimization procedures. A particular family of SR algorithms relies on a Bayesian formulation of the SR problem, see e.g., [6, 7, 8]. In a nutshell, the idea of these approaches is to model y as the output of a stochastic process (promoting sparsity on x) and apply statistical tools to infer the value of x. In this context, we
recently introduced [1] a new family of Bayesian pursuit algorithms based on a Bernoulli-Gaussian probabilistic model. These algorithms generate a solution of the SR problem by making a sequence of hard decisions on the support of the sparse representation. In this paper, exploiting our previous work [1], we propose a novel SR algorithm dealing with “soft” decisions on the support of the sparse representation. Our algorithm is based on the combination of a Bernoulli-Gaussian (BG) model and a mean-field (MF) approximation. The proposed methodology allows for keeping a measure of the uncertainty on the decisions made on the support throughout the whole estimation process. We show that, as long as our simulation setup is concerned, the proposed algorithm is very competitive with state-of-the-art procedures. 2. MODEL AND BAYESIAN PURSUIT In this section, we first introduce the probabilistic model which will be used to derive our SR algorithm. Then, for the sake of comparison with the proposed methodology, we briefly recall the main expressions of the Bayesian Matching Pursuit (BMP) algorithm introduced in [1]. 2.1. Probabilistic Model Let s ∈ {0, 1}M be a vector defining the SR support, i.e., the subset of columns of D used to generate y. Without loss of generality, we will adopt the following convention: if si = 1 (resp. si = 0), the ith column of D is (resp. is not) used to form y. Denoting by di the ith column of D, we then consider the following observation model: y=
M X
si xi di + n,
(2)
i=1
where n is a zero-mean white Gaussian noise with variance σn2 . Therefore, p(y|x, s) = N (Ds xs , σn2 IN ),
(3)
where IN is the N × N -identity matrix and Ds (resp. xs ) is a matrix (resp. vector) made up of the di ’s (resp. xi ’s) such
that si = 1. We suppose that x and s obey the following probabilistic model: p(x) =
M Y
p(xi ),
p(s) =
i=1
M Y
p(si ),
(4)
i=1
where p(xi ) = N (0, σx2 ), p(si ) = Ber(pi ), and Ber(pi ) denotes a Bernoulli distribution with parameter pi . Note that model (3)-(4) (or variants thereof) has already been used in many Bayesian algorithms available in the literature, see e.g., [1, 6, 9, 10]. The originality of this contribution is in the way we exploit it. 2.2. Bayesian Matching Pursuit We recently showed in [1] that, under mild conditions, the solution of the maximum a posteriori (MAP) estimation problem, (ˆ x, ˆs) = arg max log p(x, s|y),
(5)
x,s
is equal to the solution of the standard SR problem (1). This result led us to the design of a new family of Bayesian pursuit algorithms. In particular, we recall hereafter the main expressions of the Bayesian Matching Pursuit (BMP) algorithm. BMP is an iterative procedure looking sequentially for a solution of (5). It proceeds like its standard homologue MP by modifying one unique couple (xi , si ) at each iteration, namely the one leading to the highest increase of log p(x, s|y). It can then be shown that the (locally) optimal update of the selected coefficient xi is given by
Unfortunately, problem (8) is intractable since it typically requires to evaluate the cost function, log p(s|y), for all possible 2M sequences in {0, 1}M . In this paper, we propose to simplify this optimization problem by considering a MF approximation of p(x, s|y). Note that the combination of a BG model and MF approximations to address the SR problem has already been considered in some contributions [8, 11]. However, the latter differ from the proposed approach in several aspects. In [8], the authors considered a tree-structured version of BG model which was dedicated to a specific application (namely, the sparse decomposition of an image in wavelet or DCT bases). Moreover, the authors considered a different MF approximation than the one proposed here (see section 3.1). In [11], we applied MF approximations to a different BG model, which led to different SR algorithms. 3.1. MF approximation p(x, s|y) '
Q
i
q(xi , si )
A MF approximation of p(x, s|y) is a probability distribution constrained to have a “suitable” factorization while minimizing the Kullback-Leibler distance with p(x, s|y). This estimation problem can be solved by the so-called “variational Bayes EM (VB-EM) algorithm”, which iteratively evaluates the different elements of the factorization. We refer the reader to [12] for a detailed description of the VB-EM algorithm. In this paper, we consider the particular case where the MF approximation of p(x, s|y), say q(x, s), is constrained to have the following structure: Y q(x, s) = q(xi , si ). (9) i
(n) x ˆi
where
(n)
ri
σx2 (n)T r di , = σn2 + σx2 i X (n−1) (n−1) =y− sˆj x ˆj dj , (n) sˆi
(6)
Particularized to (9), the VB-EM algorithm evaluates the q(xi , si )’s by computing at each iteration1 :
(7) ∀i
j6=i
and n is the iteration number.
(10)
q(xi |si ) = N (m(si ), Γ(si )), p 1 m(si )2 p(si ), q(si ) ' 2πΓ(si ) exp − 2 Γ(si ) σ2 σ2 Γ(si ) = 2 x n2 , σn + σx si σ2 m(si ) = si 2 x 2 hri iT di , σn + σx si X hri i = y − q(si = 1)m(si = 1) dj .
(11)
where
3. A NEW SR ALGORITHM BASED ON A MEAN-FIELD APPROXIMATION The equivalence between (5) and (1) motivates the use of model (3)-(4) in SR problems and offers interesting perspectives. We study in this paper the possibility of considering some of the variables as hidden. In particular, we consider the problem of making a decision on the SR support as ˆs = arg max log p(s|y),
q(xi , si ) = q(xi |si ) q(si ),
(8)
s∈{0,1}M
R where p(s|y) = x p(x, s|y)dx. Note that, as long as (3)-(4) is the true generative model for the observations y, (8) is the decision minimizing the probability of wrong decision on the SR support. It is therefore optimal in that sense.
and
(12) (13) (14) (15)
j6=i
Note that the VB-EM algorithm is ensured to converge to a saddle point or a (local or global) maximum of the problem. 1 When clear from the context, we will drop the iteration indices in the rest of the paper.
At this point of the discussion, it can be interesting to compare both the proposed algorithm and BMP: i) Although the nature of the update may appear quite different (BMP makes a hard decision on the (xi , si )’s whereas the proposed algorithm rather updates probabilities on the latter), both algorithms share some similarities. In particular, the mean of distribution q(xi |si ) computed by the proposed algorithm (14) has the same form as the coefficient update performed by BMP (6). They rely however on different variables, namely the residual ri , (7), and its mean hri i, (15). This fundamental difference between both algorithms leads to well distinct approaches. In BMP, a hard decision is made on the SR support at each iteration: the atoms of the dictionary are (n−1) (n−1) either used or not (each x ˆj is multiplied by sˆj which is equal to 0 or 1). On the contrary, in the proposed algorithm, the contributions of the atoms are simply weighted by q(sj = 1), i.e., the probability distributions of the sj ’s. In a (n−1) similar way, the coefficients x ˆj ’s used in (7) are replaced by their means m(sj = 1) in (15), taking into account the uncertainties we have on the values of the xj ’s. ii) The complexity of one update step is similar in both algorithms and equal to MP: the most expensive operation is the update equation (15) which scales as O(N M ). However, in BMP one unique couple (xi , si ) is involved at each iteration while in the proposed algorithm all indices are updated one after the other. To the extend of our experiments (see section 4), we could observe that the proposed algorithm converges in a reasonable number of iterations, keeping it at a competitive place beside state-of-the-art algorithms. 3.2. Simplification of the support decision problem Coming back to the maximum a posteriori problem (8) and exploiting MF approximation (9), we obtain sˆi = arg max log q(si ) ∀i,
(16)
si ∈{0,1}
where we have used the following approximation: Z Y Y p(s|y) ' q(xi , si ) dx, = q(si ). x
i
i
The solution of (16) can then be found by a simple thresholding operation: sˆi = 1 if q(si = 1) > 1/2 and sˆi = 0 otherwise. 3.3. Estimation of the noise variance The estimation of unknown model parameters can easily be embedded within the VB-EM procedure (9)-(15). In particular, we estimate the noise variance via the procedure described in [13]. This leads to the following update on the noise variance: * + X 1 2 2 ky − si xi di k (17) σ ˆn = N Q i i
q(xi ,si )
R where hf (θ)iq(θ) , θ f (θ) q(θ) dθ. Note that although being in principle unnecessary when the noise variance is known, we found that including the noise-variance update (17) in the VB-EM iterations improves the convergence. An intuitive explanation of this behavior can be given by observing that, at a given iteration, σ ˆn2 is a measure of the (mean) discrepancies between the observation and the sparse model. 4. SIMULATIONS In this section, we study the performance of the proposed algorithm by extensive computer simulations. In particular, we assess its performance in terms of the reconstruction of the SR support and the estimation of the nonzero coefficients. To that end, we evaluate different figures of merit as a function of the number of atoms used to generate the data, say K: the ratio of the average number of false detections to K, the ratio of the average number of missed detections to K and the mean-square error (MSE) between the nonzero coefficients and their estimates. Using (16), we reconstruct the coefficients of a sparse repˆ ˆs , by resentation given its estimated support ˆs, say x ˆ ˆs = Dˆ+ x s y,
(18)
where Dˆ+ s is the Moore-Penrose pseudo-inverse of the matrix made up of the di ’s such that sˆi = 1. In the sequel, we will refer to the procedure defined in (11)-(18) as Soft Bayesian Pursuit (SoBaP) algorithm. Observations are generated according to model (3)-(4). We use the following parameters: N = 128, M = 256, σn2 = 10−3 , σx2 = 100. For the sake of fair comparisons with standard algorithms, we consider the case where all atoms have the same occurrence probability, i.e., pi = K/M , ∀i. Finally the elements of the dictionary are i.i.d. realizations of a zero-mean Gaussian distribution with variance N −1 . For each point of simulation, we run 1500 trials. We evaluate and compare the performance of 8 different algorithms: MP, IHT, BP, SP, BCS, VBSR1([11]), BMP and SoBaP. We use algorithm implementations available on author’s webpages2 . VBSR1 is run for 50 iterations. p MP is run until the `2 -norm of the residual drops below N σn2 . SoBaP is run until the estimated noise variance drops below 10−3 . Fig.1(a) shows the MSE on the nonzero coefficients according to the number of nonzero coefficients, K, for each considered algorithm. For K ≥ 40, we can observe that SoBaP is dominated by VBSR1 but outperforms all other algorithms. Below this bound, while VBSR1 presents a quite bad performance with regard to IHT (up to K = 22), SP (up to K = 38) and BMP (up to K = 20), SoBaP keeps a good behavior beside these algorithms. 2 resp. at http://www.personal.soton.ac.uk/tb1m08/sparsify/sparsify.html, http://sites.google.com/site/igorcarron2/cscodes/, http://www.acm.caltech.edu/l1magic/ (`1 -magic)
0
Number of missed detections / Number of nonzero coefficients
1
10
0
MSE
10
−1
10
MP IHT BP SP BCS VBSR1 BMP SoBaP
−2
10
−3
10
−4
10
0
10
20
30 40 50 Number of nonzero coefficients
60
70
80
(a) MSE on nonzero coefficients
3
10
Number of false detections / Number of nonzero coefficients
2
10
−1
10
MP IHT BP SP BCS VBSR1 BMP SoBaP
−2
10
−3
10
0
10
20
30 40 50 Number of nonzero coefficients
60
70
(b) Average number of missed detections
80
10
2
10
1
10
0
10
−1
10
−2
10
MP IHT BP SP BCS VBSR1 BMP SoBaP
−3
10
−4
10
−5
10
0
10
20
30 40 50 Number of nonzero coefficients
60
70
(c) Average number of false detections
Fig. 1. SR reconstruction performance versus number of nonzero coefficients K. Fig.1(b) and Fig.1(c) represent the algorithm performance for the reconstruction of the SR support. We can observe that SoBaP succeeds in keeping both small missed detection and false detection rates on a large range of sparsity levels. This is not the case for the other algorithms. If some of them (IHT and SP in Fig.1(b), BMP in Fig.1(c)) present better performance for small values of K, the gains are very slight in comparison to the large deficits observed for greater values. Note finally that Fig.1(b) and Fig.1(c) explain to some extent the singular behavior of VBSR1 observed in Fig.1(a). Below K = 50, each atom selected by VBSR1 is a “good” one, i.e., has been used to generate the data, but this is performed at the expense of the missed detection rate, which remains quite high for small numbers of nonzero coefficients. This “thrifty” strategy is also chosen by BP to a large extent. 5. CONCLUSION In this paper, we consider the SR problem within a BG framework. We propose a tractable solution by resorting to a MF approximation and the VB-EM algorithm. The resulting algorithm is shown to have very good performance over a wide range of sparsity levels, in comparison to other state-of-theart algorithms. This comes with a low complexity per update step, similar to MP. Dealing with soft decisions seems to be a promising way to solve SR problems and is de facto more and more considered in literature (e.g., [14]). 6. REFERENCES [1] C. Herzet and A. Dremeau, “Bayesian pursuit algorithms,” in Proc. EUSIPCO, 2010. [2] T. Blumensath and M. E. Davies, “Iterative thresholding for sparse approximations,” Journal of Fourier Analysis and Applications, vol. 14, no. 5-6, pp. 629–654, December 2008. [3] S. Mallat and Z. Zhang, “Matching pursuits with timefrequency dictionaries,” IEEE Trans. On Signal Processing, vol. 41, no. 12, pp. 3397–3415, December 1993.
[4] W. Dai and O. Milenkovic, “Subspace pursuit for compressive sensing signal reconstruction,” IEEE Trans. On Information Theory, vol. 55, no. 5, pp. 2230–2249, May 2009. [5] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Journal on Scientific Computing, vol. 20, pp. 33–61, 1998. [6] C. Soussen, J. Idier, D. Brie, and J. Duan, “From bernoulligaussian deconvolution to sparse signal restoration,” Tech. Rep., CRAN/IRCCyN, 2010. [7] M. E. Tipping, “Sparse bayesian learning and the relevance vector machine,” Journal of Machine Learning Research, vol. 1, pp. 211–244, 2001. [8] L. He, H. Chen, and L. Carin, “Tree-structured compressive sensing with variational bayesian analysis,” IEEE Signal Processing Letters, vol. 17, pp. 233–236, 2010. [9] H. Zayyani, M. Babaie-Zadeh, and C. Jutten, “Sparse component analysis in presence of noise using em-map,” in Proc. ICA, 2007. [10] H. Zayyani, M. Babaie-Zadeh, and C. Jutten, “An iterative bayesian algorithm for sparse component analysis in presence of noise,” IEEE Trans. On Signal Processing, vol. 57, pp. 4378–4390, 2009. [11] C. Herzet and A. Dremeau, “Sparse representation algorithms based on mean-field approximations,” in Proc. ICASSP, 2010, pp. 2034–2037. [12] M. J. Beal and Z. Ghahramani, “The variational bayesian em algorithm for incomplete data: with application to scoring graphical model structures,” Bayesian Statistics, vol. 7, pp. 453–463, 2003. [13] T. Heskes, O. Zoeter, and W. Wiegerinck, Approximate Expectation Maximization, MIT Press. Advances in Neural Information Processing Systems 16, Cambridge, MA, 2004. [14] A. Divekar and O. Ersoy, “Probabilistic matching pursuit for compressive sensing,” Tech. Rep., School of Electrical and Computer Engineering, 2010.
80