Generalized Approximate Message Passing for Estimation with ... - Inria

37 downloads 0 Views 605KB Size Report
conditions on the fixed points of these equations, one can show that approximate ..... heuristic – we don't claim any formal properties of the approximation. .... some abuse of notation) lim n→∞ ...... statements (a) to (e) of the lemma. We omit the ...
GENERALIZED APPROXIMATE MESSAGE PASSING

1

Generalized Approximate Message Passing for Estimation with Random Linear Mixing

arXiv:1010.5141v1 [cs.IT] 25 Oct 2010

Sundeep Rangan, Member, IEEE

Abstract—We consider the estimation of a random vector observed through a linear transform followed by a componentwise probabilistic measurement channel. Although such linear mixing estimation problems are generally highly non-convex, Gaussian approximations of belief propagation (BP) have proven to be computationally attractive and highly effective in a range of applications. Recently, Bayati and Montanari have provided a rigorous and extremely general analysis of a large class of approximate message passing (AMP) algorithms that includes many Gaussian approximate BP methods. This paper extends their analysis to a larger class of algorithms to include what we call generalized AMP (G-AMP). G-AMP incorporates general (possibly non-AWGN) measurement channels. Similar to the AWGN output channel case, we show that the asymptotic behavior of the G-AMP algorithm under large i.i.d. Gaussian transform matrices is described by a simple set of state evolution (SE) equations. The general SE equations recover and extend several earlier results, including SE equations for approximate BP on general output channels by Guo and Wang. Index Terms—Optimization, random matrices, estimation, belief propagation.

I. I NTRODUCTION Consider the general estimation problem shown in Fig. 1. An input vector q ∈ Qn has components qj ∈ Q for some set Q and generates an unknown random vector x ∈ Rn through a componentwise input channel described by a conditional distribution pX|Q (xj |qj ). The vector x is then passed through a linear transform z = Ax, (1) where A ∈ Rm×n is a known transform matrix. Finally, each component of zi of z randomly generates an output component yi of a vector y ∈ Y m through a second scalar conditional distribution pY |Z (yi |zi ), where Y is some output set. The problem is to estimate the transform input x and output z from the system input vector q, system output y and linear transform A. When m = n and the transform matrix is the identity matrix (A = I), z = x and the estimation problem decouples into m = n scalar estimation problems, each defined by the Markov chain: qj ∼ pQ (qj )

pX|Q (xj |qj )

−→

xj = zj

pY |Z (yj |zj )

−→

yj .

Since all the quantities in this Markov chain are scalars, one could numerically compute the exact posterior distribution on any component xj (= zj ) from qj and yj relatively easily. S. Rangan (email: [email protected]) is with Polytechnic Institute of New York University, Brooklyn, NY.

p X |Q ( x j | q j ) q Q

n

x

n

z  m

pY | Z ( yi | zi ) y Y m

Fig. 1. General estimation problem with linear mixing. The problem is to estimate the transform input and output vectors x and z from the system input and output vectors q and y, channel transition probability functions pX|Q (·|·), pY |Z (·|·) and transform matrix A.

The challenge in the general estimation problem is that a non-identity matrix A couples, or “mixes,” the input components xj with the output components zi . In this case, the posterior distribution of any particular component xj or zi would involve a high-dimensional integral that is, in general, difficult to evaluate. One common and highly effective approach for such linear mixing estimation problems is belief propagation (BP). BP is a general estimation method where the statistical relationships between variables are represented via a graphical model. The BP algorithm then iteratively updates estimates of the posterior distribution of the variables via a message passing procedure along the graph edges [1], [2]. For the linear mixing estimation problem, the graph is precisely the incidence matrix of A and the BP messages are passed between nodes corresponding to the input variables xj and outputs variables zi . In communications and signal processing, BP is best known for its connections to iterative decoding in turbo and LDPC codes [3]–[5]. However, while turbo and LDPC codes typically involve computations over finite fields, BP has also been successfully applied in a number of problems with linear realvalued mixing, including CDMA multiuser detection [6]–[8], lattice codes [9] and compressed sensing [10]–[12]. Although exact implementation of BP for dense graphs is generally computationally difficult, in the linear mixing estimation problem, BP often admits very good Gaussian approximations for large, dense matrices A [6], [12]–[16]. Gaussian approximations have also been used in the analysis of LDPC codes in [17], [18] and their design [19]. In the case of linear mixing estimation, Gaussian approximations typically enable BP to be implemented in O(mn) operations per iteration for an arbitrary (even dense) m × n matrix A. Moreover, for certain large i.i.d. random matrices A,

2

the asymptotic behavior of such approximate BP methods can often be described exactly by simple scalar state evolution (SE) equations that are analogous to the density evolution equations for BP decoding of LDPC codes [20]. Also, under certain conditions on the fixed points of these equations, one can show that approximate BP can be asymptotically optimal amongst all estimators. In this way, Gaussian approximations of BP are computationally tractable and admit precise performance characterizations with testable conditions for optimality. Bayati and Montanari [16] have recently provided a highly general and rigorous analysis of a large class of approximate message passing (AMP) algorithms that includes Gaussian approximated BP. Their analysis provides a general SE description of the behavior of AMP for large Gaussian i.i.d. matrices A that recovers the SE equations for Gaussian approximated BP as special cases. Prior to their work, analyses of approximate BP algorithms were based on either heuristic arguments such as [12], [13], [21], [22] or certain large sparse limit assumptions for the matrix A [6], [14], [15], [23], [24]. The sparse limit model in particular assumed that the matrix A was sufficiently sparse so that the graph was locally tree-like and the messages were independent – a common requirement in the analysis of message passing algorithms. However, in many applications, the matrix A is dense and the sparse limit model is only an approximation. Using a new conditioning argument used by Erwin Bolthaussen [25], Bayati and Montanari in [16] provided a completely rigorous analysis of AMP algorithms that applies to dense matrices A without any sparseness assumptions. In addition, their analysis could incorporate arbitrary mismatches between the true distributions on the variables and postulated distributions used by the estimator. Such mismatches are important to study since they often occur due to either computational limitations or lack of knowledge at the estimator. Their work also provided a rigorous justification of many earlier analyses of non-Gaussian estimation problems based on the replica method from statistical physics such as [15], [26]–[28]. While the replica method is heuristic, Bayati and Montanari’s analysis is fully rigorous and also provides an algorithm to achieve the performance predicted by the replica method. The replica method, in contrast, only describes the theoretical performance of certain estimators with no algorithm to actually implement those estimators efficiently. The main contribution of this paper is to extend Bayati and Montanari’s analysis further to an even larger class of algorithms, which we call generalized AMP (G-AMP). The AMP algorithm analyzed by Bayati and Montanari assumes a linear operation at the output nodes, which corresponds to an additive white Gaussian noise (AWGN) output channel for the mapping between z and y in Fig. 1. The work here, in contrast, will consider arbitrary, possibly nonlinear, operations at the output, and outputs y that are generated by essentially arbitrary conditional distributions pY |Z (·) from the transform output z. Such non-AWGN output channels arise in a range of applications. Some examples are given in [15] and repeated here in Section II. Similar to the analysis of AMP and related algorithms, our results show that the asymptotic behavior of G-AMP

GENERALIZED APPROXIMATE MESSAGE PASSING

algorithms under i.i.d. Gaussian matrices is described by a simple set of state evolution (SE) equations. However, unlike the SE equations for AMP which involve a single scalar parameter, the G-AMP algorithm requires the evolution of two scalar parameters. The second parameter is required to track a certain correlation between the error vector and estimate. For the AWGN channels, this parameter disappears and our results recover earlier SE equations. The behavior of BP on general channels was analyzed earlier by Guo and Wang in [24] who also provided SE equations to describe the asymptotic behavior of the BP iterations. Some algorithmic aspects were also discussed in [15]. Both works, however, assumed a sparse random matrix and were restricted to the case where estimator uses the optimal operations corresponding to the correct distributions pX|Q and pY |Z . Since the results in this work use Bayati and Montanari’s method, they apply, in a completely rigorous manner to dense matrices, and can also characterize the behavior of arbitrary mismatches between the postulated and true distributions. In the case of the “matched” distributions, we recover the SE equations of Guo and Wang. Interestingly, we will see that extending Bayati and Montanari’s analysis in [16] to general output channels requires only a minor modification of their analysis. Their work already developed an analysis of a very general class of recursions, of which the AMP algorithm is a special case. With a small extension to vector-valued inputs, we will see that their analysis immediately provides an analysis of AMP algorithms for arbitrary output channels. II. E XAMPLES AND A PPLICATIONS The linear mixing model is extremely general and can be applied in a range of circumstances. We review some simple examples for both the output and input channels from [15]. A. Output Channel Examples a) AWGN output channel: For an additive white Gaussian noise (AWGN) output channel, the output vector y can be written as y = z + w = Ax + w, (2) where w is a zero mean, Gaussian i.i.d. random vector independent of x. The corresponding channel transition probability distribution is given by   (y − z)2 1 √ exp − , (3) pY |Z (y|z) = 2µw 2πµw where µw > 0 is the variance of the components of w. b) Non-Gaussian noise models: Since the output channel can incorporate an arbitrary separable distribution, the linear mixing model can also include the model (2) with nonGaussian noise vectors w, provided the components of w are i.i.d. One interesting application for a non-Gaussian noise model is to study the bounded noise that arises in quantization as discussed in [15].

RANGAN

c) Logistic channels: A quite different channel is based on a logistic output. In this model, each output yi is 0 or 1, where the probability that yi = 1 is given by some sigmoidal function such as 1 , (4) pY |Z (yi = 1|zi ) = 1 + a exp(−zi ) for some constant a > 0. Thus, larger values of zi result in a higher probability that yi = 1. This logistic model can be used in classification problems as follows [29]: Suppose one is given m samples, with each sample being labeled as belonging to one of two classes. Let yi = 0 or 1 denote the class of sample i. Also, suppose that the ith row of the transform matrix A contains a vector of n data values associated with the ith sample. Then, using a logistic channel model such as (4), the problem of estimating the vector x from the labels y and data A is equivalent to finding a linear dependence on the data that classifies the samples between the two classes. This problem is often referred to as logistic regression and the resulting vector x is called the regression vector. By adjusting the prior on the components of x, one can then impose constraints on the components of x including, for example, sparsity constraints.

3

III. G ENERALIZED A PPROXIMATE M ESSAGE PASSING We now describe the generalized AMP (G-AMP) algorithm. The algorithm is identical to a general recursion analyzed by Bayati and Montanari in [16]. However, our analysis will be somewhat different to incorporate the effect of general output channels. The key idea of the G-AMP algorithm is to iteratively decompose the vector valued estimation problem into a sequence of scalar operations at the input and output nodes. The scalar operations are defined by two functions: gout (·) and gin (·). We will see in Section IV that with appropriate choices of the functions, the G-AMP algorithm can be interpreted as an approximation of belief propagation. Given the functions gout (·) and gin (·), in each iteration t = 0, 1, 2, . . ., the G-AMP algorithm produces estimates x bj (t) and zbi (t) of the transform input and output variables xj and zi , respectively. The steps of the algorithm are as follows: 1) Initialization: Set t = 0 and let x bj (t) and µxj (t) be any initial sequences. 2) Output linear step: For each i, compute: µpi (t)

=

a2ij µxj (t)

(5a)

aij x bj (t) − µpi (t)b si (t−1),

(5b)

aij x bj (t)

(5c)

j

B. Input Channel Examples The distributions pX|Q (xj |qj ) models the prior on xj , with the variable qj being a parameter of that prior that varies over the components, but is known to the estimator. When all the components of xj are identically distributed, we can ignore qj and use a constant distribution pX (xj ). d) Sparse priors and compressed sensing: One class of non-Gaussian priors that have attracted considerable recent attention is sparse distributions. A vector x is sparse if a large fraction of its components are zero or close to zero. Sparsity can be modeled probabilistically with a variety of heavy-tailed distributions pX (xj ) including Gaussian mixture models, generalized Gaussians and Bernoulli distributions with a high probability of the component being zero. The estimation of sparse vectors with random linear measurements is the basic subject of compressed sensing [30]–[32] and fits naturally into the linear mixing framework. e) Discrete distributions: The linear mixing model can also incorporate discrete distributions on the components of x. Discrete distribution arise often in communications problems where discrete messages are modulated onto the components of x. The linear mixing with the transform matrix A comes into play in CDMA spread spectrum systems and lattice codes mentioned above. f) Power variations and dynamic range: For multiuser detection problems in communications, the parameter qj can be used to model a priori power variations amongst the users that are known to the receiver. For example, suppose qj > √ 0 almost surely and xj = qj uj , with uj being identically distributed across all components j and independent of qj . Then, E(x2j |qj ) = qj E(u2j ) and qj thus represents a power scaling. This model has been used in a variety of analyses of CDMA multiuser detection [14], [27], [33] and random access channels [34].

X

pbi (t)

=

X j

zbi (t)

=

X j

where initially, we take sb(−1) = 0. 3) Output nonlinear step: For each i, sbi (t)

=

µsi (t)

=

gout (t, pbi (t), yi , µpi (t)) ∂ − gout (t, pbi (t), yi , µpi (t)). ∂ pb

(6a) (6b)

4) Input linear step: For each j, #−1

" µrj (t)

=

X

a2ij µsi (t)

(7a)

i

rbj (t)

= x bj (t) + µrj (t)

X

aij sbi (t).

(7b)

i

5) Input nonlinear step: For each j, = gin (t, rbj (t), qj , µrj (t)) (8a) ∂ µxj (t+1) = µrj (t) gin (t, rbj (t), qj , µrj (t)). (8b) ∂b r x bj (t+1)

Then increment t = t + 1 and return to step 2 until a sufficient number of iterations have been performed. The above algorithm permits functions gin (t, rb, q, µr ) and gout (t, pb, y, µp ) that can change with the iteration number t. When the functions do not change with the iteration, we will simply drop the parameter and write gin (b r, q, µr ) and p gout (b p, y, µ ).

4

GENERALIZED APPROXIMATE MESSAGE PASSING

IV. G-AMP FOR F IRST-O RDER A PPROXIMATIONS OF BP Although our analysis of the G-AMP algorithm applies to essentially arbitrary functions gin (·) and gout (·), we will be most interested in the case when the functions are selected to approximate belief propagation (BP). The exact form of the functions depends on whether we use BP to approximate maximum a posteriori (MAP) estimation or minimum mean squared error (MMSE) estimation. A heuristic “derivation” of the functions for each of these options is sketched in Appendices B and C and summarized here. We emphasize that the approximations are entirely heuristic – we don’t claim any formal properties of the approximation. However, the analysis of the G-AMP algorithm with these or other functions will be fully rigorous. A. Approximations of BP-based MAP Estimation To describe the MAP estimator, observe that for the linear mixing estimation problem in Section I, the posterior distribution of the vector x given the system input q and output y is given by the conditional distribution 1 (9) px|q,y (x|q, y) := exp (F (x, Ax, q, y)) , Z where n m X X F (x, z, q, y) := fin (xj , qj ) + fout (zi , yi ), (10) j=1

i=1

and fout (z, y)

:=

log pY |Z (y|z)

(11a)

fin (x, q)

:=

log pX|Q (x|q).

(11b)

The constant Z in (9) is a normalization constant. Given this distribution, the maximum a posteriori (MAP) estimator is the maxima of (9) which is given by the optimization bmap := arg max F (x, z, q, y), b x z = Ab x.

(12)

x∈Rn

This initial conditions corresponds to the output of (8) with t = 0, the functions in (13) and (15) and µr (−1) → ∞. Observe that when fin is given by (11b), gin (b r, q, µr ) is precisely the scalar MAP estimate of a random variable X b = rb for the random variables given Q = q and R b = X + V, R

V ∼ N (0, µr ),

(17)

where X ∼ pX|Q (x|q), Q ∼ pQ (q) and with V independent b can be interpreted as a of X and Q. With this definition, R Gaussian noise-corrupted version of X with noise level µr . Appendix B shows that the output function for the approximation of BP-based MAP estimation is given by 1 0 z − pb), (18) gout (b p, y, µp ) := p (b µ where zb0 := arg max Fout (z, pb, y, µp ), (19) z

and 1 (z − pb)2 . 2µp The negative derivative of this function is given by Fout (z, pb, y, µp ) := fout (z, y) −



00 (b z 0 , y) ∂ −fout gout (b p, y, µp ) = , 00 (b ∂ pb 1 − µp fout z 0 , y)

(20)

(21)

00 (z, y) is with respect to z. where the second derivative fout Now, when fout (z, y) is given by (11a), Fout (z, pb, y, µp ) in (20) can be interpreted as the log posterior of a random variable Z given Y = y and

Z =∼ N (b p, µp ), Y ∼ pY |Z (y|z).

(22)

In particular, zb0 in (19) is the MAP estimate of Z. In this way, the G-AMP algorithm reduces the vector MAP estimation problem to the evaluation of two sets of scalar MAP estimations problems, one set at the inputs and the other set at the outputs. The scalar parameters µr (t) and µp (t) represent effective Gaussian noise levels in these problems.

The implementation of approximate BP for the MAP estimation problem (12) is discussed in Appendix B. It is suggested there that a possible input function to approximately implement BP-based MAP estimation is given by

B. Approximations of BP-based MMSE Estimation The minimum mean squared error (MMSE) estimate is the conditional expectation

gin (b r, q, µr ) := arg max Fin (x, rb, q, µr )

(13)

bmmse := E [x | y, q] , x

(14)

relative to the conditional distribution (9). The selection of the functions gin (·) and gout (·) to implement approximate BPbased MMSE estimation is considered in Appendix C. Heuristic arguments in that section, show that the input function to implement BP-based MMSE estimation is given by

x

where Fin (x, rb, q, µr ) := fin (x, q) −

1 (b r − x)2 . 2µr

The appendix also shows that the function (13) has a derivative satisfying ∂ µr µr gin (b r, q, µr ) = , 00 (b ∂b r 1 − µr fin x, q)

(15)

00 where the second derivative fin (x, q) is with respect to x and r x b = gin (r, q, µ ). The initial condition for the G-AMP algorithm should be taken as 1 x bj (0) = arg max fin (xj , qj ), µxj (0) = 00 . (16) fin (b xj (0), qj ) xj

b = rb, Q = q], gin (b r, q, µr ) := E[X | R

(23)

(24)

where the expectation is over the variables in (17). Also, the derivative is given by the variance, ∂ b = rb, Q = q]. gin (b r, q, µr ) := var[X | R (25) ∂b r The initial condition for the G-AMP algorithm for MMSE estimation should be taken as µr

x bj (0) = E(X | Q = qj ) µxj (0) = var(X | Q = qj ), (26)

RANGAN

5

where the expectations are over the distribution pX|Q (xj |qj ). Thus, the algorithm is initialized to the the prior mean and variance on xj based on the parameter qj but no observations. Equivalently, the initial conditions (26) corresponds to the output of (8) with t = 0, the functions in (24) and (25) and µr (−1) → ∞. To describe the output function gout (b p, y, µp ) for the MMSE problem, consider a random variable z with conditional probability distribution p(z|b p, y, µp ) ∝ exp Fout (z, pb, y, µp ),

(27)

where Fout (z, pb, y, µp ) is given in (20). The distribution (27) is the posterior distribution of the random variable Z with observation Y in (22). Appendix C shows that the output function gout (b p, y, µp ) to implement approximate BP for the MMSE problem is given by 1 0 z − pb), zb0 := E(z|b p, y, µp ), (28) gout (b p, y, µp ) := p (b µ where the expectation is over the distribution (27). Also, the negative derivative of gout (·) is given by   ∂ 1 var (z|b p, y, µp ) − gout (b p, y, µp ) = p 1 − . (29) ∂ pb µ µp The appendix also provides an alternative definition for gout (·): The function gout (·) in (28) is given by   ∂ gout (b p, y, µp ) := log E pY |Z (y|Z) , (30) ∂ pb where the expectation is over Z ∼ N (b p, µp ). As a result, its negative derivative is   ∂ ∂2 − gout (b p, y, µp ) = − 2 log E pY |Z (y|Z) . (31) ∂ pb ∂ pb Hence gout (b p, y, µp ) has the interpretation of a score function of the parameter pb in the distribution of the random variable Y in (22). Thus, similar to the MAP estimation problem, the G-AMP algorithm reduces the vector MMSE estimation problem to the evaluation of two scalar estimation problems from Gaussian noise. In the MMSE case, a scalar MMSE estimation is performed at the input nodes, and a score function of an ML estimation problem is performed at the output nodes.

It can be verified that the conditional mean zb0 in (33a) agrees with both zb0 in (19) for the MAP estimator and zb0 in (28) for the MMSE estimator. Therefore, the output function gout for both the MAP estimate in (18) and MMSE estimate in (28) is given by y − pb . (34) gout (b p, y, µp ) := w µ + µp The negative derivative of the function is given by −

∂ 1 . gout (b p, y, µp ) = p ∂ pb µ + µw

(35)

Therefore, for both the MAP and MMSE estimation problems, gout (b p, y, µp ) and its derivative are given by (34) and (35). When, we apply these equations into the input updates (8), we precisely obtain the original AMP algorithm (with some scalings) given in [16]. V. A SYMPTOTIC A NALYSIS A. Empirical Convergence of Random Variables Bayati and Montanari’s analysis in [16] employs certain deterministic models on the vectors and then proves convergence properties of related empirical distributions. To apply the same analysis here, we need to review some of their definitions. We say a function φ : Rr → Rs is pseudo-Lipschitz of order k > 1, if there exists an L > 0 such for any x, y ∈ Rr , kφ(x) − φ(y)k ≤ L(1 + kxkk−1 + kykk−1 )kx − yk. Now suppose that for each n = 1, 2, . . ., v(n) is a block vector with components vi (n) ∈ Rs , i = 1, 2, . . . , `(n) for some `(n). So, the total dimension of v(n) is s`(n). We say that the components of the vectors v(n) empirically converges with bounded moments of order k as n → ∞ to a random vector V on Rs if: For all pseudo-Lipschitz continuous functions, φ, of order k, `(n)

1 X lim φ(vi (n)) = E(φ(V)) < ∞. n→∞ `(n) i=1 When the nature of convergence is clear, we may write (with some abuse of notation) d

lim vi (n) = V.

n→∞

C. AWGN Output Channels In the special case of an AWGN output channel, we will see that that the output functions for approximate BP-based MMSE or MAP estimation are identical and reduce to the update in the AMP algorithm of Bayati and Montanari in [16]. Suppose that the output channel is described by the AWGN distribution (3) for some output variance µw > 0. Then, it can be checked that the distribution p(z|b p, y, µp ) in (27) is the Gaussian p(z|b p, y, µp ) ∼ N (b z 0 , µz ), (32) where 0

zb

µz

µp := pb + w (y − pb), µ + µp w p µ µ := w µ + µp

B. Assumptions With these definitions, we can now formally state the modeling assumptions, which follow along the lines of the asymptotic model considered by Bayati and Montanari in [16]. Specifically, we consider a sequence of random realizations of the estimation problem in Section I indexed by the input signal dimension n. For each n, we assume the output dimension m = m(n) is deterministic and scales linearly with the input dimension in that lim n/m(n) = β,

n→∞

(33a) (33b)

(36)

for some β > 0 called the measurement ratio. We also assume that the transform matrix A ∈ Rm×n has i.i.d. Gaussian components aij ∼ N (0, 1/m) and z = Ax.

6

GENERALIZED APPROXIMATE MESSAGE PASSING

We assume that for some order k > 2, the components b(0), µx (0) and input vectors x and q of initial condition x empirically converge with bounded moments of order 2k − 2 as d b0 , µx (0), X, Q), lim (b xj (0), µxj (0), xj , qj ) = (X

n→∞

(37)

b0 , X, Q) with distribution for some random variable triple (X x pXb0 ,X,Q (b x0 , x, q) and constant µ (0). To model the dependence of the system output vector y on the transform output z, we assume that, for each n, there is a deterministic vector w ∈ W m for some set W , which we can think of as a noise vector. Then, for every i = 1, . . . , m, we assume that yi = h(zi , wi ) (38) where h is some function h : R×W → Y and Y is the output set. Finally, we assume that the components of w empirically converge with bounded moments of order 2k − 2 to some random variable W with distribution pW (w). We also need certain continuity assumptions. Specifically, we assume that the partial derivatives of the functions gin (t, rb, q, µr ) and gout (t, pb, h(z, w), µp ) with respect to rb, pb and z exist almost everywhere and are pseudo-Lipschitz continuous of order k. This assumption implies that the functions gin (t, rb, q, µr ) and gout (t, pb, y, µp ) are Lipschitz continuous in rb and pb respectively.

Similar to [16], the key result here is that the behavior of the G-AMP algorithm is described by a set of state evolution (SE) equations. However, unlike the standard scalar SE equations for AMP, we will see that the G-AMP algorithm requires, in general, a two-dimensional update. To describe the equations, it is convenient to introduce two random vectors – one corresponding to the input channel, and the other corresponding to the output channel. At the input channel, given αr , τ r ∈ R and distributions pQ (q) and pX|Q (x|q), define the random variable triple (39)

b is given where Q ∼ pQ (q), X ∼ pX|Q (x|q) and R b = αr X + V, V ∼ N (0, τ r ), R

(41)

such that W and (Z, Pb) are independent with distributions (Z, Pb) ∼ N (0, Kp ), W ∼ pW (w).

b where the expectation is over the variables (X, X(0)) in the limit (37). Output node update: Compute µp (t)

= βµx (t), Kp (t) = βKx (t) (44a)   ∂ µr (t) = −E−1 gout (t, Pb, Y, µp (t)) (44b) ∂ pb h i 2 τ r (t) = (µr (t))2 E gout (t, Pb, Y, µp (t)) (44c)

where the expectations are over the random variable triples θp (Kp (t)) = (Z, Pb, W, Y ) in (41). Also, let αr (t) = µr (t)   ∂ p b . (44d) × E gout (t, P , h(z, W ), µ (t)) ∂z z=Z •

Input node update: Compute   ∂ x r r b µ (t+1) = µ (t)E gin (t, Q, R, µ (t)) (45a) ∂b r h  i X x b X X(t+1) , (45b) K (t+1) = E b X(t+1)

b b µr (t)). X(t+1) = gin (t, Q, R, Increment t = t + 1 and return to step 2. Since the computations in (44) and (45) are expectations over scalar random variables, they can be evaluated numerically given functions gin (·) and gout (·). D. Main Result Unfortunately, we cannot directly analyze the G-AMP algorithm as stated in Section III. Instead, we need to consider a slight variant of the algorithm with two small modifications: First, we replace the equations (5a) and (7a) in the G-AMP algorithm with n

(40)

with V independent of X and Q. At the output channel, given a covariance matrix Kp ∈ R2×2 , Kp > 0, and distribution pY |Z (y|z), define the four-dimensional random vector θp (Kp ) := (Z, Pb, W, Y ), Y = h(Z, W ),



Initialization: Set t = 0, let µx (0) be the initial value in (37) and set  h i X x b K (0) = E b X X(0) , (43) X(0)

where the expectation is over the random variable triple b in (39), and θr (τ r (t), αr (t)) = (X, Q, R)

C. State Evolution Equations

b θr (τ r , αr ) := (X, Q, R),



(42)

We will write pY |Z (y|z) for the conditional distribution of Y given Z in this model. With these definitions, we can now state the state evolution (SE) equations for the G-AMP algorithm:

1 X x µ (t) m j=1 j

µpi (t)

=

µp (t) :=

1 µrj (t)

=

1 1 X s := µ (t). r µ (t) m i=1 i

(46a)

m

(46b)

Second, in (6) and (8), we replace the parameters µpi (t) and µrj (t) with µp (t) and µr (t) from the SE equations. We also replace µrj (t) with µr (t) in (7b). This modified G-AMP algorithm is still entirely implementable, since the SE outputs µp (t) and µr (t) can be computed from the distributions of the random variables. Also, at least heuristically, the modifications should have a negligible effect on the behavior of the G-AMP algorithm for high dimensions under the assumptions of Section V-B: Replacing

RANGAN

7

v(t )  N (0, r (t ))

xj

p X |Q ( x j | q j )

 (t )

A. AWGN Output Channel First consider the case where the output function (38) is given by additive noise model

rˆj (t )

gin ()

yi = h(zi , wi ) = zi + wi .

xˆ j (t )

q j ~ pQ () Fig. 2. Scalar equivalent model. The joint distribution of any one component xj and its estimate x bj (t) from the tth iteration of the AMP algorithm is identical asymptotically to an equivalent estimator with Gaussian noise.

(5a) and (7a) with (46) can be heuristically justified since aij ∼ N (0, 1/m). Replacing µpi (t) and µrj (t) with µp (t) and µr (t) can be justified with part (a) of Theorem 1 below. Theorem 1: Consider the G-AMP algorithm in Section III under the assumptions in Section V-B and with the above modifications. Then, for any fixed iteration number t: (a) Almost surely, we have the limits r

r

lim µ (t) = µ (t),

n→∞

p

p

lim µ (t) = µ (t).

n→∞

(47)

b empirically (b) The components of the vectors x, q, b r and x converge with bounded moments of order k as d

lim (xj , qj , rbj (t)) = θr (τ r (t), αr (t)),

n→∞

(48)

b is the random variwhere θr (τ r (t), αr (t)) = (X, Q, R) able triple in (39) and τ r (t) is from the SE equations. b , w and y empirically (c) The components of the vectors z, p converge with bounded moments of order k as d

lim (zi , pbi (t), wi , yi ) = θp (Kp (t)),

n→∞

Assume the components wi of the output noise vector empirically converge to a random variable W with zero mean and variance µw > 0. The output noise W need not be Gaussian. But, suppose the G-AMP algorithm uses an output function gout (·) in (34) corresponding to an AWGN output for some w postulated output variance µw post that may differ from µ . This is the scenario analyzed in Bayati and Montanari’s paper [16]. Substituting (35) with the postulated noise variance µw post into (44b) we get p w x µr (t) = µw post + µ (t) = µpost + βµ (t).

The theorem is proven in Appendix A where it is shown to be a special case of a more general result on certain vectorvalued AMP-like recursions. A useful interpretation of Theorem 1 is that it provides a scalar equivalent model for the behavior of the G-AMP estimates. The equivalent scalar model is illustrated in Fig. 2. To understand this diagram, observe that part (b) of Theorem 1 shows that the joint empirical distribution of the components b in (xj , qj , rbj (t)) converges to θr (τ r (t), αr (t)) = (X, Q, R) (39). This distribution is identical to rbj (t) being a scaled and noise-corrupted version of xj . Then, the estimate x bj (t) = gin (t, rbj (t), qj , µrj (t)) is a nonlinear scalar function of rbj (t) and qj . A similar interpretation can be drawn at the output nodes. VI. S PECIAL C ASES We now show that several previous results can be recovered from the general SE equations in Section V-C as special cases.

(51)

Also, using (34) and (50), we see that ∂ 1 ∂ z + w − pb gout (b p, h(z, y), µp (t)) = = r . p w ∂z ∂z µpost + µ (t) µ (t) Therefore, (44d) implies that αr (t) = 1. Now define   1 τ x (t) := [1 − 1] Kx (t) −1 τ p (t)

:= βτ x (t) = [1 − 1] Kp (t)

(52a) 

1 −1

 . (52b)

With these definitions, note that if (Z, Pb) ∼ N (0, Kp (t)), then τ p (t) = E(Z − Pb)2 . (53) Also, with Kx (t) defined in (45b), 2 b τ x (t+1) = E(X − X(t)) h i2 b Q, µr (t)) . = E X − gin (t, R,

(49)

where θp (Kp (t)) = (Z, Pb, W, Y ) is the random vector in (41) and Kp (t) is given by the SE equations.

(50)

(54)

Therefore, (a)

τ r (t+1) = (µr (t+1))2

E(Y − Pb)2 + µp (t+1))2

(µw post

(b)

=

E(Y − Pb)2

(c)

=

E(W + Z − Pb)2 = µw + E(Z − P )2

(d)

µw + τp (t+1) h i2 b Q, µr (t)) , µw + βE X − gin (t, R,

=

(e)

=

(55)

where (a) follows from substituting (34) into (44c); (b) follows from (51); (c) follows from the output channel model assumption (50); (d) follows from (53) and (e) follows from (52) and (54). The expectation in (55) is over θr (τ r (t), αr (t)) = b in (39). Since αr (t) = 1, the expectation is (X, Q, R) b where X ∼ pX|Q (x|q), over random variables (X, Q, R) Q ∼ pQ (q) and b = X + V, R

V ∼ N (0, τ r (t)),

(56)

and V is independent of X and Q. With this expectation, (55) is precisely the SE equation in [16] for the AMP algorithm with a general input function gin (·).

8

GENERALIZED APPROXIMATE MESSAGE PASSING

B. AWGN Output Channels with MMSE Estimation The SE equation (55) applies to a general input function gin (·). Now suppose that the estimator uses an input function gin (·) based on the MMSE estimator (24). To account for possible mismatches with the estimator, suppose that the G-AMP algorithm is run with some postulated distribution ppost X|Q (·) that may differ from the true distribution pX|Q (·). Let x bpost r, q, µr ) be the corresponding scalar MMSE estimator mmse (b   r b x bpost (b r , q, µ ) := E X| R = r b , Q = q , mmse where the expectation is over the random variable b = X + V, R

V ∼ N (0, µr ),

(57)

where X given Q follows the postulated distribution is independent of X and Q. Let ppost X|Q (x|q) and V post Emmse (b r, q, µr ) denote the corresponding postulated variance. Then, (24) and (25) can be re-written gin (b r, q, µr ) = x bpost r, q, µr ), mmse (b

(58a)

with other works including [14] and [23]. In this regard, the analysis here can be seen as an extension of that work to handle dense matrices and the case of possibly mismatched distributions. C. AWGN Output Channels with MAP Estimation Now suppose that the estimator uses an input function gin (·) based on the MAP estimator (13) again with a postulated distribution ppost X|Q (·) that may differ from the true distribution pX|Q (·). Let x bpost r, q, µr ) be the corresponding scalar MAP map (b estimator   1 2 post r r − x) , (61) x bmap (b r, q, µ ) := arg max fin (x, q) − r (b 2µ x∈R where fin (x, q) = log ppost X|Q (x|q). With this definition, x bpost map is the scalar MAP estimate of the b = rb and Q = q random variable X from the observations R in the model (57). Also, following (15) define

and

∂ post gin (b r, q, µr ) = Emmse (b r, q, µr ). µ ∂b r With this notation, r

post Emap (b r, q, µr ) :=

(58b) where x b=

x bpost map .

1−

Then, (13) and (15) can be re-written as

gin (b r, q, µr ) = x bpost r, q, µr ), map (b

(a)

x µr (t+1) = µw post + βµ (t+1)   post (b) = µw r, q, µr (t)) post + βE Emmse (b

µr

, 00 (b fin x, q)µr

(59)

where (a) follows from (51) and (b) follows from substituting (58b) into (45a). Also, substituting (58a) into (55), we obtain h i2 r b τ r (t+1) = µw + βE X − x bpost ( R, Q, µ (t)) . (60) mmse The fixed point of the updates (59) and (60) are precisely the equations given by Guo and Verd´u in their replica analysis of MMSE estimation with AWGN outputs and non-Gaussian priors [27]. Their work uses the replica method from statistical bpost be the exact MMSE physics to argue the following: Let x estimate of the vector x based on the postulated distributions w ppost X|Q and postulated noise variance µpost . By “exact,” we mean the actual MMSE estimate for that postulated prior – not the G-AMP or any other approximation. Then, in the limit of large i.i.d. random matrices A, it is claimed that the joint distribution of (xj , qj ) for some component j and the corresponding exact MMSE estimate x bpost , follows the same j scalar model as Fig. 2 for the G-AMP algorithm. Moreover, the effective noise variance of the exact MMSE estimator is a fixed point of the updates (59) and (60). This result of Guo and Ver´u generalized several earlier results including analyses of linear estimators in [35] and [33] based on random matrix theory and BPSK estimation in [26] based on the replica method. As discussed in [16], the SE analysis of the AMP algorithm provides some rigorous basis for the replica analysis, along with an algorithm to achieve the performance. Also, when the postulated distribution matches the true distribution, the general equations (59) and (60) reduce to the SE equations for Gaussian approximated BP on large sparse matrices derived originally by Boutros and Caire [6] along

(62a)

and

∂ post gin (b r, q, µr ) = Emap (b r, q, µr ). (62b) ∂b r Following similar computations to the previous subsection, we obtain the SE equations h i post b r µr (t+1) = µw + βE E ( R, Q, µ (t)) (63a) post map h i2 r b τ r (t+1) = µw + βE X − x bpost . map (R, Q, µ (t)) (63b) µr

Similar to the MMSE case, the fixed point of these equations precisely agree with the equations for replica analysis of MAP estimation with AWGN output noise given in [36] and related works in [28]. Thus, again, the G-AMP framework provides some rigorous justification of the replica results along with an algorithm to achieve the predictions by the replica method. D. General Output Channels with Matched Distributions Finally, let us return to the case of general (possibly nonAWGN) output channels and consider the G-AMP algorithm with the MMSE estimations function in Section IV-B. In this case, we will assume that the postulated distributions for pX|Q and pY |Z match the true distributions. Guo and Wang in [24] derived SE equations for standard BP in this case for large sparse random matrices. The work [15] derived identical equations for a relaxed version of BP. We will see that the same SE equations will follow as a special case of the more general SE equations above. To prove this, we will show by induction that, for all t, Kx (t) has the form   µx0 µx0 − µx (t) x , (64) K (t) = µx0 − µx (t) µx0 − µx (t)

RANGAN

9

where µx0 is the variance of X. For t = 0, recall that for bj (0) = 0 (the mean of MMSE estimation, µx (0) = µx0 and x X) for all j. Therefore, (43) shows that (64) holds for t = 0. Now suppose that (64) holds for some t. Then, (44a) shows that Kp (t) has the form   µz0 µz0 − µp (t) p , (65) K (t) = µz0 − µp (t) µz0 − µp (t)

work thus shows that the identical SE equations hold for dense matrices A. In addition, the results here extend the earlier SE equations by considering the cases where the distributions postulated by the estimator may not be identical to the true distribution.

where µz0 = βµx0 . Now, if (Z, Pb) ∼ N (0, Kp (t)) with Kp (t) given by (65), then the conditional distribution of Z given Pb is Z ∼ N (Pb, µp (t)). Therefore, gout (·) in (30) can be interpreted as

We have extended the analysis of Bayati and Montanari in [16] to a large class of nonlinear estimators for problems with linear mixing and arbitrary (possibly non-AWGN) measurements. The algorithms are based on what we call generalized approximate message passing (G-AMP) and included Gaussian approximations of belief propagation. Similar to the analysis of AMP algorithms, we have shown that in the limit of certain large Gaussian random mixing matrices, the G-AMP algorithms are precisely described by a simple set of state evolution (SE) equations. In the special case where the nonlinear operations are optimally selected to match the true input and output distributions, the SE equations for GAMP algorithm agree with earlier results due to Guo and Wang in [24]. The results here, however, are more general and can account for mismatches between the postulated and true distributions and different estimation objectives (including MAP estimation, not just MMSE). Also, the results apply to dense mixing matrices and do not require any sparsity assumptions. The G-AMP algorithm thus provides a computationally simple method that can apply to a large class of estimation problems with provable exact performance characterizations. As a result, we believe that the G-AMP algorithm can have wide ranging applicability. In [37], we have already explored applications to nonlinear wireless scheduling problems. In [15], we have suggested applications to logistic regression in pattern recognition and quantization. The main outstanding theoretical issue is to obtain a performance lower bound. An appealing feature of the SE analysis of sparse random matrices such as [6], [14], [15], [24] as well as well as the replica analysis in [26], [27], [36] is that they provide lower bounds on the performance of the optimal estimator. These lower bounds are also described by SE equations. Then, using a sandwiching argument – a technique used commonly in the study of LDPC codes [38] – shows that when fixed points of the SE equations are unique, BP is optimal. Finding a similar lower bound for the G-AMP algorithm is a possible avenue of future research.

gout (b p, y, µp ) :=

∂ log pY |Pb (y|Pb = pb), ∂ pb

(66)

where pY |Pb (·|·) is the likelihood of Y given Pb for the random variables θp (Kp (t)) in (41). Now let F (µp (t)) be the Fisher information  2 ∂ p b log pY |Pb (Y |P ) , (67) F (µ (t)) := E ∂ pb where the expectation is also over θp (Kp (t)) with Kp (t) given by (65). A standard property of the Fisher information is that  2  ∂ p b F (µ (t)) = E log pY |Pb (Y |P ) . (68) ∂ pb2 Substituting (67) and (68) into (44) we see that µr (t) = τ r (t) =

1 . F (µp (t))

(69)

We will show in Appendix D that   ∂ E gout (t, h(Z, W ), Pb, µp (t)) ∂ pb   ∂ = E gout (t, h(Z, W ), Pb, µp (t)) . ∂z

(70)

It then follows from (44d) that αr (t) = 1.

(71)

Now consider the input update (45). Since αr (t) = 1 and µr (t) = τ r (t), the expectation in (24) agrees with the expectations in (45). Substituting (25) into (45a), we obtain h i b , µx (t+1) = E var(X|Q, R) (72) where the expectation is over θr (τ r (t), αr (t)) in (39) with b b then α = 1. Also, if X(t+1) = E(X|Q, R) h i 2 b b = µx (t+1) E(X − X(t+1)) = E var(X|Q, R) h i b b E X(t+1)(X − X(t+1)) = 0. x0

2

These relations, along with the definition µ = E(X ), show that the covariance Kx (t+1) in (45b) is of the form (64). This completes the induction argument to show that (64) and the other equations above hold for all t. The updates (69) and (72) are precisely the SE equations given in [24] and [15] for BP MMSE estimation with matched distributions on large sparse random matrices. The current

C ONCLUSIONS

A PPENDIX A V ECTOR -VALUED AMP A. SE Equations for Vector-Valued AMP The analysis of the G-AMP requires a vector-valued version of the recursion analyzed by Bayati and Montanari. Fix dimensions nd and nb , and let Θu and Θv be any two sets. Let Gin (t, d, θu ) ∈ Rnb and Gout (t, b, θv ) ∈ Rnd be vector-valued functions over the arguments t = 0, 1, 2, . . ., b ∈ Rnb , d ∈ Rnd , θu ∈ Θu and θ ∈ Θv . Assume Gin (t, d, θu ) and Gout (t, b, θv ) are Lipschitz continuous in d

10

GENERALIZED APPROXIMATE MESSAGE PASSING

and b, respectively. Let A ∈ Rm×n and generate a sequence of vectors bi (t) and dj (t) by the iterations X bi (t) = aij uj (t) − λ(t)vi (t−1), (73a) j

dj (t)

=

X

aij vi (t) − ξ(t)uj (t)

(73b)

=

Gin (t, dj (t), θju ),

(74a)

=

Gout (t, bi (t), θiv ),

i

where uj (t+1) vi (t)

where, again, the expectations are over random variables θu and θv in the limits (76), and B(t) and D(t) are Gaussian vectors in (78) independent of θu and θv . Theorem 2: Consider the recursion in (73) and (74) with either the empirical update (75) or expected update (80). Then, under the above assumptions, for any fixed iteration number t, the variables in the recursions with either the empirical or expected update converge empirically as lim (dj (t), θju ) =

d

(D(t), θu )

(81a)

d

(B(t), θv ),

(81b)

n→∞

(74b)

lim (bi (t), θiv ) =

n→∞

and m

ξ(t)

=

1 X ∂ Gout (t, bi (t), θiv ) m i=1 ∂b

λ(t+1)

=

1 X ∂ Gin (t, dj (t), θjv ). m j=1 ∂d

(75a)

n

(75b)

Here, we interpret the derivatives as matrices ξ(t) ∈ Rnd ×nb and λ(t+1) ∈ Rnb ×nd . The recursion is initialized with t = 0, vi (t−1) = 0 and some values for uj (0). Now similar to Section V, consider a sequence of random realizations of the parameters indexed by the input dimension n. For each n, we assume that the output dimension m = m(n) is deterministic and scales linearly as in (36) for some β ≥ 0. Assume that the transform matrix A has i.i.d. Gaussian components aij ∼ N (0, 1/m). Also assume that the following components converge empirically with bounded moments of order 2k − 2 with the limits lim θiv

n→∞

lim ui (0)

n→∞

d

= θv ,

d

lim θju = θu ,

n→∞

d

= U0 ,

(76a) (76b)

for random variables θv , θu and U0 . We also assume U0 is zero mean. Under these assumptions, we will see that SE equations for the vector-valued recursion is given by: Kd (t)

where are θu and θv are the random variables in the limits (76), and B(t) and D(t) are Gaussians (78) independent of θu and θv . The scalar case when nb = nd = 1 is proven by Bayati and Montanari [16]. The modifications for the vector-valued case is straightforward but tedious. We provide a sketch of the arguments in Appendix E. B. Proof of Theorem 1 The theorem is a special case of the result, Theorem 2, above. Let nb = 2 and nd = 1 and define the variables     xj zi uj (t) = , bi (t) = (82a) x bj (t) pbi (t) vi (t) = sbi (t) (82b) 1 r (b rj (t) − α (t)xj ) (82c) dj (t) = µr (t) θju = (xj , qj ), θiv = wi (82d)   0 λ(t+1) = (82e) µp (t) 1 [αr (t) − 1] . (82f) ξ(t) = r µ (t) Also, for θu = (x, q), θv = w and b = (z pb)T , define the functions Gin (t, d, θu )   x := , gin (t, µr (t)d + αr (t)x, q, µr (t))

  := E Gout (t, B(t), θv )Gout (t, B(t), θv )T (77a) Kb (t+1)   := βE Gin (t, D(t), θu )Gin (t, D(t), θu )T (77b) where the expectations are over the random variables θu and θv in the limits (76), and B(t) and D(t) are Gaussian vectors B(t) ∼ N (0, Kb (t)), D(t) ∼ N (0, Kd (t))

(78)

independent of θu and θv . The SE equations (77) are initialized with   Kb (0) := βE U0 UT0 . (79) The matrices in (75) are derived empirically from the variables b(t) and d(t). We will also be interested in the case where the algorithm directly uses the expected values   ∂ ξ(t) = E Gout (t, B(t), θv ) (80a) ∂b   ∂ λ(t+1) = E Gin (t, D(t), θv ) , (80b) ∂d

(83a)

and Gout (t, b, θv ) := gout (t, pb, h(z, w), µp (t)).

(83b)

With these definitions, it is easily checked that the modified GAMP algorithm agrees with the recursion described by equations (73), and (74), with λ(t) being defined with the empirical update in (75) and ξ(t) being defined by the expected value in (80). For example,   (a) zi bi (t) = pbi (t)     X (b) xj 0 = aij − x bj (t) µp (t)b si (t−1) j X (c) = aij uj (t) − λ(t)vi (t−1), j

RANGAN

11

where (a) follows from the definition of bi (t) in (82a); (b) follows and (5) and the fact that z = Ax and (c) follows from the remaining definitions in (82). Therefore, variables in (82) satisfy (73a). Similarly,

θp (Kp (t)). Therefore, τ r (t)

h i 2 (t, Y, Pb, µp (t)) (µr (t))2 E gout   (µr (t))2 E G2out (t, B(t), θv )

(a)

=

(b)

dj (t)

(a)

=

(b)

=

(c)

=

=

1 (b rj (t) − αr (t)xj ) µr (t) X 1 aij sbi (t) + r (b xj (t) − αr (t)xj ) µ (t) i X aij vi (t) − ξ(t)uj (t),

(c)

i

where (a) follows from the definition of dj (t) in (82c); (b) follows from (7b) with the modification that µrj (t) is replaced with µr (t); and (c) follows from the other definitions in (82) Hence the variables in (82) satisfy (73b). The equations in (74) can also easily verified. We next consider λ(t) and ξ(t). For λ(t), first observe that (82c) and (8b) show that ∂ gin (t, µr (t)d + αr (t)xj , qj , µr (t))|d=dj (t) ∂d ∂ = µr (t) gin (t, rbj (t), qj , µr (t)) = µxj (t). ∂b r

where the limit is in the sense of empirical convergence of order k and D(t) ∼ N (0, Kd (t)) is independent of (X, Q). But, this limit is equivalent to d b lim (xj , qj , rbj (t)) = (X, Q, R),

Combining these relations with the definition of Gout (·) in (83b), shows that ξ(t) defined in (82f) satisfies (80). Therefore, we can apply Theorem 2 which shows that the limits (81) hold for the matrices Kb (t) and Kd (t) from the SE equations (77) with initial condition (79). Since the definitions in (82) set θju = (xj , qj ) and θiv = wi , the expectations in 77 are over the random variables θv ∼ W,

where (a) follows from (44c); (b) follows from the definition of b(t) and θv in (82) and Gout (·) in (83b); and (c) follows from (77a). Similarly, one can show that, if (85b) holds for some t, then (85a) holds for t + 1. With these equivalences, we can now prove the assertions in Theorem 1. To prove (48), first observe that the limit (81) along with (84) and the definitions in (82) show that   1 d r (b r (t) − α (t)x ), x , q = (D(t), X, Q), lim j j j j n→∞ µr (t)

n→∞

Therefore, with the definition of Gin (·) in (83a) and the modified version of µp (t) in (46a), we can see that λ(t) in (82e) satisfies (75b). Similarly, for ξ(t), observe that (44d) and (44b) show that   αr (t) ∂ = E gout (t, Pb, h(z, W ), µp (t)) ∂z µr (t) z=Z # " −1 ∂ = E gout (t, pb, h(Z, W ), µp (t)) . ∂ pb µr (t) p b=P

θu ∼ (X, Q),

b = αr (t)X + µr (t)D(t). R Since D(t) ∼ N (0, Kd (t)), µr (t)D(t) ∼ N (0, (µr (t))2 Kd (t)) = N (0, τ r (t)), b in where the last equality is due to (85b). Therefore, (X, Q, R) r r r (86) is identically distributed to θ (τ (t), α (t)) in (39). This proves (48), and part (b) of Theorem 1. Part (c) of Theorem 1 is proven similarly. To prove (47) in part (a), m

1 X s 1 (a) = lim µi (t) n→∞ µr (t) n→∞ m i=1 lim

(b)

=

(c)

τ r (t)

= Kb (t) = βKx (t) =

(µr (t))2 Kd (t).

(85a) (85b)

For example, one can show that (85a) holds for t = 0 by comparing the initial conditions (79) with (43) and using the definition of u(0) in (82a). Now, suppose that (85a) holds or some t. Then, (85a) and (84) show that (B(t), θv ) in the expectation (77b) is identically distributed to ((Z, Pb), W ) in

(86)

where

(84)

where (X, Q) and W are the limiting random variables in Section V-B. Now using the SE equations (43), (44) and (45), one can easily show by induction that Kp (t)

(µr (t))2 Kd (t),

=

=

(d)

=

m

1 X ∂ gout (t, pbi (t), yi , µp (t)) n→∞ m ∂ p b i=1   ∂ −E gout (t, Pb, Y, µp (t)) ∂ pb 1 , µr (t)

− lim

where (a) follows from (46); (b) follows from (6) and that fact that we are considering the modified algorithm where µpi (t) is replaced with µp (t); (c) follows the limit (49) and the assumption that the derivative of gout (t, pb, y, µp (t)) is of order k; and (d) follows from (44). Similarly, one can show lim µp (t) = µp (t).

n→∞

This proves (47) and completes the proof of the theorem.

12

GENERALIZED APPROXIMATE MESSAGE PASSING

i  j ( x j )

and j. The messages from the input nodes are given by

i  j ( x j )

∆i←j (t+1, xj ) = const X + fin (xj , qj ) + ∆`→j (t, xj )

x1

z1

x2

z2

(88)

`6=i

The BP iterations are initialized with t = 0 and ∆i→j (−1, xj ) = 0. The BP algorithm is terminated after some finite number of iterations. After the final tth iteration, the final estimate for xj can be taken as the maximum of X ∆j (t+1, xj ) = fin (xj , qj ) + ∆i→j (t, xj ). (89)

x3 zm xn

i

Fig. 3.

Factor or Tanner graph for the linear mixing estimation problem.

B. Quadratic Legendre Transforms A PPENDIX B G-AMP FOR BP-BASED MAP In this section, we show that with the functions gin and gout in (13) and (18), the G-AMP algorithm can be seen heuristically as a first-order approximation of BP for the MAP estimation problem. The derivations in this section are not rigorous, since we do not formally claim any properties of this approximation.

To approximate the BP algorithm, we need the following simple result. Given a function f : R → R, define the functions (Lf )(x, r, µ) (Γf )(r, µ)

(90a)

:=

(90b)

x

(Λf )(r, µ)

:=

max(Lf )(x, r, µ)

(90c)

(Λ(k) f )(r, µ)

:=

∂ (Λf )(r, µ), ∂rk

(90d)

A. Standard BP for MAP Estimation We first review how we would apply standard BP for the MAP estimation problem (12). For both the MAP and MMSE estimation problem, standard BP operates by associating with the transform matrix A a bipartite graph G = (V, E) called the factor or Tanner graph as illustrated in Fig. 3. The vertices V in this graph consists of n “input” or “variable” nodes associated with the variables xj , j = 1, . . . , n, and m “output” or “measurements” nodes associated with the transform outputs zi , i = 1, . . . , m. There is an edge (i, j) ∈ E between the input node xj and output node zi if and only if aij 6= 0. Now let ∆j (xj ) be the maximum of the MAP optimization problem (12) subject to the additional constraint that jth component of the vector x is equal to xj . For the MAP estimation problem, BP iteratively sends messages between the input nodes xj and output nodes zi representing estimates of the value function ∆j (xj ). The value message from the input node xj to output node zi in the tth iteration is denoted ∆i←j (t, xj ) and the reverse message for zi to xj is denoted ∆i→j (t, xj ). BP updates the value messages with the following simple recursions: The messages from the output nodes are given by ∆i→j (t, xj ) = const +

max fout (zi , yi ) + x

X

∆i←r (t, xr )

(87)

r6=j

where the maximization is over vectors x with the jth component equal to xj and zi = aTi x, where aTi is the ith row of the matrix A. The constant term is any term that does not depend on xj , although it may depend on t or the indices i

1 (r − x)2 2µ arg max(Lf )(x, r, µ)

:= f (x) −

x k

over the variables r, µ ∈ R with µ > 0 and k = 1, 2, . . .. The function Λf can be interpreted as a quadratic variant of the standard Legendre transform of f [39]. Lemma 1: Let f : R → R be twice differentiable and assume that all the maximizations in (90) exist and are unique. Then, (Λ(1) f )(r, µ)

=

(Λ(2) f )(r, µ)

=

∂ x b = ∂r

x b−r µ f 00 (b x) , 1 − µf 00 (b x) 1 1 − µf 00 (b x)

where x b = (Γf )(r, µ). Proof: Equation (91a) follows from the fact that ∂L(r, x) x b−r (Λ(1) f )(r, µ) = = . ∂r x=bx µ Next observe that ∂2 L(x, r) = ∂x2 ∂2 L(x, r) = ∂x∂r

f 00 (x) −

1 µ

1 . µ

So the derivative of x b is given by  2 −1 ∂ ∂ ∂2 x b=− L(b x , r) L(b x, r) 2 ∂r ∂x ∂x∂r 1 1/µ = = , 00 1/µ − f (x) 1 − µf 00 (x)

(91a) (91b) (91c)

RANGAN

13

where

which proves (91c). Hence ∂ (1) (Λ f )(r, µ) (Λ f )(r, µ) = ∂r   ∂ x b−r = ∂r µ   1 1 f 00 (b x) = − 1 = , 00 µ 1 − µf (b x) 1 − µf 00 (b x)

pbi→j (t)

(2)

µpi→j (t)

:=

1 µxj (t) 1 µxi←j (t)

X

:=

|air |2 µxr (t),

(96b)

and   1 H(b p, y, µp ) := max fout (z, y) − p (z − pb)2 . z 2µ

We now show that the G-AMP algorithm with the functions in (13) and (18) can be seen as a first-order approximation of the updates (88). We begin by considering the output node update (87). Let

x bi←j (t)

(96a)

r6=j

C. G-AMP First-Order Approximation

:=

air x bi←r (t)

r6=j

which shows (91b).

x bj (t)

X

:=

arg max ∆j (t, xj ),

(92a)

xj

arg max ∆i←j (t, xj ),

(92b)

xj 2

:= −

∂ ∆j (t, xj )|xj =bxj (t) . ∂x2j

(92c)

2

:= −

∂ ∆i←j (t, xj )|xj =bxi←j (t) . (92d) ∂x2j

Now, for small air , the values of xr in the maximization (87) will be close to x bi←j (t). So, we can approximate each term ∆i←r (t, xr ) with the second order approximation 1

∆i←r (t, xr ) ≈ ∆i←r (t, x bi←r (t)) −

2µxr (t)

(xr − x bi←r (t))2 ,

(93) where we have additionally made the approximation µxi←r (t) ≈ µxr (t) for all i. Now, given xj and zi , consider the minimization X 1 J := min (xr − x bi←r )2 , (94) x 2µr r6=j

subject to zi = aij xj +

X

air xr .

r6=j

A standard least squares calculation shows that the minimization (94) is given by 1X 1 2 (zi − pbi→j − aij xj ) J= 2 2µpi→j

(97)

The constant term in (95) does not depend on zi . Now let X pbi (t) := aij x bi←j (t), (98) j

and

µpi (t)

be given as in (5a). Then it follows from (96) that pbi→j (t)

µpi→j (t)

= =

pbi (t) − aij x bi←j (t) µpi (t)



a2ij µxj (t).

(99a) (99b)

Using (99) and neglecting terms of order O(a2ij ), (95) can be further approximated as ∆i→j (t, xj ) ≈ H (b pi (t) + aij (xj − x bj ), yi , µpi (t)) + const. (100) Now let ∂ H(b p, y, µp ). (101) gout (b p, y, µp ) := ∂ pb Comparing H(·) in (97) with the definitions in (90) and using the properties in (91), it can be checked that H(b p, y, µp ) = (Λfout (·, y)) (b p, µp ) , and that gout (·) in (101) and its derivative agree with the definitions in (18) and (21). Now define sbi (t) and µsi (t) as in (6). Then, a first order approximation of (100) shows that ∆i→j (t, xj ) ≈ const µs (t) +si (t)aij (xj − x bj (t)) − i a2ij (xj − x bj (t))2 . 2   = const si (t)aij + a2ij µsi (t)b xj (t) xj s µ (t) − i a2ij x2j , (102) 2 where again the constant term does not depend on xj . We next consider the input update (88). Substituting the approximation (102) into (88) we obtain

r6=j

∆i←j (t+1, xj ) ≈ const

where pbi→j =

X

air x br , µpi→j =

r6=j

X

+ fin (xj , qj ) −

|air |2 µxr .

1 (b ri←j (t) − xj )2 , (103) 2µri←j (t)

r6=j

where the constant term does not depend on xj and X 1 = a2`j µs` (t) (104a) µri←j (t) `6=i # X  rbi←j (t) = µri←j (t) s` (t)a`j + a2`j µs` (t)b xj (t) (zi − pbi→j (t) − aij xj )2 ,

So, the approximation (93) reduces to ∆i→j (t, xj ) " ≈

max fout (zi , yi ) − zi

1 2µpi→j (t)

`6=i

+const =

 H pbi→j (t) + aij xj , yi , µpi→j (t) + const

(95)

= x bj (t) + µri←j (t)

X `6=i

s` (t)a`j .

(104b)

14

GENERALIZED APPROXIMATE MESSAGE PASSING

Now define gin (r, q, µr ) as in (13). Then, using (103) and (13), x bi←j (t) in (92b) can be re-written as  x bi←j (t+1) ≈ gin rbi←j (t), qj , µri←j (t) . (105) Then, if we define rbj (t) and µrj (t) as in (7), rbi←j (t) and µri←j (t) in (104) can be re-written as µri←j (t) ≈ µrj (t). rbi←j (t) ≈ x bj (t) + µrj (t)

where Z(v) is a normalization function). Then, ∂ log Z(v) = ∂v ∂2 log Z(v) = ∂v 2 =

constant (called the partition E(U |V = v) ∂ E(U |V = v) ∂v var(U |V = v).

(111a) (111b) (111c)

(106a) X

Proof: The relations are standard properties of exponential families [2].

s` (t)a`j

`6=i

= rbj (t) − µrj (t)aij si (t),

(106b)

where in the approximations we have ignored terms of order O(a2ij ). We can then simplify (105) as x bi←j (t+1) (a)

 gin rbj (t) − aij si (t)µrj (t), qj , µrj (t)



(b)



x bj (t+1) − aij sj (t)Dj (t+1)

(107)

where (a) follows from substituting (106) into (105) and (b) follows from a first-order approximation with the definitions  x bj (t+1) := gin rbj (t), qj , µrj (t) (108a)  ∂ Dj (t+1) := µrj (t) gin rbj (t), qj , µrj (t) . (108b) ∂b r Now  (a) ∂ Dj (t+1) ≈ µrj (t) (Γfin (·, qj )) rbj (t), µrj (t) ∂b r µrj (t) (b) = 00 (b 1 − µrj (t)fin xj (t+1), qj ) #−1 " (c) ∂2 bj (t+1)) ≈ − 2 ∆i←j (t+1, x ∂xj (d)



µxj (t+1),

(109)

where (a) follows from (108b) and by comparing (13) to (90b); (b) follows from (91c); (c) follows from (103) and (d) follows from (92d). Substituting (109) into (107) and (98), we obtain X pbi (t) = aij x bj (t) − µpi (t)b si (t−1), (110) j

which agrees with the definition in (5c). In summary, we have shown that with the choice of gin (·) and gout (·) in (13) and (18), the G-AMP algorithm can be seen as a first-order approximation of BP-based MAP estimation. A PPENDIX C G-AMP FOR BP-BASED MMSE A. Preliminary Lemma Our analysis will needs the following standard result. Lemma 2: Consider a random variable U with a conditional probability distribution of the form pU |V (u|v) :=

1 exp (φ0 (u) + uv) , Z(v)

B. Standard BP for MMSE Estimation BP-based MMSE estimation is similar to BP-based MAP discussed in Appendix B. For the MMSE estimation problem, the BP messages can also be represented as functions ∆i→j (t, xj ) and ∆i←j (t, xj ). However, these messages are to be interpreted as estimates of the log likelihoods of the variables xj , conditioned on the system input and output vectors, q and y, and the transform matrix A. The updates for the messages are also slightly different between the MAP and MMSE algorithms. For the BP-based MMSE, the output node update (87) is replaced by the update equation ∆i→j (t, xj ) = log E(pY |Z (yi , zi )|xj ) + const,

(112)

where the expectation is over the random variable zi = aTi x with the components xr of x being independent with distributions pi←r (xr ) ∝ exp ∆i←r (t, xr ). (113) The constant term in (112) can be any term that does not depend on xj . The MMSE input node update is identical to the MAP update (88). Similar to the MAP estimation algorithm, BP-based MMSE is terminated after some iterations. The final estimate for the conditional distribution of xj is given by pj (xj ) ∝ exp ∆j (t, xj ), where ∆j (t, xj ) is given in (89). From this conditional distribution, one can compute the conditional mean of xj . C. G-AMP First-Order Approximation We now show that with the gin (·) in (24) and gout (·) in (28), the G-AMP algorithm can heuristically be regarded as a first order approximation of the above updates. The calculations are largely the same as the approximations for BP-based MAP estimation in Appendix B, so we will just point out the key differences. We begin with the output node update (112). Let := E[xj |∆j (t, ·)],

(114a)

x bi←j (t)

:= E[xj |∆i←j (t, ·)],

(114b)

µxj (t) µxi←j (t)

:= var[xj |∆j (t, ·)]

(114c)

:= var[xj |∆i←j (t, ·)],

(114d)

x bj (t)

where we have used the notation E(g(x)|∆(·)) to mean the expectation over a random variable x with a probability distribution p∆ (x) ∝ exp (∆(x)) . (115)

RANGAN

15

Therefore, x bi←r (t) and µxi←r (t) are the mean and variance of the random variable xr with distribution (113). Now, the expectation in (112) is over zi = aTi x with the components xr being independent with probability distribution (113). So, for large n, the Central Limit Theorem suggests that the conditional distribution of zi given xj should be approximately Gaussian zi ∼ N (b pi←j (t), µpi←j (t)), where pbi←j (t) and p µi←j (t)) are defined in (96). Hence, ∆i→j (t, xj ) can be approximated as  ∆i→j (t, xj ) ≈ H pbi←j (t), yi , µpi←j (t) , (116) where   H(b p, y, µp ) := log E pY |Z (y|z)|b p, µp ,

(117)

and the expectation is over random variable z ∼ N (b p, µp ). It is easily checked that H(b p, y, µp ) := log p(z|b p, y, µp ) + const,

(118)

where p(z|b p, y, µp ) is given in (27) and the constant term does not depend on pb. Now, identical to the argument in Appendix IV-A, we can define pbi (t) as in (98) and µpi (t) as in (5a) so that ∆i→j (t, xj ) can be approximated as (100). Also, if we define gout (b p, y, µp ) s as in (101), and sbi (t) and µi (t) as in (6), then sbi (t) and µsi (t) are respectively the first and second-order derivatives of H(b pi (t), yi , µpi (t)) with respect to pb. Then, taking a second order approximation of (100) results in (102). Also, using (118), gout (·) as defined in (101) agrees with the definition in (30). To show that gout (·) is also equivalent to (28) note that we can rewrite H(·) in (118) as   Z(b p, y, µp ) , H(b p, y, µp ) := log Z0 (b p, y, µp ) where Z0 (b p, y, µp ) :=

Z

 1 2 (2b pz − z ) dz exp 2µp 

Z(b p, y, µp )   Z 1 2 := exp fout (z, y) + p (2b pz − z ) dz. 2µ Then the relations (28) and (29) follow from the relations (111). We next consider the input node update (88). Similar to Appendix B, we can substitute the approximation (102) into (88) to obtain (103) where rbi←j (t) and µri←j (t) are defined in (104). Using the approximation (103) and the definition of Fin (·) in (14), the probability distribution p∆ (xj ) in (115) with ∆ = ∆i←j (t, xj ) is approximately p∆i←j (t,·) (xj ) 1 exp Fin (xj , rbi←j (t), qj , µri←j (t)), ≈ Z where Z is a normalization constant. But, based on the form of Fin (·) in (14), this distribution is identical to the conditional b Q) = (b distribution of X in θr (µr ) in (39) given (R, r, q). Therefore, if we define gin (·) as in (24), it follows that x bi←j (t) in (114b) satisfies (105). The remainder of the proof now follows as in the proof of Appendix C.

A PPENDIX D P ROOF OF (70) Define   ∂ gout (t, h(Z, W ), Pb, µp (t)) D(Kp ) := E ∂ pb   ∂ p b − E gout (t, h(Z, W ), P , µ (t)) , (119) ∂z where the expectation is over (Z, Pb) ∼ N (0, Kp ) and W ∼ pW (w) independent of (Z, Pb). We must show that when Kp is of the form (65) and gout (·) is given by (28), then D(Kp ) = 0. To this end, let Q be the two-dimensional random vector Q = [Z Pb]T ∼ N (0, Kp )

(120)

and let G ∈ R1×2 be the derivative   ∂ G=E gout (t, h(Z, W ), Pb, µp (t)) , ∂q which is the row vector whose two components are the partial derivatives with respect to z and pb. Thus, D(Kp ) in (119) can be re-written as   1 p D(K ) = G . (121) −1 Also, using Stein’s Lemma (Lemma 4 below) and the covariance (120), h i E Qgout (t, h(Z, W ), Pb, µp (t))   = E QQT GT = Kp G0T . Applying this equality to (121) we obtain h i D(Kp ) := E φ(Z, Pb)gout (t, h(Z, W ), Pb, µp (t)) ,

(122)

where φ(·) is the function: φ(z, pb) := [z pb] (Kp )−1



1 −1

 .

When Kp is of the form (65), then it is easily checked that φ(z, pb) =

pb . µp (t)

(123)

Therefore, h i 1 E Pbgout (t, h(Z, W ), Pb, µp (t)) µ (t)   1 b ∂ log p b (Y |Pb) E P Y |P b µp (t)  ∂P Z  1 b ∂ b E P p (y| P )dy b Y |P µp (t) ∂ pb   ∂ 1 E Pb (1) = 0 p µ (t) ∂ pb (a)

D(Kp ) = (b)

=

(c)

=

=

p

where (a) follows from substituting (123) into (122); (b) follows from (66); (c) follows from taking the derivative of the logarithm. This shows that D(Kp ) = 0 and proves (70).

16

GENERALIZED APPROXIMATE MESSAGE PASSING

(a) The conditional distribution of d(t+1) is given by

A PPENDIX E P ROOF S KETCH FOR T HEOREM 2 As stated earlier, Bayati and Montanari in [16] already proved the result for scalar case when nb = nd = 1. Only very minor modifications are required for the vector-valued case, so we will just provide a sketch of the key changes. For the vector case, it is convenient to introduce the notation b(t) to denote the matrix with columns bi (t):   b1 (t)T   .. m×nb . b(t) :=  ∈R . bm (t)T

We can define u(t), v(t) and d(t) similarly. Then, in analogy with the definitions in [16], we let x(t)

:= d(t) + u(t)ξ(t)T ∈ Rn×nd

y(t)

:= b(t) + v(t−1)λ(t)T ∈ Rm×nb

and define the matrices X(t)

:=

[x(0)| · · · |x(t−1)] ∈ Rn×tnd

Y(t)

:=

[y(0)| · · · |y(t−1)] ∈ Rm×tnb

U(t)

:=

[u(0)| · · · |u(t−1)] ∈ Rn×tnb

V(t)

:=

[v(0)| · · · |v(t−1)] ∈ Rm×tnd .

d(t)|G(t,t) t−1 X

d

=

i=0

b(t)|G(t−1,t) t−1 X

d

=

e is an independent copy of A, and the matrices where A αi (t) and βi (t) are the coefficients in the expansion (125) e and (126). The matrix U(t) is such that its columns form an orthogonal basis for the column space of U(t) with e T U(t) e e U(t) = nItnd . Similarly, V(t) is such that its columns form an orthogonal basis for the column space − e T V(t) e of V(t) with V(t) = mItnb . The vector → o t (1) goes to zero almost surely. (b) For any pseudo-Lipschitz functions φd : Rnd (t+1) × Θu → R and φb : Rnb (t+1) × Θv → R: n

1X φd (dj (0), · · · , dj (t), θju ) n→∞ n j=1 lim

E [φd (D(0), . . . , D(t), θu )] m 1 X lim φb (bj (0), · · · , bj (t), θjv ) n→∞ m i=1 =

(124)

Next, also following the proof in [16], let v|| (t) be the projection of each column of v(t) onto the column space of V(t), and let v⊥ (t) be its orthogonal component, v⊥ (t) = v(t) − v|| (t). Thus, we can write v|| (t) =

t−1 X

v(i)αi (t),

(125)

i=0

for matrices αi (t) ∈ Rnd ×nd . Similarly, let u|| (t) and u⊥ (t) be the parallel and orthogonal components of u(t) with respect to the column space of U(t) so that u|| (t) =

t−1 X

− e ⊥ (t) + V(t) e → b(i)βi (t) + Au o t (1)

i=0

The updates (73) can then be re-written in matrix form as X(t) = AT V(t), Y(t) = AU(t).

− e T v⊥ (t) + U(t) e → d(i)αi (t) + A o t (1)

E [φb (B(0), . . . , B(t), θv )]

=

where the expectations are over Gaussian random vectors (D(0), . . . , D(t)) and (B(0), . . . , B(t)) and the random variables θu and θv in the limit (76). The variables θu and θv are independent of B(r) and D(r) and the marginal distributions of B(r) and D(r) are given by (78). (c) For all 0 ≤ r, s ≤ t, the following limit exists hold are bounded and are non-random n

1X dj (r)dj (s)T n→∞ n j=1 lim

u(i)βi (t),

(126)

i=0

where βi (t) ∈ Rnb ×nb . Now let G(t1 , t2 ) be the sigma algebra generated by b(0), . . . , b(t1 ), v(0), . . . , v(t1 ), d(0), . . . , d(t2 − 1) and u(0), . . . , u(t2 ). Also for any sigma-algebra, G, and random variables X and Y , we will say that X is equal in distribution to Y conditional on G if E(φ(X)Z) = E(φ(Y )Z) for any Z d that is G-measurable. In this case, we write X|G = Y . With these definitions, the vector-valued analogy of the main technical lemma [16, Lemma 1] can be stated as follows: Lemma 3: Under the assumptions of Theorem 2, the following hold for all t ≥ 0:

m

=

1 X vi (r)vi (s)T m→∞ m i=1 lim m

1 X bi (r)bi (s)T n→∞ m i=1 lim

n

1X uj (r)uj (s)T . m→∞ n i=1

= β lim

(d) Suppose ϕd : Rnd × Θu → R and ϕb : Rnb × Θv → R are almost everywhere continuously differentiable with a bounded derivative with respect to the first argument. Then, for all all 0 ≤ r, s ≤ t the following limits exists

RANGAN

17

are bounded and non-random: n 1X dj (r)ϕd (dj (s))T lim n→∞ n j=1 n

=

1X dj (r)dj (s)T Gd (s)T n→∞ n j=1 lim

m

and E(Z1 ϕ(Z2 )T ) ∈ Rn1 ×n3

1 X bi (r)ϕb (bj (s))T lim m→∞ m i=1

exists. Then,

m

1 X = lim bi (r)bi (s)T Gb (s)T m→∞ m i=1 where Gd (s) and Gb (s) are the empirical derivatives n

Gd (s)

:=

1X ∂ ϕd (dj (s), θu ) n→∞ n ∂d j=1

Gb (s)

:=

1 X ∂ ϕb (bi (s), θv ). n→∞ m ∂b i=1

lim

m

lim

(e) For ` = k − 1, the following bounds hold almost surely n

1X kdj (t)k2` lim n→∞ n j=1

< ∞

m

1 X kbi (t)k2` m→∞ m i=1 lim

Lemma 4 (Stein’s Lemma [40]): Suppose Z1 ∈ Rn1 and Z2 ∈ Rn2 are jointly Gaussian random vectors and ϕ : Rn2 → Rn3 is any function such that the expectations   ∂ ϕ(Z2 ) ∈ Rn3 ×n2 G := E ∂z2

< ∞

This lemma is a verbatim copy of [16, Lemma 1] with some minor changes for the vector-valued case. Observe that Theorem 2 is a special case of part (b) of Lemma 3 by considering functions of the form φd (d(0), . . . , d(t), θu )

=

φd (d(t), θu )

φb (b(0), . . . , b(t), θu )

=

φb (b(t), θv ).

That is, we consider functions that only depend on the most recent iteration number. The proof of Lemma 3 also follows follows almost identically to the proof of the analogous lemma in the scalar case in [16]. The key idea in that proof is the following conditioning argument, originally used by Bolthausen in [25]: To evaluate the conditional distributions of d(t) with respect to G(t, t) and b(t) with respect to G(t − 1, t) in part (a) of Lemma 3, one evaluates the corresponding conditional distribution of the matrix A. But, this conditional distribution is precisely identical to the distribution of A conditioned on linear constraints of the form (124). But, this distribution is just the distribution of a Gaussian random vector conditioned on it lying on an affine subspace. That distribution has a simple expression as a deterministic offset plus a projected Gaussian. The detailed expression are given in [16, Lemma 6] and the identical equations can be used here. The proof uses this conditioning principle along with an induction argument along the iteration number t and the statements (a) to (e) of the lemma. We omit the details as they are involved but follow with only minor changes from the original scalar proof in [16]. The only one other non-trivial extension that is needed is the matrix form of Stein’s Lemma, which can be stated as:

 E(Z1 ϕ(Z2 )T ) = E (Z1 − Z1 )(Z2 − Z2 )T GT , where Zi is the expectation of Zi . Note that part (d) of Lemma 3 is of the same form of this lemma. R EFERENCES [1] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann Publ., 1988. [2] M. J. Wainwright and M. I. Jordan, Graphical Models, Exponential Families, and Variational Inference, ser. Foundations and Trends in Machine Learning. Hanover, MA: NOW Publishers, 2008, vol. 1. [3] R. J. McEliece, D. J. C. MacKay, and J.-F. Cheng, “Turbo decoding as an instance of Pearl’s ‘belief propagation’ algorithm,” IEEE J. Sel. Areas Comm., vol. 16, no. 2, pp. 140–152, Feb. 1988. [4] D. J. C. MacKay, “Good error-correcting codes based on very sparse matrices,” IEEE Trans. Inform. Theory, vol. 45, no. 3, pp. 399–431, Mar. 1999. [5] D. J. C. MacKay and R. M. Neal, “Near shannon limit performance of low density parity check codes,” Electron. Letters, vol. 33. [6] J. Boutros and G. Caire, “Iterative multiuser joint decoding: unified framework and asymptotic analysis,” IEEE Trans. Inform. Theory, vol. 48, no. 7, pp. 1772 – 1793, Jul. 2002. [7] M. Yoshida and T. Tanaka, “Analysis of sparsely-spread CDMA via statistical mechanics,” in Proc. IEEE Int. Symp. Inform. Th., Seattle, WA, Jun. 2006, pp. 2378–2382. [8] G. Guo and C. C. Wang, “Multiuser detection of sparsely spread CDMA,” IEEE J. Sel. Areas Comm., vol. 26, no. 3, p. 421431, Mar. 2008. [9] M. F. N. Sommer and O. Shalvi, “Low-density lattice codes,” IEEE Trans. Inform. Theory, vol. 54, no. 4, pp. 1561–1585, Apr. 2008. [10] D. Baron, S. Sarvotham, and R. G. Baraniuk, “Bayesian compressive sensing via belief propagation,” IEEE Trans. Signal Process., vol. 58, no. 1, pp. 269–280, Jan. 2010. [11] D. Guo, D. Baron, and S. Shamai, “A single-letter characterization of optimal noisy compressed sensing,” in Proc. 47st Ann. Allerton Conf. on Commun., Control and Comp., Monticello, IL, Sep. 2009. [12] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressed sensing,” arXiv:0907.3574v1 [cs.IT], Jul. 2009. [13] T. Tanaka and M. Okada, “Approximate belief propagation, density evolution, and neurodynamics for CDMA multiuser detection,” IEEE Trans. Inform. Theory, vol. 51, no. 2, pp. 700–706, Feb. 2005. [14] D. Guo and C.-C. Wang, “Asymptotic mean-square optimality of belief propagation for sparse linear systems,” in ”Proc. Inform. Th. Workshop”, Chengdu, China, Oct. 2006, pp. 194–198. [15] S. Rangan, “Estimation with random linear mixing, belief propagation and compressed sensing,” arXiv:1001.2228v1 [cs.IT]., Jan. 2010. [16] M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs, with applications to compressed sensing,” arXiv:1001.3448v1 [cs.IT], Jan. 2010. [17] S.-Y. Chung, T. J. Richardson, and R. L. Urbanke, “Analysis of sumproduct decoding of low-density parity-check codes using a Gaussian approximation,” IEEE Trans. Inform. Theory, vol. 47, no. 2, pp. 657– 670, Feb. 2001. [18] H. E. Gamal and R. Hammons, “Analyzing the turbo decoder using the Gaussian approximation,” IEEE Trans. Inform. Theory, vol. 47, no. 2, pp. 671–686, Feb. 2001. [19] L. R. Varshney, “Performance of LDPC codes under noisy messagepassing decoding,” in Proc. Inform. Th. Workshop, Lake Tahoe, CA, Sep. 2007, pp. 178–183.

18

[20] S. ten Brink, “Convergence behavior of iteratively decoded parallel concatenated codes,” IEEE Trans. Comm., vol. 49, no. 10, pp. 1727– 1737, Oct. 2001. [21] Y. Kabashima, “A CDMA multiuser detection algorithm on the basis of belief propagation,” J. Phys. A: Math. Gen, vol. 36. [22] J. P. Neirotti and D. Saad, “Improved message passing for inference in densely connected systems,” Europhys. Lett., vol. 71, no. 5, p. 866872, Sep. 2005. [23] A. Montanari and D. Tse, “Analysis of belief propagation for non-linear problems: The example of CDMA (or: How to prove Tanaka’s formula),” arXiv:cs/0602028v1 [cs.IT]., Feb. 2006. [24] D. Guo and C.-C. Wang, “Random sparse linear systems observed via arbitrary channels: A decoupling principle,” in Proc. IEEE Int. Symp. Inform. Th., Nice, France, Jun. 2007, pp. 946 – 950. [25] E. Bolthausen, “On the high-temperature phase of the SherringtonKirkpatrick model,” Seminar at Eurandom, Eindhoven, Sep. 2009. [26] T. Tanaka, “A statistical-mechanics approach to large-system analysis of CDMA multiuser detectors,” IEEE Trans. Inform. Theory, vol. 48, no. 11, pp. 2888– 2910, Nov. 2002. [27] D. Guo and S. Verd´u, “Randomly spread CDMA: Asymptotics via statistical physics,” IEEE Trans. Inform. Theory, vol. 51, no. 6, pp. 1983–2010, Jun. 2005. [28] Y. Kabashima, T. Wadayama, and T. Tanaka, “Typical reconstruction limit of compressed sensing based on lp -norm minimization,” arXiv:0907.0914 [cs.IT]., Jun. 2009. [29] C. M. Bishop, Pattern Recognition and Machine Learning, ser. Information Science and Statistics. New York, NY: Springer, 2006. [30] E. J. Cand`es, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inform. Theory, vol. 52, no. 2, pp. 489–509, Feb. 2006. [31] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inform. Theory, vol. 52, no. 4, pp. 1289–1306, Apr. 2006. [32] E. J. Cand`es and T. Tao, “Near-optimal signal recovery from random projections: Universal encoding strategies?” IEEE Trans. Inform. Theory, vol. 52, no. 12, pp. 5406–5425, Dec. 2006. [33] D. Tse and S. Hanly, “Linear multiuser receivers: Effective interference, effective bandwidth and capacity,” IEEE Trans. Inform. Theory, vol. 45, no. 3, pp. 641–675, Mar. 1999. [34] A. K. Fletcher, S. Rangan, and V. K. Goyal, “On-off random access channels: A compressed sensing framework,” arXiv:0903.1022v1 [cs.IT]., Mar. 2009. [35] S. Verd´u and S. Shamai, “Spectral efficiency of CDMA with random spreading,” IEEE Trans. Inform. Theory, vol. 45, no. 3, pp. 622–640, Mar. 1999. [36] S. Rangan, A. Fletcher, and V. K. Goyal, “Asymptotic analysis of MAP estimation via the replica method and applications to compressed sensing,” arXiv:0906.3234v2 [cs.IT]., Aug. 2009. [37] S. Rangan and R. K. Madan, “Belief propagation methods for intercell interference coordination,” arXiv:1008.0060v1 [cs.NE]., Jul. 2010. [38] T. J. Richardson and R. L. Urbanke, “The capacity of low-density parity check codes under message-passing decoding,” Bell Laboratories, Lucent Technologies, Tech. Rep. BL01121710-981105-34TM, Nov. 1998. [39] R. T. Rockafellar, Convex Analysis. Princeton, NJ: Princeton University Press, 1970. [40] C. Stein, “A bound for the error in the normal approximation to the distribution of a sum of dependent random variables,” in Proc. Sixth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, 1972.

GENERALIZED APPROXIMATE MESSAGE PASSING

Suggest Documents