1
Channel Vector Subspace Estimation from Low-Dimensional Projections Saeid Haghighatshoar∗, Giuseppe Caire∗
arXiv:1509.07469v1 [cs.IT] 24 Sep 2015
∗ Communications
and Information Theory Group
Technische Universit¨at Berlin Emails:
[email protected],
[email protected]
Abstract In this paper, we propose efficient algorithms for estimating the received signal subspace in a Single-Input Multiple-Output (SIMO) wireless communication channel consisting of a single-antenna user (transmitter) and a multi-antenna base-station (receiver) with M antennas (M ≫ 1). Specifically, the signal may correspond to the uplink pilot of a given user that is sent while setting up a multi-user MIMO data transmission. The signal subspace depends on the channel vector covariance matrix, which is time-invariant and independent of frequency according to the well-known Wide Sense Stationary Uncorrelated Scattering (WSSUS) fading model, thus, it can be accurately estimated over multiple pilot transmissions. Once the appropriate subspaces for different users are identified, they can be used to build suitable beamforming matrices to improve the transmission rate or manage multiuser interference according to the so-called Joint Space Division and Multiplexing (JSDM) approach. For reducing the RF front-end complexity and overall A/D bit-rate, it is desirable to split the JSDM downlink precoder (resp., uplink linear receiver) into the product of a baseband linear projection (digital) and an RF reconfigurable beamforming network (analog), such that only a reduced number m ≪ M of A/D converters and RF modulation/demodulation chains is used. This implies that only m analog observations can be obtained for subspace
estimation, thus, the standard sample covariance estimator is not available. Rather, the subspace information must be extracted by snapshots of the projected channel vector, where the projection, implemented by the analog beamforming network, reduces the M -dimensional signal in the antenna space to the m-dimensional (m ≪ M ) analog outputs. We develop efficient algorithms for estimating the dominant signal subspace from such low-dimensional projections. In particular, when the projection matrix contains all zeros but a single element equal to 1 on each row, the projection √ reduces to “antenna selection” or “array subsampling”. Our algorithms sample only m = O(2 M ) specific array elements according to a coprime sampling scheme, and for a given p ≪ M , return a p-dimensional signal subspace
that has a performance comparable with the best p-dimensional beamformer that can be designed by knowing the exact covariance matrix of the received signal. We asses the performance of our proposed estimators both analytically and empirically via numerical simulations, and provide extensive comparisons with the other state-of-the-art methods proposed in the literature, which are also reviewed and put in the context of estimating the dominant signal subspace.
2
1 I NTRODUCTION Consider a multiuser MIMO channel formed by a base-station with M antennas and K single-antenna mobile users in a cellular network. For simplicity, we focus here on a flat-fading channel in which the bandwidth of the signal is less than the channel’s coherence bandwidth. In an OFDM system, this corresponds to focusing on a single subcarrier. The wideband extension of our treatment is not conceptually difficult although it leads to some interesting considerations depending on the frequency-selective channel model, and will be developed in a subsequent companion paper. Following the current massive MIMO approach [1–6] uplink and downlink are organized in Time Division Duplexing (TDD), and the base-station transmit/receive hardware is designed or calibrated in order to preserve uplink-downlink reciprocity [2, 7] such that the base-station can estimate the channel vectors of the users from uplink training signals sent by the users on orthogonal dimensions. Since there is no multiuser interference on the uplink training phase, in this paper we shall focus on the channel estimation problem for a single user. In massive MIMO systems, the number of antennas M is typically much larger than the number of users scheduled to communicate over a given transmission time slot. Coherent channel state information necessary to set up the linear receiver (uplink) or the linear precoder (downlink) should be obtained from the uplink training phase in the same channel coherence time [1]. Letting D denote the duration of a time slot (expressed in channel uses), τ D channel uses are dedicated to training and the remaining (1 − τ )D channel uses are dedicated to data transmission. It turns out that for isotropically distributed channel vectors with min{M, K} ≥ D/2, it is optimal to devote τ = 1/2 fraction of the time to channel estimation while serving only D/2 out of K users in the remaining half [1].1 The corresponding high-SNR scaling of the sum-rate is Rsum = (D/4) log(snr) + O(1), where snr is the signal-to-noise ratio (SNR). Notice that in order to achieve this optimal pre-log factor the number of antennas M is irrelevant (it influences only the constant O(1) term) as long as M ≥ D/2. However, for M ≫ K = D/2, the very large excess antennas yield several system-level advantages summarized in the above mentioned literature (and in many more papers, which would be too long to mention here). These “massive MIMO” gains come at the price of a large complexity in the base-station front-end because a very large number M of analog RF signals must be handled both in transmission (downlink) and in reception (uplink), requiring the replication of A/D conversion and RF modulation/demodulation for each antenna. In order to reduce the complexity, Hybrid Digital Analog (HDA) architectures have been proposed in which the precoding (downlink) or detection (uplink) linear transformation is split into the product of two transformations. One is implemented in the analog RF domain, and reduces the dimensionality from M to some intermediate dimension m ≪ M . Then, the resulting m outputs (downlink) or inputs (uplink) are converted into digital baseband signals, and are further processed in the digital domain. Along these lines, a particularly effective scheme is the Joint Space Division and Multiplexing (JSDM) approach proposed and analyzed in [8–12]. JSDM starts from the consideration that in massive MIMO systems the channel vector for each user is typically highly correlated, i.e., it is far from an 1 When
K > D/2, then groups of D/2 users are scheduled over diferent time slots such that all users achieve a positive throughput (i.e.,
rate averaged over a long sequence of scheduling slots).
3
isotropically distributed Gaussian vector. Consequently, the covariance matrix for each user is typically low-rank, because the propagation scattering from a user to a massive MIMO base-station occurs over a narrow angular spread. This is especially true in the case of a tower-mounted base-station and/or in the case of mm-wave channels, as experimentally confirmed by channel measurements (see [10] and references therein). This property prompts a few important observations: 1) Let h ∈ C
M
denote the channel vector of a user, and let S = E[hhH ] denote its covariance matrix. Then,
by the KL decomposition, we can write h = VΛ1/2 w, where V and Λ are obtained form the singular value decomposition (SVD) of S given by S = VΛVH , and where w is a random vector with uncorrelated and normalized (unit-variance) components. If rank(S) = r ≪ M , then V is an M × r tall unitary matrix. This means that while h is a random (Gaussian) vector, it is contained in the r-dimensional subspace span(V). Furthermore, according to the well-known and widely accepted Wide-Sense Stationary Uncorrelated Scattering (WSSUS) channel model, S is invariant over time and frequency. Therefore, the channel subspace V can be estimated from snapshots of the vector h by SVD of the corresponding sample covariance matrix. In practice, the WSS assumption is only a local approximation, that holds for small motion around a given geometry of transmitter, receiver and scattering objects. However, while the small scale fading has a coherence time between 0.1s and 10 ms for motion speed between 1 m/s to 10 m/s at carrier of 3 GHz,2 the time over which the channel vector can be considered WSS is of the order of tens of seconds, i.e., from 2 to 4 orders of magnitude larger. Hence, estimating the channel subspace information V is a much easier task than estimating the instantaneous channel vector h on each coherence time slot.
2) For mm-wave channels (e.g., carrier frequency of the order of 30 GHz), the channel coherence time is decreased by a factor of 10 (between 10ms to 1ms). This has a very important impact on the efficiency of massive MIMO systems. From what said before, for isotropically distributed channels the best possible system multiplexing gain (i.e., the sum-rate pre-log factor) is D/4. The number D of channel uses spanning a time slot can be quantified as D ≈ Wc ∆tc , where Wc , the channel coherence bandwidth in Hz, depends on the delay spread and ∆tc is the channel coherence time. As the latter decreases by a factor of 10, we see that D also decreases by the same factor. Thus, schemes that treat mm-wave channels as isotropically distributed, and try to estimate them on a slot-by-slot basis from the training phase result in a severe limitation of the system multiplexing gain, since for mm-wave channels the coherence time is much shorter than traditional cellular carrier frequencies. 3) By grouping users into G > 1 groups with approximately the same channel subspace [8–10] it is possible to overcome the above dimensionality bottleneck. In short, the channel subspace information (i.e., the columns of Vg for each group g = 1, . . . , G) can be accurately estimated over a long sequence of time slots. Based 2 The
maximum Doppler bandwidth for a receiver-transmitter relative motion speed of v m/s and carrier frequency f0 Hz is given by vf0 /c,
where c is the speed of light. For a quick order-of-magnitude evaluation, consider f0 = 3 GHz, and users moving between 1m/s (e.g., walking speed, or just handwaving the handset) and 10m/s (slowly moving cars in a urban environment). This yields Doppler bandwidth between 10 and 100 Hz.
4
on this information, the groups can be separated by a zero-forcing projection that nulls out the inter-group interference. Then, additional multiuser multiplexing gain can be obtained by serving Kg ≥ 1 users in each group g. For the “inner” intra-group precoding, only the low-dimensional projected channels must be learned on a per-slot basis. In this way, the system multiplexing gain can be boosted to DG/4, such that a decrease in D can be compensated by a large G > 1. Selecting groups with nearly orthogonal channel subspaces is referred to as channel correlation diversity in [13]. 4) As anticipated above, JSDM is also ideally suited for an HDA implementation of the base-station precoder (downlink) and receiver (uplink), with a clear separation between the analog RF processing and the digital baseband processing. The transformation that depends only on subspace information of the channels is implemented in analog, and it is essentially similar to standard analog RF beamforming, with the only difference that different beamforming weights must be used for different user groups. This requires a fast reconfigurable RF beamforming network, implementing an M × m complex matrix (amplitude scaling and phase shifting in the analog domain). The additional processing for further signal separation on a per-group basis can be implemented in the baseband digital domain. Notice that if the groups are already sufficiently well separated in the analog RF domain, then the baseband precoding (resp., detection) transformation takes on the form of a block-diagonal matrix, with G diagonal blocks, one for each group of users [8]. In the extreme case where only one user per group is scheduled, the baseband processing is particularly simple, involving only non-interfering single-input single-output channels [10]. From the above discussion, it is apparent that a central task at the base-station side consists in estimating, for each user, the subspace of the signal span(V) that contains most of the user channel power. However, in an HDA implementation, we do not have direct access to all the M antennas, but only to m ≪ M analog output observations. Hence, we need to estimate this subspace from snapshots of a projected version of the signal. 1.1 Contribution In this paper, we aim to design such a subspace estimator for a large uniform linear array (ULA) with M ≫ 1 antennas. The geometry of the array has been shown in Fig. 1, with array elements having the uniform distance d from each other. We assume that the array scans the angular range [−θmax , θmax ] for some θmax ∈ (0, π/2). In this paper, we consider the antenna spacing d =
λ 2 sin(θmax ) ,
where λ is the wave-length. We model the statistics of the
channel vector with a power angular distribution function (angular scattering function) that expresses the received power density as a function of the signal’s angle of arrival (AoA) θ as shown in Fig. 1. As motivated before, we assume that we have only a low-dimensional sketch of the received signal via m ≪ M linear projections. In the case where the projection matrix contains a single non-zero element equal to 1 in each row, we recover the case of array subsampling as a special case. In particular, we shall consider a coprime sampling scheme which samples √ only O(2 M ) specific array elements. The coprime subsampling was first developed by Vaidyanathan and Pal in [14, 15], where they showed that for a given spatial span for the array, one obtains approximately the same resolution as a uniform linear array by nonuniformly sampling only a few array elements in coprime locations.
5
(M − 1)d • 4d • 3d • 2d •
θ
d• 0• Fig. 1: Array configuration in a multi-antenna receiver
Applying coprime sampling, we can significantly reduce the required sampling rate and power consumption of the base-station front-end. We propose several algorithms for estimating the signal subspace and cast them as convex optimization problems that can be efficiently solved. We also empirically compare the performance of our algorithms with other state-of-the-art algorithms in the literature via numerical simulations. 1.2 Notations We do not distinguish, in terms of notation, between deterministic quantities and random variables, since the meaning of the various symbols is clear from the context. We use P(·) to denote a generic probability measure for the probability space that includes all the random variables that appear in the event inside the parentheses. We use M, m, T for the array size, the dimension of the observation (sketch) and the number of training slots. We denote vectors by boldface small letters (e.g., x), matrices by boldface capital letters (e.g., X), scalar constant by non-boldface letters (e.g., x or X), and sets by calligraphic letters (e.g., X ). The i-th element of a vector x and the (i, j)-th element of a matrix X will be denoted equivalently as xi and Xi,j , or [x]i and [X]i,j respectively. We denote by CN (µ, Σ) the complex circularly symmetric multivariate Gaussian distribution with mean vector µ and covariance matrix Σ, and we use ∼ to indicate “distributed as”. Throughout the paper, the output of an optimization algorithm arg minx f (x) is denoted by x∗ . We denote the Hermitian and the transpose of a matrix X by XH and XT , respectively. The same notation is used for vectors. We use T and T+ for the space of all M × M Hermitian Toeplitz and Hermitian semi-definite Toeplitz matrices. We always use I for the identity matrix, where the dimensions may be explicitly indicated for the sake of clarity (e.g., Ip denotes the p × p identity matrix). We
define H(M, p) = {UM×p ∈ CM×p : UH U = Ip } as the set of tall unitary matrices of dimension M × p. For
matrices and vectors of appropriate dimensions, we define the inner product hK, Li = Tr(KLH ) and the induced p norm kKk = hK, Ki, also known as Frobenius norm for matrices. For an integer k ∈ Z, we use the shorthand
notation [k] for the set of non-negative integers {0, 1, 2, . . . , k − 1}, where the set is empty if k < 0. We denote the cardinality of a set X by |X |.
6
2 R ELATED W ORK Signal subspace estimation is a classical topic in signal processing [16–19]. In order to put our work in perspective, we review here a number of approaches and results that are most directly related to our setting. These also represent the main terms of comparison for our proposed algorithms. We identify the following four categories: 1) Subspace tracking from incomplete observations; 2) Low-rank matrix recovery; 3) Direction-of-arrival (DoA) estimation and super-resolution (SR); 4) Multiple Measurement Vectors (MMV) problem in compressed sensing (CS). In order to illustrate these methods, we assume a simple channel model in which the channel vector is given by a random superposition of fixed antenna array response vectors, corresponding to a finite number p of AoAs (see Fig. 2). As a consequence, a discrete-time complex baseband sample of the received signal takes on the form y=
p X
a(θℓ )wℓ x + n,
(1)
ℓ=1
where x is the transmitted (training) symbol, wℓ ∼ CN (0, σℓ2 ) is the channel gain of the ℓ-th multipath component, n ∼ CN (0, IM ) is the additive white Gaussian noise of the receiver antenna, and where a(θ) ∈ CM denotes the array response at AoA θ, whose k-th component is given by 2πd sin(θ) sin(θ) [a(θ)]k = exp jk k = exp jkπ , k ∈ [M ]. λ sin(θmax ) This model shall be generalized to more realistic continuous or mixed-type AoA distributions in Section 3. Scattering Channel (M − 1)d • 4d • 3d • 2d •
User θi .. .
d• 0•
Fig. 2: A simple model for the scattering channel with discrete angle of arrivals.
(2)
7
According to the uncorrelated scattering model, the channel gains for different paths, i.e., {wℓ }pℓ=1 , are uncorre-
lated, and since they are (jointly) Gaussian, they are mutually statistically independent.3 Without loss of generality, we can assume x = 1 in all training snapshots. Letting A = [a(θ1 ), a(θ2 ), . . . , a(θp )], we can write the received signal at the M base-station antennas at snapshot t ∈ [T ] as y(t) = Aw(t) + n(t),
(3)
where w(t) = (w1 (t), w2 (t), . . . , wp (t))T is the channel gain vector at snapshot t, and n(t) is the array noise. We also write the received signal during the training phase comprising T snapshots as Y = AW + N,
(4)
where Y = [y(0), . . . , y(T − 1)], N = [n(0), . . . , n(T − 1)], W = [w(0), . . . , w(T − 1)]. Here we assume that the training snapshots are sufficiently separated in time such that the channel gain vectors w(t) for different t ∈ [T ]
are i.i.d. ∼ CN (0, Σ) where Σ = diag(σ12 , . . . , σp2 ) contains the channel strength for the p different AoAs. We also
assume that the AoAs {θℓ }pℓ=1 (and therefore the matrix A) remain invariant over the T training slots. Since the
noise variance per component is normalized to 1, and ka(θi )k2 = M for all θi , we define the training signal-to-noise ratio (SNR) as snr = Tr(Σ). As anticipated before, we assume that the baseband processor at the base-station does not have direct acess to the observation Y. Rather, it has access to the low-dimensional sketch X = BY, where B ∈ C
m×M
is a projection
matrix. In particular, the observation sketch at slot time t is given by x(t) = By(t) = BAw(t) + Bn(t).
(5)
2.1 Subspace tracking from incomplete observations From (3), the covariance matrix of y(t) (for all t ∈ [T ]) is given by: Cy = E[y(t)y(t)H ] = AΣAH + IM =
p X
σℓ2 a(θℓ )a(θℓ )H + IM .
(6)
ℓ=1
Let Cy = UΛUH be the SVD of Cy , where Λ = diag(λ1 , λ2 , . . . , λM ) denotes the diagonal matrix of singular values.4 Letting Up the M × p matrix consisting of the first p columns of U, we have that span(A) = span(Up ) such that the columns of Up form an orthonormal basis for the signal space. Hence, finding the subspace of the signal spanned by A from the collection of noisy observations (4) can be decomposed into two subproblems: 1) estimate the covariance matrix Cy ; 2) Find an estimate of Up by taking the SVD of the estimated covariance matrix Cy . The problem becomes more interesting when only x(t) = By(t) is available. For the noiseless case, Chi et. al. in [20] developed a sequential recursive Least-Squares algorithm named PETRELS to estimate the underlying subspace from the sequence of observations x(t), t = 1, 2, 3, . . .. Another 3 Obviously, 4 We
they are also independent of the receiver noise n.
assume that the singular values are always sorted in non-increasing order.
8
algorithm named Grassmanian Rank-One Update Subspace Estimation (GROUSE) was proposed by Balzano et. al. in [21]. Both algorithms need the dimension of the subspace (in our case the number of AoAs p) or an estimate thereof. They iteratively update the estimate of the subspace of the signal for every new data vector x(t) and are especially suitable for high-dimensional data. However, their convergence is not guaranteed for the noisy incomplete observations that we address in this paper. In particular, there is no analysis of the convergence time or the required number of data for a specific precision, especially for the noisy case. PETRELS and GROUSE are geared to optimize the computational complexity rather than the training data size T , and principally suit situations in which the dimension is high (very large M and T ). In particular, their tracking “recursive” nature makes them suitable for the case where T → ∞ and the snapshots x(t) are acquired sequentially. In the massive MIMO channel estimation application that we have in mind, T should be as short as possible, since this determines how many slots the base-station needs before it can identify the channel subspace of a given user to be able to schedule this user for data transmission. Also, the problem dimension M is large for wireless communication technology (e.g., M between 100 and 500) but is definitely quite small in comparison with the “big data” scenarios for which PETRELS and GROUSE were designed. Moreover, in general we can exploit the fact that the signal subspace is the span of the array vectors {a(θℓ )}pℓ=1 , while such a priori information is not exploited by these algorithms. Since PETRELS generally beats GROUSE [20], we shall provide numerical comparisons of our proposed schemes with PETRELS. 2.2 Low-rank matrix recovery For p ≪ M and for a high SNR, the covariance matrix Cy in (6) is nearly low-rank. Recovery of low-rank matrices from a collection of a few possibly noisy samples has recently attracted great attention in signal processing and machine learning. In particular, with the emergence of Big-Data applications, people deal more and more with very high-dimensional and frequently noisy data that has a very low-dimensional structure. In turns, this can be exploited to recover the signal from a small number of low-dimensional sketches. The Matrix Completion (MC) problem consists of recovering a low-rank matrix by sampling only a few entries and can be cast as the following optimization problem: X∗ = arg min rank (M) subject to MΩ = XΩ ,
(7)
M
where X is an r × c matrix, Ω ⊂ [r] × [c] is the index subset corresponding to the observed entries, and XΩ and MΩ denote the collections of entries of X and M, respectively, indexed by Ω. Solving (7) is in general NP-hard. A convex relaxation of the MC problem yields the nuclear norm minimization X∗ = arg min kMk∗ subject to MΩ = XΩ ,
(8)
M
where the nuclear norm kMk∗ is defined as the sum of the singular values of M. The nuclear norm is the tightest convex relaxation of the rank function [22, 23]. Cand`es and Recht showed that (8) can recover (with high probability) all rank q matrices of dimension n × n by taking only m = O(q n1.2 log(n)) measurements, when
9
the set Ω is randomly selected [24]. Faster algorithms with slightly better bound on the number of measurements m were developed by Keshavan et. al. [25] and Cai et. al. [26]. The connection between classical compressed sensing [27, 28] and low-rank matrix recovery was further explored in [29, 30]. For positive semi-definite (PSD) matrices, the nuclear norm reduces to the trace. This was used by Chen et al. [31] to study the recovery of low-rank covariance matrices under noisy rank-one (quadratic) Gaussian measurements. In our problem, we have only the collection of noisy sketches X = [x(0), . . . , x(T − 1)], where x(t) is as in (5). We define the sample covariance matrices T X b y = 1 YYH = 1 y(t)y(t)H , C T T t=1
T X b x = 1 XXH = 1 b y BH . C x(t)x(t)H = BC T T t=1
(9)
(10)
b x ≈ BCy BH . Thus, the components of the sample covariance C b x can be seen For sufficiently large T , we have C
as sketches of the covariance matrix Cy via the matrix B. In particular, for a random Gaussian matrix B, the b x correspond to the rank-one random measurements studied in [31]. However, in our case diagonal elements of C
b x that can be also exploited. we have many more measurements (the non-diagonal elements of C
To compare the performance of our algorithms with the state of the art algorithms for low-rank matrix recovery,
we will consider the following extension of the nuclear norm minimization b x − BMBH k ≤ ǫ, C∗y = arg min Tr(M) subject to kC
(11)
M∈T+
where the nuclear norm is replaced by the trace, since Cy is PSD and Toeplitz [8], and where ǫ is an estimate of the ℓ2 -norm of the error that we provide to the algorithm. Numerical comparisons of our algorithms with the nuclear norm minimization (11) are provided in Section 5. 2.3 DoA estimation and super-resolution The signal in (3), for each snapshot t, can be seen as the noisy superposition of p independent Gaussian sources impinging onto the array at different AoAs. This is exactly the same model considered in DoA estimation, widely studied in many array processing applications such as radar, sonar, optical devices, and acoustic and speech processing systems. It is also closely related to the problem of estimating the frequencies of the superposition of p sinusoids in Gaussian noise [32]. Classical algorithms to accomplish this task are MUSIC [16] and ESPRIT [17]. They are sometimes referred to as “super-resolution” (SR) methods since they are able to resolve the AoAs in the continuous domain, although when the minimum angular separation mini6=j |θi − θj | becomes much smaller than π M
the Fisher Information Matrix for the joint estimation of the AoAs tends to be highly ill-conditioned, and all
these methods have a poor performance. More recently, inspired by the compressed sensing (CS) theory, the DoA estimation problem has been revisited in the framework of sparsity-based algorithms. In order to cast the DoA estimation as an instance of CS, the AoAs are discretized on a grid. Such a discretized approach has been vastly studied in the compressed sensing
10
literature, and is considered here in the next subsection dealing with MMV [33–41]. The assumption that the AoAs belong to the grid leads to some model mismatch when the propagation between the transmitter and the receiving array is characterized by a continuum of multipath components over the AoA domain. This may lead to significant performance degradation [42]. Cand`es and Fernandez-Granda [43, 44] have reconsidered SR by formulating the problem as a convex optimization. Since their method considers AoAs in a continuous domain, it does not suffer from the mismatch problem of grid-based approaches. Adapted to our problem, and by defining u =
sin(θ) sin(θmax )
for θ ∈ [−θmax , θmax ], the results proved in [43] guarantee the stable recovery of a discrete (complex) measure Pp µ = ℓ=1 wℓ δ(u − uℓ ) over u ∈ [−1, 1) from a collection of its low-frequency Fourier coefficients Z 1 p X wℓ a(uℓ ), (12) hµ, ai = a(u)µ(du) = −1
ℓ=1
2 provided that mini6=j |ui − uj | ≥ M . From (2), it is seen that the M -dimensional vector hµ, ai contains the M R1 Fourier coefficients [hµ, ai]k = −1 exp(jkπu)µ(du), k ∈ [M ] of the measure µ. Given the vector of Fourier
coefficients hµ, ai, the following convex optimization was proposed in [43] to recover the underlying measure µ: µ∗ = arg min kνkTV subject to hν, ai = hµ, ai,
(13)
ν
where, for a measure ν over [−1, 1), the total-variation (TV) norm is defined by kνkTV = sup
Pq
j=1
|ν{[tj−1 , tj )}|
where the supremum is taken over all arbitrary finite partitions −1 = t0 < t1 < · · · < tq = 1 of the interval [−1, 1) of arbitrary size q. In [43], it was also shown that the support of the optimal solution, which corresponds to the AoAs, can be extracted from the dual polynomial. In particular, a finite-dimensional representation of the dual was proposed that can be efficiently solved. The TV-minimization can be seen as the ℓ1 -norm minimization over the space of signed (complex) measures. This, in analogy with the well-known ℓ1 -norm minimization in CS [27, 28], promotes the sparsity of the resulting solution. This approach was extended by Tan et. al. in [45] to DoA estimation with coprime arrays under Gaussian noise. The algorithm proposed in [45] computes the sample covariance matrix of the observations to obtain noisy estimates of the Fourier coefficients of the discrete measure corresponding to AoAs, and uses the TV-minimization to find these angles. As the covariance matrix is PSD and Toeplitz, it can be shown that the corresponding measure must be positive. However, this has not been considered in [45] (because the dual optimization might not be easy to solve). In the wireless environment, the AoAs are clustered. This implies that the separation requirement of SR may not be satisfied. For example, often a continuous AoA density function has been observed in measurements (e.g., see [46]), and is considered in channel models (e.g., see [47]). This represents an obstacle for a straightforward application of SR methods such as [45]. However, since we aim at estimating the signal subspace rather than AoAs, in Section 4.4 we use the positivity of the underlying measure to directly solve the primal problem (13) in order to directly estimate the signal subspace rather than the dual one that is used for DoA estimation.
11
2.4 Multiple Measurement Vectors The conventional compressed sensing problem amounts to estimating a sparse vector from a set of possibly noisy linear projections [27, 28]. The problem involves a single snapshot of the type x = By, where in our case y is given in (1). An extension of the basic problem involves Multiple Measurement Vectors (MMV), where the observation is given by X = BY with Y as in (4). The underlying assumption is that the sparse vector snapshots y(t), t ∈ [T ]
have the same sparsity pattern or support over {a(θ), θ ∈ [−θmax , θmax ]}, e.g. the same AoAs {θi }pi=1 , for all
t ∈ [T ], although they might have different coefficients w(t) for different t ∈ [T ]. This problem has been widely studied in the literature (see e.g., [38, 48–53]) and it has been generally shown that MMV can provide further reduction in the number of measurements by exploiting the joint sparsity pattern (group sparsity) of the signals. As already observed, in our setting the sparsity is over the continuous AoA domain. Hence, in order to apply the classical MMV methods borrowed from the CS literature, discretization over a fine grid of angles is used. This yields the already discussed model mismatch. Another possibility is represented by a recently developed off-grid method [54, 55]. We review both approaches in the following. Discretization on a grid. Let D = [Ba(θ1 ), Ba(θ2 ), . . . , Ba(θG )] be an m × G dictionary with G ≥ m, where G is the grid size and where Θ = {θi : i = 1, . . . , G} denotes the discrete grid of AoAs. MMV assumes a model {x(t) = Dw(t) + n(t) : t ∈ [T ]}, where w(t) ∈ C
G
are the vectors of signal coefficients, and where n(t) is
the measurement noise. This can be written as X = DW + N. It is also assumed that the support of w(t), i.e., the location of nonzero components, is the same for all t ∈ [T ]. Notice that unless the true AoAs of the channel
{θℓ }pℓ=1 coincide with points in the grid, the model is mismatched. Nevertheless, if Θ is fine enough, the model mismatch can be expected to be small. Determining the sparsity pattern corresponds to finding which columns of D are effectively present in X. This can be obtained by solving the following optimization problem W∗ = arg min kMk2,1 subject to kX − DMk ≤ ǫ,
(14)
M∈CG×T
where the so-called ℓ2,1 norm of the matrix M = [m1 , . . . , mG ]T is defined as kMk2,1 = T
PG
i=1
kmi k, where
mi ∈ C , i = 1, 2, . . . , G, denote the rows of M. This norm promotes the “row-sparsity” of the resulting estimate
W∗ and, as a result, the group-sparsity of the channel gain vectors w(t), t ∈ [T ]. This approach has been applied
to array processing in [38] and to sparse regression in [49]. The common support of the sparse coefficient vectors is given by the indices corresponding to the non-zero rows of W∗ . Once this is identified, if needed, the signal coefficients can be estimated by using standard Least-Squares. If instead one is only interested in estimating the signal subspace, it is sufficient to let it equal to span(D∗ ), where D∗ contains the “active” columns of D, i.e., those indexed by the support set. One of the main issues with problem (14) is that the time-complexity scales quadratically with the number of observations T , which is a harsh limitation for these methods. We compare the performance of our algorithms with that resulting from solving (14) via numerical simulations. We show empirically that even with a reasonable grid size (G = 3M ), albeit having the grid mismatch problem, (14) has a reasonable performance but our algorithms perform better and are generally much faster even for a small training size T = 100.
12
Gridless approach. A gridless MMV scheme was proposed in [54, 55] by applying the atomic norm minimization over the continuous dictionary D = {a(θ)rH : θ ∈ [−θmax , θmax ), r ∈ CT , krk = 1}, formed by a family of rank-1 M × T matrices. The atomic norm of an M × T matrix Z with respect to the dictionary D is defined by n o kZkA = inf t : Z ∈ t conv(D) = inf
p nX ℓ=1
cℓ : ∃ p, {cℓ > 0}, {θℓ }, {rℓ : krℓ k = 1} such that Z =
p X ℓ=1
o cℓ a(θℓ )rH ℓ ,
(15) (16)
where conv(D) denotes the convex-hull of the atoms in the dictionary D obtained by taking the convex combination of all finite subsets of atoms of D. Atomic norm can be seen as the tightest convex relaxation of the ℓ0 -norm of Z over D, which is defined as the minimum number of elements in D whose linear combination gives Z. Although atomic norm can be shown to be a convex function, finding efficient algorithms for computing it over a given dictionary is a challenging problem. Fortunately, for the dictionary D this can be done through the following semi-definite program [54–56] kZkA =
min
M∈T+ ,W∈CT ×T
M 1 1 Tr(M) + Tr(W) subject to 2 2 ZH
Z W
0.
(17)
The recovery of the signal part Z = AW in (3) from the noisy observations Y can be cast as the following convex optimization (atomic norm minimization) problem Z∗ = arg min kZkA subject to kY − Zk ≤ ǫ.
(18)
Z∈CM ×T
Atomic norm minimization (18) produces a sparse solution Z∗ from which the active elements in the dictionary can be identified. In turns, this yields the discrete AoAs and the corresponding estimate of the signal subspace. This problem was studied in [54, 55] for the off-grid spectrum estimation (or equivalently DoA estimation). A regularized version of (18) along with an ADMM formulation [57] was used in [54] to reduce the computational complexity. However, the parameters of ADMM need to be selected very carefully to guarantee convergence. When instead of the whole observations Y only a noisy sketch X = BY is available, one can solve a variation of (18) as follows Z∗ = arg min kZkA subject to kX − BZk ≤ ǫ′ ,
(19)
Z∈CM ×T
where ǫ′ is an estimate of ℓ2 -norm of the noise in the projected data. Also in this case, the computational complexity scales with the number of observation snapshots T . This poses a problem when training time T is large. 3 C HANNEL M ODEL
AND
P ROBLEM S TATEMENT
As anticipated before, the channel may be formed by the superposition of a continuum of array responses. In order to include this relevant case, we define the AoA scattering function γ(u), which describes the received power sin(θ) for θ ∈ [−θmax , θmax ]. Without loss of density along direction identified by u ∈ [−1, 1], where u = sin(θ max ) R1 generality, we can normalize γ(u) such that −1 γ(u)du = 1. Therefore, γ(u) can be seen as a probability density
function or, more in general, a non-negative measure (including the possibility of discrete angles, corresponding to
13
Dirac delta functions). With some abuse of notation, we denote the array vector in the u domain by a(u), where [a(u)]k = exp(jkπu). Then, we generalize (1) to Z 1p √ y(t) = snr γ(u)a(u)z(u, t)du + n(t), t ∈ [T ],
(20)
−1
where snr is the SNR and z(u, t) is a white circularly symmetric Gaussian process with the covariance function i h (21) E z(u, t)z(u′ , t′ )∗ = δ(u − u′ )δt,t′ . The covariance matrix of the observed signal can be obtained as Z i h p p H Cy = E y(t)y(t) = snr γ(u) γ(u′ )a(u)a(u′ )H E[z(u, t)z(u′ , t)∗ ]du du′ + IM u,u′ ∈[−1,1)
= snr
Z
u,u′ ∈[−1,1)
= snr where S(γ) = snr
R1
−1
Z
p p γ(u) γ(u′ )a(u)a(u′ )H δ(u − u′ )du du′ + IM
1
γ(u)a(u)a(u)H du + IM = S(γ) + IM ,
(22)
−1
γ(u)a(u)a(u)H du denotes the covariance matrix of the signal part, and where IM is the
covariance matrix of the white additive noise. For the ULA, S is a Toeplitz matrix with [S]ij = [f ]i−j , where f is R an M -dimensional vector with [f ]k = snr γ(u) exp(jkπu) du for k ∈ [M ], and corresponds to the k-th Fourier
coefficient of the density γ scaled by snr.
We consider a wireless channel modeled by (20) during the training period. From the base-station perspective, every user is identified by a density γ(u) with corresponding covariance matrix given in (22). Letting S = VΛVH = Pp PM H H i=1 λi vi vi , i=1 λi vi vi be the SVD of S, we define the dominant p-dimensional subspace of S by Sp = Pp where the signal’s power contained in this subspace is given by Tr(Sp ) = i=1 λi . It is not difficult to check
that Tr(Sp ) can be equivalently given by Tr(Sp ) = hS, Vp VpH i, where Vp = arg maxV∈H(M,p) hS, VVH i is the
best p-dimensional bemaformer for capturing the signal’s power. Recall that H(M, p) is the space of all M × p
tall unitary matrices U such that UH U = Ip . The optimal beamformer Vp can be implemented as the RF analog layer of beamforming in a JSDM configuration [8–12]. In this paper, we are concerned with the estimation of Vp , for some appropriately chosen p, from the noisy snapshots of the projected channel (sketches) X = BY (see (5)), obtained during a training period of length T , where B is an m × M projection matrix for m ≪ M implemented in the analog RF domain, such that only m RF chains and A/D conversion are used. In order to measure the “goodness” of estimators, we propose the following performance metric which is relevant to our problem. First, we define the efficiency of the best p-dimensional beamformer by ηp =
Tr{VpH SVp } hS, Vp VpH i = . Tr(S) Tr(S)
For example, when S ∝ IM the signal’s power is uniformly distributed in all singular values, thus, ηp =
(23) p M.
If ηp ≈ 1 for some p ≪ M , then a significant amount of signal’s power can be captured by a low-dimensional ep = V e p (X) be an estimator of Vp from X. We define the following metric for the efficiency beamformer. Let now V
14
e p: of V
Γp =
e Hi e pV e Hi e pV hS, V hS, Vp VpH i − hS, V p p = 1 − , hS, Vp VpH i hS, Vp VpH i
(24)
e pH i ≥ 0 is the amount of power lost due to the mismatch between the optimal e pV where hS, Vp VpH i − hS, V
e p . It is immediate to see that Γp ∈ [0, 1], where it is desirable to make it as beamformer Vp and its estimate V
close to 1 as possible. It is also important to notice that Γp close to 1 does not necessarily imply a good overall signal power capture. For example, assume that S ≈ IM . In this case, for every beamformer matrix U ∈ H(M, p), we have hS, UUH i ≈ hS, Vp VpH i ≈ p, thus, Γp ≈ 1, but essentially only a fraction
p M
of the whole signal power
is captured by every p-dimensional beamformer. However, for the relevant case of interest here, which is when the signal subspace is (approximately) low-dimensional, there exist some p ≪ M for which ηp ≈ 1 and good estimators are effectively identified by Γp ≈ 1. Remark 3.1: We shall compare subspace estimators for a given channel statistics, number of antennas and number of measurements (i.e., γ, snr, M and m) according to the following procedure: 1) fix some ǫ > 0; 2) find minimum p such that ηp ≥ 1 − ǫ; 3) compare subspace estimators in terms of Γp . This approach is quite different from the classical DoA estimation used in array processing (e.g., in radar). There, the relevant parameters to be estimated are the AoAs. In our problem, we do not really care about discrete angles, but only about a good approximation (in terms of captured signal power) of the span of the corresponding array response vectors. It follows that the problem of identifiability that typically arises in DoA estimation when the minimum angular spacing is too small, is irrelevant here. This is the reason why we can handle continuous AoA scattering functions γ, in contrast ♦
to some SR methods that assume discrete and sufficiently spaced AoAs. 4 P ROPOSED A LGORITHMS
FOR
S UBSPACE E STIMATION
In this section, we introduce the coprime sampling that we will use throughout the paper. We explain our proposed algorithms for estimating the signal subspace and provide further intuitions and discussions about their performance. In Section 5, we will compare our proposed algorithms with the most representative existing algorithms reviewed in Section 2 on the basis of computer simulation. 4.1 Coprime Subsampling Operator Let D be a subset of [M ] of size K and consider a ULA whose elements are located at
iλ 2 sin(θmax ) ,
i ∈ D (see
Fig. 1). The array is called a minimum-redundancy linear arrays (MRLA) if for every ℓ ∈ [M ], with ℓ 6= 0, there √ or approximately K ≈ 2M . are unique elements i, i′ ∈ D such that ℓ = i − i′ . This implies that M = K(K−1) 2 Now, consider an arbitrary configuration of sensors D ⊂ [M ] and let us define the difference set ∆D = {i − i′ : i, i′ ∈ D with i ≥ i′ }.
(25)
It is clear that ∆D ⊂ [M ]. We call D a complete cover (CC) if ∆D = [M ]. This implies that for every ℓ ∈ [M ],
there is at least a (not necessarily unique) pair i, i′ ∈ D such that ℓ = i−i′ . By this definition, the location of sensors
15
√ for a MRLA builds a CC with a minimum size. For large values of M , it is possible to build a CC of size 2 M . One way to do this is to use the coprime sampling. Suppose q1 , q2 are coprime numbers, i.e., gcd(q1 , q2 ) = 1, that √ are very close to each other with q1 q2 ≈ M , thus, q1 ≈ q2 ≈ M . Let D be the set of all nonnegative integer combinations of q1 and q2 less than or equal to M − 1, i.e., D = ∪i=1,2 {k : k ∈ [M ], mod(k, qi ) = 0}.
(26)
√ Note that |D| ≈ 2 M . For suitable selection of q1 and q2 and for sufficiently large M , the set ∆D is approximately equal to [M ], thus, D is a CC for [M ]. For small values of M , the set D might not be a CC, however, as we will explain the performance of the algorithm will not change dramatically as far as the number of uncovered elements in [M ] is negligible compared with M . For simplicity, in the rest of the paper we always assume that D is a CC for [M ]. Suppose the elements of D are sorted in an increasing order with di ∈ [M ] being the i-th largest element in the list. Also, let m = |D| be the number of elements in D and let B be the m × M binary matrix with elements [B]i,di = 1 for i ∈ {1, 2, . . . , m} and zero otherwise. We can simply check that BBH = Im , where Im denotes the identity matrix of order m. We will use B as the projection matrix that
produces the low-dimensional observations X = BY. In passing, this has the advantage that the projection reduces to array subsampling, or “antenna selection”, which is very easy to implement in the analog RF domain by simple switches connecting the selected antennas to the RF demodulation chains and A/D converters. Proposition 4.1: Let S be an M × M Hermitian Toeplitz matrix and let B be the coprime sampling matrix.
Then the mapping S → BSBH is a bijection.
Proof: Since S is Toeplitz, for any i, j ∈ [M ] with i ≥ j, we have [S]i,j = [f ]i−j , for some M -dimensional
vector f . Also, as S is Hermitian, f fully specifies S. Let i, i′ ∈ {1, 2, . . . , m} with i ≥ i′ . We can check that [BSBH ]i,i′ = [S]di ,di′ = [f ]di −di′ .
(27)
As D is a complete cover for [M ], for any k ∈ [M ] there are di , di′ ∈ D such that di − di′ = k, which using
(27) implies that [BSBH ]i,i′ = [f ]k . Thus, all the elements of S can be recovered from the low-dimensional matrix BSBH and vice-versa, thus, the mapping is a bijection. Although in this paper, for simplicity of implementation, we focus on coprime sampling matrix B, all the proposed algorithm, except the super-resolution (SR) algorithm in Section 4.4, can be applied to other sampling matrices (e.g., Gaussian i.i.d. random matrices). 4.2 Algorithm 1: Approximate Maximum Likelihood (AML) Estimator For the signal model (20) we can immediately prove the following result. bx = Proposition 4.2: Let C
1 H T XX
be the sample covariance of the observations X as defined in (10). Then
b x is a sufficient statistics for estimating the signal covariance matrix S. C
Proof: Recall that the signal covariance matrix is given by Cy = S + IM , thus, Cx = BCy BH = BSBH + Im ,
where we have explicitly used the fact that BBH = Im . As the observations X are Gaussian, the corresponding
16
likelihood function can be written as T Y p(X|S) = p(x(t)|S)
(28)
t=1
=
T Y
t=1
πm
n o 1 exp − x(t)H (BSBH + Im )−1 x(t) H det(BSB + Im )
n o 1 H H −1 exp − Tr(x(t)x(t) (BSB + I ) ) m π m det(BSBH + Im ) t=1 o n 1 1 H H −1 = Tm exp − T Tr XX (BSB + I ) m π det(BSBH + Im )T T n o 1 b x (BSBH + Im )−1 . = Tm exp − T Tr C H T π det(BSB + Im ) =
T Y
(29)
(30) (31) (32)
b x . From the Fischer-Neyman factorization theorem It follows that the likelihood function depends on X only via C
b x is a sufficient statistics. [58, 59], it follows that C
The ML estimator for the subsampled data can be written as S∗ = arg minS∈T+ L(S), where b x (Im + BSBH )−1 . L(S) = log det(Im + BSBH ) + Tr C
(33)
By direct inspection, we have:
Proposition 4.3: L(S) is the sum of a concave function Lcav (S) = log det(Im + BSBH ) and a convex function b x (Im + BSBH )−1 . Lvex (S) = Tr C Proof: See Appendix C.1.
As L(S) is not convex, local optimization techniques such as gradient descent are not guaranteed to converge to the globally optimal solution. Since S scales with SNR, it is possible to obtain a convex (indeed, linear) approximation of the concave function Lcav (S), which is tight especially for low SNR. More precisely, we have: Proposition 4.4: Lcav (S) ≤ Tr(BSBH ) for all S ∈ T+ . Moreover, for the low-SNR regime we have Lcav (S) =
Tr(BSBH ) + o(snr).
Proof: See Appendix C.2. Proposition 4.4 implies that in the low-SNR regime, Tr(BSBH ) is the best linear approximation for Lcav (S). This in turns implies that b x (Im + BSBH )−1 ), Lapp (S) = Tr(BSBH ) + Tr(C
(34)
is the best convex upper bound of L(S) for the low-SNR regime.
Remark 4.1: It is interesting to note that this approximation of the log-likelihood function is valid independent of the length of the training period T as far as snr is sufficiently small. Although the total signal-to-noise ratio of the estimation problem increases by increasing T , the validity of this approximation only depends on the SNR of an individual sample rather than the accumulative signal-to-noise ratio of the whole training samples. On the other b x → Cx = E[x(t)x(t)H ] by consistency of the sample covariance estimator. hand, increasing T yields C
♦
The next proposition shows that the approximate maximum likelihood (AML) estimation can be cast as a semi-
definite programming.
17
b x (Im + BSBH )−1 ) and let C b x = UΛUH be the SVD of Proposition 4.5: Let Lapp (S) = Tr(BSBH ) + Tr(C
b x . Then the AML estimate can be obtained from the following semi-definite program C H I + BMB ∆ m 0, S∗ = arg min Tr(BMBH ) + Tr(W) subject to M∈T+ ,W ∆H W 1/2
bx where ∆ = C
= UΛ1/2 .
(35)
Proof: See Appendix C.3.
Some remarks about optimization problem (35) are in order here: Remark 4.2: Although the optimization (35) is an approximation of the ML cost for low-SNR regime, the algorithm does not need the explicit knowledge of SNR. However, an estimate of noise level in the array is necessary to scale the covariance of the noise Im properly. By Proposition 4.4, we expect that the performance of ♦
AML be very close to the performance of the ML estimator in the low-SNR regime.
Remark 4.3: The semi-definite program in (35) shows a close resemblance to the atomic norm computation introduced in (17) for the MMV problem. However, there are also some interesting differences. For example, the term Im + BMBH appears in the PSD constraint rather than M in the MMV formulation. The reason is that in our formulation we assumed that the noise variance in each array element is normalized to 1. In practice, the noise variance can be estimated during the system’s operation. However, since the noise is white, if our goal is to estimate the dominant subspace of the signal, we can simply focus on ML estimate of the whole covariance matrix Cy = IM + S rather than the signal part S. Using BBH = Im and the identity Tr(BH Cy B) = m + Tr(BSBH ), by reparametrization, we can simply remove the term corresponding to Im in the PSD constraint, thus obtaining the equivalent estimator
C∗y = arg min Tr(BMBH ) + Tr(W) subject to M∈T+ ,W
BMBH ∆
H
∆ W
0.
(36)
b 1/2 Moreover, by replacing the whole observation X during the training period by ∆ = C x , the complexity of
(35) is independent of the training length T . One of the main issues with the MMV optimization in (17) is that
the time complexity of the algorithm scales with T , which prohibits its use for even short training times such as √ T = 100. In contrast, using coprime sampling with m = O(2 M ), the minimization (35) in the AML algorithm runs very fast.
♦
Improving AML via Concave-Convex Procedure. As L(S) = Lcav (S) + Lvex (S) is a sum of a convex and a concave function, by slightly modifying the algorithm (35), we can obtain better estimates of the signal covariance matrix S even for high-SNR regime via the concave-convex procedure (CCCP). CCCP was first introduced in [60] and it is mainly a majorization-minimization algorithm that minimizes the difference of two convex functions, or equivalently the sum of a convex and a concave function, via a sequence of convex programs. It runs iteratively and in each iteration a better estimate of the optimal (not necessarily the globally optimal) solution is found. We denote the resulting approximate solutions for our problem by Sℓ , ℓ = 0, 1, . . . , where Sℓ is the estimate generated at iteration ℓ. Consider the iteration k, where the approximate estimates S1 , S2 , . . . , Sk are available. Given the last
18
estimate Sk , we can approximate the concave function Lcav (S) as follows: Lcav (S) = log det(I + BSBH ) = log det(I + BSk BH + B(S − Sk )BH ) = log det(I + BSk BH ) + log det I + (I + BSk BH )−1/2 B(S − Sk )BH (I + BSk BH )−1/2
(37) (38)
≤ log det(I + BSk BH ) + Tr (I + BSk BH )−1/2 B(S − Sk )BH (I + BSk BH )−1/2
(39)
= Lcav (Sk ) + hBH (I + BSk BH )−1 B, S − Sk i,
(40)
(a)
where in (a), we use an extension of Proposition 4.4 proved in Appendix C.2 to upper bound log det(I + H) by the linear function Tr(H), which holds for all Hermitian matrices H for which I + H 0. We also define Υk (S; Sk ) = Lcav (Sk ) + hBH (I + BSk BH )−1 B, S − Sk i,
(41)
and Lk (S; Sk ) , Υk (S; Sk )+Lvex (S). It results that that for arbitrary S and Sk in T+ , we have L(S) ≤ Lk (S; Sk ). Also, from Proposition 4.4, we know that Υk (S; Sk ) is a tight convex upper for the concave function Lcav (S) especially around Sk , thus, Lk (S; Sk ) is a tight convex upper bound for L(S). To find the next estimate Sk+1 , we solve the following convex optimization Sk+1 = arg min Lk (M; Sk ).
(42)
M∈T+
Using Proposition 4.5, this can also be cast as a semi-definite program that can be efficiently solved. We initialize the estimates with S0 = 0 for k = 0. It is immediately seen that Υ0 (S; S0 ) = Tr(BSBH ), and L0 (S, S0 ) = Lapp (S) coincides with the AML function in (34). Thus, the estimate S1 corresponds to the AML estimate. We can also see that the sequence of estimates {Sk }∞ k=1 monotonically improve the likelihood function, i.e., L(Sk+1 ) ≤ Lk (Sk+1 ; Sk ) = min Lk (M; Sk ) ≤ Lk (Sk ; Sk ) = L(Sk ), M∈T+
(43)
where we use the identity Υk (Sk ; Sk ) = Lcav (Sk ), which implies Lk (Sk ; Sk ) = L(Sk ). 4.3 Algorithm 2: MMV with Reduced Dimensionality (RMMV) One of the main problems with grid-based and off-grid MMV optimizations in (14) and (17) is that their complexity scales very fast with the sample size T . Here, we explain an SVD-based technique as in [38] to reduce the computational complexity of (14) and (17). Consider again the observation X = BY from the subsampled antennas received signal during the training period of length T . For simplicity, assume that the discrete AoA model holds and the arrival angles belong to a prefixed grid Θ with elements in the interval [−θmax , θmax ]. In this case, the model X = DW + N with the discrete dictionary D defined in Section 2.4 holds exactly. √ H We assume that T ≫ m = O( M ). Thus, the “economy form” SVD of X is X = UΣm Vm , where Vm is a T × m tall unitary matrix and Σm is the m × m diagonal matrix of the non-zero singular values. We define the e = XVm = UΣm . Notice that X e can be simply computed from the sample covariance matrix of the new data X
19
bx = data C
1 H T XX ,
thus, it is not necessary to store the whole observation X during the training time. Moreover,
b x can also be computed from X, e thus, Proposition 4.2 implies that X e is also a sufficient statistics. We also have C e = DWVm + NVm = DW f + N, e X
(44)
f G×m and N e m×m are the modified channel gains and array noise. It is not difficult to check that the where W
f has nonzero rows only on the reduced problem in (44) is still in the MMV format in the sense that the matrix W
grid points corresponding to the channel AoAs, however, the dimension of the problem now is fixed and does not f and N e lose their independence and Gaussianity. In scale with T . The drawback of SVD-based reduction is that W
f and the fact, although when Vm is a deterministic matrix with orthogonal columns the reduced channel gain W
e are still independent and Gaussian, this is not the case in (44) since Vm is a function of reduced received noise N
W and N.
Our second algorithm for subspace estimation, referred to in the following as Reduced MMV (RMMV), simply applies the off-grid atomic norm minimization for the MMV problem reviewed in Section 2.4 to the low-dimensional e This can be cast as the following semi-definite program data X. S∗ =
arg min M∈T+ ,W∈Cm×m ,Z∈CM ×m
Tr(M) + Tr(W) subject to
M
Z
ZH
W
e − BZk ≤ ǫ, 0, kX
(45)
e For large values of T , we expect that C b x ≈ Im + BSBH by where ǫ is an estimate of the matrix norm of N.
e remain approximately the consistency of the sample covariance estimator, such that the noise components in N √ independent and Gaussian. If m is sufficiently large then the optimal value of ǫ concentrates around ǫ∗ = σ m2 = √ mσ ≈ 2σ M , where σ 2 is the noise variance in each array element, and where we used the fact that, for coprime √ sampling, m ≈ 2 M . The noise level σ 2 at the output of the array elements can be typically estimated during the system’s operation. 4.4 Algorithm 3: Super Resolution (SR) Consider the channel model defined in Section 3, where here γ(u) denotes a positive measure not necessarily normalized. In order to stress the dependence of the signal covariance matrix on γ(u) in this section we use the R1 notation S(γ) = −1 γ(u)a(u)a(u)H du, consistently with (22). Since S(γ) is Toeplitz, Proposition 4.1 yields that
for the coprime sampling matrix B introduced in Section 4.1 all the elements of S, and as a result the Fourier R1 coefficients of the measure γ(u) given by [f ]k = −1 exp(jπu)γ(u)du, k ∈ [M ], can be identified from BSBH . This implies that for a sufficiently large T , we can estimate accurately all the coefficients [f ]k , k ∈ [M ], using the
b x = BC b y BH . Let Xk = {(i, i′ ) : i ≥ i′ , di − di′ = k} be the covering elements of the sample covariance matrix C set of the element k by the coprime sampling D defined in Section 4.1. Also, let ck = |Xk | be the covering number of the element k ∈ [M ]. We define the following estimator of [f ]k P b (i,i′ )∈Xk [Cx ]i,i′ b [ f ]k = . ck
(46)
20
In Section A.1 in Proposition A.1, we prove that for k 6= 0, the estimator [b f ]k is an unbiased estimate of [f ]k with an approximate variance (exact if ck = 1) given by the SNR is given by snr =
[f ]0 σ2 .
(σ2 +[f ]0 )2 , T ck
converging to 0 as T → ∞. Note that in this case
As in (12), let us denote the vector of M Fourier coefficients by f = hγ, ai, and
their estimates by b f . We propose the following TV-minimization to recover the subspace of the signal from the estimates b f
γ ∗ = arg min kf kTV subject to khf, ai − b f k ≤ ǫ,
(47)
where ǫ is an estimate of the norm of the noise in the data. For a non-negative measure γ, we have that kγkTV is R1 simply given by −1 γ(u)du = [f ]0 . Moreover, it is not difficult to check that the vector of Fourier coefficients f
can be obtained from the first column of the Toeplitz matrix S(γ). Consequently, we can write the optimization (47) directly in terms of the covariance matrix: fk ≤ ξ S = arg min Tr(M) subject to kM e1 − b ∗
M∈T+
r
M 2 (σ + [M]11 ), T
(48)
where e1 = (1, 0, . . . , 0)T has dimension M × 1, where [M]11 is the diagonal element of the Toeplitz matrix M (equivalent to [f ]0 ), and where the ǫ parameter in (47) has been replaced by an estimate thereof in which ξ is some parameter that can be tuned appropriately. The motivation for (48) is that for sufficiently large M and for the true signal distribution γ, the best value of ǫ in (47) can be estimated by khγ, ai − b f k2 = =
X k
X k
|[b f ]k − [f ]k |2 →
X h 2 i E [b f ]k − [f ]k
(49)
k
i (a) (σ 2 + [f ] )2 h 0 , Var [b f ]k ≤ M T
(50)
where in (a) we used the results proved for the variance of the estimate [b f ]k in Section A.1, and the fact for 2
2
]0 ) those elements with ck > 1, the resulting variance is less than (σ +[f . Thus, we have replaced ǫ in (47) by T q q M M 2 2 T (σ + [f ]0 ) = T (σ + [M]11 ), where an additional tuning by the scaling parameter ξ has been added to
include the variation of this optimal value around its mean. Algorithm (48) is a convex optimization that can be solved if an estimate of the noise variance σ 2 is available. In particular, no prior knowledge of SNR is necessary. Remark 4.4: It might happen, especially for small array size M , that some of the elements k ∈ [M ] are not
covered by the coprime sampling D, i.e., ck = 0. In this case, [b f ]k can not be estimated for those elements.
However, we can still run (47) or equivalently (48) by including in the constraint only the Fourier coefficients for
which ck > 0. Note that since the optimization is done over T+ , if the number of unsampled elements of f (i.e., those elements with ck = 0) is negligible compared with M , they do not affect the performance considerably. ♦ Remark 4.5: All the other algorithms proposed in this paper can be applied to sampling or projection matrices B not necessarily given by the coprime sampling (e.g., B can be a random i.i.d. Gaussian matrix). However, the SR algorithm in this section can only work with the coprime sampling.
♦
Remark 4.6: A coprime sampling scheme similar to ours along with TV-minimization has been used in [45] for DoA estimation. Provided the AoAs are well-separated, the estimation algorithm in [45] can estimate them
21
from the dual optimization proposed in [43, 44]. However, the authors do not use the positivity of the measure (in our case γ) that naturally arises because of positive semi-definite property of the signal covariance matrix. In this paper, we mainly deal with a wireless scattering channel for which the AoAs are clustered. This implies that the separation requirement for the super-resolution setup may not be met. However, since our aim is to estimate the subspace of the signal rather than AoAs, using the positivity of the underlying measure, we can directly solve the primal problem (47) rather than the dual one that is used for DoA estimation. In particular, we do not need to go through the complicated and error-prone procedure of estimating the support (AoAs) via the dual polynomial that ♦
is necessary for DoA estimation in [45]. 4.5 Algorithm 4: Covariance Matrix Projection (CMP)
Let B be the m × M matrix with m ≪ M . Consider the space of all M × M matrices and define the sampling
operator sub : CM×M → Cm×m , where sub(K) = LT(BKBH ), where LT keeps only the lower-triangular part of
the matrix BKBH (diagonal included) and sets the remaining elements equal to zero. Using the operator sub, we can define a positive semi-definite bilinear form on the space of M × M matrices by hK, LiB = hsub(K), sub(L)i,
where the latter inner product is the traditional one defined in the space of m× m matrices. Using the newly defined p bilinear form we can define the seminorm kKkB = hK, KiB . Note that k.kB is not a norm on the space of all M × M matrices since we can simply find an M × M matrix K 6= 0 for which kKkB = 0. Similarly, hK, LiB is
not a positive-definite bilinear form, thus, it does not define an inner product. However, from what we explained in Section 4.1 and Proposition 4.1, we can check that sub is a bijection over the linear subspace of Toeplitz Hermitian matrices of order M , which implies that the restriction of the newly defined bilinear form to the space of M × M Toeplitz matrices is indeed an inner product that gives a well-defined norm. We also define αB (M ) = max K∈T
kKk , kKkB
(51)
where dependence on M shows that the maximization is done over the space of all M × M Hermitian Toeplitz matrices. The parameter αB (M ) is a measure of coherence of the sampling matrix B with respect to the space of Toeplitz matrices. It is not difficult to check that for the coprime matrix B 1 ≤ αB (M ) ≤
√ M − 1.
(52)
Our analysis shows that the CMP algorithm, defined in the following, performs better for sampling matrices B with a smaller αB (M ). Although we develop the results for coprime sampling matrices, one can simply check that similar results holds when B is an m × M Gaussian random matrix. bx = Let C
1 T
XXH . In order to recover the dominant p-dimensional subspace of the signal, we first find an
estimate of the signal covariance matrix by b x ) − LT(BMBH )k. C∗y = arg min kLT(C M∈T+
(53)
22
From the definition of LT, this is equivalent to the following optimization problem b y − MkB , C∗y = arg min kC
(54)
M∈T+
which implies that the optimal solution C∗y is equivalent to projection of the sample covariance matrix of the whole array signal on T+ under seminorm k.kB . Note that this projection is unique since the restriction of the seminorm k.kB to T+ is indeed a norm and the projection theorem holds.
If an estimate of the noise variance σ 2 is available, we can directly estimate the covariance matrix of the signal
via the following variant of (54) b x ) − σ 2 Im − LT(BMBH )k. S∗ = arg min kLT(C
(55)
M∈T+
Once C∗y or S∗ are estimated, we use its p-dimensional dominant subspace (for some appropriately chosen 1 ≤ p ≤ M ) as an estimate of the signal subspace. We have the following result: Theorem 4.6: Consider the signal model given by (20) over a training period of length T . Then, for a given 1 ≤ p ≤ M , the CMP subspace estimator recovers a p-dimensional subspace of the signal, and has a performance measure Γp (see (24)) satisfying √ n o 2 p 1 E[Γp ] ≥ max 1 − √ (1 + ), 0 , snr ηp T 1 2 4p (1 + ) , Var[Γp ] ≤ T ηp2 snr where ηp is defined in (23), and where snr denotes the received SNR in one snapshot t ∈ [T ].
(56) (57)
Proof: See Appendix A.2. Some remarks about the performance of CMP are in order here. Remark 4.7: As the variance of the estimator converges to zero by increasing T , the subspace estimator is indeed consistent. In particular, for large T the performance metric Γp converges to 1, which implies that the estimate is as efficient as the best p-dimensional subspace of the signal.
♦
Remark 4.8: Even for infinite SNR, we still need to take some measurements. The main reason is that even in √ Rπ p the absence of noise the signal’s model in (20) is stochastic y(t) = snr −π γ(u)a(u)z(u, t)du, t ∈ [T ], and
it takes some time for the estimator to discover the underlying structure. It is also seen that the performance is not so sensitive to SNR for moderate range of SNR. However, for very small values of SNR (snr → 0), the required training time for achieving a specific precision, scales like T = O( snr1 2 ).
♦
Remark 4.9: Let us assume that signal’s power is concentrated in an α-dimensional subspace. In this case, ηα ≈ 1, and for a fixed SNR, the required training time scales like T = O(α). Two different models can be considered for the signal. In the first model, the effective dimension α does not scale by increasing M , thus, the required training length is independent of the embedding dimension or the number of array elements. In the second one, the user has fixed angular range ∆θ = βπ for some β ∈ (0, 1). In this case, α ≈ βM scales linearly with M and the required training length also scales linearly with the number of array elements M .
♦
23
5 S IMULATION R ESULTS In this section, we assess the performance of our proposed estimators via numerical simulations and compare their performance with the state of the art algorithms in the literature. We use the CVX package [61] for running all the convex optimizations. We consider a scattering channel in which the AoAs are uniformly distributed in Θ = [40, 50] ∪ [100, 110], thus, the total angular support is 20 degrees. We compare the performance of the each subspace estimator with the optimal beamformer that captures more than 95% of signal’s power, from which we obtain the required dimension p for the beamformer, thus, ηp = 0.95. We estimate the efficiency of each estimator, denoted by Γp , via numerical simulations. In particular, Γp very close to 1 in the results implies that the estimated subsapce capture around 95% of the whole power of the signal. For simulations, we consider an array of size M = 80. We use a coprime sampling with q1 = 7, q2 = 9, thus, in total, we sample only 19 out of 80 array elements located at D = {0, 7, . . . , 77} ∪ {0, 9, . . . , 72}. Although there are still some array indices in [M ] = {0, 1, . . . , 79} not covered by ∆D, the simulations show that the estimator is not so sensitive to these unsampled (missed) elements in [M ]. To compare the performance of our algorithm with the state of the art, we have selected three candidate algorithms that we have also reviewed in Section 2. PETRELS. One class of the algorithms that we introduced is based on subspace tracking methods. For this class we will use the PETRELS algorithm introduced by [20]. The difference with [20] is that we suppose that the data size is fixed and is not increasing with time. In this case, in every step of the algorithm PETRELS selects randomly a training sample from the fixed training set and updates its estimate of the subspace of the signal. Nuclear norm minimization algorithm. We run the optimization problem given in (11). The algorithm needs to know the noise variance to fix the parameter ǫ. To have an optimistic measure to compare with, we provide the b x − BCy BH k. This value of ǫ∗ gives the smallest constraint set that algorithm with the best ǫ given by ǫ∗ = kC
still contains the true covariance matrix Cy .
MMV Algorithms. We compare our algorithm with the state of the art performance of the MMV method. The first class that uses the sample covariance matrix of the data results in a similar optimization problem as in (??). The second class uses the group sparse structure of the signal and uses suitable regularizers to improve the performance recovery. For this class we will use two algorithms. The first algorithm runs the optimization introduced in (14). where we consider a quantized grid of size G = 3M equally spaced AoAs. We call the algorithm grid-based MMV (GBMMV). The second one is based on off-grid techniques given by the optimization (19), which similar to [54] can be equivalently written as Z∗ =
arg min Z∈M×T ,M∈T+ ,W∈CT×T
Tr(M) + Tr(W) subject to kX − BZk ≤ ǫ′ ,
M
Z
ZH
W
0.
(58)
A byproduct of this optimization is to directly obtain an estimate of the covariance of the data C∗y = M∗ , given by the matrix M∗ that achieves the minimization in (58). Then, we extract the dominant p-dimensional subsapace from
24
such estimate. We call this algorithm gridless MMV (GLMMV). We set ǫ′ to its optimal value given by ℓ2 -norm of the noise in subsampled observations X. Performance vs. signal-to-noise ratio. Fig. 3 compares the performance of our proposed algorithms with the ones in the literature for a range of SNR. Performance of different estimators vs. SNR for T = 100
Performance Metric Γp
1 0.8 0.6
PETRELS T = 800 NucNorm GBMMV GLMMV AML RMMV SR CMP
0.4 0.2 0 −15
−10
−5
0 5 SNR [dB]
10
15
Fig. 3: Comparison of the performance of the estimators versus the received signal-to-noise (SNR) for training length T = 100. It is seen that AML, RMMV, and SR perform comparably with the GLMMV but they have much lower computational complexity which in particular does not scale with T . The performance of CMP is as good as GBMMV and better than Nuclear norm minimization especially for higher SNR but its complexity is much lower than GBMMV since it does not scale with T . PETRELS does not perform very well for the fixed data size, e.g., its performance even for T = 800 is worse than the other algorithms.
Performance vs. training length T . Fig. 4 compares the performance of our proposed algorithms as well as Nuclear norm minimization for different training lengths. As the performance of AML and RMMV is comparable with the GLMMV and better than GBMMV and since for large training length T , these algorithms are really time-consuming to run, we have not included them in this figure. A PPENDIX A A NALYSIS
AND
P ROOF T ECHNIQUES
A.1 Analysis of the subsampled signal Let γ(u) be the power distribution of a user and let S(γ) =
R1
−1
γ(u)a(u)a(u)H du be its signal covariance
matrix. In this section, we assume that γ is merely a positive measure and not necessarily a normalized one. We also assume that Cy = S(γ) + σ 2 IM , where σ 2 is the variance of the array noise. Let B be the coprime sampling
25
T = 400
Performance Metric Γp
T = 200 1
1
0.8
0.8
0.6
0.6 NucNorm AML RMMV SR CMP
0.4 0.2 0 −15
−10
−5
0 5 SNR [dB]
10
NucNorm AML RMMV SR CMP
0.4 0.2
15
0
−15
−10
−5 0 SNR [dB]
Performance Metric Γp
T = 800 1
0.8
0.8
0.6
0.6 NucNorm AML RMMV SR CMP
0.2 0 −15
−10
−5
0 5 SNR [dB]
10
15
T = 1600
1
0.4
5
10
NucNorm AML RMMV SR CMP
0.4 0.2
15
0 −15
−10
−5
0 5 SNR [dB]
10
15
Fig. 4: Scaling of the performance of different estimators with training length T ∈ {200, 400, 800, 1600}
matrix corresponding to the coprime locations D whose i-th largest element gives the index of the array element at position di , where D and di ∈ D are as in Section 4.1. From the signal’s model in (20), it is seen that the received signal y(t) is a complex Gaussian vector with covariance matrix Cy = S(γ) + σ 2 IM . Since we assume that y(t) is independent across different snapshots t ∈ [T ], this completely specifies its statistics. Similarly, it results that the statistics of the sketches x(t) = By(t) is fully specified with the covariance matrix Cx = BCy BH . For the coprime sampling matrix B, and for i, j ∈ {1, 2, . . . , m} with i ≥ j, we obtain [Cx ]i,j = [S(γ)]di ,dj + σ 2 δij = [f ]di −dj + σ 2 δij := [g]di −dj . where [f ]k =
R1
−1
(59)
γ(u) exp(jkπu)du for k ∈ [M ] denotes the k-th Fourier coefficient of the measure γ. Note that
26
in the unnormalized case that we consider here, the SNR is given by snr =
[f ]0 σ2 .
Also, since x(t) is Gaussian, (59)
fully specifies its statistics. Now consider a specific k ∈ [M ] and let di , dj ∈ D be such that k = di − dj . Since, as in Section 4.1, we
b x ]i,j . The following assume that D is a complete cover for [M ], such di and dj exist. Let us also define [b g ] k = [C proposition charactereizes the mean and the variance of [b g]k . The proof uses the properties of the complex Gaussian
variables reviewed in Appendix B. b x ]i,j , with k = di − dj as defined before. Then, we have Proposition A.1: Let k ∈ [M ] and let [b g ] k = [C i i h h 2 2 4 2 ]0 ) g]k = (σ +[f = σ (1+snr) . E [b g]k = [g]k , Var [b T T b x ]i,j , where k = di − dj . Taking the expectation, we have Proof: Let k ∈ [M ], and let [b g ]k = [ C h i h i b x ]i,j = [Cx ]i,j = [Cy ]d ,d = [f ]d −d + σ 2 δij = [g]k . E [b g ] k = E [C i j i j
(60)
Since the observations y(t), and as a results x(t), are independent across t ∈ [T ], and T
T
X 1 X b x ]i,j = 1 [C (61) [x(t)]i [x(t)]∗j = [y(t)]di [y(t)]∗dj , T t=1 T t=1 i h i h it results that Var [b g]k = T1 Var [y(t)]di [y(t)]∗dj . Hence, using the properties of the complex Gaussian variables
proved in Proposition B.3, we obtain i 2 i h h 2 i h Var [y(t)]di [y(t)]∗dj = E [y(t)]di [y(t)]∗dj − E [y(t)]di [y(t)]∗dj h 2 i 2 i h 2 2 = E [y(t)]di E [y(t)]dj + E [y(t)]di [y(t)]∗dj − E [y(t)]di [y(t)]∗dj = ([f ]0 + σ 2 )2 = σ 4 (1 + snr)2 , h i where snr = [fσ]20 . This implies that Var [b g]k = T1 σ 4 (1 + snr)2 . This completes the proof.
(62) (63) (64)
A.2 Analysis of the Performance of the CMP Estimator In this section we analyze the performance of the CMP estimator. First, we need the following propositions proved in Appendix C.4, C.5, and C.6 respectively. b y be the sample covariance of the signal y(t), t ∈ [T ], and let C∗ be the CMP estimate Proposition A.2: Let C y
given by optimization (53) or equivalently (54). Let C′ ∈ T+ be an arbitrary Hermitian PSD Toeplitz matrix. Then o n b y − C′ k B . b y − C∗ kB , kC∗ − C′ kB ≤ kC max kC y y
Proposition A.3: Let C∗y and Cy be as defined before. Suppose kC∗y − Cy kB ≤ ǫ and let V ∈ H(M, p) be an √ arbitrary M × p matrix with VH V = Ip . Then, |hCy , VVH i − hC∗y , VVH i| ≤ ǫ pM .
After finding the projection C∗y , we use its p-dimensional dominant subspace to design a beamformer matrix for
the received signal y(t), t ∈ [T ]. Let C∗y = UΛUH be the SVD of C∗y , and let Up be the M × p matrix consisting
of the p first columns of U. If the estimate C∗y is very close to the Cy , then we expect that Up , in terms of capturing the signal power, be a good approximation of the dominant p-dimensional subspace of the signal. This has been formalized in the following propositions.
27
Proposition A.4: Let C∗y and Cy be as defined before and let S be the signal covariance matrix, where we have eΛ eU e H be the SVD of Cy and Cy = S + σ 2 IM . Assume that kC∗y − Cy kB ≤ ǫ. Let Cy = UΛUH and C∗y = U
e be M × p matrices consisting of the first p columns of U and U. e Then, we have C∗y respectively. Let V and V √ eV e H i| ≤ 2ǫ pM . |hS, VVH i − hS, V
Remark A.1: Notice that V is the optimal p-dimensional beamformer for S (or equivalently Cy ). Proposition √ ♦ A.4 implies that the optimal beamformer for the estimate covariance matrix C∗y is 2ǫ pM -optimal for S. We also need the following lemma proved in Appendix C.7. i h Lemma A.5: Let C∗y and Cy be as before. Then E kC∗y − Cy k2B ≤ i √ 2 (1+snr) √ . Cy kB ≤ σ 2M T
2Mσ4 (1+snr)2 , T
h which also implies E kC∗y −
Proof of Theorem 4.6: Using the definition of Γp and taking the expectation value we obtain h hS, VVH i − hS, V eV e Hi i E[Γp ] = 1 − E hS, VVH i h |hS, V eV e H i − hS, VVH i| i (a) = 1−E hS, VVH i eV e H i − hCy , VVH i| E |hCy , V (b) = 1− ηp Tr(S) ∗ √ (c) 2 pM E kCy − Cy kB ≥ 1− ηp M [f ]0 √ 2 (d) 2σ 2pM (1 + snr) √ ≥ 1− ηp M [f ]0 T √ 2 2p 1 ≥ 1 − √ (1 + ), snr ηp T
(65) (66) (67) (68) (69) (70)
eV e H i ≤ hS, VVH i, in (b) we used Cy = S + IM and Tr(VVH ) = Tr(V eV e H ), in where in (a) we used hS, V (c) we used Proposition A.4, and finally in (d) we used Lemma A.5. As Γp ≥ 0, this implies that E Γp ≥ o n √ 1 ), 0 . Applying a similar method, we obtain max 1 − η2 √2pT (1 + snr p
Var[Γp ] = Var
≤ ≤ (a)
=
h |hS, V eV e H i − hS, VVH i| i hS, VVH i
=
i h eV e H i − hCy , VVH i|2 E |hCy , V
h i eV e H i − hCy , VVH i| Var |hCy , V ηp2 Tr(S)2
(72)
ηp2 Tr(S)2
2 i h √ E 2 pM kC∗y − Cy kB ηp2 Tr(S)2
4pM E kC∗y − Cy k2B = ηp2 M 2 [f ]20
8p M 2 (snr + 1)2 σ 4 8p 1 2 = (1 + ) , T ηp2 M 2 [f ]20 T ηp2 snr
where in (a) we used Lemma A.5 and the fact that snr =
[f ]0 σ2 .
(71)
This completes the proof.
(73) (74)
28
A PPENDIX B C OMPLEX G AUSSIAN R ANDOM VARIABLES In this appendix, we review some of the properties of the complex circularly symmetric Gaussian random variables that we need in this paper. Proposition B.1: Let X = Xr + jXi and Y = Yr + jYi be two zero mean unit-variance circularly symmetric Gaussian variables with correlation coefficient ρ = ρr + jρi . Then we have: 1) E[XX ∗ ] = E[Y Y ∗ ] = 1, and E[XX] = E[Y Y ] = E[XY ] = 0. 2) E[Xr2 ] = E[Xi2 ] = E[Yr2 ] = E[Yi2 ] = 12 . 3) E[Xr Xi ] = E[Yr Yi ] = 0, E[Xr Yr ] = E[Xi Yi ] =
ρr 2 ,
and E[Xr Yi ] = −E[Xi Yr ] =
ρi 2.
We also use the following proposition for real-valued Gaussian variables. Proposition B.2 (Price’s Theorem [64]): Let Z and W be two real-valued independent N(0, 1) Gaussian variables with a covariance ρ. Let g(z, w) be a differentiable function of (z, w) and let I(ρ) = E[g(Z, W )]. Then h 2 i ∂ d I = E g(Z, W ) . dρ ∂z∂w Using Proposition B.2, we can prove the following result.
Proposition B.3: Let X and Y be as in Proposition B.1. Then, we have 1) E[Xi2 Yr2 ] = E[Xr2 Yi2 ] =
1+2ρ2i 4
and E[Xi2 Yi2 ] = E[Xr2 Yr2 ] =
1+2ρ2r 4 .
2) E[|XY ∗ |2 ] = E[|X|2 ]2 + |E[XY ∗ ]|2 . Proof: For part 1) we only prove one of the identities. The proof of the other ones follows similarly. Let us consider E[Xr2 Yi2 ]. Note that from the properties of the complex Gaussian variables, it results that (Z, W ) = √ √ ( 2Xr , 2Yi ) are jointly Gaussian N(0, 1) random variables with covariance ρi . This implies that g(ρi ) = 4E[Xi2 Yr2 ] = E[Z 2 W 2 ] is a function of ρi . Applying the Price’s theorem in Proposition B.2, we have d g = 4 E[ZW ] = 4ρi , dρi
(75)
which implies that g(ρi ) = 2ρ2i + c, where c is a constant. For ρi = 0, the random variables (Z, W ) are independent from each other, and g(0) = E[Z 2 W 2 ] = E[Z 2 ]E[W 2 ] = 1, which implies that c = 1. Hence, we have g(ρi ) = 2ρ2i + 1. Thus, we obtain E[Xi2 Yr2 ] =
1+2ρ2i 4 .
To prove part 2), note that |XY ∗ |2 = (Xr2 + Xi2 )(Yr2 + Yi2 ) can be expanded in terms of four multiplicative terms that can be computed by Price’s theorem, where we obtain E[|XY ∗ |2 ] = 2(
1 + 2ρ2i 1 + 2ρ2r ) + 2( ) = 1 + ρ2i + ρ2r = 1 + |ρ|2 . 4 4
(76)
Moreover, we have E[XY ∗ ] = E[Xr Yr + Xi Yi ] + jE[Xi Yr − Xr Yi ] = 2( Using Eq. (76), (77) and E[|X|2 ] = 1, we obtain the desired result.
ρr ρi ) + j2(− ) = ρr − jρi = ρ∗ . 2 2
(77)
29
A PPENDIX C P ROOFS
OF THE
P ROPOSITIONS
C.1 Proof of Proposition 4.3 We first prove that Lcav (S) is a concave function of S. It is sufficient to prove that for arbitrary S1 and S2 in T+ , the function κ : [0, 1] → R+ given by κ(t) = Lcav ((1 − t)S1 + tS2 ) is a concave function of t ∈ [0, 1]. Let us
e i = Im + BSi BH , i = 1, 2. Note that S e i ≻ 0 are both positive definite matrices. Thus, we have define S e 1 + tS e 2 = log det(S e −1/2 e2 S e 1 ) + log det (1 − t)Im + tS e −1/2 S κ(t) = log det (1 − t)S 1 1 (a) e 1 ) + log det (1 − t)Im + tUΛUH (b) e 1 ) + log det (1 − t)Im + tΛ = log det(S = log det(S e1) + = log det(S
m X ℓ=1
log(1 − t + tλℓ ),
(78) (79) (80)
e −1/2 = UΛUH , and where in (b) we used UUH = Im . e2 S e −1/2 S where in (a) we used the SVD of the PSD matrix S 1 1
As λℓ ≥ 0, the function log(1 − t + tλℓ ) is a well-defined concave function of t over [0, 1], thus, κ(t) is also a concave function over [0, 1].
−1 o n e 1 + tS e2 b x (1 − t)S . We To prove the convexity of Lvex , we define h : [0, 1] → R+ given by h(t) = Tr C
have
o n −1 b 1/2 S e −1/2 (1 − t)I + tUΛUH b 1/2 e −1/2 C S h(t) = Tr C x x 1 1 o n −1 b 1/2 S e −1/2 U (1 − t)I + tΛ b 1/2 . e −1/2 C = Tr C UH S x
1
1
(81) (82)
x
b 1/2 e −1/2 U by vi , i = 1, 2, . . . , m. Then we obtain h(t) = Pm Let us denote the m columns of C x S1 ℓ=1
kvℓ k2 1−t+tλℓ .
As
λℓ ≥ 1, it is seen that h(t) is the sum of m convex functions, thus, it is convex over [0, 1], which implies that Lvex
is a convex function of S. This completes the proof. C.2 Proof of Proposition 4.4 Rπ It is seen from S = snr −π γ(u)a(u)a(u)H du that S scales linearly with snr. As BSBH is a PSD matrix, we
prove a more general statement that for m × m any Hermitian matrix H for which Im + snrH is PSD, we have
log det(Im + snrH) = snrTr(H) + o(snr). Let UΛUH be the SVD of H. Then, we have log det(Im + snrH) = log det(Im + snrUΛUH ) = log det(Im + snrΛ) =
m X
log(1 + snrλℓ )
(83)
ℓ=1
= snr
m X
λℓ + o(snr) = snrTr(H) + o(snr).
(84)
ℓ=1
In particular, from the concavity of the Logarithm, it results that log(1 + snrλℓ ) ≤ snrλℓ . This implies that log det(Im + snrH) ≤ snrTr(H). This complete the proof.
30
C.3 Proof of Proposition 4.5 Let Sopt be the ML estimate, i.e., Sopt = arg minS∈T+ Lapp (S) and let (S∗ , W∗ ) be the output of the semidefinite program (35). Let us also define H = Im + BSopt BH and Wopt = ∆H H−1 ∆. Applying the Woodbury identity [62, 63] to the PSD constraint at point (Sopt , Wopt ) we obtain that H 0 I Im 0 H ∆ m = 0 Wopt − ∆H H∆ 0 ∆H H−1 Im ∆H Wopt
H−1 ∆ Im
Since Wopt = ∆H H−1 ∆, using (85), it is not difficult to check that for an arbitrary vector
[uH vH ]
H
∆
∆H
Wopt
u v
u v
.
(85)
,
= rH H r ≥ 0
(86)
where r = u + H−1 ∆ v. This implies that (Sopt , Wopt ) satisfies the PSD constraint. Hence, Lapp (Sopt ) = Tr(BSopt BH ) + Tr(∆H (Im + BSopt BH )−1 ∆) = Tr(BSopt BH ) + Tr(Wopt )
(87)
(a)
≥ Tr(BS∗ BH ) + Tr(W∗ ) ≥ Tr(BS∗ BH ) + Tr(∆H (Im + BS∗ BH )−1 ∆) = Lapp (S∗ ),
(88)
where in (a) we apply the Woodbury identity to the PSD constraint which gives W∗ ∆H (Im + BS∗ BH )−1 ∆. Taking the trace of both sides we obtain the desired inequality in (a). As Sopt is the ML estimate, we also have Lapp (Sopt ) ≤ Lapp (S∗ ). Combining with (87), we obtain the proof. C.4 Proof of Proposition A.2 b y − C′ kB simply follows from the definition of C∗ as the b y − C∗ k B ≤ k C First, note that the inequality kC y y
b y onto the space T+ . Thus, we only need to prove the other inequality kC∗ − C′ kB ≤ kC b y − C′ k B . projection of C y
To prove this, note that the seminorm k.kB is defined from a PSD bilinear form. As C′ itself belongs to T+ , it is
not difficult to see that at the projection C∗y , the vector C′ − C∗y is a feasible direction to move because from the
convexity of the space T+ , it results that C∗y + α(C′ − C∗y ) ∈ T+ for any α ∈ [0, 1]. Thus, from the optimality of b y − C∗ , C′ − C∗ iB ≤ 0. This implies that C∗y , it results that hC y y b y − C′ k2B = kC b y − C∗y + C∗y − C′ k2B kC
b y − C∗ k2 + kC∗ − C′ k2 − 2hC b y − C∗ , C′ − C∗ iB = kC y B y B y y
b y − C∗ k2 + kC∗ − C′ k2 , ≥ kC y B y B
(89) (90) (91)
b y − C′ kB results. Combining the two inequalities, we obtain from which the desired inequality kC∗y − C′ kB ≤ kC the result.
31
C.5 Proof of Proposition A.3 We use the Cauchy-Schwartz inequality for the inner product. We have: q |hCy , VVH i − hC∗y , VVH i| = |hCy − C∗y , VVH i| ≤ kCy − C∗y k Tr{VVH VVH } (a)
≤ αB (M )kCy − C∗y kB
p √ p ≤ ǫ pM ,
where in (a) we used the incoherence parameter of the coprime matrix αB (M ) ≤
(92) (93)
√ M as stated in (52).
C.6 Proof of Proposition A.4 eV e H i = p, we can First note that Cy and S differ by a multiple of identity matrix IM . As hIM , VVH i = hIM , V eV e H i| ≤ 2ǫ√pM . equivalently prove the following inequality |hCy , VVH i − hCy , V
eV e H i ≤ hCy , VVH i because from the SVD, V is the best p-dimensional beamformer First note that 0 ≤ hCy , V
eV e H i. Consequently, we can always make |hC∗ , V eV e Hi − for Cy . Similarly, we have 0 ≤ hC∗y , VVH i ≤ hC∗y , V y e or vice-versa. In other words, hCy , VVH i| larger by either changing V into V n o eV e H i − hCy , VVH i| ≤ max |hC∗ , VVH i − hCy , VVH i|, |hC∗ , V eV e H i − hCy , V eV e H i| . |hC∗y , V y y
(94)
Furthermore, we have
eV e H i − hCy , VVH i| ≤ |hCy , V eV e H i − hC∗ , V eV e H i + hC∗ , V eV e H i − hCy , VVH i| |hCy , V y y
eV e H i − hC∗ , V eV e H i| + |hC∗ , V eV e H i − hCy , VVH i| ≤ |hCy , V y y
p eV e H i − hCy , VVH i| pM + |hC∗y , V n o (b) p eV e H i − hCy , V eV e H i| ≤ ǫ pM + max |hC∗y , VVH i − hCy , VVH i|, |hC∗y , V p p p ≤ ǫ pM + ǫ pM = 2ǫ pM , (a)
≤ ǫ
where in (a), we used Proposition A.3, and in (b), we used (94) and Proposition A.3. C.7 Proof of Lemma A.5 First notice that, we have h i (a) h i b y − Cy k2B E kC∗y − Cy k2B ≤ E kC =
M−1 X k=0
=
M−1 X k=0
≈
(95)
h 2 i ck E [b g]k − [g]k
(96)
i (b) σ 4 (1 + snr)2 M−1 h X m(m + 1) σ 4 (1 + snr)2 ck Var [b g ]k = ck = T 2 T
2M σ 4 (1 + snr)2 , T
(97)
k=0
(98)
where in (a) we used the inequality proved in Proposition A.2 by replacing C′ = Cy , where [g]k are as in (59), √ where ck denotes the covering number of k ∈ [M ] by the coprime sampling D with m = |D| = O(2 M ), and
32
io2 n h ≤ where (b) results from Proposition A.1. The other result simply follows from the identity E kC∗y − Cy kB i h ∗ 2 E kCy − Cy kB . R EFERENCES [1] T. L. Marzetta, “Noncooperative cellular wireless with unlimited numbers of base station antennas,” IEEE Trans. on Wireless Commun., vol. 9, no. 11, pp. 3590–3600, Nov. 2010. [2] C. Shepard, H. Yu, N. Anand, E. Li, T. Marzetta, R. Yang, and L. Zhong, “Argos: Practical many-antenna base stations,” in Proceedings of the 18th Annual International Conference on Mobile Computing and Networking. ACM, 2012, pp. 53–64. [3] H. Huh, G. Caire, H. Papadopoulos, and S. Ramprashad, “Achieving massive MIMO spectral efficiency with a not-so-large number of antennas,” IEEE Trans. on Wireless Commun., vol. 11, no. 9, pp. 3226–3239, 2012. [4] J. Hoydis, S. Ten Brink, and M. Debbah, “Massive mimo in the ul/dl of cellular networks: How many antennas do we need?” IEEE J. on Sel. Areas on Commun. (JSAC), vol. 31, no. 2, pp. 160–171, 2013. [5] E. Larsson, O. Edfors, F. Tufvesson, and T. Marzetta, “Massive mimo for next generation wireless systems,” Communications Magazine, IEEE, vol. 52, no. 2, pp. 186–195, 2014. [6] E. Bjornson, J. Hoydis, M. Kountouris, and M. Debbah, “Massive mimo systems with non-ideal hardware: Energy efficiency, estimation, and capacity limits,” IEEE Trans. on Inform. Theory, vol. 60, no. 11, pp. 7112– 7139, 2014. [7] R. Rogalin, O. Y. Bursalioglu, H. Papadopoulos, G. Caire, A. F. Molisch, A. Michaloliakos, V. Balan, and K. Psounis, “Scalable synchronization and reciprocity calibration for distributed multiuser mimo,” IEEE Trans. on Wireless Commun., vol. 13, no. 4, pp. 1815–1831, 2014. [8] A. Adhikary, J. Nam, J.-Y. Ahn, and G. Caire, “Joint spatial division and multiplexingthe large-scale array regime,” IEEE Trans. on Inform. Theory, vol. 59, no. 10, pp. 6441–6463, 2013. [9] J. Nam, A. Adhikary, J.-Y. Ahn, and G. Caire, “Joint spatial division and multiplexing: Opportunistic beamforming, user grouping and simplified downlink scheduling,” IEEE J. of Sel. Topics in Sig. Proc. (JSTSP), vol. 8, no. 5, pp. 876–890, 2014. [10] A. Adhikary, E. Al Safadi, M. K. Samimi, R. Wang, G. Caire, T. S. Rappaport, and A. F. Molisch, “Joint spatial division and multiplexing for mm-wave channels,” IEEE J. on Sel. Areas on Commun. (JSAC), vol. 32, no. 6, pp. 1239–1255, 2014. [11] A. Adhikary, H. S. Dhillon, and G. Caire, “Massive-MIMO meets HetNet: Interference coordination through spatial blanking,” IEEE J. on Sel. Areas on Commun. (JSAC), 2014. [12] ——, “Spatial blanking and inter-tier coordination in massive-mimo heterogeneous cellular networks,” in Globecom Workshops (GC Workshop). IEEE, 2014, pp. 1229–1234. [13] J. Nam, G. Caire, Y. Ko, and J. Ha, “On the role of transmit correlation diversity in multiuser MIMO systems,” CoRR, vol. abs/1505.02896, 2015. [Online]. Available: http://arxiv.org/abs/1505.02896
33
[14] P. P. Vaidyanathan and P. Pal, “Sparse sensing with co-pprime samplers and arrays,” IEEE Transactions on Signal Processing, vol. 59, no. 2, pp. 573–586, 2011. [15] P. Vaidyanathan and P. Pal, “Theory of sparse coprime sensing in multiple dimensions,” Signal Processing, IEEE Transactions on, vol. 59, no. 8, pp. 3592–3608, 2011. [16] R. O. Schmidt, “Multiple emitter location and signal parameter estimation,” Antennas and Propagation, IEEE Transactions on, vol. 34, no. 3, pp. 276–280, 1986. [17] R. Roy and T. Kailath, “Esprit-estimation of signal parameters via rotational invariance techniques,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 37, no. 7, pp. 984–995, 1989. [18] H. Wang and M. Kaveh, “Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide-band sources,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 33, no. 4, pp. 823–831, 1985. [19] B. Yang, “Projection approximation subspace tracking,” Signal Processing, IEEE Transactions on, vol. 43, no. 1, pp. 95–107, 1995. [20] Y. Chi, Y. C. Eldar, and R. Calderbank, “Petrels: Parallel subspace estimation and tracking by recursive least squares from partial observations,” Signal Processing, IEEE Transactions on, vol. 61, no. 23, pp. 5947–5959, 2013. [21] L. Balzano, R. Nowak, and B. Recht, “Online identification and tracking of subspaces from highly incomplete information,” in Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on. IEEE, 2010, pp. 704–711. [22] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of linear inverse problems,” Foundations of Computational mathematics, vol. 12, no. 6, pp. 805–849, 2012. [23] M. Fazel, “Matrix rank minimization with applications,” Ph.D. dissertation, PhD thesis, Stanford University, 2002. [24] E. J. Cand`es and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics, vol. 9, no. 6, pp. 717–772, 2009. [25] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from a few entries,” Information Theory, IEEE Transactions on, vol. 56, no. 6, pp. 2980–2998, 2010. [26] J.-F. Cai, E. J. Cand`es, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on Optimization, vol. 20, no. 4, pp. 1956–1982, 2010. [27] D. L. Donoho, “Compressed sensing,” Information Theory, IEEE Transactions on, vol. 52, no. 4, pp. 1289– 1306, 2006. [28] E. J. Candes and T. Tao, “Near-optimal signal recovery from random projections: Universal encoding strategies?” Information Theory, IEEE Transactions on, vol. 52, no. 12, pp. 5406–5425, 2006. [29] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” SIAM review, vol. 52, no. 3, pp. 471–501, 2010. [30] D. Gross, “Recovering low-rank matrices from few coefficients in any basis,” Information Theory, IEEE
34
Transactions on, vol. 57, no. 3, pp. 1548–1566, 2011. [31] Y. Chen, Y. Chi, and A. Goldsmith, “Exact and stable covariance estimation from quadratic sampling via convex programming,” arXiv preprint arXiv:1310.0807, 2013. [32] D. C. Rife and R. R. Boorstyn, “Multiple tone parameter estimation from discrete-time observations,” Bell System Technical Journal, vol. 55, no. 9, pp. 1389–1410, 1976. [33] W. U. Bajwa, J. Haupt, A. M. Sayeed, and R. Nowak, “Compressed channel sensing: A new approach to estimating sparse multipath channels,” Proceedings of the IEEE, vol. 98, no. 6, pp. 1058–1076, 2010. [34] R. Baraniuk and P. Steeghs, “Compressive radar imaging,” in Radar Conference, 2007 IEEE.
IEEE, 2007,
pp. 128–133. [35] M. F. Duarte and R. G. Baraniuk, “Spectral compressive sensing,” Applied and Computational Harmonic Analysis, vol. 35, no. 1, pp. 111–129, 2013. [36] A. C. Fannjiang, T. Strohmer, and P. Yan, “Compressed remote sensing of sparse objects,” SIAM Journal on Imaging Sciences, vol. 3, no. 3, pp. 595–618, 2010. [37] M. Herman, T. Strohmer et al., “High-resolution radar via compressed sensing,” Signal Processing, IEEE Transactions on, vol. 57, no. 6, pp. 2275–2284, 2009. [38] D. Malioutov, M. C¸etin, and A. S. Willsky, “A sparse signal reconstruction perspective for source localization with sensor arrays,” Signal Processing, IEEE Transactions on, vol. 53, no. 8, pp. 3010–3022, 2005. [39] S. Kunis and H. Rauhut, “Random sampling of sparse trigonometric polynomials, ii. orthogonal matching pursuit versus basis pursuit,” Foundations of Computational Mathematics, vol. 8, no. 6, pp. 737–763, 2008. [40] P. Stoica and P. Babu, “Spice and likes: Two hyperparameter-free methods for sparse-parameter estimation,” Signal Processing, vol. 92, no. 7, pp. 1580–1590, 2012. [41] P. Stoica, P. Babu, and J. Li, “New method of sparse parameter estimation in separable models and its use for spectral analysis of irregularly sampled data,” Signal Processing, IEEE Transactions on, vol. 59, no. 1, pp. 35–47, 2011. [42] Y. Chi, L. L. Scharf, A. Pezeshki et al., “Sensitivity to basis mismatch in compressed sensing,” Signal Processing, IEEE Transactions on, vol. 59, no. 5, pp. 2182–2195, 2011. [43] E. J. Cand`es and C. Fernandez-Granda, “Towards a mathematical theory of super-resolution,” Communications on Pure and Applied Mathematics, vol. 67, no. 6, pp. 906–956, 2014. [44] ——, “Super-resolution from noisy data,” Journal of Fourier Analysis and Applications, vol. 19, no. 6, pp. 1229–1254, 2013. [45] Z. Tan, Y. C. Eldar, and A. Nehorai, “Direction of arrival estimation using co-prime arrays: A super resolution viewpoint,” Signal Processing, IEEE Transactions on, vol. 62, no. 21, pp. 5565–5576, 2014. [46] M. Toeltsch, J. Laurila, K. Kalliola, A. F. Molisch, P. Vainikainen, and E. Bonek, “Statistical characterization of urban spatial radio channels,” Selected Areas in Communications, IEEE Journal on, vol. 20, no. 3, pp. 539–549, 2002. [47] H. Asplund, A. A. Glazunov, A. F. Molisch, K. Pedersen, M. Steinbauer et al., “The cost 259 directional
35
channel model-part ii: macrocells,” Wireless Communications, IEEE Transactions on, vol. 5, no. 12, pp. 3434– 3450, 2006. [48] J. A. Tropp, A. C. Gilbert, and M. J. Strauss, “Algorithms for simultaneous sparse approximation. part i: Greedy pursuit,” Signal Processing, vol. 86, no. 3, pp. 572–588, 2006. [49] J. A. Tropp, “Algorithms for simultaneous sparse approximation. part ii: Convex relaxation,” Signal Processing, vol. 86, no. 3, pp. 589–602, 2006. [50] K. Lee, Y. Bresler, and M. Junge, “Subspace methods for joint sparse recovery,” Information Theory, IEEE Transactions on, vol. 58, no. 6, pp. 3613–3641, 2012. [51] J. M. Kim, O. K. Lee, and J. C. Ye, “Compressive music: revisiting the link between compressive sensing and array signal processing,” Information Theory, IEEE Transactions on, vol. 58, no. 1, pp. 278–301, 2012. [52] M. E. Davies and Y. C. Eldar, “Rank awareness in joint sparse recovery,” Information Theory, IEEE Transactions on, vol. 58, no. 2, pp. 1135–1146, 2012. [53] M. Mishali and Y. C. Eldar, “Reduce and boost: Recovering arbitrary sets of jointly sparse vectors,” Signal Processing, IEEE Transactions on, vol. 56, no. 10, pp. 4692–4702, 2008. [54] Y. Li and Y. Chi, “Off-the-grid line spectrum denoising and estimation with multiple measurement vectors,” arXiv preprint arXiv:1408.2242, 2014. [55] Z. Yang and L. Xie, “Exact joint sparse frequency recovery via optimization methods,” arXiv preprint arXiv:1405.6585, 2014. [56] G. Tang, B. N. Bhaskar, P. Shah, and B. Recht, “Compressed sensing off the grid,” Information Theory, IEEE Transactions on, vol. 59, no. 11, pp. 7465–7490, 2013. [57] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004. [58] R. A. Fisher, “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, pp. 309–368, 1922. [59] J. Neyman, Su un teorema concernente le cosiddette statistiche sufficienti.
Istituto Italiano degli Attuari,
1936. [60] A. L. Yuille and A. Rangarajan, “The concave-convex procedure,” Neural computation, vol. 15, no. 4, pp. 915–936, 2003. [61] M. Grant, S. Boyd, and Y. Ye, “Cvx: Matlab software for disciplined convex programming,” 2008. [62] M. A. Woodbury, “Inverting modified matrices,” Memorandum report, vol. 42, p. 106, 1950. [63] ——, “The stability of out-input matrices,” Chicago, IL, vol. 93, 1949. [64] R. Price, “A useful theorem for nonlinear devices having gaussian inputs,” Information Theory, IRE Transactions on, vol. 4, no. 2, pp. 69–72, 1958.