File produced on May 9, 2012
Preprint
arXiv:1205.1650v1 [cs.IT] 8 May 2012
Compressed Sensing with Nonlinear Observations and Related Nonlinear Optimisation Problems Thomas Blumensath University of Oxford Centre for Functional Magnetic Resonance Imaging of the Brain J R Hospital, Oxford, OX3 9DU, UK
[email protected]
Abstract Non-convex constraints have recently proven a valuable tool in many optimisation problems. In particular sparsity constraints have had a significant impact on sampling theory, where they are used in Compressed Sensing and allow structured signals to be sampled far below the rate traditionally prescribed. Nearly all of the theory developed for Compressed Sensing signal recovery assumes that samples are taken using linear measurements. In this paper we instead address the Compressed Sensing recovery problem in a setting where the observations are non-linear. We show that, under conditions similar to those required in the linear setting, the Iterative Hard Thresholding algorithm can be used to accurately recover sparse or structured signals from few non-linear observations. Similar ideas can also be developed in a more general non-linear optimisation framework. In the second part of this paper we therefore present related result that show how this can be done under sparsity and union of subspaces constraints, whenever a generalisation of the Restricted Isometry Property traditionally imposed on the Compressed Sensing system holds. Key words and phrases : Compressed Sensing, Nonlinear Optimisation, Non-Convex Constraints, Inverse Problems
1
Introduction
Compressed Sensing [1, 2, 3] deals with the acquisition of finite dimensional sparse signals. Let x be a sparse vector of length N and assume we sample x using M linear measurements. The M samples can then be collected into a vector y of length M and the sampling process can be described by a matrix Φ.
2
T. BLUMENSATH
If the observations are noisy, then the Compressed Sensing observation model is y = Φx + e,
(1)
where e is the noise vector. If M < N , then such a linear system is not uniquely invertible in general, unless we use additional assumptions on x. Sparsity of x is such an assumption and Compressed Sensing theory tells us that, for certain Φ, we can recover x from y even if M 0, as there ˆ k2 . However, for might not exist an xopt , such that kx − xopt k2 = inf xˆ ∈A kx − x
Nonlinear Compressed Sensing
5
simplicity, we will assume for the rest of this paper that A is a so called proximal set, which is just a fancy way of saying that the required optimal points indeed lie in the set A, so that we use ǫ = 0 here. Nevertheless, it is easy to adapt our theory to the more general setting. Note that this ”projection” might not be defined uniquely in general, as for a given x, there might be several elements xA that satisfy the condition in (9). However, all we require here is that the map PA (x) returns a single element from the set of admissible xA (which is guaranteed to be non-empty [14]). How this selection is done is of no consequence for our arguments here. It is further worth noting that the relaxation offered by an ǫ > 0 in the definition of the above projection has also a computational advantage. Instead of having to compute exact optima, which for many problems are often difficult to find, many approximate algorithms can be used instead (see [15] for a more detailed discussion).
2.2
The Iterative Hard Thresholding Algorithm for Non-Linear Compressed Sensing
For the linear Compressed Sensing problem, the Iterative Hard Thresholding (IHT) algorithm uses the following iteration xn+1 = PA (xn + µΦ∗ (y − Φxn ),
(10)
where Φ is the linear measurement operator. In the non-linear case, let us approximate Φ(x) using an affine Taylor series type approximation around a point x⋆ , so that Φ(x) ≈ Φ(x⋆ ) + Φx⋆ (x − x⋆ ), where Φx⋆ is a linear operator (such as the Jacobian of Φ(x), evaluated at point x⋆ ). The matrix Φx⋆ thus will depend on x⋆ in general. At iteration n we then write the IHT algorithm as xn+1 = PA (xn + µΦ∗xn (y − Φ(xn )).
(11)
Indeed, as we show below in 2.4, this algorithm can recover x under similar condition to those required from the IHT algorithm in the linear setting. All we require is that the matrices Φx⋆ satisfy a Restricted Isometry Property and that the error introduced in the linearisation is not too large, i.e. that kΦ(xA ) − Φ(xnx ) − ΦxA (xA − xn )k is small for large n. Theorem 1. Assume that y = Φ(x) + e and that Φx⋆ is a linearisation of Φ(·) at x⋆ so that the Iterative Hard Thresholding algorithm uses the iteration xn+1 = PA (xn + µΦ∗xn (y − Φxn xn ). Assume that Φx⋆ satisfies RIP αkx1 − x2 k22 ≤ αkΦx⋆ (x1 − x2 )k22 ≤ βkx1 − x2 k22
(12)
6
T. BLUMENSATH
for all x1 , x2 , x⋆ ∈ A, with constants satisfying β ≤ 1/µ < 1.5α. Define enA = y − Φ(xn ) − Φxn (xA − xn ) and ǫk = b
Pk−1
n=0 a
(13)
k−1−n ken k2 , A
where b = 4/α and a = 2/(µα) − 2, then after √ ǫk ) ln(δ kxA k (14) k⋆ = 2 ln(2/(µα) − 2)
iterations we have
√ ⋆ kx − xk k ≤ (1 + δ) ǫk + kxA − xk.
(15)
Obviously, for the above theorem to make sense, we would require the error term ǫk to be well behaved. This is true whenever ky − Φ(xn ) − Φxn (xA − xn )k2 P k−1 is bounded, as then ǫk ≤ b n=0 ak−1−n C, for some constant C so that the requirement that 1/µ < 1.5α ensures a = 2/(µα) − 2 < 1, which in turn Pk−1that implies that the geometric series n=0 ak−1−n is bounded. Indeed, if a < 1 and if we can show that ǫn = kenA k2 is bounded and convergent to some ǫlim , then ǫk will also be bounded as the following argument shows k
ǫ /b =
=
k−1 X
n=0 p−1 X
ak−n−1 ǫn ak−n−1 ǫn +
≤
p−1 X
ak−n−1 ǫn + ǫp
p−1 X
ak−n−1 ǫn + ǫp
k−1 X
ak−n−1
p
n=0
n=0 k−p−1
= a
p−1 X
≤ ak−p−1 ǫ0 k−p−1
1 1−a
ap−n−1 ǫn + ǫp
n=0 p−1 X
≤ a
ak−n−1 ǫn
p
n=0
≤
k−1 X
n=0
ap−n−1 + ǫp
ǫ0 ǫp + 1−a 1−a
1 1−a 1 1−a (16)
Thus, if we let k and p increase to infinity such that 0 < k − p → ∞, then the first term on the left converges to zero whilst the second term converges to a
Nonlinear Compressed Sensing
7
limit depending on ǫlim , so that, if we iterate the algorithm long enough, then b 1−a
(17)
b + kxA − xk. 1−a
(18)
lim ǫk ≤ ǫlim
k→∞
and the error term converges to to r kx − x⋆ k ≤
ǫlim
Actually, as shown in 2.5, more can be said if we can establish the following bound for Φ(x) and its linearisation kΦ(x1 ) − Φ(x2 ) − Φx1 (x1 − x2 )k22 ≤ Ckx1 − x2 k22 . Corollary 2. Assume that y = Φ(x) + e and that Φx⋆ is a linearisation of Φ(·) at x⋆ so that the Iterative Hard Thresholding algorithm uses the iteration xn+1 = PA (xn + µΦ∗xn (y − Φxn xn ). Assume that Φx⋆ satisfies RIP αkx1 − x2 k22 ≤ kΦx⋆ (x1 − x2 )k22 ≤ βkx1 − x2 k22
(19)
for all x1 , x2 , x⋆ ∈ A, and assume Φ(x) and Φx satisfy kΦ(x1 ) − Φ(x2 ) − Φx1 (x1 − x2 )k22 ≤ Ckx1 − x2 k22 ,
(20)
with constants satisfying β ≤ 1/µ < 1.5α − 4C, then the algorithm converges to a solution x⋆ that satisfies kx − x⋆ k ≤ ckeA k + kxA − xk, where eA = y − Φ(xA ) and c =
2.3
(21)
2 0.75α−1/µ−2C .
Example
Before we proof Theorem 1 and Corollary 2, let us give a simple example that shows how the above method and theory can be applied in a particular setting. Assume we have constructed a Compressed Sensing system, where a sparse signal x ∈ RN is measured using a linear measurement system Φ. Also assume that we have constructed the system so that Φ satisfies the Restricted Isometry Property with constants α and β. Now unfortunately, the sensors we have available for the actual measurements are not exactly linear but have a slight non-linearity, so that our measurements are of the form y = Φf (x) + e,
(22)
where f (·) is a non-linear function applied to each element of the vector x. For simplicity, we will write f (x) = x + h(x), where again h(x) is a function applied element wise. We then have f ′ (x) = 1 + h′ (x).
8
T. BLUMENSATH
It is not difficult to see that the Jacobian of Φf (x) can be written as Φx⋆ = Φ + ΦHx′ ⋆ ,
(23)
where Hx′ ⋆ is the diagonal matrix with the elements h′ (x⋆ ) along the diagonal. To use Corollary 2, we thus need to determine a) the RIP constant of Φx⋆ and b) bound kΦ(x1 ) − Φ(x2 ) − Φx1 (x1 − x2 )k22 as a function of kx1 − x2 k2 . The RIP constants are bounded for our example as follows. Assume that |h′ (x)| ≤ M < 1, that x1 , x2 ∈ A and that Φ satisfies the RIP with constants α and β. We then have (α1/2 1 − β 1/2 M )kx1 − x2 k
≤ kΦ(x1 − x2 )k − kΦHx′ ⋆ (x1 − x2 )k
≤ kΦx⋆ (x1 − x2 )k
= kΦ(x1 − x2 ) + ΦHx′ ⋆ (x1 − x2 )k
≤ kΦ(x1 − x2 )k + kΦHx′ ⋆ (x1 − x2 )k ≤ βk(x1 − x2 )k + βkHx′ ⋆ (x1 − x2 )k ≤ β 1/2 (1 + M )kx1 − x2 k,
(24)
which proofs the following Lemma. Lemma 3. Let Φ(x) = Φf (x), where the function f (x) = x + h(x) is applied element wise and where the derivative h′ (x) is absolutely bounded |h′ (·)| ≤ M . Also assume that the matrix Φ satisfies the RIP condition with constants α and β for a set A, then the matrix Φ(I + Hx′ ⋆ ) satisfies RIP with constants (α1/2 1 − β 1/2 M )2 and β(1 − M )2 . Let us now turn to point b). We have the bound kΦ(x1 ) − Φ(x2 ) − Φx1 (x1 − x2 )k22
= kΦh(x1 ) − Φh(x2 ) − ΦHx′ ⋆ (x1 − x2 )k22
= kΦ(h(x1 ) − h(x2 ) − Hx′ 1 x1 + Hx′ 1 x2 )k22
≤ βk(h(x1 ) − h(x2 ) − Hx′ 1 x1 + Hx′ 1 x2 ))k22 ,
(25)
where in the last inequality we assume Φ to satisfy the RIP property and that h(0) = 0 (Note that if we do not assume that h(0) = 0, then the same reults still hold, though we have to replace β by the operator norm of Φ). Let us introduce the function dx⋆ (x) = h(x) − Hx′ ⋆ x, so that kΦ(x1 ) − Φ(x2 ) − Φx1 (x1 − x2 )k22 ≤ βkdx1 (x1 ) − dx1 (x2 )))k22 .
(26)
Thus if dx⋆ is Lipschitz for all x⋆ ∈ A with a small constant K, then the condition kΦ(x1 ) − Φ(x2 ) − Φx1 (x1 − x2 )k22 ≤ Ckx1 − x2 k22 ,
(27)
Nonlinear Compressed Sensing
9
in Corollary 2 holds with C = βK. Thus it remains to show that dx1 (x1 ) is Lipschitz. If the Jacobian Dx1 (x⋆ ) of dx1 (x1 ) satisfies kDx1 (x + th)k ≤ M for all 0 ≤ t ≤ 1, then we know that kdx1 (x + h) − dx1 (x)k ≤ M khk,
(28)
so that kdx1 (x1 ) − dx1 (x2 )k = kdx1 (x2 + x1 − x2 ) − dx1 (x2 )k ≤ M kx1 − x2 k,
(29)
holds if kDx1 (x2 + t(x1 − x2 ))k ≤ M
(30)
for all 0 ≤ t ≤ 1. For our simple example, we see that Dx1 is in fact a diagonal matrix with entries {Dx1 (x⋆ )}i,i = h′ (xi ) − h′i (x⋆ ), where h′i (x1 ) = h′ ({x1 }i ), so that {Dx1 (x2 + t(x1 − x2 ))}i,i = h′ ({x2 + t(x1 − x2 )}i ) − h′ ({x1 }i ).
(31)
Thus if |h′ (·)| is bounded, that is, if |h′ (·)| ≤ M , then kDx1 (x)k ≤ M . We thus have demonstrated the following. Lemma 4. Let Φ(x) = Φf (x), where the function f (·) = x + h(x) is applied element wise and where the derivative h′ (x) is absolutely bounded |h′ (·)| ≤ M , then kΦ(x1 ) − Φ(x2 ) − Φx1 (x1 − x2 )k22 ≤ Ckx1 − x2 k,
(32)
where C = βM .
2.4
Proof of Theorem 1
The proof follows basically that in [14], but with some important modifications to account for the non-linear setting analysed here. Proof. As always, we start with the triangle inequality kx − xn+1 k ≤ kxA − xn+1 k + kxA − xk
(33)
and then bound the first term on the left using the definition enA = y − Φ(xn ) − Φxn (xA − xn )
(34)
10
T. BLUMENSATH
and the inequalities
≤ = = = ≤ =
kxA − xn+1 k2 1 kΦxn (xA − xn+1 )k2 α 1 ky − Φ(xn ) − Φxn (xn+1 − xn ) − (y − Φ(xn ) − Φxn (xA − xn ))k2 α 1 ky − Φ(xn ) − Φxn (xn+1 − xn ) − enA k2 α 1 ky − Φ(xn ) − Φxn (xn+1 − xn )k2 + kenA k2 − 2henA , (y − Φ(xn ) − Φxn (xn+1 − xn ))i α 1 ky − Φ(xn ) − Φxn (xn+1 − xn )k2 + kenA k2 + kenA k2 + ky − Φ(xn ) − Φxn (xn+1 − xn )k2 α 2 2 ky − Φ(xn ) − Φxn (xn+1 − xn )k2 + kenA k2 . (35) α α
We here used the fact that
−henA , (y − Φ(xn ) − Φxn (xn+1 − xn ))i
≤ kenA kky − Φ(xn ) − Φxn (xn+1 − xn )k
≤ 0.5(kenA k2 + k(y − Φ(xn ) − Φxn (xn+1 − xn )k2 ).
The left term in the last line of (35) is bounded by the next inequality
ky − Φ(xn ) − Φxn (xn+1 − xn )k2 ≤ (
1 − α)k(xA − xn )k2 + kenA k2 , µ
(36)
11
Nonlinear Compressed Sensing
which is a result of the following argument in which we use g = 2Φ∗xn (y−Φ(xn )) ky − Φ(xn ) − Φxn (xn+1 − xn )k2 − ky − Φ(xn )k2
= −h(xn+1 − xn ), gi + kΦxn (xn+1 − xn )k2 µ 1 2 ≤ − h(xn+1 − xn ), gi + k(xn+1 − xn )k2 µ 2 µ i µ µ 1 h n+1 kx − xn − gk2 − kgk2 = µ 2 2 1 µ µ 2 2 n = inf kx − x − gk − kgk µ x∈A 2 2 1 n n 2 = inf −h(x − x ), gi + k(x − x )k x∈A µ 1 ≤ −h(xA − xn ), gi + k(xA − xn )k2 µ 1 = −2h(xA − xn ), Φ∗xn (y − Φ(xn ))i + kxA − xn k2 µ n n ∗ = −2h(xA − x ), Φxn (y − Φ(x ))i + αkxA − xn k2 1 +( − α)kxA − xn k2 µ ≤ −2h(Φxn (xA − xn )), (y − Φ(xn ))i + kΦxn (xA − xn )k2 1 +( − α)kxA − xn k2 µ = ky − Φ(xn ) − Φxn (xA − xn )k2 − ky − Φ(xn )k2 1 +( − α)kxA − xn k2 µ 1 = kenA k2 − ky − Φ(xn )k2 + ( − α)k(xA − xn )k2 , µ
(37)
where the first and last inequalities are due to the RIP property of Φxn and the choice of β ≤ µ1 , whilst the second inequality is due to the fact that xA ∈ A. We have thus shown that kxA − xn+1 k2 ≤ 2
1 4 − 1 k(xA − xn )k2 + kenA k2 . µα α
We can now iterate the above expression. Using ǫk = b where b = 4/α and a = 2/(µα) − 2, we get
Pk−1
2 k−1−n , n=0 ky−Φxn xA k a
k 1 kxA k2 + ǫk . −1 kxA − x k ≤ 2 µα k 2
(38)
(39)
12
T. BLUMENSATH
Thus k
ck kxA k2 + ǫk + kxA − xk √ ≤ ck/2 kxA k + ǫk + kxA − xk,
kx − x k ≤
where c =
2.5
2 µα
q
− 2 and the theorem is proven.
Proof of Corollary 2
Proof. Let us start with the bound in (38) 4 1 n+1 2 − 1 k(xA − xn )k2 + kenA k2 kxA − x k ≤2 µα α
(40)
and let us look a bit more closely at enA = y − Φ(xn ) − Φxn (xA − xn )
= Φ(xA ) + eA − Φ(xn ) − Φxn (xA − xn ),
(41)
where eA = y − Φ(xA ). We then have kenA k ≤ kΦ(xA ) − Φ(xn ) − Φxn (xA − xn )k + keA k
(42)
Now by assumption, kΦ(xA ) − Φ(xn ) − Φxn (xA − xn )k is bounded as a function of kxA − xn k, i.e. kΦ(xA ) − Φ(xn ) − Φxn (xA − xn )k2 ≤ CkxA − xn k2 ,
(43)
so that (38) becomes n+1 2
kxA − x
k ≤2
1 4 8 − 1 + C k(xA − xn )k2 + keA k2 , µα α α
(44)
1 − 1 + α4 C ≤ 0.5, that is that 1/µ ≤ 1.5α − 4C. Thus we require that µα The same argument used in the main proof now holds. Whenever the constant before the left term on the right hand side is smaller than one, then we can iterate the error and the corollary follows.
3
The Iterative Hard Thresholding Algorithm for NonLinear Optimisation
Let us now return to the more general problem of minimising a non-linear function f (x) under the constraint that x ∈ A, where A is a Union of Subspaces. Let
Nonlinear Compressed Sensing
13
us recall again that for minimisation problems of the form argminx∈A ky − Φxk2 we use the algorithm xn+1 = PA (xn + Φ∗ (y − Φxn ).
(45)
Note that the update Φ∗ (y − Φxn ) is a scaled version of the gradient of the cost function ky − Φxk2 . In the more general setting argminx∈X f (x), where x is an Euclidean vector, we can simply replace this update direction with the gradient of f (x) (evalustaed at xn ), whilst in more general spaces, we assume that f (x) is Fr´echet differentiable with respect to x, that is, for each x1 there exist a linear functional Dx1 (·) such that f (x1 + h) − f (x1 ) − Dx1 (h) = 0. (46) lim h→0 khk We can then use Riesz representation theorem to write the linear functional Dx1 (·) using its inner product equivalent Dx1 (·) = h∇(x1 ), ·i,
(47)
where ∇(x1 ) ∈ H. Using h = x2 − x1 we see that for each u and x1 we require the existence of a ∇(x1 ) such that lim
x2 →x1
f (x2 ) − f (x1 ) − h∇(x1 ), (x2 − x1 )i = 0. kx2 − x1 kH
(48)
In Euclidean spaces the Fr´echet derivative is obviously the differential of f (x) at x1 , in which case ∇(x1 ) is the gradient and h·, ·i the Euclidean inner product. With a slight abuse of terminology, we will therefore call ∇(x1 ) ‘the gradient’ even in more general Hilbert space settings. Having thus defined an update direction ∇(x) in quite general spaces, we are now in a position to define an algorithmic strategy to optimise f (x). We again use a version of our trusty Iterative Hard Thresholding algorithm, but replace the update direction with ∇(x). With this modification, the algorithm might also be called the Projected Landweber Algorithm [16], and is defined formally by the iteration xn+1 = PA (xn − (µ/2)∇(xn )), (49)
where x0 = 0 and µ is a step size parameter chosen to satisfy the condition in Theorem 5 below.
3.1
Theoretical Performance Bound
We now come to the second main result of this paper, which states that, if f (x) satisfy the Restricted Strong Convexity Property, then the Iterative Hard Thresholding algorithm can find a vector x that is close to the true minimiser of f (x) among all x ∈ A. In particular, we have the following theorem.
14
T. BLUMENSATH
Theorem 5. Let A be a union of subspaces. Given the optimisation problem f (x), where f (·) is a positive function that satisfies the Restricted Strict Convexity Property α≤
f (x1 ) − f (x2 ) − Reh∇(x2 ), (x1 − x2 )i ≤ β, kx1 − x2 k2
(50)
for all x1 , x2 ∈ H for which x1 − x2 ∈ A + A + A. Let xopt = argminx∈A f (x) and assume that β ≤ µ1 ≤ 43 α, then, after f (xopt ) ln δ kxopt k n⋆ = 2 , (51) ln 4(1 − µα) ⋆
iterations, the IHT algorithm calculates a solution xn satisfying r µ n⋆ kx − xk ≤ 2 + δ f (xopt ) + kx − xopt k. 1−c
(52)
In the traditional Compressed Sensing setting, this result is basically that derived in [7].
3.2
Proof of the Second Main Result
Proof of Theorem 5. The proof requires the orthogonal projection onto a subspace Γ. The subspace Γ is defined as follows. Let Γ be the sum of no more than three subspaces of A, such that xopt , xn , xn+1 ∈ Γ. Let PΓ be the orthogonal projection onto the subspace Γ. We write anΓ = PΓ an and PΓ ∇(xn ) = ∇Γ (xn ). Note that this ensures that PΓ xn = xn , PΓ xn+1 = xn+1 and PΓ xopt = xopt . We note for later that with this notation Reh∇Γ (xn ), (xopt − xn )i = RehPΓ ∇(xn ), (xopt − xn )i = Reh∇(xn ), PΓ (xopt − xn )i
= Reh∇(xn ), (xopt − xn )i
(53)
and k∇Γ (xn )k2 = h∇Γ (xn ), ∇Γ (xn )i = hPΓ ∇(xn ), PΓ ∇(xn )i
= h∇(xn ), PΓ∗ PΓ ∇(xn )i
= h∇(xn ), ∇Γ (xn )i.
(54)
We also need the following lemma. Lemma 6. Under the assumptions of the theorem, µ k ∇Γ (xn )k2 − µf (xn ) ≤ 0. 2
(55)
Nonlinear Compressed Sensing
15
Proof. Using the Restricted Strict Convexity Property we have µ µ µ k ∇Γ (xn )k2 = − Reh∇(xn ), − ∇Γ (xn )i 2 2 2 µ µ µ µ µ n 2 ≤ βk ∇Γ (x )k + f (xn ) − f (xn − ∇Γ (xn ))) 2 2 2 2 2 µ µ µ n 2 n ≤ βk ∇Γ (x )k + f (x ). (56) 2 2 2 Thus µ (57) (2 − µβ)k ∇Γ (xn )k2 ≤ µf (xn ), 2 which is the desired result as µβ ≤ 1 by assumption. To prove the theorem, we start by bounding the distance between the current estimate xn+1 and the optimal estimate xopt . Let anΓ = xnΓ −µ/2∇Γ (xn ). Because xn+1 is the closest element in A to anΓ , we have 2 kxn+1 − xopt k2 ≤ kxn+1 − anΓ k + kanΓ − xopt k ≤ 4k(anΓ − xopt )k2
= 4kxn − (µ/2)∇Γ (xn ) − xopt k2
= 4k(µ/2)∇Γ (xn ) + (xopt − xn )k2
= µ2 k∇Γ (xn )k2 + 4kxopt − xn k2 + 4µReh∇Γ (xn ), (xopt − xn )i
= µ2 k∇Γ (xn )k2 + 4kxopt − xn k2 + 4µReh∇(xn ), (xopt − xn )i
≤ 4kxopt − xn k2 + µ2 k∇Γ (xn )k2
+4µ[−αkxn − xopt k2 + f (xopt ) − f (xn )]
= 4(1 − µα)kxopt − xn k2 + 4µf (xopt ) +4[k(µ/2)∇Γ (xn )k2 − µf (xn )]
≤ 4(1 − µα)kxopt − xn k2 + 4µf (xopt ).
(58)
Here, the second to last inequality is the RSCP and the last inequality is due to lemma 6. We have thus shown that kxn+1 − xopt k2 ≤ 4(1 − µα)kxopt − xn k2 + 4µf (xopt ). Thus, with c = 4(1 − µα) kxn − xopt k2 ≤ cn kxopt k2 +
4µ f (xopt ), 1−c
(59)
(60)
so that, if µ1 < 43 α we have c = 4(1 − µα) < 1, so that cn decreases with n. Taking the square root on both sides and noting that for positive a and b, √ a2 + b2 ≤ a + b, r µ n n/2 f (xopt ). (61) kx − xopt k ≤ c kxopt k + 2 1−c
16
T. BLUMENSATH
The theorem then follows using the triangle inequality kxn − xk ≤ kxn − xopt k + kx − xopt k r µ n/2 ≤ c kxopt k + 2 f (xopt ) 1−c +kx − xopt k.
(62)
The iteration count is found by setting cn/2 kxopt k ≤ δ(xopt ). so that after n=2
f (xopt ) ln δ kxopt k ln c
,
(63)
(64)
iterations r µ + δ f (xopt ) + kx − xopt k. kx − xk ≤ 2 1−c n
3.3
(65)
When and Where is this Theory Applicable?
Since we first derived the result here, it has been shown that properties such as the Restricted Strict Convexity Property do indeed hold for certain nonlinear functions such as those encountered in certain logistic regression problems [13]. These recent findings thus further strengthen the case for a detailed study of non-convexly constrained non-linear problems and the derivation of novel methodologies for their solution. It may thus seem tempting to use this theory also in a non-linear Compressed Sensing setting, where we would have f (x) = ky − Φ(x)k2B , where k · kB is some Banach space norm and where Φ(·) is some non-linear function1 . If this f (x) would satisfy the Restricted Strict Convexity property, then the Theory in the second part of this paper would indeed tell us how to solve the non-linear Compressed Sensing problem. Unfortunately, it is far from clear yet under which conditions on f (x) = ky − Φ(x)k2B Restricted Strict Convexity type properties hold. Indeed, the following lemma shows that such a condition cannot be fulfilled in general for Hilbert spaces. Lemma 7. Assume B is a Hilbert space and assume f (x) is convex on A + A for all y (i.e. it Satisfies the Restricted Strict Convexity Property), then Φ is affine on all subspaces of A + A. 1
This was indeed the setting proposed in [11].
17
Nonlinear Compressed Sensing
Proof. The proof was suggested by an anonymous reviewer of the earlier version of this manuscript [11] and uses contradiction. Assume Φ is not affine on any subspacePof A+A. Thus,P there is a subspace S = Ai +Aj ,P and xn ∈ S, such that for x = n λn xn , where n λn = 1 and 0 ≤ λn , we have n Φ(xn ) − Φ(x) 6= 0. Now by assumption of strong convexity on S, we have (using yn = Φ(xn ) and −¯ y = x) 0≤
X n
λn ky − Φ(xn )k2 − ky − Φ(x)k2 = ¯− = 2hy, y
X n
X n
¯ k2 λn ky − yn k2 − ky − y
λn yn i +
X n
λn kyn k2 − k¯ yk2 . (66)
where the inequality is due to the assumption of convexity. But the above inequality cannot hold for all y (it fails for example for a multiple of −(¯ y− P λ y )). Thus Φ needs to be affine on the linear subsets of A + A. n n n Whilst this implies that the property cannot hold in Hilbert spaces for nonaffine Φ and all y, it does not preclude the possibility that it could hold for specific observations y. This would not allow us to build a general signal recovery framework, but might still allow us the recovery of a subset of signals. Thus, for the non-linear Compressed Sensing problem in Hilbert space, the Restricted Isometry Property of the Jacobian of Φ(x) together with the ability to construct a good linear approximation of Φ(x) seem to be the more suitable tools to study recovery performance. Nevertheless, for certain other non-convexly constrained non-linear optimisation problems, such as those addressed in [13], the Restricted Strict Convexity Property might be the more appropriate framework. Whilst there are many similarities between these requirements and they both boil down to the same RIP property in the linear setting, it remains to be seen what the exact relationship is between these two measures in general non-linear problems.
4
Conclusions
Compressed Sensing ideas can be developed in much more general settings than considered traditionally. We have shown previously [14] that sparsity is not the only structure that allows signals to be recovered and that the finite dimensional setting can be replaced with a much more general Hilbert space framework. In this paper we have made a further important generalisation and have introduced the concept of non-linear measurements into Compressed Sensing theory. Under certain conditions, such as the requirement that the Jacobian of the measurement system satisfies a Restricted Isometry Property, then the Iterative Hard Thresholding algorithm can be used to recover signals from a non-convex constraint set with similar error bounds to those derived in Compressed Sensing.
18
T. BLUMENSATH
In the second part of this paper we have then looked at the related and in some sense more general setting of non-linear optimisation under non-convex constraints. Here we have looked the Restricted Strict Convexity Property as a tool to study recovery performance and it was shown that that this condition is indeed sufficient for the Iterative Hard Thresholding to find points that are near the optimal solution.
Acknowledgment This work was supported in part by the UKs Engineering and Physical Science Research Council grants EP/J005444/1 and D000246/1 and a Research Fellowship from the School of Mathematics at the University of Southampton.
References [1] E. Cand`es, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, pp. 489–509, 2006. [2] D.L. Donoho, “For most large underdetermined systems of linear equations the minimal 1-norm solution is also the sparsest solution,” Communications on Pure and Applied Mathematics, vol. 59(6) pp. 797–829, 2006 [3] E. Cand`es and M. B. Wakin “An introduction to compressive sampling,” IEEE Signal Processing Magazine, vol. 25(2), pp. 21–30, 2008 [4] E. Cand`es, “The restricted isometry property and its implications for compressed sensing.,” Compte Rendus de l’Academie des Sciences, Paris, Serie I, 346 589–592, 2008. [5] D. Needell and J. Tropp, “COSAMP: Iterative signal reovery from incomplete and inacurate samples.,” Appl. Comp. Harmonic Anal, vol. 26, pp. 301–321, 2008. [6] W. Dai and O. Milenkovic, “Subspace pursuit for compressed sensing: Closing the gap between performance and complexity,” IEEE Transactions on Information Theory, vol. 55, no. 5, pp. 2230–2249, 2009. [7] T. Blumensath and M.E. Davies “Iterative Hard Thresholding for Compressed Sensing,” Applied and Computational Harmonic Analysis, vol. 27, no. 3, pp. 265-274, 2009 [8] M. Mishali, Y.C. Eldar “Blind Multi-Band Signal Reconstruction: Compressed Sensing for Analog Signals,” IEEE Transactions on Signal Processing, vol. 57(3), pp. 993–1009, 2009
Nonlinear Compressed Sensing
19
[9] D. Goldfarb, S. Ma “Convergence of fixed point continuation algorithms for matrix rank minimization,” arXiv:09063499v3, 2010. [10] R. Baraniuk, V. Cevher, M. Duarte, and C. Hegde, “Model-based compressive sensing,” IEEE Transactions on Information Theory, vol 56(4), pp. 1982–2001, 2010 [11] T. Blumensath “Compressed Sensing with non-linear observations,” http://eprints.soton.ac.uk/164753/, Oct. 2010 [12] W. Xu, M. Wang, A. Tang “Sparse Recovery from Nonlinear Measurements with Applications in Bad Data Detection for Power Networks,” arXiv:1112.6234v1, 2011 [13] Bahmani S., Raj B., Boufounos P. “Greedy Sparsity-Constrained Optimization,” arXiv:1203.5483v1, 2012 [14] T. Blumensath, “Sampling and reconstructing signals from a union of subspaces,” IEEE TRans. on Information Theory, vol. 57(7), pp. 4660–4671, 2011 [15] A. Kyrillidis and V. Cevher ”Combinatorial Selection and Least Absolute Shrinkage via the CLASH Algorithm” preprint, 2011 [16] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems, Kluwer Academic Publishers, 2000. [17] S. Negahban, P. Ravikumar, M. J. Wainwright and B. Yu. “A unified framework for the analysis of regularized M -estimators’,’ in Advances in Neural Information Processing Systems, December, 2009. Vancouver. Canada.