Maximal Sparsity Representation via l1 Minimization

Draft: to be submitted to the IEEE Transactions on Information Theory on September 2002

Maximal Sparsity Representation via l1 Minimization David L. Donoho∗ and

Michael Elad†

August 15, 2002

Abstract Finding a sparse representation of signals is desired in many applications. For a representation dictionary D and a given signal S ∈ span{D}, we are interested in finding the sparsest vector γ such that Dγ = S. Previous results have shown that if D is composed of a pair of unitary matrices, then under some restrictions dictated by the nature of the matrices involved, one can find the sparsest representation using an l1 minimization rather than using the l0 norm of the required composition. Obviously, such a result is highly desired since it leads to a convex Linear Programming form. In this paper we extend previous results and prove a similar relationship for the most general dictionary D. We also show that previous results are emerging as special cases of the new extended theory. In addition, we show that the above results can be markedly improved if an ensemble of such signals is given, and higher order moments are used.

Keywords:

Sparse Representation, Atomic Decomposition, Convex Optimization, Linear

Programming, Basic Pursuit, Matching Pursuit.

∗ †

Department of Statistics, Stanford University, Stanford 94305-9025 CA. USA. Department of Computer Science (SCCM), Stanford University, Stanford 94305-9025 CA. USA.

1

1

Introduction

A sparse representation for a signal is a desired efficient description of it that can be used for its analysis or compression [1]. However, far deeper reasons lead to the search for sparse representations for signals. As it turns out, one of the most natural and effective priors in Bayesian theory for signal estimation is the existence of a sparse representation over a suitable dictionary. This prior is leaning on the assumption that its ground–truth representation is expected to be simple and thus sparse in some representation space [1]. Indeed, it is sparsity that lead to the vast theoretic and applicative work in Wavelet theory [1]. More formally, we are given a representation dictionary D defined as a matrix of size [N × L]. We hereby assume that the columns of D, denoted as {dk }Lk=1 , are normalized, i.e. ∀1 ≤ k ≤ L, dH d = 1. These columns are to be used to represent incoming signals S ∈ span{D} ⊆ C N . Note that we do not claim any relationship between N and L, and in particular, N may be larger than L, implying that the proposed representation space is not complete. Given a signal vector S, we are interested in finding the sparsest vector γ such that Dγ = S. This process is commonly referred to as atomic decomposition, since we decompose the signal S into its building atoms, taken from the dictionary. The emphasis here is on finding such a decomposition that uses as few as possible atoms. Thus, we resort to the following optimization problem

(P0 )

Minimize kγk0

subject to

S = Dγ.

(1)

Obviously, two easy-to-solve special cases are the case of a unique solution to Dγ = S and the 2

case of no feasible solution at all. While both these cases lead to easy-to-solve (P 0 ), in general, (P0 ) solution requires combinatorial search through all the combinations of columns from D, and as such, its complexity grows exponentially with L. Thus, we are interested either in an approximation of (P0 ) solution, or better yet, a numerical shortcut leading to its exact solution. Matching Pursuit (MP) [1, 2] and Basis Pursuit (BP) [3] are two different methods to achieve the required simplifying goal. In the MP and related algorithms, a sequential sub-optimal representation is sought using a greedy algorithm. As such, these family of algorithms lead to an approximation of (P0 ) solution. A numerically more complicated approach, which in some cases lead to the exact solution of (P0 ), is the BP algorithm. BP suggests solving (P0 ) by replacing it with a related (P1 ) problem defined by

(P1 )

Minimize kγk1

subject to

S = Dγ.

(2)

As can be seen, the penalty is replaced by an l1 norm (sum of absolute values). As such, (P1 ) is a convex programming problem implying that we expect no local minima problems in its numerical solution. Actually, a well known result from optimization theory shows that an l 1 minimization could be solved using a Linear Programming procedure [3, 4, 5]. Recent results in numerical optimization and the introduction of the interior point methods turn the above described problem to a practically solvable one, even for very large dictionaries. A most interesting and surprising result due to Donoho and Huo [6] is that the solution of (P1 ) in some cases coincides with the (P0 ) one. Donoho and Huo assumed a specific structure

3

of D, built by concatenating two unitary matrices, Φ and Ψ of size N × N each, thus giving that L = 2N . For this specific dictionary form, they developed conditions for the equivalence between the (P0 ) and (P1 ) solutions. These conditions were expressed in terms of the involved dictionary D (and actually, more accurately, in terms of Φ and Ψ). Later these conditions were improved by Elad and Bruckstein to show that the equivalence is actually true for a wider class of signals [7, 8]. In this paper we further extend the results in [6, 7, 8], and prove a (P0 )-(P1 ) equivalence for the most general form of dictionaries. In order to prove this equivalence we address two questions for a given dictionary D and signal S: 1. Uniqueness: Having solved the (P1 ) problem, under which conditions can we guarantee that this is also the (P0 ) solution as well? This question is answered by generalizing the uniqueness Theorem in [6, 7, 8]. 2. Equivalence: Knowing the solution of the (P0 ) problem (or actually, knowing its l0 norm), what are the conditions under which (P1 ) is guaranteed to lead to the exact same solution? This question is answered by generalizing the equivalence Theorem in [6, 7, 8]. The proposed analysis adopts a totaly new line of reasoning, compared to the work done in [6, 7, 8], and yet, we show that all previous results emerge as special cases of this new analysis. So far atomic decomposition was targeted towards dealing with a single given vector, finding the limitations of using (P1 ) instead of (P0 ) in order to decompose it to its building atoms taken from the dictionary D. This is the problem solved in [6, 7, 8] and in this paper too. An interesting extension of the above results correspond a source generating an ensemble of random 4

sparse representation signals from the same dictionary using the same stationary random rule. The questions raised are whether there is something to gain from the given multiple signals, and if so then how. As it turns out, use of higher moments leads in this case to a similar formulation of (P0 ), and again, (P1 ) similar form comes to replace it as a traceable alternative. We show that indeed, similar relations between the (P0 ) and the (P1 ) hold, with far weaker conditions due to the increased dimensionality, implying that less restrictions are posed to guarantee the desired (P0 )-(P1 ) equivalence. This paper is organized as follows: In the next section we briefly repeat the main results found in [6, 7, 8] on the uniqueness and equivalence Theorems for the two-unitary matrices dictionary. Section 3 then extends the uniqueness Theorem for an arbitrary dictionary. Section 4 similarly extends the equivalence results for general form dictionary. The idea to use an ensemble of signals and higher moments for accurate sparse decomposition is covered in Section 5. We summarize and draw future research directions in Section 6.

2

Previous Results

As was said before, previous results refer to the special case where the dictionary is built by concatenating two unitary matrices, Φ and Ψ of size N × N each, giving D = [Φ, Ψ]. We define φi and ψj (1 ≤ i, j ≤ N ) as the columns of these two unitary matrices. Following [6] we define a real-positive scalar M representing the cross-correlation between these two bases by

M = Sup{|hφi , ψj i|, ∀1 ≤ i, j ≤ N }.

5

Thus, given the two matrices Φ and Ψ, M can be computed, and it is easy to show [6, 8] that √ 1/ N ≤ M ≤ 1. The lower bound is obtained for a pair such as spikes and sines [6] or Identity and Hadamard matrices [8]. The upper bound is obtained if at least one of the vectors in Φ is also found in Ψ. Using this definition for M , the following Theorem states the requirement on a given representation such that it is guaranteed to be the solution of the (P0 ) problem: Theorem 1 - Uniqueness: Given a dictionary D = [Φ, Ψ], given its corresponding crosscorrelation scalar value M as defined in Equation (3), and given a signal S, a representation of the signal by S = Dγ is necessarily the sparsest one possible if kγk0 < 1/M . This Theorem’s proof is given in [7, 8]. A somewhat weaker version of it requiring kγk 0 < 0.5(1 + M −1 ) is proven in [6]. Thus, having solved (P1 ) for the incoming signal, we measure the l0 norm of its solution, and if it is sparse enough (below 1/M ), we conclude that this is also the (P0 ) solution. For the best cases where 1/M =

√

N , we get that the requirement translates into kγk0
σ.

(4)

On the other hand we have that kγ 1 − γ 2 k ≤ kγ 1 k + kγ 2 k. Thus, we get that for the two arbitrary representations found, the following must hold true

kγ 1 k + kγ 2 k > σ.

(5)

This inequality can be interpreted as an uncertainty law: Theorem 3 - Uncertainty Law: Given a dictionary D and given its corresponding Spark value σ, for every non-zero signal S and every pair of different representations of it, i.e, S = Dγ 1 = Dγ 2 , we have that the sparsity of the two representations together must be above σ as in Equation (5). An immediate consequence of this result is a new general uniqueness Theorem. Using Equation (5), if there exists a representation satisfying kγ 1 k ≤ σ/2, then necessarily due to the above Theorem, any other representation γ 2 of this signal must satisfy kγ 1 k > σ/2, implying that the γ 1 is the sparsest possible one. 9

Theorem 4 - New Uniqueness: Given a dictionary D, given its corresponding Spark value σ, and given a signal S, a representation of the signal by S = Dγ is necessarily the sparsest one possible if kγk0 ≤ σ/2. The obvious question at this point is what is the relationship between the defined M in previous results and the newly defined notion of Spark of a matrix. In order to explain this relationship we bring the following analysis on bounding the Spark. Note that the proposed bounding are important not only because of relating the new results to the previous ones, but also because we need methods to approximate the Spark and replace the impossible sweep through all the column combinations by a traceable and a computationally reasonable method for computing the Spark.

3.1

Bounding the Spark from Below

Let us built the Gramian matrix for our dictionary, G = DH D. Obviously, every entry in G is an inner product of a pair of columns from the dictionary, the main diagonal contains exact ’1’-s due to the normalization of D’s columns, and all the entries outside the main diagonal are in the general case complex values with magnitude equal or smaller than 1. If the Spark is known to be σ, it implies that any leading minor of G of size σ × σ must be positive definite [9]. This reasoning works the other way around, namely, if every σ × σ leading minor is guaranteed to be positive definite, then obviously the Spark is at least σ. The problem is we do not want to sweep through all combinations of columns from D, nor do we want to check every possible σ × σ leading minor of the Gramian matrix. Instead, we use the well known Gersgorin Disk Theorem [9], or better yet, its special case 10

property that claims that every strictly diagonally dominant matrix must be positive definite. A matrix is strictly diagonally dominant if for each row, the sum of absolute values of the offdiagonal values is strictly smaller compared to the main diagonal entry. In our case, if for every σ × σ leading minor we get a strictly diagonally dominant matrix, then obviously these matrices are positive definite, and thus the Spark of D is at least σ. Using the above rule, let us search for the most problematic set of column vectors from the dictionary. By problematic we refer to the set that tends to create the smallest possible diagonally non-dominant leading minor. Thus, if we take the Gramian matrix G and perform a decreasing rearrangement of its absolute entries in each row, we should get that the first column has all ’1’-s (taken from the main diagonal), and as we observe the entries from left to right in each row we are expecting to see a decrease in magnitude. Computing the cumulative sum per each such row excluding the first entry, let us define Pk as the number of entries in the k th row that are summed to the minimal value above 1. Assume that we computed P = Min1≤k≤L Pk . Then, clearly, every leading minor of size P × P must be diagonally dominant by definition. Moreover, using minors of size (P + 1) × (P + 1), at least one of them is expected to be diagonally non-dominant. Thus, P is a lower bound on the actual Spark of D. The process we have just described is exactly the method to find the ’most problematic’ set of columns from D, and this way bound the Matrix’s Spark. To summarize, we have the following Theorem: Theorem 5 - Lower-Bound on the Spark:

Given the dictionary D and its corresponding

Gramian matrix G = DH D, apply the following stages: 1. Perform a decreasing rearrangement of |G| in each row, 11

2. Compute the cumulative sum per each such row excluding the first entry, 3. Compute Pk , the number of entries at the k th row that are summed to the minimal value above 1, and 4. Find P = Min1≤k≤L Pk . Then, σ = Spark(D) ≥ P . A special case of interest to us is obtained if we simply assume that ∀i 6= j, |Gi,j | ≤ M . Note the resemblance of this M to the M defined at equation (3) and originally in [6]. Then clearly we have that for an arbitrary leading minor of size (P + 1) × (P + 1), we should require P M ≥ 1 in order to allow for it to become a diagonally non-dominant matrix. Thus we get σ = Spark(D) ≥ P ≥ 1/M . Thus, Theorem 6 - Lower-Bound on the Spark (special case 1): Given the dictionary D and its corresponding Gramian matrix G = DH D, define M as the upper bound on the entries of G, i.e., ∀i 6= j, |Gi,j | ≤ M . Then, the following relationship holds σ = Spark(D) ≥ 1/M . Using this new simple bound and plugging Theorem 4, we get exactly the uniqueness Theorem as stated by Donoho and Huo [6], namely, if a proposed representation has less than 0.5(1+1/M ) non-zeros, it is necessarily the sparsest one possible. Note that although looking different, the requirements kγk0 < 0.5(1 + 1/M ) and kγk0 ≤ 0.5/M are equivalent since kγk0 is integer. As an example for this last result, if we return to the special case where D = [Φ, Ψ], Φ = I N √ √ and Ψ = FN , then we have that M = 1/ N , and thus σ = Spark(D) ≥ N . Remember that

√ we claimed that for this case σ = 2 N − 1, so clearly the new bound should be improved. 12

An interesting question that emerges is why didn’t we get the better 1/M result as stated in [7, 8] and as appears in Theorem 1? Answering this question may lead to the improvement we mentioned for the above example as well. It turns out that if we plug in the fact that the dictionary is built of two unitary matrices D = [Φ, Ψ], the Gramian matrix contains many zeros corresponding the orthogonality of the columns in Φ and Ψ, and in this case the bound can be further improved. In an attempt to compose the ’worst’ (P + 1) × (P + 1) leading minor, it is not hard to see that we need to take half of the vectors from Φ and half from Ψ (and let us conveniently assume that P is odd). Then we get that in each row there are (P − 1)/2 exact zeros, (P + 1)/2 values assumed to be below or equal to M in their absolute value, and one identity corresponding to the main diagonal. Thus, in this special case we get that diagonally non-dominance is achieved for (P + 1)/2 · M ≥ 1 leading to P ≥ 2/M − 1. Using Theorem 5 we get σ = Spark(D) ≥ P ≥ 2/M − 1. Thus, Theorem 7 - Lower-Bound on the Spark (special case 2):

Given the specific dictio-

nary D = [Φ, Ψ] where Φ and Ψ are both unitary N × N matrices, and assuming that for its corresponding Gramian matrix G = DH D we define ∀i 6= j, |Gi,j | ≤ M , we have that σ = Spark(D) ≥ 2/M − 1. Note that, again, using this result and plugging Theorem 4, we get exactly the result in Theorem 1 as taken from [7, 8] (and again the difference in appearance is caused by replacing [≥] sign by √ [>] one). Returning again to the example with D = [IN , FN ], we know that M = 1/ N and √ √ thus σ = Spark(D) ≥ 2 N − 1. On the other hand, we have seen that there exists a set of 2 N

√ columns that are linearly dependent, and therefore we conclude that σ = 2 N − 1 and this case. 13

So, in this example we got that the proposed lower bound on the Spark in Theorem 7 is actually a tight bound. Returning back to the general dictionary form, we should ask how close is the found bound to the actual Spark? As it turns out this bound is rather loose, which typically means that σ = Spark(D) 1/M . This gap is not surprising if we bare in mind that requiring diagonally dominance for positive definiteness is a highly restrictive approach, and it is commonly known that Gersgorin disks are far too loose bounds on eigenvalue locations [9]. This is why we turn to methods to bound the Spark from above as discussed hereafter.

3.2

Bounding the Spark from Above

Let us propose a presumably practical method for finding the matrix Spark. Define the following sequence of optimization problems for k = 1, 2, ..., L:

(Rk )

Minimize kU k0

subject to

DU = 0

Uk = 1.

(6)

If the Spark value σ is achieved by set of columns from D containing the first column, then solution of (R1 ) necessarily gives that the solution satisfies MinkU k0 = σ. Similarly, by sweeping through k = 1, 2, ..., L we guarantee that the solution with the minimal l 0 norm is necessarily the exact matrix’s Spark. That is to say, if we denote the solution of the (R k ) problem as U opt k , then we get

σ = Spark(D) = Min1≤k≤L kU opt k k0 .

14

(7)

However, as we know by now, solution of l0 norm is notoriously hard. Thus, in the spirit of the Basis Pursuit approach discussed in this paper, let us replace the minimization of the l 0 norm by a more convenient l1 norm. Thus, we define the sequence of optimization problems for k = 1, 2, ..., L:

(Qk )

Minimize kV k1

subject to

DV = 0

Vk = 1.

(8)

This time we have a set of convex programming problems that we can solve using LinearProgramming solver. Let us define the solution of the (Qk ) problem as V opt k . Then, clearly

opt kU opt k k0 ≤ kV k k0

=⇒

σ = Spark(D) ≤ Min1≤k≤L kV opt k k0 .

(9)

So, let us recap the above discussion into the following new Theorem on bounding the Spark: Theorem 8 - Upper-Bound on the Spark:

Given the dictionary D, apply the following

stages: 1. Solve the sequence of L optimization problems defined as (Qk ), and define their corresponding solutions as V opt k , 2. Compute the l0 norm of the found solutions V opt and find the smallest one, denoted as k kV kmin 0 . Then, σ = Spark(D) ≤ kV kmin 0 . It is interesting to note that in order to make a statement regarding the relation between (P 0 ) and (P1 ) solutions with a dictionary D of size N × L, we find ourselves required to use the (P 0 ) 15

and (P1 ) relation on dictionaries of size N × (L − 1). Is there some sort of recursiveness that could be exploited? We leave this as an open question.

4

A Equivalence Theorem for Arbitrary Dictionaries

In the previous Section we focused on extending the uniqueness Theorem for arbitrary dictionary D. Here we similarly extend the equivalence Theorem from [6, 8] for such general shaped dictionaries. So, the question we focus on now is: if a signal S has a sparse representation γ in the dictionary D, what are the conditions such that solving the (P1 ) optimization problem leads to this solution as well? Assume that the sparsest representation is found and denoted as γ 0 . We have that Dγ 0 = S. Assume also that a second representation γ 1 is found, i.e. Dγ 1 = S, and obviously kγ 1 k0 > kγ 0 k0 . In order for the (P1 ) to lead to the γ 0 solution, we must have that kγ 1 k1 ≥ kγ 0 k1 , that is to say, we need to get that the sparsest solution γ 0 is also ”shortest” in the l1 metric. In addition, we should require that Dγ 0 = S = Dγ 1 , or differently, D(γ 0 − γ 1 ) = Dx = 0. So, let us define the following optimization problem:

Minimize kγ 1 k1 − kγ 0 k1 =

L X

k=1

|γ0 (k) + x(k)| −

L X

k=1

|γ0 (k)| Subject to Dx = 0

(10)

and if the value of the penalty function at the minimum is positive, it implies that the l 1 norm of the denser representation is found to be higher than the sparse solution one. This in turn means that the (P1 ) problem leads to the sparsest solution γ 0 as required. Since the optimization problem in Equation (10) is difficult to work with, following [6, 8], we 16

perform several simplification stages, while guaranteeing that the minimum value of the penalty function only gets smaller. First, we change the summations to the on- and the off-support of the sparse representation of γ 0 : L X

k=1

|γ0 (k) + x(k)| −

L X

k=1

|γ0 (k)| =

X

off support ofγ 0 X

+

|xk | +

on support ofγ 0

(|γ0 (k) + x(k)| − |γ0 (k)|) .

Using |v + m| ≥ |v| − |m| we have that |v + m| − |v| ≥ |v| − |m| − |v| = −|m| and thus

X

off support ofγ 0

|xk | +

X

(|γ0 (k) + x(k)| − |γ0 (k)|) ≥

≥

X

on support ofγ 0

=

off support ofγ 0

L X

k=1

|xk | − 2 ·

|xk | −

X

on support ofγ 0

X

on support ofγ 0

|x(k)| =

|x(k)|

So, if we replace the optimization problem in Equation (10) with

Minimize

L X

k=1

|xk | − 2 ·

X

on support ofγ 0

|x(k)| Subject to Dx = 0,

(11)

then obviously, the minimum value of the penalty function is expected to be lower, and if it is still above zero, it implies that solving (P1 ) is going to lead to the proper sparsest solution. Following [8], we add a constraint in order to avoid the trivial solution x = 0, which corresponds to the case where the two representations are the same. The new constraint 1T |x| = 1 implies

17

that the sum of absolute entries in x is required to be 1. Thus, Equation 11 becomes

Minimize 1 − 2 · 1Tγ0 |x| Subject to Dx = 0 and 1T |x| = 1.

(12)

Note that in the new formulation we used a slightly different notation. The vector 1γ 0 is a binary vector of length L obtained by putting ’1’-s and ’0’-s where the condition γ 0 6= 0 holds true or false respectively. Looking at Equation (12), we see that both x and |x| appear in it, and this complicates its solution. Let us replace the constraint Dx = 0 with a weaker requirement that is posed on the vector |x|. If the feasible solution is required to satisfy Dx = 0, then clearly it must also satisfy the weaker condition DH Dx = Gx = 0, where G is the Gramian matrix we have already used in the previous Section to bound the Spark. Thus we have that

Gx = 0 =⇒

(G − I + I) x = 0 =⇒

− x = (G − I) x

(13)

=⇒ |x| = | (G − I) x| ≤ |G − I| |x|.

The matrix (G − I) is the Gramian matrix with its main diagonal nulled to zero. If we take the new constraint and plug it back to Equation (12) instead of the original Dx = 0, we get

Minimize 1 − 2 · 1Tγ0 z

Subject to {|G − I| · z ≥ z & 1T z = 1 & z ≥ 0}.

(14)

Note that we have defined z = |x|. In order to further simplify the problem and come up with simple requirements for this opti18

mization problem to give positive value at the minimum, we assume that the off-diagonal entries in G satisfy ∀i 6= j, |Gi,j | ≤ M . Thus the constraint |G − I| · z ≥ z is replaced by

z ≤ |G − I| z ≤ M · (1 − I)z,

where I is the L × L identity matrix and 1 in an L × L matrix with all entries equal 1. Using the fact that 1z = 1 × 1T z = 1 (since 1T z = 1) we get that the constraint becomes

z ≤ M · (1 − I)z = M 1 − M z =⇒ z ≤

M 1 . 1+M

Going back to our minimization task as written in Equation (14), we have that the minimum value is obtained by assuming that on the support of γ 0 all the z(k) values are exactly z(k) = M/(1 + M ). and then the penalty term becomes

1 − 2 · 1Tγ0 z = 1 −

2M · kγ 0 k . 1+M

and therefore we require

1−

2M 1 1 1+ · kγ 0 k ≥ 0 =⇒ kγ 0 k ≤ 1+M 2 M

.

To summarize, we have the following result: Theorem 9 - New Equivalence: Given a dictionary D, given its corresponding M value extracted from the Gramian matrix G = DH D, and given a signal S, if the sparsest representation

19

of the signal by S = Dγ 0 satisfies kγ 0 k ≤ 0.5(1 + 1/M ), then the (P1 ) solution is guaranteed to find this sparse representation. The above Theorem poses the same requirement as the one posed by Donoho and Huo [6] in their equivalence Theorem. However there is some very basic and major difference between these two results. The new result does not assume any structure on the dictionary, whereas Donoho and Huo assumed the two-unitary matrices dictionary. As in the uniqueness case, if we assume that the dictionary is composed of two unitary matrices concatenated together, then this may lead to an improvement of the bound to 1/M . We skip this analysis because major sections of it are exactly the same ones as described in [8]. A question that remains unanswered at this stage is whether we can prove a more general equivalence result without using M , but rather using the notion of Spark of the dictionary.

5

Sparse Representation of Random Ensemble of Signals

So far we assumed that only one signal is given to us, and we are to seek its decomposition into a set of building atoms using our knowledge that it is a sparse linear combination of the dictionary columns. Assume now that the source generating this signal is activated infinitely many times, creating this way an infinite sequence of signals {S k }∞ k=1 . If we assume that the source uses the same probabilistic law for generating the representation coefficients in all instances, then this should somehow serve for improving our capabilities to decompose the incoming signals into their building atoms. More specifically, let us assume that the source draws the representation coefficients for all these

20

L ∞ instances {S k }∞ k=1 = {Dγ k }k=1 from the same L distribution laws {Pj (α)}j=1 in an identically

and independent manner. Thus, each coefficient is generated using a different statistical rule. We further assume that F of the coefficients are capable of having non-zero values, and the remaining L − F coefficients are exact zero with probability 1. Thus, we may claim per each signal that it is a sparse composition of atoms of the same F elements, but with varying coefficients due to the probabilities {Pj (α)}Lj=1 . Clearly, given the signal S 1 , we can apply previous results and seek its sparse representation using (P1 ) solution, provided that F , the number of non-zero entries in the representation, is low enough. If another signal S 2 is given as well, apart from our knowledge that it has also a sparse representation, we know that the non-zeros in the two representations are expected to appear in the same locations. This is a new powerful knowledge we seek to exploit. Here we use higher moments to achieve this gain, and thus we need an infinitely long sequence of signals. Let us define the mean and variance values per each representation coefficient as

mj =

Z

α

α · Pj (α)dα ,

σj2

=

Z

2

α

(α − mj ) · Pj (α)dα

L

j=1

Thus, the mean and covariance of the representation vector γ are given by

n o

E γ

=

"

m1 m2 . . . m L

21

#H

= MH

(15)

 n

E γγ H

o

=

             



σ12

0

0

σ22

0 ...

0  

.. .

.. .

.. . . . .

.. .

0

0

0 . . . σL2

0 ...

0

     H H +M ·M =Σ+M ·M .      

(16)

Using these definitions, we may write the mean and covariance of the incoming signals we get

n

o

n o

E {S} = E Dγ = D · E γ = D · M . n

E SS H

o

n

o

n

o

= E Dγγ H DH = D · E γγ H · DH = D · Σ + M · M H · DH .

Since we gathered an infinitely long sequence of signals {S k }∞ k=1 and since the process described here is ergodic, we may obtain the above mean-covariance pair by computing the estimates

lim E {S} = D · M = n→∞ n

E SS

H

o

n 1X S . n k=1 k

= D· Σ+M ·M

H

n 1X SkSH · D = n→∞ lim k . n k=1 H

Now recall that according to our assumptions only F of the diagonal entries of Σ are expected to be non-zeros, and all remaining L − F are supposed to be exact zeros. Thus, after the removal of the rank-one mean term D · M · M H · DH (based on the estimated first moment), the covariance matrix of the signal is expected to be of rank F exactly. This, by the way, leads to a good cleaning method of the estimated covariance in cases of insufficient measurements, i.e., by application of SVD on the estimated DΣDH and nulling the last (and thus smallest) L − F singular values [10]. So, at this stage we got to the point where through the incoming signals we are capable of 22

computing the rank F matrix DΣDH . A slightly different way to write this matrix is

DΣDH =

L X

k=1

σk2 · dk × dH k .

(17)

However, our goal is to find the building atoms for the measured signals, i.e. the indices where the σj are non-zeros. Whereas SVD can reduce the matrix rank [10], it cannot map the remaining F rank-1 terms to the columns of the dictionary, and thus cannot be used to solve our problem. Instead, we suggest the following optimization problem

Minimize kαk0

DΣDH =

subject to

L X

k=1

αk · dk × dH k .

(18)

Since the sparsest solution has exactly F non-zeros, we are expecting to obtain this result from the above problem. As before, replacing this l0 optimization problem by an l1 one we should solve instead

Minimize kαk1

DΣDH =

subject to

L X

k=1

αk · dk × dH k ,

(19)

and all the previous Theorems on uniqueness and equivalence hold as well. Note that we refer to the rank-1 terms as vectors rather than matrices, and thus the formulation is exactly the same. The dictionary in this case is built by outer product of each vector by itself. Thus, the new dictionary is of size N 2 × L. Therefore, if for the one-signal problem we used the scalar Msingle ,

23

defined as Msingle = sup1≤i6=j≤L |dH i dj |, then now the new Mmultiple should be H 2 Mmultiple = sup1≤i6=j≤L (di ⊗ di )H dj ⊗ dj = sup1≤i6=j≤L dH ⊗ d d d i j i j = Msingle .

(20)

Thus, M being a value smaller than 1 becomes smaller and thus all the sparseness requirements in the developed Theorems improve markedly. As an example, for the dictionary composed of the identity and the Hadamard unitary matrices, if N = 256, then it is easy to show that √ Msingle = 1/ N = 1/16. Thus, using Theorem 9 we have that if a signal has a representation with smaller than 0.5(1 + 1/Msingle ) = 8.5 atoms, the (P1 ) will find this representation. Now, if a sequence of such signals is obtained we get that if the signals are composed of smaller than 0.5(1 + 1/Mmultiple ) = 128.5 atoms, then a correct decomposition can be found using (P1 ). Before we conclude this topic of how to exploit the existence of multiple signals, several comments are in order: • So far we have seen two extremes - one signal or infinitely many signals. A more practical question is how to exploit the support overlap in the multiple signal case if a finite set of signals is given to us. This question remains at the moment unanswered. • Notice that in the above two optimization problems we conveniently chose not to use an additional constraint forcing positiveness on the unknowns, due to their origin as variances. As it turn out, this additional constraint does not impact the correctness of the Theorems obtained in this paper, and thus we are allowed to use them as they are. Of-course, from the practical point of view, adding the positivity constraint, we should improve the conditioning of the optimization problems involved and this way stabilize their solutions. 24

• A vital analysis of inaccurate cases is missing both here and also in the single signal case. An extension of the obtained results for the case where the representation is an approximate one is in order. This is especially important in the multiple signals, where an estimated covariance matrix is used. Again, we leave these questions open at the moment. • The improvement found in using the large ensemble of signals comes from the fact that the representation coefficients are statistically independent. Actually, K-th order moment (e.g. K = 3, 4, ...) could be used just as well, leading this way to a better bound 0.5(1 + K 1/Mmultiple ) since then we have Mmultiple = Msingle . This leads to the conclusion that in

the case of infinite amount of measured signals, the (P1 ) can be applied successfully as a replacement to (P0 ) for all signals, provided that a sufficiently high-order moment is used.

6

Summary and Conclusions

This work addresses the problem of decomposing a signal into its building atoms. We assume that a dictionary of the atoms is given, and we seek the sparsest representation. The basis pursuit method [3] proposes to replace the minimization of the l0 -norm of the repreentation coefficients with the l1 -norm, leading to a solvable problem. Later contribution [6, 7, 8, 12] proposed theoretic background for such a replacement, proving that, indeed, a sparse representation is unique, and also proving that for sufficiently sparse representations, there is an equivalence between the l 0 norm and the l1 -norm minimizations. However, all these theoretic results were based on the assumption that the dictionary is composed of pair of unitary bases. In this paper we propose an extensions to the uniqueness and equivalence results mentioned

25

above, and threat the most general dictionary case. We show that similar theorems are found to be true for any dictionary. A basic tool used in order to prove these theorems is the Spark of a matrix. We bound this value from both sides in order to practically evaluate it. Another contribution of this paper is the decomposition of multiple signals generated by the same statistical source. We show that using the above understanding, far better results are achieved when higher order moments are used. Open questions for future research: • Our equivalence Theorem for the general dictionary case is weaker compared to the uniqueness Theorems that are based on the Spark of the dictionary matrix. We did not find a parallel result to the σ/2 uniqueness result, nor did we find a bound using the ordering method as described in the Uniqueness theorem 5. Further work needs to be done here in order to improve our results. • We found ways to bound the Spark of a matrix from below and above. Are there better ways to compute/bound the Spark? Is there a way to exploit the order-reduction property we found in the upper bound on the Spark? Further work is required in order to establish better methods to compute the Spark. • Multiple signals case was solved only for the infinite amount of measurements case, building on estimation of moments. A similar result should be obtained for the case of finite number of signals, using a deterministic approach, rather than statistic-based. • All the results in this paper should be extended to the case of approximate representation where a bounded error is allowed in the equation Dγ = S. We expect all the results to 26

hold as well, and somehow improve as a function of the allowed error norm.

References [1] S. Mallat, A Wavelet Tour of Signal Processing, 1998, Academic Press, Second Edition. [2] S. Mallat & Z. Zhang, Matching Persuit with Time-Frequency Dictionaries, IEEE Transactions on Signal Processing, Volume 41, number 12, pages 3397-3415, December 1993. [3] S.S. Chen & D.L. Donoho & M.A. Saunders, Atomic decomposition by basis pursuit, SIAM Review, January 2001, volume 43, number 1, pages 129-59. [4] P.E. Gill & W. Murray & M.h. Wright, Numerical Linear Algebra and Optimization, 1985, Cambridge University Press. [5] D. Bertsekas, Non-Linear Programming, 1995, Athena Publishing. [6] D.L.Donoho & X.Huo, Uncertainty Principles and Ideal Atomic Decomposition, IEEE Transactions on Information Theory, November 2001, volume 47, number 7, pages 2845-62. [7] M. Elad & A.M. Bruckstein, On Sparse Representations, International Conference on Image Processing (ICIP) 2001, Tsaloniky, Greece, November 2001. [8] M. Elad & A.M. Bruckstein, A Generalized Unceritainty Principle and Sparse Representation in Pairs of