A NEW ALGORITHM FOR COMPUTING SPARSE? SOLUTIONS TO LINEAR INVERSE PROBLEMS G. Harikumar and Yoram Breslery Coordinated Science Laboratory, University of Illinois at Urbana-Champaign 1308 W. Main St. Urbana, IL 61801 G. Harikumar: Ph. (217) 244 6384 E-Mail
[email protected] Y. Bresler: Ph. (217) 244 9660 E-Mail
[email protected]
ABSTRACT
In this paper, we present an iterative algorithm for computing sparse solutions (or sparse approximate solutions) to linear inverse problems. The algorithm is intented to supplement the existing arsenal of techniques. It is shown to converge to the local minima of a function of the form used for picking out sparse solutions, and its connection with existing techniques explained. Finally, it is demonstrated on subset selection and deconvolution examples. The fact that the proposed algorithm is sometimes successful when existing greedy algorithms fail is also demonstrated.
1. INTRODUCTION Consider the problem of estimating a vector x 2 Rn from y 2 Rn , where y = Ax + , A is a matrix and is an unknown noise vector. If the matrix A is ill-conditioned (or non-invertible), the set S de ned by S = fx : kAx yk 0. Thus even a small perturbation in the output could lead to an unacceptably large \search space" for x. Under such circumstances, weha ve to narrow the search space some way, and this narrowing is called \regularization". One of the most useful and intuitively attractive techniques of regularization is the use of prior knowledge about x. This technique limits the search to those x in S that satis es the prior knowledge. A discrete-time vector x is said to be sparse if a signi cant fraction of its component elements are zero. One very importantt ype of prior knowledge that is often available is about the degree of \sparsity" of the solution x. The problem of obtaining sparse solutions (or sparse approximate solutions) to linear inverse problems is an important one, and arises in numerous areas of engineering and applied mathematics. Av ariant of this problem has been studied under the name \subset selection" by Golub and Van Loan [1]. For binary matrices, it has been studied as a \minimum x
y
This work was supported in part by the National Science Foundation grant No: MIP 91-57377, and a Schlumberger-Doll research grant y Yoram Bresler is on sabbatical leave at the Technion, Israel Institute of Technology during 1995-96
weight solution" in the theory of error corrective coding [2]. It has also been studied as the \sparse null-space problem" in the theory of non-linear optimization and the \minimum set-cover problem" in the theory of algorithms(see [3] and the references therein). Other examples are: the deconvolution of geophysical signals (where the unknown re ectivity series signal can be assumed to be sparse), the variable selection problem in regression (where it is attempted to express av ector y as a linear combination of (also known as \explaining" y) the smallest number of columns of a matrix A), image restoration with a sparse-edge model (where it is attempted to recover an image from a corrupted version, and the gradient of the image is assumed to be sparse), the design of digital lters with few non-zero coecients, and sinusoid retrieval. Some existing techniques for solving this problem are greedy algorithms, relaxation techniques and methods based on graduated non-convexity. This problem is known to be N-P complete [3]. This is signi cant since it rules out greedy algorithms and other exhaustive search methods for large problems. Thus there exists a need for an algorithm that can, in some sense, pick the most sparse solution directly out of the feasible set. In this paper, we present a new iterative algorithm for picking out sparse elements of a convex, compact set S . This is a more general version of the linear inverse problem. For example, in the \subset selection" problem, S is a piece of the linear variety de ned by Ax = b. In the 2 lter-design problem, S is an ellipsoid de ned by kAx bk 0
g t
?
(2) (3)
Equation (3)P is dicult to implement in practice as the function f (x)= ni=1 g(xi) is discontinuous, and has large at regions that will confound any descent-based optimization x
algorithm. Many of the existing techniques for obtaining sparse solutions minimize a continuous approximation to f . For example, when S is a subset of a linear variety, a commonly used choice of f ( [4, 5] and the references within) is the l1 norm. The actual minimization is done via linear programming. The basic idea can be gleaned from Figure 1. S is shown as a thick line, and the diamonds drawn in dotted lines represent the isocontours of the function jx1 j + jx2 j. A local minimum over S occurs when the rst of the isocontours touch S . However, this technique will not pick out the other sparse solution in the example. There is no local minimum at the point where S cuts the x1 axis. Also, minimizing the l1 norm does not work well in cases where S is not a linear variety. An example is given in Figure 2. Here S is an ellipsoid that is not aligned with the axes. xs is the point on S that minimizes the l1 norm. But neither component of xs is zero. x2
S
x1
jx1 j + jx2 j = c
Figure 1: Figure illustrating how minimizing linear variety yields sparse solutions.
x2
xs ( ) + (x2 ) = d
x1
P
j j over a
i xi
S x1
axes. x2
S
P
A better approximation to f is (x) = i (xi) over S , where (r) is a strictly concave, monotone increasing function of jrj and (0) = 0. Again, the best way to see this is to study the isocontours of (x), as shown in Figures 2 and 3. For a concave , these \bulge" inwards. In the Figure 2, the global minimum of over the ellipse S is at a sparse solution. In Figure 3, has multiple local minima in the linear variety S , at each point of intersection of S with the
(x) = c
Figure 3: Figure illustrating how minimizing (x) = P ( x ), where () is concave, over a linear variety S yields i i multiple sparse solutions. It would seem that if there is a local minimum at each sparse solution, the algorithm would get trapped in these spurious minima and never converge to the maximally sparse solution. But minimizing these concave functions works very well in practice. However, the problem of minimizing such a concave function is fraught with diculties. The minimization would be relatively easier if (r) had been convex. But there is not much point in choosing to be convex since its isocontours would bulge outwards, and the rst isocontour need not necessarily meet S at a sparse point. Recently, attempts to combine the best of both worlds have been made. Heuristic iterative techniques that converge to sparse points within S have been proposed( [6]. At each iterate xn , these algorithms choose the next iterate xn+1 as the minimum of a quadratic function qn () over S . The coecients of the quadratic are chosen such that the components of x corresponding to the smallest components of xn are weighted heavily. Our algorithm, presented in Table 1 is a generalized version of such techniques. 1. Start from x0 2 S 2. = (Di (xn )) (4) where (r) is a monotonic, strictly decreasing, continuously dierentiable function of jrj with (0) = a; (1) = 0, as shown in Figure 4, and the Di ; 1 i d are continuously dierentiable, strictly convex, functions of x with Di (0) = 0 8 i. n
ei
jx1 j + jx2 j = c
Figure 2: Figure illustrating how minimizing the l1 norm over S need not yield sparse solutions.
x1
3.
xn+1 = arg min x2S
X i
n
ei Di
(x); Initial point xn (5)
4. If convergence criterion not met, go to step 2 Table 1: Table explaining the algorithm for obtaining a sparse solution.
local minimum of the functional de ned by
()
r
J
a
(x) =
d X
=1
i
( (x))
(6)
Di
where : [0; 1) 7! [0; 1) is de ned by
0 ( ) = (t); (0) = 0
(7)
t r
Figure 4: Figure showing the form of () required for the algorithm in Table 1 to work. Before discussing convergence, we justify the algorithm by heuristic arguments. First of all, note that at each xn , the next point xn+1 is uniquely de ned and can be obtained without any numerical instabilities, since computing it just involves the minimization of a convexP functional over a convex, compact set. The cost function i eni x2i is a weighted sum of squares type. The ei are chosen such that the smaller the Di (xn ), the larger the eni . Thus the larger entries are lightly weighted compared to the smaller ones. So the algorithm will try to seek a next estimate xn+1 that has small values at those components where xn has small values. The algorithm can be classi ed under the broad category of Relaxation Techniques. Such algorithms are observed to converge quickly to sparse solutions in practice. The reason for this is the following. It turns out the xed points of this algorithm in S correspond to the local minima of P a function of the form i (xi), where is a particular function depending on . So, an appropriate choice of would lead to a concave, monotonically increasing . It is often advantageous to use this algorithm instead of P trying to minimize i (xi) directly using a descent-based technique. Since, at every iterate, we are minimizing a convex function over a convex set, it might be a lot faster. Closed-form solutions for the next iterate might exist as when S is aPsubset of a linear variety. Also, not all local minima of i (xi) need be at sparse solutions, but the algorithm is de ned to converge to those that are sparse. The algorithm is also related to the \alternate minimizations" algorithms that have been recently proposed [7] for edge-preserving regularization.
3. CONVERGENCE OF THE ALGORITHM. This following theorems and their proofs are modi cations of results obtained by Delaney and Bresler [8] for a similar problem.
Theorem 1. Let S be a convex and compact subset of Rd , de ned by the equations gi (x) 0; 1 i c. Let xn be a sequence of iterates generated by the algorithm in Table 1. Any convergent subsequence of this sequence converges to a
For example, if 8 1 > < 4t0:75
Di
(x) = x2i and
( )=> : 343750 9:3750 1020 t2 p =) (Di (x)) jxi j
t
8 t 10 t
0Psuch that kz1 z2 k > r for all local minima z1 and z2 of i (Di (x))), any sequence of iterates of the algorithm in Table 1 converges. Thus in the above example, the algorithm converges from all starting points.
4. RESULTS We demonstrate the algorithm with two Monte-Carlo studies.
4.1. Subset selection example
In this case, we try to compute maximally sparse solutions in S where
S = fx 2 R30 : Ax = Axb ; jxi j 5; 1 i 30g: (11) For each of 100 trial runs, A is a 15 30 matrix with random coecients, and xb is chosen to be a random vector having exactly 10 nonzero components. Note that there are several sparse solutions having 15 zeros in S , but only xb has more. 30An exhaustive search for xb would involve search-
ing over 10 possibilities. The algorithm is applied to this problem from 200 dierent starting points, with the and Di as de ned in Equations (8) and (9), respectively. For each trial run, the algorithm converged to a solution with at least 15 zeros from all the starting points, and to a solution approximately equal to xb from about 10% of the starting points.
4.2. Deconvolution example
The second example is the problem of obtaining a vector x fromn y = h ? xb + . The star denotes convolution. h 2 R is a known vector, and N (0; I). It is assumed that an upper limit for the length of x (nx ) is known, that x is sparse, and that each component of x is bounded (say by h
). The problem now reduces to searching for the maximally sparse solutions in the set S , where
t
S = fz 2 Rn : kh ? z yk2 ; jzi j t 8 ig x
(12)
where is chosen to be a large enough multiple of 2 . We demonstrate this case when h 2 R8 , xb 2 R6 , nx = 10, = 0:01 and t = 5. is chosen to be = 10(nx + nh 1)2 . Note that, for Gaussian noise, this choice means that the true x must be within S with > 99% probability. The test data y is synthesized with xb chosen to be 25% sparse. For each of 30 Monte-Carlo trial runs, the vectors xb and h are generated randomly. For 30 dierent values of the noise vector , the algorithm is applied to the problem of nding the sparse solution from 30 random starting points. For each trial run, and for each noise realization, and for every starting point, the algorithm converged to a solution at least as sparse as xb and approximately equal to it. The rms per-component error in the estimate, averaged across trial runs, noise realizations and starting points was 0:08. In contrast, the least squares estimate had a smaller error (0.02) but almost all its component elements were non-zero.
4.3. Comparison with existing Greedy Algorithms
At this point, we should like to compare our algorithm with a few commonly used greedy algorithms for subset selection and other feature extraction problems. The ones we choose are Sequential Forward Selection (SFS), Sequential Backward Selection (SFS), \Plus 2 take away 1 (SFSB)" and \Remove 2 add 1(SBSF)", as explained in [9]. Specifically, we would like to demonstrate with an example that our heuristic often works when these competing algorithms fail. In this case, we are looking for sparse exact solutions to Ax = b, where 2
1:17 1:78 :70 6 1:07 A = 4 1:06 01:09 1:33 0:46 and b = 0:22
1:29 1:61 0:71 1:52 1:51
0:44 0:76 1:30 3 0:44 1:38 0:3875 1:73 0:43 1:99 1:60 1:95 1:94 2:00 1:21 T
Table 2 shows the selections of each algorithm. A zero in a column means that that element is less than 10 3 . It can be seen that all four greedy algorithms have converged to points having two zero components. Our algorithm converged to a solution approximately equal to the one shown (with 3 zero-components) from about half of the random starting points.
5. CONCLUSIONS AND FUTURE WORK In this paper, we present an iterative algorithm for computing sparse solutions (or sparse approximate solutions) to linear inverse problems. The algorithm is seen to work very well in practice. It is shown to converge to the local minima of a function of the form used for picking out sparse solutions, and thus ties in with the existing techniques. It is demonstrated on subset selection and deconvolution examples.
Our alg. SFS SBS SFSB SBSF 0 0 1.19 1.40 0.06 -0.41 0 0 -1.28 0.15 -0.91 -0.89 0 0.09 -0.84 1.51 0.53 -2.51 0 0 0 0.02 -0.54 -0.71 0 0 -0.87 -2.43 0 -1.28 Table 2: The points selected by the greedy algorithms and the algorithm presented in this paper. Our algorithm converged to a solution approximately equal to the one shown from about half of the random starting points. Our heuristic is not intented to replace the popular greedy algorithms, but instead to supplement the arsenal of existing techniques. The fact that the proposed algorithm is sometimes successful when popular greedy algorithms fail is demonstrated with an example. The rate of convergence of the algorithm is a function of the choices of and the Di , but theoretical results are as yet unavailable. Future work should be directed at these areas.
6. REFERENCES [1] G. H. Golub and C. F. Van Loan, Matrix Computations. Baltimore: The Johns Hopkins University Press, 1989. [2] R. G. Gallager, Information Theory and Reliable Communication. New York: John Wiley, 1968. [3] B. K. Natarajan, \Sparse approximate solutions to linear systems," SIAM Journal on Computers, vol. 24, pp. 227{234, April 1995. [4] M. S. O'Brien, A. N. Sinclair, and S. M. Kramer, \Recovery of a sparse spike time series by l1 norm deconvolution," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 42, pp. 3353{3365, December 1994. [5] J.-J. Fuchs, \Extension of the Pisarenko method to sparse linear arrays," in Proceedingss of the ICASSP, vol. 3, (Detroit), pp. 2100{2103, IEEE, May 1995. [6] I. Gorodnitsky and B. D. Rao, \Convergence properties of an adaptive weighted norm extrapolation algorithm," in Proceedings of the ICASSP, vol. 3, (Minneapolis), pp. 456{459, IEEE, May 1993. [7] P. Charbonnier et al., \An adaptive reconstruction method involving discontinuities," in Proceedings of the ICASSP, vol. 5, (Minneapolis), pp. 491{494, IEEE, April 1993. [8] A. H. Delaney and Y. Bresler, \Edge preserving regularization for limited angle tomography," submitted to IEEE Transactions on Image Processing, 1994. [9] P. A. Devijver and J. Kittler, Pattern Recognition: a statistical approach. Englewood Clis: Prentice Hall, 1982.