Maximum likelihood, least squares, and penalized least squares for ...

1 downloads 0 Views 2MB Size Report
200. IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 12, NO. 2, JUNE 1993. Squares ,. Maximum Likelihood, and Penalized Least Squares for PET. Least.
IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 12, NO. 2, JUNE 1993

200

Maximum Likelihood, Least Squares, and Penalized Least Squares for PET Linda Kaufman

Abstract- The EM algorithm is the basic approach used to maximize the log likelihood objective function for the reconstruction problem in PET. The EM algorithm is a scaled steepest ascent algorithm that elegantly handles the nonnegativity constraints of the problem. We show that the same scaled steepest descent algorithm can be applied to the least squares merit function, and that it can be accelerated using the conjugate gradient approach. Our experiments suggest that one can cut the computation by about a factor of 3 by using this technique. Our results also apply to various penalized least squares functions which might be used to produce a smoother image.

I. INTRODUCTION

P

OSITRON emission tomography (PET) is used to study blood flow and metabolism of a particular organ. The patient is given a tagged substance (such as glucose for brain study) which emits positrons. Each positron annihilates with an electron and emits two photons in opposite directions. The patient is surrounded by a ring of detectors, which are wired so that whenever any pair of detectors senses a photon within a very small time interval, the size of which is systemdependent, the count for that pair is incremented. In a matter of minutes, several million photon pairs may be detected. The reconstruction problem in PET is to determine a memory map of the annihilations, and hence a map of the blood flow, given the data gathered by the ring of detectors. There are two main approaches given in the literature: convolution backprojection [28], which was originally devised for CAT; and the probability matrix approach, which better captures the physics of the positron annihilation, but in practice has not been as popular as convolution backprojection. There are two main arguments usually leveled against the probability approach. In the first place, the images are often speckled, and secondly, they can be expensive to produce. There have been various proposals for different merit functions-maximum likelihood (ML) [31], least squares (LS), to give a better image. maximum a posreriori-intended Various smoothers have been proposed, which tend to consider nearest neighbor interactions (see Green [9], Hebert and Leahy [ 111, Lange [ 191, Geman and McClure [7], and Levitan and Herman [22]). These smoothing techniques also choose a particular solution when there is no unique one with the ML or LS approaches. A disadvantage is that they have parameters which must be determined. Herman and Odhner [12] have arrested some of the controversy over the desirability of some Manuscript received September 19, 1991; revised July 23, 1992. The author is with AT&T Bell Laboratories, Murray Hill, NJ 07974. IEEE Log Number 9208162.

of the merit functions by showing that the suitability of an approach depends on the medical application, and sometimes the speckling is inconsequential. The EM algorithm proposed in [29] and [21] is the basic approach for solving the ML problem. Techniques have been suggested for speeding up each iteration of the EM algorithm by taking advantage of the fact that the algorithm is well suited for parallel computation [4], [24], [ 131, and by decreasing the number of unknowns by multigridding and adaptive gridding [27], [26]. Various people have suggested treating the steps of the EM algorithm as a direction, and then using an inexact line search to speed up convergence. The EM algorithm is a scaled steepest ascent algorithm. The scaling is a very good way to incorporate the nonnegativity constraints. However, as a steepest ascent technique, it is a linearly convergent scheme. Steepest ascent algorithms are notorious for going across steep canyons rather than along canyons, and for taking very small steps whenever the level curves are ellipsoidal. Using a line search usually improves the rate of convergence, but the algorithm is still linearly convergent. However, one can create a superlinearly convergent scheme using the ideas of the conjugate gradient algorithm. The conjugate gradient algorithm uses a linear combination of the current step and the previous one to create directions which are A orthogonal where A is an approximation to the Hessian matrix. It tends to go along canyons. Tsui et al. [30] have used the conjugate gradient algorithm with the least square objective function, and showed that in one setting, LS-CG was ten times faster than ML-EM. In Section 11, we develop the EM algorithm for both the maximum likelihood and least squares merit function, and we show that EM for ML is equivalent to applying EM to a continually reweighted least squares problem. The Kuhn-Tucker conditions which are used to develop the EM algorithm elegantly incorporate the nonnegativity constraints. Most algorithms that have been used for the LS problem either do not incorporate the constraints (see [30]), include them more as an afterthought (see [16]), or force one to decide whether a variable is small or 0 (see [2] and [17]). Moreover, the same techniques apply to various smoothing penalty functions. Using the same type of technique with various merit functions eliminates some of the factors that tend to obscure the issue of determining which, if any, is the best merit function. In Section 111, we discuss ways of accelerating the EM algorithm for least squares computation. Our techniques are similar to those discussed in [ 151 for the maximum likelihood

0278-0062/93$03,00 0 1993 IEEE

20 1

KAUFMAN: ML, LS, AND PENALIZED LS FOR PET

function. However, since differences in function values can be computed more easily for LS than for ML, acceleration techniques, based on function differences, should be much more acceptable to the medical imaging community than these same techniques were when applied to the maximum likelihood function in [15]. Like [30], we turn to the conjugate gradient algorithm, but we suggest a preconditioned conjugate gradient algorithm (PCG) based on the scaled steepest descent algorithm in order to take into consideration the nonnegativity constraints that they ignore. Our algorithm is similar to the one proposed by Kawata and Nalcioglu [17], but we give a bit more freedom in choosing a diagonal scaling matrix. We show that with little modification, the algorithms can be used with a merit function with a smoothing penalty term. We also suggest that the scaling in the EM algorithm might not be optimal, especially if it is used in a multigrid setting. In Section IV, numerical results are given. In general, there is little difference between the images produced using EM for least squares and EM for ML. Applying our PCG algorithm to the least squares function tends to reduce the number of iterations by about a factor of 3 over the EM-ML algorithm. The main features of the image appear early in the sequence of pictures produced by the EM algorithm applied to LS and in those produced by the EM-based PCG algorithm. As in the case with the EM algorithm for ML, as the algorithms converge, the images become snowier. Some researchers suggest terminating the EM algorithm before the speckling obscures the image, while others suggest some type of smoothing. Adding a smoothing penalty term, such as a squared difference as in [22] or an ln(cosh), as in Green [9], decreases the amount of speckling, but the PCG approach is still just as effective. The appropriate weighting parameters in these penalty approaches depends on the total number of annihilations counted and the shape of the image, and adjusting them might not be an easy task. When comparing various merit functions, the algorithm used to optimize a particular merit function must be considered. Images obtained using different algorithms or starting guesses to optimize the same objective function might be radically different. Just because an algorithm is producing iterates that appear to have converged does not mean that the optimum of the function has been obtained. An algorithm that stops when the gradient is small may terminate prematurely in a region that is “almost” flat. Iterates could be bouncing back and forth between the sides of a steep canyon, and thus might appear to be converging. Different algorithms may take different paths to the solution, and using even the same stopping criteria may produce different results. The initial guess also tends to be a big contributing factor to the appearance of an image, as the results in Kaufman [15] indicate. Furthermore, when the solution is not unique or there are multiple local optima, various algorithms will approach different optima. For example, with an initial guess of 0, for an underdetermined system, the conjugate gradient algorithm is guaranteed, if roundoff error is not considered, to find the least squares solution that has minimum norm. In general, it will determine the solution that is closest to the starting point.

11. EM APPLIEDTO L s AND ML AND PENALIZED LIKELIHOOD

In the discrete reconstruction problem, one has data q where represents the number of photon pairs detected in tube t. One would like to determine z(z), the number of photon pairs emitted at z. However, this is not computationally feasible, but one can impose a grid of B boxes on the affected organ and try to compute as unknowns x b , the number emitted in box b. We would like the z ’ s to be nonnegative, and it would be nice if the sum of the emitted pairs equals the sum of the detected pairs. We assume that a matrix P can be constructed such that p b , t represents the probability that a photon pair emitted in box b will be detected in tube t. There are various mathematical approaches to determine the map of the annihilations, i.e., z.One approach, suggested by Shepp and Vardi [29], is based on the assumption that the emissions occur according to a spatial Poisson point process in a certain region. If the vt’s are assumed to be independent Poisson variables with means xi, Shepp and Vardi show that z can be found by maximizing the likelihood function

vt

where T is the number of detecting tubes. The vector z which maximizes L ( z ) also maximizes l ( z ) = log(L(s)) whose gradient is much simpler to compute than that of L ( z ) . Assuming that B

1

xb?)bt

171

=

b=l

and B x p b t b=l

1

where p b , t is the probability that photons emitted from box b will be detected by detecting pair t , then the gradient of l ( z ) is given by m

b‘=l

The Kuhn-Tucker conditions (see [8] or [23]) for maximizing 1 subject to nonnegativity constraints is

and

tll(Z) ~

ax b

50

.

for h = 1, . . . B and x b = 0.

(2.4)

The Kuhn-Tucker conditions along with the formula given in (2.2) lead to the EM algorithm of Dempster et al. [SI,proposed for PET by Shepp and Vardi [29] and Lange and Carson [21]: xpew)

= $*d)

T

%pb,t

B Z:ld)pb,,t

t=l

b‘=l

(2.5)

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 12, NO. 2, JUNE 1993

202

Equation (2.10) implies that in the EM-ML algorithm,

Another way of rewriting the EM algorithm is

(2.11)

so that the EM algorithm might be thought of as a scaled steepest ascent algorithm with the distance of each element to the nonnegativity constraint used as the scale factor. In the rest of the paper, any scaled steepest descent or ascent method using this scale factor we will call an EM-like algorithm. Another approach for determining the annihilation map is the least squares approach in which, given the tube counts 9, one minimizes (2.7) subject to nonnegativity constraints. We note that for a rather small problem, one may impose a 128 x 128 grid leading to 16 384 unknowns, and that there might be 128 detectors or 128 x 127/2 columns in P. Thus, P would have 125 million elements, of which only about 1.6 million are nonzero. Because of its size and density structure, using a factorization of the P matrix to solve (2.7) is ill advised. The gradient of (2.7) is

Of = P ( P ' z - 9 )

z=Pa which, if ut = for t , is the EM-LS Another way to look at the correspondence between the two methods is to consider minimizing 1i

4 2 ) = ; I I w T z - 9NI;

More formally, one can write the algorithm in matrix form as follows. EM-LS: For IC = 1 , 2 , . . . until convergence: 1) Set z = P ( P ~ ~ - 9) ( ~ ) 2 ) Set xf+') = xf) - xf)z&. The EM-LS algorithm and the EM algorithm for the maximum likelihood function, which we will call EM-ML, are rather similar. This becomes more apparent when the EM-ML is written in matrix form as follows.

(2.12)

where D is a diagonal weighting matrix. The gradient of w(z) is simply

Vw = P D 2 ( P T z- 9 ) . As we did for f(z),we can derive an EM-like algorithm for w(z), namely, (k+l) =

xb

(k) - x ( k ) z b b

xb

where 2

(2.8)

Using the Kuhn-Tucker conditions for minimizing a function subject to nonnegativity constraints, one can derive an EM-like algorithm for the least squares function, namely,

If

= PD2(PT#) - 9 ) = PD2(u - 9 ) .

D were allowed to change each iteration and d;; = (1/ui)'l2,

then it should be obvious from (2.11) and (2.9) that the iterates obtained from the EM-ML algorithm would be exactly those obtained from applying an EM-like algorithm to a continually reweighted least squares problem. Our development of the EM algorithm can also be extended to merit functions that include a penalized potential function. In the likelihood situation, these can take the form of

where U ( z ) is designed to penalize large differences in estimate parameters for neighboring boxes and have the general form

U ( % )= y ~ v ( x j : x , / XE, N j )

(2.14)

j

EM-ML: For IC = 1,. . . until convergence: 1 ) Set U = P d k ) for t = 1 , 2 , . . . , T 2) Set $t = v t / u t 3) Set y = Pd 4) Set x f + l ) - x (k) b Yb forb = 1 , 2 , . . . , B .

where y is a positive constant and Nj are usually the eight or so nearest neighbors. Various suggestions have been given for v, including

Both EM-LS and EM-ML require matrix-vector multiplications with matrices P and P T ,-and thus require roughly the same amount of work per iteration. Let (2.9)

so that $t = 1 - ot.Let e represent the vector containing all 1's. Because all the row sums of P are 1, Pe = e. Thus, y = P+ = P ( e - a ) = e - Pa.

where

(2.10)

\

-

suggested in [22] where Nj represent the eight nearest neighbors, Green's suggestion in [9] of

v(xj,xi)= ln(cosh((x;

-

xj)/S)

(2.16)

where S is another parameter to be set, and the nonlinear function of Hebert and Leahy [ l l ] v(xj,x;)=

I n ( l + (xi- xj12/s)

(2.17)

KAUFMAN: ML, LS, AND PENALIZED LS FOR PET

203

which also has an additional parameter. These, and others suggested in [ 111 and [7], are all easy to differentiate, nonnegative, and even. One can obviously add V(z) to f(z)and form a penalized least squares function and apply the EM algorithm as given above. The form of (2.15) is very conducive to a least squares situation, but as suggested by various authors, it penalizes high deviations between neighboring boxes excessively. As shown by Lange [ 191, the ln(cosh) function has many desirable properties, but it is critical that IS in (2.16) be appropriate to the problem. The term suggested by Hebert and Leahy seems to have most of the advantages of (2.16) and is easier to evaluate and seems to have fewer numerical considerations. In the remainder of this paper, MAP will denote a penalized least squares function using (2.13, LC using the log(cosh) function in (2.16), and LN using the penalty term in (2.17). 111. ALTERNATIVE WAYS THE LEASTSQUARES

FOR SOLVING PROBLEM

In this section, we will discuss several general approaches for minimizing

f(z)= +llpTz- all;

(3.la)

z 2 0.

(3.lb)

such that

Our aim in this discussion is to determine ways to accelerate the EM-LS algorithm given in Section 11. We also point out that extending our results to merit functions involving a penalized smoothing term as in (2.14) is easy. Many algorithms for minimizing (3.la) subject to (3.lb) have the following general form. General Minimization Algorithm: 1) Determine dl). 2) For k = 1 , 2 , . . . until convergence: a) Determine a search direction s(') b) Determine a step size a ( k ) c) Set z ( k + l ) z(k) +a(k)s(k).

(3.2)

Step 1) of the above algorithm should not be treated lightly. For certain algorithms for solving (3.1), one has to be "close enough' to get convergence. For some methods, like the EM-LS algorithm of Section 11, starting at z = 0 spells disaster. For others, it may be fine. As our data indicate in the next section, certain algorithms produce much better pictures when the initial guess is uniform. The parameter a in step 2b) is often used to obtain a sufficient decrease in f along s and to maintain feasibility. For EM-LS, described in Section 11, it is set to 1 , which does not guarantee feasibility. However, the EM-ML algorithm follows the general outline given above with a set to 1, and as proved in [31], the iterates will always be nonnegative. Moreover, if the EM-ML algorithm were modified to include a step size parameter, as long as cy(') = 1, one would have

so that there is a preservation of tube counts in the tomography problem. For the least squares problem, the EM-LS algorithm, which sets a = 1, does not guarantee that the tube counts will be preserved or that f(z("')) 5 f(z('1).Allowing a to vary in the EM-like algorithms gives much greater flexibility. A. Ensuring Nonnegativity

There are four main approaches to handling the constraints in the general algorithm given above. In the first place, s can be determined without considering (3. lb), as in the well-known active set techniques sometimes used for such problems. Nonnegativity is maintained by restricting a. Secondly, if for some 6, xb & s b is 0, then for a > 8, one might consider sb = 0. Thus, s might be considered a bent line, and as one travels along s, whenever a component of z as becomes 0, s would bend. Thus, the constraints would determine the breakpoints in the bent line. Thirdly, the constraints can be explicitly used while forming s, as in the EM algorithm and in barrier methods. Finally, the constraints can be used in initially forming s as in the third approach, and then a bent line approach can be applied. In the active set procedures, at each iteration, one separates the elements of x into those which should be kept at 0 and the variables that are free to vary. The direction s is chosen to minimize some approximation to f in the space of the free Variables. One travels along s until some approximation to f has been sufficiently decreased or a variable becomes negative. The approach assumes that it is important to determine whether variables are 0, and works well when one knows a priori almost all the variables that will be at bound. If there are initially many variables that are positive, which will eventually be driven to zero, many small steps might be required. Because the ultimate use of the variables in tomography is a picture, in which elements that are small and elements that are zero might be displayed by the same color, it is not that important whether a variable is 0 or just close to 0. Moreover, there may be a large number of variables at 0, which a priori were thought to be positive. Thus, the active set procedures are rarely cost effective for the tomography problem. The problem of small steps is partially overcome in the bent line approach. Here, a can be larger than in the active set approach so that some of the elements of x may become initially negative. After the step is taken, negative elements are set to 0. Various algorithms have been suggested for determining a, and the reader is referred to a recent paper by Beierlaire et al. [2]. The bent line approaches assume that the task of reevaluating f at the projections of z(') for various values of cr is much less than computing a new direction s. The tests in 12) indicate that when there are more than a few variables at bound, which is often the case in tomography, it is better to do an inexact line search that stops when f has been sufficiently decreased than to do an exact line search. The third approach involves the explicit incorporation of the constraints into the search direction s. Included in this category is the EM algorithm, with a line search and various interior points methods given in [ 181 and other recent papers. Often, the objective method is changed to reflect the constraints, and

+

+

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 12, NO. 2, JUNE 1993

204

a standard unconstrained method is used to determine s of the new function. The advantage of incorporating the constraints into s is that, usually, elements of s,corresponding to elements close to 0, are kept small. Thus, large steps in the "freer" variables are tolerated. The active set methods and the bent line methods eventually reach this state, but rarely as quickly as those that involve the distance to the constraints in the determination of s. The bounded line search EM algorithm, given below, follows the third approach. Bounded Line Search EM-LS: 1) Determine dl). 2) Set g = p ( P z ( l )- q). 3) Let W be the diagonal matrix containing dl). 4) Set y = g T W g . 5) For IC = 2 , 3 , . . .until convergence: a) Set s = -Wg, the new search direction b) Set U = P T s c) Set 0 = y/uTu, the minimum of f along s d) Set Q = min(0, ~ i i r i ~ ~ < ~ ( - ~ ~ I , - ' ) / ~ ~ ) ) = dk-') as e) Set f) Reset W to the diagonal matrix containing z(') g) Set z = P u h) Set g = g a z i) Set y = g t W g .

+

+

If one had a penalized least squares function of the form f(z) U ( z ) where U ( z ) might be something like (2.16), then one would subtract a U ( z ) / a zfor g in the bounded line search EM-LS algorithm and change the formula for 0 in (5c) accordingly.

+

B. Determining s and Acceleration Schemes In (3.2), the direction s is chosen to approximately minimize f or a modification of f to include constraint information. If the approximation is linear, as in the EM algorithm, one quickly gets somewhat close to the solution, but then little progress is made. When the level curves of a function are very elongated hyperellipsoids and minimization is tantamount to finding the lowest point in a flat, steep-sided valley, a steepest descent algorithm, such as the EM algorithm, tends to traverse across the valley rather than going along the valley. The directions generated are too similar and information is not gathered in subspaces orthogonal to these directions. Several altematives exist which remove or decrease the effect of ellipsoidal level curves. One can use a quadratic approximation as in the method suggested by Han et al. [lo]. However, their algorithm involves solving a huge linear system involving P , not an easy task. Often, an iterative method is used to approximately solve the system. The conjugate gradient algorithm is an easy to use algorithm which generates gradients that are mutually orthogonal. The search directions tend to go along steep valleys, rather than across them. The search directions s satisfy the condition that s,PPTs3 = 0 for i # j. (See [8].) The linear conjugate gradient algorithm, originally proposed by Hestenes and Stiefel in 1952 [14], is used to iteratively solve a symmetric positive semi-definite linear system when it is easy to do a matrix-vector multiplication with the coefficient matrix.

The nonlinear conjugate gradient algorithm, proposed first by Fletcher and Reeves in 1964 [6], is used in function minimization. If the function is quadratic, as in (3.la), and there are no constraints, then the standard nonlinear conjugate gradient algorithm will produce the same sequence of iterates as the linear conjugate gradient method applied to the system P P T z = Pq. If PT has m positive distinct singular values, the conjugate gradient algorithm is guaranteed to converge in at most m steps, each involving a matrix by vector multiplication to determine s. (See [3].) In our experience, good pictures are obtained in significantly fewer less than m steps, and for the least squares problem, in many fewer steps than the bounded line search EM-LS algorithm. As we shall see in Section IV, dramatic decreases in function values can be obtained very quickly. The strength of the conjugate gradient method is captured in the following theorem recast from Luenberger [23]. Theorem: Assume do)is the initial guess of an iteration procedure and Q = P P T , and consider the class of procedures given by

+

,(k+l) = Rlc(Q)Vf(z(O') where RI,(Q)is a polynomial of degree IC. Assume x* is the solution to the problem P P T z = P q so that

--*

=

( I + Q R ~ ( Q ) ) ( ~ ( O ) -z*).

Let

E(z(k+l))

l(z(k+l)

-

1

z * ) T Q ( z ( k + l ) - z*

implying E(z(I,+'))= ;(z(~+')- z * ) ~ Q ( I QRI,(Q))(z(O) - z*).

+

The point z('+l) generated by the conjugate gradient method satisfies qz("+l)) = min ;(z(k+') - z*)T Rk

. Q ( I + QRI,(Q))(JO) - z*) where the minimum is taken with respect to all possible polynomials RI, of degree IC. The above theorem states that in one sense, the conjugate gradient method is optimal over a class of procedures that is easy to implement. In particular, every step of the conjugate gradient method is at least as good as the steepest descent step would be from the same point. There are various ways of stating the conjugate gradient algorithm for minimizing a quadratic function, all of which are equivalent in infinite precision arithmetic. The variant that seems least sensitive to roundoff error in finite precision arithmetic is LSQR [25]. As explained in Section 111-A in general terms, there are a number of ways LSQR can be modified to take into consideration nonnegativity constraints. One can use an active set strategy in the space of free variables, and whenever taking a step of Q to minimize f , or an approximation thereof, violates a nonnegativity constraint, one only goes as far as the constraint and restarts at the top of the algorithm. Computational tests suggest that this strategy makes little progress in tomography problems. It is probably a bad strategy in general. One can do a bent line search approach, and again restart whenever one goes past the first bend. (See [2].) The efficacy of this

KAUFMAN: ML, LS, AND PENALIZED LS FOR PET

205

a line search EM step is taken. If the algorithm never hits a constraint, the whole algorithm is just the standard linear conjugate gradient algorithm applied to the preconditioned system. Because the objective function is quadratic, termination is assured. In practice, the role of the preconditioner is to ensure that the distance to the nearest constraint is large enough so that progress is not hindered. It also reweights the problem so that information corresponding to nonnegligible x's is considered more important. Although the algorithm does not seem simple, most of the work is involved in multiplication by P and P T , operations that must be done with the standard EM algorithm. Notice that the search direction generated in step j) is just a linear combination of the EM step and the previous direction. The main differences between the standard ,Jk+l) -wg - S s ( k ) . (3.4) LSQR algorithm as given in [25] and algorithm PCG given Let us derive such an algorithm for the quadratic problem above are steps d') and e') and the inclusion of W . One problem with the above algorithm is that once a (3. la). The traditional linear conjugate gradient algorithm is variable becomes 0, it will never increase. Thus, although the designed to solve the system algorithm will terminate in a finite number of steps, there is no PPTz= Pr) (3.5) guarantee that if xJ = 0, f(z)could not be further decreased by letting z3 become positive again. One way around this Solving (3.5) is also equivalent to solving problem is to check on termination whether d f /ax, is positive w 1 / 2 p p T w 1 / 2 2 = w1/2pv (3.6) for all zJ = 0, and if not, restart the algorithm. If xJ/has the where W1/'2 = z. The matrix W is usually called a largest negative gradient for all z's that are 0, reset w3/,Jl preconditioner because it is assumed that the new system is to max,x,. In theory, allowing only one element to change better conditioned than the old one and the algorithm will and checking the sign of the gradient only after termination converge faster. (See [8].) If it is assumed that W is a of the inner PCG algorithm guarantees convergence in a finite constant matrix in (3.6) that is updated each time there is a number of steps. In practice, one usually checks the gradient restart, then the linear conjugate gradient algorithm, retaining for zero x's in the inner loop and restarts immediately. Of course, one never sees the power of the PCG approach nonnegativity, applied to (3.6) is as follows. if there is a restart every iteration. In our examples, in Section PCG: IV, a restart, because a constraint was hit, occurred about every 1) Determine dl).Set k to 1. six iterations. The EM algorithm for maximum likelihood can, 2 ) Set d = r) - PTz(')). Set W to the diagonal matrix in principle, be accelerated just as the EM algorithm for LS is containing dk). accelerated in PCG. The problem we encountered in practice 3) Let = I(dllz and u = d/$('). was that for each iteration, a restart was necessary so that 4) Set y = Pu. the conjugate gradient inner iteration was never activated. As 5 ) Let g = y / ( ~ ~ ~ y ) l / ~ . reported in [ 151, this can be overcome by instituting a bent line 6) Set dk)= W g , y = (yTWy)l/', and p ( k ) = y. approach initially so that one momentarily permits infeasibility 7) Until convergence iterate: and then sets negative elements to 0. This returns us to the old a) Set d = P T y - yu problem, however, of trying to determine which elements are b) Set p = lldll2 and U = d / P "0" rather than letting the EM algorithm itself do it. c) Set c = ~ ( ~ ) / p ( ' T) ,= p / p ( ' ) The PCG algorithm given above is similar to the algorithm d) Set p ( k ) = (p(k))2+ @2)1/2, +(k) = ,$('), given by Kawata and Nalcioglu [17] (KN), but the differences and 6 = q $ ( k ) ) / p ( k ) are very important. In the KN algorithm, the elements of d')If 6 > 0, set a = min(6, miny(b) Set z ( k + l ) = z ( k ) + as(k) whether an element is small or 0, which can be numerically e')If ( a (< (61,increment k by 1, go back to step 2 ) difficult and rather unnecessary. The smoothness in which the f) Set y = Pu - p y nonnegativity constraint is handled in the EM algorithm does g> Set y = ( y T ~ y ) 1 / 2 not appear in the KN algorithm. h) Set 9 = Y/Y However, there is one problem with the EM algorithm, and i> Set p('"+l) = -,y$('+1) = T $ ( k ) , 6 = ry, its variants like PCG above, that the KN algorithm overcomes. and S = O/p(') When s, > 0 or there is no fear of hitting the boundary, j) Set s('+') = W g - SS(')). it would be nice not to scale the search direction by T , . In algorithm PCG, whenever a constraint is hit, (a1 < (61, The EM algorithm is very sensitive to the initial guess, and and the algorithm is restarted with a new preconditioner, and starting with a random start is disastrous, whether it be in the

approach is problem-dependent, and also depends on how the bent line search is implemented. One can be very lucky, and after a few iterations, all the elements that eventually will be 0 are determined, and one gets the superlinear convergence associated with the conjugate gradient technique. A third possibility involves using the constraints more explicitly, as in the EM algorithm. Let W be the diagonal matrix with wii = xi. The line search EM algorithm, given above, would use s = -Wg, where g is the gradient of f ( z ) . To accelerate the EM algorithm in the same way that the conjugate gradient algorithm accelerates the steepest descent algorithm, one might consider a conjugate gradient EM algorithm with

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 12, NO. 2, JUNE 1993

206

ML setting or the LS setting. Little improvement is made in those components which are very small initially, but should be relatively large. The picture is particularly speckled. An initial homogeneous guess is a good starting point for W;i = xi. In a multigrid or an adaptive grid situation, one may want to use as an initial guess information that might come from a coarser grid. With the EM approach, if one is not very careful, the coarser grid tends to bleed through and give unnecessary artifacts . There is another change which one may wish to make with the W matrix in the PCG algorithm. When the problem is underdetermined, which could be the case in tomography, the conjugate gradient algorithm seeks out the solution with the least norm if the initial guess is 0. Even in our situation with the preconditioner changing, theoretically, in a roundoff-errorfree environment, the iterates will always lie in the range space of PT if the starting guess is 0. However, starting with an initial guess of 0 is not an option given in our current PCG scheme with the W matrix defined above. If initially w;i is set to 1 if yi > 0, then one could take advantage of this property. In order to take advantage of a penalty smoothing term as in (2.14), the PCG algorithm may be reworked as follows. PCG-Penalized: 1) Determine dl).Set IC to 1. 2) Set d = 1 - P T d k ) )Set . W to the diagonal matrix containing ~ ( ' 1 . 3) Set y = P d . 4) Let g = y - VU(z). 5 ) Set p = gTWg. 6 ) Set s(') = Wg. 7) Until convergence iterate: a) Determine 6, the minimum of j ( d k ) as(')) U(&) as(") b) Set Q = min(d, mins(k)