Dealing with Flat Likelihood Functions

1 downloads 0 Views 778KB Size Report
mation in situations where the likelihood function is not concave enough to allow ... of the usual optimization algorithms when applied to flat likelihood functions.
Dealing with Flat Likelihood Functions †

Alejandro C. Frery ‡



Francisco Cribari-Neto

Marcelo O. de Souza



Departamento de Tecnologia da Informa¸ca˜o Universidade Federal de Alagoas Campus A. C. Sim˜oes, BR 104 Norte km 14, Bloco 12 57072-970 Macei´o, AL – Brazil [email protected] Tel: +55 (82) 214-1415 Fax: +55 (82) 214-1615 ‡

Departamento de Estat´ıstica, CCEN Universidade Federal de Pernambuco Cidade Universit´aria 50740-540 Recife, PE – Brazil [email protected], [email protected] Tel: +55 (81) 3271-8420 ´ Area Proposta: Econometria

Dealing with Flat Likelihood Functions

Abstract This paper deals with numerical problems arising when performing maximum likelihood parameter estimation in situations where the likelihood function is not concave enough to allow direct optimization. The 0 technique here proposed is discussed for the estimation of paramenters of the GA distribution, a law that arises in the modelling of speckled imagery. The properties of speckle noise are well described by the Multiplicative Model, a flexible statistical framework from which stem several important distributions. The literature reports that techniques for obtaining estimates (maximum likelihood, based on moments and on order statistics) of 0 distribution require samples of hundreds, even thousands, of observations in order to the parameters of the GA yield sensible values. This is verified for maximum likelihood estimation, and a proposal based on alternated optimization is made to alleviate this situation. The proposal is assessed with real and simulated data. A Monte Carlo experiment is devised to evaluate the accuracy of maximum likelihood estimators in small samples, and real data is succesfully analyzed with the proposed alternated procedure. A detailed discussion of the behavior of optimization routines commonly used in statistical and econometrical analysis is presented. Stylized empirical influence functions are computed and used to choose a maximum likelihood estimator that is resistant to outliers. The results presented here can be applied to other situations of interest. Keywords: Image analysis, likelihood, maximum likelihood estimation, nonlinear optimization, small samples.

1

Introduction

The curvature of likelihood functions is regarded as a very important feature in parameter estimation. In particular, it can be related to the influence that observations have on the resulting estimated values (Poon & Poon 1999). The chief purpose of this paper is to propose an iterated algorithm that aims at solving the problem of lack of convergence of the usual optimization algorithms when applied to flat likelihood functions. Though the discussion and the results are concentrated on a distribution thar arises in the modelling of speckled data 0 (the GA law), the proposal can be applied to other situations where the lack of convergence arises. Ultrasound, sonar, laser and synthetic aperture radar images are formed by active sensors (since they carry their own source of illumination) that send and retrieve signals whose phases are recorded. The imagery is formed detecting the echo from the target, and in this process a noise is introduced due to interference phenomena. This noise departs from classical hypotheses: it is not Gaussian in most cases, and it is not added to the true signal. Classical techniques derived from the assumption of additive noise with Gaussian distribution may lead to suboptimal procedures, or to the complete failure of the processing and analysis of the data. Several models have been proposed in the literature to cope with this departure from classical assumptions, 0 the K and GA distributions being the more successful ones. These are parametric models, so inference takes on a central role. In many applications inference based on sample moments is used but, whenever possible, maximum likelihood estimators are preferred due to their optimal asymptotic properties. Oliver & Quegan (1998) provide a comprehensive introduction to the subject of Synthetic Aperture Radar (SAR) image processing and analysis, and Mejail, Jacobo-Berlles, Frery & Bustos (2003) present an application of parameter estimation to image classification. 0 Since the family of GA laws is regarded as an universal model for speckled imagery, our work focuses on maximum likelihood inference of the parameters of that index this distribution. The literature reports severe numerical problems when traditional nonlinear optimization algorithms are used to obtain such estimates. The solution commonly proposed consists of using large samples, in spite of being small samples desirable for minute feature analysis and for techniques that do not introduce unacceptable blurring. Two systems of ML equations can be derived for this model, and the choice of one is made computing the empirical stylized influence functions and checking the one that is least sensitive to outliers. This paper evaluates the performance of several classical optimization techniques for maximum likelihood 0 parameter estimation in the GA model, showing that none of them is reliable for practical applications with

1

small samples. A proposal based on alternated optimization of the reduced log-likelihood function is then made and assessed with real and simulated data. Our numerical results show that classical algorithms fail to converge in almost 9,000 out of 80,000 samples 0 (around 11% of failure) when performing maximum likelihood estimation for the GA model. With the same samples, the proposed algorithm does not fail at all, i.e., it yields sensible estimates for all 80,000 samples. When we use real data extracted from a SAR image with squared windows of size 3, the classical approach fails to produce sensible results in up to nearly 70% of the samples, whereas the algorithm we propose always returns estimates. The computational platform is Ox, well known for its numerical soundness. 0 The remainder of the paper unfolds as follows. Section 2 presents the main properties of the GA model, our main object of interest. Section 3 recalls the main algorithms used for maximum likelihood estimation, with special emphasis on their availability in the Ox platform. Once verified that these algorithms oftentimes fail to produce acceptable estimates, Section 4 describes and assesses a proposed optimization algorithm. Applications are discussed in section 5. Finally, conclusions and future research directions are listed in section 6.

2

The Universal Model

As proposed and assessed by Frery, M¨ uller, Yanasse & Sant’Anna (1997) and Mejail, Frery, Jacobo-Berlles & 0 Bustos (2001), GA distributions can be succesfully used to describe data contaminated by speckle noise. This family of distributions stems from making the following assumptions about the signal formation: 1. The observed data (return) can be described by the random variable Z = XY , where the independent random variables X and Y describe the ground truth and the speckle noise, respectively. 2. The random variable X: Ω → R+ follows the square root of reciprocal of gamma law, characterized by the density ³ γ ´ 2α+1 fX (x) = α x2α−1 exp − 2 IR+ (x), (1) γ Γ (−α) 2x where (α, γ) ∈ (R− × R+ ), IA denotes the indicator function of the set A, and Γ is the gamma function. 3. The random variable Y obeys the square root of gamma distribution, whose density is fY (y) =

¡ ¢ LL 2L−1 y exp −Ly 2 IR+ (y), Γ (L)

(2)

where L ≥ 1 is the (equivalent) number of looks, a parameter that can be controlled in the image generation process and, therefore, will be considered known. The distribution characterized by equation (1) describes properties of the terrain, while the one in equation (2) models the speckle noise. Under these assumptions, the density of Z is given by fZ (z) =

2LL Γ(L − α) z 2L−1 IR (z), γ α Γ(L)Γ(−α) (γ + Lz 2 )L−α +

(3)

0 where α, γ are the (unknown) parameters. The properties of this distribution, denoted GA (α, γ, L), are discussed by Frery, M¨ uller, Yanasse & Sant’Anna (1997), Mejail, Jacobo-Berlles, Frery & Bustos (2000) and Mejail et al. (2001); moments of order r are given by

E(Z r ) =

³ γ ´r/2 Γ(−α − r/2)Γ(L + r/2) L Γ(−α)Γ(L)

(4)

0 if α < −r/2, and are not finite otherwise. The mean and variance of a GA (α, γ, L) distributed random variable can be computed using equation (4), yielding r γ Γ(L + 21 )Γ(−α − 12 ) µZ = , L Γ(L)Γ(−α) ¤ £ 2 γ LΓ (L)(−α − 1)Γ2 (−α − 1) − Γ2 (L + 21 )Γ2 (−α − 21 ) 2 σZ = , LΓ2 (L)Γ2 (−α)

provided α < −1/2 and α < −1, respectively. 2

As previously noticed, in many applications estimators for (α, γ) are derived using moment equations. When the first and second moments are used, severe numerical instabilities often appear and only samples from laws with α < −1 can be analyzed. The dependence of this distribution on the parameter α < 0 can be seen in Figure 1. It is noticeable that the larger the value of α the more asymmetric and the heavier-tailed the density.

0 Figure 1: Densities of the GA (α, 10, 1) distribution, with α ∈ {−5, −2, −1} (solid line, dash-point, dashes). 0 If Z follows the GA (α, γ, L) distribution, then its cumulative distribution function is given by

FZ (z) =

LL Γ(L − α)z 2L H(L, L − α; L + 1; −Lz 2 /γ), γ α Γ(L)Γ(−α)

with z > 0, where

(5)



H(a, b; c; t) =

Γ(c) X Γ(a + k)Γ(b + k)tk Γ(a)Γ(b) Γ(c + k)k! k=0

is the hypergeometric function. Equation (5) can also be written as FZ (z) = Υ2L,−2α

³ −αz 2 ´ γ

,

(6)

where Υ2L,−2α is the cumulative distribution function of Snedecor’s F law with 2L and −2α degrees of freedom. This form is useful for the following reasons: 0 1. The cumulative distribution function of a GA (α, γ, L) random variable, needed to perform the KolmogorovSmirnov test and to work with order statistics, can be computed using relation (6) and the Υ·,· function, available in most statistical software platforms.

2. The function Υ−1 ·,· is also available in most statistical platforms, so outcomes of the random vari0 able Z ∼ GA (α, γ, L) can be obtained using this inverse function and returning outcomes of Z = 1/2 (−γΥ−1 , with U uniformly distributed on (0, 1). This was the method employed in the 2L,−2α (U )/α) forthcoming Monte Carlo experience.

3

A desirable feature of the distribution characterized by equation (3) is that its parameters are interpretable: γ is a scale parameter and α is related to the roughness of the target. Small values of α (say α < −10) describe smooth regions such as, for instance, crops and burnt fields. When α is close to zero (say α > −5) the observed target is extremely rough, as is the case of urban spots. Intermediate situations (−10 < α < −5) are usually related to rough areas such as, for instance, forests. The equivalent number of looks L is known beforehand or is estimated for the whole image using extended targets, i.e., very large samples. Note that estimating (α, γ) amounts to making inference about the unobservable ground truth X. 0 The densities of the the GA (−2.5, 7.0686/π, 1) and the Gaussian distribution N (1, 4(1.1781 − π/4)/π) are shown in Figure 2, along with their mean. The paremeters were chosen so they have the same mean and variance. The different decays of their tails are evident: the former decays logarithmically, while the latter 0 distribution to model data with extreme decays quadratically. This behavior ensures the ability of the GA variability but, at the same time, the slow decay is prone to producing problems when performing parameter estimation.

0 Figure 2: Densities of the GA (−2.5, 7.0686/π, 1) (solid line) and the N (1, 4(1.1781 − π/4)/π) (dashes) distributions in semilogarithmic scale.

Systems that employ coherent illumination are used to survey inaccessible and/or unobservable regions (the interior of the human body, the bottom of the sea, areas under cloud cover, etc.). It is, therefore, of paramount importance to be able to make reliable inference about the kind of target under analysis, since visual information is seldom available. Inference can be performed through the estimation of the parameter (α, γ) ∈ Θ = (R− × R+ ) from samples z = (z1 , . . . , zn ) taken from homogenous areas in order to grant that the observations come from identically distributed populations. The larger the sample size, in principle, the more accurate the estimation but, also, the bigger the chance of including spurious observations. If the goal is to perform some kind of image processing or enhancement (Bustos, Lucini & Frery 2002, Frery, Sant’Anna, Mascarenhas & Bustos 1997), as is the case of filtering based on distributional properties, large samples obtained with large windows usually cause heavy blurring. It is noteworthy that inference with small samples is gaining attention in the specialized literature (see, for instance, Rousseeuw & Verboven 2002), and reliable inference using small samples is the core contribution of the present paper.

4

2.1

Inference techniques

Usual inference techniques include methods based on the analogy principle (moment and order statistics estimators being the most popular members of this class) and on maximum likelihood (Bickel & Doksum 2001). Moment estimators are favored in applications, since they are easy to derive and are, usually, computationally attractive. An estimator based on the median and on the first moment was sucessfully used in (Bustos et al. 2002) as the starting point for computing maximum likelihood estimates. Maximum likelihood (ML) estimators will be considered in this work since they exhibit well known optimal properties (consistency, asymptotic efficiency, asymptotic normality, etc.) ML estimation for another model for SAR data is discussed by Joughin, Percival & Winebrenner (1993). Given the sample z = (z1 , . . . , zn ), and assuming that these observations are outcomes of independent and identically distributed random variables with common distribution D(θ), with θ ∈ Θ ⊂ Rp , p ≥ 1, a ML estimator of θ is given by θb = arg max L(θ; z), (7) θ∈Θ

where L is the likelihood of the sample z under the parameter θ. It is typically easier to work with the reduced log-likelihood function `(θ; z) ∝ ln L(θ; z), where the terms that do not depend on θ are ignored. Though direct maximization of equation (7) is possible (either analytically or using numerical tools), and oftentimes desirable, quite often one finds ML estimates by solving the system of (usually non-linear) p equations given by b = 0. ∇`(θ) (8) This system is referred to as likelihood equations. The choice between solving equation (7) or equation (8) heavily relies on computational issues: availability of reliable algorithms, computational effort required to implement and/or to obtain the solution, etc. These equations, in general, have no explicit solution. Qn In our case the likelihood function is L((α, γ); z) = i=1 fZ (zi ), with fZ given in equation (3). Therefore, the reduced log-likelihood can be written as n

Γ(L − α) L − α X `((α, γ); z, L) = ln α − ln(γ + Lzi2 ). γ Γ(−α) n i=1

(9)

The system given by equation (8) is, in our case, n[Ψ(−b α) − Ψ(L − α b)] + −

³γ b + Lzi2 ´ ln = γ b i=1

n X

n ´ ³ X 1 nb α = − (L − α b) ln γ b γ b + Lzi2 i=1

0

(10)

0,

(11)

where Ψ(τ ) = d ln Γ(τ )/dτ is the digamma function. No explicit solution for this system is available in general and, therefore, numerical routines have to be used. The single look case (L = 1) is an important special situation for which a deeper analytical analysis is performed and presented in section 2.2. 0 Figure 3 shows a typical situation. A sample from the GA (−8, γ ∗ , 3) of size n = 9 was generated, and the log- likelihood function of this sample is shown. The parameter γ ∗ is chosen such that the expected value is one (see equation (17)). It is noticeable that finding the maximum of this function (provided it exists) is not an easy task due to the almost flat area it displays around the candidates. The ML estimates for this sample were (b α, γ b) = (−1.84, 1.44). The same sample will be revisited in section 4, when we analyze the proposed estimation procedure.

2.2

Stylized Empirical Influence Functions

Two sets of solutions can be obtained from the system formed by equations (10) and (11). The choice between them will be made by studying the behavior of estimates of α when a single observation varies in R+ . In order to perform an analytical analysis, the single look case (L = 1) will be considered. As presented by Rousseeuw & Verboven (2002), under very general conditions, a convenient tool for assessing the robustness of an estimator θb based on n independent samples is its ‘empirical influence function’ (EIF). This quantity describes the behaviour of the estimator when a single observation varies freely. For the univariate sample z = (z1 , . . . , zn−1 ) the EIF of the estimator θb is given by b z), EIF(z; z) = θ(z, 5

(12)

0 Figure 3: Log-likelihood function of a sample of size n = 9 of the GA (−8, γ ∗ , 3) law.

where z ranges over the whole support of the underlying distribution. In order to avoid the dependence of equation (12) on the n − 1 observations z, an artificial and ‘typical’ sample can be formed using the n−1 quantiles of the distribution of interest. The sample zi will be then replaced by the quantile zi∗ = F − ((i − 1/3)/(n − 2/3)) for every 1 ≤ i ≤ n − 1, where F − (t) = inf{x ∈ R: F (x) ≥ t} is the generalised inverse cumulative distribution function. This yields the ‘stylised empirical influence function’ (SEIF). Denoting the vector of n − 1 quantiles as z∗ = (zi∗ )1≤i≤n−1 , one has b ∗ , z), SEIF(z; z∗ ) = θ(z with z ranging over the whole support of the underlying distribution. When the random variable is continuous, F − is replaced by F −1 , the inverse cumulative distribution function. 0 For the single-look case the cumulative¡ distribution GA (α, γ, 1) distributed random variable ¢α function of a −1 2 reduces from equation (5) to FZ (t) = 1 − 1 + t /γ , with inverse FZ (t) = (γ((1 − t)1/α − 1))1/2 . 0 The likelihood equations for a sample of size n, assuming GA (α, γ, 1) independent and identically distributed random variables, are, therefore, ³ 1´ n ln γ b+ α b nb α γ b

=



n X

ln(b γ + zi2 ),

(13)

i=1

= (b α − 1)

n X

(b γ + zi2 )−1 .

(14)

i=1

We can form two systems of estimation equations. The first is obtained taking α b out of equation (14): α b1 =

1 1−

γ b

Pn

n

,

(15)

b+zi2 )−1 i=1 (γ

and plugging equation (15) into equation (13) to obtain γ b1 . The second system is built by taking α b out of equation (13): 1 α b2 = − 1 Pn , (16) γ + zi2 ) + ln γ b i=i ln(b n 6

and plugging equation (16) into equation (14) to obtain γ b2 . Since the estimation of the roughness parameter is of paramount importance, in what follows only results regarding inference on α will be assessed. The SEIF will be computed for the estimators given in equations (15) and (16), assuming γ = 1. These stylized empirical influence functions will be referred to as ‘SEIF1’ and ‘SEIF2’, respectively. They are given by 1 SEIF1(z) = − n 1 − Pn−1 n−2/3 1/α 1 + 1+z ( 2 i=1 n−i−1/3 ) and SEIF2(z) =

1 n

h P n−1 1 α

i=1

1 ln

n−i−1/3 n−2/3

+ ln(1 + z 2 )

i;

in both cases z ∈ R+ . Figure 4 shows the functions SEIF1 and SEIF2 (first and second columns, respectively) for α = −1 with varying sample size (first row) and for samples of size 9 and varying α (second row). In the first row n = 9 is seen in solid line, n = 25 in dashes and n = 49 in dots. The second row depincts the situations α = −1 in solid line, α = −3 in dashes and α = −5 in dots. It is readily seen that SEIF1 is less sensitive than SEIF2 to variations of the observation z ∈ R+ . This behavior is consistent when both α and the sample size n vary, and it was also observed with other values of L. It was then chosen to work with the system of equations formed by taking α b out of equation (11), and then plugging it into equation (10) in order to compute γ b.

Figure 4: Functions SEIF1 (left) and SEIF2 (right) for n ∈ {9, 25, 49} with α = −1 (first row), and for α ∈ {−1, −3, −5} with n = 9 (second row).

7

3

Algorithms for Inference

The routines reported here were used as provided by the Ox platform, a robust, fast, free and reliable matrixoriented language with excellent numerical capabilities and support for object-orientation. This platform is available for a variety of operational systems (Doornik 2001). Two categories of routines were tested: those devoted to direct maximization (or minimization), referred to as optimization procedures, and those that look for the solution of systems of equations. In the first category the Simplex Downhill, Newton-Raphson and Broyden, Fletcher, Goldfarb and Shanno (generally referred to as ‘the BFGS method’) algorithms were used to maximize equation (9). In the second category the Broyden algorithm was used to find the roots of the system given in equations (10) and (11). These routines impose different requirements for their use. Some require or accept the derivatives of the function to be maximized, while others try to perform their tasks with mere evaluations of the target function. In our case, BFGS plays a central role, and it was used with analytical first derivatives of the objective function. Since the main goal of this work is to find suitable solutions, all routines were tested following the guidelines provided with the Ox platform: a variety of tuning parameters, starting points, steps and convergence criteria 0 were employed. The results confirmed what is commented in the literature, namely, that inference for the GA law requires huge samples in order to converge and deliver sensible estimates. The analysis was performed using samples of size n ∈ {9, 25, 49, 81, 121}, roughness parameters α ∈ {−1, −3, −5, −15} and looks L ∈ {1, 2, 3, 8}. The scale parameter γ was chosen to yield unit mean, so it was set to µ ¶2 Γ(L)Γ(−α) γ∗ = L . (17) Γ(L + 1/2)Γ(−α − 1/2) The sample sizes considered reflect the fact that most image processing techniques employ estimation in square windows of even side and, therefore, samples are of size n = s2 were analyzed, where s is the side of the window. Windows of sides 3, 5, 7, 9 and 11 are commonly used in practical applications. The roughness parameter describes regions with a wide range of smoothness, as discussed in section 2. The number of looks also reflects situations of practical interest, ranging from raw images (L = 1) to smoothed out data with L = 8. It is convenient to note here that the bigger the number of looks the smoother the image, at the expense of less spatial resolution. One thousand replications were performed for each of these eighty situations. In each of them, we generated samples with the specified parameters and, then, applied the four algorithms. Success (convergence to a point and numerical evidence of convergence to either a maximum or a root) or failure to converge was recorded, and specific situations of both outcomes were traced out. Table 1 shows the percentage of times (in 1,000 independent trials) that the BFGS and Simplex algorithms failed to converge in each of the eighty aforementioned situations. The larger the sample size the better the performance, and the smoother the target the worse the convergence rate. In nearly 9,000 out of 80,000 situations the algorithms did not converge, and in the worst case (n = 9, α = −15 and L = 1), about sixty percent of the samples were left unanalyzed, i.e., no sensible estimate was obtained. Similar (mostly worse) behaviour is observed using the other algorithms, and it is noteworthy that all of them were fine-tuned for the problem at hand. The overall behaviour of these algorithms falls into one of three situations, namely 1. All of them converge to the same (sensible) estimate. 2. All of them converge, but not to the same value. 3. At least one algorithm fails to converge. 0 In order to illustrate these behaviours, two GA samples were chosen, one leading to situation 1 above (denoted z1 ), and the other to situation 2 (denoted z2 ). For each sample the likelihood function was computed and level curves of this function and of the maximum likelihood equations were analyzed. Situation 1 is illustrated in Figure 5, where it is noticeable that the point of convergence of the Broyden algorithm (denoted as ‘∗’) is in the interior of the highest level curve. This point coincides with the intersection of the curves corresponding to ∂`/∂α = ∂`/∂γ = 0 and, regardless the precision of the estimation procedure, is an acceptable estimate. Similarly, situation 2 is illustrated in Figure 6. In this case the point to which the Broyden algorithm converges is outside the highest level curve and, thus, does not correspond to the maximum of the likelihood function.

8

Table 1: Percentage of situations for which BFGS and Simplex fail to converge in 1,000 replications. L

1

2

3

8

α −15 −5 −3 −1 −15 −5 −3 −1 −15 −5 −3 −1 −15 −5 −3 −1

9 59.9 52.6 42.3 17.6 51.9 37.7 25.0 4.6 46.5 28.1 17.4 2.1 31.2 8.2 2.9 0.1

25 48.2 30.1 19.1 1.0 35.4 13.5 5.4 0.0 28.7 7.9 2.3 0.0 9.1 0.3 0.0 0.0

BFGS n 49 81 36.2 27.8 14.5 8.6 6.1 1.5 0.1 0.0 25.8 16.2 5.4 1.7 0.4 0.0 0.0 0.0 16.6 9.9 1.4 0.1 0.0 0.0 0.0 0.0 2.3 0.8 0.0 0.0 0.0 0.0 0.0 0.0

121 25.2 3.9 0.4 0.0 11.4 0.2 0.0 0.0 7.1 0.0 0.0 0.0 0.2 0.0 0.0 0.0

9 65.2 56.9 47.8 17.8 57.6 40.6 28.1 5.5 50.6 29.8 18.9 2.7 34.9 9.6 2.9 0.1

25 54.0 34.9 22.9 0.9 41.2 17.0 6.3 0.0 34.5 10.0 2.6 0.0 10.9 0.5 0.0 0.0

Simplex n 49 42.1 19.1 7.9 0.0 31.2 7.2 0.9 0.0 19.6 1.5 0.0 0.0 2.9 0.0 0.0 0.0

81 35.2 12.5 1.8 0.0 21.2 1.9 0.0 0.0 12.5 0.1 0.0 0.0 1.4 0.0 0.0 0.0

121 33.3 6.1 0.4 0.0 15.8 0.3 0.0 0.0 8.4 0.0 0.0 0.0 0.3 0.0 0.0 0.0

The Broyden algorithm seemed to have the best performance, since it often reported convergence. But when at least two of the other algorithms converged, most of the time they did it to the same point whereas Broyden frequently stopped very far from it. When checking the value of the likelihood in the solutions, the one computed by Broyden was orders of times smaller than the one found by maximization techniques. For this reason, though Broyden allegedly outperformed optimization procedures in terms of convergence, it was considered unreliable for the application at hand. This behaviour motivated the proposal of an algorithm able to converge to sensible estimates. This will be done in the next section.

4

Proposal: alternated optimization

Since simultaneous optimization is not reliable enough, an analysis of the marginal functions to be maximized was conducted for a variety of situations. In all these situations it was checked that, whereas the reduced loglikelihood function showed flat regions, where simultaneous optimization may get lost or stuck, these surfaces could be sliced in order to yield better-behaved functions. This motivated the proposal of an alternated algorithm that consists of writing two equations out of equation (9): one depending on α, given γ fixed, and the other depending on γ, given a fixed α. Provided a starting point for γ, say γ b(0), one maximizes the first equation on α to find α b(0). One can now use this crude estimate of α, solve again the first equation on γ and continue until evidence of convergence is achieved. The equations to be maximized are n

`1 (α; γ(j), z) = ln

Γ(L − α) αX + ln(γ(j) + Lzi2 ), α (γ(j)) Γ(−α) n i=1

and

(18)

n

`2 (γ; α(j), z) = −α(j) ln γ −

L − α(j) X ln(γ + Lzi2 ). n i=1

(19)

In practice, equation (18) always showed excellent behaviour, while equation (19) presented flat areas in a few situations (in 6 out of the 80,000 samples analyzed in Table 1). In these situations, though, varying the value of α(j) led to well-behaved and easy-to-optimize functions. Figure 7 shows the functions `1 and `2 for the same three looks sample used in Figure 3, and a variety of values of γ and α (left and right, respectively). Algorithm 1 (Alternated optimization for parameter estimation) Read the input data z = (zi )1≤i≤n and fix the smallest acceptable variation to proceed (typically ² = 10−4 ) and the maximum number of iterations (typically M = 103 ). Then: 9

4.0 3.0 1.0

2.0

γ

−4.0

−3.0

−2.0

−1.0

α Figure 5: Log-likelihood function for z1 : contour plots (solid lines), ∂`/∂α (black dashes) and ∂`/∂γ (grey dashes). 1. Compute an initial estimate of γ, for example µ γ b(0) = L m c1 where m c1 = n−1

Pn

i=1 zi

Γ(L) Γ(L + 1/2)

¶2 ,

(20)

is the first sample moment.

2. Set the values needed to execute step (3c) for the first time ε = 103 and α b(0) = −106 , and start the counter j = 1. 3. While ε ≥ ² and j ≤ M do (a) Find α b(j) = arg maxα∈R− `1 (α; γ b(j − 1), z) given in equation (18). (b) Find γ b(j) = arg maxγ∈R `2 (γ; α b(j), z) given in equation (19), with R ⊂ R+ a compact set, typically R = [10−2 , 102 ] · γ b(0). (c) Compute

¯ ¯ ¯ ¯ ¯α b(j + 1) − α b(j) ¯¯ ¯¯ γ b(j + 1) − γ b(j) ¯¯ ¯ ε=¯ ¯+¯ ¯, α b(j + 1) γ b(j + 1)

the absolute value of the relative inter-iteration variation. (d) Update the counter j ← j + 1. 4. If ε > ² return anything with a message of error, else return the estimate (b α(j − 1), γ b(j − 1)) and a message of success.

10

105 104 103 101

102

γ

−90

−89

−88

−87

−86

α Figure 6: Log-likelihood function for z2 : contour plots (solid lines), ∂`/∂α (black dashes) and ∂`/∂γ (grey dashes). Equation (20) is derived using r = 1 and discarding the dependence of α on equation (4). It is thus a crude estimator of γ based on the first sample moment m c1 . Other starting points were considered, and their effect on the algorithm convergence was negligible. Step (3b) seeks the estimate of γ in a compact set rather than in R+ due to the aforementioned behavior of the function `2 . If there is no attainable maximum in R, a new value of α(j) will be used in the next iteration and, ultimately, convergence will be achieved. We chose to work with the BFGS algorithm in steps (3a) and (3b) since, for the considered univariate equations, it outperformed the other methods in terms of speed and convergence in the numerical evaluations we performed. The BFGS is generally regarded as the best performing method (Mittelhammer, Judge & Miller 2000) for multivariate non-linear optimization. In our case the explicit analytical first derivatives of the objective function were provided, a desirable information whenever available. This alternated algorithm can be easily generalized to obtain parameters with as many components as desired, and its implementation in any computational platform is immediate, provided reliable univariate optimization routines. Using this algorithm there was convergence in all the 80,000 samples analized in Table 1, while classical procedures failed in about 9,000 situations. This represents a notorious improvement with respect to classical algorithms since they failed in about 11% of the samples (considering both good and bad situations). With real data, where most of the samples are ‘bad’, our proposal also outperforms classical algorithms, as will be seen in the next section.

5

Application

Using Algorithm 1 it was possible to conduct a Monte Carlo experience in order to evaluate the bias and mean square error of the ML estimator in a variety of situations that had to be left unexplored when using classical procedures. These results on the bias of α b are shown in Figure 8. The bias can be huge, confirming 11

previous results (Cribari-Neto, Frery & Silva 2002, Mejail et al. 2001, Mejail et al. 2003). Efforts to reduce this undesirable behavior of ML estimators are reported in (Cribari-Neto et al. 2002). Two applications were devised to show the applicability of the alternated algorithm: one with simulated 0 data and the other with a real SAR image. The former consists of generating samples from the GA (α, γ ∗ , 1) law. 0 Two-hundred and fifty samples of size n = 121 were generated, being fifty from the GA (−5, γ ∗ , 1), fifty from 0 ∗ 0 ∗ 0 the GA (−1, γ , 1), fifty from the GA (−15, γ , 1) and the remaining 100 samples from the GA (αj , γ ∗ , 1) where αj = 0.14j − 15 and 1 ≤ j ≤ 100 is the integer index. For each of these samples two algorithms were employed to obtain the ML estimates, namely the BFGS and alternated algorithms. The procedure was repeated for each sample, but using 81, 49, 25 and 9 observations out of the complete data set. In every situation the alternated algorithm achieved convergence, and the same did not hold for the BFGS algorithm. The percentage of situations for which BFGS did not converge is presented in Table 2. Again, the classical procedure is unreliable. Table 2: Situations where BFGS failed to converge. n %

121 1.6

81 4.8

49 10.8

25 19.2

9 41.2

Figure 9 shows, for n = 25, the true value of −α (in semilogarithmic scale) along with the estimates: ‘×’ for the alternated algorithm and ‘◦’ for the ones obtained with the BFGS procedure. The missing circles correspond to those situations where BFGS failed to converge (roughly 20% of the samples). It can be checked that when they both converge, they converge to similar values, and that there are many situations for which BFGS was unable to return an estimate. Similar behaviour is exhibited for other sample sizes, the smaller the sample the less reliable BFGS. Figure 10 shows a SAR image obtained by the sensor E-SAR, managed by the German Aerospace Center DLR. This is an airborne sensor with polarimetric and high spatial resolution capabilities. The scene was taken over the surroundings of M¨ unchen, and typical classes are marked as ‘U’ (Urban), ‘F’ (Forest) and ‘C’ (Crops). An hypothesized flight track is marked with the NW-SE white arrow, where small samples are being collected at every passage point. One thousand samples were collected, and they were divided into four groups of the same size for the sake of simplicity. The analysis of these on-flight samples was performed with both the BFGS and alternated algorithms. The latter always returned estimates while the number of samples for which the former failed to converge is reported in Table 3. Even with windows of size 11 almost a third of the coordinates would be left unanalized by the classical algorithm. Figures 11, 12, 13 and 14 show the values of α b in two-hundred and fifty sites using n = 121, 49, 25 and 9 observations, corresponding to groups 1, 2, 3 and 4, respectively. It can be seen that the larger the window the smoother the analysis, leading to the conclusion that most sites correspond to heterogeneous or extremely heterogeneous spots (since α b > −7). When the window is small, more heterogeneous areas appear (b α < −10). The sensed area is suburban, and typical spots consist of scattered houses and small buildings (extremely heterogeneous return) with trees and gardens in between, where SAR will return heterogeneous and homogeneous clutter, respectively. The only exception is Group 3 (Figure 13), for which the estimated roughness at all window sizes is consistent.

Table 3: Percentage of samples for which BFGS failed to converge in the four groups of real data using samples of size n. n 121 81 49 25 9

G1 20.8 22.4 32.4 42.0 52.4

Group G2 G3 21.2 39.2 28.8 42.8 34.8 53.6 46.4 55.6 65.4 69.2

12

G4 32.0 36.4 48.0 54.0 65.6

The ground resolution of this sensor can be of less than one meter, so minute features of about two meters 0 of side can be detected with the use of the alternated algorithm and the GA model.

6

Conclusions and future work

The present paper addresses the issue of numerical maximization of likelihood functions that display flat regions, for which traditional nonlinear optimization methods oftentimes fail to yield sensible ML estimates. To overcome this problem, we proposed an alternated optimization algorithm that performs well regardless of whether the likelihood function under maximization has flat regions. The motivating example used throughout 0 the paper was maximum likelihood estimation of the parameters that index the GA distribution, which is commonly used in image analysis. The literature has shown that large samples are typically necessary for 0 GA likelihood maximization to be performed since the optimization algorithms usually employed too often do not converge. Stylized empirical influece functions were computed to choose the ML estimator that is least sensitive to outliers. The alternated algorithm we proposed solves this problem. Finally, it is noteworthy that the algorithm proposed in this paper can be used for likelihood estimation in other statistical and econometric models, especially those where traditional optimization algorithms fail to deliver sensible ML estimates. The soundness of the G 0 family of distributions is under assessment for applications other than speckled imagery. The alternated optimization technique has been ported to other platforms, namely R and IDL, and it is being used with success in higher dimensional situations of the parameter space Θ ⊂ Rp . Parameter estimation with restrictions is immediate with the alternated algorithm here proposed.

Acknowledgements The authors gratefully acknowledge partial financial support from CNPq.

References Bickel, P. J. & K. A. Doksum (2001), Mathematical statistics: basic ideas and selected topics, Vol. 1, 2 edn, Prentice-Hall, NJ. Bustos, O. H., M. M. Lucini & A. C. Frery (2002), ‘M-estimators of roughness and scale for GA0-modelled SAR imagery’, EURASIP Journal on Applied Signal Processing 2002(1), 105–114. Cribari-Neto, F., A. C. Frery & M. F. Silva (2002), ‘Improved estimation of clutter properties in speckled imagery’, Computational Statistics and Data Analysis 40(4), 801–824. Doornik, J. A. (2001), Ox: an object-oriented matrix programming language, 4 edn, Timberlake Consultants & Oxford, London. *http://www.nuff.ox.ac.uk/users/doornik Frery, A. C., H.-J. M¨ uller, C. C. F. Yanasse & S. J. S. Sant’Anna (1997), ‘A model for extremely heterogeneous clutter’, IEEE Transactions on Geoscience and Remote Sensing 35(3), 648–659. Frery, A. C., S. J. S. Sant’Anna, N. D. A. Mascarenhas & O. H. Bustos (1997), ‘Robust inference techniques for speckle noise reduction in 1-look amplitude SAR images’, Applied Signal Processing 4, 61–76. Joughin, I. R., D. B. Percival & D. P. Winebrenner (1993), ‘Maximum likelihood estimation of K distribution parameters for SAR data’, IEEE Transactions on Geoscience and Remote Sensing 31(5), 989–999. Mejail, M. E., A. C. Frery, J. Jacobo-Berlles & O. H. Bustos (2001), ‘Approximation of distributions for SAR images: proposal, evaluation and practical consequences’, Latin American Applied Research 31, 83–92. Mejail, M. E., J. Jacobo-Berlles, A. C. Frery & O. H. Bustos (2000), ‘Parametric roughness estimation in amplitude SAR images under the multiplicative model’, Revista de Teledetecci´ on 13, 37–49. Mejail, M. E., J. Jacobo-Berlles, A. C. Frery & O. H. Bustos (2003), ‘Classification of SAR images using a general and tractable multiplicative model’, International Journal of Remote Sensing 24. In press. Mittelhammer, R. C., G. G. Judge & D. J. Miller (2000), Econometric foundations, Cambridge University Press, New York. 13

Oliver, C. & S. Quegan (1998), Understanding Synthetic Aperture Radar Images, Artech House, Boston. Poon, W.-Y. & Y. S. Poon (1999), ‘Conformal normal curvature and assessment of local influence’, Journal of the Royal Statistical Society B 61(1), 51–61. Rousseeuw, P. J. & S. Verboven (2002), ‘Robust estimation in very small samples’, Computational Statistics and Data Analysis 40(4), 741–758.

14

−150 −200

n=9 n = 25 n = 49 n = 81 n = 121

−250

^

^ B(α)

−100

−50

Figure 7: Functions `1 and `2 with γ ∈ {1, 3, 5, 10} and −α ∈ {1, 3, 5, 10} (dash-dot, dashes, dots and solid lines, respectively).

−15

−5

−3

α

Figure 8: Estimated bias of the ML estimator of α for one look.

15

−1

1 e+04 1 e+02 1 e+00

^ −α

1 e+06

n = 25

0

50

100

150

200

j

Figure 9: ML estimates of α with n = 25 and L = 1.

Figure 10: E-SAR synthetic aperture image with L = 1.

16

250

n = 81

^ α −15

−15

−10

−10

^ α

−5

−5

n = 121

50

100

150

200

250

0

50

100

150

Sample

Sample

m = 25

n=9

200

250

200

250

−15

−20

−15

−10

^ α −10

^ α

−5

−5

0

0

0

0

50

100

150

200

250

0

Sample

50

100

150

Sample

Figure 11: Estimates of α in 250 sites with different window sizes: Group 1.

17

n = 81

50

100

150

200

250

0

50

100

150

Site

Site

n = 25

n=9

200

250

200

250

−10 −20

−15

−15

−10

^ α

−5

−5

0

0

0

^ α

−10 −15

^ α

−5

−8 −6 −4 −2 −12

^ α

0

n = 121

0

50

100

150

200

250

0

Site

50

100

150

Site

Figure 12: Estimates of α in 250 sites with different window sizes: Group 2.

18

n = 81

^ α −14

−15

−10

−10

^ α

−6

−5

−2

0

0

n = 121

50

100

150

200

250

0

100

150

Sample

n = 25

n=9

250

200

250

0

200

−30

^ α

−20

−10

−5 −15

−40

−25

^ α

50

Sample

0

0

0

50

100

150

200

250

0

Sample

50

100

150

Sample

Figure 13: Estimates of α in 250 sites with different window sizes: Group 3.

19

n = 81

−14

−15

−10

−10

^ α

^ α

−5

−6 −4 −2

n = 121

50

100

150

200

250

0

50

100

150

Sample

Sample

n = 25

n=9

200

250

200

250

0

50

100

150

200

250

−25 −20 −15 −10

^ α

−10 −15 −20

^ α

−5

−5

0

0

0

Sample

50

100

150

Sample

Figure 14: Estimates of α in 250 sites with different window sizes: Group 4.

20