Wavelet Estimators in Nonparametric Regression: A ...

Wavelet Estimators in Nonparametric Regression: A Comparative Simulation Study Anestis Antoniadis, Jeremie Bigot Laboratoire IMAG-LMC, University Joseph Fourier, BP 53, 38041 Grenoble Cedex 9, France. and Theofanis Sapatinas, Department of Mathematics and Statistics, University of Cyprus, P.O. Box 20537, CY 1678 Nicosia, Cyprus.

Journal of Statistical Software, 6, Issue 6, 1–83 (2001). (available from http://www.jstatsoft.org/v06/i06/updates/jssv06i06.pdf)

1

Abstract Wavelet analysis has been found to be a powerful tool for the nonparametric estimation of spatially-variable objects. We discuss in detail wavelet methods in nonparametric regression, where the data are modelled as observations of a signal contaminated with additive Gaussian noise, and provide an extensive review of the vast literature of wavelet shrinkage and wavelet thresholding estimators developed to denoise such data. These estimators arise from a wide range of classical and empirical Bayes methods treating individual or blocks of wavelet coefficients either globally or in a level-dependent fashion. We compare various estimators in an extensive simulation study on a variety of sample sizes, test functions, signal-to-noise ratios and wavelet filters. Because there is no single criterion that can adequately summarise the behaviour of an estimator, we use various criteria to measure performance in finite sample situations. Insight into the performance of these estimators is obtained from graphical outputs and numerical tables. In order to provide some hints of how these estimators should be used to analyse real-life data sets, a detailed practical step-by-step illustration of a wavelet denoising analysis on electrical consumption is provided. Matlab codes are provided so that all figures and tables in this paper can be reproduced. Some key words: EM algorithm; Empirical Bayes; Monte Carlo experiments; Nonparametric regression; Smoothing Methods; Shrinkage; Thresholding; Wavelets.

2

1

INTRODUCTION

Nonparametric regression has been a fundamental tool in data analysis over the past two decades and is still an expanding area of ongoing research. The goal is to recover an unknown function, say g, based on sampled data that are contaminated with additive Gaussian noise. Only very general assumptions about g are made such as that it belongs to a certain class of smooth functions. Nonparametric regression (or denoising) techniques provide a very effective and simple way of finding structure in data sets without the imposition of a parametric regression model (as in linear or polynomial regression for example). However, nonparametric and parametric regression models should not be viewed as mutually exclusive competitors. In many cases, a nonparametric regression estimate will suggest a simple parametric model, while in other cases it will be clear that the underlying regression function is too complicated and no reasonable parametric model would be adequate. During the 1980s and 1990s, the literature was inundated by hundreds (and perhaps thousands) of papers regarding various linear (on a fixed spatial scale) estimators to the nonparametric regression problem.

Some of the more popular are those based on kernel

functions, smoothing splines and orthogonal series.

Each of these approaches has its own

particular strengths and weaknesses. We refer, for example, to the monographs of Härdle (1990), Green & Silverman (1994), Wand & Jones (1995), Fan & Gijbels (1996) and Eubank (1999) for extensive bibliographies. For linear smoothers, asymptotic results are easily obtained. Usually, if g is sufficiently smooth (for example if g belongs to a Sobolev space of regularity s), then the mean integrated squared error (which is usually considered to measure asymptotic performance) converges to zero at a rate n−r as the sample size n increases. This is known as the optimal asymptotic rate and for all reasonable linear smoothers, r is the same (for example, r = 2s/(2s + 1) for the Sobolev case). Unfortunately, the asymptotics are not particularly helpful to practitioners faced with finite data sets when deciding what type of smoother to use. As a public service, Breiman & Peters (1992) designed a useful simulation study comparing some popular linear smoothing methods found in the literature on a variety of sample sizes, regression functions and signal-to-noise ratios using a number of different criterion to measure performance. As expected, they found that no linear smoother uniformly dominates in all aspects. However, valuable conclusions were drawn from these simulations about the small sample behaviour of the various linear smoothers and their computational complexity. The fixed spatial scale estimators mentioned above assume that g belongs to some given set of functions {gθ ; θ ∈ Θ} where Θ is an infinite dimensional parameter set. It turns out that the

3

prior knowledge of the family {gθ ; θ ∈ Θ} influences both the construction of the estimators and their performances. Roughly speaking, the larger the family the larger the risk, revealing a major drawback of the preceding estimators. The problem of spatially adaptive estimation concerns whether one can, without prior knowledge of the family {gθ ; θ ∈ Θ}, built estimators that achieve the optimal asymptotic rates on some privileged subsets of the parameter set Θ. Over the last decade, various nonlinear (spatially adaptive) estimators in nonparametric regression have been proposed. The most popular are variable-bandwidth kernel methods, classification and regression trees, and adaptive regression splines. Although some of these methods achieve the optimal asymptotic rates, they can be computationally intensive and they are usually designed to denoise regular functions. During the 1990s, the nonparametric regression literature was dominated by (nonlinear) wavelet shrinkage and wavelet thresholding estimators. These estimators are a new subset of an old class of nonparametric regression estimators, namely orthogonal series methods. Moreover, these estimators are easily implemented through fast algorithms so they are very appealing in practical situations. Donoho & Johnstone (1994) and Donoho, Johnstone, Kerkyacharian & Picard (1995) have introduced nonlinear wavelet estimators in nonparametric regression through thresholding which typically amounts to term-by-term assessment of estimates of coefficients in the empirical wavelet expansion of the unknown function g. If an estimate of a coefficient is sufficiently large in absolute value – that is, if it exceeds a predetermined threshold – then the corresponding term in the empirical wavelet expansion is retained (or shrunk toward to zero by an amount equal to the threshold); otherwise it is omitted. In particular, these papers study nonparametric regression from a minimax viewpoint, using some important function classes not previously considered in statistics when linear smoothers were studied (and have also provided new viewpoints for understanding other nonparametric smoothers as well). These function classes model the notion of different amounts of smoothness in different locations more effectively than the usual smooth classes. In other words, these function classes contain function spaces like the Hölder and Sobolev spaces of regular functions, as well as spaces of irregular functions such as those of ‘bounded variation’. These considerations are not simply esoteric, but are of statistical importance, since these classes of functions typically arise in practical applications such as the processing of speech, electrocardiogram or seismic signals.

(Mathematically, all these function spaces can be formalized in terms of the so-

called Besov or Triebel spaces; see, for example, Meyer (1992), Wojtaszczyk (1997), Donoho & Johnstone (1998) and Härdle, Kerkyacharian, Picard & Tsybakov (1998).) Although these contributions describe performance in terms of convergence rates that are achieved over large function classes, concise accounts of mean squared error for single functions also exist and have 4

been discussed in detail by Hall & Patil (1996a, 1996b). To date, the study of these methods has been mostly asymptotic in character. In particular, it has been shown that nonlinear wavelet thresholding estimators are asymptotically near optimal or optimal while traditional linear estimators are suboptimal for estimation over particular classes of the Besov or Triebel spaces (see, for example, Delyon & Juditsky, 1996; Donoho & Johnstone, 1998; Abramovich, Benjamini, Donoho & Johnstone, 2000). As with any asymptotic result, there remain doubts as to how well the asymptotics describe small sample behaviour. In other words, how large the sample size should be before the asymptotic theory applies is a very important question. To shed some light on this question, Marron, Adak, Johnstone, Newmann & Patil (1998) applied the tool of exact risk analysis to understand the small sample behaviour of the two wavelet thresholding estimators introduced by Donoho & Johnstone (1994), namely the minimax and universal wavelet estimators, and thus to check directly the conclusions suggested by asymptotics. Also, their analysis provide insight as to why the viewpoints and conclusions of Donoho-Johnstone (convergence rates achieved uniformly over large function classes) differ from those of Hall-Patil (convergence rates achieved for single functions). Bruce & Gao (1996), Gao & Bruce (1997), Gao (1998) and Antoniadis & Fan (2001) have also developed analytical tools to understand the finite sample behaviour of minimax wavelet estimators based on various thresholding rules. Since the seminal papers by Donoho & Johnstone (1994) and Donoho, Johnstone, Kerkyacharian & Picard (1995), various alternative data-adaptive wavelet thresholding estimators have been developed.

For example, Donoho & Johnstone (1995) proposed the

SureShrink estimator based on minimizing Stein’s unbiased risk estimate; Weyrich & Warhola (1995a, 1995b), Nason (1996) and Jansen, Malfait & Bultheel (1997) have considered estimators based on cross-validation approaches to choosing the thresholding parameter; while Abramovich & Benjamini (1995, 1996) and Ogden & Parzen (1996a, 1996b) considered thresholding as a multiple hypotheses testing procedure. Hall, Penev, Kerkyacharian & Picard (1997), Hall, Kerkyacharian & Picard (1998, 1999), Cai (1999), Efromovich (1999, 2000) and Cai & Silverman (2001) suggested further modifications of the basic thresholding by considering wavelet block thresholding estimators meaning that the wavelet coefficients are thresholded in blocks rather than term-by-term. Some of these alternative wavelet thresholding estimators possess near optimal asymptotic properties. Moreover, it has been shown that wavelet block thresholding estimators have excellent mean squared error performances relative to wavelet term-by-term thresholding estimators in finite sample situations. Various Bayesian approaches for nonlinear wavelet thresholding and nonlinear wavelet shrinkage estimators have also recently been proposed. To fix terminology, a shrinkage rule 5

shrinks wavelet coefficients to zero, whilst a thresholding rule in addition sets actually to zero all coefficients below a certain level (as in the classical approach). These estimators have been shown to be effective and it is argued that they are less ad-hoc than the classical proposals discussed above. In the Bayesian approach a prior distribution is imposed on the wavelet coefficients.

The prior model is designed to capture the sparseness of wavelet expansions

common to most applications. Then the function is estimated by applying a suitable Bayesian rule to the resulting posterior distribution of the wavelet coefficients. Different choices of loss function lead to different Bayesian rules and hence to different nonlinear wavelet shrinkage and wavelet thresholding rules. Such wavelet estimators have been discussed, for example, by Chipman, Kolaczyk & McCulloch (1997), Abramovich, Sapatinas & Silverman (1998), Clyde, Parmigiani & Vidakovic (1998), Crouse, Nowak & Baraniuk (1998), Johnstone & Silverman (1998), Vidakovic (1998a), Clyde & George (1999, 2000), Vannucci & Corradi (1999), Huang & Cressie (2000), Huang and Lu (2000), Vidakovic & Ruggeri (2001) and Angelini, De Canditiis & Leblanc (2003). Moreover, it has been shown that Bayesian wavelet shrinkage and thresholding estimators outperform the classical wavelet term-by-term thresholding estimators in terms of mean squared error in finite sample situations. Recently, Abramovich, Besbeas & Sapatinas (2002) have considered Bayesian wavelet block shrinkage and block thresholding estimators, and have shown that they outperform existing classical block thresholding estimators in terms of mean squared error in finite sample situations. Extensive reviews and descriptions of the various classical and Bayesian wavelet shrinkage and wavelet thresholding estimators are given in the books by Ogden (1997), Vidakovic (1999) and Percival and Walden (2000), in the papers appeared in the edited volume by M¨ uller & Vidakovic (1999), and in the review papers by Antoniadis (1997), Vidakovic (1998b) and Abramovich, Bailey and Sapatinas (2000). It is evident that, although a number of wavelet estimators has been compared to get insight as to which ones are to be preferred against the others, a more detailed study involving recent classical and Bayesian wavelet methods is required in the development towards high-performance wavelet estimators. Furthermore, with the increased applicability of these estimators in nonparametric regression, their finite sample properties become even more important. Therefore, as a public service, similar to one given by Breiman & Peters (2000) for linear smoothers, in this paper we design an extensive simulation study to compare most of the above mentioned wavelet estimators on a variety of sample sizes, test functions, signal-to-noise ratios and wavelet filters. Because there is no single criterion that can adequately summarise the behaviour of an estimator, various criteria (including the traditional mean squared error) are used to measure performance. Insight into the performance of these estimators in finite sample 6

situations is obtained from graphical outputs and numerical tables. In particular, from the classical wavelet thresholding estimators we consider: minimax estimators, the universal estimator, SureShrink estimators, a translation invariant estimator (a variant of the universal estimator), the two-fold cross-validation estimator, multiple hypotheses testing estimators, a nonoverlapping block estimator and an overlapping block estimator. A linear wavelet estimator (extending spline smoothing estimation methods) is also considered. From the Bayesian wavelet shrinkage and wavelet thresholding estimators we consider: posterior mean estimators, posterior median estimators, a hypothesis testing estimator, a deterministic/stochastic decomposition estimator, a nonparametric mixed-effect estimator and nonoverlapping block estimators. The elicitation of the hyperparameters of the above estimators are resolved in detail by employing empirical Bayes methods that attempt to determine the hyperparameters of the prior distributions from the data being analysed. The plan of this paper is as follows. Section 2 recalls some known results about wavelet series and the discrete wavelet transform. Section 3 briefly discusses the nonlinear wavelet approach to nonparametric regression.

In Section 4 the different classical and empirical

Bayes wavelet estimators used in the simulation study are presented.

The description of

the actual simulation study is given in Section 5. Section 6 summarises and discusses the results of the simulations. The overall conclusions of our simulation comparison is presented in Section 7. Section 8 explains in some detail which files the user should download and how to start running them. In order to provide some hints of how the functions should be used to analyse real-life data sets, a detailed practical step-by-step illustration of a wavelet denoising analysis on electrical consumption is provided. Finally, following the principle of reproducible research as advocated by Buckheit & Donoho (1995), Matlab routines and a description of software implementing these routines (so that all figures and tables in this paper can be reproduced) are available at http://www-lmc.imag.fr/SMS/software/GaussianWaveDen.html or http://www.ucy.ac.cy/∼fanis/links/software.html .

2

Wavelet Series and the Discrete Wavelet Transform

In this section we give a brief overview of some relevant material on the wavelet series expansion and a fast wavelet transform that we need later.

2.1

The wavelet series expansion

The term wavelets is used to refer to a set of orthonormal basis functions generated by dilation and translation of a compactly supported scaling function (or father wavelet), φ, and a mother 7

wavelet, ψ, associated with an r-regular multiresolution analysis of L2 (R). A variety of different wavelet families now exist that combine compact support with various degrees of smoothness and numbers of vanishing moments (see Daubechies (1992)), and these are now the most intensively used wavelet families in practical applications in statistics. Hence, many types of functions encountered in practice can be sparsely (i.e. parsimoniously) and uniquely represented in terms of a wavelet series. Wavelet bases are therefore not only useful by virtue of their special structure, but they may also be (and have been!) applied in a wide variety of contexts. For simplicity in exposition, we shall assume that we are working with periodised wavelet bases on [0, 1] (see, for example, Mallat (1999), Section 7.5.1), letting φpjk (t) =

X

p φjk (t − l) and ψjk (t) =

l∈Z

X

ψjk (t − l),

for t ∈ [0, 1],

l∈Z

where φjk (t) = 2j/2 φ(2j t − k) and ψjk (t) = 2j/2 ψ(2j t − k). For any given primary resolution level j0 ≥ 0, the collection p {φpj0 k , k = 0, 1, . . . , 2j0 − 1; ψjk , j ≥ j0 ≥ 0, k = 0, 1, . . . , 2j − 1}

is then an orthonormal basis of L2 ([0, 1]). The superscript “p” will be suppressed from the notation for convenience. Despite the poor behaviour of periodic wavelets near the boundaries (they create high amplitude wavelet coefficients in the neighborhood of the boundaries when the analysed function is not periodic) they are commonly used because the numerical implementation is particularly simple. Also, as Johnstone (1994) has pointed out, this computational simplification affects only a fixed number of wavelet coefficients at each resolution level and does not affect the qualitative phenomena that we wish to present. The idea underlying such an approach is to express any function g ∈ L2 ([0, 1]) in the form g(t) =

j0 −1 2X

j

αj0 k φj0 k (t) +

where

Z αj0 k = hg, φj0 k i = Z βjk = hg, ψjk i =

βjk ψjk (t),

j0 ≥ 0,

t ∈ [0, 1],

j=j0 k=0

k=0

and

∞ 2X −1 X

0

0

1

g(t)φj0 k (t) dt,

j0 ≥ 0,

1

g(t)ψjk (t) dt,

j ≥ j0 ≥ 0,

k = 0, 1, . . . , 2j0 − 1

k = 0, 1, . . . , 2j − 1.

For detailed expositions of the mathematical aspects of wavelets we refer to, for example, Daubechies (1992), Meyer (1992), Wojtaszczyk (1997) and Mallat (1999). 8

2.2

The discrete wavelet transform

In statistical settings we are more usually concerned with discretely sampled, rather than continuous, functions. It is then the wavelet analogy to the discrete Fourier transform which is of primary interest and this is referred to as the discrete wavelet transform (DWT). Given a vector of function values g = (g(t1 ), ..., g(tn ))0 at equally spaced points ti , the discrete wavelet transform of g is given by d = W g, where d is an n×1 vector comprising both discrete scaling coefficients, cj0 k , and discrete wavelet coefficients, djk , and W is an orthogonal n × n matrix associated with the orthonormal wavelet basis chosen. The cj0 k and djk are related to their continuous counterparts αj0 k and βjk (with an approximation error of order n−1 ) via the relationships cj0 k ≈ The factor

√ n αj0 k

and djk ≈

√ n βjk .

√ n arises because of the difference between the continuous and discrete

orthonormality conditions. This root factor is unfortunate but both the definition of the DWT and the wavelet coefficients are now fixed by convention, hence the different notation used to distinguish between the discrete wavelet coefficients and their continuous counterpart. Note that, because of orthogonality of W , the inverse DWT (IDWT) is simply given by 0

g = W d, 0

where W denotes the transpose of W . If n = 2J for some positive integer J, the DWT and IDWT may be performed through a computationally fast algorithm developed by Mallat (1989) that requires only order n operations. In this case, for a given j0 and under periodic boundary conditions, the DWT of g results in an n-dimensional vector d comprising both discrete scaling coefficients cj0 k , k = 0, ..., 2j0 − 1 and discrete wavelet coefficients djk , j = j0 , ..., J − 1; k = 0, ..., 2j − 1. We do not provide technical details here of the order n DWT algorithm mentioned above. Essentially the algorithm is a fast hierarchical scheme for deriving the required inner products which at each step involves the action of low and high pass filters, followed by a decimation (selection of every even member of a sequence). The IDWT may be similarly obtained in terms of related filtering operations. For excellent accounts of the DWT and IDWT in terms of filter operators we refer to Nason & Silverman (1995), Strang & Nguyen (1996), or Burrus, Gonipath & Guo (1998).

9

3

The wavelet approach to nonparametric regression

Consider the standard univariate nonparametric regression setting yi = g(ti ) + σ ²i ,

i = 1, . . . , n,

(1)

where ²i are independent N (0, 1) random variables and the noise level σ may, or may not, be known. The goal is to recover the underlying function g from the observations y = (y1 , . . . , yn )0 without assuming any particular parametric structure on its form. In what follows we assume, without loss of generality, that the sample points ti are within the unit interval [0, 1]. For simplicity, we also assume that the sample points are equally spaced, i.e. ti = i/n, and that the sample size n is a power of two: n = 2J for some positive integer J. These assumptions allow us to perform both the DWT and the IWDT using Mallat’s (1989) fast algorithm. Remark 3.1 It should be noted that for non-equispaced or random designs, or sample sizes which are not a power of two, or data contaminated with correlated noise, modifications are needed to the standard wavelet-based estimation procedures that will be discussing in Section 4. We refer, for example, to Deylon & Juditsky (1995), Neumann & Spokoiny (1995), Wang (1996), Hall & Turlach (1997), Johnstone & Silverman (1997), Antoniadis, Grégoire & Vial (1997), Antoniadis & Pham (1998), Cai & Brown (1998, 1999), Kovac & Silverman (2000), von Sachs & MacGibbon (2000), Nason (2002) and Angelini, De Canditiis & Leblanc (2003). In our simulation comparison in Section 5, we have not included these latter methods but rather concentrated on the standard nonparametric regression setting. Although a more general comparison will be valuable it is, however, outside the scope of this paper. One of the basic approaches to nonparametric regression is to consider the unknown function g expanded as a generalised Fourier series and then to estimate the generalised Fourier coefficients from the data. The original nonparametric problem is thus transformed to a parametric one, although the potential number of parameters is infinite. An appropriate choice of basis for the expansion is therefore a key point in relation to the efficiency of such an approach. A ‘good’ basis should be parsimonious in the sense that a large set of possible response functions can be approximated well by only few terms of the generalized Fourier expansion employed. As already discussed, wavelet series allow a parsimonious expansion for a wide variety of functions, including inhomogeneous cases. It is therefore natural to consider applying the generalized Fourier series approach using a wavelet series.

10

Due to the orthogonality of the matrix W , the DWT of white noise is also an array of independent N (0, 1) random variables, so from (1) it follows that cˆj0 k = cj0 k + σ ²jk ,

k = 0, 1, . . . , 2j0 − 1,

dˆjk = djk + σ ²jk ,

j = j0 , . . . , J − 1,

k = 0, . . . , 2j − 1,

(2) (3)

where cˆj0 k and dˆjk are respectively the empirical scaling and the empirical wavelet coefficients of the noisy data y, and ²jk are independent N (0, 1) random variables. The sparseness of the wavelet expansion makes it reasonable to assume that essentially only a few ‘large’ djk contain information about the underlying function g, while ‘small’ djk can be attributed to the noise which uniformly contaminates all wavelet coefficients. If we can decide which are the ‘significant’ large wavelet coefficients, then we can retain them and set all others equal to zero, thus obtaining an approximate wavelet representation of the underlying function g. It is also advisable to keep the scaling coefficients cj0 k , the coefficients on the lower coarse levels, intact because they represent ‘low-frequency’ terms that usually contain important components about the underlying function g. Finally, we mention that the primary resolution level j0 that we have used throughout our simulations was chosen to be j0 = [log2 (log(n))] + 1, following the asymptotic considerations given in Chapter 10 of Härdle, Kerkyacharian, Picard & Tsybakov (1998).

3.1

The classical approach to wavelet thresholding

A wavelet based linear approach, extending simply spline smoothing estimation methods as described by Wahba (1990), is the one suggested by Antoniadis (1996) and independently by Amato & Vuza (1997). Of non-threshold type, this method is appropriate for estimating relatively regular functions. Assuming that the smoothness index s of the function g to be recovered is known, the resulting estimator is obtained by estimating the scaling coefficients cj0 k by their empirical counterparts cˆj0 k and by estimating the wavelet coefficients djk via a linear shrinkage

dˆjk , 1 + λ22js where λ > 0 is a smoothing parameter. The parameter λ is chosen by cross-validation in Amato d˜jk =

& Vuza (1997), while the choice of λ in Antoniadis (1996) is based on risk minimization and depends on a preliminary consistent estimator of the noise level σ. The above linear wavelet method is not designed to handle spatially inhomogeneous functions with low regularity. For such functions one usually relies upon nonlinear wavelet (thresholding or shrinkage) methods. Donoho & Johnstone (1994, 1995, 1998) and Donoho, Johnstone, Kerkyacharian & Picard (1995) proposed a nonlinear wavelet estimator of g based on reconstruction by keeping the 11

empirical scaling coefficients cˆj0 k in (2) intact and from a more judicious selection of the empirical wavelet coefficients dˆjk in (3). They suggested the extraction of the significant wavelet coefficients by thresholding in which wavelet coefficients are set to zero if their absolute value is below a certain threshold level, λ ≥ 0, whose choice we discuss in more detail in Section 4.1. Under this scheme we obtain thresholded wavelet coefficients using either the hard or soft thresholding rule given respectively by

( δλH (dˆjk )

and

=

0 if |dˆjk | ≤ λ dˆjk if |dˆjk | > λ

(4)

  if |dˆjk | ≤ λ   0 S ˆ δλ (djk ) = dˆjk − λ if dˆjk > λ    dˆ + λ if dˆ < −λ. jk jk

(5)

Thresholding allows the data itself to decide which wavelet coefficients are significant; hard thresholding (a discontinuous function) is a ‘keep’ or ‘kill’ rule, while soft thresholding (a continuous function) is a ‘shrink’ or ‘kill’ rule. Bruce & Gao (1996) and Marron, Adak, Johnstone, Newmann & Patil (1998) have shown that simple threshold values with hard thresholding results in larger variance in the function estimate, while the same threshold values with soft thresholding shift the estimated coefficients by an amount of λ even when |dˆjk | stand way out of noise level, creating unnecessary bias when the true coefficients are large. Also, due to its discontinuity, hard thresholding can be unstable, that is, sensitive to small changes in the data. To remedy the drawbacks of both hard and soft thresholding rules, Gao & Bruce (1997) and considered the firm threshold thresholding     0 λ (|dˆ |−λ1 ) F δλ1 ,λ2 (dˆjk ) = sign(dˆjk ) 2 λ2jk −λ1    dˆ jk

if |dˆjk | ≤ λ1 if λ1 < |dˆjk | ≤ λ2

(6)

if |dˆjk | > λ2

which is a “keep” or “shrink” or “kill” rule (a continuous function). The resulting wavelet thresholding estimators offer, in small samples, advantages over both hard thresholding (generally smaller mean squared error and less sensitivity to small perturbations in the data) and soft thresholding (generally smaller bias and overall mean squared error) rules. For values of |dˆjk | near the lower threshold λ1 , δ F (dˆjk ) behaves like δ S (dˆjk ). λ1 ,λ2

λ1

For values of |dˆjk | above the upper threshold λ2 , δλF1 ,λ2 (dˆjk ) behaves like δλH2 (dˆjk ). Note that the hard thresholding and soft thresholding rules are limiting cases of (6) with λ1 = λ2 and λ2 = ∞ respectively.

12

Note that firm thresholding has a drawback in that it requires two threshold values (one for ‘keep’ or ‘shrink’ and another for ‘shrink’ or ‘kill’), thus making the estimation procedure for the threshold values more computationally expensive. To overcome this drawback, Gao (1998) considered the nonnegative garrote thresholding   0 δλG (dˆjk ) =  dˆjk −

λ2 dˆjk

if |dˆjk | ≤ λ if |dˆjk | > λ

(7)

which is a “shrink” or “kill” rule (a continuous function). The resulting wavelet thresholding estimators offer, in small samples, advantages over both hard thresholding and soft thresholding rules that is comparable to the firm thresholding rule, while the latter requires two threshold values. In the same spirit to that in Gao (1998), Antoniadis & Fan (2001) suggested the SCAD thresholding rule  ˆ ˆ ˆ    sign(djk ) max (0, |djk | − λ) if |djk | ≤ 2λ (α−1)dˆjk −aλsign(dˆjk ) δλSCAD (dˆjk ) = if 2λ < |dˆjk | ≤ αλ α−2    dˆ if |dˆjk | > αλ jk

(8)

which is also a “keep” or “shrink” or “kill” rule (a piecewise linear function). It does not over penalize large values of |dˆjk | and hence does not create excessive bias when the wavelet coefficients are large. Antoniadis & Fan (2001) have recommended to use the value of α = 3.7 based on a Bayesian argument. Remark 3.2 In our simulation comparison in Section 5, we have only considered hard, soft and SCAD thresholding rules since the firm and the nonnegative garrote thresholding rules have been extensively studied by Gao & Bruce (1997) and Gao (1998) – see also Vidakovic (1999) for a discussion on other types of thresholding rules. Moreover, hard and soft thresholding rules are the most commonly used (if not the only ones!) among the various classical wavelet thresholding estimators considered in practice that we will be discussing in Section 4.1. The thresholded wavelet coefficients obtained by applying any of the thresholding rules δλ , given in (4)–(7), are used to obtain a selective reconstruction of the response function g. The resulting estimate can be written as gˆλ (t) =

j0 −1 2X

k=0

j

J−1 −1 X 2X δλ (dˆjk ) cˆj0 k √ φj0 k (t) + √ ψjk (t). n n

(9)

j=j0 k=0

Often, we are interested in estimating the unknown response function at the observed dataˆλ of the corresponding estimates can be derived by simply points. In this case the vector g 13

performing the IDWT of {ˆ cj0 k , δλ (dˆjk )} and the resulting three-step selective reconstruction estimation procedure can be summarized by the following diagram n o thresholding n o DWT IDWT ˆλ . y −→ cˆj0 k , dˆjk −→ cˆj0 k , δλ (dˆjk ) −→ g

(10)

As mentioned in Section 2.2, for the equally spaced design and sample size n = 2J assumed throughout this section, both DWT and IDWT in (10) can be performed by a fast algorithm of order n, and so the whole process is computationally very efficient.

3.2

The Bayesian approach to wavelet shrinkage and thresholding

Recall from (2) that the empirical scaling coefficients {ˆ cj0 k : k = 0, . . . , 2j0 − 1} conditionally on cj0 k and σ 2 are independently distributed as cˆj0 k | cj0 k , σ 2 ∼ N (cj0 k , σ 2 ).

(11)

Similarly, from (3), the empirical wavelet coefficients {dˆjk : j = j0 , . . . , J − 1; k = 0, . . . , 2j − 1} conditionally on djk and σ 2 are independently distributed as dˆjk | djk , σ 2 ∼ N (djk , σ 2 ).

(12)

In the Bayesian approach a prior model is imposed on the function’s wavelet coefficients, designed to capture the sparseness of the wavelet expansion common to most applications. It is assumed that in the prior model the wavelet coefficients djk are mutually independent random variables (and independent of the empirical wavelet coefficients dˆjk ). A popular prior model for each wavelet coefficient djk is a scale mixture of two distributions, one mixture component corresponding to negligible coefficients, the other to significant coefficients. Usually, a scale mixture of two normal distributions or a mixture of one normal distribution and a point mass at zero is considered. Such mixtures have been recently applied to stochastic search variable selection problems (see, for example, George & McCullogh (1993)). The mixture component with larger variance represents the significant coefficients, while the mixture component with smaller variance represents the negligible ones. The limiting case of a mixture of one normal distribution and a point mass at zero corresponds to the case where the smaller variance is actually set to zero. An important distinction between the use of a scale mixture of two normal distributions (considered by Chipman, Kolaczyk & McCulloch (1997)) and a scale mixture of a normal distribution and a point mass at zero (considered by Clyde, Parmigiani & Vidakovic (1998) and Abramovich, Sapatinas & Silverman (1998)) in the Bayesian wavelet approach to nonparametric regression is the type of shrinkage obtained. In the former case, no wavelet coefficient estimate 14

based on the posterior analysis will be exactly equal to zero. However, in the latter case, with a proper choice of a Bayes rule, it is possible to get wavelet coefficient estimates that are exactly zero, resulting in a bonafide thresholding rule. We consider here the mixture model of one normal distribution and a point mass at zero in detail. (Other choices of prior models will also be discussed in Section 4.3.) Following the discussion given above, a hierarchical model that expresses the belief that some of the wavelet coefficients {djk : j = j0 , . . . , J − 1; k = 0, . . . , 2j − 1} are zero is obtained by djk | γjk ∼ N (0, γjk τj2 )

(13)

γjk ∼ Bernoulli(πj ),

(14)

where djk | γjk are mutually independent random variables. The binary random variables γjk determine whether the wavelet coefficients are nonzero (γjk = 1), arising from N (0, τj2 ) distributions, or zero (γjk = 0), arising from point masses at zero.

In the next stage

of the hierarchy, it is assumed that the γjk have independent Bernoulli distributions with P (γjk = 1) = 1 − P (γjk = 0) = πj for some fixed hyperparameter 0 ≤ πj ≤ 1. The probability πj gives the proportion of nonzero wavelet coefficients at resolution level j while the variance τj2 is a measure of their magnitudes. The same prior parameters πj and τj2 for all coefficients at a given resolution level j are used, resulting in level-dependent wavelet threshold and shrinkage estimators. Recall from Section 3.1 that is advisable to keep the coefficients on the lower coarse levels intact because they represent ‘low-frequency’ terms that usually contain important components of the function g.

Thus, to complete the prior specification of g, the scaling coefficients

{cj0 k : k = 0, . . . , 2j0 − 1} are assumed to be mutual independent random variables (and independent of the empirical scaling coefficients cˆj0 k ) and vague priors are placed on them cj0 k ∼ N (0, ²),

² → ∞.

(15)

Once the data are observed, the empirical wavelet coefficients dˆjk and empirical scaling coefficients cˆj0 k are determined, and we seek the posterior distributions on the wavelet coefficients djk and scaling coefficients cj0 k . Using (12), (13) and (14), the djk are a posteriori conditionally independent

Ã djk | γjk , dˆjk , σ 2 ∼ N

γjk

τj2 σ 2 + τj2

dˆjk , γjk

σ 2 τj2 σ 2 + τj2

! .

(16)

In order to incorporate model uncertainty about which of the wavelet coefficients djk are zero, we now average over all possible γjk . Using (16), the marginal posterior distribution of djk

15

conditionally on σ 2 , is then given by Ã djk | dˆjk , σ 2 ∼ p(γjk = 1 | dˆjk , σ 2 ) N

τj2 σ 2 + τj2

dˆjk ,

σ 2 τj2

!

σ 2 + τj2

+(1 − p(γjk = 1 | dˆjk , σ 2 )) δ(0),

(17)

where δ(0) is a point mass at zero. It is not difficult to see the posterior probabilities that wavelet coefficients djk are nonzero can be expressed as p(γjk = 1 | dˆjk , σ 2 ) =

1 1 + Ojk (dˆjk , σ 2 )

,

where the posterior odds ratios Ojk (dˆjk , σ 2 ) that γjk = 0 versus γjk = 1 are given by Ã ! τj2 dˆ2jk (σ 2 + τj2 )1/2 1 − π j 2 Ojk (dˆjk , σ ) = exp − 2 2 . πj σ 2σ (σ + τj2 )

(18)

(19)

Based on some Bayes rules (BR), as we shall discuss in Section 4.3, expressions (16) and (17) can be used to obtain wavelet threshold and shrinkage estimates BR(djk | dˆjk , σ 2 ), of the wavelet coefficients djk . Also, using (11) and (15), the cj0 k are a posteriori conditionally independent cj0 k | cˆj0 k , σ 2 ∼ N (ˆ cj0 k , σ 2 )

(20)

and therefore, using (20), cj0 k are estimated by cˆj0 k . The posterior-based wavelet threshold and wavelet shrinkage coefficients are used to obtain a selective reconstruction of the response function. The resulting estimate can be written as gˆBR (t) =

j0 −1 2X

k=0

j

J−1 −1 X 2X cˆj0 k BR(djk | dˆjk , σ 2 ) √ φj0 k (t) + √ ψjk (t). n n

(21)

j=j0 k=0

ˆBR of the corresponding estimates of the unknown response function g Finally, the vector g at the observed data-points can be derived by simply performing the IDWT of {ˆ cj0 k , BR(djk | 2 dˆjk , σ )} and the resulting four-step selective reconstruction estimation procedure can be summarized by the following diagram DWT

y −→

n o n o IDWT Priors Posterior estimates ˆBR . (22) −→ cˆj0 k , BR(djk | dˆjk , σ 2 ) −→ g cˆj0 k , dˆjk −→ {cj0 k , djk }

As mentioned in Section 2.2, for the equally spaced design and sample size n = 2J assumed throughout this section, both DWT and IDWT in (22) can be performed by a fast algorithm of order n, and so the whole process is computationally very efficient.

16

4

Description of the Wavelet Estimators

In this section, we give a brief description of the various classical and empirical Bayes wavelet shrinkage and thresholding estimators that we will be using in our simulation study given in Section 5.

4.1

Classical Methods: Term-by-Term Thresholding

Given the basic framework of function estimation using wavelet thresholding as discussed in Section 3.1, there are a variety of methods to choose the threshold level λ in (9) or (10) for any given situation. These can be grouped into two categories: global thresholds and level-dependent thresholds. The former means that we choose a single value of λ to be applied globally to all empirical wavelet coefficients {dˆjk : j = j0 , . . . , J − 1; k = 0, 1, . . . , 2j − 1} while the latter means that a possibly different threshold value λj is chosen for each resolution level j = j0 , . . . , J − 1. In what follows, we consider both global and level-dependent thresholds. These thresholds all require an estimate of the noise level σ. The usual standard deviation of the data values is clearly not a good estimator, unless the underlying response function g is reasonably flat. Donoho & Johnstone (1994) considered estimating σ in the wavelet domain and suggested a robust estimate that is based only on the empirical wavelet coefficients at the finest resolution level. The reason for considering only the finest level is that corresponding empirical wavelet coefficients tend to consist mostly of noise. Since there is some signal present even at this level, Donoho & Johnstone (1994) proposed a robust estimate of the noise level σ (based on the median absolute deviation) given by median({|dˆJ−1,k | : k = 0, 1, . . . , 2J−1 − 1}) . (23) 0.6745 The estimator (23) has become very popular in practice and it is used in subsequent sections σ ˆ=

unless stated otherwise. 4.1.1

The Minimax Threshold

An optimal threshold, derived to minimize the constant term in an upper bound of the risk involved in estimating a function, was obtained by Donoho & Johnstone (1994). The proposed minimax threshold, that depends on the sample size n, is defined as λM = σ ˆ λ?n , where λ?n is defined as the value of λ which achieves ½ ¾ Rλ (d) ? Λn := inf sup , λ n−1 + Roracle (d) d 17

(24)

(25)

ˆ − d)2 and Roracle (d) is the ideal risk achieved with the help of an oracle. where Rλ (d) = E(δλ (d) Two oracles were considered by Donoho & Johnstone (1994): diagonal linear projection (DLP), an oracle which tell us when to “keep” or “kill” each empirical wavelet coefficient, and diagonal linear shrinker (DLS), an oracle which tells you how much to shrink each wavelet coefficient. The ideal risks for these oracles are given by DLP Roracle (d) := min(d2 , 1)

DLS and Roracle (d) :=

d2 . +1

d2

Donoho and Johnstone (1994) computed the DLP minimax thresholds for the soft thresholding rule (5), while the DLP minimax thresholds for the hard thresholding rule (4) and the DLS minimax thresholds for both soft and hard thresholding rules were obtained by Bruce & Gao (1996). DLP minimax thresholds for the SCAD thresholding rule (8) were obtained by Antoniadis & Fan (2001). Numerical values for the sample sizes that we will be using in our simulative study in Section 5 are given in Table 1.

n

128

256

512

1024

HARD

2.913

3.117

3.312

3.497

SOFT

1.669

1.859

2.045

2.226

SCAD

1.691

1.881

2.061

2.241

Table 1: Diagonal linear projection minimax thresholds for HARD, SOFT and SCAD thresholding rules for various sample sizes. Since the type of oracle used has little impact on the minimax thresholds, Table 1 only reports the DLP minimax thresholds and can be used as a look-up table in any software. These values were computed using a grid search over λ with increments ∆λ = 0.0001. At each point, the supremum over d in (24) was computed using a quasi-Newton optimisation with numerical derivatives (see, for example, Dennis & Mei, 1979). Remark 4.1 Although not considered in our simulation study in Section 5, we note that DLP minimax thresholds for the firm thresholding rule (6) and the nonnegative garrote thresholding rule (7) were considered by Gao & Bruce (1997) and Gao (1998) respectively. Numerical values for selected sample sizes can be found, for example, in Table 2 in Gao (1998). 4.1.2

The Universal Threshold

As an alternative to the use of minimax thresholds, Donoho & Johnstone (1994) suggested thresholding of empirical wavelet coefficients {dˆjk : j = j0 , . . . , J − 1; k = 0, 1, . . . , 2j − 1} by 18

using the universal threshold λU = σ ˆ

p 2 log n.

(26)

This threshold is easy to remember and its implementation in software requires no costly development of look-up tables. The universal threshold ensures, with high probability, that every sample in the wavelet transform in which the underlying function is exactly zero will be estimated as zero. This is so, because if X1 , . . . , Xn are independent and identically distributed N (0, 1) random variables, then ¾ ½ p 1 , P max |Xi | > 2 log n ∼ √ 1≤i≤n π log n

as n → ∞.

The rate at which the probability above tends to one is however quite slow. Remark 4.2 In fact, it can be shown that ½ ¾ p P max |Xi | > c log n ∼ 1≤i≤n

√ 2 √ , c/2−1 n cπ log n

as n → ∞.

(27)

Although not considered in our simulation study in Section 5, we mention that one can define other universal thresholds such that (27) converges to zero faster (ie, to suppress noise more √ √ thoroughly). For example, λU 4 log n makes the above convergence rate 1/n 2π log n and 1 = √ √ λU 6 log n makes the above convergence rate 1/n2 3π log n. Such thresholds have been 2 = used for applications involving indirect noisy measurements (usually called inverse problems); in particular the latter universal threshold has been used by Abramovich & Silverman (1998) for the estimation of a function’s derivative as a practical important example of an inverse problem. 4.1.3

The Translation Invariant Threshold

It has been noted from scientists (and other users) that wavelet thresholding with either the minimax threshold (24) or the universal threshold (26) suffers from artifacts of various kinds. In other words, in the vicinity of discontinuities, these wavelet thresholding estimators can exhibit pseudo-Gibbs phenomena, alternating undershoot and overshoot of a specific target level. While these phenomena are less pronounced than in the case of Fourier-based estimates (in which Gibbs phenomena are global, rather than local, and of large amplitude), it seems reasonable to try to improve the resulting wavelet-based estimation procedures. Coifman & Donoho (1995) proposed the use of the translation invariant wavelet thresholding scheme which helps to suppress these artifacts. The idea is to correct unfortunate mis-alignments between features in the function of interest and features in the basis. When the function of interest contains several discontinuities, the approach of Coifman & Donoho (1995) is to apply 19

a range of shifts in the function, and average over the several results so obtained. Given data y = (y1 , . . . , yn )0 from model (1), the translation invariant wavelet thresholding estimator is defined as

n

gˆTI =

1X (W Sk )0 δλ (W Sk y), n

(28)

k=i

where W is the order n orthogonal matrix associated with the DWT, Sk is the shift matrix Ã ! Ok×(n−k) Ik×k , I(n−k)×(n−k) O(n−k)×k δλ is either the hard thresholding rule (4) or the soft thresholding rule (5), Ik×k is the identity matrix with k rows, and Or1 ×r2 is an r1 × r2 matrix of zeroes. Alternatively, the translation invariant thresholding is via the nondecimated discrete wavelet transform (NDWT) (or stationary or maximal overlap transform) – see, for example, Nason & Silverman (1995) or Percival & Walden (2000). By denoting W TI to be the matrix associated with the NDWT (which actually maps a vector of length n = 2J into a vector of length (J +1)n), then (28) is equivalent to gˆTI = WGTI δλ (W TI y),

(29)

¡ ¢−1 where WGTI = (W TI )0 W TI (W TI )0 is the generalised inverse of W TI . Remark 4.3 Coifman & Donoho (1995) applied the above procedure by using the threshold p λ = σ ˆ 2 loge ((n log2 (n)). Although lower threshold levels could be used in conjunction with translation invariant thresholding, Coifman & Donoho (1995) pointed out that such lower thresholds result in a very large number of noise spikes, apparently much larger than in the noninvariant case. Therefore, in our simulation study in Section 5, we only consider translation invariant thresholding in conjunction with the above threshold and with both hard thresholding (4) and soft thresholding (5). 4.1.4

Thresholding as a Multiple Hypotheses Testing Problem

The idea of wavelet thresholding can be viewed as a multiple hypotheses testing. For each wavelet coefficient dˆjk ∼ N (djk , σ ˆ 2 ), test the following hypothesis H0 : djk = 0

versus H1 : djk 6= 0.

If H0 is rejected, the coefficient dˆjk is retained in the model; otherwise it is discarded. Classical approaches to multiple hypotheses testing in this case face serious problems because of the large numbers of hypotheses being tested simultaneously. In other words, if the error is 20

controlled at an individual level, the chance of keeping erroneously a coefficient is extremely high; if the simultaneous error is controlled, the chance of keeping a coefficient is extremely low. Abramovich & Benjamini (1995,1996) proposed a way to control such dissipation of power based on the false discovery rate (FDR) method of Benjamini & Hochberg (1995). Let R be the number of empirical wavelet coefficients that are not dropped by the thresholding procedure for a given sample (thus, they are kept in the model). Of these R coefficients, S are correctly kept in the model and V = R − S are erroneously kept in the model. The error in such a procedure is expressed in terms of the random variable Q = V /R – the proportion of the empirical wavelet coefficients kept in the model that should have been dropped. (Naturally, it is defined that Q = 0 when R = 0 since no error of this type can be made when no coefficient is kept.) The FDR of empirical wavelet coefficients can now be defined as the Expectation of Q, reflecting the expected proportion of erroneously kept coefficients among the ones kept in the model. Following the method of Benjamini & Hochberg (1995), Abramovich & Benjamini (1995,1996) proposed maximizing the number of empirical wavelet coefficients kept in the model subject to condition EQ < α, for some prespecified level α, yielding the following procedure 1. Take j0 = 0.

For each of the n − 1 empirical wavelet coefficients {dˆjk : j =

0, 1, . . . , J −1; k = 0, 1, . . . , 2j −1} calculate the corresponding 2-sided p-value, pjk , (testing H0 : djk = 0),

Ã pjk = 2 1 − Φ

Ã

|dˆjk | σ ˆ

!! ,

where Φ is the cumulative distribution function of a standard normal random variable. 2. Order the pjk according to their size, p(1) ≤ p(2) ≤ . . . ≤ p(n−1) (i.e., each p(i) corresponds to some coefficient djk ). 3. Find k = max (i : p(i) < (i/m)α). For this k, calculate ³ p(k) ´ λFDR = σ ˆ Φ−1 1 − . 2

(30)

4. Apply the hard thresholding rule (4) or soft thresholding rule (5) to all empirical wavelet coefficients (ignoring the coefficients at the coarsest levels) {dˆjk : j = j0 , . . . , J − 1; k = 0, 1, . . . , 2j − 1} with the threshold value (30) (using the traditional levels for significance testing, i.e., α = 0.01 or α = 0.05). Remark 4.4 We note that the universal threshold (26) can be viewed as a critical value of a similar test to the ones considered above. The level for this test is p p α = P (|dˆjk | > σ 2 log n | H0 ) ≈ (n π log n)−1 , 21

which is also equal to its power against the alternative H1 : djk = d (6= 0). Thus, as mentioned by Abramovich & Benjamini (1995), the approach of Donoho & Johnstone (1994) based on the universal threshold (26) is equivalent to the ‘panic’ procedure of controlling the probability of √ even one erroneous inclusion of a wavelet coefficient at the level (n π log n)−1 , but the level at which the error is controlled approaches zero as n tends to infinity. 4.1.5

Thresholding Using Cross-Validation

One way to choose the threshold level λ is by minimising the mean integrated squared error (MISE) between a wavelet threshold estimator gˆλ and the true function g. In symbols, the threshold λ should minimise Z M (λ) = E

(ˆ gλ (x) − g(x))2 dx.

(31)

In practice, the function g is unknown and so an estimate of M is required. Cross-validation is widely used as an automatic procedure to choose the smoothing parameter in many statistical settings – see, for example, Green & Silverman (1994) or Eubank (1999). The classical cross-validation method is performed by systematically expelling a data point from the construction of an estimate, predicting what the removed value would be and, then, comparing the prediction with the value of the expelled point. Cross-validation is usually numerically intensive unless there are some updating formulae that allow to calculate the ‘leaving-out-one’ predictions on the basis of the ‘full’ predictions only. In this respect very helpful is the ‘leaving-out-one’ Lemma 4.2.1 of Wahba (1990), which shows that such updating may be done when the so-called ‘compatibility condition’ holds. Although this condition is easy to derive for projection-type estimators, it fails to hold for nonlinear shrinkage or thresholding rules. One way to proceed is to pretend that this condition ‘almost’ holds. This approach to cross-validation in wavelet regression was adopted by Nason (1994, 1995, 1996). In order to directly apply the DWT, the latter author suggested breaking the original data set into 2 subsets of equal size: one containing only the even-indexed data, and the other, the odd-indexed data. The odd-indexed data will be used to ‘predict’ the even-indexed data, and vice-versa, leading to a ‘leave-out-half’ strategy. To be more specific, given data y = (y1 , . . . , yn )0 from model (1) with n = 2J , remove all the odd-indexed yi from the set. This leaves 2J−1 evenly indexed yi which are reindexed from j = 1, . . . , 2J−1 . These reindexed data are then used to construct a function estimate gˆλE by using a particular threshold parameter λ with either hard thresholding (4) or soft thresholding (5). To compare the function estimator with the left-out noisy data an interpolated version of

22

gˆλE is formed

1 E E E g¯λ,j = (ˆ g + gˆλ,j ), 2 λ,j+1

j = 1, . . . , n/2,

E E because g is assumed to be periodic. The g setting gˆλ,n/2+1 = gˆλ,1 ¯λO is computed for the odd-

indexed points and the interpolant is, similarly, formed as 1 O O O g¯λ,j = (ˆ g + gˆλ,j ), 2 λ,j+1

j = 1, . . . , n/2.

The full estimate for the MISE given in (31) compares the interpolated wavelet estimators and the left out points ˆ (λ) = M

n/2 X £

¤ E O (¯ gλ,j − y2j+1 )2 + (¯ gλ,j − y2j )2 .

(32)

j=1

It has been showed by Nason (1994) that one can almost always find a unique minimum of (32) ˆ (λ). λmin = arg min M λ≥0

This minimum value depends on n/2 data points (since both estimates of g, gˆλE and g¯λO are based on n/2 data points) and, therefore, a correction for the sample size is needed. Nason (1994, 1995) considered the universal threshold λU given in (26) to supply a heuristic method for obtaining a cross-validated threshold for n data points. By using this adjustment, the leaveout-half cross-validation threshold is defined as CV

λ

¶ µ log 2 −1/2 λmin . = 1− log n

(33)

Remark 4.5 Nason (1996) also developed a leave-one-out cross-validation method that works for any number of data points, removing the above algorithm’s restriction of n = 2J data points. However, since we are only considered the case of n = 2J in this paper, this latter algorithm is not considered in our simulation study in Section 5. We also pinpoint that Weyrich & Warhola (1995a, 1995b) and Jansen, Malfait & Bultheel (1997), by mimicking the classical cross-validation, applied the method of generalized crossvalidation to choose the threshold level λ. The criterion they used is GCV (λ) =

1 kw − wλ k2 , n k nn0 k2

where wλ is the vector of thresholded normalised coefficients and n0 /n is the fraction of coefficients replaced by zero by this particular threshold value. They show that the minimizer of this function is an asymptotically optimal threshold in the mean-squared error sense. This alternative threshold, however, is not considered in our simulation study in Section 5. 23

4.1.6

The SureShrink Threshold

Donoho & Johnstone (1995) introduced a scheme that uses the empirical wavelet coefficients at each resolution level j to choose a threshold value λj with which to threshold the empirical wavelet coefficients. The idea is to employ Stein’s unbiased risk criterion (see Stein (1981)) to get an unbiased estimate of the l2 -risk. Consider the following equivalent problem to (1) or, equivalently, to (2)–(3).

Suppose

X1 , . . . , Xs are independent N (µi , 1), i = 1, . . . , s, random variables. The problem is to estimate the mean vector µ = (µ1 , . . . , µs )0 with minimum l2 -risk. A result of Stein (1981) states that the ˆ l2 -loss can be estimated unbiasedly for any estimator µ that can be written as µ(X) = X+g(X), where the function g = (g1 , . . . , gs )0 : Rs → Rs is weakly differentiable. In other words, we have that ˆ Eµ ||µ(X) − µ||2 = s + Eµ {||g(X)||2 + 2 5 ·g(X)}, where

(34)

s X ∂gi 5·g ≡ . ∂xi i=1

Using the soft thresholding rule (5), we easily see that ||g(X)||2 =

s X [min (|Xi |, λ)]2

and

5 ·g(X) = −

s X

i=1

1[−λ,λ] (Xi ),

i=1

where 1A (x) is the usual indicator function for any set A. Then, the following quantity SURE(λ; X) = s − 2 · #{i : |Xi | ≤ λ} + [min (|Xi |, λ)]2 , where #B denotes the cardinality of any set B, is an unbiased estimate of the l2 -risk, i.e., ˆ λ (X) − µ||2 = Eµ SURE(λ; X). Eµ ||µ The threshold level λ is then set so as to minimise the estimate of the l2 -risk for a given data X1 , . . . , Xs , i.e., λ = arg min ? SURE(λ; X), 0≤λ≤λ

where λ? =

(35)

√ 2 log s. By considering dˆjk /ˆ σ = Xi , s = 2j and applying (35) at any level

j = j0 , . . . , J − 1, the SureShrink threshold is finally given by Ã ! ˆjk d λSj = arg min SURE λ; , j = j0 , ..., J − 1; k = 0, ..., 2j − 1, σ ˆ 0≤λ≤λU where, in this case, λU is given in (26) with n = 2j . 24

(36)

The SureShrink threshold (36) has a serious drawback in situations of extreme sparsity of the wavelet coefficients. In such cases, as noted by Donoho & Johnstone (1995), “... the noise contributed to the SURE profile by the many coordinates at which the signal is zero swamps the information contributed to the SURE profile by the few coordinates where the signal is nonzero”. To avoid this drawback, Donoho & Johnstone (1995) considered a hybrid scheme of the SureShrink threshold by the following heuristic idea: if the set of empirical wavelet coefficients is judged to be sparsely represented, then the hybrid scheme defaults to the levelp ˆ 2 log (2j ); otherwise the SURE criterion is used to select wise universal threshold λU j = σ a threshold value. In mathematical terms, the hybrid scheme of the SureShrink threshold is expressed, for j = j0 , ..., J − 1, as ( P2j −1 ˆ2 λU if ˆ 2 2j/2 (2j/2 + j 3/2 ) j k=0 djk ≤ σ HS λj = λSj otherwise.

(37)

Remark 4.6 The SureShrink threshold (36) and the hybrid SureShrink threshold (37) could also be obtained for the hard thresholding rule (4). However, the latter threshold is not continuous and does not have a bounded weak derivative, meaning that a more complicated SURE formula would be required to implement the idea on a particular data set. Also, it is possible to derive SureShrink-type thresholds for other thresholding rules, for example, as the ones given in (4)– (7). However, the simplicity of the SURE formula is again lost – see, for example, Gao (1998) for a SureShrink-type threshold for the nonnegative garrote threshold (7). 4.1.7

Thresholding as a Recursive Hypothesis Testing Problem

The multiple hypotheses testing approach to thresholding discussed in Section 4.1.4 produces a global threshold λ. On the other hand, Ogden & Parzen (1996a) developed a hypothesis testing procedure that produces level-dependent thresholds λj , as does the SureShrink methods discussed in Section 4.1.6. Rather than seeking to include as many wavelet coefficients as possible (subject to constraint) as in Abramovich & Benjamini (1995, 1996), the procedure of Ogden & Parzen (1996a) includes a wavelet coefficient only when there is strong evidence that is needed in the reconstruction. Consider the following equivalent problem to (1) or, equivalently, to (2)–(3). Let X1 , . . . , Xs be independent N (µs , 1), i = 1, . . . , s, random variables that represent the empirical wavelet coefficients at any level j = j0 , . . . , J − 1 with s = 2j . Let Is represent a non-empty subset of indices {1, . . . , s}. Then the multiple hypotheses testing problem could be expressed as H0 : µi = 0, i ∈ Is

versus H1 : µi 6= 0 for all i ∈ Is ;

25

µi = 0 for all i 6∈ Is .

(38)

The approach of Ogden & Parzen (1996a) to test the set of hypotheses given in (38) is as follows. If the cardinality of the set Is is not known, the standard likelihood ratio test for the P above hypotheses would be based on the test statistic si=1 Xi2 ∼ χ2s when H0 is true. (Note that this is also the test statistic that would be used if it were known that Is = {1, . . . , s}.) However, this is not the most appropriate test statistics, since it is usually assumed that very few of the µi ’s are non-zero, resulting in poor power of detection when Is contains only a few coefficients (since the noise of the zero wavelet coefficients will tend to overwhelm the signal of the nonzero wavelet coefficients). If the cardinality of the set Is is known, say equal to m, then the standard likelihood ratio test statistic would be the sum of squares of the m largest Xi ’s. However, in practice, m is unknown, so Ogden & Parzen (1996a) suggested a recursive testing procedure for Is containing only one element each time. Hence, the appropriate test statistics is the largest of the squared Xi ’s. The α-critical point of this distribution is shown to be equal to ( xαs

=

Φ

" −1

(1 − α)1/s + 1 2

#)2 ,

(39)

where Φ is the cumulative distribution function of a standard normal random variable. The recursive method then for choosing a threshold λj at each level j = j0 , . . . , J − 1 consists of the following steps 1. Compare the largest Xi2 with the critical point xαs given in (39). 2. If the Xi2 is larger, this indicates that there still significant signal among the wavelet coefficients. Remove the Xi with the largest absolute value from consideration, set s to s − 1, and return to Step 1. 3. If Xi2 < xαs , then there is no strong evidence of strong signal among the (remaining) wavelet coefficients. The threshold λj for the current level j is set equal to the largest remaining Xi in absolute value. By following the above algorithm, at each level j, we are throwing out ‘large’ wavelet coefficients from the data set until everything left (the set of ‘small’ wavelet coefficients) is not distinguishable from pure noise. By setting the threshold λj equal to the maximum absolute value of the ‘small’ wavelet coefficients, we are ensuring that they will all be shrunk to zero, and that each ‘large’ wavelet coefficient will be included in the reconstruction, but shrunk toward zero by the same amount. In other words, this procedure has a natural interpretation as a soft thresholding rule (5) with level-dependent thresholds λj . 26

In determining the thresholds λj , the user has some control over the amount of smoothness that is done via the choice of α involved in (39). In general, choosing a relatively small value of α will make it very difficult for a wavelet coefficient to be judged ‘significant’ resulting in a smoother estimate. On the other hand, choosing a relatively large value of α makes it easier for a wavelet coefficient to be included in the reconstruction, resulting in a less smooth estimate. The recommended value is α = 0.05. Remark 4.7 Although it is not considered in our simulation study in Section 5, we mention that Ogden & Parzen (1996b) also developed a recursive hypothesis testing procedure to chose leveldependent thresholds λj that take into account not only the relative magnitudes of the wavelet coefficients (as in Abramovich & Benjamini (1995, 1996) and Ogden & Parzen (1996a)), but also the relative position of large wavelet coefficients. Their approach adapts standard changepoint methods to test the set of hypotheses given in (38) based on omnibus tests that can be used to test the null hypothesis of equal means versus a very wide variety of possible alternatives. However, as mentioned by Ogden (1997), this approach suffers somewhat from a lack of power in detecting ‘large’ wavelet coefficients.

4.2

Classical Methods: Block Thresholding

The wavelet thresholding procedure described in Section 3.1 achieves adaptivity through termby-term thresholding of the empirical wavelet coefficients. There, each individual empirical wavelet coefficient is compared with a predetermined threshold; a coefficient is retained if its magnitude in absolute value is above the threshold and is discarded otherwise. This approach achieves a degree of trade-off between variance and bias contribution to mean squared error. However, this trade-off is not optimal; it removes too many terms from the empirical wavelet expansion, with the result the estimator is too biased and has a sub-optimal L2 -risk convergence rate (and also in other metrics Lp , 1 ≤ p ≤ ∞). One way to increase estimation precision is by utilising information about neighboring empirical wavelet coefficients. In other words, empirical wavelet coefficients could be thresholded in blocks (or groups) rather than individually. As, a result, the amount of information available from the data for estimating the “average” empirical wavelet coefficient within a block, and making a decision about retaining or discarding it, would be an order of magnitude larger than the case of a term-by-term threshold rule. This would allow threshold decisions to be made more accurately and permit convergence rates to be improved. In what follows we consider both nonoverlapping and overlapping block thresholding estimators.

27

4.2.1

A Nonoverlapping Block Thresholding Estimator

A nonoveralpping block thresholding estimator was proposed by Cai (1999) via the approach of ideal adaptation with the help of an oracle. At each resolution level j = j0 , . . . , J − 1, the empirical wavelet coefficients dˆjk are grouped into nonoverlapping blocks of length L. In each case, the first few empirical wavelet coefficients might be re-used to fill the last block (which is called the Augmented case) or the last few remaining empirical wavelet coefficients might not be used in the inference (which is called the Truncated case), should L not divide 2j exactly. Let (jb) denote the set of indices of the coefficients in the bth block at level j, that is, (jb) = {(j, k) : (b − 1)L + 1 ≤ k ≤ bL}, 2 and let S(jb) denote the L2 -energy of the noisy signal in the block (jb). Within each block (jb),

estimate the wavelet coefficients djk simultaneously via a James-Stein thresholding rule ! Ã 2 2 S − λLσ (jb) (jb) dˆjk . d˜jk = max 0, 2 S(jb)

(40)

Then, an estimate of the unknown function g is obtained by applying the IDWT to the vector consisting of both empirical scaling coefficients cˆj0 k (k = 0, 1, . . . , 2j0 − 1) and thresholded (jb) empirical wavelet coefficients d˜ (j = j0 , . . . , J − 1; k = 0, 1, . . . , 2j − 1). jk

Cai (1999) suggested using L = log n and setting λ = 4.50524 which is the solution of the equation λ − log λ − 3 = 0. This particular threshold was chosen so that the corresponding wavelet thresholding estimator is (near) optimal in function estimation problems. The resulting block thresholding estimator was called BlockJS. Remark 4.8 Hall, Penev, Kerkyacharian & Picard (1997) and Hall, Kerkyacharian & Picard (1998, 1999) considered wavelet block thresholding estimators by first obtaining a near unbiased estimate of the L2 -energy of the true coefficients within a block and then keeping or killing all the empirical wavelet coefficients within the block based on the magnitude of the estimate. Although it would be interesting to numerically compare their estimators, they require the selection of smoothing parameters – block length and threshold level – and it seems that no specific criterion is provided for choosing these parameters in finite sample situations. 4.2.2

An Overlapping Block Thresholding Estimator

Cai & Silverman (2001) considered an overlapping block thresholding estimator by modifying the nonoverlapping block thresholding estimator of Cai (1999). The effect is that the treatment 28

of empirical wavelet coefficients in the middle of each block depends on the data in the whole block. At each resolution level j = j0 , . . . , J − 1, group the empirical wavelet coefficients dˆjk into nonoverlapping blocks (jb) of length L0 . Extend each block by an amount L1 = max (1, [L0 /2]) in each direction to form overlapping larger blocks (jB) of length L = L0 + 2L1 . 2 Let S(jB) denote the L2 -energy of the noisy signal in the larger block (jB). Within each block

(jb), estimate the wavelet coefficients simultaneously via the following James-Stein thresholding rule

Ã (jb) d˘jk

= max 0,

2 S(jB) − λLˆ σ2 2 S(jB)

! dˆjk .

(41)

Then, an estimate of the unknown function g is obtained by applying the IDWT to the vector consisting of both empirical scaling coefficients cˆj0 k (k = 0, 1, . . . , 2j0 − 1) and thresholded (jb) empirical wavelet coefficients d˘ (j = j0 , . . . , J − 1; k = 0, 1, . . . , 2j − 1). jk

Cai & Silverman (2001) suggested using either L0 = [log n/2] and taking λ = 4.50524 (which results in the NeighBlock estimator) or L0 = L1 = 1 (i.e., L = 3) and taking λ =

2 3

log n

(which results in the NeighCoeff estimator). NeighBlock uses neighbouring coefficients outside the block of current interest in fixing the threshold, whilst NeighCoeff chooses a threshold for each coefficient by reference not only to that coefficient but also to its neighbours. Remark 4.9 The above thresholding rule (41) is different to the one given in (40) since the empirical wavelet coefficients dˆjk are thresholded with reference to the coefficients in the larger block (jB). One can envision (jB) as a sliding window which moves L0 positions each time and, for each window, only half of the coefficients in the center of the window are estimated.

4.3

Bayesian Methods: Term-By-Term Shrinkage and Thresholding

The basic Bayesian framework of function estimation as discussed in Section 3.2 can now be used to obtain wavelet shrinkage and threshold estimates. Obviously, using expressions (17), (18) and(19), different losses will lead to different Bayesian rules and, therefore, to different wavelet shrinkage and threshold estimates for the unknown function g. The problem of eliciting the hyperparameters of the prior distributions is obviously dependent on the parametric forms chosen for the prior distributions.

If the hyperparameters have

meaningful interpretations or if a connection can be drawn between the hyperparameters and some quantities that are easier to specify, criteria for choosing the form of the prior distributions can be obtained. Alternatively, one can employ empirical Bayes methods that attempt to determine the hyperparameters of the prior distributions from the data being analysed. These latter methods will be discussed in subsequent sections. 29

4.3.1

Shrinkage estimates based on L2 -losses

Clyde & George (1999, 2000) obtained wavelet shrinkage estimates by considering leveldependent posterior mean estimates. Using (17), (18) and(19), it is easily seen that L2 -based Bayes rules BR(djk | dˆjk , σ 2 ) correspond to marginal posterior means of wavelet coefficients djk conditionally on σ 2 , given by E(djk

| dˆjk , σ 2 ) =

τj2

1

dˆjk .

1 + Ojk (dˆjk , σ 2 ) σ 2 + τj2

(42)

Expression (42) corresponds to a level-dependent shrinkage rule. It shrinks the empirical wavelet coefficients dˆjk by a nonlinear factor of (1 + τ 2 )/((1 + Ojk (dˆjk , σ 2 ))(σ 2 + τ 2 )). j

j

Estimates of the hyperparameters πj , τj2 and σ 2 can now be obtained by using empirical Bayes methods which are based on using marginal maximum likelihood estimates of the hyperparameters. Marginalizing over wavelet coefficients djk and model uncertainty γjk , and conditioning on πj , τ 2 and σ 2 , the empirical wavelet coefficients dˆjk are independently distributed j

as a mixture of two normal distributions.

ˆ j = (dˆjk : Define d

k = 0, 1, . . . , 2j − 1) for

j = j0 , . . . , J − 1. At each level j, the marginal log-likelihood for πj , τj2 and σ 2 is therefore, up to a constant, L(πj , τj2 , σ 2

ˆj) = |d

j −1 2X

(

Ã 2

log πj (σ +

τj2 )−1/2 exp

k=0

−

!

dˆ2jk 2(σ 2 + τj2 )

Ã + (1 − πj )σ

−1

exp −

dˆ2jk

!)

2σ 2 (43)

Because the empirical Bayes estimates of πj , τj2 and σ 2 based on (43) are generally not available in closed forms, approximations of the marginal maximum likelihood are used which could be maximized quickly. Maximum likelihood estimation of πj and τj2 using the EM algorithm By estimating the noise level σ with the robust estimate (23), we now discuss an approach to obtain the maximum likelihood estimates of πj and τj2 using the EM algorithm. This approach uses a complete data (or ‘augmented’) likelihood and applies the EM algorithm as developed in exponential family problems. Rather than using the marginal log-likelihood (43) at each level j, consider now the loglikelihood given the latent or ‘missing’ vector γ j = (γjk : k = 0, 1, . . . , 2j − 1). This complete data log-likelihood takes the form, up to an additive constant, j −1 · µ ¶ ¸ 2X πj 1 2 ˆ 2 2 L(πj , τj | dj , γ j ) = log − log(ˆ σ + τj ) γjk 1 − πj 2 k=0

+ 2j log(1 − πj ) + 30

τj2 2(ˆ σ2 +

j −1 2X

τj2 )

k=0

γjk

dˆ2jk σ ˆ2

(44)

.

which belongs to a regular exponential family of the form [a1 (ζ 1 )]T b1 (X) + c1 (ζ 1 ) + d1 (X), where ζ 1 = (πj , τj2 ), a1 (ζ 1 ) is the vector of natural parameters, X = (dj , γ j ) and b1 (X) = P P ( k γjk , k γjk dˆ2jk /ˆ σ 2 )T is the vector of sufficient statistics. We can now apply the EM algorithm developed for exponential family problems (see Dempster, Laird & Rubin (1977)) that is particularly simple to implement. Hence, using (44), we have that • E-step: It consists of computing the expectations of the sufficient statistics with respect ˆ j , πj and τ 2 to the distribution of γ j given d j ³ ´ ˆb(i) (X) = E b1 (X) | d ˆ j , π (i) , (τ 2 )(i) j 1 j T  j −1 j −1 2X 2X 2 ˆ djk (i) (i) =  ηjk (dˆjk , σ ˆ 2 ), ηjk (dˆjk , σ ˆ2) 2  , σ ˆ k=0

k=0

where

1

(i)

ηjk (dˆjk , σ ˆ2) =

(i) 1 + Ojk (dˆjk , σ ˆ2)

(i) are the posterior expectations of γjk , and Ojk (dˆjk , σ 2 ) are the posterior odds ratios that (i)

γjk = 0 versus γjk = 1 given by (19) evaluated using σ ˆ 2 and the current estimates πj and (τj2 )(i) . (i)

• M-step: It consists of maximizing [a1 (ζ 1 )]T ˆb1 (X) + c1 (ζ 1 ), resulting in the solution P2j −1 (i+1) πj

(τj2 )(i+1)

=

k=0

(i) ηjk (dˆjk , σ ˆ2)

j   2P j 2 −1 (i) ˆ 2 ) dˆ2 η ( d , σ ˆ jk k=0 jk = max 0, P2j −1 −σ ˆ2 . (i) 2 ˆ ˆ ) k=0 η (djk , σ

(45) (46)

Because the complete data log-likelihood (44) belongs to a regular exponential family, if the parameter estimates are in the interior of the parameter space, standard exponential family theory ensures that the solutions for πj and τj2 are the unique global solutions, conditional on the values of the latent vector γ j . The E-step and M-step above are repeated until the estimates converge, and yield a stationary point of the marginal log-likelihood (43). As in the case of direct maximization of the marginal log-likelihood (43) using the GaussSeidel algorithm (or other algorithms) applied to (43), the EM algorithm applied to the complete data log-likelihood (44) may converge to a local mode. Also, the direct maximization methods may result in faster convergence, because the convergence rate of the EM algorithm is linear 31

(see Dempster, Laird & Rubin (1977)). However, the iterative solutions (45), (46) using the EM algorithm are in closed form and provide some insight into the problem and connections to the conditional maximum likelihood estimates that we will discuss below. Moreover, the M-step estimate (45) of πj has a natural interpretation: is the posterior expected fraction of nonzero wavelet coefficients. Maximum likelihood estimation of σ, πj and τj2 using the EM algorithm Instead of using the robust estimate (23) of σ, Clyde & George (2000) observed that the complete data log-likelihood (44) at each level j can be combined to construct the complete data logˆ = (dˆjk : likelihood based on all levels for estimating σ 2 using the EM algorithm. By setting d j = j0 , . . . , J − 1; k = 0, 1, . . . , 2j − 1), γ = (γjk : j = j0 , . . . , J − 1; k = 0, 1, . . . , 2j − 1), π = (πj : j = j0 , . . . , J − 1) and τ 2 = (τj2 : j = j0 , . . . , J − 1), it is not difficult to see that this complete data log-likelihood based on all levels j takes the form, up to a constant, µ

¶ J−1 2j −1 J−1 2j −1 1 X X 1 X X (1 − γjk ) − 2 (1 − γjk )dˆ2jk σ2 2σ j=j0 k=0 j=j0 k=0  Ã  ! j j −1 J−1 2X −1 2X X 1 1 1 log + γjk − 2 γjk dˆ2jk  2 2 + τ2 2 σ σ + τ j j k=0 j=j0 k=0   j −1 µ ¶ 2X J−1 X π j 2j log(1 − πj ) + log + γjk  1 − πj

ˆ γ) = L(π, τ , σ | d, 2

2

1 log 2

j=j0

(47)

k=0

which still belongs to a regular exponential family of the form [a2 (ζ 2 )]T b2 (X) + c2 (ζ 2 ) + d2 (X), where ζ 2 = (π, τ 2 , σ 2 ), a2 (ζ 2 ) is the vector of natural parameters, X = (d, γ) and b2 (X) is the P j −1 P j −1 γjk dˆ2jk ) for (2(J − j0 ) + 1) × 1 vector of sufficient statistics with components ( 2k=0 γjk , 2k=0 PJ−1 P2j −1 (1 − γjk )dˆ2 . j = j0 , . . . , J − 1 and j=j0

k=0

jk

Applying again the EM algorithm developed for exponential family problems, we have • E-step: It consists of computing the expectations of the sufficient statistics with respect ˆ π, τ 2 and σ 2 to the distribution of γ given d, ³ ´ ˆb(i) (X) = E b2 (X) | d, ˆ π (i) , (τ 2 )(i) 2 which has components 

j −1 2X



k=0

(i) ηjk (dˆjk , σ 2 ),

j −1 2X

T (i) ηjk (dˆjk , σ 2 ) dˆ2jk  ,

k=0

32

for j = j0 , . . . , J − 1,

and

j −1 J−1 X 2X

(i)

(1 − ηjk (dˆjk , σ 2 )) dˆ2jk .

j=j0 k=0

The quantities

1

(i)

ηjk (dˆjk , σ 2 ) =

1+

(i) Ojk (dˆjk , σ 2 )

(i) are the posterior expectations of γjk , where Ojk (dˆjk , σ 2 ) are the posterior odds ratios that (i)

γjk = 0 versus γjk = 1 given by (19) evaluated using the current estimates πj , (τj2 )(i) and (σ 2 )(i) . (i)

• M-step: It consists of maximizing [a2 (ζ 2 )]T ˆb2 (X) + c2 (ζ 2 ),resulting in the solution P2j −1 (i+1) πj

(τj2 )(i+1) (σ 2 )(i+1)

=

k=0

(i)

ηjk (dˆjk , σ 2 )

j  2P j  2 −1 (i) ˆ 2 ) dˆ2 η ( d , σ jk k=0 jk jk = max 0, P j − (σ 2 )(i+1)  . 2 −1 (i) ˆ 2 k=0 ηjk (djk , σ ) PJ−1 P2j −1 (i) ˆ 2 ˆ2 j=j0 k=0 (1 − ηjk (djk , σ )) djk = . P P2j −1 (i) ˆ 2 2(J−j0 ) − J−1 j=j0 k=0 ηjk (djk , σ )

(48) (49)

(50)

As before, because the complete data log-likelihood (47) belongs to a regular exponential family, if the parameter estimates are in the interior of the parameter space, standard exponential family theory ensures that the solutions for πj , τj2 and σ 2 are the unique global solutions, conditional on the values of the latent vector γ. The E-step and M-step above are repeated until the estimates converge, and yield a stationary point of the marginal log-likelihood (43). The M-step estimates (48) and (50) of πj and σ 2 , respectively, have natural interpretations: (48) is the posterior expected fraction of nonzero wavelet coefficients, while (50) is the ratio of the posterior expected error sum of squares to the posterior expected degrees of freedom. Remark 4.10 Rather than maximizing the marginal log-likelihood L given by (43) (with σ estimated by the robust estimate (23)) directly, Clyde & George (2000) also used the conditional maximum likelihood approach of George & Foster (2000) which can be maximized very quickly in practice. This method takes the ‘augmented’ log-likelihood (44) and evaluates it at the mode for γjk , rather than using the posterior mean, as in the EM algorithm discussed above. However, this latter algorithm is not considered in our simulation study in Section 5. This is so because the conditional maximum likelihood estimators have the same form as the EM maximum likelihood estimators (45) and (46), and are exactly the same when the posterior distribution of γjk is degenerate at one or zero. The difference between the EM maximum 33

likelihood and the conditional maximum likelihood estimates will be the most extreme when the posterior mean of γjk is 0.5 and when τj2 is small. However, while the EM maximum likelihood estimates of πj and τj2 appear to be asymptotically consistent (as 2j → ∞), this is not necessarily the case with the conditional maximum likelihood estimates (see, Johnstone & Silverman, 1998). On the other hand, because the conditional maximum likelihood estimators are very rapidly computable, they can be used as starting values for the EM algorithms for computing the maximum likelihood estimates of πj and τj2 . 4.3.2

Thresholding estimates based on L1 -losses

Abramovich, Sapatinas & Silverman (1998) obtained wavelet thresholding estimates by considering level-dependent posterior median estimates. Using (17), (18) and(19), it is easily seen that L1 -based Bayes rules BR(djk | dˆjk , σ 2 ) correspond to marginal posterior medians of wavelet coefficients djk conditionally on σ 2 , given by Median(djk | dˆjk , σ 2 ) = sign(dˆjk ) max(0, ζjk ), where ζjk =

τj2 σ 2 + τj2

Ã

στj

|dˆjk | − q Φ σ 2 + τj2

−1

1 + min(Ojk (dˆjk , σ 2 ), 1) 2

(51) !

and Φ is the cumulative distribution function of a standard normal random variable. The quantity ζjk is negative for all dˆjk in some implicitly defined interval [−λj , λj ], and hence Median(djk | dˆjk , σ 2 ) = 0 whenever |dˆjk | falls below the threshold λj .

Expression (51) corresponds to a bonafide

thresholding rule (a level-dependent ‘kill’ or ‘shrink’ thresholding rule) with thresholds λj . Note that, unlike the soft thresholding rule (5), extent of shrinkage in (51) depends on |dˆjk |: large |dˆjk | are shrunk less. For large dˆjk the thresholding rule is asymptotic to linear shrinkage by a factor of τj2 /(σ 2 + τj2 ), since the value of Φ−1 above becomes negligible as |dˆjk | → ∞. One can now use the iterative EM solutions (48), (49) and (50) of Clyde & George (2000) to get maximum likelihood estimates of πj , τj2 and σ 2 . Alternatively, by estimating the noise level σ with the robust estimate (23), one can use the iterative EM solutions (45), (46) of Clyde & George (1999) to get maximum likelihood estimates of πj and τj2 . As discussed in Section 4.3.1, the latter approach uses a complete data (or ‘augmented’) likelihood and applies the EM algorithm as developed in exponential family problems. Johnstone & Silverman (1998) suggested another approach that uses an EM algorithm based on a derivation that introduces an entropy function to create a modified likelihood, where the global maxima 34

of this modified likelihood are the global maximum likelihood estimates of the marginal loglikelihood (43). We now present this alternative method to obtain the maximum likelihood estimates of πj and τj2 . Maximum likelihood estimation of πj and τj2 using the EM algorithm – an alternative derivation Consider the following binary entropy function H(ξ) = −ξ log ξ − (1 − ξ) log(1 − ξ)

(52)

log(1 + e−x ) = sup (H(ξ) − ξx),

(53)

which has conjugate 0≤ξ≤1

the maximum being attained at ξ = 1/(1 + ex ). At each level j, the marginal log-likelihood (43) can therefore be rewritten as, up to a constant, L(πj , τj2

ˆ j ) = 2 log(1 − πj ) + |d j

j −1 2X

log[1 + exp{−h(dˆjk , πj , τj2 , σ ˆ 2 )}],

(54)

k=0

where µ h(dˆjk , πj , τj2 , σ ˆ 2 ) = log

1 − πj πj

Ã

¶ + log

(ˆ σ 2 + τj2 )1/2 σ ˆ

! −

τj2 2ˆ σ 2 (ˆ σ 2 + τj2 )

dˆ2jk .

By defining ?

L

(πj , τj2 ;

ξ0 , ξ1 , . . . ξ2j −1 ) =

j −1 2X

[H(ξk ) − ξk h(dˆjk , πj , τj2 , σ ˆ 2 )] + 2j log(1 − πj ),

(55)

k=0

it then follows from (53) that if we maximize (55) over all its arguments, we get the maximum of the original marginal log-likelihood (43). For fixed πj and τj2 , the function (55) is clearly a sum of concave functions of the individual ξk (k = 0, 1, . . . , 2j − 1). The unique maxima of each of these concave functions are given by solving, over ξk ∈ [0, 1], the equation d H(ξk ) = h(dˆjk , πj , τj2 , σ ˆ 2 ), dξk which after some manipulation lead to ξˆk =

1 1 + Ojk (dˆjk , σ ˆ2)

,

(56)

where Ojk (dˆjk , σ ˆ 2 ) are the posterior odds ratios that γjk = 0 versus γjk = 1 given by (19) with σ 2 replaced by σ ˆ2. 35

We can now find the global maximum of (55) over πj ∈ [0, 1] and τj2 ∈ [0, ∞) by fixing ξˆk (k = 0, 1, . . . , 2j − 1). In this case, the function (55) can be rewritten as a sum of two functions Q1 and Q2 . The first function, which does not involve πj and τj2 , is given by Q1 = log σ ˆ

2

j −1 2X

ξˆk −

k=0

j −1 2X

ξˆk log ξˆk −

k=0

j −1 2X

[(1 − ξˆk ) log(1 − ξˆk )],

k=0

while the second function, which involves πj and τj2 , is given by Q2 =

  j −1 j −1 2X 2X τj2 1 ξˆk dˆ2jk − log(ˆ σ 2 + τj2 ) ξˆk  2 σ ˆ 2 (ˆ σ 2 + τj2 ) k=0

k=0

µ +2j log(1 − πj ) + log

πj 1 − πj

j −1 ¶ 2X

ξˆk .

(57)

k=0

The last two terms of expression (57) correspond to a concave function of πj with maximum given by

P2j −1 ˆ ξk π ˆj = k=0j . 2 The first term of expression (57) is not a concave function though. However, we have   j −1 j −1 2X 2X ∂ ? 1  ξˆk  , L (πj , τj2 ; ξˆ0 , ξˆ1 , . . . , ξˆ2j −1 ) = ξˆk dˆ2jk − (ˆ σ 2 + τj2 ) ∂τj2 2(ˆ σ 2 + τj2 )2 k=0

(58)

(59)

k=0

the product of a strictly positive quantity and a linearly decreasing function of τj2 . It follows then from (59) that the global maximum of (55) over τj2 ∈ [0, ∞) is given by  P2j −1 ˆ ˆ2 ξ d k=0 k jk −σ ˆ2 . τˆj2 = max 0, P2j −1 ˆk ξ k=0 

(60)

By optimizing alternately over the ξˆk and over (πj , τj2 ), we finally see the connection with the EM algorithm. The ξˆk , as given by (56), are the posterior expected values of γjk given dˆjk and the current values of πj and τj2 . The function Q2 , as given by (57), is just the expected complete data log-likelihood for (πj , τ 2 ) given dˆjk and the previous iterate’s estimates of (πj , τ 2 ), j

j

as obtained in (56) (compare with (44)). Thus, the estimates (58) and (60) are just the ones obtained by (45) and (46), respectively. Remark 4.11 Although it is not considered in our simulation study in Section 5, we mention that Abramovich, Sapatinas & Silverman (1998) have studied a particular form for the hyperparameters πj and τj2 of the prior model (13), (14), and estimated σ by the robust estimate 36

(23).

These hyperparameters depend on additional hyperparameters through the following

structure τj2 = C1 2−αj

and

πj = min (1, C2 2−βj ),

j = j0 , . . . , J − 1,

where α, β, C1 and C2 are non-negative constants. Some interpretation of these constants were given by Abramovich, Sapatinas & Silverman (1998) to explain how they might be derived. Part of the novelty of the approached was the idea that by simulating observations from different Besov spaces, one can elicit the correct space from which to choose the prior. In particular, the parameters α and β and the Besov space parameters were connected, so that if a particular Besov space was chosen to represent prior beliefs, α and β could be numerically derived. A weakness of this approach is that while α and β has nice interpretations, the parameters C1 and C2 do not have good intrinsic interpretability, and so elicitation of these parameters would be very difficult. One recommendation was to choose the parameters C1 and C2 by arguments related to the method of moments and, for practical applications, to choose α = 1 and β = 0.5 that found to work well for standard test cases. 4.3.3

Thresholding estimates using a Bayesian hypothesis testing approach

Similar in spirit to the multiple hypotheses testing procedures discussed in Sections 4.1.4 and 4.1.7, a Bayesian method for obtaining a bonafide wavelet thresholding estimator was considered by Vidakovic (1998). For each wavelet coefficient dˆjk | djk , σ 2 ∼ N (djk , σ 2 ), this method involves testing the following hypothesis H0 : djk = 0 versus H1 : djk 6= 0 according to the Bayesian framework that requires a prior distribution that has a point mass component. Otherwise, the testing is impossible because any continuous prior density will give the prior (and hence the posterior) probability of zero to the precise hypothesis – see, for example, Berger (1985). If the hypothesis H0 is rejected, then djk is estimated by dˆjk . In each level j = j0 , . . . , J − 1, the prior distribution could therefore be taken as djk ∼ πj ξ(djk ) + (1 − πj )δ(0),

k = 0, 1, . . . , 2j − 1,

(61)

where, as before, δ(0) is a point mass at zero and ξ describes the behaviour of djk when djk is nonzero (i.e. when H0 is false), which occurs with probability πj . Considering the above setting to the prior mixture model of one normal distribution and a point mass at zero (discussed in detail in Section 3.2), and applying the usual Bayes methods for hypothesis testing, Abramovich & Sapatinas (1999) obtained the following wavelet thresholding 37

estimator (which is called the Bayes factor thresholding rule since the posterior odds ratio is obtained by multiplying the Bayes factor with the prior odds ratio) d˜jk = dˆjk χ(ηjk < 1)

with ηjk =

P (H0 | dˆjk ) , P (H1 | dˆjk )

(62)

where χ is the usual indicator function and ηjk is the posterior odds ratio that is given by (19). It essentially mimics the hard thresholding rule (4), since a wavelet coefficient dˆjk will be thresholded if the corresponding posterior odds ratio ηjk > 1 and will be kept as it is otherwise. The wavelet thresholding estimate (62) is then incorporated into expressions (21) or (22) in order to get an estimate of the unknown response function g. To apply the wavelet thresholding rule (62), the parameters πj , τj2 and σ 2 should be chosen appropriately. One, as before, could use the robust estimate (23) of σ and the iterative EM solutions (45) and (46) of Clyde & George (1999) to get maximum likelihood estimates of πj and τj2 (or, equally, the estimates (58) and (60) obtained by Johnstone & Silverman, 1998). Alternatively, the iterative EM solutions (48), (49) and (50) of Clyde & George (2000) to get maximum likelihood estimates of πj , τj2 and σ 2 could be adopted. Remark 4.12 To compare the Bayesian thresholding rules (51) and (62), note that the latter rule is always a ‘keep’ or ‘kill’ thresholding, whilst the former rule is a ‘shrink’ or ‘kill’ thresholding, where extend of shrinkage depends on the absolute values of the wavelet coefficients. In addition, (62) thresholds dˆjk if the corresponding ηjk ≥ 1. One can verify that (51) will ‘kill’ those dˆjk , whose

 ˆ τ | d | j jk  ≥ 1 − 2Φ − q 2 2 σ σ + τj 

ηjk

and, hence, it will threshold more coefficients. 4.3.4

Alternative shrinkage estimates based on L2 -losses

Vidakovic & Ruggeri (2001) obtained wavelet shrinkage estimates by putting a distribution on σ 2 and considering a prior distribution for the wavelet coefficients djk that is similar in spirit to the prior mixture model (13), (14). For each wavelet coefficient dˆjk | djk , σ 2 ∼ N (djk , σ 2 ), a standard way of integrating out σ 2 is to choose the prior distribution of σ 2 to be exponential, σ 2 ∼ E(µ), where µ > 0. It is well known now that the exponential distribution is the entropy maximiser among all distributions supported on (0, ∞) with a fixed first moment. Thus, given the moment, the exponential prior choice on σ 2 is most uninformative. 38

√ The marginal distribution is then the double exponential, dˆjk | djk ∼ DE(djk , 1/ 2µ) with probability density function given by f (dˆjk | djk ) =

√ p 2µ exp (− 2µ |dˆjk − djk |) 2

which follows from the fact that the double exponential distribution is a scale mixture of normals. Vidakovic & Ruggeri (2001) observed that by using the prior distribution djk ∼ DE(0, ν), the marginal distribution (predictive distribution) of dˆjk is given by p ˆjk |/ν) − µ/2 exp (−√2µ |dˆjk |) µν exp (−| d f (dˆjk ) = (63) 2µν 2 − 1 and the corresponding posterior means of wavelet coefficients djk are given by √ ν(ν 2 − 1/(2µ))dˆjk exp (−|dˆjk |/ν) + ν 2 (exp (−|dˆjk | 2µ) − exp (−|dˆjk |/ν))/µ ˆ . E(djk | djk ) = √ √ (ν 2 − 1/(2µ))(ν exp (−|dˆjk |/ν) − (1/ 2µ) exp (−|dˆjk | 2µ)) (64) However, it can be seen that expression (64) is close to a linear shrinkage rule known to be under-performing in wavelet-based methods. To obtain Bayesian shrinkage rules with a more desirable shape, Vidakovic & Ruggeri (2001) considered the ²-contaminated priors djk ∼ ²j DE(0, ν) + (1 − ²j )δ(0)

(65)

which is similar in spirit to the prior mixture model (13), (14). Under the above prior mixture model, the marginal distribution (predictive distribution) of dˆjk is given by p f (²) (dˆjk ) = ²j f (dˆjk ) + (1 − ²j )DE(0, 1/ 2µ)

(66)

and the corresponding L2 -based Bayes rules BR(djk | dˆjk , σ 2 ) correspond to posterior means of wavelet coefficients djk are given by E(²) (djk | dˆjk ) =

²j f (dˆjk )E(djk | dˆjk )

, √ ²j f (dˆjk ) + (1 − ²j )DE(0, 1/ 2µ)

(67)

where f (dˆjk ) and E(djk | dˆjk ) are given by (63) and (64) respectively. The shrinkage rule (67) is now ‘close’ to a thresholding rule – it heavily shrinks small empirical wavelet coefficients while the large ones are shrunk slightly. Expression (67) is called by Vidakovic & Ruggeri (2001) the Bayesian adaptive multiresolution smoother (BAMS). In order to apply it in practice, they proposed an empirical (moment matching) specification of the parameters µ, ν and ²j that works well for standard test cases and emphasized that the nature of the data may call for different parameter values. They suggest the following choices: 39

µ: Since µ is the reciprocal of the mean for the prior on σ 2 , σ is first estimated by a robust Tukey’s σ ˆ = |Q1 − Q3 |/C, where Q1 and Q3 are the first and third quartile of the finest level J − 1 of the wavelet decomposition, and C ∈ [1.3, 1.5]. Then, µ is estimated by µ ˆ = 1/ˆ σ , which according to the Strong Law of large Numbers should be close to µ. ²j : Since 1 − ²j is the weight of the point mass at zero in the prior distribution (65), it should be closed to one at the finest level of the wavelet decomposition and zero at the coarsest levels. A good estimator of ²j is then given by a hyperbolic decay ²ˆj = 1/(j − j0 + 1)1.5 for all j = j0 , . . . , J − 1. ν: Since ν is the scale of the ‘spread part’ (which is a double exponential having variance 2ν 2 ) in the prior distribution (65) and because of the independence between the noise and the signal parts, we have the σs2 = 2²2j ν 2 + 1/µ, where σs2 is the variance of the sample data. p Taking 2²2j ≈ 1 as a mid-point of range of ²j , ν is estimated by νˆ = max (0, σs2 − 1/µ). Remark 4.13 Although it is not considered in our simulation study in Section 5, we mention that Vidakovic (1998) proposed wavelet shrinkage estimates by considering a symmetric prior distribution on djk (i.e., f (djk ) = f (−djk )).

Although, the choice of normal distribution

is not recommended for robustness reasons, Vidakovic (1998) suggested the prior distribution djk ∼ tn (0, ν), a t distribution with mean zero, scaling parameter ν and n degrees of freedom. Empirical specification of the parameters µ and ν and n that works well for standard test cases was also suggested. The corresponding L2 -based shrinkage rules also shrink small empirical wavelet coefficients heavily and large only slightly. However, close expressions are not available and Monte Carlo methods, for example, could be used to approximate the integrals involved. 4.3.5

Shrinkage estimates based on Deterministic/Stochastic Decompositions

All the Bayesian approaches described previously to obtain wavelet shrinkage and wavelet thresholding estimates, assumed a prior for each wavelet coefficient djk with zero mean. Huang & Cressie (2000) proposed a Bayesian approach that does not put such a ‘strong’ assumption on the prior mean but rather they estimated it and plugged it into the wavelet shrinkage formulae. Moreover, they assumed that the underlying signal is composed of a piecewise-smooth deterministic part plus a zero mean stochastic part. ˆj0 = (ˆ Consider the vector of empirical scaling coefficients c cj0 ,0 , cˆj0 ,1 , . . . , cˆj0 ,2j0 −1 )0 and the ˆ j = (dˆj,0 , dˆj,1 , . . . , dˆj,2j −1 )0 for levels j = j0 , . . . , J − 1. vectors of empirical wavelet coefficients d Similarly, let cj0 = (cj0 ,0 , cj0 ,1 , . . . , cj0 ,2j0 −1 )0 be the vector of scaling coefficients and let dj = (dj,0 , dj,1 , . . . , dj,2j −1 )0 be the vectors of wavelet coefficients for levels j = j0 , . . . , J − 1. 40

Huang & Cressie (2000) considered the following Bayesian model ω | β, σ 2 ∼ N (β, σ 2 I), 0

(68)

0

0 0 0 0 0 0 ˆ ,...,d ˆ where ω = (ˆ cj0 , d j0 J−1 ) and the signal β = (cj0 , dj0 , . . . , dJ−1 ) is assumed to have a

prior distribution given as β | µ, θ ∼ N (µ, Σ(θ)), 0

0

0

where µ = ((µ?j0 ) , µj0 , . . . , µJ−1 )0 is the deterministic mean structure and Σ(θ) describes the variability and the correlation in the signal. The hyperparameter µ represents the large-scale variation (low-frequency term) in β. In other words, one could write β = µ + η,

(69)

where η ∼ N (0, Σ(θ)) is the stochastic component representing the small-scale variation (highfrequency term). Note that in the Bayesian approaches described earlier, µ = 0. As pointed out by Huang & Cressie (2000), the presence of both deterministic mean structure and a stochastic structure help us to recover a wider variety of signals, including piecewisesmooth signals considered in nonparametric regression and nonsmooth signals often appearing in time series analysis and spatial statistics. The identification of the deterministic µ and stochastic η in (69) is however an ill-posed problem. Although, in finite samples, it is possible to separate them out asymptotically with some further assumptions on µ and η (see, for example, Johnstone & Silverman (1997)), it is impossible to distinguish them. It is easily seen that the corresponding L2 -based Bayes rule corresponds to the posterior mean of β conditionally on σ 2 , given by E(β | ω, σ 2 ) = µ + Σ(θ)(Σ(θ) + σ 2 I)−1 (ω − µ),

(70)

which is a shrinkage rule (called the DecompShrink I method). In order to apply it in practice, Huang & Cressie (2000) suggested the following empirical specification of the parameters σ 2 , µ and θ σ: Rather than using the robust estimate (23), σ is estimated using a method based on the variogram of the original process y = (y1 , . . . , yn )0 given in (1), resulting in a more reliable estimate when the signal β is either deterministic (i.e Σ(θ) = 0) or stochastic (i.e. Σ(θ) 6= 0). This is so, because when the underlying signal contain stochastic components, they will confound with the noise component in all the empirical wavelet coefficients and,

41

therefore, (23) tends to overestimate the true value of σ. The estimator of σ 2 is σ ˆ 2 , where  ³ ´1/2 k2 γ ˆ (k1 )−k1 γ ˆ (k2 )   if k2 γˆ (k1 ) ≥ k1 γˆ (k2 ) ≥ k1 γˆ (k1 )  k2 −k1  ³ ´1/2 γ ˆ (k1 )+ˆ γ (k2 ) σ ˆ= (71) if γˆ (k2 ) < γˆ (k1 )  2    0 otherwise for 0 < k1 < k2 and 2ˆ γ (k) is the robust estimator (based on the median absolute deviation) of the variogram 2γ(k) = V(yt+k − yt ) at lag k. The recommended values are k1 = 1 and k2 = 2. µ: Because the scaling coefficients represent low-frequency features of the underlying function, ˆj0 ; it is declared that the empirical scaling coefficients µ?j0 is estimated, as before, by µ ˆ?j0 = c ˆj0 are deterministic (i.e. its stochastic counterpart η ˆ ?j0 = 0). For the wavelet coefficients c at each level j (j = j0 , . . . , J −1), the deterministic mean µj could be considered as coming ˆ j because from components that are potential outliers in the normal probability plot of d significant mean components usually stand out among the nonzero stochastic components, which are more evenly distributed at each scale. Therefore, quantities based on normal probability plots can be used (see, equations (8) and (9) in Huang & Cressie (2000)) to ˆ j of µj for j = j0 , . . . , J − 1. obtain estimates µ θ: The vector of hyperparameters θ is estimated by maximum likelihood based on the ˆ estimated above. marginal distribution of the data ω, with the plug-in values σ ˆ 2 and µ That is, ˆ = arg inf {log (|Σ(θ) + σ ˆ 0 (Σ(θ) + σ ˆ θ ˆ 2 I|) + (ω − µ) ˆ 2 I)−1 (ω − µ)}. θ

(72)

The prior covariance matrix Σ(θ) is assumed to be a block diagonal matrix. Specifically, ˆ ?j0 = 0 (which is in line with the earlier assumption the stochastic scaling coefficients η ˆj0 are attributed to the deterministic mean component that all the scaling coefficients c µ?j0 ). At each level j = j0 , . . . , J − 1, the stochastic wavelet coefficients η j are assumed to be independent random variables with zero means (which allows one to model η j at each level j separately) with V(η j ) = σj2 I. Therefore, from (72), the components of the pseudo ˆ of θ = (σ 2 , . . . , σ 2 ) are given by maximum likelihood estimator θ j0 J−1 σ ˆj2

¶ µ ˆ j )0 (ω j − µ ˆj) (ω j − µ 2 −σ ˆ , = max 0, 2

j = j0 , . . . , J − 1.

(73)

Remark 4.14 Although it is not considered in our simulation study in Section 5, we mention that Huang & Cressie (2000) also dealt with the case of a more general prior covariance matrix 42

Σ(θ) which allows us for correlation between the stochastic components η j for j = j0 , . . . , J − 1. Binary tree structures and an optimal-prediction Kalman-filter algorithm are used to obtain recursive estimates ηˆjk of ηjk for j = j0 , . . . , J − 1 and k = 0, 1, . . . , 2j − 1. The resulting wavelet shrinkage estimator is called DecompShrink II. However, as pointed out by Huang & Cressie (2000), the improvement of DecompShrink II in finite samples, if any, over DecompShrink I is small. Therefore, the latter method is preferable in terms of its good performance and ease of computation. 4.3.6

Thresholding estimates based on nonparametric mixed-effects models

Similar in spirit to the deterministic/stochastic decomposition approach discussed in Section 4.3.5, Huang & Lu (2000) considered wavelet thresholding estimators based on nonparametric mixed-effect models. By considering the wavelet expansion of the unknown response function g, a prior model for g is given by g(t) =

j0 −1 2X

j

αj0 k φj0 k (t) + δZ(t) with Z(t) ∼

∞ 2X −1 X

γjk ψjk (t),

t ∈ [0, 1],

(74)

j=j0 k=0

k=0

where αj0 k = hg, φj0 k i and γjk = hg, ψjk i (‘∼’ means ‘equal in distribution’). The coefficients αj0 k are now modelled as fixed effects (which usually reflect the main features of g), whilst the coefficients γjk are modelled as random effects (which usually reflect the fine features of g). 2 ) = λ . Huang These random coefficients are assumed uncorrelated with zero mean and E(γjk j

& Lu (2000) used the prior model (74) to model the relation between the prior parameters and the Besov spaces (similar in spirit to the work of Abramovich, Sapatinas & Silverman (1998) – see Remark 4.11). It can be seen that the regularity of posterior estimators can be controlled via the prior parameter λj (in fact, the ‘posterior’ of g lies in space that is smoother than the ‘prior’ space). Assuming that Z is a Gaussian process, the Bayes rule under the L2 -loss is the posterior mean of g given the observations y = (y1 , . . . , yn )0 from model (1). It is easily seen that this can be calculated as E(g(t) | y) = µ(t) + where µ(t) =

P2j0 −1 k=0

δ2 0 [w(t)] M −1 (y − µ), 2 σ

(75)

αj0 k φj0 k (t), w(t) is the 1×n matrix defined by W (t, tj ) (for any j = 1, . . . , n

and any t ∈ [0, 1]), M = w + (δ 2 /σ 2 )W , µ = (µ(1), . . . , µ(n))0 , w is the n × n matrix defined by W (ti , tj ) (for any i = 1, . . . , n and any j = 1, . . . , n). Recall that W is the n × n orthogonal matrix associate with the DWT discussed in Section 2.2.

43

If the coefficients αj0 k ( k = 0, 1, . . . , 2j0 − 1) are unknown, ones needs to estimate them from the data y. Huang & Lu (2000) proposed a generalized least squares estimate for αj0 k . It turns out that the resulting shrinkage estimator is the best linear unbiased predictor (BLUP) and it is equivalent to a method of regularization estimator (MORE). Moreover, it is asymptotically equivalent to a diagonal shrinkage estimator. When the parameters values for σ, δ and λj are not available, adaptive estimators are necessary. The method of regularization (as discussed in Antoniadis (1996) and Amato & Vuza (1997)) could be used to obtain the resulting wavelet shrinkage estimator. An alternative adaptive and computationally economical thresholding estimator for g was suggested by Huang & Lu (2000) given by gˆ(t) =

j0 −1 2X

α ˆ j0 k φj0 k (t) +

k=0

where α ˆ j0 k =

1 n

Pn

i=1 yi φj0 k (ti )

j −1 J−1 X 2X

Ã max 0,

j=j0 k=0

and γˆjk =

to get the estimates α ˆ j0 k and γˆjk .) When

2 − σ2 nˆ γjk 2 nˆ γjk

! γˆjk ψjk (t),

t ∈ [0, 1],

(76)

1 Pn i=1 yi ψjk (ti ). (Note that the DWT can be applied n 2 σ is unknown, Huang & Lu (2000) suggested to select

its value in (76) by generalized cross-validation (with low computational cost) and the resulting estimator is called the GGV-BLUPWAVE. Remark 4.15 We mention that similar ideas to the ones discussed above have been independently explored and developed by Angelini, De Canditiis & Leblanc (2003). We note that the methods discussed in these papers could also be applied to non-equispaced designs. A fast algorithm to obtain the resulting shrinkage estimator in such cases was developed by Angelini, De Canditiis & Leblanc (2003). However, since this paper only deals with the standard nonparametric regression model as defined in (1), this algorithm is not considered in our simulation study in Section 5.

4.4

Bayesian Methods: Block Shrinkage and Thresholding

In Section 4.2, it was shown that one way to increase estimation precision in classical term-byterm thresholding estimators was by utilising information about neighboring empirical wavelet coefficients. In other words, empirical wavelet coefficients could be thresholded in blocks rather than individually using global thresholds. This idea has been recently discussed by Abramovich, Besbeas & Sapatinas (2002) in a Bayesian framework to obtain level-dependent nonoverlapping block shrinkage and nonoverlapping block thresholding estimates. Consider the model given by (2) and (3). At each resolution level j (j = j0 , . . . , J − 1), the wavelet coefficients djk are grouped into nonoverlapping blocks bjK of length lj = j. In each 44

case, the first few empirical wavelet coefficients might be re-used to fill the last block (which is called the Augmented case) or the last few remaining empirical wavelet coefficients might not be used in the posterior-based inference (which is called the Truncated case), should lj not divide 2j exactly. Let mj be the number of blocks (i.e. K = 1, . . . , mj ) and consider, at each level j, the following prior model on bjK bjK | γjK ∼ γjK N (0, Vj ) + (1 − γjK )δ(0),

K = 1, . . . , mj ,

(77)

where δ(0) is a vector of lj point masses at zero. The matrix Vj is an lj ×lj nonsingular covariance |k−l|

matrix given by Vj = τj2 Pj where Pj is the lj × lj matrix with elements Pj [k, l] = ρj

for

k, l = 1, . . . , lj with |ρj | < 1 (otherwise Pj cannot be a positive definite matrix). It is also assumed that γjK has its own prior distribution given by P (γjK = 1) = 1 − P (γjK = 0) = πj ,

with

0 ≤ πj ≤ 1

and that, at each level j, the blocks bjK (K = 1, . . . , mj ) are independent. The marginal prior distribution of bjK is then of the form bjK ∼ πj N (0, Vj ) + (1 − πj )δ(0),

K = 1, . . . , mj .

(78)

Note that, at each level j, the same prior parameters are used for all blocks bjK and that all the variances (i.e diagonal elements of Vj ) are equal to τj2 . According to the prior model (78), a block bjK is either zero with probability 1 − πj , or multivariate normally distributed with zero-mean and covariance Vj . This prior model supposes that if a wavelet coefficient is non-zero (zero), then its neighboring wavelet coefficients are likely to be non-zero (zero). As in the classical approach, to complete the Bayesian model, vague priors on the scaling coefficients cj0 k , k = 0, . . . , 2j0 − 1 are placed which are therefore estimated by their empirical counterparts cˆj0 k , k = 0, . . . , 2j0 − 1. The above model is, obviously, an extension of the prior model of one normal distribution and a point mass at zero (lj = 1) discussed in Section 3.2. At each level j, consider the corresponding blocks ˜bjK of empirical wavelet coefficients dˆjk | djk , σ 2 ∼ N (djk , σ 2 ). Then, by combining the prior model (78) with the normal likelihood ˜bjK | bjK , σ 2 ∼ N (bjK , σ 2 I), the posterior distribution of bjK conditionally on σ 2 can be expressed as bjK | ˜bjK , σ 2 ∼

2 1 ˜bjK , σ 2 Aj ) + OjK (bjK , σ ) δ(0), N (A j 1 + OjK (bjK , σ 2 ) 1 + OjK (bjK , σ 2 )

(79)

where Aj = (σ 2 Vj−1 + I)−1 and the posterior odds ratio that γjK = 0 versus γjK = 1 is given by s ) ( 0 ˜b Aj ˜bjK det(Vj ) 1 − πj jK 2 OjK (bjK , σ ) = . (80) exp − πj 2σ 2 σ 2lj det(Aj ) 45

The posterior (79) can be used to generate block shrinkage and block thresholding estimators ˜ using Bayes rules under L2 and L1 -losses. Define for the jK-th block the vector dˆj = Aj ˜bjK ˜ and its elements dˆjk . Then, we have • posterior means: It is immediately seen that the posterior mean of bjK conditionally on σ 2 is given by E(bjK | ˜bjK , σ 2 ) =

1 ˜ dˆj . 1 + OjK (bjK , σ 2 )

(81)

This is a block shrinkage estimator where each empirical wavelet coefficient within a block is shrunk by the same shrinkage factor depending on all coefficients within the block. It is called the PostBlockMean estimator. • marginal posterior medians: It is easily seen that, for the posterior distribution given by (79), the marginal posterior distribution of djk conditionally on σ 2 is expressed as djk | ˜bjK , σ 2 ∼

2 1 ˜jk , σ 2 Ajj ) + OjK (bjK , σ ) δ(0), N ( d 1 + OjK (bjK , σ 2 ) 1 + OjK (bjK , σ 2 )

where Ajj is the diagonal entry of Aj (they are the same for all k). Hence, following the arguments of Abramovich, Sapatinas & Silverman (1988) discussed in Section 4.3.2, the posterior median of djk | ˜bjK is of the following closed form ˜ Median(djk | ˜bjK , σ 2 ) = sign(dˆjk ) max (0, ζjK ), where ζjK

p ˜ = |dˆjk | − σ Ajj Φ−1

µ

1 + min(OjK (bjK , σ 2 ), 1) 2

(82) ¶ .

This is an individual thresholding estimator where each empirical wavelet coefficient is thresholded utilising information about neighbouring coefficients within a block. It is called the PostBlockMed estimator. ˆ of the corresponding estimates of the unknown response function g at the The vector g observed data-points can be derived by simply performing the IDWT to the vector consisting of both the empirical scaling coefficients (obtained from applying any of the previous losses on the resulting posterior distributions using the vague priors on the scaling coefficients) and the shrunk or thresholded empirical wavelet coefficients (obtained from one of (81) or (82)). As before, estimates of the hyperparameters πj , τj2 , ρj and σ 2 can now be obtained by using empirical Bayes methods which are based on using marginal maximum likelihood estimates of 46

the hyperparameters. At each level j, it is easily seen that the marginal distribution of the empirical blocks ˜bjK is a mixture of two multivariate normal distributions. Therefore, defining ˜ jK = (˜bjK : k = 1, . . . , mj ), the marginal log-likelihood function is, up to a constant b L(π, τj2 , ρj , σ 2

˜ jK ) = |b

½ µ ¶ 1 ˜0 −1˜ −1/2 log πj (det(Bj )) exp − bjK Bj bjK 2 K=1 µ ¶¾ 1 0 + (1 − πj )σ −lj exp − 2 ˜bjK ˜bjK , 2σ mj X

(83)

where Bj = σ 2 I + τj2 Pj . Since expression (83) does not lead to closed form solutions for the maximum likelihood estimates of πj , τj2 , ρj and σ 2 , numerical minimisation of −L in (83) must be used in order to obtain the maximum likelihood estimates of these hyperparameters. Abramovich, Besbeas & Sapatinas (2002) suggested to use the robust estimate (23) for σ and to numerically minimize −L. The log-likelihood function was reparametrised with πj =

1 , 1 + exp (−θ1j )

τj = |θ2j | and ρj =

2 arctan (θ3j ) π

so that parameter estimates would lie in the ranges 0≤π ˆj ≤ 1,

τˆj ≥ 0 and

− 1 < ρˆj < 1,

respectively. The algorithm that they have used for the minimisation of −L is the Nelder-Mead simplex search method which does not require first derivatives of −L. Remark 4.16 We have also considered hybrid schemes by applying, on the first few resolution levels, after a fixed level (which is a user-choice), the posterior mean and median estimators discussed in Sections 4.3.1 and 4.3.2, and using the block shrinkage and thresholding estimators (discussed in this section) on the remaining resolution levels. We mention that Abramovich, Besbeas & Sapatinas (2002) have also considered block thresholding estimators using a Bayesian hypothesis testing approach, similar in spirit to the one discussed in Section 4.3.3. Furthermore, at each resolution level j, they have considered blocks of size O(j) and have studied the effect of various block sizes in the numerical performance of the resulting empirical Bayes block shrinkage and block thresholding estimators. Since there are 2j empirical wavelet coefficients at each resolution level j, it is often more convenient to chose the block sizes lj to be dyadic integers; this results in block sizes that evenly divide the empirical wavelet coefficients at each resolution level j into nonoverlapping blocks, and it has been also considered by Abramovich, Besbeas & Sapatinas (2002). However, for brevity, these alternative choices have not been considered in our simulation study in Section 5. 47

We conclude this section with Table 2 which reports, in a synthetic way, the main denoising procedures that were discussed and their corresponding properties. Method

Bayes

Global

Level

Shrink

Thresh

Minimax

•

•

SCAD

•

•

VisuShrink

•

•

Translation Invariant

•

•

Multiple Hypotheses Testing

•

•

Cross-validation

•

•

Block

SureShrink

•

•

Recursive Hypothesis Testing

•

•

BlockJS

•

•

•

NeighBlock

•

•

•

Single Posterior Mean

•

•

•

Single Posterior Mean 2

•

•

•

Single Posterior Median

•

•

•

Single Posterior Median 2

•

•

•

Bayesian Hypothesis Testing

•

•

•

BAMS

•

•

•

Decompsh

•

•

•

Mixed

•

•

•

Blocking Posterior Median

•

•

•

•

Hybrid Posterior Median

•

•

•

•

Blocking Posterior Mean

•

•

•

•

Hybrid Posterior Mean

•

•

•

•

σ ˆ 6= M AD

• • •

Table 2: Main characteristic properties of the set of denoising procedures discussed in Section 4. The column Bayes groups all methods relying on a Bayesian procedure. The columns Global and Level refer to the method of choice for the threshold level in wavelet thresholding which are grouped into two categories: global thresholds and level-dependent thresholds. The columns Shrink and Thresh denote the type of thersholding used, while the column Block refers to methods for which the wavelet coefficients are thresholded in blocks rather than term-by-term. Finally, the column σ ˆ 6= M AD indicates whether or not the noise level σ is estimated with the robust estimate (23).

48

5

Description of the Simulation

In all cases, the data (xi , yi ) were generated from a model of the form {²i } i.i.d. N (0, σ 2 ),

yi = f (xi ) + ²i ,

where {xi } are equispaced in [0, 1], x0 = 0 and xn = 1. The factors were 1. The sample sizes n. 2. The test functions f (x). 3. The values of σ 2 . For each combination of these factor levels, a simulation run was repeated 100 times holding all factor levels constant, except the {²i } which were regenerated. In order to compare the behavior of the various estimation methods (see Table 3), each of them based on two different wavelet filters, we have used six different criteria: MSE, L1, RMSE, RMSB, MXDV and CPU. These criteria were computed as follows: MSE: This is the average over 100 runs of n

1X (f (xi ) − fˆ(xi ))2 . n i=1

L1: This is the average over the 100 runs of n X

|f (xi ) − fˆ(xi )|.

i=1

RMSE: The mean squared error was computed for each run and averaged over the 100 runs. Then its square root was taken. RMSB: Let f¯(xi ) be the average of fˆ(xi ) over the 100 runs. The RMSB is the square root of n

1X (f (xi ) − f¯(xi ))2 . n i=1

MXDV: This is the average over 100 runs of max |f (xi ) − fˆ(xi )|.

1≤i≤n

CPU: This is the average over 100 runs of the CPU time. 49

In these simulations we have concentrated on the Symmlet 8 wavelet basis (as described on page 198 of Daubechies (1992)), and on the Coiflet 3 basis (as described on page 258 of Daubechies (1992)). The set of test functions, f , that we have used is similar to the one used by Marron, Adak, Johnstone, Neumann & Patil (1998). This set works both quite well and also quite poorly for a variety of wavelet estimators, and is shown in Figure 5.1. Explicit formulae of these curves are given in Appendix I. For the sake of completeness, we summarize here from Marron, Adak, Johnstone, Neumann & Patil (1998) the motivation behind each of these functions. A visual idea of the noise level that we have used in this paper is also given in Figure 5.2. 1. Step: This function should be very hard to estimate with linear methods, because of its jumps, but relatively easy for nonlinear wavelet estimators. 2. Wave: This is a sum of two periodic sinusoids. Since this signal is smooth, linear methods compare favorably with non linear ones. 3. Blip: This is essentially the sum of a linear function with a Gaussian density, and has been often used as a target function in nonparametric regression. To make the jump induced by the assumed periodicity visually clear, the function has been periodically rotated so the jump is at x = 0.8. 4. Blocks: This step function has many more jumps than the Step above, and has been used in several Donoho and Johnstone papers, for example Donoho & Johnstone (1994). 5. Bumps: This also comes from Donoho and Johnstone, and is very challenging for any smoother. 6. HeaviSine: Another Donoho and Johnstone example. This looks promising for linear smoothers, except for the two jumps. 7. Doppler: The final Donoho and Johnstone example. The time varying frequency makes this very hard for linear methods, with power spread all across the spectrum. It is more suitable for the wavelets, with their space time localization. 8. Angles: This function is piecewise linear, and continuous, but has big jumps in its first derivatives. 9. Parabolas: This function is piecewise parabolic. The function and its first derivative are continuous, but there are big jumps in its second derivative. It is ideal for the Symmlet 8

50

1

1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0

1

0.5 Step

1

0

0

1

0.5 Wave

1

0

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

1

0.5 Blocks

1

0

0

1

0.5 Bumps

1

0

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

1

0.5 Doppler

1

0

0

1

0.5 Angles

1

0

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

0.5 1 Time Shifted Sine

0

0

0.5 Spikes

0

0.5 HeaviSine

1

0

1

0.8

0

1

1

0.8

0

0.5 Blip

1

0.8

0

0

1

0

0

0.5 Parabolas

0.5 Corner

1

1

Figure 5.1: The twelve signals used in the simulation study in this paper, based on 256 design points. 51

1

1

1

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

origin12

0

1

0.5

Step

1

0

origin11

0

1

0.5

Wave

1

0

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

origin9

0

1

0.5

Blocks

1

0

origin8

0

1

0.5

Bumps

1

0

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

origin6

0

1

0.5

Doppler

1

0

origin5

0

1

0.5

Angles

1

0

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

origin3

0

0.5 Time Shifted Sine

1

0

origin2

0

0.5 Spikes

1

0

1

0.5

1

0.5

1

0.5

1

HeaviSine

origin4

0

1

0.8

0.5

Blip

origin7

0

1

0.8

0

0

1

0.8

0

origin10

Parabolas

origin1

0

Corner

Figure 5.2: The noisy versions of the signals shown in Figure 5.1, giving a visual impression of our ‘high noise’ Gaussian errors setting. 52

bases. Estimation should be reasonable for linear methods because both the function and its first derivative are continuous. 10. Time Shifted Sine: This is a time shifted sine wave. It is intended to be a very smooth function, but rather far from a linear combination of sine waves.

We view this as

representing the type of curve that ”traditional smoothers” would consider estimating. For consistency of mnemonics, the various denoising procedures that we have used (acronyms, type of thresholding and reference to the appropriate sections of this paper) are summarized in Table 3. Insight into the performance of the various wavelet based denoising procedures described in Table 3 can be obtained from graphical outputs and numerical tables. However, the resulting graphical outputs and numerical tables across all criteria, sample sizes, test functions, noise levels, wavelet filters and denoising procedures are very extensive. Looking at our simulation results, it was apparent that the criteria L1 and MSE are closely correlated with RMSE. Also, most of the conclusions hold whatever wavelet filter is used. Hence, for reasons of space, we only report, in Appendix II, in detail the results for two samples sizes, n = 128 (a moderate sample size) and n = 512 (a large sample size), for four criteria (RMSE, RMSB, MXDV and CPU), a particular wavelet filer (Symmlet 8), a root-signal-to-noise ratio rsnr = 3 (a high noise level), for all test functions and all smoothing procedures. Even so, the graphical outputs summarizing the behaviour of the smoothers with respect to the various criteria are quite extensive. Graphical outputs for other combinations of sample sizes and noise levels, as well as numerical tables, can be derived using appropriate Matlab scripts. These scripts are based on Version 8 of the Wavelab toolbox for MATLAB (Buckheit, Chen, Donoho, Johnstone & Scargle (1995)) and are available at http://www-lmc.imag.fr/SMS/software/GaussianWaveDen.html or http://www.ucy.ac.cy/∼fanis/links/software.html . We refer at this point to Section 8 for more details on which files the user should download and how to start running them. We remark below on some of the conclusions we drew after a careful examination on the simulations’ output.

6

Summary of Results

As expected, for the same test function and denoising procedure, whatever wavelet filter is used, the RMSE is roughly proportional to the inverse of the square root of the sample size. In terms of RMSE, TI-H, BAMS and DECOMPSH are markedly better on the Step function, whatever the sample size is, and the results are quite similar with respect to the filter used. As for computational efficiency, for small sample sizes, TI-H and BAMS are much more efficient 53

1

VISU-H

VisuShrink

Hard

4.1.2

2

VISU-S

VisuShrink

Soft

4.1.2

3

SURE

SureShrink

4

HYBSURE

SureShrink

5

TI-H

6

TI-S

7

4.1.6 Hybrid

4.1.6

Translation-Invariant

Hard

4.1.3

Translation-Invariant

Soft

4.1.3

MINIMAX-H

Minimax

Hard

4.1.1

8

MINIMAX-S

Minimax

Soft

4.1.1

9

CV-H

Cross-Validation

Hard

4.1.5

10

CV-S

Cross-Validation

Soft

4.1.5

11

NEIGHBL

12

BLOCKJS-A

Block Thresholding

Augment

4.2.1

13

BLOCKJS-T

Block Thresholding

Truncate

4.2.1

14

THRDA1

Hypothesis Testing

Soft

4.1.7

15

FDR-H

False Discovery Rate

Hard

4.1.4

16

FDR-S

False Discovery Rate

Soft

4.1.4

17

PENWAV

18

SCAD

19

DECOMPSH

20

MIXED

21

BLMED-A


Augment

4.4

22

BLMED-T


Truncate

4.4

23

HYBMED-A

Hybrid Blocking Median

Augment

4.4

24

HYBMED-T

Hybrid Blocking Median

Truncate

4.4

25

BLMEAN-A


Augment

4.4

26

BLMEAN-T


Truncate

4.4

27

HYBMEAN-A

Hybrid Blocking Mean

Augment

4.4

28

HYBMEAN-T

Hybrid Blocking Mean

Truncate

4.4

29

SINGLMED


4.3.2

30

SINGLMED-2


4.3.2

31

SINGLMEAN

Single Posterior Mean

4.3.1

32

SINGLMEAN-2


4.3.1

33

SINGLHYP

Bayesian Hypothesis Testing

4.3.3

34

BAMS

Bayesian Adaptive Multiresolution

4.3.4

NeighBlock

4.2.2

Linear Penalization Nonlinear Penalization

3.1 Hybrid

3.1 & 4.1.1

Deterministic/Stochastic

4.3.5

Mixed-Effects

4.3.6

Table 3: Acronyms for the set of denoising procedures applied in the simulation study. The section column refers to the actual sections where these procedures have been defined. 54

than DECOMPSH, while for large sample sizes the three methods are equivalent. One would expect a similar conclusion with respect to the Blocks function. While this is true for BAMS and DECOMPSH, CV-S outperforms TI-H in this case in terms of RMSE. Note also that Symmlets are better suited than Coiflets in this case, which is mainly due to the presence of many jumps in the Blocks function. For the smooth function Wave, most Bayesian methods do well. Among the non-Bayesian denoising procedures, PENWAV and TI-H are competitive, especially for large sample sizes. The fact that PENWAV is fine is not surprising given the fact that this is the type of functions where linear methods are generally equivalent to nonlinear ones. Similar conclusions hold also for the Time Shifted Sine function. For the Blip function as well as for the Parabolas function, TI-H is markedly better than other methods, and the use of Coiflets improves the RMSE. At the other end, SURE, MIXED and SINGLHYP perform poorly for these test functions. For the Bumps, Angles and Spikes functions almost all Bayesian procedures are equivalent (with the exception of SINGLHYP) and do markedly better than the other procedures in terms of RMSE. Among the non-Bayesian procedures, once again TI-H is fine. Both wavelet filters lead to similar results. For such functions it is therefore advisable to denoise using Bayesian procedures, if computational cost is not an issue. Finally, for the Heavisine and Doppler functions almost all procedures give equivalent results, with the exception of SURE and SINGLHYP which perform poorly. Criterion MXDV allows a finer comparison between Bayesian methods when these are in competition. Larger values of MXDV occur for functions with many spikes or discontinuities, but this is expected. The curious behavior of MXDV for some of the methods with the Bumps signal calls for some explanation. Recall that, throughout the simulations, the primary resolution level j0 = [log2 (log(n))] + 1 was used for all methods. This value of j0 affects whether or not the spikes in the Bumps signal are felt in the lowest level of wavelet coefficients. For j0 = 3, the standard methods, especially PENWAV and DECOMPSH both smooth out the spike effect to a big extent. Among all denoising procedures and almost all test functions, the minimum RMSB is achieved by the SURE procedure, which shows that the bias in the procedures is almost never a substantial contributor to RMSE, reflecting the capability of these automatic denoising procedures to fit a large variety of functions. In terms of the mean squared error criterion, a conceivable competitor to SURE among the other methods is NEIGHBLOCK, especially when the underlying function is of significant spatial variability. 55

The strange behaviour of some of the methods with the Waves signal is probably due to the fact that for all methods the same primary resolution level j0 = [log2 (log(n))] + 1 was used, and most methods smooth to some extent the high frequencies in the Waves signal. Table 4 in the Appendix II shows the average of the CPU time involved in computing the estimates for the Corner function by each method and the two sample sizes. Our simulations show that non-Bayesian methods uniformly outperform Bayesian methods in terms of CPU time in all examples, and indeed the relative performance of Bayesian procedures is even worse for some other examples than the Corner presented in detail.

7

Overall Conclusions

As expected, no wavelet based denoising procedure uniformly dominates in all aspects. For larger sample sizes and when a function is expected to be mainly smooth, Coiflets lead to better results due to the fact that scaling functions associated to Coiflets have better approximation properties that other Daubechies filters. While Bayesian methods perform reasonably well at small sample sizes for relatively inhomogeneous functions, their computational cost may be a handicap, when compared with translation invariant thresholding procedures.

8

An illustrative example

This section explains in some detail which files the user should download and how to start running them. In order to provide some hints of how the functions should be used to analyse real data sets, a detailed practical step-by-step illustration of a wavelet denoising analysis on electrical consumption is provided.

Prerequisites As already noted, our library makes an extensive use of the Matlab routines available in the WaveLab package developed by Buckheit, Chen, Donoho, Johnstone & Scargle (1995) at Stanford University. WaveLab has over 800 subroutines which are well documented, indexed and crossreferenced. The library is available, free of charge, over the Internet World Wide Web (WWW) access. Versions are provided for Macintosh, Unix and Windows platforms. The WaveLab package is made available as a compressed archive, in a format suitable for the machine in question: .zip (for MS-Windows), .tar.Z (for Unix) and .sea.hqx (for Macintosh). The archives may be accessed by WWW access to http://www-stat.stanford.edu/∼ wavelab.

56

Once the appropriate compressed archive has been transferred to your machine, it should be decompressed, with the relevant tools, and installed. On a personal computer (Macintosh or Windows), the archives should be decompressed and installed as a subdirectory of the toolbox directory inside the Matlab folder. On a Unix workstation or server, the archives could either be installed in the systemwide Matlab directory, if you have permission to do this, or in your own personal Matlab directory, if you do not. Once the actual files are installed, you should have a number of subdirectories of .m files in the directory WaveLab.

Matlab can

automatically, at startup time, make all the WaveLab package available (read the installations instructions accompanying the archive). We will assume from now on that the Wavelab routines are available in your machine. We will also assume that you have already downloaded our GaussianWaveDen package from http://www-lmc.imag.fr/SMS/software/GaussianWaveDen or http://www.ucy.ac.cy/∼fanis/links/software.html and you have installed its subroutines by specifying its path in Matlab. Once this is done you can try running the demo that we will describe in this section. Each function in GaussianWaveDen has an accompanying html help documentation in the directory html-help.

An electrical consumption example The example we present here involves a real-word signal – electrical consumption measured over the course of three days. This signal is particularly interesting because of noise introduced whenever a defect is present in the monitoring equipment. The data consist of measurement of a complex, highly-aggregated plant: the electrical load consumption, sampled minute by minute, over a 5-week period. The resulting time series of 50,400 points is partly plotted in the top panel of Figure 8.3. External information is given by electrical engineers, and additional indications can be found in Misiti, Misiti, Oppenheim & Poggi (1994). This information includes the following remarks: • The load curve is the aggregation of hundreds of sensors measurements, thus generating measurement errors. • The consumption is accounted for 50% by industry and for the other half by individual consumers. The component of the load curve produced by industry has a rather regular profile and exhibits low-frequency changes.

On the other hand, the consumption of

individual consumers may be highly irregular, leading to high-frequency components. • There are more than 10 millions individual consumers.

57

• Daily consumption patterns also change according to rate changes at different times (e.g. relay-switched water heaters to benefit from special night rates). • For the 3-day observations, indexed from 1 to 4096, the measurement errors are unusually high, due to sensors failures. Electrical Signal 600 500 400 300 200 100

0

500

1000

1500

0

500

1000

0

500

1000

2000 2500 Soft TI Denoising

3000

3500

4000

1500 2000 2500 Neighblock Denoising

3000

3500

4000

1500

3000

3500

4000

600 500 400 300 200 100 600 500 400 300 200 100

2000

2500

Figure 8.3: An electricity consumption signal and its denoised versions. We shall not report here a complete analysis which is included in Misiti, Misiti, Oppenheim & Poggi (1994). We only want to illustrate some of the denoising procedures developed in this paper to the local description of this time series, which effectively remove the noise. We choose a portion of the sample signal corresponding to a midday period. Observe, however, that the midday period has a complicated structure because the intensity of the electricity consumers activity is high and it presents very large changes. An appropriate noise removal allows the identification of interesting features of the data. Figure 8.3 has been generated using the following Matlab command lines. % Load the original 1-D signal, choose a portion of it and plot it. % % load the signal in s. s=file2var(‘eleccum.dat’); % fix the time axis. 58

x=(1:4096); signal=s(x); % Plot the original signal plot(x,signal); title(‘Electrical Signal’); % Denoise the plotted portion of the signal using few of our procedures. % Using the Translation Invariant procedure with soft thresholding. f = recTI(signal,’S’); % Using the Neighblock procedure. g=recneighblock(signal); % Plot the results. subplot(3,1,1); plot(x,signal); axis([-20 4097 100 600]); title(‘Electrical Signal’); subplot(3,1,2); plot(x,f); axis([-20 4097 100 600]); title(‘Soft TI Denoising’) subplot(3,1,3); plot(x,g); axis([-20 4097 100 600]); title(‘Neighblock Denoising’) One may note on the denoised signal the abrupt changes due to automatic switches. Note also that the TI-S procedure produces a smooth fit, removing efficiently the massive and high frequency changes of personal electric appliances in the consumption, while the NEIGHBLOCK procedure undersmooths the high frequency portions of the observed signal. To end this section, we would like to mention here a few types of signal processing problems where the wavelet methods discussed and compared in this paper have been used in practice. Wavelet denoising procedures have been found to be a particularly useful tool in machinery fault detection (see Staszewski & Tomlinson (1994) and Lin & McFadden (1997)). Typical examples of signals encountered in this field are vibration signals generated in defective bearings and gears rotating at constant speeds. When a machine or its parts change from one state into another, transients may be seen in the vibration signals. Transients usually have relatively high frequencies but relatively small time scales and contain rich information about machinery conditions. This explains why wavelet denoising procedures provide considerable improvement over certain traditional techniques in the fault detection of mechanical systems. Wavelet denoising procedures, in conjunction with hypothesis testing, have also been used for detecting change points in several biomedical applications.

Typical examples are the

detection of life-threatening cardiac arrythmia (see Khadra, Al-Fahoum & Al-Nashash (1997)) in electrocardiographic signals (ECG) recorded during the monitoring of patients, or the detection 59

of venous air embolism in doppler heart sound signals recorded during surgery when the incision wounds lie above the heart (see Chan, Chan, Lam, Lui & Poon (1997)). They also have been used in astronomical application for estimating periodicities in the light-curves of the variable star R Aquilae, after appropriate denoising (see Foster (1996)). A number of interesting applications of wavelets may be also found in economic and financial applications. Ramsay, Usikov & Zaslavsky (1995) provides an account on research arising from earlier concerns in the analysis of the stock market. In conclusion, it is apparent that wavelets are particularly well adapted to the statistical analysis of several types of data, and denoising tools, like the ones presented in this paper, will certainly be of great help in revealing features present in data.

Acknowledgements This work was supported by the IMAG project AMOA whose financial support is acknowledged. The authors would like to thank Jean-Michel Poggi for his suggestions and comments on a preliminary version of this paper. Theofanis Sapatinas would like to thank Anestis Antoniadis for financial support and excellent hospitality while visiting Grenoble to carry out this work. Helpful comments made by the editor and an anonymous referee are gratefully acknowledged.

60

References [1] Abramovich, F., Bailey, T.C. & Sapatinas, T. (2000). Wavelet analysis and its statistical applications. The Statistician, 49, 1–29. [2] Abramovich, F. & Benjamini, Y. (1995). Thresholding of wavelet coefficients as multiple hypotheses testing procedure. In Wavelets and Statistics, Antoniadis, A. & Oppenheim, G. (Eds.), Lecture Notes in Statistics 103, pp. 5–14, New York: Springer-Verlag. [3] Abramovich, F. & Benjamini, Y. (1996). Adaptive thresholding of wavelet coefficients. Comput. Statist. Data Anal., 22, 351–361. [4] Abramovich, F., Besbeas, P. & Sapatinas, T. (2002). Empirical Bayes approach to block wavelet function estimation. Comput. Statist. Data Anal., 39, 435–451. [5] Abramovich, F. & Sapatinas, T. (1999). Bayesian approach to wavelet decomposition and shrinkage. In Bayesian Inference in Wavelet Based Models, M¨ uller, P. & Vidakovic, B. (Eds.), Lect. Notes Statist., 141, pp. 33–50, New York: Springer-Verlag. [6] Abramovich, F. & Silverman, B.W. (1998). The vaguelette-wavelet decomposition approaches to statistical inverse problems. Biometrika, 85, 115–129. [7] Abramovich, F., Sapatinas, T. & Silverman, B.W. (1998). Wavelet thresholding via a Bayesian approach. J. R. Statist. Soc. B, 60, 725–749. [8] Abramovich, F., Benjamini, Y., Donoho, D.L. & Johnstone, I.M. (2000). Adapting to unknown sparsity by controlling the false discovery rate. Technical Report 00-19, Department of Statistics, Stanford University, USA. [9] Amato, U. & Vuza, D.T. (1997). Wavelet approximation of a function from samples affected by noise. Rev. Roumanie Math. Pure Appl., 42, 481–493. [10] Angelini, C., De Canditiis, D. & Leblanc, F. (2003). Wavelet regression estimation in nonparametric mixed effect models. J. Multivariate Anal., 85, 267–291. [11] Antoniadis, A. (1996). Smoothing noisy data with tapered coiflets series. Scand. J. Statist., 23, 313–330. [12] Antoniadis, A. (1997). Wavelets in statistics: a review (with discussion). J. Ital. Statist. Soc., 6, 97–144.

61

[13] Antoniadis, A. & Fan, J. (2001) Regularization of wavelets approximations (with discussion). J. Am. Statist. Ass., 96, 939–967. [14] Antoniadis, A. & Oppenheim, G. (Eds.) (1995). Wavelets and Statistics, Lect. Notes Statist., 103, New York: Springer-Verlag. [15] Antoniadis, A. & Pham, D.T. (1998). Wavelet regression for random or irregular design. Computat. Statist. Data Anal., 28, 353–369. [16] Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B, 57, 289–300. [17] Berger, J. (1985). Statistical Decision Theory and Bayesian Analysis. New York: SpringerVerlag [18] Breiman, L. & Peters, S. (1992). Comparing automatic smoothers (a public service enterprise). Int. Statist. Rev., 60, 271–290. [19] Bruce, A.G. & Gao, H.-Y. (1996). Understanding WaveShrink:

variance and bias

estimation. Biometrika, 83, 727–745. [20] Buckheit, J.B. & Donoho, D.L. (1995). WaveLab and Reproducible Research. In Wavelets and Statistics, Antoniadis, A. & Oppenheim, G. (Eds.), Lect. Notes Statist., 103, pp. 55–81, New York: Springer-Verlag. [21] Buckheit, J.B., Chen, S., Donoho, D.L., Johnstone, I.M. & Scargle, J. (1995). About WaveLab. Technical Report, Department of Statistics, Stanford University, USA. [22] Burrus, C.S., Gonipath, R.A. & Guo, H. (1998). Introduction to Wavelets and Wavelet Transforms: A Primer. Englewood Cliffs: Prentice Hall. [23] Cai, T.T. (1999). Adaptive wavelet estimation: a block thresholding and oracle inequality approach. Ann. Statist., 27, 898–924. [24] Cai, T.T. & Brown, L.D. (1998). Wavelet shrinkage for nonequispaced samples. Ann. Statist., 26, 1783-1799. [25] Cai, T.T. & Brown, L.D. (1999). Wavelet estimation for samples with random uniform design. Statist. Probab. Lett., 42, 313–321. [26] Cai, T.T. & Silverman, B.W. (2001). Incorporating information on neighbouring coefficients into wavelet estimation. Sankhy¯ a B, 63, 127–148. 62

[27] Chan, C.B., Chan, F.H., Lam, F.K., Lui, P.-W. & Poon, P.W. (1997). Fast detection of venous air embolism in doppler heart sound using the wavelet transform. IEEE Trans. Biomed. Eng., 44, 237–246. [28] Chipman, H.A., Kolaczyk, E.D. & McCulloch, R.E. (1997). Adaptive Bayesian Wavelet Shrinkage. J. Am. Statist. Ass., 92, 1413–1421. [29] Clyde, M. & George, E.I. (1999). Empirical Bayes estimation in wavelet nonparametric regression. In Bayesian Inference in Wavelet Based Models, M¨ uller, P. & Vidakovic, B. (Eds.), Lect. Notes Statist., 141, pp. 309–322, New York: Springer-Verlag. [30] Clyde, M. & George, E.I. (2000). Flexible empirical Bayes estimation for wavelets. J. R. Statist. Soc. B, 62, 681–698. [31] Clyde, M., Parmigiani, G. & Vidakovic, B. (1998). Multiple shrinkage and subset selection in wavelets. Biometrika, 85, 391–401. [32] Coifman, R.R. & Donoho, D.L. (1995). Translation-invariant de-noising. In Wavelets and Statistics, Antoniadis, A. & Oppenheim, G. (Eds.), Lect. Notes Statist., 103, pp. 125–150, New York: Springer-Verlag. [33] Crouse, M.S., Nowak, R.D. & Baraniuk, R.G. (1998). Wavelet-based statistical signal processing using hidden Markov models. IEEE Trans. Sig. Proc., 46, 886–902. [34] Daubechies, I. (1992). Ten Lectures on Wavelets. Philadelphia: SIAM. [35] Delyon, B. & Juditsky, A. (1995). Estimating wavelet coefficients In Wavelets and Statistics, Antoniadis, A. & Oppenheim, G. (Eds.), Lect. Notes Statist., 103, pp. 15–168, New York: Springer-Verlag. [36] Delyon, B. & Juditsky, A. (1996). On minimax wavelet estimators. Appl. Comput. Harm. Anal., 3, 215–228. [37] Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc. B, 39, 1–38. [38] Dennis, J.E. & Mei, H.H.W. (1979). Two new unconstrained optimization algorithms which use function and gradient values. J. Optimiz. Theory Appl., 28, 453–483. [39] Donoho, D.L. & Johnstone, I.M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81, 425–455. 63

[40] Donoho, D.L. & Johnstone, I.M. (1995). Adapting to unknown smoothness via wavelet shrinkage. J. Am. Statist. Ass., 90, 1200–1224. [41] Donoho, D.L. & Johnstone, I.M. (1998). Minimax estimation via wavelet shrinkage. Ann. Statist., 26, 879–921. [42] Donoho, D.L., Johnstone, I.M., Kerkyacharian, G. & Picard, D. (1995). Wavelet shrinkage: asymptopia? (with discussion). J. R. Statist. Soc. B, 57, 301–337. [43] Efromovich, S. (1999). Quasi-linear wavelet estimation. J. Am. Statist. Ass., 94, 189–204. [44] Efromovich, S. (2000). Sharp linear and block shrinkage wavelet estimation. Statist. Probab. Lett., 49, 323–329. [45] Eubank, R.L. (1999). Nonparametric Regression and Spline Smoothing. 2nd Edition, New York: Marcel Dekker. [46] Fan, J. & Gijbels, I. (1996). Local Polynomial Modelling and its Applications. London: Chapman & Hall. [47] Foster, G. (1996). Wavelets for period analysis of unevenly sampled time series. Astron. J., 112, 1709–1729. [48] Gao, H.-Y. & Bruce, A.G. (1997). WaveShrink with firm shrinkage. Statist. Sinica, 7, 855– 874. [49] Gao, H.-Y. (1998). Wavelet shrinkage denoising using the non-negative garrote. J. Comp. Graph. Statist., 7, 469–488. [50] George, E.I. & Foster, D.P. (2000). Calibration and empirical Bayes variable selection. Biometrika, 87, 731–747. [51] George, E.I. & McCulloch, R. (1993). Variable selection via Gibbs sampling. J. Am. Statist. Ass., 88, 881–889. [52] Green, P.J. & Silverman, B.W. (1994). Nonparametric regression and generalised linear models. London: Chapman & Hall. [53] Hall, P., Kerkyacharian, G. & Picard, D. (1998). Block threshold rules for curve estimation using kernel and wavelet methods. Ann. Statist., 26, 922–942.

64

[54] Hall, P., Kerkyacharian, G. & Picard, D. (1999). On the minimax optimality of block thresholded wavelet estimators. rules for curve estimation using kernel and wavelet methods. Statist. Sinica., 9, 33–50. [55] Hall, P., Penev, S., Kerkyacharian, G. & Picard, D. (1997). Numerical performance of block thresholded wavelet estimators. Statist. Comput., 7, 115–124. [56] Hall, P. & Nason, G.P. (1997). On choosing a non-integer resolution level when using wavelet methods. Statist. Probab. Lett., 34, 5–11. [57] Hall, P. & Patil, P. (1996a). Effect of threshold rules on performance of wavelet-based curve estimators. Statist. Sinica, 6, 331–345. [58] Hall, P. & Patil, P. (1996b). On the choice of smoothing parameter, threshold and truncation in nonparametric regression by non-linear wavelet methods. J. R. Statist. Soc. B, 58, 361–377. [59] Hall, P. & Turlach, B.A. (1997). Interpolation methods for nonlinear wavelet regression with irregularly spaced design. Ann. Statist., 25, 1912–1925. [60] Härdle, W. (1990). Applied Nonparametric Regression. Cambridge: Cambridge University Press. [61] Härdle, W., Kerkyacharian, G., Pikard, D. & Tsybakov, A. (1998). Wavelets, Approximation, and Statistical Applications. Lecture Notes in Statistics 129, New York: Springer-Verlag. [62] Huang, H.-C. & Cressie, N. (2000). Deterministic/stochastic wavelet decomposition for recovery of signal from noisy data. Technometrics, 42, 262–276. [63] Huang, S.Y. & Lu, H.H.-S. (2000). Bayesian wavelet shrinkage for nonparametric mixedeffects models. Statist. Sinica, 10, 1021–1040. [64] Jansen, M., Malfait, M. & Bultheel, A. (1997). Generalised cross-validation for wavelet thresholding. Sig. Proc., 56, 33–44. [65] Johnstone, I.M. (1994). Minimax Bayes, asymptotic minimax and sparse wavelet priors. In Statistical Decision Theory and Related Topics, V, Gupta, S.S. and Berger, J.O. (Eds.), pp. 303–326, New York: Springer-Verlag. [66] Johnstone, I.M. & Silverman, B.W. (1997). Wavelet threshold estimators for data with correlated noise. J. R. Statist. Soc. B, 59, 319–351. 65

[67] Johnstone, I.M. & Silverman, B.W. (1998). Empirical Bayes approaches to mixture problems and wavelet regression. Technical Report, School of Mathematics, University of Bristol, UK. [68] Khadra, L., Al-Fahoum, A.S. & Al-Nashash, H. (1997). Detection of life-threatening cardiac arrhythmias using the wavelet transformation. Med. Biol. Eng. Comput., 35, 626–632. [69] Kovac, A. & Silverman, B.W. (2000). Extending the scope of wavelet regression methods by coefficient-dependent thresholding. J. Am. Statist. Ass., 95, 172–183. [70] Lang, M., Guo, H., Odegard, J.E., Burrus, C.S. & Wells Jr, R.O. (1996). Noise reduction using an undecimated discrete wavelet transform. IEEE Sig. Proc. Lett., 3, 10–12. [71] Lin, S.T. & McFadden, P.D. (1997). Gear vibration analysis by B-spline wavelet based linear wavelet transform. Mech. Syst. Signal Proc., 11, 603–609. [72] Mallat, S.G. (1989). A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans. Pattn Anal. Mach. Intell., 11, 674–693. [73] Mallat, S.G. (1999). A Wavelet Tour of Signal Processing. 2nd Edition, San Diego: Academic Press. [74] Marron, J.S., Adak, S., Johnstone, I.M., Neumann, M.H. & Patil, P. (1998). Exact risk analysis of wavelet regression. J. Comp. Graph. Statist., 7, 278–309. [75] Meyer, Y. (1992). Wavelets and Operators. Cambridge: Cambridge University Press. [76] Misiti, M., Misiti, Y., Oppenheim, G. & Poggi, J.-M. (1994). Décomposition en ondelettes et méthodes comparatives : étude d’une courbe de charge éléctrique. Revue de Statistique Appliquée, 17, 57–77. uller, P. & Vidakovic, B. (Eds.) (1999). Bayesian Inference in Wavelet Based Models. [77] M¨ Lect. Notes Statist., 141, New York: Springer-Verlag. [78] Nason, G.P. (1994). Wavelet regression by cross-validation. Technical Report 447, Department of Statistics, Standord University, USA. [79] Nason, G.P. (1995). Wavelet function estimation using cross-validation. In Wavelets and Statistics, Antoniadis, A. & Oppenheim, G. (Eds.), Lect. Notes Statist., 103, pp. 261–280, New York: Springer-Verlag.

66

[80] Nason, G.P. (1996). Wavelet shrinkage using cross-validation. J. R. Statist. Soc. B, 58, 463–479. [81] Nason, G.P. (2002). Choice of wavelet smoothness, primary resolution and threshold in wavelet shrinkage. Statist. Comput., 12, 219–227. [82] Nason, G.P. & Silverman, B.W. (1995). The stationary wavelet transform and some statistical applications. In Wavelets and Statistics, Antoniadis, A. & Oppenheim, G. (Eds.), Lect. Notes Statist., 103, pp. 281–300, New York: Springer-Verlag. [83] Neumann, M.H. & Spokoiny, V. (1995). On the efficiency of wavelet estimators under arbitrary error distributions. Math. Meth. Stat., 4, 137–166. [84] Ogden, R.T. (1997). Essential Wavelets for Statistical Applications and Data Analysis. Boston: Birkhäuser. [85] Ogden, R.T. & Parzen, E. (1996a). Change-point approach to data analytic wavelet thresholding. Statist. Comput., 6, 93–99. [86] Ogden, R.T. & Parzen, E. (1996b). Data dependent wavelet thresholding in nonparametric regression with change-point applications. Comput. Statist. Data Anal., 22, 53–70. [87] Percival, D.B. & Walden, A.T. (2000). Wavelet Methods for Time Series Analysis. Cambridge: Cambridge University Press. [88] Ramsay, J.B., Usikov, D. & Zaslavsky, D. (1995). An analysis of U.S. stock price behavior using wavelets. Fractals, 3, 377–389. [89] Staszewski, W.J. & Tomlinson, G.R. (1994). Application of the wavelet transform to fault detection in a spur gear. Mech. Syst. Signal Proc., 8, 289–307. [90] Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist., 9, 1135–1151. [91] Strang, G. & Nguyen, T. (1996). Wavelets and Filter Banks. Wellesley:

Wellesley-

Cambridge Press. [92] von Sachs, R. & MacGibbon, B. (2000). Non-parametric curve estimation by wavelet thresholding with locally stationary errors. Scand. J. Statist., 27, 475–499. [93] Vannucci, M. & Corradi, F. (1999). Covariance structure of wavelet coefficients: theory and models in a Bayesian perspective. J. R. Statist. Soc. B, 61, 971–986. 67

[94] Vidakovic, B. (1998a). Non-linear wavelet shrinkage with Bayes rules and Bayes factors. J. Am. Statist. Ass., 93, 173–179. [95] Vidakovic, B. (1998b). Wavelet based nonparametric Bayes methods. In Practical Nonparametric and Semiparametric Bayesian Statistics, Dey, D.D., M¨ uller, P. & Sinha, D. (Eds.), Lecture Notes in Statistics 133, pp. 133-155, New York: Springer-Verlag. [96] Vidakovic, B. (1999). Statistical Modeling by Wavelets. New York: John Wiley & Sons. [97] Vidakovic, B. & Ruggeri, F. (2001). BAMS method: theory and simulations. Sankhy¯ a B, 63, 234–249. [98] Wahba, G. (1990). Spline Models for Observational Data. Philadelphia: SIAM. [99] Wang, Y. (1996). Function estimation via wavelet shrinkage for long-memory data. Ann. Statist., 24, 466–484. [100] Wand, M.P. & Jones, M.C. (1995). Kernel Smoothing. London: Chapman & Hall. [101] Weyrich, N. & Warhola, G.T. (1995a). De-noising using wavelets and cross-validation. NATO Adv. Study Inst. C, 454, 523–532. [102] Weyrich, N. & Warhola, G.T. (1995b). Wavelet shrinkage and generalized cross-validation for de-noising with applications to speech. In Wavelets and Multilevel Approximation, Chui, C.K. & Schumaker, L.L. (Eds.), Approximation Theory VIII 2, pp. 407–414, Singapore: World Scientific. [103] Wojtaszczyk, P. (1997). A Mathematical Introduction to Wavelets. Cambridge: Cambridge University Press.

68

Appendix I Here are the analytical formulae of the test functions introduced in Section 5. 1. Step: f1 (x) = 0.2 + 0.6I(1/3,3/4) (x). 2. Wave: f2 (x) = 0.5 + 0.2 cos(4πx) + 0.1 cos(24πx). 3. Blip: ³ f3 (x) =

³

2

0.32 + 0.6x + 0.3e−100(x−0.3)

´

I[0,0.8] (x) + ´ 2 −0.28 + 0.6x + 0.3e−100(x−1.3) I(0.8,1] (x).

4. Blocks: Donoho and Johnstone’s (1994) Blocks function vertically rescaled to [0.2, 0.8]. 5. Bumps: Donoho and Johnstone’s (1994) Bumps function vertically rescaled to [0.2, 0.8]. 6. Heavisine: Donoho and Johnstone’s (1994) Heavisine function vertically rescaled to [0.2, 0.8]. 7. Doppler: Donoho and Johnstone’s (1994) Doppler function vertically rescaled to [0.2, 0.8]. 8. Angles: f8 (x) = (2x + 0.5))I[0,0.15] (x) + (−12(x − 0.15) + 0.8)I(0.15,0.2] (x) + 0.2I]0.2,0.5] (x) + (6(x − 0.5) + 0.2)I(0.5,0.6] (x) + (−10(x − 0.6) + 0.8)I]0.6,0.65] (x) + (−5(x − 0.65) + 0.3)I(0.65,0.85] (x) + (2(x − 0.85) + 0.2)I(0.85,1] (x). 9. Parabolas: f9 (x) = 0.8 − 30r(x, 0.1) + 60r(x, 0.2) − 30r(x, 0.3) + 500r(x, 0.35) − 1000r(x, 0.37) + 1000r(x, 0.41) − 500r(x, 0.43) + 7.5r(x, 0.5) − 15r(x, 0.7) + 7.5r(x, 0.9), where r(x, c) = (x − c)2 I(c,1] (x).

69

10. Time Shifted Sine: f10 (x) = 0.3 sin{3π[g(g(g(g(x)))) + x]} + 0.5, where g(x) = (1 − cos(πx))/2. 11. Spikes: 2

2

g(x) = 15.6676e−500(x−0.23) + 2e−2000(x−0.33) + 2

2

2

4e−8000(x−0.47) + 3e−16000(x−0.69) + e−32000(x−0.83) . f11 (x) = (0.6/range(g))g(x) + 0.2. 12. Corner: g(x) = 623.87x3 (1 − 4x)I]0,0.5] (x) + 187.161(0.125 − x3 )x4 I(0.5,0.8] (x) + 3708.470441(x − 1)3 I(0.8,1] (x), f12 (x) = (0.6/range(g))g(x) + 0.6.

70

Appendix II Step function; sample size 128; Symmlet 8

0.08

Step function; sample size 512; Symmlet 8

0.06

0.07 0.05

Root Mean Squared Bias


0.06 0.05 0.04 0.03

0.04

0.03

0.02

0.02 0.01

0.01 0

0

5

10

15

20

25

30

0

35


0.09

0

5

15

20

25

30

35

30

35


0.07

0.08

10

0.06

Root Mean Squared Error


0.07 0.06 0.05 0.04 0.03

0.05

0.04

0.03

0.02

0.02 0.01

0.01 0

0

5

10

15

20

25

30

0

35

0


5

10

15

20

25

Step function; sample size 512; Symmlet 8 0.5

0.4

0.45 0.4 Maximum Deviation

Maximum Deviation

0.35

0.3

0.25

0.2

0.35 0.3 0.25 0.2

0.15

0.15

0.1 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Method

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Method

Figure 8.4: Performance of the estimators for the Step function over 100 simulations. The root signal-to-noise ratio is equal to 3 for sample sizes of 128 (left) and 512 (right) design points. The wavelet filter used is the Symmlet 8.

71

Wave; sample size 128; Symmlet 8

0.06

Wave; Sample size 512, Symmlet 8

0.04 0.035

0.05



0.03 0.04

0.03

0.02

0.025 0.02 0.015 0.01

0.01 0.005 0

0

5

10

15

20

25

30

0

35

Wave; sample size 128; Symmlet 8

0.06

0

5

10

15

20

25

30

35

30

35


0.04 0.035

0.05



0.03 0.04

0.03

0.02

0.025 0.02 0.015 0.01

0.01

0

0.005

0

5

10

15

20

25

30

0

35

0

10

15

20

25


Wave; sample size 128; Symmlet 8 0.22

0.22

0.2

0.2

0.18

0.18

0.16

0.16

Maximum Deviation

Maximum Deviation

5

0.14 0.12 0.1

0.14 0.12 0.1 0.08

0.08

0.06

0.06

0.04 0.02

0.04

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

Figure 8.5: Performance of the estimators for the Wave function over 100 simulations. The root signal-to-noise ratio is equal to 3 for sample sizes of 128 (left) and 512 (right) design points. The wavelet filter used is the Symmlet 8.

72

Blip function; sample size 128; Symmlet 8

0.05


0.04

0.045

0.035

0.04



0.03 0.035 0.03 0.025 0.02 0.015

0.025 0.02 0.015 0.01

0.01 0.005

0.005 0

0

5

10

15

20

25

30

0

35


0.06

0

5

10

15

20

25

30

35

30

35


0.045 0.04

0.05



0.035 0.04

0.03

0.02

0.03 0.025 0.02 0.015 0.01

0.01 0.005 0

0

5

10

15

20

25

30

0

35

0

10

15

20

25


Blip function; sample size 128; Symmlet 8 0.35

0.35

0.3

Maximum Deviation

0.3 Maximum Deviation

5

0.25

0.2

0.25

0.2

0.15

0.15

0.1

0.1

0.05

0.05 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

Figure 8.6: Performance of the estimators for the Blip function over 100 simulations. The root signal-to-noise ratio is equal to 3 for sample sizes of 128 (left) and 512 (right) design points. The wavelet filter used is the Symmlet 8.

73

Blocks function; sample size 128; Symmlet 8

0.07


0.05 0.045

0.06 0.04 Root Mean Squared Bias


0.05

0.04

0.03

0.02

0.035 0.03 0.025 0.02 0.015 0.01

0.01

0

0.005

0

5

10

15

20

25

30

0

35


0.07

0

5

10

15

20

25

30

35

30

35


0.05 0.045

0.06



0.04 0.05

0.04

0.03

0.02

0.035 0.03 0.025 0.02 0.015 0.01

0.01

0

0.005

0

5

10

15

20

25

30

0

35

0

5

10

15

20

25


Blocks function; sample size 128; Symmlet 8 0.28

0.22

0.26 0.2

0.24


Maximum Deviation

0.22 0.2 0.18 0.16 0.14 0.12

0.16 0.14 0.12 0.1

0.1 0.08

0.08

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

Figure 8.7: Performance of the estimators for the Blocks function over 100 simulations. The root signal-to-noise ratio is equal to 3 for sample sizes of 128 (left) and 512 (right) design points. The wavelet filter used is the Symmlet 8.

74

Bumps function; sample size 128; Symmlet 8

0.07

0.05

0.05 Root Mean Squared Bias


0.06

0.04

0.03

0.02

0

0.04

0.03

0.02

0.01

0.01

0

5

10

15

20

25

30

0

35

0

5


0.07

10

15

20

25

30

35

30

35


0.06

0.06

0.05

0.05 Root Mean Squared Error



0.06

0.04

0.03

0.02

0.03

0.02

0.01

0.01

0

0.04

0

5

10

15

20

25

30

0

35

0


5

10

15

20

25

Bumps function; sample size 512; Symmlet 8 0.45

0.45 0.4 0.4 0.35 Maximum Deviation

Maximum Deviation

0.35 0.3 0.25 0.2

0.3 0.25 0.2 0.15

0.15 0.1

0.1

0.05

0.05 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

Figure 8.8: Performance of the estimators for the Bumps function over 100 simulations. The root signal-to-noise ratio is equal to 3 for sample sizes of 128 (left) and 512 (right) design points. The wavelet filter used is the Symmlet 8.

75

HeaviSine function; sample size 128; Symmlet 8

0.025


0.02 0.018 0.016 Root Mean Squared Bias


0.02

0.015

0.01

0.005

0.014 0.012 0.01 0.008 0.006 0.004 0.002

0

0

5

10

15

20

25

30

0

35

10

15

20

25

30

35


0.045

0.045

0.04

0.04

0.035

0.035



5


0.05

0.03 0.025 0.02 0.015

0.03 0.025 0.02 0.015 0.01

0.01

0.005

0.005 0

0

0

5

10

15

20

25

30

0

35

0

5

10

15

20

25

30

35


HeaviSine function; sample size 128; Symmlet 8 0.26 0.24 0.25 0.22

Maximum Deviation

Maximum Deviation

0.2 0.18 0.16 0.14 0.12 0.1

0.2

0.15

0.1

0.08 0.06 0.05

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

Figure 8.9: Performance of the estimators for the HeaviSine function over 100 simulations. The root signal-to-noise ratio is equal to 3 for sample sizes of 128 (left) and 512 (right) design points. The wavelet filter used is the Symmlet 8.

76

Doppler function; sample size 128; Symmlet 8

0.07

0.05

0.05 Root Mean Squared Bias


0.06

0.04

0.03

0.02

0

0.04

0.03

0.02

0.01

0.01

0

5

10

15

20

25

30

0

35

0

5


0.07

10

15

20

25

30

35

30

35


0.06

0.06

0.05




0.06

0.04

0.03

0.02

0.03

0.02

0.01

0.01

0

0.04

0

5

10

15

20

25

30

0

35

0


5

10

15

20

25

Doppler function; sample size 512; Symmlet 8 0.28 0.26

0.3

0.22

0.25

Maximum Deviation

Maximum Deviation

0.24

0.2

0.15

0.2 0.18 0.16 0.14 0.12

0.1

0.1 0.08

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

Figure 8.10: Performance of the estimators for the Doppler function over 100 simulations. The root signal-to-noise ratio is equal to 3 for sample sizes of 128 (left) and 512 (right) design points. The wavelet filter used is the Symmlet 8.

77

Angles function; sample size 128; Symmlet 8

0.045


0.035

0.04

0.03

0.03



0.035

0.025 0.02 0.015

0.025

0.02

0.015

0.01 0.01 0.005

0.005 0

0

5

10

15

20

25

30

0

35

15

20

25

30

35

30

35

0.04

0.04


0.035 0.03 0.025 0.02 0.015

0.03 0.025 0.02 0.015

0.01

0.01

0.005

0.005

0

5

10

15

20

25

30

0

35


0

5

10

15

20

25


0.25

0.25

0.2


Maximum Deviation

10


0.045

0.045

0

5


0.05


0

0.15

0.15

0.1

0.1

0.05

0.05 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

Figure 8.11: Performance of the estimators for the Angles function over 100 simulations. The root signal-to-noise ratio is equal to 3 for sample sizes of 128 (left) and 512 (right) design points. The wavelet filter used is the Symmlet 8.

78

Parabolas function; sample size 128; Symmlet 8

0.045


0.035

0.04

0.03

0.03



0.035

0.025 0.02 0.015

0.025

0.02

0.015

0.01 0.01 0.005

0.005 0

0

5

10

15

20

25

30

0

35

15

20

25

30

35

0.05



10


0.06

0.05

0.04

0.03

0.02

0.01

0.04

0.03

0.02

0.01

0

5

10

15

20

25

30

0

35

0


5

10

15

20

25

30

35

Parabolas function; sample size 512; Symmlet 8 0.35

0.35

0.3

0.3

0.25

0.25

Maximum Deviation

Maximum Deviation

5


0.06

0

0

0.2

0.15

0.2

0.15

0.1

0.1

0.05

0.05 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

Figure 8.12: Performance of the estimators for the Parabolas function over 100 simulations. The root signal-to-noise ratio is equal to 3 for sample sizes of 128 (left) and 512 (right) design points. The wavelet filter used is the Symmlet 8.

79

Time Shifted Sine function; sample size 128; Symmlet 8

0.05


0.035

0.045 0.03

0.035



0.04

0.03 0.025 0.02 0.015

0.025

0.02

0.015

0.01

0.01 0.005

0.005 0

0

5

10

15

20

25

30

0

35

0

5


0.06

10

15

20

25

30

35


0.045 0.04

0.05



0.035 0.04

0.03

0.02

0.03 0.025 0.02 0.015 0.01

0.01 0.005 0

0

5

10

15

20

25

30

0

35

0

5

10

15

20

25

30

35


Time Shifted Sine function; sample size 128; Symmlet 8 0.3 0.25

Maximum Deviation

Maximum Deviation

0.25

0.2

0.15

0.2

0.15

0.1 0.1 0.05 0.05 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

Figure 8.13: Performance of the estimators for the Time Shifted Sine function over 100 simulations. The root signal-to-noise ratio is equal to 3 for sample sizes of 128 (left) and 512 (right) design points. The wavelet filter used is the Symmlet 8.

80

Spikes function; sample size 128; Symmlet 8

0.06

0.05



0.05

0.04

0.03

0.02

0.04

0.03

0.02

0.01

0.01

0

0

5

10

15

20

25

30

0

35

5

10

15

20

25

30

35

30

35


0.06

0.05

0.05



0


0.06

0.04

0.03

0.02

0.01

0


0.06

0.04

0.03

0.02

0.01

0

5

10

15

20

25

30

0

35

0

5

10

15

20

25


Spikes function; sample size 128; Symmlet 8 0.35

0.35

0.3 0.3

Maximum Deviation

Maximum Deviation

0.25 0.25

0.2

0.15

0.2

0.15

0.1 0.1 0.05 0.05 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

Figure 8.14: Performance of the estimators for the Spikes function over 100 simulations. The root signal-to-noise ratio is equal to 3 for sample sizes of 128 (left) and 512 (right) design points. The wavelet filter used is the Symmlet 8.

81

Corner function; sample size 128; Symmlet 8

0.03


0.018 0.016

0.025



0.014 0.02

0.015

0.01

0.012 0.01 0.008 0.006 0.004

0.005 0.002 0

0

5

10

15

20

25

30

0

35

5


0.035

10

15

20

25

30

35

30

35


0.03

0.03

0.025



0

0.02

0.015

0.01

0.015

0.01

0.005

0.005

0

0.02

0

5

10

15

20

25

30

0

35

0


10

15

20

25


0.18

0.18

0.16

0.16

0.14


Maximum Deviation

5

0.12 0.1 0.08

0.12 0.1 0.08 0.06

0.06

0.04 0.04 0.02 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334 Column Number

Figure 8.15: Performance of the estimators for the Corner function over 100 simulations. The root signal-to-noise ratio is equal to 3 for sample sizes of 128 (left) and 512 (right) design points. The wavelet filter used is the Symmlet 8.

82

Method

Average CPU Time (n=128)

St. Dev.

Average CPU Time (n=512)

St. Dev.

VISU-H

0.0047

0.0052

0.0057

0.005

VISU-S

0.0044

0.0052

0.0075

0.0044

SURE

0.0131

0.0053

0.0219

0.0044

HYBSURE

0.0108

0.0044

0.018

0.004

TI-H

0.0948

0.008

0.3842

0.0067

TI-S

0.0951

0.0085

0.3842

0.0075

MINIMAX-H

0.0071

0.0048

0.0089

0.0031

MINIMAX-S

0.007

0.0048

0.0094

0.0024

CV-H

0.788

0.0423

2.8711

0.1194

CV-S

0.8123

0.0803

2.9715

0.1487

NEIGHBL

0.0258

0.0065

0.0621

0.005

BLOCKJS-A

0.0131

0.0056

0.0295

0.0026

BLOCKJS-T

0.0124

0.0051

0.0282

0.0039

0.11

0.0236

0.1602

0.0155

FDR-H

0.0314

0.0053

0.0345

0.0052

FDR-S

0.0311

0.0053

0.0341

0.0057

PENWAV

0.0088

0.0041

0.0108

0.0031

SCAD

0.0059

0.0049

0.009

0.003

MIXED

0.0535

0.0083

0.0643

0.0056

DECOMPSH

0.1146

0.0073

0.2216

0.0072

BLMED-A

4.9719

0.7042

7.5799

0.7529

BLMED-T

4.8445

1.0492

8.4865

0.8633

HYBMED-A

8.1662

18.5313

10.4675

24.1012

HYBMED-T

8.1802

18.4423

10.4619

24.2981

BLMEAN-A

4.8831

0.7118

7.3968

0.7442

BLMEAN-T

4.776

1.0416

8.314

0.8564

HYBMEAN-A

8.1064

18.4009

10.3045

24.2128

HYBMEAN-T

8.0896

18.3561

10.3007

24.2864

15.0696

36.0658

35.8483

60.1153

SINGLMED2

1.27

0.8122

2.0824

1.2284

SINGLMEAN

15.0723

36.1797

35.7626

60.0166

SINGLMEAN2

1.2227

0.8109

2.0086

1.2153

15.0615

36.1177

35.7836

60.0317

0.0876

0.0077

0.3524

0.0062

THRDA1

SINGLMED

SINGLHYP BAMS

Table 4: Average and its standard deviation (over 100 simulations) of the CPU time involved in computing the estimates for the Corner function by each method and the two sample sizes (n = 128, n = 512). Non-Bayesian methods uniformly outperform Bayesian methods in terms of CPU time in all examples.

83

Wavelet Estimators in Nonparametric Regression: A ...

Wavelet Estimators in Nonparametric Regression: A ...

Suggest Documents

Double Smoothing Robust Estimators in Nonparametric Regression

Ridge Estimators in Logistic Regression

Modified estimators in semiparametric regression

A comparative study of nonparametric derivative estimators

On convex regression estimators

Comparing cointegrating regression estimators

3. Nonparametric regression analysis

Nonparametric Regression Estimation for

CONSTRAINED NONPARAMETRIC KERNEL REGRESSION ...

Nonparametric Quantile Regression

CONSTRAINED NONPARAMETRIC KERNEL REGRESSION ...

Nonparametric Quantile Regression

Nonparametric Censored Regression - Core

Applied Nonparametric Regression

Wavelet-Smoothed Empirical Copula Estimators

a markovian local resampling scheme for nonparametric estimators in ...

Stochastic Restricted Biased Estimators in Misspecified Regression ...

A Chaining Algorithm for Online Nonparametric Regression

Regression-cum-Exponential Ratio Estimators in ...

Right-Censored Nonparametric Regression: A ... - TEM JOURNAL

A Spatially Adaptive Nonparametric Regression Image Deblurring

A Nonparametric Circular-Linear Multivariate Regression ... - CiteSeerX

A New Nonparametric Regression for Longitudinal Data

Asymmetric least squares regression estimation: a nonparametric ...