COMPARATIVE STUDY OF TWO KERNEL SMOOTHING ... - CiteSeerX

Folia Fac. Sci. Nat. Univ. Masarykianæ Brunensis, Mathematica 15 (2004): 419–436

419

COMPARATIVE STUDY OF TWO KERNEL SMOOTHING TECHNIQUES JIŘÍ ZELINKA, VÍTĚZSLAV VESELÝ AND IVANA HOROVÁ Abstract. The kernel functions (kernels) can be used in many types of nonparametric methods - estimation of the density function of a random variable, estimation of the hazard function or the regression function. These methods belong to the most efficient non-parametric methods. Another nonparametric method uses so-called frames - overcomplet systems of functions of some type. This paper compares the kernel smoothing and the frame smoothing with frames of a special kind - the kernel functions are used for their construction. Both the smoothing procedures are applied to simulated data. Obtained results will be presented graphically.

Keywords: functional approximations, overcomplete frame expansions, kernel smoothing, kernel operators. AMS classification: Primary: 47B34, 65D15, 65F20 Secondary: 41A45, 42C15, 65K05 1. Introduction In this paper a novel kernel smoothing technique developed by the 2-nd author is confronted with classical kernel smoothing handled by the other two co-authors. These techniques were tested on simulated data (20 and 101 samples) with normally i.i.d. distributed errors having variation equal to 0, 0.01, 0.04 and 0.25. Two types of regression functions were used; the first one with wave shape si given by m(x) =

sin(4πx) , (1 + cos(0.6πx))2

the second one with sharp peak is given by 2

m(x) = (x2 − x)(3x − 1) + (2e−50∗(3x−2) ) . 2. Standard kernel technique As the kernel smoothing technique is described in detail in many publications (e.g. [13], [3], [6], [7], [9], [8],) only the basic description will be given in this paper. Let’s have the data points (xi , Yi ), i = 1, . . . , n and let’s suppose the regression model Yi = m(xi ) + v 1/2 (xi )εi . i = 1, . . . , n (2.1) Research supported by the MSM:J07/98:1431 0001.

Jiří Zelinka, Vítězslav Veselý, Ivana Horová

420

for ordered non–random numbers x1 , . . . , xn and for independent random variables ε1 , . . . , εn , Eεi = 0,

var(εi ) = 1,

i = 1, . . . , n.

The function m is the unknown regression function, E(Yi ) = m(xi ) and v is called the variance function, V ar(Yi ) = v(xi ). Often it is assumed that v(xi ) = σ 2 for all i. Without loss of generality we can assume that 0 ≤ x1 < · · · < xn ≤ 1. The estimate of m will be denoted as m. ˆ The kernel functions (simply kernels) used for the kernel smoothing are defined as follows: Let K ∈ Lip[−1, 1], support(K) ⊆ [−1, 1]. Let ν, k be nonnegative number with the same parity, k ≥ ν + 2, and let the following moment conditions be satisfied  0, 0 ≤ j < k, j 6= ν   Z1  j (−1)ν ν!, j = ν (2.2) x K(x)dx =    −1 βk 6= 0, j = k. Then the function K is called a kernel of order ν, k and write K ∈ Mν,k .

The kernel of order ν, k is used for estimation of ν-th derivative of the regression function m. As only the regression function will be estimated in this paper we will use ν = 0. The Gasser–Müller estimator of the function m takes form Z si n 1X x−u m ˆ h (x) = K Wi (x, h)Yi , Wi (x, h) := du, h i=1 h si −1

K ∈ M0,k

(2.3) s0 = 0, si = (xi + xi+1 )/2, i = 1, . . . , n − 1, sn = 1. The parameter h is called bandwidth and it exhibits major influence on the quality of the estimate. The Mean Squared Error (MSE) gives statistical properties of the estimate 2

MSE (m ˆ h (x)) = E (m ˆ h (x) − m(x)) = = (E m ˆ h (x) − m(x))2 + var m ˆ h (x) . The leading term of MSE(m ˆ h (x)) can be expressed as MSE(m ˆ h (x)) =

2 v(x)V (K) + h2k Bk2 m(k) (x) nh

Z1

(−1)k βk , Bk = k!

where V (K) =

−1

2

K (x)dx,

βk =

Z1

−1

xk K(x)dx.

(2.4)

Two Kernel Smoothing techniques

421

To get the global criteria for quality of the estimate the average of the values MSE(m ˆ h (x)) from the points x1 , . . . , xn will be evaluated. n

AMSE(m ˆ h) =

=

1X MSE(m ˆ h (xi )) n i=1 n n 2 1 X (k) V (K) 1 X v(xi ) + h2k Bk2 m (xi ) . nh n i=1 n i=1 n

Denote v¯ =

1X v(xi ), n i=1

(m ¯ (k) )2 =

n 2 1 X (k) m (xi ) , then n i=1

V (K)¯ v + h2k Bk2 (m ¯ (k) )2 . (2.5) nh The bandwidth minimizing this expression is called optimal bandwidth and it is given by 1 2k+1 V (K)¯ v . (2.6) hopt = 2knBk2 (m ¯ (k) )2 The estimate of hopt was evaluated using generalized cross–validation function that is given by the formula 2 n 1 X Yi − m(x ˆ i) GCV (h) = , (2.7) n i=1 1 − tr S/n AMSE(m ˆ h) =

where S = [Wi (xj , h)] is the so-called smoothing matrix. Hence ˆ opt = arg min GCV (h) h h∈Jn

(2.8)

for suitable interval Jn . The optimal kernels were used for the estimation of regression function m (see [8]). 3. Novel kernel frame technique We apply the technique of Hilbert-space expansions in overcomplete systems (frames), so-called dictionaries. Details regarding the general theory can be found in [11] and [12]. See also [2] and the review paper [1] where a lot of numerical examples demonstrate how digital signals can be effectively represented in terms of overcomplete time-scale dictionaries. In the following subsections we briefly recapitulate the main issues. 3.1. Theoretical background. Notation . Z, C . . . sets of integers and complex numbers, respectively. J ⊆ Z, F ⊆ J . . . at most countable subscripting sets, F finite. Hi . . . symbolic notation for P a Hilbert space (H-space) over the scalar field C. H1 := ℓ2 (J) := {{ξj }j∈J | j∈J |ξj |2 < ∞, ξj ∈ C} . . . square-sumable sequences. E1 := {εj }j∈J , εj = {δjk }k∈J . . . natural ONB in H1 , card J = dim H1 .


422

H2 . . . a given H-space with inner-product h·, ·i and induced norm k·k. H = L(Φ) ⊆ H2 . . . a suitable separable H-subspace in H2 . Φ := {φj }j∈J . . . (over)complete system in H (dictionary), φj are called atoms. f ∈ H2 . . . unknown object (function) to be approximated (represented) in H. f . . . N samples of f on a finite mesh. y = f + e . . . f possibly corrupted by a zero-mean white noise e with variance σ 2 . PH : H2 → H . . . orthogonal projection operator of H2 onto H. fb := PH f . . . best LSQ approximation of f in H: kf − fbk = argming∈H kf − gk.

Definition 3.1. A dictionary Φ is called a frame in H if φj = T εj for some bounded linear operator T : H1 → H2 having closed range space R(T ) = H. A frame which is not minimal (exact) is called overcomplete. A frame is called Riesz basis (RB) if T is a topological linear isomorphism, special case of it being the orthonormal basis (ONB) where T is a unitary isomorphism. We assume Φ to be a frame in what follows. Given f , our goal is to find an H(f ) spanned by a small-sized finite Riesz basis Φ∗ = Φ∗ (f ) picked out from an a priori given oversized frame Φ but still yielding a satisfactory approximation fb∗ of f . Observe that every (even oversized) finite dictionary constitutes always a frame because R(T ) has a possibly very high but still finite dimension, and is therefore closed. For convenience and with regard to numerical feasibility of our procedures we shall not loose much of the precision when starting from a finite and sufficiently large frame Φ instead of from a countable one admissible by the general theory. Reconstruction of fb via T : fb ∈ H = R(T ) ⇒ ∃ ξ ∈ H1 such that X X X fb = T ξ = T ξj εj = ξj T εj = ξj φj . j∈J

j∈J

(3.1)

j∈J

Weighted discretization of f or fb via the adjoint T ∗ : X X X hT ξ, f i = h ξj φj , f i = ξj hφj , f i = ξj hf, φj i = hξ, T ∗ f i, (3.2a) j∈J

j∈J

j∈J

∗

where T stands for the adjoint operator of T :

due to

T ∗ f = {hf, φj i}j∈J = {hfb, φj i}j∈J = T ∗ fb

(3.2b)

hf, φj i = hfb + f ⊥ , φj i = hfb, φj i + hf ⊥ , φj i = hfb, φj i. | {z } 0

We denote by L := T ∗ the discretization (Bessel) operator, and by R := T ∗ T = LT the correlation operator: Rεk = LT εk = Lφk = {hφk , φj i}j∈J . Theorem 3.2 (see also [12, Theorem 5.2]). ξ is a solution of the original least-squares (LSQ1) problem T ξ = fb iff ξ is an (exact) solution of the equivalent linear (L2) problem Rξ = Lf .


423

Proof. (3.2b)

⇒: T ξ = fb ⇒ Rξ = LT ξ = Lfb = Lf . ⇐: L is injective on H in view of [12, Corollary 2.6(4)]. Hence (3.2b) Rξ = Lf = Lfb ⇒ LT ξ = Lfb ⇒ T ξ = fb because both T ξ and fb belong to H. Corollary 3.3. P Solution sets of the original LSQ1 problem kf − T ξk = kf − j∈J ξj φj k → min and of the equivalent L2 problem Rξ = Lf are the same. Remark 3.4. As y are imprecise samples of f , we have to replace the exact discretization operator L by a suitable matrix estimator1 L such that Ly is a good approximation of Lf . Typically in the case of a functional space H the j-th line of L = [ljn ], j ∈ J, n = 1, . . . , N may originate from a suitable linear quadrature R PN formula approximately evaluating hf, φj i = I f (t)φj (t) dt ≈ n=1 ljn yn where ljn = wn φj (n∆t) with quadrature weights wn . Then we solve least-squares problem kLy − Rξk → min (LSQ2 problem) instead of the linear problem L2 because R is usually badly conditioned and consequently L2 would produce numerically instable solution. Anyway we usually prefer LSQ2 where R is known exactly while T (or φj ) would have to be discretized1 to a matrix T when solving the approximate LSQ1 problem ky − T ξk → min. Definition 3.5 (Finite ε-suboptimal frame for f ). Let εP> 0 be a given precision tolerance and let F ⊆ J be finite such that kf − j∈F ξj φj k < ε holds for suitable ξε = {ξj }j∈F . We can assume without the loss of generality that ξε was chosen to minimize the residual error: P ξj φj = PH (f ) f =: fbε , where Hε (f ) = L(Φε (f )), Φε (f ) := {φj }j∈F ⊂ Φ. j∈F

ε

Then Φε (f ) will be called finite ε-suboptimal frame for f and ξε (f ) := ξε will be called ε-suboptimal spectral representation of f in terms of Φ. Clearly the choice of such Φε (f ) and ξε (f ) is not unique.

OUR MAIN GOAL We are looking for a sparse ε-suboptimal spectral representation ξε∗ (f ) := ξε∗ = {ξj∗ }j∈F ∗ where card F ∗ is as small as possible: Φ∗ε (f ) := {φj }j∈F ∗ , Hε∗ (f ) := L(Φ∗ε (f )). We can assume without the loss of generality that φj , j ∈ F ∗ are linearly independent and thus Φ∗ε (f ) constitutes a (Riesz) basis and dim Hε∗ (f ) = card F ∗ .

3.2. Solution procedure. 1Hereafter estimators are set boldfaced to distinguish them from the exact operators.


424

3.2.1. Optional step: Reducing big frame Φ to a medium-sized frame Φε (f ). If the frame Φ is extremely big or even infinite, we can try to locate finitely many atoms most significant for our data. The algorithm A2 described in [12] starts recurrent construction of Φε (f ) putting there an atom exhibiting largest correlation with f , in the next step we add another one most correlated with the residual term f − fbε , etc. The procedure is stopped as soon as precision tolerance kf − fbε k < ε is achieved, provided that this is possible. But one should keep in mind that this algorithm is not able to resolve atoms strongly correlated to each other and with high probability will not insert all such atoms to Φε (f ) regardless of their correlation with f . We can thus miss significant atoms in cases when resolution feature is important for our model. 3.2.2. Looking for some ξε being ε-suboptimal for f , or estimates thereof. (1) Direct method 3.4 fbε = T ξε ⇒ ξε ≈ argminξ∈ℓ2 (F ) ky − T ξk2 , or alternatively in view of 3.3: 3.4

ξε ≈ argminξ∈ℓ2 (F ) kLy − Rξk2 . Direct solution is obtained: a) by a g-inverse: ξε ≈ ξ − = T − y, or ξε ≈ ξ − = R− Ly. For example using MATLAB we get: ξ − = T \y, or ξ − = R\Ly. b) by a Moore-Penrose pseudoinverse: ξε ≈ ξ + = T + y, or ξε ≈ R+ Ly. For example using MATLAB we get: ξ + = pinv(T ) ∗ y, or ξ + = pinv(R) ∗ L ∗ y. Except small-sized and well-conditioned problems, the direct method yields solutions subject to defects due to numerical instability and round-off error propagation. That is why we typically do not keep them as a final solution but use them rather as a starting point for special sophisticated iterative procedures providing more reliable results (see a detailed discussion of algorithm A1 in [12]). Below we prefer an approach based on [1] treating noisy data very well. (2) Iterative method Using a suitable algorithm (see 3.3 below) we solve iteratively the following minimization problem starting with appropriate initial estimates ξ 0 . For example we can start with the raw estimates ξ 0 = ξ − or ξ 0 = ξ + obtained in (1). 1 ξε = argminξ∈ℓ2 (F ) kf − T ξk2 + λkξk1 2 1 ≈ argminξ∈ℓ2 (F ) ky − T ξk2 + λkξk1 , 2

(3.3a)

or alternatively 1 ξε = argminξ∈ℓ2 (F ) kLf − Rξk2 + λkξk1 2 1 ≈ argminξ∈ℓ2 (F ) kLy − Rξk2 + λkξk1 . 2

(3.3b)


425

Above we have assumed a normalized frame kφj k = 1 for all j ∈ F . If P P φ this is not the case then in view of T ξ = j∈F ξj φj = j∈F kφj kξj kφjj k P P we have to replace kξk1 = j∈F |ξj | by j∈F kφj k|ξj |. The parameter λ > 0 is the so-called smoothing parameter controlling the level of smoothing which increases with growing values of λ: ξε → 0 with λ → ∞ (maximal smoothing: T ξ = 0 = zero constant level), on the other hand with λ = 0 we get the pusual least-squares result (minimal smoothing). Optimal choice is λ = σ 2 ln(M ) where M := card F stands for the cardinality of the dictionary and σ 2 is the variance of the additive white noise e. This can be motivated as follows. In the case of a dictionary that is an orthonormal basis, a number of papers [4, 5] have carefully studied an approach to denoising by the so-called soft thresholding in an orthonormal basis. The next theorem 3.6 justifies this choice also for the overcomplete and nonorthogonal dictionary which includes soft thresholding as a special case of either of equations (3.3a) and (3.3b). Theorem 3.6 (see [1, sec. 5.2]). If Φ is an orthonormal basis in H over R then equations (3.3a) and (3.3b) have the same unique solution ξε = {ξj }j∈F , ξj = sign(ηj )(|ηj | − λ)+ , j ∈ F where ηj = hf, φj i, i.e. η = {ηj }j∈F = Lf . 3.2.3. Computing ξε∗ having minimal ℓ1 -norm among all ξε ε-suboptimal for fbε . Having estimates ξε satisfying eqs. (3.3a) or (3.3b), we search ξε∗ = argminT ξ=T ξε kξk1 = argminRξ=Rξε kξk1 .

(3.4a)

When assuming simplified subscripting by F := {1, 2, . . . , M } then this is equivalent with the solution of the linear programing (LP) problem of minimizing

2M X

xj subject to Ax = b, xj ≥ 0,

(3.4b)

j=1

where ξj = xj − xj+M and A := [T , −T ] with b = T ξε , or alternatively A := [R, −R] with b = Rξε . See algorithm A3 in [12] for a more detailed discussion. If x∗ is the solution of (3.4b) we put ξε∗ = {x∗j − x∗j+M }j∈F . 3.2.4. Removing negligible terms from ξε∗ to obtain Φ∗ε (f ) and respective fε∗ . We choose suitable near-zero tolerancePδ > 0 as large as possible and such that kf − fε∗ k ≈ ky − T ξε∗ k < ε with fε∗ = j∈F ∗ ξj∗ φj and Φ∗ε (f ) := {φj }j∈F ∗ where F ∗ = {j | |ξj∗ | > δ}. We do not distinguish between ξε∗ = {ξj∗ }j∈F ∗ and {ξj∗ }j∈F where ξj∗ = 0 for j ∈ F − F ∗ . 3.2.5. Optional step: Recomputing ξε∗ with Φ∗ε to obtain possibly improved fbε∗ . We use 3.2.2(2) with F replaced by significantly reduced F ∗ and initial estimate ξ 0 = ξε∗ from 3.2.4. We expect to obtain a possibly improved sparse representation ξε∗ of fbε∗ = PHε∗ f where Hε∗ := Hε∗ (f ) = L(Φ∗ε (f )).


426

3.3. Algorithmic implementation of steps 3.2.2(2), 3.2.3 and 3.2.5. We use Primal-Dual-Log-Barrier LP Algorithm by M. Saunders [10] for solving the following perturbed linear program: 1 1 minimize cT x + kγxk2 + kpk2 subject to Ax + δp = b, x ≥ 0, 2 2

(3.5)

where γ a δ are normally small (≈ 10−3 ) regularization parameters and P2M PM cT x = λ j=1 xj = λ j=1 kφj k|ξj | — see also 3.2.2(2) and 3.2.3. Setting specific values to the control parameters allows us to accomplish the desired task: • λ ≥ 0, δ = 1, γ ≈ 10−3 solves the problems (3.3a) and (3.3b) from 3.2.2(2) or 3.2.5. • λ = 1, δ ≈ 10−3 , γ ≈ 10−3 solves the problem 3.2.3. 3.4. Kernel frames. We put H2 = L2 (R) and choose a fixed ‘mother kernel function’ K(t) ∈ L2 (R), typically a bell-shaped one like a gaussian (see section 4.1 as well). Let 1I stand for the characteristic function of a bounded interval I ⊂ R where unknown function f ∈ L2 (R) was sampled to y. We construct an oversized shift-scale dictionary Φ = {φa,b }a∈A derived from K(t) by varying the shift parameter a and scale b∈B

(width) parameter b between values from big finite (or even countable) sets A ⊂ R and B ⊂ R+ , respectively: B 1 t − a where K ± = . H = L({φa,b }a∈A ), φa,b (t) = 1I K b/B 2 2 b∈B The value of B, the so-called full width at half-maximum (FWHM) of the mother kernel K(t), accomplishes scale-normalizing allowing us to interprete b as FWHM of the atom φa,b . If necessary Φ can be reduced to a medium-sized frame Φε (f ) according to 3.2.1. Then Hε (f ) = L({ai , bj }i=1,...,α ). j=1,...,β

In our simulations from section 4 we do not play around with ε and get along with Φ = Φε where α = β = 20 yields a frame with 400 atoms which is computationally feasible on a standard PC machine running at least at 350-500 MHz. 4. Comparative study 4.1. Design of the numerical experiment. We use uniform mesh ti = (i − 1)∆t, i = 1, . . . , n, on the interval I = [0, 1] 1 and the number of sample points n = 20 or n = 101. According with ∆t = n−1 to section 3.4 a 20-point uniform grid was used both for shifts ai and scales bj as 1 19 follows: A = {i ∆a}19 i=0 and B = {b0 + j ∆b}j=0 where ∆a = ∆b = 19 along with the minimal scale b0 = 5∆t for σ ≥ 0.2 and b0 = 3∆t otherwise. Observe that the smoothing level is also influenced by b0 , the higher values of which eliminate narrow lines from the model and thus increase the smoothing effect in addition to


427

the choice of λ as well. On the other hand, with noisy data the values of b0 close to ∆t cause the model to catch the noise destroying the quality of the approximation. In all steps of the frame-based kernel smoothing from section 3 the original LSQ1 problem was solved for noisy data (σ > 0), while the equivalent LSQ2 problem (see 3.4) was used for the approximation of exact data (σ = 0). In all cases we preferred the norm kΦk∞ for the atom normalization which showed to be more appropriate for bell-shaped kernels rather than the L2 -normalization commonly used with oscillatory wavelet atoms. Unfortunately the optimal value of λ designed by Donoho and Johnstone [4] showed not to be appropriate any more and led to oversmoothing with growing σ. By way of trial we have corrected its value by a multiplicative factor k1 related to the respective standard deviation σk of the corrupting white noise (σ1 := 0, σ2 := 0.1, σ3 := 0.2 and σ3 := 0.5). 4.2. Inspection of results. The results of the numerical experiment are shown in Figures 1–6. Let us explain in more detail the meaning of the curves distinguished by different colors and briefly glossed by the attached legend: Exact data (green): the exact function and (noisy) samples thereof by circles; Kernel smoothing (blue): data smoothed via standard kernel technique form section 2; the remaining curves show results of the novel kernel smoothing procedure described in section 3 at its intermediate steps: All frame atoms (magenta): LSQ solution fbε from the step 3.2.2(2) Significant frame atoms (red): sparse solution by ℓ1 -optimization after removing the negligible terms in step 3.2.4; Reduced frame (black): final sparse solution fbε∗ with reduced frame from the step 3.2.5. So called optimal kernels were used for the experiment. They were of order 0,2; 0,4 and 0,12 and the exact formulas are given by: 3 (K02) K0,2 (x) = − (x2 − 1) · I[−1,1] 4 15 2 K0,4 (x) = (x − 1)(7x2 − 3) · I[−1,1] (K04) 32 9009 (x2 − 1)(52003x10 − 124355x8+ K0,12 (x) = 524288 + 106590x6 − 39270x4 + 5775x2 − 231) · I[−1,1] (K12) The actual value of the smoothing parameter λ is printed in the title of each plot. The label below the plot indicates the number ns of q significant atoms of the final Pn sparse solution and RMSE the root mean square error (yi − fb∗ (ti ))2 /(n − ns ) i=1

where n = 20 or n = 101 is the number of sample points.

ε


428

data1−20+WN(0,0.0), Kernel: kern2opt (lambda=0.00)


1

1

0.5

0.5

0

0

−0.5

−1

−1.5 0

−0.5

Exact Data Kernel Smoothing All Frame Atoms Signif. Frame Atoms Reduced Frame 0.2

0.4 0.6 0.8 15 atoms out of 400, RMSE=0.221

−1

1

−1.5 0


Exact Data Kernel Smoothing All Frame Atoms Signif. Frame Atoms Reduced Frame 0.2 0.4 0.6 0.8 18 atoms out of 400, RMSE=0.0338

1


0.6

1

0.4 0.5

0.2 0

0

−0.2 −0.4 −0.6 −0.8 −1 −1.2 0

−0.5


0.4 0.6 0.8 18 atoms out of 400, RMSE=0.225

−1

1

−1.5 0



1


1

1 0.5

0.5

0 0 −0.5 −0.5

−1

−1.5 0


0.4 0.6 0.8 17 atoms out of 400, RMSE=0.396

−1 −1.5

1

−2 0



1


1

2 1.5

0.5

1 0

0.5

−0.5 −1 −1.5 −2 0

0 Exact Data Kernel Smoothing All Frame Atoms Signif. Frame Atoms Reduced Frame 0.2

0.4 0.6 0.8 16 atoms out of 400, RMSE=0.636

a) 20 samples

−0.5 −1 −1.5 1

−2 0


0.4 0.6 0.8 14 atoms out of 400, RMSE=0.159

b) 101 samples

Figure 1: Reconstruction from data 1 corrupted with additive gaussian white noise W N (0, σ 2 ), σ = 0, 0.1, 0.2 and 0.5 (top-down) using 400 atoms K02.

1

Two Kernel Smoothing techniques data1−20+WN(0,0.0), Kernel: kern4opt (lambda=0.00)

429


1 0.4 0.5

0.2

0

−0.2

0

−0.4 −0.5

−1

−1.5 0

−0.6


0.4 0.6 0.8 17 atoms out of 400, RMSE= 0.24

−0.8 −1 −1.2 1

−1.4 0



1


0.6

1

0.4 0.5

0.2 0

0

−0.2 −0.4 −0.6 −0.8 −1 −1.2 0

−0.5


0.4 0.6 0.8 14 atoms out of 400, RMSE=0.116

−1

1

−1.5 0



1


1

1 0.5

0.5

0 0 −0.5 −0.5

−1

−1.5 0


0.4 0.6 0.8 9 atoms out of 400, RMSE=0.207

−1 −1.5

1

−2 0



1


1

2 1.5

0.5

1 0

0.5

−0.5 −1 −1.5 −2 0


0.4 0.6 0.8 6 atoms out of 400, RMSE=0.325

a) 20 samples

−0.5 −1 −1.5 1

−2 0


0.4 0.6 0.8 11 atoms out of 400, RMSE=0.142

b) 101 samples


1


430



1

1

0.5

0.5

0

0

−0.5

−1

−1.5 0

−0.5


−1

1

−1.5 0



1


0.6

1

0.4 0.5

0.2 0

0

−0.2 −0.4 −0.6 −0.8 −1 −1.2 0

−0.5


0.4 0.6 0.8 10 atoms out of 400, RMSE=0.107

−1

1

−1.5 0



1


1

1 0.5

0.5

0 0 −0.5 −0.5

−1

−1.5 0


0.4 0.6 0.8 8 atoms out of 400, RMSE= 0.21

−1 −1.5

1

−2 0



1


1

2 1.5

0.5

1 0

0.5

−0.5 −1 −1.5 −2 0


0.4 0.6 0.8 11 atoms out of 400, RMSE=0.401

a) 20 samples

−0.5 −1 −1.5 1

−2 0


0.4 0.6 0.8 9 atoms out of 400, RMSE=0.151

b) 101 samples


1



2 1.5 1

431

2 Exact Data Kernel Smoothing All Frame Atoms Signif. Frame Atoms Reduced Frame

1.5

1

Exact Data Kernel Smoothing All Frame Atoms Signif. Frame Atoms Reduced Frame

0.5 0.5 0 0

−0.5 −1 0

0.2

0.4 0.6 0.8 10 atoms out of 400, RMSE=0.226

1

−0.5 0


1


1.5

1

0.5

0.5

0

0

−0.5 0

0.2

0.4 0.6 0.8 10 atoms out of 400, RMSE=0.267

1

−0.5 0


1


0.2

0.4 0.6 0.8 8 atoms out of 400, RMSE=0.0834

1


2 1.5

1


2

1.5

0.2 0.4 0.6 0.8 11 atoms out of 400, RMSE=0.0641

2.5 Exact Data Kernel Smoothing All Frame Atoms Signif. Frame Atoms Reduced Frame

2 1.5 1


0.5 0.5 0 0 −0.5 −1 0

−0.5 0.2

0.4 0.6 0.8 4 atoms out of 400, RMSE=0.368

1

−1 0

data2−20+WN(0,0.5), Kernel: kern2opt (lambda=0.25) 2

1

1.5

0.5

1

0

0.5

−1.5 −2 0

1

2.5

1.5

−1

0.4 0.6 0.8 9 atoms out of 400, RMSE=0.0959


2

−0.5

0.2


0.4 0.6 0.8 7 atoms out of 400, RMSE=0.442

a) 20 samples


0 −0.5 −1 1

−1.5 0

0.2

0.4 0.6 0.8 7 atoms out of 400, RMSE=0.208

b) 101 samples


1


432



2.5 2 1.5 1


1.5

1

0.5


0.5

0 0 −0.5 −1 0

0.2

0.4 0.6 0.8 13 atoms out of 400, RMSE=0.293

1

−0.5 0


1


1.5

1

0.5

0.5

0

0

−0.5 0

0.2

0.4 0.6 0.8 10 atoms out of 400, RMSE=0.295

1

−0.5 0


1


0.2

0.4 0.6 0.8 11 atoms out of 400, RMSE=0.074

1


2 1.5

1


2

1.5

0.2 0.4 0.6 0.8 17 atoms out of 400, RMSE=0.0499


2 1.5 1


0.5 0.5 0 0 −0.5 −1 0

−0.5 0.2

0.4 0.6 0.8 6 atoms out of 400, RMSE=0.394

1

−1 0


1

1.5

0.5

1

0

0.5

−1.5 −2 0

1

2.5

1.5

−1

0.4 0.6 0.8 8 atoms out of 400, RMSE=0.0994


2

−0.5

0.2


0.4 0.6 0.8 8 atoms out of 400, RMSE=0.496

a) 20 samples


0 −0.5 −1 1

−1.5 0

0.2

0.4 0.6 0.8 9 atoms out of 400, RMSE= 0.2

b) 101 samples


1



2

1.5

1


1.5

1

0.5

0.5

0

0

−0.5 0

0.2

0.4 0.6 0.8 15 atoms out of 400, RMSE=0.232

1

−0.5 0


1


1.5

1

0.5

0

0

0.2

0.4 0.6 0.8 8 atoms out of 400, RMSE=0.298

1

−0.5 0


1

1


0.2 0.4 0.6 0.8 13 atoms out of 400, RMSE=0.0834

1


2 1.5

0.2 0.4 0.6 0.8 25 atoms out of 400, RMSE=0.0439

2

0.5

−0.5 0



2

1.5

433


2 1.5 1


0.5 0.5 0 0 −0.5 −1 0

−0.5 0.2

0.4 0.6 0.8 9 atoms out of 400, RMSE=0.473

1

−1 0


1

1.5

0.5

1

0

0.5

−1.5 −2 0

1

2.5

1.5

−1

0.4 0.6 0.8 13 atoms out of 400, RMSE=0.108


2

−0.5

0.2


0.4 0.6 0.8 7 atoms out of 400, RMSE=0.496

a) 20 samples


0 −0.5 −1 1

−1.5 0

0.2

0.4 0.6 0.8 5 atoms out of 400, RMSE=0.188

b) 101 samples


1


434

Significance coefficient plot 6 5 4 3 2 1 0 0

50 100 150 200 250 300 350 400 data1−20+WN(0,0.0), Kernel: kern12opt (lambda=0.00)

Figure 7: Least-squares coefficient estimates sorted in decreasing order of their absolute values |ξj | obtained in step 3.2.2(2).

Significance coefficient plot 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0

50 100 150 200 250 300 350 Maximal deviation due to ZeroTol: 0.0052983

400

Figure 8: ℓ1 -optimized coefficient estimates sorted in decreasing order of their absolute values |ξj | obtained in step 3.2.3.


435

4.3. Discussion of results. The crucial role of ℓ1 -optimizing step 3.2.3 is demonstrated by Figures 7 and 8 which are related to the approximation of 20-point exact data (σ = 0) using 400 atoms and kernel shape K12. The coefficients by the standard least-squares procedure do not yield a sparse solution (Figure 7) because of the slow decay of their magnitudes towards zero. Thus the information about the data is spread around almost all of them (400 values), giving us no idea where to cut them off. This contrasts with the ℓ1 -optimized representation of (nearly) the same least squares solution (Figure 8) where just a few coefficients carry useful information. The cut-off point is easily located by the abrupt change in the slope of their decay. See also the red line on top plot of Figure 3a) produced only by 11 significant coefficients (cut-off point) out of 400 which is even more precise than solution (magenta line) related to Figure 7. 5. Conclusion We can see that both techniques give similar results for the first regression function and for all used orders of the kernel (Figures 1–3). In Figures 4–6 one can see the typical shape of kernel estimate for regression function with sharp peak: kernels of low order give worse estimate of the peak (Figures 4 and 5) and the estimate for kernel of high order (Figure 6) is wavy at the part of the regression function without peak. The novel kernel smoothing technique produces better estimate for the second regression function. So we can say that this technique seems to be more suitable for this type of regression function. For complete comparison of both techniques much more numerical experiments ought to be accomplished. One should also take into account that we have used kernel shapes designed to be asymptotically optimal in some sense for the standard kernel smoothing, but not for the novel technique which does not rely on asymptotics. In particular the loworder kernels K02 and K04 are not smooth enough to linearly approximate roughly sampled ideal smooth shapes well (see section 1) — see bumps in subplots left on the top in Figures 1–3 which damp down with the increasing order (smoothness) of the kernel atoms. Gaussian-like kernel atoms would not produce such defects. References [1] S. S. Chen, D. L. Donoho, and M. A. Saunders, Atomic decomposition by basis pursuit, SIAM J. Sci. Comput. 20 (1998), no. 1, 33–61, reprinted in SIAM Review, 43 (2001), no. 1, pp. 129–159. [2] O. Christensen, An introduction to frames nad Riesz bases, Applied and Numerical Harmonic Analysis, Birkhäuser, Boston-Basel-Berlin, (2003). [3] L. Devroye and L. Györfi, Nonparametric density estimation: the L1 view. Wiley, New-York. (1985). [4] D. L. Donoho, De-noising by soft-thresholding, IEEE Trans. Inform. Theory 41 (1995), no. 3, 613–627. [5] D. L. Donoho, I. M. Johnstone, G. Kerkyacharian and D. Picard, Wavelet shrinkage: Asymptopia ?, J. Royal Statist. Soc., B 57 (1995), no. 2, 301–337. [6] T. Gasser and H. G. Müller, Kernel estimation of regression function. Smoothing techniques for curve estimation (eds. Gasser & Rosenblatt), Springer, Heidelberg, (1979), 23–68.

436


[7] T. Gasser and H. G. Müller, Non-parametric estimation of regression functions and their derivatives by the kernel method. Scand. J. Statist., (1984), 11, 171–185 [8] I. Horová, P. Vieu and J. Zelinka, Optimal choice of nonparametric estimates of a density and its derivatives, Statistics & Decisions, 20, (2002), 355–378 [9] H. G. Müller, Nonparametric regression analysis of longitudinal data. Lecture Notes in Statistics 46, Springer – Verlag Berlin, Heidelberg, (1988). [10] M. A. Saunders, MATLAB code for minimizing convex separable objective functions subject to Ax = b, x ≥ 0, http://www-stat.stanford.edu/~atomizer. [11] V. Veselý, Kernel frame smoothing operators, Proceedings of the summer school ROBUST’2000, Nečtiny near Plzeň, September 2000 (J. Antoch and G. Dohnal, eds.), JČMF (Society of Czech Mathematicians and Physicists), (2001), 308–323. [12] V. Veselý„ Hilbert-space techniques for spectral representation in terms of overcomplete bases, Proceedings of the summer school DATASTAT’2001, Čihák near Žamberk (I. Horová, ed.), Folia Fac. Sci. Nat. Univ. Masaryk. Brunensis, Mathematica, vol. 11, Dept. of Appl. Math., Masaryk University of Brno, Czech Rep., (2002), 259–273. [13] I. P. Wand and I. C. Jones, Kernel smoothing, Chapman & Hall, (1995).

Jiří Zelinka Department of Applied Mathematics, Faculty of Science, Masaryk University in Brno, Janáčkovo nám. 2a, 602 00 Brno, Czech Republic E-mail: [email protected] Vítězslav Veselý Department of Applied Mathematics and Computer Science, Faculty of Economics and Administration, Masaryk University in Brno, Lipová 41a, 602 00 Brno, Czech Republic. [email protected] Ivana Horova Department of Applied Mathematics, Faculty of Science, Masaryk University in Brno, Janáčkovo nám. 2a, 602 00 Brno, Czech Republic E-mail: [email protected]

COMPARATIVE STUDY OF TWO KERNEL SMOOTHING ... - CiteSeerX

COMPARATIVE STUDY OF TWO KERNEL SMOOTHING ... - CiteSeerX

Suggest Documents

Heat Kernel Smoothing on Unit Sphere - CiteSeerX

spline-backfitted kernel smoothing of additive

A kernel density smoothing method for determining an ... - CiteSeerX

Yield Curve Estimation by Kernel Smoothing Methods - CiteSeerX

Introducing Heat and Geodesic Kernel Smoothing on ... - CiteSeerX

Preformulation Comparative Study between Two

Kernel Smoothing Toolbox for MATLAB - Math MUNI

A Comparative Study of Two Short Text Semantic ... - CiteSeerX

Comparative Study of the Pollen Protein Contents in Two ... - CiteSeerX

A Comparative Study of Two Short Text Semantic ... - CiteSeerX

A Comparative Study of Two Different Approaches in ... - CiteSeerX

A comparative study of two automatic document ... - CiteSeerX

A comparative study of two different approaches for the ... - CiteSeerX

XCS and GALE: a Comparative Study of Two Learning ... - CiteSeerX

A comparative study of two modeling approaches in ... - CiteSeerX

Comparative Usability Study of Two Space Logistics ... - CiteSeerX

Gauss-Weierstrass Kernel Smoothing on Unit Sphere

ESTIMATING YIELD CURVES BY KERNEL SMOOTHING METHODS

Adaptive Kernel Smoothing Regression for Spatio ... - Research

Comparative study of the two-period of

A Comparative Prospective Study of Two Different

An electrical comparative study of two atmospheric

Comparative Study of Modulation Techniques for Two

Comparative study of aromatic compounds in two