Lévy Adaptive Regression Kernels - CiteSeerX

Lévy Adaptive Regression Kernels Chong Tu

Merlise Clyde

Robert L. Wolpert∗

Duke University Department of Statistical Science Revised: August 3, 2007

Summary This paper describes a new class of prior distributions for nonparametric function estimation. The unknown function is approximated as a weighted sum of kernel or generator functions with arbitrary location parameters. Scaling (and other) parameters for the generating functions are also modeled as location specific and thus are adaptive, as with wavelet bases and overcomplete dictionaries. Lévy random fields are introduced to construct prior distributions on the unknown functions, which lead to the specification of a joint prior distribution for the number of kernels, kernel regression coefficients and kernel associated parameters. Under Gaussian errors, the problem may be formulated as a sparse regression problem, with regularization induced via the Lévy random field prior. Posterior inference on the unknown functions is based on a reversible jump Markov Chain Monte Carlo algorithm. We compare the Lévy Adaptive Regression Kernel (LARK) method to wavelet-based methods using some of the standard test functions. In all cases the LARK method leads to sparser solutions and improvements in mean squared error.

Key words: Bayes; Lévy random field; nonparametric regression; relevance vector machine; kernel regression, reversible jump Markov chain Monte Carlo; splines; support vector machine; wavelets. ∗ Address for correspondence: Professor Robert L. Wolpert ([email protected]), Department of Statistical Science, Duke University, Durham, NC 27708-0251, USA

1

1

Introduction

Suppose we have n noisy measurements Y1 , . . . , Yn of an unknown real-valued function f : X → R on some complete separable metric space X, iid

ei ∼ No(0, σ 2 )

Yi = f (xi ) + ei

(1)

observed at points {xi }i∈I ⊂ X. In nonparametric regression models, the mean function f (·) is often regarded as an element of some Hilbert space H of real-valued functions on X, and is expressed as a linear combination of basis functions {gj } ⊂ H: f (xi ) =

X

gj (xi )βj

(2)

0≤j 0, and γ > 0, with ω-dependence of simple form γ(dω) = γ dχ πλ (dλ) for some γ > 0, proportional to a uniform measure in location χ and a specified probability measure πλ (dλ) on R+ governing scale. Truncating at |β| ≥ ǫγ 1/α again disentangles γ’s roles of determining jump sizes and magnitudes, leading to ǫ ν+ ≡ ν ǫ (R × Ω) = ǫ = in general, or ν+

2 πǫ

2 Γ(α) sin(πα/2)ǫ−α π

for the Cauchy case α = 1. Again (32f) specifies that, given J and γ,

{βj , χj , λj } are independent and identically distributed, with the same distributions as before for χj and λj , but now the jumps have symmetric Pareto distributions, α|βj |−α−1 , 2γ ǫα

πβ (βj ) = in general, or πβ (βj ) = 2γǫ β 2

−1

c βj ∈ − ǫγ 1/α , ǫγ 1/α

1{|β|>γǫ} for the Cauchy case. Zolotarev’s (M) parametrization

(1986) employs the compensator h(u) = sin(u), so in principle (32g) must be changed slightly to

f (xi ) ≡ β0 + δǫ (xi , γ) + where

δǫ (xi , γ) ≡

ZZ

R×Ω

J X

g(xi , ωj )βj

(34a)

j=1

g(xi , ω) β1{|β|α ≤γǫα } (β) − sin(β) ν(dβ, dω),

(34b)

but because ν(β, ω) of (33) is an even function of β while the bracketed term in (34b) is odd, δǫ (xi , γ) ≡ 0 for the SαS LARK model. For asymmetric α-stable models (for example, fully-skewed ones), the compensation adjustment of (34) is essential for convergence as ǫ → 0. 17

4.5

Posterior Inference

The joint posterior density of all parameters under the symmetric Gamma LARK model of (32), given observations Y = {Yi }, is    J  Y exp − νǫ (R × Ω) p(α, τ, J, σ 2 , β0 , β, ω | Y) ∝ πα (α)πτ (τ ) νǫ (βj , ωj )   J! j=1   n X X 2 1 × σ −n−2 exp − 2 Yi − β0 − g(xi , ωj )βj  . (35) 2σ

i=1

j

Only β0 and σ 2 may be integrated out analytically; the posterior (and full conditional) distributions of the other parameters are not available in closed form. Since some of our parameters (β and ω) have varying dimension, some form of trans-dimensional Markov chain Monte Carlo, such as a reversible jump (RJ-MCMC) algorithm (Green, 1995; Wolpert, Ickstadt and Hansen, 2003; DiMatteo et al., 2001) must be used to provide samples from (35) for posterior inference. A typical RJ-MCMC procedure for sampling varying dimensional parameters involves (at least) three types of moves: Birth, Death and Update. A Birth step entails generating a new point (β ∗ , ω ∗ ) for {(β1 , ω1 ), · · · , (βJ , ωJ )} and incrementing J by one; a Death step entails removing some (βj , ωj ) from {(β1 , ω1 ), · · · , (βJ , ωJ )} and decrementing J by one; an Update step entails altering the value (βj∗ , ωj∗ ) of at least one point (βj , ωj ). A Metropolis-Hastings algorithm (Gilks, Richardson and Spiegelhalter, 1996, §1.3.3) is used to sample the fixed dimensional parameters. We now turn our attention to simulated and real examples to illustrate the performance of the LARK models in practice.

5

Examples and Illustrations

In this section we first use simulated data (where the ‘truth’ is known) to compare the performance of the LARK model to other nonparametric methods, then we present an application to the motorcycle crash test data of Schmidt, Mattern and Sch¨ uler (1981) to illustrate LARK’s performance with unequally-spaced data.

18

Test Function

Kernel g(xi ; χj , λj )

Blocks Bumps Doppler Heavysine

1{0

Lévy Adaptive Regression Kernels - CiteSeerX

Lévy Adaptive Regression Kernels - CiteSeerX

Suggest Documents

LÃ©vy Adaptive Regression Kernels - CiteSeerX

Hierarchical Adaptive Regression Kernels for ... - Cornell University

Adaptive Regression by Mixing - CiteSeerX

Adaptive Bayesian density regression for high ... - CiteSeerX

sequential adaptive nonparametric regression via h ... - CiteSeerX

Manifold Learning Regression with Non-stationary Kernels

Local Polynomial Regression: Optimal Kernels and Asymptotic

Local Polynomial Regression: Optimal Kernels and Asymptotic ...

Word-Sequence Kernels - CiteSeerX

Word-Sequence Kernels - CiteSeerX

Testing Multivariate Adaptive Regression Splines

Robust Regression of Scattered Data with Adaptive Spline ... - CiteSeerX

Data Adaptive Ridging in Local Polynomial Regression 1 ... - CiteSeerX

a concept of adaptive fuzzy regression model vs. neuro ... - CiteSeerX

Adaptive metric kernel regression cg,[email protected] 1 ... - CiteSeerX

a concept of adaptive fuzzy regression model vs. neuro ... - CiteSeerX

Adaptive Robust Regression by Using a Nonlinear Regression Program

Adaptive Robust Regression by Using a Nonlinear Regression Program

Joint-Saliency Structure Adaptive Kernel Regression with Adaptive

Cox Proportional Hazard with Multivariate Adaptive Regression ...

Adaptive Estimation of Heteroscedastic Linear Regression Model ...

Adaptive sparse polynomial regression for camera ...

Adaptive Kernel Smoothing Regression for Spatio ... - Research

Adaptive non-parametric instrumental regression in the