L´evy Adaptive Regression Kernels Chong Tu
Merlise Clyde
Robert L. Wolpert∗
Duke University Department of Statistical Science Revised: August 3, 2007
Summary This paper describes a new class of prior distributions for nonparametric function estimation. The unknown function is approximated as a weighted sum of kernel or generator functions with arbitrary location parameters. Scaling (and other) parameters for the generating functions are also modeled as location specific and thus are adaptive, as with wavelet bases and overcomplete dictionaries. L´evy random fields are introduced to construct prior distributions on the unknown functions, which lead to the specification of a joint prior distribution for the number of kernels, kernel regression coefficients and kernel associated parameters. Under Gaussian errors, the problem may be formulated as a sparse regression problem, with regularization induced via the L´evy random field prior. Posterior inference on the unknown functions is based on a reversible jump Markov Chain Monte Carlo algorithm. We compare the L´evy Adaptive Regression Kernel (LARK) method to wavelet-based methods using some of the standard test functions. In all cases the LARK method leads to sparser solutions and improvements in mean squared error.
Key words: Bayes; L´evy random field; nonparametric regression; relevance vector machine; kernel regression, reversible jump Markov chain Monte Carlo; splines; support vector machine; wavelets. ∗ Address for correspondence: Professor Robert L. Wolpert (
[email protected]), Department of Statistical Science, Duke University, Durham, NC 27708-0251, USA
1
1
Introduction
Suppose we have n noisy measurements Y1 , . . . , Yn of an unknown real-valued function f : X → R on some complete separable metric space X, iid
ei ∼ No(0, σ 2 )
Yi = f (xi ) + ei
(1)
observed at points {xi }i∈I ⊂ X. In nonparametric regression models, the mean function f (·) is often regarded as an element of some Hilbert space H of real-valued functions on X, and is expressed as a linear combination of basis functions {gj } ⊂ H: f (xi ) =
X
gj (xi )βj
(2)
0≤j 0, and γ > 0, with ω-dependence of simple form γ(dω) = γ dχ πλ (dλ) for some γ > 0, proportional to a uniform measure in location χ and a specified probability measure πλ (dλ) on R+ governing scale. Truncating at |β| ≥ ǫγ 1/α again disentangles γ’s roles of determining jump sizes and magnitudes, leading to ǫ ν+ ≡ ν ǫ (R × Ω) = ǫ = in general, or ν+
2 πǫ
2 Γ(α) sin(πα/2)ǫ−α π
for the Cauchy case α = 1. Again (32f) specifies that, given J and γ,
{βj , χj , λj } are independent and identically distributed, with the same distributions as before for χj and λj , but now the jumps have symmetric Pareto distributions, α|βj |−α−1 , 2γ ǫα
πβ (βj ) = in general, or πβ (βj ) = 2γǫ β 2
−1
c βj ∈ − ǫγ 1/α , ǫγ 1/α
1{|β|>γǫ} for the Cauchy case. Zolotarev’s (M) parametrization
(1986) employs the compensator h(u) = sin(u), so in principle (32g) must be changed slightly to
f (xi ) ≡ β0 + δǫ (xi , γ) + where
δǫ (xi , γ) ≡
ZZ
R×Ω
J X
g(xi , ωj )βj
(34a)
j=1
g(xi , ω) β1{|β|α ≤γǫα } (β) − sin(β) ν(dβ, dω),
(34b)
but because ν(β, ω) of (33) is an even function of β while the bracketed term in (34b) is odd, δǫ (xi , γ) ≡ 0 for the SαS LARK model. For asymmetric α-stable models (for example, fully-skewed ones), the compensation adjustment of (34) is essential for convergence as ǫ → 0. 17
4.5
Posterior Inference
The joint posterior density of all parameters under the symmetric Gamma LARK model of (32), given observations Y = {Yi }, is J Y exp − νǫ (R × Ω) p(α, τ, J, σ 2 , β0 , β, ω | Y) ∝ πα (α)πτ (τ ) νǫ (βj , ωj ) J! j=1 n X X 2 1 × σ −n−2 exp − 2 Yi − β0 − g(xi , ωj )βj . (35) 2σ
i=1
j
Only β0 and σ 2 may be integrated out analytically; the posterior (and full conditional) distributions of the other parameters are not available in closed form. Since some of our parameters (β and ω) have varying dimension, some form of trans-dimensional Markov chain Monte Carlo, such as a reversible jump (RJ-MCMC) algorithm (Green, 1995; Wolpert, Ickstadt and Hansen, 2003; DiMatteo et al., 2001) must be used to provide samples from (35) for posterior inference. A typical RJ-MCMC procedure for sampling varying dimensional parameters involves (at least) three types of moves: Birth, Death and Update. A Birth step entails generating a new point (β ∗ , ω ∗ ) for {(β1 , ω1 ), · · · , (βJ , ωJ )} and incrementing J by one; a Death step entails removing some (βj , ωj ) from {(β1 , ω1 ), · · · , (βJ , ωJ )} and decrementing J by one; an Update step entails altering the value (βj∗ , ωj∗ ) of at least one point (βj , ωj ). A Metropolis-Hastings algorithm (Gilks, Richardson and Spiegelhalter, 1996, §1.3.3) is used to sample the fixed dimensional parameters. We now turn our attention to simulated and real examples to illustrate the performance of the LARK models in practice.
5
Examples and Illustrations
In this section we first use simulated data (where the ‘truth’ is known) to compare the performance of the LARK model to other nonparametric methods, then we present an application to the motorcycle crash test data of Schmidt, Mattern and Sch¨ uler (1981) to illustrate LARK’s performance with unequally-spaced data.
18
Test Function
Kernel g(xi ; χj , λj )
Blocks Bumps Doppler Heavysine
1{0