P-splines - Boosted Partial Least-Squares regression

2 downloads 0 Views 504KB Size Report
huge amount of data and computation, easy to master by on-line ... in short PLS, may be viewed as a repeated Least-Squares fit of residuals from ..... slope : am.
'

$

Boosted Partial Least-Squares Regression

Jean-Fran¸cois Durand Montpellier II University, France E-Mail: [email protected] Web site: www.jf-durand-pls.com

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-1

%

'

$

I. Introduction – Machine Learning versus Data Mining – The data mining prediction process – Partial Least-Squares boosted by splines II. Short introduction to splines – Few words on smoothing splines – Regression splines ∗ Two sets of basis functions ∗ Bivariate regression splines ∗ Least-Squares Splines ∗ Penalized Splines III. Boosted PLS regression – – – – –

What is L2 Boosting? Ordinary PLS viewed as a L2 Boost algorithm The linear PLS regression: algorithm and model The building-model stage: choosing M PLS Splines (PLSS): a main effects additive model ∗ The PLSS model ∗ Choosing the tuning parameters ∗ Example 1: Multi-collinearity and outliers, the orange juice data ∗ Example 2: The Fisher iris data revisited by PLS boosting

– MAPLSS to capture interactions ∗ The ANOVA type model for main effects and interactions ∗ The building-model stage ∗ Example 3: Comparison between MAPLSS, MARS and BRUTO on simulated data ∗ Example 4: Multi-collinearity and bivariate interaction, the ”chem” data IV. References

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-2

%

'

$

I. Introduction Machine Learning versus Data mining Machine Learning: machine learning is concerned with the design and development of algorithms and techniques that allow computers to ”learn”. The major focus of machine learning research is to extract information from data automatically, by computational and statistical methods. [http://www.wikipedia.org/] Data mining: Data mining has been defined as ”the nontrivial extraction of implicit, previously unknown, and potentially useful information from data” and ”the science of extracting useful information from large data sets or databases.” [http://www.wikipedia.org/] To be franc, I do not like very much the word ”Machine Learning”. I do prefer ”Data Mining” that suggests helping humans to learn rather than machines... More concretely, many automatic black-box methods involve thresholds for decision rules whose values may be more or less well understood and controlled by the lazy user who mostly accepts the default values proposed by the author. In the R-package, called PLSS for Partial Least-Squares Splines, I tried to make the automatic part, inescapable to face with the huge amount of data and computation, easy to master by on-line conversational controls.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-3

%

'

$

The data mining prediction process The timing of the data mining prediction process to be followed with PLSS functions, can be split up into 3 steps: 1. Set up the aims of the problem and the associated schedule conditions. 2. The building-model phase: a 2.1-2.2 round-trip until obtaining a validated model. 2.1 Build an evolutionary training data base following the retained schedule conditions. 2.2 Process the regression and validate or not the model built on the data at hand. 3. Elaborate a scenario of prediction. A scenario allows the user to conveniently enter new real or fictive observations to test the validated models.

Partial Least-Squares boosted by splines Partial Least-Squares regression [19, S. Wold et al.], in short PLS, may be viewed as a repeated Least-Squares fit of residuals from regressions on latent variables that are linear compromises of the predictors and of maximum covariance with the responses. This method is presented here in the framework of L2 boosting methods [11, Hastie et al.], [8, J.H. Friedman], by considering PLS components as the base learners.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-4

%

'

$

Historically very popular in chemistry and now in many scientific domains, Partial Least-Squares regression of responses Y on predictors X, P LS(X, Y ), produces linear models and has been recently extended to ANOVA style decomposition models called PLS Splines, PLSS, and Multivariate Additive PLS Splines, MAPLSS, [4], [5], [12]. The key point of this nonlinear approach was inspired by the book of A. Gifi [9] who replaced in exploratory data analysis methods, the design matrix X by the super-coding matrix B from transforming the variables by B-splines. Let Bi be the coding of predictor i by B-splines, PLSS produces main effects additive models P LSS(X, Y ) ≡ P LS(B, Y ) where B = [B1| . . . |Bp], while capturing main effects plus bivariate interactions leads to M AP LSS(X, Y ) ≡ P LS(B, Y ) where B = [B1| . . . |Bp k . . . |Bi,i0 | . . .], Bi,i0 being the tensor product of splines for the two predictors i and i0. The aim of this course is twofold, first to detail the theory of that way of boosting PLS that involves regression splines in the base learner, second to present real and simulated examples treated by the free PLSS package available at http://www.jf-durand-pls.com .

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-5

%

'

$

II. Short introduction to splines

Few words on smoothing splines Consider the signal plus noise model yi = s(xi) + εi, i = 1 . . . n, x1 < . . . < xn ∈ [0, 1] (ε1 . . . , εn)0 ∼ N (0, σ 2In×n), σ 2 is unknown and s ∈ W2m[0, 1] W2m [0, 1] = {s/s(l) l = 1, . . . , m − 1 are absolutely continuous, and s(m) ∈ L2 [0, 1]}.

The smoothing spline estimator of s is n

1X 2 (y − µ(x )) +λ sλ = arg min i i µ∈W2m [0,1] n i=1

Z

1

[µ(m)(t)]2dt; λ > 0

0

usually m = 2, to penalize convexity and overfitting. The solution sλ is unique and belongs to the space of univariate natural spline functions of degree 2m − 1 (usually 3) with knots at distinct data points x1 < . . . < xn.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-6

%

'

$

Now is time to tell something about what splines are! To transform a continuous variable x whose values range within [a, b], a spline function s is made of adjacent polynomials of degree d that join end to end at points called ”the knots”, with continuity conditions for the derivatives [1, De Boor].

0 −100

−50

s(x)

50

100

spline of degree 2 with knots (2,3) on [1,5]

1

2

3

4

5

x

Two kinds of splines that mainly differ by the way they are used: • Smoothing splines: the degree is fixed to 3 and knots are located at distinct data points, the tuning parameter λ is a positive number that controls the smoothness. • Regression splines: few knots {τj }j whose number ad location constitute, joined to the degree, the tuning parameters. Splines are computed according to a regression model.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-7

%

$

'

Regression splines A spline belongs to a functional linear space S(m, {τm+1, . . . , τm+K }, [a, b]) of dimension m + K characterized by three tuning parameters - the degree d or the order m = d + 1 of the polynomials, - the number K and - the location of knots {τm+1, . . . , τm+K } τ1 = . . . = τm = a < τm+1 ≤ . . . ≤ τm+K < b = τm+K+1 = . . . = τ2m+K . A spline s ∈ S(m, {τm+1, . . . , τm+K }, [a, b]) can be written s(x) =

m+K X

βiBim(x)

i=1

where {Bim(.)}i=1,...,m+K is a basis of spline functions. The vector β of the coordinate values is to be estimated by a regression method.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-8

%

'

$

Two sets of basis functions • The truncated power functions

truncated power functions at knot 2

8

d=3

x −→ (x − τ )d+

4

6

d=2

2

d=1

0

d=0

1

2

3

4

5

x

When knots are distinct, a basis of S(m, {τm+1, . . . , τm+K }, [a, b]) is given by 1, x, . . . , xd, (x − τm+1)d+, . . . , (x − τm+K )d+ Notice that, when K = 0, S(m, ∅, [a, b]) = the set polynomials of order m on [a, b].

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-9

%

'

$

• The B-splines B-splines of degree d (order m = d + 1): for j = 1, . . . , m + K, Bjm(x) = (−1)m(τj+m − τj )[τj , . . . , τj+m](x − τ )d+ where [τj , . . . , τj+m](x − τ )d+ is the divided difference of order m computed at τj , . . . , τj+m for the function τ −→ (x − τ )d+. This basis is the most popular partly due to the next property that allows to compute recursively the values of B-splines, [1, De Boor] Bj1(x) = 1 if τj ≤ x ≤ τj+1, 0 otherwise, For k = 2, . . . , m, τj+k − x k−1 x − τj Bjk (x) = Bjk−1(x) + B (x). τj+k−1 − τj τj+k − τj+1 j+1 Many statistical packages implement those formulae to compute the m + K values of the B-splines given an x sample R-package: library(splines) x=seq(1,2*pi,length=100) B=bs(x,degree=2,knots=c(pi/2,3*pi/2),intercept=T) # What are the dimensions of the matrix B?

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-10

%

'

$

The attractive B-splines family for coding data 1.0

(b)

1.0

(a) B2

B3

0.8 0.6 0.2

0.4

B2

0.0

0.0

0.2

0.4

0.8

B1

0.6

B1

0

1

2

3

4

5

0

1

2

3

1

2

0

1

2

3

4

5

1

0

2

1

2

1

2

1

2

3

4

5

B4

3

4

5

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

B3

0

5

B2

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

B1

4

2

1

0

1

2

3

4

5

S(2, ∅, [0, 5]) (a), S(3, ∅, [0, 5]) (b) and S(1, {1, 2, 4}, [0, 5]) .

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-11

%

'

$

0

1

2

3

4

5

0.8 0.0

0.2

0.4

0.6

0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

B3

1.0

B2

1.0

B1

0

1

2

4

5

3

4

5

0

1

2

3

4

5

3

4

5

3

4

5

0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

B5

1.0

B4

3

0

1

2

3

4

5

0

1

2

0

1

2

3

4

5

0.8 0.0

0.2

0.4

0.6

0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

B3

1.0

B2

1.0

B1

0

1

2

4

5

2

2

3

4

5

1.0 0.0

0.2

0.4

0.6

0.8

1.0 0.8 0.4 0.2 0.0 1

1

B6

0.6

0.8 0.6 0.4 0.2 0.0 0

0

B5

1.0

B4

3

0

1

2

3

4

5

0

1

2

B-splines for S(2, {1, 2, 4}, [0, 5]) and S(3, {1, 2, 4}, [0, 5]).

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-12

%

'

$

• Local support: Bim(x) = 0,

∀x ∈ / [τi, τi+m].

♥ → One observation xi, has a local influence on s(xi) that depends only on the m basis functions whose supports encompass this data. ♠ → The counterpart is that s(x) = 0 outside [a, b]. • Fuzzy coding functions: 0≤

Bim(x)

≤ 1 (1) and

m+K X

Bim(x) = 1 (2).

i=1 m → Bi (x) measures the degree of membership of x to [τi, τi+m].

♥ The set {[τi, τi+m] |i = 1, ..., m+K} is a fuzzy partition of [a, b]. • The multiplicity of knots controls the smoothness: The multiplicity of a knot is the number of knots that merge at the same point. The multiplicity may vary from 1, a simple knot, to m, a multiple knot of order m.

Let mi be the multiplicity of τi, 0 ≤ mi ≤ m, then the first m − 1 − mi right and left derivatives are equal, (j) (j) s− (τi) = s+ (τi), j ≤ m − 1 − mi. mi = m ⇒ discontinuity at τi mi = 1 ⇒ locally C m−2 at τi.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-13

%

'

$

• Coding a variable x through B-splines Due to (2), two B-splines bases are generally used: {Bjm(x) | j = 1, . . . , m + K} usual basis {1, Bjm(x) | j = 2, . . . , m + K} , modified basis. Let X = (x1, . . . , xn)0 be a n-sample of the variable x, denote B = [B 1(X) . . . B m+K (X)] or B = [B 2(X) . . . B m+K (X)] the complete n × (m + K), or incomplete n × (d + K), coding matrix of the sample. Notice that d = 0 provides a binary coding matrix B. D-centering the coding matrix When the columns of B are centered, then, rank(B) ≤ min(n − 1, d + K).

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-14

%

'

$

Bivariate regression splines for (x, z) {1, B1j (x) | j ∈ I1} and {1, B2j (z) | j ∈ I2} univariate bases s(x, z) ∈ span[{1, B1j (x)|j ∈ I1}

N {1, B2j (z)|j ∈ I2}]

A bivariate regression spline split into the ANOVA decomposition: s(x, z) = β1 +

X

βj1B1j (x)

j∈I1

+

X

βj2B2j (z)

j∈I2

main effects s1 and s2 in x and z X 1 βj1B1j (x) s (x) = j∈I1 12

interaction part s

12

s (x, z) =

+

XX

1,2 i βi,j B1(x)B2j (z)

i∈I1 j∈I2

2

s (z) =

X

βj2B2j (z)

j∈I2

XX

1,2 i βi,j B1(x)B2j (z)

i∈I1 j∈I2

How to measure the importance of an ANOVA term? by the range of the transformation (when standardized variables). B = [B1|B2|B1,2] column centered coding matrix from X and Z. Curse of dimensionality: expansion of the column dimension. ncol(B1) = 10, ncol(B2) = 10, ⇒ ncol(B1,2) = 100.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-15

%

'

$

The Least-Squares Splines (LSS) [14, Stone] Denote X, n × p, and Y , n × q, the sample matrices for the p predictors and the q responses (centered). The centered coding matrix of X, B = [B1| . . . |Bp], leads to q separate additive spline models, j = 1, . . . , q, ybj = sj,1(x1) + . . . + sj,p(xp)

through P b b LSS(X, Y ) ≡ OLS(B, Y ) ⇐⇒ Y = HY = B β = i Biβbi where the so called linear smoother H = B(B 0B)−1B 0. Tuning parameters: The spline spaces used for the predictors LSS drawbacks: numerical instability of (B 0B)−1 if it exists ! ♠ needs a large ratio observations/(column dimension of B) ♠* very sensitive to knots’ location ♠ perturbing concurvity effects due to correlated predictors ♠** no interaction terms involved ♥** MARS [7, Friedman] proposes to remedy ♠**: automatic % & procedure to select knots and high order interactions through linear truncated power functions. ♥* EBOK [13, Molinari et al.]: K fixed, optimal location of knots. ◦ others....

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-16

%

'

$

The Penalized Least-Squares Splines [6, P. Eilers and B. Marx] Because of drawbacks cited above, L-S Splines are mostly efficient in the one-dimensional context (p = 1). In that univariate case, to remedy ♠* by using a large number of equidistant knots, [6] proposed to penalize the L2 cost by a penalty (λ) on k-finite differences of the coefficients of adjacent B-splines.  2 n r r   X X X S= yi − βj Bj (xi) + λ (∆k βj )2   i=1

j=1

j=k+1

The linear smoother associated to the P-splines model becomes H = B(B 0B + λDk 0Dk )−1B 0 Notice that λ = 0 leads to LSS and k = 0 to ridge regression. Default : k = 2 (strong connection with second derivatives)   1 −2 1 0 0 example : r = ncol(B) = 5 D2 =  0 1 −2 1 0  . 0 0 1 −2 1 Following [10, T. Hastie and R. Tibshirani] trace(H) = effective dimension of the smoother

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-17

%

'

$

so that the Generalized Cross-Validation index GCV (λ, k) = var(residuals)/(1 − trace(H/n))2 is a surrogate to the cross validation Predictive Error Sum of Squares (PRESS) One example : (xi, yi)i=1,100, yi = sin(xi) + εi,

xi ∼ U [0, 2π],

εi ∼ N (0, 0.5) GCV = 0.3433 Least−Squares spline of degree 3

0.0 −0.5

Y1

0.0 −1.5

−1.5

−1.0

−1.0

−0.5

Y1

0.5

0.5

1.0

1.0

1.5

1.5

GCV = 0.2768 Least−Squares spline of degree 2

0

1

2

3 X1

4

5

6

0

1

2

3

4

5

6

X1

L-S Splines (red), signal (dotted), vertical lines indicate the location of knots

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-18

%

'

$

1.0 0.5 0.0 −1.5

−1.0

−0.5

Y1

0.0 −1.5

−1.0

−0.5

Y1

0.5

1.0

1.5

GCV(lambda=1.5, diff=2) = 0.2969 Penalized Least−Squares spline of degree 3

1.5

GCV(lambda=0.2, diff=2) = 0.2705 Penalized Least−Squares spline of degree 3

0

1

2

3

4

5

6

0

1

2

X1

4

5

6

X1

1.0 0.5 0.0 −1.0 −1.5

−1.5

−1.0

−0.5

Y1

0.0

0.5

1.0

1.5

GCV(lambda=30, diff=2) = 0.4267 Penalized Least−Squares spline of degree 3

1.5

GCV(lambda=3, diff=2) = 0.3252 Penalized Least−Squares spline of degree 3

−0.5

Y1

3

0

1

2

3 X1

4

5

6

0

1

2

3

4

5

6

X1

P-splines (red) with λ ∈ {0.2, 1.5, 3, 30}.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-19

%

'

$

III. Boosted PLS regression What is L2 Boosting? [17, Tukey] (”Twicing”), [11, Hastie et al.], [8, Friedman], [2, Buhlmann and Bin Yu] The training sample: {yi, xi}n1 ,

y ∈ IR, x ∈ IRp, centered

to estimate the function F ∗ restricted to be of ”additive” type,

F (x ;

{αm, θm}M 1 )

=

M X

αmh(x, θm),

m=1

that minimize the expected L2 cost IE[C(y, F (x))]

C(y, F ) = (y − F )2/2 .

• the base learner h(x, θ) 7→ a function of the input variables x characterized by parameters θ = {θ1, θ2, . . .} . 7→ neural nets, wavelets, splines, regression trees... but also, latent variables, θ ∈ IRp

,

t = h(x, θ) =< θ, x > .

• M , the ”dimension” of the additive model.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-20

%

'

$

Boosting: a stagewise functional gradient descent method with respect to F . L2 boost algorithm: F0(x) = 0 For m = 1 to M do i ,F ) y˜i = − ∂C(y |F =Fm−1(xi) = yi − Fm−1(xi), i = 1, n, ∂F n X (αm, θm) = arg minα,θ [˜ yi − αh(xi; θ)]2 (∗)

i=1

Fm(x) = Fm−1(x) + αmh(x; θm) endFor Extended L2 boost algorithm: Replace (∗) by Criterion or procedure to construct θm from {(xi, yi)}i=1,n, (∗1) n X [˜ yi − αh(xi; θm)]2. (∗2) αm = arg minα i=1

To summarize, L2 boosting is characterized by • choosing the type of the base learner • repeated least-squares fitting of residuals • fitting the data by a linear combination of the base learners.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-21

%

'

$

Ordinary PLS as a L2 Boost algorithm The context of P LS(X, Y ) : • Xn×p for the p predictors,

Yn×q for the q responses.

• D = diag(p1, . . . , pn) = n−1In weights of the observations. All variables are centered (standardized) with respect to D so that cov(x, y) =< x, y >D = y 0Dx and var(x) = kxk2D .

The algorithm [18, H. Wold],[19, S. Wold et al.] PLS constructs components {tm}m=1,...,M as linear compromises of X, on which LS residuals are repeatedly regressed. PLS

X(0) = X, Y(0) = Y (*1) base learner constructing

Step m

t = X(m−1)w, u = Y v (wm, v m) = arg

max

w0 w=1=v 0 v m m

cov(t, u)

tm ∈ span(X(m−1))

tm = X(m−1)w , u = Y v m

update X (m) (deflation)

X(m) = X(m−1) − Hm X(m−1)

(*2) Y-residuals

Y(m) = Y(m−1) − Hm Y(m−1)

m=1,. . . ,M

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-22

%

'

$

where the so-called linear PLS learner, m m0 m 2 Hm = Π D tm = t t D/kt kD

is the n × n D-orthogonal Least-Squares projector onto tm. Notice that the projector Hm depends also on Y since tm is of maximum covariance with a Y -compromise . To solve the optimization problem 1) with two constraints, the Lagrange multipliers technique leads to Proposition : The PLS base learner problem is solved by the first term in the singular value decomposition of the p × q covariance matrix 0 DY. X(m−1) Let (λm, wm, v m) be the triple corresponding to the largest (first) singular value λm, then tm = X(m−1)wm,

um = Y v m,

λm = cov(tm, um).

0 0 In the one-response case, v m = 1 and wm = X(m−1) DY /||X(m−1) DY ||.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-23

%

'

$

Components {tm} tm = Xθm

• belong to span(X), that is

tm1 0Dtm2 = 0,

• are mutually D-orthogonal

m1 6= m2 .

• Partial Regressions of pseudo variables give the same results as LS regressions of the original variables on the components. ˆ m = HmX(m−1) = HmX = tmpm0 X Yˆm = HmY(m−1) = HmY = tmαm0 LS regression line

LS regressions of both responses i

pseudo

Y(m-1 ) and natural Y

on the component

t

i

slope :

i

Y( m-1)

m

t 0 n ( R ,D)

m

Y

i

i

Y(m-1) Y

i

11 10 00 00 11 00 11

t 11 00 00 01 11 00mean individual 11

m

^i ^i t = Y=Y m

mm

ai

i

am

0110 1010

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-24

%

'

$

The linear model with respect to {t1, . . . , tM } Denoting T M = [t1...tM ]n×M After M steps, the fit from both X and Y sides, is P D ˆ ˆ X(M )= M m=1 Xm = ΠT M X = (H1 + . . . + HM )X , P D ˆ Yˆ (M ) = M m=1 Ym = ΠT M Y = (H1 + . . . + HM )Y ,

(1)

so that, the set {Hm}M 1 tries to sequentially reconstruct X ˆ X = X(M ) + X(M ) and provides a typical L2-Boosting additive model (1) whose base learners are the PLS components Y = t1α10 + . . . + tM αM 0 + Y(M ) . If M = rank(X), then, span(T M ) = span(X), X(M ) = 0 and P LS(X, Y ) ≡ OLS(X, Y ), thus providing an upper bound for the number of components 1 ≤ M ≤ rank(X). PCA of X is the ”self”-PLS regression of X onto itself P LS(X, Y = X) ≡ P CA(X).

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-25

%

'

$

The linear model with respect to X Recall that a component is a linear combination of the natural predictors tm = Xθm. The {θm}m set provides a (V = X 0DX)-orthogonal basis to span(X 0) < tm1 , tm2 >D = < θm1 , θm2 >V = 0 The use of {θm}m is twofold • V-project the X-samples on {θm} and look at 2-D scatterplots of the X observations. • stepwise build the linear PLS model with respect to the natural predictors Yˆ (M ) = (H1 + . . . + HM )Y = Xβ(M )

(2)

where β(M ) is the p×q matrix of the linear model, recursively computed: β(0) = 0 β(m) = β(m − 1) + θmθm0X 0DY /kθmk2V.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-26

%

'

$

The building-model stage: choosing M • Cross-Validation criterion: PRESS m = 1, . . . , rank(X); prop = proportion of samples out, default 0.1. When prop is such that 1 observation out at a time, q n X 1X j j 2 ˆ P RESS(prop, m) = (Yi − Xiβ(m) (−i) | ) . n i=1 j=1

0.06 0.03

0.04

0.05

PRESS

0.07

0.08

opt. Dim. 3 , y PRESS = 0.03 ( 1 out )

1

2

3

4

5

Model Dim.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-27

%

'

$

• A surrogate to CV : Generalized Cross-Validation (GCV) m = 1, . . . , rank(X); α = penalty term (default 1) Pq GCV (α, m) =

− Yˆ (m)j k2 2 [1 − α m n]

1 j j=1 n kY

0.12

Calibration of the penalty α: GCV (α = 1.7, M = 3) = 0.29

0.06 0.00

0.02

0.04

GCV

0.08

0.10

1 1.5 1.7 mean mean+2sdv mean−2sdv

1

2

3

4

5

Model Dim.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-28

%

$

'

PLS Splines (PLSS): a main effects additive model [4, J.F. Durand] The PLSS model The centered coding matrix of X being B = [B1| . . . |Bp] PLS through Splines (PLSS) is defined as PLSS(X, Y ) ≡ PLS(B, Y )

t = h(x, θ) =

ri p X X

θki Bki (xi) =

i=1 k=1

p X

hi(xi) .

i=1

- Tuning parameters: the spline spaces for each predictor the model dimension M 7→ crossvalidation - The PLSS additive model: j = 1, . . . , q, j j,p p 1 ybM = sj,1 (x ) + . . . + s M M (x )

♥ if M = rank(B) then PLSS(X, Y ) = LSS(X, Y ) if it exists ♥ PLSS(X, Y = B) ≡ PLS(B, B) = NL-PCA(X), [9, Gifi] ♥ efficient with low ratio observations/(column dimension of B) ♥ efficient in the multi-collinear context for predictors (concurvity) ♥ robust against extreme values of predictors (local polynomials)

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-29

%

'

$

♠ ♥ no automatic procedure for choosing spline parameters ♠ no interaction terms involved

Choosing the tuning parameters 1. For each predictor: degree, number and location of knots 2. the number of PLS components 7→ cross validation 1. Two strategies for choosing the spline spaces • The ascending strategy – First, take d = 1 with no knots (K = 0) 7→ linear model. – Increase the degree d, keeping K = 0, 7→ polynomial model. – for fixed d, add knots, 7→ local polynomial model ”adding a knot increases the local flexibility of the spline and then, the freedom of fitting the data in this area.” • The descending strategy – First, take a high degree, d = 3, and more knots than necessary. – Remove superfluous knots and decrease the degree as much as possible 2. To stop a strategy : find a balance between thriftiness (M and the total spline dimension) and goodness-of-prediction, P RESS and/or GCV .

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-30

%

'

$

Example 1: Multi-collinearity, nonlinearity and outliers, the orange juice data - (x1, . . . , xp),

p predictors,

- (y 1, . . . , y q ),

q responses,

X = [X 1| . . . |X p] Y = [Y 1| . . . |Y q ]

n×p n×q

Both sample matrices are standardized. Here, n = 24 orange juices: A, B, C, ..., X p = 10, q = 1, PREDICTORS sensorial RESPONSE COND (Conductivity) Heavy SiO2 Na K Ca Mg Cl HCO3 SO4 Sum collinearity : SU M = SiO2 + ... + SO4

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-31

%

'

$

O 15

5

0

M K

5 20 15

20

KM P

L B S AG NX I W D VTCJ QF R

UH

0 10 20 30 40 50 Cl

E BL SA NHIU XG W DVTJC F RQ 0

1000 SO4

2000

O P

15 10

I 5

15 10

150

5 10 15 20 K

KM

O P

400 800 HCO3

100 Mg

R

O E

50

E LB SA NXHGIU VDTWJ C FQ 0

10

E B L S A XNHUGI DVCJ WT QFR

20

20

0 100 200 300 400 500 Ca

E BL SA G U N H X T VDCJ W FR Q

O P

10

10

15

15 10 5

E LB AS XUHGIN T VDCJW FRQ

P

20 40 60 80 Na

KM

O P

10

M

U

5

K

0

O P

20

80

KM

15

40 60 SiO2

20

10

0

LB S A NX I GH VTDCJ W QF R

5

15 5

R

O

5

E

BL S A XNH GUI VD T WJ C QF

5

20

500 1000 1500 COND

KM

0

O P

E 10

E BL SA NIH U XG VDCJ TW FQR

P

20

5

10

15

P

KM

15

KM

O

0

20

KM

20

20

biplots between Heavy and the predictors with ’lowess’

E L B SA XGNUH T VDCJW FR Q 0

I

1000 2000 3000 SUM

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-32

%

'

$

Linear PLS on the orange juice data

0.4

0.6

PRESS 0.8

1.0

opt. Dim. 2 , Heavy PRESS = 0.32 ( 2 out )

2

4

6

8

Model Dim.

R2(1) = 0.754

P RESS(0.1, 1) = 0.32

Retained dimension : M = 1

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-33

%

'

$

Predictors’ influence on Heavy (dim 1)

-1 0 1 2 3 4

0.004 Na

-0.003 Cl

I HO E U L B A W T N G R S C QJ VXFD

0.6

VDXSQ NO FKM HBTPALWEGJUI C

R

-1 0 1 2 3 4

FSXDTW VNO HABG LUIJQPEKM C

R

0 1 2 3 4

U

0 1 2 3 4

-0.2

IWAGHE FSXTDLPKBCJM QR VNO

0.2

0.6

0 1 2 3

P

-0.007 K

-0.2

KH O UTAPBELM VDXFQCRGJSNW

KM

0 1 2 3

0.2

0.6

I

0.6 0.2

O

-0.2

0.2

TASINELB G HCJW U D Q VXFR

P

-0.047 SiO2

0.2

0.2 -0.2

O G H W FATLJCEUI XBSN VR DQ

KM

0.167 SUM

0 1 2 3

0.063 HCO3 0.6

0.6 0.2 -0.2

P

KM

0.6 0.2

LO A E T W N XBFCSRJHUGI DVQ 0 1 2 3

0.16 SO4

-0.2

PK

0.6

0.6

0 1 2

M

-0.2

O

0.172 Ca

-0.2

TBALUE NIHW CJSG VXFQDR

0.2

KMP

0.181 Mg

-0.2

-0.2

0.2

0.6

0.182 COND

QVTCNO XFLJSPKRIMB WEADG UH -1 0 1 2

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-34

%

'

$

PLSS on the orange juice data degree=2 for all predictors Table of COND 400 1600

selected knots for the predictors SiO2 Na K Ca Mg Cl SO4 10 10 2.5 160 40 4 400 20 40 5 400 110 11 1700 40 30

HCO3 Sum 100 600 300 2600 500

0.3 0.2

PRESS

0.4

0.5

opt. Dim. 2 , Heavy PRESS = 0.154 ( 2 out )

2

4

6

8

10

Model Dim.

R2(2) = 0.917

P RESS(0.1, 2) = 0.154

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-35

%

'

$

Predictors’ influence on Heavy (2 dim.)

-1 0 1 2 3 4 HCO3

0.2 0.0

O VXF LUE QDRCJSG TBA NIHW

-0.2

1 2 Mg

3

0

0 1 2 3 4 K

0.0

O Q UJ VNFSXDTW HABLGI

0.2

C EKM P

R

-0.2

0.2

I

0.0

O K H LM PBE TA UW N VDXF S QCRGJ

K MP

1 2 COND

0.4

0

0.4

0.4 0.2 0.0 -0.2

O L TEA W G RHUI NXBFCSJ DVQ

0.4

3

0.4

3

RI PKOMB JS WEAD L NXF G TC QV -1

0

1 Cl

UH

2

0.2

FKH NO MB SQ DX TPA V LW EGJ UI C

0.0

0.2 0.0

1 2 Ca

M

0.4

1 2 SUM

0.4

0

PK

-0.2

I

-0.2

O

P

0.0

0.2

KM

LEH TABU JRSNW VXFDQCG

B NEL W S A TI DUHGCJ Q FVXR 0

-0.2

0.0

3

0.2

0.4

1 2 SO4

P

O

0.2 -0.2

0.0

U CE G NHATLI W Q DVRBXFSJ 0

-0.2

KM

R

-1 0 1 2 3 4 SiO2

I WAGH E FSXTDLPKBCJRM Q VNO

U

-0.2

-0.2

0.0

0.2

P

0.4

0.4

KM O

0 1 2 3 4 Na

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-36

%

'

$

t2

0.5

1.0

1.5

Nonlinear 2-D component scatterplots

V

K

XF

M

D Q

O N

0.0

S P

−0.5

C

R

TH B

J

I

G

U

−1.0 −0.5

L E

WA

0.0

0.5

1.0

1.5

2.0

t1

A PLSS component t is additively decomposed by ANOVA terms P P t = Bθ = pi=1 Biθi = pi=1 hi(X i) where θi is the sub-vector of θ associated to the block Bi, thus allowing to interpret t by the predictors. Because r(t1, Y ) = 0.924, then h1i (X i) ≈ siM (X i) and one can use the Heavy ANOVA plots to explain the t1 coordinates.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-37

%

'

$

Example 2: The Fisher iris data revisited by PLS boosting A multi-response PLS discriminant Analysis context, [15, Sj¨ostr¨om et al.], [16, Tenenhaus] : p = 4 predictors: sepal width and length, petal width and length q = 3 responses: the binary indicators of 3 Iris species, setosa (s), versicolor (c), virginica (v) n = 150 Iris specimen (50 per species). The (t1, t2) scatterplot and the table of PLS diagnostics:

s

vv

2 −2

−1

0

1

s

v

Linear PLS

s

v

v v v vv vv c v vv v c c v vv c v c c v cc vv v v vv c v v vvvc c c cc vv v v cv c ccc c c c v v v v cc c c cc vc v ccc v v cc c vc c vcc c cc vc c c

s

s

ss

Table of diagnostics

s ss ss

s s ss s s ss s s sss s ss s s ss ss sss s ss s s s s s

s true c true v true s predicted 49 0 0 c predited 1 42 13 v predicted 0 8 37

c c c s

c

−3

−2

−1

0

1

2

3

P RESS(3) = 1.379,

10% out.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-38

%

'

$

The splines : degree=2 and 3 equally spaced knots. No relevant interaction was detected and the pure main effects model provides efficient diagnostics

−0.5

0.0

0.5

The (t1, t2) scatterplot and table of PLSS diagnostics:

v v vv v v v vv vv vv v vvv v v v vv v v

v v v

Boosted PLS

v v vv vv vv

Table of diagnostics

v cv v cv v v v vv cv cc v v c c c ccc c ccc v v ccccc cc c c c c c c c cc cc c c cccccc c c ccc

−0.5

s

s s s s s s ssss s s sssssssss s s s s sss ssss ss sssss s ssssss

s true c true v true s predicted 50 0 0 c predicted 0 48 2 v predicted 0 2 48

ccc c

0.0

0.5

1.0

P RESS(4) = 0.3551,

10% out.

How to interpret a latent variable with respect to the predictors?

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-39

%

'

$

0.0

2 3 2 33 3 22 3 22 333 22 222 33333 3 22 32222 33 332 32 2

−0.2

33 2 2 33 3 22 3 3 23 23 23 33 22

−1.5

11111 111 1

0.1

0.3

1 1 1 1 1 1

−0.2

0.1

0.3

Predictors’ influence on t1

1.0

−1.5

−2

0

1

2

Sepal.Length

0.1

32 1 1112 11 11 2 2 11 1 2 12 12 32 3 12 3 12 33 32 33 32 3 32 33 32 32 32 32 32 322 32

0.3

Petal.Length

−0.2

−0.2

0.1

0.3

Petal.Width

0.0 1.0

311 1 3 3 1 21 111 1 3 1 2 1 3 1 2 2 3 1 2 3 22 3 1 2 12 2 1 32 2 33 32 32

−2

0 1 2 3 Sepal.Width

The spline functions are ordered, from left to right and up to down, according to the range of the vertical values. Data are coded: setosa (1), versicolor (2), virginica (3). Dashed vertical lines indicate the location of knots.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-40

%

'

$

−1.5

0.0

0.2 0.0

33 33 33 3 3 3 3 2 3 2 3 2 3 2 3 2 2 2 22 22

111111 11 1

−0.3

11 1 1 1 1

−0.3

0.0

0.2

Predictors’ influence on t2

1.0

−1.5

0.0 1.0

0.0

11 3 111 3 11 2 3 211 311 3 21 1 21 33 1 11 3 12 2 1 3 2 3 2 3 2 23 3 2 22 12

−0.3

32333333 32 32 2 3 32 32 2 3 2 11 3 2 11 3 2 11 3 2 32 2 1 3 12 12 3 12 112 3 12 32 12 12

0.2

Petal.Length

−0.3

0.0

0.2

Petal.Width

3 333333 33 3 3 3 3 3 3 3 32 32 2 3 2 2 32 2 22 2 2 22222222

−2

0

1

2

Sepal.Length

−2

0 1 2 3 Sepal.Width

To summarize: Petal Length and Width are the relevant factors for discrimination as well as the Sepal Length to a lesser degree.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-41

%

$

'

Multivariate Additive PLS Splines (MAPLSS): capture of interactions [5, Durand & Lombardo][12, Lombardo, Durand, De Veaux ]

The ANOVA type model for main effects and interactions The centered main effects + interactions coding matrix: B = [B1| . . . |Bp k . . . |Bi,i0 | . . .] where (i, i0) belong to the set I of accepted couples of interactions MAPLSS(X, Y ) ≡ PLS(B, Y )

t = h(x, θ) =

ri p X X

θki Bki (xi)+

X

{i,i0 }∈I

i=1 k=1

"

ri ri0 X X

# 0

0

0

i,i i i θk,l Bk (x )Bli (xi ) .

k=1 l=1

q simultaneous models casted in the ANOVA decomposition p X X j,ii0 0 j,i i j sM (x ) + sM (xi, xi ) j = 1, . . . , q, yˆM = i=1

(i,i0 )∈I

PLSS with interactions shares the preceding ♥ properties ♠ ♥ no automatic procedure for choosing spline parameters

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-42

%

'

$

The building-model stage Inputs: threshold = 20%, Mmax = dim. maximum to explore 0 Preliminary phase: the main effects model (mandatory). Decide on the spline parameters as well as on Mm for the main 2 effects model (m): denote GCVm(Mm) and Rm (Mm). 1 Individual evaluation of all interactions. Each interaction i is separately added to the main effects. 2 2 Rm+i (M ) − Rm (Mm) GCVm(Mm) − GCVm+i(M ) crit(Mi) = max + 2 (M ) Rm GCVm(Mm) M ∈{1,Mmax } m

Eliminate interactions i such that CRIT (Mi) < 0 Order decreasingly the accepted candidate interactions. 2 Add successively interactions to main effects (forward phase). Set GCV0 = GCVm(Mm) and i = 0. REPEAT 2 i←i+1 2 add interaction i and index the new model with i 2 GCVi ← optimal GCV among all dimensions UNTIL (GCVi < GCVi−1 − threshold ∗ GCVi−1) 3 prune ANOVA terms of lowest influence (backward phase) criterion: the range of the values of the ANOVA function

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-43

%

'

$

Example 3: Comparison between MAPLSS, MARS and BRUTO on simulated data BRUTO [10, Hastie & Tibshirani]: A multi-response additive model fitted by adaptive back-fitting using smoothing splines. MARS[8, Friedman]: Multivariate Adaptive Regression Splines. A two-phase process to build ANOVA style decomposition models by fitting truncated power basis function. Variables, knots and interactions are optimized simultaneously by using a GCV criterion. Campaign of simulations: Three signals (pure additive, one and two interactions respectively) 10

X 4 +3 x3+2 x4+x5+0 xi f1(x) = 0.1 exp(4 x1)+ 1 + exp(−20 (x2 − 0.5)) i=6 f2(x) = 10 sin(πx1x2) + 20 (x3 − 0.5)2 + 10 x4 + 5 x5 + 0

10 X

xi

i=6

f3(x) = 10 sin(πx1x2) + 20 (x3 − 0.5)2 + 20 x4x5 + 0

10 X

xi

i=6

not depending on the last five variables.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-44

%

'

$

The model associated to the signal fj yi = fj (xi) + εi, xi ∼ U [0, 1]10

i = 1, . . . , n

and

εi ∼ N [0, 1].

100 training data sets (xi, yi)i=1,n were generated according to n = 50, 100, 200 and j = 1, 2, 3. For each of the 900 data sets, the goodness-of-prediction of BRUTO, MARS and MAPLSS were measured on a new test set (ntest = n) by computing the Mean Squared Error M SE =

n−1 test

ntest X

(fj (xi) − yˆi)2.

i=1

These data sets were not only used to compare the domain of efficiency of the three methods but also to calibrate the MAPLSS tuning parameter allowing to accept or not one interaction threshold = 20%.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-45

%

'

$

&

%

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-46

'

$

&

%

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-47

'

$

&

%

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-48

'

$

Example 4: Multi-collinearity and bivariate interaction, the ”chem” data [3, De Veaux et al.] 61 observations from a proprietary polymerization process. 10 explanatory variables, x1, . . . , x10 (inputs) and 1 response, y (output) Due to the proprietary nature of the process, no more information could be disclosed. Image of Correlations

r level 1.0

Var. names y 10

x10

x5 x4

0.5 0.0

8

−0.5

x6

6

x7

4

x8

predictors + responses

x9

x2

2

x3

x1 2

4

6

8

10

predictors + responses

cor(x5, x7) = 1, cor(x1, x2) = −0.95, cor(x6, x9) = −0.82

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-49

%

'

$

{(xi,y)}i plots of chem data smoothed with L-S splines.

13 42

57 36 37 47

0.2

0.4

24 25 20 5 1423 6 21 22 12

32 789 10 33 43 14

16

0.6

0.8

55

57

34 35 45

36 37 47

0.2

0.4

X1

60

38 49 39

61 40 51 41

0.6

24 25

0.6

54 56 58 26 53 28 29 55 23 57 60 59 61 24 25 20 1234 5 21 22

0.8

18

0.2

0.4

16 0.6

19 0.8

0.6

0.6

19 0.8

15 30

13 42 44

46 48

11

26 23

18 5651 32 718923449 34 36 10 38 16 40 19 37 35 33 45 43 39 47 12 14 41

24 25

20 21 22 0.4

0.6

X7

54 56 58 53 28 2957 55 31 59 60 61

0.8

4648 52 27

52 27

0.2

X6

50 54 56 58 11 26 53 28 29 55 23 31 57 60 59 18 61 24 25 20 545 647 32 34 21 10 22 16 35 37 51 234112 89719 36 33 43 49 14 0.2

38 39

40 41

0.4

0.6

0.8

X8

0.8

17

0.8

17

30

0.6 0.4

31 78932 34 36 10 38 40 37 35 33 51 49 45 43 39 47 14 41

X5

0.4

17

50

11

16

51 49 43 45 47 14

X4

13 42

50

612

18

17

0.2

26 23

52 27

54 56 58 53 28 29 55 31 57 60 59 61

31 12 0.2

44

46 48

52 27

0.2

50

0.2

0.6 0.4

44

11

0.8

0.8

0.8

0.8 0.6 0.4

13 42

44 46 48

0.6

15

30

13 42

0.4

0.4

32 7 34 36 10 38 40 35 37 9 8 33 39 41

50

54 56 58 26 53 28 29 55 23 57 60 59 61 24 25 20 567189234 32 34 36 21 10 38 22 40 35 37 33 39 41

X3

15 30

0.2

6

X2

15

20 21 22

5 0.2

17

11

0.8

0.8

17

18 32 715689234 34 36 10 38 16 40 19 37 35 33 49 51 43 45 39 47 12 14 41

0.4

0.4

59

0.2

59 38 49 39

58

11 26 53 28 29 23 31 18 24 25 20 567189234 32 21 10 22 16 19 33 43 12 14

46 48 52 27

0.8

60

53 26 11 28 29 23

55 31 18 34 19 35 45

56

44

46 48 52 27 50 54 56 58 11 26 53 28 29 55 23 31 57 60 59 18 61 24 25 20 1 21 22 16 19 2 49 51 3 43 45 47 12 14 4

0.6

0.2

58

61 40 51 41

50

54 0.2

54

56

48

0.4

50

46 52 27

0.2

52 27

13 42

44

0.4

0.4

44

46

15 30

13 42

44 48

15 30

0.2

0.6

0.6

13 42

15 30

0.6

15 30

17

0.8

17

0.8

17

0.8

17

15 0.6

0.6

30

13 42

13 42 44 0.4

0.4

44 48

46 52 27

50

5811 26 5557 5960

31 51

40 19 39 4138 0.2

36 37 49

789 34 10 16 35 43 45 14

47

61 6 32 3312 5 1423

0.4

0.6

X9

20 21 22 0.8

53 28 23 29 24 25

0.2

0.2

54 56

18

15 30

46 48 52 27 50 54 56 58 11 26 23 60 20 32 1567 34 36 38 40

2 0.2

53 28 55 31 57 59 18 61 24 21 16 35 37 33 49 51 38 43 45 39 47 12 14 41

29 25 22 19 49 0.4

0.6

10 0.8

X10

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-50

%

'

$

Towards a final model...

1.2 1.0 0.8 0.6 0.4 0.0

0.2

TOTCRIT = R2CRIT + GCV(1, . ) CRIT

1.4

Phase 0: build a pure main effects model (PLSS) Phase 1: select interactions candidate in decreasing order

x4*x10

x5*x10 x4*x8 x5*x6 x5*x8 x5*x9 x1*x5 x2*x4 x2*x9 x3*x8 x3*x4

candidate interactions

Phase 2: Add interactions to the main effects model candidate 1 : x4 ∗ x10 ACCEPTED at 20 % relative GCV gain

x4 ∗ x10

i j 4 10

M 9

GCV 0.04820853

% rel. GCV gain 89.22

candidate 2 : x6 ∗ x10 NOT ACCEPTED at 20 % relative GCV gain

x6 ∗ x10

i j 6 10

M 9

GCV 0.04757621

% rel.

GCV gain 1.31

Continue anyway exploring (y/n)?1: n

Phase 3: Evaluate the PRESS(0.02,7)=0.0584 and ANOVA terms Range of the transformed predictors in descending order

x4 ∗ x10 x10 3.8238

x8

x5

x7

x2

x6

x4

x9

x1

x3

1.7266 0.6202 0.5647 0.4616 0.4163 0.266 0.2106 0.188 0.1457 0.0762

Phase 4: Prune low ANOVA terms. Final PRESS(0.02,6)=0.0569 x4 ∗ x10 x8 x2 x5 x7 5.11868 0.55858 0.55846 0.53523 0.42331

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-51

%

'

$

Leave-one-out predicted versus observed y samples.

0.8

PLS PRESS = 0.5709

0.6 0.0

−0.2

10

17

30 15 44 13 18 31 54 42 19 52 46 56 27 48 26 58 50 16 11 60 23 34 45 720 55 1 32 43 14 36 47 28 53 6 57 38 49 25 5929 51 12 40 24 61 35 21 825 33 3 37 10 22 9 4 39

0.2

0.2

13 46 44 42 52 2748 5450 16 19 5856 6026 11 23 43 14 45 20 55 47 49 28 51 7 32 6 5753 1 56159 34 36 12 224 38 21 40 29 8 33 3 35 37 25 39 22 9 4 41

0.4

est. y , Dim. 3

0.4

17

30 15

1831

0.0

est. y , Dim. 2

0.6

0.8

PLSS PRESS = 0.4835

41

−0.2

0.0

0.2

0.4

0.6

0.8 0.0

obs. y

0.2

0.4

0.6

0.8

obs. y

17

0.8

MAPLSS PRESS = 0.0569

0.6 0.4

46 50 48

44

13 42

3111

27 54 52 28 53 29 2656 55 58 57 60 18 16 615923 40 24 25 43 14 32 720 6 5 1 10 12 45 34 2 38 36 33 8 21 9 22 3 47 41 4 35 49 39 37 51

0.0

0.2

est. y , Dim. 6

15 30

19 0.0

0.2

0.4

0.6

0.8

obs. y

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-52

%

'

$

4 3

3

4

y-ANOVA plots ordered from left to right and up to down according to the vertical ranges.

4

2

2

3

1

1

1

2

40 41

0 39 38

51 37 50 18 61 49 48 35 34 3 2 36 10 919 8 7 133 1617 560 31 30 46 43 1 3 47 45 42 14 59 44 6 57 56 558 12 55 54 53 52 29 28 20 27 22 21 25 23 3 26 24 411 2 1

5 52 53 43 33 32 31 30 29 28 26 25 23 20 19 18 16 15 13 10 9 8 6 3 7 42 27 24 22 21 17 14 12 11 4 2 1

55 54 45 35 34 44

0

−1

0

2

57 56 46 37 36 47

58 59 49 48 39 38

1 0

60 61 51 50 40 41

−1

0

−1

2 X10 1

X4

3

0

1

2

X4*X10 interaction

3

4

5

−0.5

0.0

0.5

1.0

1.5

2.0

X2

57 55 31 30 29 28 61 60 59 58 56 54 53 52 27 26 20 22 21

5 10 9 8 3 40 39 38 37 36 35 34 33 32 46 45 7 19 18 16 15 13 650 43 49 48 51 442 2 1 44 41 47 17 14 12 11

25 23 24

26

61 60 57 55 2956 28 54 53 52 27 31 5830 59

−1

−1

20 22 21

X4*X10

0

0

25 23 24 5 32 51 50 49 48 46 45 43 40 39 38 37 36 35 34 33 3 19 18 16 15 13 10 9 8 6 7 47 44 42 41 4 2 1 17 14 12 11

−0.5

0.0

0.5

X5

1.0

1.5

X8

1

1

X2

2

2

X5

3

3

X7

4

4

X8

−0.5

0.0

0.5

X7

1.0

1.5

0

1

2

3

4

5

range of the transf. predictors

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-53

%

'

$

References [1] C. De Boor. A Practical Guide to Splines, Springer-Verlag, Berlin, 1978. [2] P. B¨ uhlmann and Bin Yu, Boosting with the L2 Loss: Regression and Classification. Journal of the American Association, 98, pages 324-339, 2000. [3] R.D. De Veaux, D.C. Psichoggios and L.H. Ungar. A comparison of two nonparametric estimation schemes: MARS and neural networks,Computers chem. Engng, Vol. 17, No. 8, pp. 819-837, 1993 [4] J.F. Durand. Local Polynomial Additive Regression through PLS and Splines: PLSS, Chemometrics and Intelligent Laboratory Systems 58, 235-246, 2001. [5] J. F. Durand and R. Lombardo. Interactions terms in nonlinear PLS via additive spline transformations. In ”Studies in Classification, Data Analysis, and Knowledge Organization”, Springer-Verlag, 2003. [6] P.H.C. Eilers and B.D. Marx Flexible smoothing with B-splines and Penalties, (with discussion). Statistical Science, 19, 89-121, 1996. [7] J.H. Friedman. Multivariate Adaptive Regression Splines, (with discussion). The Annals of Statistics, 19, 1-123, 1991. [8] J.H. Friedman. Greedy function approximation: a gradient boosting machine. The Annals of Statistics, 29, 5, 1189-1232, 2001. [9] A. Gifi. Non Linear Multivariate Analysis, J. Wiley & Sons, Chichester, 1990. [10] T. Hastie and R. Tibshirani. Generalized additive models. Chapman and Hall, London, 1990. [11] T. Hastie, R. Tibshirani and J.H. Friedman, The Elements of Statistical Learning, Springer, 2001. [12] R. Lombardo, J. F. Durand and D. De Veaux. Multivariate Additive Partial Least-Squares Splines, MAPLSS. Submitted. [13] N. Molinari, J.F. Durand and R. Sabatier. Bounded Optimal knots for Regression Splines. Computational Statistics & Data Analysis, 2, 159-178, 2004. [14] C.J. Stone. Additive regression and other nonparametric models. The Annals of Statistics, 13, 689705, 1985. [15] M. Sj¨ostr¨om, S. Wold and B. S¨oderstr¨om, PLS discrimination plots. In E.S. Gelsema et L.N. Kanals (Eds): Pattern Recognition in Practice II. Elsevier, Amsterdam, 1986. [16] M. Tenenhaus. La r´egression PLS, Th´eorie et Applications, Technip, Paris, 1998. [17] J.W. Tukey, Exploratory Data Analysis. Reading Massachusetts: Addison-Wesley, 1977. [18] H. Wold. Estimation of principal components and related models by iterative least squares. In Multivariate Analysis, (Eds.) P.R. Krishnaiah, New York: Academic Press, 391-420, 1966. [19] S. Wold., H. Martens and H. Wold. The mutivariate calibration problem in chemistry solved by PLS method. In: A. Ruhe, B. Kagstrom (Eds), Lecture Notes in Mathematics, Proceedings of the Conference on Matrix Pencils, Springer-Verlag, Heidelberg, 286-293, 1983.

&

Boosted PLS Regression: S´eminaire J.P. F´enelon 2008-54

%