T ESTING SRUCTURAL ASSUMPTIONS IN REGRESSION WHEN THE COVARIATE IS FUNCTIONAL Laurent DELSOL1 , Fr´ed´eric Ferraty2 and Philippe Vieu2 1
2
Institut de Statistique Universit´e Catholique de Louvain 20, voie du roman pays, Louvain-la-Neuve, Belgique (e-mail:
[email protected]) Institut de Mathmatiques de Toulouse Universit´e Paul Sabatier 118 route de Narbonne, 31062 Toulouse cedex 9, France (e-mail:
[email protected],
[email protected])
A BSTRACT. In many real world studies, the aim is to predict a real value of interest from the observation of a functional type phenomenon. A great variety of multivariate regression models based on structural assumptions have been adapted to take into account a functional covariate. In order to avoid such structural assumptions, one may consider nonparametric models in which only regularity assumptions are made. Before considering prediction issues, it seems relevant to wonder if the explanatory variable has an effect on the response and more generally if this effect has a specific structure. Starting from classical ideas of structural tests in multivariate statistics and recent advances in functional kernel smoothing methods, we propose innovative structural testing procedures adapted to regression on functional variable. Then, various bootstrap procedures are introduce to compute the threshold value and finally, some applications of our methods in spectrometric studies are discussed.
1
I NTRODUCTION
Many real world issues are related to the study of functional type phenomena (that may be for instance represented as curves, surfaces, or images). Nowadays, recent technological advances allow to collect, stock and treat data discretized on thin grids adated to reflect the structure and the regularity of such phenomena. A direct use of multivariate methods on such data is often irrelevant. Indeed, in addition to the high dimension of the variable and the high correlation of its components, a multivariate modelisation appears to be unadapted to take into account the regularity and the structure of the phenomenon. An alternative approach consists in considering these functional data as (discretized) realisations of functional random variables (i.e. random variables taking values in an infinite dimensional space). This interesting point of view often leads to a more synthetic representation of the data using their underlying functional nature. In addition to the great variety of potential real world issues it may concern, the development of functional statistical methods is also motivated by the many theoretical and methodological challenges it leads to consider. Some classical methods of multivariate statistics have been adapted to the study of functional variable (see Ramsay, 1997, 2002, 2005, Bosq, 2000, or Ferraty and Vieu, 2006).
In this talk we focus on the way a real variable of interest Y depends on a functional covariate X taking values in some semi-metric space (E , d). Practical considerations often lead to wonder if this link exists or more generally if it has a specific nature. In the next section we first discuss the interest of considering various kinds of regression models with functional covariate. After providing a definition of parametric and nonparametric functional regression models (see also Ferraty and Vieu, 2006), we motivate the study of in between alternatives which seem to be interesting in terms of interpretability and flexibility. Then, we present an innovative way to construct a great variety of structural testing procedures. Such tests are useful to get moreinformation on the underlying regression model and check for the validity of specific regression models. Extending the multivariate method introduced by H¨ardle and Mammen, we first provide a general result and then explain how bootstrap methods can be used to compute the threshold value. Finally, some interesting applications in real world spectrometric studies are discussed.
2
S TRUCTURAL TESTING PROCEDURES
The issue of testing for structural assumptions in regression on functional variable has not yet received a lot of attention. Indeed, the literature seems reduced to the papers dealing with testing procedures in the specific case of the functional linear model (Cardot et al., 2003, 2004), a no effect test using projection methods (Gadiaga and Ignaccolo, 2005), and an heuristic goodness of fit test (Chiou and M¨uller, 2007). The main objective of this talk is to present a general way to construct structural testing procedures that may be useful to check the validity of a great number of specific functional regression models discussed in the previous section. The methodologies proposed by H¨ardle and Mammen (1993) for the multivariate case are adapted taking into account recent advances on funcional kernel smoothing methods (see Ferraty and Vieu, 2000, 2006, for an overview).
2.1
A
FEW WORDS ON FUNCTIONAL REGRESSION MODELS
In this talk, we focus on functional regression models: Y = r(X ) + ε, with r ∈ R
(1)
in which Y is a real-valued random variable, X is a functional random variable taking values in a semimetric space (E , d), R is a given family of operators on E , and the residual ε fulfils E[ε|X ] = 0. A functional regression model is defined as parametric (respectively nonparametric) if the family R can (respectively cannot) be indexed by a finite number of elements of E . Parametric models (e.g. the functional linear model) present interesting features in terms of interpretability and rates of convergence. However, they are based on structural assumptions and may hence lead to important modelling errors. Nonparametric models (e.g. when R is the set of H¨olderian operators with respect to d) are much more flexible but the counter part is that the convergence rates are slower and the interpretaion of the results seems less easy. Consequently, it might be important to consider alternatives making a compromise between interpretability and generality (functional single index model, partial linear functional model,
functional additive model, partial multivariate functional model, ...). It seems relevant to propose structural tests to check the validity of such specific models and get more information on the nature of the regression operator. This additional information may also be interesting to design our estimation method.
2.2
H YPOTHESIS AND T EST STATISTIC
Assume we have at our disposal a data set composed of N(> n) independent pairs (Xi ,Yi )1≤i≤N identically distributed as (X ,Y ). Given a family R of L2 operators, we are interested in structural testing procedures in which the aim is to test the null hypothesis
H0 : {∃m ∈ R , P(r(X ) = m(X )) = 1} against local alternatives of the form H1,n : inf kr − mkL2 (wdPX ) ≥ ηn , m∈R
where w is a weight operator and PX the law of X . The way ηn decreases to 0 reflects the capacity to detect smaller and smaller differences between r and the family R when n grows. Following ideas introduced by H¨ardle and Mammen (1993), we construct our test statistic from the L2 -norm of the functional kernel estimator of the residuals associated to the specific model corresponding to H0 . For technical reasons, it is more convenient to suppress the denominator and use the following test statistic: TR n
Z
=
! d (x, X ) 2 i R w (x) dPX (x) . ∑ Yi − rˆ0 (Xi ) K hn i=1 n
where rˆR is a specific estimator (corresponding to the model induced by R ) computed from the dataset (Xi ,Yi )n+1≤i≤N , K is a kernel function, d is a semi metric on E , and hn a smoothing parameter.
2.3
A
GENERAL THEORETICAL RESULT AND ITS APPLICATIONS
A general result stated in Delsol et al. (2009) ensures the asymptotic normality of the test statistic Tn under the null hypothesis and its divergence under the alternative. In order to get this result we need to make assumptions on the specific estimator. We assume that under the null hypothesis it converges in squared mean to the true regression operator with a certain convergence rate. Many asymptotic results obtained for estimators corresponding to specific models can be used to show this holds. Under the alternative, we assume that it is close to R and fulfils some regularity condition. In many cases the first part of the assumption is trivially fulfilled because the specific estimator belongs to R . Some results showing the convergence of the specific estimator under the alternative can be eventually used to show the last assumption is fulfilled. The previous result is given under various sets of assumptions allowing its use for a great variety of structural tests:
• Test of a priori models: R = {r0 } , N = n, and rˆR = r0 . • No effect tests: R = {constant operators} , and rˆR = Y n . • Linearity tests: R = {affine and continuous operators} , and rˆR may be the splines estimator (see Crambes et al., 2008). • Tests of multivariate models: Given a known operator V : E → Rd , take R = {m, m(X ) = m0 (V (X ))}, and rˆR can be the kernel estimator constructed from the pairs (Yi ,V (Xi )). • Test of functional single index: R = {m, m(X ) = m0 (< X , θ >), θ ∈ E }, and choose θ by cross validation (see Ait Saidi et al., 2008) and use a kernel estimator for m0 Other specific models could be checked using the result given in Delsol et al. (2009). The main difficulty is to find a specific estimator. This is discussed for instance in Ferraty and Vieu (2009b).
2.4
B OOTSTRAP PROCEDURES
From the result mentioned above, we would like to reject the null hypothesis when Tn is larger than a threshold value. In order to compute the threshold value corresponding to a given level, one may try to use directly the asymptotic normality obtained in Delsol et al. (2009). However, the asymptotic bias and variance dominant terms involved in the expression of the asymptotic distribution of Tn are complicate and their estimation seems difficult. Hence the direct use of the asymptotic normality may lead to irrelevant results. In order to avoid these drawbacks, we propose to use bootstrap methods (see e.g. the seminal paper by Efron, 1979 or H¨ardle and Mammen, 1993) to generate bootstrap samples for which the null hypothesis is approximately fulfilled. Then we compute our test statistic on each of these samples and take as threshold the 1 − α empirical quantile of the values we have obtained. We propose to let the explanatory variables fixed and make bootstrap on the residuals. Then we generate bootstrap responses such that the null hypothesis approximately holds for the pairs (X i,Yib ). We propose the following bootstrap procedure in which steps 1-4 have to be done separately on the datasets D : (Xi ,Yi )1≤i≤n and D1 : (Xi ,Yi )n+1≤i≤N . In the following lines rˆ stands for the functional kernel estimator of the regression operator computed from the considered dataset (D or D1). Bootstrap procedure: Pre-treatment: 1. εˆi = Yi − rˆ (Xi ) 2. ε˜ i = εˆi − ε¯ˆ Repeat B times steps 3-5: 3. Generate bootstrap residuals NB • (εbi )1≤i≤n drawn with replacement from (ε˜ i )1≤i≤n
SNB • (εbi )1≤i≤n generated from a smooth version F˜n of the empirical cumulative distribution function of (ε˜ i )1≤i≤n (εbi = F˜n−1 (Ui ) , Ui ∼ U (0, 1)) WB • (εbi) =ε˜iVi where Vi ∼ PW fulfills the following conditions: E [Vi ] = 0, E Vi2 = 1 et E Vi3 = 1. 4. Generate bootstrap response “fulfilling” H0 Yib = rˆ0R (Xi ) + εbi 5. Compute the bootstrap test statistic Tnb from the bootstrap sample (Xi ,Yib )1≤i≤N Compute the empirical threshold value: 6. For a test of level α, take as threshold value the 1 − α empirical quantile of the family (Tnb )1≤b≤B . Three interesting examples of possible laws for the variable U are given for instance in Mammen (1993). Finally, the integral with respect to PX used to define the test statistic is approximated in practice by taking the empirical mean over a third sub-dataset.
3
A PPLICATION TO SPECTROMETRIC STUDIES
In order to illustrate the interest of testing for specific models in real world studies, we now consider a dataset coming from the spectrometric study of meat pieces (see Delsol, 2009, for more details). The aim is to predict the fat content of each piece of meat from the observation of the corresponding spectrometric curve. Previous studies of this dataset have shown the derivatives of spectrometric curves are good predictors (see for instance Ferraty and Vieu, 2006, 2009a). No effect tests can be used to test if the successive derivatives have a significant effect in a first time on the fat content and then on the residuals computed from the best predictors. This leads to the conclusion that once taken into account the effect of the second and third derivatives the others do not have a significant effect. This conclusion is similar to the results obtained by Ferraty and Vieu (2009a) with boosting methods. Moreover, linearity tests highlight that some derivatives have a significant non linear effect and it is better to use a non parametric estimator while for the others a linear estimator allows to get more relevant results. The question of testing which part of the spectrometric curve has a significant effect has been considered from a dataset corresponding to the spectrometric analysis of corn samples. It seems that the effect of the whole curve can be reduced to the effect of specific portions. Taking this into account allows to get better prediction results using as predictor the informative part of the spectrometric curve. It would also be interesting to test if the effect of the spectrometric curve reduces to the effect of some values corresponding to specific wavelengths. In conclusion, our structural tests allow to get interesting information on the underlying regression model.
4
C ONCLUSION AND PROSPECTS
Testing the validity of some specific models in regression on functional variable can be done by testing structural assumptions on the unknown regression operator. The general theoretical result cannot be used directly to compute the threshold value. However, it is important starting point to prove the theoretical efficiency of bootstrap methods that lead to relevant results in practice Because of the lack of testing procedures in regression on functional variables, there is a need for further developments to study more precisely some specific structural tests, propose other test statistics, or adapt our procedures to dependent datasets. The extension of our method to deal with functional responses is an other prospect for further research.
R EFERENCES AIT SAIDI, A., FERRATY, F., KASSA, R. and VIEU, P. (2008) Cross-validated estimations in the single functional index model. Statistics, 42, 475-494. BOSQ, D. (2000) Linear Processes in Function Spaces: Theory and Applications, Lecture Notes in Statistics, 149, Springer-Verlag, New York. CARDOT, H., FERRATY, F.,MAS, A. and SARDA, P. (2003) Testing Hypotheses in the Functional Linear Model, Scandinavian Journal of Statistics, 30, 241-255. CARDOT, H., GOIA, A. et SARDA, P. (2004) Testing for no effect in functional linear regression models, some computational approaches. Comm. Statist. Simulation Comput, 33 (1), 179-199. CHIOU, J.M. and MULLER, H.G. (2007) Diagnostics for functional regression via residual processes. Computational Statist. and Data Analysis, 51, (10) 4849-4863. CRAMBES, C., KNEIP, A. and SARDA, P. (2008) Smoothing splines estimators for functional linear regression, ıAnnals of Stat., 37, 35-72. L. DELSOL (2009) No-effect tests in regression on functional variable and some applications to spectrometric studies. (submitted) L. DELSOL, F. FERRATY and P. VIEU (2009) Structural test in regression on functional variable. (submitted) EFRON, B. (1979) Bootstrap Methods: Another Look at the Jackknife. Annals Statist., 7 (1), 1-26. FERRATY, F. and VIEU, P. (2006) Nonparametric functional data analysis: theory an practice. Springer-Verlag, New York. FERRATY, F. and VIEU, P. (2009a) Additive prediction and boosting for functional data. Comp. Statistist. and Data Ananlysis, 53, 1400-1413. FERRATY, F. and VIEU, P. (2009b) Kernel regression estimation for functional data. In Handbook on Functional Data Analysis. To appear. GADIAGA, D. and IGNACCOLO, R.(2005) Test of no-effect hypothesis by nonparametric regression. Afr. Stat., 1, (1), 67-76. HARDLE, W. and MAMMEN, E. (1993) Comparing Nonparametric Versus Parametric Regression Fits, Ann. Statist., 21 (4), 1926-1947. MAMMEN, E. (1993) Bootstrap and wild bootstrap for high-dimensional linear models. Ann. Statist., 21 (1), 255-285. RAMSAY, J. and SILVERMAN, B. (1997) Functional Data Analysis, Springer-Verlag, New York. RAMSAY, J. and SILVERMAN, B. (2002) Applied functional data analysis: Methods andcase studies, Spinger-Verlag, New York. RAMSAY, J. and SILVERMAN, B. (2005) Functional Data Analysis (Second Edition) Spinger-Verlag, New York.