Robust PLS regression based on simple least median squares regression Biagio Simonetti DASES University of Sannio
[email protected];
Smail Mahdi Department of Computer sciences University of West indies, Barbados
[email protected] Ida Camminatiello Dipartimento di matematica e statistica Università di Napoli Federico II, Italy
[email protected]
Keywords:Multivariate regression, partial least squares regression, least median squares regression, simulation study. 1. Introduction A classical statistical problem is to estimate the linear relationship between two sets of variables, X n , p (explicative variables) and y n ,1 (dependent variable) where n is the number of statistical units and q the number of the explanatory variables. The technique which is largely used to solve this problem is the multivariate regression model y = a + Xb + e where a (n,1) is the intercept term, b ( p,1) the gradient and e (n,1) the error term. This linear model is defined if and only if p > n . If p < n , the inverse of X'X does not exist (or is unstable) and therefore the classical least squares regression model cannot be applied. The solution for overcoming this problem is offered by Partial Least Squares (PLS) Regression (Wold, 1975).
1
Garthwaite (1994) shows that linear combinations of the explanatory variables can be formed sequentially and related to y variable by ordinary least squares regression. Since PLS regression is very sensitive to the presence of outliers in the data, different algorithms of robust version have been proposed. In this paper, we propose a robust version of PLS regression, based on least median square regression (Rousseeuw, 1984). This median square regression is substituted for the least square regression systematically used in Garthwaite (1994) set up for forming the latent PLS factors.
2. Partial Least Squares Regression Partial Least Squares (PLS) regression is a multivariate data analysis technique which can be applied to relate a response variable (y) to several explanatory (x) variables. The method aims to identify the underlying factors, that is, the linear combinations of the x-variables, that best fit and model the y dependent variable. PLS can deal efficiently with data sets consisting of a large number of variables that might be highly correlated and involving substantial random noise. To describe the technique, let X (n, p) be a matrix with p explicative variables observed on n units and y a vector of the response variable. The PLS projects the X and y onto a subset of latent variables, t j (linear combination of X) and y (linear combination of Y) which maximize the term: j = 1,2,K a Cov(t j , y ) where a stands for the number of components to be retained in the model.
3. Gartwhaite PLS approach In order to make easier the interpretation of the regression coefficients, Gartwaite proposed an alternative approach of the PLS based on simple linear regression. To study the relationship between y and X he proposed the following form for the regression equation:
yˆ = b1t 1 + b2 t 2 + L + b p t p
2
where each component t j is a linear combination of the predictors
weighted by the regression coefficient b j . The following components are computed in a sequential manner as in PLS regression. 4. Least median Squares regression Rousseeuw (1984) proposed the Least Median of Squares (LMS) estimator which is obtained from the following optimization set up Mimimize median ei2 i where ei is the residual of observation i. It turns out that this estimator is very robust with respect to outliers in Y as well as outliers in X. Its breakdown point is 50%. Furthermore, LMS estimator is equivariant with respect to linear transformation on the explicative variables, because it only makes use of the residuals. 5. PLS under LMS The PLS method is known to be very sensible to outlying observations, which are typically expected to be present in experimental data. This drawback of classical partial least squares regression has been heeded by several authors who propose different ways to construct a robust version of partial least squares regression. We propose here to mix Gartwaite PLS technique with Rousseeuw Least median of squares estimators and to study the performance of this new procedure by simulation. 6. A Simulation Study In this section will illustrate the statistical properties of R-PLS Regression in comparison with PLS and the principal robust techniques proposed in literature. First the efficiency of the estimators is investigated. Will generate 1000 samples of size n with components 3
randomly drawn from a normal with mean zero and standard deviation 0.001. References M. Hubert, S. Verboven, (2003), A robust PCR method for high-dimensional regressors. J. of Chemometrics, 17, 438–452 Garthwaite, P. H., (1994). An Interpretation of Partial Least Squares, Journal of the American Statistical Association, 89 (425), 122-27. Rousseeuw, P. J. , (1984), Least median of squares regression, Journal of the American Statistical Association, 79, 871-88. Vanden Branden, K., Hubert, M., (2002), A robustified version of the SIMPLS algorithm. Proceeding of the International Conference on robust statistics ICOR 2002, Canada. Vanden Branden K, Hubert M., (2003), The influence function of the classical and robust PLS weight vectors., submitted. Wold H., (1966), Estimation of principal components and related models by iterative least squares. Multivariate Analysis, Krishnaiah P. R. (Ed.), Ac. Press, New York. Wold, H., (1975), Soft Modeling by Latent Variables: the Non-linear Iterative Partial Least Squares Approach, in Perspectives in Probability and Statistics, Papers in Honour of M. S.
4