Transformations of Data - Semantic Scholar

7 downloads 0 Views 173KB Size Report
Oct 14, 1999 - transformations on the error structure be understood; see Ruppert, Carroll, and Cressie (1991), where the use of transformations to linearize the ...
1

Transformations of Data David Ruppert Cornell University, Ithaca, NY, USA

1 Introduction Data transformations such as replacing a variable by its logarithm or by its square-root are used to simplify the structure of the data so that they follow a convenient statistical model. Transformations have one or more objectives including: a) inducing a simple systematic relationship between a response and predictor variables in regression, b) stabilizing a variance, that is, inducing a constant variance in a group of populations or in the residuals after a regression analysis, and c) inducing a particular type of distribution, e.g., a normal or symmetric distribution. The second and third goals are concerned with simplifying the \error structure" or random component of the data. Int. Encyc. Social and Behavioral Sciences

14 October 1999

2 Data transformations were developed at a time when there were far fewer tools available to the applied statistician than now. Many problems that in the past would have been tackled by data transformation plus a linear model are now solved using more sophisticated techniques, for example, generalized linear models, nonlinear models, smoothing, and variance function models. Nonetheless, model simpli cation is still important and data tranformation is still an e ective and widely used tool.

2 Transformations to simplify the systematic model In regression, if the expected value of the response y is related to predictors x1 ; : : : ; xJ in a complex way, often a transformation of y will simplify this

relationship by inducing linearities or removing interactions. For example, if y = 0 x 1 1    x JJ + ;

(1)

where  is a small random error, then log y approximately follows the linear model log y = 0 + 1 x1 +    + J xJ + ( 0x 1 1    x JJ )?1 ;

(2)

where 0 = log( 0) and xj = log(xj ). Model (1) is nonlinear in each xj and has large and complex interactions. In contrast, model (2) is linear and additive in the transformed predictors. However, transformations should not be used blindly to linearize a systematic relationship. If the errors in (1) are

3 homoscedastic, then the errors in (2) will be heteroscedastic. Also, the log transformation can induce skewness and transform the data nearest 0 to outliers. Any intelligent use of transformations requires that the e ects of the transformations on the error structure be understood; see Ruppert, Carroll, and Cressie (1991), where the use of transformations to linearize the MichaelisMenten equation is discussed.

3 Transformations to simplify error structure The e ects of a transformation upon the distribution of data can be understood by examining the transformation's derivative. A nonlinear transformation tends to stretch out data where its derivative is large and to squeeze together data where its derivative is small. These e ects are illustrated in Fig 1(a) where the logarithm transformation is applied to normalize rightskewed lognormally distributed data. Along the x-axis one sees the density of the lognormal data. The log tranformation is shown as a solid curve. On the y-axis is the normal density of the transformed data. The mapping of the 5th percentile of the original data to the same percentile of the transformed data is shown with dashed lines, and similar mappings for the median and 95th percentiles are shown with solid and dashed-and-dotted lines. The log is a concave transformation, that is, its derivative is decreasing. Therefore, the long right tail and short left tail of the lognormal distribution are shrunk, respectively, stretched, by the mapping to produce a symmetric distribution of

4 the transformed data. The density on the y-axis is evaluated at equally-spaced points. Their inverse images on the x-axis are increasingly spread out as one moves from left to right. The same e ect can be seen in the percentiles. In general, concave transformations are needed to symmetrize right-skewed distributions. Analogously, convex transformations which have increasing rst derivatives are used to symmetrize left-skewed distribution. Data often exhibit the property that populations with larger means also have larger variances. In regression if E (Y jX) = (X) is the conditional mean of the response Y given a vector X of covariates, then we might nd that var(Y jX) is a function of (X), i.e., var(Y jX) = gf(X)g for some function g. If Y is transformed to h(Y ), then a linearizing approximation shows that var(h(Y )jX)  fh0((X)g2gf(X)g:

(3)

The transformation h stabilizes the variance if fh0(y)g2g(y) is constant. For example, if g(y) / y then h(y) / y1? =2 is variance-stabilizing for 6= 2. For Poisson data, = 1 and the square-root transformation is variancestabilizing. For data with a constant coecient of variation, = 2 and the log transformation is variance-stabilizing. When g is an increasing function, then the variance-stabilizing transformation will be concave, as illustrated in Fig 1(b). The original data on the x-axis are such that their square-roots are normally distributed data truncated in

5 the extreme tails to be strictly positive. There are two original population with di erent means and variances. After a square-root transformation, the two populations have di erent means but equal variances. The dashed lines show the mapping of the rst and third quartiles of the rst population of the original data to their transformed values. The solid lines show the same mapping of the second population. When var(Y jX) is not a function of (X), then there is no variance-stabilizing transformation and variance function modeling (Carroll and Ruppert, 1988) should be used as an alternative, or at least an adjunct, to transformation of Y.

4 Power transformations There are several parametric families of data transformation in general use. The most important are the power or Box-Cox (1964) family and the powerplus-shift family. The Box-Cox power transformations are: h(y ; ) = (y  ? 1)=  6= 0 = log(y)  = 0:

For  6= 0 the ordinary power transformation, y, is shifted and rescaled so that h(y; ) ! log(y) as  ! 0. This embeds the log transformation into the power transformation family in a continuous fashion. The shift and rescaling

6 are linear transformations and have no e ect on whether the transformation achieves the three objectives enumerating in Section 1. As mentioned before, power transformation are variance-stabilizing when the variance is a power of the mean, a model that often ts well in practice. The cube-root transformation, called the Wilson-Hilferty transformation, approximately normalizes gamma-distributed data. The e ect of h(y; ) on data depends on its derivative h0 (y; ) = (@=@y)h(y; ) = y ?1 . Consider two points y1 < y2 . The \strength" of the transformation can

be measured by the ratio ! y2 ?1 h0 (y2 ; ) = y ; h0 (y1 ; ) 1

which shows how far data are spread out at y2 relative to at y1. This ratio is one for  = 1 (linear transformation), is less than one for  < 1 (concave transformations), and is greater than one for  > 1 (convex transformation). For xed y1 and y2, the ratio is increasing in  so that its concavity or convexity is increasingly strong as  moves away from 1. If a certain value of  transforms a right-skewed distribution to symmetry, then a large value of  will not completely remove right-skewness while a smaller value of  will induce some left-skewness. Analagously, if the variance is an increasing function of the mean and a certain value of  is variance-stabilizing, then for a larger value of  the variance of the transformed data is still increasing in the mean, while for a smaller value of , the variance is decreasing in the mean. Understanding

7 this behavior is important when a symmetrizing or variance-stabilizing power transformation is sought by trial-and-error. The shifted-power transformation is h(y ; ; ) = h(y + ; ):

The ordinary power transformation should be applied only to nonnegative data, but the shifted power transformation can be used for any data set provided that  > ? min(x) where min(x) is the smallest value of the data. Thus, shifted power transformations are particularly well suited to variables that have a lower bound, but no upper bound. Atkinson (1975) discusses families of transformations that are appropriate for percentages, proportions, and other variables that are constrained to a bounded interval. For data that is neither bounded above nor bounded below, Manly (1976) proposes the exponential data transformations h(y ; ) = (exp(y ) ? 1)=;  6= 0 = y;  = 0;

which is the transformation from y to exp(y) to a positive variate followed by the Box-Cox power transformation.

5 Estimation of transformation parameters

8

There are situations where a transformation may be selected based on a priori information. For example, if we know that model (1) holds, then we know that the log transformation will linearize and remove interactions. In other situations, transformations are often based on the data though perhaps selected from a pre-determined parametric family, e.g., shifted power transformations. Trial and error can be an e ective method of parameter selection. For example, one might try power transformations with  equal to ?1; ?1=2; 0, 1/2, and 1 and see which value of  produces the best residual plots. When all aspects of the data are parametrically modeled, then the transformation parameter as well as other parameters can be estimated simulateously by maximum likelihood. Semiparametric transformations are used when the model is not fully parametric, e.g., it is assumed that there is a parametric transformation to symmetry but not to a speci c symmetric parametric family. 5.1 Maximum likelihood estimation

Box and Cox (1964) assumed that after a parametric transformation of the response, the data follow a simple linear model with homoscedastic, normally distributed errors. They nd the likelihood for this model and show how to maximize it to estimate the transformation parameter, the regression param-

9 eters, and the variance of the errors. The Box-Cox model attempts to achieve each of the three objectives in Section 1 with a single transformation. Although it may be rare for all three goals to be achieved exactly, often in practice both the systematic model and the error structure can be simpli ed by the same transformation, and because of this the Box-Cox technique has seen widespread usage. Simultaneous estimation of a power and shift by maximum likelihood is not recommended for technical reasons (Atkinson, 1975), but  can be estimated by maximum likelihood for a xed value of the shift, which can be chosen by trial and error. Carroll and Ruppert (1981, 1988) study a somewhat di erent parametric model that that of Box and Cox, called transform-both-sides (TBS). They assume that there is a known theoretical model relating the response to the predictors. Transforming only the response will destroy this relationship, so they transform both the response and the theoretical model in the same way. It is assumed that after transforming by the correct , the errors are homoscedastic and normally distributed. TBS simpli es only the error structure while leaving the systematic model unchanged. The model of Box and Tidwell (1962) assumes that the untransformed Y is linear in transformed predictors, for example, that Y = 0 + 1 h(x1 ; 1 ) + p h(xp ; p) + :

This model is a special case of nonlinear regression and can be t by ordinary

10 nonlinear least-squares. Atkinson (1975) discusses a combination of the BoxTidwell model and the Box-Cox model where h(Y ; ) = 0 + 1 h(x1 ; 1 ) + p h(xp ; p) + :

5.2 Semiparametric estimation

Hinkley (1975) estimates a transformation to symmetry by setting the thirdmoment skewness coecient of the transformed data to zero and solving for the transformation parameter. Other measures of skewness can be used as well. Ruppert and Aldershof (1989) estimate a variance-stabilizing transformation by minimizing the correlation between the squared residuals and the tted values. 5.3 E ects of outliers and robustness

An outlier is an observation that is poorly t by the model and parameter values that t the bulk of the data. When the model includes a transformation, outliers can be dicult to identify. For example, if the original data are lognormally distributed, apparent outliers in the long right tail may be seen to be conforming after the normalizing transformation. Conversely, the smallest observations may be quite close to bulk of the original data but outlying after a log transformation. It is essential that the transformation parameter not be

11 chosen to t a few outliers but rather to t the main data set. Diagnostics and robust transformations can be useful for this purpose (Atkinson, 1975; Carroll and Ruppert, 1985, 1988). 5.4 Inference after a transformation

There has been some debate on whether a data-based transformation can be treated as xed when one makes inferences about the regression parameters (Box and Cox 1982). In the Box-Cox model, the values of the regression parameters are highly dependent on the choice of transformation, and there is a general consensus that inference for regression parameters should treat the transformation parameters as xed. If one is estimating the expected response for predicting responses at new values of the predictors, then the predictions are more variable with an estimated transformation that when the correct transformation is known (Carroll and Ruppert, 1981). Prediction intervals can be adjusted for this e ect to obtain coverage probabilities closer to nominal values (Carroll and Ruppert 1991).

6 Multivariate transformations Most of the classical techniques of multivariate analysis assume that the population has a multivariate normal distribution, an assumption that is stronger than that the individual components are univariate normal. Andrews, Gnanadesikan, and Warner (1971) generalize the Box-Cox model to multivari-

12 ate samples. They transform each coordinate of observations using the same transformation family, e.g., power transformations, for all coordinates but with the transformation parameters being coordinate speci c. It is assumed that within this family of multivariate transformations, there exists a transformation to multivariate normality. All parameters in this model are estimated by maximum likelihood.

7 Nonparametric transformations For interpretability, a data transformation should be continuous and monotonic, but parametric assumptions are only made for simplicity. A nonparametric estimate of a transformation allows one to select a suitable parametric family of transformation, to check if an assumed family is appropriate, or to forego parametric assumptions when necessary. Tibshirani's (1988) AVAS method uses nonparametric variance function estimation and eqn (2) to estimate the variance-stabilizing transformation. Nychka and Ruppert (1995) use nonparametric spline-based transformations within the TBS model.

Bibliography [1] Andrews D F, Gnanadesikan R, and Warner J L 1971 Tranformations of multivariate data. Biometrics. 27: 825{40 [2] Atkinson A C 1985 Plots, transformations, and regression: an introduction to

13 graphical methods of diagnostic regression analysis. Oxford University Press,

New York [3] Box G E P and Cox D R 1964. An analysis of transformations (with discussion). J. Roy. Statist. Soc., Ser. B. 26: 211-46

[4] Box G E P and Cox D R 1982. An analysis of transformations revisited, rebutted. J. Amer. Statist. Assoc. 77: 209-10 [5] Box G E P and Tidwell P W 1962 Transformations of the independent variables. Technometrics. 4: 531-50

[6] Carroll R J and Ruppert D 1981 On prediction and the power transformation family. Biometrika. 68: 606{619 [7] Carroll R J and Ruppert D 1984 Power transformation when tting theoretical models to data. J. Amer. Statist. Assoc. 79: 321{8 [8] Carroll R J and Ruppert D 1985 Transformations in regression: A robust analysis. Technometrics. 27: 1{12 [9] Carroll R J and Ruppert D 1988 Transformation and Weighting in Regression. Chapman & Hall, New York and London [10] Carroll R J and Ruppert D 1991 Prediction and tolerance intervals with transformation and/or weighting. Technometrics. 33: 197{210 [11] Hinkley, D V 1975 On power transformations to symmetry. Biometrika. 62: 101{111

14 [12] Nychka D and Ruppert D 1995 Nonparametric transformations for both sides of a regression model. J. Royal Statist. Soc. B. 57: 519{532 [13] Manly B F J 1976 Exponential data transformations. Statistician. 25: 37-42 [14] Ruppert D and Aldershof B 1989 Transformations to symmetry and homoscedasticity. J. Amer. Statist. Assoc.. 84: 437{46 [15] Ruppert D, Carroll R J, and Cressie N. 1989 A transformation/weighting model for estimating Michaelis-Menten parameters. Biometrics. 45: 637{56 [16] Tibshirani R 1988 Estimating transformations for regression via additivity and variance stabilization. J. Amer. Statist. Assoc. 83: 394{405

15

1.5

7

1

6

5

normal

normal

0.5

0

4

3

−0.5

2 −1

1 −1.5

1

2

3

4

lognormal

0

0

10

20

30

sqrtnormal

Figure 1

(a) A transformation to remove right skewness. Here the concave log transformation converts right-skewed lognormal data to normally-distributed data. The densities of the original and transformed data are show on the x and y axes, respectively. The lines show that mappings of the 5th, 50th, and 95th percentiles. (b) A variance-stabilizing transformation. Here the square-root transformation converts heteroscedastic data consisting of two populations to data with a constant variance. \*" and \o"denote the two populations. The lines show the mappings of the rst and third quartiles of the two populations.

Suggest Documents