Winsorization on linear mixed model (Case study

0 downloads 0 Views 330KB Size Report
Winsorization on linear mixed model (Case study: National exam of senior high school in West .... conducted and the result showed that average score of UN could not be .... Pendekatan Winsor pada analisis regresi dengan pencilan [Skripsi].
Winsorization on linear mixed model (Case study: National exam of senior high school in West Java) Leny Yuliyani, Anang Kurnia, and Indahwati

Citation: AIP Conference Proceedings 1827, 020020 (2017); doi: 10.1063/1.4979436 View online: http://dx.doi.org/10.1063/1.4979436 View Table of Contents: http://aip.scitation.org/toc/apc/1827/1 Published by the American Institute of Physics

Winsorization on Linear Mixed Model (Case Study: National Exam of Senior High School in West Java) Leny Yuliyani a), Anang Kurnia b), Indahwati c) Department of Statistics, Faculty of Mathematics and Natural Science, Bogor Agricultural University Indonesia Corresponding author: a) [email protected] b) [email protected] c) [email protected] Abstract. In the case of hierarchical data is typically modeled with linear mixed model (LMM). The LMM requires the assumption of normality which is error and random effects are assumed normal distribution. However in practice, to meet the assumption of normality is difficult especially if the sample is small. Violation of the normality assumption can be caused by outliers . In this paper, we will examine the effect of outliers on the random effects and error and overcome them with the Winsorization technique. The result of application indicated that Winsorization technique with c-tuning constant iterative process produced root mean squared error, AIC, and BIC are smaller than the others. We conclude that Winsorization technique can be used to overcome outliers in linear mixed model fitting. Keywords: winsorization, outliers, linear mixed model.

INTRODUCTION Characteristics of data collection in a research usually consists of diverse background of the respondents. Background of the respondents could consist of differences in geography and cultural. The differences of background makes the data becomes hierarchy or multi-level, so to overcome the problem of diversity could be resolved by multistage random sampling and sometimes is needed to analyze each level. In hierarchical data, individuals are in same group have similarity so between observation in general is dependent. This situation caused violation of the independently assumption which if was analyzed by simple regression would make the standard error of coefficient was under estimated and the hypothesis testing on the auxiliary variables will tend to be statistically significant (Hox, 2010). In hierarchical data there is random effect on other level for example in the selection area, so Ronald Fisher (1918) introduced a random effect model to study genetic relationship between heredity. In the 1950s, Charles Roy Henderson introduced the Best Linear Unbiased Estimate (BLUE) of fixed effects and Best Linear Unbiased Prediction (BLUP) of random effects. Linear mixed model (LMM) is a model in which the response variable Y moreover is affected by auxiliary variables (X) or known as fixed effects, is also influenced by randomness in the selection example, area, time, etc., known as random effects. In general the LMM requires the assumption of normality which is error and random effects are assumed normal distribution. But in fact, very possible presence a violation of this assumption when the data have outliers. Violation of these assumptions may result in estimates of regression coefficients and variance components becomes inaccurate. Winsorization is a statistical transformation by limiting extreme values in the statistical data to reduce the effect of outliers. This approach was introduced by Charles P. Winsor (1895-1951). Handling of outliers in a simple regression has been reviewed by Pusparum (2015) which showed that the Winsor approach with the addition of tuning constant iteration is better than regular Winsor. Kurnia et al. (2013) used the Winsor method on small area estimation (SAE) and proved that Winsorization can be used as an alternative method to overcome the outliers on

Statistics and its Applications AIP Conf. Proc. 1827, 020020-1–020020-7; doi: 10.1063/1.4979436 Published by AIP Publishing. 978-0-7354-1495-2/$30.00

020020-1

Small Area Estimation. In this study, the outliers will be overcome on the random effects and the error with the application to National Examination value of Senior High School in West Java. .

METHODS Linear Mixed Model The general form of linear mixed model (LMM) is presented as follows ‫ ܡ‬ൌ ‫ ઺܆‬൅ ‫ ܞ܈‬൅ ઽ (1) where y is an observation vector of dimension n×1, X is a known covariate matrix of dimension n×p, β is a regression coefficient vector (fixed effect), Z is a design matrix of dimension n×q, α is a normally distributed random effect vector, ε is an error vector. The basic assumption in linear mixed model is the random effect and error must be uncorrelated with v ~ N (0, G) and ε ~ N (0, R), where G = Var (v) and R= Var (ε) are covariate matrix which dispersion parameter or unknown variance component (σ). Base on equation 1 and normally distribution so that y~N(Xβ,V) with V=R+ZGZ’. The parameter estimation in LMM consist of fixed effect and random effect are shown as follows ෡ ஻௅௎ா ൌ ሺࢄԢࢂିଵ ࢄሻିଵ ࢄԢࢂିଵ ࢟ ࢼ ෡ሻ ෥஻௅௎௉ ൌ ࡳࢆԢࢂିଵ ሺ࢟ െ ࢄࢼ ࢜

(2) (3)

Estimation of fixed effect in carried out using Best Linear Unbiased Estimator (BLUE) and the random effect prediction which is Best Linear Unbiased Predictor (BLUP). In practice variance component V is unknown ෡ and the estimation result is therefore, in estimating fixed effect and random effect, V is replaced by the estimator ࢂ empirical best linear unbiased predictor (EBLUP) ෡ൌࡾ ෡ ൅ ࢆࡳ ෡ ࢆᇱ ࢂ (4) where ‫ܩ‬෠ and ܴ෠ are estimator using Maximum Likelihood Estimation (MLE) or Restricted Maximum Likelihood (REML). Jiang (2007) mentioned that the estimation result using REML for variance component is a consistent estimator without assuming normality from random variable and error. Henderson et al. (1959) assumes normality v and ε and maximize joint density of y and v to β and v. This is equivalent to maximize function. ͳ ͳ ିଵ ݂ሺ࢟ǡ ࢜ሻ ൌ െ ሺ࢟ െ ࢄࢼ െ ࢆ࢜ሻᇱ ࡾିଵ ሺ࢟ െ ࢄࢼ െ ࢆ࢜ሻ െ ࢜Ԣࡳିଵ ࢜ ʹ ʹ The partial derivatives of ݂ ሺ‫ݕ‬ǡ ‫ݒ‬ሻ™‹–Š”‡•’‡…––‘ β and v are ߲ Ž‘‰ ݂ሺ࢟ǡ ࢜ሻ ൌ ࢄԢࡾିଵ ሺ࢟ െ ࢄࢼ െ ࢆ࢜ሻ ߲ߚ ߲ Ž‘‰ ݂ሺ࢟ǡ ࢜ሻ ൌ ࢆᇱ ࡾିଵ ሺ࢟ െ ࢄࢼ െ ࢆ࢜ሻ െ ࡳିଵ ࢜ ߲‫ݒ‬ Setting the equation to zero will obtain the following equations ࢄᇱ ࡾିଵ ࢄࢼ ൅ ࢄᇱ ࡾିଵ ࢆ࢜ ൌ ࢄᇱ ࡾିଵ ࢟ ࢆᇱ ࡾିଵ ࢄࢼ ൅ ࢆᇱ ࡾିଵ ࢆ࢜ ൅ ࡳିଵ ࢜ ൌ ࢆԢࡾିଵ ࢟ In matrix can be written as ିଵ ି૚ ࢄԢࡾି૚ ࢆ ൨ ቂࢼቃ ൌ ൤ࢄԢࡾ ࢟൨ ൤ࢄԢࡾି૚ ࢄ ିଵ ᇱ ି૚ ି૚ ࢜ ࢆԢࡾ ࢟ ࢆԢࡾ ࢄ ࢆ ࡾ ࢆ ൅ ࡳ

020020-2

Winsorization on Mixed Model Winsor method (Winsorizing or Winsorization) was first introduced by Charles P. Winsor in 1946 as one of the alternatives to solve the problems on counting statistics if there are outliers in the data observations. The basic principal of Winsorizing is exchanging the value of outliers with another value that has a meaning and under a certain condition. Welsh (1987) used this approach to estimate the regression parameters. Welsh (1987) explained that observation values of response variable (y) in this method is converted into y-winsorization (y*). More detail about winsorizing observation are defined as follows: ‫ݕ‬௜‫ כ‬ൌ ‫ݕ‬௜ ‫ܫ‬൫ߟƸ ሺߙͳሻ ൑ ݁௜ ൑  ߟƸ ሺߙʹሻ൯ ൅ ߟƸሺߙͳሻሺ‫ܫ‬ሺ݁௜ ൏ ߟƸ ሺߙͳሻ െ ߙͳሻ ൅ ߟƸ ሺߙʹሻሺ‫ܫ‬ሺ݁௜ ൐ ߟƸ ሺߙʹሻ െ ሺͳ െ ߙʹሻሻ

note: ‫ݕ‬௜ : response variable (y) -i ݁௜ : error from parameter estimation-i , for i= 1,2, ...n ߟƸ ሺߙͳሻand ߟƸ ሺߙʹሻ : empirical quantile α1 and α2 from error, 0< α1 < 0.5 < α2 < 1

Outliers in LMM can occur at random effect or the error. The following an approach Winsorization at random effect ‫ݒ‬ො௜ ǡ ߟƸ ሺܽͳሻ ൑  ‫ݒ‬ො௜  ൑  ߟƸ ሺܽʹሻ ‫ݒ‬ො௜‫= כ‬ቐ ߟƸ ሺߙͳሻǡ ‫ݒ‬ො௜ ൏  ߟƸ ሺܽͳሻ ߟƸ ሺߙʹሻǡ ‫ݒ‬ො௜ ൐  ߟƸ ሺߙʹሻ where ߟƸ ሺߙͳሻand ߟƸ ሺߙʹሻ are empirical quantile α1 and α2 of random effect. Similarly, handling outliers on error can be written as follows: ݁௜௝ ǡ ߟƸ ሺߙͳሻ ൑  ݁௜௝  ൑  ߟƸ ሺߙʹሻ ‫כ‬ ݁௜௝ =ቐ ߟƸ ሺߙͳሻǡ ݁௜௝ ൏  ߟƸ ሺߙͳሻ ߟƸ ሺߙʹሻǡ ݁௜௝ ൐  ߟƸ ሺߙʹሻ Later in this paper this method will be called Winsor 1. The other definitions about Winsorizing observation is introduced by Huber and Ronchetti (2009) in which the boundary winsorization using tuning constant (c). Tuning constant was used amounted to 1.96 with Winsor method following formula, and will be named as Winsor 2. ‫ݒ‬ො௜‫כ‬

‫ݒ‬ො௜ ǡ ȁ‫ݒ‬ො௜ ȁ  ൑ ܿ‫ݏ‬௩  െܿߪ ො௜ ൏  െܿ‫ݏ‬௩  =൝ ௩ ǡ ‫ݒ‬ ܿߪ௩ ǡ ‫ݒ‬ො௜ ൐ ܿ‫ݏ‬௩ 

ߝ௜௝‫כ‬

ߝ௜௝ ǡ ȁߝ௜௝ ȁ  ൑ ܿ‫ݏ‬ఌ  െܿߪ =ቐ ఌ ǡ ߝ௜௝ ൏  െܿ‫ݏ‬ఌ  ܿߪఌ ǡ ߝ௜௝ ൐ ܿ‫ݏ‬ఌ 

where sv is standard deviation of random effect and se is standard deviation of error. The model that we are developing is as follow : 1. Regressing auxiliary variables (x) to response variable (y) will obtain y estimation, fixed effect, random effect and error. 2. Transforming random effect (‫ݒ‬௜ ሻusing winsorization so will be gotten new random effect (‫ݒ‬௜‫) כ‬ 3. Substitute new random effect (‫ݒ‬ො௜‫ ) כ‬on model, hence we will retrieve ‫ݕ‬௜௝‫ כ‬ൌ ܺߚመ ൅ ‫ݒ‬ො௜‫ כ‬+ ߝƸ௜௝ 4. Regressing auxiliary variables (x) to the new response variable (‫ݕ‬௜௝‫ ) כ‬and will obtain y estimation, fixed effect, random effect and error. 5. Transforming error (ߝ௜௝ ሻusing winsorization so will be gotten a new error (ߝ௜௝‫) כ‬ 6. Substitute new error (ߝ௜௝‫ ) כ‬on model, hence we will retrieve ‫ݕ‬௜௝௪ ൌ ܺߚመ ൅ ‫ݒ‬ො௜ + ߝƸ௜௝

020020-3

At Winsor 2 with iterative procedure is applied by using the new error / random effect and the new standard deviation in order to obtain ‫ݕ‬௜௝௪ which was convergent to parameter estimations (fixed effects).

RESULT AND DISCUSSION Our case study is national exam scores and school exam score, in Indonesia namely as UN and US, respectively. The response (average score of UN) is continuous variable which was assumed normally distributed. Here the response variable has been modelled with average score of US and accreditation as the auxiliary variable. Average score of US is continuous variable and the accreditation is category variable. Hence, the accreditation variable was used as dummy variable. The category of accreditation consists of A, B, C, and NA (not accredited) so that the required number of dummy variables are three. Region and Municipality in West Java were defined as area level and the smallest unit is schools. Before we would analyze with regression model, we did data exploration to detect presence of outliers and distribution of data. Histogram of UN 50,0

200

47,5

Frequency

150

45,0 100

42,5

50

0

40,0

37,5 36

38

40

42 UN

44

46

48

50

35,0

(a)

(b)

FIGURE 1. Histogram (a) and Boxplot (b) of the Average Score of UN at School Level in West Java The Figure 1 showed that there are some outliers and the histogram extends to the right. Normality test was conducted and the result showed that average score of UN could not be assumed normally distributed (Figure 2). Hence, we need robust method to overcome outliers.

99.9

Mean StDev N AD P-Value

99 95

38.70 2.145 424 30.360