Modelling GME and PLS estimation methods for evaluating the Job

15 downloads 0 Views 206KB Size Report
vectors w and c which maximize the covariance between the X-score vector t and the. Y-score vector u. Hence the maximizing function is: f(w, c)=t'u, with respect ...
Modelling GME and PLS estimation methods for evaluating the Job Satisfaction in the Public Sector Enrico Ciavolino, Researcher of Statistics, University of Salento, Department of Philosophy & Social Science, Palazzo Parlangeli, Via Stampacchia n.1, 73100, Lecce, Italy, [email protected] Keywords: Generalized Maximum Entropy, Partial Least Squares, Job Satisfaction, Multicollinearity, Bootstrap Category: Research Paper 1. Introduction Golan et al. (1996) proposed an alternative method for the parameters estimation of the regression models, in case of ill-posed problems, as an extension of the entropy measure, introduced by Shannon and as a generalization of the Maximum Entropy Principle (MEP) developed by Jaynes (1957, 1968). The job satisfaction model is considered as motivating example, for developing a performance comparative study between the GME and PLS regression methods. Both GME and PLS methods are implemented on the job satisfaction model, where several level of correlation are generated among the predictors. For each level of correlation, regression coefficients and diagnostic values are calculated, for showing the performance of both methods in case of ill-posed problems. The paper is divided in two main sections: The first part consider the introduction of the two estimation methods, in way to give a general overview of both techniques and also the main characteristics. The second part gives a brief introduction to the job satisfaction model and then starts with the simulation study, comparing both estimation results, giving a discussion in case of multicollinearity problem. 2 The Generalized Maximum Entropy Method The GME method is based on the re-parameterization and re-formulation of a linear regression model y=Xβ+ε in way to estimate the parameters inside the framework of the MEP. 2.1 Re-parameterization and Re-Formulation of the Model The basic idea of GME method is the re-parameterization of the regression coefficients and error term in form of the excepted values, and the reformulation of the linear regression model, in way to estimate the parameters within the MEP approach. Considering a linear regression model with n observations and m variables as follow:

y n ,1 = X n ,m ⋅ β m ,1 + ε n ,1 = X n ,m ⋅ Z m ,m⋅M ⋅ p m⋅M ,1 + Vn ,n⋅ N ⋅ w n⋅ N ,1

(1)

It is always possible to write the parameters βk as a convex combination of a finite support variables, in this case five, {zk1, zk2, zk3, zk4, zk5} (Paris, 2001), that means: βk= {pk1·zk1+pk2·zk2+pk3·zk3+pk4·zk4+pk5·zk5}, where the probabilities 0≤pkj≤1, j=1,…, 5 and ∑pkj=1. Similarly each error term is treated as a discrete random variable. The matrices Z and V are diagonal matrices, whose diagonal elements are vector of support variables: ⎡ z1' ⎢ 0 β = Z ⋅p = ⎢ ⎢M ⎢ ⎢⎣ 0

⎤ ⎡ p1 ⎤ ⎥ ⎢ ⎥ ⎥ ⋅ ⎢ p2 ⎥ ⎥ ⎢ M ⎥ ⎥ ⎢ ⎥ L z 'm ⎥⎦ ⎢⎣p m ⎥⎦

0 L z '2 L M O 0

0 0 M

⎡ v1' ⎢ 0 ε = V⋅w = ⎢ ⎢M ⎢ ⎢⎣ 0

0 ⎤ ⎡ w1 ⎤ ⎥ ⎢ ⎥ 0 ⎥ ⎢w 2 ⎥ ⋅ M⎥ ⎢ M ⎥ ⎥ ⎢ ⎥ L v 'n ⎥⎦ ⎢⎣ w n ⎥⎦

0 L v '2 L M O 0

(2)

The support variables are defined by the vectors z and v (3), whose dimension, usually from 2 to 7 elements (Golan, 1996), is identify from the number of fixed points, respectively M and N. The value ‘c’ is a constant symmetric around zero, in this application c=1 and M and N are equal to 5.

z 'k = [−c −c / 2 0 c / 2 c]

v 'k = [−c −c / 2 0 c / 2 c]

(3)

The vectors p and w (4) are the probabilities associated respectively to the β regression coefficients and the ε error terms. The objective is to estimate these probabilities, in way to represent the coefficients and the error terms as excepted value of a discrete random variable. p 'k = [ pk1

pk 2

pk 3

pk 4

pk 5 ]

w 'k = [ wk1

wk 2

wk 3

wk 4

wk 5 ]

(4)

The estimation of the unknow parameters p and w, is obtained by the maximization of of the Shannon’s entropy function: H ( p, w) = −p1,' m⋅M ⋅ ln p m⋅M ,1 − w1,' n⋅ N ⋅ ln w n⋅ N ,1

(5)

The constraints defined for estimating the unknown parameters are referred to consistency and normalization constraints. The first one represents the information generated from the data, that means a part of the model defined in the equation (1); the second one identifies the conditions: 0≤pkj≤1, {j=1,…,5; k=1,…,m}, ∑pkj=1 {k=1,…,m} and 0≤wkj≤1, {j=1,…,5; k=1,…,n}, ∑wkj=1 {k=1,…,n}. In the following sections will be discussed both consistency and normalization constraints, the definition of the global matrix of the constraints and the optimization function. 2.2 Matrix of the Consistency Constraints The matrix equation reported in (6) is the representation of the equation (1), where all the elements are highlighted. The consistency constraints are defined by a known part of the following matrix formulation: The vectors to estimate are pmM,1, whose

elements are the still vectors pM,1, and wnN,1, whose elements are the vectors wN,1. The number of fixed points are respectively M=5 and N=5. ⎡ x1,1 x1,2 L x1,m ⎤ ⎡−c, −c / 2,0, c / 2, c ⎢ ⎥ ⎢ 0 ⎢ x2,1 x2,1 L x2,m ⎥ ⋅ ⎢ ⎢ M M O M ⎥ ⎢ M ⎢ ⎥ ⎢ 0 ⎢⎣ xn,1 xn,2 L xn,m ⎥⎦ ⎢⎣ 0 ⎡−c, −c / 2,0, c / 2, c ⎢ −c, −c / 2,0, c / 2, c 0 +⎢ ⎢ M M ⎢ 0 0 ⎣⎢

⎤ ⎡ p1 ⎤ ⎥ ⎢ ⎥ −c, −c / 2,0, c / 2, c L 0 ⎥ ⋅ ⎢ p2 ⎥ + ⎥ ⎢M ⎥ M O M ⎥ ⎢ ⎥ L −c, −c / 2,0, c / 2, cm ⎥⎦ ⎢⎣pm ⎥⎦ (6) 0 L 0 ⎤ ⎡ w1 ⎤ ⎡ y1 ⎤ ⎥ ⎢w ⎥ ⎢ y ⎥ L 0 ⎥⋅⎢ 2⎥ = ⎢ 2⎥ ⎥ ⎢ M ⎥ ⎢M⎥ O M ⎥ ⎢ ⎥ ⎢ ⎥ L −c, −c / 2,0, c / 2, cn ⎦⎥ ⎣⎢wn ⎦⎥ ⎣⎢ yn ⎦⎥ 0

L

0

Therefore, the information generated from the data, are the matrix X of the predictors, the matrices Z and V of the support variables, and the vector y of the dependent variable, instead the unknown parameters are the vectors p and w. 2.3 Matrix of the Normalization Constraints The normalization constraints (7, 8) are necessary, because for each probability vector of the coefficients and the error terms (ex., pM,1), the sum of probabilities estimated have to be equal to 1, that means, ∑pkj=1 {k=1,…,m} and ∑wkj=1 {k=1,…,n}.

I*m ,m⋅M ⋅ p m⋅M ,1 = 1m ,1 =>

0 L 0 ⎤ ⎡ p1 ⎤ ⎡ 1 ⎤ ⎡11111 ⎢ 0 11111 L 0 ⎥⎥ ⎢⎢ p 2 ⎥⎥ ⎢⎢ 1 ⎥⎥ ⎢ ⋅ = ⎢ M M O M ⎥ ⎢ M ⎥ ⎢M⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 0 L 11111m ⎥⎦ ⎣⎢p m ⎦⎥ ⎣⎢1m ⎦⎥ ⎣⎢ 0

(7)

J *n ,n⋅N ⋅ w n⋅ N ,1 = 1n ,1 =>

L 0 0 ⎤ ⎡ w1 ⎤ ⎡ 1 ⎤ ⎡11111 ⎢ 0 11111 L 0 ⎥⎥ ⎢⎢ w 2 ⎥⎥ ⎢⎢ 1 ⎥⎥ ⎢ ⋅ = ⎢ M M O M ⎥ ⎢ M ⎥ ⎢M⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ L 11111n ⎥⎦ ⎢⎣ w n ⎥⎦ ⎢⎣1n ⎥⎦ 0 ⎢⎣ 0

(8)

The matrices I* (7) and J* (8) are the Kronecker product between the identity matrices I and the vector of one 1, respectively for the variables and for the errors, that means I*m ,m⋅M = (I m ,m ⊗ 11,′ M ) and J *n ,n⋅ N = (I n ,n ⊗ 11,′ N ) . Both normalization constraints are used in the a global matrix of the constrains. 2.4 Definition of the Global Estimation Problem The Whole Matrix definition of all constraints system can be expressed by a matrix A of the constraints and a vector b of the excepted and normalization values (9).

A ( m + 2 n ),( m⋅M + n⋅ N )

⎡ I*m ,m⋅M ⎢ = ⎢ 0 n , m⋅ M ⎢ ( X ⋅ Z ) n , m⋅ M ⎣

0 m , n⋅ N ⎤ ⎥ J *n ,n⋅ N ⎥ Vn ,n⋅ N ⎥⎦

⎡1n + m ,1 ⎤ b ( m + 2 n ),1 = ⎢ ⎥ ⎣ y n ,1 ⎦

(9)

A just apposed supervector P25+500,1 = [p25,1;w500,1], containing variables and errors terms unknown parameters, defines the function has to be estimated: f = H ( P) = − P1,' m⋅M + n⋅ N ⋅ ln Pm⋅M + n⋅ N ,1

(10)

The GME objective function is strictly concave on the interior of the normalisation constraints, and a unique solution exists if the intersection between these constraints and the consistency constraint is non-empty. [P,fval]=fmincon('f',p0,[ ],[ ],A,b,lb,ub,[ ])

(11)

The above function (11) represent the matlab function, where the lb and ub, are respectively the lower bound and the upper bound, and p0 is the vector (mM+nN, 1) of the initialization values, whose elements are fixed to 0,1. 3 The Partial Least Squares Estimation Method The aim of PLSr is to estimate a number p of T components [with p ≤ rank(X)], such that they contain a maximum of relevant information concerning the relation between X and y. The PLS algorithm can be interpreted as the problem of computing weight vectors w and c which maximize the covariance between the X-score vector t and the Y-score vector u. Hence the maximizing function is: f(w, c)=t’u, with respect to w and c subject to the orthonormal constraints on w and c. The matrix of the “X-scores” T=Xw*, is able to rebuild the X matrix, by minimizing the residuals, moreover being good predictors of y. The regression model can be defined as follow: y=TCT+E, where CT is the matrix of regression coefficients of the y on the T scores. The above equation can be rewritten as follow: y = Xw*CT + E

(13)

The vector w* is equal to w ⋅ (P ' ⋅ w )' in way to rewrite the model by using the original manifest variables X, moreover w=X’y. The w*CT represents the PLS regression coefficients, βPLS, so the model can be formalized as follow: y = Xβ PLS + E

(14)

The parameters estimation are calculated by using the Non Linear Iterative Partial Least Squares (NIPALS) algorithm (Wold et al., 2001). 4. The Job Satisfaction Model in the Public Sector The implementation of GME method is made by following a research for evaluating Job Satisfaction in a public sector, with reference to the labour market of the city of Lecce in Italy. The model is reported in the next figure 1.

Leadership

Job Satisfaction

Working Time

Professional Improvement

Salary

Future expectations

Figure 1 – The Job Satisfaction Model in the Public Sector The questionnaire is divided into clearly defined thematic areas. There are 5 explicative variables: X1-Leadership; X2-Salary; X3- Future Expectations; X4Professional Improvement; X5 Working Time. One Dependent Variable: Y-Job Satisfaction. The evaluations are made on an ordinal scale from 1 (minimum) to 5 (maximum). The latent variables are estimated by using the average of each manifest variable belonging to the respective latent variable. 5. Comparative Study The comparative study wants to show the potentiality of the GME method in case of multicollinearity among the predictive variables. The first step is to analyze the data collected considering the original values. The following steps introduce an experimental condition in the dataset, where the multicollinearity among the predictors is increased, by using the following formulae: x new = r ⋅ x j −1 + (1 − r ) ⋅ x old j j

(15)

The (15) expresses the degree of association between two consecutive variables: indeed by increasing the coefficient ‘r’ until one, the new variable xjnew becomes at least equal to the precedent xj-1. The condition number K=(λmax/λmin)½ is calculated at each step to measure the matrix multicollinearity degree, given by increasing ‘r’ from 0,8 to 1, with a step equal to 0,02 (table 1).

Step 1 2 3 4 5 6

Condition Number ’K’ 3,9625 29,751 33,725 38,741 45,249 53,999

Step

Condition Number ’K’

7 8 9 10 11 12

66,342 84,981 116,22 178,98 367,83 +∞

Table 1 – Steps of Simulation and Condition Number ‘K’ The λmax and λmin are respectively the bigger and the smaller eigenvalue of the X matrix and K is significant for values greater than 30. At each step a bootstrap re-sampling technique is used for estimating the following values: Beta Coefficients, Standard Errors, Significances, BIAS and Mean Squared Error by. The values are calculated for both PLS and GME methods.

One of the most interesting result is about the regression coefficients, where is showed that increasing the multicollinearity, the results of GME regression are convergent, instead of the PLSr are divergent. This is a significant result, because in the paradox case of maximum correlation among the variables, all the regression coefficients have to be equal to the same value, as in the figure (2, GME). PLS

GME 0,22

8

0,2

6 4

0,18

2

0,16

0 0,14

-2

0,12

-4

0,1

-6

0,08

-8 3,9625 29,751 33,725 38,741 45,249 53,999 66,342 84,981 116,22 178,98 367,83 1000 B1

B2

B3

B4

3,9625 29,751 33,725 38,741 45,249 53,999 66,342 84,981 116,22 178,98 367,83

B5

B1

B2

B3

B4

B5

Figure 2 – The GME and PLS Beta Coefficients There are real application where is possible to find strong collinearity among the variables and it’s not possible to leave out one of them, for instance considering the dynamic equation of the vehicle, where there are the square and cube of the velocity, or in the social and economic model, where it’s interesting to show the how strongly correlated variables can have the same impact on the dependent variable. PLS

GME 0,04

1,4

0,035

1,2 1

0,03

0,8 0,025

0,6 0,02

0,4

0,015

0,2 0

0,01 3,9625 29,751 33,725 38,741 45,249 53,999 66,342 84,981 116,22 178,98 367,83 SE_B1

SE_B2

SE_B3

SE_B4

3,9625 29,751 33,725 38,741 45,249 53,999 66,342 84,981 116,22 178,98 367,83

SE_B5

SE_B1

SE_B2

SE_B3

SE_B4

SE_B5

Figure 3 – The GME and PLS Standard Errors GME regression, moreover can underlie, as being equal the regression coefficients, the same standard error is estimated. From the figure (3, GME), the standard errors (GME) converge to the same value, instead for the PLS, they increase with the multicolinearity. The effect of the increase on the standard error, bring the researcher to reject coefficients where the p-value is greater than 0.05, because of the test depend on the standard error.

GME

PLS 1,000 0,900 0,800 0,700 0,600 0,500 0,400 0,300 0,200 0,100 0,000

0,100 0,090 0,080 0,070 0,060 0,050 0,040 0,030 0,020 0,010 0,000

3,9625 29,751 33,725 38,741 45,249 53,999 66,342 84,981 116,22 178,98 367,83

3,9625 29,751 33,725 38,741 45,249 53,999 66,342 84,981 116,22 178,98 367,83 SIG_B1

SIG_B2

SIG_B3

SIG_B4

SIG_B5

SIG_B1

SIG_B2

SIG_B3

SIG_B4

SIG_B5

Figure 4 – The GME and PLS Significance There is an over performing of the PLSr on the GME in term of Mean Squared Error, but as is reported (figure 4) the performing result is about of one on one thousand. 0,0100 0,0095 0,0090 0,0085 0,0080

PLS GME

0,0075 0,0070 0,0065 0,0060 0,0055 3,9625

29,751

33,725

38,741

45,249

53,999

66,342

84,981

116,22

178,98

367,83

Figure 5 – The GME and PLS Mean Squared Error Parameter reformulation always will produce unbiased estimators (mathematically) by using GME (Golan et al., 1996), indeed, assuming that one can specify (Z) to span the true value of (β), then the GME is a consistent estimator, the simulation results reported in the figure (6, GME) can show this assertion. PLS

GME 0,200

0,250

0,150

0,200 0,150

0,100

0,100

0,050

0,050 0,000 -0,050

3,9625 29,751 33,725 38,741 45,249 53,999 66,342 84,981 116,22 178,98 367,83

0,000 -0,050

-0,100

3,9625 29,751 33,725 38,741 45,249 53,999 66,342 84,981 116,22 178,98 367,83

-0,100

-0,150

-0,150

-0,200

-0,200 B1

B2

B3

B4

B5

Figure 6 – The GME and PLS BIAS

B1

B2

B3

B4

B5

6. Concluding Remarks The implementation and the potentiality of the GME method are showed, by defining the re-parameterization and reformulation of the regression model, as a non linear problem, considering a new formulation of the consistency and normalization constraints in a unique matrix formulation, moreover the matlab estimation function is showed. The job satisfaction model demonstrates the results of the GME method and the potentiality in the field of more general evaluation of the customer satisfaction, in social, economic and psychometric context where the regression models can be used. The comparative bootstrap study, based on the model of job satisfaction, shows how in case of multicollinearity, the GME regression can be considered a valid alternative to the PLSr, indeed increasing the multicolinearity, the beta coefficients converge to a unique value, that means in extreme case of the same predictors, the regression coefficients are equal. The standard error also decrease giving also the assurance of not reject the coefficient in case of significant as well. The GME estimator is asymptotically unbiased, and the results are more stable, moreover, there is no the wrong sign problem, that means difference between the sign of beta coefficients and the correlation matrix. The drawback of this approach is that the GME estimator depends crucially upon the subjective and exogenous information supplied by the researcher, that means the matrix of the support variable. Another problem is the time consuming (Ciavolino, 2007), indeed in this method the number of constraints increase with number of the variables and units. The future develops are about the implementation of a new algorithm for the structural equation models, already proposed by Al-Nasser (2003). REFERENCES Al-Nasser A. D. (2003). Customer Satisfaction Measurement Models: Generalized Maximum Entropy Approach. Pak Journal of Statistics, 19(2), 213–226; Belsley, DA, E Kuh & RE Welsch (1980). Regression Diagnostics, Identifying Influential Data and Sources of Collinearity, Wiley, New York; Ciavolino E. (2007). The Entropy Theory for evaluating the Job Satisfaction, GFKL 2007, Freiburg, march 2007; Ciavolino E., Al Nasser A.D., D’Ambra A. (2006). The Generalized Maximum Entropy Estimation method for the Structural Equation Models, GFKL 2006, Berlino, marzo 2006; Golan, A, G George & D Miller (1996). Maximum Entropy Econometrics, Wiley, London; Höskuldsson A (1988), PLS Regression Methods, Journal of Chemometrics, 2, 211228; Jaynes E.T. (1957). Information Theory and Statistical Mechanics, The Physical Review 106 (4), 620-630, May 15, 1957; Jaynes, E. T. (1968). Prior Probabilities, IEEE Transactions On Systems Science and Cybernetics, vol. sec-4, no. 3, 227-241; Paris, Q. (2001): Multicollinearity and maximum entropy estimators. Economics Bulletin, 3(11), 1–9; Shannon C. E. (1948). A mathematical Theory of Communications, Bell System Technical Journal, 27, 379–423; Wold S., Sjostrom M., Eriksson L., (2001), PLS-regression: a basic tool of chemometrics, Chemometrics and Intelligent Laboratory Systems, vol. 58, 109130;

Suggest Documents