comparing three methods of handling

3 downloads 0 Views 2MB Size Report
Kaedah regresi permatang ialah pengubahsuaian kaedah kuasa dua terkecil (LS) yang ... di mana komponen baru perlu dibina untuk mengurangkan bilangan.
COMPARING THREE METHODS OF HANDLING MULTICOLLINEARITY USING SIMULATION APPROACH

NORLIZA BINTI ADNAN

UNIVERSITI TEKNOLOGI MALAYSIA

“ Wehe r e byde c l a r et ha tweha ver e a dt h i st he s i sa ndi nou r opinion this thesis is sufficient in terms of scope and quality for the a wa r doft hed e g r e eofMa s t e rofSc i e nc e( Ma t he ma t i c s ) ”

Signature : ........................................................................ Dr. Maizah Hura Binti Ahmad Name of Supervisor I : ..................................................... April 28, 2006 Date : ................................................................................

Signature : ......................................................................... Dr. Robiah Binti Adnan Name of Supervisor II : .................................................... April 28, 2006 Date : ................................................................................

* Delete as necessary BAHAGIAN A –Pengesahan Kerjasama*

Adalah disahkan bahawa projek penyelidikan tesis ini telah dilaksanakan melalui kerjasama antara _______________________ dengan _______________________

Disahkan oleh: Tandatangan : .................................................... Nama

: ....................................................

Jawatan (Cop rasmi)

: ....................................................

Tarikh : ...............................

* Jika penyediaan tesis/projek melibatkan kerjasama. BAHAGIAN B –Untuk Kegunaan Pejabat Sekolah Pengajian Siswazah

Tesis ini telah diperiksa dan diakui oleh: Nama dan Alamat Pemeriksa Luar

:

Prof Madya Dr Abd Aziz bin Jemain School of Mathematical Sciences Faculty of Science & Technology Universiti Kebangsaan Malaysia 43600 Bangi Selangor DarulEhsan

Nama dan Alamat Pemeriksa Dalam :

Dr Zalina binti Mohd Daud Angkatan Tentera Malaysia (ATMA) Kem Sungai Besi 57000 Kuala Lumpur

Nama Penyelia Lain (jika ada)

:

Disahkan oleh Penolong Pendaftar di SPS: Tandatangan : .................................................................. Tarikh : .............................. GANESAN A/L ANDIMUTHU Nama : ..................................................................

PSZ 19:16 (Pind. 1/97)

UNIVERSITI TEKNOLOGI MALAYSIA

BORANG PENGESAHAN STATUS TESIS JUDUL :

COMPARING THREE METHODS OF HANDLING MULTICOLLINEARITY USING SIMULATION APPROACH

SESI PENGAJIAN Saya

:

2/ 2005/2006

NORLIZA BINTI ADNAN (HURUF BESAR)

mengaku membenarkan tesis (PSM/Sarjana/Doktor Falsafah)* ini disimpan di Perpustakaan Universiti Teknologi Malaysia dengan syarat-syarat kegunaan seperti berikut: 1. 2. 3. 4.

Tesis adalah hakmilik Universiti Teknologi Malaysia. Perpustakaan Universiti Teknologi Malaysia dibenarkan membuat salinan untuk tujuan pengajian sahaja. Perpustakaan dibenarkan membuat salinan tesis ini sebagai bahan pertukaran antara institusi pengajian tinggi. **Sila tandakan (  ) (Mengandungi maklumat yang berdarjah keselamatan atau kepentingan Malaysia seperti yang termaktub di dalam AKTA RAHSIA RASMI 1972)

SULIT

TERHAD



(Mengandungi maklumat TERHAD yang telah ditentukan oleh organisasi/badan di mana penyelidikan dijalankan)

TIDAK TERHAD Disahkan oleh

____________________________________ (TANDATANGAN PENULIS)

__________________________________ (TANDATANGAN PENYELIA)

Alamat Tetap: Rancangan Belia Dua, Sungai Panjang, 45300 Sungai Besar, Selangor Darul Ehsan Tarikh :

CATATAN

APRIL 28, 2006

:

Dr. Maizah Hura binti Ahmad Nama Penyelia Tarikh :

APRIL 28, 2006

* Potong yang tidak berkenaan. ** Jika tesis ini SULIT atau TERHAD, sila lampirkan surat daripada pihak berkuasa/organisasi berkenaan dengan menyatakan sekali sebab dan tempoh tesis ini perlu dikelaskan sebagai SULIT atau TERHAD.



Tesis dimaksudkan sebagai tesis bagi Ijazah Doktor Falsafah dan Sarjana secara penyelidikan, atau disertasi bagi pengajian secara kerja kursus dan penyelidikan, atau Laporan Projek Sarjana Muda (PSM).

COMPARING THREE METHODS OF HANDLING MULTICOLLINEARITY USING SIMULATION APPROACH

NORLIZA BINTI ADNAN

A thesis submitted in fulfilment of the requirements for the award of the degree of Master of Science (Mathematics)

Faculty of Science Universiti Teknologi Malaysia

MEI 2006

ii

“ Ide c l a r et ha tt hi st he s i se n t i t l e d“ Comparing Three Methods Of Handling Multicollinearity Using Simulation Approach”i st her e s u l tofmyownr e s e a r c he xc e p ta s cited in the references. The thesis has not been accepted for any degree and is not c onc u r r e nt l ys ubmi t t e di nc a nd i d a t ur eofa nyot he rd e g r e e ”

Signature

:

Name

:

Date

:

.................................................... Norliza binti Adnan April 28, 2006 ....................................................

iii

To My loving and supportive parents, Hj Adnan and Pn Ramnah My siblings, Hairil, Aizam, Haffiz and Sarah

and

Pn Rusnah and family, and all my supportive friends

iv

ACKNOWLEDGEMENT

“I nthe name od ALLAH the All-Merciful, The All-Compassionate. All praise be to ALLAHf o rg i v i n gmet h es t r e n gt han dc our a get oc ompl e t et hi st h e s i s ”

A very great gratitude and appreciation expressed to all those who make a part to the successful cease of this thesis either directly or indirectly. Particularly, I wish to express my sincere appreciation to my main thesis supervisor, Dr. Maizah Hura binti Ahmad, for encouragement, guidance, critics and motivation. I am also very thankful to my co-supervisor Dr. Robiah binti Adnan for her guidance, advices and motivation. Without their continued support and interest, this thesis would not have been the same as presented here. An appreciation also goes to the Universiti Teknologi Malaysia for financial support.

Lots of gratitude and special thanks also to my family for their support, love and encouragement throughout my study. My sincere appreciation also extends to all my lovely friends.

v

ABSTRACT

In regression, the objective is to explain the variation in one or more response variables, by associating this variation with proportional variation in one or more explanatory variables. A frequent obstacle is that several of the explanatory variables will vary in rather similar ways. This phenomenon called multicollinearity, is a common problem in regression analysis. Handling multicollinearity problem in regression analysis is important because least squares estimations assume that predictor variables are not correlated with each other. The performances of ridge regression (RR), principal component regression (PCR) and partial least squares regression (PLSR) in handling multicollinearity problem in simulated data sets are compared to help and give future researchers a comprehensive view about the best procedure to handle multicollinearity problems. PCR is a combination of principal component analysis (PCA) and ordinary least squares regression (OLS) while PLSR is an approach similar to PCR because a component that can be used to reduce the number of variables need to be constructed. RR on the other hand is the modified least square method that allows a biased but more precise estimator. The algorithm is described and for the purpose of comparing the three methods, simulated data sets where the number of cases was less than the number of observations were used. The goal was to develop a linear equation that relates all the predictor variables to a response variable. For comparison purposes, mean square errors (MSE) were calculated. A Monte Carlo simulation study was used to evaluate the effectiveness of these three procedure. The analysis including all simulations and calculations were done using statistical package S-Plus 2000 software.

vi

ABSTRAK

Objektif bagi regresi ialah untuk menerangkan variasi bagi satu atau lebih pembolehubah bersandar dengan cara menghubungkan variasi ini berkadaran dengan satu atau lebih pembolehubah tak bersandar. Halangan yang sering berlaku ialah apabila wujudnya kebersandaran antara pembolehubah-pembolehubah tak bersandar. Fenomena ini dipanggil multikolinearan. Mengawal dan mengatasi masalah multikolinearan di dalam analisis regresi adalah penting kerana kaedah penganggaran kuasa dua terkecil menganggap bahawa pembolehubah tak bersandar tidak berkorelasi antara satu sama lain. Perbandingan antara penggunaan kaedah regresi permatang (RR), regresi komponen berkepentingan (PCR) dan regresi sebahagian kuasa dua terkecil (PLSR) di dalam mengawal masalah multikolinearan dilakukan menggunakan set data yang disimulasi bagi membantu dan memberi satu pendekatan kepada para pengkaji yang akan datang tentang pemilihan kaedah terbaik bagi mengawal masalah multikolinearan. Kaedah regresi permatang ialah pengubahsuaian kaedah kuasa dua terkecil (LS) yang memasukkan pemalar kepincangan, di dalam penganggar kuasa dua terkecil. Regresi komponen berkepentingan pula merupakan gabungan analisis komponen berkepentingan (PCA) dengan kaedah kuasa dua terkecil biasa (OLS) sementara kaedah regresi sebahagian kuasa dua terkecil adalah hampir sama dengan kaedah regresi komponen berkepentingan di mana komponen baru perlu dibina untuk mengurangkan bilangan pembolehubah. Algoritma bagi setiap kaedah turut diterangkan dan untuk tujuan perbandingan bagi setiap kaedah, set data bagi kes bilangan pembolehubah tak bersandar lebih kecil dari bilangan pemerhatian. Perbandingan keberkesanan bagi ketiga-tiga kaedah tersebut menggunakan ralat min kuasa dua (MSE). Kaedah simulasi Monte Carlo digunakan untuk menilai keberkesanan ketiga-tiga kaedah yang dibincangkan. Semua simulasi dan pengiraan dilakukan dengan menggunakan pakej statistik S-PLUS 2000.

vii

TABLE OF CONTENTS

CHAPTER

1

SUBJECT

PAGE

COVER

i

DECLARATION

ii

DEDICATION

iii

ACKNOWLEDGEMENT

iv

ABSTRACT

v

ABSTRAK

vi

TABLE OF CONTENTS

vii

LIST OF TABLES

x

LIST OF FIGURES

xiv

LIST OF SYMBOLS

xvii

LIST OF APPENDICES

xix

INTRODUCTION 1.1

Background

1

1.2

The Problem of Multicollinearity

2

1.3

Statement of Problem

5

1.4

Research Objectives

5

1.5

Scope of Research

5

1.6

Summary and Outline of Research

6

viii 2

LITERATURE REVIEW 2.1

Introduction

7

2.2

Ordinary Least Squares Regression

7

2.3

Multicollinearity Problem in Regression Analysis

10

2.3.1

Explanation of Multicollinearity

10

2.3.2

Effects of Multicollinearity in Least Squares

2.3.3

2.4

3

Regression

13

Multicollinearity Diagnostics

23

2.3.3.1 Informal Diagnostics

24

2.3.3.2 Formal Methods

25

Concluding Remarks

32

METHODS FOR HANDLING MULTICOLLINEARITY 3.1

Introduction

34

3.2

Partial Least Squares Regression

35

3.2.1

The construction of k components

37

3.2.2

Regress the response into k components

40

3.3

3.4

Principal Components Regression

49

3.3.1

The construction of k components

50

3.3.2

Regress the response into k components

53

3.3.3

Bias in Principal Components Coefficient

54

Ridge Regression

61

ix 4

5

6

REFERENCES APPENDICES

SIMULATION AND ANALYSIS 4.1

Introduction

71

4.2

Generating Simulated Data Sets

72

4.3

Performance Measures

76

4.4

Simulation Results

78

4.4.1

Partial Least Squares Regression

82

4.4.2

Principal Component Regression

104

4.4.3

Ridge Regression

116

COMPARISONS ANALYSIS AND DISCUSSIONS 5.1

Introduction

125

5.2

Performance on Classical Data

125

5.3

Performance on Simulated Data Sets

127

5.4

Comparison Analysis

134

5.5

Discussions

142

SUMMARY, CONCLUSIONS AND FUTURE RESEARCH 6.1

Introduction

145

6.2

Summary

145

6.3

Significant Findings and Conclusions

147

6.4

Future Research

149

x

LIST OF TABLES

TABLE NO.

TITLE

PAGE

3.1

Tobacco Data

45

3.2

Variance Inflation Factors for Tobacco Data

46

3.3

PLS weights vectors, rk for the PLS Components

46

3.4

Loadings for the PLS Components

47

3.5

PLS Components

47

3.6

RMSE values for k components

48

3.7

MSE values for k components

49

3.8

The correlation matrix

58

3.9

The eigenvalues of the correlation matrix

58

3.10

Matrix of eigenvectors

58

3.11

The principal components

59

3.12

Values of Cand the regression coefficients for various values of 

69

4.1

Factors and levels for the simulated data sets

72

4.2

The specific values for xip for each sets of p regressors

74

4.3

The response variable, y for each sets of p regressors

74

4.4

TheVI F’ sva l ue sf ort hec ho i c eo ft heva r i a nc e ,  for noise matrix ( )

75

4.5

TheVI F’ sva l ue sf ore a c hs e t sofp regressors

76

4.6

The values of correlation for p = 2 regressors

79

4.7

The values of correlation for p = 4 regressors

79

4.8

The values of correlation for p = 6 regressors

79

xi 4.9

Regression model for p = 2 regressors for n = 100

80

4.10

Regression model for p = 4 regressors for n = 100

80

4.11

Regression model for p = 6 regressors for n = 100

81

4.12

PLS weights, ri for the PLS Components for p = 2 regressors 83

4.13

PLS weights, ri for the PLS Components for p = 4 regressors 83

4.14

PLS weights, ri for the PLS Components for p = 6 regressors 83

4.15

PLS weights, ri for the PLS Components for p =50 regressors 84

4.16

PLS loadings, pi for the PLS Components for p =2 regressors 85

4.17

PLS loadings, pi for the PLS Components for p =4 regressors 85

4.18

PLS loadings, pi for the PLS Components for p =6 regressors 85

4.19

PLS loadings, pi for the PLS Components for p=50 regressors 86

4.20

Correlations between each PLS components and y

87

4.21

Correlation between each components

89

4.22

RMSE values for p = 2 data sets

97

4.23

RMSE values for p = 4 data sets

97

4.24

RMSE values for p = 6 data sets for first five PLS components

4.25

RMSE values for p = 50 data sets for first five PLS components

4.26

100

Regression model using selected PLS Scores for p = 4 (kopt = 1)

4.29

99

Regression model using all the PLS Scores for p = 4 and n = 100

4.28

97

Regression model using all the PLS Scores for p = 2 and n = 100

4.27

97

101

Regression model using selected PLS Scores ( kopt = 5 ) for p = 6 and n = 100

102

4.30

The eigenvalues of the correlation matrix

105

4.31

Matrix of eigenvectors

106

4.32

Percentage of variance explained and the eigen values of the p = 2

107

xii 4.33

Percentage of variance explained and the eigen values of the p = 4

4.34

108

Percentage of variance explained and the eigen values of the p = 6

4.35

108

Percentage of variance explained and the eigen values of the p = 50

4.36

109

Regression model using all the PC scores for p = 2 and n = 100

4.37

111

Regression model using selected PC scores (PC1) for p = 2 and n = 100

4.38

111

Regression model using all the PC scores for p = 4 and n = 100

4.39

112

Regression model using selected PC scores (PC1) for p = 4 and n = 100

4.40

113

Regression model using all the PC scores for p = 6 and n = 100

4.41

114

Regression model using selected PC scores (PC1) for p = 6 and n = 100

4.42

114

Values of , C and coefficient vectors employed for p = 2 and n = 100

4.43

119

Values of , C and coefficient vectors employed for p = 4 and n = 100

4.44

120

Values of , C and coefficient vectors employed for p = 6 and n = 100

5.1

121

Performance of PLS, PC and RR methods on classical data sets

5.2 5.3

126 2

127

2

130

2

130

Cross-validation of PLS, PC and RR methods, R for p = 2 Cross-validation of PLS, PC and RR methods, R for p = 4

5.4

Cross-validation of PLS, PC and RR methods, R for p = 6

5.5

Cross-validation of PLS, PC and RR methods, R2 for p = 50 130

xiii 5.6

MSE values for p = 2 and specified n = 20, 30, 40, 60, 80 and 100

5.7

MSE values for p = 4 and specified n = 20, 30, 40, 60, 80 and 100

5.8

131

131

MSE values for p = 6 and specified n = 20, 30, 40, 60, 80 and 100

135

5.9

MSE values for p = 50 and specified n = 60, 80 and 100

135

5.10

Summary of the performances of the three methods with p = 2

5.11

Summary of the performances of the three methods with p = 4

5.12

142

Summary of the performances of the three methods with p = 6

5.13

142

143

Summary of the performances of the three methods with p = 50

143

xiv

LIST OF FIGURES

FIGURE NO.

TITLE

PAGE

1.1

Multicollinearity in simple linear regression

3

1.2

Multicollinearity in multiple linear regression

4

2.1

Picket Fences illustrations

15

2.2

The choice of VIF value against the R-square value.

29

3.1

Steps in SIMPLS algorithm

42

3.2

Steps in Principal Component Regression algorithm

56

3.3

The sampling distribution of biased and unbiased estimator

61

3.4

Steps in Ridge Regression algorithm

67

3.5

Plot of Cagainst 

70

4.1

Flowchart summarizing performance assessment of methodology

77

4.2

Correlation between x1 and x2 for p = 2 regressors

88

4.3

Correlation between first and second components for p = 2 regressors

89

4.4

Correlation between each components for p = 4 regressors

90

4.5

Correlation between each components for p = 6 regressors

92

4.6

X- and Y-scores for p = 2 regressors (First and second components)

94

4.7

X- and Y-scores for p = 4 regressors (All components)

94

4.8

Correlation between first five components of p = 6

4.9

regressors data set and response variable, y

95

Plot of against C for p = 2 and n = 100

117

xv 4.10

Plot of against C for p = 4 and n = 100

117

4.11

Plot of against C for p = 6 and n = 100

118

4.12

Plot of against C for p = 50 and n = 100

118

5.1

Plot of regression coefficients against number of regressors

126

5.2

Plot of R2 against m = 100 replications for p = 2 for specified n = 20

5.3

Plot of R2 against m = 100 replications for p = 2 for specified n = 100

5.4

139

Plot of MSE values against m = 100 replications for p = 4 for n = 100

5.14

138

Plot of MSE values against m = 100 replications for p = 4 for n = 20

5.13

138

Plot of MSE values against m = 100 replications for p = 2 for n = 100

5.12

133

Plot of MSE values against m = 100 replications for p = 2 for n = 20

5.11

133

Plot of R2 against m = 100 replications for p = 50 for specified n = 100

5.10

132

Plot of R2 against m = 100 replications for p = 50 for specified n = 60

5.9

132

Plot of R2 against m = 100 replications for p = 6 for specified n = 100

5.8

131

Plot of R2 against m = 100 replications for p = 6 for specified n = 20

5.7

131

Plot of R2 against m = 100 replications for p = 4 for specified n = 100

5.6

128

Plot of R2 against m = 100 replications for p = 4 for specified n = 20

5.5

128

139

Plot of MSE values against m = 100 replications for p = 6 for n = 20

140

xvi 5.15

Plot of MSE values against m = 100 replications for p = 6 for n = 100

5.16

Plot of MSE values against m = 100 replications for p = 50 for n = 60

5.17

140

141

Plot of MSE values against m = 100 replications for p = 50 for n = 100

141

xvii

LIST OF SYMBOLS

y

Response (dependent) variable

x

Predictor (independent) variable



Parameter (regression coefficient), known constant



Error

2

Variance

Y

Matrix of observations

X

Matrix of predictors

β

Vector of parameters

ε

Vector of error matrix term

I

Identity matrix

ˆ 

Estimate regression coefficient

ˆ β

Vector of estimate regression coefficient



Fitted value

H

Hat matrix

hii

ith diagonal element of hat matrix

e

Residual term

e

Vector of residual term

n

Number of observations

p

Number of regressors



Shrinkage parameter

V

Matrix of normalized eigenvectors X' X

Λ

The diagonal matrix of eigenvalues of X' X

T

Components for Partial Least Squares Regression

xviii P

Matrix of x-loadings

r

PLS weight vectors

k

Number of components for PLS and PCR 2

R

Coefficient of Determination

Z

Components for Principal Component Regression



Eigenvalue

xix

ABBREVIATIONS

MEANING

CI

Condition Indices

cov

Covariance

GCV

Generalized Cross Validation

LS

Least Squares

max

Maximum

MSE

Mean Square Error

OLS

Ordinary Least Squares

PCR

Principal Component Regression

PLSR

Partial Least Squares Regression

RMSE

Root Mean Square Error

RR

Ridge Regression

SMC

Squared Multiple Correlation

VIF

Variance Inflation Factors

xx

LIST OF APPENDICES

APPENDIX

TITLE

A

S-PLUS codes for data generating function

B

S-PLUS codes and simulation results for Partial Least Squares Regression method in Chapter IV

C

156

162

S-PLUS codes and simulation results for Principal Component Regression method in Chapter IV

D

PAGE

182

S-PLUS codes and simulation results for Ridge Regression method in Chapter IV

200

E

S-PLUS code for classical data set

215

F

Simulation Results For Chapter V

225

CHAPTER 1

INTRODUCTION

1.1

Background

Regression analysis is one of the most widely use of all statistical tools that utilizes the relation between two or more quantitative variables so that one variable can be predicted from the others. The relationship of each predictor to the criterion is measured by the slope of the regression line of the criterion Y on the predictor. The regression coefficients are the values of these slopes.

The multiple linear regression model relates Y to X 1 , X 2 ,..., X p and can be expressed in terms of matrices as y Xβεwhere y is the nx1 vector of observed response values, X is the nxp matrix of p regressors, βis the px1 regression coefficients and εis the nx1 vector of error terms. The objectives of regression analysis are, (1) to find the estimates of unknown parameters β’ sa n dt e s tof1 , 2 ,..., k for the significance of the associated predictors, (2) to use the regression equation to estimate Y from X 1 , X 2 ,..., X p and, (3) to measure the error involved in the estimation. The multiple

2 linear regression model may be used to identify important regressor variables that can be used to predict future values of the response variable.

The method of least squares is used to find the best line that on the average, is closest to all points. In other words, to find the the best estimates of ’ swi t ht hel e a s t squares criterion which minimizes the sum of squared distances of all points from the actual observation to the regression surface. The name least squares comes from minimizing the squared residuals. From the Gauss-Markov theorem, least squares is always the best linear unbiased estimator (BLUE) and if is assumed to be normally distributed with mean 0 and variance 2 , then least squares is the uniformly minimum variance unbiased estimator.

1.2

The Problem of Multicollinearity

In the applications of regression analysis, multicollinearity is a problem that always occurs when two or more predictor variables are correlated with each other. This problem can cause the value of the least squares estimated regression coefficients to be conditional upon the correlated predictor variables in the model. As defined by Bowe r ma na n dO’ Co nne l l( 199 0) ,mu l t i c ol l i ne a r i t yi sapr ob l e mi nr e g r e s s i ona na l y s i s when the independent variables in a regression model are intercorrelated or are dependent on each other.

There are a variety of informal and formal methods that have been developed for detecting the presence of serious multicollinearity. One of the most commonly used is the variance inflation factor (VIF) that measures how much the variances of the estimated regression coefficients are inflated compared to when the independent variables are not linearly related (Neter, et. al., 1990). The problem of multicollinearity can be remedied

3 using some method of estimation or some modifications of the method of least squares for estimating the regression coefficients.

The problem of multicollinearity can occur in both simple linear regression and multiple linear regression. Figure 1.1 illustrates the problem of multicollinearity that occur in simple regression (Wannacott and Wannacott, 1981). The figure shows how the ˆbecomes unreliable if the Xi’ estimate  swe r ec l os e l yb unc h e d, t h a ti s ,i ft her e g r e s s orX ˆis not had little variation. When the Xi’ sa r ec onc e nt r a t e donon es i ng l eva l ueX , then 

determined at all. For each line, the sum of squared deviations is the same, since the deviations are measured vertically from ( X , Y ) . If X i X , then all xi = 0, and the term ˆis zero. Hence, the sum of squares does not depend on β ˆat all. Therefore, involving β

when the values of X show little or no variation, then the effect of X on Y can no longer be sensibly investigated. The best fit for Y for data with multicollinearity was not a line, but rather a point ( X , Y ) . In explaining Y, multicollinearity makes the Xi’ sl o s eone dimension.

Y

( X ,Y )

x 0 X

Figure 1.1

:

Multicollinearity in simple regression

x X

4 Figure 1.2 illustrates the problem of multicollinearity in multiple regression (Wannacott and Wannacott, 1981). All the observed points in the scatter plot lie in the vertical plane running up through L. In explaining Y, multicollinearity makes the Xi’ sl os e one dimension and the best fit for Y is not a plane but instead the line F.

Figure 1.2

:

Multicollinearity in multiple regression

Several approaches for handling multicollinearity problem have been developed such as Principal Component Regression, Partial Least Squares Regression and Ridge Regression. Principal Components Regression (PCR) is a combination of principal component regression analysis (PCA) and ordinary least squares regression (OLS). Partial Least Squares Regression (PLSR) is an approach similar to PCR because one needs to construct a component that can be used to reduce the number of variables. Ridge Regression is the modified least squares method that allow biased estimators of the regression coefficient.

5 1.3

Statement of Problem

This study will explore the following question : Which method among Principal Component Regression, Partial Least Squares Regression and Ridge Regression performs best as a method for handling multicollinearity problem in regression analysis ?

1.4

Research Objectives

1.

To compare the use of Partial Least Squares Regression, Principal Component Regression and Ridge Regression for handling multicollinearity problem.

2.

To study the degree of efficiency among the three methods and hence rank them in terms of their capabilities in overcoming the multicollinearity problem using simulated data sets.

1.5

Scope of Research

For this research, the problem is focused on the analysis of multicollinearity problems in regression analysis using simulated data sets.

In Principal Component Regression, features from Principal Component Analysis (PCA) and multiple regression are combined to handle multicollinearity. The principal components computed a linear combinations of the explanatory variables (regressors). The Partial Least Squares Regression method is similar to Principal Component Regression because it is also a two-step procedure. However, PLSR searches for a set of

6 components called latent vectors that performs a simultaneous decomposition of X and Y with the constraint that those component explain as much as possible of the covariance between X and Y. This step generalizes PCA and it is followed by a regression step where the decomposition of X is used to predict Y.

The other method considered is the Ridge Regression. It is performed by adding a small biased estimator to the elements on the diagonal of the matrix ( X' X ) to be inverted, that is modifications of the least squares estimator.

1.6

Summary and Outline of Research

The goal of this research is to find the best procedure and method to handle multicollinearity problems by comparing the performances of the three methods to determine which method is superior than the others in terms of its practicality and efficiency. Practicality means how effective or convenient a method is in actual use while efficiency means how well the methods work in producing the best regression model and measured by a specified test discussed in Chapter 4. Basically three different methods are put forward, two of which are the methods with two-step procedures where it computes a component(s) and then regressed it on the response variable. The third method is a modification of the least squares method that allows biased estimators of the regression coefficients. The algorithms for each method used in this study are shown in Chapter 3.

Chapter 2 reviews the relevent literature on published work done recently concerning the problems of multicollinearity. Discussion on methods for handling multicollinearity problems in regression analysis is presented in Chapter 3. Chapter 4 describes the simulation work and the analysis of the three methods. Chapter 5 discusses the performances of the three methods and makes comparisons among them. Lastly, Chapter 6 concludes the study and makes recommendations for further research.

CHAPTER 2

LITERATURE REVIEW

2.1

Introduction

This chapter discusses the ordinary least squares (OLS) regression, multicollinearity problem in regression analysis, methods for detecting multicollinearity and its effect to least squares regression. The discussion on ordinary least squares regression is presented since it is the most commonly used estimator in regression analysis. This chapter also reviews the relevant literature on published work done recently.

2.2

Ordinary Least Squares Regression

The least squares method is a very popular technique and widely used to compute estimations of parameters and to fit data. Statisticians often used this method to estimate the parameters of a linear regression model.

8 The term regression describes statistical relations between variables. It consists of minimizing the sum of squares of the differences between the predicted and observed values of the dependent variables (Gruber, 1990)

The linear regression model can be written in a matrix form as follows :

y Xβε

( 2.1 )

where y is an nx1 vector of response variable, X is an nxp matrix of predictor variables, βis an unknown px1 vector of regression coefficients and εis an nx1 vectors of random

errors which are assumed to be independent normally distributed with mean 0 and variance σ2 . The goal is to estimate β. To obtain the estimates of the regression coefficients ˆe where yˆis the vector of β, the linear regression model is of the form yˆXβ ˆis the estimate of predicted response variable, e is the vector of observed residuals and β

ˆ, the sum of the squared residuals are the regression coefficient. To compute, β minimized with the least squares criterion : n

2 min ei

( 2.2 )

ˆ i 1 β

where,

ˆ, ei y i X iT β

i = 1, 2, ..., n

The least square estimator is derived by minimizing the sum of squares of the error terms. That is, minimizes :

F β ε ' ε y Xβ ' y Xβ y' y β' X ' y y ' Xββ' X' Xβ y' y β' X' Xβ2β' X ' y

( 2.3 )

9 Differentiates to minimize F (β) with respect to βand sets it to zero as follows :  F (β) ˆ2 X ' y 0 | 2 X' Xβ  β ˆ

The normal equations of the least squares can be written as, ˆX'y X' Xβ If the ( X' X ) 1 exists, then ˆ β 0 ˆ 1  ˆβ β ( X' X ) 1 X'y :    ˆ  β  p 1  

( 2.4 )

This model provides information about the relationship of the response variable and the regressors. The model can also identify important regressor variables that can be used to predict future values of the response variable. ˆis an unbiased estimator that is E (β ˆ The least squares estimator, β ) βand has 2 ˆ ˆ minimum variance, Cov(β ) σ ( X' X ) 1 within the class of linear unbiased estimator of

β, as stated in the Gauss-Markov theorem. A common estimate for σ2 is the Mean Square Error (MSE) which is defined as e' e/(n p ) . Under the assumption that the error terms are independent and identically distributed normal variates with mean 0 and covariance matrix σ2 I , the least squares estimate of βis also the maximum likelihood estimator. Model inferences such as hypothesis and confidence intervals are very powerful if the error terms are independent and identically normally distributed with mean 0 and covariance matrix σ2 I .

10

Therefore, when two or more regressors are nearly the same, the least squares estimator will have a very large variance and covariance, and hence broad confidence 2 ˆ. This shows that least squares ˆ interval since σ ( X' X ) 1 is the covariance matrix of β

estimator is very poor in the analysis in the presence of multicollinearity.

2.3

Multicollinearity Problem in Regression Analysis

The presence of multicollinearity will cause the values of X, to have little or no variation, such that the effect of X on Y can no longer be sensibly investigated. Hence, various multicollinearity diagnostics methods have been developed to detect these problems.

2.3.1

Explanation of Multicollinearity

Multicollinearity is a condition in a set of regression data that have two or more regressors that are redundant and have the same information. The linear dependencies among the regressors can effect the model ability to estimate regression coefficients.

The redundant information means that what one independent variable explains about Y is exactly what the other independent variable explains. In this case, the two or more redundant predictor variables would be completely unreliable since the bi’ swoul d measure the the same effect of those xi’ sa ndthe same goes for the other b in the least squares estimates, b. Furthermore, ( X' X ) 1 would not exist because the denominators of X' X or 1 rik2 is zero. As a result, values for b cannot be found because the elements of

the inverse matrix and coefficients become quite large (Younger, 1979).

11 Hair et al. (1998) noticed that as multicollinearity increases, it complicates the interpretation of the variate because it is more difficult to ascertain the effect of any s i ng l eva r i a b l e ,be c a us eoft heva r i a bl e s ’i nt e r r e l a t i o ns hi p s .Mu l t i c ol l i ne a r i t yi salso an issue in other multivariate techniques because of the difficulty in understanding the true impact of multicollinear variables.

According to Neter et al. (1996), there are two types of multicollinearity : perfect multicollinearity (or extreme multicollinearity) and high multicollinearity (or nearextreme multicollinearity). Perfect multicollinearity means that at least two of the independent variables in a regression equation are perfectly related by a linear function. When perfect multicollinearity is present, there is no perfect solution. Perfect multicollinearity occurs when; (1) independent variables are linear functions of each other, for example; age and year of birth, (2) dummy variables are created for all values of a categorical variable and, (3) there are fewer observations than variables.

Consider the following example, suppose the model

Y A B1 X 1 B2 X 2 +  is to be estimated. Suppose that in the sample, it happens to be the case where X 1 2 3 X 2

Assume that there is no error term in the equation. Therefore, the correlation between X1 and X2 is 1.0. This is an example of a model having perfect multicollinearity. The other type of multicollinearity is high multicollinearity. High multicollinearity means that there are strong (but not perfect) linear relationship among the independent variables. If the regression model has only two independent variables, high multicollinearity occurs if the two variables have a correlation that is close to 1 or –1. Therefore, the closer it gets to 1 or –1, the greater is the association between the independent variables. When there is high but imperfect multicollinearity, a solution is still possible but as the independent variables increase in correlation with each other, the standard errors of the regression coefficients will become inflated.

12 The problem of multicollinearity can be explained by the use of eigenvalues and eigenvectors. According to Myers (1986), the X' X matrix for the correlation form is considered and there exists an orthogonal matrix :

V [V1 , V2 ,..., Vk ] such that, V' (X' X)V diag ( 1,  2 ,...,  k)

( 2.5 )

where, the i are the eigenvalues of the correlation matrix and the operation given by Eq. (2.5) is called eigenvalue decomposition of X' X . The columns of V are normalized eigenvectors associated with the eigenvalues of ( X' X ). Denote the ith element of the vector vj by ij . When multicollinearity is present, at least one i 0 . For at least one value of j, v j ' ( X ' X )v j 0

which implies that for at least one eigenvector vj, k

 X lj

* l

0

l 1

Note that, the symbol means since if the right hand side is identically zero, the linear dependencies are exact and thus ( X' X ) 1 does not exist. ˆis The mean squared error (MSE) of β p

ˆβ)' (β ˆβ) 2 ( X' X ) 1 2 E (β 1 i 1

i

where 1 2  p are the eigenvalues of X' X . For collinear data, some i are close t oz e r oa ndt h a twi l lr e s ul ti nl a r g eMSE’ s .MSEi st h ee xpe c t e ds qua r ed distance in p-dimensional Euclidean space from the estimate to the true βpoint, therefore the collinearity results in poor least squares estimation (Hoerl et al., 1986).

13 Multiple regression is trying to separate out the effects of two or more variables, even though they are correlated with each other. To do this, however, there must be some remaining variation of each X variable when the other X variables are held constant. If two variables are perfectly correlated while one of the variable is held constant, the other one must be constant as well. Hence, it is impossible to separate their effect on the dependent variable.

2.3.2

Effects of Multicollinearity in Least Squares Regression

The presence of multicollinearity in least squares regression can cause larger variances of parameter estimates which means that the estimates of the parameters tend to be less precise. As a consequence, the model will have insignificant test and wide confidence interval. Therefore, the higher the multicollinearity level, the less interpretable are the parameters.

To understand the effects of multicollinearity, consider a model with two independent variables. The model is

Y 1 X 1 2 X 2 e and the normal equations are given by ˆ  r1 y   1 r   1 ˆ    r2 y  r 1 2      

( 2.6 )

where, r denotes the correlation between x1 and x2; and r1y and r2y denote the correlations of x1 and x2 with y. From Eq. (2.6), we get 1 ˆ  r1 y   1 r   1 ˆ    r2 y  r 1  2   

r1 y r (r2 y ) 1   2  r2 y r (r1 y ) (1 r )  

( 2.7 )

14

If the predictors are highly correlated, that is | r | ≈1 ,t he nt hed i v i s o rt ha ti st he determinant of X' X or (1 –r2), is close to zero. The variances of the estimates



ˆ 2 (1 r 2 ) Var  i 2  X' X  1

 

ˆ,  ˆ r , are inflated by this divisor. The estimates are highly correlated, with Corr  1 2 where, if r is negative, the estimates are approximately equal with inflated magnitude. If r is positive, the estimates are approximately equal in magnitude but opposite in signs with the magnitude being inflated by the divisor.

2.3.2.1 Effects on Least Squares Estimated Regression Coefficients

The value of the least squares estimated regression coefficients tend to change and depend on particular independent variables that are included in the model which means that their values are dependent on the correlated independent variables in the model.

Serious cases of multicollinearity can cause the least squares estimated regression ˆ,  ˆ,  ˆ,...,  ˆ to be far from the true values of the regression parameters coefficients  1 2 3 p

1 , 2 , 3 ,..., p . It can also cause the least squares estimated regression coefficients to be highly dependent on the particular sample of observed values of response variable (Neter et al., 1985).

The popular picket fences example described the problem of multicollinearity as shown in Figure 2.1 (Hocking, 1996) :

15

Figure 2.1

:

Picket Fences illustrations

This figure shows two independent variables (x1 and x2) exhibiting strong multicollinearity, where the heights of the pickets on the fence represent the y observations. Here, the least squares point estimates describe the slant of this plane. Note that b1 r e pr e s e n t st hepl a n e ’ ss l op ei nt hex1 direction and b2 r e pr e s e n t st hepl a n e ’ ss l op e in the x2 direction. The plane are quite unstable, that is, if a small change occurs in the height of the one of the pickets (that is, in a y value), then the slant of the plane (that is, the least squares point estimates) might radically change. On other hand, if multicollinearity does not exist, the observed combinations of values x1 and x2 would not tend to fall around a line but would be more uniformly spread out in the (x1, x2) plane. Then the plane would be more stable.

16

2.3.2.2 Effects on Computational Accuracy and the Standardized Multiple Regression Model

The effect of multicollinearity can lead to serious rounding error problem in the calculation of least squares estimated regression coefficients. This is because strong multicollinearity can cause the columns of the X matrix (independent variables) to be nearly dependent.

The magnitudes of standardized regression coefficients may also be seriously affected when independent variables are highly correlated among each other. This happens because the standardized multiple regression model is a standardized form of multiple regression model that is employed to control and minimize roundoff errors in normal equations calculations, especially when the number of predictor variables is small.

The roundoff errors tend to enter least squares calculations when the inverse of X' X is taken. It may be serious when X' X has a determinant that is close to 0, in which

case ( X' X )-1 almost does not exist. This results in inaccurate values of least squares estimated regression coefficients b. Roundoff error may also exist when the element of X' X differ substantially in terms of magnitude, that is when the data in X variables cover

large range.

According to Myers (1986), the danger of getting serious roundoff errors in ( X' X )-1 is great when X' X has a determinant which is close to 0 and the element of X’ Xdiffers substantially in order of magnitudes. X' X has a determinant close to 0, when some or all of the independent variables are highly correlated. Omitting or discarding one or several of the intercorrelated independent variables can shift the determinant away from near 0, so that roundoff errors will not have severe effects. The elements of X' X differs substantially in order of magnitudes, when the variables have substantially different magnitudes, so that the entries in the X' X matrix cover a wide range, for

17 example 15 to 29,000,000. The possible solution is by transforming the variables and thereby reparameterizing the regression model because it makes all entries in the X' X matrix for the transformed variable to fall between –1 and +1 inclusive. Hence, the calculation of the inverse matrix becomes much less subjected to roundoff error due to dissimilar orders of magnitudes than with the original variables.

The standardized variables Yi Y Yi '  S  Y

   

for i = 1, 2,..., n

X ik X k X k '   S  k

   

for k = 1, 2,..., p

where Y and X k are the respective means of the Y and Xk observed values.

SY and Sk are the respective standard deviations :

n

(Y Y )

2

i

SY 

i 1

n 1

and n

Sk 

( X i 1

ik

X k ) 2

n 1

The standardized multiple regression model is the multiple regression model in the transformed form (as defined by the correlation transformation) Yi ' 1 ' X i1 '2 ' X i 2 '..... p ' X ip 'i '

The X matrix for the transformed variables (without the intercept term) is

18  x11' x12' ' ' x21 x22   X . . nxp ' xn1 xn' 2  

... x1' p   ... x2' p  ... .  '  ... xnp  

Then, X' X rxx where rxx is the correlation matrix of the X variables which contains the element of coefficients of simple correlation between all pairs of X variables. That is,

rxx pxp

1 r12  r21 1  . .  rp1 rp 2  

... r1 p  ... r2 p   ... .   ... 1  

The coefficients of simple correlations among the predictor variables Xj and Xk ( j k ) denoted by rjk with the range between – 1 to +1 because it also includes the negative square root of the coefficient of the multiple determination when the number of predictor is one. Therefore, the entries in the X’ Xmatrix for the transformed variables will fall between – 1 and 1. By regressing the response variable Y with predictor variable X, the coefficient of simple correlation is given as ryx. The correlation matrix consists of the coefficients of simple correlations between Y and each of the predictor variables denoted by rY 1 , rY 2 ,..., rYp .

Similar to the algebraic definition of X’ Xmatrix,

X 'Y rYX where

rYX

rY 1    rY 2     :    rYp    

19 The normal equation for the standardized multiple regression is given by rxx b ryx The vector of estimated standardized regression coefficients b is b (rxx ) 1 ryx

where (rxx ) 1 is the inverse matrix of rxx

The existence of roundoff errors in the least squares estimated regression coefficients can be checked by transforming back the estimated standardized regression coefficients into the original least squares estimated regression coefficients.

2.3.2.3 Effects on the Estimated Standard Deviation of Least Squares Estimated Regression Coefficients

The value of the estimated standard deviation of least squares estimated regression coefficients, s bk will increase as the degree of multicollineaity becomes higher. The high degree of multicollinearity among predictor variables is responsible for the substantial effects on the estimated standard deviation of least squares estimated regression coefficients. This effect of multicollinearity can be proved when the variables in the least squares model are transformed by means of correlation transformation.

Consider a model with two predictor variables : Yi 1 X i1 2 X i 2 i

The  X' X  matrix in the standardized model is 1

( X' X ) 1 rxx

1

1  1 r122

1 r12    r12 1  

20 The variance-covariance matrix of the estimated standardized regression coefficients is

2 ( X' X ) 1 (' ) 2 rxx1 (' ) 2

1 1 r122

1 r12    r12 1  

where (' ) 2 is the error term variance for the standardized multiple regression model.

Therefore (' ) 2 2  b1 '2  b2 ' 1 r122

( 2.8 )

This means that the estimated regression coefficients b1’and b2’ha vet hes a me variance. Each of these variances becomes larger as the correlation between X1 and X2 increases. When the predictor variables are not correlated at all, ( r122 0) . Eq. (2.8) then becomes 2  b1 '2  b2 '(' ) 2

When r122 1 , the variance of b1’and b2’b e c omel a r g e rwi t hou tl i mi t .The r e f o r e ,i tp r ove s the fact that standard deviations of b1’and b2’a r ea l s ov e r yl a r g ewhe nr122 1 .

Since b1 ' , b2 ' ,..., bp ' can be transformed back to b1 , b2 ,..., bp , it can be implied that

s bk is inflated with the increase of the degree of correlation among the predictor variables in the model. The large value of s bk when high correlation exists among the predictor variables will then affect the statistical inference made that involves s b k .

21 2.3.2.4 Effects on t-Test

Multicollinearity problem is said to affect the estimated standard deviation of least squares estimated regression coefficients. Therefore, the large value of s bk also affect the value of t-test.

The test statistics

b t*  k s bk  might become small when the s bk is large. Strong multicollinearity can cause the null hypothesis βk 0 to be concluded because of the small value of the t* eventhough the actual value of βk is not 0. Therefore, the smaller value of t-test statistic can lead to the conclusion of βk 0 (Neter et al., 1996). When multicollinearity exists, the size of the t-test measures the additional importance of the other predictor variable Xk over the combined importance of the other predictor variables in the model which means that two or more correlated predictor variables contribute unessential information.

2.3.2.5 Effects on Extra Sum of Squares

When two predictor variables are correlated, the marginal contribution of one predictor variable in reducing the sum of squares is dependent on the other predictor variables already in the model.

According to Hocking (1996) and Neter et al. (1985), the extra sum of squares for a predictor variable xk, given the other correlated predictor variables already in the model is usually smaller than before these other predictor variables are in the model. This is because, when the other predictor variable are already in the model, the marginal contribution of xk in reducing error sum of squares is comparatively small because the

22 rest of the predictor variables already contain much of the same information as xk. When the predictors are correlated, there is no unique sum of squares that can be attributed to an individual predictor variables as reflecting its effect in reducing the total variation in Y. It must depend on the rest of the predictor variables that are already in the model.

The important conclusion is, when independent variables are correlated, there is no unique sum of squares which can be ascribed to an independent variable as reflecting its effect in reducing the total variation in Y. The reduction must be viewed in the context of the other independent variables included in the model, whenever the independent variables are correlated (Neter et al., 1985)

2.3.2.6 Effects on Fitted Value and Predictions

Prediction is damaged by collinearity and this problem can seriously affect the least squares prediction at the points that represent extrapolation outside the range of data. The reason for this is because the quality of the prediction, yˆ ( x0 ) , depends on where x0 is in the regressor space (Myers, 1986). If the fit of the least squares model is good, there are regions in x where prediction will be effective. However, the presence of multicollinearity will likely produce areas where prediction will be quite poor.

This problem however does not affect inferences about mean responses or predictions that are made within the region of observations or the range of data. It also does not affect the precision of the estimated mean response.

23

2.3.2.7 Effects on Coefficients of Partial Determination

The existence of multicollinearity will make the value of the coefficient of partial correlation between response variable and each predictor variable decreases. Therefore, the coefficients of partial determination between Y and xk when the rest of predictor variables already in the model will also appear to be smaller.

2.3.3

Multicollinearity Diagnostics

Neter et al. (1990) noted some key problems that typically arise when the independent variables being considered for the regression model are highly correlated among themselves. These are, (1) adding or deleting an independent variable will change the regression coefficients, (2) the extra sum of squares associated with an independent variable varies, depending upon which independent variables are already included in the model, (3) the estimated standard deviations of the regression coefficients become large when the independent variables in the model are highly correlated with each other and lastly, (4) the estimated regression coefficients individually may not be statistically significant even though a definite statistical relation exists between the dependent variable and the set of independent variables. However, these problems can also arise without substantial multicollinearity being present, but only under unusual circumstances not likely to be found in practice.

24 2.3.3.1 Informal Diagnostics for Detecting Multicollinearity

A variety of informal diagnostics can be used to detect multicollinearity problems. These informal diagnostics do not provide quantitative measurements of the impact of multicollinearity nor may they identify the nature of multicollinearity. Another limitation of the informal diagnostic methods is that sometimes the observed behavior may occur without multicollinearity being present.

According to Neter et al. (1996), Younger (1979) and Hocking (1996), there are about five informal diagnostics that are most commonly known to check the presence of multicollinearity. There are as follows :

1.

Once multicollinearity is present, there will be large value of coefficients of simple correlation between pairs of predictor variables in the correlation matrix, rxx that are linearly correlated to each other. It can be verified by observing the scatter plot matrix and their value of coefficients of simple correlation that are obtained from the correlation matrix.

2.

Multicollinearity can also cause large changes in the least squares estimated regression coefficients when a predictor variable is added or deleted or when observation is altered or deleted.

3.

The wide confidence intervals for regression coefficients of important predictor variables is also another sign that multicollinearity is present in the regression analysis.

4.

The appearance of nonsignificant results in individual t-test on regression coefficients of important predictor variables is the sign of the presence of multicollinearity.

25 5.

If the predictor variables xk is regressed with the rest of predictor variables in the model and R2 value is being examined, multicollinearity is certain to exists among xk and the rest of the predictor variables in the model when its R2 value is substantially high and near to the value 1.

These informal diagnostics just give the signs that multicollinearity is present in the analysis. The determination of which variables occurred in serious multicollinearity and need to be omitted should be done using formal methods that have been developed and used widely.

2.3.3.2 Formal Methods for Detecting Multicollinearity

The development of formal methods for detecting multicollinearity problem is to determine how serious the problem affects the analysis and to know the details of which variables are correlated and need to be omitted or deleted.

Variance Inflation Factors (VIF)

Variance Inflation Factors is the measure of the speed with which variances and covariances increase and it is most commonly used method for detecting multicollinearity problem. Variance inflation factors is a measure of multicollinearity in a regression design matrix (that is, independent variables) in a scaled version of the multiple correlation coefficient between an independent variable, and the rest of the independent variable. The measure shows the number of times the variances of the corresponding parameter estimate is increased due to multicollinearity as compared to as what it would be if there were no multicollinearity. Therefore, this diagnostic is designed to indicate the

26 strength of the linear dependencies and how much the variances of each regression coefficient is inflated above ideal (Myers, 1986). The formula is 1 (VIF ) j  1 R 2j where R 2j is the multiple correlation coefficient and measures the coefficient of

correlation between two variables with 1 R j 1 .

If R 2j 0 , it indicates that there is no correlation between xj and the remaining independent variables and this is the minimum value.

The measures of VIF comes from the variances of the estimated coefficients that are inflated due to collinearity compared to when the independent variables are not linearly related.

The variance-covariance matrix for the estimated parameters is considered as follows : σ2  b 2 ( X' X ) 1

( 2.9 )

and the variance-covariance matrix of the estimated standardized regression coefficients is obtained from Eq. (2.9), which states that the X' X matrix for the transformed variables is rxx :

σ2  b  'rxx 2

1

where rxx is the matrix of the pairwise simple correlation coefficients among the X variables and  'is the error term variance for the transformed model. 2

Thus, the variances of the estimated standardized regression coefficients are equal to the product of the error term variance and the kth diagonal element of the inverse of the correlation matrix.

27 (' ) 2 2  bk '(' ) 2 (VIF ) k  1 Rk2 where Rk2 is the coefficient of multiple determination when x is regressed on the remaining independent measures and the elements of the diagonal of the rxx1 matrix are the VIFs.

According to Hocking (1996), if the predictor were orthogonal, the variance of the estimated coefficient for that predictor are divided by the variance, that is the VIF associated with the jth predictor. That is, ˆ] Var[  j VIF j  r jj 2 

where, r jj is the jth diagonal element of the inverse of the correlation matrix of the

predictors.

The diagonal elements of the inverse matrix are examined which for simplicity of notation, consider j =1.

The correlation matrix

 1 rT  C  1  r1 C1   where r1T denotes the row-vector of correlations of the first predictor with the remaining

ones. C1 denotes the correlation matrix for the remaining predictors

C is the coefficient matrix for the normal equations in standard form hence, the diagonal elements of the inverse matrix when multiplied by 2 , give the variance of the estimates.

28 The result on the inverse of a partitioned matrix 1 r11  r 1 (1 r1 C1 r1 ) 1 1 R12







where C11r1 is the vector of coefficient estimates for the regression of the first predictor on the remaining m –1 predictor. R12 is the coefficient of determination for that regression.

Therefore, the jth variance inflation factor is 1 VIF j  1 R 2j

If the jth predictor is well described by the remaining predictors, R 2j will be close to 1 and VIFj will be large. There is no formal cutoff value to use with the VIF for determining the presence of multicollinearity but, Neter et al. (1996) recommended looking at the largest VIF value. A value greater than 10 is often used as an indication of potential multicollinearity problem.

29 The cutoff value of VIF that should be used to determine whether collinearity is a problem is shown as follows

Figure 2.2

:

The choice of VIF value against the R-square value.

The illustration in Figure 2.2 shows that the value of VIF are determined at which value of R-Squared for the x regressed on the other variables. The choice of R-Squared shows the different levels of correlation between xj and the other variables. Therefore, the VIF equal to 10 implies that the R-Square value for the regression must be 0.90.

Tolerance

Tolerance is an index (set of indices) of linear dependence among the independent variables X 1 ,..., X j in the intercept model (Jennrich, 1995). It is the inverse of variance inflation factors which a value of near 1 indicates the independence of the predictors while a value of close to 0 indicates the variables are multicollinear.

Therefore, tolerance have a range from 0 to 1 and the closer the tolerance value is to 0, the higher the level of multicollinearity exists. It is calculated as follows :

30 Tolerancek 1 Rk2

Tolerance limit that frequently used are 0.01, 0.001 or 0.0001 below which the variable is not entered into the model.

Eigenvalues, Condition Indices (CI) and Variance Proportions

The indication of multicollinearity through the association of eigenvalue with the eigenvector are defined by the small near constant value where the associated vectors are used to describe the multicollinearity and the small eigenvalues are indicators of multicollinearity (near-linear relation). But a criterion is needed to assess the severity of this multicollinearity analogous to that established for the VIF (Hocking, 1996). These can be done by applying the eigen-analysis to the correlation matrix of the predictors, since multicollinearity is only a function of the correlation structure. The determinant of a matrix is the product of its eigenvalues and the presence of one or more small values implies a near-singular matrix. Notice that, if the predictors are orthogonal, all the eigenvalues are 1.

Condition Indices (CI) is a criterion from numerical analysis that is used to measure the stability of a matrix. The formula is : CI j 

max

j

Condition number is a maximum number of condition indices and the criterion of criterion number > 30 is often used as an indicator of near-singularity of a matrix.

According to Myers (1986), the appearance of a small eigenvalue of X' X implies that any or all regression coefficients may be adversely affected and the

31 proportion of the variance need to be determined for each coefficient that is attributed to each dependency.

The eigenvalue decomposition of X' X (scaled but not centered) are considered as

0 0     1   V' (X' X)V       k  0 Then, the variance-covariance vector ( X' X ) 1 is written as 1 0 V ' 0   0   1   V1 '  1   ( X' X )  V0V1.....Vk     1         1  Vk ' 0    k  where, the notations 0 ,  1 ,...,  k are used to take into account, one additional eigenvalue due to an increase in the dimension of X' X when the intercept is involved.

The proportion of the variance of bi , interpreted as pij , is attributed to the multicollinearity characterized by the eigenvalue j and defined as follows : vij2 pij 

j

cii

where, k v 2 var(b ) cii  ir  2 i  r 0  r

The vij is the ith element in the eigenvector associated with j with the eigenvectors appearing as columns of V. The cii illustrates that a small eigenvalue deposit its influence, to some degree, on all variances.

32 The qualitative description of the usefulness of these variance proportions are, a small eigenvalue implies serious linear dependency, accompanied by a subset of regressors (at least two) with high variance proportions, represents a dependency involving the regressors in that subset. The dependency is damaging to the precision of estimation of the coefficients in the subset (Hocking, 1996).

2.4

Concluding Remarks

This chapter presented the OLS, multicollinearity problem in regression analysis, methods for detecting multicollinearity and the effects of multicollinearity to least squares regression. In the next chapter, three methods that will be used in this study for handling multicollinearity will be discussed. These methods are partial least squares regression, principal component regression and ridge regression. They are chosen based on their popularity and efficiency.

PLS was first developed by Herman Wold in 1966 as an econometric technique whi c hi sa ppl i e dt ot he‘ s of t ’t y peo fmodel (Wold, 1975). It is also one of the most widely used chemometrical tools to estimate concentrations from measured spectra. SIMPLS has been proposed in 1993 by Sijmen de Jong (De Jong, 1993) as an alternative i nt e r pr e t a t i onoft hea bb r e vi a t i o n‘ SI M’i n SIMPLS that is Straightforward Implementation. It has been shown to be a powerful technique for developing models for process modelling and multivariate statistical process control in the presence of highly correlated or collinear data (Li et al., 2001). Algorithmically, one of the major differences between classical PLS and SIMPLS is the way the deflation is done. From the the second latent variable on, the weighting vectors are the dominant eigenvectors of X' yy' X , multiplied by a projection matrix. One can deflate the matrix X by means of the projection matrix (for classical PLS). An alternate choice is to deflate the matrix X’ y (Serneels and Van Espen, 2003).

33 Principal component analysis, is a statistical technique originated by Hotelling in 1933, which is performed in order to simplify the description of a set of interrelated variables (Afifi and Clark, 1984). It was then developed by Massy in 1965 to handle the problem of multicollinearity and known as principal component regression. The basic idea of PCR is to use a set of orthogonal variables, thas is, PCs, that span the original measurement space (Xie and Kalivas, 1997a). It allows the transformation of a set of correlated explanatory X-variables into an equal number of uncorrelated variables. These new variables (PCs) are all linear combinations of the original correlated X-variables.

The RR approach was first developed by Hoerl in 1962 with the purpose of controlling the inflation and general instability associated with the least squares estimates. As an estimation procedure, it is based on adding small positive quantities to the diagonal X’ X. To overcome the deficiencies of multicollinearity, ridge regression technique using ridge trace was then suggested by Hoerl and Kennard in 1970 and was developed later in numerous works.

The detailed discussion on these methods will be presented in Chapter 3.

CHAPTER 3

METHODS FOR HANDLING MULTICOLLINEARITY

3.1

Introduction

Several methods have been developed to overcome the deficiencies of multicollinearity.

Among such methods are continuum regression (Stone and Brooks, 1990, Sundberg, 1993), partial least squares (PLS) regression and principal component regression (PCR) (Bjorkstrom and Sundberg, 1999).

In the case of multicollinearity problems in regression analysis, it generally takes the number of predictors p to be substantially smaller than the number of cases n, so that multicollinearity comes through the use of highly correlated predictors. But in some chemometrics problems one has p values substantially larger than n. These problems with their massive rank defficiency have led to a new interest in methods for dealing with multicollinearity. Subset selection proposed by Miller (1990) seeks to find some small subset of the predictors and use just these in a conventional regression.

35 The current study explores PLSR, PCR and RR as methods to handle multicollinearity. Hence these methods will be discussed in this chapter. Discussion on PLS regression method will be done in Section 3.2. This is followed by PCR method in Section 3.3 and Ridge Regression method in Section 3.4. In each section, the algorithms of the method are described and an example using classical data sets will be used to illustrate the methods.

3.2

Partial Least Squares Regression

Partial least squares (PLS) is a method of modeling relationships between a Y variable and other explanatory variables (Garthwaite, 1994). This method was first developed by Wold (1966), which originated in social sciences specifically economy but be c a mepop ul a rf i r s ti nc h e mome t r i c sbyWol d’ ss on, Sv a nt e .PLSi sapr e d i c t i ve technique which can handle many independent variables, especially when these display multicollinearity.

The goal of PLS regression is to predict Y from X and to describe their common structure when X is likely to be singular and the regression approach is no longer possible to be used because of multicollinearity problem. This method is similar to Principal Component Regression because components are extracted before they are regressed to predict Y.

In contrast, PLS regression searches for a set of components called latent vectors, factors or components from X that are also relevant for Y that performs a simultaneous decomposition of X and Y with the constraint that these components explain as much as possible of the covariance between X and Y (Abdi, 2003). This step generalizes Principal Component Analysis (PCA) and it is followed by a regression step where the decompositions of X is used to predict Y. In other words, it combines features from PCA and multiple linear regression (MLR) and the regression equation will take the form

36

Yˆ0 1T1 2T2 ... pTp

( 3.1 )

where each component Tk is a linear combination of the X j and the sample correlation for any pair of components is 0. PLS also reduces the number of terms, as the components in Equation (2.2) are usually far fewer than the number of X variables (Garthwaite, 1994).

The principle is that, when considering the relationship between Y and some specified X variable, other X variables are not allowed to influence the estimate of the relationship directly but are only allowed to influence it through the components.

The SIMPLS are an extension of PLSR and was proposed by De Jong (1993). The SIMPLS algorithm is the leading PLSR algorithm because of its speed and efficiency. This algorithm is based on the empirical cross-variance matrix between the response variables and the regressors and on linear least squares regression. The main difference between SIMPLS algorithm and the standard PLSR algorithm is their residual matrices, where X A ( I n TT' ) X 0

PLSR :

X A X0 ( I p VV ' ) SIMPLS :

S xya S xya 1 v a ( v a ' S xya 1 )

that applies to the cross-product Sxy and not to the larger data matrices X and Y. The SIMPLS method assumes that the x and y variables are related through a bilinear model x i x Pt g i i

( 3.2 )

y i y A' t i f i

( 3.3 )

37 where, x and y denote the means of the x and y variables. The t i are called the scores which are k-dimensional, with k p , whereas P Pp,k is the matrix of x-loadings. The residuals of each equation are represented by the gi and fi respectively. The matrix A Ak ,q represent the slope matrix in the regression of yi on t i .

The bilinear structure as shown above implies a two-step algorithm. After meancentering the data, SIMPLS will first construct k latent variables, Tn ,k (t1 ,..., t n )' and secondly, the responses will be regressed onto these k variables. The columns of Tn,k are the components.

The steps will be described in the next subsections.

3.2.1

The first stage : The construction of k components

The elements of the scores t i are defined as a linear combinations of the meancentered data, t ia x'i ra or equivalently Tn,k X n, p R p,k with R p,k (r1 ,..., rk )

De Jong (1993) had stated that the weights should be determined such as to maximize the covariance score vectors t a and u a under some constraints. He also pointed out four conditions that are specified to control the solution, where ;

1.

Maximization of covariance : u' a t a q' a (Y ' a X a )ra max!

2.

Normalization of weights ra : r ' a ra 1

3.

Normalization of weights qa : q' a q a 1

4.

Orthogonality of t scores

: t 'b t a 0 for a b

38

The last constraint has to be added as the orthogonality restriction, in order to get more than one solution and to generate a set of orthogonal factors of X. ~ ~ The X n, p and Yn,q denote the mean-centered data matrices, with ~xi xi x and ~ yi yi y . The k components are obtained as a linear combination of the x-variables

which have maximum covariance with a certain linear combination of the y-variables and it depends on the normalized PLS weight vectors ra and q a (for a =1, …. ,k) that are defined as the vectors that maximize the covariance between x- and y-components

~ ~ max cov( Xra , Yq a )  max r ' a S xy q a

ra 1, qa  1

where, S ' yx S xy

ra  1, qa  1

( 3.4 )

~ ~ X ' p , n Yn , q is the empirical cross-covariance matrix between the x n 1

and y-variables. The maximization has an additional restrictions that the components ~ Ta Xra be uncorrelated (orthogonal), i.e. n ~ ~ ~~ r ' j X' Xra T ' a T j tij tia 0

, a j

( 3.5 )

i 1

This constraint is imposed to obtain more than one solution and in addition it avoids multicollinearity between the regressors in the second stage of this algorithm.

The x-loading p j that describes the linear relation between the x-variables and the ~ jth component Xr j is to satisfy the condition in Eq. (3.5). It is computed as ~ ~ ~ ~ p j (r ' j X' Xr ' j ) 1 X' Xr j

(r ' j S x r ' j ) 1 S x r j

39 with S x the empirical covariance matrix of the x-variables. This definition implies that Eq. (3.5) is fullfilled when p' j ra 0 for a j .

The SIMPLS weight vectors are pairs of (ra , q a ) . The first pair of (r1 , q1 ) is thus obtained as the first left and right singular vector of S xy . This implies that q1 is the dominant eigenvector of S yx S xy and is the dominant eigenvector of S xy S yx (with S yx S' yx ) . The following pairs of SIMPLS weight vectors (ra , q a ) where a a 2 a k are obtained as the dominant eigenvectors of S xy S ayx and S ayx S xy , respectively a with S xy as the deflated cross-covariance matrix

a a 1 a 1 a 1 S xy S xy v a ( v a ' S xy ) (I p v a v a ' )S xy

and S 1xy S xy . The  v 1 ,..., v a 1 represents an orthonormal basis of the x-loadings

p1,..., p a 1. Thus, the iterative algorithm which started with

S xy S 1xy and the process

are repeated until k components are obtained.

According to Engelen et al. (2003), the most commonly used approaches for the choices of the number of k components is Root Mean Squared Error (RMSE). The predictive ability of the methods are measured by the Root Mean Squared Error (RMSE) defined as RMSEk 

1 n

n

( y

i

2 yˆ i, k )

( 3.6 )

i 1

with, yˆ i , k the predicted y-value of observation i from the test set when the regression parameter estimates are based on the training set (X,Y) of size n and k scores are retained.

The optimal number of components is often selected as that k for which this RMSE value is minimal.

40 3.2.2

The second stage : Regress the response into k components

For the second stage of the algorithm, the response are regressed onto these k components and the formal regression model that is considered is y i 0 A' q,k t i f i

where E ( f i ) 0 and Cov( f i ) f which is multiple linear regression (MLR) performed using the original y-variables on the extracted components T1 , T2 ,.., Tk .

Multiple linear regression provides estimates 1 1 Aˆ k,q ( S t ) S ty ( R' k, p S x R' p,k ) R' k, p S xy

ˆ ˆ  0 y A' q, k t S f S y Aˆ ' q ,k S t Aˆ k,q where S y and S t stand for the empirical covariance matrix of the y- and t-variables. Because t 0 , the intercept 0 is thus estimated by y . By plugging in t i R' k , p ( xi x ) into the equation y i y A' q ,k t i f , the estimates for the parameters in the original model yi 0 B' q, p xi ei are obtained. These are

Bˆp,q R p,k Aˆ k,q ˆ y Bˆ β ' q, p x 0

( 3.7 ) ( 3.8 )

Note that for a univariate response variable (q = 1), the parameter estimate Bˆp ,1 can be ˆ( Hubert and Vanden (2003), Barker (1997), Garthwaite (1994) rewritten as the vector β

and De Jong (1993)).

Partial least squares regression (PLSR) has become an established tool for modelling linear relations between multivariate measurements in chemometrics (De Jong, 1993). One of the well known feature of PLS is that the sample correlations between any

41 pair of components is 0 (Garthwaite, 1994). This follows because the residuals from a regression are uncorrelated with a regressor, for example, ra is uncorrelated with Ti , and each of the components Ti 1 ,..., Tk is a linear combination of the ra’ s ,s ot h e ya r e uncorrelated with Ti . This method is used to compress the predictor data matrix X [ x1 , x 2 ,..., x p ] , that contains the values of p predictors for n samples, into a set of k latent variable or factor scores T [t1 , t 2 ,..., t k ] , where k p (De Jong, 1993). The factors tk, a =1, 2, ..., k, are determined sequentially using the nonlinear iterative partial least squares (NIPALS) algorithm (Wold, 1966).

Figure 3.1 shows the steps in SIMPLS algorithm. This algorithm is used in this study.

42

~ ~ STEP 1 : Compute the mean-centered data matrices, X n, p and Yn,q ~ x xi x i ~ y yi y i

STEP 2 : Compute the first pair of the normalized SIMPLS weight vectors, r1 and q1 q1 is the dominant eigenvector of SyxSxy r1 S xy q 1 where, ~ ~ X' p,n Yn,q S' yx S xy  n 1 is the empirival cross-covariance matrix between the x- and y-variables STEP 3 : For each a 1,2,..., k the normalized SIMPLS weight vectors ra and q a , r1  q1 1are defined as the vectors that maximize ~ ~ X' p,n Yn,q ~ ~ cov(Yn,q q a , X n, p ra ) q'a ra q'a S yx ra n 1

STEP 4 : Compute the SIMPLS scores where,

~ ~ Tn,k X n, p R p,k

with R p,k  r1 , r2 ,.., rk  The first score, t 1 :

t 1 ~ x 'i r1

STEP 5 : Check the restriction : n ~ ~ ~~ r ' j X' Xra tij tia 0 i 1

T ' a T j 0

, a j

~ where the components Xrj are required to be orthogonal to obtain more

than one solution

43

STEP 6 : Compute the x-loading p j that describes the linear relation between ~ x-variables and the jth component Xrj ~ p j X' T j / (T' j T j ) ~ ~ ~ ~ (r ' j X' Xrj ) 1 X' Xrj

(r ' j S x rj ) 1 S x rj where S x is the empirical variance-covariance matrix of the regressors STEP 7 : Step 5 is fulfilled when p' j ra 0 for a j STEP 8 : Compute an orthonormal basis  v1 , v 2 ,..., v a 1 of the x-loadings

p1, p 2 ,..., p a 1 for  2 a k  Basis,

v 1 p 1 v' p v' p v i p i  1 i v1 ...  i 1 i v i 1 v '1 v1 v 'i 1 v i 1 Orthonormal basis, 0 , i j  v ' i v j  1 , i j 

(Giving orthogonality) (Giving normalization)

STEP 9 : Compute the deflated cross-covariance matrix, S xya



S xya S xya 1 v a1 v 'a 1 S xya



STEP 10 : Compute the following SIMPLS weight vectors ra and q a  2 a k  as the first left and right singular vectors of S xya STEP 11 : Compute the next score for 2 a k ~ Ta X n , p ra STEP 12 : Repeat step 4 for 2 a k

44

STEP 13 : Compute the estimates of SIMPLS algorithm B pls  S t  Sty 1

 R'k , p S x R p ,k R'k , p S xy 1

ˆ  0 y where S y and S t are the empirical covariance matrix of the y- and t-variables STEP 14 : Select the number of k components RMSEk 

1 n yi yˆi, k 2  n i 1

where yˆ i , k is based on the parameter estimates with k components. STEP 15 : Choose kopt as the k-value which gives the smallest value of RMSEk STEP 16 : Compute the coefficients of SIMPLS Regression for the original variables b pls R p,k B k,q ˆ y b' x β 0 pls

Figure 3.1

: Steps in SIMPLS algorithm

45 The following is an example of SIMPLS algorithm using classical data set. This tobacco data set are being considered as an example in which 30 tobacco blends were made and the goal was to develop a linear equation that relates the percentage concentration of four important components to a response that measures the amount of heat given off by the tobacco during the smoking process. Table 3.1 supplies the data and Table 3.2 reveals the collinearity diagnostics using variance inflation factors (VIF). All code developments were done using S-PLUS 2000 and are shown in Appendix E.

Table 3.1 Sample 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

y 527.91 518.29 549.56 738.06 5704.82 697.94 826.86 998.18 1040.22 1040.46 803.26 1009.51 916.44 394.23 583.20 744.81 825.93 1070.88 840.91 991.58 767.40 807.18 857.15 526.05 495.89 476.38 520.82 1066.99 1020.25 494.59

:

Tobacco Data

x1(%)

x2(%)

x3(%)

x4(%)

5.50 6.20 7.70 8.50 11.00 11.50 13.00 15.00 16.20 16.90 14.10 17.50 15.00 6.30 9.20 11.50 12.00 16.80 14.20 17.10 11.90 13.10 14.30 7.20 6.70 5.60 7.10 17.00 16.10 6.30

4.00 4.30 5.20 5.30 6.30 6.50 7.20 7.60 7.80 8.70 7.20 8.80 7.50 4.80 5.40 6.50 7.00 8.40 7.40 8.30 6.90 7.10 7.50 5.00 4.90 4.10 5.00 8.00 7.60 4.60

9.55 11.10 12.84 13.32 17.84 18.57 21.96 25.87 26.82 27.89 23.99 29.61 25.80 11.49 14.68 19.10 19.60 27.16 23.95 27.94 19.56 22.05 23.91 12.14 11.91 9.56 11.98 27.77 26.99 11.26

13.25 15.32 17.41 18.08 24.16 24.29 27.29 31.32 34.62 36.03 28.48 36.88 31.41 15.59 19.69 25.40 26.39 36.49 28.84 35.40 26.10 27.40 29.03 16.13 15.98 13.34 16.09 35.45 34.16 15.52

46 Table 3.2

:

Variance Inflation Factors for Tobacco Data

Variables

VIF

x1

324.14119

x2

45.17285

x3

173.25772

x4

138.17526

Fr omt heVI F’ sva l u e ss howni nTa bl e3. 2,a l lf ourva r i a b l e si nc l u de da r epo s s i b l y multicollinear because they exceed the cutoff values of VIF in determining whether collinearity is a problem. x1 appears to have a strong multicollinearity problems followed by x3 , x4 and x2.

Table 3.3

x1 x2 x3 x4

:

PLS weights vectors, rk for the PLS Components

PLS1

PLS2

PLS3

PLS4

0.3647893 0.1274698 0.5912632 0.7078757

0.1895016 -0.3161498 -0.7330871 0.5715958

-0.6454277 0.6663337 -0.1434161 0.3447524

-0.6370039 0.6403092 -0.1888636 0.3854358

Table 3.3 displays the values of PLS weight vectors that are calculated as the first step in constructing PLS components. Eventhough the sign may be useful, it is difficult to interpret the magnitude of the weights.

47 Table 3.4

x1 x2 x3 x4

:

Loadings for the PLS Components

PLS1

PLS2

PLS3

PLS4

0.3640556 0.1286938 0.5941013 0.7056629

-0.7279066 -0.2494037 -1.8368707 -0.5029701

3.584746 1.674861 5.483791 8.655897

3.521529 1.585775 5.346201 8.399696

Using PLS weights and the original standardized variables, the PLS scores or the loadings for the PLS components are calculated and shown in Table 3.4. From the loadings, it is clear that all the explanatory variables load are highest on PLS3.

Table 3.5

:

PLS Components

k =1

k =2

k =3

k =4

-16.9290408 -14.2536866 -11.0835216 -10.0208601 -2.0050229 -1.2734884 3.4909339 9.4360787 12.7970098 14.7978415 5.9348385 16.6481291 9.4456521 -13.7317536 -7.8089627 -0.1743768 1.0681814 14.6171224 6.2279961 14.4034129 0.7900210 3.6457459 6.3880679 -12.6113754 -13.0486889 -16.8101934 -12.7707714 14.2635720 12.5099288 -13.9427894

0.11954775 0.20427231 0.12305366 0.27412737 0.59348057 0.16415531 -0.54327487 -0.85357090 0.50043454 0.37009778 -1.14279084 -0.32286956 -0.71919620 -0.06642552 0.29843436 0.41009051 0.54610274 1.24408020 -0.95197266 0.13769831 0.42232826 -0.49581203 -0.82671079 -0.12694888 -0.10721406 0.15099569 -0.05146894 0.36679768 0.15715546 0.12540276

-0.354449212 -0.115005911 -0.012458575 -0.300022990 0.200595461 -0.048727577 -0.002358789 -0.198085355 0.162105809 0.642652460 -0.593208539 0.368393113 -0.223651885 0.190769065 -0.325183493 0.257937092 0.537986942 0.770574953 -0.394637056 0.022668596 0.441654786 -0.108519616 -0.321306857 -0.163903296 0.073450035 -0.322765057 -0.090204049 -0.071070395 -0.089585002 0.066355345

-0.37306903 -0.12176542 -0.02405477 -0.30203948 0.23554633 -0.04265753 -0.03388704 -0.23692143 0.21925352 0.69100953 -0.65931577 0.37561319 -0.25304267 0.16509965 -0.32021409 0.28507852 0.57388083 0.87778805 -0.44864289 0.05521734 0.46932846 -0.13621813 -0.36752501 -0.19476799 0.04532628 -0.33993791 -0.11626686 -0.02179642 -0.05451521 0.05349594

48 Table 3.6

:

RMSE values for k components

k

RMSEk

1 2 3 4

49.72228 46.84524 43.12863 43.12410

Table 3.5 displays the PLS components where k = 4, that have been constructed as the first major step in PLS algorithm. These k components are obtained as a linear combination of the x-variables which have maximum covariance with a certain linear combination of the y-variables. The choice of k components can be done using Root Mean Squared Error, RMSEk as shown in Table 3.6 where kopt is chosen as the k-value which gives the smallest or sufficiently small value for RMSEk. Therefore, for this classical data set, kopt was chosen at k = 4 as an optimal number of components.

Equation ( 3.7 ) can then be used as a transformation to the coefficients of the natural variables and Equation ( 3.8 ) can be used to obtain the constant term. These values were calculated after the response variable were regressed onto the k = 4 components. The results are given by

b0,pls = 311.78184 b1,pls =

77.62480

b2,pls = -75.13769 b3,pls = -26.53706 b4,pls =

21.83263

The final PLS regression model for Tobacco data is given as follows y = 311.78184 + 77.62480x1 –75.13769x2 –26.53706x3 + 21.83263x4

49

Table 3.7

:

MSE values for k components

k

MSEk

1 2 3 4

1.477400e+003 1.504314e+003 3.738533e-001 2.677920e-017

The MSEk values shown in Table 3.7 also display that the minimal value of MSEk is at k = 4.

3.3

Principal Component Regression

The development of Principal Component Regression was done by Massy in 1965 to handle the problem of multicollinearity by eliminating model instability and reducing the variances of the regression coefficients. This method is a biased estimation technique in handling multicollinearity. It performs least squares estimation on a set of new variables called the Principal Components of the correlation matrix. The results in estimation and prediction are superior to ordinary least squares.

According to Barker (1997), the Principal Component Regression performs the Principal Component Analysis on the explanatory variables and then runs a regression using the principal component scores as the explanatory variables with the response variable of interest. It is a two-step procedure (Filzmoser and Croux, 2003). In the first step, one computes principal components which are linear combinations of the explanatory variables while in the second step, the response variable is regressed on to the selected principal components. Combining both steps in a single method will maxime the relation to the response variable.

50 Principal components are orthogonal to each other, so that it becomes quite easy to attribute a specific amount of variance to each. Assuming the predictors are in standard form, V denotes the orthogonal matrix of eigenvectors of the correlation matrix, and Z XV .

The matrix of normalized eigenvectors considered are associated with the eigenvalues  1,  2 ,...,  k of X' X (correlation form). VV' I since V is an orthogonal matrix. Hence, the original regression model is written in the form y β0 1 XVV ' βε

( 3.9 )

y 0 1 Zαε

( 3.10 )

with Z XV where Z is an n x k matrix, and αV ' βwhere αis a k x 1 vector of new coefficients 1 ,2 ,..., k .

3.3.1

The first stage : Construct k components

The columns of Z (typical element zij) represent k new variables, called principal components. The components are orthogonal with each other where

Z ' Z ( XV )' ( XV ) V' X' XV

diag ( 1,  2 ,...,  k)

By performing the regression on the Z’ sv i at hemo de li nEq. ( 3. 10) ,t hev a r i a bl e s of the coefficients, ehich are the diagonal elements of ( Z' Z ) 1 apart from 2 are the reciprocals of eigenvalues. Thus,

51 ˆj ) Var (



2

1  j

j = 1, 2, ..., k

( 3.11)

ˆare least squares estimator. An elimination of at least one principal Note that, the 

component, that is associated with the small eigenvalue, may substantially reduce the total variance in the model and thus produce an appreciably improved prediction equation (Myers, 1986).

According to Hwang and Nettleton (2003), there are a few methods for using the principal components. One popular method is to use the principal components corresponding to the k largest eigenvalues. The problem with this approach is that the magnitude of the eigenvalue depends on X only and has nothing to do with the response variable. Hence it is possible that principal components that are important in relating X to the response are excluded because they may have small eigenvalues. An alternative approach is to use the principal components that have the highest correlations with the response, which makes intuitive sense.

The Principal Component Regression method depends on the other maximization criteria for construction of regressors from X. The first regressor is the first principal component, formed by letting c1 be the direction of the highest variability in x, that maximizes the variance Var (t1 ) for t1 Xc1 , c1 1 . Thus, the successive regressors

t 2 Xc 2 and so on are obtained by repeating the procedure on the residuals from the preceding step, and the new principal components will be uncorrelated with the preceding ones. The number of principal components to be used jointly as regressors will typically be optimized by cross-validation, or by using a separate test set (Sundberg, 2002).

Myers (1986) showed that the principal components, z was produced by the rotation of the regressor variables where the principal components allows for a new set of ˆj to a coefficients, the ’ s ,a n dc a ndi r e c t l ya t t r i bu t et hev a r i a nc eofa ne s t i ma t or

specific linear dependency. This is apparent form Eq. (3.11).

52 The column of Z are considered as Z XV

( 3.12 )

with a specific column of Z being given by z j Xv j

The elements in zj are the data measured on the zj axis. The variation in the resulting zj is given by z j ' z j v j ' ( X' X ) v j j

j =1,2 ,…,k

Thus the regression on the principal components would involve

1 0     2   Z' Z       k  0 where the small eigenvalues characterized the transformation to the z’ st ha ta l l owst h e linear dependencies and the decision be made regarding which z’ sa r et obee l i mi n a t e d.

Filzmoser and Croux (2002) gave an idea in the algorithm to find k components z1 , z2 ,..., zk having property that the squared multiple correlation between y and the components is as high as possible, under the constraints that these components are mutually uncorrelated and have unit variance, as shown as follows : SMC Corr 2 ( y , z1 ) Corr 2 ( y, z2 ) ..... Corr 2 ( y , zk )

Optimizing the SMC is done in a sequential manner. First by selecting a z1 having maximal squared correlation with the dependent variable, and then by sequentially finding the other components having maximal correlation with y while still verifying the imposed side restrictions.

53 3.3.2

The second stage : Regress the response into k components

After the r components have been eliminated, the k components will be regressed by the response variable. From Eq. (3.7), with the selected components, we can write αV ' β, and hence

βVα

( 3.13 )

Clearly then, if the last r components are eliminated, the least squares estimators of the regression coefficients for all k parameters where the deleting of principal component does not imply deleting of any of the original regressors are given by

b pc

b1, pc  ˆ    1     b2, pc ˆ  2    v 1 v 2 ..........v k r        ˆ bk , pc    k   

( 3.14 )

Thus elimination of r principal components also caused an elimination of r eigenvectors and, of course, r of the ' s. Here, the x ’ s are still assumed centered and scaled so the constant term in the transformed model which is y .

The coefficients of the natural variables of Eq. (3.9) are obtained by dividing the ith estimated coefficient in the centered and scaled model by Si as shown as follows : b i , pc ' b i , pc  Si where, Si 

 n

j 1

( x ij xi ) 2

Thus the constant term 0 is estimated from the estimate b0 ' by computing

( 3.15 )

54

b' x b' x b' x b0, pc b'0, pc  1, pc  2, pc .....  k , pc S1 S2 Sk

( 3.16 )

where the b i ' s are estimates from the scaled and centered model as follows :

x x  x x2  x xk  y 0 '1 ' 1 1 2 ' 2 ..... k ' k    S1   S2   Sk  Furthermore, b0 ' y , which is the mean of the y ’ s.

3.3.3

Bias in Principal Components Coefficients

This method is also biased estimation technique where the biased are in the principal components coefficients. The principal components are considered with r components eliminated and s components retained, where s + r = k. The matrix V  V1V2 ...Vk of normalized eigenvectors of X' X is partitioned into V  Vr  Vs  and the matrix Λis considered to be a diagonal matrix of eigenvalues of X' X where the partitioned of Λ is given as follows

Λ  Λ r 0

0 Λs  

where Λr and Λs are diagonal matrices with Λr containing the eigenvalues associated with the eliminated components. Since V ' ( X' X )V Z' Z Λ, the least squares estimates of the ' s are written as ˆ( Z' Z ) 1 Z ' y Λ1V' X ' y α which implies that the estimator for the ' s that are retained is given by ˆs Λs1 Λs1 ' X ' y α ˆ Then, the  s are the unbiased estimator for s and shown as follows

55 ˆs b pc V s α

Thus, E (b pc ) V s αs V sV s ' β where the r estimator are shown to be biased since V 'V I VrV r ' + V sV s ' , E (b pc )  I VrV r ' β

βV rVr ' β βVr αr Thus the estimators of the p regression coefficients are biased by the quantity V r αr with

αr being the vector of principal components that have been eliminated (Myers, 1986).

Xie and Kalivas (1997b) agreed that principal component regression is widely used for analytical calibration and in most application of PCR, principal components are included in regression models in sequence according to respective variances. A suitable number of principal components is determined by the predictive ability obtained through cross-validation within the calibration samples or by relying on external validation sample set.

There are a few approaches for principal components selection in PCR. Top down selection is a strategy of selecting principal components in sequence according to variances. Another approach is based on correlations with the calibration property where only few principal components commonly have high correlations with the calibration property.

Figure 3.2 shows the steps in PCR algorithm. This algorithm is used in this study.

56

STEP 1 : Center and scale the data X* 

xi xi



 xi xi

2

STEP 2 : Compute the correlation matrix for centered and scaled data X*' X *

STEP 3 : Compute eigenvalues, i and eigenvectors, V of the correlation matrix

STEP 4 : Compute the components, Z Z X * V

STEP 5 : Compute eigenvalues of the components. The component associated with the smallest eigenvalue will be deleted.

STEP 6 : Compute the coefficient estimate for the component after deletion ˆ( Z' Z ) 1 Z ' y α

STEP 7 : Transform back the coefficient estimate to the original standardized variables βVα The least squares estimators of the regression coefficients are :

b pc

b1, pc  ˆ    1     b2, pc  ˆ  2     v v ...v     1 2 k r      ˆ bk , pc     k  

where ; r : component eliminated The estimate of '0 : b'0, pc y

57

STEP 8 : Compute the coefficient of PCR for the original variables B ( PC' PC ) 1 PC ' y

STEP 9 : Compute the coefficient of PCR for the original variables b'i , pc b i , pc  Si The constant term is estimated by b'1, pc x1 b'2, pc x2 b'k , pc xk b0, pc b'0, pc   ...  S1 S2 Sk Figure 3.2

:

Steps in Principal Components Regression algorithm

58

The following is an example of PCR algorithm using tobacco data as shown in Table 3.1. The following are the correlation matrix for each independent variable, the eigenvalues of the correlation matrix and the resulting matrix of eigenvectors.

Table 3.8

:

The correlation matrix

0.9885153 0.9971034 0.9962613  1  0.9885153 1 0.9852665 0.9870504  X*' X*   0.9971034 0.9852665 1 0.9935571   0.9962613 0.9870504 0.9935571 1  

Table 3.9

:

The eigenvalues of the correlation matrix

Variables

eigenvalues

λ 1

3.973892616

λ2

0.017583962

λ3

0.006381769

λ4

0.002141653

Table 3.10

:

Matrix of eigenvectors

0.5010082 0.2389403 0.08490097 0.82746005    0.4983506 0.8602570 0.08785135 0.06234301  V   0.5002597 0.3867114 0.61031845 0.47718535   0.5003775 0.2309105 0.78267832 0.28933953 

All four regression coefficients, b1, b2, b3 and b4, are affected by collinearity. The simple correlation shown in Table 3.8 also indicates that all four independent variables have very high correlations with each other.

59 The Z matrix of principal components is found by Z = X*V, where X* is centered and scaled and V is the matrix of eigenvectors as shown is Table 3.10. X* is a 12 x 4 matrix without the column of ones. Therefore, the principal components are columns of Z shown in Table 3.11

Table 3.11

:

The principal components

z1

z2

z3

z4

-0.573869219 -0.493649883 -0.354729937 -0.315924516 -0.063243308 -0.027757008 0.131967351 0.303286028 0.394543126 0.498210340 0.198727553 0.551723996 0.297072778 -0.451387196 -0.256274985 -0.007509151 0.053696262 0.472339081 0.217214981 0.470907923 0.041212832 0.130457256 0.227428687 -0.403192994 -0.425710106 -0.564129175 -0.408129040 0.447999280 0.376687305 -0.467968261

0.033796058 0.036422331 -0.015383159 -0.008899804 -0.009662142 -0.017508696 -0.025072221 0.016436250 0.035517673 -0.035719113 0.014857489 -0.017180771 0.027037294 -0.011208962 0.010955428 -0.005883578 -0.044160229 -0.009526273 -0.004287518 0.007044954 -0.036358476 -0.011577206 -0.013481031 -0.013395894 -0.011181849 0.024599179 -0.016397339 0.037015376 0.055513709 0.007688519

-0.0023915703 -0.0079597324 -0.0009688274 -0.0009086722 -0.0147952294 -0.0007248410 0.0152834929 0.0198820756 -0.0173975853 -0.0123776599 0.0320374063 0.0045387738 0.0159533677 -0.0003732127 -0.0035199805 -0.0120091596 -0.0141509807 -0.0367722054 0.0274081580 -0.0037323614 -0.0110355935 0.0140702994 0.0247657516 0.0064126883 0.0022438385 -0.0023697947 0.0040614565 -0.0112200957 -0.0087557235 -0.0051940829

0.0009070398 -0.0096950687 0.0021970085 0.0204296537 0.0053530594 0.0119432625 -0.0024619529 -0.0096325457 -0.0011614462 -0.0057181349 0.0038263093 -0.0125032515 -0.0085286572 -0.0168575443 0.0170036121 -0.0025154446 -0.0010470329 -0.0005650755 0.0040868002 0.0084788013 -0.0015119612 0.0001308320 0.0062793545 0.0029548275 -0.0108774014 0.0031140012 0.0016005987 0.0090125241 -0.0024618187 -0.0117803490

The eigenvalues in Table 3.9 show that the fourth column of Z is the principal components associated with the smallest eigenvalue, which means a small variation in the z4 .

60 Therefore, the application of principal components regression involves the removal, initially, of z4 with the response y being regressed against the remaining components. This regression gives the following coefficients of the z ’ s: ˆ  1 554.5231 ˆ  2 768.0541 ˆ  1226.8759 3  Equation (3.12) can be used to determine the estimates of the coefficients in terms of the centered and scaled regressors. The constant term for the regression is y and the results are displayed as follows : b'0, pc 761.8583 b'1, pc 357.1768

b'2, pc -492.1596 b'3, pc -174.3641 b'4, pc 1415.0718 This is followed by a transformation to the coefficients of the natural variables using Equation (3.13) and using Equation (3.14) for estimating the constant term. The results are given by b0, pc 242.21304 b1, pc  16.10995 b2, pc  -62.15957 b3, pc  -4.81548

b4, pc  32.93089 The final PC regression model for tobacco data are given as follows y = 242.21304 + 16.10995x1 –62.15957x2 –4.81548x3 + 32.93089x4

61 3.4

Ridge Regression

The objective of regression is to explain the variation in one or more response variables, by associating this variation with proportional variation in one or more explanatory variables. The phenomenon called multicollinearity that will occur will effect ˆ, where 2 ( X' X ) 1 will be almost singular and β ˆhighly the covariance matrix for β sensitive to random variations in Y, that is, the estimate will depend very much on the particular way the errors, i that happen to come out (Bjorkstrom, 2001). In other words, if X' X is not nearly a unit matrix, the least squares estimates are sensitive to a number of errors. Therefore, ridge regression is developed by Hoerl and Kennard (1970), in order to reduce this sensitivity.

Ridge regression is the modifications of the least squares method that allow biased estimators of the regression coefficients. Although it has biased estimators, it only has a small biased substantially more precise than an unbiased estimator. Therefore the estimator will be prefered since it will have a larger probability of being close to the true parameter value. The sampling distribution are illustrated as follows (Neter, et. al., 1985).

Sampling distribution of biased estimator bR

Sampling distribution of unbiased estimator b

E(b) E(bR)

Statistic



Parameter Bias of bR

Figure 3.3

:

The sampling distribution of biased and unbiased estimator

62 Figure 3.3 illustrates the estimator b as being biased but imprecise, while estimator b R is much more precise but has a small bias. Thus, the probability that bR falls near the true value βis much greater than that for the unbiased estimator b.

The ridge regression estimator of the coefficient βis found by solving for b R in the equation

( X' X  I )bR X ' y where 0 is often referred to as a shrinkage parameter. Thus, the solution for ridge estimator is given by bR ( X' X  I ) 1 X ' y

According to Bjorkstrom (2001), ridge regressors are known to have favourable ˆ has smaller mean square error properties as shown by Hoerl and Kennard (1970). β R than the ordinary least squares estimator which provided  small enough and the standard regression model holds. Marquardt (1970) also pointed out that the ridge regressors are known as shrinkage estimator.

The matrix ( X*' X *) are considered to be replaced by ( X*' X *  I ) , where  is a small positive quantity. Thus, the eigenvalues and the inverse elements will also be replaced. Since the V matrix diagonalizes ( X*' X *) , it also diagonalizes ( X*' X *  I). Thus

  0  1  0 2  V ' ( X*' X *  I ) V      0 0

  0        k  

0

The eigenvalues of the new matrix ( X*' X *  I ) are i  for i 1,2,....., k where adding  to the main diagonal effectively replaces i by i .

63 From the properties of the ridge estimator, the role of  is to moderate the variance of the estimators. The impact of eigenvalues on the variances of the ridge regression coefficients can be illustrated as Var bi , R

 i

2

i

()



i

2

i

Therefore, the  in ridge regression moderates the damaging impact of the small eigenvalues that result from the collinearity (Myers, 1986). The procedure of ridge regression would be successful if a  is choosen so that the variance reduction is greater than the bias term that the expression is given by Hoerl and Kennard (1970) as

E (b k

i 1

Bias b E (b k

where

i 1

2

i, R

k

i 1

) βi 2β'  X' X  I β 2

i, R

2

) βi  is the sum of squared biases of the ridge 2

i, R

regression coefficients. The bias that results for a selection of 0 is best quantified by observing these expression (Myers, 1986 and Hocking, 1996). There are various procedures for choosing the shrinkage parameter . According to Myers (1986), the ridge trace is a very pragmatic procedure for choosing the shrinkage parameter where it allows  to increase until stability is indicated in all coefficients. A plot of the coefficients against  that pictorially displays the trace often helps the analyst to make a decision regarding the appropriate value of . However, the stability does not imply that the regression coefficients have converged. As  grows, variances reduce and coefficients become more stable. Therefore, the value of  is chosen at the point for which the coefficients no longer change rapidly.

64 The use of prediction criteria is an alternative procedure for choice of . This prediction criteria is more directly descriptive of prediction performance (Myers, 1986). The Cp-like statistic that is based on the same variance-bias type trade-off is one of the proposed procedures. The use of C statistic is by a simple plotting of C against , with the use of -value for which C is minimized. The statistic is given as SS C  Re2s , n 2 2tr  H  ˆ 





where H   X * ( X*' X *  I ) 1 X *' , SS Re s ,is the residual sum of squares using ridge H is the trace of H . Notice that H  plays the same role as the HAT regression and tr  2 ˆ matrix in ordinary least squares. The statistic  comes from the residual mean square

from ordinary least squares estimation.

Cross validation is considered in some sense since prediction is an important criterion for a choice of . A PRESS-like statistic is considered as 2

   ei ,  PR( Ridge)    1 i 1  1  hii ,  n    n



where ei , is only the ith residual for a specific value of  and hii , is the ith diagonal element of H . The use of PR(Ridge) is by a simple plotting of PR(Ridge) against , for which PR(Ridge) is minimized. However, for pragmatic reasons, this expression cannot always be used. It is because, when the data are centered and scaled, the setting aside of a data point changes the centering and scaling constants and thus changes all the regressor observations. Therefore, with ridge regression, this expression can only be used when the observations are deleted one at a time and recomputed from original data each time. The use of PR(Ridge) is also quite good when the sample size is small and sample data contain no high leverage observations; i.e. no large HAT diagonals (Myers, 1986).

65 In other case where the data set is moderate or small in size or if high leverage observations appear, PRESS criterion can still be used but, without exploiting the shortcut deletion formula, that is n

e

PRESS , 

2 i , 1,

i 1

where ei ,1, is the ith PRESS residual for ridge regression which involves yˆ i , i ,.

The other criterion that represents a prediction approach is the generalized cross validation (GCV) given by n

e  2 i,

i 1 GCV  {n  1 tr ( H ) }2 SS Re s , k  {n  1 tr ( H ) }2

where the value of 1 in 1 tr ( H ) accounts for the fact that the role of the constant is not involved in H . The use of this procedure is to choose so as to minimize GCV by simple plotting of GCV against . The use of C, PR(Ridge) and GCV in the ridge trace approach for choosing  are grouped as stochastics methods, where the choice of being a random variable (Myers, 1986). Choosing the  can also be done as a function of the regressor data only, where the choice of  is determined by the nature of the collinearity itself and in this case,  is not a random variable. One simple nonstochastic choice of  is to let  grows until the variance inflation factors of all coefficients are reduced sufficiently. This decision procedure is nonstochastic because the variance inflation factors do not depend on the yobservations. The second method is called the df-trace criterion that centers around the matrix H  and based on df tr ( H )

66 where the df implies degree of freedom, that is the effective regression degrees of freedom. The procedure involves plotting df against  with a view towards choosing  where df stabilizes (Myers, 1986).

Figure 3.4 shows the steps in ridge regression algorithm. This algorithm is used in this study.

67

STEP 1 : Center and scale the data X* 

xi xi



 xi xi

2

STEP 2 : Compute the correlation matrix for centered and scaled data rxx X*' X *

STEP 3 : Compute the ridge regression estimators for a set of values

b'i , R ( X*' X *  I ) 1 X *' y

STEP 4 : Compute the estimate of β0 ' b'0, R y

STEP 5 : Compute the coefficients of the natural variables

b' bi , R  i , R Si where ; Si 

x x  2

i

i

STEP 6 : The constant term is estimated by

b ' x b' x b' x b0, R b'0, R  1, R 1  2, R 2 ...  k , R k S1 S2 Sk

68

STEP 7 : Compute the CStatistics SS C  Re2s , n 2 2tr[ H ] ˆ  where ; SS Re s ,is The residual sum of squares using Ridge Regression SS Re s ,  y Xb R  ' y Xb R 

2 ˆ is The residual mean square of OLS 

y Xb  SSE ' y Xb  2 ˆ    n (k 1) n (k 1)  i  tr  H    i 

STEP 8 : Plot Cagainst to choose the value of 

STEP 9 : Select the coefficient of Ridge Regression with chosen value FIGURE 3.4 :

Steps in Ridge Regression algorithm

The following is an example of RR algorithm using the tobacco data in Table 3.1. The VIFs values showed that the OLS coefficients are obviously unstable for all coefficients of regressors x1, x2, x3 and x4. Table 3.12 presents ridge coefficient values in the natural regressors for  from 0 to 0.049. The Cstatistic was computed for each value of and the results are given in this table.

69 Table 3.12 : Values of Cand the regression coefficients for various values of 

 0.000 0.001 0.002* 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.010 0.011 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019 0.020 0.021 0.022 0.023 0.024 0.025 0.026 0.027 0.028 0.029 0.030 0.031 0.032 0.033 0.034 0.035 0.036 0.037 0.038 0.039 0.040 0.041 0.042 0.043 0.044 0.045 0.046 0.047 0.048 0.049

C 5.000000 4.340973 4.253391 4.327295 4.459283 4.613206 4.773926 4.934437 5.091344 5.243008 5.388700 5.528178 5.661466 5.788734 5.910225 6.026223 6.137023 6.242918 6.344196 6.441132 6.533986 6.623002 6.708409 6.790421 6.869239 6.945048 7.018020 7.088319 7.156092 7.221482 7.284617 7.345620 7.404604 7.461675 7.516932 7.570468 7.622370 7.672720 7.721594 7.769063 6.762663 6.853500 6.939761 7.021843 7.100094 7.174822 7.246300 7.314772 7.380457 8.179556

b0 311.7818 282.1426 263.6954 250.2147 239.5044 230.5734 222.8969 216.1636 210.1722 204.7841 199.8986 195.4398 191.3487 187.5782 184.0897 180.8515 177.8369 175.0231 172.3906 169.9225 167.6042 165.4228 163.3668 161.4261 159.5918 157.8557 156.2108 154.6503 153.1685 151.7600 150.4198 149.1437 147.9276 146.7676 145.6605 144.6031 143.5925 142.6260 141.7012 140.8159 139.9678 139.1550 138.3757 137.6282 136.9109 136.2222 135.5609 134.9257 134.3152 133.7284

b1

b2

b3

b4

77.62480 58.23229 48.18875 42.01922 37.82433 34.77270 32.44326 30.59996 29.10019 27.85265 26.79614 25.88804 25.09774 24.40264 23.78570 23.23381 22.73668 22.28615 21.87564 21.49978 21.15414 20.83504 20.53939 20.26457 20.00835 19.76881 19.54430 19.33338 19.13479 18.94743 18.77033 18.60264 18.44358 18.29248 18.14872 18.01176 17.88111 17.75631 17.63697 17.52271 17.41321 17.30815 17.20727 17.11029 17.01700 16.92717 16.84060 16.75711 16.67653 16.59870

-75.137689 -64.680974 -57.117712 -51.076426 -46.010298 -41.639878 -37.799159 -34.379279 -31.303804 -28.516354 -25.973795 -23.642184 -21.494215 -19.507528 -17.663543 -15.946632 -14.343516 -12.842808 -11.434668 -10.110537 -8.8629218 -7.6852278 -6.5716237 -5.5169287 -4.5165216 -3.5662645 -2.6624399 -1.8016968 -0.9810054 -0.1976192 0.5509588 1.2670039 1.9525942 2.6096322 3.2398631 3.8448908 4.4261925 4.9851306 5.5229644 6.0408594 6.5398960 7.0210772 7.4853357 7.9335398 8.3664989 8.7849685 9.1896548 9.5812185 9.9602785 10.327415

-26.537060 -17.264891 -11.954895 -8.4518030 -5.9502566 -4.0706073 -2.6068076 -1.4359748 -0.4797187 0.3145583 0.9835631 1.5537338 2.0446123 2.4709603 2.8441369 3.1730236 3.4646627 3.7247069 3.9577440 4.1675341 4.3571866 4.5292937 4.6860320 4.8292420 4.9604896 5.0811146 5.1922695 5.2949509 5.3900243 5.4782447 5.5602733 5.6366918 5.7080131 5.7746916 5.8371309 5.8956906 5.9506919 6.0024229 6.0511421 6.0970825 6.1404545 6.1814483 6.2202364 6.2569756 6.2918088 6.3248661 6.3562665 6.3861189 6.4145234 6.4415716

21.832633 22.114287 21.435068 20.556026 19.679566 18.864069 18.123027 17.454881 16.853237 16.310603 15.819762 15.374228 14.968342 14.597235 14.256733 13.943264 13.653767 13.385610 13.136522 12.904539 12.687951 12.485267 12.295181 12.116545 11.948343 11.789679 11.639754 11.497856 11.363349 11.235661 11.114280 10.998741 10.888627 10.783556 10.683185 10.587199 10.495310 10.407258 10.322802 10.241721 10.163812 10.088888 10.016778 9.947321 9.880370 9.815789 9.753451 9.693236 9.635037 9.578750

5

6

Cks

7

8

70

0.0

0.01

0.02

0.03

0.04

0.05

Delta

Figure 3.5

: Plot of Cagainst 

From Figure 3.5, it shows that large changes occur for coefficients between  = 0 and

= 0.001. The plot in this figure illustrates how predictive capability of the model improves with increasing ,wi t ha“ l e ve l i ngof f ”at of approximately 0.002. The ridge coefficients are given by b0, R  263.69540 b1, R  48.18875 b2, R  -57.11771 b3, R  -11.95490 b4, R  21.43507 The final Ridge regression model for tobacco data are given as follows y = 263.69540 + 48.18875x1 –57.11771x2 –11.95490x3 + 21.43507x4

CHAPTER 4

SIMULATION AND ANALYSIS

4.1

Introduction

The early stages of the research involved reviewing literature of the study and discussed how the chosen methods perform in regression analysis to handle multicollinearity problems. This chapter describes the details of the simulation study.

Discussion on the generating simulated data sets will be done in Section 4.2. The performance measures that will be used for comparing all the chosen methods are described in Section 4.3. Section 4.4 will discuss the simulation results and the performances of SIMPLS Regression, Principal Component Regression and Ridge Regression in handling multicollinearity problem. The algorithms were described and showed that performed well on classical data sets in Chapter 3. In this chapter, the performance characteristics of these methodology are demonstrated and explored by applying the procedure to simulated data sets that have multicollinearity problems in various number of regressors.

72 4.2

Generating Simulated Data Sets

The number of regressors refers to the chances or possibility where the multicollinearity problems will occur when the regressors was regressed. It means that, the more number of regressors involved, the more chances to have multicollinearity problems in the analysis. The regression condition refers to the number of observations in the data set and the number of regressor variables.

The n regression observations were generated randomly in order to test the performances of the methodologies. The response variable that is considered in this study is q = 1. For every set of n observations that have been used in this research, the number of regressor variables considered are p = 2, 4, 6 and 50. The regression condition for this study is shown in Table 4.1.

Every set of the regressor variables were tested using VIF to see how serious the multicollinerity problems infected the data. These various numbers of regressors were chosen because they will give a different level of multicollinearity that can be seen by a value of VIF. Myers (1990) noted that the low level of collinear problems exist between regressors if a VIF value are close to 10, while a large value of VIF implies that the multicollinearity problems are very serious.

Table 4.1

:

Factors and levels for the simulated data sets

Factors

Levels

Number of regressor variables (p)

2, 4, 6, 50

Number of observations in data set (n)

20, 30, 40, 60, 80, 100

73 The approach of creating data sets to test the methodology was to randomly generate n regression observations. Of these n observations, the value was set according to p number of regressors especially for the high number of regressors data sets to test how the methodology works if there were only small differences between the number of regressors and the number of observations. This study was done for the cases of a univariate response variable (q = 1) data sets.

For each of the simulations, the distribution of the random error for every set of n observations were N(0, 1) and for each situation, where the number of replications, m is set at 100 data sets. The m = 100 were choosen because it is enough to show a consistent results for each generated data sets.

The n observations were generated according to the model y i β0 βp x ip ε i,

i = 1, 2, ....., n

where xip are generated from a specified N  0,  distribution as shown in Table 4.2 and

i is N(0, 1) with 0 0 . For the purpose of obtaining collinearity in each set of data, the x p were generated according to x p x1 p , and the columns of the noise matrix p

are independently distributed according to N (0,0.1) . The response variable, y for each p regressors was generated as y XI p δ where I p are  Ip 1 for i = j and 0 elsewhere. The variable δcomes from the i, j distribution N(0, 1). So, y is a linear combination of p regressor variables plus an error term as shown in Table 4.3.

74 TABLE 4.2

:

The specific values for xip for each sets of p regressors

p

xip

2

x1 N (0,1) x2 x1 N (0,0.1)

4

x1 N (0,1) x p x1 N (0,0.1)

p = 2, 3, 4

6

x1 N (0,1) x p x1 N (0,0.1)

p = 2, 3, 4, 5, 6

50

x1 N (0,1) x p x1 N (0,0.1)

p = 2, 3, 4..., 50

TABLE 4.3

:

The response variable, y for each sets of p regressors

p

y

2

y x1 x2 N (0,1)

4

y x1 x2 .... x p N (0,1)

p=4

6

y x1 x2 .... x p N (0,1)

p=6

50

y x1 x2 .... x p N (0,1)

p = 50

The choice of the value for the variance of the noise matrix p that comes from the distribution N (0, ) is shown in Table 4.4. It shows that, multicollinearity problems oc c ur e da tt h eva l ue sof0. 1, 0. 01a nd0 . 0 01whe r et heVI F’ sva l ue sa r eg r e a ter than 10 which indicate the presence of multicollinearity. Therefore, the value was chosen at 0.1 be c a us et heVI F’ sva l ue sa r ei nt hes ma l l e s tr a ng et ha ti s|t|) -8.4556 0.8675 -9.7474 0.0000 7.7377 2.4157 3.2031 0.0047 1.0125 2.6635 0.3802 0.7080

Residual standard error: 0.9362 on 19 degrees of freedom Multiple R-Squared: 0.7910 F-statistic: 113.7 on 2 and 19 degrees of freedom, the p-value is 2.681e-011

TABLE 4.10 :

Regression model for p = 4 regressors for n = 100

Response: y Analysis of Variance Table Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value x1 1 199.1453 199.1453 275.9209 x2 1 0.1267 0.1267 0.1755 x3 1 4.1195 4.1195 5.7077 x4 1 0.2635 0.2635 0.3651 Residuals 17 19.2697 0.7217

Pr(F) 0.0000000 0.6805179 0.0287540 0.5536773

Coefficients: (Intercept)

x1 x2 x3 x4

Value -8.4272 5.2700 -3.9082 14.5290 2.5955

Std. Error t value 0.8143 -10.3495 2.8080 1.8768 3.3484 -1.1672 5.9048 2.4605 4.2956 0.6042

Pr(>|t|) 0.0000 0.0778 0.2592 0.0249 0.5537

Residual standard error: 0.8496 on 17 degrees of freedom Multiple R-Squared: 0.9136 F-statistic: 70.54 on 4 and 17 degrees of freedom, the p-value is 2.337e-010

81

TABLE 4.11 :

Regression model for p = 6 regressors for n = 100

Response: y Analysis of Variance Table Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value x1 1 3595.962 3595.962 3311.659 x2 1 0.033 0.033 0.030 x3 1 7.034 7.034 6.478 x4 1 0.138 0.138 0.127 x5 1 3.430 3.430 3.159 x6 1 1.357 1.357 1.250 Residuals 93 100.984 1.086

Pr(F) 0.0000000 0.8622861 0.0125681 0.7225302 0.0787664 0.2664588

Coefficients: (Intercept)

x1 x2 x3 x4 x5 x6

Value Std. Error -0.0476 0.1110 3.2112 2.2765 -0.7390 1.1357 2.5552 1.0405 0.2151 1.0361 2.0613 1.0587 -1.2567 1.1241

t value -0.4291 1.4106 -0.6507 2.4557 0.2076 1.9469 -1.1180

Pr(>|t|) 0.6688 0.1617 0.5168 0.0159 0.8360 0.0546 0.2665

Residual standard error: 1.042 on 93 degrees of freedom Multiple R-Squared: 0.9728 F-statistic: 553.8 on 6 and 93 degrees of freedom, the p-value is 0

Tables 4.9 –4.11 show an output for the original variables for each sets of regressors. The R 2 value will be used to compare the performance of the new model using RR, PCR and PLSR over the original model.

82

4.4.1

Partial Least Squares Regression

The techniques implemented in the PLS procedure have the additional goal of accounting for variation in the predictors, under the assumption that the directions in the predictor space that are well sampled should provide better prediction for new observations when the predictors are highly correlated.

Construct k components

All of the techniques implemented in the PLS procedure work by extracting successive linear combinations of the predictors, called components (also called factors or latent vectors), which optimally address one or both of these two goals, explaining response variation and explaining predictor variation. In particular, the methods of partial least squares balances the two objectives, seeking for components that explain both response and predictor variation.

The PLS weight vectors, ri by themselves are difficult to interpret, but the sign are useful eventhough it is difficult to interpret the magnitude of the weights. The weights give the directions toward which each PLS component projects. They show which predictors are most represented in each component. Those predictors with small weights in absolute value are less important then those with large weights. The PLS weight vectors for each sets of regressors are shown in Tables 4.12 – 4.14. For p = 2 regressors, x1 are less important on PLS1 while x2 on PLS2. PLS weight vectors for p = 4 regressors show that x1 are less important on PLS2, x2 on PLS3, x3 on PLS1 and x4 on PLS4. x1 and x6 are less important on PLS2, x2 on PLS3 and x3, x4 and x5 on PLS4 for p = 6 set of regressors.

83

Table 4.14 also shows that PLS weight vectors, for components 5 and 6 have the same value for p = 6 regressors. Therefore, it gives the sign that components 5 and 6 will be eliminated. These condition also goes to p = 50 for components 6 and 7. These results confirmed Tobias (1997) findings. But, eliminating or choosing the number of components should be done using specified methods such as root mean square error (RMSE) approach and cross-validation.

Table 4.12

:

PLS weights, ri for the PLS components for p = 2 regressors PLS1

PLS2

x1

0.7047

-0.7094

x2

0.7094

0.7047

Table 4.13

:

PLS weights, ri for the PLS components for p = 4 regressors PLS1 0.5019 0.5025 0.5088 0.4864

x1 x2 x3 x4

Table 4.14

x1 x2 x3 x4 x5 x6

:

PLS2

PLS3 PLS4

-0.3028 -0.8139 0.7454 0.5415 0.1595 -0.1729 -0.6523 0.5402 -0.6432 0.4352 0.1418 0.0249

PLS weights, ri for the PLS components for p = 6 regressors PLS1

PLS2

PLS3

PLS4

PLS5

PLS6

0.4105 0.4133 0.4136 0.3995 0.4069 0.4052

-0.2276 -0.5050 0.7128 -0.3383 0.2422 0.1084

-0.5829 0.3288 0.3659 0.3104 -0.5503 0.1371

-0.7568 0.5162 -0.2347 0.2297 0.0423 0.2260

-0.7406 0.5066 -0.2562 0.3290 0.0542 0.1333

-0.7406 0.5066 -0.2562 0.3290 0.0542 0.1333

84

Table 4.15

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50

:

PLS weights, ri for the PLS components for p = 50 regressors PLS1

PLS2

PLS3

PLS4

PLS5

PLS6

0.14125 0.14203 0.14253 0.13731 0.14005 0.13941 0.14287 0.14035 0.14060 0.14435 0.14285 0.14092 0.14199 0.14187 0.14324 0.14197 0.14169 0.14139 0.14110 0.14500 0.13946 0.14009 0.14180 0.14441 0.14241 0.14533 0.14147 0.14054 0.14073 0.14029 0.14134 0.14233 0.13999 0.14160 0.13972 0.14319 0.14190 0.14037 0.14073 0.14045 0.14150 0.14208 0.14148 0.14132 0.13864 0.14168 0.14259 0.13975 0.14144 0.13896

0.00636 -0.20648 -0.20294 -0.20248 -0.21674 -0.13540 -0.23712 -0.25706 -0.13337 -0.20387 -0.32808 -0.14089 -0.14932 -0.23921 0.08370 0.10902 0.08179 0.07832 -0.10821 0.08251 0.07843 0.07100 0.01258 -0.01750 0.09483 -0.06334 0.08319 0.09527 0.06471 0.17656 0.14534 0.18933 0.02690 0.13985 0.11330 -0.03668 -0.06318 0.03912 -0.01090 0.09041 0.17345 0.04537 0.00586 -0.00015 0.06184 0.17220 0.14623 0.06173 0.22952 0.21468

0.00189 0.05741 0.02221 0.02519 0.20124 0.15682 0.02006 0.09933 0.20462 0.10236 0.11895 0.02844 0.22788 -0.01543 0.00961 0.04122 -0.11018 -0.00353 -0.23099 -0.07665 0.07809 -0.16516 -0.21115 -0.23848 0.11861 -0.40351 0.04223 -0.17548 -0.10569 0.24661 0.05215 0.26197 0.01032 0.07326 -0.09937 -0.08015 -0.13919 -0.08771 -0.12588 0.01500 0.03714 -0.00131 -0.12406 -0.27804 0.04538 0.05692 -0.01094 -0.03354 0.14113 0.23316

0.00344 0.02695 0.00294 0.04175 -0.05964 0.06893 -0.04484 -0.10741 0.19337 0.08314 -0.19036 0.12783 0.12466 -0.01454 0.12780 0.01141 -0.07955 0.02807 -0.37754 -0.07399 0.01354 -0.12409 -0.13497 -0.07056 0.14974 -0.29328 -0.06692 -0.11078 -0.04465 0.15624 0.02764 0.36313 -0.08073 0.17688 -0.00490 -0.12589 -0.23612 -0.07661 -0.10922 -0.00606 0.16804 -0.00098 -0.01809 -0.13420 -0.06786 0.28566 0.15626 -0.03223 0.31043 0.18352

0.00257 0.00847 -0.00527 0.01702 -0.05820 0.06744 -0.07286 -0.09926 0.18345 0.05739 -0.19313 0.10426 0.12217 -0.04385 0.12150 0.02133 -0.07095 0.04764 -0.39720 -0.06484 0.02656 -0.09326 -0.14869 -0.11330 0.11020 -0.31314 -0.03081 -0.08068 -0.04840 0.19747 0.05742 0.34424 -0.03929 0.16825 -0.00454 -0.15482 -0.23483 -0.05230 -0.11432 0.00110 0.17220 -0.03074 -0.07627 -0.14919 -0.03729 0.21727 0.12747 -0.01112 0.30038 0.23276

0.00259 0.00825 -0.00102 0.01533 -0.05851 0.06764 -0.07240 -0.09966 0.18330 0.05734 -0.19308 0.10393 0.12191 -0.04393 0.12105 0.02153 -0.07589 0.04347 -0.35970 -0.06540 0.02311 -0.09430 -0.14392 -0.11913 0.11170 -0.31329 -0.03106 -0.08071 -0.04824 0.19739 0.05749 0.34454 -0.03964 0.16837 -0.00380 -0.15469 -0.23471 -0.05272 -0.11445 0.00137 0.17206 -0.03050 -0.07648 -0.14939 -0.03691 0.21709 0.12736 -0.01127 0.30030 0.23236

85

As noted by Tobias (1997) the number of components to extract depends on the data. Basing the model on more extracted components improves the model fit to the observed data, but extracting too many components can cause over-fitting, that is, tailoring the model too much to the current data, to the detriment of future predictions.

Table 4.16

:

PLS loadings, pi for the PLS components for p = 2 regressors PLS1

PLS2

x1

0.70336

-0.21405

x2

0.71083

1.20341

Table 4.17

:

x1 x2 x3 x4

Table 4.18

x1 x2 x3 x4 x5 x6

:

PLS loadings, pi for the PLS components for p = 4 regressors PLS1

PLS2

PLS3

PLS4

0.50077 0.50458 0.50636 0.48808

0.56743 1.18762 -0.10134 1.06268

6.19918 6.53852 7.54709 6.52523

-6.96042 -7.18922 -7.95096 -6.77604

PLS loadings, pi for the PLS components for p = 6 regressors PLS1

PLS2

PLS3

PLS4

PLS5

0.41029 0.41281 0.41447 0.39917 0.40718 0.40536

0.17288 -0.38626 0.82178 -0.22100 0.63130 0.28313

0.87037 1.38970 1.23730 1.44819 -0.04143 0.91194

2.68927 3.87470 2.08217 3.15214 2.36473 3.09406

2.76340 3.85721 2.19883 3.45298 2.45287 2.90041

86

Table 4.19

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31 x32 x33 x34 x35 x36 x37 x38 x39 x40 x41 x42 x43 x44 x45 x46 x47 x48 x49 x50

:

PLS loadings, pi for the PLS components for p = 50 regressors PLS1

PLS2

PLS3

PLS4

PLS5

PLS6

0.14124 0.14195 0.14244 0.13723 0.13997 0.13936 0.14278 0.14025 0.14055 0.14427 0.14272 0.14087 0.14193 0.14177 0.14328 0.14201 0.14172 0.14142 0.14106 0.14504 0.13949 0.14012 0.14181 0.14440 0.14244 0.14530 0.14151 0.14057 0.14076 0.14036 0.14140 0.14241 0.14000 0.14166 0.13976 0.14318 0.14188 0.14039 0.14073 0.14049 0.14157 0.14210 0.14148 0.14132 0.13867 0.14175 0.14265 0.13977 0.14154 0.13904

0.15406 -0.02654 -0.03436 -0.03880 0.00964 0.07530 -0.06891 -0.06488 0.09486 -0.00609 -0.12649 0.02803 0.08834 -0.08416 0.24909 0.28368 0.20485 0.23715 -0.02690 0.22067 0.26272 0.17362 0.10151 0.06509 0.29619 -0.03566 0.25762 0.19492 0.18818 0.41894 0.32303 0.43924 0.18879 0.32498 0.23781 0.09812 0.05015 0.16826 0.10566 0.25447 0.34627 0.20570 0.12391 0.06557 0.23412 0.35191 0.30397 0.20853 0.43754 0.45103

-0.52305 -0.58569 -0.62166 -0.62385 -0.31186 -0.41081 -0.61566 -0.45854 -0.42624 -0.56325 -0.42763 -0.65122 -0.36051 -0.68667 -0.54548 -0.39923 -0.58034 -0.49459 -0.67845 -0.54573 -0.35576 -0.63342 -0.73810 -0.85093 -0.38737 -0.98009 -0.35963 -0.64401 -0.60467 -0.14097 -0.36837 -0.25717 -0.42820 -0.44144 -0.58613 -0.58245 -0.61127 -0.56958 -0.64260 -0.43110 -0.46733 -0.47557 -0.66035 -0.83184 -0.37829 -0.46734 -0.53273 -0.52277 -0.36662 -0.15038

-1.05550 -1.14993 -1.21861 -1.11181 -1.10319 -1.03995 -1.19062 -1.23710 -0.97944 -1.10832 -1.28261 -1.08080 -1.01372 -1.20342 -0.95839 -0.95868 -1.07861 -1.02839 -1.39379 -1.09188 -0.93942 -1.19116 -1.18310 -1.13359 -0.77463 -1.39649 -1.05620 -1.18005 -1.05942 -0.81502 -0.96840 -0.62341 -1.13375 -0.84298 -1.01902 -1.09922 -1.23582 -1.15368 -1.12895 -0.98423 -0.87561 -1.02137 -1.10359 -1.27607 -0.94895 -0.86958 -0.93559 -0.98909 -0.72350 -0.79988

-1.11092 -1.21998 -1.26386 -1.18865 -1.14127 -1.09695 -1.28187 -1.25574 -1.04298 -1.19558 -1.31240 -1.16905 -1.05644 -1.29187 -1.02554 -1.00869 -1.13501 -1.05692 -1.41007 -1.14300 -0.98794 -1.21471 -1.25188 -1.26857 -0.89991 -1.47539 -1.04405 -1.20366 -1.12911 -0.80644 -0.98264 -0.71659 -1.11372 -0.91945 -1.08621 -1.18059 -1.28033 -1.16214 -1.19152 -1.03045 -0.92989 -1.07745 -1.17132 -1.33089 -1.00134 -0.92180 -0.99651 -1.06087 -0.79128 -0.77939

-1.11046 -1.21953 -1.26369 -1.18851 -1.14052 -1.09616 -1.28093 -1.25559 -1.04258 -1.19477 -1.31183 -1.16874 -1.05624 -1.29126 -1.02531 -1.00786 -1.13442 -1.05685 -1.40949 -1.14241 -0.98708 -1.21415 -1.25136 -1.26769 -0.89859 -1.47510 -1.04417 -1.20310 -1.12842 -0.80638 -0.98234 -0.71603 -1.11380 -0.91886 -1.08537 -1.18003 -1.27961 -1.16210 -1.19116 -1.02982 -0.92983 -1.07679 -1.17110 -1.33037 -1.00073 -0.92162 -0.99623 -1.06017 -0.79113 -0.77959

87

The factor loadings as shown in Tables 4.16 –4.19 show how the PLS factors are constructed from the centered and scaled predictors. PLS seeks values for the factor loadings and structural parameters that minimize residual variance for the factors and the X- and Y-variables. This way, a factor is estimated to be the best predictable variable of its X-variables as well as the best predictor of subsequent dependent variables or Yvariables (Rougoor et. al., 2000).

For p = 2 data set (Table 4.16), the loadings show that the explanatory variable x1 load highest on PLS1 and x2 on PLS2. From the loadings for p = 4 (Table 4.16) it is clear that all the explanatory variables load highest in absolute values on PLS4 while for p = 6 (Table 4.17), x1, x3, x4 and x5 load highest on PLS5 and x2 and x6 on PLS4. Table 4.19 shows most of the variables in p = 50 data set load highest on PLS5.

Table 4.20

:

Correlations between each PLS components and y

p=2

p=4

p=6

p = 50

PLS1

8.9170e-001

9.5927e-001

9.8734e-001

9.997299e-001

PLS2

-2.6238e-012

1.5431e-012

5.6022e-012

1.199658e-010

4.3545e-001

9.2587e-002

-6.600554e-002

-6.1381e-001

2.1289e-001

-1.386592e-001

2.2726e-001

-1.501399e-001

PLS3 PLS4 PLS5 PLS6

-1.501638e-001

Table 4.20 displays the correlations between each PLS component and response variable, y. Notice that, for all sets of regressors, the PLS2 show a very small correlation with y while the PLS1 and the rest for each sets of data also show a highest value of correlation with response variable, y.

0 -2

-1

x2

1

2

88

-2

-1

0

1

2

x1

Figure 4.2

:

Correlation between x1 and x2 for p = 2 regressors

Figure 4.2 displays the correlations between each regressors for p = 2 data set. The plot show a strong relation between both original variables, x1 and x2 regressors which plotted too close with each other. This confirmed the existing of multicollinearity problems in the p = 2 data set. The idea from PLS regression was that each component that have been constructed will have a low relation between them.

89

Table 4.21

:

Correlation between each components Correlation

p=2

p=4

p=6

T1 1.00000000 0.03740995

T2

T1 T2 T3 T4

T1 T2 T3 1.00000000 0.07101789 1.00000000 0.45856364 -0.27196095 1.0000000 -0.63033261 0.37383220 -0.9415714

T1 T2 T3 T4 T5

T1 1.00000000 0.02480167 0.09389023 0.20374383 0.21667116

T1 T2 T3 T4 T5 T6

T1 1.00000000 0.02131786 -0.06608345 -0.12971870 -0.14001087 -0.14003084

1.00000000

T2

T4

1.0000000

T3

T4

T5

1.00000000 -0.35707187 -0.77485365 -0.82401730

1.0000000 0.4172865 0.4437629

1.0000000 0.9753324

1.0000000

T2

T3

T4

T5

1.00000000 0.35108780 0.68916884 0.74384900 0.74395512

1.00000000 0.70536149 1.0000000 0.76132642 0.9839383 1.0000000 0.76143503 0.9840787 0.9999982

T6

1.0000000

0.0 -0.1

T2

0.1

p = 50

T1 T2

-3

-2

-1

0

1

2

3

T1

Figure 4.3

: Correlation between first and second components for p = 2 regressors

T3

-0.2

-0.2

-0.1

0.0

0.0 -0.1

T2

0.1

0.1

0.2

0.2

90

-4

-2

0

2

4

-4

-2

0

2

4

T1

T3

0.0

0.0

-0.2

-0.2

-0.1

-0.1

T4

0.1

0.1

0.2

0.2

T1

-4

-2

0

2

4

-0.2

-0.1

T1

0.1

0.2

-0.2

-0.2

-0.1

0.0

T4

0.0

0.1

0.1

0.2

0.2

T2

-0.1

T4

0.0

-0.2

-0.1

0.0

0.1 T2

Figure 4.4

:

0.2

-0.2

-0.1

0.0

0.1

T3

Correlation between each components for p = 4 regressors

0.2

-0.2

T3

0.1

0.2

-0.1

-0.1

T4

-4

-0.2 0.1

0.1

0.2

-2

-2 0

0

-0.1

T2

0.0 2

2

0.1

0.2 0.0

T5

0.0

-4

0.0

0.1

-6

T4

0.0 -6

-0.1

-0.1

-0.2

-0.2

-0.1

-0.1

0.0

T3

0.0

T2

0.1

0.1

0.2

0.2

91

T1 4 -6

4 -6

-4

-4

-0.2

-2 T1

T1 -2

0

0

-0.1

T2

0.0

2 4

T1 2 4

0.1

0.2

T4

-0.1

0.0

0.0 -0.1

T5

0.1

0.1

0.2

92

-0.2

-0.1

0.0

0.1

-0.2

0.2

-0.1

0.1

0.2

0.2

0.2

T5

-0.1

0.0

0.1

0.1 T5

0.0 -0.1

-0.2

0.0 T3

T2

-0.1

0.0

0.1

0.2

T3

Figure 4.5

:

-0.1

0.0

0.1

T4

Correlation between each components for p = 6 regressors

The plot of X-scores against each other can also be made to look at the irregularities in the data and display in Figures 4.3 –4.5. From Figure 4.4, the components T3 and T4 are plotted too close too each other which means a strong correlation between them. The plot for components T1 and T4 also show a strong relation to each other while for T1 and T2, T1 and T3, T2 and T3 and components T2 and T4, there were plotted randomly. These means that components T3 and T4 have a large chances to be eliminate from the analysis.

93

From Figure 4.5, the plot of components T4 and T5 shows a very strong relation between them because there are plotted too close to each other. The plot for T2 and T5 and components T2 and T4 also show a strong relation compared to the correlation plot between T1 and T2, T1 and T3, T1 and T4, T1 and T5, T2 and T3, T3 and T4 and components T3 and T5. Therefore, components T4 and T5 have a chances to be eliminate from the analysis.

From that figures, the point that plotted randomly means that there were no strong correlation between each component except for components T3 and T4 for p = 4 regressors and components T4 and T5 for p = 6 regressors. Recall that, PLSR method contruct a component that have no strong relation among the components. Since the plot for these 2 cases show high correlation between some components, it means that the component have a chance to be eliminate from the analysis and the choice of kopt will be done using RMSE. The correlation values are shown in Table 4.21. Figures 4.6 –4.8 display the plotting of the correlations between each PLS components and response variable that give a good illustration of how PLS contructs components which are highly correlated with the response variable. These are something that PCR fails to do.

From Figure 4.6, the correlation plot for p = 2 regressors show a strong relation for T1 but not for T2. The correlation plot for p = 4 as displayed in Figure 4.7 also show a strong relation for T1 but no strong relation for T2, T3 and T4. Figure 4.8 show the correlation plot between each component and response variable, y for p = 6. Again T1 show a very strong relation with y while no strong relation for the other components. From all the figures, they show a very high correlation between X- and Y-scores for the first component but somewhat lower correlation for the second components and the rests.

y

-1.561642*10^-4 -4*10^0

-4*10^0

y

-1.561642*10^-4

5.999609*10^0

5.999609*10^0

94

-3

-2

-1

0

1

2

-0.1

3

0.0

:

X- and Y-scores for p = 2 regressors (First and second components)

0 -10

-10

-5

-5

y

y

0

5

5

10

10

Figure 4.6

-4

-2

0

2

4

-0.2

-0.1

0.0

T1

0.1

0.2

10 5 0 -5 -10

-5

y

0

5

10

T2

-10

y

0.1

T2

T1

-0.2

-0.1

0.0 T3

Figure 4.7

:

0.1

0.2

-0.2

-0.1

0.0

0.1

0.2

T4

X- and Y-scores for p = 4 regressors (All components)

-15

-15

-10

-10

-5

-5

y

y

0

0

5

5

10

10

95

-6

-4

-2

0

2

4

-0.2

-0.1

0.0

T1

0.1

0.2

10 5 0 -10

-5

y

-0.2

-15

-15

-10

-5

y

0

5

10

T2

-0.1

0.0

0.1

0.2

-0.1

0.0

0.1

T4

0

y -15

-10

-10

-5

-5

y

0

5

5

10

10

T3

-0.1

0.0

0.1

0.2

T5

Figure 4.8

-0.2

-0.1

0.0

0.1

0.2

T3 T6

: Correlation between first five components of p = 6 regressors data set and response variable, y

96

Partial least squares algorithms choose successive orthogonal factors or components that maximize the covariance between each X-score and the corresponding Y-score. For a good PLS model, the first few factors show a high correlation between the X- and Y-scores. Then, the correlation usually decreases from one factor to the next. There were no distributional assumptions made in PLS, therefore, the traditional statistical testing methods are not well suited (Rougoor et. al., 2000).

Selecting the number of components

A very important issue in building the PLS model is the choice of the optimal number of components kopt. The most common methods use a test set or crossvalidation (CV). A test should be independent from the training set which is used to estimate the regression parameters in the model, but should still be a representative of the population. Root mean squared error, RMSE is the recent method used in these selection of kopt components. Let n denote the number of observations in the test set and for each k, the root mean squared error, RMSEk for this test set can be computed as Eq (3.6). Here, the yˆ i , k of observation i in the test set are based on the parameter estimates that are obtained

from the training set using a PLS method with k components. Hubert and Branden (2003a) cited that the k-value which gives the smallest or a sufficiently small value for RMSEk should be chosen as the kopt .

97

Table 4.22

:

RMSE values for p = 2 data sets

n = 20

n = 40

n = 60

n = 80

n = 100

PLS1

0.99301

1.09204

0.91631

0.88808

1.01668

PLS2

0.99301

1.04031

0.91027

0.88163

1.01391

k

Table 4.23

:

RMSE values for p = 4 data sets

k

n = 20

n = 40

n = 60

n = 80

n = 100

PLS1

3.9720e-016

4.2130e-016

2.8665e-017

1.7625e-015

8.8817e-016

PLS2

2.9790e-016

6.3195e-016

1.1466e-016

2.5073e-015

1.4210e-015

PLS3

1.9860e-016

3.5108e-016

2.2932e-016

1.8494e-015

2.2204e-015

PLS4

6.4545e-016

4.2130e-016

2.8665e-016

2.3494e-015

9.3258e-016

Table 4.24 k

:

RMSE values for p = 6 data sets for first five PLS components

n = 20

n = 40

n = 60

n = 80

n = 100

PLS1

0.97062

1.22661

1.02552

0.99051

0.93580

PLS2

0.89834

1.13257

0.96783

0.93501

0.92457

PLS3

0.88655

1.12777

0.95217

0.93380

0.92283

PLS4

0.88023

1.12738

0.94248

0.93370

0.92069

PLS5

0.87307

1.12718

0.94208

0.93356

0.92036

Table 4.25 k

:

RMSE values for p = 50 data sets for first five PLS components n = 60

n = 80

n = 100

PLS1

0.29126

0.35394

0.31231

PLS2

0.12832

0.13680

0.12444

PLS3

0.07923

0.08064

0.05992

PLS4

0.05515

0.05003

0.03626

PLS5

0.04063

0.03289

0.02340

98

Tables 4.22 –4.25 show the RMSE values for the first few components for each sets of regressors. The component with smallest RMSE value will be chosen as kopt. RMSE provides a direct estimate of the modelling error expressed in original measurement units (Kvalheim, 1987).

From the table, the RMSE values for PLS1 and PLS2 for n = 20 case (Table 4.22) give the same value and it was the sign of eliminating the PLS2 component while the RMSE value for the rest are decreases, which means that the number of component are optimal at k = 2. From these results, there were no elimination for any component for these low number of regressors data sets (p =2) , except for n = 20 where it was in the category of low number of regressors data set with a small number of observation.

For a moderate number of regressors data sets where p = 4 (Table 4.23), the RMSEk values for n = 20 and 40 are minimal at kopt = 3. For n = 60, 80 and 100 the components are optimal at kopt = 1. For p = 6 as shown in Table 4.24, there were almost no difference for each number of observations where the minimal value of RMSE are always reached at kopt = 5. The RMSE values also decreases for each sets of data. For a high number of regressors data sets where p = 50 as shown in Table 4.25, the RMSEk values for each specified n = 60, 80 and 100 observations also minimal at kopt = 5. The RMSE values also decreases for each sets of data. Garthwaite (1994) cited that PLS reduces the number of terms, as the components are usually far fewer than the number of X variables. PLS also aims to avoid using equations with many parameters when cunstructing components and to achieve this it adopts the principle that when considering the relationship between Y and some specified X variable, other X variables are not allowed to influence the estimate of the relationship directly but are only allowed to influence it through the components Tk .

99

According to Engelen et al. (2003), the high RMSE values indicate the low predictive ability of the methods. Validation of the models was performed by comparing differences in R2 and root mean square error, RMSE (Hansen and Schjoerring, 2003). Tables 4.26 –4.28 show a regression model using selected PLS components. The choice of k components is based on RMSE values in Tables 4.22 –4.25.

Table 4.26 : Regression model using all the PLS Scores for p = 2 and n = 100 Response: y Analysis of Variance Table Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) T1 1 401.1907 401.1907 378.5451 0.0000000 T2 1 0.5623 0.5623 0.5305 0.4681439 Residuals 97 102.8028 0.3279 Coefficients: Value (Intercept) -0.0374 T1 1.4721 T2 -1.0347

Std. Error t value 0.1029 -0.3635 0.0756 19.4699 1.4206 -0.7284

Pr(>|t|) 0.7170 0.0000 0.4681

Residual standard error: 1.029 on 97 degrees of freedom Multiple R-Squared: 0.7963 F-statistic: 189.5 on 2 and 97 degrees of freedom, the p-value is 0

Equation (3.7) can then be used as a transformation to the coefficients of the natural variables while Equation (3.8) can be used for obtained the constant term. These values were calculated after the response variable were regressed onto the kopt components. The results are given as follows b0, pls = 0.04719 b1, pls = 1.77158 b2, pls = 0.31512

100

The PLS regression model for p = 2, listed in Table 4.26 show that the t-values are decreased and hence p-values are increased. This is another indication of how PLS produces the first PLS component to be the most important variable in the model, followed by PLS2. The improvement of the model can be seen through the R2 value where the value of R2 for PLS regression model (R2 = 0.7963) are larger than the final regression model (R2 = 0.7910) as shown in Table 4.9.

The final PLS regression model for p = 2 are y = 0.04719 + 1.77158x1 + 0.31512x2

Table 4.27 : Regression model using all the PLS Scores for p = 4 and n = 100 Response: y Analysis of Variance Table Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) T1 1 1392.760 1392.760 1178.162 0.0000000 T2 1 7.060 7.060 5.972 0.0163786 T3 1 1.390 1.390 1.176 0.2809917 T4 1 0.011 0.011 0.009 0.9243603 Residuals 95 112.304 0.687 Coefficients: Value (Intercept) -0.2417 T1 2.0039 T2 -3.1773 T3 -1.4899 T4 0.6956

Std. Error t value 0.1087 -2.2228 0.1156 17.3302 1.6267 1.9532 7.2962 0.2042 7.3068 0.0952

Pr(>|t|) 0.0286 0.0000 0.0537 0.8386 0.9244

Residual standard error: 1.087 on 95 degrees of freedom Multiple R-Squared: 0.9258 F-statistic: 296.3 on 4 and 95 degrees of freedom, the p-value is 0

101

Table 4.28 : Regression model using selected PLS Scores for p = 4. (kopt = 1) Response: y Analysis of Variance Table Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value T1 1 1392.760 1392.760 1130.221 Residuals 98 120.764 1.232

Pr(F) 0

Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -0.2417 0.1110 -2.1771 0.0319 T1 1.9495 0.0580 33.6188 0.0000 Residual standard error: 1.11 on 98 degrees of freedom Multiple R-Squared: 0.9202 F-statistic: 1130 on 1 and 98 degrees of freedom, the p-value is 0

The PLS regression coefficients of original variables for p = 4 regressors are given as follows b0, pls = -0.07874 b1, pls = 0.97848 b2, pls = 0.97976 b3, pls = 0.99193 b4, pls = 0.94837 The RMSE values for n = 100 in Table 4.21 show that the minimal value is 8.8817e-016 where the number of component are optimal at kopt = 1. Table 4.27 lists the PLS regression model with all components where the t-values and hence the p-values decrease and increase, respectively. However, since the components are optimal at kopt = 1, the final PLS regression model with selected components are presented in Table 4.28. Both R2 values for PLS regression models, are larger than the regression model that is R2 = 0.9136 (Table 4.10) which indicate the improvement of the model. That is,

102 the fit of the model to the data is quite good. The R2 value for selected component (R2 = 0.9202) and the value for all components (R2 = 0.9258) are also larger than the R2 value of the regression model. What makes the final PLS regression model better than the regression model presented in Table 4.10 is the fact that the same number of variables were equal but PLS scores are uncorrelated.

The final PLS regression model for p = 4 are y = –0.07874 + 0.97848x1 + 0.97976x2 + 0.99193x3 + 0.94837x4

Table 4.29

:

Regression model using selected PLS Scores ( kopt = 5 ) for p = 6 and n = 100

Response: y Analysis of Variance Table Response: y Terms added sequentially (first to last) of Variance Table Analysis Df Sum of Sq Mean Sq F Value Pr(F) T1 sequentially 1 3393.403 3393.403 3765.704 0.0000000 Terms added (first to last) TDf 2.089Sq F2.089 2.318 Sum1of Sq Mean Value Pr(F) 0.1312583 2 T31 1 0.321 0.321 0.357 0.5518180 T1 3393.403 3393.403 3765.704 0.0000000 T41 12.089 0.396 0.439 0.5091755 T2 2.089 0.396 2.318 0.1312583 T51 10.321 0.061 0.068 0.7952459 T3 0.321 0.061 0.357 0.5518180 Residuals 94 T4 1 0.396 84.707 0.396 0.901 0.439 0.5091755 T5 1 0.061 0.061 0.068 0.7952459 ResidualsCoefficients: 94 84.707 0.901 Value Std. Error t value Pr(>|t|) (Intercept) -0.2194 0.0949 -2.3111 0.0230 Coefficients: T1 Value Std. 2.5092 Error 0.0451 t value 55.5885 Pr(>|t|) 0.0000 T2 -0.2194 -2.5108 (Intercept) 0.0949 1.7228 -2.3111 1.4574 0.0230 0.1483 T3 2.5092 -0.4470 T1 0.0451 1.1508 55.5885 -0.3884 0.0000 0.6986 T4 -0.0079 6.6185 -0.0012 0.9991 T5 -1.9243 7.3944 -0.2602 0.7952 Residual standard error: 0.9493 on 94 degrees of freedom Multiple R-Squared: 0.9757 F-statistic: 753.8 on 5 and 94 degrees of freedom, the p-value is 0

103

The PLS regression for p = 6 regressors and n = 100 gives the following coefficients of original variables b0, pls = 0.04989189 b1, pls = 3.29355388 b2, pls = 1.17930365 b3, pls = -0.42048181 b4, pls = 1.07823854 b5, pls = 0.55387702 b6, pls = 0.42487938 The regression with the reduced number of components has resulted in a slight increase in R2 from 0.9728 (Table 4.11) to 0.9757 (Table 4.29) which shows an improvement of the PLS regression model.

The PLS regression model in Table 4.29 with selected components shows that the t-values and hence the p-values also decrease and increase, respectively. The optimal number of components for this set of data are obtained from Table 4.24 with kopt = 5. The final PLS regression model for p = 6 are y = 0.04989 + 3.29355x1 + 1.17930x2 –0.42048x3 + 1.07824x4 + 0.55388x5 + 0.42488x6 The final PLS regression model for p = 50 are y = -0.00994 + 0.30446x1 + 0.97473x2 + 0.96657x3 + 0.95084x4 + 0.94044x5 + 0.84339x6 + 0.94693x7 + 1.01967x8 + 1.00248x9 + 1.03658x10 + 1.11541x11 + 0.88275x12 + 0.97976x13 + 0.99009x14 + 0.17135x15 -0.04190x16 -0.11138x17 + 0.09849x18 + 0.12322x19 -0.08358x20 + 0.06085x21 -0.07620x22 + 0.00122x23 + 0.09326x24 + 0.08447x25 -0.01026x26 + 0.00072x27 -0.13770x28 -0.02392x29 + 0.03845x30 -0.08687x31 + 0.14131x32 + 0.18054x33 + 0.04961x34 -0.11665x35 + 0.14351x36 + 0.14131x37 + 0.09534x38 + 0.12612x39 -0.01314x40 -0.04446x41 + 0.08191x42 + 0.12195x43 + 0.04738x44 + 0.01302x45 + 0.03191x46 -0.02366x47 + 0.03851x48 -0.04636x49 -0.03282x50

104

4.4.2

Principal Component Regression

Principal component analysis is performed in order to simplify the description of a set of interrrelated variables (Afifi and Clark, 1984). It allows the transformation of a set of correlated explanatory X-variables into an equal number of uncorrelated variables where these new variables that called principal components (PCs), are all linear combinations of the original correlated X-variables.

Construct k components

In PCR, the scores are obtained by extracting the most relevant information present in the x-variables by performing a principal component analysis on the predictor variables and thus using a variance criterion. No information concerning the response variables is yet taken into account.

Due to these transformation, these explanatory variables (the PCs) are corrected in such a way as to minimize the effect of multicollinearity. In this way, dimensionality could be reduced without losing much of the information. Besides that, interpretability will be increased (Afifi and Clark, 1984).

As mentioned in Chapter III, the matrix of normalized eigenvectors considered are associated with the eigenvalues  1,  2 ,...,  k of X' X (correlation form) and VV' I since V is an orthogonal matrix where the original regression model are written as in Eq. (3.9) and (3.10)

The correlation matrix for each independent variables for n = 100 are listed in Tables 4.6 –4.8 where all values are very close to 1. These values also give the sign that all regressors are highly correlated. The following are the eigenvalues of the correlation matrix and the resulting matrix of eigenvectors.

105

Table 4.30 p=2 λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ 10 λ 11 λ 12 λ 13 λ 14 λ 15 λ 16 λ 17 λ 18 λ 19 λ 20 λ 21 λ 22 λ 23 λ 24 λ 25 λ 26 λ 27 λ 28 λ 29 λ 30 λ 31 λ 32 λ 33 λ 34 λ 35 λ 36 λ 37 λ 38 λ 39 λ 40 λ 41 λ 42 λ 43 λ 44 λ 45 λ 46 λ 47 λ 48 λ 49 λ 50

1.994357755 0.005642245

:

The eigenvalues of the correlation matrix p=4 3.972635017 0.012709687 0.011689290 0.002966005

p=6

p = 50

5.945611255 0.017048309 0.013993811 0.011982390 0.009811620 0.001552614

49.485105701 0.028084048 0.027305565 0.025910774 0.024459707 0.022811857 0.022110055 0.020841065 0.020534373 0.019442290 0.017893127 0.017288210 0.016972939 0.016216231 0.014808067 0.014036127 0.013811076 0.012887435 0.012038311 0.011571342 0.010823000 0.010397830 0.010092432 0.009588935 0.009001732 0.008444328 0.007672532 0.007429646 0.007284003 0.006876453 0.006331135 0.005847514 0.005540818 0.005360067 0.005213225 0.005062848 0.004966798 0.004175408 0.003541477 0.003284346 0.003181683 0.002772169 0.002495972 0.002304964 0.002190667 0.001987910 0.001625861 0.001233357 0.001038149 0.000106440

106

Table 4.30 shows the eigenvalues of the correlation matrix for each sets of regressors for n = 100 observations. These values are decreasing and will be used as one of the useful sign to choose which component will be eliminated from the analysis.

Table 4.31

:

Matrix of eigenvectors Matrix of Eigenvectors

p=2

-0.7071068 0.7071068 -0.7071068 -0.7071068

p=4

0.5011390 0.0885382 -0.0398539 0.8599025 0.4996566 0.6230536 0.5018639 -0.3320844 0.4997593 0.0624413 -0.7964840 -0.3345965 0.4994433 -0.7746393 0.3348990 -0.1957876

p=6

0.4098406 -0.0097300 -0.0322827 -0.0342283 0.0124949 0.9108052 0.4080210 0.3187762 -0.3527499 -0.7254095 0.1782846 -0.2224041 0.4083239 -0.0346181 0.4087912 -0.1279717 -0.7885547 -0.1636079 0.4076733 0.6472369 0.2652471 0.4979750 0.2709391 -0.1521303 0.4077670 -0.6243987 0.4010963 -0.0617097 0.4948972 -0.1850476 0.4078598 -0.2971518 -0.6900584 0.4521565 -0.1670586 -0.1918762

The matrix of eigenvectors that are presented in Table 4.31 for each sets of regressors will be used to contructs the component using Eq. (3.11). All the S-Plus codes for Principal Component Regression and the components that have been constructed are in Appendix C.

107

Selecting the number of components

The PCs are arranged in decreasing order of contribution to variance (Rougoor et al., 2000), so that, dimensionality can be reduced by selecting only a number of PCs with a high contribution to variance. The number of PCs selected can be determined by examining the proportion of total variance explained by each component, or by the cumulative proportion of total variance explained.

Afifi and Clark (1984) cited that a rule of thumb adopted by many investigators is to select only the PCs explaining at least 100/P percent of total variance, with P being the total number of variables. This selection criterion was also used in this study.

Besides the percentage of variance explained, the eigen values of the PCs and root mean squared error criterion of the PCs also can be used to decide how many PCs to include in the PCR of the PCs on the Y-variable. The eigenvalue is the variance of that PC. When an eigen value of a PC is close to zero, it means that multicollinearity is present among the original variables, where in that case that PC can be excluded from the regression (Rougoor et al., 2000). The selected PCs were utilised as uncorrelated explanatory variables in the regression model.

TABLE 4.32 : Percentage of variance explained and the eigen values of the p = 2 Principal component PC1 PC2

% variation 99.71 0.28

Eigen value 1.99435 0.00564

The percentage of variance explained by the 2 PCs and the eigenvalues of these PCs are shown in Table 4.32. These results also showed that multicollinearity is present in the data set, because component two (PC2) had an eigen value close to zero (0.00564). Therefore, when the rule of thumb was used that a PC has to explain at least

108

100/P% of the variance to be included in the regression, the percentage of variance explained by one PC has to be at least 100/2 = 50%. Only the first of the 2 PCs could satisfy this criterion.

TABLE 4.33 : Percentage of variance explained and the eigen values of the p = 4 Principal component PC1 PC2 PC3 PC4

% variation

Eigen value

99.32 0.32 0.29 0.07

3.97263 0.01270 0.01168 0.00296

Ta bl e4. 33s ho wst hep e r c e nt a g eofv a r i a nc ee xpl a i ne dbyt h e4PC’ sa n dt he eigenvalues of these PC’ s . Th e s er e s ul t sa l s os h owst h a tmu l t i c o l l i ne a r i t yi sp r e s e nti n the data set, because component 2, 3 and 4 (PC2, PC3 and PC4) have an eigen value close to zero (0.01270, 0.01168 and 0.00286). Therefore, the percentage of variance explained by one PC to be included in the regression has to be at least 100/4 = 25% and on l yt hef i r s to ft he4PC’ sc ou l ds a t i s f yt hi sc r i t e r i on ,t ha ti sPC1 only.

TABLE 4.34 : Percentage of variance explained and the eigen values of the p = 6 Principal component PC1 PC2 PC3 PC4 PC5 PC6

% variation

Eigen value

99.09 0.28 0.23 0.20 0.16 0.03

5.94561 0.01704 0.01399 0.01198 0.00981 0.00155

109

TABLE 4.35 : Percentage of variance explained and the eigen values of the p = 50 Principal component PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28 PC29 PC30 PC31 PC32 PC33 PC34 PC25 PC36 PC37 PC38 PC39 PC40 PC41 PC42 PC43 PC44 PC45 PC46 PC47 PC48 PC49 PC50

% variation

Eigen value

98.9702 0.0562 0.0546 0.0518 0.0489 0.0456 0.0442 0.0417 0.0411 0.0389 0.0358 0.0346 0.0339 0.0324 0.0296 0.0281 0.0276 0.0258 0.0241 0.0231 0.0216 0.0208 0.0202 0.0192 0.0180 0.0169 0.0153 0.0148 0.0146 0.0137 0.0127 0.0117 0.0111 0.0107 0.0104 0.0101 0.0099 0.0083 0.0071 0.0066 0.0064 0.0055 0.0050 0.0046 0.0044 0.0040 0.0032 0.0025 0.0021 0.0002

49.48510 0.02808 0.02730 0.02591 0.02445 0.02281 0.02211 0.02084 0.02053 0.01944 0.01789 0.01728 0.01697 0.01621 0.01480 0.01403 0.01381 0.01288 0.01203 0.01157 0.01082 0.01039 0.01009 0.00958 0.00900 0.00844 0.00767 0.00742 0.00728 0.00687 0.00633 0.00584 0.00554 0.00536 0.00521 0.00506 0.00496 0.00417 0.00354 0.00328 0.00318 0.00277 0.00249 0.00230 0.00219 0.00198 0.00162 0.00123 0.00103 0.00010

110

The result in Table 4.34 also shows a percentage of variance explained and the eigen values of the p = 6 regressors with n = 100 data set. The percentage of variance explained by one PC to be included in the regression has to be at least 100/6 = 16.67% and again only the first PC from 6 PCs could satisfy this criterion. Eigenvalues for PC2, PC3, PC4, PC5, PC6 also give values that are close to zero. A percentage of variance explained and the eigen values of the p = 50 are shown in Table 4.35. The percentage of variance explained by one PC to be included in the regression has to be at least 100/50 = 2.0% and only the first PC from 50 PCs could satisfy this criterion while the eigenvalues for the other components are close to zero.

One goal of PCR is to simplify the regression model by taking a reduced number of PCs in the prediction set where those first k < p could be taken in the regression model. The selected PCs having the largest variances (sequential selection) but often PCs with smaller variances are higher correlated with the response variable (Filzmoser and Croux, 2002).

Another alternative to set up the PCR model is by a stepwise selection of PCs due to an appropriate measure of association with the response variable. The first step in this alternative way is to search for the PC having the largest Squared Multiple Correlation (SMC) with the response variable while the second step is to search for an additional PC resulting in the largest SMC and so on. Due to uncorrelatedness of the Principal Components, this comes down to selecting the k components having the largest squared (bivariate) correlations with the dependent variable.

111

Table 4.36 : Regression model using all the PC scores for p = 2 and n = 100 Response: y Analysis of Variance Table Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Z1 1 401.2043 401.2043 378.5579 Z2 1 0.5487 0.5487 0.5177 Residuals 97 102.8028 1.0598 Coefficients: Value (Intercept) -0.0374 Z1 -14.1834 Z2 9.8613

Std. Error t value 0.1029 -0.3635 0.7290 -19.4566 13.7054 0.7195

Pr(F) 0.0000000 0.4735516

Pr(>|t|) 0.7170 0.0000 0.4736

Residual standard error: 1.029 on 97 degrees of freedom Multiple R-Squared: 0.7963 F-statistic: 189.5 on 2 and 97 degrees of freedom, the p-value is 0

Table 4.37 : Regression model using selected PC scores (PC1) for p = 2 and n = 100 Response: y Analysis of Variance Table Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Z1 1 401.2043 401.2043 380.4302 0 Residuals 98 103.3515 1.0546 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -0.0374 0.1027 -0.3644 0.7163 Z1 -14.1834 0.7272 -19.5046 0.0000 Residual standard error: 1.027 on 98 degrees of freedom Multiple R-Squared: 0.7952 F-statistic: 380.4 on 1 and 98 degrees of freedom, the p-value is 0

112

A transformation to the coefficients of the natural variables for p = 2 and n = 100 data set using Eq. (3.13) and using Eq. (3.14) to estimate the constant term give the given results b0,pc = 0.05199 b1,pc = 1.04502 b2,pc = 1.03408 At the same time, all of the independent variables together explain 79.52% of the dependent variables, y. The R2 values for PC regression model for both all and selected PC components (R2 = 0.7963 and R2 = 0.7952) showed an improvement from the regression model that is illustrated in Table 4.9 (R2 = 0.7910)

The final PC regression model for p = 2 are y = 0.05199 + 1.04502x1 + 1.03408x2

Table 4.38 : Regression model using all the PC scores for p = 4 and n = 100 Response: y Analysis of Variance Table Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Z1 1 1392.582 1392.582 1178.011 Z2 1 0.059 0.059 0.050 Z3 1 6.068 6.068 5.133 Z4 1 2.511 2.511 2.124 Residuals 95 112.304 1.182

Pr(F) 0.0000000 0.8229943 0.0257432 0.1482974

Coefficients: Value Std. Error (Intercept) -0.2417 0.1087 Z1 18.7228 0.5455 Z2 2.1633 9.6442 Z3 -22.7843 10.0564 Z4 29.0961 19.9641

t value -2.2228 34.3222 0.2243 -2.2657 1.4574

Pr(>|t|) 0.0286 0.0000 0.8230 0.0257 0.1483

Residual standard error: 1.087 on 95 degrees of freedom Multiple R-Squared: 0.9258 F-statistic: 296.3 on 4 and 95 degrees of freedom, the p-value is 0

113

Table 4.39 : Regression model using selected PC scores (PC1) for p = 4 and n = 100 Response: y Analysis of Variance Table Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Z1 1 1392.582 1392.582 1128.412 0 Residuals 98 120.943 1.234 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -0.2417 0.1111 -2.1755 0.0320 Z1 18.7228 0.5574 33.5918 0.0000 Residual standard error: 1.111 on 98 degrees of freedom Multiple R-Squared: 0.9201 F-statistic: 1128 on 1 and 98 degrees of freedom, the p-value is 0

The coefficients of the natural variables for p = 4 and n = 100 dataset showed the following result b0,pc = -0.07829 b1,pc = 0.97766 b2,pc = 0.96457 b3,pc = 0.96161 b4,pc = 0.99612 From the results illustrated in Tables 4.38 –4.39, the R2 values for PC regression models for both all and selected components are R2 = 0.9258 and R2 = 0.9201, respectively. These are larger than the regression model that is R2 = 0.9136 (Table 4.10). These showed that all of the independent variables for a new regression model together explained 92.01% of the independent variable, y.

The final PC regression model for p = 4 are y = –0.07829 + 0.97766x1 + 0.96457x2 + 0.96161x3 + 0.99612x4

114

Table 4.40 : Regression model using all the PC scores for p = 6 and n = 100 Response: y Analysis of Variance Table Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Z1 1 3393.458 3393.458 3725.708 0.0000000 Z2 1 0.519 0.519 0.570 0.4522968 Z3 1 0.308 0.308 0.338 0.5623480 Z4 1 0.069 0.069 0.076 0.7834072 Z5 1 1.049 1.049 1.152 0.2858557 Z6 1 0.866 0.866 0.951 0.3319633 Residuals 93 84.706 0.911 Coefficients: Value Std. Error t value (Intercept) -0.2194 0.0954 -2.2987 Z1 23.8904 0.3914 61.0386 Z2 5.5168 7.3093 0.7548 Z3 -4.6909 8.0677 -0.5814 Z4 -2.4035 8.7186 -0.2757 Z5 10.3424 9.6349 1.0734 Z6 23.6213 24.2206 0.9753

Pr(>|t|) 0.0238 0.0000 0.4523 0.5623 0.7834 0.2859 0.3320

Residual standard error: 0.9544 on 93 degrees of freedom Multiple R-Squared: 0.9757 F-statistic: 621.5 on 6 and 93 degrees of freedom, the p-value is 0

Table 4.41 : Regression model using selected PC scores (PC1) for p = 6 and n = 100 Response: y Analysis of Variance Table Terms added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Z1 1 3393.458 3393.458 3799.879 0 Residuals 98 87.518 0.893 Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -0.2194 0.0945 -2.3215 0.0223 Z1 23.8904 0.3876 61.6432 0.0000 Residual standard error: 0.945 on 98 degrees of freedom Multiple R-Squared: 0.9749 F-statistic: 3800 on 1 and 98 degrees of freedom, the p-value is 0

115 The improvement of the model can be seen through the R2 value where the value of R2 for PC regression model for both all and selected components (R2 = 0.9757 and R2 = 0.9749) are larger than the final regression model (R2 = 0.9728) as shown in Table 4.11.

These values were calculated after the response variable was regressed onto the k components and the results are given as follows b0,pc = 0.05276 b1,pc = 1.02022 b2,pc = 1.00507 b3,pc = 1.00253 b4,pc = 1.03749 b5,pc = 1.01763 b6,pc = 1.02266 The final PC regression model for p = 6 are y = 0.05276 + 1.02022x1 + 1.00507x2 + 1.00253x3 + 1.03749x4 + 1.01763x5 + 1.02266x6 The final PC regression model for p = 50 are y = -0.02927 + 0.28284x1 + 0.27834x2 + 0.27749x3 + 0.28722x4 + 0.28170x5 + 0.28313x6 + 0.27726x7 + 0.28139x8 + 0.28206x9 + 0.27362x10 + 0.27684x11 + 0.28116x12 + 0.27903x13 + 0.27855x14 + 0.27652x15 + 0.27818x16 + 0.27988x17 + 0.27973x18 + 0.28051x19 + 0.27313x20 + 0.28385x21 + 0.28261x22 + 0.27877x23 + 0.27344x24 + 0.27715x25 + 0.27164x26 + 0.27915x27 + 0.28120x28 + 0.28069x29 + 0.28159x30 + 0.27949x31 + 0.27749x32 + 0.28178x33 + 0.27975x34 + 0.28312x35 + 0.27653x36 + 0.27882x37 + 0.28163x38 + 0.28160x39 + 0.28224x40 + 0.27981x41 + 0.27836x42 + 0.27921x43 + 0.28013x44 + 0.28509x45 + 0.27845x46 + 0.27686x47 + 0.28223x48 + 0.27932x49 + 0.28389x50

116

4.4.3

Ridge Regression

A solution to the problem of multicollinearity is to abandon the usual least squares procedure and turn to biased estimation techniques. In using a biased estimation procedure, one is willing to allow for a certain amount of bias in the estimates in order to reduce their variances and ridge regression method is one of the biased estimation technique used for this purpose. The ridge regression estimates are computed for various increasing values of , from  = 0, until a value of is determined for which all the regression coefficients appear to stabilised. By plotting the values of the coefficients against the successive values , a curve referred to as ridge trace is obtained and helps the analyst to make a decision regarding the appropriate value of . Value for  is chosen at the point for which the coefficients are no longer changing rapidly.

The purpose of ridge trace is to indicate, for a given set of data, a set of estimates that are reasonable. Sometimes, unfortunately it is difficult to determine an appropriate value of for which the estimates of all the coefficients have stabilised (Ozcelik et al, 2002). Tables 4.42 –4.44 list a series of ridge regression coefficients that are computed for p = 2, p = 4, p = 6 and p = 50 for n = 100, respectively. The numbers reveal coefficients of the natural variables. The resulting Cvalues are shown for each . For p = 2, 4 and 6 data sets, the values were generated as low as 0.00 to 0.49, while for p = 50, the values were generated as low as 0.000 to 0.049.

2

4

6

8

Cks

10

12

14

16

117

0.0

0.1

0.2

0.3

0.4

0.5

Delta

:

Plot of against C for p = 2 and n = 100

5

10

Cks

15

20

Figure 4.9

0.0

0.1

0.2

0.3

0.4

Delta

Figure 4.10 :

Plot of against C for p = 4 and n = 100

0.5

5

10

Cks

15

20

118

0.0

0.1

0.2

0.3

0.4

Delta

Figure 4.11 :

Figure 4.12

:

Plot of against C for p = 6 and n = 100

Plot of against C for p = 50 and n = 100

0.5

119 Table 4.42 : Values of , C and coefficient vectors employed for p = 2 and n = 100



C

b0

b1

b2

0.00 3.000000 0.04719663 1.7715801 0.3151197 0.01 1.932441 0.04981990 1.3018779 0.7695895 0.02 1.772478 0.05005357 1.1945112 0.8656169 0.03 1.736875 0.04991263 1.1445459 0.9049445 0.04 1.751886 0.04964610 1.1142856 0.9248727 0.05 1.798369 0.04932395 1.0931325 0.9358870 0.06 1.868941 0.04897348 1.0769463 0.9420828 0.07 1.959989 0.04860751 1.0637758 0.9453897 0.08 2.069417 0.04823285 1.0525808 0.9468355 0.09 2.195841 0.04785345 1.0427557 0.9470182 0.10 2.338246 0.04747172 1.0339241 0.9463088 0.11 2.495833 0.04708924 1.0258397 0.9449499 0.12 2.667935 0.04670706 1.0183341 0.9431067 0.13 2.853969 0.04632590 1.0112886 0.9408955 0.14 3.053413 0.04594628 1.0046170 0.9384005 0.15 3.265791 0.04556856 0.9982548 0.9356840 0.16 3.490657 0.04519302 0.9921533 0.9327933 0.17 3.727594 0.04481985 0.9862746 0.9297646 0.18 3.976207 0.04444920 0.9805887 0.9266265 0.19 4.236117 0.04408117 0.9750720 0.9234013 0.20 4.506964 0.04371583 0.9697050 0.9201070 0.21 4.788401 0.04335324 0.9644719 0.9167582 0.22 5.080094 0.04299345 0.9593597 0.9133669 0.23 5.381723 0.04263646 0.9543574 0.9099428 0.24 5.692975 0.04228231 0.9494558 0.9064941 0.25 6.013550 0.04193098 0.9446469 0.9030275 0.26 6.343157 0.04158248 0.9399241 0.8995487 0.27 6.681513 0.04123681 0.9352814 0.8960627 0.28 7.028346 0.04089395 0.9307139 0.8925733 0.29 7.383391 0.04055389 0.9262170 0.8890842 0.30 7.746388 0.04021662 0.9217868 0.8855982 0.31 8.117090 0.03988210 0.9174199 0.8821180 0.32 8.495253 0.03955033 0.9131131 0.8786456 0.33 8.880641 0.03922128 0.9088638 0.8751830 0.34 9.273026 0.03889492 0.9046693 0.8717317 0.35 9.672185 0.03857124 0.9005275 0.8682933 0.36 10.077901 0.03825021 0.8964362 0.8648687 0.37 10.489964 0.03793180 0.8923937 0.8614592 0.38 10.908170 0.03761598 0.8883982 0.8580656 0.39 11.332320 0.03730274 0.8844482 0.8546886 0.40 6.954224 0.03699204 0.8805422 0.8513289 0.41 7.333125 0.03668386 0.8766788 0.8479872 0.42 7.719332 0.03637816 0.8728568 0.8446638 0.43 8.112618 0.03607493 0.8690751 0.8413592 0.44 8.512764 0.03577414 0.8653325 0.8380737 0.45 8.919554 0.03547576 0.8616282 0.8348076 0.46 9.332778 0.03517977 0.8579610 0.8315612 0.47 9.752231 0.03488613 0.8543301 0.8283347 0.48 10.177714 0.03459483 0.8507348 0.8251281 0.49 15.860674 0.03430583 0.8471741 0.8219417

120 Table 4.43 : Values of , C and coefficient vectors employed for p = 4 and n = 100



C

b0

b1

b2

b3

b4

0.00 5.000000 -0.10999387 3.6992296 -1.0717095 1.8399890 -0.6020724 0.01 4.021505 -0.09379091 1.6337205 0.1766168 1.7432330 0.3168188 0.02 4.467049 -0.08926390 1.3521021 0.4501728 1.5209275 0.5435574 0.03 4.829126 -0.08718322 1.2373525 0.5784794 1.3914519 0.6530161 0.04 5.113674 -0.08608227 1.1740840 0.6530642 1.3080723 0.7174305 0.05 5.350461 -0.08547433 1.1334644 0.7015512 1.2498427 0.7595496 0.06 5.559579 -0.08514935 1.1048279 0.7353379 1.2067036 0.7889676 0.07 5.753573 -0.08500156 1.0833051 0.7600150 1.1732996 0.8104550 0.08 5.940399 -0.08497182 1.0663541 0.7786536 1.1465336 0.8266572 0.09 6.125250 -0.08502446 1.0525174 0.7930834 1.1244938 0.8391605 0.10 6.311614 -0.08513669 1.0408993 0.8044628 1.1059379 0.8489748 0.11 6.501893 -0.08529328 1.0309186 0.8135614 1.0900240 0.8567735 0.12 6.697771 -0.08548364 1.0221817 0.8209106 1.0761616 0.8630232 0.13 6.900448 -0.08570019 1.0144127 0.8268890 1.0639245 0.8680574 0.14 7.110785 -0.08593736 1.0074123 0.8317741 1.0529973 0.8721207 0.15 7.329405 -0.08619099 1.0010328 0.8357734 1.0431415 0.8753969 0.16 7.556757 -0.08645786 0.9951626 0.8390456 1.0341737 0.8780263 0.17 7.793160 -0.08673550 0.9897159 0.8417139 1.0259505 0.8801183 0.18 8.038838 -0.08702193 0.9846253 0.8438756 1.0183579 0.8817592 0.19 8.293944 -0.08731559 0.9798373 0.8456083 1.0113045 0.8830181 0.20 8.558571 -0.08761520 0.9753091 0.8469748 1.0047158 0.8839510 0.21 8.832771 -0.08791972 0.9710056 0.8480267 0.9985308 0.8846036 0.22 9.116562 -0.08822830 0.9668982 0.8488062 0.9926988 0.8850137 0.23 9.409935 -0.08854023 0.9629629 0.8493487 0.9871776 0.8852129 0.24 9.712859 -0.08885490 0.9591799 0.8496840 0.9819315 0.8852275 0.25 10.025286 -0.08917181 0.9555324 0.8498370 0.9769303 0.8850801 0.26 10.347156 -0.08949054 0.9520060 0.8498291 0.9721481 0.8847897 0.27 10.678396 -0.08981071 0.9485885 0.8496785 0.9675628 0.8843726 0.28 11.018925 -0.09013203 0.9452695 0.8494011 0.9631551 0.8838429 0.29 11.368656 -0.09045420 0.9420399 0.8490105 0.9589084 0.8832128 0.30 11.727495 -0.09077701 0.9388919 0.8485184 0.9548082 0.8824929 0.31 12.095345 -0.09110024 0.9358184 0.8479354 0.9508417 0.8816925 0.32 12.472103 -0.09142372 0.9328134 0.8472705 0.9469977 0.8808199 0.33 12.857666 -0.09174730 0.9298717 0.8465317 0.9432662 0.8798820 0.34 13.251927 -0.09207083 0.9269885 0.8457261 0.9396386 0.8788854 0.35 13.654778 -0.09239420 0.9241596 0.8448600 0.9361069 0.8778357 0.36 14.066110 -0.09271730 0.9213812 0.8439391 0.9326640 0.8767378 0.37 14.485813 -0.09304003 0.9186500 0.8429684 0.9293039 0.8755964 0.38 14.913776 -0.09336232 0.9159631 0.8419523 0.9260206 0.8744153 0.39 15.349889 -0.09368408 0.9133176 0.8408950 0.9228093 0.8731983 0.40 10.950862 -0.09400526 0.9107112 0.8398000 0.9196652 0.8719486 0.41 11.329139 -0.09432578 0.9081416 0.8386706 0.9165843 0.8706692 0.42 11.715524 -0.09464561 0.9056068 0.8375098 0.9135626 0.8693627 0.43 12.110024 -0.09496470 0.9031051 0.8363204 0.9105967 0.8680315 0.44 12.512627 -0.09528299 0.9006346 0.8351047 0.9076835 0.8666779 0.45 12.923301 -0.09560046 0.8981940 0.8338650 0.9048201 0.8653038 0.46 13.342002 -0.09591707 0.8957817 0.8326033 0.9020038 0.8639111 0.47 13.768675 -0.09623279 0.8933965 0.8313216 0.8992322 0.8625014 0.48 14.203254 -0.09654760 0.8910373 0.8300214 0.8965030 0.8610763 0.49 20.135084 -0.09686147 0.8887028 0.8287045 0.8938143 0.8596372

121 Table 4.44 : Values of , C and coefficient vectors employed for p = 6 and n = 100



C

b0

b1

b2

b3

b4

b5

b6

0.00 7.000000 0.04980847 3.2941874 1.1852239 -0.4178801 1.0735231 0.5548114 0.4192540 0.01 2.942177 0.05645370 1.3368063 1.3365215 0.4222996 1.2250812 0.8863645 0.8947269 0.02 1.980832 0.05492228 1.1898577 1.2462735 0.6163887 1.1773866 0.9166090 0.9432252 0.03 1.548646 0.05379012 1.1341670 1.1915046 0.7106130 1.1462972 0.9333380 0.9625883 0.04 1.333756 0.05289748 1.1042129 1.1554030 0.7662396 1.1249857 0.9441836 0.9726702 0.05 1.235836 0.05214097 1.0851226 1.1297255 0.8027403 1.1093673 0.9516209 0.9785467 0.06 1.212566 0.05146779 1.0716410 1.1103948 0.8283458 1.0973017 0.9568785 0.9821475 0.07 1.242957 0.05084920 1.0614350 1.0952015 0.8471498 1.0875920 0.9606555 0.9843746 0.08 1.315373 0.05026839 1.0533096 1.0828518 0.8614217 1.0795221 0.9633818 0.9857076 0.09 1.422811 0.04971494 1.0465890 1.0725398 0.8725221 1.0726391 0.9653363 0.9864264 0.10 1.560772 0.04918205 1.0408624 1.0637374 0.8813164 1.0666427 0.9667079 0.9867050 0.11 1.726210 0.04866508 1.0358654 1.0560848 0.8883820 1.0613263 0.9676289 0.9866579 0.12 1.916968 0.04816081 1.0314202 1.0493283 0.8941182 1.0565432 0.9681949 0.9863633 0.13 2.131459 0.04766687 1.0274029 1.0432841 0.8988108 1.0521862 0.9684764 0.9858763 0.14 2.368473 0.04718152 1.0237242 1.0378153 0.9026691 1.0481752 0.9685268 0.9852374 0.15 2.627063 0.04670345 1.0203186 1.0328186 0.9058503 1.0444493 0.9683871 0.9844764 0.16 2.906460 0.04623164 1.0171365 1.0282139 0.9084745 1.0409612 0.9680893 0.9836162 0.17 3.206033 0.04576530 1.0141399 1.0239386 0.9106354 1.0376739 0.9676589 0.9826743 0.18 3.525245 0.04530378 1.0112991 1.0199427 0.9124067 1.0345575 0.9671162 0.9816648 0.19 3.863633 0.04484658 1.0085906 1.0161863 0.9138476 1.0315882 0.9664778 0.9805985 0.20 4.220791 0.04439329 1.0059955 1.0126365 0.9150059 1.0287463 0.9657572 0.9794845 0.21 4.596357 0.04394356 1.0034986 1.0092665 0.9159208 1.0260158 0.9649658 0.9783301 0.22 4.990004 0.04349710 1.0010872 1.0060539 0.9166245 1.0233833 0.9641129 0.9771411 0.23 5.401430 0.04305368 0.9987510 1.0029802 0.9171442 1.0208374 0.9632063 0.9759227 0.24 5.830358 0.04261309 0.9964813 1.0000295 0.9175023 1.0183686 0.9622529 0.9746789 0.25 6.276528 0.04217516 0.9942708 0.9971886 0.9177182 1.0159690 0.9612583 0.9734134 0.26 6.739697 0.04173974 0.9921132 0.9944458 0.9178080 1.0136315 0.9602273 0.9721290 0.27 7.219633 0.04130671 0.9900032 0.9917915 0.9177859 1.0113501 0.9591643 0.9708283 0.28 7.716115 0.04087595 0.9879364 0.9892170 0.9176638 1.0091198 0.9580729 0.9695136 0.29 8.228933 0.04044736 0.9859087 0.9867151 0.9174522 1.0069359 0.9569562 0.9681867 0.30 8.757882 0.04002086 0.9839169 0.9842792 0.9171603 1.0047947 0.9558171 0.9668493 0.31 9.302766 0.03959637 0.9819578 0.9819037 0.9167959 1.0026925 0.9546579 0.9655029 0.32 9.863395 0.03917382 0.9800289 0.9795836 0.9163661 1.0006264 0.9534810 0.9641486 0.33 10.439585 0.03875316 0.9781278 0.9773145 0.9158771 0.9985937 0.9522882 0.9627876 0.34 11.031155 0.03833432 0.9762526 0.9750925 0.9153343 0.9965919 0.9510811 0.9614210 0.35 11.637929 0.03791727 0.9744015 0.9729140 0.9147426 0.9946189 0.9498614 0.9600494 0.36 12.259738 0.03750195 0.9725727 0.9707759 0.9141062 0.9926729 0.9486303 0.9586738 0.37 12.896411 0.03708832 0.9707650 0.9686755 0.9134292 0.9907520 0.9473891 0.9572948 0.38 13.547785 0.03667635 0.9689770 0.9666102 0.9127148 0.9888547 0.9461388 0.9559130 0.39 14.213698 0.03626600 0.9672074 0.9645777 0.9119664 0.9869796 0.9448806 0.9545289 0.40 7.670604 0.03585725 0.9654554 0.9625761 0.9111866 0.9851255 0.9436151 0.9531432 0.41 8.225222 0.03545006 0.9637200 0.9606033 0.9103781 0.9832911 0.9423433 0.9517561 0.42 8.794972 0.03504440 0.9620002 0.9586578 0.9095431 0.9814755 0.9410659 0.9503682 0.43 9.379745 0.03464026 0.9602953 0.9567380 0.9086837 0.9796777 0.9397836 0.9489797 0.44 9.979423 0.03423761 0.9586046 0.9548425 0.9078019 0.9778968 0.9384968 0.9475910 0.45 10.593885 0.03383643 0.9569274 0.9529700 0.9068994 0.9761320 0.9372063 0.9462024 0.46 11.223004 0.03343669 0.9552632 0.9511192 0.9059777 0.9743826 0.9359125 0.9448142 0.47 11.866653 0.03303838 0.9536114 0.9492892 0.9050384 0.9726479 0.9346158 0.9434266 0.48 12.524701 0.03264149 0.9519714 0.9474789 0.9040828 0.9709273 0.9333168 0.9420399 0.49 21.638775 0.03224599 0.9503429 0.9456874 0.9031120 0.9692201 0.9320157 0.9406542

122

Figures 4.9 –4.12 and Tables 4.42 –4.44 clearly illustrate the instability of the OLS coefficients. Notice how rapidly several of the coefficients change as  moves only slightly away from zero. The Cvalues also drop substantially for > 0 and then rise at the point of = 0.03 for p = 2 (Figure 4.9 and Table 4.42), = 0.01 for p = 4 (Figure 4.10 and Table 4.43), = 0.06 for p = 6 (Figure 4.11 and Table 4.44) and = 0.001 for p = 50 as shown in Figure 4.12. Tables 4.42 –4.44 show a Cvalues in the natural regressors for from 0.00 to 0.49 for p = 2, 4, 6 and for from 0.000 to 0.049 for p = 50 to illustrate how the Cpoints reached the stability to choose the best value. For p = 2 and n =100, the best value are 0.03 as illustrated in Figure 4.9 and the ridge regression coefficients are follows b0,R = 0.04991 b1,R = 1.14455 b2,R = 0.90495

The improvement of the model can be seen through the R2 value where the value of R2 for ridge regression model are 0.7918, which is larger than the R2 value for the regression model (R2 = 0.7910) as shown in Table 4.9.

The final Ridge regression model for p = 2 are y = 0.04991 + 1.14455x1 + 0.90495x2

123

The Cfor p = 4 and n = 100 that is illustrated in Figure 4.10 reached the best

point of the at 0.01. The ridge regression coefficients after transformation back to natural variables are given as follows b0,R = -0.09379 b1,R = 1.63372 b2,R = 0.17662 b3,R = 1.74323 b4,R = 0.31682

This new model also showed an improvement where it gives R2 = 0.9167, larger than the regression model as shown in Table 4.10 (R2 = 0.9136).

The final Ridge regression model for p = 4 are y = –0.09379 + 1.63372x1 + 0.17662x2 + 1.74323x3 + 0.31682x4

For p = 6 and n =100 cases, the Cvalues drops substantially for > 0 and then rises at the point = 0.06 and gives the following ridge regression coefficient result b0,R = 0.05147 b1,R = 1.07164 b2,R = 1.11039 b3,R = 0.82835 b4,R = 1.09730 b5,R = 0.95688 b6,R = 0.98215

The R2 value for this new model is 0.9755 which is larger than the regression model shown in Table 4.11, that is R2 = 0.9728.

124

The final Ridge regression model for p = 6 are y = 0.05147 + 1.07164x1 + 1.11039x2 + 0.82835x3 + 1.09730x4 + 0.95688x5 + 0.98215x6 The final Ridge regression model for p = 50 are y = -0.00997 + 0.34556x1 + 0.91371x2 + 0.90160x3 + 0.89704x4 + 0.92930x5 + 0.85324x6 + 0.92084x7 + 0.95737x8 + 0.92006x9 + 0.94398x10 + 1.02915x11 + 0.85334x12 + 0.89868x13 + 0.93506x14 + 0.10522x15 + 0.01630x16 –0.01826x17 + 0.07817x18 + 0.14608x19 + 0.00423x20 + 0.07746x21 + 0.00834x22 + 0.03211x23 + 0.09832x24 + 0.08238x25 + 0.04816x26 + 0.01804x27 –0.04070x28 + 0.03199x29 + 0.03631x30 –0.03220x31 + 0.07078x32 + 0.12846x33 + 0.03659x34 –0.03401x35 + 0.12472x36 + 0.14021x37 + 0.07648x38 + 0.11006x39 + 0.04506x40 –0.00983x41 + 0.07999x42 + 0.09293x43 + 0.07384x44 + 0.04410x45 + 0.01999x46 –0.00202x47 + 0.05873x48 –0.03541x49 –0.02035x50 The S-Plus codes for Ridge Regression, the C plots and the values of , C and coefficient vectors employed for p = 2, 4, 6 and 50 and n = 20, 30, 40, 60, 80 and 100 are in Appendix D.

The discussion and comparison work for all the three methods will be done in Chapter 5.

CHAPTER 5

COMPARISONS ANALYSIS & DISCUSSIONS

5.1

Introduction

This chapter discusses the performance of PLS regression, PC regression and Ridge Regression methods for handling multicollinearity problems in regression analysis. These comparison analysis were done using the results from simulated data. Summary and discussion on these comparisons are done in Sections 5.4 and 5.5.

5.2

Performance on Classical Data

Each method discussed in this research was tested using a classical tobacco data set in Table 3.1. This data set was used as examples in many multicollinearity papers and books. The performances of Partial Least Squares regression, Principal Component regression and Ridge regression on the classical data set are summarized in Table 5.1.

126 Table 5.1

:

Performance of PLS, PC and RR methods on classical data sets

R2 MSE

PLSR

PCR

RR

0.95720

0.95273

0.94834

9387.32

3716.30

2.6779e-017

100 80 60

Coefficients

40 20 0 -20

0

1x1

0,5

1,5

x2 2

2,5

x3

3,5

x44

4,5

-40 -60 -80 -100 Regressors PLS

Figure 5.1

:

PC

RR

Plots of coefficients against number of regressors

According to the performance on classical data sets (Tobacco Data), PLS is better than PC and RR when the number of observations, n is 30 and the number of regressors, p is 4, followed by RR and then PC as shown in Table 5.1. This evaluation was done using MSE value where the method that produced the smallest value was the best method. The R 2 va l ue ss ho wt h emo de l s ’c a pa b i l i t yt of i tt h ep r e s e ntda t aus i ngPLSR, PCR and RR. From the result, again PLSR model have the best fit followed by PCR and RR. The regression coefficients plot using PLS, PC and RR are shown in Figure 5.1 by plotting them against the number of regressors.

127 5.3

Performance on Simulated Data

From the simulation study, the performance of all the three methods for the specified cases are compared. Figures 5.2 –5.9 and Tables 5.2 –5.5 show some of the results. Appendix E shows the full results of the simulation study for every methods and every cases. The cross-validation results are shown in Tables 5.2 –5.5. For low number of regressors, p = 2 and small n, that is, n = 20, PLS gave a cross-validates R 2 of 0.79982, while PC gave 0.79976 and RR gave 0.76855. For medium n, that is n = 40, again PLS gave R 2 of 0.79970, followed by PC that gave 0.79967 and RR gave 0.77575 and for large n, that is, n = 100, the R 2 of PLS were 0.79745 while PC gave 0.79744 and RR gave 0.77874. Both PLS and PC methods have only one component retained. These results show that based on the values of R2, PLS models performed best, followed by PCR and RR. It also shows that this conclusion is consistent for each specified number of observations including n = 30, 60 and 80.

Table 5.2

:

Cross-validation of PLS, PC and RR methods, R2 for p = 2

n

RR

PCR

PLSR

20

0.76855

0.79976

0.79982

30

0.77095

0.79753

0.79758

40

0.77575

0.79967

0.79970

60

0.77499

0.79446

0.79449

80

0.77952

0.79958

0.79959

100

0.77874

0.79744

0.79745

128

1 0,9

R-sq

0,8 0,7 0,6 0,5 0,4

97

93

89

85

81

77

73

69

65

61

57

53

49

45

41

37

33

29

25

21

17

13

9

5

1

0,3

Replications RR

Figure 5.2

PC

PLS

: Plot of R2 against m = 100 replications for p = 2 for specified n = 20

0,9 0,85

R-sq

0,8 0,75 0,7 0,65

96

91

86

81

76

71

66

61

56

51

46

41

36

31

26

21

16

11

6

1

0,6

Replications RR

PC

PLS

Figure 5.3 : Plot of R2 against m = 100 replications for p = 2 for specified n = 100 The R2 value against each m = 100 replications are also plotted in Figures 5.2 – 5.9. For p = 2, the plots can be seen in Figure 5.2 and 5.3 for n = 20 and n = 100 respectively. From the figures, PLS and PC are loaded and plotted in a very slight

129 difference of R2 values and can hardly be distinguished while the plot for RR is the lowest than the other two methods. The cross-validation results for p = 4 and p = 6 as shown in Tables 5.3 –5.4, show that PLS gave a higher R2 followed by PC and RR for all specified number of observations. The R2 values against each m = 100 replications for p = 4 and 6 are also plotted and shown in Figures 5.4 –5.7 for n =20 and n = 100 where RR are plotted in the bottom of PLS and PC plot. PLS and PC also can be hardly be distinguished and plotted in a very slight difference of R2. In othe rha n d,PLSmode lf i tt hemod e l ’ sda t ap e r f e c t l yc ompa r e d to PC and RR.

From Table 5.5, the cross-validation results for high number of regressors, p = 50, showed that the differences in R2 produced by three methods differed slightly. PLS again gives a higher cross-validated R2 followed by PC and RR. The plot for n = 60 and n = 100 can be seen in Figures 5.8 and 5.9. PLS plot for both figures show a very consistent values of R2 where it were plotted in a straight line and close to 1. The overall R2 values for all methods and for all cases are good since the values are very close to 1, which means that the models are capable to fit the present data. However, a small difference among the R2 values show a performance for each methods.

130 Table 5.3

:

Cross-validation of PLS, PC and RR methods, R2 for p = 4

n

RR

PCR

PLSR

20

0.91154

0.943774

0.94681

30

0.92008

0.94559

0.94748

40

0.92056

0.94354

0.94497

60

0.92132

0.94156

0.94267

80

0.92229

0.94129

0.94208

100

0.92395

0.94049

0.94116

Table 5.4

:

Cross-validation of PLS, PC and RR methods, R2 for p = 6

n

RR

PCR

PLSR

20

0.95057

0.97649

0.97918

30

0.95659

0.97617

0.97793

40

0.95525

0.97501

0.97622

60

0.95631

0.97343

0.97429

80

0.95922

0.97322

0.97398

100

0.96080

0.97352

0.97416

Table 5.5

:

Cross-validation of PLS, PC and RR methods, R2 for p = 50

n

RR

PCR

PLSR

60

0.999293

0.999617

0.999992

80

0.999271

0.999604

0.999994

100

0.999262

0.999589

0.999996

131

1 0,95

R-sq

0,9 0,85 0,8

96

91

86

81

76

71

66

61

56

51

46

41

36

31

26

21

16

11

6

1

0,75

Replications RR

PC

PLS

Figure 5.4 : Plot of R2 against m = 100 replications for p = 4 for specified n = 20

0,97 0,96 0,95 0,94 R-sq

0,93 0,92 0,91 0,9 0,89 0,88 96

91

86

81

76

71

66

61

56

51

46

41

36

31

26

21

16

11

6

1

0,87

Replications RR

PC

PLS

Figure 5.5 : Plot of R2 against m = 100 replications for p = 4 for specified n = 100

132

1,02 1 0,98 R-sq

0,96 0,94 0,92 0,9 0,88 96

91

86

81

76

71

66

61

56

51

46

41

36

31

26

21

16

11

6

1

0,86

Replications RR

PC

PLS

Figure 5.6 : Plot of R2 against m = 100 replications for p = 6 for specified n = 20

0,99 0,98

R-sq

0,97 0,96 0,95 0,94

97

93

89

85

81

77

73

69

65

61

57

53

49

45

41

37

33

29

25

21

17

13

9

5

1

0,93 Replications RR

PC

PLS

Figure 5.7 : Plot of R2 against m = 100 replications for p = 6 for specified n = 100

133

1,0002 1

R-sq

0,9998 0,9996 0,9994 0,9992

97

93

89

85

81

77

73

69

65

61

57

53

49

45

41

37

33

29

25

21

17

13

9

5

1

0,999

Replications RR

PC

PLS

Figure 5.8 : Plot of R2 against m = 100 replications for p = 50 for specified n = 60

1,0002 1

R-sq

0,9998 0,9996 0,9994 0,9992

96

91

86

81

76

71

66

61

56

51

46

41

36

31

26

21

16

11

6

1

0,999

Replications RR

PC

PLS

Figure 5.9 : Plot of R2 against m = 100 replications for p = 50 for specified n = 100

134 5.4

Comparison Analysis

From the simulation study, the MSE values of the estimated regression parameters ˆfor each specified p and n cases are calculated to compare the efficiency of the β considered methods. These MSE values indicates to what extent the slope and intercept are correctly estimated. So, the goal is to obtain an MSE value close to zero. Tables 5.6 – 5.9 show the values for each method where k implies the components for PLS and PC methods.

Table 5.6

:

MSE values for p = 2 and specified n = 20, 30, 40, 60, 80 and 100

MSE

20

30

40

60

80

100

RR

4.5158

4.0621

2.2457

1.3620

0.9296

0.7523

PCR

15.3422

8.6804

5.4349

4.5474

2.2499

2.2356

PLSR

15.2835

8.6362

5.4018

4.5244

2.2354

2.2250

Table 5.7

:

MSE values for p = 4 and specified n = 20, 30, 40, 60, 80 and 100

MSE

k

20

30

40

60

80

100

RR

-

24.06891

13.46989

9.30244

6.27415

5.36279

3.50251

PCR

k =1

38.35023

25.78909

18.17101

12.43210

8.64044

7.35115

k=2

34.05852

21.99729

16.02691

11.07563

7.46967

6.35491

k=3

25.14823

17.46042

12.11719

8.89374

6.15083

4.86143

k=1

38.2232

25.73677

18.22498

12.40459

8.59149

7.32591

k=2

13.18924

9.24830

5.56765

4.54305

2.65812

2.25146

k=3

2.62392

1.39834

0.91491

0.49154

0.17670

0.12939

PLSR

135 Table 5.8

:

MSE values for p = 6 and specified n = 20, 30, 40, 60, 80 and 100

MSE

k

20

30

40

60

80

100

RR

-

53.36859

26.6182

21.92014

13.13541

9.20576

6.37893

PCR

k=1

82.3958

41.03708

28.78554

18.33398

14.06939

10.07186

k=2

78.39923

38.29943

26.95395

17.01371

12.98822

9.273407

k=3

73.20789

34.83759

24.42904

15.54941

11.98279

8.272738

k=4

65.70394

28.93769

21.59502

13.69208

10.31598

7.130633

k=1

82.20597

40.93727

28.73878

18.29155

14.03621

10.03428

k=2

47.50811

18.63663

12.88441

8.56508

5.854887

3.616554

k=3

27.06995

8.86398

5.74804

3.32051

2.05498

1.44028

k=4

5.35992

2.11927

0.68219

0.37051

0.17128

0.13095

PLSR

Table 5.9

:

MSE values for p = 50 and specified n = 60, 80 and 100

MSE

k

60

80

100

RR

-

1.41069

0.94437

0.73838

PCR

k=1

10.07760

10.07909

10.07753

k=2

9.94546

9.89374

9.87701

k=3

9.76919

9.68552

9.67526

k=4

9.56081

9.44986

9.50969

k=5

9.34888

9.21501

9.31887

k=1

10.07900

10.07647

10.07747

k=2

4.98517

4.27791

3.70296

k=3

3.29653

2.48634

1.92073

k=4

2.44565

1.64511

1.17761

k=5

1.91952

1.18943

0.81114

PLSR

136 From Table 5.6 where p = 2 and the specified n = 20, 30, 40, 60, 80 and 100 observations, Ridge regression performed best compared to the other two methods which gives MSE = 4.5158 for n = 20 and MSE = 0.7523 for n = 100 followed by PLS regression with MSE = 15.2835 for n = 20 and MSE = 2.2250 for n = 100 and PC regression with MSE = 15.3422 for n = 20 and MSE = 2.2356 for n = 100, respectively. Ridge regression method is considered the best since it shows the lowest MSE values for all specified n observations and large difference from the other two methods. On other hand, there is slight difference in the MSE for PLS and PC regression which are chosen at kopt = 1. The results are consistent for each n specified cases. These results showed that for low number of regressors, p =2, the MSE decreases as the number of observations increases.

Figures 5.10 and 5.11 show the plots of MSE values for each method against 100 replications for n = 20 and n = 100, respectively. Both figures show that most of RR plots are in the bottom while the PLS and PC plot are in the same pattern. A few point of PLS and PC also differs widely from RR. These show that RR results are consistent compared to others and perform best in handling multicollinearity problem.

From Table 5.7 where p = 4 and the specified n = 20, 30, 40, 60, 80 and 100 observations, PLS regression performed better than Ridge regression and PC regression when the components are chosen at the optimal kopt = 3. The MSE values for PLS regression differ a lot from the other two methods where MSE = 2.62393 for n = 20 and MSE = 0.12939 for n = 100, MSE = 24.06891 for n = 20 and MSE = 3.50251 for n = 100 for RR and MSE = 25.14823 for n = 20 and MSE = 4.86143 for n = 100 for PC regression, respectively. But, when only one component is selected (k = 1), the results show a different findings with Ridge regression having smallest MSE values, MSE = 24.06891 and MSE = 3.50251 for n = 20 and n = 100, respectively, which means that it is better than the rest, where for PLS, MSE = 38.22320 and MSE = 7.32591 and for PC are MSE = 38.35023 and MSE = 7.35115. However, the optimal number of components for PLS and PC regression should be chosen at the smallest MSE (Engelen et al., 2003).

137 These show that both methods did well with optimal number of components in handling multicolinearity for p = 4 regressors.

Figures 5.12 and 5.13 show the plot of MSE values for p = 4 for each method against m = 100 replications for n = 20 and n = 100, respectively. Both figures also show that PLS has the lowest MSE values. Some of the MSE points of RR and PC also differ widely from PLS.

From Table 5.8 where p = 6 and the specified n = 20, 30, 40, 60, 80 and 100 observations, again PLS performed well followed by Ridge regression and PC regression when the component for both PLS and PC regression are chosen at the optimal kopt = 4. The MSE values for PLS are 5.35992 for n = 20 and 0.13095 for n = 100, while for RR, the MSE values for the same n values are 53.36859 and 6.37893, respectively, and for PC, the MSE values are 65.70394 and 7.13063. The results given also show the decrease nature of MSE values from n = 20 to n = 100. It shows that, when the number of observations becomes higher, the MSE values become smaller compared to a small number of observations. The results are also consistent, where PLS performed better than RR followed by PC regression for every specified number of observations.

Figures 5.14 and 5.15 show the plot of MSE values for p = 6 for each method against m = 100 replications for n = 20 and n = 100, respectively. Both figures also show that PLS has the lowest MSE values. Most of the plots of RR and PC also differ widely from PLS, especially for n = 100 (Figure 5.15).

From the results given in Table 5.9, Ridge Regression performed best followed by PLS and PC respectively in high number of regressors, p = 50. RR gives an MSE values of 1.41069, 0.94437 and 0.73838 for n = 60, 80 and 100, respectively, followed by PLS which gives an MSE values of 1.91952, 1.18943 and 0.81114 and PC gives 9.34888, 9.21501 and 9.31887.

138 Figures 5.16 and 5.17 show that both PLS and RR plots produce small MSE va l u e s( l owe rt ha n2 )whi l ePC’ spl ot sdi f f e rwi de l ya n dha vel a r g eMSEv a l ue s( c l os et o 10) compared to PLS and RR.

200 180 160 MSE values

140 120 100 80 60 40 20 96

91

86

81

76

71

66

61

56

51

46

41

36

31

26

21

16

11

6

1

0

Replications RR

:

PLS

Plot of MSE values against m = 100 replications for p = 2 for n = 20

96

91

86

81

76

71

66

61

56

51

46

41

36

31

26

21

16

11

6

20 18 16 14 12 10 8 6 4 2 0 1

MSE values

Figure 5.10

PC

Replications RR

Figure 5.11

:

PC

PLS

Plot of MSE values against m = 100 replications for p = 2 for n = 100

139

180 160

MSE values

140 120 100 80 60 40 20 96

91

86

81

76

71

66

61

56

51

46

41

36

31

26

21

16

11

6

1

0

Replications RR

Figure 5.12

:

PC

PLS

Plot of MSE values against m = 100 replications for p = 4 for n = 20

35 30

MSE values

25 20 15 10 5

96

91

86

81

76

71

66

61

56

51

46

41

36

31

26

21

16

11

6

1

0

Replications RR

Figure 5.13

:

PC

PLS

Plot of MSE values against m = 100 replications for p = 4 for n = 100

140

900 800

MSE values

700 600 500 400 300 200 100 97

93

89

85

81

77

73

69

65

61

57

53

49

45

41

37

33

29

25

21

17

13

9

5

1

0

Replications RR

Figure 5.14

:

PC

PLS

Plot of MSE values against m = 100 replications for p = 6 for n = 20

45 40 MSE values

35 30 25 20 15 10 5 96

91

86

81

76

71

66

61

56

51

46

41

36

31

26

21

16

11

6

1

0

Replications RR

Figure 5.15

:

PC

PLS

Plot of MSE values against m = 100 replications for p = 6 for n = 100

141

12

MSE values

10 8 6 4 2

96

91

86

81

76

71

66

61

56

51

46

41

36

31

26

21

16

11

6

1

0

Replications PLS

:

Plot of MSE values against m = 100 replications for p = 50 for n = 60

13

Figure 5.16

PC

7

RR

12

MSE values

10 8 6 4 2 97

91

85

79

73

67

61

55

49

43

37

31

25

19

1

0

Replications RR

Figure 5.17

PC

PLS

: Plot of MSE values against m = 100 replications for p = 50 for n = 100

142 5.5

Discussions

From the simulation study, the performance of all the methods are determined using an efficiency test to find out which method was the best in various specified cases. In this study, MSE was used as the efficiency test and the following tables (Tables 5.10 – 5. 13 )s u mma r i z e dt hepe r f or ma nc eofa l lt hr e eme t h ods . Th et e r m‘ Mor eEf f i c i e n t ’me a ns that the method is superior because it has a smaller MSE values when compared with the MSEva l u e sc a l c ul a t e dus i ngt h eo t h e rt wome t h ods . Th i si sf o l l o we dby‘ Ef f i c i e nt ’f or t hes e c o ndbe s tme t ho da nd‘ Le s sEf f i c i e nt ’f o rt heme t ho dwi t ht hel a r g e s tva l ueo f MSE. These summaries are based on the results given in Tables 5.6 –5.9.

Table 5.10

:

Summary of the performances of the three methods with p = 2

p=2

RR

PCR

PLSR

n = 20

More Efficient

Less Efficient

Efficient

n = 30

More Efficient

Less Efficient

Efficient

n = 40

More Efficient

Less Efficient

Efficient

n = 60

More Efficient

Less Efficient

Efficient

n = 80

More Efficient

Less Efficient

Efficient

n = 100

More Efficient

Less Efficient

Efficient

Table 5.11

:

Summary of the performances of the three methods with p = 4

p=4

RR

PCR

PLSR

n = 20

Efficient

Less Efficient

More Efficient

n = 30

Efficient

Less Efficient

More Efficient

n = 40

Efficient

Less Efficient

More Efficient

n = 60

Efficient

Less Efficient

More Efficient

n = 80

Efficient

Less Efficient

More Efficient

n = 100

Efficient

Less Efficient

More Efficient

143 Table 5.12

:

Summary of the performances of the three methods with p = 6

p=6

RR

PCR

PLSR

n = 20

Efficient

Less Efficient

More Efficient

n = 30

Efficient

Less Efficient

More Efficient

n = 40

Efficient

Less Efficient

More Efficient

n = 60

Efficient

Less Efficient

More Efficient

n = 80

Efficient

Less Efficient

More Efficient

n = 100

Efficient

Less Efficient

More Efficient

Table 5.13

:

Summary of the performances of the three methods with p = 50

p=5

RR

PCR

PLSR

n = 60

More Efficient

Less Efficient

Efficient

n = 80

More Efficient

Less Efficient

Efficient

n = 100

More Efficient

Less Efficient

Efficient

From the summary tables, it appears that RR and PLSR methods are generally effective in handling multicollinearity problems in the specified cases with p = 2, 4, 6 (for low and moderate number of regressors) and 50 (for high number of regressors). The performances of RR are most efficient when p = 2 and p = 50, while PLSR are most efficient when p = 4 and p = 6.

From the tables, both PLSR and RR performed better than PCR in all cases. However, the differences between the PCR performance from PLSR and RR are only s l i g ht l y .The s ec o nf i r me dRoug oore ta l . ’ s( 200 0)f i nd i ng st ha tt h e r ei snoo neme t ho d that dominates the other, and that the difference between the methods is typically small when the number of observations is large.

144 In all cases, it shows that the superior methods performed well when the number of observations, n are larger than the number of regressors. It also shows that the results are consistent for every specified number of observations, n that were included in the analysis. Generally, RR is approximately effective and efficient for a small and high number of regressors.

CHAPTER 6

SUMMARY, CONCLUSIONS AND FUTURE RESEARCH

6.1

Introduction

This chapter summarizes the materials presented in the previous six chapters and discusses in further detail some of the results and findings. Based on the results and findings some conclusions and suggestions are presented.

6.2

Summary

For each data set, the methodology that fits best have to be decided. Rougoor et al., (2000) stated that to decide which methodology to use, the requirements of all the methodologies have to be taken into account including the complexity of the analysis, the degree of prior-information that is available, the number of variables compared to the number of observations and the number of Y-variables.

Generally, this research provided a review on multicollinearity problems in linear regression and the use of Partial Least Squares Regression, Principal Component

146 Regression and Ridge Regression to overcome these problems. As pointed out in Chapter 2, researchers have suggested numerous methods that can be used to handle the problems.

The PLSR, PCR and RR methods are among the most popular procedures in the literature have been chosen to be selected in this study. These methods were shown to perform well on the classical multicollinear data set chosen and also on simulated random data. Both PLSR and PCR constructed new variables called components while in RR, some critical and small value were added to the ordinary least squares estimator in order to reduce the effect of the multicollinearity problems in regression analysis. For RR, by adding a small value to the (X’ X)-1 can help to control and minimize the roundoff errors problem in the calculation of least squares estimated regression coefficients that were caused by collinearity problems. For PCR and PLSR, due to the transformations of the components to the regression coefficient for original regressors, the explanatory variables (the components) are corrected in such a way as to minimize the effect of multicollinearity. The reconstituted regression coefficients were compute using constructed factors, that is, a combination of related variables that were selected from components. This way, dimensionality could be reduced without losing much of the information. Besides that, intepretability will also be increased.

The objectives of this research as stated in Chapter I is to determine which method is superior in order to handle multicollinearity problems in regression analysis. The findings of this research can also give a brief idea on which method is the most superior in various combinations of p and n in order to get the best results as presented in Chapter 5.

147 6.3

Significant Findings and Conclusions

This research focusses on three methodologies for handling multicollinearity problems that were discussed in Chapter 3. The first method that have been discussed is the Partial Least Squares regression which extracted new factors called components. These components take into consideration the relation between Y and some specified X variable in order to solve the multicollinear problems in the data sets. The second method discussed in this thesis is the Principal Component regression. This method also constructed k components which are orthogonal to each other. Both PLS and PC eliminated the components which are influenced by the collinearity problems. The last method discussed is the Ridge regression. This method is the modification of ordinary least squares regression which allows biased estimator in order to reduce multicollinearity.

The results obtained from the analysis showed that no single method dominated each other. The graphical displays for R2 also show that all the methods were good but the small differences among the values were used to identify which method gave the best model compared to the others.

The Monte Carlo simulations comparing the PLS, PC and RR methods showed that PLS and RR were most superior and more effective than PC in handling multicollinearity problems in all cases. When p, (the number of independent variables), is equal to 2 (low number of regressors), RR was superior than PLS and PCR for low and high number of observations since it gave the smallest MSE values compared to PLS and PCR. However, for moderate number of regressors, that is, p = 4 and 6, PLS became superior than RR in both cases and for low and high number of observations. For a very high number of regressors, p = 50, RR was again more superior than PLS for high number of observations.

For RR method, the performance is better than the other two methods when the number of regressors were very low, (p = 2) and when the number of regressors were

148 very high, (p = 50). In the RR algorithm, the collinearity problems can be reduced by adding a small value in  X' X  . When the number of regressors was too low, (p = 2) for 1

perfectly correlated, RR worked better than the other methods because of the direct computation of RR in  X' X  I  . When the number of regressors became higher, 1

(p = 50), RR again performed better when compared to PLS which have a complicated algorithm in calculating and constructing the k components.

For PLSR method, its best performance could be seen when the number of regressors were moderate, (p = 4 and 6). The principle of PLS did not allow the collinear problems in X variables to influence the estimation of the relationship directly but were only allowed to influence it through the components. The nature of PLS algorithm that contructed a component (or components) which helped to remedy the collinear problems worked better on moderate number of regressors. This is because the computation of the algorithm in constructing the components were quite complicated especially when the number of regressors became higher.

The best performance of PLS regression over the RR started at p = 3 (in Appendix E) which was the case of a moderate number of regressors. However, RR started to perform better than PLS regression in the case of high number of regressors p = 15 (in Appendix E). For p = 12, 13 and 14 (in Appendix E), RR performed better when compared to PLS when n = 20 and n = 30, while PLS performed better when compared to RR when n = 40. The results were consistent till n = 100. The ranking of the methods was also consistent with theory in the literature.

The advantages of RR method over bilinear least squares methods (PLSR and PCR) are that it is easier to compute. It also provides a more stable way of moderating the model degrees of freedom than dropping variables. This method also controls the biasvariance tradeoff of the estimated values where more variables can be included in a scoring model without danger of overfitting the data. The advantages of PLS are the optimization towards the Y –variable and the possibility to include more than one Y –

149 variable. The advantages of PCR are that hypothesis testing can be performed, and that complete optimisation is used in determining the PCs.

6.4

Future Research

The performance of the three methods can also be done by comparing the use of all methods for high-dimensional regressors where p > n. It is known that the problem of multicollinearity is present in the data set where the number of variables is high compared to the number of observations.

These three methods can also be considered in the handling of multiple outliers problem since according to Engelen et al. (2003), PLS and PCR are very sensitive to the presence of outliers in data set. The robust version of PLS and PCR are known to resist several types of contamination.

150

REFERENCES

Abdi, H., (2003a). Least Squares. In M. Lewis-Beck, A. Bryman, T. Futing (Eds): Encyclopedia for research methods for the social sciences. Thousand Oaks (CA): Sage. Abdi, H., (2003b). Partial Least Squares Regression (PLS regression). In M. LewisBeck, A. Bryman, T. Futing (Eds): Encyclopedia for research methods for the social sciences. Thousand Oaks (CA): Sage. Afifi, A.A. and Clark, Y., (1984). Computer-aided multivariate analysis. Lifetime Learning Publications, Belmont, California. Barker, M., (1997). A Comparisons of Principal Component Regression and Partial Least Squares Regression. Multivariate Project. BjÖkstrÖm, A., (2001). Ridge Regression and Inverse Problem. Research Report. Stockholm University, Sweden. BjÖkstrÖm, A. and Sundberg, R., (1999). A Generalized View on Continuum Regression. Scandinavian Jornal Statistics. 25 : 17 –30. Bowe r man, B. L.a ndO’ Conne l l ,R. T.( 19 90) .Linear Statistical Models an Applied Approach. 2nd Ed., Boston : PWS-KENT Publishing Co.

151 Boeris, M.S., Luco, J.M. and Olsina, R.A., (2000). Simultaneous Spectrophotometric Determination of Phenobarbital, Phenytoin and Methylphenobarbital in Pharmaceutical Preparations by Using Partial Least Squares and Principal Component Regression Multivariate Calibration. Journal of Phamaceutical and Biomedical Analysis. 24 : 259 –271. De Jong, S. (1993), SIMPLS: An Alternative Approach to Partial Least Squares Regression. Chemometrics and Intelligent Laboratory Systems. 18 : 251 –263. De Jong, S., Farebrother and R.W., (1994). Extending the Relationship Between Ridge Regression and Continuum Regression. Chemometrics and Intelligent Laboratory Systems. 25 : 179 –181.

Dempster, A.P., Schatzoff, M. and Wermuth, N., (1977). A Simulation Study of Alternatives to Ordinary least Squares. Journal of the American Statistical Association. 72 : 77 –91. Dijkstra, T. (1983), Some Comments on Maximum Likelihood and Partial Least Squares Methods. Journal of Econometrics. 22 : 67 –90. Dijkstra, T. (1985), Latent Variables in Linear Stochastic Models: Reflections on Maximum Likelihood and Partial Least Squares Methods. Second Edition, Amsterdam, The Netherlands: Sociometric Research Foundation. Engelen, S., Hubert, M., Vanden B.K. and Verboven, S. (2003). Robust PCR and Robust PLSR: a comparative study. Theory and Applications of Recent Robust Methods. edited by M. Hubert, G. Pison, A. Struyf and S. Van Aelst, Series : Statistics for Industry and Technology, Birkhauser, Basel.

Farebrother, R. W., (1999). A Class of Statistical Estimators Related to Principal Components. Linear Algebra and its Aplications. 289 : 121 –126.

152 Filzmoser, P., Croux, C., (2002). A Projection Algorithm for Regression with Collinearity. Classification, Clustering and Data Analysis, 227 –234. Filzmoser, P., Croux, C., (2003). Dimension Reduction of the Explanatory Variables in Multiple Linear Regression. Pliska Studia Mathematica Bulgaria. 14 : 59 –70. Garthwaite, P.H., (1994). An Interpretation of Partial Least Squares. Journal of the American Statistical Association. 89 : 122 –127. Geladi, P., Kowalski, B.R., (1986a). Partial Least-Squares Regression: A Tutorial. Analytica Chimica Acta. 185 : 1 –17. Geladi, P., Kowalski, B.R., (1986b). An Example of 2-Block Predictive Partial Least Squares Regression with Simulated Data. Analytica Chimica Acta. 185 : 19 –32. Gibbons, D.G., (1981). A Simulation Study of Some Ridge Estimators. Journal of the American Statistical Association. 76 : 131 –139. Gruber, M.H.J., (1990). Regression Estimators : A Comparative Study. California : Academic Press Hair, J. F., Anderson, R.E., Tatham, R.L. and Black, W.C., (1998). Multivariate Data Analysis. 5th Ed., New Jersey : Prentice Hall. Hansen, P.M. and Schjoerring, J.K., (2003). Reflectance Measurement of Canopy Biomass and Nitrogen Status in Wheat Crops Using Normalized Difference Vegetation Indices and Partial Least Squares. Remote Sensing of Environment. 86 : 542 –553. Hawkins, D.M. and Yin, X., (2002). A Faster Algorithm for Ridge Regression of Reduced Rank Data. Journal of Computional Statistics & Data Analysis. 40 : 253 –262.

153 Hocking, R.R, (1996). Methods and Applications of Linear Models : Regression and the Analysis of Variance. USA : John Wiley & Sons. Hoerl, R.W., Schuenemeyer, J.H. and Hoerl, A.E., (1986). A Simulation of Biased Estimation and Subset Selection Regression Techniques. Technometrics. 28(4) : 369 –380. Hoerl, A.E., Kennard, R.W., (1970). Ridge Regression : Biased Estimation to Nonorthogonal Problems. Technometrics. 12 : 56 –67. Hubert, M., Branden .K.V., (2003a). Robust methods for Partial Least Squares Regression, Journal of Chemometrics. 17 : 537-549. Hubert, M., Branden .K.V., (2003b). Robustness properties of a robust Partial Least Squares Regression, Journal of Chemometrics. 17 : 537-549. Hwang, J.T., Nettleton, D., (2000). Principal Components Regression With DataChosen Components and Related Methods. Technometrics. 45 : 70 –79. Jennrich, R.I., (1995). An Introduction to Computational Statistics : Regression Anaysis. New Jersey : Prentice Hall. Kvalheim, O. M. (1987). Latent-structure decompositions (projections) of multivariate data. Chemometrics and Intelligent Laboratory Systems. 2 : 283 –290 Kleinbaum, D.G., Kupper, L.L., Muller, K.E. and Nizam, A., (1998). Applied Regression Analysis and Other Multivariable Methods. USA : DUXBURY Press. Li, B., Martin, E.B. and Moris, A.J., (2001). Box-Tidwell Transformation Based Partial Least Squares Regression. Computers and Chemical Engineering. 25 : 1219 – 1233. Marquardt, (1970). Generalized Inverses, Ridge Regression, Biased Linear Estimation and Nonlinear Estimation. Technometrics. 12 : 591 –612.

154 Miller, A.J., (1990). Subset Selection in Regression. Chapman & Hall, New York. Myers, R.H., (1986). Classical and Modern Regression With Applications. 2nd Ed. USA : PWS-KENT Publishing Company. Neter, J., Wasserman, W. and Kutner, M.H., (1985). Applied Linear Statistical Models. 2nd Ed. USA : IRWIN. Neter, J., Wasserman, W. and Kutner, M.H., (1990). Applied Linear Regression Models. 3rd Ed. USA : IRWIN Book Team. Neter, J., Kutner, M.H., Nachtseim, C.J. and Wasserman, W., (1996). Applied Linear Statistical Models. 4th Ed. USA : Irwin Bokk Team. Ozcelik, Y., Kulaksiz S. and Cetin, M.C., (2002). Assessment of the Wear of Diamond Beads in the Cutting of Different Rock Types by the Ridge Regression. Journal of Materials Processing Technology. 127 : 392 –400. Rougoor, C.W., Sundaram, R. and Van Arendonk, J.A.M., (2000). The Relation Between Breeding Management and 305-day Milk Production, Determined via Principal Components Regression and Partial Least Squares. Livestock Product Science. 66 : 71 –83. Serneels, S. and Van Espen, P.J., (2003). Sample specific prediction intervals in SIMPLS. In PLS and related methods, Ed. Vilares, M., Tenenhaus, M., Coelho, P., Esposito, V., Vinzi and Morineau, A., DECISIA, Levallois Perret (France), 219 –233. Stone, M. and Brooks, R.J., (1990). Continuum Regression : Cross-Validated Sequentially Constructed Prediction Embracing OLS, PLS and PCR. Journal of Royal Statistic Society. 52 : 237 –269.

155 Sundberg, R., (1993). Continuum Regression and Ridge Regression. Journal of Royal Statistics Society. 55 : 653 –659. Sundberg, R., (2002). Continuum Regression. Article for 2nd Ed. of Encyclopedia of Statistical Sciences. Tobias, R.D., (1997). An Introduction to Partial Least Squares Regression. Cary, NC : SAS Institute. Wannacott, T.H. and Wannacott, R.J., (1981). Regression :A Second Course in Statistics. USA : John Wiley & Sons. Wesolowsky, G.O., (1976). Multiple Regression and Analysis of Variance. USA : John Wiley & Sons. Wold, H. (1966), Estimation of Principal Components and Related Models by Iterative Least Squares. In Multivariate Analysis, Ed. Krishnaiah, P.R., New York : Academic Press, 391 –420. Wold, S. (1994), PLS for Multivariate Linear Modeling. QSAR: Chemometric Methods in Molecular Design. Methods and Principles in Medicinal Chemistry, ed. H. van de Waterbeemd, Weinheim, Germany: Verlag-Chemie. Xie, Y.L. and Kalivas, J.H., (1997a). Evaluation of Principal Component Selection Methods to form a Global Prediction Model by Principal Component Regression. Analytica Chimica Acta. 348 : 19 –27. Xie, Y.L. and Kalivas, J.H., (1997b). Local Prediction Models by Principal Component Regression. Analytica Chimica Acta. 348 : 29 –38. Younger, M.S., (1979). A Handbook for Linear Regression. USA : DUXBURY Press.

156

APPENDIX A

S-PLUS CODES FOR DATA GENERATING FUNCTION (CHAPTER IV)

157 APPENDIX A.1

This is the data generating function for simulation n = 20 and p = 2. For n = 30, 40, 60, 80 and 100, replace the value n with 30, 40, 60, 80 and 100 respectively. Datap2.func