[44] S. Roussel, V. Bellon-Maurel, J. M. Roger, and P. Grenier, âFusion of aroma, .... [96] A. Doucet, N. de Freitas, and N. Gordon, eds., Sequential Monte Carlo.
SAMPLING-BASED BAYESIAN LATENT VARIABLE REGRESSION METHODS WITH APPLICATIONS IN PROCESS ENGINEERING
DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Hongshu Chen, B.E., M.S. ***** The Ohio State University 2007 Dissertation Committee:
Approved by
Bhavik R. Bakshi, Co-Adviser Prem K. Goel, Co-Adviser
Co-Adviser
James F. Rathman Michael E. Paulaitis
Co-Adviser Graduate Program in Chemical Engineering
c Copyright by
Hongshu Chen 2007
ABSTRACT
Latent variable methods, such as Principal Component Analysis and Partial Least Squares Regression, can handle collinearity among variables by projecting the original data into a lower dimensional space. They are widely applied to build empirical models of chemical and biological processes. With the development of modern experimental and analytical technology, data sets from those processes are getting bigger and more heterogeneous. The increasing complexity of data sets causes traditional latent variables methods to often fail to provide satisfactory modeling results. Meanwhile, prior information about processes and data usually exist in different sources, such as expert knowledge, historical data etc. However, traditional latent variable methods are ill-suited to incorporate such information. Bayesian latent variable methods, such as Bayesian Latent Variable Regression (BLVR) and Bayesian Principal Component Analysis (BPCA) can combine prior information and data via a rigorous probabilistic framework. Since they make use of more information, they can provide models with better quality. However, BPCA and BLVR are optimization-based, which restricts them from modeling high dimensional data sets or providing error bars. They also make restrictive assumptions to make them suitable for the optimization routines. Because of those pitfalls, they have very limited applications in practice. This dissertation addresses the challenges of making Bayesian latent variable methods practical by developing novel algorithms and a toolbox of sampling-based ii
methods, including a sampling-based BLVR (BLVR-S). BLVR-S is computationally efficient and is able to model high dimensional data sets. It can also readily provide confidence intervals for estimates. An iterative modeling procedure is proposed to deal with hybrid data sets with both continuous and discrete variables. An extended version of BLVR-S is developed to address lack of information about measurement noise in modeling. A generalized BLVR-S is developed to relax the restrictive assumptions of prior distributions. Those methods tackle some practical challenges in Bayesian modeling. The advantages of those Bayesian latent variable regression methods are illustrated in various case studies. Some practical aspects of applying Bayesian latent variable methods are also explored. Through those efforts, the Bayesian latent variable methods are expected to have more practical applications in building empirical models in process engineering.
iii
This is dedicated to my family, teachers and friends.
iv
ACKNOWLEDGMENTS
I am happy that finally I can say some grateful words which are long overdue. First, I would like to thank my advisors, Dr. Bhavik R. Bakshi and Dr. Prem K. Goel. They introduced me to my research field and guided me through my Ph.D. study with great passion, patience and encouragement. They are great mentors, not only in academics but also in life, with their dedication to work, pursuit of excellence and foresights for research. I am grateful to Dr. James F. Rathman and Dr. Michael E. Paulaitis for their kind support and help in my research and the process of completing my Ph.D. study. I would like to thank all my past and current group members during my stay at Ohio State. They created a warm and upbeat environment in our lab. Their generous help and advices are priceless for me throughout these years. I would like to thank all my teachers from primary school to graduate school. Without their hard work, I can not be at this stage. I would like to thank all my family and friends. Their support makes me strong, their care keeps me warm and their love completes me as a human being. Finally, financial support from The Ohio State University University Fellowship and National Science Foundation (CTS-0321911) is also gratefully acknowledged.
v
VITA
February 6, 1981 . . . . . . . . . . . . . . . . . . . . . . . . . . . Born - Pengan, China 1998-2002 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.E. Chemical Engineering and Technology Tsinghua University, China 2002-2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . University Fellow The Ohio State University 2002-2006 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.S. Statistics The Ohio State University 2003-present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graduate Research Associate The Ohio State University
FIELDS OF STUDY Major Field: Chemical Engineering Studies in: Process Systems Engineering Bayesian Modeling
vi
TABLE OF CONTENTS
Page Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii
Chapters: 1.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 1.2 1.3 1.4 1.5
2.
Latent Variable Modeling in Process Engineering Bayesian Methods in Process Engineering . . . . Dissertation Statement . . . . . . . . . . . . . . . Summary of Contributions . . . . . . . . . . . . . Structure of Dissertation . . . . . . . . . . . . . .
. . . . .
1 5 10 11 13
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.1 Elements of Bayesian Modeling . . . . . . . . . . . . . . . . . 2.2 Bayesian View of Traditional Methods . . . . . . . . . . . . . 2.2.1 Ordinary Least Squares Regression & Ridge Regression 2.2.2 PCA & MLPCA . . . . . . . . . . . . . . . . . . . . . 2.2.3 Partial Least-Squares Regression . . . . . . . . . . . . 2.3 Optimization-based Bayesian Latent Variable Methods . . . . 2.3.1 Bayesian Principal Component Analysis (BPCA) . . .
15 20 20 22 24 27 27
vii
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . . . .
. . . . .
1
. . . . . . .
. . . . . . .
2.3.2
Optimization-based Bayesian Latent Variable Regression (BLVROPT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Monte Carlo Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.1 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . 33 2.4.2 Importance Sampling . . . . . . . . . . . . . . . . . . . . . 39 3.
4.
5.
Sampling-based Bayesian Latent Variable Regression . . . . . . . . . . .
43
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Sampling-Based Bayesian Latent Variable Regression . . . . . . . . 3.2.1 Gaussian Prior . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Uniform Prior . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Case Studies and Practical Aspects of BLVR . . . . . . . . . . . . 3.3.1 Case Study 1: Modeling of Simulated High Dimensional Data 3.3.2 Case Study 2: Modeling Near-Infrared Data of Wheat . . . 3.4 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . .
43 46 50 52 55 57 64 76
Modeling Hybrid Data Sets with BLVR-S . . . . . . . . . . . . . . . . .
78
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . 4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Modeling the Continuous Part Data . . . . 4.2.2 Modeling the Discrete Part Data . . . . . . 4.2.3 The Iteration Procedure . . . . . . . . . . . 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . 4.3.1 Simulated Hybrid Data . . . . . . . . . . . 4.3.2 Simulated High Throughput Screening Data 4.4 Conclusions and Future Work . . . . . . . . . . . .
78 82 83 84 85 87 87 89 91
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
BLVR-S with Noninformative Priors for Parameters in Likelihood Functions 93 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Setup of Noninformative Priors . . . . . . . . . . . . . . 5.2.2 BLVR-S with Noninformative Priors for Noise Variances 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Simulated High Dimensional Data . . . . . . . . . . . . 5.3.2 Simulated Hybrid Data . . . . . . . . . . . . . . . . . . 5.3.3 Inferential Modeling of a Batch Distillation Process . . . 5.3.4 Inferential Modeling of a Continuous Distillation Process 5.4 Conclusions and Future Work . . . . . . . . . . . . . . . . . . .
viii
. . . . . . . . . .
. . . . . . . . . .
93 96 96 102 105 105 123 126 129 138
6.
BLVR-S with Non-Gaussian Prior Distributions . . . . . . . . . . . . . . 142 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . 6.2 Method . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Drawing Samples of Latent Variables . . . . 6.2.2 Drawing Samples of Regression Parameters 6.2.3 Estimation of New Observations . . . . . . 6.3 Experiment . . . . . . . . . . . . . . . . . . . . . . 6.4 Conclusions and Future Work . . . . . . . . . . . .
7.
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
142 147 147 149 151 153 156
Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 158 7.1 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Appendices: A.
Another Way to Derive Full Conditional Distributions for Uniform Prior Case in Section 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
B.
Proof of the Type 2 Prior is the Jeffreys Prior for Noise Variances of BLVR-S in Section 5.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
ix
LIST OF TABLES
Table 2.1
Page Common loss functions and their corresponding objective functions and Bayes estimates . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.2
Metropolis-Hastings algorithm. . . . . . . . . . . . . . . . . . . . . .
36
2.3
Gibbs Sampling algorithm. . . . . . . . . . . . . . . . . . . . . . . . .
37
2.4
Importance Sampling algorithm. . . . . . . . . . . . . . . . . . . . . .
41
3.1
BLVR-S algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.2
Mean MSE and CPU time of the high dimensional modeling example over 50 realizations, MSE are normalized by the variance of corresponding measurement error. . . . . . . . . . . . . . . . . . . . . . . . . . .
58
Mean testing MSE of Y for different numbers of training data over 50 realizations, MSE are normalized by testing MSE of PCR, SNR for X and Y are both 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
Mean testing MSE of Y for different SNR of Y over 50 realizations, MSE are normalized by testing MSE of PCR; there are 200 observations in training set, SNR for X is 3. . . . . . . . . . . . . . . . . . . . . .
65
Mean testing MSE of Y for different SNR for X over 50 realizations, MSE are normalized by testing MSE of PCR; there are 200 observations in training, SNR for Y is 3. . . . . . . . . . . . . . . . . . . . . . . .
66
3.6
Validation MSE of Y1 (moisture) and corresponding ranks. . . . . . .
74
3.7
Validation MSE of Y2 (protein) and corresponding ranks. . . . . . . .
75
3.3
3.4
3.5
x
4.1
Procedure of Modeling Hybrid Data . . . . . . . . . . . . . . . . . . .
87
4.2
Average of MSE normalized by the MSE of PCR in each realization over 100 realizations for the simulated high dimensional hybrid data set. 89
4.3
Average of MSE normalized by the MSE of PCR in each realization over 100 realizations for the simulated high throughput screening data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
5.1
Summary of different types of noninformative priors for σj2 . . . . . . 101
5.2
The extended BLVR-S algorithm with noninformative priors for noise variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3
Summary of the different settings in each case of the simulated high dimensional data example in Section 5.3.1. . . . . . . . . . . . . . . . 107
5.4
Case 1 of the simulated high dimensional data example in Section 5.3.1: Average of MSE for different methods, normalized by the MSE of PCR in each realization, SNRx = 3, SNRy = 3, l = 2, 50 realizations. . . . 110
5.5
Case 2 of the simulated high dimensional data example in Section 5.3.1: Average of MSE for different methods, normalized by the MSE of PCR in each realization, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 2, 50 realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.6
Case 3 of the simulated high dimensional data example in Section 5.3.1: Average of MSE for different methods, normalized by the MSE of PCR in each realization, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 3, 50 realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.7
Case 4 of the simulated high dimensional data example in Section 5.3.1: Average of MSE for different methods, normalized by the MSE of PCR in each realization, SNRx = 3, SNRy = 3, l = 0, 50 realizations. . . . 113
5.8
Case 5 of the simulated high dimensional data example in Section 5.3.1: Average of MSE for different methods, normalized by the MSE of PCR in each realization, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 0, 50 realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
xi
5.9
Average of MSE for the simulated high dimensional hybrid data set in Section 5.3.2, normalized by the MSE of PCR in each realization, 50 realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.10 Case 1 of the batch distillation example in Section 5.3.3: Average of MSE for different methods of the batch distillation problem, normalized by the MSE of PCR in each realization, rank 2, l = 3, 50 realizations.128 5.11 Case 2 of the batch distillation example in Section 5.3.3: Average of MSE for different methods of the batch distillation problem, normalized by the MSE of PCR in each realization, rank 2, l = 0, 50 realizations.128 6.1
Typical Maximum Entropy priors with constraints. . . . . . . . . . . 146
6.2
Importance sampling algorithm to draw samples of latent variables in GBLVR-S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3
Importance sampling algorithm to draw samples of regression parameters in GBLVR-S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.4
The procedure to estimate noise free output for a new observation in GBLVR-S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.5
Average of the normalized MSE and CPU time of all methods for the simulated non-Gaussian data example over 50 realizations, the MSE are normalized by the variances of measurement noise. . . . . . . . . 156
xii
LIST OF FIGURES
Figure
Page
2.1
Procedure of Bayesian Estimation . . . . . . . . . . . . . . . . . . . .
18
2.2
Illustration of information used in PCA . . . . . . . . . . . . . . . . .
25
2.3
Illustration of information used in MLPCA . . . . . . . . . . . . . . .
25
2.4
Illustration of information used in BPCA . . . . . . . . . . . . . . . .
29
2.5
Illustration of Gibbs sampling. . . . . . . . . . . . . . . . . . . . . . .
39
3.1
Scatter plot of testing MSE of output of BLVR-OPT against that of BLVR-S, normalized by the variance of measurement error in the output variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
3.2
Illustration of uncertainty information provided by BLVR-S. . . . . .
60
3.3
Comparison of mean squared prediction error of output variable for BLVR-S with true prior under different conditions of SNR, normalized by the mean squared prediction error of output variable for PCR in each case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
Comparison of mean squared prediction error of output variable for Rx BLVR-S with true prior with different SN , normalized by the mean SN Ry squared prediction error of output variable for PCR in each case. . . .
67
3.5
Wheat spectra data. . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
3.6
Assumed variance of measurement noise in spectrum. . . . . . . . . .
71
4.1
Illustration of a hybrid data set. . . . . . . . . . . . . . . . . . . . . .
79
3.4
xiii
5.1
Probability distributions of flow rate and mean of residence time in the CSTR example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
5.2
Probaility density functions of Inverse-Gamma distributions with different parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3
Case 1 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of training MSE of the output, normalized by the variance of noise in the output, SNRx = 3, SNRy = 3, l = 2, 50 realizations. . 115
5.4
Case 1 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of testing MSE of the output, normalized by the variance of noise in the output, SNRx = 3, SNRy = 3, l = 2, 50 realizations. . 116
5.5
Case 2 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of training MSE of the output, normalized by the variance of noise in the output, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 2, 50 realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.6
Case 2 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of testing MSE of the output, normalized by the variance of noise in the output, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 2, 50 realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.7
Case 3 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of training MSE of the output, normalized by the variance of noise in the output, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 3, 50 realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.8
Case 3 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of testing MSE of the output, normalized by the variance of noise in the output, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 3, 50 realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.9
Case 4 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of training MSE of the output, normalized by the variance of noise in the output, SNRx = 3, SNRy = 3, l = 0, 50 realizations. . 121
xiv
5.10 Case 4 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of testing MSE of the output, normalized by the variance of noise in the output, SNRx = 3, SNRy = 3, l = 0, 50 realizations. . 122 5.11 Case 5 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of training MSE of the output, normalized by the variance of noise in the output, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 0, 50 realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.12 Case 5 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of training MSE of the output, normalized by the variance of noise in the output, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 0, 50 realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.13 Illustration of the batch distillation process. . . . . . . . . . . . . . . 128 5.14 Case 1 of the batch distillation example in Section 5.3.3: Matrix plot of training MSE of output, normalized by the variance of noise in the output, l = 3, 50 realizations. . . . . . . . . . . . . . . . . . . . . . . 130 5.15 Case 1 of the batch distillation example in Section 5.3.3: Matrix plot of testing MSE of output, normalized by the variance of noise in the output, l = 3, 50 realizations. . . . . . . . . . . . . . . . . . . . . . . 131 5.16 Case 2 of the batch distillation example in Section 5.3.3: Matrix plot of training MSE of output, normalized by the variance of noise in the output, l = 0, 50 realizations. . . . . . . . . . . . . . . . . . . . . . . 132 5.17 Case 2 of the batch distillation example in Section 5.3.3: Matrix plot of testing MSE of output, normalized by the variance of noise in the output, l = 0, 50 realizations. . . . . . . . . . . . . . . . . . . . . . . 133 5.18 Case 1 of the continuous distillation column example in Section 5.3.4: Median of trainging MSE of output over 50 realizations, l = 3. . . . . 136 5.19 Case 1 of the continuous distillation column example in Section 5.3.4: Median of testing MSE of output over 50 realizations, l = 3. . . . . . 137 5.20 Case 2 of the continuous distillation column example in Section 5.3.4: Median of trainging MSE of output over 50 realizations, l = 0. . . . . 138
xv
5.21 Case 2 of the continuous distillation column example in Section 5.3.4: Median of testing MSE of output over 50 realizations, l = 0. . . . . . 139 6.1
Histogram of the temperature measurements of the 10th stage of the continuous distillation column in Section 5.3.4. . . . . . . . . . . . . . 144
6.2
Histograms of the non-Gaussian input variables of the simulated data set example, based on 10000 observations. . . . . . . . . . . . . . . . 154
xvi
CHAPTER 1
INTRODUCTION
1.1
Latent Variable Modeling in Process Engineering
Empirical modeling plays an important role in process systems engineering. The quality of models is an important factor in successful process monitoring, fault detection, process control, etc. Many of the modeling methods in process engineering also fall into the category of chemometric methods because they are used to deal with chemical data. Those data are usually high dimensional and highly collinear. Meanwhile, latent variable methods [1, 2] project data into lower dimensional spaces to eliminate the collinearity. Therefore, they are popular choices for building models in process engineering or chemometrics [3]. The most commonly used latent variable methods are Principal Component Analysis/Regression (PCA/PCR) [4, 5, 6, 7] and Partial Least Squares Regression (PLS) [8, 9]. In the past several decades, they are widely used for Statistical Process Control (SPC) [10, 11], process monitoring and fault diagnosis [1, 12, 13, 14, 15, 16, 17], system identification [18], controller design [19, 20], etc. In practice, extended versions of PCA and PLS such as the multiway are often used. Multiway and multblock PCA [21] and PLS [22, 23] can model process data cross batches; while dynamic PCA 1
[24] and PLS [20, 24] can model time-lagged process data. Latent variable methods have enjoyed great success in not only process engineering but also areas related to chemometrics, such as pharmaceutical studies [25], Quantitative Structure-Activity Relationship (QSAR) and bioinformatics [3, 26]. Lately, a novel latent variable approach called Independent Component Analysis (ICA) [27] is also getting attention and has been applied to solve many problems [28, 29, 30, 31, 32, 33] in process engineering. The success of latent variable methods is in large part due to the fact that they permit the use of extra information, such as the lower dimensionality of the subspace of the input variables, not used in Ordinary Least Squares Regression (OLS). In fact, many of the advances in the modeling methods in process engineering or chemometrics are based on the use of additional information in practical problems. Some examples of this trend include the following: • In PCA, measurement noise for the variables is implicitly assumed to be independent and identical Gaussian, and the probability density of variables is assumed to be uniform in the Euclidean space. Maximum Likelihood Principal Component Analysis (MLPCA) [34], relaxes some of these assumptions, in that observations are assumed to have Gaussian measurement noise with possibly different variances. MLPCA performs better than PCA because it can utilize additional information about the measurement errors. • Model-based Principal Component Analysis (MBPCA) [35] incorporates physical relations among variables to improve upon traditional PCA.
2
• The recent success of wavelet-based chemometric methods is due to their multiscale nature that permits better use and extraction of information about the time-frequency localization of measured data [36, 37]. This is in contrast to traditional methods that are inherently single scale in nature and cannot take advantage of features with different localization in time, space or frequency, that are commonly present in most practical data. • In estimation of state variables and parameters for nonlinear dynamic systems, the traditional method of Extended Kalman Filtering (EKF) [38] uses an approximate linear model and ignores process constraints, while the relatively recent Moving Horizon Estimation (MHE) [39] uses additional information in the form of the model nonlinearity and constraints, resulting in superior estimates. Success of these methods reveals the following intuitive but powerful principle: proper use of additional information can improve the quality of models, estimates and predictions, and seems to have been the primary driving force behind most of the methodological advances in empirical modeling. Over the last several decades, chemometrics has grown from a nascent field to an active area of research, resulting in many innovative modeling methods and applications. Over the last few years, applications of chemometrics have continued to grow, including diversification into biological problems [3, 26], but growth in new methodological developments seems to have been slower. Evidence of these trends can be seen in programs for conferences such as the CAC series1 . Consequently, chemometrics seems to be a maturing science, which leads to the question about where the next phase of growth is likely to come 1
http://www.cac2006.iqm.unicamp.br/CAC-2006-statistics.html
3
from. As probably the most important class of approaches in chemometrics, this is also the question that latent variable methods face. Meanwhile, modern analytical techniques, such as microarrays, can rapidly obtain lots of measurements. Consequently, process engineers are increasingly facing large and heterogeneous data sets. A possible approach of dealing with the additional challenges in the modeling of such data and obtaining further improvement over traditional and maximum likelihood latent variable methods is to use even more domain-specific information about the system and data than what is used by maximum likelihood methods. Prior information about the data and model is often available and its proper use can enhance the model. For example, chemists usually have knowledge about the specific bands of a functional group in a Near-Infrared (NIR) spectrum, while chemical engineers can develop a model representing the underlying phenomena. Those information often come from different sources with some knowledge of uncertainties. This kind of information is very useful in developing calibration models, estimating unknown variables and parameters, and many other tasks. The challenges stem mainly from the fact that mainstream latent variable methods, such as PCR and PLS are ill-suited for incorporating information about uncertainty, model error and other partial information. Consequently, it seems that the next wave of innovation in latent variable methods may be motivated by the need to combine various sources of information in an optimal manner. Although maximum likelihood methods can exploit information about measurement error, they completely ignore prior information about model parameters. In addition, even within classical frequentist framework, it is known that in three or more dimensions, the maximum likelihood
4
estimators are inadmissible, in that for any convex loss function (e.g., squared error loss) appropriately regularized, Stein-type estimators see e.g., Berger [40], have smaller frequentist risk than the maximum likelihood estimators.
1.2
Bayesian Methods in Process Engineering
The so-called gray models [41] can combine measurements with external information about models by setting constraints or building hybrid models with the white box model and the black box model. However, these models are not built in a statistically solid way. In contrast, long established in statistics, Bayes theorem provides a rigorous way to combine prior information with data. Through Bayes theorem, all the information is contained in the posterior distribution, which can provide the estimates. Bayesian methods have been the subject of relatively recent efforts in process engineering or chemometrics. For example, Haim et al. [42] estimated the optimal addition time of a catalyst by combining measurements of different accuracies by Bayes theorem; Won and Modarres [43] developed a Bayesian method which combines data and expert judgement to diagnose equipment partial failure in process plants; Roussel et al. [44] applied Bayes theorem for solving a sensor fusion problem; Braatz et al. [45] suggested that Bayesian estimation can be applied to combine ab initio calculations and experimental data in the design and control of multiscale systems. In addition to utilization of prior information, Bayesian methods also provide a convenient and efficient way to obtain uncertainty information about the estimates, since interval estimate can be easily constructed based on the posterior distribution. Those intervals have a straightforward meaning in the Bayesian framework: they
5
represent the most probable region of model parameters based upon all the available information. Although other methods, such as bootstrapping and jackknife [46] can also be used to construct confidence intervals, they require much more effort. Furthermore, such intervals can only be interpreted from a long term frequentist perspective, as representing the region occupied by the results of a large number of replicated experiments. However, in most cases obtaining the uncertainty of estimates based on current information, instead of the variation from one experiment to another is much more appealing. In this sense, Bayesian methods are more attractive than the traditional methods commonly used in chemometrics. This feature has also been illustrated in many applications. Hauptmanns and Homke [47] used Bayesian estimation to get the distribution of failure rate for components of process plants. Jorgensen and Pedersen [48] used a Bayesian approach to calculate the variability of model parameters. Verdonck et al. [49] compared Bayesian methods with bootstrapping and maximum likelihood methods in determining environmental standards. Park et al. [50] used a Bayesian approach to deal with the uncertainty in a multivariate receptor model to find major pollution sources. Armstrong [51] applied Bayesian analysis to get the distribution of parameters in the band-broadening models for high performance liquid chromatography (HPLC) data. Bayesian methods have been applied to model many chemical processes. Bystritskaya et al. [52] used a Bayesian method to estimate model parameters in order to predict aging of polymers and [53] used a sequential Bayesian estimation method in non-linear regression analysis. Pomerantsev [54] applied a Successive Bayesian Estimation (SBE) method to estimate kinetic parameters. Chen et al. [55] used a sequential Bayesian estimation method to unconstrained nonlinear dynamic systems. 6
Li and Huang [56] used a Bayesian strategy to get a weighted model with a set of neural network models. Coleman and Block [57] applied a Bayesian approach to estimate model parameters for nonlinear fermentation systems. Nounou [58] applied Bayesian shrinkage for building Finite Impulse Response (FIR) models. Jitjareonchai et. al [59] solved Error-In-Variables (EIV) modeling problems in a Bayesian framework. Bayesian methods are also often seen in solving gross error detection and data rectification problems in process engineering. Tamhane et al. [60, 61] applied a Bayes test for gross error detection of chemical process data. Morad et al. [62] used Bayes’ theorem for the data rectification of plant measurements. Ungarala and Bakshi [63] developed a multiscale Bayesian method for date rectification, Bakshi et al. [64] also applied the multiscale Bayesian method to systems without accurate models. Some of other applications of Bayesian methods in process monitoring, control and optimization can be found in [65]. Bayesian methods are also applied for feature selection. Brown et al. [66] applied a Bayesian wavelength selection approach for multicomponent analysis of spectra data. Vannucci et al. [67] applied a Bayesian variable selection method for the wavelet-based feature selection of spectra data. Ghosh et al. [68] used Bayesian data analysis to check the significance of principal components for the prediction of trauma outcome. Naes and Indahl [69] showed that many classification methods can be unified in a Bayesian framework. Mallet et al. [70] used Bayesian classifiers for discriminant analysis of high dimensional spectral data. Yamashita [71] applied the Naive-Bayes classifier to classify the process operational data. Chen and Wang [72] used Bayesian classification to help developing neural network models. Hancewicz and Wang [73] applied Bayesian discriminant clustering in spectroscopic image analysis. Torabi et al. 7
[74] applied a Bayesian classification method for the in-line particle image monitoring in polymer film manufacturing. Bayesian classifiers are also often used in the subspace formed by latent variables. Kim et al. [75] combined PCA with a Bayesian classifier to classify petroleum products with real-time data. Liu [76] also combined PCA with a Bayesian classifier for the fault detection of operating data. Except for the applications mentioned above, Bayesian methods can be found in many other chemometric research. Emerenciano [77] et al. applied Bayes’ theorem for skeleton identification of chemical compounds. Niedzinski and Morawski [78] used an Bayesian approach to interpret spectrophotometric data. Nolsoe et al. [79] performed conformational analysis of ring molecules in a Bayesian framework. Willis [80] developed an iterative Bayesian approach to evaluate analyzer performance. Moussaoui et al. [81] used a Bayesian estimation method to analyze a spectral mixture data set. Most Bayesian methods listed above are developed for solving specific problems. Without a general Bayesian modeling method, most process engineers will not build models using Bayesian statistics. Hence, the applications of Bayesian methods are still very limited in process engineering. Meanwhile, in the statistics community, general Bayesian regression methods, such as Least Absolute Shrinkage and Selection Operator (Lasso) [82] and Least Angle Regression (Lars) [83] were already developed. But they are not latent variable methods, which process engineers are more familiar and comfortable with. Consequently, they draw little attention in process engineering. Therefore, it is important to develop general Bayesian latent variable modeling methods for process engineering. For this reason, in the past decade, Bayesian PCA (BPCA) and Bayesian Latent Variable Regression (BLVR) [84, 85] were developed.
8
They are general latent variable modeling methods, which make use of prior knowledge about the variables and parameters via Bayes rule. The original implementation of BPCA and BLVR was optimization-based, which have many pitfalls. They tend to be very slow and easily trapped in local minima when applied to model high dimensional problems. They can only provide point estimate and can not obtain the valuable uncertainty information from the posterior distribution. They make restrictive assumptions, such as Gaussian or uniform prior. They are also not suitable for modeling discrete data that are commonly seen in high throughput screening data sets. For these reasons, they are not widely applied in practice. However, with the dramatic increase in computing power, combined with advances in statistics, practical methods for addressing these challenges are becoming available. Monte Carlo sampling gained popularity for Bayesian statistics in the last two decades. Nowadays, people often use Markov Chain Monte Carlo (MCMC) [86, 87] or Sequential Monte Carlo (SMC) [55] to solve Bayesian estimation problems. These techniques make Gaussian assumptions unnecessary and the computation is much more efficient. Sampling-based method can also easily provide uncertainty information for the estimates. As a result, more modeling practitioners [48, 57, 59, 66, 68, 79, 81, 88] are picking up these methods. In fact, the recent surge of Bayesian methods is tied with the adoption of these sampling techniques. The shortcomings of optimization-based methods can be overcome by sampling-based methods. Hence, developing sampling-based Bayesian latent variable regression methods can potentially make Bayesian modeling much more practical in process engineering.
9
Other than the methodology aspect, the limited use of Bayesian methods in process engineering may be also due to the perception that they require too much information that is often difficult to obtain, and they require more computation load as well. Hence, people are often reluctant to put in those efforts to apply Bayesian methods. Thus, it is also important to explore some practical aspects of Bayesian modeling, such as the ways of elicitation of prior and likelihood information in practice, situations that Bayesian methods can greatly outperform traditional methods. Exploration of these practical aspects can provide a guidance for the application of Bayesian modeling methods in practice.
1.3
Dissertation Statement
This dissertation focuses on the development of sampling-based Bayesian latent variable regression methods that are computationally affordable and with proper assumptions about likelihood and prior functions, and the applications of those methods to model chemical processes. The overall objective is to develop practical Bayesian latent variable methods for the process engineering and chemometrics community. Challenges associated with this task are: • Modeling high dimensional data sets. • Modeling hybrid data sets with both continuous (stochastic) and discrete (deterministic) variables. • Dealing with the problem of inaccurate or unavailable information about likelihood functions in modeling. • Dealing with non-Gaussian and non-uniform prior distributions. 10
This dissertation undertakes these challenges with Monte Carlo sampling and Bayesian statistics that can properly address those issues.
1.4
Summary of Contributions
Major contributions of this research with the objective of promoting Bayesian latent variable methods in process engineering include: • Developed a sampling-based Bayesian Latent Variable Regression (BLVR-S) approach. The original BLVR is optimization-based, limited to modeling small data sets, and is not able to provide confidence intervals for estimates. While BLVR-S is capable of modeling high dimensional data sets and providing confidence intervals for estimates. • Explored some practical aspects of Bayesian modeling. Prior to this work, there was little research on identifying the practical situations when Bayesian methods have the largest margin over traditional methods. In this work, the effects of number of training data and SNR in input and output variables on the relative performance of BLVR-S are examined. • Demonstrated different ways for the elicitation of prior and likelihood information in practice. Getting those information is a prerequisite of applying Bayesian methods, but previous research paid little attention to this important practical issue. In this work, methods of elicitation of information are illustrated in case studies. • Developed a modeling procedure which can deal with hybrid data sets with more appropriate assumptions by Bayesian methods. Most existing methods 11
do not distinguish discrete data from continuous data, and make inappropriate assumptions. The modeling procedure developed in this work models the two parts separately to address the differences in the nature of data. • Explored the effect of the distribution of discrete variables on the performance of BLVR-S, identified the situations when the modeling procedure for hybrid data sets is most needed. • Developed a Bayesian method to address the issue of lack of likelihood information in modeling. Previous work usually first estimate the parameters in the likelihood functions, then use the estimates in modeling. This two-step strategy is inherently vulnerable and not robust. In this research, this problem is tackled by setting noninformative priors for the measurement noise variances. This method is adopted to develop an extended BLVR-S approach, which is robust against inaccurate likelihood information. • Explored the effect of SNR and the quality of information about measurement noise variances on the performance of BLVR-S, identified the situations when the noninformative priors for measurement noise variances are most helpful. • Developed a generalized BLVR-S algorithm which relaxes the uniform or Gaussian assumptions on the prior distributions. Previous work make those restrictive assumptions, primary for convenience and do not reflect the underlying truth. This generalized BLVR-S allows more appropriate assumptions for the prior distributions in practice.
12
1.5
Structure of Dissertation
This dissertation is organized as a chapter containing the theoretical background for the rest of this dissertation (Chapter 2), four chapters containing the main body of this research (Chapters 3 to 6) and a chapter containing the general conclusions and future work (Chapter 7). A more detailed description of each chapter is provided below. Chapter 2 first provides basic knowledge about Bayes rule and a general procedure of Bayesian estimation, followed by the Bayesian interpretations of some of the traditional modeling methods, including OLS, Ridge Regression, PCA, MLPCA and PLS. After that, two Bayesian latent variable methods, BPCA and BLVR are introduced. In the last section of this chapter, several most commonly used Monte Carlo sampling techniques are introduced. Chapter 3 describes the methodology and applications of sampling-based BLVR (BLVR-S). Two cases of prior distributions, uniform and Gaussian are considered in BLVR-S. Advantages of BLVR-S over the optimization-based BLVR and other traditional methods are illustrated in various case studies, including modeling a real Near Infrared (NIR) data set. The practical issues of applying BLVR-S are also discussed and illustrated in those case studies, including the effect of number of training data, Signal to Noise Ratio (SNR) in the input and output variables, and the ways of elicitation likelihood and prior information in practice. The BLVR-S in Chapter 3 assumes all variables are stochastic, which is often not true for discrete variables because the discrete variables often correspond to a design of experiment. To solve this problem, in Chapter 4, an iterative modeling procedure is proposed which separately models the continuous variables by BLVR-S and discrete 13
variables by other Bayesian regression methods with more appropriate assumptions. The advantages of this modeling procedure are illustrated by case studies. Chapter 5 addresses another challenge of applying BLVR-S in practice, the inaccurate information about the measurement noise variances. In the proposed method, noninformative priors are assumed for the measurement noise variances, which makes this method more robust. The performance of this method is compared with other approaches in several case studies. In Chapter 6, the assumption of uniform or Gaussian priors in BLVR-S is relaxed with the implementation of an importance sampling within Gibbs sampling strategy. This allows more appropriate types of priors to be assumed. The performance of this method is compared with other approaches in a simulated example. Finally, Chapter 7 concludes the contributions of this research and suggested the future directions of Bayesian latent variable modeling in process engineering and chemometrics.
14
CHAPTER 2
BACKGROUND2
2.1
Elements of Bayesian Modeling
In Bayesian methods, model parameters θ are considered to be stochastic instead of an unknown deterministic quantity in the traditional framework. Before any measurements are obtained, the prior knowledge (and/or beliefs) about the uncertainty in model parameters is represented by a probability distribution P (θ), known as the prior distribution. Given the parameters θ, the conditional distribution of measured data D, P (D | θ), is called the likelihood of data. Given prior knowledge about the unknown parameters is combined with the information in observed data D in a probabilistically rigorous way using Bayes Theorem. The updated information is represented by the posterior distribution, P (θ | D) =
P (D | θ)P (θ) P (D)
(2.1)
The expression P (D) represents the marginal probability distribution of the data D is also called the evidence. The posterior distribution combines all the available 2 Content of this chapter is based on Toward Bayesian Chemometrics - A Tutorial on Some Recent Advances, Analytica Chimica Acta, in press, 2007.
15
information after the experiment is carried out. It summarizes the current uncertainty about the parameters, and used for all parameter estimation problems. ˆ is specified to To find an optimal point estimate of θ, a loss function Loss(θ, θ) measure the consequences of the discrepancy between the true parameter θ and its ˆ The Bayes estimate θˆb minimizes the expected posterior loss, i.e., estimated value θ. ˆ |D , θˆb = arg min E Loss(θ, θ) θˆ
(2.2)
where the expectation is with respect to the posterior distribution. Table 2.1 shows some commonly used loss functions, their objective functions and their respective Bayes estimates. These estimates often make intuitive sense. For example, with 0ˆ θ) equals one if the estimate θˆ equals to θ, otherwise, 1 loss, the loss function I(θ, zero. This means that the estimated value is useful only if it is equal to the true value. Such a loss function makes sense in applications such as target recognition where getting arbitrary close to the target, but missing it, has no utility. In contrast, the squared-error loss implies that the utility of the estimate decreases in proportion to the squared-distance between the true value and estimate, implying that even getting close to the true value is worthwhile. A 0-1 loss function is most popular in optimization formulations and results in the maximum a posteriori (MAP) estimate. If P (θ) is uniform over the whole parameter space, then the MAP estimator is the same as the maximum likelihood estimate (MLE). The process of Bayesian estimation can be summarized by three steps below: 1. Specify prior distribution and likelihood of data.
16
Loss Function Squared error Absolute error 0-1 error
Objective Function to Be Minimized R (θˆ − θ)T (θˆ − θ)P (θ | D) dθ R |θˆ − θ|P (θ | D) dθ R ˆ θ)P (θ | D) dθ I(θ,
Bayes Estimate Posterior Mean Posterior Mediana Posterior Mode
Table 2.1: Common loss functions and their corresponding objective functions and Bayes estimates Only applicable for univariate distributions because the median of multivariate distributions is not uniquely defined. a
2. Calculate posterior distribution by Bayes rule. Usually only the numerator of the right hand side of equation (2.1) is of interest because the denominator is just a normalizing constant. 3. Specify a loss function and get the optimal estimate based on the loss function and posterior distribution. This procedure is illustrated in Figure 2.1. The following simple example illustrates some of the basic principles of Bayesian estimation, and provides insight into combining information from two distinct sources, the data and the prior. In practice, the data and models are often much more challenging than those in this simple example. Illustrative Example 2.1. A simple example of Bayesian estimation is a NormalNormal [89] situation. Suppose for observations x = [x1 , x2 , ..., xn ]T , xi = θ + ǫi ,
i = 1, 2, 3, ...n,
(2.3)
and ǫi ∼ Normal(0, σ 2 ), 17
(2.4)
Prior Knowledge P (θ) (Current Belief) Posterior P (θ | D)
Loss Function
Bayesian Estimate ˆb θ
(New Belief) Likelihood P (D | θ) (New Information)
Figure 2.1: Procedure of Bayesian Estimation
hence, xi ∼ Normal(θ, σ 2 ).
(2.5)
Also assume that ǫi ’s are independently distributed, i.e., x ∼ Normal(θ1, σ 2 I),
(2.6)
and the MLE of θ is the mean of x. The least squares solution is also the MLE due to the Gaussian assumption. Since the observed data are known while the parameter θ is to be estimated, from the Bayesian perspective, θ is stochastic. Assuming that its prior distribution is uniform, then given the observations, the distribution of θ is found to be: θ | x ∼ Normal(¯ x, σ 2 /n).
(2.7)
Instead, if an informative prior distribution of θ is given to be θ ∼ Normal(µ, η 2 ), 18
(2.8)
then using Bayes rule, the prior and the likelihood is combined to get the posterior distribution of θ to be θ | x ∼ Normal(
η 2 x¯ + σ 2 µ/n σ2η2 , ). η 2 + σ 2 /n σ 2 + nη 2
(2.9)
When the 0-1 loss function is chosen, Bayesian estimate can be obtained by finding the posterior mode: ˆ |x θˆb = arg min E Loss(θ, θ) θˆ
= arg max{P (θ | x)} θ
2
=
η x¯ + σ 2 µ/n . η 2 + σ 2 /n
(2.10)
Note that this Bayesian estimate is a weighted sum of x¯ and µ, where x¯ contains information about θ from the data, while µ contains information from the prior. Their respective weights depend on the relative values of σ 2 /n and η 2 , which represent the trust that is put in the data and the prior, or the relative uncertainty of data and prior. The larger η 2 is, the more uncertainty in the prior, and the smaller weight for µ; similarly, the larger σ 2 /n is, the more noisy x¯ is and less weight will be given to it. Furthermore, the MLE x¯ is equivalent to MAP when we assume the prior distribution of θ is uniform or that the prior is normal, but its variance η 2 goes to ∞. This shows that MLE can be treated as a special case of Bayesian estimation. Common challenges include high dimensionality, presence of latent variables, nonGaussian distributions, missing data, and process dynamics. Approaches for dealing with many of these challenges are described and illustrated in the sequel.
19
2.2
Bayesian View of Traditional Methods
This section summarizes existing empirical modeling methods from a Bayesian point of view. Such insight is useful for appreciating the new developments and opportunities discussed in subsequent sections.
2.2.1
Ordinary Least Squares Regression & Ridge Regression
A very common problem in chemometrics is to predict an output variable from some predictors by a linear model. The measurements of the output are usually contaminated by noise. Suppose the predictors are noise-free, this linear model can be represented as: y = Xβ + ǫ.
(2.11)
Where y (n × 1) is the measurement of the output variable, X (n × m) is the input matrix, β (m×1) is the regression parameter vector, and ǫ (n×1) is the measurement noise. From a Bayesian view, β is stochastic and it has a prior distribution P(β). Therefore, the posterior of β is P (β | X, y) ∝ P (y | X, β)P (β)
(2.12)
When a 0-1 loss function is chosen, the Bayesian estimate of β can be obtained by solving the following optimization problem: βˆ = arg max P (y | X, β)P (β) β
(2.13)
Equation (2.13) is the generalized Bayesian formulation of a class of linear regression methods. OLS and ridge regression both fall into this class. Although they were not developed from a Bayesian perspective, they have very clear Bayesian interpretations. 20
Ordinary Least Squares Regression. Assume the prior distribution P (β) is uniform in the m-dimensional space Rm , Equation (2.13) reduces to a MLE problem: βˆ = arg max P (y | X, β). β
(2.14)
Moreover, assume the elements of ǫ in Equation (2.11) are zero-mean Gaussian noise from independent identical distributions (i.i.d.), i.e., ǫ ∼ Normal 0, σ 2 I .
(2.15)
Because the negative log of the Gaussian likelihood of β is proportional to the sum of squared error (SSE), Equation (2.13) may be further reduced to the problem of minimizing the SSE: βˆ = arg min{(y − Xβ)T (y − Xβ)}. β
(2.16)
The analytical solution for this problem is known to be: βˆ = (X T X)−1 X T y.
(2.17)
This is exactly the familiar analytical solution of OLS. Thus, OLS may be treated as just a special case of the Bayesian linear regression problem represented by Equation (2.13). From a Bayesian point of view, OLS implicitly assumes β is uniformly distributed over the space Rm , i.e., we have no prior knowledge about β. The uniform prior probability density of β is a constant. It also assumes i.i.d. Gaussian measurement noise for the output variable. These assumptions are commonly made in many other traditional methods. A different assumption about the prior distribution of β leads to another familiar method: ridge regression. Ridge Regression [90]. In practice, the uniform prior distribution of β in OLS is not a good assumption. For example, when X and y are all mean centered and 21
scaled with unit variance, the regression coefficients β should not be too far from zero. Thus, it is more reasonable to assume a normal prior with mean zero for β, i.e., β ∼ Normal 0, ψ 2 I .
(2.18)
With this assumption, Equation (2.13) reduces to a penalized least squares problem: βT β 1 βˆ = arg min{ 2 (y − Xβ)T (y − Xβ) + 2 }. β σ ψ
(2.19)
The analytical solution to this problem is: −1 2 σ T βˆb = X X + 2 I X T y. ψ
(2.20)
Note that the solution for ridge regression [91] with its parameter λ is: βˆridge = X T X + λI Thus βˆb and βˆridge are equivalent when λ =
−1
σ2 . ψ2
X T y.
(2.21)
The Bayesian interpretation of the
parameter λ in ridge regression is much more meaningful. Besides helping in making the ill-conditioned matrix invertible, λ represents the ratio of the uncertainty in the data likelihood to that in the prior. The value of λ is an index representing the relative trust in the data.
2.2.2
PCA & MLPCA
Chemometricians often need to deal with high dimensional data. A popular approach for handling the collinearity of those data is to first project the original input variables X (n × m) into a lower dimensional space, then work with the latent variables instead of the original input variables. By reducing the dimensionality of the data, not only can we achieve data compression, but the measurement noise in the 22
original input variables may also be filtered. In summary, this class of problems estimate the set of orthogonal loading matrix α (m × p), which contains information about the rotation of original axis and the corresponding score matrix Z (n × p), which contains the coordinate values in the reduced space. From a Bayesian perspective, Z and α are stochastic and they have a prior distribution. Combining this prior distribution with the data likelihood, given a 0-1 loss function, the objective function of this Bayesian estimation problem is: ˆ = arg max P (X | α, Z)P (Z, α), ˆ Z} {α, α,Z
(2.22)
s.t. αT α = I. Equation (2.22) is the generalized Bayesian formulation of a class of latent variable methods. Both PCA and MLPCA are just special cases of this Bayesian estimation problem. Principal Component Analysis. If the prior distributions of α and Z are uniform, the prior term in Equation (2.22) drops and it reduces to a MLE problem. Also the score vectors are obtained by multiplying the input matrix and the loading vectors, thus, the objective function becomes: ˆ = arg max P (X | α, Z), ˆ Z} {α, α,Z
(2.23)
s.t. αT α = I and Z = Xα. Furthermore, assume the measurement noise in input variables are all i.i.d. Gaussian: ˜ i , σ 2 I), xi | α ∼ Normal(x
(2.24)
˜ :)T (the transpose ˜ i = X(i, where xi = X(i, :)T (the transpose of the i-th row of X), x ˜ and zi = Z(i, :)T (the transpose of the i-th row of Z). The of the i-th row of X) 23
noise-free input data is modeled as: ˜ = ZαT , X
(2.25)
this leads to a constrained least squares problem: ˆ = arg min{ ˆ Z} {α, α
T
n X
(xi − αzi )T (xi − αzi )},
(2.26)
i=1
s.t. α α = I and Z = Xα. This is exactly the formulation of PCA. Maximum Likelihood Principal Component Analysis. As shown above, PCA implicitly assumes i.i.d. Gaussian measurement noise for all input data. However, in most cases, the measurement noise for different variables are not at the same level and might be correlated. Hence, MLPCA [34] was developed based on more practical assumptions about the measurement noise. In MLPCA, the likelihood is assumed to be: ˜ i , φi ) xi ∼ Normal(x
(2.27)
where, φi can be any positive definite symmetric matrix. This approach still assumes a uniform prior distribution for α and Z. With these assumptions, MLPCA treats Equation (2.22) as a constrained weighted least squares problem. Figure 2.2 and 2.3 illustrate the implicit assumptions made in PCA and MLPCA. Circles and ellipses represent the likelihood for different data points. The uniform gray background indicates a uniform prior for model parameters.
2.2.3
Partial Least-Squares Regression
Methods discussed in Section 2.2.2 only focus on input data. To use the latent variables obtained from PCA or MLPCA for prediction, a regression step is followed 24
b
x2
b b b b
b b
b b
b b b
b
b
b
b
b
b
b
b
b
x1 Figure 2.2: Illustration of information used in PCA
b
x2
b b b b
b b
b b b
b
b b
b
b b
b
b
b
b
b
x1 Figure 2.3: Illustration of information used in MLPCA
25
to get the regression parameters vector β (p × 1). If the two steps are separated, the latent variables obtained may not be optimal for prediction problems. Another class of methods overcome this drawback by doing these two steps simultaneously. In this type of approaches, α, Z and β are model parameters to be estimated. From a Bayesian perspective, they are stochastic and should be estimated based on their posterior distribution. Given a 0-1 loss function, the Bayesian estimation problem is: ˆ = arg max P (X, Y | Z, α, β)P (Z, α, β), ˆ β} ˆ Z, {α, α,Z,β
(2.28)
s.t. αT α = I. PLS is a special case of this general Bayesian latent variable regression problem. Assuming the prior distribution to be uniform, the above equation reduces to a MLE problem. To further simplify this problem, PLS implicitly assumes the measurement noise in input and output variables are all i.i.d. Gaussian, i.e., ˜ i , σ 2 I), xi ∼ Normal(x
(2.29)
yi ∼ Normal(˜ yi , σ 2 I).
(2.30)
˜ :)T , yi = y(i) (the i-th element of y) and y˜i = y(i) ˜ i = X(i, ˜ where xi = X(i, :)T , x ˜ and y˜ are modeled as: ˜ The noise-free data X (the i-th element of y). ˜ = ZαT , X
(2.31)
y˜ = Zβ.
(2.32)
26
Hence, Equation (2.28) becomes a constrained least squares problem: ˆ = arg min { ˆ β} ˆ Z, {α, α,Z,β
n X
(xi − αzi )T (xi − αzi ) + (y − Zβ)T (y − Zβ)},
(2.33)
i=1
s.t. αT α = I. This is one of the representations of the objective function of PLS. All of the traditional methods discussed in Section 2.2 except ridge regression set prior distributions to be uniform, which are non-informative. This leaves much room for improvement if a more accurate or informative prior can be used [84, 85].
2.3
Optimization-based Bayesian Latent Variable Methods
In this section, an introduction of two optimization-based Bayesian latent variable methods developed by Nounou et al. [84, 85] is given.
2.3.1
Bayesian Principal Component Analysis (BPCA)
PCA and MLPCA do not treat the score matrix Z and the loading matrix α to be stochastic. As shown in Section 2.2.2, from a Bayesian point of view, these methods do not use any prior information about the variables. In BPCA [84], an informative ˜ i is assumed to be: prior distribution of the noise-free x ˜ i ∼ Normal(µx , Qx ), x
(2.34)
where µx (m × 1) is the prior mean of input variables and Qx (m × m) is the prior ˜ i , given covariance matrix of input variables. Since zi is the linear transformation of x the transformation matrix α, the prior distribution for zi is also Gaussian: zi | α ∼ Normal(αT µx , αT Qx α) 27
(2.35)
As for the prior distribution of α, based on Girshick’s results [92], each column of α can be given an independent Gaussian prior distribution: αj ∼ Normal(µαj , ζαj ),
(2.36)
where αj = α(:, j). Hence, the posterior distribution is: P (α, Z | X) ∝ P (X | α, Z)P (Z | α)P (α),
(2.37)
and the objective function is: ˆ BP CA = arg max P (X | α, Z)P (Z | α)P (α) ˆ Z} {α, α,Z
(2.38)
s.t. αT α = I. ˆ BP CA is the MAP ˆ Z} This is also a special case of Equation (2.22). Note that {α, estimate for α and Z. BPCA can be solved as a nonlinear constrained optimization problem by numerical methods. Figure 2.4 shows information used by BPCA. Comparing it with the information used by PCA in Figure 2.2 and MLPCA in Figure 2.3, we can see that BPCA uses the most information by going beyond the likelihood and utilizing the prior information, which is represented by the shaded oval. This represents knowledge about where the measured data are expected to be, based on knowledge about the system obtained from physical or chemical principles or historical data. Another Bayesian version of PCA is independently developed by Bishop [93] and it is solved by the Expectation-Maximization (EM) algorithm [94].
2.3.2
Optimization-based Bayesian Latent Variable Regression (BLVR-OPT)
In the BLVR algorithm developed by Nounou et. al. [85] it is assumed that the input variables X (n × m) and output variables y (n × 1) are contaminated by 28
b
x2
b b b b
b b
b b b
b
b b b
b
b b
b
b
b
b
x1 Figure 2.4: Illustration of information used in BPCA
independent Gaussian noise. The noise free input and output variables are modeled as: ˜ = ZαT X
(2.39)
y˜ = Zβ,
(2.40)
where, Z is a n × p score matrix, α is a m × p loading matrix, p is the rank of the model, and β (p × 1) is the regression parameter vector of output variable on the score vectors. The regression parameter vector b (m × 1) of output variable on input variables is calculated as: b = αβ.
(2.41)
Loading vectors represent an orthogonal basis for the underlying data structure, thus: αT α = I. 29
(2.42)
The measurement noise for every observation is assumed to be independently distributed with identical normal distribution. Let ǫxi (m × 1) and ǫyi (a scalar) be measurement noise for the i-th input and output, then their distributions are given by: ǫxi ∼ Normal(0, Rx )
(2.43)
ǫyi ∼ Normal(0, Ry ),
(2.44)
where Rx is a m×m noise covariance matrix of input variables and Ry (a scalar) is the variance of noise in output variable for every observation. Also, measurement noise in input variables are assumed to be independent from noise in the output variable, i.e., E(ǫyi ǫxj ) = 0
∀ i ∈ {1, 2, . . . , n}, j ∈ {1, 2, . . . , n}
(2.45)
Based on Equations (2.39) and (2.40), an observation of input and output variables can be written as: xi = αzi + ǫxi
(2.46)
yi = β T zi + ǫyi ,
(2.47)
where xi = X(i, :)T (the transpose of the i-th row of X), zi = Z(i, :)T (the transpose of the i-th row of Z), yi = y(i). The assumption on the measurement noise implies that xi and yi are independently distributed with: xi | α, zi ∼ Normal(αzi , Rx )
(2.48)
yi | β, zi ∼ Normal(β T zi , Ry ).
(2.49)
30
The likelihood of model parameters (Z, α and β) is proportional to the product of the density of distributions of X and y. ˜ and b are assumed to be either uniform or Gaussian Prior information for X distributions and incorporated in BLVR. This permits formulation of BLVR as a least squares minimization problem. In case of a uniform prior, the maximum a posteriori solution is the same as the maximum likelihood solution. In case of a Gaussian prior, ˜ i the same as the assumption made in BPCA, as shown in Equation the prior of x (2.34), and the prior for zi is the same as shown in Equation (2.35). Gaussian prior of b is assumed to be: b ∼ Normal(µb , Qb )
(2.50)
where µb (m × 1) is the prior mean for b and Qb (m × m) is the prior covariance matrix of b. Again, β is just a linear transformation of b given α, hence the prior for β is also Gaussian, β ∼ Normal(αT µb , αT Qb α)
(2.51)
Given the above likelihood and prior distributions, the posterior distribution can be derived via Bayes theorem. Based on above assumptions, the posterior distribution of model parameters is: P (α, Z, β | X, y) ∝ P (X, y | α, Z, β)P (α, Z, β) ∝ P (X | α, Z)P (y | Z, β)P (Z | α)P (β | α)
(2.52)
Equation (2.52) implicitly assumes that the prior distribution of α is uniform on the space of all orthogonal matrices, and the rank of model is a known constant. By using the 0-1 loss function in BLVR, the Bayes estimate is the posterior mode of the above
31
posterior distribution. Hence, one needs to solve the following optimization problem: ˆ BLV R = arg max P (α, Z, β | X, y) ˆ β} ˆ Z, {α, α,Z,β
(2.53)
s.t. αT α = I With the Gaussian prior distributions specified above for Z and β, this is equivalent to solving the following constrained least squares problem [85]: ˆ BLV R = arg min ˆ β} ˆ Z, {α,
α,Z,β
{
n X
(xi − αzi )T R−1 x (xi − αzi )
i=1
+
n X
(yi − β T zi )T Ry−1 (yi − β T zi )
i=1
+
n X
(zi − αT µx )T (αT Qx α)−1 (zi − αT µx )
i=1
+(β − αT µb )T (αT Qb α)−1 (β − αT µb) } (2.54) s.t. αT α = I When uniform prior distributions are assumed for Z and β, the last two terms in the above objective function are dropped. This BLVR-OPT approach is shown to work well with optimization routines such as Sequential Quadratic Programming (SQP), but only for relatively small problems. Even for a moderate number of variables or observations, this approach can be extremely slow. This poses a significant limitation to the use of BLVR for solving practical problems. Difficulties in obtaining uncertainty bounds for the estimates are also encountered for this approach.
2.4
Monte Carlo Sampling
Monte Carlo sampling [95] draws samples from the posterior distribution, which can easily provide characteristics of the posterior such as its moments and region of 32
highest posterior density. For example, the posterior mean can be approximated as: E(θ | D) =
Z
θP (θ | D) dθ
M 1 X ≈ θi M i=1
(2.55)
where θi are samples from the posterior distribution P (θ | D). This approximation approaches the posterior mean as the number of simulated draws M goes to infinity. The main challenge in Monte Carlo approximation is that sampling from the posterior distribution is not always an easy task because these distributions are often not defined in an analytic form. Fortunately, a variety of sampling methods have been developed to help in drawing the samples from complicated distributions. For dealing with data from static systems, Markov Chain Monte Carlo (MCMC) approach is widely used. As for dealing with data from dynamic and evolving systems, Sequential Monte Carlo (SMC) or Particle Filtering is a popular choice. SMC draws samples using importance sampling techniques. Brief introductions to MCMC and importance sampling are given in the following paragraphs, with more details in [95, 96, 97].
2.4.1
Markov Chain Monte Carlo
The recent popularity of Bayesian statistical modeling is to a large extent due to the introduction of MCMC [98, 86]. MCMC provides the ability of getting Bayesian estimates for analytically intractable posterior distributions. The idea behind MCMC is quite simple. It tries to create a Markov chain, whose samples come from the target posterior distribution. Therefore, those samples can be used to approximate the posterior distribution and obtain highly accurate approximations of Bayesian estimates. Fortunately, this idea can be realized with the Metropolis-Hastings algorithm. A 33
sequence of random variables (x0 , x1 , ..., xk−1 , xk , ...) is a Markov chain when the distribution of xi given all the past variables (x0 , x1 , ..., xi−1 ) only depends on xi−1 , i.e.,
P (xi | xi−1 , xi−2 , ..., x1 , x0 ) = Pi (xi | xi−1 ).
(2.56)
The probability function Pi (xi | xi−1 ) is called the one step transition probability distribution at time step i. The Markov Chain is said to have a time-invariant transition probability distribution if Pi (· | ·) is the same for all the time steps. Under certain conditions, the marginal probability distributions for the random variables xi in the Markov chain converges to a probability distribution. This distribution is called the stationary distribution of the Markov chain. In fact, if the initial state x0 of the chain follows the stationary distribution, then the marginal distribution at every time step is the same. In the Metropolis-Hastings algorithm, a transition probability function of the Markov chain is set up so that its stationary distribution is exactly the distribution we want to sample from. The Metropolis-Hastings algorithm requires a proposal distribution, and provides a mechanism for drawing a sample x∗ as the candidate of the i-th draw in the Markov chain based on the previous draw, xi−1 . For example, in a random walk as the proposal distribution, x∗ is drawn as: x∗ = xi−1 + ǫi ,
(2.57)
where ǫi is a random variable with the distribution h(ǫi ). Thus, the proposal distribution is P (x∗ | xi−1 ) = h(x∗ − xi−1 ). The algorithm involves a random walk sampling step, followed by an acceptance-rejection step to make such a chain having
34
the desired stationary distribution. The acceptance probability is calculated as: f (x∗ )P (xi−1 | x∗ ) α(x , xi−1 ) = min 1, f (xi−1 )P (x∗ | xi−1 )
∗
,
(2.58)
where f (x) is the desired stationary distribution. In the above example with a random walk proposal distribution, if h(ǫi ) is symmetric about zero, then h(x∗ − xi−1 ) = h(xi−1 − x∗ ), and the acceptance probability reduces to: f (x∗ ) α(x , xi−1 ) = min 1, . f (xi−1 ) ∗
(2.59)
The acceptance probability shows that this Markov chain has a larger chance to move from a low probability density area to a high density area than to move from a high density area to a low density area. This is similar to simulated annealing [99], when considering states in low density area have higher energy than states in high density area. In practice, to accept sample x∗ with this probability distribution, another random number ui is drawn from a Unif orm(0, 1) distribution, since P (ui < α(x∗ , xi−1 )) = α(x∗ , xi−1 ), sample should be rejected when ui > α(x∗ , xi−1 ). This will guarantee that x∗ is accepted with the probability α(x∗ , xi−1 ). Once the proposed sample is accepted, xi is set to be equal to x∗ , otherwise, xi = xi−1 . The Markov chain can get stuck at one point if the acceptance probability is very low, which is not an effective sampling strategy. This situation often happens when sampling from a high dimensional distribution. Another version of Metropolis-Hastings sampling, known as the Gibbs sampler, avoids this problem, but in order to sample from a m dimensional distribution f (x), each element of the vector x is drawn sequentially. This is similar to optimization by univariate search. The proposal function for each element is just its full conditional distribution, conditional on all other elements. By using this proposal distribution, 35
f (x) is the desired PDF of x, P (x | y) is the proposal function • for i = 1 : M – draw x∗ from P (x | xi−1 ) – draw ui from Unif orm(0, 1) |x ) } – if ui < α(x∗ , xi−1 ) = min{1, ff(x(xi−1)P)P(x(xi−1 ∗ |x i−1 ) ∗
∗
· xi = x∗
– else · xi = xi−1 – end if • end for (x1 , x2 ..., xM ) are samples from f (x)
Table 2.2: Metropolis-Hastings algorithm.
the acceptance probability is always equal to one. The elements of θ can also be divided into several blocks and samples of blocks, instead of individual elements, can be drawn sequentially. This can greatly reduce the number of iterations in the Gibbs sampling algorithm. This strategy is very effective for drawing samples from high dimensional distributions, whenever one can sample from each of the full conditional distribution. It is widely used in Bayesian computing. Table 2.2 shows the Metropolis-Hastings algorithm and Table 2.3 shows the Gibbs sampling algorithm from the Markov Chain with the posterior distribution as its stationary distribution. Note that the samples in the Markov chain are highly correlated. To asses the degree of this correlation, one can calculate the sample autocorrelation function of the generated sequence and find the minimal lag r at which the autocorrelation is 36
f (x) is the desired PDF of x, x = (x(1) , x(2) , ..., x(m) )T is mdimensional • for i = 1 : M – for j=1:m (j)
· draw xi from f (xj (1) (2) (j−1) (j+1) (m) xi , xi , ..., xi , xi−1 , ..., xi−1 )
|
– end for • end for (x1 , x2 ..., xM ) are samples from f (x)
Table 2.3: Gibbs Sampling algorithm.
small. In order to minimize the correlation among the sampled draws, a thinning step involves selecting every r-th element in the chain as a sample from the posterior. In practice, every 5th or 10th observation can be selected. These thinned samples can be used for Monte Carlo approximations. Some statisticians [100] argue that the thinning step is not necessary unless there is a storage problem. But many practitioners still consider thinning as a required step in the MCMC procedure. The thinning step is also adopted in the popular MCMC software BUGS [101, 102]. Theoretical properties of different types of Markov chains in MCMC have been thoroughly discussed [103]. Since the marginal distribution of a Markov chain starting from an arbitrary initial state, takes time to converge to the stationary distribution, initial MCMC draws in the burn-in phase are also discarded. For example, if there are n thinned samples of θ in the Markov chain, which has a stationary distribution P (θ | D), and the number of burn-in samples is k, then the effective MCMC sample is the last n − k draws, and 37
the Monte Carlo approximation in Equation (2.55) should be changed to: n X 1 E(θ | D) ≈ θi n−k
(2.60)
i=k+1
Illustrative Example 2.2. Assume b ∼ Normal(µ, σ 2 ), σ 2 has an Inverse-Gamma distribution, i.e., 1/σ 2 ∼ Gamma(α, β), given µ = 0, α = 2, β = 5, we want to find the expected absolute value of b, E(|b| | µ, α, β). This integral can be calculated analytically and results in an answer equal to 0.3162 to the 4-th decimal place. Even though this integral has a closed form solution, it does require a tedious process. Instead of solving this problem analytically, Gibbs sampling can be applied to draw Monte Carlo samples of b from P (b, σ 2 | µ, α, β), and the expected value of |b| can be approximated by those samples. To implement Gibbs sampling, the respective full conditional distributions of b and σ 2 must be identified, but it is trivial to state these here: b | µ, α, β, σ 2 ∼ Normal(0, σ 2 ),
1 1/σ | µ, α, β, b ∼ Gamma 2.5, 1/5 + b2 /2 2
(2.61)
.
(2.62)
Based on the algorithm in Table 2.3, conditional samples of b and σ 2 can be drawn sequentially from the full conditional distributions given in Equation (2.61) and (2.62). Figure 2.5 illustrates the steps of drawing samples from P (b, σ 2 | µ, α, β) by Gibbs sampling. In one full cycle of the simulation, the initial starting point of {b, σ 2 } is chosen arbitrarily as {−0.6, 0.35}. After 104 samples from the Markov Chain have been drawn, the first 103 draws are discarded as burn-ins and every 5th draw is selected for estimation. The integration can be approximated by the remaining samples
38
0.55 0.5 0.45 0.4
σ
2
0.35
2 1
(b ,σ ) 1
2 1
P(b µ,α,β,σ )
0.3
2
P(σ µ,α,β,b2)
0.25 0.2 0.15
2 2
(b ,σ )
0.1
2
0.05 −1
−0.5
0 b
0.5
1
Figure 2.5: Illustration of Gibbs sampling.
as: 1800
1 X E(|b| | µ, α, β) ≈ |b1001+5×(i−1) | 1800 i=1 = 0.3196
(2.63)
This is very close to the analytical solution.
2.4.2
Importance Sampling
Similar to MCMC, importance sampling [55] also needs a distribution h(x), from which one can draw samples conveniently, and whose support set contains the support of the desired distribution f (x), i.e., for every x, with f (x) > 0, h(x) is also positive.
39
Random samples are drawn from h(x) and they are assigned corresponding weights: f (x) . h(x)
w(x) =
(2.64)
These samples and weights are used together in Monte Carlo approximation. Note that, by importance sampling, the expectation of function g(x) under the distribution f (x) is approximated as: Ef (g(x)) =
b
Z
g(x)f (x) dx
a
=
Z
b
g(x)
a
≈
f (x) h(x) dx h(x)
M 1 X g(xi)w(xi ) M i=1
(2.65)
where (x1 , x2 , ..., xM ) are random samples from h(x). Table 2.4 shows the algorithm for importance sampling. To get better approximation, the importance sampling distribution h(x) should be as similar to f (x) as possible [104]. When they are identical, the weights are always one and it is equivalent to drawing random samples directly from f (x). Illustrative Example 2.3. It has been shown in Section 2.4.1 that Gibbs sampling can be applied to solve an integration problem. The same problem can also be solved by importance sampling in a simpler way. Let the joint importance sampling distribution of {b, σ 2 } be a product of independent importance sampling distributions of b and σ 2 . The proposal distribution of σ 2 can be chosen from a semi-normal distribution. Samples of σ 2 are the absolute value of the random numbers drawn from a normal distribution with mean zero and variance ζ 2 , and the proposal distribution of b can be a normal distribution with mean µ and a fixed variance τ 2 . Then, the weight of the 40
f (x) is the desired PDF of x, h(x) is the proposal PDF Wherever f (x) > 0, h(x) > 0 • for i = 1 : n – draw x∗ from h(x) – xi = x∗ – calculate weight: wi =
f (xi ) h(xi )
• end for (x1 , x2 ..., xn ) and (w1 , w2 ..., wn ) are samples and corresponding weights
Table 2.4: Importance Sampling algorithm.
i-th sample is calculated as: wi =
P (bi | µ, σi2 )P (σi2 | α, β) . P (bi | µ, τ 2 )P (σi2 | 0, ζ 2 )
(2.66)
The value of the τ 2 and ζ 2 should be setup such that the proposal function is close to the posterior distribution. Hence, the value of τ 2 is set to be 0.16, which is close to the prior mean of σi2 . This makes samples of µ come from a distribution close to the posterior distribution. Also noticing the probability density of the prior of σi2 is very small when it is above 2, we can set the value of ζ 2 to be 4, thus most drawn samples of σi2 are smaller than 2. This guarantees that most of samples have weights that are not too close to zero. Now 10, 000 independent samples are drawn from the proposal distribution P (bi | µ, τ 2 )P (σi2 | 0, ζ 2 ), the integration can be approximated as: 10000 1 X E(|b| | µ, α, β) ≈ |bi |wi 10000 i=1
= 0.3013 41
(2.67)
This result is not as accurate as the one obtained by Gibbs sampling, but this implementation does not requires us to sample directly from the full conditional distribution of b and σ 2 . This merit is very valuable in practice. In the latter approach, we only need to pay attention to the choice of proposal distribution, while in Gibbs sampling, full conditional distributions must be derived, and the burn-in samples have to be discarded in Monte Carlo approximation. Differences between importance sampling and Gibbs sampling shown in this example are also typical elsewhere. For high dimensional distributions, if full conditional distributions are easy to derive and sampling from these distributions is not difficult, Gibbs sampling usually provides more accurate results. To achieve the same level of accuracy, importance sampling must have a high quality proposal distribution and often requires more samples. However, in many situations, Gibbs sampling requires sampling from many sophisticated full conditional distributions and drawing samples from them may become an obstacle itself. Thus, importance sampling is much more versatile in that it can easily handle sampling from distributions when other approaches fail to provide a solution with reasonable a computation load.
42
CHAPTER 3
SAMPLING-BASED BAYESIAN LATENT VARIABLE REGRESSION3
3.1
Introduction
Latent variable modeling methods such as Principal Component Regression (PCR) [7] and Partial Least Squares Regression (PLS) [8] have been popular in chemometrics due to the common occurrence of correlated variables [2]. As described in Section 2.2, a closer look at these methods reveals many implicit assumptions. For example, in both PCR and PLS, the measurement noise terms are assumed to be Gaussian and independent identically distributed. These assumptions, made for convenience, are often violated, and leave much room for improvement. In response to this realization, Wentzell et al. [34] developed maximum likelihood methods that avoid some of these assumptions. Maximum Likelihood Principal Component Analysis (MLPCA) and Maximum Likelihood Principal Component Regression (MLPCR) handle measurement error containing noise of different magnitudes and correlation in input variables. A similar approach, Maximum Likelihood Latent Root Regression (MLLRR) [105], can handle measurement errors in both input and output variables. These maximum 3
Content of this chapter is based on Bayesian Latent Variable Regression via Gibbs Sampling: Methodology and Practical Aspects, Journal of Chemometrics, in press, 2007.
43
likelihood methods can outperform traditional latent variable methods by making use of additional information about measurement error. As described in Section 2.3, BPCA [84] is a Bayesian version of PCA which only models the input variables while BLVR [85] is a Bayesian regression approach that combines data likelihood with the prior information about the input variables and the regression coefficients in modeling. Various case studies have illustrated the benefits of BLVR over traditional methods. BLVR was initially formulated as a least-squares problem and its implementation relies on numerical optimization routines, which can be computationally very expensive, particularly when modeling high dimensional data sets. This least-squares formulation of BLVR relies on assumptions, such as Gaussian measurement noise and Gaussian prior distribution for input variables and regression parameters, which may not hold in many cases. For example, data from a nonlinear dynamic process tends to have non-Gaussian distributions [55] and a Gaussian assumption could lead to unsatisfactory models. Also, the optimization routine often encounters local minima or non-convergence and may not be computationally efficient for non-Gaussian distributions. Moreover, a big advantage of Bayesian approaches is that they can provide uncertainty information about the model parameters from the posterior distribution. Optimization-based approaches can also provide variance estimates when a closed-form solution is available. This usually requires Gaussian assumptions and is exemplified by methods such as Kalman filtering and Ordinary Least Squares regression. Since BLVR lacks a closed-form solution of the posterior variance, BLVR-OPT is unable to provide the posterior variance without additional effort.
44
As an alternative to optimization, Monte Carlo sampling is gaining popularity among Bayesian practitioners since advances in computing and statistical theory are overcoming many hurdles in its practical use. As discussed in Section 1.2, many of the Bayesian applications in process engineering and chemometrics have used samplingbased approaches. These methods can efficiently execute Bayesian estimation, even for high dimensional problems. Since sampling-based methods do not require the least-squares formulation, assumptions such as Gaussian distribution are not necessary. Furthermore, it is quite straightforward to obtain the point estimate and Bayesian confidence interval or posterior variance from samples representing the posterior generated by sampling-based methods. These apparent advantages make a sampling-based BLVR more appealing than the optimization-based BLVR. In this chapter, a sampling-based BLVR (BLVR-S) is proposed to overcome the shortfalls of optimization-based BLVR (BLVR-OPT)4 . Instead of solving a constrained least-squares minimization problem, BLVR-S draws samples from the posterior distribution by Markov Chain Monte Carlo (MCMC) methods [86]. These posterior samples are summarized to approximate the Bayesian estimates of the model parameters and the Bayesian confidence intervals. BLVR-S is applied to model a simulated high dimensional data set and an experimental NIR data set. These case studies show that BLVR-S is computationally efficient in modeling high dimensional data sets and can provide better models than BLVR-OPT. Despite the generality of the proposed algorithm, Gaussian assumptions are still made in this chapter for simplicity. Extension to non-Gaussian distributions is described in Chapter 6. 4
The notation of BLVR and BLVR-S are used interchangeably afterwards; while the optimizationbased BLVR is only referred to as BLVR-OPT.
45
In addition to the BLVR-S algorithm, another contribution of this work is that it provides practical insight into modeling situations that may be appropriate for Bayesian methods. The effects of size of the data set, and signal to noise ratio in input and output variables on the relative performance of BLVR are illustrated. These empirical studies indicate that the benefits of Bayesian approaches may be most significant when the amount of measured data is limited and noise in the outputs is relatively large. The rest of this chapter is organized as follows. The algorithm of BLVR-S is first described. Practical aspects of BLVR, mentioned in the previous paragraph are then discussed. After that, two case studies are presented, where the advantages of BLVR-S over other methods and its practical aspects are illustrated. The last section provides a summary of this chapter and proposes directions for future work.
3.2
Sampling-Based Bayesian Latent Variable Regression
In this approach, MCMC is used to draw samples from the posterior distribution , as discussed in Section 2.4.1. To implement the Gibbs sampling algorithm, it is necessary to derive the full conditional distributions, P (Z | X, y, α, β), P (β | X, y, α, Z) and P (α | X, y, Z, β). Since the conditional distribution of α is complicated due to the orthonormality constraint, it is quite challenging to draw samples of α by MCMC. Alternatively, this problem can be decomposed as follows,
ˆ BLV R = arg max P (Z, β | X, y, α) ˆ β} ˆ {Z,
(3.1)
ˆ ˆ β) ˆ BLV R = arg max P (α | X, y, Z, s.t. {α}
(3.2)
Z,β
α
αT α = I. 46
For solving the problem in Equation (3.1), MCMC can be used, while for solving the problem in Equation (3.2), optimization methods that have a closed form solution can be used, such as PCA, MLPCA or PLS. With iteration of these two steps, it may be possible to converge to the optimum. The convergence can be detected by ˆ and the previous one from last calculating the angular deviation [34] of the new α iteration step. When the deviation is less than a threshold, it can be considered as if convergence has been achieved. A simpler way to solve this problem is to fix α, then perform MCMC without the iteration for optimizing α. This will lead to suboptimal results but empirical studies show that this simplification does not cause much deterioration in the quality of the resulting model in terms of prediction error. For example, results of BLVR-S in the first case study in Section 3.3.1 are all obtained by using this simplified method and prediction errors are still smaller than some other methods. Therefore, this simplified method is applied in both case studies. The BLVR-OPT algorithm could also benefit by using a fixed α to significantly reduce the computation load. Nonetheless, only a point estimate can be obtained and accessing the uncertainty information contained in the posterior distribution will require additional effort beyond this algorithm. By fixing α, BLVR-S can provide the confidence intervals for Z and β from the distribution conditional on α, albeit they are not the same as the Bayesian confidence intervals from the posterior distribution, they still provide valuable uncertainty information about the estimates. When there is little variation of α in the posterior distribution, which is often true, those confidence intervals are good approximations of the real confidence intervals from the posterior distribution. Although the same notation is used in the BLVR-S algorithm as in the BLVR-OPT algorithm, readers 47
ˆ and βˆ are not MAP estimates any more with the fixing ˆ Z should be aware that α, of α in the BLVR-S algorithm. Deriving the full conditional distributions for Z and β is essential for implementing Gibbs sampling. Depending on different prior assumptions, these conditional distributions are different. For simplicity, only Gaussian and uniform distributions are considered in this chapter and the derivation of the conditional distributions for the two cases are described in the following subsections. The detailed algorithm of BLVR-S is shown in Table 3.1. In the MCMC step, samples of Z and β are drawn from P (Z, β | X, y, α). These samples, after the burn-in phase can be used to obtain estimates of Z and β. The posterior mean is commonly used as a point estimate of the variables. This estimate may be different from the point estimate obtained by optimization based methods, since these methods usually aim to find the mode of the distribution. In addition to it being more convenient, estimating the mean as a point estimate also matches the common goal of minimizing the mean-squared error, since the posterior mean is the optimum for the squared-error loss function. The mode, on the other hand, is the optimum for the zero-one loss function, which implies that estimates different from the truth have no utility. Such a loss function does not seem appropriate for most chemometric and chemical engineering tasks since for most users, the utility of the estimate is better captured by its distance from the truth. More details about this are given in [40]. In principle, any prior can be used in the proposed sampling-based approach. However, other prior distributions may lead to full conditional distributions for Z and β that are not convenient to sample from. It is a topic of on-going research. The more general BLVR-S algorithm is provided in Chapter 6. 48
ˆ by applying PCA, MLPCA or PLS to (X, y) • get α • while not converge ˆ and βˆ by Gibbs sampling 1. get Z – for i = 1 : n ˆ β (i−1) ) (Equation (3.9) for · draw Z (i) from P (Z | X, y, α, Gaussian prior and Equation (3.17) for uniform prior) ˆ Z (i) ) (Equation (3.12) for · draw β (i) from P (β | X, y, α, Gaussian prior and Equation (3.20) for uniform prior) – end for – estimate Z and β based on (1) (1) (2) (2) (n) (n) {(Z , β ), (Z , β )..., (Z , β )} ˆ and yˆ based on α, ˆ and βˆ ˆ Z 2. get X ˆ y) ˆ by applying PCA, MLPCA or PLS to (X, ˆ 3. get α • end while
Table 3.1: BLVR-S algorithm.
49
3.2.1
Gaussian Prior
In the case of a Gaussian prior, since measurement noise is independently distributed, the conditional distribution of zi only depends on the distributions of xi and yi and the prior distribution (Equation (2.35)). Let xi di ≡ , yi
γ≡
Rd ≡
α βT
,
Rx 0 0 Ry
(3.3)
(3.4)
.
(3.5)
Variable di is the i-th observation of both input and output variables that contains all the information from the i-th observation of data. Rd is a block diagonal matrix because measurement error of input and output variables are independent. Based on Equations (2.39) and (2.40): di = γzi +
ǫxi ǫyi
.
(3.6)
Since the error term follows a Gaussian distribution with mean of 0 and covariance Rd , di follows the Gaussian distribution: di | α, zi , β ∼ Normal(γzi , Rd ).
(3.7)
The mean of di resides in the column space of γ, which is in compliance with the formulation of BLVR. Combining it with the prior distribution of zi (Equation (2.35)), the joint distribution of di and zi is: zi αT µ x αT Q x α αT Qx αγ T | α, β ∼ Normal , . di γαT µx γαT Qx α γαT Qx αγ T + Rd
(3.8)
50
This is a multivariate normal distribution, for which the following theorem is well known [106].
Theorem 3.1. If x = n×1
x1
n1 ×1
x2
n2 ×1
∼ Normal
µ1 µ2
conditional distribution of x1 given that x2 = x∗2 is:
,
Σ11 Σ12 Σ21 Σ22
, then the
∗ −1 x1 | x∗2 ∼ Normal µ1 + Σ12 Σ−1 22 (x2 − µ2 ), Σ11 − Σ12 Σ22 Σ21 .
Based on above theorem, the full conditional distribution of zi is:
zi | xi, yi , α, β ∼ Normal( αT µx + αT Qx αγ T (γαT Qx αγ T + Rd )−1 (di − γαT µx ), αT Qx α(I − γ T (γαT Qx αγ T + Rd )−1 γαT Qx α) ).
(3.9)
All information about zi is contained in this conditional distribution, given data and model parameters α and β. Since a normal distribution is symmetric around its mean, the conditional mean is also the mode of this distribution. Hence it is quite likely that the next sample of zi will be drawn around the conditional mean. It is also worth noting that sample of zi depends on both xi and yi . Thus, both input and output observations have effects on the estimation of zi , while in PCA and MLPCA, the estimation of score vectors only depends on the observation of input variables. In this regard, BLVR is similar to PLS. However, in PLS, both input and output variables are equally weighted in estimating zi since they are assumed to have measurement noise with the same magnitude. In contrast, in BLVR if the measurement noise in the output variable is much larger than that in the input variables, the effect of including the output variable in the process of estimation zi will be negligible. The conditional distribution of β only depends on its prior distribution (Equation (2.51)) and the distribution of y: y | α, Z, β ∼ Normal (Zβ, I ⊗ Ry ) , 51
(3.10)
where ⊗ is the Kronecker product. Combining it with the prior distribution of β, the joint distribution of β and y is:
β y
| α, Z ∼ Normal
αT µ b ZαT µb
,
αT Q b α αT Qb αZ T T T Zα Qb α Zα QbαZ T + I ⊗ Ry
.
(3.11)
Again, by applying Theorem 3.1, the full conditional distribution of β is: β | y, α, Z ∼ Normal( αT µb + αT QbαZ T (ZαT Qb αZ T + I ⊗ Ry )−1 (y − ZαT µb ), αT Qb α(I − Z T (ZαT Qb αZ T + I ⊗ Ry )−1 ZαT Qb α) ). (3.12) If consider Z as the input matrix, then the above distribution is nothing but the posterior distribution of regression parameters in Bayesian linear regression with Gaussian prior. If the prior mean αT µb = 0, and the prior covariance αT Qb α = I ⊗ k, then the conditional mean is the same as the solution for ridge regression [91], with the parameter λ =
Ry . k
With Equations (3.9) and (3.12), the MCMC step of the BLVR-S
algorithm can be easily implemented for the case of a Gaussian prior. This conditional mean is also the MLE of β given y, α and Z, which is also used in BLVR-OPT. However, the conditional variance was not derived or used in the optimization-based approach. With MCMC, in addition to the conditional mean, the conditional variance is also readily obtained, which provides useful information about the posterior distribution along with uncertainty information for the point estimates.
3.2.2
Uniform Prior
Derivation of the full conditional distributions for the uniform prior case is not as straightforward as in case of a Gaussian prior. The integral of the uniform prior distribution of zi or β over the Rm space is not finite. Thus the uniform prior is 52
statistically an improper prior distribution.. One way to deal with an improper prior is to start with a proper prior and make the limiting case equivalent to the improper prior. Since the Gaussian prior is proper and the full conditional distributions have already been derived, it is a good choice for obtaining the limiting case that approaches a uniform prior. Let the covariance matrices of the Gaussian prior distributions (Equation (2.35) and (2.51)) be diagonal matrices with diagonal elements approaching infinity, i.e., αT Qx α ≡ I ⊗ k,
(3.13)
αT Qb α ≡ I ⊗ k,
(3.14)
k −→ +∞.
(3.15)
with
With Equations (3.13)∼(3.15), the conditional mean and covariance of zi and β can be simplified for the uniform prior case based on the following theorem [106]. Theorem 3.2. Let A (p × p) and C (m × m) be nonsingular and B is a p × m matrix. If |C −1 + B T A−1 B| = 6 0, then (A + BCB T )−1 = A−1 − A−1 B(C −1 + B T A−1 B)−1 B T A−1 . Based on Equation (3.12), the conditional mean and covariance for zi both contain (γαT Qx αγ T + Rd )−1 , which with the help of (3.13) becomes (γ(I ⊗ k)γ T + Rd )−1 . Apparently, both I ⊗ k and Rd are nonsingular. Also, it is easy to prove that (I ⊗ −1 k)−1 + γ T R−1 + d γ is positive definite, hence, it satisfies the condition that |(I ⊗ k)
γ T R−1 6 0. By applying Theorem 3.2, the following results can be obtained: d γ| = −1 −1 −1 T −1 (γ(I ⊗ k)γ T + Rd )−1 = R−1 + γ T R−1 d − Rd γ((I ⊗ k) d γ) γ Rd .
53
(3.16)
Substituting Equation (3.16) into the conditional mean of zi given by Equation (3.9) and rearranging: αT µx + (I ⊗ k)γ T (γ(I ⊗ k)γ T + Rd )−1 (di − γαT µx ) −1 −1 −1 −1 T = αT µx + (I ⊗ k)γ T R−1 + γ T R−1 (di − γαT µx ) d − Rd γ((I ⊗ k) d γ) γ Rd −1 −1 T T −1 T −1 T −1 = ((I ⊗ k)−1 + γ T R−1 d γ) [(I ⊗ k) α µx + γ Rd di + γ Rd γ(I ⊗ k)γ Rd di T −1 T − γ T R−1 d γ(I ⊗ k)γ Rd γα µx ] −1 −1 T −1 − (I ⊗ k)γ T R−1 + γ T R−1 d γ((I ⊗ k) d γ) γ Rd di −1 −1 T −1 T + (I ⊗ k)γ T R−1 + γ T R−1 d γ((I ⊗ k) d γ) γ Rd γα µx −1 −1 T T −1 = ((I ⊗ k)−1 + γ T R−1 d γ) [(I ⊗ k) α µx + γ Rd di ] T −1 −1 T −1 + k((I ⊗ k)−1 + γ T R−1 d γ) γ Rd γγ Rd di −1 T −1 −1 + γ T R−1 − kγ T R−1 d γ) γ Rd di d γ((I ⊗ k) T −1 T −1 −1 + γ T R−1 + kγ T R−1 d γ) γ Rd γα µx d γ((I ⊗ k) T T −1 −1 T −1 − k((I ⊗ k)−1 + γ T R−1 d γ) γ Rd γγ Rd γα µx −1 −1 T T −1 = ((I ⊗ k)−1 + γ T R−1 d γ) [(I ⊗ k) α µx + γ Rd di ] k→+∞
≈
−1 T −1 (γ T R−1 d γ) γ Rd di .
(3.17)
The result shown in Equation (3.17) is the same as that of weighted least squares −1/2
regression of di on γ, with Rd
as the weighting matrix. This makes intuitive sense
because it is the MLE solution. The conditional covariance matrix is: −1 (I ⊗ k)(I − γ T (γ(I ⊗ k)γ T + Rd )−1 γ(I ⊗ k)) = ((I ⊗ k)−1 + γ T R−1 d γ) k→+∞
≈
54
−1 (γ T R−1 d γ) .
(3.18)
Similarly, the conditional mean and covariance for β are (Z T Z)−1 Z T y and (Z T Z)−1 ⊗ Ry . It can be concluded that for the uniform prior case, the full conditional distributions of zi and β are: −1 T −1 T −1 −1 zi ∼ Normal((γ T R−1 d γ) γ Rd di , (γ Rd γ) ),
(3.19)
β ∼ Normal((Z T Z)−1 Z T y, (Z T Z)−1 ⊗ .Ry )
(3.20)
With these results, the MCMC step of BLVR-S for uniform prior case can also be implemented with ease. An alternate approach for deriving the full conditional distributions for the uniform prior case based on the kernel of the joint distribution is shown in Appendix A.
3.3
Case Studies and Practical Aspects of BLVR
Since Bayesian methods use more information than traditional methods, it is expected that their results will be more accurate. It is also usually expected that a Bayesian approach will require additional computational effort than traditional approaches. As discussed in this chapter and demonstrated via the case studies, sampling based Bayesian methods such as BLVR-S overcome many of the computational challenges of Bayesian modeling. Clearly, BLVR-S requires more effort than traditional methods, not only in the computations, but also in obtaining prior information. However, the extent of improvement in typical problems and the trade-off between increased accuracy and higher computational costs is not clear and has been the subject of hardly any research. In practice, the extent of the benefits depends on several factors such as the quality of the prior, signal to noise ratios in the input and output variables, and quantity of measured data. 55
This section explores the practical aspects of BLVR via case studies to gain insight into these issues and to determine when applying BLVR may be worth the effort. Although this insight lacks theoretical proof, hopefully it can assist in choosing an appropriate modeling method while avoiding unreasonably high expectations. The case studies illustrate the effect of each of these factors on the modeling by various methods. The importance of an accurate prior is explored by considering informative and non-informative priors. Even with a good prior, when the training data set is large, Bayesian methods may not greatly outperform traditional methods, particularly maximum likelihood methods. This happens mainly because the information in a large amount of measured data is rich enough that the additional knowledge provided by the prior is relatively insignificant. Another finding is that when the SNR of an output variable is much smaller than that of input variables, that is, when the output is much more noisy than inputs, BLVR tends to perform much better than traditional methods. Case study 1 indicates that what may matter is the ratio of SNR in input (SNRx ) and output (SNRy ) variables. When the ratio
SN Rx SN Ry
is large, BLVR
can provide much better models than traditional methods. It seems that when the output noise is relatively large, the relation between the input and output variables is more difficult to establish based on the data itself, and BLVR has an edge over traditional methods because of the extra information contained in the prior distributions. However, when the input variables are relatively noisy, the prior information is less effective in improving the modeling results. More discussion about the effect of SNR can be found in Section 3.3.1. In summary, BLVR is most appealing when good prior information is available and the measured data set is small, or when the
56
output is very noisy compared with the input variables. In these situations, it seems that the extra effort of applying BLVR could be worthwhile.
3.3.1
Case Study 1: Modeling of Simulated High Dimensional Data
In this simulated study, each data set contains 15 input variables and 1 output variable, and the true rank of the input matrix is 10. All input variables are Gaussian and the output variable is a linear combination of input variables. Both, input and output variables are contaminated by Gaussian measurement noise. The magnitude of the noise is determined by the SNR. Data sets generated by this procedure are used in the following subsections to illustrate the features of BLVR-S and explore the effects of the amount of training data and SNR on the modeling performance . The loading vectors used for BLVR-S in this case study come from MLPCA. The true rank 10 of the simulated data is used for all methods in this case study. Performance Comparison of BLVR-S and BLVR-OPT The performance of BLVR-S and BLVR-OPT is compared by applying them to model a simulated data set in 50 realizations. There are 200 observations for both testing and training data set and the SNR in input and output variables is set to be 3. A historical prior is used for BLVR. This prior is generated from a historical data set consisting of 600 observations. BLVR with uniform prior is first applied to model this historical data set, and the modeling results are used to fit the Gaussian prior distributions for input variables and regression parameters. Table 3.2 shows the average of mean squared error (MSE) for input and output variables and mean CPU time over 50 realizations. The MSE is normalized by the variance of measurement
57
PCR BLVR-OPT BLVR-S
Y (training) 0.6877 0.4298 0.3877
Y (testing) 0.7544 0.7270 0.6601
X (training) 0.7200 0.5975 0.5182
X (testing) 0.7100 0.5895 0.5343
CPU time (s) 0.0238 1859.8 108.4
Table 3.2: Mean MSE and CPU time of the high dimensional modeling example over 50 realizations, MSE are normalized by the variance of corresponding measurement error.
noise in the output. In theory, BLVR-OPT should have nearly identical performance as BLVR-S in terms of estimation accuracy, since they are solving the same problem with the same information. In fact, BLVR-OPT is allowing the latent variable weights, α to change with the optimization, while BLVR-S has them fixed to those obtained by MLPCA. However, the scatter plot of the training MSE of output of BLVR-OPT against that of BLVR-S of the 50 realizations shown in Figure 3.1 (one outlier is removed in the plot) suggests that BLVR-S tends to have smaller MSE. Whether this difference between the MSE of the two BLVR methods is statistically significant may be evaluated by a Wilcoxon signed test [107] to test the hypothesis that MSEBLV R−S −MSEBLV R−OP T follows a distribution whose median is zero. The p-value calculated based on the 50 realizations is 9.1×10−8 , which is very small. Thus this hypothesis can be safely rejected at any reasonable significance level. This indicates that the performance of BLVR-OPT is statistically different from that of BLVR-S. This finding is also indicated by the fact that the mean MSE of BLVR-S is smaller than that of BLVR-OPT in every category. This may be happening because of the fact that the BLVR-OPT nonlinear optimization problem rarely
58
Scatter Plot of Normalized Training MSE of Output 0.6
MSE of BLVR−S
0.55
0.5
0.45
0.4
0.35
0.3 0.3
0.35
0.4
0.45 0.5 MSE of BLVR−OPT
0.55
0.6
Figure 3.1: Scatter plot of testing MSE of output of BLVR-OPT against that of BLVR-S, normalized by the variance of measurement error in the output variable.
59
10
Value of Regression Parameter
Posterior Density of Parameter by Sampling−based BLVR True Parameter
5
0
−5
0
2
4 6 Index of Latent Variables
8
10
Figure 3.2: Illustration of uncertainty information provided by BLVR-S.
converges for high dimensional problems. This observation is under closer scrutiny in on-going work. As mentioned in the Section 3.1, BLVR-S can readily provide uncertainty information about the estimate. Figure 3.2 shows the region of highest posterior density for the regression coefficients of the 10 latent variables obtained by BLVR-S with uniform prior in one realization. The intensity in each bar is proportional to the marginal posterior density of every regression parameter.
60
Effect of Size of Training Data Set Effect of the number of training observations on the performance of BLVR-S has been evaluated by running further simulations. In this case, uniform, historical (generated in the same way as in section 3.3.1) and true prior are used for the modeling. They are denoted as BLVR(u), BLVR(h) and BLVR(t), and are compared with PCR, PLS and MLPCR. The SNR for input and output variables is still set to be 3. There are 50 realizations in each simulation, and 200 observations in the testing set in all the simulations. However, the number of training data varies in each simulation. Results of these simulations are shown in Table 3.3. The MSE shown in the table are normalized by the MSE of PCR in each simulation, so that it is easier to compare the relative performance of each method under different conditions. Based on these results, it can be seen that BLVR provides significant improvement over other methods when the number of training observations is small compared with the historical data set. However, when a large number of measurements are available, the benefits of BLVR over traditional methods are much less, and it may be best to use traditional techniques. These results seem reasonable because the advantages of BLVR compared with other latent variable methods are in the incorporation of prior information and the handling of measurement noise in input and output variables together. With a large number of measurements, the prior adds relatively little additional information to what is already in the measurements. Effect of SNR in Input and Output Variables As discussed in section 3.3, the SNR in input and output variables also affects the performance of BLVR. This effect is studied here in more detail with more simulations
61
62
Number of Training Data
PLS
MLPCR
BLVR(u)
BLVR(h)
BLVR(t)
200 100 50 25 10
1.0090 1.0884 1.2237 1.6583 1.0000
0.9561 0.9623 0.9572 0.9620 0.9912
0.9587 0.9663 0.9592 0.9427 N/Aa
0.8939 0.8014 0.6415 0.4835 0.1730
0.8671 0.7896 0.6413 0.5084 0.1777
Normalization Factor (MSE of PCR) 277.7491 306.5697 421.8590 618.3589 2467.5
Table 3.3: Mean testing MSE of Y for different numbers of training data over 50 realizations, MSE are normalized by testing MSE of PCR, SNR for X and Y are both 3. Because of the small number of observations, BLVR with uniform prior encountered numerical problems and could not provide results in some realizations. a
with different SNR. The SNR of input variables and output variable (SNRx and SNRy ) are each set to 4 different levels (1, 3, 9 and 27). Hence, overall, there are 16 simulations, with each simulation having 50 realizations. The number of training and testing observations is 200. Tables 3.4 and 3.5 show the testing MSE of output variables of those modeling methods in simulations with SNRx = 3 and SNRy = 3 respectively. As in section 3.3.1, the MSE is normalized by the MSE of PCR in each simulation. From Table 3.4, it can be observed that with decreasing SNRy , the relative performance of BLVR with Gaussian prior improves. This makes sense because the prior information is more helpful when the output is noisier and the relation between input and output is fuzzier. But the trend shown in Table 3.5 is counterintuitive, with increasing SNRx , the relative performance of BLVR with Gaussian prior also improves. This means that the usage of prior information can lead to more significant improvement when the input variables are less noisy. Thus, when the input is cleaner, there is more improvement in terms of prediction when BLVR is applied. When inputs are very noisy, as shown in Equation (3.9), the conditional mean of zi weighs much more on the prior mean and the conditional variance is very large. Hence, the estimate of zi under this circumstance is pretty rough and the resulting model quality is much worse than when inputs are less noisy. Meanwhile, according to Wentzell and Montoto [108], with the decreasing of SNRx , the performance of PCR and PLS will not deteriorate as fast as people expect. This can possibly explain why the relative performance of BLVR with Gaussian prior will decrease when SNRx decreases. Figure 3.3 shows the testing MSE of BLVR(t) in the 16 simulations. It confirms the two trends observed in Table 3.4 and 3.5. Also, it can be seen that simulations with the same ratio 63
SN Rx SN Ry
tend to have similar normalized
Normalized MSE of Ouput Variable
Sampling−based BLVR, True Prior
1
0.8
0.6
0.4
0.2 4 3
4 3
2
2
1 log3SNRy
1 0
0
log3SNRx
Figure 3.3: Comparison of mean squared prediction error of output variable for BLVRS with true prior under different conditions of SNR, normalized by the mean squared prediction error of output variable for PCR in each case.
MSE. To further confirm this observation, Figure 3.4 is plotted to show the normalized MSE against
SN Rx SN Ry
for the 16 simulations. Apparently, with larger
SN Rx SN Ry
value, BLVR
can obtain less relative prediction error. This comes as no surprise because it is just the combined effect of the two trends observed in Table 3.4 and 3.5. It tells us that BLVR with Gaussian prior can provide much better prediction results than traditional methods when the output is relatively much noisier than the input variables.
3.3.2
Case Study 2: Modeling Near-Infrared Data of Wheat
This case study focuses on calibration of a near-infrared (NIR) spectra data set of wheat, which was proposed by Kalivas [109] as a standard data set for chemometrics. 64
65
SNR of Y
PLS
MLPCR
BLVR(u)
BLVR(h)
BLVR(t)
27 9 3 1
0.9868 0.9894 1.0090 1.0677
0.9581 0.9567 0.9561 0.9522
0.9664 0.9602 0.9587 0.9584
0.9201 0.9181 0.8939 0.7798
0.9168 0.9037 0.8671 0.7335
Normalization Factor (MSE of PCR) 265.1168 257.8655 277.7491 323.3444
Table 3.4: Mean testing MSE of Y for different SNR of Y over 50 realizations, MSE are normalized by testing MSE of PCR; there are 200 observations in training set, SNR for X is 3.
66
SNR of X
PLS
MLPCR
BLVR(u)
BLVR(h)
BLVR(t)
27 9 3 1
1.1949 1.0678 1.0090 0.9466
0.9784 0.9633 0.9561 0.9234
0.9772 0.9615 0.9587 0.9428
0.6866 0.8147 0.8939 0.9400
0.5976 0.7573 0.8671 0.8804
Normalization Factor (MSE of PCR) 53.8212 120.1099 277.7491 594.6353
Table 3.5: Mean testing MSE of Y for different SNR for X over 50 realizations, MSE are normalized by testing MSE of PCR; there are 200 observations in training, SNR for Y is 3.
Sampling−based BLVR, True Prior
Mean Normalized MSE of Output Variable
1
0.9
0.8
0.7
0.6
0.5
Normalized MSE of Each Simulation Mean Normalized MSE of Simulations with the Same SNR /SNR x
0.4 −4
−3
−2
−1 0 log (SNR /SNR ) 3
x
1
y
2
3
y
Figure 3.4: Comparison of mean squared prediction error of output variable for BLVRRx S with true prior with different SN , normalized by the mean squared prediction error SN Ry of output variable for PCR in each case.
67
It contains NIR spectra data of 100 wheat samples. The diffusion reflectance of samples are measured from 1100 nm to 2500 nm in 2 nm intervals (Figure 3.5). According to Kalivas [109], the reflectance at every 5th wavelength is used as an input variable, samples are divided into one calibration (training) data set and two validation (testing) data sets. Not all samples are used for this study. There are 50 samples in the calibration set and 20 samples each in the validation sets, as partitioned by Kalivas. The NIR data are used to predict two output variables, moisture (Y1 ) and protein (Y2 ) content in the wheat samples. Measurements of moisture and protein were obtained by reference methods. PCR, PLS, MLPCR, BLVR-S are applied to develop the calibration models for moisture and protein. The loading vectors for BLVR come from PLS because it is widely used for this type of calibration problems and it provides relatively good prediction for both Y1 and Y2 , as shown in Tables 3.6 and 3.7. The mean and standard deviation of the input and output of the training data (Xtrn , Ytrn ) are calculated and the training data set is normalized by mean-centering then scaling to have unit variance. Those latent variable modeling methods are applied to the normalized data ′ ′ (Xtrn , Ytrn ) to estimate model parameters. As for the testing input data (Xtst ),
they are first subtracted by the mean of training input then divided by the standard deviation of the training input. In this way, both training and testing input data are processed following the same procedure. The preprocessed input variables of testing ′ data (Xtst ) are then combined with the estimated model parameters to provide the ′ estimate of preprocessed testing output (Ytst ). The estimate of original testing output ′ (Ytst ) is then calculated by reversing the processing procedure using the estimated Ytst
and the mean and standard deviation of Ytrn . The ten-fold cross validation approach 68
is used to determine the rank of those latent variable methods and the same modeling procedure described in this paragraph is used in cross validation. Every time, 90% of the data are used for training, and the remaining 10% is used as testing set. The rank that corresponds to the least mean squared prediction error is chosen as the optimal rank when calibrating the whole data set. Since BLVR requires the measurement noise for input and output variables, the magnitude of noise must be properly assessed. Unfortunately, the measurement noise is not reported in the paper by Kalivas, hence, more efforts are needed to get this information. In the book edited by Williams and Norris [110], a noise spectrum of a spectrophotometer is provided. It suggested that the standard deviation of noise is 1.62 × 10−4 and the noise magnitude in the two ends of the spectrum is higher than that in the middle. Based on this information, the variance of measurement noise in the wheat data set is assumed to be piecewise linear over the spectrum, as shown in Figure 3.6. The standard deviation of noise over the spectrum is equal to 1.62 × 10−4 . In the same book [110], the measurement error of protein content is also discussed. Based on the discussion, 0.06% is chosen to be the standard deviation of measurement error of protein in this data set. In another paper, Centner et. al. [111] point out that the measurement precision of moisture is usually better than 0.2%, which makes 0.07% a reasonable guess for the standard deviation of the measurement error. Setting prior distributions to be uniform (noninformative) is the easiest way to go. Gaussian empirical priors can be fitted based on the results of uniform prior case. These two types of priors do not require any additional information other than the data set itself and the improvement over traditional methods is very limited. A fair comparison is between BLVR and PLS since they use the same loadings in this 69
1.6
Diffusion Reflectance (log(1/R))
1.4
1.2
1
0.8
0.6
0.4
0.2
1200
1400
1600 1800 2000 Wavelength (nm)
2200
Figure 3.5: Wheat spectra data.
70
2400
−8
7
x 10
Variance of Measurement Noise
6
5
4
3
2
1
0
1200
1400
1600 1800 2000 Wavelength (nm)
2200
2400
Figure 3.6: Assumed variance of measurement noise in spectrum.
71
experiment. As shown in Tables 3.6 and 3.7, BLVR with uniform prior (denoted as BLVR(u)) performs slightly worse than PLS in predicting both moisture and protein. BLVR with Gaussian empirical prior (denoted as BLVR(e)) does better than PLS in predicting moisture but worse in case of protein. Hence we can not gain much by applying BLVR(u) and BLVR(e). A more informative prior can be obtained by using extra information about the spectrum. Because the data is normalized prior to modeling, the mean of the prior distribution of the input variables are set to be zero. The covariance matrix of the input variables minus the covariance matrix of the measurement noise can be used as the covariance matrix of the prior distribution of input variables. Since not all the wavelengths (input variables) are useful for the prediction of moisture and protein, regression coefficients for unrelated or less important wavelengths are close to zero. This insight has been used in coefficient shrinkage methods such as Ridge Regression [91]. Although the resulting estimates are biased, they also result in a smaller variance of the estimates, due to which the mean squared error of estimation can be reduced. In BLVR, coefficient shrinkage can be achieved by setting zero mean for prior distribution of the regression coefficients. For this case study, it is well known that the water (which contains the Hydroxyl OH group) has specific bands at 1940 nm and 1440 nm [112]. Hence, the variance of the prior of the regression coefficients of wavelengths close to 1940 and 1440 are set to be much larger than other wavelengths. By doing this, input variables corresponding to those bands are more likely to have regression coefficients further away from zero than others. In this experiment, the variance for the regression coefficients of wavelength 1420, 1430, 1440, 1450, 1460, 1920, 1930, 1940, 1950 and 1960 are set to be a while 72
others are set to be a−1 , where a > 1. As for protein, the C-H, O-H, and N-H group may all affect the spectrum, hence a good prior distribution based on expert knowledge for the regression coefficients is more difficult to obtain. Instead, a semiempirical approach is used for protein. The results of the uniform prior case give us some hints about the relevance of the wavelengths to the prediction of the protein concentration. The relevance of wavelengths is measured by the absolute value of the regression coefficients from the uniform prior case. All the wavelengths are sorted by the absolute value of their corresponding regression coefficients for protein from BLVR(u) in descending order. The top 10% wavelengths are considered relevant to the prediction of protein and they are set to have a Gaussian prior distribution with variance a. Other wavelengths are considered to be irrelevant and their corresponding regression coefficients have a Gaussian prior with variance a−1 . The mean of the prior distribution of all regression coefficients are still set to be zero. In this study, the parameter a is set to be two values, 100 or 10. Tables 3.6 and 3.7 show the MSE and optimal ranks of those methods for Y1 and Y2 . The MSE is calculated by comparing the estimated output variables with the measurements of output variables in the original data set. BLVR with uniform, empirical and the more informative prior are denoted as BLVR(u), BLVR(e) and BLVR(i). As for BLVR(i), there are two cases, when a = 100, it is denoted as BLVR(i)-1; when a = 10, it is denoted as BLVR(i)-2. For Y1 , BLVR(i) has a much better performance than others, and the two cases of BLVR(i) perform similarly. For Y2 , PLS has very good performance, BLVR(i)-1 still has a slight improvement over PLS but BLVR(i)-2 performs worse. Also the performance of BLVR(i) is more sensitive to the value of parameter a when predicting protein than when predicting 73
74
Rank MSE of Validation Set 1 MSE of Validation Set 2 MSE of All
PCR 23 0.0667 0.1322 0.0995
PLS 12 0.0689 0.1195 0.0942
MLPCR 16 0.0560 0.1106 0.0833
BLVR(u) 12 0.0729 0.1231 0.0980
BLVR(e) 10 0.0434 0.1003 0.0718
BLVR(i)-1 11 0.0371 0.0817 0.0594
Table 3.6: Validation MSE of Y1 (moisture) and corresponding ranks.
BLVR(i)-2 11 0.0370 0.0801 0.0586
75
Rank MSE of Validation Set 1 MSE of Validation Set 2 MSE of All
PCR 24 0.1850 0.1401 0.1625
PLS 15 0.1503 0.1350 0.1427
MLPCR 24 0.1947 0.1371 0.1659
BLVR(u) 15 0.1542 0.1418 0.1480
BLVR(e) 14 0.1728 0.1442 0.1585
BLVR(i)-1 16 0.1621 0.1177 0.1399
Table 3.7: Validation MSE of Y2 (protein) and corresponding ranks.
BLVR(i)-2 11 0.2071 0.1108 0.1589
moisture. The different performance of BLVR(i) for the prediction of Y1 and Y2 may arise from the difference in the quality of priors. Since the prior for the regression coefficients of Y1 are from expert knowledge, it is much more reliable than the semiempirical prior for the regression coefficients of Y2 .
3.4
Conclusions and Future Work
This chapter shows that the proposed BLVR-S method can combine prior information and data likelihood to provide better quality point and uncertainty estimates with an affordable computation load, even for high-dimensional problems. This approach improves upon BLVR-OPT [85] due to its computational efficiency and ease of obtaining uncertainty information. The sampling based approach is also expected to be readily generalizable to non-Gaussian situations. Case studies in this chapter show that with good prior information, BLVR can compensate for the lack of data. In fact, if the amount of training data is large, the benefit of using BLVR diminishes, possibly due to the relatively small additional information contained in the prior. Empirical studies also indicate that BLVR performs much better than traditional methods when the output is much more noisy than input variables. Some of the challenges facing the use of BLVR include the need for prior and likelihood information. The second case study demonstrates that with a little effort, obtaining prior knowledge is certainly possible from existing sources. Methods for eliciting such information from domain experts are also available [113, 114]. Analysis of the sensitivity of the model to available or elicited information via robust Bayes techniques may also be done.
76
Challenges of using MCMC include the difficulty in detecting the convergence of the Markov Chain to the targeted distribution, especially for high dimensional data. The mean of the samples is easy to calculate, however, obtaining the posterior mode can be a computationally challenging task. These challenges may be tackled with more theoretical progress and the development of practical methods. The approach in this chapter only considers Gaussian or uniform priors due to their computational convenience. Even though Gibbs sampling can be used with other types of prior distributions, they may lead to conditional distributions that are not easy to sample from. This problem is handled by a more sophisticated method described in Chapter 6. With additional research, it is expected that many of these challenges may be overcome to result in methods that continue to bring the benefits of Bayesian methods for solving chemometric and engineering problems.
77
CHAPTER 4
MODELING HYBRID DATA SETS WITH BLVR-S
4.1
Introduction
Hybrid data sets, containing both continuous and discrete variables are more and more frequently seen in industry and laboratory. In those data sets, discrete variables usually represent qualitative characteristics of observations while continuous variables often represent measurements of physical or chemical properties. For example, in the study of Quantitative Structure Activity Relationship (QSAR) [115], some of the discrete variables are coded as 1 or 0 to represent the presence or absence of molecular fingerprints, some of the continuous variables represent infrared spectrum properties of molecules [116]. In industrial high throughput screening data sets, lots of discrete variables come from design of experiment and many continuous variables come from experimental measurements. Modeling these data sets is not an easy task due to factors such as high dimensionality, collinearity and noise in measurements. Figure 4.1 shows an illustration of a typical hybrid data set. A review [117] shows that a variety of approaches have been applied to analyze high throughput screening data sets. Many of these approaches are kernel-based
78
-0.4325 -1.6656 0.1253 0.2876 -1.1465 1.1909 1.1892 -0.0376 0.3272 0.1746 -0.1867 0.7257 -0.5883 2.1832 -0.1364 1.0668 ... ...
0.2944 -1.336 0.7143 1.6236 -0.6917 0.8580 1.2540 -1.5937 -1.4410 0.5711 -0.3998 0.6900 0.8156 0.7119 1.2902 1.1908 ... ...
C
-1.6041 0.2573 -1.0565 1.4151 -0.8050 0.5287 0.2193 -0.9219 -2.1707 -0.0591 -1.0106 0.6144 0.5077 1.6924 -0.6436 0.3803 ... ...
......
......
......
1 2 0 1 3 0 1 4 2 0 1 2 1 0 1 2 ... ...
0 1 0 0 1 0 0 1 1 0 1 0 0 0 1 1 ... ...
a a b b ...... a b b a a ...... a a b b b ...... a b ... ...
D
Figure 4.1: Illustration of a hybrid data set.
79
methods, for example, Support Vector Machine (SVM) [118] is used for classification of active and inactive chemical compounds in the drug discovery process [119]; Support Vector Regression (SVR) [118] is used for the prediction of protein retention times in anion-exchange chromatography systems [120]. However, good performance of kernel methods relies on fine tuning [117] and models developed by these methods are not easy to explain due to their nonlinear nature. Multivariate regression is also a popular approach for modeling QSAR data sets [121, 122, 123]. Since there are lots of predictors in the data sets, variable selection procedures have to be used along with multivariate regression. In some other methods [124, 125, 126], PCA or Singular Value Decomposition (SVD) is applied to reduce the dimensionality of the predictor space instead of doing variable selection. Unfortunately, among so many methods, few of them are developed in a probability framework. One of the exceptions is the so-called Binary QSAR [124, 126] approach, which uses Bayes rule to classify binary response but it is not based on regression and it can not be adapted for the prediction of continuous response. Even though high throughput screening data are inherently hybrid, virtually none of the methods used for analyzing such data have addressed this issue. Most methods do not distinguish between continuous and discrete variables in modeling, but in the broader statistics community some methods have been developed which treat hybrid data with care. Racine and Li [127] used different kernels for continuous and discrete variables and combined them in the nonparametric regression models. Kuha [128] developed a hierarchical model to incorporate noisy continuous and discrete variables in a regression model and solved the problem by data augmentation [129]. However, none of these methods is latent variable method, which is very commonly used in 80
modeling high dimensional data set. Also none of them is Bayesian and no prior information can be incorporated in modeling. The BLVR-S approach described in Chapter 3 can handle collinearity of variables, deal with measurement noise in both input and output variables, and incorporate prior knowledge by applying Bayes rule. It is also efficient for modeling high dimensional data sets and can easily provide uncertainty information about model parameters. These nice features of BLVR-S make it a promising approach for modeling high throughput screening data. However, in BLVR-S, all the variables are assumed to be continuous and noisy, which need not hold for many high throughput screening data sets. These data sets usually contain both continuous and discrete variables. Since discrete variables are often from designed experiments, they are likely to be noise free and deterministic. Because of the existence of discrete variables, BLVR-S should not be directly applied to model those high throughput screening data sets. To make BLVR-S applicable for modeling high dimensional hybrid data sets, a modeling procedure described in this chapter is developed. By applying this procedure, continuous and discrete variables are separately modeled and the two sub models are combined to predict the output variable. Continuous variables can be modeled by BLVR-S to incorporate prior information; discrete variables can be modeled by other appropriate methods. The correlation between the continuous part and discrete part is handled by the iteration of the two modeling steps. Currently, this method can only be used for prediction of continuous output variable, adapting BLVR-S for modeling hybrid data for the classification purpose is a subject of future work.
81
The rest of this chapter is organized as follows. In the next section, the iterative modeling procedure for hybrid data is described. It is followed by a variety of case studies, results of those cases studies are discussed. Conclusions are made in the last section.
4.2
Method
As mentioned in Section 4.1, BLVR-S assumes that noisy input variables are continuous, which makes it unsuitable for modeling hybrid data sets. The ideal solution is to make appropriate assumptions for all variables and get the estimates based on the posterior distribution of both continuous and discrete variables. Using C (n × mC ) to denote the continuous input variables and D (n × mD ) to denote the discrete input variables, i.e., X = [C D], then, this problem can be formulated as: ˆ β} ˆ BLV R = arg max P (C, D | α, Z)P (y | Z, β)P (α, Z, β) ˆ Z, {α, α,Z,β
(4.1)
s.t. αT α = I. The different assumptions made for continuous and discrete variables can make the problem much more complicated and difficult to solve. Alternatively, assume that continuous and discrete variables have separate additive effects on the output, and the continuous variables can be modeled by BLVR-S, the discrete variables can be modeled by other methods with more appropriate assumptions, then the two parts can be combined to get the full model. The output variable can be expressed as: y = y˜C + y˜D + ǫ,
(4.2)
where y˜C (n×1) and y˜D (n×1) denote the effects of continuous and discrete variables on the output respectively, ǫ (n × 1) is the measurement noise vector. Moreover, 82
assume that: ǫ = ǫyC + ǫyD ,
(4.3)
where ǫyC (n × 1) and ǫyD (n × 1) are two independent measurement noise vectors for the continuous part and discrete part models: ǫyC ∼ Normal(0, RyC ⊗ I),
ǫyD ∼ Normal(0, RyD ⊗ I),
(4.4)
Given these assumptions, the detailed modeling procedure is described in the following subsections.
4.2.1
Modeling the Continuous Part Data
If one could observe the noisy output of the continuous part yC = y˜C + ǫyC , with: yC ∼ Normal(y˜C , RyC ⊗ I),
(4.5)
and y˜C is modeled by the latent variables ZC (n × pC ) and regression parameters vector βC (pC × 1) of the continuous part as: y˜C = ZC βC .
(4.6)
Meanwhile, assume that the continuous input variables are contaminated by Gaussian measurement noise with the covariance matrix RC , then the i-th observation of continuous inputs ci = C(i, :)T is also normally distributed: ci ∼ Normal(˜ ci , RC ),
(4.7)
˜ :)T and it is modeled by ZC and the orthogonal loading vectors αC where c˜i = C(i, (mC × pC ) of the continuous part as ˜ = ZC αT . C C 83
(4.8)
Apparently, these assumptions for the continuous part data are the same as those made in BLVR-S, therefore, αC , βC and ZC can be estimated by BLVR-S. The priors of βC and ZC can be either Gaussian or uniform.
4.2.2
Modeling the Discrete Part Data
As for the discrete part model, again if one could observe the noisy output attributes to the discrete input variables, yD = y˜D + ǫyD , since the discrete variables are assumed to be noise-free and deterministic, when there is no need for dimension reduction, the noise free output of the discrete part can simply be modeled as: y˜D = DbD ,
(4.9)
yD ∼ Normal(y˜D , RyD ⊗ I).
(4.10)
and yD is normally distributed:
The regression parameters vector bD (mD × 1) can easily be estimated by OLS. As discussed in Section 2.2.1, from a Bayesian perspective, the estimate from OLS is the Bayesian estimate with a uniform prior distribution for bD . However, when there is prior information about bD available, more informative priors should be used instead of the uniform prior. In that case, usually a Gaussian prior can be assumed for bD , Information contained in this prior distribution should be incorporated through Bayesian regression for the discrete part. Assume that: bD ∼ Normal(µbD , QbD ),
(4.11)
the Gaussian likelihood function and the Gaussian prior distribution of bD results in a Gaussian posterior distribution, with a 0-1 loss function, the Bayesian estimate of 84
bD is its posterior mode: ˆD = µb + Qb DT (DQb DT + Ry ⊗ I)−1 (yD − Dµb ). b D D D D D
(4.12)
When D is singular or the number of observations n is not large enough comparing with the number of discrete variables mD , dimension reduction is necessary to avoid overfitting. PCA can first be applied to D to find the score vectors ZD (n × pD ) and loading vectors αD (mD × pD ). Similar to the Gaussian mean in Equation (3.12), for a 0-1 loss function, the Bayesian estimate of the regression parameters vector βD of ZD is: T T βˆD = αTD µbD +αTD QbD αD ZD (ZD αTD QbD αD ZD +RyD ⊗I)−1 (yD −ZD αTD µbD ). (4.13)
Since ZD is calculated as ZD = Dα,
(4.14)
therefore, the Bayesian estimate of bD is: ˆD = α αT µb + αT Qb αD Z T (ZD αT Qb αD Z T + Ry ⊗ I)−1 (yD − ZD αT µb ) . b D D D D D D D D D D D (4.15)
4.2.3
The Iteration Procedure
By separately model the continuous and discrete variables as described in Section 4.2.1 and 4.2.2, the overall estimate of the noise free output variable can be obtained: yˆ = yˆC + yˆD ˆD . ˆ βˆ + Db = Z
(4.16)
The assumption that yC and yD can be observed is made for getting the two sub models, since with this assumption, fitting the two models for continuous and 85
discrete variables is quite easy. But in practice, they can only be estimated after the models have been fitted. It is tempting to use y as yC and yD for building the two sub models. This strategy works when yC and yD are in two spaces that are orthogonal to each other. Unfortunately, since C and D are usually non-orthogonal, this simplification will lead to two over-lapping sub models and overall a redundant model for estimating the output variable. Hence, separately model the two parts need to be done carefully. To solve this problem, an iterative modeling procedure is developed. In this procedure, the original output variable y is used as the initial guess for yD , by applying the modeling approaches described in Section 4.2.2, yˆD can be estimated, then y − yˆD can be considered as the initial guess for yC , by applying BLVR-S, yˆC can be estimated and the new yD is calculated as y − yˆC . By repeating these steps until convergence, the final estimates of yC , yD and model parameters can be found. In this iteration procedure, the residuals of output is used as new the new response in the next modeling step. Similar iteration approaches are widely applied in many other existing algorithms, such as Nonlinear Iterative Partial Least Squares (NIPALS) [130] and Nonlinear Continuum Regression (NLCR) [131]. This iterative procedure ensures that in the final model, the contributions from the two sub models are complementary to each other and non-overlapping. Hence the two sub models are additive and the output variable can be estimated based on them. The detailed algorithm is shown in Table 4.1. One problem with this procedure is that usually RyC and RyD are unknown in practice. A simplification can be made by using the measurement noise variance of the output variable as the approximation of RyC and RyD . The uncertainty of
86
ˆC βˆC + Db ˆD Model: yˆ = Z ˆD . (yˆD = Db ˆD ) 1. Fit y on D via a Bayesian regression to get b ˆC . (yˆC = Z ˆC βˆC ) ˆ C , βˆC and Z 2. Fit y − yˆD on C by BLVR-S to get α 3. Until Convergence, ˆD and yˆD . (a) Fit y − yˆC on D by Bayesian regression to get b ˆC and yˆC . ˆ C , βˆC ,Z (b) Fit y − yˆD on C by BLVR-S to get α End While ˆD , α ˆC are estimated model parameters. ˆ C , βˆC and Z b
Table 4.1: Procedure of Modeling Hybrid Data
RyC can be handled by the extended BLVR-S approach described in Chapter 5, see Section 5.3.2 for details. This procedure can effectively fit the two sub models. The estimated model parameters can be directly used for predicting the output of new observations. The BLVR-S in this procedure can also be replaced by other modeling methods.
4.3 4.3.1
Experiments Simulated Hybrid Data
In this example, sampling-based BLVR is applied to model a simulated high dimensional hybrid data set. It contains 10 continuous input variables, 5 discrete input variables, 1 output variable and 100 observations in this data set. The first 5 continuous variables (C(:, 1 : 5)) are generated from standard normal distribution, and the next 5 continuous variables (C(:, 6 : 10)) are generated such that
87
C(:, 6 : 10) = C(:, 1 : 5)L, where L is a random 5 × 5 matrix. Thus, the next 5 continuous variables are linear combinations of the first 5 continuous variables. Hence, all of them are Gaussian and the true rank of C is 5. Discrete input variables (D) are from Bernoulli distributions. Two cases are considered for the parameter p of the Bernoulli distributions. p is set to be either 0.9 or 0.5. When p = 0.9, the distributions of discrete variables are highly skewed and the Gaussian assumption for the prior is very inappropriate. Noise free output variable y is generated as y = [C D]b, where b is a random 15 × 1 vector. Continuous variables and the output variable are contaminated by white noise while discrete variables are noise free. The signal to noise ratio (SNR) is set to be 16 or 9 for every case of p. Thus overall there are 4 cases in the simulation setups. BLVR-S with uniform prior and with true Gaussian prior (the true prior is used only for the continuous variables, and a Gaussian prior for the discrete variables is fitted based on the true Bernoulli distribution and with a relatively large variance) is applied to model the continuous variables. It is also compared with PCR and MLPCR. To illustrate how the modeling procedure described in Section 4.2.3 affects the results, each method has two variations, without applying this procedure (denoted as PCR, MLPCR and BLVR) and with this procedure (denoted as PCR2, MLPCR2 and BLVR2). For BLVR2 with Gaussian prior, the Bayesian regression is applied when modeling the discrete variables; for other methods, OLS is applied. The true rank is used for all methods. Table 4.2 shows the average of normalized MSE over 100 realizations for each case, the MSE of different methods are normalized by the MSE of PCR in every realization.
88
PCR
PCR2
MLPCR
Y (training) Y (testing)
1 1
1.0423 1.0644
0.9428 0.9653
Y (training) Y (testing)
1 1
1.0053 1.0152
0.9069 0.9319
Y (training) Y (testing)
1 1
1.0958 1.1456
0.9992 1.0320
Y (training) Y (testing)
1 1
1.0224 1.0588
0.9410 0.9558
MLPCR2
BLVR BLVR2 (uniform) p = 0.9, SNR = 16 0.9430 0.8988 0.8468 0.9662 1.0365 0.9674 p = 0.5, SNR = 16 0.9067 0.8116 0.8114 0.9318 0.9325 0.9327 p = 0.9, SNR = 9 0.9988 1.0054 0.9042 1.0322 1.2016 1.0367 p = 0.5, SNR = 9 0.9407 0.8706 0.8525 0.9562 0.9859 0.9561
BLVR BLVR2 (Gaussian) 1.6122 2.5573
0.5678 0.5739
0.6438 0.6058
0.5622 0.5755
3.0243 5.1045
0.6135 0.6151
0.9598 1.3225
0.5760 0.5842
Table 4.2: Average of MSE normalized by the MSE of PCR in each realization over 100 realizations for the simulated high dimensional hybrid data set.
Based on these results, the two variations of each method, except BLVR with true Gaussian prior, perform similarly in each case. When p = 0.9, with Gaussian prior, BLVR performs much worse than BLVR2, this is because the Gaussian prior assumption for the discrete variables are far from truth. When p = 0.5, BLVR and BLVR2 with true Gaussian prior performs similarly, but with a smaller signal to noise ratio (larger measurement noise), BLVR2 again outperforms BLVR significantly. In general, BLVR2 has a more stable performance than BLVR, and with better quality of prior for the continuous variables, the prediction error decreases.
4.3.2
Simulated High Throughput Screening Data
Another high dimensional hybrid data set is simulated to emulate real industrial high throughput screening data set. It contains 15 continuous input variables, 15 discrete input variables, 1 output variable and 100 observations each in the training 89
and testing set. The first five discrete variables are similar to the design of experiment variables. Values of these variables are either equal to 0 or 1. For each observation, only one of the 5 variables is equal to 1. The next five discrete variables are generated in the same way. Other discrete variables are from independent binomial distributions with different parameters. Five of the continuous variables are from normal distributions, two of them are from different uniform distributions, others are linear combinations of these variables. The true regression parameters for generating the noise free output are designed to ensure that some of the arbitrarily chosen input variables have large effect on the output, while some have no effect. Continuous variables and the output variable are contaminated by white noise with different magnitude while discrete variables are noise free. Methods used in Section 4.3.1 are all applied to model this simulated high throughput screening data set. True rank is used for all methods. A historical data set with 300 observations is used to generate Gaussian priors for BLVR and BLVR2. Table 4.3 shows the average of normalized MSE over 100 realizations. The trend in results is much more complicated than in Section 4.3.1. PCR, MLPCR and BLVR with uniform prior performs better than PCR2, MLPCR2 and BLVR2 with uniform prior respectively. This implies applying the iterative modeling procedure will not benefit PCR, MLPCR and BLVR with uniform prior in this case. But BLVR2 with Gaussian historical prior outperforms BLVR with Gaussian historical prior. This implies that assumptions made in BLVR hurt the quality of modeling with the existence of discrete variables in this data set. In fact, the performance of BLVR with historical prior is the worst while BLVR2 with historical prior is the best among all the methods.
90
Y (training) Y (testing)
PCR
PCR2
MLPCR
MLPCR2
1 1
1.5681 1.7159
0.9350 0.9487
1.1321 1.2032
BLVR BLVR2 BLVR BLVR2 (uniform) (Gaussian) 0.9691 1.1139 3.3737 0.4859 1.0227 1.2086 5.0050 0.4054
Table 4.3: Average of MSE normalized by the MSE of PCR in each realization over 100 realizations for the simulated high throughput screening data set.
4.4
Conclusions and Future Work
From the above case studies, it can be concluded that by applying the iterative modeling procedure, BLVR-S can be applied to model high dimensional hybrid data set. With the combination of high quality prior information and likelihood of data in modeling, BLVR-S outperforms other traditional latent variable modeling methods. The iterative modeling procedure for hybrid data successfully avoids the problem of making inappropriate assumptions for discrete variables while we can still use BLVR for continuous variables. When the Gaussian prior is used for BLVR, if the true distribution of discrete variables is highly skewed, using this procedure can enhance the result significantly. Sampling-based BLVR can be easily replaced by other modeling methods in this procedure. Since most latent variable modeling methods are developed for continuous variables, they are not suitable for discrete variables. However, by applying this iterative procedure, those methods can still be used without major modification. In the experiments described in this chapter, PCA is applied for dimension reduction of the discrete variables. As is well known, the principal components are obtained based on the eigen analysis of the covariance matrix, which is apparently not necessarily the optimal solution for discrete variables. Novel covariance matrices
91
can be defined for discrete variables and latent variables can be obtained based on the eigen analysis of those covariance matrices, this is how Multiple Correspondence Analysis (MCA) [132] works. In this modeling procedure, the discrete variables are treated as noise free, if the discrete variables also contain measurement noise, such as the mislabeling of experimental conditions and miscounting of molecular fingerprints, then this approach is not applicable and a more sophisticated method is needed. In this chapter, the Gaussian or uniform prior assumptions are still made for BLVR-S, this can be released with the method described in Chapter 6.
92
CHAPTER 5
BLVR-S WITH NONINFORMATIVE PRIORS FOR PARAMETERS IN LIKELIHOOD FUNCTIONS
5.1
Introduction
As discussed in Chapter 1, one hurdle that discourages people from applying Bayesian methods is the requirement of knowing information about likelihood functions and prior distributions. Although this information can be elicited and quantified through various approaches [113, 114], they are still often unavailable to the modelers or are in very vague forms. As to BLVR-S, whether uniform or Gaussian prior is assumed, the parameters of the likelihood functions have to be known prior to modeling. Those likelihood functions are determined by the distributions of measurement noise. For simplicity, Gaussian assumptions are made for the measurement noise in BLVR-S, the means are assumed to be zero and the covariance matrix has to be known prior to modeling. The noise variances can be elicited from many sources, such as the instrument specifications, historical data, literature, etc. But it is not uncommon that modelers have difficulty of getting access to those information. In that case, it may be unwise to assign a guessed number to be the noise variance of a variable because of the high uncertainty associated with the information. Improper 93
likelihood functions could impair instead of improve the quality of modeling. This is not only a problem for applying BLVR-S, in fact, this is a challenge for many modeling methods which require information about the variances of measurement noise, such as MLPCA. Leger et al. [133] proposed to estimate the error covariance matrices from repeated measurements and use the estimated covariance matrices in MLPCA. This method is applicable only when enough replicated measurements are available, which is often not the case in practice, especially for dynamic systems. Also duplicated measurements require much larger experimental costs, therefore this approach could be economically infeasible. Getting the error covariance matrices is also a problem in the data reconciliation field [134]. The error covariance matrices are usually directly estimated based on the measurements similar to the method used by Leger et al. [133] or indirectly estimated by incorporating additional process information. Almasy et al. and Mah [135] estimated the error covariance matrices from the constraint residuals of process data. Based on their work, Darouach et al. [136] used a maximum likelihood estimator to estimate the error variances by solving a nonlinear optimization problem that couples the estimation problem with the data reconciliation problem; Keller et al. [137] further extended this approach to estimate both the variances and covariances of measurement error. Since those methods are sensitive to outliers, Chen et al. [138] developed an M-estimator to estimate the error covariance matrices. In this approach, the observations are weighted based on their Mahalanobis distances, hence the M-estimator is more robust. Morad et al. [139] developed another robust M-estimator which directly estimates the covariance matrices from the measurements. Maquin et al. [140] applied a direct method to simultaneously estimate the variances of measurement errors and reconcile the data 94
with respect to balance equations. Mirabedini and Hodouin [141] used a state-space model to estimate the variance and covariance of sampling errors in complex mineral systems. These methods often rely on known process constraints, such as mass balance, however constraints may not exist for a given process data set. Also those methods can be computationally intensive. Those hassles limit the applications of those estimation methods. Most of the previous work focuses on the estimation of the parameters in the likelihood functions. However, getting the estimates of those parameters is not the ultimate goal. Eventually, the estimates are used in the modeling methods and the model quality is what matters. Relying on a single estimate of the error covariance matrix in modeling is inherently vulnerable to the quality of estimation and not robust. Hence, this two-step strategy is not ideal for solving this problem. Yet from a Bayesian perspective, this challenge can be tackled in a rigorous way without requiring extra measurements, process constraints or heavy a computation load. Bayesian statistics provides a rigorous way to combine information from different sources and with uncertainties. Considering a noise covariance matrix as a set of model parameters to be estimated, they are inherently stochastic and the uncertainties of those parameters before modeling can be captured by their prior distributions. Many of the existing modeling methods can be reformulated as Bayesian methods with prior distributions for the parameters in the likelihood functions. When little is known about the prior distribution of the noise variance of a variable, a noninformative prior [142, 143] can be assumed, this is a popular way in Bayesian modeling to deal with situations with little prior information about a parameter. This layer of prior for the noise variance avoids the problem of having to assign a fixed number to the parameter. The 95
prior distributions of model parameters are combined with data measurements by Bayes rule, resulting in a posterior distribution. This posterior distribution contains all the available information about the model parameters. The noise variances and other model parameters can be estimated altogether from the posterior distribution. The uncertainty information of the estimates can also be obtained. BLVR-S implements a Gibbs sampling algorithm to draw samples. The sequential sampling strategy in Gibbs sampling makes it easy to draw samples from the high dimensional posterior distribution. In this chapter, a novel Bayesian modeling method based on BLVR-S is proposed. It adds noninformative priors for the noise variances in BLVR-S. An extra step is added in the Gibbs sampler of BLVR-S to draw samples of the noise variances of input and output variables. The introduction of the noninformative priors for the noise variances makes this extended BLVR-S more immune from inaccurate information about likelihood functions. The rest of this chapter is organized as follows. Section 5.2 describes the details about the algorithm of BLVR-S with noninformative prior for noise variance, followed by Section 5.3, which presents several case studies, performance of different modeling methods are compared and the advantages of the proposed method in this chapter are illustrated. Finally, Section 5.4 gives a summary of this work and discusses possible future directions.
5.2 5.2.1
Method Setup of Noninformative Priors
For simplicity, the measurement noise for each variable are assumed to be independent in this work, hence, the covariances matrices are diagonal and there are only
96
m + 1 parameters to be estimated in Rx and Ry , i.e., the diagonal elements of Rx and Ry . An advantage of using noninformative priors for those parameters is that they are relatively more objective, therefore they are less likely to be biased. Also noninformative priors are usually easy to setup because there are usually no or very few parameters to be determined in them. Noninformative priors are often in the formats of uniform distributions, however, they need not always be uniform. Even if a prior distribution is uniform under one metric, it could be another type of distribution under another metric. This is illustrated in the following example. Illustrative Example 5.1. For a Continuous Stirred Flow Reactor (CSTR) of volume (V ) 1 m3 , assume the inlet flow rate (F ) is stochastic and uniformly distributed between 1 to 1.1 m3 /hr, the probability density function (PDF) of F is: P (F ) = 10,
1 ≤ F ≤ 1.1.
(5.1)
Hence the mean of the residence time τ is also stochastic. However, since τ =
V F
,τ
is not a linear function of F , and it is not uniformly distributed. In fact, the PDF of τ can be obtained by applying the following theorem: Theorem 5.1. [89] For a continuous random variable x with a PDF Px (x), x ∈ X , another random variable y = g(x) (g(·) is a monotone function), and g −1 (y) has a continuous derivative on Y, then y has a PDF of: −1 d g (y) −1 , Py (y) = Px (g (y)) dy
y ∈ Y.
(5.2)
Thus, the PDF of τ is:
1 d P (τ ) = 10 × τ dτ 10 1 = 2, ≤ τ ≤ 1. τ 1.1 97
(5.3)
Distribution of Flow Rate Probability Density
11 10.5 10 9.5 9
1
1.02
1.04
1.06
1.08
1.1
F (m3/hr) Distribution of Mean of Residence Time Probability Density
10 9.5 9 8.5 8 0.91
0.92
0.93
0.94
0.95 0.96 τ (hr)
0.97
0.98
0.99
1
Figure 5.1: Probability distributions of flow rate and mean of residence time in the CSTR example.
Figure 5.1 shows the distributions of F and τ , apparently the distribution of τ is far from uniform. If measurements of τ show that actually it is uniformly distributed, then the assumption of F being uniform is inappropriate. Example 5.1 illustrates the importance of choosing an appropriate metric for assigning the uniform prior. Once this metric is determined, by applying Theorem 5.1, the prior distribution under another metric can be derived. A natural way to come up with the noninformative prior of the noise variances in BLVR is to assume no information about the standard deviations of the noise is available. Let σj denote the standard deviation of noise of the j-th variable of the data set, where the first m variables are inputs and the m+ 1-th variable is the ouput,
98
it can be considered as uniformly distributed, i.e., P (σj ) ∝ 1,
σj > 0.
(5.4)
By applying Theorem 5.1, the prior distribution for σj2 can be obtained: P (σj2 )
d σj ∝ d σj2 q d σ2 j ∝ 2 d σ j 1 ∝ q . σj2
(5.5)
However, because a different metric can be chosen for the uniform distribution, the above noninformative prior is not the only one acceptable. Another popular choice is to assume log σj is uniformly distributed, i.e., P (log σj ) ∝ 1,
σj > 0.
(5.6)
By applying the Theorem 5.1, the prior distribution for σj2 is: P (σj2 )
d log σj ∝ d σj2 log σj2 d 2 ∝ 2 d σj ∝
1 . σj2
(5.7)
In fact, this noninformative prior satisfies Jeffreys’ rule and it is called Jeffreys’ prior [144]. See details in Appendix B. In addition to the above two noninformative priors, the prior distributions of the variances of Gaussian noise are often assumed to be Inverse-Gamma, which has the 99
PDF of the form: k
P (σj2
θj j | kj , θj ) = (σ 2 )−kj −1 exp Γ(kj ) j
−θj σj2
,
(5.8)
where Γ(·) is a Gamma function. Given a Gaussian likelihood function, the InverseGamma prior of σj2 results in an Inverse-Gamma posterior distribution for σj2 . The Inverse-Gamma prior of σj2 is said to be conjugate to Gaussian likelihood functions. Conjugate priors bring convenience to Bayesian computation, hence the InverseGamma prior is a good candidate for the noninformative prior of σj2 . To make an Inverse-Gamma distribution resemble a uniform prior, a common choice is to set its parameters kj and θj close to zero. In the popular Bayesian modeling software BUGS [101], they are both set to be equal to 0.001. Figure 5.2 shows the PDF for several Inverse-Gamma distributions with different sets of parameters. The trend is, that is with the decreasing of kj and θj , the Inverse-Gamma distribution tends to be more flat. In fact, the previously discussed two types of noninformative priors can be regarded as two limiting Inverse-Gamma distributions with different parameters. For the noninformative prior in Equation (5.5), kj → − 21 , θj → 0; for the noninformative prior in Equation (5.7), kj → 0, θj → 0. The three noninformative priors for σj2 are summarized in Table 5.1, and are adopted in the work described in this chapter. They are also the types of priors considered in the work by Gelman [143]. Since they can be unified under the Inverse-Gamma distribution family, only one extended BLVR-S approach with Inverse-Gamma prior distributions for the noise variances is needed, this method is described in Section 5.2.2.
100
0.7 ki=0.001, θi=0.001 ki=1, θi=1
0.6
ki=2, θi=2
Probability Density
0.5
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
2
2.5
3
σ2 i
Figure 5.2: Probaility density functions of Inverse-Gamma distributions with different parameters.
Type 1 2 3
Parameters of the Inverse-Gamma Distribution kj θj 0.001 0.001 0 0 1 −2 0
Uniform (or Approximately Uniform) in the Space of σj2 log σj σj
Table 5.1: Summary of different types of noninformative priors for σj2
101
5.2.2
BLVR-S with Noninformative Priors for Noise Variances
In the BLVR-S approach described in Chapter 3, the noise variances for input and output variables are considered to be deterministic and they are fixed throughout the Gibbs sampling procedure. By setting noninformative priors for them, they are treated as stochastic and they should be estimated along with the score vectors and regression parameters. Hence, an additional step is needed in the BLVR-S approach to draw samples of those noise variances. In this step, samples of noise variances are drawn from their full conditional distributions. These distributions of Z and β remain the same as those described in Section 3.2, except that the noise variance parameters in those distributions should be updated in every time step. To sample the noise variances by Gibbs sampling, their full conditional distributions have to be derived. As discussed in Section 5.2.1, the three types of noninformative priors of noise variances considered in this work can be unified under the Inverse-Gamma family. Since Inverse-Gamma is a conjugate prior for Gaussian likelihood functions, the full conditional distribution of σj2 is also an Inverse-Gamma distribution. Without loss of generality, assume j < m + 1, i.e., the j-th variable is an input variable. Because the measurement noise are independent, the conditional posterior distribution of σj2 only depends on the likelihood of the j-th input variable, the conditional distribution of σj2 is: P (σj2 | X(:, j), Z, α) ∝ P (X(:, i) | Z, α, σj2)P (σj2 | kj , θj ) Pn kj θj − i=1 (xij − x˜ij )2 −θj 1 2 −kj −1 ∝ exp (σ ) exp (2π)n/2 σjn 2σj2 Γ(kj ) j σj2 Pn ! xij )2 i=1 (xij −˜ + θ n j 2 ∝ (σj2 )−( 2 +kj )−1 exp − , (5.9) σj2 102
where X(:, j) is the j-th column of X, xij is the measurement of the j-th input variable in the i-th observation, x˜ij is its noise free value, which depends on the loading and score vectors, Z and α respectively. Comparing Equation (5.9) with Equation (5.8), it is obvious that the conditional distribution of σj2 indeed is another P Inverse-Gamma distribution. Denoting ni=1 (xij − x˜ij )2 as SSEj , the conditional
distribution of σj2 is:
σj2 | X(:, j), Z, α ∼ InverseGamma(kj′ =
n SSEj + kj , θj′ = + θj ), j = 1, 2, . . . , m. 2 2 (5.10)
Equation (5.10) provides some insights on the effects of the parameters in the InverseGamma prior. When there are a large number of observations (n is large), kj′ ≈
n 2
and kj has little effect on the posterior distribution of σj2 . When the magnitude of noise in the j-th variable is large (SSEj is large), θj′ ≈
SSEj 2
and θj has little effect
2 on the posterior distribution of σj2 . Similarly, the conditional distribution of σm+1 (or
Ry ) is: 2 ′ σm+1 | y, Z, β ∼ InverseGamma(km+1 =
where SSEm+1 =
Pn
i=1 (yi
n SSEm+1 ′ + km+1 , θm+1 = + θm+1 ), 2 2 (5.11)
− y˜i)2 .
Table 5.2 shows the detailed algorithm of the extended BLVR-S with the InverseGamma priors for the noise variances. In the Gibbs sampling steps, the SSEj depends on the most recent samples of Z and β. The SSEj at the s-th time step is denoted (s)
as SSEj
and is calculated with β (s) and Z (s) .
As described in Section 5.2.1, three different noninformative priors are considered in this work, the noise variance of each variable can choose one of them as its prior distribution. Hence, the parameter kj and θj in the priors can be different for each σj2 . 103
ˆ by applying PCA, MLPCA or PLS to (X, y) • get α • while not converge ˆ R ˆ ,β, ˆ x and R ˆ y by Gibbs sampling with K samples in the 1. get Z Markov Chain. – for s = 1 : K ˆ β (s−1) ) · draw Z (s) from P (Z | X, y, α, ˆ Z (s) ) · draw β (s) from P (β | X, y, α, (s)
· draw σj2 from InverseGamma( n2 + kj , 1, 2, . . . m + 1) (s)
(s)
(s)
(s)
(s)
(s)
SSEj 2
+ θj ), (j = (s)
2 2 · Rx = diag[σ12 , σ22 , . . . , σm ], Ry = σm+1 – end for – estimate Z, β, Rx and Ry based {(Z (1) , β (1) ), (Z (2) , β (2) )..., (Z (K) , β (K) )} ˆ and yˆ based on α, ˆ and βˆ ˆ Z 2. get X ˆ y) ˆ by applying PCA, MLPCA or PLS to (X, ˆ 3. get α
on
• end while
Table 5.2: The extended BLVR-S algorithm with noninformative priors for noise variances.
104
But for simplicity, the same type of noninformative prior is used for all the σj2 here. BLVR-S with the Type 1, Type 2 or Type 3 prior in Table 5.1 for noise variance is denoted as BLVR-v1, BLVR-v2 and BLVR-v3. In Section 5.3, the performance of those 3 approaches are compared with other methods in various case studies in Section 5.3.
5.3 5.3.1
Experiments Simulated High Dimensional Data
This example is adapted from the one in 3.3.1. There are 50 observations, 15 input variables and 1 output variable in the data set, the true rank of the input matrix is known to be 10. Both input and output variables are contaminated by Gaussian measurement noise, the variances of the noise are determined by the Signal to Noise Ration (SNR) of variables. The loading vectors for BLVR are obtained from PCR. The historical Gaussian priors for the noise free input variables and the regression parameters of BLVR, BLVR-v1, BLVR-v2 and BLVR-v3 are generated based on the results of applying PCR to a historical data set with 600 observations. The true rank 10 is used for all methods. The performance of PCR, MLPCR, BLVR, BLVR-v1, BLVR-v2, BLVR-v3 are compared. For BLVR, BLVR-v1, BLVR-v2 and BLVR-v3, with uniform or historical Gaussian priors for other model parameters are considered. With the uniform prior, they are denoted as BLVR(u), BLVR(u)-v1, BLVR(u)-v2 and BLVR(u)-v3. With the historical prior, they are denoted as BLVR(h), BLVR(h)-v1, BLVR(h)-v2 and BLVR(h)-v3. Results are based on 50 realizations.
105
To mimic our ignorance about the measurement error, the initial guess for σj2 is assumed to be: σj2
(0)
= σj2 exp(uj ),
(5.12)
where uj ∼ Unif orm(−l, l) (l ≤ 0). The variable l denotes inaccuracy of the initial guess for σj2 . The larger l is, the larger there is a chance to have a guess far from the truth. When l = 0, the guess is the same as the true noise variance, which means accurate information about the likelihood function is available. This initial guess of noise variance is used in MLPCR and is also used as the first sample in the Markov Chain for σj2 in BLVR, BLVR-v1, BLVR-v2 and BLVR-v3. Five different cases are studies in this example by varying the SNR and l. Parameter settings for the five cases are summarized in Table 5.3. In the first case, SNR for all the input and output variables are set to be 3 with l = 2. The second case, other conditions remain the same as in the first case, except that the SNR in inputs variables varies from 1 to 15, which means the magnitudes of measurement noise are quite different among the variables. The third case, is the same as the second case, except that l = 3, which means that there is larger error in the initial guess of the measurement noise variance. In the fourth case, the settings are the same as in case 1 except that l = 0, which means that the true noise variances are used as the initial guesses. Similarly, in the fifth case, all the settings are the same as in case 2, except that l = 0. Table 5.4∼5.8 show the average, over 50 realizations, of the normalized Mean Squared Error (MSE) of the input and output variables for different methods in each of the five cases. The MSE are normalized by the MSE of PCR in each realization.
106
Case 1 2 3 4 5
l 2 2 3 0 0
SNR of the Inputs 3 1∼15 1∼15 3 1∼15
SNR of the Output 3 3 3 3 3
Table 5.3: Summary of the different settings in each case of the simulated high dimensional data example in Section 5.3.1.
Table 5.4 shows that in Case 1, the performance of MLPCR and BLVR(u) are worse than PCR. This is not surprising since they make use of inaccurate error variances while PCR assumes same error variances for all variables. In this case, the assumption made by PCR is not that bad since the differences of the actual error variances of different variables are relatively small. Although BLVR(h) also makes use of the inaccurate likelihood information, its performance is slightly better than PCR, due to the utilization of extra information in the historical prior. As for BLVR(u)v1, BLVR(u)-v2 and BLVR(u)-v3, they have similar performance which are slightly worse than PCR but are all better than BLVR(u), without the noninformative prior for noise variances. The estimates, BLVR(h)-v1, BLVR(h)-v2 and BLVR(h)-v3 using the non-informative priors on the noise outperform all other methods. This is as expected because these methods make use of the most information, along with noninformative priors for the noise variances. The three types of noninformative priors have similar effects on the performance of BLVR. This makes sense since the number of observations and the measurement noise are relatively large, and as discussed in Section 5.2.2, the choice of the hyper-parameters in the Inverse-Gamma prior distributions does not affect the performance of the BLVR. 107
Similar trend is observed for Case 2 in Table 5.5, except that MLPCR and BLVR(u) now outperform PCR. This can be explained by the fact that in Case 2, the noise variances of the input variables vary a lot more than the Case 1, and this is exactly the situation in which MLPCR and BLVR(u) can significantly outperform PCR when accurate likelihood information are available. Since PCR assumes that the noise variances are identical for all input variables, this assumption is inappropriate in Case 2. Even though MLPCR and BLVR(u) use inaccurate noise variances, they still do a better job than PCR. As for Case 3, the trend in Table 5.6 is similar to that in Case 1 and PCR again outperforms MLPCR and BLVR(u). In this case, the variation among the noise variances of different variables is the same as in Case 2. However, the quality of likelihood information is worse. Hence, applying PCR can get better model than applying MLPCR or BLVR(u) with very inaccurate likelihood information. In Case 4, since the accurate noise variances are available, Table 5.7 indicates that BLVR(h) has the best performance, but the performances of BLVR(h)-v1, BLVR(h)v2 and BLVR(h)-v3 are not far behind. The performance of MLPCR, BLVR(u), BLVR(u)-v1, BLVR(u)-v2 and BLVR(u)-v3 are also similar. In Case 5, when the accurate noise variances are also available, Table 5.8 shows that BLVR(h) again has the best performance. But compared with the results in Case 4, there is a larger difference in performance when comparing BLVR(u) with BLVR(u)-v1, BLVR(u)-v2 and BLVR(u)-v3, or comparing BLVR(h) with BLVR(h)v1, BLVR(h)-v2 and BLVR(h)-v3. This is expected since the variation among the noise variances is much larger than in Case 4 and having accurate information about the noise variances can greatly improve the model quality. 108
Figure 5.3∼5.12, show the matrix plots of the normalized training and testing MSE for all the methods over 50 realizations in the five cases. The MSE are normalized by the variance of the noise in the output variable. These plots confirm that in all the cases, the three type of noninformative priors have similar performance, because the paired plots of BLVR(u)-v1, BLVR(u)-v2 and BLVR(u)-v3 are almost diagonal. The same can be said for the paired plots of BLVR(h)-v1, BLVR(h)-v2 and BLVR(h)v3. This is because the hyper-parameter kj in the three types of Inverse-Gamma priors distributions are all much smaller than the number of observations, and the hyper-parameter σj2 in the three types of priors are all much smaller than SSEj . As discussed in Section 5.2.2, in this situation, all of the three types of priors have little effect on the posterior distribution. Hence, the choice of the noninformative prior is not important in this case study. The diagonal of Figure 5.3∼5.12 present the box plots of normalized MSE for each method. It can be observed that the use of noninformative priors can reduce the variability of the MSE, which means that BLVR with noninformative priors for noise variances are more robust.
109
110
MSE Y (testing) Y (training) X(testing) X(training)
MLPCR 1.2420 1.1863 1.3454 1.1598
BLVR(u) 1.1801 0.9038 1.3636 1.1869
BLVR(u)-v1 1.0575 0.7924 1.2147 1.0798
BLVR(u)-v2 1.0573 0.7990 1.2454 1.0931
BLVR(u)-v3 1.0064 0.7446 1.0882 1.0104
BLVR(h) 0.9059 0.8096 1.1116 0.9567
BLVR(h)-v1 0.7463 0.7510 0.9207 0.8647
BLVR(h)-v2 0.7559 0.7622 0.9649 0.8988
BLVR(h)-v3 0.7370 0.7333 0.8641 0.8114
Table 5.4: Case 1 of the simulated high dimensional data example in Section 5.3.1: Average of MSE for different methods, normalized by the MSE of PCR in each realization, SNRx = 3, SNRy = 3, l = 2, 50 realizations.
111
MSE Y (testing) Y (training) X(testing) X(training)
MLPCR 0.8997 0.9299 0.8639 0.7924
BLVR(u) 0.9234 0.9017 1.2600 1.0333
BLVR(u)-v1 0.8846 0.8513 1.2382 1.0579
BLVR(u)-v2 0.9061 0.8219 1.2679 1.0779
BLVR(u)-v3 0.8441 0.7941 1.0921 0.9872
BLVR(h) 0.7303 0.7663 0.9215 0.7780
BLVR(h)-v1 0.7343 0.7733 0.9160 0.8862
BLVR(h)-v2 0.7498 0.7940 0.9646 0.9238
BLVR(h)-v3 0.7158 0.7469 0.8364 0.8149
Table 5.5: Case 2 of the simulated high dimensional data example in Section 5.3.1: Average of MSE for different methods, normalized by the MSE of PCR in each realization, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 2, 50 realizations.
112
MSE Y (testing) Y (training) X(testing) X(training)
MLPCR 1.2559 1.2190 1.0769 0.9652
BLVR(u) 1.2189 1.1174 1.6500 1.2696
BLVR(u)-v1 0.9494 0.8040 1.2846 1.0854
BLVR(u)-v2 0.9607 0.8040 1.2406 1.0598
BLVR(u)-v3 0.8866 0.7731 1.0912 0.9888
BLVR(h) 0.9274 0.9735 1.0820 0.9572
BLVR(h)-v1 0.8025 0.8563 0.9370 0.9064
BLVR(h)-v2 0.8240 0.8840 1.0062 0.9612
BLVR(h)-v3 0.7843 0.8201 0.8510 0.8289
Table 5.6: Case 3 of the simulated high dimensional data example in Section 5.3.1: Average of MSE for different methods, normalized by the MSE of PCR in each realization, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 3, 50 realizations.
113
MSE Y (testing) Y (training) X(testing) X(training)
MLPCR 0.9545 0.9622 0.9335 0.9326
BLVR(u) 0.9734 0.7019 0.9724 0.9466
BLVR(u)-v1 1.0696 0.8393 1.2493 1.1077
BLVR(u)-v2 1.0828 0.8461 1.2619 1.1146
BLVR(u)-v3 1.0148 0.7649 1.1066 1.0244
BLVR(h) 0.8141 0.6852 0.7937 0.7326
BLVR(h)-v1 0.8222 0.8121 0.9317 0.8727
BLVR(h)-v2 0.8365 0.8298 0.9866 0.9145
BLVR(h)-v3 0.8118 0.7899 0.8734 0.8195
Table 5.7: Case 4 of the simulated high dimensional data example in Section 5.3.1: Average of MSE for different methods, normalized by the MSE of PCR in each realization, SNRx = 3, SNRy = 3, l = 0, 50 realizations.
114
MSE Y (testing) Y (training) X(testing) X(training)
MLPCR 0.6919 0.7461 0.5553 0.5390
BLVR(u) 0.7143 0.6933 1.0074 0.8494
BLVR(u)-v1 0.8832 0.7926 1.1763 1.0525
BLVR(u)-v2 0.8868 0.7939 1.1755 1.0493
BLVR(u)-v3 0.8368 0.7756 1.0733 0.9879
BLVR(h) 0.5194 0.5426 0.6633 0.5855
BLVR(h)-v1 0.6813 0.7063 0.9062 0.8852
BLVR(h)-v2 0.6984 0.7349 0.9638 0.9311
BLVR(h)-v3 0.6684 0.6875 0.8343 0.8177
Table 5.8: Case 5 of the simulated high dimensional data example in Section 5.3.1: Average of MSE for different methods, normalized by the MSE of PCR in each realization, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 0, 50 realizations.
PCR
MLPCR
BLVR(u)
BLVR(u)−v1
BLVR(u)−v2
BLVR(u)−v3
BLVR(h)
BLVR(h)−v1
BLVR(h)−v2
BLVR(h)−v3
2 1 0 2 1 0 2 1 0 2 1 0 2 1
115
0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2
Figure 5.3: Case 1 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of training MSE of the output, normalized by the variance of noise in the output, SNRx = 3, SNRy = 3, l = 2, 50 realizations.
PCR
MLPCR
BLVR(u)
BLVR(u)−v1
BLVR(u)−v2
BLVR(u)−v3
BLVR(h)
BLVR(h)−v1
BLVR(h)−v2
BLVR(h)−v3
4 2 0 4 2 0 4 2 0 4 2 0 4 2
116
0 4 2 0 4 2 0 4 2 0 4 2 0 4 2 0 0
2
4 0
2
4 0
2
4 0
2
4 0
2
4 0
2
4 0
2
4 0
2
4 0
2
4 0
2
4
Figure 5.4: Case 1 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of testing MSE of the output, normalized by the variance of noise in the output, SNRx = 3, SNRy = 3, l = 2, 50 realizations.
PCR
MLPCR
BLVR(u)
BLVR(u)−v1
BLVR(u)−v2
BLVR(u)−v3
BLVR(h)
BLVR(h)−v1
BLVR(h)−v2
BLVR(h)−v3
2 1 0 2 1 0 2 1 0 2 1 0 2 1
117
0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2
Figure 5.5: Case 2 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of training MSE of the output, normalized by the variance of noise in the output, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 2, 50 realizations.
PCR
MLPCR
BLVR(u)
BLVR(u)−v1
BLVR(u)−v2
BLVR(u)−v3
BLVR(h)
BLVR(h)−v1
BLVR(h)−v2
BLVR(h)−v3
2 1 0 2 1 0 2 1 0 2 1 0 2 1
118
0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2
Figure 5.6: Case 2 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of testing MSE of the output, normalized by the variance of noise in the output, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 2, 50 realizations.
PCR
MLPCR
BLVR(u)
BLVR(u)−v1
BLVR(u)−v2
BLVR(u)−v3
BLVR(h)
BLVR(h)−v1
BLVR(h)−v2
BLVR(h)−v3
2 1 0 2 1 0 2 1 0 2 1 0 2 1
119
0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2
Figure 5.7: Case 3 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of training MSE of the output, normalized by the variance of noise in the output, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 3, 50 realizations.
PCR
MLPCR
BLVR(u)
BLVR(u)−v1
BLVR(u)−v2
BLVR(u)−v3
BLVR(h)
BLVR(h)−v1
BLVR(h)−v2
BLVR(h)−v3
4 2 0 4 2 0 4 2 0 4 2 0 4 2
120
0 4 2 0 4 2 0 4 2 0 4 2 0 4 2 0 0
2
4 0
2
4 0
2
4 0
2
4 0
2
4 0
2
4 0
2
4 0
2
4 0
2
4 0
2
4
Figure 5.8: Case 3 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of testing MSE of the output, normalized by the variance of noise in the output, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 3, 50 realizations.
PCR
MLPCR
BLVR(u)
BLVR(u)−v1
BLVR(u)−v2
BLVR(u)−v3
BLVR(h)
BLVR(h)−v1
BLVR(h)−v2
BLVR(h)−v3
2 1 0 2 1 0 2 1 0 2 1 0 2 1
121
0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2
Figure 5.9: Case 4 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of training MSE of the output, normalized by the variance of noise in the output, SNRx = 3, SNRy = 3, l = 0, 50 realizations.
PCR
MLPCR
BLVR(u)
BLVR(u)−v1
BLVR(u)−v2
BLVR(u)−v3
BLVR(h)
BLVR(h)−v1
BLVR(h)−v2
BLVR(h)−v3
2 1 0 2 1 0 2 1 0 2 1 0 2 1
122
0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2
Figure 5.10: Case 4 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of testing MSE of the output, normalized by the variance of noise in the output, SNRx = 3, SNRy = 3, l = 0, 50 realizations.
5.3.2
Simulated Hybrid Data
The example in Section 4.3.1 is adapted here. There are 10 continuous input variables, 5 discrete input variables, 1 output variable and 100 observations in this data set. The first 5 continuous variables (C(:, 1 : 5)) are generated from the standard normal distribution, and the next 5 continuous variables (C(:, 6 : 10)) are generated such that C(:, 6 : 10) = C(:, 1 : 5)L, where L is a random 5 × 5 matrix. Thus, the next 5 continuous variables are linear combinations of the first 5 continuous variables, hence, all of them are Gaussian and the true rank of C is 5. Discrete input variables (D) are from Bernoulli distributions. The parameter p of the Bernoulli distributions is set to be 0.9. Noise free output variable y is generated as y = [C D]b, where b is a random 15×1 vector. Continuous variables and the output variable are contaminated by white noise while discrete variables are noise free. The signal to noise ratio (SNR) is set to be 9. Uniform prior and true prior is used for BLVR. PCR, MLPCR, BLVR(u), BLVR(u)-v1, BLVR(u)-v2, BLVR(u)-v3, BLVR(t), BLVR(t)v1, BLVR(t)-v2, BLVR(t)-v3 are applied to model this data set. Each method has two variations, without applying the modeling procedure described in Chapter 4 and with this procedure (denoted as the original name with a postfix of ’2’, such as PCR2, BLVR2(u), BLVR2(u)-v1, etc). True rank is used for all methods. Table 5.9 shows the average of normalized MSE over 100 realizations for each case. The initial guess for noise variance is generated in the same way as in the previous example. The three types of noninformative priors perform similarly in this example. BLVR(u) and BVLR(t) with the noninformative priors for noise variances tend to outperform their corresponding versions without the noninformative priors. The application of the modeling procedure for hybrid data sets greatly improves the performance of BLVR. 123
PCR
MLPCR
BLVR(u)
BLVR(u)−v1
BLVR(u)−v2
BLVR(u)−v3
BLVR(h)
BLVR(h)−v1
BLVR(h)−v2
BLVR(h)−v3
2 1 0 2 1 0 2 1 0 2 1 0 2 1
124
0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2
Figure 5.11: Case 5 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of training MSE of the output, normalized by the variance of noise in the output, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 0, 50 realizations.
PCR
MLPCR
BLVR(u)
BLVR(u)−v1
BLVR(u)−v2
BLVR(u)−v3
BLVR(h)
BLVR(h)−v1
BLVR(h)−v2
BLVR(h)−v3
2 1 0 2 1 0 2 1 0 2 1 0 2 1
125
0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 0 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2 0
1
2
Figure 5.12: Case 5 of the simulated high dimensional data example in Section 5.3.1: Matrix plot of training MSE of the output, normalized by the variance of noise in the output, SNRx = 1, 2, . . . , 15, SNRy = 3, l = 0, 50 realizations.
Among all the methods, BLVR2(t)-v1, BLVR2(t)-v2 and BLVR2(t)-v3 have the best performance.
5.3.3
Inferential Modeling of a Batch Distillation Process
Different methods are applied to an inferential modeling problem. This problem originated from a 4-stage batch distillation column described in [145]. Binous [146] coded a program to simulate the separation of the mixture of methanol and water by this distillation column, assuming that the reflux ratio is 10, the pressure is 1 bar, and the initial molar fraction of methanol is 80%. Constant molar holdup is assumed at each stage to be 20 mol, and in the still is 100 mol. The vapor flow rate is 10 mol/min. Figure 5.13 shows the illustration of this batch distillation process. A simulation using Binous’ program is run to obtain the temperature of each stage and the still, and the methanol molar fraction of the distillate in the reflux drum. The temperature and molar fraction are sampled with a time step of 1 minute from 1 to 200 minutes. Independent Gaussian noise are added to the temperature and molar fraction. The variances of the measurement noise for the temperature of the 4 stages and the still are 12 , 0.82 , 0.62 , 0.42 and 0.22 ; the variance of the measurement noise for the molar fraction is 9 × 10−4 . Measurements at the 200 time steps are randomized and divided into a training set and a testing set, each containing 100 observations. There are 5 input variables (temperature at 4 stages and the still) and one output variable (methanol molar fraction of the distillate). Although this is a dynamic system, the dynamics are not considered in this modeling problem. This is because the current temperature measurements at stages and the still are already highly relevant with the current
126
127
Y (testing) Y (training)
PCR 1 1
MLPCR 1.5102 1.4275
BLVR(u) 2.1298 1.5674
BLVR(u)-v1 1.0925 1.0061
BLVR(u)-v2 2.2819 1.8832
BLVR(u)-v3 2.0260 1.3053
BLVR(t) 6.6263 5.8607
BLVR(t)-v1 0.7150 0.6530
BLVR(t)-v2 0.7180 0.6592
BLVR(t)-v3 0.6980 0.6475
Y (testing) Y (training)
PCR2 1.1428 1.1031
MLPCR2 1.5109 1.4285
BLVR2(u) 1.5434 1.3863
BLVR2(u)-v1 1.0738 0.9193
BLVR2(u)-v2 1.0793 0.9270
BLVR2(u)-v3 1.0699 0.9187
BLVR2(t) 0.8668 1.2358
BLVR2(t)-v1 0.6333 0.6008
BLVR2(t)-v2 0.6359 0.5972
BLVR2(t)-v3 0.6332 0.5977
Table 5.9: Average of MSE for the simulated high dimensional hybrid data set in Section 5.3.2, normalized by the MSE of PCR in each realization, 50 realizations.
Figure 5.13: Illustration of the batch distillation process.
MSE Y (testing) Y (training) X(testing) X(training)
PLS 0.9960 0.9275 1.0049 0.9875
MLPCR 0.9811 1.0486 1.7999 1.7469
BLVR(u) 0.9596 0.6118 1.8076 1.8171
BLVR(u)-v1 0.5251 0.3578 1.3900 1.3510
BLVR(u)-v2 0.5226 0.3699 1.4014 1.3577
BLVR(u)-v3 0.5272 0.3466 1.3773 1.3442
Table 5.10: Case 1 of the batch distillation example in Section 5.3.3: Average of MSE for different methods of the batch distillation problem, normalized by the MSE of PCR in each realization, rank 2, l = 3, 50 realizations. MSE Y (testing) Y (training) X(testing) X(training)
PLS 0.9960 0.9256 1.0031 0.9855
MLPCR 0.5306 0.4602 1.1913 1.1166
BLVR(u) 0.5281 0.2267 1.1692 1.2694
BLVR(u)-v1 0.5273 0.3376 1.4003 1.3367
BLVR(u)-v2 0.5236 0.3477 1.4037 1.3384
BLVR(u)-v3 0.5303 0.3252 1.3887 1.3310
Table 5.11: Case 2 of the batch distillation example in Section 5.3.3: Average of MSE for different methods of the batch distillation problem, normalized by the MSE of PCR in each realization, rank 2, l = 0, 50 realizations. 128
molar fraction of the distillate. Hence, this system can be treated as if it is static. PCR, PLS, MLPCR, BLVR(u), BLVR(u)-v1, BLVR(u)-v2, BLVR(u)-v3 are applied to model this process. Rank 2 is used for all methods. In the first case, the initial guess of the variances of measurement noise is obtained in the same way as in previous examples, where l = 3; in the second case, the true variances of measurement noise are used, i.e., l = 0. Table 5.10 and 5.11 show the average of normalized MSE of different methods over 50 realization in the two cases respectively. When l = 3, BLVR(u)-v1, BLVR(u)-v2 and BLVR(u)-v3 have significantly better prediction for the output comparing with other methods; when l = 0, MLPCR, BLVR(u), BLVR(u)v1, BLVR(u)-v2 and BLVR(u)-v3 have similar performance. Figure 5.14∼5.17 shows the matrix plots of the training and testing MSE of the output in the 50 realizations for the two cases, normalized by the variance of measurement noise. It is clear that in the first case that there are many outliers for MLPCR and BLVR(u) over 50 realizations, and they have relatively large variance of MSE because they are vulnerable to bad likelihood information; while the variances of MSE for BLVR(u)-v1, BLVR(u)-v2 and BLVR(u)-v3 are much smaller.
5.3.4
Inferential Modeling of a Continuous Distillation Process
This inferential modeling problem originated from the problem that has been thoroughly studied by Skogestad et al. [147, 148, 149, 150]. There are 40 theoretical stages and a total condenser in this distillation column. It is used to separate a binary mixture with constant relative volatility of 1.5. The difference of the boiling point of the two components is 13.5 ◦C. The feed flow rate is 1 kmol/min. In this study, the feed composition (of the light component) is set to be stochastic, normally distributed 129
PCR
PLS
MLPCR
BLVR(u)
BLVR(u)−v1
BLVR(u)−v2
BLVR(u)−v3
PCR
20 10
PLS
0 20 10
MLPCR
0 20 10
BLVR(u)−v1 BLVR(u)−v2 BLVR(u)−v3
130
BLVR(u)
0 20 10 0 20 10 0 20 10 0 20 10 0 0
10
20
0
10
20
0
10
20
0
10
20
0
10
20
0
10
20
0
10
20
Figure 5.14: Case 1 of the batch distillation example in Section 5.3.3: Matrix plot of training MSE of output, normalized by the variance of noise in the output, l = 3, 50 realizations.
PCR
PLS
MLPCR
BLVR(u)
BLVR(u)−v1
BLVR(u)−v2
BLVR(u)−v3
PCR
40 20
PLS
0 40 20
MLPCR
0 40 20
BLVR(u)−v1 BLVR(u)−v2 BLVR(u)−v3
131
BLVR(u)
0 40 20 0 40 20 0 40 20 0 40 20 0 0
20
40
0
20
40
0
20
40
0
20
40
0
20
40
0
20
40
0
20
40
Figure 5.15: Case 1 of the batch distillation example in Section 5.3.3: Matrix plot of testing MSE of output, normalized by the variance of noise in the output, l = 3, 50 realizations.
PCR
PLS
MLPCR
BLVR(u)
BLVR(u)−v1
BLVR(u)−v2
BLVR(u)−v3
PCR
10 5
PLS
0 10 5
MLPCR
0 10 5
BLVR(u)−v1 BLVR(u)−v2 BLVR(u)−v3
132
BLVR(u)
0 10 5 0 10 5 0 10 5 0 10 5 0 0
5
10
0
5
10
0
5
10
0
5
10
0
5
10
0
5
10
0
5
10
Figure 5.16: Case 2 of the batch distillation example in Section 5.3.3: Matrix plot of training MSE of output, normalized by the variance of noise in the output, l = 0, 50 realizations.
PCR
PLS
MLPCR
BLVR(u)
BLVR(u)−v1
BLVR(u)−v2
BLVR(u)−v3
PCR
10 5
PLS
0 10 5
MLPCR
0 10 5
BLVR(u)−v1 BLVR(u)−v2 BLVR(u)−v3
133
BLVR(u)
0 10 5 0 10 5 0 10 5 0 10 5 0 0
5
10
0
5
10
0
5
10
0
5
10
0
5
10
0
5
10
0
5
10
Figure 5.17: Case 2 of the batch distillation example in Section 5.3.3: Matrix plot of testing MSE of output, normalized by the variance of noise in the output, l = 0, 50 realizations.
with mean of 0.5 and variance of 0.01. A nonlinear dynamic simulation of 500 minutes is run in the Matlab and Simulink files provided by Skogestad [150] to obtain the composition of the mixture at each stage and the condenser with a time step of 1 minute. The temperature at each stage is calculated by a linear approximation [150] with the composition data. Those temperatures are contaminated by Gaussian measurement noise with mean zero and different variances. The variance of the measurement noise of the first and last 10 stages is 9 × 10−6 , the variance of the measurement noise of the stages 16∼25 is 1 × 10−4 , and the variance of the measurement noise of the remaining 10 stages is 2.5 × 10−5 . The composition of the distillate is also contaminated by Gaussian noise with mean of zero and variance of 4 × 10−8 . The dynamics of this system are ignored in the modeling process. The 500 observations are randomized and divided into a training and testing set with 250 observations in each. Two cases have been considered in this simulation study also. In the first case, the initial guess of the variance of measurement noise is obtained in the same way as in previous examples, where l = 3; whereas in the second case, the true variances of measurement noise are used, i.e., l = 0. Another simulation of 500 minutes is run to get a historical data set, which is used to generate the historical prior distributions by PCR. Then PCR, PLS, MLPCR, BLVR(u), BLVR(u)-v1, BLVR(u)-v2, BLVR(u)-v3, BLVR(h), BLVR(h)-v1, BLVR(h)-v2, BLVR(h)-v3 are applied to estimate the composition of distillate with the temperature measurements at the 40 stages. Figures 5.18∼5.21 show the median of training and testing MSE of output over 50 realization of above methods with different ranks in the two cases. According to Figure 5.19, in the first case, BLVR(h)-v2 and BLVR(h)-v3 have the best performance while BLVR(h)-v1
134
performs worse. Also BLVR(u)-v2 and BLVR(h)-v3 performs better than BLVR(u)v1. This is different from the trend observed in previous case studies. This difference can be explained by recalling the effect of θj on the posterior distribution of σj2 . As discussed in Section 5.2.2, θj has little effect only when SSEj is much larger than θj . While in this example, the noise variances of most variables are much smaller than 0.001, which is the value of θj in the Type 1 prior. Hence, the Type 1 prior has significant effect on the posterior distribution. Therefore, it is more biased than the Type 2 and Type 3 prior in this problem, which leads to worse performance. Hence, when the measurement noise are very small, Type 2 and Type 3 priors should be preferred. Figure 5.21 show that BLVR(h), BLVR(h)-v2 and BLVR(h)-v3 perform equally well and slightly better than MLPCR in Case 2. It also shows that Type 1 prior performs worse than Type 2 and Type 3 priors.
135
−8
Median of training MSE of output over 50 realizations
2
x 10
1.8 1.6 1.4 1.2 PCR PLS MLPCR BLVR(u) BLVR(u)−v1 BLVR(u)−v2 BLVR(u)−v3 BLVR(h) BLVR(h)−v1 BLVR(h)−v2 BLVR(h)−v3
1 0.8 0.6 0.4 0.2 0
3
4
5
6
7
8
9
10
Rank
Figure 5.18: Case 1 of the continuous distillation column example in Section 5.3.4: Median of trainging MSE of output over 50 realizations, l = 3.
136
−8
Median of testing MSE of output over 50 realizations
1.6
x 10
1.4 1.2 1 PCR PLS MLPCR BLVR(u) BLVR(u)−v1 BLVR(u)−v2 BLVR(u)−v3 BLVR(h) BLVR(h)−v1 BLVR(h)−v2 BLVR(h)−v3
0.8 0.6 0.4 0.2 0
4
5
6
7 Rank
8
9
10
Figure 5.19: Case 1 of the continuous distillation column example in Section 5.3.4: Median of testing MSE of output over 50 realizations, l = 3.
137
−8
Median of training MSE of output over 50 realizations
2
x 10
1.8 1.6 1.4 1.2 PCR PLS MLPCR BLVR(u) BLVR(u)−v1 BLVR(u)−v2 BLVR(u)−v3 BLVR(h) BLVR(h)−v1 BLVR(h)−v2 BLVR(h)−v3
1 0.8 0.6 0.4 0.2 0
3
4
5
6
7
8
9
10
Rank
Figure 5.20: Case 2 of the continuous distillation column example in Section 5.3.4: Median of trainging MSE of output over 50 realizations, l = 0.
5.4
Conclusions and Future Work
This chapter proposes a general Bayesian approach to address the challenge of lack of accurate likelihood information in process modeling. This approach is applied to developing an extended BLVR-S method with noninformative prior for the variances in likelihood functions. Case studies show that it can effectively handle the problem
138
−8
Median of training MSE of output over 50 realizations
2
x 10
1.8 1.6 1.4 1.2 PCR PLS MLPCR BLVR(u) BLVR(u)−v1 BLVR(u)−v2 BLVR(u)−v3 BLVR(h) BLVR(h)−v1 BLVR(h)−v2 BLVR(h)−v3
1 0.8 0.6 0.4 0.2 0
3
4
5
6
7
8
9
10
Rank
Figure 5.21: Case 2 of the continuous distillation column example in Section 5.3.4: Median of testing MSE of output over 50 realizations, l = 0.
139
of inaccurate information about measurement noise, and by setting noninformative priors for the noise variances, model quality is in general better than fixing the noise variances in BLVR-S, especially when the information regarding the noise variance may be far from the underlying truth. Although PCR or PLS seem attractive in that situation, they are ill-suited for utilizing prior information. Also the variances of the measurement noise can be estimated from the BLVR-S with a noninformative prior approach. This is an additional benefit that PCR or PLS can not provide. For other modeling methods which require likelihood information, this strategy can also be adopted by putting them in a Bayesian framework. A prior distribution can be setup for the parameters in the likelihood functions, which should reflect the uncertainty about the likelihood information. A noninformative prior can be used when there is little knowledge about the measurement noise. Three different types of noninformative priors suggested in this chapter are unified under the InverseGamma distribution family. Since the hyper-parameters of the Inverse-Gamma prior distributions in practice can be different from the priors discussed in this chapter, the choices of the priors in this framework are not limited to those three priors. Although different priors usually have similar performance, in some cases, the performance of BLVR-S is sensitive to the choice of the prior hyper-parameters. Thus it is important to choose the prior hyper-parameters according to the number of observations in the data set and the magnitude of measurement noise. Not only the parameters in the likelihood functions can have their prior distributions, when there are uncertainty associated with the hyper-parameters of the prior distributions in Bayesian modeling methods, a second layer of prior distributions can be setup for those hyper-parameters. Those hyper-prior distributions result in 140
a hierarchical structure of the Bayesian methods. This is very common in Bayesian modeling and it is call hierarchical Bayes [151]. The development of the hierarchical structure for BLVR-S and other modeling methods could be the future direction of this work.
141
CHAPTER 6
BLVR-S WITH NON-GAUSSIAN PRIOR DISTRIBUTIONS
6.1
Introduction
One of the advantages of using Monte Carlo sampling instead of optimization in Bayesian computation is that there is few restraints on the type of prior distributions or likelihood functions for Monte Carlo sampling. While for optimization-based methods, Gaussian or other assumptions are often made such that the optimization problem can be formulated as a least-squares or other types of problems that are suitable for the optimization routines. However, this advantage of Monte Carlo sampling is available only when the sampling procedure is capable of drawing samples from arbitrary distributions. In theory, this is not a problem for Metropolis-Hastings sampling, but efficiently drawing samples from an arbitrary distribution requires fine tuning of the sampling procedure, which is often not practical. Also MetropolisHastings sampling is not efficient in drawing samples from high dimensional distributions. As to Gibbs sampling, it is efficient in drawing samples from high dimensional distributions, but it requires the ability of drawing samples from the full conditional distributions of parameters. In the BLVR-S described in Chapter 3, the likelihood 142
functions are assumed to be Gaussian and the prior distributions are either Gaussian or uniform. Those assumptions result in Gaussian full conditional distributions. Since the sampling strategy for the Gaussian distributions has been well studied, it is easy to implement. Other assumptions of likelihood functions and prior distributions can lead to other types of conditional distributions that are even not defined in the literature, moreover, only the kernel parts of the PDF of the conditional distributions are available in BLVR-S. Since most existing sampling algorithms requires knowledge about the entire PDF, it is much more difficult to draw samples from those distributions in this situation. Because of the difficulty of drawing samples from arbitrary distributions, it is not clear if releasing the assumptions made in BLVR-S would be worth the effort. In fact, the assumption of Gaussian likelihood functions is not a bad idea because this is the most natural distribution of random noise. This is also almost the most popular assumptions about likelihood functions in Bayesian or maximum likelihood modeling methods. However, the Gaussian prior assumption for input variables or regression parameters can often be inappropriate. Chen et al. [55] showed that the distributions of data tend to be non-Gaussian in dynamic systems. Figure 6.1 shows the histogram of the temperature measurements at the 10th stage of the continuous distillation column described in Section 5.3.4. The heavy right tail of the histogram also suggests that the distribution is non-Gaussian, it is not uniform either. In such cases, the assumption of Gaussian or uniform prior may not be appropriate. Ideally, the prior distributions of input variables and regression parameters should depend on the type of information available. For example, if the only prior information about a regression parameter available is its upper and lower bounds, it is probably 143
Histogram of the Temperature Measurements of the 10th Stage 100 90 80 70 60 50 40 30 20 10 0 353.4
353.5
353.6 353.7 Temperature (K)
353.8
353.9
Figure 6.1: Histogram of the temperature measurements of the 10th stage of the continuous distillation column in Section 5.3.4.
144
best to assume a bounded uniform distribution for the regression parameter. This prior distribution also satisfies the Maximum Entropy (ME) principle [152]. Entropy of a distribution is used as a measure of uncertainty in information theory [153]. For a prior distribution P (θ), its entropy is defined as: H(θ) = −
Z
P (θ) log P (θ) dθ
(6.1)
The idea behind the ME principle is to make the prior distribution as objective as possible by choosing the distribution with most uncertainty that satisfies the available information. If θ is defined on a subset of the real line, R, and no other prior information is available, the ME prior distribution will be a uniform distribution over this set. This makes intuitive sense because the uniform prior provides no additional information about θ, and corresponds to the largest entropy. In the presence of domain specific constraints, such as the upper and lower bounds of the parameter, the ME prior is obtained by solving the constrained optimization problem: [ P (θ) = arg max H(θ)
(6.2)
P (θ)
s.t. Qi (P (θ)) = ci
i = 1, 2, . . . , k,
where Qi is some characteristics of the prior distribution elicited. This information is often expressed as expectation of some function fi (θ). If all constraints can be expressed as expectations of functions f1 (θ), f2 (θ), . . . , fk (θ), the ME prior falls into the exponential family: P (θ) = exp(λ0 + λ1 f1 (θ) + λ2 f2 (θ) + . . . + λk fk (θ)),
(6.3)
where λ0 , λ1 , . . . , λk are parameters that need to be estimated. Typical ME priors with this kind of format have been derived [154], and are summarized in Table 6.1, 145
Support of θ (a, b)
Constraints N/A
(0, + ∞)
E(θ) = c1
(−∞, + ∞)
E(θ) = c1 E((θ − c1 )2 ) = c2 E(|θ|) = c1
(−∞, + ∞)
ME Priora 1 P (θ) = b−a (Uniform) P (θ) = c11 e−θ/c1 (Exponential) 2 2 1 P (θ) = √2πc e−(θ−c1 ) /2c2 2 (Gaussian) P (θ) = 2c11 e−|θ|/c1 (Laplace)
Table 6.1: Typical Maximum Entropy priors with constraints. Priors shown in this table are univariate, but ME principle can be applied to obtain multivariate priors. a
which shows that under certain circumstances, the Gaussian or uniform assumptions are not appropriate according to the ME principle. Hence, it is necessary to release those assumptions in BLVR-S. Among different sampling techniques, importance sampling is well known for its ability to handle arbitrary types of priors. Besides, it requires relatively low computation load. For those reasons, it has been popular in Bayesian estimation methods such as SMC [55]. In this work, it is also chosen to draw samples from the full conditional distributions within Gibbs sampling and a generalized sampling-based BLVR (GBLVR-S) approach is developed. In this approach, the assumptions for the prior distributions of input variables and regression parameters are released and any type of prior distributions can be assumed. The prior distribution can even be nonparametric and in the format of samples or particles from the prior distribution. In this GBLVR-S approach, the Gaussian likelihood assumption is till made, however, the measurements could have heteroscedastic measurement noise, i.e., every observation 146
could have a different error covariance matrix. Thus, GBLVR-S is much more flexible than BLVR-S. The rest of this chapter is organized as follows, in Section 6.2, the methodology of GBLVR-S is described, followed by an example of applying GBLVR-S in Section 6.3. Finally, Section 6.4 provides a summary of this work and possible future directions.
6.2
Method
The flowchart of GBLVR-S remains the same as the algorithm of BLVR-S shown in Table 3.1, the difference lies in the steps of drawing samples of zi and β from their full conditional distributions in Gibbs sampling. As shown in Section 3.2, those conditional distributions are Gaussian in BVLR-S, thus it can be easily implemented with the built-in functions to draw samples in Matlab or other packages. While in GBLVR-S, importance sampling is implemented to draw samples of zi and β from the conditional distributions. The detailed algorithms are described in Section 6.2.1 and Table 6.3. The procedure of estimating the output for new observations by GBLVR-S is described in Section 6.2.3.
6.2.1
Drawing Samples of Latent Variables
ˆ and the In the t-th step of Gibbs sampling, given an estimated loading matrix α most recent sample of β, The full conditional distribution of zi = Z(i, :)T is: ˆ β (t−1) ) ∝ P (xi | α, ˆ zi )P (yi | β (t−1) , zi )Pz (zi ). P (zi | xi , yi, α,
(6.4)
The first two terms on the right hand side of Equation (6.4) are the likelihood functions and the last term is the prior distribution for the latent variables. To draw zi from Equation (6.4) by importance sampling, a proposal distribution for zi has to be 147
chosen. A natural choice is to use its prior distribution as the proposal distribution. However, for modelers, the available prior information are usually about the original ˜ i , not the latent variables zi . Thus, only the prior distrinoise free input variables x ˜ i ) is available and there is no way to bution for the noise free input variables Px (x draw samples of zi directly from its prior distribution. Fortunately, given r samples [k]
˜ i ), x ˜ i , (k = 1, 2, . . . , r), r sample of zi can be calculated as: of xi from Px (x [k]
[k]
˜ i α, ˆ zi = x
k = 1, 2, . . . , r.
(6.5)
Hence, samples of zi can still be indirectly drawn from its prior distribution Pz (zi ). In this way, r samples of zi can be obtained, but those samples are from the prior distribution, not the conditional distribution in Equation (6.4). In the next step, weights of those samples can be calculated as: [k]
w(zi ) =
[k]
[k]
[k]
[k]
[k]
ˆ zi )P (yi | β (t−1) , zi )Pz (zi ) P (xi | α, Pz (zik )
ˆ zi )P (yi | β (t−1) , zi ). = P (xi | α,
(6.6)
The weights are simply the likelihood of those samples, which makes intuitive sense. In the next step of importance sampling, one out of the r samples is drawn according to their weights, and this sample can be approximately treated as drawing from Equation (6.4). To get this step done, the sample weights are first normalized: [k]
w
∗
[k] (zi )
w(zi )
= Pr
[s] s=1 w(zi )
.
(6.7)
The cumulative sum of the normalized weights are then calculated: [k]
cw ∗ (zi ) =
k X
[s]
w ∗ (zi ).
(6.8)
s=1
[1]
[2]
[r]
Therefore, {cw ∗(zi ), cw ∗ (zi ), . . . , cw ∗(zi )} is a sequence of numbers monotonically increasing from 0 to 1. Next, a random number u from Unif orm(0, 1) can be drawn 148
and k can be found such that [k−1]
cw ∗(zi
[k]
) < u ≤ cw ∗ (zi ).
(6.9)
This can be done by applying the binary search algorithm [155]. The k-th sample [k]
zi
is drawn through this importance sampling algorithm as the sample from the
conditional distribution of zi in the t-th step of Gibbs sampling. It is obvious that throughout this procedure, the prior distribution on the right hand side of Equation (6.4) is used to draw a number of samples as candidates, the likelihood functions on the right hand side of Equation (6.4) are used to calculated weights for those samples and finally one sample is randomly chosen from those candidates based on their weights. The detailed algorithm is shown in Table 6.2.
6.2.2
Drawing Samples of Regression Parameters
In the t-th step of Gibbs sampling, after Z is drawn, the full conditional distribution of regression parameters β is: ˆ Z (t) ) ∝ P (y | Z (t) , β)Pβ (β). P (β | y, α,
(6.10)
The first term on the right hand side of Equation (6.10) is the likelihood function and the second term is the prior distribution for β. Similar to the strategy for drawing samples of zi , the prior distribution Pβ (β) is used as the proposal distribution in importance sampling. Samples of β also can not be drawn directly, instead, the samples of the regression parameters for the original input variables are drawn from the prior distribution Pb (b), and samples of β are calculated as: ˆ T b[k] , β [k] = α
k = 1, 2, . . . , r. 149
(6.11)
Draw a sample of Z in the t-th step of Gibbs sampling. • for i = 1 : n – zi = Z(i, :)T , its full conditional distribution is ˆ β (t−1) ). P (zi | xi , yi, α, – draw r samples from Pz (zi ). [k]
˜ i , (k = 1, 2, . . . , r) from Px (x ˜ i ). · draw x [k]
[k]
˜ i α, ˆ · zi = x
k = 1, 2, . . . , r.
– draw one sample out of the r samples of zi according to their weights. · calculate weights: [k] [k] [k] ˆ zi )P (yi | β (t−1) , zi ). w(zi ) = P (xi | α, [k]
· normalize weights: w ∗ (zi ) =
[k]
w(zi ) Pr [s] . s=1 w(zi )
· calculate cumulative normalized weights: P [k] [s] cw ∗(zi ) = ks=1 w ∗(zi ). · draw a random number u from Unif orm(0, 1). [k−1]
[k]
· find k, such that cw ∗ (zi ) < u ≤ cw ∗ (zi ) by binary search. (t) [k] · draw the sample of zi : zi = zi . • end for
Table 6.2: Importance sampling algorithm to draw samples of latent variables in GBLVR-S.
150
The sample weight of β [k] is calculated by the likelihood function of the output variable: w(β [k]) = P (y | Z (t) , β [k]).
(6.12)
Based on those sample weights, one out of the r samples of β can be randomly drawn with the same approach used in Section 6.2.1. The detailed algorithm is shown in Table 6.3.
6.2.3
Estimation of New Observations
ˆ βˆ can be obtained by applying GBLVR-S The estimates of model parameters α, to the training data set, they can be used to estimate the noise free output of new observations. Given a new noisy observation xnew , the most straightforward way to ˆ i.e., ˆ and β, estimate its output is by multiplying it with α ˆ ˆ β. yˆnew = xnew α
(6.13)
But this is not the optimal solution because xnew is also contaminated by measurement noise, therefore, the latent variables znew should be estimated first, given the likelihood function of xnew and the prior distribution of the noise free input variables. With the estimate zˆnew , the output can be estimated as: ˆ yˆnew = zˆnew β.
(6.14)
Fortunately, zˆnew can be estimated by a procedure similar to the one used for drawing samples of latent variables in Gibbs sampling, as shown in Table 6.2. There are two differences though. First, the sample weights are calculated only based on the likelihood function of the input variables, that is because the output of the new observation is not observed. Second, instead of drawing one out of r samples of znew , 151
Draw a sample of β in the t-th step of Gibbs sampling. • The full conditional distribution of β is ˆ Z (t) ). P (β | y, α, • draw r samples from Pβ (β). – draw b[k] , (k = 1, 2, . . . , r) from Pb (b). ˆ T b[k] , – β [k] = α
k = 1, 2, . . . , r.
• draw one sample out of the r samples of β according to their weights. – calculate weights: w(β [k]) = P (y | Z (t) , β [k] ). – normalize weights: w ∗ (β [k] ) =
w(β[k] ) Pr [s] . s=1 w(β )
– calculate cumulative normalized weights: P cw ∗ (β [k] ) = ks=1 w ∗ (β [s] ).
– draw a random number u from Unif orm(0, 1). – find k, such that cw ∗ (β [k−1] ) < u ≤ cw ∗ (β [k] ) by binary search. – draw the sample of β: β (t) = β [k] .
Table 6.3: Importance sampling algorithm to draw samples of regression parameters in GBLVR-S.
152
ˆ ˆ and β. Estimate y˜new given xnew , α • Estimate znew . – draw r samples from Pz (znew ). [k]
˜ new , (k = 1, 2, . . . , r), from Px (x ˜ new ). · draw x [k]
[k]
˜ new α, ˆ · znew = x
k = 1, 2, . . . , r. [k]
[k]
ˆ znew ). – calculate weights: w(znew ) = P (xi | α, – calculate the weighted average: zˆnew =
[s] [s] s=1 w(znew )znew Pr [s] s=1 w(znew )
Pr
.
ˆ • calculate the estimate of noise free output: yˆnew = zˆnew β.
Table 6.4: The procedure to estimate noise free output for a new observation in GBLVR-S.
the weighted average of those r samples is calculated as the estiamte of znew . The detailed algorithm is shown in Table 6.4.
6.3
Experiment
This GBLVR-S approach is applied to model a simulated data set. There are 3 ˜ 1) input variables and 1 output variable in this data set. The first input variable X(:, ˜ 2) comes from a comes from a uniform distribution. The second input variable X(:, mixture of two Gaussian distributions, with equal probability for each one. The third ˜ 3) is a linear combination of the first and second variables: input variable X(:, ˜ 3) = 2X(:, ˜ 1) + X(:, ˜ 2). X(:,
(6.15)
Figure 6.2 shows the histograms of the input variables based on 10000 observations in one simulation. The noise free output variable is a linear combination of the noise 153
Counts
1000 500 0
0
0.2
0.4
0.6
0.8
1
x1 Counts
2000 1000 0 −20
−15
−10
−5
0 x2
5
10
15
20
−15
−10
−5
0 x3
5
10
15
20
Counts
1500 1000 500 0 −20
Figure 6.2: Histograms of the non-Gaussian input variables of the simulated data set example, based on 10000 observations.
154
free input variables: ˜ 1) + 0.4X(:, ˜ 1) + 0.4X(:, ˜ 1) y = 0.8X(:, ˜ 1) + a2 X(:, ˜ 2), = a1 X(:,
(6.16)
where a1 = 1.6 and a2 = 0.8 are the regression coefficients for the two independent input variables. Both input and output variables are contaminated by i.i.d. Gaussian measurement noise. There are 100 observations in both the training and testing sets. In the training stage, 100 samples are drawn from the proposal distributions in each importance sampling step; in the testing stage, 10000 samples are drawn for calculating the estimates of latent variables. The true prior distributions of input variables are used in GBLVR-S. The prior for the regression coefficients is set to be a uniform prior with a lower limit of 0 and the upper limit of 1. The performance of GBLVR-S is compared with the performance of PCR, PLS, MLPCR and BLVR with uniform prior. The true rank 2 is used for all the methods. The loading vectors for BLVR comes from MLPCA. The average the normalized MSE of those methods are shown in Table 6.5, based on 50 realizations, the MSE of input and output variables are normalized by the variances of measurement noise. Results show that GBLVR-S has the best performance in almost all the categories, this is not unexpected since it makes use of the most information. However, the improvement of model quality comes with the price of higher computation load. Table 6.5 shows that the average CPU time for GBLVR-S is two magnitudes larger than the average CPU time of BLVR(u). The high computation load could be a problem for the applications of GBLVR-S in practice.
155
Y (training) Y (testing) X (training) X (testing) a1 a2 CPU time (sec)
PCR 0.3467 0.3484 0.9936 1.0138 0.5662 0.0052 0.0002
PLS 0.3447 0.3622 0.9671 1.0117 0.6021 0.0058 0.0018
MLPCR 0.2963 0.2993 0.5196 0.5103 0.3878 0.0115 0.0764
BLVR(u) 0.2341 0.3007 0.4219 0.5103 0.3883 0.0111 5.3972
GBLVR-S 0.2131 0.2741 0.3277 0.4103 0.1397 0.0063 483.5538
Table 6.5: Average of the normalized MSE and CPU time of all methods for the simulated non-Gaussian data example over 50 realizations, the MSE are normalized by the variances of measurement noise.
6.4
Conclusions and Future Work
A generalized sampling-based BLVR (GBLVR-S) approach is proposed in this chapter. With the importance sampling within Gibbs sampling strategy, the restrictive assumptions on the prior distributions are released. This allows the use of non-Gaussian and non-uniform prior distributions for noise free input variables and regression parameters, which is much more flexible than the BLVR-S approach described in Chapter 3. The prior distributions can be assumed according to the information available and by applying methods such as Maximum Entropy principle. This is more appropriate than forcing the priors to be Gaussian or uniform. Although the assumption of Gaussian likelihood functions is still made, that is just for convenience and it can be easily dropped without modifying the algorithm described in this chapter. But with the releasing of those assumptions about prior and likelihood functions, this algorithm is computationally more expensive than the original BLVR-S. Reducing the number of samples drawn in the importance sampling steps can reduce
156
the computation load. However, good approximation of importance sampling has to be achieved by have enough samples drawn from the proposal distribution. Hence, it is important to maintain the balance between speed and accuracy by choosing the proper number of samples drawn. Although GBLVR-S has not been applied to model more process data so far, it has a great potential because of its flexibility. Since non-Gaussian distribution are often seen in dynamic systems, that could a possible field that GBLVR-S can be applied to in the future.
157
CHAPTER 7
CONCLUSIONS AND FUTURE WORK
This dissertation developed a set of sampling-based Bayesian latent variable regression methods, which are capable of solving practical modeling problems in process engineering. It addresses the challenges of the high dimensionality of data, hybrid data types, inaccurate measurement noise variances and non-Gaussian/non-uniform prior assumptions. It illustrates the advantages of Bayesian latent variable methods over traditional approaches in various case studies. Section 7.1 provides some concluding remarks and Section 7.2 recommends possible future directions of this work.
7.1
Concluding Remarks
The beauty of Bayesian methods lies in the fact that they are built in a solid statistical framework. They solve the estimation and prediction problems from the information processing and decision making points of view, which is radically different from traditional methods. Different sources of information can be incorporated in this framework and their uncertainties can be accounted properly. By using more information in modeling, the model quality is expected to improve. However, the preassumption for Bayesian methods to work well is that the information are of good 158
quality and the likelihood functions and prior distributions must accurately reflect the available information. Thus, although the steps of elicitation likelihood and prior information may seem trivial, they are essential for the success of Bayesian methods. This point has been demonstrated in many case studies in this dissertation. As to the methodology part, the ideal Bayesian modeling methods should be able to handle different types of likelihood functions and prior distributions, and provide ways to deal with information with low quality. Efforts have been made in this dissertation to accommodate those requirements in Bayesian latent variable regression methods. A toolbox of Bayesian latent variable methods has been developed, but this work should be considered as the beginning not the end of the drive of promoting Bayesian modeling methods in process engineering. The argument has been made clear in this dissertation that Bayesian latent variable methods are superior to traditional methods, but unfortunately they have received little attention in process engineering for a long time. There are many reasons for their lack of applications. One of the biggest problems is that most earlier Bayesian methods are optimization-based and require much more computation than traditional methods, often unaffordable in solving practical problems. But with the advances in Monte Carlo sampling, sampling-based methods are becoming the mainstream in Bayesian computation. They are more efficient than optimization-based methods and they can easily provide uncertainty information of the estimates. They are also easier to be expanded into more sophisticated methods which are more hierarchical and accommodate more prior information. Of course they are still computationally more expensive than traditional methods, but the development of computing technology has loyally followed Moore’s law [156] in the past several decades. In fact, the 159
dramatic advancement of computing technology plays an important role in the recent surge of Bayesian methods. In the foreseeable future, computation cost is expected to be a less significant issue and sampling-based Bayesian modeling methods are going to hit the prime time in process engineering. The advances in latent variable methods have been steady in recent years, but there is still lots of room for improvement upon traditional methods by renovating them from a Bayesian perspective. With more modeling practitioners picking up Bayesian latent variable methods, we can expect a new wave of research in latent variable modeling methods. To make this happen, it is important to have more applications in solving real problems in industry and laboratories. One challenge of applying Bayesian methods in practice is that they require more information and many people do not want to make the effort. This dissertation has demonstrated ways of getting information out of various sources, some situations that Bayesian latent variable methods can make a big difference are also identified. This should provide some guidelines for engineers and chemometricians who are interested in applying Bayesian latent variable methods.
7.2
Future Work
Recommendations of continuing this work can be summarized in two aspects. Possible future directions for the methodology part include: • The loading vectors in the current BLVR-S method are obtained through traditional latent variable methods and are fixed throughout the sampling process. This is due to the difficulty of drawing samples from the conditional distributions of loading vectors with the orthonormal constraints. Although this seems 160
to not affect the model quality too much, it is apparently not the optimal solution, especially when the estimates of loading vectors are used for process monitoring. This issue should be addressed more appropriately in the future with advanced sampling techniques. • In the modeling procedure for dealing with hybrid data sets, PCA is used for the dimension reduction of the discrete input variables, but PCA is developed based on the assumption of continuous data. Hence, methods specifically developed for discrete variables should replace PCA in the future. It is also worth exploring how Bayesian methods can be used to deal with the measurement noise in discrete variables. • Although in the extended BLVR-S approach, noninformative priors can be assumed for the noise variances, this is still not a hierarchical method. A second layer of prior distributions can be assumed for the hyper-parameters in the first layer of prior distributions. This hierarchical structure can better deal with the uncertainty about the hyper-parameters in the prior distributions. This could potentially be more useful in modeling practical problems where prior and likelihood information are often unavailable or inaccurate. The ultimate objective of this work is to promote the applications of Bayesian methods in process engineering, to achieve this goal, there are still lots of work to be done: • This work provides a toolbox of Bayesian latent variable methods that can be used by process engineers and chemometricians. This toolbox is developed in Matlab, which is a commercial software and not available for everyone. Also 161
Matlab programs are executed as an interpreted language, which has low efficiency. To make those Bayesian methods easier to use, it is necessary to develop this toolbox in open source softwares or more efficient computer languages such as C/C++. • This work has demonstrated how Bayesian methods can be applied to solve modeling problems in process engineering, but that is far from enough. To convince people of the advantages of Bayesian methods, more practical applications in areas such as process monitoring, fault detection and process control are needed. This can be done through more industrial collaborations. So far, the applications of those Bayesian latent variables methods are limited to modeling static systems, or treating dynamic systems as if they are static. Since most industrial processes are dynamic, from an application point of view, it could be very beneficial if the system dynamics can be properly handled by Bayesian latent variable methods. • This study has also found some interesting effects of the conditions of data on the performance of Bayesian methods, such as the number of training observations and the SNR in input and output variables. Other conditions might also affect the performance of Bayesian methods. The underlying mechanism of some effects, such as the SNR, still remains a mystique and is worth exploring. Having good understanding of those effects can make Bayesian practitioners better equipped to deal with practical modeling problems. • There is still a big gap between the process engineering community and the statistics community. Process engineers still have very little exposure to some 162
well established statistical methods, such as Bayesian modeling. While statisticians also often do not have practical engineering problems in mind in their research. This has been one of the biggest challenges for the promotion of Bayesian latent variable methods in practice. This gap has to be filled by introducing the basics and advances of the two fields to each other. This involves lots of work and is important for the prosperity of both fields in the decades to come.
163
APPENDIX A
ANOTHER WAY TO DERIVE FULL CONDITIONAL DISTRIBUTIONS FOR UNIFORM PRIOR CASE IN SECTION 3.2.2
To find the full conditional distribution of zi , it is only needed to work on the kernel of the likelihood of di . The kernel part of the likelihood which contains zi is: 1 L = exp[− ((di − γzi )T R−1 d (di − γzi ))] 2
(A.1)
Hence, T T −1 T −1 −2 ln L = ziT (γ T R−1 d γ)zi − 2zi γ Rd di + di Rd di T T −1 T −1 −1 T −1 T −1 = ziT (γ T R−1 d γ)zi − 2zi (γ Rd γ)(γ Rd γ) γ Rd di + di Rd di −1 T −1 T T −1 T −1 −1 T −1 = (zi − (γ T R−1 d γ) γ Rd di ) (γ Rd γ)(zi − (γ Rd γ) γ Rd di ) + δ (A.2)
where T −1 T −1 −1 T −1 δ = dTi R−1 d di − di Rd γ(γ Rd γ) γ Rd di
(A.3)
Notice that if δ = 0, L is the same as the kernel part of −1 −1 −1 T T −1 Normal((γ T R−1 d γ) γ Rd di , (γ Rd γ) ). δ = 0 holds as long as there exists a
p × 1 vector νi , such that di = γνi , i.e., di is in the column space of γ. This is exactly the case when di happens to be noise free measurements. If the true model is: di = γνi + ǫi 164
(A.4)
where ǫi ∼ Normal(0, Rd ), then: T −1 T −1 −1 T −1 δ = ǫTi R−1 d ǫi − ǫi Rd γ(γ Rd γ) γ Rd ǫi
(A.5)
and δ ∼ χ2m+1−p . δ is somewhat like the ”residual sum” of the normal kernel, it is reasonable to neglect it, thus the full conditional distribution of zi is just −1 −1 −1 T T −1 Normal((γ T R−1 d γ) γ Rd di , (γ Rd γ) ), which is the same as derived by lim-
iting the conditional distribution of Gaussian prior case. Full conditional distribution of β can be derived in a similar way by working on the kernel of the likelihood of y, same results can be obtained as in Section 3.2.2.
165
APPENDIX B
PROOF OF THE TYPE 2 PRIOR IS THE JEFFREYS PRIOR FOR NOISE VARIANCES OF BLVR-S IN SECTION 5.2.1
Jeffreys proposed an noninformative prior which is proportional to the square root of the Fisher information [157], i.e., P (θ) ∝
p
I(θ | D),
(B.1)
where P (θ) is a prior distribution of a model parameter θ, I(θ | D) is the Fisher information. This is called Jeffreys prior [144]. The Fisher information is calculated as: I(θ | D) = E
(
d log P (D | θ) dθ
2
|θ
)
(B.2)
Since Fisher information is based only on the data likelihood, hence no extra information is used in Jeffreys prior. In addition, it is invariant under reparameterization of parameters.
166
Without loss of generality, let j ∈ 1, 2, . . . , m, the Fisher information of σj2 is: I(σj2
Pn ( i=1 (xij − x˜ij )2 )2 | X(:, j)) = E 4(σj2 )4 " n # X 1 = E ( (xij − x˜ij )2 )2 4(σj2 )4 i=1 3n2 (σj2 )2 = 4(σj2 )4 =
3n2 . 4(σj2 )2
(B.3)
Thus, the Jeffreys prior is, P (σj2 ) ∝ ∝
s
3n2 4(σj2 )2
1 , σj2
(B.4)
This is the same as in Equation (5.7). Therefore, the Type 2 prior for σj2 in Table 5.1 is the Jeffreys prior.
167
BIBLIOGRAPHY
[1] T. Kouri and J. F. MacGregor, “Process analysis, monitoring and diagnosis, using multivariate projection methods,” Chemometrics and Intelligent Laboratory Systems 28 (1995) 3–21. [2] A. J. Burnham, J. F. MacGregor, and R. Viveros, “Latent variable multivariate regression modeling,” Chemometrics and Intelligent Laboratory Systems 48 (1999) 167–180. [3] S. Wold and M. Sjostrom, “Chemometrics, present and future success,” Chemometrics and Intelligent Laboratory Systems 44 (1998) 3–14. [4] J. Koskinen and B. Kowalski, “Interactive pattern recognition in the chemical laboratory,” Journal of Chemical Information and Computer Sciences 15 (1975), no. 2, 119–123. [5] D. Duewer and B. Kowalski, “Forensic data analysis by pattern recognition. categorization of white bond papers by elemental composition,” Analytical Chemistry 47 (1975), no. 3, 526–530. [6] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics and Intelligent Laboratory Systems 2 (1987) 37–52. [7] J. E. Jackson, A User’s Guide to Principal Components. Wiley, 1991. [8] W. Lindberg, J.-A. Persson, and S. Wold, “Partial least-squares method for spectrofluorimetric analysis of mixtures of humic acid and ligninsulfonate,” Analytical Chemistry 55 (1983) 643–648. [9] P. Geladi and B. R. Kowalski, “Partial least squares regression - a tutorial,” Analytica Chimica Acta 185 (1986) 1–17. [10] J. F. MacGregor and T. Kouri, “Statistical process control of multivariate processes,” Control Engineering Practice 3 (1995) 403–414. [11] P. Nomikos and J. F. MacGregor, “Multivariate spc charts for monitoring batch processes,” Technometrics 37 (1995) 41–59. 168
[12] J. V. Kresta, J. F. MacGregor, and T. E. Marlin, “Multivariate statistical monitoring of process operating performance,” Canadian Journal of Chemical Engineering 69 (1991) 35–47. [13] A. Raich and A. Cinar, “Statistical process monitoring and disturbance diagnosis in multivariable continuous processes,” AIChE Journal 42 (1996) 995–1009. [14] G. Chen and T. J. McAvoy, “Predictive on-line monitoring of continuous processes,” Journal of Process Control 8 (1998) 409–420. [15] L. H. Chiang, E. L. Russell, and R. D. Braatz, “Fault diagnosis in chemical processes using fisher discriminant analysis, discriminant partial least squares, and principal component analysis,” Chemometrics and Intelligent Laboratory Systems 50 (2000) 243–252. [16] S. J. Qin, “Statistical process monitoring: basics and beyond,” Journal of Chemometrics 17 (2003) 480–502. [17] V. Venkatasubramanian, R. Rengaswamy, S. N. Kavuri, and K. Yin, “A review of process fault detection and diagnosis part iii: Process history based methods,” Computers & Chemical Engineering 27 (2003) 327–346. [18] R. J. Shi and J. F. MacGregor, “Modeling of dynamic systems using latent variable and subspace methods,” Journal of Chemometrics 14 (2000) 423–439. [19] M. H. Kaspar and W. H. Ray, “Chemometric methods for process monitoring and high-performance controller-design,” AIChE Journal 38 (1992) 1593–1608. [20] M. H. Kaspar and W. H. Ray, “Dynamic pls modeling or process control,” Chemical Engineering Science 48 (1993) 3447–3461. [21] P. Nomikos and J. F. MacGregor, “Monitoring batch process using multiway principal component analysis,” AIChE Journal 40 (1994) 1361–1375. [22] J. F. MacGregor, C. Jaeckle, C. Kiparissides, and M. Koutoudi, “Process monitoring and diagnosis by multiblock pls methods,” AIChE Journal 40 (1994) 826–838. [23] P. Nomikos and J. F. MacGregor, “Multi-way partial least squares in monitoring batch processes,” Chemometrics and Intelligent Laboratory Systems 30 (1995) 97–108. [24] J. H. Chen and K. C. Liu, “On-line batch process monitoring using dynamic pca and dynamic pls models,” Chemical Engineering Science 57 (2002) 63–75. 169
[25] J. Gabrielsson, N. O. Lindberg, and T. Lundstedt, “Multivariate methods in pharmaceutical applications,” Journal of Chemometrics 16 (2002) 141–160. [26] L. M. C. Buydens, T. H. Reijmers, M. L. M. Beckers, and R. Wehrens, “Molecular data-mining: a challenge for chemometrics,” Chemometrics and Intelligent Laboratory Systems 49 (1999) 121–133. [27] L. D. Lathauwer, B. D. Moor, and J. Vandewalle, “An introduction to independent component analysis,” Journal of Chemometrics 14 (2000) 123–149. [28] R. F. Li and X. Z. Wang, “Dimension reduction of process dynamic trends using independent component analysis,” Computers & Chemical Engineering 26 (2002) 467–473. [29] M. Kano, S. Tanaka, S. Hasebe, I. Hashimoto, and H. Ohno, “Monitoring independent components for fault detection,” AIChE Journal 49 (2003) 969–976. [30] M. Kano, S. Hasebe, I. Hashimoto, and H. Ohno, “Evolution of multivariate statistical process control: application of independent component analysis and external analysis,” Computers & Chemical Engineering 28 (2004) 1157–1166. [31] J. M. Lee, C. K. Yoo, and I. B. Lee, “Statistical process monitoring with independent component analysis,” Journal of Process Control 14 (2004) 467–485. [32] C. K. Yoo, J. M. Lee, P. A. Vanrolleghem, and I. B. Lee, “On-line monitoring of batch processes using multiway independent component analysis,” Chemometrics and Intelligent Laboratory Systems 71 (2004) 151–163. [33] H. Albazzaz and X. Z. Wang, “Statistical process control charts for batch operations based on independent component analysis,” Industrial & Engineering Chemistry Research 43 (2004), no. 21, 6731–6741. [34] P. D. Wentzell, D. T. Andrews, D. C. Hamilton, K. Faber, and B. R. Kowalski, “Maximum likelihood principal component analysis,” Journal of Chemometrics 11 (1997), no. 4, 339–366. [35] Y. Rotem, A. Wachs, and D. R. Lewin, “Ethylene compressor monitoring using model-based pca,” AIChE Journal 46 (2000), no. 9, 1825–1836. [36] B. R. Bakshi, “Multiscale analysis and modeling using wavelets,” Journal of Chemometrics 13 (1999) 415–434. [37] B. Walczak, ed., Wavelets in Chemistry. Elsevier, 2000. 170
[38] A. H. Jazwinski, Stochastic Processes and filtering theory. Academic Press, New York, 1970. [39] D. G. Robertson, J. H. Lee, and J. B. Rawlings, “A moving horizon-based approach for least-squares estimation,” AIChE Journal 42 (1996), no. 8, 2209–2224. [40] J. O. Berger, Statistical Decision Theory and Bayesian Analysis. Springer, New York, 2nd ed., 1993. [41] E. N. M. V. Sprang, H. J. Ramaker, J. A. Westerhuis, and A. K. Smilde, “Statistical batch process monitoring using gray models,” AIChE Journal 51 (2005) 931–945. [42] M. Haim, M. Jaeger, and D. Tzidony, “An application of a Bayesian approach to the combination of measurements of different accuracies,” Reliability Engineering and System Safety 56 (1997) 1–4. [43] J. K. Won and M. Modarres, “Improved Bayesian method for diagnosing equipment partial failures in process plants,” Computers & Chemical Engineering 22 (1998) 1483–1502. [44] S. Roussel, V. Bellon-Maurel, J. M. Roger, and P. Grenier, “Fusion of aroma, ft-ir and uv sensor data based on the Bayesian inference. application to the discrimination of white grape varieties,” Chemometrics and Intelligent Laboratory Systems 65 (2003) 209–219. [45] R. D. Braatz, R. C. Alkire, E. Seebauer, E. Rusli, R. Gunawan, T. O. Drews, X. Li, and Y. He, “Perspectives on the design and control of multiscale systems,” Journal of Process Control 16 (2006) 193–204. [46] J. Shao and D. Tu, The Jackknife and Bootstrap. Springer, 1995. [47] U. Hauptmanns and P. Homke, “Bayesian estimation of failure rate distributions for components in process plants,” Industrial & Engineering Chemistry Research 28 (1989), no. 11, 1639–1644. [48] J. S. Jorgensen and J. B. Pedersen, “Calculation of the variability of model parameters,” Chemometrics and Intelligent Laboratory Systems 22 (1994) 25–35. [49] F. A. M. Verdonck, J. Jaworska, O. Thas, and P. A. Vanrolleghem, “Determining environmental standards using bootstrapping, bayesian and maximum likelihood techniques: a comparative study,” Analytica Chimica Acta 446 (2001), no. 1-2, 429–438. 171
[50] E. S. Park, M. S. Oh, and P. Guttorp, “Multivariate receptor models and model uncertainty,” Chemometrics and Intelligent Laboratory Systems 60 (2002), no. 1-2, 49–67. [51] N. Armstrong, “Bayesian analysis of band-broadening models used in high performance liquid chromatography,” Chemometrics and Intelligent Laboratory Systems 81 (2006) 188–201. [52] E. V. Bystritskaya, A. L. Pomerantsev, and O. Y. Rodionova, “Prediction of the aging of polymer materials,” Chemometrics and Intelligent Laboratory Systems 47 (1999), no. 2, 175–178. [53] E. V. Bystritskaya, A. L. Pomerantsev, and O. Y. Rodionova, “Non-linear regression analysis: new approach to traditional implementations,” Journal of Chemometrics 14 (2000), no. 5-6, 667–692. [54] A. L. Pomerantsev, “Successive Bayesian estimation of reaction rate constants from spectral data,” Chemometrics and Intelligent Laboratory Systems 66 (2003) 127–139. [55] W. S. Chen, B. R. Bakshi, P. K. Goel, and S. Ungarala, “Bayesian estimation of unconstrained nonlinear dynamic systems via sequential monte carlo sampling,” Industrial & Engineering Chemistry Research 43 (2004), no. 14, 4012–4025. [56] H. Li and Y. L. Huang, “Bayesian-based on-line applicability evaluation of neural network models in modeling automotive paint spray operations,” Computers & Chemical Engineering 30 (2006) 1392–1399. [57] M. C. Coleman and D. E. Block, “Bayesian parameter estimation with informative priors for nonlinear systems,” AIChE Journal 52 (2006) 651–667. [58] M. N. Nounou, “Dealing with collinearity in finite impulse response models using Bayesian shrinkage,” Industrial & Engineering Chemistry Research 45 (2006) 292–298. [59] J. J. Jitjareonchai, P. M. Reilly, T. A. Duever, and D. B. Chambers, “Parameter estimation in the error-in-variables models using the Gibbs sampler,” Canadian Journal of Chemical Engineering 84 (2006) 125–138. [60] A. C. Tamhane, C. Iordache, and R. S. H. Mah, “A Bayesian approach to gross eroor detection in chemical process data .1. model development,” Chemometrics and Intelligent Laboratory Systems 4 (1988) 33–45.
172
[61] A. C. Tamhane, C. Iordache, and R. S. H. Mah, “A Bayesian approach to gross eroor detection in chemical process data .2. simulation results,” Chemometrics and Intelligent Laboratory Systems 4 (1988) 131–146. [62] K. Morad, B. R. Young, and W. Y. Svrcek, “Rectification of plant measurements using a statistical framework,” Computers & Chemical Engineering 29 (2005) 919–940. [63] S. Ungarala and B. R. Bakshi, “A multiscale, Bayesian and error-in-variables approach for linear dynamic data rectification,” Computers & Chemical Engineering 24 (2000) 445–451. [64] B. R. Bakshi, M. N. Nounou, P. K. Goel, and X. T. Shen, “Multiscale Bayesian rectification of data from linear steady-state and dynamic systems without accurate models,” Industrial & Engineering Chemistry Research 40 (2001), no. 1, 261–274. [65] B. M. Colosimo and E. del Castillo, eds., Bayesian Process Monitoring, Control and Optimization. Chaptman & Hall/CRC, 2007. [66] P. J. Brown, M. Vannucci, and T. Fearn, “Bayesian wavelength selection in multicomponent analysis,” Journal of Chemometrics 12 (1998) 173–182. [67] M. Vannucci, N. Sha, and P. J. Brown, “Nir and mass spectra classification: Bayesian methods for wavelet-based feature selection,” Chemometrics and Intelligent Laboratory Systems 77 (2005) 139–148. [68] S. Ghosh, W. H. Dennis, N. M. Petty, R. B. Melchert, B. Luo, D. F. Grant, and D. K. Dey, “Statistical approach to metabonomic analysis of rat urine following surgical trauma,” Journal of Chemometrics 20 (2006) 87–98. [69] T. Naes and U. Indahl, “A unified description of classical classification methods for multicollinear data,” Journal of Chemometrics 12 (1998) 205–220. [70] Y. Mallet, D. Coomans, and O. de Vel, “Recent developments in discriminant analysis on high dimensional spectral data,” Chemometrics and Intelligent Laboratory Systems 35 (1996), no. 2, 157–173. [71] Y. Yamashita, “Supervised learning for the analysis of process operational data,” Computers & Chemical Engineering 24 (2000) 471–474. [72] F. Z. Chen and X. Z. Wang, “Software sensor design using Bayesian automatic classification and back-propagation neural networks,” Industrial & Engineering Chemistry Research 37 (2004), no. 10, 3985–3991. 173
[73] T. M. Hancewicz and J. H. Wang, “Discriminant image resolution: a novel multivariate image analysis method utilizing a spatial classification constraint in addition to bilinear nonnegativity,” Chemometrics and Intelligent Laboratory Systems 77 (2005) 18–31. [74] K. Torabi, S. Sayad, and S. T. Balke, “On-line adaptive Bayesian classification for in-line particle image monitoring in polymer film manufacturing,” Computers & Chemical Engineering 30 (2005) 18–27. [75] M. Kim, Y. H. Lee, and C. G. Han, “Real-time classification of petroleum products using near-infrared spectra,” Computers & Chemical Engineering 24 (2000) 513–517. [76] J. L. Liu, “Process monitoring using Bayesian classification on pca subspace,” Industrial & Engineering Chemistry Research 43 (2004), no. 24, 7815–7825. [77] V. D. Emerenciano, M. J. P. Ferreira, M. D. Branco, and J. E. Dubois, “The application of Bayes’ theorem in natural products as a guide for skeletons identification,” Chemometrics and Intelligent Laboratory Systems 40 (1998) 83–92. [78] C. Niedzinski and R. Z. Morawski, “Estimation of low concentrations in the presence of high concentrations using Bayesian algorithms for interpretation of spectrophotometric data,” Journal of Chemometrics 18 (2004) 217–230. [79] K. Nolsoe, M. Kessler, J. Perez, and H. Madsen, “Bayesian conformational analysis of ring molecules through reversible jump mcmc,” Journal of Chemometrics 19 (2005) 412–426. [80] A. J. Willis, “A Bayesian online inferential model for evaluation of analyzer performance,” Journal of Chemometrics 19 (2005) 90–96. [81] S. Moussaoui, C. Carteret, D. Brie, and A. Mohammad-Djafari, “Bayesian analysis of spectral mixture data using Markov chain monte carlo methods,” Chemometrics and Intelligent Laboratory Systems 81 (2006) 137–148. [82] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological) 58 (1996) 267–288. [83] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” Annals of Statistics 32 (2004) 407–499. [84] M. N. Nounou, B. R. Bakshi, P. K. Goel, and X. Shen, “Bayesian principal component analysis,” Journal of Chemometrics 16 (2002), no. 11, 576–595.
174
[85] M. N. Nounou, B. R. Bakshi, P. K. Goel, and X. Shen, “Process modeling by Bayesian latent variable regression,” AIChE Journal 48 (2002), no. 8, 1775–1793. [86] D. Gamerman, Markov Chain Monte Carlo. Chapman & Hall, 1997. [87] R. M. Neal, “Probabilistic inference using Markov chain monte carlo methods,” technical report, Department of Computer Science, University of Toronto, 1993. [88] H. L. Shen, G. Nelson, S. Kennedy, D. Nelson, J. Johnson, D. Spiller, M. R. H. White, and D. B. Kell, “Automatic tracking of biological cells and compartments using particle filters and active contours,” Chemometrics and Intelligent Laboratory Systems 82 (2006) 276–282. [89] G. Casella and R. L. Berger, Statistical Inference. Duxbury Press, 2nd ed., 2001. [90] T. C. Hsiang, “A Bayesian view on ridge regression,” The Statistician 24 (1975), no. 4, 267–268. [91] A. E. Hoerl, “Application of ridge analysis to regression problems,” Chemical Engineering Progress 58 (1962) 54–59. [92] M. A. Girshick, “On the sampling theory of roots of determinantal equations,” Ann. Math. Statist. 10 (1939) 203–224. [93] C. M. Bishop, “Bayesian pca,” in Proceedings of the 1998 conference on Advances in neural information processing systems II, pp. 382–388. MIT Press, Cambridge, MA, USA, 1999. [94] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society. Series B (Methodological) 39 (1977) 1–38. [95] H. Chen, “Tutorial on monte carlo sampling,” technical report, Department of Chemcal & Biomolecular Engineering, The Ohio State University, 2005. [96] A. Doucet, N. de Freitas, and N. Gordon, eds., Sequential Monte Carlo Methods in Practice. Springer-Verlag, New York, 2001. [97] J. B. Rawlings and B. R. Bakshi, “Particle filtering and moving horizon estimation,” Computers & Chemical Engineering 30 (2006) 1529–1541. [98] A. E. Gelfand and A. F. M. Smith Journal of The American Statistical Association 85 (1990) 399–409. 175
[99] S. Kirkpatrick, C. Gelatt, and M. Vecchi, “Optimization by simulated annealing,” Science 220 (1983), no. 4598, 671–680. [100] S. N. MacEachern and L. M. Berliner, “Subsampling the Gibbs sampler,” The American Statistician 48 (1994) 188–190. [101] D. J. Spiegelhalter, A. Thomas, N. G. Best, W. R. Gilks, and D. Lunn, “Bugs: Bayesian inference using Gibbs sampling.” http://www.mrc-bsu.cam.ac.uk/bugs/, 1994,2003. [102] D. J. Lunn, A. Thomas, N. Best, and D. Spiegelhalter, “Winbugs - a Bayesian modelling framework: Concepts, structure, and extensibility,” Statistics and Computing 10 (2000), no. 4, 325–337. [103] L. Tierney, “Markov chain for exploring posterior distributions,” The Annals of Statistics 22 (1994), no. 4, 1701–1762. [104] W. S. Chen, “Tutorial on sequential monte carlo sampling,” technical report, Department of Chemcal & Biomolecular Engineering, The Ohio State University, 2002. [105] P. D. Wentzell, D. T. Andrews, and B. R. Kowalski, “Maximum likelihood multivariate calibration,” Analytical Chemistry 69 (1997) 2299–2311. [106] C. R. Rao, Linear Statistical Inference and Its Application. Wiley, 2nd ed., 1973. [107] J. D. Gibbons, Nonparametric Statistical Inference. M. Dekker, 2nd ed., 1985. [108] P. D. Wentzell and L. V. Montoto, “Comparison of principal components regression and partial least squares regression through generic simulations of complex mixtures,” Chemometrics and Intelligent Laboratory Systems 48 (1999) 167–180. [109] J. H. Kalivas, “Two data sets of near infrared spectra,” Chemometrics and intelligent Laboratory Systems 37 (1997) 255–259. [110] P. Williams and K. Norris, eds., Near-Infrared Technology in the Agricultural and Food Industries. American Association of Cereal Chemists, Inc., 2nd ed., 2001. [111] V. Centner, J. Verdu-Andres, B. Walczak, D. Jouan-Rimbaud, F. Despagne, L. Pasti, R. Poppi, D. Massart, and O. N. D. Noord, “Comparison of multivariate calibration techniques applied to experimental nir data sets,” Applied Spectroscopy 54 (2000), no. 4,. 176
[112] H. W. Siesler, Y. Ozaki, S. Kawata, and H. M. Heise, eds., Near-Infrared Spectroscopy: Principles, Instruments, Applications. Wiley-Vch, 2002. [113] P. H. Garthwaite and J. M. Dickey, “Quantifying and using expert opinion for variable-selection problems in regression,” Chemometrics and Intelligent Laboratory Systems 35 (1996) 1–26. [114] J. Gill and L. D. Walker, “Elicited priors for Bayesian model specifications in political science research,” Journal of Politics 67 (2005), no. 3, 841–872. [115] H. V. D. Waterbeemd, Chemometric methods in molecular design. VCH, 1995. [116] D. J. Livingstone, “The characterization of chemical structures using molecular properties. a survey,” Journal of chemical information and computer sciences 40 (2000), no. 2,. [117] A. B¨ocker, G. Schneider, and A. Teckentrup, “Status of hts data mining approaches,” QSAR & combinatorial science 23 (2004), no. 4,. [118] V. N. Vapnik, The nature of statistical learning theory. Springer-Verlag, New York, 1995. [119] M. K. Warmuth, J. Liao, G. R¨atsch, M. Mathieson, S. Putta, and C. Lemmen, “Active learning with support vector machines in the drug discovery process,” Journal of chemical information and computer sciences 43 (2003), no. 2,. [120] M. Song, C. M. Breneman, J. Bi, N. Sukumar, K. P. Bennett, S. Cramer, and N. Tugcu, “Prediction of protein retention times in anion-exchange chromagraphy systems using support vector regression,” Journal of chemical information and computer sciences 42 (2002), no. 6,. [121] B. Luˇci´c and N. Trinajsti´c, “A new efficient approach for variable selection based on multiregression: prediction of gas chromatographic rentention times and response factors,” Journal of chemical information and computer sciences 39 (1999), no. 3,. [122] B. Luˇci´c, D. Ami´c, and N. Trinajsti´c, “Nonlinear multivaraite regression outperforms several concisely desinged neural networks on three qspr data sets,” Journal of chemical information and computer sciences 40 (2000), no. 2,. [123] B. Luˇci´c, D. Nadramija, I. Baˇsic, and N. Trinajsti´c, “Toward generating simpler qsar models: nonlinear multivariate regression versus sevearl neural network ensembles and some related methods,” Journal of chemical information and computer sciences 43 (2003), no. 4,. 177
[124] H. Gao, “Application of bcut metrics and genetic algorithm in binary qsar analysis,” Journal of chemical information and computer sciences 41 (2001), no. 2,. [125] D. Xie, A. Tropsha, and T. Schlick, “An efficient projection protocol for chemical databases: singular value decomposition combined with truncated-newton minimization,” Journal of chemical information and computer sciences 40 (2000), no. 1,. [126] P. Labute, S. Nilar, and C. Willams, “A probabilistic approach to high throughput drug discovery,” Combinatorial chemistry & high throughput screening 5 (2002), no. 2,. [127] J. Racine and Q. Li, “Nonparametric estimation of regression functions with both categorical and continuous data,” Journal of econometrics 119 (2004), no. 1, 99–130. [128] J. Kuha, “Estimation by data augmentation in regression models with continuous and discrete covariates measured with error,” Statistics in medicine 16 (1997), no. 2, 189–201. [129] M. A. Tanner and W. H. Wong, “The calculation of posterior distributions by data augmentation,” Journal of the American statistical association 82 (1987), no. 398, 528–550. [130] H. Wold, “Soft modelling by latent variables: The non-linear iterative partial least squares (NIPALS) approach,” in Perspectives in Probab. and Stat., In Honor of M. S. Bartlett, J. Gani, ed., pp. 117–144. Academic Press, 1976. [131] B. R. Bakshi and U. Utojo, “Unification of neural and statistical modeling methods that combine inputs by linear projection,” Computers & Chemical Engineering 22 (1998) 1859–1878. [132] J. Bond and G. Michailidis, “Interactive correspondence analysis in a dynamic object-oriented environment,” tech. rep., University of California, Los Angeles, 1997. [133] M. N. Leger, L. Vega-Montoto, and P. D. Wentzell, “Methods for systematic investigation of measurement error covariance matrices,” Chemometrics and Intelligent Laboratory Systems 77 (2005) 181–205. [134] C. M. Crowe, “Data reconciliation - progress and challenges,” Journal of Process Control 6 (1996) 89–98.
178
[135] G. A. Almasy and R. S. H. Mah, “Estimation of measurement error variances from process data,” Industrial & Engineering Chemistry Research 23 (1984) 779–784. [136] M. Darouach, R. Ragot, M. Zasadzinski, and G. Krzakala, “Maximum likelihood estimator of measurement error variances in data reconciliation,” IFAC, ALPAC Symp. 2, pp. 135–139. 1992. [137] J. Y. Keller, M. Zasadzinski, and M. Darouach, “Analytical estimator of measurement error variances in data reconciliation,” Computers & Chemical Engineering 16 (1992) 185–188. [138] J. Chen, A. Bandoni, and J. A. Romagnoli, “Robust estimation of measurement error variance/covariance from process sampling data,” Computers & Chemical Engineering 21 (1996) 593–600. [139] K. Morad, W. Y. Svrcek, and I. McKay, “A robust direct approach for calculating measurement error covariance matrix,” Computers & Chemical Engineering 23 (1999) 889–897. [140] D. Maquin, S. Narasimhan, and J. Ragot, “Data validation with unknown variance matrix,” Computers & Chemical Engineering 23 (1999) S609–S612. [141] A. Mirabedini and D. Hodouin, “Calculation of variance and covariance of sampling errors in complex mineral processing systems, using state-space dynamic models,” International Journal of Mineral Processing 55 (1998) 1–20. [142] U. V. Toussaint and V. Dose, “Bayesian inference in surface physics,” Applied Physics A 82 (2006) 403–413. [143] A. Gelman, “Prior distributions for variance parameters in hierarchical models,” Bayesian Analysis 1 (2006) 515–533. [144] H. Jeffreys, Theory of Probability. Oxford University Press, 1961. [145] J. Ingham, I. J. Dunn, E. Heinzle, and J. E. Prenosil, Chemical Engineering Dynamics. Wiley-VCH, 2nd ed., 2000. [146] H. Binous, “Separation of a water-methanol mixture using a four stage batch distillation column.” http://www.mathworks.com/matlabcentral/fileexchange/, 2006. [147] S. Skogestad and M. Morari, “Understanding the dynamic behavior of distillation columns,” Industrial & Engineering Chemistry Research 27 (1988) 1848–1862. 179
[148] T. Mejdell and S. Skogestad, “Estimation of distillation compositions from multiple temperature measurements using partial-least-squares regression,” Industrial & Engineering Chemistry Research 30 (1991) 2543–2555. [149] S. Skogestad and I. Postlethwaite, Multivariable Feedback Control. Wiley, 1996. [150] S. Skogestad, “Matlab distillation column model (”column a”).” http://www.nt.ntnu.no/users/skoge/book/matlab_m/cola/cola.html,
1997. [151] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis. Chapman & Hall/CRC, 2nd ed., 2003. [152] E. Jaynes, “Prior probabilities,” IEEE Transactions On Systems Science and Cybernetics 4 (1968), no. 3, 227–241. [153] C. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal 27 (1948) 379–423, 623–656. [154] A. Kagan, Y. Linnik, and C. Rao, Characterization Problems in Mathematical Statistics. Wiley, 1973. [155] D. Knuth, The Art of Computer Programming, vol. 3. Addison-Wesley, 1997. [156] G. E. Moore, “Cramming more components onto integrated circuits,” Electronics 38 (1965) 114–117. [157] M. J. Schervish, Theory of Statistics. Springer, 1995.
180