SOME CONTRIBUTIONS TO BAYESIAN ...

SOME CONTRIBUTIONS TO BAYESIAN REGULARIZATION METHODS WITH APPLICATIONS TO GENETICS AND CLINICAL TRIALS

by HIMEL MALLICK

NENGJUN YI, PHD, CHAIR INMACULADA B. ABAN, PHD GUSTAVO DE LOS CAMPOS, PHD RAGIB HASAN, PHD XIAOGANG SU, PHD HEMANT K. TIWARI, PHD

A DISSERTATION Submitted to the graduate faculty of The University of Alabama at Birmingham, in partial fulfillment of the requirements for the degree of Doctor of Philosophy BIRMINGHAM, ALABAMA 2015

ProQuest Number: 3719218

All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion.

ProQuest 3719218 Published by ProQuest LLC (2015). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code Microform Edition © ProQuest LLC. ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106 - 1346

c Copyright by Himel Mallick 2015 All Rights Reserved

SOME CONTRIBUTIONS TO BAYESIAN REGULARIZATION METHODS WITH APPLICATIONS TO GENETICS AND CLINICAL TRIALS HIMEL MALLICK BIOSTATISTICS ABSTRACT Variable selection refers to the class of problems where one tries to find an optimal subset of relevant variables, which can be used to accurately predict the outcome of a certain response variable. Typically, a large number of variables are often collected; however, all but a few important variables are relevant for the prediction of the outcome, so the underlying representation is sparse. To this end, variable selection is fundamental in high-dimensional data analysis, playing a crucial role in important scientific discovery and decision-making, and has received enormous attention in the literature. Regularization method is one attractive approach that has proven successful for dealing with high-dimensional data. In the last two decades, a large amount of effort has gone into the development of regularization methods for high-dimensional variable selection problems arising in various diverse scientific disciplines. These methods facilitate automatic variable selection by setting certain coefficients to zero and shrinking the remainder, and provide useful estimates even if the model includes a large number of highly correlated variables. While great progress has been made in the last two decades, further improvements are still possible. With the dramatic increase in computing power and new and emerging efficient algorithms, Bayesian regularization approaches

iii

to variable selection have become increasingly popular. Bayesian regularization methods often outperform their frequentist analogs by providing smaller prediction errors while selecting the most relevant variables in a parsimonious way. In this dissertation, we have developed a set of novel Bayesian regularization methods for linear models. Extensive simulation studies and real data analyses are carried forward, which demonstrate superior performance of the proposed methods as compared to their frequentist counterparts. Some of these methods are applied to select subgroups of patients with differential treatment effects from a historical clinical trial data. We also apply our methods to the problem of detecting rare variants in a genetic association study. In addition, some associated theoretical properties are investigated, which gives deeper insight into the asymptotic behavior of these approaches. Finally, we discuss possible extensions of our approach and present a unified framework for variable selection in general models beyond linear regression. This dissertation thus introduces new and efficient tools for practitioners and researchers alike for conducting Bayesian variable selection in a variety of real life problems.

iv

DEDICATION

To my wife, Gargi To my parents, Ajit and Ratna

v

ACKNOWLEDGEMENTS My PhD journey would not have been possible without the concerted effort of several individuals who in one way or another contributed toward this rewarding experience. I want to take this opportunity to thank them all. First and foremost, I would like to express my heartfelt gratitude to my dissertation advisor Dr. Nengjun Yi for his enthusiasm, supervision, and inspiration over the past years. An outstanding researcher himself, Dr. Yi has guided me since my first year in the department, consistently providing me timely suggestions, encouragement, and support throughout this nerve-wracking process. I feel very fortunate to have him as my mentor and advisor. I am deeply appreciative of my committee members, Dr. Inmaculada (Chichi) Aban, Dr. Gustavo de los Campos, Dr. Ragib Hasan, Dr. Xiaogang Su, and Dr. Hemant Tiwari, for their respective time, effort, and contributions toward this dissertation. I am thankful to Dr. Leslie McClure, Dr. David Allison, and Dr. David Redden for inspiring me on various occasions in and out of the classroom. As an international student, I greatly appreciate the help and support of Ms. Stacey Thompson and Ms. Daizy Walia that made my life easier regarding many practical issues related to immigration and other matters. I would like to thank the rest of the faculty and staff for their contributions to the department. I would like to thank all my collaborators at UAB and numerous great friends I met during these years in Birmingham, Alabama. Above all, I am extremely grateful to Dr. Hemant Tiwari and Dr. George Howard for providing the much-needed financial assistance that helped me smoothly sail through this journey.

vi

I would like to thank Dr. Debasis Sengupta of Indian Statistical Institute (ISI) (Kolkata, India), Dr. Ming Li of University of Arkansas for Medical Sciences (UAMS) (Little Rock, AR), Dr. Peter Mesenbrink of Novartis Pharmaceuticals Corporation (East Hanover, NJ), and Dr. Lindsay Renfro of Mayo Clinic (Rochester, MN) for their guidance during my four fruitful summer internships in 2008, 2013, 2014, and 2015 respectively. Section 6.2 of Chapter 6 in this dissertation is a joint work with Dr. Mesenbrink. I would like to thank all my teachers who shaped my life starting from kindergarten through high school. Special thanks go to my English teacher Ramendra Nath Sarkar and Mathematics teacher Ramendra Nath Dutta who encouraged me to choose Statistics as major, a decision that changed my life. Special recognition also goes to my high school teacher Uday Shankar Ghosh, who nurtured my early interest in Mathematics. I am grateful to the professors of Ramakrishna Mission Residential College (Narendrapur, Kolkata, India), my undergraduate alma mater, for encouraging me to pursue Master’s degree at the prestigious Indian Institute of Technology (IIT) Kanpur (Kanpur, India). I would like to thank all my professors from IIT Kanpur for motivating me for the PhD. A few conversations with Professor Debasis Kundu and Professor Amit Mitra of IIT Kanpur and Professor Anil Kumar Ghosh of ISI Kolkata have been largely responsible for me coming to UAB, and I am always indebted to them for that. Finally, I am forever grateful to my parents Ajit and Ratna for their parenting efforts and sacrifices they have presented me with. Without their help, support, and understanding, I would not be who or where I am today. I would

vii

like to thank all my extended family members (sister, brother-in-law, parentsin-law, grandparents, aunts, uncles, cousins, and relatives) and multitudinous friends in India (whom I met during my stay in Siliguri, Kolkata, Kanpur, and various other places) for standing by me all these years in good and trying times, extending unconditional support wherever and whenever required. Last but not least, I thank my wife Gargi for her patience, endless support, and never-ending caring throughout the last few years of my PhD. Her love has transformed my entire existence, and I will be forever indebted to her for being such a wonderful part of my life.

viii

TABLE OF CONTENTS

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . .

vi

TABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . .

ix

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . xix 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1

BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

MAJOR CONTRIBUTIONS . . . . . . . . . . . . . . . . .

3

1.3

OUTLINE OF THE THESIS . . . . . . . . . . . . . . . . .

4

2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . 2.1

7

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1.1

STANDARD NORMAL LINEAR MODEL . . . .

7

2.1.2

BAYESIAN HIERARCHICAL NORMAL LINEAR MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . ix

8

2.2

2.3

CLASSICAL MODEL SELECTION METHODS . . . . .

10

2.2.1

MALLOWS’S CP . . . . . . . . . . . . . . . . . . . . .

10

2.2.2

AKAIKE’S INFORMATION CRITERIA (AIC) .

11

2.2.3

BAYESIAN INFORMATION CRITERIA (BIC)

12

2.2.4

DEVIANCE INFORMATION CRITERIA (DIC)

12

2.2.5

EXTENDED BIC (EBIC) . . . . . . . . . . . . . . .

13

REGULARIZATION METHODS . . . . . . . . . . . . . .

15

2.3.1

ODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

GROUP REGULARIZATION METHODS . . . .

22

BAYESIAN METHODS . . . . . . . . . . . . . . . . . . . .

26

2.4.1

OVERVIEW OF BAYESIAN FRAMEWORK . .

26

2.4.2

OVERVIEW OF MCMC METHODOLOGY . . .

28

2.4.3

BAYESIAN REGULARIZATION METHODS . .

34

2.3.2 2.4

LASSO AND RELATED REGULARIZATION METH-

3 A NEW BAYESIAN LASSO . . . . . . . . . . . . . . . . . . . 48 3.1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . .

49

3.2

SCALE MIXTURE OF UNIFORM DISTRIBUTION .

51

3.3

THE MODEL

52

3.3.1

. . . . . . . . . . . . . . . . . . . . . . . . . .

MODEL HIERARCHY AND PRIOR DISTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3.2

FULL CONDITIONAL POSTERIOR DISTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3.3

52

53

MCMC SAMPLING FOR THE NEW BAYESIAN LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

53

3.4

3.5

3.6

3.7

3.8

SIMULATION STUDIES . . . . . . . . . . . . . . . . . . .

56

3.4.1

PREDICTION . . . . . . . . . . . . . . . . . . . . . .

56

3.4.2

VARIABLE SELECTION . . . . . . . . . . . . . . .

59

3.4.3

SOME COMMENTS . . . . . . . . . . . . . . . . . .

64

REAL DATA ANALYSES . . . . . . . . . . . . . . . . . . .

64

3.5.1

THE DIABETES EXAMPLE . . . . . . . . . . . . .

65

3.5.2

THE PROSTATE EXAMPLE . . . . . . . . . . . .

69

EXTENSIONS . . . . . . . . . . . . . . . . . . . . . . . . . .

70

3.6.1

MCMC FOR GENERAL MODELS . . . . . . . . .

70

3.6.2

SIMULATION EXAMPLES FOR GENERAL MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

COMPUTING MAP ESTIMATES . . . . . . . . . . . . .

77

3.7.1

ECM ALGORITHM FOR LINEAR MODELS . .

77

3.7.2

ECM ALGORITHM FOR GENERAL MODELS

78

CONCLUDING REMARKS . . . . . . . . . . . . . . . . .

79

4 BAYESIAN BRIDGE REGRESSION . . . . . . . . . . . . . . 85 4.1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . .

86

4.2

THE MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

4.2.1

TION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

HIERARCHICAL REPRESENTATION . . . . . .

90

MCMC SAMPLING . . . . . . . . . . . . . . . . . . . . . .

90

4.3.1

90

4.2.2 4.3

SCALE MIXTURE OF UNIFORM DISTRIBU-

FULL CONDITIONAL DISTRIBUTIONS . . . .

xi

4.3.2

4.3.3 4.4

4.5

SAMPLING COEFFICIENTS AND LATENT VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

SAMPLING HYPERPARAMETERS

92

. . . . . . .

POSTERIOR CONSISTENCY UNDER BAYESIAN BRIDGE MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

4.5.1

SIMULATION EXPERIMENTS . . . . . . . . . . . 100

4.5.2

REAL DATA ANALYSES . . . . . . . . . . . . . . . 106

4.6

COMPUTING MAP ESTIMATES . . . . . . . . . . . . . 109

4.7

EXTENSION TO GENERAL MODELS . . . . . . . . . . 112

4.8

CONCLUSION AND DISCUSSION . . . . . . . . . . . . 114

5 THE BAYESIAN GROUP BRIDGE FOR BI-LEVEL VARIABLE SELECTION . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 122

5.2

THE GROUP BRIDGE PRIOR . . . . . . . . . . . . . . . 127

5.3

5.4

5.2.1

FORMULATION . . . . . . . . . . . . . . . . . . . . 127

5.2.2

PROPERTIES . . . . . . . . . . . . . . . . . . . . . . 128

MODEL HIERARCHY AND PRIOR DISTRIBUTIONS 135 5.3.1

HIERARCHICAL REPRESENTATION . . . . . . 135

5.3.2

FULL CONDITIONAL DISTRIBUTIONS . . . . 135

MCMC SAMPLING . . . . . . . . . . . . . . . . . . . . . . 136 5.4.1

SAMPLING COEFFICIENTS AND THE LATENT VARIABLES . . . . . . . . . . . . . . . . . . . . . . . 136

5.4.2


xii

. . . . . . . 137

5.5

5.6

RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.5.1

THE BIRTH WEIGHT EXAMPLE . . . . . . . . . 140

5.5.2

SIMULATION EXPERIMENTS . . . . . . . . . . . 143

EXTENSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.6.1

EXTENSION TO GENERAL MODELS . . . . . . 150

5.6.2

EXTENSION TO OTHER GROUP PENALIZED REGRESSION METHODS . . . . . . . . . . . . . . 153

5.7

DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

6 APPLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . 165 6.1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 165

6.2

BAYESIAN REGULARIZED SUBGROUP SELECTION FOR DIFFERENTIAL TREATMENT EFFECTS . . . . 167 6.2.1

BACKGROUND . . . . . . . . . . . . . . . . . . . . . 167

6.2.2

NOTATIONS . . . . . . . . . . . . . . . . . . . . . . . 169

6.2.3

GENERAL FRAMEWORK FOR SUBGROUP SELECTION VIA REGULARIZATION . . . . . . . . 170

6.2.4

COMPARISON OF DIFFERENT SCORING SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

6.3

6.2.5

APPLICATION TO THE ACTG 320 STUDY . . 173

6.2.6

CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . 175

EVALUATION OF THE BAYESIAN GROUP BRIDGE FOR IDENTIFYING ASSOCIATION WITH RARE VARIANTS IN THE DALLAS HEART STUDY (DHS) . . . . 179

xiii

6.3.1

POPULATION-BASED RESEQUENCING OF ANGPTL4 AND TRIGLYCERIDES . . . . . . . . . . . . . . . . 179

6.3.2

CONCLUDING REMARKS . . . . . . . . . . . . . 185

7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 7.1

SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

7.2

FUTURE DIRECTIONS . . . . . . . . . . . . . . . . . . . . 194 7.2.1

SCALABLE VARIATIONAL ALGORITHMS FOR HIGH DIMENSIONAL VARIABLE SELECTION 194

7.2.2

BAYESIAN REGULARIZED MODEL AVERAGING FOR SUBGROUP SELECTION

7.2.3

. . . . . . . 197

BAYESIAN GROUP VARIABLE SELECTION FOR SEMIPARAMETRIC PROPORTIONAL HAZARDS MODEL FOR HIGH-DIMENSIONAL SURVIVAL DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

7.2.4

BAYESIAN ANALYSIS OF OTHER REGULARIZATIONS . . . . . . . . . . . . . . . . . . . . . . . . . 201

7.2.5

GEOMETRIC ERGODICITY OF BAYESIAN REGULARIZATION METHODS . . . . . . . . . . . . . 202

7.2.6

POSTERIOR CONSISTENCY OF GROUP PRIORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

7.2.7

BAYESIAN SHRINKAGE PRIORS FOR ZEROINFLATED AND MULTIPLE-INFLATED COUNT DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

xiv

7.2.8

VARIABLE SELECTION USING SHRINKAGE PRIORS . . . . . . . . . . . . . . . . . . . . . . . . . . 204

7.2.9

R AND C++ INTEGRATION FOR BAYESIAN REGULARIZATION METHODS . . . . . . . . . . 206

7.2.10 STATISTICAL PARALLEL COMPUTING FOR BAYESIAN REGULARIZATION METHODS . . 207 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 A APPENDIX FOR CHAPTER 3 . . . . . . . . . 216 B APPENDIX FOR CHAPTER 4 . . . . . . . . . 219 C APPENDIX FOR CHAPTER 5 . . . . . . . . . 224 BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . 233

xv

LIST OF TABLES

2.1

Various Penalty Functions . . . . . . . . . . . . . . . . . .

3.1

Median mean squared error (MMSE) based on 100 replications for Example 1 . . . . . . . . . . . . . . . . . . . . .

3.2

62

Frequency of correctly-fitted models over 100 replications for Example 7 . . . . . . . . . . . . . . . . . . . . .

3.8

61


3.7

60


3.6

59


3.5

58


3.4

57


3.3

22

63


xvi

63

3.9

Prostate Cancer Data Analysis - Mean squared prediction errors based on 30 observations of the test set for four methods : New Bayesian Lasso (NBLasso), Original Bayesian Lasso (OBLasso), Lasso and OLS .

70

3.10 Simulation Results for Logistic Regression . . . . . . .

76

3.11 Simulation Results for Cox’s Model . . . . . . . . . . .

76

4.1

MMSE based on 100 replications for Model 7 . . . . . 105

4.2

Mean squared prediction errors for Prostate and Pollution data analyses . . . . . . . . . . . . . . . . . . . 109

5.1

Summary of birth weight data analysis results using linear regression.‘MSE’ reports the mean squared errors (MSE’s) on the test data . . . . . . . . . . . . . . . 141

5.2

A comparison of GB and BAGB with respect to MSE’s for different values of α, based on 38 test set observations. In GB, λ is chosen by BIC and the BAGB estimates are based on 15, 000 posterior samples . . . . 142

5.3

Results from Simulation 1.

The numbers in paren-

theses are the corresponding standard errors of the RPE’s based on 500 bootstrap resampling of the 100 RPE’s. The bold numbers are significantly smaller than others . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

xvii

5.4

Results from Simulation 2.

The numbers in paren-

theses are the corresponding standard errors of the RPE’s based on 500 bootstrap resampling of the 100 RPE’s. The bold numbers are significantly smaller than others . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.5

Summary of the birth weight data analysis results using logistic regression.

‘AUC’ reports the area

under the curve (AUC) estimates on the test data. ‘Test Error’ reports the misclassification errors on the test data . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.1

Summary of the ACTG 320 data analysis results (ABC Measures) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

6.2

Summary of the DHS data analysis results. Posterior probabilities in the scaled neighborhood, along with corresponding GB, GL and SGL estimates . . . . . . . . 183

xviii

LIST OF FIGURES

2.1

A Graphical Illustration of the Properties of Three Penalty Functions: Lasso, Ridge and Bridge (α = 0.5)

20

3.1

Gibbs Sampler for the NBLASSO for linear models .

55

3.2

Posterior mean Bayesian LASSO estimates (computed over a grid of λ values, using 10,000 samples after the burn-in) and corresponding 95% credible intervals (equal-tailed) of Diabetes data (n = 442) covariates. The hyperprior parameters were chosen as a = 1, b = 0.1. The OLS estimates with corresponding 95% confidence intervals are also reported. For the LASSO estimates, the tuning parameter was chosen by 10-fold CV of the LARS algorithm . . . . . . . . . .

3.3

3.4

66

Histograms based on posterior samples of Diabetes data covariates . . . . . . . . . . . . . . . . . . . . . . . . .

67

Trace plots of Diabetes data covariates . . . . . . . .

68

xix

3.5

Posterior mean Bayesian LASSO estimates (computed over a grid of λ values, using 10,000 samples after the burn-in) and corresponding 95% credible intervals (equal-tailed) of Prostate data (n = 67) covariates. The hyperprior parameters were chosen as a = 1, b = 0.1. The OLS estimates with corresponding 95% confidence intervals are also reported. For the LASSO estimates, the tuning parameter was chosen by 10fold CV of the LARS algorithm . . . . . . . . . . . . . .

71

3.6

Trace plots of Prostate data covariates . . . . . . . .

72

3.7

Histograms based on posterior samples of Prostate data covariates . . . . . . . . . . . . . . . . . . . . . . . . .

74

3.8

Gibbs Sampler for the NBLASSO for general models

74

4.1

Gibbs Sampler for the BBR for linear models . . . . .

93

4.2

Boxplots summarizing the prediction performance for the seven methods for Model 1 . . . . . . . . . . . . . . .

4.3


4.4

97


4.5

96

98

Boxplots summarizing the prediction performance for the seven methods for Model 4 . . . . . . . . . . . . . . . 102

4.6


xx

4.7


4.8

Boxplots summarizing the prediction performance for the five methods for Model 8 . . . . . . . . . . . . . . . . 107

4.9

For the prostate data, posterior mean estimates and corresponding 95% equal-tailed credible intervals for the Bayesian methods. Overlaid are the LASSO, elastic net, and classical bridge estimates based on crossvalidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.10 For the pollution data, posterior mean estimates and corresponding 95% equal-tailed credible intervals for the Bayesian methods. Overlaid are the LASSO, elastic net, and classical bridge estimates based on crossvalidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.11 Gibbs Sampler for the BBR for general models 5.1

. . . 114

Density plots of four bivariate group bridge priors corresponding to α = {0.05, 0.2, 0.5, 0.8} and λ = 1, based on 10, 000 samples drawn from the group bridge prior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

5.2

Density plots of four bivariate group bridge priors corresponding to α = {1, 2, 5, 10} and λ = 1, based on 10, 000 samples drawn from the group bridge prior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

5.3

Gibbs Sampler for the BAGB for linear models . . . . 138

xxi

5.4

Histogram of α based on 15, 000 MCMC samples assuming a Beta(1, 1) prior (left) and the corresponding density plot (right) for the Birthweight data analysis, assuming a linear model. Solid red line: posterior mean, dotted green line: posterior mode, and dashed purple line: posterior median . . . . . . . . . . . . . . . . 143

5.5

Histogram of α based on 15, 000 MCMC samples assuming a Beta(10, 10) prior (left) and the corresponding density plot (right) for the Birthweight data analysis, assuming a linear model. Solid red line: posterior mean, dotted green line: posterior mode, and dashed purple line: posterior median . . . . . . . . . . . . . . . . 144

5.6

Marginal posterior densities for the marginal effects of the 16 predictors in the Birthweight data. Solid red line: penalized likelihood (group bridge) solution with λ chosen by BIC. Dashed green line: marginal posterior mean for βj .

Dotted purple line: mode

of the marginal distribution for βj under the fully Bayes posterior. The BAGB estimates are based on 15, 000 posterior samples . . . . . . . . . . . . . . . . . . . . 145 5.7

Gibbs Sampler for the BAGB for general models . . 152

xxii

5.8

For the birth weight data, posterior mean Bayesian group bridge estimates and corresponding 95% equaltailed credible intervals based on 15, 000 Gibbs samples using a logistic regression model. Overlaid are the corresponding GB, GL, and SGL estimates . . . . . 155

6.1

ACTG 320 Data Analysis - Two Separate Models (Based on 100 Random Cross-validations) . . . . . . . . . . . . . 176

6.2

ACTG 320 Data Analysis - Single Interaction Model (Based on 100 Random Cross-validations) . . . . . . . . 177

6.3

Histogram of the untransformed and log-transformed plasma levels of triglyceride in the DHS dataset . . 180

6.4

Marginal posterior densities for the marginal effects of 5 selected predictors in the DHS data. Solid red line: penalized likelihood (group bridge) solution with λ chosen by BIC. Dashed green line: marginal posterior mean for βj . Dotted purple line: mode of the marginal distribution for βj under the fully Bayes posterior (BAGB). The BAGB estimates are based on 25, 000 posterior samples . . . . . . . . . . . . . . . . . . . . 182

6.5

Histogram of α based on 25, 000 MCMC samples assuming a Beta(10, 10) prior (left) and the corresponding density plot (right) for the DHS data analysis. Solid red line: posterior mean, dotted green line: posterior mode, and dashed purple line: posterior median

xxiii

184

6.6

Trace plots (left) and ACF plots (right) for the selected variables for the DHS data analysis, based on 25, 000 posterior samples . . . . . . . . . . . . . . . . . . . . 186

xxiv

1

INTRODUCTION

1.1

BACKGROUND

High-dimensional nature of many modern datasets in various diverse disciplines along with new and emerging scientific problems have introduced unique statistical and computational challenges for data scientists, which have reshaped statistical thinking and research in recent years. A central assumption of highdimensional statistical modeling is that the underlying true model is sparse, meaning that the data contains many redundant variables, which do not provide any additional information than a subset of important variables. In practice, the underlying representations of many biomedical data are often sparse. For instance, in genetic association studies, even though hundreds of thousands of genetic variants and multiple environmental factors are included as potential covariates for modeling the genotype-phenotype relationship, only a few of them are relevant to the phenotype, while others are likely to be zero or negligible. Since model misspecification can have a significant impact on a scientific result, correctly identifying relevant variables is an important issue in any scientific research. If more predictors are included in the model, a high proportion of the response variability can be explained. On the other hand, overfitting (inclusion

2 of predictors with null effect) results in a less reliable model with poor predictive performance1 . Various methods have been developed over the years for dealing with variable selection in high-dimensional linear models. Among these, regularization methods have become increasingly popular due to their susceptibility to overfitting, which is a serious issue in both high-dimensional and low-dimensional problems (Subramanian and Simon, 2013). Moreover, as will be discussed in the later chapters, these methods have the attractive feature that variable selection and coefficient estimation can be carried out simultaneously with higher prediction accuracy and lower computational cost as compared to traditional model selection procedures such as the best subset selection and stepwise regression. However, despite their robust performance in model selection, these methods have been criticized for providing unreliable standard errors (Kyung et al., 2010), which is arguably a major drawback of these approaches. Alternatively, much recent work has been devoted to the development of various Bayesian regularization methods for conducting variable selection in linear models. Unlike classical or non-Bayesian regularized regression approaches, Bayesian regularization methods provide a valid measure of standard error based on a geometrically ergodic Markov chain besides an appropriate point estimator. In addition, Bayesian analysis significantly reduces the actual complexity involved in the estimation procedure by incorporating any prior information about the parameters into the model estimation technique. With ever-increasing computing power, these methods are increasingly becoming popular, and gaining more and more insight and considerations in high-dimensional data analysis. 1

Some passages in this chapter have been adapted from Mallick and Yi (2013).

3 In this dissertation, we have proposed a variety of Bayesian regularized regression approaches for high-dimensional linear models, with applications to genetics and clinical trials. Contrary to the existing literature which is primarily based on the scale mixture of normal (SMN) representations of the associated priors, a new data augmentation strategy is introduced by utilizing the scale mixture of uniform (SMU) distributions, which in turn facilitate computationally efficient Markov Chain Monte Carlo (MCMC) algorithms. As will be discussed later, the proposed strategy successfully Bayesianises2 many important non-Bayesian regularization methods with strong frequentist properties. Empirical studies and real data analyses reveal that the new strategy performs quite satisfactorily across different practical scenarios, and has good mixing property (to be discussed in Chapter 2), which ensures potential applicability of the proposed methods in a variety of real life problems. In addition, theoretical properties of some of the proposed methods have been investigated, which provide further insight into the asymptotic behavior of these approaches.

1.2

MAJOR CONTRIBUTIONS

The dissertation consists of five research papers. In the first paper, a new Bayesian solution to the LASSO problem (Tibshirani, 1996) is put forward using the SMU representation of the Bayesian LASSO prior (Park and Casella, 2008). In the second paper, this technique has been extended to the Bayesian bridge estimator, a natural extension of the Bayesian LASSO. In the third paper, 2

Bayesianism refers to the methods in Bayesian statistics, named after Thomas Bayes (1701-1761), a British statistician, who invented the famous Bayes’ theorem (Fienberg, 2006), which is widely considered as the foundation of Bayesian inference.

4 the methodology is further extended to the Bayesian group bridge estimator, which is intended for variable selection for grouped predictors. In the fourth paper, some of these methods are applied to a historical clinical trial data for the selection of subgroups with differential treatment effects. Some of these methods are also applied to the problem of rare variants detection in a genetic association study, which constitutes the fifth paper. As will be discussed in more detail in subsequent chapters, the major contributions of this dissertation are as follows: (i) introduction of the bridge and group bridge shrinkage priors for Bayesian variable selection and their connections to other existing sparsity-inducing priors, (ii) introduction of the SMU representations for various emerging shrinkage priors, which are relatively new and unexplored, and provide a unified framework for Bayesian regularization methods, (iii) a set of simple and easily implementable MCMC algorithms (Gibbs samplers), (iv) adaptive treatment of the tuning parameters and estimating the hyperparameters from the data, and (v) extending the methods beyond linear regression (e.g. Cox’s model, Generalized Linear Model (GLM), etc.), providing a flexible framework for variable selection in both low-dimensional and high-dimensional scenarios for modeling a variety of outcomes.

1.3

OUTLINE OF THE THESIS

The remainder of the thesis is organized in the following manner: Chapter 2 consists of a literature review and background information on several existing model and variable selection methods. Some of these methods will be used for comparison purposes in our simulation studies and real data

5 analyses; some of them will be used as building blocks of the methods proposed. Chapter 3 introduces a new Bayesian solution to the LASSO problem. Chapter 4 presents the Bayesian bridge estimator. Chapter 5 discusses the Bayesian group bridge estimator, which is intended for bi-level variable selection (to be discussed later in Chapter 2). Chapter 6 includes two real life applications: Section 6.1 presents the application of the proposed methods to the problem of selecting subgroups of patients with differential treatment effects from a randomized clinical trial (RCT) data. Section 6.2 presents the application of the proposed methods to the problem of rare variants detection in a genetic association study. Chapter 7 discusses some promising directions for future research along with several concluding remarks on the proposed methods. Related proofs and derivations are included in the appendix.

6

REFERENCES

Fienberg, S. E. (2006). When did Bayesian inference become “Bayesian”? Bayesian Analysis 1 (1), 1–40. Kyung, M., J. Gill, M. Ghosh, and G. Casella (2010). Penalized regression, standard errors, and Bayesian lassos. Bayesian Analysis 5 (2), 369–412. Mallick, H. and N. Yi (2013). Bayesian methods for high dimensional linear models. Journal of Biometrics & Biostatistics S1, 005. Park, T. and G. Casella (2008). The Bayesian lasso. Journal of the American Statistical Association 103 (482), 681–686. Subramanian, J. and R. Simon (2013). Overfitting in prediction models – is it a problem only in high dimensions? Contemporary Clinical Trials 36 (2), 636–641. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 58 (1), 267–288.

7

LITERATURE REVIEW

2.1

INTRODUCTION

In this chapter, we present a selective overview of various frequentist and Bayesian model and variable selection methods in the context of linear and hierarchical linear models. The review is structured as follows. First, we give a brief overview of traditional model selection methods (Section 2.2) followed by a discussion on numerous regularization methods for linear models (Section 2.3) and their relations to Bayesian arguments (Section 2.4). Section 2.4.1 gives a short overview of Bayesian framework. In Section 2.4.2, we present a short review of MCMC algorithms and various other aspects of MCMC computation. Section 2.4.3 gives a very brief introduction to Bayesian regularization approaches. Before doing so, we first describe the standard normal linear model and the standard Bayesian hierarchical normal linear model.

2.1.1

STANDARD NORMAL LINEAR MODEL

Linear models are probably the most widely used statistical models to investigate the influence of a set of predictors on a response variable. In the simplest case of a normal linear regression model, it is assumed that the mean of the

8 response variable can be described as a linear function of a set of predictors. Mathematically, we have the following model y = μ1n + Xβ + ,

(2.1)

where 1n is the vector of size n with all entries equal to 1, y is the n × 1 vector of responses, X is the n × p design matrix of regressors, β is the p × 1 vector of coefficients to be estimated, and is the n × 1 vector of independent and identically distributed normal errors with mean 0 and variance σ 2 . Without loss of generality, we assume that y and X are centered so that the intercept μ is zero and can be omitted from the model. The ordinary least squares (OLS) method estimates β by minimizing the residual sum of squares (RSS), i.e., ˆ OLS = argmin (y − Xβ) (y − Xβ). β β

(2.2)

When there is no perfect multicollinearity and the errors are homoscedastic and serially uncorrelated with finite variances, the OLS estimator is optimal in the sense that it provides the minimum variance in the class of linear unbiased estimators (Yan and Su, 2009). It also coincides with the maximum likelihood estimator (MLE) for normal linear models.

2.1.2

BAYESIAN HIERARCHICAL NORMAL LINEAR MODEL

In Bayesian hierarchical normal linear models, we assume that the distribution of the dependent variable y is specified conditional on the parameters β and σ 2

9 as y|β, σ 2 ∼ N (Xβ, σ 2 In ),

(2.3)

where In is an n × n identity matrix. Then, any prior information on (β, σ 2 ) is incorporated by specifying a suitable prior distribution p(β, σ 2 ) on them. This second-level model has its own parameters known as hyperparameters, which are usually estimated from the data. After observing the data, the prior distribution p(β, σ 2 ) is updated by the corresponding posterior distribution, which is obtained as p(β, σ 2 |y) =

p(y|β, σ 2 )p(β, σ 2 ) . p(y|β, σ 2 )p(β, σ 2 )d(β, σ 2 )

(2.4)

The posterior distribution contains all the current information about the parameters. Ideally one might fully explore the entire posterior distribution by sampling from the distribution using MCMC algorithms (Gelman et al., 2003). Due to its ability to incorporate specific hierarchical structure of the data (correlation among the predictors), hierarchical modeling is often more efficient than the traditional approaches (Gelman and Hill, 2006). In the next section, we briefly review the following: Mallows’s Cp (Mallows, 1973), AIC (Akaike, 1974), BIC (Schwarz, 1978), and EBIC (Chen and Chen, 2008) for standard normal models, and DIC (Spiegelhalter et al., 2002) for Bayesian hierarchical normal linear models, which remain the most popular model selection methods used for linear models and hierarchical linear models respectively.

10

CLASSICAL MODEL SELECTION METHODS

2.2

In statistical tradition, the most commonly used model selection approaches include backward, forward, and stepwise selection, where in every step, predictors are either added to the model or eliminated from the model according to a precisely defined testing rule (Hwang and Hu, 2014). These conventional model selection procedures are often inadequate for high-dimensional statistical modeling, making it extremely difficult to build a parsimonious, meaningful, and interpretable model with a large number of variables. Flom and Cassell (2007) reported several practical scenarios in which the traditional model selection methods performed poorly in recovering the ‘true’ model even in the simplest of cases. As noted by Steyerberg (2008), stepwise selection is a poor-performing automated strategy and not recommended (Flom and Cassell, 2007), although the practice seems to remain widespread (Kadane and Lazar, 2004). To overcome the limitations of these mechanical model selection approaches, several information-type methods have been developed, which aim to provide a tradeoff between model complexity and goodness-of-fit of a model (Smyth, 2001). Here we briefly review a few of these methods in the linear and hierarchical linear regression framework1 .

2.2.1

MALLOWS’S CP

For a subset model with k ≤ p explanatory variables, the Cp statistic of Mallows (1973) is defined as Cp = 1

RSS(k) − n + 2p, s2


(2.5)

11 where s2 is the MSE for the full model containing p explanatory variables and RSS(k) is the residual sum of squares for the subset model containing k explanatory variables. In practice, Cp is usually plotted against p for a collection of subset models, and models with Cp approximately equal to p are taken as acceptable models in the sense of minimizing the total bias of the predicted values (Yan and Su, 2009). Woodroofe (1982) showed that Cp is a conservative model selector, which tends to overfit. Nishii (1984) showed that Cp is not consistent in selecting the ‘true’ model, and often tends to select a larger model as n → ∞. 2.2.2

AKAIKE’S INFORMATION CRITERIA (AIC)

The AIC of Akaike (1974) is defined as AIC = −2 log(L) + 2p,

(2.6)

where L is the likelihood function evaluated at the MLE. Given a set of candidate models, the ‘best’ model is the one with the minimum AIC value. Similar to Mallows’s Cp , AIC is not model selection consistent (Nishii, 1984). Here, the consistency of a model selection criterion means that the probability of the selected model being equal to the ‘true’ model converges to 1. More information about the AIC can be found in Burnham and Anderson (2004). The asymptotic approximation on which the AIC is based is rather poor when n is small (Dziak et al., 2005). As a remedy, Hurvich and Tsai (1989) proposed a small-sample correction, leading to the AICc statistic defined by

AICc = AIC +

2p(p + 1) . (n − p − 1)

(2.7)

12 AICc converges to AIC as n gets larger, and therefore, it is preferred to AIC regardless of the sample size (Dziak et al., 2005).

2.2.3

BAYESIAN INFORMATION CRITERIA (BIC)

While the AIC is motivated by the Kullback-Leibler discrepancy of the fitted model from the ‘true’ model, Schwarz (1978) derived the BIC from a Bayesian perspective by evaluating the leading terms of the asymptotic expansion of the Bayes factor. The BIC is defined as BIC = −2 log(L) + p log(n),

(2.8)

where L is the likelihood function evaluated at the MLE. Similar to AIC, model with the minimum BIC is chosen as the preferred model from a set of candidate models. It is well-known that neither AIC nor BIC performs better all the time. However, unlike AIC, BIC is a consistent model selection technique, which means, as the sample size n gets large enough, the lowest BIC model will be the ‘true’ model with probability 1 (Burnham and Anderson, 2004; Dziak et al., 2005). Systematic comparisons of AIC and BIC in the context of linear regression are provided in Kundu and Murali (1996) and Yang (2003).

2.2.4

DEVIANCE INFORMATION CRITERIA (DIC)

For model selection in Bayesian hierarchical normal linear models, Spiegelhalter et al. (2002) proposed the generalization of AIC and BIC defined as

DIC = D + 2pD ,

(2.9)

13 where D is the deviance evaluated at the posterior mean of the parameters and pD is the effective number of parameters calculated as the difference between the posterior mean deviance and the deviance of posterior means. Like AIC and BIC, models with smaller DIC are better supported by the data. DIC is particularly useful when the MCMC samples are easily available, and is valid only when the joint distribution of the parameters is approximately multivariate normal (Spiegelhalter et al., 2002). DIC tends to select over-fitted models, which has been addressed by Ando (2007), although very little is known about its performance in high-dimensional models. As noted by Gelman et al. (2014), various other difficulties (apart from overfitting) have been noted with the DIC, but there has been no consensus on an alternative.

2.2.5

EXTENDED BIC (EBIC)

While the above information-type procedures are widely used (Ward, 2008) and readily available in most statistical softwares (Kadane and Lazar, 2004), they tend to select more variables than necessary in high-dimensional problems, especially when the number of predictors increases with the sample size (Broman and Speed, 2002; Casella et al., 2009). In the setting of a linear regression model, if the number of covariates is of the polynomial order or exponential order of the sample size, i.e., p = O(nκ ) or p = O(exp(nκ )) for κ > 0, then it is called a high-dimensional problem (Luo and Chen, 2013). For such scenarios, as argued by Yang and Barron (1998), the overfitting problem can be substantial, resulting in severe selection bias, which damages the predictive performance of high-dimensional models.

14 To overcome this, various model selection criteria for high-dimensional models have been introduced recently. Wang et al. (2009) proposed a modified BIC (mBIC), which is consistent when p is diverging slower than n. Chen and Chen (2008, 2012) developed a family of Extended Bayes Information Criteria (EBIC) for variable selection in high-dimensional problems. The family of EBIC is indexed by a parameter γ in the range [0, 1]. The extended BIC is defined as EBIC = −2 log(L) + p log(n) + 2γ log(p),

(2.10)

where L is the likelihood function evaluated at the MLE, and 0 ≤ γ ≤ 1 is a tuning parameter. The original BIC is a special case of EBIC with γ = 0. The mBIC is also a special case of EBIC in an asymptotic sense, i.e., it is asymptotically equivalent to EBIC with γ = 1. Chen and Chen (2008) established the model selection consistency of EBIC when p = O(nκ ) and γ > 1 −

1 2κ

for any

κ > 0, where consistency implies that as n → ∞, the minimum EBIC model will converge in probability to the ‘true’ model. Among other developments, the General Information Criterion (GIC) proposed by Shao (1997) is known to be consistent in high dimensions. Recently Kim et al. (2012) showed that EBIC is asymptotically equivalent to GIC. One common concern with these methods is that no matter how theoretically attractive they are, in practice one needs to evaluate these procedures on each of the candidate models being considered, which could be computationally burdensome if the number of models is surpassingly large (Kadane and Lazar, 2004). Regularization methods provide an effective and computationally attractive alternative to these mechanical model selection methods. Contrary to the

15 classical model selection approaches which are discrete in nature, regularization methods are continuous due to their ability to conduct simultaneous variable selection and coefficient estimation. In the next section, we briefly review these methods.

2.3

REGULARIZATION METHODS

It is to be noted that the OLS estimator is highly inconsistent in the presence of multicollinearity. This is due to the fact that in the presence of strongly correlated predictors, X X is a nearly singular matrix, and some diagonal elements of (X X)−1 are inflated due to ill-conditioning, leading to very unstable, ˆ OLS is no high-variance estimators (Armagan, 2008). Moreover, when p n, β longer identifiable without making additional non-testable assumptions on the design matrix (Huang et al., 2008). Such concerns have led to the prominent development of regularization methods with various penalties to mitigate modeling biases and achieve higher prediction accuracy in linear regression (Kyung et al., 2010). In a sense, these methods regularize (hence the term regularization methods) or adjust the imprecise and volatile estimates of the regression coefficients encountered in ill-posed problems (due to either collinearity of the variables or high-dimensionality), by shrinking the estimates towards zero relative to the maximum likelihood estimates (Goeman, 2010). The notion of variable selection through regularization was first introduced by Donoho and Johnstone (1994) in their study of Stein Unbiased Risk Estimation (SURE), and then further developed by Tibshirani (1996) and many other

16 researchers (Li, 2010). As pointed out by Fan and Li (2001), the basic idea of regularization methods is to introduce sparsity-inducing penalty terms that are non-differentiable at zero when viewed as a function of the model coefficients. The penalty terms serve to control the complexity of the model and provide criteria for variable selection by imposing some constraints on the parameters (Breheny, 2009). Although these methods are motivated by high-dimensional data, they can also be effectively applied to sparse low-to-moderate-dimensional problems (Chand, 2012), facilitating applications in a wide range of scientific problems. Next, we briefly discuss these methods in the context of linear regression along with their relative merits, demerits, and challenges. The problem of interest involves estimating a sparse vector of regression coefficients by minimizing an objective function Q(.) that is composed of a standard loss function such as the RSS (without loss of generality, most commonly used squared error loss is considered, although other loss functions such as the hinge loss function (Wahba, 1998) and least absolute deviation (LAD) are also common) plus a penalty function P (.) Q(β) = (y − Xβ) (y − Xβ) + Pλ (β),

(2.11)

where P (.) is a function of the model coefficients indexed by a parameter λ > 0, which controls the degree of penalization. The form of P (.) determines the general behavior of regularization methods. In general, the penalty function P (.) has the following properties (Breheny, 2009): 1. It is symmetric about the origin, i.e., P (0) = 0, and

17 2. P (.) is non-decreasing in ||β||, where ||β|| denotes the norm of β. This approach produces a spectrum of solutions depending on the value of λ (Breheny, 2009), which is called the regularization parameter (or tuning parameter or shrinkage parameter ). The tuning parameter plays a vital role in variable selection, as it controls the degree of shrinkage that is applied to the estimates. Small values of λ lead to large models with limited bias but potentially high variance; large values of λ lead to the selection of models with fewer predictors but with less variance (Chand, 2012). In practice, the whole regularization ˆ path {β(λ), λ > 0} can be effectively computed, from which the optimum value of λ (that yields the “correct” amount of shrinkage) should be systematically determined.

2.3.1

LASSO AND RELATED REGULARIZATION METHODS

RIDGE REGRESSION Regularization methods date back to the proposal of ridge regression by Hoerl and Kennard (1970), which finds the coefficients by minimizing the RSS subject to an L2 norm constraint on the coefficients. Equivalently, for some λ > 0, the ˆ RIDGE can be written as follows: solution β p ˆ βj2 . β RIDGE = argmin (y − Xβ) (y − Xβ) + λ β j=1

(2.12)

The ridge estimator can be effortlessly derived as ˆ RIDGE = (X X + λI)−1 X y. β

(2.13)

18 Ridge regression achieves better prediction performance than the OLS estimator through a bias-variance trade-off (biased estimates with lower variance) by shrinking each coefficient based on the variation of the corresponding variable (Park, 2006). In addition, it achieves stable fit even in the presence of multicollinearity, by perturbing the eigenvalues of the ill-conditioned matrix X X by a well-behaved diagonal matrix of the form λI (Armagan, 2008). Despite these attractive characteristics, ridge regression cannot produce a parsimonious model as it always keeps all the predictors in the model. LASSO In order to carry out variable selection, Tibshirani (1996) proposed the least absolute shrinkage and selection operator (LASSO), which estimates the regression coefficients by minimizing the RSS subject to an L1 norm constraint on the coefficients as follows: p ˆ LASSO = argmin (y − Xβ) (y − Xβ) + λ |βj |, β β j=1

(2.14)

where λ is a positive regularization parameter. Compared to ridge regression, a remarkable property of LASSO is that it can shrink some coefficients exactly to zero, and therefore, it can automatically achieve variable selection. BRIDGE REGRESSION Frank and Friedman (1993) introduced the bridge regression (a general family of penalties), which penalizes the size of a specially-designed L1 norm of the

19 coefficients, thereby determining the coefficients with the following criterion: p ˆ |βj |α , β BRIDGE = argmin (y − Xβ) (y − Xβ) + λ β j=1

(2.15)

where λ is again a positive regularization parameter, and α > 0 is the concavity parameter which controls the concavity of the penalty function. The bridge estimator includes both LASSO (α = 1) and ridge (α = 2) as special cases. It achieves variable selection when 0 < α < 1, and shrinks the coefficients when α > 1 (Park and Yoon, 2011). A geometrical illustration of why the LASSO and bridge estimators result in sparsity, but the ridge does not, is given by the constraint interpretation of their penalties as described in Figure 2.1, which is extracted from Chen et al. (2010). It shows the general behavior of these three penalty functions in a two-parameter case2 , β1 and β2 . To obtain the regularized estimates, we essentially seek the points at which the objective function contour first “hits” the constraint. The LASSO, ridge, and bridge penalty functions have constraints shaped like a square, circle, and star respectively. As a consequence of the different shapes, the LASSO estimator is likely to involve variable selection (β1 = 0 or β2 = 0) as well as parameter estimate shrinkage, and the ridge estimator yields mainly parameter estimate shrinkage; in contrast, the bridge estimator induces an even higher chance of variable selection than the LASSO, because the star shape of bridge makes the contour even more likely to hit one of the points (β1 = 0 or β2 = 0) than does the diamond shape of LASSO. 2

The following description of Figure 2.1 was extracted from Chen et al. (2010).

21 LASSO heavily depends on the quality of the initial estimates of the penalty parameters. In high-dimensional and correlated settings, it can be challenging to obtain initial estimates with satisfactory consistency properties. As noted by Fan et al. (2009), requirement of a consistent initial estimate is a drawback of the adaptive LASSO. ELASTIC NET Zou and Hastie (2005) proposed the elastic net estimator to achieve improved performance in situations when there is multicollinearity and grouping among the predictors. The penalty term in elastic net is a convex combination of the LASSO and ridge penalty, i.e., ˆ ENET = argmin (y − Xβ) (y − Xβ) + λ1 β β

p j=1

|βj | + λ2

p

βj2 ,

(2.17)

j=1

where λ1 , λ2 are positive tuning parameters. The elastic net estimator can be interpreted as a stabilized version of the LASSO estimator, yielding superior performance especially when there are groups of highly correlated predictors. Other related regularization methods include smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001), fused LASSO (Tibshirani et al., 2005), adaptive elastic net (Ghosh, 2011), minimax concave penalty (MCP) (Zhang, 2010), adaptive bridge (Park and Yoon, 2011), and Dantzig selector (Candes and Tao, 2007), among others. In Table 2.1, we summarize some of these popular regularization methods which are most relevant to this dissertation. Various optimization algorithms have been proposed to solve the LASSO and related estimators. Notably, the least angle regression (LARS) (Efron et al.,

22 2004) and the coordinate descent algorithm (Friedman et al., 2010) are the most computationally efficient ones. Given the tuning parameters, these algorithms are extremely fast, making the regularization approaches extremely popular in high-dimensional data analysis. Method

Tuning Parameter(s)

Ridge

λ>0

Penalty p λ βj 2 j=1

LASSO

λ>0

λ

p

|βj |

j=1 p

Adaptive LASSO Elastic Net

λ>0

λ

λ1 , λ2 > 0

λ1

p

|βj | + λ2

j=1

Bridge

λ, α > 0

wj |βj |

j=1

λ

p

p

βj2

j=1

|βj |α

j=1

Group LASSO Sparse Group LASSO Group Bridge

λ>0 λ1 , λ2 > 0 λ, α > 0

λ λ1

K

K

||β k ||2

k=1

||β k ||2 + λ2

k=1 K

λ

K

||β k ||1

k=1

wk ||β k ||α1 , wk > 0

k=1

Table 2.1: Various Penalty Functions

2.3.2

GROUP REGULARIZATION METHODS

In many scientific applications, covariates are naturally grouped. One such example arises in genetic association studies, where the genetic variants can

23 be divided into multiple groups, with variants in a same group being biologically related or statistically correlated. When there is no grouping structure, all these aforementioned procedures have the attractive feature that individual variable selection and coefficient estimation can be carried out simultaneously with higher prediction accuracy and lower computational cost as compared to traditional model selection methods such as the best subset selection, AIC, and BIC. However, when there is a known grouping structure among the covariates, these methods are not capable of simultaneous group and individual variable selection, and may often lead to poor performance, as they usually do not take into account the grouping information into their estimation procedures. GROUP LASSO AND RELATED ESTIMATORS With the aim of proper incorporation of the grouping information into the estimation procedure, there has been a fair amount of recent work extending these individual variable selection approaches to group variable selection problems. The very first method in this area is the group LASSO (Yuan and Lin, 2006), which penalizes the L2 norm of the coefficients within each group. In the context of linear regression, we assume that there is an underlying grouping structure

among the predictors, i.e., β = (β1 , . . . , βK ) , where β k is the mk -dimensional vector of coefficients corresponding to group k, k = 1(1)K, and K is the number K mk = p and K < p. To address the issue of variable selecof groups with k=1

tion for grouped variables, Yuan and Lin (2006) proposed the group LASSO

24 estimator, in which the coefficients are determined with the following criterion: K ˆ ||β k ||2 , β GLASSO = argmin (y − Xβ) (y − Xβ) + λ β k=1

(2.18)

where ||β k ||2 is the L2 norm of β k , and for any z = (z1 , . . . , zq ) ∈ Rq , ||z||s is 1

defined as ||z||s = (|z1 |s + . . . + |zq |s ) s , s > 0. Although the group LASSO yields sparse solutions at the group level, it cannot effectively remove unimportant variables in a selected group, i.e., it selects variables in an “all-in-all-out” fashion, meaning when one variable in a group is selected, all other variables in the same group are also selected. Another criticism of group LASSO is that it assumes orthonormality of the design matrix corresponding to each group. Simon and Tibshirani (2012) noted that for non-orthonormal group matrices, implementation of the group LASSO algorithm changes the original problem. For example, a design matrix consisting of categorical predictors coded by dummy variables can be orthonormalized without changing the problem only if the number of observations in each category are the same (Simon and Tibshirani, 2012). Thus, for general problems with correlated features, application of the original group LASSO can be misleading, reducing the practical use of the estimator. Recently, Breheny and Huang (2015) modified the original group LASSO algorithm for general design matrices; they indicated that after orthonormalizing within groups and implementing the group LASSO algorithm, the coefficients must be transformed back to the original basis. Among other developments, Wang and Leng (2008) proposed the adaptive group LASSO, which improves upon the performance of group LASSO by adaptively shrinking the L2 norms of the group-specific coefficients; however,

25 similar to original group LASSO, it fails to do bi-level variable selection (simultaneously selecting not only the important groups, but also important members within those groups). Recently, Simon et al. (2013) proposed the sparse group LASSO, which is capable of bi-level variable selection by using a convex combination of the LASSO and group LASSO penalties on the coefficients. However, each coefficient in this penalty incurs a double amount of shrinkage, which might lead to a relatively large bias in the estimation. GROUP BRIDGE To mitigate the problems associated with the group LASSO and related estimators, Huang et al. (2009) proposed a penalized regression method that uses a specially designed group bridge penalty. They established that unlike group LASSO, the group bridge estimator is capable of bi-level variable selection with the desired ‘oracle group selection property’, i.e., it can correctly select important groups with probability converging to 1. In the context of a linear regression model, the group bridge estimator (Huang et al., 2009) results from the following regularization problem:

K

wk ||β k ||α1 , min (y − Xβ) (y − Xβ) + λ β k=1

(2.19)

where ||β k ||1 is the L1 norm of β k , λ > 0 is the tuning parameter which controls the degree of penalization, α > 0 is the concavity parameter which controls the shape of the penalty function, and the weights wk ’s depend on some preliminary estimates with wk > 0, k = 1, . . . , K. In this dissertation, we limit our discussion to the group bridge estimator. A

26 comprehensive review of other group penalized regression methods can be found in a recent article by Huang et al. (2012), and references therein. Huang et al. (2009) argued that when 0 < α < 1, the group bridge estimator can be used for simultaneous variable selection at the group and individual variable levels, and it can be efficiently solved by locally approximated coordinate descent algorithms (Breheny and Huang, 2009; Huang et al., 2009). Given the tuning parameters, these algorithms are extremely fast and stable, which ensures wide applicability of the method (Breheny and Huang, 2009). Despite being theoretically and computationally attractive, regularization methods have been criticised for providing unreliable standard errors, which is clearly a major drawback of these approaches. To compensate for this shortcoming, various Bayesian regularization methods have been proposed in the literature. As noted by Celeux et al. (2012), Bayesian regularization methods usually dominate frequentist methods in the sense that they provide smaller prediction errors while selecting the most relevant variables in a parsimonious way.

2.4 2.4.1

BAYESIAN METHODS OVERVIEW OF BAYESIAN FRAMEWORK

There are two main opposing schools of statistical reasoning, namely frequentist and Bayesian approaches (Du and Swamy, 2013). Although the frequentist or classical approaches have dominated scientific research for several decades (Efron, 2005), with increased computing power, Bayesian statistical methods

27 have reappeared with a significant impact that is starting to change the situation (Vallverdu, 2008). The fundamental difference between the Bayesian and frequentist approaches lies in the definition and interpretation of probability. While frequentists consider the probability of an event to be objective and interpret it as the long-run expected frequency of occurrences, Bayesians believe that the probability of an event is rather subjective and view it as the ‘degree of belief’ (Wagner and Steinzor, 2006). Without delving too much into the philosophical aspects, a Bayesian approach to a problem starts by allowing a priori information or knowledge to contribute to the estimation and inference of the parameters. As opposed to a frequentist approach, the model parameters are regarded as random variables in a Bayesian framework, where a priori knowledge about the parameters are quantified probabilistically by assigning probability distributions on the parameters, known as the prior distributions, which are then updated by the observed data with the resulting posterior distributions. The joint posterior distribution of the parameters is calculated using the well-known Bayes’ theorem (named after Thomas Bayes (1701-1761), a British statistician), which is given as

p(θ|y) =

p(y|θ)p(θ) ∝ p(y|θ)p(θ), p(y)

(2.20)

where θ is the unknown parameter of interest (without loss of generality we assume that θ is a vector), p(θ) denotes the prior distribution of θ, p(y|θ) is the likelihood function, and p(θ|y) refers to the joint posterior distribution of θ. Any feature of the posterior distribution is a legitimate candidate for Bayesian inference: moments, quantiles, highest posterior density regions, etc.

28 (Gilks, 2005). As such, the posterior distribution plays a pivotal role in Bayesian statistical modeling, and a variety of methods have been developed for posterior inference based on posterior simulations and posterior summary. In the next subsection, we give an overview of one such widely used method: Markov Chain Monte Carlo (MCMC) for posterior simulation.

2.4.2

OVERVIEW OF MCMC METHODOLOGY

In Bayesian paradigm, it is not always straightforward to draw independent samples from the posterior distribution of interest either due to the analytical intractability the posterior or for dealing with a lot of parameters in complex scenarios. In such situations, various alternative techniques have been developed for drawing dependent samples from the posterior distribution, which can be used in a similar fashion as independent samples. These techniques are collectively called Markov Chain Monte Carlo (MCMC) 3 , which refers to the broad class of stochastic simulation techniques for producing dependent samples from a target distribution that may be difficult to sample by rejection sampling or other classical independence methods. Here, in stead of sampling θ directly from p(θ|y), we sample iteratively in such a way that at each step of the process we expect to draw from a distribution that becomes closer and closer to p(θ|y) (Gelman et al., 2003). Starting with some initial starting values, the approximate distributions are sequentially improved at each step until the sequence converges to the target distribution. Each sample depends on the previous one, hence the notion of Markov chain. A Markov chain is a sequence of random variables θ 1 , . . . , θ t , for which the 3

Some passages in this section have been adapted from Gelman et al. (2003).

29 random variable θ t depends on all previous θ’s only through its immediate predecessor θ (t−1) . On the other hand, Monte Carlo, as in Monte Carlo integration, is mainly used to approximate an expectation by using the Markov chain samples. In simplest terms, when we have n samples generated from p(θ|y) (say, θ t ), the posterior expectation of any function of θ (say, (f (θ)) can be approximated as E[f (θ|y)] =

1 f (θ|y)p(θ|y)dθ ∼ f (θ i ). = n i=1 n

(2.21)

Hence the name Markov Chain Monte Carlo (MCMC). Thus, with the MCMC method, it is possible to generate samples from an arbitrary posterior density p(θ|y) and to use these samples to approximate any posterior quantities of interest. Also, if the simulation algorithm is implemented correctly, the Markov chain is guaranteed to converge to the target distribution p(θ|y) under rather broad conditions, regardless of the initial values of the chain. In practice, one might experience poor mixing (or slow convergence) of the Markov chain. This can happen, for example, when the parameters are highly correlated with each other. Poor mixing means that the Markov chain slowly traverses the parameter space and the chain has high dependence (sample autocorrelation), which can result in biased Monte Carlo standard errors. As a remedy, discarding an initial portion of a Markov chain sample, known as burnin, is often done so that the effect of initial values on the posterior inference is minimized. In theory, if the Markov chain is run for an infinite amount of time, the effect of the initial values should decrease to zero. However, in practice, we do not have

30 the luxury of infinite samples. In fact, in most practical scenarios, we assume that after certain iterations, the chain has reached its target distribution, and one can throw away the early portion and use the good samples for posterior inference. Another common strategy is to thin the Markov chain in order to reduce sample autocorrelations, known as thinning. Thinning is done by keeping every k th simulated draw from each sequence of a chain. One can safely use a thinned Markov chain for posterior inference as long as the chain converges (Gelman et al., 2003). In this dissertation, we mainly focus on two widely used MCMC mechanisms that allow us to carry out posterior sampling when a direct approach is not possible, viz. the Metropolis-Hastings algorithm (Hastings, 1970; Metropolis et al., 1953) and the Gibbs sampling algorithm (Gelfand and Smith, 1990). A comprehensive treatment of various other MCMC algorithms can be found in Robert and Casella (2011, 2013), and references therein. Apart from these posterior simulation algorithms, a variety of numerical methods exist for finding the maximum-a-posteriori (MAP) estimates of the parameters, the most popular of them being the Expectation Maximization (EM) algorithm. EM algorithm was first developed for maximum likelihood parameter estimation from incomplete data (Dempster et al., 1977). A variant of the EM algorithm in which the Mstep is replaced by a set of conditional maximizations, or CM-steps, is known as the Expectation Conditional Maximization (ECM) algorithm. A comprehensive review of various EM and related algorithms is provided in McLachlan and Krishnan (2007). Next, we briefly outline the Metropolis-Hastings algorithm and the Gibbs sampling algorithm.

31

THE METROPOLIS-HASTINGS ALGORITHM The Metropolis-Hastings algorithm (Hastings, 1970; Metropolis et al., 1953) is a powerful rejection sampling technique which can be used to sample any density function of interest. Given a target posterior density π(θ) with θ = (θ1 , . . . , θl ) , and a suitably chosen proposal density (also known as transition density) J(θ|θ t ), t = 1, 2, . . ., the Metropolis-Hastings algorithm proceeds by successively iterating the following steps: 1. Generate a candidate value θ ∗ from the transition density J(θ|θ t ). 2. Calculate J(θ (t−1) |θ ∗ )p(θ ∗ ) r= . J(θ ∗ |θ (t−1) )p(θ (t−1) ) 3. Set θ (t) =

⎧ ⎪ ⎪ ⎨θ ∗

with probability min(r,1),

⎪ ⎪ ⎩θ (t−1)

otherwise.

Clearly, performance of the MH algorithm is highly dependent on the choice of the proposal density. Depending on the spread of the transition density, the algorithm can have low or high acceptance rate. The transition density must be chosen to avoid the event that the chain stays at the same point for many iterations and instead traverses the whole support of the target density (Taroni et al., 2010). Many authors uniformly agree that a good acceptance rate must be between 20% and 50% (Taroni et al., 2010). If the candidate generating density is built to have the mean equal to the current value of the parameter, then the corresponding MH algorithm is called the random walk MH algorithm.

32

THE GIBBS SAMPLING ALGORITHM Suppose we wish to sample from π(θ) where θ = (θ1 , . . . , θl ) . Each iteration of the Gibbs sampler proceeds as follows: 1. A collection of d (≤ l) subvectors of θ is chosen. 2. For each component in this subvector (say, θj , j = 1, . . . , d), we sample from the full conditional distribution given all the other components of (t−1)

(t−1)

θ, i.e., θ tj ∼ π(θ j |θ −j ), where θ −j

represents all the components of θ,

except for θ j , at their current values, i.e., (t−1)

θ −j

(t−1)

= (θ 1

(t−1)

(t−1)

(t−1)

, . . . , θ j−1 , θ j+1 , . . . , θ d

).

As an example, we consider a single observation y = (y1 , y2 ) from a bi variate normal distribution with ⎛ unknown ⎞mean vector θ = (θ1 , θ2 ) and known ⎜ 1 ρ ⎟ variance-covariance matrix Σ = ⎝ ⎠. With non-informative uniform priρ 1 ors on the components of θ, the posterior distribution is explicitly available

as θ|y ∼ N(y, Σ). The full conditional distributions are θ1 |θ2 , y ∼ N(y1 + ρ(θ2 − y2 ), 1 − ρ2 ), θ2 |θ1 , y ∼ N(y2 + ρ(θ1 − y1 ), 1 − ρ2 ).

(2.22)

33 We can obtain samples from the above posterior distribution (2.22) by choosing initial values of (θ1 , θ2 ) to start the chain. Then, one may sample from π(θ1 |θ2 , y), and this updated value of θ1 can be used to draw a random value of θ2 from π(θ2 |θ1 , y). This idea is continued for t iterations giving t draws from the desired bivariate posterior distribution. Gibbs sampling is a special case of the MH algorithm with the proposal distributions equal to the full conditional distributions (Gelman et al., 2003). It is to be noted that the Gibbs sampler requires the availability of full conditional distributions for all components of the parameter vector θ. Due to its simplicity, it has been extensively applied in high-dimensional problems. However, if the full conditionals are not easily available (or intractable), then we must resort to rejection sampling or MH algorithm. CONVERGENCE DIAGNOSTICS Once the simulation algorithm has been implemented and the sequences are drawn, it is absolutely necessary to check the convergence of the simulated sequences (Gelman et al., 2003). If the convergence is not attained (painfully slow), the algorithm should be altered. One popular diagnostic technique to assess convergence is the Gelman-Rubin convergence diagnostic (Gelman and Rubin, 1992), which involves two basic steps: the first step is to get multiple chains from different starting points, and the second step is to estimate the ˆ for each scalar quantity of interest potential scale reduction factor (PSRF) R after running the algorithm for a desired number of iterations. When the value of PSRF is close to 1, one has sufficient evidence to believe that the chain has converged to its stationary distribution. Another popular diagnostic tool is the

34 Geweke convergence diagnostic (Geweke, 1992), which is based on dividing the chain into two “windows” containing a specified fraction of the earliest and the latest iterations in the sequence. Then, the mean of the sampled values in the first window is compared to the mean of the sampled values in the last window, giving rise to a Z statistic calculated as the difference between the two means divided by the asymptotic standard error of their difference. The Z statistic approaches a standard normal distribution if the chain has converged.

2.4.3

BAYESIAN REGULARIZATION METHODS

We finish this chapter with a quick overview of Bayesian regularization approaches. Regularization naturally lends itself to a Bayesian interpretation in which the penalty function is the negative log prior distribution of the coefficient vector. In recent years, there has been a resurgence of interest in Bayesian regularized regression approaches. Most of these approaches are based on hierarchical models, where the basic idea is to introduce a suitable mixture distribution of the corresponding sparsity-inducing prior, which facilitates computation via an efficient data augmentation strategy. While most of the conventional Bayesian variable selection methods in the literature rely on specifying spike and slab priors on the coefficients subject to selection (George and McCulloch, 1993, 1997; O’Hara and Sillanpaa, 2009) (requiring computation of marginal likelihood, which is computationally intensive for high-dimensional models), Bayesian regularization methods specify both the spike and slab components as continuous distributions, which can be written as scale mixtures, leading to simpler MCMC algorithms with no marginal likelihood being computed. In addition, posterior sampling of most traditional Bayesian variable selection

35 methods often require stochastic search over an enormous space of complicated models facilitating slow convergence and mixing, when the marginal likelihood is not analytically tractable (Johnson and Rossell, 2012). Moreover, unlike conventional Bayesian variable selection methods, Bayesian regularization methods specify shrinkage priors, which enable simultaneous variable selection and coefficient estimation. Furthermore, using MCMC and Gibbs samplers to search for the model space for the most probable variable models, without fitting all possible models is efficient, which avoids time-consuming computation. Bayesian regularization methods also have some advantages over frequentist regularization methods. Firstly, in MCMC-based Bayesian regularization methods, we have a valid measure of standard error obtained from the posterior distribution, and thus, we can easily obtain interval estimates of the parameters along with other quantities of interest. Secondly, it is more flexible in the sense that we can estimate the tuning parameters jointly with the other model parameters without the need of cross-validation which can be computationally burdensome for multiple tuning parameters. Lastly, unlike frequentist framework, it is fairly straightforward to extend a model to incorporate multi-level information inherent in the data in a Bayesian hierarchical framework. Furthermore, statistical inference on the regression coefficients is usually difficult for frequentist regularization methods, whereas a Bayesian approach enables exact inference, even when the sample size is small. Following the path-breaking paper of Park and Casella (2008), which is arguably the first successful (nontrivial) Bayesian hierarchical formulation of the LASSO regularization, there has been an unparallel development of Bayesian regularization methods in recent years, most of which are primarily based on the

36 scale mixture of normal (SMN) representations (Andrews and Mallows, 1974; West, 1987) of the associated priors. In the dissertation, following the idea of Park and Casella (2008) and Kyung et al. (2010), we propose a set of novel Bayesian regularization methods based on the scale mixture of uniform (SMU) representations of the associated priors, and solve the problems using Gibbs samplers.

37

REFERENCES

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control 19 (6), 716–723. Ando, T. (2007). Bayesian predictive information criterion for the evaluation of hierarchical Bayesian and empirical Bayes models. Biometrika 94 (2), 443– 458. Andrews, D. F. and C. L. Mallows (1974). Scale mixtures of normal distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 36 (1), 99–102. Armagan, A. (2008). Bayesian Shrinkage Estimation and Model Selection. Ph. D. thesis, University of Tennessee. Breheny, P. and J. Huang (2009). Penalized methods for bi-level variable selection. Statistics and its Interface 2 (3), 369–380. Breheny, P. and J. Huang (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing 25 (2), 173–187. Breheny, P. J. (2009). Regularized Methods for High-dimensional and Bi-level Variable Selection. Ph. D. thesis, The University of Iowa.

38 Broman, K. W. and T. P. Speed (2002). A model selection approach for the identification of quantitative trait loci in experimental crosses. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 (4), 641–656. Burnham, K. P. and D. R. Anderson (2004). Multimodel inference: understanding AIC and BIC in model selection. Sociological Methods & Research 33 (2), 261–304. Candes, E. and T. Tao (2007). The Dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics 35 (6), 2313–2351. Casella, G., F. Giron, M. Martinez, and E. Moreno (2009). Consistency of Bayesian procedures for variable selection. The Annals of Statistics 37 (3), 1207–1228. Celeux, G., M. El Anbari, J.-M. Marin, and C. P. Robert (2012). Regularization in regression: comparing Bayesian and frequentist methods in a poorly informative situation. Bayesian Analysis 7 (2), 477–502. Chand, S. (2012). On tuning parameter selection of lasso-type methods - a monte carlo study. In 9th International Bhurban Conference on Applied Sciences and Technology (IBCAST) Proceedings, pp. 120–129. Chen, J. and Z. Chen (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95 (3), 759–771. Chen, J. and Z. Chen (2012). Extended BIC for small-n-large-p sparse GLM. Statistica Sinica 22 (2), 555–574.

39 Chen, L. S., C. M. Hutter, J. D. Potter, Y. Liu, R. L. Prentice, U. Peters, and L. Hsu (2010). Insights into colon cancer etiology via a regularized approach to gene set analysis of GWAS data. The American Journal of Human Genetics 86 (6), 860–871. Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 39 (1), 1–38. Donoho, D. L. and J. M. Johnstone (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 (3), 425–455. Du, K.-L. and M. Swamy (2013). Neural Networks and Statistical Learning. Springer Science & Business Media. Dziak, J., R. Li, and L. Collins (2005). Critical review and comparison of variable selection procedures for linear regression. Technical report, Pennsylvania State University. Efron, B. (2005). Modern science and the Bayesian-frequentist controversy. Technical report, Division of Biostatistics, Stanford University. Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. The Annals of Statistics 32(2), 407–99. Fan, J., Y. Feng, and Y. Wu (2009). Network exploration via the adaptive LASSO and SCAD penalties. The Annals of Applied Statistics 3 (2), 521– 541.

40 Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96 (456), 1348–1360. Flom, P. L. and D. L. Cassell (2007). Stopping stepwise: Why stepwise and similar selection methods are bad, and what you should use. In 20th Annual NorthEast SAS Users Group (NESUG) Conference Proceedings. Frank, I. and J. H. Friedman (1993). A statistical view of some chemometrics regression tools (with discussion). Technometrics 35 (2), 109–135. Friedman, J., T. Hastie, and R. Tibshirani (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 (1), 1–22. Gelfand, A. E. and A. F. Smith (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association 85 (410), 398–409. Gelman, A., J. Carlin, H. Stern, and D. Rubin (2003). Bayesian Data Analysis. Chapman & Hall, London. Gelman, A. and J. Hill (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. Gelman, A., J. Hwang, and A. Vehtari (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing 24 (6), 997– 1016.

41 Gelman, A. and D. B. Rubin (1992). Inference from iterative simulation using multiple sequences. Statistical Science 7 (4), 457–472. George, E. I. and R. E. McCulloch (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association 88 (423), 881–889. George, E. I. and R. E. McCulloch (1997). Approaches for Bayesian variable selection. Statistica Sinica 7 (2), 339–373. Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments (with discussion). Bayesian Statistics 4, 169–194. Ghosh, S. (2011). On the grouped selection and model complexity of the adaptive elastic net. Statistics and Computing 21(3), 452–461. Gilks, W. R. (2005). Markov Chain Monte Carlo. Wiley Online Library. Goeman, J. J. (2010). l1 penalized estimation in the Cox proportional hazards model. Biometrical Journal 52 (1), 70–84. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57 (1), 97–109. Hoerl, A. E. and R. W. Kennard (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12 (1), 55–67. Huang, J., P. Breheny, and S. Ma (2012). A selective review of group selection in high-dimensional models. Statistical Science 27 (4), 481–499.

42 Huang, J., S. Ma, H. Xie, and C. Zhang (2009). A group bridge approach for variable selection. Biometrika 96 (2), 339–355. Huang, J., S. Ma, and C.-H. Zhang (2008). Adaptive lasso for sparse highdimensional regression models. Statistica Sinica 18 (4), 1603–1618. Hurvich, C. M. and C.-L. Tsai (1989). Regression and time series model selection in small samples. Biometrika 76 (2), 297–307. Hwang, J.-S. and T.-H. Hu (2014). A stepwise regression algorithm for highdimensional variable selection. Journal of Statistical Computation and Simulation 85 (9), 1793–1806. Johnson, V. E. and D. Rossell (2012). high-dimensional settings.

Bayesian model selection in

Journal of the American Statistical Associa-

tion 107 (498), 649–660. Kadane, J. B. and N. A. Lazar (2004). Methods and criteria for model selection. Journal of the American Statistical Association 99 (465), 279–290. Kim, Y., S. Kwon, and H. Choi (2012). Consistent model selection criteria on high dimensions. Journal of Machine Learning Research 13 (1), 1037–1057. Kundu, D. and G. Murali (1996). Model selection in linear regression. Computational Statistics & Data Analysis 22 (5), 461–469. Kyung, M., J. Gill, M. Ghosh, and G. Casella (2010). Penalized regression, standard errors, and Bayesian lassos. Bayesian Analysis 5 (2), 369–412. Li, Q. (2010). On Bayesian Regression Regularization Methods. Ph. D. thesis, Washington University in St. Louis.

43 Luo, S. and Z. Chen (2013). Extended BIC for linear regression models with diverging number of relevant features and high or ultra-high feature spaces. Journal of Statistical Planning and Inference 143 (3), 494–504. Mallick, H. and N. Yi (2013). Bayesian methods for high dimensional linear models. Journal of Biometrics & Biostatistics S1, 005. Mallows, C. L. (1973). Some comments on cp . Technometrics 15 (4), 661–675. McLachlan, G. and T. Krishnan (2007). The EM Algorithm and Extensions, Volume 382. John Wiley & Sons. Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics 21 (6), 1087–1092. Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in multiple regression. The Annals of Statistics 12 (2), 758–765. O’Hara, R. B. and M. J. Sillanpaa (2009). A review of Bayesian variable selection methods: what, how and which. Bayesian Analysis 4 (1), 85–117. Park, C. and Y. J. Yoon (2011). Bridge regression: adaptivity and group selection. Journal of Statistical Planning and Inference 141 (11), 3506–3519. Park, M. Y. (2006). Generalized Linear Models with Regularization. Ph. D. thesis, Stanford University. Park, T. and G. Casella (2008). The Bayesian lasso. Journal of the American Statistical Association 103 (482), 681–686.

44 Radchenko, P. and G. M. James (2008). Variable inclusion and shrinkage algorithms. Journal of the American Statistical Association 103 (483), 1304–1315. Robert, C. and G. Casella (2011). A short history of Markov Chain Monte Carlo: Subjective recollections from incomplete data. Statistical Science 26 (1), 102– 115. Robert, C. and G. Casella (2013). Monte Carlo Statistical Methods. Springer Science & Business Media. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6 (2), 461–464. Shao, J. (1997). An asymptotic theory for linear model selection. Statistica Sinica 7 (2), 221–242. Simon, N., J. Friedman, T. Hastie, and R. Tibshirani (2013). A sparse-group lasso. Journal of Computational and Graphical Statistics 22 (2), 231–245. Simon, N. and R. Tibshirani (2012). Standardization and the group lasso penalty. Statistica Sinica 22 (3), 983–1001. Smyth, I. V. C. P. (2001). Model complexity, goodness of fit and diminishing returns. In Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference, Volume 13, pp. 388. MIT Press. Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and A. Van Der Linde (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 (4), 583–639.

45 Steyerberg, E. W. (2008). Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer Science & Business Media. Taroni, F., S. Bozza, A. Biedermann, P. Garbolino, and C. Aitken (2010). Data Analysis in Forensic Science: A Bayesian Decision Perspective, Volume 88. John Wiley & Sons. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 58 (1), 267–288. Tibshirani, R., M. Saunders, S. Rosset, J. Zhu, and K. Knight (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (1), 91–108. Vallverdu, J. (2008). The false dilemma: Bayesian vs. frequentist. arXiv preprint arXiv:0804.0486 . Wagner, W. E. and R. Steinzor (2006). Rescuing Science from Politics: Regulation and the Distortion of Scientific Research. Cambridge University Press. Wahba, G. (1998). Support vector machines, reproducing kernel hilbert spaces, and randomized GACV. In B. Schoelkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods: Support Vector Learning, Chapter 6, pp. 69–87. Wang, H. and C. Leng (2008). A note on adaptive group lasso. Computational Statistics & Data Analysis 52 (12), 5277–5286.

46 Wang, H., B. Li, and C. Leng (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (3), 671–683. Ward, E. J. (2008). A review and comparison of four commonly used Bayesian and maximum likelihood model selection tools. Ecological Modelling 211 (1), 1–10. West, M. (1987). On scale mixtures of normal distributions. Biometrika 74 (3), 646–648. Woodroofe, M. (1982). On model selection and the arc sine laws. The Annals of Statistics 10 (4), 1182–1194. Yan, X. and X. Su (2009). Linear Regression Analysis: Theory and Computing. World Scientific. Yang, Y. (2003). Regression with multiple candidate models: selecting or mixing? Statistica Sinica 13 (3), 783–810. Yang, Y. and A. R. Barron (1998). An asymptotic property of model selection criteria. IEEE Transactions on Information Theory 44 (1), 95–116. Yuan, M. and Y. Lin (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1), 49–67. Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 38 (2), 894–942.

47 Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101 (476), 1418–1429. Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2), 301–320.

48

A NEW BAYESIAN LASSO

by

HIMEL MALLICK, NENGJUN YI

Published in Statistics and its Interface 7(4): 571-582, 2014 c Copyright by International Press of Boston, Inc. Used by permission Format adapted for dissertation

49 ABSTRACT Park and Casella (2008) provided the Bayesian LASSO for linear models by assigning scale mixture of normal (SMN) priors on the parameters and independent exponential priors on their variances. In this paper, we propose an alternative Bayesian analysis of the LASSO problem. A different hierarchical formulation of the Bayesian LASSO is introduced by utilizing the scale mixture of uniform (SMU) representation of the Laplace density. We consider a fully Bayesian treatment that leads to a new Gibbs sampler with tractable full conditional posterior distributions. Empirical results and real data analyses show that the new algorithm has good mixing property and performs comparably to the existing Bayesian method in terms of both prediction accuracy and variable selection. An Expectation Conditional Maximization (ECM) algorithm is provided to compute the MAP estimates of the parameters. Easy extension to general models is also briefly discussed.

KEYWORDS: LASSO, Bayesian LASSO, Scale Mixture of Uniform, MCMC, Variable Selection.

3.1

INTRODUCTION

Consider the LASSO of Tibshirani (1996) in the context of linear regression that results from the following regularization problem:

p

|βj |, argmin (y − Xβ) (y − Xβ) + λ β j=1

(3.1)

50 where λ > 0 is the tuning parameter. Tibshirani (1996) suggested that the LASSO estimates can be interpreted as the posterior mode estimates when the regression parameters are assigned independent and identical Laplace priors. Motivated by this, different approaches based on the scale mixture of normal (SMN) distributions with independent exponentially distributed variances (Andrews and Mallows, 1974; West, 1987) have been proposed (Bae and Mallick, 2004; Figueiredo, 2003; Yuan and Lin, 2005). Park and Casella (2008) introduced a Gibbs sampling strategy using a conditional Laplace prior specification of the form π(β|σ 2 ) =

p

λ λ|β | √ exp{− √ j }, 2 σ2 j=1 2 σ

(3.2)

and non-informative scale-invariant marginal prior on σ 2 , i.e., π(σ 2 ) ∝ 1/σ 2 . Park and Casella (2008) devoted serious efforts to address the important unimodality issue. They pointed out that conditioning on σ 2 is important for unimodality, and lack of unimodality might slow down the convergence of the Gibbs sampler and make the point estimates less meaningful (Kyung et al., 2010). Other methods based on the Laplace prior include the Bayesian LASSO via reversible-jump MCMC (Chen et al., 2011) and the Bayesian LASSO regression based on the ‘Orthant Normal’ prior (Hans, 2009). Unlike their frequentist counterparts, Bayesian methods usually provide a valid measure of standard error based on a geometrically ergodic Markov chain (Kyung et al., 2010). Moreover, an MCMC-based Bayesian framework provides a flexible way of estimating the tuning parameter along with the other model parameters. In this paper (Mallick and Yi, 2014), along the same lines as Park and

51 Casella (2008) and Yi and Xu (2008), we propose a new hierarchical representation of the Bayesian LASSO. A new Gibbs sampler is put forward utilizing the scale mixture of uniform (SMU) representation of the Laplace density. Empirical studies and real data analyses show that the new algorithm inherits good mixing property and yields satisfactory performance comparable to the existing Bayesian LASSO. The remainder of the paper is organized as follows. In Section 3.2, we briefly review the SMU distribution. The Gibbs sampler is presented in Section 3.3. Some empirical studies and real data analyses are presented in Sections 3.4 and 3.5 respectively. Easy extension to general models is provided in Section 3.6, and an ECM algorithm is described in Section 3.7. In Section 3.8, we provide conclusions and further discussions in this area. Some proofs and related derivations are included in an appendix (Appendix A).

SCALE MIXTURE OF UNIFORM DISTRIBUTION

3.2

Proposition 1 A Laplace density can be written as a scale mixture of uniform distribution, the mixing density being a particular gamma distribution, i.e., λ −λ|x| e = 2

u>|x|

1 λ2 2−1 −λu u e du, 2u Γ(2)

λ > 0.

(3.3)

Proof : Proof of this result is straightforward, which is included in Appendix A.1. The SMU distribution for regression modeling has been used in a few occasions in the literature. Walker et al. (1997) used the SMU distribution in normal regression models in non-Bayesian framework. Qin et al. (1998a,b) considered

52 autocorrelated heteroscedastic regression models and variance regression models respectively, in which the SMU representations of the associated priors were utilized to derive the corresponding Gibbs samplers. Choy et al. (2008) used it in stochastic volatility model by using a two-stage scale mixture representation of the student-t distribution. However, its use has been limited in penalized regression framework. We explore this fact by observing that the LASSO penalty function corresponds to a scale mixture of uniform distribution, the mixing distribution being a particular gamma distribution. Following Park and Casella (2008), we consider conditional Laplace priors of the form (3.2) on the coefficients and scale-invariant marginal prior on σ 2 . Rewriting the Laplace priors as scale mixture of uniform distributions and introducing the gamma mixing densities result in a new hierarchy. Under this new hierarchical representation, the posterior distribution of interest p(β, σ 2 |y) is exactly the same as the original Bayesian LASSO model of Park and Casella (2008), and therefore, the resulting estimates should exactly be the same ‘theoretically’ for both Bayesian LASSO models. We establish this fact by simulation studies and real data analyses. Conditioning on σ 2 ensures unimodal full posteriors in both Bayesian LASSO models.

3.3 3.3.1

THE MODEL MODEL HIERARCHY AND PRIOR DISTRIBUTIONS

Using (3.2) and (3.3), we formulate our hierarchical representation as follows: y|X, β, σ 2 ∼ Nn (Xβ, σ 2 In ),

53

2

β|u, σ ∼

p

√ √ Uniform(− σ 2 uj , σ 2 uj ),

(3.4)

j=1

u|λ ∼

p

Gamma(2, λ),

j=1

σ 2 ∼ π(σ 2 ). 3.3.2

FULL CONDITIONAL POSTERIOR DISTRIBUTIONS

Introduction of u = (u1 , u2 , ..., up ) enables us to derive the tractable full conditional posterior distributions, which are given as −1 2 ˆ β|y, X, u, σ ∼ Np (β OLS , σ (X X) ) 2

p

I{|βj |
√ }, σ2 j=1

(3.6)

σ 2 |y, X, β, u ∼ Inverse Gamma (

βj 2 n−1+p 1 , (y − Xβ) (y − Xβ))I{σ 2 > Maxj ( 2 )}, (3.7) 2 2 uj

where I(.) denotes an indicator function. The derivations are included in Appendix A.2.

3.3.3

MCMC SAMPLING FOR THE NEW BAYESIAN LASSO

SAMPLING COEFFICIENTS AND LATENT VARIABLES (3.5), (3.6), and (3.7) lead us to a Gibbs sampler that starts at initial guesses for β and σ 2 and iterates the following steps:

54 1.

Generate uj from the left-truncated exponential distribution (3.6) using

inversion method which can be done as follows:

a) Generate u∗j from an exponential distribution with rate parameter λ, b) Set uj = u∗j +

|β | √j . σ2

2. Generate β from a truncated multivariate normal distribution proportional to (3.5). This step can be done by implementing an efficient sampling technique developed by Li and Ghosh (2015).

3. Generate σ 2 from a left-truncated Inverse Gamma distribution proportional to (3.7). This step can be done by utilizing the fact that the inverse of a lefttruncated Inverse Gamma distribution is a right-truncated Gamma distribution. ∗

By generating σ 2 from the right-truncated gamma distribution proportional to

Gamma (

1 n−1+p 1 ∗ }, , (y − Xβ) (y − Xβ))I{σ 2 < β 2 2 2 Maxj ( ujj 2 )

and replacing σ 2 =

1 , σ2 ∗

(3.8)

we can mimic sampling from the targeted left-truncated

Inverse Gamma distribution (3.7). SAMPLING HYPERPARAMETERS To update the tuning parameter λ, we work directly with the Laplace density after marginalizing out the latent variables uj ’s. From (3.4), we observe that the posterior for λ given β is conditionally independent of y and takes the following

55

Figure 3.1: Gibbs Sampler for the NBLASSO for linear models

form 2p

π(λ|β) ∝ λ exp{−λ

p

|βj |}π(λ).

(3.9)

j=1

Therefore, if λ has a Gamma (a, b) prior, its conditional posterior will also be a gamma distribution (due to conjugacy), i.e., λ|β ∝ λ

a+2p−1

exp{−λ(b +

p

|βj |)}.

(3.10)

j=1

Thus, we update the tuning parameter along with other the model parameters p by generating samples from Gamma(a + 2p, b + |βj |). j=1

We summarize the algorithm in Figure 3.1.

56

SIMULATION STUDIES

3.4 3.4.1

PREDICTION

In this section, we investigate the prediction accuracy of our method (NBLasso) and compare its performance with both original Bayesian LASSO (OBLasso) and frequentist LASSO (Lasso) across varied simulation scenarios. The LARS algorithm of Efron et al. (2004) is used for LASSO, in which 10-fold crossvalidation is used to select the tuning parameter, as implemented in the R package lars. For the Bayesian LASSOs, we estimate the tuning parameter λ by using a gamma prior distribution with shape parameter a = 1 and scale parameter b = 0.1, which is relatively flat, and results in high posterior probability near the MLE (Kyung et al., 2010). The Bayesian estimates are posterior means using 10, 000 samples of the Gibbs sampler after the burn-in. To decide on the burn-in number, we make use of the PSRF (Gelman and Rubin, 1992). Once ˆ < 1.1 for all parameters of interest, we continue to draw 10,000 iterations to R obtain samples from the joint posterior distribution. The response is centered and the predictors are normalized to have zero means and unit variances before applying any model selection method. For the prediction errors, we calculate the median of mean squared errors (MMSE) for the simulated examples based on 100 replications. We simulate data from the true model y = Xβ + , ∼ N(0, σ 2 I).

Each simulated sample is partitioned into a training set and a testing set. Models are fitted on the training set and MSE’s are calculated on the testing

57 {nT , nP } σ2 {200, 200} 225 {200, 200} 81 {100, 400} 225 {100, 400} 81

Lasso OBLasso 279.72 249.79 101.2 93.89 354.19 268.74 131.62 104.17

NBLasso 244.4 92.93 259.85 102.32

Table 3.1: Median mean squared error (MMSE) based on 100 replications for Example 1 set. In all examples, detailed comparisons with both ordinary and Bayesian LASSO methods are presented.

EXAMPLE 1 (SIMPLE EXAMPLE - I): Here we consider a simple sparse situation which was also used by Tibshirani (1996) in his original LASSO paper. Here we set β = (0T , 2T , 0T , 2T )T , where 0 and 2 are vectors of length 10 with each entry equal to 0 and 2 respectively. The design matrix X is generated from the multivariate normal distribution with mean 0, variance 1, and pairwise correlations between xi and xj equal to 0.5. We experiment with four different scenarios by varying the sample size and σ 2 . We simulate datasets with {nT , nP } = {100, 400} and {200,200} respectively, where nT denotes the size of the training set and nP denotes the size of the testing set. We consider two values of σ : σ ∈ {9, 25}. The simulation results are summarized in Table 3.1, which clearly suggests that NBLasso outperforms both Lasso and OBLasso across all scenarios of this example.

EXAMPLE 2 (DIFFICULT EXAMPLE - I): In this example, we consider a complicated model which exhibits a substantial amount of data collinearity. A similar example was presented in the elastic net paper by Zou and Hastie

58 {nT , nP } σ2 {200, 200} 225 {200, 200} 81 {100, 400} 225 {100, 400} 81

Lasso OBLasso 242.97 240.85 90.01 88.98 250.35 254.71 93.36 95.34

NBLasso 240.35 88.92 253.84 94.49

Table 3.2: Median mean squared error (MMSE) based on 100 replications for Example 2 (2005). Here we simulate Z1 , Z2 , and Z3 independently from N(0,1). Then, we let xi = Z1 +i , i = 1, ..., 5; xi = Z2 +i , i = 6, ..., 10; xi = Z3 +i , i = 11, ..., 15, and xi ∼ N (0, 1), i = 16, ..., 30, where i ∼ N (0, 0.01), i = 1, . . . , 15. We set β = (3T , 3T , 3T , 0T )T , where 3 and 0 are vectors of length 5 and 15 with each entry equal to 3 and 0 respectively. We experiment with the same values of σ 2 and {nT , nP } as in Example 1. The simulation results are presented in Table 3.2. It can be observed that NBLasso is competitive with OBLasso in terms of prediction accuracy in all the scenarios presented in this example.

EXAMPLE 3 (HIGH CORRELATION EXAMPLE - I): Here we consider a sparse model with strong level of correlation. We set β = (3, 1.5, 0, 0, 2 , 0, 0, 0)T and σ 2 = {1, 9}. The design matrix X is generated from the multivariate normal distribution with mean 0, variance 1, and pairwise correlations between xi and xj equal to 0.95 ∀ i = j. We simulate datasets with nT = {20, 50, 100, 200} for the training set and nP = 200 for the testing set. Table 3.3 summarizes our experimental results for this example. We can see that both Bayesian LASSOs yield similar performance and usually outperform their frequentist counterpart. As σ decreases and nT increases, all the three methods yield equivalent performance.

59 nT 20 50 100 200 20 50 100 200

σ 3 3 3 3 1 1 1 1

Lasso 11.61 10.03 9.6 9.4 1.79 1.28 1.19 1.1

OBLasso 10.4 9.8 9.55 9.29 1.6 1.27 1.18 1.1

NBLasso 10.4 9.8 9.54 9.29 1.6 1.27 1.18 1.1

Table 3.3: Median mean squared error (MMSE) based on 100 replications for Example 3

EXAMPLE 4 (SMALL n LARGE p EXAMPLE): Here we consider a case where p ≥ n. We let β 1:q = (5, ..., 5)T , β q+1:p = 0, p = 20, q = 10, σ = {1, 3}. The design matrix X is generated from the multivariate normal distribution with mean 0, variance 1, and pairwise correlations between xi and xj equal to 0.95 ∀ i = j. We simulate datasets with nT = {10, 20} for the training set and nP = 200 for the testing set. It is evident from the results presented in Table 3.4 that the proposed method performs better than both OBLasso and Lasso in most of the situations. In one situation, OBLasso performs slightly better. Overall, Bayesian LASSOs significantly outperform frequentist LASSO in terms of prediction accuracy.

3.4.2

VARIABLE SELECTION

In this section, we investigate the model selection performance of our method (NBLasso) and compare its performance with both original Bayesian LASSO (OBLasso) and frequentist LASSO (Lasso). It should be noted that LASSO

60 nT 10 10 20 20

σ 3 1 3 1

Lasso OBLasso NBLasso 91.39 77.0 77.4 81.4 69.47 68.81 86.04 41.66 41.59 46.94 30.98 30.71

Table 3.4: Median mean squared error (MMSE) based on 100 replications for Example 4 was originally developed as a variable selection tool. However, in Bayesian framework, this attractive property vanishes as Bayesian LASSOs usually do not set any coefficient to zero. One way to tackle this problem is to use the credible interval criterion, as suggested by Park and Casella (2008) in their seminal paper. However, it brings the problem of threshold selection (Leng et al., 2014). Moreover, credible intervals are not uniquely defined. Therefore, we will seek out an alternative strategy here. In Bayesian paradigm, it is a standard procedure to fully explore the posterior distribution and estimate λ by posterior median or posterior mean. Therefore, we can plug in the posterior estimate of λ in (3.1), and solve (3.1) to carry out variable selection. This Bayesian-frequentist strategy was recently used by Leng et al. (2014). For the optimization problem (3.1), we make use of the LARS algorithm of Efron et al. (2004). For each simulated dataset, we apply three different LASSO methods viz. NBLasso, OBLasso, and Lasso, and record the frequency of correctly-fitted models over 100 replications. For the Bayesian LASSOs, we assign a Gamma (1, 0.1) prior on λ to estimate the tuning parameter. Based on the posterior samples (10, 000 MCMC samples after the burn-in), we calculate two posterior quantities of interest, viz. posterior mean and posterior median. Then we plug-in either

61 nT 20 50 100 200

Lasso 15 11 16 11

OBL - Mean 13 14 21 16

NBL - Mean 22 20 25 17

OBL - Median 11 12 20 16

NBL - Median 17 15 25 16

Table 3.5: Frequency of correctly-fitted models over 100 replications for Example 5 posterior mean or posterior median estimate of λ in (3.1), and solve (3.1) to get the estimates of the coefficients. We refer to these different strategies as NBLMean, OBL-Mean, NBL-Median, and OBL-Median, where NBL-Mean refers to NBLasso coupled with λ estimated by posterior mean, OBL-Median refers to OBLasso coupled with λ estimated by posterior median, and so on. LARS algorithm is used for the frequentist LASSO, in which 10-fold cross-validation is used to select the tuning parameter. The response is centered and the predictors are normalized to have zero means and unit variances before applying any model selection method.

EXAMPLE 5 (SIMPLE EXAMPLE - II): This example was used by Tibshirani (1996) in his original LASSO paper to systematically compare the predictive performance of LASSO and ridge regression. Here we set β = (3, 1.5, 0, 0, 2, 0, 0, 0)T and σ 2 = 9. The design matrix X is generated from the multivariate normal distribution with mean 0, variance 1, and pairwise correlations between xi and xj equal to 0.5|i−j| ∀ i = j. We simulate datasets with nT = {20, 50, 100, 200} for the training set and nP = 200 for the testing set. The simulation results are summarized in Table 3.5. It can be observed that NBLasso performs reasonably well outperforming both frequentist and original

62 nT 20 50 100 200

Lasso 32 34 27 20

OBL - Mean 10 13 20 14

NBL - Mean 28 26 31 21

OBL - Median 7 12 19 12

NBL - Median 17 22 27 18

Table 3.6: Frequency of correctly-fitted models over 100 replications for Example 6 Bayesian LASSO.

EXAMPLE 6 (SIMPLE EXAMPLE - III): We consider another simple example from Tibshirani’s original LASSO paper. We set β = (5, 0, 0, 0, 0, 0, 0, 0)T and σ 2 = 9. The design matrix X is generated from the multivariate normal distribution with mean 0, variance 1, and pairwise correlations between xi and xj equal to 0.5|i−j| ∀ i = j. We simulate datasets with nT = {20, 50, 100, 200} for the training set and nP = 200 for the testing set. The simulation results are presented in Table 3.6. We see that NBLasso always performs better than OBLasso for this example although outperformed by frequentist LASSO in many situations. The reason might be contributed to the fact that not much variance is explained by introducing the priors which resulted in poor model selection performance for the Bayesian methods.

EXAMPLE 7 (DIFFICULT EXAMPLE - II): Here we consider a situation where the LASSO estimator does not give consistent model selection. This example is taken from the adaptive LASSO paper by Zou (2006). Here we set β = (5.6, 5.6, 5.6, 0)T and the correlation matrix of X is such that Cor(xi , xj ) = −0.39, i < j < 4 and Cor(xi , x4 ) = 0.23, i < 4. Zou (2006) showed

63 nT 120 300 300

σ 5 3 1

Lasso 0 0 0

OBL - Mean 6 10 9

NBL - Mean 6 8 12

OBL - Median 5 12 12

NBL - Median 6 10 12

Table 3.7: Frequency of correctly-fitted models over 100 replications for Example 7 nT 20 50 100 200

Lasso 8 11 8 5

OBL - Mean 19 18 17 16

NBL - Mean 22 20 17 16

OBL - Median 17 18 17 16

NBL - Median 19 19 17 16

Table 3.8: Frequency of correctly-fitted models over 100 replications for Example 8 that for this example, LASSO is inconsistent regardless of the sample size. The experimental results are summarized in Table 3.7. None of the methods perform well for this example. Both Bayesian LASSOs yield similar performance and behave better than frequentist LASSO in selecting the correct model.

EXAMPLE 8 (HIGH CORRELATION EXAMPLE - II): Here we consider a simple model with strong level of correlation. We set β = (5, 0, 0, 0, 0, 0, 0, 0)T and σ 2 = 9. The design matrix X is generated from the multivariate normal distribution with mean 0, variance 1, and pairwise correlations between xi and xj equal to 0.95 ∀ i = j. We simulate datasets with nT = {20, 50, 100, 200} for the training set and nP = 200 for the testing set. Table 3.8 summarizes our experimental results for this example, which indicates that both Bayesian LASSOs yield similar performance outperforming frequentist LASSO.

64

3.4.3

SOME COMMENTS

We have considered a variety of experimental situations to investigate the predictive and model selection performance of NBLasso. Most of the simulation examples considered here have previously appeared in other LASSO and related papers. From our extensive simulation experiments, it is evident that NBLasso performs as well as, or better than OBLasso for most of the examples. For the simple examples, NBLasso performs the best whereas for other examples, it provides comparable and slightly better performance in terms of prediction and model selection. It is to be noted that superiority of Bayesian LASSO and related methods is already well-established in the literature (Kyung et al., 2010; Leng et al., 2014; Li and Lin, 2010). We have found similar conclusion in this paper. For all the simulated examples, convergence of the corresponding MCMC chain was assessed by trace plots of the generated samples and calculating the Gelman-Rubin scale reduction factor (Gelman and Rubin, 1992) using the coda package in R. For n ≤ p situation, none of the methods performed well in model selection. Therefore, those results are omitted. In summary, based on our experimental results, it can be concluded that NBLasso is as effective as OBLasso with respect to both model selection and prediction performance.

3.5

REAL DATA ANALYSES

In this section, two real data analyzes are conducted using the proposed and existing LASSO methods. Four different methods are applied to the datasets: original Bayesian LASSO (OBLasso), new Bayesian LASSO (NBLasso), frequentist LASSO (Lasso), and ordinary least squares (OLS). For the Bayesian

65 methods, posterior means are calculated as estimates based on 10, 000 samples after the burn-in. To decide on the burn-in number, we make use of the PSRF ˆ < 1.1 for all parameters of interest, we (Gelman and Rubin, 1992). Once R continue to draw 10,000 iterations to obtain samples from the joint posterior distribution. The tuning parameter λ is estimated as posterior mean with a gamma prior with shape parameter a = 1 and scale parameter b = 0.1 in the MCMC algorithm. The convergence of our MCMC is checked by trace plots of the generated samples and calculating the Gelman-Rubin scale reduction factor (Gelman and Rubin, 1992) using the coda package in R. For the frequentist LASSO, 10-fold cross-validation (CV) is used to select the shrinkage parameter. The response is centered and the predictors are normalized to have zero means and unit variances before applying any model selection method.

3.5.1

THE DIABETES EXAMPLE

We analyze the benchmark diabetes dataset (Efron et al., 2004), which contains measurements from n = 442 diabetes patients. Each measurement has ten baseline predictors: age, sex, body mass index (bmi ), average blood pressure (map), and six blood serum measurements (tc, ldl, hdl, tch, lth, glu). The response variable is a quantity that measures progression of diabetes one year after baseline. Figure 3.2 gives the 95% equal-tailed credible intervals along with the posterior mean Bayesian LASSO estimates of Diabetes data covariates along with the frequentist LASSO estimates, overlaid with the OLS estimates with corresponding 95% confidence intervals. The estimated λ’s are 5.1 (2.5, 9.1) and 4.0 (2.2, 6.4) for NBLasso and OBLasso respectively. Figure 3.2 shows that two Bayesian LASSOs behave similarly for all the coefficients of this dataset. The

66

Standardized Coefficients

glu

●

ltg

●

tch

●

hdl

●

ldl

●

tc

●

map

●

bmi

●

sex

● ●

age

●

−0.5

0.0

0.5

NBLasso OBLasso Lasso OLS 1.0

Figure 3.2: Posterior mean Bayesian LASSO estimates (computed over a grid of λ values, using 10,000 samples after the burn-in) and corresponding 95% credible intervals (equal-tailed) of Diabetes data (n = 442) covariates. The hyperprior parameters were chosen as a = 1, b = 0.1. The OLS estimates with corresponding 95% confidence intervals are also reported. For the LASSO estimates, the tuning parameter was chosen by 10-fold CV of the LARS algorithm

0.05

0.15

Density −0.25

−0.15

0.15

0.25

−0.4

0.0

6

0.4

−0.4

0.0

0.4

8 Density

0

Density 0.2

0.2

ldl

0 2 4 6

6

0.0

−0.2

0.0

0.2

0.4

tch

0.0

0.2

0.4

0.6

ltg

0 4 8

hdl

0.45

0 −0.8

tc

3

−0.2

0.35

3

Density

4

0.35

0 −0.4

0.25

bmi

2

Density 0.15 map

Density

0.05

0 0.05

Density

−0.05

sex

0 4 8

Density

age

4

−0.05

0 4 8

Density −0.15

0 4 8

15 0 5

Density

67

−0.10

0.00

0.10

0.20

glu

Figure 3.3: Histograms based on posterior samples of Diabetes data covariates

95% credible intervals are also similar. Any observed differences in parameter estimates can be contributed (up to Monte Carlo error) to the properties of the different Gibbs samplers used to obtain samples from the corresponding posterior distributions. The histograms of the Diabates data covariates based on posterior samples of 10,000 iterations are illustrated in Figure 3.3. These histograms reveal that the conditional posterior distributions are in fact the desired stationary truncated univariate normals. The mixing of an MCMC chain shows how rapidly the MCMC chain converges to the stationary distribution (Gelman et al., 2003). Trace plot is a good visual indicator of the mixing property. This plot is shown in Figure 3.4 for the Diabetes data covariates. It is highly satisfactory to observe that for this benchmark dataset, the samples traverse the posterior space very fast. We also conduct the Geweke’s convergence diagnosis test (Geweke, 1992) and all the

68

Trace of sex

−0.25

−0.10

0.00

−0.10

0.10

Trace of age

2000

4000

6000

8000

10000

0

2000

4000

6000

Iterations

Iterations

Trace of bmi

Trace of map

8000

10000

8000

10000

8000

10000

8000

10000

8000

10000

0.05

0.20

0.15

0.30

0.25

0.40

0.35

0

2000

4000

6000

8000

10000

0

2000

4000

6000

Iterations

Iterations

Trace of tc

Trace of ldl

−0.6

−0.2

−0.2

0.2

0.2

0

2000

4000

6000

8000

10000

0

2000

4000

6000

Iterations

Iterations

Trace of hdl

Trace of tch

−0.2

−0.3

0.0

−0.1

0.2

0.1

0

2000

4000

6000

8000

10000

0

2000

4000

6000

Iterations

Iterations

Trace of ltg

Trace of glu

0.1

−0.05

0.3

0.05

0.5

0.15

0

0

2000

4000

6000

Iterations

8000

10000

0

2000

4000

6000

Iterations

Figure 3.4: Trace plots of Diabetes data covariates

69 individual chains pass the tests. All these illustrate that the new Gibbs sampler has good mixing property.

3.5.2

THE PROSTATE EXAMPLE

The data in this example is taken from a prostate cancer study (Stamey et al., 1989). Following Zou and Hastie (2005), we analyze the data by dividing it into a training set with 67 observations and a test set with 30 observations. Model fitting is carried out on the training data and performance is evaluated with the prediction error (MSE) on the test data. The response of interest is the logarithm of prostate-specific antigen. The predictors are eight clinical measures: the logarithm of cancer volume (lcavol ), the logarithm of prostate weight (lweight), age, the logarithm of the amount of benign prostatic hyperplasia (lbph), seminal vesicle invasion (svi ), the logarithm of capsular penetration (lcp), the Gleason score (gleason), and the percentage Gleason score 4 or 5 (pgg45 ). Figure 3.5 shows the 95% equal-tailed credible intervals for the regression parameters of the prostate data based on the posterior mean Bayesian LASSO estimates with point estimates of frequentist LASSO and OLS estimates with corresponding 95% confidence intervals. The estimated λ’s are 3.5 (1.6, 7.3) and 3.1 (1.5, 5.3) for NBLasso and OBLasso respectively. The predictors in this dataset are known to be more correlated than those in the Diabetes data. Even for this dataset, the proposed method performs impressively. Figure 3.5 reveals that two Bayesian LASSO estimates are strikingly similar and the corresponding 95% credible intervals are almost identical for this dataset. Also, it is interesting to note that all the estimates are inside the credible intervals, which indicates that the resulting conclusion will be similar regardless of which

70 method is used. Moreover, the new method outperforms both OBLasso and Lasso in terms of prediction accuracy (Table 3.9). The trace plot shown in Figure 3.6 demonstrates that the sampler jumps from one remote region of the posterior space to another in relatively few steps. Here also, we conduct Geweke’s convergence diagnosis test (Geweke, 1992) and all the individual chains pass the test, which again establish good mixing property of the proposed Gibbs sampler. The histograms of the Prostate data covariates based on 10,000 posterior samples (Figure 3.7) reveal that the conditional posterior distributions are the desired stationary distributions viz. truncated univariate normals, which further validate our findings. Method MSE

NBLasso 0.4696

OBLasso 0.4729

Lasso 0.4856

OLS 0.5212

Table 3.9: Prostate Cancer Data Analysis - Mean squared prediction errors based on 30 observations of the test set for four methods : New Bayesian Lasso (NBLasso), Original Bayesian Lasso (OBLasso), Lasso and OLS

3.6 3.6.1

EXTENSIONS MCMC FOR GENERAL MODELS

In this section, we briefly discuss how NBLasso can be extended to several other models (e.g. GLM, Cox’s model, etc.) beyond linear regression. Our extension is based on the least squares approximation (LSA) by Wang and Leng (2007). Recently, Leng et al. (2014) used this approximation for the original Bayesian LASSO. Therefore, here we only describe the algorithm for NBLasso. Let us

71

Standardized Coefficients ●

NBLasso OBLasso Lasso OLS

pgg45

●

gleason

●

lcp

●

svi

●

lbph

●

age

●

lweight

●

lcavol

●

−0.5

0.0

0.5

1.0

Figure 3.5: Posterior mean Bayesian LASSO estimates (computed over a grid of λ values, using 10,000 samples after the burnin) and corresponding 95% credible intervals (equal-tailed) of Prostate data (n = 67) covariates. The hyperprior parameters were chosen as a = 1, b = 0.1. The OLS estimates with corresponding 95% confidence intervals are also reported. For the LASSO estimates, the tuning parameter was chosen by 10-fold CV of the LARS algorithm

72

Trace of lweight

Trace of age

2000

6000

10000

0.0 0

2000

6000

10000

2000

6000

Iterations

Iterations

Trace of lbph

Trace of svi

Trace of lcp

6000

10000

10000

−0.2

0.2

0.6

−0.6

0.2 −0.2 2000

0

2000

6000

Iterations

Iterations

Trace of gleason

Trace of pgg45

10000

0

2000

6000

10000

Iterations

−0.4

−0.2

0.2

0.0

0.6

0.4

0

0

Iterations

0.0 0.2 0.4 0.6

0

−0.4

0.2

0.6

1.0

0.0 0.2 0.4 0.6

0.2

Trace of lcavol

0

2000

6000

10000

0

Iterations

2000

6000

10000

Iterations

Figure 3.6: Trace plots of Prostate data covariates

denote by L(β) the negative log-likelihood. Following Wang and Leng (2007), L(β) can be approximated by the LSA as follows 1 ˜ Σ ˜ ˆ −1 (β − β), L(β) ≈ (β − β) 2 ˜ is the MLE of β and Σ ˆ −1 = δ 2 L(β)/δβ 2 . Therefore, for a general where β model, the conditional distribution of y is given by

1 −1 ˜ Σ ˜ . ˆ (β − β) y|β ∼ exp − (β − β) 2 Thus, we can easily extend our method to several other models by approximating the corresponding likelihood by the normal likelihood. Combining the SMU

73 representation of the Laplace density and the LSA approximation of the general likelihood, the hierarchical presentation of NBLasso for general models can be written as

˜ Σ ˜ , ˆ −1 (β − β) y|β ∼ exp − 12 (β − β) p Uniform(−uj , uj ), β|u ∼ j=1

u|λ ∼

p

(3.11) Gamma(2, λ),

j=1

λ ∼ Gamma(a, b).

The full conditional distributions are given as ˜ Σ) ˆ β|y, X, u, λ ∼ Np (β,

p

I{|βj | < uj },

(3.12)

Exponential(λ)I{uj > |βj |},

(3.13)

j=1

u|β, λ ∼

p j=1

λ|β ∼ λ

a+2p−1

exp{−λ(b +

p

|βj |)}.

(3.14)

j=1

As before, an efficient Gibbs sampler can be easily carried out based on these full conditionals. We summarize the algorithm in Figure 3.8.

3.6.2

SIMULATION EXAMPLES FOR GENERAL MODELS

We now assess the performance of NBLasso in general models by means of two examples. For brevity, we only report the performance of various methods in terms of prediction accuracy. Three different LASSO methods are applied to the simulated datasets. For the frequentist LASSO, we use the R package glmnet

1 0 0.0

0.4

0.6

−0.4

0.2

0.4

0.6

0.0

0.2

1

2

Density

3

3.0 2.0

Density

0

0.0

0

0.0

−0.2 age

1.0

3 2 1

Density

0.2

lweight

4

lcavol

−0.2

3

4 0.2 0.4 0.6 0.8 1.0

2

Density

3 0

0

1

2

Density

2 1

Density

3

4

5

74

−0.2 0.0

0.2

0.4

0.6

svi

−0.8

−0.4

0.0

0.4

lcp

0

0

1

1

2

Density

3 2

Density

4

3

5

lbph

−0.4

0.0 gleason

0.2

0.4

−0.2

0.2

0.4

0.6

pgg45

Figure 3.7: Histograms based on posterior samples of Prostate data covariates

Figure 3.8: Gibbs Sampler for the NBLASSO for general models

75 which implements the coordinate descent algorithm of Friedman et al. (2010), in which 10-fold cross-validation is used to select the tuning parameter. We normalize the predictors to have zero means and unit variances before applying any model selection method. For the prediction errors, we calculate the median of mean squared errors (MMSE) for the simulated examples based on 100 replications. The design matrix X is generated from the multivariate normal distribution with mean 0, variance 1, and pairwise correlations between xi and xj equal to 0.5|i−j| ∀ i = j. We simulate datasets with nT = {200, 400} for the training set and nP = 500 for the testing set.

EXAMPLE 9 (LOGISTIC REGRESSION EXAMPLE): In this example, observations with binary response are independently generated according to the following model (Wang and Leng, 2007) P (yi |xi ) =

exp{xTi β} , 1 + exp{xTi β}

where β = (3, 0, 0, 1.5, 0, 0, 2, 0, 0)T . The experimental results are summarized in Table 3.10, which shows that NBLasso performs comparably with OBLasso. As the size of the training data increases, all the three methods yield equivalent performance.

EXAMPLE 10 (COX’S MODEL EXAMPLE): In this simulation study,

76 nT 200 400

Lasso 0.004 0.003

OBLasso 0.006 0.004

NBLasso 0.006 0.003

Table 3.10: Simulation Results for Logistic Regression

independent survival data are generated according to the following hazard function (Wang and Leng, 2007) h(ti |xi ) = exp{xTi β}, where ti is the survival time from the ith subject and β = (0.8, 0, 0, 1, 0, 0, 0.6, 0)T . Also, independent censoring time is generated from an exponential distribution with mean u exp{xTi β}, where u ∼ Uniform (1, 3). The experimental results are summarized in Table 3.11, which shows that both Bayesian LASSOs perform comparably, outperforming frequentist LASSO. Thus, it is evident from the experimental results that NBLasso is as effective as OBLasso even for the general models. nT 200 400

Lasso 0.41 0.15

OBLasso 0.3 0.1

NBLasso 0.3 0.1

Table 3.11: Simulation Results for Cox’s Model

77

COMPUTING MAP ESTIMATES

3.7 3.7.1

ECM ALGORITHM FOR LINEAR MODELS

It is well-known that the full conditional distributions of a truncated multivariate normal distribution are truncated univariate normal distributions. This fact motivates us to develop an ECM algorithm to estimate the marginal posterior modes of β and σ 2 . At each step, we treat the latent variables uj ’s and the tuning parameter λ as the missing parameters and average over them to estimate βj ’s and σ 2 by maximizing the expected log conditional posterior distributions. The complete data log-likelihood is given by log p(β, σ 2 , u, λ|y, X) ∝

n

log p(yi |Xi β, σ 2 ) +

i=1

+

p

p

log p(βj |uj , σ 2 )

j=1

log p(uj |λ) + log p(σ 2 ) + log p(λ)

j=1 n 1 − 2 (yi − Xi β)2 + (a + 2p − 1)logλ ∝{ 2σ i=1

(n − 1 + p) |βj | 2 − λ(b + log σ }I uj > √ , ∀j . uj ) − 2 σ2 j=1 p

(3.15)

We initialize the algorithm by starting with a guess of β and σ 2 . Then, at each step of the algorithm, we replace u and λ in the log joint posterior (3.15) by their expected values conditional on the current estimates of β and σ 2 . Finally, we update β and σ 2 by maximizing the expected log conditional posterior distributions. The algorithm proceeds as follows:

78 E-Step: (t) uj

(t) 1 = λ

(t) (t) |βj | 1 = + , λ σ 2 (t) 1 p

(b +

;

(t) |βj |)(a

+ 2p − 1)

j=1

CM-Steps:

(t+1)

βj

σ2

(t+1)

⎧ ⎪ (t) ⎪ ⎪ −u σ 2 (t) ⎪ j ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ = βˆj OLS ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩u(t) σ 2 (t) j

= Max

(t)

(t)

if βj < −uj

(t)

if − uj

(t)

σ 2 (t) ,

(t)

σ 2 (t) ,

⎧ ⎨ (y − Xβ (t+1) ) (y − Xβ (t+1) ) ⎩

n+p+1

(t)

σ 2 (t) < βj < uj

(t)

if βj > uj

σ 2 (t) ,

j = 1, . . . , p,

, Maxj

(t+1)

βj uj (t)

2 ⎫ ⎬ ⎭

.

At convergence of the algorithm, we summarize the inferences using the latest estimates of β and σ 2 and their estimated variances (Johnson et al., 1994). 3.7.2

ECM ALGORITHM FOR GENERAL MODELS

Similarly, an approximate ECM algorithm for general models can be given as follows:

79 1. E-Step: (t) uj

(t) 1 = λ

(t) 1 (t) = + |βj |, λ 1 p

(b +

.

(t)

|βj |)(a + 2p − 1)

j=1

2. CM-Steps:

(t+1) βj

=

⎧ ⎪ (t) ⎪ ⎪ −uj ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨βˆj MLE ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩u(t) j

(t)

(t)

if βj < −uj , (t)

(t)

(t)

(t)

if βj > uj , 3. Repeat 1 & 2 until convergence.

3.8

(t)

if − uj < βj < uj ,

j = 1, . . . , p.

CONCLUDING REMARKS

In this paper, we have introduced a new hierarchical representation of Bayesian LASSO using the SMU distribution. It is to be noted that the posterior distribution of interest p(β, σ 2 |y) is exactly the same for both original Bayesian LASSO (OBLasso) and new Bayesian LASSO (NBLasso) models. As such, all inference and prediction that is based on the posterior distribution should be exactly the same ‘theoretically’. Any observed differences must be attributed (up to Monte Carlo error) to the properties of the different Gibbs samplers used to obtain samples from the corresponding posterior distributions. We establish this fact by real data analyses and empirical studies. Our results indicate that both Bayesian LASSOs perform comparably in different empirical and real scenarios. In many situations, the new method is competitive in terms of either

80 prediction accuracy or variable selection. Moreover, the new Bayesian method performs quite satisfactorily for general models beyond linear models. Furthermore, the proposed Gibbs sampler inherits good mixing properties as evident from both empirical studies (data not shown due to too many predictors) and real data analyses. It is to be noted, however, that we do not have a theoretical result on the convergence of our MCMC algorithm. Therefore, despite our encouraging findings, theoretical research is needed to investigate the geometric ergodicity of the proposed Gibbs sampler.

81

REFERENCES

Andrews, D. F. and C. L. Mallows (1974). Scale mixtures of normal distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 36 (1), 99–102. Bae, K. and B. Mallick (2004). Gene selection using a two-level hierarchical Bayesian model. Bioinformatics 20(18), 3423–3430. Chen, X., J. Z. Wang, and J. McKeown (2011). A Bayesian lasso via reversiblejump MCMC. Signal Processing 91(8), 1920–1932. Choy, S., W. Wan, and C. Chan (2008). Bayesian student-t stochastic volatility models via scale mixtures. In S. Chib, W. Griffiths, G. Koop, and D. Terrell (Eds.), Bayesian Econometrics (Advances in Econometrics), Volume 23, pp. 595–618. Emerald Group Publishing Limited. Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. The Annals of Statistics 32(2), 407–99. Figueiredo, M. (2003). Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (9), 1150–59. Friedman, J., T. Hastie, and R. Tibshirani (2010). Regularization paths for

82 generalized linear models via coordinate descent. Journal of Statistical Software 33(1), 1–22. Gelman, A., J. Carlin, H. Stern, and D. Rubin (2003). Bayesian Data Analysis. Chapman & Hall, London. Gelman, A. and D. B. Rubin (1992). Inference from iterative simulation using multiple sequences. Statistical Science 7 (4), 457–472. Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments (with discussion). Bayesian Statistics 4, 169–194. Hans, C. M. (2009). Bayesian lasso regression. Biometrika 96 (4), 835–845. Johnson, N., S. Kotz, and N. Balakrishnan (1994). Continuous Univariate Distributions. John Willey and Sons, New York. Kyung, M., J. Gill, M. Ghosh, and G. Casella (2010). Penalized regression, standard errors, and Bayesian lassos. Bayesian Analysis 5 (2), 369–412. Leng, C., M. Tran, and D. Nott (2014). Bayesian adaptive lasso. Annals of the Institute of Mathematical Statistics 66 (2), 221–244. Li, Q. and N. Lin (2010). The Bayesian elastic net. Bayesian Analysis 5(1), 151–170. Li, Y. and S. K. Ghosh (2015). Efficient sampling methods for truncated multivariate normal and student-t distributions subject to linear inequality constraints. Journal of Statistical Theory and Practice, Upcoming.

83 Mallick, H. and N. Yi (2014). A new Bayesian lasso. Statistics and its Interface 7 (4), 571–582. Park, T. and G. Casella (2008). The Bayesian lasso. Journal of the American Statistical Association 103 (482), 681–686. Qin, Z., S. Walker, and P. Damien (1998a). Uniform scale mixture models with applications to Bayesian inference. Technical report, University of Michigan Ross School of Business. Qin, Z., S. Walker, and P. Damien (1998b). Uniform scale mixture models with applications to variance regression. Technical report, University of Michigan Ross School of Business. Stamey, T., J. Kabalin, J. McNeal, I. Johnstone, F. Frieha, E. Redwine, and N. Yang (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. radical prostatectomy treated patients. Journal of Urology 141 (5), 1076–1083. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 58 (1), 267–288. Walker, S., P. Damien, and M. Meyer (1997). Uniform scale mixture models with applications to variance regression. Technical report, University of Michigan Ross School of Business. Wang, H. and C. Leng (2007). Unified lasso estimation by least squares approximation. Journal of the American Statistical Association 102 (479), 1039–1048.

84 West, M. (1987). On scale mixtures of normal distributions. Biometrika 74 (3), 646–648. Yi, N. and S. Xu (2008). Bayesian lasso for quantitative trait loci mapping. Genetics 179 (2), 1045–1055. Yuan, M. and Y. Lin (2005). Efficient empirical Bayes variable selection and esimation in linear models. Journal of the American Statistical Association 100 (472), 1215–1225. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101 (476), 1418–1429. Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2), 301–320.

85

BAYESIAN BRIDGE REGRESSION

by


In preparation for submission Format adapted for dissertation

86 ABSTRACT Classical bridge regression is known to possess many desirable statistical properties such as oracle, sparsity, and unbiasedness. In this paper, we propose bridge regression (BBR) from a Bayesian perspective. A straightforward Gibbs sampler is introduced based on the scale mixture of uniform (SMU) representation of the Bayesian bridge prior. Based on a recent result, we provide sufficient conditions on strong posterior consistency of the Bayesian bridge prior in high-dimensional sparse situations. Through simulation studies and real data analyses, we compare the performance of the proposed method with existing Bayesian and non-Bayesian methods. Results show that the proposed BBR method performs well in comparison to the other approaches.

KEYWORDS: Bridge Regression, Bayesian Bridge Regression, MCMC, Variable Selection.

4.1

INTRODUCTION

Recall the bridge regression (Frank and Friedman, 1993) that results from the following regularization problem:

p

|βj |α , min (y − Xβ) (y − Xβ) + λ β j=1

(4.1)

where λ > 0 is the tuning parameter and α > 0 is the concavity parameter. Bridge regularization with 0 < α < 1 is known to possess many desirable statistical properties such as oracle, sparsity, and unbiasedness (Xu et al., 2010),

87 and produces sparser solutions than the LASSO estimator (Tibshirani, 1996). However, despite being theoretically attractive in terms of variable selection and parameter estimation, the bridge regression usually cannot produce valid standard errors (Kyung et al., 2010). Both approximate covariance matrix method and bootstrap method to calculate standard errors are known to be unstable for the bridge regression (Knight and Fu, 2000). Therefore, in practice, the standard error estimates of the bridge estimator might be problematic, reducing the practical use of the estimator. Bayesian analysis naturally overcomes this limitation by providing a valid measure of standard error based on a geometrically ergodic Markov chain along with an appropriate point estimator. Ideally, a Bayesian solution can be obtained by placing an appropriate prior on the coefficients which will mimic the property of the bridge penalty. Bridge regularization has a natural Bayesian interpretation. Frank and Friedman (1993) suggested that the bridge estimates can be interpreted as posterior mode estimates when the regression coefficients are assigned independent and identical generalized Gaussian (GG) priors. Most of the Bayesian regularization methods developed so far are primarily based on the scale mixture of normal (SMN) representations of the associated priors (Kyung et al., 2010). Unfortunately, such a representation is not explicitly available for the Bayesian bridge prior when 0 < α < 1. See for example, Armagan (2009), who considered a variational approach to the bridge regression. This fact was also recognized by Park and Casella (2008) in their seminal paper, which mentioned that a simple extension of the Bayesian LASSO to the Bayesian bridge regression would not be straightforward given that an explicit SMN representation is not available. Therefore, to come up with a Bayesian solution, alternative technique is

88 required. Recently, Polson et al. (2014) provided a set of Bayesian bridge estimators for linear models based on two distinct scale mixture representations of the generalized Gaussian density. The first one utilizes the SMN representation (Andrews and Mallows, 1974; West, 1987), for which the mixing variable is not explicit in the sense that it requires simulating draws from an exponentially-tilted Stable random variable, which is quite difficult to generate in practice (Devroye, 2009). To avoid the need to deal with exponentially tilted stable random variables, Polson et al. (2014) proposed another Bayesian bridge estimator based the scale mixture of triangular (SMT) representation of the GG prior. In this paper, along the same lines, we utilize the scale mixture of uniform (SMU) representation of the GG prior, which in turn facilitates computationally efficient MCMC algorithm. Following Park and Casella (2008), we consider a conditional GG prior specification (GG distribution with mean 0, shape parameter α, and scale parameter √ 1 σ 2 λ− α ) of the form 2

π(β|σ ) ∝

p

√ exp{−λ(|βj |/ σ 2 )α },

(4.2)

j=1

and a non-informative scale-invariant marginal prior on σ 2 , i.e., π(σ 2 ) ∝ 1/σ 2 . Conditioning on σ 2 is important as it ensures unimodal full posterior (Park and Casella, 2008). Rather than minimizing (4.1), we solve the problem by using a Gibbs sampler which involves constructing a Markov chain having the joint posterior for β as its stationary distribution. Unlike the frequentist approach,

89 statistical inference for the Bayesian bridge is much more straightforward. Moreover, the tuning parameter can be easily estimated as an automatic byproduct of the Markov chain Monte Carlo procedure. In summary, this paper introduces some new aspects of the broader Bayesian treatment of the bridge regression. First, we propose a simple Gibbs sampler based on the SMU representation of the GG prior. Second, based on a recent result, we provide sufficient conditions on strong posterior consistency of the proposed Bayesian bridge prior in high-dimensional sparse situations. In addition, several flexible extensions of the method are presented, which provide a unified framework for modeling a variety of outcomes in varied real life scenarios. The remainder of the paper is organized as follows. In Section 4.2, we describe the hierarchical representation of the Bayesian bridge model. The resulting Gibbs sampler is put forward in Section 4.3. A result on posterior consistency is presented in Section 4.4. Some empirical studies and real data analyses are described in Section 4.5. Section 4.6 describes an ECM algorithm to compute approximate posterior mode estimates. Some extensions are provided in Section 4.7. In Section 4.8, we provide conclusions and further discussions in this area. Some proofs and related derivations are included in an appendix (Appendix B).

4.2 4.2.1

THE MODEL SCALE MIXTURE OF UNIFORM DISTRIBUTION

Proposition 2 A generalized Gaussian distribution can be written as a scale mixture of uniform distribution, the mixing distribution being a particular gamma

90 distribution as follows 1

λα α e−λ|x| = 1 2Γ( α + 1)

1

λ α +1 u>|x|α

1 α

1

2u Γ( α1 + 1)

u α e−λu du.

(4.3)

Proof : A straightforward proof of this result is provided in Appendix B.1.

4.2.2

HIERARCHICAL REPRESENTATION

Using (4.2) and (4.3), we can formulate our hierarchical representation as y n×1 |X, β, σ 2 ∼ Nn (Xβ, σ 2 In ), p √ 1 √ 1 β p×1 |u, σ 2 , α ∼ Uniform(− σ 2 ujα , σ 2 ujα ), p

j=1

(4.4)

1 Gamma( + 1, λ), up×1 |λ, α ∼ α j=1 σ 2 ∼ π(σ 2 ).

4.3 4.3.1

MCMC SAMPLING FULL CONDITIONAL DISTRIBUTIONS

Introduction of u = (u1 , u2 , ..., up ) enables us to derive the full conditional distributions which are given as −1 2 ˆ β|y, X, u, σ ∼ Np (β OLS , σ (X X) ) 2

p

I{|βj |
( √ )α }, σ2 j=1

σ 2 |y, X, β, u ∼

(4.6)

91

Inverse Gamma (

βj 2 n−1+p 1 , (y − Xβ) (y − Xβ))I{σ 2 > Maxj ( 2 )}, (4.7) 2 2 uj α

where I(.) denotes an indicator function. The proofs are included in Appendix B.2.

4.3.2

SAMPLING COEFFICIENTS AND LATENT VARIABLES

(4.5), (4.6), and (4.7) lead to a Gibbs sampler that starts at initial guesses of the parameters and iterates the following steps:

1. Generate uj from the left-truncated exponential distribution Exp(λ)I{uj > |β |

( √σj 2 )α } using inversion method which can be done as follows a) Generate u∗j ∼ Exp(λ), |β |

b) uj = u∗j + ( √σj 2 )α . 2. Generate β from a truncated multivariate normal distribution proportional to the posterior distribution of β. This step can be done by implementing efficient sampling technique developed by Li and Ghosh (2015).

3.

Generate σ 2 from a left-truncated Inverse Gamma distribution propor-

tional to (4.7) which can be done by replacing σ 2 =

1 , σ2 ∗

∗

where σ 2 is generated

from right-truncated Gamma distribution (Damien and Walker, 2001; Phillippe,

92 1997) proportional to

Gamma (

1 n−1+p 1 ∗ }. , (y − Xβ) (y − Xβ))I{σ 2 < β 2 2 2 Maxj ( j 2 ) uj α

An efficient Gibbs sampler based on these full conditionals proceeds to draw posterior samples from each full conditional posterior distribution, given the current values of all other parameters and the observed data. The process continues until all chains converge (or attain geometric ergodicity).

4.3.3


To update the tuning parameter λ, we work directly with the generalized Gaussian density, marginalizing out the latent variables uj ’s. From (4.4), we observe that the posterior for λ, given β, is conditionally independent of y. Therefore, if λ has a Gamma (a, b) prior, we can update the tuning parameter by generating samples from its conditional posterior distribution which is given by π(λ|β, α) ∝ λ

(a+p+p/α)−1

exp{−λ(b +

p

|βj |α )}.

(4.8)

j=1

The concavity parameter α is usually prefixed beforehand. Xu et al. (2010) argued that α = 0.5 can be taken as a representative of Lα , 0 < α < 1 regularization. We consider α to be prefixed to 0.5 in this article. However, it can be estimated by assigning a suitable prior π(α). Since 0 < α < 1, a natural choice for prior on α is Beta distribution. Assuming a Beta (c, d) prior on α,

93

Figure 4.1: Gibbs Sampler for the BBR for linear models

the posterior distribution of α is given by λp/α exp{−λ |βj |α }. {Γ( α1 + 1)}p j=1 p

π(α|β, λ) ∝ α

c−1

(1 − α)

d−1

(4.9)

To sample from (4.9), one can use random walk MH algorithm with the prior Beta (c, d) as the transition distribution and acceptance probability

π(α |β, λ) w = min{1, }, π(α|β, λ)

(4.10)

where α is the present state and α is a candidate draw from the prior distribution Beta (c, d). We summarize the algorithm in Figure 4.1.

94

4.4

POSTERIOR CONSISTENCY UNDER BAYESIAN BRIDGE MODEL

Consider the high-dimensional sparse linear regression model yn = Xn βn0 + n , where yn is an n-dimensional vector of responses, Xn is the n×pn design matrix, n ∼ N (0, σ 2 In ) with known σ 2 > 0, and βn0 is the true coefficient vector with both zero and non-zero components. To justify high-dimensionality, we assume that pn → ∞ as n → ∞. Let Θn = {j : βn0j = 0, j = 1, . . . , pn } be the set of nonzero components of βn0 and |Θn | = qn be the cardinality of Θn . Consider the following assumptions as n → ∞:

(A1) pn = o(n). be the smallest and largest singular values of Xn re√ √ spectively. Then 0 < Λn min ≤ lim infn→∞ Λn min / n ≤ lim infn→∞ Λn max / n < (A2) Let Λn

Λn

max

min

and Λn

max

< ∞.

(A3) supj=1,..,pn |βn0j |α < ∞, 0 < α < 1. (A4) qnα = o{n1−αρ/2 /(pn α/2 (log n)α )} for ρ ∈ (0, 2) and α ∈ (0, 1). Armagan et al. (2013) provided sufficient conditions for strong posterior consistency of various shrinkage priors in linear models. Here we derive sufficient conditions on strong posterior consistency for the Bayesian bridge prior using Theorem 1 of Armagan et al. (2013). We re-state the theorem for the sake of completeness. Theorem 1 Under assumptions (A1) and (A2), the posterior of βn0 under prior

95 Π is strongly consistent if Πn

Δ β n : ||β n − β n || < ρ/2 n 0

> exp(−dn)

for all 0 < Δ < 2 Λ2min /(48Λ2max ) and 0 < d < 2 Λ2min /(32σ 2 ) − 3ΔΛ2max /(2σ 2 ) and some ρ > 0. Theorem 2 Consider the GG prior with mean zero, shape parameter α ∈ (0, 1), and scale parameter sn > 0 given by f (βnj |sn , α) =

1 exp[−{|βnj |/sn }α ], 1 2sn Γ( α + 1)

(4.11)

Under assumptions (A1)-(A4), the Bayesian bridge (GG) prior (4.11) yields a √ strongly consistent posterior if sn = C/( pn nρ/2 log n) for finite C > 0. Proof : Proof of this result is given in Appendix B.3.

4.5

RESULTS

In this section, we investigate the prediction accuracy of the proposed Bayesian Bridge Regression (BBR.U), and compare its performance with six existing Bayesian and non-Bayesian methods. The Bayesian methods include BBR.T (Polson et al., 2014), BBR.N (Polson et al., 2014), and BLASSO (Park and Casella, 2008), where BBR.N corresponds to the Bayesian bridge model of Polson et al. (2014) based on the SMN representation of the GG density, BBR.T corresponds to the Bayesian bridge model of Polson et al. (2014) based on the SMT representation of the GG density, and BLASSO corresponds to the

96

20

A ●

●

●

●

●

●

● ●

●

●

● ● ●

● ● ● ●

BBR.T

BBR.N

● ● ●

8

10

12

14

16

18

●

B ●

14

●

●

8

10

12

●

●

12

C

8

9

10

11

●

LASSO

ENET

BRIDGE

BBR.U

BLASSO

Figure 4.2: Boxplots summarizing the prediction performance for the seven methods for Model 1

97

30

A

20

25

●

●

● ● ●

●

● ●

●

●

●

● ●

●

● ●

BBR.T

BBR.N

BLASSO

10

15

● ●

● ● ● ●

8

10

12

14

B

C

8

9

10

11

12

13

●

LASSO

ENET

BRIDGE

BBR.U


98

A ● ● ●

16

● ● ●

●

●

●

●

●

●

●

●

●

●

●

BBR.T

BBR.N

●

●

14

●

● ● ● ●

● ●

8

10

12

● ●

● ● ●

●

7

●

12

8

9

10

11

12

B ● ●

●

C ● ●

8

9

10

11

●

LASSO

ENET

BRIDGE

BBR.U

BLASSO


99 Bayesian LASSO model of Park and Casella (2008) based on the SMN representation of the Laplace density. The non-Bayesian methods include LASSO (Tibshirani, 1996), Elastic Net (Zou and Hastie, 2005), and classical bridge regression (Frank and Friedman, 1993). For the LASSO and elastic net (ENET) solution paths, we use the R package glmnet, which implements the coordinate descent algorithm (Friedman et al., 2010) in which a 10-fold cross-validation is used to select the tuning parameter(s). For the classical bridge (BRIDGE), a locally approximated coordinate descent algorithm (Breheny and Huang, 2009; Huang et al., 2009) is used in which a generalized cross-validation (GCV) criterion (Golub et al., 1979) is used to select the tuning parameter, as implemented in the R package grpreg. For BBR.U and BLASSO, we set the hyperparameters as a = 1 and b = 0.1, which is relatively flat, and results in high posterior probability near the MLE (Kyung et al., 2010). For BBR.T and BBR.N, a default Gamma (2, 2) prior is used as prior for the tuning parameter, as implemented in the R package BayesBridge. The Bayesian estimates are posterior means using 10, 000 samples of the Gibbs sampler after the burn-in. To decide on the burn-in number, we use the PSRF (Gelman and Rubin, 1992). Once ˆ < 1.1 for all parameters of interest, we continue to draw 10,000 iterations R to obtain samples from the joint posterior distribution. The convergence of our MCMC is checked by trace plots of the generated samples and calculating the Gelman-Rubin scale reduction factor (Gelman and Rubin, 1992) using the coda package in R. The response is centered and the predictors are normalized to have zero means and unit variances before applying any model selection method.

100

4.5.1

SIMULATION EXPERIMENTS

For the simulated examples, we calculate the median of mean squared errors (MMSE) based on 100 replications. Each simulated sample is partitioned into a training set and a test set. Models are fitted on the training set and MSE’s are calculated on the test set. We simulate data from the true model y = Xβ + , ∼ N(0, σ 2 I).

SIMULATION 1 (SIMPLE EXAMPLES) Here we investigate the prediction accuracy of BBR.U using three simple models. These models were used in the original LASSO paper to systematically compare the predictive performance of LASSO and ridge regression. Models 1 and 3 represent two different sparse scenarios whereas Model 2 represents nonsparse situation.

Model 1: Here we set β 8×1 = (3, 1.5, 0, 0, 2, 0, 0, 0)T and σ 2 = 9. The design matrix X is generated from the multivariate normal distribution with mean 0, variance 1, and pairwise correlations between xi and xj equal to 0.5|i−j| ∀ i = j.

Model 2: Here we set β 8×1 = (0.85, 0.85, 0.85, 0.85, 0.85,0.85, 0.85, 0.85)T , leaving other setups exactly same as Model 1.

Model 3: We use the same setup as model 1 with β 8×1 = (5, 0, 0, 0, 0, 0, 0, 0)T . For all the three models we experiment with three sample sizes n = {50, 100, 200},

101 which we refer to as A, B, and C. Prediction error (MSE) was calculated on a test set of 200 observations for each of these cases. The results are presented in Figures 4.2–4.4, which clearly indicate that BBR.U yields better prediction accuracy than the classical bridge estimator. All the Bayesian methods have similar performance. SIMULATION 2 (HIGH CORRELATION EXAMPLES) In this simulation study, we investigate the performance of BBR.U in sparse models with strong level of correlation. We repeat the same models in Simulation 1 but experiment with a different design matrix X which is generated from the multivariate normal distribution with mean 0, variance 1, and pairwise correlations between xi and xj equal to 0.95. To distinguish from Simulation 1, we refer to the models in Simulation 2 as Models 4, 5, and 6 respectively. The experimental results are presented in Figures 4.5–4.7. Here also, BBR.U performs comparably with the existing Bayesian methods with better prediction accuracy than the frequentist methods. SIMULATION 3 (DIFFICULT EXAMPLES) In this simulation study, we evaluate the performance of BBR.U in fairly complicated models, which exhibit a substantial amount of data collinearity.

Model 7: This example is drawn from the original LASSO paper by Tibshirani (1996). Here we set σ 2 = 225 and β 40×1 = (0T , 2T , 0T , 2T )T , where 010×1 and 210×1 are vectors of length 10 with each entry equal to 0 and 2 respectively. The design matrix X is generated from the multivariate normal distribution

102

A

20

●

●

●

15

● ● ●

●

●

● ●

●

●

●

●

●

●

● ●

BBR.T

BBR.N

BLASSO

●

10

●

●

●

B 16

●

14

●

● ● ●

8

10

12

●

C

8

9

10

11

12

●

LASSO

ENET

BRIDGE

BBR.U


103

25

A

20

●

● ● ● ●

15

●

●

●

●

●

●

●

● ●

●

10

● ●

16

B ●

14

●

●

● ●

●

8

10

12

●

13

C

12

●

● ●

8

9

10

11

●

LASSO

ENET

BRIDGE

BBR.U

BBR.T

BBR.N

BLASSO


104

A

16

18

●

●

●

●

●

●

● ●

●

●

●

●

●

●

BBR.T

BBR.N

BLASSO

●

8

10

12

14

● ●

B

14

●

●

●

●

8

10

12

●

C

8

9

10

11

12

●

LASSO

ENET

BRIDGE

BBR.U


105 with mean 0, variance 1, and pairwise correlations between xi and xj equal to 0.5. We simulate datasets with {nT , nP } = {100, 400} and {200,200} respectively, where nT denotes the size of the training set and nP denotes the size of the testing set. We consider two values of σ : σ ∈ {9, 25}. The simulation results are summarized in Table 4.1, where the abbreviations L, EN, and BL denote LASSO, ENET, and BLASSO respectively. It can be observed that BBR.U is competitive with other Bayesian methods in terms of prediction accuracy. In some situations, BBR.T or BLASSO performs slightly better. Also, the Bayesian methods are slightly outperformed by frequentist methods in some scenarios. The reason might be contributed to the fact that not much variance is explained by introducing the priors which resulted in slightly worse model selection performance for the Bayesian methods. Overall, Bayesian bridge estimators significantly outperform frequentist bridge estimator in terms of prediction accuracy. {nT , nP , σ 2 } L {200, 200, 225} 252.9 {200, 200, 81} 94.0 {100, 400, 225} 270.9 {100, 400, 81} 106.8

EN 252.0 93.3 264.9 105.0

BRIDGE 258.1 103.7 272.2 114.1

BBR.U 250.8 94.9 263.3 105.3

BBR.T 251.8 95.5 261.9 105.4

BBR.N 251.4 95.1 264.2 105.7

BL 251.7 94.9 268.9 104.5

Table 4.1: MMSE based on 100 replications for Model 7

SIMULATION 4 (SMALL n LARGE p EXAMPLE) Here we consider a case where p ≥ n. We let β 1:q = (5, ..., 5)T , β q+1:p = 0, p = 20, q = 10, σ = {3, 1}. The design matrix X is generated from the multivariate

106 normal distribution with mean 0, variance 1, and pairwise correlations between xi and xj equal to 0.95 ∀ i = j. We simulate datasets with nT = {10, 20} for the training set and nP = 200 for the testing set, which we refer to as A (nT = 10, σ = 3), B (nT = 10, σ = 1), C (nT = 20, σ = 3), and D (nT = 20, σ = 1). It is evident from the results presented in Figure 4.8 that the proposed method always performs better than classical bridge regression. In this example, we did not include BBR.T and BBR.N as the BayesBridge package did not converge. Overall, these simulation examples reveal that all the Bayesian methods have similar prediction accuracy for most of the situations, and usually outperform their frequentist counterparts in terms of prediction accuracy across a variety of scenarios.

4.5.2

REAL DATA ANALYSES

In this section, we apply our method to two benchmark datasets viz. prostate cancer data (Stamey et al., 1989) and pollution data (McDonald and Schwing, 1973). Both these datasets have been used for illustration purposes in previous LASSO and bridge papers. For both analyses, we randomly divide the data into a training set and a test set. Model fitting is carried out on the training data and performance is evaluated with the prediction error (MSE) on the test data. For both analyses, we also compute the prediction error for the OLS method. In the prostate cancer dataset, the response variable of interest is the logarithm of prostate-specific antigen. The predictors are eight clinical measures: the logarithm of cancer volume (lcavol), the logarithm of prostate weight (lweight), age, the logarithm of the amount of benign prostatic hyperplasia (lbph), seminal vesicle invasion (svi), the logarithm of capsular penetration (lcp), the Gleason

107

800

A ● ●

600

● ●

●

●

●

● ● ● ●

●

●

●

●

●

● ● ● ●

● ● ●

0

200

400

●

●

● ●

●

800

B ●

●

●

●

●

●

● ● ● ●

● ● ● ● ●

●

●

● ● ● ● ● ●

● ● ● ● ● ●

●

●

● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

BBR.U

BLASSO

●

●

● ●

● ●

● ● ● ●

● ● ●

● ● ● ● ● ● ● ●

C ●

● ● ●

●

● ● ● ● ●

● ● ● ● ●

●

●

0

100 200 300 400 500

0

200

400

600

●

●

●

● ●

●

●

●

LASSO

ENET

0

100 200 300 400

D

BRIDGE

Figure 4.8: Boxplots summarizing the prediction performance for the five methods for Model 8

108

●

BBR.U BBR.T

0.8

BBR.N BLASSO LASSO ●

0.6

BRIDGE

0.4

ENET

0.2

● ● ●

0.0

●

● ●

−0.4

−0.2

●

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

betas

Figure 4.9: For the prostate data, posterior mean estimates and corresponding 95% equal-tailed credible intervals for the Bayesian methods. Overlaid are the LASSO, elastic net, and classical bridge estimates based on cross-validation

score (gleason), and the percentage Gleason score 4 or 5 (pgg45). We analyze the data by dividing it into a training set with 67 observations and a test set with 30 observations. On the other hand, the pollution dataset consists of 60 observations and 15 predictors. The response variable is the total age-adjusted mortality rate obtained for the years 1959−1961 for 201 Standard Metropolitan Statistical Areas. In order to calculate prediction errors, we randomly select 40 observations for model fitting and use the rest as test set. We summarize the results for both data analyzes in Table 4.2, which shows that BBR.U performs as well as, or better than existing Bayesian and nonBayesian methods in terms of prediction accuracy. We repeat the random selection of training and test sets many times and obtain the similar result as in Table 4.2. Figures 4.9 and 4.10 describe the 95% equal-tailed credible intervals

109 for the regression parameters of prostate cancer and pollution data respectively, based on the posterior mean Bayesian estimates. To increase the readability of the plots, in each plot, we add a slight horizontal shift to the estimators. We can see that our method gives very similar posterior mean estimates compared to the other Bayesian methods. It is interesting to note that for both these datasets, all the estimates are inside the BBR.U credible intervals which indicates that the resulting conclusion will be similar regardless of which method is used. Hence, the analyses show strong support for the use of the proposed method. BBR.U performed similarly for other values of α (data not shown). Also, the proposed sampler runs very fast, at least for the examples presented here. For example, for the prostate cancer data, BBR.U took around ∼0.4 minutes on a P4 desktop computer. Method OLS LASSO ENET BRIDGE BLASSO BBR.T BBR.N BBR.U

Prostate 0.52 0.49 0.489 0.45 0.47 0.48 0.47 0.45

Pollution 7524.75 2110.99 2104.75 2116.60 1946.63 1944.89 1920.86 1846.49

Table 4.2: Mean squared prediction errors for Prostate and Pollution data analyses

4.6

COMPUTING MAP ESTIMATES

In this section, we provide an ECM algorithm to find the approximate posterior mode estimates of the proposed Bayesian bridge model. It is well-known that

110

● BBR.U

●

20

40

BBR.T BBR.N BLASSO LASSO BRIDGE ENET

●

●

0

● ●

●

● ●

●

●

●

●

●

●

−40

−20

●

1 2 3 4 5 6 7 8 9

11

13

15

betas Figure 4.10: For the pollution data, posterior mean estimates and corresponding 95% equal-tailed credible intervals for the Bayesian methods. Overlaid are the LASSO, elastic net, and classical bridge estimates based on cross-validation

111 the bridge regularization with 0 < α < 1 leads to a nonconcave optimization problem. To avoid the nonconvexity of the penalty function, various approximations have been suggested (Park and Yoon, 2011). One such approximation is the local linear approximation (LLA) proposed by Zou and Li (2008), which can be used to compute the approximate MAP estimates of the coefficients. Treating β as the parameter of interest and φ = (σ 2 , λ) as the ‘missing data’, the complete data log-likelihood based on the LLA approximation is given by l(β|y, X, φ) = C −

p RSS − λα |βj0 |α−1 |βj |, 2σ 2 j=1

(4.12)

which we rewrite as RSS l(β|y, X, φ) = C − − λj |βj |, 2σ 2 j=1 p

(4.13)

where λj = λα|βj0 |α−1 , C is a constant w.r.t β, RSS is the residual sum of squares, and βj0 ’s are the initial values usually taken as the OLS estimates (Zou and Li, 2008). We initialize the algorithm by starting with a guess of β, σ 2 , and λ. Then at each step of the algorithm, we update β by maximizing the expected log conditional posterior distribution. Finally, we replace λ and σ 2 in the log posterior (4.13) by their expected values conditional on the current estimates of β. Following a similar derivation in Sun et al. (2010), the algorithm proceeds as follows:

CM-Step:

112

(t+1)

βj

⎧ ⎪ ⎪ ⎪ 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ (t) 2(t) = βˆj OLS − λj σj ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ (t) 2(t) ⎪ ⎩βˆj OLS + λj σj

(t)

(t)

if |βˆj OLS | ≤ λj σj2 ,

(t) (t) if βˆj OLS > λj σj2 ,

(t) (t) if βˆj OLS < −λj σj2 .

E-Step: (t+1)

λj

=

(a + p + p/α) α|βj0 |α−1 , p (t+1) (b + |βj |α ) j=1

σj2

(t+1)

=

RSS(t+1) T −1 (X X)jj . n

where RSS(t+1) is the residual sum of squared re-calculated at β (t+1) , t = 1, 2, . . ., th diagonal element of the matrix (X T X)−1 . At converand (X T X)−1 jj is the j

gence of the algorithm, we summarize inferences using the latest estimates of β. Unlike the MCMC algorithm described before, this algorithm will shrink some coefficients exactly to zero.

4.7

EXTENSION TO GENERAL MODELS

In this section, we briefly discuss how BBR.U can be extended to several other models (e.g. GLM, Cox’s, etc.) beyond linear regression. Let us denote by L(β) the negative log-likelihood. Following Wang and Leng (2007), L(β) can be approximated by the LSA (as described before) as follows 1 ˜ Σ ˜ ˆ −1 (β − β), L(β) ≈ (β − β) 2

113 ˜ is the MLE of β and Σ ˆ −1 = δ 2 L(β)/δβ 2 . Therefore, for a general where β model, the conditional distribution of y is given by 1 −1 ˜ Σ ˜ . ˆ (β − β) y|β ∼ exp − (β − β) 2

Thus, we can easily extend our method to several other models by approximating the corresponding likelihood by normal likelihood. Combining the SMU representation of the GG density and the LSA approximation of the general likelihood, the hierarchical presentation of BBR.U (for a fixed α) for general models can be written as

− 12 (β

ˆ −1 ˜ ˜ − β) Σ (β − β) ,

y|β ∼ exp p 1 1 p×1 Uniform(−ujα , ujα ), β |u ∼ j=1

up×1 |λ ∼ λ∼

p

p

Gamma(

j=1

(4.14)

1 + 1, λj ), α

Gamma(a, b).

j=1

The full conditional distributions are given as ˜ Σ) ˆ β|y, X, u ∼ Np (β,

p

1

I{|βj | < ujα },

(4.15)

j=1

u|β, λ ∼

p

Exp(λ)I{uj > |βj |α },

(4.16)

j=1

λ|β ∼ λ

(a+p+p/α)−1

exp{−λ(b +

p j=1

|βj |α )}.

(4.17)

114

Figure 4.11: Gibbs Sampler for the BBR for general models


4.8

CONCLUSION AND DISCUSSION

We have considered a Bayesian analysis of bridge regression utilizing the SMU representation of the Bayesian bridge prior. Based on a recent result by Armagan et al. (2013), we have provided sufficient conditions for posterior consistency under the Bayesian bridge model. Also, we have discussed how BBR.U can be easily extended to several other models (e.g. GLM, Cox’s, etc.), providing a unified framework for modeling a variety of outcomes (e.g. continuous, binary, count, and time-to-event, among others). We have shown that in the absence of an explicit SMN representation of the GG density, SMU representation seems to provide important advantages. Our approach is fundamentally different from

115 the Bayesian bridge regression reported in Polson et al. (2014) in essentially two aspects. Firstly, unlike Polson et al. (2014), we have considered a conditional prior on the coefficients to ensure unimodal posterior. This conditional prior specification is consistent with major recently-developed Bayesian penalized regression methods (see for example Kyung et al. (2010), and references therein). Secondly, our MCMC is much more straightforward. While Polson et al. (2014) introduces essentially two sets of latent variables to carry out their MCMC, our method relies on only one set of latent variables, again consistent with the existing Bayesian regularization methods. However, the question of whether our MCMC leads to simpler calculations, warrants further research. We have evaluated the performance of BBR.U across a variety of models, which reveals that BBR.U yields satisfactory performance in both sparse and non-sparse scenarios. However, BBR.U (and BBR.T) can be computationally intensive due to the need of simulating draws from truncated multivariate normal distribution, which might be difficult to deal with in practice, especially in ultra-high dimensions. For speedy inference, one might consider using the MAP estimates, which we have not evaluated here, to primarily focus on a fully Bayesian approach. Moreover, an OLS or similar estimator may not be available in high dimensions, making it difficult to sample from the desired posterior distribution. However, this step can be taken care of by considering ridge-penalized (Hoerl and Kennard, 1970) or other stable Stein-type shrinkage estimates (Opgen-Rhein and Strimmer, 2007; Schafer and Strimmer, 2005), which are usually available.

116

REFERENCES

Andrews, D. F. and C. L. Mallows (1974). Scale mixtures of normal distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 36 (1), 99–102. Armagan, A. (2009). Variational bridge regression. Journal of Machine Learning Research W & CP 5, 17–24. Armagan, A., D. B. Dunson, J. Lee, W. U. Bajwa, and N. Strawn (2013). Posterior consistency in linear models under shrinkage priors. Biometrika 100(4), 1011 – 1018. Breheny, P. and J. Huang (2009). Penalized methods for bi-level variable selection. Statistics and its Interface 2 (3), 369. Damien, P. and S. G. Walker (2001). Sampling truncated normal, beta, and gamma densities. Journal of Computational and Graphical Statistics 10 (2), 206–215. Devroye, L. (2009). On exact simulation algorithms for some distributions related to Jacobi theta functions. Statistics & Probability Letters 79 (21), 2251– 2259.

117 Frank, I. and J. H. Friedman (1993). A statistical view of some chemometrics regression tools (with discussion). Technometrics 35 (2), 109–135. Friedman, J., T. Hastie, and R. Tibshirani (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33(1), 1–22. Gelman, A. and D. Rubin (1992). Inference from iterative simulation using multiple sequences. Statistical Science 7(4), 457–472. Golub, G. H., M. Heath, and G. Wahba (1979). Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21 (2), 215–223. Hoerl, A. E. and R. W. Kennard (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12 (1), 55–67. Huang, J., S. Ma, H. Xie, and C. Zhang (2009). A group bridge approach for variable selection. Biometrika 96(2), 339–355. Knight, K. and W. Fu (2000). Asymptotics for lasso-type estimators. The Annals of Statistics 28(5), 1356–1378. Kyung, M., J. Gill, M. Ghosh, and G. Casella (2010). Penalized regression, standard errors, and Bayesian lassos. Bayesian Analysis 5 (2), 369–412. Li, Y. and S. K. Ghosh (2015). Efficient sampling methods for truncated multivariate normal and student-t distributions subject to linear inequality constraints. Journal of Statistical Theory and Practice, Upcoming. McDonald, G. C. and R. C. Schwing (1973). Instabilities of regression estimates relating air pollution to mortality. Technometrics 15 (3), 463–481.

118 Opgen-Rhein, R. and K. Strimmer (2007). From correlation to causation networks: a simple approximate learning algorithm and its application to highdimensional plant gene expression data. BMC Systems Biology 1, 37. Park, C. and Y. J. Yoon (2011). Bridge regression: adaptivity and group selection. Journal of Statistical Planning and Inference 141 (11), 3506–3519. Park, T. and G. Casella (2008). The Bayesian lasso. Journal of the American Statistical Association 103 (482), 681–686. Phillippe, A. (1997). Simulation of right and left truncated gamma distributions by mixtures. Statistics and Computing 7(3), 173–181. Polson, N. G., J. G. Scott, and J. Windle (2014). The Bayesian bridge. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (4), 713–733. Schafer, J. and K. Strimmer (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology 4 (1), Article 32. Stamey, T., J. Kabalin, J. McNeal, I. Johnstone, F. Frieha, E. Redwine, and N. Yang (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. radical prostatectomy treated patients. Journal of Urology 141 (5), 1076–1083. Sun, W., J. G. Ibrahim, and F. Zou (2010). Genomewide multiple-loci mapping in experimental crosses by iterative adaptive penalized regression. Genetics 185 (1), 349–359.

119 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 58 (1), 267–288. Wang, H. and C. Leng (2007). Unified LASSO estimation by least squares approximation. Journal of the American Statistical Association 102 (479), 1039–1048. West, M. (1987). On scale mixtures of normal distributions. Biometrika 74 (3), 646–648. Xu, Z., H. Zhang, Y. Wang, X. Chang, and Y. Liang (2010). L 1/2 regularization. Science China Information Sciences 53(6), 1159–1169. Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2), 301–320. Zou, H. and R. Li (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics 36 (4), 1509–1533.

120

THE BAYESIAN GROUP BRIDGE FOR BI-LEVEL VARIABLE SELECTION

by


In preparation for submission Format adapted for dissertation

121 ABSTRACT In this paper, we consider the problem of variable selection for grouped predictors. In particular, we propose a Bayesian method to solve the group bridge model using a Gibbs sampler. The fundamental difference between the frequentist and Bayesian approach is that here we can obtain valid standard errors based on a geometrically ergodic Markov chain besides an appropriate point estimator. In addition, the concavity parameter can be estimated from the data along with the other model parameters. The proposed method is adaptive to the signal level by adopting different shrinkage for different groups of predictors. A novel scale mixture of multivariate uniform (SMU) distribution is introduced, which in turn facilitates computationally efficient MCMC algorithms. Empirical evidence of the attractiveness of the method is illustrated by simulations, and by analyzing the well-known birth weight dataset from Hosmer and Lemeshow (1989). We find that the Bayesian group bridge outperforms most existing group variable selection methods in estimation and prediction across a variety of simulation studies and real data analysis. Finally, we discuss possible extensions of this new approach, and present a unified framework for bi-level variable selection in general models with other flexible penalties.

KEYWORDS: Bayesian Regularization, Bayesian Variable Selection, Bilevel Variable Selection, Group Bridge, Group Variable Selection, MCMC.

122

5.1

INTRODUCTION

Recall that in the context of a linear regression model, the group bridge estimator (Huang et al., 2009) results from the following regularization problem:

K

min (y − Xβ) (y − Xβ) + λ wk ||β k ||α1 , β k=1

(5.1)

where ||β k ||1 is the L1 norm of β k , λ > 0 is the tuning parameter, α > 0 is the concavity parameter, and the weights wk ’s depend on some preliminary estimates, wk > 0, k = 1, . . . , K. Due to the general form of the penalty term, the group bridge estimator naturally fits any situation where it needs variable selection or there exists multicollinearity (Park and Yoon, 2011). However, there are at least three serious drawbacks of the frequentist group bridge estimator: 1. The group bridge algorithm only computes the point estimates of the regression coefficients. In practice, practitioners usually also need to know the level of confidence of the estimates, such as the confidence interval. However, it is not clear how to do statistical inference on the estimated coefficients of the group bridge estimator. 2. The optimization problem (5.1) is nonconvex when 0 < α < 1. This nonconvexity may cause numerical instability in practical computation especially when the number of covariates is large, as it involves traversing a multimodal region in high-dimensional data space. 3. In group bridge, the concavity parameter α is usually pre-fixed (e.g. α = 0.5). However, the choice of α depends on the data, and therefore, it is

123 desirable to optimize α which determines the data-specific shape of the penalty function. The first limitation is common to most frequentist regularized regression methods. Huang et al. (2009) gives a ridge regression approximation of the covariance matrix for the group bridge estimator. However, this approximation leads to standard error estimates of 0 when the corresponding estimated coefficient is 0, which is clearly unsatisfactory. An alternative approach to obtaining standard error estimates is to use the bootstrap. Knight and Fu (2000) studied the bootstrap method for standard error estimation for the bridge estimators, and pointed out that the bootstrap may have some problems in estimating the sampling distribution of the bridge estimators for 0 < α < 1 when the true parameter values are either exactly 0 or close to 0; in such cases, bootstrap sampling introduces a bias that does not vanish asymptotically. It is to be noted that group bridge is a natural extension of the classical bridge estimator. Both classical bridge and group bridge achieve the so-called ‘oracle property’ under certain regularity conditions (Huang et al., 2008, 2009). As noted by Andersen et al. (2009), the ‘oracle property’ is essentially a reincarnation of the superefficiency phenomenon, and it does not necessarily reflect the actual statistical performance. Moreover, superefficiency does not in general help in terms of accuracy of the confidence intervals. Beran (1982) showed that the bootstrap is not consistent for superefficient estimators. Based on this, Kyung et al. (2010) formally established that the bootstrap estimates based on the LASSO estimator are not consistent if the true β = 0. Similarly, it can be formally established, following arguments in Kyung et al. (2010) together with the

124 results of Huang et al. (2009) and Beran (1982), that the bootstrap estimates of the standard errors of the group bridge estimator are unstable and inconsistent if the true β = 0. Therefore, we may not have reliable standard errors for the zero coefficients in the group bridge model. The second limitation is intrinsic to the formulation of the group bridge penalty. However, the impact of multimodality is more severe in the frequentist framework, as it is misleading to summarize a multimodal surface in terms of a single point estimate (Polson et al., 2014). The third limitation can be taken care of by finding the optimal combination of λ and α by cross-validation, as commonly done in the frequentist framework; however, it is computationally intensive, as it involves exploring and validating in a two-dimensional space. To avoid intensive computation, Park and Yoon (2011) suggested grid search for selecting the optimal combination of (λ, α) for the classical bridge and L2 norm group bridge estimators. However, it selects λ and α sequentially instead of simultaneously, which may lead to a problem similar to ‘double shrinkage’, as commonly encountered in frequentist penalized regression methods such as the elastic net (Li and Lin, 2010; Zou and Hastie, 2005). To overcome all these aforementioned limitations, we propose a Bayesian analysis of the group bridge problem. Ideally, a Bayesian solution can be obtained by placing an appropriate prior on the coefficients which will resemble the property of the group bridge penalty. Contrary to the original formulation (5.1), we consider a slightly different version where different penalty parameters are used for different groups of predictors, which gives rise to the following

125 regularization problem: K

λk ||β k ||α1 . min (y − Xβ) (y − Xβ) + β k=1

(5.2)

The group-specific parameters provide a way to pool the information among variables within groups, and also to induce differential shrinkage for different groups. Naturally, for the unimportant groups, we should put larger penalty parameters λk ’s on their corresponding group-specific L1 norms of the associated coefficients. To come up with our Bayesian solution, we consider the following group bridge prior on the coefficients

π(β|α, λ1 , . . . , λK ) ∝

K

exp(−λk ||β k ||α1 ).

(5.3)

k=1

Rather than minimizing (5.2), we solve the problem by using a Gibbs sampler which involves constructing a Markov chain having the joint posterior for β as its stationary distribution. Unlike the frequentist approach, statistical inference for the Bayesian group bridge is much more straightforward. For instance, Bayesian analysis naturally overcomes the above-mentioned first limitation by providing a valid measure of standard error based on a geometrically ergodic Markov chain along with an appropriate point estimator. Moreover, the concavity parameter and the tuning parameters can be effortlessly estimated as automatic byproducts of the MCMC procedure. It should be noted, however, that unlike most existing sparsity-inducing priors, the joint posterior distribution in the proposed group bridge model is multimodal. Therefore, both classical and Bayesian approaches to group bridge

126 estimation must encounter practical difficulties related to multimodality. However, as noted in Polson et al. (2014), multimodality is one of the strongest arguments for pursuing a Bayesian approach, as we are able to handle model uncertainty in a systematic manner by fully exploring the posterior distribution, unlike a frequentist approach which forces one to summarize by means of a single point estimate. As it will be clear from the simulation studies and real data analysis, the proposed MCMC algorithm has good mixing property, and seems to be very effective at exploring the whole parameter space, unlike the classical group bridge estimator, which suffers from a number of computational drawbacks due to nonconvexity that limit its applicability in high-dimensional data analysis (Breheny, 2015). The remainder of the paper is organized as follows. In Section 5.2, we describe the group bridge prior, its various properties, and its connection to other sparsity-inducing priors. In Section 5.3, we develop the Bayesian group bridge estimator. A detailed description of the MCMC sampling scheme is given in Section 5.4. Results from the simulation experiments and real data analysis are provided in Section 5.5. A unified framework and some possible extensions are described in Section 5.6. In Section 5.7, we provide conclusions and further discussions in this area. All derivations and proofs are given in an appendix (Appendix C).

127

5.2 5.2.1

THE GROUP BRIDGE PRIOR FORMULATION

With a slight abuse of notation, we drop the dependence on the group index k, and consider a q-dimensional group bridge prior as follows

π(β) = C(λ, α) exp(−λ||β||α1 ),

(5.4)

where C(λ, α) is the normalizing constant. The normalizing constant C(λ, α) can be explicitly calculated (detailed in Appendix C.1) as q

λ α Γ(q + 1) . C(λ, α) = q q 2 Γ( α + 1)

(5.5)

Hence, the normalized group bridge prior is q

λ α Γ(q + 1) exp(−λ||β||α1 ). π(β) = q q 2 Γ( α + 1)

(5.6)

This proves that the group bridge prior is a proper prior, and in any Bayesian analysis with the group bridge prior, the propriety of the posterior will be retained. The posterior mode estimate under a normal likelihood and the group bridge prior matches exactly with the solution of (5.2). It is to be noted that the group bridge prior can be construed as a multivariate extension of the Bayesian bridge prior (introduced in Chapter 4); when q = 1, (5.6) reduces to the univariate generalized Gaussian distribution, which is the essence of the Bayesian bridge estimator, in which independent and identical generalized Gaussian distributions are assigned on the coefficients. In contrast, we assign independent

128 multivariate group bridge priors on each group of coefficients to take into account the grouping structure among the predictors, and also to mimic the property of the group bridge penalty.

5.2.2

PROPERTIES

CONNECTION TO MULTIVARIATE L1 NORM SPHERICALLY SYMMETRIC DISTRIBUTIONS In this section, we study various properties of the proposed group bridge prior in light of its connection to the multivariate L1 norm spherically symmetric distributions (introduced in Fang et al. (1989); also known as simplicially contoured distributions). To this end, we introduce the following notations. Definition 1 A random vector β = (β1 , . . . , βq ) with location parameter μ and scale parameter Σ follows an L1 norm spherically symmetric distribution 1

if its density function f is of the form f (β) = ψ(||Σ− 2 (β − μ)||1 ) for some ψ : R + → R+ . Definition 2 A random vector β = (β1 , . . . , βq ) follows a multivariate uniform distribution in an arbitrary region A ⊂ Rq if its density function f is of the form ⎧ ⎪ ⎪ ⎨ f (β) =

1 , Ω(A)

⎪ ⎪ ⎩0

if β ∈ A,

(5.7)

otherwise,

where Ω(A)=Volume of A. 1

Proposition 3 If f is of the form f (β) = ψ(||Σ− 2 (β − μ)||1 ) for some ψ :

129 R+ → R+ , it can also be expressed as 1

f (β) = |Σ|− 2

Γ(q) g(r(β)) , r(β) > 0, 2q r(β)q−1

(5.8)

1

where r(β) = ||Σ− 2 (β − μ)||1 , g(.) is the density of r(β). Proof : Proof is detailed in Appendix C.2. Corollary 1 Let β ∼ π(β), the group bridge prior defined in (5.6). Then, β has the following stochastic representation D

β = RU , where U =

β

||β ||1

(5.9)

is a random vector uniformly distributed on the unit L1 norm

sphere in Rq , and R = ||β||1 is an absolutely continuous non-negative random variable independent of U , whose density function is given by q

α

qrq−1 λ α e−λr , r > 0, g(r) = Γ( αq + 1) which corresponds to a generalized Gamma distribution. Proof : The proof is an immediate consequence of the fact that the group bridge prior belongs to the family of multivariate L1 norm spherically symmetric distributions defined above. Based on the above corollary, we provide a straightforward algorithm on how to sample directly from the group bridge prior (5.6) (see Appendix C.3). The shape of the group bridge prior for different values of α is illustrated in

130 Figures 5.1 and 5.2. It can be observed that when 0 < α < 1, the density function is strongly peaked at zero, suggesting that priors in this range are suitable for bi-level variable selection by shrinking small coefficients to zero while minimally shrinking the large coefficients at both group and individual levels. As α → 0, the density becomes more and more heavy-tailed while preserving its linear contours and strong peak at zero (Figure 5.1). On the other hand, when α > 1, the sharpness of the density diminishes with increasing α (Figure 5.2), suggesting that priors in this range may not be appropriate if variable selection is desired, as in many practical applications, it is unknown in advance whether variable selection is desired or not. When α = 1, the group bridge prior loses its attractive grouping property as it reduces to the joint distribution of i.i.d Laplace distributions, the essence of the Bayesian LASSO prior (Park and Casella, 2008). Based on these observations, we consider 0 < α < 1 throughout this paper. We develop our Gibbs sampler based on a novel mixture representation of the group bridge prior, which we describe next. SCALE MIXTURE OF MULTIVARIATE UNIFORM REPRESENTATION Proposition 4 The group bridge prior (5.6) can be written as a scale mixture of multivariate uniform (SMU) distribution, the mixing density being a particular Gamma distribution, i.e., β|u ∼ Multivariate Uniform (A1 ), where A1 = {β ∈ Rq : ||β||α1 < u}, u > 0, and u ∼ Gamma ( αq + 1, λ). Proof : Detailed in Appendix C.4.

133

CONNECTION TO OTHER SPARSITY-INDUCING PRIORS In recent years, there has been a great deal of interest in developing Bayesian shrinkage priors for penalized regression which include the Bayesian LASSO (Park and Casella, 2008), the Bayesian elastic net (Kyung et al., 2010; Li and Lin, 2010), the Bayesian adaptive LASSO (Leng et al., 2014), and the Bayesian bridge (Polson et al., 2014), among others. A detailed account of some recently developed Bayesian regularized regression approaches is given in Kyung et al. (2010). All these methods assume independence between variables, and therefore, completely ignores the grouping structure among the predictors. One exception in this class which takes into account the grouping structure among the predictors is the Bayesian group LASSO (Kyung et al., 2010), which is attained by imposing independent multivariate Laplace priors on the groups of coefficients. However, similar to the frequentist group LASSO, the Bayesian group LASSO cannot do bi-level variable selection, and therefore, it is expected to perform poorly for problems that are actually sparse at both group and individual levels. Moreover, the Bayesian group LASSO assumes invariance under within-group orthonormal transformations (Kyung et al., 2010), which may not be a valid assumption in practice. All these lead to the fact that, there is a lack of well-defined Bayesian regularized methods in the literature, which are capable of bi-level variable selection by taking into account the grouping information among the predictors into their estimation procedures. Motivated by this gap, we propose the Bayesian group bridge estimator which is capable of bi-level variable selection, applicable to general design matrices, and encourages simultaneous grouping and within-group sparsity unlike other sparsity-inducing

134 priors. There are considerable challenges, however, to come up with a Bayesian solution of the group bridge estimator. First, most of the Bayesian group regularization methods developed so far are primarily based on the scale mixture of multivariate normal (SMN) representation of the associated priors (Kyung et al., 2010). Unfortunately, such a representation is not explicitly available for the Bayesian group bridge prior, due to the lack of a closed-form expression of the mixing density function. Second, posterior inference for most Bayesian regularized regression approaches are primarily based on unimodal posterior distributions. Unfortunately, for the Bayesian group bridge, the resulting posterior is multimodal. Third, due to the adaptive nature of the group bridge prior, we have many more parameters to estimate (not to mention that one has to choose the concavity parameter as well) as compared to other sparsity-inducing priors. Therefore, the posterior inference for the Bayesian group bridge is much more challenging as compared to the existing Bayesian methods. To overcome these issues, we appeal to the SMU representation (described in Proposition 4), which in turn facilitates computationally efficient MCMC algorithms. With this representation, we are able to formulate a hierarchical model, which alleviates the problem of estimating many parameters. Moreover, the proposed Gibbs sampler is very effective at exploring the multimodal parameter space, ensuring enriched posterior inference.

135

5.3 5.3.1

MODEL HIERARCHY AND PRIOR DISTRIBUTIONS HIERARCHICAL REPRESENTATION

Based on Proposition 4, we formulate our hierarchical representation as follows y|X, β, σ 2 ∼ Nn (Xβ, σ 2 In ), β k |uk , α ∼ Multivariate Uniform(Ωk ), independently, where Ωk = {βk ∈ Rmk : ||βk ||α1 < uk }, k = 1, . . . , K, K mk + 1, λk ), Gamma( u1 , . . . , uK |λ1 , . . . , λK , α ∼ α k=1

(5.10)

σ 2 ∼ π(σ 2 ), where a non-informative scale-invariant marginal prior is assumed on σ 2 , i.e., π(σ 2 ) ∝ 1/σ 2 . 5.3.2

FULL CONDITIONAL DISTRIBUTIONS

Introduction of u = (u1 , u2 , ..., uK ) enables us to derive the full conditional distributions as follows −1 2 ˆ β|y, X, u, σ , α ∼ Np (β OLS , σ (X X) ) 2

K

I{||β k ||α1 < uk },

(5.11)

k=1

u|β, λ1 , . . . , λK , α ∼

K

Exponential(λk )I{uk > ||β k ||α1 },

(5.12)

k=1

σ 2 |y, X, β ∼ Inverse Gamma (

n−1 1 , (y − Xβ) (y − Xβ)), 2 2

(5.13)

136 where I(.) denotes an indicator function. The proofs involve only simple algebra, and are omitted.

5.4

MCMC SAMPLING

5.4.1

SAMPLING COEFFICIENTS AND THE LATENT VARIABLES

(5.11), (5.12), and (5.13) lead us to a Gibbs sampler that starts at initial guesses for β and σ 2 and iterates the following steps:

1. Generate uk from the left-truncated exponential distribution Exponential(λk )I{uk > ||β k ||α1 } using inversion method, which can be done as follows: a) Generate u∗k ∼ Exponential(λk ), b) uk = u∗k + ||β k ||α1 , k = 1, . . . , K.

2. Generate β from a truncated multivariate normal distribution proportional to the posterior distribution of β described in (5.11). This step can be done by implementing an efficient sampling technique described in Appendix C.5.

3. Generate σ 2 from the Inverse Gamma distribution proportional to the posterior distribution of σ 2 described in (5.13).

137

5.4.2


To update the tuning parameters (λ1 , . . . , λK ), we work directly with the group bridge prior, marginalizing out the latent variables uk ’s. From (5.10), we observe that the posterior for λk , given β, is conditionally independent of y. Therefore, if λk has a Gamma(a, b) prior, we can update the tuning parameters by generating samples from their conditional posterior distribution

λ1 , . . . , λK |β, α ∝

K

a+

λk

mk α

−1

exp{−λk (b + ||βk ||α1 )}.

(5.14)

k=1

The concavity parameter α is usually prefixed beforehand. However, it can also be estimated by assigning a suitable prior π(α). Since 0 < α < 1, a natural choice for prior on α is the Beta distribution. Assuming a Beta (c, d) prior on α, the posterior distribution of α is given by

π(α|β, λ1 , . . . , λK ) ∝α

c−1

(1 − α)

d−1

K

K m /α λk k exp{− λk ||βk ||α1 }. mk Γ( + 1) α k=1 k=1

(5.15)

To sample from (5.15), one can use a random walk MH algorithm with the prior Beta (c, d) as the transition distribution and acceptance probability

w = min{1,

π(α |β, λ1 , . . . , λK ) }, π(α|β, λ1 , . . . , λK )

(5.16)

138

Figure 5.3: Gibbs Sampler for the BAGB for linear models

where α is the present state, and α is a candidate draw from the prior distribution Beta (c, d). One might be tempted to choose a Beta (1, 1) distribution as prior for the hyperparameter α, which is noninformative, and corresponds to a U (0, 1) distribution. However, in many real situations, we might have a prior belief on the concavity parameter α that reflects a particular desired shape of the penalty function. In those situations, other priors from the Beta family can be used. For example, a Beta (10, 10) prior having higher concentration of mass around α = 0.5 can be used if α is believed to be 0.5 beforehand based on prior experience. We summarize the algorithm in Figure 5.3.

139

5.5

RESULTS

In this section, we evaluate the statistical properties of the proposed Bayesian Group Bridge (BAGB) estimator, and compare its performance with several existing group variable selection methods viz. group LASSO (GL), sparse group LASSO (SGL), and group bridge (GB). For group bridge, a locally approximated coordinate descent algorithm (Breheny and Huang, 2009; Huang et al., 2009) is used, as implemented in the R package grpreg. For GL, we make use of the group descent algorithm (Breheny and Huang, 2015), also implemented in the R package grpreg. For SGL, we use a blockwise descent algorithm, which is implemented in the R package SGL. It is to be noted that none of these algorithms assumes invariance under within-group orthonormality of the design matrix. Regarding the selection of the tuning parameters, Huang et al. (2009) indicated that tuning based on BIC in general does better than that based on other criterions such as AIC, Cp , or GCV in terms of selection at the group and individual variable levels. Therefore, the tuning parameters are selected by BIC for all the frequentist methods under consideration, where the BIC is defined as ˆ 22 /n) + log(n).df /n, BIC = log(||y − X β|| where df is the degrees of freedom of the estimated model. For SGL, we define the corresponding df as the number of non-zero coefficients in the model. For others, we calculate df as described in Huang et al. (2009) and Breheny and Huang (2009). For GB, the concavity parameter α is pre-fixed at 0.5 (default in grpreg). We set the hyperparameters corresponding to λj ’s as a = 1 and

140 b = 0.1, which corresponds to a relatively flat distribution with high posterior probability near the MLE (Kyung et al., 2010). The BAGB estimates are posterior means using the post-burn-in samples of the Gibbs sampler. To decide on the burn-in number, we use the PSRF (Gelman and Rubin, 1992). Once ˆ < 1.1 for all parameters of interest, we continue to draw the desired numR ber of iterations to obtain samples from the joint posterior distribution. The convergence of our MCMC is checked by trace plots of the generated samples and calculating the Gelman-Rubin scale reduction factor (Gelman and Rubin, 1992) using the coda package in R. The response is centered and the predictors are normalized to have zero means and unit variances before applying any model selection method. For variable selection with BAGB, we make use of the Scaled Neighbourhood Criterion (SNC) (Li and Lin, 2010). For a particular variable βj , the criterion is defined as SN C = P {(|βj | > var(βj |Data)|Data}, j = 1, . . . , p. A variable is selected if the corresponding SNC exceeds a certain threshold τ . Following Li and Lin (2010), we set τ = 0.5.

5.5.1

THE BIRTH WEIGHT EXAMPLE

We re-examine the birth weight dataset from Hosmer and Lemeshow (1989). We make use of a reparametrized data frame available from the R package grpreg. The original dataset, which was collected at Baystate Medical Center, Springfield, Massachusetts during 1986, consists of the birth weights of 189 babies and 8 predictors concerning the mother. Among the eight predictors, two are continuous (mother’s age in years and mother’s weight in pounds at the last menstrual period) and six are categorical (mother’s race, smoking status during pregnancy, number of precious premature labors, history of hypertension,

141 presence of uterine irritability, number of physician visits). Since a preliminary analysis suggested that nonlinear effects of both mother’s age and weight might exist, both these effects are modelled by using third-order polynomials. The reparametrized data frame includes p = 16 predictors corresponding to K = 8 groups consisting of 3, 3, 2, 1, 2, 1, 1, and 3 variables respectively. The response variable of interest is the maternal birth weight in kilograms. As in the previous analyses (Park and Yoon, 2011; Yuan and Lin, 2006), we randomly select three quarters of the observations (151 cases) for model fitting, and reserve the rest of the data as the test set. For BAGB, the Markov chain (in this and the simulation studies as well) is run for a total of 30, 000 iterations, of which 15, 000 are discarded as burn-in, and the rest are used as draws from the posterior distribution. To check the sensitivity of our model, we opt for two different hyperparameter settings for the prior on α, (i) Beta (1, 1), and (ii) Beta (10, 10). MSE

BAGB (Beta) 662615.2

BAGB (Uniform) 703928.7

GB 758162.2

GL 744161.4

SGL 868976.3

Table 5.1: Summary of birth weight data analysis results using linear regression.‘MSE’ reports the mean squared errors (MSE’s) on the test data

Table 5.1 summarizes the comparison of the prediction errors of the four estimators. It shows that BAGB outperforms other three methods regardless of which prior distribution is used for α. We repeat the random selection of training and test sets many times, and obtain the similar result as in Table 5.1. We also compare the prediction accuracy of both group bridge estimators for a varying α, which is summarized in Table 5.2. It is clear from Table 5.2 that

142 Method BAGB GB

α = 0.1 705374.0 839626.6

α = 0.3 672173.1 838866.3

α = 0.5 654960.2 758162.2

α = 0.7 640146.8 764094.2

α = 0.9 625812.3 728786.2

Table 5.2: A comparison of GB and BAGB with respect to MSE’s for different values of α, based on 38 test set observations. In GB, λ is chosen by BIC and the BAGB estimates are based on 15, 000 posterior samples

BAGB significantly outperforms the frequentist GB estimator across all values of the concavity parameter. The histograms and the corresponding density plots of the concavity parameter α, assuming two Beta priors are given in Figures 5.4 and 5.5 respectively. It is interesting to note that the posterior mean estimates of α are pulled away from 0.5. The estimated α for BAGB is 0.3 and 0.06 respectively based on the Beta (10, 10) and Beta (1, 1) priors. In Figure 5.6, posterior densities for the marginal effects of the predictors using our method are provided, along with the classical group bridge estimates. It can be observed that the GB solutions do not coincide with the posterior mode BAGB estimates. This difference in estimates can be contributed to the fact that in the Bayesian framework, we are able to account for the uncertainty in λ, σ 2 , and α, which is otherwise ignored in the classical framework. Another intrinsic feature of this diagram is the noticeable multimodality in the joint distribution of the Bayesian group bridge posterior. Clearly, summarizing βj by means of a single point estimate may not be reasonable, which is unfortunately the only available estimate in the frequentist framework. We note that, for this particular dataset, BAGB performs better for fixed α. The reason might be that not much variance is explained by introducing the priors which resulted

Posterior Mean Posterior Median Posterior Mode

Density

40

15

0

0

5

20

10

Density

20

60

25

80

30

143

0.1

0.2

0.3

0.4

alpha

(a)

0.05

0.10

0.15

0.20

alpha

(b)

Figure 5.4: Histogram of α based on 15, 000 MCMC samples assuming a Beta(1, 1) prior (left) and the corresponding density plot (right) for the Birthweight data analysis, assuming a linear model. Solid red line: posterior mean, dotted green line: posterior mode, and dashed purple line: posterior median

in slightly worse performance of the estimator with estimated α as compared to the same with fixed concavity parameter. The mixing of the MCMC chain was reasonably good, confirmed in the trace and ACF plots and other statistical tests (data not shown). Thus, BAGB is able to mimic the property of the frequentist group bridge estimator by conducting bi-level variable selection, with better uncertainty assessment and quantification, by unraveling the full posterior summary of the coefficients.

5.5.2

SIMULATION EXPERIMENTS

In this section, we conduct simulation experiments to evaluate the finite sample performance of BAGB, and compare the results with several existing group variable selection methods described above. For our simulation studies, we prefix α = 0.5 for both group bridge methods. To evaluate the prediction

4

144

2

Density

2

0

0

1

1

Density

3

3

4


0.2

0.3

0.4

0.5 alpha

(a)

0.6

0.7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

alpha

(b)

Figure 5.5: Histogram of α based on 15, 000 MCMC samples assuming a Beta(10, 10) prior (left) and the corresponding density plot (right) for the Birthweight data analysis, assuming a linear model. Solid red line: posterior mean, dotted green line: posterior mode, and dashed purple line: posterior median

performance of the methods, following Zou (2006), we calculate the median of relative prediction errors (MRPE) based on 100 replications. The relative prediction error (RPE) is defined as RPE = E(ˆ y − Xβ ∗ )/σ 2 , where β ∗ is the true coefficient vector. Each simulated sample is partitioned into a training set and a test set. Models are fitted on the training set and RPE’s are calculated on the test set. To evaluate the variable selection performance of the methods, we consider sensitivity (Sen) and specificity (Spe), which are defined as follows

Sen =

Spe =

No. of Selected Important Variables , No. of ‘True’ Important Variables

No. of Selected Unimportant Variables . No. of ‘True’ Unimportant Variables

0.0 0.5

−1.0

0.0

−1.0 0.0

0.2 0.6 ftv1

0.0

1.0

0.6

black

1.0

−2.0

−0.5 0.5 ht

Density

0

4 0

−0.4

1.5 −1.0

ptl2m

Density

4 0

−0.5 ui

0.0

1.0

white

ptl1

Density

1.0 0.0

Density

−1.5

4 0

0.5

8

smoke

2

Density 0.0

Density

0.0 1.0 2.0

2.0 0.0 −1.0

0.4

lwt3

Density

lwt2

0.2

8

−0.2 0.0

0.1 0.3 0.5 lwt1

Density

0.2

−0.2

Density

Density

6 0

0.0

0.2 age3

3

Density

8 4 −0.3

Density

−0.2

age2

0

Density

age1

0.2

0.0

−0.2

4

0.2

0.0 1.0 2.0

−0.1

0.0 1.5 3.0

−0.4

0 4 8

Density

8 0

4

Density

0 4 8

Density

145

−0.5 0.0 ftv2

0.5

−1.0

0.0 ftv3m

Figure 5.6: Marginal posterior densities for the marginal effects of the 16 predictors in the Birthweight data. Solid red line: penalized likelihood (group bridge) solution with λ chosen by BIC. Dashed green line: marginal posterior mean for βj . Dotted purple line: mode of the marginal distribution for βj under the fully Bayes posterior. The BAGB estimates are based on 15, 000 posterior samples

146 For both sensitivity and specificity, a higher value (close to 1) means a better variable selection performance. For the frequentist methods, the variable corresponding to a non-zero estimated coefficient is considered to be important. For variable selection with BAGB, we make use of the Scaled Neighbourhood Criterion (SNC) (Li and Lin, 2010), as defined before. We simulate data from the true model y = Xβ + , ∼ N(0, σ 2 I). Following Huang et al. (2009), we consider two scenarios. For the generating models in Example 1, the number of groups is moderately large, the group sizes are equal and relatively large, and within each group the coefficients are either all nonzero or all zero. In Example 2, the group sizes vary and there are coefficients equal to zero in a nonzero group. It is to be noted that most of the simulation results in the literature (including those reported in Huang et al. (2009)) consider considerably high signal-to-noise ratio (SNR), defined as Var(X β ) . As pointed out by Levina and Zhu (2008), when the SNRs SNR = σ2 under consideration are extremely high (≥ 40), any variable selection method should have a reasonably good performance given a moderately large sample size. However, in many practical applications including the genome-wide association studies, the true effect of a genetic variant is expected to be at most moderate, which corresponds to a low SNR scenario. Therefore, for a relatively complete evaluation, we vary the signal-to-noise ratio (SNR) from low to moderate to high for varying sample sizes {nT , nP }, where nT denotes the size of the training set, and nP denotes the size of the testing set.

147 SIMULATION 1 (“ALL-IN-ALL-OUT”): In this simulation example, we assume that there are K = 5 groups and each group consists of eight covariates, resulting in p = 40 predictors. The design matrix X consists of 5 submatrices, i.e., X = (X1 , X2 , X3 , X4 , X5 ), where Xj = (x8(j−1)+1 ), . . . x8(j−1)+8 ) for any j ∈ {1, . . . , 5}. The covariates x1 , . . . , x40 are generated as follows: 1. First simulate R1 , . . . , R40 independently from the standard normal distribution. 2. Simulate Zj ’s from the multivariate normal distribution with mean 0, variance 1, and pairwise correlations between xj1 and xj2 equal to 0.5|j1 −j2 | ∀ j1 = j2 , for j1 , j2 = 1, . . . , 5. √ 3. Obtain xj ’s as xj = (Zgj + Rj )/ 2, j = 1, . . . , 40, where gj is the smallest integer greater than (j −1)/8 and the xj ’s with the same value of gj belong to the same group. We consider four different levels of SNR from low (1) to moderate (3, 5) to high (10) and two sample sizes {nT , nP } = {200, 200} and {400, 400}. σ 2 is chosen such that the desired level of SNR is achieved. The coefficients corresponding to the first two groups are given as (β1 , . . . , β8 ) = (0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4) , (β9 , . . . , β16 ) = (2, . . . , 2) . We set the rest of the coefficients to zero. Thus, the coefficients in each group are either all nonzero or all zero, leading to a “All-In-All-Out” scenario.

SIMULATION 2 (“NOT-ALL-IN-ALL-OUT”): In this example, we assume that the coefficients in a group can be either all zero, all nonzero or

148 partly zero, leading to a “Not-All-In-All-Out” scenario. We also assume that the group size differs across groups. There are K = 6 groups made up of three groups each of size 10 and three groups each of size 4, resulting in p = 42 predictors. The design matrix X consists of 6 submatrices, i.e., X = (X1 , X2 , X3 , X4 , X5 , X6 ), where Xj = (x10(j−1)+1 , . . . x10(j−1)+10 ) for any j ∈ {1, 2, 3} and Xj = (x4(j−4)+31 , . . . x4(j−4)+34 ) for any j ∈ {4, 5, 6}. The covariates x1 , . . . , x42 are generated as follows: 1. First simulate R1 , . . . , R42 independently from the standard normal distribution. 2. Simulate Zj ’s from the multivariate normal distribution with mean 0, variance 1, and pairwise correlations between xj1 and xj2 equal to 0.5|j1 −j2 | ∀ j1 = j2 , for j1 , j2 = 1, . . . , 6. 3. For j = 1, . . . , 30, let gj be the largest integer less than j/10 + 1, and for j = 31, . . . , 42, let gj be the largest integer less than (j − 30)/4 + 1. Then √ we obtain xj ’s as xj = (Zgj + Rj )/ 2, j = 1, . . . , 42. The coefficients are given as follows (β1 , . . . , β10 ) = (0.5, −2, 0.5, 2, −1, 1, 2, −1.5, 2, −2) , (β11 , . . . , β20 ) = (−1.5, 2, 1, −2, 1.5, 0, . . . , 0) , (β21 , . . . , β30 ) = (0, . . . , 0) , (β31 , . . . , β34 ) = (2, −2, 1, 1.5) , (β35 , . . . , β38 ) = (−1.5, 1.5, 0, 0) , (β39 , . . . , β42 ) = (0, . . . , 0) .

149 Similar to Simulation 1, we consider four different levels of SNR from low (1) to moderate (3, 5) to high (10) and two sample sizes {nT , nP } = {200, 200} and {400, 400} and choose σ 2 such that the desired level of SNR is achieved.

The simulation results are summarized in Tables 5.3 and 5.4. Several observations are in order. First, in low SNR (high noise) settings, BAGB vastly outperforms existing methods, regardless of the sample size and the sparsity structure in the data generating model. Even for moderate to high SNRs, it consistently produces the smallest prediction error across all simulation examples. This is due to the fact that unlike frequentist methods, BAGB is able to account for the uncertainty in the tuning parameters in a systematic fashion that leads to the better prediction performance of the method. Secondly, none of the methods uniformly dominates the others in variable selection performance. In the presence of within-group sparsity (Simulation 2), BAGB has higher or similar sensitivity as compared to both GB and SGL (lower specificity), which indicates that as far as bi-level variable selection is concerned, BAGB is not as ‘aggressive’ as the frequentist methods in excluding predictors. This holds true even in the absence of within-group sparsity (Simulation 1), when the corresponding SNR is moderate or high, although it maintains good specificity. For low SNR scenarios in Simulation 1, it has slightly worse sensitivity than the frequentist GB estimator, although it maintains better specificity. Thirdly, group LASSO has the lowest specificity in all the simulation examples, regardless of the sample size, which is expected as it tends to select all variables in an unimportant group. However, it has the highest sensitivity across all examples. Fourthly, it should be noted that the variable selection performance of BAGB

150 depends on the threshold parameter τ . In practice, using a higher threshold value τ would result in a higher sensitivity but a lower specificity. Therefore, depending on the importance of sensitivity and specificity of a particular study, the researcher may choose the appropriate threshold.

5.6 5.6.1

EXTENSIONS EXTENSION TO GENERAL MODELS

In this section, we briefly discuss how BAGB can be extended to several other models (e.g. GLM, Cox’s model, etc.) beyond linear regression. Again, we assume that there is a grouping structure among the predictors. As described before, based on the LSA (Wang and Leng, 2007), the conditional distribution of y for a general model is given by 1 −1 ˜ Σ ˜ . ˆ (β − β) y|β ∼ exp − (β − β) 2 Combining the SMU representation of the group bridge prior and the LSA approximation of the general likelihood, the hierarchical presentation of BAGB

151 for general models can be written as y|β ∼ exp

− 12 (β

ˆ −1 ˜ ˜ − β) Σ (β − β) ,

β k |uk , α ∼ Multivariate Uniform(Ωk ), independently, where Ωk = {βk ∈ Rmk : ||βk ||α1 < uk }, k = 1, ldots, K, K mk + 1, λk ), Gamma( u1 , . . . , uK |λ1 , . . . , λK , α ∼ α k=1 K Gamma(a, b), λ1 , . . . , λ K ∼

(5.17)

k=1

α ∼ Beta(c, d).

The full conditional distributions are given as

˜ Σ) ˆ β|y, X, u, λ1 , . . . , λK , α ∼ Np (β,

K

I{||β k ||α1 < uk },

(5.18)

k=1

u|β, λ1 , . . . , λK , α ∼

K


(5.19)

k=1

λ1 , . . . , λK |β, α ∝

K

a+

λk

mk α

−1

exp{−λk (b + ||βk ||α1 )}.

(5.20)

k=1

α|β, λ1 , . . . , λK ∝ α

c−1

(1 − α)

d−1

K

K m /α λk k exp{− λk ||βk ||α1 }. (5.21) mk Γ( + 1) α k=1 k=1


152

Figure 5.7: Gibbs Sampler for the BAGB for general models

APPLICATION TO THE BIRTHWEIGHT DATA - LOGISTIC REGRESSION ILLUSTRATION As a second example, we re-visit the birth weight data. We devote this example to evaluate the potential of using the proposed method for general models. The binary response variable of interest was whether the maternal birth weight was less than 2.5 kg (Y = 1) or not (Y = 0). As before, we randomly selected 151 observations for model fitting and used the rest as the test set to validate the test errors. Table 5.5 summarizes the comparison of the test errors of the four estimators. It shows that BAGB performs the best and sparse group lasso closely follows next. Figure 6.1 describes the 95% equal-tailed credible intervals for the regression parameters for this dataset corresponding to α = 0.5. We note that, among the methods depicted here, only BAGB can provide valid standard

153 errors, based on a geometrically ergodic Markov chain. It is interesting to note that all the estimates are inside the credible intervals, which indicates that the resulting conclusion will be similar regardless of which method is used.

5.6.2

EXTENSION TO OTHER GROUP PENALIZED REGRESSION METHODS

The proposed SMU technique can be extended to other flexible group penalties. We discuss in detail two novel applications of the SMU representation. GROUP BRIDGE WITH L2 PENALTY Park and Yoon (2011) proposed a variant of the group bridge estimator which results from the following regularization problem: K

min (y − Xβ) (y − Xβ) + λk ||β k ||α2 , β k=1

(5.22)

where ||β k ||2 is the L2 norm of β k , λk ’s are the tuning parameters, and α > 0 is the concavity parameter with λk > 0, k = 1, . . . , K. It is to be noted that when α = 1, it coincides with the adaptive group LASSO estimator of Wang and Leng (2008). We refer to this method as GB2, to distinguish it from GB. Park and Yoon (2011) showed that under certain regularity conditions, the GB2 estimator achieves the ‘oracle group selection property’. For a Bayesian analysis of the GB2 estimator, one may consider the following prior on the coefficients

π(β|λ1 , . . . , λK , α) ∝

K k=1

exp(−λk ||β k ||α2 ),

(5.23)

154 which belongs to the family of multivariate generalized Gaussian distributions. Now, assuming a linear model and using the SMU representation of the associated prior (Appendix C.6), the hierarchical representation of the Bayesian GB2 estimator can be formulated as follows y|X, β, σ 2 ∼ Nn (Xβ, σ 2 In ), β k |uk , α ∼ Multivariate Uniform(Ωk ), independently, where Ωk = {βk ∈ Rmk : ||βk ||α2 < uk }, k = 1, . . . , K, K mk + 1, λk ), Gamma( u1 , . . . , uK |λ1 , . . . , λK , α ∼ α k=1

(5.24)

σ 2 ∼ π(σ 2 ), π(σ 2 ) ∝ 1/σ 2 , K Gamma(a, b), λ1 , . . . , λ K ∼ k=1

α ∼ Beta(c, d). The full conditional distributions can be derived as ˆ OLS , σ 2 (X X)−1 ) β|y, X, u, σ 2 , α ∼ Np (β

K

I{||β k ||α2 < uk },

(5.25)

k=1

u|β, λ1 , . . . , λK , α ∼

K


(5.26)

k=1

σ 2 |y, X, β ∼ Inverse Gamma ( λ1 , . . . , λK |β, α ∝

K

n−1 1 , (y − Xβ) (y − Xβ)), 2 2

a+

λk

mk α

−1

exp{−λk (b + ||βk ||α2 )},

(5.27) (5.28)

k=1

α|β, λ1 , . . . , λK ∝ α

c−1

(1 − α)

d−1

K

K m /α λk k exp{− λk ||βk ||α2 }. (5.29) mk Γ( + 1) α k=1 k=1

155

BAGB 3

GB GL

2

SGL

1

●

●

0

● ●

●

●

●

●

●

●

●

●

●

●

●

ftv1

ftv2

ftv3m

−1

●

age1

age2

age3

lwt1

lwt2

lwt3

white

black

smoke

ptl1

ptl2m

ht

ui

betas

Figure 5.8: For the birth weight data, posterior mean Bayesian group bridge estimates and corresponding 95% equal-tailed credible intervals based on 15, 000 Gibbs samples using a logistic regression model. Overlaid are the corresponding GB, GL, and SGL estimates

COMPOSITE GROUP BRIDGE PENALTY Seetharaman (2013) proposed the composite group bridge estimator which is an extension of the original group bridge penalty, and results from the following regularization problem: mk K min (y − Xβ) (y − Xβ) + λk { |β kj |γ }α , β j=1 k=1

(5.30)

where λk ’s are the tuning parameters, 0 < α < 1 and 0 < γ < 1 are the concavity parameters, and λk > 0, k = 1, . . . , K. When γ = 1, it reduces to the GB estimator. We refer to this method as CGB. Seetharaman (2013) established that unlike GB, the CGB estimator achieves bi-level oracle properties, meaning it identifies the correct groups and the correct nonzero coefficients within

156 each selected group with probability converging to 1, under certain regularity conditions. As before, assuming a normal likelihood and utilizing the SMU representation of the CGB prior, the hierarchical representation of the composite group bridge estimator can be formulated as follows y|X, β, σ 2 ∼ Nn (Xβ, σ 2 In ), β k |uk , α, γ ∼ Multivariate Uniform(Ωk ), independently, mk 1 mk where Ωk = {βk ∈ R : |β kj |γ < ukα }, k = 1, . . . , K, j=1

u1 , . . . , uK |λ1 , . . . , λK , α, γ ∼

K

Gamma(

k=1

mk + 1, λk ), α

(5.31)

σ 2 ∼ π(σ 2 ), π(σ 2 ) ∝ 1/σ 2 , K Gamma(a, b), λ1 , . . . , λ K ∼ k=1

α ∼ Beta(c, d), γ ∼ Beta(c∗ , d∗ ). The full conditional distributions can be derived as −1 2 ˆ β|y, X, u, σ , α, γ ∼ Np (β OLS , σ (X X) ) 2

K k=1

u|β, λ1 , . . . , λK , α, γ ∼

K

λ1 , . . . , λK |β, α, γ ∝

k=1

mk

|β kj |γ )α },

(5.33)

j=1

σ 2 |y, X, β ∼ Inverse Gamma ( a+ λk

(5.32)

j=1

Exponential(λk )I{uk > (

k=1

K

mk 1 I{ |β kj |γ < ukα },

mk α

n−1 1 , (y − Xβ) (y − Xβ)), 2 2

−1

mk exp{−λk (b + { |β kj |γ )α }. j=1

(5.34) (5.35)

157 α|β, λ1 , . . . , λK , γ ∝ mk /α mk K K Γ( mγk + 1) αc−1 (1 − α)d−1 λk exp{− λk ( |β kj |γ )α }, Γ( mαk + 1) {Γ( γ1 + 1)}p k=1 j=1 k=1

(5.36)

γ|β, λ1 , . . . , λK , α ∝ γc

∗ mk /α mk K K Γ( mγk + 1) (1 − γ)d −1 λk exp{− λ ( |β kj |γ )α }. k mk 1 p Γ( α + 1) {Γ( γ + 1)} j=1 k=1 k=1

∗ −1

(5.37)

When γ = 1, the Bayesian CGB reduces to the BAGB estimator.

5.7

DISCUSSION

We have proposed the Bayesian group bridge estimator which is novel in three aspects. First, BAGB is one of the very few existing Bayesian group feature selection methods capable of simultaneous bi-level variable selection and coefficient estimation. Secondly, it allows adaptive selection of the tuning parameters as well as the concavity parameter from the data, providing a novel and natural treatment of model uncertainty through a systematic quantification of uncertainty associated with the hyperparameters. Thirdly, it provides a valid measure of standard error for the estimated coefficients. Most of the existing group variable selection methods, which are developed from a frequentist’s point of view, are unable to produce valid standard errors if the true coefficients are zero, i.e., they cannot give any confidence statement of these estimates, which makes the statistical inference rather difficult. In contrast, statistical inference for BAGB is much more straightforward. Empirically, we have shown its attractiveness compared to several existing methods. Moreover, we have proposed a unified

158 framework which is applicable to general models beyond linear models. In addition, we have introduced a novel SMU representation of the group bridge penalty, which is particularly interesting, and suggests many generalizations, some of which we have discussed. The resulting MCMC algorithm has good mixing property, which validates potential applicability of the method. A few limitations of our approach should be noted. First, we have described our method for working with non-overlapping groups. However, in practice, multiple features can belong to multiple overlapping groups, for which BAGB may not be applicable although we suggest it as a future work. Second, although in our experience, the proposed Gibbs sampler is efficient with good rates of convergence and mixing, an MCMC-based approach can be overwhelmed for massively large datasets. Therefore, another path for future investigation includes computationally efficient adaptation of our approach to fast algorithms (e.g. EM) for rapid estimation and speedy inference.

159 nT = nP = 200, SNR = 1 Method Sen BAGB 0.175 GB 0.241 GL 0.535 SGL 0.076 nT = nP = 200, SNR = 3 BAGB 0.352 GB 0.352 GL 0.880 SGL 0.376 nT = nP = 200, SNR = 5 BAGB 0.478 GB 0.451 GL 0.955 SGL 0.538 nT = nP = 200, SNR = 10 BAGB 0.682 GB 0.584 GL 1.000 SGL 0.698 nT = nP = 400, SNR = 1 BAGB 0.237 GB 0.292 GL 0.760 SGL 0.172 nT = nP = 400, SNR = 3 BAGB 0.505 GB 0.482 GL 0.980 SGL 0.584 nT = nP = 400, SNR = 5 BAGB 0.669 GB 0.586 GL 0.995 SGL 0.698 nT = nP = 400, SNR = 10 BAGB 0.848 GB 0.795 GL 1.000 SGL 0.841

Spe 0.921 0.881 0.663 0.965

MRPE 0.009(0.019) 0.012(0.013) 0.049(0.011) 0.068(0.014)

0.903 0.937 0.493 0.837

0.011(0.020) 0.036(0.014) 0.088(0.015) 0.076(0.022)

0.892 0.948 0.443 0.768

0.026(0.021) 0.049(0.017) 0.112(0.018) 0.056(0.022)

0.889 0.958 0.393 0.743

0.043(0.020) 0.081(0.018) 0.128(0.016) 0.051(0.017)

0.932 0.934 0.553 0.915

0.016(0.008) 0.023(0.009) 0.047(0.009) 0.067(0.014)

0.912 0.962 0.417 0.770

0.023(0.009) 0.034(0.011) 0.064(0.010) 0.039(0.008)

0.910 0.982 0.387 0.741

0.029(0.009) 0.053(0.008) 0.072(0.010) 0.045(0.008)

0.910 0.998 0.320 0.732

0.036(0.010) 0.053(0.009) 0.085(0.011) 0.046(0.008)

Table 5.3: Results from Simulation 1. The numbers in parentheses are the corresponding standard errors of the RPE’s based on 500 bootstrap resampling of the 100 RPE’s. The bold numbers are significantly smaller than others

160 nT = nP = 200, SNR = 1 Method Sen BAGB 0.916 GB 0.720 GL 0.982 SGL 0.507 nT = nP = 200, SNR = 3 BAGB 0.988 GB 0.978 GL 1.000 SGL 0.878 nT = nP = 200, SNR = 5 BAGB 0.994 GB 0.993 GL 1.000 SGL 0.903 nT = nP = 200, SNR = 10 BAGB 0.999 GB 0.999 GL 1.000 SGL 0.915 nT = nP = 400, SNR = 1 BAGB 0.977 GB 0.928 GL 1.000 SGL 0.858 nT = nP = 400, SNR = 3 BAGB 0.996 GB 0.994 GL 1.000 SGL 0.919 nT = nP = 400, SNR = 5 BAGB 0.999 GB 0.999 GL 1.000 SGL 0.929 nT = nP = 400, SNR = 10 BAGB 1.000 GB 1.000 GL 1.000 SGL 0.934

Spe 0.732 0.925 0.608 0.920

MRPE 0.137(0.015) 0.225(0.021) 0.226(0.018) 0.363(0.033)

0.674 0.851 0.454 0.903

0.146(0.015) 0.185(0.014) 0.212(0.021) 0.568(0.016)

0.653 0.839 0.379 0.938

0.153(0.014) 0.192(0.014) 0.229(0.022) 0.870(0.026)

0.630 0.830 0.282 0.962

0.175(0.018) 0.190(0.021) 0.260(0.029) 1.641(0.049)

0.704 0.895 0.527 0.870

0.078(0.010) 0.115(0.015) 0.114(0.011) 0.192(0.014)

0.658 0.850 0.408 0.963

0.082(0.011) 0.094(0.012) 0.124(0.010) 0.451(0.013)

0.639 0.841 0.291 0.986

0.091(0.011) 0.096(0.014) 0.128(0.011) 0.705(0.016)

0.618 0.837 0.299 0.994

0.109(0.009) 0.109(0.011) 0.148(0.011) 1.360(0.033)

Table 5.4: Results from Simulation 2. The numbers in parentheses are the corresponding standard errors of the RPE’s based on 500 bootstrap resampling of the 100 RPE’s. The bold numbers are significantly smaller than others

161

Test Error AUC

BAGB GB 0.21 0.24 0.78 0.67

GL 0.26 0.71

SGL 0.21 0.64

Table 5.5: Summary of the birth weight data analysis results using logistic regression. ‘AUC’ reports the area under the curve (AUC) estimates on the test data. ‘Test Error’ reports the misclassification errors on the test data

162

REFERENCES

Andersen, T., R. Davis, J.-P. Kreiss, and T. Mikosch (2009). Handbook of Financial Time Series. Springer. Beran, R. (1982). Estimated sampling distributions: the bootstrap and competitors. The Annals of Statistics 10 (1), 212–225. Breheny, P. (2015). The group exponential lasso for bi-level variable selection. Biometrics, Upcoming. Breheny, P. and J. Huang (2009). Penalized methods for bi-level variable selection. Statistics and its Interface 2 (3), 369. Breheny, P. and J. Huang (2015). Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing 25 (2), 173–187. Fang, K. T., S. Kotz, and K. W. Ng (1989). Symmetric Multivariate and Related Distributions. Chapman and Hall, London. Gelman, A. and D. Rubin (1992). Inference from iterative simulation using multiple sequences. Statistical Science 7 (4), 457–472.

163 Hosmer, D. and S. Lemeshow (1989). Applied Logistic Regression. John Wiley & Sons. Huang, J., J. Horowitz, and S. Ma (2008). Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics 36 (2), 587–613. Huang, J., S. Ma, H. Xie, and C. Zhang (2009). A group bridge approach for variable selection. Biometrika 96 (2), 339–355. Knight, K. and W. Fu (2000). Asymptotics for lasso-type estimators. The Annals of Statistics 28 (5), 1356–1378. Kyung, M., J. Gill, M. Ghosh, and G. Casella (2010). Penalized regression, standard errors, and Bayesian lassos. Bayesian Analysis 5 (2), 369–412. Leng, C., M. Tran, and D. Nott (2014). Bayesian adaptive lasso. Annals of the Institute of Mathematical Statistics 66 (2), 221–244. Levina, E. and J. Zhu (2008). Discussion of “Sure independence screening for ultrahigh dimensional feature space”. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 (5), 897–898. Li, Q. and N. Lin (2010). The Bayesian elastic net. Bayesian Analysis 5 (1), 151–70. Park, C. and Y. J. Yoon (2011). Bridge regression: adaptivity and group selection. Journal of Statistical Planning and Inference 141 (11), 3506–3519. Park, T. and G. Casella (2008). The Bayesian lasso. Journal of the American Statistical Association 103 (482), 681–686.

164 Polson, N. G., J. G. Scott, and J. Windle (2014). The Bayesian bridge. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(4), 713– 733. Seetharaman, I. (2013). Consistent bi-level variable selection via composite group bridge penalized regression. Master’s thesis, Kansas State University. Wang, H. and C. Leng (2007). Unified LASSO estimation by least squares approximation. Journal of the American Statistical Association 102 (479), 1039–1048. Wang, H. and C. Leng (2008). A note on adaptive group lasso. Computational Statistics & Data Analysis 52 (12), 5277–5286. Yuan, M. and N. Lin (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1), 49–67. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101 (476), 1418–1429. Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2), 301–320.

165

APPLICATIONS

6.1

INTRODUCTION

This chapter consists of two main sections, each of which aims at evaluating the effectiveness of the existing and proposed Bayesian regularization methods in real life applications. The first utilizes a Bayesian regularized strategy for identifying a subgroup of patients with differential treatment effect in a RCT. The problem of subgroup identification has recently become a popular topic in clinical research, as it contributes to the efforts in discovering personalized medicine (Zhu and Xie, 2015). It is to be noted that subgroup identification can be framed as a special case of a typical model selection problem, as we are not only interested in selecting important variables most predictive of the outcome but also in detecting those with non-negligible interaction with the treatment. The difficulty in coming up with a coherent framework for subgroup identification arises from the fact that each study subject in a RCT is assigned to receive either the new treatment or control, but not both. Therefore, it is not clear how to compare, at the patient level, the observed treatment difference to its predicted counterpart (Zhao et al., 2013). The proposed research in Section

166 6.2 is one of the first to address this issue using a Bayesian regularization technique. In particular, we apply two existing Bayesian regularization approaches, viz. the Bayesian LASSO (Park and Casella, 2008) and the Bayesian adaptive LASSO (Leng et al., 2014) to the ACTG 320 clinical trial data (Hammer et al., 1997; Zhao et al., 2013), and compare these procedures with their frequentist counterparts. A parametric scoring system as a function of multiple baseline covariates is constructed based on the posterior estimates, which is further used to construct a threshold-specific treatment difference curve across a range of score values. In the next section (Section 6.3), we aim to investigate the utility of Bayesian regularization methods for the problem of rare variants detection in a genetic association study. Although rare variants are suggested to be a potential source of ‘missing heritability’ (Bodmer and Bonilla, 2008; McClellan and King, 2010), detecting rare sequence variants associated with complex diseases remains a challenge1 . Due to the low frequencies of the rare variants, it is extremely difficult to make statistical inference based on the traditional single-variant methods (Ayers and Cordell, 2013). Recently, many researchers have proposed methods based on collapsing information across biologically relevant groups so that the combined effect becomes less rare and estimable (Li and Leal, 2008; Madsen and Browning, 2009; Morris and Zeggini, 2010). However, these approaches do not provide any insight into the selection of individual variants. An alternative approach is to use the bi-level variable selection methods. These methods naturally combine information across variants in a group, while avoiding the simplistic assumption that each variant has the same effect on the phenotype 1

Some passages in this chapter have been adapted from Mallick and Li (2013).

167 (Breheny and Huang, 2009). In order to evaluate the real world applicability of our proposal, we apply our Bayesian Group Bridge (BAGB) estimator to the sequencing data from the Dallas Heart Study (DHS) (Romeo et al., 2007), and compare its performance with several existing group variable selection methods viz. group LASSO (GL), sparse group LASSO (SGL) and group bridge (GB).

BAYESIAN REGULARIZED SUBGROUP SELECTION FOR

6.2

DIFFERENTIAL TREATMENT EFFECTS 6.2.1

BACKGROUND

In controlled trials, investigators are often interested in assessing the heterogeneity of treatment effects in subgroups of patients. While the overall treatment effect may be modest, it is plausible that there is a subgroup of individuals, defined by their baseline covariate values, that has an enhanced treatment effect (Zhu and Xie, 2015). By appropriately choosing a subgroup of patients with enhanced treatment effect, future investigations can be straightforward for managing future patients and/or designing new trials involving similar comparator treatments (Zhao et al., 2013). As noted by Varadhan et al. (2013), descriptive and exploratory subgroup identification methods using existing clinical trial data are much needed to form new hypothesis and to generate new evidence for clinical decision-making. In practice, there are two common approaches to subgroup identification. First, a series of simple subgroup analyses, where the treatment and control groups are compared in several pre-defined subgroups identified by categorizing one or more baseline variables. However, such procedures may not be efficient

168 in practice in the presence of a large number of predictive baseline variables. Moreover, these approaches tend to suffer from false positive findings due to multiple comparisons, and therefore, may not necessarily unravel complicated treatment-covariate interactions in subpopulations (Tian et al., 2014). On the other hand, various multivariable parametric (Cai et al., 2011; Gunter et al., 2011; Tian et al., 2014) and non-parametric (Zhu and Xie, 2015) regressionbased approaches to subgroup identification have been developed in recent years, where the product of the binary treatment indicator and a set of baseline covariates along with their main effects are included in the regression model. Apart from these model-based approaches, various tree-based machine learning methods (Foster et al., 2011; Lipkovich et al., 2011; Su et al., 2009) based on the classification and regression tree (CART) methodology (Breiman et al., 1984) have also been developed. There is also literature on Bayesian subgroup identification methods (Berger et al., 2014; Dixon and Simon, 1991; Sivaganesan et al., 2011). However, it is not clear how to utilize these procedures to efficiently identify a group of future patients based on the baseline covariates, who would have a desired overall treatment benefit (Zhao et al., 2013). Moreover, only a few of these methods are suitable for high-dimensional covariates (Tian et al., 2014). To bridge this gap, Zhao et al. (2013) recently developed an effective way to identify a promising population based on the LASSO regularization using data from a historical clinical trial. Motivated by Zhao et al. (2013), we aim to answer the following questions: 1) how to apply Bayesian regularization to efficiently identify a subgroup of patients with differential treatment effect in a RCT with a large number of potential covariates? 2) Does Bayesian regularization have any advantage over its frequentist counterpart in the context

169 of subgroup selection in clinical trials? We seek to answer these questions by borrowing the idea of Bayesian LASSO (Park and Casella, 2008) and Bayesian adaptive LASSO (Leng et al., 2014) into Zhao et al. (2013)’s framework, and advocating a systematic comparison of frequentist and Bayesian regularization approaches with respect to their performance in selecting a desired subgroup from the ACTG 320 data.

6.2.2

NOTATIONS

Following the same notations as in Zhao et al. (2013), we consider a RCT where each patient receives either an active treatment (denoted by G = 1) or control (denoted by G = 0) at random. Let πk = P (G = k) for k = 0, 1. Let X = (X1 , .., Xp ) denotes the design matrix consisting of the baseline covariates and y denotes the clinical response. In principle, the clinical response y has two components, y (1) and y (0) , where y (1) is the clinical response if a patient receives the active treatment, and y (0) is the clinical response if a patient receives the control. In other words, y = Gy (1) + (1 − G)y (0) . Therefore, for each subject, only y (k) , k = 0 or 1, can potentially be observed. Let us denote by μk (X) = E(y (k) |X), the expected response for patients in group k, conditional on X. Let D(X) = μ1 (X)−μ0 (X) denotes the conditional treatment difference. Also, let us denote the sample space of X as χ. A subpopulation with an enhanced treatment effect can be defined as a partition of χ such that D(X) is no less than a pre-defined value c. Formally, the subpopulation can be represented as S = {X ∈ χ : D(X) ≥ c}.

(6.1)

170 Suppose that we have a randomized clinical trial dataset {(yi , Gi , Xi ), i = 1, . . . , n} consisting of n independent copies of the triplet (y, G, X). Let ˆ D(X) be an estimator of D(X). Let X 0 be the covariate vector of a future 0 , if the patient drawn from the same population with potential response y(k)

patient belongs to group k, k = 0, 1. Consider the subgroup of subjects such ˆ 0 ) ≥ c. Then, following Zhao et al. (2013), the average treatment that D(X difference for this subgroup, defined as 0 0 ˆ 0 ) ≥ c), − y(0) )|D(X AD(c) = E((y(1)

(6.2)

can be estimated by n n ˆ ˆ i ) ≥ c, Gi = 0} yi I{D(X i=1 yi I{D(Xi ) ≥ c, Gi = 1} ˆ AD(c) = n − i=1 , (6.3) n ˆ i ) ≥ c, Gi = 1} ˆ i ) ≥ c, Gi = 0} I{ D(X I{ D(X i=1 i=1 ˆ where I(.) is an indicator function. AD(.) as a function of c can be used to identify subgroups of patients with enhanced treatment effects corresponding to different values of c.

6.2.3

GENERAL FRAMEWORK FOR SUBGROUP SELECTION VIA REGULARIZATION

A general parametric modeling framework for subgroup selection was introduced in Zhao et al. (2013), which we will describe now for the sake of completeness.

171

TWO SINGLE MODELS One way to parametrically model the subject-specific treatment difference is to model the mean for each treatment group:

E(y (k) |X) = gk (βk h(X)),

(6.4)

where h(X) is a known function consisting of the baseline covariates and the intercept, βk is an unknown vector of parameters to be estimated, gk (.) is a given ˆ link function, and k = 0, 1. Using this framework, D(X) can be estimated as ˆ D(X) = g1 (βˆ1 h(X)) − g0 (βˆ0 h(X)).

(6.5)

where βˆk is the estimate of βk , k = 0, 1. SINGLE INTERACTION MODEL Alternatively, one may consider a single model for both treatment groups:

E(y|X, G) = g(β h(G, X)),

(6.6)

where g(.) is a given link function, β is an unknown vector of parameters to be estimated, and h(.) is a known function consisting of G, X, and G × X ˆ interactions. With this modeling framework, D(X) can be estimated as ˆ D(X) = g(βˆ h(1, X)) − g(βˆ h(−1, X)),

(6.7)

172 ˆ is the estimate of β. where β 6.2.4

COMPARISON OF DIFFERENT SCORING SYSTEMS

Suppose that we have estimated the coefficient β (or βk ’s) using either a freˆ (or quentist regularization method, or a Bayesian regularization method as β ˆ βˆk ’s), using either of the above modeling frameworks. With the resulting β ˆ (or βˆk ’s), let D(X) be the score. In general, how well a subgroup is defined ˆ depends on how precise the estimate D(.) is, and therefore, different scoring ˆ will group patients differently (Zhao et al., 2013). Here we briefly systems D(.) outline how to compare different scoring systems, as described in Zhao et al. (2013). ˆ It is to be noted that the curve AD(c) is expected to be increasing for a ˆ ˆ reasonably good scoring system D(.). In particular, the estimated AD(c) can be plotted over a range of values of c, which can then be utilized to select a subgroup with a desired overall treatment benefit. For a suitable comparison of ˆ 0) ≥ c multiple scoring systems, one may transform the conditional event D(X ˆ 0 )) ≥ q, where H is the empirical cumulative distribution function of to H(D(X ˆ 0 ). The resulting estimate corresponding to equation (6.3) is denoted by D(X −1 ˜ ˜ ˆ AD(q), which is also same as AD(H (q)). Given 0 ≤ q ≤ 1, AD(q) is simply

an estimated average treatment difference for subjects with scores exceeding the q th quantile (Zhao et al., 2013). To compare different scoring systems, Zhao et al. (2013) recommended a cross-training-evaluation strategy that involves randomly splitting the dataset into two parts and carrying out the following ˆ steps: (1) use the training set to obtain the scoring system D(X), (2) construct ˜ the corresponding estimate AD(.) using the testing set, and (3) repeat the

173 procedure for M > 1 iterations and average over them to get the empirical ˜ estimate of AD(.). Then, a comparison among different methods can be made ˜ (based on M iterations) for each method. In by plotting the average AD(.) ˜ is, the better the scoring system is. general, the higher the curve AD(.) Another way to compare different scoring systems is to compute the partial ˜ up to a specific constant η), as defined AUC (by integrating the curve AD(.) in Zhao et al. (2013), as a measure of concordance between the true treatment difference and its empirical estimate. Therefore, a higher AUC indicates a ˜ and better fit. Another metric is the area between the curves (ABC) of AD(.) ˜ the horizontal line y = AD(0), which can also be used (higher the better) for model evaluation and comparison. A theoretical description of these metrics can be found in Zhao et al. (2013).

6.2.5

APPLICATION TO THE ACTG 320 STUDY

We consider the ACTG 320 clinical trial data (Hammer et al., 1997), one of the very first trials to evaluate the added value of a potent protease inhibitor, indinavir, for HIV patients, conducted by the AIDS Clinical Trials Group (ACTG) (Zhao et al., 2013). In this randomized, double-blind study, AIDS patients are randomly assigned to one of two daily regimens: one is the adminstration of the protease inhibitor indinavir in addition to zidovudine and lamivudine, and the other is what we will refer to as placebo where the patients receive only the two nucleosides zidovudine and lamivudine alone. Following Zhao et al. (2013) and Hammer et al. (1997), we analyze the CD4 count change at the 24th week as the response with nine baseline covariates, as listed in Table 1 of Hammer et al. (1997). Although there were 1156 patients enrolled in the study, we only

174 consider the subjects with no missing values, giving rise to an analytic sample of n = 838 subjects. As noted in Zhao et al. (2013), “Although the overall efficacy from the threedrug combination group is highly statistically significant, it is not necessarily true that the new therapy works for all future patients. Moreover, there are nontrivial toxicities and serious concerns about the development of protease inhibitor resistance mutations. Now, suppose that having an expected treatment benefit representing a week 24 CD4 count increase of 100 cells/mm3 relative to the control would be sufficient to compensate for the costs and risks of using the new therapy. The question, then, is how to identify such a subpopulation efficiently via the “baseline” covariates”. Therefore, following Zhao et al. (2013), we aim to identify a subgroup of patients having an expected treatment benefit representing a week 24 CD4 count increase of 100 cells/mm3 relative to the control using the baseline variables. We consider the above two classes of models to construct various scoring systems. The first one uses an additive linear model for each treatment group with all nine of the covariates. The second one uses a single model with main covariate effects and interactions between the treatment indicator and other covariates. For each of the two classes of models, we apply four methods to build candidate scoring systems: LASSO (Tibshirani, 1996), adaptive LASSO (Zou, 2006), Bayesian LASSO (Park and Casella, 2008) and Bayesian adaptive LASSO (Leng et al., 2014). We use the R package glmnet for the LASSO and adaptive LASSO methods, which implements the coordinate descent algorithm of Friedman et al. (2010), and selects the tuning parameter by means of a 10-fold cross-validation. For the Bayesian methods, posterior means are calculated as

175 estimates based on 10, 000 samples after 10, 000 burn-in iterations. This choice of running parameters appears to work satisfactorily based on convergence diagnostics (Gelman et al., 2003). The tuning parameters are estimated as posterior means with the gamma priors with shape parameter a = 1 and scale parameter b = 0.1 in the MCMC algorithm. ˜ based Figures 6.1 and 6.2 summarize the treatment difference curves AD(.) on the averages over M = 100 replications of a cross-validation procedure, where each replication resulted from the random selection of 4/5 of the data as the training set. In the x-axis, we consider equally-spaced quantiles (stratum) over the interval [0, 1]. It can be observed that only the Bayesian LASSO is able to select a subgroup of patients with an average CD4 count treatment difference of 100, which represents the patients with scores in the top 5% approximately. We also calculate the ABC measures (with η = 0.90) for these four methods, which are summarized in Table 6.1. Again, it can observed that the Bayesian LASSO performs the best with the Bayesian adaptive LASSO closely following next. Moreover, the most interesting, and perhaps not surprising, conclusion of this study is that, Bayesian regularization methods vastly outperform their frequentist cousins in identifying the desired subgroup. In other words, Bayesian regularization methods have the potential to effectively select target populations for future clinical trials that can provide new evidence for medical decisionmaking.

6.2.6

CONCLUSIONS

Despite their predominance in other scientific disciplines, Bayesian regularization methods have not been well-received in the clinical trials community.

176

Two Separate Models

120 100 80 40

60

Average treatment difference

140

Bayesian Lasso Bayesian Adaptive Lasso Adaptive Lasso Lasso

1

2

3

4

5

Strata

Figure 6.1: ACTG 320 Data Analysis - Two Separate Models (Based on 100 Random Cross-validations)

177

Single Interaction Model

120 100 80 40

60

Average treatment difference

140

Bayesian Lasso Bayesian Adaptive Lasso Adaptive Lasso Lasso

1

2

3

4

5

Strata

Figure 6.2: ACTG 320 Data Analysis - Single Interaction Model (Based on 100 Random Cross-validations)

178 Method Lasso Bayesian Lasso Adaptive Lasso Bayesian Adaptive Lasso

Interaction 47.4 59.7 50.1 58.7

Two 55.2 60.6 59.8 57.2

Table 6.1: Summary of the ACTG 320 data analysis results (ABC Measures)

The literature on subgroup selection with Bayesian regularization is even lessdeveloped. With the current results, we show how Bayesian regularization can help detect a potential subgroup of AIDS patients who may react much more favorably to the addition of a protease inhibitor to a conventional regimen than other patients. Our analysis of course is not intended for a full investigation of the ACTG 320 trial, but is used to demonstrate how the proposed methods can add value to the existing knowledge. It remains to be seen how these methods perform for other scenarios (e.g. in the presence of high-dimensional covariates or grouped predictors) and other outcomes (e.g. binary, time-to-event, etc.). Therefore, although the results are promising, further research remains necessary.

179

EVALUATION OF THE BAYESIAN GROUP BRIDGE FOR

6.3

IDENTIFYING ASSOCIATION WITH RARE VARIANTS IN THE DALLAS HEART STUDY (DHS) 6.3.1

POPULATION-BASED RESEQUENCING OF ANGPTL4 AND TRIGLYCERIDES

One of the first applications of resequencing to examine the role of sequence variations in the adipokine gene ANGPTL4 in lipid metabolism was conducted by Romeo et al. (2007). This large-scale genetic association study included 3, 551 participants from the Dallas Heart Study (DHS), a multi-ethnic sample from Dallas County residents (consisting of 601 Hispanic, 1, 830 African American, 1, 045 European American and 75 other ethnicities), from whom fasting venous blood samples were obtained. Romeo et al. (2007) sequenced the seven exons and the intron-exon boundaries of the gene ANGPTL4, and identified a total of 93 sequence variations. Most of these variants were rare: only 12 variants had a minor allele frequency above 1%. As in the previous studies (King et al., 2010; Romeo et al., 2007; Yi et al., 2011, 2014), the phenotype analyzed in our study is the log-transformed plasma levels of triglyceride. Figure 6.3 gives the histogram of the untransformed and log-transformed plasma levels of triglyceride in the DHS dataset. Two variants were removed as they were not segregating in the sample, reducing the total number of variants to 91. Our analysis included race, age, and gender as covariates in the model. We excluded individuals with any missing values in the non-genetic covariates from the analysis. Following previous analyses (Yi et al.,

40 30

Frequency

0

0

10

50

20

100

Frequency

150

50

180

0

500

1000

Response

1500

−1

0

1

2

Log Transformed Response

Figure 6.3: Histogram of the untransformed and log-transformed plasma levels of triglyceride in the DHS dataset

2011, 2014), the 75 participants from other ethnicities were also excluded, leaving with n = 3, 456 subjects for the final analysis. For the missing genotypes, we filled in the variables using the expectation of the observed values in that marker. We also filled in a single missing value in the response variable by replacing it by the mean of the non-missing responses. Following Yi et al. (2011), we divided the variants into four groups: common non-synonymous, common synonymous, rare non-synonymous, and rare synonymous. We used a minor allele frequency of 1% as the cut-off to distinguish between common and rare variants. The four groups consisted of 2, 10, 30, and 49 variants respectively. Unlike previous analyses (Yi et al., 2011, 2014), we consider an ungrouped variable as a group consisting of the variable itself. Thus, the covariates age, gender, and race (black, white, other ) contributed three groups of size 1, 1, and 3 respectively, making the total number of groups K = 7, and the total number of

181 variables p = 96. By convention, a group is considered to be significant if at least one variable belonging to that group is found to be significant. We coded the main effect predictor of each variant using the additive genetic model, i.e., the number of minor alleles in the observed genotype. It is to be noted that due to the presence of many rare variants, the matrix X X is illconditioned for this dataset. To resolve this difficulty, we make use of the shrinkage estimates, by replacing the matrix X X with a stable Stein-type shrinkage estimator of the covariance matrix (Opgen-Rhein and Strimmer, 2007a; Schafer and Strimmer, 2005), obtained as standard outputs from the function slm() in the R package care. This simple, but reasonable and heuristic approach is computationally much more efficient than the regularized estimates based on the ridge regression (Hoerl and Kennard, 1970), and have been widely used in genetic studies (Opgen-Rhein and Strimmer, 2007b). Using these estimates, the Markov chain was run for a total of 50, 000 iterations, of which 25, 000 were discarded as burn-in, and the rest were used as draws from the posterior distribution for making inference. This choice of running parameters appeared to work satisfactorily based on the convergence diagnostics (Gelman et al., 2003). The results obtained from analyzing the DHS data are shown in Table 6.2. For brevity, we only report those variables, which are selected by at least one of these methods, viz. BAGB, GB, GL, and SGL. The variables are sorted in the order of their posterior probabilities in the scale neighbourhood (according to the SNC defined before). It can be immediately observed that the top three variables selected by BAGB viz. gender, black, and age, are also selected by each of the three competing methods, consistent with the original analysis of Romeo et al. (2007). In addition to these non-genetic covariates,

0

0

5

10

Density

20

10 15 20 5

Density

182

0.05

0.10

0.15

0.00

0.05

0.10

0.15

age

3 0

0

1

2

Density

6 4 2

Density

4

8

5

gender

−0.25

−0.15

−0.05 0.00

0.05

0.10

−0.2

−0.1

0.0

0.1

rare_non_1313_E40K

10 0

5

Density

15

20

com_syn_8191_R278Q

−0.3

−0.30

−0.25

−0.20

−0.15

−0.10

−0.05

0.00

black

Figure 6.4: Marginal posterior densities for the marginal effects of 5 selected predictors in the DHS data. Solid red line: penalized likelihood (group bridge) solution with λ chosen by BIC. Dashed green line: marginal posterior mean for βj . Dotted purple line: mode of the marginal distribution for βj under the fully Bayes posterior (BAGB). The BAGB estimates are based on 25, 000 posterior samples

183

com rare com com

Variable gender black age non 8191 non 1313 syn 10707 non 8155 white other

GB 0.1265 −0.254 0.1096 R278Q 0.0000 E40K 0.0000 P389P 0.0000 T266M 0.0000 −0.042 0.0000

GL 0.1258 −0.244 0.1085 −0.060 0.0000 0.0000 0.0038 −0.074 −0.052

SGL Posterior Probablity 0.1456 1.00 −0.179 1.00 0.0981 1.00 −0.013 0.70 0.0000 0.59 0.0185 0.48 0.0222 0.42 0.0000 0.35 0.0000 0.32

Table 6.2: Summary of the DHS data analysis results. Posterior probabilities in the scaled neighborhood, along with corresponding GB, GL and SGL estimates

previous analyses suggested that two variants in ANGPTL4, viz. 8191 R278Q (common non-synonymous) and 1313 E40K (rare non-synonymous), are significantly associated with the triglyceride level (King et al., 2010; Romeo et al., 2007; Yi et al., 2011, 2014). Although the common non-synonymous variant 8191 R278Q was detected by both GL and SGL, these two methods could not detect an effect of the rare non-synonymous variant 1313 E40K on triglyceride. In contrast, BAGB was not only able to detect these two previously identified variants (top two genetic variants according to the SNC), but also confirmed additional findings from the previous studies. For example, in addition to the significant non-genetic and genetic covariates, our analysis also revealed that the group effect of rare synonymous variants was not significant (no variant from this group was selected), although there might be a marginally significant effect of the common synonymous variant 10707 P389P (SNC=0.48), which is also detected by SGL. Although this particular variant was not found to be significant in our previous analyses, the group effect of the common synonymous variants

6

184

3

Density

0

0

1

1

2

2

Density

3

4

4

5

5


0.2

0.3

0.4

0.5 alpha

(a)

0.6

0.7

0.8

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

alpha

(b)

Figure 6.5: Histogram of α based on 25, 000 MCMC samples assuming a Beta(10, 10) prior (left) and the corresponding density plot (right) for the DHS data analysis. Solid red line: posterior mean, dotted green line: posterior mode, and dashed purple line: posterior median

remained marginally significant (Yi et al., 2011). Therefore, considering this variant as marginally significant, BAGB also finds the similar conclusion that there might be three significant groups (with one individual significant variant in each group) that affect the triglyceride level: common non-synonymous variants (8191 R278Q), common synonymous variants (10707 P389P), and rare non-synonymous (1313 E40K) variants, in addition to the non-genetic covariates. It is to be noted that, the classical group bridge was unable to detect any of these genetic group effects. In Figure 6.4, we provide the marginal posterior densities for the marginal effects of the selected predictors using our method, along with the classical group bridge estimates. It can be observed that the GB solution does not coincide with the posterior mode BAGB estimate. This difference in estimates can be contributed to the fact that in the Bayesian framework,

185 we are able to account for the uncertainty in λ, σ 2 , and α, which is usually ignored in the classical framework. Another intrinsic feature of this diagram is the noticeable multimodality in the joint distribution of the Bayesian group bridge posterior. Clearly, summarizing βj by means of a point estimate may not be reasonable, which is unfortunately, the only available estimate in the frequentist framework. In Figure 6.5, we draw the histogram of the concavity parameter α based on 25, 000 MCMC samples assuming a Beta (10, 10) prior (left) and the corresponding density plot (right). The posterior mean estimate of α is given by 0.73, which is clearly shifted away from both 1 and 0.5 (default in GB). Even though the variables in this dataset are highly correlated, the mixing of the MCMC chain was reasonably good, confirmed in the trace and ACF plots (Figure 6.6), and also other MCMC diagnostics and statistical tests (data not shown due to too many predictors). Thus, BAGB is able to mimic the property of the frequentist group bridge estimator by conducting bi-level variable selection, with better uncertainty assessment and better summarization, by unveiling the full posterior summary of the coefficients.

6.3.2

CONCLUDING REMARKS

We have applied our method to detect rare variants in a genetic association study, which not only validates previous findings but also provides insight into the posterior summary of the selected variables. We recognise that most genetic association studies are conducted on much larger scales than we have described here: moving from hundreds of variants to hundreds of thousands of variants, which poses serious challenges in computation. Moreover, there are a number of important practical issues that arise in large-scale genetic association studies

186 gender

Trace of black

4000

6000

8000

10000

0

2000

4000

6000

8000

10000

0.8 ACF

0.0

0.4

ACF

0.0

0.4

−0.10 −0.20

0.10 0.04

2000

0

10

20

30

40

0

10

20

30

Iterations

Iterations

Lag

Lag

Trace of age

Trace of com_syn_8191_R278Q

age

com_syn_8191_R278Q

40

0

2000

4000

6000

8000

10000

Iterations

0

2000

4000

6000

8000

10000

0.8 ACF

0.4 0.0

ACF

0.4 0.0

−0.20

0.02

0.06

−0.10

0.10

0.00

0.8

0.14

0

black

0.8

0.16

Trace of gender

0

10

Iterations

20

30

40

0

Lag

10

20

30

40

Lag

rare_non_1313_E40K

ACF

0.0

−0.3

0.4

−0.1

0.8

0.1

Trace of rare_non_1313_E40K

0

2000

4000

6000

8000

10000

Iterations

0

10

20

30

40

Lag

(a)

(b)

Figure 6.6: Trace plots (left) and ACF plots (right) for the selected variables for the DHS data analysis, based on 25, 000 posterior samples

that are beyond the scope of this dissertation to address. Therefore, both empirical and theoretical research are still needed for investigating the potential of the BAGB estimator in large-scale genetic association studies.

187

REFERENCES

Ayers, K. L. and H. J. Cordell (2013). Identification of grouped rare and common variants via penalized logistic regression. Genetic Epidemiology 37 (6), 592– 602. Berger, J. O., X. Wang, and L. Shen (2014). A Bayesian approach to subgroup identification. Journal of Biopharmaceutical Statistics 24 (1), 110–129. Bodmer, W. and C. Bonilla (2008). Common and rare variants in multifactorial susceptibility to common diseases. Nature Genetics 40 (6), 695–701. Breheny, P. and J. Huang (2009). Penalized methods for bi-level variable selection. Statistics and its Interface 2 (3), 369. Breiman, L., J. Friedman, C. J. Stone, and R. A. Olshen (1984). Classification and Regression Trees. CRC press. Cai, T., L. Tian, P. H. Wong, and L. Wei (2011). Analysis of randomized comparative clinical trial data for personalized treatment selections. Biostatistics 12 (2), 270–282. Dixon, D. O. and R. Simon (1991). Bayesian subset analysis. Biometrics 47 (3), 871–881.

188 Foster, J. C., J. M. Taylor, and S. J. Ruberg (2011). Subgroup identification from randomized clinical trial data. Statistics in Medicine 30 (24), 2867–2880. Friedman, J., T. Hastie, and R. Tibshirani (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33(1), 1–22. Gelman, A., J. Carlin, H. Stern, and D. Rubin (2003). Bayesian Data Analysis. Chapman & Hall, London. Gunter, L., J. Zhu, and S. Murphy (2011). Variable selection for qualitative interactions in personalized medicine while controlling the family-wise error rate. Journal of Biopharmaceutical Statistics 21 (6), 1063–1078. Hammer, S. M., K. E. Squires, M. D. Hughes, J. M. Grimes, L. M. Demeter, J. S. Currier, J. J. Eron Jr, J. E. Feinberg, H. H. Balfour Jr, and L. R. Deyton (1997). A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and CD4 cell counts of 200 per cubic millimeter or less. New England Journal of Medicine 337 (11), 725–733. Hoerl, A. E. and R. W. Kennard (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12 (1), 55–67. King, C., P. Rathouz, and D. Nicolae (2010). An evolutionary framework for association testing in resequencing studies. PLoS Genetics 6 (11), e1001202. Leng, C., M. Tran, and D. Nott (2014). Bayesian adaptive lasso. Annals of the Institute of Mathematical Statistics 66 (2), 221–244.

189 Li, B. and S. M. Leal (2008). Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. The American Journal of Human Genetics 83 (3), 311–321. Lipkovich, I., A. Dmitrienko, J. Denne, and G. Enas (2011). Subgroup identification based on differential effect search - A recursive partitioning method for establishing response to treatment in patient subpopulations. Statistics in Medicine 30 (21), 2601–2621. Madsen, B. E. and S. R. Browning (2009). A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genetics 5 (2), e1000384. Mallick, H. and M. Li (2013). Penalized regression methods in genetic research. OA Genetics 01;1(1):7. McClellan, J. and M.-C. King (2010). Genetic heterogeneity in human disease. Cell 141 (2), 210–217. Morris, A. P. and E. Zeggini (2010). An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genetic Epidemiology 34 (2), 188–193. Opgen-Rhein, R. and K. Strimmer (2007a). From correlation to causation networks: a simple approximate learning algorithm and its application to highdimensional plant gene expression data. BMC Systems Biology 1, 37. Opgen-Rhein, R. and K. Strimmer (2007b). Learning causal networks from systems biology time course data: an effective model selection procedure for the vector autoregressive process. BMC Bioinformatics 8 (Suppl 2), S3.

190 Park, T. and G. Casella (2008). The Bayesian lasso. Journal of the American Statistical Association 103 (482), 681–686. Romeo, S., L. Pennacchio, Y. Fu, E. Boerwinkle, A. Tybjaerg-Hansen, H. Hobbs, and J. Cohen (2007). Population-based resequencing of ANGPTL4 uncovers variations that reduce triglycerides and increase hdl. Nature Genetics 39 (4), 513–516. Schafer, J. and K. Strimmer (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology 4 (1), Article 32. Sivaganesan, S., P. W. Laud, and P. Mller (2011). A Bayesian subgroup analysis with a zero-enriched Polya urn scheme. Statistics in Medicine 30 (4), 312–323. Su, X., C.-L. Tsai, H. Wang, D. M. Nickerson, and B. Li (2009). Subgroup analysis via recursive partitioning. Journal of Machine Learning Research 10 (2), 141–158. Tian, L., A. Alizadeh, A. Gentles, and R. Tibshirani (2014). A simple method for detecting interactions between a treatment and a large number of covariates. Journal of the American Statistical Association 109 (508), 1517–1532. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 58 (1), 267–288. Varadhan, R., J. B. Segal, C. M. Boyd, A. W. Wu, and C. O. Weiss (2013).

191 A framework for the analysis of heterogeneity of treatment effect in patientcentered outcomes research. Journal of Clinical Epidemiology 66 (8), 818–825. Yi, N., N. Liu, D. Zhi, and J. Li (2011). Hierarchical generalized linear models for multiple groups of rare and common variants: jointly estimating group and individual-variant effects. PLoS Genetics 7 (12), e1002382. Yi, N., S. Xu, X.-Y. Lou, and H. Mallick (2014). Multiple comparisons in genetic association studies: a hierarchical modeling approach. Statistical Applications in Genetics and Molecular Biology 13 (1), 35–48. Zhao, L., L. Tian, T. Cai, B. Claggett, and L.-J. Wei (2013). Effectively selecting a target population for a future comparative study. Journal of the American Statistical Association 108 (502), 527–539. Zhu, J. and J. Xie (2015). Nonparametric variable selection for predictive models and subpopulations in clinical trials. Journal of Biopharmaceutical Statistics 25 (4), 781–794. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101 (476), 1418–1429.

192

CONCLUSIONS

7.1

SUMMARY

In this dissertation, we have presented several Bayesian regularization methods in the context of linear regression. Broadly speaking, the main contribution of this dissertation is the development of a set of novel tools for conducting variable selection in both low-dimensional and high-dimensional problems. These methods have been applied to both simulated and real world datasets. As it has been shown, they all exhibit better or similar results relative to both existing frequentist and Bayesian approaches. Another important and particularly interesting contribution is the introduction of SMU distribution in Bayesian regularization, which has taken center stage in this dissertation. As it has been shown, with the SMU distribution, we are able to Bayesianise many interesting non-Bayesian problems. Moreover, the SMU distribution provides a more general and unified framework for Bayesian regularization methods in the sense that for some important problems (e.g. group bridge), coming up with an elegant data augmentation strategy is much more straightforward using the SMU technique as compared to the SMN distribution, which has been the state-of-the-art technique for the development of

193 Bayesian regularization methods over the past years. With the current results presented in this dissertation, researchers are now well served by having both these techniques in their statistical repertoire. In addition to the above, we have theoretically investigated the posterior consistency of one of the proposed methods (viz. Bayesian bridge), which contributes further theoretical insight into the existing literature of Bayesian regularization methods. We have observed that the proposed methods have good mixing properties, as evident from a wide spectrum of simulation studies and real data analyses, ensuring their applicability in varied real life problems. Moreover, we have extended these methods to more general models beyond linear regression, providing a flexible framework by accommodating a variety of outcomes (e.g. continuous, binary, count, and time-to-event, among others). It should be noted, however, that although we have extended our proposed methods to general models based on the LSA (Wang and Leng, 2007), it is not clear whether this approximation is valid for n ≤ p situations. Therefore, for high-dimensional models, these approximations should be interpreted with considerable caution. It is to be noted that Bayesian variable or model selection is a much broader topic than what we have described here, which includes several data pre-processing steps in practice including variable transformations, coding of variables, removal of outliers, among others1 . Therefore, in real life applications, the general framework of Bayesian variable and model selection should be applied to these methods to ensure accurate results and easy implementation (Chipman et al., 2001). Also, one should be aware that even though both Bayesian and non-Bayesian 1


194 regularization methods are essentially estimation methods with the common goal of determining the model parameters, a Bayesian approach can often lead to very different results than a frequentist regularization approach (Hans, 2010). As noted by Polson et al. (2014), this does not, however, imply that one conclusion is right, and the other wrong. Rather it suggests that the practitioners are well served by having both at hand. In practice, the choice of a model can be based on the specific aspect of the problem. For instance, in genomic prediction of complex traits, the choice of a model is usually based on the closest fit between the observed and the predicted values of the phenotype (Morota and Gianola, 2014; Mutshinda and Sillanpaa, 2010).

7.2

FUTURE DIRECTIONS

The works in this dissertation also motivate many exciting directions for future research in the field of statistics and machine learning, both fundamental and practical. We briefly describe them as follows:

7.2.1

SCALABLE VARIATIONAL ALGORITHMS FOR HIGH DIMENSIONAL VARIABLE SELECTION

Most of the MCMC-based Bayesian regularization methods developed here can be computationally overwhelmed for large datasets. Alternatively, one may consider variational Bayes (VB) approximation-based algorithms for Bayesian regularization methods, which can significantly reduce the computational bottlenecks associated with the fully Bayesian approaches. The development of VB estimation was motivated by the fact that in the full Bayesian analysis,

195 many posterior distributions we are interested in might be intractable (Li and Sillanpaa, 2012). By using the VB approximation, we aim to find a tractable distribution that can approximate the target posterior. Recently, there has been a few developments in the area of variational regularization methods. Armagan (2009) considered a variational approach to bridge regression. Li and Sillanpaa (2012) developed a variational Bayes algorithm for the Bayesian LASSO in the context of quantitative trait loci (QTL) mapping. Both these approaches are based on the SMN representations of the associated priors. These algorithms can be readily extended to the L2 norm group bridge (Park and Yoon, 2011) estimator, which is based on the following regularization problem:

K

λk ||β k ||α2 . min (y − Xβ) (y − Xβ) + β k=1

(7.1)

To mimic the property of the L2 norm group bridge (GB2), we consider the following prior on the groups of coefficients

π(β|λ1 , . . . , λK , α) ∝

K

exp(−λk ||β k ||α2 ),

(7.2)

k=1

which belongs to the family of multivariate generalized Gaussian distributions (Gomez-Sanchez-Manzano et al., 2008; Gomez-Villegas et al., 2011). To solve the problem using a Gibbs sampler, one needs to construct a Markov chain having the joint posterior for β as its stationary distribution. However, a closed form solution is not possible with a normal likelihood since the kernel of the prior density loses its attractive quadratic structure. Although this family of

196 distributions can be written as scale mixtures of normals for 0 < α < 2 (GomezSanchez-Manzano et al., 2008; Gomez-Villegas et al., 2011), an explicit mixing distribution is not available for α = 1, which poses difficulty in the Bayesian hierarchical modeling. As a result, it would not be possible to analytically derive the marginal likelihood for each parameter. However, it is possible to derive a VB solution to the above group bridge problem. Despite the unavailability of an explicitly stated mixing distribution, we may be able to exploit the mixture formulation under the VB framework by only extracting the required moments. With this formulation, the MAP estimation can be effortlessly carried out for large datasets without fully exploring the joint posterior distribution. Moreover, we can obtain uncertainty measures for the unknown coefficients, which can be utilized for statistical inference and variable selection. To this end, consider the corresponding group-specific prior mk

λ α Γ( m2k + 1) exp{−λk ||β k ||α2 }. p(β k |λk , α) = m /2 m π k Γ( αk + 1)

(7.3)

The above family of distributions can be written as scale mixtures of normals (Gomez-Sanchez-Manzano et al., 2008; Gomez-Villegas et al., 2011) as follows p(β k |λk , α) =

∞ 0

−2

N (0, τk−1 λk α Imk )f (τk )dτk ,

(7.4)

where N (μ, Σ) denotes the multivariate normal distribution with mean vector μ and variance-covariance matrix Σ, Imk denotes the identity matrix of dimension −mk /2

mk , and f (τk ) ∝ τk

q(τk ), where q(τk ) is the density of a stable distribution.

197 Therefore, neither the prior nor posterior for τk is known in closed form. Based on the above, we formulate the hierarchical representation as follows y|X, β, σ 2 ∼ Nn (Xβ, σ 2 In ), β|τ1 , . . . , τK , λ1 , . . . , λK , α ∼ λ1 , . . . , λ K ∼

K

K

2

Nmk (0, τk −1 λk − α Imk ),

k=1

Gamma (a, b),

k=1

τ1 , . . . , τK |λ1 , . . . , λK , α ∼

K

(7.5)

f (τk ),

k=1 2

2

σ ∝ 1/σ . Using similar arguments as in Armagan (2009), we can easily develop a computational algorithm after calculating the required moments in order to derive the required lower bound. In practice, the lower bound can be calculated in each iteration and used as a criterion for convergence. After certain initial values are assigned to the parameters of each distribution, an iterative algorithm can be used to update them successively until convergence.

7.2.2

BAYESIAN REGULARIZED MODEL AVERAGING FOR SUBGROUP SELECTION

Most existing exploratory subgroup identification methods in the literature are based on building a single model in the frequentist framework, and usually do not take into account the model uncertainty in the estimation procedure. However, when model uncertainty is present, making inferences based on a single model can be dangerous. Alternatively, one can use a set of models to account for this uncertainty. Therefore, for improved inference, a Bayesian

198 model averaging (BMA) strategy can be considered to identify subgroups of patients with non-trivial treatment benefits. In the Bayesian framework, BMA is routinely used for prediction. BMA generally provides better predictive performance than a single chosen model (Raftery et al., 1997). Recently, Leng et al. (2014) used this idea to improve upon the prediction performance of the Bayesian LASSO estimator. Contrary to the formal Bayesian treatment of the model uncertainty, an ensemble of sparse models can be used for prediction, which can be done by sampling the posterior distribution of the tuning parameter. By considering different sparse conditional mode estimates of the regression coefficients corresponding to the estimated tuning parameters, we can further extend this idea to the problem of subgroup selection in clinical trials. Following the same notations as in Leng et al. (2014), let Δ = (X 0 , y 0 ) be a future observation and D0 = (X, y) be the past data. The posterior predictive distribution of Δ is given by p(Δ|D0 ) =

p(Δ|β)p(β|λ, D0 )dβp(λ|D0 )dλ.

(7.6)

It can be shown that the prediction based on p(Δ|D0 ) is usually superior to the prediction based on a fixed choice of λ, i.e., p(Δ|λ = λ0 , D0 ). The hierarchical models developed here offer a natural way to estimate the predictive distribution (7.6), in which the integral is approximated by the average from the Gibbs samples of λ. Thus, the future observation y 0 can be predicted as follows 0

E(y |D0 ) =

X 0 E(β|λ, D0 )p(λ|D0 )dλ = X 0 E(β|D0 ),

(7.7)

199 which can be estimated as

1 s

s i=1

ˆ (i) , where β ˆ (i) denotes the conditional X 0β λ λ

posterior mode estimate of β corresponding to λ(i) , and λ(i) , i = 1, . . . , s denote the MCMC samples drawn from the posterior distribution of λ. By considering the models selected by the conditional posterior modes for different draws of λ, an ensemble of sparse models for predicting future observations can be obtained. Based on these model-averaged posterior estimates, a parametric scoring system as a function of multiple baseline covariates can be constructed as before, which can be further used to construct a threshold-specific treatment difference curve across a range of score values.

7.2.3

BAYESIAN GROUP VARIABLE SELECTION FOR SEMIPARAMETRIC PROPORTIONAL HAZARDS MODEL FOR HIGH-DIMENSIONAL SURVIVAL DATA

In recent years, there has been a huge influx of biomedical ‘big data’ from various fields such as genomics, proteomics, metabolomics, transcriptomics, metagenomics, and lipidomics (Vivar et al., 2013). The development of -omics technologies makes it possible to scan the entire genome and identify variations that may be associated with a disease or condition (Ma et al., 2011). In microarray gene expression and other high-throughput profiling studies, a major objective is to find a subset of genes, which are strongly associated with the survival time related to the development and/or progression of a disease. A popular model for analyzing survival data is the proportional hazards model (Cox, 1972). However, due to the high-dimensionality of the predictors in vast majority of modern datasets, the standard maximum Cox’s partial likelihood

200 method is known to cause unique statistical bottlenecks, including computational infeasibility and model non-identifiability, which lead to serious problems in estimation and prediction. To this end, various regularization methods have been developed for the Cox’s proportional hazards model (Fan and Li, 2002; Gui and Li, 2005; Tibshirani, 1997; Zhang and Lu, 2007). However, none of these methods are suitable for bi-level variable selection. As a remedy, various regularization approaches with grouped predictors have been considered by some authors (Huang et al., 2014; Wu and Wang, 2013). However, due to the complex structure of the likelihood, only limited developments have been made in the Bayesian framework (Lee et al., 2011). In a recent work, Lee et al. (2011) developed the Bayesian LASSO for Cox’s proportional hazards model by implementing a fast MH algorithm with adaptive jumping rule. We can further extend this idea by considering Bayesian group bridge regularization in order to conduct bi-level variable selection in the high-dimensional Cox’s proportional hazards model. On the other hand, in most gene profiling studies, markers identified from the analysis of single datasets often suffer from a lack of reproducibility and sufficient power (Ma et al., 2011). To tackle this shortcoming, various cost-effective data integration methods have been proposed to integrate large -omics data to extract useful information through the analysis of multiple datasets. Recently, Ma et al. (2011) developed a L2 norm group bridge regularization approach for biomarker selection in integrative analysis of data from multiple heterogeneous studies. It would be worthy to develop a similar Bayesian approach for the integrative analysis of multiple heterogeneous high-dimensional datasets.

201

7.2.4

BAYESIAN ANALYSIS OF OTHER REGULARIZATIONS

In this dissertation, we have mostly focused on developing new Gibbs samplers for various regularization problems due to their simplicity and easy implementation. However, more complicated algorithms can improve the results and open for new possibilities. One such example is the logistic Bayesian LASSO (Biswas and Lin, 2012; Biswas et al., 2014), which implements MH algorithms for posterior simulation in the context of logistic regression for the problem of rare haplotypes detection in case-control genetic association studies. It would be interesting to consider future endeavors based on the Bayesian group bridge prior for solving similar problems. On the other hand, besides their applications in linear regression models, regularization methods have also been studied in the quantile regression models, which are becoming increasing popular, as they provide richer information than the classical mean regression models (Li, 2010). Recently, Li et al. (2010) developed various Bayesian regularization approaches for quantile regression. It would be worthwhile to undertake similar challenges based on the Bayesian bridge and Bayesian group bridge regularizations in the context of quantile regression. Apart from the methods presented in this dissertation, there are many other non-Bayesian regularization approaches with no concrete Bayesian solution. These methods include SCAD (Fan and Li, 2001), MCP (Zhang, 2010), reciprocal LASSO (Song and Liang, 2015), and subtle uprooting (Su, 2015), among others. Successful Bayesian adaptation of these methods using either SMU or SMN technique warrants further research.

202

7.2.5

GEOMETRIC ERGODICITY OF BAYESIAN REGULARIZATION METHODS

A critical issue for Bayesian regularization methods is how to determine whether the corresponding Markov chain has converged to its desired target distribution. Only when the Markov chain has reached its stationary distribution, the generated samples can be used to characterize the corresponding posterior distribution. Therefore, it is often crucial to investigate whether these Markov chains are geometrically ergodic, i.e., whether the distribution of each of these Markov chains converges to the corresponding posterior distribution at a geometric rate (Pal and Khare, 2014). Most MCMC users address this convergence problem by applying some diagnostic tools to the output produced by running their samplers (Cowles and Carlin, 1996). However, it is not possible to say with certainty that a finite sample from an MCMC algorithm is representative of an underlying stationary distribution (Cowles and Carlin, 1996). Therefore, theoretically establishing the geometric ergodicity of a Markov chain is important for a meaningful statistical analysis of the corresponding model (Pal and Khare, 2014). Although Bayesian regularization methods have evolved rapidly in the past few years, too little attention has been given to the development of theoretical convergence bounds for these methods. Recently, a proof of geometric ergodicity of the Markov chain corresponding to the Bayesian LASSO model (Park and Casella, 2008) was provided in Khare and Hobert (2013). A similar future investigation on the geometric ergodicity of the proposed Gibbs samplers can further strengthen our findings.

203

7.2.6

POSTERIOR CONSISTENCY OF GROUP PRIORS

Unlike frequentist methods, asymptotic behavior of Bayesian regularization methods are less studied and poorly understood (Liang et al., 2013). Recently, Armagan et al. (2013) provided sufficient conditions on prior concentration for strong posterior consistency of various Bayesian regularization methods including the Bayesian LASSO (Park and Casella, 2008). In this dissertation, we have extended the results of Armagan et al. (2013) to the Bayesian bridge estimator. More recently, Bhattacharya et al. (2015) studied prior concentrations of various Bayesian shrinkage priors to investigate whether the entire posterior distributions of these methods concentrate at the optimal rate, i.e., the posterior probability assigned to a shrinking neighborhood (proportional to the optimal rate) of the true value of the parameter converges to 1. They argued that most of the Bayesian regularization methods are sub-optimal in terms of posterior contraction, which is considered a stronger optimality condition than the posterior consistency alone. Although the above-mentioned results have provided a better theoretical understanding of various Bayesian regularization methods, it is not clear how to derive similar results for other important shrinkage priors such as the Bayesian group bridge. As noted by Liang et al. (2013), due to lack of theoretical justification, Bayesian regularization methods are not easily adopted by subjective Bayesians and frequentists. Therefore, much work in this direction needs to be done to better understand these approaches regarding their asymptotic behavior in high-dimensional models.

204

7.2.7

BAYESIAN SHRINKAGE PRIORS FOR ZERO-INFLATED AND MULTIPLE-INFLATED COUNT DATA

In many biomedical applications such as human microbiome studies, count outcomes occur frequently, and often these count data exhibit a preponderance of inflated zeros (zero-inflated) or multiple inflated values (multiple-inflated). Various zero-inflated (Greene, 1994; Lambert, 1992) and multiple-inflated (Su et al., 2013) count models have been developed in the literature to take into account the added variability associated with the extraneous observations. However, most of these approaches lead to unstable estimation in the presence of a large number of covariates. To tackle high-dimensionality, various regularization approaches have been proposed for these models (Buu et al., 2011; Su et al., 2013; Tang et al., 2014; Wang et al., 2014). These methods appear quite promising, although we speculate that they carry forward all the limitations of frequentist regularization methods discussed earlier. Therefore, another important future direction is to Bayesianise these methods for improved inference and more reliable estimation.

7.2.8

VARIABLE SELECTION USING SHRINKAGE PRIORS

One of the major drawbacks of Bayesian regularization methods is the unavailability of exact zeros in the posterior samples of the coefficients (Li and Pati, 2015). Therefore, unlike frequentist methods, Bayesian regularization methods are unable to achieve automatic variable selection. To tackle this problem, several methods based on thresholding the posterior estimates of the coefficients

205 have been developed in the literature, which include the credible interval criterion (Li et al., 2011; Park and Casella, 2008), the scaled neighbourhood criterion (SNC) (Li and Lin, 2010), and the hybrid Bayesian-frequentist strategy (Leng et al., 2014), among others. However, most of these methods are heuristic in nature, and lack proper theoretical justification, and statistical inference is usually highly sensitive to the choice of the threshold (Li and Pati, 2015). To this end, Li and Pati (2015) recently proposed a novel method based on the post-processing of the posterior samples. Their approach is based on first obtaining a posterior distribution of the number of signals by clustering the signal and the noise coefficients and then estimating the signals using the posterior median. Chakraborty and Guo (2011) also used a similar approach based on the 2-means clustering of the posterior samples to achieve variable selection in their Bayesian regularized hybrid Huberized support vector machine model. The above-mentioned automatic procedures could be applied to our findings in future research in order to further validate these methods. Although these methods have shown some promises, they are still somewhat new, which need to be further explored. Moreover, the literature on variable selection with Bayesian shrinkage priors is still in its infancy, and further research is needed to better understand their implications in large-scale data analysis. Therefore, another important direction for future research is to develop efficient algorithms to automatically select important variables based on the posterior samples of the coefficients.

206

7.2.9

R AND C++ INTEGRATION FOR BAYESIAN REGULARIZATION METHODS

As previously mentioned, Bayesian regularization methods are computationally very intensive for large-scale problems. Scarcity of fast algorithms and user-friendly softwares are the largest obstacles for the routine application of Bayesian regularization methods. To this end, the Stan Development Team (2014) recently created STAN: a new, high-performance, and open-source software for Bayesian inference based on multi-level models2 . Rather than using the traditional Gibbs samplers, STAN uses Hamiltonian Monte Carlo (HMC) algorithms to speed up calculations. STAN supports truncated and/or censored data, and so can be used to fit survival and reliability models with non-standard (or even user-defined) probability distributions. While STAN has a commandline interface, it is most easily used via the R package interface RStan. By automatically converting models to compiled C++ code, STAN can solve complex models with non-standard distributions quickly. Therefore, another direction for future research includes prolific implementation of the proposed Bayesian regularization methods in STAN for quick estimation and rapid inference. The computation time of Bayesian regularization methods can also be significantly reduced by using efficient programming of low level languages like C++. By virtue of the Rcpp package, key functions can be rewritten in C++, which can be seamlessly integrated with R. Based on this, Polson et al. (2012) 2

The following description of STAN has been extracted from http://blog.revolutionanalytics.com/2012/08/ rstan-fast-multilevel-bayesian-modeling-in-r.html

207 recently developed the R package BayesBridge, which implements the algorithms from the Bayesian bridge paper by Polson et al. (2014). In the same spirit, a software package in R, consisting of functions from this dissertation research, will be developed and released for public use in the near future by efficiently integrating R and C++.

7.2.10

STATISTICAL PARALLEL COMPUTING FOR BAYESIAN REGULARIZATION METHODS

How to scale up Bayesian computation is an open issue in Bayesian research. Recently, graphics processing units (GPUs) have received a lot of attention in massively parallel computing (Suchard et al., 2010). A GPU is a highly parallel, multithreaded, and manycore processor with havoc computational power and very high memory bandwidth (Paun et al., 2010). GPUs can possibly be used to speed up inference for the proposed Bayesian methods in large-scale problems. Efficient implementation of the GPU programming can significantly improve the general applicability of Bayesian regularization methods to the analysis of massively large datasets.

208

REFERENCES

Armagan, A. (2009). Variational bridge regression. Journal of Machine Learning Research W & CP 5, 17–24. Armagan, A., D. B. Dunson, J. Lee, W. Bajwa, and N. Strawn (2013). Posterior consistency in linear models under shrinkage priors. Biometrika 100 (4), 1011– 1018. Bhattacharya, A., D. Pati, N. S. Pillai, and D. B. Dunson (2015). DirichletLaplace priors for optimal shrinkage. Journal of the American Statistical Association, Upcoming. Biswas, S. and S. Lin (2012). Logistic Bayesian LASSO for identifying association with rare haplotypes and application to age-related macular degeneration. Biometrics 68 (2), 587–597. Biswas, S., S. Xia, and S. Lin (2014). Detecting rare haplotype-environment interaction with logistic Bayesian LASSO. Genetic Epidemiology 38 (1), 31– 41. Buu, A., N. J. Johnson, R. Li, and X. Tan (2011). New variable selection methods for zero-inflated count data with applications to the substance abuse field. Statistics in Medicine 30 (18), 2326–2340.

209 Chakraborty, S. and R. Guo (2011). A Bayesian hybrid Huberized support vector machine and its applications in high-dimensional medical data. Computational Statistics & Data Analysis 55 (3), 1342–1356. Chipman, H., E. I. George, and R. E. McCulloch (2001). The Practical Implementation of Bayesian Model Selection, Volume 38 of Lecture Notes– Monograph Series, pp. 65–116. Beachwood, OH: Institute of Mathematical Statistics. Cowles, M. K. and B. P. Carlin (1996). Markov Chain Monte Carlo convergence diagnostics: a comparative review. Journal of the American Statistical Association 91 (434), 883–904. Cox, D. R. (1972). Regression models and life tables (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology) 34 (2), 187–220. Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96 (456), 1348–1360. Fan, J. and R. Li (2002). Variable selection for Cox’s proportional hazards model and frailty model. The Annals of Statistics 30 (1), 74–99. Gomez-Sanchez-Manzano, E., M. A. Gomez-Villegas, and J. M. Marin (2008). Multivariate exponential power distributions as mixtures of normal distributions with Bayesian applications. Communications in Statistics - Theory and Methods 37 (6), 972–985.

210 Gomez-Villegas, M. A., E. Gomez-Sanchez-Manzano, P. Main, and H. Navarro (2011). The effect of non-normality in the power exponential distributions. In L. Pardo, N. Balakrishnan, and M. Gil (Eds.), Understanding Complex Systems, pp. 119–129. Springer Berlin Heidelberg. Greene, W. H. (1994). Accounting for excess zeros and sample selection in Poisson and negative binomial regression models. Technical report, New York University. Gui, J. and H. Li (2005).

Penalized Cox regression analysis in the high-

dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics 21 (13), 3001–3008. Hans, C. M. (2010). Model uncertainty and variable selection in Bayesian lasso regression. Statistics and Computing 20 (2), 221–9. Huang, J., L. Liu, Y. Liu, and X. Zhao (2014). Group selection in the Cox model with a diverging number of covariates. Statistica Sinica 24 (4), 1787–1810. Khare, K. and J. P. Hobert (2013). Geometric ergodicity of the Bayesian lasso. Electronic Journal of Statistics 7, 2150–2163. Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34 (1), 1–14. Lee, K. H., S. Chakraborty, and J. Sun (2011). Bayesian variable selection in semiparametric proportional hazards model for high dimensional survival data. The International Journal of Biostatistics 7 (1), 1–32.

211 Leng, C., M. Tran, and D. Nott (2014). Bayesian adaptive lasso. Annals of the Institute of Mathematical Statistics 66 (2), 221–244. Li, H. and D. Pati (2015). Variable selection using shrinkage priors. arXiv preprint arXiv:1503.04303 . Li, J., K. Das, G. Fu, R. Li, and R. Wu (2011). The Bayesian lasso for genomewide association studies. Bioinformatics 27 (4), 516–523. Li, Q. (2010). On Bayesian Regression Regularization Methods. Ph. D. thesis, Washington University in St. Louis. Li, Q. and N. Lin (2010). The Bayesian elastic net. Bayesian Analysis 5 (1), 151–70. Li, Q., R. Xi, and N. Lin (2010). Bayesian regularized quantile regression. Bayesian Analysis 5 (3), 533–556. Li, Z. and M. J. Sillanpaa (2012). Estimation of quantitative trait locus effects with epistasis by variational Bayes algorithms. Genetics 190 (1), 231–249. Liang, F., Q. Song, and K. Yu (2013). Bayesian subset modeling for highdimensional generalized linear models. Journal of the American Statistical Association 108 (502), 589–606. Ma, S., J. Huang, and X. Song (2011). Integrative analysis and variable selection with multiple high-dimensional data sets. Biostatistics 12 (4), 763–775. Mallick, H. and N. Yi (2013). Bayesian methods for high dimensional linear models. Journal of Biometrics & Biostatistics S1, 005.

212 Morota, G. and D. Gianola (2014). Kernel-based whole-genome prediction of complex traits: a review. Frontiers in Genetics 5, 363. Mutshinda, C. M. and M. J. Sillanpaa (2010). Extended Bayesian LASSO for multiple quantitative trait loci mapping and unobserved phenotype prediction. Genetics 186 (3), 1067–1075. Pal, S. and K. Khare (2014). Geometric ergodicity for Bayesian shrinkage models. Electronic Journal of Statistics 8, 604–645. Park, C. and Y. J. Yoon (2011). Bridge regression: adaptivity and group selection. Journal of Statistical Planning and Inference 141 (11), 3506–3519. Park, T. and G. Casella (2008). The Bayesian lasso. Journal of the American Statistical Association 103 (482), 681–686. Paun, G., M. J. Perez-Jimenez, A. Riscos-Nunez, G. Rozenberg, and A. Salomaa (2010). Membrane Computing: 10th International Workshop, WMC 2009, Curtea de Arges, Romania, August 24-27, 2009. Revised Selected and Invited Papers, Volume 5957. Springer Science & Business Media. Polson, N. G., J. G. Scott, and J. Windle (2012). Package ’BayesBridge’. Polson, N. G., J. G. Scott, and J. Windle (2014). The Bayesian bridge. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (4), 713–733. Raftery, A. E., D. Madigan, and J. A. Hoeting (1997). Bayesian model averaging for linear regression models. Journal of the American Statistical Association 92 (437), 179–191.

213 Song, Q. and F. Liang (2015). High dimensional variable selection with reciprocal l1 -regularization. Journal of the American Statistical Association, Upcoming. Stan Development Team (2014). RStan: the R interface to Stan, Version 2.5.0. Su, X. (2015). Variable selection via subtle uprooting. Journal of Computational and Graphical Statistics, Upcoming. Su, X., J. Fan, R. Levine, X. Tan, and A. Tripathi (2013). Multiple-inflation Poisson model with l1 regularization. Statistica Sinica 23, 1071–1090. Suchard, M. A., Q. Wang, C. Chan, J. Frelinger, A. Cron, and M. West (2010). Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures. Journal of Computational and Graphical Statistics 19 (2), 419–438. Tang, Y., L. Xiang, and Z. Zhu (2014). Risk factor selection in rate making: EM adaptive LASSO for zero-inflated Poisson regression models. Risk Analysis 34 (6), 1112–1127. Tibshirani, R. (1997). The lasso method for variable selection in the Cox model. Statistics in Medicine 16 (4), 385–395. Vivar, J. C., P. Pemu, R. McPherson, and S. Ghosh (2013). Redundancy control in pathway databases (ReCiPa): an application for improving gene-set enrichment analysis in omics studies and “Big data” biology. Omics: A Journal of Integrative Biology 17 (8), 414–422.

214 Wang, H. and C. Leng (2007). Unified LASSO estimation by least squares approximation. Journal of the American Statistical Association 102 (479), 1039–1048. Wang, Z., S. Ma, C. Y. Wang, M. Zappitelli, P. Devarajan, and C. Parikh (2014). EM for regularized zero-inflated regression models with applications to postoperative morbidity after cardiac surgery in children. Statistics in Medicine 33 (29), 5192–5208. Wu, T. T. and S. Wang (2013). Doubly regularized Cox regression for highdimensional survival data with group structures. Statistics and its Interface 6 (2), 175–186. Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 38 (2), 894–942. Zhang, H. H. and W. Lu (2007). Adaptive lasso for Cox’s proportional hazards model. Biometrika 94 (3), 691–703.

215 APPENDICES

216 APPENDIX A APPENDIX FOR CHAPTER 3

217

A.1 PROOF OF PROPOSITION 1 It is well-known that

|x| σ2

z> √

λe−λz dz = e

|x| σ2

−λ √

.

Therefore, the pdf of a Laplace distribution with mean 0 and variance

√

σ 2 /λ

can be written as λ λ −λ √|x|σ2 √ e = √ 2 2 σ 2 σ2 =

√ √ −u σ 2 √ }. σ2 j=1

Similarly, π(σ 2 |y, X, β, u) ∝ π(y|X, β, σ 2 )π(β|u, σ 2 )π(σ 2 )dσ 2 ∝

1 σ2

n−1+p +1 2 exp{−

1 βj 2 2 (y − Xβ) (y − Xβ)}I{σ > Max ( )}. j 2σ 2 uj 2

Therefore, σ 2 |y, X, β, u ∼ Inverse Gamma

βj 2 n−1+p 1 2 , (y − Xβ) (y − Xβ) I{σ > Maxj ( 2 )}. 2 2 uj

This completes the proof.

219 APPENDIX B APPENDIX FOR CHAPTER 4

220

B.1 PROOF OF PROPOSITION 2 It is well-known that

|x|α σ2

z> √

λe

−λz

dz = e

|x|α σ2

−λ √

.

Therefore, the pdf of a GG distribution with mean 0, shape parameter α, and 1

scale parameter λ− α can be written as 1

1

λα λα α e−λ|x| = 1 2Γ( α + 1) 2Γ( α1 + 1) =

1

λ α +1 1

1

−u α ( √ )α }. σ2 j=1

Similarly, π(σ 2 |y, X, β, u) ∝ π(y|X, β, σ 2 )π(β|u, σ 2 )π(σ 2 )dσ 2 . ∝

1 σ2

n−1+p +1 2

1 βj 2 2 exp{− 2 (y − Xβ) (y − Xβ)}I{σ > Maxj ( 2 )}. 2σ uj α

Therefore, σ 2 |y, X, β, u ∼ Inverse Gamma

βj 2 n−1+p 1 2 , (y − Xβ) (y − Xβ) I{σ > Maxj ( 2 )}. 2 2 uj α

This completes the proof.

222

B.3 PROOF OF THEOREM 2 Based on equation (A3) of Armagan et al (2013), the following holds true under the GG prior Πn β n : ||β n − β 0 n ||
0. Hence the proof.

Δ nρ/2

< dn for all

224 APPENDIX C APPENDIX FOR CHAPTER 5

225

C.1 CALCULATING THE NORMALIZING CONSTANT From equation (5.6), the group bridge prior is given by π(β) = C(λ, α) exp{−λ||β||α1 }, where β ∈ Rq , λ > 0, and α > 0.

Consider the following transformation ⎧ ⎪ ⎪ ⎨zi =

i = 1, . . . , (q − 1),

βi , ||β ||1

⎪ ⎪ ⎩r = ||β||1 .

The Jacobian of the above transformation is |J| = rq−1 . Therefore, the joint pdf of (r, z1 , z2 , . . . , zq−1 ) is given by f (r, z1 , z2 , . . . , zq−1 ) = C(λ, α) exp{−λrα }rq−1 ,

where −1 < zi < 1, i = 1, . . . , (q − 1), and Note that,

q−1

|zi | < 1.

i=1

1 −1

. . . q−1

1

... −1

|zi | < 1

dz1 . . . dzq−1

2q . = Γ(q)

i=1

Therefore we have, 2q 1 = C(λ, α) Γ(q)

∞ 0

exp{−λrα }rq−1 dr

226 q

λ α Γ(q + 1) . =⇒ C(λ, α) = q q 2 Γ( α + 1) Hence the proof. C.2 STOCHASTIC REPRESENTATION Without loss of generality, we assume that the random vector β ∗ follows an L1 norm spherically symmetric distribution with Σ = Iq and μ = 0, where Iq is the q × q identity matrix and 0 is the q-dimensional vector with all entries equal to 0. Following the proof of Lemma 1.4 in Fang et al. (1989), we have the following

2q ζ(||β ||1 ) dβ = Γ(q) Rq ∗

∞

∗

ζ(u)uq−1 du,

0

for any non-negative measurable function ζ. Therefore, for any non-negative measurable function η, we define ζ = η.ψ, and get the following 2q E(η(||β ||1 )) = Γ(q)

∞

∗

η(u)ψ(u)uq−1 du.

0

Therefore, the density of r(β ∗ ) is

g(r) =

2q ψ(r)rq−1 Γ(q)

Γ(q) g(r(β ∗ )) =⇒ ψ(r(β )) = q . 2 r(β ∗ )q−1 ∗

1

Now, consider the transformation β = μ + Σ 2 β ∗ . We get, 1

=⇒ f (β) = |Σ|− 2

Γ(q) g(r(β)) . 2q r(β)q−1

227

C.3 DIRECTLY SAMPLING FROM THE PRIOR Based on Corollary 1, we observe that when α = 1, β follows a joint distribution of q i.i.d Laplace distributions, which can be decomposed as β = RU , β follows a multivariate ‘L1 norm sphere’ uniform distribution, where U = ||β ||1 and R follows a particular gamma distribution. Therefore, any sample from the joint distribution of i.i.d Laplace distributions (divided by its L1 norm) can be regarded as a sample generated from the multivariate ‘L1 norm sphere’ uniform distribution. Therefore, given λ and α, the algorithm for generating observations from the q-dimensional group bridge prior proceeds as follows:

Step I: Generate an observation r∗ from Gamma( αq , λ). Step II: Calculate r = r∗ 1/α .

Step III: Generate z = (z1 , . . . , zq ) from the joint distribution of q i.i.d Laplace distributions with scale parameter 0 and location parameter 1 (these parameters can be arbitrarily chosen). Step IV: Calculate u = ||zz||1 . u can be regarded as a sample generated from the multivariate ‘L1 norm sphere’ uniform distribution.

Step V: y = u ∗ r is a sample generated from the desired group bridge prior (using Corollary 1).

228

C.4 SCALE MIXTURE OF MULTIVARIATE UNIFORM REPRESENTATION 1

Consider the L1 norm open ball A1 = {β ∈ Rq : ||β||1 < u α , u > 0, α > 0}. The volume of A1 is given by q

2q u α . Vq (A1 ) = Γ(q + 1) We have,

q

λ α Γ(q + 1) exp{−λ||β||α1 } π(β) = q q 2 Γ( α + 1) q

λ α +1 Γ(q + 1) = q q 2 Γ( α + 1) =

q

1

||β ||1 0, i = 1, 2, .., n. The expressions μ∗i , σi∗ 2 , and Li are given as follows (t)

x−i − μ−i ), μ∗i (t) = μi + si T Σ−1 −i (˜ 2 − si T Σ−1 σi∗ 2 = σi,i −i si , (t)

Li = ρ −

(t)

|˜ xk |,

k=i (t)

(t)

(t)

(t)

(t)

where μ−i = (μ1 , .., μ(i−1) , μ(i+1) , .., μn ) , μi is the ith element of μ, si is the (n−1)×1 vector obtained from the ith column of Σ by removing the ith row, and −1 by removing the ith Σ−1 −i denotes the (n − 1) × (n − 1) matrix obtained from Σ

row and ith column. The sampling according to the above TUVN distributions can be easily done by implementing the efficient algorithm proposed by Li and

230 Ghosh (2015).

III. SAMPLING COEFFICIENTS IN THE BAYESIAN GROUP BRIDGE REGRESSION For the BAGB, it is easy to note that the truncation region in (5.11) is group wise independent. Therefore, at each step, we sample from the group wise conditional distribution (using the algorithm described above), conditional on other parameters, i.e. β k |β −k , u, α, σ 2 , y, X ∼ NΩk (μk , Σk ), where NΩk (μk , Σk ) is the mk dimensional TMVN distribution defined on the open ball Ωk (uk ) with mean vector μk and covariance matrix Σk , which are easily obtained by partitioning the corresponding full variance-covariance matrix described in (5.11), where we have 1/α

Ωk (uk ) = {β k ∈ Rmk : ||β k ||1 < uk }, and β −k = (β 1 , . . . , β k−1 , β k+1 , . . . , β K ) , k = 1, . . . , K. C.6 MULTIVARIATE GENERALIZED GAUSSIAN DISTRIBUTION Here we derive various properties of the GB2 prior in light of its connection to the multivariate generalized Gaussian (MGG) distribution. First, we derive the normalizing constant for the GB2 prior. Then, we derive a SMU representation of the GB2 prior that leads to the corresponding MCMC algorithm described in Section 5.6.2.

RESULT 1: A random vector y q×1 = (y1 , y2 , .., yq ) follows a multivariate generalized Gaussian distribution with pdf g(y) = C(λ, α) exp{−λ||y||α2 }, where q

C(λ, α) =

λ α Γ( 2q +1) , q π q/2 Γ( α +1)

y ∈ Rq , λ > 0, and α > 0.

231

RESULT 2: An MGG distribution can be written as a scale mixture of multivariate uniform (SMU) distribution, the mixing density being a particular Gamma distribution, i.e. y|u ∼ Multivariate Uniform (A2 ), where A2 = {y ∈ Rq : ||y||α2 < u}, u > 0, and u ∼ Gamma ( αq + 1, λ).

PROOF OF RESULT 1 Consider the following polar transformations ⎧ ⎪ ⎪ ⎪ y1 = r cos(θ1 ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ y2 = r sin(θ1 ) cos(θ2 ) ⎪ ⎪ ⎪ ⎨ ... ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ yq−1 = r sin(θ1 ) sin(θ2 ) . . . sin(θq−2 ) cos(θq−1 ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩y = r sin(θ ) sin(θ ) . . . sin(θ ). q

1

2

q−1

Then we have ||y||22 = r2 . The Jacobian of the above transformation is |J| = rq−1 sin(θ1 )q−2 sin(θ2 )q−3 . . . sin(θq−1 ).

The joint pdf of (r, θ1 , θ2 , . . . , θq−1 ) is given by f (r, θ1 , θ2 , . . . , θq−1 ) = C(λ, α) exp{−λrα }rq−1 sin(θ1 )q−2 sin(θ2 )q−3 . . . sin(θq−1 ),

232 where 0 ≤ θ1 , . . . , θq−2 ≤ π, 0 ≤ θq−1 ≤ 2π. Note that,

2π

π

π

... 0

0

0

sin(θ1 )

q−2

sin(θ2 )

q−3

. . . sin(θq−1 )dθ1 . . . dθq−1

2π q/2 . = Γ( 2q )

Therefore we have, 2π q/2 1 = C(λ, α) Γ( 2q )

∞ 0

exp{−λrα }rq−1 dr

q

λ α Γ( q + 1) . =⇒ C(λ, α) = q/2 2q π Γ( α + 1)

PROOF OF RESULT 2 1

Consider the n-ball A2 = {y ∈ Rq : ||y||2 < u α , u > 0, α > 0}. The volume of A2 is given by

q

q

π 2 uα . Vq (A2 ) = q Γ( 2 + 1) Therefore, we have q

λ α Γ( 2q + 1) exp{−λ||y||α2 } g(y) = q/2 q π Γ( α + 1) q

λ α +1 Γ( 2q + 1) = q/2 q π Γ( α + 1) =

q

1

||y ||2

SOME CONTRIBUTIONS TO BAYESIAN ...

SOME CONTRIBUTIONS TO BAYESIAN ...

Suggest Documents

Some contributions to Collatz conjecture

Some Contributions to Nonlinear Adaptive Control of

Some Calculable Contributions to Entanglement Entropy

Some contributions to management, theory and research of innovation ...

Some contributions to finite-sample analysis in three econometric ...

Some contributions to practice of 22 contingency tables

contributions to the phytochemical study of some samples of ajuga ...

Contributions to the taxonomy and faunistics of some ...

On some Potential Research Contributions to the Multi-Core Enterprise

Some contributions to management, theory and research of innovation ...

Observations on Some German Contributions to Engineering Design ...

Some Contributions to Filtering, Modeling and ... - DiVA portal

Some Contributions to the Solution of Cubic ... - Journal Repository

Some of Statistics Canada's Contributions to Survey Methodology

Some contributions to incidence geometry and the polynomial method

coNTRIBUTIoNs To

Some Misunderstandings Concerning the Contributions Made by ...

Solutions to some exercises from Bayesian Data Analysis, second ...

Optimal properties of some Bayesian inferences

Contributions to Macroeconomics

Arab Contributions to Civilization

CONTRIBUTIONS TO LCA... - tdx.cat

Contributions To Nepalese Studies

Contributions to Macroeconomics - CiteSeerX