Bayesian Nonparametric Modeling and Theory for

0 downloads 0 Views 6MB Size Report
Alan E. Gelfand ... Bayesian density estimation as well as function estimation are well-justified in ..... dom variables taking values in a complete separable metric space X ... ǫ ą 0, there exists a δ ă ǫ{4, c1,c2 ą 0, β ă ǫ2{8 and subsets Fn Ă F such ... Gaussian if the random variable b˚W is normally distributed for any element ...
Bayesian Nonparametric Modeling and Theory for Complex Data by

Debdeep Pati Department of Statistical Science Duke University Date:

Approved:

David B. Dunson, Advisor

Alan E. Gelfand

Surya T. Tokdar

Lawrence Carin

Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Statistical Science in the Graduate School of Duke University 2012

Abstract Bayesian Nonparametric Modeling and Theory for Complex Data by

Debdeep Pati Department of Statistical Science Duke University Date:

Approved:

David B. Dunson, Advisor

Alan E. Gelfand

Surya T. Tokdar

Lawrence Carin

An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Statistical Science in the Graduate School of Duke University 2012

c 2012 by Debdeep Pati Copyright All rights reserved except the rights granted by the Creative Commons Attribution-Noncommercial Licence

Abstract The dissertation focuses on solving some important theoretical and methodological problems associated with Bayesian modeling of infinite dimensional ‘objects’, popularly called nonparametric Bayes. The term ‘infinite dimensional object’ can refer to a density, a conditional density, a regression surface or even a manifold. Although Bayesian density estimation as well as function estimation are well-justified in the existing literature, there has been little or no theory justifying the estimation of more complex objects (e.g. conditional density, manifold, etc.). Part of this dissertation focuses on exploring the structure of the spaces on which the priors for conditional densities and manifolds are supported while studying how the posterior concentrates as increasing amounts of data are collected. With the advent of new acquisition devices, there has been a need to model complex objects associated with complex data-types e.g. millions of genes affecting a bio-marker, 2D pixelated images, a cloud of points in the 3D space, etc. A significant portion of this dissertation has been devoted to developing adaptive nonparametric Bayes approaches for learning low-dimensional structures underlying higher-dimensional objects e.g. a high-dimensional regression function supported on a lower dimensional space, closed curves representing the boundaries of shapes in 2D images and closed surfaces located on or near the point cloud data. Characterizing the distribution of these objects has a tremendous impact in several application areas ranging from tumor tracking for targeted radiation therapy, to classifying cells in the

iv

brain, to model based methods for 3D animation and so on. The first three chapters are devoted to Bayesian nonparametric theory and modeling in unconstrained Euclidean spaces e.g. mean regression and density regression, the next two focus on Bayesian modeling of manifolds e.g. closed curves and surfaces, and the final one on nonparametric Bayes spatial point pattern data modeling when the sampling locations are informative of the outcomes.

v

To my family

vi

Contents Abstract

iv

List of Tables

xiii

List of Figures

xiv

List of Abbreviations and Symbols

xv

Acknowledgements

xvii

1 Introduction

1

1.1

Review of posterior consistency and convergence rate . . . . . . . . .

2

1.2

Review of Gaussian processes . . . . . . . . . . . . . . . . . . . . . .

6

1.3

Research Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.4

Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2 Adaptive dimension reduction with a Gaussian process prior

17

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.2

Specific notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.3

Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.3.1

Adaptive estimation of anisotropic functions . . . . . . . . . .

29

2.3.2

Adaptive dimension reduction . . . . . . . . . . . . . . . . . .

30

2.3.3

Connections between cases (i) and (ii) . . . . . . . . . . . . .

32

2.3.4

Rates of convergence in specific settings

. . . . . . . . . . . .

33

Properties of the multi-bandwidth Gaussian process . . . . . . . . . .

35

2.4

vii

2.5

Proof of main results . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

2.5.1

Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . .

46

2.5.2

Proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . . .

50

2.6

Lower bounds on posterior contraction rates . . . . . . . . . . . . . .

51

2.7

Main result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

3 Bayesian nonparametric regression with varying residual density

64

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

3.2

Nonparametric regression modeling . . . . . . . . . . . . . . . . . . .

69

3.2.1

Data Structure and Model . . . . . . . . . . . . . . . . . . . .

69

3.2.2

Prior on the Mean Regression Function . . . . . . . . . . . . .

70

3.2.3

Priors for Residual Distribution . . . . . . . . . . . . . . . . .

71

3.3

Consistency properties . . . . . . . . . . . . . . . . . . . . . . . . . .

76

3.4

Posterior Computation . . . . . . . . . . . . . . . . . . . . . . . . . .

80

3.4.1

Gaussian process regression with t residuals . . . . . . . . . .

81

3.4.2

Heteroscedastic PSB mixture of normals . . . . . . . . . . . .

82

3.4.3

Heteroscedastic sPSB process location-scale mixture . . . . . .

84

3.5

Measures of Influence . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

3.6

Simulation studies

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

3.7

Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

3.7.1

Boston housing data Application . . . . . . . . . . . . . . . .

93

3.7.2

Body fat data application . . . . . . . . . . . . . . . . . . . .

93

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

3.8

4 Posterior consistency in conditional density estimation

96

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2

Conditional density estimation . . . . . . . . . . . . . . . . . . . . . . 100

viii

96

4.2.1

Predictor dependent mixtures of Gaussian linear regressions . 101

4.2.2

Gaussian mixtures of fixed-π dependent processes . . . . . . . 102

4.3

Notions of neighborhoods in conditional density estimation . . . . . . 102

4.4

Posterior consistency in MGLRx mixture of Gaussians

4.5

4.6

. . . . . . . . 104

4.4.1

Kullback-Leibler property . . . . . . . . . . . . . . . . . . . . 104

4.4.2

Strong Consistency with the q-integrated L1 neighborhood . . 108

Posterior consistency in mixtures of fixed-π dependent processes . . . 114 4.5.1

Kullback-Leibler property . . . . . . . . . . . . . . . . . . . . 114

4.5.2

Strong consistency with the q-integrated L1 neighborhood . . 114

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5 Bayesian shape modeling with closed curves

118

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.2

Shape-generating random process . . . . . . . . . . . . . . . . . . . . 121

5.3

5.2.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.2.2

Roth curve

5.2.3

Deforming a Roth curve . . . . . . . . . . . . . . . . . . . . . 122

5.2.4

Vector notation . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.2.5

Shape-generating Random Process . . . . . . . . . . . . . . . 126

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Properties of the Prior . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.3.1

Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.3.2

Influence of the control points . . . . . . . . . . . . . . . . . . 132

5.4

Inference from Point Cloud Data . . . . . . . . . . . . . . . . . . . . 132

5.5

Inference from Pixelated Image Data . . . . . . . . . . . . . . . . . . 135 5.5.1

5.6

Modeling surface orientation . . . . . . . . . . . . . . . . . . . 136

Fitting a collection of curves . . . . . . . . . . . . . . . . . . . . . . . 136

ix

5.7

Posterior computation . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.7.1

Conditional posteriors for mk and dprq,k . . . . . . . . . . . . . 139

5.7.2

Derivation of the approximate deformation-orienting matrix . 141

5.7.3

Conditional posteriors for µr and Σr

5.7.4

Gibbs updates for the parameterizations and orientation . . . 143

5.7.5

Likelihood contribution from surface-normals . . . . . . . . . . 143

. . . . . . . . . . . . . . 142

5.8

Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

5.9

Brain tumor segmentation study . . . . . . . . . . . . . . . . . . . . . 147

5.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6 Bayesian modeling of closed surfaces through tensor products

151

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.2

Outline of the method . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6.3

6.4

6.5

6.2.1

Review of Terminology . . . . . . . . . . . . . . . . . . . . . . 155

6.2.2

Choice of the parameterization . . . . . . . . . . . . . . . . . 156

6.2.3

Closed surface model . . . . . . . . . . . . . . . . . . . . . . . 157

6.2.4

Construction of the cyclic basis . . . . . . . . . . . . . . . . . 158

6.2.5

Model for the control points . . . . . . . . . . . . . . . . . . . 159

6.2.6

Prior realizations . . . . . . . . . . . . . . . . . . . . . . . . . 160

Support of the prior and posterior convergence rates . . . . . . . . . . 161 6.3.1

Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

6.3.2

Rate of convergence of the posterior . . . . . . . . . . . . . . . 163

Posterior computation . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.4.1

Gibbs sampler for a fixed truncation level

. . . . . . . . . . . 164

6.4.2

Posterior sampling of n and m . . . . . . . . . . . . . . . . . . 165

Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

x

6.6

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

7 Bayesian geostatistical modeling with informative sampling

174

7.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

7.2

Model for spatial data with informative sampling . . . . . . . . . . . 175

7.3

Theoretical properties . . . . . . . . . . . . . . . . . . . . . . . . . . 177 7.3.1

Weak posterior consistency . . . . . . . . . . . . . . . . . . . . 177

7.3.2

Posterior propriety of a . . . . . . . . . . . . . . . . . . . . . . 178

7.4

Computational details . . . . . . . . . . . . . . . . . . . . . . . . . . 178

7.5

Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

7.6

Analysis of Eastern United States ozone data

7.7

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

8 Future works

. . . . . . . . . . . . . 183

188

8.1

Latent variable density regression models . . . . . . . . . . . . . . . . 188

8.2

Nonparametric variable selection

8.3

Bayesian shape modeling . . . . . . . . . . . . . . . . . . . . . . . . . 191

8.4

Spatial point patterns

8.5

Robust Bayesian model based clustering . . . . . . . . . . . . . . . . 192

8.6

Other directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

. . . . . . . . . . . . . . . . . . . . 190

. . . . . . . . . . . . . . . . . . . . . . . . . . 191

A Proofs of some results in Chapter 4

195

A.1 Proof of Lemma 46 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 A.2 A useful lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 A.3 Proof of Theorem 47 . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 A.4 Proof of Theorem 44 . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 A.5 Proof of Theorem 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 A.6 Another useful lemma . . . . . . . . . . . . . . . . . . . . . . . . . . 206

xi

A.7 Proof of Theorem 57 . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 A.8 Proof of Theorem 58 . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 B Proofs of some results in Chapter 3

214

B.1 Proof of Lemma 34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 B.2 Proof of Lemma 36 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 B.3 Proof of Theorem thm:ghoshal . . . . . . . . . . . . . . . . . . . . . . 216 C Proofs of some results in Chapter 5

223

C.1 Proof of Lemma 63: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 C.2 Proof of Theorem 64: . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 D Proofs of some results in Chapter 6

225

D.1 Proofs of Lemma 71 . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 D.2 Proof of Theorem 72 . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 D.3 Proof of Theorem 74 . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 E Proofs of some results in Chapter 7

233

E.1 Proof of Theorem 77 . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 E.2 Proof of Theorem 78 . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Bibliography

240

Biography

255

xii

List of Tables 3.1

Simulation results for cases (i) and (ii) . . . . . . . . . . . . . . . . .

89

3.2

Simulation results for cases (iii) and (iv) . . . . . . . . . . . . . . . .

90

3.3

Simulation results for case (v) . . . . . . . . . . . . . . . . . . . . . .

91

3.4

Boston housing data and body fat data results . . . . . . . . . . . . .

91

6.1

Hausdorff distance between true and fitted surface . . . . . . . . . . . 169

6.2

Posterior summaries of σ and n, m . . . . . . . . . . . . . . . . . . . 169

7.1

Simulation study results . . . . . . . . . . . . . . . . . . . . . . . . . 182

7.2

Mean and 95% intervals for the ozone data . . . . . . . . . . . . . . . 186

xiii

List of Figures 2.1

Union of rectangles Crθ “ t0 ď a ď r θ u . . . . . . . . . . . . . . . . .

5.1

Deformation of a Roth curve . . . . . . . . . . . . . . . . . . . . . . . 123

5.2

An illustration of the shape generation process . . . . . . . . . . . . . 127

5.3

Random samples from the shape-generating process . . . . . . . . . . 129

5.4

Influence of the control points . . . . . . . . . . . . . . . . . . . . . . 132

5.5

Borrowing of information . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.6

Brain tumor application . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.1

Surface fitted to a pelvic girdle point cloud . . . . . . . . . . . . . . . 151

6.2

Output triangulation from crust algorithm on a point cloud . . . . . . 154

6.3

Parameterization of the human skull and the Beethoven . . . . . . . . 157

6.4

Control points for different closed surfaces . . . . . . . . . . . . . . . 160

6.5

Prior realizations with increasing n and m . . . . . . . . . . . . . . . 161

6.6

Crust on a sparse non-noisy point cloud

6.7

Tensor-product surface on a sparse non-noisy point cloud . . . . . . . 171

6.8

Crust on a sparse noisy point cloud . . . . . . . . . . . . . . . . . . . 172

6.9

Tensor-product surface on a sparse noisy point cloud . . . . . . . . . 173

7.1

Plots of the ozone data . . . . . . . . . . . . . . . . . . . . . . . . . . 184

7.2

Posterior mean predicted values of ozone . . . . . . . . . . . . . . . . 185

xiv

43

. . . . . . . . . . . . . . . . 170

List of Abbreviations and Symbols Symbols These general symbols apply to all the chapters in this dissertation. R{ℜ

set of real numbers.

N

set of natural numbers.

Z

set of integers.

λ

Lebesgue measure on ℜ or ℜp .

IB

indicator of a set B.

Lp pνq

the space of measurable functions with ν-integrable pth absolute power.

CpX q

the set of continuous functions on X .

C α pX q

|| ¨ ||8 , || ¨ ||1 , || ¨ ||p,ν Bpx0 , r; dq suptdpx, yq : x, y P Mu Npǫ, T, dq

log Npǫ, T, dq ű

the H¨older space of order α, consisting of the functions f P CpX q that have tαu continuous derivatives with the tαuth derivative f tαu being Lipshitz continuous of order α ´ tαu. supremum, L1 and norm of Lp pνq. A ball of radius r with centre x0 relative to the metric d. The diameter of a bounded metric space M relative to a metric d. covering number of a semi-metric space T relative to the semimetric d is the minimal number of balls of radius ǫ needed to cover T . ǫ-entropy of the space T with respect to d. complex line integral. xv

À tap1q , ap2q , . . . , apnq u δ0 supppνq

inequality up to a constant multiple order statistics of the set tai : ai P R, i “ 1, . . . , nu. distribution degenerate at 0 support of a measure ν.

Abbreviations w.r.t RKHS

with respect to Reproductive Kernel Hilbert Space

xvi

Acknowledgements First, I would like to thank my advisor Dr. David B. Dunson for his profound ingenuity and indomitable enthusiasm in nurturing within me the roots of Bayesian nonparametrics. Being trained mostly in Mathematical Statistics, I came to Duke with almost no knowledge in stochastic modeling. It was him who inculcated within me the necessity to learn stochastic modeling to be an all-rounded statistician. By his tremendous work ethics and upfront attitude he has set an example in front of me to be a dynamic and honest academician. I also thank my committee members Alan Gelfand, Surya Tokdar and Lawrence Carin. I admire Alan’s remarkable ability to come up with thought provoking questions - they helped me understand my work better. I would particularly like to thank Surya-da for his mentorship in both academics and beyond. Larry is a source of extraordinary energy, his group meetings provided a platform to vent out any random ideas we have. It was a privilege to work with Brian Reich at NC State - I enjoyed numerous technical conversations with him. Kelvin Gu is a wonderful friend and brilliant colleague, I thank him for our enjoyable discussions ranging from the light hearted ones to the most grueling sessions. My sincere thanks to Alan Gelfand, Jim Berger, Merlise Clyde, Scott Schmidler, Robert Wolpert, and Fan Li - who taught me various courses at Duke and provided helpful feedbacks regarding the work here. Natesh Pillai is a brilliant mentor, I thank him for organizing a wonderful visit to Harvard - it was a productive as well

xvii

as fun-filled experience. I would like to take this opportunity to thank the friendly and helpful staff Karen, Lance, Anne, Nikki and Tameka - I admire your painstaking effort to help me out with every possible solutions. I am obliged to the National Science Foundation, National Institutes of Health, the International Biometric Society, the International Society for Bayesian Analysis, and the American Statistical Association for financing my graduate education and conference travel. It has been my privilege to meet some of the brightest and jubilant young friends while living in Durham. To all of my friends at Duke, it wouldn’t be such an invigorating experience without you. I am especially fortunate to have Anirban as my friend, who has been a tremendous source of encouragement all along this journey right from the undergraduate days - be it during darkest and the happiest of times to long drawn technical conversations or numerous fun sessions. It was hard to imagine this journey without him, I thank him for his guidance and support in every sphere of life. To Avishek, Chiranjit and Anjishnu, life in the United States wouldn’t be so easy without you - you will continue to be my constant source of inspiration in the years to come. Finally, to my parents and my wonderful sister, needless to put an acknowledgement in words, but thank you for all your sacrifices, your dedication and your love. Without you, it would not even be possible to see this day, to you I dedicate this thesis.

xviii

1 Introduction

During the last decade there has been an immense development in Bayesian nonparametric modeling and theory related to density and function estimation. Bayesian modeling offers a full probabilistic framework for inference while a nonparametric approach adds the necessary flexibility. With recent technological developments and advent of huge amount of data acquisition devices, there is a clear need for modeling and theory beyond traditional density estimation and unconstrained function estimation. Estimation of a conditional density is one such area where there has been a significant amount of new methodologies developed during the last decade without being substantiated by suitable theory. Also existing methods for Bayesian conditional density estimation are black-box, and lack computational simplicity and interpretability. This motivates developing new classes of flexible, yet computationally tractable and interpretable prior distributions. The development of next-generation imaging devices enable the scientists to penetrate the deepest parts of anatomical organs producing huge amounts of biologically rich images and point cloud data. This requires fundamentally new approaches to model the complicated shapes and manifolds. While there has been significant devel1

opments for manifold estimation in computer science, most of these methods follow a sequence of multistage procedures involving noise reduction, outlier removal and subsequent manifold estimation. Multistage procedures fail to capture the uncertainty in estimation which is crucial to very sensitive analyses like brain tumor detection. Hence there is a necessity for a fully Bayesian model based approach to manifold estimation which has received little or no attention over the past decade. A coherent Bayesian framework also has a huge possibility of being embedded into a population level analysis with immediate applications to tumor tracking for targeting radiation therapy, modeling a 3-dimensional forest of morphologically diverse neurons changing over time in response to chemical and electrical signals and so on. One of the key interests of this dissertation is understanding and evaluating infinite-dimensional Bayesian procedures from a frequentist perspective. The objective is to ensure that as we get more and more samples from the true ‘parameter’, the posterior distribution concentrates in an arbitrarily small neighborhood of the true ‘parameter’, a phenomenon popularly known as posterior consistency. In the following we provide rigorous definitions of posterior consistency and posterior convergence rate.

1.1 Review of posterior consistency and convergence rate Assume Y1 , . . . , Yn , . . . is a sequence of independent and identically distributed random variables taking values in a complete separable metric space X endowed with the Borel σ-algebra of subsets BpX q with a common density f0 P F , where F is a space of densities f : X Ñ R` . The complete separable space F is also endowed with the Borel σ-algebra of subsets BpF q. It is convenient to think of Y1 , Y2 , . . . as the coordinate random variables defined on Ω the space pX 8 , BpX q8 q and f 8 as the i.i.d. product density defined on Ω. We will denote by Ωn “ pX n , BpX qn q and by f n the n-fold product of f . We will also abbreviate pY1 , Y2 , ..., Yn q by Y n when 2

convenient. A Bayesian nonparametric approach to infer f0 involves placing a prior distribution or a probability measure Πn on pF , BpF qq and computing the posterior distribution of a Borel set B P BpF q according to the Bayes’ rule as ş śn f pYi qdΠn pf q Πn pB | Y n q “ şB śi“1 . n i“1 f pYi qdΠn pf q F

(1.1)

We must ensure that the expressions in (1.1) are well defined. In particular, we assume that the map py, f q ÞÑ f pyq is measurable for the product σ-field on X ˆ F . To define various notions of posterior consistency, we need different notions of neighborhood which we shall formulate below. Definition 1. Wǫ pf0 q is said to be a sub-basic weak neighborhood of f0 if " Wǫ pf0 q “ f P F

ˇ ˇż * ż ˇ ˇ ˇ ˇ : ˇ φf ´ φf0 ˇ ă ǫ

(1.2)

for a bounded continuous function φ : X Ñ R.

Definition 2. Sǫ pf0 q is said to be a strong or L1 neighborhood of f0 if Sǫ pf0 q “ tf P F : }f ´ f0 }1 ă ǫu.

(1.3)

Definition 3. KLǫ pf0 q is said to be a Kullback-Leibler neighborhood of f0 if "

KLǫ pf0 q “ f P F :

ż

* f0 log f0 {f ă ǫ .

(1.4)

Definition 4. A posterior is said to be weakly or strongly consistent at f0 if for any ǫą0 Πn pU | Y n q Ñ 1 rf0 s a.s. where U “ Wǫ pf0 q or Sǫ pf0 q respectively. 3

(1.5)

Clearly strong posterior consistency implies weak posterior consistency whereas the reverse implication is not necessarily true. A special case when this is true is when X is a countable set. (1.5) is only an asymptotic evaluation of the performance of the posterior and one might question the usefulness of such asymptotic justification in actual finite sample examples. There are several reasons which make the study of posterior consistency more interesting in nonparametric Bayesian models. First, posterior consistency is violated more often in the infinite-dimensional models than their parametric counterparts (Diaconis and Freedman, 1986) due to the inability of the data to explain infinitely many parameters. Second, it turns out that posterior consistency is intimately related to the prior flexibility as well as model identifiability which are itself quite interesting to study. Schwartz (1965a) demonstrated that the two key phenomena governing posterior consistency are i) whether the prior has large support on the target space, i.e., realizations from the prior can approximate a variety of objects and ii) the ability of the model space to distinguish two parameters with respect to the topology concerned. The second aspect is often manifested in the form of the complexity of the model space. While having a large prior support enhances the ability to approximate a wider range of truth, it also decreases the chance to concentrate near a particular true density given enough samples. Hence there is always a trade-off between (i) and (ii) and a fine balance is required to achieve consistency. Ghosal et al. (1999) provided two sets of sufficient conditions for strong and weak consistency. For simplicity assume Πn ” Π. Before discussing the sufficient conditions, a notion of prior support is quite important. Definition 5. f0 is said to be in the Kullback-Leibler support of the prior Π (write as f0 P KLpΠq) if ΠpKLǫ pf0 qq ą 0 for any ǫ ą 0. Theorem 6.

1. The posterior is weakly consistent at f0 if f0 P KLpΠq for any 4

ǫ ą 0. 2. The posterior is strongly consistent at f0 with f0 satisfying (1) and for every ǫ ą 0, there exists a δ ă ǫ{4, c1 , c2 ą 0, β ă ǫ2 {8 and subsets Fn Ă F such that (a) ΠpFnc q ă c1 expt´nc2 u and (b) log Npδ, Fn , }¨}1 q ă nβ. The complexity of the model space is measured by the sequence Fn , also referred to as the sieve. Clearly, the weak topology doesn’t require limiting the complexity for achieving consistency, having a large prior support alone suffices. Although the idea of consistency is useful, quantifying the speed of convergence of the posterior is necessary to determine the number of samples required to obtain a desired accuracy upto constants. The speed or the rate of convergence is defined in terms of the smallest shrinking neighborhood around the truth that still contains all the posterior mass asymptotically. It is important to note here that determining the rate of convergence alone cannot satisfactorily foretell exactly the sample size required to obtain the desired accuracy as the constant is quite hard to determine accurately in most cases. Definition 7. A posterior Πn is said to concentrate around f0 with rate at least ǫn in the L1 topology if Πn pSM ǫn pf0 q | Y n q Ñ 1

rf0 s a.s.

(1.6)

for some large constant M ą 0. An accurate calculation of prior concentration and calibration of the model space (Ghosal et al., 2000; Ghosal and van der Vaart, 2001) allow us to compute posterior convergence rates. 5

Theorem 8. Suppose that for a sequence ǫn with ǫn Ñ 0 and nǫ2n Ñ 8, a constant C and sets Fn Ă F , we have

log Npǫn , Fn , }¨}n q ď nǫ2n ,

Πn pFnc q ď expt´nǫ2n pC ` 4qu, ˆ ˙ ż ż 2 2 2 Πn f : ´ f0 log f {f0 ď ǫn , f0 plog f {f0 q ď ǫn ě

(1.7) (1.8) expp´nǫ2 Cq,

(1.9)

then there exists a constant M ą 0 such that Πn pSM ǫn pf0 q | Y n q Ñ 1

rf0 s a.s.

The form of the condition (1.9) can be motivated from entropy considerations. Suppose, we wish to satisfy (1.9) for the minimal ǫn satisfying (1.9) with Fn “ F , i.e., for the optimal rate of convergence for the model. Furthermore, for the sake of argument, assume that all the distances are equivalent. Then a minimal ǫn -cover of F consists of exptnǫ2n u balls. If the prior Πn would spread its mass uniformly over F , then every ball would obtain mass approximately expt´Cnǫ2n u. Hence a rough implication of the conditions of Theorem 8 is that Πn should spread its mass uniformly in order for the posterior to attain the optimal rate of convergence. Ghosal et al. (2000); Ghosal and van der Vaart (2001) also provided several modifications of Theorem 8.

1.2 Review of Gaussian processes Another key to this dissertation is the use of conditionally Gaussian processes which is a powerful tool for function estimation in general. A Borel measurable random element W with values in a separable Banach space pB, }¨}q (e.g., Cr0, 1s) is called Gaussian if the random variable b˚ W is normally distributed for any element b˚ P B˚ , the dual space of B. Note that under this general definition f ptq “

J ÿ

j“1

cj Bj ptq, cj „ Np0, σj2 q, j “ 1, . . . , J, 6

(1.10)

for a fixed set of basis functions tB1 , . . . , BJ u, is also Gaussian process conditional on J and tσj2 , j “ 1, . . . , Ju. Such a representation covers a wide range of models and is flexible enough to approximate a smooth function for sufficiently large J and thus can be suitably used for function estimation. To characterize the support of a Gaussian process, one must study the reproducing kernel Hilbert space (RKHS) associated with a Gaussian process. The RKHS H attached to a zero-mean Gaussian process W taking values in a Banach space B is defined as the completion of the linear space of functions t ÞÑ EW ptqH relative to the inner product xEW p¨qH1 ; EW p¨qH2 yH “ EH1 H2 , where H, H1 and H2 are finite linear combinations of the form

ř

i

ai W psi q with

ai P R and si in the index set of W . It is a well-known fact the support of a Gaussian process is the closure of the RKHS in B. The RKHS plays an important role in determining the concentration properties of the process around a smooth function. Refer to van der Vaart and van Zanten (2008b) for further details.

1.3 Research Problems Our first problem is related to variable selection in Bayesian nonparametric regression. Theoretical study of variable selection in infinite-dimensional models using sparsity favoring priors is an important area of modern research which differs significantly from variable selection in finite-dimensional models. More specifically, if the true regression function is actually lower dimensional, the frequentist minimax rate of estimating the regression function remains unaltered in a parametric model but can be improved by a significant margin in infinite dimensional models (Barron et al., 1999a; Kerkyacharian et al., 2001; Hoffmann and Lepski, 2002; Klutchnikoff, 2005).

7

Consider the non-parametric mean regression model, yi “ µpxi q ` ǫi , xi P r0, 1sd, ǫi „ Np0, σ 2 q,

(1.11)

In a Bayesian context, one would place a Gaussian process prior for µ and a hyper prior on the bandwidth parameter and model-average across different values of the bandwidth through the posterior distribution. The parameter a in the squaredexponential covariance kernel expp´a||s ´ t||2 q plays the role of a scaling or inverse bandwidth. van der Vaart and van Zanten (2009) showed that with a gamma prior on ad , one obtains the minimax rate of posterior contraction n´α{p2α`dq up to a logarithmic factor for α-smooth functions adaptively over all α ą 0. Even with moderate number of dimensions, the assumption of the true function being in an isotropic smoothness class characterized by a single smoothness parameter seems restrictive. Practitioners often use a non-homogeneous variant of the squared exř ponential covariance kernel above given by Cps, tq “ expp´ dj“1 aj |sj ´ tj |2 q. A separate scaling variable aj for the different dimensions incorporates dimension spe-

cific effects in the covariance kernel, intuitively enabling better approximation of functions in anisotropic smoothness classes. In particular, one can let a subset of the covariates to drop out from the covariance kernel by setting some of the scales aj to zero. Such a model was recently studied in Savitsky et al. (2011); Zou et al. (2010), who used a point mass mixture prior on the bandwidth. Although this is an attractive scheme for anisotropic modeling and dimension reduction in non-parametric regression problems with empirical support, there hasn’t been any theoretical study on this class of models in a Bayesian framework to our knowledge. In particular, there is an open question whether the rate of posterior contraction can be improved when the true regression function is supported on a lower dimensional space. We want to develop a fully adaptive Bayesian nonparametric procedure that achieves this. 8

Our second problem is regarding robust Bayesian inference on the mean regression function allowing the residual density to change flexibly with predictors. To simplify inferences and prior elicitation, it is appealing to separate the mean regression function from the residual distribution in the specification, which is accomplished by only a few density regression methods. The general framework of separately modeling the mean function and residual distribution nonparametrically was introduced by Griffin and Steel (2010). They allow the residual distribution to change flexibly with predictors using the order-based Dirichlet process Griffin and Steel (2006b). On the other hand, we want to develop a computationally simpler specification with straightforward prior elicitation. Moreover, existing theory on Gaussian process regression Choi and Schervish (2007a); Choi (2009) ensures consistently estimating the regression function assuming parametric error distribution. One of the key theoretical problems lies in generalizing their theory to the case of nonparametric error distribution. Although there has been a well developed literature in studying posterior consistency and convergence rates in nonparametric Bayes density estimation and mean regression models, there has been a dearth of such results in density regression models, particularly since the post Dirichlet process regime has seen the development of numerous predictor-dependent random measures for modeling conditional distributions e.g., the class of dependent random processes (MacEachern, 1999; De Iorio et al., 2004; Griffin and Steel, 2006b; Rodriguez and Dunson, 2011a). Our third research problem is providing a general theoretical framework for characterizing the support of priors for conditional distributions. In doing so, a fundamental technical problem lies in calibrating a large space of conditional densities. It has been noted by Wu and Ghosal (2010) that the usual method of constructing a sieve by controlling prior probabilities is unable to lead to a consistency theorem in the multivariate case. This is because of the explosion of the entropy (adn ) 9

ş of t Npy; µ, IdqdP pµq : P pp´an , an sd q ą 1 ´ δu with increasing dimension. They

developed a technique specific to the Dirichlet process in the multivariate case for

showing weak and strong posterior consistency. We would like to develop technique for constructing a sieve for high-dimensions suited to general mixture models. We next turn our attention to methodological questions related to Bayesian manifold estimation. With technological advancements, there has been a need to model complex objects associated with complex data-types, e.g. 2D pixelated images, a cloud of points in the 3D space, etc. Traditional function estimation methods seem to be inadequate for this purpose. Boundaries of objects are widely studied across many disciplines, such as biomedical imaging, cytology and computer vision. In describing complex boundaries, one can use a parametric curve (2D) or surface (3D), i.e. Cptq : D1 Ñ R2 or Cptq : D2 Ñ R3 respectively, where D1 Ă R and D2 Ă R2 . Note that this is different from a typical function estimation problem because the independent variable, t, is unknown. Moreover, the curve must be closed to produce a valid boundary. In many applications featuring low-contrast images or sparse and noisy point clouds, there is insufficient data to recover local segments of the boundary in isolation. Thus, it also becomes critical to model the boundary’s global shape. Multiple related objects may share shape similarities that can be leveraged for improved inference of boundaries. However, to the best of our knowledge, there are few curve models which incorporate detailed shape information. Lastly, very few works have considered integrating both curve fitting and shape analysis. We want to develop a model based approach for characterizing a population of 2D closed curves representing the complex boundary of 2D shapes in pixelated images. Our next intention is to develop models for closed surfaces from point cloud data because of their usefulness to represent a variety of 3d shapes including human bones, anatomical organs which are often encountered in practice. In many applications such as in establishing the target for linac-based radiation therapy, it is necessary 10

to characterize the uncertainty in the 3D tumor contour to compute an expanded contour, called a planning target volume (PTV). This PTV is used as the target to which the full radiation dose is delivered. Naturally larger margin expansions increase the likelihood that the tumor is treated effectively, but also increase the dose delivered to healthy normal tissues. Bayesian approaches are ideal in characterizing this uncertainty and calibrating the PTV. The existing literature on closed surface modeling focuses on frequentist point estimation methods that join surface patches along the edges leading to heavy geometric constraints. One of the main motivations is to develop a model for a closed surface which avoids the need for any constraints. This can improve mixing of the MCMC employed for inferring the posterior surface and facilitate interpretation of the coefficients. To the best of our knowledge, there hasn’t been any model-based work on fitting a closed parametric surface to a sparse and noisy point cloud data, particularly from a Bayesian point of view. Also studying the structure of the constrained Euclidean spaces on which the prior distribution for a closed surface is supported is particularly challenging as current Bayesian theory mostly focus on unconstrained surface estimation. Our next question is related to a fundamental problem in spatial point pattern data modeling. Geostatistical models focus on inferring a continuous spatial process based on data observed at finitely many locations, with the locations typically assumed to be noninformative. As noted by Diggle et al. (2010), this assumption is commonly violated for point-referenced spatial data, as it is not unusual to collect data at locations thought to have a large or small value for the outcome. For example, in monitoring of air pollution, one may place more monitors at locations believed to have a high value of ozone or another pollutant, while in studying distribution of animal species one may systematically look in locations thought to commonly contain the species of interest. Diggle et al. (2010) proposed a shared latent process model to adjust for bias due to informative sampling locations. Their analysis 11

was implemented using a Monte Carlo approach for maximum likelihood estimation. However it is not clear whether the data contain information about the informativeness of the sampling locations, and one may wonder to what extent the prior is driving the results even in large samples. We would like to develop a Bayesian model for informative sampling and address these theoretical concerns.

1.4 Our contribution In Chapter 2, we focus on nonparametric mean regression problem involving multiple predictors where the interest is in estimating the multivariate regression surface in the important predictors while discarding the unimportant ones. Our focus is on defining a Bayesian procedure that leads to the minimax optimal rate of posterior contraction (up to a log factor) adapting to the unknown dimension and anisotropic smoothness of the true surface. We propose such an approach based on a Gaussian process prior with dimension-specific scalings, which are assigned carefully-chosen hyperpriors. We obtained fully Bayesian frameworks for the following scenarios. 1. Adaptive estimation over Holder smooth functions that can possibly depend on fewer coordinates and have isotropic smoothness over the remaining coordinates: Consider a joint prior on pa1 , a2 , . . . , ad q induced through the following hierarchical scheme: (i) draw d˜ according to some prior distribution

˜ draw a subset S of size d˜ from (with full support) on t1, . . . , du, (ii) given d, t1, . . . , du following some prior distribution assigning positive prior probability `˘ ˜ (iii) generate a pair of random variables pa, bq with to all dd˜ subsets of size d, ˜

ad „ gamma and b drawn from any compactly supported density, and finally, (iv) let aj “ a for j P S and aj “ b for j R S.

2. Adaptive estimation over anisotropic Holder functions of d arguments: We propose a joint prior on the bandwidths pa1 , a2 , . . . , ad q induced through the 12

following hierarchical specification. Let Θ “ pΘ1 , . . . , Θd q denote a random vector with a density supported on the simplex Sd´1 . Given Θ “ θ, we let the 1{θj

elements of pa1 , . . . , ad q be conditionally independent, with aj

„ g, where g

is a gamma density. This is a novel generalization of the previous case. We also demonstrated the necessity of using multiple bandwidths by proving that the optimal prior choice in the isotropic case leads to a sub-optimal convergence rate if the true function depends on fewer coordinates. In Chapter 3, we consider the problem of robust Bayesian inference on the mean regression function allowing the residual density to change flexibly with predictors. To accomplish this, we propose to place a Gaussian process prior on the regression function and to allow the residual density to be unknown through a probit stick-breaking (PSB) process mixture. Here, we propose four novel variants of PSB mixtures for the residual distribution. The first uses a scale mixture of Gaussians to obtain a prior with large support on unimodal symmetric distributions. The next is based on a symmetrised location-scale PSB (sPSB) mixture, which is more flexible in avoiding the unimodality constraint, while constraining the residual density to be symmetric and have mean zero. In addition, we show that this prior leads to strong posterior consistency in estimating the regression function under weak conditions. To allow the residual density to change flexibly with predictors, we generalize the above priors through incorporating probit transformations of Gaussian processes in the weights. In Chapter 4, defining various topologies on the space of conditional distributions, we provide sufficient conditions for posterior consistency focusing on a broad class of priors formulated as predictor-dependent mixtures of Gaussian kernels. This theory is illustrated by showing that the conditions are satisfied for a class of generalized stick-breaking process mixtures in which the stick-breaking lengths are monotone, 13

differentiable functions of a continuous stochastic process. We also provide a set of sufficient conditions for the case where stick-breaking lengths are predictor independent, such as those arising from a fixed Dirichlet process prior. A key technical contribution of this article is the development of a novel method of constructing a sieve, suited particularly to multivariate and predictor dependent mixture priors. We developed a technique suited to general mixture models based on marginalizing out the random measure P and calibrating the space of infinite mixture models by a finite mixture with increasing number of components subject to a tail condition as follows. Let Fan “ #ż

Npy; µ, 1qdP pµq “

8 ÿ

h“1

πh φpy ´ µh , Id q : ||µh || ď an , h “ 1, . . . , mn ,

8 ÿ

h“mn `1

+

πh ă ǫ .

for an , mn increasing. The proposed sieve alleviates the curse of dimensionality (Wu and Ghosal, 2010) and can be used to show consistency in a large variety of mixture models for multivariate density estimation (Tokdar, 2011b) and can be generalized to accommodate predictor dependent mixture models for conditional density estimation (Pati et al., 2012).

Our sieve construction has also

opened up the possibility of studying posterior consistency and convergence rates in sparse multivariate mixtures of factor analyzers (McLachlan and Peel, 2000; Canale and Dunson, 2011) and probability tensor decomposed models for categorical data analysis (Bhattacharya and Dunson, 2011). In Chapter 5, we proposed a Bayesian hierarchical model for boundaries of 2D shapes contained in pixelated images. The model is based on a novel multiscale deformation process. By relating multiple objects through a hierarchical formulation, we can successfully recover missing boundaries by borrowing shape information from similar objects at the appropriate scale. Furthermore, the models latent parameters help interpret the population, indicating dimensions of significant shape variability 14

and also specifying a central curve that summarizes the collection. Often we have information about surface normals, position of the nucleus and other information in addition to pixel locations in these images. We incorporate these information to obtain a better fit without any compromise in computational efficiency. Theoretical properties of our prior are studied in specific cases and efficient Markov chain Monte Carlo methods are developed, evaluated through simulation examples and applied to a brain tumor contour detection problem. In Chapter 6, we develop a Bayesian model for closed surfaces based on tensor products of a cyclic basis resulting in infinitely smooth surface realizations avoiding heavy geometric constraints required to join the surface patches. Theoretical properties of the support of our proposed prior are studied and it is shown that the posterior achieves the optimal rate of convergence under reasonable assumptions on the prior. Chapter 6 laid the foundation for hierarchical modeling of multiple shapes, both 2d and 3d. It allows the possibility of multitask learning, incorporation of covariates and do hypothesis testing and dynamic models for closed surfaces. To address the question associated with preferential sampling in point-pattern data modeling, we follow a Bayesian approach in which the locations are modeled using a log Gaussian Cox process (Møller et al., 2001), with the intensity function included as a spatially-varying predictor in the outcome model, which also includes spatial random effects drawn from a Gaussian process. We incorporate a sampling bias term a which causes a tendency to take more observations at spatial locations having relatively high outcome values. We empirically showed that a joint model of the responses and the sampling locations can lead to improved prediction and characterization of uncertainty. To our knowledge, we are the first to develop a Bayesian approach to the informative locations problem in geostatistical modeling. A major contribution is studying the theoretical properties of the model. We address this concern by proving that the posterior is proper under a noninformative prior on 15

a. In addition, one can consistently estimate a, the density of the sampling locations and the mean function of the outcome process. Ongoing work focuses on quantifying the increase in the amount of information in the sampling bias term with sample size by studying the posterior convergence rates of the parameter a. In Chapter 8, we outline future directions related to the current threads on density regression models and hierarchical modeling of objects.

16

2 Adaptive dimension reduction with a Gaussian process prior

2.1 Introduction Non-parametric function estimation methods have been immensely popular due to their ability to adapt to a wide variety of function classes with unknown regularities. In Bayesian nonparametrics, Gaussian processes (Rasmussen, 2004; van der Vaart and van Zanten, 2008b) are widely used as priors on functions due to tractable posterior computation and attractive theoretical properties. The law of a mean zero Gaussian process Wt is entirely characterized by its covariance kernel cps, tq “ EpWs Wt q.

A squared exponential covariance kernel given by

cps, tq “ expp´a }s ´ t}2 q is commonly used in the literature. It is well established (Stone, 1982) that given n independent observations, the optimal rate of estimation of a d-variable function that is only known to be α-smooth is n´α{p2α`dq . The quality of estimation thus improves with increasing smoothness of the “true” function while it deteriorates with increase in dimensionality. In practice, the smoothness α is typically unknown and one would thus like to have a unified

17

estimation procedure that automatically adapts to all possible smoothness levels of the true function. Accordingly, a lot of effort has been employed to develop adaptive estimation methods that are rate-optimal for every regularity level of the unknown function. The literature on adaptive estimation in a minimax setting was initiated by Lepski in a series of papers (Lepski, 1990, 1991, 1992); see also Birg´e (2001) for a discussion on this topic. We also refer the reader to Hoffmann and Lepski (2002), which contains an extensive list of developments in the frequentist literature on adaptive estimation. There is a growing literature on Bayesian adaptation over the last decade. Previous works include Belitser and Ghosal (2003); Ghosal et al. (2003, 2008); Huang (2004); Rousseau (2010); Kruijer et al. (2010); De Jonge and van Zanten (2010); Shen and Ghosal (2011). A key idea in frequentist adaptive estimation is to narrow down the search for an “optimal” estimator within a class of estimators indexed by a smoothness or bandwidth parameter, and make a data-driven choice to select the proper bandwidth. In a Bayesian context, one would place a prior on the bandwidth parameter and model-average across different values of the bandwidth through the posterior distribution. The parameter a in the squared-exponential covariance kernel c plays the role of a scaling or inverse bandwidth. van der Vaart and van Zanten (2009) showed that with a gamma prior on ad , one obtains the minimax rate of posterior contraction n´α{p2α`dq up to a logarithmic factor for α-smooth functions adaptively over all α ą 0. In multivariate problems involving even moderate number of dimensions, the assumption of the true function being in an isotropic smoothness class characterized by a single smoothness parameter seems restrictive. Practitioners often use a non-homogeneous variant of the squared exponential covariance kernel given by ř cps, tq “ expp´ dj“1 aj |sj ´ tj |2 q. A separate scaling variable aj for the different di18

mensions incorporates dimension specific effects in the covariance kernel, intuitively enabling better approximation of functions in anisotropic smoothness classes. In particular, one can let a subset of the covariates drop out of the covariance kernel by setting some of the scales aj to zero. Such a model was recently studied in Savitsky et al. (2011), who used a point mass mixture prior on ρj “ ´ log aj P r0, 1s. Zou et al. (2010) also used a similar model for high-dimensional non-parametric variable selection. Although this is an attractive scheme for anisotropic modeling and dimension reduction in non-parametric regression problems with encouraging empirical performance, there hasn’t been any theoretical studies of asymptotic properties in related models in a Bayesian framework. In the frequentist literature, minimax rates of convergence in anisotropic Sobolev, Besov and H¨older spaces have been studied in Ibragimov and Khasminski (1981); Nussbaum (1985); Birg´e (1986), with adaptive estimation procedures developed in Barron et al. (1999a); Kerkyacharian et al. (2001); Hoffmann and Lepski (2002); Klutchnikoff (2005) among others. The traditional way of dealing with anisotropy is to employ a separate bandwidth or scaling parameter for the different dimensions, and choose an optimal combination of scales in a data-driven way. However, the multidimensional nature of the problem makes the optimal bandwidth selection difficult compared to the isotropic case, as there is no natural ordering among the estimators with multiple bandwidths (Lepski and Levit, 1999). It is known (Hoffmann and Lepski, 2002) that the minimax rate of convergence for a function with smoothness αi along the ith dimension is given by n´α0 {p2α0 `1q , ř where α0´1 “ di“1 αi´1 is an exponent of global smoothness (Birg´e, 1986). When αi “ α for all i “ 1, . . . , d, one reduces back to the optimal rate for isotropic classes.

On the contrary, if the true function belongs to an anisotropic class, the assumption of isotropy would lead to loss of efficiency which would be more and more accentuated in higher dimensions. In addition, if the true function depends on a subset of 19

coordinates I “ ti1 , . . . , id0 u Ă t1, . . . , du for some 1 ď d0 ď d, the minimax rate ř ´1 would further improve to n´α0I {p2α0I `1q , with α0I “ jPI αj´1 . The objective of this chapter is to study whether one can fully adapt to this larger

class of functions in a Bayesian framework using dimension specific rescalings of a homogenous Gaussian process, referred to as a multi-bandwidth Gaussian process from now on. We answer the question in the affirmative and develop a class of priors which lead to the optimal rate n´α0I {p2α0I `1q of posterior contraction (up to a log term) for any α and I without prior knowledge of either of them. The general sufficient conditions for obtaining posterior rates of convergence (Ghosal et al., 2000) involve finding a sequence of compact and increasing subsets of the parameter space, usually referred to as sieves, which are “not to large” in the sense of metric entropy and yet capture most of the prior mass. van der Vaart and van Zanten (2008a) developed a general technique for constructing such sieves with Gaussian process priors, which involved subtle manipulations of the reproducing kernel Hilbert space (RKHS) of a Gaussian process (van der Vaart and van Zanten, 2008b).

A key technical advancement in

van der Vaart and van Zanten (2009) was to extend the above theoretical framework to the setting of conditionally Gaussian random fields. In particular, they exploited a containment relation among the unit RKHS balls with different bandwidths to construct the sieves Bn in their framework. Their construction can be conceptually related to the general framework for adaptive estimation developed in Lepski (1990, 1991, 1992), where a natural ordering among kernel estimators with different scalar bandwidths is utilized to compare different estimators and balance the bias-variance trade-off. However, it gets significantly more complicated in situations involving multiple bandwidths to compare kernel estimators with different vectors of bandwidths. In multi-bandwidth Gaussian processes, a similar problem arises in comparing unit RKHS balls of Gaussian processes with different vectors of bandwidths, and the tech20

niques of van der Vaart and van Zanten (2009) cannot be immediately extended to obtain adaptive posterior contraction rates in this case. Our main contribution is to address the above issue by a novel prior specification on the vector of bandwidths and a careful construction of the sieves Bn , which can be used to establish rate adaptiveness of the posterior distribution in a variety of settings involving a multi-bandwidth Gaussian process. For simplicity of exposition, we initially study the problem in two parts: (i) adaptive estimation over anisotropic H¨older functions of d arguments, and (ii) adaptive estimation over functions that can possibly depend on fewer coordinates and have isotropic H¨older smoothness over the remaining coordinates. In each of these cases, we propose a joint prior on the bandwidths induced through a hierarchical Bayesian framework. To avoid the problem of comparing between different vectors of scales, we aggregate over a collection of bandwidth vectors to construct the sets Bn . New results are developed to bound the metric entropy of such collections of unit RKHS balls. Combining these results, we balance the metric entropy of the sieve and the prior probability of its complement. The prior specifications for the two cases above are easy to interpret intuitively and can be easily connected to prescribe a unified prior leading to adaptivity over (i) and (ii) combined. In particular, our proposed prior has interesting connections to a class of multiplicity adjusting priors previously studied by Scott and Berger (2010) in a linear model context. Although our prior specification involving dimension-specific bandwidth parameters leads to adaptivity, a stronger result is required to conclude that a single bandwidth would be inadequate for the above classes of functions. We prove that the optimal prior choice in the isotropic case leads to a sub-optimal convergence rate if the true function depends on fewer coordinates by obtaining a lower bound on the posterior contraction rate. The general sufficient conditions for rates of posterior contraction provide an upper bound on the rate of convergence implying that the 21

posterior contracts at least as fast as the rate obtained. Castillo (2008) studied lower bounds for posterior contraction rate with a class of Gaussian process priors. We extend the results of Castillo (2008) to the setting of rescaled Gaussian process priors. We develop a technique for deriving a sharp lower bound to the concentration function of a rescaled Gaussian process, which can be used for comparing the posterior convergence rates obtained for different prior distributions on the bandwidth parameter. The remaining chapter is organized as follows. In Section 2.2, we introduce relevant notations. Section 2.3 discusses the main developments with applications to anisotropic Gaussian process mean regression and logistic Gaussian process density estimation described in subsection 2.3.4. In Section 2.4, we study various properties of multi-bandwidth Gaussian processes which are crucially used in the proofs of the main theorems in Section 2.5 and should also be of independent interest. Section 2.6 establishes the necessity of the multi-bandwidth Gaussian process (GP) by showing that a single rescaling can lead to sub-optimal rates when the true function is lowerdimensional.

2.2 Specific notations To keep the notation clean, we shall only use boldface for a, b and α to denote vectors. We shall make frequent use of the following multi-index notations. For vectors a, b ś ś ř ¯ “ maxj aj , a “ minj aj , a.{b “ P Rd , let a. “ dj“1 aj , a˚ “ dj“1 aj , a! “ dj“1 aj !, a ś b pa1 {b1 , . . . , ad {bd qT , a ¨ b “ pa1 b1 , . . . , ad bd qT , ab “ dj“1 ajj . Denote a ď b if aj ď bj

for all j “ 1, . . . , d. For n “ pn1 , . . . , nd q, let D n f denote the mixed partial derivatives

of order pn1 , . . . , nd q of f . Let Cr0, 1sd and C β r0, 1sd denote the space of all continuous functions and the 22

H¨older space of β-smooth functions f : r0, 1sd Ñ R respectively, endowed with the supremum norm }f }8 “ suptPr0,1sd |f ptq|. For β ą 0, the H¨older space C β r0, 1sd consists of functions f P Cr0, 1sd that have bounded mixed partial derivatives up to order tβu, with the partial derivatives of order tβu being Lipschitz continuous of order β ´ tβu. Next, we define an anisotropic H¨older class of functions previously used in For a function f P Cr0, 1sd,

Barron et al. (1999a) and Klutchnikoff (2005).

x P r0, 1sd, and 1 ď i ď d, let fi p¨ | xq denote the univariate function y ÞÑ f px1 , . . . , xi´1 , y, xi`1, . . . , xd q. For a vector of positive numbers α “ pα1 , . . . , αd q, the anisotropic H¨older space C α r0, 1sd consists of functions f which satisfy, for some

L ą 0, max sup

1ďiďn xPr0,1sd

tα ÿi u

j“0

› j › ›D fi p¨ | xq› ď L, 8

(2.1)

and, for any y P r0, 1s, h small such that y ` h P r0, 1s and for all 1 ď i ď d, › › sup ›D tαi u fi py ` h | xq ´ D tαi u fi py | xq›8 ď L |h|αi ´tαi u .

(2.2)

xPr0,1sd

For t P Rd and a subset I Ă t1, . . . , du of size |I| “ d˜ with 1 ď d˜ ď d, let tI

denote the vector of size d˜ consisting of the coordinates ptj : j P Iq. Let Cr0, 1sI denote the subset of Cr0, 1sd consisting of functions f such that f ptq “ gptI q for some function g P Cr0, 1sd. Also, let C α r0, 1sI denote the subset of C α r0, 1sd consisting ˜

of functions f such that f ptq “ gptI q for some function g P C αI r0, 1sd. ˜

The ǫ-covering number Npǫ, S, dq of a semi-metric space S relative to the semimetric d is the minimal number of balls of radius ǫ needed to cover S. The logarithm of the covering number is referred to as the entropy. We φpxq

write “

“À”

for

inequality

up

to

a

constant

p2πq´1{2 expp´x2 {2q denote the standard 23

multiple.

normal density,

Let and

let φσ pxq



p1{σqφpx{σq. Let an asterisk denote a convolution, e.g., ş pφσ ˚ f qpyq “ φσ py ´ xqf pxqdx. Let fˆ denote the Fourier transform of a function f whenever it is defined. Denote by Sd´1 the d ´ 1-dimensional simplex ř consisting of points tx P Rd : xi ě 0, 1 ď i ď d, di“1 xi “ 1u.

2.3 Main results

Let W “ tWt : t P r0, 1sdu be a centered homogeneous Gaussian process with covariance function EpWs Wt q “ cps ´ tq. By Bochner’s theorem, there exists a finite positive measure ν on Rd , called the spectral measure of W , such that ż cptq “ e´ipλ,tq νpdλq, Rd

where for u, v

P

Cd , pu, vq denotes the complex inner product.

As in

van der Vaart and van Zanten (2009), we shall restrict ourselves to processes with spectral measure ν having sub-exponential tails, i.e., for some δ ą 0, ż eδ}λ} νpdλq ă 8.

(2.3)

The spectral measure ν of a squared exponential covariance kernel with cptq “ expp´ }t}2 q has a density w.r.t.

the Lebesgue measure given by f pλq “

1{p2d π d{2 q expp´ }λ}2 {4q which clearly satisfies (2.3). Rates of posterior contraction with Gaussian process priors were first studied by van der Vaart and van Zanten (2008a), who gave sufficient conditions in terms of the concentration function of a Gaussian random element for optimal rate of convergence in a variety of statistical problems including density estimation using the logistic Gaussian process (Lenk, 1988, 1991), Gaussian process mean regression, latent Gaussian process regression (e.g., in logit, probit models), binary classification, etc. As indicated in the introduction, one needs to build appropriate sieves in the space of continuous functions to get a handle on the posterior rates of convergence 24

in such models. van der Vaart and van Zanten (2008a) constructed the sieves as a collection of continuous functions within a small (sup-norm) neighborhood of a normbounded subset of the RKHS. Sharp bounds on the complement probability of such sets can be obtained using Borell’s inequality (Borell, 1975), and the metric entropy can also be appropriately controlled exploiting the fact that the RKHS consists of smooth functions if the covariance kernel is smooth. It is important to mention here that a similar strategy involving a subset of continuous functions bounded in sup-norm doesn’t work beyond the uni-dimensional case (Tokdar and Ghosh, 2007). A process W with infinitely smooth sample paths is not suitable for modeling less smooth functions. Rescaling the sample paths of an infinitely smooth Gaussian process is a powerful technique to improve the approximation of α-H¨older functions from the RKHS of the scaled process tWtA “ WAt :

t P r0, 1sdu

with A ą 0. Intuitively, for large values of A, the scaled process traverses the sample path of an unscaled process on the larger interval r0, Asd , thereby incorporating more “roughness”. In the context of univariate function estimation, van der Vaart and van Zanten (2007) had previously shown that a rescaled Gaussian process W an with a deterministic scaling an “ n1{p2α`1q logκ n leads to the minimax optimal rate for α-smooth functions up to a log factor. This specification requires knowledge of the true smoothness to obtain the minimax rate. Since the true smoothness is essentially always unknown, one would ideally employ a random rescaling, i.e., place a prior on the scale. van der Vaart and van Zanten (2009) studied rescaled Gaussian processes W A “ tWAt : t P r0, 1sdu for a real positive random variable A stochastically independent of W , extending the framework of van der Vaart and van Zanten (2008a) to the setting of conditionally Gaussian random elements (see also De Jonge and van Zanten (2010) for a different class of conditionally Gaussian processes). van der Vaart and van Zanten (2009) showed that with a Gamma prior on Ad , one obtains the minimax-optimal rate of convergence 25

n´α{p2α`dq (up to a logarithmic factor) for α-smooth functions. Since their prior specification does not involve the unknown smoothness α, the procedure is fully adaptive. The key result of van der Vaart and van Zanten (2009) was to construct the sieves Bn Ă Cr0, 1sd so that given α ą 0, a function w0 P C α r0, 1sd, and a constant C ą 1, there exists a constant D ą 0 such that, for every sufficiently large n, log Np¯ǫn , Bn , }¨}8 q ď Dn¯ǫ2n , 2

PpW A R Bn q ď e´Cnǫn ,

› › 2 Pp›W A ´ w0 ›8 ď ǫn q ě e´nǫn ,

(2.4) (2.5) (2.6)

with ǫn “ n´α{p2α`1q plog nqκ1 , ¯ǫn “ n´α{p2α`1q plog nqκ2 for constants κ1 , κ2 ą 0. There is a deep connection between the above measure theoretic result involving the concentration probability and complexity of the support of the conditional Gaussian process W A and rates of posterior contraction with Gaussian process priors. van der Vaart and van Zanten (2008a) mention that the conditions (2.4) - (2.6) have a one-to-one correspondence with the general sufficient conditions for rates of posterior contraction (Theorem 2.1 of Ghosal et al. (2000)). In a specific statistical setting involving Gaussian process priors on some function, sieves in the parameter space of interest can be easily obtained by restricting the unknown function to such sets Bn . It only remains to appropriately relate the norm of discrepancy specific to the problem (e.g., Hellinger norm for density estimation) to the Banach space norm (sup-norm in this case) of the Gaussian random element to conclude that maxtǫn , ¯ǫn u is the rate of posterior contraction; refer to the discussion following Theorem 3.1 in van der Vaart and van Zanten (2009). In this chapter, we shall consider two function classes defined in Section 2.2, (i) H¨older class of functions C α r0, 1sd with anisotropic smoothness (α P Rd` ), and (ii) H¨older class of functions C α r0, 1sI with isotropic smoothness that can possibly 26

depend on fewer dimensions (α ą 0 and I Ă t1, . . . , du). We shall study multi-

bandwidth Gaussian processes of the form tWta “ Wa¨t : t P r0, 1sdu for a vector of

rescalings (or inverse-bandwidths) a “ pa1 , . . . , ad qT with aj ą 0 for all j “ 1, . . . , d. For a continuous function in the support of a Gaussian process, the probability assigned to a sup-norm neighborhood of the function is controlled by the centered small ball probability and how well the function can be approximated from the RKHS of the process (Section 5 of van der Vaart and van Zanten (2008b)). With the target class of functions as in (i) or (ii), a single scaling seems inadequate and it is intuitively appealing to introduce multiple bandwidth parameters to enlarge the RKHS and facilitate improved approximation from the RKHS. As in van der Vaart and van Zanten (2007), we shall first consider minimax estimation with deterministic scalings an . van der Vaart and van Zanten (2008a) showed that the rate of posterior contraction with a Gaussian process prior W is determined by the behavior of the concentration function φw0 pǫq for ǫ close to zero, where φw0 pǫq “

inf

h:H:}h´w0 }8 ďǫ

}h}2H ´ log P p}W }8 ď ǫq,

(2.7)

and H is the RKHS of W . (We tacitly assume that there is a given statistical problem where the true parameter f0 is a known function of w0 .) Based on their result, with a multi-bandwidth Gaussian process prior W an , the posterior distribution would asymptotically accumulate all of its mass on an Opǫn q ball around the true parameter, where ǫn is the smallest possible solution to 2 n φa w0 pǫn q À nǫn ,

(2.8)

an . In n with φa w0 pǫn q denoting the concentration function of the scaled process W the following Theorem 9, we state choices of the bandwidth parameters specific to (i) and (ii) that lead to minimax rates of convergence. The proof follows from the 27

properties of multi-bandwidth GPs developed in Lemma 16–19 and hence is not provided separately. Theorem 9. 1. Suppose w0 P C α r0, 1sd for some α P Rd` and let α0´1 “ řd ´1 T i“1 αi . Let an “ pa1n , . . . , adn q , where, “ ‰α0 {αi ajn “ n1{p2α0 `1q . (2.9) 2 n Then, with ǫn “ n´α0 {p2α0 `1q logκ1 n for some constant κ1 , φa w0 pǫn q À nǫn .

2. Suppose w0 P C α r0, 1sI for some α ą 0 and I Ă t1, . . . , du with |I| “ d˚ . Let an “ pa1n , . . . , adn qT , where, ajn “

#“

˚q

n1{p2α`d 1

‰1{d˚

if j P I, if j R I.

(2.10)

˚ 2 n Then, with ǫn “ n´α{p2α0 `d q logκ2 n for some constant κ2 , φa w0 pǫn q À nǫn .

Theorem 9 coupled with van der Vaart and van Zanten (2008a) implies that a multi-bandwidth Gaussian process W an with an as in (2.9) and (2.10) leads to the minimax optimal rate of convergence in cases (i) and (ii) respectively. Theorem 9 requires knowledge of the true smoothness levels or the true dimensionality for minimax estimation. This is clearly unappealing and one would instead like to devise priors on a that lead to minimax rates for all smoothness levels. We propose a novel class of joint priors on the rescaling vector a that leads to adaptation over function classes (i) and (ii) in Section 2.3.1 and 2.3.2 respectively. Connections between the two prior choices are discussed and a unified framework is prescribed ( for the function class C α r0, 1sI : α P Rd` , I Ă t1, . . . , du combining (i) and (ii).

The main technical challenge for adaptation is to find sets Bn so that (2.4)–

(2.6) are satisfied with w0 in the above function classes and ǫn being the optimal rate of convergence for the same. With such sets Bn , one can use standard results to establish adaptive minimax rate of convergence in various statistical settings. Applications to some specific statistical problems are described in Section 2.3.4. 28

2.3.1 Adaptive estimation of anisotropic functions Let A = pA1 , . . . Ad qT be a random vector in Rd with each Aj a non-negative random variable stochastically independent of W . We can then define a scaled process W A “ tWA¨t : t P r0, 1sdu, to be interpreted as a Borel measurable map in Cr0, 1sd equipped with the sup-norm }¨}8 . The basic idea here is to stretch or shrink the different dimensions by different amounts so that the resulting process becomes suitable for approximating functions having differential smoothness along the different coordinate axes. We shall define a joint distribution on A induced through the following hierarchical specification. Let Θ “ pΘ1 , . . . , Θd q denote a random vector with a density supported on the simplex Sd´1 . In the subsequent analysis, we shall assume Θ „ Dirpβ1 , . . . , βd q for some β “ pβ1 , . . . , βd q. Given Θ “ θ, we let the elements of 1{θj

A be conditionally independent, with Aj

„ g, where g is a density on the positive

real line satisfying, C1 xp expp´D1 x logq xq ď gpxq ď C2 xp expp´D2 x logq xq, for positive constants C1 , C2 , D1 , D2 and every sufficiently large x ą 0. 1{θj

In particular, the conditions in the above display are satisfied with q “ 0 if Aj

follows a gamma distribution. For notational simplicity, we shall assume g to be a gamma density from now on, noting that the main results would all hold for the general form of g above. Let πA denote the induced joint prior on A, so that πA paq “

ş śd

j“1 πpaj

|

θj qdπpθq. We now state our main theorem for the anisotropic smoothness class in (i), with a detailed proof provided in Section 2.5. Theorem 10. Let W be a centered homogeneous Gaussian random field on Rd with spectral measure ν that satisfies (2.3) and let W A denote the multi-bandwidth process 29

with A „ πA as above. Let α “ pα1 , . . . , αd q be a vector of positive numbers and ř α0 “ p di“1 αi´1 q´1 . Suppose w0 belongs to the anisotropic H¨older space C α r0, 1sd. Then for every constant C ą 1, there exist Borel measurable subsets Bn of Cr0, 1sd

and a constant D ą 0 such that, for every sufficiently large n, the conditions (2.4)– (2.6) are satisfied by W A with ǫn “ n´α0 {p2α0 `1q plog nqκ1 , ǫ¯n “ n´α0 {p2α0 `1q plog nqκ2 for constants κ1 , κ2 ą 0. 2.3.2 Adaptive dimension reduction We next consider the smoothness class in (ii), namely C α r0, 1sI for I Ă t1, . . . , du and α ą 0. If the true function has isotropic smoothness on the dimensions it depends on, it is intuitively clear that one doesn’t need a separate scaling for each of the dimensions. Indeed, had we known the true coordinates I Ă t1, . . . , du, we could have only scaled the dimensions in I by a positive random variable A, and a slight modification of the results in van der Vaart and van Zanten (2009) would imply that a gamma prior on A|I| would lead to adaptation. Without knowledge of I, it is natural to consider mixture priors of the form Aj „ pA ` p1 ´ pqB, where A and B are positive random variables and 0 ď p ď 1, so that a subset of the dimensions are scaled by A and the remaining by B. Assume a gamma prior on Ad and B any fixed compactly supported density. We first construct a sample n for A through the following deterministic specification for size dependent prior πA

p “ pn assuming knowledge of |I| and the true smoothness level α. Aj „ pn A ` p1 ´ pn qB, j “ 1, . . . , d ˚ {p2α`d˚ q

pdn “ 1 ´ expp´cn q, cn “ n´d

,

where d˚ “ |I|. The following theorem is a result on partial adaptive estimation, n where we can adapt to the positions in I using πA assuming only the knowledge of

|I| and α. 30

Theorem 11. Let W be a centered homogeneous Gaussian random field on Rd with spectral measure ν that satisfies (2.3) and let W A denote the multi-bandwidth process n with A „ πA as above. Suppose w0 P C α r0, 1sI and let I Ă t1, . . . , du with |I| “

d˚ .Then for every constant C ą 1, there exist Borel measurable subsets Bn of Cr0, 1sd and a constant D ą 0 such that, for every sufficiently large n, the conditions (2.4)– (2.6) are satisfied by W A with ǫn “ n´α{p2α`d q plog nqκ1 , ǫ¯n “ n´α{p2α`d q plog nqκ2 ˚

˚

for constants κ1 , κ2 ą 0. As in the previous sub-section, our ultimate aim is to propose a joint prior on A so that the rescaled process W A satisfies conditions (2.4)–(2.6) without the knowledge of α or I. We describe such a prior specification below. Consider a joint prior πA on A induced through the following hierarchical scheme: (i) draw d˜ according to some prior distribution (with full support) on t1, . . . , du, (ii)

˜ draw a subset S of size d˜ from t1, . . . , du following some prior distribution given d, `˘ ˜ (iii) generate a pair assigning positive prior probability to all dd˜ subsets of size d, ˜

of random variables pA, Bq with Ad „ gamma and B drawn from a fixed compactly supported density, and finally, (iv) let Aj “ A for j P S and Aj “ B for j R S.

We next state our main result on adaptive dimension reduction. The proof of the following Theorem 12 has elements in common with the proof of the previous theorem, and hence only a sketch of the proof is provided in Section 2.5. Theorem 11 can be proved along similar lines. Theorem 12. Let W be a centered homogeneous Gaussian random field on Rd with spectral measure ν that satisfies (2.3) and let W A denote the multi-bandwidth process with A „ πA as above. Suppose w0 belongs to the H¨older space C α r0, 1sI for some subset I of t1, . . . , du and α ą 0. Then for every constant C ą 1, there exist Borel measurable subsets Bn of Cr0, 1sd and a constant D ą 0 such that, for every sufficiently large n, the conditions (2.4)–(2.6) are satisfied by W A with 31

ǫn “ n´α{p2α`d0 q plog nqκ1 , ¯ǫn “ n´α{p2α`d0 q plog nqκ2 for constants κ1 , κ2 ą 0 and d0 “ |I|. Remark 13. A salient feature of our hierarchical prior formulation is that the tail heaviness of A is related to the size of the subset S, i.e., the number of dimensions that are scaled by the non-compact random variable A. For larger subsets S, the tails of A get lighter, inducing a bigger penalty for large values of A. In the previous mixture specification Aj „ πn A`p1´πn qB, we believe that we needed the information of α and d0 in the weights πn since the interplay between the size of S and the tail heaviness of A was missing. 2.3.3 Connections between cases (i) and (ii) The joint distributions on A specified in Section 2.3.1 and 2.3.2 are closely connected. To begin with, note that if we set Aj “ A and θj “ 1{d for all j, one obtains a gamma prior on Ad which was previously suggested by van der Vaart and van Zanten (2009). In the general anisotropic case, the joint distribution can be motivated as follows. Recall that the purpose of rescaling is to traverse the sample paths of an infinite smooth stochastic process on a larger domain to make it more suitable for less smooth functions. If the true function has anisotropic smoothness, then we would like to stretch those directions more where the function is less smooth. Now note that for smaller values of θj , the marginal distribution of aj has lighter tails compared to larger values of θj . We would thus like θj to assume smaller values for the directions j where the function is more smooth and larger values corresponding to the less smooth directions. Without further constraints on θ, it is not possible to separate the scale of A from θ. This motivates us to constrain θ to the simplex which serves as a weak identifiability condition. In the limit as θj Ñ 0, the distribution of aj converges to a point mass at zero. Accordingly, if the true function doesn’t depend on a set of pd ´ d˚ q dimensions, we 32

would set θj “ 0 for those dimensions and choose the remaining θj ’s from a d˚ ´ 1 dimensional simplex. In particular, if the function has isotropic smoothness in the remaining d˚ coordinates, one can simply choose θj “ 1{d˚ for those dimensions. ˚

This explains our choice of letting ad follow a gamma distribution in Section 2.3.2. Based on the above discussion, we combine the results in Section 2.3.1 and 2.3.2 to prescribe a unified framework for adaptively estimating functions which possibly depend on fewer coordinates and have anisotropic smoothness in the remaining ones, i.e., functions in C α r0, 1sI for α P Rd` and I Ă t1, . . . , du. 2.3.4 Rates of convergence in specific settings The above two theorems are in the same spirit as Theorem 3.1 of van der Vaart and van Zanten (2009) and Theorem 2.2 of De Jonge and van Zanten (2010) and can be used to derive rates of posterior contraction in a variety of statistical problems involving Gaussian random fields. We shall consider a couple of specific problems with the message that similar results can be obtained for a large class of problems involving rescaled Gaussian random fields. We first consider a regression problem where given independent response variable yi and covariates xi P r0, 1sd, the response is modeled as random perturbations around a smooth regression surface, i.e., yi “ µpxi q ` ǫi . We assume ǫi „ Np0, σ 2 q with a prior on σ supported on some interval ra, bs Ă r0, 8q. As motivated before, the regression surface might depend only on a subset of variables in r0, 1sd and have anisotropic smoothness in the remaining variables. It is thus appealing to place a Gaussian process prior with dimension specific rescalings on µ as follows. Let W denote a Gaussian process with squared exponential covariance kernel cptq “ expp´ }t}2 q and A “ pA1 , . . . , Ad qT be a vector of positive random variables stochastically independent of W . We use the conditionally Gaussian process W A “ tWA¨t : t P r0, 1sdu as a prior for µ, with a joint prior on A induced through 33

the following hierarchical specification: (i) draw d˜ uniformly on t1, . . . , du, (ii) given

˜ draw a subset S “ ti1 , . . . , i ˜u of size d˜ uniformly from t1, . . . , du, (iii) draw d, d 1{θj θ “ pθ1 , . . . , θd˜q from the d˜ ´ 1-dimensional simplex Sd´1 „ gamma for ˜ , (iv) let Aj

j P S, and set the remaining Aj ’s to zero. We denote the posterior distribution by Πp¨ | y1 , . . . , yn q. Let }µ}2n “ ř n´1 ni“1 µ2 pxi q denote the L2 norm corresponding to the empirical distribution of the design points. Let the true value σ0 of σ be contained in the interval ra, bs. The posterior is said to contract at a rate ǫn , if for every sufficiently large M, “ ‰ Eµ0 ,σ0 Π pµ, σq : }µ ´ µ0 }n ` |σ ´ σ0 | ą Mǫn | y1 , . . . , yn Ñ 0. Theorem 14. Let α “ pα1 , . . . , αd q be a vector of positive numbers and I be a

subset of t1, . . . , du. If w0 P C α r0, 1sI , then the posterior contracts at the rate ǫn “ ř ´1 n´α0I {p2α0I `1q logκ n, where α0I “ jPI αj´1. Thus, one obtains the minimax optimal rate up to a log factor adapting to the

unknown dimensionality and anisotropic smoothness. A similar result holds for density estimation using the logistic Gaussian process. Suppose X1 , . . . , Xn are drawn i.i.d. from a continuous, everywhere positive density f0 on the hypercube r0, 1sd. Suppose one uses a multi-bandwidth Gaussian process exponentiated and re-normalized to integrate to one as the prior on the unknown density f , so that f ptq “ ş

A eWt . eWsA ds

r0,1sd

Theorem 15. Let α “ pα1 , . . . , αd q be a vector of positive numbers and I be a subset

of t1, . . . , du. If w0 “ log f0 P C α r0, 1sI , then the posterior contracts at the rate ǫn “ ř ´1 n´α0I {p2α0I `1q logκ n with respect to the Hellinger distance, where α0I “ jPI αj´1 . 34

The proofs of the above Theorems 14 and 15 follow in a straightforward manner from our main results in Theorem 10 and 12. We don’t provide a proof here since the steps are very similar to those in Section 3 of van der Vaart and van Zanten (2008a).

2.4 Properties of the multi-bandwidth Gaussian process We now summarize some properties of the RKHS of the scaled process W a for a fixed vector of scales a, which shall be crucially used to prove our main theorems.

The first five lemmas generalize the results in section 4 of

van der Vaart and van Zanten (2009) from a single scaling to a vector of scales. A key idea in van der Vaart and van Zanten (2009) to construct the sieves Bn was to exploit a containment relation among the unit balls of the RKHS with different amounts of scaling. Such a result sufficed in the single rescaling framework exploiting the ordering in elements of R` . However, the result can only be generalized with respect to the partial order on Rd` which is not sufficient for our purpose. We develop a technique to circumvent this curse of dimensionality by precisely calculating the metric entropy of a collection of unit RKHS balls. Assume that the spectral measure ν of W has a spectral density f . For a P Rd` , the

rescaled proces W a has a spectral measure νa given by νa pBq “ νpB.{aq. Further, νa admits a spectral density fa , with fa pλq “ a´1 f pλ.{aq. For w0 P Cr0, 1sd, define φa pǫq to be the concentration function of the rescaled Gaussian process W a . w0

As

a

straightforward

extension

of

Lemma

4.1

and

4.2

in

van der Vaart and van Zanten (2009), it turns out that the RKHS of the process W a can be characterized as below. Lemma 16. The RKHS Ha of the process tWta : t P r0, 1sdu consists of real parts of the functions t ÞÑ

ż

eipλ,tq gpλqνa pdλq, 35

where g runs over the complex Hilbert space L2 pνa q. Further, the RKHS norm of the element in the above display is given by }g}L2 pν q . a Lemma 4.3 of van der Vaart and van Zanten (2009) shows that for any isotropic H¨older smooth function w, convolutions with an appropriately chosen class of higher order kernels indexed by the scaling parameter a belong to the RKHS. This suggests that driving the bandwidth 1{a to zero, one can obtain improved approximations to any H¨older smooth function. The following Lemma 17 illustrates the usefulness of using separate bandwidths for each dimension for approximating anisotropic H¨older functions from the RKHS. Lemma 17. Assume ν has a density with respect to the Lebesgue measure which is bounded away from zero on a neighborhood of the origin. Let α P Rd` be given. Then, for any subset I of t1, . . . , du and w P C α r0, 1sI , there exists constants C and D depending only on ν and w such that, for a large enough, ÿ i u ď Da˚ . inft}h}2Ha : }h ´ w}8 ď C a´α i iPI

Proof. We shall prove the result for w P C α r0, 1sd and sketch an argument for extending the proof to any w P C α r0, 1sI .

Let ψj , j “ 1, . . . , d, be a set of higher order kernels as in the proof of Lemma 4.3 of ş ş van der Vaart and van Zanten (2009), which satisfy ψj ptj qdtj “ 1, tkj ψj ptj qdtj “ ş 0 for any positive integer k and |tj |αj |ψj ptj q|dtj ď 1. Define ψ : Rd Ñ C by ş ş ψptq “ ψ1 pt1 q . . . ψd ptd q so that one has Rd ψptqdt “ 1, Rd tk ψptqdt “ 0 for any nonˆ ˆ 2 {f are uniformly zero multi-index k “ pk1 , . . . , kd q, and the functions |ψ|{f and |ψ|

bounded, where ψˆ denotes the Fourier transform of ψ.

For a vector of positive numbers a “ pa1 , . . . , ad q, let ψa ptq “ a˚ ψpa ¨ tq, where ś a˚ “ dj“1 aj . By Whitney’s theorem, w can be extended to a function w : Rd Ñ R with compact support and }w}α ă 8. Working with this extension, we shall first 36

show that the convolution ψa ˚ w is contained in the RKHS Ha . To that end, note that, 1 pψa ˚ wqptq “ p2πqd

ż

´ipt,λq

e

wpλq ˆ ψˆa pλqdλ “

ż

eipt,λq

wp´λq ˆ ψˆa pλq νa pdλq. fa pλq

ˆ ψˆa pλq{fa pλq P L2 pνa q to Thus, following Lemma 16, we need to show that wp´λq ˆ conclude that ψa ˚ w belongs to Ha . Since ψˆa pλq “ ψpλ.{aq, one has › ˇ2 › ż ˇˇ › |ψ| ˆ 2 ›› ż ˆ ψˆa pλq ˇˇ ˇ wp´λq › 2 |wpλq| ˆ dλ. ˇ ˇ νa pdλq ď a˚ › › ˇ › f › ˇ fa pλq 8

ˆ 2 {f is uniformly bounded by The above assertion is thus proved by noting that |ψ| ş ş construction and p2πqd |wˆ 2 pλq|dλ “ |wptq|2 dt ă 8. Also, the squared RKHS norm

of ψa ˚ w is bounded by Da˚ , with D depending only on ν and w. Thus, the proof of ř ´α Lemma 17 would be completed if we can show that }ψa ˚ w ´ w}8 ď C dj“1 aj j . We have, for any t P Rd ,

ψa ˚ wptq ´ wptq “

ż

ψpsqtwpt ´ s.{aq ´ wptquds. pjq

For 1 ď j ď d ´ 1, let upjq denote the vector in Rd with ui “ 0 for i “ 1, . . . , j and pjq

ui “ 1 for i “ j ` 1, . . . , d. For any two vectors x, y P Rd , we can navigate from x to y in a piecewise linear fashion traveling parallel to one of the coordinate axes at a time. The vertices of the path will be given by xp0q “ x, xpjq “ upjq ¨ x ` p1 ´ upjq q ¨ y for j “ 1, . . . , d ´ 1 and xpdq “ y. A multivariate Taylor expansion of wpt´s.{aq around wptq cannot take advantage of the anisotropic smoothness of w across different coordinate axes. Letting x “ t, y “ t ´ s.{a and xpjq , j “ 0, 1, . . . , d as above, let us write wpyq ´ wpxq in the following telescoping form, wpyq ´ wpxq “

d ÿ

j“1

pjq

pj´1q

wpx q ´ wpx

q“ 37

d ÿ

j“1

wj ptj ´ sj {aj | xpjq q ´ wj ptj | xpjq q,

where the functions wj are as defined in Section 2.2, with wj pt | xq “ wpx1 , . . . , xj´1 , t, xj`1 , . . . , xd q for any t P R and x P Rd . Thus, wpt ´ s.{aq ´ wptq “

d „ tα ÿ ÿj u

 p´sj {aj qi D wj ptj | x q ` Sj ptj , ´sj {aj q , i! i“1

j“1

α

´αj

where |Sj ptj , ´sj {aj q| ď Ksj j aj

pjq

i

by (2.2), for a constant K depending on ν and w

but not on t and s. Combining the above, we have ˇ ˇ ˇˇ d ż ˇż d ˇ ÿ ˇ ˇÿ ˇ ˇ ´α ˇ ψpsqtwpt ´ s.{aq ´ wptquˇ “ ˇ S pt , ´s {a qdt ď C aj j . ˇ j j j j j ˇ ˇ ˇ ˇ j“1 j“1

˜ so that wptq “ w0 ptI q If, w P C α r0, 1sI for some subset I of t1, . . . , du with |I| “ d,

for some w0 P C αI r0, 1sd, then the conclusion follows trivially follows from the ˜

observation ψa ˚ w “ ψaI ˚ w0 .

We next study the metric entropy of the unit ball of the RKHS and the centered small ball probability of the rescaled process. Let Ha 1 denote the unit ball in the RKHS of W a . Lemma 18. There exists a constant K, depending only on ν and d, such that, for ǫ ă 1{2,

ˆ ˙d`1 1 a ˚ log Npǫ, H1 , }¨}8 q ď Ka log . ǫ

Proof. By Lemma 16, an element of Ha 1 can be written as the real part of the function h : r0, 1sd Ñ C given by hptq “

ż

eipλ,tq gpλqνa pdλq 38

(2.11)

ş for g : Rd Ñ C a function with |gpλq|2 νa pdλq ď 1.

Viewing h as a function of it, we would like to exploit the sub-exponential tails

of ν as in (2.3) to extend h analytically over a larger domain in Cd . For z P Cd , ş we shall continue to denote the function z ÞÑ epλ,zq ψpλqνa pdλq by h. Using the

Cauchy-Schwartz inequality and the change of variable theorem, ż 2 |hpzq| ď epλ,2a¨Repzqq νpdλq,

(2.12)

where Repzq denotes the vector whose jth element is the real part of zj for j “ 1, . . . , d, and a ¨ Repzq “ pa1 Repz1 q, . . . , ad Repzd qqT . From (2.12) and the dominated

d convergence theorem, any h P Ha 1 can be analytically extended to Γ “ tz P C :

}2a ¨ Repzq}2 ă δu. Clearly, Γ contains a strip Ω in Cd given by Ω “ tz P Cd : ? |Repzj q| ď Rj , j “ 1, . . . , du with Rj “ δ{p6aj dq. Also, for every z P Ω, h satisfies ş the uniform bound |hpzq|2 ď eδ}λ} νpdλq “ C 2 .

The analytic extension of h to a strip containing the product of the imaginary

axes allows us to precisely estimate the error term of a k-order Taylor expansion of hptq. For t P r0, 1sd, Let C1 , . . . , Cd denote circles of radius R1 , . . . , Rd in the complex plane around the coordinates it1 , . . . , itd of it respectively. Using the Cauchy integral formula, ˇ ˇ ˇ n ˇ ˇ ˇ ¿ ¿ ˇ D hptq ˇ ˇ 1 ˇ hpzq ˇ ˇ“ˇ ˇď C , dz ¨ ¨ ¨ dz ¨ ¨ ¨ 1 d ˇ n! ˇ ˇ p2πiqd ˇ Rn. n`1 pz ´ tq ˇ ˇ C1

Cd

where D n denotes the partial derivative of order n “ pn1 , . . . , nd q. This suggests

using a net of piecewise polynomials for approximating the elements of Ha 1 . One can discretize the coefficients and centers of the piecewise polynomials to obtain a finite set of functions that approximate the leading terms of a Taylor expansion of a function in Ha 1 and the remainder terms can be controlled using the bound in the above display. 39

To elaborate, let R “ pR1 , . . . , Rd qT .

Partition T “ r0, 1sd into rectangles

Γ1 , . . . , Γm with centers tt1 , t2 , . . . , tm u such that given any z P T , there exists Γj with center tj “ ptj1 , . . . , tjd qT with |zi ´ tji | ď Ri {4, i “ 1, . . . , d. Consider the ř piecewise polynomials P “ m j“1 Pj,γj 1Γj with Pj,γj ptq “

ÿ

n.ďk

γj,n pt ´ tj qn .

We obtain a finite set of functions Pa by discretizing the coefficients γj,n for each j and n over a grid of mesh width ǫ{Rn in the interval r´C{Rn , C{Rn s, with Rn “ R1n1 . . . Rdnd and C defined as above. As in van der Vaart and van Zanten (2009), the log cardinality of the set is bounded above by log

˜

m ź ź

j“1 n:n.ďk

#γj,n

¸

d

ď mk log

ˆ

2C ǫ

˙

.

(2.13)

We can choose m À 1{R˚ . The proof is complete if we show that the resulting set of functions is a Kǫ-net for constants C and K depending on ν and k À logp1{ǫq. The rest of the proof follows exactly as in the proof for Lemma 4.5 in van der Vaart and van Zanten (2009) by showing that

and

ˇ ˇ ˆ ˙k ˇ ÿ D n h pt q ˇ ÿ C 2 ˇ ψ i nˇ n pz ´ ti q ˇ ď pR{2q ď KC ˇ n ˇn.ąk ˇ n.ąk R n! 3 ˇ ˇ ˇ ÿ D n h pt q ˇ ˇ ˇ ψ i n pz ´ ti q ´ Pi,γi pzqˇ ď Kǫ. ˇ ˇn.ďk ˇ n!

(2.14)

(2.15)

The proof is completed by choosing k large enough such that p2{3qk ď Kǫ. Lemma 19. For any a0 positive, there exists constants C and ǫ0 ą 0 such that for a ě a0 and ǫ ă ǫ0 ,

˙d`1 ˆ ` › a› ˘ ¯ a ˚ › › . ´ log P W 8 ď ǫ ď Ca log ǫ 40

Proof. This follows from Theorem 2 in Kuelbs and Li (1993) and Lemma 4.6 in van der Vaart and van Zanten (2009).

Proceeding as in Lemma 4.6 in

van der Vaart and van Zanten (2009) and Lemma 18, we obtain ˆ ˙1`d φa 0 pǫq a ˚ φ pǫq ` log 0.5 ď K1 a log . ǫ

(2.16)

for some constant K1 ą 0. Note that with L “ r0, a1s ˆ ¨ ¨ ¨ ˆ r0, ads, › a› › › ď ǫq “ ´ log P psup |Wt | ď ǫq φa 0 pǫq “ ´ log P p W 8

(2.17)

tPL

ď ´ log P p sup |Wt | ď ǫq ¯ sd tPr0,a ˆ ˙τ ¯ a ď K2 , ǫ

(2.18)

(2.19)

for some constant K2 and τ ą 0, where the last inequality follows from the proof of Lemma 4.6 in van der Vaart and van Zanten (2009). Inserting this bound in (2.16), we obtain ˆ ˙ ` › a› ˘ ¯ d`1 a ˚ › › ´ log P W 8 ď ǫ ď Ca log ǫ

(2.20)

for some constant C ą 0.

a for difWe next state a nesting property of the unit ball Ha 1 of the RKHS of W ferent values of a, generalizing Lemma 4.7 of van der Vaart and van Zanten (2009). Lemma 20. Assume the spectral measure ν satisfies (2.3) and has a density f with respect to the Lebesgue measure on Rd which satisfies f pt.{aq ď f pt.{bq for any a ď b. Then,

a ? a1 . . . ad Ha b1 . . . bd Hb 1 Ă 1 . 41

ş Following Lemma 16, hptq “ eipλ,tq ψpλqνa pdλq. Since ş , it follows that |ψpλq|2 fa pλqdλ ď 1. Now, hptq “ q

Proof. Let h P Ha 1.

}h}2Ha “ }ψ}2L2 pν a ş ipλ,tq e tψpλqfa pλq{fb pλquνb pdλq. The conclusion follows since, 2

}h} b “ H

ż

2

|ψpλq|

"

fa pλq fb pλq

*2

› ż › › fa pλq › a˚ 2 › |ψpλq| ν pdλq ď , νb pdλq ď ›› a fb pλq ›8 b˚

using the fact that fa pλq{fb pλq “ pb˚ {a˚ qf pλ.{aq{f pλ.{bq ď pb˚ {a˚ q by assumption.

van der Vaart and van Zanten (2009) crucially used the above containment relation among the RKHS unit balls in the single bandwidth case to conclude that pr{δqd{2 Hr1 contains Ha1 for all a in the interval rδ, rs. Combining this fact with the observation that for very small values of a, the sample paths of W a behave like a constant function, they could construct the sieves Bn containing MHa1 ` ǫB1 for all a P r0, rs without increasing the entropy from that of MHr1 ` ǫB1 . The complement probability of Bn under the law of the rescaled process could also be appropriately controlled by choosing r large enough so that PpA ą rq is small enough. However, one doesn’t obtain a straightforward generalization of the above scheme to the multibandwidth case since the entropy of the sieve blows up in trying to control the joint probability of the rescaling vector a outside a hyper-rectangle in Rd` . The problem mentioned above is fundamentally due to the curse of dimensionality and one needs a more careful construction of the sieve to avoid this problem. The next three lemmas are crucially used in our treatment of the multi-bandwidth case. In the proof of Lemma 18, a collection of piece-wise polynomials is used to cover the unit RKHS ball Ha 1 . The main idea in the next set of lemmas is to exploit the fact that the same set of piecewise polynomials can also be used to cover Hb 1 for b sufficiently close to a. Further, we shall carefully choose a compact subset Q of Rd`

that balances the metric entropy of the collection of unit RKHS balls Ha 1 with a P Q 42

and the complement probability of Q under the joint prior on a. p0q

Let Sd´1 denote the interior of Sd´1 , i.e., all vectors θ P Rd` with

řd

j“1 θj

“ 1 and

θj ą 0 for all j “ 1, . . . , d. For u P Rd` , let Cu denote the rectangle in the positive quadrant given by a ď u, i.e., 0 ď aj ď uj for all j “ 1, . . . , d. For a fixed r ą 0, p0q

let Q “ Qprq consist of vectors a with aj ď r θj for some θ P Sd´1 . Clearly, Q is a p0q

union of rectangles Crθ over θ P Sd´1 . Clearly, the volume of each such rectangle Crθ is r and the outer boundary of Q consists of points a with aj ď r for all j “ 1, . . . , d and a˚ “ r (figure 2.1). By Lemma 18, for any such a in the outer boundary of

d`1 Q, the metric entropy of Ha p1{ǫq. In 1 is bounded by a constant multiple of r log

the following, we show in Lemma 21 that the metric entropy of the collection of unit RKHS balls with a varying over the outer boundary of Q is still of the order of r logd`1 p1{ǫq. Lemma 22 - 23 establish a stronger result which states that the entropy remains of the same order even if the union is considered over all of Q.

(a) For fixed r ą 1, rectangles Crθ “ t0 ď a ď(b) The region Q (shaded) resulting from the union p0q of all such rectangles rθ u for different values of θ P Sd´1

Figure 2.1: Union of rectangles Crθ “ t0 ď a ď r θ u for different θ

43

p0q

Lemma 21. For a positive number r ą 1 and θ P Sd´1 , let Hr,θ 1 denote the unit ball of the RKHS of W a with aj “ r θj for 1 ď j ď d. Then, there exists a constant K1 ,

depending only on ν and d, such that, for ǫ ă 1{2, ˆ ď ˙ ˆ ˙d`1 1 r,θ log N ǫ, H1 , }¨}8 ď K1 r log . ǫ p0q θPSd´1

Proof. Let Q “ ta P Rd` : 1 ď aj ď r @j “ 1, . . . , d, a˚ “ ru denote the outer boundary of Q defined above. Clearly, ď ď Hr,θ Ha 1 “ 1. p0q aPQ θPS d´1

For a, b P Q, the idea of the proof is to show that the piecewise polynomials Pa that b form a Kǫ-net for Ha 1 in the proof of Lemma 18 are also a Kǫ-net for H1 if b is

“close enough” to a. ? Fix a P Q. Let Ωa “ tz P Cd : |Repzj q| ď Rj , j “ 1, . . . , du with Rj “ δ{p6aj dq

denoting the strip in Cd on which every h P Ha 1 can be analytically extended. Let b P Q satisfy maxj |aj ´ bj | ď 1. We shall show that any h P Hb 1 can also be extended

analytically to the same strip Ωa by showing that }2b ¨ Repzq}2 ă δ on Ωa . To that end, for z P Ωa ,

}2b ¨ Repzq}2 ď }2a ¨ Repzq}2 ` }2pb ´ aq ¨ Repzq}2 ď 2 }2a ¨ Repzq}2 ď 2δ{3. where the penultimate inequality uses |bj ´ aj | ď 1 ď aj for all j “ 1, . . . , d. Clearly, the same tail estimate as in (2.14) works for any h P Hb 1 . From (2.15), it thus follows that the set of functions Pa form a Kǫ-net for Hb 1 . Let A be a set of points in Q such that for any b P Q, there exists a P A such that maxj |aj ´ bj | ď 1. One can clearly find an A with |A| ď r d . The proof is completed by observing that YaPA Pa form a Kǫ net for YθPSd´1 Hr,θ 1 . 44

Lemma 22. For u P Rd` , let Cu denote the subset of Rd` consisting of all vectors a ď u, i.e., aj ď uj for all j “ 1, . . . , d. Then, there exists a constant K2 , depending only on ν and d, such that, for ǫ ă 1{2, ˆ ˙ ˆ ď ˙d`1 1 ˚ a H1 , }¨}8 ď K2 u log log N ǫ, . ǫ aPCu Proof. The idea of the proof is similar to Lemma 21 in that we partition the space Cr into finitely many sets and cover the collection of unit RKHS balls with the scaling vector varying over one of these sets by a single collection of piecewise polynomials. We only sketch the partitioning scheme here and the rest of the proof is similar to Lemma 21. I For a subset I of t1, . . . , du, let Cu denote the subset of Cu consisting of vectors

a ď u with aj ď 1 for all j P I and aj ą 1 for all j R I. Then, clearly Cu can be written as the following disjoint union, Cu “

d ď ď

I . Cu

l“0 I:|I|“l

Fix 0 ď l ď d and a subset I of t1, . . . , du with |I| “ l. It suffices to prove the desired I . We shall slightly modify the complex strip from the proof entropy bound for Cu I of 18 to exploit that for any a P Cu , the values of aj for the coordinates j in I are

smaller than one. ? I Fix a P Cu . Let Ωa “ tz P Cd : |Repzj q| ď Rj , j “ 1, . . . , du with Rj “ δ{p6aj dq ? if j R I and Rj “ δ{p6 dq if j P I. Since }2a ¨ Repzq} ă δ for any z P Ωa , it follows

from the proof of Lemma 18 that any function h P Ha 1 has an analytic extension

I to Ω. Let b P Cu satisfy maxj |aj ´ bj | ď 0.5. Then one can prove along the lines

a of 21 that any h P Hb 1 can also be extended analytically to Ω . The remainder of I the proof follows similarly as Lemma 22, where the net for Cu is constructed as the union of the set of piecewise polynomials P a covering Ha , with a varying over a 1

45

I finite subset of Cu with cardinality Opu˚ q.

The following Lemma 23 follows along similar lines as the previous two lemmas. Lemma 23. Let ν satisfy (2.3). Fix r ě 1. Then, there exists a constant K2 depending on ν and d only, so that, for ǫ ă 1{2, ˙ ˆ ď a H1 , }¨}8 log N ǫ, prq aPQ ˙d`1 ˆ ď ď ˙ ˆ 1 a “ log N ǫ, . H1 , }¨}8 ď K2 r log ǫ p0q aďr θ θPSd´1

2.5 Proof of main results We shall only provide a detailed proof of Theorem 10 and sketch the main steps in the proof of Theorem 12. 2.5.1 Proof of Theorem 10 Let us begin by observing that, ˆ› ˙ ż › › › › › A P ›W ´ w0 › ď 2ǫ “ Pp›W a ´ w0 ›8 ď 2ǫqπA pdaq “

ż "ż

8

* › a › › › Pp W ´ w0 8 ď 2ǫqπpa | θqda πpθqdθ.

As in van der Vaart and van Zanten (2009), we first derive bounds on the noncentered small ball probability for a fixed rescaling a, and then integrate over the distribution of a to derive the same for W A . Given a P Rd` , recall the definition of the centered and non-centered concentration

functions of the process W a ,

› a› › › ď ǫq, φa 0 pǫq “ ´ log Pp W 8 φa w0 pǫq “

inf

hPHa 1 :}h´w0 }8 ďǫ

› › }h}2Ha ´ log Pp›W a ›8 ď ǫq. 46

(2.21)

For a fixed a, the non-centered small ball probability of W a can be bound in terms of the concentration function as follows (van der Vaart and van Zanten, 2008b), › › a Pp›W a ´ w0 ›8 ď 2ǫq ě e´φw0 pǫq .

Now, suppose that w0 P C α r0, 1sd for some α P Rd` . From Lemma 17 and 19, it follows that for every a0 ą 0, there exist positive constants ǫ0 ă 1{2, C, D and E ř i ă ǫ, that depend only on w0 and ν such that, for a ą a0 , ǫ ă ǫ0 and C di“1 a´α i ˆ ˙1`d ˙1`d ˆ ¯ ¯ a a a ˚ ˚ ˚ φw0 pǫq ď Da ` Ea log ď K1 a log , ǫ ǫ

with K1 depending only on a0 , ν and d. Thus, for ǫ ă mintǫ0 , C1a0´α¯ u, by (2.21), for constants K2 , . . . , K6 ą 0 and C2 , . . . , C6 ą 0, ˆ› ˙ › › › A P ›W ´ w0 › ď 2ǫ ě

ě

ż "ż θ

8

´φa w0 pǫq

e

* πpa | θqda πpθqdθ

ż " ż 2pC1 {ǫq1{α1

a1 “pC1 {ǫq1{α1

θ

¨¨¨

ż 2pC1 {ǫq1{αd

e

ad “pC1 {ǫq1{αd

´K2 p1{ǫq1{α0 log1`d p1{ǫq

ě C2 e

¯ {ǫq ´K1 a˚ log1`d pa

ż " ż 2pC1 {ǫq1{α1 θ

a1 “pC1 {ǫq1{α1

¨¨¨

* πpa | θqda πpθqdθ

ż 2pC1 {ǫq1{αd

ad “pC1 {ǫq1{αd

* πpa | θqda πpθqdθ.

Let Γ denote the region in the simplex Sd´1 given by Γ “ tθ P Sd´1 : τ ă θj ´ αα01 ă ř 2τ, j “ 1, . . . , d ´ 1u. Since dj“1 α0 {αj “ 1, we can choose τ ą 0 small enough

to guarantee that any θ satisfying the set of inequalities lies inside the simplex. ř Moreover, with θd “ 1 ´ d´1 j“1 θj , one has pd ´ 1qτ ă θd ă 2pd ´ 1qτ . Choosing ř τ “ C3 { logp1{ǫq, one can show that dj“1p1{ǫq1{pαj θj q ď C4 p1{ǫq1{α0 for any θ P Γ.

47

Now, ż " ż 2pC1 {ǫq1{α1

a1 “pC1 {ǫq1{α1

ě ě ě

¨¨¨

ż " ż 2pC1 {ǫq1{α1

ż

ż

a1 “pC1 {ǫq1{α1

e´K3

ż 2pC1 {ǫq1{αd

ad “pC1 {ǫq1{αd

¨¨¨

řd

1{αj θj j“1 p1{ǫq

1{α0

e´K4 p1{ǫq θPΓ

* πpa | θqda πpθqdθ

ż 2pC1 {ǫq1{αd

e

θ1 “α0 {α1 ´2τ

¨¨¨

1{θj

j“1

ad “pC1 {ǫq1{αd

aj

1{α0

πpθqdθ ě C5 e´K5 p1{ǫq

ż α0 {αd´1 ´τ

β

θd´1 “α0 {αd´1 ´2τ

* da πpθqdθ

πpθqdθ

The last inequality in the above display uses that ż α0 {α1 ´τ

řd

´

ş

θPΓ

´1

.

πpθqdθ “

d´1 p1 ´ θ1β1 ´1 . . . θd´1

d´1 ÿ j“1

θj qβd ´1 dθ1 . . . dθd´1

can be bounded below by a polynomial in τ 91{ logp1{ǫq. Hence, ˆ› ˙ › 1`d 1{α0 › › A P ›W ´ w0 › ď 2ǫ ě C6 e´K6 p1{ǫq log p1{ǫq . 8

(2.22)

Let B1 denote the unit sup-norm ball of Cr0, 1sd. For a vector θ P Sd´1 and

positive constants M, r, ǫ, let B θ “ B θ pM, r, ǫq denote the set, ď Bθ “ pMHa 1 q ` ǫB1 , θ aďr where r θ denotes the vector whose jth element is r θj . We further let, ď ď B“ pMHa 1 q ` ǫB1 . θPSd´1 aďr θ Let us first calculate the probability PpW A R B θ | θq. Note that, ż a θ PpW R B | θq “ PpW θ R B θ qπpa | θqda ď

ż

a

ďr θ

PpW a R B θ qπpa | θqda ` PpA ę r θ | θq, 48

where PpW A ę r | θq is a shorthand notation for Ppat least one Aj ą r θj | θq.

To tackle the first term in the last display, note that B θ contains the set MHa 1 `

ǫB1 for any a ď r θ by definition. Hence, for any a ď r θ , by Borell’s inequality, PpW a R B θ q ď PpW a R MHa 1 ` ǫB1 q ˙* ˆ " ´φa pǫq ´1 e 0 ď1´Φ M `Φ

˙* ˆ " θ rθ ´φr0 pǫq ´1 ď e´φ0 pǫq , e ď1´Φ M `Φ

` rθ ˘ if M ě ´2Φ´1 e´φ0 pǫq , where the penultimate inequality follows from the fact that,

with T “ r0, 1sd,

rθ a e´φ0 pǫq “ Pp sup |Wt | ď ǫq ě Pp sup |Wt | ď ǫq “ e´φ0 pǫq . tPa¨T tPr θ ¨T

By Lemma 4.10 of van der Vaart and van Zanten (2009), Φ´1 puq ě ´t2 logp1{uqu1{2 for u P p0, 1q. Hence, the last inequality in the above display remains valid if we choose M ě4 1{θj

Since Aj

b

θ

φr0 pǫq.

follows a gamma distribution given θj , in view of Lemma 4.9 of

van der Vaart and van Zanten (2009), for r larger than a positive constant depending only on the parameters of the gamma distribution, PpAj ą r θj | θq ď C1 r D1 e´D2 r . Combining the above, since B contains B θ for every θ P Sd´1 , * ż "ż A a PpW R Bq “ PpW R B | θqgpa | θq θ

ď

ż "ż θ

PpW a R B θ | θqgpa | θq d`1

ď C2 r D1 e´D2 r ` e´D3 r logpr{ǫq 49

.

*

(2.23)

From Lemma 23, the entropy of B can be estimated as, ď ď pMHa log Np2ǫ, B, }¨}8 q ď log Npǫ, 1 q, }¨}8 q θ θPSd´1 aďr ď r log

ˆ

M ǫ

˙d`1

(2.24)

.

Thus (2.22), (2.23) and (2.24) can be simultaneously satisfied if we choose, for constants κ, κ1 , κ2 ą 0, ǫn “ n´α0 {p2α0 `1q logκ pnq, rn “ n1{p2α0 `1q logκ1 pnq, Mn “ rn logκ2 pnq. 2.5.2 Proof of Theorem 12 For ease of notation, we shall make the simplifying assumption that the random variable B is degenerate at 1. For a ą 0 and S Ă t1, . . . , du, let Ha,S denote the

RKHS of W a , where aj “ a for j P S and aj “ 1 for j R S.

˜ and given positive constants M, r, ξ, ǫ, For a subset S Ă t1, . . . , du with |S| “ d,

let BS “ BS pM, r, ξ, ǫq  ď„ ď „ ` ˘ ` r ˘d˜ r,S a,S MH1 ` ǫB1 . H1 ` ǫB1 “ M ξ aăξ ˜

Since, given S, Ad „ gamma, it can be shown that, for some constant C1 ą 0, d˚

PpW A R BS | Sq À e´C1 r . The dominating term in the ǫ entropy of BS is bounded by ˆ ˙ C3 M 1`d d˚ . C2 r log ǫ 50

While calculating the concentration probability around w0 P C α r0, 1sI , simply use the fact that prpS “ Iq ą 0. Combining the above, the sieves Bn are constructed as, Bn “

d ď ď

˜ S:|S|“d˜ d“1

BS pMnS , rnS , ξn , ǫn q,

where, for constants κ, κ1 ą 0, ǫn “ n´α{p2α`d0 q logκ n, rnS



ˆ

n

d0 2α`d0

˙1{|S|

logκ1 pnq,

˜

pMnS q2 “ prnS qd logprnS {ǫn q.

2.6 Lower bounds on posterior contraction rates In this section, we will demonstrate that when the true density is dependent on a smaller number of variables, a Gaussian process prior with a single bandwidth leads to a sub-optimal rate of convergence. To illustrate this, we will focus on the example of density estimation using the logistic Gaussian process prior. We will show that the posterior contraction rate using a single bandwidth logistic Gaussian process with respect to the sup-norm topology is bounded below by n´α{p2α`dq when the true density is 1.5

f0 px1 , . . . , xd q “ Ce|x1 ´0.5| , x “ px1 , . . . , xd qT P r0, 1sd.

(2.25)

This shows the necessity of using an inhomogeneous Gaussian process in highdimensional density estimation when the true density is actually lower dimensional. Although lower bounds on the posterior contraction rates in Gaussian process settings have been previously addressed by Castillo (2008), the literature is restricted to series expansion priors and the Riemann-Liouville process priors. In this section, 51

we have extended the results to Gaussian process with exponential covariance kernel having a single bandwidth. In particular, we have derived a lower bound to the concentration function around w0 px1 , . . . , xd q “ |x1 ´ 0.5|1.5 using a single inversegamma bandwidth. In the following, we shall consider a rescaled Gaussian process W A for a positive random variable A stochastically independent of W . Recall that the logistic Gaussian process prior for a density f on r0, 1sd is given by exptW A pxqu , x P r0, 1sd. A exptW ptqudt r0,1sd

f pxq “ ş

(2.26)

˚

We shall consider a prior distribution on A specified by Ad „ g, where g is the gamma density and d˚ P t1, . . . , du. Recall that a gamma prior on Ad results in the minimax rate of contraction adaptively over log f being an isotropic α-H¨older function of d variables for any α ą 0. We shall show below that the above specification involving a single bandwidth leads to sub-optimal rate for any choice of d˚ P t1, . . . , du if log f0 depends on fewer coordinates. We will start with a few auxiliary lemmas which enable us to provide an lower bound to the concentration function of the Gaussian process W A . First we derive a lower bound to the concentration function φa pǫq for a fixed a and then marginalize with respect to the prior for a. The lower bound coupled with the ability of the model (2.26) to identify the Gaussian process term W A from w0 results in a lower bound to the posterior concentration rate. The key to obtaining a lower bound for the concentration function φa pǫq is to find a lower bound to ´ log P p}W a }8 ď ǫq. However, it is important to note here that one can’t just obtain a lower bound to the marginalized concentration function by marginalizing over ´ log P p}W a }8 ď ǫq. It becomes necessary to carefully characterize the domain of a in terms of the ǫ for which there exists an element in Ha in an ǫ-sup-norm neighborhood of w0 . Lemma 52

25-27 serve to find this domain by searching for the best approximator of w0 in Ha . In conjunction with our intuition, the obtained domain is rC0ǫ´1{α , 8q for some global constant C0 . This fact immediately provides a sharp lower bound to the marginalized concentration function which turns out to be of the same order as the upper bound up to a log-factor. Thus it is of no surprise that one can only achieve a sub-optimal rate of posterior convergence using a single bandwidth logistic Gaussian process prior. Denote by Ha the reproducing kernel Hilbert space of the Gaussian process W a.

In the following, we define a Gaussian based higher order kernel as in

Wand and Schucany (1990). For r ě 1, let Q2r´2 be the polynomial given by řr´1 c2i x2i where Q2r´2 pxq “ i“0 c2i “

p´1qi 2i´2r`1 p2rq! . r!p2i ` 1q!pr ´ i ´ 1q!

Wand and Schucany (1990) showed that Q2r´2 is the unique polynomial of degree ď 2r ´ 2 for which G2r ” Q2r´2 φ is a 2r order kernel. It is easy to see that r “ 1 corresponds to the standard Gaussian kernel. For r ą 1 and any 1 ď j ď r ´ 1, ş 2j x G2r pxq “ 0. R For x P Rd , define ψ 2r pxq “ G2r px1 q . . . G2r pxd q and for a ą 0, let ψa2r pxq “

ad ψ 2r paxq. In the following Lemma 24, we calculate the Fourier transform of ψ 2r ptq. Lemma 24. ψˆ2r pλq “ e´}λ}

2

śd {2

j“1



řr´1

λ2s j s“0 2s s!

53

 .

Proof. ψˆ2r pλq “ “

ż

ż

eipλ,tq ψ 2r ptqdt eipλ,tq G2r pt1 q ¨ ¨ ¨ G2r ptd qdt



d ż ź



d ź

j“1

eipλj ,tj q G2r ptj qdtj 2

e´λj {2

j“1

´}λ}2 {2

“ e

r´1 ÿ

λ2s j s s! 2 s“0

d r´1 ź ÿ λ2s j s 2 s! j“1 s“0

where the penultimate identity follows from Wand and Schucany (1990). Lemma 4.1 of van der Vaart and van Zanten (2009) gives a nice characterization of Ha in view of the isometry with the space L2 pνa q. In the following Lemma 25, we express each element of Ha as a convolution of ψa2r with a function in CpRd q for any given r ě 1. In other words, every element of Ha arises as a convolution of a higher order kernel with a function in CpRd q showing that the search for the best approximator of a C α r0, 1s function in the space Ha can be restricted to only convolutions of continuous functions with a higher order kernel. Lemma 25. Given any h P Ha and r ě 1, there exists w P CpRd q such that h “ 2r ψ2a ˚ w.

Proof. By Lemma 4.1 of van der Vaart and van Zanten (2009), we obtain that any h P Ha can be written as tÑ

ż

eipλ,tq gpλqfa pλqdλ, 54

(2.27)

ş where gpλq2fa pλqdλ ă 8. By change of variable,

hptq “

ż

e´ipλ,tq gp´λqfa pλqdλ,

(2.28)

ş ˆ with gp´λq2fa pλqdλ ă 8. Then hpλq “ p2πqd gp´λqfa pλq. Now observe that ψˆ2r pλq 2 2r is real and positive for all values of t and ψˆ2r pλq ą e´}λ} {2 . Also note that ψˆ2a pλq “

ψˆ2r pλ{2aq. Hence setting wpλq ˆ “

ˆ hpλq 2r pλq , ψˆ2a

we obtain

gp´λqπ d{2 expt´ }λ}2 {4a2 u . wpλq ˆ “ ś řr´1 λ2s j expt´ }λ}2 {8a2 u dj“1 s“0 2s s p2aq 2 s!

Thus |wpλq| ˆ ď expt´ }λ}2 {8a2 u |gp´λq| and "ż

|wpλq| ˆ dλ

*2

ď ď

"ż ż

2

2

expt´ }λ} {8a u |gp´λq| dλ

*2

expt´ }λ}2 {4a2 u |gp´λq|2 dλ

ă 8. 2r ˆ “ ψˆ2r w, As w ˆ belongs to L1 , and h 2a ˆ we immediately have h “ ψ2a ˚ w for a continuous

function w given by 1 wptq “ p2πqd 1 “ p2πqd

ż ż

e´ipλ,tq wpλqdλ ˆ e´ipλ,tq

gp´λqπ d{2 expt´ }λ}2 {4a2 u dλ. ś řr´1 λ2s j expt´ }λ}2 {8a2 u dj“1 s“0 p2aq2s 2s s!

The following Lemma 26 says that ψa2r ˚ w0 can better approximate w0 P CpRd q compared to ψa2r ˚ w for any w ‰ w0 . Lemma 26 further restricts the search for 55

the best approximator of a CpRd q function to only convolutions of the higher order kernel ψa2r with the function w0 itself. Lemma 26. Given any w0 P CpRd q compactly supported and r ě 1, › › › › ›w0 ´ ψ 2r ˚ w0 › ď ›w0 ´ ψ 2r ˚ w › a a 8 8

for sufficiently large a ą 0 and for any w P CpRd q compactly supported with }w ´ w0 } ą δ for some δ ą 0. Proof. Note that › 2r › › › ›ψa ˚ w ´ w0 › ě }w ´ w0 } ´ ›φ2r › . ˚ w ´ w a 8 8 8

Since w is compactly supported, there exists a0 ą 0 such that for a ą a0 , }φ2r a ˚ w ´ w}8 ă δ{2. The conclusion of the lemma follows by observing that for a ą a0 , }ψa2r ˚ w ´ w0 }8 ą δ{2. The following Lemma 27 provides a lower bound to the approximation error for w0 px1 , . . . , xd q “ |x1 ´ 0.5|1.5 , px1 , . . . , xd q P r0, 1sd with ψa2 ˚ w0 . Lemma 27. For w0 px1 , . . . , xd q “ |x1 ´ 0.5|1.5 , › › 2 ›w0 ´ ψ2a ˚ w0 ›8 ě C0 a´1.5

(2.29)

for some global constant C0 ą 0.

Proof. Since w0 P C 1.5 r0, 1sd, by Whitney’s theorem we can extend it to Rd so that w0 has a compact support with }w0 }1.5 ă 8. Without loss of generality, assume w0 is non-negative and the support of w0 is r´L, Lsd for some large L. Observe that 2 ψ2a

˚ w0 p1{2q ´ w0 p1{2q “ 56

ż

ψ 2 psqw0 p1{2 ´ s{p2aqqds

Now since w0 p1{2 ´ s{p2aqq “ 0 if |1{2 ´ s{p2aq| ą L, so for a ą 1{2, ts : |1{2 ´ s{p2aq| ď Lu Ą r´2L ` 1, 2L ` 1sd . Thus ż

2

ψ psqw0 p1{2 ´ s{p2aqqds ě

ż

r´2L`1,2L`1sd

“ 1{p2aq

1.5

ż

ψ 2 psqw0 p1{2 ´ s{p2aqqds

r´2L`1,2L`1sd

ψ 2 psq |s1 |1.5 ds.

2 This shows that }w0 ´ ψ2a ˚ w0 }8 ě C0 a´1.5 where

1 C0 “ 1.5 2

ż

r´2L`1,2L`1sd

ψ 2 psq |s1 |1.5 ds.

Also it follows from the last part of Lemma 4.3 of van der Vaart and van Zanten 2 2 2 (2009) that ψ2a ˚ w0 P Ha since pψˆ2a q pλq “ fa pλq.

Note that the lower bound obtained is same as the upper bound to the approxi2 mation error of any C 1.5 r0, 1s function using ψ2a ˚ w upto constants.

The following Lemma 28 is crucial to the derivation of a lower bound to the concentration function φa pw0 q.

Lemma 28 complements Lemma 4.6

of van der Vaart and van Zanten (2009) and is an application of Theorem 2 of Kuelbs and Li (1993). Lemma 28. There exists ǫ0 ą 0, possibly depending on a, such that for all ǫ ă ǫ0 , ´ log P p}W a }8 ă ǫq Á ad log

˜

|log ǫ|1{2 ǫ

¸d`1

.

(2.30)

Proof. Obtaining a lower bound is a simple application of Lemma 4.5 of van der Vaart and van Zanten (2009) and Theorem 2 of Kuelbs and Li (1993). The proof of Lemma 4.3 of van der Vaart and van Zanten (2009) shows that ˆ ˙d`1 1 a d log Npǫ, H1 , }¨}8 q « a log . ǫ 57

` ˘d`1 If we define ga pxq “ ad log x1 , it is easy to observe that g is a slowly varying

function. Then by Theorem 2 of Kuelbs and Li (1993), we obtain

φa0 pǫq

ě C1 g a

ˆ

ǫ a φa0 pǫq

˙

d

“a

ˆ

log

a

φa0 pǫq ǫ

˙d`1

.

(2.31)

Below we show that we only need to find a crude lower bound to φa0 pǫq to obtain the required bound. Observe that ˇ ˇ φa0 pǫq “ ´ log P p}W a }8 ď ǫq ě ´ log P pˇW 0 ˇ ď ǫq.

(2.32)

Note that W 0 „ Np0, 1q and hence P p|W 0 | ď ǫq “ t2Φpǫq ´ 1u « 1 ` |log ǫ| as ǫ Ñ 0. Hence we obtain for sufficiently small ǫ, φa0 pǫq Á |log ǫ| .

(2.33)

Plugging in the bound (2.33) in (2.31), we obtain

φa0 pǫq Á ad log

˜

|log ǫ|1{2 ǫ

¸d`1

.

(2.34)

Note that the lower bound in Lemma 28 differs from the upper bound in Lemma 4.6 of van der Vaart and van Zanten (2009) only by a logarithmic factor suggesting that the lower bound obtained is reasonably tight. Finally, we calculate the tail probability of the supremum of the Gaussian process W A which will be crucially used to derive a lower bound to the posterior concentration rate. Although this is an application of Borell’s Inequality, we will provide an independent proof to carefully identify the role of the prior for the bandwidth.

58

Lemma 29. For r ą 1, › `› ˘ P ›W A ›8 ą M ď



 1 2 1{2 1{2 P pA ą rq ` 2paMq exp ´ M ` Ctplog rq ` plog Mq u 2 d

for some constant C ą 0. Proof. From Theorem 5.2 of Adler (1990) it follows that if X is a centered Gaussian process on a compact set T Ă Rd and σT2 is the maximum variance attained by the Gaussian process on T , then for large M, P p}X}8

where νpMq “ C1



 1 2 ą M q ď 2N p1{M, T, }¨}q exp ´ 2 tM ´ νpM qu , 2σT

ş1{M 0

tlog Np1{M, T, }¨}qu1{2 dp1{Mq for some constant C1 ą 0. Ob-

serve that W a is rescaled to T “ r0, asd and the maximum variance attained by W a is 1. Note that Np1{M, T, }¨}q “ paMqd . Now νpMq ď C2

ż 1{M 0

td logpaMqu1{2 dp1{Mq

ď C3

ż 1{M

ď C3

1 tplog aq1{2 ` plog Mq1{2 u M

0

tplog aq1{2 ` plog Mq1{2 udp1{Mq

for some constants C2 , C3 ą 0. Using W a in place of X, we obtain, „

 1 P p}W a }8 ą Mq ď 2paMqd exp ´ M 2 ` C3 tplog aq1{2 ` plog Mq1{2 u 2 The conclusion of the lemma follows immediately.

59

2.7 Main result Below we state the main theorem on obtaining a lower bound to the posterior concentration rate using a logistic Gaussian process prior when the true density is given by (2.25). Since w0 is a C 1.5 r0, 1sd function, the best obtainable upper bound to the posterior rate of convergence using a single bandwidth logistic Gaussian process prior is n´1.5{p3`dq “ n´3{p6`2dq upto a log factor (van der Vaart and van Zanten, 2009). In the following Theorem 30, we show that the lower bound using the supnorm topology is also of the same order if we use a single bandwidth. In other words, it is impossible for a single bandwidth Gaussian process to optimally learn the lower dimensional density. Theorem 30. If f0 is given by (2.25) and the prior for a density f on r0, 1sd is given as in (2.26) for any d˚ P t1, . . . , du, then P p}f ´ f0 }8 ď n´3{p6`2dq logt0 n | Y1 , . . . , Yn q Ñ 0

(2.35)

a.s. as n Ñ 8 for some constant t0 ą 0. Proof. To obtain the lower bound, we will verify the conditions of Lemma 1 in Castillo (2008) with Bn “ tf : }f ´ f0 }8 ď ξn u for ξn “ n´3{p6`2dq logt0 n for some constant t0 chosen appropriately in the subsequent analysis. From the proof of Lemma 5 in Castillo (2008) it follows that for ck “ kdξn , k “ ´N, . . . , N and N the smallest ? integer larger than C n, ` ˘ P }f ´ f0 }8 ď ξn ď

N ÿ

k“´N

› › `› ˘ `› ? ˘ P ›W A ´ w0 ´ ck ›8 ď 2dξn ` P ›W A ›8 ą C nξn .

60

(2.36)

˚

An application of Lemma 29 with Mn2 , rnd “ Opnξn2 q yields › `› ? ˘ P ›W A ›8 ą C nξn ď P pA ą rn q ` expt´K1 Mn2 u ˚

ď expt´rnd u ` expt´K1 Mn2 u ď expt´K2 nξn2 u,

(2.37)

for some constants K1 , K2 ą 0. Lemma 25-27 and the observation that w0 R C 1.5`δ r0, 1sd for any δ ą 0 together imply that given any ǫ ą 0, there does not exist any element in Ha for a ă C0 ǫ´1{α such that for each k “ ´N, . . . , N, }w0 ´ h ´ ck }8 ă ǫ, where w0 is given by w0 px1 , . . . , xd q “ |x1 ´ 0.5|1.5 . From Lemma 28, if a ą C0 ǫ´1{α , φaw0 `ck pǫq

1 ě inf }h}2H ` ad log hPHa :}h´w0 ´ck }8 ăǫ 2 d

ě a log

ˆ

|log ǫ|1{2 ǫ

˙d`1

ˆ

|log ǫ|1{2 ǫ

˙d`1

.

Hence for k “ ´N, . . . , N, P

ˆ

˙ ż8 › A › ›W ´ w0 ´ ck › ă ǫ ď 8

exp

a“C0 ǫ´1{α

"

d

´ a log

ˆ

|log ǫ|1{2 ǫ

˙d`1 * da.

Using the inequality ż8 v

expt´tr udt ď 2r ´1 v 1´r expt´v r u,

we obtain that P

ˆ

˙ › A › ›W ´ w0 ´ ck › ă ǫ ď C1 expt´C2 ǫ´d{α |log ǫ|d`1 u, 8 61

for some constants C1 , C2 ą 0. Thus, from (2.37) and (2.36), ` ˘ P }f ´ f0 }8 ď ξn ď C3 N expt´C4 ξn´d{α u,

(2.38)

for some constant C3 ą 0. From van der Vaart and van Zanten (2009) it also follows that 2

P pBKL pf0 , ξn qq ě e´C5 nξn ,

(2.39)

for some constant t0 ą 0 and C5 ą 0 where " ż ˙2 * ż ˆ f0 f0 2 2 BKL pf0 , ǫq “ f : f0 log ă ǫ , f0 log ăǫ . f f

(2.40)

By adjusting t0 , C4 and C5 , we have from (2.38) and (2.39) ` ˘ P }f ´ f0 }8 ď ξn ď expt´2nξn2 u, P pBKL pf0 , ξn qq which proves the assertion of the theorem by Lemma 1 of Castillo (2008). Remark 31. Note that the lower bound n´3{p6`2dq logt0 n for d ą 1 is only a suboptimal rate for estimating w0 , the optimal rate being given by n´3{8 which is actually achieved by a multi-bandwidth Gaussian process prior. Refer to Theorem 15 for details. Remark 32. Note that we have derived a lower bound to the posterior contraction rate only for this special choice of f0 given in 2.25. The choice is motivated by the fact that it is easy to find a lower bound to the best approximation error of this function within the class Ha . More generally one might be interested in finding a subset of C α r0, 1sd for a fixed α ą 0 such that we can characterize both the best approximator and a lower bound to the approximation error for each of the elements in the subset. This would require a different version of Lemma 27 in each of the cases. However the general recipe provided in Lemma 25–27 remains the same. 62

Remark 33. One can also obtain a lower bound to the posterior concentration rate in other statistical settings, e.g., the Gaussian process mean regression using the same technique. This would need careful characterization of the upper bound to the concentration probability of the induced density around the truth i.e., P p}f ´ f0 }8 ă ξn q in terms of the concentration probability of the Gaussian process W A around w0 similar to that for the logistic Gaussian process in Theorem 30. Interested readers might find an outline of such an exercise in Section 7.7 of Ghosal and van der Vaart (2007a).

63

3 Bayesian nonparametric regression with varying residual density

3.1 Introduction Nonparametric regression offers a more flexible way of modeling the effect of covariates on the response compared to parametric models having restrictive assumptions on the mean function and the residual distribution.

Here we consider a

fully Bayesian approach. The response y P Y corresponding to a set of covariates x “ px1 , x2 , . . . , xp q1 P X can be expressed as y “ ηpxq ` ǫ

(3.1)

where ηpxq “ Epy | xq is the mean regression function under the assumption that the residual density has mean zero, i.e., Epǫ | xq “ 0 for all x P X . Our focus is on obtaining a robust estimate of η while allowing heavy tails to down-weight influential observations. We propose a class of models that allows the residual density to change nonparametrically with predictors x, with homoscedasticity arising as a special case. There is a substantial literature proposing priors for flexible estimation of the mean function, typically using basis function representations such as splines or 64

wavelets (Denison et al., 2002). Most of this literature assumes a constant residual density, possibly up to a scale factor allowing heteroscedasticity. Yau and Kohn (2003) allow the mean and variance to change with predictors using thin plate splines. In certain applications, this structure may be overly restrictive due to the specific splines used and the normality assumption. Chan et al. (2006) also used splines for heteroscedastic regression, but with locally adaptive estimation of the residual variance and allowance for uncertainty in variable selection. Nott (2006) considered the problem of simultaneous estimation of the mean and variance function by using penalized splines for possibly non Gaussian data. Due to the lack of conjugacy, these methods rely on involved sampling techniques using Metropolis Hastings, requiring proposal distributions to be chosen that may not be efficient in all cases. The residual density is assumed to have a known parametric form and heavy-tailed distributions have not been considered. In addition, since basis function selection for multiple predictors is highly computationally demanding, additive assumptions are typically made that rule out interactions. Gaussian process (GP) regression (Adler, 1990; Ghoshal and Roy, 2006; Neal, 1998) is an increasingly popular choice, which avoids the need to explicitly choose the basis functions, while having many appealing computational and theoretical properties. For articles describing some of these properties, refer to Adler (1990), Cram´er and Leadbetter (1967) and van der Vaart and Wellner (1996). A wide variety of functions can arise as the sample paths of the Gaussian process. GP priors can be chosen that have support on the space of all smooth functions while facilitating Bayes computation through conjugacy properties. In particular, the GP realizations at the data points are simply multivariate Gaussian. As shown by Choi and Schervish (2007b), GP priors also lead to consistent estimation of the regression function under normality assumptions on the residuals. Recently, Choi (2009) extended their results to allow for non-Gaussian symmetric residual distributions (for example, the Laplace 65

distribution) which satisfy certain regularity conditions and the induced conditional density belongs to a location-scale family. Although they require mild assumptions on the parametric scale family, the results depend heavily on parametric assumptions. In particular, their theory of posterior consistency is not applicable to an infinite mixture prior on the residual density. We extend their result allowing a rich class of residual distributions through PSB mixtures of Gaussians in Section 3.3. There is a rich literature on Bayesian methods for density estimation using mixture models of the form yi „ f pθi q,

θi „ P,

P „ Π,

(3.2)

where f p¨q is a parametric density and P is an unknown mixing distribution assigned a prior Π. The most common choice of Π is the Dirichlet process (Ferguson, 1973b, 1974b). Lo (1984) showed that Dirichlet process mixtures of normals have dense support on the space of densities with respect to Lesbesgue measure, while Escobar and West (1995) developed methods for posterior computation and inference. James et al. (2005) considered a broader class of normalized random measures for Π. In order to combine methods for Bayesian nonparametric regression with methods for Bayesian density estimation, one can potentially use mixture model (3.2) for the residual density in (8.1). A number of authors have considered nonparametric priors for the residual distribution in regression. For example, Kottas and Gelfand (2001) proposed mixture models for the error distributions in median regression models. To ensure identifiability of the regression coefficients, the residual distribution is constrained to have median zero. Their approach is very flexible but has the unappealing property of producing a residual density that is discontinuous at zero. In addition, the approach of mixing uniforms leads to blocky looking estimates of the residual density particularly for sparse data. Lavine and Mockus (2005) allow 66

both a regression function for a single predictor and the residual distribution to be unknown subject to a monotonicity constraint. A number of recent papers have focused on generalizing model (3.2) to the density regression setting in which the entire conditional distribution of y given x changes flexibly with predictors. Refer, for example, to M¨ uller et al. (1996); Griffin and Steel (2006b, 2010); Dunson et al. (2007b) and Dunson and Park (2008b) among others. Bush and MacEachern (1996) is contemporary with M¨ uller et al. (1996) and is concerned with nonparametrically estimating the random block effects in an anova-type mean linear-regression model with a t-residual density rather than density regression. Although these approaches are clearly highly flexible, there are several issues that provide motivation for this article. First, to simplify inferences and prior elicitation, it is appealing to separate the mean function ηpxq from the residual distribution in the specification, which is accomplished by only a few density regression methods. The general framework of separately modeling the mean function and residual distribution nonparametrically was introduced by Griffin and Steel (2010). They allow the residual distribution to change flexibly with predictors using the order-based Dirichlet process (Griffin and Steel, 2006b). On the other hand, we want to able to have a computationally simpler specification with straightforward prior elicitation. Chib and Greenberg (2010) develops a nonparametric model jointly for continuous and categorical responses where they model the mean of the link function and residual density separately. The mean is modeled using flexible additive splines and the residual density is modeled using a DP scale mixture of normals. However they didn’t allow the residual distribution to change flexibly with the predictors. Often we have strong prior information regarding the form of the regression function. Most of the current density regression models do not allow the incorporation of prior information without being overparametrized. Second, in many applications, the main interest is in inference on η or in prediction, and the residual distribution can be con67

sidered as a nuisance. Third, we would like to be able to provide a specification with theoretical support. By placing some constraints on the support we may achieve a gain in efficiency in estimating the regression function, since density regression procedures are almost too flexible. In particular, it would be appealing to show strong posterior consistency in estimating η without requiring restrictive assumptions on η or the residual distribution. Current density regression models lack such theoretical support. In addition, computation for density regression can be quite involved, particularly in cases involving more than a few predictors, and one encounters the curse of dimensionality in that the specifications are almost too flexible. Our goal is to obtain a computationally convenient specification that allows consistent estimation of the regression function, while being flexible in the residual distribution specification to obtain robust estimates. To accomplish this, we propose to place a Gaussian process prior on η and to allow the residual density to be unknown through a probit stick-breaking (PSB) process mixture. The basic PSB process specification was proposed by Chung and Dunson (2009) in developing a density regression approach that allows variable selection. On the other hand we are concerned with robust estimation of the mean regression function allowing the residual distribution to change flexibly with predictors. While we want to model the mean regression function nonparametrically, we also want to be able to incorporate our prior knowledge for the regression function quite easily as opposed to density regression models which are often a black box. Here, we propose four novel variants of PSB mixtures for the residual distribution. The first uses a scale mixture of Gaussians to obtain a prior with large support on unimodal symmetric distributions. The next is based on a symmetrized location-scale PSB mixture, which is more flexible in avoiding the unimodality constraint, while constraining the residual density to be symmetric and have mean zero. In addition, we show that this prior leads to strong posterior consistency in estimating η under weak conditions. To allow 68

the residual density to change flexibly with predictors, we generalize the above priors through incorporating probit transformations of Gaussian processes in the weights. The last two prior specifications allow changing residual variances and tail heaviness with predictors, leading to a highly robust specification which is shown to have better performance in simulation studies and out of sample prediction. It will be shown in some small sample simulated examples that the heteroscedastic symmetrized location-scale PSB mixture leads to even more robust inference than the heteroscedastic scale PSB mixture without compromising out of sample predictive performance. Section 3.2 proposes the class of models under consideration. Section 3.3 shows consistency properties. Section 3.4 develops efficient posterior computation through an exact block Gibbs sampler. Section 3.5 describes measures of influence to study robustness properties of our proposed methods. Section 3.6 contains simulation study results, Section 3.7 applies the methods to the Boston housing data and body fat data, and Section 3.8 discusses the results. Proofs are included in the Appendix.

3.2 Nonparametric regression modeling 3.2.1 Data Structure and Model Consider n observations with the ith observation recorded in response to the covariate xi “ pxi1 , xi2 , . . . , xip q1 . Let X “ px1 , . . . , xn q1 be the predictor matrix for all n subjects. The regression model can be expressed as yi “ ηpxi q ` ǫi ,

ǫi „ fxi , i “ 1, . . . , n.

We assume that the response y P Y is continuous and x P X where X Ă Rp is compact. Also, the residuals ǫi are sampled independently, with fx denoting the residual density specific to predictor value xi “ x. We focus initially on the case in which the covariate space X is continuous, with the covariates arising from a fixed, 69

non-random design or consisting of i.i.d realizations of a random variable. We choose a prior on the regression function ηpxq that has support on a large subset of C 8 pX q, the space of smooth real valued X Ñ R functions. The priors proposed for tfx , x P X u will be chosen to have large support so that heavy-tailed distributions and outliers will automatically be accommodated, with influential observations downweighted in estimating η. 3.2.2 Prior on the Mean Regression Function We assume that η P F “ tg : X Ñ R is a continuous functionu, with η assigned a Gaussian process (GP) prior, η „ GP pµ, cq, where µ is the mean function and c is the covariance kernel. A Gaussian process is a stochastic process tηpxq : x P X u such that any finite dimensional distribution is multivariate normal, i.e., for any n and x1 , . . . , xn , ηpXq :“ pηpx1 q, . . . , ηpxn qq1 „ NpµpXq, Ση q, where µpXq “ pµpx1 q, . . . , µpxn qq1 and Σηij “ cpxi , xj q. Naturally the covariance kernel cp¨, ¨q must satisfy, for each n and x1 , . . . , xn , that the matrix Ση is positive definite. The smoothness of the covariance kernel essentially controls the smoothness of the sample paths of tηpxq : x P X u. For an appropriate choice of c, a Gaussian process has large support in the space of all smooth functions. More precisely, the support of a Gaussian process is the closure of the reproducing kernel Hilbert space generated by the covariance kernel with a shift by the mean function (Ghoshal and Roy, 2006). For example, when X Ă R, the eigenfunctions of the univariate covariance kernel, cpx, x1 q “

1 ´κpx´x1 q2 e , τ

span C 8 pX q if κ is allowed to vary freely over R` .

Thus we can see that the Gaussian process prior has a rich class of functions as its support and hence is appealing as a prior on the mean regression function. Refer to Rasmussen and Williams (2005) and van der Vaart and van Zanten (2008b) as an introductory textbook on Gaussian processes. We follow common practice in choosing the mean function in the GP prior to 70

correspond to a linear regression, µpXq “ Xβ, with β denoting unknown regression coefficients. As a commonly used covariance kernel, we took the Gaussian kernel 1 2

cpx, x1 q “ τ1 e´κ||x´x || , where τ and κ are unknown hyperparameters, with κ controlling the local smoothness of the sample paths of ηpxq. Smoother sample paths imply more borrowing of information from neighboring x values. 3.2.3 Priors for Residual Distribution Motivated by the problem of robust estimation of the regression function η, we consider five different types of priors for the residual distributions tfx , x P X u as enumerated below. The first prior corresponds to the t distribution, which is widely used for robust modeling of residual distributions (West, 1984; Lange et al., 1989; Fonseca et al., 2008), while the remaining priors are flexible nonparametric specifications. 1.

Heavy tailed parametric error distribution: Following many previous

authors, we first consider the case in which the residual distributions follow a homoscedastic Student-t distribution with unknown degrees of freedom. As the Student-t with low degrees of freedom is heavy tailed, outliers are allowed. By placing a hyperprior on the degrees of freedom, νσ „ Gapaν , bν q, with Gapa, bq denoting the Gamma distribution with mean a{b, one can obtain a data adaptive approach to down-weighting outliers in estimating the mean regression function. However, note that this specification assumes that the same degrees of freedom and tail-heaviness holds for all x P X . Following West (1987), we express the Student-t distribution as a scale mixture of normals for ease in computation. In addition, we allow an unknown scale parameter, letting ǫi „ Np0, σ 2 {φi q, with φi „ Gapνσ {2, νσ {2q. 2.

Nonparametric error distribution: Let Y “ ℜ be the response space and

X be the covariate space which is a compact subset of ℜp . Let F denotes the space 71

of densities on X ˆ Y w.r.t the Lebesgue measure and Fd denotes the space of all conditional densities subject to mean zero, " * ż ż Fd “ g : X ˆ Y Ñ p0, 8q, gpx, yqdy “ 1, ygpx, yqdy “ 0 @ x P X . Y

Y

We propose to induce a prior on the space of mean zero conditional densities through a prior for a collection of mixing measures tPx , x P X u using the following predictordependent mixture of kernels. Px “

8 ÿ

h“1

πh pxqδtµh pxq,σh u ,

µh „ P0 , σh „ P0,σ

where πh pxq ě 0 are random functions of x such that

ř8

h“1 πh pxq

(3.3)

“ 1 a.s. for each

fixed x P X . tµh pxq, x P Xu8 h“1 are iid realizations of a real valued stochastic process, i.e., P0 is a probability distributions over a function space FX . Here P0,σ is a probability distribution on ℜ` . Hence for each x P X , Px is a random probability measure over the measurable Polish space pℜ ˆ ℜ` , Bpℜ ˆ ℜ` qq. Before proposing the prior, we first review the probit stick breaking process specification and its relationship to the Dirichlet process. Rodriguez and Dunson (2011a) introduce the probit stick-breaking process in greater details and discuss some to its theoretical smoothness and clustering properties. A probability measure P P P on pY, BpYqq follows a probit stick-breaking process with base measure P0 if it has a representation of the form P p¨q “

8 ÿ

h“1

πh δθh p¨q,

θh „ P0 ,

(3.4)

where the atoms tθh u8 h“1 are independent and identically distributed from P0 and ś the random weights are defined as πh “ Φpαh q lăh t1 ´ Φpαl qu, αh „ Npµα , σα2 q, h “ 1, . . . , 8. Here Φp¨q denotes the cumulative distribution function for the standard

normal distribution. Note that expression (3.4) is identical to the stick-breaking 72

representation (Sethuraman, 1994) of the Dirichlet process (DP), but the DP is obtained by replacing the stick-breaking weight Φpαh q with a beta(1, α) distributed random variable. Hence, the PSB process differs from the DP in using probit transformations of Gaussian random variables instead of betas for the stick lengths, with the two specifications being identical in the special case in which µα “ 0, σα “ 1 and the DP precision parameter is α “ 1. Rodriguez and Dunson (2011a) also mentioned the possibility of constructing a variety of predictor dependent models e.g., latent Markov random fields, spatio-temporal processes, etc by using probit transformation latent Gaussian processes. Such latent Gaussian processes can be updated using data augmentation Gibbs sampling as in continuation-ratio probit models for survival analysis (Albert and Chib, 2001). While we follow similar computational strategies as in Rodriguez and Dunson (2011a), they didn’t consider robust regression using predictor dependent residual density. Under the symmetric about zero assumption, we propose two nonparametric priors for the residual density fx for all x P X . The first prior is a predictor dependent PSB scale mixture of Gaussians which enforces symmetry about zero and unimodality, and the next is a symmetrized location-scale PSB mixture of Gaussians, which we develop to satisfy the symmetric about zero assumption while allowing multimodality. 2a. Heteroscedastic scale PSB mixtures: To allow the residual density to change flexibly with predictors, while maintaining the constraint that each of the predictordependent residual distributions is unimodal and symmetric about zero, we propose the following specification

f p¨q “

ż

Np¨ ; 0, τ ´1 qPx pdτ q,

where πh pxq “ Φtαh pxqu

ś

lăh r1

Px “

8 ÿ

h“1

πh pxqδτh ,

τh „ Gapατ , βτ q,

(3.5)

´ Φtαl pxqus is the predictor-dependent probability 73

weight on the hth mixture component, and the αh ’s are drawn independently from 1 2

zero mean Gaussian processes having covariance kernel cα px, x1 q “ τ1α e´κα ||x´x || . ř ´1 This implies fx p¨q “ 8 h“1 πh pxqNp¨ ; 0, τh q and is a highly-flexible specification that enforces smoothly changing mixture weights across the predictor space, so that the

residual densities at x and x1 will tend to be similar if x is located close to x1 , as measured by κα ||x ´ x1 ||2 . Clearly, the specification allows the residual variance to change flexibly with ř ´1 predictors, as we have varpǫ | xq “ 8 h“1 πh pxqτh . However, unlike the previously

proposed methods for heteroscedastic nonlinear regression, we do not just allow the variances to vary, but allow any aspect of the density to vary, including the heaviness of the tails. This allows locally adaptive downweighting of outliers in estimating the mean function. Previous methods, which instead assume a single heavy-tailed residual distribution, such as a t-distribution, can lead to a lack of robustness due to global estimation of a single degree of freedom parameter. In addition, due to the form of our specification, posterior computation becomes very straightforward using a data augmentation Gibbs sampler, which involves simple steps for sampling from conjugate full conditional distributions. Even under the assumption of Gaussian residual distributions, posterior computation for heteroscedastic models tends to be complex, with Gibbs sampling typically not possible due to the lack of conditional conjugacy. 2b. Heteroscedastic symmetric PSB (sPSB) location-scale mixtures: The PSB scale

mixture in (3.5) restricts the residual density to be unimodal. As this is a very restrictive assumption, it is appealing to define a prior with larger support that allows multimodal residual densities, while enforcing the symmetric about zero assumption so that the residual density is constrained to have mean zero. To accomplish this, we propose a novel symmetrized PSB process specification, which is related to the

74

symmetrized Dirichlet process proposed by Tokdar (2006b). We define ş f p¨q “ Np¨ ; µ, τ ´1 qdPxs pµ, τ q,

dPxs pµ, τ q “ 21 dPx p´µ, τ q ` 12 dPx pµ, τ q, (3.6)

where the atoms pµh , τh q are drawn independently from P0 a priori, with P0 chosen as a product of a Npµ0 , σ02 q and Gapατ , βτ q measure. The difference between the sPSB process prior and the PSB process prior is that instead of just placing probability weight πh on atom pµh , τh q, we place probability πh {2 on p´µh , τh q and pµh , τh q. The resulting residual density under (3.6) has the form f p¨q “ ř8 πh pxq tNp¨., ; ´µh , τh´1 q ` Np¨ ; µh , τh´1 qu. Clearly, each of the realizations corh“1 2

responds to a mixture of Gaussians that is constrained to be symmetric about zero.

The same comments made for the heteroscedastic scale PSB mixture apply here, but (3.6) is more flexible in allowing multi-modal residual distributions, with modality changing flexibly with predictors. Posterior computation is again straightforward, as will be shown later. 2c. Homoscedastic scale PSB process mixture of Gaussians. A simpler homoscedastic version of 3.5 is to consider f p¨q “

ż

Np¨ ; 0, τ ´1 qP pdτ q,

P “

8 ÿ

h“1

πh δτh ,

τh „ Gapατ , βτ q,

(3.7)

where the weights tπh u are specified as in πh “ νh

ź lăh

This implies that f p¨q “

p1 ´ νh q, νh “ Φpαh q, αh „ Npµα , σα2 q.

ř8

h“1

(3.8)

πh Np¨ ; 0, τh´1 q, so that the unknown density of the

residuals is expressed as a countable mixture of Gaussians centered at zero but with varying variances. Observations will be automatically allocated to clusters, with outlying clusters corresponding to components having large variance (low τh ). By choosing a hyperprior on µα while letting σα “ 1, we allow the data to inform more 75

strongly about the posterior distribution on the number, sizes and allocation to clusters. 2d. Location-scale symmetrized PSB (sPSB) mixture of Gaussians. A homoscedastic version of 3.6 is the following. ż f p¨q “ Np¨ ; µ, τ ´1qdP s pµ, τ q, P “

8 ÿ

h“1

πh δpµh ,τh q ,

1 1 dP s pµ, τ q “ dP p´µ, τ q ` dP pµ, τ q, 2 2

pµh , τh q „ P0 ,

(3.9)

where the prior on the weights πh are given by (3.8) and the prior for pµh , τh q are exactly as in 2b.

3.3 Consistency properties Let f „ Πu and f „ Πs denote the priors for the unknown residual density defined in expressions (3.7) and (3.9) respectively. It is appealing for Πu and Πs to have support on a large subset of Su and Ss respectively, where Ss denotes the set of densities on R with respect to Lebesgue measure that are symmetric about zero and Su Ă Ss is the subset of Ss corresponding to unimodal densities. We characterize the weak support of Πu , denoted by wkpΠu q Ă Su , in the following lemma. ? Lemma 34. wkpΠu q “ Cm , where Cm “ tf : f P Su , hpxq “ f p xq, x ą 0 is a completely monotone functionu. A function hpxq on p0, 8q is completely monotone in x if it is infinitely differm

d entiable and p´1qm dx m hpxq ě 0 for all x and for all m P t1, 2, . . . , 8u. Chu (1973)

proved that if f is a density on R which is symmetric about zero and unimodal, it can be written as a scale mixture of normals, f pxq “

ż

σ ´1 φpσ ´1 xqgpσqdσ 76

? for some density g on R, if and only if hpxq “ f p xq, x ą 0, is a completely monotone function, where φ is the standard normal pdf. This restriction places a smoothness constraint on f pxq, but still allows a broad variety of densities. Definition 35. Letting f „ Π, f0 is in the Kullback-Leibler(KL) support of Π if ˆ ż ˙ f0 pyq Π f : f0 pyq log dy ă ǫ ą 0, f pyq

@

ǫą0

The set of densities f in the Kullback-Leibler support of Π is denoted by KLpΠq. Let S˜s denote the subset of Ss corresponding to densities satisfying the following regularity conditions. 1. f is nowhere zero and bounded by M ă 8 ˇş ˇ 2. ˇ ℜ f pyq log f pyqdy ˇ ă 8

ˇ ˇş ˇ ă 8, where ψ1 pyq “ inf tPry´1,y`1s f ptq dy 3. ˇ ℜ f pyq log ψf1pyq pyq

4. there exists ψ ą 0 such that Lemma 36. KLpΠs q Ě S˜s .

ş



|y|2p1`ψq f pyqdy ă 8

Remark 37. The above assumptions on f are standard regularity conditions introduced by Tokdar (2006b) and Wu and Ghoshal (2008) to prove that f P KLpΠq, where Π is a general stick breaking prior which has all compactly supported probability distributions as its support. (1) is usually satisfied by common densities arising in practice. (4) imposes a minor tail restriction e.g., t-density with p2 ` δq degrees of freedom for some δ ą 0 satisfies (4). (1)-(4) are satisfied by a finite mixture of t-densities or even by an infinite mixture of t-densites with p2 ` δq degrees of freedom for some δ ą 0 and bounded component specific means and variances. 77

From Lemma 36, it follows that the sPSB location-scale mixture has KL-support on a large subset of the set of densities symmetric about zero. These conditions are important in verifying that the priors are flexible enough to approximate any density subject to the noted restrictions.

We provide fairly general sufficient conditions to ensure strong and weak posterior consistency in estimating the mean regression function and the residual density, respectively. We focus on the case in which a GP prior is chosen for η and an sPSB location-scale mixture of Gaussians is chosen for the residual density as in (3.9). Similar results can be obtained for the homoscedastic scale PSB process mixture under stronger restrictions on the true residual density. Although showing consistency results using predictor dependent mixtures of normals as the prior for the residual density in (3.5) and (3.6) is a challenging task, one can anticipate such results given the theory in Pati et al. (2010) and Norets and Pelenis (2010). Indeed Pelenis and Norets (2011) showed posterior consistency of the regression coefficients in a mean linear regression model with covariate dependent nonparametric residuals using the kernel stick-breaking process Dunson and Park (2008a). However, showing posterior consistency of the mean regression when we have a Gaussian process prior on the regression function and predictor dependent residuals is quite challenging and is a topic of future research. For this section, we assume xi ’s are non random and arising from a fixed design, though the proofs are easily modified for random xi ’s. When the covariate values are fixed in advance, we consider the neighborhood based on the empirical measure of the design points. Let Qn be the empirical probability measure of the design ř points, Qn pxq “ n1 ni“1 Ixi pxq. Based on Qn , we define a strong L1 neighborhood

of radius ∆ ą 0 as in Choi (2005) around the true regression function η0 . Letting 78

||η ´ η0 ||1,n “

ş

xPX

|ηpxq ´ η0 pxq|dQn pxq set, ( Sn pη0 ; ∆q “ η : ||η ´ η0 ||1,n ă ∆

(3.10)

We introduce the following notation. Let f0 denote an arbitrary fixed density in S˜s , η0 denote an arbitrary fixed regression function in F , and f0i “ f0 py ´ η0 pxi qq fηi “ f py ´ ηpxi qq. For any two densities f and g, let Kpf, gq “

ż

R

f pyq logtf pyq{gpyqudy,

V pf, gq “

ż

R

“ ‰2 f pyq log` tf pyq{gpyqu dy,

where log` x “ maxplog x, 0q. Set Ki pf, ηq “ Kpf0i , fηi q and Vi pf, ηq “ V pf0i , fηi q for i “ 1, . . . , n. For technical simplicity assume X “ r0, 1sp, τ “ 1 and µ ” 0. Denote a mean 1 2

zero Gaussian process tWx : x P r0, 1spu with covariance kernel cpx, x1 q “ e´||x´x ||

by W . Rescaling the sample paths of an infinitely smooth Gaussian process is a powerful technique to improve the approximation of α-H¨older functions from the RKHS of the scaled process tWxκ “ W?κx :

x P r0, 1sdu with κ ą 0. Intu-

itively, for large values of κ, the scaled process traverses the sample path of an ? unscaled process on the larger interval r0, κsp , thereby incorporating more “roughness”. van der Vaart and van Zanten (2009) studied rescaled Gaussian processes W κ “ tW?κx : x P r0, 1spu for a positive random variable κ stochastically independent of W and showed that with a Gamma prior on κp{2 , one obtains the minimax-optimal rate of convergence for arbitrary smooth functions. ? Assumption 1: η „ W κ with the density g of κ on the positive real line satisfying C1 xp expp´D1 x logq xq ď gpxq ď C2 xp expp´D2 x logq xq, for positive constants C1 , C2 , D1 , D2 and every sufficiently large x ą 0. Next we state the lemma on prior positivity due to van der Vaart and van Zanten (2009). 79

Lemma 38. If η satisfies Assumption 1 then P p||η ´ η0 ||8 ă ǫq ą 0 @ ǫ ą 0, if η0 is continuous. In order to prove posterior consistency for our proposed model, we rely on a theorem of Amewou-Atisso et al. (2003), which is a modification of the celebrated Schwartz (1965a) theorem to accommodate independent but not identically distributed data. Theorem 39. Suppose η as in Assumption 1 with q ě p ` 2 and f „ Πs , with Πs defined in (3.9). In addition, assume the data are drawn from the true density f0 pyi ´ η0 pxi qq, with txi ufixed and non-random, f0 P S˜s , η0 P F and f0 following the additional regularity conditions, 1. 2.

ş

ş

ş y 4 f0 pyqdy ă 8 and f0 pyq| log f0 pyq|2 dy ă 8. R

ˇ2 ˇ ˇ dy ă 8, where ψ1 pyq “ inf tPry´1,y`1s f0 ptq. f0 pyqˇ log ψf01pyq pyq

Let U be a weak neighborhood of f0 and Wn “ U ˆ Sn pη0 ; ∆q, with Wn Ă S˜s ˆ F . Then the posterior probability κ

ş

śn

κ i“1 fηi pyi qdΠs pf qdW pηq śn κ i“1 fηi pyi qdΠs pf qdW pηq S˜s ˆF

pΠs ˆ W qpWn |y1 , . . . , yn , x1 , . . . , xn q “ ş

Wn

Ñ 1 a.s.

Theorem 39 ensures weak consistency of the posterior of the residual density and strong consistency of the posterior of the regression function η.

3.4 Posterior Computation We provide details for posterior computation separately for the most important models. We first describe the choice of hyperparameters of the prior on the regression function. 80

Choice of hyperpriors: We choose the typical conjugate prior for the regression coefficients in the mean of the GP, β „ Npβ0 , Σ0 q, where β0 “ 0 and Σ0 “ cI is a common choice corresponding to a ridge regression shrinkage prior. The prior on τ is given by τ „ Gap ν2τ , ν2τ q. We let κ „ Gapακ , βκ q with small βκ and large ακ . Normalizing the predictors prior to analysis, we find that the data are quite informative about κ under these priors, so as long as the priors are not overly informative, inferences are robust. The parameter τ controls the heaviness of the tails of the prior for the regression function. In fact, choosing a Gapντ {2, ντ {2q prior induces a heavy tailed t-process with ντ degrees of freedom as a prior for the regression function. In all the examples, we have fixed ντ “ 3. κ controls the correlation of the Gaussian process at two points in the covariate space similar to a spatial decay parameter in a spatial random effects model. Although a discrete uniform prior for κ is computationally efficient in leading to a griddy Gibbs update step, there can be sensitivity to the choice of grid. A gamma prior for κ eliminates such sensitivity at some associated computational price in terms of requiring a Metropolis-Hastings update that tends to mix slowly. Since the responses are normalized and the covariates are scaled to lie in the interval r0, 1s, using a single decay parameter appears to be reasonable. We choose the parameters ακ and βκ so that the mean correlation is 0.1 for two points ? separated by a distance p in the covariate space. νσ controls the tail-heaviness of the prior for the scaling φ. Since we would like to accommodate outliers with the mean being fixed at 1, we assume φi „ Gapνσ {2, νσ {2q with νσ „ Gap3, 1q. a and b are fixed at 3{2 to resemble a t-distribution with 3 degrees of freedom without the scaling φ. 3.4.1 Gaussian process regression with t residuals Let Y “ py1 , . . . , yn q1 , η “ pηpx1 q, ηpx2 q, . . . , ηpxn qq1 and define a matrix T such that 2

Tij “ e´κ||xi´xj || . Hence Ση “

1 T. τ

Assume Ω “ diagp1{φi : i “ 1, . . . , nq and 81

φ “ pφ1 , . . . , φn q1 . Then we have Y|η „ Npη, σ 2Ωq, η|β, τ, κ „ NpXβ, τ ´1 Tq, β „ Npβ0 , Σ0 q ` νσ νσ ˘ , , νσ „ Gapαν , βν q, σ ´2 „ Gapa, bq 2 2 ` ντ ντ ˘ κ „ Gapακ , βκ q, τ „ Ga , . 2 2

φi „ Ga

Next we provide the full conditional distributions needed for Gibbs sampling. Due to conjugacy, η, β, σ ´2, φ and τ have closed form full conditional distributions, while νσ and κ are updated by using Metropolis Hastings steps within the Gibbs sampler. ´1 Let Vη “ pτ T´1 ` σ ´2 Ω´1 q´1 and Vβ “ pτ X1 T´1 X ` Σ´1 0 q . ` ˘ η|Y, β, σ ´2, τ, κ, νσ , φ „ N Vη pτ T´1 Xβ ` σ ´2 Ω´1 Yq, Vη ` ˘ β|Y, η, σ ´2 , τ, κ, νσ , φ „ N Vβ pτ X1 T´1 η ` Σ´1 0 β0 q, Vβ ¸ ˜ n ÿ 1 n φi pyi ´ ηi q2 ` b ` a, σ ´2 |Y, η, β, τ, κ, νσ , φ „ Ga 2 2 i“1

ˆ

˙ ( n ` ντ 1 1 ´1 τ |Y, η, β, σ , κ, νσ , φ „ Ga , pη ´ Xβq T pη ´ Xβq ` ντ 2 2 ˙ ˆ νσ ` 1 1 ´2 2 ´2 , tσ pyi ´ ηi q ` νσ u . φi |Y, η, β, σ , κ, νσ „ Ga 2 2 ´2

3.4.2 Heteroscedastic PSB mixture of normals First we need to describe the choice of hyperparameters in this case. Choice of hyperparameters: We assume κα „ Gapγκ , βκ q and τα „ Gap ν2α , ν2α q. If the data yi are normalized, we can expect the overall variance to be close to one, ř ´1 so the variance of the residuals, V arpǫq “ 8 h“1 πh τh , should be less than one. We set ατ “ 1 and choose a hyperprior on βτ , βτ „ Gap1, k0 q with k0 ą 1 so that the

prior mean of τh is significantly less than one. Different values of k0 are tried out to assess robustness of the posteriors. For posterior computation,

we propose a Markov chain Monte Carlo 82

algorithm,

which

is a hybrid

of

data augmentation,

the

exact

block

Gibbs sampler of Papaspiliopoulos (2008) and Metropolis Hastings sampling. Papaspiliopoulos (2008) proposed the exact block Gibbs sampler as an efficient approach to posterior computation in Dirichlet process mixture models, modifying the block Gibbs sampler of Ishwaran and James (2001) to avoid truncation approximations. The exact block Gibbs sampler combines characteristics of the retrospective sampler (Papaspiliopoulos and Roberts, 2008) and the slice sampler (Walker, 2007; Kalli et al., 2010). We included the label switching moves introduced by Papaspiliopoulos and Roberts (2008) for better mixing. Introduce γ1 , . . . , γn such that πh pxi q “ P pγi “ hq, h “ 1, 2, . . . , 8. Then γi „

8 ÿ

πh pxi qδh “

h“1

8 ÿ

h“1

1pui ă πh pxi qqδh

where ui „ Up0, 1q. The MCMC steps are given below. 1.

Update ui ’s and stick breaking random variables: Generate ui |´ „ Up0, πγi pxi qq

where πh pxi q “ Φtαh pxi qu

ś

lăh r1

´ Φtαl pxi qus. For i “ 1, . . . , n, introduce la-

tent variables Zh pxi q, h “ 1, 2, . . . such that Zh pxi q „ Npαh pxi q, 1q. Thus πh pxi q “ P pZh pxi q ą 0, Zl pxi q ă 0 for l ă hq. Then # Npαh pxi q, 1qIR` , h “ γi Zh pxi q|´ „ Npαh pxi q, 1qIR´ , h ă γi . ` ˘ Let Zh “ pZh px1 q, . . . , Zh pxn qq1 and αh “ pαh px1 q, . . . , αh pxn qq1 . Letting Σα ij “

e´κα ||xi´xj || , Zh „ Npαh , Iq and αh „ Np0, τ1α Σα q,

` ˘ ´1 ´1 ´1 αh |´ „ N pτα Σ´1 α ` In q Zh , pτα Σα ` In q 83

Continue up to h “ 1, . . . , h˚ “ maxth˚1 , . . . , h˚n u, where h˚i is the minimum integer řh˚i satisfying l“1 πl pxi q ą 1 ´ mintu1 , . . . , un u, i “ 1, . . . , n. Now ˆ ˙˙ ˆ h˚ ˘ 1 ÿ 1` ˚ 1 ´1 τα |´ „ Ga α Σ αk ` να , nh ` να , 2 2 l“1 k α

while kα is updated using a Metropolis Hastings step. 2.

Update allocation to atoms: Update pγ1 , . . . , γn q|´ as multinomial random

variables with probabilities P pγi “ hq9Npyi ; ηpxi q, τh´1 qIpui ă πh pxi qq, h “ 1, . . . , h˚ . 3.

Update component-specific locations and precisions: Letting nl “ #ti :

γi “ lu, l “ 1, 2, . . . , h˚ , ˙ ˆ ÿ nl 2 ` ατ , βτ ` pyi ´ ηpxi qq , l “ 1, 2, . . . , h˚ τl |´ „ Ga 2 i:γ “l i

˙ ˆ ÿ k˚ βτ |´ „ Ga 1, τl ` k0 . l“1

4.

q, Update the mean regression function: Letting Λ “ diagpτγ´1 , . . . , τγ´1 n 1 η|´ „ Nppτ T´1 ` Λ´1 q´1 pτ T´1 Xβ ` Λ´1 Yq, pτ T´1 ` Λ´1 q´1 q ` ˘ ´1 1 ´1 ´1 1 ´1 ´1 ´1 β|´ „ N pτ X1 T´1 X ` τ Σ´1 q pτ X T η ` Σ β q, pτ X T X ` Σ q 0 0 0 0 ˆ ˙ ( n ` ντ 1 τ |´ „ Ga , pη ´ Xβq1 T´1 pη ´ Xβq1 ` ντ . 2 2

5.

Update κ in a Metropolis Hastings step.

3.4.3 Heteroscedastic sPSB process location-scale mixture We will need the following changes in the updating steps from the previous case. 1.

Update allocation to atoms: Update pγ1 , . . . , γn q|´ as multinomial random 84

variables with probabilities P pγi “ hq9

( 1 Npyi ; ηpxi q ` µh , τh´1 q ` Npyi ; ηpxi q ´ µh , τh´1 q Ipui ă πh pxi qq, 2

h “ 1, . . . , h˚ . Component-specific locations and precisions: Let nl “ #ti : γi “ lu, l “ ř 1, 2, . . . , h˚ and ml “ i:γi “l pyi ´ ηi q. The atoms of the base measure location is

3.

updated from a mixture of normals as ˆ ˙ ˆ ˙ µ0 σ0´2 ` τl ml 1 1 µ0 σ0´2 ´ τl ml µl |´ „ pl N , ` p1 ´ pl qN , , σ0´2 ` nl τl σ0´2 ` nl τl σ0´2 ` nl τl σ0´2 ` nl τl

where pl 9 exp

" ˆ 1 2

µ0 σ0´2 `τl ml σ0´2 `nl τl

˙* .

˙ ˆ ÿ nl 2 ` ατ , βτ ` tyi ´ ηpxi q ´ µl u ` τl |´ „ pl Ga 2 i:γ “l i

˙ ˆ ÿ nl ` ατ , βτ ` tyi ´ ηpxi q ` µl u2 , p1 ´ pl qGa 2 i:γ “l i

"

* nl `α 2

where pl 9 `

βτ ` 21

ř

1

2 i:γi “l tyi ´ηpxi q´µl u

˘

.

q, µ˚ “ Update the mean regression function: Let Λ “ diagpτγ´1 , . . . , τγ´1 n 1 ` ˘´1 pµγ1 , µγ2 , . . . , µγn q and W “ τ T´1 ` Λ´1 . Hence ˆ ˙ ´1 ´1 ˚ η|´pN η; Wtτ T Xβ ` Λ pY ´ µ qu, W ` 4.

ˆ ˙ ´1 ´1 ˚ p1 ´ pqN η; Wtτ T Xβ ` Λ pY ` µ qu, W

where p9 exp (‰ µ˚ q .

“1 2

pτ T´1 Xβ`Λ´1pY´µ˚ qq1 WXβ`Λ´1pY´µ˚ q´pY´µ˚q1 Λ´1 pY´

85

3.5 Measures of Influence There has been limited work on sensitivity of the posterior distribution to perturbations of the data and outliers. Arellano-Vallea et al. (2000) use deletion diagnostics to assess sensitivity, but their methods are computationally expensive in requiring posterior computation with and without data deleted. Weiss (1996) proposed an alternative that perturbs the posterior instead of the likelihood, and only requires ˜ xi q denote the samples from the full posterior. Following Weiss (1996), let f pyi |Θ, likelihood of the data yi , define ˜ ˜ “ f pyi ` δ|Θ, xi q , δi˚ pΘq ˜ xi q f pyi |Θ, ˜ for some small δ ą 0 and let pi pΘ|Yq denote a new perturbed posterior, ˚ ˜ ˜ ppΘ|Yqδ i pΘq ˜ pi pΘ|Yq “ . ˜ Epδi˚ pΘq|Yq

Denote by Li the influence measure, which is a divergence measure between the ˜ ˜ unperturbed posterior ppΘ|Yq and the perturbed posterior pi pΘ|Yq, 1 Li “ 2

ż

˜ ˜ ˜ |ppΘ|Yq ´ pi pΘ|Yq|d Θ.

˜ ˜ Li is bounded and takes values in r0, 1s. When ppΘ|Yq “ pi pΘ|Yq, Li “ 0 indicating that the perturbation δi˚ has no influence. On the other hand, if Li “ 1, the supports ˜ ˜ of ppΘ|Yq and pi pΘ|Yq are disjoint indicating maximum influence. We can define ř an influence measure as L “ n1 ni“1 Li . Clearly L also takes values in r0, 1s with

L “ 0 ñ Li “ 0 @ i “ 1, 2, . . . , n. Also L “ 1 ñ Li “ 1 @ i “ 1, 2, . . . , n. Weiss

˜ 1, . . . , Θ ˜ M be the (1996) provided a sample version of Li , i “ 1, . . . , n. Letting Θ posterior samples with B the burn-in, ˇ ˚ ˇ M ÿ ˜ kq ˇ δi pΘ ˇ 1 1 ˇ ˇ, ˆi “ ´ 1 L ˇ ˆ ˚ pΘqq ˜ M ´ B k“B`1 2 ˇ Epδ i 86

řM ˚ ˜ ˆ ˚ pΘqu ˜ “ 1 ˆ where Etδ i k“B`1 δi pΘk q. Our estimated influence measure is L “ M ´B řn ˆ 1 i“1 Li . We will calculate the influence measure for our proposed methods and n

compare their sensitivity.

3.6 Simulation studies To assess the performance of our proposed approaches, we consider a number of simulation examples, (i) linear model, homoscedastic error with no outliers, (ii) linear model, homoscedastic error with outliers (iii) linear model, heteroscedastic errors and outliers, (iv) non-linear model (a), heteroscedastic errors and outliers and (v) non-linear model (b), heteroscedastic errors and outliers. We let the heaviness of the tails and error variance change with x in cases (iii), (iv) and (v). We considered the following methods of assessing the performance, namely, mean squared prediction error (MSPE), coverage of 95% prediction intervals, mean integrated squared error (MISE) in estimating the regression function at the points for which we have data, point wise coverage of 95% credible intervals for the regression function and the inˆ as described in Section 3.5. We also consider a variety of sample fluence measure (L) sizes in the simulation, n=30, 60, 80 and simulate 10 covariates independently from Up0, 1q. Let z be 10-dim vector of i.i.d Up0, 1q random variables independent of the covariates. Generation of errors in heteroscedastic case and outliers: Let fxi pǫi q “ pxi Npǫi ; 0, 1q ` qxi Npǫi ; 0, 5q where pxi “ Φpx1i zq. The outliers are simulated from the

model with error distribution fxoi p¨q, which is a mixture of truncated normal distri-

butions as follows. In the heteroscedastic case, fxoi pǫi q “ pxi TNp´8,3qYp3,8q pǫi ; 0, 1q ` qxi TNp´8,´3?5qYp3?5,8q pǫi ; 0, 5q where TNR p¨ ; µ, σ 2 q denotes a truncated normal dis-

tribution with mean µ and standard deviation σ over the region R. We consider the following five cases.

87

1. Case (i): yi “ 2.3 ` 5.7x1i ` ǫi , ǫi „ Np0, 1q with no outliers. 2. Case (ii): yi “ 2.3 ` 5.7x1i ` ǫi , ǫi „ 0.95Np0, 1q ` 0.05Np0, 10q. 3. Case (iii): yi “ 1.2 ` 5.7x1i ` 4.7x2i ` 0.12x3i ´ 8.9x4i ` 2.4x5i ` 3.1x6i ` 0.01x7i ` ǫi , ǫi „ fxi , with 5% outliers generated from fxoi pǫi q.

4. Case (iv): yi “ 1.2 ` 5.7x1i ` 3.4x21i ` 4.7xi2 ` 0.89x2i2 ` 0.12xi3 ´ 8.9xi4 xi8 ` 2.4xi5 xi9 ` 3.1xi6 ` x2i6 ` 0.01xi7 ` ǫi , ǫi „ fxi with 5% outliers generated from

fxoi pǫi q.

5. Case (v): yi “ 1.2 ` 5.7 sin x1i ` 3.4 exppx2i q ` 4.7 log |xi3 | ` ǫi , ǫi „ fxi with 5% outliers generated from fxoi pǫi q.

For each of the cases and for each sample size n, we took the first the training set and the next

n 2

n 2

samples as

samples as the test set. The hyperparameters are

specified as follows. 1. Heavy tailed parametric error distribution: We described the choice of the hyperparameters in Section 3.5. We took β0 “ 0, Σ0 “ 5I2 , αν “ 1, βν “ 1, a “ 0.5, b “ 0.5, ατ “ 5 and βτ “ 1. 2. Heteroscedastic PSB or sPSB process scale mixture on the residual density: β0 “ 0, Σ0 “ 5I2 , αν “ 1, βν “ 1, a “ 0.5, b “ 0.5, ατ “ 5, βτ “ 1, γκ “ 5, βκ “ 1, να “ 1 and k0 “ 10. We also compare the MSPE of the proposed methods with Lasso (Tibshirani, 1996), Bayesian additive regression trees (Chipman et al., 2010), and Treed Gaussian processes (Gramacy and Lee, 2008). The MCMC algorithms described in Section 3.5 are used to obtain samples from the posterior distribution. The results for model 1 given here are based on 20,000 samples obtained after a burn-in period of 3,000. The results for Model 2 and 3 are based on 20,000 samples obtained after a period of 7,000. 88

Table 3.1: Simulation results under homoscedastic residuals (Cases (i) and (ii)) n=40 MSPE

cov(y)a

Method 1 Method 2d Method 3e Lasso BART Treed GP n=60

0.2997 0.2821 0.2798 0.4651 0.3510 0.3042 MSPE

1 0.9980 1

Method 1 Method 2 Method 3 Lasso BART Treed GP n=80

0.2990 0.2769 0.2752 0.4715 0.3314 0.3000 MSPE

Method 1 Method 2 Method 3 Lasso BART Treed GP

0.2913 0.2592 0.2574 0.4318 0.3128 0.2886

c

0.6866 0.9134 cov(y) 1 0.9947 0.9963 0.6753 0.9193 cov(y) 1 0.9940 0.9956 0.6525 0.9301

Case (i) MISE cov(η)b 0.0248 0.0141 0.0144 0.1934 0.0714 0.0256 MISE

1 1 1

0.0246 0.0103 0.0104 0.1974 0.0539 0.0218 MISE

1 1 1

cov(η)

cov(η)

0.0252 1 0.0086 1 0.0069 1 0.1756 0.0437 0.0175

L

MSPE

cov(y)

Case (ii) MISE cov(η)

L

0.0017 0.6043 0.0015 0.5983 0.0015 0.5987 0.6410 0.7051 0.6968 L MSPE

1 0.0232 1 0.9740 0.0173 1 0.9745 0.0169 1 0.1080 0.7845 0.0950 0.9365 0.0803 cov(y) MISE cov(η)

0.0027 0.0019 0.0017

0.0019 0.5776 0.0017 0.5471 0.0016 0.5541 0.6702 0.6725 0.6880 L MSPE

1 0.95 0.95

0.0242 0.0143 0.0141 0.1194 0.7777 0.1098 0.9301 0.1198 cov(y) MISE

1 0.97 0.98

0.0023 0.0016 0.0016

cov(η)

L

0.0021 0.5583 0.0021 0.4989 0.002 0.4898 0.6569 0.6509 0.6532

1 0.97 0.98

0.0172 1 0.0050 1 0.0067 1 0.1150 0.7815 0.1098 0.9224 0.1031

L

0.0022 0.0014 0.0010

a

cov(y) denotes the coverage of the 95% predictive intervals of the test cases

b

cov(η) denotes the coverage of the 95% credible intervals of the mean regression function

c

GP on mean and t residual distribution

d

GP on mean and heteroscedastic PSB process scale mixtures as residual distribution

e

GP on mean and heteroscedastic sPSB process mixtures as residual distribution

Rapid convergence was observed based on diagnostic tests of Geweke (1992) and Raftery and Lewis (1992). In addition, the mixing was very good for model 1. For models 2 and 3, we use the label switching moves by Papaspiliopoulos and Roberts (2008), which lead to adequate mixing. Tables 6.1, 6.2 and 3.3 summarize the performance of all the methods based on 50 replicated datasets. Tables 6.1, 6.2 and 3.3 clearly show that in small samples both of the heteroscedastic methods (2 and 3) have substantially reduced MSPE and MISE relative to the heavy tailed parametric error model in most of the cases, interestingly even in the

89

Table 3.2: Simulation results under heteroscedastic residuals (Cases (iii) and (iv)) n=40

Case (iii) MISE cov(η)

L

MSPE

cov(y)

Case (iv) MISE cov(η)

MSPE

cov(y)

Method 1 Method 2 Method 3 Lasso BART Treed GP n=60

0.4833 0.2570 0.2586 0.3219 0.4639 0.3320 MSPE

1 0.3612 1 0.9990 0.1394 1 0.9990 0.1298 1 0.1970 0.8444 0.3413 0.7834 0.1979 cov(y) MISE cov(η)

0.0027 0.4416 0.0025 0.2783 0.0025 0.2712 0.3140 0.4103 0.3548 L MSPE

1 0.3274 1 0.9923 0.1583 0.98 0.9867 0.1501 0.97 0.1863 0.8833 0.2675 0.8268 0.2108 cov(y) MISE cov(η)

0.0029 0.0023 0.0017

Method 1 Method 2 Method 3 Lasso BART Treed GP n=80

0.2254 0.1744 0.1712 0.2958 0.3429 0.2047 MSPE

1 0.1154 1 0.9973 0.0572 1 0.9878 0.0567 1 0.1830 0.8546 0.2217 0.8349 0.0779 cov(y) MISE cov(η)

0.0023 0.2367 0.0020 0.2178 0.0016 0.2099 0.3025 0.3385 0.2611 L MSPE

1 1 1

1 0.97 0.98

0.0021 0.0019 0.0017

cov(η)

L

Method 1 Method 2 Method 3 Lasso BART Treed GP

0.1636 0.1509 0.1578 0.2592 0.2284 0.1655

1 0.0454 1 0.9976 0.0373 0.95 0.9931 0.0324 1 0.1437 0.9265 0.1098 0.8876 0.0427

0.0018 0.1855 0.0015 0.1653 0.0013 0.1614 0.2798 0.2491 0.2022

1 1 1

0.1067 0.0562 0.0656 0.1543 0.9122 0.1799 0.8867 0.0899 cov(y) MISE

L

L

0.0346 1 0.0019 0.0321 0.9952 0.0014 0.0312 0.9932 0.0010 0.1373 0.9490 0.1083 0.8923 0.0548

homoscedastic cases. This may be because discrete mixture of Gaussians better approximate a single normal than a t-distribution in small samples. Methods 2 and 3 also did a better job than method 1 in allowing uncertainty in estimating the mean regression and predicting the test sample observations. In some cases, the heavy tailed t-residual distribution results in overly conservative predictive and credible intervals. As seen from the value of the influence statistic, the heteroscedatic PSB process mixtures result in more robust inference compared to the parametric error model, the sPSB process mixture of normals being more robust than the symmetric and unimodal version. As the sample size increases, the difference in the predictive performances between the parametric and the nonparametric models is reduced and in some cases the parametric error model performs as well as the nonparametric approaches, which is as expected given the Central Limit Theorem.

90

Table 3.3: Simulation results under heteroscedastic residuals (Case (v)) n=40

MSPE

cov(y)

MISE

cov(η)

L

Method 1 Method 2 Method 3 Lasso BART Treed GP n=60

0.6666 0.5233 0.5231 0.3713 0.4956 0.7224 MSPE

0.9800 0.5856 1 0.0033 0.9770 0.3980 0.9812 0.0025 0.9854 0.3745 0.9765 0.0019 0.2871 0.8980 0.4013 0.8123 0.6132 cov(y) MISE cov(η) L

Method 1 Method 2 Method 3 Lasso BART Treed GP n=80

0.3828 0.3745 0.3767 0.3532 0.3930 0.4225 MSPE

1 0.2911 0.9985 0.0031 0.9832 0.2617 0.9840 0.0022 0.9812 0.2601 0.9867 0.0020 0.2616 0.9313 0.2668 0.9023 0.3217 cov(y) MISE cov(η) L

Method 1 Method 2 Method 3 Lasso BART Treed GP

0.3599 0.3503 0.3519 0.4505 0.3594 0.4489

0.9901 0.2759 0.9998 0.0029 0.9762 0.2582 0.9765 0.0022 0.9712 0.2545 0.9715 0.0019 0.2751 0.9442 0.2867 0.9125 0.3509

Table 3.4: Boston housing data and body fat data results Boston housing data Methods MSPE cov(y)

L

Method 1 Method 2 Method 3 Lasso BART Treed GP

0.0034 0.9894 0.0027 0.9901 0.0020 0.9863 0.9909 0.9836 0.9524

a

0.0012 0.0013 0.0016 0.0015 0.0024 0.0053

0.99 0.99 0.99 0.92 0.91

corr(Ytest, Ypred)a

body fat data MSPE cov(y)

L

0.0055 0.0031 0.0029 0.0184 0.0355 0.1526

0.0020 0.9972 0.0017 0.9984 0.0017 0.9989 0.9909 0.9655 0.9250

1 1 1 0.95 0.98

corr(Ytest, Ypred)

corr(Ytest, Ypred) denotes the sample correlation between the test and predicted y

91

Table 6.1 shows that, in the simple linear model with normal homoscedastic errors, all the models perform similarly in terms of mean squared prediction error, though the methods 2 and 3 are somewhat better than the rest. Also, in estimating the mean regression function in case (i), methods 2 and 3 performed better than all the other methods. In case (ii)(Table 6.1), methods 2 and 3 are most robust in terms of estimation and prediction in presence of outliers. In cases (iii) and (iv), when the residual distribution is heteroscedastic, our methods 2 and 3 perform significantly better than the parametric model 1 in both estimation and prediction, since the heteroscedastic PSB mixture is very flexible in modeling the residual distribution. This is quite evident from the MSPE values under cases (iii) and (iv) in Table 2. Lasso did a poor job in estimating the mean regression function and also in prediction particularly in cases (iii) and (iv) when the underlying mean function is actually nonlinear. Also BART failed to perform well in estimating the mean function in small samples in these cases. On the other hand, GP based approaches perform quite well in these cases in estimating the regression function with methods 2 and 3 performing better than the rest. Treed GP performed close to method 1 in estimation and prediction as both the methods are based on GP priors on the mean function and have a parametric error distribution. In not allowing heteroscedastic error variance, BART and Treed GP under-estimates uncertainty in prediction, leading to overly narrow predictive intervals. In case (v)(Table 3.3), where the true model is generated using comparatively less number of true signals, Lasso and BART performed slightly better in terms of prediction than the methods 4 and 5 in small samples. This may be due to the fact that Lasso can pick up the true signals quite efficiently in an overly parsimonious model. However, as the sample size increased, Lasso performed poorly while the GP prior on the mean can accommodate the non-linearity resulting in substantially good predictive performances. 92

3.7 Applications 3.7.1 Boston housing data Application To compare our proposed approaches to alternatives, we applied the methods to a commonly used data set from the literature, the Boston housing data. The response is the median value of the owner-occupied homes (measured in 1000$) in 506 census tracts in the Boston area, and there are 13 predictors (12 continuous, 1 binary) that might help to explain the variation in the median value across tracts. We predict the median value of the owner occupied homes of which the first 253 is taken as the training set and the remaining 253 as the test set. Out of sample predictive performance of our three methods is compared to competitors in Table 3.4. The parametric model 1, the heteroscedastic PSB process mixture models 2 and 3 and the Lasso perform very closely to each other in terms of prediction and did better than BART and Treed GP. Methods 1 and 2 even perform slightly better than method 3 and Lasso. As in the simulation examples, BART and Treed GP underestimates the uncertainty in prediction. On the other hand, the predictive intervals of the methods 1, 2 and 3 are more conservative and accommodate uncertainty in predicting regions with outliers quite flexibly. Also the model 3 appears to be more robust compared to models 1 and 2 in terms of the influence measure. 3.7.2 Body fat data application With the increasing trend in obesity and concerns about associated adverse health effects, such as heart disease and diabetes, it has become even more important to obtain accurate estimates of body fat percentage. It is well known that body mass index, which is calculated based only on weight and height, can produce a misleading measure of adiposity as it does not take into account muscle mass or variability in frame size. As a gold standard for measuring percentage of body fat, one can rely

93

on under water weighing techniques, and age and body circumference measurements have also been widely used as additional predictors. We consider a commonly-used data set from Statlib (http://lib.stat.cmu.edu/datasets/bodyfat), which contains the following 15 variables; percentage of body fat(%), body density from underwater weighing (gm{cm3 ), age (year), weight (lbs.), height (inches), and ten body circumferences (neck, chest, abdomen, hip, thigh, knee, ankle, biceps, forearm, wrist, all in cm). Percentage of body fat is given from Siri’s (1956) equation: Percentage of body Fat “

495 ´ 450 Density

We predict the percentage of body fat(%) taking the first 126 as the training set and the remaining 126 as the test set. We summarize the predictive performances in Table 3.4. Table 3.4 suggests that the nonparametric regression procedures with heteroscedastic residual distribution 2 and 3 perform better than the parametric model 1, BART, Lasso and Treed GP in predicting the percentage of body fat.

3.8 Discussion We have developed a novel regression model that can accommodate a large range of non linearity in the mean function and at the same time can flexibly deal with outliers and heteroscedasticity. Based on preliminary simulation results, it appears that our method can outperform contemporary nonparametric regression methods, such as BART and treed Gaussian processes, with the performance also better than Lasso in certain linear regression settings. We also provide theoretical support for the proposed methodology when both the mean and the residuals are modeled nonparametrically. One possible future direction is to relax the symmetry assumption on the residual distribution and introduce a model for median regression based on conditional PSB 94

mixtures for allowing possibly asymmetric residual densities constrained to have zero median. Conditional DP mixtures are well known in the literature (Doss, 1985; Burr and Doss, 2005) and it is certainly interesting to extend our approach via a conditional PSB. In that way we can hope to obtain a more robust estimate of the regression function. It is challenging to extend our theoretical results to conditional PSB and develop a fast algorithm for computation. Another possible theoretical direction is to prove posterior consistency using heteroscedastic mixtures. Currently we only have results for the homoscedastic PSBP mixture.

95

4 Posterior consistency in conditional density estimation

4.1 Introduction There is a rich literature on Bayesian methods for density estimation using mixture models of the form yi „ f pθi q,

θi „ P,

P „ Π,

(4.1)

where f p¨q is a parametric density and P is an unknown mixing distribution assigned a prior Π. The most common choice of Π is the Dirichlet process prior, first introduced by Ferguson (1973a, 1974a). Barron et al. (1999b); Ghosal et al. (1999) used upper bracketing and L1 -metric entropy bounds respectively to derive sufficient conditions on the prior on f and the true data generating f for obtaining strong posterior consistency in Bayesian density estimation. Ghosal et al. (1999) also provided sufficient conditions for posterior consistency in univariate density estimation using Dirichlet process location mixtures of normals. Tokdar (2006a) significantly relaxed their conditions in a Dirichlet process location-scale mixture of normals setting, requiring existence of only weak moments of the true f . Ghosal and van der Vaart 96

(2001, 2007b) provided rates of convergence for Bayesian univariate density estimation using a Dirichlet process mixture of normals. Bhattacharya and Dunson (2010) provided conditions for strong consistency of kernel mixture priors for densities on compact metric spaces and manifolds. Recent literature has focused on generalizing model (4.1) to the density regression setting in which the entire conditional distribution of y given x changes flexibly with predictors. Bayesian density regression views the entire conditional density f py | xq as a function valued parameter and allows its center, spread, skewness, modality and other such features to vary with x. For data tpyi , xi q, i “ 1, . . . , nu let yi | xi „ f p¨ | xi q,

tf p¨ | xq, x P Xu „ ΠX ,

(4.2)

where X is the predictor space and ΠX is a prior for the class of conditional densities tfx , x P X u indexed by the predictors. Refer, for example, to M¨ uller et al. (1996); Griffin and Steel (2006a, 2010); Dunson et al. (2007a); Dunson and Park (2008a); Chung and Dunson (2009) and Tokdar et al. (2010a) among others. The primary focus of this recent development has been mixture models of the form "

* y ´ µh pxq πh pxqφ f py | xq “ , σh h“1 8 ÿ

(4.3)

wmhere φ is the standard normal density, tπh pxq, h “ 1, 2, . . .u are predictordependent probability weights that sum to one almost surely for each x P X , and pµh , σh q „ G0 independently, with G0 a base probability measure on FX ˆ ℜ` , FX Ă X ℜ , the space of all X Ñ ℜ functions. However, there is a dearth of results on support properties of prior distributions for conditional distributions and on general theorems providing conditions for weak and strong posterior consistency. To our knowledge, only Barrientos et al. (2011) have considered formalizing the notions of 97

weak and KL-support for dependent stick-breaking processes. We focus on a broad class of generalized stick-breaking processes, which express the probability weights πh pxq in stick-breaking form, with the stick lengths constructed through mapping continuous stochastic processes to the unit interval using a monotone differentiable link function. This class includes dependent Dirichlet processes (MacEachern, 1999) as a special case. To our knowledge, only a few papers have considered posterior consistency in conditional density estimation. Tokdar et al. (2010a) considers posterior consistency in estimating conditional distributions focusing exclusively on logistic Gaussian process priors (Tokdar and Ghosh, 2007). Such priors have beautiful theoretical properties but lack the computational simplicity of the countable mixture priors in (4.3). In addition, (4.3) has the appealing side effect of inducing predictor-dependent clustering, which is often of interest in itself and is an aid to interpretation and inferences. Yoon (2009) considers posterior consistency in conditional distribution estimation through a limited information approach by approximating the likelihood by the quantiles of the true distribution. Tang and Ghosal (2007a,b) provide sufficient conditions for showing posterior consistency in estimating an autoregressive conditional density and a transition density rather than regression with respect to another covariate. In this chapter, focusing on model (4.3), we initially provide sufficient conditions on the prior and true data-generating model under which the prior leads to weak and various types of strong posterior consistency. In this context, we first define notions of weak and L1 -integrated neighborhoods. We then show that the sufficient conditions are satisfied for a novel class of generalized stick-breaking priors that construct the stick-breaking lengths through mapping continuous stochastic processes to the unit interval using a monotone differentiable link function. The theory is illustrated through application to a model relying on probit transformations of Gaussian processes, an approach related to the probit stick-breaking process of Chung and Dunson 98

(2009) and Rodriguez and Dunson (2011b). We also considered Gaussian mixtures of fixed-π dependent processes (MacEachern, 1999; De Iorio et al., 2004). Norets and Pelenis (2010) showed posterior consistency in conditional density estimation using kernel stick breaking process mixtures of Gaussians in a very recent unpublished article. They approximated a conditional density by a smooth mixture of linear regressions as in Norets (2010) to demonstrate the KL property. In this chapter, we have shown KL support using a more direct approach of approximating the true density by a kernel mixture of a compactly supported conditional measure. The fundamental contribution of this chapter is developing a novel class of prior distributions which has large support in the space of conditional densities and also leads to a consistent posterior. In doing so, a key technical contribution is the development of a novel method of constructing a sieve for the proposed class of priors. It has been noted by Wu and Ghosal (2010) that the usual method of constructing a sieve by controlling prior probabilities is unable to lead to a consistency theorem in the multivariate case. This is because of the explosion of the L1 -metric entropy with increasing dimension. They developed a technique specific to the Dirichlet process in the multivariate case for showing weak and strong posterior consistency. The proposed sieve1 avoids the pitfall mentioned by Wu and Ghosal (2010) in showing consistency using multivariate mixtures. Our sieve construction has been applied to a variety of settings for studying posterior consistency and convergence rates - adaptive Bayesian multivariate density estimation Shen and Ghosal (2011); Tokdar (2011b), sparse multivariate mixtures of factor analyzers McLachlan and Peel (2000) and probability tensor decomposed models for categorical data analysis Bhattacharya and Dunson (2011). 1

A similar sieve appears in Norets and Pelenis (2010) with a citation to an earlier draft of our paper.

99

4.2 Conditional density estimation In this section, we will define the space of conditional densities and construct a prior on this space. It is first necessary to generalize the topologies to allow appropriate neighborhoods to be constructed around an uncountable collection of conditional densities indexed by predictors. With such neighborhoods in place, we then state our main theorems providing sufficient conditions under which various modes of posterior consistency hold for a broad class of predictor-dependent mixtures of Gaussian kernels. Let Y “ ℜ be the response space and X be the covariate space which is a compact subset of ℜp . Unless otherwise stated, we will assume X “ r0, 1sp without loss of generality. Let F denote the space of densities on X ˆY w.r.t. the Lebesgue measure and Fd denote a subset of the space of conditional densities satisfying, " ż Fd “ g : X ˆ Y Ñ p0, 8q, gpx, yqdy “ 1 @ x P X , x ÞÑ gpx, ¨q Y

* continuous as a function from X Ñ L1 pλ, Yq .

Suppose yi is observed independently given the covariates xi , i “ 1, 2, . . . which are drawn independently from a probability distribution Q on X . Assume that Q admits a density q with respect to the Lebesgue measure. If we define hpx, yq “ qpxqf py | xq and h0 px, yq “ qpxqf0 py | xq then h, h0 P F . Throughout the chapter, h0 is assumed to be a fixed density in F which we alternatively refer to as the true data generating density and tf0 p¨ | xq, x P X u is referred to as the true conditional density. The density qpxq will be needed only for theoretical investigation. In practice, we do not need to know it or learn it from the data. We propose to induce a prior ΠX on the space of conditional densities through a 100

prior PX for a collection of mixing measures GX “ tGx , x P X u using the following predictor-dependent mixture of kernels f py | xq “

ż

˙ ˆ 1 y´µ dGx pψq, φ σ σ

(4.4)

where ψ “ pµ, σq, and Gx “

8 ÿ

h“1

πh pxqδtµh pxq,σh u ,

pµh , σh q „ G0 ,

where πh pxq ě 0 are random functions of x such that

ř8

h“1 πh pxq

(4.5)

“ 1 a.s. for each

fixed x P X . tµh pxq, x P Xu8 h“1 are i.i.d. realizations of a real valued stochastic process, i.e., G0 is a probability distribution over FX ˆ ℜ` , where FX Ă X ℜ , X ℜ being the space of functions from X to ℜ. Hence for each x P X , Gx is a random probability measure over the measurable Polish space pℜ ˆ ℜ` , Bpℜ ˆ ℜ` qq. We are interested in Bayesian posterior consistency for a broad class of predictor-dependent stick-breaking mixtures including the following two important special cases. 4.2.1 Predictor dependent mixtures of Gaussian linear regressions We define the predictor dependent countable mixtures of Gaussian linear regressions (MGLRx ) as f py | xq “

ż

˙ ˆ 1 y ´ x1 β dGx pβ, σq, φ σ σ

and Gx “

8 ÿ

h“1

πh pxqδpβh ,σh q ,

pβh , σh q „ G0

where πh pxq ě 0 are random functions of x such that

ř8

h“1 πh pxq

(4.6)

“ 1 a.s. for each

fixed x P X and G0 “ G0,β ˆ G0,σ is a probability distribution on ℜp ˆ ℜ` where G0,β 101

and G0,σ are probability distributions on ℜp and ℜ` respectively. For a particular choice of πh pxq’s, we obtain the probit stick-breaking mixtures of Gaussians which have been previously applied by (Chung and Dunson, 2009; Rodriguez and Dunson, 2011b; Pati and Dunson, 2010). The latter two articles considered probit transformations of Gaussian processes in constructing the stick-breaking weights. 4.2.2 Gaussian mixtures of fixed-π dependent processes In (4.4), set Gx as in (4.5) with πh pxq ” πh for all x P X where πh ě 0 are random ř 8 probability weights 8 h“1 πh “ 1 a.s. and tµh pxq, x P Xuh“1 are as in (4.5). Examples include fixed-π dependent Dirichlet process mixtures of Gaussians (MacEachern, 1999). Versions of the fixed π-DDP have been applied to ANOVA (De Iorio et al., 2004), survival analysis (De Iorio et al., 2009; Jara et al., 2010), spatial modeling (Gelfand et al., 2005), and many more.

4.3 Notions of neighborhoods in conditional density estimation We define the weak and ν-integrated L1 neighborhoods of the collection of conditional densities tf0 p¨ | xq, x P X u in the following. A sub-base of a weak neighborhood is defined as ˇż " ˇ Wǫ,g pf0 q “ f : f P Fd , ˇˇ

X ˆY

gh ´

ż

X ˆY

ˇ * ˇ ˇ gh0 ˇ ă ǫ ,

(4.7)

for a bounded continuous function g : Y ˆ X Ñ ℜ. A weak neighborhood base is formed by finite intersections of neighborhoods of the type (4.7). Define a νintegrated L1 neighborhood " * ż Sǫ pf0 ; νq “ f : f P Fd , }f p¨ | xq ´ f0 p¨ | xq}1 νpxqdx ă ǫ

(4.8)

for any measure ν with supppνq Ă X . Observe that under the topology in (4.8), Fd can be identified to a closed subset of L1 pλ ˆ ν, Y ˆ supppνqq making it a complete 102

separable metric space. Thus measurability issues won’t arise with these topologies. Although the choice of ν-integrated L1 topology might not seem to be a good choice of metric for the conditional densities in the onset, we refer the reader to two observations mentioned in Section 4 in Tokdar (2011a) to point out why this is not a terrible thing to do. In the following, we define the Kullback-Leibler (KL) property of ΠX at a given f0 P Fd . Note that we define a KL-type neighborhood around the collection of conditional densities f0 through defining a KL neighborhood around the joint density h0 , while keeping Q fixed at its true unknown value. Definition 40. For any f0 P Fd , such that h0 px, yq “ qpxqf0 py | xq is the true joint data-generating density, we define an ǫ-sized KL neighborhood around f0 as Kǫ pf0 q “ tf : f P Fd , KLph0 , hq ă ǫ, hpx, yq “ qpxqf py | xq @y P Y, x P X u, ş where KLph0 , hq “ h0 logph0 {hq. Then, ΠX is said to have KL property at f0 P Fd , denoted f0 P KLpΠX q, if ΠX tKǫ pf0 qu ą 0 for any ǫ ą 0.

We recall the definitions of various modes of posterior consistency through yn “ py1 , . . . , yn q and xn “ px1 , . . . , xn q. ˘ ` Definition 41. The posterior ΠX ¨ | yn , xn is consistent weakly or strongly in the ` ˘ ν-integrated L1 topology at tf0 p¨ | xq, x P X u if ΠX U c | yn , xn Ñ 0 a.s. for any

ǫ ą 0 with U “ Wǫ pf0 q and Sǫ pf0 ; νq respectively.

Here a.s. consistency at tf0 p¨ | xq, x P X u means that the posterior distribution concentrates around a neighborhood of tf0 p¨ | xq, x P X u for almost every sequence tyi , xi u8 i“1 generated by i.i.d. sampling from the joint density qpxqf0 py | xq. Another definition we would require for showing the KL support is the notion of weak neighborhood of a collection of mixing measures GX “ tGx , x P X u where Gx 103

is a probability measure on S ˆ ℜ` for each x P X . Here S “ ℜp or ℜ depending on the cases considered above. We formulate the notion of a sub-base of the weak neighborhood of GX “ tGx , x P X u below. Definition 42. For a bounded continuous function g : S ˆ ℜ` ˆ X Ñ ℜ and ǫ ą 0, a sub-base of the weak neighborhood of a conditional probability measure tFx , x P X u is defined as ˇż " ˇ tGx , x P X u : ˇˇ

Sˆℜ` ˆX

gps, σ, xqdGx ps, σqqpxqdx ´

ˇ * ˇ gps, σ, xqdFx ps, σqqpxqdxˇˇ ă ǫ

(4.9)

A conditional probability measure tGx , x P X u lies in the weak support of PX if PX assigns positive probability to every basic neighborhood generated by the subbase of the type (4.9). In the sequel, we will also consider a neighborhood of the form ˇż " ˇ tGx , x P X u : sup ˇˇ xPX

Sˆℜ`

ˇ (ˇ gps, σqdGxps, σq ´ gps, σqdFx ps, σq ˇˇ

* ăǫ .

(4.10)

for a bounded continuous function g : S ˆ ℜ` Ñ ℜ.

4.4 Posterior consistency in MGLRx mixture of Gaussians 4.4.1 Kullback-Leibler property We will work with a specific choice of PX motivated by the probit stick breaking process construction in Chung and Dunson (2009) but using Gaussian process transforms instead of Gaussian transforms. Let πh pxq “ Φtαh pxqu

ź lăh

r1 ´ Φtαl pxqus ,

104

(4.11)

where αh „ GPp0, ch q, for h “ 1, 2, . . . , 8. Assume the following holds. S1. ch is chosen so that αh „ GPp0, ch q has continuous path realizations and S2. for any continuous function g : X ÞÑ ℜ, PX

"

* sup |αh pxq ´ gpxq| ă ǫ ą 0 xPX

h “ 1, . . . , 8 and for any ǫ ą 0. S3. G0 is absolutely continuous with respect to λpℜp ˆ ℜ` q. Consider the subset Fd˚ Ă Fd satisfying the following conditions. A1. f is nowhere zero and bounded by M ă 8. A2. | A3. |

ş ş X

Y

f py | xq log f py | xqdyqpxqdx| ă 8.

ş ş X

Y

dyqpxqdx| ă 8, where ψx pyq “ inf tPry´1,y`1s f pt | xq. f py | xq log fψpy|xq x pyq

A4. D η ą 0 such that

ş ş X

Y

|y|2p1`ηq f py | xqdyqpxqdx ă 8.

A5. px, yq ÞÑ f py | xq is jointly continuous. Remark 43. A1 is usually satisfied by common densities arising in practice. A4 imposes a minor tail restriction; e.g., a mean regression model with continuous mean function and a heavy-tailed t residual density with 4 degrees of freedom satisfies A4. Conditions A2 and A3 are more subtle, but are also mild. A flexible class of models which satisfies A1-A5 is as follows. Let yi “ µpxi q ` ǫi , with µ : X Ñ ℜ continuous řH ř 2 and ǫi „ fxi , where fx pǫq “ H h“1 πh pxq “ 1, h“1 πh pxqψpǫ; µh , σh q for some H ě 1,

πh : X Ñ r0, 1s continuous and ψ is Gaussian or t with greater than 2 degrees of

freedom.

105

The following theorem characterizes the subset of Fd for which ΠX has the KL property. The proof of Theorem 44 is provided in Appendix A. Theorem 44. f0 P KLpΠX q for each f0 in Fd˚ if PX satisfies S1-S3. Remark 45. The conditions are satisfied for a class of generalized stick-breaking process mixtures in which the stick-breaking lengths are constructed through mapping continuous stochastic processes to the unit interval using a monotone differentiable link function. To prove Theorem 44, we need several auxiliary results related to the support of the prior PX which might be of independent interest. The key idea for showing that the true f0 satisfies ΠX tKǫ pf0 qu ą 0 for any ǫ ą 0 is to impose certain tail ş ` 1β ˘ ˜ x pβ, σq, conditions on f0 py | xq and approximate it by f˜py | xq “ σ1 φ y´x dG σ

˜ x , x P X u is compactly supported. Observe that, where tG KLph0 , hq “

ż ż X

Y

f0 py | xq log

ż ż X

Y

f0 py | xq dyqpxqdx ` f˜py | xq

f0 py | xq log

(4.12)

f˜py | xq dyqpxqdx. f py | xq

We construct such an f˜ in Theorem 44 which makes the first term in the right hand side of (4.12) sufficiently small.

The following lemma

(which is similar to Lemma 3.1 in Tokdar (2006a) and Theorem 3 in Ghosal et al. (1999)) guarantees that the second term in the right hand side of (4.12) is also sufficiently small if tGx , x P X u lies inside a finite intersection of neighbor-

˜ x , x P X u of the type (4.10). hoods of tG

ş ş Lemma 46. Assume that f0 P Fd satisfies X Y y 2 f0 py | xqdyqpxqdx ă 8. Suppose ş ` 1β ˘ ˜ x pβ, σq, where D a ą 0 and 0 ă σ ă σ such that dG f˜py | xq “ σ1 φ y´x σ ˘ ` ˜ x r´a, asp ˆ pσ, σq “ 1 @ x P X , G 106

(4.13)

˜ x has compact support for each x P X . Then given any ǫ ą 0, D a finite so that G

˜ x , x P X u of the type (4.10) such that for any intersection W of neighborhoods of tG ş ` 1β ˘ conditional density f py | xq “ σ1 φ y´x dGx pβ, σq, x P X , with tGx , x P X u P W , σ ż ż X

Y

f0 py | xq log

f˜py | xq dyqpxqdx ă ǫ. f py | xq

(4.14)

The proof of Lemma 46 is provided in Appendix A. In order to ensure that the weak support of ΠX is sufficiently large to contain all densities f˜ satisfying the assumptions of Lemma 46, we define a collection of fixed conditional probability measures on pℜp ˆ ℜ` , Bpℜp ˆ ℜ` qq denoted by GX˚ satisfying 1. x ÞÑ Fx pBq is a continuous function of x P X @ B P Bpℜp ˆ ℜ` q. 2. For any sequence of sets An Ă ℜp ˆ ℜ` Ó H, supxPX Fx pAn q Ó 0. Next we state the theorem characterizing the weak support of PX which will be proved in Appendix A. Theorem 47. If PX satisfies S1-S3, then any tFx , x P X u P GX˚ lies in the weak support of PX . Corollary 48. Assume S1-S3 hold and assume Fx P GX˚ is compactly supported, i.e., there exists a, σ, σ ą 0 such that Fx pr´a, asp ˆ rσ, σsq “ 1. Then for a bounded uniformly continuous function g : ℜp ˆ ℜ` Ñ r0, 1s satisfying gpβ, σq Ñ 0 as }β} Ñ 8, σ Ñ 8, ˇż " ˇ PX tGx , x P X u : sup ˇˇ xPX

ℜp ˆℜ`

ˇ (ˇ gpβ, σqdGx pβ, σq ´ gpβ, σqdFx pβ, σq ˇˇ

* ă ǫ ą 0.

107

(4.15)

Proof. The proof is similar to Theorem 47 with the L1 convergence in (A.6) replaced by convergence uniformly in x. This is because under the assumptions of Corolř ˜k,n qFx pAk,n q lary 48, the uniformly continuous sequence of functions nk“1 gpβ˜k,n, σ ş on X monotonically decreases to C gpβ, σqdFx pβ, σq as n Ñ 8 where C is given by r´a, asp ˆ rσ, σs.

The proof of the following corollary is along the lines of the proof of Theorem 47 and is omitted here. Corollary 49. Under the assumptions of Corollary 48 for any k0 ě 1, PX

"

0 Xkj“1

Uj

*

ą 0,

(4.16)

where Uj ’s are neighborhoods of the type (4.15). 4.4.2 Strong Consistency with the q-integrated L1 neighborhood To obtain strong consistency in the q-integrated L1 topology, we need a very straightforward extension of Theorem 2 of Ghosal et al. (1999) below. Theorem 50. Suppose f0 P KLpΠX q and there exists subsets Fn Ă Fd with 1. log Npǫ, Fn , }¨}1 q “ opnq, 2. ΠX pFnc q ď c2 e´nβ2 for some c2 , β2 ą 0, then the posterior is strongly consistent with respect to the q-integrated L1 neighborhood. Before stating the main theorem on strong consistency, we consider a hierarchical extension of MGLRx where the bandwidths are taken to be random. We define a sequence of random inverse-bandwidths Ah of the Gaussian process αh , h ě 1 each having ℜ` as its support. Since the first few atoms suffice to explain most 108

of the dependence of y on x, we expect that the variability due to the covariate in the stochastic process Φtαh u decreases as h increases. This is achieved through a carefully chosen prior for the covariance kernel ch of the Gaussian process αh . Let α0 denote the base Gaussian process on r0, 1sp with covariance kernel 1{2

1 2

c0 px, x1 q “ τ 2 e´||x´x || . Then αh pxq “ α0 pAh xq for each x P X . The variability 1{2

of αh with respect to the covariate is shrunk or stretched to the rectangle r0, Ah sp as Ah decreases or increases. Ah ’s are constructed to be stochastically decreasing to δ0 in the following manner. We assume that there exist η, η0 ą 0 and a sequence δn “ Opplog nq2 {n5{2 q such that P pAh ą δn q ď expt´n´η0 hpη0 `2q{η log hu for each h ě 1.

Also assume that there exists a sequence rn Ò 8 such that

rnp nη plog nqp`1 “ opnq and P pAh ą rn q ď e´n . We will discuss how to construct such a sequence of random variables in the Remark 53 following Theorem 51. The following theorem provides sufficient conditions for strong posterior consistency in the q-integrated L1 topology. The proof is provided in Appendix A. Theorem 51. Let πh ’s satisfy (4.11) with αh „ GP p0, ch q where ch px, x1 q “ 1 2

τ 2 e´Ah }x´x } , h ě 1, τ 2 ą 0 fixed. C1. There exists sequences an , hn Ò 8, ln Ó 0 with

an ln

“ Opnq, hlnn “ Open q, and

constants d1 , d2 ą 0 such that G0 tBp0; an q ˆ rln , hn suc ă d1 e´d2 n for some d1 , d2 ą 0. C2. Ah ’s are constructed as in the last paragraph before Lemma 56. then f0 P KLpΠX q implies that ΠX achieves strong posterior consistency in qintegrated L1 topology at f0 . Remark 52. Verification of condition C1 of Theorem 51 is particularly simple. For example, if G0 is a product of multivariate normals on β and an inverse Gamma ? prior on σ 2 , the condition C1 is satisfied with an “ Op nq, hn “ en , ln “ Op ?1n q. It 109

follows from van der Vaart and van Zanten (2009) that f0 P KLpΠX q is still satisfied when we have the additional assumptions C1-C2 together with S1-S3 on the prior ΠX . Remark 53. Since we need rnp nη plog nqp`1 “ opnq, rnp can be chosen to be Opnη1 q for some 0 ă η1 ă 1. Let d be such that dη1 {p ě 1 and set η0 “ 3d. Let Ah “ ch Bh , where Bhd „ Exppλq and ch “ php3d`2q{η log hq´1{d for any 0 ă η ă 1. Then P pAh ą nη1 {p q ď P pBh ą nη1 {p q ď e´n

dη1 {p

ď e´n and P pAh ą plog nq2 n´5{2 q ď

expt´n´3d hp3d`2q{η log hu. Remark 54. The theory of strong posterior consistency can be generalized to an arbitrary monotone differentiable link function L : ℜ ÞÑ r0, 1s which is Lipschitz, i.e., there exists a constant K ą 0 such that |Lpxq ´ Lpx1 q| ď K |x ´ x1 | for all x, x1 P X . Below we will develop several auxiliary results required to prove Theorem 51. They are stated below as some of them might be of independent interest. Let ` 1β ˘ for y P Y and x P X . From Tokdar (2006a), we obtain φβ,σ px, yq :“ σ1 φ y´x σ for σ2 ą σ1 ą ż

σ2 2

and for each x P X ,

ˆ ˙1{2 ? }β2 ´ β1 } p 3pσ2 ´ σ1 q 2 |φβ1 ,σ1 px, yq ´ φβ2 ,σ2 px, yq| dy ď ` π σ2 σ1 Y

Construct a sieve for pβ, σq as ( Θa,h,l “ φβ,σ : }β} ď a, l ď σ ď h .

(4.17)

In the following Lemma, we provide an upper bound to NpΘa,h,l , ǫ, dSS q. The proof is omitted as it follows trivially from Lemma 4.1 in Tokdar (2006a). Lemma 55. There exists constants d1 , d2 ą 0 such that NpΘa,h,l , ǫ, dSS q ď d1 d2 log hl ` 1. 110

` a ˘p l

`

In the proof of Theorem 51, we will verify the sufficient conditions of TheoWe calibrate Fd by a carefully chosen sequence of subsets Fn Ă Fd . ş The fundamental problem with mixture models Npy; µ, σ 2 Ip qdP pµq in estimat-

rem 50.

ing a multivariate density lies in attempting to compactify the model space by ş t Npy; µ, σ 2 Ip qdP pµq : P pp´an , an sp q ą 1 ´ δu for each σ leading to an en-

tropy apn growing exponentially with the dimension p. Here we marginalize P ş ř n 2 in Npy; µ, σ 2Ip qdP pµq to yield the following construction t m h“1 πh Npy; µh , σ Ip q : ř ||µh || ď an , h “ 1, . . . , mn , 8 h“mn `1 πh ă ǫu leading to an entropy mn log an where ř mn is related to the tail-decay of P p 8 h“mn `1 πh ą ǫq. With this idea in place, we extend the construction of Fn for conditional densities below.

Assume ǫ ą 0 is given. Let Ha1 denote a unit ball in the RKHS of the covariance 1 2

kernel τ 2 e´a}x´x } and B1 is a unit ball in Cr0, 1sp. For numbers M, m, r, δ, construct a sequence of subsets tBh , h “ 1, . . . , mu of Cr0, 1sp as follows. Bh “

#`

a ˘ ` ˘ M r{δHr1 ` mǫ2 B1 Y Yaăδ MHa1 ` mǫ2 B1 , if h “ 1, . . . , mη Yaăδn Mn Ha1 ` mǫ2 B1 , if h “ mη ` 1, . . . , m.

The idea is to construct Fn

" ˆ ˙ 8 ÿ 1 y ´ x1 βh n “ f : f py | xq “ πh pxq φ , tφβh ,σh um h“1 σ σ h h h“1 P Θan ,hn ,ln , αh P Bh,n , h “ 1, . . . , mn , sup

ÿ

xPX hěm `1 n

* πh pxq ď ǫ .

(4.18)

for appropriate sequences am , ln , hn , Mn , mn , rn , δn to be chosen in the proof of Theorem 51. The following lemma is also crucial to the proof of Theorem 51 which allows us to calculate the rate of decay of P psupxPX πh pxq ą ǫq with mn . Lemma 56. Let πh ’s satisfy (4.11) with αh „ GP p0, ch q where ch px, x1 q “ 111

1 2

τ 2 e´Ah }x´x } , h ě 1, τ 2 ą 0 fixed. Then for some constant C7 ą 0, ˜› › › ΠX › ›

8 ÿ

h“mn `1

› › › πh › ›

ąǫ 8

¸

m ÿn

ď e´C7 mn log mn `

P pAh ą δn q.

(4.19)

h“mηn `1

Proof. Let Wh “ ´ logr1 ´ Φtαh1 us where αh1 “ inf xPX αh pxq, Zh „ Gap1, γ0 q. We will choose an appropriate value for γ0 in the sequel. Let t0 “ ´ log ǫ ą 0. Observe that › ˜› ¸ ˆ ˙ 8 mn › ÿ › ź › › ΠX › πh › ą ǫ “ ΠX sup r1 ´ Φtαh pxqus ą ǫ ›h“m `1 › xPX h“1 ď ΠX

ˆ ź mn

n

8

t1 ´

Φpαh1 qu

h“mηn `1

˙ ˆ ą ǫ “ ΠX ´

mn ÿ

h“mηn `1

logt1 ´

Φpαh1 qu

˙ ă t0 .

Note that if we had αh pxq ” αh „ Np0, 1q, then the right hand side above equals ΠX

ˆ

´

mn ÿ

h“1

logt1 ´ Φpαh qu ă t0

˙

“ ΠX pΛh ă t0 q

where Λh „ Gapmn , 1q. Then its easy to show that ΠX pΛh ă t0 q À e´mn log mn . However, the calculation gets complicated when αh ’s are i.i.d realizations of a zero mean Gaussian process. The proof relies on the fact that the supremum of Gaussian processes has sub-Gaussian tails. Below we calculate the rate of decay of ΠX

ˆ

˙ › › with mn . We h“mn `1 πh 8 ą ǫ

›ř8 ›

will show that there exists γ0 , depending on ǫ and τ but not depending on n, such that ΠX

ˆ

mn ÿ

h“mηn `1

Wh ă t0

˙

ď ξpδn q `

mn ´mηn

mn ÿ

h“mηn `1

ΠX

ˆ

mn ÿ

h“mηn `1

P pAh ą δn q.

Zh ă t0

˙ (4.20)

where there exists a constant C5 ą 0 such that ξpxq “ C5 xp{2 for x ą 0. Observe that 112

ΠX

ˆ

řmn

řmn

h“mηn `1

Wh ă t0

˙

ď ΠX

ˆ

řmn

h“mηn `1 Wh

ă t0 , Ah ď δn , h “

mηn

˙

` 1, . . . , mn `

P pAh ą δn q. ˆ ˙ ˆ ˙ řmn řmn 1 1 Since ΠX “ ΠX for some h“mηn `1 Wh ă t0 h“mηn `1 pτ {τ qWh ă τ t0 {τ

h“mηn `1

τ 1 ă 1, we can re-parameterize t0 as τ 1 t0 {τ and τ as τ 1 . Hence without loss of generality we assume τ ă 1.

Define g : r0, t0 s Ñ ℜ, t ÞÑ ´Φ´1 p1 ´ e´t q. It holds that g is a continuous 1 2

function on p0, t0 s. Assume α0 „ GPp0, c0q where c0 px, x1 q “ τ 2 e´}x´x } . For h “ mηn ` 1, . . . , mn ,

α0 pxq ě λq. P psup αh pxq ě λ, Ah ď δn q ď P p sup ? xP δn X

xPX

Below we estimate P psupxP?δn X α0 pxq ě λq for large enough λ following Theorem 5.2 of Adler (1990). However extra care is required to identify the role of δn . Since ? ? Npǫ, δn X , }¨}q ď C1 p δn {ǫqp , żǫ a a tlog Npǫ, δn X , }¨}qu1{2 dǫ ď C2 ǫt1 ` logp1{ǫqu. 0

for some constant C2 ą 0. Hence a a P p sup α0 pxq ě λq ď C3 p δn λqp expr´1{2tλ ´ C2 {λp1 ` log λqu2 {τ 2 s xPrn X

ď C3 δnp{2 λp`2 t1 ´ Φpλ{τ 2 qu ď C4 δnp{2 t1 ´ Φpλqu. for constants C3 , C4 ą 0. The last inequality holds for all large λ because τ ă 1. Hence there exists t1 P p0, t0 q sufficiently small and independent of n such that for p{2

all t P p0, t1 q, ΠX tsupxP?δn X α0 pxq ě gptqu ď C4 δn Φt´gptqu. Observe that ΠX t sup α0 pxq ě gptqu ď C4 δnp{2 Φt´gptqu “ C4 δnp{2 p1 ´ e´t q ? xP δn X

ă C5 δnp{2 p1 ´ e´γ0 t q, for any γ0 ą 1. Further choose γ0 large enough such that 2p1´e´γ0 t q ą 1 @ t P rt1 , t0 s. p{2

Hence P pWh ď t, Ah ď δn q ď ξpδn qP pZh ă tq @ t P p0, t0 s where ξpδn q “ C5 δn , with 113

C5 “ maxt2, C4 u. Applying Lemma 80, we conclude (4.20) by induction. Lemma ˆ ˙ řmn řmn 80 is proved in Appendix A. As h“1 Zh „ Gapmn , γ0 q, ΠX ď h“1 Zh ă t0 ´C6 mn log mn

e

for some constant C6 ą 0. Since ξpδn q

mn ´mηn

ΠX

ˆ

řmn

h“1 Zh

ă t0

˙

ď

pe´C7 mn log mn q for some constant C7 ą 0, the result follows immediately.

4.5 Posterior consistency in mixtures of fixed-π dependent processes 4.5.1 Kullback-Leibler property The following theorem verifies that ΠX has KL property at f0 P Fd˚ . The proof of Theorem 57 is somewhat similar to that of Theorem 44 and can be found in Appendix A. Theorem 57. f0 P KLpΠX q for each f0 in Fd˚ if PX satisfies T1. G0 is specified by µh „ GPpµ, cq, σh „ G0,σ where c is chosen so that GPp0, cq has continuous path realizations and Πσ is absolutely continuous w.r.t. Lebesgue measure on ℜ` . T2. For every k ě 2, pπ1 , . . . , πk q is absolutely continuous w.r.t. to the Lebesgue measure on Sk´1 . T3. For any continuous function g : X ÞÑ ℜ, PX

"

* sup |µh pxq ´ gpxq| ă ǫ ą 0 xPX

h “ 1, . . . , 8 and for any ǫ ą 0. 4.5.2 Strong consistency with the q-integrated L1 neighborhood Next we summarize the consistency theorem with respect to the q-integrated L1 topology. The proof of Theorem 58 is also similar to that of Theorem 51 and is provided in Appendix A. 114

Theorem 58. Let µh pxq “ x1 βh ` ηh pxq, βh „ Gβ and ηh „ GP p0, cq, h “ 1, . . . , 8 1 2

where cpx, x1 q “ τ 2 e´A}x´x } , App1`η2 q{η2 „ Gapa, bq for some η2 ą 0. F1. There exists sequences an , hn Ò 8, ln Ó 0 with

an ln

“ Opnq, hlnn “ Open q,

and constants d1 , d2 , d3 and d4 ą 0 such that Gβ tBp0; an quc ă d1 e´d2 n and G0,σ trln , hn suc ď d3 e´d4 n . F2. P p

ř8

h“n

πh ą ǫq À Ope´n

1`η2 plog nqpp`1q

q.

then f0 P KLpΠX q implies that ΠX achieves strong posterior consistency at f0 with respect to the q-integrated L1 topology. Remark 59. F2 is satisfied if πh ’s are made to decay more rapidly than the usual ś Betap1, αq stick-breaking random variables, e.g, if πh “ νh lăh p1 ´ νh q and if νh „ Betap1, αh q where αh “ h1`η2 plog hqp`1 α0 for some α0 ą 0, then F2 is satisfied. Large value of αh for the higher indexed weights favors smaller number of components.

4.6 Discussion We have provided sufficient conditions to show posterior consistency in estimating the conditional density via predictor dependent mixtures of Gaussians which include probit stick-breaking mixtures of Gaussians and the fixed-π dependent processes as special cases. The problem is of interest, providing a more flexible and informative alternative to the usual mean regression. For both the models, we need the same set of tail conditions (mentioned in Fd˚ ) on f0 for KL support. Although the first prior is flexible in the weights and the second one in the atoms through their corresponding GP terms, S1, S2, T1 and T3 show that verification of KL property only requires that both the GP terms have continuous path realizations and desired approximation property. Moreover, for the second prior, any set of weights summing to one a.s. (T2) suffices for showing KL property. Careful investigations of the prior for the GP kernel for the first model and the probability weights for the second one are required for strong consistency. For the first one we need the covariate dependence of the higher 115

indexed GP terms in the weights to fade off. On the other hand, for the second model, the atoms can be i.i.d. realizations of a GP with Gaussian covariance kernel with inverse-Gamma bandwidth while limiting the model complexity through a sequence of probability weights which are allowed to decay rapidly. This suggests that full flexibility in the weights should be down-weighted by an appropriately chosen prior while full flexibility in the atoms should be accompanied by a restriction imposing fewer number of components. One alternative possibility is to specify a prior for the joint density hpx, yq “ qpxqf py | xq, to induce a prior on the conditional f py | xq, where qpxq denotes the joint density of the covariates. Using such an approach, which was originally proposed by M¨ uller et al. (1996) using Dirichlet process mixtures of multivariate Gaussians, one can potentially rely on the theory of large support and posterior consistency for i.i.d. realizations from a multivariate distribution; for example, refer to Wu and Ghosal (2010); Norets and Pelenis (2009). Unfortunately, such an approach has clear disadvantages. When interest focuses on the conditional distribution of f py | xq it is very appealing to avoid needing to model the joint density of the predictors, qpxq, which will be multivariate in typical applications. In addition, standard models for the joint distribution relying on multivariate Dirichlet process mixtures (refer also to Shahbaba and Neal (2009); Park and Dunson (2009)), can have relatively poor performance, because many mixture components may be introduced primarily to provide a good fit to the marginal qpxq, potentially leading to degradation of performance in estimating f py | xq for all x P X . The MGLRx and the Gaussian mixture of fixed-π dependent processes are examples of priors directly on the conditional densities. The q-integrated L1 topology concerns average accuracy for prediction of future y values when the future x values are drawn from the same covariate distribution Q that generate the data x’s. It is preferable to use a topology that can lead to average accuracy guarantees when the future x’s are generated from any distribution ν whose support is a subset of the support of Q. To accomplish this, we propose to focus on a topology based on the supremum of L1 neighborhoods of the true density 116

in our future research. Although, a more reasonable way of evaluating a Bayes procedure is to study the posterior convergence rates, deriving the rates of convergence in our case substantially complicates the analysis and is a topic of future research. Of course our sieve construction can be used to derive the rates, while being more careful in estimating the concentration of the prior around the true density, the rates of decay of the complement of the sieve and calculating the entropy.

117

5 Bayesian shape modeling with closed curves

5.1 Introduction Boundaries of objects are widely studied across many disciplines, such as biomedical imaging, cytology and computer vision. In describing complex boundaries, one can use a parametric curve (2D) or surface (3D), i.e. Cptq : D1 Ñ R2 or Cptq : D2 Ñ R3 respectively, where D1 Ă R and D2 Ă R2 . Note that this is different from a typical function estimation problem because the independent variable, t, is unknown. And furthermore, the curve must be closed to produce a valid boundary. A collection of introductory work on curve and surface modeling can be found in Su and Liu (1989) and subsequent developments in Muller (2005). Popular representations include Bezier curves, splines, and principal curves (Hastie and Stuetzle, 1989), the final one being a nonlinear generalization of principal components involving smooth curves which pass through the middle of a data cloud. Su et al. (2011) dealt with curve modeling based on stochastic processes when the observations are given as a set of time-indexed points on manifolds. Kurtek et al. (2011) developed an elegant theoretical framework for comparing and analyzing curves once the fitted

118

curves are obtained. Nonparametric representations of parametric curves and surfaces are widely used (Barnhill, 1985; Lang and R¨oschel, 1992; Hagen and Santarelli, 1992; Aziz et al., 2002), because they provide a flexible model for a broad range of objects e.g. cells, pollen grains, protein molecules, machine parts, etc. Although there is a vast literature on estimating curves and surfaces, the majority of this work focuses on estimating unrestricted functions. However, the boundary of a simply-connected object must be a closed curve, which is a restriction on the curve representation. Estimating a closed surface or curve involves a different modeling strategy and there has been little work in this regime, particularly from a Bayesian point of view. To our knowledge, only Pati and Dunson (2011) developed a Bayesian approach for fitting a closed surface, using tensor-products. In many applications featuring low-contrast images or sparse and noisy point clouds, there is insufficient data to recover local segments of the boundary in isolation. Thus, it becomes critical to model the boundary’s global shape. Furthermore, multiple related objects may share shape similarities that can be leveraged for improved inference of boundaries. However, to the best of our knowledge, there are few curve models which incorporate detailed shape information. One strategy for analyzing complex curves is to refactor them in a multiscale fashion, as done by Fourier and wavelet descriptors (Whitney, 1937; Zahn and Roskies, 1972; Mortenson, 1985; Persoon and Fu, 1977). These approaches decompose a curve into components of different scales, so that the coarsest scale components carry the global approximation information while the finer scale components contain the local detailed information. Mokhtarian and Mackworth (1992), D´esid´eri and Janka (2004) and D´esid´eri et al. (2007) also proposed multiscale curves. Such multiscale transforms make it easier to compare objects that share the same coarse shape, but differ on finer details, or vice versa. The finer scale components can also be discarded 119

to yield a finite and low-dimensional representation. However, none of these methods are model-based. In this chapter, we propose a Bayesian hierarchical model for object boundaries, which addresses all of the aforementioned problems: 1) guaranteeing valid boundaries through closed curves, 2) enabling borrowing of information when fitting multiple similar objects, and 3) employing a multiscale representation suitable for shape analysis. The key innovation in our model is a curve-generating random process which can approximate the whole range of simply connected 2D shapes. It is based on applying a sequence of multiscale deformations to a novel type of closed curve R´oth et al. (2009). Because the model is multiscale, it is able to detect and borrow inter-object similarities at a particular resolution even if similarities are not present at other resolutions. This process also yields a ‘central curve’ that summarizes multiple objects. Dryden and Mardia (1998) discussed a related concept of mean shape, shape variability and various methods of estimating them in the context of landmark-based analysis. En route, we solve several important sub-problems that may be generally useful in the study of curve and surface fitting. First, we develop a model-based approach for parameterizing point cloud data. Second, we show how fully Bayesian joint modeling can be used to incorporate several pieces of auxiliary information in the process of curve-fitting, such as when a surface orientation is reported for each point within a point cloud. Lastly, the concept of multi-scale deformation can be generalized to 3D surfaces in a straightforward manner.

120

5.2 Shape-generating random process 5.2.1 Overview Our shape-generating random process starts with a closed curve and performs a sequence of multiscale deformations to generate a final curve. In §5.2.2, we introduce the Roth curve developed by R´oth et al. (2009), which will be used to represent the object boundary. Then, in §5.2.3, we demonstrate how to deform a Roth curve at multiple scales to produce any simply-connected shape. Using the mechanisms developed in §5.2.2 and §5.2.3, we present the full random process in §5.2.5. In §5.4, we use this as a prior distribution for curve-fitting. 5.2.2 Roth curve A Roth curve is a closed parametric curve, C : r´π, πs Ñ R2 , defined by a set of 2n`1 points in R2 , tcj , j “ 1, . . . , 2n ` 1u (also known as control points), where n is the degree of the curve and we may choose it to be any positive integer. For convenience, we will refer to the total number of control points as J, where Jpnq “ 2n ` 1. For notational simplicity, we will drop the dependence of n in Jpnq. As a function of t, the curve can be viewed as the trajectory of a particle over time. At every time t, the particle’s location is defined as some convex combination of all control points. The weight accorded to each control point in this convex combination varies with time according to a set of basis functions, tBjn ptq, j “ 1, . . . , Ju, where Bjn ptq ą 0 ř and Jj“1 Bjn ptq “ 1 for all t. Cptq “ Bjn ptq

J ÿ

j“1

hn “ n 2

cj Bjn ptq, t P r´π, πs , "

˙*n ˆ 2πpj ´ 1q p2n n!q2 1 ` cos t ` , , hn “ 2n ` 1 p2n ` 1q!

121

(5.1)

(5.2)

where cj “ rcj,x cj,y s1 specifies the location of the j th control point and Bjn : r´π, πs Ñ r0, 1s is the j th basis function. For simplicity, we omit the superscript n denoting a basis function’s degree, unless it requires special attention. This representation is a type of Bezier curve. The Roth curve has several appealing properties: 1. It is fully defined by a finite set of control points, despite being an infinite dimensional curve. 2. It is always closed, i.e. Cp´πq “ Cpπq. This is necessary to represent the boundary of an object. 3. All basis functions are nonlinear translates of each other, and are evenly spaced over the interval r´π, πs. They can be cyclically permuted without altering the curve. This implies that each control point exerts the same ‘influence’ over the curve. 4. A degree 1 Roth curve having 3 control points is always a circle or ellipse. 5. Any closed curve can be approximated arbitrarily well by a Roth curve, for some large degree n. This is because the Roth basis, for a given n, spans the vector space of trigonometric polynomials of degree n and as n Ñ 8, the basis functions span the vector space of Fourier series. We elaborate on this in §5.3. 6. Roth curves are infinitely differentiable (C 8 ). 5.2.3 Deforming a Roth curve A Roth curve can be deformed simply by translating some of its control points. We now formally define deformation and illustrate it in Figure 5.1. Definition 60. Suppose we are given two Roth curves, Cptq “

J ÿ

j“1

cj Bj ptq,

r “ Cptq

122

J ÿ

j“1

r cj Bj ptq,

(5.3)

Figure 5.1: Deformation of a Roth curve where for each j, r cj “ cj ` Rj dj , dj P R2 and Rj is a rotation matrix. Then, we say

r by the deformation vectors tdj , j “ 1, . . . , Ju. that Cptq is deformed into Cptq

Each Rj orients the deformation vector dj relative to the original curve’s surface.

As a result, positive values for the y-component of dj always correspond to outward deformation, negative values always correspond to inward deformation, and dj ’s xcomponent corresponds to deformation parallel to the surface. We will call Rj a deformation-orienting matrix. In precise terms, Rj “



 cospθj q ´ sinpθj q , sinpθj q cospθj q

where θj is the angle of the curve’s tangent line at qj “

´2πpj´1q , 2n`1

(5.4)

the point where the

control point cj has the strongest influence: qj “ arg max Bj ptq. θj can be obtained tPr´π,πs

by computing the first-derivative of the Roth curve, also known as its hodograph. Definition 61. The hodograph of a Roth curve is given by:

Hptq “

J ÿ

j“1

123

cj

d Bj ptq, dt

(5.5)

where

d B ptq dt j

is given by

ˆ ˙ J n´1 ÿ ÿ ˆ2n˙ 2pn ´ kqpj ´ 1qπ 2 ` ˘ pn ´ kq sin pn ´ kqt ` cj , (5.6) ´ k 2n ` 1 p2n ` 1q 2n n j“1 k“0 where t P r´π, πs. If we view Cptq as the trajectory of a particle, Hptq intuitively gives the velocity of the particle at point t. We can use simple trigonometry to determine that θj “ arctan

ˆ

Hy pqj q Hx pqj q

˙

(5.7)

.

Note that Rj is ultimately just a function of tcj P R2 , j “ 1, . . . , Ju. Next, we show how to alter the scale of deformation, using an important concept called degree elevation. Definition 62. Given any Roth curve, we can use degree elevation to re-express the same curve using a larger number of control points (a higher degree). More ř2n`1 precisely, if we are given a curve of degree n, Cptq “ j“1 cj Bjn ptq, we can elevate

p its degree by any positive integer v, to obtain a new degree elevated curve: Cptq “ ř2pn`vq`1 p p p cj Bjn`v ptq such that Cptq “ Cptq for all t P r´π, πs. In Cptq, each new j“1 degree-elevated control point, p cj , can be defined in terms of the original control points,

tci , i “ 1, . . . , 2n ` 1u:

2n`1 ÿ 1 ci ` 2n ` 1 i“1

`2pn`vq˘ n´1 `2n˘ 2n`1 ÿ hn ÿ n`v k cospξpk, n, iqqci , ` ˘ 22n´1 k“0 2pn`vq i“1 v`k

´ ´ ¯ ´2pi´1qπ where ξpk, n, iq “ pn ´ kq 2pn`vq`1 `

2pn´kqpi´1qπ 2n`1

¯

.

Although daunting to read, the only crucial points to note about this relationship are that p cj is linear in ci ’s, i “ 1, . . . , 2n`1 and that the ‘influence’ of a single control 124

point shrinks after degree elevation. This is because the curve is now shared by a greater total number of control points. This implies that after degree-elevation, the translation of any single control point will cause a smaller, finer-scale deformation to the curve’s shape. Thus, degree elevation can be used to adjust the scale of deformation. We exploit this strategy in the random process proposed in §5.2.5. To that end, we first rewrite all of the concepts described above in more compact vector notation. Note that the formulas for degree elevation, deformation, the hodograph and the curve itself all simply involve linear operations on the control points. 5.2.4 Vector notation Rewrite the control points in a ‘stacked’ vector of length 2J, c “ pc1,x , c1,y , c2,x , c2,y , . . . , cJ,x , cJ,y q1 .

(5.8)

The formula for a Roth curve given in (5.1) can be rewritten as Cptq “ Xptqc „  B1 ptq 0 B2 ptq 0 ¨ ¨ ¨ BJ ptq 0 Xptq “ . 0 B1 ptq 0 B2 ptq ¨ ¨ ¨ 0 BJ ptq

(5.9) (5.10)

The formula for the hodograph given in (5.5) is rewritten as 9 Hptq “ Xptqc,

d 9 Xptq “ Xptq. dt

(5.11)

Deformation can be written as r c “ c ` T pcqd, d “ pd1,x , d1,y , d2,x , d2,y , . . . , dJ,x , dJ,y q1 ,

T pcq “ blockpR1 , R2 , . . . , RJ q,

(5.12) (5.13)

where blockpA1 , . . . , Aq q is a pq ˆ pq block diagonal matrix using p ˆ p matrices Ai , i “ 1, . . . , q. We call T the stacked deformation-orientating matrix. Note that T 125

is a function of c, because each Rj depends on c. Degree elevation can be written as the linear operator, E: p c “ Ec,

where

Ei,j

1 “ ` 2n ` 1

n`v,n E “ pEi,j qi“1,j“1 .

`2pn`vq˘

hn n`v 2n´1 2

n´1 ÿ k“0

`2n˘

k `2pn`vq ˘ cospξpk, n, iqq. v`k

We will maintain this vector notation throughout the rest of the chapter. 5.2.5 Shape-generating Random Process The random process starts with some initial Roth curve, specified by an initial set of control points, cp0q . From here on, we will refer to all curves by the stacked vector of their control points, c. Then, drawing on the deformation and degree-elevation operations defined earlier, we repeatedly apply the following recursive operation R times: p cpr´1q “ Er cpr´1q ,

dprq „ Npµr , Σr q,

cprq “ p cpr´1q ` Tr pcpr´1q qdprq

(5.14)

resulting in a final curve cpRq . In other words, (i) degree elevate the current curve, (ii) randomly deform it, and repeat a total of R times. This random process specifies a probability distribution over cpRq . We now elaborate on the details of this recursive process. The parameters of the process are 1. R P Z, the number of steps in the process. 2. nr P Z, the degree of the curve cprq , for each r “ 0, . . . , R. The sequence of tnr uR 0 must be strictly monotonically increasing. For convenience, we will denote the number of control points at a certain step r to be Jr “ 2nr ` 1. 126

Figure 5.2: An illustration of the shape generation process. From left to right: 1) initial curve specified by three control points, 2) the same curve after degree elevation, 3) deformation, 4) degree elevation again, 5) deformation again. Dark lines indicate the curve, pale dots indicate the curve’s control points, and pale lines connect the control points in order. 3. µr P R2Jr , the average set of deformations applied at step r “ 0, . . . , R. Note that this vector contains a stack of deformations, not just one. 4. Σr P R2Jr ˆ2Jr , the covariance in the set of deformations applied at step r “ 0, . . . , R. For these parameters, Er is the degree-elevation matrix going from degree nr´1 to nr , Np¨, ¨q is a 2Jr -variate normal distribution and Tr is the stacked deformation orienting matrix. We take special care in defining the initial curve, cp0q . We choose cp0q to be degree n0 “ 1, which guarantees that it is an ellipse. For j “ 1, 2, 3, we define each control

127

point as p0q

cj

p0q

“ p0, 0q1 ` Rθj dj ,

Rθj “ rotation matrix where θj “

(5.15) 2πj , 3

(5.16)

p0q

and where each dj P R2 is a random deformation vector. In words: we start with a curve that is just a point at the origin, Cptq ” p0, 0q, and apply three random deformations which are rotated by a radially symmetric amount: 0˝ , 120˝ and 240˝ (note that the final deformations are not radially symmetric, since each dj is randomly drawn). We will write this in vector notation as dp0q „ Npµ0 , Σ0 q, cp0q “ 0 ` T0 dp0q . The deformations essentially ‘inflate’ the curve into some ellipse. This completes our definition of the random process. We now give some intuition about the process and each of its parameters, and define several additional concepts which make the process easier to interpret. The random process gives a multiscale representation, because each step in the process produces increasingly fine-scale deformations, through degree-elevation. R is then the number of scales or ‘resolutions’ captured by the process. Each nr specifies the number of control points at resolution r. We will use Sr to denote the class of curves that can be exactly represented by a degree nr Roth curve. If tnr uR 1 is monotonically increasing, then S1 Ă S2 Ă . . . Ă SR . Thus, the deformations dprq roughly describe the additional details gained going from Sr´1 to Sr . Modeling multiple resolutions allows better ‘borrowing of information’ between subjects. For example, we may wish to model a human body before and after it has lost weight. The two shapes will differ in their coarse outline, but share the same fine-scale 128

features (a nose, ears, etc.). If one object is missing a large part of its boundary, we may borrow fine-scale features without incorrectly importing the coarse outline. Thus, resolutions should be chosen to reflect the levels at which shapes are similar. It is crucial that we define each resolution relative to the surface orientation of the previous resolution. For example, if two human bodies only differ by the tilted angle of their head, it should be possible to observe that the facial features are identical, once differences in the coarser level head-orientation have been removed. µr is the mean deformation at level r. Based on tµr , r “ 0, . . . , Ru, we define the ‘central curve’ of the random process, cµ as: cµ :“ cpRq µ cprq µ

“ Er cpr´1q ` Tr pcpr´1q qµr µ µ

Note that c˚ is simply the deterministic result of the random process if each dprq “ µr , rather than being drawn from a distribution centered on µr . Thus, all shapes generated by the process tend to be deformed versions of the central curve. We illustrate this in Figure 5.3. If the random process is used to describe a collection of objects, the central curve provides a good summary.

(a) (b) (c) Figure 5.3: Random samples from the shape-generating process (red: the central curve, blue: random samples). (a) A moon-shaped collection, (b) star-shaped collection, (c) high-variance but symmetry-constrained collection.

Σr determines the covariance of the deformations at level r. This naturally con129

trols the variability among shapes generated by the process. If the variance is very small, all curves will be very similar to the central curve. Σr can also be chosen to induce correlation between deformation vectors at the same resolution, in the typical way that correlation is induced between dimensions of a multivariate normal distribution. This allows us to incorporate higher-level assumptions about shape, such as reflected or radial symmetry. For example, if R “ 2, n1 “ 1 and n2 “ 2, we can p2q

p2q

p2q

p2q

specify perfect correlation in Σ2 , such that d1 “ d4 and d2 “ d3 . The resulting curves are guaranteed to be symmetrical along an axis of reflection. In the subsequent sections 5.4 and 5.5, we show how to use our random process to guide curve-fitting for point clouds and image data.

5.3 Properties of the Prior 5.3.1 Support Let the H¨older class of periodic functions on r´π, πs of order α be denoted by C α pr´π, πsq. Define the class of closed parametric curves SC pα1 , α2 q having different smoothness along different coordinates as SC pα1 , α2 q :“ tS “ pS 1 , S 2 q : r´π, πs Ñ R2 , S i P C αi pr´π, πsq, i “ 1, 2u.

(5.17)

Consider for simplicity a single resolution Roth curve with control points tcj , j “ 0, . . . , 2nu. Assume we have independent Gaussian priors on each of the two coordiř n 2 nates of cj for j “ 0, . . . , 2n, i.e., Cptq “ 2n j“0 cj Bj ptq, cj „ N2 p0, σj I2 q, j “ 0, . . . , 2n. Denote the prior for C by ΠC n . ΠC n defines an independent Gaussian process for each of the components of C. Technically speaking, the support of a prior is defined as the smallest closed set with probability one. Intuitively, the support characterizes the variety of prior realizations along with those which are in their limit. We construct a prior distribution to have large support so that the prior realizations are flexible enough to approximate the true underlying target object. As re130

viewed in van der Vaart and van Zanten (2008b), the support of a Gaussian process (in our case ΠC n ) is the closure of the corresponding reproducing kernel Hilbert space (RKHS). The following Lemma 63 describes the RKHS of ΠC n , which is a special case of Lemma 2 in Pati and Dunson (2011). Refer to Appendix C for the proofs. Lemma 63. The RKHS Hn of ΠC n consists of all functions h : r´π, πs Ñ R2 of the form hptq “

2n ÿ

j“0

cj Bjn ptq

(5.18)

where the weights cj range over R2 . The RKHS norm is given by ||h||2Hn



2n ÿ

j“0

||cj ||2 {σj2 .

(5.19)

The following theorem describes how well an arbitrary closed parametric surface S0 P SC pα1 , α2 q can be approximated by the elements of Hn for each n. Theorem 64. For any fixed S0 P SC pα1 , α2 q, there exists h P Hn with ||h||2Hn ď ř 2 K1 2n j“0 1{σj such that ||S0 ´ h||8 ď K2 n´αp1q log n

(5.20)

for some constants K1 , K2 ą 0 independent of n. This shows that the Roth basis expansion is sufficiently flexible to approximate any closed curve arbitrarily well. Although we have only shown large support of the prior under independent Gaussian priors on the control points, the multiscale structure should be even more flexible and hence rich enough to characterize any closed curve. We can also expect minimax optimal posterior contraction rates using the prior ΠC n similar to Theorem 2 in Pati and Dunson (2011) for suitable choices of prior distributions on n. 131

Figure 5.4: Influence of the control points on the Roth curve 5.3.2 Influence of the control points The unique maximum of basis function Bjn ptq defined in (5.1) is at t “ ´2πpj ´ 1q{J, therefore the control point cj has the most significant effect on the shape of the curve in the neighborhood of the point Cp´2πpj ´ 1q{Jq. Note that Bjn ptq vanishes at t “ π ´ 2πpj ´ 1q{J, thus cj has no effect on the corresponding point i.e., the point of the curve is invariant under the modication of cj . The control point cj affects all other points of the curve, i.e. the curve is globally controlled. These properties are illustrated in Figure 5.4. However, we emphasize following Proposition 5 in R´oth et al. (2009) that while control points have a global effect on the shape, this inuence tends to be local and dramatically decreases on further parts of the curve, especially for higher values of n.

5.4 Inference from Point Cloud Data We now demonstrate how our multiscale closed curve process can be used as a prior distribution for fitting a curve to a 2D point cloud. Data examples are given in §5.8. As a byproduct of fitting, we also obtain an intuitive description of the shape in terms of deformation vectors. Assume that the data consist of points tpi P R2 , i “ 1, . . . , Nu concentrated near 132

a 2D closed curve. Since a Roth curve can be thought of as a function expressing the trajectory of a particle over time, we view each data point, pi , as a noisy observation of the particle’s location at a given time ti , ǫi „ N2 p0, σ 2 I2 q.

pi “ Cpti q ` ǫi ,

(5.21)

(5.21) shares a similar form to nonlinear factor models, where ti is the latent factor score. We assume that the noise variance σ 2 is known, but if not, one can easily place a conjugate inverse Gamma prior on it. First, we will rewrite the point cloud model in stacked vector notation. Defining p “ pp1,x , p1,y , . . . , pN,x , pN,y q1 ,

ǫ “ pǫ1,x , ǫ1,y , . . . , ǫN,x , ǫN,y q1

t “ pt1,x , t1,y , . . . , tN,x , tN,y q1 ,

Xptq1 “ rXpt1 q1 Xpt2 q1 . . . XptN q1 s

we have p “ Xptqc ` ǫ,

ǫ „ N2N p0, σ 2 I2N q

(5.22)

where Xpti q is as defined in (5.11). To fit a Roth curve through the data, we want to infer P pc | pq, the posterior distribution over control points c, given the data points p. To compute this, we must specify P pp | cq, the likelihood, and P pcq, the prior distribution over Roth curves. We choose P pcq to be the probability distribution induced by the shape-generating random process specified in §5.2.5. From (5.22), we can specify the likelihood function as, P ptpi uN 1

|

tci uJ1 q



N ź i“1

ˆ ÿ ˙ J 2 N2 pi ; cj Bj pti q, σ I2 ,

(5.23)

j“1

P pp | cq “ N2N pp; Xptqc, σ 2 I2N q.

(5.24)

This completes the Bayesian formulation for inferring c, given p and t. In §5.7, we describe the exact method for performing Bayesian inference. 133

In many applications, ti is not known and can be treated as a latent variable. We propose a prior for ti conditionally on c, which is designed to be uniform over the curve’s arc-length. This prior is motivated by the frequentist literature on arc-length parameterizations Madi (2004), but instead of assigning the values tti P r´π, πsu in a deterministic preliminary step prior to statistical analysis, we use a Bayesian approach to formally accommodate uncertainty in parameterization of the points. Define the arc-length function A : r´π, πs ÞÑ R` Apuq :“ Apu; pc0 , . . . , c2n qq “

żu

´π

||Hptq||dt.

(5.25)

Note that A is monotonically increasing and satisfies Ap´πq “ 0, Apπq “ Lpc0 , . . . , c2n q where Lpc0 , . . . , c2n q is the length of the curve conditional on the control şπ points pc0 , . . . , c2n q and is given by ´π ||Hptq||dt. Given pc0 , . . . , c2n q, we draw li „ Unifp0, Lpc0 , . . . , c2n qq and set ti “ A´1 pli q.

Thus we obtain a prior for the ti ’s which is uniform along the length of the curve and is given by ||Hptq|| . ||Hptq||dt ´π

pptq “ ş π

Thus the high velocity regions on the curve are penalized more and the low velocity regions are penalized less to enable uniform arc-length parameterizations. Uniform arc-length parametrization is extremely important for two reasons. First, it ensures that the control points are well distributed along the entire object boundary. This means that a roughly equal amount of ”detail” is given to describing any given length of the curve. Second, it standardizes parametrization among multiple curves to make them directly comparable. We will discuss a griddy Gibbs algorithm for implementing the arc-length parametrization in a fully Bayesian framework in §5.7. 134

5.5 Inference from Pixelated Image Data In this section, we show how to model image data by converting it to point cloud data. We also show how image data gives a bonus estimate for the object’s surface orientation, ωi at each point pi . We incorporate this extra information into our model to improve fitting, with essentially no sacrifice in computational efficiency. A grayscale image can be treated as a function Z : R2 Ñ R. The gradient of this function, ∇Z : R2 Ñ R2 is a vector field, where ∇Zpx, yq is a vector pointing in the direction of steepest ascent at px, yq. In computer vision, it is well known that the gradient norm of the image, ||∇Z||2 : R2 Ñ R approximates a ‘line-drawing’ of all the high-contrast edges in the image. Our goal is to fit the edges in the image with our model. In practice, an image is discretized into pixels tza,b | a “ 1, . . . , X, b “ 1, . . . , Y u but a discrete version of the gradient can still be computed by taking the difference between neighboring pixels, such that one gradient vector, ga,b is computed at each pixel. The image’s gradient norm is then just another image, where each pixel ma,b “ ||ga,b ||2 . Finally, we extract a point cloud: tpa, bq | ma,b ą M, a “ 1, . . . , X, b “ 1, . . . , Y u where M is some user-specified threshold. Each point pa, bq can still be matched to a gradient vector ga,b . For convenience, we will re-index them as pi and gi . The gradient vector points in the direction of steepest change in contrast, i.e. it points across the edge of the object, approximating the object’s surface normal. The surface g

i,y orientation is then just ωi “ arctanp gi,x q.

In the following, we describe a model relating a Roth curve to each ωi . This model can be used together with the model we specified earlier for the pi .

135

5.5.1 Modeling surface orientation Denote by vi “ pHx pti q, Hy pti qq P R2 the velocity vector of the curve Cptq at the parameterization location ti , i “ 1, . . . , N. Note that vi is always tangent to the curve. Since each ωi points roughly normal to the curve, we can rotate all of them by 90 degrees, θi “ ωi ` π2 , and treat each θi as a noisy estimate of vi ’s orientation. Note that we cannot rotate the vector gi by 90 degrees and directly treat it as a noisy observation of vi . In particular, gi ’s magnitude bears no relationship to the magnitude of vi : ||gi || is the rate of change in image brightness when crossing the edge of the object, while ||vi || describes the speed at which the curve passes through pi . Suppose we did have some noisy observation of vi , denoted ui . Then, we could have specified the following linear model relating the curve tcj , j “ 1, . . . , Ju to the ui ’s: ui “ vi ` δi “

J ÿ

cj

j“1

d Bj pti q ` δi dt

(5.26) (5.27)

for i “ 1, . . . , N where δi „ N2 p0, τ 2 I2 q. Instead, we only know the angle of ui , θi . In §5.7, we show that using this model, we can still write the likelihood for θi , by marginalizing out the unknown magnitude of ui . The resulting likelihood still results in conditional conjugacy of the control points.

5.6 Fitting a collection of curves We can extend our methodology in the previous section to simultaneously fit and characterize a collection of K separate point clouds, via hierarchical modeling. In the previous section, we used the random shape process as a prior with fixed parameters. 136

Now, we will instead treat the random process as a latent mechanism which generated all K objects. The inferred parameters of the latent process then characterize the collection of shapes. As a reminder, the parameters of the random process and their interpretations are defined in §5.2.5. Rather than fixing their values, we will treat µr and Σr for r “ 0, . . . , R as unknowns and place the following priors on them: µr „ N2Jr pµµr , Σµr q, ˆ” ı1 ˙ prq prq prq prq prq prq ´1 , Σr “ diag τ1,x , τ1,y , τ2,x , τ2,y , ¨ ¨ ¨ , τJr ,x , τJr ,y τuprq „ Gamma pατ , βτ q

for u “ t1, . . . Jr u ˆ tx, yu,

where diag pvq takes any vector v P Rd and produces the diagonal matrix V P Rdˆd with elements of v along the diagonal. We will continue to treat R and tnr | r “ 0, . . . , Ru as fixed, although we plan to consider inferring these quantities in future work. Note that the prior on Σr only permits diagonal covariance structure, assuming independence between deformations (and between the x/y components of each deformation). In future work, it will be interesting to remove this simplifying assumption and characterize inter-deformation correlation. Nonetheless, ! ) prq the inferred values of τu | u “ t1, . . . Jr u ˆ tx, yu, r “ 0, . . . , R can still be usefully pRq

interpreted. For example, a high value for τ2,y indicates that there is a high-level

of variability in the second deformation vector at resolution R (implying a fairly fine-scale deformation). Since it is the y-component, this corresponds to variability normal to the object surface. We now formalize the concept that all K objects are generated from a single random process. Denote the k th point cloud as pk for k “ 1, . . . K. Each pk is fit by ( a curve ck , which is composed from the deformations dprq,k | r “ 0, . . . , R and each

dprq,k „ N pµr , Σr q. This hierarchical structure also induces dependence between the 137

K curves, enabling them to borrow information from each other during fitting. An additional challenge that arises from modeling multiple objects is inter-object alignment (also known as registration). This typically involves removing differences in object position, orientation and scale. Here, we only deal with position and orientation. According to (5.15), our random process generates shapes centered at p0, 0q and rotated to a fixed angle. However, in an actual collection of shapes, each object is rotated to a different angle and centered at a different location. We can modify (5.15) to account for this simply by adding latent variables for the position, mk P R2 , and orientation, φk P r´π, πs, of each object k: p0q,k

cj

p0q,k

“ mk ` Rφk Rθj dj

,

where Rφk is a rotation matrix. We place a uniform prior on φk and a normal prior on mk . It can be desirable to put a more sophisticated spatial prior on mk , but our focus here is on modeling object boundaries, not their location. The orientation of level r “ 0 orients all subsequent levels. This is sufficient to align the entire collection and make the deformation vectors of each shape directly comparable. We also note that the definition of φk has an important anchoring effect on tk . The prior for tk is conditional on the curve ck . Since the random process prior for the curve is oriented by φk , the prior for tk favors parametrizations that conform to this orientation.

5.7 Posterior computation In §5.6, we presented a model for characterizing and fitting a collection of K closed curves, with unknown underlying parametrization. We now present an MCMC algorithm for sampling from the joint posterior of this model. This involves deriving the conditional posteriors of mk , dprq,k , µr , Σr and tk for r “ 0, . . . , R and k “ 1, . . . , K. 138

5.7.1 Conditional posteriors for mk and dprq,k The conditional posteriors for mk and dprq,k are the most challenging to sample from, because our model’s likelihood function is nonlinear in these terms, preventing conditional conjugacy. To overcome this, we derive a linear approximation to the true likelihood function, which does yield conditional conjugacy, and then use samples from the approximate conditional posterior as proposals in a Metropolis-Hastings step. We first present the source of nonlinearity in the likelihood function. Recall from §5.4 that: P ppk | cpRq,k q “ N2N ppk ; Xptk qcpRq,k , σ 2 I2N q. From §5.2.5, we note that cpRq,k is the result of combining the deformations tdprq,k | r “ 0, . . . , Ru through the following recursive relation, with mk appearing in the base case: ` ˘ cprq,k “ Er cpr´1q,k ` Tr cpr´1q,k dprq,k

cp0q,k “ mk ` Tφk T0 dp0q,k

(5.28) (5.29)

At each step r of the recursive process, the deformation-orienting matrix Tr is a nonlinear function of the previous cpr´1q,k . As a result, cpRq,k is nonlinear in dprq,k for r “ 0, . . . , R ´ 1. For any given step of the process, we can replace the true recursive relation with a linear approximation. In particular, we will substitute Tr pcpr´1q,k qdprq,k

with Tˆr,k cpr´1q,k , where Tˆr,k will be derived shortly. The new approximate step is then c

prq,k

´ ¯ ˆ « Er ` Tr,k cpr´1q,k .

(5.30)

If we wish to write cpRq,k linearly in terms of cprq,k for any r “ 0, . . . , R, we can replace every recursive step from r to R with the approximate step given in (5.30). 139

We emphasize that steps 0, . . . r ´ 1 follow the original recursive relation. This yields the following approximation:

c

pRq,k

«

prq,k ΩR , r`1 c

Ωba



b ´ ź

ρ“a

¯ Eρ ` Tˆρ,k .

(5.31)

Now, by combining (5.28) and (5.31), we have that:

c

pRq,k

«

#

“ ` ˘ ‰ pr´1q,k ΩR ` Tr cpr´1q,k dprq,k r`1 Er c ` k ˘ p0q,k ΩR 1 m ` Tφk T0 d

rą0 r“0

Thus, the approximation of cpRq,k can be written linearly in terms of any dprq,k or mk . Note that it is still nonlinear in dpρq,k for any ρ ‰ r. However, for MCMC sampling, we only need one dprq,k to be linear at a time, holding all others fixed. Lastly, we note that the approximation becomes increasingly good as r approaches R, because the number of approximate steps (contained in Ωba ) decrease. We are now ready to derive the approximate conditional posteriors for mk and dprq,k . First, we claim that these posteriors can all be written in the following form for generic ‘x’, ‘y’ and ‘z’. P px | ´q 9 N py; Qx, Σy q N px; z, Σx q ´ ¯ ÿ ˆ , Σˆ´1 “ Σ´1 ` Q1 Σ´1 Q, P px | ´q “ N µ ˆ, Σ y x

(5.32) (5.33)

k

˙ ˆ ÿ 1 ´1 ´1 ˆ Σ z` QΣ y . µ ˆ “ Σ y x

(5.34)

k

Note that each approximate conditional posterior is simply a multivariate normal. We now show that each approximate posterior can be rearranged to match the form

140

of (5.32) - (5.34). Papprox pmk | ´q ` ˘ 9 N pk ; Xptk qcpRq,k , σ 2 I2N k Npmk ; µm , Σm q ´ ¯ ` k ˘ 2 R,k k k p0q,k 9 N p ; Xpt qΩ1 m ` T0 d , σ I2N k Npmk ; µm , Σm q

¯ ´ R,k k 2 p0q,k k m , σ I k Npmk ; µm , Σm q T d ; Xpt qΩ 9 N pk ´ Xptk qΩR,k 0 2N 1 1

Papprox pdprq,k | ´q ` ˘ 9 N pk ; Xptk qcpRq,k , σ 2 I2N k Npdprq,k ; µr , Σr q

` “ ` ˘ ‰ ˘ pr´1q,k 9 N pk ; Xptk qΩR ` Tr cpr´1q,k dprq,k , σ 2 I2N k Npdprq,k ; µr , Σr q r`1 Er c

` ` pr´1q,k ˘ prq,k 2 ˘ pr´1q,k 9 N pk ´ Xptk qΩR ; Xptk qΩR d , σ I2N k Npdprq,k ; µr , Σr q r`1 Er c r`1 Tr c

We then use Papprox pmk | ´q and Papprox pdprq,k | ´q as M-H proposal distributions to sample from their true counterparts. Both are multivariate normals and if necessary, their variance parameters may be tuned to improve sampling efficiency. 5.7.2 Derivation of the approximate deformation-orienting matrix For visual clarity of the derivation, we will temporarily drop superscripts denoting

the resolution r and object index k of each variable. First, we recall from §5.2 that T pcq is a block diagonal matrix with blocks consisting of the rotation matrices Rj , for j “ 1, . . . , J (where J is the total number of control points at the particular resolution ´ ¯ H pq q r). Each Rj rotates its corresponding deformation, dj , by θj “ arctan Hyx pqjj q . Now, using the identities

cosparctanpx{yqq “ a

x x2 ` y 2

,

we can write Rj as:

Rj

sinparctanpx{yqq “ a

„  1 Hx pqj q ´Hy pqj q “ sj pcq Hy pqj q Hx pqj q 141

y x2 ` y 2

,

where sj pcq “

a

Hx pqj q2 ` Hy pqj q2 . This term intuitively represents the “speed” of

the curve at parametric position qj . It is reasonable to think that the curve’s speed does not vary greatly among samples in the posterior, because in §5.4 we imposed a prior that encourages arc-length uniform parametrization, and because the total arc-length of the curve is not expected to vary greatly. Therefore, we approximate this term with the fixed constant Sj “ sj pcprev q, where cprev is just the curve sampled in the previous iteration of the M-H sampler. Lastly note that the hodograph, 9 H ptq “ Xptqc, is a linear function of c. So, we can now approximate Rj as a linear function of c: Rj

„  1 X9 x pqj, qc ´X9 y pqj qc « . Sj X9 y pqj qc X9 x pqj qc

Then, we can write Rj dj as: Rj dj « Rˆj c,

»´

¯fi 9 9 X pq qd ´ X pq qd y j j,y 1 – ´ x j, j,x ¯fl , Rˆj “ 9 9 Sj Xy pqj qdj,x ` Xx pqj qdj,y

´ ¯ ˆ1 , . . . , R ˆJ . and finally we define Tˆ “ block R 5.7.3 Conditional posteriors for µr and Σr

¯ ´ ˆ P pµr | ´q “ N µ ˆr , Σµr

ˆ ´1 “ Σ´1 ` KΣ´1 Σ r µr µr ¸ ˜ K ÿ ˆ µr Σ´1 µµr ` Σ´1 dprq,k µ ˆr “ Σ r µr k“1

´

prq prq prq P pτj,x | ´q “ Ga α ˆ j,x , βˆj,x

prq

α ˆ j,x “ ατ ` K,

prq βˆj,x “ βτ `

142

¯

K ´ ¯2 ÿ prq,k dj,x ´ µr,j,x

k“1

2

5.7.4 Gibbs updates for the parameterizations and orientation We discretize the possible values of tki P r´π, πs to obtain a discrete approximation of its conditional posterior: ` ˘ P tki | ´ „ ř

Nppki ; Xpti qcpRq,k , σ 2 I2 qP ptki | cpRq,k q k pRq,k , σ 2 I qP ptk | cpRq,k q 2 i τ Pr´π,πs Nppi ; Xpτ qc

We can make this arbitrarily accurate, by making a finer summation over τ . To achieve quick burn-in, we initialize ti using polar-coordinate parametrization, where

p1i “ pi ´ p¯,

´1

tan ti “ t

´ p1 ¯ i,y

p1i,x





u ´ π.

The point p¯ is the average of tpi , i “ 1, . . . , Nu and φ is the orientation variable defined in §5.6. We discretize the possible value of φk P r´π, πs in a similar manner. 5.7.5 Likelihood contribution from surface-normals Define X9 x pti q “ X9 y pti q “





n1 dB2n pti q dB1n1 pti q dB0n1 pti q 1 , 0, , 0, ¨ ¨ ¨ , ,0 dt dt dt n1 dB2n pti q dB1n1 pti q dB0n1 pti q 1 , 0, , ¨ ¨ ¨ , 0, 0, dt dt dt





(5.35)

(5.36)

Proposition 65. The likelihood contribution of the tangent directions θik , i “ 1, . . . , N k ensures conjugate updates of the control points for a multivariate normal prior. Proof. Recall the noisy tangent direction vectors uki ’s and vik ’s in (5.26). Using a simple reparameterization uki “ peki , eki tan θik q 143

where only θi1 s are observed and ei ’s aren’t. Observe that vik “ pHx pti q, Hy pti qq “ pX9 x ptki qcp3q,k , X9 y ptki qcp3q,k q.

(5.37)

Assuming a uniform prior for the eki ’s on R, the marginal likelihood of the tangent direction θik given τ 2 and the parameterization tki is given by lpθik q

1 “ 2πτ 2

 1 k k p3q,k 2 k k p3q,k 2 9 9 q ` pei tanpθi q ´ Xy pti qc q u deki exp ´ 2 tpei ´ Xx pti qc 2τ ´8

ż8



It turns out the above expression has a closed form given by lpθik q “

1 2πτ 2

?

?

2πτ 2 1`tan2 pθik q

! ” exp ´ 2τ12 pX9 x ptki qcp3q,k q2 ` pX9 y ptki qcp3q,k q2 ´

ˆ

pX9 x ptki qcp3q,k `X9 y ptki qcp3q,k tanpθi qq2 1`tan2 pθik q



.

The likelihood for the tθik , i “ 1, . . . , N k u is given by 1 k Lpθ1k , . . . , θN k q9 N τ k

«

1 exp ´ 2 pcp3q,k q1 2τ

#

N ÿ

pSik q1 Γki Sik i“1

+

c

p3q,k



where ¨

Γki “ ˝ and Sik “ rpX9 x ptki qq1

tan2 pθik q 1`tan2 pθik q ´ tanpθik q 1`tan2 pθik q

´ tanpθik q 1`tan2 pθik q 1 1`tan2 pθik q

˛



pX9 y ptki q1 s is a 2p2n3 ` 1q ˆ 2 matrix. Clearly, an inverse-

Gamma for τ 2 and a multivariate normal prior for the control points are conjugate choices.

5.8 Simulation Study We evaluate our method by defining a true underlying curve, c˚ , and checking to see how accurately this curve can be recovered by our model. In particular, we 144

place the true curve in nine different orientations and positions, sparsely sampling to generate nine point clouds. Then, we use our model to recover a full boundary for each cloud, gaining high accuracy by borrowing information between objects. This scenario is similar to real-world applications where a single object has been observed in multiple poses, or a collection of similar objects have been observed. We compare our model against a simpler version of our model which does not allow borrowing of information, and against principal curves, in which each point cloud is fit separately. It was not apparent how to achieve borrowing of information using principal curves. One strategy would be to align the separate point clouds first, then treat them as a single cloud. However, the sparsity of each cloud makes alignment extremely difficult, as there are no clear features present across all nine clouds. For our method, it is only necessary to initialize the parametrization of each point cloud (via polarcoordinate parametrization) such that the orientation of each cloud is roughly correct. The model is robust to small errors in initial parametrization. We define several concepts to help interpret our results. Given some curve c, let Bpcq “ tXptqc | t P r´π, πsu (the set of all points along the curve), and let Apcq denote the interior region enclosed by the curve. For a given distribution over curves, P pcq, we define its boundary heatmap, MPBpcq : R2 Ñ R, and its region heatmap, MPApcq : R2 Ñ R, as: MPApcq px, yq “ P ppx, yq P A pcqq , M B px, yq dxdy “ P ppx ` dx, y ` dyq X Bpcqq . Given a set of samples from the distribution P pcq, tcs | s “ 1, . . . , Su, we can discretely approximate MPBpcq and MPApcq as: 1ÿ px, yq « 1 rW px, yq X B pcs qs, S s“1 S

MPBpcq

145

MPApcq

S 1 ÿ px, yq « 1 rpx, yq P A pcs qs , S s“1

where W px, yq “ tpx1 , y 1q | px1 , y 1 q P rx div ∆x , x div ∆x ` ∆x s ˆ ry div ∆y , y div ∆y ` ∆y su Here div denotes integer division. The function W simply maps values in R2 to a regular grid of bins with width ∆x and height ∆y . Lastly, the mean can be approximated by cˆ “

1 S

S ÿ

cs .

s“1

The following Figure 5.5 illustrates borrowing of information across 16 different point clouds generated from a single curve having missing chunks in different regions. The hyperparameters were set to: r1 “ 1, r2 “ 4, r3 “ 22, Σ1 “ 100 IJr1 , Σ2 “ “ ‰1 “ ‰1 70 IJr2 , Σ3 “ 70 IJr3 , µµ1 “ 1 1 1 b 0 10 , µµ2 “ 0Jr2 , µµ3 “ 0Jr3 , α “ 1, β “ 1, σp “ 1{100.

Figure 5.5: (left) Using a simplified model fitting each point cloud independently i,e., µr and Σr are fixed rather than inferred. Note the large margin of uncertainty for gaps in the point cloud. (Right) full hierarchical model resulting in much tighter fit. Parametrization was achieved by the arc-length model given in §5.4. For the simplified model, insufficient data was present for arc-length parametrization. Instead, a fixed polar-coordinate parametrization was used, producing artifacts in the fit.

Convergence was monitored using the Raftery & Lewis diagnostic test as well as trace plots of the deviance parameters. Also, we get essentially identical posterior 146

summaries with different MCMC starting points and moderate changes to hyperparameters.

5.9 Brain tumor segmentation study In brain tumor diagnosis and therapy, it is very important to account for uncertainty in the tumor’s outline. This information is crucial for assessing whether a tumor has grown/regressed over time, and even more important if a surgeon must target the tumor for excision or radiation therapy. In that situation, there is a critical tradeoff between false positives (targeting healthy tissue) and extremely undesirable false negatives (missing the tumor). Furthermore, tumor outlines are notoriously hard to determine. Error stems from the poor contrast between tumor and healthy tissue in magnetic resonance imaging (MRI), the prevalent modality for diagnosis. Even seasoned experts differ greatly when tracing an estimate. We use our model to intelligently combine the input traces of multiple experts (Figure 5.6), by treating each trace as a point cloud drawn from the same random process. We can then interpret P pcµ | tpk uq as the posterior distribution of the tumor, fully describing the variability and uncertainty among the experts. One might also run additional tumor segmentation algorithms, and combine their outputs using the same approach. In this setting, the region heatmap of the posterior, MPApcµ |tpk uq (shortened to M A ), is especially informative. For every point x, M A pxq gives the probability that it is part of the tumor. This enables a neurosurgeon to manage the tradeoff between false positives/negatives in a principled manner. Let the true tumor c region be Xtumor Ă R2 and Xtumor its complement. Then, define the loss function

for targeting a region X to be c LpXq “ λ` Area pX X Xtumor q `λ´ Area pX c X Xtumor q .

Depending on the ratio of the penalties λ` and λ´ , the surgeon can minimize L 147

simply by cutting along a level set of M A .

Figure 5.6: (top left) raw MRI image; (bottom) M A discretized into 3 colored regions (red:ą 0.95, orange:ą 0.5, yellow:ą 0.01), the traces provided by 4 experts are overlaid on top; (top right) the raw brain image with M A overlaid, and the trace from one expert overlaid for reference.

We can also allow for experts to express varying confidence in different portions of their trace. This is desirable, because certain boundaries of the tumor will have high contrast with the surrounding tissue while other parts won’t, and the expert should not be forced to make an equal opinion on both. We can achieve this by slightly modifying the point cloud model given in (5.21). There, we assumed that each point pi was generated with fixed variance σp2 . Instead, we can let σp2i “ σp2 {κi , where κi is 148

the expert’s confidence in that point. Furthermore, if the expert has no confidence at all, they can simply leave a gap in their trace. The model automatically closes the gap, as shown in simulation examples. Lastly, it is also easy to compute the posterior distribution for quantities such as the size of the tumor, simply by computing the size of each sample.

5.10 Discussion We have developed a fully Bayesian hierarchical model based on multiscale deformations for modeling a collection of 2D closed curves. Although we have characterized a collection of curves using our model, comparing the shapes of objects in a rigorous Riemannian framework will involve further work, such as defining a loss function involving an appropriate metric between shapes. We propose to address this issue in future. In defining the multiscale process, we would like to have a more automatic way of choosing the different resolutions. It is clear that the highest resolution is obtained by maximizing the fit subject to minimizing the Bayesian penalty for model complexity. In our future research we would like to have a more informed way of selecting the lower resolutions. Our multiscale model differs in its purpose from other multiscale methods such as the wavelet transform. With wavelet methods, the goal is often to compress the data, whereas our goal is to define levels that isolate dimensions of similarity or variability within a collection of shapes. In our current methodology, we have assumed that all K shapes are generated from the same random shape process. However, in future applications, it may be useful to model the collection as a mixture of multiple random shape processes, resulting in a clustering method. For example, in analyzing a blood sample featuring sickle cell anemia, it is useful to assume that the population was generated by two 149

random shape processes: one healthy and one “sickle-shaped”. Finally, we would like to extend our multiscale random shape process to the 3D case using the tensor product approach Pati and Dunson (2011) which has potential applications in modeling animated characters or tracking 3D lung tumors for targeted radiation therapy and so on.

150

6 Bayesian modeling of closed surfaces through tensor products

6.1 Introduction Surface reconstruction can be viewed as an algorithm that takes as an input an unorganized set of points tp1 , . . . , pn u P R3 on or near the unknown manifold M embedded in R3 and produces a surface that approximates M. Free-form surface modeling from massive data points is becoming an important area of research in commercial computer aided design and development of manufacturing software (Barnhill, 1985; Lang and R¨oschel, 1992; Hagen and Santarelli, 1992; Aziz et al., 2002). A collection of introductory works on surface modeling can be found in Su and Liu (1989) and the subsequent developments in Muller (2005).

Figure 6.1: Scattered data from a pelvic girdle and the fitted surface

151

Common surface reconstruction algorithms in the computer science literature usually follow a sequential multistage process which includes scanning, outlier removal, denoising and input normal estimation to generate a simplicial surface. The Poisson surface reconstruction method (Kazhdan et al., 2006) solves for an approximate indicator function of the inferred surface, whose gradient best matches the input normals. The output scalar function, represented in an adaptive octree (Whang et al., 2002), is then iso-contoured using an adaptive marching cubes algorithm (Lorensen and Cline, 1987). An illustration of scattered data from a pelvic girdle and the fitted surface using the Poisson surface reconstruction method is provided in Figure 6.1. Cgal surface mesh generator (Rineau and Yvinec, 2007) implements a variant of this algorithm which solves for a piecewise linear function on a 3D Delaunay triangulation instead of an adaptive octree. Hoppe et al. (1992); Boissonnat and Oudot (2005) developed a two stage surface reconstruction algorithm by first estimating M by the implicit surface Zpf q “ ty : f pyq “ 0u of a suitable function f : R3 Ñ R and then using a contouring algorithm to approximate Zpf q by a simplicial surface. There is a rich literature on estimation of surfaces using tensor products of bases (Fowler, 1992; Goshtasby, 1992; Mann and DeRose, 1995; Johnstone and Sloan, 1995). Tensor product surfaces provide a flexible representation of a surface embedded in an arbitrary Euclidean space. However, there is a limited literature on Bayesian modeling of free-form surfaces (Cunningham et al., 1999) and closed surfaces (Soussen and Mohammad-Djafari, 2002). While frequentist surface estimation using tensor products has been widely studied, Bayesian estimation has received almost no consideration. A notable exception is the approach of Smith and Kohn (1997) for Bayesian estimation of bivariate regression surfaces using tensor products. Modeling of closed surfaces is a primary focus in application areas such as computer vision, as closed surfaces provide an adequate geometric model of a wide range of objects ranging from human faces to brains and other organs. In this field, stan152

dard practice involves restrictive parametric shapes depending on a few parameters (Cinquin et al., 1982; Amenta et al., 1998; Rossi and Willsky, 2003). Although such models can describe many common surfaces, the variety of generated shapes is limited. More flexible models for closed surfaces can be defined through carefully specified linear combinations of basis functions. Soussen and Mohammad-Djafari (2002) developed the notion of global harmonic surfaces, which yield a simple procedure to reconstruct coarse surfaces. Shen and Makedon (2006); Chung et al. (2008) developed a novel method based on general and weighted spherical harmonics to model closed sphere-like objects, such as the cortical surface. However the variety of shapes generated by spherical harmonics are somewhat limited to sphere-like or convex objects although weighted spherical harmonics can capture local features like cortical folds quite well. Amenta et al. (1998) developed a surface reconstruction algorithm called the Crust algorithm based on the three-dimensional Voronoi diagram to model closed surfaces from a data cloud in R3 . The algorithm generates a regular surface and the output mesh interpolates, rather than approximates, the input points. However, the algorithm is not probabilistic and does not allow uncertainty in estimating the surface. Moreover, the algorithm requires a dense collection of data points for a reasonably good reconstruction indicating slow convergence. Some illustrations of the Crust algorithm are provided in Figure 6.2. In computer aided design, closed surface modeling is often aided by combining several B´ezier or spline surface patches by endpoint interpolation (Gordon and Riesenfeld, 1974; Piegl, 1986; Casale, 1987; Szeliski and Tonnesen, 1992; Hoppe et al., 1992; Yang and Lee, 1999; Li et al., 2007). In a frequentist analysis such endpoint restrictions are incorporated through constrained optimization. In the Bayesian paradigm, these restrictions lead to mixing problems in the posterior analysis. Furthermore, these restrictions can make the resulting surface non-differentiable along the edges joining the patches. 153

Points Cloud

Output Triangulation

5

5

0

0

−5

−5

−10

−10

10

10 10

5

10

5

5

0 −5

5

0

0

0

−5

−5

−5

Points Cloud

Output Triangulation

3

3

2

2

1

1

0

0

−1

−1

−2

−2

−3

−3

6

6 4

4 4

2

−4

0

−2 −4

−2 −6

2

0

0

−2

4

2

2

0

−2 −6

−4

−4

Figure 6.2: Output triangulation from crust algorithm on a point cloud Instead we use a cyclic basis developed by R´oth et al. (2009) to accommodate restrictions without parameter constraints and give rise to an infinitely smooth surface. In this chapter we propose a Bayesian hierarchical model of a closed surface embedded in R3 using tensor products of cyclic bases with a carefully-chosen shrinkage prior placed on the tensor of basis coefficients. In particular, motivated by the decreasing impact of the higher indexed basis functions in the B´ezier surface representation, we increase the shrinkage as the index increases. The specification leads to a highly efficient algorithm for posterior computation that allows uncertainty in the number of bases. In addition, the proposed prior is shown to have large support and to lead to a posterior with the optimal rate of convergence up to a log factor.

154

6.2 Outline of the method 6.2.1 Review of Terminology Assume a data cloud tpi P R3 , i “ 1, . . . , Nu is given. Our aim is to obtain a posterior distribution for a smooth closed surface about which these data points are concentrated. Before going into the details of our model, we start with a few definitions. Definition 66. A closed surface is a compact two dimensional closed manifold, which does not have a boundary. Examples are spaces like the sphere, the torus, and the Klein bottle. Definition 67. A parametric surface is a surface in R3 which is defined by a parametric equation with two parameters u and v. Mathematically, a parametric surface is an injective map from R2 to R3 defined by S : ra, bs2 Ñ R3 , pu, vq ÞÑ Spu, vq. See the Purdue University thesis Sederberg (1983) for a detailed description of parametric surfaces. Definition 68. Parametrization is an algorithm to find the coordinate pui , vi q corresponding to the observed data point pi for each i “ 1, . . . , N such that there exists a parametric surface S so that pi is regarded as an error-prone realization of Spui , vi q, i “ 1, . . . , N. The coordinate chart tpui , vi q, i “ 1, . . . , Nu is alternatively termed as the associated parameter values. Definition 69. A tensor product surface is formed by taking a tensor product of bases

Sn,m pu, vq “

km kn ÿ ÿ

j“0 k“0

djk Bjn puqBkm pvq,

(6.1)

where pu, vq P ra, bs2 , S is a parametric surface, tdjk P R3 , j “ 0, . . . , km , k “ 155

0, . . . , kn u are control points and tBlkn puq, u P r0, 1s, l “ 0, . . . , kn u are basis functions. Here kn “ n or 2n depending on whether the bases span the algebraic or the trigonometric polynomials having maximum degree n. An example of a tensor prod` ˘ uct surface is the B´ezier surface (Farin, 2002) in which Bjn puq “ nj uj p1 ´ uqn´j , j “ 1, . . . , n, u P r0, 1s. B´ezier surfaces are an extension of the idea of B´ezier curves, and share many of their properties. 6.2.2 Choice of the parameterization Since we intend to fit a parametric surface, we have to find the coordinate chart tpui , vi q P ra, bs2 u corresponding to the points tpi P R3 , i “ 1, . . . , Nu. Closed surfaces can be achieved by parameterizations on the sphere or the torus. The parameterizations are typically estimated from the data by, for example, projecting the points tpi u onto a suitably chosen plane. Spherical harmonics were originally used as a type of parametric surface representation for radial or steller surfaces Spu, vq, 0 ă u ă 2π, 0 ă v ă π (Brechb¨ uhler et al., 1995; Shen and Makedon, 2006). The idea is to project the data on the sphere by constrained optimization and then recover the surface by fitting Spu, vq to pi , i “ 1, . . . , N. Parameterization with the torus topology has the advantage of encompassing a wider range of closed surfaces compared to spherical harmonic functions which can only model sphere-like or convex surfaces. The torus topology ensures that the cross sections along the axes of the closed surface are closed curves, thus allowing more general closed surfaces. As discussed by Staib and Duncan (1992), the torus can be deformed into a tube by squeezing the torus cross section to a thin ribbon and closed surfaces are obtained by considering tubes whose ends meet up to a point. Brechb¨ uhler et al. (1995) discuss some of the practical disadvantages of their method. Instead, we use the relational perspective map developed by Li (2004) to project the 3-d point cloud onto a torus 156

and then scale down to r´π, πs2. Then we can use a tensor product of cyclic bases on r´π, πs2 devoid of any constraints to develop a flexible model for closed surfaces. Applying the relational perspective map to the point cloud in Figure 2, we obtain the points in the r´π, πs2 square shown in Figure 6.3.

Figure 6.3: Parameterization of the human skull and the Beethoven

6.2.3 Closed surface model We assume that the data tpi “ pp1i , p2i , p3i qT , i “ 1, . . . , Nu arise as a random additive perturbation from the closed parametric surface Spu, vq, pu, vq P r´π, πs2, as follows, pi “ Spui , vi q ` ei ,

ei „ Np0, σ 2 I3 q,

i “ 1, . . . , N,

(6.2)

where pui , vi q are coordinates in r´π, πs2 corresponding to point pi P R3 , Spui , vi q “ tS 1 pui , vi q, S 2 pui , vi q, S 3 pui , vi quT is the fitted surface at coordinates pui , vi q, and ei P R3 is a measurement error. Let P, S and E denote the corresponding N ˆ 3 matrix representations with rows tpTi , i “ 1, . . . , Nu, tSpui , vi qT , i “ 1, . . . , Nu and teTi , i “ 1, . . . , Nu respectively. Assume σ ´2 „ Gapaσ , bσ q. We follow a tensor product surface representation (6.1) to model the closed parametric surface Spu, vq, pu, vq P r´π, πs2.

157

6.2.4 Construction of the cyclic basis Using the tensor product specification in (6.1) for the closed surface Spu, vq, we propose to use the cyclic basis developed by R´oth et al. (2009); R´oth and Juh´asz (2010). These bases have a cyclic symmetry that eliminates the need for constraints on the control points, while also leading to surfaces that are infinitely smooth in the sense that the realizations are infinitely differentiable (C 8 ). Assuming S P C 8 is appealing in avoiding the need for geometric constraints and surfaces in C 8 can approximate any parametric closed surface arbitrarily well preserving local features. In addition, S can be characterized as a single coherent surface dependent on only the positions of the control points. In contrast, most methods characterize S by piecing together local surfaces with heavy constraints needed for continuity along the joints of the patches. R´oth et al. (2009) devised a basis for the vector space Vn “ x1, cospuq, sinpuq, . . . , cospnuq, sinpnuqy of trigonometric polynomials of degree at most n, i.e., of truncated Fourier series. Let Bjn puq

˙*n " ˆ 2πj cn , “ n 1 ` cos u ` 2 2n ` 1

where cn “

p2n n!q2 . p2n`1q!

pj “ 0, 1, . . . , 2nq, u P r´π, πs,

(6.3)

The following lemma from R´oth et al. (2009) demonstrates

that any truncated Fourier series can be expressed as a linear combination of the elements of Vn for some large n. This implies that any reasonable closed curve can be approximated arbitrarily well by the linear combination of the elements of Vn for some n. This concept is formalized in §6.3 in discussing posterior convergence. Lemma 70. The functions tBjn puq, i “ 0, 1, . . . , 2n, u P r´π, πsu form a basis of the vector space Vn . 158

Using basis functions (6.3), we can define the tensor product of surfaces of degree pn, mqpn ě 1, m ě 1q by Sn,m pu, vq with kn “ 2n in (6.1). 6.2.5 Model for the control points Let T2n`1,2m`1 pRp q denote the space of tensors of order p2n ` 1q ˆ p2m ` 1q ˆ p. n,m Define D n,m “ rdjk s2m,2n P T2n`1,2m`1 pR3 q for all m ě 1, n ě 1. j“0,k“0 . Clearly D

R´oth et al. (2009) remarked that although the control points have a global effect on the shape, this influence dramatically decreases on further parts of the surface, especially for higher value of n and m. They provide several test examples to show that the decrease of the influence is fast. This observation is the key to the choice of sparsity favoring priors for D n,m . Because the elements of D n,m are expected to have an increasingly localized influence on the shape of the surface Spu, vq as the index on the control points increases, we choose a shrinkage prior that favors smaller values for djk as j and k increases. Here we use a double shrinkage prior to facilitate a sparseness of the tensors D n,m. djk „ N3 p0, φ´1 jk I3 q, φjk “ τj ξk , τj „ Gapαn , βq, ξk „ Gapαm , βq,

(6.4)

where αn is an increasing sequence of positive integers. The prior for S induced from (6.1), (6.3) and (6.4), denoted S „ ΠS n,m , is defined conditionally on n and m. If n and m are chosen to be too small, the prior ΠS n,m will not support a sizable subset of surfaces in Cp8q. As an alternative to choosing n and m to be extremely large or even infinite to obtain large support, we propose to choose a prior for n and m, which allows one to adaptively learn and model average over the unknown dimensions of the control point tensor D n,m . Let pn, mq „ Πn,m denote this prior, with Πn,m a distribution over t1, . . . , 8u2 , such as independent truncated Poissons, and let S „ ΠS denote the resulting prior for S marginalizing 159

out n and m. This approach is related to the literature on Bayesian adaptive splines (Denison et al., 1998), though we will bypass the need to implement the standard reversible jump Markov chain Monte Carlo and describe a computationally efficient approach in §3. 6.2.6 Prior realizations ř n Since Bjn puq ą 0, j “ 0, . . . , 2n, u P r´π, πs and 2n j“0 Bj puq “ 1, the closed surface ř ř2m n m 2 Sn,m pu, vq “ 2n k“0 djk Bj puqBk pvq, ru, vs P T lies in the convex-hull of its conj“1 trol points D n,m . We can achieve a variety of closed surfaces through specific choices

of the control points as shown below in Figure 6.4. To demonstrate the nature of

(a) A sphere with its control points (b) A closed surface with n “ 7, m “ 9 Figure 6.4: (a) A sphere with its control points (b) A closed surface with n “ 7, m“9 the prior realizations with increase in n and m, consider first the case n “ m “ 1. For a fixed v “ v0 , the v0 -section of the surface S1,1 pu, v0 q, u P r´π, πs is a closed curve of degree p1, 1q. Similarly any u0 -section is also a closed curve of degree p1, 1q. Thus S is a closed surface whose cross-sections parallel to the axes are closed curves. For general n and m, the v0 -section Sn,m pu, v0 q is just a linear combination of closed curves, thus producing a rich class of closed curves. Hence the variety of shapes generated increases with increase in n and m which is shown in Figure 6.5. Figure 6.5 also demonstrates that the influence of the control points is increasingly localized for large values of n and m. 160

Figure 6.5: Prior realizations with increasing n and m

6.3 Support of the prior and posterior convergence rates 6.3.1 Support Let T2 denote the 2-dimensional torus represented by the square r´π, πs2. Let the H¨older class of bivariate periodic functions on T2 of order α be denoted by C α pT2 q. Define a class of closed parametric surfaces SC pα1 , α2 , α3 q having different smoothness along different coordinates as SC pα1 , α2 , α3 q :“ tS “ pS 1 , S 2, S 3 q : T2 Ñ R3 , S i P C αi pT2 q, i “ 1, 2, 3u.

(6.5)

For fixed n and m, define the stochastic process S „ ΠS n,m . To characterize the support of our prior, we first recall the definition of the RKHS of a multivariate Gaussian process prior. van der Vaart and van Zanten (2008b) review facts that are relevant to the present setting. A Borel measurable random element W with values in a separable Banach space pB, || ¨ ||q is called Gaussian if the random variable b˚ W is normally distributed for any element b˚ P B˚ , the dual space of B. In our case, 161

the Banach space B is CpT2 ; R3 q, the space of continuous functions from T2 to R3 . The reproducing kernel Hilbert space (RKHS) H attached to a zero-mean Gaussian process W is defined as the completion of the range MB˚ of the map M : B˚ Ñ B defined by Mb˚ “ EW b˚ pW q relative to the inner product xMb˚1 , Mb˚2 yH “ Eb˚1 pW qb˚2 pW q. The following lemma describes the RKHS of the Gaussian process ΠS n,m given tφjk , j “ 0, . . . , 2n, k “ 0, . . . , 2mu. Refer to Appendix D for a proof. Lemma 71. Given tφjk , j “ 0, . . . , 2n, k “ 0, . . . , 2mu, the RKHS Hn,m of ΠS n,m consists of all functions h : T Ñ R3 of the form hpu, vq “

2n ÿ 2m ÿ

j“0 k“0

cjk Bjn puqBkm pvq,

(6.6)

where the weights cjk range over R3 . The RKHS norm is given by

||h||2Hn,m



2n ÿ 2m ÿ

j“0 k“0

||cjk ||2 φjk .

(6.7)

The following theorem describes how well an arbitrary closed parametric surface S0 P SC pα1 , α2 , α3 q can be approximated by the elements of Hn,m for each n and m given tφjk , j “ 0, . . . , 2n, k “ 0, . . . , 2mu. Refer to Appendix D for a proof. Theorem 72. For any fixed S0 P SC pα1 , α2 , α3 q, there exists h P Hn,m with ř ř2m ||h||2Hn,m ď K1 2n j“0 φjk such that j“0 ||S0 ´ h||8 ď K2 pn ^ mq´αp1q log n log m

for some constants K1 , K2 ą 0 independent of n and m.

162

(6.8)

6.3.2 Rate of convergence of the posterior The parameter space is CpT2 ; R3 q ˆ r0, 8q and ΠS ˆ Πσ is the prior on CpT2 ; R3 q ˆ r0, 8q where Πσ denotes a general prior for σ which is compactly supported on r0, Ls for some L ą 0. Assume that the density of Πσ with respect to the Lebesgue measure on the compact interval is bounded away from zero. The inverse gamma prior truncated to the interval r0, Ls provides an example. Definition 73. For a given sequence ǫN Ó 0, the posterior is said to contract around the true parameter value pS0 , σ0 q P CpT2 ; R3 qˆr0, 8q at a rate ǫN if for L sufficiently large, ˇ " * N ˇ 1 ÿ N 2 2 2 2 ˇ Π pS, σq : ||Spui , vi q ´ S0 pui , vi q|| ` |σ ´ σ0 | ą L ǫN ˇ tpi , pui , vi qui“1 N i“1 PS0 ,σ0

Ñ 0 as N Ñ 8. (6.9)

The proof of the following Theorem 74 is provided in Appendix D. Theorem 74. If pS0 , σ0 q P SC pα1 , α2 , α3 q ˆ r0, Ls, an « Oplog nq3 and expt´pnr ` ´3

msqu ď Πn,m ď pnmq , n, m ě 1 for some r, s ą 0, ǫN « N

αp1q p1q `2

´ 2α

logt N, where t

is a known constant. The assumption on Πn,m ensures that the prior probability is not too small on smaller values of n and m so that the prior favors relatively simple representations of the surface. The assumption is satisfied by a product of independent Poissons. Also the shape parameter of the Gamma distribution for τj and ξk should be increased depending on the values of n and m to guarantee an optimal rate of convergence. The increase in shape parameter with n and m corresponds to a greater shrinkage of the higher indexed control points. To estimate a real valued d-variate function in C α pX q, the minimax optimal rate of convergence is n´α{p2α`dq . One can anticipate that for 163

vector valued functions with smoothness αj , j “ 1, 2, 3 in the coordinates, with the loss function defined by the sum of the individual loss across the coordinates, the rate of convergence cannot be improved beyond n´αp1q {p2αp1q `dq . Theorem 74 ensures that the posterior will converge to the true surface at this rate which is offset slightly by a logarithmic factor as expected for Bayesian procedures (De Jonge and van Zanten, 2010; van der Vaart and van Zanten, 2009).

6.4 Posterior computation 6.4.1 Gibbs sampler for a fixed truncation level For a fixed n and m, the full conditional distributions of all the unknown variables are conjugate and we can do Gibbs sampling. Since we only require αn to grow slowly at Oplog nq3 to achieve the optimal rate of convergence, we will assume αn “ α. The sampler cycles through the following steps. Step 1.

Define X to be the N ˆ p2n ` 1qp2m ` 1q matrix with rows

n m tB0n pui q, B1n pui q, . . . , B2n pui qu b tB0m pvi q, B1m pvi q, . . . , B2m pvi qu, i “ 1, . . . , N. Also let

D be the p2n ` 1qp2m ` 1q ˆ 3 coefficient matrix with rows dTjk , j “ 0, 1, . . . , 2n, k “ 0, 1, . . . , 2m. Recall that the density of a matrix-normal random variable Z „ MNpM, Ω, Σq with mean M having dimension n ˆ p is given by f pz | M, Ω, Σq9 expr´0¨5trtΩ´1 pz ´ MqT Σ´1 pz ´ Mqus,

(6.10)

for positive definite matrices Ω and Σ of order p ˆ p and n ˆ n. Then * " ` T ˘ T ´1 ´1 D | ´ „ MNp2n`1qp2m`1qˆ3 X P, I3 , X X ` Λ "

`

vecpDq | ´ „ N3p2n`1qp2m`1q vecpX P q, I3 b X X ` Λ

164

T

T

˘ ´1 ´1

(6.11) * .

(6.12)

Here Λ´1 “ diagtτj ξk , j “ 0, . . . , 2n, k “ 0, . . . , 2mu Step 2.

σ

´2

˙ ˆ N ÿ 2 | ´ „ Ga aσ ` 3N{2, bσ ` 0¨5 ||pi ´ Spui , vi q|| .

(6.13)

i“1

Step 3. For j “ 0, . . . , 2n and k “ 0, . . . , 2m, ˙ ˆ 2m ÿ 2 ξk ||djk || . τj | ´ „ Ga α ` 3p2m ` 1q{2, β ` 0¨5

(6.14)

˙ ˆ 2n ÿ 2 ξk | ´ „ Ga α ` 3p2n ` 1q{2, β ` 0¨5 τj ||djk || .

(6.15)

k“0

j“0

6.4.2 Posterior sampling of n and m The conditional likelihood of n, m, tpdjk , φjk q, j “ 0, . . . , 2n, k “ 0, . . . , 2mu given tpi , pui , vi q, i “ 1, . . . , Nu is proportional to exp

"

* 2n ź 2m N ź 1 ÿ 2 ´ 2 ppdjk | φjk qppφjk q. ||pi ´ Sn,m pui , vi q|| Πn,m 2σ i“1 j“0 k“0

In this case, we take advantage of the partial analytic structure (Godsill, 2001) in the models as pn, mq changes and rather than proposing an entirely new parameter vector, the form of reversible jump MCMC for n and m becomes relatively straightforward. The common parameters σ ´2 and tdjk , j “ 0, . . . , 2n, k “ 0, . . . , 2mu as the order of the model changes are updated using a within model Gibbs move as in §6.4.1. Consider a proposal qpn, m | n0 , m0 q “ qpn | n0 qqpm | m0 q with qp1 | 0q “ 1 and qpk 1 | kq “ 1{2 for all |k´k 1 | “ 1. Suppose the chain is at pn0 , m0 q and a proposal is made to go to state pn0 ` 1, m0 ` 1q, we employ a step-wise sampler as in (Godsill, 165

2001). We sample pd12n0 `1,2m0 `1 , φ12n0 `1,2m0 `1 q and pd12n0 `2,2m0 `2 , φ12n0 `2,2m0 `2 q from a kernel kerpdjk , φjk q and the move is accepted with probability mint1, αu, where α“

( 2 ||p ´ S pu , v q|| Πn0 `1,m0 `1 i n `1,m `1 i i 0 0 i“1 ˆ ( řN 1 exp ´ 2σ2 i“1 ||pi ´ Sn0 ,m0 pui , vi q||2 Πn0 ,m0

exp ´

1 2σ2

řN

qpn0 ` 1, m0 ` 1 | n0 , m0 q . ś qpn0 , m0 | n0 ` 1, m0 ` 1q 2j“1 ppd12n0 `j,2m0 `j , φ12n0 `j,2m0 `j q We take kerpdjk , φjk q “ ppdjk , φjk q.

(6.16)

The proposal probabilities for the moves

pn0 , m0 q Ñ pn0 , m0 ˘ 1q, pn0 , m0 q Ñ pn0 ˘ 1, m0 q and pn0 , m0 q Ñ pn0 ˘ 1, m0 ˘ 1q can be derived similarly. The shrinkage prior on the φjk ’s gives rise to highly efficient RJMCMC moves which converge to the appropriate values of n and m rapidly in most cases we have observed.

6.5 Applications We analyzed the skull and Beethoven data shown in Figure 6.1 using our proposed method. As all reasonable methods will do a good job at surface estimation based on a large number of points located very close to the surface of interest, we simulated different levels of sparse and noisy data by sampling a subset of the points in the original data sets and adding different levels of Gaussian measurement errors. In many other applications, sparse and noisy data are routinely collected but focusing on two dense, low measurement error data sets allows careful study of the impact of sample size and measurement error on the performance of our proposed Bayesian approach relative to the state-of-the-art Crust algorithm. First we reconstruct the surface from non-noisy sparse data by taking random subsamples of 390 points from the skull and Beethoven point clouds. The results for Crust are shown in Figure 6.6, while the results for our proposed Bayesian approach are shown in Figure 6.7. In each case, we generated 5000 samples and discarded the 166

first 2000 as burn-in. Convergence was monitored using trace plots of the deviance as well as several parameters. Also we get essentially identical posterior modes of n and m with different MCMC starting points and moderate changes to hyperparameters. In many applications, the features of the data acquisition device can dictate the amount of noise incorporated. Choosing an informative prior for the noise variance can help in the ability to pick up local features. The hyperparameters in the priors for τj and ξk play a key role in controlling the smoothness of the surface. An increase in αn corresponds to a decrease in the values of τj and ξk leading to over-smoothing. However one needs to carefully control αn to prevent over-shrinkage leading to oversmoothing. In the applications below αn “ 3{2 and β “ 3{2. Estimation of noise variance and the surface is robust to moderate changes in hyperparameters as the sample size increases. Our method performs closely to Crust for non-noisy data. As we add Gaussian noise to the points, the performance of Crust deteriorates (Figure 6.8) while the tensor product surface (Figure 6.9) is quite robust to the addition of noise as it takes into account the uncertainty in estimating the surface. In Figure 6.8, we notice some parts from the skull and the Beethoven’s head jutting out owing to poor characterization of the noise. To compare the performance of our method with existing competitors, we compute the Hausdorff distance between the true surface and the fitted surface as described below. Let S1 and S2 be two manifolds embedded in R3 . Then the Hausdorff distance is defined by hD pS1 , S2 q “ max

"

sup inf dpx, yq, sup inf dpx, yq

xPS1 yPS2

xPS2 yPS1

*

(6.17)

where d is any distance in R3 . It can be shown that hD pS1 , S2 q “ 0 if and only if S1

ˆ and S2 have the same closure. For the tensor product approach we estimate hD pS, Sq 167

by max sup inf dppi , pˆj q, sup inf dppi , pˆj q i

j

j

i

(

where tˆ pi , i “ 1, . . . , Nu is a Bayes estimate of tpi , i “ 1, . . . , Nu where d is the

ˆ by standard Euclidean distance. For the Crust algorithm, we estimate hD pS, Sq max sup inf dppi , tj q, sup inf dppi , tj q i

j

j

i

(

where tti : i “ 1, . . . , Mu is a dense grid of points on the resulting simplicial surface. We summarize the performances of the Crust algorithm and the tensor product approach in Table 6.1 for a variety of choices of the sample size and noise variance (σ 2 ). We observe that for non-noisy data Crust performs closely and slightly better than the tensor-product surface for large sample sizes while the tensor product outperforms the Crust as the noise variance increases. As the sample size increases, the tensor product surface fit becomes better even when the noise variance is large. However, the performance of the Crust improves with sample size only when the noise variance is very small. Posterior summaries of the noise variance and the basis function truncation levels n and m are provided in Table 6.2. The noise variance is not well-estimated for small sample sizes and smaller value of the true noise variance. However, estimation becomes better for larger sample sizes consistent with the posterior convergence results. Also, one can estimate larger variances well compared to smaller ones for reasons discussed earlier. As the sample size increases, the posterior mode of pn, mq tend to increase slightly when the noise variance is small in order to capture local features. When the noise variance is large, the global features dominate and the posterior modes of n and m remain constant at the smaller values.

168

Table 6.1: Hausdorff distance between true and fitted surface using tensor product method and Crust Skull Beethoven σ N=390 N=690 N=990 N=390 N=690 N=990 0¨05 (2¨123, 2¨045) (2¨008, 1¨971) (1¨981, 1¨791) (1¨528, 1¨557) (1¨510, 1¨527) (1¨411,1¨397) 0¨1 (2¨561, 2¨671) (2¨345, 2¨682) (2¨311, 2¨677) (1¨589, 1¨679) (1¨524, 1¨560) (1¨579, 1¨730) 0¨2 (2¨711, 3¨134) (2¨697, 3¨225) (2¨523, 3¨435) (1¨812, 2¨146) (1¨796 ,1¨874) (1¨657, 2¨334)

Table 6.2: Posterior summaries of σ and n, m (posterior mean of σ, 95% credible intervals for σ, posterior mode of pn, mq) σ

N=390

0¨05 0¨075, [0¨065, 0¨093], (3,4) 0¨1 0¨194, [0¨124, 0¨265], (4,4) 0¨2 0¨220, [0¨127, 0¨314], (3,4) 0¨05 0¨090, [0¨081 0¨109], (5,6) 0¨1 0¨220, [0¨191, 0¨261], (5,6) 0¨2 0¨228, [0¨166, 0¨291], (5,6)

N=690 Skull 0¨064, [0¨056, 0¨077], (4,4) 0¨154 [0¨096, 0¨213], (4,4) 0¨210 [0¨136, 0¨279], (3,4) Beethoven 0¨061, [0¨041, 0¨081], (6,6) 0¨171,[0¨143, 0¨191], (5,6) 0¨214, [0¨161, 0¨267], (5,6)

N=990 0¨056, [0¨047, 0¨062], (4,4) 0¨120, [0¨081, 0¨156], (3,4) 0¨196 [0¨139, 0¨253], (3,4) 0¨537, [0¨039, 0¨067], (6,7) 0¨167, [0¨091, 0¨159], (6,6) 0¨203 [0¨161, 0¨246], (5,6)

6.6 Discussion This chapter develops a novel Bayesian hierarchical model for a closed surface, allowing full posterior inferences via an efficient Markov chain Monte Carlo algorithm. Consistent with our theory results on optimal rates of posterior contraction, we find that the methodology does a good job in reconstructing a closed surface from sparse and noisy 3d point cloud data yielding improved performance over state-of-the-art computer science algorithms. Although modern sensing technology, such as computed tomography or magnetic resonance imaging, enables us to make detailed scans of complex objects generating point cloud data consisting of millions of points, the data acquired is usually distorted by noise arising out of various physical measurement processes and limitations of the acquisition technology. Most of these points are typically discarded after taking into account acquisition effects leading to a sparse noisy point cloud. The resolution specifics of these acquisition devices provide information on the magnitude of the measurement error variance. An appealing feature of 169

Sparse non−noisy points Cloud

Output Triangulation

5

5

0

0

−5

−5

−10

−10

10

10 10

5

10

5

5

0 −5

5

0

0

0

−5

−5

−5

Sparse non−noisy points Cloud

Output Triangulation

4

4

3

3

2

2

1

1

0

0

−1

−1

−2

−2

−3

−3

−4

−4

4

4 2

2 0

0

4 2

−2

0

−4

−2 −6

4 2

−2

0

−4

−2 −6

−4

−4

Figure 6.6: Output triangulation using Crust on a sparse (390 points) non-noisy point cloud our Bayesian approach is that we obtain a full posterior for the surface allowing uncertainty. Visualizing this uncertainty is an interesting challenge for future research, but one can produce interior and exterior pointwise 95% credible surfaces and even movies of surface realizations from the posterior. In addition, when there is interest in surface features, such as the interior volume, surface area, or the number of holes, one can obtain posterior summaries of the feature of interest. Our proposed approach represents an initial step in a line of research related to Bayesian modeling of 3-d closed surfaces. There are several important next steps. It is commonly the case that each subject has their own surface and interest focuses on modeling a collection of dependent surfaces across subjects, while incorporating subject-specific predictors, using the surface to predict a response variable, and testing differences in distributions of surfaces between groups. In such settings, it is necessary to align the surfaces for the different subjects, which can potentially be ac170

Sparse non−noisy Points Cloud

Output Tensor product surface

5

5

0 0 −5 −5 −10 −10 −15 10

10 10

5

10

5

5

0 −5

5

0 0

0

−5

−5

−5

Sparse non−noisy Points Cloud

Output Tensor product surface

3

3

2

2

1

1 0

0

−1

−1

−2

−2

−3

−3

4

4 4

2

−4

0

−2

−2 −6

2

0

0

−2

4

2

2

0

−2

−4 −6

−4

−4

Figure 6.7: Output tensor product surface on a sparse (390 points) non-noisy point cloud complished in a Bayesian probabilistic framework. Another ongoing problem relates to surfaces that change dynamically over time within a subject. In addition, it is common for the data to not consist simply of a 3-d point cloud but instead to have pixelated data in which the surface(s) of interest are embedded in an blurry image containing other objects. As in other functional data modeling settings, the smoothness and local features of the surfaces being estimated can be somewhat sensitive to the basis functions being used. We have focused on tensor products of truncated Fourier series, which lead to obtain rates of posterior contraction and have good practical performance in reconstructing infinitely smooth surfaces that have cross sections that are closed curves. There are settings in which the objects being modeled may have interesting local features, such as spikes, that may be smoothed out with our proposed bases and shrinkage priors in the absence of abundant data. 171

Sparse noisy points Cloud

Output Triangulation

8 6

5

4 2 0

0 −2

−5

−4 −6 −8

−10

−10 10

10 10

5

10

5

5

0

5

0

0

−5

0 −5

−5

−5

Sparse noisy points Cloud

Output Triangulation

3

3 2

2

1

1

0

0

−1 −1 −2 −2 −3

4

4 2

2

4 0

−6

0

−2

0 −4

2

0

2 −2

−2

−4

−2

−6

−4

−4

Figure 6.8: Output triangulation using Crust on a sparse (390 points) noisy (std=0.2) point cloud

172

Sparse noisy Points Cloud

Output Tensor product surface

5

5

0

0

−5

−5

−10 −10

10

10 10

5

5

10

5

0

5

0

0

−5

0

−5

−5

−5

Sparse noisy (std =0.2) Points Cloud

Output Tensor product surface

3

3

2

2

1

1

0

0

−1

−1

−2

−2

−3

−3

5 4 4

2 0

4

2 −2

2

0 0

0 −4

−2

−2 −6

−5

−4

−4

Figure 6.9: Output tensor product surface on a sparse (390 points) noisy (std=0.2) point cloud

173

7 Bayesian geostatistical modeling with informative sampling

7.1 Introduction Geostatistical models focus on inferring a continuous spatial process based on data observed at finitely many locations, with the locations typically assumed to be noninformative. As noted by Diggle et al. (2010), this assumption is commonly violated for point-referenced spatial data, as it is not unusual to collect data at locations thought to have a large or small value for the outcome. For example, in monitoring of air pollution, one may place more monitors at locations believed to have a high value of ozone or another pollutant, while in studying distribution of animal species one may systematically look in locations thought to commonly contain the species of interest. Diggle et al. (2010) proposed a shared latent process model to adjust for bias due to informative sampling locations. Their analysis was implemented using a Monte Carlo approach for maximum likelihood estimation. We follow a Bayesian approach using a model related to those described by Menezes (2005), Ho and Stoyan (2008) and Diggle et al. (2010). The locations are

174

modeled using a log Gaussian Cox process (Møller et al., 2001), with the intensity function included as a spatially-varying predictor in the outcome model, which also includes spatial random effects drawn from a Gaussian process. A parameter a controls the degree of informative sampling, and the sampling locations are ignorable in the special case in which a “ 0, while a ą 0 implies a tendency to take more observations at spatial locations having relatively high outcome values. This model modifies shared random effects models for joint modeling of longitudinal and event time data (Radcliffe et al., 2004) and for accommodating informative missingness (Wu and Follmann, 1999). To our knowledge, we are the first to develop a Bayesian approach to the informative locations problem in geostatistical modeling. However, adapting recently proposed models to the Bayesian paradigm is relatively straightforward, and our primary contribution is studying the theoretical properties of the model. In particular, it is not obvious that the data contain information about the informativeness of the sampling locations, and one may wonder to what extent the prior is driving the results even in large samples. We address this concern by proving that the posterior is proper under a noninformative prior on a. In addition, one can consistently estimate a, the density of the sampling locations and the mean function of the outcome process. This later result extends recent work showing posterior consistency in Gaussian process regression models (Choi and Schervish, 2007b; Choi, 2007).

7.2 Model for spatial data with informative sampling Our objective is to estimate the spatial surface µpsq P R for all s P D Ă R2 based on observations y1 , . . . , yn at locations s1 , . . . , sn P D. We propose the following joint model yi | si „ Ntηpsi q ` aξpsi q, σ 2 u,

exptξpsi qu exptξpsquds D

ppsi q “ ş 175

pi “ 1, . . . , nq,(7.1)

where the observations are independent across locations si given ξpsq and ηpsq, and ppsq is the location density. Assuming the locations are a realization of an inhomogeneous Poisson process with log intensity ξpsq, the mean surface is characterized as µpsq “ ηpsq ` aξpsq, where ηpsq is a baseline surface and aξpsq is an adjustment due to informative sampling. Letting xpsq denote a vector of spatial covariates, ξpsq “ xpsqT βξ ` ξr psq and ηpsq “ xpsqT βη ` ηr psq, where βξ and βη are regression coefficients and ξr psq and ηr psq are mean zero residual processes. The log sampling density is treated as a latent covariate to adjust for informative sampling, with a ą 0 implying that samples are more likely to be taken in areas with a large response. Setting the coefficient in βξ corresponding to the intercept to zero for identifiability, Epyi | si q “ xpsi qT β ˚ ` aξr psi q ` ηr psi q pi “ 1, . . . , nq,

(7.2)

where β ˚ “ aβξ `βη . Therefore, accounting for informative sampling is only necessary when there is an association between the spatial surface of interest and the sampling density that cannot be explained by the shared spatial covariates xpsq. The residuals ξr psq „ Πξr and ηr psq „ Πηr are assigned independent mean zero Gaussian process priors with Mat´ern covariance functions (Stein, 1999), τ2 cph | ψq “ ν´1 2 Γpνq

ˆ

2ν 1{2 h ρ

˙ν



ˆ

2ν 1{2 h ρ

˙

,

h “ ||s ´ s1 ||,

(7.3)

where ψ “ pτ 2 , ρ, νq and K is the modified Bessel function of the second kind. The Mat´ern covariance has three parameters: τ 2 ą 0 controls the variance, ρ ą 0 controls the spatial range of the correlation, and ν ą 0 controls the smoothness of the process. Special cases include the exponential cph | ψq “ τ 2 expp´21{2 h{ρq with ν “ 1{2, and the squared exponential cph | ψq “ τ 2 expp´2h2 {ρ2 q with ν “ 8.

176

7.3 Theoretical properties 7.3.1 Weak posterior consistency In this section we obtain posterior consistency of the parameters of our model with respect to fixed-domain asymptotics. Consider the joint model defined in §2, with D “ r0, 1s2 without loss of generality and Πξr , Πηr Gaussian processes on CpDq, the space of continuous functions on D. Letting cph | ψξ q and cph | ψη q denote the covariance functions for ξr and ηr , respectively, we choose independent bounded hyperpriors for τξ2 , τη2 , νξ and νη while letting ρξ „ πξ and ρη „ πη , where the supports of both πη and πξ are R` . We choose a proper prior on R for a, βξ „ Npβ0ξ , Σ0ξ q, βη „ Npβ0η , Σ0η q and σ 2 „ Inv-Gapασ , βσ q. Assumption 75. The prior ζ „ Π satisfies the prior positivity condition Πpζ : ||ζ ´ ζ0 ||8 ă ǫq ą 0 for all ǫ ą 0 and for any ζ0 P CpDq. van der Vaart and van Zanten (2009) showed that Assumption 1 holds for Gaussian process priors with squared exponential covariance under mild conditions, and Choi (2005) provided a set of sufficient conditions on the Mat´ern covariance kernel for the same. Assumption 76. The covariates are uniformly bounded, so there exists an M ą 0 such that ||xpsq|| ď M for all s P D. Theorem 77. Under model (7.1)–(7.2) with priors chosen as described in §3 and Assumptions 1–2, the posterior distribution Πpξr , ηr , a, βξ , βη , σ | tpyi , si q, i “ 1, . . . , nuq, is weakly consistent. Theorem 77 does not imply that the hyperparameters in the covariance kernel are consistently estimated, though we do take into account uncertainty in these parameters and do not assume that the priors are well specified. It is typically not 177

possible to consistently estimate all the parameters in the Mat´ern covariance (Zhang, 2004). 7.3.2 Posterior propriety of a Under model (7.1)–(7.2), the parameter a controls the degree of informative sampling. The uniform improper prior, πa paq91, provides a noninformative choice. Theorem 78 shows that this prior leads to a proper posterior, implying that the data are informative about a. (T Letting s “ ps1 , s2 , . . . , sn q, y “ py1 , y2 , . . . , yn qT , ξrn “ ξr ps1 q, ξr ps2 q, . . . , ξr psn q (T and ηrn “ ηr ps1 q, ηr ps2 q, . . . , ηr psn q , we have ξrn „ Np0, Σnξ q and ηrn „ Np0, Σnη q, where Σnξ ps, s1q “ cp||s ´ s1 || | ψξ q and Σnη ps, s1 q “ cp||s ´ s1 || | ψη q for s, s1 P D. Let cph | ψξ q “ τξ2 expp´21{2 hp {ρξ q and cph | ψη q “ τη2 expp´21{2 hp {ρη q for 0 ă p ď 2. We assume independent bounded priors on τξ and τη and independent discrete uniform priors on ρξ and ρη . Also, βξ „ Npβ0ξ , Σ0ξ q, βη „ Npβ0η , Σ0η q and σ 2 „ πpσ 2 q. Here we focus on powered exponential covariance functions rather than Mat´ern to simplify calculations. A similar result should hold for Mat´ern covariance functions if the priors on the hyperparameters have a bounded support. Theorem 78. With the above prior specifications, the marginal posterior distribution of a, ppa | y, sq, is proper provided n ě 2 and Eπ pσq ă 8. When the conditions of Theorem 78 are satisfied, the joint posterior is also proper. Proofs are provided in Appendix E.

7.4 Computational details The exact density for the sample locations in (7.1) is not available analytically, so approximation is required. In point process modeling the integral is often approximated as the sum over a fine grid. Letting t1 , . . . , tM P D be a rectangular grid 178

covering D with cell area ∆, ż

D

exptξpsquds « ∆

M ÿ

j“1

exptξptj qu.

(7.4)

This approximation yields a tractable posterior, but requires computationally expensive matrix inversions, which we limit using a kernel convolution approximation to f. Let δpsq be a mean zero Gaussian process with covariance cph | ψq. A process convolution (Higdon, 2002) lets δpsq “

ż

D

Kψ ps ´ uqdW puq,

(7.5)

where W is Brownian motion and Kψ is a kernel with parameters ψ. The kernel corresponding to the Mat´ern covariance is Γpν ` 1q1{2 ν ν{4`1{4 |u|ν{2´1{2 Kν{2`1{2 Kψ puq “ τ 1{2 π Γpν{2 ` 1{2qΓpνq1{2 ρν{2`1{2

ˆ

2ν 1{2 |u| ρ

˙

.

The kernel convolution representation of the Gaussian process in (7.5) is often used to motivate dimension reduction for the spatial process. Let φ1 , . . . , φN be a grid of spatial knots. Then for large N

δpsq «

N ÿ

j“1

Kψ ps ´ φj qwj ,

(7.6)

where wj „ Np0, 1q. Applying kernel convolution to ξpsq and ηpsq yields "

˚

yi | si „ N xpsi q β ` T

N ÿ

j“1

Kψη psi ´ φj quj ` a

N ÿ

j“1

* Kψξ psi ´ φj qvj , σ , (7.7)

) ! ř K ps ´ φ qv exp xpsi qT βξ ` N ψξ i j j j“1 ! ), ppsi q “ ř ř M N T exp xpt q β ` K pt ´ φ qv l ξ ψξ l j j l“1 j“1 179

2

where uj , vj „ Np0, 1q. Selecting the number of grid points M and knots N is discussed in §7.5 & 7.6. We use a combination of Gibbs and Metropolis sampling for posterior computation. Assuming conjugate normal and inverse gamma priors, and reparameterization so that uj „ Np0, τη2 q and vj „ Np0, τξ 2 q, the full conditionals for β ˚, a, τη2 , τξ2 and the vector pu1 , . . . , uN qT are conjugate and we use Gibbs sampling. The correlation parameters ρη and ρξ and the smoothness parameters νη and νξ are updated with Metropolis sampling, tuned to have acceptance ratio near 0¨4. The sampling density parameters vj are updated using blocked Metropolis sampling to account for posterior correlation between coefficients for nearby knots. We used ten blocks, with knots allocated to blocks using k-means clustering implemented by the kmeans package in R. For the simulation study in §7.5 we generated 5,000 samples and discarded the first 1,000 as burn-in. For the analysis of the ozone data in §7.6 we generated 20,000 samples and discarded the first 5,000. Convergence was monitored using trace plots of the deviance as well as several representative parameters.

7.5 Simulation study We conduct a simulation study to illustrate the effect of failing to account for informative sampling on spatial interpolation, and determine the amount of data need to reliably identify informative sampling. We assume D “ r0, 1s2 and no spatial covariates, xpsq “ 1 for all s. We generate data using model (7.7) with an equally-spaced grid of N “ 225 knots on r-0¨2,1¨2s2 and a Mat´ern kernel. We generate S “ 50 data sets from each of four simulation scenarios: (1) n “ 250, a “ 0, ρ “0¨2; (2) n “ 250, a “ 1, ρ “0¨2; (3) n “ 250, a “ 1, ρ “0¨5; and (4) n “ 500, a “ 1, ρ “0¨2, with σ “ 1, Etµpsqu “ 0, ν “2¨0, and τ “0¨1 under all scenarios. For each simulated data set we fit the following three models. The noninformative sampling (NIS) model ˆ to account for informative locations, sets a “ 0, the plug-in model sets ξpsq “ ξpsq 180

and the full model implements the approach of §4. In the plug-in analysis the location density is estimated using kernel density estimation in R’s KernSur function in the GenKern package with default settings. GenKern gives a bivariate kernel density estimate that uses Gaussian kernels with bandwidth chosen using a direct plug-in approach to approximate the asymptotically optimal bandwidth. We use the same grid of N “ 225 knots used to generate the data in the kernel convolution model, and approximate the integral using a square grid of M “ 900 points t1 , . . . , tM covering r0, 1s. Motivated by Rodrigues and Diggle (2010), we used an equally spaced grid of 225 knots on r´0.2, 1.2s2. Simulation study results show that irrespective of the number and position of the sampling locations, the Gaussian process can be well approximated with 225 knots. Following Lee et al. (2005), the grid spacings are chosen to be no larger than the standard deviation of the kernel in the convolution representation. We use diffuse normal priors for β ˚ and a and the covariance parameters have priors σ 2 , τξ2 , τη2 „ Inv-Gap0¨01, 0¨01q, ρ2ξ , ρ2η „ Up0, 2q, and νξ2 , νη2 „ Up0, 30q. Table 1 reports bias, mean squared error (MSE), mean absolute deviation (MAD) and coverage probability (CP), each averaged over the grid of M spatial locations t1 , . . . , tM . The coverage probability is the proportion of the M grid locations for which the posterior 95% interval for µptj q covers the true value. For the plug-in model and the full model we also report the power for a in Table 7.1 which is defined to be the proportion of data sets for which the posterior 95% credible interval for a excludes zero. All three methods perform similarly when sampling is not informative. In this case, the informative sampling methods rarely identify a as significant and reduce to the usual geostatistical model. The noninformative sampling model has high mean squared error and negative bias in the remaining designs with informative sampling. The two methods that allow for informative sampling reduce mean squared 181

Table 7.1: Simulation study results Design

Model

1

NIS Plug-in Full NIS Plug-in Full NIS Plug-in Full NIS Plug-in Full

2

3

4

MSE(ˆ102 ) 33¨1 32¨2 31¨9 49¨4 39¨2 32¨9 13¨2 12¨1 10¨8 25¨6 20¨9 19¨1

(2¨8) (1¨7) (1¨2) (5¨0) (5¨5) (2¨8) (1¨1) (0¨8) (0¨7) (1¨1) (0¨8) (0¨6)

MAD(ˆ102)

Bias(ˆ102 )

CP(ˆ102 )

41¨3 (0¨6) 41¨3 (0¨) 41¨5 (0¨7) 50¨0 (1¨1) 44¨8 (0¨9) 43¨2 (0¨8) 28¨1 (1¨8) 27¨1 (1¨8) 25¨3 (1¨4) 36¨9 (0¨7) 33¨9 (0¨5) 32¨6 (0¨4)

2¨0 (1¨3) 2¨5 (1¨3) 2¨5 (1¨3) ´25¨8 (1¨3) ´13¨9 (1¨3) ´7¨5 (1¨6) ´8¨3 (1¨4) ´3¨1 (1¨4) ´2¨0 (1¨3) ´15¨3 (1¨2) ´7¨2 (1¨1) ´0¨8 (1¨0)

93¨0 93¨0 93¨0 90¨0 91¨0 93¨0 94¨0 94¨0 95¨0 92¨0 92¨0 94¨0

(1¨0) (1¨0) (1¨0) (1¨0) (1¨0) (1¨0) (1¨0) (1¨0) (1¨0) (1¨0) (1¨0) (1¨0)

Power for a(ˆ102 ) – 10¨0 10¨0 – 74¨0 80¨0 – 40¨0 50¨0 – 88¨0 98¨0

error compared to the noninformative sampling model. The informative sampling models also reduce bias, although some bias remains, especially for design 2. In all cases the full model improves on the plug-in approach. The relative mean squared error of the noninformative sampling model to the full model is smaller for design 3 (0¨132/0¨108 = 1¨222) with large spatial range and design 4 (0¨256/0¨190=1¨347) with large sample size than for design 2 (0¨494/0¨329 = 1¨502), so it seems that accounting for informative sampling is most important for small data sets with considerable spatial variation. To analyze sensitivity to the prior for a, we redid simulation design 2 with a “ 1 and ρ=0¨2 and used four different priors for a: Np1, 1q, Np0, 1q, Np0, 102q and an improper prior. In summary, mean squared prediction error and predictive coverage are insensitive to the hyperparameters of the prior on a for n “ 150 and n “ 200. Even for a sample size as small as n “ 50, differences are small for different priors. However, the Np0, 102q prior and the informative prior Np1, 1q lead to a better power for a than the others when n “ 50 and 100. The minimum sample size needed to swamp out the prior for a is around 150 in this example.

182

7.6 Analysis of Eastern United States ozone data With the increasing concern about air pollution and climate change, building predictive models for ozone is an important area. It is often the case that the monitoring locations are informative about the ozone surface and hence it is important to account for informative sampling. We analyze the median daily ozone for June-August 2007 for n “ 631 observations in the Eastern United States. The data are plotted in Fig. 7.1(a). There is a clear association between the sampling density and the response, as there are more monitors placed in areas with high ozone, such as Atlanta and New England, than areas with low ozone, such as Mississippi and West Virginia. We fit a generalized additive model to the median ozone values and the kernel density estimate of the log sampling density using locally weighted scatterplot smoothing in Fig. 7.1(b). The linear fit is entirely contained within the generalized additive model 95% confidence intervals for all values of the log sampling density estimate, supporting the log linear model in (7.1). To apply a stationary spatial model we first project the spatial locations to a two-dimensional surface using the Mercator projection, and then scale them to the unit square coordinate-wise by subtracting the minimum and dividing by the range of the observation locations. We fit the informative sampling model with a 30 ˆ 30 grid of knots on r´0¨2, 1¨2s2 in the kernel convolution approximation in (7.6) and a 50 ˆ 50 grid of points on r0, 1s2 in the integral approximation in the sampling density (7.4). Points outside the convex hull of the observation locations or outside the continental United States were discarded from integral approximation to the sampling density, leaving M “ 1077. Kernel convolution knots not within 0¨1 of an integral approximation knot were discarded, leaving N “ 490. We include a second-order spatial trend as predictors in xpsq, that is, linear and quadratic terms for re-scaled latitude and longitude, and their interaction. We 183

45

100

40

90

35

80

30

70

60

−90

−85

−80

−75

−70

(a) Median ozone

(b) Log sampling density versus median ozone (circles), gamfit with 95% intervals (dashed), linear fit (solid)

Figure 7.1: Plots of the ozone data. Panel (a) plots the ozone data (ppb; color) and monitor locations (points), Panel (b) plots the estimated log sampling density against the response. compare the noninformative sampling, plug-in and full models described in §7.5. The posteriors for several parameters are summarized in Table 7.2. The spatial process for both the mean process and sampling density are fairly smooth. The posterior 95% intervals for νξ and νη exclude the exponential covariance (ν “ 0¨5) for all the three models. The 95% interval of a for both the plug-in model (2¨16, 6¨46) and fully Bayesian model (2¨12, 4¨25) excludes zero, indicating an informative sampling scheme. The scale of a’s posterior is not comparable between the two models since the plug-in density estimate has been standardized to have mean zero and variance one. The effect of accounting for informative sampling is illustrated in Fig. 7.2. The difference in predicted values between the noninformative sampling and full model in Fig. 7.2(c) is the largest in Northern Pennsylvania and West Virginia. These areas have 184

85

−5.0 −5.5

40

40

90

−6.0 −6.5

35

35

80

75

−7.5 30

30

70

−7.0

−8.0

−85

−80

−75

−85

−70

(a) Posterior mean predicted values from full model

−80

−75

−70

(b) Log sampling density from the full model

2

2

0

0

35

35

40

4

40

4

−2 30

30

−2

−4

−85

−80

−75

−70

−4

−85

(c) Posterior mean predicted values (NIS - full)

−80

−75

−70

(d) Posterior mean predicted values (NIS - plug-in)

Figure 7.2: Posterior mean predicted values of ozone relatively few monitors and are near areas with high ozone. The difference between the noninformative sampling and plug-in predictions in Fig. 7.2(d) are also positive in these areas though the differences are not nearly as large in the plug-in analysis. This may be because the plug-in estimates do not appropriately account for uncertainty 185

Table 7.2: Mean and 95% intervals for the ozone data

Parameters a σ τg ρg νg τf ρf νf

NIS 4¨68 0¨17 0¨06 3¨95

– (4¨37, (0¨14, (0¨05, (0¨92, – – –

5¨03) 0¨27) 0¨16) 6¨42)

4¨43 4¨70 0¨15 0¨06 3¨46

Plug-in

Full

(2¨16, (4¨38, (0¨13, (0¨04, (1¨53, – – –

3¨21 (2¨12, 4¨25) 4.78 (4.47, 5.12) 0¨17 (0¨13, 0¨21) 0¨06 (0¨05, 0¨10) 12¨6 (0¨74, 28¨8) 0¨05 (0¨04, 0¨06) 0¨07 (0¨04, 0¨13) 10¨7 (0¨74, 28¨77)

6¨46) 5¨04) 0¨19) 0¨10) 5¨52)

in estimation, and hence may lead to some attenuation of the estimated surface. Finally, we refit the model with different priors and different knot locations to test for sensitivity to these assumptions. We fit the model with 20ˆ20 and 40ˆ40 initial grids of knots in the kernel convolution approximation. After removing knots outside the domain of interest, this gave N “ 206 and N “ 876 knots, respectively. The results were fairly similar to the original 30ˆ30 grid. In all cases the posterior of a was separated from zero, the posterior median being 3¨31 and 2¨85 for N “ 206 and N “ 876 knots, respectively, and the largest difference between the noninformative sampling and full model was in the Northern Pennsylvania and West Virginia.

7.7 Discussion We have focused on a simple model for informative locations, which assumes that the outcomes are conditionally independent of the locations given the mean process µpsq and the spatial location density ppsq. In addition, we include a single parameter a controlling the informativeness of the sampling process. These simplifying assumptions certainly make the theory and computation more tractable. However, to more realistically characterize data from a broader variety of applications, it may

186

be necessary to generalize the models. There are several interesting directions in this regard. First, it is straightforward conceptually to replace the constant a with a spatially-varying coefficient apsq, which is assigned a Gaussian process prior. This generalization allows the informativeness of the sampling locations to vary spatially; for example, in certain regions, say near cities, monitors may be placed without regard to the outcome, while in other regions, say in the rural areas, monitors may be placed at sites likely to have high values of ozone. It is an open question whether one can consistently estimate apsq in this extended model without very restrictive assumptions. However, a simple adjustment for informative sampling may be preferable to more complicated models that require rich datasets for reliable estimation.

187

8 Future works

8.1 Latent variable density regression models Current density regression models focusing on mixture models are often a black-box in terms of realistic applications which require needing to center the model on a simple parametric family without sacrificing computational efficiency. Stephen Walker pointed out in one of the ISBA bulletins “Current density regression models are too big, too non-identifiable and I doubt whether they would survive the test of time.” Lenk (1988, 1991); Tokdar et al. (2010b) proposed a logistic Gaussian process prior which also allows convenient prior centering in density regression models, however computationally quite challenging. Somewhat discontented with the discrete mixture formulation of the existing density estimation and density regression models which doesn’t allow for convenient prior centering, we turned our attention to latent variable models which have become increasingly popular as a dimension reduction tool in machine learning applications. Although latent variable models are widely used in machine learning community, it was only recently realized (Kundu and Dunson, 2011) that they are also suitable for density estimation. Kundu and Dunson (2011)

188

developed a density estimation model where unobserved Up0, 1q latent variables are related to the response variables via a random non-linear regression with an additive error. This allows convenient prior centering, avoids the mixture formulation and enables efficient computation through a griddy Gibbs algorithm Kundu and Dunson (2011). However, there has been little study on theoretical properties of these models. In particular does it share the same appealing support and optimal convergence properties of some of the existing methods? Can it achieve faster convergence rates if the prior centering is appropriate? In an ongoing paper Pati et al. (2011b), we answered these questions in the affirmative by characterizing the space of densities induced by the above model as kernel convolutions with a general class of continuous mixing measures. Our paper Pati et al. (2011b) leads to the following simple density regression formulation. Consider the following non-linear latent variable model, yi “ µpηi , xi q ` ǫi , ηi „ Up0, 1q, ǫi „ Np0, σ 2 q

(8.1)

Integrating out ηi , f py | xq “

ż

φσ py ´ µpt, xqqdt

The above model can approximate a large collection of conditional densities tf0 py | xqu by letting µpt, xq concentrate around the conditional quantile functions F0´1 py | xq by assigning a Gaussian process prior. A couple of advantages of this formulation is the feasibility of an efficient posterior computation based on an uni-dimensional griddy Gibbs algorithm and the ability to center the model on a prior parametric guess which are not both shared by any of the prevalent density regression approaches. Studying rates of convergence in density regression models becomes more challenging as we need to assume mixed smoothness in y and x. Although posterior contraction rates are studied widely in mean regression, logistic regression and density estimation models, results on convergence rates for density regression models 189

are lacking. In the ongoing work Pati et al. (2011a), we study posterior convergence rates of the density regression model (8.1) by assuming the true conditional density has different smoothness across y and x. Assuming f0 py | xq to be compact and twice and thrice continuously differentiable in y and x, we obtain a rate of n´1{3 plog nqt2 using a Gaussian process prior for µ having a single inverse-Gamma bandwidth across different dimensions. The optimal rate in such a mixed smoothness class is n´6{17 . The slight slow rate of n´1{3 is the drawback of using an isotropic Gaussian process used for modeling an anisotropic function. Current research focuses on improving the rate of convergence using an anisotropic Gaussian process with different scaling across different dimensions.

8.2 Nonparametric variable selection Although Pati et al. (2011a) and Chapter 2 deal with the optimal convergence rate in estimating the true regression function, it might be of importance to study the behavior of the marginal posterior inclusion probabilities of individual variables to actually study consistency and rates of convergence of variable selection.

p´jq

Alternatively, let µi “ tµpxi1 q, . . . , µpxip qu and let µi



tµpxi1 q, . . . , µpxij´1q, µpx0j q, µpxij`1q, . . . , µpxip qu, where x0j is some reference point which is fixed across subjects. Then, under the hypothesis H0j that the jth variable p´jq

has no impact on the regression function, it would seem that µi & µi

will be very

close asymptotically. We are interested in providing sufficient conditions for Bayes factor consistency in testing pnq

pnq

H0j : ||µ ´ µp´jq ||2,n ă ǫn , H1j : ||µ ´ µp´jq ||2,n ą ǫn

(8.2)

where || ¨ ||2,n denotes the L2 pPn q norm, Pn being the empirical distribution. Hence ǫn Ó 0 determines the convergence rate of the variable selection. 190

8.3 Bayesian shape modeling Object segmentation and surface fitting from pixelated volume data is widely used in bio-medical applications as a part of pre-therapeutic diagnosis. We plan to develop a novel Bayesian method for inferring the surface of a 3-dimensional object by modeling the intensity of the pixels with a mixture of normals with the weights depending flexibly on the pixel coordinates. We want our approach to yield smooth, closed surface estimates that can prove highly useful in medical diagnosis and other general imaging applications. Lung tumors moving during respiration are particularly challenging. Because of irregular breathing and imaging limitations it can be hard to characterize and predict how a tumor will move during treatment. There has been essentially no work on Bayesian hierarchical modeling of 3d surfaces evolving over time (e.g, tumor image stacks of different individual at several time points). Our objective is to provide a joint framework for registering and modeling multiple 3d shapes. Instead of registering the entire volumetric data, we want to be able to develop a computationally efficient selective registration scheme which takes into account only the tumor region. One of the long-term goals is to develop an algorithm which uses the real-time 3D images acquired during treatment to adapt the radiation beam to optimize the dose being delivered to the target based on the updated images. Finally, working with volumetric data poses several computational bottlenecks. We propose to overcome these computational challenges via an adaptive partitioning scheme, and random projection-based techniques.

8.4 Spatial point patterns The log Gaussian Cox process model which we have exploited in Chapter 7 in the context of preferential sampling is useful in a variety of other spatial settings. More commonly it is used to model spatial point pattern data such as the locations of for191

est fires or earthquakes. Here the parametric gps; θq could capture effects of spatial covariates, e.q., gps; θq “ exppdistance from location s to a fault ˚ θq. Sampling for these types of models is highly challenging, particularly in presence of information on numerous spatially varying covariates where we are also interested in in selecting the important variables. The approach that is usually taken is to approximate the denominator of the likelihood of the log-Gaussian Cox process by a finite Riemann sum. However, since the exponentiated Gaussian process is not an infinitely divisible process, there has been a debate whether the approximated posterior converges to the true posterior. Stephen Walker recently developed an exactly sampling algorithm relying on the introduction of latent variables which removes any integrals associated with the inaccessibility of the normalizing constant. This would open up the possibility of exact sampling for a wide range of models (e.g., the log Gaussian Cox process for the sampling distribution in our informative sampling model) which would circumvent the existing criticisms.

8.5 Robust Bayesian model based clustering Model-based clustering based on mixtures of parametric kernels is a substantially popular tool for separating heterogeneous collection of items into homogeneous subsets. However, accurate estimation of the number of clusters as well as the cluster specific densities is highly sensitive to the choice of the kernels. To address this issue, we propose to develop a novel Bayesian hierarchical clustering model based on mixtures of constrained unimodal kernels where we potentially cluster based on modes without restricting the form of the kernel other than assuming it to be unimodal. We plan to explore theoretical directions like consistency of the number of clusters and the cluster specific densities.

192

8.6 Other directions Another interesting direction is when the true regression function is supported on a smaller dimensional linear subspace and it is of importance to estimate the minimal subspace, popularly termed as sufficient dimension reduction. Although there has been some works on Bayesian sufficient dimension reduction Reich et al. (2010); Tokdar et al. (2010a), accurate calibration of the posterior contraction rate in such settings is still an open area of research. In high dimensional small sample size scenario e.g. gene expression data, it becomes necessary to come up with simple parametric procedures with an accurate calibration for the prior that automatically adjusts for multiplicity. I would particularly like to explore variable selection consistency and convergence rates in Bayesian models where the number of predictors are increasing faster than the sample size. There has been a recent surge of interest in the frequentist literature in working out minimax rates for estimating high-dimensional covariance matrices where the dimensionality increases with the sample size and the truth lies in some sparsity class, but almost no work from a factor model type representation which are more commonly used in Bayesian factor models for learning covariance matrices. A particularly interesting direction is to consider estimation of covariance matrices by assuming a low rank decomposition which arises from factor models. Apart from these specific directions, one of my long-term research goals is to provide a non-asymptotic theoretical framework for comparison of frequentist and Bayesian procedures. Although, in most cases, Bayesian procedures behave as well as the frequentist procedures asymptotically, empirical evidence often suggests that the Bayesian procedures are superior to the frequentist counterparts in specific problems, particularly in a high-dimensional sparse data setting. A rigorous theoretical framework is necessary to validate and promote the use of Bayesian procedures. An193

other interesting direction is to theoretically compare different nonparametric Bayes models and evaluate the coverage probability of the parameter of interest in the light of Knapik et al. (2011).

194

Appendix A Proofs of some results in Chapter 4

A.1 Proof of Lemma 46 The

proof

proceeds

similarly

to

that

for

Theorem 3 in ` ˘ ˜ x , x P X u P G ˚ . Let B “ r´a, asp ˆ pσ, σq . Ghosal et al. (1999). Note that tG X

Choose k ą pa ` σ such that ż ż X

"

|y| ` pa f0 py | xq 2σ 2 |y|ąk

*2

ǫ dyqpxqdx ă . 2

Take V “ ttGx | x P X u : inf xPX Gx pBq ą σσ u. By approximating 1B by a bounded ˜x | x P X u continuous function, we can show that V contains a neighborhood V 1 of tG ş ` 1β ˘ dGx pβ, σq, x P X , of the type (4.15). For any density f P Fd , f py | xq “ σ1 φ y´x σ

195

with tGx | x P X u P V 1 , ż ż X

"˜ * f py | xq f0 py | xq log dyqpxqdx f py | xq |y|ąk

ď

ż ż

ď

ż ż

ď

ż ż

X

X

X

|y|ąk

|y|ąk

f0 py | xq log f0 py | xq log "

ş

`

˘

`

˘

1β 1 ˜ x pβ, σq φ y´x dG B σ σ dyqpxqdx ş 1 ` y´x1 β ˘ φ σ dGx pβ, σq B σ

1 φ |y|´pa σ dyqpxqdx ` |y|`pa ˘σ 1 Gx pBq φ σ σ

|y| ` pa f0 py | xq 2σ 2 |y|ąk

Let inf ty:|y|ďkuˆX inf pβ,σqPB σ1 φ family of functions

` y´x1 β ˘ σ

*2

ǫ dyqpxqdx ă . 2

“ c. Consider the uniformly equi-continuous

" ˙ * ˆ 1 y ´ x1 β gy,x : gy,x : B Ñ ℜ, pβ, σq ÞÑ φ , py, xq P r´k, ks ˆ X . σ σ By the Arzela-Ascoli theorem, given δ ą 0, there exists finitely many points tpyi , xi q P r´k, ks ˆ X , i “ 1, . . . , mu such that for any py, xq P r´k, ks ˆ X , D i such that sup |gy,x pβ, σq ´ gyi ,xi pβ, σq| ă cδ.

pβ,σqPB

Let gyi ,xi ˚ Gx “

ş1ş σ

φ

ˆ

yi ´x1i β σ

˙ dGx pβ, σq.

* " ˇ ˇ ˇ ˇ ˜ x ˇ ă cδ, i “ 1, 2, . . . , m . E “ tGx | x P X u : sup ˇgyi ,xi ˚ Gx ´ gyi ,xi ˚ G xPX

˜ x | x P X u formed by finite intersections of It holds that E is a neighborhood of tG sets of the type (4.15) and for tGx | x P X u P E and py, xq P r´k, ks ˆ X , ˇ ˇş ` 1 ˘ ˇ ˇ 1 φ y´x β dG ˜ x pβ, σq 3δ ˇ ˇ σ σ ´ 1ˇ ă . ˇ ş 1 ` y´x 1β ˘ ˇ 1 ´ 3δ ˇ σ φ σ dGx pβ, σq 196

˜ x | x P X u such for δ ă 31 . Thus given any ǫ ą 0, there exists a neighborhood E of tG ş ` 1β ˘ dGx pβ, σq, that for tGx | x P X u P E with f py | xq “ σ1 φ y´x σ ż ż X

"˜ * f py | xq ǫ f0 py | xq log dyqpxqdx ă . f py | xq 2 y:|y|ďk

(A.1)

˜ x of Taking W “ V 1 X E and since W is a finite intersection of neighborhoods of G the type (4.15), the result follows immediately.

A.2 A useful lemma Lemma 79. If {πh pxq, h “ 1, . . . , 8} constructed as in (4.11) satisfies S1 and S2 then PX

"

sup |π1 pxq ´ Fx pA1 q| ă ǫ1 , . . . , sup |πk pxq ´ Fx pAk q| ă ǫk xPX

xPX

*

ą 0.

(A.2)

for a measurable partition tAi , i “ 1, . . . , ku of ℜp ˆ ℜ` , ǫi ą 0 and a conditional cdf tFx , x P X u. Proof. Without loss of generality, let 0 ă Fx pAi q ă 1, i “ 1, . . . , k @ x P X . We want to show that for any ǫi ą 0, i “ 1, . . . , k, (A.2) holds. Construct continuous functions gi : X ÞÑ ℜ, 0 ă gi pxq ă 1 @x P X , i “ 1, . . . , k ´ 1 such that ź g1 pxq “ Fx pA1 q, gi pxq t1 ´ gl pxqu “ Fx pAi q, 2 ď i ď k ´ 1, gk pxq “ 1 @x. (A.3) lăi

As 0 ă Fx pAi q ă 1, i “ 1, . . . , k @ x P X , it is trivial to find gi , i “ 1, . . . , k satisfying řk (A.3) since one can solve back for the gi ’s from (A.3). i“1 Fx pAi q “ 1 enforces

gk ” 1. Since Φ is a continuous function, for any ǫi ą 0, i “ 1, . . . , k ´ 1, PX

"

sup |Φtαi pxqu ´ gi pxq| ă ǫi xPX

197

*

ą0

(A.4)

and for i “ k, " * " * ´1 PX sup |Φtαk pxqu ´ 1| ă ǫk “ PX inf αk pxq ą Φ p1 ´ ǫk q .

(A.5)

xPX

xPX

Choose M ą Φ´1 p1 ´ ǫk q ` ǫk . We have 0 ă M ă 1 and "

sup |αk pxq ´ M| ă ǫk xPX

Hence by assumption, PX

"

*

Ă

"

* inf αk pxq ą Φ p1 ´ ǫk q . ´1

xPX

´1

inf xPX αk pxq ą Φ p1 ´ ǫk q

*

ą 0. Let Sk´1 denote

the k-dimensional simplex. For notational simplicity let pi pxq “ Φtαi pxqu, gi pxq “ Fx pAi q, i “ 1, . . . , k ´ 1 and gk pxq “ 1. Let z “ pz1 , . . . , zp q1 , fi : Sk´1 Ñ ℜ, z ÞÑ ś zi lăi p1 ´ zl q, i “ 2, . . . , k and f1 pzq “ z1 . Let ppxq “ pp1 pxq, . . . , pk pxqq and

gpxq “ pg1 pxq, . . . , gk pxqq. Then we need to show that

PX t}f1 ppq ´ f1 pgq}8 ă ǫ1 , . . . , }fk´1 ppq ´ fk´1 pgq}8 ă ǫk´1 , }fk ppq ´ 1}8 ă ǫk u ą 0. Note that for 2 ď i ď k,

› › › › ÿ ÿ ( ( › › fl pgq › fl ppq ´ gi 1 ´ }fi ppq ´ fi pgq}8 “ ›pi 1 ´ › › lăi lăi ď pi ´ 1q }pi ´ gi }8 `

ÿ lăi

8

}fl ppq ´ fl pgq}8 .

Thus one can get ǫ˚i ą 0, i “ 1, . . . , k, such that t}pi ´ gi }8 ă ǫ˚i , i “ 1, . . . , ku Ă t}f1 ppq ´ f1 pgq}8 ă ǫ1 , . . . , }fk´1 ppq ´ fk´1pgq}8 ă ǫk´1 , }fk ppq ´ 1}8 ă ǫk u. But since PX t}pi ´ gi }8 ă ǫ˚i , i “ 1, . . . , ku “ follows immediately. 198

śk

i“1

PX t}pi ´ gi }8 ă ǫ˚i u, the result

A.3 Proof of Theorem 47 Fix tFx , x P X u P GX˚ . Without loss of generality it is enough to show that for a uniformly continuous function g : ℜp ˆ ℜ` ˆ X Ñ r0, 1s and ǫ ą 0, ˇż " ˇ PX tGx , x P X u : ˇˇ

ℜp ˆℜ` ˆX

gpβ, σ, xqdGx pβ, σqqpxqdx ´

ˇ * (ˇ ˇ gpβ, σ, xqdFx pβ, σqqpxqdx ˇ ă ǫ ą 0.

Furthermore, it suffices to assume gpβ, σ, xq Ñ 0 uniformly in x P X as }β} Ñ 8, σ Ñ 8. Fix ǫ ą 0, there exists a, σ, σ ą 0 not depending on x such that Fx pr´a, asp ˆ rσ, σsq ą 1 ´ ǫ for all x P X . Let C “ r´a, asp ˆ rσ, σs. ż ( gpβ, σ, xqdGx pβ, σq ´ gpβ, σ, xqdFx pβ, σq qpxqdx ď ℜp ˆℜ` X

ż "ÿ 8 X

h“1

πh pxqgpβh , σh , xq ´

ż

C

* gpβ, σ, xqdFx pβ, σq qpxqdx ` ǫ.

where πh ’s are specified by 4.11 with ch satisfying S1 and S2 and pβh , σh q „ G0 . Now for each x P X , construct a Riemann sum approximation of ż

C

gpβ, σ, xqdFx pβ, σq.

Let tAk,n , k “ 1, . . . , nu be sequence of partitions of C with increasing refinement as

n increases. Assume max1ďkďn diampAk,n q Ñ 0 as n Ò 8. Fix pβ˜k,n, σ ˜k,n q P Ak,n , k “ 1, . . . , n. Then by DCT as n Ñ 8, ż "ÿ n X

k“1

gpβ˜k,n, σ ˜k,n , xqFx pAk,n quqpxqdx Ñ ż ż X

C

gpβ, σ, xqdFx pβ, σqqpxqdx. 199

(A.6)

Hence there exists n1 such that for n ě n1 ˇż ˇ ˇ ˇ ( ˇ ˇď gpβ, σ, xqdG pβ, σq ´ gpβ, σ, xqdF pβ, σq qpxqdx x x ˇ p ` ˇ ℜ ˆℜ X

ˇ ˇż " * n 8 ˇ ˇ ÿ ÿ ˇ ˇ gpβ˜k,n, σ ˜k,n , xqFx pAk,n q qpxqdxˇ ` 2ǫ. πh pxqgpβh, σh , xq ´ ˇ ˇ ˇ X h“1 k“1

Consider the set

"

ǫ ,..., n1 xPX * ǫ sup |πn1 pxq ´ Fx pAn1 ,n1 q| ă . n1 xPX

Ω1 “ pπh , h “ 1, . . . , 8q : sup |π1 pxq ´ Fx pA1,n1 q| ă

By Lemma 79 which is proved in A.2, PX pΩ1 q ą 0. Since

ř8

h“1

πh pxq “ 1 a.s.

there D Ω with PX pΩq “ 1, such that for each ω “ tπh , h “ 1, . . . , 8u P Ω, gn pxq “ řn h“1 πh pxq Ñ 1 as n Ñ 8 for each x in X . Note that this convergence is uniform

since, gn p¨q, n ě 1 are continuous functions defined on a compact set monotonically

increasing to a continuous function identically equal to 1. Hence for each ω “ tπh , h “ 1, . . . , 8u P Ω, gn pxq Ñ 1 uniformly in x. By Egoroff’s theorem, there exists a measurable subset Ω2 of Ω1 with PX pΩ2 q ą 0 such that within this subset gn pxq Ñ 1 uniformly in x and uniformly in ω in Ω2 . Thus there exists a positive ř integer nǫ ě n1 not depending on x and ω, such that 8 h“nǫ `1 πh pxq ă ǫ on Ω2 .

Moreover, one can find a K ą 0 independent of x such that gpβ, σ, xq ă ǫ if }β} ą K

and σ ą K. Let A1 “ tpβ, σq : }β} ą K, σ ą Ku. Let Ω3 “ Ω2 X tpβn1 `1 , σn1 `1 q P A1 , . . . , pβnǫ ´1 , σnǫ ´1 q P A1 u. For ω P Ω3 , ˇż ˇ ˇ ˇ ( ˇ gpβ, σ, xqdGx pβ, σq ´ gpβ, σ, xqdFx pβ, σq qpxqdxˇˇ ď ˇ ℜp ˆℜ` X

ż "ÿ n1 ˇ ˇ* ˇ ˇ ˜ ˜k,n , xqFx pAk,n1 qˇ qpxqdx ` 4ǫ ˇπk pxqgpβk , σk , xq ´ gpβk,n, σ X

k“1

200

and ż "ÿ n1 ˇ ˇ* ˇ ˇ ˜k,n , xqFx pAk,n1 qˇ qpxqdx ˇπk pxqgpβk , σk , xq ´ gpβ˜k,n, σ X

k“1

* n1 ż " ˇ ˇ ÿ ˇ ˇ ˜k,n , xqˇ ` |πk pxq ´ Fx pAk,n1 q| qpxqdx πk pxq ˇgpβk , σk , xq ´ gpβ˜k,n, σ ď k“1 X

ď

n1 ż ÿ

k“1 X

ˇ ˇ ˇ ˇ ˜k,n , xqˇ qpxqdx ` ǫ. πk pxq ˇgpβk , σk , xq ´ gpβ˜k,n, σ

There exists sets Bk , k “ 1, . . . , n1 depending on n1 but independent of x such that if ˇ ˇ ˇ ˇ ˜ ˜k,n1 , xqˇ ă ǫ. So for ω P Ω4 “ Ω3 X tpβ1 , σ1 q P pβk , σk q P Bk , ˇgpβk , σk , xq ´ gpβk,n1 , σ

B1 , . . . , pβn1 , σn1 q P Bn1 u, ˇż ˇ ˇ ˇ ( ˇ ˇ ă 5ǫ. gpβ, σ, xqdG pβ, σq ´ gpβ, σ, xqdF pβ, σq qpxqdx x x ˇ p ` ˇ ℜ ˆℜ X

Now since PX pΩ2 q ą 0 and the sets tpβn1 `1 , σn1 `1 q P A1 , . . . , pβnǫ´1 , σnǫ ´1 q P A1 u

and tpβ1 , σ1 q P B1 , . . . , pβn1 , σn1 q P Bn1 u are independent from Ω2 and have positive probability, it follows that PX pΩ4 q ą 0.

A.4 Proof of Theorem 44 Without loss of generality, assume that the covariate space X is rζ, 1sp for some 0 ă ζ ă 1. The proof is essentially along the lines of Theorem 3.2 of Tokdar (2006a).

The f˜ in (4.12) will be constructed so as to satisfy the assumptions of Lemma 46 and ş ş py|xq such that X Y f0 py | xq log ff˜0py|xq dyqpxqdx ă 2ǫ for any ǫ ą 0. Define a sequence of

conditional densities fn py | xq “

ş

1β 1 ˜ n,x pβ, σq, n φp y´x qdG σ σ

ě 1 where for σn “ n´η ,

ś Iβ1 Pr´n,ns f0 px1 β | xq pj“2 δ0 pβj qδσn pσq şn dGn,x pβ, σq “ . f px β | xqdβ1 ´n 0 1 1 201

(A.7)

Define

fn py | xq “

ş nx1

1 φp y´t qf0 pt | xqdt ´nx1 σn ş nx1 σn . f pt | xqdt ´nx1 0

(A.8)

Proceeding as in Theorem 3.2 of Tokdar (2006a), an application of DCT using the conditions A1-A5 yields ż ż X

Y

f0 py | xq log

f0 py | xq dyqpxqdx Ñ 0 as n Ñ 8. fn py | xq

Therefore one can simply choose f˜ “ fn0 for sufficiently large n0 . fn0 satisfies the assumptions of Lemma 46 since tGn0 ,x , x P X u is compactly supported. Also tGn0 ,x , x P X u P GX˚ as x Ñ Gn0 ,x pAq is continuous. Hence there exists a finite intersection W of neighborhoods of tGn0 ,x , x P X u the type (4.15) such that for any tGx , x P X u P W , the second term of (4.12) is arbitrarily small. The conclusion of the theorem follows immediately from Corollary 49.

A.5 Proof of Theorem 51 Consider the sequence of sieves defined by (4.18) for given ǫ ą 0 and for sequences an , hn , ln , Mn , mn , rn to be chosen later with δn “ K1 ǫ{pMn m2n q for some constant K1 . We will first show that given ξ ą 0, there exists c1 , c2 ą 0 and sequences mn ` ˘ and Mn , such that ΠX Fnc ď c1 e´nc2 and log Npδ, Fn , }¨}q ă nξ.

202

For f1 , f2 P Fn , we have for each x P X , ż ÿ mn ˇ ˇ ˇ ˇ p1q πh pxq ˇφβp1q ,σp1q px, yq ´ φβp2q ,σp2q px, yqˇ dy }f1 p¨ | xq ´ f2 p¨ | xq}1 ď h

Y h“1

h

h

h

ż ÿ mn ˇ ˇ ˇ ˇ p1q p2q ` ˇπh pxq ´ πh pxqˇ φβp2q ,σp2q px, yqdy h

Y h“1

`

ď

8 ÿ

h“mn `1 mn ÿ

h“1

p1q

p2q

πh pxq ` πh pxq

πh pxq

h

(

› › $ › p2q p1q › ? ˆ ˙ 1{2 & 2 ›βh ´ βh › p p2q σh

% π

`

p2q 3pσh

´

p1q σh

mn › › ÿ › p1q p2q › ` ›πh ´ πh › ` 2ǫ.

,

p1q σh q .

-

8

h“1

Let Θπ,n “ tπ mn “ pπ1 , π2 , . . . , πmn q : αh P Bh,n , h “ 1, . . . , mn u. Fix π1mn , π2mn P Θπ,n . Note that since |Φpx1 q ´ Φpx2 q| ă K2 |x1 ´ x2 | for a global constant K2 ą 0, we have }Φpαh,1 q ´ Φpαh,2 q}8 ď K2 }αh,1 ´ αh,2 }8 . The above fact together with the proof of Lemma 79 show that if we can make › řmn ›› p1q p2q › ǫ }αh,1 ´ αh,2}8 ă m2 , h “ 1, . . . , mn , we would have h“1 ›πh ´ πh › ă ǫ. From n

8

the proof of Theorem 3.1 in van der Vaart and van Zanten (2009) it follows that for

h “ 1, . . . , mηn and for sufficiently large Mn , rn , log Np2ǫ{m2n , Bh,n , }¨}8 q ď K3 rnp log 2 log

˜

Mn m2n

a ǫ

K4 Mn m2n . ǫ

rn {δn

¸p`1

` (A.9)

for global constants K3 , K4 ą 0. For Mn2 ą 16K5 rnp plogprn {ǫqq1`p , rn ą 1 we have for h “ 1, . . . , mηn , 2

P pαh R Bh,n q ď P pAh ą rn q ` e´Mn {2 . 203

(A.10)

Hence for sufficiently large Mn , we have for h “ mηn ` 1, . . . , mn , log Np3ǫ{m2n , Bh,n , }¨}8 q ď 2 log For h “ mηn ` 1, . . . , mn , P pαh R Bh,n q ď P pAh ą δn q ` ď P pAh ą δn q `

ż δn

a“0

ż δn

a“0

K4 Mn m2n . ǫ

(A.11)

P pαh R Bh,n | Ah “ aqgAh paqda P pαh R Mn Ha1 ` ǫB1 | Ah “ aqgAh paqda δn pǫ{m2n q

ď P pAh ą δn q ` p1 ´ ΦpΦ´1 pe´φ0

q ` Mn qq.

where φκ0 pǫq denotes the concentration function of the Gaussian process with covari1 2

ance kernel cpx, x1 q “ τ 2 e´κ}x´x } . Now ˇ ˇ φδ0n pǫ{m2n q ď ´ log P p|W0 | ď ǫ{m2n q “ K6 ˇlogpǫ{m2n qˇ ą

for some constant K6 some K7

ą

0,

0.

ě

Hence if Mn

K7 |logpǫ{m2n q| for

then it follows from the proof of Theorem 3.1 in

van der Vaart and van Zanten (2009) that 2

P pαh R Bh,n q ď P pAh ą δn q ` e´Mn {2 .

(A.12)

From (A.9) and (A.11), K4 Mn m2n ` ǫ ¸p`1 ˜ a Mn m2n rn {δn η p . mn rn log ǫ

logpNpǫ, B1,n ˆ ¨ ¨ ¨ ˆ Bmn ,n , }¨}8 q ď2mn log

Also from (A.10) and (A.12), mn ÿ

h“1

η

´Mn2 {2

P pαh R Bh,n q ď mn e

`

mn ÿ

h“1

P pAh ą rn q `

204

mn ÿ

h“mηn `1

P pAh ą δn q.

(A.13)

We will show that with mn “ Op logn n q, ΠX pFnc q ă e´nξ0 for some ξ0 . By assumption C1, we have ΠX pΘcan ,hn ,ln q À mn Ope´n q À Ope´n q. řmηn

h“1 P pAh

With mn “ Opn{ log nq, δn q ď pmn ´ mηn qe´n With mn “

´η0 mη0 `2 n

n , log n

log mn

ą rn q ď mηn e´n À e´n ,

À e´mn log mn .

mn log mn ą

n 2

(A.14) řmn

h“mηn `1

P pAh ą

for large enough n and it follows from Lemma

56 that ΠX

ˆ

sup

8 ÿ

xPX h“m `1 n

˙ πh pxq ą ǫ À Ope´n{2q.

(A.15)

Thus with Mn “ Opn1{2 q, mn ÿ

h“1

P pαh R Bh,n q À e´n .

(A.16)

(A.14), (A.15) and (A.16) together imply that ΠX pFnc q À Ope´n q. ˆ ? ˙p`1 Mn rn {δn η p Also mn rn log “ opnq for the choice of the sequence rn . With ǫ mn “ n{pC log nq for some large C ą 0, one can make logpNpǫ, B1,n ˆ ¨ ¨ ¨ ˆ Bmn ,n , }¨}8 q ă nξ

(A.17)

for any ξ ą 0. Also from Lemma 55,

" ˆ ˙p * an hn mn log NpΘan ,hn ,ln , ǫ, }¨}8 q ď mn log d1 `1 ` d2 log ln ln ă nξ

(A.18)

for any ξ ą 0. Combining (A.17) and (A.18), log NpFn , 4ǫ, }¨}1 q ă nξ for any ξ ą 0. 205

A.6 Another useful lemma Lemma 80. For non-negative r.v.s Ai , Bi , if P pAi ď uq ď Ci P pBi ď uq for u P p0, t0 q, t0 ą 0, i “ 1, 2, P pA1 ` A2 ď t0 q ď C1 C2 P pB1 ` B2 ď t0 q. Proof. Denote by f the corresponding density functions. ż t0 ż t0 P pA1 ` A2 ď t0 q “ fA1 puqP pA2 ď t0 ´ uq ď C2 fA1 puqP pB2 ď t0 ´ uq 0

0

“ C2 P pA1 ` B2 ď t0 q “ C2 ď C1 C2

ż t0 0

ż t0 0

fB2 puqP pA1 ď t0 ´ uq

fB2 puqP pB1 ď t0 ´ uq “ C1 C2 P pB1 ` B2 ď t0 q.

A.7 Proof of Theorem 57 Proof. Once again we approximate f0 py | xq by f˜py | xq “

ş

`

1 φ y´µ σ σ

˘ ˜ x pµ, σq, so dG

that the first term of 4.12 is arbitrarily small. We construct such an f˜ analogous to that in Theorem 44. Lemma 81 is a variant of Lemma 46 which ensures that the

second term in (4.12) is also sufficiently small. Before that we need a different notion of neighborhood of tFx , x P X u which we formulate below. ˇż " ˇ tGx , x P X u : sup ˇˇ xPX

ℜˆℜ`

ˇ (ˇ gpµ, σqdGxpµ, σq ´ gpµ, σqdFx pµ, σq ˇˇ

* ăǫ .

(A.19)

ş ş Lemma 81. Assume that f0 P Fd satisfies X Y y 2 f0 py | xqdyqpxqdx ă 8. Suppose ˘ ş ` ˜ x pµ, σq, where D a ą 0 and 0 ă σ ă σ such that f˜py | xq “ σ1 φ y´µ dG σ ˘ ` ˜ x r´a, as ˆ pσ, σq “ 1 @ x P X , G 206

(A.20)

˜ x has compact support for each x P X . Then given any ǫ ą 0, D a so that G

˜ x , x P X u which is a finite intersection of neighborhoods of the neighborhood W of tG ˘ ş ` type (A.19) such that for any conditional density f py | xq “ σ1 φ y´µ dGx pµ, σq, x P σ X , with tGx , x P X u P W , ż ż X

Y

f0 py | xq log

f˜py | xq dyqpxqdx ă ǫ. f py | xq

(A.21)

The proof of Lemma 81 is similar to that of Lemma 46 and is omitted here. To characterize the support of PX , we define a collection of fixed conditional probability measures tFx , x P X u on pℜ ˆ ℜ` , Bpℜ ˆ ℜ` qq denoted by GX˚˚ satisfying ş x ÞÑ ℜˆℜ` gpµ, σqdFxpµq is a continuous function of x for all bounded uniformly continuous functions g : ℜ ˆ ℜ` Ñ r0, 1s.

Theorem 82. Assume the following holds. T1. G0 is specified by µh „ GPpµ, cq, σh „ G0,σ where c is chosen so that GPp0, cq has continuous path realizations and Πσ is absolutely continuous w.r.t. Lebesgue measure on ℜ` . T2. For every k ě 2, pπ1 , . . . , πk q is absolutely continuous w.r.t. to the Lebesgue measure on Sk´1 . T3. For any continuous function g : X ÞÑ ℜ, PX

"

* sup |µh pxq ´ gpxq| ă ǫ ą 0 xPX

h “ 1, . . . , 8 and for any ǫ ą 0. Then for a bounded uniformly continuous function g : ℜ ˆ ℜ` : r0, 1s satisfying

207

gpµ, σq Ñ 0 as |µ| Ñ 8, σ Ñ 8, ˇż " ˇ PX tGx , x P X u : sup ˇˇ xPX

ℜˆℜ`

ˇ (ˇ gpµ, σqdGx pµ, σq ´ gpµ, σqdFxpµ, σq ˇˇ

* ă ǫ ą 0.

(A.22)

Proof. It suffices to assume that g is is coordinatewise monotonically increasing on ş ℜ ˆ ℜ` . Let ǫ ą 0 be given and ψpxq “ ℜˆℜ` gpµ, σqdFx pµ, σq. Let nǫ be such that ř PX pΩ1 q ą 0 where Ω1 “ t 8 h“nǫ `1 πh ă ǫu. Then in Ω1 , ˇ ˇż nǫ ˇ (ˇ ÿ ˇ ˇ πk |gpµk pxq, σk q ´ ψpxq| ` ǫ. gpµ, σqdG pµ, σq ´ ψpxq ď x ˇ ˇ ℜˆℜ`

Define Ω2

k“1



tsupxPX |gpµk pxq, σk q ´ ψpxq|

ă

ǫ, k



1, . . . , nǫ u.

For a

fixed σk , there exists a δ such that supxPX |gpµk pxq, σk q ´ ψpxq| ă ǫ{2 if ˇ ˇ ˇ ă δ where g ´1 denotes the inverse of gp¨, σk q for fixed ψpxq supxPX ˇµk pxq ´ gσ´1 σk k

Hence there exists a neighborhood Bk of σk such that for σk P Bk and ˇ ˇ ˇ ă δ, we have supxPX |gpµk pxq, σk q ´ ψpxq| ă ǫ. Since for ψpxq supxPX ˇµk pxq ´ gσ´1 k ˇ ˇ ( ˇăδ “ each k “ 1, . . . , nǫ , PX σk P Bk , supxPX ˇµk pxq ´ gσ´1 ψpxq k

σk .

ż

σk PBk

ˇ ˇ ( PX sup ˇµk pxq ´ gσ´1 ψpxqˇ ă δ dG0,σ pσk q ą 0, k xPX

PX pΩ2 q ą 0. The conclusion of the theorem follows from the independence of Ω1 and Ω2 . f˜ in (4.12) will be constructed so as to satisfy the assumptions of Lemma 81 and ş ş py|xq such that X Y f0 py | xq log ff˜0py|xq dyqpxqdx ă 2ǫ for any ǫ ą 0. Define a sequence of

conditional densities fn py | xq “

ş

1 ˜ n,x pµ, σq, n φp y´µ qdG σ σ

dGn,x pµ, σq “

ě 1 where for σn “ n´η ,

IµPr´n,ns f0 pµ | xqδσn pσq şn . f pµ | xq ´n 0 208

(A.23)

As before define the approximator fn py | xq “

şn

1 φp y´t qf0 pt | xqdt ´n σn ş n σn . f pt | xqdt ´n 0

(A.24)

f˜ will be chosen to be fn0 for some large n0 . fn0 satisfies the assumptions of Lemma 81 since tGn0 ,x , x P X u is compactly supported. Moreover tGn0 ,x , x P X u P GX˚˚ as ş x Ñ ℜˆℜ` gpµ, σqdGn0,x pµ, σq is continuous function of x for all bounded uniformly continuous function g. Hence there exists a finite intersection W of neighborhoods of tGn0 ,x , x P X u the type (A.19) such that for any tGx , x P X u P W , the second term of (4.12) is arbitrarily small. The conclusion of the theorem follows immediately from a variant of Corollary 49 applied to neighborhoods of the type (A.19).

A.8 Proof of Theorem 58 Proof. As before we establish q-integrated L1 consistency of Gaussian mixtures of fixed-π dependent processes by verifying the conditions of Theorem 50. Let ` ˘ for y P Y and x P X . From Lemma 4.1 of Tokdar (2006a), φµ,σ px, yq :“ σ1 φ y´µpxq σ σ2 2

and for each x P X , ˆ ˙1{2 ż }µ1 ´ µ2 }8 3pσ2 ´ σ1 q 2 |φµ1 ,σ1 px, yq ´ φµ2 ,σ2 px, yq| dy ď ` π σ2 σ1 Y

we obtain for σ2 ą σ1 ą

Let µh pxq “ x1 βh ` ηh pxq, h “ 1, 2, . . ., βh „ Gβ where Gβ is a probability distribu1 2

tion on ℜp . Let ηh „ GP p0, cq independently where cpx, x1 q “ τ 2 e´A}x´x } , where A is a distributed with support ℜ` and τ 2 is fixed. Assume that σh „ G0,σ where G0,σ is a distribution on ℜ` . Here G0x is a distribution on ℜ ˆ ℜ` induced from the distribution of pµh pxq, σh2 q. For any pair µ1 , µ2 , ˆ ˙1{2 ? }β1 ´ β2 } p ` }η1 ´ η2 }8 2 . }µ1 ´ µ2 }8 ď π σ2 209

1 2

As before, let Ha1 denote a unit ball in the RKHS of the covariance kernel τ 2 e´a}x´x }

and B1 is a unit ball in Cr0, 1sp. For sequences Mn Ò 8, ln Ó 0, rn Ò 8 to be determined later and given ǫ ą 0 construct Bn as ˆ c ˙ ˆ ˙ ? ? ǫln π rn rn ǫln π a Bn “ Mn H ` ? B1 Y Yaăδn Mn H1 ` ? B1 . δn 1 4 2 4 2 with δn “

K1 ǫln Mn

for some constant K1 ą 0. Let ( Θn “ φµ,σ : }β} ď an , η P Bn , ln ď σ ď hn .

In the following Lemma, we provide an upper bound to NpΘn , ǫ, }¨}1 q. Lemma 83. There exists constants d1 , d2 , K2 and K3 ą 0 such that for Mn 2ǫ and for sufficiently large rn log NpΘn , ǫ, dSS q ď

K2 rnp

"

log

ˆ

˙*p`1 ? ? 8 2Mn rn {δn ? ǫ πln

(A.25)

a

rn {δn ą

n ` ` log Kǫl3 M n

* " ˆ ˙p hn an ` d2 log ln ` 1 . log d1 ln

` ( ? ? ‰p Proof. We have Θn Ă φµ,σ | β P ´an p, an p , η P Bn , hn ď σ ď ln . Let κ ă minp 6ǫ , 1q and σm “ ln p1 ` κqm , m ě 0. Let m0 be the smallest integer such

that σm0 “ ln p1 ` κqm0 ą h. This implies m0 ď p1 ` κq´1 log hlnn ` 1. By the choice ´σm´1 q of σm , m ě 1, 3pσmσm´1 ă

construct a

? ǫ πσj´1 ? -covering 4 2

ǫ . 2

Let Nj “ r

` 128 ˘1{2 an p?p π

σj´1 ǫ

s. For each 1 ď j ď m0 ,

tAkj , k “ 1, . . . , Mj u of Bn with

« ff a " ˆ ? ˙*p`1 ? 8 2M r {δ ǫ πσj´1 K M n n n 3 n ? ? , Bn , }¨}8 q ď exp K2 rnp log Mj “ Np log ǫ πσj´1 ǫln 4 2 for some constants K2 , K3 ą 0. For 1 ď i ď Nj , 1 ď k ď Mj & 1 ď j ď m0 , define Eikj

p ˆ 2a1n i 2a1n pi ´ 1q 1 1 , ´an ` ˆ Akj ˆ pσj´1 , σj s “ ´an ` Nj Nj 210

(A.26)

? where a1n “ an p. We have for pβ, η, σq, pβ 1, η 1 , σ 1 q P Eikj and for each x P X , ˆ ˙1{2 ? }β ´ β 1 } p ` }η1 ´ η2 }8 ǫ 2 }φµ,σ px, ¨q ´ φµ1 ,σ1 px, ¨q}1 ď ` π σ2 2 ˆ ˙1{2 ˆ ? ˙ ? 2an p p ǫ π ǫ 2 ` ď ǫ. ď ` ? π σj´1Nj 2 4 2

Thus NpΘn , ǫ, }¨}q ď

˙1{2 ? m0 "ˆ ÿ an p p 128 π

j“1

σj´1 ǫ

`1

*p

ˆ

+ a ˙p`1 ˆ ? 2M r {δ 8 K M n n n 3 n ? exp K2 rnp log log ǫ πσj´1 ǫln #

«

ď exp K2 rnp

"

ff a ˙ * ˆ ? 8 2Mn rn {δn p`1 K3 Mn ? log ˆ log ǫ πln ǫln

* " ˆ ˙p an hn `1 . d1 ` d2 log ln ln

The rest of the proof follows similar to that of Theorem 51. Consider the sequence of sieves defined by ˙ ˆ " 8 ÿ y ´ µh pxq 1 n , tφµh ,σh um Fn “ f : f py | xq “ πh φ h“1 P Θn , σ σ h h h“1 sup

ÿ

xPX hěm `1 n

* πh ď ǫ .

` ˘ We will show that given any ξ ą 0, there exists a c1 , c2 ą 0 such that ΠX Fnc ď

211

c1 e´nc2 and logpδ, Fn , }¨}q ă nξ. For f1 , f2 P Fn , we have

ż ÿ 8 ˇ ˇ ˇ p1q ˇ p2q }f1 p¨ | xq ´ f2 p¨ | xq}1 ď ˇπh φµp1q ,σp1q px, yq ´ πh φµp2q ,σp2q px, yqˇ dy h

Y h“1

ď

mn ÿ

p1q πh

h“1

h

h

h

ż ˇ ˇ ˇ ˇ ˇφµp1q ,σp1q px, yq ´ φµp2q ,σp2q px, yqˇ dy Y

h

h

h

h

mn ˇ ˇ ÿ ˇ p1q p2q ˇ ` ˇπh ´ πh ˇ ` 2ǫ. h“1

Let Θπ,n “ tπ mn “ pπ1 , π2 , . . . , πmn q : νh , h “ 1, . . . , mn P r0, 1su. Fix π1mn , π2mn P Θπ,n . It is easy to see that if we can make |νh,1 ´ νh,2| ă mǫ2 , h “ 1, . . . , mn , we would n ˇ řmn ˇˇ p1q p2q ˇ have h“1 ˇπh ´ πh ˇ ă ǫ. Since νh,1 , νh,2 P r0, 1s, the number of balls required to ˇ ř n ˇˇ p1q p2q ˇ 2 mn π ´ π cover Θπ,n so that m for some constant K4 ą 0. h ˇ ă ǫ is K4 pmn {ǫq h“1 ˇ h Hence

a ˙*p`1 ˆ ? 8 2M rn {δn K4 m2n n p ? log NpFn , 4ǫ, }¨}q ď K2 mn rn log ` mn log ǫ πln ǫ " ˆ ˙p * an hn K3 Mn ` mn log d1 `1 (A.27) ` d2 log mn log ǫln ln ln "

ř c P p}β} ą Note that ΠX pFnc q ď mn P pΘcn q ` P p 8 h“mn πh ą ǫq and P pΘn q ď ( an q ` P pσ P rln , hn sc q ` P pη P Bnc q . It follows from the proof of Theorem 3.1

of van der Vaart and van Zanten (2009) that

2

P pη P Bnc q ď P pA ą rn q ` e´Mn {2 if

Mn2

ą

rnp

"

log

ˆ

˙ ? ? 8 2Mn rn {δn ? . ǫ πln

Since App1`η2 q{η2 „ Gapa, bq, Lemma 4.9 of pp1`η2 q{η2

van der Vaart and van Zanten (2009) indicates that P pA ą rn q À expt´rn

u.

Hence with Mn “ Opn1{2 q, mn “ Otn{plog nqp`1 u1{p1`η2 q and rnp “ Otnη2 {p1`η2 q u, 212

P pΘcn q À e´n and Pp

8 ÿ

h“mn

πh ą ǫq À expt´mn1`η2 plog mn qpp`1q u À e´n .

(A.28)

Also, the first term in the right hand side of (A.27) can be made smaller than nξ since mn rnp “ Opn{plog nqp`1 q. Also by F1, the last two terms of the right hand side of (A.27) can be made to grow at opnq.

213

Appendix B Proofs of some results in Chapter 3

B.1 Proof of Lemma 34 It follows from Chu (1973) that f P Cm ô f pxq “

ż

σ ´1 φpσ ´1 xqgpσqdσ

for some density g on R` . Recall from Ongaro and Cattaneo (2004) that a collection ř8 of random weights tπh u8 h“1 with h“1 πh “ 1 a.s. is said to have a full support if

for any m ě 1, pπ1 , . . . , πm q admits a positive joint density with respect to Lebesgue ř measure on the simplex tpp1 , . . . , pm q : m i“1 pi ď 1u. Ongaro and Cattaneo (2004)

showed that if πh ’s have a full support, the weak support of P “

8 ÿ

h“1

πh δθh , θh „ G0

is the set of all probability measures whose support is contained the support of G0 . Since d

pπ1 , . . . , πm q “

ˆ

Φpα1 q, Φpα2 qt1´Φpα1 qu, . . . , Φpαm q 214

m´1 ź i“1

˙ t1´Φpαi qu , αi „ Npµα , σα2 q,

πh ’s have a full support and hence the weak support of P “

ř8

h“1

πh δτh defined

in (3.7) is all probability measures on R` . It follows that the weak support of the induced prior Πu on Su , denoted by wkpΠu q, is precisely Cm .

B.2 Proof of Lemma 36 It follows from Tokdar (2006b) that if we can show that the weak support of Πs contains all probability measures symmetric about zero and having compact support, then f P S˜s ñ f P KLpΠs q. The argument given in Lemma 34 shows that the weak support of the PSB prior in (3.4) is the set of all probability measures on R ˆ R` . Now we will show that an arbitrary P˜ s is in a weak neighborhood of P s if P˜ is in a weak neighborhood of P . We state a lemma to prove our claim. Lemma 84. Let P˜n be a sequence of probability measures and P˜ be a fixed probability measure. Then pP˜n ñ P˜ q ñ pP˜ns ñ P˜ s q, with P˜ns and P˜ s the symmetrised versions of P˜n and P˜ , respectively, where the symmetrizing operation is as defined in (3.9).

Proof. Assume P˜n ñ P˜ . We have to show that for any bounded function φ on R ˆ R` ,

ż

φpt, τ qdP˜ns pt, τ q

Ñ

ż

φpt, τ qdP˜ s pt, τ q as n Ñ 8.

Now, ż

ż ż 1 1 s ˜ ˜ φpt, τ qdPn pt, τ q ` φpt, τ qdP˜n p´t, τ q φpt, τ qdPn pt, τ q “ 2 2 ż ( 1 “ φpt, τ q ` φp´t, τ q dP˜n pt, τ q. 2

Since ψpt, τ q “

1 2

( φpt, τ q ` φp´t, τ q is also a bounded continuous function and

P˜n ñ P˜ , ż ż ż ( ( 1 1 φpt, τ q ` φp´t, τ q dP˜n pt, τ q Ñ φpt, τ q ` φp´t, τ q dP˜ pt, τ q “ φpt, τ qdP˜ s pt, τ q 2 2 215

as n Ñ 8. This completes the proof of Lemma 84. Lemma 84 in fact shows that the weak support of Πs contains all probability measures symmetric about zero. With an appeal to Tokdar (2006b), f P S˜s ñ f P KLpΠs q.

B.3 Proof of Theorem thm:ghoshal In order to prove the theorem we need the following variant of Theorem 2.1 of Amewou-Atisso et al. (2003) and Theorem 1 of Choi and Schervish (2007b) which we state as Lemma 86. Existence of exponentially consistent tests is a typical tool in showing strong consistency. ` ˘ Definition 85. Let W Ă S˜s ˆ F . A sequence of test functions Φn tyi , xi uni“1 is said

to be exponentially consistent for testing

H0 : pf, ηq “ pf0 , η0 q against H1 : pf, ηq P Wn if there exists constants C1 , C2 , C ą 0 such that 1. Eśni“1 f0i pΦn q ď C1 e´nC , 2. inf pf,ηqPWn Eśni“1 fηi pΦn q ě 1 ´ C2 e´nC . ˜ “ pΠs ˆπq be the prior on S˜s ˆF . Let Un be a sequence of subsets Lemma 86. Let Π

˜ of S˜s ˆ F . Suppose that there exists test functions tΦn u8 n“1 , sets Θn Ă Ss ˆ F , n ě 1 and constants C1 , C2 , c1 , c2 ą 0 such that 1.

ř8

n“1

Eśni“1 f0i Φn ă 8.

2. suppf,ηqPUnc XΘn Eśni“1 fηi p1 ´ Φn q ď C1 e´c1 n . ˜ c q ď C2 e´c2 n . 3. ΠpΘ n 216

4. For all δ ą 0 and for almost every data sequence tyi , xi u8 i“1 , " * 8 ÿ Vi pf, ηq ˜ Π pf, ηq : Ki pf, ηq ă δ @i, ă 8 ą 0. 2 i i“1 ˜ Then Πtpf, ηq P Unc | pY1 , x1 q, . . . , pYn , xn qu Ñ 0 a.s.rPf0 ,η0 s. In this case Un



Wn



U ˆ Sn pf0 , ∆q @ n

ě

1.

As in

van der Vaart and van Zanten (2009), we construct Θn “ F ˆ Θ1n where Θ1n “ Yaărn Mn Ha1 ` ǫB1 where H1 and B1 are unit ball of the RKHS of W a and unit ball of the Banach space of Cr0, 1sp respectively, rn , Mn are increasing sequences to be chosen later. The nth test is constructed by combining a collection of tests one for each of the finitely many elements of Θn . It follows from the proof of Theorem 3.1 in van der Vaart and van Zanten (2009) that under Assumption 1, there exists constants d1 , d2 , K ą 0 such that ˜ cn q ď expt´d1 rnp logq prn qu ` expt´Mn2 {8u. 1. ΠpΘ 2. log Npǫ, Θ1n , || ¨ ||8 q ď

Krnp

ˆ

log

Mn ǫ

˙p`1

.

Choosing Mn “ Opn1{2q, rnp “ Opn{plog nqp`2 q, we observe that ˜ c q ď expt´d2 nu. 1. ΠpΘ n 2. log Npǫ, Θ1n , || ¨ ||8 q “ opnq. for some constant d2 ą 0. In order to verify 1 and 2 of Lemma 86, we will write Wn as a disjoint union of two easily tractable regions. The particular form of Wn that is of interest to us is W1n Y W2n , where for any ∆ ą 0, c

W1n “ U ˆ η : ||η ´ η||1,n

* ( ď∆ W2n “ pf, ηq : ||η ´ η||1,n ą ∆ . 217

We will establish the existence of a consistent sequence of tests for each of these regions by considering the following variants of Proposition 3.1 and Proposition 3.3 of Amewou-Atisso et al. (2003). Proposition 87. There exists an exponentially consistent sequence of tests for H0 : pf, ηq “ pf0 , η0 q against H1 : pf, ηq P W2n X Θn . Proof. Let 0 ă t ă ∆{2 and assume Nt “ Npt, Θ1n , || ¨ ||8 q. Let η 1 , . . . , η Nt P Θ1n be such that for each η P Θ1n there exists j such that ||η ´ η j ||8 ă t. If ||η´η0 ||1,n ą ∆, ||η j ´η0 ||1,n ą ∆{2. It follows from Lemma 3.2 Amewou-Atisso et al. (2003) that there exists a set Aji and a constant C ą 0 depending on f0 such that αij :“ Pf0i pAji q ď 12 ´ C|η j pxi q ´ η0 pxi q|. and γij :“ Pfηj i pAi q ě 21 . If i ď n and i R Kn , set Ai “ R, so that αij “ γij “ 1. Thus

1ÿ j pγi ´ αij q ě C∆{2 lim inf nÑ8 n i“1 n

From Lemma 3.1 and Lemma 3.2 of Amewou-Atisso et al. (2003), it follows that there exist test functions Φjn based on tIAj , i “ 1, . . . , nu such that Eśni“1 f0i Φjn ă e´nC1 and i

Eśn

i“1 fη j i

Φjn q

p1 ´

´nC2

for constants C1 , C2 ą 0 Now define Φn “ max1ďjďNt Φjn .

ăe

Then śn

E

i“1 f0i

Φn ď

Nt ÿ

j“1

E

Φjn i“1 f0i

śn

for some constant C3 ą 0. Clearly

ř8

n“1

ď

Nt ÿ

j“1

e´nC1 ď Nt e´nC1 ď e´nC3 .

Eśni“1 f0i Φn ă 8.

Next we consider the type II error probability. The type II error probability of Φn is no larger than the type II error probability of any of the tΦjn , j “ 1, . . . , Nt u and hence exponentially small. Proposition 88. There exists an exponentially consistent sequence of tests for H0 : pf, ηq “ pf0 , η0 q against H1 : pf, ηq P W1n 218

Proof. Without loss of generality take " ż * ż U “ f : Φpyqf pyqdy ´ Φpyqf0 pyqdy ă ǫ where 0 ď Φ ď 1 and Φ is Lipschitz continuous. Hence there exists M ą 0 such that ˜ i pyq “ Φty ´ η0 pxi qu. Notice that Ef Φ ˜ i “ Ef Φ. |Φpy1 q ´ Φpy2 q| ă M|y1 ´ y2 |. Set Φ 0 0i

Now ˜i “ Efηi Φ ě

ż

˜ i pyqfηi pyqdy “ Φ

ż

Φpyqf ry ´ tηpxi q ´ η0 pxi qus

ż

Φry ´ tηpxi q ´ η0 pxi qusf ry ´ tηpxi q ´ η0 pxi qusdy

ż

Φpyqf pyqdy ´ M|ηpxi q ´ η0 pxi q|

ˇ ż ˇ ˇ ˇ ˇ ´ ˇΦpyq ´ Φry ´ tηpxi q ´ η0 pxi qusˇˇf ry ´ tηpxi q ´ η0 pxi qusdy ě

ě Ef0 Φ ` ǫ ´ M|ηpxi q ´ η0 pxi q| Hence 1{n

řn

i“1

˜ i ě Ef Φ ` ǫ ´ M∆ for any f P U c . Now choosing ∆ ă ǫ{M Efηi Φ 0

and applying Lemma 3.1 of Amewou-Atisso et al. (2003) we complete the proof. It remains to verify the second sufficient condition of Theorem 39. Under the assumptions, it follows from Lemma 36 that f0 P KLpΠs q. We will present an important lemma which is similar to Lemma 5.1 of Tokdar (2006b). It guarantees that Kpf0 , fθ q and V pf0 , fθ q are continuous at θ “ 0. First we state and prove some properties of the prior Πs described in (3.9) which will be used to prove the lemma. Lemma 89. If Πs is the prior described in (3.9) and P0 pt, τ q “ Npt; µ0 , σ02 q ˆ Gapτ ; ατ , βτ q, with ατ ą 0 and βτ ą 0. Then, ż ż s τ dP pt, τ q ă 8 a.s., t2 dP s pt, τ q ă 8 a.s., ż

2

s

τ t dP pt, τ q ă 8 a.s., ´8 ă 219

ż

plog τ qdP s pt, τ q ă 8 a.s.

(B.1)

Proof. ż ż

s

τ ą0,tPR

1 “ 2 ż “

τ dP pt, τ qdP “

ż

τ ą0,tPR

τ ą0

ż

τ τ ą0,tPR

ż

dP s pt, τ qdP

τ Npt; µ0 , σ02 qGapτ ; ατ , βτ qdtdτ

1 ` 2

ż

τ ą0,tPR

τ Npt; ´µ0 , σ02 qGapτ ; ατ , βτ qdtdτ

τ Gapτ ; ατ , βτ qdτ ă 8.

ş ş The proofs of t2 dP s pt, τ q ă 8 a.s. and τ t2 dP s pt, τ q ă 8 a.s. are similar. Since ατ ą 0, choose an integer m large enough such that ατ ą m1 . ż ż ż s plog τ qdP pt, τ qdP “ plog τ qGapτ ; ατ , βτ qdτ τ ą0,tPR

“C

ż

τ ą0

τ ą0

plog τ qτ

ατ ´1 ´βτ τ

e

dτ “ C

ż

1

τ ą0

since τ 1{m log τ is bounded in r0, 1s. ş τ τ ατ ´1 e´βτ τ dτ ă 8. τ ą0

pτ 1{m log τ qτ ατ ´ m ´1 e´βτ τ dτ ą ´8 Also

ş

τ ą0

plog τ qτ ατ ´1 e´βτ τ dτ

ď

ş Lemma 90. Under the conditions of the Theorem 39, if f p¨q “ Np¨ ; t, τ ´1 qdP s pt, τ q and fθ pyq “ f py ´ θq, then

ş ş pyq 1. limθÑ0 f0 pyq log ffθ0 pyq dy “ f0 pyq log ff0pyq dy. pyq ` ş 2. limθÑ0 f0 pyq log`

˘ f0 pyq 2 dy fθ pyq

` ş “ f0 pyq log`

˘ f0 pyq 2 dy. f pyq

( ş Proof. Clearly τ φ τ py ´ θ ´ tqu Ñ τ φ τ py ´ tq as θ Ñ 0. Since τ φ τ py ´ θ ´ ( ş tq dP s pt, τ q ď ?12π τ dP s pt, τ q ă 8, so by DCT fθ pyq Ñ f pyq as θ Ñ 0. Hence log

ˆ

f0 pyq f0 pyq Ñ log as t Ñ 0 ft pyq f pyq

f0 pyq log` ft pyq

˙2

Ñ

ˆ

f0 pyq log` f pyq

220

˙2

as t Ñ 0.

To apply DCT again, we have to bound the function | log fθ pyq|by an integrable function.

ˇ ˇ ż ? ˇ ˇ τ 2 ´ py´t´θq s | log fθ pyq| ď log 2π ` ˇˇ log τ e 2 dP pt, τ qˇˇ.

ş Let c “ τ dP s pt, τ q ă 8. Then ˇ ˇ ˇ ˇ ż ż ˇ ˇ ˇ ˇ ˇ log τ e´ τ2 py´t´θq2 dP s pt, τ qˇ ď | log c| ` ˇ log τ e´ τ2 py´t´θq2 dP s pt, τ qˇ. ˇ ˇ ˇ ˇ c

ˇ ˇ ş ş τ τ 2 2 Now since τ e´ 2 py´t´θq dP spt, τ q ď c, ˇ log τc e´ 2 py´t´θq dP s pt, τ qˇ ş τ 2 “ ´ log τc e´ 2 py´t´θq dP s pt, τ q. Hence, by Jensen’s inequality applied to ´ log x, we get,

´ log

ż

ż ż 1 τ ´ τ py´t´θq2 s s dP pt, τ q ď log c ´ plog τ qdP pt, τ q ` τ py ´ t ´ θq2 dP s pt, τ q. e 2 c 2

Now since θ Ñ 0, w.l.o.g assume |θ| ď 1. Hence ˙ ˆ ż ż ż s 2 s 2 2 s τ dP pt, τ q ` τ t dP pt, τ q ` 1 τ py ´ t ´ θq dP pt, τ q ď 4 y

ˆ ż ż ? s ñ | log fθ pyq| ď log 2π ` | log c| ` log c ´ plog τ qdP pt, τ q ` 2 y 2 τ dP s pt, τ q`

ż

2

s

τ t dP pt, τ q ` 1

˙

which is clearly f0 -integrable according to the assumptions of the lemma and from the properties of Πs proved in Lemma 89. Similarly | log fθ pyq|2 can be bounded by an f0 -integrable function. The conclusion of the lemma follows from a simple application of DCT. " Lemma 36 together with the assumption (2) of the Theorem 39 guarantees Π f : *

Kpf0 , f q ă δ, V pf0 , f q ă 8 ą 0 for all δ ą 0. Since (B.1) holds, we may assume * " ΠpUq ą 0, where U “ f : Kpf0 , f q ă δ, V pf0 , f q ă 8, pB.1q holds . 221

(B.2)

Now for every f p¨q “ that for |θ| ă δf ,

ş

Np¨ ; t, τ ´1 qdP s pt, τ q P U, using Lemma 93, choose δf such

Kpf0 , fθ q ă 2Kpf0 , f q, V pf, fθ q ă 2V pf0 , f q. Now if ||η ´ η0 || ă δf , |ηpxi q ´ η0 pxi q| ă δf , for i “ 1, . . . , n. So if f P U and ||η ´ η0 || ă δf , we have ż ż f0i f0 Ki pf, ηq “ f0i log ă 2Kpf0 , f q, “ f0 log fηi fpη´η0 qi ż ż ` ` f0i ˘2 f0 ˘2 Vi pf, ηq “ f0i log` “ f0 log` ă 2V pf0 , f q. fηi fpη´η0 qi From (B.2) and Lemma 38 we have, " * Π pf, ηq : f P U, ||η ´ η0 || ă δf ą 0. Hence * " 8 ÿ Vi pf, ηq ă 8 ą 0. Π pf, ηq : Ki pf, ηq ă 2δ @ i, i2 i“1 This ensures weak consistency of the posterior of the residual density and strong consistency of the posterior of the regression function η.

222

Appendix C Proofs of some results in Chapter 5

C.1 Proof of Lemma 63: The Gaussian process prior ΠC n given tσj , j “ 0, . . . , 2nu has the following representation. n

C ptq “

2n ÿ

j“0

cj Bjn ptq, cj „ N2 p0, σj2 I2 q, ptq P r´π, πs.

(C.1)

Since cj „ N2 p0, σj2 I2 q, ΠC n can be written as n

C ptq “

2n ÿ

j“0

c˚j σj Bjn ptq.

(C.2)

where c˚j „ N2 p0, I2 q. Hence from Proposition 1 in Pati and Dunson (2011), Hn consists of h : r´π, πs Ñ R2 such that hptq “

2n ÿ

j“0

cj Bjn ptq,

where cj P R2 . The RKHS norm of h in (C.3) is given by ||h||2Hn “ 223

(C.3) ř2n

2 2 j“0 ||cj || {σj .

C.2 Proof of Theorem 64: From Stepanets (1974) and observing that the basis functions tBjn , j “ 0, . . . , 2nu span the vector space of trigonometric polynomials of degree at most n, it follows that ř i n i given any S0i P C αi pr´π, πsq, there exists hi puq “ 2n j“0 cj Bj puq, h : r´π, πs Ñ R with |cij | ď Mi , such that ||hi ´ S0i ||8 ď Ki n´αi log n for some constants Mi , Ki ą ř 1 2 1 n 0, i “ 1, 2. Setting hpuq “ 2n j“0 pcj , cj q Bj puq, we have ||h ´ S0 ||8 ď Mn´αp1q log n

with ||h||2H ď K

ř2n

j“0 φj

where M “ Mp2q , K “ Kp2q .

224

Appendix D Proofs of some results in Chapter 6

D.1 Proofs of Lemma 71 The Gaussian process prior ΠS n,m given tφjk , j “ 0, . . . , 2n, k “ 0, . . . , 2mu has the following representation.

S n,m pu, vq “

2n ÿ 2m ÿ

j“0 k“0

2 djk Bjn puqBkm pvq, djk „ N3 p0, φ´1 jk I3 q, pu, vq P T .

(D.1)

To characterize the RKHS of S n,m pu, vq, we need the following generalization of Theorem 4.2 of van der Vaart and van Zanten (2008b) to the multivariate case. Proposition 91. Let phi q be a sequence of elements in a separable Banach space B ř 3 such that 8 i“1 wi hi “ 0 for a sequence w P ℓ2 pR q, where the convergence is in B,

implying that w “ 0. Let Zi “ pZi1 , Zi2 , Zi3 qT „ N3 p0, I3 q, and assume that the series ř 3 W “ 8 i“1 Zi hi converges almost surely in B . Then the RKHS of W as a map in ř ř8 3 2 B3 is given by H “ t 8 i“1 wi hi : w P ℓ2 pR qu with squared norm || i“1 wi hi ||H “ ř8 2 i“1 ||wi || . Proof. The almost sure convergence of the series W “ 225

ř8

i“1

Zi hi P B3 implies al-

most sure convergence of the series b ˚ W for any b˚ P pB3 q˚ . Now any b˚ P pB3 q˚ can be written as b˚ “ α1 b˚1 ` α2 b˚2 ` α3 b˚3 for αi P R, b˚i P B˚ . Hence b˚ W “ ř8 ř3 ˚ i“1 Zij bj phi q. Since the partial sums of the last series are zero mean Gausj“1 αj sian, the series also converges in L2 pΩ, U, Pq. Hence for b˚ , b˚ P pB3 q˚ , ˚

˚

Eb W b W “

3 ÿ

αj α j

8 ÿ

b˚j hi b˚j hi .

i“1

j“1

For w P ℓ2 pR3 q and natural numbers m ă n, by the Hahn-Banach theorem and the Cauchy-Schwartz inequality, we have

||

ÿ

mďiďn

2

wi hi ||



sup ||

3 ÿ

||b˚ ||ď1

j“1

ď 3 sup

3 ÿ

||b˚ ||ď1

ď 3p

αj

mďiďn

αj2

j“1 mďiďn

ÿ

wij b˚j phi q||2 wij2

mďiďn

j“1

3 ÿ ÿ

8 ÿ

wij2 q

sup

ÿ

b˚j phi q2

αj2

ÿ

mďiďn 3 ÿ

||b˚ ||ď1 j“1

mďiďn

b˚j phi q2

As m, n Ñ 0, the first term on the far right converges to zero as w P ℓ2 pR3 q. By the first paragraph the second factor is bounded by sup||b˚ ||ď1 Epb˚ W q2 ď E||W ||2 . ř Hence the partial sums of the series i wi hi form a Cauchy sequence in B3 and hence

it converges.

ř8 ˚ 2 was seen to converge for each j “ 1, 2, 3, it follows Because i“1 pbj hi q ř ˚ ř8 ˚ ˚ ˚ T ˚ that 8 i“1 pα1 b1 hi , α2 b2 hi , α3 b3 hi q hi “ i“1 pbj hi qhi converge in B, and hence b ř8 ˚ ˚ ř3 ˚ ˚ ˚ P pB3 q˚ . This shows that i“1 bj hi bj hi “ Eb W b W , for any b j“1 αj αj ř8 ˚ ˚ ˚ T Mb˚ “ i“1 pα1 b1 hi , α2 b2 hi , α3 b3 hi q hi and the RKHS is not bigger than this ř ř ˚ 2 space. Also ||Mb˚ ||2H “ 3j“1 αj2 8 i“1 pbj hi q . Thus the RKHS consists of elements ř8 ř8 ř8 ř8 3 2 2 i“1 wi hi “ i“1 wi hi where wi P ℓ2 pR q and || i“1 wi hi ||H “ i“1 ||wi || . 226

The space would have been smaller than claimed if there existed w P ℓ2 pR3 q that is not in the closure of the linear span of the elements pb˚ hi q of ℓ2 pR3 q when b˚ ranges over pB˚q3 . Without loss of generality, we can take this w to be orthogonal to the ř ř later collection, i.e., 3j“1 αj i wij b˚j hi “ 0 for every b˚ P pB˚q3 . This is equivalent ř to i wij hi “ 0 for j “ 1, 2, 3 which implies w “ 0. Since djk „ N3 p0, φ´1 jk I3 q, ΠS n,m can be written as S n,m pu, vq “

2n ÿ 2m ÿ

j“0 k“0

´1{2

b˚jk φjk Bjn puqBkm pvq.

(D.2)

where b˚jk „ N3 p0, I3 q. Hence Hn,m consists of h : T2 Ñ R3 such that hpu, vq “

2n ÿ 2m ÿ

j“0 k“0

cjk Bjn puqBkm pvq,

(D.3)

where cjk P R3 . The RKHS norm of h in (D.3) is given by ||h||2Hn,m “ ř2n ř2m 2 k“0 φjk ||cjk || . j“0

D.2 Proof of Theorem 72

From Stepanets (1974) and observing that the basis functions tBjn , j “ 0, . . . , 2nu span the vector space of trigonometric polynomials of degree at most n, it follows that ř ř2m i n m i 2 given any S0i P C αi pT2 q, there exists hi pu, vq “ 2n k“0 cjk Bj puqBk pvq, h : T Ñ j“0 R with |cijk | ď Mi , such that ||hi ´S0i ||8 ď Ki pn^mq´αi log n log m for some constants ř ř2m 1 2 3 T n m Mi , Ki ą 0, i “ 1, 2, 3. Setting hpu, vq “ 2n k“0 pcjk , cjk , cjk q Bj puqBk pvq, we j“0 have

||h ´ S0 ||8 ď Mpn ^ mq´αp1q log n log m with ||h||2H ď K

ř2n ř2m j“0

k“0 φjk

where M “ Mp3q , K “ Kp3q . 227

D.3 Proof of Theorem 74 . It is enough to verify the following along the lines of De Jonge and van Zanten (2010). We will show that if S0 P SC pα1 , α2 , α3 q there exists for every constant C ą 1 measurable subsets BN of CpT2 ; R3 q such that for N large enough, log Np¯ǫN , BN , || ¨ ||8 q ď DN ǫ¯2N

(D.4)

2

(D.5)

P pS R BN q ď e´CN ǫN 2

P p sup ||Spu, vq ´ S0 pu, vq|| ď ǫN q ě e´N ǫN

(D.6)

pu,vqPT2

with ǫN “ N ´αp1q {p2αp1q `2q logt1 N and ǫ¯N “ N ´αp1q {p2αp1q `2q logt2 N for some global constants t1 , t2 ą 0. To find an upper bound to the metric entropy of the unit ball of Hn,m , we embed it in an appropriate space of functions for which the upper bound is known. The function h is in fact well defined on Ap1q “ tz P C2 : |Impzj q| ď 1, j “ 1, 2u, is analytic on this set and takes real values in R2 . By the Cauchy-Schwartz inequality, it follows that with φ1,n,m “ mintφjk , j “ 0, . . . , 2n, k “ 0, . . . , 2mu, |hpzq|2 ď

ř2n ř2m j“0

2 k“0 ||cjk || φjk

ř2n ř2m j“0

n 2 m 2 k“0 p1{φjk qBj pz1 q Bk pz2 q ,

ď ||h||2Hn,m p1{φ1,n,mq.

(D.7)

for every z P Ap1q. Let Spφ, ψq denote the set of all analytic functions on Apψq, Ă Spφ1,n,m , 1q. uniformly bounded by φ´0¨5 . (D.7) shows that Hn,m 1 Next we characterize the metric entropy of Spφ, ψq for any φ ą 0 in Proposition 92. Proposition 92. There exist ǫ0 , φ0 ą 0 such that 1 logpǫ, Spφ, ψq, || ¨ ||8 q ď K1 2 ψ for φ P p0, φ0 q and ǫ P p0, ǫ0 q. 228

ˆ

K2 log 0¨5 φ ǫ

˙3

(D.8)

Proof. The proof proceeds similarly to van der Vaart and van Zanten (2009). However, extra care is needed to identify the role of φ and ψ. Let M “ φ´0¨5 . h is an analytic function h : C2 Ñ C, |hpzq| ď M for all z P Ω “ tz P C2 : |Repz1 q| ď ψ, |Repz2 q| ď ψu and hence admits a Taylor series expansion on Ω. Let tt1 , . . . , tm u be an ψ{2-net of T2 for sup norm, let T2 “ Ym i“1 Bi be a partition of T2 into sets B1 , . . . , Bm obtained by assigning every t P T to a closest ti P tt1 , . . . , tm u. ř ř n Consider P “ m i“1 Pi,ai IBi for Pi,ai ptq “ n.ďk ai,n pt ´ ti q where the sum ranges

over n “ pn1 , n2 q P pN Y t0uq2 with n. “ n1 ` n2 ď k and xn is defined as xn1 1 xn2 2 . Obtain a finite set of functions by discretizing ai,n for each i and n over a mesh of

ǫ{ψ n. -net of the interval r´M{ψ n. , M{ψ n. s. Then log

ˆź ź i

n:n.ďk

#ai,n

˙

2 2

ď t3{pψ{2qu k log

ˆ

˙ 2M . ǫ

By the Cauchy formula (2 applications of the formula in one dimension suffice), for C1 , C2 circles of radius ψ in the complex plane around the coordinates ti1 , ti2 of ti , and with D n the partial derivative of orders n “ pn1 , n2 q and n! “ n1 !n2 !, ˇ ˇ ˇ n ˇ ¿ ¿ ˇ D hpti q ˇ ˇ 1 ˇ hpzq ˇ ˇ“ˇ ˇď M . dz dz 1 2 ˇ n! ˇ ˇ p2πiq2 ˇ ψn. n`1 pz ´ ti q C1 C2

Consequently for any z P Bi , a universal constant K, an appropriately chosen ai and for k ą log KM , ǫ ˇ ˇ ˆ ˙k 8 ÿ M ÿ ˇ ÿ D n hpti q ˇ l 2 n n. ˇ pz ´ ti q ˇˇ ď pψ{2q ď M ď KM ď ǫ, ˇ n. l n! ψ 2 3 n.ąk n.ąk l“k`1

ˇ ˇ k ÿ ǫ ÿ ˇ ˇ ÿ D n hpti q l n n. ˇ ˇ pz ´ ti q ´ Pi,ai pzqˇ ď pψ{2q ď ǫ ď Kǫ. ˇ n. n! ψ 2l n.ąk l“1 n.ďk

˘3 ` Hence logpǫ, SpM, ψq, || ¨ ||8 q ď K1 ψ12 log K2ǫM . 229

We return to verifying (D.4), (D.5) and (D.6). First we will verify (D.6). By Lemma 5.3 of van der Vaart and van Zanten (2008b), we have for S0 P SC pα1 , α2 , α3 q, the inequality ´ log P p||S ´ S0 ||8 ă ǫq ď ψSn,m pǫq, 0 with ψSn,m the so-called concentration function, defined as follows: 0 ψSn,m pǫq “ 0

inf

hPHn,m :||h´S0 ||8 ăǫ

||h||2Hn,m ´ log P p||S n,m|| ă ǫq.

We can provide a lower bound to ´ log P p||S n,m|| ă ǫq using Proposition 92. Observe that

P p||S ´ S0 || ď ǫN q “

8 ÿ

n,m“1

Πn,m

ż

P p||S

n,m

´ S0 || ď ǫN q

2n,2m ź j,k“0

ppφjk qdφjk .

From Theorem 72 we obtain P p||S ´ S0 || ď ǫN q ě 8 ÿ

n,měp1{ǫN

Πn,m exp 1{α q p1q

"

˙3 * ˆ αm αn K3 ´ Mp2n ` 1qp2m ` 1q 2 ` K1 log β ǫN

for some constant K3 ą 0. Next we will verify (D.5). Define RN to be the region tφjk ě tN , j “ 0, . . . , n, k “ 0, . . . , m; n, m “ 1, . . . , rN u.

Let B1 denote the unit ball in the Banach space

CpT2 ; R3 q. Define BN “ LN SptN , 1q ` ǫN B1 .

230

Then by Borel’s inequality (van der Vaart and van Zanten, 2008b) 8 ÿ 8 ÿ

P pS R BN q “

Πn,m

n“1 m“1 rN ÿ rN ÿ

ď

Πn,m

n“1 m“1

ż ż

P pS n,m R BN q P pS

n,m

R

2n,2m ź j,k“0

LN Hn,m 1

ppφjk qdφjk

` ǫN B1 q

2n,2m ź j,k“0

ppφjk qdφjk

` P pn ą rN , m ą rN q rN ÿ rN ÿ

ď

Πn,m

n“1 m“1

ż

RN

P pS

n,m

n,m

R LN H

` ǫN B1 q

2n,2m ź j,k“0

ppφjk qdφjk

` P pφ1,rN ,rN ď tN q ` P pn ą rN , m ą rN q. From van der Vaart and van Zanten (2009), the first term on the right hand side of the previous inequality is bounded as follows. P pS n,m R LN Hn,m ` ǫN B1 q ď 1 ´ ΦrΦ´1 tP p||S n,m ||8 ď ǫN qu ` LN s. For ǫN small enough and since Φ´1 pyq ě ´tp5{2q logp1{yqu0¨5 for y P p0, 0¨5q, it follows that P pS

n,m

n,m

R LN H

„ " ˆ ˙3 *0¨5  K2 ` ǫN B1 q ď 1 ´ Φ LN ´ p5{2qK1 log 0¨5 . tN ǫN

˙3 *0¨5 " ˆ K2 and for tφjk , j “ 0, . . . , 2rN , k “ 0, . . . , 2rN u in for LN ě p5{2qK1 log t0¨5 ǫN N

RN . Let τN “ mintτi , 0 ď i ď 2rN u, τ˜N „ GatαrN , p2rN ` 1qβu and κN “ mintκi , 0 ď i ď 2rN u, κ ˜ N „ GatαrN , p2rN ` 1qβu, τ˜N and κ ˜ N are independent. Observe that P pφ1,rN ,rN ď tN q ď P pτN κN ď tN q ď P p˜ τN κ ˜ N ď tN q ď

ż e´1 0

P p˜ τN ď tN {yqfκ˜N pyqdy ` P p˜ τN ď etN q 231

Now P p˜ τN ď etN q À exprαrN ` αrN logtp2rN ` 1qtN βu ´ αrN log αrN s and fκ˜N pyq À expp´αrN q for y P p0, e´1 q. Thus P pφ1,rN ,rN ď tN q À exprαrN ` αrN logtp2rN ` 1qtN βu ´ αrN log αrN s ` expp´αrN q. Finally we will verify (D.4). For ¯ǫN ě ǫN , Np2¯ǫN , BN , || ¨ ||8 q ď Np¯ǫN {LN , SptN , 1q, || ¨ ||8 q ď K1

Letting αN N

2 3p2αp1q `2q

*

“ Oplog Nq3 , rN

ˆ

K2 log 0¨5 tN ǫ¯N {LN

˙3



" * „ " 2 3p2αp1q `2q “ O exp N , tN “ O exp ´

such that p2rN ` 1qtN β is a global constant, LN “ N 2{p2αp1q `2q , we can

verify that (D.4), (D.5) and (D.6) are satisfied with ǫN “ N ´αp1q {p2αp1q `2q logt1 N and ǫ¯N “ N ´αp1q {p2αp1q `2q logt2 N for some global constants t1 , t2 ą 0. P pn ą rN , m ą rN q * „ " 2 2αp1q `2 is guaranteed to be O exp ´ N from the tail condition in the assumption.

232

Appendix E Proofs of some results in Chapter 7

E.1 Proof of Theorem 77 Let φ “ pξr , ηr , βξ , βη , a, σq and φ0 “ pξ0r , η0r , βξ0 , βη0 , a0 , σ0 q be a fixed set of parameters in CpDq ˆ CpDq ˆ R ˆ R` . Clearly pyi , si q „ f py, s | φq, where f py, s | φq “  „ 1 exptxpsqT βξ ` ξr psqu ty ´ µpsqu2 ş f py | s, φqpps | φq “ ? . exp ´ 2σ 2 exptxpsqT βξ ` ξr psquds 2πσ 2 D Here µpsq “ xpsqT paβξ `βη q`aξr psq`ηr psq. Let µ0 psq “ xpsqT pa0 βξ0 `βη0 q`a0 ξ0r psq` η0r psq. Define Λpφ0 , φq “ log f py, s | φ0 q{f py, s | φqu and Kpφ0 , φq “ Eφ0 tΛpφ0 , φqu. Then following Schwartz (1965b), its enough to show that for all ǫ ą 0, ( pΠξr ˆ Πηr ˆ πβξ ˆ πβη ˆ πσ ˆ πa q φ : Kpφ0 , φq ă ǫ ą 0.

233

We calculate Kpφ0 , φq below. "

* f py, s | φ0 q Kpφ0 , φq “ Eφ0 tΛpφ0 , φqu “ Eφ0 log f py, s | φq „ „   1 ty ´ µ0 psqu2 ty ´ µpsqu2 σ2 “ ´ Eφ0 ´ ´ log 2 ` Eφ0 ´ 2 σ0 2σ02 2σ 2 ( Eφ0 xpsqT pβξ ´ βξ0 q ` ξr psq ´ ξ0r psq (  „ ş exp xpsqT βξ ` ξr psq ds D ( ` log ş T exp xpsq β ` ξ psq ds ξ0 0r D ˆ ˙ ż σ02 σ2 1 1 1 1 ´ 2 ` 2 tµ0 psq ´ µpsqu2 ppsqds log 2 ´ “ 2 σ0 2 σ 2σ D ż ` txpsqT pβξ ´ βξ0 q ` ξr psq ´ ξ0r psquppsqds D

(  „ ş T exp xpsq β ` ξ psq ds ξ r ( ` log ş D . exp xpsqT βξ0 ` ξ0r psq ds D

For each δ ą 0, define Bδ “

φ : ||ξr ´ ξ0r ||8 ă δ, ||ηr ´ η0r ||8 ă δ, ||βξ ´ βf 0 || ă δ, ||βg ´ βg0 || ă δ, |a ´ a0 | ă δ, ( |σ{σ0 ´ 1| ă δ .

Take b1 “ ||µ0 ´µ||8 and b2 “ σ{σ0 . Let g1 pb1 , b2 q “ log b2 ´pb22 ´1q{p2b22 q`b21 {p2σ02 b22 q. Clearly g1 pb1 , b2 q is continuous at b1 “ 0 and b2 “ 1 and g1 p0, 1q “ 0. We have b1 ď M||paβξ ` βη q ´ pa0 βξ0 ` βη0 q|| ` ||taξr psq ` ηr psqu ´ ta0 ξ0r psq ` η0r psqu|| and Kpφ0 , φq ď g1 pb1 , b2 q `

ż

D

( xpsqT pβξ ´ βξ0 q ` ξr psq ´ ξ0r psq ppsqds

(  „ ş T exp xpsq β ` ξ psq ds ξ r ( . ` log ş D T exp xpsq βξ0 ` ξ0r psq ds D 234

For ǫ ą 0, there exists a δ1 ą 0 such that for all φ P Bδ1 , ˆ ˙ ż 1 σ02 1 ǫ σ2 1 1 ´ 2 ` 2 tµ0 psq ´ µpsqu2 ppsqds ă . log 2 ´ 2 σ0 2 σ 2σ D 3 ( Also there exists δ2 ą 0 such that for all φ P Bδ2 , xpsqT pβξ ´ βξ0q ` ξr psq ´ ξ0r psq ă ( ǫ{3 uniformly for all s P D. If we define hφ psq “ exp xpsqT βξ ` ξr psq , then ( ş ş φ ÞÑ D hφ psqds is a continuous function and hence φ ÞÑ log D hφ psqds is also

a continuous function. So there exists a δ3 ą 0 such that φ P Bδ3 ñ log



D

*

hφ psqds ´ log



*

ǫ hφ0 psqds ă . 3 D

Choosing δ “ mintδ1 , δ2 , δ3 u, φ P Bδ implies Kpφ0 , φq ă ǫ. From Choi (2005), it follows that with the priors specified in §7.3.1 pΠξr ˆ Πηr ˆ πβξ ˆ πβη ˆ πσ ˆ πa qpBδ q ą 0. Hence, ( pΠξr ˆ Πηr ˆ πβξ ˆ πβη ˆ πσ ˆ πa q φ : Kpφ0 , φq ă ǫ ą 0.

E.2 Proof of Theorem 78 The prior specifications on ρξ , ρη , τξ and τη enable one to bound any quadratic forms and determinants involving Σnξ and Σnη by fixed quantities. Hence, in showing that the posterior ppa | y, sq is proper, its enough to treat ρξ , ρη , τξ and τη as constants. Without loss of generality we can work with D “ r0, 1s2 by the projection argument described in §7.6. Following Benes et al. (2003), we consider the grid approximation of the infinite dimensional Gaussian process tξr psq : s P Du, denoted by ξr . Let Ť D “ Jj“1 Ij , with tIj u denoting a segmentation of D into contiguous regions of 235

equal area ∆ “ J ´1

ş

D

ds. Choose J sufficiently large such that at most one si lies

within any Ij . The infinite-dimensional Gaussian process, ξr , can be approximated by a finite dimensional vector ξrJ “ pξr˚1, . . . , ξr˚J qT , corresponding to the choice of arbitrary points s˚1 , . . . , s˚J within I1 , . . . , IJ , respectively such that ξr psi q “ ξr˚j if ˚J ˚ ˚ si P Ij . Thus ξrJ „ Np0, Σ˚J ξ q, where pΣξ qij “ cp||si ´ sj || | ψq. Define the

true posterior ptrue pξrn | sq and the approximated posterior pJ pξrn | sq as follows. ptrue pξrn | sq9ptrue pξrn , sq “ E



( exp xpsq βξ ` ξr psq ds | ξrn T

*´n

` ˘T ` ˘´1 ` n ˘T ( exp ´ 0¨5 ξrn Σnξ ξr

and pJ pξrn | sq9pJ pξrn , sq “  „ ÿ J ( ´n ` ˘T ` ˘´1 ` J ˘T ( ˚ T ˚ ∆ exp xpsj q βξ ` ξr psj q exp ´ 0¨5 ξrJ Σ˚J ξr . ξ j“1

Marginalizing out ηrn , we have y | s, ξr , a, σ 2 , βη , βξ „ NpXβ ˚ ` aξrn , σ 2 In ` Σnη q, where X T “ txps1 q ¨ ¨ ¨ xpsn qu. The true posterior of (ξrn , a, σ 2 , βξ , βη ) is ptrue pξrn , a, βξ , βη , σ 2 | y, sq9ppy | s, ξr , a, σ 2 , βξ , βη qptrue pξrn , sqπpσ 2 qπpβξ qπpβη q. Benes et al. (2003) showed that, under these assumptions, for a fixed s P D n , the expectation of any bounded function with respect to pJ pξrn | sq converges to the corresponding expectation with respect to ptrue pξrn | sq as J tends to infinity. Hence there exists a J such that the expectation of the bounded function with respect to pJ pξrn | sq is greater than the corresponding expectation with respect to p1{2qptrue pξrn | sq. Thus, in order to show propriety of the true posterior of (ξrn , a, σ 2 , βξ , βg ), which involves ptrue pξrn | sq, its enough to show the propriety of the approximated posterior pJ pξrn , a, βξ , βη , σ 2 | y, sq. The approximated posterior of (ξrn , a, σ 2 , βξ , βη ) is pJ pξrn , a, βξ , βη , σ 2 | y, sq “ ` ˘T ` ˘´1 ` ˘( C exp ´ 0¨5 Y ´ Xβ ˚ ´ aξrn σ 2 In ` Σnη Y ´ Xβ ˚ ´ aξrn ˆ 236

( řn T ` J ˘T ` ˚J ˘´1 ` J ˘( exp i“1 xpsi q βξ ` ξr psi q 2 exp ´ 0¨5 ξr Σξ ξr πpβξ qπpβη q ˆ πpσ q “ řJ (‰n , ˚j ˚ T ∆n j“1 exp xpsj q βξ ` ξr

( ř ( where C is a constant. Since exp xpsi qT βξ ` ξr psi q ă Jj“1 exp xps˚j qT βξ ` ξr˚j for all i “ 1, . . . , n,

( xpsi qT βξ ` ξr psi q (‰n ă 1. ˚j ˚ T j“1 exp xpsj q βξ ` ξr

exp “ řJ

řn

i“1

After integrating out ξrJ excluding ξrn we are left with ppξrn , a, βξ , βη , σ 2 | Y, sq ď ` ˘T ` ˘´1 ` ˘( C1 exp ´ 0¨5 Y ´ Xβ ˚ ´ aξrn σ 2 In ` Σnη Y ´ Xβ ˚ ´ aξrn ˆ ` ˘T ` ˘´1 ` n ˘( exp ´ 0¨5 ξrn Σnf ξr πpβξ qπpβη qπpσ 2 q,

where C1 ą 0 is a constant and Σnξ is the variance-covariance matrix of ξrn constructed ` ˘ ` n 2 ˘ 2 ( ˚ n ´1 ´1 ´1 out of Σ˚J . Setting Z “ y´Xβ {a, Σ “ Σ `σ I {a and Ω “ pΣ q `Σ n η η ξ ξ and completing quadratic forms yield

` ˘T ` n ˘( ppξrn , a, βξ , βη , σ 2 | y, sq ď C2 exp ´ 0¨5 ξrn ´ Ωη Σ´1 Z Ω´1 ξr ´ Ωη Σ´1 Z ˆ η ` ˘( exp ´ 0¨5 Z T Σ´1 Z ´ Z T Σ´1 Ωη Σ´1 Z πpβξ qπpβη qπpσ 2 q,

where C2 ą 0 is another constant. Next we state a useful lemma from matrix algebra. Lemma 93. If A and B are positive definite square matrices so is A´ ApA` Bq´1A. Proof. We have A ´ ApA ` Bq´1 A “ ApA ` Bq´1 B “ tB ´1 pA ` BqA´1 u´1 “ pB ´1 ` A´1 q´1 . The conclusion follows from the fact that the sum and inverses of positive definite matrices of the same dimension are also positive definite. 237

` ˘ From Lemma 93, we have Z T Σ´1 Z ´ Z T Σ´1 Ωη Σ´1 Z ě 0, so that

` ˘T ` n ˘( ppξrn , a, βξ , βη , σ 2 | y, sq ď C2 exp ´ 0¨5 ξrn ´ Ωη Σ´1 Z Ω´1 ξr ´ Ωη Σ´1 Z ˆ η πpβξ qπpβη qπpσ 2 q. Integrating out ξrn first and then βξ and βη , ˇ` ˘´1 ` ˘´1 ˇ´p1{2q ppa, σ 2 | y, sq ď C3 ˇ Σnξ ` a2 Σnη ` σ 2 In ˇ .

Call Σnξ “ A and Σnη “ B. Hence

2 2 ´1 2 2 ˇ ´1 ˇ ˇA ` a2 pB ` σ 2 Iq´1 ˇ “ |I ` a ApB ` σ Iq | “ |a A ` σ I ` B| . |A| |σ 2 I ` B||A|

Now we state a useful result from matrix algebra.

ˇ ˇ Proposition 94. If A and B and non-negative definite matrices, then ˇA ` B ˇ ě ˇ ˇ ˇ ˇ ˇAˇ ` ˇB ˇ with strict inequality holding in case of positive definite matrices. Using Proposition 94, we get

ˆ

|a2 A ` σ 2 I ` B| |σ 2 I ` B|

˙´p1{2q

ď

ˆ

|a2 A| ` |σ 2 I ` B| |σ 2 I ` B|

˙´p1{2q

" *´p1{2q a2n |A| “ 1 ` śn 2 i“1 pσ ` bi q

" ď 1`

a2n |A| pσ 2 ` bn qn

*´p1{2q

,

where 0 ă b1 ď b2 ď ¨ ¨ ¨ ď bn are the eigen values of B. By Minkowski’s inequality we get " 1`

a2n |A| pσ 2 ` bn qn

*´p1{2q

ď

pσ 2 ` bn qn{2

cn pa2 |A|p1{nq ` σ 2 ` bn q

238

n{2

.

Set |A|1{n “ k1 and bn “ k2 . We assume n ě 2. Then ignoring constants ż8

´8

`

pσ 2 ` bn qn{2

a2 |A|1{n ` σ 2 ` bn

˘n{2 da “ ď

ż8

´8

ż8

´8

1 1 ` pa2 k1 q{pσ 2 ` k2 q

1 ( da “ π 2 1 ` pa k1 q{pσ 2 ` k2 q

Now since Eπ pσq ă 8, ż8ˆ 0

σ 2 ` k2 k1

(n{2 da

˙p1{2q

πpdσ 2 q ă 8.

By Fubini’s Theorem, ppa | Y, sq is integrable.

239

ˆ

σ 2 ` k2 k1

˙p1{2q

.

Bibliography Adler, R. (1990), An introduction to continuity, extrema, and related topics for general Gaussian processes, vol. 12, Institute of Mathematical Statistics. Albert, J. and Chib, S. (2001), “Sequential ordinal modeling with applications to survival data,” Biometrics, 57, 829–836. Amenta, N., Bern, M., and Kamvysselis, M. (1998), “A new Voronoi-based surface reconstruction algorithm,” in Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp. 415–421, ACM. Amewou-Atisso, M., Ghoshal, S., Ghosh, J. K., and Ramamoorthi, R. V. (2003), “Posterior consistency for semi-parametric regression problems,” Bernoulli, 9, 291– 312. Arellano-Vallea, R. B., Galea-Rojasb, M., and Zuazola, P. I. (2000), “Bayesian sensitivity analysis in elliptical linear regression models,” Journal of Statistical Planning and Inference, 86, 175–199. Aziz, N., Bata, R., and Bhat, S. (2002), “Bezier surface/surface intersection,” Computer Graphics and Applications, IEEE, 10, 50–58. Barnhill, R. (1985), “Surfaces in computer aided geometric design: A survey with new results,” Computer Aided Geometric Design, 2, 1–17. Barrientos, F., Jara, A., and Quintana, F. (2011), “On the support of MacEacherns dependent Dirichlet processes,” Unpublished manuscript, University of Chile. Barron, A., Birg´e, L., and Massart, P. (1999a), “Risk bounds for model selection via penalization,” Probability theory and related fields, 113, 301–413. Barron, A., Schervish, M., and Wasserman, L. (1999b), “The consistency of posterior distributions in nonparametric problems,” The Annals of Statistics, 27, 536–561. Belitser, E. and Ghosal, S. (2003), “Adaptive Bayesian inference on the mean of an infinite-dimensional normal distribution,” The Annals of Statistics, 31, 536–559.

240

Benes, V., Bodl´ak, K., Møller, J., and Waagepetersen, R. P. (2003), “Bayesian analysis of log Gaussian Cox process models for disease mapping,” in The ISI International Conference on Environmental Statistics and Health, Univ Santiago de Compostela, pp. 95–105. Bhattacharya, A. and Dunson, D. (2010), “Strong consistency of nonparametric Bayes density estimation on complex Watson kernels,” Duke University, DSS discussion series. Bhattacharya, A. and Dunson, D. (2011), “Posterior rates of contraction in probability tensor decomposed latent variable models,” (in progress). Birg´e, L. (1986), “On estimating a density using Hellinger distance and some other strange facts,” Probability theory and related fields, 71, 271–291. Birg´e, L. (2001), “An alternative point of view on Lepski’s method,” Lecture NotesMonograph Series, pp. 113–133. Boissonnat, J. and Oudot, S. (2005), “Provably good sampling and meshing of surfaces,” Graphical Models, 67, 405–451. Borell, C. (1975), “The Brunn-Minkowski inequality in gauss space,” Inventiones Mathematicae, 30, 207–216. Brechb¨ uhler, C., Gerig, G., and K¨ ubler, O. (1995), “Parametrization of closed surfaces for 3-D shape description,” Computer vision and image understanding, 61, 154–170. Burr, D. and Doss, H. (2005), “A Bayesian Semiparametric Model for RandomEffects Meta-Analysis,” Journal of the American Statistical Association, 100, 242– 251. Bush, C. and MacEachern, S. (1996), “A semiparametric Bayesian model for randomised block designs,” Biometrika, 83, 275. Canale, A. and Dunson, D. (2011), “Bayesian multivariate mixed scale density estimation,” Arxiv preprint arXiv:1110.1265. Casale, M. (1987), “Free-form solid modeling with trimmed surface patches,” IEEE Computer Graphics and Applications, pp. 33–43. Castillo, I. (2008), “Lower bounds for posterior rates with Gaussian process priors,” Electronic Journal of Statistics, 2, 1281–1299. Chan, D., Kohn, R., Nott, D., and Kirby, C. (2006), “Locally adaptive semiparametric estimation of the mean and variance functions in regression models,” Journal of Computational and Graphical Statistics, 15, 915–936. 241

Chib, S. and Greenberg, E. (2010), “Additive cubic spline regression with Dirichlet process mixture errors,” Journal of Econometrics, 156, 322–336. Chipman, H. A., George, E. I., and Mcculloch, R. E. (2010), “BART: Bayesian additive regression trees,” The Annals of Applied Statistics, 4, 266–298. Choi, T. (2005), “Posterior Consistency in Nonparametric Regression problems in Gaussian Process Priors,” Ph.D. thesis, Department of Statistics, Carnegie Mellon University. Choi, T. (2007), “Alternative posterior consistency results in nonparametric binary regression using Gaussian process priors,” Journal of Statistical Planning and Inference, 137, 2975–2983. Choi, T. (2009), “Asymptotic properties of posterior distributions in nonparametric regression with non-Gaussian errors,” Annals of the Institute of Statistical Mathematics, 61, 835–859. Choi, T. and Schervish, M. (2007a), “On posterior consistency in nonparametric regression problems,” Journal of Multivariate Analysis, 98, 1969–1987. Choi, T. and Schervish, M. (2007b), “On posterior consistency in nonparametric regression problems,” Journal of Multivariate Analysis, 98, 1969–1987. Chu, K. C. (1973), “Estimation and detection in linear systems with elliptical errors.” IEEE Trans. Auto. Control, 18, 499–505. Chung, M., Dalton, K., and Davidson, R. (2008), “Tensor-based cortical surface morphometry via weighted spherical harmonic representation,” IEEE Transactions on Medical Imaging, 27, 1143–1151. Chung, Y. and Dunson, D. (2009), “Nonparametric Bayes conditional distribution modeling with variable selection,” Journal of the American Statistical Association, 104, 1646–1660. Cinquin, P., Chalmond, B., and Berard, D. (1982), “Hip prosthesis design,” Lecture Notes in Medical Informatics, 16, 195–200. Cram´er, H. and Leadbetter, M. R. (1967), Stationary and related stochastic processes, sample function properties and their applications, John Wiley & Sons, New York. Cunningham, G., Lehovich, A., and Hanson, K. (1999), “Bayesian estimation of regularization parameters for deformable surface models (Proceedings Paper),” . De Iorio, M., Mueller, P., Rosner, G., and MacEachern, S. (2004), “An ANOVA model for dependent random measures,” Journal of the American Statistical Association, 99, 205–215. 242

De Iorio, M., Johnson, W., M¨ uller, P., and Rosner, G. (2009), “Bayesian nonparametric nonproportional hazards survival modeling,” Biometrics, 65, 762–771. De Jonge, R. and van Zanten, J. (2010), “Adaptive nonparametric bayesian inference using location-scale mixture priors,” The Annals of Statistics, 38, 3300–3320. Denison, D., Mallick, B., and Smith, A. (1998), “Bayesian MARS,” Statistics and Computing, 8, 337–346. Denison, D., Holmes, C., Mallick, B., and Smith, A. F. M. (2002), Bayesian methods for nonlinear classification and regression, Wiley & Sons, London. D´esid´eri, J. and Janka, A. (2004), “Multilevel shape parameterization for aerodynamic optimization–application to drag and noise reduction of transonic/supersonic business jet,” in European Congress on Computational Methods in Applied Sciences and Engineering (ECCOMAS 2004), E. Heikkola et al eds., Jyv¨askyla, pp. 24–28. D´esid´eri, J., Abou El Majd, B., and Janka, A. (2007), “Nested and self-adaptive B´ezier parameterizations for shape optimization,” Journal of Computational Physics, 224, 117–131. Diaconis, P. and Freedman, D. (1986), “On the consistency of Bayes estimates,” The Annals of Statistics, pp. 1–26. Diggle, P., Menezes, R., and Su, T. (2010), “Geostatistical inference under preferential sampling (with discussion),” Journal of the Royal Statistical Society: Series C (Applied Statistics), 59, 191–232. Doss, H. (1985), “Bayesian nonparametric Estimation of the median; Part I: Computation of the estimates,” The Annals of Statistics, 13, 1432–1444. Dryden, I. and Mardia, K. (1998), Statistical shape analysis, vol. 4, John Wiley & Sons New York. Dunson, D. and Park, J. (2008a), “Kernel stick-breaking processes,” Biometrika, 95, 307–323. Dunson, D., Pillai, N., and Park, J. (2007a), “Bayesian density regression,” Journal of the Royal Statistical Society, Series B, 69, 163–183. Dunson, D. B. and Park, J.-H. (2008b), “Kernel stick-breaking processes,” Biometrika, 95, 307–323. Dunson, D. B., Pillai, N., and Park, J.-H. (2007b), “Bayesian density regression,” Journal of the Royal Statistical Society, Series B, 69, 163–183.

243

Escobar, M. D. and West, M. (1995), “Bayesian Density Estimation and Inference Using Mixtures,” Journal of the American Statistical Association, 90, 577–588. Farin, G. (2002), Curves and surfaces for CAGD: a practical guide, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 5th edn. Ferguson, T. (1973a), “A Bayesian analysis of some nonparametric problems,” The Annals of Statistics, 1, 209–230. Ferguson, T. (1974a), “Prior distributions on spaces of probability measures,” The Annals of Statistics, 2, 615–629. Ferguson, T. S. (1973b), “A Bayesian Analysis of Some Nonparametric Problems,” The Annals of Statistics, 1, 209–230. Ferguson, T. S. (1974b), “Prior Distributions on Spaces of Probability Measures,” The Annals of Statistics, 2, 615–629. Fonseca, T. C. O., Ferreira, M. A. R., and Migon, H. S. (2008), “Objective Bayesian analysis for the Student-t regression model,” Biometrika, 95, 325–333. Fowler, B. (1992), “Geometric manipulation of tensor product surfaces,” in Proceedings of the 1992 symposium on Interactive 3D graphics, pp. 101–108, ACM. Gelfand, A., Kottas, A., and MacEachern, S. (2005), “Bayesian nonparametric spatial modeling with Dirichlet process mixing,” Journal of the American Statistical Association, 100, 1021–1035. Geweke, J. (1992), “Evaluating the Accuracy of Sampling-Based Approaches to the Calculation of Posterior Moments,” Bayesian Statistics, 4, 169–194. Ghosal, S. and van der Vaart, A. (2001), “Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities,” The Annals of Statistics, 29, 1233–1263. Ghosal, S. and van der Vaart, A. (2007a), “Convergence rates of posterior distributions for noniid observations,” The Annals of Statistics, 35, 192–223. Ghosal, S. and van der Vaart, A. (2007b), “Posterior convergence rates of Dirichlet mixtures at smooth densities,” The Annals of Statistics, 35, 697–723. Ghosal, S., Ghosh, J., and Ramamoorthi, R. (1999), “Posterior consistency of Dirichlet mixtures in density estimation,” The Annals of Statistics, 27, 143–158. Ghosal, S., Ghosh, J., and van der Vaart, A. (2000), “Convergence rates of posterior distributions,” Annals of Statistics, 28, 500–531.

244

Ghosal, S., Lember, J., and Van Der Vaart, A. (2003), “On Bayesian adaptation,” Acta Applicandae Mathematicae, 79, 165–175. Ghosal, S., Lember, J., and Van Der Vaart, A. (2008), “Nonparametric Bayesian model selection and averaging,” Electronic Journal of Statistics, 2, 63–89. Ghoshal, S. and Roy, A. (2006), “Posterior consistency of Gaussian process prior in nonparametric binary regression,” The Annals of Statistics, 34, 2413–2429. Godsill, S. (2001), “On the relationship between Markov chain Monte Carlo methods for model uncertainty,” Journal of Computational and Graphical Statistics, 10, 230–248. Gordon, W. and Riesenfeld, R. (1974), “Bernstein-B´ezier methods for the computeraided design of free-form curves and surfaces,” Journal of the ACM (JACM), 21, 293–310. Goshtasby, A. (1992), “Surface reconstruction from scattered measurements,” SPIE. Gramacy, R. and Lee, H. (2008), “Bayesian treed Gaussian process models with an application to computer modeling,” Journal of the American Statistical Association, 103, 1119–1130. Griffin, J. and Steel, M. (2006a), “Order-based dependent Dirichlet processes,” Journal of The American Statistical Association, 101, 179–194. Griffin, J. and Steel, M. (2010), “Bayesian nonparametric modelling with the Dirichlet process regression smoother,” Statistica Sinica, 20, 1507–1527. Griffin, J. and Steel, M. F. J. (2006b), “Order-Based Dependent Dirichlet Processes,” Journal of the American Statistical Association, Theory and Methods, 101, 179– 194. Hagen, H. and Santarelli, P. (1992), “Variational design of smooth B-spline surfaces,” in Topics in surface modeling, pp. 85–92, Society for Industrial and Applied Mathematics. Hastie, T. and Stuetzle, W. (1989), “Principal curves,” Journal of the American Statistical Association, pp. 502–516. Higdon, D. (2002), “Space and space-time modeling using process convolutions,” Quantitative methods for current environmental issues, pp. 37–56. Ho, L. and Stoyan, D. (2008), “Modeling marked point patterns by intensity-marked Cox processes,” Statistics and Probability Letters, 78, 1194–1199. Hoffmann, M. and Lepski, O. (2002), “Random rates in anisotropic regression,” Annals of statistics, pp. 325–358. 245

Hoppe, H., DeRose, T., Duchamp, T., McDonald, J., and Stuetzle, W. (1992), “Surface reconstruction from unorganized points,” Computer Graphics, 26, 71–71. Huang, T. (2004), “Convergence rates for posterior distributions and adaptive estimation,” The Annals of Statistics, 32, 1556–1593. Ibragimov, I. and Khasminski, R. (1981), Statistical estimation–asymptotic theory, vol. 16, Springer. Ishwaran, H. and James, L. (2001), “Gibbs Sampling Methods for Stick-Breaking Priors,” Journal of the American Statistical Association, 96, 161–173. James, L. F., Lijoi, A., and Pr¨ unster, I. (2005), “Bayesian nonparametric inference via classes of normalized random measures,” Tech. rep., ICER Applied Mathematics Working Papers Series 5/2005. Jara, A., Lesaffre, E., De Iorio, M., and Quintana, F. (2010), “Bayesian semiparametric inference for multivariate doubly-interval-censored data,” The Annals of Applied Statistics, 4, 2126–2149. Johnstone, J. and Sloan, K. (1995), “Tensor product surfaces guided by minimal surface area triangulations,” in Visualization, p. 254, Published by the IEEE Computer Society. Kalli, M., Griffin, J., and Walker, S. (2010), “Slice sampling mixture models,” Statistics and computing, pp. 1–13. Kazhdan, M., Bolitho, M., and Hoppe, H. (2006), “Poisson surface reconstruction,” in Proceedings of the fourth Eurographics symposium on Geometry processing, pp. 61–70, Eurographics Association. Kerkyacharian, G., Lepski, O., and Picard, D. (2001), “Nonlinear estimation in anisotropic multi-index denoising,” Probability theory and related fields, 121, 137– 170. Klutchnikoff, N. (2005), “On the adaptive estimation of anisotropic functions,” Ph.D. thesis, Ph. D. thesis, Univ. Aix–Marseille I. Knapik, B., van der Vaart, A., and van Zanten, J. (2011), “Bayesian inverse problems,” Arxiv preprint arXiv:1103.2692. Kottas, A. and Gelfand, A. E. (2001), “Bayesian Semiparametric Median Regression Modeling,” Journal of the American Statistical Association, 96, 1458–1468. Kruijer, W., Rousseau, J., and van der Vaart, A. (2010), “Adaptive Bayesian density estimation with location-scale mixtures,” Electronic Journal of Statistics, 4, 1225– 1257. 246

Kuelbs, J. and Li, W. (1993), “Metric entropy and the small ball problem for Gaussian measures,” J. Funct. Anal, 116, 133–157. Kundu, S. and Dunson, D. (2011), “Single Factor Transformation Priors for Density Regression,” DSS Discussion Series. Kurtek, S., Srivastava, A., Klassen, E., and Ding, Z. (2011), “Statistical Modeling of Curves Using Shapes and Related Features,” Journal of American Statistical Association, (in revision). Lang, J. and R¨oschel, O. (1992), “Developable (1, n)-B´ezier surfaces,” Computer Aided Geometric Design, 9, 291–298. Lange, K., Little, R. J. A., and Taylor, J. M. G. (1989), “Robust statistical modelling using the T distribution.” Journal of the American Statistical Association, 84, 881– 896. Lavine, M. and Mockus, A. (2005), “A nonparametric Bayes method for isotonic regression,” Journal of Statistical Planning and Inference, 46, 235–248. Lee, H., Higdon, D., Calder, C., and Holloman, C. (2005), “Efficient models for correlated data via convolutions of intrinsic processes,” Statistical Modelling, 5, 53–74. Lenk, P. (1988), “The logistic normal distribution for Bayesian, nonparametric, predictive densities,” Journal of the American Statistical Association, 83, 509–516. Lenk, P. (1991), “Towards a practicable Bayesian nonparametric density estimator,” Biometrika, 78, 531. Lepski, O. (1990), “A problem of adaptive estimation in Gaussian white noise,” Teoriya Veroyatnostei i ee Primeneniya, 35, 459–470. Lepski, O. (1991), “Asymptotic minimax adaptive estimation. —. Upper bounds.” Theory Probab. Appl, 36, 645–659. Lepski, O. (1992), “Asymptotic minimax adaptive estimation. 2.— Statistical model without optimal adaptation. Adaptive estimators,” Theory Probab. Appl, 37, 468– 481. Lepski, O. and Levit, B. (1999), “Adaptive nonparametric estimation of smooth multivariate functions,” Mathematical Methods of Statistics, 8, 344–370. Li, J. (2004), “Visualization of high-dimensional data with relational perspective map,” Information Visualization, 3, 49–59. Li, R., Li, G., and Wang, Y. (2007), “Closed surface modeling with helical line measurement data,” Frontiers of Mechanical Engineering in China, 2, 72–76. 247

Lo, A. Y. (1984), “On a class of Bayesian nonparametric estimates. I: Density estimates,” The Annals of Statistics, 12, 351–357. Lorensen, W. and Cline, H. (1987), “Marching cubes: A high resolution 3D surface construction algorithm,” in Proceedings of the 14th annual conference on Computer graphics and interactive techniques, pp. 163–169, ACM. MacEachern, S. (1999), “Dependent nonparametric processes,” in Proceedings of the Section on Bayesian Statistical Science, pp. 50–55. Madi, M. (2004), “Closed-form expressions for the approximation of arclength parameterization for Bezier curves,” International journal of applied mathematics and computer science, 14, 33–42. Mann, S. and DeRose, T. (1995), “Computing values and derivatives of B´ezier and B-spline tensor products,” Computer Aided Geometric Design, 12, 107–110. McLachlan, G. and Peel, D. (2000), “Mixtures of factor analyzers,” in In Proceedings of the Seventeenth International Conference on Machine Learning, Citeseer. Menezes, R. (2005), “Assessing spatial dependency under non-standard sampling,” Ph.D. thesis, Universidad de Santiago de Compostela, Santiago de Compostela, Spain. Mokhtarian, F. and Mackworth, A. (1992), “A theory of multiscale, curvature-based shape representation for planar curves,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 14, 789–805. Møller, J., Syversveen, A., and Waagepetersen, R. (2001), “Log Gaussian Cox processes,” Scandinavian Journal of Statistics, 25, 451–482. Mortenson, M. (1985), Geometrie modeling, John Wiley, New York. Muller, H. (2005), “Surface reconstruction-an introduction,” in Scientific Visualization Conference, 1997, p. 239, IEEE. M¨ uller, P., Erkanli, A., and West, M. (1996), “Bayesian curve fitting using multivariate normal mixtures,” Biometrika, 83, 67–79. M¨ uller, P., Erkanli, A., and West, M. (1996), “Bayesian curve fitting using multivariate normal mixtures,” Biometrika, 83, 67–79. Neal, R. J. (1998), “Regression and Classification using Gaussian process Priors,” Bayesian Statistics, 6, 475–501. Norets, A. (2010), “Approximation of conditional densities by smooth mixtures of regressions,” The Annals of Statistics, 38, 1733–1766. 248

Norets, A. and Pelenis, J. (2009), “Bayesian modeling of joint and conditional distributions,” Unpublished manuscript, Princeton Univ. Norets, A. and Pelenis, J. (2010), “Posterior consistency in conditional distribution estimation by covariate dependent mixtures,” Unpublished manuscript, Princeton Univ. Nott, D. (2006), “Semiparametric estimation of mean and variance functions for non-Gaussian data,” Computational Statistics, 21, 603–620. Nussbaum, M. (1985), “Spline smoothing in regression models and asymptotic efficiency in L2,” The Annals of Statistics, pp. 984–997. Ongaro, A. and Cattaneo, C. (2004), “Discrete random probability measures: a general framework for nonparametric Bayesian inference,” Statistics & Probability Letters, 67, 33–45. Papaspiliopoulos, O. (2008), “A note on posterior sampling from Dirichlet mixture models,” Tech. rep. Papaspiliopoulos, O. and Roberts, G. . (2008), “Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models,” Biometrika, 95, 169–183. Park, B. and Dunson, D. (2009), “Bayesian generalized product partition model,” Statistica Sinica, (to appear). Pati, D. and Dunson, D. (2010), “Bayesian nonparametric regression with varying residual density,” Annals of the Institute for Statistical Mathematics. Pati, D. and Dunson, D. (2011), “Bayesian modeling of closed surfaces through tensor products,” (submitted to Biometrika). Pati, D., Dunson, D., and Tokdar, S. (2010), “Posterior consistency in conditional distribution estimation,” Duke University, Dept. of Statistics. Pati, D., Bhattacharya, A., and Dunson, D. (2011a), “Posterior convergence rates in latent variable density regression models,” (in progress). Pati, D., Bhattacharya, A., and Dunson, D. (2011b), “Posterior convergence rates in non-linear latent variable models,” Arxiv preprint arXiv:1109.5000, (submitted). Pati, D., Dunson, D., and Tokdar, S. (2012), “Posterior consistency in conditional distribution estimation,” Journal of Multivariate Analysis, (submitted). Pelenis, J. and Norets, A. (2011), “Bayesian semi-parametric regression,” Tech. rep., Tech. rep., Princeton University, Economics Department.

249

Persoon, E. and Fu, K. (1977), “Shape discrimination using Fourier descriptors,” Systems, Man and Cybernetics, IEEE Transactions on, 7, 170–179. Piegl, L. (1986), “The sphere as a rational Bezier surface,” Computer Aided Geometric Design, 3, 45–52. Radcliffe, S. J., Guo, W., and Ten Have, T. (2004), “Joint modeling of longitudinal and survival data via a common frailty,” Biometrics, 60, 892–899. Raftery, A. E. and Lewis, S. (1992), “How Many Iterations in the Gibbs Sampler?” Bayesian Statistics, 4, 763–773. Rasmussen, C. (2004), “Gaussian processes in machine learning,” Advanced Lectures on Machine Learning, pp. 63–71. Rasmussen, C. and Williams, C. (2005), “Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning),” . Reich, B., Bondell, H., and Li, L. (2010), “Sufficient Dimension Reduction via Bayesian Mixture Modeling,” Biometrics. Rineau, L. and Yvinec, M. (2007), “A generic software design for Delaunay refinement meshing,” Computational Geometry, 38, 100–110. Rodrigues, A. and Diggle, P. (2010), “A Class of Convolution-Based Models for Spatio-Temporal Processes with Non-Separable Covariance Structure,” Scandinavian Journal of Statistics, (to appear). Rodriguez, A. and Dunson, D. (2011a), “Nonparametric Bayesian models through probit stick-breaking processes,” Bayesian Analysis, 6, 145–178. Rodriguez, A. and Dunson, D. (2011b), “Nonparametric Bayesian models through probit stick-breaking processes,” Bayesian Analysis, 6, 145–178. Rossi, D. and Willsky, A. (2003), “Reconstruction from projections based on detection and estimation of objects–Parts I and II: Performance analysis and robustness analysis,” Acoustics, Speech and Signal Processing, IEEE Transactions on, 32, 886–906. ´ and Juh´asz, I. (2010), “Control point based exact description of a class of R´oth, A. closed curves and surfaces,” Computer Aided Geometric Design, 27, 179–201. ´ Juh´asz, I., Schicho, J., and Hoffmann, M. (2009), “A cyclic basis for closed R´oth, A., curve and surface modeling,” Computer Aided Geometric Design, 26, 528–546. Rousseau, J. (2010), “Rates of convergence for the posterior distributions of mixtures of betas and adaptive nonparametric estimation of the density,” The Annals of Statistics, 38, 146–180. 250

Savitsky, T., Vannucci, M., and Sha, N. (2011), “Variable selection for nonparametric Gaussian process priors: Models and computational strategies,” Statistical Science, 26, 130–149. Schwartz, L. (1965a), “On Bayes procedures,” Z. Wahrsch. Verw. Gebiete, 4, 10–26. Schwartz, L. (1965b), “On Bayes procedures,” Z. Wahrsch. Verw. Gebiete, 4, 10–26. Scott, J. and Berger, J. (2010), “Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem,” The Annals of Statistics, 38, 2587–2619. Sederberg, T. (1983), “Implicit and parametric curves and surfaces for computer aided geometric design,” ETD Collection for Purdue University. Sethuraman, J. (1994), “A Constructive Definition of Dirichlet Priors,” Statistica Sinica, 4, 639–650. Shahbaba, B. and Neal, R. (2009), “Nonlinear models using dirichlet process mixtures,” Journal of Machine Learning Research, 10, 1829–1850. Shen, L. and Makedon, F. (2006), “Spherical mapping for processing of 3D closed surfaces,” Image and vision computing, 24, 743–761. Shen, W. and Ghosal, S. (2011), “Adaptive Bayesian multivariate density estimation with Dirichlet mixtures,” Arxiv preprint arXiv:1109.6406. Smith, M. and Kohn, R. (1997), “A Bayesian approach to nonparametric bivariate regression,” Journal of the American Statistical Association, 92, 1522–1535. Soussen, C. and Mohammad-Djafari, A. (2002), “Closed surface reconstruction in X-ray tomography,” in Image Processing, 2001. Proceedings. 2001 International Conference on, vol. 1, pp. 718–721, IEEE. Staib, L. and Duncan, J. (1992), “Deformable Fourier models for surface finding in 3-D images,” in Proceedings:SPIE-International Society for Optical Engineering, pp. 90–90, Citeseer. Stein, M. L. (1999), Interpolation of Spatial Data: Some Theory for Kriging, Springer Series in Statistics, New York. Stepanets, A. (1974), “The approximation of certain classes of differentiable periodic functions of two variables by Fourier sums,” Ukrainian Mathematical Journal, 25, 498–506. Stone, C. (1982), “Optimal global rates of convergence for nonparametric regression,” The Annals of Statistics, pp. 1040–1053.

251

Su, B. and Liu, D. (1989), Computational geometry: curve and surface modeling, Academic Press Professional, Inc. San Diego, CA, USA. Su, J., Dryden, I., Klassen, E., Le, H., and Srivastava, A. (2011), “Fitting Optimal Curves to Time-Indexed, Noisy Observations of Stochastic Processes on Nonlinear Manifolds,” Journal of Image and Vision Computing. Szeliski, R. and Tonnesen, D. (1992), “Surface modeling with oriented particle systems,” ACM SIGGRAPH Computer Graphics, 26, 185–194. Tang, Y. and Ghosal, S. (2007a), “A consistent nonparametric Bayesian procedure for estimating autoregressive conditional densities,” Computational Statistics & Data Analysis, 51, 4424–4437. Tang, Y. and Ghosal, S. (2007b), “Posterior consistency of Dirichlet mixtures for estimating a transition density,” Journal of Statistical Planning and Inference, 137, 1711–1726. Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society Series B, 58, 267–288. Tokdar, S. (2006a), “Posterior consistency of Dirichlet location-scale mixture of normals in density estimation and regression,” Sankhy¯a: The Indian Journal of Statistics, 67, 90–110. Tokdar, S. (2011a), “Dimension adaptability of Gaussian process models with variable selection and projection,” Arxiv preprint arXiv:1112.0716. Tokdar, S. (2011b), “Posterior rates of contraction in Dirichlet process mixtures of multivariate normals,” Arxiv preprint arXiv:1111.4148. Tokdar, S. and Ghosh, J. (2007), “Posterior consistency of logistic Gaussian process priors in density estimation,” Journal of Statistical Planning and Inference, 137, 34–42. Tokdar, S., Zhu, Y., and Ghosh, J. (2010a), “Bayesian Density Regression with Logistic Gaussian Process and Subspace Projection,” Bayesian Analysis, 5, 1–26. Tokdar, S., Zhu, Y., and Ghosh, J. (2010b), “Bayesian Density Regression with Logistic Gaussian Process and Subspace Projection,” Bayesian Analysis, 5, 1–26. Tokdar, S. T. (2006b), “Posterior Consistency of Dirichlet Location-scale Mixture of Normals in Density Estimation and Regression,” Sankhy¯a, 68, 90–110. van der Vaart, A. and van Zanten, J. (2007), “Bayesian inference with rescaled Gaussian process priors,” Electronic Journal of Statistics, 1, 433–448.

252

van der Vaart, A. and van Zanten, J. (2008a), “Rates of contraction of posterior distributions based on Gaussian process priors,” The Annals of Statistics, 36, 1435– 1463. van der Vaart, A. and van Zanten, J. (2008b), “Reproducing kernel Hilbert spaces of Gaussian priors,” IMS Collections, 3, 200–222. van der Vaart, A. and van Zanten, J. (2009), “Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth,” The Annals of Statistics, 37, 2655–2675. van der Vaart, A. W. and Wellner, J. A. (1996), Weak Convergence and Empirical Processes, Springer-Verlag, New York. Walker, S. G. (2007), “Sampling the Dirichlet Mixture Model with Slices,” Communications in Statistics-Simulation and Computation, 36, 45–54. Wand, M. and Schucany, W. (1990), “Gaussian-based kernels,” Canadian Journal of Statistics, 18, 197–204. Weiss, R. (1996), “An Approach to Bayesian Sensitivity Analysis,” Journal of the Royal Statistical Society. Series B (Methodological), 58, 739–750. West, M. (1984), “Outlier models and prior distributions in Bayesian linear regression,” Journal of the Royal Statistical Society Series B, 46, 431–439. West, M. (1987), “On scale mixtures of normal distributions,” Biometrika, 74, 646– 648. Whang, K., Song, J., Chang, J., Kim, J., Cho, W., Park, C., and Song, I. (2002), “Octree-R: An adaptive octree for efficient ray tracing,” Visualization and Computer Graphics, IEEE Transactions on, 1, 343–349. Whitney, H. (1937), “On regular closed curves in the plane,” Compositio Math, 4, 276–284. Wu, M. and Follmann, D. (1999), “Use of summary measures to adjust for informative missingness in repeated measures data with random effects,” Biometrics, 55, 75–84. Wu, Y. and Ghosal, S. (2010), “L1-Consistency of Dirichlet Mixtures in Multivariate Bayesian Density Estimation,” Journal of Multivariate Analysis, (to appear). Wu, Y. and Ghoshal, S. (2008), “Kullback Leibler property of kernel mixture priors in Bayesian density estimation,” Electronic Journal of Statistics, 2, 298–331. Yang, M. and Lee, E. (1999), “Segmentation of measured point data using a parametric quadric surface approximation,” Computer Aided Design, 31, 449–457. 253

Yau, P. and Kohn, R. (2003), “Estimation and variable selection in nonparametric heteroscedastic regression,” Statistics and Computing, 13, 191–208. Yoon, J. (2009), “Bayesian analysis of conditional density functions: a limited information approach,” Unpublished manuscript, Claremont Mckenna College. Zahn, C. and Roskies, R. (1972), “Fourier descriptors for plane closed curves,” Computers, IEEE Transactions on, 100, 269–281. Zhang, H. (2004), “Inconsistent Estimation and Asymptotically Equal Interpolations in Model-Based Geostatistics,” Journal of the American Statistical Association, 99, 250–261. Zou, F., Huang, H., Lee, S., and Hoeschele, I. (2010), “Nonparametric Bayesian Variable Selection With Applications to Multiple Quantitative Trait Loci Mapping With Epistasis and Gene–Environment Interaction,” Genetics, 186, 385.

254

Biography Debdeep Pati was born on March 12, 1985 in Kolkata, India. He received his Bachelors degree in Statistics from the Indian Statistical Institute in 2006. He continued there for a masters degree and graduated in 2008 specializing in Mathematical Statistics and Probability. In August 2008, Debdeep moved to the United States to pursue a Ph.D. in Statistical Science at Duke University, Durham, NC. In 2010, he earned a Masters degree en route to his Ph.D. He graduated with a Doctor of Philosophy under the supervision of Professor David B. Dunson in May 2012. From Fall 2012, he will be an Assistant Professor at the Department of Statistics, Florida State University. His research interests center around nonparametric Bayesian foundational theory and methodology in a broad range of areas including density estimation, highdimensional density regression and variable selection, shape reconstruction, imaging and hierarchical modeling of shapes.

255