Bayesian evidence computation for model selection in non-linear geoacoustic inference problems Jan Dettmera) and Stan E. Dosso School of Earth and Ocean Sciences, University of Victoria, Victoria, British Columbia V8W 3P6, Canada
John C. Osler Defence Research and Development Canada Atlantic, Dartmouth, Nova Scotia B2Y 3Z7, Canada
(Received 7 July 2010; revised 27 September 2010; accepted 2 October 2010) This paper applies a general Bayesian inference approach, based on Bayesian evidence computation, to geoacoustic inversion of interface-wave dispersion data. Quantitative model selection is carried out by computing the evidence (normalizing constants) for several model parameterizations using annealed importance sampling. The resulting posterior probability density estimate is compared to estimates obtained from Metropolis–Hastings sampling to ensure consistent results. The approach is applied to invert interface-wave dispersion data collected on the Scotian Shelf, off the east coast of Canada for the sediment shear-wave velocity profile. Results are consistent with previous work on these data but extend the analysis to a rigorous approach including model selection and uncertainty analysis. The results are also consistent with core samples and seismic reflection C 2010 Acoustical Society of America. measurements carried out in the area. V [DOI: 10.1121/1.3506345]
I. INTRODUCTION
Bayesian inference has been widely applied to geoacoustic inference problems in the past years1–7 since the knowledge of geoacoustic properties and their uncertainties is important for acoustic/sonar shallow-water applications. However, Bayesian model selection, a fundamental component of Bayesian inference, has seen only limited applications in acoustics.8–12 Both parameter inference and model selection are intrinsic parts of estimating parameter uncertainty. Model selection is particularly important for geoscientific inverse problems, where complex unknown environments often result in non-uniqueness and poor understanding of data error processes. Here, a model is considered in the most general sense to be any particular choice of physical theory, its appropriate parameterization, and a statistical representation for the data errors that are used to explain the physical system under investigation. The goal is to determine the simplest model that sufficiently explains the data by applying a parsimony criterion. In this paper, parameterization of the model is considered in terms of the layering structure of the seabed sediment. Parameterizations that differ in the number and type of layers are considered, including layers with homogeneous properties, and linear and power-law gradients (approximated by large numbers of homogeneous layers). While under-parameterizing a model yields simplicity, it can also cause unrealistically small parameter uncertainty estimates and large theory error, since the model can reach high likelihood levels only for specific parameter values and can therefore only access a limited part of the data space.13 Furthermore, the theory error introduced by under-parameterization can lead to biases in parameter estimates. Hence, a)
Author to whom correspondence should be addressed. Electronic mail:
[email protected]
3406
J. Acoust. Soc. Am. 128 (6), December 2010
Pages: 3406–3415
appropriate parameterizations need to be selected using objective criteria, and the ability of computing reliable estimates of normalizing constants is an important aspect of uncertainty models to account for the uncertainty due to parameterization. Bayes’ theorem addresses model selection with natural parsimony in terms of the normalizing constant of the theorem (the evidence). Evidence is challenging to compute but particularly powerful in addressing model selection in a quantitative manner.13 In particular, evidence is essential for model selection when addressing complicated posterior distributions, such as multi-modal distributions, as it is an integral property of the posterior probability density (PPD) rather than a point estimate. In effect, evidence is a measure of the change in multi-dimensional volume from the prior to the posterior. Complicated posterior distributions are particularly common for strongly non-linear problems, where multiple modes, separated by extensive low-likelihood regions, can occur. Computing Bayesian evidence involves solving a difficult integral, requiring effective sampling methods that can integrate over all parts of the parameter space according to their posterior mass. Annealed importance sampling14 (AIS) is a powerful but computationally intensive approach that uses an annealing heuristic to define an importance sampler that asymptotically converges to the PPD and gives an unbiased estimate of the evidence at convergence. Importance weights are computed along many cooling trajectories, and the end points of these trajectories in conjunction with the importance weights provide an estimate of the PPD. This paper illustrates Bayesian evidence computation using AIS with an application to the inversion of interfacewave dispersion data to infer the shear-wave velocity structure of seabed sediments. Inverting dispersion data for the shearwave velocity structure in sediments are of significant interest to many communities, including geophysics/earthquake
0001-4966/2010/128(6)/3406/10/$25.00
C 2010 Acoustical Society of America V
Author's complimentary copy
PACS number(s): 43.30.Pc, 43.60.Pt, 43.30.Ma [AIT]
hazards,15 seismology,16 and ocean acoustics.17–19 The data considered here were collected on the Scotian Shelf, off the east coast of Nova Scotia, Canada, using an ocean bottom seismometer and impulsive sources on the seabed. The data are inverted for shear-wave velocity-depth profiles; compressional sound velocity and density are also included in the inverse problem as unknowns but are not of primary interest. The class of models considered here are multi-layered sediment models defined by layer thickness, shear-wave velocity, compressional-wave velocity, and density. The shear-wave velocity in a layer can be homogeneous or can be described by linear or power-law gradients. Results are evaluated in terms of marginal shear-wave velocity profiles as well as one dimensional (1-D) parameter marginal distributions. AIS parameter estimates are compared to Metropolis–Hastings sampling (MHS) results to ensure consistency. II. BAYESIAN INFERENCE
This section gives a brief overview of a Bayesian formulation of inverse problems; more complete treatments can be found elsewhere.13,20–24 Let d 2 RN be a random variable of N observed data containing information about a physical system. Further, let I denote the model specifying a particular choice of physical theory, model parameterization, and error statistics to explain that physical system. Let m 2 RM be a random variable of the M free parameters representing a realization of model I . Bayes’ rule can then be written as Pðdjm; I ÞPðmjI Þ ; PðdjI Þ
(1)
where the conditional probability Pðmjd; I Þ represents the PPD of the unknown model parameters given the observed data, prior information, and choice of model I . The conditional probability Pðdjm; IÞ describes the data error statistics. Since data errors include both measurement and theory errors (which cannot generally be separated), the specific form of this distribution is often not known. To interpret Eq. (1) quantitatively, some particular form that describes the data error statistics reasonably well must be assumed for this distribution. In practice, mathematically simple distributions, such as multivariate Gaussian distributions, are commonly used; the validity of such assumptions should be checked a posteriori using statistical tests.2,4 The general multivariate Gaussian distribution for real data is given by Pðdjm; I Þ ¼
8 QMðI Þ þ 1 < i¼1 ðmþ if m i mi Þ i mi mi (3) PðmjI Þ ¼ i ¼ 1; MðIÞ : 0 else: The conditional probability PðdjI Þ in Eq. (1) is commonly referred to as the normalizing constant, evidence, or marginal likelihood of I . It describes how likely a particular choice of model parameterization I is given the observed data and prior. Since the evidence PðdjI Þ normalizes Eq. (1), it can be written as ZðI Þ ¼ PðdjI Þ ¼
ð
Pðdjm; I ÞPðmjI Þdm;
(4)
M
where the integration is over the full model space M. Solving Eq. (4) for non-linear geoacoustic inference problems is the focus of this paper and will be addressed in Sec. III. III. MODEL SELECTION
The normalizing constant in Bayes’ theorem (evidence), Eq. (4), provides an intrinsic approach to model selection, constituting a natural parsimony criterion which is often referred to as the Bayesian razor. The concept of natural parsimony is illustrated in Fig. 1. Let dobs be an observed data vector given as one point in the data space D. Further, let I 1 be a model parameterization (hypothesis) and I 2 be a model parameterization that is more complex than I 1 . Since model I 1 has a limited number of parameters, it can only access a certain subspace D1 of the total data space D. Since I 2 is a more complex model with more parameters, it can access a larger subspace D2 which includes D1. As long as dobs falls
1 N=2
jCd j1=2 1 > 1 exp ðd dðmÞÞ Cd ðd dðmÞÞ ; 2 ð2pÞ
(2) where Cd is the data covariance matrix and d(m) are the modeled data. The covariance matrix Cd is often unknown since the source of errors may be poorly understood. In some cases, data error statistics can be parameterized (e.g., as variances or as a covariance matrix based on an assumed form J. Acoust. Soc. Am., Vol. 128, No. 6, December 2010
FIG. 1. The concept of natural parsimony as given by Bayes’ theorem [after MacKay (Ref. 13)]. The simpler parameterization I 1 is favored as long as the observed data dobs fall into D1 (the part of the data space that can be accessed by I 1 ). Dettmer et al.: Bayesian evidence computation
3407
Author's complimentary copy
Pðmjd; I Þ ¼
such as an auto-regressive moving average25) and included in the inversion, either implicitly26 or explicitly as unknown parameters with assigned priors.26–28 Data error covariance matrices can also be estimated non-parametrically from data residuals.2 In inverse theory, Pðdjm; I Þ is interpreted as the likelihood function Lðm; I Þ of m for fixed (observed) data d. Note that given a Gaussian data error distribution, the likelihood function is not Gaussian distributed for non-linear inverse problems. The term PðmjI Þ in Eq. (1) gives the model prior distribution. In this paper, prior distributions are considered to be bounded, uniform distributions of the form
^ þ M loge N: BIC ¼ 2 loge LðmÞ
(5)
Since the BIC is based on the negative log-likelihood, the model with the smallest BIC is selected as the preferred model. The value of the BIC cannot be directly associated with a probability and cannot yield the significance of the selection. Additionally, a general problem of point estimates is that they cannot properly represent complicated (non-Gaussian) PPDs, particularly PPDs with multiple modes which are a common occurrence for non-linear problems. IV. ANNEALED IMPORTANCE SAMPLING
AIS was developed independently by Neal14 and JarzynThis paper follows Neal14 in explaining the algoski. rithm, where an extensive derivation can be found. AIS yields an estimate for the PPD as well as for Z by combining importance sampling, Markov chain Monte Carlo (MCMC) sampling, and simulated annealing (SA). In the following paragraphs, a short overview of these methods is presented before AIS is addressed. Importance sampling is a common method to estimate expectations of functions f(m) that are difficult to sample directly. Instead of drawing samples from f(m), a function g(m) is used that approximates f(m) and from which independent samples can be drawn in a straightforward manner. In addition, for each sample m(i), an importance weight x(i) is computed to correct for the difference between sampling distribution and the distribution of interest 37,38
xðiÞ ¼ f ðmðiÞ Þ=gðmðiÞ Þ:
(6)
Importance sampling faces the substantial challenge that effective sampling distributions are difficult to formulate for general non-linear problems with high dimensionality, rendering the method prohibitively expensive for many applications. MCMC methods are widely and successfully applied to inference problems and accumulate a series of dependent samples during a random walk. For example, the 3408
J. Acoust. Soc. Am., Vol. 128, No. 6, December 2010
Metropolis–Hastings algorithm39,40 is commonly used to carry out the sampling, based on probabilistically accepting a perturbation to the state variables m according to (assuming a symmetric proposal distribution) loge Pðmnew Þ loge Lðmnew Þ þb ; n exp loge Pðmold Þ loge Lðmold Þ
(7)
where n is a uniform random variable over [0, 1] and b ¼ 1. Challenges arise due to the requirement of detailed balance13 and proper mixing of the Markov chain. MCMC methods generally require relatively small changes to the state variables to ensure a reasonable acceptance rate to move to a new state. In addition, step size cannot be adapted or changed during the run of a chain as this violates detailed balance. Due to the dependency of samples, convergence of MCMC algorithms is often difficult to assess and the autocorrelation length within a chain can be large. In addition, proper mixing of the chain requires that all modes of the PPD are sampled according to their posterior mass. If multiple modes exist which are separated by extensive low-likelihood regions, proper mixing requires large perturbations to jump between modes. However, this generally results in very low acceptance rates. Poor mixing of the chain can result in prohibitive computational expense for some problems, since the algorithm can get “stuck” in local modes for extended periods of time. Poor mixing is fairly straightforward to observe for uni-modal distributions but can go unobserved for multimodal problems if the chain gets stuck in a mode without finding other important posterior modes. SA (Ref. 41) was developed as a heuristic for optimization problems with many posterior modes. SA defines a sequence of probabilities Pj ðmÞ / P0 ðmÞbj , where bj [ [0, 1] is often referred to as the inverse temperature. During sampling, the algorithm “anneals” by moving through the sequence of distributions, increasing b from 0 to 1. Acceptance of moving from one state to the next is decided according to Eq. (7). At low b values, the SA algorithm can move freely between isolated modes, but becomes increasingly confined with increasing b. However, SA cannot be used for Bayesian inference since the algorithm does not provide an unbiased sample of the PPD. The algorithm converges to the highest probability region (assuming an appropriately large number of intermediate distributions) but does not sample all modes according to their posterior mass. AIS combines the above approaches to assign importance weights to many SA runs (Fig. 2). As the number of runs increases, the end points of the annealing trajectories weighted by importance weights provide an unbiased sample of the PPD as well as an estimate of the normalizing constant Z. Initially, an annealing schedule is defined through a sequence of distributions that start at the prior and end at the posterior, such as f1 ðmÞ ¼ ðLðmÞPðmÞÞbl PðmÞ1bl ;
(8)
where the algorithm proceeds from bL to b0 with 0 ¼ bL < bL1 < < b0 ¼ 1. Defining the sequence of distributions as in Eq. (8) is advantageous since sampling Dettmer et al.: Bayesian evidence computation
Author's complimentary copy
into a region of the data space that can be accessed by both models, Bayes’ theorem favors the simpler model I 1 . AIS and similar thermodynamic integration approaches are considered benchmark methods to estimate evidence. Other methods exist, including the recently proposed method of nested sampling29 which has been applied to a variety of problems, including one application to acoustics.9 Further, several methods to approximate evidence by predictive distributions and by analytical examination of asymptotic behavior can be found in the literature.30,31 Several other approaches discussed in the literature can suffer from large or even infinite variance of the evidence estimate.32 Model selection can also be addressed by treating the problem as trans-dimensional33 and applying reversible-jump Markov chains that can jump between parameter spaces of different dimensionality.12,28,34 Asymptotic point estimates, such as the Bayesian information criterion (BIC),35,36 are commonly used to carry out model selection.10,11 The BIC accounts for the number of data and is given by
starts at the prior and common priors are often trivial to sample from independently. Drawing independent samples from the prior results in independent samples from the PPD which makes convergence tests simpler. In addition to an annealing sequence [Eq. (8)], an MCMC operator is needed that leaves each intermediate distribution in the sequence invariant. A cooling trajectory is then initiated by drawing an initial point mL1 from the prior. Moving along the trajectory from mL1 to mL2 is achieved with an appropriate MCMC operator TL1, and so on, until the last move from m1 to m0 is performed with MCMC operator T1. Point m0 then becomes one sample m(i) of the PPD estimate. The MCMC operator can be freely chosen (e.g., MHS, Gibbs sampling, slice sampling42) according to the problem and can also consist of several MCMC steps (e.g., applying a series of perturbations with different step sizes or applying sweeps over the parameters). The associated weight x(i) for point m(i) is computed along the trajectory as xðiÞ ¼
L1 Y f1 ðml Þ : f ðml Þ l¼0 lþ1
(9)
Hence, this scheme uses the sequence of distributions moving from the prior to the posterior to define an importance sampler. Parameter estimates and other moments of the PPD (such as marginal distributions) can then be estimated from the final points m(i) of all trajectories weighed by their associated importance weight x(i). The mean of the importance weights converges to the normalizing constant42 for a large number of trajectories K K 1X xðiÞ : K!1 K i¼1
Z ¼
(10)
For computer implementation, it is generally useful to work with log-likelihoods and log-weights to avoid floating-point underflow due to the exponential dependence. The multidimensional PPD sample is generally interpreted in terms of its moments such as the maximum a posteriori (MAP) model vec^ and marginal probability distributions P(mijd). tor estimate m The b sequence used here consists of three segments, with b being geometrically spaced in each segment as shown J. Acoust. Soc. Am., Vol. 128, No. 6, December 2010
FIG. 3. The b sequence used for AIS shown as loge(b) for illustrative purpose.
in Fig. 3. The segments are chosen empirically so that the algorithm spends more time close to b ¼ 1, where the updated point is expected to be in high-likelihood regions of the parameter space. In general, the number of distributions L in the annealing process (schedule) and the number of b segments can be chosen depending on the problem. The more complicated the posterior, the more distributions are needed. Hence, the number increases with the number of parameters, with non-linearity, and with strong and complicated parameter correlations. The MCMC operators used here involve several updates at each intermediate distribution in the sequence, using Metropolis–Hastings updates according to Eq. (7) with perturbations drawn from a Cauchy proposal to allow for occasional large jumps (due to heavier tails compared to a Gaussian proposal). The perturbations are carried out for principal components of the parameter space to address correlated parameters. To obtain the principal component decomposition, parameter covariances are estimated from MHS at b ¼ 1 prior to the AIS run. The Cauchy proposal is scaled by the eigenvalues of the eigenvalue decomposition of the parameter covariance matrix and perturbation sizes are scaled according to the b sequence. The order of principal components in which perturbations are applied is randomized, and for each bl, several perturbations (empirically chosen depending on the number of parameters) are applied. Convergence of the algorithm is assessed by monitoring the convergence of the mean of the estimate of Z. Since Z is estimated by the mean of the importance weights, the algorithm is considered to have converged once this mean value does not change significantly over some pre-determined number of trajectories. As with most Monte Carlo approaches, there is no guaranteed method of determining convergence. For example, a new mode with high probability mass may be found at any time during the sampling which could significantly change the Z estimate. The advantage of AIS is that the annealing heuristic effectively addresses multiple modes that may be separated by extensive low-likelihood regions. In general, the more complicated the inference problem, the more trajectories and the larger the number of distributions in the annealing sequence should be used to ensure proper search of the space. Dettmer et al.: Bayesian evidence computation
3409
Author's complimentary copy
FIG. 2. Schematic of the AIS algorithm. Many independent annealing trajectories converge to the posterior P(mjd) when properly computed weights are assigned to each trajectory.
A significant advantage of AIS is that the algorithm scales linearly with the number of processors. Individual annealing trajectories are entirely independent of each other so that the algorithm can be implemented in a auto load-balancing fashion43 which allows for linear scaling on massively parallel machines to arbitrary large numbers of central processing units (examples in this paper were run on 170-cores of a parallel computer speeding up the algorithm close to 170 times).
wave velocity are represented as discrete layered models. In all cases, 20 sub-layers are used to represent gradients, which are sufficiently large to ensure negligible theory error due to this choice. In the case of the linear gradient, the sublayers are uniformly spaced over the layer thickness and the shear-wave velocity is given by
V. INVERSION EXAMPLE
where ts(z0) is the velocity at the top of the layer (depth z0), h is the layer thickness, and ts(h) is the velocity at the bottom of the layer. Hence, the shear-wave velocity profile for a linear gradient is represented by three parameters. The power-law sub-layers are spaced logarithmically with depth and the shear-wave velocity-depth dependence is given by
The experiment17 considered in this work was carried out in 1996 by Defence Research and Development Canada on the Scotian Shelf (Emerald Basin 44 N 63 10 W), NS, Canada. An ocean bottom seismometer was deployed in 219 m of water and impulsive sound sources on the seabed were used at 100 m range to excite seafloor-interface (Scholte) waves. The recorded time series were processed to obtain time–frequency–intensity decompositions using the S transform.44 The data used here (Fig. 4) were obtained by picking the maximum energy in the time–frequency–intensity plot to provide group velocity as a function of frequency (i.e., a dispersion curve). A detailed description of the experiment and data processing can be found in Osler and Chapman.17 B. Model selection
Model selection was carried out by computing Z for five different model parameterizations. Layered models were considered with one, two, and three homogeneous layers over a sediment half-space. Each homogeneous layer consists of the four parameters layer thickness h, shear-wave velocity vs, compressional-wave velocity vp, and density q (e.g., this results in 16 unknowns for the three-layer model). In addition, two other models were considered, one with a linear shear-wave velocity gradient layer over a half-space and another with a power-law shear-wave velocity gradient over a half-space. Densities and compressional velocities are treated as homogeneous within layers containing shear-wave velocity gradients. The forward model used to compute replica dispersion curves from the environmental model is Wathelet’s gpdc,15 which uses discrete, homogeneous layers as input. Therefore, models with linear and power-law gradients in shear-
FIG. 4. Measured dispersion curve data (crosses) and best fit (dashed line) for Emerald Basin site. 3410
J. Acoust. Soc. Am., Vol. 128, No. 6, December 2010
ts ðz0 þ hÞ ts ðz0 Þ ðz z0 Þ; h
ts ðzÞ ¼ ts ðz0 Þ þ aðz z0 Þb ;
(11)
(12)
where a ¼ ts ðz0 þ 1Þ and b ¼
loge ts ðz0 þ hÞ loge ts ðz0 þ 1Þ ; loge ð1 z0 Þ (13)
for h > 1. The power-law gradient within a layer is therefore given by four parameters: Layer thickness h and shear-wave velocities at z0, z0 þ 1, and h. Data errors are assumed to be uncorrelated across frequency and Gaussian distributed. The data errors are addressed by a parametric approach where the standard deviation (constant across frequency) is included as an unknown parameter in the model. This adds one dimension to the parameter search space for each model considered. AIS was typically run with a sequence of 105 intermediate distributions during the annealing from the prior to the posterior. A complete AIS run includes 104 trajectories resulting in the same number of samples and weights to estimate the PPD. Uniform priors were chosen wide to be largely uninformative. Densities were constrained to within 1.3 and 2.2 g/cm3, compressional-wave velocities to within 1450 and 2200 m/s, and shear-wave velocities to within 0 and 400 m/s. Layer thicknesses were constrained to within 1 and 80 m for the gradient cases and to within 1 and 60 m for the layered models. Figure 5 shows a histogram of the importance weights from AIS for the power-law gradient model. For the importance sampling to be effective, the variance of the importance weights should be relatively small to ensure that the evidence and parameter estimates are based on many effective points. Figure 5 also shows the mean of the importance weights throughout the AIS run. The mean of the importance weights converges to the evidence Z, and the run is considered to have converged when the loge ðZÞ estimate changes by less than 103 for 100 new trajectories. Figure 6 shows the result of the model selection study, with the models sorted according to ascending number of parameters. The numerical values are given in Table I. The four panels show the log-evidence loge ðZÞ [Fig. 6(a)], the log-likelihood loge ðLÞ for the MAP parameter estimate Dettmer et al.: Bayesian evidence computation
Author's complimentary copy
A. Experiment and data
ts ðzÞ ¼ ts ðz0 Þ þ
FIG. 5. Importance weights (a) and convergence (b) of an AIS run for an environmental model including a power-law shear-wave velocity gradient in a layer overlying a sediment half-space.
and linear-gradient models are not subsets of the other models. However, the likelihood for layered models always increases with added layers since these are nested. Figure 6(c) shows the BIC values for this study. Note that the BIC prefers the linear-gradient model over the power-law model but the significance of the preference cannot be evaluated from the BIC values. C. Posterior parameter inference
This section considers information that can be inferred from the PPD in terms of parameter estimates and associated uncertainties which are of interest in practical applications. Here, we consider 1-D marginal probability distributions from AIS and compare the results to marginal distributions from MHS. The MHS sampling was carried out according to Eq. (7). The MCMC operator used for the MHS is similar to the one used in AIS, using a Cauchy proposal in principal components. The sampling was carried out at several b-values to ensure appropriate sampling of the multi-modal distributions. The results shown here were sampled with b ¼ 0.5. Figure 7 shows marginals for the power-law gradient model and indicates that the AIS and MHS results are in excellent agreement with virtually identical results. The main difference is due to the sample size which is significantly larger for the MHS sample, causing a smoother look. Note, however, that the large sample size in MHS can be misleading since the samples are dependent. The effective (independent) sample size is often not obvious and can be much smaller, even when chain thinning is applied (chain thinning by a factor of 10 was applied here). Figure 7 also shows excellent agreement of the 1-D marginals for the standard deviation r of the data errors which was included as a parameter in the problem. TABLE I. Numerical values for the evidence study shown in Fig. 6. Model
FIG. 6. (Color online) Model selection study. The evidence study selects the power-law model [marked by a circle in panel (a)]. J. Acoust. Soc. Am., Vol. 128, No. 6, December 2010
One-layer Linear Power-law Two-layer Three-layer
loge ðZÞ
loge ðLÞ
BIC
No. parameters
50.9715 30.9643 26.9645 45.3162 38.9380
11.9720 15.0830 12.0884 0.4989 14.8981
47.9099 3.2045 5.7804 34.9509 18.1355
8 9 10 12 16
Dettmer et al.: Bayesian evidence computation
3411
Author's complimentary copy
[Fig. 6(b)], the BIC value evaluated for the MAP parameter estimate [Fig. 6(c)], and the number of parameters [Fig. 6(d)]. The model with the largest evidence loge ðZÞ ¼ 26:96 is the shear-wave velocity power-law gradient model that is characterized by ten parameters. One of the advantages of computing the evidence is that quantitative conclusions can be drawn in terms of support for a model. The Bayes factor (BF, the ratio of evidences for two models) for the power-law and the linear models is loge(BF) ¼ 3.5, which indicates very strong45 evidence in favor of the power-law model over the linear model. The three-layer model has the third highest value in evidence and the BF of loge(BF) ¼ 7.97 indicates decisive evidence against the three-layer model. Note that the likelihood of the model does not necessarily have to increase with additional parameters if the models are not nested. In this case, the power-law
FIG. 7. (Color online) Comparison of 1-D marginal distributions from AIS (solid line) and MHS (shaded) outputs for the power-law gradient model. The top row represents the power-law layer parameters while the bottom row represents the half-space parameters.
wave velocity marginal profiles for the three most likely models. The three-layer model in Fig. 9 shows a severely multi-modal posterior, often observed for over-parameterized models. The marginal-profile distribution for the linear-gradient model in Fig. 10 shows a layer thickness of approximately 10 m. The shear-wave velocity in this layer is low, increasing from 2 to 80 m/s. The sensitivity of the data decreases with depth but it is generally not straightforward to quantify this decrease since the probability mass as a function of depth is closely tied to the model parameterization. The uncertainty indicated for the basement shear-wave velocity is fairly peaked around 50 m/s. This peak likely underestimates the uncertainty and is an effect of the parameterization not accessing a large enough subspace of the data space. The marginal-profile distribution for the model selected by evidence in this study, the power-law gradient parameterization, is shown in Fig. 11. The layer thickness in this case has a much larger uncertainty of 10–40 m (Fig. 7). However, both gradient parameterizations show very close agreement over the upper 10 m. Osler and Chapman17 quote a layered
FIG. 8. (Color online) Comparison of 1-D marginal distributions from AIS (solid line) and MHS (shaded) outputs for the one-layer model showing multiple modes. 3412
J. Acoust. Soc. Am., Vol. 128, No. 6, December 2010
Dettmer et al.: Bayesian evidence computation
Author's complimentary copy
One challenge of model selection studies with point estimates based on a maximum likelihood model, such as the BIC, is multiple modes in the posterior. In this case, point estimates may misrepresent the distribution of posterior mass. Since model selection studies commonly consider model parameterizations of complexity varying from severely underparameterized to grossly over-parameterized, multiple modes are a common occurrence in many problems. Figure 8 clearly shows the 1-D marginals of the one-layer model considered in this study have multiple modes in layer thickness and shearwave velocity. The multiple modes in Fig. 8 are not separated by extensive low-likelihood regions; however, this can occur in some problems, and accounting for the probability mass of such modes is important for model selection. Some of the advantages of AIS are that it collects independent samples and the annealing heuristic is capable of finding isolated modes in complex search spaces. In addition to 1-D marginal distributions for the model parameters, it is useful to consider shear-wave velocity marginalprofile distributions that marginalize the PPD to show velocity uncertainty as a function of depth. Figures 9–11 show shear-
FIG. 11. (Color online) Shear-wave velocity profile marginal distributions for the power-law model.
structure for this site determined from a nearby large-diameter core and seismic reflection surveys which indicate a 9 m layer of clay overlying a 29 m layer of silt. It appears likely that the linear-gradient model only captures the first clay layer which shows a strong shear-wave velocity gradient. Below
this, the data are likely relatively insensitive to the environment, but the inversion supports a relatively low velocity in the half-space. In the case of the power-law model, the results show that the parameterization captures more structure, reaching to within the silt layer. The largest layer thickness indicated by the marginal distribution in Fig. 7 approximates the combination of the clay and silt layers found in the core and seismic surveys, while the smallest thickness in marginal distribution approximates that of the clay layer. However, the data do not appear to be sensitive to the lower interface of the silt layer since the layer thickness uncertainty spreads over the entire layer with no increase in probability near either 9 or 38 m (layer boundaries as indicated by the core and seismics). This appears to be a sensible result, given the low shear-wave velocities in the uppermost part of the sediment causing limited bottom penetration of the interface waves and the integrating nature of interface-wave dispersion data. The actual shear-wave velocity structure in the sediment likely involves a strong gradient in the uppermost clay layer and a weaker gradient in the silt layer. The dispersion data are most sensitive to the shear-wave velocity structure in the uppermost layer. Hence, the linear gradient is strongly constrained by the structure in the clay layer, and since this parameterization cannot accommodate a gradient decrease in the silt layer, the structure in the silt is not represented. The power-law shape, however, can capture structure similar to the linear gradient in the clay layer, but, in addition, has a limited ability to decrease the gradient in the silt layer. Therefore, the very strong support for the power-law model from the data when compared to the linear gradient appears plausible. However, the high shear-wave velocity toward the
FIG. 10. (Color online) Shear-wave velocity profile marginal distributions for the linear-gradient model. J. Acoust. Soc. Am., Vol. 128, No. 6, December 2010
Dettmer et al.: Bayesian evidence computation
3413
Author's complimentary copy
FIG. 9. (Color online) Shear-wave velocity profile marginal distributions for the three-layer model.
power-law model and the next most likely model (a lineargradient model) indicates very strong support in favor of the power-law model. Independent analysis of the same data17 also resulted in a power-law shear-wave velocity profile; however, the analysis in this work focused on rigorous, quantitative parameter, and uncertainty estimates for all seabed parameters and considered quantitative selection of the best model parameterization. The Bayesian analysis showed strong multi-modal posteriors for some parameterizations which are difficult to address with point estimates such as the BIC. To ensure stable results from the AIS algorithm, comparisons with MHS results (in terms of 1-D marginal probability distributions) showed excellent agreement. Data errors were addressed by including the standard deviation as an unknown in the problem, assuming an uncorrelated Gaussian distribution. The assumptions were examined a posteriori to provide confidence in the results.
ACKNOWLEDGMENTS
bottom of the silt layer (Fig. 11) likely results from gradient curvature being largely constrained by the clay layer, with limited sensitivity to the silt layer. The over-estimate of the velocity at the base of the silt layer is then compensated for by a slightly increased probability of a lower-velocity halfspace (although data sensitivity at this depth is very low). The assumption for using Eq. (2) in this work is that the data errors represent a Gaussian random process uncorrelated across frequency, which allows for parameterizing the data errors by a single unknown standard deviation. Substantial violations of these assumptions can result in an unreliable PPD estimate. Challenges are that the number of data (N ¼ 20) are low and that the data are non-uniformly spaced. Figure 12 shows the autocorrelation function for non-uniform data which was computed by binning the sorted lags with a sliding window of increasing size from 0.2 to 0.8 Hz. The center peak falls off to 0.1 Hz. This indicates that the assumption of uncorrelated data errors is reasonable. In addition, the Shapiro–Wilk test46 was applied to the data residuals which examines the null hypothesis that the residuals (of unknown standard deviation and variance) originated from a Gaussian distribution. The test did not reject the null hypothesis at the 0.05 level of significance indicating no significant evidence against the null hypothesis. VI. SUMMARY
This paper applied AIS to a strongly non-linear geoacoustic inverse problem. Rigorous evidence-based model selection was carried out using interface-wave dispersion data to recover the shear-wave velocity structure and associated uncertainty of the seabed. The model selection study identified a power-law parameterization of the shear-wave velocity as the parameterization with the strongest support from the data. The BF between the 3414
J. Acoust. Soc. Am., Vol. 128, No. 6, December 2010
The authors gratefully acknowledge the support of the Office of Naval Research (code 322OA Grant No. N0001409-1-0394). The computational work was carried out on a parallel high-performance computing cluster operated by the authors at the University of Victoria and funded by the Natural Sciences and Engineering Research Council of Canada and the Office of Naval Research. The authors would like to thank Dr. Mark Wathelet [Laboratorie de Ge´ophysique Interne et Tectono Physique (LGIT), France] for consultation on the forward modeling software and two anonymous reviewers for improving the manuscript. 1
S. E. Dosso, “Quantifying uncertainty in geoacoustic inversion. I. A fast Gibbs sampler approach,” J. Acoust. Soc. Am. 111, 129–142 (2002). 2 C. W. Holland, J. Dettmer, and S. E. Dosso, “Remote sensing of sediment density and velocity gradients in the transition layer,” J. Acoust. Soc. Am. 118, 163–177 (2005). 3 Y. Jiang, N. R. Chapman, and H. A. DeFerrari, “Geoacoustic inversion of broadband data by matched beam processing,” J. Acoust. Soc. Am. 119, 3707–3716 (2006). 4 J. Dettmer, S. E. Dosso, and C. W. Holland, “Joint time/frequency-domain inversion of reflection data for seabed geoacoustic profiles,” J. Acoust. Soc. Am. 123, 1306–1317 (2008). 5 D. Tollefsen and S. E. Dosso, “Bayesian geoacoustic inversion of ship noise on a horizontal array,” J. Acoust. Soc. Am. 124, 788–795 (2008). 6 C.-F. Huang, P. Gerstoft, and W. S. Hodgkiss, “Effect of ocean sound speed uncertainty on matched-field geoacoustic inversion,” J. Acoust. Soc. Am. 123, EL162–EL168 (2008). 7 C. Yardim, P. Gerstoft, and W. S. Hodgkiss, “Tracking of geoacoustic parameters using Kalman and particle filters,” J. Acoust. Soc. Am. 125, 746–760 (2009). 8 D. J. Battle, P. Gerstoft, W. S. Hodgkiss, W. A. Kuperman, and P. L. Nielsen, “Bayesian model selection applied to self-noise geoacoustic inversion,” J. Acoust. Soc. Am. 116, 2043–2056 (2004). 9 T. Jasa and N. Xiang, “Using nested sampling in the analysis of multi-rate sound energy decay in acoustically coupled rooms,” AIP Conf. Proc. 803, 189–196 (2005). 10 J. Dettmer, S. E. Dosso, and C. W. Holland, “Model selection and Bayesian inference for high resolution seabed reflection inversion,” J. Acoust. Soc. Am. 125, 706–716 (2009). 11 J. Dettmer, C. W. Holland, and S. E. Dosso, “Analyzing lateral seabed variability with Bayesian inference of seabed reflection inversions,” J. Acoust. Soc. Am. 126, 56–69 (2009). 12 J. Dettmer, S. E. Dosso, and C. W. Holland, “Trans-dimensional geoacoustic inversion,” J. Acoust. Soc. Am. 128, 3393–3405 (2010). Dettmer et al.: Bayesian evidence computation
Author's complimentary copy
FIG. 12. Autocorrelation function of the data residuals.
D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms (Cambridge University Press, Cambridge, 2003), pp. 343–386. 14 R. M. Neal, “Annealed importance sampling,” Stat. Comput. 11, 125–139 (2001). 15 M. Wathelet, “An improved neighborhood algorithm: Parameter conditions and dynamic scaling,” Geophys. Res. Lett. 35, 543–559 (2008). 16 X. N. Nguyen, T. Dahm, and I. Grevemeyer, “Inversion of Scholte wave dispersion and waveform modeling for shallow structure of the Ninetyeast Ridge,” J. Seismol. 13, 543–559 (2009). 17 J. C. Osler and D. M. F. Chapman, “Seismo-acoustic determination of the shear-wave speed of surficial clay and silt sediments on the Scotian Shelf,” Can. Acoust. 24, 11–22 (1996). 18 S. E. Dosso and G. H. Brooke, “Measurement of seismo-acoustic oceanbottom properties in the high Arctic,” J. Acoust. Soc. Am. 98, 1657–1666 (1995). 19 O. A. Godin and D. M. F. Chapman, “Dispersion of interface waves in sediments with power-law shear speed profiles. I. Exact and approximate analytical results,” J. Acoust. Soc. Am. 110, 1890–1907 (2001). 20 A. F. M. Smith, “Bayesian computational methods,” Philos. Trans. R. Soc. London 337, 369–386 (1991). 21 A. F. M. Smith and G. O. Roberts, “Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods,” J. R. Stat. Soc. 55, 3–23 (1993). 22 W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors, Markov Chain Monte Carlo In Practice, Interdisciplinary Statistics (Chapman & Hall/ CRC, Boca Raton, FL, 1996), pp. 273–304. 23 M. Sambridge and K. Mosegaard, “Monte Carlo methods in geophysical inverse problems,” Rev. Geophys. 40, 3–1–3–29 (2002). 24 A. Tarantola, Inverse Problem Theory and Methods for Model Parameter Estimation (SIAM, Philadelphia, 2005), pp. 1–57. 25 D. C. Montgomery and E. A. Peck, Introduction to Linear Regression Analysis (Wiley, New York, 1992), pp. 115–136. 26 S. E. Dosso and M. J. Wilmut, “Data uncertainty estimation in matchedfield geoacoustic inversion,” IEEE J. Ocean. Eng. 31, 470–479 (2005). 27 A. Malinverno and V. A. Briggs, “Expanded uncertainty quantification in inverse problems: Hierarchichal Bayes and empirical Bayes,” Geophysics 69, 1005–1016 (2004). 28 M. Sambridge, K. Gallagher, A. Jackson, and P. Rickwood, “Trans-dimensional inverse problems, model comparison and the evidence,” Geophys. J. Int. 167, 528–542 (2006).
J. Acoust. Soc. Am., Vol. 128, No. 6, December 2010
29
J. Skilling, Bayesian Statistics 8 (Oxford University Press, Oxford, 2007), pp. 491–524. 30 A. E. Gelfand and D. K. Dey, “Bayesian model choice: Asymptotics and exact calculations,” J. R. Stat. Soc. 56, 501–514 (1994). 31 A. E. Gelfand, D. K. Dey, and H. Chang, Bayesian Statistics 4 (Oxford University Press, Oxford, 1992), pp. 147–167. 32 S. Chib, “Marginal likelihood from the Gibbs output,” J. Am. Stat. Assoc. 90, 1313–1321 (1995). 33 P. J. Green, “Reversible jump Markov chain Monte Carlo computation and Bayesian model determination,” Biometrika 82, 711–732 (1995). 34 A. Malinverno and W. S. Leaney, “Monte-Carlo Bayesian look-ahead inversion of walkaway vertical seismic profiles,” Geophys. Prospect. 53, 689–703 (2005). 35 G. Schwartz, “Estimating the dimension of a model,” Ann. Stat. 6, 461– 464 (1978). 36 R. E. Kass and A. E. Raftery, “Bayes factors,” J. Am. Stat. Assoc. 90, 773–795 (1995). 37 C. Jarzynski, “Nonequilibrium equality for free energy differences,” Phys. Rev. Lett. 78, 2690–2693 (1997). 38 C. Jarzynski, “Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach,” Phys. Rev. E 56, 5018–5035 (1997). 39 N. Metropolis, A. Rosenbluth, M. Rosenbluth, and A. T. A. E. Teller, “Equations of state calculations by fast computing machines,” J. Chem. Phys. 21, 1087–1092 (1953). 40 W. K. Hastings, “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika 57, 97–109 (1970). 41 S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated annealing,” Science 220, 671–679 (1983). 42 R. M. Neal, “Slice sampling,” Ann. Stat. 31, 705–741 (2003). 43 W. Gropp, E. Lusk, and A. Skjellum, Using MPI, Portable Parallel Programming With the Message-Passing Interface (MIT Press, Cambridge, 1999), pp. 171–185. 44 R. G. Stockwell, L. Mansinha, and R. P. Lowe, “Localization of the complex spectrum: The s transform,” IEEE Signal Process. 44, 998–1001 (1996). 45 H. Jeffreys, Theory of Probability (Cambridge University Press, Oxford, 1961), pp. 53–54. 46 S. S. Shapiro and M. B. Wilk, “An analysis of variance test for normality (complete samples),” Biometrika 52, 591–611 (1965).
Dettmer et al.: Bayesian evidence computation
3415
Author's complimentary copy
13