Efficient MCMC for Binomial Logit Models AGNES FUSSL, Johannes Kepler University ¨ SYLVIA FRUHWIRTH-SCHNATTER , University of Economics and Business ¨ RUDOLF FRUHWIRTH , Austrian Academy of Sciences
This article deals with binomial logit models where the parameters are estimated within a Bayesian framework. Such models arise, for instance, when repeated measurements are available for identical covariate patterns. To perform MCMC sampling, we rewrite the binomial logit model as an augmented model which involves some latent variables called random utilities. It is straightforward, but inefficient, to use the individual random utility model representation based on the binary observations reconstructed from each binomial observation. Alternatively, we present in this article a new method to aggregate the random utilities for each binomial observation. Based on this aggregated representation, we have implemented an independence Metropolis-Hastings sampler, an auxiliary mixture sampler, and a novel hybrid auxiliary mixture sampler. A comparative study on five binomial datasets shows that the new aggregation method leads to a superior sampler in terms of efficiency compared to previously published data augmentation samplers. Categories and Subject Descriptors: G.3 [Probability and Statistics]: Statistical computing; Markov processes; probabilistic algorithms (including Monte Carlo) General Terms: Algorithms, Performance, Theory Additional Key Words and Phrases: Binomial data, logit model, data augmentation, Markov chain Monte Carlo, random utility model ACM Reference Format: ¨ ¨ Fussl, A., Fruhwirth-Schnatter, S., and Fruhwirth, R. 2013. Efficient MCMC for binomial logit models. ACM Trans. Model. Comput. Simul. 23, 1, Article 3 (January 2013), 21 pages. DOI:http://dx.doi.org/10.1145/2414416.2414419
1. INTRODUCTION
Models involving binomial outcome variables are widely used in statistical and econometric data analysis. Such data typically arise for experiments with binary outcome variables that have been aggregated in terms of binomial outcome variables. They often take the form of two-way or three-way contingency tables or arise if repeated measurements are taken for each factor in the design matrix in a planned experiment (see, e.g., Hilbe [2007]). Finally, large cross-sectional datasets with a binary variable of interest requiring only a few distinct covariate patterns are usually aggregated to form binomial regression data.
Authors’ addresses: A. Fussl, Department of Applied Statistics, Johannes Kepler University, Altenberg¨ erstr. 69, 4040 Linz, Austria; email:
[email protected]; S. Fruhwirth-Schnatter, Institute for Statistics and Mathematics, University of Economics and Business, Augasse 2-6, 1090 Vienna, Austria; email: ¨
[email protected]; R. Fruhwirth, Institute of High Energy Physics, Austrian Academy of Sciences, Nikolsdorfer Gasse 18, 1050 Vienna, Austria; email:
[email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or
[email protected]. c 2013 ACM 1049-3301/2013/01-ART3 $15.00 DOI:http://dx.doi.org/10.1145/2414416.2414419 ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
3
3:2
A. Fussl et al.
Statistical and econometric models for such data are typically based on modeling each observation yi as a realization from a binomial distribution with known repetition parameter Ni : πi yi |πi ∼ Binom Ni , πi , logit πi = log = log λi , (1) 1 – πi where log λi is an observation specific predictor depending on unknown model parameters to be estimated from the data. The model structure underlying log λi may be a simple linear predictor as in the binomial regression model, but frequently more complex models involving additional latent variables are applied. Examples include ANOVA for proportions using random-effects models [Crowder 1978], modeling of portfolio credit risk using generalized linear mixed models [McNeil and Wendin 2007], and discretevalued time series using binomial mixed state space models [Czado and Song 2008]. The present work contributes to Bayesian inference for binomial logit models. Following the seminal paper by Albert and Chib [1993], many authors have proposed Markov chain Monte Carlo (MCMC) estimation for discrete-valued data using data augmentation based on a latent variable representation of the underlying discrete¨ valued distribution; see, for example, Holmes and Held [2006], Fruhwirth-Schnatter ¨ and Fruhwirth [2007], and Gramacy and Polson [2012]. For binary logit models, for instance, such a representation reads [McFadden 1974]: yi = log λi + εi , yi =
I{yi
(2)
> 0},
yi
where is a latent variable and εi follows a logistic (LO) distribution. Also for binomial models, itis helpful for deriving MCMC samplers to regard the distribution of yi |πi ∼ Binom Ni , πi as the marginal distribution of an augmented model involving latent variables. A rather straightforward solution is to consider each binomial observation yi as the aggregated number of successes among Ni independent binary outcomes z1i , . . . , zNi ,i , following the binary logit model Pr(zni = 1|πi ) = πi . By introducing a latent variable as in (2) for each zni , an individual latent variable repre¨ ¨ sentation of the binomial distribution is obtained [Fruhwirth-Schnatter and Fruhwirth 2007]. However, the artificial enlargement of the sample leads to an unnecessarily high-dimensional latent variable, in particular if Ni is large. For this reason, an aggregated representation of a binomial logit model is preferable, where only a single latent variable is introduced for each binomial observation yi . Aggregation yields a considerable reduction of computing time compared to the individual representation, even if it does not necessarily improve mixing. Two aggregated repre¨ sentations of a binomial logit model have been considered so far. Fruhwirth-Schnatter et al. [2009] derive a representation involving an aggregated latent variable yi similar to (2), where εi follows the distribution of the negative logarithm of a Gamma random variable with shape parameter Ni and unit scale. To perform efficient MCMC estimation, this distribution is approximated by a very accurate discrete location-scale mixture of normals. Gramacy and Polson [2012] use a representation involving z distributions which in turn are represented as continuous location-scale mixtures of normals with the mixing distribution being an infinite sum of exponential distributions. The need to choose a value at which to truncate the sum is a complication that, in the current implementation, is solved by a rather time-consuming mechanism (see the comparison in Section 4.2). In this article, we derive a new representation of the binomial distribution involving an aggregated latent variable yi similar to (2), in which, however, εi follows the Type III generalized logistic distribution GL(Ni ). The representation is derived by writing ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
Efficient MCMC for Binomial Logit Models
3:3
the logit model for each individual binary observation zni as a random utility model (RUM), then aggregating the latent utilities separately for each of the two categories, and finally defining the aggregated latent variable yi as the difference of these two variables. This representation is generic in the sense that the conditional distribution of yi |yi , λi is independent of the specific model structure and is easily simulated from three independent Gamma random variables, depending only on Ni , yi , and λi . This makes this representation very useful for MCMC estimation based on data augmentation. Conditional on yi , the unknown model parameters appearing in the definition of λi may be sampled using three different MCMC algorithms: a data-augmented Metropolis-Hastings (MH) sampler, an auxiliary mixture (AM) sampler, and a novel hybrid auxiliary mixture (HAM) sampler. We illustrate how to use these approaches to estimate efficiently the regression parameters of a binomial logit regression model. However, in particular AM sampling may be applied to arbitrarily complex models. AM sampling is based on approximating the distribution εi ∼ GL(Ni ), which is symmetric around 0, by a discrete scalemixture of normals. The moments of this discrete scale-mixture of normals depend on Ni and have been computed for all values 1 ≤ Ni ≤ 600 by minimizing the KolmogorovSmirnov distance to the true density. In the data augmentation scheme, a component indicator is introduced for each binomial observation yi as a second latent variable, conditional on which the latent variable yi follows a normal distribution. This facilitates MCMC estimation considerably. A comparative case study on five binomial datasets confirms that data augmentation based on the new aggregated representation is superior to the aggregated representa¨ tion considered by Fruhwirth-Schnatter et al. [2009] in terms of the effective sampling size. It is also superior to the individual representation and to Gramacy and Polson [2012] in terms of the effective sampling rate. 2. A NEW AGGREGATED LATENT VARIABLE REPRESENTATIONS OF THE BINOMIAL DISTRIBUTION
A generic latent variable representation of the binomial distribution yi |πi Binom Ni , πi , where logit πi = log λi , involving an aggregated latent variable should have the following desirable properties.
∼ yi ,
(A) The latent equation takes the form of a “regression-type“ model as in (2): yi = log λi + εi .
(3)
(B) The error εi in the latent Eq. (3) follows a distribution with a pdf that is known explicitly. (C) It is easy to simulate from the conditional distribution of yi |yi , λi . These properties make it possible to apply this representation conditional on λi , regardless of the more specific model structure describing λi . In this section, a new aggregated representation satisfying these properties is proposed. This representation is derived from the individual representation mentioned in Section 1 by first writing the logit model for each individual binary observation zni as a RUM, as in McFadden [1974]: u0,ni = 0,ni ,
0,ni ∼ EV,
u1,ni = log λi + 1,ni ,
1,ni ∼ EV,
zni = I{u1,ni > u0,ni }, ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
(4) (5)
3:4
A. Fussl et al.
where 0,ni and 1,ni are independent and identically distributed errors following a type I extreme value (EV) distribution. The latent variables u0,ni and u1,ni are the utilities of choosing, respectively, category 0 or 1. Based on observing that a priori exp(–u0,ni ) ∼ E (1) and exp(–u1,ni ) ∼ E λi are independently exponential (E) distributed, we aggregate the individual utilities for each category by taking the sum over all n = 1, . . . , Ni transformed utilities: exp(–y0i ) =
Ni
exp(–u0,ni ),
exp(–y1i ) =
n=1
Ni
exp(–u1,ni ).
(6)
n=1
Evidently, the aggregated utilities exp(–y0i ) and exp(–y1i ), being both sums of independent exponential random variables, are independent a priori with Gamma distributions (G): exp(–y0i ) ∼ G Ni , 1 , (7) exp(–y1i )|λi ∼ G Ni , λi . By taking the negative logarithm for both expressions in (7), we obtain: y0i = 0i , (8) (9) y1i = log λi + 1i , where ki = – log ξki , with ξki ∼ G Ni , 1 for k = 0, 1, follows the negative log-Gamma distribution with integer shape parameter Ni . Finally, by taking the difference of (8) and (9) to generate a single aggregated latent variable yi = y1i – y0i for each binomial observation, we obtain the aggregated dRUM representation of the binomial logit model. It is easy to verify that the aggregated dRUM representation satisfies the desired properties (A) to (C). First of all, the latent equation takes the form of a “regression type”model with a closed form distribution of the error εi = 1i – 0i ∼ log G Ni , 1 – log G Ni , 1 [Cutler 1992]: yi = log λi + εi ,
εi ∼ GL(Ni ),
(10)
where GL(ν) is the Type III generalized logistic distribution with parameter ν, having the density fGL (ε; ν) =
Γ(2ν) exp(–νε) , Γ(ν)2 (1 + exp(–ε))2ν
(11)
which is symmetric around 0. The first two central moments are given by E(ε) = 0 and V(ε) = 2ψ (ν), where ψ (·) denotes the trigamma function; see, for example, Zelterman and Balakrishnan [1992]. The distribution of ε can also be regarded as a transformed Beta (B) distribution, since ε ∼ logit P, where P ∼ B(ν, ν). Due to its symmetry around 0, the GL-distributed error in (10) is particularly suited for an approximation by Gaussian mixtures. The classical logistic distribution results for the case ν = 1. Hence, for Ni = 1, that is, for binary data, representation (10) corresponds to the standard latent variable representation of the binary logit model given in (2) which has been termed differenced ¨ ¨ random utility model (dRUM) by Fruhwirth-Schnatter and Fruhwirth [2010]. We use the term aggregated dRUM representation in connection with representation (10) to indicate that it may be seen as an extension of representation (2) to binomial data. ¨ The representation considered in Fruhwirth-Schnatter et al. [2009] is based on introducing the aggregated variable y1i appearing in (9) as latent variable. For Ni = 1, that is, for binary data, this representation corresponds to the RUM representation of ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
Efficient MCMC for Binomial Logit Models
3:5
the logit model given in (4). Hence, we refer to this representation as the aggregated RUM representation. The aggregated dRUM representation also satisfies the important property (C). As outlined in Lemma 1, the conditional distribution of yi |λi , yi given λi and yi has a representation Gamma of three independent random variables following, re in terms spectively, G Ni , 1 + λi , G yi , 1 , and G Ni – yi , λi . A proof of this lemma is given in Appendix A. Hence, it is extremely easy to simulate from yi |λi , yi within a data augmentation scheme, regardless of the specific model structure underlying λi . L EMMA 1. The conditional distribution of yi |λi , yi is equal in distribution to Ui + I{yi > 0}Vi , Ui + I{yi < Ni }Wi where Ui ∼ G Ni , 1 + λi , Vi ∼ G yi , 1 , and Wi ∼ G Ni – yi , λi , independently. yi |λi , yi ∼ log
(12)
3. MCMC ESTIMATION BASED ON THE AGGREGATED DRUM REPRESENTATION OF THE BINOMIAL DISTRIBUTION
The aggregated dRUM representation introduced in Section 2 can be used for Bayesian inference with binomial models. The representation is generic in the sense that the conditional distribution of the latent variable yi |yi , λi is independent of the specific model structure and is easily simulated from three independent Gamma random variables, depending only on Ni , yi , and λi . This makes the representation especially useful for MCMC estimation based on data augmentation. Conditional on yi , the unknown model parameters appearing in the definition of λi may be sampled using latent equation (10). In this section, we discuss three different, highly generic MCMC algorithms that are able to handle the non-normality of the error term appearing in equation (10). The two samplers in Section 3.1 and Section 3.2 extend, respectively, data-augmented independence MH sampling and AM sampling to this new representation of the binomial model. The HAM sampler introduced in Section 3.3 is a novel sampler based on the idea of combining data-augmented independence MH sampling with AM sampling. We illustrate how to use these approaches to estimate the regression parameters of a binomial logit regression model where y = (y1 , . . . , yN) are conditionally independent data from a binomial distribution, that is, yi ∼ Binom Ni , πi with πi =
exp(xi β) , 1 + exp(xi β)
or equivalently, logit πi = log λi = xi β. xi is a row vector of regressors, including 1 for the intercept, and β is an unknown regression parameter of dimension d. To carry out Bayesian we assume that the prior distribution of β is a normal distribution inference, β ∼ Nd b0 , B0 with known mean vector b0 and covariance matrix B0 . 3.1. Data-Augmented Independence Metropolis-Hastings Sampling
Data-augmented independence MH sampling was introduced by Scott [2011] for MCMC estimation of logit and multinomial logit models using the RUM representation. The method is, however, useful for more general models and is a fully automatic, generic algorithm for posterior simulation using an independence MH algorithm. Data-augmented independence MH sampling is easily implemented for any latent variable representation having properties (A)–(C). For each sweep of the sampler, the latent variables yi are easily sampled, according to assumption (C). Conditional on ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
3:6
A. Fussl et al.
all latent variables, an independence proposal is constructed for the unknown parameters from model (3) by approximating the distribution of the error εi by a normal distribution with the same expectation and variance. As log λi is typically linear in the unknown parameters, this usually yields a simple proposal density for the unknown parameters. The true error density has to be evaluated for all i at εi = yi – log λi to compute the acceptance rate. As we are dealing with an independence sampler, the efficiency of the sampler increases with increasing acceptance rate. It is obvious that the acceptance rate will strongly depend on how close the true density of εi is to the normal distribution. ¨ ¨ Fruhwirth-Schnatter and Fruhwirth [2010], for instance, apply data-augmented independence MH sampling to MCMC estimation of logit and multinomial logit models under the dRUM representation, and achieve considerably higher acceptance rates than Scott [2011] simply because the logistic distribution underlying the dRUM representation is symmetric around 0 while the extreme value distribution underlying the RUM representation is skewed. In the context of the binomial logit models, data-augmented independence MH sampling can be applied both to the aggregated RUM representation considered in ¨ Fruhwirth-Schnatter et al. [2009] and to the new aggregated dRUM representation introduced in Section 2. Since the error distribution in the aggregated dRUM representation (10) is symmetric around 0, we expect a higher acceptance rate under this representation. To construct an independence proposal for the unknown parameters, the GL(Ni ) distribution of the error εi appearing in (10) is approximated by a normal distribution with the same expectation and variance: yi ≈ log λi + ε˜i ,
ε˜i ∼ N 0, 2ψ (Ni ) ≈ GL(Ni ).
(13)
This approximate model serves as a basis for constructing a proposal for the model parameters appearing in log λi . Application to the Binomial Logistic Regression Model. For the binomial logistic regression model, where log λi = xi β, the approximate model (13) is a linear regression model in β with heteroscedastic normal errors. We take the posterior of this approximate regression model as independence proposal density q(β new |z) for β. Since the distribution of εi is approximately normal for large Ni , the acceptance rate of this proposal should be fairly high. This leads directly to Algorithm 1. To save computing time, the posterior covariance matrix BN as well as the vectors mN and KN can be precalculated as follows, before starting Algorithm 1: ⎛ BN = ⎝B0 –1 +
N
⎞–1 1
xx 2ψ (Ni ) i i i=1
b mN = BN B–1 0 0 ,
⎠
–1
X = B0 –1 + X ,
(14)
, KN = BN X
is given by x /2ψ (N ). where the ith row of X i i For the purpose of comparison, we apply data-augmented independence MH sampling also to the aggregated RUM representation, by approximating the negative logGamma distributed error 1i appearing in (9) by a normal distributed error with the ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
Efficient MCMC for Binomial Logit Models
3:7
same moments, that is, E(1i |Ni ) = –ψ(Ni ) and V(1i |Ni ) = ψ (Ni ). The posterior of this approximate model is used as proposal for β, that is, q(β new |z) = Nd bN , BN with:
ψ + B X
bN = BN B–1 b + X 0 N z, 0 –1
X BN = B0 –1 + X ,
is x /ψ (N ). The where ψ = (ψ(N1 ), . . . , ψ(NN )) , z = (y11 , . . . , y1N ) and the ith row of X i i sampling scheme itself is identical with Algorithm 1, only the likelihood of the model and the posterior of the aggregated utilities have to be adjusted according to the aggregated RUM representation. Since the error distribution in the aggregated RUM representation (9) is skew, we expect lower acceptance rate of this representation, compared to the aggregated dRUM representation.
ALGORITHM 1: Data-augmented independence MH sampling for a binomial regression model Choose starting values for the regression coefficients β and the latent variables z = (y1 , . . . , yN ) and repeat the following steps: (a) Sample β conditional on z:
(a-1) Propose β new from the proposal q(β new |z) = Nd bN , BN where BN , mN , and KN are precomputed as in (14) and bN depends on z through ⎛ ⎞ N 1 x y ⎠ bN = BN ⎝B–1 0 b0 + 2ψ (Ni ) i i i=1
= mN + KN z. (a-2) Accept β new with probability min(α, 1), where the acceptance rate α is defined as: p(z|β new )p(β new )q(β|z) , (A1.1) α= p(z|β)p(β)q(β new |z) and p(z|β) is the likelihood of model (10): p(z|β) =
N
fGL (yi – xi β; Ni ),
i=1
where fGL (yi – xi β; Ni ) is the pdf of the Type III generalized logistic distribution given in (11). (b) Sample the aggregated latent variables yi simultaneously from yi |λi , yi for i = 1, . . . , N as in (12) with λi = exp(xi β).
3.2. Auxiliary Mixture Sampling
AM sampling is another fully automatic, generic algorithm for posterior simulation of any latent variable representation having properties (A)–(C), provided that an extremely accurate, precalculated normal mixture approximation to the density of the distribution of the error εi appearing in (3) is available. For each sweep of the sampler, the latent variables yi are easily sampled, according to assumption (C). Conditional on all latent variables, a proposal for the unknown model parameters appearing in λi is constructed from model (3) by approximating the distribution of the error εi by the accurate mixture distribution. In the data augmentation scheme, a component indicator is introduced for each i as a second latent variable, conditional on which the latent variable yi follows a normal distribution. As log λi ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
3:8
A. Fussl et al.
is typically linear in the unknown parameters, this usually yields a straightforward Gibbs sampling scheme. AM sampling was introduced by Shephard [1994] for stochastic volatility models, where εi follows a χ21 distribution. It was extended to modeling count data from ¨ the Poisson distribution [Fruhwirth-Schnatter and Wagner 2006] and to logit model ¨ and multinomial models using the RUM representation [Fruhwirth-Schnatter and ¨ ¨ Fruhwirth 2007], where εi follows an EV distribution. Fruhwirth-Schnatter and ¨ Fruhwirth [2010] were able to increase efficiency of AM sampling for logit model and multinomial models considerably by using the dRUM representation, where εi follows a logistic distribution. The resulting sampler is much faster than the computationally intensive sampler by Holmes and Held [2006], which is also based on the dRUM representation, but uses an exact continuous scale mixture representation of the logistic distribution instead of a finite mixture approximation. In all of these cases, the distribution of εi is independent of any parameters, making it easily possible to precalculate a very accurate normal mixture approximation. Even if the density of εi depends on an integer parameter ν, it is possible to precalculate accurate normal mixture approximations with component specific parameters for ¨ each ν. Fruhwirth-Schnatter et al. [2009], for instance, increase the efficiency of AM sampling for count data models considerably, by using a latent variable representation where εi follows a negative log Gamma distribution, the number of degrees of freedom being equal to the observed counts. The same error distribution appears in the context of the binomial regression, if the aggregated RUM representation (9) is used. However, since the individual dRUM representation of the binomial regression model is more efficient than the individual ¨ ¨ RUM representation according to Fruhwirth-Schnatter and Fruhwirth [2010], we introduce subsequently AM sampling for the aggregated dRUM representation instead of the aggregated RUM representation. To this end, the density of the Type III generalized logistic distribution GL(ν) of the error term appearing in the aggregated dRUM representation (10) has to be approximated by a mixture of normal distributions for arbitrary integer values ν = 1, 2, . . .. The symmetry of fGL (ε; ν) around 0 suggests to use a scale mixture of normals, where all component means are equal to 0: fGL (ε; ν) =
R(ν) Γ(2ν) exp(–νε) ≈ q (ε) = wr (ν) ϕ(ε; 0, s2r (ν)), ν 2 2ν Γ(ν) (1 + exp(–ε))
(15)
r=1
where ϕ(ε; 0, s2 ) denotes a normal density with mean 0 and variance s2 . The number of components R(ν), the weights wr (ν) and the variances s2r (ν) depend on the individual group sizes ν = Ni . Details of the approximation are given in Appendix B. In addition to the aggregated latent variable yi , this sampler also introduces the component indicator ri of this finite mixture for each binomial observation yi as missing data. Conditional on z = (y1 , . . . , yN ) and r = (r1 , . . . , rN ), the non-Gaussian model in (10) reduces to a regression-type model with a normally distributed error term: ε˜i |ri ∼ N 0, s2ri (Ni ) . (16) yi = log λi + ε˜i , If log λi is linear in the unknown model parameters, then sampling of these parameters from well-known densities is often possible. The sampling scheme in Algorithm 2 illustrates how auxiliary mixture sampling is used to estimate the regression parameters β in a binomial logistic regression model. However, AM sampling may be applied to more complex models such as state space or ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
Efficient MCMC for Binomial Logit Models
3:9
random effects models. Step (a) in Algorithm 2 is generic, being only based on knowing λi , rather than the entire model structure. Step (b) is the model specific part and might involve further blocking in order to estimate all model parameters. ALGORITHM 2: Auxiliary mixture sampling for a binomial regression model Choose starting values for β and repeat the following steps: (a) Sample the aggregated variables z = (y1 , . . . , yN ) and the component indicators r = (r1 , . . . , rN ) conditional on λi = exp(xi β): (a-1) Sample yi simultaneously from yi |λi , yi for i = 1, . . . , N as in (12). (a-2) Sample the component indicators ri conditional on yi and λi from the discrete density:
2 wj (Ni ) 1 yi – log λi , j = 1, . . . , R(ν). exp – Pr(ri = j|yi , λi ) ∝ sj (Ni ) 2 sj (Ni ) (b) Sample the regression β conditional on z and r based on the normal regression coefficients model (16) from Nd bN , BN with moments: ⎛ ⎞ N 1 bN = BN ⎝B–1 x y ⎠ , 0 b0 + 2 (N ) i i s r i i i=1 ⎛
BN = ⎝B0
–1
+
N i=1
⎞–1 1 xi xi ⎠ . s2ri (Ni )
3.3. Hybrid Auxiliary Mixture Sampler
¨ ¨ It is easy to verify [Fruhwirth-Schnatter and Fruhwirth 2012] that the acceptance rate of data-augmented independence MH sampling equals min(1, α), with α given by: N new )p(β new ) N p old )p(β old ) i=1 p(yi |β i=1 ˜ (yi |β α = N old )p(β old ) N p new )p(β new ) i=1 p(yi |β i=1 ˜ (yi |β =
N
˜ (y |β old ) p(y |β new ) p i i ˜ (yi |β new ) p(y |β old ) p i i=1
=:
N
αi .
(17)
i=1
˜ (yi |β) based on the normal distribution Hence, the closer the approximate likelihood p is to the true likelihood p(yi |β) over a wide range of β, the higher the acceptance rate will be. Note that for AM sampling, no rejection step appears in Algorithm 2, since the true and the approximate likelihood are virtually the same; see also Appendix B. ˜ (yi |β) used by data-augmented MH The degree to which the normal approximation p sampler is a good approximation to the true likelihood p(yi |β) may vary considerably over the observations. For instance, the two likelihoods will be close to each other whenever Ni is large and the success rate yi /Ni is neither close to 0 nor close to 1. In this case, the normal approximation gives a contribution αi to the acceptance rate close to 1, and the data-augmented MH sampler works well. However, for extreme ratios yi /Ni ≤ clow and yi /Ni ≥ cup αi is considerably smaller, and in the worst case the MH algorithm completely fails to work. The basic idea of HAM sampling is to combine the data-augmented MH sampler and the AM sampler, that is, to use the precise mixture approximation for problematic ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
3:10
A. Fussl et al.
observations, where we expect a poor performance of the normal approximation, and to use the normal approximation of the MH sampler otherwise. Since αi is virtually 1 for the problematic observations, this hybrid algorithm will have reasonable acceptance rates. Application to the Binomial Logistic Regression Model. We provide details of the HAM sampler for the binomial regression model using the aggregated dRUM representation. First, the N binomial observations are partitioned into two sets IS and IA in the following way. For all i, where clow < yi /Ni < cup , we approximate thedistribution of the error εi appearing in (10) by the normal distribution N 0, 2ψ (Ni ) and denote the set containing these indices by IS . For all i, where yi /Ni ≤ clow and yi /Ni ≥ cup , we approximate the distribution of the error εi appearing in (10) by the mixture distribution qNi (εi ) given by (15) and denote the set containing these indices by IA . For a binomial logistic regression model this leads to Algorithm 3.
ALGORITHM 3: Hybrid auxiliary mixture sampler Choose starting values for β and z = (y1 , . . . , yN ) and repeat the following steps: (a) Sample the component indicators r = {ri }i∈IA conditional on λi = exp(xi β) and z by sampling ri from q(ri |λi ), which is a discrete density over ri ∈ {1, . . . , R(Ni )} given by: q(ri |λi ) ∝ ϕ(yi ; log λi , s2ri (Ni ))wri (Ni ), where
(s2ri (Ni ), wri (Ni ))
(A3.1)
are given by (15).
(b) Sample the regression coefficients β conditional on z and r: (b-1) Propose β new |r from the proposal q(β new |r) = Nd (BN mN , BN ) where BN and mN are computed as ⎛ ⎞–1 1 xx⎠ , BN = ⎝B–1 (A3.2) S + 2 (N ) i i s r i i i∈I A
mN =
B–1 0 b0
+
i∈IS
1 2ψ (N
i)
xi yi +
1 xi yi 2 s (Ni ) i∈IA ri
(A3.3)
and B–1 S is precomputed as in (18). (b-2) Accept β new with probability min(1, α), where: p(y |β new )ϕ(y ; xi β old , 2ψ (Ni )) i i α= old )ϕ(y ; x β new , 2ψ (N )) i i i∈I p(yi |β i S
ϕ(β old ; BS mS , BS ) p(β new ) p(yi |β new ) = . old ) ϕ(β new ; BS mS , BS ) p(β old ) i∈I p(yi |β
(A3.4)
S
(c) Sample the aggregated utilities z conditional on λi = exp(xi β) by sampling yi simultaneously from yi |λi , yi for i = 1, . . . , N as in (12).
The proposal density in Step (b) of Algorithm 3 is derived from
q(β new |β old ) ∝ p(β new ) ϕ(yi ; xi β new , 2ψ (Ni )) ϕ(yi ; xi β new , s2ri (Ni ))q(ri |β old ). i∈IS
i∈IA
ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
Efficient MCMC for Binomial Logit Models
3:11
The first two terms define a partial independence proposal density ϕ(β new ; BS mS , BS ) based only on the prior and all observations i ∈ IS , with moments given by: 1 –1 xx, B–1 + (18) S = B0 2ψ (Ni ) i i i∈IS
mS = B–1 0 b0 +
1 xi yi . 2ψ (Ni ) i∈IS
(19)
Note that the matrices B–1 S as well as BS may be precomputed because they do not depend on any latent variable, whereas the vector mS depends on the latent aggregated utilities yi . Combining this proposal with the information contained in all observations i ∈ IA , given r, yields the proposal density with the moments given in (A3.2) and (A3.3). The acceptance rate (A3.4) follows from (17). Note that αi is equal to 1 for all i ∈ IA , ˜ (yi |λi ) is a perfect approximation to p(yi |λi ). because p 4. COMPARISON OF THE DIFFERENT MCMC ALGORITHMS
In order to compare the various latent variable representations of Section 2 and the different MCMC algorithms of Section 3, we use two simulated datasets and three well-known observed datasets. 4.1. The Datasets
The Titanic passenger dataset [Hilbe 2007, Table 6.11] is organized as a contingency table, where the passengers are categorized by age (child/adult), gender (female/male) and the class they are traveling (first/second/third class). Ni denotes the number of exposures in each group, yi is the corresponding number of survived passengers in each category. In all groups, we observe survivors as well as nonsurvivors, except for children in the first and second class where all children survived. We perform an ANOVA for the survival rates of the N = 8 groups having nonsurvivors and fit a saturated model. Adult males in the first class are assumed to be the baseline category. The Caesarean birth dataset [Fahrmeir and Tutz 2001, Table 1.1] contains information on infection from births by Caesarean section, originating from a 3-way contingency table. 251 mothers are categorized by the variables “Caesarean planned” (yes/no), “antibiotics given” (yes/no), and “risk factors present” (yes/no). In order to obtain a binomial regression model we ignore the two types of infection and just observe whether an infection occurred or not. Thus, yi denotes the number of women with infection and Ni denotes the number of observed women in each group. We fit a binomial logit model only with main effects. The Beetle mortality dataset [Czado and Raftery 2006, Table 5] serves as typical example for the special case of binomial data where repeated measurements are available for identical covariate patterns. For N = 8 different concentrations of gaseous carbon disulphide, in each group the number yi of dead insects is observed after an exposure time to the gas of five hours. Ni denotes the number of exposed insects within each group. The Gramacy/Polson dataset was simulated in a similar way as the binomial data in Gramacy and Polson [2012, Section 4.2], but without intercept. The true parameter vector of dimension d = 9 is equal to β = {2, –3, 2, –4, 0, 0, 0, 0, 0} and the elements of the design matrix xi are all drawn from a uniform distribution U [0, 1]. Each of the N = 100 covariate patterns is observed 20 times, that is, all group sizes Ni = 20. The responses yi were simulated corresponding to the binomial logit model in (1), where πi was calculated by λi /(1 + λi ) with λi = exp(xi β). ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
3:12
A. Fussl et al.
Finally, we constructed a very extreme and unbalanced Simulated dataset. The pattern of this dataset is typical of models for rare events, for example, rare diseases or financial defaults. Based on a fixed number of d = 10 covariates consisting of nine binary variables and the intercept, we built the design matrix xi by computing all 29 possible 0/1 combinations. As true parameter vector, we defined β = {0.05, 2, 1.5, –3, –0.01, –1.3, 2.9, –2.1, 0.5, –0.2}. The data yi were simulated according to the binomial logit model in (1), where Ni and πi are dependent. The group sizes Ni were simulated from the Poisson distribution Ni ∼ P 1, πi · 100 . As a consequence, the group sizes Ni in this simulated dataset are small for πi near 0 and large for πi near 1. For all five datasets, the number of binomial observations (N), the number of the reproduced binary observations ( Ni ), the minimal and the maximal group sizes (min(Ni ), max(Ni )) and the number of covariates (d) are reported in the corresponding T I to T V. The regression coefficients β are all assumed to be independent and normally distributed N (0, 10) a priori. 4.2. Performance Evaluation of the MCMC Algorithms
For all datasets introduced in Section 4.1, we considered the three samplers derived from the new aggregated dRUM representation of a binomial regression model presented in Section 3, namely data-augmented independent MH sampling (Agg. dRUMMH) as in Algorithm 1, AM sampling (Agg. dRUM-AM) as in Algorithm 2, and HAM sampling (Agg. dRUM-HAM) as in Algorithm 3. The algorithms were implemented in the R package binomlogit, which is available on CRAN.1 The boundaries clow and cup used for HAM sampling are reported for each dataset in the header of Tables I to V. For the sake of comparison, we considered data-augmented independent MH sampling for the aggregated RUM representation (Agg. RUM-MH), as described at the end of Section 3.1. In addition, we investigated two samplers based on the individual dRUM representation, namely data-augmented independent MH sampling (Indiv. dRUM-MH) and AM sampling (Indiv. dRUM-AM). These samplers were implemented by running the corresponding R functions with repetition parameter equal to 1 for individual binary observations reconstructed from the binomial data. Finally, we estimate the binomial regression model as in Gramacy and Polson [2012] using their R package reglogit, also available on CRAN. As the main purpose of this package actually is to perform simulation-based regularized logistic regression by Gibbs sampling, we chose some arguments different from the default values of the function reglogit, that is, nu=-1, kappa=1, nup=NULL and method=”MH”. Specifying the parameter zzero as TRUE (=default value) or FALSE yields the results for the pdf representation in the first, and for the cdf representation in the latter case. The results for both representations are stated in the Tables I to V (Reglogit-pdf/Reglogit-cdf). For each sampler, we generated M = 10000 draws from the posterior distribution after a burn-in of 2000 draws. The starting value of β for all algorithms was set to 0. For the MH and the HAM sampler starting values for the latent utilities z were required as well. For the aggregated dRUM representation, starting values for the latent utilities z were sampled from (12) with λi = π ˆi /(1 – π ˆi ), where π ˆi = min(max(yi /Ni , 0.05), 0.95). Starting values for z were determined for the aggregated RUM representation as in ¨ Fruhwirth-Schnatter et al. [2009] and for the individual dRUM representation as in ¨ ¨ Fruhwirth-Schnatter and Fruhwirth [2010]. A certain difficulty with the data-augmented independence MH sampler is that it might get stuck at the starting values. This actually happened with the unbalanced 1 http://cran.r-project.org/src/contrib/binomlogit
1.0.tar.gz
ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
Efficient MCMC for Binomial Logit Models
3:13
Table I. Comparison of the MCMC Samplers for the Titanic Passenger Data (N = 8, Ni = 1286, min(N i ) = 31, max(N i ) = 462, d = 8, c low = 0.05, c up = 0.95); Based on M = 10000 Draws after Burn-In of 2000 Draws Sampler Agg. dRUM-MH Agg. dRUM-AM Agg. dRUM-HAM Agg. RUM-MH Indiv. dRUM-AM Indiv. dRUM-MH Reglogit-pdf Reglogit-cdf
a (%)
TCPU (s)
99.5
3.0 4.2 5.0 2.9 26.3 14.9 335.8 335.0
99.7 94.0 56.2
ESS (total draws) min med max 295.9 330.6 305.9 147.5 413.2 261.7 712.5 751.1
2459.4 2412.9 2292.7 1040.0 2854.5 1883.3 5887.1 3707.9
3197.3 3223.2 3001.6 1311.6 4184.7 2664.2 9574.8 9013.9
min 97.7 79.1 61.7 50.2 15.7 17.6 2.1 2.2
ESR (draws/s) med max 811.7 577.3 462.2 353.8 108.5 126.6 17.5 11.1
1055.2 771.1 605.2 446.1 159.1 179.0 28.5 26.9
simulated dataset in Table V for all three samplers based on the individual or aggregated dRUM or RUM representation. To overcome this problem and to start the sampler successfully, we inserted an “acceptance phase” before the burn-in period of all data-augmented independence MH samplers. This means that for an additional small number of MCMC steps at the beginning of the algorithm, typically 50 draws, each proposed parameter vector is accepted with probability 1 rather than according to the MH acceptance rule. Although we do not draw from the correct posterior distribution during the acceptance phase, we generate reasonable starting values for our MH samplers, once the real burn-in phase sets in. In this way we were able to run the data-augmented independence MH samplers successfully on all datasets, including the unbalanced simulated dataset in Table V. The performance of the different samplers is compared using three criteria. The CPU time TCPU quantifies the running time in seconds needed for the M MCMC draws without burn-in time using R (version 2.14.0) on a PC with a 3.16-GHz processor. The effective sampling size [Kass et al. 1998] is computed for each regression coefficient βk , k = 1, . . . , d according to ESS=M/τ , where τ = 1 + 2 · K h=1 ρ(h) is the inefficiency factor, ρ(h) denotes the empirical autocorrelation of the MCMC draws at lag h, and K is determined by the initial monotone sequence estimator [Geyer 1992]. Finally, we calculate for all regression coefficients the effective sampling rate per second as ESR=ESS/TCPU . ESR can be used to compare a fast, but inefficient sampler with a slow, but efficient one. Since both a high ESS and a small runtime are desirable, it follows that a higher ESR indicates a better sampler. Tables I to V report the median value of ESS and ESR over all regression coefficients, as well as the minimum and the maximum values. In addition, the acceptance rate α for all MH samplers and the HAM sampler is shown. As far as computing time is concerned, all four samplers based on an aggregated representation are faster than any of the alternative samplers. In particular for larger val ues of Ni , samplers based on the individual representations are significantly slower than the fastest aggregated sampler, that is, for the data augmented independence MH by a factor 5 (Table I) and 23 (Table V), respectively. This shows clearly that the aggregation step used for the latent variable representations yields a considerable reduction of computing time compared to the individual versions. While there is virtually no difference between the MH samplers emerging from the two different aggregated models concerning runtime, we observe noticeable differences regarding efficiency. In all examples, the MH sampler based on the aggregated dRUM representation has higher acceptance rate than the aggregated RUM sampler. The situation is similar when comparing the two aggregated MH samplers in terms of ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
3:14
A. Fussl et al. Table II. Comparison of the MCMC Samplers for the Caesarean Birth Data (N = 7 , N i = 251, min(N i ) = 2, max(N i ) = 98, d = 4, c low = 0.01, c up = 0.99); Based on M = 10000 Draws after Burn-In of 2000 Draws Sampler Agg. dRUM-MH Agg. dRUM-AM Agg. dRUM-HAM Agg. RUM-MH Indiv. dRUM-AM Indiv. dRUM-MH Reglogit-pdf Reglogit-cdf
α (%)
TCPU (s)
97.0
3.5 4.1 4.8 3.0 7.9 5.1 71.4 74.9
99.3 86.1 68.1
ESS (total draws) min med max 1086.2 1306.9 1241.6 529.9 1419.2 978.3 1396.9 1451.7
1390.1 1465.0 1453.7 598.5 1761.5 1388.0 3241.7 3008.4
1540.4 1930.3 1700.3 650.4 1909.1 1560.2 3685.2 3244.7
ESR (draws/s) min med max 307.7 318.0 258.1 179.0 179.6 191.5 19.6 19.4
393.8 356.5 302.2 202.2 223.0 271.6 45.4 40.2
436.4 469.7 353.5 219.7 241.7 305.3 51.6 43.3
Table III. Comparison of the MCMC Samplers for the Beetle Mortality Data (N = 8, N i = 481, min(N i ) = 56, max(N i ) = 63, d=2, c low = 0.05, c up = 0.95); Based on M = 10000 Draws after Burn-In of 2000 Draws Sampler Agg. dRUM-MH Agg. dRUM-AM Agg. dRUM-HAM Agg. RUM-MH Indiv. dRUM-AM Indiv. dRUM-MH Reglogit-pdf Reglogit-cdf
α (%)
TCPU (s)
97.4
3.0 4.2 4.7 2.8 11.1 6.0 129.9 129.2
97.4 80.0 64.5
ESS (total draws) min med max 4008.8 4013.3 3660.0 1535.2 4085.6 2388.5 1151.1 1438.7
4058.7 4058.9 3679.3 1544.2 4119.2 2395.3 1158.0 1441.0
4108.6 4104.6 3698.6 1553.2 4152.8 2402.1 1164.9 1443.2
ESR (draws/s) min med max 1349.8 944.3 773.8 538.7 369.4 400.1 8.9 11.1
1366.6 955.0 777.9 541.8 372.4 401.2 8.9 11.1
1383.4 965.8 781.9 545.0 375.5 402.4 9.0 11.2
Table IV. Comparison of the MCMC Samplers for the Gramacy/Polson Data (N = 100, N i = 2000, min(N i ) = 20, max(N i ) = 20, d = 9, c low = 0.01, c up = 0.99); Based on M = 10000 Draws after Burn-In of 2000 Draws
Sampler Agg. dRUM-MH Agg. dRUM-AM Agg. dRUM-HAM Agg. RUM-MH Indiv. dRUM-AM Indiv. dRUM-MH Reglogit-pdf Reglogit-cdf
a (%)
TCPU (s)
96.7
4.3 5.8 6.5 4.0 40.4 18.7 523.2 527.5
96.8 74.0 52.6
ESS (total draws) min med max 957.5 993.5 874.6 408.3 1022.4 716.3 3070.8 2468.4
1587.8 1706.9 1468.7 573.4 1753.3 1160.5 4615.7 4170.9
1834.1 1844.5 1933.1 620.9 2313.3 1317.8 5484.5 4918.9
ESR (draws/s) min med max 222.2 171.6 134.8 102.3 25.3 38.3 5.9 4.7
368.4 294.8 226.3 143.7 43.3 62.0 8.8 7.9
425.5 318.6 297.9 155.6 57.2 70.4 10.5 9.3
ESS and ESR. For the three real datasets the median ESS and ESR under the dRUM representation are much larger than the corresponding medians under the RUM representation. This shows that the modifications in the aggregated dRUM lead to a remarkable gain in efficiency with respect to the aggregated RUM representation. For this reason, we propose to prefer the samplers based on the new aggregated dRUM representation of the binomial regression model. Among the samplers based on the aggregated dRUM representation there is little difference as far as ESS is concerned. The data-augmented MH sampler has a very high acceptance rate of at least 95% for all datasets, making it practically as efficient in ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
Efficient MCMC for Binomial Logit Models
3:15
Table V. Comparison of the MCMC Samplers for the Simulated Data (N = 490, N i = 25803, min(N i ) = 1, max(N i ) = 126, d = 10, c low = 0.15, c up = 0.85); Based on M = 10000 Draws after Burn-In of 2000 Draws Sampler Agg. dRUM-MH Agg. dRUM-AM Agg. dRUM-HAM Agg. RUM-MH Indiv. dRUM-AM Indiv. dRUM-MH Reglogit-pdf Reglogit-cdf
a (%)
TCPU (s)
97.2
11.4 13.0 15.0 9.5 504.2 264.8 6639.5 6640.9
98.6 82.2 50.3
ESS (total draws) min med max 533.4 540.4 556.6 208.7 596.7 447.3 1326.9 1317.9
947.3 974.4 888.2 409.8 1141.8 704.4 1591.2 1653.2
1447.1 1599.3 1520.9 675.7 1726.3 1203.0 2661.4 3047.1
ESR (draws/s) min med max 46.9 41.6 37.1 22.0 1.2 1.7 0.2 0.2
83.2 75.1 59.2 43.2 2.3 2.7 0.2 0.2
127.2 123.2 101.5 71.3 3.4 4.5 0.4 0.5
terms of ESS as AM sampling. The HAM sampler is similar to the AM sampler in terms of ESS, but is worse in terms of ESR. Although the number of component indicators sampled for the mixture approximation is considerably smaller for the HAM sampler than for the AM sampler, the algorithm has to compute the acceptance rate as well in each MCMC step of the HAM sampler. This leads to a slightly larger computing time and, as a consequence, to a smaller ESR for the HAM sampler. Since the aggregated MH sampler is faster than the AM sampler and the HAM sampler in all cases, by a factor between 1.1 and 1.7, it has the highest ESR among these three samplers. Interestingly, aggregation in the dRUM representation does not improve mixing compared to the two samplers based on the individual dRUM representation. In the case of AM sampling, ESS is even slightly larger in the individual dRUM representation as compared to the aggregated dRUM representation. The smaller ESS of the data-augmented MH sampler is due to MH rejection rates in order of 30 to 50%. Hence, the main efficiency gain of aggregated sampling in the dRUM representation is achieved by avoiding the computationally wasteful drawing of the individual latent variables. This is contrast to aggregation in the RUM representation, which has been ¨ shown for example, by Fruhwirth-Schnatter et al. [2009, Table 2] not only to reduce computing time, but also the inefficiency factor for the regression parameters compared to the individual RUM representation. Comparing our new samplers to the samplers by Gramacy and Polson [2012] shows that their sampling procedures are outstanding concerning ESS, except for the Beetle data. Nevertheless, both implementations Reglogit-pdf and Reglogit-cdf are extremely slow. As a consequence, our samplers are clearly superior in terms of ESR. For the simulated dataset in Table V they are even better by a factor up to 416. To sum up, in terms of ESR, the aggregated dRUM MH sampler emerges as a clear winner in all real data examples and for both simulated datasets, because of its superior speed. 5. CONCLUDING REMARKS
We have described various efficient MCMC methods for Bayesian inference of binomial regression models which are based on introducing a single latent variable for each binomial observation, using an aggregated dRUM representation. The resulting MCMC samplers are much more efficient in terms of effective sampling size than sampling ¨ based on the aggregated RUM representation [Fruhwirth-Schnatter et al. 2009]. Sampling in the aggregated dRUM representation is clearly superior in terms of computing time to sampling in the individual dRUM representation, in particular, if Ni is large, albeit not in terms of effective sampling size. The effective sampling ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
3:16
A. Fussl et al.
size turned out to be smaller than for the samplers of Gramacy and Polson [2012]; however, our samplers are considerably faster, which more than compensates for the slightly slower mixing. Based on our experiences with fitting binomial regression models to five different datasets, we recommend to make the data augmented independence MH sampler based on the aggregated dRUM the first choice. Its implementation is extremely simple, it is fast, and the acceptance rate is expected to be high due to the symmetry of the generalized logistic error distribution. The acceptance rate turned out to be higher than 95% when fitting binomial regression models to the datasets investigated in Section 4.2. However, for more complex models involving additional latent variables such as panel data analysis using random-effects models [Crowder 1978], modeling portfolio credit risk using generalized linear mixed models [McNeil and Wendin 2007], or discrete-valued time series using binomial mixed state space models [Czado and Song 2008] joint sampling of all unknown variables using the data-augmented independence MH sampler might lead to much smaller acceptance rates. For such models, we propose to use the AM sampler which was only about 50% slower for the binomial regression models investigated in Section 4.2. Also HAM sampling might be useful, by applying the accurate mixture approximation only to those observations which are responsible for the low acceptance rate. Furthermore, AM sampling based on the aggregated dRUM representation is easily combined with any shrinkage or regularization prior which possesses a representation as a scale-mixture of normals, such as the double exponential or Laplace prior [Park and Casella 2008], which makes Bayesian LASSO type estimation for binomial regression models straightforward. Similarly, spike-and-slab priors may be used for variable selection in binomial regression models similarly as in Holmes and Held [2006], ¨ Tuchler [2008] or Wagner and Duller [2012]. Moreover, handling repeated measurements with more than two response categories {0, 1, . . . , m} is straightforward. Such data typically arise in analyzing market share ¨ ¨ data. Fruhwirth-Schnatter and Fruhwirth [2010] showed that the standard multinomial regression models may be easily estimated if one category, typically category 0, is selected as baseline and a category-specific regression coefficient β k is introduced for k = 1, . . . , m. The regression coefficients β k are then sampled in turn conditional on the remaining coefficients from a regression with a binary outcome variable taking the value 1, iff yi = k, the so-called partial dRUM representation. The aggregation techniques outlined in this article are easily applied to this partial dRUM representation to yield an aggregated partial dRUM representation for repeated multinomial data. Appendix A
P ROOF OF L EMMA 1. Proof of Lemma 1. To derive the conditional distribution of yi |λi , yi we first derive the conditional distribution of exp(–yi )|λi , yi . From (6), we obtain that exp(–yi ) is related to the latent utilities u0,ni and u1,ni appearing in Eqs. (4) and (5) of the individual representation of the binomial distribution: Ni exp(–u1,ni ) exp(–yi ) = exp(–(y1i – y0i )) = n=1 . (20) Ni n=1 exp(–u0,ni ) Given yi , individual binary observations z1i , . . . , zNi ,i are defined such that yi observations are equal to 1 whereas the remaining Ni – yi observations are equal to 0. The conditional distribution of exp(–yi )|λi , yi is then equal to the distribution of ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
Efficient MCMC for Binomial Logit Models
3:17
Table VI. Weights of the Mixture Components for 1 ≤ ν ≤ 60 ν 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
w1 0.044334 0.063586 0.206397 0.204072 0.204101 0.204665 0.380650 0.376986 0.379651 0.379554 0.401391 0.401546 0.382557 0.384717 0.381464 0.380527 0.380767 0.379757 0.378070 0.380199 0.537661 0.537773 0.538551 0.540635 0.542054 0.544090 0.542922 0.541641 0.528778 0.528906
w2 0.294977 0.450637 0.518711 0.522425 0.523180 0.522745 0.550368 0.554703 0.547164 0.546832 0.543566 0.535463 0.537188 0.541301 0.547412 0.548631 0.550416 0.549739 0.547679 0.549532 0.462339 0.462227 0.461449 0.459365 0.457946 0.455910 0.457078 0.458359 0.471222 0.471094
w3 0.429806 0.405544 0.251858 0.251401 0.250535 0.250057 0.068982 0.068311 0.073185 0.073614 0.055043 0.062991 0.080255 0.073982 0.071124 0.070843 0.068817 0.070504 0.074251 0.070269 — — — — — — — — — —
w4 0.207597 0.076979 0.023034 0.022102 0.022184 0.022533 — — — — — — — — — — — — — — — — — — — — — — — —
w5 0.023286 0.003254 — — — — — — — — — — — — — — — — — — — — — — — — — — — —
ν 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
w1 0.044334 0.063586 0.206397 0.204072 0.204101 0.204665 0.380650 0.376986 0.379651 0.379554 0.401391 0.401546 0.382557 0.384717 0.381464 0.380527 0.380767 0.379757 0.378070 0.380199 0.537661 0.537773 0.538551 0.540635 0.542054 0.544090 0.542922 0.541641 0.528778 0.528906
w2 0.294977 0.450637 0.518711 0.522425 0.523180 0.522745 0.550368 0.554703 0.547164 0.546832 0.543566 0.535463 0.537188 0.541301 0.547412 0.548631 0.550416 0.549739 0.547679 0.549532 0.462339 0.462227 0.461449 0.459365 0.457946 0.455910 0.457078 0.458359 0.471222 0.471094
exp(–yi )|λi , z1i , . . . , zNi ,i . Given z1i , . . . , zNi ,i , the sums appearing in (20) are split into two parts corresponding to the cases zni = 1 and zni = 0, respectively: exp(–u1,ni ) n:zni =1 exp(–u1,ni ) + n:zni =0 . (21) exp(–yi ) = n:zni =1 exp(–u0,ni ) + n:zni =0 exp(–u0,ni ) We now derive a representation of the distribution of all four sums appearing in (21) conditional on z1i , . . . , zNi ,i . While exp(–u0,ni ) ∼ E (1) and exp(–u1,ni )|λi ∼ E λi are independent a priori, these variables are dependent given zni . If zni = 1, then u1,ni > u0,ni , or equivalently, exp(–u1,ni ) < exp(–u0,ni ). Thus, the joint distribution of exp(–u0,ni ), exp(–u1,ni ) conditional on zni = 1 is given by exp(–u1,ni )|λi , zni = 1 ∼ E 1 + λi , exp(–u0,ni )|λi , exp(–u1,ni ), zni = 1 ∼ E (1) + exp(–u1,ni ).
(22) (23)
On the other hand, if zni = 0, then u0,ni > u1,ni , or equivalently, exp(–u0,ni ) < exp(–u1,ni ). Thus, the joint distribution of exp(–u0,ni ), exp(–u1,ni ) conditional on zni = 0 ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
3:18
A. Fussl et al. Table VII. Variances of the Mixture Components for 1 ≤ ν ≤ 60, Scaled to the Actual Variance of the Generalized Logistic Distribution
ν 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
σ12 0.793359 0.506928 0.448036 0.348852 0.287496 0.245301 0.231837 0.204708 0.183854 0.166768 0.153599 0.141696 0.130847 0.122141 0.114388 0.107654 0.101705 0.096376 0.091578 0.087298 0.086093 0.082343 0.078923 0.075801 0.072911 0.070246 0.067726 0.065382 0.063047 0.061028
σ22 1.547433 0.887047 0.719848 0.528010 0.417200 0.344862 0.337453 0.290932 0.255688 0.228056 0.208743 0.189540 0.171908 0.159255 0.147957 0.138141 0.129638 0.121957 0.115044 0.109172 0.110856 0.105413 0.100488 0.096030 0.091932 0.088178 0.084652 0.081390 0.078160 0.075382
σ32 3.012084 1.571432 1.119148 0.772515 0.584942 0.468502 0.480100 0.405807 0.345980 0.303297 0.276401 0.244711 0.218195 0.201209 0.186269 0.172695 0.161216 0.150499 0.140764 0.133168 — — — — — — — — — —
σ42 5.922640 2.650426 1.829246 1.194362 0.863166 0.666312 — — — — — — — — — — — — — — — — — — — — — — — —
σ52 11.771158 5.097358 — — — — — — — — — — — — — — — — — — — — — — — — — — — —
ν 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
σ12 0.793359 0.506928 0.448036 0.348852 0.287496 0.245301 0.231837 0.204708 0.183854 0.166768 0.153599 0.141696 0.130847 0.122141 0.114388 0.107654 0.101705 0.096376 0.091578 0.087298 0.086093 0.082343 0.078923 0.075801 0.072911 0.070246 0.067726 0.065382 0.063047 0.061028
σ22 1.547433 0.887047 0.719848 0.528010 0.417200 0.344862 0.337453 0.290932 0.255688 0.228056 0.208743 0.189540 0.171908 0.159255 0.147957 0.138141 0.129638 0.121957 0.115044 0.109172 0.110856 0.105413 0.100488 0.096030 0.091932 0.088178 0.084652 0.081390 0.078160 0.075382
is given by exp(–u0,ni )|λi , zni = 0 ∼ E 1 + λi , exp(–u1,ni )|λi , exp(–u0,ni ), zni = 0 ∼ E λi + exp(–u0,ni ).
(24) (25)
Using (25), the second sum in the numerator of (21) is equal in distribution to the sum of two independent random variables, exp(–u1,ni )|λi , z1i , . . . , zNi ,i ∼ exp(–u0,ni )|λi , z1i , . . . , zNi ,i + I{yi < Ni }Wi , n:zni =0
n:zni =0
where Wi ∼ G Ni – yi , λi . Similarly, it follows from (23) that the first sum in the denominator of (21) is equal in distribution to the sum of two independent random variables: exp(–u0,ni )|λi , z1i , . . . , zNi ,i ∼ exp(–u1,ni )|λi , z1i , . . . , zNi ,i + I{yi > 0}Vi , n:zni =1
n:zni =1
ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
Efficient MCMC for Binomial Logit Models
3:19
Fig. 1. The distance dKS between the Gaussian mixture approximation and the generalized logistic distribution GL(ν) for 1 ≤ ν ≤ 600.
where Vi ∼ G yi , 1 . exp(–yi )|λi , z1i , . . . , zNi ,i :
This
yields
exp(–yi )|λi , z1i , . . . , zNi ,i ∼ where Ui =
the
representation
Ui + I{yi < Ni }Wi , Ui + I{yi > 0}Vi
exp(–u1,ni ) +
n:zni =1
following
of
(26)
exp(–u0,ni ).
n:zni =0
From (22) and (24), we obtain that conditional on z1i , . . . , zNi ,i the random variable Ui follows a G Ni , 1 + λi distribution. Evidently, representation (26) is independent of the specific sequence of individual observations z1i , . . . , zNi ,i . By taking the negative logarithm of (26), we obtain that yi |λi , yi is equal in distribution to the random variable in (12). This proves Lemma 1. Appendix B
The Type III generalized logistic distribution with parameter ν, denoted by GL(ν), has the following probability density function: f (x|ν) =
Γ(2ν) exp(–νx) . Γ(ν)2 (1 + exp(–x))2ν
(27)
It is symmetric around 0, and its variance is equal to 1 π2 –2 σ (ν) = 2ψ (ν) = . 3 k2 2
ν–1
(28)
k=1
If the distribution is standardized, it converges to the standard normal distribution for ν –→ ∞. It is therefore simpler to compute the approximating mixtures in standard measure and to rescale the mixtures afterwards. ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
3:20
A. Fussl et al.
For ν = 1, the approximation in (15) is identical to the mixture approximation to the ¨ ¨ logistic distribution applied in Fruhwirth-Schnatter and Fruhwirth [2010]. This mixture has 5 components. With increasing ν, fewer components are required to achieve a satisfactory approximation. For ν > 20, two components are sufficient. We have estimated the mixture parameters (weights and variances) by minimizing the maximal absolute difference dKS of the cumulative distribution function (cdf) of GL(ν) and the mixture cdf. The minimization was done via the simplex algorithm, starting from the parameters of the previous mixture. For ν ≤ 60, the mixture parameters have been stored individually for each ν. They are listed in Tables VI and VII. The variances have been scaled to the variance of GL(ν) in Eq. (28). For 61 ≤ ν ≤ 600, the two variances can be parameterized by rational functions: –0.4973255 · 10–7 ν 2 + 0.024036101167ν + 1 , (29) 0.024403981536ν + 1.165357272312 –0.42821294 · 10–6 ν 2 + 0.027883528610ν + 1 σ22 (ν) = σ 2 (ν) . (30) 0.027266080794ν + 0.843603549826 The weights can be computed from the variances via the constraint that the total variance is equal to σ 2 (ν): σ12 (ν) = σ 2 (ν)
w1 (ν) =
σ 2 (ν) – σ22 (ν) σ12 (ν) – σ22 (ν)
,
w2 (ν) = 1 – w1 (ν).
(31)
The resulting distance dKS is always smaller than 10–5 in the range 1 ≤ ν ≤ 600 (see Figure 1). For ν > 600, the GL-distribution is approximated by a single Gaussian. The deviation is maximal at ν = 601 (dKS ≈ 4·10–5 ), and converges to 0 for ν –→ ∞. The function that computes the mixture parameters for given ν is part of the R package binomlogit. An equivalent M ATLAB function is available from the authors on request. REFERENCES Albert, J. H. and Chib, S. 1993. Bayesian analysis of binary and polychotomous response data. J. Amer. Statis. Assn. 88, 669–679. Crowder, M. J. 1978. Beta-binomial ANOVA for proportions. Appl. Stat. 27, 34–37. Cutler, C. D. 1992. kth nearest neighbors and the generalized logistic distribution. In Handbook of the Logistic Distribution, N. Balakrishnan Ed., Marcel Dekker, New York, 512–522. Czado, C. and Raftery, A. E. 2006. Choosing the link function and accounting for link uncertainty in generalized linear models using Bayes factors. Statis. Papers 47, 419–442. Czado, C. and Song, P. X.-K. 2008. State space mixed models for longitudinal observations with binary and binomial responses. Statis. Papers 49, 691–714. Fahrmeir, L. and Tutz, G. 2001. Multivariate Statistical Modelling based on Generalized Linear Models 2nd Ed. Springer Series in Statistics. Springer, Berlin. ¨ ¨ Fruhwirth-Schnatter, S. and Fruhwirth, R. 2007. Auxiliary mixture sampling with applications to logistic models. Computat. Statis. Data Anal. 51, 3509–3528. ¨ ¨ Fruhwirth-Schnatter, S. and Fruhwirth, R. 2010. Data augmentation and MCMC for binary and multinomial logit models. In Statistical Modelling and Regression Structures – Festschrift in Honour of Ludwig Fahrmeir, T. Kneib and G. Tutz Eds., Physica-Verlag, Heidelberg, 111–132. ¨ ¨ Fruhwirth-Schnatter, S. and Fruhwirth, R. 2012. Bayesian analysis of the multinomial model. Austrian J. Statis. 41, 27–43. ¨ ¨ Fruhwirth-Schnatter, S., Fruhwirth, R., Held, L., and Rue, H. 2009. Improved auxiliary mixture sampling for hierarchical models of non-Gaussian data. Statis. Comput. 19, 479–492. ¨ Fruhwirth-Schnatter, S. and Wagner, H. 2006. Auxiliary mixture sampling for parameter-driven models of time series of counts with applications to state space modelling. Biometrika 93, 827–841.
ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.
Efficient MCMC for Binomial Logit Models
3:21
Geyer, C. 1992. Practical Markov chain Monte Carlo. Statis. Sci. 7, 473–511. Gramacy, R. B. and Polson, N. G. 2012. Simulation-based regularized logistic regression. Bayesian Anal. 7, 3, 567–590. Hilbe, J. M. 2007. Negative Binomial Regression. Cambridge University Press, Cambridge. Holmes, C. C. and Held, L. 2006. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Anal. 1, 145–168. Kass, R. E., Carlin, B., Gelman, A., and Neal, R. 1998. Markov chain Monte Carlo in practice: A roundtable discussion. Amer. Statist. 52, 93–100. McFadden, D. 1974. Conditional logit analysis of qualitative choice behaviour. In Frontiers of Econometrics, P. Zarembka Ed., Academic, New York, 105–142. McNeil, A. J. and Wendin, J. 2007. Bayesian inference for generalized linear mixed models of portfolio credit risk. J. Empir. Finan. 14, 131–149. Park, T. and Casella, G. 2008. The Bayesian Lasso. J. Amer. Statis. Assn. 103, 681–686. Scott, S. L. 2011. Data augmentation, frequentist estimation, and the Bayesian analysis of multinomial logit models. Statis. Papers 52, 87–109. Shephard, N. 1994. Partial non-Gaussian state space. Biometrika 81, 115–131. ¨ Tuchler, R. 2008. Bayesian variable selection for logistic models using auxiliary mixture sampling. J. Computat. Graphic. Statis. 17, 76–94. Wagner, H. and Duller, C. 2012. Bayesian model selection for logistic regression models with random intercept. Computat. Statis. Data Anal. 56, 1256–1274. Zelterman, D. and Balakrishnan, N. 1992. Univariate generalized logistic distributions. In Handbook of the Logistic Distribution, N. Balakrishnan Ed., Marcel Dekker, New York, 209–221. Received October 2011; revised April 2012; accepted June 2012
ACM Transactions on Modeling and Computer Simulation, Vol. 23, No. 1, Article 3, Publication date: January 2013.