May 31, 2008 - The softmax link is used in many probabilistic model dealing with both ... in directed probabilistic graphical models having binary variables with ...
Efficient Bounds for the Softmax Function and Applications to Approximate Inference in Hybrid models Guillaume Bouchard Xerox Research Center Europe 6, chemin de Maupertuis 38240 Meylan May 31, 2008 Abstract The softmax link is used in many probabilistic model dealing with both discrete and continuous data. However, efficient Bayesian inference for this type of model is still an open problem due to the lack of efficient upper bound for the sum of exponentials. We propose three different bounds for this function and study their approximation properties. We give a direct application to the Bayesian treatment of multiclass logistic regression and discuss its generalization to deterministic approximate inference in hybrid probabilistic graphical models.
The softmax function is the extension of the sigmoid function for more than two values. Its role is of central importance in many non-linear probabilistic models. In particular, many well-known models deal with discrete and continuous data. Variational approximations based on the minimization of the Kullback-Leibler divergence are one of the most popular tools in large-scale Bayesian inference. In recent years, generic tools such as VIBES [1] have been proposed for inference and learning of graphical models using mean field approximations. For graphs having discrete nodes with continuous parents, the direct mean field cannot be applied, since there is no conjugate family for the multinomial logistic model. Local variational approximations have been proposed in the case of binary variables [2]. They are based on a quadratic lower bound for the log of the sigmoid function. This is used in directed probabilistic graphical models having binary variables with continuous parents [3, 1]. However, for more than 2 categories, the inference problem remains unsolved. In section 2, give an detailed overview of the techniques used by different authors to deal with Bayesian inference for models involving softmax links. In section 3, we derive three simple lower bounds and discuss their approximation properties. In Section 4 we perform numerical experiments on a typical application which is the variational approximation of the multiclass logistic regression. Finally, we discuss its application to more general deterministic approximate inference algorithms in graphical models.
1
1
Previous work
To the author’s knowledge, some insights for a general solution solution can be found in the following works: • Gaussian process classification [4]. The author defines a technique to design a quadratic bound to the log-sum of exponentials he does not proves that this approximation is correct, • the reverse-Jensen bound of T. Jebara [5] provides an upper bound for the log-sum of exponential densities. Although this could be applied in our setting, we focused on much simpler approaches that are not limited to exponential distributions, • an interesting discussion about the possibility of obtaining tree-based approximations of the softmax link is given by C. Bishop [6]. No experiment or implementation detail has been provided, so we assume that this research direction is still open, • a local sampling loop is used for multiclass probit regression [7] inside a variational algorithm. The idea of having a local importance sampler inside a variational algorithm has first been formalized in [8]. • detecting trends in textual data is a typical application where some continuous variables (the topic frequencies) evolve over time. Standard models based on the Dirichlet distribution do not allow easy formulation of the temporal dynamics. In the variational inference algorithm for correlated topic models, Blei and Lafferty [9] make use of a lower bound to the likelihood based on a simple linear approximation of the logarithm function. We think that this trick should be advertised and studied more in detail. In fact, we will see in the following that this idea can be extended to multidimensional setting and for any hybrid graphical models, • the maximization-majorization MM algorithms for sparse multinomial logistic regression [10] is also based on a bound to the likelihood. The paper describes a iterative algorithm that converge to the MAP solution of a multinomial logistic regression. The majorization part requires an upper bound on the hessian of the likelihood which turn out to be directly applicable to the fully Bayesian setting that is described in section 3.
2
Bounds for the log-sum of exponentials
We propose several approaches to bound the sum of exponentiated variables. Concavity of the log For any x ∈ RK and any φ ∈ R, log
K X
k=1
exk ≤ φ
K X
exk − log φ − 1 .
(1)
k=1
P −1 K xk where the equality holds iff φ = e . The main advantage of this bound is that it k=1 is possible to compute the expectation of the right hand side for many classical distribution 2
since EQ exk is the moment generating function of the distribution Q. This will be applied to the multivariate gaussian distribution in the next section. This approximation, although very simple, has been efficiently used in the correlated topic model of Bley and Lafferty[9] to link from continous dynamics (the topic trends) to the discrete variables corresponding to topics. Quadratic upper bound To find a quadratic upper bound on the likelihood of a multiclass logistic regression model, several authors [11, 10] use the fact that if H is the Hessian PK T of log k=1 exk , then Id − 11 K − H is a positive semi-definite matrix. This leads to following quadratic approximation. For any x ∈ RK and any χ ∈ RK , !2 K K K K K X X X X (xk − χk )eχk 1 X xk 2 (xk − χk ) + log e ≤ (xk − χk ) − + log eχk (2) , PK χk′ K e ′ k =1 k=1 k=1 k=1 k=1 k=1
This bound is tight on the one-dimensional vector space ξ + ρ1 where ρ ∈ R. However, due to the fact that the worst curvature over the space is used, the approximation can be inefficient when integrating this bound. This will be shown in the numerical experiments. Product of sigmoids To obtain a better local quadratic approximation, we propose a new bound for the log-sum of exponentials obtained by two simple stages: First, the sum of exponential is upper bounded by a product of sigmoids. Then, the standard quadratic bound on log(1 + ex ) is used to obtain the final majorant. For any x ∈ RK and any α ∈ R, log
K X
exk ≤ α +
K X
log 1 + exk −α
k=1
k=1
.
(3)
QK PK PK This is easily shown by the fact that k=1 (1 + exk −α ) ≤ k=1 exk −α = e−α k=1 exk . The key property for this bound is that its asymptotes are parallel in most directions1 . This will be relevant in the next section when one needs to compute the expectation of this function, assuming that x is a multivariate random variable with high variance. Note that now we are able to apply the standard quadratic bound [2, 12] for log(1 + exp): log(1 + ex ) ≤ λ(ξ)(x2 − ξ 2 ) +
x−ξ + log(1 + eξ ) , 2
(4)
for all ξ ∈ R. It is applied inside the sum of Equation (3). For any x ∈ RK , any α ∈ RK , and any ξ ∈ [0, ∞)K : log
K X
exk ≤ α +
k=1
where λ(ξ) =
K X xk − α − ξk
2
k=1 1 2ξ
h
1 1+e−ξ
−
1 2
i
+ λ(ξk )((xk − α)2 − ξk2 ) + log(1 + eξk ) ,
(5)
.
1 More exactly, by applying the bound to ax where a → ∞, the difference between the right and the left part of equation tends to a constant if there exists at least one xk positive and xk 6= xk for all k 6= k ′ .
3
bound
A
b
(2)
Id −
( PKe
k′ =1
1 2
eχk′
+ 2 χK1 − χk )K k=1
− 2(αλ(ξk ))K k=1
Table 1: Summary of quadratic upper bounds: log .
3
T 2 PK log k=1 eχk − (χ K1) PK χk + k=1 χ2k − PKχk e eχk′ k′ =1 PK α − k=1 ξk 2+α +λ(ξk )(α2 − ξk2 ) + log(1 + eξk )
T
χk
1 T K 11
diag(λ(ξk )K k=1 )
(5)
c
PK
k=1
exk ≤ xT Ax + xT b + c
Upper bounds for the expectation
d×K Let Q(β) be the probability density function of given multidimensional i distribution in R h P T βk x is a central task for and let x be a vector of Rd . Evaluating γ = EQ log K k=1 e
many inference problems, such as Gaussian process classifiers [4], Bayesian multiclass logistic regression, and more generally for any deterministic approximation of probabilistic models dealing with discrete variables conditioned on continuous ones. A standard approach P is thex use of the Laplace method that seeks for a second-order Taylor expansion of log K k=1 e at the mode of the distribution Q. However, this technique is known to give highly skewed results when the variance of Q is large[2]. A variational technique based on the maximization of a lower bound of γ can be used to obtain better the approximations. For simplicity, we consider in the following that Q(βk ) is a multivariate normal distribution with mean µk and variance Σk . The results can be easily extendeded to other multivariate distributions with bounded variances. h T i Using the moment generating function of the multivariate normal distribution EQ eβk x = T
1
eµk x+ 2 x
T
Σk x
, we obtain:
≤ φ
γ
K X
T
1
eµk x+ 2 x
T
Σk x
− log(φ) − 1
(6)
k=1
for every φ ∈ RK . Minimizing the right hand side with respect to φ leads to the upper bound PK T 1 T log k=1 eµk x+ 2 x Σk x which shows that the bound becomes accurate when the variance of Q is small. In the variational inference of the correlated topic model, Blei and Lafferty [9] make use of this bound for univariate gaussian only. The expectation over quadratic bounds (2) gives: γ
≤
¯ + (K − 1)xT Σx
K X
(µTk x − χk )2 − K(¯ µT x − χ) ¯ 2+
k=1
K K X X (µTk x − χk )eχk eχk ,(7) + log PK χk′ e ′ k =1 k=1 k=1
¯ = where Σ k Σk . A minimization of the right hand side with respect to χ, gives the solution χk = µTk x. At this point, the difference between gamma and its upper bound is at ¯ This means that the bound is tight if the distribution lies in a manifold least (K − 1)xT Σx. orthogonal to x. 1 K
P
4
For equation (5) we have: K
γ
≤ −α(
X K − 1) + λ(ξk ) xT Σk x + (µTk x)2 + 2 k=1
−
K X
k=1
ξk + λ(ξk )(α2 − ξk2 ) + log(1 + eξk ) . 2
1 − 2αλ(ξk ) µTk x 2 (8)
The minimization of the upper bound with respect to ξ gives: ξk2 = xT Σk x + (µTk x)2 + α2 − 2αµTk x ,
(9)
for k = 1, · · · , K (see [2, 12] for a similar derivation). The minimization with respect to α gives: α=
1 K 2( 2
− 1) + PK
PK
k=1
k=1
λ(ξk )µTk x
λ(ξk )
.
(10)
To have a general idea of the quality of approximations that can be obtained by these bounds, we made several experiments to compare the approximations based on the log (6), on the quadratic upper bound for the softmax (7) and the proposed upper bound based on a double majorization (8). They are summarized in table 2. For every experiment, we randomly sampled a vector x ∈ Rd , a mean µk according to a standardized normal distribution, and a covariance matrix using a Wishart distribution with identity scale matrix and 10Id for experiments labelled with var=10. The dimension d of the x vector varied from 2 to 100, the number of classes ranges from K = 2 (classical binary case) to K = 1000. The Wishart distribution had 100 degrees of freedom and (d + 1 in the highly correlated (high corr ) case). In the high discrepency (high disc) case, the vector x has many extreme values: it was generated using a student t distribution with 10 degrees of freedom. In experiments labelled samecov, the covariance of every Q(βk ) was the same. We generate hundred independent replications of the same experience. Every bound was compared to a Monte-Carlo sample of size 100 000 (its variance was negligeable compared to the error of the bound). We first notice the the variance of Q directly influence the choice of the bounding methods: for small variances, the linear approximation of the log seems to give the best results. For large variances, we see that the proposed bound is much more accurate than the other methods. This seems to be mainly due to the fact that the approximation is asymptotically optimal, so that it is robust to cases where the distribution is not precisely localized. For truly Bayesian models, having a model that is accurate in the high variance cases is crucial since it enable a correct (i.e. not too biased) estimation of the uncertainty. It is worth to note that the quality of the approximations do not degrades too much when dealing with a large number of classes or high dimensional data. Another fact that is revealed by our experiments is that the quadratic upper bound of the Hessian (which is often used find the MAP solutions of multiclass logistic regression) gives poor results in most of the cases.
5
method d=2,K=2 d=2,K=10 d=10,K=10 d=10,K=100 d=10,K=1000 d=100,K=1000 d=2,K=2,var=10 d=2,K=10,var=10 d=10,K=10,var=10 d=10,K=10,var=10,samecov d=10,K=10,var=10,samecov,high disc d=10,K=10,var=10,samecov,high corr
log 1.69852( 7) 0.0909539(0.1) 0.523273(0.2) 0.189525(0.1) 0.208771(0.1) 0.220016(0.1) 11.097( 7) 4.12899( 2) 11.9611( 3) 10.0631( 3) 28689.3(8e+04) 623176(4e+06)
local quadratic 4.2913(2e+01) 4.00055( 3) 13.9242( 4) 84.4871(2e+01) 88.7137(2e+01) 91.2663(2e+01) 21.6069(1e+01) 73.147(3e+01) 185.402(4e+01) 196.382(5e+01) 516423(1e+06) 1.12172e+07(8e+07)
Table 2: Relative absolute error of the various bounds on EQ [log normal distribution. Standard devition is given in parenthesis.
4
P
double maj. 4.17173(3e+01) 0.544315(0.1) 0.742303(0.2) 2.25458(0.2) 2.28424(0.2) 2.30886(0.2) 2.89464( 2) 1.94252(0.5) 2.80779(0.2) 2.98899(0.2) 3.57032(0.4) 3.594(0.1)
T
k
eβk x ] where Q is a
Bayesian multinomial logistic regression
We consider a multinomial logistic model with discrete observations y = (y1 , · · · , yn ) ∈ {1, · · · , K}n conditioned by continuous variables x = (x1 , · · · , xn ) ∈ Rn×d via a softmax function. T
eβk x P (y = k|x; β) = PK . βkT′ x k′ =1 e
(11)
For any choice of the prior there do not exist a closed form solution for the posterior distribution P(β|x, y). Variational approximations are based on the minimization of KL(Q(β)||P (β|y, x)) over a class of distributions FQ for which the computation is effective. This problem is equivalent to maximizing the following lower bound L on P(y|x): log P(y|x) ≥ =
EQ [log P(y|x, β))] − KL(Q(β)||P (β)) := L(µ, Σ) (12) # " K K n X T X X X eβk′ xi − KL(Q(β)||P (β)) . EQ βkT xi − EQ log k=1 i,yi =k
k′ =1
i=1
Using a quadratic bound for the second expectation, we obtain a lower bound F(µ, Σ, ξ) of L(µ, Σ): F(µ, Σ, ξ) =
−
K
K
k=1
k=1
X T 1X bk EQ [βk ] − KL(Q(β)||P (β)) − c , tr Ak EQ βk βkT + 2
where Ak , bk and c depend on the choice of the quadratic bound, as shown in Table 1. For the proposed bound based on the double majorization, we have: X λ(ξik )xi xTi , Ak = 2 i
6
bk
=
c
=
X 1 yik − + 2αi λ(ξik ) xi , 2 i X ξik K 2 − λ(ξik )(α2i − ξik ) − log(1 + eξik ) αi ( − 1) + 2 2 i,k
.
¯ k ) for the parameters βk , k = 1, · · · , K, Assuming independent Gaussian priors P = N (¯ µk , Σ the KL-divergence between Gaussians is: KL(Q(β)||P (β)) =
K ¯ k| 1X |Σ T ¯ −1 ¯ −1 ) + (µk − µ ¯ ) Σ (µ − µ ¯ ) − Kd , + tr(Σk Σ log k k k k k 2 |Σk | k=1
so that we have a closed-form solution of the maximum of F(µ, Σ, ξ) with respect to µ and Σ: ˆ k = Ak + Σ ¯ −1 −1 , Σ (13) k −1 ¯ µ ˆk b + Σ . (14) µ ˆk = Σ k ¯k In the case of the proposed quadratic upper bound to the log-sum of exponentials, we have the following objective: X X 1 K nKd αi + µTk xi yik − + 2αi λ(ξik ) + ( − 1) F(µ, Σ, ξ, α) = 2 2 2 i i,k
ξik 2 −λ(ξik ) xTi Σk xi + (µTk xi )2 + − λ(ξik )(α2i − ξik ) − log(1 + eξik ) 2 |Σk | 1X ¯ −1 ) − (µk − µ ¯ −1 (µk − µ log ¯ − tr(Σk Σ + ¯k )T Σ ¯k ) , (15) k k 2 | Σ | k k
which gives the updates ˆ −1 Σ k
¯ −1 + 2 = Σ k
X
λ(ξik )xi xTi ,
(16)
i
µ ˆk
" # X 1 −1 ¯ ˆ yik − + 2αi λ(ξik ) xi . ¯k + = Σk Σk µ 2 i
(17)
Note that this variational upates are very similar to the binary case reported by Jaakkola and Jordan where the updates are: X ˆ −1 = Σ ¯ −1 + 2 λ(ξi )xi xTi , (18) Σ i
"
ˆ Σ µ ˆ = Σ
¯ −1
# X 1 yi − µ ¯+ xi . 2 i
(19)
In fact, we can see that when K = 2, then the optimal value for the αi ’s is zero and we recover exactly Equations (18) and (19). 7
In the case of the bound based on the moment generating function of Q, we have: F(µ, Σ, φ) = −
K X
µTk sk +
k=1
n X i=1
φi
K X
T
1
T
eµk xi + 2 xi Σk xi − log(φi ) − n − KL(Q(β)||P (β)) , (20)
k=1
P
where sk = i,yi =k xi . The minimization with respect to φ is straightforward. For a fixed φ, the objective can be decomposed into a some of independent functions of (µk , Σk ) that can be maximized independently for k = 1, · · · , K. Since the gradient can be computed easily and F is concave with respect to µ and Σ, the minimization and can be done using a standard optimization package.2 Alternatively, it is possible to find µk and Σk using a fixed point equation. In the case of quadratic bounds, the maximization of this function is done by iteratively maximizing with respect to the variational parameters and (µ, Σ). Every computation is analytical in this case.
5
Conclusion
We proposed and compared three different lower bounds for the softmax function that enable analytical solutions for the expectation of the log-softmax of a multivariate random variable with finite variance. The proposed approximations have pros and cons depending on the type of distribution that is intergrated. In any case, these methods are complementary in term of accuracy and leads to an efficient Bayesian treatment of the multiclass logistic regression model. One of these bounds is completely novel and leads to very robust upper bounds of the log-sumexp function. A direct implication of the results is an effective solutions for deterministic inference problem in hybrid Bayesian networks. We hope that this work will leverages the current limitations for the development of general purpose inference engines for probabilistic graphical models. As any bounding method, it is difficult to quantify the effect of the loss. We are currently doing experiments to compare the approximation of the marginal likelihood to the stochastic (but unbiased) approximation based on Gibbs sampling. Finally, we would like to raise the question of the interest to have bounds rather that direct approximations (such as quadratures or Monte-Carlo estimates).
References [1] J. Winn, D. Spiegelhalter, and C. Bishop. Vibes a variational inference engine for bayesian networks. In Advances in Neural Information Processing Systems, volume 15, pages 793–800, 2002. [2] T. Jaakkola and M. Jordan. A variational approach to bayesian logistic regression problems and their extensions. In Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics., 1996. 1
1
use the reparameterizations φ = a2 and Σ = R 2 (R 2 )T to have an unconstrained maximization problem. 2 We
8
[3] K. P. Murphy. Inference and learning in hybrid bayesian networks. Technical Report UCB/CSD-98-990, EECS Department, University of California, Berkeley, Jun 1998. [4] M. N. Gibbs. Bayesian Gaussian Processes for Regression and Classification. PhD thesis, University of Cambridge, 1997. [5] T. Jebara and A. Pentland. On reversing jensen’s inequality. In Advances in Neural Information Processing Systems 13, pages 231–237, 2000. [6] C. M. Bishop. Discussion of ‘bayesian treed generalized linear models’ by h. a. chipman, e. i. george and r. e. mcculloch. In Oxford University Press, editor, Proceedings Seventh Valencia International Meeting on Bayesian Statistics, volume 7, pages 98–101, 2002. [7] M. Girolami and S. Rogers. Variational bayesian multinomial probit regression with gaussian process priors. Neural Comput., 18(8):1790–1817, 2006. [8] Neil D. Lawrence, Marta Milo, Mahesan Niranjan, Penny Rashbass, and Stephan Soullier. Reducing the variability in cdna microarray image processing by bayesian inference. Bioinformatics, 20(4):518–526, 2004. [9] D. Blei and J. Lafferty. A correlated topic model of science. Annals of Applied Statistics, 1:17–35, 2007. [10] B. Krishnapuram, L. Carin, M. A. Figueiredo, and A. J. Hartemink. Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Trans Pattern Anal Mach Intell, 27(6):957–68, 2005. [11] D. B¨ohning. Multinomial logistic regression algorithm. Annals of the Institute of Statistical Mathematics, 44(9):197–200, 1992. [12] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
9