Bayesian Estimation of Beta Mixture Models with ... - Semantic Scholar

8 downloads 890 Views 4MB Size Report
Abstract—Bayesian estimation of the parameters in beta mixture models (BMM) is analytically intractable ... beta distribution, data with bounded support could be.
Bayesian Estimation of Beta Mixture Models with Variational Inference c 2011 IEEE. Personal use of this material is permitted. However, permission to

reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

ZHANYU MA, ARNE LEIJON

Stockholm 2011

Reference number in EES database IR-EE-SIP 2011:016

1

Bayesian Estimation of Beta Mixture Models with Variational Inference Zhanyu Ma, Student Member, IEEE, and Arne Leijon Abstract—Bayesian estimation of the parameters in beta mixture models (BMM) is analytically intractable. The numerical solutions to simulate the posterior distribution are available, but incur high computational cost. In this paper, we introduce an approximation to the prior/posterior distribution of the parameters in the beta distribution and propose an analytically tractable (closed-form) Bayesian approach to the parameter estimation. The approach is based on the variational inference (VI) framework. Following the principles of the VI framework and utilizing the relative convexity bound, the extended factorized approximation method is applied to approximate the distribution of the parameters in BMM. In a fully Bayesian model where all the parameters of the BMM are considered as variables and assigned proper distributions, our approach can asymptotically find the optimal estimate of the parameters posterior distribution. Also, the model complexity can be determined based on the data. The closed-form solution is proposed so that no iterative numerical calculation is required. Meanwhile, our approach avoids the drawback of overfitting in the conventional expectation maximization algorithm. The good performance of this approach is verified by experiments with both synthetic and real data. Index Terms—Bayesian Estimation, Maximum Likelihood Estimation, Beta Distribution, Mixture Modeling, Variational Inference, Factorized Approximation



1

I NTRODUCTION

I

N the field of pattern recognition, the statistical approach is popular and intensively studied [1]–[4]. Given a set of observations and assuming they are generated from an underlying source, the main goal of statistical modeling is to establish a probabilistic model which can characterize the patterns of the observations, capture their underlying distributions, and describe the statistical properties of the source. The Gaussian distribution and the corresponding Gaussian mixture models (GMM) are popular tools, widely used by researchers for modeling the distribution of observed data. The Gaussian distribution is symmetric and unbounded (with support range (−∞, +∞)). However, the data we observe in a practical problem is not always symmetric or unbounded. For signal processing applications, for example, the most frequently used feature, the power spectrum or energy, is semibounded (non-negative, and of course asymmetric). In the area of image processing and computer vision, the pixels in the gray image or the color image in RGB space are digitalized and have bounded support in the [0, N ] range, where N is an integer value determined by the number of bits spent on each pixel. The distribution of the pixels is usually not symmetric. For speech transmission, the line spectral frequency (LSF) representation of the linear predictive coding parameters is bounded in • The authors are with the KTH - Royal Institute of Technology, School of Electrical and Engineering, Sound and Image Processing Lab, SE-100 44 Stockholm, Sweden. E-mail: [email protected], [email protected]

the range [0, π]. Although the GMM can model arbitrary distributions with a proper number of mixture components, a large number of these components are spent on describing the edge when modeling semi-bounded or bounded support data. Compared to the Gaussian distribution, the beta distribution [5]–[9] has a more flexible shape. It has a support range of [0, 1] and can be easily generalized to any compact range [a, b], a, b ∈ R ( [6], [10]). Usually, it is applied to model events that take place in a limited interval and is widely used in financial model building, project control systems, and some business applications. Also, as a special case of the Dirichlet distribution with two parameters, the beta distribution is extensively used as the conjugate prior distribution of the parameter of the binomial distribution [9], [11], [12]. Some Maximum Likelihood (ML) estimation methods for the parameters were proposed in [6], [13]. The mixture model [14], [15] is a flexible and powerful probabilistic tool for analyzing both univariate and multivariate data. With the bounded support property of the beta distribution, data with bounded support could be efficiently modeled by the beta mixture models (BMM) with less model complexity than GMM. For example, Bouguila et al. [10] applied the BMM to model the enzymatic activity distribution in the blood, to describe the distribution of the acidity index for lakes, and to estimate the parameters of the SAR image histogram. All the applications in [10] showed that the BMM performs better than GMM when modeling the data distribution. In the area of bioinformatics, Ji et al. [16] applied the BMM to solve a variety of problems related to the correlations of gene-expression levels. For modeling the distribution of pixels in gray images, we applied the BMM to the task

2

of handwritten digit classification [17]. The BMM-based classifier performed better than a GMM-based classifier. To detect human skin color, a three-dimensional BMM was used to model the skin and non-skin color distributions [18]. Compared to previous statistical modelbased methods (most of which were based on GMM), the BMM detector showed better performance. For modeling the LSF parameters in speech quantization, Lindblom et al. [19], [20] pointed out the drawback of modeling the speech spectra with GMM and proposed so-called bounded support GMM to account for the unbounded property of the Gaussian distribution. To make a larger improvement, we proposed a BMM-based vector quantizer (VQ) [21], which was shown to be superior to the GMM-based VQ. The underlying distribution of the 16dimensional LSF parameter was modeled with BMM and the corresponding BMM-based VQ was introduced. Generally speaking, for bounded support data, the BMM can model the underlying distribution better than GMM, with the same level of model complexity. Furthermore, the consequent processing based on BMM modeling (e.g., a classifier based on a parametric statistical model, or VQ design based on trained probability density function (PDF)) also showed better performance. The most central task in modeling the data with BMM is parameter estimation. Since the normalization constant (the beta function) in the beta distribution is defined as a fraction of integrals, it is difficult to obtain a closed-form expression for estimating the parameters. For the ML estimation to the BMM parameters, [16] and [17] proposed the Expectation Maximization (EM) algorithm [22] (with iterative numerical calculation in the maximization step). The EM algorithm for BMM has some disadvantages: 1) it could lead to overfitting when the mixture models are excessively complex; 2) the iterative numerical calculation in the maximization step (e.g., with the Newton-Raphson method) causes high computational cost. For the Bayesian estimation, we can formally find the prior distribution and the conjugate posterior distribution of the parameters of the beta distribution. However, this posterior distribution is still defined with an integration expression in the denominator such that the closed-form of the posterior distribution is analytically intractable (See section 3.1 for details). Bouguila et al. [10] proposed a practical Bayesian estimation algorithm based on the Gibbs sampling method, which simulates the posterior distribution approximately rather than computing it. The method proposed in [10] could prevent the overfitting problem but still suffers from high computational cost because of the Gibbs sampling, especially when the data is in a high dimensional space. In this paper, we approximate the posterior distribution with variational inference (VI) [4], [23]–[26]. The VI framework could incur unknown bias while Gibbs sampling gives samples from the exact posterior distribution [27], [28]. However, obtaining correct samples by Gibbs sampling crucially depends on Markov chain

convergence. This can be a slow process, and assessing whether the chain has converged is difficult [28]. With the framework of VI, we approximate the true twodimensional correlated prior distribution of the beta distribution with a product of two uncorrelated gamma distributions. Both the true prior distribution of the beta distribution and the approximating distribution are unimodal. Even though we factorize the prior distribution into a product of two gamma distributions, the optimal solution is still intractable in the expectation part. To make the inference solution tractable, we obtain the relative convexity bounds [29] by the Taylor expansion and maximize the lower bound of the objective function instead of itself to reach the optimum. With these approximations, the posterior distribution of the beta distribution parameters is again approximated by a product of two gamma distributions. This approach forces the approximating prior and posterior distributions to be a conjugate pair such that the Bayesian estimation can be carried out in an iterative way. Furthermore, at each iteration, no iterative numerical calculation is required, which is another advantage of our proposed Bayesian approach. To facilitate the usage of the obtained approximating posterior distribution, the posterior mean of the approximating distribution of the parameters is calculated and is verified to be very close to the true parameters, especially as the number of observation increases. Compared to the EM algorithm for BMM, the Bayesian-approach-based algorithm proposed here can prevent the overfitting problem, automatically determine the complexity of the mixture models and avoid using iterative numerical calculation in each update step. This algorithm is verified with both synthetic and real data. The rest of the paper is organized as follows: in section 2, we review the beta distribution and some previous work on BMM. In section 3, we extend the factorized approximation method, apply it to the Bayesian estimation of BMM parameters and propose the algorithm. The experimental results with both synthetic and real data are shown in section 4. Some conclusions are drawn in section 5.

2 B ETA M IXTURE M ODELS L IKELIHOOD E STIMATION

AND

M AXIMUM

2.1 The Mixture Models The probability density function (PDF) of the beta distribution is Beta(x; u, v) =

1 xu−1 (1 − x)v−1 , u, v > 0, beta(u, v)

(1)

where beta(u, v) is the beta function beta(u, v) = Γ(u)Γ(v) Γ(u+v) and Γ(·) is the gamma function defined as Γ(z) =  ∞ z−1 −t t e dt. The shape of the beta distribution depends 0 on two shape parameters u, v; it could be symmetric or highly skewed depending on the parameters. Fig. 1 shows some typical cases mentioned above. If one (or both) parameters is (are) smaller than 1, the probability

2.5

2.5

2

2

Beta(x)

Beta(x)

3

1.5

1

0.5

0

a local maximum, possibly different from the global maximum, depending on the initialization. Also, given an excessively complex model (a large number of components) or when the amount of training data is small, the EM algorithm could lead to overfitting. More details about the EM algorithm for the multivariate BMM can be found in [17].

1.5

1

0.5

0

0.2

0.4

x

0.6

0.8

0

1

0

0.2

(a) u=5,v=5

0.4

x

0.6

0.8

1

3 BAYESIAN E STIMATION I NFERENCE F RAMEWORK

(b) u=2,v=5

7

6

6

5

5

Beta(x)

Beta(x)

4 4

3

3

2 2 1

1

0

0

0.2

0.4

x

0.6

0.8

1

0

0

(c) u=0.1,v=2

0.2

0.4

x

0.6

0.8

1

(d) u=0.2,v=0.8

Fig. 1. Beta distributions for different pairs of parameters. is concentrated at x-values near 0 and 1, which is less interesting in practical problems and seldom happens in practical applications. If the estimated beta distribution has parameters smaller than 1, we can choose the Bernoulli distribution as an alternative way to model the data. Furthermore, we only discuss the case that the parameters are greater than 1 in the following sections. The approach of mixture models [14], [15] assumes that the observed data is drawn from a mixture of parametric distributions. Multivariate data are in most cases statistically dependent. However, for any random vector x consisting of L elements, the dependencies among the elements x1 , . . . , xL can be represented by a mixture model, even if each specific mixture component can only generate vectors with statistically independent elements. Therefore, we define the multivariate BMM as f (x; Π, U, V) =

I 

πi Beta(x; ui , vi ) =

i=1

I  i=1

πi

L 

Beta(xl ; uli , vli ),

l=1

where x = {x1 , . . . , xL }, Π = {π1 , . . . , πI }, U = {u1 , . . . , uI } and V = {v1 , . . . , vI }. {ui , vi } denote the parameters vectors of the ith mixture component and uli , vli are the (scalar) parameters of the beta distribution for element xl . 2.2 Maximum Likelihood Estimation With a set of i.i.d observation X = {x1 , . . . , xN }, the likelihood is given as f (X; U, V) =

N 

f (xn ; Π, U, V) .

(2)

n=1

Ma et al. [17] used the EM algorithm to obtain the ML estimation of the parameters Π, U and V. However, in the maximization step, an iterative numerical calculation method was used. The EM algorithm may converge to

WITH

VARIATIONAL

To avoid the overfitting problem when estimating the model parameters from the data, the Bayesian estimation approach for BMM is required. Although the conjugate prior distribution of the beta distribution can be found by following the principles of finding conjugate priors of the exponential family [4], [27], [30], the prior distribution and the corresponding posterior distribution is intractable. Thus, some approximations are needed. By the VI framework, we propose a Bayesian estimation approach to estimate the distribution of the parameters and derive a closed-form solution, which is free of iterative numerical calculation during each update round. 3.1 Conjugate Prior of Beta Distribution In the field of Bayesian analysis, if the posterior density function f (Z|X) for all the variables Z, given X, has the same form as the prior density function f (Z), then the probability density function f (Z) is said to be the conjugate prior of the likelihood f (X|Z). As an important step in the Bayesian approach, we need to find a conjugate prior distribution to the beta distribution. The conjugate prior density function can be written as [4], [30] f (u, v) =

1 C(α0 , β0 , ν0 )



Γ(u + v) Γ(u)Γ(v)

 ν0

e−α0 (u−1) e−β0 (v−1)

(3)

where α0 , β0 , ν0 are free positive parameters and 0 , β0 , ν0 ) is a normalization factor such that C(α ∞∞ f (u, v)dudv = 1. Then we obtain the posterior 0 0 distribution of u, v as (with N i.i.d. scalar observations X = {x1 , . . . , xN }) f (X|u, v)f (u, v) f (u, v|X) =  ∞  ∞ 0 0 f (X|u, v)f (u, v)dudv   Γ(u + v) νN −αN (u−1) −βN (v−1) 1 e e dudv = C(αN , βN , νN ) Γ(u)Γ(v)

(4)

 where1 νN = ν0 + N , αN = α0 − N n=1 ln xn and βN = N β0 − n=1 ln(1 − xn ). We have formally obtained the conjugate prior of the beta distribution. However, it is not applicable in practical problem due to the analytically intractable integration expression. Some stochastic techniques (e.g., Gibbs sampling [10]) could be utilized to numerically calculate the posterior distribution. In this paper, we propose a method based on the VI framework to approximate the posterior distribution. 1. To prevent the infinity quantity in the practical implementation, we assign ε1 to xn when xn = 0 and 1 − ε2 to xn when xn = 1. Both ε1 and ε2 are slightly positive real numbers.

4

3.2 Factorized Approximation to the Parameter Distributions of BMM The variational methods were originated in the 18th century and used recently in probabilistic inference [4], [23]–[26], [28], [31]. As described in section 3.1, we have found a conjugate prior of the beta distribution, and the corresponding posterior distribution can be calculated by combining the prior and likelihood. However, the true posterior distribution is analytically intractable in the integral expression and makes the Bayesian estimation task unpractical. In principle, we could start from the original conjugate prior introduced in (3), and then following the factorized approximation to assume u and v are separate variables in the posterior distribution. But this approach would lead to a mismatch between the true prior (which is a joint distribution of u and v) and the approximating posterior (which is a product of the uncorrelated distributions of u and v). Furthermore, if we consider the currently estimated posterior distribution as the prior distribution for new upcoming data (e.g., in the online Bayesian learning), the prior and posterior distributions must have the same form. Therefore, it is reasonable to force the prior distribution and the posterior distribution to be factorized in the same way such that the Bayesian estimation can be carried out iteratively. As stated in [32], this conjugate match permits easy updates of the posterior distribution using variational distributions of the same families as the prior distributions. We introduce an approximation to both the conjugate prior and the posterior distributions of the beta distribution and attempt to solve the Bayesian estimation problem via the factorized approximation method. With the principles of the factorized approximation method, a distribution could be used as the factorized distribution of the true posterior distribution if the optimal solution to this factorized distribution has exactly the same form as the initialization. This requirement guarantees that the estimation works iteratively. With the non-negative property of the parameters in the beta distribution and assuming that u and v are statistically independent, we could use some well defined distribution with support domain (0, +∞) to approximate the conjugate prior. One possible way is to assign the gamma distribution to u and v as f (u; μ, α) =

αμ μ−1 −αu β ν ν−1 −βv e ; f (v; ν, β) = e . u v Γ(μ) Γ(ν)

(5)

(6)

The same form of approximation applies to the posterior distribution as f (u, v|X) ≈ f (u|X)f (v|X).

ui I

I xn vi N I

Fig. 2. Graphical representation of the variables’ relationship in the Bayesian approach. All the circles in the graphical figure represent variables. Arrows show the relationship between variables. The variables in the box are the i.i.d. observations.

shown in Fig. 2 following the principles of graphical models [33]. For each observation xn , the corresponding zn = [zn1 , . . . , znI ]T is the indication vector with one element equal to 1 and the rest equal to 0. Denoting Z = {z1 , . . . , zN } and assuming the indication vectors are independent given the mixing coefficients, the conditional distribution of Z given Π is f (Z|Π) =

N  I  n=1 i=1

πizni .

(8)

Introducing the Dirichlet distribution as the prior distribution of the mixing coefficients, the probability function of Π can be written as f (Π) = Dir(π|c) = C(c)

I  i=1

c −1

πi i

(7)

With the approximation mentioned above and recalling the BMM described in section 2, we can build a hierarchical model for the Bayesian estimation which is

(9)

I c) where C(c) = Γ(c1Γ( c = i=1 ci . We consider )···Γ(cI ) and  the observation xn and the unobserved indication vector zn as the complete data. The conditional distribution of X = {x1 , . . . , xN } and Z = {z1 , . . . , zN } given the latent variables {U, V, Π} is f (X, Z|U, V, Π) =f (X|U, V, Π, Z)f (Z|Π) =f (X|U, V, Z)f (Z|Π)2

The conjugate prior is then approximated as f (u, v) ≈ f (u)f (v).

zn

πi

=

N  I 

(10)

[πi Beta(xn |ui , vi )]zni .

n=1 i=1

With the Bayesian rules and by combining (5), (8), (9), and (10) together, the joint density function of the observation X and all the i.i.d. latent variables Z = 2. With the principle of graphic models and from Fig. 2, we know that X is conditionally independent of Π, given Z.

5

{U, V, Π, Z} is then given by f (X, Z) = f (X, U, V, Π, Z) =f (X|U, V, Z)f (Z|Π)f (Π)f (U)f (V) =

N  I 

[πi Beta(xn |ui , vi )]zni · C(c)

n=1 i=1

·

I L  



l=1 i=1



μ

αlili

Γ(μli )

μ −1 −αli uli

ulili

e

·

I 

i=1 ν βlili

Γ(νli )

πici −1

(11)

ν −1 −βli vli

vlili

e

and the logarithm of (11) is L(X, Z) = ln f (X, Z, U, V, Π) I N  L   Γ(uli + vli ) =con. + zni ln πi + ln Γ(u li )Γ(vli ) n=1 i=1 l=1 L  [(uli − 1) ln xln + (vli − 1) ln(1 − xln )] + l=1

+

I L  

and Z are independent from each other and absorb the items without the variable uli into the constant part. We recognize that (13) is a function of uli and wish it could have the same functional form as the logarithm of the gamma distribution. This problem could be solved if we have some approximation to Q(uli ) and make this approximation be a linear function of ln uli . More details about the approximation and the updating scheme of the optimal hyperparameters α, β, μ and ν will be presented in section 3.4 and 3.5. Firstly, we derive the updating scheme for the hyperparameters of Π and Z. Now considering Π as the variable, we have that ln f ∗ (πi ; c) = EZ\πi [L(X, Z, U, V, Π)] N 

= (12)

n=1

= ln πi

[(μli − 1) ln uli − αli uli ]

I L  

[(νli − 1) ln vli − βli vli ] +

l=1 i=1

I 

(ci − 1) ln πi .

i=1

The i.i.d. latent variables we have now are U, V, Π and Z with the hyperparameters α, β, μ, ν and c. The updating scheme of the variational inference can be used to estimate these hyperparameters of the latent variables. Following the principles of the VI framework [4], [23]–[26], we need to calculate the expected value of L(X, Z, U, V, Π) in (12) with respect to all the other variables except for the current one, iteratively. Considering U as the variable and taking the expected value of L(X, Z, U, V, Π) with respect to V, Π and Z, we have the optimal solution to f (uli ; μli , αli ) as (elementwise) ln f ∗ (uli ; μli , αli ) = EZ\uli [L(X, Z)] 

 N  Γ(uli + vli ) EZ\uli zni ln = Γ(uli )Γ(vli ) n=1 +

N  n=1

EZ\uli {zni [(uli − 1) ln xln ]}

(13)

+ EZ\uli [(μli − 1) ln uli − αli uli ] + con. =

N 

E [zni ] Q(uli ) + uli

n=1

N 

E [zni ] ln xln

n=1



Q(uli ) = EZ\uli

ln f ∗ (zni |Π) = EZ\zni [L(X, Z, U, V, Π)]    L  Γ(uli + vli ) =EZ\zni [zni ln πi ] + EZ\zni zni ln Γ(uli )Γ(vli ) l=1 +

L  l=1

EZ\zni {zni [(uli − 1) ln xln + (vli − 1) ln(1 − xln )]} (16)

+ con. =zni

L 

[(uli − 1) ln xln + (v li − 1) ln(1 − xln )]

l=1

+ zni E [ln πi ] + zni

L 

Pli + con.

l=1

  Γ(uli +vli ) and uli = where Pli = Euli ,vli ln Γ(u li )Γ(vli ) E [uli ] , v li = E [vli ]. The Q(uli ), Q(vli ), and Pli introduced above are helper functions. The optimal solution to Z has the logarithmic form of (8) except for the normalization constant. Noting that for each value of n, the quantities zni are binary and sum to 1, we obtain N  I 

zni rni , rni =  I

ρni

k=1

ρnk

(17)

   where Γ(uli + vli ) Γ(uli + vli ) L  ln = Evli ln . ln ρni =E [ln πi ] + Pli Γ(uli )Γ(vli ) Γ(uli )Γ(vli )

ln f ∗ (vli ; νli , βli ) = EZ\vli [L(X, Z)] =

Apparently, (15) has the same form as the logarithm of the Dirichlet distribution. For the variable Z, we obtain

n=1 i=1

Symmetrically, we can obtain the optimal solution to f (vli ; νli , βli ) as N 

(15)

E [zni ] + ln πi (ci − 1) + con. .

f ∗ (Z|Π) =

+ (μli − 1) ln uli − αli uli + con.

where

N  n=1

l=1 i=1

+

EZ\πi [zni ln πi ] + EZ\πi [(ci − 1) ln πi ] + con.

E [zni ] Q(vli ) + vli

n=1

N 

E [zni ] ln(1 − xln )

[(uli − 1) ln xln + (v li − 1) ln(1 − xln )] .

l=1

(14)

+ (νli − 1) ln vli − βli vli + con.



+

(18)

For f ∗ (Z|Π), we have the standard result E [zni ] = rni .

n=1



l=1

L 

Γ(uli +vli ) . The above derivawhere Q(vli ) = Euli ln Γ(u li )Γ(vli ) tions use the factorized approximation that U, V, Π

3.3 Extended Factorized Approximation Method In the above section, (13), (14) and (16) are the standard solutions under the principles of factorized approximation. Due to two reasons, these standard solutions do not

6

work in our case. 1) The calculations of Q(u), Q(v) and P are analytically intractable. To apply the factorized approximation method, we need a closed-form expression, 2) For the Bayesian estimation in our case, as we assumed that both the prior and posterior distribution have been factorized in the same way (a product of two gamma distributions, stated in Sec. 3.2), the factorized approximation method requires that ln f ∗ (uli ; μli , αli ) and ln f ∗ (vli ; νli , βli ) have the same form as the logarithm of a gamma density function. According to the standard factorized approximation method, some interesting properties can be noted: Property 3.1: If we could find an (unnormalized) likelihood function g(X, Zn ) which satisfies   gn ln g(X, Zn )dZn , then the gn ln f(X, Zn )dZn ≥ lower bound (see [4], pp. 465) we want to maximize can be expressed as 

L(g) = ≥



gn (Zn ) ln f(X, Zn )dZn − g (X, Zn )dZn − gn (Zn ) ln 





gn (Zn ) ln gn (Zn )dZn + con.

the first order Taylor expansion is not a lower bound. By utilizing the relative convexity introduced by [29], we combine the first and second Taylor expansions with the relative convexity and derive the lower bounds to approximate the helper functions Q(u), Q(v), and P . 3.4.1 Useful Properties The following properties should be stated first and used for the approximations of the expectation parts in Q(u), Q(v) and P . Since the scenario where the parameters of the beta distribution are smaller than 1 is less interesting in practice, we only consider the case where the parameters are greater than 1. The following properties only work on this assumption.3 Property 3.3: (Relative Convexity of Log-Inverse-Beta (LIB) Function) Define F (x) = ln

gn (Zn ) ln gn (Zn )dZn + con.

g (X, Zn )) + con. . = − KL (gn (Zn )|

(19)

By setting gn (Zn ) equal to the normalized version of g(X, Zn ), the lower bound of L(g) in (19) is maximized. Even though we cannot maximize L(g) directly, by maximizing the lower bound of L(g), we can still reach the maximum value of L(g) asymptotically.   Property 3.2: If we have f (X, Zn ) ≥ g(X, Zn ), then  gn ln f (X, Zn )dZn ≥ gn ln  g(X, Zn )dZn always holds. If we can find the lower bounds of Q(u), Q(v) and P which satisfy: 1) the lower bounds can be analytically calculated; 2) by replacing Q(u) and Q(v) with the corresponding lower bounds, ln f ∗ (uli ; μli , αli ) and ln f ∗ (vli ; νli , βli ) have the same logarithmic forms of gamma distribution. Then by using property 3.2, the problem mentioned at the beginning of this section can be solved with this extended factorized approximation method. 3.4 Lower Bound Approximation The remaining unsolved part in the Bayesian estimation via the variational inference framework is to find the lower bound of Q(u), Q(v) and P . Usually, some approximations are applied to the variational objective function. Braun et al. [34] considered the zeroth-order and first-order delta method for moments [35] to derive an alternative for the objective function to simplify the calculation. Using first order Taylor approximation have become commonplace in variational inference. Blei et al. [36], [37] proposed the correlated topic model (CTM) and used a first Taylor expansion to preserve a bound such that an intractable expectation was avoided. These ideas of deriving an alternative objective function such that the calculation could be simplified are also utilized in our paper. Furthermore, since the helper functions are not convex functions of the variables u, v directly,

1 , beta(x, y)

x, y ∈ R+ .

(20)

If y is greater than 1, then F (x) is convex relative to ln x; otherwise, F (x) is concave relative to ln x. Property 3.4: (Approximation of LIB Function) With the relative convexity of the LIB function, the following approximation 1 1 ≥ ln beta(x, y) beta(x0 , y) + [ψ(x0 + y) − ψ(x0 )] x0 (ln x − ln x0 ) , y > 1

F (x) = ln

(21)

exists and the equality applies if and only if x = x0 ∈ R+. The Digamma function ψ(x) is defined as ψ(x) = ∂ ln Γ(x)/∂x. Property 3.5: (Relative Convexity of Pseudo Digamma Function) The pseudo Digamma function G(x) = ∂ ln Γ(x + y)/∂x, x ∈ R+ , y > 1 is convex relative to ln x. Property 3.6: (Approximation of Pseudo Digamma Function) With the relative convexity of the pseudo Digamma function, the following approximation 

G(x) = ψ(x + y) ≥ ψ(x0 + y) + ψ (x0 + y)x0 (ln x − ln x0 ) .

(22)

exists when y > 1 and the equality applies if and only if x = x0 ∈ R+ . ψ  (x) is the derivative of the Digamma function, with respect to x. Property 3.7: (Approximation of the Bivariate LogInverse-Gamma Function) The bivariate LIB function H(x, y) = 1 ln beta(x,y) , x, y > 1 is always greater than its pseudo second-order Taylor expansion for (ln x, ln y) at 3. For the case where the parameters are smaller than 1, the proposed algorithm still works fine, even though we cannot prove the validity of the approximation theoretically.

7

(ln x0 , ln y0 ), which is

to P as

H(x, y) ≥ H(x0 , y0 ) ∂H(x, y) |x=x0 ,y=y0 (ln x − ln x0 ) + ∂ ln x ∂H(x, y) |x=x0 ,y=y0 (ln y − ln y0 ) + ∂ ln y

2 ∂x 1 ∂ 2 H(x, y) + |x=x0 ,y=y0 (ln x − ln x0 )2 2 ∂x2 ∂ ln x

2 ∂y 1 ∂ 2 H(x, y) + |x=x0 ,y=y0 (ln y − ln y0 )2 2 ∂y 2 ∂ ln y ∂ 2 H(x, y) |x=x0 ,y=y0 (ln x − ln x0 )(ln y − ln y0 ). + ∂ ln x∂ ln y

(23)



Approximation of the Help Functions

With properties 3.3 and 3.4, we can obtain an approximation of Q(u) around point u as   Γ(u + v) Q(u) =Ev ln Γ(u)Γ(v)   Γ(u + v) + [ψ(u + v) − ψ(u)] u(ln u − ln u) ≥Ev ln Γ(u)Γ(v) = ln u · {Ev [ψ(u + v)] − ψ(u)} · u + con. .



=ψ(u + v) + vψ (u + v) (Ev [ln v] − ln v) .

Are (26), (27) and (28) reasonable approximations? These approximations are the lower bounds of the helper functions Q(u), Q(v), and P . With property 3.2, we set the LHS of (13), (14) and (16) equal to the corresponding lower bound of the optimal solution instead of the optimal solution itself. At each iteration, we maximize the lower bound. The optimal solution can be reached asymptotically. From another point of view, the re-estimation of the optimal solution with the other distributions fixed is equivalent to the maximization step in the conventional EM algorithm. So what we are maximizing now is the lower bound of L(g). More discussion about the behavior of these approximations will appear in section 3.5.

3.5 Algorithm of Bayesian Estimation (24)

The expectation part in the last line of (24) is with respect to v. Again, for the convenience of practical application, we apply a first-order Taylor approximation to this part with properties 3.5 and 3.6. The following is the approximation    Ev [ψ(u + v)] ≥Ev ψ(u + v) + vψ (u + v)(ln v − ln v)

(28)

+ u · v · ψ (u + v)(E [ln u] − ln u)(E [ln v] − ln v) .

The reason we named it as “pseudo second-order Taylor expansion” is because the terms of the second derivative of H(x, y) with respect to ln x and ln y are not the same as those of the true second-order Taylor ex2 H(x,y) ∂ 2 H(x,y) pansion (which should be ∂∂(ln x)2 and ∂(ln y)2 ). The equality applies when (x, y) = (x0 , y0 ). The proofs of property 3.3, 3.4, 3.5, 3.6, and 3.7 can be found in the Appendices. 3.4.2

  Γ(u + v) P =Eu,v ln Γ(u)Γ(v) Γ(u + v) + u [ψ(u + v) − ψ(u)] (E [ln u] − ln u) ≥ ln Γ(u)Γ(v) + v [ψ(u + v) − ψ(v)] (E [ln v] − ln v)       + 0.5 · u2 ψ (u + v) − ψ (u) E (ln u − ln u)2       + 0.5 · v2 ψ (u + v) − ψ (v) E (ln v − ln v)2

In section 3.2, we factorized the prior distribution of the parameters in the beta distribution into a product of two gamma distributions, then with the VI framework and some approximations, we approximated the posterior distributions of the parameters with a product of two gamma distributions again. This means that, during each iteration, the hyperparameters in the gamma distributions can be obtained directly from the posterior distribution, without any numerical calculation (e.g., gradient method). Thus the solution is in a closed-form. The speed of convergence is shown in Fig. 4.

(25)

3.5.1 Update of the Posterior Hyperparameters

Substituting (25) into (24), we finally obtain the lower bound of Q(u) as    Q(u) ≥ ln u ψ(u + v) − ψ(u) + v · ψ (u + v)(Ev [ln v] − ln v) u + con., u, v > 1. (26)

The RHS of (26) is a lower bound. For the same reasoning, we can also obtain the approximation of Q(v) as    Q(v) ≥ ln v ψ(u + v) − ψ(v) + u · ψ (u + v)(Eu [ln u] − ln u) v + con., u, v > 1. (27)

Furthermore, property 3.7 leads to the approximation

The updating equation for c is simple. By identifying the last line in (15) as the logarithm of the Dirichlet distribution, we then have c∗i = ci0 +

N  n=1

E [zni ] = ci0 +

N 

rni

(31)

n=1

where ci0 denotes the initial value of ci . With (18) and (28), we have the approximation of ln ρni in (29) (Eq.(29) and (30) are on the next page) and consequently, the update scheme for the expected value rni is found in (17). If substituting the approximation (26) back into (13), the optimal solution to f (uli ; μli , αli ) is found by (30). Identifying that the RHS of (30) is the logarithmic gamma distribution, we obtain the closed-form updating

8

lnρni ≈ E [ln πi ] + L  

L 

[(uli − 1) ln xln + (v li − 1) ln(1 − xln )]

l=1

Γ(uli + vli ) + uli [ψ(uli + v li ) − ψ(uli )] (E [ln uli ] − ln uli ) + v li [ψ(uli + vli ) − ψ(v li )] (E [ln vli ] − ln v li ) Γ(uli )Γ(v li )             + 0.5 · u2li ψ (uli + vli ) − ψ (uli ) E (ln uli − ln uli )2 + 0.5 · v 2li ψ (uli + vli ) − ψ (v li ) E (ln vli − ln vli )2   + uli · v li · ψ (uli + v li )(E [ln uli ] − ln uli )(E [ln vli ] − ln v li )

+

ln

l=1

ln f ∗ (uli ; μ∗li , α∗li ) ≈ ln uli + uli

N 

   E [zni ] ψ(uli + v li ) − ψ(uli ) + vli · ψ (uli + v li )(Ev [ln vli ] − ln v li ) uli

n=1 N 

(29)

(30) E [zni ] ln xln + (μli − 1) ln uli − αli uli + con.

n=1

scheme for the hyperparameters αli and μli as μ∗li =μli0 +

N 

E [zni ] uli {ψ(uli + v li ) − ψ(uli )

n=1

 + v li · ψ (uli + v li )(Ev [ln vli ] − ln v li )

(32)



α∗li =αli0 −

N 



E [zni ] ln xln .

(33)

n=1

For the same reasons, the update schemes for the hyperparameters βli and νli are ∗ νli =νli0 +

N 

E [zni ] vli {ψ(uli + v li ) − ψ(v li )

n=1

  + uli · ψ (uli + vli )(Eu [ln uli ] − ln uli ) ∗ βli =βli0 −

N 

(32) and (34) are always greater than 0? The first-order condition of the concave function [38] is described as  f (y) ≤ f (x) + f (x)(y − x) if f (x) is a concave function of x. Since ψ(x) is a concave function of x, then with the variable changing y = u and x = u + v, we have that

E [zni ] ln(1 − xln ) .

(34)

(37)

Because Ev [ln v] − ln v = ψ(ν) − ln ν is a monotonic nondecreasing function of ν, if we could find a threshold ν that satisfies ψ( ν ) − ln ν = −1, then any value greater than ν could make the multiplier factor of E [zni ] uli in (32) positive, which is 

ψ(u + v) − ψ(u) + ψ (u + v) · (v)(Ev [ln v] − ln v) 

>ψ(u + v) − ψ(u) + ψ (u + v) · (−v) ≥ 0 . (35)

n=1

The above update equations are calculated through the following expectations μ ν c), , v = , E [zni ] = rni , E [ln πi ] = ψ(ci ) − ψ( α β Eu [ln u] = ψ(μ) − ln α, Ev [ln v] = ψ(ν) − ln β,    Eu (ln u − ln u)2 = [ψ(μ) − (ln μ)]2 + ψ (μ)    Ev (ln v − ln v)2 = [ψ(ν) − (ln ν)]2 + ψ (ν) .

ψ(u + v) − ψ(u) + ψ (u + v) · (−v) ≥ 0

u=

(36)

With (31), (32), (33), (34) and (35), we calculate the new values for the hyperparameters with the observation and the currently estimated expected value of the other hyperparameters, iteratively. During each iteration, all the hyperparameters are updated only once. 3.5.2 Choice of the Prior Hyperparameters Since we have no prior knowledge about the hyperparameters, one constraint for choosing the initial values of the hyperparameters α, β, μ and ν is that we should assign a rather broad distribution for U and V. Another constraint on the initial values comes from the behavior of these hyperparameters during the iterations. It is worth spending a moment to study it. With the property that x ∈ (0, 1) and (33), (35), the new estimated hyperparameter α∗ and β ∗ are always greater than 0. But how can we guarantee that μ∗ and ν ∗ in

So the positivity of μ∗ is guaranteed with ν > ν. The value of μ∗ will be always greater than μ0 once the  iteration starts. It is the same case for ν ∗ if μ > μ (ψ( μ) − ln μ  = −1) because μ and ν are coupled. It is not difficult to solve μ  and ν numerically and obtain μ  = ν ≈ 0.6156. If we calculate the initial guess of u and v from α0 , β0 , μ0 and ν0 , we should choose μ0 and ν0 to be greater than 0.6156. Also, since the expected value of u and v should be greater than 1 (see (26), (27)), the constraint μ0 /α0 > 1 and ν0 /β0 > 1 should be satisfied. The algorithm of the proposed Bayesian estimation is concluded in Table 1. 3.6 Discussion In the previous section, we have proposed a Bayesian estimation for BMM with variational inference. We approximated the PDF of the parameters in the beta distribution by a product of two gamma densities. To make the expectation calculation in the optimal solution tractable, by the relative convexity [29], using the first and the second order Taylor expansion [36], [37], we approximated the objective function with a lower bound. A conjugate match between the prior and posterior distributions was established by the above procedures. Compared to the EM algorithm [16], [17] for BMM, the proposed Bayesian approach could prevent the overfitting problem and avoided the iterative numerical search

9

TABLE 1

10

The algorithm of proposed Bayesian estimation for BMM

in each maximization step. The convergence is guaranteed because of the convexity of the bounds [4], [38]. However, the variational objective function is convex in each parameter, not globally. The convergence may find a local optimum. Unlike the Gibbs sampling [10] or some other Monte Carlo methods which give samples from the exact posterior distribution, the VI incurs unknown bias [27] for the approximation of the posterior distribution. But the VI is deterministic and is guaranteed to converge because of the objective function convexity [28]. Since the sampling can be a slow process for convergence, we prefer the VI to the sampling methods, even though some bias may be incurred. The parameters in the beta distribution are u and v, which control the mean and variance of the distribution. As we assigned the hyperparameters to the distributions of u and v, respectively, the hyperparameters can control the expected values of u and v, as well as the variances.

4

E XPERIMENTAL R ESULTS

We verify the proposed algorithm for the Bayesian estimation with both synthetic and real data sets. For the synthetic data validation, we generated artificial data from known BMMs and evaluate our algorithm by comparing the estimated PDF and the true one. To show the bias incurred by the variational inference, we compare the approximating posterior distribution with the true one in (4). The validation with real data is based on image processing, skin color detection and speech coding applications. For image processing, we model Synthetic Aperture Radar (SAR) images with BMM and GMM. For the application of skin color detection in RGB space, the performance of BMM and GMM based classifiers are compared. To quantize the line spectral frequency (LSF) parameters efficiently, we model the distribution of the LSF parameters with BMM. Experimental results show

0

Loglikelihood per sample

Step 1. Initialization a. Choose the number of components. b. Choose the initial parameters for the Dirichlet distr. . c. Initialize rni by K-means algorithm. d. Choose the initial parameters (element-wise) as α0 > 0, β0 > 0, μ0 > 0.6156 and ν0 > 0.6156. Furthermore, μ0 /α0 and ν0 /β0 should be greater than 1. e. Calculate the initial guess of u and v from α0 , β0 , μ0 , ν0 . Step 2. Hyperparameters update With (31), (32), (33), (34) and (35), update the hyper∗ and β ∗ iteratively. The order of parameters c∗i , μ∗li , α∗li , νli li updating does not matter. But each hyperparameter should be updated once and only once during each iteration. Intermediate quantities are computed using (36) and (29). Step 3. Run till convergence (e.g. the current estimated model and the previous estimated model are sufficient close). Step 4. Return the current estimated hyperparameters to get the approximating posterior distribution. The joint posterior distribution of uli , vli (see (4)) is approximated by the product of two gamma distributions with parameters ∗ , β ∗ (see (7) and (5)). μ∗li , α∗li andνli li

−10

−20 Baseline N=1000 N=100 N=10

−30

−40

−50

0

10

20

30

40

50

60

Iteration

Fig. 4. Comparison of accuracy and training time. The underlying model used to generate data is Beta(x; 3, 8). We take the true parameters u = 3, v = 8 to evaluate the baseline. that the high rate performance of the BMM based vector quantizer is superior to that of the GMM based vector quantizer. Through all the experiments, we select the initial settings for the prior distribution as α0 = β0 = 0.001 and μ0 = ν0 = 1, which give broad non-informative prior distributions. The initial settings for the Dirichlet distribution is ci0 = 0.001, i = 1, . . . , I, which give fair opportunity to all the mixture components and make the component probability controlled mainly by the data. We take the posterior mean as the point estimate of the parameters. 4.1 Synthetic Data Evaluation The first experiment was evaluated with a single beta distribution to show both the accuracy and the bias of the proposed method. The second experiments compared the proposed Bayesian estimation algorithm with the EM algorithm with different data sizes. The proposed method performs better than the EM algorithm, especially when the amount of data is small. 4.1.1 Accuracy/Bias of Variational Inference Since we have introduced some approximations in both the factorized approximations and the lower bounds, there exists some bias in estimating the posterior distribution of the parameters, even though the algorithm converges. To illustrate the bias, we compare the true posterior distribution obtained in (4) and the approximating posterior distribution obtained with the proposed method. We generated data from an underlying beta distribution with different amount of observations. Then we calculated the true posterior distribution and estimated the approximating posterior distribution. The comparison is shown in Fig. 3. We can observe that 1) there is a mismatch between the approximation and the true one; 2) however, as the number of observation

10 True Posterior

VI

True Posterior

VI

True Posterior

VI

True Posterior

VI

20

20

20

20

20

20

20

20

15

15

15

15

15

15

15

15

v

25

v

25

v

25

v

25

v

25

v

25

v

25

v

25

10

10

10

10

10

10

10

10

5

5

5

5

5

5

5

5

0

0

10

u

20

0

0

10

u

20

0

0

(a) N=10.

10

0

20

u

0

10

u

20

(b) N=20.

0

0

10

u

20

0

0

10

u

20

(c) N=50.

0

0

10

u

20

0

0

10

u

20

(d) N=100.

Fig. 3. Comparisons of the true posterior distribution and the approximation obtained with the proposed method. The underlying

model used to generate data is Beta(x; 5, 8). We ran 10 rounds of such simulations. The average Euclidean distances between the obtained posterior mean and the true parameters are 5.73, 3.82, 1.57, 1.09 while the systematic biases of the posterior mean estimation (measured in Euclidean distance) are 5.23, 2.88, 1.12, 0.32 for N = 10, 20, 50, 100, respectively.

TABLE 2 Comparison of the KL divergences. Beta(x; 5, 8) Beta(x; 3, 3) Beta(x; 10, 2)

N=10 1.26 0.90 0.94

N=20 1.11 0.76 0.86

N=50 1.04 0.74 0.82

N=100 1.02 0.71 0.80

increases, both the approximation and the true posterior distribution are concentrated in a more narrow area, and the difference is smaller; 3) the posterior mean of the approximation model could be used as a point estimate to the true parameter, with increasing accuracy for larger amounts of data. This experiment has been done several times with different settings of the underlying beta distributions and similar comparison results were obtained. Due to the limitation of space, we show only one example here. Furthermore, we calculated the KL divergence of the true posterior distribution from the approximation as KL (f (u|X)f (v|X)  f (u, v|X))  ∞ ∞ f (u|X)f (v|X) dudv. = f (u|X)f (v|X) ln f (u, v|X) 0 0

(38)

The KL divergence with different sample sizes are shown in Tab. 2. The true posterior distribution was calculated numerically by the importance sampling method. We ran 10 rounds of such simulation for each beta distribution and reported the mean value as the KL divergence. As expected, the approximation violates the correlation of the true posterior distribution, but is still efficient at capturing the true parameters as the mean. With more observations, this bias is smaller. To illustrate the training time, we compare the accuracy and training time in Fig. 4. It is observed that for different data size, the algorithm can always converge after 30 ∼ 40 iterations. 4.1.2 Modelling the Underlying Distribution Generally speaking, the Bayesian estimation could still perform well even when there are fewer samples. We compared the proposed Bayesian method with the EM algorithm [17] by modelling the underlying distribution of some synthetic data. For a known BMM, we generated

data sets with different size and estimated the PDF of the data. We initialized the number of mixture components as two times of that in the true BMM (i.e., if the BMM has 2 mixture components, we initialize the algorithm with 4 mixture components), both for the Bayesian method and the EM algorithm. Fig. 5 shows the comparisons among the estimated PDFs, the data histogram, and the underlying PDF. When the amount of data is small (e.g., N = 10, 50), the histogram of the generated data cannot describe the distribution of the true underlying PDF properly or even is far away from the underlying PDF. Thus we cannot expect the PDFs estimated from the data to have the same shape as the underlying PDF. In these cases, the Bayesian method can adjust the model complexity according to the data and capture the modality of the data better than the EM algorithm. The EM algorithm could lead to overfitting by assigning some components on small amounts of data or even on one single data point (see Fig. 5(a), 5(b), 5(e), and 5(f)). As the amount of data increases, the histogram of the generated data approaches the underlying PDF. Then both methods can estimate the PDF of the underlying data efficiently. However, the EM algorithm cannot throw away the unnecessary mixture components. The estimated values for the weighting factors Π is listed in Tab. 3. The Bayesian method models the data better than the EM algorithm in all the cases and the efficiency of the proposed Bayesian method is verified. 4.2 Real Data Evaluation In most of the signal processing applications, the GMM is widely used for modelling the underlying distribution of the data. However, for the data with bounded support, the BMM can model the data better than GMM. To evaluate the modelling performance, we use the Bayesian Information Criterion (BIC) and the Bayesian factors to decide the preference of the models. The BIC is defined as BIC  −2 ln L(X) + K ln N,

(39)

where L(X) is the log likelihood, K is the number of parameters to represent the model and N is the number

11 6 5

PDF estimated by Bayesian method

3

PDF estimated by EM

Likelihood

Likelihood

4 3.5

10

3.5

Real data histogram Underlying PDF PDF estimated by Bayesian method PDF estimated by EM

3

3 2

2.5 2 1.5

5

1 0.5

0.5

0.4

x

0.6

0.8

0 0

1

0.2

4 3.5 3

Likelihood

10

Likelihood

0.8

0 0

1

0.2

(b) N=50.

Real data histogram Underlying PDF PDF estimated by Bayesian method PDF estimated by EM

12

x

0.6

8 6 4

5 Real data histogram Underlying PDF PDF estimated by Bayesian method PDF estimated by EM

4

2.5 2 1.5 1

2

0.4

x

0.6

0.8

0 0

1

0.2

(c) N=100.

Likelihood

14

0.4

0.4

x

0.6

0.8

1

(d) N=1000. 2.5

Real data histogram Underlying PDF PDF estimated by Bayesian method PDF estimated by EM

Real data histogram Underlying PDF PDF estimated by Bayesian method PDF estimated by EM

2

3

Likelihood

0.2

(a) N=10.

0 0

2 1.5

1

1 0 0

Real data histogram Underlying PDF PDF estimated by Bayesian method PDF estimated by EM

2.5

4

Likelihood

Underlying PDF

15

Real data histogram Underlying PDF PDF estimated by Bayesian method PDF estimated by EM

Real data histogram

Likelihood

20

2

1

1.5

1

0.5

0.5

0.2

0.4

x

0.6

0.8

1

0 0

(e) N=10.

0.2

0.4

x

0.6

0.8

(f) N=50.

1

0 0

0.2

0.4

x

0.6

0.8

1

0 0

(g) N=100.

0.2

0.4

x

0.6

0.8

1

(h) N=1000.

Fig. 5. Estimating the PDF with the proposed Bayesian method and EM. N is the amount of data generated from the underlying BMM. The first row: model A, π 1 = 0.3, u1 = 2, v1 = 8 and π2 = 0.7, u2 = 15, v2 = 4. The second row: model B, π 1 = 0.3, u1 = 10, v1 = 2, π2 = 0.4, u2 = 2, v2 = 12, and π3 = 0.4, u3 = 10, v3 = 10. TABLE 3 The true and estimated mixture weights factors. True values of π Estimated π,

Bayesian

with N = 10

EM

Estimated π,

Bayesian

with N = 50

EM

Estimated π,

Bayesian

with N = 100

EM

Estimated π,

Bayesian

with N = 1000

EM

Model A 0.3, 0.7 0.300, 0.000 0.303, 0.397 0.200, 0.300 0.300, 0.200 0.717, 0.000 0.283, 0.000 0.143, 0.291 0.404, 0.162 0.327, 0.673 0.000, 0.000 0.254, 0.425 0.235, 0.086 0.702, 0.000 0.298, 0.000 0.250, 0.164 0.475, 0.111

Model B 0.3, 0.4, 0.3 0.300, 0.199, 0.000 0.201, 0.300, 0.000 0.100, 0.100, 0.300 0.200, 0.200, 0.100 0.295, 0.182, 0.259 0.125, 0.139, 0.000 0.076, 0.203, 0.053 0.268, 0.096, 0.304 0.252, 0.325, 0.000 0.000, 0.423, 0.000 0.079, 0.038, 0.331 0.231, 0.055, 0.266 0.399, 0.306, 0.000 0.000, 0.295, 0.000 0.373, 0.274, 0.247 0.064, 0.009, 0.033

of observations in X. The K ln N part is a compensation of model complexity. The smaller the BIC is, the more preferred the model is. When the models to be compared have the same number of parameters, to compare BIC is equivalent to compare the likelihood. Each mixture component in BMM has 2L parameters 4 , which is the same for GMM with diagonal covariance matrix. Thus comparing the BIC is equivalent to comparing the likelihood in our case, as we use the diagonal covariance matrix for GMM. In [16], the authors argued that BIC is not suitable for the beta mixtures and introduced another criterion. We think it might be due to the statistical property of the 4. We estimated 4L parameters in the Bayesian BMM but only use 2L of them.

bioinformatics data. The BIC is a well used criterion and should be suitable to model selection, both in mixture models and single model. We also use another criterion named Bayes factors (BF) [39], which is equivalent to the posterior odds when the two hypotheses are equally a priori. The BF for the hypothesis H1 over the hypothesis H2 is defined as BF 

f (X|H1 ) , f (X|H2)

(40)

where f (X|Hi ), i = 1, 2 is either the marginal likelihood (f (X|Hi ) = f (X|θi , Hi )f (θi |Hi )dθi ) when the parameters in Hi are uncertain or the likelihood when the parameters in Hi are known. In the majority of cases, the integration in the marginal likelihood is not analytically tractable, but some numerical calculation can be used to evaluate the marginal likelihood. We use the importance sampling method. 4.2.1 SAR Image Processing Synthetic Aperture Radar (SAR) images were widely used in civil and military applications such as terrain detection and resource exploration. The estimated parameters of the unimodal (or multimodal) SAR image histogram could be used for pattern recognition and classification. The main reasons for choosing a BMM to estimate the histograms are: 1. most of the SAR image histograms are asymmetrical (shown in Fig. 6(b)); 2. the pixel value of the 8-bit gray image is in the interval [0, 255] and can be linearly normalized into [0, 1]. The Bayesian approach derived above could be utilized to estimate the distribution to model the SAR image data histogram. Also, we apply the GMM to estimate the

12

ROC curve of BMM based classifier 100 90

True Positive Rate (%)

80

10 Estimated by BMM Estimated by GMM Histogram

9 8 7 6

70

GMM, Bayes SPM [40] Thresholding, Bayes SPM [41] GMM in IQ, Single Gaussin in YCbCr [42] Maximum entropy [43] SOM in TS [44] Implicit mathematical model [45] ROC by Bayesian BMM

60 50

5

40

4 3 2

30

1 0

(a) SAR image of size 311 × 200.

0

0.2

0.4

0.6

1

(b) Estimated and real PDF.

8

5

4

4

3

3

2

2

1

1

0.8

(c) Component PDFs of BMM.

50

60

3

6

5

0.6

40

π = 0.580

3

0.4

30

π2 = 0.351

7

2

π = 0.563

0.2

20

False Positive Rate (%)

1

π = 0.368

0

10

π = 0.069

π1 = 0.069

6

0

Fig. 7. Recognition score comparison of different methods [18].

8

7

0

0.8

1

0

0

0.2

0.4

0.6

0.8

1

(d) Component PDFs of GMM.

Fig. 6. SAR image of the Yukon River. The estimated and real c Canadian Space Agency 1997. histograms are shown.  histogram with the same number of mixture components found by Bayesian BMM. Fig. 6(b) shows the estimated histograms by BMM and GMM. The Bayesian BMM models the multimodal feature of the histogram and keeps three components. The weighted components for BMM and GMM are shown in Fig. 6(c) and 6(d), respectively. Although both BMM and GMM give almost the same result, the BIC (per sample) for the BMM is BICB = −2.70 and the BIC for the GMM is BICG = −2.68. The BF (per sample) of BMM over GMM is 1.01. The marginal likelihood was calculated as the weighted summation of the component marginal likelihood. The comparison of BICs indicates that BMM is more preferred than GMM. 4.2.2 Human Skin Color Detection For the purpose of human skin color detection in RGB space5 , we applied [18] the Bayesian BMM based classifier to the well-known Compaq database [40]. Each pixel value from the RGB image was considered as a three dimensional vector. The database was partitioned randomly into a training sub-database and a test subdatabase. Each sub-database consists of skin and nonskin images. Two Bayesian BMM, one for the skin pixels and the other for the non-skin pixels, were trained for the task of skin/non-skin color classification. The pixel value 5. This work has been published in [18].

in the test sub-database was classified to be skin or nonskin based on the obtained model. All the pixels were labelled so that we can calculate the correct decision of classifying the skin pixels as skin (True Positive Rate (TPR)) and the false decision of classifying non-skin pixels as skin (False Positive Rate (FPR)). To evaluate the performance of our Bayesian BMM based classifier, we used the ROC analysis [46]. Our Bayesian BMM based classifier reported 80% true positive rate (TPR) with 7.2% false positive rate (FPR) and 90% TPR with 11.8% FPR. The accuracy rate (the total number of correct decision out of the total input) of the classifier was 88.9% when the TPR is equal to one minus the FPR. To prevent the effect of randomness, we executed 60 rounds of the train-test procedures mentioned above. The mean values are reported. Some other classifiers based on pixel probabilistic models were also analyzed with the Compaq database in the previous literature [40]–[45]. In Fig. 7, all the results obtained in the previous literature are all under the ROC curve. Thus, for the statistical model based pixel classification, the Bayesian BMM based classifier outperform all the other methods. 4.2.3 Quantization of Line Spectral Frequencies The line spectral frequencies (LSF) are known to be the most efficient representation of the linear predictive coding (LPC) parameters. The LSF parameters are normalized angular frequencies between [0, π]. The proposed Bayesian BMM was applied to model the distribution of the LSF parameters. The LSF parameters were extracted from the TIMIT database [47], where the speech is sampled at 16 kHz. The dimension of the LSF parameters is L = 16 for the wide-band data. For the purpose of applying the BMM, the LSF parameters were divided by π to fit the definition interval. We modelled the LSF parameters by the proposed Bayesian BMM and the GMM with diagonal covariance matrix, respectively. The 1st dimension and 2nd dimension of the LSF parameters are shown in Fig. 8. We can observe that both Bayesian

13

0.2

0.2

2nd dimension

0.25

2nd dimension

0.25

0.15

0.15

0.1

0.05

0

5

0.1

0.05

0

0.05

0.1

0.15

0

0.2

0

0.05

1st dimension

(a) Distribution of LSF.

2.2

Mean square error per dimension

2nd dimension

0.2

0.15

0.1

0.05

0

0.05

0.1

0.15

0.2

(b) Modelled by Bayesian BMM.

0.25

0

0.1

1st dimension

0.15

0.2

x 10

−4

BVQ GVQ 2

1.8

1.6

1.4

1.2

1 42

43

1st dimension

44

45

46

47

48

Rate

(c) Modelled by GMM.

(d) Theoretical D-R performance.

Fig. 8. Comparison of BVQ and GVQ. The ellipses in Fig. 8(b) and 8(c) indicate where the GMM violated the bound

BMM and GMM could model the data distribution. However, the boundary property (the LSF parameters are in [0,1]) of the LSF parameters is violated by GMM while the Bayesian BMM describes it properly. With the obtained model, the theoretical high rate distortion rate (D-R) performance of the PDF optimized vector quantization (VQ) in the entropy-constrained case can be approximated as D(R) = C(2, L)

I 

2

πi 2− L (Ri −hi (X)) .

(41)

i=1

The relation between the total number of bits (R) and the number of bits assigned to the ith component (Ri ) is I  i=1

πi Ri = R − log2 I.

(42)

log2 I is the number of bits assigned to identify the index of the component. C(2, L) is a constant which depends only on the dimensionality (L) of the LSF parameters. hi (X) is the differential entropy of the ith mixture component calculated as hi (X) = −

L   l=1

1 0

f (xli ) × log2 f (xli )dxli .

Due to the analytically intractable integration, estimating the distribution of the parameters in BMM is difficult. The main contribution of this paper is to propose an algorithm based on the variational inference framework to approximate the posterior distribution of the parameters. By factorizing the true posterior distribution to a multiplication of a set of factorized distributions, we update the factorized distribution iteratively to minimize the KL divergence of the true posterior distribution from the approximating posterior distribution. By using some approximations as the lower bound to the analytically intractable integration, this iteration is guaranteed to work properly. With a proper initialization, the distributions of the parameters in BMM and the complexity of the mixture models can be automatically determined. Also, the proposed Bayesian approach circumvents the requirement of sampling, which is the computational bottleneck in Bayesian estimation with Gibbs sampling. The efficiency and correctness of the proposed method were verified with synthetic data. Compared to the conventional EM algorithm, the proposed Bayesian method prevents the drawbacks of overfitting and of iterative numerical calculation in the maximization step. Furthermore, the proposed Bayesian method could estimate a more accurate distribution, when the amount of data is small. As the number of samples increases, the Bayesian method converges to the same result as the EM algorithm. For the real data evaluation, we applied the proposed Bayesian BMM to the task of SAR image modeling, skin color detection, and quantization of LSF parameters. The proposed Bayesian BMM was superior to the GMM, due to the boundary support of the data and the flexibility of the Bayesian framework.

ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their fruitful suggestions. Also, we would like to thank Dr. Guoqiang Zhang for his kind discussion.

R EFERENCES [1] [2]

(43)

The purpose of bits allocation is to minimize the mismatch between the source and the quantized data. The theoretical D-R performance and the bit allocation strategy of the DMM based VQ were derived in [21]. The theoretical high rate D-R performances of the BMM based VQ (BVQ) and the GMM based VQ (GVQ) were evaluated and shown in Fig. 8(d). It can be observed that the theoretical distortion of BVQ is smaller than that of GVQ, which indicates that the Bayesian BMM obtained by the proposed method is a more promising model for designing a practical PDF optimized VQ.

C ONCLUSION

[3] [4] [5]

[6] [7] [8]

K. Fukunaga, Introduction to Statistical Pattern Recognition. Academic Press, 1990. A. K. Jain, R. P. W. Duin, and J. Mao, “Statistical pattern recognition: a review,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, pp. 4–37, 2000. A. R. Webb, Statistical Pattern Recognition, Second Edition. Wiley, 2002. C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006. R. Gnanadesikan, R. S. Pinkham, and L. P. Hughes, “Maximum likelihood estimation of the parameters of the beta distribution from smallest order statistics,” Technometrics, vol. 9, pp. 607–620, 1967. R. J. Beckman and G. L. Tietjen, “Maximum likelihood estimation for the beta distribution,” Journal of Statistical Computation and Simulation, vol. 7, pp. 253–258, 1978. B. Wagle, “Multivariate beta distribution and a test for multivariate normality,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 30, pp. 511–516, 1968. I. Olkin and R. Liu, “A bivariate beta distribution,” Statistics & Probability Letters, vol. 62, pp. 407–412, 2003.

14

[9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38]

A. K. Gupta and S. Nadarajah, Eds., Handbook of Beta Distribution and Its Applications. Marcel Dekker, 2004. N. Bouguila, D. Ziou, and E. Monga, “Practical bayesian estimation of a finite beta mixture through gibbs sampling and its applications,” Statistics and Computing, vol. 16, pp. 215–225, 2006. V. P. Savchuk and H. F. Martz, “Bayes reliability estimation using multiple sources of prior information: binomial sampling,” IEEE Transactions on Reliability, vol. 43, pp. 138–144, 1994. J. C. Lee and Y. L. Lio, “A note on bayesian estimation and prediction for the beta-binomial model,” Journal of Statistical Computation and Simulation, vol. 63, pp. 73–91, 1999. F. Cribari-Neto and K. L. P. Vasconcellos, “Nearly unbiased maximum likelihood estimation for the beta distribution,” Journal of Statistical Computation and Simulation, vol. 72, pp. 107–118, 2002. G. J. McLachlan and D. Peel, Finite Mixture Models. Wiley, 2000. M. A. T. Figueiredo and A. K. Jain, “Unsupervised learning of finite mixture models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 381–396, 2002. Y. Ji, C. Wu, P. Liu, J. Wang, and K. R. Coombes, “Application of beta-mixture models in bioinformatics,” Bioinformatics applications note, vol. 21, pp. 2118–2122, 2005. Z. Ma and A. Leijon, “Beta mixture models and the application to image classification,” in Proceedings of International Conference on Image Processing, 2009. ——, “Human skin color detection in RGB space with Bayesian estimation of beta mixture models,” in 18th European Signal Processing Conference (EUSIPCO 2010), 2010. P. Hedelin and J. Skoglund, “Vector quantization based on Gaussian mixture models,” Speech and Audio Processing, IEEE Transactions on, vol. 8, no. 4, pp. 385–401, Jul 2000. J. Lindblom and J. Samuelsson, “Bounded support Gaussian mixture modeling of speech spectra,” Speech and Audio Processing, IEEE Transactions on, vol. 11, no. 1, pp. 88–99, Jan 2003. Z. Ma and A. Leijon, “PDF-optimized LSF vector quantization based on beta mixture models,” in Proceedings of Interspeech, 2010. G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions. New York: Wiley, 1997. H. Attias, “A variational bayesian framework for graphical models,” in In Advances in Neural Information Processing Systems 12. MIT Press, 2000, pp. 209–215. N. Ueda and Z. Ghahramani, “Bayesian model search for mixture models based on optimizing variational bounds,” Neural Network, vol. 15, pp. 1223–1241, 2002. T. S. Jaakkola and M. I. Jordan, “Bayesian parameter estimation via variational methods,” Statistics and Computing, vol. 10, pp. 25– 37, 2000. T. S. Jaakkola, “Tutorial on variational approximation methods,” in Advances in Mean Field Methods., M. Opper and D. saad, Eds. MIT Press., 2001, pp. 129–159. J. M. Bernardo and A. F. M. Smith, Bayesian Theory. John Wiley & Sons, Ltd, 1994. D. M. Blei, “Probabilistic models of text and images,” Ph.D. dissertation, University of California, Berkeley, 2004. J. A. Palmer, “Relative convexity,” ECE Dept., UCSD, Tech. Rep., 2003. P. Diaconis and D. Ylvisaker, “Conjugate priors for exponential families,” The Annals of Statistics, vol. 7, pp. 269–281, 1979. M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Mach. Learn., vol. 37, no. 2, pp. 183–233, 1999. M. D. Hoffman, D. M. Blei, and P. R. Cook, “Bayesian nonparametric matrix factorization for recorded music,” in Proceedings of 27th International Conference on Machine Learning, 2010. M. I. Jordan, Learning in Graphical Models. MIT Press., 1999. M. Braun and J. McAuliffe, “Variational inference for largescale models of discrete choice,” Journal of the American Statistical Association, vol. 105, pp. 324–335, 2010. P. J. Bickel and K. A. Doksum, Mathematical Statistics: Basic Ideas and Selected Topics. Pearson Prentice Hall, 2007. D. M. Blei and J. D. Lafferty, “Correlated topic models,” in Advances in Neural Information Processing Systems, 2006. ——, “A correlated topic model of Science,” The Annals of Applied Stattistics, vol. 1, pp. 17–35, 2007. S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge university Press, 2004.

[39] R. E. Kass and A. E. Raftery, “Bayes factors,” Journal of the American Statistical Association, vol. 90, no. 430, pp. pp. 773–795, 1995. [Online]. Available: http://www.jstor.org/stable/2291091 [40] M. J. Jones and J. M. Rehg, “Statistical color models with application to skin detection,” International Journal of Computer Vision, vol. 46, no. 1, pp. 81–96, 2002. [41] J. Brand and J. S. Mason, “A comparative assessment of three approaches to pixel-level human skin-detection,” in Proceedings of IEEE International Conference on Pattern Recognition, vol. 1, 2000, pp. 1056–1059 vol.1. [42] J. Y. Lee and S. I. Yoo, “An elliptical boundary model for skin color detection,” in Proceedings of the International Conference on Imaging Science, Systems, and Technology, 2002. [43] B. Jedynak, H. Zheng, M. Daoudi, and D. Barret, “Maximum entropy models for skin detection,” in Proceedings of Indian Conference on Computer Vision, Graphics and Image Processing, 2002, pp. 276–281. [44] D. A. Brown, I. Craw, and J. Lewthwaite, “A SOM based approach to skin detection with application in real time systems,” in Proceedings of the British Machine Vision Conference, 2001. [45] M. M. Aznaveh, H. Mirzaei, E. Roshan, and M. Saraee, “A new and improved skin detection method using RGB vector space,” in Proceedings of IEEE International Multi-Conference on Systems, Signals and Devices, July 2008, pp. 1–5. [46] T. Fawcett, “An introduction to ROC analysis,” Pattern Recognition Letter, vol. 27, no. 8, pp. 861–874, 2006. [47] “DARPA-TIMIT,” Acoustic-phonetic continuous speech corpus, NIST Speech Disc 1.1-1, 1990.

Zhanyu Ma received his B.Eng. degree and M.Eng. degree in signal and information processing from BUPT (Beijing University of Posts and Telecommunications), Beijing, China, in 2004 and 2007, respectively. In 2007, he joined the Sound and Image Processing (SIP) lab at KTH (Royal Inst. of Technology), Stockholm, Sweden, where he is now pursuing the Ph.D. degree. His research area includes Bayesian estimation of statistical model and its applications in sound and image processing.

Arne Leijon is a Professor in Hearing Technology at the KTH (Royal Inst of Technology) Sound and Image Processing Lab, Stockholm, Sweden, since 1994. His main research interest concerns applied signal processing in aids for people with hearing impairment, and methods for individual fitting of these aids, based on psychoacoustic modelling of sensory information transmission and subjective sound quality. He received the M. S. degree in Engineering Physics in 1971, and a Ph.D. degree in Information Theory in 1989, both from Chalmers University of Technology, Gothenburg, Sweden.

1

Appendices to Bayesian Estimation of Beta Mixture Models with Variational Inference Zhanyu Ma, Student Member, IEEE, and Arne Leijon



A PPENDIX A P ROOF OF THE R ELATIVE C ONVEXITY OF L OG I NVERSE -B ETA F UNCTION ( PROPERTY 3.3)

Substituting (4) into (2), we obtain that

A convex function is defined as

which means that the LIB function is a convex function of ln x. Symmetrically, if y < 1, the LIB function is a concave function of ln x.

f [tx + (1 − t)y] ≤ tf (x) + (1 − t)f (y)

(1)

where x, y ∈ dom f and 0 ≤ t ≤ 1. According to the second-order condition of the convex function in [1], f (x) is a convex function of x if and only if dom f is ′′ convex and f (x) ≥ 0. By changing the variable x = ez , z ∈ R+ , the second derivative of the LIB function with respect to z = ln x in (20) is h  ′ i ′ ∂ 2 F (ez ) = ez ψ(ez + y) − ψ(ez ) + ez ψ (ez + y) − ψ (ez ) ∂z 2 Z ∞ 1 − e−yt −ez t = ez e (1 − ez t)dt 1 − e−t 0

(2)

where we use the integral  representation of ψ(x) as R ∞  e−t e−xt ψ(x) = 0 − t 1−e−t dt and the integral represenR ∞ te−xt ′ tation of ψ (x) as ψ ′ (x) = 0 1−e −t dt. When y = 1, (2) is simplified to Z

∂ 2 F (ez ) = ez ∂z 2



zt

e−e

zt

(1 − ez t)dt = ez · lim te−e t→∞

0

−ez t

=0

(3)

Denote p(t) = e (1 − e t) and q(t) = (1 − e−yt )/(1 − e ). From Fig. 1(a), we recognize that p(t) has the property that p(t) < 0 if t > 1/ez and p(t) > 0 if t < 1/ez . Apparently, q(t) is monotonous decreasing and greater than 1 when y > 1, which is shown in Fig. 1(b). The curve of p(t)q(t) as a function of t is shown in Fig. 1(c). Let S1 and S2 be the area of the region they point at, respectively z

−t

Z

Z

1 ez

S1 =

p(t)q(t)dt, S2 = −

0

∞ 1 ez

p(t)q(t)dt.

Since q(t) > q(1/ez ) when t < 1/ez and q(t) < q(1/ez ) when t > 1/ez , we know that Z

1 ez

S1 =

Z

0

Z

S2 = −

1 ez

p(t)q(t)dt > 0

∞ 1 ez

p(t)q(t)dt < −

Z

1 )dt ez ∞ 1 p(t)q( z )dt . 1 e z

p(t)q(

(4)

e

• The authors are with KTH - Royal Institute of Technology, School of Electrical and Engineering, Sound and Image Processing Laboratory, SE100 44 Stockholm, Sweden. E-mail: [email protected], [email protected]

1 ∂ 2 F (ez ) z z = e (S1 − S2) > e q( z ) ∂z 2 e

Z



(5)

p(t)dt = 0, 0

A PPENDIX B R ELATIVE C ONVEXITY OF P SEUDO D IGAMMA F UNCTION ( PROPERTY 3.5) ′

The integral representation of the derivative of ψ (x) with respect to x is ψ ′′ (x) = −

Z

∞ 0

t2 e−xt dt . 1 − e−t

(6)

The second derivative of G(x + y) in property 3.5 with respect to ln x is then (with the changing of variable x = ez ,z ∈ R+ ) expressed as  ∂ 2 G(ez + y) z  ′ z z ′′ z = e ψ (e + y) + e ψ (e + y) ∂z 2 "Z z Z ∞ 2 −(ez +y)t # ∞ te−(e +y)t t e z = ez dt − e dt 1 − e−t 1 − e−t 0 0 Z ∞ z te−yt = ez (1 − ez t)e−e t dt . 1 − e−t 0

(7)

As we have already known that te−yt /(1 − e−t ) is a monotonous decreasing function when y > 1, with the same method applied in Appendix A, we obtain that ∂ 2 G(ez +y)/∂z 2 > 0 when y ≥ 1. So the pseudo Digamma is a convex function of ln x if y ≥ 1.

A PPENDIX C A PPROXIMATIONS OF THE LIB F UNCTION AND THE P SEUDO D IGAMMA F UNCTION ( PROPERTY 3.4 AND 3.6) The first-order Taylor expansion of the LIB function F (x) in (21) for ln x at ln x0 is ∂F (x) |x=x0 (ln x − ln x0 ) ∂ ln x ∂F (x) ∂x = F (x0 ) + |x=x0 (ln x − ln x0 ) ∂x ∂ ln x 1 = ln + [ψ(x0 + y) − ψ(x0 )] x0 (ln x − ln x0 ) . beta(x0 , y)

F (x) ≈ F (x0 ) +

(8)

Since the first-order Taylor expansion of F (x) for ln x is a linear function of ln x and is also a tangent line to F (x) in ln x domain, with the convexity of the

2 1.5

2

1 0.5

p(t)q(t)

q(1/ez ) q(t)

p(t)

2

2.2

1.5

1.8

S1

1

S2

0.5

1.6

0

0 1.4

−0.5

−0.5 0

0.2 t = 1z e

0.4

0.6

0.8

t

1

0

0.2 t = 1z e

(a) Curve of p(t)

0.4

(b) Curve of q(t)

Fig. 1. Illustration of the proof of the convexity of the LIB function. LIB function, we obtain F (x) ≥ ln [1/beta(x0 + y)] + [ψ(x0 + y) − ψ(x0 )] x0 (ln x − ln x0 ) if y > 1 and the equality reaches when x = x0 , which is exactly the same as (21) in property 3.4. Similarly, property 3.6 could be proved in the same way. These two corollaries show the approximations of the LIB function and the pseudo Digamma function.

A PPENDIX D A PPROXIMATION OF THE B IVARIATE LIG F UNC TION ( PROPERTY 3.7) e Let H(x, y) be the result of subtracting the pseudo second order Taylor expansion of H(x, y) from H(x, y). e By setting the first derivative of H(x, y) with respect to (ln x, ln y) equal to 0, we can obtain one stationary e point (ln x0 , ln y0 ). The Hessian of H(x, y) with respect to (ln x, ln y) is h i x2 ψ ′ (x + y) − ψ ′ (x)  +x [ψ(x + y) − ψ(x)]  h i  −x2 ψ ′ (x + y ) − ψ ′ (x ) 0 0 0  2f 0 ∇ H(x, y) =    xyψ ′ (x + y)   −x0 y0 ψ ′ (x0 + y0 )





xyψ ′ (x + y) −x0 y0 ψ ′ (x0 + y0 )

     h i .  y2 ψ ′ (x + y) − ψ ′ (y)   +y [ψ(x + y) − ψ(y)] h i 2 ′ ′ −y0 ψ (x0 + y0 ) − ψ (y0 )

Substituting (ln x, ln y) with (ln x0 , ln y0 ), the Hessian with respect to (ln x, ln y) is reduced to a diagonal matrix f ∇2 H(x, y)|x=x0 y=y0 =  x0 [ψ(x0 + y0 ) − ψ(x0 )] 0

0.6

t

0 y0 [ψ(x0 + y0 ) − ψ(y0 )]

 .

Since ψ(·) is a monotonous increasing function, we know that all the diagonal elements are positive and this Hessian matrix is positive definite at (ln x0 , ln y0 ). So e H(x, y) attains a local minimum at (ln x0 , ln y0 ). Since e (ln x0 , ln y0 ) is the only stationary point of H(x, y) and e H(x, y) is continuous and differentiable all through (x, y) ∈ {x > 1, y > 1}, this local minimal stationary e point is also a global minimal stationary point. H(x, y) is equal to its global minimal value (which is 0, actually) when (ln x, ln y) = (ln x0 , ln y0 ). Then we can conclude that (23) holds.

R EFERENCES [1] S. Boyd and L. Vandenberghe, Convex Optimization. university Press, 2004.

Cambridge

0.8

1

0

0.2 t = 1z e

0.4

0.6

t

(c) Curve of p(t)q(t)

0.8

1

Suggest Documents