A Maximum Likelihood Approach to Density Estimation with ...

0 downloads 0 Views 470KB Size Report
Density estimation plays an important and fundamental role in pattern recognition, machine learning, and statistics. In this article, we develop a parametric ...
LETTER

Communicated by Wloodzislaw Duch

A Maximum Likelihood Approach to Density Estimation with Semidefinite Programming Tadayoshi Fushiki [email protected] Institute of Statistical Mathematics, Minato-ku, Tokyo 106-8569, Japan

Shingo Horiuchi [email protected] Access Network Service Systems Laboratories, NTT Corp., Makuhari, Chiba, 261-0023, Japan

Takashi Tsuchiya [email protected] Institute of Statistical Mathematics, Minato-ku, Tokyo 106-8569, Japan

Density estimation plays an important and fundamental role in pattern recognition, machine learning, and statistics. In this article, we develop a parametric approach to univariate (or low-dimensional) density estimation based on semidefinite programming (SDP). Our density model is expressed as the product of a nonnegative polynomial and a base density such as normal distribution, exponential distribution, and uniform distribution. When the base density is specified, the maximum likelihood estimation of the polynomial is formulated as a variant of SDP that is solved in polynomial time with the interior point methods. Since the base density typically contains just one or two parameters, computation of the maximum likelihood estimate reduces to a one- or two-dimensional easy optimization problem with this use of SDP. Thus, the rigorous maximum likelihood estimate can be computed in our approach. Furthermore, such conditions as symmetry and unimodality of the density function can be easily handled within this framework. AIC is used to choose the best model. Through applications to several instances, we demonstrate flexibility of the model and performance of the proposed procedure. Combination with a mixture approach is also presented. The proposed approach has possible other applications beyond density estimation. This point is clarified through an application to the maximum likelihood estimation of the intensity function of a nonstationary Poisson process.

Neural Computation 18, 2777–2812 (2006)

 C 2006 Massachusetts Institute of Technology

2778

T. Fushiki, S. Horiuchi, and T. Tsuchiya

1 Introduction Density estimation plays an important and fundamental role in constructing flexible learning models in pattern recognition, machine learning and statistics. In this article, we develop a new computational approach to density estimation based on the maximum likelihood estimation of parametric models and model selection by AIC. The rigorous maximum likelihood estimate of the parameters is computed by employing semidefinite programming (SDP) as a main tool. SDP is a new technology in convex programming developed in the past decade in optimization. We deal with one-dimensional density estimation, though the idea can be extended to the multidimensional case in a reasonable manner, as is explained at the end of section 2. We model a density function as the product of a univariate nonnegative polynomial and a base density such as normal distribution, exponential distribution, and uniform distribution. In other words, our model amounts to expressing the density functions with nonnegative Hermite, Laguerre, and Legendre polynomials (multiplied by their respective weight functions). The support of the density function can be any of (−∞, ∞), [0, ∞) and [a , b], where a , b ∈ R and the models are capable of approximating density functions (with the same support) precisely by adapting a polynomial of sufficiently large degree. Our main tool for the maximum likelihood estimation of this model is SDP. We formulate the maximum likelihood estimation as a bilevel optimization problem separating optimization of the nonnegative polynomial part and the base density part. At the lower level, the likelihood function is maximized with respect to the nonnegative polynomial part with the base density part fixed. Such a partially maximized likelihood is referred to as the profile likelihood. At a higher level, global optimization of the likelihood function is done by maximizing the profile likelihood function with respect to the base density function parameters. As will be explained later, the lower-level optimization of the nonnegative polynomial part reduces to a simple variant of SDP. This variant can be solved rigorously and efficiently with interior-point algorithms developed recently. Taking advantage of this fact, the higher-level optimization of the base density part is not difficult since the number of parameters involved in the base density is quite low—one or two, typically. SDP is a new class of convex programming developed quickly in the past decade (Alizadeh, 1995; Ben-Tal & Nemirovski, 2001; Nesterov & Nemirovskii, 1994; Vandenberghe & Boyd, 1996; Wolkowicz, Saigal, Vandenberghe, 2000). The main purpose of SDP is to solve a semidefinite programming problem, the problem of minimizing a linear function over the intersection of an affine space, and the space of symmetric positive semidefinite matrices. SDP is an extension of the classical linear programming and has wide applications in control, combinatorial optimization, signal processing, communication systems design, optimal shape design,

Maximum Likelihood Approach to Density Estimation

2779

and machine learning, among others. A semidefinite programming problem can be solved efficiently in both theory and practice with the interior point methods (Monteiro & Todd, 2000). We will show that the maximum likelihood estimate of our model can be computed efficiently and rigorously with the techniques of SDP. Recently, several interesting applications of convex programming have been developed in machine learning (e.g., Bhattacharyya, 2004; Graepel & Herbrich, 2004; Lanckriet, El Ghaoui, Bhattacharyya, & Jordan, 2002; Lanckriet, Cristianini, Bartlett, El Ghaoui, & Jordan), in particular after the success of Vapnik’s support vector machine (Vapnik, 1995). Bennet and Mangasarian (1992) is another pioneering work in the same direction. A necessary and sufficient condition for a univariate polynomial to be nonnegative over the support sets can be expressed as semidefinite constraints (Nesterov, 2000). With this formulation, when the base density is specified, the maximum likelihood estimation can be formulated as a variant of SDP, namely, maximizing the log determinant function under semidefinite constraint. The formulated optimization problem is not exactly a semidefinite program, but due to its special form, we can solve it as efficiently as an ordinary semidefinite program with the interior point method (Vandenberghe, Boyd, & Wu, 1998; Toh, 1999; Tsuchiya & Xia, 2006). Specifically, we use the primal-dual interior point method to solve the variant of SDP. Once the maximum likelihood estimate is obtained, AIC (Akaike, 1973, 1974) is used to choose the best model. We demonstrate that the model is flexible and that the MAIC (minimum AIC) procedure gives reasonable estimates of the densities. There has been much research on one-dimensional density estimation, in particular in statistics, where gaussian mixture approaches and nonparametric approaches are preferred due to their flexibility (Akaike, 1977; Eggermont & LaRiccia, 2001; Good & Gaskins, 1971, 1980; Hjort & Glad, 1995; Ishiguro & Sakamoto, 1984; McLachlan & Peel, 2000; Roeder, 1990; Scott, 1992; Silvermann, 1986; Tanabe & Sagae, 1999; Tapia & Thompson, 1978; Wand & Jones, 1995). The advantages of our approach compared with these approaches are as follows. First, our approach is based on a parametric model taking the Kullback-Leibler divergence as a loss function, where the rigorous maximum likelihood estimate can be computed without any trouble in principle. Furthermore, basically it fits into a general framework of model selection with information criteria. In other parametric approaches, it often occurs that there are a lot of local maxima of the objective function and the optimization does not work well. In other nonparametric approaches, the selection of the bandwidth becomes an issue and sometimes causes an unsatisfactory result. Since our approach is based on the likelihood, we can use this density model as a part in assembling general learning models in the context of multivariate regression, pattern discrimination, time series analysis, independent component analysis (ICA), and so on. In particular in ICA, one-dimensional density estimation

2780

T. Fushiki, S. Horiuchi, and T. Tsuchiya

plays a crucial role. Our approach would provide a reasonable alternate to the existing density estimation methods. Another important feature is flexibility in modeling densities. It is sometimes known in advance that the estimated density function should satisfy some conditions like symmetry and unimodality. In our approach, these conditions are naturally treated as linear and semidefinite constraints in formulation, although it is not easy to incorporate these conditions into models in other approaches. The proposed approach has possible other applications beyond density estimation. This point is clarified through an application to the maximum likelihood estimation of the intensity function of a nonstationary Poisson process later in this article. This article is organized as follows. In section 2, we explain our model and formulate the maximum likelihood estimation with the variant of semidefinite program. In section 3, we briefly explain SDP and introduce interior point methods to solve this problem. In section 4, the performance of our method is demonstrated through application to several instances. In section 5, we propose a mixture model in which the proposed density model is employed as a component. In section 6, we discuss possible applications of our approach to other problems, taking up as an example the estimation of the intensity function of a nonstationary Poisson process. Section 7 concludes the discussion. 2 Problem Formulation Let {x1 , . . . , xN } be data independently drawn from an unknown density g(x) over the support S ⊆ R. Our problem is to estimate g(x) based on {x1 , . . . , xN }. If some prior information on g(x) is available, we can use an appropriate statistical model to estimate g(x). A flexible model is necessary to estimate g(x) when such prior information is not available. In this article, we develop a computational approach to estimate g(x) based on the following statistical model, f (x; α, β) = p(x; α)K (x; β),

(2.1)

where p(x; α) is a univariate polynomial with parameter α and K (x; β) is a density function specified with parameter β over the support S. The function K (x; β) is referred to as a base density (function). We call α and β a polynomial parameter and a base density parameter, respectively. The polynomial p(x; α) is nonnegative over S. In the following, we consider the models where the base density is normal distribution, exponential distribution, and uniform distribution. These models will be referred to as the normal-based model, exponential-based model, and pure polynomial model, respectively.

Maximum Likelihood Approach to Density Estimation

2781

We associate a univariate polynomial of even degree with a matrix as follows. Given x ∈ R, we define x d = (1, x, x 2 , . . . , x d−1 ) ∈ Rd . In the following, we drop the subscript d of x d when there is no ambiguity. A polynomial T of degree n(= 2d − 2) can be written n as x iQx by choosing an appropriate Q ∈ Rd×d . If a polynomial q (x) = i=0 q i x is represented as q (x) = x T Qx with Q ∈ Rd×d , then we can recover the coefficient q k by q k = Tr(E k Q), where the (i, j)th element of E k is  (E k )i j = Let q  (x) = sented as

1 if i + j − 2 = k 0 otherwise

n−1 i=0

,

k = 1, . . . , n.

q i xi be the derivative of q (x). The coefficient q k is repre-

q k = (k + 1)Tr(E k+1 Q), k = 1, . . . , n − 1. The main theorem used in this article together with SDP is the following. Theorem 1 (e.g., Nesterov, 2000). Let p(x) be a univariate polynomial of degree n. Then, T Qx (n/2+1) holds for a symmetric i. p(x) ≥ 0 over S = (−∞, ∞) iff p(x) = x (n/2+1) (n/2 + 1)×(n/2 +1) positive semidefinite matrix Q ∈ R . ii. p(x) ≥ 0 over S = [a , ∞) iff (a) T T p(x) = x (n+1)/2 Q1 x (n+1)/2 + (x − a )x (n-1)/2 Q2 x (n-1)/2 for symmetric positive semidefinite matrices Q1 ∈ R((n+1)/2)×((n+1)/2) and Q2 ∈ R((n-1)/2)×((n-1)/2) (the case when n is odd). (b) T T p(x) = x (n/2+1) Q1 x (n/2+1) + (x − a )x n/2 Q2 x n/2 for symmetric positive semidefinite matrices Q1 ∈ R(n/2+1)×(n/2+1) and Q2 ∈ R(n/2)×(n/2) (the case when n is even). iii. p(x) ≥ 0 over S = [a , b] iff (a) T T p(x) = (x − a )x (n+1)/2 Q1 x (n+1)/2 + (b − x)x (n+1)/2 Q2 x (n+1)/2 (2.2) for symmetric positive semidefinite matrices Q1 , Q2 ∈ R((n+1)/2)×((n+1)/2) (the case when n is odd), (b) T T p(x) = x (n/2+1) Q1 x (n/2+1) + (b − x)(x − a )x n/2 Q2 x n/2 (2.3) for symmetric positive semidefinite matrices Q1 ∈ R(n/2+1)×(n/2+1) and Q2 ∈ R(n/2)×(n/2) (the case when n is even).

2782

T. Fushiki, S. Horiuchi, and T. Tsuchiya

The class of polynomials that admit a representation of x T Qx with Q  0 and x being a vector (function) consisting of (possibly multivariate) monomials is referred to as the sum-of-squares polynomials (Nesterov, 2000; Waki, Kim, Kojima, & Muramatsu, 2005), since it exactly coincides with the polynomials representable as the sum of several squared polynomials. To explain our approach, we focus on the normal-based model where S = (−∞, ∞) and K (x; β) is a normal distribution with parameter (µ, σ ). We assume that the parameter β = (µ, σ ) is given. Under this condition, as will be explained below, the maximum likelihood estimation is formulated as a tractable convex program that can be solved easily with the technique of SDP. Readers will readily see that the exponential-based model and the pure polynomial model can be treated in exactly the same manner. In the following, C  ()0 means that a matrix C is symmetric positive semidefinite (definite). The maximum likelihood estimate of our model is an optimal solution of the following problem: N 

max α,β

 s.t.

log p(xi ; α) +

i=1

N 

log K (xi ; β)

i=1

p(x; α)K (x; β)d x = 1,

p(x; α) ≥ 0 for all x.

(2.4)

We represent p(x; α) as x T Qx with some Q  0. The condition for f (x; α, β) to be a density function is written as  x T Qx K (x; β)dx = 1,

Q  0.

It is easy to see that this condition is written as the following linear equality constraint with a semidefinite constraint, Tr(M(β)Q) = 1, Q  0, where  M(β) =

x x T K (x; β)dx.

Note that M(β) is a matrix that can be obtained in a closed form as a function of β when K is normal, exponential, or uniform distribution. On the other

Maximum Likelihood Approach to Density Estimation

2783

hand, the log likelihood of model 2.1 becomes N 

log f (xi ; α, β) =

i=1

N  

 log p(xi ; α) + log K (xi ; β)

i=1

=

N 

log(x (i)T Qx (i) ) +

i=1

=

N 

N 

log K (xi ; β)

i=1

log Tr(x (i) x (i)T Q) +

i=1

N 

log K (xi ; β),

i=1

where x (i) = (1, xi , xi2 , . . . , xid−1 )T . Note that the term in log is linear in Q. Therefore, the maximum likelihood estimation is formulated as follows:

max Q,β

N 

log Tr(X(i) Q) +

i=1

s.t. Tr(M(β)Q) = 1,

N 

log K (xi ; β)

i=1

Q  0,

(2.5)

where X(i) = x (i) x (i)T for i = 1, . . . , N. If we fix β and regard Q as the decision variable, this problem is a convex program closely related to SDP and can be solved efficiently in both theory and practice with the interior point method (Ben-Tal & Nemirovski, 2001; Nesterov & Nemirovskii, 1994; Vandenberghe & Boyd, 1996; Wolkowicz et al., 2000). Let g(β) be the optimal value of problem 2.5 when β is fixed. Then we maximize g(β) to obtain the maximum likelihood estimator. Since β is typically with one or two dimension (e.g., location and scale parameters), maximization of g can be done easily by grid search or nonlinear programming techniques (Fletcher, 1989; Nocedal & Wright, 1999). In the following, we show that many properties of the density functions such as symmetry and monotonicity can be expressed by adding linear equality constraints or semidefinite constraints to the above problem. Recall that we are dealing with the normal-based model with the base density parameter (µ, σ ). For simplicity, we also assume that µ = 0. If we consider a symmetric density function with respect to x = 0, we add several linear constraints Tr(E i Q) = 0 for all odd i such that 1 ≤ i ≤ n because all coefficients of the odd degrees in the polynomial p(x; α) must be zero. If we consider a density that is unimodal with the maximum at x = xˆ , the problem is formulated as follows. This condition is equivalent to the two conditions that f (x) is monotone increasing in the interval (−∞, xˆ ] and is monotone decreasing in the interval [xˆ , ∞). The first monotone-increasing condition can be formulated as follows. Since

2784

T. Fushiki, S. Horiuchi, and T. Tsuchiya

f  (x; β) = { p  (x) − xp(x)/σ }K (x; β), nonnegativity of f  (x; β) in the interval (−∞, xˆ ] is equivalent to the nonnegativity of { p  (x) − xp(x)/σ } in the interval (−∞, xˆ ]. In view of the second statement of theorem 1, we introduce symmetric positive semidefinite matrices Q1 ∈ R(n/2+1)×(n/2+1) and Q2 ∈ R(n/2+1)×(n/2+1) to represent p  (x) −

x p(x) = x T Q1 x − (x − xˆ )x T Q2 x. σ

Note that the degree of p(x) is always even. The formulation is completed by writing down the conditions to associate Q with Q1 and Q2 . This amounts to the following n linear equality constraints, 1 Tr(E k−1 Q) = Tr(E k Q1 ) σ − Tr(E k−1 Q2 ) + xˆ Tr(E k Q2 ),

(k + 1)Tr(E k+1 Q) −

k = 1, . . . , n,

where El = 0 for l = −1 and l = n + 1. The other monotone-decreasing condition on the interval [xˆ , ∞) can be treated in a similar manner. Thus, the approach is capable of handling various conditions about the density function in a flexible way by adding semidefinite and linear constraints to problem 2.5. From the optimization point of view, the problem to be solved yet remains in the same tractable class. This point will be explained in more detail in the next section. We also point out that estimation of a density from a histogram based on the maximum likelihood estimation is formulated as a slight variation of problem 2.5 where each log term in the latter summation in the objective function is replaced by a weight proportional to the number of samples in a bin. This problem belongs to the same tractable variant of SDP (Tsuchiya & Xia, 2006). Here we discuss a few possible extensions to the multivariate density estimation problems. In the k-multivariate case, we replace x = (1, x, . . . , x (d−1) ) in the univariate case with a vector consisting of monomials with k variables. For example, if k = 2 and we want to express a polynomial of (up to) the fourth order, we take x = (1, x, y, x 2 , y2 , xy) and consider a polynomial of the form x T Qx with Q  0. As is remarked earlier, a multivariate polynomial that admits this type of representation is a member of the sum-of-squares polynomials. Unfortunately, a multivariate nonnegative polynomial does not always admit a sum-of-squares representation. Therefore, the sum-of-squares polynomials are just a subclass of multivariate nonnegative polynomials, that is, there is a gap. In this respect, one may not be satisfied with this straightforward extension. However, the use of the sum-of-squares polynomials to represent a density function would be justified as long as it serves as a flexible and useful model in expressing a density function. Furthermore, the recent attempts to solve polynomial programming problems with the sum-of-squares polynomials relaxation

Maximum Likelihood Approach to Density Estimation

2785

suggest that the gap between nonnegative multivariable polynomials and the sum-of-squares polynomials is not large (see, e.g., Waki et al., 2005). The model has a feature that it admits a reasonable and efficient way of computing the maximum likelihood estimate based on SDP. Another reasonable approach to multivariate cases is to apply our model in the context of ICA. We consider a multivariate density in the form of k −1 i=1 f i ((A x)i )/det(A), where A is a k × k nonsingular mixing matrix and f i is one-dimensional density. This is a parametric ICA model. We can use our density model for f i and apply our method to estimate f i . It would be possible to apply the standard techniques in ICA for estimating A (e.g., Hyv¨arinen, Karhunen, & Oja, 2001). In view of ICA, one-dimensional density f i could be estimated with nonparametric methods, but estimation of A would be much easier if a reasonable parametric model is available. This ICA model is one of the future research topics to incorporate our model into various machine learning models.

3 Semidefinite Programming and Interior Point Methods In this section, we introduce SDP and the interior point method for SDP and explain how the method can be used in the maximum likelihood estimation formulated in the previous section. SDP (Ben-Tal & Nemirovskii, 2001; Nesterov & Nemirovskii, 1994; Vandenberghe & Boyd, 1996; Wolkowicz et al., 2000) is an extension of LP (linear programming) in the space of matrices, where a linear objective function is optimized over the intersection of an affine space and the cone of symmetric positive semidefinite matrices. SDP is tractable convex programming and has a number of applications in combinatorial optimization, control theory, signal processing, statistics, machine learning, and structure design among others (Ben-Tal & Nemirovski, 2001; Vandenberghe & Boyd, 1996; Wolkowicz et al., 2000). A nice property about SDP is duality. As will be shown later, the dual problem of a semidefinite program becomes another semidefinite program, and under mild assumptions, they have the same optimal value. The original problem is referred to as the primal problem in relation to the dual problem. The interior point method solves SDP by generating a sequence in the interior of the feasible region. There are two types of the interior point methods: the primal interior point method and the primal-dual interior point method. The first generates iterates in the space of the primal problem, while the other generates iterates in both spaces of the primal and dual problems. We illustrate basic ideas of interior point methods with the primal method since it is more intuitive and easier to understand. In numerical experiments conducted in the later sections, we adopted the primal-dual method because it is more flexible and numerically stable. (See the appendix for the basic ideas and outline of the primal-dual methods we implemented.)

2786

T. Fushiki, S. Horiuchi, and T. Tsuchiya

Let Ai j (i = 1, . . . , m and j = 1, . . . , n) ¯ and C j ( j = 1, . . . , n) ¯ be real symmetric matrices. The sizes of the matrices Ai j and C j are assumed to be n j × n j . A standard form of SDP is the following optimization problem with respect to n j × n j real symmetric matrix X j , j = 1, . . . , n: ¯

(P) min

n¯ 

X

s.t.

Tr(C j X j ),

j=1 n¯ 

Tr(Ai j X j ) = b i , i = 1, . . . , m, X j  0,

j = 1, . . . , n. ¯ (3.1)

j=1

Here we denote (X1 , . . . , Xn¯ ) by X, and X  ()0 means that each X j is symmetric positive semidefinite (definite). A feasible solution X is called an interior feasible solution if X  0 holds. Since the cone of positive semidefinite matrices is convex, SDP is a convex program. Although the problem is highly nonlinear, it can be solved efficiently in both a theoretical and practical sense with the interior point method. The interior point method is a polynomial time method for SDP, and in reality, it can solve SDP involving matrices whose dimension is several thousands. To date, the interior point method is the only practical method for SDP. Let  be a subset of {1, . . . , n}, ¯ and consider the following problem where a convex function − j∈ log det X j is added to the objective function in problem 3.1: ( P) min



X

s.t.

Tr(C j X j ) −

j





log det X j ,

j∈

Tr(Ai j X j ) = b i , i = 1, . . . , m, X j  0,

j = 1, . . . , n. ¯ (3.2)

j

It is not difficult to see that the maximum likelihood estimation, problem 2.5, can be cast into this problem as follows: (ML) min −

N 

log det Yj ,

j=1

s.t. Yj − Tr(X( j) Q) = 0, i = 1, . . . , N, Yj  0, Tr(MQ) = 1, Q  0,

j = 1, . . . , N, (3.3)

where Yj , j = 1, . . . , N are new variables of a one-by-one matrix introduced to convert problem 2.5 to the form of problem 3.2. Thus, there are n¯ = N + 1 decision variables Yj ( j = 1, . . . , N) and Q in this problem.

Maximum Likelihood Approach to Density Estimation

2787

At a glance, problem 3.2 looks more difficult than problem 3.1 because of the additional convex term in the objective function; however, due to its special structure, we can solve problem 3.2 as efficiently as problem 3.1 by slightly modifying the interior point method for problem 3.1 without losing its advantages (Toh, 1999; Vandenberghe et al., 1998; Tsuchiya & Xia, 2006). In Vandenberghe et al. (1998), problem 3.2 is studied in detail from the viewpoint of applications and solution by the primal interior point method. For the time being, we continue explaining the interior point method for problem 3.1 to illustrate its main idea. Since a main difficulty of SDP comes from its highly nonlinear shape of the feasible region (even though it is convex), it is reasonable to provide machinery to keep iterates away from the boundary of feasible region in order to develop an efficient iterative method. For this purpose, the interior point method makes use of the logarithmic barrier function:



n¯ 

log det X j .

j=1

The logarithmic barrier function is a convex function whose value diverges as X approaches the boundary of the feasible region where one of X j becomes singular. Incorporating with this barrier function, let us consider the following optimization problem with a positive parameter ν: (Pν ) min



X

s.t.

Tr(C j X j ) − ν



j



log det X j ,

j

Tr(Ai j X j ) = b i , i = 1, . . . , m, X j  0,

j = 1, . . . , n, ¯

(3.4)

j

where ν is referred to as barrier parameter. Since the log barrier function is strictly convex, (Pν ) has a unique optimal solution. We denote by Xn¯ (ν)) the optimal solution of (Pν ). By using the method X(ν) = ( X1 (ν), . . . , of Lagrange multipliers, we see that X(ν) is a unique symmetric positive definite matrix X satisfying the following system of nonlinear equations in unknown X and (the Lagrangian multiplier) y: ν X−1 j =Cj −  j



Ai j yi , i = 1, . . . , n, ¯

i

Tr(Ai j X j ) = b i , i = 1, . . . , m, X j  0, j = 1, . . . , n. ¯

(3.5)

2788

T. Fushiki, S. Horiuchi, and T. Tsuchiya

The set X(ν) : 0 < ν < ∞} CP ≡ { defines a smooth path that approaches an optimal solution of (P) as ν → 0. This path is called the central trajectory of problem 3.1. The main idea of the interior point method is to solve the SDP with the following procedure to trace the central trajectory. Starting from a point close to the central trajectory CP , we solve problem 3.1 by repeated application of the Newton method to problem 3.4, reducing ν gradually to zero. A relevant part of this method is solving problem 3.4 for each fixed ν where the Newton method is applied. The Newton method is basically a method for an unconstrained optimization problem, but the problem contains nontrivial constraints X  0. However, there is no difficulty in applying the Newton method  here, because the problem is a minimization problem and the term − j log det X j diverges whenever X approaches the boundary of the feasible region. Therefore, the Newton method is not bothered by the constraint X  0. Another important issue here is initialization. We need an interior feasible solution to start. Usually this problem is resolved by a so-called twophase method or Big-M method, which are analogies of the techniques developed in classical linear programming. In the primal-dual method we introduce later, this point is resolved in a more elegant manner. Now we extend the idea of the interior point method to solve problem 3.2. We consider the following problem with a positive parameter η: ( Pη ) min



X

s.t.

j



Tr(C j X j ) −



log det X j − η

j∈



log det X j ,

j ∈

Tr(Ai j X j ) = b i , i = 1, . . . , m, X j  0,

j = 1, . . . , n. ¯

(3.6)

j

We denote by  X(η) the optimal solution of problem 3.6 and define the central trajectory for problem 3.6 as DP ≡ {  X(η) : 0 < η < ∞}. Note that  X(η) approaches the optimal set of equation 3.2 as η → 0. Observe that the central trajectories CP and DP intersect at ν = 1 and η = 1, that is, X(1) =  X(1). Therefore, we consider an interior point method to solve problem 3.2 consisting of two stages. We first obtain a point X∗ close to X(1) at stage 1 with the ordinary interior point method. In stage 2, starting from X∗ , a point close to the central trajectory DP for problem 3.2,

Maximum Likelihood Approach to Density Estimation

2789

we solve problem 3.6 with the Newton method, repeatedly reducing η gradually to zero. As was mentioned in the beginning of this section, this idea is further incorporated with the primal-dual interior point method and adopted in our implementation. See the appendix for details. 4 Numerical Results 4.1 Outline. We conducted numerical experiments of our method with the following five models: i. Normal-based model ii. Exponential-based model iii. Pure polynomial model where the density function and its first derivative are assumed to be zero on the boundary of the support iv. Normal-based model where we require another additional condition that the estimated density function is unimodal v. Exponential-based model where we require another additional condition that the estimated density function is monotone decreasing All the results of simulation data are compared with the results obtained by the kernel density estimation method where the bandwidth is determined by the procedure bw.ucv: unbiased cross-validation in statistical software R. Some of the data sets used in this experiment are generated by simulation from assumed distributions, and others are taken from real data sets that have been often used for benchmark. The algorithms are coded in Matlab and C, and all the numerical experiments are conducted under Matlab 6.5 environment with the Windows OS. We used several platforms, but the typical one is like a Pentium IV 2.4 GHz with 1 GB memory. The code runs without trouble in a notebook computer equipped with a Pentium III 650 MHz CPU and 256 MB memory. The maximum likelihood estimation is computed in two steps: we optimize parameter α–associated polynomials with SDP, and at higher levels we optimize parameter β for the base density. We have β = (µ, σ ) for i, β = λ for ii, and β = [a min , a max ] for iii, and in iv, we have β = (µ, σ, γ ), where γ denotes the peak of the distribution. Finally in v, we have β = λ. Assuming that α is optimized with SDP, we just need to perform at most threedimensional optimization problems (basically one- or two-dimensional) to accomplish global optimization of the likelihood function. According to the level of difficulty of SDP to be solved later, we employed optimization by nonlinear optimization (for i, ii, and v) and optimization by grid search (for i, ii, iii and iv) for optimization of β. We provided two versions of the primal-dual interior point methods: the basic algorithm and the predictor-corrector algorithm. The former is faster, but the latter is more sophisticated and robust. (See the appendix and

2790

T. Fushiki, S. Horiuchi, and T. Tsuchiya

Fushiki, Horiuchi, & Tsuchiya, 2003, for explanation of the two algorithms.) Generally we observed that SDPs for i, ii, and v are fairly easy, while the ones for iii and iv are more difficult. The basic algorithm is robust and stable enough to solve i, ii, and v without trouble, but it got into trouble when we tried to solve iii and iv. In that case, we need to use a more sophisticated predictor-corrector algorithm. The typical number of iterations of the basic algorithm is around 50, and the predictor-corrector algorithm is between 100 to 200. Concerning higher-level optimization to determine β, we adopted nonlinear optimization and grid search procedure. We compare the models with AIC (Akaike, 1973, 1974). Here we note that the number of parameters should be reduced by one for one addition of one equality constraint. Therefore, if we define AIC = −(Log likelihood) + k, where k denotes the number of parameters, k will become as follows for the five cases i to v: i. Normal-based model: k = n + 2 (dim(α) = n + 1, dim(β) = 2,

(number of linear equalities) = 1)

ii. Exponential-based model: k = n + 1 (dim(α) = n + 1, dim(β) = 1,

(number of linear equalities) = 1)

iii. Pure polynomial model: k = n − 2 (dim(α) = n + 1, dim(β) = 2,

(number of linear equalities) = 5)

iv. Normal-based model with unimodality: k = n + 2 (dim(α) = n + 1, dim(β) = 3,

(number of linear equalities) = 2)

v. Exponential-based model with monotonicity: k = n + 1 (dim(α) = n + 1, dim(β) = 1,

(number of linear equalities) = 1)

Here n is the degree of the polynomial p(x; α) in the model. We define AIC as one-half of the usual definition of AIC. All the models contain the linear equality constraint that the integral of the estimated density over the support is one. The pure polynomial model iii contains additional equality constraints that the value of the density function and its derivative on both ends of its support is zero. In iv and v, we did not give any “penalty term” on unimodality and monotonicity. In iv, we introduce a new parameter to specify the peak of the density, but we also have an additional linear equality constraint that the derivative of the density is zero at the peak. Therefore, the penalty term is the same as i.

Maximum Likelihood Approach to Density Estimation

2791

The data analyzed here are as follows: i. Normal-based model a. Simulated data 1 generated from a bimodal normal mixture distribution (N = 100) b. Simulated data 2 generated from an asymmetric unimodal normal mixture distribution (N = 250) c. Buffalo snowfall data (N = 62) (Carmichael, 1976; Parzen, 1979; Scott, 1992) d. Old Faithful geyser data (N = 107) (Weisberg, 1985; Silvermann, 1986) ii. Exponential-based model a. Simulated data 3 generated from a mixture distribution of an exponential distribution and a gamma distribution (N = 200) b. Coal-mining disasters data (N = 109) (Cox & Lewis, 1966) iii. Pure polynomial model a. Old Faithful geyser data b. Galaxy data (N = 82) (Roeder, 1990) iv. Normal-based model with unimodality condition a. The normal mixture distribution data set treated in i-b v. Exponential-based model with monotonicity condition a. Coal-mining disasters data treated in ii-b 4.2 Normal-Based Model. In this section, we show the results of the density estimation with the normal-based model. 4.2.1 Simulated Data 1: Bimodal Normal Mixture Distribution. We generated 200 observations from a bimodal normal mixture distribution:



0.3 (x + 1)2 (x − 1)2 0.7 exp − exp − + . √ √ 2 · 0.52 2 · 0.52 2π0.52 2π0.52 Figures 1a to 1c show the estimated densities for n = 2, 4, 6. The MAIC procedure picks up the model of degree 4 as the best model. It is seen that the estimated density (solid line) is close to the true density (broken line). In Figure 1d, the kernel density estimation result is shown with the broken line. In this case, the result with the kernel density estimation also appears close to the true density. In the following, we adopt the Kullback-Leibler divergence as the loss function to measure the goodness of fit of the estimated densities. The Kullback-Leibler divergence between the true density and the best model of degree 4 polynomial is 9.1693 × 10−3 , whereas the Kullback-Leibler divergence between the true density and the kernel density estimation result is 1.8359 × 10−2 . Thus, the estimation with our model is closer to the true density in terms of the Kullback-Leibler divergence in this case.

2792

T. Fushiki, S. Horiuchi, and T. Tsuchiya

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 -3

-2

-1

0

1

2

0 -3

3

(a) n = 2 (AIC = 248.19) 0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1 -2

-1

0

1

2

-1

0

1

2

3

(b) n = 4 (AIC = 247.26)

0.7

0 -3

-2

0 -3

3

-2

-1

0

1

2

3

(d) Kernel method

(c) n = 6 (AIC = 247.72)

Figure 1: Estimated density function from the simulated data 1 with different degrees of polynomials (normal-based model) and the kernel method. Each data point is shown by a circle; the solid line is the estimated density with our method in a –c, and the kernel method in d. The broken line is the true density. The bars are the histogram density estimation (the number of the bins is 20).

4.2.2 Simulated Data 2: Asymmetric Unimodal Normal Mixture Distribution. Here we generated a simulated data set of 250 observations from a distribution proportional to

x2 exp − 2





(x − 1)2 + 5 exp − 0.2





(x − 1)2 + 3 exp − 0.5

.

This is an asymmetric distribution with a sharp peak around x = 1. The estimated density by MAIC procedure is shown in Figure 2a. The MAIC procedure chooses the model with degree 8. The values of log likelihood (LL) and AIC are LL = −215.1 and AIC = 223.1. The Kullback-Leibler divergence between the true density and the estimated density is 3.9499 × 10−2 . For comparison, we show the estimated density with the kernel method

Maximum Likelihood Approach to Density Estimation

2793

1

0.8

0.6

0.4

0.2 0 -4

-2

0 (a)

2

4

-2

0 (b)

2

4

1

0.8

0.6

0.4

0.2 0 -4

Figure 2: Estimated density function from the simulated data 2 (normal-based model). Each data point is shown by a circle. The solid line is the estimated density with our method in a and with the kernel method in b. The broken line is the true density; the bars are the histogram density estimation (the number of the bins is 20).

in Figure 2b. The Kullback-Leibler divergence between the true density and the estimated density is 6.5548 × 10−2 . Both estimated densities have bumps on the left-hand tail. Thus, we see that the estimation of the density function is more difficult on the left-hand tail as long as we estimate the

2794

T. Fushiki, S. Horiuchi, and T. Tsuchiya

distribution just from the data. Later we will show how the estimation is stabilized if we assume the prior knowledge of unimodality of distribution. 4.2.3 Buffalo Snowfall Data. This is the set of 63 values of annual snowfall in Buffalo, New York, from 1910 to 1972, in inches (Carmichael, 1976; Parzen, 1979; Scott, 1992). In Figures 3a to 3c, we show profiles of the distribution obtained with the maximum likelihood estimate when the degree of polynomial is decreased or increased. The MAIC procedure chooses the model of degree 6 and seems to give a reasonable result. For comparison, we also plot the estimated density by the kernel method with a broken line. 4.2.4 Old Faithful Geyser Data. This set contains duration times of 107 nearly consecutive eruptions of the Old Faithful geyser in minutes (Weisberg, 1985; Silvermann, 1986). The estimated density is shown in Figure 4, where n = 10, LL = −105.6, and AIC = 117.6. It has longer tails in the both ends, reflecting the nature of the normal distribution. It seems that the tail of the distribution should be shorter. We also plot the estimated density by the kernel method with a broken line. Later we apply the pure polynomial model, which seems to give a better fit with shorter tail. 4.3 Exponential-Based Model. In this section, we show the results of the density estimation with the exponential-based model. 4.3.1 Simulated Data 3: Mixture of an Exponential Distribution and a Gamma Distribution. Here we generated simulated data of 200 observations from a mixture distribution of an exponential distribution and a gamma distribution with shape parameter 4: 3

  x 0.2 2 exp(−2x) + 0.8 exp(−x) . 3! The MAIC procedure picks up the model with degree 2. As is seen in Figure 5a, the estimated distribution obtained by MAIC procedure recovers fairly well the original distribution. For comparison, the density estimation result based on the kernel method is plotted also in Figure 5b. The estimated density is truncated for negative values. The density estimated with our method is closer to the true density. Indeed, the Kullback-Leibler divergence between the true density and the estimated density with our method is 8.6162 × 10−3 , whereas the KullbackLeibler divergence between the true density and the estimated densith with the kernel method is 2.1609 × 10−1 .

Maximum Likelihood Approach to Density Estimation

2795

0.03

0.02

0.01

0 0

40

80

120

160

(a) n = 4 (AIC = 291.88) 0.03

0.02

0.01

0 0

40

80

120

160

(b) n = 6 (AIC = 289.99) 0.03

0.02

0.01

0 0

40

80

120

160

(c) n = 8 (AIC = 291.58) Figure 3: Estimated density function from the Buffalo snowfall data with different degrees of polynomials (normal-based model). Each data point is shown by a circle. The solid line is the estimated density with our method. The broken line is the estimated density with the kernel method. The bars are the histogram density estimation (the number of the bins is 20).

2796

T. Fushiki, S. Horiuchi, and T. Tsuchiya 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

1

2

3

4

5

6

Figure 4: Estimated density function from the Old Faithful geyser data (normalbased model). Each data point is shown by a circle. The bars are the histogram density estimation (the number of the bins is 20). The solid line is the estimated density with our method, and the broken line is the estimated density with the kernel method.

4.3.2 Coal-Mining Disasters Data. Coal mining disasters in Great Britain from 1875 to 1951 are reported in days (Cox & Lewis, 1966). (See Figure 12 for the original sequence of disasters, where each accident is drawn as a circle on the horizontal axis.) Here we model it as a renewal process to estimate the interval density function with the exponential-based model. MAIC procedure picks up the model with degree 4, where LL = −699.0 and AIC = 704.0. The estimated distribution is shown in Figure 6. It is seen that the distribution is considerably different from the exponential distribution. The estimated density seems to consist of three slopes, with a small bump around x = 1200. We also plot the estimated density with the kernel method with a broken line for comparison in the same figure. Later we show how the estimated density will change if the density function is assumed to be monotonically decreasing. 4.4 Pure Polynomial Model. In this section, we show the profiles of the density functions estimated with the pure polynomial model. 4.4.1 Galaxy Data. These data are obtained by measurements of the speed of galaxies in a region of the universe (Roeder, 1990). We applied the pure polynomial model to estimate the density from galaxy data, since the model with the normal distribution base did not fit well. The MAIC procedure chooses the model with n = 13, where LL = 609.8 and AIC = −598.89.

Maximum Likelihood Approach to Density Estimation

2797

0.5 0.4 0.3 0.2 0.1 0 0

2

4

6

8

10

12

(a) Estimated density with our method

0.5 0.4 0.3 0.2 0.1 0 0

2

4

6

8

10

12

(b) Estimated density with the kernel method Figure 5: Estimated density function from the simulated data 3 (exponentialbased model). Each data point is shown by a circle. The solid line is the estimated density (our method in a and the kernel method in b). The broken line is the true density. The bars are the histogram density estimation (the number of the bins is 30).

As is seen from Figure 7 (solid line), the model suggests that there are three clusters in distribution, which seems to be a bit too simple. Observe that the estimated density takes a value close to zero in the interval between 10,000 and 20,000, reflecting the fact that there are no data. This

2798

T. Fushiki, S. Horiuchi, and T. Tsuchiya

7

x 10-3

6 5 4 3 2 1 0 0

500

1000

1500

2000

Figure 6: Estimated density function from the coal-mining disasters data (exponential-based model). Each data point is shown by a circle. The bars are the histogram density estimation (the number of the bins is 30). The solid line is the estimated density with our method, and the broken line is the estimated density with the kernel method.

seems to impose a strong constraint on the shape of the estimated density. We plot the estimated density with the kernel method with a broken line for comparison in the same figure. We return to this instance in section 5, where a better fit is obtained by combination with a mixture approach. 4.4.2 Old Faithful Geyser Data. We applied the pure polynomial model to estimate the density function of the Old Faithful geyser data. The MAIC procedure chooses the model with degree 14, where LL = −96.4 and AIC = 108.4. The estimated density is shown in Figure 8 (solid line). This model fits much better compared to the normal-based model in terms of both likelihood and AIC. The model captures well the structure of distribution, suggesting that the distribution has a very short tail and the left peak is higher than the right one. We plot the estimated density with the kernel method with a broken line for comparison in the same figure. 4.5 Normal-Based Model with Unimodality Condition. As is demonstrated so far, our approach gives reasonable estimates for many cases. When we analyze real data, we sometimes have prior knowledge on the distribution such as symmetry or unimodality. It is difficult to incorporate

Maximum Likelihood Approach to Density Estimation

2799

x 10-4

3 2.5 2 1.5 1 0.5 0

0

10000

20000

30000

40000

Figure 7: Estimated density function from the galaxy data (pure polynomial model). Each data point is shown by a circle. The solid line is the estimated density with our method. The broken line is the estimated density with the kernel method. The bars are the histogram density estimation (the number of the bins is 20). 1

0.8

0.6

0.4

0.2

0 0

2

4

6

Figure 8: Estimated density function from the Old Faithful geyser data (pure polynomial model). Each data point is shown by a circle. The solid line is the estimated density with our method. The broken line is the estimated density with the kernel method. The bars are the histogram density estimation (the number of the bins is 20).

2800

T. Fushiki, S. Horiuchi, and T. Tsuchiya 1

0.8

0.6

0.4

0.2

0 -4

-2

0

2

4

Figure 9: Estimated density function from the simulated data 2 (normal-based model with unimodality constraint). Each data point is shown by a circle. The solid line is the estimated density. The broken line is the true density. The bars are the histogram density estimation (the number of the bins is 20).

such prior knowledge into the model in nonparametric approaches. In the following, we pick up the simulated data 2 as an example and compute the maximum likelihood estimate with the unimodality condition in our approach. Specifically, we observe how the addition of the unimodality condition changes the estimated density function. The distributions obtained by MAIC procedure adding a constraint of unimodality for the data set is shown in Figure 9. LL and AIC are −216.4 and 224.5, respectively. We see that the bumps in the estimated density without the unimodality condition disappeared and the shape of the estimated density looks much closer to the true one. LL and AIC get worse by about 1.5, where no “penalty” is added to AIC for the unimodality constraint. In our model with the monotonicity condition, the estimated density is unimodal and much closer to the true density in shape. The KullbackLeibler divergence between the true density and the estimated density is 1.8422 × 10−2 . This is the best among the three method we tested (see subsection 4.2). 4.6 Exponential Based Model with Monotonicity Condition. In section 4.3, we estimated the interval density function of the coal mining disasters data and observed that the density function with the best AIC value has a bump around x = 1200. Here we estimate the density based on the same data with the exponential-based model with monotonicity condition. In Figure 10 (solid line), we show the best model we found, where

Maximum Likelihood Approach to Density Estimation

7

2801

x 10-3

6 5 4 3 2 1 0 0

500

1000

1500

2000

Figure 10: Estimated density function from the coal mining disasters data (exponential-based model with monotonicity). Each data point is shown by a circle. The solid line is the estimated density. The bars are the histogram density estimation (the number of the bins is 30).

the degree of polynomial is 4, LL = −699.57 and AIC = 704.57. We did not impose any “penalty” on monotonicity in calculating AIC. These values of LL and AIC are almost the same as the ones we obtained in section 4.2. We plot the estimated density with the kernel method with a broken line for comparison in the same figure. 4.7 Discussion. We discuss a few issues before concluding this section. 4.7.1 Comparison with the Kernel Method. In this section, we compare the estimated densities obtained with our method and the ones with the kernel method. We summarize in Table 1 the Kullback-Leibler divergence between the true density and the estimated densities with both methods (only for the cases when the true densities are known). In all three instances, the estimated densities with our method happened to be closer than the kernel method. As to the profiles of the estimated densities, the ones with our method tend to be smoother than those with the kernel method. The estimated densities with the kernel method tend to be shakier. 4.7.2 Computational Time. In Table 2, we summarized timing data for optimizing the polynomial part for a fixed base density for several typical

2802

T. Fushiki, S. Horiuchi, and T. Tsuchiya

Table 1: Kullback-Leibler Divergence Between True Densities and Estimated Densities: Summary.

Simulated data 1 (bimodal gaussian mixture) Simulated data 2 (unimodal gaussian mixture) Simulated data 3 (exponential + gamma) ∗

SDP

Kernel

SDP (with Unimodality)

∗ 9.1693 × 10−3

1.8359 × 10−2



3.9499 × 10−2

6.5548 × 10−2

∗ 1.8422 × 10−2

∗ 8.6162 × 10−3

2.1609 × 10−1



The best model in terms of the Kullback-Leibler divergence.

Table 2: Timing Data.

Instance

Model Type

Buffalo snowfall Coal mining Simulated data 1 Old Faithful geyser

Normal Exponential Normal Pure polynomial

Number of Samples

Time(sec)

Number of Iterations

63 109 200 107

33 61 202 623

34 29 33 162

Algorithm Basic Basic Basic Predictorcorrector

instances. In measurement, we ran our code in Matlab 7.1, but all other computational environments are the same as in the previous experiments in this section. In summary, if the number of data is up to 100 and the normal-based model or exponential-based model is used, optimization finishes in about 1 minute. If the number is 200, it requires around 3 minutes. Generally, the pure polynomial model tends to be more difficult than other two models. Since our code used in this experiment is based on a general-purpose code for SDP, it is possible to speed up the code considerably by taking into account the special structures of our problem. As is observed above, computational time increases as the number of data increase. This is inevitable due to the nature of SDP algorithm. One reasonable option for dealing with large data is to estimate density from the histogram, as was mentioned in section 2. In this extension, the number of bins corresponds to the number of data here. The number of the bins should be as large as possible within the range that the associated SDP problem can be solved in a reasonable time. ¨ uncu, ¨ Recently, well-known SDP software such as SDPT3 (Tut Toh, & Todd, 2003) started supporting a solution of the variant of SDP problems

Maximum Likelihood Approach to Density Estimation

2803

we deal with in this article. The use of such SDP software could be a reasonable option for readers to implement the density estimation method proposed here. 5 Combination with a Finite Mixture Model When the data set has a long interval without any data, the estimated polynomial with our method tends to take the value close to zero in the interval for achieving a better likelihood value. This implicitly imposes a strong restriction on the possible shape of the density. We encountered such a situation in analyzing the galaxy data in section 4.4. In order to construct a more flexible model in such a situation, here we consider a model where a density function is represented as a finite mixture of several normal-based models. We assume that the data set has one or more long intervals without data. Therefore, the data set may well be grouped into several clusters easily. In the case when clustering is difficult, the original model, model 2.1, should work fine, since the data set does not have a sparse interval. Let M be the number of clusters, and suppose that the density is represented as f (x; θ, w) =

M 

wi f i (x; θi ),

i=1

M 

wi = 1,

wi ≥ 0 (i = 1, . . . , M),

(5.1)

i=1

where f i is the density function of a normal-based model with θi as the parameter, θ = (θ1 , . . . , θ M ), and w = (w1 , . . . , w M ). Let I j , j = 1, . . . , M be the index of the data points belonging to the cluster j, and let D j = {x j : j ∈ I j }. D j is the set of data assigned to the jth cluster, which is associated with the density function f j (·; θ j ). If we assume that f j is almost zero for those data that do not belong to D j , the log likelihood of equation 5.1 is simply written as log f (x; θ, w) ≈

M   i=1 j∈Ii

log f i (x j ; θi ) +

M 

|Di | log wi .

(5.2)

i=1

From equation 5.2, it is easy to see that the maximum likelihood estimation reduces to the maximum likelihood estimation of the model f i based on the data set Di . Furthermore, minimization of AIC for model 5.1 can be done by minimizing AIC in estimation of each component f i . We implemented this procedure and applied it to the galaxy data. The best model we found in terms of AIC consists of four components, as shown in Figure 11a, where LL = 623.029, AIC = −606.029 and the number of parameters is 17. The second-best model is with three components and is slightly worse in terms of AIC (LL = 623.952, AIC = −605.952; the

2804

T. Fushiki, S. Horiuchi, and T. Tsuchiya

3

x 10 -4

2

1

0 0

10000

20000

30000

40000

(a) Four components

3

x 10 -4

2

1

0 0

10000

20000

30000

40000

(b) Three components Figure 11: Estimated density function from the galaxy data (mixture normalbased model). Each data point is shown by a circle. The solid line is the estimated density with our method. The broken line is the estimated density with the kernel method. The bars are the histogram density estimation (the number of the bins is 20).

Maximum Likelihood Approach to Density Estimation

2805

number of parameters is 18). The profile is shown in Figure 11b. In the latter model, the two middle clusters in the former are treated as one cluster. We checked that the assumption on the behavior of f j in other clusters Di (i = j) is satisfied in the obtained density function. The estimated density is better than the one obtained in the previous section in terms of AIC and its shape is more similar to those obtained in other work (e.g., Roeder, 1990). 6 Other Applications The approach proposed here can be applied to other areas of statistics such as point process and survival data analysis. In order to clarify this point, here we pick up estimation of the intensity function of a nonstationary Poisson process as an example. Suppose we have a nonstationary Poisson process whose intensity function is given by λ(t), and let t1 , . . . , tN = T be the sequence of the time when events were observed. We estimate λ(t) as a nonnegative polynomial function on the interval (0, T]. The log likelihood is N 

 log λ(ti ) −

T

λ(t)dt.

0

i=1

If T is fixed, then we can apply exactly the same technique as density estimation developed in this article. If we represent λ(t) as equations 2.2 or 2.3 in theorem, the term 

T

λ(t)dt

0

is represented as 

T

λ(t)dt = Tr(M1 Q) + Tr(M2 Q1 ),

0

where M1 and M2 are appropriate symmetric matrices. Therefore, the maximum likelihood estimation is formulated as the following problem: max

N 

    (i) (i) log Tr(X1 Q) + Tr(X2 Q1 ) − Tr(M1 Q) + Tr(M2 Q1 ) ,

i=1

s.t. Q  0, Q1  0, (i)

(i)

where X1 and X2 , (i = 1, . . . , N) are matrices determined from the data. Thus, the problem just becomes problem 3.2 in this case. In Figure 12, we

2806

T. Fushiki, S. Horiuchi, and T. Tsuchiya 0.015

0.01

0.005

0 0

10000

20000

30000

Figure 12: Estimated intensity function from the coal mining disasters data. Each disaster is shown by a circle.

show the estimated intensity function λ(t) for the coal mining data with the MAIC procedure. The procedure picks up the polynomial of degree 7, where LL = −690.8 and AIC = 698.8. In the previous section, we analyzed these data as a renewal process, and AIC of the estimated model is around 704.0 in both cases where we require or do not require monotonicity condition. Thus, we see that the nonstationary Poisson model seems to fit better in this case than the renewal model. A similar technique can be applied to the analysis of other statistical problems such as estimation of a survival function for medical data. This is another interesting topic of further study. 7 Conclusion In this article, we proposed a novel approach to the density estimation problem by means of semidefinite programming. We adapted standard interior point methods for SDP to this problem and demonstrated through various numerical experiments that the method gives a reasonable estimate of the density function with MAIC procedure. We also demonstrated that such conditions as unimodality and monotonicity of the density function, which are usually difficult to handle with other approaches, can be easily incorporated within this framework. Combination with the mixture model and possible applications to other areas were also discussed. Here we discuss model selection criteria. We used AIC as the model selection criterion, and it worked well in our experiments. But we can consider other approaches for model selection. For example, cross-validation is

Maximum Likelihood Approach to Density Estimation

2807

a reasonable candidate. While cross-validation is applied in a more general context, it is computationally more expensive than AIC. It would be an interesting topic to develop a suitable criterion for model selection in our context. Finally, development of new learning models based on the proposed density estimation approach will be a long-term main direction of our research. Appendix: Dual Problem, Primal-Dual Formulation, and Primal-Dual Interior-point Method This appendix outlines the idea of primal-dual interior point method. First we introduce the dual problem of problem 3.1 and its associated central trajectory. Then we explain the primal-dual central trajectory and outline the primal-dual interior point method. The dual problem of problem 3.1 is defined as follows:  (D) max b i yi , y,Z j

i

s.t. C j −



Ai j yi = Z j , Z j  0, j = 1, . . . , n, ¯

(A.1)

i

¯ is an n j × n j real symmetric matrix and y = where Z j , j = 1, . . . , n, (y1 , . . . , ym ) is an m-dimensional real vector. We denote (Z1 , . . . , Zn¯ ) by Z. Under mild conditions, problems 3.1 and A.1 have the optimal solutions with the same optimal value (the duality theorem) (Ben-Tal & Nemirovski, 2001; Monteiro & Todd, 2000; Nesterov & Nemirovskii, 1994; Vandenberghe & Boyd, 1996). Analogous to the case of problem 3.1, the central trajectory of equation A.1 is defined as the set of the unique optimal solution of the following problem when parameter ν is changed:   (Dν ) max b i yi + ν log det Z j , y,Z j

i

s.t. C j −



j

Ai j yi = Z j , Z j  0, j = 1, . . . , n. ¯

(A.2)

i

We denote by ( Z(ν), yˆ (ν)) the optimal solution of problem A.2. Differentiation yields that ( Z(ν), yˆ (ν)) is a unique optimal solution to the following system of nonlinear equations:  Ai j Z−1 ν j = b i , i = 1, . . . , m, j

Cj −

 i

Ai j y j = Z j , Z j  0, j = 1, . . . , n. ¯

(A.3)

2808

T. Fushiki, S. Horiuchi, and T. Tsuchiya

The set CD ≡ {( Z(ν), yˆ (ν)) : 0 < ν < ∞} defines a smooth path that approaches an optimal solution of (D) as ν → 0. This path is called the central trajectory for problem A.1. Comparing problems A.3 and 3.5, we see that ( X(ν), Z(ν), yˆ (ν)) is the unique optimal solution of the following bilinear system of equations X j Z j = ν I, j = 1, . . . , n, ¯



Tr(Ai j X j ) = b i , i = 1, . . . , m,

j

Cj −



Ai j yi = Z j , j = 1, . . . , n, ¯

i

¯ X j  0, j = 1, . . . , n,

Z j  0, j = 1, . . . , n. ¯

(A.4)

Note that we also require X j = XTj and Z j = ZTj for each X j and Z j , since  means that a matrix is symmetric positive semidefinite. We define the central trajectory of problems 3.1 and A.1 as C = {W(ν) : 0 < ν < ∞}, where W(ν) = ( X(ν), Z(ν), yˆ (ν)). The primal-dual interior point method solves (P) and (D) simultaneously by following the central trajectory C based on the formulation A.4. Namely, we solve equation A.4 repeatedly, reducing ν gradually to zero. As in the primal method, a crucial part of the primal-dual method is the solution procedure of equation A.4 for fixed ν. There are several efficient iterative algorithms for this problem based on the Newton method (see, e.g., Tsuchiya & Xia, 2006). Now we introduce the dual counterpart of problem 3.2 as follows:  max (D) y,Z j

 i

s.t. C j −

b i yi + 



log detZ j + ||,

j∈

Ai j yi = Z j , Z j  0, j = 1, . . . , n.

(A.5)

i

It is known that the optimal values of problems A.5 and 3.2 coincide. In order to solve this problem, we consider the following optimization problem with

Maximum Likelihood Approach to Density Estimation

2809

parameter η > 0: η ) max (D y,Z j

 i

s.t. C j −

b i yi + 



log detZ j + η

j∈



log detZ j + ||,

j ∈

Ai j yi = Z j , Z j  0, j = 1, . . . , n.

(A.6)

i

We denote by ( Z(η), y˜ (η)) the unique optimal solution of this problem. We define the central trajectory for problem A.5 as Z(η), y˜ (η)) : 0 < η < ∞}. DD ≡ {( Note that ( Z(η), y˜ (η)) approaches the optimal set of problem 3.6 as η → 0. The set DD is referred to as the central trajectory for equation A.5. Analogous to equation A.4, we have the following primal-dual formulation of ( X(η),  Z(η), y˜ (η)): X j Z j = I, j ∈   j

Cj −

X j Z j = ηI, j ∈  Tr(Ai j X j ) − b i = 0, i = 1, . . . , m



Ai j yi − Z j = 0, j = 1, . . . , n¯

i

X j  0 j = 1, . . . , n, ¯ Z j  0, j = 1, . . . , n. ¯

(A.7)

 (η) = ( X  (η), Z  (η), y˜ (η)), and define Let W  (η) : 0 < η < ∞} D = {W as the primal-dual central trajectory of problems 3.2 and A.5. Equation A.7 can be solved efficiently with the same iterative methods for equation A.4. Now we are ready to describe a primal-dual interior point method for problems 3.2 and A.5. As in the case of the primal method, the primal  dual central trajectories C and D intersect at ν = η = 1, that is, W(1) = W(1). Therefore, we can solve equations 3.2 and A.5 in two stages as follows. We first apply the ordinary primal-dual interior point method for equations 3.1 and A.1 to find a point W∗ = (X∗ , Z∗ , y∗ ) close to W(1). Then starting from W∗ , a point close to the central trajectory D for problems 3.2 and A.5, we solve problem 3.2 by solving equation A.7 approximately repeatedly reducing η gradually to zero.

2810

T. Fushiki, S. Horiuchi, and T. Tsuchiya

A remarkable advantage of the primal-dual method is its flexibility concerning initialization. In the primal formulation in section 3, the method needs an initial feasible interior point. But obtaining such a point is already a nontrivial problem. In the primal-dual formulation, we can get around this difficulty, because the search directions can be computed for any (X, Z, y) such that X  0 and Z  0. Generally this does not necessarily satisfy linear equalities in equation A.7, but we may let them be satisfied in the end of iterations, since they are linear. In that case, we approach the central trajectory from outside the feasible region. We provided two versions of the primal-dual method in our implementation: the basic algorithm and the predictor-corrector algorithm. The first follows the central trajectory loosely. The method is simple and efficient but often encounters difficulty for ill-conditioned problems. The predictor-corrector algorithm follows the central trajectory more precisely. This method is slower but is robust and steady, suitable for ill-conditioned and difficult problems. (See Fushiki et al., 2003, for details.) When we reasonably exploit the structure of this problem, the number of arithmetic operations required per iteration of the primal-dual method becomes O(N3 ), where N is the number of data. This suggests that our approach is not computationally expensive since the number of iterations is typically up to 50 in interior point methods for SDP. In view of the state of the art of the current SDP software (Sturm, 1999; ¨ uncu ¨ Tut et al., 2003; Yamashita, Fujisawa, & Kojima, 2003), a sophisticated implementation would be capable of handling N up to a few thousands. Acknowledgments This research was supported in part with Grant-in-Aid for Young Scientists (B), 2005, 17700286 and Grant-in-Aid for Scientific Research (C), 2003, 15510144 from the Japan Society for the Promotion of Sciences. References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Cs`aski (Eds.), Proceedings of the Second International Symposium on Information Theory (pp. 267–281). Budapest: Akademiai Kiado. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(1974), 716–723. Akaike, H. (1977). On entropy maximization principle. In P. R. Krishnaiah (Ed.), Applications of statistics (pp. 27–41). Amsterdam: North-Holland. Alizadeh, F. (1995). Interior point methods in semidefinite programming with applications to combinatorial optimization. SIAM Journal on Optimization, 5, 13–51.

Maximum Likelihood Approach to Density Estimation

2811

Bennet, K. P., & Mangasarian, O. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1, 23– 34. Ben-Tal, A., & Nemirovski, A. (2001). Lectures on modern convex optimization: Analysis, algorithms, and engineering applications. Philadelphia: SIAM. Bhattacharyya, C. (2004). Second order cone programming formulations for feature selection. Journal of Machine Learning Research, 5, 1417–1433. Carmichael, J.-P. (1976). The autoregressive method: A method for approximating and estimating positive functions. Unpublished doctoral dissertation, SUNY, Buffalo. Cox, D. R., & Lewis, P. A. W. (1966). The statistical analysis of series of events. New York: Wiley. Eggermont, P. P. B., & LaRiccia, V. N. (2001). Maximum penalized likelihood estimation, Vol 1: Density estimation. New York: Springer. Fletcher, R. (1989). Practical methods of optimization (2nd ed.). New York: Wiley. Fushiki, T., Horiuchi, S., & Tsuchiya, T. (2003). A new computational approach to density estimation with semidefinite programming (Research Memorandum No. 898). Tokyo: Institute of Statistical Mathematics. Good, I. J., & Gaskins, R. A. (1971). Nonparametric roughness penalties for probablity densities. Biometrika, 58, 255–277. Good, I. J., & Gaskins, R. A. (1980). Density estimation and bump-hunting by the penalized likelihood method exemplified by scattering and meteorite data. Journal of the American Statistical Association, 75, 42–56. Graepel, T., & Herbrich, R. (2004). Invariant pattern recognition by semidefinite ¨ programming machines. In S. Thrun, L. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Hjort, N. L., & Glad, I. K. (1995). Nonparametric density estimation with a parametric start. Annals of Statistics, 23, 882–904. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Ishiguro, M., & Sakamoto, Y. (1984). A Bayesian approach to the probability density estimation. Annals of the Institute of Statistical Mathematics, 36, 523–538. Lanckriet, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui, L., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5, 27–72. Lanckriet, G. R. G., El Ghaoui, L., Bhattacharyya, C., & Jordan, M. I. (2002). A robust minimax approach to classification. Journal of Machine Learning Research, 3, 555– 582. McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley. Monteiro, R. D. C., & Todd, M. J. (2000). Path following methods. In H. Wolkowicz, R. Saigal, & L. Vandenberghe (Eds.), Handbook of semidefinite programming; Theory, algorithms, and applications (pp. 267–306). Boston: Kluwer. Nesterov, Y. (2000). Squared functional systems and optimization problems. In H. Frenk, K. Roos, T. Terlaky, & S. Zhang (Eds.), High performance optimization (pp. 405–440). Dordrecht: Kluwer. Nesterov, Y., & Nemirovskii, A. (1994). Interior-point methods for convex programming. Philadelphia: SIAM. Nocedal, J., & Wright, S. J. (1999). Numerical optimization. New York: Springer.

2812

T. Fushiki, S. Horiuchi, and T. Tsuchiya

Parzen, E. (1979). Nonparametric statistical data modeling. Journal of American Statistical Association, 74, 105–131. Roeder, K. (1990). Density estimation with confidence sets exemplified by superclusters and voids in the galaxies. Journal of American Statistical Society, 85, 617–624. Scott, D. W. (1992). Multivariate density estimation. New York: Wiley. Silvermann, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall. Sturm, J. F. (1999). Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, 11/12, 625–653. Tanabe, K., & Sagae, M. (1999). An empirical Bayes method for nonparametric density estimation. Cooperative Research Report of the Institute of Statistical Mathematics, 118, 11–37. Tapia, R. A., & Thompson, J. R. (1978). Nonparametric probability density estimation. Baltimore: Johns Hopkins University Press. Toh, K.-C. (1999). Primal-dual path-following algorithms for determinant maximization problems with linear matrix inequalities. Computational Optimization and Applications, 14, 309–330. Tsuchiya, T., & Xia, Y. (2006). An extension of the standard polynomial-time primaldual path-following algorithm to the weighted determinant maximization problem with semidefinite constraints (Research Memorandum No. 980). Tokyo: Institute of Statistical Mathematics. ¨ uncu, ¨ Tut R. H., Toh, K. C., & Todd, M. J. (2003). Solving semidefinite-quadratic-linear programs using SDPT3. Mathematical Programming, 95, 189–217. Vandenberghe, L., & Boyd, S. (1996). Semidefinite programming. SIAM Review, 38, 49–95. Vandenberghe, L., Boyd, S., & Wu, S-P. (1998). Determinant maximization with linear matrix inequality constraints. SIAM Journal on Matrix Analysis and Applications, 19, 499–533. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag. Waki, H., Kim, S., Kojima, M., & Muramatsu, M. (2005). Sums of squares and semidefinite programming relaxations for polynomial optimization problems with structured sparsity (Tech. Rep. No. B-411). Tokyo: Department of Mathematical and Computing Sciences, Tokyo Institute of Technology. Wand, M. P., & Jones, M. C. (1995). Kernel smoothing. London: Chapman & Hall. Weisberg, S. (1985). Applied linear regression. New York: Wiley. Wolkowicz, H., Saigal, R., & Vandenberghe, L. (2000). Handbook of semidefinite programming: Theory, algorithms, and applications. Boston: Kluwer. Yamashita, M., Fujisawa, K., & Kojima, M. (2003). Implementation and evaluation of SDPA 6.0 (SemiDefinite Programming Algorithm 6.0). Optimization Methods and Software, 18, 491–505.

Received June 14, 2005; accepted April 20, 2006.