Computing maximum likelihood estimators of convex density functions Report 95-49
T. Terlaky J.-Ph. Vial
Faculteit der Technische Wiskunde en Informatica Faculty of Technical Mathematics and Informatics Technische Universiteit Delft Delft University of Technology
ISSN 0922-5641
Copyright c 1995 by the Faculty of Technical Mathematics and Informatics, Delft, The Netherlands. No part of this Journal may be reproduced in any form, by print, photoprint, microfilm, or any other means without permission from the Faculty of Technical Mathematics and Informatics, Delft University of Technology, The Netherlands. Copies of these reports may be obtained from the bureau of the Faculty of Technical Mathematics and Informatics, Julianalaan 132, 2628 BL Delft, phone +31152784568. A selection of these reports is available in PostScript form at the Faculty’s anonymous ftp-site. They are located in the directory /pub/publications/tech-reports at ftp.twi.tudelft.nl
DELFT UNIVERSITY OF TECHNOLOGY
REPORT Nr. 95{49 Computing Maximum Likelihood Estimators of Convex Density Functions
T. Terlaky, J.-Ph. Vial
ISSN 0922{5641 Reports of the Faculty of Technical Mathematics and Informatics Nr. 95{49 Delft Delft May, 1995 i
T. Terlaky, Faculty of Technical Mathematics and Informatics, Delft University of Technology, P.O. Box 5031, 2600 GA Delft, The Netherlands. e{mail:
[email protected] J.-Ph. Vial HEC{Management Studies, Faculte des Sciences Economiques et sociales, University of Geneva, 102 Bd Carl Vogt, CH{1211 Geneva 4, Switzerland. e{mail:
[email protected] The rst author is on leave from the Eotvos University, Budapest, and partially supported by OTKA No. 2116. The second author completed this work under the support of research grant # 12{34002.92 of the Fonds National Suisse de la Recherche Scienti que.
c 1995 by Faculty of Technical Mathematics and InforCopyright matics, Delft, The Netherlands. No part of this report may be reproduced in any form, by print, photoprint, micro lm or any other means without written permission from Faculty of Technical Mathematics and Informatics, Delft University of Technology, The Netherlands.
ii
Contents 1 Introduction
1
2 Problem formulation
3
3 Central path
5
4 A primal path-following method
7
4.1 The Newton step : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 4.2 The primal logarithmic barrier algorithm : : : : : : : : : : : : : : : : : : : 9 4.2.1 Implementation of the primal method : : : : : : : : : : : : : : : : : 10
5 A primal-dual infeasible start algorithm
12
5.1 The general scheme : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 12 5.1.1 The primal-dual infeasible start algorithm : : : : : : : : : : : : : : 13 5.1.2 Implementation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15
6 Numerical results
17
6.1 Algorithmic performances : : : : : : : : : : : 6.1.1 Results with the primal algorithm : : : 6.1.2 Results with the primal-dual algorithm 6.2 Shape of the solution : : : : : : : : : : : : : :
7 A clustering scheme
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
: : : :
17 18 19 19
21
iii
Abstract We consider the problem of estimating a density function that is known in advance to be convex. The maximum likelihood estimator is then the solution of a linearly constrained convex minimization problem. This problem turns out to be numerically dicult. We show that interior point algorithms perform well on this class of optimization problems, though for large samples, numerical diculties are still encountered. To eliminate those diculties, we propose a clustering scheme that is reasonable from a statistical point of view. We display results for problems with up to 40000 observations. We also give a typical picture of the estimated density: a piece wise linear function, with very few pieces only.
Key words: interior-point method, convex estimation, maximum likelihood estimation,
logarithmic-barrier method, primal{dual method. iv
1 Introduction Finding a good statistical estimator can often be formulated as an unconstrained optimization problem whose objective function may be either a loss function to be minimized or a likelihood to be maximized. There exist many commercial or public domain optimization packages to solve this simple optimization problem. It is sometimes necessary to add constraints to the problem to get a meaningful estimator. The optimization problem becomes then more dicult. Several publications in the eld of statistics consider such constrained optimization problems [3, 4, 5, 6, 7, 12, 19]. For a recent survey of the constrained optimization problems occurring in statistical estimations see [15]. A typical example, that has recently drawn some attention [12, 6, 11, 15], concerns the estimation of a density function that is known in advance to be convex. This convexity requirement results in a set of linear inequality constraints. Though seemingly simple, these constraints introduce much numerical diculties. Standard constrained optimization packages are likely to fail on estimating convex densities from samples of medium to large sizes. The research eld of convex optimization has been extremely active in the recent years, with the noticeable revival of interior point methods rst introduced by [9]. The theory has much progressed, providing now complexity estimates on the number of iterations. The most remarkable fact for the practitioner is that these new methods turn out to be tremendously eective, being capable of handling very large scale problems on a microcomputer in a reasonable time. Moreover these algorithms are easy to program, either in a standard programming language or in a higher-level language such as MATLABTM. Our main purpose is to show that the convex density estimation problem is well-solved by interior point methods. Toward this end, we shall give a short description of two interior point methods, a primal and a primal-dual one. The primal logarithmic barrier method under consideration is presented in [1, 14]. Due to its special structure |maximizing the sum of the logarithm of the variables under linear inequality constraints| the problem satis es the smoothness conditions (self-concordance [18, 13] or relative Lipschitz [14]) which are sucient to ensure that the primal logarithmic barrier algorithm is polynomial. Primal-dual methods were rst introduced for linear programming and linear complementarity problems. They are widely recognized to be the most ecient interior point 1
methods. They have been successfully extended to the nonlinear case, see e.g., [21], [17], [20] and [16]. Those algorithms can be made globally convergent [17, 21, 2], but contrary to primal interior point algorithms there is no known complexity result. The algorithm we shall present here appeared in [20]. It applies to general problems with nonlinear convex constraints, but we shall restrict our presentation to linearly constrained problems only. The two approaches, primal and primal-dual, have been extensively tested on problems generated by random sampling from three dierent distribution laws: the negative exponential, a quadratic, the arcsine and the inverse law. The sample sizes range from 500 to 2000. The codes were both written in MATLABTM, using the option of sparse matrix algebra. Numerical diculties were encountered for the larger sample sizes, due to a very poor condition number of the constraint matrix. On large samples, some sample points are so close to one another that the convexity requirement, which is nothing but the monotonicity of the estimated density, becomes very dicult to enforce. Extreme closeness of sample points does not have much sense in practice. It is statistically reasonable to aggregate them in clusters. We propose such a clustering scheme. We show that by aggregating data that are close to one another by less than 10?5 , a distance in the range of precision of the solution, then less than 1% of the data are aggregated in the larger sample sizes of 2000 observations. With this clustering scheme, the problem is then much easier to solve. The primal-dual algorithm successfully solves problems much larger sample sizes: from 6000 up to 40000 observations. The increase in the total number of iterations is very mild.
Notations The derivation of the formulas for interior point methods requires some
recurrent and speci c algebraic manipulations. The more common manipulations are forming a diagonal matrix whose diagonal elements are equal to the coordinates of some vector; or constructing a vector by performing coordinate-wise operations, such as multiplication or exponentiation. It has become common practice in the literature on interior point methods to use the following compact notation. Let v and s be vectors (with positive coecients). The capitalized letters V and S designate the diagonal matrices whose diagonal terms are the coordinates of v and s respectively. The symbol e denotes the vector of all ones of appropriate dimension. Finally we shall use the short-hand notation
s?1 = S ?1e 2
and
vs = V s = Sv: The latter is the component-wise product between two vectors known as Hadamard product. It is interesting to note that the convention is very close to the programming style of MATLABTM.
2 Problem formulation A common feature of all the statistical estimation problems collected in [15] is that all have linear equality and inequality constraints and a convex objective. Hence our prototype optimization problem is of the form min f (x) (1) Ax = c Bx d; where x 2 IRn is the variable in which the minimization is done, f : IRn ! IR is a twice continuously dierentiable convex function, A and B are p n and m n matrices respectively, while c 2 IRp and d 2 IRm . The inequality constraints may include simple bounds on x such as a nonnegativity requirement. We shall denote rf (x) and r2f (x) the gradient and the Hessian of f , respectively. The formulation is general enough to t many statistical problems. The speci c problem we have in mind is the estimation of an unknown convex density function g : IR+ ! IR+. Let Y be the real-valued random variable with density function g. Let fy1; : : : ; yng be an ordered sample of n outcomes of Y . We shall assume, for the time being, that y1 < y2 < : : : < yn. The estimator of g 0 is a piece wise linear function g~ : [y1; yn] ! IR+ with break points at (yi; g~(yi)); i = 1; : : : ; n. Let xi > 0, i = 1; : : :; n be the estimator of g(yi). The objective is to maximize the log-likelihood function n X f (x) = ln xi: i=1
To match the convexity requirement for the one-dimensional density function we add the constraint that the slope of g~ is nondecreasing. This is written as xi xi+1 ; i = 2; : : :; n ? 1; yi yi+1 3
where xi = xi ? xi?1 and yi = yi ? yi?1. We also have the requirement that g~ is a density function: the area below g~ must sum up to one, i.e., nX ?1 yi+1( xi + xi+1 ) = 1: 2 i=1 The resulting optimization problem is n min ? P ln x
9 > i > > i=1 ?yi+1xi?1 + (yi + yi+1)xi ? yixi+1 0; i = 2; : : : ; n ? 1 >>= nX ?1 > > yi+1( xi +2xi+1 ) = 1 > > i=1 > ; x 0:
(2)
We did not include the nonnegativity constraint on x in the matrix B . Actually, the nonnegativity constraint may be neglected altogether since the objective tends to +1 as some components of x tend to zero. The above problem can easily be identi ed with the general formulation (1), with m = n ? 2 and p = 1. The matrix B is tridiagonal, while A is a row vector, say aT . In the particular case of the log-likelihood function it turns out that the equality constraint can be disposed o and replaced by a linear term in the objective. To see this, let us introduce the dual variables u 0 and v associated respectively to Bx 0 and aT x = 1. The necessary and sucient optimality conditions for the optimization problem are
rf (x) + ua + B T v = 0
aT x = 1 vT Bx = 0 Bx 0 and v 0:
(3)
Multiplying the rst equation by x, one gets 0 = xT rf (x) + xT (B T v + ua) = xT rf (x) + u: Replacing xT rf (x) by its value ? Pni=1 xix?i 1 = ?n, we get u = n. Hence the optimal value of the dual variable u is known. One can thus replace problem (2) by the equivalent problem n?1 n min ? P ln xi + n P yi+1( xi +2xi+1 ) (4) i=1 i=1 ?yi+1xi?1 + (yi + yi+1)xi ? yixi+1 0; i = 2; : : : ; n ? 1: 4
We check that problems (2) and (4) have the same set of necessary and sucient optimality conditions. For the sake of completeness, we present the dual of problem (4). This dual form, after some simple transformations, directly follows from the Lagrange dual of (4). m min ? P ln(a + B T v) + ln(n) i=1 ?B T v a
i
(5)
v 0: It is interesting to observe that the dual of the convex maximum likelihood estimation problem has essentially the same structure as the original primal problem. We do not know of a statistical interpretation of this dual problem.
3 Central path Interior point methods were rst presented [9] as sequential unconstrained optimization. The key idea of these methods is to penalize the inequality constraints with a barrier function P (), where > 0 is the barrier parameter and P is a one-dimensional convex real-valued function with the property that P () ! 1 as ! 0. Although dierent barrier functions were proposed in the last decades, the standard choice for P is the logarithm. Practically all the polynomially convergent interior point methods are explicitly or implicitly based on the use of the logarithmic barrier (potential) function. To our best knowledge no polynomial convergence is known based on other barrier functions. One of the reasons of the success of the logarithmic barrier function is probably that it is self-dual, i.e., by dualizing the logarithmic barrier problem one obtains the logarithmic barrier problem of the dual problem. To keep our discussions more general and applicable to all the statistical estimation problems surveyed in [15], we will present the primal logarithmic barrier and the primal-dual infeasible start algorithm in the context of problem (1). We thus obtain from (1) a barrier problem of the form m min f (x) ? P ln(d ? Bx)j j =1 (6) Ax = c Bx d: 5
Because of the barrier term, the optimum of the barrier problem (6) is such that Bx < d. The inequality constraint can be dropped and we can consider that problem (6) is essentially unconstrained. Denoting by x() the unique optimal point of problem (6), we can trace the set fx : x = x()g as tends to zero. The basic result is that x() tends to an optimal solution x of the original problem (1). The set of solutions x() is named the central path of problem (1). Interior point methods approximately follow that path to reach an optimal solution. To give a more formal presentation of the central path, we rst state two assumptions that guarantee the existence of an optimal solution x() to (6).
Assumption 3.1 Problem (1) has an interior feasible solution, i.e., a point x > 0 such
that Ax = c and B x < d.
Assumption 3.2 The level sets = fx : Ax = c; Bx d; f (x) g of problem (1) are bounded.
It is proved in [9] that under assumptions (3.1) and (3.2), problem (6) has a solution for all > 0 and this solution is unique. Hence the necessary and sucient rst order optimality conditions rf (x) + B T (d ? Bx)?1 + AT u = 0 (7) Ax = c have a unique solution (x(); u()). The set
fx : x = x(); > 0g is called the primal central path. Introducing the slack variable s() = d ? Bx() and the dual variable v() = s()?1 , we can rewrite the conditions as
rf (x) + AT u + B T v = 0
Ax = c Bx + s = d vs = e; 6
(8) (9) (10) (11)
where e is the vector of all ones of appropriate dimension. Let F : IRn IRm IRp IRm ! IRn IRm IRp IRm be the function de ned by 0 rf (x) + AT u + B T v 1 CC B B Ax ? c CC B F (z) = B CA B Bx + s ? d @ vs with z = (x; s; u; v). The rst order optimality conditions (8){(11) are conveniently written in compact form 001 BB CC 0 (12) F (z) = B BB CCC : @0A e To guarantee the uniqueness of the dual variables we need one more assumption.
Assumption 3.3 The matrix A has full row rank. The implicit function theorem guarantees that (12) has a unique, continuously dierentiable solution z() provided that the Jacobian matrix 0 r2f (x) 0 AT B T 1 B C 0 0 0 C @F = B BB A CC B @z @ B I 0 0 C A 0 V 0 S is regular. The latter is satis ed if, in addition to Assumption 3.3, the following conditions hold: v > 0 and s > 0 and r2f (x) is positive semi-de nite. The central path is then a smooth curve. The curve
fz : z = (x(); s(); u(); v()); > 0g in the extended primal-dual space is named the primal-dual central path. The two algorithms that we shall present now try to approximate the central path.
4 A primal path-following method Primal path-following methods (or as frequently referred to as logarithmic barrier methods) basically follow the central by repeatedly taking damped Newton steps in order to 7
minimize the barrier function (6). In the meantime the parameter is decreased in a controlled way. By Newton step we mean here the step that minimizes the local quadratic approximation of the objective function in (6). This step is further constrained to lie in a linear subspace to keep feasibility in the constraints Ax = c.
4.1 The Newton step Let us denote the Newton step by p = p(x; ). It can be easily proved that p is the solution r2f (x) p + B T S ?2Bp + AT v = ? rf (x) ? B T s?1 (13) Ap = 0 for some vector v. Let us introduce the notation 2 Q(x) = r f(x) + B T S ?2B and g(x) = rf(x) + B T s?1: >From the rst equation we have
p = ?Q(x)?1 AT v + g(x) :
By the second equation in (13) we have Ap = 0 hence i?1 h v = ? AQ(x)?1AT AQ(x)?1g(x): By substituting this value in the formula of p above one obtains the Newton direction h i?1 p = ?Q(x)?1AT AQ(x)?1AT AQ(x)?1g(x) ? Q(x)?1g(x): (14) We have that x = x() if and only if p = 0. Moreover the length kpkH of p in the norm generated by the Hessian of the logarithmic function measures the proximity of the current iterate to the central path. Denoting the logarithmic barrier function (6) by (x; ) the primal path-following algorithm can formally be stated as follows. 8
4.2 The primal logarithmic barrier algorithm Input:
" is the accuracy parameter, 0 < "; is the proximity parameter, 0 < < 1; is the reduction parameter, 0 < < 1; 0 is the initial barrier value; x0 is a given interior feasible point such that kp(x0; 0kH ;
begin
x := x0; := 0; while > 4"n do begin (outer step) := (1 ? ); while kpkH do begin (inner step) ~ := arg min>0f(x + p; ) : x + p 2 F 0g; x := x + ~ p; end (inner step) end (outer step)
end.
Depending on the value of in the update of the barrier parameter, we have long-, medium-, and short-step path-following methods. An algorithm is called
long-step algorithm if is a constant (0 < < 1) and independent of " and n; medium-step algorithm if = pn , where > 0 is a an arbitrary constant, possibly large and independent of n and ";
short-step algorithm if = pn , and is so small that after a reduction of one unit Newton step is sucient to reach the vicinity of the new -center.
As it is known the logarithmic barrier function is smooth (self-concordant, see e.g., [18]), our objective function in the maximum likelihood estimation problem is also a logarithmic 9
function, hence the above-mentioned path-following algorithms are polynomial. To obtain an "-optimal solution the algorithm needs at most
O(n ln n" 0 ) Newton steps for the long-step variant; O(pn ln n0 ) Newton steps for the medium- and short-step version. "
Note that for general convex programming problems a smoothness-condition, the so-called self-concordance or some equivalent of it, is needed for the polynomial complexity of the algorithm. Since our objective is to maximize the sum of the logarithms of the variables this condition is trivially satis ed. For more details on smoothness conditions see e.g. [13, 14, 18].
4.2.1 Implementation of the primal method Our problem of interest (4) has no equality constraint. The computation of the search direction (14) is greatly simpli ed. We also have d = 0 in (4). The nonnegativity constraints x 0 could be incorporated into Bx 0, but one observes that the objective (f (x) = ? Pni=0 ln xi + naT x) is in fact a barrier function, hence one simply can leave out the nonnegativity constraints. In that case, the Newton direction is given by !?1 ! ?2 ?1 + na X ? x T ? 2 T ? 1 p=? +B S B ?B s : (15) The main computational eort of the primal path-following method is in computing the Newton step (15). This is done by Cholesky factorization of the matrix (1=)X ?2 + B T S ?2B and back substitution. Observe that in our problem the matrix B is tridiagonal, hence B T S ?2B is 5-diagonal: the Cholesky factor is made of a band of 5 elements at most. As a result, we have that the computational cost of an iteration is linear in the dimension of the problem. It is well known [9, 13, 14] that the primal objective is monotone along the central path. This is not necessarily so for the Newton iterates with is xed. However, by appropriately decreasing the barrier parameter , the Newton direction can be made a descent direction of the original objective function as well. This procedure ensures the monotonicity of the sequence of iterates: it also accelerates convergence. 10
Finally, the line search to minimize the barrier function in the Newton direction is done by a crude bisection algorithm. As a safeguard we do not make larger steps than 95% of the maximal feasible step size; we also do not make more than six bisection steps per iteration. Contrary to the general scenario, when the computational cost of line search is negligible in comparison with the direction nding problem, here line-searching is not so cheap (or better to say to solve the the Newton system is cheap in this case). Due to the vediagonal structure of the Hessian in (15), the search direction can be calculated by using O(n) arithmetic operation. The cost of one function evaluation is in the same order, which indicates that we have to be careful putting much eort in line searching. It worth to point out explicitly that in the case of the convex likelihood estimation problem to do line search is the same expensive than to calculate the Newton direction. Due to the 5-band structure of the normal matrix to get the search direction p needs O(n) arithmetic operations, that is in the same order as just one function evaluation. Note that when the current iterate is close to the central path, i.e., kpkH < 1 (see e.g., [18, 13, 14]), then the duality gap is bounded by 2n. This bound can be derived without explicitly producing dual variables at nearly centered points. If we have a point on the central path then the duality gap is n and the dierence of the objective values of the nearly centered point and the center can again be bounded by the same amount. The last results can be proved by using the local quadratic convergence property of the Newton process applied to the logarithmic barrier function (for details see [13, 14]). Since the primal algorithm maintains primal feasibility throughout, our only concern is to achieve a small enough duality gap. Actually we test convergence on the relative duality gap. In this measure, the optimal value of the objective is replaced by the value of the objective at the current iterate. Since we get an evaluation of the duality gap only at points where kpkH < 1, the algorithm terminates at nearly centered points. More precisely, the algorithm terminates when it produces an approximately centered point with "1(1 +njf (x)j) : As the objective contains the only equality constraint with its exact Lagrange multiplier (naT x) we do not need to impose extra stopping criteria for primal feasibility. As it also 11
will be seen from our computational results the error in the equality constraint is always smaller than the relative duality gap (for further explanation see also Section 5.1.2). In our computational experience we observed that the Hessian of the barrier function is sometimes severely ill-conditioned: it is then impossible to obtain the Cholesky factorization of the Hessian. In that case, we rely on the solution method for sparse systems of equations that MATLABTM automatically computes. The embedded method is based on a LU decomposition: this produces a descent direction. The primal algorithm may then proceeds safely.
5 A primal-dual infeasible start algorithm 5.1 The general scheme In the primal approach of the previous section, the algorithm aims to get near the central path by taking damped projected Newton step relative to the logarithmic barrier function. The iterates are computed in the primal space; dual variables appear as a by-product of the computations. Together with the current primal iterate, they approximately solve the system (12). The primal-dual algorithm simultaneously operates on the primal and dual variables and takes steps that directly aim to solve (12). More precisely, the step is a Newton step with respect to the equation (12); it is given by the solution dz of the linear system @F dz + F (z) = 0: (16) @z The parameter is gradually decreased. The goal is to obtain a solution of (12) with as small as possible. The primal-dual algorithm departs from the primal barrier method in several ways. First, it is not possible to maintain primal and dual feasibility along a Newton step. The system (12) contains linear equations, corresponding to the primal constraints, but nonlinear ones corresponding to the dual constraints rf (x) + AT u + B T v = 0 (and the complementarity equations vs = e). Once satis ed, the primal constraints remain satis ed after the Newton step, but nonlinearity destroys this property for the dual constraints. The second 12
dierence is that there is no barrier function to determine the step size. (This is partly a consequence of the dual infeasibility.) It is still possible to enforce global convergence using some kind of merit function, see [21] and [2]. However it is shown in [20] that a very crude strategy of taking a xed fraction, e.g., 0.99 of the maximal step-size that maintains positivity of the variables, turns out to be very ecient. The last dierence concerns the stopping criteria, that now explicitly involves the dual variables.
5.1.1 The primal-dual infeasible start algorithm The basic iteration of the algorithm is made of three steps.
Basic iteration Step 1 Compute a target value for . Step 2 Compute dz by (16) Step 3 Compute a step length and update z by z := z + dz. We shall now brie y discuss each step of the algorithm. To this end we rst introduce some piece of notation:
F1(z) F2(z) F3(z) F4(z)
= = = =
rf (x) + AT u + Bv Ax ? c Bx + s ? d vs ? e:
The complementarity condition F4(z) = 0 plays a conspicuous role since it is possible to control to some extent the constraint violation by a proper choice of . This is in contrast with the other conditions which are parameter-free and solely depend on the current iterate.
Choice of : (0 < < 1 is a parameter.) 1. Compute E = m minvfTvsisi g and the norm = k(F1; F2; F3)k. 13
2. If E
T = vT s+ vms
else
T = vms :
Stepsize: ( > 1 is a parameter.) Compute the largest 0 such that x + dx 0 s + ds 0 v + dv 0: In the reported implementations we chose the following parameter values: = 10?2 and = 1 ?1 : To compute the search direction one uses the closed form solution r2f (x) + B T V S ?1B AT ! dx ! = ? F1 ? B T S ?1F4 + B T V S ?1F3 ! (17) A 0 du F2 ds = ?F3 ? Bdx (18) dv = ?S ?1F4 ? V S ?1ds: (19) The computational eort in one iteration entirely lies in solving (17), since all other operations are simple matrix-vector multiplications or simple ratio tests on vectors. To solve (17) one may resort to a direct factorization of the left-hand side coecient matrix. If A has very few rows, i.e., p 0 remain strictly positive in the algorithm. It is likely that when the above optimality conditions are nearly satis ed for a xed , we have Bx 0 and rf (x)+ B T v 0. Let us call those conditions primal and dual feasibility, respectively. It is not dicult to show, using Lagrangian duality, that whenever one has simultaneously primal and dual feasibility, the duality gap is given by
xT (rf (x) + B T v) + vT (?Bx) = xT rf (x): In practice it is not dicult to achieve primal feasibility; if is not too small, dual feasibility also . If becomes really small, dual feasibility may become an issue. It is thus natural to relax the feasibility requirements to
Bx < "1e rf (x) + B T v > ?"1e: 15
The termination criterion is
xT rf (x) < "2: Here is an estimate of the optimal value f . In our experiments, we set "1 = 10?8 and "2 = 10?6 . The same conditions with "1 = 0 allow the stronger statement that x is a feasible "2-optimal solution. It is interesting to relate the duality gap to the barrier term . Assume that we have an exact solution of the optimality conditions for a xed . Then
xT rf (x) = xT (rf (x) + B T v) + vT (?Bx) = xT t + v T s = (n + m): Since there is no particular reason to look for a solution with a much smaller duality gap than "2 (otherwise one should simply set a dierent stopping criterion), we need not give too small values to . In our experiments we use the following lower bound = :8 n + m "2: Failure to converge always occurs in the following way: the parameter has reached its target value , the primal constraints are satis ed but the test on the dual constraints cannot be met. This seems to correspond to errors in the computation of the search direction that are too large compared to the current precision on the iterates. Still it may happen that by letting the algorithm run, convergence is achieved. This result is the fortunate consequence of a series of more or less random moves; it is not really meaningful for our tests. Thus we introduced the following additional stopping rule: the algorithm is stopped ten iterations at most after the estimate of the relative duality gap has reached its target value "2. The last issue concerns the area constraint aT x = 1 in (2) that has been incorporated in the objective of (4). Since rf = ?x?1 + na we get
xT rf = naT x ? n: Hence, at termination
1 aT x 1 + "2: n 16
The primal-dual algorithm is simple enough. The computations essentially involve the solution of system (16), which, after elimination, reduces to (17). We use a Cholesky factorization. Due to ill-conditioning the solution may be very poor. We improve it, if necessary, by an iterative re nement scheme [10]. In theory, this scheme requires that the residuals be computed in quadruple precision. Although we cannot meet this requirement with our programming language, we still apply the procedure: instead of doubling the number of accurate digits, as expected, we get a much more modest improvement. However, a repeated use of the scheme yields, most of the time, a solution with residuals less than 10?10. In very few cases the accuracy is bad: the resulting step looks probably more like a random move. Without the iterative re nement scheme, it is often impossible to get a solution to the required accuracy.
6 Numerical results To test the algorithm we sampled points from four distribution laws. distribution density range law exponential 1 ? e?z e?z z0 p 1 2 arcsine arcsine z pz(1?z) 0 < z < 1 z ? z42 1 ? 2z 0 z 2 quadratic 1 1 ? 1z inverse 1z z2 The arcsine law plays a role in the study of uctuations. See [8]. The inverse cubic has no nite expectation. We considered three problem sizes de ned by the sample sizes, namely 500, 1000 and 2000 observations. For each problem size, we drew ten representatives. We used the random number generator of MATLABTM.
6.1 Algorithmic performances The primal and primal-dual codes were written in MATLABTM. We do not report on CPU times, a measurement which is machine and environment dependent. However MATLABTM allows a count of the total number of ops. We report these numbers. 17
Each iteration of the algorithm requires the solution of an n by n linear system. In our case the system is very sparse and structured: actually the matrix is 5-diagonal as being the sum of a diagonal matrix (the Hessian of the objective) and the product of a tridiagonal matrix (B) by its transpose. Computations were made by using the sparse linear algebra option of MATLABTM . The number of ops is determined by those computations. It is thus closely dependent on the number of iterations. However, the number of ops is not proportional to the the number of iterations, mainly because the number of iterations with the iterative re nement scheme varies from one iteration of the algorithm to another. The results are reported in Tables 6.1.1 and 6.1.2. The gap is given in relative value. \Primal" (resp., \dual") is the maximum value of (Bx)j (resp., minimum value of (rf (x) + B T v)i). A negative value (resp., positive value) indicates that there is no constraint violation. \Area" gives the error on aT x = 1. In case of the primal algorithm dual variables are not generated explicitly, hence in Table 6.1.1 the measure of centrality kpkH is reported as a signi cant measure of centrality. Further we remind that the primal algorithm operates in the interior of the primal feasible set, hence all the iterates and the nal solution is strictly primal feasible. For each problem size we give the minimum, average and maximum value of each of the above measurements over the sample of 10 experiments.
6.1.1 Results with the primal algorithm The results with the primal logarithmic-barrier algorithm are summarized in Table 6.1.1. Sometimes the algorithm shows diculty in centering as numerical diculties results in poor search direction. Hence a bound 50 for the number of centering steps between two subsequent decrease of the barrier parameter is imposed in the algorithm while the maximal number of steps is set to 100. The centering parameter = :9 is chosen, while "2 = 10?6 is the termination criteria for the relative duality gap for the problems n = 500 and n = 1000. For the larger problems " = 10?5 is used. As mentioned at Table 6.1.1 in some cases the method failed to center suciently. It also remarkable that imposing monotonicity of the objective during the centering steps sometimes results in smaller relative duality gap than requested. An appropriate problem speci c starting point could improve the numerical performance signi cantly. Here we uniformly used just a perturbation of a quadratic function. This starting point is usually far from being centered for the arcsine and the inverse law. This 18
explains why the solution of that problems is more costly. Without perturbation the problems with the quadratic law are solved usually in less than ten iterations, but the perturbation sometimes causes diculties here as well.
6.1.2 Results with the primal-dual algorithm For better results we chose to scale B to have its second diagonal equal to 1. The convergence parameters are "1 = 10?8 for primal and dual feasiblity and "2 = 10?6 for the relative duality gap. Primal feasibility was achieved in all cases. Exact dual feasibility is obtained in most cases. On some other cases, dual feasiblity was satis ed up to the "1 tolerance level. We did not encountered failure with the inverse law for samples up to 2000 observations. With the exponential law, the primal-dual algorithm could solve 8 out of 10 problems of size 2000. By relaxing the criteria to "1 = 10?7 and "1 = 10?5 , all problems were solved. The arcsine law, and, surprisingly enough, the quadratic law, yield problems that are more dicult to solve: convergence could not be met for samples of sizes 2000. In the case of the arcsine law, the primal-dual algorithm run into troubles even for problems of size 1000; in 40% of the cases we stopped the algorithm because dual feasibility wasn't reached 10 iterations after the duality gap had fallen below the target "2 = 10?6 . The results are summarized in Table 6.1.2.
6.2 Shape of the solution To illustrate the nature of the solution on a problem instance with a sample size equal to 1000, we drew two plots. (See Figure 6.2). The rst plot represents the dual variable vs. the sample points. We observe the rather chaotic behavior of the variable around an average value of 10, with a few sharp drops in well-located places. The second plot represents the estimated density. We observe that it looks like a piece wise linear curve with very few pieces. The break points correspond to the low values of the dual variables. This remarkable shape seems to prevail for any problem size and instance. The same behavior is observed when the sample points are drawn from the arcsine law. (With little surprise the same is true for the quadratic law.)
19
0
10
-5
10
-10
10
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1 0.8 0.6 0.4 0.2 0 0
Figure 1: Solution 20 for n=1000.
7 A clustering scheme For large sample sizes the problem becomes dicult, mostly due to very poor condition numbers of the direction nding linear systems, often in the early and late stages of the algorithm. Typically one observes numerical inde niteness of theoretically positive de nite matrices. In such cases instead of Cholesky factorization one might use LU factorization to get a usable search direction. The numerical diculties are probably due for a good part to observation points too close to one another. For instance, in the sample of our previous example with 1000 observations from the exponential distribution 1 ? e?z , we found that 362 points were at a distance of their right neighbor less than 10?3 , and that 2 pairs were closer than 10?5 . We may invoke two potential sources of numerical diculties. The rst one is the extreme closeness of some adjacent observation points. A small error on the estimated density on two such neighboring point may have dramatic eect on the slopes and thus entail a severe violation of the primal constraints. This situation is likely to occur on our test problems since our samples are drawn from an the exponential distribution. It is well-known that this distribution generates clusters of points near the origin. The other diculty, in case of a distribution function with an in nite range, may come from the tail of the sample: one gets very small values for the density estimates, a potential source of numerical trouble. However we observed that those values are not small enough to induce numerical diculties. It is possible to remedy to the rst source of numerical trouble by partial aggregation of the observations. We believe that this clustering scheme should not endanger the statistical signi cance of the solution. Let us describe the scheme. We choose rst an arbitrary resolution number. To construct the clusters we order the sample by increasing values. Then we start by enumerating the leftmost points. At some stage, data on the left are all in clusters which are at a larger distance from one another than the resolution number; data on the right are unclassi ed. If the immediate right neighbor of the current cluster is closer than the resolution number, the point enters the cluster and the mean of the cluster is adjusted accordingly. Otherwise, the point originates a new cluster and the process is repeated. The procedure yields a weighted sample with fewer points which are reasonably distant from one another. The above 21
description would exactly t the case of a physical experiment with a minimum resolution between measurements. In the estimation problem, the likelihood function is replaced by a weighted likelihood. The optimization problem becomes: 9 n > min ? P wi ln xi > > i=1 ?yi+1xi?1 + (yi + yi+1)xi ? yixi+1 0; i = 2; : : :; n ? 1 >>= nX ?1 (20) > > yi+1( xi +2xi+1 ) = 1 > > i=1 > ; x 0: The primal or primal-dual algorithm needs only a minor modi cation. Problem (20) is better conditioned and easier to solve, by far. We applied the modi ed scheme to the same sample for which we pictured the maximum likelihood density estimate. The sample procedure with a resolution number equal to 10?1 generated 683 clusters. (Note that this number is larger than the 638 pairs of size larger than the resolution number.) Figure 7 pictures the solution. It does not seem possible to distinguish between density estimates in the clustered and unclustered case. We also observe a less chaotic behavior of the dual variables. Finally, to show the bene t of clustering, we run the same sample of 10 problem tests with n = 2000 for the three distribution laws. To check the behavior of the interior-point algorithms on very large problems, we conducted a study on samples from the exponential law. We thus drew a sample of 10 problems with n = 6000 and a single problem for each of the sizes 10000, 20000 and 40000. The resolution between clusters was chosen to be at least 10?5 . Note that this number is close the tolerance levels for a solution. We give the average (min and max) number of clusters obtained this way. We see that only few data are clustered (at most 1% for n = 2000 and 3% for n = 6000). The clustering scheme drastically eliminates the numerical diculties. We note that the computational eort grows almost linearly with the size of the problem. This is due to the nearly constant number of iterations, as it is usually the case with interior point algorithms, and to the the band structure of the matrix to be inverted.
22
0
10
-5
10
-10
10
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1 0.8 0.6 0.4 0.2 0 0
Figure 2: Solution 23 for n=1000.
References [1] K.M. Anstreicher, D. den Hertog, C. Roos and T. Terlaky (1993), A Long Step Barrier Method for Convex Quadratic Programming, Algorithmica 10 365{382. [2] K.M. Anstreicher and J.-P. Vial (1994), On the convergence of an infeasible primaldual interior-point method for convex programming, Optimization Methods and Software. 3, 273{283. [3] R.E. Barlow and H.D. Brunk (1972), The Isotonic Regression Problem and Its Dual, Journal of the American Statistical Association 67 140{147 [4] R.L. Dykstra (1983), An Algorithm for Restricted Least Squares Regression, Journal of the American Statistical Association 78 837{842. [5] P. Groeneboom and G. Jongbloed (1993), Isotonic Estimation and Rates of Convergence in Wicksell's Problem. Report 93{39, Faculty of Technical Mathematics and Informatics, Delft University of Technology. [6] P. Groeneboom and G. Jongbloed (1994), Maximum Likelihood Estimation of a Convex Decreasing Density, to appear. [7] P. Groeneboom and J.A. Wellner (1992), Information Bounds and Nonparametric Maximum Likelihood Estimation. Birkhauser, Basel, Switzerland. [8] W. Feller (1966), An Introduction to Probability Theory and Its Applications, vol. II, John Wiley & Sons, New York. [9] A. V. Fiacco and G. P. McCormick (1968), Nonlinear Programming : Sequential Unconstrained Minimization Techniques, John Wiley & Sons, New York. [10] G. Golub and Van Loan (1989), Matrix computations, 2nd edition, The Johns Hopkins University Press, Baltimore. [11] M. Hage (1994), Estimation of Convex decreasing Densities and Convex Regression Functions. M.Sc. Thesis, Faculty of Technical Mathematics and Informatics, Delft University of Technology. [12] F.R. Hampfel (1987), Design, Data and Analysis by Some Friends of Cuthbert Daniel, John Wiley, New York. 24
[13] D. den Hertog (1994), Interior Point Approach to Linear Quadratic and Convex Programming { Algorithms and Complexity, Kluwer Academic Publisher, Dordrecht, The Netherlands. [14] D. den Hertog, C. Roos and T. Terlaky (1992), On the Classical Logarithmic Barrier Function Method for a Class of Smooth Convex Programming Problems, Journal of Optimization Theory and Applications 73 1{25. [15] G. Jongbloed (1995), Some Algorithms for Minimising a Convex Function Over a Closed Convex Cone. Report 95{??, Faculty of Technical Mathematics and Informatics, Delft University of Technology. [16] K. O. Kortanek, X. Xu and Y. Ye (1995), \ An infeasible interior-point algorithm for primal and dual geometric programs", Technical Report, Department of Management Science, The University of Iowa, Iowa City, USA. [17] G.P. McCormick (1991), \ The superlinear convergence of a primal-dual algorithm", Technical report GWT/ORT-550/91, Department of Operations Research, The George Washington University, Washington DC, USA. [18] Y.E. Nesterov and A.S. Nemirovsky (1994), Interior-Point Polynomial Algorithms in Convex Programming, SIAM, Philadelphia, USA. [19] T. Robertson, F.T. Wright and R.L. Dykstra (1988), Order Restricted Statistical Inference, John Wiley & Sons, New York. [20] J.-Ph. Vial (1994), Computational Experience with a Primal-Dual Interior-Point Method for Smooth Convex Programming, Optimization Methods and Software 3, 285{316. [21] H. Yamashita (1992),\A globally convergent primal-dual interior-point method for constrained optimization", Technical Report, Mathematical Systems Institute inc., Tokyo, Japan.
25
variables 500 1000 2000
gap kpk exponential law 1.9E-08 5.3E-02 2.0E-08 1.3E-01 2.2E-07 2.4E-01 1.9E-08 0.0E-00 2.0E-08 9.9E-02 2.1E-08 2.2E-01 1.9E-07 7.8E-02 3.8E-07 1.4E-01 1.9E-06 2.4E-01 arcsine law 5.2E-08 2.5E-02 7.2E-08 1.3E-01 8.8E-08 2.1E-01 6.7E-08 5.9E-02 8.0E-08 1.3E-01 9.1E-08 2.3E-01 7.4E-07 1.5E-03 1.6E-06 9.9E-02 7.8E-06 2.1E-01 quadratic law 3.8E-08 1.0E-01 4.1E-08 1.7E-01 4.6E-08 2.4E-01 4.1E-09 0.0E-00 3.7E-08 1.0E-01 4.3E-08 2.0E-01 3.9E-08 0.0E-00 1.2E-06 1.4E-01 4.2E-06 8.7E-01 inverse law 9.5E-09 9.4E-02 2.7E-08 2.1E-01 1.0E-07 2.4E-01 1.0E-08 7.6E-02 1.3E-07 2.5E-01 9.7E-07 6.3E-01 1.0E-07 1.7E-01 1.4E-06 4.6E-01 9.9E-06 9.6E-01 H
min average max min average max min average max
area
iter
m ops
9.5E-09 34 1.0E-08 36.8 1.1E-08 44 9.3E-09 34 1.8E-08 39.1 9.1E-08 43 1.0E-07 31 1.9E-07 42.2 9.8E-07 66
4.1 4.6 5.3 8.6 10.5 11.9 17.4 24.1 38.1
2.5E-08 48 3.5E-08 58.9 4.4E-08 74 3.3E-08 52 4.0E-08 66.3 4.5E-07 77 3.7E-07 67 8.0E-07 71.7 3.9E-06 78
5.6 7.1 9.1 12.5 16.5 18.7 33.9 36.3 39.2
1.9E-08 33 2.4E-08 39.4 5.7E-08 46 2.0E-08 34 3.0E-07 41.4 2.6E-06 52 1.3E-07 41 1.2E-06 48.5 4.1E-06 62
4.5 5.0 5.8 8.9 11.3 15.1 22.3 26.7 35.5
4.8E-09 35 1.3E-08 47.1 5.0E-08 75 5.1E-09 33 7.1E-08 54.5 5.3E-07 76 5.1E-08 22 7.5E-07 39.7 4.9E-06 71
4.3 5.8 9.4 8.7 14.2 19.8 12.9 21.8 37.1
a
500 1000 2000
min average max min average max min average max
b
500 1000 2000
min average max min average max min average max
c
500 1000 2000
min average max min average max min average max
a For n = 2000 the method failed in one case:
after 100 iterations; the relative gap and the area constraint were
about 10?4 , while the proximity was 279 5. The outlier have been removed from the table. :
= 2000 the method failed in one case: after 100 iterations; the area constraint was about 10?4 , while the proximity was 1183:1. The outlier have been removed from the table. c For n = 2000 the method failed to center suciently in one case. The proximity was 300 when the method stopped after 100 iterations with duality gap about 10?5 . The outlier has been removed from the table. b For n
Table 1: Performances with the primal algorithm 26
variables
gap
primal dual exponential law -2.5E-11 4.2E-12 -8.3E-12 3.8E-10 -1.0E-12 1.5E-09 -7E-12 -4E-09 -2E-12 -4E-10 -3E-14 8.3E-10 -9.9E-12 -2.9E-10 -3.8E-12 1.1E-09 -1.7E-13 3.3E-09 arcsine law -4E-11 -5.8E-11 -1E-11 -6.0E-12 -4E-13 4.3E-12 -6E-12 -3E-08 -2E-12 -3E-09 -2E-13 2E-12 quadratic law -1E-11 -4E-09 -3E-12 -1E-10 -2E-13 1E-09 -1E-12 -7.0E-05 -3E-13 -1.6E-05 -5E-14 5.9E-10 inverse law -3.7E-11 -3.6E-10 -1.3E-11 7.2E-11 -9.1E-13 3.3E-10 -5.8E-12 -3.2E-08 -1.8E-12 -3.8E-09 -4.7E-13 1.4E-10 -1.1E-11 -5.7E-10 -4.7E-12 1.7E-10 -1.3E-12 7.8E-10
area
iter
m ops
1.6E-06 1.6E-06 1.9E-06 2E-06 2E-06 2E-06 1.6E-05 1.6E-05 1.6E-05
23 26 30 25 32 41 23 30 36
3.0 3.6 4.5 8.6 12.4 22.0 19.6 25.5 36.9
6E-07 6E-07 7E-07 6E-07 6E-07 7E-07
28 31 34 33 38 46
3.6 4.0 5.2 9.7 12.4 16.1
1.1E-06 1.2E-06 1.2E-06 1E-06 1E-05 4E-05
22 26 32 18 38 59
2.8 4.0 6.3 5.6 12.9 18.7
9.1E-07 23 1.4E-06 27.8 4.8E-06 34 9.4E-07 26 1.1E-06 30.8 2.0E-06 45 9.5E-06 29 1.0E-05 33 1.1E-05 39
3.2 3.7 4.9 8.0 10.6 17.1 25.0 29.5 41.6
a
500 1000 2000
min average max min average max min average max
7.8E-07 8.2E-07 9.7E-07 8E-07 8E-07 8E-07 7.9E-06 8.1E-06 8.5E-06
min average max min average max
5.7E-07 6.1E-07 6.7E-07 5.5E-07 6.1E-07 6.6E-07
min average max min average max
8E-07 8E-07 8E-07 8E-07 8E-07 8E-07
min average max min average max min average max
8.0E-07 8.1E-07 8.5E-07 8.0E-07 8.0E-07 8.0E-07 8.0E-06 8.1E-06 9.1E-06
b
500 1000
c
500 1000
500 1000 2000
For n = 2000 the tolerances were relaxed, otherwise the algorithm failed on 2 problems. b For n = 2000 convergence could not be met within the given tolerances. c For n = 2000, convergence could not be met within the given tolerances. For n = 1000, convergence could not be met on 40% of the problems within 10 iterations after the precision on the gap was achieved. a
Table 2: Performances with the primal-dual algorithm
27
sample points 2000 6000 10000 20000 40000 2000
2000
2000
cluster points
gap
primal
exponential law min 1975 7.8E-07 -2E-12 average 1979 8.2E-07 -9E-13 max 1985 9.6E-07 -8E-14 min 5799 7.7E-07 -2E-12 average 5818 7.8E-07 -9E-13 max 5839 7.9E-07 -8E-14 (1 problem) 9506 7.7E-07 -6E-13 (1 problem) 18193 7.3E-07 -2E-13 (1 problem) 33400 6.7E-07 -7E-14 arcsine law min 1890 6.1E-07 -1E-11 average 1913 6.3E-07 -7E-12 max 1922 6.5E-07 -3E-12 quadratic law min 1962 8.0E-07 -2E-12 average 1973 8.1E-07 -8E-13 max 1981 8.6E-07 -4E-13 inverse law min 1941 8.0E-07 -7.3E-12 average 1952.5 8.2E-07 -2.4E-12 max 1961 9.7E-07 -2.5E-13
dual
area
iter m ops
9E-12 7E-11 1E-10 9E-12 7E-11 1E-10 3E-11 3E-11 7E-12
1.6E-06 1.6E-06 1.9E-06 1.5E-06 1.6E-06 1.6E-06 1.5E-06 1.5E-06 1.3E-06
23 30 38 30 36 41 42 42 54
14.0 17.7 23.5 59.0 71.0 80.7 157 341 928
2E-13 5E-13 9E-13
5.7E-07 38 6.0E-07 40 6.2E-07 43
19.1 20.2 21.8
7E-12 1E-10 3E-10
1.2E-06 23 1.2E-06 29 1.3E-06 39
14.5 17.8 22.9
3.9E-12 9.5E-07 28 13 .1 2.9E-11 1.0E-06 32.5 18 .1 1.0E-10 1.2E-06 36 21 .5
Table 3: Primal-dual algorithm with clustered data (separation: 10?5 )
28
variables 2000 6000 10000 20000 40000 2000
2000
2000
gap kpkH exponential law min 1.9E-09 3.6E-02 average 2.0E-09 1.3E-01 max 2.1E-09 2.0E-01 min 1.9E-10 2.8E-02 average 1.9E-10 1.3E-01 max 1.9E-10 2.0E-01 (1 problem) 1.9E-10 1.9E-01 (1 problem) 1.8E-10 1.1E-01 (1 problem) 3.9E-08 2.8E-01 arcsine law a min 1.1E-08 1.4E-01 average 1.2E-08 2.2E-01 max 1.3E-08 2.4E-01 quadratic law min 3.8E-09 8.2E-03 average 4.0E-09 9.3E-02 max 4.1E-09 2.3E-01 inverse law b min 9.8E-08 2.0E-01 average 2.4E-07 3.1E-01 max 9.6E-07 7.1E-01
area
iter m ops
1.4E-06 2.6E-06 3.8E-06 2.2E-06 2.5E-06 2.8E-06 2.5E-06 2.3E-06 2.1E-06
42 21.6 50.9 25.8 56 27.7 57 88.6 59.3 91.5 65 98.3 56 144.0 69 336.8 101 908.4
4.5E-05 66 6.2E-05 80.2 7.9E-05 95
30.9 36.7 43.1
4.8E-06 46 6.8E-06 52.9 9.5E-06 64
24.0 27.1 32.0
3.9E-07 26 6.0E-07 46.8 8.6E-07 79
13.4 23.2 37.2
In three cases the the algorithm failed! The outlier has been removed from the table. b In four cases case the method failed! The outlier has been removed from the table. a
Table 4: Primal algorithm with clustered data (separation: 10?5 )
29