Maximum Entropy Density Estimation from Fractional Moments

MARCEL DEKKER, INC. • 270 MADISON AVENUE • NEW YORK, NY 10016 ©2003 Marcel Dekker, Inc. All rights reserved. This material may not be used or reproduced in any form without the express written permission of Marcel Dekker, Inc.

COMMUNICATIONS IN STATISTICS Theory and Methods Vol. 32, No. 2, pp. 327–345, 2003

Maximum Entropy Density Estimation from Fractional Moments P. L. Novi Inverardi* and A. Tagliani Department of Computer and Management Sciences, Faculty of Economics, Trento University, Trento, Italy

ABSTRACT A procedure for the estimation of probability density functions of positive random variables by its fractional moments, is presented. When all the available information is provided by population fractional moments a criterion of choosing fractional moments themselves is detected. When only a sample is known, Jaynes’ maximum entropy procedure and the Akaike’s estimation procedure are joined together for determining respectively, what and how many sample fractional moments have to be used in the estimation of the density. Some numerical experiments are provided.

*Correspondence: P. L. Novi Inverardi, Department of Computer and Management Sciences, Faculty of Economics, Trento University, 38100 Trento, Italy; E-mail: [email protected]. 327 DOI: 10.1081/STA-120018189 Copyright & 2003 by Marcel Dekker, Inc.

0361-0926 (Print); 1532-415X (Online) www.dekker.com


328

Novi Inverardi and Tagliani Key Words: Entropy; Fractional moments; Hankel Maximum entropy; Hausdorff moment problem.

matrix;

1. INTRODUCTION In this paper we consider the use of fractional moments to recover a probability density function when the underlying random variable takes positive values and the corresponding distribution exhibits fat tails. For example, this kind of situations frequently characterizes the pricing of financial derivatives, where the underlying distribution is usually assumed to be lognormal. If the distribution has infinite support and shows fat tails, it is a well known fact that power moment problem is indeterminate or the distribution does not admit any power moment or it admits only a finite number of power moments. In the latter case, usually, the Laplace transform L ð pÞ is available, whose abscissa of convergence is equal to zero. By fractional calculus, fractional moments may be obtained as (Cressie, 1986) Z1 1 Lð pÞ EðX Þ ¼ dp, 0 < < 1 ð1:1Þ ð1 Þ 0 pþ1 where ðÞ denotes the Gamma function. Higher order fractional moments involve higher derivatives of Lð pÞ. Then, in the above framework, a viable way for reconstructing the underlying unknown density function could be obtained by involving fractional moments. The paper will be focused on it. In practical cases, only a finite set of fractional moments is given. In a wide range of applications where the goal consists in choosing an approximating distribution of the true but unknown distribution, the maximum entropy principle (ME) (Jaynes, 1978) is probably the most popular strategy. Assuming the availability of some known fractional moments, the maximum entropy principle suggests to select, among the distributions consistent with such partial information, the one having the maximum entropy, or equivalently the most uncertain. The use of fractional moments in the framework of maximum entropy rests on two recent theoretical results. Theorem 1.1. (Lin, 1992). A positive r.v. X is uniquely characterized by an infinite sequence of positive fractional moments fEðX j Þg1 j¼1 with distinct exponents j 2 ð0, Þ, ðEðX Þ < 1, for some > 0.


Max Entropy Density by Fractional Moments

329

Theorem 1.2. (see Appendix). Let X a random variable having density f ðxÞ, fEðX j Þg, j ¼ 1, . . . , M, j ¼ j =M, for some > 0, its M fractional moments, and fM ðxÞ a ME density constrained by same M fractional moments. Then fM ðxÞ converges in entropy to the underlying unknown density f ðxÞ, when M ! 1. It is also remarkable to recall the considerable body of results on convergence of generalized moments based on ME technique given by Borwein and Lewis (1991), Borwein and Lewis (1993), though concerning finite invervals. If fgj ðxÞg1 j¼1 is a sequence of (real-valued) continuous functions whose linear span is dense in a Banach space of continuous functions with supremum norm, namely Cð½0, 1Þ,Rit firstly follows 1 that f ðxÞ is uniquely determined by moment sequence 0 gj ðxÞ f ðxÞ dx, j ¼ 1, 2, . . . : Next, Borwein and Lewis prove that some properties of the Shannon entropy, as strict convexity, essential smoothness and coercitivity lead to various distinct types of convergence, as weak-star, weak, in measure, in norm and uniform. Nevertheless these striking results of convergence are difficult to extend to infinite intervals. In this case and for power moments, only entropy convergence (and, as a consequence, convergence in directed divergence and in L1 norm) is available in literature (Frontini and Tagliani, 1997). This is a simple consequence of the fact that, in the finite interval case, a countable sequence of power moments characterizes a distribution but this does not hold for the infinite interval case. In reason of it, generalized moments problem in finite and infinite intervals are structurally different, so that the results obtained by Borwein and Lewis for the finite interval case are not easily extensible to the infinite interval case.

2. FORMULATION OF THE PROBLEM AND MAIN RESULTS X be a positive r.v. X with density f ðxÞ, j ¼: fEðX j Þg ¼ R 1 Let j 0 x f ðxÞ dx, j ¼ 0, . . . , M, with R 1 0 ¼ 0, the j -th positive fractional moment of X and H ½ f ¼ 0 f ðxÞln f ðxÞ dx the Shannon entropy of f ðxÞ. From Kesavan and Kapur (1992), we know that the density function which has the same fractional moments j , j ¼ 0, . . . , M of the unknown density f (x) and which maxmizes the Shannon entropy is


330

Novi Inverardi and Tagliani

given by fM ðxÞ ¼ exp

M X

! j

j x

ð2:1Þ

j¼0

where ð0 , . . . , M Þ are the Lagrange’s multipliers; the Shannon entropy of fM is equal to Z

1

fM ðxÞ ln fM ðxÞ dx ¼

H ½ fM ¼ 0

M X

j j :

ð2:2Þ

j¼0

The density fM ðxÞ given by Eq. (2.1) is the unique density that maximizes entropy H ½ f under the constraint that it has the same M fractional moments as f ðxÞ, i.e., Z1 xj fM ðxÞ dx, j ¼ 0, . . . , M, 0 ¼ 1: ð2:3Þ j ¼: fEðX j Þg ¼ 0

Indeed, it can be proved (see Kesavan and Kapur, 1992) that the maximum value of H ½ f subject to Eq. (2.3) is equal to the minimum value of the potential "Z ( ) # M M 1 X X j j ð1 , . . . , M Þ ¼: j EðX Þ þ ln exp j x dx 0

j¼1

j¼1

ð2:4Þ where ð1 , . . . , M Þ is a convex function of 1 , . . . , M [(Kesavan and Kapur, 1992), p. 60]. In formula, max H ½ f ¼ min ð1 , . . . , M Þ: f

1 ,..., M

The equivalence between Eq. (2.3) and minimization of Eq. (2.4) is proved by observing that stationary points of the potential ð1 , . . . , M Þ are solutions to the equations Z1 n XM o xj exp j xj dx j¼1 @ ¼ 0 ) EðX j Þ ¼ 0Z 1 , j ¼ 1, 2, . . . , M: n XM o @j j exp x dx j¼1 j 0



331

The latter is equivalent to Eq. (2.3). We had two problems. The first one is concerning with the constrained maximization of H ½ f over all continuous functions satisfying the given constraints and the second is concerning with unconstrained minimization of ð1 , . . . , M Þ over all the M-tuples of real numbers. These can be called the primal and the dual problem, respectively, and we have shown that the maximum value for the primal problem is equal to the minimum value for the dual problem. More details on such an equivalence for the finite interval case may be found in Borwein and Lewis (1991) and Borwein and Lewis (1993) and rely on a convex duality analysis of the ME program. Formula (2.2) is what in optimization is called strong duality, or zero duality gap. Zero duality gap, so called constraint qualification, is provided in Borwein and Lewis (1991) and Borwein and Lewis (1993) in a fairly general context. Given two probability densities f ðxÞ and fM ðxÞ, there are two wellknown measures of the distance between f ðxÞ and fM ðxÞ: R 1the divergence measure (also called cross-entropy) Ið f , fMRÞ ¼ 0 f ðxÞ lnð f ðxÞ= 1 fM ðxÞÞ dx and the variation measure Vð f , fM Þ ¼ 0 j fM ðxÞ f ðxÞj dx. If f ðxÞ and fM ðxÞ have the same fractional moments j ¼: EðX j Þ, j ¼ 1, . . . , M; then Ið f , fM Þ ¼ H ½ fM H ½ f

ð2:5Þ

R1 P In fact Ið f , fM Þ ¼ 0 Pf ðxÞ lnð f ðxÞ=fM ðxÞÞ dx ¼ H ½ f þ M j¼0 j Rholds. 1 j M j¼0 j j ¼ H ½ fM H ½ f . 0 x f ðxÞ dx ¼ H ½ f þ In literature several lower bounds for the divergence measure I based on V are available. We shall however use the following not too restrictive bound (Kullback, 1967)

I

V2 : 2

ð2:6Þ

For what the calculation of expected values is concerned, if gðxÞ denotes a bounded function, such that jgðxÞj K, K > 0 and err an upper bound for the expected value, taking into account Eqs. (2.4) and (2.5), we have Z jEf ð gÞ EfM ð gÞj

1

jgðxÞj j f ðxÞ fM ðxÞj dx pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi K 2ðH ½ fM H ½ f Þ < err 0

ð2:7Þ


332


According to Eq. (2.7) and Theorem 1.2 about convergence in entropy of fM ðxÞ to f ðxÞ we are able to formulate the choice criterion of 1 , . . . , M . The optimal j exponents are obtained as fj gM j¼1 : H ½ fM ¼ minimum and from a practical point of view as i h min ð1 , . . . , M Þ fj gM j¼1 : min 1 ,..., M

1 ,..., M

ð2:8Þ

ð2:9Þ

from which 0 , an normalizing constant, by imposing that the density integrates to 1. The sequence 1 , . . . , M is optimal in the sense that it accelerates the convergence of H ½ fM to H ½ f . Equivalently, it uses a minimum number of fractional moments to reach a pre-fixed (even if unknown) gap H ½ fM H ½ f : Formula (2.7) suggests how many moments M have to be taken into account, i.e., M coincides with the minimum value so that pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi K 2ðH ½ fM H ½ f Þ < err ð2:10Þ is satisfied. For every err value, Eq. (2.10) is satisfied if equispaced exponents j ¼ ð j=M Þ , j ¼ 1, . . . , M for some > 0, are used (see Appendix), as H ½ fM ! H ½ f by increasing M. Then j satisfying Eq. (2.8) guarantee the fulfillment of the above inequality (2.10) too. Equation (2.8) reflects a parsimony principle, selecting a model with the lowest number of parameters. The ME approximant density fM ðxÞ shows a high efficiency in evaluating expected values; indeed, when the underlying Stieltjes power moment problem is determinate, then ME approximants constrained by power moments converge in entropy to f ðxÞ (see Frontini and Tagliani, 1997). Then, for high M values, fractional and power moments provide similar results (if the direct divergence is adopted as measure) in the computing of expected values. Obviously power moments require an higher order model compared with the one of fractional moments to attaint same accuracy. In many practical situations, H [ f ] is unknown. The following heuristic reasoning, in analog with the case of power moments, allows us to estimate H ½ fM H ½ f by H ½ fj , j M as follows. For every distribution unlike the maximum entropy distribution, the contents of information are shared by the whole sequence of fractional moments and a further addition of a fractional moment causes an



333

entropy decrease. The size of decrease is not the same for every fractional moment. In fact it is reasonable argue that the first few fractional moments generate a higher decrease than the latter in the list. Consequently it seems reasonable to expect that the sequence of points ð j, H ½ fj Þ, j ¼ 1, 2, . . . lie on a convex curve. Therefore the second order differences D2 H ½ fj ¼ H ½ fj 2H ½ fj1 þ H ½ fj2 ,

j>2

are positive. Next using Aitken’s 2 -procedure, we define a new accelerated sequence as 2 H ½ fj H ½ fj1 H acc ½ fj ¼ H ½ fj D2 H ½ fj which converges faster to H ½ f than the initial sequence. Letting H ½ f ’ H acc ½ fM , we obtain the following estimate 2 H ½ fM H ½ fM1 H ½ fM H ½ f ’ ð2:11Þ D 2 H ½ fM to be used in Eq. (2.7). Indeed for any prefixed error bound as given in Eq. (2.7), we can determine the minimum value of M by substituting Eq. (2.11) in (2.10) and solving.

3. RECONSTRUCTION OF A DENSITY FROM A SAMPLE The classical formulation of Jaynes’ ME strategy presumes the availability of a set of M given population (power or fractional) moments: in this case it provides a complete solution of the problem of density estimation. Unfortunately, in most real problems in which we are interested the population moments are unknown; in such cases the Jaynes’ ME approach defines only an infinite hierarchy of models specified in Eq. (2.1). Baker (1990) treated the case of sample reconstruction of probability density function when the unknown distribution has finite support, involving sample integral moments as constraints in ME density estimation procedure. He proposed a solution which combines both Jaynes’ ME formalism for producing an infinite hierarchy of ME models and Akaike’s approach for selecting the optimal member of a


334


hierarchy of models (Akaike, 1973). In Baker’s solution the selection of the optimal member of a hierarchy of model coincides with the selection of the optimal order of the model or in other words, with the selection of the optimal number of moments that has to be considered in ME density reconstruction procedure. The fact that both Jaynes’ and Akaike’s procedures can be presented as two particular cases of minimization of the Kullback–Leibler information Ið f , fM Þ adds a sense of unity and consistency to such approach. There are two aspects of the Baker’s procedure that ask to be carefully reconsidered: (a)

It is well known fact that the sampling variability of power moments increases with the order of moment; for this reason, in the classical methods involving moments, lower order moments are used. Given that in ME setup the information in the sample about f ðxÞ is returned only via sample moments, a simple question arises: is the restriction on the lower order moments always compatible with the goal to extract all the information about the unknown population (or the most relevant part of it) contained in the data? In general, the answer is no. (b) In general, the existence and the knowledge of all integral moments do not guarantee the unique identification of the distribution: what happens with the Lognormal distribution, is instructive in such sense. Indeed, the unique identification follows from the existence and the knowledge of the moments only if the support of the distribution is finite. This requirement seems too limiting for many applications in the real life; for example, many applications in finance involve prices or other related quantities defined on the positive real axis and the restriction on a finite interval seems unacceptable and make no sense.

At the scope to overcome troubles such as (a) and (b) we propose to consider as a plug-in estimator of EðX j Þ the sample fractional moments mj ¼ mj ¼:

n 1X X j, n i¼1 i

j 2 Rþ :

ð3:1Þ

Indeed, the use of fractional moments with exponents j in the interval (0,1) permits to overcome the serious weakness in (a). Further, the restriction in (b) on the support of the density can be easily bypassed using



335

the fractional moments: Lin (1992) proved that the existence and the knowledge of the infinite sequence of fractional moments with distinct exponents in some interval ð0, aÞ, a 2 Rþ , guarantee the unique identification of the distribution of a positive r.v. X with support ½0, 1Þ. The last result is extremely important and, joined to reduced sampling variability, gives a strong motivation to involve fractional moments in the ME density reconstruction from a sample. We will extend Baker’s ME density estimation procedure involving sample fractional moments with exponents in (0,1) instead of power moments as constraints to summarize the information contained in a given sample from a unknown distribution with support ½0, 1Þ. With respect to Baker’s procedure, we have here an additional problem to solve: being the exponents j of fractional moments new variables to take into account, we have to decide not only how many but also what fractional moments to choose in such a way that the estimated density reflects properly the information contained in a given sample about the unknown probability distribution. In other words, we have to choose in some ‘‘optimal’’ manner either the value of M or the exponents j of the fractional moments. Let f ðxÞ represent the true but unknown distribution of X and fM ðx; k, aÞ, k ¼ ð1 , . . . , M Þ, a ¼ ð1 , . . . , M Þ, sometimes only fM for notational simplicity, be the M-th order model. A measure of distance between fM and f ðxÞ is given by the Kullback–Leibler measure Ið f , fM Þ; above defined; hence, the best choice for the vectors k, a and the scalar M is that one which minimizes the distance between the M-order model fM and the ‘‘reality’’ f ðxÞ. We cannot evaluate Ið f , fM Þ because f ðxÞ is unknown, but we can rewrite Ið f , fM Þ as Z

Z

1

1

f ðxÞ lnð f ðxÞÞ dx

Ið f , fM Þ ¼ 0

f ðxÞ lnð fM Þ dx 0

¼ H ½ f Ef ½lnð fM Þ:

ð3:2Þ

The term Lðk, a; M Þ ¼ Ef ½lnð fM ðx; k, aÞÞ can be naturally estimated from a sample of n observations ðX1 , X2 , . . . , Xn Þ through the quantity ! n M X X 1 L^ ðk, a, M; xÞ ¼ lnð fM ðxi ; k, aÞÞ ¼ k0 þ j mj , n i¼1 j¼1 where

ð3:3Þ


336


Z

1

k0 ¼ ln

( exp

0

M X

) j

j x

! dx

ð3:4Þ

j¼1

represents the Lagrangian multiplier associated with the normalization constant imposed so that the density integrates to 1. Then, the corresponding estimate of Ið f , fM Þ is given by I^ð f , fM Þ ¼ H ½ f L^ ðk, a, M; xÞ

ð3:5Þ

We omit to consider the constant H ½ f in Eq. (3.5) because it does not depend on k, a or M, hence for the purpose of minimizing Eq. (3.2) with respect to k, a, M this term is irrelevant; in other words, we rescale the values of the entropy starting from H ½ f , by defining ^ ðk, a, M; xÞ ¼ I^ð f , fM Þ þ H ½ f , so that 1 ^ ðk, a, M; xÞ ¼ L^ ðk, a, M; xÞ ¼ n

n X

lnð fM ðxi ; k, aÞÞ

ð3:6Þ

i¼1

represents an estimate of the potential function (2.4) or, in other terms, an estimate of the M-order model entropy. Now, the solution of the problem to determine what and how many fractional moments have to be considered for a proper estimation of f ðxÞ implies the solution of the following optimization problem: o n min minfminf^ ðk, a, M; xÞgg : ð3:7Þ M

a

k

Following the Akaike’s information criterion, we reformulate the optimization problem in Eq. (3.7) as o n min minfminf^ ðk, a, M; xÞgg ð3:8Þ M

a

k

where M ^ ðk, a, M; xÞ ¼ ^ ðk, a, M; xÞ þ n

ð3:9Þ

represents the sample differential entropy. The term M=n is proportional to the model order M, i.e., to the number of parameters which we try to estimate using a given sample, and inversely proportional to the size n of the sample and can be



337

interpreted in the Akaike’s philosophy (Akaike, 1973) as a ‘‘penalty term’’ which prevents us from establishing ‘‘too elaborate’’ models which cannot be justified by the given data. We live in a world saturated by information explosion. Consequently, the parsimony principle becomes an important criterion whereby we attempt to retain only relevant and useful information and discard the redundant part of it. The Akaike’s estimation procedure involved in finding a solution of Eq. (3.8) may be summarized as follows: (i) For a given value of M, obtain the optimal value of the parameter vector k, say k^ M ¼ ð^1 , . . . , ^M Þ, by solving ( ) n o n X 1 M ð3:10Þ min ^ ðk, a, M; xÞ ¼ min lnð fM ðxi ; k, aÞÞ þ : k k n i¼1 n But, (

) ( ) n n 1X 1X min lnð fM ðxi ; k, aÞÞ ¼ max lnð fM ðxi ; k, aÞÞ k k n i¼1 n i¼1

ð3:11Þ

Pn and the term i¼1 lnð fM ðxi ; kÞÞ is the log likelihood function. Hence Eq. (3.11) shows that the parameters k^ M which minimize ^ ðk, a, M; xÞÞ with respect to k are the maximum likelihood estimates of k. We will use the notation k^ M to stress the fact that the optimal parameters for the M-th order model are obtained by minimization of Eq. (3.9). (ii) Using k^ M from (i), the optimal values of a exponents, a^ M , are obtained by solving (

) n 1X min lnð fM ðxi ; k^ M , aÞÞ a n i¼1 or, in other words, by choosing the M exponents j , j ¼ 1, 2, . . . , M which minimize the sample entropy of the M-order model, say a^ M ¼ ð^ 1 , . . ., ^ M Þ. (iii) Using k^ M and a^ M , calculate the differential entropy associated with the ‘‘best’’ M-th order model using Eq. (3.9). At this stage k^ M , a^ M are known so that the differential entropy is a function of M only. For this reason we will write ^ ðM Þ instead of ^ ðk^ M , a^ M , M; xÞ. (iv) Repeating steps (i)–(iii) for a sequence of M values, obtain the relation ^ ðM Þ as a function of M only.


338


(v) Locate the optimal order M of approximation which minimizes the value of ^ ðM Þ as a function of M, i.e., find Mopt from the relation: n o min ^ ðM Þ ! Mopt : ð3:12Þ M

The existence of such a minimum rests on the fact that fM ðx; k, aÞ is the density function corresponding to the family of ME distributions and it is possible to approximate any other distribution of density f ðxÞ using fM ðx; k, aÞ with an arbitrary degree of accuracy which depends on the order M. It follows that it is possible to make the entropy ^ ðM Þ as small as we want by increasing the order M of the approximant fM . In other words, the function ^ ðM Þ decreases as a function of M, approaching zero asymptotically. Now, for n constant, the term M=n increases linearly as a function of M so that the differential entropy M ^ ðM Þ ¼ ^ ðM Þ þ n must have a minimum for some M value.

4. NUMERICAL RESULTS Example 1. To appreciate the size of entropy bound (2.11), we choose for illustrative purpose, a density function which does not admit power moments, having the following analytical form 2 1 f ðxÞ ¼ , x0

1 þ x2 with EðX j Þ ¼ ½cosð =2Þj 1 , 0 < j < 1 and 2 1 1 1 1 0 H ½ f ¼ ln 0 ð1Þ ’ 1:83787707

2 2 2 where ðÞ is the Gamma function. Next, we consider a sequence of approximants fM ðxÞ, M ¼ 1, 2, . . . , having fractional moments in accordance with Eq. (2.8). In Table 1 we report optimal fractional moments, entropy difference H ½ fM H ½ f and entropy bound estimate M H ¼ ððH ½ fM H ½ fM1 Þ2 Þ=D2 H ½ fM given by right hand side of Eq. (2.11), which provides a theoretical bound to the expected value. Example 2. Here our goal is to compare the performances of fractional and power moments in the reconstruction of the density. We choose the



339

Table 1. Optimal fractional moments and entropy difference of distributions having an increasing number of common fractional moments. M

fj gM j¼1

H ½ fM H ½ f

1 2

0.33151 0.08215 0.11200 0.03443 0.12660 0.45174 0.00098 0.06965 0.11689 0.25947

0.8956E1 0.1491E2

3

4

0.1110E2

M H

0.1819E2

0.4169E3

following density function (

sin x 0 x 1 f ðxÞ ¼ 2 0 x>1 with H ½ f ’ 0:144729886. Fractional moments EðX j Þ are obtained numerically. Now f ðxÞ admits a determinate Stieltjes moment problem (See Shohat, 1943). The approximate density fM ðxÞ, x 0 is given by Eq. (2.1). In Table 2 are reported the number of moments involved in the approximant M and the entropy difference H ½ fM H ½ f for (a) fractional moments ME approach and for (b) power moments ME approach. Further, for the case (a) are also given the exponents j satisfying Eq. (2.8) and the quantity M H. Inspection of Table 2 allows us to conclude that: 1.

2. 3.

By optimal fractional moments satisfying Eq. (2.8) the entropy decreasing is fast, so that practically 3–4 fractional moments determine f ðxÞ. On the converse an higher numbers of power moments are required for a satisfactory characterization of f ðxÞ. Approximately 9 power moments have an effect comparable with 3 fractional moments (9 is the highest number of available power moments before incurring in the numerical instability due to the moment problem ill-conditioning).


340

Novi Inverardi and Tagliani Table 2. Optimal fractional moments and entropy difference of distributions having an increasing number of common (a) fractional moments and (b) power moments. (a) M

fj gM j¼1

H ½ fM H ½ f

1 2

14.2464 0.06618 5.48126 0.05128 2.89585 33.5184 0.05833 2.47930 9.33220 122.269

0.8774E 1 0.6380E 2

3

4

4.

(b) M H

0.4668E 3

0.3044E 1

0.2911E 3

0.3280E 2

M H ½ fM H ½ f 1 2 3 4 5 6 7 8 9

0.4516E 0 0.2577E 1 0.2136E 1 0.5051E 2 0.4526E 2 0.1627E 2 0.1503E 2 0.1457E 2 0.7288E 3

High values of M reflect the fact that f ðxÞ ¼ 0, x > 1:

Example 3. Let F ðxÞ be an absolutely continuous distribution having the following density function f ðxÞ ¼ A expð3x1=4 x3=4 Þ with A ’ 0:02582851 and H ½ f ’ 2:36739581. Now, f ðxÞ is an ME density with characterizing moments EðX 1=4 Þ and EðX 3=4 Þ. Next several samples (1000 in our tests) having same size n ¼ 100ð100Þ500 are drawn from the distribution F ðxÞ. For each sample the differential entropy given by Eqs. (3.8)–(3.9) is calculated, from which Mopt . The distribution of Mopt is reported in Table 3 for different values of n. Inspection of Table 3 shows that in the most cases the optimal model having lowest differential entropy is of 2nd order, as expected.

5. SOME OPEN QUESTIONS 1.

An interesting problem is related to the study of the properties of the estimator of EðX Þ, m . We are currently working on it. In particular we are using a Monte–Carlo simulation setup to



341

Distribution of Mopt for samples having different size n.

Table 3.

Mopt 1

2

3

4

0.213 0.048 0.018 0.006 0.002

0.632 0.778 0.788 0.846 0.778

0.118 0.148 0.164 0.122 0.190

0.037 0.026 0.030 0.026 0.030

n 100 200 300 400 500

2.

3.

obtain informations about finite and asymptotic properties of the proposed estimator. In analogy with the results of convergence in Borwein and Lewis (1991), Borwein and Lewis (1993) for finite intervals, optimal fractional moments Eq. (2.8) should be promising for a stronger type of convergence than entropy convergence provided in Appendix. In physical applications usually power moments of an unknown probability density f ðxÞ are known. An accurate recovering of f ðxÞ demands for an high number of power moments arising instability. In analogy with the finite interval, it should be interesting to find fractional moments fEðX Þg from the knowledge of a finite (or infinite) sequence of power moments. Then f ðxÞ should be recovered by few fractional moments satisfying Eq. (2.8).

APPENDIX: ENTROPY CONVERGENCE A.1. Some Background Let’s consider a sequence of equispaced points j ¼ ð =M Þj, j ¼ 0, . . . , M for a some finite > 0 and j

Z

j ¼ EðX Þ ¼ 0

1

tj fM ðtÞ dt, j ¼ 0, . . . , M

ðA:1Þ


342


with fM ðtÞ ¼ expð j ¼ EðX j Þ ¼

PM

j¼0

Z

1 0

j tj Þ. If we put x ¼ t

=M

from (A.1) we have

X M h i M M x j exp 0 ln j x j 1 ln x dx, j¼1

j ¼ 0, . . . , M

ðA:2Þ

which is a reduced Stieltjes moment problem for each fixed M value and a determinate Stieltjes moment problem when M ! 1. Referring to Eq. (A.2) the following symmetric definite positive Hankel matrices are considered 0 ¼ 0 ,

2 ¼

1 ¼ 1 ,

3 ¼

0 1

1 2

2

1 , . . . , 2M 2

0 6 .. ¼4 . M 2

2 , . . . , 2Mþ1 3

1 6 .. ¼4 . Mþ1

3 M .. 7 . 5 2M

3 Mþ1 .. 7 . 5 2Mþ1 ðA:3Þ

whose ði, jÞ-th entry i, j ¼ 0, 1, . . . holds Z1 iþj ¼ xiþj fM ðxÞ dx 0

P j where fM ðxÞ ¼ exp½ð0 lnðM= ÞÞ M j¼1 j x ð1 ðM= ÞÞ ln x. Stieltjes moment problem is determinate and the underlying distribution has a continuous distribution function FðxÞ, with density f ðxÞ. Then the maximal mass ðxÞ which can be concentrated at any real point x is equal to zero (Shohat and Tamarkin, 1943; Corollary 2.8). In particular at x ¼ 0 we have 0 ¼ ð0Þ ¼ lim ð0Þ ¼: i!1 i 2 .. .

iþ1

2i ¼ lim ð ðiÞ 0 Þ iþ1 i!1 0 .. .

ðA:4Þ

2i

where ð0Þ i indicates the largest mass which can be concentrated at a given point x ¼ 0 by any solution of a reduced moment problem of order i



343

and ðiÞ indicates the minimum value of 0 once assigned the first 2i 0 moments (Fronting and Tagliani, 1997). Let’s fix f0 , . . . , i1 , iþ1 , . . . , M g while only i , i ¼ 0, . . . , M varies continuously. From Eq. (A.2) we have 2 3 d0 =di 6 7 .. 2M 4 ðA:5Þ 5 ¼ eiþ1 . dM =di where eiþ1 is the canonical unit vector 2 RMþ1 , from which 2 3 d0 =di hd h i 6 7 d i .. 0 7 ¼ d0 , . . . , dM eiþ1 0< , . . . , M 2M 6 . 4 5 di di di di dM =di d ¼ i 8i ðA:6Þ di

A.2. Entropy Convergence The following theorem holds. Theorem A.1. If j ¼ ð =M Þj, j ¼ 0, ... ,M and fM ðxÞ ¼ expð then Z1 fM ðxÞ ln fM ðxÞ dx ¼ H ½ f lim H ½ fM ¼: lim M!1 M!1 Z1 0 ¼: f ðxÞ ln f ðxÞ dx:

PM

j j¼0 j x Þ,

ðA:7Þ

0

Proof. From Eqs. (A.1) and (A.7) we have H ½ fM ¼

M X

j j

ðA:8Þ

j¼0

Let’s consider Eq. (A.8). When only 0 varies continuously, ð fM ðxÞÞ is still a a positive function only loosing its characteristic of probability density) taking into account Eqs. (A.3), (A.5) and A(8) we have


344

Novi Inverardi and Tagliani M X dj d H ½ fM ¼ j þ 0 ¼ 0 1 d0 d 0 j¼0

d2 d H ½ fM ¼ 0 ¼ d0 d20

2 .. .

Mþ1

Mþ1 .. . 2M

j2M j

¼

1 Þ 0 ðM 0

< 0:

Thus H ½ fM is a differentiable concave function of 0 . When Þ 0 ! ðM then H ½ fM ! 1, whilst at 0 it holds 0 H ½ fM > H ½ f , being fM ðxÞ the maximum entropy density once Þ assigned ð0 , . . . , M Þ. Besides, when M ! 1 then ðM ! 0 . So 0 the theorem is proved. Remark. Entropy convergence is guaranteed and accelerated whenever equispaced nodes ðj ¼ =M Þj are replaced by optimal nodes (2.8).

ACKNOWLEDGMENTS We are deeply indebted to two anonymous referees for their criticism which rendered possible an improvement of the first version of the article.

REFERENCES Akaike, H. (1973). Information theory and extension of the maximum likelihood principle. In: Petrov, B. M., Cask, F., eds. Sec. Int. Symp. on Information Theory. Budapest: Akademiai Kiado, 267–281. Baker, R. (1990). Probability estimation and information principles. Structural Safety 9:97–116. Borwein, J. M., Lewis, A. S. (1991). Convergence of best entropy estimates. SIAM J. Optimization 1:191–205. Borwein, J. M., Lewis, A. S. (1993). A survey of convergence results for maximum entropy methods. In: Djafari, M., Demoments, G., eds. Maximum Entropy and Baysian Methods. Kluwer Academic Publishers, 39–48.



345

Cressie, N., Borkent, M. (1986). The moment generating function has its moments. Journal of Statistical Planning and Inference 13:337–344. Frontini, M., Tagliani, A. (1997). Entropy-convergence in Stiltjes and Hamburger moment problem. Applied Math. and Computation 88:39–51. Jaynes, E. T. (1978). Where do we stand on maximum entropy. In: Levine, R. D., Tribus, M., eds. The Maximum Entropy Formalism. Cambridge, MA: MIT Press, 15–118. Kesavan, H. K., Kapur, J. N. (1992). Entropy Optimization Principles with Applications. Academic Press. Kullback, S. (1967). A lower bound for discrimination information in terms of variation. IEEE Transaction on Information Theory IT-13:126–127. Lin, G. D. (1992). Characterizations of distributions via moments. Sankhya: The Indian Journal of Statistics Series A 54:128–132. Shohat, J. A., Tamarkin, J. D. (1943). The Problem of Moments. Providence, RI: AMS Mathematical Survey, 1.