Adaptive Estimation of Shannon Entropy - arXiv

4 downloads 0 Views 491KB Size Report
Feb 1, 2015 - We consider estimating the Shannon entropy of a discrete ... In contrast, various modern data-analytic applications deal with datasets which do ...
1

Adaptive Estimation of Shannon Entropy Yanjun Han, Student Member, IEEE, Jiantao Jiao, Student Member, IEEE, and Tsachy Weissman, Fellow, IEEE

arXiv:1502.00326v1 [cs.IT] 1 Feb 2015

Abstract We consider estimating the Shannon entropy of a discrete distribution P from n i.i.d. samples. Recently, Jiao, Venkat, Han, and Weissman, and Wu and Yang constructed approximation theoretic estimators that achieve the minimax L2 rates in estimating entropy. Their estimators are consistent given n  lnSS samples, where S is the alphabet size, and it is the best possible sample complexity. In contrast, the Maximum Likelihood Estimator (MLE), which is the empirical entropy, requires n  S samples. In the present paper we significantly refine the minimax results of existing work. To alleviate the pessimism of minimaxity, we adopt the adaptive estimation framework, and show that the minimax rate-optimal estimator in Jiao, Venkat, Han, and Weissman achieves the minimax rates simultaneously over a nested sequence of subsets of distributions P , without knowing the alphabet size S or which subset P lies in. In other words, their estimator is adaptive with respect to this nested sequence of the parameter space, which is characterized by the entropy of the distribution. We also characterize the maximum risk of the MLE over this nested sequence, and show, for every subset in the sequence, that the performance of the minimax rate-optimal estimator with n samples is essentially that of the MLE with n ln n samples, thereby further substantiating the generality of the phenomenon discovered by Jiao, Venkat, Han, and Weissman. Index Terms adaptive estimation, entropy estimation, best polynomial approximation, high dimensional statistics, large alphabet, minimax optimality

I. I NTRODUCTION Shannon entropy H(P ), defined as H(P ) ,

S X i=1

pi ln

1 , pi

(1)

is one of the most fundamental quantities of information theory and statistics, which emerged in Shannon’s 1948 masterpiece [1] as the answer to foundational questions of compression and communication. Consider the problem of estimating Shannon entropy H(P ) from n i.i.d. samples. Classical theory is mainly concerned with the case where the number of samples n → ∞, while the alphabet size S is fixed. In that scenario, the maximum likelihood estimator (MLE), H(Pn ), which plugs in the empirical distribution into the definition of entropy, is asymptotically efficient [2, Thm. 8.11, Lemma 8.14] in the sense of the H´ajek convolution theorem [3] and the H´ajek–Le Cam local asymptotic minimax theorem [4]. It is therefore not surprising to encounter the following quote from the introduction of Wyner and Foster [5] who considered entropy estimation: “The plug-in estimate is universal and optimal not only for finite alphabet i.i.d. sources but also for finite alphabet, finite memory sources. On the other hand, practically as well as theoretically, these problems are of little interest. ” In contrast, various modern data-analytic applications deal with datasets which do not fall into the regime of fixed alphabet and n → ∞. In fact, in many applications the alphabet size S is comparable to, or even larger than the number of samples n. For example: • Corpus linguistics: about half of the words in the Shakespearean canon appeared only once [6]. • Network traffic analysis: many customers or website users are seen a small number of times [7]. • Analyzing neural spike trains: natural stimuli generate neural responses of high timing precision resulting in a massive space of meaningful responses [8]–[10]. A. Existing literature The problem of entropy estimation in the large alphabet regime (or non-asymptotic analysis) has been investigated extensively in various disciplines, which we refer to [11] for a detailed review. One recent breakthrough in this direction came from Valiant and Valiant [12], who constructed the first explicit entropy estimator whose sample complexity is n  lnSS samples, which they also proved to be necessary. It was also shown in [13] [14] that the MLE requires n  S samples, implying that MLE is strictly sub-optimal in terms of sample complexity. However, the aforementioned estimators have not been shown to achieve the minimax L2 rates. In light of this, Jiao et al. [11], and Wu and Yang in [15] independently developed schemes based on approximation theory, and obtained the minimax L2 convergence rates for the entropy. Furthermore, Jiao et al. [11] proposed a general methodology for estimating functionals, and showed that for a wide class of functionals (including entropy, mutual information, and R´enyi entropy), their methodology

2

can construct minimax rate-optimal estimators whose performance with n samples is essentially that of the MLE with n ln n samples. They also obtained minimax L2 rates for estimating a large class of functionals. On the practical side, Jiao et al. [16] showed that the minimax rate-optimal estimators introduced in [11] can lead to consistent and substantial performance boosts in various machine learning algorithms.  2 Recall that the minimax risk of estimating functional F (P ) is defined via inf ˆ supP ∈M EP Fˆ − F (P ) , where MS F

S

denotes all distributions with alphabet size S, and the infimum is taken with respect to all estimators Fˆ . Correspondingly, the maximum risk of MLE F (Pn ), which evaluates the functional F (·) at the empirical distribution Pn , is defined via 2 supP ∈MS EP (F (Pn ) − F (P )) . The following table in Jiao et al. [11] summaries the minimax L2 rates and the L2 rates of PS MLE in estimating H(P ) and Fα (P ) , i=1 pα i . Whenever there are two terms, the first term corresponds to squared bias, and the second term corresponds to variance. It is evident that one can obtain the minimax rates from the L2 rates of MLE via replacing n with n ln n in the dominating (bias) terms. We adopt the following notation: an  bn means supn an /bn < ∞, an  bn means bn  an , an  bn means an  bn and an  bn , or equivalently, there exists two universal constants c, C such that an an ≤ lim sup < C < ∞. (2) 0 < c < lim inf n→∞ bn n→∞ bn Minimax L2 rates S2 ln2 S n (n ln n)2 + n

H(P ) Fα (P ), 0 < α ≤

1 2

Fα (P ), 12 < α < 1 Fα (P ), 1 < α < 32 Fα (P ), α ≥ 32

S ln S



( [11], [15])  S2 1/α n  S / ln S, ln n  ln S ( [11]) (n ln n)2α  S 2−2α S2 n  S 1/α / ln S ( [11]) (n ln n)2α + n (n ln n)−2(α−1) (S  n ln n) ( [11]) n−1 [14]

L2 rates of MLE S2 ln2 S (n  S) [14] n2 + n  S2 n  S 1/α [14] n2α  S 2−2α S2 n  S 1/α [14] n2α + n n−2(α−1) (S  n) [14] n−1

PS α TABLE I: Comparison of the minimax L2 rates and the L2 rates of MLE in estimating H(P ) and Fα (P ) , i=1 pi . Whenever there are two terms, the first term corresponds to squared bias, and the second term corresponds to variance. It is evident that one can obtain the minimax rates from the L2 rates of MLE via replacing n with n ln n in the dominating (bias) terms.

B. Refined minimaxity: adaptive estimation One concern the readers mayhave about results on minimax rates is that they are too pessimistic. Indeed, in the definition 2 inf Fˆ supP ∈MS EP Fˆ − F (P ) , we have considered the worst case distribution P over all possible distributions supported on S elements, and it would be disappointing if the estimator in Jiao et al. [11] turned out to behave sub-optimally when we consider distributions lying in subsets of MS . A usual approach to alleviate this concern is the adaptive estimation framework, which we briefly review below. The primary approach to alleviate the pessimism of minimaxity in statistics is the construction of adaptive procedures, which has gained particular prominence in nonparametric statistics [17]. The goal of adaptive inference is to construct a single procedure that achieves optimality simultaneously over a collection of parameter spaces. Informally, an adaptive procedure automatically adjusts to the unknown parameter, and acts as if it knows the parameter lies in a more restricted subset of the whole parameter space. A common way to evaluate such a procedure is to compare its maximum risk over each subset of the parameter space in the collection with the corresponding minimax risk. If they are nearly equal, then we say such a procedure is adaptive with respect to that collection of subsets of the parameter space. The primary results of this paper are twofold. 1) First, we show that the minimax rate-optimal entropy estimator in Jiao et al. [11] is adaptive with respect to the collection of parameter space MS (H), where MS (H) , {P : H(P ) ≤ H, P ∈ MS }. Moreover, the estimator does not need to know S nor H, which is an advantage in practice since usually the alphabet size S nor an a priori upper bound on the true entropy H(P ) are known. 2) Second, we show that the sample size enlargement effect still holds in this adaptive estimation scenario. Table I demonstrates that in estimating various functionals, the performance of the minimax rate-optimal estimator with n samples is nearly that of the MLE with n ln n samples, which the authors termed “effective sample size enlargement” in [11]. We compute the maximum risk of the MLE over each MS (H), and show that for every H, the performance of the estimator in [11] with n samples is still nearly that of the MLE with n ln n samples. These facts suggest that the estimator in Jiao et al. [11] is near optimal in a very strong sense, for which we refer the readers to [11] for a detailed discussion on methodology behind their estimator, literature survey, and experimental results.

3

C. Mathematical framework and estimator construction Before we discuss the main results, we would like to recall the construction of the entropy estimator in [11]. The approach is to tackle the estimation problem separately for the cases of “small p” and “large p” in H(P ) estimation, corresponding to treating regions where the functional is “nonsmooth” and “smooth” in different ways. Specifically, after we obtain the empirical distribution Pn , for each coordinate Pn (i), if Pn (i)  ln n/n, we (i) compute the best polynomial approximation for −pi ln pi in the regime 0 ≤ pi  ln n/n, (ii) use the unbiased estimators for integer powers pki to estimate the corresponding terms in the polynomial approximation for −pi ln pi up to order Kn ∼ ln n, and (iii) use that polynomial as an estimate for −pi ln pi . If 1 to estimate −pi ln pi . Then, we add the estimators corresponding Pn (i)  ln n/n, we use the estimator −Pn (i) ln Pn (i) + 2n to each coordinate. We define the minimax risk for Multinomial model with n observations on alphabet size S for estimating H(P ), P ∈ MS (H) as  2 ˆ − H(P ) , R(S, n, H) , inf sup EMultinomial H (3) ˆ P ∈MS (H) H

which is the quantity we will characterize in this paper. To simplify the analysis, we also utilize the Poisson sampling model, i.e., we first draw a random variable N ∼ Poi(n), and then obtain N samples from the distribution P . It is equivalent to having a S-dimensional random vector Z such that each component Zi in Z has distribution Poi(npi ), and all coordinates of Z are independent. The counterpart of minimax risk in the Poissonized model is defined as 2  ˆ − H(P ) . (4) RP (S, n, H) , inf sup EPoisson H ˆ P ∈MS (H) H

The following lemma, which follows from [11], [15], shows that the minimax risks under the Multinomial model and the Poissonized model are essentially equivalent. Lemma 1. The minimax risks under the Poissonized model and the Multinomial model are related via the following inequalities: RP (S, 2n, H) − e−n/4 H 2 ≤ R(S, n, H) ≤ 2RP (S, n/2, H).

(5)

For simplicity of analysis, we conduct the classical “splitting” operation [18] on the Poisson random vector Z, and obtain two independent identically distributed random vectors X = [X1 , X2 , . . . , XS ]T , Y = [Y1 , Y2 , . . . , YS ]T , such that each component Xi in X has distribution Poi(npi /2), and all coordinates in X are independent. For each coordinate i, the splitting process generates a random variable Ti such that Ti |Z ∼ B(Zi , 1/2), and assign Xi = Ti , Yi = Zi − Ti . All the random variables {Ti : 1 ≤ i ≤ S} are conditionally independent given our observation Z. We also note that for random variable X such that nX ∼ Poi(np), k−1 Y r = pk , (6) E X− n r=0 for any k ∈ N+ . For simplicity, we re-define n/2 as n, and denote Yi c1 ln n ∆ Xi , pˆi,2 = , ∆ = , K = c2 ln n, t = , (7) n n n 4 where c1 , c2 are positive parameters to be specified later. Note that ∆, K, t are functions of n, where we omit the subscript n for brevity. ˆ in Jiao et al. [11] is constructed as follows. The estimator H pˆi,1 =

ˆ , H

S X

[LH (ˆ pi,1 )1(ˆ pi,2 ≤ 2∆) + UH (ˆ pi,1 )1(ˆ pi,2 > 2∆)] ,

(8)

i=1

where SK,H (x) ,

K X k=1

gk,H (4∆)−k+1

k−1 Y

x−

r=0

LH (x) , min {SK,H (x), 1}   1 UH (x) , In (x) −x ln x + . 2n

r n

(9) (10) (11)

We explain each equation in detail as follows. 1) Equation (8): Note that pˆi,1 and pˆi,2 are i.i.d. random variables such that nˆ pi,1 ∼ Poi(npi ). We use pˆi,2 to determine whether we are operating in the “nonsmooth” regime or not. If pˆi,2 ≤ 2∆, we declare we are in the “nonsmooth” regime,

4

and plug in pˆi,1 into function LH (·). If pˆi,2 > 2∆, we declare we are in the “smooth” regime, and plug in pˆi,1 into UH (·). 2) Equation (9): The coefficients rk,H , 0 ≤ k ≤ K are coefficients of the best polynomial approximation of −x ln x over [0, 1] up to degree K, i.e., K X rk,H xk = arg min sup |y(x) − (−x ln x)|, (12) y(x)∈polyK x∈[0,1]

k=0

where polyK denotes the set of algebraic polynomials up to order K. Note that in general gk,α depends on K, which we do not make explicit for brevity. Then we define {gk,H }1≤k≤K gk,H = rk,H , 2 ≤ k ≤ K, g1,H = r1,H − ln(4∆). (13) Lemma 9 shows that for nX ∼ Poi(np), ESK,H (X) =

K X

gk,H (4∆)−k+1 pk

(14)

k=1

is a near-best polynomial approximation for −p ln p on [0, 4∆]. Thus, we can understand SK,H (X), nX ∼ Poi(np) as a random variable whose expectation is nearly 1 the best approximation of function −x ln x over [0, 4∆]. 3) Equation (10): Any reasonable estimator for −p ln p should be upper bounded by the value one. We cutoff SK,H (x) by upper bound 1, and define the function LH (x), which means “lower part”. 4) Equation (11): The function UH (x) (means “upper part”) is nothing but a product of an interpolation function In (x) and the biascorrected MLE. The interpolation function In (x) is defined as follows:   0 x≤t   (15) In (x) = g (x − t; t) t < x < 2t    1 x ≥ 2t The following lemma characterizes the properties of the function g(x; a) appearing in the definition of In (x). In particular, it shows that In (x) ∈ C 4 [0, 1]. Lemma 2. For the function g(x; a) on [0, a] defined as follows,  x 6  x 7  x 8  x 9  x 5 − 420 + 540 − 315 + 70 , (16) g(x; a) , 126 a a a a a we have the following properties: g(0; a) = 0, g(a; a) = 1,

g (i) (0; a) = 0, 1 ≤ i ≤ 4 (i)

g (a; a) = 0, 1 ≤ i ≤ 4

(17) (18)

The function g(x; 1) is depicted in Figure 1. 1 Note

that we have removed the constant term from the best polynomial approximation. It is to ensure that we assign zero to symbols we do not see.

5

1.0

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

1.0

Fig. 1: The function g(x; 1) over interval [0, 1].

II. M AIN R ESULTS Since supP ∈MS H(P ) = ln S, we assume throughout this paper that 0 < H ≤ ln S. Denote by MS (H) the set of all ˆ ≡ H(Z) ˆ discrete probability distributions P with support size |supp(P )| = S and entropy H(P ) ≤ H. We say an estimator H is within accuracy  > 0, if and only if  1 ˆ − H(P )|2 2 ≤ . sup EP |H (19) P ∈MS (H)

For the plug-in estimator H(Pn ), the following theorem presents the non-asymptotic upper and lower bounds for the L2 risk. Theorem 1. If H ≥ H0 > 0, where H0 is a universal positive constant, then for the plug-in estimator H(Pn ), we have ( S 2 H ln S if S ln S ≤ enH, + n 2 (20) sup EP |H(Pn ) − H(P )|   n  2 H S ln S P ∈MS (H) otherwise. ln S ln nH Note that the only assumption in Theorem 1 is that the upper bound H should be no smaller than a constant, which is a ˆ ≡ 0 has a satisfactory performance. The reasonable assumption to avoid the subtle case where the naive zero estimator H minimum sample complexity of the plug-in approach can be immediately obtained from Theorem 1. Corollary 1. If H ≥ H0 > 0, where H0 is a universal positive constant, the plug-in estimator H(Pn ) is within accuracy  if  and only if n  S 1− H · lnHS .  Recall that it requires n  S samples for the MLE to achieve accuracy  when there is no constraint on the entropy [11]. Hence, when the upper bound on the entropy is loose, i.e., H  ln S, the minimum sample complexity in the bounded entropy case is exactly the same, i.e., we cannot essentially improve the estimation performance. On the other hand, when the upper bound is tight, i.e., H  ln S, the required sample complexity enjoyed a significant reduction, i.e., we only need a sublinear number of samples for accurate entropy estimation. When it comes to the maximum L2 risk, we conclude from Theorem 1 that the bounded entropy property helps only at the boundary, i.e., when n is close to S and H is small. Moreover, this help vanishes quickly as S increases: when n = S 1−δ , the maximum L2 risk will be at the order (δH)2 , which is the same risk achieved by the naive zero estimator when δ is not close to zero. Is the plug-in estimator H(Pn ) optimal in the minimax sense? It has been shown in [11], [12], [15] that when there is no constraint on H(P ), i.e., H = ln S, the answer is negative. What about subsets of MS , such as MS (H)? The following theorem characterizes the minimax L2 rates over MS (H). Theorem 2. If H ≥ H0 > 0, where H0 is a universal positive constant, then ( S2 + H nln S 2 ˆ − H(P )|  (n ln n)2 inf sup EP |H 2 H S ln S ˆ P ∈MS (H) H ln S ln nH ln n

if S ln S ≤ enH ln n, otherwise.

(21)

6

where the infimum is taken over all possible estimators. Moreover, the upper bound is achieved by the estimator in [11] under the Poissonized model without the knowledge of H nor S. An immediate result on the sample complexity is as follows. Corollary 2. If H ≥ H0 > 0, where H0 is a universal positive constant, the minimax rate-optimal estimator in [11] is within  accuracy  if and only if n  H1 S 1− H . For the minimum sample complexity, we still distinguish H into two cases. Firstly, when H  ln S, the required sample S complexity is n   ln S , which recovers the minimax results with no constraint on entropy in [11]. Secondly, when H  ln S, there is a significant improvement. We also conclude from Theorem 2 that the bounded entropy constraint again helps only at the boundary, and this help vanishes quickly as S increases: when n = S 1−δ , we do not have sufficient information to make inference, and the naive zero estimator is near-minimax. To sum up, we have obtained the following conclusions. 1) The minimax rate-optimal entropy estimator in Jiao et al. [11] is adaptive with respect to the collection of parameter space MS (H), where MS (H) , {P : H(P ) ≤ H, P ∈ MS }. Moreover, the estimator does not need to know S nor H, which is an advantage in practice since usually the alphabet size S nor an a priori upper bound on the true entropy H(P ) are known. 2) Second, the sample size enlargement effect still holds in this adaptive estimation scenario. Table I demonstrates that in estimating various functionals, the performance of the minimax rate-optimal estimator with n samples is essentially that of the MLE with n ln n samples, which the authors termed “sample size enlargement” in [11]. Theorems 1 and 2 show that over every MS (H), the performance of the estimator in [11] with n samples is still essentially that of the MLE with n ln n samples. III. P ROOF OF U PPER B OUNDS IN T HEOREM 1 First we consider the case where S ln S ≤ enH. For the bias, it has been shown in [13] that   S S−1 ≤ . Bias(H(Pn )) ≤ ln 1 + n n

(22)

As for the variance, [11] shows that by the Efron-Stein inequality that S

2X 2 2 Var(H(Pn )) ≤ pi (ln pi − 2) ≤ n i=1 n

S X

! 2

pi (ln pi ) + 4H + 4 .

(23)

Lemma 3. For any discrete distribution P = (p1 , p2 , · · · , pS ) with alphabet size S ≥ 2, we have ! S S X X 2 pi (ln pi ) ≤ 2 ln S · −pi ln pi + 3.

(24)

i=1

i=1

i=1

In light of Lemma 3, we conclude that 2 Var(H(Pn )) ≤ n

S X

! 2



pi (ln pi ) + 4H + 4

i=1

H ln S 2 (2H ln S + 4H + 7)  n n

(25)

where we have used the assumption H ≥ H0 > 0 in the last step. Hence, when S ln S ≤ enH, we have S2 H ln S + 2 n n which completes the proof for the first part. For the second part, we introduce a lemma first. 2

2

EP (H(Pn ) − H(P )) = (Bias(H(Pn ))) + Var(H(Pn )) 

Lemma 4. For p ≤

c 2en

(26)

and nˆ p ∼ B(n, p), where c is a positive integer, we have

0 ≤ −p ln p − E[−ˆ p ln pˆ] ≤ −p ln(np) + p ln c +

n X c ln c  enp c ln k + 1  enp k + . n c n k k=c+1

(27)

7

Define ξ(ˆ pi ) = −ˆ pi ln pˆi . In light of Lemma 4, for any positive integer c, we have # " n X X X c ln c  enpi c ln k + 1  enpi k + |Bias (ξ(ˆ pi ))| ≤ −pi ln(npi ) + pi ln c + n c n k c c i:pi ≤ 2en i:pi ≤ 2en k=c+1 " # n  np i X h X X c ln c  enpi c ln k + 1  enpi k i ≤ −pi ln + + c n c n k c c i:pi ≤ 2en



X c i:pi ≤ 2en

i:pi ≤ 2en

(29)

k=c+1

n h  np i c ln c 2en X ln k + 1 2en  c k i −pi ln + · 2−c + · c n c n c 2k

(30)

k=c+1

n h  np i X ln k + 1 i −pi ln + 2e · 2−c ln c + 2e · c k2k c i:pi ≤ 2en k=c+1 h  i X npi −pi ln ≤ + 2e · 2−c (ln c + 1), c c



(28)

X

(31) (32)

i:pi ≤ 2en

where we have used the convexity of xk , 0 ≤ x ≤ 1 for any k ≥ 1 in (30). We consider the following optimization problem: n  np  X X c o i (33) maximize −pi ln subject to −pi ln pi ≤ H, A1 ≡ i : pi ≤ ≤S c 2en c c i:pi ≤ 2en

i:pi ≤ 2en

It is straightforward to show that in the solution to (33), all pi ≤ c/2en should be equal, say, to p0 . Then (33) reduces to   c c maximize A1 p0 ln , A1 ≤ S, −A1 p0 ln p0 ≤ H (34) subject to 0 ≤ p0 ≤ np0 2en whose optimization result is no larger than  maximize A1 p0 ln

c np0

 subject to A1 ≤ S, −A1 p0 ln p0 ≤ H.

Then it is easy to check that the solution to (35) is A1 = S and −A1 p0 ln p0 = H. Then we have p0    X c |Bias (ξ(ˆ pi ))| ≤ Sp0 ln + 2e · 2−c (ln c + 1) np 0 c i:pi ≤ 2en   H cS ln S  ln + 2−c ln c. ln S nH

(35) H S ln S ,

and (36) (37)



Now we set c = n H with H ln = ln S



S ln S nH

 >0

(38)

and we assume without loss of generality that c is an integer. We can easily check that   H ln c ln n H S ln S 2−c ln c  = = ln ln S ln S ln S nH

(39)

which leads to the desired result X c i:pi ≤ 2en

H |Bias (ξ(ˆ pi ))|  ln ln S



S ln S nH

 .

(40)

2 holds for all i, hence As for the second part of bias, it has been shown in [11] that |Bias (ξ(ˆ pi ))| ≤ 5 ln n X 5 ln 2 c 5 ln 2 |Bias (ξ(ˆ pi ))| ≤ · {i : pi > } ≡ · A2 . n 2en n c

(41)

i:pi > 2en

We use the bounded entropy property to bound A2 . Due to the concavity of −x ln x, 0 ≤ x ≤ 1, the minimum of c is attained when all but one pi are at the boundary pi = 2en , hence   S X X c 2en H≥ −pi ln pi ≥ −pi ln pi ≥ (A2 − 1) · ln . 2en c c i=1 i:pi > 2en

P

c i:pi > 2en

−pi ln pi

(42)

8

As a result, we have A2 

nH ln n ,

and H H |Bias (ξ(ˆ pi ))|   ln ln n ln S

X c i:pi > 2en



S ln S nH

 (43)

when S ln S ≥ enH (the last inequality can be shown by considering two cases S ≥ n2 and S < n2 separately). Hence,   X X H S ln S |Bias (ξ(ˆ pi ))| + |Bias (ξ(ˆ pi ))|  |Bias(H(Pn ))| ≤ ln (44) ln S nH c c i:pi ≤ 2en

i:pi > 2en

and the squared bias is the dominating term since  2  S ln S (ln n)2 H ln  ln S nH n where

(ln n)2 n

(45)

is an upper bound for the variance Var(H(Pn )) [11]. IV. P ROOF OF L OWER B OUNDS IN T HEOREM 1

We first derive a lower bound for the bias term. When S ln S ≤ enH, we recall the following result in [11]. Lemma 5. For p ≥

15 n ,p

∈ [0, 1], we have −p ln p − E[−ˆ p ln pˆ] ≥

1−p 1 p + − . 2 2n 20n p 12n2

(46)

15 15 15K If we choose P = ( 15 n , n , · · · , n , 1 − n , 0, · · · , 0), where          n 15N 15 15N 15N K = min b c, S − 1, max N ∈ N : − ln − 1− ln 1 − ≤H 15 n n n n

(47)

we have  |Bias(H(Pn ))| ≥ K ·

1 5 n − 15 + − 2n2 300n 4n3



  S H S  min 1, ,  n ln n n

(48)

where we have used the assumption S ln S ≤ enH and H ≥ H0 > 0. Hence, we have proved that sup

EP |H(Pn ) − H(P )|2 

P ∈MS (H)

S2 . n2

(49)

For the case where S ln S > enH, we establish a lemma first. Lemma 6. For nˆ p ∼ B(n, p), we have −p ln p − E[−ˆ p ln pˆ] ≥ −p ln(np).

(50)

Proof: Note that nˆ p is an integer, we have E[−ˆ p ln pˆ] ≤ E[ˆ p ln n] = p ln n.

(51)

A A Consider a distribution P = ( S−1 , · · · , S−1 , A) ∈ MS , where A ∈ (0, 1) is the solution to   A −A ln − (1 − A) ln(1 − A) = H S−1

(52)

then Lemma 6 tells us that |Bias(H(Pn ))| ≥

S−1 X i=1

 −pi ln(npi ) = −A ln

nA S−1



H  ln ln S



S ln S nH

 (53)

where in the last step we have used the relationship A  lnHS from (52). We now turn to the lower bound for variance. We will actually prove a stronger result: a minimax lower bound for all estimators for the L2 risk, which naturally is also a lower bound for the maximum risk of the MLE. We use Le Cam’s twopoint method here. Suppose we observe a random vector Z ∈ (Z, A) which has distribution Pθ where θ ∈ Θ. Let θ0 and θ1 be two elements of Θ. Let Tˆ = Tˆ(Z) be an arbitrary estimator of a function T (θ) based on Z. We have the following general minimax lower bound.

9

Lemma 7. [19, Sec. 2.4.2] Denoting the Kullback-Leibler divergence between P and Q by (R   dP dP, if P  Q, ln dQ D(P kQ) , +∞, otherwise.

(54)

we have   1 |T (θ1 ) − T (θ0 )| ˆ inf sup Pθ |T − T (θ)| ≥ ≥ exp (−D (Pθ1 kPθ0 )) . 2 4 Tˆ θ∈Θ

(55)

Applying this lemma to the Poissonized model nˆ pi ∼ Poi(npi ), 1 ≤ i ≤ S, we know that for θ1 = (p1 , p2 , · · · , pS ), θ0 = (q1 , q2 , · · · , qS ), D (Pθ1 kPθ0 ) =

S X

D (Poi(npi )kPoi(nqi )) =

i=1

S X ∞ X

S

P (Poi(npi ) = k) · k ln

i=1 k=0

X pi pi = = nD(θ1 kθ0 ), npi ln qi qi i=1

then Markov’s inequality yields    2 2 ˆ − H(P ) ≥ |H(θ1 ) − H(θ0 )| · inf sup P |H ˆ − H(P )| ≥ |H(θ1 ) − H(θ0 )| inf sup EP H ˆ P ∈MS (H) ˆ P ∈MS (H) 4 2 H H |H(θ1 ) − H(θ0 )|2 exp (−nD(θ1 kθ0 )) . 16 Fix  ∈ (0, 1) to be specified later, and let     A A(1 − ) A A(1 − ) θ1 = ,··· , , 1 − A , θ0 = ,··· , , 1 − A + A , S−1 S−1 S−1 S−1 ≥

(56)

(57) (58)

(59)

where A is the solution to (52). Direct computation yields D(θ1 kθ0 ) = A ln

1 1−A + (1 − A) ln ≡ h(), 1− 1 − A + A

(60)

2 we have h(0) = h0 (0) = 0, and |h00 (0)| = 1−A A > 0. Hence, for  small enough we have D(θ1 kθ0 ) ≤  /A. By choosing − 21  = (nA)  1, we have     A A(1 − ) |H(θ1 ) − H(θ0 )| = −A ln + A(1 − ) ln − (1 − A) ln(1 − A) + (1 − A + A) ln(1 − A + A) S−1 S−1 (61)   S−1  A ln . (62) A

Hence, by Lemma 7 we know that   2   2 S−1 H S ln S H ln S 2 ˆ inf sup EP |H − H(P )|  A ln  ln  ˆ P ∈MS (H) A n ln S H n H

(63)

and the lower bound in the Multinomial model follows from Lemma 1. V. P ROOF OF U PPER B OUNDS IN T HEOREM 2 Define ξ , ξ(X, Y ) = LH (X)1(Y ≤ 2∆) + UH (X)1(Y > 2∆),

(64)

D

where nX = nY ∼ Poi(np), and X, Y are independent. We first recall the following lemma from [11]. Lemma 8. Suppose 0 < c1 = 16(1 + δ), 0 < 8c2 ln 2 =  < 1, δ > 0. Then the bias and variance of ξ(X, Y ) are given as follows: 1 n ln n (ln n)4 p(ln p)2 Var(ξ)  2− + n n

|Bias(ξ)| 

(65) (66)

10

In light of Lemma 10, we have ˆ ≤ |Bias(H)| ˆ = Var(H)

S X i=1 S X

|Bias(ξ(ˆ pi,1 , pˆi,2 ))| 

S X i=1

Var(ξ(ˆ pi,1 , pˆi,2 )) ≤

i=1

S 1 = n ln n n ln n

S  X (ln n)4 i=1

+

n2−

where we have used Lemma 3 in the last step. Hence,  2 ˆ − H(P ) = |Bias(H)| ˆ 2 + Var(H) ˆ  EP H

(67)

pi (ln pi )2 n

 

H ln S S(ln n)4 + 2− n n

(68)

S(ln n)4 H ln S S2 + + . (n ln n)2 n2− n

(69)

When S ln S ≤ enH ln n, for  small enough, say,  < 21 , we have s H ln S H ln S S2 S2 S(ln n)4  · + ≤ 2− 2 2 n (n ln n) n (n ln n) n

(70)

4

n) where we have used the assumption that H ≥ H0 > 0. Hence, the term S(ln n2− is negligible when compared with others, and we have reached the end for the case S ln S ≤ enH ln n. For the case where S ln S ≥ enH ln n, we need stronger results for the bias and variance in the regime where p < en 1ln n . The results are summarized in the following lemma.

Lemma 9. If 0 < c2 ≤ 1 ≤ c1 , for nX ∼ Poi(np), 0 < p
0. Then for 0 < p


1 en ln n ,

i:pi < en ln n

similar arguments in the MLE analysis yield   1 i : pi ≥  nH ln n  nH en ln n ln(n ln n)

(77)

hence X i:pi ≥ en 1ln n

1 |Bias(ξ)|  n ln n

    1  H  H ln S ln S · i : pi ≥ en ln n ln n ln S nH ln n

when S ln S ≥ enH ln n. Summing up the bias yields X ˆ ≤ |Bias(H)| |Bias(ξ(ˆ pi,1 , pˆi,2 ))| + i:pi ≥ en 1ln n

X i:pi < en 1ln n

|Bias(ξ(ˆ pi,1 , pˆi,2 ))| 

H ln ln S



S ln S nH ln n

(78)

 .

(79)

11

For the total variance, we have X

ˆ = Var(H)

Var(ξ(ˆ pi,1 , pˆi,2 )) +

X

Var(ξ(ˆ pi,1 , pˆi,2 ))

(80)

i:pi < en 1ln n

i:pi ≥ en 1ln n

 X p(ln p)2 (ln n)4 (ln n)4 p  + + n2− n n1− i:pi ≥ en 1ln n i:pi < en 1ln n   (ln n)5 (ln n)2 (ln n)4  + + n1− n n1−   2 H S ln S  ln ln S nH ln n 

X

(81)

(82) (83)

where in the last step we have used the assumption H ≥ H0 > 0 again. Combining the total bias and variance constitutes a complete proof of the upper bounds in Theorem 2. VI. P ROOF OF L OWER B OUNDS IN T HEOREM 2 2

S When S ln S ≤ enH ln n, the lower bound for the squared bias, i.e., the (n ln n)2 term, can be obtained using a similar N argument in [15]. Specifically, we can assign two product measures µ0 and µN 1 to the first N (≤ S) components in the distribution vector P , where   1 a2 ln n supp(µi ) = {0} ∪ , , i = 0, 1 (84) a1 n ln n n

for some constants a1 , a2 > 0, and Z

1

tµi (dt) = 0

1 , a1 n ln n

i = 0, 1.

(85)

In particular, Z

1

Z −t ln tµ1 (dt) −

0

1

−t ln tµ0 (dt)  0

1 n ln n

(86)

and 2  Z  ˆ − H(P )  N inf sup EP H ˆ P ∈MS H

0

1

Z −t ln tµ1 (dt) −

2

1

−t ln tµ0 (dt) 0



N2 . (n ln n)2

In [15], N = S. However, in our case, we have an additional constraint that H(P ) ≤ H. Since Z 1 Z 1 1 a1 ln(a1 n ln n)  Eµi [−p ln p] = −t ln tµi (dt) ≤ ln(a1 n ln n) tµi (dt) = n ln n n 0 0

(87)

(88)

we have EµN H(P ) = N Eµi [−p ln p]  i

N . n

(89)

One can show that the measures µN i , i = 0, 1 are highly concentrated around their expectations [15]. Hence, in order to ensure H(P ) ≤ H with overwhelming probability, we can set N  min{nH, S}, and the condition S ln S ≤ enH ln n and H ≥ H0 > 0 yield that N  S. Hence,  2 N2 S2 ˆ − H(P )  inf sup EP H  . (90) ˆ P ∈MS (H) (n ln n)2 (n ln n)2 H The variance bound H nln S has been given in (63), and so far we have completed the proof of the first part. As for the second part, the key lemma we will employ is the so-called method of two fuzzy hypotheses presented in Tsybakov [19]. Below we briefly review this general minimax lower bound. Suppose we observe a random vector Z ∈ (Z, A) which has distribution Pθ where θ ∈ Θ. Let σ0 and σ1 be two prior distributions supported on Θ. Write Fi for the marginal distribution of Z when the prior is σi for i = 0, 1. For any function g we shall write EFi g(Z) for the expectation of g(Z) with respect to the marginal distribution of Z when the prior on θ is σi . We shall write Eθ g(Z) for the expectation of g(Z) under Pθ . Let Tˆ = Tˆ(Z) be an arbitrary estimator of a function T (θ) based on Z. We have the following general minimax lower bound.

12

Lemma 11. [19, Thm. 2.15] Given the setting above, suppose there exist ζ ∈ R, s > 0, 0 ≤ β0 , β1 < 1 such that

If V (F1 , F0 ) ≤ η < 1, then

σ0 (θ : T (θ) ≤ ζ − s) ≥ 1 − β0

(91)

σ1 (θ : T (θ) ≥ ζ + s) ≥ 1 − β1 .

(92)

  1−η−β −β 0 1 inf sup Pθ |Tˆ − T (θ)| ≥ s ≥ , 2 Tˆ θ∈Θ

(93)

where Fi , i = 0, 1 are the marginal distributions of Z when the priors are σi , i = 0, 1, respectively. Here V (P, Q) is the total variation distance between two probability measures P, Q on the measurable space (Z, A). Concretely, we have Z 1 |p − q|dν, (94) V (P, Q) , sup |P (A) − Q(A)| = 2 A∈A dQ where p = dP dν , q = dν , and ν 3is a dominating measure so that P  ν, Q  ν. First we assume that S  n 2 . In light of Lemma 11, we construct two measures as follows.

Lemma 12. For any 0 < η < 1 and positive integer L > 0, there exist two probability measures ν0 and ν1 on [η, 1] such that R R 1) R tl ν1 (dt) = tl νR0 (dt), for all l = 0, 1, 2, · · · , L; 2) − ln tν1 (dt) − − ln tν0 (dt) = 2EL [− ln x][η,1] , where EL [− ln x][η,1] is the distance in the uniform norm on [η, 1] from the function f (x) = − ln x to the space spanned by {1, x, · · · , xL }. Based on Lemma 12, two new measures ν˜0 , ν˜1 can be constructed as follows: for i = 0, 1, the restriction of ν˜i on [η, 1] is absolutely continuous with respect to νi , with the Radon-Nikodym derivative given by d˜ νi η (t) = , dνi t

t ∈ [η, 1],

(95)

and ν˜i ({0}) = 1 − ν˜i ([η, 1]) ≥ 0. Hence, ν˜0 , ν˜1 are both probability measures on [0, 1], with the following properties R R 1) R t1 ν˜1 (dt) = R t1 ν˜0 (dt) = η; 2) R tl ν˜1 (dt) = tl ν˜R0 (dt), for all l = 2, · · · , L + 1; 3) −t ln t˜ ν1 (dt) − −t ln t˜ ν0 (dt) = 2ηEL [− ln x][η,1] . The construction of measures ν˜0 , ν˜1 are inspired by Wu and Yang [15]. The following lemma characterizes the properties of EL [− ln x][η,1] . Lemma 13. If K ≥ eL2 , there exists a universal constant D0 ≥ 1 such that   K . EL [− ln x][(D0 K)−1 ,1]  ln L2

(96)

Define L = d2 ln n,

η=

nH , ln S ln n

d22 D0 S

M=

H d1 d1 d22 D0 ln n · = , ln S Sη n

(97)

with universal positive constants d1 ∈ (0, e−1 ], d2 > 2 to be determined later. Without loss of generality we assume that d2 ln n is always a positive integer. Due to S ln S ≥ enH ln n, we have (D0 η)−1 ≥ eL2 , thus Lemma 13 yields     S ln S 1 EL [− ln x][η,1]  ln  ln . (98) D0 ηL2 Hn ln n Let g(x) = M x and let µi be the measures on [0, M ] defined by µi (A) = ν˜i (g −1 (A)) for i = 0, 1. It then follows that R R 1) R t1 µ1 (dt) = R t1 µ0 (dt) = d1 H/(S ln S); 2) R tl µ1 (dt) = tl µR0 (dt), for all l = 2, · · · , L + 1; 3) −t ln tµ1 (dt) − −t ln tµ0 (dt) = 2ηM EL [− ln x][η,1] . Let µS−1 and µS−1 be product priors which we assign to the length-(S − 1) vector (p1 , p2 , · · · , pS−1 ), and we set pS = 0 1 d1 (1 − H/ ln S). With a little abuse of notation, we still denote the overall product measure by µS0 and µS1 . Note that P may not be a probability distribution, we consider the set of approximate probability vectors S ( ) X MS (, H) , P : pi − d1 ≤ , H(P ) ≤ H, pi ≥ 0(1 ≤ i ≤ S) , (99) i=1

13

with parameter  > 0 to be specified later, and further define under the Poissonized model, RP (S, n, H, ) , inf

sup

Fˆ P ∈MS (,H)

ˆ − H(P )|2 . EP |H

(100)

Lemma 14. For any S, n ∈ N and 0 <  < d1 , we have   2n 1 n 2 sup ln2 (ex). R(S, n, H) ≥ 2 RP S, , ln(d1 − ) (H − ln(d1 + )) ,  − (ln S)2 exp(− ) − 2 · 2d1 d1 4 d1 x∈[d1 −,d1 +] In light of Lemma 14, it suffices to consider RP (S, n, H, ) to give a lower bound of R(S, n, H). Denote   2d1 H H S ln S · EL [− ln x][η,1]  · ln , χ , EµS1 H(P ) − EµS0 H(P ) = 2ηM EL [− ln x][η,1] · S = ln S ln S nH ln n

(101)

(102)

and \n χo P : |H(P ) − EµSi H(P )| ≤ , 4 Denote by πi the conditional distribution defined as Ei , MS (, H)

πi (A) =

µSi (Ei ∩ A) , µSi (Ei )

i = 0, 1.

i = 0, 1.

(103)

(104)

Now consider π0 , π1 as two priors. By setting χ χ 1 , s= , = , 2 4 ln n we have β0 = β1 = 0 in Lemma 11. Applying union bound yields that   X S h χi pj − d1 >  + µSi |H(P ) − EµSi H(P )| > µSi [(Ei )c ] ≤ µSi  + µSi [H(P ) > H] 4 j=1 ζ = EµS0 H(P ) +

and the Chebychev inequality tells us that   S S X 1 X SM 2 S(ln n)4 (ln n)4 pj − d1 >  ≤ 2 µSi  VarµSi (pj ) ≤ 2   →0 1  j=1  n2 n2 j=1 S h χ i 16 X 16S(M ln M )2 S(ln S)2 (ln n)4 (ln n)6 →0 µSi |H(P ) − EµSi H(P )| > ≤ 2 VarµSi (−pj ln pj ) ≤   1 4 χ j=1 χ2 n2 n2

(105)

(106)

(107)

(108)

3

where we have used our assumption that S  n 2 . For bounding µSi [H(P ) > H], we first remark that for d1 ≤ e−1 , Z EµSi H(P ) ≤ −d1 ln d1 + (S − 1) −t ln tµi (dt) (109) Z ≤ −d1 ln d1 − S ln(ηM ) tµi (dt) (110)   d1 H S ln S = −d1 ln d1 + ln (111) ln S d1 H   d1 H d1 H = −d1 ln d1 + d1 H − ln (112) ln S ln S ≤ d1 H − 2d1 ln d1 (113)  hence, for d1 sufficiently small, say, d1 ≤ min{ 41 , f −1 min{ H80 , 1e } }, where f (x) = −x ln x is defined in [0, e−1 ] and f −1 (·) denotes the inverse function of f (·), we have   H H0 1 H0 + H H EµSi H(P ) ≤ d1 H − 2d1 ln d1 ≤ + 2 · min , ≤ ≤ . (114) 4 8 e 4 2 Hence, similar to (108), we have   H S(M ln M )2 S(ln n)4 (ln n)4 µSi [H(P ) > H] ≤ µSi |H(P ) − EµSi H(P )| > ≤   → 0. 1 2 2 2 (H/2) n n2

(115)

Denote by Fi , Gi the marginal probability under prior πi and µSi , respectively, for all i = 0, 1. In light of (106), (107), (108)

14

and (115), we have V (Fi , Gi ) ≤ µSi [(Ei )c ] → 0.

(116)

Moreover, by setting  d1 = min

    1 −1 H0 1 1 ,f min , , 2 , 4 8 e d2 D0

d2 = 10e

(117)

it was shown in [11, Lem. 11] that 1 S  9 → 0. 6 n n2 Hence, the total variational distance is then upper bounded by V (G0 , G1 ) ≤

(118)

V (F0 , F1 ) ≤ V (F0 , G0 ) + V (G0 , G1 ) + V (G1 , F1 ) → 0

(119)

where we have used the triangle inequality of the total variation distance. The idea of converting approximate priors µSi into priors πi via conditioning comes from Wu and Yang [15]. Now it follows from Lemma 11 and Markov’s inequality that   2   ˆ − H(P )| ≥ s  χ2  H ln S ln S RP (S, n, H, ) ≥ s2 inf sup P |H (120) ˆ P ∈MS (,H) ln S nH ln n H 3

and the desired result follows directly from Lemma 14. Hence we have obtained the desired lower bound in the case S  n 2 . 3 For general S  n 2 , the non-decreasing property of R(n, S, H) with respect to S shows that  2  3 S ln S H 2  H2 (121) R(n, S, H) ≥ R(n, n , H)  ln ln S nH ln n 3 S=n 2

which exactly equals to the desired lower bound



H ln S

ln

S ln S nH ln n

2

.

VII. F UTURE WORK This paper studies the adaptive estimation framework to strengthen the optimality properties of the approximation theoretic entropy estimator proposed in Jiao et al. [11]. We remark that the techniques in this P paper are by no means constrained to S entropy, and we believe analogous results are also true for the estimators of Fα (P ) = i=1 pα i in [11]. Furthermore, we find the fact that the sample size enlargement effect still holds in the adaptive estimation setting very intriguing, and we believe there is a larger picture surrounding this theme to be explored. VIII. ACKNOWLEDGMENTS The authors would like to express their most sincere gratitude to Dany Leviatan for valuable advice on the literature of approximation theory, in particular, for suggesting the result in Lemma 16. A PPENDIX A AUXILIARY L EMMAS The following lemma characterizes the performance of the best uniform approximation polynomial for −x ln x, x ∈ [0, 1]. PK Lemma 15. Denote by k=0 gK,k xk the K-th order best uniform approximation polynomial for −x ln x, x ∈ [0, 1], then for PK pK (x) = k=1 gK,k xk , we have the norm bound sup |pK (x) − (−x ln x)| ≤ x∈[0,1]

Dn K2

(122)

where Dn > 0 is a universal constant for the norm bound. In fact, the following inequality holds: lim sup K 2 · sup |pK (x) − (−x ln x)| ≤ ν1 (2) ≈ 0.453, K→∞

(123)

x∈[0,1]

where the function ν1 (p) is was introduced by Ibragimov [20] as the following limit for p positive even integer and m positive integer np lim En [|x|p lnm |x|][−1,1] = ν1 (p). (124) n→∞ (ln n)m−1

15

Furthermore, we also have the pointwise bound: there exists a universal constant Dp > 0 such that for any C ≥ 1,   C (125) |pK (x) − 2 ln K · x| ≤ Dp Cx, ∀x ∈ 0, 2 . K Lemma 16. [21, Thm. 8.4.8] There exists some universal constant M > 0 such that for any order-n polynomial p(x) in [0, 1], we have sup |p(x)| ≤ M ·

|p(x)|.

sup

(126)

x∈[n−2 ,1−n−2 ]

x∈[0,1]

The following lemma gives some tails bounds for Poisson and Binomial random variables. Lemma 17. [22, Exercise 4.7] If X ∼ Poi(λ), or X ∼ B(n, nλ ), then for any δ > 0, we have λ  eδ P(X ≥ (1 + δ)λ) ≤ (1 + δ)1+δ  λ 2 e−δ P(X ≤ (1 − δ)λ) ≤ ≤ e−δ λ/2 . (1 − δ)1−δ

(127) (128)

A PPENDIX B P ROOF OF L EMMAS A. Proof of Lemma 3 PS Denote H(P ) = i=1 −pi ln pi by H, we construct the lagrangian: L=

S X

2

pi (ln pi ) + λ

i=1

S X

! −pi ln pi − H



i=1

S X

! pi − 1 .

(129)

i=1

By taking the derivative with respect to pi , we obtain that ∂L = (ln pi )2 + 2 ln pi − λ(1 + ln pi ) + µ ∂pi

(130)

∂L = 0 has at most two solutions. is a quadratic form of ln pi , so the equation ∂p i Hence, we conclude that components of the maximum achieving distribution can only take two values pi ∈ {q1 , q2 }, and suppose q1 appears m times. Then our objective function becomes ! S ! S S X X X pi (ln pi )2 = pi (ln pi )2 pi (131) i=1

i=1

=

S X

i=1

!2 −pi ln pi

+

i=1 2

X

pi pj (ln pi − ln pj )2

(132)

1≤i 1 to be specified later, then given N = (10D)2 M ≥ (10DL)2 , the non-increasing property of En [fN ][−1,1] with respect to n yields EL [fN ][−1,1] ≥

DL X 1 En [fN ][−1,1] DL − L

(160)

n=L+1

1 ≥ DL

DL X

En [fN ][−1,1] − E0 [fN ][−1,1] −

n=0

L X

! En [fN ][−1,1]

(161)

n=1

L ln N M1 X − τ1 (fN , ∆n ) DL DL n=1     L N M1 X 2N ln N ≥ M2 ln − ln − 2(DL)2 DL DL n=1 n2    Z L  N ln N M1 2N − − dx ≥ M2 ln ln 2(DL)2 DL DL 1 x2     50K 200e2 D2 K ln K + 2 ln(10D) M1 ≥ M2 ln − ln − . L2 DL D L2

≥ M2 τ1 (fN , ∆DL ) −

(162)

(163) (164) (165)

Hence, there exists a sufficiently large constant D > 0 such that  EL [− ln x][(100D2 K)−1 ,1] = EL [fN ][−1,1]  ln

K L2

 (166)

and this lemma is proved by setting D0 = max{100D2 , 1}. E. Proof of Lemma 14 ˆ ˆ Fix δ > 0. Let H(Z) be a near-minimax estimator of H(P ) under the Multinomial model. The estimator H(Z) obtains the number of samples n from observation Z. By definition, we have sup P ∈MS (H)

ˆ EMultinomial |H(Z) − H(P )|2 < R(S, n, H) + δ,

(167)

18

where R(S, n, H) is the minimax L2 risk under the Multinomial model. Note that for any vector P ∈ MS (, H) (P is not necessarily a probability distribution), we have ! ! S X H(P ) H(P ) P = PS + ln pi ≤ + ln(d1 + ) (168) H PS d 1− i=1 pi i=1 pi i=1 P S where by definition we have i=1 pi − d1 ≤ . Hence, given P ∈ MS (, H), let Z = [Z1 , · · · , ZS ]T with Zi ∼ Poi(npi )   PS PS ˆ and let n0 = i=1 Zi ∼ Poi(n i=1 pi ), (168) suggests to use the estimator d1 H(Z) − ln d1 to estimate H(P ). Note that !! ! ! S S   X X P ˆ ˆ + pi ln pi − d1 ln d1 (169) d1 H(Z) − ln d1 − H(P ) = d1 H(Z) − H PS i=1 pi i=1 i=1 the triangle inequality gives (define A = supx∈[d1 −,d1 +] ln2 (ex)) 2 ! ! 2 S ! S X 2   X P 1 2 ˆ ˆ pi ln pi − d1 ln d1 (170) − H PS − ln d1 − H(P ) ≤ d1 EP H(Z) EP d1 H(Z) + 2 i=1 pi i=1 i=1   ! 2 ∞ X P ˆ EP  H(Z) − H PS ≤ d21 (171) n0 = m P(n0 = m) + 2 A m=0 i=1 pi   ∞ X H ≤ d21 R S, m, + ln(d1 + ) P(n0 = m) + δ + 2 A (172) d1 −  m=0   d1 n H d1 n d1 n 2 ≤ d1 R S, , + ln(d1 + ) P(n0 ≥ ) + (d1 ln S)2 P(n0 ≤ ) + δ + 2 A 2 d1 −  2 2 (173)   d1 n H d1 n ≤ d21 R S, , + ln(d1 + ) + (d1 ln S)2 exp(− ) + δ + 2 A, (174) 2 d1 −  8 2 where we have used the fact that conditioned on n0 = m, Z ∼ Multinomial(m, PP pi ), and R(S, n, H) ≤ supP ∈MS H(P ) = i (ln S)2 . Moreover, the last step follows from Lemma 17. The proof is completed by the arbitrariness of δ and Lemma 1. F. Proof of Lemma 15 It has been shown in [11, Lemma 18] that K X ν (2) 1 k lim K · sup gK,k x − (−x ln x) = , K→∞ 2 x∈[0,1] 2

(175)

k=0

then plugging in x = 0 yields lim sup K 2 · |gK,0 | ≤ K→∞

ν1 (2) . 2

Hence, it follows from the triangle inequality that K X ν (2) ν (2) 1 1 + = ν1 (2), lim sup K 2 · sup gK,k xk − (−x ln x) ≤ 2 2 K→∞ x∈[0,1]

(176)

(177)

k=1

which completes the proof of the norm bound. For the pointwise bound, [21, Thm. 7.3.1] asserts that there exists a universal positive constant M1 such that sup |(ϕ(x))2 p00K (x)| ≤ M1 K 2 ωϕ2 (−x ln x, K −1 ),

(178)

x∈[0,1]

where ϕ(x) =

p

x(1 − x), and ωϕ2 (f, t) is the second-order Ditzian-Totik modulus of smoothness [21] defined by      u + v u+v ωϕ2 (f, t) , sup f (u) + f (v) − 2f : u, v ∈ [0, 1], |u − v| ≤ 2tϕ . 2 2

(179)

19

Direct computation yields ωϕ2 (−x ln x, t) =

2t2 ln 2 , 1 + t2

(180)

we have sup |x(1 − x)p00K (x)| ≤ 2M1 ln 2.

(181)

x∈[0,1]

According to Lemma 16, since p00K (x) is a polynomial with order K − 2 < 2K, there exists some positive constant M2 such that sup |p00K (x)| ≤ M2 x∈[0,1]

≤ ≤ ≤

sup |p00K (x)| x∈[(2K)−2 ,1−(2K)−2 ] M2 (2K)4 sup |x(1 (2K)2 − 1 x∈[(2K)−2 ,1−(2K)−2 ] 16M2 K 4 sup |x(1 − x)p00K (x)| 4K 2 − 1 x∈[0,1] 32M1 M2 K 4 ln 2 4K 2 − 1

(182) − x)p00K (x)|

≤ 16M1 M2 K 2 ln 2,

(183) (184) (185) (186)

2

hence for any x, y ∈ [0, C/K ], we have Z max{x,y} 0 0 |pK (x) − pK (y)| ≤ |p00K (t)|dt ≤ 16M1 M2 ln 2 · K 2 |x − y| ≤ 16M1 M2 C ln 2.

(187)

min{x,y}

As a result, we know that for any C ≥ 1 and x ∈ [0, C/K 2 ], Z C K 2 K2 0 16M1 M2 C ln 2 ≥ |pK (x) − p0K (t)| dt C 0 Z C K 2 K2 0 0 ≥ pK (x) − pK (t) dt C 0   2 K C = p0K (x) − pK C K2       0 K2 C K 2 C C C C pK ≥ pK (x) − − 2 ln 2 − − − 2 ln 2 C K K C K2 K K 0 2 ≥ |pK (x) − 2 ln K| − ln C − K sup |pK (t) − (−t ln t)|

(188) (189) (190) (191) (192)

t∈[0,1]

≥ |p0K (x) − 2 ln K| − ln C − Dn ,

(193)

where Dn is the coefficient of the norm bound in (122). Hence, the universal positive constant Dp , 16M1 M2 ln 2 + 1 + Dn satisfies   C 0 |pK (x) − 2 ln K| ≤ Dp C, ∀x ∈ 0, 2 , (194) K and it follows that Z |pK (x) − 2 ln K · x| ≤

x

|p0K (t)

Z − 2 ln K| dt ≤

0

x

Dp Cdt = Dp Cx.

(195)

0

R EFERENCES [1] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, pp. 379–423, 623–656, 1948. [2] A. W. Van der Vaart, Asymptotic statistics. Cambridge university press, 2000, vol. 3. [3] J. H´ajek, “A characterization of limiting distributions of regular estimates,” Zeitschrift f¨ur Wahrscheinlichkeitstheorie und verwandte Gebiete, vol. 14, no. 4, pp. 323–330, 1970. [4] ——, “Local asymptotic minimax and admissibility in estimation,” in Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, vol. 1, 1972, pp. 175–194. [5] A. J. Wyner and D. Foster, “On the lower limits of entropy estimation,” IEEE Transactions on Information Theory, submitted for publication, 2003. [6] B. Efron and R. Thisted, “Estimating the number of unsen species: How many words did shakespeare know?” Biometrika, vol. 63, no. 3, pp. pp. 435–447, 1976. [Online]. Available: http://www.jstor.org/stable/2335721

20

[7] F. Benevenuto, T. Rodrigues, M. Cha, and V. Almeida, “Characterizing user behavior in online social networks,” in Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference. ACM, 2009, pp. 49–62. [8] M. J. Berry, D. K. Warland, and M. Meister, “The structure and precision of retinal spike trains,” Proceedings of the National Academy of Sciences, vol. 94, no. 10, pp. 5411–5416, 1997. [9] Z. F. Mainen and T. J. Sejnowski, “Reliability of spike timing in neocortical neurons,” Science, vol. 268, no. 5216, pp. 1503–1506, 1995. [10] R. R. d. R. van Steveninck, G. D. Lewen, S. P. Strong, R. Koberle, and W. Bialek, “Reproducibility and variability in neural spike trains,” Science, vol. 275, no. 5307, pp. 1805–1808, 1997. [11] J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Minimax estimation of functionals of discrete distribution,” submitted to IEEE Trans. Inf. Theory, 2014. [12] G. Valiant and P. Valiant, “Estimating the unseen: an n/ log n-sample estimator for entropy and support size, shown optimal via new CLTs,” in Proceedings of the 43rd annual ACM symposium on Theory of computing. ACM, 2011, pp. 685–694. [13] L. Paninski, “Estimation of entropy and mutual information,” Neural Computation, vol. 15, no. 6, pp. 1191–1253, 2003. [14] J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Non-asymptotic theory for the plug-in rule in functional estimation,” arXiv preprint arXiv:1406.6959, 2014. [15] Y. Wu and P. Yang, “Minimax rates of entropy estimation on large alphabets via best polynomial approximation,” available on arXiv, 2014. [16] J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Beyond maximum likelihood: from theory to practice,” arXiv preprint arXiv:1409.7458, 2014. [17] T. T. Cai et al., “Minimax and adaptive inference in nonparametric function estimation,” Statistical Science, vol. 27, no. 1, pp. 31–50, 2012. ´ [18] A. B. Tsybakov, “Aggregation and high-dimensional statistics,” Lecture notes for the course given at the Ecole d´et´e de Probabilit´es in Saint-Flour, URL http://www. crest. fr/ckfinder/userfiles/files/Pageperso/ATsybakov/Lecture notes SFlour. pdf, vol. 16, p. 20, 2013. [19] A. Tsybakov, Introduction to Nonparametric Estimation. Springer-Verlag, 2008. [20] I. I. Ibragimov, “Sur la valeur asymptotique de la meilleure approximation d’une fonction ayant un point singulier re?el,” Izvestiya Rossiiskoi Akademii Nauk. Seriya Matematicheskaya, vol. 10, no. 5, pp. 429–460, 1946. [21] Z. Ditzian and V. Totik, Moduli of smoothness. Springer, 1987. [22] M. Mitzenmacher and E. Upfal, Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press, 2005. [23] P. P. Petrushev and V. A. Popov, Rational approximation of real functions. Cambridge University Press, 2011.

Suggest Documents