Abstract The minimum description length prin- ciple applied to function estimation can yield a cri- terion of the form - log(1ikelihood) + const m instead.
Asymptotically Optimal Function Estimation by Minimum Complexity Criteria Andrew Barron’, Yuhong Yang’, and Bin Yuz ‘Yale University, Dept. Statistics, New Haven, C T ’University of California, Dept. Statistics, Berkeley, CA
Abstract The minimum description length principle applied to function estimation can yield a criterion of the form - log(1ikelihood) + const m instead of the familiar - log(likelihood) (m/2) log n where m is the number of parameters and n is the sample size. The improved criterion yields minimax optimal rates for redundancy and statistical risk. The analysis suggests an information-theoretic reconciliation of criteria proposed by Rissanen, Schwarz, and Akaike. ~
+
I. SUMMARY In the minimum description length principle formulated in ([3],[2]) the density estimate $,,(z) minimizes the total length of twestage descriptions, p , = argminpEs{L(q) log 1/ q ( X , ) } ,where L ( q ) is length of a uniquely decodable code for q E Q satisfying CSEQ 2-L(9) l , and Q is a given collection of probability densities. If X I , ..., X , are independently distributed with density p ( z ) = p(zl6’*) in a smooth parametric family { p ( z l O ) : 6’ E R”} then an asymptotically optimal two stage code assigns a penalty term of order (m/2) logn corresponding to (1/2) logn bits for each coordinate of the parameter vector. Here we consider densities in nonparametric classes. Sequences of families pm(z16’),B E R”, m = 1 , 2 , ... provide approximations to the unknown density. A description length criterion selects a sequence of appropriate model sizes in,. Cases are formulated for which the best description length criteria have the form logl/likelihood, + const . m where the likelihood maximization is constrained to suitable quantizations of the parameter spaces. The description length for the constrained parameters is of order m instead of (m/2) logn. We give several results in this direction. One involves the following setting. Let WF,a be the set of densities on [0,1] such that f ( z ) = logp(z) has a representation Ch6’hdt(z)in terms of an orthonormal basis for which the coefficients lie in an ellipse of the form Cbk2’6’l r z . If J(dsf(z)/dz‘)zdz is bounded the ellipse constraint holds using either polynomial or trigonometric bases. The best choices of r . s and m are unknown and estimates are provided by the minimum description length criterion. Let Pm(zI0) = e x P t Z y = l 6’t$t(z)}/C@,m be approximating families, let Om,r,s,( be the intersection of the ellipse (0 E R“ : k2‘08: I r’} with a regular grid spaced a t width 6 = 1/4 in each coordinate, and k t Q be the union of the densities parameterized by 0 E O m , r , a , t l for positive integers m, r , and s. The choice 6 = (l/m)d+1/2 permits the error in the approximation of f to be of the optimal order. With this choice the log-cardinality of grid points in the ellipse satisfies logCard(B,,,,,,t) I c,,am where cr,$ is independent of m. Let 6‘m,r,d maximize the likelihood over grid points in the ellipse. Then m, i ,i are
n;=’=,
+