to possess a sufficient statistic is the factorability of the likelihood function ... pend on Ð, and g is a function of the statistic OÐ and Ð. Equation (B.2) is also the ...
Statistical Pattern Recognition, Second Edition. Andrew R. Webb Copyright 2002 John Wiley & Sons, Ltd. ISBNs: 0-470-84513-9 (HB); 0-470-84514-7 (PB)
B Parameter estimation
B.1 B.1.1
Parameter estimation Properties of estimators
Perhaps before we begin to discuss some of the desirable properties of estimators, we ought to define what an estimator is. For example, in a measurement experiment we may assume that the observations are normally distributed but with unknown mean, ¼, and variance, ¦ 2 . The problem then is to estimate the values of these two parameters from the set of observations. Therefore, an estimator O of is defined as any function of the sample values which is calculated to be close in some sense to the true value of the unknown parameter . This is a problem in point estimation, by which is meant deriving a single-valued function of a set of observations to represent the unknown parameter (or a function of the unknown parameters), without explicitly stating the precision of the estimate. The estimation of the confidence interval (the limits within which we expect the parameter to lie) is an exercise in interval estimation. For a detailed treatment of estimation we refer to Stuart and Ord (1991). Unbiased estimate The estimator O of the parameter is unbiased if the expectation over the sampling distribution is equal to , i.e. Z 4 O E[ ] D O p.x 1 ; : : : ; x n / dx 1 : : : dx n D where O is a function of the sample vectors x 1 ; : : : ; x n drawn from the distribution p.x 1 ; : : : ; x n /. We might always want estimators to be approximately unbiased, but there is no reason why we should insist on exact unbiasedness. Consistent estimate An estimator O of a parameter is consistent if it converges in probability (or converges stochastically) to as the number of observations, n ! 1. That is, for all Ž; ž > 0 p.jjO jj < ž/ > 1 Ž for n > n 0 or lim p.jjO jj > ž/ D 0
n!1
432
Parameter estimation
Efficient estimate The efficiency, , of one estimator O2 relative to another O1 is defined as the ratio of the variance of the estimators D
E[jjO1 jj2 ] E[jjO2 jj2 ]
O1 is an efficient estimator if it has the smallest variance (in large samples) compared to all other estimators, i.e. 1 for all O2 . Sufficient estimate A statistic O1 D O1 .x 1 ; : : : ; x n / is termed a sufficient statistic if, for any other statistic O2 , p. jO1 ; O2 / D p. jO1 / (B.1) that is, all the relevant information for the estimation of is contained in O1 and the additional knowledge of O2 makes no contribution. An equivalent condition for a distribution to possess a sufficient statistic is the factorability of the likelihood function (Stuart and Ord, 1991; Young and Calvert, 1974): O /h.x 1 ; : : : ; x n / p.x 1 ; : : : ; x n j / D g.j
(B.2)
O and does not dewhere h is a function of x 1 ; : : : ; x n and is essentially p.x 1 ; : : : ; x n j/ O pend on , and g is a function of the statistic and . Equation (B.2) is also the condition for reproducing densities (Spragins, 1976; Young and Calvert, 1974) or conjugate priors (Lindgren, 1976): a probability density of , p. /, is a reproducing density with respect to the conditional density p.x 1 ; : : : ; x n j / if the posterior density p. jx 1 ; : : : ; x n / and the prior density p. / are of the same functional form. The family is called closed under sampling or conjugate with respect to p.x 1 ; : : : ; x n j /. Conditional densities that admit sufficient statistics of fixed dimension for any sample size (and hence reproducing densities) include the normal, binomial and Poisson density functions (Spragins, 1976). P Example 1 The sample mean x n D n1 inD1 xi is an unbiased estimator of the population mean, ¼, since " # n n 1X 1X E[x n ] D E xi D E[xi ] D ¼ n i D1 n i D1 but the sample variance 2 # !2 3 n n n X X 1X 1 1 E .xi x n /2 D E 4 xj 5 xi n i D1 n i D1 n jD1 " # n n1 X 1 1X X 2 D E x x j xk n n jD1 j n j k;k6D j "
D
n1 2 ¦ n
Parameter estimation 433
is not an unbiased estimator of the variance ¦ 2 . Therefore, the unbiased estimator sD
n 1 X .xi x n /2 n 1 i D1
is usually preferred.
Example 2 The sample mean is a consistent estimator of the mean of a normal population. The sample mean is normally distributed as n 1 1 2 2 p.x n / D exp n.x n / 2³ 2 with mean (the mean of the population with unit variance) and variance 1=n. That is, 1 .x n /n 2 is normally distributed with zero mean and unit variance. Thus, the probability 1 1 that j.x n /n 2 j žn 2 (i.e. j.x n /nj žn) is the value of the normal integral between 1 the limits šžn 2 . By choosing n sufficiently large, this can always be larger than 1 for any given .
B.1.2
Maximum likelihood
The likelihood function is the joint density of a set of samples x 1 ; : : : ; x n from a distribution p.x i j /, i.e. L. / D p.x 1 ; : : : ; x n j / regarded as a function of the parameters, , rather than the data samples. The method of maximum likelihood is a general method of point estimation in which the estimate of the parameter is taken to be that value for which L. / is a maximum. That is, we are choosing the value of which is ‘most likely to give rise to the observed data’. Thus, we seek a solution to the equation @L D0 @ or, equivalently,
@ log.L/ D0 @
since any monotonic function of the likelihood, L, will also be a minimum at the same value of as the function L. Under very general conditions (Stuart and Ord, 1991, Chapter 18), the maximum likelihood estimator is consistent, asymptotically normal and asymptotically efficient. The estimator is not, in general, unbiased though it will be asymptotically unbiased if it is consistent and if the asymptotic distribution has finite O of a parameter , the lower bound on the mean. However, for an unbiased estimator, , variance is given by the Cram´er–Rao bound E[.O /2 ] D
1 ¦n2
434
Parameter estimation
where ¦n2
" DE
2 # @ log[ p.x 1 ; : : : ; x n j /] @
is called the Fisher information in the sample. It follows from the definition of the efficient estimator that any unbiased estimator that satisfies this bound is efficient.
B.1.3
Problems with maximum likelihood
The main difficulty with maximum likelihood parameter estimation is obtaining a solution of the equations @L D0 @θ for a vector of parameters, θ . Unlike in the normal case, these are not always tractable and iterative techniques must be employed. These may be a nonlinear optimisation scheme using gradient information such as conjugate gradient methods or quasi-Newton methods or, for certain parametric forms of the likelihood function, expectation–maximisation (EM) methods may be employed. The problem with the former approach is that maximisation of the likelihood function is not simply an exercise in unconstrained optimisation. That is, the quantities being estimated are often required to satisfy some (inequality) constraint. For example, in estimating the covariance matrix of a normal distribution, the elements of the matrix are constrained so that they satisfy the requirements of a covariance matrix (positive definiteness, symmetry). The latter methods have been shown to converge for particular likelihood functions. Maximum likelihood is the most extensively used statistical estimation technique. It may be regarded as an approximation to the Bayesian approach, described below, in which the prior probability, p. /, is assumed uniform, and is arguably more appealing since it has no subjective or uncertain element represented by p. /. However, the Bayesian approach is in more agreement with the foundations of probability theory. A detailed treatment of maximum likelihood estimation may be found in Stuart and Ord (1991).
B.1.4
Bayesian estimates
The maximum likelihood method is a method of point estimation of an unknown parameter . It may be considered to be an approximation to the Bayesian method described in this section and is used when one has no prior knowledge concerning the distribution of . Given a set of observations x 1 ; : : : ; x n , the probability of obtaining fx i g under the assumption that the probability density is p.xj / is p.x 1 ; : : : ; x n j / D
n Y
p.x i j /
i D1
if the x i are independent. By Bayes’ theorem, we may write the distribution of the parameter as p.x 1 ; : : : ; x n j / p. / p. jx 1 ; : : : ; x n / D R p.x 1 ; : : : ; x n j / p. / d
Parameter estimation 435
where p. / is the prior probability density of . The quantity above is the probability density of the parameter, given the data samples and is termed the posterior density. Given this quantity, how can we choose a single estimate for the parameter (assuming that we wish to)? There are obviously many ways in which the distribution could be used to generate a single value. For example, we could take the median or the mean as estimates for the parameter. Alternatively, we could select that value of for which the distribution is a maximum: O D arg max p. jx 1 ; : : : ; x n /
(B.3)
the value of which occurs with greatest probability. This is termed the maximum a posteriori (MAP) estimate or the Bayesian estimate. One of the problems with this estimate is that it assumes knowledge of p. /, the prior probability of . Several reasons may be invoked for neglecting this term. For example, p. / may be assumed to be uniform (though why should p. / be uniform on a scale of rather than, say, a scale of 2 ?). Nevertheless, if it can be neglected, then the value of which maximises p. jx 1 ; : : : ; x n / is equal to the value of which maximises p.x 1 ; : : : ; x n j / This is the maximum likelihood estimate described earlier. More generally, the Bayes estimate is that value of the parameter that minimises the Bayes risk or average risk, R, defined by O R D E[C.; /] Z O p.x 1 ; : : : ; x n ; / dx 1 : : : dx n d D C.; /
(B.4)
where C.; O / is a loss function which depends on the true value of a parameter and O its estimate, . O are the quadratic and the uniform loss functions. Two particular forms for C.; / The Bayes estimate for the quadratic loss function (minimum mean square estimate) O 2 C.; O / D jj jj is the conditional expectation (expected error of the a posteriori density) (Young and Calvert, 1974) Z O D E jx 1 ;:::;x n [ ] D p. jx 1 ; : : : ; x n / d (B.5) This is also true for cost functions that are symmetric functions of and convex, and if the posterior density is symmetric about its mean. Choosing a uniform loss function ² O Ž 0 j j O C.; / D O >Ž 1 j j leads to the maximum a posteriori estimate derived above as Ž ! 0. Minimising the Bayes risk (B.4) yields a single estimate for and if we are only interested in obtaining a single estimate for , then the Bayes estimate is one we might consider.