May 27, 1998 - A new procedure is proposed for deriving variable bandwidths in univariate kernel density estimation. Rather than concentrate on minimising ...
A Model-Based Approach for Variable Bandwidth Selection in Kernel Density Estimation Mark J Brewer Department of MSOR University of Exeter, UK May 27, 1998 Abstract A new procedure is proposed for deriving variable bandwidths in univariate kernel density estimation. Rather than concentrate on minimising some criterion based upon the mean integrated square error (MISE) or related quantity, we build models for the data and use exact calculation or sampling methods to make inferences about the bandwidths as appropriate. These Bayesian models are based on a cross-validated representation of the data; it is noted that they allow for bandwidth selection which is exible in terms of the amount of smoothing required and can be tailored to suit speci c applications. One model in particular, which introduces direct dependencies between bandwidths of neighbouring points, is seen to produce adaptive estimates which are reliably smooth in low density areas while still allowing for dierent levels of smoothing in high density areas.
Key words: kernel density estimation, Markov chain Monte Carlo, cross-validation,
variable bandwidth.
1
1 Introduction This paper considers a new procedure for the selection of bandwidths, both global and variable, in univariate kernel density estimation. The procedure diers from most previous bandwidth selection methods in that it does not set out directly to minimise some criterion related to the mean integrated square error (MISE)|see Jones et al. (1996), Sain and Scott (1996), Silverman (1986) and Wand and Jones (1995) for example. Instead, the procedure formulates a model which represents a cross-validated likelihood function. The bandwidth is then treated as a parameter of the model to be estimated. There have been many methods proposed for deriving global bandwidths, but relatively few for deriving variable bandwidths. This is apparently due to \unsettled issues concerning performance and practical implementation" (Wand and Jones, 1995, Section 2.10); we shall attempt to address these issues. The methods developed in this paper will be shown to be more straightforward to apply than those in Sain and Scott (1996), say, and yet give equivalent or better results, representing a exible framework for gearing density estimation toward speci c applications. While Sain and Scott (1996) show that their method works extremely well for reconstructing some Normal mixture densities, we shall demonstrate that this performance does not necessarily carry across into practical kernel density estimation with real data. Furthermore, a nal model in this paper, which introduces direct dependencies between the variable bandwidths, is shown via practical application to be more robust to changes in pilot density estimate than methods such as that of Abramson (1982), meaning that it is most suitable for automatic bandwidth selection. Let f (x) be the unknown true density function. Then the xed kernel density estimator based on a sample fx1 ; x2 ; : : : ; xn g can be written n 1X f^0 (x) = K (x ? xi ) ; (1) n i=1
where the Kernel function K is a symmetric density function and is a bandwidth which controls the smoothness of f^0(x). This bandwidth is global in the sense that it is constant over i. The choice of is well-known to be crucial, and of more importance than the choice of Kernel K . Choosing usually involves a trade-o between smoothness and bias 2
of the density estimate. Generally speaking, a xed kernel estimator is likely to simultaneously under- and over-smooth f (x) in dierent parts of the function (Izenman, 1991). The adaptive kernel density estimator attempts to resolve this problem by allowing dierent levels of smoothing to occur in dierent regions of the estimate. A separate bandwidth i is associated with each data point xi, and the resulting estimator is n 1X f^(x) = K (x ? xi ) : (2) n i=1
i
Abramson (1982) suggests taking i / f (xi)?1=2 ; implementation thus requires use of a pilot estimate of f (x), usually taken to be a xed kernel estimator. A further type of estimator which allows variation in the bandwidths is the so-called balloon estimator, given by n 1X f^b (x) = K (x ? xi ) (3) n i=1
x
so that the bandwidth is a function of the evaluation point x rather than the data points, and for a given x the kernel density is eectively a xed estimator. Terrell and Scott (1992) highlighted problems with the balloon estimator in univariate settings, but there has been further work on the topic recently, notably Hazelton (1996a,1996b) and Sain and Scott (1998). This estimator is mentioned here primarily for reference from Section 3.1. This paper describes a variety of models; those of Section 2 concern the derivation of bandwidths for the xed kernel estimator, and are similar to those in Brewer (1998). We consider automatic bandwidth selection, since as Wand and Jones (1995, Chapter 3) point out, although the most suitable bandwidth for a given problem may only best be chosen through experimentation, there are \many circumstances where it is very bene cial to have the bandwidth automatically selected from the data." Some potential methods of estimating the global bandwidths are discussed in detail. In Section 3 we consider models for selection of variable bandwidths; like most other variable bandwidth selection methods, the variable bandwidths here are derived from a \pilot" (global) bandwidth. A basic version is extended to a model incorporating dependencies between bandwidths of neighbouring data values, and this latter model is shown 3
to be relatively robust to changes in parameters, especially in regions of low density, and hence is most suitable for automatic bandwidth selection. While this paper concentrates on practical performance on application of methods to real data sets, Section 4 looks at data generated from a particular mixture density. We show that the Sain and Scott (1996) method performs extremely poorly for this density, and that both the Abramson and our robust model-based methods give better results.
2 A Global Bandwidth Selector Consider the expression for the xed kernel density estimator at (1). We would like to be able to use this expression to formulate a model for the bandwidth . It is tempting to think of the kernel estimator as a mixture of (say) Normals, with 2 as the constant variance over all mixture components. In mixture modelling however, the usual aims are to estimate the number of components and to allocate data points to those components, and in kernel estimation these quantities are xed (but see, for example, Jones et al., 1994). In addition, only one point of data is assigned to each mixture component, occurring at the mode, and clearly trying to estimate the variance (or itself), by methods such as those of Diebolt and Robert (1994) say, is not possible, as the relevant integral will have in nite value. In kernel density estimation, one common way around this problem is to use crossvalidation. Each point of data is assumed to have originated from the kernel density based on all the other observations. We build a model based around this representation of the data: assume we have univariate data xi , i = 1; 2; : : : ; n and equivalently yj , j = 1; 2; : : : ; n so that xi = yj when i = j . Having two occurrences of the data in this way allows us to de ne a sensible graphical model, with conditional independence graph as in Figure 1 and joint density f (y ; j x; ; ) = f ( j ; )
= f ( j ; ) 4
n Y j =1 n
f (yj j fx?j g; )
n Y 1 X K (yj ? xi ) ; n?1
j =1
i=1 i6=j
(4)
α
xi
β τ
i ij
yj j
Figure 1: Conditional independence graph for the global bandwidth model where fx?j g is the set of observations excluding xj , and we supply the precision = 1=2 with a Gamma prior, hence f ( j ; ) =
?1 exp (? ) : ?()
Initially we shall consider the use of a non-informative prior; Section 2.4 discusses the possible use of informative versions. In this paper, the kernel functions K will be taken to be Normal, which seems sensible given the modelling context. Historically, the function at (4) without the prior has been maximised to give a pseudomaximum likelihood estimate (which we shall call LCV) of the bandwidth (see Silverman, 1986, Section 3.4.4). In our Bayesian modelling context, we shall use the expected value of ; the dierence between the LCV and expected value of will depend on the prior distribution for . Since we have a non-informative prior and the cross-validated likelihood will be positively skew, the model-based bandwidth will be slightly higher than the LCV bandwidth. Inferences would follow from the posterior distribution of , given by f ( j y ; x; ; ) = c2 f ( j ; )
n Y j =1
f (yj j fx?j g; );
where the constant c2 is the reciprocal of expression (4) integrated over . We now consider various ways of estimating given these models.
5
(5)
2.1 Estimation by Integration We would like to be able to simply evaluate " # Z1 1 p1 f ( j y; x; ; ) d E [] = E p =
0
(6)
by analytical means, but in practice this is forbiddingly complex. The symbolic computation package Maple, for example, was unable to compute E [] for more than a handful of data points. The same problem occurred when Maple was forced to choose numerical integration. In fact, by rearranging the order of the product and the summation at (4) we can see that there are (n ? 1)n functions of Gamma form, and hence there are no real bene ts of conjugacy to be had|since the conditional distribution for yj is a sum of Normal densities, not just a single Normal. However, since we are primarily interested in the expectation of , we now study the use of sampling methods to perform inference.
2.2 Estimation by Sampling Here we consider three potential sampling methods: rejection sampling (see Ripley, 1987); Metropolis-Hastings (M-H, see Smith and Roberts, 1993); and an auxiliary variables procedure (AV) which is a more ecient version of that in Brewer et al. (1996). The latter two of these are MCMC methods, while rejection sampling is straight Monte Carlo and hence is guaranteed to generate independent observations.
2.2.1 Rejection Sampling For rejection sampling from f ( j y; x; ; ) we must nd a suitable envelope function g( ) such that f ( j y ; x; ; ) M (7) g ( ) for some constant M for all . We then generate proposed values from g( ) and accept with probability f=gM . However, in this application of rejection sampling we need to take logs of (7) in order to avoid numerical under ow. This gives log f ( j y; x; ; ) ? log g( ) log M 6
so that rejection sampling is now implemented by selecting from g, then selecting u from U (0; 1), and accepting the new value if log u log f ( j y; x; ; ) ? log g( ) ? log M:
(8)
We need to specify the envelope function g( ) at (8). Experience with likelihood functions in this context suggests a Gamma envelope would be suitable.
2.2.2 Metropolis-Hastings Sampling We can use Metropolis-Hastings Sampling to create a Markov chain of (correlated) observations. To generate a new value [t + 1] in the chain given [t], we sample 0 from a proposal distribution q( [t]; 0 ) and calculate
8 0 0 > < min ff(([t]j jyy; ;xx; ;;; ))qq((;[t][;t])0 ) ; 1 if f ( [t] j y ; x; ; )q( [t]; 0) > 0, a0 = > : 1 if f ( [t] j y ; x; ; )q( [t]; 0) = 0.
With probability a0 we set [t + 1] = 0, else we set [t + 1] = [t]. Note this requires us to supply a starting value [0] for the chain, and that we only need f ( j y; x; ; ) to a constant here, not the actual density, since the constant will cancel in the ratio above. In practice we will have severe diculties with numerical under ow unless we use logs, as with rejection sampling. This gives
8 > < min fR; 0g if f ( [t] j y ; x; ; )q( [t]; 0 ) > 0, a=> : 0 if f ( [t] j y ; x; ; )q( [t]; 0 ) = 0.
(9)
where R = log(f ( j y ; x; ; )) + log(q ( 0; [t])) ? log(f ( [t] j y ; x; ; )) ? log(q ( [t]; 0 )) (10)
and we set [t + 1] = 0 if, for u U (0; 1), log u a, else we set [t + 1] = [t]. Some care is needed in de ning the proposal distribution q. In line with Brewer et al. (1996) we suggest taking q to be Gamma, where the mean of q is placed at the current value [t]. 7
2.2.3 Auxiliary Variable Sampling The auxiliary variable sampling method here is adapted from Besag and Green (1993), and the version below is a more ecient implementation than that in Brewer et al. (1996). Given the representation of the posterior distribution for at (5) and a current state [t], we sample a proposed new value 0 from the prior distribution on , with p.d.f. f ( j ; ). Then, for each of the n remaining product terms from (5) in turn we generate a value uj for an auxiliary variable Uj from U (0; f (yj j fx?j g; )), until the condition f (yj j fx?j g; 0 ) uj
(11)
fails. Upon failure, we start again with a new 0 , but if the condition (11) holds for all j = 1; 2; : : : ; n the proposal 0 is accepted and we set [t + 1] = 0 . The above version of the algorithm is likely to be more ecient than that in Brewer et al. (1996) since it does not require us to perform the sampling and tests for all j ; we can bale out once we hit the rst failure. In the current context, the number of auxiliary variables will be large (equal to the sample size), and hence we will avoid much unnecessary computation. In practice, since we are assuming for now that the prior distribution for will be non-informative, a large proportion of the proposed 0 will be rejected. While this will slow down the algorithm considerably, it does have the bene t of encouraging the Markov chain to explore the sample space, thus reducing autocorrelations in the chain.
2.3 Application Our model-based approach is now applied to four data sets and the results are compared with those from existing methods. We consider the three estimation methods for our approach, and refer to results from the rejection algorithm as BR, from the M-H algorithm BMH and from the auxiliary variables BAV; estimates are calculated from 1000 observations, after a suitable burn-in and taking every 20th simulated value for the MCMC procedures. In addition we consider: the rule-of-thumb bandwidth (ROT) which is given
8
LBM
s.e. Old Faithful
s.e. Bualo
s.e. Suicide
s.e.
ROT PI LCV BR BMH BAV 4.789 3.711 4.077 4.097 4.175 4.145 0.029 0.030 0.030 0.438 0.209 0.102 0.110 0.108 0.114 - 0.000787 0.000798 0.000850 10.971 9.683 9.496 10.361 10.194 10.672 0.098 0.092 0.100 63.774 19.401 33.296 34.161 32.335 34.391 0.179 0.133 0.187
Table 1: Bandwidth estimates for the four data sets. by Silverman (1986) as ROT =
4 15
n? 5 s 1
3 where s is (usually) the sample standard deviation; the \plug-in" (PI) selector of Sheather and Jones (1991) which is based upon minimising the asymptotic MISE; and the likelihood cross-validation (LCV) method mentioned earlier. It is beyond the scope of this paper to consider all possible global bandwidth selectors, so the interested reader is referred to the review articles by Jones et al. (1996) and Izenman (1991); the simulation studies by Cao et al. (1994), Sheather (1992) and Park and Turlach (1992); and the books by Bowman and Azzalini (1997) and Simono (1996). The main lesson from all of these references seems to be that no one bandwidth selector can be regarded as infallible, although on the whole the Sheather and Jones (1991) plug-in method is felt to be as reliable as any and is straightforward to calculate. The results are summarised in Table 1, and plots of kernel density estimates from the BR, PI and ROT bandwidths are shown in Figure 2. Only these three are plotted, since the model-based approaches and LCV give curves indistinguishable from one another for the given data sets. The rst plot concerns the \Lean Body Mass" variable of the Australian Institute of Sports data from Exercise 2.4 of Cook and Weisberg (1994). The three curves shown are barely distinguishable, though as Table 1 con rms, the PI bandwidth is the 9
0.0
0.2
density
0.4
0.6
0.030 0.020 0.010 0.0
density
40
60
80
100
120
2
3
0.0
0.002 0.004 0.006
density
0.010 0.005
density
5
Old Faithful
0.015
Lean Body Mass
4
20
40
60
80
100
120
140
0
Buffalo Snowfall
200
400
600
800
Suicide
Figure 2: Fixed kernel density estimates for the four data sets. Line types: ROT, BR.
10
PI,
smallest and leads to a density estimate which highlights a \kink" at Mass=75. Jones et al. (1996) also study this data set and conclude that the PI bandwidth is the most appropriate. The second plot shows density estimates for the Old Faithful data (Silverman, 1986). Here we notice the PI bandwidth gives a much smoother curve than the BR bandwidth, which is clearly undersmoothing. It might be felt that the PI density estimate is too smooth, and ignores important structure in the right-hand mode. The third plot considers the Bualo Snowfall data (Sain and Scott, 1996); this time we see that the PI bandwidth gives a density estimate with more structure than the ROT and BR selectors, which are virtually indistinguishable. The PI curve suggests a \shoulder" around Snowfall=100. Finally, the fourth plot shows estimates for the suicide data of Silverman (1986). As Table 1 shows, the PI bandwidth is much smaller than the ROT and BR bandwidths, giving a curve with more de nition|the PI density seems undersmoothed here, and perhaps for these data, the BR (and hence LCV) selector is performing best here. As has been discussed elsewhere in the literature, there is no one global bandwidth selection method which performs best for every possible data set. Since the model-based bandwidths here will be slightly larger than the LCV bandwidths, their performance will be similar. The theoretical properties should also be comparable, noting that the dierence in LCV and model-based bandwidths will increase with greater positive skewness of LCV pro le likelihood in . While the PI selector is currently regarded as the best performing in general, it may be that in the Suicide example here the BR selector has performed better, at least by the criteria in Sain and Scott (1996). The dierent sampling methods used to estimate from the model here all give similar results, but at diering costs. For the rejection sampling, one has to nd a suitable envelope, perform function maximisation to obtain M at (8), and then obtain samples. While this is guaranteed to produce independent observations, and hence rejection sampling will be the preferred method of a competent statistician, the tasks of deciding on the envelope and function maximisation will be non-trivial to a non-statistician. For MetropolisHastings sampling, one has to ascertain a sensible proposal distribution q at (10), and this may not be an automatic choice. The process of generating observations for these 11
BR LBM 3.538 s.e. 0.025 Bualo 8.688 s.e. 0.078
BMH 3.520 0.017 8.685 0.106
BAV 3.600 0.020 8.628 0.087
Table 2: Bandwidth estimates given Ga(2,5) prior on for Lean Body Mass and Ga(1,5) prior for Bualo Snowfall data. two methods will be much quicker than for the auxiliary variable procedure, but note that this latter method requires merely the data as input. For this reason, a non-statistician requiring an automatic choice of bandwidth may prefer the AV selector. The MetropolisHastings and auxiliary variable procedures will produce correlated observations, and hence one will probably need to generate a longer run of iterations in these cases. The samples of size 1000 giving the bandwidths in Table 1 for the two MCMC methods were derived by discarding a \burn-in" period of 1000 iterations, and then taking the value from every 20th iteration after that as a selected observation; this is erring on the side of caution.
2.4 Informative Priors on Noting that averaging over the likelihood functions will produce bandwidths larger than the (pseudo-)maximum likelihood estimates, it may be desirable to consider using the prior speci cation on to in uence the choice of bandwidth. Consider that we wish to obtain a kernel density which has (in MISE terms) less bias than those resulting from the bandwidth choices from Table 1, i.e. we would like the bandwidth to be smaller. For the Lean Body Mass data, consider using the informative prior Ga(2,5) on , and for the Bualo Snowfall data, a Ga(1,5) prior, for example. This results in bandwidth estimates shown in Table 2. The resulting curve for the Lean Body Mass data, shown in Figure 3, suggests more strongly than those from Figure 2 the existence of a mode around Mass=75. However, attempting to highlight this mode by reducing the bandwidth further will also aect the rest of the function|in this case, a (spurious?) mode around Mass=105 would 12
0.0
0.005 0.010 0.015
density
0.010 0.020 0.030 0.0
density
40
60
80
100
120
20
Lean Body Mass
40
60
80
100
120
140
Buffalo Snowfall
Figure 3: Kernel density estimates (BR) of the Lean Body Mass and Bualo Snowfall data with informative priors on . appear. With the Bualo data, the density estimate hints at trimodality. It should be clear from this that systematic alterations to parameters of the model priors can be made in an attempt to identify (for example) possible modes in the underlying density.
3 Variable Bandwidth Selectors In this Section we modify the model of Section 2 to consider variable bandwidths, such as would be used in adaptive density estimation at (2). We study two models, one of which is a straightforward extension of the xed bandwidth case, and the other which involves the introduction of direct dependencies between variable bandwidths. We shall compare this model-based approach to previous methods, particularly the Silverman (1986) implementation of Abramson's (1982) method and the method of Sain and Scott (1996).
3.1 A Variable Bandwidth Model This model is Bayesian in nature, and hence we concentrate on modelling variable precisions j = 1=2j . Each j is considered to be a product of a xed precision and a local bandwidth factor j , such as that in the Silverman (1986) implementation of the Abramson 13
xi
τ
i
d1
d2
ij
δj
yj j
Figure 4: Conditional independence graph for the variable bandwidth model (1982) estimator. The graph for the model is shown in Figure 4. The local bandwidth factors are given Ga(d1; d2) priors and so the conditional distributions are thus: f (yj j fx?j g; ; j ) = f (j j d1 ; d2 ) =
n 1 X K (y ? x ) n ? 1 =1 j i
(12)
dd21 d1 ?1 exp (?j d2 ) : ?(d1) j
(13)
j
i i6=j
Given the conditional distributions at (12) and (13), the joint density of the variable bandwidth model is
2 3 d n 1 Y d 1 X K (y ? x )7 : f (y ; x; ; ; d1 ; d2 ) = 64 2 jd1 ?1 exp (?j d2 ) j i 5 ?(d ) n?1 n
j =1
1
j
i=1 i6=j
(14)
To gain an estimate for each variable bandwidth j we need to nd the expectation of the conditional density of j given all data and other parameters. Section A.1 of the q Appendix shows that the expected value of j = 1= j is ? (d1)
n ( X i=1 i6=j
) ( yj ? xi )2 ?d1 d2 + 2
(15)
: ) n ( p ? d + 1 X ( yj ? xi )2 ?(d + ) d2 + 1 2 2 1
i=1 i6=j
1 2
p
This expression uses , but in practice we would supply the bandwidth = 1= . Also, the parameterisation of the prior is important. When choosing d1 and d2 we would like the q q expected value of 1= j for the resulting prior distribution to be 1 (since j / 1= j ), and 14
again this is similar in spirit to Silverman's version of the Abramson estimator. Section A.2 of the Appendix shows that this results in the following relationship between the prior parameters: 32 2 ? ( d ) (16) d2 = 4 1 1 5 for d1 > 12 : ? d1 ? 2 p In eect then, we have two user inputs with this method: the global bandwidth = 1= ; and the rst parameter (d1) of the Gamma prior on the j . The value chosen for will aect the resulting density estimate in the obvious way. The parameter d1 eectively controls the amount of variability among the j (and hence the j ): larger values of d1 give less variability in terms of j , and smaller values of d1 give more variability. As a result, choosing too large a value for d1 may give an adaptive density estimate which is indistinguishable from the xed density estimate; conversely, choosing too small a d1 will introduce too much variation, which will cause (broadly speaking) points in less dense areas to have enormous bandwidths and points in very dense areas to have in nitesimal bandwidth values. Note that this model is implicitly de ning a balloon kernel estimator like that at (3) for each data point yj here; however, since we obtain a local bandwidth for each sample point, it is valid to use these bandwidths in adaptive kernel density estimation using the formula at (2). This is similar in spirit to Sain and Scott (1996) who use a binned kernel criterion for selecting variable bandwidths. In the next two subsections we investigate choice of the parameters and d1 for the four data sets.
3.1.1 Eect of Varying Figure 5 shows density estimates with varying global bandwidths. In order to get a good spread of bandwidths, the three selectors ROT, PI and LCV were chosen|ROT will guarantee a large , while one of the other two selectors should give something considerably smaller. LCV was chosen over BR since it is noticeably smaller for the Bualo data especially. The parameter d1 was xed at d1 = 1 here, and note that this implies d2 = 1= 0:318. 15
density
0.0
0.2
0.4
0.6
0.03 0.02 0.01 0.0
density
40
60
80
100
120
2
3
5
20
40
60
80
100
120
140
0.0 0.002
density
0.006
0.005 0.010 0.015
Old Faithful
0.0
density
Lean Body Mass
4
0
Buffalo Snowfall
200
400
600
800
Suicide
Figure 5: Variable bandwidth kernel density estimates with d1 = 1, and set to the value obtained from using the methods of Section 2: for PI, for ROT, for LCV.
16
The ranges of values for global bandwidths chosen by the selectors for the Lean Body Mass data and the Bualo Snowfall data were relatively small, and hence the three curves for these data sets in Figure 5 are very similar to each other; however, there are clear dierences from the global bandwidth curves of Figure 2 for these data sets. For the Lean Body Mass data, the existence of a mode around Mass=75 is emphasised very clearly by the PI and LCV curves, whereas this was merely hinted at by Figure 2|in addition, the shelf at Mass=105 has now been smoothed out. For the Bualo data, there is now evidence of the existence of \shoulders" in the density estimates, even in the ROT case, suggesting possible trimodality. The wider ranges of pilot values used for the two other data sets result in density estimates with markedly dierent appearances in Figure 5. The curves for the Old Faithful data resemble very clearly those from Figure 2, but with less smoothing around the two main modes and more smoothing elsewhere. For the PI global bandwidth, we note that the derived adaptive density has performed more smoothing in the trough between the main modes, but now suggests more structure in the right-hand mode|it may be felt that the original PI density estimate was oversmoothed in this area. With the Suicide data, the adaptive density estimates show much more smoothing in the long right-hand tail than the xed estimates. One might argue that the PI curve of Figure 5 is undersmoothed, particularly in relation to the LCV curve.
3.1.2 Eect of Varying d1 Now we study the eect of dierent values of d1. For the rst three data sets, was xed to be the PI estimate; for the Suicide data the LCV estimate was used for the reasons outlined at the end of Section 3.1.1. The values 1, 2/3 and 2 were used for d1 , and the adaptive kernel density estimates are shown in Figure 6. Use of d1 = 1 seems to give a reasonable density estimate for all four data sets, in that the curves seem neither too smooth nor not smooth enough. Interestingly, the choice of d1 = 2=3 leads to density estimates which appear to emphasise certain modes, but without introducing too much variation. For example, with the Old Faithful data, the dashed curve in Figure 6 looks rather like the BR curve of Figure 2 in the two main modes, 17
0.4 0.2 0.0
density
0.6
0.01 0.02 0.03 0.04 0.0
density
40
60
80
100
120
2
3
5
0.0
0.002
density
0.015
0.006
Old Faithful
0.0 0.005
density
Lean Body Mass
4
20
40
60
80
100
120
140
0
Buffalo Snowfall
200
400
600
800
Suicide
Figure 6: Variable bandwidth kernel density estimates with equal to PI estimates, 1, 2/3, 2. except for the Suicide data where LCV is used, and d1 set to:
18
but has a much smoother appearance in the trough between these modes. Similarly, the dashed curve for the Lean Body Mass data in Figure 6 highlights three peaks but does not introduce a shelf at Mass=105. This may suggest that for certain applications, for example classi cation, d1 = 2=3 would be the preferred choice. In this way, the modelbased approach for derivation of variable bandwidths can be tailored to suit speci c applications. By the time d1 is as large as 2, the density estimates begin to look very much like the original xed estimates. It seems then that in general, to obtain a sensible general adaptive density estimate with this procedure, we can take d1 to be 1, and use as the global bandwidth a suitable value from one of the choice of selectors|the plug-in estimate being perhaps the best option overall should we require an automatic selection.
3.2 Comparisons with Previous Methods Here we compare the above procedure with two existing methods. The Abramson (1982) method involves taking a ( xed) kernel density as a pilot estimate, and then letting i / f (xi )?1=2 in (2). Silverman (1986) presents an implementation of this method using local bandwidth factors as discussed above. The adaptive density estimates for the Old Faithful data resulting from the PI, ROT and LCV pilot bandwidths are shown in the left-hand plot of Figure 7. Here we see variations in the curves similar to those for the corresponding plot in Figure 5, but note that the model-based estimate for the PI pilot seems more appropriate since it gives more de nition in both main modes while still appearing acceptably smooth between the modes and in the tails. Also, the model-based estimate for the ROT pilot seems preferable, as it has \recovered" from the poor pilot bandwidth choice rather better. In fairness, the Abramson estimate from the LCV pilot seems to have done a better job at smoothing in the trough than the model-based estimate, although there are still several kinks visible. Sain and Scott (1996) describe a procedure which uses a binned kernel estimator and an unbiased cross-validation (UCV) criterion to obtain an \optimal" number of bins, and to subsequently derive bandwidths for the bins. In practice, one nds the minimum value of UCV for each of a set of numbers of bins, and then a global minimum provides the 19
0.6 0.0
0.2
0.4
density
0.6 0.4 0.0
0.2
density
2
3
4
5
2
Old Faithful
3
4
5
Old Faithful
Figure 7: Left: Abramson adaptive kernel density estimates with pilot bandwidths: for PI, for ROT, for LCV. Right: Sain and Scott adaptive density estimate, model-based estimate from PI pilot bandwidth and d1 = 1. optimal single number of bins. The method of Sain and Scott (1996) seems to be something of a compromise between kernel and mixture density estimation. By using a binning procedure to obtain maximum likelihood estimates of bandwidths for particular bins, the method mimics the mixture approach of modelling sets of points as coming from a particular component. The difference is that although the bandwidth (variance) is constant within a particular bin (or component), the points are not assumed to have constant mean. This is why the Sain and Scott procedure clearly works so well when applied to the parametric examples in their Section 3 (normal and two-component normal mixture). The right-hand plot of Figure 7 shows the Sain and Scott adaptive density estimate for the Old Faithful data (solid line) and the model-based estimate with PI pilot bandwidth and d1 = 1 (dotted line). The sample points have also been included in this plot|this is to highlight some unfortunate features of the Sain and Scott estimate. Firstly, the tail on the far right seems too heavy, and secondly the estimate reaches a local minimum around 2.25, where there is a fair amount of data|not at 2.75, where virtually all other density estimates have a minimum, and where there is a clear gap in sample points. The 20
model-based estimate seems to describe the trough between the main modes and the right hand tail much better than the Sain and Scott version, although the same could also be said of the Abramson estimate, for example. The variable bandwidths from the model-based approach are calculated very easily, and this is true also for the Abramson bandwidths (note that alternate versions of the Abramson estimator have been studied|see for example Terrell and Scott, 1992). The Sain and Scott method however is comparatively dicult to apply; the method requires the sample space to be divided into bins, and each such division with, say, b bins requires a function minimisation over the b bandwidths. As is suggested by Sain and Scott (1998), the function to be minimised contains many local minima, and it may be in any case that a particular local minimum will give better results than the overall minimum (over all feasible b). In addition, the division of the range of sample values into bins does not always produce \improved" density estimates; Sain and Scott (1996) describe cases where it was required either to shift bins slightly or to subdivide particular bins even further. Due both to these diculties and the question marks over performance and implementation, it is dicult to envisage the Sain and Scott method being used practically, especially when automatic bandwidth selection is desired or necessary.
3.3 A Robust Variable Bandwidth Model One of the main problems with xed kernel density estimation is that the estimator attempts to apply the same degree of smoothing over the whole range of data. Of course, the adaptive density estimator seeks to provide variable amounts of smoothing, and this represents a notable improvement, as we have seen. However, we would like very much to have a variable bandwidth selection method that gives density estimates which are robust to changes in pilot bandwidths, especially in areas of low density, but which allow for dierent degrees of smoothing where there are a lot of data. We need to be careful what we mean by \robust" in this context|consider then the density estimates of Figure 5 for the Old Faithful data set. The three curves here are noticeably dierent, but the importance of these dierences at dierent points along the x-axis is dependent upon the sparseness of data in those areas. For example, the fact that 21
the curves are dissimilar within the two main modes is not particularly worrying; there are a lot of data in these areas and hence it is a moot point whether further structure within the main modes should be revealed or not. None of the three estimates seem desperately awry in this case. Now instead look at the trough between the main modes; two of the curves give a smooth U-shape, while the third (the LCV) has two bumps within the trough. Given that the data are sparse in this region, it seems reasonable to assume we have undersmoothing here. It would be therefore be preferable to have a variable bandwidth selector which allows for dierences of smoothing in areas of high density, but which is robust to changes in pilot bandwidth in areas of low density. This Section describes such a selection method.
3.3.1 Dependencies Between the Local Bandwidth Factors The new model is similar to that from Section 3.1 but we now introduce dependencies between the local bandwidth factors j . Here we must ensure that the data yj (and hence xi ) are ordered. The dependencies take the following form: 1 j Ga [d1 + d2 (j ?2 + j ?1 + j +1 + j +2 )] ; d2 (17) 5 where d1 and d2 are as in Section 3.1 and the obvious adjustments are made for j 2 f1; 2; n?1; n?2g. All other parts of the model are de ned as before, and the corresponding graph is shown in Figure 8. At this point, some explanation is needed for this choice of dependency. We can see that for given fj?2 ; j?1 ; j+1; j+2 g, the mean for j without considering the likelihood from yj is ! 1 d1 + + + + : (18) 5 d2 j?2 j?1 j+1 j+2 Recall that the corresponding mean for j previously (from the prior|note our prior is now only implicit) was d1=d2 , and that since we had decided upon d1 = 1 and d2 = 1= as sensible values, this ratio had value . If the 's at (18) therefore all have values around , then j from this new model will be much the same as before. Suppose however that one or more of the neighbouring 's have values which are much smaller than , representing a much larger bandwidth at this point. It should be clear then that the value of (18) 22
xi
τ
i
d1
d2
ij
δ
j-2
δ
j-1
δj
yj
δ
j+1
j
δ
j+2
Figure 8: Conditional independence graph for the robust variable bandwidth model. Note that for clarity, the directed arcs from d1 and d2 to j?2 etc have been suppressed. will be lower as a result, forcing down the value of j . A similar eect occurs if one or more of the 's have values higher than . The net eect of introducing the dependencies between the 's will hence be to emphasise dierences between them, and because of the relationship between i and i , it is the higher bandwidth values which will increase the most|and this is precisely the eect desired. The dependencies between the 's de ned at (17) show that each j is related to the two factors either side; we choose two here since taking one factor either side produced relatively little change in terms of density estimate from those derived using the model of Section 3.1, and it was felt taking any more than two neighbours each side produced too little improvement given the increasing complexity of the model. To see why we need to look as far as two factors in each direction, consider the conditional density for yj at (12); since this is a balloon estimator based upon the set fx?j g (equivalently fy?j g) the bandwidth j needs to be large enough so that (12) \covers" yj |i.e. that there is non-negligible density at that point. Especially in areas of low density (such as when yj is relatively isolated), the odds are that it will be the Kernel component of the nearest point which covers yj . If yj is in an area of low density, then the nearest point to yj is more likely to be in such an area too, and hence neighbouring 's are already correlated. The model now de ned here, with the conditional density for yj as at (12), is eectively a graphical chain model (see Wermuth and Lauritzen, 1990, for example) by virtue of the 23
undirected arcs linking the 's. Consequently, we cannot easily write down an expression for the joint density of the model since we cannot just multiply together the Gamma terms for each j ; this is discussed by Mollie (1996) and Sections 10 and 11 of Spiegelhalter et al. (1995). However, as the latter reference describes, this would only be a problem if we were not regarding d1 and d2 as constant. We are thus able to make inferences on the j of this model through sampling methods.
3.3.2 Making Inferences via Sampling While we cannot easily write down the density for our robust model, we can still express the conditional density for an individual j : f (j j y ; x; ; f?j g; d1 ; d2 )
/ f (j j j?2; j?1; j+2; j+1; d1; d2) f (yj j fx?j g; ; j ) n X / j?1 exp (?j d2) n ?1 1 K (yj ? xi) i=1 i6=j
j
where
= 1 [d1 + d2 (j?2 + j?1 + j+1 + j+2)] : 5 We can make inferences about the j therefore by MCMC analysis; we choose an auxiliary variable algorithm here, which proceeds as follows: 1. Sample a proposed new j (called j0 ) from Ga(; d2). 2. Sample uj from U (0; n?1 1 Pn=1 K (yj ? xi )). = i i6
3. Accept j0 if
j
uj
j
n 1 X K (y ? x )); n ? 1 =1 j i 0
i i6=j
j
else return to step 1. 4. Repeat steps 1 to 3 for all j = 1; 2; : : : ; n. 5. Repeat steps 1 to 4 until the MCMC estimates are \stable" (including any burn-in period). 24
3
4
5
0.0 0.002
density
0.006
0.6 0.4 0.0
0.2
density
2
0
Old Faithful
200
400
600
800
Suicide
Figure 9: Robust variable bandwidth kernel density estimates with d1 = 1, and pilot bandwidths: for PI, for ROT, for LCV. In practice 1000 iterations of the algorithm, after a burn-in period of 50 iterations, were felt to be sucient. Note that unlike the sampling of Section 2 for the global bandwidth model, the AV procedure will be very fast since we now have eectively an informative prior on the j , and hence a much larger proportion of proposed values will be accepted. Consequently, the algorithm takes only a few seconds to generate 1000 observations of the set of 's.
3.3.3 Application We now study the resulting density estimates for this new model for the various choices of pilot bandwidth. We do not consider the Lean Body Mass and Bualo data sets here, since we can see from Figure 5 that the density estimates from the original variable bandwidth model were barely distinguishable for the dierent pilots. Instead, we concentrate on the Old Faithful and Suicide data, and Figure 9 shows the robust adaptive density estimates for PI, ROT and LCV pilot bandwidths. There are two major improvements here over the corresponding plots of Figure 5: rstly, with the Old Faithful data, the density estimates in the trough are very similar for the PI and LCV pilots|in particular, the two bumps within the trough for the LCV estimate in Figure 5 have successfully been smoothed out, 25
giving more of a U-shape than the corresponding Abramson estimate. Secondly, with the Suicide data, the long right-hand tail has been smoothed much more eectively, especially for that of the PI pilot which was previously too variable in the tail. Earlier we noted a preference for the LCV pilot over the PI pilot with this data set, but it may be felt that the results from the robust model reverse this and the PI is more suitable. This has important consequences for automatic bandwidth selection since we can now say that the PI pilot bandwidth gives very satisfactory density estimates for all four of our data sets when used with the robust model.
4 Study of a Known Mixture Density Sain and Scott (1996) study simulated data sets from a Normal density and a twocomponent Normal mixture density. In the mixture density the two components are fairly well separated and of equivalent height. Sain and Scott demonstrate that their method gives density estimates which look far more like the original densities than the estimates given by use of either the xed kernel or Abramson methods. In this Section we study a more complicated mixture density for which the Sain and Scott method appears to fail, and for which the Abramson and robust model-based methods perform equally well. The density we study is a mixture of four Normal components and a scale-shifted Gamma density (to provide a suitably long tail). The mixture is illustrated in Figure 10 and has the form
(
)
(
)
1 1 exp ? 1 x ? 6:5 2 + 3 p 1 exp ? 1 x ? 8 2 f (x) = p 4 2 2 2( 2 8 2 1 2 ( 1 ) x ? 18:5 2) 1 1 1 x ? 14 2 1 1 1 exp ? exp ? + p + p 8 2 1:5 2 1:5 8 2 1:5 2 1:5 1 (x ? 20)2 exp f?(x ? 20)g I (x > 20) +1 8 ?(3)
where I () is the indicator function. We take 5 samples of size 200 from this density, and apply kernel density estimation to each of the samples in turn. We study a xed kernel with PI bandwidth; an Abramson 26
0.05 0.10 0.15 0.20 0.0
density
Mixture Density
0
5
10
15
20
25
30
x
Figure 10: Known mixture density for Section 4. estimate with PI pilot; a robust estimate with PI pilot; and a Sain and Scott estimate. Figure 11 shows the 5 density estimates for each method as dotted curves. It is dicult to distinguish between the Abramson and robust results, but these methods have both performed better than the xed method, especially around the main peak and in the tails. The Sain and Scott method however has severely undersmoothed over the whole range|it doesn't even begin to pick up the three minor modes. Since the density is quite complex, the Sain and Scott method requires the data to be split into large numbers of bins (between 9 and 12 for the 5 samples), which will have relatively few data points; this is presumably why the Sain and Scott method returns bandwidth values which are too large. Since we know the true density here, we can gain a numerical estimate of performance by calculating the integrated square error (ISE) for each sample and taking the mean to give an estimate of the expected ISE (EISE), in the spirit of Park and Turlach (1992). The values here are: for the Abramson method, 0.001214; for the robust method, 0.001210; and for the Sain and Scott method, 0.004721. This tells pretty much the same story as Figure 11, namely that with data simulated from the given density, there is little to choose between the Abramson and robust selection methods, but that the Sain and Scott 27
5
10
15
20
25
30
0
5
10
15
20
25
30
x
Robust Kernel Estimates
Sain & Scott Kernel Estimates
0.0
density
0.0
0.05 0.10 0.15 0.20
x
0.05 0.10 0.15 0.20
0
density
0.05 0.10 0.15 0.20 0.0
density
0.05 0.10 0.15 0.20
Abramson Kernel Estimates
0.0
density
Fixed Kernel Estimates
0
5
10
15
20
25
30
0
x
5
10
15
20
25
30
x
Figure 11: Plots of density estimates (dotted lines) for the 5 samples from the known mixture density (solid) line for four dierent bandwidth selectors.
28
method is failing.
5 Discussion It has been shown that the robust variable bandwidth selection method of Section 3.3 is robust to changes in pilot estimate in terms of producing an adaptive density estimate which can be exible in areas of high density, but reliably smooth in areas of low density. Note that although the method of Sain and Scott (1996) performs extremely well in reconstructing simple mixtures, its practical performance when applied to real data sets or more complex mixtures is disappointing. We have concentrated on data sets which have appeared before in density estimation literature, and note that the robust model in particular leads to estimates which are either as good as or better than previous methods. In terms of automatic bandwidth selection, it seems that the combination of robust model and PI pilot produces reliable density estimates, at least for the four real data sets covered so far and the one to come. Even if one is to use a dierent pilot, which may not be appropriate in reality, the robust model gives estimates which attempt to play safe in low density areas, while still allowing for the fact that one might desire to in uence the shape of the density estimate within major modes, for example. To further illustrate this point, consider again the Lean Body Mass data, and suppose we were to use the Least Squares Cross Validation (LSCV) bandwidth (Bowman and Azzalini, 1997) as our (global) pilot bandwidth. Jones et al. (1996) claim that this selector produces a bandwidth ( = 1:151) which is highly inappropriate. The dashed line of Figure 12 shows the xed kernel estimate with the LSCV bandwidth, and it is indeed extremely variable. The dotted line of the same Figure shows the Abramson estimate, and while there is notable improvement in the tails, the within-mode variation is even worse than with the xed estimate. The solid line shows the robust model estimate, and this has clearly performed the greatest \correction" to the inappropriate pilot: the behaviour in the tails is better than with the Abramson curve, and there is even some reduction in variability in the main body of the density. Finally we discuss the 1872 Hidalgo stamp issue data which represents the paper 29
0.05 0.04 0.03 0.0
0.01
0.02
density
40
60
80
100
120
Lean Body Mass
Figure 12: Kernel density estimates for Lean Body Mass data: for robust model with LSCV pilot, for xed estimate using LSCV, for Abramson selector with LSCV pilot.
30
thickness of 485 stamps, and which has been studied by Sheather (1992) among others. We note rst that the method of Sain and Scott (1996) fails in this situation, as shown by the short-dashed line in Figure 13, despite using a number of bins (eight) which was found to minimise the appropriate criterion and which agrees with the maximum likelihood conclusions of Basford et al. (1997). In fairness, Sain and Scott report that using equal sized bins \does not [always] yield improvement over the xed bandwidth estimator" and suggest a method for choosing bins to subdivide further. Note that the study of Section 4 provides further evidence that it is the Sain and Scott method (rather than all the others) performing badly. The long-dashed line of Figure 13 shows the xed kernel estimate using the PI bandwidth, which Sheather (1992) felt to be the best. The dotted line and the solid line display the Abramson and robust model estimates respectively, both using PI pilots. The Abramson estimate contains greater smoothing in the tails relative to the xed estimate, but rather oddly introduces a new bump between the two largest peaks. The robust curve smooths out entirely the bumps at 0.12 and 0.13, leaving ve visible peaks in the density estimate. This is especially interesting since Sheather reports on an analysis of \extensive historical data" which found \plausible reasons" for these ve modes alone; only the robust model found these ve modes without nding others in addition. We have thus shown our models, especially the robust model, to be extremely successful in selecting variable bandwidths for adaptive density estimation. This approach is novel, and it is hoped that further developments will be made using these techniques, for example concerning extensions to multivariate density estimation, studies of classi cation with adaptive estimates and the use of boundary kernels.
31
80 60 40 0
20
density
0.06
0.08
0.10
0.12
0.14
Stamp Thickness
Figure 13: Kernel density estimates for 1872 Hidalgo stamp issue data: for robust model with PI pilot, for Sain and Scott estimate with 8 bins, for Abramson selector with PI pilot, for xed estimate using PI.
32
Appendix A Exact Calculations for Section 3.1 A.1 Estimation of Variable Bandwidths The conditional density for a variable bandwidth j at (14) is found thus: f (j
j y; x; f?j g; ; d1; d2 ) = f (j j yj ; fx?j g; ; d1; d2) = c1 f (j j d1 ; d2)f (yj j fx?j g; ; j ) n d1 X = c1 ?(dd2 ) jd1?1 exp (?j d2 ) n ?1 1 K (yj ? xi) 1 =1 = n d 1 X p j ( (yj ? xi )2 ) 1 d2 d 1 ?1 = c1 ?(d ) j exp (?j d2) n ? 1 p exp ? 2 j 2 1 =1 = #) ( " n ( yj ? xi )2 d1 + 12 )?1 X ( = c 2 j exp ?j d2 + 2 j
i i6
i i6
j
j
i=1 i6=j
using the conditional independence structure of the graph of Figure 4, and where c1 and q c2 are appropriate constants. In order to calculate E () = E 1= j , we need to nd c2 from Z1 f (j j y ; x; f?j g; ; d1 ; d2 ) dj = 1; 0
so
Z1 0
f (j
j y; x; f?j g; ; d1; d2) dj =
Z1 0
j f
g n (d1+ 12 )?1 X
f (j yj ; x?j ; ; d1 ; d2 ) dj
Z1
= c2 0
j
i=1 i6=j
(
exp ?j
"
(yj ? xi )2 d2 + 2
#) dj
2 ( " #) 3 n Z 1 2 1 X (d1+ 2 )?1exp ? d + (yj ? xi ) d 75 j = c2 64 j 2 j 2 =1 0 2= 3 66X 77 n ? d1 + 21 6 7 = c2 66 ( 17 ) d + 1 27 4 =1 5 ( yj ? xi )2 = d + i i6
j
i i6
j
2
= c2 ? d1 + 12
33
2
X n ( i=1 i6=j
)?(d1 + 21 ) ( y j ? x i )2 d2 + 2
and hence
0 )?(d1+ 12 ) 1?1 n ( 2 X ( y ? x ) 1 C c2 = B d2 + j i A : @? d1 + 2 2 i=1 i6=j
Finally, we can evaluate the expectation: 2 3 ( " Z1 n 2 #) X ( y ? x ) 1 1 j i d ? 1 1 E 4 q 5 = c2 p exp ?j d2 + dj 0 j 2 =1 j i i6=j
n X = pc2 (
? (d1)
)
(yj ? xi)2 d1 d2 + 2 ) ( n X ( yj ? xi )2 ?d1 c2 d2 + = p ? (d1 ) 2 =1 i=1 i6=j
i i6=j
? (d1) =
n ( X
) ( yj ? xi )2 ?d1 d2 + 2
i=1 i6=j
) n ( p ? d + 1 X ( yj ? xi )2 ?(d + ) d2 + 1 2 2 1
1 2
i=1 i6=j
which gives expression (15) as required.
A.2 Relationship between Prior Parameters q
We require that the expectation of 1= j is equal to 1 when j has a Ga(d1; d2) prior distribution as at (13). So: 2 3 Z 1 1 dd21 1 q E 4q 5 = jd1 ?1 exp (?j d2 ) dj ? ( d ) 0 1 j j d1 Z 1 1 q d1?1 exp (?j d2) dj = d2 ? (d1 ) 0 j j dd21 Z 1 d ?1 exp (?j d2 ) dj ? (d1 ) 0 j 2 13 d 1 ? d ? 1 = d2 4 d1? 1 2 5 = 1 ? (d1 ) d2 2
=
and hence
0
2 32 ? ( d ) d2 = 4 1 1 5
? d1 ? 2 which gives expression (16) as required.
34
(where d0 = d1 ? 1=2)
for d1 > 12
References Abramson, I (1982) \On Bandwidth Variation in Kernel Estimates|A Square Root Law," The Annals of Statistics, 10 1217{1223. Basford, KE, McLachlan, GJ, and York, MG (1997) \Modelling the Distribution of Stamp Paper Thickness via Finite Normal Mixtures: The 1872 Hidalgo Stamp Issue of Mexico Revisited," Journal of Applied Statistics, 24 169{179. Besag, J, and Green, PJ (1993) \Spatial Statistics and Bayesian Computation," Journal of the Royal Statistical Society, B 55 25{38. Bowman, AW, and Azzalini, A (1997) Applied Smoothing Techniques for Data Analysis, Oxford, England: Oxford University Press. Brewer, MJ (1998) \A Modelling Approach for Bandwidth Selection in Kernel Density Estimation," In Proceedings of COMPSTAT 1998, Physica Verlag (To appear) Brewer, MJ, Aitken, CGG, and Talbot, M (1996) \A Comparison of Hybrid Strategies for Gibbs Sampling in Mixed Graphical Models," Computational Statistics and Data Analysis, 21 343{365. Cao, R, Cuevas, A, and Mantiega, WG (1994) \A Comparative Study of Several Smoothing Methods in Density Estimation," Computational Statistics and Data Analysis, 17 153{176. Cook, RD, and Weisberg, S (1994) An Introduction to Regression Graphics, New York: Wiley. Diebolt, J, and Robert, CP (1994) \Estimation of Finite Mixture Distributions through Bayesian Sampling," Journal of the Royal Statistical Society, B 56 363-375. Hazelton, ML (1996a) \Optimal Rates for Local Bandwidth Selection," Nonparametric Statistics, 7 57{66. Hazelton, M (1996b) \Bandwidth Selection for Local Density Estimation," Scandinavian Journal of Statistics, 23 221{232. Izenman, AJ (1991) \Recent Developments in Nonparametric Density Estimation," Journal of the American Statistical Association, 86 205{224. Jones, MC, McKay, IJ, and Hu, T-C (1994) \Variable Location and Scale Kernel 35
Density Estimation," Annals of the Institute of Statistical Mathematics, 46 521{535. Jones, MC, Marron, JS, and Sheather, SJ (1996) \A Brief Survey of Bandwidth Selection for Density Estimation," Journal of the American Statistical Association, 91 401{407. Mollie, A (1996) \Bayesian Mapping of Disease," In Gilks, WR, Richardson, S, and Spiegelhalter, DJ (1996) Markov Chain Monte Carlo In Practice, London: Chapman and Hall. Park, B-U, and Turlach BA (1992) \Practical Performance of Several Data-Driven Bandwidth Selectors," Computational Statistics, 7 251{285. Ripley, BD (1987) Stochastic Simulation, New York: Wiley. Sain, SR, and Scott, DW (1996) \On Locally Adaptive Density Estimation," Journal of the American Statistical Association, 91 1525{1534. Sain, SR, and Scott, DW (1998) \Zero-Bias Locally Adaptive Density Estimators," Journal of the Royal Statistical Society, B (Submitted). Sheather, SJ (1992) \The Performance of Six Popular Bandwidth Selection Methods on Some Real Data Sets," Computational Statistics, 7 225{250. Sheather, SJ, and Jones, MC (1991) \A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation," Journal of the Royal Statistical Society, B 53 683{690. Silverman, BW (1986) Density Estimation for Statistics and Data Analysis, London: Chapman and Hall. Simono, JS (1996) Smoothing Methods in Statistics, New York: Springer-Verlag. Smith, AFM, and Roberts, GO (1993) \Bayesian Computation via the Gibbs Sampler and Related Markov Chain Monte Carlo Methods," Journal of the Royal Statistical Society, B 55 3{23. Spiegelhalter, D, Thomas, A, Best, N, and Gilks, W (1995) \BUGS 0.5*Examples Volume 2 (version ii )," MRC Biostatistics Unit, Institute of Public Health, Cambridge, UK. Terrell, GR, and Scott, DW (1992) \Variable Kernel Density Estimation," Annals of Statistics, 20 1236{1265. 36
Wand, MP, and Jones, MC (1995) Kernel Smoothing, London: Chapman and Hall. Wermuth, N, and Lauritzen, SL (1990) \On Substantive Research Hypotheses, Conditional Independence Graphs and Graphical Chain Models," Journal of the Royal Statistical Society, B 52 21{50.
37