A sediment-size measure based on an estimate of an upper-size bound is suggested as a useful alternative to the largest sample observation. Two estimation ...
Mathematical Geology, Vol. 12,No. 2, 1980
Estimating an Upper B o u n d to Particle Size in Sediment Populations ~ W. E. Bardsley 2 A sediment-size measure based on an estimate o f an upper-size bound is suggested as a useful alternative to the largest sample observation. Two estimation procedures are described which are compatible with the usual sediment sampling techniques that yield very large samples o f unknown and variable size. One method requires only the first two order statistics from a single sample, while the other uses the maxima from a number o f samples and estimates the upper bound as the location parameter o f a Weibull distribution. In the latter case it is shown that the effect o f random sample size can be overcome provided the expected sample size is sufficiently large.
KEY WORDS: Particle size analysis, order statistics, Weibull distribution. INTRODUCTION The description and analysis of spatial trends in grain size is a recurrent theme in sediment studies. In such investigations, emphasis is often placed on the coarse extreme of the grain size distribution, where the effect of transport is most evident. One of the most commonly used measures in this respect is the size of the largest particle observed at each exposure. This measure has proved particularly useful in the determination of paleocurrent directions in coarse sediments. Despite its intuitive appeal, however, the use of the largest observed particle presents some fundamental problems with respect to statistical inference and hypothesis testing. In this paper the nature of these problems is briefly discussed and an alternative approach is proposed through the use of sample estimates of an assumed upper size bound in the sediment population. Two types of estimation procedure are outlined which are compatible with the usual methods of sediment sampling. In the discussion below, it is assumed that all samples are drawn at random from some large sediment population and "sample size" refers to the total number of sediment particles in a given sample.
1Manuscript received 4 May 1979; revised 21 September 1979. 2Department of Geography, University of Otago, Dunedin, New Zealand. 127 0020-5958/80/0400-0127503.00/0 © 1980 Plenum Publishing Corporation
128
Bardsley
In considering an upper bound, it is necessary to define what is meant by the sediment population. It may seem reasonable to assume that all sediment populations are finite and hence that the upper bound must represent the largest particle deposited. However, the set of sediment particles deposited in any area of interest is itself only a sample of an unobservable infinite population that would result if the same sequence of events which produced that particle set could be repeated again and again. The largest of the set of deposited particles therefore may or may not be a close approximation to the bound, depending on the nature of the transporting and depositional processes involved. The essential difficulty with the use of the largest sampled particle as a size indicator is that without additional assumptions, it is impossible to relate the measure back to any parameter of the population concerned. For example, the sample particle size maxima may show a clear decrease away from some source area but it would be difficult to infer from this just what it was in the underlying population that was changing with distance. This arises from the fact that the parameters of a distribution of extremes may depend on the location, scale, and shape parameters of the parent distribution. A change in the distribution of sample extremes could therefore result from changes in any number of possible parameter combinations in the parent distribution. In this situation, there will obviously be considerable difficulties in using sample maxima to make inferences or test statistical hypotheses related to the sediment population. A further drawback to the use of the largest particle arises from the need to avoid unequal sample sizes, since large samples tend to produce greater maxima than small samples. Potter and Pettijohn (1977, p. 282)recommend sampling in such a way that the sample size is constant from one locality to the next. Any such procedure must inevitably depend on laborious counting since the usual area, volume, or weight sampling lead to variable sample sizes. The above difficulties can be avoided if it can be assumed that there exists some upper grain size bound e in the sediment population. The largest sample observation can then be replaced by the sample estimate of e. Any observed systematic change in the estimate ~" can then be interpreted as reflecting a corresponding change in the underlying parameter. Further, if the estimator is unbiased, the expected value of ~" must by definition remain independent of sample size. One particularly useful feature of ~" is that it is applicable to measures of apparent size, such as that obtained from thin sections or partially buried boulders. For example, if e is the upper bound to the distribution of apparent intermediate diameters measured from thin sections, then e must also be the upper bound to the distribution of the true intermediate diameters. The only assumption required is that there is a nonzero probability that any arbitrary particle will yield a "true" value of the variable concerned. It is not suggested, of course, that the estimation of upper bounds will be a universally applicable procedure. It is clear that for ~ to be near the bound, e
Estimating an Upper Bound to Particle Size in Sediment Populations
129
must be reasonably well defined with respect to the deposited particles. This condition is unlikely to hold for deposits associated with very high competence fluids such as viscous mudflows. In Such situations, even if the bound is a physical reality, it is likely to be so far removed from the main body of deposited sediment that any sample estimate would be highly unreliable. DEFINITIONS The variable X is defined as some actual or apparent linear particle measure relating to size. The distribution of X in the particle population is assumed to be governed by an unknown probability density function h (x), which may or may not possess some finite upper bound. The order statistics of a random sample drawn from h (x) are defined as X l ~ X2 ~ ~(3 " " " X N
(1)
where N is the sample size. Depending on the sampling procedure, N may be either a constant or a variable governed by some probability function k(n). ESTIMATING e FROM THE FIRST TWO ORDER STATISTICS The largest observation from a single sample of size N can itself be regarded as a estimator of e. However, the obvious bias of such estimates leads to the sample size difficulties mentioned earlier. Robson and Whitlock (1964) applied a bias-reducing jackknife technique to the sample maximum, obtaining the estimator 6" = 2 X 1 - X2
(2)
which is mean-unbiased to order N -2 . An estimate of the variance of (2) is given by
(xl - x2)2
(3)
which is mean-unbiased to order N - a . The estimator (2) can be applied to any h (x) which is bounded, provided h (e) > 0. The mean squared error of (2) increases with decreasing h (e), but this presents no real problem in most sediment sampling since a sufficiently large sample size will counterbalance the error contributed by any arbitrarily small
h(e). ESTIMATING e AS THE LOCATION PARAMETER OF A WEIBULL DISTRIBUTION
In using (2), the existence of e is always an a priori assumption with respect to the deposit concerned. An alternative approach is to sample in such a way
130
B~dfley
that the data can be used to test the validity of e as a meaningful finite bound, prior to any estimation. It is assumed that the distribution h (x) gives rise to sample maxima following one of the three asymptotic extreme value distributions, listed in Bardsley (1978). This implies that for sufficiently large samples, Xa will follow a type III extreme value distribution if an upper bound is present, and a type I or II distribution otherwise. It follows that a test for the presence of an upper bound can be achieved by collecting several large samples and then testing whether the X1 values from each are best described by Type I, II, or III extreme value distributions. If a Type III distribution is indicated, its location parameter is e, and -X1 follows a three-parameter Weibull distribution with location parameter -e (Johnson and Kotz, 1970, p. 272). Estimation of e can then be carried out by using any of the variety of methods which have been developed for estimating the Weibull location parameter (Mann, Schafer, and Singperwalla, 1974, Ch. 5). A difficulty with the above approach is that it is based on the assumption that the samples are not only large, but of constant size. The usual sediment sampling procedures meet the first requirement but not the second. The section below is concerned with demonstrating that the fixed sample size extreme value theory can be also applied to maxima from sufficiently large sediment samples of variable size with respect to (i) constant weight samples, (ii) constant volume samples, (iii) samples produced from unbiased subdivision of any large sample. A useful result which can be applied to the three sampling types was obtained by Bardsley and Manly (1979). It was shown that for variable sample size N, a sufficient condition for X, to tend toward a corresponding fixed sample size asymptotic distribution is given by
lira
DIE(N) = 0
(4)
E(N) -+ where E(N) and D denote the expected value and mean deviation of k(n), the probability function governingthe distribution of N.
Constant Weight Samples It was noted by Bardsley (this issue) that if sediment samples are of constant weight and E(N) is large, then N is approximately distributed as an inverse GaussJan random variable. Following Johnson and Kotz (1970, Ch. 15)the DIE(N) ratio of an inverse Gaussian distribution can be written 4 exp (2¢) qb(-2~ 1/2)
(5)
where q~ is the standard normal integral. The parameter ~bdetermines the shape of the distribution and is related to sample weight by
e; = WE(Z)/o~
(6)
Estimating an Upper B o u n d to Particle Size in Sediment Populations
131
where W is the sample weight, and E(Z) and ~} are the mean and variance of the distribution of particle weights. Since E(Z) and o} are constant for a given sediment population, E(N) -+ oo implies W -+ oo and ~ -+ oo. When q5 is large, (5) can be approximated by the expression (27r-lq~-l)l/z(1 - ~1 ~ - I )
(7)
and it is evident that E(N) ~ o~ implies (7) ~ co and thus condition (4) holds. Because the inverse Gaussian distribution is used as an approximation holding for large E(N), it is not possible in this case to determine how large E(N) should be in order to overcome the effect of random sample size on the distribution of X1.
Constant Volume Samples The distribution of N is unknown in this case and it is necessary to redefine the population as consisting only of those X values greater than some lower bound 0 > 0. Any upper bound in this new population will still be given by e as before. The lower bound need not be specified explicitly, but must be assumed to be sufficiently high so that those particles with X values greater than 0 can be considered to be located independently of each other in space. This spatial randomness implies that k(n) will be a Poisson distribution with some expectation X. It is readily verified that condition (4) holds for the Poisson distribution, and therefore the fixed sample size extreme value distributions can be applied to particle maxima drawn from sufficiently large constant volume samples. Because k(n) is known exactly, it is possible to obtain information concernhag the magnitude of X required to overcome the effect of random sample size. An investigation along these lines can be developed from eq. 13 of Bardsley and Manly (1979), written in terms of the symbolism of this paper as R (H) = exp I-X(1 - H)] - H x
(8)
where H is an abbreviation for H(x), the distribution function of X. It was shown that the maximum value of the function R (H) occurs at some value 0 < H < 1, corresponding to R ' ( / / ) = 0. Provided that this maximum value is small compared to unity, the Xl values from random sample sizes will be distributed as if they were drawn from samples of constant size X. Rmax cannot be shown to be a decreasing function of X, but a sharp upper bound t o R m a x will be derived which does possess this property. From the derivative of (8), R (H) must be at a maximum when R ' ( H ) = 0, giving the equality exp [-X(1 - H)] = H x-1
(9)
R max = H x - 1 _ H x
(10)
therefore
132
Bardsley
An upper bound to Rma x can now be obtained by assuming the value of H to be such that (10) is at its maximum. The derivative of (10) with respect to H is given by (k - 1 ) H x-2 {1 - H [k/(). - 1)])
(11)
and it is apparent that for )` > 1, (10) is a convex function with a maximum at H = ()` - 1)/),. Substituting for H in (10) and simplifying gives the upper bound to Rma x ()`- 1)~'-I/X x
(12)
which is a decreasing function of )`. If 0.01 is taken to be a "small" value o f R m a x then it can be shown from calculation of (12) that this corresponds to )` ~ 40, a very small expected value for most sediment sampling. Of course, the fact that the effect of random sample size can be ignored for quite small values of )` does not imply that the sample maxima will converge rapidly to an asymptotic extreme value distribution. Subsamples Suppose that J subsamples are created by splitting a single large sample in such a way that there is a constant probability p = J-1 that any particle will be placed into any one subsample. The subsample size N will therefore follow a binomial distributiori with parameters M and p, where M is the size of the original sample. Since p is constant, E(N) increases with M, and condition (4) can be readily verified from the expression for the mean deviation of the binomial distribution. A practical point arises here concerning the sample splitting procedure. Since only the largest particle in each subsample will be measured, it is not necessary to physically split the entire sample. For example, if the sediment consists predominantly of silt with a small proportion of sand, the silt could be removed first by sieving and only the sand fraction split. The maxima can then be identified visually, avoiding fhe need to sieve the individual subsamples. An alternative approach is to avoid any mechanical splitting and simply measure the T largest particles in the sieved coarse fraction. These measurements can then be "split" into subsamples using a random number generator and the maxima directly obtained. Given that it is desired to obtain J maxima from J subsamples, it is possible to calculate, for a given level of probability, the magnitude of T which will be required for the allocation of at least one measurement into each subsample. Assuming that J is large, T can be obtained from T = - : l n [- In ( r ) J -1 ]
(13)
where r is the specified probability that all subsamples contain at least one of the T measures. The relation (13) is a particular case of a more general formula given by Feller (1957, p. 94).
113
A Statistical Analysis of Exploration Geochemical Data for Uranium
Table 4. Prediction Error Coefficients by the YW and Burg Methodsa Set No. Set I
Set II
Set III
Method
Prediction error coefficients
YW
1.00 0.21
-0.53 -0.24
0.13 0.23
0.11 0.11
-0.31 -0.06
0.44 0.14
-0.04 0.05
-0.01 -0.12
Burg
1.00 -0.54 0.49 -0.43
0.16 0.27
0.26 0.39
-0.55 -0.38
0.55 0.44
-0.05 -0.11
-0.20 -0.25
YW
1.00 -0.28 -0.03 -0.10 0.01 -0.07 0.01 0.11 0.07 -0.04 0.01 .
-0.02 0.03 0.02 . .
0.01 0.01 0.08 .
-0.13 0.11 0.10 .
-0.05 0.15 -0.04 .
-0.04 -0.14 0.001
Burg
1.00 -0.15 0.02 -0.10
-0.08 -0.07 0.22 -0.04
0.03 -0.20 0.22 .
0.07 0.05 -0.05 -0.08 0.05 0.22 . . .
-0.15 0.10 0.27 .
-0.07 0.25 -0.05 .
0.001 --0.19 -0.03
YW
1.00 -0.49 0.11 -0.02 -0.14 0.10 -0.03 0.02
0.01 0.12 0.01 0.01
-0.01 0.12 -0.06 -0.11
-0.04 0.16 0.11 -0.01 0.07 -0.15 . . .
-0.10 -0.01 0.04 .
-0.03 0.13 0.03
Burg
1.00 0.12 -0.15 -0.06
0.06 0.14 0.03 0.05
0.02 -0.09 -0.11 -0.41
-0.08 0.20 0.09 0.03 0.14 -0.24 . . .
-0.06 -0.04 0.02 .
-0.04 0.16 0.04
-0.42 -0.09 0.11 0.02
aSample size: Set I = 30; Set II = 50; Set III = 55.
data under consideration. The prediction error coefficients are given in Table 4. A comparison of the spectra by the Burg and YW schemes showed that generally the spectral density estimates were higher in the Burg scheme. Further, a peak (marked G) for periods of (3.85 × 4) k m ~ 15 km could be observed with respect to spectra obtained by both the Yule-Walker and Burg schemes. The three series were for localities separated by a considerable distance. This suggests that the 15-kin periodicity reflects a process that was operative in large parts of the three areas under consideration.
ACKNOWLEDGMENTS One of the authors (DDS) is grateful to the University of Georgia for a Postdoctoral Research Associateship awarded to him during 1978. The authors wish to express their thanks to Robert and Susan Carpenter for discussions on the geological aspects of the three areas referred to in this study and also to R. Nagendra, Center of Exploration Geophysics, Osmania University, Hyderabad,
134
B~dfley
The desired estimate of -e is achieved by transforming back the estimate of
-g(e). It should be emphasized that this estimation technique' is based on an implicit transformation of h (x). The method is therefore not applicable to data sets other than sample extremes since a normalizing transformation applied to a three-parameter Weibull distribution will not in general result in a Weibull distribution. THE PHI TRANSFORMATION The previous discussion has been concerned with the estimation of e, where e represents an upper bound to some untransform.ed size measure X. An alternative approach is to estimate e', where e' is the lower bound of a new variable Y, obtained from the phi transformation Y=-log2 (X)
(14)
where X is measured in millimeters. An estimate of e' could be obtained from (2) using transformed order statistics. Alternatively e' could be estimated as the location parameter of a Weibull distribution based on the phi transform of a number of X1 values. In the latter case no sign change is required since this is included in (14). It is not possible to make any general statement concerning whether estimation should best be carried out using Y or X. For Weibull-based estimation, it is known that if h(x) is a Type III extreme value distribution then for constant sample size, - g l will be distributed exactly as a Weibull random variable. In this case, the linear measure would be the most appropriate since for random sample size, the convergence of X1 to a fixed sample size distribution also represents convergence to the asymptotic distribution. Equivalently, the phi measure would be more appropriate if the distribution of X was such that Y followed a Weibull distribution. It follows that -log2 (Xx) would follow a Weibull distribution exactly for fixed sample size and the same comments with respect to random sample size apply as before. The distribution of X in this situation could be termed "phi-Weibull" since h (x) is related to the Weibull through (14) in the same way that the lognormal is related to the normal through the log transformation. Making the appropriate change of variable yields the phi-Weibull density function of X
cT-Xx -x [ln(e/x)7-a] c-1 exp {-[ln(e/x)7-1]c},
O