Efficiency of composite sampling for estimating a lognormal distribution

2 downloads 0 Views 132KB Size Report
Ef®ciency of composite sampling for estimating a lognormal distribution. ABEER EL-BAZ and TAPAN K. NAYAK. Department of Statistics, George Washington ...
Environmental and Ecological Statistics 11, 283±294, 2004

Ef®ciency of composite sampling for estimating a lognormal distribution A B E E R E L - B A Z and TA PA N K . N AY A K Department of Statistics, George Washington University, Washington, DC 20052, USA Received January 2002; Revised July 2003 In many environmental studies measuring the amount of a contaminant in a sampling unit is expensive. In such cases, composite sampling is often used to reduce data collection cost. However, composite sampling is known to be bene®cial for estimating the mean of a population, but not necessarily for estimating the variance or other parameters. As some applications, for example, Monte Carlo risk assessment, require an estimate of the entire distribution, and as the lognormal model is commonly used in environmental risk assessment, in this paper we investigate ef®ciency of composite sampling for estimating a lognormal distribution. In particular, we examine the magnitude of savings in the number of measurements over simple random sampling, and the nature of its dependence on composite size and the parameters of the distribution utilizing simulation and asymptotic calculations. Keywords: asymptotic ef®ciency, Kullback±Leibler divergence, mean squared error, method of moments, simulation 1352-8505 # 2004

Kluwer Academic Publishers

1. Introduction As health risks to individuals from environmental hazards can vary substantially due to differences in physical characteristics, susceptibility, exposure, and other factors, recently many researchers (e.g., Thompson et al., 1992; Finley and Paustenbach, 1994; Keenan et al., 1994) have suggested that risk assessment and policy decisions for risk management should be based on estimates of the risk distribution rather than on estimates of its mean, or certain percentiles. Thus, now there is more emphasis on estimating the entire risk distribution, frequently using Monte Carlo methods which require distributions of relevant risk factors as inputs. The accuracy of the input distributions naturally affect the precision of the estimated risk distribution. In many cases, an input distribution is assumed to be lognormal (e.g., Roseberry and Burmaster, 1992; Finley et al., 1994) and then its parameters are estimated from a sample. Estimation of a lognormal distribution also arises in other environmental applications (see Ott, 1995). Due to high cost of measuring the amount of many chemical, and biological contaminants in samples of air, water, soil, or tissue it is often dif®cult to collect data from each of a large number of sampling units. Thus, large samples are unavailable for estimating distributions of many environmental risk factors. As estimates based on small 1352-8505 # 2004

Kluwer Academic Publishers

284

El-Baz, Nayak

samples are likely to be unreliable, composite sampling is often used in environmental data collection. In composite sampling, ®rst groups of original sampling units are mixed physically to form composited units, and then measurements are made on each composited unit. In effect, composite sampling compiles information from a larger number of sampling units than the number of measurements. When the number of measurements is ®xed, a composite sample typically contains more information than a simple random sample. Composite sampling is advantageous when the cost of measurement is high but the cost of gathering physical sampling units is small. Con®dentiality is another reason for using composite sampling (e.g., Gastwirth and Hammick, 1989). Various uses of composite sampling in environmental studies can be found in Mack and Robinson (1985), Gore and Patil (1994), Lovison et al. (1994), Patil et al. (1994), and Gore et al. (1996). The gain in information from composite sampling is substantial for estimating the mean of a distribution, but not necessarily for estimating the variance or other parameters (see Edland and van Belle, 1994). Although the mean is a parameter of common interest, an estimate of the whole distribution is required in some applications including Monte Carlo risk assessment. Thus, it is useful to investigate the bene®ts of composite sampling for estimating a distribution. As the lognormal model is common in environmental applications, in this paper we study ef®ciency of composite sampling, relative to simple random sampling (SRS), for estimating a lognormal distribution with density ( ) 2 1 …ln…x† m† f …x† ˆ p exp ; x > 0; s > 0; …1† 2s2 x 2ps2 where both m and s are unknown parameters. In this paper, we consider only the case of ®xed composite size. Thus, each composite unit is formed by mixing a ®xed number, say k, of the original sampling units. We also assume that the composite samples are formed at random, and the effects of mixing, and measurement errors are negligible. Then the measured value for a composite unit is equivalent to the average of the values of the k SRS units that formed the composite unit. Thus, if Xij ; i P ˆ 1; . . . ; n; j ˆ 1; . . . ; k is a sample of original units, the composited values are Yi ˆ 1=k kjˆ 1 Xij ; i ˆ 1; . . . ; n. In the next section we present estimators of m and s derived using method of moments and discuss their asymptotic properties. The theoretical results are used to compare large sample ef®ciency of composite sampling relative to simple random sampling. In Section 3, we present results from a simulation study investigating ef®ciency of composite sampling in small samples. There, we use the mean squared error (MSE) to examine ef®ciencies of the estimators of m; s2 , and the Kullback± Leibler (KL) divergence measure for assessing ef®ciency of the resulting estimators of the distribution. Our results show that composite sampling is quite bene®cial for estimating m; s2 as well as the entire distribution.

2. The estimators and their properties The assumption that fXij g is a random sample from (1) implies that the composited values Yi ; i ˆ 1; . . . ; n, form a random sample from the sampling distribution of the mean of a random sample of size k from the lognormal distribution in (1). The exact analytical form of the pdf of Y is not convenient for deriving the maximum likelihood estimators. So, we

Ef®ciency of composite sampling

285

shall use method of moments for estimating m and s2 . The formulas for the rth raw, and central moments of X are (see Johnson and Kotz, 1970): m0r ˆ e…rm ‡ 1=2r mr ˆ e

2 2

rm ‡ rs2 =2

s †

; r X

  r 1=2s2 …r … 1† e j jˆ0 j

j†…r

j



:

2

2

2

1†. For further In particular, E…X† ˆ m01 ˆ em ‡ s =2 , and V…X† ˆ m2 ˆ e2m ‡ s …es information on lognormal distribution see Johnson and Kotz (1970) and Crow and Shimizu (1988). The mean and variance of Y are: 2

E…Y† ˆ E…X† ˆ m01 ˆ em ‡ s =2 ; V…x† 1 2m ‡ s2 s2 ˆ e …e s2y ˆ V…Y† ˆ k k

1†:

By equating these two theoretical moments to Y and s2y ˆ 1=…n solving for m and s2 , we get the following estimators: ! 2 Y ^ ˆ log m ; 2 1=2 …Y ‡ ks2y † ! ks2y ^ 2 s ˆ log 1 ‡ 2 : Y



Pn

i ˆ 1 …Yi

2

Y† and

…2† …3†

Obviously, for k ˆ 1 Equations (2) and (3) reduce to the moments estimators of m and s2 from SRS data. The exact sampling distribution or even the moments of these estimators are dif®cult to derive, but we have the following theorem which explains the behavior of the estimators for large n. ^ and s^2 in (2) and (3), as n??, is Theorem 1 The joint asymptotic distribution of m normal. Speci®cally,        p ^ m L s11 s12 m n N 0; ? ; s21 s22 s2 s^2 where

s22

2s2

2

…es 4k 2 2 e 2s …es ˆ k

s11 ˆ

e

1† 1†

2

‰e5s ‡ e4s 2

2

2

‰e4s ‡ 2e3s

2

7e3s ‡ 9e2s 2

e2s

2

2

2

2es ‡ 2 ‡ 2k…es

2 ‡ 2kŠ;

2

1†Š; …4† …5†

and s12 ˆ s21 ˆ

e

2s2

2

…es 2k

1†2

‰ e4s

2

2

2

2e3s ‡ 3e2s ‡ 2

2kŠ:

…6†

The proof of the theorem is provided in the Appendix. The theorem shows that the estimators are asymptotically unbiased. Also, the expressions of s11 ; s22 , and s12 in (4)±

286

El-Baz, Nayak

(6) show that asymptotic ef®ciencies of the estimators depend only on k, and s2 , but not on m. The theorem is useful for calculating approximate con®dence intervals for m and s2 , testing hypotheses about them, and for determining necessary sample size in a study. The asymptotic variance formulas for method of moments estimators based on SRS data can be 2 ^SRS and s ^SRS obtained by putting k ˆ 1 in (4)±(6). Let m denote the estimators of m and s2 based on an SRS sample of size m. Also let 2

a ˆ es ;

b ˆ a5 ‡ a4

2a ‡ 2; andc ˆ a4 ‡ 2a3 ^SRS and s^2 SRS are: Then, for large m, approximate variances of m 2

…a 1† ‰b ‡ 2…a 1†Š; 4m 2 a 2 …a 1† ‰c ‡ 2Š: V…s^2 SRS †& m ^ and s^2 based on n composited values are: Approximate variances of m V…^ mSRS †&

2

…a 1† ‰b ‡ 2k…a 4nk 2 a 2 …a 1† ‰c ‡ 2kŠ: V…s^2 †& nk V…^ m†&

a

a

7a3 ‡ 9a2

1†Š;

a2

2:

…7† …8†

…9† …10†

To investigate the magnitude of cost savings from using composite sampling, we shall examine the ratio nm =m, where nm is such that the ef®ciency of an estimator based on nm composited observations equals the ef®ciency of an estimator based on an SRS sample of size m. Logically, this ratio may depend on m, the composite size k, the parameters of the true distribution, and the choice of the estimators and ef®ciency measures. Let us ®rst consider estimation of m, where for large m the approximate variances in (7) and (9) are reasonable ef®ciency measures of the two estimators. Equating the right sides of (7) and (9) we get nm b ‡ 2k…a 1† : & bk ‡ 2k…a 1† m

…11†

Interestingly, the ratio in (11) depends only on k and s2 , but not on m, and m. It can be shown analytically that the ratio is less than 1; decreasing in k; nm =m?1=k as a?? or a?1; and nm =m?‰1 ‡ b=2…a 1†Š 1 as k??. Fig. 1 plots the ratio in (11) against a for k ˆ 2; 3; 4, and 5. Note that the transformation s2 ?a is one-to-one (and monotonically increasing), and a has a more convenient interpretation than s2 . Speci®cally, a equals 1 plus the square of the coef®cient of variation (CV) of the distribution in (1). We may note some useful information conveyed by Fig. 1. For each k, as a increases, the ratio nm =m ®rst increases and then decreases. Also, the range of the ratio increases as k increases. However, the value of a at which the ratio attains its maximum (i.e., composite sampling is least bene®cial) does not seem to depend on the value of k. For each k, the maximum is attained at a value of a close to 1:5 …CV ˆ 0:707†. It may also be noted that for a < 1:1 …CV < 0:316†, and for a > 3 …CV > 1:414† the ratio is close to 1=k. For k ˆ 2; nm =m is between 0.5 and 0.56. Thus, compared to SRS, composite sampling with k ˆ 2 requires between 50 and 56% measurements for estimating m equally ef®ciently. Analogous features can be seen for other values of k.

Ef®ciency of composite sampling

287

Figure 1. Relative composite sample size for estimating m.

Similarly, for estimating s2 , equating (8) and (10), we get nm c ‡ 2k : & ck ‡ 2k m

…12†

Fig. 2 plots the ratio in (12) against a for k ˆ 2; 3; 4, and 5. Consistently with the ®gure, we can show analytically that (12) is less than 1; decreasing in k, and a; nm =m?1=k as 1 a??; and nm =m?‰1 ‡ c=2Š as k??. The plots show that composite sampling provides substantial savings, in the number of measurements, for estimating s2 unless a is close to 1 (i.e., CV is close to 0).

Figure 2. Relative composite sample size for estimating s2 .

288

El-Baz, Nayak

3. Simulation results As ef®ciency comparisons based on asymptotic results are valid for large samples, in this section we present results from a simulation study investigating ef®ciency of composite sampling in small samples. As our estimators are not unbiased, we shall use the MSE for judging and comparing estimators of m and s2 . For measuring ef®ciency of a procedure for estimating the distribution we shall use the average of the KL divergence between the true and the estimated distributions. The KL divergence between two distributions with pdf R f …x† and g…x† is de®ned as f …x† logf f …x†=g…x†gdx. It can be seen that when f …x† and g…x† are the lognormal distributions with parameters …m; s2 † and …^ m; s^2 †, respectively, the KL divergence reduces to ^ kl ˆ ln s

ln s

^ †2 1 s2 ‡ …m m ‡ : 2 2s^2

…13†

Note that in our context, the estimated distribution is the lognormal distribution with the estimated parameters values. We shall use the average of (13) for measuring ef®ciency of an estimator of the distribution in (1). In the following, we report simulation results for m ˆ 0; s2 ˆ 0:3; 0:5; 1, and 2 (corresponding CVs are 0.59, 0.81, 1.31, and 2.53), and k ˆ 2; 3; 4, and 5. We did not consider larger values of k as they cause dilution, mixing, and other problems and hence are not suitable in many practical situations. We also used other values of m, but as expected from our discussion of the previous section, the results did not vary much with m. 2 ^SRS , s ^SRS For each value of s2 , we ®rst calculated (based on simulation) the MSE of m , and the average of the KL measure in (13) when the estimates are obtained based on an SRS of size 20. The summary measures for the SRS estimators are based on 10,000 simulated ^, s ^2 , samples. Then, for each combination s2 , k, and n values, we calculated the MSEs of m and the average KL divergence for composite sample estimators, based on 10,000 simulated samples. Thus, we compare ef®ciencies of various composite sampling procedures with ef®ciencies of estimators based on SRS of size 20. We also considered other SRS sample sizes, but do not report them here to save space. Tables 1±4 report ef®ciency ratios of composite sample procedures relative to corresponding SRS procedures. Thus, larger (smaller) ratios indicates larger (smaller) bene®ts from using composite sampling. For example, in Table 1, the entries for k ˆ 3, and n ˆ 10, show that for s2 ˆ 0:3, the relative ef®ciencies of composite sampling, compared to SRS of size 20, are about 71.5% for estimating the distribution, 123.41% for estimating m, and 73.32% for estimating s2 . Note that, in this case, the SRS sample requires 20 original sampling units and 20 measurements, whereas the composite sample requires 30 original sampling units but only 10 measurements. The same table also shows that when s2 ˆ 0:3 and k ˆ 3, all relative ef®ciencies are about 100% or larger for n ˆ 13. Thus, from statistical ef®ciency viewpoint, measuring 13 composited units (each is a composite of three sampling units) is better than measuring each of 20 SRS units. This reduces the number of measurements by 35%. The simulation results show that composite sampling is more bene®cial for estimating m than for estimating s2 or the whole distribution. The relative ef®ciencies based on KL divergence (i.e., for estimating the whole distribution) are similar to the relative ^2 . The cost savings from composite sampling for estimating either s2 or ef®ciencies of s

Ef®ciency of composite sampling

289

Table 1. Relative ef®ciency of composite sampling compared to SRS of size 20 …m ˆ 0; s2 ˆ 0:3†. n 8

9

KL-RE m-RE s2 -RE

0.4507 0.7252 0.6129

0.5493 0.7997 0.6489

KL-RE m-RE s2 -RE

0.5087 0.9728 0.6015

0.6026 1.1139 0.668

KL-RE m-RE s2 -RE

0.5333 1.1991 0.6335

0.6843 1.367 0.7044

KL-RE m-RE s2 -RE

0.5363 1.3947 0.6486

0.697 1.5417 0.6912

10

11

kˆ2 0.6261 0.7121 0.895 0.9945 0.6984 0.7639 kˆ3 0.715 0.8284 1.2341 1.3566 0.7332 0.809 kˆ4 0.7858 0.8875 1.493 1.6446 0.7624 0.8384 kˆ5 0.795 0.9327 1.7013 1.9088 0.7814 0.8578

12

13

n

0.7981 1.0863 0.7978

0.8783 1.1513 0.8597

11.0019 13.1166

0.9285 1.4598 0.8837

1.0279 1.5552 0.9227

8.0026 10.8222

1.0223 1.7488 0.8691

1.1275 1.9332 0.9643

6.5029 9.6749

1.036 2.0139 0.9013

1.1568 2.1689 0.9946

5.6031 8.9866

Note: m has very little effect on the relative ef®ciencies.

Table 2. Relative ef®ciency of composite sampling compared to SRS of size 20 …m ˆ 0; s2 ˆ 0:5†. n 8

9

KL-RE m-RE s2 -RE

0.4828 0.7469 0.6905

0.5661 0.8263 0.7426

KL-RE m-RE s2 -RE

0.5628 1.0019 0.7061

0.6807 1.1295 0.7725

KL-RE m-RE s2 -RE

0.5988 1.1824 0.7185

0.7485 1.3591 0.8079

KL-RE m-RE s2 -RE

0.6909 1.3292 0.7491

0.8231 1.4582 0.803

10

11

kˆ2 0.6706 0.743 0.928 0.9999 0.7675 0.8194 kˆ3 0.8168 0.9261 1.2378 1.3427 0.8422 0.8615 kˆ4 0.8753 1.0321 1.4362 1.5913 0.8406 0.9344 kˆ5 0.9589 1.0804 1.6665 1.805 0.8829 0.9444

Note: m has very little effect on the relative ef®ciencies.

12

13

n

0.8571 1.1007 0.8713

0.9509 1.1551 0.912

11.0245 11.4669

1.0245 1.4039 0.9222

1.1283 1.587 0.9861

8.0327 8.6225

1.1388 1.7203 0.9642

1.245 1.8647 1.0501

6.5367 7.2004

1.2131 1.9458 1.0077

1.3686 2.0963 1.063

5.6392 6.3471

290

El-Baz, Nayak Table 3. Relative ef®ciency of composite sampling compared to SRS of size 20 …m ˆ 0; s2 ˆ 1:0†. n 8

9

KL-RE m-RE s2 -RE

0.556 0.7401 0.8013

0.6682 0.8557 0.8549

KL-RE m-RE s2 -RE

0.68 0.9805 0.8642

0.8169 1.0988 0.9226

KL-RE m-RE s2 -RE

0.7779 1.1716 0.9055

0.9187 1.2981 0.9678

KL-RE m-RE s2 -RE

0.9004 1.3214 0.9354

1.0303 1.4306 1.0202

10

11

kˆ2 0.7597 0.8355 0.8882 0.9662 0.9038 0.9403 kˆ3 0.9279 1.0537 1.1914 1.3045 0.9695 1.0391 kˆ4 1.0951 1.2111 1.4384 1.5198 1.0189 1.0818 kˆ5 1.1775 1.3196 1.561 1.6906 1.0861 1.115

12

13

n

0.9153 1.0711 0.9784

1.0261 1.1721 1.0022

10.2666 10.2289

1.1766 1.3578 1.0781

1.2882 1.4754 1.1273

7.0221 6.9719

1.3656 1.6428 1.1524

1.4837 1.7376 1.1902

5.3999 5.3433

1.5097 1.762 1.2024

1.6572 1.9152 1.2206

4.4265 4.3662

Note: m has very little effect on the relative ef®ciencies.

Table 4. Relative ef®ciency of composite sampling compared to SRS of size 20 …m ˆ 0; s2 ˆ 2:0†. n 8

9

KL-RE m-RE s2 -RE

0.6601 0.8076 0.849

0.7593 0.906 0.9053

KL-RE m-RE s2 -RE

0.8425 1.0469 0.9716

0.9643 1.1268 1.035

KL-RE m-RE s2 -RE

0.9858 1.2077 1.0838

1.1331 1.3221 1.1504

KL-RE m-RE s2 -RE

1.1508 1.3514 1.1903

1.235 1.4653 1.2276

10

11

kˆ2 0.8311 0.9099 0.9455 1.0201 0.9346 0.9819 kˆ3 1.054 1.19 1.2096 1.2781 1.0843 1.1687 kˆ4 1.2846 1.398 1.4285 1.5265 1.2246 1.272 kˆ5 1.4564 1.5876 1.5852 1.7158 1.3195 1.3838

Note: m has very little effect on the relative ef®ciencies.

12

13

n

1.0023 1.0645 1.0363

1.0786 1.1418 1.0682

10.0056 10.0054

1.3129 1.393 1.2181

1.403 1.4222 1.2573

6.6742 6.6738

1.5664 1.6448 1.3576

1.6692 1.7101 1.4031

5.0085 5.0080

1.7312 1.7937 1.4508

1.8884 1.9377 1.5149

4.0090 4.0086

Ef®ciency of composite sampling

291

Table 5. Accuracy of simulation results (based on 100 repetitions). m ˆ 0; s2 ˆ 0:3; n ˆ 12 and k ˆ 2 m ˆ 0; s2 ˆ 2; n ˆ 13 and k ˆ 4

5th percentile 95th percentile Standard deviation

KL-RE

m-RE

s2 -RE

KL-RE

m-RE

s2 -RE

0.7677 0.8090 0.0141

1.0470 1.0956 0.0159

0.7776 0.8297 0.0154

1.6200 1.6899 0.0209

1.6590 1.7484 0.0250

1.3738 1.4124 0.0125

the whole distribution are also substantial, especially for populations with large CV. For example, for s2 ˆ 1:0 …CV ˆ 1:37† and k ˆ 4, all relative ef®ciencies are more than 100% for n ˆ 10. Also, as expected, the relative ef®ciency of composite sampling increases with k. It may be noted that any cost savings calculated based only on relative ef®ciencies are potential and depend on relative costs of measurement, gathering of sampling units, and composite preparation. Limitations of composite sampling are further discussed in Section 4. We also calculated the values of nm based on (11) and (12). Those values are reported in the last columns …n † of Tables 1±4. Thus, the values of n are the number of composited measurements needed, based on the asymptotic formulas in (11) and (12), for composite sampling to be as ef®cient as SRS of size 20. The tables show that for m ˆ 20, the n values are overly optimistic, especially for small CV, and for estimating s2 . They are expected to be more accurate for larger m. Thus, for smaller sample sizes the simulation results provide a better guide than the asymptotic results. Finally, in order to get an idea about the precision of our simulation results we repeated the simulation program 100 times in two selected cases. Thus, in each of the two cases, and for each of the three relative ef®ciency measures, we obtained 100 simulated values. Table 5 summarizes the results by providing the 5th and the 95th percentiles and the standard deviations. The margin of error (twice the standard deviation) is 0.05 or less in all cases. The table also shows that our simulation results are more accurate for smaller CV.

4. Discussion It is well known that composite sampling is cost effective for estimating the mean of a population, but not necessarily for estimating other parameters. This fact may curtail uses of composite sampling in environmental data gathering activities especially when the data are expected to be used by various users for different inferential purposes. However, our investigations show that for lognormal population distributions composite sampling is also bene®cial for estimating s2 as well as the entire distribution. We believe this is signi®cant as it provides additional justi®cation for using composite sampling in environmental studies, where lognormal distribution serves as a common model. Our results also con®rm that composite sampling provides higher cost savings for estimating m than for estimating s2 or the whole distribution. Our analytical and numerical results, and related methodology may be useful in determining optimal composite sampling schemes, that is, for determining composite size …k†, and number of measurements …n† when the total budget for data gathering, and the costs of sampling, and measurement are given. Theorem

292

El-Baz, Nayak

1 may also be useful for testing hypotheses about m and s2 and constructing con®dence intervals for them when m is large. Derivation of small sample methods is a future research topic. Composite sampling has some limitations which we review brie¯y in the following. Most measuring instruments cannot measure the actual level of a contaminant below their detection limits. So, when one or two sample units with detectable values are combined with several sample units with fairly small or non-detectable values the actual amount of the contaminant in the composite sample might fall below the detection limit resulting in a loss of information. This effect of dilution becomes more severe as the composite size increases. Thus, the result that ef®ciency of composite sampling increases with composite size, derived in an ideal setting, may not apply in practice. In real applications, composite size is usually ®ve or less. For this reason, in our simulation study we did not consider composite sizes greater than ®ve. Another potential problem with composite sampling is that the level of some contaminants, especially volatile chemicals, may change during mixing (see Cline and Severin, 1989). Also, composite sampling may not be appropriate if interactions among the sample units occur during mixing. While compositing we forfeit information on individual sampling units. This information loss may be signi®cant when the goal is to identify extreme sampling units (Gore and Patil, 1994). Thus, one should carefully examine suitability of composite sampling to speci®c inferential objectives of a study.

Appendix Proof of Theorem 1: The following fact that can be found in Ferguson (1996, p. 49)     p my y L n ?N…0; S1 †; s2y s2y where

 S1 ˆ

s2y my3

my3

my4

 s4y

;

where myj is the jth central moment of Y. We shall prove Theorem 1 by using the preceeding fact and the multivariate d-method. To apply d-method, we ®rst calculate the following derivatives: y2 ‡ 2ks2y q^ m ; ˆ qy y…y2 ‡ ks2y † 2ks2y qs^2 ˆ ; 2 qy y…y ‡ ks2y †

q^ m k ; ˆ qs2y 2…y2 ‡ ks2y † qs^2 k ˆ : qs2y y2 ‡ ks2y

Note that ^jy ˆ my ;s2y ˆ s2y ˆ m; m and

^2y jy ˆ my ;s2y ˆ s2y ˆ s2 ; s

Ef®ciency of composite sampling 0 1 q^ m q^ m B qy qs2y C B C D ˆ B ^2 C @ qs qs^2 A qy qs2y

293 0

1

m2x ‡ 2s2x k B m …m2 ‡ s2 † 2…m2 ‡ s2 † C x x C: x x x ˆB @ A 2s2x k 2 2 2 2 mx ‡ sx mx …mx ‡ sx † y ˆ m ;s2 ˆ s2 y

y

y

p By d-method, n‰…^ m m†; …s^2 s2 †Š0 converges to bivariate normal distribution with mean vector 0 and var-covariance matrix   s11 s12 0 S ˆ DS1 D ˆ ; s21 s22 where s11 ˆ s22 s12

…m2x ‡ 2s2x †2 s2x

2kmy3 …m2x ‡ 2s2x † 2mx …m2x ‡ 4kmy3 s2x

2 s2x †

‡

k2 my4

s4x

; 2 4…m2x ‡ s2x † k2 my4 s4x 4s6x ˆ ‡ ; 2 2 2 km2x …m2x ‡ s2x † mx …m2x ‡ s2x † …m2x ‡ s2x † kmy3 s2x 2s4x …m2x ‡ 2s2x † kmy3 …m2x ‡ 2s2x † ˆ s21 ˆ ‡ ‡ mx …m2x ‡ s2x †2 mx …m2x ‡ s2x †2 km2x …m2x ‡ s2x †2 km2x …m2x ‡

2 s2x †

k2 my4

s4x

2…m2x ‡ s2x †2

:

The proof of the theorem now can be completed by simplifying the elements of S using the following relations: 2

2

2

2

mx3 e3m ‡ 3s =2 …es 1† …es ‡ 2† ˆ ; 2 2 k k m 3…k 1†s4x ‡ my4 ˆ x4 k3 k3 2 4m ‡ 2s2 s2 e …e 1† 4s2 2 2 ˆ ‰e ‡ 2e3s ‡ 3e2s ‡ 3k k3 2 2 m2x ‡ 2s2x ˆ e2m ‡ s …2es 1†; my3 ˆ

6Š;

and …m2x ‡ s2x †2 ˆ e4m ‡ 4s : 2

h

Acknowledgment This research was supported in part by a Cooperative Agreement between the US Environmental Protection Agency and The George Washington University. The authors thank a referee for some helpful comments.

References Cline, S.M. and Severin, B.F. (1989) Volatile organic losses from a composite water sampler. Water Research, 23(4), 407±12.

294

El-Baz, Nayak

Crow, E.L. and Shimizu, K. (1988) Lognormal Distributions: Theory and Applications, Marcel Dekker, New York. Edland, S.D. and van Belle, G. (1994) Decreased sampling costs and improved accuracy with composite sampling, in Environmental Statistics, Assessment, and Forecasting, C.R. Cothern and N.P. Ross (eds), Lewis Publishers, Ann Arbor, pp. 29±55. Ferguson, T.S. (1996) A Course in Large Sample Theory, Chapman and Hall, New York. Finley, B. and Paustenbach, D. (1994) The bene®ts of probabilistic exposure assessment: Three case studies involving contaminated air, water, and soil. Risk Analysis, 14, 533±54. Finley, B., Proctor, D., Scott, P., Harrington, N., Paustenbach, D., and Price, P. (1994) Recommended distributions for exposure factors frequently used in health risk assessment. Risk Analysis, 14, 533±54. Gastwirth, J.L. and Hammick, P.A. (1989) Estimation of the prevalence of a rare disease, preserving the anonymity of the subjects by group testing: Application to estimating the prevalence of AIDS antibodies in blood donors. J. Statist. Plann. Inference, 22, 15±27. Gore, S.D. and Patil, G.P. (1994) Identifying extremely large values using composite sample data (with discussion). Environmental and Ecological Statistics, 1, 227±45. Gore, S.D., Patil, G.P., and Taillie, C. (1996) Identi®cation of the largest individual sample using composite sample data and certain modi®cations of the sweep-out method. Environmental and Ecological Statistics, 3, 219±34. Johnson, N.L. and Kotz, S. (1970) Continuous Univariate DistributionsÐI, Houghton Mif¯in Company, Boston. Keenan, R.E., Finley, B.L., and Price, P.S. (1994) Exposure assessment: Then, now, and quantum leaps in the future. Risk Analysis, 14, 225±30. Lovison, G., Gore, S.D., and Patil, G.P. (1994) Design and analysis of composite sampling procedures: A review, in Handbook of Statistics, Volume 12, G.P. Patil and C.R. Rao (eds), Elsevier, New York, pp. 103±66. Mack, G.A. and Robinson, P.E. (1985) Use of composited samples to increase the precision and probability of detection of toxic chemicals, in Environmental Applications of Chemometrics, J.J. Breen and P.E. Robinson (eds), American Chemical Society, Washington, DC, pp. 174±83. Ott, W.R. (1995) Environmental Statistics and Data Analysis, CRC Press, New York. Patil, G.P., Gore, S.D., and Sinha, A.K. (1994) Environmental chemistry, statistical modeling, and observational economy, in Environmental Statistics, Assessment, and Forecasting, C.R. Cothern, and N.P. Ross (eds), Lewis Publishers, Ann Arbor, pp. 57±97. Roseberry, A.M. and Burmaster, D.E. (1992) Lognormal distributions for water intake by children and adults. Risk Analysis, 12, 99±104. Thompson, K.M., Burmaster, D.E., and Crouch, E.A.C. (1992) Monte Carlo techniques for quantitative uncertainty analysis in public health risk assessment. Risk Analysis, 12, 53±63.

Biographical sketches Abeer El-Baz is a doctoral candidate in the Department of Statistics at the George Washington University. Her research interests include statistical prediction and prediction intervals, speci®cally in life testing and environmental applications. Tapan K. Nayak is Professor, Department of Statistics, George Washington University, Washington, DC 20052. He is an elected member of the International Statistical Institute. His main research interests are in parametric inference and prediction, software reliability, and environmental risk assessment.

Suggest Documents