Journal of Probability and Statistical Science 5(2), 137-150, Aug. 2007
Using General Inverse Sampling Design to Avoid Undefined Estimator Mohammad Moradi Mohammad Salehi M. Isfahan University of Technology Paul S. Levy Research Triangle Park ABSTRACT We consider the problem of estimating a ratio for which the denominator estimator can take zero value. Under a simple random sampling (SRS) design, if all observations of the denominator variable are zero, the ratio estimator would be undefined. A natural solution is to use an inverse sampling design for which one continues sampling until at least a predetermined number of nonzero values is observed for the denominator variable. General inverse sampling proposed by Salehi and Seber [13], is a more practicable version of inverse sampling. Using Taylor expansion, we derive here an asymptotic unbiased estimator of the ratio and an approximate variance estimator for a general inverse sampling design. We use simulation to evaluate the efficiency of the developed estimator, based on a real population from the Statistical Center of Iran and an artificial population. We compute its relative efficiency over the usual SRS ration estimator. Using general inverse sampling, we not only control the problem of an undefined estimator but also show that the developed estimator is more efficient than its counterpart of SRS. Keywords Murthy’s estimator; Rare events; Taylor expansion.
1. Introduction Inverse sampling was first proposed by Haldane [5] in which one continues sampling until a pre-determined number of rare events of interest is observed. It is generally a more appropriate _______________________ □Received November 2006, revised February/March 2007, in final form March 2007. □Mohammad Moradi and Mohammad Salehi M. (corresponding author; email:
[email protected]) are affiliated to the School of Mathematical Science at the Isfahan University of Technology, Isfahan, Iran. Paul S. Levy is affiliated to Statistics Research Division, RTI International, Research Triangle Park, NC 27709, USA; email:
[email protected]. Dr. Levy is a Fellow of both the American Statistical Association and the American College of Epidemiology. © 2007 Susan Rivers’ Cultural Institute, Hsinchu, Taiwan, Republic of China.
ISSN 1726-3328
JPSS
138
Vol. 5 No. 2
August 2007
pp. 137-150
sampling design than SRS when the event of interest is rare, and when estimator of the parameter of interest is likely to be undefined. In the past decade it has received considerable attention by Lui [9], Chang et al. [1, 2], Christman and Lan [3] and Salehi and Seber [12] to name a few among many others. One deficiency, however, is that the final sample size is not fixed, which makes it difficult to plan budgets and survey logistics. As a result, surveys having an inverse sampling design, are rarely used in practice. To deal with this problem, Salehi and Seber [13] proposed the following design. Suppose that we can select at least n0 and at most n1 units based on a minimum and a maximum budget. We first take an initial sample of size n0 . If we have the pre-determined number of events in the sample we would stop sampling. Otherwise we would keep sampling until we either achieve the pre-determined number of events or reach the sample size n1 . This sample design is called general inverse sampling (GIS). In that same article, they derived an unbiased estimator as well as its variance estimator using the estimator developed by Murthy estimator [10]. Motivated by a need to estimate the amount of honey produced per family in Kurdistan Province, Iran, we develop a ratio estimator for general inverse sampling since the usual SRS estimator of a ratio can be undefined when the estimator of the denominator is zero. In section 2, we use Taylor series approximation to derive for any sampling design an estimator of the ratio as well as its variance and its variance estimator based on the Murthy estimator. To accomplish this, we use the Murthy estimator for deriving the variance estimator, similar to what Salehi and Seber [13], and Salehi and Chang [11] did. S¨arndal et al. [14] have done so for the Horvitz-Thompson [8] estimator. In section 3, we derive the formulations for general inverse sampling. In section 4, we simulate two populations 1) the honey production population from the Statistical Center of Iran and 2) an artificial population. We gain efficiency using GIS. We gain more efficiency as the event of interest becomes rarer.
2. Murthy Estimator of a Ratio Consider a population P D fu1 ; u2 ; :::; uN g of N units. Let .yi ; xi / be the y-value and x-value associated with unit ui , for i D 1; 2; :::; N . Suppose that we want to estimate the ratio P PN R D y =x where y D N iD1 yi and x D iD1 xi . For Simple Random Sampling (SRS), estimator of the ratio is the ratio of sample means; yN RO D : xN Using Taylor expansion, the approximate variance of RO (Wolter [15]) is given by " O D R2 Var.R/
# Var.y/ N Var.x/ N C y2 2x
C ov.y; N x/ N 2 y x
(1)
Using General Inverse Sampling Design to Avoid
M. Moradi, M. Salehi M and P. S. Levy
139
and its estimator is "
b
O D RO 2 V ar.R/
b
b
b
V ar.y/ N V ar.x/ N C yN 2 xN 2
b
#
b
C ov.y; N x/ N 2 yN xN
(2)
b
1 1 1 where V ar.y/ N D . n1 /s 2 , V ar.x/ N D . n1 /s 2 , C ov.y; N x/ N D . n1 /s , sy2 D N y N x N xy P P P n n n 1 y/ N 2 , sx2 D n 1 1 iD1 .yi x/ N 2 and syx D n 1 1 iD1 .yi y/.x N i x/. N iD1 .yi n 1
Suppose that a sampling design of size n is applied for which pi is selection probability in the first draw. In order to develop a consistent estimator of the ratio based on the Murthy estimator, we need to have Murthy estimators Ox and Oy , namely Oy D
n X P .Sji/ iD1
P .S/
yi ;
Ox D
n X P .Sji/ iD1
P .S/
xi ;
(3)
where P .S/ is the probability of getting sample S, P .Sji/ is the conditional probability of getting the sample S, given the i th unit was selected first. The variance 1 0 N X N X X P .Sji/P .Sjj / yi yj 2 A @1 Var.Oy / D pi pj ; P .S/ pi pj iD1 j 0, is 1440 and the number of more active units, x > 5, is 618. The more active subpopulation is relatively rarer than the active subpopulation. Simulation shows that relative efficiency of RO m is larger for this rare population. We also simulated an artificial population for which the rare event was 5 percents of the population. Its results show that we frequently encountered the problem of undefined in SRS, and the gains in efficiencies of using GIS are larger compare to previously discussed populations. Table 1 and 2 show that for m D f2; 3; 4; 5g, RO m is more efficient and with increasing m to f10; 50; 100g the relative efficiencies tend to one. Other simulation study done on an artificial
Using General Inverse Sampling Design to Avoid
M. Moradi, M. Salehi M and P. S. Levy
145
population that their variable rareness are about 0.05. Simulation result shows that frequently we encountered to problem of undefined RO in SRS design, and the efficiency of GIS design rather than SRS is larger than that efficiency in the previous population. Table 2 The relative efficiency of the ratio estimator under GIS for condition C D fxi > 5g for the honey production population. n0
n1
m
2 3 4 5 10 50 100 2 3 4 5 10 50 100 2 3 4 5 10 50
500 2 500 3 500 4 500 5 500 10 500 50 500 100 100 2 100 3 100 4 100 5 100 10 100 50 100 100 50 2 50 3 50 4 50 5 50 10 50 50
efficiency 1.23 1.38 1.34 1.21 1.10 1.03 1 1.25 1.39 1.24 1.23 1.09 1.01 1 1.31 1.25 1.17 1.10 1.07 1
We should note that we do not know the nonzero x-values prior to the survey. Even if we were able to do a complete census of honey farms, we would not be able to use it as a frame for future sample surveys as such family businesses are not very stable. Simply put, many families may close their business in one season and reopen it in the next since it is not the main source of income for most of the families. The referee noted when we end up only (0,0) pairs in the sample set its information contributes nothing to RO m . Base on the referee’s suggestion, we simulate another population in which most of units have value .0; yi / with yi > 0, rather than (0,0). We consider a population of size 200 with 170 units having zero x-value and 30 non-zero x-values. The 30 non-zero xvalues are observations of a Normal random variable. The y-values are generated from a simple linear regression model with intercept. The population values are given in Table 4. Selected results are summarized in Table 3. When n0 D m and n1 D 200 the GIS is Haldane’s inverse sampling. The results are consistent with the honey production population with greater efficiency as the non-zero units are rarer.
JPSS
146
Vol. 5 No. 2
August 2007
pp. 137-150
Using the GIS, we not only reduced the chance of obtaining only .0; yi / pairs in the sample set but also we gained efficiency. When n0 D m and n1 D 200 the GIS would be Haldane’s inverse sampling. Increasing n1 does not necessarily means that efficiency would increase. For example for n0 D 7, n1 D 100 and m D 7 the efficiency was 69 but when n1 increases to 200 (Haldane’s inverse sampling) the efficiency will decreases to 50. The sample size would be prohibitively increase for Haldane’s inverse sampling in many cases. Using GIS avoids sampling an infeasible large number of units. Taking an initial sample of size n0 rather than taking units one by one from the scratch will help researcher to design sampling strategy better. Table 3 The relative efficiency of the ratio estimator in artificial population under GIS for condition C D fxi > 0g n0
n1
m
efficiency
2 2 2 2 5 5 5 5 20 20 20 20 3 3 3 3 20 20 20 20 5 5 5 5 20 20 20 20 7 7 7 7 20 20 20 20 10 10 10 10
30 50 100 200 30 50 100 200 30 50 100 200 30 50 100 200 30 50 100 200 30 50 100 200 30 50 100 200 30 50 100 200 30 50 100 200 30 50 100 200
2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 7 10 10 10 10
1.51 4.24 6.63 5.36 1.23 3.49 6.58 6.00 1.56 5.73 7.73 8.13 1.90 25.23 52.67 51.92 1.71 13.57 57.33 49.21 1.58 6.20 113.55 98.00 1.24 3.33 115.27 88.82 1.34 15.16 69.00 50.00 1.12 2.36 66.40 55.60 1.13 1.64 49.50 33.50
When GIS achieves the predetermined number m with probability 1 (i.e. for all replication m is achieved) and the distribution of number of units satisfying condition C is scattered in an
Using General Inverse Sampling Design to Avoid
M. Moradi, M. Salehi M and P. S. Levy
147
interval close to zero for SRS with the same effective sample size, the efficiency is considerably high. For example, when n0 D 20, n1 D 100 and m D 5 for all replicated sample sets are contained 5 units satisfying condition C for GIS but for equivalent SRS, the number of units satisfying condition C can take values ranged from 0 to 10 which makes x, N yN and RO unstable. Hence, GIS gains high relative efficiency 115.27 over SRS. The distribution of number of units satisfying condition C for SRS is given in Figure 1. Table 4 Artificial population x
y
x
y
x
y
x
y
x
y
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3.8 2.2 3.1 2.3 2.9 2.6 2.7 2.6 3.7 4.5 3 4 2.9 3.6 2.1 2.2 2.1 2.6 3.3 4.1 3.5 2.8 3.2 3 2.1 2.4 2.5 3 3 2.9 3.3 2.2 2.7 2.2 2.7 3 2.6 2.3 3.3 3.3
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2.6 2.1 3 3.2 2.8 3.7 3 3.1 3.7 2.1 2.4 3.8 3.6 3 3.4 2.7 2.7 2.5 2.6 3.7 2.2 2.1 3.2 4 2.1 2.8 2.9 3 2.3 2.2 3.5 2.6 2.5 2.5 2.8 2.7 3.3 3.2 2.1 2.6
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3.3 2.6 2.3 3.4 3.5 2.8 3.6 4.1 3.1 2.2 2 2.5 2.8 2.1 3.4 4.8 2.2 2.1 2 3.5 2.5 3.5 3.2 2.3 2.8 2.6 2.3 2.6 2.5 2.8 3.6 2.1 2.6 3.4 2.3 4 2.3 2.3 3 2.9
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2.5 3.3 3 2.2 3.4 2.8 2.9 4.4 2.9 2.2 4.1 3.2 2.6 2.8 3.3 2 4 2.8 2.1 2.1 2.4 2.9 2.4 2.2 5 2 4 2.7 2.2 2.8 2.3 2.1 2.3 2.9 2.3 2.5 2.1 2.4 2.8 2.2
0 0 0 0 0 0 0 0 0 0 3.1 4.2 .5 .3 3.2 10.5 2.5 4.2 11.5 1.8 5 15.1 4.2 6.5 7.3 13.9 1.3 .1 2.8 1.5 5.1 3.6 1.9 6.4 2.1 6.6 7.6 .5 4.7 11.3
3 2.3 3.1 3.1 2.8 2.3 2.7 2.2 2.5 2.4 8.4 10.8 3.5 3 9.2 23.3 8.1 10.7 25.9 5.6 12.7 33.9 11.2 16.3 17.9 30.1 5.1 2.7 9.1 5.3 12.6 11 8 16.2 7.5 16.1 19.1 3.8 12.7 24.9
JPSS
148
Vol. 5 No. 2
August 2007
pp. 137-150
Appendix X C ov.Oy ; Ox / D E.Oy Ox /
E.Oy /E.Ox / D
Oy Ox P .S/
y x
S
1 !0 X P .Sjj / X X P .Sji/ yi @ xj A P .S/ D P .S/ P .S/ S i2S j 2S X X X P .Sji/P .Sjj / D S
D
P .S/
i2S j 2S
yi xj
y x
y x
N X N X X P .Sji/P .Sjj / yi xj P .S/ iD1 j D1 S3i;j
y x
N N X X X P .Sji/P .Sjj / y i xi C yi xj D P .S/ P .S/ iD1 S3i iD1 j ¤iD1 S3i;j P P We know that S3i P .Sji/ D 1 and S3i;j P .Sji; j / D 1 then N X X P .Sji/2
N N X X
y x D
yi xj D
y i xi C iD1
N X iD1
yi xj iD1 j ¤iD1
N N X X
X y i xi
D
N N X X
N X
iD1 j D1
y x
P .Sji/ C
X yi xj
iD1 j ¤iD1
S3i
P .Sji; j / S3i;j
By replacing y and x in the previous equation, we have C ov.Oy ; Ox / D
N X X P .Sji/2 iD1 S3i
P .S/
N X
y i xi C
j ¤iD1
X y i xi
iD1
N N X X
P .Sji/
: S
iD1
iD1 j ¤iD1
P .Sji; j / S3i;j
n n X X P .Sji/P .Sjj / P .Sji/ yi xi C P .S/
P .S/
i P .Sji; j / yi xj
X yi xj
S3i
8 n 2 X