31
Model Assisted Statistics and Applications 1 (2005,2006) 31–36 IOS Press
On stratified randomized response sampling Jea-Bok Ryua,∗ , Jong-Min Kimb , Tae-Young Heo c and Chun Gun Park d a
Statistics, Division of Life Science and Genetic Engineering and Statistics, Cheongju University, Cheongju, Chungbuk, 360-764, Republic of Korea b Statistics, Division of Science and Mathematics, University of Minnesota, Morris, MN, 56267, USA c Department of Statistics, North Carolina State University, Raleigh, NC, 27695, USA d National Cancer Center, Ilsan, Goyang-si, Gyeonggi-do, 411-769, Republic of Korea
Abstract. In this paper, we propose a new quantitative randomized response model based on Mangat and Singh [7] two-stage randomized response model. We derive the estimator of the sensitive variable mean, and show that our method is more efficient than other randomized response models suggested by Greenberg et al. [3] and Gupta et al. [4] estimators. Keywords: Quantitative randomized response technique, sensitive characteristics, stratified sampling
1. Introduction The randomized response technique is a procedure for collecting the information on sensitive characteristics without exposing the identity of the respondent. It was first introduced by Warner [8] as an alternative survey technique for socially undesirable or incriminating behavior questions. Greenberg et al. [3] have proposed and developed the unrelated question randomized response design for estimating the mean and the variance of the distribution of a quantitative variable. Gupta et al. [4], and Arnab [1] have showed that optional randomized response model is more accurate while being less intrusive. Hong et al. [5] suggest a stratified randomized response model using a proportional allocation. However their model may have a high costs due to the difficulty in obtaining a proportional sample from each stratum. To rectify this problem, Kim and Warde [6] suggest a stratified randomized response model using an optimal allocation which is more efficient than that of using the proportional allocation.
∗ Corresponding
author: Jea-Bok Ryu, Statistics, Division of Life Science and Genetic Engineering and Statistics, Cheongju University, Cheongju, Chungbuk, 360-764, Republic of Korea. E-mail:
[email protected].
2. A review of quantitative randomized response methods Unrelated question randomized response method proposed by Greenberg et al. [3] is a survey procedure that a respondent could be asked one of two questions depending on the outcome of a randomization device. For example, an interviewee performs a randomization device with two outcomes each with pre-assigned probabilities P and 1 − P which will answer one of the following questions: S : How many abortions have you had during your lifetime? N : How many magazines do you subscribe to? where we denotes S as the sensitive question and N as the non-sensitive question. Two independent, nonoverlapping samples of sizes n 1 and n2 are used (size n1 need not be equal to size n 2 ). Let the population mean of both the sensitive and non-sensitive distributions be µA , µY , respectively. Let the population variance of both the sensitive and non-sensitive distribu2 tions be σA , σY2 , respectively. Unbiased estimators for the means of the sensitive and non-sensitive random variables, µA and µY , are µ ˆ1 =
(1 − P2 )T¯1 − (1 − P1 )T¯2 and P1 − P2
ISSN 1574-1699/05/06/$17.00 © 2005/2006 – IOS Press and the authors. All rights reserved
(1)
32
J.-B. Ryu et al. / On stratified randomized response sampling
µ ˆ2 =
P2 T¯1 − P1 T¯2 , P2 − P1
(2)
where T¯i is total sample mean computed from the responses in the ith samples and Pi is the selection probability for the sensitive question in the i th sample, for i = 1, 2 (P1 = P2 ). The variance of µ ˆ 1 is given by Var(ˆ µ1 ) =
(3)
(1 − P2 )2 Var(T¯1 ) + (1 − P1 )2 Var(T¯2 ) , (P2 − P1 )2
2 − σY2 ) + Pj (1 − where Var(T¯j ) = n1j (σY2 + Pj (σA 2 Pj )(µA − µY ) ). In this method, if µ Y and σY2 are known in advance, only one sample is needed. So we define P 1 = P and T¯1 = T¯, then Eqs (1) and (3) are simplified as
T¯ − (1 − P )µY µ ˆ1 = P
(4)
and Var(T¯) Var(ˆ µ1 ) = P2 1 2 = (σ 2 + P (σA − σY2 ) nP 2 Y
(5)
1 Zi , n i=1
(9)
with variance 1 2 2 2 σ + W CB (σA + µ2A ) , (10) n A where CB = σB /µB denotes the known coefficient of variation of the scrambling variable B. Var(ˆ µG ) =
3. Two-stage quantitative randomized response model
(6)
(7)
(i) “Report the correct response A of a sensitive question” and (ii) “Go to the randomization device R 2 in the second stage”
with variance 1 2 2 2 σ + CB (σA + µ2A ) , n A
If W denotes the probability that a person will report the scrambled response, then I is a Bernoulli random variable with E(I) = W , where W can be called the sensitivity of the question. They showed an unbiased estimator of population mean, µ A , is given by
In this section, we propose a two-stage quantitative randomized response model. We assume that a sample of size n is selected by SRSWR. The method is described as follows. Stage 1 An individual respondent in the sample is instructed to use the randomization device R 1 which consists of two statements:
n
Var(ˆ µE ) =
(8)
where I is an indicator random variable defined as 1 if the response is scrambled I= 0 otherwise.
µ ˆG =
Eichhorn and Hayre [2] introduce a scrambled randomized response method for estimating the mean µ A 2 and the variance σ A of the sensitive question A. According to them, each respondent selected in the sample is instructed to use a randomization device and generate a random number, say B, from some pre-assigned distribution. The distribution of the random variable B, also called a scrambling variable, is assumed to be 2 known. The mean µ B and the variance σ B of the scrambling variable are also assumed to be known. The i th respondent selected in the sample of size n, drawn by using simple random sampling with replacement (SRSWR), is requested to report the value Z i = Bi Ai /µB as a scrambled response on the sensitive variable, A. They show that an unbiased estimator of the population mean, µA , is given by 1 Zi n i=1
Z = B I A,
n
+P (1 − P )(µA − µY )2 ).
µ ˆE =
where σB is the standard deviation of the scrambling variable B, and C B = σB /µB denotes the known coefficient of variation of the scrambling variable B. Gupta et al. [4] propose an optional randomized response technique, which is more efficient than the scrambled randomized response technique suggested by Eichhorn and Hayre [2]. In the optional randomized response technique, where each respondent selected by SRSWR, can choose one of the following two options: (a) The respondent can report the correct response A, or (b) The respondent can report the scrambled response BA, where B denotes the independent scrambling variable. In optional procedure, they assumed that both B and A are positive random variables and µ B = 1 . The optional randomized response model can be written as
J.-B. Ryu et al. / On stratified randomized response sampling
represented with probabilities P and 1 − P . Stage 2 The randomization device R 2 consists of two statements: (i) “Report the correct response A of a sensitive question” and (ii) “Report the scrambled response AB of a sensitive question” represented with probabilities T and 1 − T . The respondent should not report to an interviewer which steps are taken to protect the respondent’s privacy. We assumed that both B and A are positive ran2 = ψ 2 . Similar to dom variables, µ B = 1, and σB Eichhorn and Hayre [2] approach, the distribution of 2 random variable B, the mean µ B and the variance σ B of the scrambling variable are all assumed to be known. Based on two-stage procedures, the i th respondent selected in the sample of size n, drawn by using SRSWR, is requested to report the value, U = αA + (1 − α) (βA + (1 − β)AB) ,
⎧ ⎪ ⎪ 1 if a respondent chooses a statement 1 ⎨ in R1 α= 0 if a respondent chooses a statement 2 ⎪ ⎪ ⎩ in R1 ⎧ ⎪ ⎪ 1 if a respondent chooses a statement 1 ⎨ in R2 β= 0 if a respondent chooses a statement 2 ⎪ ⎪ ⎩ in R2
The expected value of the observed response is, E(U ) = E (αA + (1 − α)(βA + (1 − β)AB)) =P µA + (1 − P ) (T µA + (1 − T )µA µB ) = µA ,
(14) Var(ˆ µA )
1 2 2 σ + (1 − P )(1 − T )ψ 2 (µ2A + σA = ) . n A If 0 < P < 1 in Eq. (14), then, obtain the relaˆ G , we compare tive efficiency of µ ˆ A with respect to µ Var(ˆ µG ) and Var(ˆ µA ) as follows: Var(ˆ µG ) − Var(ˆ µA ) 1 2 2 σA + W ψ 2 (σA = + µ2A ) n
1 2 2 − σA + (1 − P )(1 − T )ψ 2 (µ2A + σA ) n 1 2 2 = (µA + σA )(1 − T )P ψ 2 0. n We have shown that the proposed estimator µ ˆ A is more efficient than the estimator µ ˆ G suggested by Gupta et al. [4].
(11)
where
and
33
(12)
where α is a Bernoulli random variable with E(α) = P , Var(α) = P (1 − P ) and β is Bernoulli with E(β) = T , Var(β) = T (1 − T ). Theorem 3.1. An unbiased estimator, µ ˆ A , of the population mean µA is given by,
4. Stratified two-stage quantitative randomized response model In this section, we newly propose a two-stage quantitative randomized response technique in stratified sampling. The main advantage of the stratified approach is that the technique overcome the limitation of the loss of individual characteristics of the respondents. We assume that the population is partitioned into strata, and a sample nh is selected by the SRSWR from each stratum. We assume that the number of units in each stratum is known. Let n h denote the number of units in the sample from stratum h and n denote the total number of units in the samples from all strata so that k n = h=1 nh . Stage 1 An individual respondent in the sample is instructed to use the randomization device R 1h which consists of two statements: (i) “Report the correct response A of a sensitive question” and (ii) “Go to the randomization device R 2h in the second stage”
(13)
represented with probabilities P h and 1 − Ph . Stage 2 The randomization device R 2h consists of two statements:
Theorem 3.2. The variance of the proposed estimator µ ˆA is given by
(i) “Report the correct response A of a sensitive question” and (ii) “Report the scrambled response AB of a sensitive question”
µ ˆA =
1 n
n
Ui
i=1
34
J.-B. Ryu et al. / On stratified randomized response sampling
2 ) (P + (1 − P )T + (1 − P )(1 − T )(1 + ψ 2 )) − µ2 wh (µ2Ah + σA h h h h h h Ah nh h = . k n w (µ2 + σ 2 ) (P + (1 − P )T + (1 − P )(1 − T )(1 + ψ 2 )) − µ2 h=1
1 n
k
h
Ah
Ah
h
h
h
h
h
h
Ah
2 ) (P + (1 − P )T + (1 − P )(1 − T )(1 + ψ 2 )) − µ2 wh (µ2Ah + σA h h h h h Ah h h
2 .
(20)
h=1
represented with probabilities T h and 1 − Th . Under the assumption that respondent reports truthfully and Ph and Th are set by the researcher, the distribution of random variable B h , the mean µBh and the 2 of the scrambling variable are all assumed variance σB h 2 = ψh2 to be known. We assume that µ Bh = 1 and σB h for all h = 1, 2, · · · , k. The i th respondent selected in the sample of size n h in stratum h, drawn by using SRSWR, is requested to report the value, Uh = αh Ah + (1 − αh ) (βh Ah + (1 − βh )Ah Bh ) ,
⎧ ⎪ ⎪ 1 if a respondent chooses a statement 1 ⎨ in R2h βh = 0 if a respondent chooses a statement 2 ⎪ ⎪ ⎩ in R2h Similar to Eq. (12), the expected value of the observed response is given by,
(16)
+(1 − Th )µAh µBh ) = µAh where αh is a Bernoulli random variable with E(α h ) = Ph , Var(αh ) = Ph (1 − Ph ) and βh is a Bernoulli random variable with E(β h ) = Th , Var(βh ) = Th (1 − Th ). By Theorem 3.1, an unbiased estimator of the population mean µ Ah in stratum h is, n 1 Uhi nh i=1
and its variance is
+(1 − Ph )(1 − Th )(1 + ψh2 )) − µ2Ah ). Since the selections in different strata are made independently, the mean estimators for individual strata can be added together to obtain a mean estimator for the whole population. The mean estimator of µ A for stratified sampling scheme is: k h=1
wh µ ˆ Ah =
k n wh Uhi , nh i=1
(18)
h=1
where wh = (Nh /N ) for h = 1, 2, · · · , k, so that w = kh=1 wh = 1, N is the number of units in the whole population and N h is the total number of units in stratum h. It is easily shown that the proposed mean estimator µ ˆsA is an unbiased estimate for the population mean µA . The variance of the mean estimator µ ˆ sA is: Var(ˆ µsA ) =
(19)
k wh2 2 ((µ2Ah + σA )(Ph + (1 − Ph )Th h nh
h=1
+(1 − Ph )(1 − Th )(1 + ψh2 )) − µ2Ah ).
E(Uh ) = E(αh Ah + (1 − αh )(βh Ah
µ ˆ Ah =
1 2 ((µ2Ah + σA )(Ph + (1 − Ph )Th h nh
µ ˆsA =
⎧ ⎪ ⎪ 1 if a respondent chooses a statement 1 ⎨ in R1h αh = 0 if a respondent chooses a statement 2 ⎪ ⎪ ⎩ in R1h
= Ph µAh + (1 − Ph )(Th µAh
=
(15)
where,
+(1 − βh )Ah Bh ))
Var(ˆ µ Ah )
(17)
2 is usually unavailable, Information on µ Ah and σA h 2 however if prior information on µ Ah and σA is availh able from past experience, then we may derive the following optimal allocation formula. Using the optimal-allocation approach based on Kim and Warde [6], one can show that the variance in Eq. (19) is minimized when n 1 , n2 , . . . , nk are chosen such that (the first equation on the top of the page). Under this optimal-allocation assumption, the variance in Eq. (19) becomes in Eq. (20).
Theorem 4.1. Assuming optimal allocation, when w1 = w2 = 1/2 and D = (D1 + D2 )/2, the stratified estimator µ ˆsA is more efficient than the proposed model
0.2
80 20
40
60
T=0.8 T=0.6 T=0.4 T=0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
80 20
40
60
T=0.8 T=0.6 T=0.4 T=0.2
0
0
20
40
60
T=0.8 T=0.6 T=0.4 T=0.2
Relative Efficiency
P
80
P
Relative Efficiency
35
0
Relative Efficiency
80 20
40
60
T=0.8 T=0.6 T=0.4 T=0.2
0
Relative Efficiency
J.-B. Ryu et al. / On stratified randomized response sampling
0.2
0.4
0.6
0.8
0.2
P
0.4
0.6
0.8
P
Fig. 1. The relative efficiency of µ ˆA with respect to µ ˆ1 as a function of P = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 and T = 0.2, 0.4, 0.6, 0.8 2 = 1, σ 2 = σ 2 = 0.5 and µ = 2 (top left); µ = 4 (top right); µ = 6 (bottom left); µ = 8 (bottom right). with σA A A A A Y B
estimator µ ˆ A , where, 2 D = ((µ2A + σA )(P + (1 − P )T
+(1 − P )(1 − T )(1 + ψ 2 )) − µ2A ), 2 )(Ph + (1 − Ph )Th Dh = (µ2Ah + σA h
µ ˆ1 =
Var(T¯ ) P2 1 2 = (σ 2 + P (σA − σY2 ) nP 2 Y
Var(ˆ µ1 ) =
+(1 − Ph )(1 − Th )(1 + ψh2 )) − µ2Ah , for h = 1 and 2. For other cases, we can easily check the relative efficiency by a variance comparison with various settings 2 of µAh , σA , Ph , and Th . Theorem 4.1 guarantees that h when optimal-allocation is used, the stratified estimator, µ ˆ sA , is more efficient than the estimator, µ ˆ A , which ignores the stratification.
T¯ − (1 − P )µY , P
+P (1 − P )(µA − µY )2 ). Under the assumptions with µ Y = µB = 1 and 2 σY2 = σB = ψ2 , µ ˆ1 =
T¯ − (1 − P ) , P Var(T¯ ) P2 1 2 = (ψ 2 + P (σA − σY2 ) nP 2
Var(ˆ µ1 ) = 5. Comparison and discussion In this section, we present a numerical study of the two-stage quantitative randomized response model. The purpose of the simulation is to confirm that the proposed technique is more efficient. We compare the original quantitative randomized response model proposed by Greenberg et al. [3] (ˆ µ 1 ) with the proposed model (ˆ µA ) in terms of variance. From Eqs (4) and (5), the mean and the variance of Greenberg et al. [3]’s sensitive mean estimator (when µ Y is known) are
+P (1 − P )(µA − 1)2 ). The relative efficiency of µ ˆ A with respect to µ ˆ 1 is as follows: Var(ˆ µ1 ) RE2 = Var(ˆ µA ) 2 = [ψ 2 + P (σA − ψ2 )
36
J.-B. Ryu et al. / On stratified randomized response sampling
+P (1 − P )(µA − 1)2 ]/ [P
2
((µ2A
+
2 σA )(P
References
+ (1 − P )T
+(1 − P )(1 − T )(1 + ψ )) − 2
µ2A )].
Figure 1 shows that the proposed estimator, µ ˆ A , is more efficient than the Greenberg et al. [3] estimator, 2 = 1. We can show that the proposed µ ˆ1 , with σA method is more efficient than the Greenberg et al. [3] method if the coefficient of variation, C B = σB /µB = σB 1.0. Our newly proposed two-stage quantitative randomized response model improves the performance by taking advantage of randomized response information provided by second stage. We have shown that our model is much more efficient than other models (Greenberg et al. [3] and Gupta et al. [4]). Additionally, we have provided a comprehensive description of the two-stage quantitative stratified randomized response model and its statistical properties. The use of stratified quantitative randomized response model can overcome the limitations of randomized response model which can lose the individual characteristics of the respondents.
[1] R. Arnab, Optional randomized response techniques for complex survey designs, Biometrical Journal 46(1) (2004), 114– 124. [2] B.H. Eichhorn and L.S. Hayre, Scrambled randomized response methods for obtaining sensitive quantitative data, Journal of Statistical Planning and Inference 7 (1983), 307–316. [3] B.G. Greenberg, R.R. Kuebler Jr., J.R. Abernathy and D.G. Horvitz, Application of the randomized response technique in obtaining quantitative data, Journal of the American Statistical Association 66 (1971), 243–250. [4] S. Gupta, B. Gupta and S. Singh, Estimation of sensitivity level of personal interview survey questions, Journal of Statistical Planning and Inference 100 (2002), 239–247. [5] K. Hong, J. Yum and H. Lee, A stratified randomized response technique, Korean Journal of Applied Statististics 7 (1994), 141–147. [6] J.-M. Kim and W.D. Warde, A stratified Warner’s randomized response model, Journal of Statistical Planning and Inference 120(1–2) (2004), 155–165. [7] N.S. Mangat and R. Singh, An alternative randomized response procedure, Biometrika 77 (1990), 439–442. [8] S.L. Warner, Randomized response: a survey technique for eliminating evasive answer bias, Journal of the American Statistical Association 60 (1965), 63–69.