METRON - International Journal of Statistics 2007, vol. LXV, n. 1, pp. 59-66
AMITAVA SAHA
A simple randomized response technique in complex surveys Summary - Warner (1965) proposed the randomized response (RR) device as a tool for eliminating evasive answer bias while collecting information on sensitive issues. Following Warner several other RR procedures have also emerged in the RR literature. Eichhorn and Hayre (1983) pioneered the scrambled RR (SRR) technique for situations where the reply to a sensitive question results in a quantitative variable instead of being a dichotomous one. The RR devices currently available in the RR literature depends on the nature of the response variable, i.e., whether it is a qualitative response or a quantitative one. Thus these devices can not be applied in both the cases, that is for qualitative as well as quantitative response variables. In this paper, we propose a response independent RR device that can be used when the reply to a sensitive question is either dichotomous or quantitative in nature. A numerical study for efficiency comparison under alternative sampling designs is also presented. Key Words - Randomized response; Scrambled response; Sensitive variable; Unequal probability sampling.
1. Introduction Warner (1965) proposed the randomized response (RR) device for estimating the proportion of persons in a community bearing a socially disapproved character, say, A. The main purpose of introducing this method was to collect true response on sensitive questions by protecting the respondents privacy. Horvitz et al. (1967), Greenberg et al. (1969), Kuk (1990), Mangat and Singh (1990, 1991), Mangat et al. (1995a, 1995b), Christofides’ (2003) and Mangat (1994) developed several other RR devices for obtaining information on sensitive issues. Eichhorn & Hayre (1983) pioneered a device called ‘scrambled randomized response’ (SRR)technique when the response to a sensitive question results in a quantitative variable instead of being a dichotomous one assuming only two values ‘0’ and ‘1’ or ‘yes’ and ‘no’. The RR devices as referred above are developed assuming that the sample is selected with simple Received July 2006 and revised April 2007.
60
AMITAVA SAHA
random sampling with replacement (SRSWR). Chaudhuri (2004, 2002, 2001a, 2001b) extended some of the RR devices mentioned above to more general complex survey situations where varying probability sampling designs are adopted. Arnab & Singh (2002) developed a procedure for estimating the proportion of individuals belonging to a sensitive group such as criminals, freedom fighters etc. together with the mean of a stigmatizing or sensitive quantitative character associated with such hidden groups. Many-a-times, it is required not only to estimate the proportion of persons in a community bearing a stigmatizing characteristic but also to estimate the population mean or total of another sensitive variable which is quantitative in nature and related to the stigmatizing attribute. For example, it may be of interest to know simultaneously, the proportion of government servants evading income tax and the amount of such tax evaded, the percentage of individuals encountering accidents due to habits of drunken driving and also the total number of such accidents encountered, the proportion of women undergoing induced abortions in a given community as well as number of such abortions on an average that a woman had undergone in the particular community etc. The current RR literature is full of numerous devices that are either applicable to qualitative response variables or to quantitative response variables alone. None of the RR devices available in the RR literature so far can cater to both the situations. That is, if one has to gather information simultaneously on two or more sensitive questions, some of which result in simple ‘yes’ or ‘no’ responses and the others in quantitative variables, he has to adopt more than one RR device depending on the nature of the response variables. Here we attempt to develop an RR device that is independent of the type of the response variable and when individuals are sampled with unequal selection probabilities. The proposed procedure is described in Section 2 and a numerical example showing the performance of the proposed procedure under three alternative sampling designs is presented in Section 3.
2. The proposed procedure Suppose that U is a finite population of N individuals and xi , yi are the values of sensitive variables x and y related to a sensitive feature, say, A. We assume that x is a dichotomous variable assuming only two values, i.e., ‘0’ and ‘1’ while - Ny is a quantitative one. - N Our problem is to estimate simultaneously, yi on choosing a sample, say, s of size n xi )/N and Y A = i=1 π A = ( i=1 according to any arbitrary sampling design p. Let {z i > 0 : i = 1, . . . , M} and {u i > 0 : i = 1, . . . , L} be two independent sets of random numbers both being also independent of x and y with known means and variances. Then our proposed RR procedure is as below.
A simple randomized response technique in complex surveys
61
Each sampled person is presented with the two sequence/sets of independent random numbers and is first requested to choose, at random a number, say, u from the set {u i > 0 : i = 1, . . . , L} and to add the selected number with his/her true x or y-value. In the next step he/she is advised to draw another number randomly from {z i > 0 : i = 1, . . . , M} and to report the scrambled response after multiplying (x + u) or (y + u) with this number, say, z. Here the interviewer is totally unaware of the random numbers z and u used for scrambling the true responses x or y. But the interviewer is having complete knowledge of the distributions that generated the two sets of numbers. That is, if we write E R and VR to denote the expectation and variance operators respectively with respect to any arbitrary RR device then E R (z i ) = Z¯ ,
VR (z i ) = Sz2 ,
E R (u i ) = U¯
VR (u i ) = Su2
are known. We first consider the problem of estimation of Y A . The procedure for estimation of π A then can be developed in a similar manner. Let wi be the scrambled randomized response received from the ith selected individual. Then we have wi = z i (yi + u i ), i = 1, . . . , n .
(1)
Here one may argue that in view of respondents’ convenience, instead of using two randomization methods one can very well use a single randomization. Of course, one can always use a single randomization to scramble the true response. But we feel that use of a single randomization technique when the response to the sensitive question is a ‘qualitative’ one, i.e., ‘1’ or ‘0’, might not induce sufficient confidence among the respondents’ about their privacy protection. For example, if a respondent’s true x-value is ‘0’ and we use a single randomization technique, the scrambled response from the respondent will be z 2 . In such a situation, the respondent may feel that the RR device is unable to provide sufficient protection to his privacy and this may ultimately reduce the level of cooperation from the respondents. Note that E R (wi ) = E R (z i yi ) + E R (z i u i ) = yi Z¯ + Z¯ U¯
so that for ri = (wi / Z¯ ) − U¯ we have E R (ri ) = yi , ∀i ∈ U implying that ri is an unbiased estimator for yi , ∀i ∈ U. Again VR (ri ) = VR (wi )/ Z¯ 2 and writing C R to denote the covariance operator with respect to the RR device we have VR (wi ) = VR (z i yi ) + VR (z i u i ) + 2C R (z i yi , z i u i ) = yi2 Sz2 + E R (z i2 )E R (u i2 ) − Z¯ 2U¯ 2 + 2[yi U¯ E R (z i2 ) − yi U¯ Z¯ 2 ] = yi2 Sz2 + (Sz2 Su2 + Sz2U¯ 2 + Su2 Z¯ 2 ) + 2yi Sz2U¯
62
AMITAVA SAHA
so that VR (ri ) = ayi2 +byi +c = Vi , say where a = (Sz2 / Z¯ 2 ), b = (2Sz2 / Z¯ 2 )U¯ , c = (Sz2 / Z¯ 2 )(Su2 + U¯ 2 ) + Su2 and an unbiased estimator for Vi is given by vi = ari2 + bri + c. Now let E p and Vp be the operators for expectation and variance respec-N bsi Isi yi tively with respect to the sampling design p and suppose that t A = i=1 where Isi = 1(0) if i ∈ s(∈ / s) and bsi ’s are constants free of Y = (y1 , · · · , y N ) such that E p (bsi Isi ) = 1. Then t A is a homogeneous linear unbiased estimator for Y A . We also write Vp (t A ) =
N
yi2 ci
i=1
+
N
yi yj ci j
i=1 i= j
where ci = E p (bsi2 Isi ) − 1, ci j = E p (bsi Isi − 1)(bs j Is j − 1). Then an unbiased estimator of Vp (t A ) is given by v p (t A ) =
N
i=1
yi2 csi Isi +
N
yi yj csi j Isi j
i=1 i= j
where Isi j = Isi Is j , csi and csi j ’s are constants free of Y and R = (r1 , . . . , ri , . . . , . . . r N ) such that E p (csi Isi ) = ci and E p (csi j Isi j ) = ci j ∀i, j ∈ U . Now since yi ’s are unobservable, one can not use t A for estimating Y A and as ri ’s are unbiased for yi ’s an unbiased estimator for Y A is obtained as eA =
N
(2)
ri bsi Isi
i=1
N ri ) = Y A . because E(e A ) = E p E R (e A ) = E p (t A ) = Y = E R E p (e A ) = E R ( i=1 Now following Raj (1968) and Rao (1975) two unbiased estimators for V (e A ) are given by
-
v1 (e A ) = v p (t A ) |Y =R + v2 (e A ) = v p (t A ) |Y =R +
N
i=1 N
i=1
bsi Isi vi )
(3)
(bsi2 − csi )Isi vi .
(4)
Now defining wi′ = z i (xi + u i ) and proceeding in a similar fashion as above, it can be shown that for ri′ = (wi′ / Z¯ ) − U¯ , E R (ri′ ) = xi ∀i ∈ U , so that ri′ is an unbiased estimator for xi . Also, we have VR (wi′ ) = b′ xi + c′ = Vi′ , say where
A simple randomized response technique in complex surveys
63
2
¯ ¯ a ′ = (1+2Z¯U2 ) Sz and c′ = (Sz2 / Z¯ 2 )(Su2 + U¯ 2 ) + Su2 . So, an unbiased estimator for Vi′ is obtained as vi′ = b′ri′ + c′ . Hence an unbiased estimator of π A may -N ′ be found on employing the estimator πˆ A = ( i=1 ri bsi Isi )/N . Also following Raj (1968) and Rao (1975) two unbiased estimators for V (πˆ A ) are given by
v1 (πˆ A ) = v p (t A ) |Y =R ′ +
v2 (πˆ A ) = v p (t A ) |Y =R ′ +
where R ′ = (r1′ , . . . , ri′ , . . . , r N′ ).
N
i=1
N
i=1
bsi Isi vi′
/N 2
(5)
(bsi2 − csi )Isi vi′ /N 2
(6)
3. Numerical illustrations We consider an artificial population of N = 319 individuals in a particular community and our problem is to estimate the proportion, π A , say of people in the said community encountering accidents due to habits of drunken driving and also the total number of such accidents encountered, say, Y A on choosing a sample of n = 87 individuals from the N individuals. We draw the n = 87 persons by three alternative sampling designs, namely, (i) simple random sampling with replacement (SRSWR), (ii) simple random sampling without replacement (SRSWOR) and, (iii) the sampling scheme due to Rao-Hartley-Cochran (RHC, 1962) as a representative of unequal probability sampling design. The total expenditure incurred during the last month in the household to which the individuals belong to is considered as the size-measure for selecting the persons with RHC design. The sampling schemes SRSWR and SRSWOR being well-known and simple do not require any elaboration and we discuss below the RHC scheme in brief. In the RHC scheme, first the N units of the population are - randomly divided-into n groups, the gth group having N g units such that n N g = N , where n denotes the sum over the n random groups. Let pi be the value of the normed size-measure for the ith unit of the population and yi be the value on the study variable for the corresponding population unit. Then writing Q g as the total of the size measures for the units in the gth group, one unit is selected from the gth group with a probability proportion to its p-value divided by Q g and this process is being repeated for all the n groups formed. Now writing (yg , pg ) as the (y, p) value for the unit selected from the gth group
64
AMITAVA SAHA
N an unbiased estimator for Y A = i=1 yi as proposed by RHC is t = along with an unbiased variance estimator
-
v p (t A ) = B
n
Qg
yg − tA pg
Qg n pg
-
yg
2
where B = ( n N g2 − N )/(N 2 − n N g2 ). Note that here yg ’s are unknown and r g ’s are unbiased estimators for yg ’s. Thus an unbiased estimator for Y A - Q is obtained as e A = n pgg r g where r g ’s are as defined in Section 2. Now following Chaudhuri, Adhikary and Dihidar (2000) an unbiased estimator for V (e A ) is given by
Qg v(e A ) = v p (t A )|Y =R + vg n pg -
-
where vg = ar g2 + br g + c. An unbiased estimator for π A , i.e., the proportion of persons encountering - Q accidents due to habits of drunken driving is given by πˆ A = ( n pgg r g′ )/N = Q
e′A /N , say, where e′A = n pgg r g′ and r g′ ’s are as defined in Section 2. Also, an unbiased estimator for V (πˆ A ) is given by v(πˆ A ) = v p (e′A )/N 2 where v(e′A ) = - Q v p (t A )|Y =R ′ + n pgg vg′ and vg′ = b′r g′ + c′ . In our present example, we have considered the numbers between 0 & 1 (excluding 0) and 1 to 1000 as the two sets of random numbers for scrambling the true responses x and y. In a practical situation, there are several mechanisms available for generating the scrambling distributions. A possible way of implementation might be to provide the respondents with a random number table having random numbers of specified digits between 0 & 1 (excluding 0), and a box containing cards/balls marked 1 to 1000 on them. Since it is expected that a larger set of numbers will induce more randomness in the scrambling distributions, we have considered numbers between 0 & 1 (excluding 0) and 1 to 1000 as the two sets of random number to scramble the true x and y-values respectively. But, one may very well use a smaller sequence considering the convenience of the interviewees. Eichhorn and Hayre (1983) discussed in detail about how the scrambling distributions are to be chosen. Note that although two randomizers are used for scrambling the true responses, for the sake of convenience of the interviewees, the interviewer might instruct the respondents to choose one number randomly from each of the two sets only once and use the same random numbers for scrambling the true x and y-values. That is, if z 0i is a positive random number of specified digits between 0 & 1 and u 0i in another number randomly chosen from the set (1, 2, . . . , 1000) -
65
A simple randomized response technique in complex surveys
by the respondent, the respondent might be asked to report z 0i (yi + u 0i ) and z 0i (xi + u 0i ) respectively, as the scrambled randomized responses. Let θ be any point estimator √ for Y with an unbiased variance estimator v. Then assuming τ = (θ − Y )/ v to be a standard normal deviate, we consider the following three criteria for comparing the efficacy of the estimator: (a) the average coefficient of variation (ACV), which is the average of the √ coefficient of variation CV = 100 × ( v/θ ) over T = 1000 replicated samples; (b) the actual coverage percentage (ACP) defined as √the percentage √ of cases for which the 95% confidence interval (θ − 1.96 v, θ + 1.96 v) covers the true Y value out of the 1000 replicated; and (c) the average length √ (AL) of the confidence intervals that is average of the length 2 × 1.96 v over 1000 replicated samples.
Clearly, the smaller the ACV and AL and the more the ACP, the better is the performance of the estimator θ. Now we present below, in Table 1 the comparative performances of the estimator for three different sampling designs. Table 1: Performance of the proposed procedure under three alternative sampling schemes. sampling scheme
SRSWR SRSWOR RHC
YA
πA ACV
ACP
AL
ACV
ACP
AL
28.0 24.2 26.3
96.8 98.4 96.3
1.008 0.863 1.216
10.9 8.2 14.5
85.0 87.5 90.5
194.227 166.412 269.057
Discussions. From Table 1, it is observed that SRSWOR turns out to be the best sampling procedure in the context of our numerical example for estimating π A as well as Y A if one goes by the two criteria ACV and AL. SRSWOR outperforms the RHC scheme and this may be due to the fact that the auxiliary character considered here as the size-measure for selection of the sample individuals by the RHC design is not well correlated with the study variables. Here it may be noted that our proposed procedure can always be implemented in a stratified unequal probability sampling set-up by considering the entire development for a single stratum.
REFERENCES Arnab and Singh (2002) Estimation of the size and mean value of stigmatized characteristic of a hidden gang in a finite population: a unified approach, Ann. Inst. Math. Stat., 54 (3), 659–666. Chaudhuri, A. (2004) Christofides’ randomized response technique in complex sample surveys, Metrika, 60 (3), 223–228.
66
AMITAVA SAHA
Chaudhuri, A. (2001a) Using randomized response from a complex survey to estimate a sensitive proportion in a dichotomous finite population, Journal of Statistical Planning and Inference, 94, 37–42. Chaudhuri, A. (2001b) Estimating sensitive proportions from unequal probability sample using randomized responses, Pakistan Journal of Statistics, 17 (3), 259–270. Chaudhuri, A. (2002) Estimating sensitive proportions from randomized responses in unequal probability sampling, Calcutta Statistical Association Bulletin, 52, (205-208), 315–322. Christofides’, T. C. (2003) A generalized randomized response technique, Metrika, 57, 195–200. Chaudhuri, A., Adhikary, A. K., and Dihidar, S. (2000) Mean square error estimation in multistage sampling, Metrika, 52 (2), 115–131. Eichhorn, B. H. and Hayre, L. S. (1983) Scrambled randomized response methods for obtaining sensitive quantitative data, Journal of Statistical Planning and Inference, 7, 307–316. Greenberg, B. G., Abul-Ela, Simmons, W. R., and Horvitz, D. G. (1969) The unrelated question randomized response model: theoritical framework, Jour. Amer. Stat. Assoc, 64, 520–539. Horvitz, D. G., Shah, B. V., and Simmons, W. R. (1967) The unrelated question randomized response model, Proc. Soc. Sec. Amer. Stat. Assoc, 65–72. Kuk, A. Y. C. (1990) Asking sensitive question indirectly, Biometrika, 77, 436–438. Mangat, N. S. (1994) An improved randomized response strategy, Journal of the Royal Statistical Society, Series B, 56, 93–95. Mangat, N. S. and Singh, R. (1991) An alternative approach to randomized response survey, Statistica, 51 (3), 327–332. Mangat, N. S. and Singh, R. (1990) An alternative randomized response procedure, Biometrika, 77, 439–442. Mangat, N. S., Singh, R., and Singh, S. (1995a) Unrelated question randomized response model without randomization device, Estadistica, 47, 59–68. Mangat, N. S., Singh, R., and Singh, S. (1995b) On use of a modified randomization device of Warner’s model, Journal of Indian Society of Statistics and Operations Research, 16, 65–69. Raj and Des (1968) Sampling Theory, Mc-graw Hill, N.Y. Rao, J. N. K (1975) Unbiased variance estimation for multi-stage designs, Sankhya C, 37, 133–139. Rao, J. N. K., Hartley, H. O., and Cochran, W. G. (1962) On a simple procedure of unequal probability sampling without replacement, Journal of the Royal Statistical Society, B, 24, 482–491. Warner, S. L. (1965) Randomized response: a survey technique for eliminating evasive answer bias, Journal of American Statistical Association, 60, 63–69.
AMITAVA SAHA Directorate General of Commercial Intelligence & Statistics 1, Council House Street Kolkata - 700001 (India) saha
[email protected]