This article was downloaded by: [Univ Studi Della Calabria] On: 03 February 2012, At: 08:52 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
Journal of Applied Statistics Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/cjas20
A calibration-based approach to sensitive data: a simulation study a
Giancarlo Diana & Pier Francesco Perri
b
a
Department of Statistical Sciences, University of Padova, Via Cesare Battisti 241, 35121, Padova, Italy b
Department of Economics and Statistics, University of Calabria, Via P. Bucci, Cubo 0C, 87036, Arcavacata di Rende, Italy Available online: 21 Jun 2011
To cite this article: Giancarlo Diana & Pier Francesco Perri (2012): A calibration-based approach to sensitive data: a simulation study, Journal of Applied Statistics, 39:1, 53-65 To link to this article: http://dx.doi.org/10.1080/02664763.2011.578615
PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/terms-andconditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.
Journal of Applied Statistics Vol. 39, No. 1, January 2012, 53–65
Downloaded by [Univ Studi Della Calabria] at 08:52 03 February 2012
A calibration-based approach to sensitive data: a simulation study
Giancarlo Dianaa and Pier Francesco Perrib * a Department
of Statistical Sciences, University of Padova, Via Cesare Battisti 241, 35121 Padova, Italy; of Economics and Statistics, University of Calabria, Via P. Bucci, Cubo 0C, 87036 Arcavacata di Rende, Italy
b Department
(Received 14 October 2009; final version received 29 March 2011)
In this paper, we discuss the use of auxiliary information to estimate the population mean of a sensitive variable when data are perturbed by means of three scrambled response devices, namely the additive, the multiplicative and the mixed model. Emphasis is given to the calibration approach, and the behavior of different estimators is investigated through simulated and real data. It is shown that the use of auxiliary information can considerably improve the efficiency of the estimates without jeopardizing respondent privacy. Keywords:
1.
calibration estimators; sensitive variable; scrambled responses; auxiliary information
Introduction
Since people are known to be inclined to respond untruthfully when asked personal or socially sensitive questions, survey statisticians have developed many different strategies to ensure respondent privacy and reduce under-reporting when interviewees are asked to respond directly to sensitive topics such as sexual abuse, drug addiction, noncompliance with rules and regulations and so on. One possibility is to limit the influence of the interviewer by means of self-administered survey modes. Alternatively, respondents can be instructed to give partially misclassified (or perturbed) responses using the randomized response (RR) technique introduced by Warner [23]. The RR mechanisms can significantly outperform conventional data collection methods and procure more reliable answers as discussed, among others, in [3,11,12]. This paper addresses the question of how to obtain more accurate estimates for the population mean of a quantitative sensitive variable without jeopardizing respondent privacy. The analysis is performed through Monte Carlo simulations and a real data application. The delivered results highlight the usefulness of the RR techniques.
*Corresponding author. Email:
[email protected] ISSN 0266-4763 print/ISSN 1360-0532 online © 2012 Taylor & Francis http://dx.doi.org/10.1080/02664763.2011.578615 http://www.tandfonline.com
Downloaded by [Univ Studi Della Calabria] at 08:52 03 February 2012
54
G. Diana and P.F. Perri
Let us first introduce the notation and contextualize the problem. Let U = {1, . . . , N} be a finite population of N identifiable units and (yi , x i ) the values of the study variable (Y) and auxiliary ¯ = N −1 i∈U xi , one (X ) associated with the ith individual. Assume that the auxiliary mean, X is known in advance. In order to estimate the mean Y¯ = N −1 i∈U yi , a sample s of size n is drawn without replacement from U according to a generic sampling design with first-order inclusion probabilities π i = P(i ∈ s). Let the values di = πi−1 denote the basic design weights. Then, the Horvitz–Thompson estimator, Yˆ¯HT = N −1 i∈s di yi , can be used to obtain an unbiased estimate of Y¯ . Let us suppose that the survey variable is sensitive so that the yi ’s are not directly observable. To preserve privacy, therefore, each respondent is asked to algebraically perturb the true value of Y through one or more random numbers generated from known scrambling distributions. Let zi be the scrambled value for yi reported by the ith respondent once the randomization procedure has been performed. We assumed that this reporting was done truthfully. The coded values zi are sought in order to derive unbiased estimates – with respect to the RR device – of yi , say r i , for which E R (r i ) = yi , ∀i ∈ U. Many different RR models have been suggested in the literature. For the purpose of this article, we focus on three well-known scrambling devices: the additive model, the multiplicative model and the mixed model. Let T and S be two positive independent random variables also independent of Y whose distributions are completely known and, hence, their means and variances μt , μs , σt2 and σs2 . Under the additive model [14], the ith respondent is asked to add his/her sensitive value yi to the scrambling value t i from T . The interviewer receives the scrambled response zi = yi + t i from which E R (zi ) = yi + μt and r i = zi − μt . The multiplicative model, investigated in depth by Eichhorn and Hayre [10], is used when the scrambled response is given as zi = yi t i and, thus, ri = zi μ−1 t . Finally, under the mixed model analyzed by Saha [15], the ith respondent provides the perturbed response zi = t i (yi + si ), from which it follows that ri = zi μ−1 t − μs . Given a sample of the unbiased coded values r 1 , . . . , r n , the estimator of Y¯ is obtained from the Horvitz–Thompson estimator by replacing yi with the corresponding expression for r i under the specific RR model. For our purposes, let us consider a simple random sample without replacement (SRSWOR) for which π i = n/N, so that Yˆ¯HT becomes the sample mean r¯ = n−1 ni=1 ri .
2.
Calibration for RR models
The idea of using available auxiliary information to increase the efficiency of the estimates is not new in RR theory; see, among others, [4,8,9,17,19,21]. What is surprising is the limited use of the calibration approach in spite of its widespread use in survey practice (see [16] for a review). To the best of our knowledge, very few works link RR models and calibration techniques together [18,20,22]. It may be interesting, therefore, to find our whether different calibration methods proposed in the literature for nonsensitive surveys can improve the performance of RR methods. The remainder of the paper is devoted to this aim. Technical aspects are omitted for brevity and the reader is referred to the original papers for the details. The seminal idea underlying calibration is to define a new system of weights, say w i , obtained ¯ In order to get by modifying the basic ones d i as little as possible through the knowledge of X. ˆ ¯ ¯ a calibrated estimator of Y , say Yc = i∈s wi yi , Deville and Särndal [7] looked for calibration weights w i which minimize the distance measure = i∈s (wi − di )2 /di qi , the qi ’s being known positive weights unrelated to d i . Minimization of w.r.t. w i under the calibration constraints N −1 i∈s wi xi = X¯ leads to the calibration estimator ˆ X¯ − Xˆ¯ HT ). Yˆ¯c = Yˆ¯HT + β(
(1)
Journal of Applied Statistics
55
Wu and Sitter [24] generalize Deville and Särndal’s procedure by means of the model calibration approach, which yields ˆ ˆ −1 (2) Y¯mc = Y¯HT + N βˆmc μˆ i − di μˆ i , i∈U
i∈s
where the μˆ i ’s denote the fitted values derived under an adopted working model ξ . The two calibration approaches described above can produce nonpositive weights. To overcome the problem, the pseudo-empirical maximum likelihood approach [5,24] can be adopted. This leads to Yˆ¯el = N −1 wi yi , (3)
Downloaded by [Univ Studi Della Calabria] at 08:52 03 February 2012
i∈s
ˆ = i∈s di ln wi under constraints. where weights wi are obtained by maximizing (w) Following a nonparametric approach, Montanari and Ranalli [13] proposed the local polynomial model calibration estimator ˆ ˆ mc −1 ˆ ¯ ¯ Ylp = YHT + N βlp (4) m ˆi − di m ˆi i∈s
i∈U
by adding a calibration step to the local polynomial regression estimator [2] ˆ ˆ −1 Y¯lp = Y¯HT + N m ˆi − di m ˆi . i∈U
(5)
i∈s
The m ˆ i ’s denote the fitted values obtained from Equation (5) in Montanari and Ranalli. Finally, Montanari and Ranalli [13] proposed the neural network model calibration estimator ˆ ˆ mc −1 ˆ ¯ ¯ ˆ ˆ Ynn = YHT + N βnn (6) di fi , fi − i∈U
i∈s
where the fˆi ’s denote the fitted values for a regression function defined in a neural network framework. The calibration methods presented for a nonsensitive variable do not meet any particular difficulties if applied to the case where the values of the study variable are assumed to be sensitive and, thus, collected by RR models. In fact, in our notation, it is sufficient to set yi = r i . The calibration estimators based on the observed values r i inherit the same asymptotic properties as the traditional calibration estimators performed on the yi ’s, usually with a slower convergence rate due to the fact that the responses r i show higher variance than the latent values yi . 3. A simulation study In order to study the efficiency of the additive, multiplicative and mixed RR models in the estimation of the unknown mean Y¯ by means of the calibration approach, we carried out some Monte Carlo simulations. 3.1
Simulation design
According to Montanari and Ranalli [13], eight different artificial populations for (yi , x i ) of N = 1000 units are generated as i.i.d. samples from the following superpopulation models: • ξ 1 : linear, y = 1 + 2(x − 0.5) + , • ξ 2 : quadratic, y = 1 + 2(x − 0.5)2 + ,
56
Downloaded by [Univ Studi Della Calabria] at 08:52 03 February 2012
• • • • • •
G. Diana and P.F. Perri
ξ 3 : bump, y = 1 + 2(x − 0.5) + exp{−200(x − 0.5)2 } + , ξ 4 : jump, y = 1 + 2(x − 0.5)I(x ≤ 0.65) + 0.65I(x > 0.65) + , ξ 5 : cdf, y = ((0.5 − 2x)/0.02) + , is the standard normal cdf, ξ 6 : exponential, y = exp{−8x} + , ξ 7 : cycle1, y = 2 + sin(2πx) + , ξ 8 : cycle4, y = 2 + sin(8πx) + ,
where ∼ N(0, σ2 ) with variance σ2 chosen such that approximately 20% of the variance for the yi ’s is due to the noise component in order to maintain the main relationship hypothesized through the regression function. For the values of the auxiliary variable, we assume X ∼ U (0, 1) and X ∼ Beta(66/49, 165/49). The values yi are perturbed in r i in accordance with the additive, multiplicative and mixed models by assuming that the scrambling variables T and S are Fisher distributed; T , S ∼ F (5, 50). For each fixed population, an SRSWOR of size n = 100 is drawn to estimate the unknown sensitive mean Y¯ by means of the following estimators: • • • • • • • •
r¯ , sample mean based on RR, Yˆ¯lr , linear regression estimator [6], Yˆ¯cr , cubic regression estimator [2, p. 16], Yˆ¯lp , local polynomial regression estimator (5), Yˆ¯mc , model calibration estimator (2), Yˆ¯el , empirical-likelihood calibration estimator (3), Yˆ¯lpmc , local polynomial model calibration estimator (4), Yˆ¯ mc , neural network model calibration estimator (6). nn
For calibration estimators, we set qi = 1. Estimators Yˆ¯lr and Yˆ¯cr are parametric estimators in that they assume a linear and cubic regression function of the coded response on the auxiliary variable. In simple random sampling without replacement, Yˆ¯lr coincides with Yˆ¯c defined in Equation (1) with d i = N/n (obviously r i is used instead of yi ). As regards Yˆ¯cr , it is here used as 3 ˆ k ˆ Yˆ¯cr = r¯ + N −1 i∈U rˆi − N i∈s rˆi /n , where rˆi = k=0 βk xi and the βk ’s denote the leastˆ square estimates of the regression coefficients. The estimator Y¯lp is a nonparametric estimator lp based on a local linear fitting; Y¯ˆ and Y¯ˆ nn are nonparametric model calibration estimators, while mc
mc
Yˆ¯mc and Yˆ¯el are parametric model calibration estimators. To evaluate the performance of the model calibration estimators, we assumed a linear and a quadratic working model, E ξ (r i |x i ) = β 0 + β 1 x i and Eξ (ri |xi ) = β0 + β1 xi + β2 xi2 , respectively, nn , we considered M = 2 neurons, λ = 25e−5 as with estimated parameters β. For the estimator Yˆ¯mc decay factor and logistic activation function φ. Neural network estimates were obtained by means of an available R software code. Finally, the choices concerning Yˆ¯lpmc were p = 1 and h = 0.25 for the degree of the polynomial kernel estimator and the bandwidth, respectively; the Epanechnikov kernel function was also assumed. The process was repeated B = 1000 times. Furthermore, in order to widen the analysis, we also perturbed the auxiliary variable with the same coding mechanism for the sensitive variable. Our reason for doing so is twofold. Scrambling both the variables is of practical concern in data management, for instance, in statistical disclosure control. Second, for the considered models,
Journal of Applied Statistics
57
Table 1. RE of Yˆ¯· w.r.t. y¯ under direct questioning and X ∼ U (0, 1). Yˆ¯· r¯ Yˆ¯lr Yˆ¯
cr
Yˆ¯lp Yˆ¯mc Yˆ¯
Downloaded by [Univ Studi Della Calabria] at 08:52 03 February 2012
el
Yˆ¯lpmc mc Yˆ¯nn r¯ Yˆ¯lr Yˆ¯
cr
Yˆ¯lp Yˆ¯mc Yˆ¯
ξ1
ξ2
ξ3
ξ4
2.125
20.051
1.897
0.382
3.934
0.485
0.397
1.395
20.495
1.833
1.437
1.571
23.326
1.425
1.658
0.382
3.980
0.491
0.400
0.382
3.969
0.491
1.492
20.124
1.347
3.428
45.040
6.161
2.630
ξ5
ξ6
ξ7
ξ8
9.962
1.886
1.830
2.079
3.776
1.558
1.244
10.026
18.686
7.241
3.047
4.295
12.560
2.355
2.083
2.116
3.787
1.550
1.259
0.400
2.104
3.779
1.547
1.260
1.586
3.014
9.399
1.375
1.808
3.851
5.435
15.165
1.691
1.363
24.916
Multiplicative RR model 2.555 2.711 1.881
1.929
4.373
4.708
0.350
14.223
0.494
0.402
1.674
1.690
4.170
3.569
0.440
21.023
0.805
0.479
3.001
3.571
6.143
4.039
1.970
27.989
2.021
2.097
1.327
1.175
5.047
5.333
Additive RR model 2.189 3.431
0.384
14.390
0.572
0.439
1.805
1.949
4.217
3.643
el ˆ ¯ Ylpmc mc Yˆ¯nn
0.384
14.358
0.572
0.440
1.805
1.985
4.244
3.637
1.977
24.827
1.998
2.102
1.172
0.998
3.892
4.702
5.461
41.857
7.586
6.049
2.332
2.090
4.173
3.579
r¯ Yˆ¯lr Yˆ¯
7.112
98.062
6.612
Mixed RR model 7.355 7.484
22.148
8.832
7.598
0.653
10.476
0.840
0.681
3.048
5.296
3.774
2.639
1.615
50.101
2.014
1.738
7.162
12.734
15.608
14.688
6.449
102.487
6.026
6.701
8.202
23.857
9.742
8.099
0.839
11.570
1.073
0.864
3.338
6.059
4.068
2.846
0.840
11.602
1.073
0.865
3.368
6.056
4.099
2.851
6.402
98.168
6.008
6.654
6.998
21.648
8.406
7.639
13.442
334.153
21.466
17.980
21.180
44.052
43.245
cr
Yˆ¯lp Yˆ¯mc Yˆ¯ el
Yˆ¯lpmc mc Yˆ¯nn
49.19
the correlation between the coded sensitive variable and the auxiliary one is always smaller, in absolute value, than the correlation between the original variables. On the contrary, coding both the variables may increase the correlation depending on the choice of the scrambling variables (see the appendix for details). Thus, we have decided to explore the effect induced by different correlation structures on the efficiency of the estimators. In some cases, as already shown in Diana and Perri [9], the double coding can increase the efficiency of the estimates. ˆ¯ Under the considered scrambled models, n the performance of each estimator, say Y· , is compared −1 with the baseline estimator y¯ = n i=1 yi under direct questioning by the simulated relative bias (RB) and the simulated relative efficiency (RE) RB(Yˆ¯· ) =
B
ˆ¯ (j ) j =1 (Y· Y¯
− Y¯ )/B
and
RE(Yˆ¯· ) =
Yˆ¯· ) mse( , y) mse( ¯
(7)
58
G. Diana and P.F. Perri
Table 2. RE of Yˆ¯· w.r.t. y¯ under direct questioning and X ∼ Beta(66/49, 165/49). Yˆ¯· r¯ Yˆ¯lr Yˆ¯
cr
Yˆ¯lp Yˆ¯mc Yˆ¯
Downloaded by [Univ Studi Della Calabria] at 08:52 03 February 2012
el
Yˆ¯lpmc mc Yˆ¯nn r¯ Yˆ¯lr Yˆ¯
cr
Yˆ¯lp Yˆ¯mc Yˆ¯
el ˆ ¯ Ylpmc mc Yˆ¯nn
r¯ Yˆ¯lr Yˆ¯
cr
Yˆ¯lp Yˆ¯mc Yˆ¯ el
Yˆ¯lpmc mc Yˆ¯nn
ξ1
ξ2
ξ3
ξ4
3.598
20.816
2.111
0.385
4.182
0.606
0.412
1.500
26.493
2.193
1.617
3.085
24.270
1.740
0.399
4.075
0.617
0.399
4.084
3.064
19.968
3.617
142.561
2.089
ξ5
ξ6
ξ7
ξ8
8.066
2.458
1.851
1.611
2.538
1.488
1.151
8.684
15.547
5.118
3.141
2.833
3.168
9.597
2.858
2.011
0.424
1.578
2.482
1.460
1.169
0.617
0.424
1.583
2.488
1.461
1.168
1.621
2.800
1.998
7.301
1.955
1.836
3.739
3.775
8.010
21.329
2.395
1.398
25.798
Multiplicative RR model 1.772 2.021 1.772
1.875
9.217
4.823
0.279
21.037
0.499
0.296
1.709
1.778
8.503
3.954
0.390
24.993
0.644
0.420
2.374
2.688
15.392
4.296
1.540
32.241
1.326
1.478
1.469
1.350
12.316
5.623
0.398
21.418
0.602
0.425
1.666
1.923
8.201
4.212
0.397
21.497
0.600
0.424
1.676
1.942
8.252
4.267
1.531
24.941
1.320
1.474
1.318
1.108
8.971
4.826
0.627
29.698
1.127
0.640
2.141
2.005
8.558
3.970
13.590
128.730
7.349
Mixed RR model 3.082 8.039
24.433
21.660
10.746
0.668
14.531
0.928
12.054
2.390
3.678
5.853
3.021
Additive RR model 3.333 2.667
3.023
52.721
3.610
0.875
6.834
12.476
27.367
11.470
13.124
134.765
7.004
0.876
8.622
25.896
22.835
11.159
0.827
14.856
1.053
12.156
2.705
4.237
6.037
3.201
0.826
14.833
1.050
18.619
2.695
4.225
6.053
3.199
13.250
128.556
7.019
1.998
7.447
23.720
21.553
10.787
17.480
722.995
30.493
8.010
44.790
66.245
126.078
75.541
Yˆ¯· ) = Bj=1 (Yˆ¯·(j ) − Y¯ )2 /B denotes the Monte Carlo estimate of the mean square error. where mse( Moreover, to evaluate the behavior of the estimators when only the sensitive variable is coded, we calculate the percent gain of efficiency (PGE) Yˆ¯· ) mse( ˆ ˆ ¯ ¯ PGE(Y· |Y· ) = 100 × −1 , Yˆ¯· ) mse( where Yˆ¯· denotes estimator Y¯ˆ· computed by scrambling the sensitive variable only. For the sake of brevity, we have only discussed the simulation results obtained under the quadratic working model also because the results for the linear model show negligible differences for the model calibration estimators alone, while the efficiency of the other estimators does not depend on the underlying working model. However, complete simulation outcomes are available from the authors upon request.
Journal of Applied Statistics
59
Table 3. PGE w.r.t. only Y is coded and and X ∼ U (0, 1). Yˆ¯·
ξ1
ξ2
ξ3
ξ4
ξ5
ξ6
ξ7
ξ8
ρ xy
0.886
− 0.014
0.817
0.869
− 0.677
− 0.665
− 0.712
− 0.140
Additive RR model 0.583 − 0.389 0.908 0.608
− 0.219 0.789
− 0.538 0.400
− 0.105 0.570
ρ xz ρ vz Yˆ¯lr Yˆ¯
Downloaded by [Univ Studi Della Calabria] at 08:52 03 February 2012
cr
Yˆ¯lp Yˆ¯mc Yˆ¯
el ˆ ¯ Ylpmc mc Yˆ¯nn
ρ xz ρ vz Yˆ¯lr Y¯ˆ cr
Yˆ¯lp Yˆ¯
mc
Yˆ¯el Yˆ¯lpmc Yˆ¯ mc nn
ρ xz ρ vz Yˆ¯lr Yˆ¯ cr
Yˆ¯lp Yˆ¯mc Yˆ¯ el
Yˆ¯lpmc mc Yˆ¯nn
0.600 0.908 287.3
− 0.003 0.898
0.577 0.874
411.4
179.3
299.3
37.2
145.6
− 15.2
44.9
7.7
− 6.0
− 25.6
10.4
− 73.6
− 52.4
− 85.0
− 43.8
− 4.9
− 17.8
− 5.1
− 3.8
− 38.6
− 28.8
− 46.7
− 14.6
292.4
382.7
176.0
301.8
24.9
138.2
− 13.2
44.7
292.2
383.5
175.7
301.9
25.6
138.3
− 13.0
43.5
0.3
− 4.8
0.6
0.9
− 12.4
− 5.1
− 9.4
− 1.7
− 54.3
− 55.6
− 77.4
− 56.8
− 52.6
− 39.2
− 21.9
32.3
0.561 0.935
− 0.003 0.666
Multiplicative RR model 0.511 0.545 − 0.540 0.898 0.927 − 0.281
− 0.532 − 0.277
− 0.350 0.252
− 0.068 0.502
463.4
74.9
306.2
422.6
− 23.2
− 23.1
− 10.8
31.5
352.0
372.3
149.3
335.4
− 63.5
− 74.6
− 42.4
13.4
0.7
253.4
− 1.0
0.1
− 19.2
− 19.0
− 26.6
− 12.6
418.6
67.5
250.9
381.1
− 40.9
− 49.2
− 10.2
29.9
418.7
67.8
250.7
380.7
− 40.8
− 48.8
− 10.7
29.7
0.5
300.0
0.3
0.1
− 8.5
− 4.5
− 5.1
− 0.6
− 63.4
133.0
− 73.2
− 65.2
− 56.2
− 48.2
− 10.8
31.1
0.324 0.962
− 0.001 0.948
0.308 0.947
Mixed RR model 0.312 − 0.249 0.962 0.791
− 0.138 0.887
− 0.242 0.774
− 0.047 0.835
879.2
837.1
614.1
877.5
128.8
308.8
121.6
188.8
304.7
98.2
203.1
291.2
− 3.6
71.1
− 49.1
− 47.5
0.7
− 3.5
0.7
0.6
− 16.2
− 9.1
− 14.9
− 5.0
676.0
756.5
467.3
682.6
105.8
258.4
107.9
171.2
675.1
752.7
466.2
681.3
103.6
257.5
106.3
170.2
1.7
1.2
1.2
1.7
− 1.9
− 0.1
− 1.0
1.0
− 50.5
− 70.8
− 71.5
− 61.9
− 67.9
− 50.3
− 80.7
− 84.5
3.2 Simulation results The results of the simulations, reported in Tables 1–4, show a number of revealing aspects, which are discussed below. The results for the RB have been omitted since they appear to be of little mc significance even if some striking values appear for Yˆ¯cr and Yˆ¯nn . 1. The first relevant aspect that stands out from Tables 1 and 2 is the profitable use of the auxiliary variable in the estimation of Y¯ . This is particularly marked for estimators Yˆ¯lr , Yˆ¯cr , Yˆ¯mc and Yˆ¯el
60
G. Diana and P.F. Perri
Table 4. PGE w.r.t. only Y is coded and X ∼ Beta(66/49, 116/49). Yˆ¯·
ξ1
ρ xy
0.890
ρ xz ρ vz Yˆ¯lr Yˆ¯
0.459 0.949
Downloaded by [Univ Studi Della Calabria] at 08:52 03 February 2012
cr
Yˆ¯lp Yˆ¯mc Yˆ¯ el
Yˆ¯lpmc mc Yˆ¯nn
cr
Yˆ¯lp Yˆ¯mc Yˆ¯ el
Yˆ¯lpmc mc Yˆ¯nn
cr
Yˆ¯lp Yˆ¯mc Y¯ˆ el
Yˆ¯lpmc Y¯ˆ mc nn
ξ4
ξ5
ξ6
ξ7
ξ8
− 0.742
0.803
0.884
− 0.750
− 0.746
− 0.566
− 0.156
− 0.166 0.898
0.529 0.865
Additive RR model 0.470 − 0.466 0.940 0.634
− 0.270 0.830
− 0.368 0.638
− 0.118 0.640
377.0
167.5
579.7
22.2
186.3
36.5
59.0
110.1
− 23.6
− 25.6
77.0
− 77.7
− 53.2
− 68.7
− 41.8
1.1
− 17.3
− 6.7
0.2
− 39.5
− 25.3
− 40.2
− 10.3
682.3
392.9
164.9
570.1
21.9
189.3
19.1
55.0
681.7
392.2
164.9
569.5
21.5
188.5
18.1
55.3
1.8
0.5
0.4
1.4
− 4.1
− 1.8
− 12.8
− 1.9
− 11.1
− 85.9
− 56.8
− 22.4
− 76.9
− 65.6
− 15.1
30.9
− 0.150 0.485
Multiplicative RR model 0.565 0.591 − 0.561 0.864 0.927 − 0.230
− 0.564 − 0.246
− 0.186 0.417
− 0.075 0.458
448.1
18.3
162.5
397.3
− 31.6
− 29.8
3.6
21.6
312.2
1.7
112.1
265.6
− 51.1
− 62.7
− 45.3
12.5
2.7
− 21.9
− 0.5
2.7
− 25.6
− 23.0
− 31.0
− 14.4
298.9
17.7
120.7
258.5
− 34.4
− 45.2
4.1
14.6
300.7
17.4
120.3
260.1
− 35.0
− 45.1
3.8
13.6
3.4
1.1
0.3
3.0
− 17.1
− 6.2
− 5.2
− 0.1
162.6
− 15.9
17.6
143.3
− 51.7
− 45.2
3.0
21.1
− 0.166 0.916
− 0.132 0.842
− 0.052 0.835
542.1
268.1
256.1
0.258 0.977
ρ xz ρ vz Yˆ¯lr Yˆ¯
ξ3
695.1
0.587 0.933
ρ xz ρ vz Yˆ¯lr Yˆ¯
ξ2
− 0.071 0.937
Mixed RR model 0.310 0.268 − 0.286 0.937 0.973 0.820
1884.6
782.8
654.5
1615.4
208.1
345.7
147.5
93.0
301.8
9.3
89.8
− 22.7
− 5.8
1.5
− 4.2
0.2
1.3
− 15.1
− 8.8
− 8.0
− 3.1
1512.3
768.4
568.5
1295.3
170.4
458.2
248.8
238.5
1508.9
769.4
569.5
1292.7
171.6
460.8
248.8
238.6
0.5
0.5
0.1
0.4
− 1.7
− 0.4
− 2.4
0.2
− 22.9
− 82.2
− 77.0
− 33.2
− 83.5
− 64.0
− 82.9
− 85.8
which, under superpopulation models ξ 1 , ξ 3 and ξ 4 , depending on the scrambling model, can be more efficient than the sample mean y. ¯ This seems an important finding since, in RR theory, it is known that the perturbation of the data introduces a source of variability that makes estimators less efficient than y. ¯ This aspect, criticized by many researchers, has sometimes limited the use of the RR approach. Yet, here we can strike a blow for the RR technique that lead to refinements of the scrambled models. On this point, moreover, confirmatory theoretical results can also be found in a noncalibration setting in [8,9].
Downloaded by [Univ Studi Della Calabria] at 08:52 03 February 2012
Journal of Applied Statistics
61
2. Estimator r¯ that does not employ any auxiliary information is generally outperformed by the other estimators even if there are a number of exceptions; for instance, Yˆ¯cr and Yˆ¯lp under the additive model for ξ 2 , ξ 5 − ξ 8 . 3. Regression estimator Yˆ¯lr performs better than its competitors in many situations. When it fails, Yˆ¯mc , Yˆ¯lpmc and Yˆ¯lpmc seem to offer the best solution. 4. Passing from the additive model to the mixed one, a general increase in the RE is detected. This is hardly surprising since the latter mechanism ensures greater confidentiality. The cost of increasing participant cooperation is the high variability induced in the observed data by the scrambling mechanism which leads to less efficient estimates. 5. Tables 3 and 4 concern the PGE which derives from coding both the sensitive and the auxiliary variables. We also report the correlation coefficients between the variables in the analysis, i.e. ρ xy which refers to the starting variables X and Y, ρ vz and ρ xz for the coded version V and Z of X and Y, respectively. Since in many cases |ρ vz | > |ρ xz |, we might expect a better performance of Yˆ¯· upon Yˆ¯· . This is confirmed for estimators Yˆ¯lr , Yˆ¯mc and Yˆ¯el (PGE > 0), whereas the PGE can be negative for the remaining estimators when |ρ vz | > |ρ xz |.
4. A real data application We now consider an application to the real data taken from the Survey of Household Income and Wealth 2008 conducted in Italy by the Bank of Italy [1]. The survey covers 7977 Italian households composed of 19,907 individuals and 13,266 income-earners. In the analysis, we assume the 7977 households as the target population on which the variable Household Net Disposal Income(Y) is studied. For our purposes, we assume this variable to be sensitive in nature even if it is collected by means of questionnaires without resorting to any scrambled response models. In the study, we also consider separately the auxiliary variables Household Consumption and Number of Household Components which exhibit a different degree of correlation with Y. The values of the three variables are known in advance for each household from the data set. Let us assume that Y¯ = 32,344.33 (euros) is unknown and we want to estimate it by means of the scrambled response models previously discussed. To investigate whether the calibration approach could be useful in this context, we performed a simulation study using the same settings as in Section 3.1. In particular, from the population of 7977 Italian households, we selected 1000 SRSWOR of n = 100 households. For each sample, the values yi , as well as the values of the auxiliary variables, were perturbed as in the previous section. Then, we evaluated the performance of the same estimators in Section 3.1 by means of the RE defined in Equation (7). The case when only Y is coded is also reported. For model calibration purposes, a linear regression model was considered an appropriate choice for describing the underlying relationship between the study and the auxiliary variables. The results in Table 5 confirm the main outcome on simulated data. The use of the coded auxiliary variable Household Consumption may considerably increase the efficiency of the estimates upon the sample mean y¯ obtained on the true values yi . This is true across all the three RR models considered. The practice of coding only the sensitive variable does not produce the same results: improvements can be achieved upon the estimator r¯ induced by the scrambling model, but not on y. ¯ Different outcomes are produced by the use of the auxiliary variable Number of Household Components which may lead to improvements only upon r¯ (Table 6). The use of this variable does not seem to be worthwhile if compared with the gain in efficiency procured by Household Consumption. Perhaps, this is due to the different level of correlation with the study variable, say 0.717 versus 0.292.
62
G. Diana and P.F. Perri Table 5. RE for the estimators of Y¯ when the auxiliary variable Household Consumption is used; ρ xy = 0.717. RR model Additive
Multiplicative
Mixed
Coded variable
ρ ·z r¯ Yˆ¯lr Yˆ¯
cr
Downloaded by [Univ Studi Della Calabria] at 08:52 03 February 2012
Yˆ¯lp Yˆ¯mc Yˆ¯ el
Yˆ¯lpmc mc Yˆ¯nn
X, Y
Y
X, Y
Y
X, Y
Y
0.972 9.399
0.214 9.399
0.851 2.069
0.466 2.069
0.989 26.614
0.104 26.614
0.524
8.979
0.629
1.607
0.597
27.064
0.992
45.827
1.089
15.221
1.399
167.025
9.023
58.365
6.059
43.295
26.678
38.958
0.524
8.979
0.620
1.607
0.694
27.064
0.528
8.940
0.642
1.606
0.696
27.047
9.052
58.137
1.992
40.842
27.052
51.295
38.820
9.255
0.820
1.715
68.206
27.467
Table 6. RE for the estimators of Y¯ when the auxiliary variable Number of Household Members is used; ρ xy = 0.292. RR model Additive
Multiplicative
Mixed
Coded variable
ρ xy r¯ Yˆ¯lr Yˆ¯cr Yˆ¯lp Yˆ¯mc Yˆ¯el Yˆ¯lpmc Yˆ¯ mc nn
X, Y
Y
X, Y
Y
X, Y
Y
0.564 8.518 5.786 6.428 492.192 5.786 5.822 8.605
0.13 8.518 8.460 9.167 25.355 8.460 8.496 25.467
0.643 1.818 1.058 1.800 13.618 1.078 1.090 1.866
0.227 1.818 1.719 1.795 8.182 1.719 1.722 25.467
0.814 24.856 9.128 14.777 214.917 9.244 9.516 25.466
0.081 24.856 24.805 26.234 143.693 24.805 24.827 143.181
8.135
8.751
2.225
8.751
27.988
25.268
Overall, the nonparametric estimators give rise to unusual results despite their ascertained good performance in a nonsensitive set-up. One possible reason for this may be the choice of the required parameters which need more fine-tuning. Attempts have been made in this direction. Nonetheless, full investigation is somewhat computer-intensive and, besides, it may be sometimes impractical. 5.
Conclusions
The question addressed in this paper is: can auxiliary information improve the estimation of a sensitive mean through the calibration approach when scrambled response models are used?
Downloaded by [Univ Studi Della Calabria] at 08:52 03 February 2012
Journal of Applied Statistics
63
Through a simulation study and an application to real data, we have considered three models and some common calibration estimators both where the auxiliary variable is scrambled and where it is not. Noncalibration estimators have been also considered in the analysis. Overall, the estimators under study outperform the basic estimator induced by the scrambling mechanism and, in many cases, they may even be more efficient than the sample mean under direct questioning. This outcome is particularly important since it provides evidence that improvements in the estimates are possible without jeopardizing respondent privacy. This means that a trade-off between privacy protection and efficiency can be achieved. Moreover, the relevance of the result is further strengthened by the fact that it has been obtained with a sample size n = 100, which is certainly unusual in practical RR applications. In fact, since the efficiency of RR estimators is generally low, larger sample sizes are required to obtain estimates with a precision comparable to that from the sample mean under direct questioning of sensitive issues. This makes RR surveys rather expensive. On the contrary, the use of estimators based on auxiliary information makes it possible to achieve comparable and, sometimes, even more efficient estimates than with the sample mean. Keeping in mind that one of the limitations of the use of RR theory lies in the excessive cost due to the sample size, we hope that the results described in this article can represent a response to some of the criticisms directed at the RR models and can contribute toward a reappraisal of RR theory as a valid tool for data collection.
Acknowledgements The authors wish to thank two anonymous referees for their careful reading and constructive suggestions which led to improvement over earlier versions of the paper. Work supported by the Italian Ministry of University and Research, MIUR-PRIN 2007: “Efficient use of auxiliary information at the design and at the estimation stage of complex surveys: Methodological aspects and applications for producing official statistics”.
References [1] Bank of Italy. Survey of Household Income and Wealth, 2008. Available at http://www.bancaditalia.it/statistiche/ indcamp/bilfait/dismicro. [2] F.J. Breidt and J.D. Opsomer, Local polynomial regression estimators in survey sampling, Ann. Statist. 28 (2000), pp. 1026–1053. [3] A. Chaudhuri and R. Mukerjee, Randomized Response: Theory and Techniques, Marcel Dekker, Inc., New York, NY, 1988. [4] A. Chaudhuri and D. Roy, Model assisted survey sampling strategies with randomized response, J. Statist. Plann. Inference 60 (1997), pp. 61–68. [5] J. Chen and R.R. Sitter, A pseudo empirical likelihood approach to the effective use of auxiliary information in complex surveys, Statist. Sinica 9 (1999), pp. 385–406. [6] W.G. Cochran, Sampling Techniques, John Wiley & Sons, New York, NY, 1977. [7] J.-C. Deville and C.-E. Särndal, Calibration estimators in survey sampling, J. Amer. Statist. Assoc. 87 (1992), pp. 376–382. [8] G. Diana and P.F. Perri, A class of estimators for quantitative sensitive data, Statist. Papers (2009), doi:10.1007/s00362-009-0273-1. [9] G. Diana and P.F. Perri, New scrambled response models for estimating the mean of a sensitive quantitative character, J. Appl. Stat. 37 (2010), pp. 1875–1890. [10] B. Eichhorn and L.S. Hayre, Scrambled randomized response methods for obtaining sensitive quantitative data, J. Statist. Plann. Inference 7 (1983), pp. 307–316. [11] J.A. Fox and P.E. Tracy, Randomized Response: A Method for Sensitive Survey, Sage Publication, Inc., Newbury Park, CA, 1986. [12] G.J.L.M. Lensvelt-Mulders, J.J. Hox, P.G.M. Heijden, and C.J.M. Maas, Meta-analysis of randomized response research: Thirty-five years of validation, Sociol. Methods Res. 33 (2005), pp. 319–348. [13] G.E. Montanari and M.G. Ranalli, Nonparametric model calibration estimation in survey sampling, J. Amer. Statist. Assoc. 100 (2005), pp. 1429–1442.
Downloaded by [Univ Studi Della Calabria] at 08:52 03 February 2012
64
G. Diana and P.F. Perri
[14] K.H. Pollock and Y. Bek, A comparison of three randomized response models for quantitative data, J. Amer. Statist. Assoc. 71 (1976), pp. 884–886. [15] A. Saha, A simple randomized response technique in complex surveys, Metron LXV (2007), pp. 59–66. [16] C.-E. Särndal, The calibration approach in survey theory and practice, Surv. Methodol. 33 (2007), pp. 99–119. [17] S. Singh, A.H. Joarder, and M.L. King, Regression analysis using scrambled responses, Aust. J. Stat. 38 (1996), pp. 201–211. [18] S. Singh and J.-M. Kim, A pseudo-empirical log-likelihood estimator using scrambled responses, Statist. Probab. Lett. 81 (2011), pp. 345–351. [19] S. Singh and D.S. Tracy, Ridge regression using scrambled responses Metron LVII (1999), pp. 147–157. [20] C.-K. Son, K.-H. Hong, G.-S. Lee, and J.-M. Kim, The calibration for stratified randomized response estimators, Commun. Korean Statist. Soc. 14 (2008), pp. 597–603. [21] R. Strachan, M. King, and S. Singh, Likelihood-based estimation of the regression model with scrambled responses, Aust. N. Z. J. Stat. 40 (1998), pp. 279–290. [22] D.S. Tracy and S. Singh, Calibration estimators in randomized response survey, Metron LVII (1999), pp. 47–68. [23] S.L. Warner, Randomized response: A survey technique for eliminating evasive answer bias, J. Amer. Statist. Assoc. 60 (1965), pp. 63–69. [24] C. Wu and R.R. Sitter, A model-calibration approach to using complete auxiliary information from survey data, J. Amer. Statist. Assoc. 96 (2001), pp. 185–193.
Appendix We report some theoretical considerations regarding the correlation between the sensitive and the auxiliary variables, which may or may not be coded. First, let us assume that only Y is scrambled and consider the additive model Z = Y + T . Hence, we have 2 = ρxz
2 2 2 σxy σxy σxz 2 = = ρxy . < σx2 σz2 σx2 σy2 σx2 (σy2 + σt2 )
Analogously, for the multiplicative model Z = T Y, we get 2 = ρxz
2 2 σxy μt
σx2 (σt2 μ2,y + σy2 μ2t )
,
where μ2, · denotes the second moment of the variable indicated in the subscript. It follows that 2 2 ρxz < ρxy if σy2 μ2t σt2 μ2,y + σy2 μ2t
< 1,
which is always true. Finally, for the mixed model Z = T (Y + S), we have 2 ρxz =
2 2 σxy μt
σx2 [(σt2 μ2,y + σy2 μ2t ) + (σt2 μ2,s + σs2 μ2t ) + 2μy μs σt2 ]
2 2 and ρxz < ρxy if
σy2 μ2t σx2 [(σt2 μ2,y + σy2 μ2t ) + (σt2 μ2,s + σs2 μ2t ) + 2μy μs σt2 ]
< 1.
The latter is always true since the scrambling variable is assumed to be positive. In doing so, we have proved that the correlation between the coded sensitive variable and the auxiliary one is smaller than the correlation between the original variables, i.e. |ρ xy | > |ρ xz | whatever be the considered scrambling model.
Journal of Applied Statistics
65
Let us now examine the case when both Y and X are scrambled by using the same model and the same scrambling variables. Under the additive model, we have 2 ρvz =
(σxy + σt2 )2 . (σx2 + σt2 )(σy2 + σt2 )
2 2 After some algebra, we get ρvz > ρxz if
σt2 >
σxy (σxy − 2σx2 ) σx2
(A1)
Downloaded by [Univ Studi Della Calabria] at 08:52 03 February 2012
2 2 and ρvz > ρxy if
σt2 >
σxy [σxy (σx2 + σy2 ) − 2σx2 σy2 ] 2 σx2 σy2 − σxy
.
(A2)
Conditions (A1) and (A2) make explicit the fact that coding both the variables may increase the correlation between Y and X depending on the choice of the scrambling variable T . In particular, if T is chosen such that condition (A2) holds true, then it follows that |ρ vz | > |ρ xy | > |ρ xz |. We observe that σ xy and σy2 are generally unknown in practice. Usually, a good guess may be available on the basis of previous data, a pilot survey or past experience. If there is evidence to believe that the guess is not reliable, then Conditions (A1) and (A2) are to be checked on the basis of accurate estimates for σ xy and σy2 . Similar considerations can be drawn for the multiplicative and mixed models. Nonetheless, the conditions for comparing the correlation coefficients are unusable since they cannot be expressed in a closed form such as Conditions (A1) and (A2) and, therefore, they have been omitted.