On estimating the median from survey data using ... - Springer Link

2 downloads 0 Views 143KB Size Report
In this paper we present results pertaining to investigations concerning median estimation. Sample medians have long been recognized as simple.
Metrika (2001) 54: 59±76

> Springer-Verlag 2001

On estimating the median from survey data using multiple auxiliary information M. Rueda GarcõÂa, A. Arcos CebriaÂn Departamento de EstadõÂstica e InvestigacioÂn Operativa, Universidad de Granada, 18071 Granada, Spain (e-mail: [email protected], [email protected])

Abstract. A new method to derive con®dence intervals for medians in a ®nite population is presented. This method uses multi-auxiliary information through a multivariate regression type estimator of the population distribution function. A simulation study based on four real populations shows its behaviour versus other known methods. Key words and phrases: regression type estimator, ®nite population median, con®dence interval, calibration estimator. 1 Introduction In this paper we present results pertaining to investigations concerning median estimation. Sample medians have long been recognized as simple robust alternatives to sample means, for estimating the location of heavytailed or markedly skewed populations from simple random samples. Their simplicity relative to other robust estimates makes them a suitable choice for investigation in designs other than simple random sampling. The literature relating to the estimation of medians and other quantiles which use an auxiliary variable is, however, considerably less extensive than in the case of means and totals. Relevant references are Chambers and Dunstan (1986), Kuk and Mak (1989), Rao et al. (1990), Mak and Kuk (1993), Kuk (1993) and Rueda et al. (1998). Recently, several estimators of a population distribution function have also been proposed, using auxiliary information at the estimation stage. Rao (1994) carried out a study of the di¨erent estimators of the total and of the distribution function in a ®nite population, making use of auxiliary information. In this paper di¨erent approaches are taken into consideration: the model-based approximation, the conditional probability sampling approach and the calibration theory; an alternative calibration estimator to the usual one is pro-

60

M. Rueda, A. Arcos

posed, which is asymptotically more e½cient than the generalized regression estimator and which coincides with the latter in simple random sampling. A further advantage of the calibration estimator proposed by Rao is the preferred estimator from a conditional point of view. Making use of this estimator, we propose a con®dence interval for the population median which is valid for any sampling design. We study its application in the speci®c case of simple random sampling, testing its performance by means of di¨erent simulation studies and comparing it with the con®dence intervals based on other known methods. 2 Construction of con®dence intervals for the median As usual, let y1 ; . . . ; yN be the values of the population elements U1 ; . . . ; UN , for the variable of interest y. For any y … y < y < y†, the population distribution function FY …y† is de®ned as the proportion of elements in the population that are less than or equal to y. The population median is MY ˆ inff y j FY …y† b 0:5g ˆ FY 1 …0:5†. The problem is to estimate the population median MY , using data yk for k A s, where s is a random sample. An initial way of obtaining a con®dence interval for the median is by ^ Y ˆ F^ 1 …0:5†, (when the inverse F^ 1 is to be undertaking the estimator M Y Y stood in the same way as FY 1 above), as a pivotal statistic and constructing the interval, q ^ Y H za=2 V^ …M ^ Y †; …1† M ^ Y and za=2 ^ Y † is a consistent estimator of the variance of M where V^ …M a percentage point of the standard normal distribution, denotes the upper 1 2 as long as the distribution of the said estimator is asymptotically normal for any sampling design (as is the case with simple random sampling). Woodru¨ (1952) described a large sample procedure for determining con®dence intervals for a ®nite population median: for any two constants d1 and d2 , and for any values of MY , Pfd1 a F^Y …MY † a d2 g F PfF^Y 1 …d1 † a MY a F^Y 1 …d2 †g. Hence, for any d1 and d2 constants such that Pfd1 a F^Y …MY † a d2 g ˆ 1 a, the interval ‰F^Y 1 …d1 †; F^Y 1 …d2 †Š is a 100…1 a†% approximate con®dence interval for MY . In simple random sampling, if the sample size n is su½ciently large, F^Y …MY † is approximately normal with variance V …F^Y …MY †† ˆ

1

f n

b…1



…2†

and we would prefer to choose as a con®dence interval, ‰F^Y 1 …b where f ˆ

za=2 fV …F^Y …MY ††g1=2 †; F^Y 1 …b ‡ za=2 fV …F^Y …MY ††g1=2 †Š; n is the sampling fraction and b ˆ 12. N

…3†

On estimating the median from survey data using multiple auxiliary information

61

There has been recent work on the use of auxiliary information to improve the estimation of a distribution function or the ®nite population quantiles. When the population median of the auxiliary variable, MX , is known the ratio method of estimation of the population median made by Kuk and Mak (1989) ^ Y MX and their position method is based ^ YR ˆ M is based on the estimator M ^X M ^ YP , which, in turn, is based on an estimate of the position on the estimator M of MY in the ordered sample values. Both estimators are (under certain conditions) asymptotically normal and can be used as pivotal statistics to obtain ^ Y in (1). con®dence intervals such as M Kuk (1993) suggests another median estimator based on the kernel method the construction of which requires a knowledge of the values of the auxiliary variable for all the individuals in the population. Woodru¨ 's method, together with the ratio method of estimation for estimating ®nite population distribution functions, is used in Rueda et al. (1998), using an estimator which only employs one auxiliary variable FX …MX † ; F^R …MY † ˆ F^Y …MY † F^X …MX †

…4†

and further, using an estimator which makes use of more than one auxiliary variable X1 ; . . . ; Xl , where the population medians MX1 ; . . . ; MXl are known, F^RM …MY † ˆ F^Y …MY †

l X

wi

iˆ1

FXi …MXi † : F^Xi …MXi †

…5†

3 Con®dence intervals based on the regression estimator Let y and xi …i ˆ 1; . . . ; l † respectively be the survey variable and the auxiliary variables related to y. We assume that the population medians MXi of xi …i ˆ 1; . . . ; l† are known. From the sample of n units from a population of size N we observe …xik ; yk † where k A s, i ˆ 1; . . . ; l. We obtain an alternative con®dence interval which is based on the construction of a multiple estimator of the population distribution function. Consider the regression-type estimator: M …MY † ˆ F^Y …MY † ‡ F^RE

l X

bi …b

F^Xi …MXi ††:

…6†

iˆ1

M Let c1 and c2 be constants such that Pfc1 a F^RE …MY † a c2 g ˆ 1

" F^Y 1 c1

l X iˆ1

! bi …b

F^Xi …MXi †† ; F^Y 1 c2

l X

a. Then !#

bi …b

F^Xi …MXi ††

;

iˆ1

(if F^Y is a continuous and strictly increasing function) is approximately a 100…1 a†% con®dence interval for the population median MY .

62

M. Rueda, A. Arcos

As F^Y …MY † and F^Xi …MXi † …i ˆ 1; . . . ; l † are approximately normally distributed, (see Kuk and Mak, 1989), then the multiple regression estimator M …MY † is asymptotically normal with expected value FY …MY † ˆ b, and F^RE therefore we would choose c1 ˆ b

M za=2 fV …F^RE …MY ††g1=2 ;

M c2 ˆ b ‡ za=2 fV …F^RE …MY ††g1=2 :

Estimating this unknown approximate variance is not so simple, but a conM sistent V^ …F^RE …MY †† estimator can P l be constructed by using Taylor's linearizabi …b F^Xi …MXi ††, the con®dence interval tion technique. Denoting A ˆ iˆ1 then becomes ‰F^Y 1 …b

M za=2 fV^ …F^RE …MY ††g1=2

M F^Y 1 …b ‡ za=2 fV^ …F^RE …MY ††g1=2

A†; A†Š:

…7†

Consider now the problem of choosing the most appropriate coe½cients. The solution is simple: we select the coe½cients bi …i ˆ 1; . . . ; l† which increase the accuracy of the multiple regression estimator. If we de®ne B ˆ …b1 ; . . . ; bl † 0 , Q ˆ b…1; . . . ; 1† 0 , and F^X ˆ …F^X1 …MX1 †; . . . ; F^Xl …MXl †† 0 , then, we have M F^RE …MY † ˆ F^Y …MY † ‡ …Q

F^X † 0 B:

The criterion for optimality of the coe½cients B ˆ …b1 ; . . . ; bl † 0 is to minimize the variance. This leads (see Rao 1994) to the optimum value Bopt , and the resulting estimator M F^RE …MY † ˆ F^Y …MY † ‡ …Q

F^X † 0 Bopt ;

where Bopt ˆ S 1 s, S ˆ …aij †, aij ˆ Cov…F^Xi …MXi †; F^Xj …MXj ††, i 0 j ˆ 1; . . . ; l, aii ˆ V …F^Xi …MXi ††, i ˆ 1; . . . ; l and s ˆ …Cov…F^Y …MY †; F^X1 …MX1 ††; . . . ; Cov…F^Y …MY †; F^Xl …MXl ††† 0 . Insertion of Bopt in the variance yields M Vmin …F^RE …MY †† ˆ V …F^Y …MY ††

s 0S

1

s:

…8†

^ we get the By replacing s and S by their unbiased estimators s^ and S, estimator F^opt …MY † ˆ F^Y …MY † ‡ …Q

F^X † 0 S^ 1 s^:

…9†

By using, as in (6), the bi coe½cients that are independent P l of the y value, it M is possible to construct the function F^RE …y† ˆ F^Y … y† ‡ iˆ1 bi …b F^Xi …MXi †† which is non-decreasing in y whenever F^Y is non-decreasing in y. This enables M M us to use F^RE as an estimator of the distribution function in y and …F^RE † 1 …0:5† as a point estimator of the median. Unfortunately, the cumulative distribution function estimator with optimal estimated bi , (9), is not necessarily monotone.

On estimating the median from survey data using multiple auxiliary information

63

One way to overcome this problem is to make it monotone as done in Rao et al. (1990), by de®ning F~opt … y…1† † ˆ F^opt …y…1† †;

F~opt … y…i† † ˆ maxfF~opt … y…i

^

1† †; Fopt …y…i† †g;

where the y…i† 's are the order statistics of the sample f yi ; i A sg. Then, inverting them, F^opt1 …0:5† is obtained as an estimator of the population median, MY . This method can be applied to constructing a con®dence interval for MY , using F^opt1 …0:5† as pivotal statistic. The generalization to any design (16) and the estimation of any quantile (17), particularly when the order of the quantile does not necessarily coincide with the orders of the quantiles of the auxiliary variables, also enables us to de®ne estimators of the cumulative distribution function. The properties of these estimators as point cumulative distribution function estimators and their performance with respect to others are currently being investigated. Note that following the precepts underlying Woodru¨ 's idea, one might apply the inverse of the estimated cumulative distribution function as follows: M 1 …b ‰F^RE

M M 1 M za=2 fV^ …F^RE …MY ††g1=2 †; F^RE …b ‡ za=2 fV^ …F^RE …MY ††g1=2 †Š

that yields the same con®dence interval in (7) with bi 's in A ®xed at their optimal estimates. However, we have not pursued this option here since a simulation study would require a considerable amount of time to implement M 1 , in univariate and multivariate cases, since that using (7) only requires F^RE us to implement F^Y 1 . 3.1. Simple random sampling without replacement If the sampling design used to select the sample is without replacement and there are equal probabilities, simple expressions can be obtained for the variance of F^opt …MY † and for the proposed con®dence interval. With this in mind, for each auxiliary variable, Xi , we take Cramer's V coe½cient, fi , based on the two-way classi®cation: xi a MXi

xi > MXi

y a MY

N11

N12

N1

y > MY

N21

N22

N2 

N1

N2

N11 N22 N12 N21 fi ˆ p ; N1 N2  N1 N2

;

i ˆ 1; . . . ; l;

…10†

where N11 is the number of units in the population with x a MX and y a MY . In an analogous way, we can consider f^i Cramer's V coe½cient, based on the two-way classi®cation, as an estimator of fi :

64

M. Rueda, A. Arcos

^ Xi xi a M

^ Xi xi > M

^Y yaM

n11

n12

n1

^Y y>M

n21

n22

n2 

n1

n2

n11 n22 n12 n21 f^i ˆ p ; n1 n2  n1 n2

;

i ˆ 1; . . . ; l;

…11†

^ X and y a M ^Y . where n11 is the number of units in the sample with x a M 0 0 ^ ^ ^ We note Fyx ˆ …f1 ; . . . ; fl † and Fyx ˆ …f1 ; . . . ; fl † . As in SRSWOR (see 1 f b…1 b†fi , and thus s Rueda et al. 1998) Cov…F^Y …MY †; F^Xi …MXi †† ˆ n 1 f ^yx . b…1 b†F can be estimated by s^ ˆ n 1 f b…1 b†fij , Taking into account that Cov…F^Xi …MXi †; F^Xj …MXj †† ˆ n where fij is Cramer's V coe½cient based on the two-way classi®cation similar d F^X …MX †; F^X …MX †† to (10) but crossing the variables Xi and Xj , we have Cov… i i j j 1 f ^ ^ b…1 b†fij , where the fij are estimators of fij obtained using a crossˆ n classi®cation as in (11) crossing Xi and Xj . 1 f b…1 b†, writing Fxx ˆ …fij † with fii ˆ 1, we As V …F^Xi …MXi †† ˆ n 1 f b…1 b†Fxx and then S can be estimated by S^ ˆ have S ˆ n 1 f ^xx , where F ^xx ˆ …f^ij † with f^ii ˆ 1. b…1 b†F n Thus, if the sample is chosen by means of simple random sampling, the variance can be expressed in the following way 1 V …F^opt …MY †† ˆ

f n

b…1

0 Fyx Fxx1 Fyx Š:

b†‰1

and therefore the interval ‰F^Y 1 …^r1 is a 100…1

^ 1F ^yx †; F^ 1 …^r2 F^X † 0 F xx Y

…Q

…Q

^ 1F ^yx †Š F^X † 0 F xx

a†% con®dence interval for MY , where k

^rk ˆ b ‡ … 1† za=2



1

f n

b…1

b†‰1

1=2 0 ^ 1 ^ ^ Fyx Fxx Fyx Š

k ˆ 1; 2:

3.2. SRSWOR using only one auxiliary variable In this simple case, (9) reduces to ^ F^RE …MY † ˆ F^Y …MY † ‡ f…b

F^X †

…12†

On estimating the median from survey data using multiple auxiliary information

65

and the variance obtained in (8) can be estimated by V^min …F^RE …MY †† ˆ V^ …F^Y …MY ††…1

f^2 †

…13†

3.3. SRSWOR using l ˆ 2 auxiliary variables In this bivariate case, where X1 and X2 are the auxiliary variables, (9) reduces to M F^RE …MY † ˆ F^Y …MY † ‡ b^1 …b

F^X1 † ‡ b^2 …b

F^X2 †

…14†

where the optimal b^i are given by f^ f^2 f^12 b^1 ˆ 1 ; 2 1 f^12

f^ f^1 f^12 b^2 ˆ 2 2 1 f^12

…15†

and the variance obtained in (8) can be estimated by M V^min …F^RE …MY †† ˆ V^ …F^Y …MY †† 1

f^12 ‡ f^22 1

2f^1 f^2 f^12 f^2

! …16†

12

^ 0 0 is non-restrictive Thus, it is evident that the condition det…S† 2 0 1†. …f^12 4 Generalizations 4.1. Generalization for any design If the samples are obtained according to a speci®ed sampling design d, which is non-ordered and quanti®able, with design matrix P ˆ …pkm † it would appear logical to construct the distribution function estimator from its HorvitzThompson estimates, 1X 1 d…MY F^HTy …MY † ˆ N k A s pk

yk †;

1X 1 F^HTi …MXi † ˆ d…MXi N k A s pk

xik †;

where d…t† ˆ 1 if t > 0 and d…t† ˆ 0 if t a 0, since these are admissible in the set of unbiased estimators, giving rise to the following estimate: d …MY † ˆ F^HTy …MY † ‡ …Q F^opt

F^HTx † 0 S^HT1 s^HT ;

…17†

where F^HTx ˆ …F^HT1 …MX1 †; F^HT2 …MX2 †; . . . ; F^HTl …MXl †† 0 , S^HT ˆ …aij †, aij ˆ d F^HTi …MX †; F^HTj …MX ††, aii ˆ V^ …F^HTi …MX ††, i 0 j ˆ 1; . . . ; l, and s^HT ˆ Cov… i j i d F^HTy …MY †; F^HTl …MX ††† 0 . d F^HTy …MY †; F^HT1 …MX ††; . . . ; Cov… …Cov… 1 l

66

M. Rueda, A. Arcos

The estimated variances and covariances would be given by X pkm pk pm d…MX xi † d…MX xi † i i m k ^ F^HTi …MX †† ˆ 1 V… i 2 N k; m A s pk pm pkm X pkm pk pm d…MX xi † d…MXj xjm † i k d F^HTi …MX †; F^HTj …MX †† ˆ 1 Cov… i j 2 N k; m A s pkm pk pm X pkm pk pm d…M ^ Y yk † d…MXj xjm † d F^HTy …MY †; F^HTj …MX †† ˆ 1 Cov… j 2 N k; m A s pkm pk pm and therefore the interval 1 …^r1 ‰F^HTy

is a 100…1

…Q

1 F^HTx † 0 S^HT1 s^HT †; F^HTy …^r2

…Q

F^HTx † 0 S^HT1 s^HT †Š

a†% con®dence interval for MY , where

^rk ˆ b ‡ … 1† k za=2 fV^ …F^HTy …MY ††

0 S^HT1 s^HT g1=2 s^HT

k ˆ 1; 2:

If design d is of ®xed size, as is the case of the most common designs such as simple random sampling, sampling with replacement, strati®ed sampling, etc., we could consider as an alternative estimator the variance estimator given by Yates and Grundy, which in general, presents better properties. d …MY † estimator is a particular case of calibration estimator proThe F^opt posed by Rao (1994) (estimating the population distribution function) and is an alternative that is more e½cient than the generalized regression estimator proposed by SaÈrndal (1980). These two estimators agree for simple random sampling, but in general the two estimators di¨er. Fuller and Isaki (1981) and Montanari (1987) have also studied this optimal estimator in the context of unistage designs and the Horvitz-Thompson estimator with basic weights. 4.2. Generalization for any quantile In this paper, we have considered the estimation of a population median MY , which is QY …b† with b ˆ 12. The extension of the procedures to the case of estimating an arbitrary quantile QY …b† when the quantiles of the same order QXi …b† of the auxiliary variables are known can be obtained by simply considering the corresponding b value. From M F^RE …QY …b†† ˆ F^Y …QY …b†† ‡ …Q

F^X † 0 Bopt ;

…18†

and denoting A ˆ …Q F^X † 0 B^opt ,      q q 1 1 M M ^ ^ ^ ^ ^ ^ FY b za=2 V …FRE …QY …b†† A ; FY b‡za=2 V …FRE …QY …b†† A is a 100…1

a†% con®dence interval for QY …b†.

On estimating the median from survey data using multiple auxiliary information

67

A more general and complex problem is the estimation of a quantile QY …b† when the quantiles of the order (not necessarily equal) b1 ; . . . ; b k of the auxiliary variables X1 ; . . . ; Xk are known. By constructing the estimator M …QY …b†† ˆ F^Y …QY …b†† ‡ F^RE

k X iˆ1

bi …bi

F^Xi …QXi …b i †††;

and following a procedure analogous to that described in Section 3, the corresponding intervals can be constructed. 5 Simulation studies 5.1. Populations The ®rst population consists of 430 farms with 50 or more beef cattle, surveyed in the 1988 Australian Agricultural and Grazing Industries Survey carried out by the Australian Bureau of Agricultural and Resource Economics. This population was originally used by Chambers et al. (1993) and subsequently by Kuk (1993) for their respective simulation studies. In this case, Y is income from beef and X is the number of beef cattle in each farm. The second population comprises 338 sugar cane farms surveyed in 1982 in Queensland, Australia, as used originally by Chambers and Dunstan (1986), and later by Rao et al. (1990) and Kuk (1993). We took income from cane as Y, area assigned for growing cane as X 1, and the cost as X 2. The MU284 Population consists of the 284 municipalities in Sweden. The variable of interest is RMT85, the revenues from 1985 municipal taxation (in millions of kronor) and the auxiliary variables are CS82, the number of Conservative seats in the municipal councils and SS82, numbers of SocialDemocratic seats in the municipal councils. Data for these variables have been taken from SaÈrndal et al. (1992). In this population, the municipalities which have extremely large values have been maintained, since the parameters which we wish to estimate are less sensitive to extreme values. The FAM1500 population consists of 1500 families living in an Andalusian province, as used originally by Rueda et al. (1998). The variable of interest, Y, denotes the cost of food, and two auxiliary variables X 1 and X 2 denote family income and other costs, respectively. 5.2. Methods We shall compare di¨erent methods of constructing con®dence intervals for the median of a variable y, using the medians of one or more auxiliary variables, using simple random sampling and strati®ed random sampling (with di¨erent allocations) methods. The methods which are compared in this simulation study are those which result from applying Woodru¨ 's method to the ratio estimators, univariate (4) and multiple (5), obtained by Rueda et al. (1998), together with those which we propose in this paper, univariate regression (12), and multiple regression (14) methods. In addition, we have included the con®dence intervals constructed by using the pivotal technique with the ratio and position estimators ^ YP . ^ YR and M obtained by Kuk and Mak, M

68

M. Rueda, A. Arcos

The estimation methods compared in our simple random sampling were developed by the authors in this sampling design and furthermore present the common characteristic that no model was used to arrive at the estimators they are based upon. An additional point is that we selected these methods in order to compare them with those proposed in this paper because if we wish to estimate the median of the variable of interest using a con®dence interval, such methods merely require a knowledge of the median of the auxiliary variable to be applied. On the other hand, the same does not occur with the kernel method carried out by Kuk (1993), in which it is assumed that the values of the auxiliary variable are known for all elements in the population. Moreover, the kernel method proposed by Kuk (1993) needs, in order to calculate the estimator, double summations over both the sample and the population, and requires a more intensive computation, even for small sample sizes. Woodru¨ 's method (3), which does not use auxiliary information, has been chosen as a basis for comparing the remaining methods, which we shall denote F^W . A large number of samples (500) is drawn from the given population according to simple random sampling. For every sample obtained, and for every ^ YP †, we compute the estimator t^, the variance estimate ^ YR ; M estimator t^ ˆ …M ^ ^ V …t† and the con®dence interval at the approximate 95% level, t^H 1:96…V^ …t^†† 1=2 : For univariate ratio …F^R † and regression …F^RE † estimators, and for multiple M † estimators, the con®dence interval is obtained ratio …F^RM † and regression …F^RE using Woodru¨ 's method, as for the classical Woodru¨ method …F^W †. Then we count the number of intervals, C, that contain the true value of C as an estimate of the actual con®dence the median, MY and we use cove ˆ 500 level. Taking lj as the length of con®dence interval obtained for the jth sample, we can calculate lˆ

500 1 X lj 500 jˆ1

which is an estimate of the expected value E…l †, calculating the ratio of the length obtained in each method over the length obtained by the Woodru¨ method based on the unadjusted estimate of the cumulative distribution function, r, as an estimate of e½ciency. ^ YP (see Kuk and Mak, 1989) are ^ YR and M The asymptotic variances of M given by !  2 1 P 1 f 1 M 1 M 11 Y Y 4 ^ YR † ˆ ‡ 2 V …M ; n MX 4fx …MX † 2 MX fx …MX † fy …MY † 4fy …MY † 2 ^ YP † ˆ 1 V …M

f n

2 fy …MY † 2

P11 …1

2P11 †

On estimating the median from survey data using multiple auxiliary information

69

where P11 is the proportion of units in the population with x a MX and y a MY . These variances involve fy …MY † and P11 , the latter being consistently ^ X and estimated by p11 (the proportion of units in the sample with x a M ^ Y † and replacing fx …MX † and fy …MY † by the estimates obtained apyaM plying the kth nearest neighbour method, choosing O…N 1=2 † as the smoothing parameter. The variance of univariate regression estimators, F^RE , is estimated using M , is estimated using (16) (13). The variance of multiple regression estimator, F^RE and the con®dence interval is constructed with the bi estimate as in (15). The variance of univariate ratio estimators, F^R , (see Rueda et al. 1998) is estimated with V^ …F^R …MY †† ˆ 2V^ …F^Y …MY ††…1

~ f†

where f~ is based on a similar two-way classi®cation (11), but replacing the ^ X by the population median MX , while the variance of sample median M multiple ratio estimator, F^RM is estimated with V^ …F^RM …MY †† ˆ 2V^ …F^Y …MY ††…w12 …1

f~1 † ‡ w22 …1

f~2 † ‡ w1 w2 …1

f~1

f~2 ‡ f~12 ††

where the wi , i ˆ 1; 2 are estimated with ^1 ˆ w

1 ‡ f~1

f~2 f~12 ; 2…1 f~12 †

^2 ˆ 1 w

^1 : w

The following table summarizes the di¨erent methods compared and the notation used for each, in simple random sampling. Based on the same considerations for the estimators to be compared as those applied to simple random sampling, in strati®ed sampling we have only compared the proposed method, based on estimator (17), with the adaptation of Woodru¨ 's method to this sampling design, as illustrated in SaÈrndal et al. st , respectively. (1992, Chap. 5), which are denoted F^opt and F^W Three strati®ed sampling regimes were investigated in 338 sugar cane farms with 2 strata (population details for the variables X1 , Y and X2 can be seen in Table 1. Methods and notation in SRSWOR Notation

Method

Auxiliary variables

F^W F^R1 , F^R2 , 1 2 F^RE , F^RE , 1 ^ ^2 , MYR , M YR ^1 , M ^2 , M YP YP F^RM F^ M

Woodru¨ ratio (Rueda et al.) regression ratio (Kuk and Mak) position (Kuk and Mak) ratio (Rueda et al.) regression

± X1 , X2 X1 , X2 X1 , X2 X1 , X2 X1 and X2 X1 and X2

RE

70

M. Rueda, A. Arcos

Chambers and Dunstan, 1986): proportional allocation (here termed propor.), optimal allocation (termed optimal) and optimal allocation based on X1 (xoptimal). The strati®cation of the 284 municipalities in Sweden was carried out using the same text as was used for the population. Let the MU284 population be partitioned into two strata in such a way that the municipalities in regions 4 and 5 belong to stratum 1, whereas the remaining municipalities belong to stratum 2. Within this population, we investigated the same allocations as for the previous one. For the two strati®ed populations, X2 was taken as the auxiliary variable to build the interval with the proposed estimator. In both cases, a Yates-Grundy variance type estimator was used to estimate the variance of (17). 5.3. Results Tables 2, 3, 4, and 5 show the results for di¨erent estimators based on 500 samples of sizes 30, 35, 40, 45, 50 and 100 obtained with simple random sampling without replacement, for all populations, except for the FAM1500 population, in which sample sizes 30 and 35 are excluded in order to apply the kth nearest neighbour method, and sample sizes 150 and 200 have been included. All tables show the ratio, r, of the average length for all methods of estimation over the average length by the classical Woodru¨ method of estimation and cove, as an estimate of the actual con®dence level. Table 2 presents the results of our simulation for the beef farms data set. For the Kuk and Mak estimators, the e½ciency with respect to the classical Woodru¨ method of estimation is from 0.41 to 0.73, which is very good. The coverage ranges from 0.73 to 0.898, which is unsatisfactory. However, the ratio estimator reduces the average length of the Woodru¨ simple estimator from 0.78 to 0.91 and that of the regression estimator from 0.59 to 0.75, but the coverage has values that are reasonably close to the nominal level, (from Table 2. Ratio of the average length of con®dence intervals to the average length of the Woodru¨ interval (r), and observed coverage rates (cove) for the beef farms population (500 simulated samples, nominal coverage level is 95%) Method

r

F^W : F^R1 : 1 F^RE : 1 ^ MYR : ^1 : M YP

1.00 0.78 0.59 0.73 0.56

F^W : F^R1 : 1 F^RE : 1 ^ MYR : ^1 : M YP

1.00 0.81 0.70 0.50 0.48

n ˆ 30

n ˆ 45

cove

r

0.982 0.976 0.958 0.874 0.898

1.00 0.82 0.61 0.41 0.39

0.952 0.962 0.960 0.840 0.776

1.00 0.83 0.70 0.52 0.50

n ˆ 35

n ˆ 50

cove

r

0.978 0.970 0.960 0.794 0.730

1.00 0.91 0.75 0.53 0.51

0.962 0.972 0.960 0.894 0.802

1.00 0.79 0.72 0.60 0.57

cove n ˆ 40

n ˆ 100

0.950 0.960 0.964 0.862 0.800 0.966 0.962 0.960 0.898 0.858

On estimating the median from survey data using multiple auxiliary information

71

Table 3. Ratio of the average length of con®dence intervals to the average length of the Woodru¨ interval (r), and observed coverage rates (cove) for the sugar cane population (500 simulated samples, nominal coverage level is 95%) Method

r

F^W : F^R1 : F^R2 : 1 F^RE : 2 F^RE : ^1 : M YR ^2 : M YR ^1 : M YP ^2 : M YP F^RM : F^ M : RE

1.00 0.89 0.81 0.71 0.67 0.78 0.75 0.70 0.67 0.71 0.62

F^W : F^R1 : F^R2 : 1 F^RE : 2 : F^RE ^1 : M YR ^2 : M YR ^1 : M YP ^2 : M YP F^RM : F^ M :

1.00 0.94 0.85 0.78 0.74 0.79 0.74 0.70 0.65 0.77 0.69

RE

n ˆ 30

n ˆ 45

cove

r

0.991 0.966 0.964 0.968 0.966 0.968 0.968 0.916 0.954 0.944 0.936

1.00 1.08 0.96 0.84 0.77 0.86 0.79 0.76 0.71 0.85 0.72

0.954 0.948 0.942 0.938 0.928 0.866 0.902 0.844 0.872 0.932 0.912

1.00 0.88 0.81 0.76 0.71 0.81 0.76 0.71 0.66 0.71 0.66

n ˆ 35

n ˆ 50

cove

r

0.958 0.954 0.934 0.930 0.932 0.904 0.912 0.862 0.896 0.930 0.904

1.00 0.97 0.88 0.81 0.76 0.88 0.82 0.77 0.72 0.79 0.70

0.970 0.944 0.940 0.944 0.928 0.900 0.928 0.900 0.914 0.908 0.926

1.00 0.87 0.78 0.77 0.71 0.83 0.74 0.74 0.69 0.72 0.68

cove n ˆ 40

n ˆ 100

0.958 0.954 0.940 0.952 0.954 0.940 0.958 0.896 0.932 0.936 0.934 0.988 0.936 0.940 0.946 0.954 0.884 0.940 0.902 0.924 0.940 0.928

0.96 to 0.976 for the ratio estimator and from 0.958 to 0.964 for the regression estimator). For the sugar cane population, in terms of e½ciency, the regression and 2 ^ 2 , respectively, and M position estimator with X2 as auxiliary variable, F^RE YP M , produce similar results (Table 3). and the multiple regression estimator, F^RE In this population, the empirical evidence shows that all indirect methods are more e½cient than the classical method. Univariate and multivariate ratio and regression methods have similar coverage. The ratio and position method proposed by Kuk and Mak yields a coverage that is reasonably close to the nominal level. It is worth pointing out that the coverage of the estimators is improved by using X2 as an auxiliary variable. ^ 1 (obtained using X 1 as auxiliary From Table 4, we observe that M YP ^ 1 and then variable) produces the smallest average length, followed by M YR 2 ^ MYP (except n ˆ 100). We can see that the ratio estimators produce values of r which are unsatisfactory. M 1 2 , F^RE and F^RE , the reported Turning to the coverage of estimators F^RE values are reasonably close to the nominal level of 0.95. It is interesting to point out that the coverages of the regression estimators are the best. This M ^1 2 , FRE and F^RE , besides being asymptotically normal, have, for suggests that F^RE

72

M. Rueda, A. Arcos

Table 4. Ratio of the average length of con®dence intervals to the average length of the Woodru¨ interval (r), and observed coverage rates (cove) for the MU284 population (500 simulated samples, nominal coverage level is 95%) Method

r

F^W : F^R1 : F^R2 : 1 F^RE : 2 F^RE : ^1 : M YR ^2 : M YR ^1 : M YP ^2 : M YP F^RM : F^ M : RE

1.00 1.26 1.41 0.80 0.82 0.59 0.62 0.56 0.60 0.75 0.72

F^W : F^R1 : F^R2 : 1 F^RE : 2 : F^RE ^1 : M YR ^2 : M YR ^1 : M YP ^2 : M YP F^RM : F^ M :

1.00 1.17 1.18 0.88 0.91 0.66 0.72 0.63 0.70 0.80 0.78

RE

n ˆ 30

n ˆ 45

cove

r

0.988 0.960 0.946 0.958 0.964 0.850 0.846 0.814 0.836 0.946 0.958

1.00 1.38 1.38 0.96 0.98 0.67 0.72 0.64 0.69 0.90 0.85

0.950 0.946 0.932 0.928 0.946 0.798 0.846 0.798 0.824 0.940 0.936

1.00 1.13 1.13 0.84 0.88 0.67 0.70 0.63 0.69 0.77 0.75

n ˆ 35

n ˆ 50

cove

r

0.950 0.944 0.946 0.936 0.948 0.768 0.826 0.776 0.808 0.924 0.938

1.00 1.30 1.27 0.93 0.94 0.69 0.72 0.66 0.70 0.84 0.81

0.960 0.956 0.966 0.942 0.956 0.828 0.862 0.810 0.852 0.944 0.948

1.00 1.25 1.27 0.93 0.97 0.85 0.92 0.83 0.90 0.86 0.82

cove n ˆ 40

n ˆ 100

0.948 0.946 0.940 0.952 0.954 0.784 0.830 0.796 0.836 0.930 0.930 0.962 0.948 0.966 0.946 0.976 0.872 0.922 0.934 0.940 0.948 0.936

moderate sample sizes, good coverage properties. Nevertheless, the coverage for Kuk and Mak's ratio and position estimators is well below the nominal coverage. The results relating to simulation of the FAM1500 population are recorded in Table 5. Note how, for every sample size, the ratio method proposed by Rueda et al. and the ratio method proposed by Kuk and Mak, using X2 as auxiliary variable, do not improve on the classical method. However, the regression method using the same variable always improves on the classical ^ 1 improves on the classical method if the sample size is greater than method. M YP ^ 1 improve on it if the sample size is greater ^ 2 and M or equal to 45, and M YP YR than or equal to 100. In terms of e½ciency, if we observe n ˆ 150 for example, the multiple regression estimator is the best, followed by the univariate regression and position methods (built using X1 ) and then the multivariate ratio method. The associated coverages are, respectively, 0.952, 0.96, 0.94 and 0.956. Thus the improved performance of the regression method is evident as regards all the remaining methods in both aspects: e½ciency and coverage. Of the two ways of constructing a con®dence interval for the median without using auxiliary information, that given in (1) or by following Woodru¨ 's method (3), we have chosen the latter, since the methods examined in

On estimating the median from survey data using multiple auxiliary information

73

Table 5. Ratio of the average length of con®dence intervals to the average length of the Woodru¨ interval (r), and observed coverage rates (cove) for the FAM1500 population (500 simulated samples, nominal coverage level is 95%) Method

r

F^W : F^R1 : F^R2 : 1 F^RE : 2 F^RE : ^1 : M YR ^2 : M YR ^1 : M YP ^2 : M YP F^RM : F^ M : RE

1.00 0.79 1.08 0.68 0.85 1.17 2.05 1.12 1.39 0.73 0.66

F^W : F^R1 : F^R2 : 1 F^RE : 2 : F^RE ^1 : M YR ^2 : M YR ^1 : M YP ^2 : M YP F^RM : F^ M :

1.00 0.81 1.11 0.73 0.91 0.81 1.36 0.76 0.95 0.76 0.71

RE

n ˆ 40

n ˆ 100

cove

r

0.968 0.928 0.922 0.936 0.934 0.994 1.000 0.992 0.992 0.930 0.928

1.00 0.89 1.23 0.80 0.98 1.01 1.84 0.97 1.26 0.82 0.77

0.946 0.960 0.938 0.950 0.940 0.954 0.938 0.964 0.964 0.938 0.942

1.00 0.81 1.10 0.73 0.90 0.80 1.34 0.74 0.93 0.76 0.72

n ˆ 45

n ˆ 150

cove

r

0.940 0.922 0.934 0.928 0.942 0.970 0.984 0.966 0.986 0.916 0.912

1.00 0.89 1.20 0.78 0.95 1.00 1.71 0.94 1.17 0.81 0.75

0.958 0.952 0.950 0.960 0.950 0.950 0.954 0.940 0.952 0.956 0.952

1.00 0.84 1.14 0.76 0.94 0.82 1.39 0.76 0.95 0.79 0.75

cove n ˆ 50

n ˆ 200

0.938 0.950 0.930 0.950 0.944 0.966 0.978 0.972 0.984 0.938 0.954 0.938 0.924 0.936 0.940 0.946 0.956 0.928 0.944 0.936 0.930 0.930

this article for constructing con®dence intervals are based on this technique. However, Table 6 shows the results of a simulation study in which the average length of 500 con®dence intervals for the population median and their coverage are compared. In this study, the variance of the direct estimator (which we ^ Y †, is estimated by means of ^ Y ), V …M denote M ^Y † ˆ 1 V^ …M

f 4n

1 fy …MY † 2

and we replace the unknown value fy …MY † by the estimate f^…MY † obtained by applying the kth nearest neighbour method speci®cally at the sample median ^ Y , whereas V …F^Y …MY †† is estimated by using (2) in Woodru¨ 's method. M These results show that the e½ciency is similar for both methods when the sample size is large, except for the beef farms population. As for the coverage, it is useful to point out the fact that Woodru¨ 's method reaches the most appropriate values in almost all the populations, while in the beef farms population the pivotal method does not even achieve 90 per cent of coverage for the largest sample sizes.

74

M. Rueda, A. Arcos

^ Y † method Table 6. Woodru¨ …F^W † versus pivotal …M MU284

BEEF

SUGAR

r

cove

r

cove

F^W ^Y M

1.00 0.70

0.988 0.872

1.00 0.83

0.982 0.954

F^W ^Y M

1.00 0.79

0.960 0.888

1.00 0.72

0.962 0.866

F^W ^Y M

1.00 1.02

0.962 0.914

1.00 0.83

FAM1500

r

cove

r

cove

1.00 0.94

0.991 0.976

1.00 1.54

0.968 0.982

1.00 0.95

0.970 0.930

1.00 1.29

0.938 0.976

n ˆ 100 0.966 1.00 0.888 0.96

0.988 0.954

1.00 1.04

0.946 0.960

n ˆ 30

n ˆ 50

For Fam1500 population, a sample size n ˆ 40 is used Table 7. Ratio of the average length of con®dence intervals to the average length of the Woodru¨ interval (r), and observed coverage rates (cove) for the strati®ed MU284 population (500 simulated samples, nominal coverage level is 95%) propor Method F^Wst F^opt F^Wst F^opt F^Wst F^opt F^Wst F^opt F^Wst F^opt F^Wst F^opt

n 30 35 40 45 50 100

optimal

x-optimal

r

cove

r

cove

r

cove

1.00 0.77 1.00 0.92 1.00 0.92 1.00 0.89 1.00 0.85 1.00 0.91

0.956 0.938 0.922 0.960 0.924 0.938 0.946 0.942 0.938 0.942 0.920 0.950

1.00 0.86 1.00 0.87 1.00 0.87 1.00 0.88 1.00 0.89 1.00 0.86

0.908 0.930 0.946 0.942 0.938 0.948 0.940 0.938 0.946 0.950 0.936 0.936

1.00 0.83 1.00 0.89 1.00 0.89 1.00 0.87 1.00 0.86 1.00 0.85

0.956 0.958 0.916 0.940 0.924 0.944 0.944 0.948 0.942 0.930 0.936 0.936

A similar problem in the empirical coverages of some estimators is present in the simulation study of SaÈrndal et al.; they solve it by replacing the z-value with the value corresponding to a t-Student with n 1 degrees of freedom. We have also calculated the interval using both critical values. The use of the t-value makes the average length of the intervals increase slightly, but the rise in the empirical coverage is also small, and thus the same problems of coverage in the estimators used by Kuk and Mak appear for the ®rst three populations. In short, and taking into account the latter conclusion, the simulation studies carried out reveal that substituting the t distribution for the normal distribution has a very limited e¨ect. Tables 7 and 8 show the results of the simulations for the strati®ed MU284 and Sugar Cane populations, respectively. In both cases, the interval was constructed on the basis of the F^opt estimator, from the auxiliary variable X2 .

On estimating the median from survey data using multiple auxiliary information

75

Table 8. Ratio of the average length of con®dence intervals to the average length of the Woodru¨ interval (r), and observed coverage rates (cove) for the strati®ed sugar cane population (500 simulated samples, nominal coverage level is 95%) propor Method F^Wst

F^opt F^Wst F^opt F^Wst F^opt F^Wst F^opt F^Wst F^opt F^Wst F^opt

n 30 35 40 45 50 100

optimal

x-optimal

r

cove

r

cove

r

cove

1.00 0.83 1.00 0.84 1.00 0.92 1.00 0.88 1.00 0.89 1.00 0.91

0.958 0.922 0.936 0.938 0.928 0.932 0.946 0.938 0.942 0.940 0.962 0.968

1.00 0.90 1.00 0.85 1.00 0.87 1.00 0.85 1.00 0.89 1.00 0.90

0.930 0.900 0.946 0.896 0.958 0.938 0.946 0.928 0.950 0.934 0.910 0.940

1.00 0.92 1.00 0.85 1.00 0.94 1.00 0.87 1.00 0.86 1.00 0.90

0.886 0.900 0.938 0.900 0.920 0.924 0.936 0.920 0.920 0.918 0.930 0.956

The X1 variable was used to perform the x-optimal allocation and to stratify the sugar cane data. Results were similar for all allocations: there was a considerable reduction in the length of the interval when the F^opt estimator was used; this reduction was of a similar magnitude among both populations. In the MU284 population, with simple random sampling, the regression estimator with the X2 variable did not produce any great reduction in interval lengths; a greater reduction was obtained for strati®ed sampling. Such was not the case in the sugar cane population, in which the regression estimator with the X2 variable produced a greater reduction in simple random sampling than in strati®ed sampling. In both populations, coverage was seen to approach the nominal value as sampling size increased. 5.4. Conclusions In simple random sampling, our simulation studies suggest that little e½ciency (and only on occasions) can be gained by applying ratio estimators and that the coverage is satisfactorily close to the nominal level. The position estimators proposed by Kuk and Mak have a good performance as regards e½ciency, but present serious problems with respect to coverage. The intervals constructed from the regression estimators are satisfactory both from the point of view of e½ciency and of coverage. Indeed, the length of the intervals constructed with the regression estimators with a single auxiliary variable is, on average, less than that of those constructed using Woodru¨ 's method. Even when the linear regression between the auxiliary variable and the main one is not very strong, as occurs with the FAM1500 population with the second auxiliary variable …R 2 ˆ 0:298†, these intervals have a very high real coverage. If we consider the two auxiliary variables to construct the regression esti-

76

M. Rueda, A. Arcos

mator, the resulting intervals produce an even greater reduction in the length with respect to the univariate models, although real coverage is sometimes slightly reduced (for example, in population MU284). In strati®ed random sampling, the intervals constructed from the optimal estimator are satisfactory both from the point of view of e½ciency and of coverage. In conclusion, we found evidence that greater e½ciency can be obtained by using the proposed estimators. These estimators can be extended to the case of estimating other quantiles, and it is expected that the satisfactory results obtained in the case of the median can also be attained in the general case. Acknowledgements. The authors are grateful to Dr. R. L. Chambers for providing the sugar cane and beef farm data and to Dr. J. A. Mayor Gallego for providing the FAM1500 data. The authors would like to thank the referee for his many helpful comments and suggestions.

References Chambers RL, Dunstan R (1986) Estimating distribution functions from survey data. Biometrika 73:597±604 FernaÂndez GarcõÂa FR, Mayor Gallego JA (1994) Muestreo en Poblaciones Finitas: Curso BaÂsico. P. P. U., Barcelona Fuller WA, Isaki CT (1981) Survey design under superpopulation models, In current topics in Survey Sampling. Academic Press, New York Kuk AYC (1993) A kernel method for estimating ®nite population distribution functions using auxiliary information. Biometrika 80:385±392 Kuk AYC, Mak TK (1989) Median estimation in presence of auxiliary information. Journal of the Royal Statistical Society Series B 51:261±269 Mak TK, Kuk AYC (1993) A new method for estimating ®nite-population quantiles using auxiliary information. The Canadian Journal of Statistics 25:29±38 Montanari GE (1987) Post-sampling e½cient QR-prediction in large-sample surveys. International Statistical Review 55:191±202 Rao JNK (1994) Estimating totals and distribution functions using auxiliary information at the estimation stage. Journal of O½cial Statistics 10:153±165 Rao JNK, Kovar JG, Mantel HJ (1990) On estimating distribution functions and quantiles from survey data using auxiliary information. Biometrika 77:365±375 Rueda M, Arcos A, ArteÂs E (1998) Quantile interval estimation in ®nite population using a multivariate ratio estimator Metrika 47:203±213 SaÈrndal CE, Swensson B, Wretman J (1992) Model Assisted Survey Sampling. Springer-Verlag, New York SaÈrndal CE (1980) On p-inverse weighting versus best linear unbiased weighthing in probability sampling. Biometrika 67:527±537 Woodru¨ RS (1952) Con®dence intervals for medians and other position measures. Journal of the American Statistical Association 47:635±646

Suggest Documents