An investigation of triple system estimators in censuses - IOS Press

16 downloads 37 Views 750KB Size Report
IOS Press. An investigation of triple system estimators in censuses. Bernard Baffour a,∗. , James J. Brown b and Peter W.F. Smith b a. Institute for Social Science ...
53

Statistical Journal of the IAOS 29 (2013) 53–68 DOI 10.3233/SJI-130760 IOS Press

An investigation of triple system estimators in censuses Bernard Baffoura,∗ , James J. Brownb and Peter W.F. Smithb a

b

Institute for Social Science Research, University of Queensland, Brisbane, Australia Southampton Statistical Sciences Research Institute, University of Southampton, Southampton, UK

Abstract. The value of a census cannot be overstated, given that no other data resource provides such detailed information about the population. Further, censuses are often the only historical data source to map out change over time due the consistency of questions asked. However, it is often the most expensive undertaking – other than going to war – that a country embarks on. Countries are thus seeking more cost-effective alternatives. This paper details some exploratory research into one such alternative, based on capture-recapture methods. Capture recapture methods have been used for population estimation for decades, but the focus has been on dual system estimation. Dual system measurement of the population has been criticized for its reliance on the independence assumption between the two systems. This assumption is untestable, and failure introduces bias into the estimates of the population. The most logical improvement of dual system estimation is triple system estimation. In this paper, a simulation study is carried out to compare the performance of different dual and triple system estimators of the population size under various dependency scenarios. Performance is explored through both the bias and variability. The study shows that the dual system estimator copes well with dependence, provided the coverage of both lists are reasonably high. In addition, although the triple system estimators yield less biased estimates of the population, the dual system estimator is shown to be robust enough to cope with low levels of dependence. Keywords: Census estimation, capture-recapture, non-response, population coverage

1. Introduction Census taking has had to evolve rapidly in the past few years, and a number of countries have been exploring new, alternative, ways of producing suitably detailed census information, particularly at smaller levels of geography and of special population groups, such as ethnic minorities. In addition, an increasingly mobile population, living in more complex households, means that the standard definition of a census being a count of the population and households at their usual residence is becoming difficult to accomplish. Further, there is an increased demand for more frequent information to allow for better planning of services and to account for changes in society. This means that the ∗ Corresponding author: Bernard Baffour, Institute for Social Science Research, University of Queensland, Brisbane, Australia. E-mail: [email protected].

standard decennial or quinquennial census periodicity can lead to less up-to-date data than is required by census users. Changes in technology, as well as the widespread availability of additional data sources provide an opportunity to consider whether there is a more cost-effective and efficient way of producing census. There is declining census participation and therefore innovations in the census methodology which harness the current explosion of ancillary data, while reducing public response burden, are being sought. According to the United Nations, as at the 1st October 2012, 191 countries had completed the arduous task of taking an accurate head count of the number of people living within their geographical borders. In, addition, 37 other countries are scheduled to undertake censuses of their population by the end of 2014, enumerating roughly 98% of the world’s population [26]. Obtaining counts of the population resident in a country is of immense benefit for planning and apportionment of resources. It therefore follows that the strength

c 2013 – IOS Press and the authors. All rights reserved 1874-7655/13/$27.50 

54

B. Baffour et al. / An investigation of triple system estimators in censuses

of any census rests on its near-complete coverage of the population. However, the experiences of the 2000 round of international censuses suggest that the challenge of achieving high coverage was, and is going to be, substantially difficult. One of the main issues that arose from these censuses is that modern societal changes do have an impact on the census methodology. It is undeniable that the ‘traditional family’ household that was the norm in the 1970s is very much different to that of the new millennium. Currently in the UK – and indeed in much of Western Europe, North America, Australia and New Zealand – the household structure is comprised of large proportion of cohabiting couples, families with part-resident children and multiple occupancy households. In addition, owing in part to education and migration, some sections of the population are very transient. Despite the best efforts of the national statistical office, the census will not achieve universal coverage of the population of interest. Therefore, in order to ensure that everyone is counted by the census, national statistical offices undertake a coverage assessment as part of their census operations. The most widely used tool to accomplish this is by a post-enumeration survey that evaluates the coverage of the census. The postenumeration survey is often a specially designed survey that comprises of an intensive re-enumeration of a sample of the population. The post-enumeration survey, like the census, will not have a perfect coverage of the sampled areas. However, by assuming that the inclusion of an individual in the survey is independent of the individual’s inclusion in the census, it is possible to estimate the population size under dual system estimation (see [11]). The main criticism of dual system estimation is that population estimates derived through independence may be wrong. This is because, firstly, the same individuals could be omitted from both the census and post-enumeration survey. Secondly, an individual’s inclusion in the census could have a direct causal effect on their inclusion in the survey. In the first instance, the type of dependence can easily be accommodated through post-stratification. This is done by dividing the population into sub-groups such that within each sub-group the probability of being on a particular list does not vary from individual to individual. These sub-groups are chosen based on various characteristics, such as geography, age, sex and other sociodemographic factors. In contrast, the second form of dependence is difficult to correct for, and most national statistical institutes implementing dual system estima-

tion ensure that there is operational independence between the census and the post-enumeration survey. Nevertheless, if the census and survey processes are independent and the population is divided into homogeneous groups of similar individuals, then dual system estimation can be applied to each of these subgroups to provide unbiased estimates of the population. Clearly, it is relatively simple to ensure operational independence but far more difficult to ensure behavioural independence. Even if the census and survey are truly independent operationally, (for some individuals) the act of participation in the census will affect their participation in the survey. In actuality, most national statistics institutes have an adjustment term added to the dual system estimates (see [8] for the UK, and [23] for the US). The most feasible extension of the dual system estimation methodology in the presence of dependence is to include a third list to correct for this dependence. A third list has the obvious advantage of checking whether or not there truly is independence between the census and post-enumeration survey, through triple system estimation (see [18,29]. The third list can be matched to the census and survey, and subsequently any dependence between them can be controlled for. To achieve this, the third list has to comprise of a source of names and addresses which can be crossreferenced with the census and survey lists. In theory, an administrative data register is the most viable choice of this third list. However, in most countries such a single register is not in existence, so it is created by combining several administrative lists, for instance from birth, health, education, employment, etc. records held by different government departments. The key to accomplishing this is being able to link records across the different registers, most likely through the existence of a number that is unique to each individual in the population. In practice, the availability of such an administrative register that is fully encompassing of the population makes the census redundant. This is certainly the case for several western European countries, such as Sweden, Denmark and Norway, which rely solely on administrative registers as their main source for population estimation. In some countries, however, such as the US [5], France [17], UK [16] and Brazil [14], the spiralling cost of undertaking the census (which is often the largest government peace-time mobilization) has meant that it has become imperative to consider alternative approaches, and specifically how to use administrative data. Although, the ideas behind triple system estima-

B. Baffour et al. / An investigation of triple system estimators in censuses

tion are not new, there has not been a real world application of triple system estimation within a census. The US trialled out a version of the approach during the dress rehearsal for the 1990 census in parts of St Louis, Missouri [27]. However, following from a 1999 US supreme court judgement, it was ruled that sampling was not consistent with the US census act for providing apportionment for legislative representation [5]. Therefore, the focus has been more on the feasibility of an administrative system based census (along similar lines that adopted in the Scandinavian countries), see [20]. The outline of the paper is as follows. In the next section, the theory behind underenumeration measurement for dual and triple lists is given. The general theory of underenumeration measurement lies within the broader area of capture-recapture methods, and the loglinear modelling framework provides a convenient representation. Section 3 presents and gives the results of the simulation study undertaken to compare the dual system and triple system estimators. The estimators’ performance are compared using the bias and variability of the estimators, under different scenarios. The final sections consist of some concluding remarks, followed by a discussion.

2. Underenumeration measurement 2.1. Dual system estimation In general, when there are two lists then it is possible to estimate the number missed by both lists, n00 , by using dual system estimation and considering the number observed in – both the First and Second Lists (n11 ); – the First List but not the Second (n10 ); and – the Second List but not the First (n01 ). The estimate of those missed by both lists (i.e. the underenumeration) is given by the Lincoln-Peterson estimator, n ˆ 00 =

n01 n10 . n11

(1)

Dual system estimation relies on five key (but untestable) assumptions and these are that: – the population is closed, – individuals can be matched from capture to recapture, – the capture in the second sample is independent of capture in the first,

55

– capture probabilities are homogeneous across across all individuals, and – there are no erroneous captures in either the first or second list. These assumptions can be summarily divided into two assumptions, that of independence and homogeneity. Firstly, independence means that processes yielding the first and second list counts are not related. Secondly, the probability of an individual being in the first or second lists is assumed to be the same across individuals. Under these conditions, the estimate of the missing cell, n00 , is ultimately unbiased. In most cases failure leads to biased population estimates. This is because the ‘true’ estimate of the missing cell should be n ˆ ∗00 = γ

n01 n10 , where γ is the dependence. (2) n11

This bias is termed correlation bias and can be due to two types of dependencies: 1. List dependence: – the act of being included in the first list makes an individual more or less likely to be included in the second list, i.e. inclusion in the first sample has a direct causal effect on inclusion in the second. This is sometimes referred to as causal dependence. 2. Heterogeneity: – even if the two lists are independent within individuals, the lists may become dependent if the capture probabilities are heterogeneous among individuals. This is similar to the Yule association paradox where an aggregation of two independent 2×2 tables may result in a dependent table [22]. This is sometimes referred to as apparent dependence. In practice, these two types of dependencies are confounded and cannot be separated unless additional information is provided. On the one hand, the direction of correlation bias due to a failure of the homogeneity assumption is often straightforward. When there is heterogeneity then there are different capture probabilities, and the same individuals are likely to be missed from both the first and second lists. In this case, the estimates of the population will be biased downwards, and lead to an underestimate of the population. On the other hand, the direction of the correlation bias due to the failure of the list independence assumption is less certain: the effect inclusion on the first list has on the individual’s propensity to be included in the second list can either be negative or positive. An individual included in the first list may be more aware of

56

B. Baffour et al. / An investigation of triple system estimators in censuses

the census process, and hence more likely to participate in the second list. The correlation bias is positive and leads to an under-estimation of the population estimate. Alternatively, the individual could feel that they have already taken part in the first list and so actively fail to be included in the second list. This type of dependence will lead to an over-estimation. It is difficult to ascertain which type of causal dependence is more likely. In reality, the heterogeneity assumption can be relaxed such that it is required that individuals’ capture probabilities are only homogeneous across one list [4, 8]. In terms of logistics in a large scale operation, such as a population census, this relaxation of the homogeneity assumption can be very important; often the homogeneity assumption is assumed to strictly hold on the second list (i.e. the post-enumeration survey), which makes sense because it is a more intensive reenumeration of (a subgroup of) the population, and more effort can be spent on ensuring that similar individuals are grouped together. Since the effectiveness of any coverage adjustment is hugely reliant on how well the first enumeration of the population has been conducted, there can be some flexibility as to how homogeneous the individuals in the post-strata are, and effort spent on ensuring that the entire population has been enumerated. A proof of this is shown in the Appendix. However, the difficulty lies in the fact that the list independence and homogeneity assumptions are untestable in practice. Thus, several authors have made significant contributions to relaxing these assumptions. In the context census underenumeration, work has been undertaken that looks into relaxing the independence assumption. [21] proposed several alternative dual system estimates to incorporate the correlation bias due to list dependence. [6,28] suggest the population totals and sex ratios as additional demographic information to assess the dependence. [2,4] modelled heterogeneity using a logistic model containing several explanatory variables under the assumption of independence. If there are no covariates that can explain the heterogeneity, then individuals can be thought of as having some random effects that determine their catchability in each sample. So under the assumption of independence across individuals, [1,15] allow for heterogeneous capture probabilities by using a logit model with random effects. This model is the same as that suggested by [24] in an application to educational testing, with individuals differing on a continuous scale. More relevantly, [29] proposed using an administrative list as a third system. In this paper, triple system esti-

mation will be investigated where additional information in the form of a third list will be used to relax the assumption of independence between the first and second lists. 2.2. Triple system estimation For the case when there are three lists, the interrelationships between the various lists can now be measured, in particular the (in) dependence between the first and second lists can be investigated. In comparison with dual system estimation, triple system estimation seemingly relies on fewer assumptions and these are that: – the population is closed; – individuals can be matched from capture to recapture; – capture probabilities are homogeneous across across all individuals; and – there are no erroneous captures on either lists. Under triple system estimation, the cell counts are given by {nijk }, with the missing cell denoted by n000 , which can be estimated through loglinear modelling (see [18,19]). Let μijk be the expected number of individuals in the (i, j, k)th  cell. Also denote the marginal counts n = ni+k = +jk ijk , i n   n , n = n , n = i++ j ijk  ij+ k ijk j k nijk ,   n = n , n = n , n +++ = i j ijk +j+   i k ijk ++k i j k nijk = N , with the expected counts μij+ , μ+jk , μi+k etc. similarly defined. Then the saturated model is (1)

log μijk = λ + λi

(13)

(2)

(3)

(12)

+ λj + λk + λij (23)

(123)

+λik + λjk + λijk

(3)

However, since n000 is missing, the log-linear model in Eq. (3) is not identified, so for triple system estimation the assumption is made that there is no threeway association. This model, also referred to as the homogeneous association model, is the ‘saturated’ model in capture-recapture estimation. In all there are seven other non-saturated hierarchical models – mutual independence between all three lists, three partial independence models with one interaction term and three conditional independence models with two interaction terms. In fact the eight hierarchical models can be summarized into four different models:

B. Baffour et al. / An investigation of triple system estimators in censuses

1. If the First, Second and Third lists are mutually independent then (1)

(2)

(3)

log μijk = λ+λi +λj +λk . Model I There is no direct estimator available to find the missing cell estimate, μ ˆ000 , but the population esˆ can be found by solving the quadratic timate, N     ˆ − n+1+ N ˆ − n++1 ˆ − n1++ N N   (4) ˆ −n ˆ2 N =N 2. If the First and Second lists are partially independent of the Third list then (1) (2) (3) (12) log μijk = λ + λi + λj + λk + λij . Model II Here, the missing cell estimate is given by n∗++0 n001 , μ ˆ000 = n++1 − n001 (5) where n∗++0 = n001 + n010 + n100 . 3. If the First and Second lists are conditionally independent of each other given the Third list then (1)

log μijk = λ + λi

(2)

(3)

(13)

+ λj + λk + λik

(23)

+λjk . Model III Here, the missing cell estimate is given by μ ˆ000 =

n010 n100 . n101

(6)

4. If the First, Second and Third lists show pairwise dependence, in other words there is homogeneous association then (1)

log μijk = λ + λi (13)

(2)

(23)

+λik + λjk . The missing cell estimate is μ ˆ 000 =

(3)

(12)

+ λj + λk + λij

n111 n100 n010 n001 . n110 n101 n011

Model IV

(7)

The above equations have focused on list dependence, and assumed that heterogeneity has been accounted for by post-stratification. However, the loglinear modelling framework makes it straightforward to consider the possibility of both list dependence and heterogeneity by including additional terms to the model. The variance of these functions can be difficult to analytically compute, and hence the delta method is used ˆ is a funcby finding a linear approximation. Since N

57

tion of the sum of observed, n (which is a constant), and the missing estimate, μ ˆ000 , the asymptotic variance ˆ ), can be derived using of the population estimate, Vˆ (N the delta method. [18] derived the variance estimators for the different log-linear models, and these results are summarized below. When the three samples are mutually independent (i.e. Model I), the variance is estimated as   ˆ = Vˆ N

ˆμ N ˆ 000 . n011 + n101 + n110 + n111

(8)

The asymptotic variance for the model with one pair-wise interaction (i.e. Model II) is estimated as    1 2 ˆ ˆ V N = (ˆ μ000 ) n++1 − n001  (9) 1 1 1 + + + . n ˆ ++0 − n ˆ 000 n001 μ ˆ000 When there is conditional independence, and assuming Model III holds, i.e. the First and Second Lists are conditionally independent of the Third List, then the asymptotic variance is estimated as   2 ˆ = (ˆ Vˆ N μ000 )   (10) 1 1 1 n011 + + + . n001 n010 n011 n001 n010 Finally for the saturated model, the asymptotic variance estimate is estimated as    1 1 1 1 2 ˆ ˆ μ000 ) V N = (ˆ + + + n111 n110 n101 n100  (11) 1 1 1 1 + + + . + n011 n010 n001 μ ˆ000 In addition the dual system estimate asymptotic variance was derived by [11], also using the delta method, and is given by   n n n n ˆ = +1 1+ 01 10 . Vˆ N (12) (n11 )3 In analysing capture-recapture three-sample data the aim is to fit the incomplete 2×2×2 table by a loglinear model with the fewest possible parameters. It becomes readily apparent that the fewer the parameters in the ‘most suitable’ model for estimating μ000 the smaller the variance. Thus it becomes ideal not to just use the ‘saturated model’ given in Model IV. On the other hand if a model with too few parameters is used, bias may be introduced into the resulting estimate of the population size, and the risk is that the variance formulae (8)–(11) become meaningless. Essentially, the variance formulae only hold under the assumption that the correct model has been chosen.

58

B. Baffour et al. / An investigation of triple system estimators in censuses

3. Simulation study investigating dependence 3.1. Introduction In dual system estimation, the independence assumption is heavily relied upon; it actually underpins the estimation of the population size. This is especially true for population censuses. Nevertheless, it was mentioned that this assumption can often fail, and it is not possible to directly estimate the level of dependence within the dual system estimation framework. In fact, to estimate the dependence there needs to be some ancillary information available, and in a population census context this can be demographic data, such as the age and sex ratios [6,8]. In fact, [8] found that in most cases when coverage in both population counts are reasonably high the dual system estimation methodology was robust enough to cope with low levels of dependency between the first and second lists. However, an actual grasp of what constitutes ‘low’ and ‘high’ levels of dependency and ‘low’ and ‘high’ levels of coverage is something that has not been fully realised. Further, the previous work looked at just estimating the dependence parameter, and then applying this dependence adjustment to the dual system population estimate to correct for bias. An alternative way is to increase the number of lists, and this is the strict definition of triple system estimation. This has the advantage of not just looking at the dependence between the first and second list, but can also investigate whether or not the third list is related to either the first or second list, or both. Therefore, based on some coverage rates achieved from the 2001 census in the UK, a simulation exercise was undertaken to look at the effect dependence will have on the population estimate under dual system estimation. In addition, coverage on a fictional Third List, assumed to be some kind of administrative register, were included, and different triple system estimators were used to obtain the population estimate. These results were compared to the dual system estimates. For simplicity, it was assumed that the only dependence under consideration arises from the fact that the probability a person is counted or missed by a particular list is related to the probability that the same person is counted or missed on a different list, i.e. list dependence. Further, it is assumed that the saturated model here is the one with no three-way interaction, i.e. the homogeneous association model. This is done for two reasons. Firstly, like the assumption of independence in a dual system setting, this assumption is required in

order to be able to estimate the missing cell. Secondly, and more pertinently, by design the simulations were on the basis that the Third List was independent of the First and Second Lists – implying that the observed cells contain sufficient information for the estimation of the population total. Thus it is possible to posit more complicated models (as will be shown), however these models will be anticipated to over-fit the observed data. 3.2. Simulation study In our simulation study we were interested in finding out how estimates of the population size varied in different dependency scenarios. We chose these dependency scenarios based on practical considerations regarding the behavioural dependence that may exist. We were also primarily concerned with the failure to account for an existing association between the First and Second List. We further assumed that the Third List had been independently assembled from both First and Second Lists. As such, there is no association between being found (or missed) in the Third List with being found (or missed) in either of the two lists. In reality the second list will be a survey, and so in practice there is an additional, fairly important, step of ‘weighting up’ the sample to the non-sampled individuals. [3] gives a general framework for the estimation of population size from samples while [9,12] show how this was done for the 2011 UK census and 2010 US census, respectively. Furthermore, the Third List will be an administrative population register, which could realistically be assumed to independently cover different population subgroups [25]. The simulation study generated a population of 1000 individuals. The size of the population was chosen so that it would be representative of the typical post-stratum size such that there is homogeneity of capture within each post-stratum. The simulation starts by assigning each individual a probability being counted in the First List or Census (pcen ), the Second List or Survey (psur ) and the Third List (padm ). The individuals are then cross-classified into a 2×2×2 table according to their absence or presence on the three lists. Since the object is to find the individuals who fall into the (0,0,0)-cell corresponding to those missed in all three lists, attention is restricted to the incomplete table representing the individuals who are observed. Different coverage probabilities are considered, and these take values 30%, 50%, 70% and 90%. Now, since dual system estimation makes the assumption that there is no systematic relationship between the

B. Baffour et al. / An investigation of triple system estimators in censuses

probability of an individual being counted in the Census and the same individual being counted in the Survey, the objective is to determine how robust of method it is to estimate the population size when some dependence is introduced. It has been assumed that the population has been suitably post-stratified, so that the resulting dependence is only through the association between the Census and Survey. The dependence is represented by the odds ratio (γ) and took values in {1, 1.2, 1.4, 1.6, 1.8, 2}. An odds ratio of 2 implies that people who are counted in the Census are twice as likely to be counted in the Survey than those who are not counted in the Census, and this was set to be the maximum dependence in line with the general perceptions of the existing dependence [8]. We did investigate larger dependence structure, and the results are presented. Additionally, in order to investigate the performance of the population estimators at different odds ratios the reciprocals of the dependence were consid1 1 1 1 1 ered, i.e. in { 1.2 , 1.4 , 1.6 , 1.8 , 2 }. Therefore, for given coverage and dependency levels a 2×2×2 contingency table is simulated on the basis of whether or not an individual is counted or missed in the Census, Survey and the Third List. Obviously since in reality the people missed by all three lists are unknown the n000 cell count is discarded, and the remaining seven cells are taken to be the ‘observed’ table of counts in the simulated population. Based on the posited log-linear model, the estimate of the missing cell can be obtained. In the exercise three triple system estimators and the dual system estimator were considered. For each simulated data set, these four estimators were used to obtain the missing cell count, n000 and the total population size, N . The first triple system estimator considered is the mutual independence model (TSE1) which assumes that all three lists are independent of each other. It is important to see how this mutual independence model fares in comparison to the dual system estimator when there is some dependency. The second triple system estimator was the pairwise dependence model (TSE2). Here the model assumes that the Census and Survey are independent of the Third List. The third triple system model (TSE3) considered was the ‘saturated’ model, i.e. the homogeneous association model with all pairwise relationships between the Census, Survey and Third List present. Finally all three models were compared to the dual system estimator (DSE). Of interest was to determine if all the triple system estimators always outperformed the dual system estimator, regardless of the amount of dependence or the coverage probabilities. In order to assess the performance

59

of each of these estimators, the bias and the standard error were calculated. The process was repeated multiple times to yield the mean bias and standard error. The standard error computed this way was the ‘empirical standard error’. There was another standard error computed using the asymptotic formulae in Eqs (8)–(12). This was referred to as the ‘asymptotic standard error’. On a cautionary note, the data has been simulated under dependence between the Census and Survey, and an assumption is made that bringing the Third List into the frame does not introduce additional dependence. This assumption seems fairly reasonable in a UK context because the only feasible individual-level administrative list under consideration in a triple system scenario is the health records list. Further, the mechanism used to collate health data is sufficiently different to that used in the Census or Survey for it to be reasonably assumed that the Third List is independent of the Census or Survey. In other words, the coverage of an individual on the health register (or Third List) does not depend on the individual’s coverage in the Census or Survey. This assumption will not strictly hold in other countries. For example, in the US, the administrative list used by [27] was put together to better count those sub-populations who were difficult to enumerate in the Census and Survey. As such, in this context there is not only dependence between the Census and Survey, but the administrative list could be related to the Census, the Survey or both. However, the log-linear modelling framework proposed here is flexible enough to include additional dependence terms, if required. 3.3. Results As anticipated, the dual system estimator (DSE) is the most biased in all cases when there is dependence and TSE2 and TSE3 are the least biased. However, TSE3 has larger standard errors and in some cases seems to over-estimate the population size. This is intuitive given that TSE3 is fitting the saturated model when a simpler model (with only the pairwise dependence between the Census and Survey) will suffice. It follows that any of the conditional independence models (i.e. the model with pairwise dependence terms between the Census and Survey and Census and List or the one with Census and Survey and Survey and List terms) may be unbiased, but will suffer from poorer precision when compared to the simpler model. Given that TSE2 and TSE3 are virtually unbiased, by definition, the evaluation of the simulation exercise will be concentrating on TSE1 and how it performs

60

B. Baffour et al. / An investigation of triple system estimators in censuses

Fig. 1. Performance of TSE1 for varying levels of Third List coverage.

Fig. 2. Performance of TSE1 for varying levels of Third List coverage.

for different levels of coverage and dependence. This is because it is imagined that the introduction of the Third List will improve the population estimates, but it is difficult to quantify how beneficial the Third List actually is. It becomes clear, however, that when there is high enough coverage on the Census and Survey, the

Third List does not improve on the DSE a great deal, as shown in Figs 1 and 2. Figures 1 and 2 show how different coverage levels on the administrative list affect the simple triple system estimator that assumes independence between all three lists, TSE1. It is obvious that TSE1 is expected

B. Baffour et al. / An investigation of triple system estimators in censuses

to be biased when there is some simulated dependence between the Census and Survey. This bias is positive when the dependence γ > 1 and negative when γ < 1. In other words, when the bias is negative a person who is missed by the Census is more likely to be missed by the Survey. On the other hand, when the bias is positive then a person who is missed by the Census is more likely to be counted by the Survey. It is difficult to say which of the two is more likely to happen. Nevertheless, Figs 1 and 2 show that this bias is relatively small (all the estimators have biases smaller in magnitude than 2.5%) when coverage is suitably high enough for the chosen range, { 12 < γ < 2}. On both graphs the bias under the DSE is plotted as well to give some idea as to the rewards of using the Third List. As intuitively expected, when there is independence all the estimators are unbiased, but assuming that the simulated dependence is 2 then the DSE will under-estimate the size of the missing population by a factor of 2. This follows considering that the estimate of the missing cell under n10 DSE is given by n00 = n01 n11 , when it should actually n01 n10 be n00 = γ n11 . Moreover, Fig. 1 shows that the benefits of an administrative list when the Census achieves a population coverage of 90% while the Survey achieves 70% coverage (which is what was roughly achieved in the 2001 UK census) are relatively minimal in that the DSE in the most extreme case of dependence is actually relatively unbiased, with an absolute bias of 2.2%. However, the benefits of triple system estimation in the presence of dependence become clear in Fig. 2 which shows a somewhat significant reduction in the bias, even when the Third List only covers 30% of the population. Although it must be said that a poor covering administrative list becomes less useful when it is realised that, as is often the case in administrative records, there are erroneous enumerations. So the advantages of bringing in an administrative list that achieves poor population coverage are outweighed by the disadvantages due to the requirement to remove erroneous enumerations from the population estimate. For Fig. 1 it is assumed that the coverage probability in the Census is 0.9 and the Survey coverage probability is 0.7. Here given γ = 2 the relative bias for the simulations when the Third List coverage probabilities are 0.3, 0.5, 0.7 and 0.9 are found to be −1.35%, −0.86%, −0.46% and −0.11%. By comparison the DSE has a relative bias of −2.17%. For Fig. 2 on the other hand, the Census and Survey coverage probabilities are both taken to be 0.5, and the relative bias when the simulated dependence is 2 for administrative list coverage

61

levels of 0.3, 0.5, 0.7 and 0.9 are −7.02%, −4.07%, −2.14% and −0.57%. Apart from when the Third List has a ‘poor’ population coverage of 0.3, the absolute relative bias in all remaining cases is beneath 5%. This compares favourably to the relative bias for the DSE of −14.62%. The presence of the administrative list does improve on the population estimate; more so, this improvement can be shown to be particularly significant when the Census and Survey fail to achieve reasonable population coverage. When there is 50% coverage in the Census and Survey, the DSE bias is −14.62% for γ = 2 and 20.93% for γ = 12 . However, the bias for an administrative list with coverage of 30% is −7.02% and 8.00%, respectively, which is roughly equivalent to a two-thirds reduction in bias when γ = 2 and a half for γ = 12 . Furthermore, increasing the administrative list coverage to 50%, leads to a bias of −4.07% and 4.39% (for γ = 2 and 12 ), which is almost a 50% improvement on the TSE bias and roughly 80% on the DSE bias. Another observation from the simulation results concerns the symmetry. One of the reasons behind choosing reciprocals was to look at the behaviour about dependence of 1 (i.e. independence) as there is no bias when γ = 1 but the bias is positive for γ between (0, 1) and negative for γ between (1, ∞). The relative bias for γ = 12 in Fig. 1 when the Third List coverage was 0.3, 0.5, 0.7 and 0.9 was respectively 1.10%, 0.75%, 0.35% and 0.11%. The DSE bias was 1.79%. On the other hand, the relative bias for dependence, γ, of 2 is −1.35%, −0.86%, −0.46% and −0.11%, with a DSE bias of −2.17%. This shows that there is symmetry for high coverage probabilities, but this symmetry diminishes as the coverage probability drops. This asymmetry is more evident in the dual system estimator, as shown in Fig. 2 where the relative biases under TSE1 at 2 and 12 are −0.59% and 0.59% when the administrative List coverage is 0.9; −2.14% and 2.21% when the List coverage is 0.7; −4.04% and 4.39% when the List coverage is 0.5; and −7.02% and 8.00% when the List coverage is 0.3. For the DSE the relative bias is 20.93% when the dependence is 12 compared to −14.62% when the dependence is 2. The lack of symmetry is more apparent when looking at higher levels of dependence. Table 1 shows the results of the relative biases when there is a simulated dependence between the Census and Survey of 18 and 8, and some asymptotic properties of the bias of the DSE are presented below. Trivially, it may be observed that the TSE biases are bounded by the DSE. The asymptotic behaviour of the dual system estimator can also be investigated to give some indication

62

B. Baffour et al. / An investigation of triple system estimators in censuses Table 1 Relative bias at simulated dependence of

1 8

Table 2 Probabilities from 2×2 contingency table

and 8 π11 π01 π+1

Dependence γ=

1 8

pcen = 0.5, psur = 0.5 padm = 0.3 25.440% padm = 0.5 13.217% padm = 0.7 6.287% padm = 0.9 1.735% DSE 91.718% pcen = 0.9, psur = 0.7 padm = 0.3 padm = 0.5 padm = 0.7 padm = 0.9 DSE

γ=8 −17.387% −10.997% −5.760% −1.691% −32.233%

2.366% 1.513% 0.811% 0.214% 3.976%

−3.981% −2.580% −1.484% −0.409% −6.389%

of how the triple system estimators behave since Figs 1 and 2 show that the triple system estimator biases lie within the dual system estimator bias. This is reasonable in view of the fact that the dual system estimator is broadly not as efficient as the triple system estimators. Given that the triple system estimators are complicated functions of γ, it is not easy to ascertain how the different TSEs change with varying dependencies and coverage probabilities. However, simple expressions can be found for the DSE, at varying levels of dependence and coverage. Furthermore, since it has been shown that the DSE bounds the TSEs, obtaining expressions of how the DSE behaves as γ tends to zero and infinity, does provide some information as to the asymptotic behaviour of the TSEs. Firstly, respectively define the Census and Survey coverage probabilities to be π1+ and π+1 . Then the den11 pendence γ, between the two list, given by γ = nn00 10 n01 can be re-expressed as γ=

π11 (1 − (π1+ + π+1 − π11 )) . (π1+ − π11 ) (π+1 − π11 )

π11 2 (1 − γ) + π11 (1 − π+1 − π1+

− (1−π+1 −π1+ +γ (π+1 +π1+ )) ± 2 (1−γ)

(14)

 (1−π+1 −π1+ +γ(π+1 +π1+ ))2 +4 (1−γ)(γπ1+ π1+ ) 2 (1−γ)

A. What happens to π11 as γ tends to zero? As the dependence between the Census and Survey becomes smaller, lim π11 = 0.6 for π1+ = 0.9 and π+1 = 0.7

γ→0

or = 0 for π1+ = 0.5 and π+1 = 0.5. Generally, for any set of marginal probabilities {π1+ , π+1 } as γ tends to zero, the algebraic limit of (14) simplifies to γ→0

Therefore, supposing the coverage probabilities π1+ and π+1 and dependence, γ, are known, π11 =

Similarly when the Census and Survey coverage probabilities are both 50%, then  −γ ± γ 2 + 4γ (1 − γ) π11 = . (16) 2 (1 − γ)

lim π11 =

+γ (π+1 + π1+ )) − γπ1+ π+1 = 0.

.

After obtaining the value of π11 , the rest of the probabilities in contingency table, Table 2 can be found

π1+ (1-π1+ )

since the marginal probabilities, π1+ and π+1 are given. In view of the fact that π11 can be expressed as a function of the dependence, γ, it becomes possible to ascertain the limiting behaviour of the relative bias as γ tends to zero and infinity. This allows us to look at how the various dual and triple system estimators behave for as γ tends to its asymptotes. Now, for the case when the Census (π1+ ) and Survey (π+1 ) coverage are respectively 90% and 70%, then Eq. (14) simplifies to 0.6 − 1.6γ π11 = 2 (1 − γ)  (15) 2 (1.6γ − 0.6) + 4 (1.6γ) (1 − γ) . ± 2 (1 − γ)

(13)

Now, Eq. (13) can be written in terms of a quadratic function of π11 ,

π01 π00 (1-π+1 )

(1 − π1+ − π+1 ) ± (1 − π1+ − π+1 ) −2

= max {0, (π1+ + π+1 − 1)} since π11  max {0, (π1+ + π+1 − 1)}. B. What happens to π11 as γ tends to infinity? As the dependence increases, then it can be credibly demonstrated that lim π11 = 0.7 for π1+ = 0.9 and π+1 = 0.7

γ→∞

or = 0.5 for π1+ = 0.5 and π+1 = 0.5.

B. Baffour et al. / An investigation of triple system estimators in censuses

In general for any given marginal probabilities π1+ and π+1 , as γ tends to infinity the expression (14) simplifies to lim π11 =

γ→∞

(π1+ + π+1 ) ± (π1+ − π+1 ) 2

= min {π1+ , π+1 } since π11  min {π1+ , π+1 }. The preceding discussion has investigated the bias of the different population estimators. However, the bias is concerned with how accurate the estimator is in measuring the quantity of interest, and is just one measure of an estimator’s performance. The variance is another measure which looks at how precise this estimator is. Clearly, an estimator may be precise but inaccurate and vice versa. Thus to determine which of the estimators was the best a measure of consistency was used. The consistency of an estimator is a function of its accuracy (i.e. bias) and precision (i.e. variance), and it is normally expressed in terms of the mean squared error. In essence, the mean squared error rewards small biases but penalises larger standard errors. It is therefore a useful tool in comparing the performance of the dual and triple system estimators. In an ideal world, the best estimator will have the lowest bias and the lowest variance. The simulation exercise showed that on the one hand though TSE3 is unbiased it comes with large standard errors, while on the other hand TSE1 has some bias, but the standard errors may be small for some cases. So the objective is to compare which of these estimators performs the best, under different scenarios. Figures 3 and 4 compare the mean squared error for the different triple system estimators, under the scenarios detailed above when the Census and Survey have coverage probabilities of 0.9 and 0.7, and 0.5 and 0.3. The first thing of note is that the mean squared error for TSE3 is larger than the respective mean squared errors for TSE1 and TSE2, which supports the assertion that TSE3 is an inefficient estimator. Although, TSE3 like TSE2 is relatively unbiased, the associated large variance of the estimator has the effect of inducing a high mean squared error, in comparison with the biased but low variability TSE1. When the coverage in the Census, Survey and Third List are high then there is very little to distinguish between the three estimators, in terms of their mean squared error. It can also be noticed that as the Third List coverage probability increases there seems to be very little difference between the mean squared error plots for TSE1 and TSE2. Additionally, as the Census and Survey cover-

63

age get higher, it appears that TSE3 becomes pejoratively less efficient when compared to TSE1 and TSE2. In both figures the dual system performance is included, and from Fig. 3 it can be seen that when the Census and the Survey respectively cover 90% and 70% of the population but the Third List is poor (at 30%), then the DSE copes well with failures of the independence assumption. Here the DSE, although slightly biased has the lower RMSE than the two un1 < γ < 1.2 the DSE biased TSEs. Indeed, for 1.2 1 is a better estimator, and for 1.4 < γ < 1.4 the biased triple system estimator, TSE1, is the most efficient. Even when the Census and the Survey do not achieve a decent coverage of the population (i.e. both have 50% coverage) and the Third List achieves 30%, TSE1 is the most efficient estimator of the population 1 < γ < 1.2. From both figures, it can be inferred for 1.2 that only at lower than 50% coverage of the Third List is the DSE less efficient than all the TSEs at all levels of dependence. Indeed, the bottom right plots in both figures, representing an administrative list coverage of 90%, show that the mean squared error of all the TSEs is much lower than that of the DSE. There is some degree of reasonableness to these results. For the case when both the Census and Survey achieve high coverage of the population, the administrative list does not have that many people to find: for a population of 1000 people, with 90% Census coverage and 70% survey coverage, even with dependence of 2, there are only 44 people to find, whereas with 50% coverage there are now 293 people missing. Furthermore, what the mean squared error plots are crudely saying is that the simplest triple system estimator (i.e. TSE1) performs reasonably well as it has the lowest root mean squared error for all dependency levels – despite the fact that TSE1 is biased, the variance is not comparatively smaller than the other less biased estimators. Further, even though the best model is TSE2, TSE1 is consistent enough as an estimator of the population size to merit consideration. There could be an argument to always fit TSE3 to the data since it has an easy close-form expression, i.e. n100 n010 n001 n ˆ 000 = n111 n110 n101 n011 . However, the above simulation exercise has shown that even when there is relatively high dependence between the Census and Survey, doing this is not the most efficient way of determining the missing cell. Finally, it was mentioned that the mean squared error is a function of the bias and variance, and though the bias can be found fairly easily, the estimation of the variance is complicated. Therefore, it was shown

64

B. Baffour et al. / An investigation of triple system estimators in censuses

Fig. 3. Comparison of the root mean square error for different triple system estimators and the dual system estimator (for pcen =0.9 and psur =0.7).

Fig. 4. Comparison of the root mean square error for different triple system estimators and the dual system estimator (for pcen =0.5 and psur =0.5).

in Section 2.2 that asymptotic variance expressions exist for all the triple system estimators found using the delta method. The delta method, however, assumes that the multinomial counts nijk are asymptotically normal. In capture-recapture population estimation, there has been some concern raised as to the reasonableness of the assumption of normality since there is a relatively flat likelihood surface which in turn leads

to some positive skewness [13]. Therefore, the simulation exercise compared the asymptotic standard errors to the empirical standard errors for the triple system estimators. The results appear in Fig. 5. The results demonstrate that the asymptotic and empirical standard errors are similar for all the three estimators. This goes to show that, in the simulations, the assumption of normality is realistic. Put differently, although

B. Baffour et al. / An investigation of triple system estimators in censuses

65

Fig. 5. Comparison of the empirical and asymptotic standard errors of different triple system estimators.

the normal approximation may have some limitations, it was found that for the simulation exercise, this approximation did not impact negatively on the calculation of the variance. In addition, using the results in Fig. 5 in conjunction with the mean squared error plots in Figs 3 and 4, we can begin to understand why TSE1 appears to fare better than the others. Despite TSE1 being relatively more biased the variance of TSE1 is sufficiently smaller than both TSE2 and TSE3. Hence the reason why TSE1 has the smallest mean squared error, in most cases. Indeed, even when the Third List has poor coverage (i.e. 30%), the standard errors of the (biased) TSE1 are much lower than the unbiased TSE2 and TSE3. This implies that on average, the collection of estimates of the population under TSE1 are closer to the true population total than the estimates found under TSE2 or TSE3, even when there is dependence. Hence, the mutual independence estimator, TSE1, appears to be the most efficient. An explanation of this could be because TSE2 and TSE3, albeit unbiased are susceptible to high variability.

4. Conclusion It is true that independence, be it in dual system estimation or triple system estimation, is unlikely to hold in an actual census environment. This is because there is definite evidence of dependence between the

Census and Survey (albeit there is less of a dependence between these two and the Third List). Nonetheless, the independence model which assumes no relationship between the different counts of the population does have some benefits. The single most important reason for its choice is that of model parsimony: the independence model is simplistic, and in most cases does approximate the true cell probabilities well, especially when the coverage probabilities are high. The mean squared error of the independence model is lower than that of the partial dependence and ‘saturated’ models, for all the simulations considered. This is because, although the independence model is the most biased, it also has the smallest variance as it is based on estimating fewer parameters. In essence, the mean squared error is smaller because the bias does not dominate the variance. In the same vein, although the ‘saturated’ model gives unbiased estimates of the population, it has been shown to be inefficient when the data has been simulated under the Census and Survey pairwise dependence model. The motivation of the simulation work was the desire to have some idea as to what constitutes ‘low’ and ‘high’ levels of coverage and consider if there is the need for triple system estimation. The simulations considered permutations of four probabilities of 0.3, 0.5, 0.7 and 0.9 for the Census, Survey and Third List coverage. It was found that if the Census manages to enumerate roughly 90% of the population and the Sur-

66

B. Baffour et al. / An investigation of triple system estimators in censuses

vey achieves 70% then the dual system estimate is fairly unbiased, and bringing in a Third List into frame does not significantly improve upon the population estimates. However, it has been shown that in the presence of dependence, and for a Census or Survey that only counts 50% of the population, there are definite advantages of using data from an administrative source to obtain population size estimates that have been adjusted for underenumeration. Further, the simple triple system estimator that assumes independence between the Census, Survey and Third List is found to be very efficient, even in the presence of some dependence.

5. Discussion The census has to produce accurate and reliable estimates of the population. There is evidence to suggest that population estimates derived under dual system estimation are susceptible to bias due to dependence. This dependence could be adjusted for post hoc, as done in the UK census [8] or the US census [23]. However, this paper has described how dependence could directly be taken into account through triple system estimation. Triple system estimation is reliant on the availability of an administrative list that encompasses the full spectrum of the population. Another important assumption is that there are no erroneous enumerations. The assumption of no erroneous enumerations in dual system estimation is fairly plausible because of the matching and classification procedures in place to ensure that each individual has one, and only one, record. However, under triple system estimation this is not necessarily true, mainly because the third, administrative, system is particularly prone to erroneous enumerations (through duplication or invalidation of records). Nonetheless, this simplifying assumption allows closed-form maximum likelihood estimates to be found. In general, the coverage error in a population census includes both missing enumerations and erroneous enumerations, although the main focus has been on underenumeration. In fact, the quality of the coverage assessment relies heavily on the ability of the postenumeration survey to accurately classify the erroneous enumerations. In the US, the post-enumeration survey has a specific component that measures erroneous enumerations, and adjusts the final population estimates accordingly [23]. However, in previous UK censuses, the number of erroneous enumerations has been negligible, in comparison with the underenumer-

ation [10]. When there are erroneous enumerations, it is still possible to determine the population size estimates, adjusted for both underenumeration and overenumeration. This is accomplished through latent class analysis, where the belief is that an individual’s true enumeration status is unobserved, or latent. What is observed is whether they appear on the three systems, i.e. the Census, Survey or Third List. The basic premise of latent class analysis in application to census enumeration is that the observed covariation between the Census, Survey and Third List is actually better represented by each of the Census, Survey and Third List’s relationship with an unobserved (latent) variable. In our application the belief is that the latent variable characterizes the unobserved heterogeneity and this is attributable to capture error. Note that this unobserved heterogeneity is there even after poststratification. This latent class model can be parameterized as a log-linear model. A similar approach has been suggested by [7]. It is clear that the general theory on capture-recapture is fairly extensive, but the loglinear modelling framework provides a convenient representation for population estimation problem, particularly for the case where there are three or more lists. The paper has illustrated how loglinear models can allow for dependence among the different lists, underenumeration, overenumeration as well as account for heterogeneity in the population. The work undertaken here can also be extended to general capture-recapture settings, such as in epidemiology where dual lists are known to be dependent. There is tremendous value attached to the census, since there is often no other data resource collected about a country’s population at such great detail. However, in order to know the accuracy of the census information, many countries use post-enumeration surveys, which rely on independence. Further, this independence assumption is typically untestable, and can lead to biased census results. Administrative records have been identified as the future of censuses [20]. Therefore, the value of this simulation exercise is to demonstrably show how best to use information from a third, administrative records, list in census coverage. It has also identified the requisite properties of a good triple system estimator in coping with dependence.

References [1]

A. Agresti, Simple capture-recapture models permitting unequal catchability and variable sampling effort, Biometrics 50(2) (1994), 494–500.

B. Baffour et al. / An investigation of triple system estimators in censuses [2] [3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

J.M. Alho, Logistic regression in capture-recapture models, Biometrics 46(3) (1990), 623–635. J.M. Alho, Analysis of Sampled Based Capture-Recapture Experiments, Journal of Official Statistics 10 (1994), 245– 245. J.M. Alho, M.H. Mulry, K. Wurdeman and J. Kim, Estimating heterogeneity in the probabilities of enumeration for dualsystem estimation, Journal of the American Statistical Association 88(423) (1993), 1–130. M.J. Anderson and S.E. Fienberg, Who counts? The politics of census-taking in contemporary America, Second Edition, Russel Sage, New York, 2001. W.R. Bell, Using Information from Demographic Analysis in Post-Enumeration Survey Estimation, Journal of the American Statistical Association 88(423) (1993), 1106–1118. P.P. Biemer, H. Woltmann, D. Raglin and J. Hill, Enumeration Accuracy in a Population Census: An Evaluation Using Latent Class Analysis, Journal of Official Statistics 17(1) (2001), 129–148. J. Brown, O. Abbott and I. Diamond, Dependence in the 2001 one-number census project, Journal of the Royal Statistical Society: Series A (Statistics in Society) 169(4) (2006), 883– 902. J. Brown, O. Abbott and P.A. Smith, , Design of the 2001 and 2011 Census coverage survey for England and Wales 174(4) (2011), 881–906. J.J. Brown, I.D. Diamond, R.L. Chambers, L.J. Buckner and A.D. Teague, A methodological strategy for a one-number census in the UK, Journal of the Royal Statistical Society: Series A (Statistics in Society) 162(2) (1999), 247–267. C. Chandrasekar and W.E. Deming, On a method of estimating birth and death rates and the extent of registration, Journal of the American Statistical Association 44(245) (1949), 101–115. S.X. Chen, C.Y. Tang and V.T. Mule, Jr., Local poststratification in dual system accuracy and coverage evaluation for the US census, Journal of the American Statistical Association 105(489) (2010), 105–119. B.A. Coull and A. Agresti, The Use of Mixed Logit Models to Reflect Heterogeneity in Capture-Recapture Studies, Biometrics 55(1) (1999), 294–301. A.D. da Silva, 2010 Brazilian Census Post Enumeration Survey, in: Proceedings of the 58th World Congress of the International Statistical Institute (ISI), Dublin, Ireland, 2011, pp. 27–31. J.N. Darroch, S.E. Fienberg, G.F.V. Glonek and B.W. Junker, A three-sample multiple-recapture approach to census population estimation with heterogeneous catchability, Journal of the American Statistical Association 88(423) (1993), 1137– 1148. K. Dugmore, P. Furness, B. Leventhal and C. Moy, Beyond the 20111 Census in the United Kingdom – with an international perspective, International Jounral of Market Research 53(5) (2005), 619–650. J.M. Durr, The French new rolling census, Statistical Journal of the United Nations Economic Commission for Europe 22(1) (2005), 3–12. S.E. Fienberg, The multiple recapture census for closed populations and incomplete 2k contingency tables, Biometrika 59(3) (1972), 591–603. S.E. Fienberg, Bibliography on capture-recapture modelling with application to census undercount adjustment, Survey Methodology 18(1) (1992), 143–154. S.E. Fienberg and D. Manrique-Vallier, Integrated methodol-

67

ogy for multiple systems estimation and record linkage using a missing data formulation, Advances in Statistical Analysis 93(1) (2009), 49–60. [21] C.T. Isaki and L.K. Schultz, Dual system estimation using demographic analysis data, Journal of Official Statistics 2(2) (1986), 169–179. [22] J.B. Kadane, M.M. Meyer and J.W. Tukey, Yule’s association paradox and ignored stratum heterogeneity in capturerecapture studies, Journal of the American Statistical Association 94(447) (1999), 855–859. [23] M.H. Mulry, Summary of accuracy and coverage evaluation for the US Census 2000, Journal of Official Statistics 23(3) (2007), 345–370. [24] G. Rasch, Probabilistic models for some intelligence and attainment tests (Expanded edition), Copenhagen: Denmark’s Paedagogiske Institut (1960). [25] E. Stuart and A.M. Zaslavsky, Using administrative records to predict census day residency, Case Studies in Bayesian Statistics 6 (2002), 335–349. [26] United Nations Statistics Division (UNSD), 2010 World Population and Housing Census Programme. Department of Economic and Social Affairs, 2012, http://unstats.un.org/unsd/ demographic/sources/census/2010_PHC/default.htm. [27] G.S. Wolfgang, Using administrative lists to supplement coverage in hard-to-count areas of the post-enumeration survey for the 1988 Census of St Louis, in: Proceedings of the American Statistical Association, Section on Survey Research Methods, 1989. [28] K.M. Wolter, Capture-recapture estimation in the presence of a known sex ratio, Biometrics 46(1) (1990), 157–162. [29] A.M. Zaslavsky, Combining census, dual-system, and evaluation study data to estimate population shares, Journal of the American Statistical Association 88(423) (1993), 1092–1105.

Appendix – Proof of homogeneity under one list Under dual system estimation, the assumption is that there is independence between the first and second lists. This implies that the probability of being in the th (i, j) cell, πij , is the product of the  marginal probaand π , where π = bilities π i+ +j i+ j πij and π+j =  π . In addition to assuming independence across i ij individuals, dual system estimation also assumes independence within individuals. So let let π11l be the probability of an individual, l, being observed in both the first and second samples, π1+l the probability of individual, l, being in the first sample and π+1l the probability of being in the second sample, respectively. Suppose we assume independence between the two samples aftersumming across all individuals, l, then  l π11l = l π1+l π+1l since πij = πi+ π+j under independence. Under homogeneity of capture, we want the probability of capture on both lists π11 =

l

π11l =

l

π1+l π+1l =

l

π1+l

l

π+1l .

68

B. Baffour et al. / An investigation of triple system estimators in censuses

In words, taking the product of an individual l’s probability of being in the first list by the probability of the same individual being found on the second list and then summing across all individuals is the same as taking sums across individuals in the first list and sums of individuals on the second list, then taking the product. We want to show that, the requirement is that homogeneity needs to hold on only one list, and there being heterogeneity of capture on the other list does not invalidate the assumption of independence. Supposing that there is heterogeneity of capture on the first list, such that the capture probabilities are not identical, but on the second list there is homogeneity of capture. That is, for the first list for the second list

π1+l = π1+r for individuals l and r, π+1l = π+1r for individuals l and r.

Thus, the probability of being counted in the first list

is a function dependent on the individual, so π1+l = f (l; π), and π1+l π+1l = f (l; π) π+1l l

l

l

= π+1



l

f (l; π).

l

On the other hand, since individuals on the second list are homogeneously captured, then π+1l = π+1 l

π1+l π+1l =



f (l; π)π+1l = π+1 f (l; π).

l

l

Therefore it follows that, π1+l π+1l = π1+l π+1l = π+1 f (l; π), l

l

l

l

and capture probabilities are only required to be homogeneous across one list.

Suggest Documents