Methodological Challenges in the Development of Statistics Canada’s New Integrated Business Statistics Program Claude Turmelle1, Serge Godbout1, Keven Bosa1
Methodological Challenges in the Development of Statistics Canada’s New Integrated Business Statistics Program Claude Turmelle1, Serge Godbout1, Keven Bosa1 1
Statistics Canada, 100 Tunney’s Pasture Driveway, Ottawa, Ontario, K1A 0T6
Abstract In 2009, Statistics Canada launched the Corporate Business Architecture (CBA) review initiative. Growing financial pressures led to this process to review Statistics Canada’s business model and systems to achieve further efficiencies, enhance quality and improve responsiveness in the delivery of new statistical programs. To meet the objectives of the CBA, Statistics Canada is currently undertaking a major integration project for its Business Statistics surveys, the Integrated Business Statistics Program (IBSP). The IBSP will provide a common survey framework for the various business surveys conducted at Statistics Canada. By 2016, over 100 surveys will be integrated into this new harmonized framework. The surveys under the umbrella of the IBSP will use a common frame, adopt common sampling, collection and processing methodologies driven by a common metadata framework and they will share common tools to analyse, edit and correct data. The development of this new harmonized framework will bring a lot of methodological challenges at each and every survey step. This paper will discuss some of these challenges along with some of the proposed solutions.
Key Words: Business surveys, harmonized framework, sampling design, adaptive design
1. Introduction In the late 1990’s, Statistics Canada developed the Unified Enterprise Survey (UES) that gradually brought together, over a period of 10 years, the sampling, collection and processing of over 60 annual business surveys. Except for integrating new surveys or to adjust to external changes (e.g. the redesign of the Canadian Business Register in 20072008), the UES has been running more or less as is, i.e. without any significant improvements, since the mid 2000’s. In 2009, Statistics Canada launched the Corporate Business Architecture (CBA) review initiative. Growing financial pressures led to this process to review Statistics Canada’s business model and systems to achieve further efficiencies, enhance quality and improve responsiveness in the delivery of new statistical programs. For business surveys, this corporate review was quite timely, as the UES was in need of its own review. Further stimulated by the CBA objectives, this review quickly evolved into a full redesign to define a new model in which even more business surveys would be conducted using shared processing systems that would provide the flexibility to answer most of the needs of the surveys. In order to achieve these ambitious objectives, the IBSP quickly decided to build this new model around the following two principles:
• •
Increased standardisation of methods, systems, questionnaires and processes that are developed around the extensive use of administrative data Greater automation to achieve similar, if not better quality, at lower processing costs.
To give an idea of the scope and timeline of the IBSP, the current UES surveys and the Capital Expenditure Survey will be run using the IBSP model starting in 2014, for survey reference year 2013. In 2015, the IBSP will have integrated transportation, energy and research and development (R&D) surveys, some of which are sub-annual surveys. Finally, in 2016, the IBSP will be running the Quarterly and Annual Finance Surveys and several agriculture surveys. The current UES consists of annual business surveys collecting financial and characteristics data. These industry-based surveys are divided by industry classification. The IBSP will run industry-based as well as activity-based surveys spanning multiple industries. It will also run some sub-annual surveys and some surveys, such as most agriculture surveys, which collect solely commodity data. Obviously, the surveys to be integrated all come with their specific needs and design features. Consequently, given its above mentioned principles, the IBSP relies on strong governance to ensure surveys adapt to and adopt more global solutions rather than maintaining local solutions and infrastructure. Generalized systems will play an even more important role. Currently, our generalized sampling and estimation systems are also being redesigned to: Add new functionalities required to satisfy the needs of all the surveys under the IBSP, as well as most surveys outside of the IBSP; Bring the systems to a platform that will facilitate adding new functionalities in the future Finally, systems development will be centered on the use of metadata to ensure maximum flexibility in implementing and managing the various processes. The main challenge for the IBSP is to come up with a survey design strategy and infrastructure that: Will integrate all these surveys, as well as potential ad hoc surveys Will focus on improving the quality of commodity and characteristics estimates, given the extensive use of tax sources for financial data Will offer the flexibility to add extra sample, new questions or produce new estimates, as needed Will easily adapt to new auxiliary data, classifications, etc. Section 2 of this paper describes the IBSP sample design. The proposed adaptive processes are presented in section 3, and then the paper concludes and discusses some future work.
2. Sample Design 2.1 Survey Frames As in the UES, the IBSP survey frame will be extracted from Statistics Canada’s Business Register (BR). The current UES survey frame is extracted from the BR. The Capital Expenditure Survey and the Energy Statistics Program surveys are presently completing their full integration with the BR. The transfer of Farm Register and 2011 Census of Agriculture data to the BR is nearing completion and soon our first agriculture commodity survey will be extracting its frame directly off the BR. All IBSP surveys will target establishments or enterprises. Single establishment enterprises comprise the vast majority of the businesses on the BR and are easy to deal with. However, complex multi-establishment enterprises, often in different industries or provinces, are typically large, contribute a lot to the overall economy and are more difficult to deal with from a sampling and processing perspective. In the current UES approach, the sample selection is first carried out at the establishment level, and then network sampling is used to bring into the sample other in-scope establishments, within the same complex enterprise, that might not have originally been selected. Doing so complicates sampling and brings significant challenges when estimating variances. More detail on the UES sample design and on its network sampling strategy can be found in Girard and Simard (2000) and Simard et al. (2001). For the IBSP, it was decided to create only one sampling unit (SU) per enterprise, covering all the establishments/operations of the enterprise that are in-scope for the survey. Once selected, the enterprises will have to report on all their in-scope operations. The way the IBSP will take into account the contribution of complex enterprises to multiple geographic and industrial domains is described in the following sections.
2.2 Sample design’s objective Say the survey’s objectives are to measure a set of parameters t yjd
N
k 1
y jdk , with
j 1,..., J referring to the variables of interest and d 1,..., D referring to the domains of interest (e.g. geography x industry), for a population U of size N. Let’s assume that the variables j and the domains d are of different importance to the survey and that jd is a relative measure of that importance. Let’s also assume that the population will be divided into H strata Uh of size Nh, for h 1,..., H , and that a Bernoulli sample will be selected in each stratum h with sampling fraction h . The strata Uh will be created by first assigning the SUs into one and only one domain d (also called primary stratum), and then by stratifying the SUs according to a certain size measure within the domain. Obviously, complex SUs may have operations in multiple domains, and thus, decisions have to be made as to which domain they are stratified into and what measure of size should be used. Note that, the domains of interest d 1,..., D are assumed here to be the same at sampling and at estimation.
The main objective of the sample design is to allocate and select a sample that will, as efficiently as possible, produce the required set of estimates. In other words, for a fixed cost the sample design should on average maximize the overall quality of the estimates. This objective could be formulated as the following optimization problem:
D
Min h
J
d 1 j 1
under the constraints
H
h1
jd
2 CV tˆyjd
h N h ch C
(2.1)
(2.2)
where tˆyjd , CV tˆyjd , C, ch, respectively are the estimate of the total t yjd , the coefficient of variation of tˆyjd , the target total cost (or sample size) and the stratum unit cost. Note that, if we set jd (t xjd ) p (
N
k 1
x jdk ) p , with p 0,1 and where x jdk is the value of
an auxiliary variable x j for unit k in the domain d, then the minimization problem above is simply a multivariate version of the loss function used in Bankier (1988) to define the well known power allocation.
2.3 Sample allocation In the IBSP, the minimization problem above was used to define the sampling strategy. Even though stratification chronologically comes before sample allocation, for the sake of clarity, the allocation methodology will be presented first. Let’s assume for now that the population of SUs has already been stratified into the strata Uh. Assume also that for every SU k in the population, a single vector of auxiliary variables x k ( x1k ,..., xDk ) is available and correlated with all J variables of interest. Finally, let’s define t xd
N
k 1
xdk , the total for x in domain d, for d 1,..., D.
In this context, the sample allocation algorithm that will be used for the IBSP, in line with the objective of the previous section, will be:
CV tˆ D
Min h
under the constraints
2
d
d 1
xd
N c C CV tˆ MaxCV H
h 1
xd
h
h h
d
Pah h N h 1
(2.3)
(2.4a) (2.4b) (2.4c)
where tˆxd , C, ch, MaxCVd, ah, and α are the estimate of the total t xd , the target total cost (or sample size), the stratum unit cost, an upper bound on the CV in domain d, a lower bound on the expected sample size in stratum h and the level of confidence used for the probabilities, respectively. Note that some of the constraints are formulated in terms of expected sample sizes and probabilities since a Bernoulli design is assumed. It’s also worth mentioning that this is a simplified version of the complete optimization problem that will be solved and that will be offered in Statistics Canada’s new Generalized
Sampling system (G-SAM). For example, the complete optimization problem will take into account, and directly adjust for, expected nonresponse and out of scope rates. As mentioned before, the IBSP will use the h determined above to select a Bernoulli sample in each stratum h. If we assume for now a Horvitz-Thompson estimator, i.e. tˆxd h kU h1 xdk , we find that
h
H D d2 2 ˆ Min d CV t xd Min 2 Varhd tˆxd h h d 1 h 1 d 1 t xd 2 H D 1 2 Min 2d h 2 h xdk h h h 1 kU h d 1 t xd D
D 1 2 2 Min h 2 h 2d xdk h h h 1 kU h d 1 t xd H 1 Min h 2 h z k2 h h h 1 kU h
(2.5)
H
where z k
d
( d xdk t xd1 ) 2 . Written this way, the minimization problem becomes
univariate, in the sense that every unit k has now only one auxiliary variable, i.e. z k . It can also be rewritten as
where tˆzh
kU h
H H 1 Min h 2 h zk2 Min Var tˆzh (2.6) h h h h 1 k U h 1 h 1 h zk . It then becomes clear that zk is a good choice for the
univariate size measure attached to each SU k.
2.4 Stratification 2.4.1 Primary stratification As mentioned before, a decision has to be made as to which primary stratum a complex SU is assigned to. To do so, let’s start from the fact that once the SUs have been stratified into one domain, they will then be further stratified by size, into small, medium and large strata. Typically, the h increases as we move from the strata composed of small SUs to the strata composed of large SUs, and often h 1 for the strata made up of the largest SUs. The primary stratum d for a unit k will be determined according to the following formula: ArgMaxz dk , for d 1,..., D. (2.7) d
where z dk d x t
1 dk xd
are standardized variables. Formula (2.7) means that the unit k is
classified in the primary stratum d where its value z dk is a maximum over all domains d. According to our framework, the larger z dk is for the domain d, the more important unit
k should be for d, which intuitively makes sense since z dk is the weighted contribution of unit k to t xd . Formula (2.7) can also be seen as the minimization of the distance between z k and z dk , which in the end should help create more homogeneous primary strata, in terms of the z k , and reduce the risk of creating extremely skewed distributions of z k in the domains d.
2.4.2 Size stratification As in the UES, in order to reduce the response burden of small businesses the IBSP will, in each domain, exclude from its surveyed population the smallest businesses that together contribute to at most 10% of the domain total t xd . Multi-domain SUs will only be excluded if they fall in the bottom 10% of all domains to which they contribute. Once the excluded SUs have been identified, the remaining SUs will be stratified by size. As demonstrated in section 2.3 a good choice for the size measure is z k . These zk will therefore be used to stratify the SUs, using the Geometric approach (see Gunning and Horgan 2004), into L size strata within each primary stratum. L will vary between 1 and 3 depending on the number of SUs inside the primary stratum. To define L size strata, the boundaries b0, b1, ..., bL are set as 1L
b bh ar with a b0 and r L b0 h
(2.8)
where b0 and bL are respectively the lowest and highest value of the zk in the primary stratum. This algorithm is very simple to implement and according to Gunning and Horgan (2004) it performs fairly well in the context of asymmetric populations, which is often the situation in business surveys. On the other hand, a preliminary evaluation, in the context of the IBSP, showed that in the presence of extreme outliers (i.e. extremely small or extremely large units), the algorithm sometimes gives inappropriate results, such as empty strata or highly unstable stratum boundaries from one survey cycle to another. Therefore, a method to treat extreme values of z k before running the size stratification will be developed.
2.5 Two phase approach Another key feature of the IBSP will be to allow the selection of two phase samples. This will enable surveys to efficiently produce good quality estimates for variables or subpopulations that are not available or not easily identified on the BR. This will be particularly useful for surveys interested, for example, in things like: Commodities (pens, screws, wood tables, etc.) Characteristics (presence of technical staff, method of sale, number of rooms, etc.) Activities (hog breeding, capital and repair expenses, R&D, etc...).
Even though not all IBSP surveys will make use of the two phase feature, most annual surveys will take part in a global two phase strategy. The idea is to select one large first phase sample covering all industries, and then to make use of web-based collection to: a. Verify and update a units’ activity status, industrial classifications and contact information; b. Collect additional information, such as the ones mentioned above. Then, the global first phase sample would be used as the sample frame for the survey specific second phase samples. The methodology described above for stratification and sample allocation will be used at both phases with some minor adjustments when dealing with the second phase samples.
3. Adaptive Design As mentioned before, one of the main goals of the IBSP is to achieve greater efficiencies in processing its survey data, while producing estimates of similar, if not better, quality. To do this, a new adaptive design has been developed to manage data collection activities as well as data analysis (Godbout, Beaucage, Turmelle, 2011; Godbout 2011)
3.1 Processing model Currently, the processing model of most of our business surveys is typically linear (Figure 1). As the data come in through collection they are partially edited and some businesses are followed-up for non-response or failed edits. The complete set of data is then fully edited and imputed, and then analyzed by subject matter analysts. Finally, estimates are produced, analyzed and disseminated. This is carried out through a long series of processes that require considerable manual intervention, even to run its automated steps. The processing to dissemination steps can take up to 6 months, and quality indicators regarding the estimates are only available during final analysis.
Figure 1: Current survey processing model While the current process model produces good quality estimates, it is very lengthy and still heavily focussed on micro-data. Follow-up is prioritized based on frame information, not on the estimates themselves or their quality. In the IBSP, we want to make optimal use of the resources available by limiting manual processing to the more influential units. To achieve this, we must to take into account the estimates and their quality during collection and processing rather than just near the end. While the current collection management based on weighted response rates is good, for the IBSP we want to actively manage collection efforts based more on key estimates and their Quality Indicators (QI). To do this, the IBSP will implement a circular approach (Figure 2) called the Rolling Estimates (RE) model. In this model, once enough data from administrative sources and collection have been received, a series of automated processes will be run, right through
to producing estimates and their quality indicators. This information will then be analyzed using a top-down ˗ macro estimates first, microdata second ˗ approach. The current plan is to produce these “rolling estimates” about once a month during a 4 to 5 month period. The RE will produce key estimates and related QIs. Using these results, decisions will be made whether to stop active collection or not. Basically, when the QIs have reached prespecified targets for a given geography-industry domain, active collection can stop in that domain and resources can be redirected towards other domains, as required.
Figure 2: Rolling Estimates Model The RE will also produce Measure of Impact scores (MI), for each unit, at each iteration. Active collection will be based on lists of non-responding units or units that failed edits, prioritized by their MI scores. Active analysis (also called Selective Editing) will mainly focus on respondents significantly influencing key estimates and their quality.
3.2 Methodology This section describes the methodology related to quality indicators, measure of impact scores and the active management for the IBSP.
3.2.1 Notation and definition Let’s assume again that we are interested in measuring a set of parameters t yjd kU y jdk , with j 1,..., J referring to the variables of interest and d 1,..., D
referring to domains of interest, for a population U of size N. As described in section 2, a sample s of size n is drawn using a 2-phase design with stratified Bernoulli sampling at both phases. To handle unit or item non-response, the processing and estimation model will use composite linear imputation methods under the assumption that the multivariate distribution function of the variables of interest Y given the set of covariates Xobs is independent from the sample s, the respondent subsample sr , the calibration information and the domain information. An imputation method is said to be linear if the imputed value for a non-respondent k sm , where s m is the subsample of non-
respondents, can be written in the linear form y IMP j 0k jk
k 'sr
jk 'k y jk ' , where
j 0 k and jk 'k are quantities that don’t depend on the y-values. Examples of linear imputation methods include auxiliary value imputation, linear regression and donor imputation. For more details, see Beaumont and Bissonnette (2011). The general imputation model for unit k is the following:
E m ( y jk | x kobs ) jk varm ( y jk | x kobs ) 2jk
(3.1)
cov m ( y jk , y jk ' | x kobs ) 0. The processed variable of interest j for unit k, allocated to domain d is:
where μˆ jk
y jk adk if k sr ˆ y μjdk IMP (3.2) y jk adk μˆ jk adk if k sm is the estimate of the parameter jk in the imputation model (3.1) and
adk [0,1] is an allocation factor and constant with respect to the j-th variable of interest. The estimator for the totals t yjd , is given by: ˆ IMP tˆyjd ks wk( E ) y jdk
(3.3)
The estimation weight wk( E ) k1 g dks is the inverse of the sampling probability k resulting from the 2-phase design calibrated to the stratum counts and known domain totals. The collection and analysis activities are called treatments. Some of them like fax or email follow-up have a relatively low unit cost. The adaptive design aims at managing the treatments with a significant marginal cost in order to get the best impact on the quality of the estimates. The treatments covered by this paper are telephone follow-up for non-response or failed-edit, and manual editing (Claveau et al. (2012)). Let’s define the local MI score of a QI subject to a treatment T applied to unit k on variable j and in domain d, conditional to the realized sets s and s r , as
Mˆ I Tk (Qˆ I jd ) Mˆ I Tk (Qˆ I jd | s, s r ) Qˆ I jd ETk [Qˆ I jd | s, s r ] .
(3.4)
Defined this way, the MI score will be positive for decreasing QIs (i.e. QIs that decrease as the quality improves), like the ones presented in section 3.2.2. To get positive MI scores for increasing QIs, (3.4) would be reversed and become
Mˆ I Tk (Qˆ I jd ) ETk [Qˆ I jd | s, sr ] Qˆ I jd . The expectation of QI under a treatment T on unit k is the combination of its predicted
~
value QI jd |T and the success probability of the treatment T on unit k.
~ ETk [Qˆ I jd | s, s r ] QI jd |T P[T is successful on k ] Qˆ I jd (1 P[T is successful on k ]).
(3.5)
Defining sT s as the set of all units assigned to treatment T, the following property can be observed:
~ QI jd |T Qˆ I jd Mˆ I Tk (Qˆ I jd ).
(3.6)
ksT
The predicted QI can be approximated from the current estimated QI by subtracting the MI scores of all units assigned to treatment T, regardless of the treatment’s results. For many QIs, the approximation becomes equality. Since there will not be any propensity models implemented in the first few years of the IBSP, we assume that all treatments will be successful, i.e. P[T is successful on k ] 1 so the expected quality indicator, given unit k is assign to treatment T of (3.4) can be simplified and written as the following.
~ ETk [Qˆ I jd | s, s r ] QI jd |Tk .
(3.7)
The formulas for the local MI scores depend on the quality indicators as shown in the following section.
3.2.2 Quality Indicators and Measure of Impact Scores The IBSP’s adaptive design will first rely on three basic QIs: the total coefficient of variation (CV), the collection weighted non-response rate and the relative absolute predicted error. They are well known and understood and are described in more detail below.
3.2.2.1 Total coefficient of variation The CV is a common indicator used at Statistics Canada to assess the quality of an estimate. The CV is based on the sampling variance but it also often encompasses the effect of non-response through reweighting, or is complemented with an imputation rate. For the IBSP, the main driver of active management will be the total CV, which combines the sampling variance VˆSam , the non-response variance VˆNR and the sampling and non-response covariance (also called the mixed variance VˆMix ). In the case of the estimator under imputation (3.3), the CV is given by:
IMP IMP 1 ˆ VSam(tˆyjdIMP ) VˆNR (tˆyjdIMP ) VˆMix (tˆyjdIMP ) 1/ 2 Cˆ V (tˆyjd ) tˆyjd
(3.8)
Where the total CV components are given by the following:
k k' (E) IMP VˆSam (tˆyjd ) kk ' wk e jdk wk( E' ) e jdk ' kk '
ks k 's
IMP VˆNR (tˆyjd )
W ˆ 2
ksr
jdk
2 jk
w
ksm
1 w k
ksm
(E) 2 k
(E) 2 k
2 a dk ˆ 2jk (3.9a)
2 a dk ˆ 2jk
(3.9b)
IMP 2 VˆMix (tˆyjd ) 2 W jdk ( wk( E ) 1)a dkˆ 2jk 2 wk( E ) wk( E ) 1 a dk ˆ 2jk ksr
(3.9c)
ksm
(E)
In the above formulas, wk
2 is the estimation weight, ˆ jk is the estimate of the
2 μ parameter jk in the imputation model (3.1) and e jdk is the residual of y jdk calibrated
ˆ
to
a
vector
of
auxiliary
zk
variables
ˆ ˆ e jdk y μjdk z' k (k 's k'1z k ' z' k ' ) 1 k 's k'1z k ' y μjdk ' .
The quantity W jdk
k 'sm
given
by
wk( E' ) a dk ' jkk ' could be seen as the extra weight carried out
by a responding unit k to compensate for the set of non-responding units according to the imputation model. More information about how those three components of variance are derived can be found in Beaumont and Bissonnette (2011) and in Beaumont, Bissonnette and Bocci (2010). The active management intends to reduce the total CV to a pre-specified target by efficiently converting non-responding units to a respondent status through non-response follow-up. In order to predict the impact on the CV components of converting nonresponding unit k to a respondent, the assumption is that k moves from s m to s r without changing the estimated imputation model and the estimated total tˆyjd since they are both unbiased estimates. Thus, the predicted effects of unit k following treatment T on each CV components are given by: 2 jk (VˆSam ) (VˆSam ETk [VˆSam | s, sr ]) (1 k )wk( E ) adk ˆ 2jk 2
jk (VˆNR ) (VˆNR ETk [VˆNR | s, s r ]) 2 wk( E ) a dk ˆ 2jk 2
2W
k 'sr
jdk '
wk( E ) a dk jk 'k wk( E ) a dk jk 'k
(3.10a)
ˆ 2
2 jk '
(3.10b)
jk (VˆMix ) (VˆMix ETk [VˆMix | s, s r ])
2 2wk( E ) wk( E ) 1 a dk ˆ 2jk 2 wk( E ) a dk jk 'k ( wk( E' ) 1)a dk 'ˆ 2jk ' .
(3.10c)
k 'sr
The
three
expressions
above
are
fairly
easy
to
(W jdk' | s, sr , sT {k}) (ls w adl jk 'l ) w adk jk 'k m
(E) l
(E) k
verify by noting that under treatment T on unit
k. The derivation of those expressions can be found in Picard & Bosa (2012). The associated MI score is:
IMP Mˆ I Tk (Cˆ V ) tˆyjd
1
jk
1/ 2 (VˆSam ) jk (VˆNR ) jk (VˆMix ) .
(3.11)
3.2.2.3 Non-response rates Some of the most common quality measures come from the family of rates (e.g. collection return rates, collection response rates, out-of-scope rates, etc.). The IBSP’s active management model will use an economically-weighted non-response rate at collection (Statistics Canada, 2001). A rate can be defined as a ratio of weighted counts/totals for classes specific to the type of rates as the following:
Qˆ I Rˆ xd ,cnum,cdenum wk x dk icnum,k ks
tˆxd1,cdenumtˆxd ,cnum .
w x ks
k
i
dk cdenum, k
(3.12)
The class ownership indicator i.,k is set to 1 if the unit k belongs to a specific class and 0 otherwise. Generally, the numerator’s class is embedded in the denominator’s class so the rate is in the range [0,1]. The local MI score for a unit k under a given treatment T attached to a specific rate Rˆ xd is the following,
MI Tk ( Rˆ xd ) tˆxd1,cdenom wk xdk (icnum,k ETk [icnum,k | s, sr ]).
(3.13)
Without propensity models, we assume ETk [icnum,k | s, sr ] 0 if unit k is assigned to the treatment T so formula (3.13) becomes,
MI Tk ( Rˆ xd ) tˆxd1,cdenom wk xdk icnum,k .
(3.14)
It is easily shown that the local MI score of a unit k equals 0 if the unit already has the numerator’s expected status ( icnum,k 0 ) or if the unit doesn’t belong to the denominator’s class ( icdenum,k 0 ). For some specific IBSP surveys for which the total variance is deemed inappropriate as a QI, the active collection management model will try to reduce the economically-weighted non-response rate at collection to a pre-specified target. This rate is defined as the ratio of the economically-weighted count of non-respondents over the count of in-scope units assigned to survey collection, where the weight wk and the economic size x dk are respectively the sampling weight wk k resulting from the 2-phase design and the value of the auxiliary variable used at sampling for unit k in domain d. (S )
1
3.2.2.3 Relative absolute predicted error The QI driving the active analysis depends on predicted values, according to the selective y jk for the key variable of editing methodology (Brundell, 2011). The predicted value ~ interest j is available for all the collection units k in a static file created before collection phase. So, the absolute predicted error allocated to domain d is given by μˆ ~ ~ y err jdk y jk y jk adk .
The QI attached to the active analysis is the total relative absolute predicted error and it can be viewed as a special case of the rate presented above. ˆ IMP 1 ˆ (3.15) Qˆ I Rˆ err, jd wk( E ) ~ y err wk( E ) y μjdk (tˆyjd ) t err, jd jdk iedit , k
ks
ks
where the eligibility status indicator for manual editing iedit ,k is set to 1 when unit k has μˆ
not been reviewed yet and is set to 0 when the value y jdk is confirmed or manually edited. The active analysis model will identify the units with the largest impact on the total relative absolute predicted error to bring it down to a pre-specified target. IMP 1 ( E ) ~ err MI Tk ( Rˆ err , jd ) (tˆyjd ) wk y jdk (iedit,k ETk [iedit,k | s, s r ]).
(3.16)
We suppose the editing operations on unit k will validate the discrepancy between the processed and the predicted values or correct any errors so ETk [iedit,k | s, sr ] 0 if unit k is assigned to the treatment T so formula (3.14) becomes: IMP 1 ( E ) ~ err MI Tk ( Rˆ err, jd ) (tˆyjd ) wk y jdkiedit,k
(3.17)
It can be shown that the local MI score equals 0 if the allocated absolute predicted error is ˆ y jdk y μjdk 0 (i.e. ~ ) or the unit has been manually reviewed (i.e. iedit,k 0 ).
3.2.3 Global Quality Indicator In order to prioritize units for active collection and analysis, a single measure of impact score per unit is required. The use of a distance measure to aggregate the local MI scores into a global MI score is described by Hedlin (2008). In addition, in order to dynamically take into account how far (or close) we are to the
) between the current QI targets, let’s introduce the distance measure d (Qˆ I jd , QI Target jd estimate and its target for a given variable j and domain d. The distance measure can be any monotonic distance function mapping from 1 (or close to), when Qˆ I jd is very far from its target, to 0, when Qˆ I jd has reached its target (i.e. Qˆ I jd QI Target ). The quality jd targets are set before the collection period based on the survey objectives, the quality achieved in the previous cycles and the resources available. To determine the global MI score used in the IBSP, let’s formulate the adaptive design problem using the multivariate sampling objectives from (section 2.3) with the importance factors jd . Using property (3.6), the dynamic minimization function in the IBSP’s adaptive design, conditional to the realized sets s and s r , and under constraints on the resources available for a specific treatment T becomes,
1
2
ArgMin F ( sT | s, s r ) sT
jd
j ,d
1
2
~ ArgMin jd d (Qˆ I jd , QI Target )QI jd |T jd sT
jd
2
j ,d
j ,d
J
ArgMax
D
j 1 d 1 ks
jd
d (Qˆ I jd , QI Target ) Mˆ I Tk (Qˆ I jd ) jd
2
(3.18)
J
sT
D
2
jd
j 1 d 1
J
D
2 ArgMax Mˆ I TkGlobal (Qˆ I ) . sT
ks j 1 d 1
Thus, the global MI score for unit k under treatment T for a specific QI is defined by:
MI TkGlobal (Qˆ I )
J
D
j 1 d 1
Target ˆ ) MI Tk (Qˆ I jd ) jd d (QI jd , QI jd
. 2
J
D
j 1 d 1
2
jd
(3.19)
One can observe that it is the Euclidian distance of the local MI scores adjusted by the distance of the estimated QI to its target and weighted by the importance factors. A unit will have a large global MI score if it has a large MI score for important combinations of variables j and domains d and for which the current estimated quality is far from the target. All combinations of variables and domains that reached their targets are dynamically zeroed out. In the first few years of the IBSP, since no random subsampling of units will be used, this minimization problem reduces simply to sort in descending order the Global MI score of the units eligible for a specific treatment and to create prioritization lists for collection and analysis operations that fit the resources. In the future, more sophisticated subsampling methods using global MI scores can be added in the IBSP’s adaptive design, which would help reduce the risk of non-response and editing biases. In both scenarios, the minimization problem shown in 3.18 would be very similar; the minimization on sT would be replaced by sampling probabilities. In order to start testing some of these innovations, a parallel run was performed during summer of 2011 using the UES environment. This parallel run was mostly used as a proof of concept. Only basic QIs and MIs were computed and made available to analysts to get them used to the idea and also to get some feedback. Overall, even though the basic QIs and MIs had very limited use, the outcome was quite favorable and feedback for improvement was gathered and taken into account during the preparation of the second parallel run that occurred during the summer of 2012. This second parallel run included more complex QIs and MIs, as well as the computation of global MI scores, and was used to try to assess the potential reduction in the volume of collection follow ups and microdata analysis.
4. Conclusion and future work As discussed here, the development of the IBSP is a major undertaking that presents a lot of methodological challenges. Two of the main functionalities that will enable the IBSP to integrate over 100 surveys into a harmonized, flexible and efficient survey model were presented here, the two phase sample strategy and the adaptive design. The first official execution of the IBSP is less than 2 years away. Still, a lot of work and many challenges lie ahead. Other than testing and fine-tuning the stratification and allocation parameters and building a flexible system to run it, the two-phase sampling strategy has to be tested and finalized, and its operational details still have to be clearly defined. The need for twophase sampling for the IBSP sub-annual surveys also has to be evaluated. Approaches and system functionalities to introduce more coordination between and within surveys will also soon be developed. Once the sampling methodology is finalized, the calibration, estimation and variance estimation strategy also has to be determined. On the processing side, results from the 2012 Rolling Estimates parallel run will be analysed, and simulations run, which will be essential to finalize the methodology and pin down the main parameters. This will be needed soon because this crucial feature of the IBSP will also be one of the more difficult to program and implement in production.
Finally, once the development and implementation pressures subside, some of our effort will be redirected toward research on non response bias indicators through the use of paradata and response propensity modeling.
Acknowledgements The authors would like to thank Claude Julien for presenting this paper at the conference and for all his suggestions and comments that helped write and improve this paper. The authors would also like to thank the two reviewers that, despite the very tight deadline imposed on them, were extremely useful in improving the quality of this paper.
References Bankier, M. (1988). Power allocations: Determining sample Sizes for subnational areas. The American Statistician, Vol. 42, 174-177. Beaumont, J-F. and Bissonnette, J. (2011). Variance estimation under composite imputation: The methodology behind SEVANI. Survey Methodology, Vol. 37, 171180. Beaumont, J-F., Bissonnette, J. and Bocci C.(2010). SEVANI, version 2.3, Methodology Guide. Internal report, Methodology Branch, Statistics Canada. Brundell, P. (2011). La vérification sélective des données et sa mise en œuvre à Statistics Sweden. Symposium international de 2011 sur la méthodologie, Ottawa (Canada). Claveau, J., Leung, J. and Turmelle, C. (2012). Embedded Experiment for Non-Response Follow-Up Methods of Electronic Questionnaire Collection. Proceedings of the ICES IV Conference. Picard, F. and Bosa, K. (2012). Estimation de la variance due à la non-réponse en présence de domaines partiels. Technical report, Methodology Branch, Statistics Canada. Girard, C. and Simard, M. (2000). Network Sampling: An application in a major survey. Proceedings of the ICES II Conference, 1577-1582. Godbout, S., Y. Beaucage et C. Turmelle (2011). Achieving Quality and Efficiency Using a Top-Down Approach in the Canadian Integrated Business Statistics Program, Conférence européenne des statisticiens, Ljubljana (Slovénie). Godbout, S. (2011). Standardization of Post-collection Processing in Business Surveys at Statistics Canada, 2011 International Methodology Symposium, Ottawa (Canada). Gunning, P and Horgan, J.M. (2004). A New Algorithm for the Construction of Stratum Boundaries in Skewed Populations. Survey Methodology, Vol. 30, 159-166. Hedlin, D. (2008). Local and Global Score Functions in Selective Editing, Conférence européenne des statisticiens, Session de travail sur la validation des données statistiques, Vienne (Autriche). Simard, M., Girard, C., Parent, M-N. and Smith, J. (2001). Sampling designs for the Unified Enterprise Surveys: The early years. Methodology Branch Working Paper, Statistics Canada, BSMD-2001-003E. Statistics Canada (2001). Standards and Guidelines for Reporting of Nonresponse Rates: Definitions, Framework and Detailed Guidelines. Internal document, Statistics Canada, Ottawa (Canada).