Statistical Matching: a tool for integrating data in National ... - CiteSeerX

Statistical Matching: a tool for integrating data in National Statistical Institutes Marcello D’ORAZIO, Marco DI ZIO and Mauro SCANU Italian National Statistical Institute (ISTAT) Via Cesare Balbo, 16 – 00184 Roma, ITALY e-mail: [email protected], [email protected], [email protected] Abstract: Statistical matching procedures are methods that aim at integrating sources. A description of real contexts in the Italian National Statistical Institute, in which these methods appear to be useful, is given. The necessary steps for their application together with a number of open problems are discussed. Keywords: Integration, Survey Data Fusion, Quality Evaluation, Imputation.

1.

Introduction

The need for timely, reliable and not contradictory statistical information makes the integration of data from different sources a hot topic for National Statistical Institutes (NSI). The Italian National Statistical Institute (ISTAT) also is increasingly interested in this topic [7] [18] and numerous applications of integration methodologies have been conducted. Generally speaking, integration of data from different sources can be performed by means of three different methodologies: merging, record linkage and statistical matching. While the first two methodologies aim at linking the same units in two or more different files, the third one faces the problem of integration when the files lack of unit identifiers or do not contain the same units. This paper will be devoted to the use of statistical matching methodologies in NSIs. Par. 2. shows why and where these methodologies are usefully applied in ISTAT. Two broad groups of applications are described: they refer respectively to what are called macro and micro objectives. In par. 3. we make clear that the choice of the statistical matching methodology depends on the combination among objective of the integration process and available information. In par. 4. we underline that this choice should also depend on the evaluation of the efficiency of the statistical procedure and overall quality of the input/output. A number of open problems are finally discussed.

433

Marcello D’Orazio, Marco Di Zio, Mauro Scanu

2.

Statistical matching in National Statistical Institutes

The main target of statistical matching (e.g. see [11]) is to give joint information on variables observed in different sources. As already remarked, statistical matching is a procedure for integration when such sources lack of unit identifiers or consist of different sets of units. As a consequence, the integration process is made through suitable statistical methods and the result consists of synthetical, i.e. not actually observed, quantities. It is possible to identify some advantages and drawbacks of the statistical matching approach when compared with the setting-up of a new survey (from now on, complete survey) that meets the informative objectives of the researcher (i.e. all the variables of interest are asked to the respondents). Among the advantages, the use of already available statistical or administrative sources makes possible to obtain timely results and to cut costs. Further, the reduction of the response burden of the statistical units is expected. In fact, it has been shown [33] that the longer the questionnaire, the highest is the rate of non-respondents and the lowest is the accuracy of the answers. Hence, each source to be integrated may be considered more reliable and complete than the corresponding subset of the complete data-set. In addition to this, the use of synthetical results might protect against disclosure avoidance problems. Regarding drawbacks, we underline that statistical matching procedures are inferential procedures based on suitable statistical models (e.g. regarding the relationship among the variables to be fused). The correct specification of a model needs a certain amount of knowledge. Lack of such information undermines the overall quality of the final result (this point will be further discussed in paragraphs 3 and 4). Once it has been decided that a complete survey is not the most convenient procedure for satisfying the informative needs, statistical matching can be considered as a privileged procedure in NSI. In fact, most of the available sources come from sample surveys and from administrative databases that either lack of unit identifiers for privacy constraints or are not referred to the same units. ISTAT experiences are various. Among them, one important field of application is related to the integrated analysis of two economic variables: consumer’s expenditures and income. Even if many surveys observe these variables jointly, there does not exist a source that describes both them with high quality and high level of detail. More often, each survey focuses alternatively on either consumer’s expenditures or income. As far as income is concerned, the Household Balance survey (HB) managed by the Bank of Italy is considered the most detailed and complete. Different sources can be used for expenditures: among the others, the Household Expenditure survey (HE) and the Household Multipurpose survey (HM), both managed by ISTAT. Examples of fusion of such data-sets in ISTAT are the following. 1) The construction of the Social Accounting Matrices. The Social Accounting Matrix (SAM) is a system of statistical information containing economic and social variables in a matrix formatted data framework. The matrix includes economic indicators such as per capita income and economic growth (for a more detailed definition: Chapter XX of [31]). In Italy such archive is not available; hence, SAM’s are built by means of the fusion of HB, HE and the National Accounts [8] [9].

434

Statistical Matching: a tool for integrating data in National Statistical Institutes

2) The analysis between income and health expenditures. During the 2000 Annual Report of ISTAT [12], the problem of evaluating the relation between income and health expenditures arose. Due to the lack of time and of additional funds for an ad hoc survey, the only feasible way to reach the scope was to fuse information coming from the 1994 HM and the 1995 HB. The objective consisted in an estimate of a parameter representative of the relation between health expenditures and income [30]. 3) The construction of comprehensive data-sets for flexible statistical analysis. For instance, Rosati [23] states that HE and HB can be integrated so that a complete dataset of units becomes available in order to: (a) analyse family’s saving; (b) analyse the decisions for groups of non-durable (or durable) goods; (c) implement microsimulation models for the analysis of public policies (e.g. [6]); (d) supply a multidimensional analysis of poverty. Another important application, not connected to the joint analysis of income and expenditures, is: 4) The construction of an integrated database of the sample surveys on households. For instance, Fortunato and Morrone [10] have discussed an application of statistical matching methodologies for the integration of HM and the Labour Force survey. The previous instances suggest the definition of two possible broad groups of objectives for statistical matching: we called them micro and macro objectives. The micro objective of statistical matching consists in the transformation of the distinct data-sets of records that refer to a particular definition of unit (individual, household, company, etc.) in an integrated data-set whose records refer to the same unit’s definition (as in instances number 3 or 4). The macro objective transforms the distinct data-sets of records in aggregated results (contingency tables or parameters that describe the relation among variables observed in the distinct data-sets; instances number 1 and 2). In the next paragraph we underline that the statistical matching methodologies should be chosen according to the two previous objectives.

3.

General overview of statistical matching methods

Once it is clear which sources should be integrated and what are the characteristics of the integrated data-set, in order to make them logically and physically compatible, some actions are needed. At the beginning, the researcher decides if the elements of the problem (unit, variable, population, record, etc.) satisfy the set of definitions and rules for the integrated data-set. If this does not happen, a preliminary harmonisation step is required. Therefore the actions to be performed, well discussed in literature (e.g. [32]), primarily consist of: (1) unit harmonisation: it is necessary that records of the different sources refer to the same definition of unit; (2) target population harmonisation: if the data-sets refer to different target populations, it is important to select just those records that refer to the population of interest; (3) variable harmonisation: the common variables should be defined in the same way. At last, we also suppose that the observations to be fused are reliable.

435


Once the process of harmonisation has been completed, we can represent the situation handled by statistical matching methods by means of Fig. 1, where columns (X, Y, Z, V) are vectors of variables; rows are the statistical units inspected; white colour represents situations where information is not available. X

Y

Z

V

Fig. 1 – General framework for statistical matching.

At this stage we can closely relate this framework to a general statistical problem in presence of missing data. Most of the literature has been developed with respect to the critical setting expressed in Fig. 2. In this case the variables have not been jointly observed. Y

X

Z

Fig. 2 – Typical situation for statistical matching

A traditional way to tackle this situation is to look at it as an imputation problem. However, we guess that the approach should be more general and in particular related to the objectives. Hence, among all the potential classifications of methods, we use the one based on the scope of integration, as previously discussed. Thus we obtain micro approach when we are essentially interested in integrating the database at unit level, and macro approach when we are mostly interested in the aggregates. According to this classification, methods thought for imputation fall principally in the first circumstance. In particular, several techniques have been introduced: imputation based on linear regression, hot-deck [17] [28], log-linear models [27], non-parametric regression [19], imputation based on population modelling [13], multiple imputation [14] [24] [25]; many other techniques can be derived from the vast experiences conducted in imputation problems. It is worth saying that for donor based techniques a further distinction in constrained and unconstrained matches can be made [21]. Regarding the macro approach, among the others methods based on calibration [20] and EM, Data Augmentation, or Iterative Proportional Fitting can be used [1] [16] [26]. It must be remarked that all methods used both in the class of micro and macro approaches are explicitly or implicitly based on a statistical model. In this context, it means that we are assigning a structure in terms of kinds of relationships among variables. Thus a wrong specification of the model leads to seriously biased final results. Moreover, the capability of obtaining reliable estimates for the model with respect to the available information must be taken into account. In fact, since the typical situation of statistical matching consists of lack of simultaneous information on variables (X, Y, Z) (Fig. 2), the only model we are able to reasonably estimate is the 436


one based on the hypothesis that, roughly speaking, information on the variable X is sufficient to determine Y and Z. More formally it means that Y and Z are statistically independent conditionally on X, i.e. P(Y,Z|X)=P(Y|X)P(Z|X). This hypothesis is known as Conditional Independence Assumption (CIA). Whenever it is not possible to justify this assumption, the use of auxiliary information is needed. Some experiments have been carried in this context to analyse this issue (see, among the others, [20] [21] [27] [28]). Finally we can say that, even if many classifications of the methods are possible, the main aspect that should be stressed are the micro or macro approach, and the possibility of using auxiliary information in order to better estimate the model, e.g. reducing the effect of the CIA.

4. Critical issues in applying statistical matching: quality evaluation A rigorous definition of the quality of the entire matching process depends mainly on: (1) the quality of the observations of the source data-sets; (2) the properties of the matching algorithm; (3) the methods used to compute the quantities of interest (correlation and/or regression coefficients, contingency tables and so on) from the fused data-set. Hence evaluating the quality would mean to understand how these factors interact and affect the final results. The quality of the data-sets to be fused might affect seriously the final results. This means that statistical matching should be applied only in those circumstances where the original data-sets are characterised by an acceptable level of quality. Much attention has been devoted to the matching algorithms by considering a series of diagnostic measures strictly connected with their characteristics. For instance distributions of matching distances are analysed (for algorithms that use distances); summaries of them through usual indexes (averages, min, max and so on) are given; ad hoc measures, such as the weighting effect [4], are used to give an idea of the amount of variation in donor usage (for unconstrained matching algorithm). Diagnostic measures represent a useful, although not sufficient, component for the process of evaluation of the overall quality of a matching application. In the context of the evaluation of the overall quality, an important role is played by the check of the distributions of the variables under study, singularly or jointly considered. In particular, it has to be tested whether these distributions are preserved by the matching application. Usually, the distributions analysed are those for the common variables (X) and the joint one X-Z (e.g. [19]). Little has been done for the evaluation of the accuracy of the estimators of parameters (correlation coefficients, regression coefficients, general measures of association) when applied to fused data and, in particular, for the estimation of their Mean Square Error (MSE=Bias2+Var). The problem of bias has been studied focusing particularly on the misuse of the CIA and on some specific matching algorithm [3] [29]. Barry [3] derived a theoretical result for "random draw" matching that showed how the correlation coefficient between Y and Z is seriously underestimated unless some assumptions hold 437


(e.g. the CIA). The same findings were empirically derived for general measures of association between Y and Z and, generally, when "random draw" is replaced with matching based on distance functions [2] [3] [22]. A heuristic tool useful to receive an idea of bias on the association among the fused variables introduced by any statistical matching techniques is the folded database procedure (see e.g. [19]). The variables of one of the original databases (usually the larger one) are partitioned in three distinct sets, say G ¢ , G ¢¢ and G ¢¢¢ . Then the chosen database is randomly partitioned in two sub-samples: G ¢ and G ¢¢¢ are respectively hidden in the first and second sub-sample, reproducing the typical statistical matching set-up. These two sub-samples are matched according to the chosen procedure: the result is the folded database. The estimates obtained from the folded database are compared with those derived from the original one so to compute the bias. Actually the estimate of bias given by this method relies heavily on the assumption that the variables G ¢ , G ¢¢ and G ¢¢¢ behave approximately as the target variables X, Y and Z. When all variables (X, Y and Z) were observed on a group, even small, of units this common information could be used to estimate indirectly the bias. If, for example, the interest lies on some kind of measure that involves the relation Y-Z, its bias could be evaluated by comparing simple measures of association/correlation computed from fused and unfused data. Cannon and Seamons [5] carried this comparison once units were grouped in proper homogeneous classes. The natural extension of this approach is that of carrying a regression analysis of these measures computed on fused data versus the same derived from unfused data-set. Rubin [24] [25] suggests carrying a sensitivity analysis. His idea consists essentially in analysing all the final fused data-sets obtained by applying multiple imputation. This approach has the advantage of permitting an estimate of the variance of the estimators applied to fused data too. Actually, real applications of such method are not still available given some difficulties in managing the problem in a multiple imputation context. The problem of estimation of variance associated with estimators applied on statistically matched data has received little attention if compared to that of bias. Nevertheless, we believe that this issue should be further investigated. In particular, it should be verified whether the theories developed for estimation of variance due to imputing missing values in a single data set applies also in this context.

5.

Conclusions

In the last years data integration received increasing attention given the higher and higher number of available data sources and the growing need of studying the relation between phenomena of different type. In this paper, we want to remark that integration represents not only an information technology problem but also a statistical one. We focused on statistical matching that represents a fast, inexpensive and flexible tool to integrate different data sources. However, many problems arise and they need to be further studied for a successful application of statistical matching.

438


A first great problem to deal with, usually recurring in NSIs, is that of harmonising the different data sources. This step may reveal somewhat expensive and time consuming especially if sources, sample or total surveys, use different scales of measurement for the common variables and different approaches in order to deal with partial and total non-response, measurement errors, etc. (see e.g. [23]). The second problem is that of model specification according to available information. As already mentioned, whereas auxiliary information is available it should be used in matching process. In our opinion auxiliary information should include any prior knowledge about the phenomena investigated. Thus, for example, wherever possible logical constraints should be introduced in matching step so to avoid that synthetic records present conflicting values for variables not jointly observed. This enhancement permits to relax the CIA and produce fused records of higher quality. The third great open problem is that of quality evaluation. As already mentioned in par. 4., a series of methods that should be developed in order to estimate both bias and variance, i.e. the MSE, of each estimator that has to be applied to fused data. We think that some progresses can be made in the fields of variance estimation by borrowing some of the methods introduced to estimate variance due to imputing missing values. At the end, it has to be remembered that most of the problems that arise when applying statistical matching can be overcome in the ideal situation of nested surveys [15]. In such case, the data sources derive by surveys designed firstly in order to study a particular phenomenon and secondly to build up a comprehensive database. In this context, the harmonisation step can be avoided by using the same definitions for the common variables and the same methodology to deal with usual survey problems. Additionally, by focusing on a particular phenomenon it is possible to improve the global quality of observed data by means of shorter questionnaires and hence less total and partial nonresponse. Finally, the nested survey can be planned in order to collect the amount of information needed to reasonably estimate the chosen model. For instance, observing all variables on a small sub-sample can reduce the bias due to CIA. Acknowledgements We are grateful to A. Coli, G. Proto, A. Solipaca, F. Tartamella for their helpful comments. References [1] [2] [3] [4] [5] [6] [7]

Agresti, A., Categorical Data Analysis. J. Wiley & Sons, New York, 1990. Barr, R. S., Stewart, W. H. and Turner, J. S., An Empirical Evaluation of Statistical Matching Methodologies. School of Business, Southern Methodist University, Dallas, 1981. Barry, J. T., An Investigation of statistical matching. Journal of Applied Statistics, 1988, 15, pp. 275-283. Brown, M., Enhancing media survey value through data fusion. Proceedings of the ESOMAR Congress, Luxembourg, 1991, pp. 573-592. Cannon, H. M. and Seamons, B. L., Simulating single source data: how it fails us just when we need it most. Journal of Advertising Research, 1995, 35, pp. 53-63. Citoni, G., Di Nicola, F., Lugaresi, S. and Proto, G., Statistical matching for tax-benefit microsimulation modelling: a project for Italy. Workshop on Statistical Matching, ISPE, 26 September 1991, Roma, Italy, 1991, pp. . Coccia, G., Gabrielli, D. and Sorvillo, M. P., Perspectives of using record linkage techniques in demographic context (Italian). Quaderni di Ricerca ISTAT, 7,1993.

439


[8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33]

Coli, A. and Tartamella, F., The link between National Accounts and households micro data. Meeting of the Siena Group on Social Statistics, 22–24 May 2000, Maastricht, the Netherlands, 2000. Coli, A. and Tartamella, F., A pilot social accounting matrix for Italy with a focus on households. 26th General Conference of the International Association for Research in Income and Wealth, 27 August–2 September 2000, Cracow, Poland, 2000. Fortunato, E. and Morrone, A., Micro and macro approaches to the linkage of data coming from ISTAT sample surveys on households (Italian). Proceedings of the Conference of Italian Statistical Society, 7-9 June 1999, Udine, Italy, 2000, Vol. 2, pp. 446-456. Goel, P.K. and Ramalingam, T., The matching methodology: some statistical properties. J. Berger, S. Fienberg, J. Gani, K. Krickeberg and B. Singer (eds.) Lecture notes in statistics, Springer Verlag, New York, 1989. ISTAT, Annual Report, The Country Situation in 1998 (Italian). ISTAT, Rome, Italy, 1999. Kadane, J. B., Some statistical problems in merging data files. Compendium of Tax Research, Office of Tax analysis, Department of Treasury. US Government printing office, Washington D.C, 1978: p. 159-179. Kamakura, W. A. and Wedel, M., Statistical data fusion for cross-tabulation. Journal of Marketing Research, 1997, 34, pp. 485-498. Kroese, A. H. and Rennsen, R. H., New applications of old weighting techniques; constructing a consistent set of estimates based on data from different sources. Technical Report, Statistics Netherlands, 2000. Little, R. J. A. and Rubin, D. B., Statistical Analysis with Missing Data. J. Wiley & Sons, New York, 1987. Liu, T. P. and Kovacevic, M. S., Statistical Matching of survey data files. Technical Report, Statistics Canada, 1996. Masselli, M. and Venturi, M., Integration between Statistical Sources. A strategy for an Integrated System of censuses and Current Surveys (Italian). Proceedings of the Conference of Italian Statistical Society, Padova, Italy, 1990. Paass, G., Statistical match: Evaluation of existing procedures and improvements by using additional information. G. H. Orcutt and H. Quinke (eds) Microanalytic Simulation Models to Support Social and Financial Policy. Amsterdam: Elsevier Science, 1986, pp. 401-422. Renssen, R. H., Use of statistical matching techniques in calibration estimation. Survey Methodology, 1998, 24, pp. 171-183. Rodgers, W. L., An evaluation of statistical matching. Journal of Business and Economic Statistics, 1984, 2, pp. 91-102. Rodgers, W. L. and DeVol, E., An evaluation of statistical matching. Proceedings of the American Statistical Association, Section on Survey Research methods, Washington D.C.,1981, pp. 128-132. Rosati, N., Statistical Matching of ISTAT expenditures data and BankItalia income data for 1995 (Italian) Technical Report, Dept. Statistics, University of Padova, Padova, Italy, 1998 Rubin, D. B., Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business and Economic Statistics, 1986, 4, pp. 87-94. Rubin, D. B., Multiple Imputation for Nonresponse in Surveys, John Wiley, New York, 1987. Schafer, J. L., Analysis of Incomplete Multivariate Data. Chapman & Hall, 1997. Singh, A. C., Log-linear imputation. Proceedings of the 5-th Annual Research Conference. U.S. Bureau of the Census, Washington D. C., 1989, pp. 118-132. Singh, A. C., Mantel, H., Kinack, M. and Rowe G., On methods of statistical matching with and without auxiliary information. Technical Report, DDMD-90-016E, 1990. Singh, A. C., Mantel H., Kinack, M. and Rowe, G., Statistical matching: use of auxiliary information as an alternative to the conditional independence assumption. Survey Methodology, 1993, 19, pp. 59-79. Solipaca, A. and Proto, G., A statistical matching application to income and health expenditures (Italian). Technical Report, ISTAT, Roma, 2001. United Nations, System of National Accounts. United Nations, New York, 1993. Van der Laan, P., Integrating administrative registers and household surveys. Netherlands Official Statistics, 2000, 15, pp. 7-15. Vousten R., de Heer, W., Reducing non-response: the POLS fieldwork design. Netherlands Official Statistics, 1998, 13, pp.16-19.

440

Statistical Matching: a tool for integrating data in National ... - CiteSeerX

Statistical Matching: a tool for integrating data in National ... - CiteSeerX

Suggest Documents

Data Fusion Through Statistical Matching

A Tool for Matching Crowd-sourced and Authoritative Geospatial Data

A Tool for Securely Integrating Legacy Systems into a ... - CiteSeerX

A Tool for Securely Integrating Legacy Systems into a ... - CiteSeerX

Measures of radioactivity: a tool for understanding statistical data ...

Statistical Analysis: a Tool for Understanding Monitoring Data

Brute Force as a Statistical Tool. - CiteSeerX

A New Statistical Software Reliability Tool - CiteSeerX

A Tool Suite for Integrating Task and System Models ... - CiteSeerX

A Tool Suite for Integrating Task and System Models ... - CiteSeerX

Integrating Food Security Information in National Statistical Systems

JUMBL: A Tool for Model-Based Statistical Testing - CiteSeerX

JUMBL: A Tool for Model-Based Statistical Testing - CiteSeerX

A Data Decomposition Tool for Writing Parallel Modules in ... - CiteSeerX

ARKTOS: A Tool For Data Cleaning and Transformation in ... - CiteSeerX

Preliminary validity data on a new matching tool ... - Academic Journals

Matching imperfect spatial data - CiteSeerX

Data Assimilation Approach for Integrating ... - CiteSeerX

Integrating Automated Data Acquisition Technologies for ... - CiteSeerX

Integrating R and Hadoop for Big Data Analysis - Romanian Statistical ...

KEEL: A data mining software tool integrating genetic fuzzy systems

Generalized Statistical Complexity: A New Tool for

SAnDReS a Computational Tool for Statistical ...

INTEGRATING A CONCEPT MAPPING TOOL INTO A ... - CiteSeerX