work session on statistical data confidentiality

0 downloads 0 Views 2MB Size Report
The proceedings has been produced by Ms. Liv Belsby from Eurostat Unit for ... Income Study and Statistics Denmark solutions, where the data can be ..... any of several standard figures of merit (number of cells suppressed, total ..... The results for the 4x9 table are shown in Table 5 and the statistics of interest ...... Page 98 ...
2003 EDITION

COPYRIGHT

Luxembourg: Office for Official Publications of the European Communities, 2004 ISBN 92-894-5766-X ISSN 1725-5406 Cat. No. KS-CR-03-004-EN-N © European Communities, 2004

Monographs of official statistics Work session on statistical data confidentiality Luxembourg, 7 to 9 April 2003 Part 1

E U R O P E A N COMMISSION

1 THEME 1 General statistics

Europe Direct is a service to help you find answers to your questions about the European Union New freephone number:

00 800 6 7 8 9 10 11

A great deal of additional information on the European Union is available on the Internet. It can be accessed through the Europa server (http://europa.eu.int). Luxembourg: Office for Official Publications of the European Communities, 2004 ISBN 92-894-5766-X ISSN 1725-5406 © European Communities, 2004

Acknowledgements The work session on Statistical confidentiality was jointly organized by Mr. Håkan Linden from Eurostat Unit for Research and Mr. Juraj Riecan from UN Economic Commission for Europe. UNECE and Eurostat gratefully acknowledge the valuable contributions of all the participants and particularly the session organizers and discussants, who were t he following: Mr. David Brown (United Kingdom), Mr. Lawrence Cox (United States), Ms. Luisa Franconi (Italy), Mr.Ramesh A. Dandekar (United States), Mr. Josep Domingo Ferrer (Spain), Ms. Sarah Giessing (Germany), Mr. Anco Hundepool (Netherlands), Mr. John King (Eurostat) and Mr. Julian Stander (United Kingdom). The proceedings has been produced by Ms. Liv Belsby from Eurostat Unit for Research.

Foreword Statistical confidentiality has become more important as the amount of and demand for data have increased. The principles of Official Statistics adopted at the annual session in 1991 by the United Nations Economic Commission for Europe also emphasizes the importance of this topic. Furthermore, the importance of statistical disclosure control increases with the growing use of the Internet for remote access to statistical data by a community of users, as well as with the increasing availability of other data sets. The workshop covered a wide range of different aspects of statistical confidentiality. New theories and emerging methods for statistical disclosure limitation may be based on a deterministic or a probabilistic approach. Some of these methods aim to control additivity in tables, while others strive to preserve expected totals or sufficient statistics in the associated models. Over the years, the technology for data release has evolved rapidly. The Luxembourg Income Study and Statistics Denmark solutions, where the data can be analyzed by submitting programs written in SAS, SPSS, GAUSS or STATA by e-mail, illustrate feasible solutions. Statistical disclosure control of the output remains a challenge. Methods to prevent release of confidential data and to evaluate the risk of disclosure were presented at the workshop, while other presentations focused on disclosure risk by multiple database mining. Emerging legal and regulatory issues are important aspects of confidentiality. The papers illustrate the very wide range of backgrounds – legal, regulatory and pragmatic – that pertain in both the Member States and other countries. One contribution describes the background and implementation at Eurostat of the regulation 831/2002. This regulation concerns access to confidential micro data from the Member States. Furthermore, some papers are concerned with data research centres, scientific purposes and control of the research work. Small area statistics represent a particular challenge for statistical disclosure protection. The contributions concentrate mostly on social and demographic data. Various geo-coding methods and related risks are also discussed. Requests from the research community and the regulation 831/2002 have provided a need to decide whether a data file is safe or not. Thus there is a need for measures of disclosure risk. The file level risk and per record risk are discussed together with an estimation of these. The papers on software tools for statistical disclosure control focus mainly on practical applications, but some methodological issues and algorithms are also

considered. Argus is an important software package used to evaluate and diminish disclosure risk, both for micro- and tabular data. The Argus software is made under the CASC project partly funded by the 5th EU framework program.

Pedro Díaz Muñoz

Juraj Riecan

Table of Contents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 4

DISCUSSION PAPER FOR TOPIC (I): NEW THEORIES AND EMERGING METHODS . . . . . . . . . . . . . . . . . . . . . . . . .

8

– – – – – – – – –

Balancing Quality and Confidentiality for Tabular Data . . . . . . . . . . . . . . . 11 A Query-Overlap Restriction for Statistical Database Security . . . . . . . . 24 Different Methods in a Common Framework to Protect Tables . . . . . . . . . 36 An Algorithm for Computing Full Rank Minimal Sufficient Statistics with Applications to Confidentiality Protection . . . . . . . . . . . . . . . . . . . . 45 Microdata Disclosure by Resampling - Empirical Findings for Business Survey Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 The Determination of Intervals of Suppressed Cells in an n-dimensional Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Rounding as a Confidentiality Measure for Frequency Tables in statbank norway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Recent Research Results on the Conditional Distribution Approach for Data Perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 The Noise Method for Tables - Research and Applications at Statistics New Zealand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

DISCUSSION PAPER FOR TOPIC (II): ON NEW DATA RELEASE TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 – Accessing Microdata via the Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – LISSY: A System for Providing Restricted Access to Survey Microdata from Remote Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Data Mining Methods for Linking Data Coming from Several Sources . . – Providing Remote Access to Data: The Academic Perspective . . . . . . . . – From On-Site to Remote Data Access – The Revolution of the Danish System for Access to Micro Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Implementing Statistical Disclosure Control For Aggregated Data Released Via Remote Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

118 132 143 151 160 170

DISCUSSION PAPER FOR TOPIC (III): EMERGING LEGAL/REGULATORY ISSUES . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 – New Ways of Access to Microdata of the German Official Statistics . . . – Contexts for the Development of a Data Access and Confidentiality Protocol for UK National Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Developments at Eurostat for Research Access to Confidential Data . . . – Report on Application of the Principle of Statistical Confidentiality in Kyrgyzstan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Questions Relating to the Confidentiality of Statistical Information at the National Statistical Service of the Republic of Armenia . . . . . . . . – Demand of Data and Options of Analysing Data: The Research Data Centre of the Statistical Offices of the Länder . . . . .

189 195 207 219 221 223

DISCUSSION PAPER FOR TOPIC (IV): CONFIDENTIALITY ISSUES FOR SMALL AREAS . . . . . . . . . . . . . . . . . . . . . . 226 – Disclosure Limitation for Census 2000 Tabular Data . . . . . . . . . . . . . . . . . – Neighbourhood Statistics in England and Wales: Disclosure Control Problems and Solutions . . . . . . . . . . . . . . . . . . . . . . . – Different Approaches to Disclosure Control Problem Associated with Geography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Zip Code Tabulation Area and Confidentiality . . . . . . . . . . . . . . . . . . . . . .

230 239 250 263

DISCUSSION PAPER FOR TOPIC (V): RISK ASSESSMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 – – – –

On Models for Statistical Disclosure Risk Estimation . . . . . . . . . . . . . . . . Assessing Individual Risk of Disclosure: An Experiment . . . . . . . . . . . . Some Remarks on the Individual Risk Methodology . . . . . . . . . . . . . . . . . Assessing Disclosure Risk and Data Utility: A Multiple Objectives Decision Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . – A Graph Theoretical Approach to Record Linkage . . . . . . . . . . . . . . . . . .

275 286 299 312 324

DISCUSSION PAPER FOR TOPIC (VI): SOFTWARE TOOLS FOR STATISTICAL DISCLOSURE CONTROL . . . . . . . . . 335 – Using DIS to Modify the Classification of Special Uniques . . . . . . . . . . . – The ARGUS-software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Cell suppression in Eurostat on Structural Business Statistics an example of Statistical Disclosure Control on tabular data . . . . . . . . . . – MASSC: A New Data Mask for Limiting Statistical Information Loss and Disclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Co-ordination of Cell Suppressions: Strategies for use of GHMITER . . . – SAFE - A Method for Statistical Disclosure Limitation of Microdata . . . . – The Statistical Protecting of the European Structure of earnings survey data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . – Bureau of Transportation Statistics’ Prototype Disclosure Limitation Software for Complex Tabular Data . . . . . . . . . . . . . . . . . . . . . . – Cost Effective Implementation of Synthetic Tabulation (a.k.a. Controlled Tabular Adjustments) in Legacy and New Statistical Data Publication Systems . . . . . . . . . . . . . . . . . . . . . . . . .

338 347 364 373 395 403 411 421 428

Discussion Paper for Topic (i): New theories and emerging methods Lawrence H. Cox (USA) 1. The four invited and five contributed papers on this topic drawn upon a diversity of ideas and methods in the mathematical sciences including mathematical statistics, graph theory, optimization theory and mathematical programming, focused on an important area of official statistics. The authors are to be commended for interesting and thought provoking contributions. 2. I will not repeat what was presented to summarize the papers individually. Instead, I will present some thoughts and impressions I formed from reading the set of papers as a whole, and frame these as questions that work session participants may find useful as a starting point or framework for general discussion on this topic. Statistical Versus Mathematical Approaches to SDL 3. Statistical disclosure limitation (SDL) for microdata relies substantially upon statistical techniques, as illustrated by Working Papers 24 and 27. SDL for tabular data have traditionally drawn upon mathematical ideas from optimization and graph theory (Working Papers 2, 3, 25, 26), while new approaches based on statistics (Working Papers 5 and 28) and hybrid approaches (Working Paper 4) are emerging. This I think is very healthy for SDL, official statistics and the mathematical sciences. There are however points of divergence. If a protection method relies on deterministic techniques, then how effective is it (how effective can it be) against a probabilistic attack, and conversely? 4. The methods described in Working Papers 2, 3, 4, 25, and 26 apply deterministic methods from graph theory and linear programming to ensure that a protection interval is not breached by deterministic means. However, emerging empirical evidence indicates that probabilistic attacks such as via iterative proportional fitting tend to come uncomfortably close to recovering original cell values, this despite the fact that conditions for applying IPF, viz., a missing-at-random assumption, are not met. Conversely, protective methods based on probability theory do not inherently guarantee minimal levels of protection.

8

Is it preferable to assure that totals and other statistics are preserved exactly or approximately, e.g., by using deterministic methods, or to preserve expected values of important quantities, e.g., by using randomization? 5. Working Papers 3 and 4 control important properties for tabular data like additivity directly using linear constraints, whereas Working Paper 28 preserves expected values of totals. Working Paper 4 aims at maintaining control while preserving statistical properties; Working Paper 5 aims to preserve sufficient statistics for associated statistical models. The central issue is not to compare approaches but how to combine desirable features of both approaches. Data Quality, Utility and Analyzability Issues 6. Methods based on resampling (Working Paper 24) and random perturbation (Working Paper 27) assure that important statistical properties of microdata are preserved. Working Paper 28 perturbs tabular data by applying random perturbation to the underlying microdata. SDL methods for microdata that assure consistency with corresponding totals or estimates at important levels of aggregation is a logical next step for microdata SDL research. Methods as in Working Papers 2 that appear focused on protection only can be applied to optimize information loss, subject to meeting protection constraints, like those described in Working Paper 3. Methods based on rounding (Working Paper 26) produce useable data constrained nearby original data. Is it preferable to control changes to individual data (e.g., cell values) or to control overall measures of data quality (e.g., mean values)? 7. Users of official statistics data cover the spectrum of analytical interestsBfrom interest in one number (with or without concern for its accuracy or precision) to interest in using all available data to model and predict complex socio-economic phenomena. In general, a clear choice between the importance of individual datum versus broad-gauge statistical measures is not available. Working Paper 4 raises this issue and demonstrates for tabular data that these objectives are in competition and offers a compromise solution. Methods that balance competing quality objectives with confidentiality protection are the next step in research on SDL for tabular data. What are appropriate, usable measures or criteria for judging the effects of SDL on data quality? 8. For microdata, the extent to which resampling and perturbative methods can preserve distributional properties of original microdata is quantifiable. For tabular data, traditional approaches have relied on measures based on minimizing familiar (Euclidean) metrics, as in Working Paper 3. A notable exception is rounding which can be performed relative to traditional norms or expected value. Working Paper 4 argues that these measures alone are insufficient for statistical purposes.

9

How Good Is Good Enough? 9. Modern methods for complementary cell suppression in tabular data are based on mathematical optimization (Working Paper 2). The optimization is performed over any of several standard figures of merit (number of cells suppressed, total value suppressed, total percentage suppressed, etc.) that can be expressed as integer linear functions of appropriate variables. The same can be said for controlled tabular adjustment. These objective functions are convenient but do not necessarily coincide with figures of merit related to statistical properties and analytical utility of the data (Working Paper 4). In view of the tension between Alocal@ and Aglobal@ criteria for analyzability discussed in paragraph 7, it is not clear that mathematically optimal solutions are necessarily optimal for data analytic purposes. Is it preferable to represent data quality and analyzability criteria as constraints rather than mathematical objectives, which is to say that two solutions meeting the quality constraints are for most purposes interchangeable? Perturbation Strategy 10. Most of the papers on this topic, and others in this work session, involve perturbative methods. For tabular data, perturbation can be applied either to the tabulations or to the underlying microdata (if available). If applied in a randomized manner to the microdata, then any and all tabulations created by summing from these microdata can be released. Conversely, if tabulations are first perturbed and disclosure-limited, then using, e.g., techniques similar to Working Paper 4, one can work backwards to produce microdata consistent with tabular cell values. It is time for SDL researchers to investigate the interplay between tabular and microdata. Is it preferable to perturb tabular data or to perturb its underlying microdata?

10

BALANCING QUALITY AND CONFIDENTIALITY FOR TABULAR DATA Lawrence H. Cox1 and James P. Kelly2 1

National Center for Health Statistics, Centers for Disease Control and Prevention, Hyattsville, MD 20782 USA 2 OptTek Systems, Inc., Boulder, CO 80302 USA Keywords. Controlled tabular adjustment, linear programming, regression

1 Introduction Since their inception national statistical offices (NSO) have provided users with data in tabular form, and tabular data remain a staple of official statistics. Examples include count data such as age-race-sex and other demographic data, concentration or percentage data such in financial or energy statistics, and magnitude data such as retail sales or daily air pollution. Confidentiality problems were first investigated for tabular data (Fellegi 1972). For magnitude data, the most studied and used disclosure limitation method has been complementary cell suppression (Cox 1980). Additive relationships defining the tabular system are represented in a linear system of equations TX = 0, where X represents the tabular cells and T the tabular equations, viz., entries of T belong to the set {-1, 0, +1} with precisely one -1 in each row. Dandekar and Cox (2002) proposed a disclosure limitation method for tabular data, originally called synthetic tabular data, recently known as controlled tabular adjustment (CTA). This methodology is motivated by user dissatisfaction with complementary cell suppression, especially removal of useful information and difficulties in analyzing tabular systems with cell values missing not at random. CTA replaces values of certain tabulation cells, called sensitive cells, that cannot be published due to confidentiality concerns, with safe values, viz., values sufficiently far from the true value. Because the adjustments almost certainly throw the additive tabular system out of kilter, CTA adjusts some or all of the nonsensitive cells by small amounts to restore additivity. In terms of ease-of-use, controlled tabular adjustment is unquestionably an improvement over complementary cell suppression. Because CTA changes cell values, especially sensitive values, the question arises : Are the effects of CTA on data analytical outcomes acceptable or not? Cox and Dandekar (2003) describe how the basic methodology can be implemented with an eye towards preserving analytical outcomes. We examine this problem further, focusing on preserving mean, variance, correlation and regression between original and adjusted data. We offer two methodologies for preserving these

11

quantities. The first is an approximate method based on linear programming. Its benefits include simplicity and flexibility, easy implementation by a wide class of users, and insight into the extent to which distributional parameters can be preserved. The second method is a direct search strategy based on Tabu Search. Its benefits include achieving optimal results across many types, size and complexity of problems and ability to optimize other nonlinear statistics, e.g., Chi-square for count data. Section 2 provides a summary of the Dandekar-Cox CTA methodology. Section 3 provides linear programming formulations for preserving mean, variance, correlation and regression slope between original and adjusted data, and introduces one solution that works well for each of these quantities and is easy to implement. The methods are applied to a two-dimensional table of magnitude data based on real data and a hypothetical three-dimensional table with complex structure. In Section 4, we provide a method based on Tabu Search and apply it to the same two examples. Section 5 provides discussion of issues and concluding comments. Our presentation has been shorted to meet a proceedings page limit.

2 Controlled tabular adjustment CTA is applicable to tabular data in any form but for convenience we focus on magnitude data, where the greatest benefits are to be found. A simple paradigm for statistical disclosure in magnitude data is as follows. A tabulation cell, denoted i, comprises k respondents (e.g., retail clothing stores in a county) and their data (e.g., retails sales and employment data). The NSO assumes that any respondent is aware of the identity of the other respondents. The cell value is the total value of a statistic of interest (e.g., total retail sales), summed over the (nonnegative) contributions of each respondent in the cell (i) to this statistic. Denote the cell value v(i) and the respondent contributions v j , ordered from largest to smallest. It is possible for any respondent J to compute v(i) - v J (i) , which yields an upper estimate of the contribution of any other respondent. This estimate is closest, in percentage terms, when J = 2 and j = 1. A standard disclosure rule, the ppercent rule, declares that the cell value represents disclosure if this estimate is closer than p-percent of the largest contribution. This condition defines the sensitive cells. The NSO may also assume that any respondent can use public knowledge to estimate the contribution of any other respondent to within q-percent (q > p, e.g., q = 50%). This additional information allows the second largest to estimate v(i) - v1(i) - v 2(i) , the sum of all contributions excluding itself and the largest, to within q-percent. This upper estimate provides the second largest a lower estimate of v1(i) . The lower and upper protection limits for the cell value equal, respectively, the minimum amount that must be subtracted

12

from (added to) the cell value so that these lower (upper) estimates are at least p-percent away from the true value v1(i) . Numeric values below the lower or above the upper protection limit are safe values for the cell. A common NSO practice assumes that these protection limits are equal, to pi . Complementary cell suppression suppresses all sensitive cells from publication, replacing sensitive values by variables in the tabular system TX = 0. Because, almost surely, one or more suppressed sensitive cell value can be estimated via linear programming to within p-percent of it true value, it is necessary to suppress some nonsensitive cells until no sensitive estimates are within p-percent. This yields a mixed integer linear programming (MILP) problem (Fischetti and Salazar 2000). The controlled tabular adjustment methodology (Dandekar and Cox 2002) first replaces each sensitive value with a safe value. This is an improvement over complementary cell suppression as it replaces a suppression symbol by an actual value. However, safe values are not necessarily unbiased estimates of true values. To minimize bias, Dandekar and Cox (2002) replace the true value with one of its protection limits, v(i) - pi or v(i) + pi . Because these assignments almost surely throw the tabular system out of kilter, CTA adjusts nonsensitive values to restore additivity. Because choices to adjust each sensitive value down or up are binary, combined these steps define a MILP (Cox 2000). Dandekar-Cox present heuristics for the binary choices. This relaxed linear program is easily solved. Mathematical programming and statistics are different disciplines. A (mixed integer) linear program in itself will not assure that analytical properties of original and adjusted data are comparable. Cox and Dandekar (2003) address these issues in three ways. First, sensitive values are replaced by closest possible safe values. Second, capacities are imposed on changes to nonsensitive values to ensure adjustments to individual datum are acceptable. Statistically sensible capacities would, e.g., be based on estimated measurement error for each cell ei . Third, the linear program is optimized with respect to an overall measure of data distortion such as minimum sum of absolute adjustments or minimum sum of percent absolute adjustments. Assume that there are n tabulation cells of which s are sensitive cells and that cells i = 1, ... , s are sensitive. Original data + + are nx1 vector a, adjusted data are a + y - y ; y = y - y . The MILP of Cox (2000) for minimizing sum of absolute adjustments is: n

min ∑ ( y i + y i ) -

+

subject to:

Ii binary, i = 1, ..., s

i=1

T (y) = 0 -

y i = p i (1 - I i ) ,

+

y i = pi I i

i = 1, ... , s

i = s+1, ..., n 0 ≤ y i , y i ≤ ei To ensure feasibility, capacities on nonsensitive cells may be increased. A companion -

+

13

strategy, albeit controversial, allows sensitive cell adjustments smaller than pi in welldefined situations. This is justified mathematically because the intruder does not know if the adjusted value lies above or below the original value (Cox and Dandekar 2003). The Cox-Dandekar constraints are useful. Unfortunately, choices for the optimizing measure are limited to linear functions. In the next two sections, we extend this paradigm in two separate directions, focusing on approaches to preserving mean, variance, correlation and regression between original and adjusted data.

3 Using linear programming to preserve distributional properties Here we present linear programming formulations for preserving exactly or approximately mean, variance, correlation and regression slope between original and adjusted data. Numerical results are presented for a two-dimensional table of size 4x9 based on actual magnitude data and a hypothetical three-dimensional table of size 13x13x13 exhibiting nonhierarchical structure among variables along each dimension. Two observations are important. First, a putative weakness of CTA is that it may change sensitive values by a significant amount. The change is no worse than the effective change created by complementary cell suppression. Moreover, CTA provides reliable estimates not available to unsophisticated users under suppression. Still, the issue of change to sensitive cells deserves examination, and so we focus on the sensitive cells in the remainder of this paper even though for analytical purposes this is artificial. This focus is not limiting, however, as our formulations extend to all cells or any subset(s) of cells of importance. Second, methods described here can be applied in conjunction with a mixed integer linear program or after feasible choices for directions of change (down/up) have been made via an appropriate heuristic 3.1 Preserving mean values Adjusted cell values can change analytical outcomes. Restricting adjustments to nonsensitive cell values to within, e.g., measurement error, is a step in the right direction. However, adjustments to sensitive cells must be safe and are likely to be larger. Effects of adjustments on data analysis, particularly using linear models, are mitigated by preserving mean values. If the grand total is fixed (viz., capacitate its adjustment to zero), then the grand mean is preserved. Similarly, any mean is preserved by capacitating its net total adjustment to zero. The mean of adjusted sensitive values will equal the + mean of the original sensitive values if and only if ∑( yi - yi ) = 0 . The corresponding mathematical problem, namely, selecting two subsets of a set of (positive) values whose difference is minimal, is known as a partitioning problem. With fixed values for the downward and upward adjustments, achieving minimum difference at or near zero may

14

be infeasible, so we allow the sensitive adjustments to vary sufficiently to achieve a zeropartition. Assume for convenience that the adjustments may vary up to as public knowledge. The MILP is: I i binary, i = 1, ..., s

min c( y) subject to: T (y) = 0 s

∑( y

+ i

-

- yi ) = 0

i=1

p i (1 - I i ) ≤ y i ≤ q i (1 - I i ) ,

pi I i ≤ yi ≤ qi I i

-

0 ≤ yi , y -

+ i

+

i = 1, ... , s

≤ ei

i = s+1, ..., n

c( y) is used to keep adjustments close to their lower limit, e.g., c( y) = ∑ y + + y- . 3.2 Preserving variances For any subset of cells of size t with y = 0 , e.g., the sensitive cells: Var( a + y) = (1/t)(∑((a i + yi - (a + y) )2) = Var(a) + (2/t) ∑(a i - a ) yi + Var(y) t

Define L( y) = Cov(a, y)/Var(a) . As y = 0 , then L( y) = (1/(tVar(a))) ∑ (a i - a ) yi , so i =1

Var( a + y)/Var(a) = 2L(y) + (1 + Var(y)/Var(a)) and |Var( a + y)/Var(a) - 1 |= | 2L(y) + (Var(y)/Var(a)) | Thus, relative change in variance can be minimized by minimizing the right-hand side. This can be done via approximation (not presented here), but as Var( y)/Var(a) is typically small, it often suffices to minimize | L( y) | , as follows: a) incorporate two new linear constraints into the system: w ≥ L( y) , w ≥ - L(y) , and b) minimize w. 3.3 Maximizing correlation We seek high positive correlation between original and adjusted values. Recall y = 0 . s

s

Corr (a, a + y) =

∑ ( a - a )( a + y - a - y ) / ∑ ( a - a )

2

i

i

i

i

∑( a i + y i - a - y )

2

i=1

i=1

= (s Var( a) + L(y)) / s Var(a) [ s Var(a) + 2sVar(a)L(y) + ∑ yi ] 2

The right-hand function is maximized by max L( y) subject to the constraints. When Var(y)/Var(a) is small, min | L( y) | provides a good approximation to the optimum.

15

3.4 Preserving regression slope We seek to preserve ordinary least squares regression Y = β 1 X + β 0 of adjusted data Y = a + y on original data X = a, viz., we want β 1 near one and β 0 near zero.

β 1 = Cov( a + y, a) / Var(a) = 1 + L(y), β 0 = (a + y) - β 1 a As y = 0 , then β 0 = 0 , β 1 = 1 if L(y) = 0 is feasible, corresponding to min | L( y) | .

3.5 A compromise solution to balance competing objectives Variance is preserved by minimizing L(y); correlation by maximizing L(y); and regression by L(y) = 0 (if feasible), all subject to y = 0 . When Var(y)/Var(a) is small, typically the case, min | L( y) | will assure good results for all three. A shortcut is to impose L(y) = 0, and verify feasibility. L(y) near zero is motivated statistically as it implies no correlation between values a and adjustments y, which is plausible, e.g., because solutions y and -y are in most situations interchangeable.

3.6 Results from numerical simulations Here we report results of numerical simulations for a two-dimensional table and for a hypothetical three-dimensional table due to R. Dandekar. The two-dimensional table of size 4x9 was constructed from actual magnitude data and contains s = 7 sensitive cells (in red). Disclosure was defined by a (1 contributor, 70%)-dominance rule, viz., a cell is sensitive if the largest contribution exceeds 70% of the cell value, yielding protection levels p i = ( v1(i) )/0.7 - v(i) . The three-dimensional table is of size 13x13x13, contains approximately 100 sensitive cells, and involves nonhierarchical tabular constraints; viz., in the k-direction: (13) = (4) + (5) + (6); (12) = (1) + (6); (11) = (3) + (4); (10) = (3) + (6); (9) = (4) + (5); (8) = (1) + (2); and (7) = (3) + (8) + (13). The 4x9 two-dimensional example is presented in Table 1. The seven protection limits pi are provided below the cell data. Absolute adjustments to sensitive cell values are capacitated to lie between pi and 50% of the original cell value; absolute nonsensitive cell adjustments are capacitated to lie between zero and 50% of the original cell value. This enforces zero adjustment to zero cells, a common practice. The 50% upper bounds are much broader than in practice, but facilitate computation and presentation for a table of this small size (36 cells) and high sensitivity.

16

Table 1. 4x9 table of magnitude data and protection limits for seven sensitive cells (in red) 4x9 T bl 167500

Origin.

Table

317501

1283751

587501

4490751

3981001

2442001

1150000

70000

56250

1487000

172500

667503

1006253

327500

1683000

1138250

46000

6584256

616752

202750

1899502

1098751

2172251

3825251

4372753

300000

787500

15275510

0

35000

0

16250

0

0

65000

0

140000

256250

840502

2042251

3355753

2370005

7669255

8133752

8562754

2588250

1043500

36606022

Protect

(+/-) 21000

0

0

0

0

0

0

0

0

625

0

0

0

0

0

0

0

7800

0

0

0

0

0

0

0

40000

0

0

10500

0

4875

0

0

0

0

42000

14490006

Table 2. Original table after various controlled tabular adjustments using linear programming ∑ | yi | 166875

307001

1283751

587501

4490751

3981001

2442001

1150000

91000

14499881

56875 616752 0

1487000 202750 45500

172500 1899502 0

667503 1103626 11375

1006253 2172251 0

327500 3825251 0

1683000 4372753 65000

1141875 260000 36375

38200 816300 98000

6580706 15269185 256250

840502

2042251

3355753

2370005

7669255

8133752

8562754

2588250

1043500

36606022

Var. 167500

317501

1283751

587501

4490751

3981001

2442001

1150000

91003

14511009

55625

1487000

172500

667503

1006253

327500

1683000

1146675

38200

6584256

616752

202750

1899502

1098751

2172251

3825251

4372753

260000

787498

15235508

0

18791

0

8125

0

0

65000

0

191756

283672

839877

2026042

3355753

2361880

7669255

8133752

8562754

2556675

1108457

36614445

Corr. (Corr ) 167500 55313

317501 1499637

1283751 172500

587501 667503

4490751 1006253

3981001 327500

2442001 1683000

1129000 1138250

91000 34300

14490006 6584256

616752

202750

1899502

1098751

2172251

3825251

4372753

359884

787500

15335394

937

19250

0

8938

0

0

65000

0

94815

188940

840502

2039138

3355753

2362693

7669255

8133752

8562754

2627134

1007615

36598596

14503694

Slope 167500

317501

1276439

587501

4490751

3981001

2442001

1150000

91000

55625

1487000

172500

667503

1006253

327500

1683000

1138250

34420

6572051

616752

202750

1899502

1106063

2172251

3825251

4372753

260000

787500

15242822

0

19250

0

8938

0

0

65000

0

194267

287455

839877

2026501

3348441

2370005

7669255

8133752

8562754

2548250

1107187

36606022

The first adjusted table minimizes total absolute deviation over all cells per Cox and Dandekar (2003). The next preserves the mean of the sensitive cells. The last three approximately preserve variance, correlation and regression slope for the sensitive cells. The set of sensitive cells is of no particular interest analytically, but we optimize over this set for several reasons. First, they provide a convenient means to

17

Table 3. Summary of results of numeric simulations on 4x9 table using linear programming Sensitive Cells

min | y i | Variance Correlation Slope All Cells All 4 Functions

Correlation 0.98

Regress. 0.82

Slope

New Var. / Original Var. 0.70

0.95 0.97 0.95

0.93 1.20 0.93

0.94 1.52 0.95

1.00

1.00

1.00

Table 4. Summary of results of numeric simulations on 13x13x13 table using linear programming Sensitive Cells

min | y i | Variance Correlation Slope All Cells All 4 Functions

Correlation 0.995

Regress. Slope 0.96

New Var. / Original Var. 0.94

0.995 0.995 0.995

1.00 1.00 1.00

1.00 1.21 1.01

1.00

1.00

1.00

demonstrate how important statistical properties can be preserved over subsets of the data. Second, doing so refutes the notion that CTA cannot mitigate the effects of (large) changes to sensitive values. Third, this represents a worst case scenario and provides a basis of comparison for analyses based on the full table. Subject to optimizing the statistic of interest, each table optimizes minimum total absolute deviation. The last solution corresponds to L(y) near zero, the compromise solution. Statistics of interest are summarized in the Table 2. The same simulations are performed for the three-dimensional table (not shown) and summarized in Table 3.

4 Controlled tabular adjustment using TABU Search The linear programming methods of the previous sections address nonlinear functions such as variance, correlation, and regression slope approximately via linear programming. They are easy to use but a direct approach would be useful for other problems, e.g., minimum chi-square. A heuristic method for CTA subject to nonlinear constraints has been developed by OptTek Systems, Inc. for the U.S. Bureau of Transportation Statistics (OptTek 2003). This approach does not guarantee optimality but does provide general nonlinear capabilities, and can be used to process extremely large, high-dimensional or complex tables.

18

4.1 Heuristic algorithm The algorithm contains three phases. In the first phase, a feasible solution is obtained. In the second phase, the solution is improved relative to the quality measure such as correlation, slope, variance or any other appropriate measure of information loss. The final phase uses Tabu Search (Glover and Laguna 1997) to further improve the best solution. Tabu Search applied to CTA provides the opportunity to exploit underlying structures via adaptive memory and responsive exploration; adaptive memory contrasts with "rigid memory" designs, such as branch and bound and associated processes that lie at the core of exact methods. Responsive exploration affords the ability to guide the solution process in ways that are not accessible to exact methods. The basis for implementing Tabu Search in the CTA context is described as follows. Consider the problem as that of optimizing a function f(x) over a set X. Tabu Search begins by proceeding iteratively from one solution to another until a chosen termination criterion is satisfied. Each x ∈ X has an associated neighborhood N(x) and each solution x in N(x) is reached from x by an operation called a move. Tabu Search employs a strategy of modifying N(x) as the search progresses, replacing it by another neighborhood N*(x), based on the use of adaptive memory structures. The solutions admitted to N*(x) by these structures are determined in several ways. The one that gives Tabu Search its name, identifies solutions encountered over a specified horizon (and implicitly, additional related solutions), and forbids them to belong to N*(x) by classifying them tabu. The implementation of this mechanism allows the search process to overcome local optimality in the quest for the globally optimal solution. The objective functions employed for CTA can be summarized in the following expression, where coefficients a,b,c,d are selected to provide the desired results and scale the sum of absolute deviations: Min {a(Sum Abs. Dev.) + b(1-Corr. 2) + c|1 B Slope| + d|New Variance/Orig. Var. -1|} The heuristic algorithm changes sensitive and non-sensitive cells first seeking to obtain a feasible solution, and then once feasibility is obtained, it moves on to optimize the quality measure. The algorithm only changes one cell or sum at a time. Eventually, no single change can improve the solution. At this point, the algorithm utilizes Tabu Search to move beyond the locally optimal solution towards a globally optimal solution. In this phase, non-improving changes are forced into the solution to allow the search to move to better solutions. Forced changes are maintained in the solution for a fixed number of iterations. The best feasible solution found during the entire search is returned to the user. Both tables and all measures examined in Section 3 were processed using Tabu Search. The results for the 4x9 table are shown in Table 5 and the statistics of interest for both tables presented in Tables 6 and 7, respectively.

19

Table 5. Original 4x9 table after controlled tabular adjustments using Tabu Search Var 167500

317501

1283751

587501

4490751

3981001

2442001

1150000

34900

14454906

56875

1487000

172500

667503

1006253

327500

1683000

1138250

53800

6592681

616752

202750

1899502

1098751

2172251

3825251

4372753

260000

787500

15235510

0

45500

0

8125

0

0

65000

0

204300

322925

841127

2052751

3355753

2361880

7669255

8133752

8562754

2548250

1080500

36606022

Corr 167500

317501

1283751

587501

4490751

3981001

2442001

1150000

92184

14512190

58058

1487000

172500

667503

1006253

327500

1683000

1138250

38200

6578264

616752

202750

1899502

1098751

2172251

3825251

4372753

341183

787500

15316693

0

24500

0

11375

0

0

65000

0

98000

198875

842310

2031751

3355753

2365130

7669255

8133752

8562754

2629433

1015884

36606022

Slope 167500

317501

1283751

587501

4490751

3981001

2442001

1150000

34900

14454906

56875

1487000

172500

667503

1006253

327500

1683000

1138250

53800

6592681

616752

202750

1899502

1098751

2172251

3825251

4372753

260000

787500

15235510

0

45500

0

8125

0

0

65000

0

204300

322925

841127

2052751

3355753

2361880

7669255

8133752

8562754

2548250

1080500

36606022

Table 6. Summary of results of numeric simulations on 4x9 table using Tabu Search

Sensitive Cells Min Abs Dev Variance Correlation Regression All Cells Min Abs Dev Variance Correlation Regression

Correlation 0.98 0.94 0.98 0.94

Regress. Slope 0.82 0.92 1.13 0.92

New Var. / Original Var. 0.70 0.96 1.32 0.96

1.00 1.00 1.00 1.00

1.00 1.00 1.00 1.00

1.00 1.00 1.00 1.00

20

Table 7. Summary of results of numeric simulations on 13x13x13 table using Tabu Search

Sensitive Cells Min Abs Dev Variance Correlation Regression All 4 Functions

Correlation 0.995 0.995 0.995 0.995

Regress. Slope 0.96 1.00 1.00 1.00

New Var. / Original Var. 0.94 1.00 1.02 1.02

1.00

1.00

1.00

5 Concluding comments We examined the issue of preserving quality analytic utility of data subjected to controlled tabular adjustment for confidentiality purposes. We provided linear programming formulations to preserve mean values exactly and variances, correlation and regression slope approximately between original and adjusted cell values. We provided an alternative algorithm based on direct search that can provide optimal solutions to many problems and near-optimal solutions for large problems. Results based on limited computational experience are encouraging. Variance, correlation and slope are not harmonious in the sense that optimizing one typically degrades another. We provided a single linear programming formulation for a compromise solution that strikes an acceptable balance between among these when variation in adjustments is small relative to variation in data. This condition is typical and likely to be met, e.g., when sensitive cells are few relative to all cells. We provided a strategy based on Tabu Search that can be driven by any statistic of interest (linear or nonlinear) and which produces good to optimal results in a wide variety of settings. In addition, the compromise solution is likely to speed the search. Preserving data quality and analytical utility under CTA is two-faceted. First, adjustments to individual cell values need to be as small as possible. Sensitive cells typically require larger adjustments. Differential adjustments are required, expressible in CTA by linear capacity constraints: nonsensitive adjustments are capacitated within a small percentage or measurement error, and sensitive cell adjustments are derived from the protection limits. CTA opens the possibility of assigning protection below protection limits in most cases, thereby reducing bias to sensitive cell values. This is controversial but has a mathematical foundation and merits consideration by NSOs. Capacities ensure that original and adjusted data are close locally. The second facet is to ensure that they are close globally, viz., important statistical properties of the data set

21

or relevant subsets are approximately preserved. Cox and Dandekar (2003) focused on minimizing deterministic measures of global change such as total (percent) absolute change. This helps to preserve analytical utility, but does not address statistical properties directly. Here we provide direct methods for preserving important distributional parameters, mean, variance and correlation, and regression slope, closing the gap between confidentiality protection and data quality and utility. Preserving local and global properties of the data are competing objectives, as observed in Section 3. If sensitive values can be adjusted only to protection limits, then it is unlikely that the mean of sensitive values will be preserved. This necessitates broadening capacities on sensitive cells. If L(y) cannot be forced near to zero, then the corresponding approximations to preserve variance, correlation and regression slope will be poor, again necessitated relaxation of capacities. Conversely, many users are interested primarily in individual values or sets of values. The more capacities are relaxed for these values, the less useful and reliable they become for the user. Rules of thumb are that it is easier to accommodate these competing objectives either if the data set is large or if the number of sensitive cells and the protection required do not overwhelm the nonsensitive cells: sufficient mathematical “elbow room” enables acceptable solutions. How and to what extent should capacities be relaxed? A balance can be struck by basing capacities on a small number of percentages applied to cell values, treating these percentages as variables, and optimizing statistics of interest and percentages jointly. We are investigating this and are extending these formulations to the multivariate case, viz., to ensure that correlation and regression slope between variables exhibited in the original data are preserved in adjusted data. References OptTek Systems, Inc. (2003). Disclosure limitation methods. Report to the U.S. Bureau of Transportation Statistics. Cox, L.H. (1980). Suppression methodology and statistical disclosure control. Journal of the American Statistical Association 75, 377-385. _______ (2000). Discussion. ICES II: The Second International Conference on Establishment Surveys: Survey Methods for Businesses, Farms and Institutions. Alexandria, VA: American Statistical Association, 905-907. _______ & Dandekar, R.A. (2003). A new disclosure limitation method for tabular data that preserves data accuracy and ease of use. Proceedings of the 2002 FCSM Statistical Policy Seminar. Washington, DC: U.S. Office of Management and Budget (in press). Dandekar, R.A. & Cox, L.H. (2002). Synthetic tabular dataBan alternative to complementary cell suppression (manuscript). Fellegi, I.P. (1972). On the question of statistical confidentiality. Journal of the American Statistical Association 67, 7-18.

22

Fischetti, M. & Salazar-Gonzalez, J.J. (2000). Models and algorithms for optimizing cell suppression in tabular data with linear constraints. Journal of the American Statistical Association 95, 916-928. Glover, F. & Laguna, M. (1997). Tabu Search. Amsterdam: Kluwer Academic.

23

A Query-Overlap Restriction for Statistical Database Security Francesco M. Malvestuto Dipartimento di Informatica, Università «La Sapienza», Via Salaria 113, 00198 Roma, Italy

Abstract. In a statistical database, the query-answering system should prevent answers to sum-queries from leading to disclosure of confidential data. We give a general framework for controlling the amount of information released when sumqueries are answered, both from the viewpoint of the user and from the viewpoint of the query-answering system. Moreover, we show that, under a suitable query-overlap restriction, an auditing procedure can be efficiently worked out using flow-network algorithms. Keywords. Statistical database security, sum map, flow network

1 Introduction A statistical database (Adam & Wortmann, 1989) is an ordinary database which contains information on individuals (persons, companies, organisations etc.) but its users are allowed to only access summary statistics over «categories» of individuals. For example, consider a statistical database containing a file R with scheme {NAME, SSN, SEX, AGE, DEPARTMENT, SALARY}. The users can ask for totals of statistics on SALARY over groups of individuals but these cannot be selected using the attributes NAME and SSN which are private. In this paper, we focus on sum-queries such as q:

«What is the sum of salaries of employees with AGE ≥ 40 and DEPARTMENT ≠ Direction or GENDER = Male?».

Here, SALARY is the summary attribute of q and the three attributes AGE, DEPARTMENT and GENDER are its category attributes. By the value of q we mean the total of the statistic (i.e., the total sum) of the summary attribute over the category of employees selected by q. Answering such sum-queries (and, more in general, statistical queries) raises concerns on the compromise of individual privacy

24

and protection of confidential data should be afforded. We call intrusive sum-queries asking for totals of sensitive statistics (Cox, 1980; Willenborg & de Waal, 1996; Willenborg & de Waal, 2000). In our example, if SALARY is a confidential attribute and the statistic of SALARY over the category of employees selected by q is sensitive (e.g., according to the domination criterion), then the sum-query q is intrusive. When an intrusive sum-query q is asked, the query-answering system (QAS) should not issue the value of q but should give a noninformative answer, for example, the interval-estimate [l, u] of the value of q, where l and u are respectively the tightest lower bound and the tightest upper bound on the value of q consistent with the responses to previously answered sum-queries. The statistical security of a database can also be attacked by a nonintrusive sum-query q; this is the case if the total of the statistic requested by q, when combined with the responses to previously answered sum-queries, leads to the disclosure of the total of some sensitive statistic. We call tricky such sum-queries and they should be answered in the same manner as intrusive sum-queries, that is, by issuing interval-estimates. Finally, if a sum-query is neither intrusive nor tricky, the QAS can be safely answer it by releasing its value. The situation can be depicted as a competitive game played by the QAS, which has as its opponent a hypothetical user, henceforth referred to as the snooper, who at all times is well-informed of all answered sum-queries and attempts to pry sensitive statistics out of their responses. The key point of a winning strategy for both the QAS and the snooper is the ability of recognising intrusive and tricky sum-queries, which needs a model of the amount of information that is (implicitly and explicitly) released by the QAS when sum-queries are answered. Previous models (Adam & Wortmann, 1989; Malvestuto & Moscarini, 1999) are neither realistic nor efficient, since they are based on the snooper’s knowledge of the set of records selected by each answered sum-query so that their computational complexity increases with the size of the underlying database. In this paper, we present a model which works with categories and, therefore, its computational complexity is independent of the size of the underlying database. Using such a model, we address the following two problems for a summary attribute of nonnegative-real type: (The estimation problem) Given a set of answered sum-queries and a statistic of interest, find an optimal estimate of its total. (The safety problem) Given the set of answered sum-queries and the set of sensitive statistics, decide if a new sum-query can be safely answered, that is, if it can be answered without running the risk that some sensitive statistic can be disclosed. and we show that they can be efficiently solved when a suitable query restriction is introduced.

25

2 Basic Definitions Let R be the file of a statistical database. Let β be a summary attribute (of additive type) in the scheme of R and let ℵ = { α1, …, αk} be the set of category attributes in the scheme of R that are used to ask for summary statistics on β. The category attributes may be either independent or dependent; they are independent if every tuple a = (a1, …, ak) on ℵ is meaningful. An example of dependent category attributes is given by GENDER and DIVISION in a hospital database: since there cannot be any male patient in the gynaecological division, the couple (Male, Gynaecology) is meaningless. By A we denote the set of meaningful tuples on ℵ. Tuples in A and subsets of A will be referred to as cells and categories, respectively; moreover, cells are assumed to be mutually exclusive and globally exhaustive. If K is an arbitrary category, by the statistic β(K) we mean the collection of the value of β over the set of records in R that fall into the category K. Typically, a sum-query q on β asks for the total of some statistic β(K), K ⊆ A, written q = β(K). In order to speed up the processing of sum-queries on β, the QAS will make use of a table, referred to as the summary table on β, which is created by the QAS once and for all and reports, for each cell a, the total b(a) of the statistic β({a}). Thus, the value of the sum-query q = β(K) is computed as ∑a∈K b(a) without accessing the file R. When a sum-query q on β is answered by the QAS, we shall see that the snooper is always able to infer from the value of q and the values of previously answered sum-queries on β the tightest lower bound l and the tightest upper bound u on the total of every statistic β(K). The pair [l, u] will be referred to as the interval-estimate of the total of β(K). Since the summary attribute β is of nonnegative real type, one always has l≥0

u ≤ +∞.

and

Loosely speaking, if β(K) is a sensitive statistic and the interval [l, u] is narrow, then the total of β(K) is not protected. More precisely, we assume that the security policy adopted by the QAS to avoid the disclosure of sensitive statistics requires that sumquery can be safely answered if, for each sensitive statistic β(K), the width of the interval [l, u] is greater than a threshold value ∆, we call the protection level of β(K); that is, if u – l > ∆. It is understood that all sensitive statistics are initially identified and the protection level of each of them (if any) is fixed. Example 1. Consider a file with scheme {NAME, DEPARTMENT, SALARY}, where SALARY is the summary attribute and DEPARTMENT is the category attribute. Henceforth, we assume that: the attribute SALARY is of nonnegative-real type, the domain of DEPARTMENT is {a, b, c, d, e, f, g}, the summary table on SALARY contains the following data

26

DEPARTMENT a b c d e f g

SALARY 15.0 9.0 7.5 6.5 1.5 5.5 1.0

Table 1. A summary table

and the three statistics of SALARY over the categories {a}, {a, f} and {a, g} are the only sensitive ones and have protection levels 3.0, 3.3 and 3.2, respectively.

3 The snooper at work Assume that, at a certain time, n sum-queries q1, …, qn, where qi = β(Ki), 1 ≤ i ≤ n, have been answered by the QAS. Without loss of generality, we assume that each Ki is not empty; however, it may happen that Ki = Ki’ even if i ≠ i'. Let An = ∪i=1,…,n Ki and K = {K1, …, Kn}. Then, it is uniquely determined the coarsest of the partitions of An such that each Ki can be recovered by taking the union of one or more classes of the partition. This partition will be referred to as the categorisation scheme induced by K. Example 2. Let K1 = {a, b, f, g}, K2 = {b, c, d, g} and K3 = {d, e, f, g}. The categorisation scheme induced by K = {K1, K2, K3} is formed by seven categories, each of which is a singleton: {a}, {b}, {c}, {d}, {e}, {f} and {g}. Let C = {C1, …, Cm} be the categorisation scheme induced by K, and let M = {1, …, m} and N = {1, …, n}. For each i ∈ N, let bi be the value of qi and let M(i) = {j ∈ M: Cj ⊆ Ki}. The amount of information conveyed by the responses to q1, …, qn can be described by the following system of linear equations ∑j∈M(i) xj = bi

(i ∈ N)

(1)

where variable xj stands for the (unknown) total of the statistic β(Cj). Let X be the set of nonnegative (real-valued) solutions of equation system (1). Suppose now that the

27

snooper wants to get the interval-estimate [l, u] of the total of a sensitive statistic

β(S). The following three cases will be examined separately:

Case 1. S is covered by C, that is, either S = Ø or there is a nonempty subset J of M such that S = ∪j∈J Cj. Then, l and u are taken to be l = min {∑j∈J xj: x ∈ X}

u = max {∑j∈J xj: x ∈ X}.

Case 2. S is a subset of An but is not covered by C. Let J = {j ∈ M: Cj ⊆ S} and J' = {j ∈ M: Cj ∩ S ≠ Ø}. Then, l and u are taken to be l = min {∑j∈J xj: x ∈ X}

u = max {∑j∈J' xj: x ∈ X}.

Case 3. S is not a subset of An. Let J = {j ∈ M: Cj ⊆ S}. Then, l and u are taken to be u = +∞.

l = min {∑j∈J xj: x ∈ X}

Note that the total of the sensitive statistic β(S) runs the risk of being disclosed (that is, u – l ≤ ∆, where ∆ is the protection level of β(S)) only if S is a subset of An (see Cases 1 and 2 above) and is definitely disclosed (independently of ∆) if the total of β(S) can be exactly evaluated (that is, l = u). Example 1 (continued). Consider four sum-queries q1 = SALARY(K1), …, q4 = SALARY(K4), where K1 = {a, b}, K2 = {a, c, d}, K3 = {b, c, f} and K4 = {d, e}. The categorisation scheme C induced by K = {K1, K2, K3, K4} is formed by the following six categories: C1 = {a}, C2 = {b}, C3 = {c}, C4 = {d}, C5 = {e} and C6 = {f}. From the summary table (see Table 1), the QAS gets that the values of q1, …, q4 are b1 = 24, b2 = 29, b3 = 18 and b4 = 12, respectively. Equation system (1) reads  x + x = 24  1 2  x1 + x3 + x 4 = 29   x2 + x 3 + x6 = 18   x 4 + x5 = 12

28

Suppose that the snooper attempts to pry the total of the sensitive statistic SALARY(S) where S = {a, f}. Since S = C1 ∪ C6, S is covered by C and he can get the interval-estimate [11.5, 42.0] using standard linear-programming methods. Since u – l (= 30.5) > ∆ (= 3.3), the total of the sensitive statistic SALARY(S) is protected. From a computational point of view, it should be noted that the number (m) of variables in equation system (1) may be exponential in the number (n) of its equations, that is, in the number of answered sum-queries (see Example 2). Therefore, computing interval-estimates can be very expensive. However, sometimes it is possible to reduce the amount of computation as follows. Suppose that there is i ∈ N such that, for each i' ≠ i, either Ki ∩ Ki' = Ø or Ki ⊆ Ki' ; then, Ki itself belongs to the categorisation scheme C, say Ki = Cj, and the i.th equation in constraint system (1) is simply xj = bi. (Note that, if there is another i' such that Ki' = Ki, then the corresponding equation xj = bi' is redundant and can be deleted.) Moreover, each occurrence of xj in the remaining equations can be deleted since it can be assigned the value bi. If this procedure is repeated, ultimately one obtains an equivalent equation system we write as Gx=v

(2)

and call the canonical equation system associated with the set of sum-queries {q1, …, qn}. Equation system (2) will be formed by a set of equations of the simple form xj = const

(j ∈ M*)

where M* is a subset of M, and by an equation system of the form ∑j∈M(i)–M* xj = vi

(i ∈ N*)

where N* is a subset of N and vi is the ‘revised’ value of sum-query qi. Note that, if M* = M, then the total of every statistic β(K) can be exactly and directly evaluated whenever K is covered by C.

4 How to repel the attacks of the snooper Suppose that QAS has answered sum-queries q1, …, qn, where qi = β(Ki), 1 ≤ i ≤ n, when a new sum-query q = β(K) arrives. Then, the QAS must decide if q can be safely answered, that is, if each sensitive statistic is protected when the value of the sum-query will be released. As we said, if q is intrusive, that is, if β(K) is a sensitive

29

statistic, then the response to q will consists of an interval-estimate. We now consider the case that q is not intrusive. Let C = {C1, …, Cm} be the categorisation scheme induced by K = {K1, …, Kn} and let equation system (2) be the canonical equation system associated with {q1, …, qn}. Of course, if the value q can be exactly evaluated from equation system (2), then q can be safely answered and the QAS will issue the value of q. Otherwise, the QAS will compute the categorisation scheme C' = {C'1 ∪ … ∪ C'm'} induced by C ∪ {K} and then build up the canonical equation system associated with the set of sum-queries {q1, …, qn, q}. After doing that, the QAS will test each sensitive statistic to see if it is protected, and only if this is the case the value of q will be released. Indeed, it is sufficient to test sensitive categories that are subsets of A' = C'1 ∪ … ∪ C'm'. We call such categories the sensitive targets. So, if each sensitive target is protected, then q can be safely answered. Otherwise, q is tricky and the QAS will issue the interval-estimate of the total of the statistic β(K) using equation system (2). What remains to clarify is how to compute C'. It is easy to see that C' can be obtained from C by replacing each Cj having K∩Cj ≠ Ø by the two categories Cj–K and K∩Cj, and by adding the category K – (∪j=1,…,m Cj). More precisely, C' is obtained from the set family (∪j=1,…,m {Cj–K, K∩Cj}) ∪ {K – (∪j=1,…,m Cj)} As noticed above, the number of variables in equation system (2) may be exponential in the number of its equations so that the auditing procedure may be timeconsuming. To overcome this difficulty, we introduce a query-overlap restriction which requires that, if the current sum-query q cannot be exactly evalued, q will be answered in the same manner as an intrusive (or tricky) sum-query if q overlaps «too much» with previously answered sum-queries, more precisely, if the number of occurrences of some variable in the canonical equation system associated with {q1, …, qn, q} is greater than r, for a fixed positive integer r. Accordingly, q does not overlap too much with q1, …, qn if and only if either K ∈ C or, for each j such that K∩Cj ≠ Ø, the number of occurrences of xj in the canonical equation system associated with {q1, …, qn} is less than r. The simplest nontrivial case is r = 2, which despite its simplicity is powerful enough to deal with the security problem for two-dimensional tables as shown in (Gusfield, 1988; Malvestuto & Mezzini, 2002). In the next section, we address the problem of computing interval-estimates under the query-overlap restriction with r = 2. To this end, we now introduce a graphical representation of a canonical equation system such as equation system (2), based on the fact that G can be viewed as being the (vertex-edge) incidence matrix of a graph, say G = (N, E). We call the pair (G, v) the sum map associated with the set of sumqueries {q1, …, qn}; it is understood that each vertex i of G is weighted by vi and each edge ej of G is labelled by Cj.

30

Example 1 (continued). The sum maps G1, …, G5 associated with the five sets of sum-queries {q1, …, qi}, i = 1, …, 5, are shown in Fig. 1. {c,d}

{d}

G2

G1

29

G3 2

29

{a}

{a}

1

1

{a,b} {b}

1

24

{d} 2

{a}

{f}

{c}

1

{b} 24

24

12 4

3

{c}

{b}

24 29

G4

2

G5

{e}

{d} 2

{b} 24

12 4

{f}

{c}

1

18

18

29

{a}

{e}

3

3

18

{e}

5

7

Figure 1. Sum maps

5 Computing interval-estimates Let (G, v) be the sum map associated with a set of sum-queries {q1, …, qn}, where G = (N, E), N = {1, …, n} and E = {e1, …, em}. Without loss of generality, we assume that G is connected. The problem we deal with is how to find the minimum and the maximum of a linear function such as ∑j∈J x j where J is a (nonempty) subset of {1, …, m}. We shall show that the problem of minimising or maximising such a function can be converted into a bipartite transportation problem (Ahuja et alii, 1993), that is, in the form minimise

∑j∈J uj xj

subject to

Ax = d, x ∈ ℜm

31

where A is the incidence matrix of a bipartite digraph D. Recall that: if (U, V) is the bipartition of D, each vertex i ∈ U of D is a source with supply –di and each vertex i ∈ V of D is a sink with demand di; the vector d is the demand vector. Finally, it is well-known that every transportation problem can be efficiently solved with the network simplex method (Ahuja et alii, 1993). We now separately discuss two cases depending on whether G is or is not bipartite. Case 1. G is bipartite. Let (U, V) be the bipartition of G. Direct each edge of G from U to V; thus, if (h, k) is an edge of G, h ∈ U and k ∈ V, h is the tail and k is the head of the directed edge, we denote by h→k. Let D be the resulting bipartite digraph and let A be the incidence matrix of D; thus, each directed edge h→k of D corresponds to a column a of A with ah = –1, ak = +1, and ai = 0 for all i ∉ {h, k}. Let d be defined by −v i di =   vi

(i ∈U ) (i ∈V )

At this point, the canonical equation system G x = v is like the equation system A x = d of a bipartite transportation problem. By taking u as the incidence vector of the edge set {ej: j ∈ J} (as its opposite, respectively), we have converted the problem of minimising (maximising, respectively) into a bipartite transportation problem and we can solve it efficiently. Case 2. G is not bipartite. The edges of G that are not loops will be referred to as links. If G contains p links, it is convenient to order the edges of G as e1, …, ep, ep+1, …, em where e1, …, ep are all links. Moreover, we always write an edge of G as (i, j) where i ≤ j. We now transform G into a bipartite graph H = (P, F) with 2n vertices and m+p edges. The graph H is constructed as follows (Malvestuto & Mezzini, 2002). The vertex set of H is taken to be P = {1, …, 2n}. Vertex n + i is meant to be a ‘copy’ of i. The edge set F of H is defined as follows. Arbitrarily choose a spanning tree T of G, and let G' be the bipartite graph obtained from T by adding all non-tree edges of G that create even cycles. The edges f1, …, fm+p of H are defined as follows: —

if ej = (i, k) is an edge of G' then fj = ej and fm+j = (n+i, n+k), 1 ≤ j ≤ p;

32



if ej = (i, k) is a link of G but not an edge of G' then fj = (i, n+k) and fm+j = (k, n+i), 1 ≤ j ≤ p;



if ej = (i, i) is a loop of G then fj = (i, n+i), p < j ≤ m.

Let (U, V) be the bipartition of G' and let U' = {n+i: i ∈ U} and V' = {n+i: i ∈ V}. Note that, since G is a nonbipartite, connected graph, H is a bipartite, connected graph with bipartition (R, S) where R = U∪V' and S = U' ∪V. Finally, the weights wp of vertices of H are taken to be  v p wp =  v p−n

( p = 1,…,n) ( p = n + 1, …,2n)

Example 2. Consider the sum map G3 of Fig. 1. Fig. 2 shows one of the possible bipartite transforms of G3. 29 R

2 f2

S

1 24

18 f4

6 f5

f3 f1

24 f8

f6

4 f7

3

5

18

29

Figure 2. The bipartite transform of a nonbipartite sum map

Let H be the incidence matrix of H and let w be the vector of weights of vertices of H. Consider the constraint system Hy=w

(3)

The following obvious fact show that the set X of nonnegative solutions of equation system (2) and the set Y of nonnegative solutions of equation system (3) are closely related to each other.

33

Fact 1. For every nonnegative solution x of equation system (2), the vector y with  x j yj =  x j−m

if j ≤ m if m < j ≤ m + p

is a nonegative solution of equation system (3). Fact 2. For every nonnegative solution y of equation system (2), the vector x with   1 ( y j + ym+ j ) x j = 2  yj 

if j ≤ p if p < j ≤ m

is a nonnegative solution of equation system (2). Consider now an arbitrary subset J of {1, …, m}. Let J' = {j ∈ J: j ≤ p} and J" = J–J'. By Facts 1 and 2, one has that the function ∑j∈J x j over X has the same range as the function (1/2) [∑j∈J' (yj + ym+j)] + ∑j∈J" yj

(4)

over Y. So, the minimum (maximum, respectively) of the function ∑j∈J x j over X is equal to the minimum (maximum, respectively) of the function (4) over Y. Finally, as we saw above, each of these two problems can be converted into a bipartite transportation problem and, hence, can be solved in an efficient way.

6 Future research We discussed the security issues connected with answering sum-queries when summary attribute is of nonnegative-real type, and gave efficient solutions to the estimation problems and the safety problem under the query-overlap restriction of order two. To achieve them, we exploited a graphical representation (a sum map) of the amount of information released by the QAS. An open problem with such a query-overlap restriction is given by sum-queries where the summary attribute is of nonnegative-integer type. However, by virtue of the total unimodularity of the incidence matrix of a bipartite graph, the results proved in Section 5 for a bipartite sum map also apply to the case that the summary attribute is of nonnegative-integer type.

34

References Adam, N.R. & Wortmann, J.C. (1989). Security control methods for statistical databases: a comparative study. ACM Computing Surveys 21, 515-556. Ahuja, R.K., Magnanti & T.L., Orlin, J.B. (1993). Network flows. Prentice Hall, Englewood Cliffs. Cox, L.H. (1980). Suppression methodology and statistical disclosure control. J. American Statistical Association 75, 377-385. Gusfield, D. (1988). A graph-theoretic approach to statistical data security. SIAM J. Computing 17, 552-571. Malvestuto, F.M. & Mezzini, M. (2002). A linear algorithm for finding the invariant edges of an edge-weighted graph. SIAM J. on Computing 31, 1438-1455. Malvestuto, F.M. & Moscarini, M. (1999). An audit expert for large statistical databases. In Statistical Data Protection, EUROSTAT, 29-43. Willenborg, L. & de Waal, T. (1996). Statistical Disclosure Control in Practice. Lecture Notes in Statistics, Vol. 111, Springer-Verlag, New York. Willenborg, L. & de Waal, T. (2000). Elements of Statistical Disclosure. Lecture Notes in Statistics, Vol. 155, Springer-Verlag, New York.

35

Different methods in a common framework to protect tables Juan-Jos´e Salazar-Gonz´alez Department of Statistics, Operations Research and Computer Science, University of La Laguna, Tenerife, Spain ([email protected])

Abstract. This paper concerns Statistical Disclosure Control methods to minimize the information loss while keeping the disclosure risk from different data snoopers small. This issue is of primary importance in practice for statistical agencies when publishing data. It is assumed that the sensitive data have been identified by practitioners in the Statistical Offices, and the paper addresses the complementary problem of protecting these data with different methods, all defined in a unified mathematical framework. A common definition of protection is introduced and used in four different methodologies. In particular, two Integer Linear Programming models are described for the well-known Cell Suppression and Controlled Rounding techniques. Also two relaxed techniques are presented by means of two associated Linear Programming models, called Interval Publication and Cell Perturbation, respectively. A final discussion shows how to combine the four methods and how to implement a cutting-plane approach for the exact and heuristic resolution of the combinatorial problems in practice. All the presented methodologies inherently guarantee protection levels on all cells and against a set of different intruders (possibly respondents and/or coalitions of respondents), thus the standard post-phase to test the protection requirements (and typically called Disclosure Auditing) is unnecessary. These mathematical models protecting a table against several intruders also provide the novelty of controlling the risk of disclosure from potential coalitions of respondents. Keywords: Statistical Disclosure Control, Mathematical Programming.

1

Introduction

Let us consider the following table to be published: Region A Region B Region C Total Activity I 20 50 10 80 Activity II 8 19 22 49 Activity III 17 32 12 61 Total 45 101 44 190 Investment of enterprizes by activity and region.

36

If cell in Activity II and Region C is classified as sensitive information (perhaps because there is only one contributor to this cell) then a method to protect this private information must be applied to this table before publication. In the area of the Statistical Disclosure Limitation there are several approaches to address this problem, and this paper introduces general concepts to merge four different methodologies in a common mathematical framework. The advantage of this common framework is that it allows to use the same measure of protection (based on protection levels) for the different methodologies. The models introduced in this paper can be applied to all kind of tabular data, no matter if they are hierarchical, linked or k-dimensional tables and/or contain magnitude or frequency data in positive, negative or unsigned values. The main aim in all cases will be to keep the risk of disclosure controlled while the loss of information is minimized. The base concepts are the following: 1. What should be an output pattern? It should be a new table with some uncertainty such that it implies that a set of different tables possible to be the original table from an attacker’s point of view. This set must be “large enough” and always contains the original table. 2. What is the “loss of information” of a pattern? It should be a measure of the size of the set determined by the pattern. For example, if the set is only the original table, then the loss of information is zero. The precise definition of “loss of information” depends on the selected methodology. 3. When is the “risk of disclosure” controlled? We assume to have a subset of sensitive cells. Each sensitive cell k must admit a “wide” interval of possible values in accordance with the set of potential tables for each attacker p. Attacker p typically has a knowledge on the unknown value in cell k represented by the “external bounds”: [lbpk , ubpk ]. To precisely define how “wide” an interval must be, three different protection levels are given uplkp (upper), lplkp (lower), splkp (sliding), for each cell k and each attacker p. A cell with original value ak is considered protected against an attacker when he/she can compute a minimum value y pk and a maximum value y pk such that: y pk ≤ ak − lplkp

y pk ≥ ak + uplkp

y pk − y pk ≥ splkp

If this is ensured for each cell k and each attacker, then the protection is guarantee and no auditing phase is later needed. For example, an statistical agency can be interested in protecting the sensitive cell in the above table against one attacker (p = 1) with external bounds defining [0,1000] for all unknown cell values, and the required protections levels can be

37

defined by uplkp = 3, lplkp = 2, splkp = 6. We next illustrate four different output patterns, each one corresponding to a different methodology. We will describe the four methodologies in the forthcoming section through precise mathematical models, but let us give here the basic idea under each one: Cell Suppression Methodology: The sensitive information is simply unpublished (called primary suppressions) and, due to the marginal cells in the tables, probably also additional cells (called secondary suppressions) should be also unpublished. Even if each cell a user will observe either a number or a missing value, due to the external bounds he/she knows then the missing values can be simply replace by intervals. Each value in this interval admits values on the other missing values such that there exists a congruent table that could be the original one from the intruder point of view. The loss of information is defined by considering fix costs associated to all the unpublished cells. Interval Publication Methodology: This is a relaxed version where instead of first locating where the missing values must be allocated and second auditing the intervals, only one step compute the intervals. This new methodology saves computational effort and reduces the loos of information because typically it will publish narrow intervals when compared with the cell suppression method. Controlled Rounding Methodology: It is a perturbation method where each value is rounded up or down to a multiple of a given base number. The rounded table should be congruent, i.e., the rounded marginal cell values should be the sum of the internal rounded cell values. Cell Perturbation Methodology: Since the classical Controlled Rounding Methodology can be infeasible, then Cell Perturbation is a new methodology which more likely admits solutions. The idea is to allow perturbation of the original values inside the interval between the extremes defined by rounding up and down the original value. Additional advantages are the saving of computational effort when finding optimal solutions which, in turn, to have smaller loss of information when compared with the optimal solution from the controlled rounding method. Examples of protected patterns for the above table according with these methodologies are the following: A B C Total A. I 20 50 10 80 A. II * 19 * 49 A. III * 32 * 61 Total 45 101 44 190 Cell Suppression Methodology

38

A B C Total A. I [18 . . . 24] 50 [6 . . . 12] 80 A. II [4 . . . 10] 19 [20 . . . 26] 49 A. III 17 32 12 61 Total 45 101 44 190 Interval Publication Methodology A B C Total A. I 20 50 10 80 A. II 10 20 20 50 A. III 15 30 15 60 Total 45 100 45 190 Controlled Rounding Methodology (Base number = 5) A B C Total A. I 20 50 10 80 A. II 7 16 26 49 A. III 18 35 8 61 Total 45 101 44 190 Cell Perturbation Methodology (Perturbed interval [-5,+5]) We next give a mathematical description of each methodology. These models can be used to derive efficient algorithms to find optimal patterns for each methodology. We will not go here into the technical details of the implementations, but only some general discussions will be pointed out at the end of this paper. We refer the interested reader to the references.

2

Cell Suppression Methodology 

1 if cell i must be suppressed 0 otherwise Then the optimization problem can be formulated as

For each cell i ∈ 1, . . . , n, xi =

min

n  i=1

wi xi

subject to: For each sensitive cell k ∈ SS and each attacker l ∈ L : xi ∈ {0, 1}

for all i = 1, . . . , n 

y pk

≤ ak −

lplkp

y pk

≥ ak +

uplkp

y pk



39

y pk



splkp

for all k ∈ S for all p ∈ P,

where 

3

ykp := min yk

ykp := max yk

and

n

i=1 mij yi = bj ai − (ai − lb pi )xi ≤ yi ≤ ai + (ub pi − ai )xi

for all j = 1, . . . , m for all i = 1, . . . , n



Interval Publication Methodology

For each cell i ∈ 1, . . . , n, zi+ and zi− such that the output is [ai − zi− . . . ai + zi+ ]. Then the optimization problem can be formulated as min

n 

wi− zi− + wi+ zi+

i=1

subject to: For each sensitive cell k ∈ SS and each attacker l ∈ L : zi− , zi+ ≥ 0

for all i = 1, . . . , n 

y pk ≤ ak −

lplkp

≥ ak +

uplkp

ykp := min yk

where     

4

y pk

y pk

− y pk ≥

for all k ∈ S for all p ∈ P,

ykp := max yk

and

n

i=1 mij yi = bj ai − zi− ≤ yi ≤ ai + lb pi ≤ yi ≤ ub pi

splkp

for all j = 1, . . . , m for all i = 1, . . . , n for all i = 1, . . . , n

zi+

    

Controlled Rounding Methodology 

0 if output is ai , (i.e., vi := ai  + ri xi ) 1 if output is ai . Then the optimization problem can be formulated as For each cell i ∈ 1, . . . , n, xi =

min

n  i=1

wi xi

subject to: For each sensitive cell k ∈ SS and each attacker l ∈ L : n

i=1

mij (ai  + ri xi ) = bj xi ∈ {0, 1}

for all j ∈ 1, . . . , m for all i = 1, . . . , n 

y pk

≤ ak −

lplkp

y pk

≥ ak +

uplkp

y pk



40

y pk



splkp

for all k ∈ S for all p ∈ P,

ykp := min yk

where   

ykp := max yk

and



n

mij yi = bj ai  + ri xi − ri ≤ yi ≤ ai  + ri xi + ri   lbpi ≤ yi ≤ ubpi

5

for all j = 1, . . . , m   for all i = 1, . . . , n  for all i = 1, . . . , n. 

i=1

Cell Perturbation Methodology

For each cell i ∈ 1, . . . , n, zi− and zi+ are nonnegative numbers such that the output is vi := ai + zi+ − zi− . Then the optimization problem can be formulated as min

n  i=1

wi− zi− + wi+ zi+

subject to: For each sensitive cell k ∈ SS and each attacker l ∈ L : n

i=1

mij (ai + zi+ − zi− ) = bj zi+ , zi− ≥ 0

for all j ∈ 1, . . . , m for all i = 1, . . . , n 

y pk where   

≤ ak −

lplkp

y pk

≥ ak +

ykp := min yk

uplkp

y pk



y pk

for all k ∈ S for all p ∈ P,

ykp := max yk

and

n

i=1 mij yi = bj ai + zi+ − zi− − ri ≤ yi ≤ ai + zi+ − zi− + ri   lbki ≤ yi ≤ ubki

6



splkp



for all j = 1, . . . , m   for all i = 1, . . . , n  for all i = 1, . . . , n. 

General discussion on the algorithms

The above mathematical modes are the basic tools for creating a computer code that find (optimal or near-optimal) solutions for each methodology. We will not give here the technical details, but only mention that in practice it is important to eliminate the continuous variables yi by applying the Linear Programming Duality Theory. Then this continuous variables will be replaced by cutting-plane inequalities for the mathematical models (called master problems). These inequalities are in a large amount, but the good news is that not all of them are necessary and the relevant ones can be found by solving a linear program (called subproblem) inside a iterative procedure named branch-and-cut algorithm in the modern Mathematical Programming terminology. Then the approach follows a set of iterations and at each one a lower bound and an upper bound for the (initially unknown) optimal

41

loss of information are computed. When both bounds coincide then the optimal loss information is known and an optimal pattern is given. Still, in practice the statistical office cannot be interested in the final optimal solution and can interrupt the iterative procedure before completion. In this case, a feasible (and therefore protected) solution is obtained with a worse-case analysis given by the gap between the last upper and lower bounds. This idea is illustrated by the following figure. loss of information 6

6 proof ............................................................................................................................. gap ?

interrupted

completed

-

computational time We have conducted experiments protecting different benchmark tables with the above methodologies. We next give the features of some of them, which can be downloaded from the webpage: http://webpages.ull.es/users/casc/ By Ramesh Dandekar: • 13x7x7, 637 cells, 525 equations (519 cells, 497 equations, 50 sensitive cells) • 13x13x7, 1183 cells, 1443 equations (1040 cells, 1388 equations, 75 sensitive cells) • 13x13x13, 2197 cells, 3549 equations (2020 cells, 3458 equations, 112 sensitive cells) • 16x16x16, 4096 cells, 5376 equations (3564 cells, 5201 equations, 224 sensitive cells) By Sarah Guessing: • linked table, 2890 cells, 1649 equations (727 cells, 717 equations, 376 sen.) • linked table, 8280 cells, 4168 equations (5540 cells, 3245 equations , 829 sen.) • linked table, 18900 cells, 11040 equations (11963 cells, 8159 equations, 2114 sen.) • linked table, 58320 cells, 27792 equations (31470 cells, 19748 equations, 7003 sen.)

42

• linked table, 148140 cells, 81184 equations (82324 cells, 75432 equations, 17321 sen.) By Anco Hundepool: • • • •

6x6x6x6 (1296 cells) 6x8x8x13 (4992 cells) linked table, 5586 cells, 4972 equations linked table, 15120 cells, 2892 equations

By David Brown: • linked table, 572 cells, 274 equations • linked table, 10963 cells, about 5000 equations For example, applying Cell Suppression Methodology on the instance hier13x7x7, where the “loss of information” is defined as the sum of penalties for all secondary suppressions with wi :=



100 + log(original value of the cell i) if NOT sensitive 0 otherwise

then the performance of our branch-and-cut procedure is summarized in the following table: time lower bound upper bound primary + secondary 12 3650 14145 181 186 9121 13608 176 315 9192 13510 175 407 9233 13070 171 830 9318 12960 170 2183 9489 12655 167 2438 9499 12427 165 3900 9572 12340 164 4993 9664 11919 160 10832 9707 11681 158 11990 9720 10589 148 28020 9822 10381 146 39929 9867 10163 144 43044 9876 10161 144 46836 9913 9940 142 50461 9940 9940 142 This means that the personal computer (Pentium at 833 Mhz) running our code needs 14 hours to prove optimality of a generated solution. Nevertheless, if the procedure is aborted in 1 hour then a protected output pattern with 165 suppressions

43

is provided together with the measure that it is not far from the minimum loss of information within gap of 23%. Therefore, the procedure can be also used as a heuristic approach to find near-optimal solutions. In all cases, the generated solution (optimal or near-optimal) is always a feasible pattern, which means that all the protection levels are satisfied. The mathematical theory under the models guarantee that no auditing phase is required to confirm this feasibility. The feature of the final optimal pattern for this benchmark table is the following: • 50 primary suppressions, with



• 92 secondary suppressions, with

7

i

ai = 46374 and



i



i

ai = 664495 and

log ai = 302



i

log ai = 740

Acknowledgement

This work was partially supported by the “Ministerio de Ciencia y Tecnolog´ıa” (TIC2002-00895), and by the European Research project IST2000-25063 entitled “Computational Aspects of Statistical Confidentiality” (CASC).

References Fischetti, M. and Salazar, J. J. (1998) “Computational Experience with the Controlled Rounding Problem in Statistical Disclosure Control”, Journal of Official Statistics, 14/4, 553–565. Fischetti, M. and Salazar, J. J. (1999) Models and Algorithms for the 2-Dimensional Cell Suppression Problem in Statistical Disclosure Control. Mathematical Programming, 84, 283–312. Fischetti, M. and Salazar, J. J. (2000) Solving the Cell Suppression Problem on Tabular Data with Linear Constraints. Management Science, 47, 1008–1026. Fischetti, M. and Salazar, J. J. (2000) “Models and Algorithms for Optimizing Cell Suppression Problem in Tabular Data with Linear Contraints”, Journal of the American Statistical Association, 95, 916–928. Fischetti, M. and Salazar, J. J. (2002) “Partial Cell Suppression: a New Methodology for Statistical Disclosure Control”, Statistics and Computing, 13/1, 13–21.

44

An Algorithm for Computing Full Rank Minimal Sufficient Statistics with Applications to Confidentiality Protection Yves Thibaudeau U.S. Census Bureau

1. Introduction A popular minimal sufficient statistic for the parameters of a hierarchical log-linear model (Bishop, Fienberg and Holland 1975, page 67) is the collection of margins of the orders of the model. This statistic is useful, as it can serve as input for iterative proportional fitting for example, but it is typically dimensionally redundant and does not decompose the spectrum of information on the parameters of the model available from the data proportionally to its dimensionality. For example, the set of two-way margins is minimal sufficient for a four-dimensional hypercube, given a secondorder hierarchical log-linear model (HLLM), but there are twenty-four individual marginal counts, while the model has only eleven degrees of freedom (d.f.). We define a full-rank minimal sufficient statistic (FRMSS) to be a minimal sufficient statistic whose dimensionality is equal to the number of d.f.’s of the HLLM. The paper presents an algorithm for deriving a FRMSS for any HLLM. A FRMSS gives the most compact representation of the sufficient information on the parameters of a HLLM. Moreover, a FRMSS can be augmented so that the augmented information has rank equal to the dimensionality of the multinomial counts, and consequently all the information ancillary with respect to the HLLM is also represented in the augmented FRMSS. The decomposition of the counts into a FRMMSS and its complement is useful as it naturally lends itself to the simulation (or partial simulation) of either the minimal sufficient and/or of the complementary information for purpose of disclosure limitation. Dobra, Tebaldi and West (2003) propose a technique for modifying the information borne by multinomial counts in order to limit disclosure, without altering the minimal sufficient information. They propose a probabilistic algorithm that prescribes “moves” that change the entries of a contingency table scrambling only ancillary atoms of information. Their algorithm extends the work of Diaconis and Sturmsfeld (1998) who suggested an algebraic algorithm to retrieve all possible “moves”. The general approach of these authors implicitly assumes there is no need to identify and modify a FRMSS to ensure a satisfactory disclosure protection. This

45

paper starts from the premise that perturbing the minimal sufficient information may be desirable in order to provide adequate disclosure limitation for some contingency tables. In this context we show that the identification and computation of an explicit FRMSS puts the data releaser in an advantageous position. We submit that, because the derivation of a FRMSS effectively decouples the minimal sufficient information from any residual information, it provides a desirable level of control for the manipulation of both types of information when the goal is to simultaneously limit disclosure and ensure data quality. The next three sections present the principles of our algorithm, which decompose a contingency table into a FRMSS and its complement. Then, section 5 gives an example, and section 6 shows how our algorithm can be used in the context of a simulation of the minimal sufficient information, along with the simulation of residual information, to construct simulated contingency tables that embed some confidentiality protection.

2. Factorization of the Likelihood of a HLLM We recall an important fact regarding HLLM’s, which is at the basis of our algorithm. This fact is best stated in the Bayesian context: If the multinomial probabilities corresponding to a contingency table are constrained by a HLLM, then the joint marginal probabilities corresponding to a full hierarchy of categorical varia bles are statistically independent of the complementary joint conditional probabilities corresponding to the remaining categorical variables. The important consequence of this fact for us is that, to identify a FRMSS for the joint cell probabilities, it is sufficient to only identify one FRMSS for the joint marginal cell probabilities of the variables in a hierarchy, and another FRMSS for the joint conditional probabilities of the corresponding conditional cells. Recall that a FRMSS for the marginal probabilities of a hierarchy is just the set of joint marginal counts for the variables involved. The next few sections focus on the task of identifying a FRMSS for the conditional probabilities. We proceed by constructing a non-singular parameterization for the conditional probabilities. For simplicity, we assume the HLLM is of second order. The generalization to a HLLM of higher order is easily done.

3. The Layering Procedure 3.1 Layering the Top Layer

46

We show that by proceeding recursively we can get a complete set of conditional probabilities based only on a subset of “free” parameters. The number of free parameters shall match exactly the number of degrees of freedom allocated by the model to characterize the interactions and main effects corresponding to the categorical variable that is the most rapidly varying, that is the variable corresponding to the top layer of conditional probabilities. Consider first the simplest non-trivial situation, a I ×J × K table. Let sgn (i , j , k ) = 1 if the triple ( i, j , k ) requires an even number of changes of the indices i, j , k relative

to (1,1,1) and sgn ( i, j ,k ) = − 1 if it requires an odd number of changes. For example

if

we

let

α, β, χ > 1,

then

we

get

sgn (1, β ,1) = sgn ( α , β , χ ) = − 1 .

sgn (1,1,1) = sgn (1, β , χ ) = 1 ,

Given a second order hierarchical constraining the probability structure of the contingency table, the following holds:



i∈{1, α } j ∈{1, β } k ∈{1, χ }

(π i , j , k )

sgn( i, j , k )

=1

and model

(1)

We use “π ” to mean “probability of”. In terms of conditional probabilities we can write



(π ) }

i ∈{1, α j ∈{1, β } k ∈{1, χ }

i j, k

sgn( i, j, k )

= 1

(2)

In general, for given values of β > 1 and χ > 1 , (2) leads to a system of I equations that are linear in the I unknowns π ( i = α j = β,k = χ ) , where α =1, …, I . equations are I

∑ π (i = α

α =1

j = β, k = χ ) = 1

and

47

These

(3)

π ( i = α j = β, k = χ ) = π ( i =1 j= β , k = χ ) ×

π ( i =1 j =1, k =1) π ( i = α j = β , k =1) π ( i = α j= 1, k = χ ) π ( i = α j = 1, k = 1) π ( i =1 j = β , k = 1)π (i = 1 j =1, k = χ )

(4)

α = 2, …, I Solving (3) and (4) simultaneously yields

π (i = α j = β , k = χ ) (5) I

=

π (i = α j = β , k = 1) π (i = α j = 1, k = χ ) ∏π ( i =δ j = 1, k = 1) δ =1

π ( i = α j= 1, k= 1) D( β , χ )

where I

D( β , χ ) =

I

π ( i = α j = β , k = 1) π ( i = α j = 1, k = χ ) ∏π ( i =δ j= 1, k = 1)

α =1

π ( i =α j= 1, k = 1)



δ =1

(6)

We choose the free parameters to be all the conditional probabilities involved on the

RHS of (5). Thus there are J ( I −1) + ( I −1)( K −1) free parameters. These correspond to the d.f’s of the marginal levels of variable i , and the d.f.’s generated by the interactions terms in the log-linear subspace between the levels of variables i and j , as well the interactions between the levels of variables i and k . We can extend the top layer to account for additional conditioning categorical variables. Consider a I ×J ×K ×L contingency table, and let the top-layer conditional probabilities are represented by π ( i = α j = β, k = χ , l = λ ) . So far we have shown that the layering procedure allows us to represent the conditional probabilities of the form π ( i = α j = β, k = χ, l = 1) in terms only of the free parameters on the RHS of (5). We must extend this set of free parameters in order to represent the additional conditional probabilities generated by a third conditioning categorical variables. Assuming a second-order model we write

48



(π }



(π }



(π }

i ∈{1, α j ∈{1, β } k ∈{1, χ } l =1

i j , k ,l

)

sgn( i, j, k )

)

sgn( i, j, k )

)

sgn ( i, j, k )

=1

(7)

=1

(8)

= 1

(9)

and by symmetry

i ∈{1,α j ∈{1, β } k =1 l = {1, λ }

and

i ∈{1, α j =1 k = {1, χ } l ∈{1, λ }

i j , k ,l

i j, k, l

where λ = 2, … , L . In defining additional free parameters we consider the case where k = 1 , l = λ > 1 , and the case where k = χ > 1 , l = λ > 1 separately. To resolve the first case we treat the variables i, j , l in (8) in the same manner that we treated variables i, j , k in (2) in order to obtain (5), while keeping k = 1 . We obtain the following system of equations: I

∑ π (i =α

α =1

j = β , k = 1, l = λ ) = 1

( 10 )

and

π (i = α j = β , k = 1, l = λ ) = π ( i = 1 j = β , k =1, l = λ )

×

π (i = 1 j = 1, k = 1, l = 1)π ( i = α j = β , k =1, l = 1) π ( i = α j= 1, k= 1, l = λ ) π ( i = α j= 1, k= 1, l = 1) π ( i =1 j = β , k = 1, l = 1)π (i =1 j =1, k = 1, l = λ )

α = 2, …, I

49

( 11 )

Note that (10) and (11) can be solved entirely in terms of the free parameters defined previously, and in terms of π ( i =α j= 1, k= 1, l = λ ) , for α =1,…, I , which define

( I −1)( L −1) additional free parameters.

In addition, we claim that the free parameters defined so far are sufficient to represent any conditional probability of the form π ( i = α j = β, k = χ , l = λ ) . In other words, the updated set of free parameters also takes care of the second case, k = χ > 1 , l = λ > 1. To show this we continue our layering procedure based on (9). We have

π (i = α j = 1, k = χ , l = λ ) = π ( i =1 j =1, k = χ , l = λ )

×

π (i = 1 j = 1, k = 1, l = 1)π ( i = α j = 1, k = χ , l = 1)π ( i = α j= 1, k= 1, l = λ ) π ( i = α j= 1, k= 1, l = 1) π ( i = 1 j =1, k = χ , l = 1) π ( i = 1 j = 1, k = 1, l = λ )

( 12 )

α = 2, …, I (12) is solved for the π ( i = α j = 1, k =χ , l = λ ) ’s and the solution is expressed entirely in terms of the current free parameters. Next, the π ( i = α j = β , k = 1, l = λ ) ’s and the π ( i = α j = 1, k =χ , l = λ ) ’s can be substituted back in (5), for each l = λ , to resolve π ( i = α j = β , k = χ , l = λ ) in terms of the current free parameters, which proves our claim. We now give a broad picture of our layering procedure for the top layer: assume that the last two conditioning categorical variables to be involved in the layering are m and n , with M and N categories respectively. Then extending the layering to account for n entails defining ( I − 1)( N − 1) additional free parameters, namely the conditional probabilities π ( i = α j= 1, k = 1, l = 1,… , m = 1, n = ν ) , where α =1, …, I −1

and ν = 2, …, N .

3.2 Layering the Lower Layers With a second order model, lower layers of conditional probabilities and free parameters must be defined to produce a FRMSS with our algorithm, if there are four or more variables. We first consider the case where there are exactly four variables and then show how it generalizes. We have

50



j =1,2 k =1 , 2 l =1,2



)

sgn ( j, k, l )

j i = 1, k , l

= 1.

( 13 )

Based on (13), we define the free parameters on the second layer just as before, but setting i = 1 , and putting i in the background. So we are back to the three variables set-up in (5). Proceeding this way, the layering is easily extended to build any number of lower layers of conditional probabilities in terms of a set of free parameters. For example, if there are five variables, we define the free parameters of the third layer by working from



k =1,2 l = 1,2 m =1,2



sgn ( k , l, m )

k i = 1, j =1, l , m

)

= 1.

( 14 )

The reader can verify that the correct number of d.f.’s is being generated.

4. The Assembling Procedure With a second order model if there are four or more variables there will be more than one layer of conditional probabilities and these layers must then be assembled together to obtain comprehensive expressions for the joint conditional probabilities in terms of the free parameters. We assemble the layers recursively, from the top down with an “assembling” procedure that can handle any number of layers. To explain our procedure, it is convenient to assume first that there are only two layers of conditional probabilities, and so there are only four categorical variables. The generalization to the case where more than two layers are defined will be by recursion. Given a I ×J ×K ×L contingency table, let k and l represent the conditioning categorical variables, and thus their marginal probabilities serve as free parameters, and let pi, j be the joint conditional probability of the variables i and j , conditional on k = 1 and l = 1 . Let q i, j be the conditional probability for a given value of variable i conditional on the value of j (also conditional on k = 1 and l = 1 ), and let r1, r2 , … , rJ be the conditional probabilities for the corresponding value of variable j given i = 1 (also conditional on k = 1 and l = 1 ). With no loss of generality we write:

51

p •,1 × q1,1

=

p1, • × r1

p•, 2 × q1,2

=

p1, • × r2

#

( 15 )

p•, J −1 × q1, J −1

=

p1, • × rJ −1

 J −1  1 − ∑ p•, j  × q1, J j=1  

=

 J −1  p1, • × 1 − ∑ r j   i=1 

(15) is equivalent to 0 q1,1 0 " − r1  0 q   0 − r 1,2 2    #  %    − 0 0 q r 1, J −1 J −1     J −1   q1, J q1, J " q1, J  1 − ∑ rj    j=1  

p•,1   0    0  p•, 2    #  =  #     p•, J −1   0    q1,J  p1,•   

( 16 )

Solving for p•, j in the first J − 1 rows of and substituting in the last row we get  p1, • ×  q1, J  

 J −1 rj ∑  j=1 q1, j 

J −1    + 1 − ∑ rj  = q1,J  j=1   

( 17 )

Thus J

p1,• =

∏ q1, j j =1

C (1 ) 

( 18 )

and  J   ∏ q1, j  r1 j =1  p1,1 = p1, • r1 =  C (1 ) 

where

52

( 19 )

 J   J −1  J − 1    C (1 ) = ∑  ∏ q1, k  r j +  ∏ q1, j  rj   1 − ∑  j =1  k = 1 j =1 j =1       j≠ k  J −1

( 20 )

To simplify the algebra we let J −1

rJ = 1 − ∑ rj

( 21 )

j =1

But we keep in mind that only the parameters on the RHS of (21) are free. The expression on the LHS serves only as a short hand for algebraic reduction. We obtain

p1, j

 J   ∏ q1, k  rj k =1  = C (1 ) 

Multiplying the LHS and RHS of (22) by

qi, j q1, j

( 22 )

we obtain the following general

formula for the joint conditional probability for the values of i and j , conditional on k = 1 , and l = 1 :

pi , j

 J    qi, j  ∏ q1, z  r j  zz =≠ 1j    = C (1 ) 

( 23 )

Note that the formula in (23) can be used to solve for the joint conditional probabilities of i and j , conditional on any values of k and l by substituting for qi ,1 , qi,2 , …, qi, J and r1, r2 , … , rJ , in terms of the free parameters of the model. Moreover, (23) can be applied recursively from the top down, treating previous layers of joint conditional probabilities that were assembled together as a new toplayer, now to be assembled with the layer directly under. In the next section we illustrate an application of (23), which produces expressions for the joint conditional probabilities of i and j , conditional on any values of k and l .

53

5. Example – The Second-Order Model on a 2 x 2 x 2 x 2 Table The title of this section tells all about the structure of the contingency table and the model of this example. According to our algorithm we first define the layers of conditional probabilities, along with the base marginal probabilities. Let q a,b,c = π (i = 1 j =a, k = b, l = c ) ra, b = π ( j =1 i = 1, k = a, l = b )

( 24 )

sa ,b = π ( k = a, l = b )

Our layering procedure delineates the following free parameters. q1,1,1, q2,1,1, q1,2,1 , q1,1,2

( 25 )

r1,1, r2,1, r1,2 s1,1 , s2,1 , s2,1 , s2,2

We compute the probabilities of i and j , conditional on any values of k and l . Define

ua ,b,c, d = π (i = a, j = b k = c, l = d )

( 26 )

We now exhibit how our algorithm generates an explicit non-degenerate parameterization that will serve to define a FRMSS. After the assembling procedure we get:

54

u1,1,1,1 = q1,1,1 q2,1,11,1 r D1,1 r D1,1 (1 − q1,1,1 ) q2,1,11,1 u1,2,1,1 = q1,1,1q2,1,1 (1 − r1,1 ) D1,1 u2,2,1,1 = q1,1,1 (1 − q2,1,1 )(1 − r1,1 ) D1,1 u1,1,2,1 = (1 − q1,1,1 ) q2,1,1 q1,2,1 r2,1D2,1 u2,1,2,1 = ( 1 − q1,1,1 ) q2,1,1 (1 − q1,2,1 ) r2,1 D2,1 u1,2,2,1 = ( 1 − q1,1,1 ) q2,1,1q1,2,1 (1 − r2,1 ) D2,1 u2,2,2,1 = q1,1,1 ( 1 − q2,1,1 )(1 − q1,2,1 )(1 − r2,1 ) D2,1 u1,1,1,2 = (1 − q1,1,1 ) q2,1,1 q1,1,2 r1,2 D1,2 u2,1,1,2 = ( 1 − q1,1,1 ) q2,1,1 (1 − q1,1,2 ) r1,2 D1,2 u1,2,1,2 = ( 1 − q1,1,1 ) q2,1,1q1,1,2 (1 − r1,2 ) D1,2 u2,2,1,2 = q1,1,1 ( 1 − q2,1,1 )(1 − q1,1,2 )(1 − r1,2 ) D1,2 2 u1,1,2,2 = ( 1 − q1,1,1 ) q2,1,1 q1,2,1 q1,1,2 ( 1 − r1,1 ) r2,1 r1,2 D2,2 u2,1,2,2 = q1,1,1 ( 1 − q1,1,1 ) q2,1,1 (1 − q1,2,1 )(1 − q1,1,2 )(1 − r1,1 ) r2,11,2 r D2,2 2 u1,2,2,2 = (1 − q1,1,1 ) q2,1,1q1,2,1 q1,1,2 r1,1 (1 − r2,1 ) (1 − r1,2 ) D2,2 2 u2,2,2,2 = q1,1,1 (1 − q2,1,1 )(1 − q1,2,1 )(1 − q1,1,2 ) r1,1 (1 − r2,1 ) (1 − r1,2 ) D2,2 u2,1,1,1 =

( 27 )

The Dk ,l ’s are inverses of polynomials in the free parameters. We compute the FRMSS by expressing the likelihood in terms of the free parameters. Table 1 exhibits the linear combinations corresponding to each free parameter, in terms of the coefficients for each elementary count. These linear combinations define the components of the FRMSS. We embed these components in the following vector:

[T1 , T2 , T3 , T4 , T5 , T6 , T7 , T8 , T9 , T10 , T11 ]

t

.

55

Table 1 - Matrix Defining a Full Rank Minimal Sufficient Statistic

q1,1,1

q2,1,1

q1,2,1

q1,1,2

r1,1

r2,1

r1,2

s1,1

s2,1

s1,2

s2,2

X 1,1,1,1

1

1

0

0

1

0

0

1

0

0

0

X 2,1,1,1 X1,2,1,1 X 2,2,1,1 X1,1,2,1 X 2,1,2,1 X1,2,2,1 X 2,2,2,1 X1,1,1,2 X 2,1,1,2 X1,2,1,2 X 2,2,1,2 X1,1,2,2 X 2,1,2,2 X1,2,2,2 X 2,2,2,2

0

1

0

0

1

0

0

1

0

0

0

1

1

0

0

0

0

0

1

0

0

0

1

0

0

0

0

0

0

1

0

0

0

0

1

1

0

0

1

0

0

1

0

0

0

1

0

0

0

1

0

0

1

0

0

0

1

1

0

0

0

0

0

1

0

0

1

0

0

0

0

0

0

0

1

0

0

0

1

0

1

0

0

1

0

0

1

0

0

1

0

0

0

0

1

0

0

1

0

0

1

0

1

0

0

0

0

0

1

0

1

0

0

0

0

0

0

0

0

1

0

0

1

1

1

0

1

1

0

0

0

1

1

1

0

0

0

1

1

0

0

0

1

0

1

1

1

1

0

0

0

0

0

1

2

0

0

0

1

0

0

0

0

0

1

6. Application to Confidentiality Protection and Conclusion We use our example to show how our layering-assembling algorithm can be useful for confidentiality protection. Let A be the matrix defined by table 1, and X the  vector of counts. The FRMSS produced by our algorithm for a 2 x 2 x 2 x 2 table is

T = [T1, T2, T3 , T4 , T5 , T6 , T7 , T8 , T9 , T10 , T11 ] = A t X . T is full rank in the sense that    A has full column rank. The statistical discrimination power of T covers any  contrast that can be moulded in the parameter space of the HLLM, and there does not exists a smaller dimension FRMSS for which this is true. We now show how to simulate the value of T , conditional on the residual information in the counts X ,   t

56

after observing T . The residual information is appended to T by augmenting T so    that the entire probabilistic space of X is covered. To do so we augment A with  column vectors so that A augmented is a non-singular matrix. The following linear statistics implicitly provide such column: Y = X 1,2,2,2 + X 2,2,2,2 , Z = X 1,2,2,2 , U = X1,1,2,2 , V = X 1,2,1,2 , W = X1,2,2,1 . Let A* be the augmented matrix and let

T * = [T1, T2, T3, T4 , T5 , T6 , T7 , T8 , T9 , T10, T11 , Y , Z ,U , V , W ] = A* t X   t

( 28 )

Then T * is indirectly restricted by the original constraints for the vector of counts  X . In particular we have 

(A ) *t

−1

T* ≥ 0 

( 29 )

(29), along with the original count total, define the range of T and Y , Z , U , V , W  jointly. Based on this range, we can simulate T conditional on Y , Z , U , V , W with  the probability function implicit in (27) with the free parameters substituted with the values of their MLE’s. To simulate Y , Z , U , V , W conditional on T we use (29) to  establish the boundaries of the range of Y , Z , U , V , W conditional on T , and then we  retrieve the combinatorial coefficients defining the conditional probabilities of T *  given T from the original multinomial distribution of X . Of course, we can also   operate a joint simulation, conditioning on Y , Z , U , V , W , and on T in turn, in a  Gibbs sampling fashion. To summarize, our algorithm positions the prospective data releaser so that he/she can readily take advantage of three general disclosure limitation strategies: 1. The simulation of the minimal sufficient information only. 2. The simulation of the information complementary to the minimal sufficient information only. 3. The joint simulation of the minimal sufficient information and of its complementary information. The computations involved in each case can be expected to increase with the size of the table, but they remain reasonable for a wide variety of situations. Disclaimer This paper reports the results of research and analysis undertaken by Census Bureau staff. It has undergone a more limited review than official Census Bureau publications. This report is released to inform interested parties of research and to encourage discussion. References

57

BISHOP, Y. M. M., FIENBERG, S. E., HOLLAND, P. W. (1975). Discrete Multivariate Analysis. MIT Press. DIACONIS, P., STURMFELS, B., (1998). Algebraic Algorithms for Sampling form Conditional Distributions. The Annals of Statistics, 26, 1. DOBRA, A., TEBALDI, C., WEST, M., (2003). Bayesian Inference in Incomplete Multi-Way Tables. Technical Report 2003-2, Statistical and Applied Mathematical Sciences Institute.

58

Microdata Disclosure by Resampling Empirical Findings for Business Survey Data Sandra Gottschalk Centre for European Economic Research (ZEW)

Abstract. A problem which statistical offices and research institutes are faced with by releasing micro-data is the preservation of confidentiality. Traditional methods to avoid disclosure often destroy the structure of data, i.d., information loss is more or less high. In this paper I discuss an alternative technique of creating scientific-usefiles, which reproduce the characteristics of the original data quite well. It is based on an idea of Fienberg (1997 and 1994) to estimate and resample from the empirical multivariate cumulative distribution function of the data to get synthetic data. The procedure creates datasets - the resample - which have the same charateristics as the original survey data. In this paper I present some applications of this method with (a) simulated data and (b) innovation survey data, the Mannheim Innovation Panel (MIP), and compare resampling with an usual method of disclosure control, disturbance with multiplicative error, concerning confidentiality on the one hand and the usage of the disturbed data for different kinds of analyses on the other hand. The results show, that univariate distributions can be better reproduced by unweighted resampling. Parameter estimates can be reproduced quite well if (a) the resampling procedure implements the correlation structure of the original data as a scale and (b) with multiplcative perturbed data when a correction term is used. On the average, anonymized data with multiplicative perturbed values better protect against re–identification as the various resampling methods used. Keywords. resampling, multiplicative data perturbation, Monte Carlo studies, business survey data

1

Introduction

Empirical research in economic and social science needs information about households and firms, which are collected by statistical offices and official or private research institutes in form of microdata. As computer capability and availability of statistical software increased in the last past years, empirical analyses and thus demand for microdata have dynamically advanced. German law provides, that microdata of government statistics are only allowed to pass on for scientific use and if disclosure limitation is guaranteed in effect1 . The same holds for survey data, conducted by private or official research institutes, if confidentiality is promised to the respondents. Hence, a problem 1

Disclosure should not be possible without unusually high costs and waste of time and energy.

59

which statistical offices and research institutes are faced with by releasing micro-data is the preservation of confidentiality. Even business survey data are at risk, because disclosure is more likely than for personal data as additional information are easier obtainable and population size is smaller (see e.g. Brand, 2000). Traditional methods to avoid disclosure often destroy the structure of data, i.e., information loss is more or less high. In this paper I discuss an alternative technique of creating scientific-use-files2 , resampling, which generates a synthetic microdata file with nearly the same characteristics as the original survey data. It is based on an idea of Fienberg (1997 und 1994) to estimate and resample from the empirical multivariate cumulative distribution function of data. As elements of the resample are only replicates and do not necessarily correspond to any of those individuals in the original sample survey, an identification of true values is not possible. Nevertheless one cannot rule out the possibility of disclosure, as synthetic datasets could be very similar to real characteristics of observations. Especially, extreme values are at risk. The paper is structured as follows: in the first section I describe the idea of resampling and an easily constructed algorithm to create synthetic data, attributed to Devroye and Gy¨ orfi (1985) and Silverman (1986). Afterwards applications with simulated data (Section 3) and business innovation survey data (Section 4) points out the properties of resamples. Confidentiality and utilizability are examined. In a second step I compare particularly resampling with an usual method of disclosure control, disturbance with multiplicative error (see e.g. Hwang, 1986), concerning confidentiality on the one hand and the usage of the disturbed data for different kinds of analyses on the other hand.

2

The Idea of Resampling

To generate synthetic data with the same characteristics as an original survey data file, one has to estimate the density function and then sample from it. It will be impossible to exactly achieve it in practice, as full information about the true density of data is not available. One could apply a parametric approach in assuming a theoretical density function with unknown parameters, like the normal distribution. The parameters have to be estimated with the data, e.g. means and variances. To sample from a theoretical distribution function is then quite easily done. But in reality survey data will rarely follow a specific theoretical distribution. Fienberg (1997) proposes non-parametric and semi-parametric estimation methods, like kernel density estimators or a Bayesian approach (see also Fienberg et al., 1996). The estimated cumulative distribution function will differ more or less from the real distribution of the data. Most of the survey data, official statistics as well as surveys from research institutes, undercover the real population. Therefore the sample distribution is merely an estimation of reality. Another source of bias is introduced by measurement errors. Furthermore, techniques to estimate multivariate cumulative 2

In contrast to public-use-files, which should be totally anonymized, i.d. disclosure is not possible under no circumstances. Scientific-use-files guarantee only disclosure limitation in effect (see above), therefore such files are still exploitable for scientific use.

60

distribution functions have been only used for low-dimensional data until now. Even three dimensional relations are difficult to describe, and only if the sample size is large enough (e.g. XploRe, H¨ ardle et al., 1991). One possibility to improve the estimation is to use a Bayesian method: one has to estimate the empirical distribution function and generate the full posterior distribution (dependent distribution). This approach takes into account regression-like relationships within the sample. “It provides a way of formalising the process of learning from data to update beliefs in accord with recent notions of knowledge synthesis” (Congdon, 2001, 1). In the following, a sample is drawn from the posterior distribution. Fienberg et al. (1996) propose to sample from it using Rubin’s multiple imputation technique (Rubin, 1993, 1987), which includes Bootstrap sampling. Devroye and Gy¨ orfi (1985) and Silverman (1986), who deal with nonparametric density estimation and simulation from density estimates, show how to draw from density function without need to estimate it explicitly. The procedure can be used to create samples that have the underlying characteristics and structure of the real data, but spurious details, that have arisen from random effects, are oppressed. The algorithm for the univariate case is described in the following: suppose a continuous variable X = X1 , X2 , ..., Xn (n observations) and a kernel density function K with bandwidth h. The bandwidth specifies the halfwidth of the kernel, the width of the density window around each point. 1. Draw observations XZ of the data file X with replacement. 2. Compute k to have probability density function K. 3. Generate Z = XZ + hk. The kernel function can be simulated from an epanechnikov kernel, for example3 : K(x) =

3 (1 − x2 ) for |x| ≤ 1 4

(1)

A simple procedure to simulate from the rescaled Epanechnikov kernel is given by Devroye and Gy¨ orfi (1985): 1. Compute three univariate random numbers ZV1 , ZV2 , ZV3 within [−1, 1]. 2. Generate k = ZV2 , if |ZV3 | ≥ |ZV2 | and |ZV3 | ≥ |ZV1 |, otherwise k = ZV3 . The procedure resamples with replacement4 from the data and disturb the information in such a manner that the distribution of each variable is retained. The sample size of Z has to be large enough to approximate the distribution of the original data X. In the literature the choice of the smoothing parameter, the bandwidth, of the kernel is often discussed (see e.g. Parzen, 1960, Tapia and Thompson, 1978). The appropriate choice for the smoothing parameter will always be influenced by the purpose for which 3

One can also think of the normal density. Resampling with replacement increases confidentiality as some of the initial observations appear several times in the anonymized data. Hence, extreme values rarely arise. 4

61

the density estimate is to be used. An optimal bandwidth minimizes the mean square error between the real and estimated kernel density. It unfortunately depends on the (unknown) density being estimated. A meaningful approach is to choose a bandwidth with reference to a standard family of densities. Hence, Silverman (1986) obtain a bandwidth, which minimizes the mean integrated square error, if the data were Gaussian and a Gaussian kernel were used, so is not optimal in any global sense. According to the confidentiality problem, the choice of bandwidth h of the used kernel function is rather difficult, because it influences the goodness of fit to the original distribution on the one hand and the probability of disclosure on the other. A narrow bandwidth causes a better approximation of distributions but rises the probability of re-identification, as the resampled values - though synthetic - could be very similar to the original. But one also should consider if a data intruder is interested in disturbed values. A higher dimensional version of the algorithm can be constructed in using directional information in the data, such as the co–variance–matrix of X. Therefore, the multivariate distribution can nearly be performed. Hence, Devroye and Gy¨ orfi (1985) modify the third step of the algorithm for I-dimensions: Z = XZ + hκA

V CV

−1

(2)



= AA

κ = [k1

k2

...

kI ].

V CV is the co–variance–matrix of the original variables X1 , X2 , ..., XI , which is used as weight of the different kernels κ. To get first and second moment properties the same as those of the data the procedure can be transformed (Silverman, 1986): ¯ + hk)/(1 + h2 σk2 /σx2 )1/2 , ¯ + (XZ − X Z = X

(3)

¯ are the sample mean of X and σx2 and σ 2 the variances of X respectively where X k k. These corrections prevent an overestimation of variances. In the multivariate case Silverman (1966) propose to scale the kernel to have variance matrix the same as the data. Devroye and Gy¨ orfi give some modified versions of the above algorithms for simulating from density estimations of various kinds, e.g. for variables which concentrate its mass on an intervall, like positive numbers. The procedure presented above is only applicable for continuous variables. As most of the survey data contain discrete variables too, one has to find additional masking methods as there exist confidentiality problems concerning them. Especially, regional information and classifications of economic sectors could be meaningfully used for re– identification of individual firms.

3

Simulations

To demonstrate the effects of the resampling procedure Monte Carlo simulations are very useful as regularities can be revealed (see e.g. Robert and Casella, 2002). Here, the simulated data contains 2000 observations, respectively, and the procedure is repeated

62

100 times. Six variables are simulated by drawing from four theoretical distribution functions: 1. normal, 2. the logarithm of 1, 3. exponential and 4. chi-square distribution. The fifth variable is a linear combination of the first and fourth variable and number six (Y ) is a linear combination of 1. (X1 ), 3. (X2 ), 4. (X3 ) and an error term (u). Hence, a linear regression model is constructed (see below). To find an optimal way of constructing a resample, eight different kinds of resamples are constructed, which differ concerning bandwidth of the kernel and the usage of the co–variance- or correlationmatrix of the unmasked data as weights. 1. A1: The resample is constructed as described above, but step three is replaced by equation 3 to better reproduce the variances of the variables. Different bandwidth for each variable are used. It is expected, that the bandwidth lead to a good approximation of the kernel density, but therefore higher disclosure risk. As the chosen bandwidth is only optimal for variables with normal distributions, the error will increase in case of all variables in the datasets whoch follow different distributions. 2. A2: The bandwidth of A1 is multiplied by factor 1.5. Hence, the density estimation will be not as accurate as in case A1, but confidentiality will increase. 3. B1: Is the same as A1, additionally the kernels are weighted with the co–variance– matrix as shown in equation 3. 4. B2: Is the same as A2, additionally the kernels are weighted with the co–variance– matrix. 5. C1: A1 is computed, but the kernel is scaled to have co–variance matrix the same as the data (instead of only using the variance matrix (see above)). 6. C2: C1 is generated with the bandwidth as in A2. 7. D1: Is similar to C1 but using the correlation matrix. The idea is to reproduce the correlation structure of the data. Hence, improvement of regression results is expected. 8. D2: D1 is generated with the bandwidth as in A2. For comparison, an anonymized version of the data is constructed by multiplying each variable with random numbers from univarate distribution functions within the intervall [0.5;1.5]. Hence, means remain the same, but variances and co–variances are biased. A measure of how much confidentiality is provided by the masking techniques, can be defined as follows (see e.g. Spruill, 1983): 1. Find the observation in the anonymized file that minimizes the sum of absolute or squared deviations for all common variables (I chose three variables). 2. If the observation, which is found in 1., is the same as the one on which the masked file is based and only differs 20% from the original5 , a link is made. 5

If the values differ more than 20%, I presume that confidentiality is still satisfied as uncertainty of a re-identification is too high.

63

3. The confidentiality criteria is then defined as the percentage of observations for which such a link cannot be made. This proceeding should be distinguished from an estimation of the re-identification probability, which - additionally to the applied anonymization techniques - should take into account (a) the probability, that observations of the anonymized microdata file are also involved in the additional database, which is used for disclosure, and (b) the possibility of measurement errors in both data files (see e.g. Brand, 2000 for a discussion of the re-identification risk of business survey data). The simulation shows that confidentiality increases if bandwidths of the kernels are multiplied. The possibility of re-identification is less than 30% regarding each method. The resampling procedure C2 and data perturbation with multiplicative errors perform the best with shares of identified observations of only 2 to 3%. The extent of information loss is measured by the average relative absolute deviation from the original measuring unit, respectively. Means are better reproduced by multiplicative perturbation, but the method distort variance of variables around 12% - resampling on the average 7%. The co–variance–matrix is biased if the variables are multiplicated with errors (15%). Resampling seriously distorts co–variances (about 3542% bias), even if the kernel–matrix is scaled to have the co–variance–matrix of the original data. Correlations and rank–correlations of masked data do not markably differ more than 3-4% from the original for most cases, except C1 and C2. Resampling D1 and D2 best performs the correlation structure, even a wider window in computing the kernel-matrix does not additionally distroy multivariate relationships, multivariate distributions are only little more biased when using a wider kernel-window. In general, the usage of multiplied bandwidth do not remarkably worsen performance, deterioration of variances even decreases. To get an impression of the effects on econometric parameter estimations by the different kinds of anonymization a linear model is constructed as follows: Y = 0.7 + 0.5X1 + X2 + 0.2X3 + u,

u ∼ N (0, 1).

(4)

In the simulations, the model–parameters are alternately estimated with OLS-method using perturbed versions of the orignal variables Y and X1 , X2 , X3 . The expectation is, that estimations with resamples lead to unbiased parameter estimates if multivariate relationships are retained. In case of multiplicative perturbation, model parameters cannot consistently estimated. Hwang (1986) show, how to correct biased estimates to get consistent results, if the distribution of the multiplicative random number is known6 . The co–variance–matrix of the errors Ei (i = 1, 2, ..., I, where I is the number of quantitative variables) can be computed. As they are independently distributed, all 1 (i = 1, 2, ..., I). A consistent estimator with co–variances are zero and V ar(Ei ) = 12 perturbed data matrix Z and endogenuous variable Yz with weight U = diag[E(Ei2 )], 6

In case of the MIP, external users are informed about anonymization techniques. The multiplicative error is an univariate random number between [0.5;1.5].

64

E(Ei2 ) = V ar(Ei ) + E(Ei )2 , can easily be constructed7 8 : βˆ = [(Z  Z) ÷ U]−1 Z  Yz .

(5)

In all cases, coefficients remain significant unequal to zero as well as signs do not change. Resampling procedures D1 and D2 as well as the corrected estimates in case of multiplicative perturbation produce the best results, as all coefficients significantly equal the real values. When regression analysis is carried out without a correction term, parameter estimates with multiplicated perturbed data slightly differ from the original, goodness of fit R2 clearly decreases as well. Weighted or scaled resamples with co–variance–matrix (B– and C–versions) produce some remarkably differing regression coefficients, for example the constant in B and parameter of X3 in C. In summary one can say that univariate distributions can be best reproduced by resamples with unweighted kernels, whereas multivariate structures can be better retained in using directional information in form of the correlation structure of the data when constructing resamples. Multiplicative perturbation re–performs univariate and multivariate distribution parameters quite well, but in linear regression analysis a correction term should be implemented to retain accurate parameter estimates. On the other hand the last method have a low re–identification risk, in comparison.

4

Empirical Application - An Example

In a second step, anonymized data are constructed by using real data - the Mannheim Innovation Panel (MIP)9 in the manufacturing sector from 1999. Five quantitative variables are chosen - “sales”, “number of employees”, “research and development (R&D) expenditure per sales” (R&D-intensity), “innovation expenditure” and the “number of high qualified personnel”. These variables are censured to the left, i.d. have only positive values. The resampling procedure used here does not consider these restrictions. Therefore a few of the synthetic observations in the resample have negative signs. Tests have shown that this fact does not matter for a lot of descriptive and regression analyses. But variables which have a markable number of values that are zero - like R&D expenditure - are difficult to reproduce as the share of zeros could not maintained. Some modifications of the resampling procedure are necessary. This will be subject of further work. 7

See an application with simulated data and with the Mannheim Innovation Panel in manufacturing of the year 1997 in Gottschalk, 2002. 8 A ÷ B is the Hadamard division of the matrizes A und B, i.e. elementwise division. 9 Survey information and methodology of the Mannheim Innovation Panel is described in Janz et al. (2001). The scientific-use-files of the MIP are freely available for purely non-commercial basic research. The applied anonymization methods are described in Gottschalk, 2002. Here, each continuous variable is multiplied by different random numbers. In the scientific-use-files of the MIP, firm specific random numbers are used and only “sales” and “number of employees” are perturbed with multiplicative error. Hence, productivity (sales per number of employees) remains constant and the scenarios in this paper are not completely transferable to the scientific-use-files of the MIP.

65

“Sales” and “number of employees” are chosen as link variables to measure confidentialty10 (similar to the above simulation). The procedure additionally divide the data in industry classes (two–digit NACE-level) and East and West German firms. Hence, a link can only be made within strata. In contrast to the Monte Carlo simulations above, the anonymized datasets involve on the average higher re–identification risks for the individual firms. Where resampling D best protect the data with a confidentiality level of nearly 100%, resampling B-versions have re–identification risks of nearly 50%. Multiplicative data perturbation preserve confidentiality on a high level (85%), in comparison to the average of resampling procedures. Informational loss due to the different anonymization methods is measured as the average relative absolute deviation from original statistics of the variables “sales”, “number of employees” and “innovation expenditure”.11 In contrast to the Monte Carlo simulations, resampling - except of versions C - performs very well in reproducing univariate and multivariate distribution parameters, whereas similar errors occur in each statistic when applying multiplicative perturbation to the data. These findings seem to be an indication of dependence between the used disturbance technique and the data generating process. One should consider that different results could occur regarding various datasets. Results of paramter estimations (OLS) of an exemplary linear model, explaining “R&D-intensity” defined as “R&D expenditure per sales” of innovative firms12 with independent variables “ln(firm size)” (logarithm of the “number of employees”) and additionally the squares (“ln(firm size)2 ”), a measure of market concentration within an industry (2-digit NACE level), the Herfindahl index (“herfindahl”)13 , a dummy variable indicating an expected positive growth rate of sales (“demand”), an indicator of East German firms (“East”), the share of high qualified personnel within the enterprise (“qualified pers.”), a binary variable, which take the value one, if the firm introduced at least one innovation, which essentially based on new developments from scientific institutions (“science”) and twelve sector dummies (two-digit NACE level: 10,15,1736) to control for heterogeneity demonstrate effects on parameter estimation of a linear model. All variable combinations are constructed before the anonymization processes. This is necessary, as variables within the resample cannot meaningfully be combined. The density of functions of variables in resamples strikingly differ from the original ones. In the case of multiplicative perturbation, ratios of two random numbers are more difficult to treat with. Therefore, shares of sales are also computed before the masking procedure starts. As mentioned before, only quantitative values are perturbed, regarding resampling as well as multiplicative perturbation. All indicator variables and the Herfindahl index (a firm level variable) are maintained. 10

In a realistic scenario, one would also presume to have “sales” and “number of employees” as common variables in anonymized microdata and additional data. 11 These variables are used because they are highly correlated. 12 Innovators are firms that have successfully completed at least one innovative projct within a threeyear period. 13 I calculate the Herfindahl index from estimated market shares of firms in the Mannheim Enterprise Panel (MUP), which includes 12,000 firms (see Almus et al., 2000, for more detailed information on the MUP).

66

Regression results do not strikingly differ from the original values in cases of resamples D1 and D2. This result confirms with the conclusions of the Monte Carlo simulations. The A-versions of resampling performs quite well, too. Significance levels and signs remain in most cases and values do not remarkably differ. Only the coefficients of the Herfindahl index are not significant anymore. Resamples B and C do not produce satisfactorily results, as seen in the Monte Carlo study, too. Multiplicative errors reverse the sign of variable-parameter ln(Firm Size)2 . This mistake can be eliminated by the correction term.

5

Resume

The Monte Carlo studies and application with real data, the Mannheim Innovation Panel, show the effects of resampling in comparison to multiplicative data perturbation on different kinds of analyses14 . Univariate distributions can be best re–performed by resampling with unweighted kernels. When using the correlation structure of the original data in constructiong the kernel-matrix resampling retains linear regression results, but univariate distributions - especially of skewed variables - are biased. Multiplicative perturbation reproduce descriptive statistics quite well and in linear regression analysis a correction term could be implemented to retain accurate parameter estimates. It remains to examine effects on non-linear and semi-parametric model estimations as well as different kinds of model specifications with the MIP. The last is quite important as potential mis-specifications may surely influence regression analysis and deterioration of regression results due to anonymization may not be independent on wrong model specifications. Though resamples consists of synthetic values, confidentiality problems remain. Anonymized data with multiplicative perturbed values performs better, on the average. Confidentiality measures and estimations of re-identification risks should be computed in realistic scenarios, where additional databases are used for a match with perturbed microdatasets to finally assess the various anonymization techniques.

References Almus, M., D. Engel and S. Prantl (2000). The ZEW Foundation Panels and the Mannheim Enterprise Panel (MUP) of the Centre for European Economic Research (ZEW). Schmollers Jahrbuch, 120, 301-308. Brand, R. (2000). Anonymit¨ at von Betriebsdaten. Beitr¨ age zur Arbeitsmarkt- und Berufsforschung: BeitrAB, 237. N¨ urnberg. Congdon, P. (2001). Bayesian Statistical Modelling, Wiley Series in Probability and Statistics. New York. Devroye, L. and L. Gy¨ orfi (1985). Nonparametric Density Estimation. New York. 14

For a more detailed description of the results look up in Gottschalk, 2003.

67

Fienberg, S.E. (1997). Confidentiality and Disclosure Limitation Methodology: Challenges for National Statistics and Statistical Research. Technical Report, Carnegie Mellon University, 161. Pittsburgh. Fienberg, S.E. (1994). A Radical Proposal for the Provision of Micro-Data Samples and the Preservation of Confidentiality. Technical Report, Carnegie Mellon University, 611. Pittsburgh. Fienberg, S.E., R.J. Steele und U. Makov (1996). Statistical Notions of Data Disclosure Avoidance and their Relationship to Traditional Statistical Methodology: Data Swapping and Loglinear Models. Proceedings of US Bureau of the Census 1996 Annual Research Conference, 87-105. Gottschalk, S. (2003). Microdata Disclosure by Resampling - Empirical Findungs for Business Survey Data. ZEW Discussion Paper, 03-55. Mannheim. ¨ Gottschalk, S. (2002). Anonymisierung von Unternehmensdaten - Ein Uberblick und beispielhafte Darstellung anhand des Mannheimer Innovationspanels. ZEW Discussion Paper, 02-23. Mannheim. H¨ardle, W., S. Klinke und M. M¨ uller (1991). XploRe - Learning Guide. Berlin. Hwang, J.T. (1986). Multiplicative Errors-in-Variables Models with Application to Recent Data Released by U.S. Department of Energy. Journal of the American Statistical Association, 81, 395, 680-688. Janz, N., G. Ebling, S. Gottschalk und H. Niggemann (2001). The Mannheim Innovation Panels (MIP and MIP-S) of the Centre for European Economic Research (ZEW). Schmollers Jahrbuch, 121, 123-129. Robert, C.P. and G. Casella (2002). Monte Carlo Statistical Methods. New York. Parzen, E. (1962). On Estimation of a Probability Density Function and Mode. Annals of Mathematical Statistics, 32, 1065-1076. Rubin, D. (1993). Discussion - Statistical Disclosure Limitation. Journal of Official Statistics, 9, 461-468. Rubin, D. (1987). Multiple Imputation for Nonresponse in Surveys: Wiley Series in Probability and Mathematical Statistics. USA: John Wiley & Sons. Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis: Monographs on Statistics and Applied Probability, 26. London. Spruill, N.L. (1983). The Confidentiality and Analytic Usefulness of Masked Business Microdata, American Statistical Association. Proceedings of the Section on Survey Research Methods 1983, 602-610. Tapia, R.A. and J.R. Thompson (1978). Nonparametric Probability Demsity Estimation. Baltimore: John Hopkins University Press.

68

The Determination of Intervals of Suppressed Cells in an n-dimensional Table Karl Luhn Fachgebiet Quantitative Methoden der Wirtschaftswissenschaften, Technische Universitaet Ilmenau, Germany Keywords. Sensitive cells, cell suppression, linear programming

1 The Problem of the Determination of Intervals We consider statistical tables with sensitive cells. These cells are suppressed, either primary or secondary. The difference between them is well known but insignificant for the attacker problem. It is only import to the attacker to know which cells are suppressed cells. The solution of the attacker problem will be found by the means of linear programming. The suppressed cells are not included in the published table. In the model of linear programming one equation, dependent on these variables, belongs to each total or subtotal of the table. Hence we have objective functions twice as variables (primary and secondary suppressed cells). This demonstrates the possibility - but the high efforts too – to solve the attacker problem by linear programming.

2 The Solution of the Problem 2.1. The Generation of a Feasible Solution The first reflection to make the problem easier was to solve it with parameters, belonging to the primary suppressed cells. But it is impossible to define these primary suppressed cells. Therefore we solve the problem in two steps: At first we find a solution of the system of linear equations (stage 1 of linear programming): m

n

z = ∑∑ aij x j → max! i =1 j =1

n

∑a x ij

j

= bi ; i = 1,2,..., m

j =1

x j ≥ 0; j = 1,2,..., n

69

with

x j - variable, suppressed cell (primary or secondary)

bi - sum of not suppressed cells (row or column of the table)

aij ∈ {−1,0,1} - coefficients of the linear equations m - number of equations, i. e. rows and columns, subdivided according to subtotals and totals of the table n - number of primary and secondary suppressed cells

This solution delivers a system dependent on only as much variables as primary suppressed cells exist:

xi = bi* − ∑ aik* xk ; i ∈ J 2 ; J = J 1 ∩ J 2 ; J 1 ∪ J 2 = ∅ k ∈J1

The lower bound of the independent variables xk is zero.

Let us consider an example as follows:

A B C ΣA..C

I 105 487 726 1318

II 594 561 383 1538

III 396 351 160 907

IV 764 716 94 1574

V 213 762 568 1543

ΣIII..V 2072 2877 1931 6880

II x2 x3 383 1538

III 396 351 160 907

IV 764 x4 x6 1574

V 213 762 568 1543

ΣIII..V 2072 2877 1931 6880

And after suppression: A B C ΣA..C

I x1 487 x5 1318

70

The first step is in accordance with the following simplex tableau: u1 u2 u3 u4 u5 u6 z

x2 1 0 0 0 1 0 -2

x1 1 0 0 1 0 0 -2

x3 0 1 0 0 1 0 -2

x4 0 1 0 0 0 1 -2

We receive at the first stage a feasible solution: x1 x3 x4 x5 x6 That means that x2 = x2,min

x2 1 1 -1 -1 1 = 0 and

bi 699 1155 122 132 688

x1 = 699 − x2 , x3 = 1155 − x2 , x4 = 122 + x2 , x5 = 132 + x2 and x6 = 688 − x2 . The maximum of x2 = 688 : x2 x1 1 x3 1 x4 -1 x5 -1 x6 1 z = x2,min -1

bi 699 1155 122 132 688 0

71

x5 0 0 1 1 0 0 -2

x6 0 0 1 0 0 1 -2

bi 699 1277 820 831 1155 810 -5592

bi 11 467 810 820 688 688

x6 x1 -1 x3 -1 x4 1 x5 1 x2 1 z = x2,max 1

2.2 The Determination of the Intervals of the Suppressed Cells The second step is to solve the remaining system by the means of linear programming (stage 2):

z k = xk → max!

∑a k∈J1

k ∈ J1

x ≤ bi* ;

i ∈ J2

* ik k

xk ≥ 0 or after introduction of the variables xi : z k = xk → max!

k ∈ J1

xi + ∑ a xk = b ; i ∈ J 2 k∈ J1

* ik

* i

xi ≥ 0 xk ≥ 0 We receive intervals of the remaining variables xk and determine with them the intervals of all variables (primary a secondary suppressed cells) of the original system.

72

Then the intervals of all variables are:

x1 = [11,699 ] x2 = [ 0,688]

x3 = [ 467,1155] x4 = [122,810] x5 = [132,820 ] x6 = [ 0,688]

If the published value of x2 = 500 (for instance), then the other values are x1 = 699 − x2 = 199, x3 = 1155 − x2 = 655, x4 = 122 + x2 = 622, x5 = 132 + x2 = 632 and x6 = 688 − x2 = 188. In practice there are more than one independent variables (like x2 ). Then the determination of the intervals must be extended on the solution of linear programming problems for each of the remaining variables ( x1 , x3..6 ).

73

ROUNDING AS A CONFIDENTIALITY MEASURE FOR FREQUENCY TABLES IN STATBANK NORWAY Johan Heldal Statistics Norway

Abstract: Dissemination of statistics on the web raises opportunities as well as challenges to statisti-cal confidentiality. Last year Statistics Norway (SN) opened its Statistics Bank (SBN) on the web giving users the opportunity to make their own tables by aggregating from detailed base tables at a disaggregated geographical level. The architecture of such web publication structures puts restrictions on which disclosure protection methods are relevant. In 2002, SN experimentally developed a rounding method to make it possible to disseminate frequency count tables from the Census 2001 in SBN in a safe manner and in accordance with the rules of the Norwegian Statistics Code. The experiment was successful enough to be applied to the census publication and the results inspire further development of the method. The paper describes the method and why existing methods and software was not used. Ideas for improvements are outlined. Keywords: Frequency counts, additivity, linked tables, search methods, minimum distance.

I.

INTRODUCTION

1. Since July 1 st 2002 Statistics Norway (SN) has offered statistical tables on the Internet through StatBank Norway (SBN) and from March 1st 2003 virtually every official statistic published by SN is disseminated through SBN. 2. The idea of SBN is to publish a large number of high-dimensional table matrices on a detailed geographical level for public use, allowing users to aggregate numbers from them according to their own needs. However, such detail raises confidentiality problems. It is considered undesirable that frequency table matrices reveal combinations of discrete variables that are (almost) population unique or situations where knowledge of some variable values associated with a statistical unit in a table makes it possible for some to derive its values on other variables. More specifically this means that counts like 1 and 2 should be avoided in the tables and uncertainty about whether a zero is actually a zero should be introduced. This problem, which has been encountered by many National Statistical Institutes (NSIs),

74

has been handled in different ways in different countries, rounding being one of them. 3. In particular, the tables from the Norwegian 2001 Census of Population and Housing are being published in SBN. The Census tables are spanned by two to five variables consisting of up to 450 cells each. These tables have been published for 434 municipalities, 52 urban districts and 19 counties, ranging in size from a population of 250 to 500 000. Another set of tables has been published for the nearly 14000 basic units. These tables have at most two variables and less detail than the municipality tables. The customers at the web are free to order any aggregations of the released tables or look at them in full detail. 4. To meet its intention with the present technology (which is based on PCAxis), the tables released in SBN must have the same variables for every geographical unit and the same level of detail for all categorical variables. All tables are accessible at their most detailed level. They are released without margins. The margins are calculated in the process of meeting the users orders. This has consequences for what kind of table protection methods that can be applied. The English lanuage version of SBN is found at http://www.ssb.no/english/. 5. The decision to go for rounding as the method of confidentiality protection in the Census was taken in June 2002, only two months before the Census should start publishing and programming work started immediately. The method involved uses a search algorithm to find a sufficiently good solution. Because of the tight time schedule, only a very primitive search algorithm was employed. Intensive use of computers had to do the rest. It was an experiment with uncertain outcome. Nevertheless, allowing sufficient computer time, the results of the experiment were good enough to be accepted for use in publication. The Census tables released in SBN have since been rounded with this method and the results have encouraged us to go ahead and with more efficient search methods. The programming so far has been in SAS and SAS macro. Ultimately, when a sufficiently efficient search method has been developed and tested, we hope to implemented it into SuperCross which is the standard tool for future table production chosen by SN.

6. The next chapter discusses the kinds of confidentiality problems that occur with frequency tables. Chapter III gives a short review of existing methods and why we chose to make something of our own rather than using methods and software already available. Chapter IV describes the goals we set for the performance of our procedure and what we chose to sacrifice. Chapter V explains the method through an example. Chapter VI tells about the application of the method to the census. Chapter VII explains our ideas for improvement.

75

II

DISCLOSURE IN FREQUENCY TABLES

7. There are essentially two kinds of confidentiality problems associated with frequency tables. a. When, for some variable(s), only one category has a positive count on some combination of characteristics on all the others. b. When the counts for some combinations of characteristics are very small, say one or two. Case a constitutes an attribute disclosure. (Willenborg and de Waal 1996, 2000). It makes it possible for an intruder to disclose the value of the some variable(s) if the values of the other variables are known. Case b reveals combinations that represent population uniques and prepares the ground for a identification disclosure. The case where b occurs but not a, does not automatically imply an attribute disclosure since all attribute values in the table must be known to disclose the identity. However, if the count in case a is one (or two), both a and b occur. Population uniqueness (b) on a given set of variables is never the less often seen as potential threat to privacy. Identification disclosure in combination with external information, for instance a sample survey, can be the key to attribute disclosure and potential harm. This has led NSIs to take action to remove uniques from population tables. The gravity with which this point is considered is stressed by the fact that numerous papers have been written in search for methods to estimate the probability that a unique in a survey sample (microdata) is also a population unique. The (expected) number of population uniques is used as a measure of the over all disclosure risk in survey microdata (Bethlehem et al. 1990, Fienberg and Makov, 1996, 1998, Chen and Keller McNulty 1998, Skinner and Holmes 1998, Carlsson 2002).

III METHODS TABLES

FOR

PROTECTION

OF

FREQUENCY

8. National Statistical Institutes (NSIs) attack the problem with small counts in different ways. Some contries restrict the detail of the tables by using crude categorisations and few variables in pubished cross classifications. This method often means that the users’ demand cannot be adequately met and therefore counters the very idea of StatBank Norway. Another very simple solution in use is to round all ones to 0 and twos to 3. This method can lead to bias. As pointed out in several of the papers mentioned at the end of paragraph 6, the distribution of the number of cells with a given count has a decreasing shape so that in sparse tables typically # ( 0 )># ( 1 )># ( 2 ) > L . Thus, this simple rounding method may lead to a downward bias which can be serious when a large number of cells are being aggregated. A third method being used is Unbiased Uncontrolled Random Rounding. UURC is typically an option in tabulation programs like SuperCross and Beyond2020 and therefore easy to use. It rounds every count, regardless of size, stochastically to a nearest multiplum of a base value (ususally 3) in such a way that the expected value of the

76

rounded count is equal to the original count. The table marginals are rounded independently of the internal cells so that the rounded tables seize to be additive. UURC is useful and often satisfactory for tables that are published in a final form on paper or web. But since marginals in SBN are calculated in the process of meeting the users orders, UURC of elementary internal cells would lead to too large random rounding errors in aggregations. Cell suppression is a much used method which is also not compatible with the architechture of SBN. Moreover, cell suppression is primarily a method for magnitude tables. It is not a suitable method for frequency tables although sometimes used (Willenborg and de Waal 1996). Swapping of information for similar units in neighbouring areas is used by the American Factfinder. More details about this can be found on the Factfinders website. 9. There are other methods, some of which are available in software. The well known ARGUS software offers Controlled Rounding for two and three way tables. Like UUCR, CR essentially rounds every cell. The method of CR could in principle be applied to two and three way tables in StatBank Norway, but falls short for the tables of higher dimension. Furhermore, ARGUS does not run on UNIX which is the production platform for SN. Fienberg, Makov and Steel (1998) propose an advanced method of count swapping within frequency tables keeping margins intact. This is one of the most advanced methods that exist, but to our knowledge software for the method is not generally accessible. A drawback with the method is that it does not remove a 1 from a marginal but only moves it along its row or column within the table. The PRAM method proposed by Gouweleeuw et al. (1998) is interest-ing. However, it requires a decision for some variables to be declared sensitive and that a Markov transition matrix for perturbation of the variable values for these variables is decided upon. There are several problems using this method that are also mentioned in the paper of Gouweleeuw et al. and which will not be repeated here. PRAM and many other methods require extra information to be communicated to the users to guide them in a correct understanding of how the tables should be interpreted. This is undesirable although not completely avoidable in the public use context of SBN.

IV

DEMANDS AND SACRIFICES

10. To meet the needs of SBN and for the rounded tables to be suitable for publication they should meet some goals and standards. a. The rounded tables must be additive. This is automatically satisfied in SBN since marginal tables are calculated from the base tables in the process of answering a request. b. No more cells should be rounded than are necessary to make the tables confidentially safe. This means that only counts less than an integer base b (usually 3) should be rounded to 0 or b, not all cells as is common for the rounding procedures mentioned in chapter III. The purpose of this requirement is to reduce the size of the problem and to facilitate goal c.

77

c. Search for solutions minimising a distance between the rounded and the original tables. This means that roundings should cancel each other as much as possible at aggregate levels. This is important in CR, but not in UUCR. A metric to measure the distance must be chosen. Our metric of choice is the maximum absolute difference between the rounded and original cell counts in a set of marginals specified to the program. This measure is rather crude. Therefore, among solutions with the same maximum absolute difference, the program will prefer solutions that minimise the number of occurences. d. Common marginals in linked tables should be rounded consistently. The program can handle this, but often to the price of higly increased computing time and larger distances between rounded and original tables. For most of the census tables this goal had to be abandoned to manage running through them in time without unacceptably large deviations. The occurrence of diverging counts for the same marginals calls for an explanation to the users and inflicts somewhat reduced disclos- ure protection where visibly a count has been rounded to zero in one table but to three in another. e. No counts less than the rounding base b should occur in the rounded tables. At aggregate levels, the rounding method affects other counts than those actually in need for rounding. We call this secondary perturbations. With base 3, this means that a cell count of 3 or 4 may well be perturbed to 1 or 2. However, 1 and 2 in the rounded table will never occur in cells that were 1 and 2 in the original table. Such 1’s and 2’s therefore do not compromise confidentiality if the user is aware that 1 and 2 actually means “at least 3”. A user who is not aware of this can be mislead to believe that the counts are real and draw wrong conclusions concerning population uniqueness and the confidentiality measures used. It is to avoid such misconceptions that this goal is desirable. In the version of the program used for the census, this goal was not given priority. However, such secondary small counts can be removed either by running the rounded table through the rounding program once more or by altering some spesifications for the program. 11. Traditionally, disclosure control metods for tables have focused upon preserving the table marginals unless they themselves present a disclosure risk. To avoid backtracing of the original cell values, methods for cell suppression uses secondary suppression. Rounding methods cannot avoid affecting marginals. Therefore, to avoid excessive differences between rounded and original marginals, uncontrolled rounding rounds marginals independently of internal cells and abandons additivity. Controlled rounding, the method of ARGUS, is not guaranteed to exist for more than two-way tables although it often works for three-way tables as well. The method presented here relaxes these tight restrictions somewhat by replacing them with item c above.

V

DESCRIPTION OF THE METHOD

12. The description will be given by an example. The example is real, but not from the Census 2001. The original table and a rounded version are presented in the appendix. Assume we have six categorical variables which we for short call V1, …, V6. They span a six-way table and we want to publish six two-way marginals from

78

this table and round them in a consistent manner. In SAS macro language the six variables and the rounding base are specified by specifying their joint table: %LET BASETAB=V1*V2*V3*V4*V5*V6; %LET BASE=3; The linking pattern of the six two-way tables to be published and from which we want to remove 1 and 2 is shown in table 1. The pattern is specified to the program with %LET TABLES=V2*V1 V3*V1 V4*V1 V6*V1 V6*V3 V6*V5; There were 450 cells in the six two-way marginals and 56 cells in the one-way marginals. The six-way table contained 559872 cells of which 2591 cells were filled with 7491 households. Among the 450 two-way cells, 44 were zeroes, 30 were ones and 23 were twos, making up t = 76 households. Variable s V2(12 ) V3(6) V4(9) V5(12 ) V6(9) Table 1

V1(8 V2(12 V3(6 V4(9 V5(12 ) ) ) ) ) x x x

x x x A linking pattern for 6 two-way tables based on 6 variables. Number of categories are in parentheses.

13. In order to manage requirement d, consistency, rounding had to take place inside the six-way table. To minimise the amount of rounding nesseccary, the table was reduced. The 30 ones and 23 twos in the six two-way tables were aggregates from altogether 60 cells in the basic six-way table containing 44 ones and 16 twos. Only these 60 cells were rounded, using base b = 3 . In order to control the rounding error induced on the table total, exactly [t / b ]2= 5 cells were rounded to 3. 35 cells were rounded to zero. Thus, for the rounded total, t * =t [ b/ ]b =7 5 . The 25 cells to be rounded up were drawn as a probability sample without replacement where in each sample the twos had twice the probability of the ones for being rounded up.

79

14. The distance between the original table X and the rounded table Y was calculated as the maximum difference between a rounded and unrounded cell counts in a specified set S of marginals. The marginals of choice for this example are those to be published, all six two way marginals and the six one.way marginals. S was specified to the program with %LET CONTROL=V1 V2 V3 V4 V5 V6 V2*V1 V3*V1 V4*V1 V6*V1 V6*V3 V6*V5; Let xc , yc be the cell count of cell c in the original and the rounded table respectively. The distance between the two tables is then defined as d ( X, Y ) m= ay x c∈

S

c−

xc

It is sufficient to calculate the marginals in S and d ( X , Y) based on the reduced sixway table (60 cells) rather than all the populated cells (2591). This significantly reduces the amount of computing. 15. The algorithm continues by repeating the procedures described in paragraph 13 and 14 until a given stopping criterion is met. This generates a sequence of solutions Y1 , Y2 , Y3 ,K and the solutions with the smallest d is preferred. To distinguish between solutions with the same d, the number of occurences of d in the table, nd = # (d ) is counted. The solution with the smallest nd is chosen. 16. This random search is primitive, but given enough computer time it can reach any possible solution. In this actual example there are ≈ 5,1910 ⋅ 16 possible    60   25   

solutions and for this example we ran through a sample of 10000 of them. Finding an optimal solution is a Np-hard problem (Fischetti and Salazar-Gonzalez 1998, Kelley et al. 1990), meaning that there is no short cut to the best solution. For problems with two and three variables the methods of ARGUS would offer the best available solutions. Systematic search algorithms that can be applied to higher dimensional tables exist and will be implemented. See Chapter VII. 17. The random search method can be reasonably efficient in the beginning, but as improved solutions are found the expected number of new iterations needed to find a further improvement increases rapidly. The waiting time from an improvement to the next is negative binomial with a decreasing success probability at each improvement. In a sample run of 10000 iterations, the improvemnts came as shown in table 2. Iteration 1

4

7

16

80

32

43

111

188

s

5 ( d ,nd ) (9, (8, (7, (6, (6, (6, (5, (5, 1) 1) 2) 4) 2) 1) 3) 1) Table 2: Improvements in d and #(d) in a sample run of 10000.

The realised distribution of deviations from this sample run in one and two way marginals is given in table 3. yc − xc # yc (− xc )

in

one-way

in

two-way

cells # yc (− xc )

- - -2 -1 0 5 4 3

1

2

3 4

0 2 6 9

9

9

6 0

7

7

1 7 26 5 2 1 0 5 2 2 7 8 2 4 9 cells Table 3: Distribution of differences in one and two-way cells. Table 3 implies a standard rounding deviation of approximately 2 across all one-way cells and 1 across all two-way cells. Such average measures of deviations can be submitted to the users along with the rounded tables to describe the uncertainty imposed by the rounding. Perturbation of counts 3 and 4 to counts 2 occurred in three cells.

VI

ROUNDING IN THE CENSUS

18. The application of the rounding procedure in the census was seen as a experiment. The method had never been tested on any data of similar size and detail.The rounded figures would be used for publication if the method proved to be successful in the sense that the rounded tables did not deviate too much from the original counts. 19. The dissemination of tables from the housing part of the census consisted of 26 tables ranging in size from two to five variables and 12 to 450 cells. All cells with one and two at the most detailed level were rounded. In this case no new (secondary) ones and twos can occur. As already mentioned in paragraph 3, these tables were published for 434 municipalities, 52 urban districts and 19 counties, ranging in size from a population of 250 to 500 000. For the smallest tables there were few counts that had to be dealt with. Table 4 shows an overview of the tables. Tables with the same background colour have the same combination of units. Red crosses marks variables that link tables. There are several links. We had hoped to round the linked tables jointly as was done in the example in chapter V, but the available computing

81

time did not allow it. For each table a minimum and maximum number of iterations (searches) permitted for the rounding procedure was set for each matrix, (MINIT,MAXIT). If, for a municipality, a maximum deviation less than or equal to a prescribed value MAXDIFF was achieved within MINIT iterations, the procedure stopped and went to the next municiplity. Otherwise it continues with new iterations until MAXDIFF was achieved or until MAXIT iterations. MINIT allows a minimum number of iterations even if MAXDIFF is achieved in hope of finding even better solutions. MAXDIFF was set to 4 for all tables and municipalities. (MINIT, MAXIT) ranged from (50, 200) for the smallest tables to (400, 2000) in the largest. 20. The largest deviations between rounded and original cell counts in any aggregated cell at any aggregation level in any of the 434 municipalities ranged from 2 in ten of the 26 tables which effectively is as good as controlled rounding, to 6 in table 1 and 2. The deviation 6 occurred in one cell in one municipality for each of the two tables. VII

IMPROVEMENTS OF THE PROCEDURE

21. The rounding procedure can be improved significantly by using systematic instead of only random search. Some main features of the procedure, like the controlled selection with a fixed number of uproundings (to b) and downroundings (to 0), will be retained. The present random search algorithm or some restriction of it thought to be close to good solutions can be applied to find start values for a systematic search. To keep the number of up and down roundings controlled, the search will be based on swapping of zeroes and bs in the rounded reduced elementary table with all variables included. Then there are two goals for this search procedure. a. Remove secondary perturbations that produce new counts less than b. There exist situations where it is not possible within the swapping framework to remove all counts less than b without creating new ones. b. Minimise the largest absolute deviations while controlling for a. In each swap an uprounding contributing to the largest positive deviation will be swapped with a downrounding contribut-ing to the largest negative deviation. Among candidate couples, one that when swapped does not generate equally large or larger extremes in other cells belonging to S (see paragraph 14) is chosen. The process in b is iterated until no further improvement can be reached. If the maximum deviation in S, d, is small enough, the solution is accepted and the search stops. If not, a new random start value can be selected for a new search.

82

22. Before a search is initiated, stopping criteria (MAXDIFF, MAXIT) must be set. For this purpose it would be desirable to know in advance how small maximal deviations can be realistically obtained. Some theorem that allows a calculation of how small deviations are at least obtainable would be useful.

VIII DISCUSSON 23. Statistics Norway has found the described rounding method promising as a confidentiality measure for frequency tables and has applied the method in the publication of the Census 2001 in SBN. SN will follow up the method with implementation and testing of the proposed improvements. If successful, the method will be approved as the standard for frequency count tables in SBN and will, as far as possible, be integrated into our standard software for table production. Proposals for improve-ment of the suggested procedure will be welcomed.

83

Table Table 4. Summary of tables published for the Housing Census in SBN Table number

1

2

3

4

5

6

Dwelling/Dwelling/dwelling household Construction year short (3 categories) Construction year long (9 categories) x Cat. of building (5) Number of rooms (5) Number of rooms/kitchen (9) x Utility floor space long (12) Utility floor space short (5) Tenure status short (2) Tenure status medium (3) x Tenure status long (7) Heating systems (10) Wheel chair accessibility, dwelling (2) Wheel chair accessibility, rooms (2) Floors/lift (6) Sanitary (3) Garden (3) Balcony, veranda, terrace (2) Garage/carport/parking space (3) Number of cars (3) Occupants in dwelling (5) Age of oldest occupant (6) Kind of household I (6) Elderly household (4) Children/not children (2) Kind of household II (9) Age (7) # cells # variables

x x

7

8

9

10

11

12

Household/ dwelling x

x

13

14

15

16

17

18

19

20

22

23

24

25

26

Housh/dwelling

Person/ dwelling x

x

21

x

x

x

x x

x

x

x

x

x x x

x

x

x

x

x

x

x x x x x

x

x

x x x

x x x x x x

x x

x x

x

x

x

x x

x

x

x

x

x x

x

x

x

x

x

20 2

36 2

12 2

40 2

96 4

12 2

x

x

42 2

54 2

x

x 420 450 72 3 3 4

45 2

12 2

125 3

63 2

180 216 144 108 54 4 4 5 4 3

84

x x 105 49 3 2

x 63 2

x 42 3

42 2

54 2

References Bethlehem, J.G., Keller, W.J. and Pannekoek, J. (1990): Disclosure Control for Microdata. J. American Statistical Association, 55, pp. 38-45. Carlson, M. (2002): Assessing Microdata Disclosure Risk Using the Poisson-Inverse Gaussian Distribution. To appear in Statistics in Transition. Chen, G. and keller-McNulty, S. (1998): Estimation of Identification Disclosure Risk in Microdata. J. Official Statistics, Vol. 14, no. 1. pp 79-95. Cox, L. H. and Ernst, L. R. (1982): Controlled Rounding, INFOR, 20, pp. 423-432. Cox, L. H. (1987): A Constructive Procedure for Unbiased Controlled Rounding. Journal of the American Statistical Association, 82, 1987 pp. 520-524 Fellegi, I. P. (1975): Controlled Random Rounding. Survey Methodology, 1, pp 123-133. Fouskakis, D. and Drape, D. (2002): Stochastic Optimization: a Review. International Statistical Review, Vol. 70, no. 3, pp 315-349. Fischetti, M. and Salazar-Gonzalez, J-J (1998): Experiments with Controlled Rounding for Statistical Disclosure Control in Tabular Data with linear Constraints. J. Official Statistics, Vol. 14, no. 4. pp 553-565. Fienberg, S.E. and Makov, U.E. (1998): Confidentiality, Uniqueness and Disclosure Limitation for Categorical Data. J. Official Statistics, Vol. 14, no. 4. pp 385-397. Fienberg, S.E., Makov, U.E. and Steele, R.J. (1998): Disclosure Limitation Using perturbation and Related Methods for Categorical Data. J. Official Statistics, Vol. 14, no. 4. pp 485-502. Gouweleeuw, J.M., Kooiman, P., Willenborg, L.C.R.J and de Wolf P.-P. (1998): Post Randomisation for Statistical Disclosure Control: Theory and Implementation. J. Official Statistics, Vol. 14, no. 4. pp 463-478. Kelley, J.P., Golden, B.L., and Assad, A. A. (1990): The Controlled Rounding Problem: Relaxation and Complexity Issues. OR Spectrum, 12, pp 129-138. Skinner, C. J. and Holmes, D. J. (1993): Modelling population uniqueness. Proceedings of the International Seminar on Confidentiality, Dublin, pp 175-199. Skinner, C. J. and Holmes, D. J.(1998): Estimating Re-identification Risk Per Record in Microdata. J. Official Statistics, Vol. 14, no. 4. pp 361-372. Willenborg, L. and de Waal, T.: Statistical Disclosure Control in Practice. Springer 1996. Willenborg, L. and de Waal, T.: Elements of Statistical Disclosure Control. Springer 2000.

85

APPENDIX UNROUNDED AND ROUNDED EXAMPLE TABLE FROM CHAPTER V. Table A.1.

1

Utility floor space Total < 25 Sqm 25-49 50-74 75-99 100-124 125-149 150-174 175-199 200-224 225-249 250-274 275 eller mer Table A.2

Total 7 491 •7 490 41 •39 162 •160

Tenure status Coop/HomeAssociowner ation 5 645 198 •5 648 •200

Private Hire 681 •683

Hiress from assoc 51 •50 2 •-

Municipal hire 360 •357

Hires trough work 82 •80

9

3

-

8

10

1 •-

55

15

55

924

294

68

216 •217

1 831 •1 828

1 332 •1 331

1 736

1 502

90 •92 22 •21

23 •22 8 •7

1 127 •1 128 711 •712 357 •360

1 012 •1 013 639 •641 321 •323

211

193

146

198 •199 72 •70 15 •14 2 •3 1 •1 •2 •3 1 •-

84 •86 161 •160

7 491 •7 490 Type of building Detatched house or farm house 5 769 •5 766 Linked house, row house, 440 terraced house or vertically •439 divided two-dwelling building Horizontally divided two298 dwelling building or other •299 house with less than 3 floors Block of flats, or other 199 building with 3 or more floors •202 Commercial building etc. or 416 residential building for •418 communities Not reported 369 •366

183 107

-

56

-

27

-

11

2 •3

-

5

-

131

2 •3

6

-

74

3

2 •3

-

-

134 •133

-

5

1 •3

4

5 645 •5 648

198 •200

681 •683

51 •50

360 •357

4 936 •4 940

11 •12

437 •436

13 •14

241 •240

51 •52

39

147

-

11

7 4 •3 1 •3

Others

Not reported

346 •345

128 •127

3

12

4

3

19 •18

4

22

66

37 •36

13

87 •86

46

63

13

34

7

7

31

2 •3

1 •-

13

7

5

5

14 •16 9 •8

2 •3

1 •2 •3 1 •-

11

4

82 •80

346 •345

128 •127

32 •27

33 •32

226

81 •79

13 •12

42 •43

7 •6

38

9

76 •77

4

49

3

11

8

120 •121

1 •3

6

29

19

7

6

45

16 •15

67

13 •14

207 •209

15

41

12

265

-

61

2 •-

1 •-

5

23 •22

12 •13

1

1 •2 •3 2 •-

Original counts are in black, rounded counts are in red. Deviation ±4 has yellow bacground, –5 has green background.

86

3

Tabell A.3

Tenure status Coop/HomeAssociowner ation 5 645 198 •5 648 •200 1 202 112 •1 203 •113 29 1 671 •32

Total Type of household Total

7 491 •7 490

Single person

2 202

Couple without children 1 974 living at home. •1 971 Couple with small children 742 (Youngest child 0-5 years) Couple with large children 939 (Youngest child 6-17 år) •942 Mother/father with small 119 children (Youngest 0-5 years) •121 Mother/father with large 269 children (Youngest 6-17 years) •268 One family households with 833 adult children (Yongest child •831 18 years or more) Multiple family households 253 with children 0-17 years •255 Multiple family households 160 without children 0-17 years •158 Tabell A.4. Type of building

Total

7 491 •7 490 Detatched house or farm 5 769 house •5 766 Linked house, row house, terraced house or vertically 440 divided two-dwelling •439 building Horizontally divided twodwelling building or other 298 house with less than 3 •299 floors Block of flats, or other 199 building with 3 or more •202 floors Commercial building etc. 416 or residential building for •418 communities Not reported 369 •366 Total

Hiress Private from Hire assoc 681 51 •683 •50 294 35 •296 •32 124 12 •123 •9 1 47 •3 1 38 •3 2 37 •3

Municipal hire 360 •357 270 25 •24 11 •10 14 •13 10 •11 16 •15

Hires trough 2 work 82 •80 27 •26 4 •2 20 •19 15 •14

Others

Not reported

346 •345

128 •127

189

73

79

30 •31

7

5

9

3 •5

-

4

4

-

17

-

636

15

851 •855

8

62

-

166

9

61

-

745 •743

17

30

-

5

9 •11

14 •13

13 •12

192

6

33 •34

-

5

2 •3

15

-

120

2 •-

17

-

4

5

12

-

Year of construction 19211941-1920 1940 1945 391 53 1 018 •1 014 •392 •54 860 49 323 •857 •51

19614 1980

19811990 832 •831

19461960 1 414 •1 415 1 197 •1 198

2 633 2 171 •2 170

678

19912000 845 •846 365 •366

Vet ikke 305 126 •123

18

11 •12

1 •-

36

180 •181

43 •42

137 •136

14

30

9

-

53

53

17

91

45 •46

2 •3

-

-

-

80 •81

48

45

24 •25

36

8

-

31

52

27

194 •195

68 •69

72 •70

40

3

97

97 •96

19

13

28

2

Grey background shows roundings to less than three. Years 1961-80 was rounded as two categories, 1961-70 and 1971-80. One 0, one 1 and two 2s are hidden.

4

87

Tabell A.5. Tenure status Total Home owner Cooperation/Association Private Hire

Total 7 491 •7 490 5 645 •5 648 198 •200 681 •683

Hiress from cooperation/ 51 association/ housing •50 company Municipal 360 hire •357 Hires trough 82 work •80 Others 346 •345 Not reported 128 •127 Table A.6 Source of heating Total 7 491 •7 490 Electric 881 •884 Radiators 192 •194 Solid fuel 197 •195 Liquid fuel 94 •91 Heat pump 4 •6 Others 14 •10 Electric and solid fuel 2 733 •2 734 Electric and liquid fuel 517 •520 Solid and liquid fuel 111 Electric, solid and liquid 1 185 fuel •1 183 Radiatorers in combination 471 •469 Other combinations 1 092 •1 093

Year of construction 1921- 1941-1920 3 1940 1945 391 53 1 018 •1 014 •392 •54 49 738 315 •51 12 •14

19461960 1 414 •1 415

19614 1980 2 633

1 194

2 151 •2 153

-

99

19811990 832 •831 641 •640 42 •41

19912000 845 •846

Not reported

530

27

43

305

2 •3 151 •152

122

48

3

133

135 •137

42

47 •46

1 •-

1 •3

-

6 •8

15 •12

6

11

21 •18 22 •21

4 •5 3 •2

-

10 •9

75

40 •39

-

5

27

8

163 •164 2 •3

15 •14

75

17

1 •-

50

96

37

32

38

27 •26

3 •2

16

35 •34

16 •18

17

14

1 018 •1 014 34 •35

391 •392

17

3

43 •42

14

24 1 •2 •417 •416 66 •68 18 147 •145 65 •64 184 •185

-

1 414 •1 415

2 633

43

199

28 •30

98

5

53

59

13

1 •-

-

-

42 •40 1 •3

-

-

4

3

1 •2 •3

159

20

488 •487

733 •735

1 •1 •3 2 •458 •460

6

118

265

21

10

8

51

27

4

-

325

503

70

27

23 •24

6 •5 1 •-

86

207 •205

28

49

3 13 •14 12 •13

52

11

199

496

91

36

23 •24 8 94

3

19

845 •846 340 •342 14 •13

47

53 •54 1 •3 2 •3 4 •3 2 •3

14

832 •831 138 •136

11 •10

305 112 25

5

6

-

5

361

1 •97 •96

23

Grey background shows roundings to less than three. Years 1961-80 was rounded as two categories, 1961-70 and 1971-80. One 0, one 1 and two 2s are hidden.

4

88

Recent Research Results on the Conditional Distribution Approach for Data Perturbation Krishnamurty Muralidhar 1 and Rathindra Sarathy2 1

Gatton College of Business & Economics, University Of Kentucky, Lexington KY 40506-0034, USA* 2 College of Business Administration, Oklahoma State University, Stillwater OK 74078, USA Keywords. Masking techniques, Perturbation, Swapping, Data Shuffling, Copula, Confidentiality, Privacy

1. Introduction In this paper, we provide a summary of our recent research on developing a theoretical basis for perturbation methods. We propose that, theoretically, generating perturbed values of the confidential variables from the conditional distribution of the confidential variables given the non-confidential variables, but independent of the original confidential variables. We show that if the perturbed values are generated from this approach, the resulting perturbed values have the same statistical characteristics as the original confidential variables, and maintain all relationships among the variables to be the same after perturbation as before perturbation. Furthermore, since given the nonconfidential variables, the perturbed variables are independent of the original confidential variables, this method also provides intruders with no knowledge gain. For a complete description, please see Muralidhar and Sarathy (2003). In the following sections, we describe our efforts in developing techniques based on this theoretical approach for numerical, confidential variables. Our initial effort focused on the desire to improve the performance of existing additive noise techniques for numerical, confidential variables. One of the key aspects of noise addition techniques was that all techniques assumed the basic noise addition model, of the form, Y = X + •.

(1)

*

Dr. Muralidhar’s work on this paper was supported in part by a grant from the Kentucky Science and Engineering Foundation as per Grant Agreement #KSEF-148-502-02-22 with the Kentucky Science and Technology Corporation.

89

Starting with the original proposal by Traub et al. (1984), researchers had provided several improvements to this procedure (Kim 1986, Sullivan 1989, Sullivan and Fuller 1989, Tendick 1991, Tendick and Matloff 1994). Most of these procedures use different statistical characteristics for •, in an attempt to satisfy disclosure risk and data utility requirements. The procedures suggested by Kim (1986) and Tendick and Matloff (1994) ensured that, when the underlying distribution was normal, the marginal distribution and covariance matrix of the perturbed variables were the same as that of the original, confidential variables. Other researchers (such as Fuller 1993, Tendick 1992) had also addressed the performance of this procedure in terms of disclosure risk. Yet there were considerable deficiencies in the noise-addition approach shown in equation (1).

2. General Additive Data Perturbation Method At the time we started our research on this topic, one key component that was missing in approaches in prior studies was the treatment of the non-confidential variables. From equation (1) it is easy to see that the non-confidential variables play no role in this approach. Most prior studies assumed that all variables were confidential and hence had to be perturbed and that there were no non-confidential variables present. In our opinion, this was a rather restrictive assumption. By their nature, non-confidential variables may be available in their original form through other sources. The releasing agency, because of concerns about re-identification may choose to perturb nonconfidential numerical variables. However, when these values are available from other sources in original form, they can be used to compromise the values of confidential variables, resulting in predictive disclosure (as opposed to identity disclosure). Even if we assume that, because of re-identification risk, all numerical variables were to be perturbed, it leaves open the issue of categorical non-confidential variables. The noise addition approach is not suited for categorical variables and hence could not be used to perturb such variables. Thus, if there were non-confidential variables (either numerical or categorical) present, prior approaches essentially ignored the categorical variables in performing the perturbation. Ignoring non-confidential variables during perturbation has obvious disadvantages. It is easy to show that, if • represents the correlation between a confidential and non-confidential variable and the if the variance of • was d2, after perturbation, the correlation between the perturbed and the non-confidential variables was •/(1+d2)0.5. In other words, regardless of the level of noise added, analysis of the released data for this relationship would provide different results compared to the original data. Thus, even if the perturbed confidential variables maintained the same

90

covariance structure as the original confidential variables, their relationship with the non-confidential variables would be biased. Our initial approach was to develop a technique that ensured that this does not occur. Our original study (Muralidhar et al. 1999) considers the entire set of variables (confidential (X), non-confidential (S), and perturbed (Y)) as having a joint (multivariate normal) distribution. Note that, prior to generating the perturbed variables, while we do not know the individual values of Y, we know the desired joint distribution of X, S, and Y. In order to maximize utility, it is desirable that the distribution of Y should be the same as that of X, and that the joint distribution of (Y and S) should be the same as that of (X and S). Further, in order to reduce disclosure risk, we specified that the covariance between the original and perturbed variables should be (λ × covariance of the original confidential variables), where λ represents the square of the first canonical correlation between X and S. This ensured that, for any linear combination of the confidential variables, the proportion of variability explained was no more than λ. We also showed that all previously proposed methods of noise addition were special cases of this approach. Hence, we called this approach the General Additive Data Perturbation (GADP) method. For complete details of GADP and illustrations, please refer to Muralidhar et al. (1999). The paper also shows that even when the underlying distribution of the variables is not multivariate normal, GADP is capable of maintaining the covariance matrix of Y to be the same as that of X, and the covariance between (Y and S) to be the same as that between (X and S). However, the marginal distribution of Y was considerably different from that of X. GADP is also very effective when the nonconfidential variables are categorical. Our research was further spurred by a desire to improve the disclosure risk characteristics of GADP. While GADP performed better than the other methods in terms of disclosure risk, it did not satisfy the strict requirements that are described in Dalenius (1977) and Duncan and Lambert (1986). Both these papers evaluate risk by comparing the information that is available prior to “data” release and post “data” release and contend that disclosure risk occurs when the intruder “gains” information from the released data. The specific objective of perturbation methods is to provide access to microdata values of the confidential variables. Hence, these methods essentially dictated that there should be no information and knowledge gain (or correspondingly, a decrease in uncertainty) when microdata is released. One of the difficulties with these definitions is that it is extremely difficult to assess the extent of knowledge that an intruder possesses prior to microdata release. However, we can model the knowledge of the intruder based on the assumption that the intruder is intelligent and specifically intends to compromise the data, and that the intruder has

91

verifiable, accurate, information of her/his own. Further, since agencies have to at least consider releasing aggregate data1, it is reasonable to include this knowledge as part of the snooper’s prior knowledge in compromising the data. This model of the intruder can be considered as the “worst case” scenario, and has been used by several respected authors in the area (see for example, Fuller 1993, Fienberg 1997, Willenborg and de Waal 2001, Yancey et al. 2002). From an agency perspective, modeling the intruder in this manner has the significant advantage that if it can be assured that risk of disclosure is very low for such an intruder, it will be even lower for other (not so intelligent) intruders. Thus, we adopt the “worst case” scenario or the “determined intruder” in evaluating disclosure risk. If the objective of perturbation is to defeat such an informed intruder, it is necessary that providing access to microdata does not result in providing the intruder with additional information. Further, since we assume that the intruder could have their own data and/or aggregate data has been provided, we can assume, from the perspective of value disclosure, that prior to microdata release the intruder has information on R X2 | S . In order to provide no additional information when microdata is released, it is necessary that: R X2 | S, Y = R X2 | S .

(2)

In Muralidhar et al. (2001), we derived the conditions that are necessary in order to satisfy this (disclosure risk) requirement and simultaneously satisfy the previously defined data utility requirements (namely, distribution of Y is the same as that of X, and the joint distribution of (Y and S) is the same as that of (X and S)) for the multivariate normal distribution. The resulting expression was of the form: −1 Y = • X + • XS • SS (s i − • S ) + • ,

(3)

where, • X and • S are the mean vectors of X and S, • XX , • SS , and • XS represent the covariance of X, S, and between (X and S), respectively. Finally, • has a multivariate −1 normal distribution with mean vector 0 and covariance matrix, • •• = • XX − • XS • SS • SX . From these specifications, it is easy to see that: 1

If the release of data, even in aggregate form, leads to unacceptable disclosure risks, the agency may not even release the data, and therefore would suffer no liability from an intruder’s knowledge gain. Hence, the concept of an intruder, in this case, is irrelevant.

92

yi ~ f(X | S = si). (4) In other words, in order to satisfy the dual requirements of disclosure risk and data utility in the multivariate normal case, it is necessary to first derive f(X|S) using the available data and then to generate the values of Y using this conditional distribution and the values of S and independent of X. Since given S, the values of X and Y are independent, equation (2) is satisfied. This implies that this approach does not provide additional information to an intruder. In addition, it can be easily shown that the distribution of Y is the same as that of X and that the joint distribution of (Y and S) is the same as that of (X and S). Note that, in the absence of S, this procedure also suggests that the perturbed values should be generated from a multivariate normal distribution that has the same characteristics as X, but independent of X. Thus, for the multivariate normal distribution, GADP provides an “optimal” solution to the perturbation problem by minimizing disclosure risk and maximizing data utility. Note that, although we originally started out defining disclosure risk by using the R2 measure, the perturbed values resulting from GADP prevent not only value disclosure but also identity disclosure. Theoretical proof for this statement can be found in Muralidhar and Sarathy (2003). However, this result can also be explained in simple terms as follows. The predictive distribution of an intruder is based on f(X|S). We have shown above and in Muralidhar and Sarathy (2003), that f(X|S,Y) = f(X|S). Consequently, regardless of the predictive measure, disclosure risk will be minimized. An extension of the simple noise addition approach given in equation (1) is the more recent “model-based” approach. While Muralidhar et al. (1999, 2001) do not directly address the model-based approach, it is easy to show its relationship with GADP. In general model based approaches suggest that the perturbed values Y should be generated from a general model of the form: Y = • 0 + • 1S + • 2 X + • .

(5)

When •0, •1 = 0, and •2 = 1, equation (5) reduces to the simple noise-addition approach in equation (1). We can also easily show that, in order to achieve the disclosure risk and −1 data utility requirements, it is necessary that •2 = 0, •0 + •1 = • X + • XS • SS , and −1 • ~ N(0, • XX − • XS • SS • SX )

(6)

93

These are the same specifications used in GADP. Thus, GADP represents the “optimal” model based approach, and the only “model-based” approach that will satisfy both the data utility and disclosure risk requirements. It is important to note that we use this notion of prior information only to define the important concept of minimum disclosure risk. Some critics have argued that we should compare the disclosure risk resulting from our method to those of other techniques. In our opinion, this is relatively straightforward. For multivariate normal distributions, the perturbed values are independent realizations from the true conditional distribution of X|S. Hence, in these cases, given S, X and Y are statistically independent. Statistical independence automatically implies that for any method of prediction, given S, the perturbed values Y provide no information regarding X. None of the other methods of additive perturbation or model-based approaches can guarantee this result. As an illustration, consider the simple case where there is a single confidential variable (X) and a single non-confidential (S) variable with a joint standard multivariate normal distribution with correlation •. Let the perturbed variable be Y. Let disclosure risk be measured by R2, the proportion of variability in X that is explained by the S and Y. We can easily show that if additive noise is used, the resulting risk of disclosure is: R 2X| S, Y = • 2 +

(1− • 2 )2 . (1− • 2 ) + d 2

(7)

where d 2 represents the variance of the noise term •. When a model based approach of the form Y = a + bX + cS + •, is used, the resulting risk of disclosure is: R

2 X | S, Y

b 2 (1− • 2 )2 . =• + 2 b (1− • 2 ) + d 2 2

(8)

When GADP is used to perturb X, the resulting risk of disclosure is: R 2X| S, Y = •2.

(9)

It is easy to verify that GADP provides the lowest risk of disclosure and is equal to an intruder’s prior estimate of X using S alone. Our use of prior information possessed by

94

an intruder essentially negates the need to measure disclosure risk in every case for every situation and every measure of disclosure. Finally, even in cases where the joint distribution of the variables is not multivariate normal, GADP can be used effectively in certain situations. One such situation occurs when all the non-confidential variables are categorical (and hence can be represented by binary variables) and the impact of these variables is a simple mean shift in the confidential variables. In such situations, GADP provides the same level of data utility and disclosure risk as it does for variables with a multivariate normal distribution. Even in other cases, if the statistic of interest is the covariance matrix and if it can be assumed that all predictions will be based on the covariance matrix, GADP can be used effectively. However, in these cases, GADP would not provide the same level of data utility since the marginal distribution of the perturbed variables will be different from that of the original variables and any non-linear relationships will be distorted. Finally, in such situations, if a snooper employs non-linear models to estimate the confidential variables, it is possible that additional information is provided.

3. Copula-Based GADP Method One of the key concerns of using GADP is the fact that when the individual variables are not normally distributed, the marginal distribution of the perturbed values is different from that of the original values. While the conditional distribution approach is theoretically sound, implementing the conditional distribution approach for non-normal multivariate distributions is significantly more complicated than it is for the multivariate normal distribution. Even characterizing the joint distribution when the individual variables can take on arbitrary marginal distributions is very difficult. To derive the conditional distribution in these cases may not even be mathematically feasible. This required us to consider alternatives that would provide us with the ability to approximate the joint distribution when the individual marginal distributions are not normal. Copulas offer just such an approximation. Copulas are often used in the statistical literature as a method for joining distributions with arbitrary marginal distributions. A variety of copulas have been proposed in the literature. One such copula is the multivariate normal copula. We employed the multivariate normal copula to generate conditional realizations for non-normal datasets (referred to as C-GADP). For a complete description of this method, please refer to Sarathy et al. (2002). The copula method can be described as follows: 1. Identify the marginal distribution of attributes X1,…,Xn, S1, … S m.

95

2. Compute pair-wise rank order correlation matrix (R) of the original database. 3. Compute product moment correlation matrix • using R using the transformation •ij = 2×Sin(• × r ij/6) (see Kruskal 1958). 4. Compute the new variables X * and S* . 5. Apply GADP to variables X * and S* to generate Y * . 6. Compute Y from Y * . where X * , S * , and Y * are defined as follows: x *i = Φ −1 ( Fi ( x i )), i = 1,..., n s *j = Φ −1 ( Fj (s j )), j = 1,..., m, and y *k = Φ −1 ( Fk ( y k )) , k = 1 ,.., n.

and F represents the cumulative distribution function of the individual marginal variables. Note that this approach only requires that the marginal distribution of the individual variables be identified. The joint distribution is then approximated by the multivariate normal copula. However, since the multivariate normal copula is only an approximation of the true joint distribution, C-GADP does not provide complete data utility. More specifically, using pair-wise rank order correlations are adequate to preserve all monotonic (both linear and non-linear) relationships. However, if the data consists of non-monotonic relationships (∪-shaped or ∩-shaped or sine wave), then CGADP perturbed variables will not maintain such relationships. While this represents one problem with C-GADP, considering the ability of other approaches for perturbing non-normal variables, C-GADP represents a significant improvement. Sarathy et al. (2002) also provide an example application of the C-GADP approach. Interested readers can also visit my web site (http://gatton.uky.edu/faculty/muralidhar/maskingpapers/) for the example data set that was used in the study. Thus, in terms of data utility, C-GADP is capable of: (1) Maintaining the marginal distribution of the perturbed variables to be the same as that of the (original) confidential variables, even if the original variables have marginal distributions that are not normal, and (2) Maintaining monotonic dependence (linear or non-linear) between variables that can be captured through pair-wise rank order correlation.

96

(3) In terms of disclosure risk, like GADP, C-GADP also assures that, given S, X and Y are independent. Hence, C-GADP perturbed variables do not provide an intruder with no additional information. Enhancing and extending the copula-based approach for perturbation represents an important direction for future research.

4. The Data Shuffle One of the complaints against all “perturbation” approaches is that the perturbed values are “not the same” as the original values. In other words, the process of masking “changes” the original values. Although from a statistical perspective this makes little difference, it appears to make a big difference from a practical perspective. The Wall Street Journal (February 14, 2001) quotes a Census Bureau researcher as stating, “… users have found this extremely irritating and unacceptable”. One technique that facilitates releasing microdata without “modifying” the original values is data swapping. For numerical variables, Moore (1996) suggests the use of the “rank-based proximity swap” and provides valuable insights into the swapping process. However, compared to perturbation methods, swapping provides lower data utility and perhaps, more importantly, far higher risk of disclosure. For illustration purposes, consider the following Figures 1 and 2 that provide the original and C-GADP perturbed values and the original and swapped values for a particular data set. 1000

Original

Original

1000

500

0

500

0 0

500

1000

0

Swapped

500

1000

Copula Perturbed

Figure 1. Original and Swapped Data

Figure 2. Original and Copula Perturbed Data Figure 1 shows that there is a very strong relationship between the original and swapped values, especially in the middle sections of the data compared to the ends (and hence the “bow tie” effect). By contrast, Figure 2 shows that there is practically no relationship between the original and C-GADP perturbed values. Even a lay person would conclude

97

from the above figures that it is far more likely that an intruder would be able to predict the value of the original variable with far higher accuracy using the swapped values than the C-GADP perturbed values. It is also likely that, because of such a relationship, swapped values also result in far higher risk of identity disclosure. We have actually verified that when swapping is used, for the data set illustrated above, it is possible to reidentify 1035 of the 1500 observations or approximately 69% of the observations. By contrast, when C-GADP is used, only 2 of the 1500 observations (or 0.13%) are reidentified. Further, swapping also results in modifying relationships between variables (Moore 1996). Finally, unlike other masking procedures, there is very little theoretical support for swapping. Although Moore (1996) has derived some results with respect to swapping, a strong theoretical basis that exists for perturbation methods does not exist. Thus, other than user acceptance, there is very little in favor of swapping. However, user acceptance is an important consideration in microdata release. Thus, there is a need to develop a new technique that is capable of combining the benefits of both perturbation and swapping. Such an approach must provide the same data utility and disclosure risk characteristics as GADP or C-GADP, but must use only the original (unmodified) data in this procedure. We have developed such a method that we refer to as “The Data Shuffle”. Data shuffling was developed on the same strong theoretical foundations of GADP and C-GADP and hence possesses the same data utility and disclosure risk characteristics as the perturbation techniques. However, unlike the perturbation methods, it does not modify the original values and the original values are directly used in masking the data. Data shuffling and data swapping differ in one important respect that unlike data swapping, data shuffling does not exchange values between records. Values are “truly” shuffled and consequently, the value of the ith record could possibly be assigned to the jth record, that of the jth record to the kth record, etc. Thus, the actual values in the shuffled data are indeed the original values, just shuffled in such a manner as to: (1) Maintain the marginal distribution of the individual variables exactly (benefit of swapping), (2) Maintain all monotonic relationships between confidential variables to be the same before and after shuffling (benefit of perturbation), (3) Maintain all monotonic relationships between confidential and non-confidential variables to be the same before and after shuffling (benefit of perturbation), and (4) Maintain disclosure risk to a minimum (benefit of perturbation). We are currently in the process of developing a manuscript that describes the theoretical foundation for data shuffling, provides an illustration of data shuffling (both for simulated and real data), evaluates data utility and disclosure risk characteristics, compares its performance to other masking techniques, and addresses implementation issues relating to small data sets.

98

References Dalenius, T. 1977. Towards a Methodology for Statistical Disclosure Control. Statistisktidskrift 5 429-444. Duncan, G.T. and D. Lambert 1986. Disclosure-Limited Data Dissemination. Journal of the American Statistical Association 81 10-18. Fienberg, S.E., U.E. Makov, and A.P. Sanil 1997. A Bayesian Approach to Data Disclosure: Optimal Intruder Behavior for Continuous Data. Journal of Official Statistics 13 75-89. Fuller, W.A. 1993. Masking Procedures for Microdata Disclosure Limitation. Journal of Official Statistics 9 383-406. Kim, J. 1986. A Method for Limiting Disclosure in Microdata Based on Random Noise and Transformation. Proceedings of the American Statistical Association, Survey Research Methods Section, ASA, Washington D.C. 370-374. Kruskal, W.H. 1958. Ordinal Measures of Association. Journal of the American Statistical Association 53 814-861. Moore, R. 1996. Controlled Data Swapping Techniques for Masking Public Use Data Sets. U.S. Bureau of the Census, Statistical Research Division Report (rr96/04). Muralidhar, K., R. Parsa, and R. Sarathy 1999. A General Additive Data Perturbation Method for Database Security. Management Science 45 1399-1415. Muralidhar, K., R. Sarathy, R., and R. Parsa 2001. An Improved Security Specification for Data Perturbation with Implications for E-Commerce. Decision Sciences 32 83698. Muralidhar, K. and R. Sarathy 2003. A Theoretical Basis for Perturbation Methods. Statistics and Computing (forthcoming). Sarathy, R., K. Muralidhar, and R. Parsa 2002. Perturbing Non-Normal Confidential Attributes: The Copula Approach. Management Science 48 1613-1627. Sullivan, G. 1989. The Use of Added Error to Avoid Disclosure in Microdata Releases. Unpublished Ph.D. Dissertation, Iowa State University, Ames, Iowa. Sullivan, G. and W.A. Fuller. 1989. The Use of Measurement Error to Avoid Disclosure. Proceedings of the American Statistical Association, Survey Research Methods Section, 802-807. Tendick, P. 1991. Optimal Noise Addition for Preserving Confidentiality in Multivariate Data. Journal of Statistical Planning and Inference 27 341-353. Tendick, P. 1992. Assessing the Effectiveness of Noise Addition, Method of Preserving Confidentiality in the Multivariate Normal Case. Journal of Statistical Planning and Inference 31 273-282. Tendick, P. and N. Matloff 1994. A Modified Random Perturbation Method for Database Security. ACM Transactions on Database Systems, 19 47-63. Traub, J.F., Y. Yemini, and H. Wozniakowski, 1984. The Statistical Security of a Statistical Database. ACM Transactions on Database Systems 9 672-679. Willenborg, L. and T. De Waal 2001. Elements of Statistical Disclosure Control. Springer, New York.

99

Yancey, W.E., Winkler, W.E., and Creecy, R. H. 2002. Disclosure Risk Assessment in Perturbative Microdata Protection, in (J. Domingo-Ferrer, ed.) Inference Control in Statistical Databases, Springer: New York.

100

THE NOISE METHOD FOR TABLES - RESEARCH AND APPLICATIONS AT STATISTICS NEW ZEALAND Mike Camden, Katrina Daish and Frances Krsinich. Statistics New Zealand

Introduction Tables of business magnitude data, which are the main output from business surveys, have a relatively high disclosure risk. Business populations tend to be skewed, with many small and medium-sized businesses and just a few very dominant businesses. This means that, for sampling efficiency, the large businesses are usually in fullcoverage strata, so there is no confidentiality protection from sampling for these units. There is generally good public knowledge about the industry, size and region of businesses, and their approximate market share. This information can enable close approximations of confidential information for those cells dominated by just a few large businesses. In particular businesses can use their own data to deduce the characteristics of other businesses in the same cell. Cell suppression is a common method for protecting tables of business magnitude data. Cell suppression is a two-stage process. A dominance rule, such as the (n,k) rule, is used to identify sensitive cells for suppression. Suppression of these sensitive cells gives the 'primary suppressions'. To protect against derivation of the sensitive cells by subtraction from the marginal totals of the table, 'secondary suppressions' of non-sensitive cells are usually also required. Determination of the secondary suppression patterns is non-trivial, particularly for large and complex tabulations. Upper and lower bounds for the suppressed cell values can be derived by solving the equations implied by the interior and marginal cell values, in addition to non-negative constraints on the cell value (Cox, 1995). The intervals within these bounds are referred to as ‘feasibility intervals’ for the suppressed value. An optimal secondary suppression pattern minimises the information loss due to cell suppression while ensuring that feasibility intervals around sensitive values are sufficiently large to preserve respondent confidentiality.

101

The Noise Method as an Alternative In 2000, analysts of the Workplace Disputes Survey, a survey run jointly between Statistics New Zealand and the New Zealand Department of Labour, had difficulty applying random rounding and the (n,k) rule with their output software. They suggested adding unbiased random noise at the unit record level instead. After consideration Statistics New Zealand approved this as an adequate confidentiality measure, particularly since the sample design was non-standard (for a business survey), in that not all large businesses were full-coverage - it therefore had less risk associated with it than a standard business survey. Sampling weights were 'disturbed' by an amount inversely proportional to the sampling weights, to adjust for the protection already offered by sampling. Later in 2000, Laura Zayatz from the US Census Bureau gave a paper titled "Using Noise for Disclosure Limitation of Establishment Tabular Data" at the Second International Conference on Establishment Surveys (ICES2) in Buffalo, New York (Zayatz et al 2000). This outlined a similar method. Rather than disturbing the weights directly, a 'multiplier' is generated. The multiplier is only applied to the sampled unit (and not to those other units in the population which the sampled unit represents), which results in the level of disturbance being inversely proportional to the weight. The method was experimentally applied to the US Census Bureau's Research and Development Survey, and various summary statistics were computed to empirically test the properties of the method. We replicated this work using our own Annual Enterprise Survey (AES) data, and the results are presented in a Statistics New Zealand research report (Krsinich and Piesse, 2002). Our results were very similar to Zayatz et al (2000). In addition, we defined and computed some information loss measures to compare cell suppression and the noise method for the tables we were working with.

How the Noise Method Works For each observation, or unit, in the data, a multiplier is randomly generated from some distribution centered around 1. A bimodal distribution with all values at least, say, 10% away from 1 ensures that each unit’s value is perturbed by at least 10%. Before tabulation, values are multiplied by (multiplier + (weight - 1)), rather than by their original sampling weight. The method is unbiased (Zayatz etal 2000; Evans, Zayatz and Slanta 1998). That is, the expected value of the 'noised' cell is equal to the original cell value. The method is illustrated below in tables 1 to 4 with a simple example:

102

Table 1. Fictional microdata and multipliers id Industry Region Turnover Weight ($000) 1 A a 50 1 2 A b 30 1 3 A b 40 1 4 B a 12 5 5 B a 14 5 6 B b 7 100 7 B b 2 100 8 B b 3 100 9 B b 4 100

Weighted value 50 30 40 60 70 700 200 300 400

Multiplier ‘Noised’ weighted value 1.12 56 (= 50 × 1.12 ) 1.09 32.7 1.11 44.4 0.91 58.92 (= 12 × 4.91) 1.10 71.4 0.88 699.16 (= 7 × 99.88) 0.93 199.86 1.11 300.33 0.90 399.6

Table 2. Original table - Turnover ($000) Region a Region b Total Industry A 50 70 120 Industry B 130 1600 1730 Total 180 1670 1850 Table 3. Noised table - Turnover ($000) Region a Region b Total Industry A 56 77.1 133.1 Industry B 130.32 1598.95 1729.27 Total 186.32 1676.05 1862.37 Table 4. Percentage difference between noised and original table Region a Region b Total Industry A 12 10 11.0 Industry B 0.2 -0.07 -0.04 Total 3.5 0.3 0.6 The percentage difference (i.e. the ‘noise’) is greater for full-coverage cells. In our example, cells corresponding to Industry A both consist solely of full-coverage businesses (i.e. weights = 1), with one business in Region a and two in Region b. These have 12% and 10% noise respectively. On the other hand, Industry B, Region a has 2 medium-sized businesses (weights = 5) and receives 0.2% noise. Industry B, region b has 4 small businesses (weights = 100) and receives only 0.07% noise. It is important to note that, although it is applied at the microdata level, the noise method protects the table, not the microdata.

103

The Advantages of the Noise Method For any dataset, the noise method is applied only once. From then on, all tables produced from the dataset will be consistent, both internally (i.e. tables will be additive) and externally (i.e. related tables will be consistent with each other). A consequence of this is that there are no disclosure risks posed from multiple production of the same table, or production of related tables. Also, the size and/or complexity of tables doesn't affect the application of the method. Publication of noised sensitive cells gives the users an approximate value, rather than complete suppression. In general, more noise is added to the sensitive cells, less noise is added to the nonsensitive cells. So the information loss is being targeted to those cells which pose the most risk.

Simulation Study Following the Approach of Zayatz et al (2000) The Annual Enterprise Survey (AES) is Statistics New Zealand’s largest financial survey. It was recently redesigned (Krsinich 2000) and now makes extensive use of tax data for small and simple businesses. The post out sample is around 20,000, with full coverage of approximately another 200,000 units whose data are derived directly from tax data. Despite the large sample sizes resulting from full coverage of small businesses via the use of tax data, sensitive cells still occur in the standard published AES tables. Together with the secondary suppressions, there can be a significant loss of information resulting from cell suppression in some of the standard published AES tables. We used AES 1999 data as an example for trialling the noise method and followed the same approach as Zayatz et al (2000) to ensure direct comparability with their work. In addition to this, we discussed and defined some measures of information loss to enable a comparison of information loss between cell suppression as currently performed in AES, and the noise method. Krsinich and Piesse (2002) gives a more detailed account of this work than we have space for here. Three industries with suppressions in the standard published tables for the variable Total Income were chosen. Note that these are examples of AES tables with relatively high levels of suppressions. Primary suppressions are indicated by a ‘p’ and secondary suppressions by an ‘s’ in tables 5 to 7 below. Each of the tables forms part of what is effectively a 4-dimensional table, with relationships across time, and all-industry totals, as well as the more obvious 1-digit

104

industry totals and total income marginals corresponding to the two-dimensional tables below. Table 5. Industry F (Wholesale): 3-digit level Industry sales - nfp sales - other interest F011 F012 F013 F014 F015 F016 F017 Total F

govt fund

non-op

s

s

p s

s s

total inc

Table 6. Industry K (Finance and Insurance): 2-digit level Industry sales interest govt fund non-op total inc K01 s s K02 p s K03 Total K Table 7. Industry P (Cultural and Recreational Services): 3-digit level Industry sales interest govt fund non-op total inc P011 p s P012 s s P013 s s Total P We applied approximately 10% noise to each unit's value. There is some hierarchical structure to the sampled units. More than one unit can belong to the same ‘group’. To ensure protection at the group level, two stages of randomisation are used. The first assigns a 'direction' at the group level - that is, each group of units has a 0.5 probability of having a multiplier close to 0.9 and a 0.5 probability of having a multiplier close to 1.1. In the second stage, the units within each group are assigned a multiplier from the distribution around whichever of 1.1 or 0.9 has been assigned to its group. Then, for every unit, the values of interest are multiplied by (multiplier+(weight-1)) before tabulation.

105

We ran 1000 replications of the three noised AES tables, and computed summary statistics to describe the behaviour of the cells across the replications. We computed the absolute percentage noise in each cell for each replication. We then averaged these absolute percentages across the 1000 replications. The distribution of this 'average absolute percentage noise' across cells of different types is shown below in graph 1. Graph 1. Average absolute percentage noise Primary suppressions (3 cells) Av 10.7, Med 10.6, Max 10.8, Min 10.5 Secondary suppressions (13 cells) Av 3.5, Med 3.3, Max 7.3, Min 0.87 Unsuppressed cells (72 cells) Av 3.0, Med 2.6, Max 9.0, Min 0.46 Primary suppressions are those cells which are defined as sensitive by Statistics New Zealand’s version of the (n,k) rule. As expected, these receive significantly more noise than the non-sensitive cells. We want these cells to receive more noise, because these are the cells we want to protect. Conversely the presence of significantly less noise in the non-sensitive cells (i.e. both the ‘secondary suppressions’ and the ‘unsuppressed cells’ in Graph 1) is a desirable result as these are the cells we don't need to protect against disclosure.

Information Loss Comparisons The noise method results in the addition of at least some noise to every cell. The cell suppression method results in complete suppression of primary and secondary suppressed cells, but other cells are left unchanged. We discuss, define and compute some measures of information loss due to cell suppression, and we then compare these to the average absolute percentage of noise due to the noise method.

106

When considering the protection offered by a method such as cell suppression, we assume that an intruder can, and will, combine the equations implied by the remaining values in the tables, to derive feasibility intervals for each suppressed value. While this worst case scenario might be a necessary assumption for guaranteeing a specified level of disclosure limitation, it is perhaps more useful to consider a ‘nonintruding user’ when trying to quantify information loss. We therefore compute two different information loss measures, corresponding to both the ‘intruder’ and ‘nonintruder’ scenarios. For the ‘intruder scenario’ we calculate the information loss corresponding to the feasibility intervals resulting from the particular cell-suppression pattern that was used for AES99. But most users won't be able and/or willing to put in the work necessary to derive feasibility intervals, particularly for large or complex linked or hierarchical tables. For these users, we assume that the information lost due to cell-suppression is the full value of the cell.

The Intruder's Information Loss from Cell Suppression We calculated the intruder’s information loss as the half-width of the feasibility interval divided by the midpoint of the feasibility interval. See Krsinich and Piesse (2002) for a discussion of why we adopted this particular measure. We calculated the 'intruder's information loss' for the three AES99 tables being considered and, from these, we calculated the average intruder's information loss for each type of cell, to compare to the average absolute percentage noise using the noise method. The results are shown in table 8. Table 8. Information loss comparison – Intruder scenario (%) Type of cell Cell suppression Noise method Primary suppression (3 cells) 100 11 Secondary suppression (13) 61 3.5 Interior cell (59) 19 3.7 Unsuppressed cells (72) 0 3.0 All cells (88) 12 3.3

The User's Information Loss from Cell Suppression As discussed above, it could be argued that, for most users, a suppressed cell represents a complete loss of information for that cell. So, using a simpler information loss measure which measures each suppressed cell as a 100% loss of

107

information, we can derive the average 'user's information loss' for each different type of cell. Table 9. Information loss comparison – User scenario (%) Type of cell Cell suppression Noise method Primary suppression (3 cells) 100 11 Secondary suppression (13) 100 3.5 Interior cell (59) 27 3.7 Unsuppressed cells (72) 0 3.0 All cells (88) 18 3.3 Using either information loss measure, it can be seen that the noise method compares very favourably in terms of the average amount of information lost for these tables. The trade-off is that cells which would have been unchanged under the cellsuppression method, are now ‘noised’. Graph 2 shows the comparison visually. Graph 2. Information loss comparison 120

80

intruders

60

users

40

noise

20

ls el lc al

un

su

pp

re

ss

ed

in te rio r

se co nd ar y

0

pr im ar y

% info loss

100

Extensions to the Method for Implementation for AES 2003 Given the promising results from the research outlined above, we hope to implement the noise method for our Annual Enterprise Survey in 2003. Further testing for this planned implementation has used AES 2000 data, and a multiplier derived from a truncated half-normal distribution 10% away from 1. With a standard deviation of 0.02, the distribution is truncated at 0.1. This means that no business, or unit, in the survey has its response 'noised' by greater than 20% or less than 10%, and 95% of the businesses have somewhere between 10% and 15% noise

108

added to, or subtracted from, their original value. We ran tests on the AES 2000 data using this distribution and the results reflected those of the earlier simulation study, which used a Beta distribution as in Zayatz et al (2000). The main issue that has arisen for implementation is the potential effect of the noise method on estimates of movements, which are important survey outputs. The noise method would cause too much volatility in movements if it was applied independently each year. Therefore we have split the noise into two parts. The 'base noise' is the direction of the noise – that is, whether we add or subtract approximately 10%. A unit will retain its base noise for its life in the survey. We also apply 'extra noise' (from the corresponding truncated half-normal distribution) independently from year to year. We have formulated the relative ‘sampling plus noise’ error for estimates of levels (see the Appendix). This remains to be done for movements. This information will be important for the ultimate decision on whether to adopt the method for AES 2003. For a range of industries, and for three different survey variables (Total Income, Total Expenditure and Total Assets) we calculated the relative sampling-plus-noise error for the AES 2000 data, and compared this to the relative sampling-only error. This is shown in Table 10 below. • • • •

Horticulture and Fruit Growing (HFG) has 12,671 units in the AES 2000 sample, and has no cell suppression. Other Food Manufacturing (OFM) has 352 units, and no cell suppression. Basic Metals (BM) has 131 units in the sample, and has some confidential cells. Telecommunication Services (TS) has 67 units, and is an industry that is dominated by a single large unit, therefore has many confidential cells.

Table 10. Relative sampling error and relative sampling-plus-noise error (%) Industry: HFG OFM BM TS Total Income Sampling 0.0 5.3 3.8 1.4 Sampling plus noise 0.8 6.5 11.0 9.4 Total Expenditure Sampling 0.0 5.3 3.5 1.8 Sampling plus noise 0.4 6.5 11.3 17.1 Total Assets Sampling 0.0 26.2 13.6 3.1 Sampling plus noise 0.4 26.8 18.4 19.6

109

Extending the Noise Method to Tables of Counts Count data can be considered to be magnitude data where every respondent, or unit, contributes a magnitude of one. Tables of counts from household survey data are generally protected by uniformly small sampling fractions, so we are only interested in the case of tables of counts from a census, where the implicit weight is one for every unit. By randomly perturbing the weight of 1 either up (to 2) or down (to 0), we can introduce confidentiality protection while preserving the statistical properties of the data. This is a very simple and elegant approach which could be useful in situations where random rounding is not appropriate, such as when it is possible for an intruder to obtain many repetitions of the same, independently rounded, counts. This could arise in a remote access situation, or if a full suite of many-dimensional tables with shared, independently rounded, marginals was produced. That is, the noise method for counts would avoid problems of undoing protection via comparison of related tables. A potential problem with the method is the level of variance introduced for largevalued cells. If the probability of perturbing the weight from 1 to 0 is p, and the probability of perturbing the weight from 1 to 2 is also p (i.e. the method is unbiased), then the variance of the introduced noise is 2pc, where c is the unperturbed cell count. So, for example, with a p of 0.2, and a cell value of 1000, the variance introduced is 400 which translates to a ‘relative noise error’ of 1.96 400 = 3.9% . 1000 This has led us to consider various alternatives based on using the 'noised' cell under a certain cell size threshold, and the original cell over that threshold. These alternatives are not as simple or elegant, and suffer from non-additivity and some small potential for 'unlocking' - similar to the small number of unusual examples that can be 'unlocked' under random rounding. However, depending on the average size of the cells, whether there is potential for many related tables to be produced, and the relative strengths and weaknesses of the alternative methods available, the noise method for counts may be a useful approach to consider.

Conclusion The noise method is very promising for tables of magnitudes from business surveys. We will soon decide whether the method will be officially used for the 2003 AES survey results. Our experiences in operationalising the method may prove useful to other agencies considering adopting the noise method in the future.

110

There appears to be good potential for extending the method to tables of counts, and we hope this will be explored further, either by ourselves or others. Acknowledgements to Mike Doherty and James Enright.

References Cox, L H (1995) ‘Protecting confidentiality in business surveys’ in Business Survey Methods, B G Cox et al (eds), 443-476, Wiley, New York. Evans T, Zayatz L and Slanta T (1998), Using Noise for Disclosure Limitation of Establishment Surveys, Journal of Official Statistics, Vol 4, No. 4. Krsinich F (2000), Tax Data in Statistics New Zealand's Main Economic Survey: A Two-Phased Redesign, Proceedings of the Second International Conference on Establishment Surveys, Buffalo, New York. Krsinich F and Piesse A (2002), Multiplicative Microdata Noise for Confidentialising Tables of Business Data, Research Report # 19, Statistics New Zealand. Available online at www.stats.govt.nz via 'publications' then 'technical publications'. Zayatz L, Evans T and Slanta J (2000), Using Noise for Disclosure Limitation of Establishment Tabular Data, Proceedings of the Second International Conference on Establishment Surveys, Buffalo, New York. Appendix. Relative Sampling-Plus-Noise Error - Formulation and Example N

v s ,n = v s + ν n ∑ π i y i2 i =1

................(1)

v v Where s ,n is the variance due to both sampling and noise, s is the variance due to sampling and v n is the variance due to noise. For the truncated half normal distribution described in this paper, v n = 0.0146 . 1 πi = π i is the probability that business i was selected into the sample. Note that wi ,

w where i is the weight for business i. y i is whatever is being estimated (e.g. 'Total Income') for business i The sum is across the population (i.e. i = 1 to N)

111

We only have sample data, so the formula needs to be restated in terms of the sample. Note that the sum across the population can be estimated by the weighted sum across n n n n N 1 2 2 2 2 w π y = w y = y i2 w π y ≈ π y ∑ ∑ ∑ i i i i i ∑ ∑ i i i i i wi i =1 i =1 i =1 and i =1 the sample. That is, i =1 n

(from above), so we can restate formula (1) as

v s ,n = v s + v n ∑ y i2 i =1

..............(2)

From the relative sampling errors (RSE) produced for the AES estimates we can 2  RSE× estimate  = vs   1.96   derive For example, we have an estimate of 5.3% for the relative sampling error of the Total Income estimate for the industry Other Food Manufacturing (OFM). The Total Income estimate for industry OFM is 5,510,151 ($000) 2

 0.053 × 5,510,151  10 vars =   = 2.22 × 10 1.96   so, for this example, n

From the sample data we can calculate

∑y i =1

2 i

= 79.44 × 1010

and we already have that v n = 0.0146 So, substituting this into 10 10 v s ,n = 2.22 × 10 + 0.0146 × 79.44 × 10 = 3.38 × 1010

(2),

we

get

Stating this in terms of relative sampling-plus-noise error, for comparison with the sampling error: RSNE =

1.96 3.38 × 1010 = 0.065 = 6.5% 5,510,151

112

So, for the variable Total Income for industry OFM (Other Food Manufacturing) in AES 2000, our use of the noise method means that the relative sampling-plus-noise error is 6.5%, compared to the relative sampling error of 5.3%.

113

Discussion on New Data Release Techniques Josep Domingo-Ferrer Dept. of Computer Engineering and Maths (ETSE), Universitat Rovira i Virgili Av. Pa¨ısos Catalans 26, E-43007 Tarragona, Catalonia e-mail [email protected], http://www.etse.urv.es/~jdomingo

Abstract. An overview of papers submitted to the 3rd Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality on Topic 2 (New data release techniques) is given. A list of key issues that were at the core of the discussion for those papers is also given. Keywords: Statistical disclosure control (SDC), Remote access systems, SDC for remote access, SDC threats resulting from mass e-access. 1

Introduction

There were four invited papers and two contributed papers presented on Topic 2 (New data release techniques). Authors of those papers came from seven different countries. According to the presented contents, three groups of papers can be distinguished in this topic: – Technology for remote access systems. This includes the paper by T. Desai, the paper by J. Coder and M. Cigrang, and the paper by O. Andersen. – SDC for remote access Papers in this group are by N. Shlomo and by L. Franconi and G. Merola. – SDC threats resulting from mass e-access The paper by V. Torra, J. Domingo` Torres is the only one in this topic. Ferrer and A. Section 2 below summarizes the contents of papers on technology for remote access systems and lists the related key issues for discussion. Section 3 is a tour to the contents of papers on SDC for remote access and lists related discussion issues. Section 4 does the same for the paper on SDC threats resulting from mass e-access. Some conclusions are listed in Section 5.

114

2

Technology for remote access systems

Desai in her paper identifies the criteria for choosing a remote access system (speed, familiarity, flexibility, graphics, cost, security). The paper also lists the issues related to supporting a remote access system (human resources, networking and partnerships). Coder and Cigrang describe the technology of the LISSY Remote Access System, originally developed for the Luxembourg Income Study project. LISSY provides easy remote access to datasets by accepting requests submitted as e-mail messages. SAS, SPSS and STATA code can be used. Andersen gives an account of the (r)evolution of the Danish system for access to microdata, from on-site to remote data access. The paper gives some detail on the technological aspects of both the on-site and the remote access systems in use. 2.1 Key questions on technology for remote access systems Desai’s paper contains a very interesting overview of the evolution of remote access systems. Flexibility is indeed a difficult requirement to meet. The key issue is how to prevent confidential analyses, because the assumption that academics have neither the inclination nor the time to identify individuals may indeed be too optimistic. In addition, there are users which are pseudo- or non-academic, and these may have interests other than science. Another issue that appeared during the discussion of that paper was related to the internal operation of the technique for preventing confidential analyses based on “blocking at source”. Coder and Cigrang’s paper describes the LISSY system for remote access from the technological standpoint. A question that was raised during the discussion was whether manual prevention of confidential analyses (such as the one offered by LISSY) was sufficient, practical and safe. Similar remarks to those reported for the LISSY paper were made in connection with the Danish system described in Andersen’s paper. 3

SDC for remote access systems

The paper by Shlomo reports on work done at CBS-Israel on SDC for remote access to microdata. R-U maps are used to compare SDC methods. Different methods are proposed to measure the disclosure risk and the information loss. Rather than using record linkage, analytical measures are used to estimate the expected number of correct matches. Franconi and Merola give in their paper a thorough account of SDC issues related to the release of tabular data through the Web. Problems peculiar to SDC in Web-based Systems for Data Dissemination (WSDD) are identified. WSDD are

115

understood as sites allowing users to query tables at their choice. The approaches examined are: – Source data perturbation – Output perturbation – Query set restriction Each approach above can be applied before the query is submitted (PRE) or after that (POST). 3.1 Key questions on SDC for remote access In Shlomo’s paper, an estimate of global risk is computed for microdata protected using non-perturbative methods (sampling, collapsing, etc.). An interesting line of work would be to investigate analytical risk measures for perturbative masking in order to avoid to the extent possible the burden of empirical record linkage. Some work has already been accomplished for particular perturbative methods (e.g. the MASSC method for categorical data) but a generalization is not straightforward. Franconi and Merola offer in their paper a very detailed and comprehensive analysis of SDC for remote access. Some issues about it that emerged during the discussion were: – The authors conclude that the choice of the release policy depends on the data to be released. A general recommendation (rule of thumb) about when to choose a PRE or a POST approach was felt to be very useful. – A necessary assumption for POST SDC to be safe seems to be that users/intruders do not co-operate. The question arises on when is such an assumption reasonable. The very fact that POST SDC requires that kind of assumption would seem to suggest that PRE SDC is to be preferred (?). 4

SDC threats resulting from mass e-access

Torra, Domingo-Ferrer and Torres analyze in their paper the SDC issues associated with the mass release of data in electronic format. Indeed, multiple database data mining is within reach of an increasing number of intruders thanks to electronic dissemination of both administrative and statistical datasets: – The steps of data mining are discussed, with a focus on data pre-processing and model estimation. – Data mining in SDC is discussed, with a focus on record linkage across databases. Record linkage aims at re-identification and can be carried out even if the databases do not share any variables.

116

4.1 Key questions on SDC threats resulting from mass e-access Mass e-access results in users being able to link several data sources. This raises the following related issues for discussion: – Given the enormous amount of information an intruder can access (on her own via Web or through co-operation with other intruders), a guidance is definitely needed for deciding what are the disclosure scenarios that should be considered when empirically computing disclosure risk via record linkage. – Another hot topic is how to include disclosure risk computation in a general SDC package such as µ-Argus. In particular, identification is needed of the input on disclosure scenarios to be considered that can be reasonably requested from a standard user. 5

Conclusions

Remote access systems should evolve towards automating the prevention of confidential analyses (SDC for remote access). Manual prevention is hard and qualified labor for it is scarce. At the same time, best practice recommendations on SDC for remote access would be very useful if available. Regarding disclosure risk for microdata, record linkage is still the only general tool for computing such a risk for a broad range of protection techniques. However, record linkage is costly in terms of computation and requires skilled users that can specify realistic disclosure scenarios. Therefore, in a similar way as analytical disclosure risk estimation has been developed for non-perturbative methods, it would be most interesting to be able to assess disclosure risk for perturbative methods without resorting to record linkage. Holy Grail? References O. Andersen, “From on-site to remote data access-The revolution of the Danish system for access to microdata”, in 3rd Joint ECE/Eurostat Work Session on Statistical Data Confidentiality, Luxemburg, April 2003. See this volume. J. Coder and M. Cigrang, “LISSY remote access system”, ibidem. T. Desai, “Providing remote access to data: the academic perspective”, ibidem. L. Franconi and G. Merola, “Implementing statistical disclosure control for aggregated data released via remote access”, ibidem. N. Shlomo, “Accessing microdata via the Internet”, ibidem. A. C. Singh, F. Yu and G. H. Dunteman, “MASSC: A new data mask for limiting statistical information loss and disclosure”, ibidem. ` Torres, “Data mining methods for linking data coming from several V. Torra, J. Domingo-Ferrer and A. sources”, ibidem.

117

Accessing Microdata via the Internet Natalie Shlomo Israel Central Bureau of Statistics, Jerusalem Abstract. With the implementation of new internet systems for providing remote access to users, the Israel Central Bureau of Statistics recognized the need to develop methodologies to reassess the disclosure risks in microdata that it releases and to set thresholds depending on the accepted levels of risk. Using a simple theoretical model, different disclosure control methods are compared against the information loss incurred. Real microdata sets that have been released as publicuse files are evaluated in order to calculate the probablitily of obtaining correct matches for the sample uniques. The first method is based on a global risk measure estimated from the sample and the second method is based on linking the files to the National Population Register to simulate resources which could be in the possession of potential attackers. Keywords. microdata and internet access, risk assessment, global disclosure risk measures, information loss measures

1. Introduction Microdata files released at the Israel Central Bureau of Statistics currently undergo a series of ad-hoc decisions for protecting the confidentiality of statistical entities. In general, there are three levels of microdata files: PUF which is released upon request to legitimate researchers after they outline the details of their research, MUC which is available to researchers at universities and approved research institutions that have a special contract with the Israel CBS, and files available at an on-site research facility at the CBS where researchers can work and access original files (without direct identifiers) for data analysis. Because of the current lack of a systematic way of evaluating the risks of the microdata files, mistakes can be made. One example is the release of multiple PUF files for the same survey or census, each having detailed coding for different sets of variables. These files can then be linked using common variables and the entire original file can be recovered. The fact that up till now PUF files were generally not given to the wide public and were granted to researchers with specific requests gave a false sense of security as to the level of protection needed for the PUF file.

118

In recent years, the Israel CBS has faced increasing demands for microdata files and especially for providing remote access for users over the internet. Some of the microdata files of the Israel CBS are in the process of being placed on the internet for this purpose. This is being implemented in two parallel systems: 1.

A multi-facet system for internet access is being developed at the Israel CBS that allows users to build custom-made tables and/or download files, depending on the level of security of the file. With this new application, a decision has to be made as to what type of file should be in the background in order to produce the tables and whether downloading all or parts of the file will be allowed. A file defined as PUF from which tables can be generated will a priori produce tables that are safe and there is no need for any further restriction or licensing. However, the data generated under such restrictions might have insufficient detail to have any real value to researchers. Another possibility is to have the original file in the background of the system from which tables can be generated, but this means placing strict restrictions on the tables themselves, such as a minimum number of cases to a cell, a maximum number of dimensions to the table, disclosure control software to evaluate and detect unsafe combinations, and prior registration of the users.

2.

The Israel CBS has an agreement to place microdata files in a depository managed by the Hebrew University of Jerusalem. Some of the current PUF microdata files have been put on an internet website of the depository to provide remote access to registered researchers at universities and research institutions of Israel for producing custom-made tables and carrying out data analysis. There are currently no restrictions on the tables and files can be downloaded as well.

With the implementation of the new internet systems, the Israel CBS recognized the need to develop methodologies to reassess the level of security of the microdata files that it releases. A working group was set up to address these issues at the Israel CBS and Prof. Yosi Rinott of the Hebrew University of Jerusalem. The purpose of the research is to provide better tools for defining the levels of security of microdata according to its intended usage. The ultimate goals will be to develop methodologies for evaluating the risks and categorizing the files according to risk measures; to determine thresholds and accepted levels of risk; and to develop methods of disclosure control that best suit the growing needs and policies of the Israel CBS. In addition, there is an immediate need to take disclosure control action during the interim period, in particular for the files that will be placed on the internet. In section II of the paper we compare two methods of disclosure control for a simple theoretical example using the R-U confidentiality map developed by

119

Duncan, et.al. (2001). To build an empirical R-U confidentiality map for a real data set, we examine in section III different methods for measuring the disclosure risk based on a realistic attack scenerio of linking the data set to the National Population Register (NPR). We also calculate estimated global risk measures based on the sampling design and the individual risk methodology (Benedetti, et.al. (2003), Rinott (2003)). In section IV information loss measures for the different levels of disclosure risk will be calculated and in section V a discussion for determining safe microdata files, especially for use with remote access via the internet.

2. R-U Confidentiality Map Duncan, et.al. (2001) develop the means for examining the balance between information loss and disclosure risk through the use of an R-U confidentiality map. R is a risk measure for the data file and U is the utility of the data file. The map is a function of the parameters used in the disclosure control method. For example, adding random noise to a variable affects its variance. As the variance of the random noise increases, the risk of disclosing statistical entities decreases but the analytic properties of the variable can be seriously compromised. With the use of the R-U confidentiality map, a decision theory can be developed based on a given risk threshold, and optimal parameters of the disclosure control methods can be determined that maximize the data utility and minimize the disclosure risk. Duncan, et. al. (2001) presented a simple theoretical example based on the perturbative method of additive random noise to normally distributed variables to show the use of the R-U confidentiality map. We will elaborate and compare this method to global recoding of the variable which is a non-perturbative method of disclosure control. Let X 1 , X 2 ,..., X n ~ N ( θ ,1 ) and let Y1 ,Y2 ,...,Yn be the masked data after applying a disclosure control method. In general, a user will estimate a parameter θ of the distribution based on the released data, and will most likely estimate the parameter while ignoring the disclosure control method. A more sophisticated user that knows that masking of the data has taken place will want to estimate the true parameter θ of the distribution. This can be carried out by moment or maximum likelihood estimates that take into account that the released data has been coarsened or perturbed. The use of the E-M algorithm, for example, is a general method for finding the maximum likelihood estimate of the parameter of the underlying distribution from the given data which has been altered. The two disclosure control methods analyzed: 1. Additive random noise to the variable, Yi = X i + ε i , ε i ~ iid ( 0, λ2 ),i = 1..n .

120

2. Global recoding or coarsening the variable by publishing categorized groupings designated by cut off values of the distribution, ( a i ,a i +1 ) i = 1..g where a0 = −∞ and a g +1 = ∞ Assuming that the data user is interested in estimating the population mean θ , the MLE of the parameter for additive random noise is the sample mean since no bias was introduced. The MLE of the parameter for global recoding can be estimated by the E-M algorithm though this will not be shown here. For the two disclosure control methods, we can apply the R-U confidentiality map. The utility of the data will be measured as the reciprocal of the variance of the sample mean after the application of the disclosure control method. For the ) first method of additive random noise, the variance of θ will be equal to: ) σ y2 1 + λ2 var( θ ) = var( Y ) = = . For the second method of global recoding, the n n utility of the MLE of θˆ I n (θ ) : Defining, ϕ ( x ) =

1 2π

e

− x2 2

can be calculated by the Fisher Information Matrix

we obtain:

d d pθ ( Y = yi ) p θ ( Y = y i ))2 g ( d Eθ ( log pθ ( Y = yi ))2 = ∑ ( dθ )2 p θ ( Y = yi ) = ∑ dθ dθ pθ ( Y = yi ) pθ ( Y = yi ) i =1 i =1 g

For element i : d d pθ ( Y = yi ) = dθ dθ

ai + 1

a i +1

ai

ai

∫ ϕ ( x − θ )dx = ∫

ai +1

∫ ( x − θ )ϕ ( x − θ )dx = ϕ ( a

i

d ϕ ( x − θ )dx = dθ

− θ ) − ϕ ( a i +1 − θ )

ai

[ ϕ ( ai − θ } − ϕ ( ai +1 − θ )] 2 And the final result is I 1 ( θ ) = ∑ . i =1 Φ ( a i +1 − θ ) − Φ ( a i − θ ) For n variables, I n ( θ ) = nI1 ( θ ) . The disclosure risk will be measured as the reciprocal of the MSE between a specific target variable X and the released variable obtained from the masked sample Y . A small MSE between the target variable and the released variable will increase the likelihood of reidentification. For the first method of additive g

121

random noise, the MSE is: E( Y − X ) 2 = E( ε 2 ) = λ 2 and the disclosure risk 1 function is: R a = 2 . λ For the second method of global recoding, we can assume that the sophisticated attacker will try and identify the target variable by: ai +1

∫ xϕ( x − θ )dx

ai +1

∫ ( x − θ )ϕ( x − θ )dx

= +θ p( X i ∈( ai , ai +1 )) p( X i ∈( ai , ai +1 )) From here, we calculate the mean square error: Yi = E( X i | X i ∈( ai , ai +1 )) =

ai

ai

g ai + 1

g

E (Y − X ) = ∑ E ((Yi − X i ) | X i ∈ (ai , ai +1 )) p( X i ∈ (ai , a i +1 )) = ∑ ∫ ( y − x ) 2 ϕ ( x − θ )dx 2

2

i =1

i =1 ai

By replacing Y with the above formula, and by noticing that g

∑ a ϕ( a i =1

i

i

− θ ) − ai +1ϕ ( ai +1 − θ ) = 0 and

g a i +1

∑ ∫ ϕ ( x − θ )dx = 1 we

obtain the

i =1 ai

[ ϕ ( a i − θ ) − ϕ ( ai +1 − θ )] 2 = 1 − I 1 ( θ ) , and the Φ ( ai +1 − θ ) − Φ ( a i − θ ) i =1 1 . disclosure risk function, R c = 1 − I1(θ ) g

MSE : E( Y − X ) 2 = 1 − ∑

For any parameter θ , we compare the risk measures of the two methods by 1 setting the utilities to be equal, ie. I 1 ( θ ) = , and obtain: 1 + λ2 1 1 1 = 1 − I1( θ ) = 1 − =1− ⇒ R c = R a + 1 . Thus for any θ and all c 2 1 R 1+ λ 1+ a R parameters of the disclosure control method, if the utility is equal, R c > R a and the risk is always greater for global recoding than for additive noise. On the basis of this simple model, this is an interesting result and shows that further theory is called for. Israel CBS has maintained a policy of releasing unperturbed data and adding random noise or other perturbative methods was rejected (Kamen (2001)). The current method for providing disclosure control is by global recoding of the categorical demographic and geographic identifying variables, ie. year of birth, country of birth, locality code, etc. or through eliminating identifying variables altogether from the file. We will continue with more experiments and theory on random noise and other perturbative methods to see if these are preferable and whether the policies at the Israel CBS should be changed.

122

The following sections of the paper describe disclosure risk and information loss measures for a real data set using the non-perturbative method of global recoding in practice at the Israel CBS. An empirical R-U confidentiality map can be built to determine optimal parameters for disclosure control depending on the different levels of security needed for the file. Figure 1: R-U confidentiality map for variables distributed N ( 0,1 ) R-U Confidentiality Map A Theoretical Example - N(0,1) global recoding

random additive noise

60 50

Risk

40 30 20 10 0 0.93

0.935

0.94

0.945

0.95

0.955

0.96

0.965

0.97

0.975

Utility

3. Measuring Disclosure Risk In determining the risks of the microdata file we will use two methods, both based on the probability of obtaining a correct match in the population for a given sample unit. The underlying and realistic disclosure scenerio is that a potential data snooper has access to the NPR which includes demographic and geographic information for all citizens in the State of Israel. We assume also that the data snooper is interested in the uniques of a sample defined under a key made up of common variables to both the NPR and the sample. By focusing on the sample uniques, the data snooper increases his chances of a successful link. The following evaluations are performed on a real data set, the Israel Income Survey 2000 (IS). The sample file contains 32,869 records that were sampled at about 1:126. The disclosure risks of the data set were evaluated in two ways: 1. Estimation of a global risk measure defined as the expected number of correct matches to the population for the sample uniques, using the sampling design and the individual risk measure methodology (Benedetti, et.al.(2003), Seri, et.al. (2003) and Rinott (2003)). The individual risk measure methodology assumes that population frequencies Fk for a key k , given the sample frequencies f k ,

123

0.98

follows a negative binomial distribution with success probability p k and the number of successes f k ie., F k f k ~ NB( f k , p k ) . The individual risk measure rk is calculated for every cell k in the key and represents the probability that any sample unit in cell k can be correctly matched to the total 1 1 population, ie. rk = | fk ) . The estimate of the risk measure is: rˆk = E p ( Fk Fk under the negative binomial distribution. The estimate depends on p k , which is f fk estimated by: pˆ k = k = . Note that pˆ k is based on the weights that were Fˆk ∑ wi k

i:i∈k

assigned to each unit in the sample at the estimation stage of the survey processing. Weights are typically calculated using a calibration method by benchmarking the inflated sample to known population totals, such as geographical areas, age and sex distributions. Thus by utilizing the weights of the survey, the population frequencies, Fk can be estimated. Since rk is the probability of a correct match to the population for any unit in the cell k , we can derive a global measure which is the expected number of correct matches K

K

k =1

k =1

for the entire file, τ = ∑ f k rk and is estimated by: τˆ = ∑ f k rˆk . In our scenerio, we are interested in the sample uniques so the global measure is: K

K

k =1

k =1

τ = ∑ I ( f k = 1)rk and its estimate: τˆ = ∑ I ( f k = 1)rˆk . 2. Linking the sample uniques in the dataset to the NPR using keys with common variables and calculating the average number of correct matches for the sample uniques under the above scenerio. This method is evaluated by comparing names for those with a one-to-one matching status. The original variables in the key common to both the NPR and the IS were: district (24 categories), type of locality, ie., urban according to population sizes and rural according to the type (16 categories), locality code (215 categories), religion (2 categories), gender (2 categories), year of birth (85 categories), marital status (5 categories), country of birth including country of birth of father for those born in Israel (130 categories), ethnic group (2 categories) and year of immigration (85 categories). Six other keys were developed for the evaluation, each key being more coarse than the previous one. The variables that were recoded in the building of the keys were: type of locality bottom coded for up to 50,000 persons in the locality, districts collapsed to regions, elimination of locality code, groupings for year of immigration and year of birth, country of birth collapsed to continent of birth. The keys are

124

defined as Key1 to Key6, and it should be noted that the current definition of the PUF file for the IS is Key1. 3.1

Estimating the Global Risk Measure

The following are the results obtained for the estimated global risk measure, ie. the expected number of correct matches of the sample uniques to the population, and its percentage out of the total sample size : Table 1: Results of the estimated expected number of correct matches for the IS

Number in Key Number of Uniques in Sample Estimated Expected Number of Correct Matches τˆ Max rk Percentage out of the total sample size

Full Key 28,658 26,121

Key1

Key2

Key3

Key4

Key5

Key6

23,264 19,049

17,469 12,523

16,694 11,07

12,627 8,128

6,669 3,319

4,726 2,083

1,219.3

884.4

590.2

526.6

388.5

158.0

98.2

22.3%

22.3%

22.3%

22.3%

22.3%

15.5%

22.3%

3.7%

2.7%

1.8%

1.6%

1.2%

0.5%

0.3%

Figure 2: Number of records in groups according to individual risk measures for original key - IS

Number of Records

Original Key 7000 6000 5000 4000 3000 2000 1000 0

5984

654

453

0

124

331

1

2

3

404

557

4

5

1064

6

152

7

208

8

2480 116

9

113

10

Individual Risk Groups

125

431

11

12

13

14

15

653

16

Note that some local suppressions have to be undertaken to reduce the high individual risk measures, rˆk for cell k of the key. The following bar charts note the differences in the distributions of the individual risk measures according to the full key and the most collapsed Key6. Groups of individual risk measures were defined where the first group has the smallest risk (up to 0.03%) and the last group has the most risk (over 13.5%): Figure 3: Number of records in groups according to individual risk measures for key6 - IS

Number of Records

Key6 20000

1696

15000 10000

318270252 179132114 863 304412469529342200 83 64

5000 0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Individual Risk Groups

3.2

Linking Sample Uniques to the NPR

The sample uniques of the IS for three different sets of keys were linked to the NPR. The keys used were the full key, Key1, which defines the current PUF data file, and Key4. The keys were slightly modified to accommodate the NPR. The results of the three linkages were as follows: Table 2: Results of linking sample uniques of the IS to the NPR Full Key 24,028 20,528

Key1 19,843 15,411

Key4 9,838 6,031

Number of keys in sample Number of uniques in the sample Estimated global risk measure - τˆ Number of keys in NPR

973.1

726.5

291.2

1,348,870

467,167

107,663

Number of uniques in the

942,875

213,678

32,004

126

16

NPR Percentage of uniques in the NPR Number of sample uniques matched Number of sample uniques unmatched Average number of correct matches Number of matches One-to-one

69.9%

45.7%

29.7%

14,359

13,762

5,669

6,169

1,649

362

4,600.3

2,118.5

411.5

2,829

848

101

In table 2, the estimated global risk measure τˆ is largely underestimated compared to the average number of correct matches to the NPR according to this scenerio, though it does maintain its monotonic property. The comparison of the two measures, however, is problematic because of the following reasons: 1. Some of the values of the keys for sample uniques were not linked at all to the NPR, from 6% to the most collapsed key4 to 30% for the full key. This is most likely due to measurement errors and other factors in both the NPR and the sample data, which increases as the key in use is more detailed. It is important to note that the sample was drawn by sampling addresses from municipal tax records and thus is not a direct sub-file of the NPR. In addition, the NPR suffers from serious coverage problems since about 20% of the population do not reside in the same address as the one listed in the NPR. It also includes population groups not covered in the target population of the survey such as institutionalized persons. 2. For those matching one-to-one, the names on the frame which was used to draw the sample was compared to the names on the NPR in order to get some idea as to the true matching status of the keys. This comparison is also problematic, not only because of the above mentioned problems with the NPR, but because the names listed on the frame are those persons who pay the municipal tax bill and not necessarily the persons living in the dwelling or other family members. Nevertheless, with all these inconsistencies between the data sources, it was found that among those with one-to-one matching status, about 40%-50% were identified as the same person. In spite of the discrepancies between the data sources, it is clear that the estimated global risk measures are underestimated under this scenerio and the sample weights which are used to estimate the population frequencies do not provide the variability of the keys as compared to the population. From the point of view of the data

127

snooper, reidentification is highly likely and any attempt to misuse the data for commercial purposes will probably be very successful.

4. Measuring Information Loss The information loss due to collapsing the categorical variables which make up the different keys can be measured by several methods. Since the variables that define the keys are demographic and geographic identifiers and are used by researchers mostly as explanatory variables in regression models, we will assess the information loss using two methods: 1. The loss in the “between” variance of the main variable of interest income calculated in the different groupings of the keys as they get coarser. In other words, the loss of the predictive power of a regression model, as expressed by the R-square, where the dependent variable is income and the independent variables are the demographic and geographic variables that are collapsed. 2. The loss in information as the keys get coarser and the categorical variables are collapsed as measured by the entropy (Willenborg and de Waal (2001)). The following results are obtained: Table 3: Information loss measures for the IS Key

Full key Key1 Key2 Key3 Key4 Key5 Key6

Percentage of the “between” variance out of the full key 100.000 96.065 90.642 91.199 84.472 75.180 70.240

Percentage of the entropy out of the full key 100.000 96.318 90.710 85.992 83.801 70.083 64.349

An empirical R-U confidentiality map for the IS can be built using the disclosure risks from the previous section and the loss of information measures in table 3 to express the data utility:

128

Risk as Fuction of Global Risk Measure

Figure 4: Empirical R-U confidentiality map for global recoding of key variables in IS

1400

R-U Confidentiality Map Income Survey 2000

1200 1000 800 600 400 200 0

60

70

80

90

Utility as Function of Entropy

On the basis of this chart and the setting of thresholds, we can determine which of the files according to the different keys should be PUF, MUC or accessible to the public via remote access to the internet.

5. Remote Accessing of Data via the Internet As seen by the analysis of the IS, since current PUF files defined by Key1 had about 2.7% estimated global risk out of the total sample size, those files placed on the web site of the depository managed by the Hebrew University for remote accessing and downloading were deemed unsafe for internet access. This is true for other microdata files that were examined and are currently on the depository’s web site, including the Israel Labour Force Survey and the Family Expenditure Survey. New files are being prepared based on the recoding of demographic and geographic variables. For the interim period, since the depository still has restricted access to researchers in universities and research institutions, the threshold for the estimated global risk measure out of the total sample size was set at about 1%. At this threshold, the information loss based on the entropy with respect to the full key is about 80.3% for the IS. With respect to the system that is currently being developed at the CBS for building custom-made tables defined by the users over the internet, the question remains as to what file should be in the background of the system. A pure and safe PUF file with no risk would allow users to access any table and allow downloading

129

100

of the file. This would also greatly simplify the amount of software needed to develop the system. However, as shown in the example of the IS, a file with no risk would have an information loss based on the entropy with respect to the full key at about 55.0% and would probably have little value to researchers. More useful tables can be produced by putting a more detailed file in the background of the system, but this would mean some or all of the following restrictions depending on the initial risks in the file: 1. Minimum number of units in a cell. This is also necessary to assure the estimate’s reliability in the table with respect to sampling errors. 2. Maximum number of dimensions to the table. 3. Prior registering and tracking of users. 4. Sophisticated disclosure control software for calculating the dislosure risk prior to the release of the table. For example, if we put the file for the IS with Key1 in the background of the internet system, and allowed users to access tables of up to three dimensions, the risk of a table defined, for example, by total income according to continent of birth× years of birth × locality code would have an estimated global risk measure of 264.1 expected correct matches to the population for the sample uniques. The amount of protection needed for providing disclosure control of the table would be very high and after collapsing on the variables and redesigning the table would probably result in the same table that would have been obtained had a more safe file been in the background of the system. By putting the file with key2 in the background of the system, the estimated global risk measure would be at about 30 expected correct matches for the most elaborate three dimensional table. This would simplify some of the restrictions necessary for protecting the confidentiality of sample entities, and still allow as little information loss as possible. Thus a compromise has to be found between selecting a file for the internet system that will allow users remote access to data with high utility for building customized tables but without having to define complicated restrictions to the system or to develop elaborate software applications that would be necessary to maintain disclosure control.

6. Acknowledgements I wish to thank Prof. Yosi Rinott for his assistance in the paper, especially for the theoretical framework in the second section.

References Benedetti, R., Capobianchi, A, and Franconi, L. (2003) “Individual risk of disclosure using sampling design information” (forthcoming).

130

Domingo-Ferrer, J., Mateo-Sanz, J. and Torra, V. (2001) “Comparing SDC methods for microdata on the basis of information loss and disclosure risk”, ETK-NTTS Pre-Proceedings of the Conference, Crete, June 2001. Duncan, G., Keller-McNulty, S., and Stokes, S. (2001) “Disclosure risk vs. data utility: the R-U confidentiality map”, Technical Report LA-UR-01-6428., Statistical Sciences Group, Los Alamos, N.M.:Los Alamos National Laboratory. Kamen, C., (2001) “Control of statistical disclosure versus needs of data users in Israel: a delicate balance”, Joint ECE/Eurostat Work Session on Statistical Data Confidentiality, Skopje, March 14-16, www.unece.org/stats/documents/2001/03/confidentiality/29.e.pdf . Polettini, S. and Seri, G. (2003) “Guidelines for the protection of social micro-data using individual risk methodology – Application within mu-argus version 3.2”, CASC Project Deliverable No. 1.2-D3, www.neon.vb.cbs.nl/casc Rinott, Y. (2003) . “On models for statistical disclosure risk estimation”, Joint ECE/Eurostat Work Session on Statistical Data Confidentiality, Luxemburg, April 7-9, www.unece.org/stats/documents/2003/04/confidentiality/wp.16.e.pdf Willenborg, L. and de Waal, T. (2001), Elements of Statistical Disclosure Control, Lecture Notes in Statistics, 155 (Springer Verlag, New York).

131

LISSY: A System for Providing Restricted Access to Survey Microdata from Remote Sites John Coder1 and Marc Cigrang2 1 2

Sentier Research LLC and the Luxembourg Income Study, Cents, Luxembourg HAL s.a.r.l, Larochette, Luxembourg

Abstract. The Luxembourg Income Study (LIS) has provided researchers with access to household survey microdata since 1988. The LISSY system permits users at remote sites to access demographic and economic information collected in household surveys. Users access the survey microdata through selected statistical packages. Output is restricted to summaries generated by these packages based on users’ job submissions via email. Results are also returned via email. The LISSY system is now being used to provide data access by the Pay Inequalities and Economic Performance Project (London School of Economics and Eurostat) and the German Socio-Economic Panel at the German Institute for Economic Research. Keywords. microdata, remote access, household survey, income, demographics

1. Introduction A major goal of the Luxembourg Income Study (LIS) has been to provide researchers with access to information about the social and economic characteristics of households and families for countries throughout the world. Since this information is in the form of data files containing actual observations from household surveys, the researcher does not need to rely on statistical summaries that others have previously created. They can create tabulations and statistical analyses from the survey data that precisely fit their needs and update them easily when new data become available. Further, a system to access these data has been developed which permits researchers to generate statistics from remote locations using email as a mechanism to transmit requests and forward results. Access to these data is, in fact, restricted to this system of remote access because much of the data available through the project have been provided by the member countries with the restriction that they not be redistributed or otherwise copied. This introduction describes the evolution of the techniques and methods used to maintain the databases of household microdata and to provide remote access services

132

to users throughout the world. It covers a period that began in 1988 when the first successful remote access system was put in place. While technology has changed dramatically since then, the basic principals that govern use and access to the data have remained the same. The discussion begins with an overview of the system. Topics covered include: • • • •

evolution of the access system elements and computing resources of the system survey microdata annual workshops

2. Evolution of the LISSY System More than 15 years have passed since the first LIS system was launched for providing remote access to the databases maintained by the project. The design of the first access system was built around the emergence of email (in those days known as EARN/BITNET). Email provided a fast and easy mechanism for any individual at any location to send a request for services to the project (although the performance of email in its early stages was not always fast and reliable). The statistical tool chosen at that time for generating tabulations, statistics, and other analyses based on the survey data was the SPSS (Statistical Package for the Social Sciences) which, at that time, was available for use on mainframe computers. SPSS was chosen because it was widely used, especially in Europe, and it was provided without cost by the Centre Informatique de l’Etat of Luxembourg whose computing resources and email connection provided both of these essential resources. Requests for statistics were made by embedding the SPSS code needed to generate the statistics within the body of the email message. The system required that the SPSS code include several additional comment lines (using SPSS comment line syntax) that provided the identification of the requestor, a password, and the databases on which the code should be executed. The email messages were received at the LIS headquarters by manually checking and receiving the email several times a day. Each email was examined and the SPSS code extracted, again manually. This code was then executed per the user request. The results were then repackaged within the body of an email and returned to the requestor. Staff intervention was required at every step in the earliest versions of the system in order to process each request. It was not unusual to require several days to complete the request and return the results to the user. Since manual intervention was needed, access was not available on weekends and holidays.

133

By the early part of 1989 it was clear that the manual method for receiving requests and returning results required more human resources than were available to the project. Additional programming resources were made available to attempt development of a more automated method for handling the increasing number of statistical requests. This effort resulted in the implementation of a second mainframe-based processing system using the IBM’s REXX command language. In this system the survey data were stored in IBM’s DB2 relational database system. Each time a request was processed the variables needed for the request were unloaded from the database to create an ASCII data file. This file was then transformed into an SPSS system file, and the execution of the request was made. Also, for the first time, information on the registered users of the system was stored in a relational database and checks of user identification and passwords were made automatically. More importantly, the email requests were received in an automated rather than a manual way. Once received, the contents are analysed by the program, the requests were processed, and the results mailed back with very little human intervention. This system was designed to meet our goal of being operational 24 hours a day. This was far from true, however, as scheduled maintenance, backups, and other problems on the mainframe computer frequently interrupted service. We had not gone far into the decade of the 1990’s before we began to look for yet other ways of improving service to the growing number of users. Our inability to control the computing resources was a major concern as we were totally dependent on a large government organization where our project was of little importance. Also of concern was the now need for more storage and faster turnaround. Efforts to add the data for new countries had paid off and pressure to provide updates of the original data sets grew. By the end of 1990, the number of country data sets had increased to 16. The number of requests for statistics was growing as well as more and more users learned about the access through the workshops and the number of persons using email grew. In addition, a significant number of requests for data access using the SAS statistical system were being made by the expanding user community, a service that was not available from the provider of computing resources. The onset of these events was accompanied by the introduction of faster and faster personal computers and the expansion of both the SPSS and SAS packages to the personal computing environment. For the first time, we were presented with a viable alternative to a mainframe system that was becoming increasing difficult to use.

134

A decision was made to migrate the entire access services and data to an environment based on personal computers toward the end of 1990. The IBM OS/2 operating system was chosen as it was the first true multitasking system available running on personal computers. Both SPSS and SAS were available under this operating system so that this choice permitted the addition of SAS as a choice for our users. The migration to the personal computing environment also required that a network be established in order to provide enough resources to handle the workload and to provide access using both SPSS and SAS. Access to email through EARN/BITNET continued to be a problem. A dedicated telephone line was established to provide communication to the Centre Informatique de l’Etat, which remained our only connection to the EARN/BITNET. This connection was made via a dialup telephone modem. The modem connection, as they were at that time, was difficult to maintain and frequent problems resulted in continued untimely disruptions of service. There were very few significant changes that occurred during the period between 1991 and 1995 in the system for processing user requests. The personal computing environment was stable and working well. It was easy and relatively inexpensive to add storage as the number of data sets expanded during this period. The speed of access improved as the stability of the connection on the dialup telephone line improved. The end of this quiet period in terms of system operations came in February 1996 when the offices of the LIS project were moved to new location. A re-evaluation of the system at that time resulted in a decision to migrate from the OS/2 operating system to Windows NT. While the OS/2 system performed very well, SPSS and SAS development on that platform was halted due to the increased popularity of Windows NT. At this same moment, a direct email connection was established which removed the need for the telephone connection to the Centre Informatique. For the first time since the project began it was no longer dependent in any way with the government of Luxembourg’s computing resources. A number of revisions to the system have been implemented since the major overhaul undertaken in 1996 in the move to Windows NT. First, the computing infrastructure has been redesigned to permit additional computers to be added to the network to process user requests. In this design, a new powerful computer can be set-up to accept user requests within one hour. More details on this design are covered later. Second, a third statistical package, STATA, was installed to provide users with another option for data analysis. Third, a relational database table was added to provide details on each and every job that was processed. This database provides information concerning the workload overall, by user, by statistical

135

package, etc. Fourth, a website was established for the project which provides extensive technical documentation regarding the data and access requirements.

3. Elements and Computing Resources of the Access System (LISSY) The LISSY operating system consists of a series of PC computers that work together on a network receive, process, and return statistical requests. This system is currently employing a total of seven personal computers linked across a Windows NT network. These computers communicate with each other and use shared system resources (system disks, etc) to provide an automated processing system. The computers are listed by function in Figure 1 below. Figure 1. System Resources and Functions Computer

Functions Provided

Mail Server

Receives user requests at specified email address as email

System Job Control

Retrieves email requests, prepares request for processing, sends requests to batch processors, returns statistical results, maintains critical databases, houses critical system parameters, houses the batch processor executable, houses computational routines Houses all microdata files as system files applicable to each statistical package Process statistical requests and returns output to System Post Office Provides technical documentation

Data Server Batch Processors Web Server

3.1 Mail Server The mail server is the connection to the internet. It receives all email addressed to the mail account specified as the repository for requests to access the system. The current email address used is [email protected]. Users wishing to access the system must address their emails containing the program code needed to access the data. As is normal, mail arriving on the mail server waits for the user, in this case the “system job control” computer to open and read it. 3.2 Managing User Requests for Services The heart of the access system is the system job control computer. It is a window’s based program written in the JAVA programming language. It is the “traffic cop” if you will which manages the entire LIS access mechanism. Functions include retrieval of the email requests, application of security checks, preparation of the

136

request for processing, distributing access requests to the batch processor computers, returning statistical results to the proper user email addresses, maintaining critical databases, and housing critical system parameters. Once this program is started, it repeats a series of tasks at five-second intervals. Start-up of the program begins with the establishment of password-protected linkages to the user and job databases. User identification codes, passwords, and email addresses are read from the database and stored in memory to speed processing. Start-up also includes reading of an initialisation file that contains information concerning the location of critical resources and files on the network 3.21 Receipt and Security Check-in of User Request Step one in the sequence of the system control computers routine is a query of the mail server to determine if any requests have been received. Once received, the request is scanned for the mandatory information located at the beginning of the request. The mandatory information includes the requestor’s user identification code, made up of a series of alphanumeric characters chosen by user or the LIS staff (usually the initials of the name or part of the name), the requestor’s password (up to 20 characters), and the statistical package that is being used. The statistical package can be one of three, SAS, SPSS, or STATA The user identification and password are checked against the list of registered users. If the identification and password pair matches a registered user the request is processed further. If a user corresponding to the identification/password cannot be found, an email is prepared noting that an error has occurred in the request. Since an error in the required syntax prevents retrieval of the appropriate email address from the LIS user database (information provided by that user at registration), the email address contained in the header of the mail message is used to return notification of the error to the sender.

After establishing that the request has been received from a registered user, a check is made to determine if this user’s access to the database is still active. This check reflects the option available to the LIS staff to disable or prevent access to the database for a specific registered user. To deny access the LIS staff must enter the user database and change the access code. Denial of access would only occur if it had been determined that that particular user had failed to comply with the LIS rules of operation. The database also contains a mechanism for assigning priority to requests once they have been checked in and are waiting for processing by the batch processing computers. The priority code for a user is stored in the database and can be changed

137

from the default to something higher. If a higher priority code has been set, those requests with the highest priority will be processed first. Following the access check, the syntax of the request is examined to determine what types of statistical procedures are being used. Any requests that appear to be printing or copying individual records are routed to a review area where they can be examined by the LIS staff. Upon examination, the LIS staff will then permit the request to continue or contact the sender and discuss the problem. The system control program provides the graphical interfaces needed to perform these checking operations in an efficient, “point and click” manner. The system for checking the syntax of user requests permits the establishment of complex checks on the existence and sequence of key words, phrases, etc. Access and use of specific variables or combinations of variable can be controlled as well as the use of certain functions within the statistical packages. After all security checks have been completed the request is assigned a unique job identification number and a copy of the request is saved to the archive so that a complete list of all jobs ever submitted is maintained. This archive not only provides a backup in case a request has been lost but also provides evidence of misuse of the database should such activity be uncovered. This archive of job requests is reviewed periodically using software independent of the access system security to assure that no job requests contain potential breaches of the current syntax checks. This software also permits a flexible mechanism for applying ad hoc checks of job submissions in order to test new security checks or to examine the frequency of current checks.

3.22 Batch Processing of Statistical Requests The initialisation file is a text file that is can be modified as necessary by the staff. By using an initialisation file, the operation of the batch machine can easily be modified without changing the program itself. Because the program is driven by the initialisation file, it is not necessary to place the executables for the statistical packages in the same location in all batch machines. Computers with very different configurations regarding their hard drives, drive letters, etc. can be configured as a batch processor very easily. It is also easy to limit the types of requests a machine can process by changing the parameters that control execution of each statistical package. If a computer does not have SAS installed or there is some problem with the program, for example, the parameters can be changed to exclude processing of SAS requests. After initialisation the batch program begins to look for user requests. At 5-second intervals, the batch program looks into the directory where requests are placed by the

138

system control computer. The highest priority request is selected of those requests that are permitted for execution on that batch processor (based on statistical package parameter values). The file containing the selected request is removed from the queue of requests waiting processing and placed in the local batch machine execution directory. Once in the execution directory, the execution of the request can take place. Upon completion of the execution, the file(s) containing the statistical summaries and the log file containing details of the execution are combined and sent to a directory on the network where it can be captured by the system control mechanism. 3.23 Returning Statistical Output to Users Requests are processed by the batch processing computers and the results (text files) are packaged and returned to a common directory accessible to the system control computer. The system control queries that directory at specified intervals to see if any result files are waiting to be returned to users. If it finds a request file waiting to be sent back to a user, it initiates an examination of the file size and contents. If the size of the file in bytes is larger than the threshold specified in the initialisation parameters the file is automatically moved to the directory reserved for output that must be examined by the LIS staff. In addition the contents of the output are examined in order to reveal any operations, listings, etc. that indicate the user is attempting to copy records or subsets of variables from individual records from the microdata file. If any occurrences indicating the possible file copying are found, the file is moved directly to the review area where the contents are examined by the staff. Output that is judged to be acceptable under the LIS rules is returned to the sender automatically following the check. The results are packaged up as the body of an email message and sent back using the email address stored in the user database for that user. Note that the source email address (email address from where the request was made) is not used to return results. If a registered user is working at a different email address than the one they provided when they registered, they must contact the LIS staff and have the new email address entered into the database. This mechanism provides added security and helps discourage a registered user from giving their user identification and password to a non-registered user since the results can only be returned to the registered email address. Return of the job is accompanied by the addition of a record to the job database. This database includes the following information for each job: • • •

user identification of user who submitted the job number of bytes (characters) of output execution time in seconds required to complete the statistical operations

139

• • • •

statistical package used date and time the request was received date and time the request was sent back job number

3.3 Managing User Information The overall data base system employed in the LIS system is Oracle. Oracle was chosen because it is an extremely reliable system capable of handling very large and diverse data tables and because it can be totally integrated into C/JAVA programs, a feature that was absolutely required for this system. As noted earlier, access to the database of survey microdata is restricted to persons registering with the LIS project. Each user must submit a registration form that provides vital information about themselves and how they plan to use the data. In addition, the user must pledge to follow the LIS rules regarding data use. This includes a promise not to attempt to identify individuals based on the survey data contained in the database. Data from this registration is entered into the user database. Item included in the database include: • • • • • • • • • • • •

Title Name Address Institute or organization Telephone number Country of residence (code) User identification used in submission of requests Password used in submission of requests Email address to which output is returned Priority (default is normal or can be set higher by the LIS staff) User number (sequential number used as primary key) User status (enable or disable access)

A custom application designed specifically for LIS permits data to be entered for new users as well as modifications to existing information. This database interface is another windows program that maintains programmatic linkages to the database. It is necessary to “logon” to the database through this program in order to add a user or make modifications. Passwords are not changed unless the user makes a request to the staff and at no time do users have any access to make changes themselves. 3.4 The Data Server The data server computer is very simply a computer that acts as the repository for all of the datasets available for access. Separate directories are maintained for each of

140

the three file formats required for the three statistical packages now available. In addition, the SAS data directory includes one SAS format file for each dataset. This SAS format file is copied to a directory on the batch processor computer at execution time. The centralization of datasets on the server assures that all requests are accessing the same data. While it would be possible to maintain separate copies of the data on each batch processor, managing that process would be difficult as it would likely be a manual operation that would likely be tedious to monitor. To date we have not experienced any serious network bottlenecks or slowdowns using this centralized approach. The data server manages a very large number of data files. The directories containing these data files are write protected so that a user could not accidentally change them as part of their job submission. Without this kind of protection it is very likely that some unintended error in specifying the program code would result in destruction or modification of the data.

4. Survey Microdata The heart of the LIS project is the large amount of survey data that has been made available to the research community. These data have been donated by participating countries and are based on household surveys conducted independently by either government or research institutes. As the surveys are country and institutionspecific, the original data collected in each of these surveys must be evaluated and, if necessary, standardized to improve their comparability with the data from other countries. This standardization is carried out by the LIS staff prior to placing it into the database that is accessible by users. In this process, the original data are transformed to fit a set of fixed variables that is consistent over all of the data sets. It is important to note that none of the data provided by the member countries contain any unique identifiers such as names, addresses, or ids issued by the government or other institutions. It is also important to note that, in some countries, the data sets made available to the LIS project have been modified or restricted to reduce the chances of disclosure. Typically, these restrictions include removal of geographic detail, top-coding of income amounts, top-coding of age, limiting details regarding occupation and industry. When the first data sets became available in the late 1980’s all datasets were at the household (family) level. As the second round of survey data became available a decision was made to expand the database to include individual level data for the

141

members of the households. In this revised scheme separate datasets were added for adult members (age 15 and over) and child members (under age 15). For the most part then, there are three datasets for each country observation for each year, one for households, one for adults, and one for children (the separation of adult and child datasets was not possible in a small number of surveys). These datasets can be linked together in user requests by referencing a unique household identification number present on each. Since users have a choice of accessing the data using SPSS, SAS, and STATA, each of the datasets must be made available in each of the formats required by these statistical packages. This makes a total of nine files for each survey. As of 2003 there were a total of approximately 130 country datasets covering surveys from 29 different countries. The coverage of the data sets varies by country. For some countries datasets span the range from the mid-1970 to 2000. The plan for the future includes updating data at 3-5 year intervals and to expand the number of participating countries when feasible. In addition, if resources permit we will extent coverage to include earlier years. Countries currently providing data include the following: Australia Denmark Hungary Mexico Romania Sweden

Austria Estonia Ireland Netherlands Russia Switzerland

Belgium Canada Finland France Israel Italy Norway Poland Slovakia Slovenia United Kingdom

Czech Republic Germany Luxembourg R.O.C Taiwan Spain United States

5. Annual Workshops An important aspect of the overall project is annual workshop. These workshops serve three important functions. First, they instruct researchers on how to access the LIS system and on the content of the LIS data holdings. Second they inform researchers about the research techniques and principals that are most applicable to the data and provide a picture of previous analyses base on these data. Finally, the workshops help build a growing family of loyal LIS users who understand the project, who know the staff personally, and who respect the rules and traditions that govern use of the data. Since the beginning of the project more than 400 researchers have attended workshops both in Luxembourg and on-site in other member countries.

142

Data Mining Methods for Linking Data Coming from Several Sources ` Vicen¸c Torra1 , Josep Domingo-Ferrer2 and Angel Torres2 1

Institut d’Investigaci´o en Intel·lig`encia Artificial - CSIC, Campus UAB s/n, E-08193 Bellaterra, Catalonia e-mail [email protected]

2

Dept. Computer Engineering and Maths (ETSE), Universitat Rovira i Virgili, Av. Pa¨ısos Catalans 26, E-43007 Tarragona, Catalonia e-mail {jdomingo,atorres}@etse.urv.es

Abstract. Statistical offices are faced with the problem of multiple-database data mining at least for two reasons. On one side, there is a trend to avoid direct collection of data from respondents and use instead administrative data sources to build statistical data; such administrative sources are typically diverses and scattered across several administration level. On the other side, intruders may attempt disclosure of confidential statistical data by using the same approach, i.e. by linking whatever databases they can obtain. This paper discusses issues related to multipledatabase data mining, with a special focus on a method for linking records across databases which do not share any variables. Keywords. Statistical disclosure control, Re-identification, Data mining, Artificial intelligence. 1

Introduction

Statistical offices are faced with the problem of multiple-database data mining at least for two reasons: – On the good side, there is a trend to avoid direct collection of data from respondents and use instead administrative data sources to build statistical data; such administrative sources are typically diverses and scattered across several administration level. Linking administrative information held by municipalities with information held at higher administration levels can yield information that is more accurate and cheaper than the one that would be collected directly from respondents.

143

– On the bad side, statistical offices must realize that intruders may attempt disclosure of confidential statistical data by using exactly the same approach, i.e. by linking whatever databases they can obtain. This is the relevant side for statistical disclosure control (SDC). This paper discusses issues related to multiple-database data mining, with a special focus on a method for linking records across databases which do not share any variables. Section 2 is about general concepts of data mining and knowledge discovery in databases. Section 3 discusses the use of data mining in SDC, that is, how data mining can increase disclosure risk. 2

Data mining and knowledge discovery in databases

Several definitions are currently being used for both data mining and knowledge discovery in databases. While in some situations they are used as equivalent terms, data mining is often considered as one of the steps in the knowledge discovery process. Here, following Fayyad et al. (1996b), we use the latter approach, which is more suited for describing the relationships between this field and statistical disclosure control. According to Fayyad et al. (1996a) and Frawley et al. (1991), knowledge discovery in databases (KDD) is defined as follows: Knowledge discovery in databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. This process encompasses several steps. According to Kantardzic (2003), these are: (i) Problem statement and hypothesis formulation (ii) Data collection (iii) Data pre-processing (iv) Model estimation (v) Model interpretation We will focus on steps (iii) and (iv). Data pre-processing includes all those mechanisms used to improve data quality. Model estimation is also known as data mining, i.e. application of computational methods for building models from data. The next

144

two sections describe data pre-processing and model estimation in greater detail. The last section reviews current trends in data mining and highlights their relation with Statistical Disclosure Control. 2.1 Data pre-processing In realistic databases, data are not free from errors or inaccuracies, which can be due to accidental or intentional distortion. The pre-processing step is used to improve the quality of data. Several methods can be used to help in this task. Some of them are next recalled (see Kantardzic (2003)): Simple transformations: Transformations that do not need major analysis of data and can be applied considering a single value at a time. These transformations include outlier detection and removal, scaling, data re-coding. Cleansing and scrubbing: Tranformations of moderate complexity, e.g. involving name and address formatting. Integration: This is applied to process data coming from various sources. Integration techniques are especially relevant for data mining in heterogeneous databases. Relevant tools include re-identification methods (in particular, record linkage algorithms) and tools for identifying attribute correspondences. Aggregation and summarization: These transformations aim at reducing the number of records or variables in the database. 2.2 Model estimation The main step of the knowledge discovery process is the actual construction of the model from the data. This is the data mining step. This step is defined in Fayyad et al. (1996a), page 9, as follows: The data mining component of the KDD process is mainly concerned with means by which patterns are extracted and enumerated from the data. The patterns extracted by the data mining algorithm are supposed to constitute knowledge. In this setting, knowledge is understood as an interesting and certain enough pattern (Dzeroski, 2001). Of course, the terms interesting and certain enough are user- and application-dependent. There is a large collection of data mining methods and tools available from the literature. A common classification, inherited from the machine learning field, is to divide data mining methods into two groups, one of them corresponding to supervised learning methods and the other to unsupervised learning methods. This classification is detailed below; data are considered as a flat table comprising variables and records.

145

Supervised learning: For one of the variables (modeled variable), a functional model is built that relates this variable with the rest of variables. Depending on the type of variable being modeled, two classes of methods are usually considered: Classification: The variable for which the model is build is categorical. This categorical variable is called class. Descriptions are built to infer the class of a record given the values for the other variables. Regression: This is similar to the classification problem, but the variable being modeled is continuous. Unsupervised learning: Some knowledge about the variables in the database is extracted from the data. Unlike for supervised learning, there is no distinguished variable being modeled; instead, relationships between variables are of interest. Common unsupervised learning methods include clustering methods and association rules; the literature often includes in this group other (statistical) tools like principal components and dimensionality reduction methods. Clustering: Clusters (groups) of similar objects are detected. Conceptual clustering is a subgroup of methods whose goal is to derive a symbolic representation from clusters. Association rules: These specify tuples of values that appear very often in a database. They are commonly used in databases related to commercial transactions, e.g. to link purchases of product A with purchases of product B. 2.3 New trends in data mining While data mining was in the past focused on the case of single flat files, currently there is a need for considering more complex data structures. In fact, two main situations arise, that correspond to two subfields: Relational data mining In this situation, there is a single (relational) database consisting of multiple tables. Relational data mining looks for patterns that involve multiple relations in a relational database. It does so directly, without transforming the data into a single table first and then looking for patterns in such an engineered table. The relations in the database can be defined extensionally, as lists of tuples, or intensionally, as database view or sets or sets of rules. The latter allows relational data mining to take into account generally valid domain knowledge, referred to as background knowledge (Dzeroski, 2001).

146

Multi-database data mining In this case, there are different databases whose records must be linked before applying data mining techniques (Zhong, 2003). In this type of data mining the pre-processing step is especially critical, as the data must reach a quality level that allows records across databases to be linked. 3

Data mining in SDC

Several aspects of data mining are of interest in SDC. For example, most supervised learning methods and some unsupervised methods (e.g., association rules) can be used to attack SDC because they allow relationships to be established between variables, which can lead to disclosure. We will concentrate here on record linkage across databases. A current assumption in SDC is that record linkage can use variables shared across the databases. This assumption will be relaxed here and no set of shared variables will be assumed. What is needed in our approach is just a set of shared individuals or entities across the files —without such a set of shared individuals, record linkage does not make sense. Example 1. A typical scenario where our relaxation is especially relevant is when data files with similar information (e.g. financial variables) are available for different time periods, (e.g. two consecutive years) which relate to nearly the same individuals (e.g. the companies of a certain region). In this case, even though variables are not the same (”2000 turnover” is not the same as ”2001 turnover”), re-identification via record linkage is possible. Record linkage without shared variables is a subject of interest for both statistical disclosure control and data mining, because it highlights relationships between individuals that would otherwise remain implicit and undiscovered in the files to be linked. Our approach to re-identification via record linkage without shared variables is rooted in the techniques of knowledge elicitation from groups of experts described in Torra and Cort´es (1995) and Gaines and Shaw (1993). In these two references, a common conceptual structure is built from the information/opinion supplied by the group of experts, which should synthesize the information/opinions obtained from individual experts. In re-identification without shared variables, we assume that this common structure exists so that it makes sense to look for links between individuals in the different files. Note that, both in re-identification without shared variables and in knowledge elicitation from groups of experts, the initial information is similar: it consists of sets of records corresponding to roughly the same objects and evaluated according to a set of different variables in each file (in knowledge

147

elicitation, the opinions of expert A on an object are different variables from the opinions of expert B on the same object). Using the jargon in Gaines and Shaw (1993) for knowledge elicitation from groups, four cases can be distinguished depending on the coincidence or non-coincidence of variables and terminology (terminology is the domain of the variables, i.e. the terms used to evaluate the individuals): Consensus Same variables and same terminology. Correspondence Same variables but different terminology. Contrast Different variables and different terminology. Conflict Different variables and same terminology. Classical record linkage falls into the case of consensus or correspondence, although in the latter case only small terminology differences are allowed (small inconsistencies among names, missing values and the like). However, based on the above classification, other types of record linkage are conceivable: correspondence when the degree of non-coincidence on the terminology is not limited to small variations of names (e.g. completely different terms, due, for example, to the use of different granularities), contrast and conflict. We study in this paper the case of contrast, that is, record linkage when neither variables nor terminology are the same across the files to be linked. The only assumption of our approach is that a common structure underlies the files to be linked. In the context of Example 1, this assumption means that companies which are deemed similar according to some financial variables for the first year will also stay similar for the corresponding second year financial variables. 3.1 Re-identification without shared variables As explained above, re-identification without common variables requires some assumptions, which are next summarized: Hypothesis 1 A large set of common individuals is shared by both files. Hypothesis 2 Data in both files contain, implicitly, similar structural information. In other words, even though there are no common variables, there is substantial correlation between some variables in both files. Structural information of data files stands in our case for any organization of the data that allows explicit representation of the relationship between individuals. This structural information is obtained from the data files through manipulation

148

of the data (e.g. using clustering techniques or any other data analysis or data mining technique). Comparison of the structural information implicit in both files is what allows two records that correspond to the same individual to be linked by the system. Hypothesis 3 Structural information can be expressed by means of partitions. In our approach, structural information is represented by means of partitions. Partitions obtained from data through clustering techniques make explicit the relation between individuals according to the variables that describe them. Common partitions in both files reflect the common structural information. We prefer partitions rather than other (more sophisticated) structures also obtainable with clustering methods, like dendrograms, because the former are more robust to changes in the data, as shown in Neumann and Norton (1986). Although the main interest of our research is re-identification of individuals, the approach described below is not directly targeted to the re-identification of particular individuals. Instead, we try to re-identify groups of them. Due to this, we use the term of group-level re-identification; record-level re-identification is a particular case of group-level re-identification where one or more groups contain a single record. See Domingo-Ferrer and Torra (2003) for further details. 4

Conclusions

Data mining across different data sources has been discussed in this paper. Specifically, a method for linking records across databases which do not share any variable has been sketched. While such new data mining approaches can be cost-effective and useful to build statistical data from several administrative sources, they are rather a threat from the viewpoint of SDC. For that very reason, data protectors should use those methods if they with to obtain realistic estimates of disclosure risk. Acknowledgments This work is partially supported by the European Commission through “CASC” (IST-2000-25069) and by the Spanish MCyT and the FEDER fund through project “STREAMOBILE” (TIC-2001-0633-C03-01/02). References Dzeroski, S., (2001), Data Mining in a Nutshell, in S. Dzeroski, N. Lavrac, Relational Data Mining, Springer, 4-27. Domingo-Ferrer, J., Torra, V., (2003), Disclosure risk assessment in statistical disclosure control via advanced record linkage, Statistics and Computing, to appear. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., (1996a), From Data Mining to Knowledge Discovery: An Overview, in U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, Advances in Knowledge Discovery in Data Mining, MIT Press, 1-34.

149

Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., (1996b), Advances in Knowledge Discovery in Data Mining, MIT Press, 1-34. Frawley, W. J., Piatetsky-Shapiro, G., Matheus, C. J., (1991), Knowledge Discovery in Databases: An Overview. in G. Piatetsky-Shapiro, W. J. Frawley, Knowledge Discovery in Databases, Cambridge: AAAI/MIT Press, 1-27. Gaines, B. R., Shaw, M. L. G. , (1993), Knowledge acquisition tools based on personal construct psychology, The Knowledge Engineering Review, 8, 49-85. Kantardzic, M., (2003), Data Mining: Concepts, Models, Methods, and Algorithms, Wiley-Interscience. Neumann, D. A., Norton, V. T. (Jr), (1986), Clustering and isolation in the consensus problem for partitions, Journal of Classification, 3 281-297. Torra, V., Cort´es, U., (1995), Towards an automatic consensus generator tool: EGAC, IEEE Transactions on Systems, Man and Cybernetics, 25 888-894. Zhong, N., (2003), Mining Interesting Patterns in Multiple Data Sources, in V. Torra, Information Fusion in Data Mining, Springer, 61-77.

150

Providing Remote Access to Data: The Academic Perspective Tanvi Desai Research Laboratory, London School of Economics, Houghton Street, London WC2A 2AE

1.

Introduction: Trends in Data Provision

As technological advances in hardware, software, data documentation and the Web make access to, and analysis of micro data more and more practical and desirable, researchers and students now expect a wide range of micro data to support their research. In particular, the growing interest in cross-national comparative research with increasing demand for international data, means that bodies such as Eurostat have a vital role to play in the future of data provision. Recently we have seen trends in data provision coming full circle. Ten years ago, the only practical way to access data was via a main frame, often via remote access to the data providers. This was necessary as very few data users had the processing power to deal with large datasets on their PCs. With advances in desktop computing, researchers became able to manipulate and analyse large micro data sets locally, and the trend was to supply CDRoms containing the data. This is the most practical format for academics as it gives the most control over data manipulation. However, with the advent of the Web we now have a move towards on-line data provision, and following from this, more centralised control of data and a renewed interest in remote access. Data providers are becoming reluctant to allow researchers to hold copies of micro datasets. This is a major concern from an academic point of view. There are now many more easily accessible data sources available through the Web, but the value of this data is far lower. This is because providing access on-line necessarily limits what you can do with the data in terms of analyses and exploration. This is a major reason why it is vital from an academic point of view that data providers consider a remote access strategy rather than restricting their innovations to the Web.

151

2.

Choosing a remote access system

2.1. The ideal for every remote access system should be to provide an environment that allows a user to feel as much as possible as if they are working on their own PC. The three major factors that contribute to this are. -

Speed: results of analyses must be returned as quickly as possible Familiarity: it should not be necessary for researchers to learn new software and new programming techniques to access data, Flexibility: restrictions on data manipulation must be kept to a minimum

Many would also include access to metadata in this list. However, while access to high quality metadata is vital to good data use, researchers are still accustomed to having to spend time examining documentation, so metadata provision should not be the major focus of a remote access system. In fact, it is often more practical to present metadata through a website. Speed: The speed with which output is returned is obviously very important when trying to recreate the feeling of working locally. Delays are a particular disadvantage when introducing new users. If the relatively straightforward exploratory programs that are necessary to get to know the data take a long time to run researchers are likely to be discouraged from using the system. Familiarity: If researchers have to invest a lot of time learning new software, they are less likely to use the data. With familiar software they can start working as soon as they gain access, thus strengthening the impression that the data is local. Using established software also reduces costs as it is means that there is no need to develop new software. Flexibility: Of all the three factors mentioned above flexibility is the most important. The balance between securing the data and allowing meaningful analyses is delicate. If the security restrictions mean that the data cannot be examined properly or the analyses allowed are too few to provide meaningful results, then the data become useless. One of the primary reasons for restricting analyses when providing access to data is to prevent the identification of individual cases, either by repeated refining of a selection, or by reporting cells with too few observations. From an academic point of view these restrictions are almost always unnecessary. Academics have neither the inclination nor the time to identify individuals. Cells with a low number of observations are not statistically significant, and to report them would only invite the

152

censure of colleagues for poor research. In addition, since all users will have signed a legal document in the form of a license agreement stating that they will not perform any of these restricted actions, they are unlikely to endanger their data access, and thus their research project by breaking this agreement for no academic return.

2.2. Graphics: The production of graphics is a common problem faced by users of remote access systems, as very few have the provision to return graphics (for instance part of the LISSY* security prevents any file that is not text format from entering the system, this means that graphics cannot be returned). LIS gets around this, by allowing users to submit jobs, which are then run locally by support staff, and returned as attachments to email. There are some remote access packages available commercially that produce beautiful graphics, but these tend to be aimed towards the private sector, and users whose priority is producing high quality graphics from relatively simple analyses to form part of commercial presentations. Therefore the range of analysis techniques can be too limited for academic use (this software also tends to be very, very expensive).

2.3. Cost: Cost is also an important factor when choosing a system. The main areas where cost need to be considered are -

-

-

Hardware: how much space is needed for storage, and to allow users to run jobs remotely? Also, would a PC or UNIX based operating system be most suitable to the software? Software: is it necessary to write new software to provide the utilities required, or is it possible to use established software that not only reduces development costs, but also provides users with a familiar interface? For established software, what are the licensing costs? Data preparation: is it necessary to do a lot of work preparing the data for mounting on the system, causing delays to data release and using up staffing resources?

*

for an outline of the PiEP and LIS projects and the LISSY software mentioned in this paper, please see the APPENDIX.

153

3.

Security

3.1. Prevention of unauthorised users: The most common form of security for remote access systems is a password authorisation system supported by a signed license agreement. License agreements are usually between the individual user and the data provider, however some data providers insist on an agreement with the institution where the data will be held. This is a serious concern, as a corollary of this is that only employees of the institution can have access to the data, this marginalises students as they are rarely employed by the institution at which they are studying. Considering the shortage of European researchers with high level methodological skills, any system that prevents students, in particular PhD students, from accessing international data is going to contribute to the problem. An additional security measure that is often considered is a system of “trusted” computers, whereby IP addresses are registered with the access system, and only requests originating from known IP addresses are processed. As it is impossible for every researcher to have a unique IP address, the only way to implement this is to have a “trusted” server. Remote Desktop Access software would allow a number of researchers to link to this server from their desktops, thus providing them with a “virtual trusted” computer. However, this method needs a dedicated server, an expense which individual users and even many institutions would be unable to manage. Another effective measure to support password authorisation is to restrict the delivery of output. For instance, the LISSY system will only return output to the user’s registered email address. This means that no unauthorised user can access output, as even if they have managed to get hold of a username and password, the results are still returned to the registered user. Hacking: Another concern is unauthorised users hacking the data server or “sniffing” the data in transfer. To prevent “sniffing” it is possible to encrypt data as it is being transferred. The LISSY system does not use encryption as no micro data nor the results of any confidential analyses are transmitted. Therefore it was decided that encryption would only slow the system down unnecessarily and speed was one of our priorities. As is standard practice, data files should always be stored with non-descriptive names providing additional protection against anyone who gains direct access to the data server due to a network security breach. Users are given aliases with which to access the data.

154

3.2. Prevention of confidential analyses: There are a number of ways of going about this, some more practical than others. Checking output: Checking the output generated before it is returned to the user is very impractical for a number of reasons. Primarily because it is very time consuming and prevents output from being returned promptly. There is also the problem of finding staff who are qualified to check output. The person who is responsible for checking output has to be a statistician of equal skill to the most sophisticated user of the system. In addition they must have an understanding of the data users’ fields of research so that they can see how the results will be used and whether this will be sensitive, and a good knowledge of the security needs of the individual countries since confidentiality concerns vary across nations. What chance is there that someone who has this range of knowledge will be content just checking other users output? There is also the difficulty of understanding the structure of other people’s programs. Therefore, if it was decided that checking output was necessary, it would be advisable to provide users with a template to encourage them to submit annotated jobs in a standard format. Blocking at source: This is the most effective method we have found for preventing sensitive analyses. In this method a string search is run on the text of all programs submitted. If any strings are identified that might represent confidential analyses the program is not sent to the data server, but returned to the user with an error message. As any combination of strings can be specified, this system offers a lot of flexibility. It allows data providers to define and block problems particular to their national data. Different blocks can also be set depending on the user. Thus, we have a method that can tailor security to the individual case, and block sensitive analyses before they gain access to the data.

3.3. Deposit papers: A commonly stated condition of data access, is that papers produced using the data are deposited with the providers. Very few data providers manage to put this into practice successfully. The only data provider I have come across who makes this work is the Luxembourg Income Study. This is primarily due to close links with their user community and the involvement of senior academics who know the publications in their field and soon become aware of any rare breach. I will discuss ways of maintaining these links below.

155

The best form of security is a good relationship with your users, if they feel they have a stake rather than being in the supplicant position they are more likely to act responsibly.

4.

Supporting a remote access system

High-quality support is crucial to any successful remote access system. A close relationship between users and providers not only ensures that any investment in developing the system is relevant but, as mentioned above, also helps immeasurably with security, as users feel that they have a responsibility to the project. A good line of communication between the data collectors and distributors is also a vital component of support, in order to ensure data quality and fast response to user queries.

4.1. Human Resources Dedicated personnel are necessary to efficiently support a remote access system. Their assistance is vital in Data preparation: Support staff are needed to prepare data before it is mounted on the system. Ideally the minimum of data preparation should be necessary, partly to preserve as much of the original structure of the data as possible, but also to minimise use of staff resources. However, it is almost always necessary to do some preparation, even if it is just naming and labelling variables. Technical support: IT staff are needed to maintain the systems, hardware and software, and ensure that downtime is kept to a minimum. Research support: It is exceedingly rare for a dataset to be perfect. Often inaccuracies are not noticed until the data is being used, when analysis produces unexpected or suspicious results. There are also likely to be methodological queries that arise about the exact definition of variables, how value ranges are selected etc. Support staff can only answer these questions if they are familiar with the data. They must also be in close contact with representatives of the data providers, so that they can quickly obtain answers to queries that cannot be solved without reference to the original dataset. If researchers are to work effectively, it is vital that any queries are answered promptly. Again, if a user’s questions are answered promptly and intelligently by people who seem interested in their research they become far more well disposed to

156

the data providers and feel more of a responsibility to safeguard the data. Long delays cause resentment and encourage researchers to find alternative data sources. Derived Variables: One of the major drawbacks of a remote access system is that researchers are not able to create their own subsets of derived variables. This is especially important for cross-national data where recoding is almost always necessary for comparability. Therefore, support for remote access systems must take into account the necessity of providing users with space to store constructed subsets or a system for adding derived variables to the core dataset (or preferably both). Decisions on derived and comparable variables can only be made in conjunction with the user community; this is another reason why close links between the data providers and users is vital.

4.2. Networking and Partnerships Mail Lists: All active users should be included in a mail list so that they can be informed of any changes to the data, technical problems with the system and other issues that arise. It is also a good idea to have a user group mail list where researchers can discuss issues amongst themselves, as researchers are often best equipped to answer each others queries. This also reduces the workload on support staff, as many researchers will query their colleagues before contacting support staff. Website: A good website is vital to attract new users, reduce the pressure on support staff, facilitate data use and enhance security. These days many users become aware of new data sources through the web. A website acts as an introduction to the service attracting new users, and allowing them to decide whether the data is likely to be useful. Documentation, high quality metadata and answers to frequently asked questions can be mounted on the website reducing the number of individual questions support staff have to deal with. Access to metadata such as details of variable coverage and the question texts can save a researcher time by minimising the need to examine the data interactively before starting analysis, and good methodological information minimises errors producing a higher standard of data use. Sample exercises and programs can also be provided as training aids for new users. Another very important function of a website is to provide information on registration procedures. A straightforward, quickly implemented registration procedure is vital as it makes it significantly less likely that researchers will try to find illegal methods of gaining access to the data.

157

Conferences: One of the key ways in which the Luxembourg Income Study develops and maintains such a good relationship with their users, apart from the superb level of support offered, is through a regular program of workshops and conferences. LIS run an annual workshop for new users. Here they are introduced to the data and access system in a supported environment, where problems can be discussed with the technical team. Experienced users of the data are also invited to give seminars on their work and to guide new users in the possibilities offered by the dataset. This is not only a great way of forming relationships between data users and providers, but is also a way to reduce the problems experienced by new users, and thus the time needed to support them remotely. In addition to the annual workshop, LIS also organise a conference once every two years, recent and future topics include Child Poverty, and Immigration. Conferences have an important part to play in developing research networks and encouraging good data use. They also provide support staff with an opportunity to gain a more in depth knowledge of what the data is being used for. Steering Committee: An active steering committee made up of data providers and data users is vital to provide a forum for discussion of topics such as security, data quality and system development. In addition, individual researchers often don’t have the breadth of international knowledge necessary to create comparable cross-national variables, therefore a steering committee is invaluable when taking decisions on derived variables. Members of the steering committee will not only be aware of which derived variables it would be useful to add to a dataset, but will also be able to decide on the methodology for constructing these variables and can pool their national expertise to decide how to derive the variable accurately for each country.

5.

Conclusion

The development of remote access systems is vital to the future of academic research if data providers continue to become increasingly reluctant to allow users to hold data locally. Working with a remote access system should resemble working on your local PC as closely as possible. This is affected by the speed at which the output is returned, the familiarity of the statistical software, and the restrictions placed on analyses. Security measures that are too severe render the data useless. Blocking confidential analyses before the program is delivered to the data server is an effective way of preventing confidential data being released.

158

It is crucial that the system administrators provide a high quality support network to enable users to make effective use of the data. A well designed website makes a vital contribution to remote access support. Finally, close links between all the stakeholders (users, collectors, distributors) are vital. Managers of remote systems should not see themselves purely as data providers, but as members of the research community. The sense of community can be encouraged through steering committees, user groups, seminar series and workshops. Good communication between all parties has a positive impact on effective allocation of resources, research quality, levels of use, and data security.

APPENDIX: PiEP: The Pay Inequalities and Economic Performance Project is conducted by an international team of academic researchers with support from the European Commission, and in close collaboration with Eurostat and the national statistical institutes. The project makes use of the 1995 Structure of Earnings Survey microdata for 6 countries (Belgium, Denmark, Ireland, Italy, Spain, UK). This data, which is held at Eurostat in Luxembourg, is accessed via a remote system managed by the London School of Economics in the UK. The access system is an adaptation of the LISSY software commonly referred to as PiEP-LISSY. http://cep.lse.ac.uk/piep/ Tanvi Desai is the Data Manager and System Administrator to the PiEP project, as well as being Data Manager for the LSE Research Laboratory. http://rlab.lse.ac.uk/ LIS: The Luxembourg Income Study provides remote access to a collection of household income surveys for 25 countries on 4 continents. The LISSY remote access software was originally developed for this project, and has been running successfully for 20 years. http://www.lisproject.org/ LISSY: The LISSY system is a remote data access system developed by HAL Consulting. It provides secure access to micro data through email, allowing users to send programs in any of three commonly used statistical software packages (SPSS, SAS, STATA). The PiEP version of LISSY has the added ability to block any strings or combination of strings that might provide access to confidential information.

159

FROM ON-SITE TO REMOTE DATA ACCESS – THE REVOLUTION OF THE DANISH SYSTEM FOR ACCESS TO MICRO DATA Otto Andersen Statistics Denmark Summary Statistics Denmark has altered its scheme for giving researchers access to deidentified micro data from on-site to remote access through the Internet. This is part of the general vision that Denmark should work hard to be one of the Worlds leading countries within registerbased research. Through the new scheme Danish researchers have experienced a breakthrough in the methods of access to micro data

1. From surveys to registerbased statistics Denmark introduced the Person Number (the Personal Identification Number) in 1968 and it was used in a census for the first time at the Population and Housing Census in 1970. Accordingly, this became the first Danish register that uses the Person Number as an identification key. During the 1970s the first attempts were made to base the production of statistics on registers. In 1976 a register-based population census was conducted as a pilot project, but the registers were not sufficiently comprehensive and well-established until 1981, when a proper registerbased population census was conducted containing most of the conventional population and housing census information. Like in the other Nordic countries, the person and business registers in Denmark today cover a very substantial part of the production of statistics. The contents of the registers also cover many fields of research such as labour market research, sociology, epidemiology and business economics. The strength of the system is that the identification keys (person number, address, central business register number and property title number) render it possible to correlate the aggregated data both within a specific year and longitudinally across several years.

2. Increased interest in micro data In the mid-1980s, Statistics Denmark experienced an emerging interest among various research environments and ministerial analysis divisions in applying micro data (individual data) for research and analysis purposes. One reason was that the development in computer technology made it technically possible to process large

160

amounts of data according to advanced statistical models, such as multivariate models. These environments put pressure on Statistics Denmark to disclose micro data; a request that Statistics Denmark was unable to grant because of the rules of confidentiality lay down by the Management and Board of Statistics Denmark. On the other hand, it was evident already at that time that not only were the registers of enormous importance to he production of statistics by Statistics Denmark, but their research potential was so great that it would be very valuable to actually utilise them for research purposes. Therefore, Statistics Denmark had to find a solution to the problem of access, which complied with the existing legislation on registers while taking into account Statistics Denmark’s own confidentiality principles. During 2001 negotiations between Statistics Denmark, the Ministry of Research and the Research environment resulted in a signing a contract on the establishment of a special unit (the Research Service Unit) in Statistics Denmark with the special duty to improve researchers access to micro data through a better infrastructure and to lower the costs of using the data. The budget for the Research Service Unit is 6 million D.kr. per year (approx. 800,000 Euro). Some of the money is used to upgrade the special Unix computers, cf. below.

3. Legislation With the introduction of two acts on registers in 1979, Denmark saw the first statutory regulation concerning, inter alia, disclosure of micro data to researchers. As at 1 July 2000 these acts were replaced by the Act on Processing of Personal Data (lov om behandling af personoplysninger). The Act implements Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and the free movement of such data within the European Union. The former Act primarily governed registration and disclosure of data in registers, while the new Act applies to all forms of processing of personal data. The new term, “processing”, covers all types of processing of personal data, including registration, storing, disclosure, merging, changes, deletion, etc. Previously, the setting up of a register was subject to the so-called register provisions, involving a rather time-consuming and laborious process. These provisions have been abolished, and now the individual authority makes decisions in concrete cases on processing; for example, the authority decides issues of disclosure of data for scientific purposes based directly on the provisions of the Act on the lawfulness of such disclosure.

161

The new Act introduced a duty of notification to the Danish Data Protection Agency. The purpose is to enable the Agency to supervise the processing of sensitive information carried out. Accordingly, a scientific project involving processing of sensitive personal data is subject to notification to and approval by the Danish Data Protection Agency before such processing can commence. This applies to all surveys, whether they are conducted by a public administration, individuals or enterprises. The Agency has laid down special provisions on security in connection with the processing of sensitive data. All in all, the introduction of the Act on Processing of Personal Data has provided potentially more favourable conditions for register-based research in Denmark. In particular, public authorities’ basis for disclosing administrative data for research purposes has been enhanced and simplified in terms of administration, as they no longer need to consult the Danish Data Protection Agency; personal data applied for statistical purposes may be disclosed and reused with the permission of the Agency; data from one private research project may be disclosed to another project; there is full access to filing of data in the State archives; both private individuals and public authorities may process data on Person Numbers for scientific or statistical purposes; furthermore, the Act now explicitly stipulates that the data subject’s right of access to personal data shall not apply where data are processed solely for scientific purposes. In addition to the Act on Processing of Personal Data, the Danish Public Administration Act (Forvaltningsloven) is of relevance. Under this Act, a public authority may impose a duty of non-disclosure on persons outside the public administration concerning the data disclosed. Statistics Denmark has applied this provision in connection with researchers’ access to micro data under the scheme for the on-site arrangement for external researchers at Statistics Denmark (cf. below), although no disclosure in a formal sense is made. Data - even anonymised data must be treated as confidential. Breach of the duty of non-disclosure is punishable by simple detention or imprisonment.

4. Confidentiality principles of Statistics Denmark As it appears from the above, current legislation permits disclosure, to a wide extent, of personal data for scientific purposes. However, the authority in question ultimately decides whether disclosure may take place, meaning that the authority may take other issues into consideration even if the Danish Data Protection Agency has approved the disclosure of data. That is what Statistics Denmark has decided to do. This decision has been made so that the individual citizen or enterprise can be certain that the data supplied directly or indirectly to Statistics Denmark do not fall into the hands of any unauthorised

162

persons. In the opinion of Statistics Denmark the risk of irreparable damage to the production of statistics outweighs the consideration for more or less convenient access to data by the individual researcher. Thus, the fundamental principle is that data must not be disclosed where there is an imminent risk that an individual person or individual enterprise can be identified. This does not only apply to identified data, such as Person Numbers, but also to deidentified data, since such data are usually so detailed that identification can be made. Since Statistics Denmark also considers it important that data can be applied for scientific purposes, special schemes for researchers have been set up.

5. Scheme for the on-site arrangement for external researchers at Statistics Denmark Since its overriding principle is not to disclose individual data, Statistics Denmark set up a scheme in 1986 for the on-site arrangement for external researchers at Statistics Denmark. Under this scheme, researchers can get access to anonymised register data from a workstation at the premises of Statistics Denmark. Statistics Denmark creates the relevant datasets on the basis of the researcher’s project description, the general principle being that the dataset should not be more comprehensive than necessary for carrying out the project (the “need to know” principle). The researcher signs an agreement which stipulates that data are confidential and that individual data must not be removed from the premises of Statistics Denmark.

6. Organisational framework The scheme is administered centrally by the Research Service Unit as a part of the office of Research and Methods. The staff of this unit also create a substantial part of the interdisciplinary datasets and have a general (authorized) access to all relevant data in Statistics Denmark in order to reduce the administrative and bureaucratic work. The scheme requires close cooperation between Research Service Unit and the individual divisions. The advantage of such central organisation is that the individual researcher is fully aware of whom to negotiate with and who is responsible for the dataset supplied. In 1996, Statistics Denmark opened a small branch in Århus, Jutland to grant researchers west of the Great Belt an opportunity to use the scheme on equal terms with researchers in Copenhagen. After having being funded by the Danish National Research Foundation (Danmarks Grundforskningsfond) Statistics Denmark has taken over the costs as a part of the above mentioned Research Service Unit.

163

7. Research databases As the researchers almost invariably request datasets linking information from several individual registers in terms of both contents and time, the creation of specific datasets for a project often involves considerable work by Statistics Denmark and often considerable costs for the researcher. To reduce the cost of datasets for research purposes and solve special data problems, Statistics Denmark has set up a number of research databases. These databases are hardly ever used in the actual production of statistics, but are first and foremost a kind of intermediate products for the benefit of the research process. The most frequently applied research database is the Integrated Database for Labour Market Research (IDA). One reason for creating the database was to solve a difficult problem of definition: Identity of enterprises over time, a task that individual researchers were unable to handle for reasons of both time and funding. Nine to ten man-years were spent on the task, which was funded by the Danish Social Science Research Council (Statens Samfundsvidenskabelige Forskningsråd) and Statistics Denmark. Since the establishment of IDA in 1991, Statistics Denmark has handled the updating of the database against user charges. Other research databases include the Demographic Database, the Fertility Database, the Prevention Register (health data), the Social Research Register, etc. As the names imply, the databases cover many specialist fields: economy, labour market research, social research, epidemiology, etc. The lasted development is the Prescription Database holding information on doctors’ prescriptions of medicine sold by the pharmacies in Denmark. A number of research institutes have paid for the creation of major research databases for the purpose of their own research.

8. Considerable growth From the modest beginnings in 1986, the use of micro data has increased markedly under the scheme for the on-site arrangement for researchers at Statistics Denmark. In 1997, 71 researchers used the on-site arrangement, while in 2003 the figure had risen to around 200.

9. Model and study datasets Statistics Denmark has only to a very limited extent departed from the rule not to disclose micro data to researchers. To enable researchers to develop computer programs at their own workplaces, they have been granted an opportunity to borrow

164

micro data, upon request, from very small populations (e.g., 1000 records). Only very few model datasets have been created in recent years. However, Statistics Denmark has prepared some study datasets, so far based on the IDA database, for study programmes in economics/labour market policy and interdisciplinary data material for sociology studies. These datasets follow a few thousand persons over time according to a number of variables. Where possible, the data are scrambled so that the actual register data have been changed in ascending or descending order by a simple mathematical function. However, the fundamental characteristics of the data have been preserved. In this way, students get an opportunity to try out statistical models on realistic data. Except for the above, Statistics Denmark has not applied scrambling procedures or special grouping techniques to the data that are made available to the researchers under the on-site arrangement. The data appear as in the basic registers.

10. The UNIX solution Until 1996, researchers under the on-site arrangement were referred to making batch runs on Statistics Denmark’s main frame. This meant that the only software available was SAS. Furthermore, most researchers were used to other platforms, such as UNIX, and therefore unfamiliar with the actual run and editing procedures. In 1996, the Danish National Research Foundation funded the acquisition of a UNIX system, which has been used exclusively for projects under the on-site arrangement. The advantages were obvious: the researchers got access to known technology and the choice of software became more varied. Besides SAS, researchers now have access to SPSS, STATA, GAUSS, etc. Statistics Denmark has repeatedly upgraded the technical solution since 1996, partly by acquiring additional UNIX systems, partly by increasing the disc capacity. The latest and biggest upgrade was done January, 2003 as a result of the contact with the Ministry of Research on Research Service Unit

11. Remote access In the autumn of 2000, the Director General of Statistics Denmark instructed a committee to examine whether to grant the users of Statistics Denmark’s researcher schemes access to datasets from their own workplaces. The result of the committee work was a proposal to grant specially authorised research and analysis environments access to making batch runs on approved datasets of Statistics Denmark. The Board of Governors of Statistics Denmark approved the scheme, which entered into force on 1 March 2001 following the completion of a pilot project.

165

A research or analysis environment can apply for an authorisation from Statistics Denmark. As at 1 October, 2003, 60 environments had been granted authorisation. Until now the remote access has not been granted for all datasets; particularly sensitive data (e.g., data on crime) has been excluded from the scheme and data on enterprises are assessed carefully to avoid any problems of confidentiality. It has been emphasised that the data consist of samples. If the researchers request access to total populations, the content of variables must be limited. The individual cases have been assessed by a steering group consisting of the Directors of Statistics Denmark. Researchers not granted remote access has been allowed instead to use the on-site scheme. However, in December 2002 the Board of Governors of Statistics Denmark have accepted to a proposal to consider the on-site scheme and the remote access scheme as equivalent concerning data security and as a consequence of this decision all data sets which can be accessed from on-site can also be accessed from remote. With this decision it has been very important to revise the rules for granting authorisation to micro data.

12. New rules for access to micro data It is proposed to the Board of Governors (in it meeting on 2 April, 2003) that access to micro data can only be granted to researchers and analysts in authorised environments. Authorisations can be granted to public research and analysts environments (e.g. in universities, sector research institutes, ministries etc) and to research organizations as a part of a charitable organization. Within the private sector following user groups can be granted authorisation if they have a stable research or analyst’s environment (with a responsible manager and with a group of researchers/analysts): 1. Nongovernmental organisations 2. Consultancy firms 3. Enterprises. However single enterprises can not have access to micro data with enterprise data In order to grant an authorisation, Statistics Denmark will evaluate the proposed organization carefully and especially when it is an organization or firm within the private sector Statistics Denmark will look at credibility of the applicant (as ownership, educational standard among the staff and the research done for others).

166

Statistics Denmark will not grant authorization to single persons. Furthermore Media organizations are excluded from the scheme. The “need to know” principle is still in force. Researchers can have access to relevant business data after the “need to know” principle. Only very few business data are excluded from remote access. The whole question concerning these data are under evaluation.

13. Foreign researchers? Only Danish research environments are granted authorisation as Statistics Denmark is not able effectively to enforce a contract abroad. Foreign researchers from well established research centres can have access to Danish micro data from the on-site arrangement in Copenhagen or Århus. Visiting researchers can have remote access from a workplace in the Danish research institution during their stay in Denmark and under the Danish authorisation.

14. The remote access will take over As a consequence of the decisions mentioned above the on-site arrangement will be closed down successively and the remote access will be the only route to micro data.

15. The technical solution The technical solution is based on the use of the Internet conf. the flow chart at the end of this paper. The relevant micro data are produced by the staff in Statistics Denmark and the deidentified micro data are transferred to the disk storage connected to the special Unix servers. These Unix servers are only used by researchers and are separated from the production network. Communications via the Internet is encrypted by means of a so-called RSA SecurID card, a component that secures Internet communications against unauthorised access. In practice the researcher rents a password key (a token) from Statistics Denmark. The token ensures that only the authorised person obtains access to the computer system. A farm of Citrix Servers ensures that the researchers from their own workplace can “see” the Unix environment in Statistics Denmark. All data processing is actually done in Statistics Denmark and data cannot be transferred from Statistics Denmark to the researcher’s computer. The researcher can work with the data quite freely and

167

can make new datasets from the original data sets. The limit is of course the amount of disk space. Statistics Denmark has just increased the total amount of disk space considerably. All results from the researchers computer work can be stored in a special file and such printouts are sent to the researchers by e-mail. This is a continuous process (every five minutes) and has shown to be quite effective. The advantage to Statistics Denmark is that all e-mails are logged at Statistics Denmark and checked by the Research Service Unit. If the unit find printouts with too detailed data, contact is taken to the researcher in order to agree on details of the level of output. No severe violation of the rules, establish in the authorisation formula, has taken place.

168

169

Implementing Statistical Disclosure Control For Aggregated Data Released Via Remote Access Luisa FRANCONI and Giovanni MEROLA ISTAT, Servizio MPS, Via C. Balbo, 16. 00184, Roma, Italy e-mail:{ franconi, merola}@istat.it Abstract: In this paper we give an overview of various approaches to the implementation of statistical disclosure control to tabular data released through the Web. We consider three generic groups of statistical disclosure control methods: source data perturbation, output perturbation and query-set restriction. Considering different types of Web-sites and implementation approaches we discuss the appropriateness and effectiveness of such statistical disclosure control methods. Keywords: Remote access, tabular data, statistical disclosure control, on-line database

1 Introduction Dissemination is one of the missions of National Statistical Institutes (NSIs); it is a way for giving society useful information and it also a way of motivating respondents to answering surveys. If NSIs have the legal standing to collect and release information, they also have the obligation to protect confidentiality of respondents, which sets a constraint on the amount of information that can be released. This trade-off is addressed by statistical disclosure control (SDC), which consists of a collection of techniques that make difficult matching confidential information with the identity of respondents from a set of data. SDC methods employ a variety of techniques that either alter or suppress some of the data released. We classify the data released in three categories: •

Microdata: files with individual observations;



Tables: total values carried by individuals that fall in given classifications;



Other statistics: summaries of different types, for example: regression coefficients, relative indices, correlation coefficients, etc.

170

In this paper we do not consider the release of microdata files but only of aggregated values. In particular, we focus on the release of tables, although some of the conclusions that we draw can be applied to the release of other statistics, as well. NSIs have a long tradition in publishing periodically printed reports of the data that they collect, which are sold or distributed freely. However, the Internet is becoming a standard channel with which public institutions communicate with general public. Institutions expect people to look for information on the Web and Internet navigators expect to find it on the Web. Furthermore, Web-sites are an ideal tool for disseminating data, as they are cheap, flexible, easy to update and accessible by most users (Blakemore, 2001). In fact, most NSIs provide on-line data in tabular form. Such Web-sites, to which we refer as Web-based Systems for Data Dissemination (WSDDs), require automated systems that release data upon request. WSDDs can be designed in many different ways, whether giving access to a predefined set of tables or allowing users to query any table choosing from a set of available variables. Therefore, SDC methods must be applied to WSDDs according to their structure and flexibility as well as to the type, quality and level of detail of the information released. In many cases SDC must also be combined with electronic access control. We argue that SDC methods can be applied to WSDDs in two ways: a priori, that is, before releasing the tables, or a posteriori, that is, after the user has made his/her particular query. We refer to the former as PRE SDC and to the latter as POST SDC. According to this classification we comment on benefits and drawbacks of different SDC approaches. In Section 2 we present the essential framework of SDC methods for the release of tabular data. In Section 3 we classify different SDC methods according to the way they are implemented in WSDDs and analyse the consequences of their application from the point of view of both the producer and the user. Finally, in Section 5 we present our conclusions casting the various WSDDs in the perspective of the general framework of lack of access versus information loss.

2 Statistical Disclosure Control for Tables Tables have always been the essential data products of any NSI or statistical agency. Tables are built from the source data file containing records on individuals, called microfile, aggregating the values of the responses for the units falling in the categories of the chosen classifying variables. When the response is equal to one for all observations, the resulting table is a frequency table that gives the number of units in each cross-classification. As customary in SDC theory, we restrict the attention to tables for nonnegative responses thus including frequency tables. SDC theory deals with releasing data without confidential information being traced back to respondents. To provide context to our discussion we briefly describe the framework of SDC.

171

A cell of a table is considered at risk of disclosure, or sensitive, if pre-defined rules are not satisfied (Willenborg and de Waal, 2001). The most used rules are: minimum number of units in each cell, called threshold rule, and maximum percentage of concentration (applicable to continuous variables), called dominance rule. These rules are based on the idea of impeding to an intruder estimating too closely confidential values. Tables containing cells at risk can be protected by different techniques, which either inhibit the access to part of the information or distort the information released. Clearly, there exists a trade-off between the level of protection achieved for the data and the quality of the information released. Duncan et al., 2001, propose a model to evaluate, also in a graphical way, this trade-off. SDC methods for tabular data are either based on data transformation  input data masking, cell perturbation- or data suppression (see Willenborg and De Waal, 2001). The former may suffer from bias in the information released whereas the latter may prevent the release of information. The considerable effort gone into developing better and more targeted SDC methods, that effectively protect the confidentiality of respondents, increased the quantity of high standard products that can be safely released, with benefit for legitimate users. Effort has also gone into developing software for the application of methods that require intense computation; for example, the CASC project1, funded by the European Union, developed two programs, τ-Argus and µ-Argus (Hundepool, 2001) that use sophisticated routines for the application of SDC for tabular data and individual data, respectively. Duncan (2000) classifies SDC methods in terms of disclosure limiting masks, we regroup this classification according to the following three, broader, basic approaches, which can be used combined together or alone: Perturbing the source data: records in the source data are perturbed before any data product is released. Such perturbations can be of different, nonexclusive, kinds: suppressing records, swapping some values between similar records (Dalenius and Reiss, 1982), applying Markov perturbation, (Gouweleeuw et al., 1998) or model based perturbation (Franconi and Stander, 2002), adding random noise (Brand, 2002) or (sub-)sampling from the entire source data file, assigning wider categories to classifying variables. All these techniques allow the protector for a large degree of decision but it is often difficult to evaluate the level of protection achieved. Furthermore, several studies have shown (e.g., Winkler, 1998 and Brand, 2002) that data protected by these techniques are often severely distorted. 1

http://neon.vb.cbs.nl/rsm/casc/

172

Perturbing cell values: some or all values to be released are perturbed, either by adding random noise or by rounding them. Perturbative methods present the same drawbacks as those perturbing the source data. Moreover, the resulting tables may be nonadditive (i.e. with marginal values not congruent with the inner cells values). Suppressing cell values: sensitive cell values are not released. Together with these values also other nonsensitive cells must be suppressed, in order to avoid recovering sensitive values by differencing, this is the so called complementary suppression. Cell suppression is the most popular protection method in SDC, also because it can be easily customized with respect to a given loss function, differently from other methods. The algorithms for choosing the cells to be suppressed in an optimal way are complex and slow. There is an extensive literature in this area treating both heuristic and exact solutions (see, for example, Willenborg and de Waal, 2001 and references therein). The drawback of cell suppression is the reduction in information released. Conceptually, the extreme case of cell suppression is the suppression of the whole table; while this practice leads to a great loss of information released, it can be convenient when the number of tables to be protected is large because it avoids the demanding computations for finding the complementary suppressions. The above approaches can be applied also to protect the releases of other statistics, however, their application to specific quantities is still under study. For a more comprehensive review of SDC methods for tabular data see, for example, Duncan et al. (2001), while for a detailed account see Willenborg and De Waal (2001) and references therein. Different methods can be adopted for the protection of data to be released and there will not be agreement on which methods are better. In the next section we consider the application of the different SDC methods to WSDDs, without discussing the merits of the methods themselves, but having in mind the peculiar problems posed by their application to automated release systems.

3 Problems peculiar to the application of Control to WSDDs

Statistical Disclosure

WSDDs are Web-sites, accessible through the Internet, in which navigators can query tables automatically built from nonaccessible source data. Usually, WSDDs are set up for records measuring several classifying variables and response variables. Therefore, for each response there exists a high dimensional table formed using all the classifying variables, of which lower dimensional marginal tables are released. The most informative WSDDs allow users to query tables at their choice, choosing among combinations of classifying and responses variables contained in the source

173

data. We call these sites dynamic as opposed to static ones, which link users to only a pre-established subset of all possible marginal tables. More sophisticated WSDDs that also offer other statistics, such as, for example, correlation or regression coefficients, will be referred to as Virtual Laboratories (V-Labs). From the point of view of SDC, static WSDDs are not really different from printed releases, therefore standard SDC techniques can be applied for their protection. Methods that require intensive computations (i.e. partial suppression), however, may not be applicable when the number of tables to be protected is large. SDC for dynamic sites is more difficult for two main reasons: the information retrievable consist of a large number of linked tables2; the total information released to each user is different, cumulative and not known in advance. V-Labs are a hard challenge for SDC and both their design and protection are still under study. In the following we will focus mainly on the application of SDC to WSDDs that release tables but some of the results are applicable to V-Labs, too. The application of SDC techniques to WSDDs is a difficult task, mainly for four reasons: methods and criteria must be standardized and implemented in an automatic system; usually there is a large number tables of high dimensions to be protected; some of the tables are linked that is, share some spanning variables and, when users can choose freely which tables to see, the total information queried by a user is not known beforehand. As regards the choice of the SDC method for protecting the data, it is evident that the more information is granted to users, the harder it is to control disclosure of confidential information. Thus, SDC techniques must be tailored for each site with respect to the nature of the data and the information offered. Often, summaries of the data released in a WSDD have already been published (for example in printings), in this case WSDDs provide additional information and SDC techniques must be consistent with what has already been released. Furthermore, some WSDDs release data from databases that are updated with new data over time; in this case, extra care must be put in SDC for these WSDDs, because it must also be consistent through time. As mentioned above, protecting tables released by WSDDs by suppressing few inner cells is not usually feasible, because of the demanding computations needed for each table. Rather, more realistically, the suppression of complete tables can be applied. We will refer to this approach as

2

Assuming that p classifying variables and q responses are offered for tabulation, the number of possible tables is q2p.

174

restricting the allowed query-set. Query-set restrictions are rules that evaluate whether a table can be released or not because it would be too risky. We reserve a special category to this approach because it is typical of WSDDs. Restrictions require auditing, (Malvestuto and Moscarini, 1999) and can be of different types: on the number of tables already released, on the dimension of the tables queried, on the goodness of fit of the tables unreleased that they allow, etc. (e.g. see Adam and Wortmann, 1989 and Fienberg, 2000, for a general review of different approaches). We consider the following generic approaches to protecting releases of a WSDD: (1) perturbing the source data (sampling, swapping, adding noise, etc.); (2) perturbing the output (adding noise, suppressing values, rounding, etc.); (3) restricting the allowed query-set (denying tables). One way of limiting the possibility of breaching confidentiality through WSDDs is by restricting the access to registered users and releasing the output directly to them. In this way, WSDD administrators reduce the probability of vicious intruders accessing the data, regardless the type of data. In sites with restricted access, users must register and login is allowed only to those fulfilling given requirements. Requirements can have different nature, such as: the most generic one being “having a valid email address”; another common, stricter, one is “belonging to a research institution”. Registration may or may not include the electronic signature of a confidentiality agreement. In any case, tight access restrictions create a severe reduction in the publicness of the site, while looser restrictions are often not effective because of possible phoney email addresses. Another limitation to breaches of confidentiality can be obtained by releasing the output by email. In this case, users must provide a valid email address and sites administrators have records of what has been released and to whom. Another possibility for having records is to keep logs of IP’s connecting to the site, although this method is usually hidden and therefore harder to use as evidence. These restrictions, however, pertain to the security of the site and not to statistical disclosure control, hence will not be analysed here, although they are used in combination with SDC. An interesting analyses of access restrictions from SDC point of view can be found in David, 1998 and Blakemore, 2001.

4 Strategies for the application of SDC to WSDDs SDC for WSDDs can be applied following two strategies: before queries are submitted (that is, before putting the data on-line) or after. We will refer to the former as PRE SDC and to the latter as POST SDC. PRE SDC is applied off-line, POST SDC can be carried out on-line, on-the-fly, but it can also be applied off-line, delaying the release of the output. The advantage of POST SDC is that it can be designed to be adaptive to the information previously released. If it is reasonable to

175

assume that users do not co-operate among themselves, POST-SDC can be adapted to the information released to each user; in this case, non-anonymous access, auditing of users’ activity and, sometimes, logging of all releases, are required. Several sites that apply PRE SDC can be found on the Web such as, for instance: Italian Foreign Trade statistics at ISTAT3. POST SDC is still not as common as PRE SDC, a site that applies POST aggregation of geographical units is the NASS4 site that releases statistics on the usage of pesticides in the US. Some sites use both PRE and POST SDC, such as, for example, the American Fact Finder5 (AFF). Next, we will briefly discuss the application PRE and POST of SDC to WSDDs. WSDDs can be designed in many different ways, allowing for different amounts of information to be released. Dynamic WSDDs require software that builds the tables on-the-fly, possibly with embedded SDC. In any case, SDC techniques must be tailored to each WSDD in order to release the maximum possible information preserving privacy. In fact, researchers are developing fully automated expert systems for applying SDC to WSDDs (see, for example, Keller-McNaulty and Unger, 1998, Malvestuto and Moscarini, 1999 and Fienberg, 2000, for theoretical treatment and Karr et al., 2001, Zayatz, 2002 and Karr et al., 2002, for examples of applications). Next, we discuss the PRE and POST application of the classes of SDC methods given above. Perturbation of the source data The perturbation of the source data is usually applied PRE, so that data are perturbed off-line and then the tables queried are built on these perturbed data. This approach is often applied to large data-sets, in which the number of units is large enough to allow for a consistent reduction or for the law of large numbers to apply. A common practice is that of building the tables from a small sub-sample of the complete dataset, as done, for example, at the AFF and for the 1996 Brazilian Census Data6. Also POST perturbation of the source data is possible, for example by extracting a new sub-sample for each query. Such practice seems reasonable because it can be implemented to be adaptive to queries, that is, for example, adapting the size of the sample to the risk of the required output. However, if it leads to different outputs for equal queries, the effectiveness of the SDC method is weakened because an intruder could estimate precisely the true values repeating the same queries many times. Algorithms for the POST aggregation of geographical areas (counties) has been 3

COEWEB http://www.coeweb.istat.it/ developed at NISS http://niss.cndir.org/, more information at http://www.niss.org/dg/nasssystem.html 5 American Fact Finder: http://factfinder.census.gov/servlet/BasicFactsServlet 6 http://sda.berkeley.edu:7502/IBGE 4

176

developed for the on-farm use of agricultural chemicals on various crops for the National Agricultural Statistics Service data (Karr et al., 2001). Schemes of PRE and POST source data perturbation application are shown in Figure 1.

QUERIES

USER

USER

SOURCE DATA

SOURCE DATA

QUERIES RESPONSES

RESPONSES

PERTURBED DATA

PERTURBED DATA

T ABLES

T ABLES

Figure 1 Scheme of SDC perturbing the source data. PRE application on the left and POST on the right.

Perturbation of the output Perturbation of the output can be applied either PRE (for static sites) or POST (dynamic sites). The POST addition of random noises to cell values has the drawback due to repeated querying, just like the addition of random noises to the source data POST. POST cell suppression must be audited and repeated consistently; for example if a cell has been released, then it cannot be suppressed in a subsequent release, or if it is suppressed in one release, then it must be suppressed in all releases. This implies that a record of all releases must be kept. On-the-fly suppression does not seem to be practical because it requires complex and lengthy computations. Rounding cell values to a constant base does not have particular drawbacks, however it does not generally give valid protection. More effective rounding techniques, such as controlled rounding, do have the drawback of not being consistent for different tables, hence allow for disclosure using overlapping queries. To our knowledge the release from simulated data is still under study (Fienberg and Makov, 2001) but seems worth of interest, especially towards the protection of V-Labs. Schemes of PRE and POST applications of output perturbation are shown in Figure 2.

177

QUERIES SOURCE DATA

SOURCE DATA

USER

RESPONSES

QUERIES

P ROT ECT ED T ABLE S

USER RESPONSES

T ABLES

P ROT ECT ED T ABLES

Figure 2 Scheme of SDC perturbing the output. PRE application on the left and POST application on the right.

Restriction of the query-set allowed The restriction of the query-set allowed (Hoffman, 1977) is useful for reducing the disclosure risk connected to cumulative knowledge and linked tables. The simplest restriction gives either an approval or a denial to a query. A more elaborate type of restriction may offer a third possibility: ”protect and release” or “release a simulated table” (Fienberg et al., 1998). Automatic SDC systems geared towards preventing disclosure from cumulative knowledge applying SDC sequentially in response to a series of queries were recently proposed (Keller-McNaulty and Unger, 1998 and Fienberg et al., 1998). Generic restrictions can be set for all users, limiting the maximum dimensions of releasable tables or excluding tables with given combinations of spanning variables. Such restrictions should really be considered PRE SDC because certain tables are banned (i.e. suppressed completely) before queries are submitted, even though restrictions take effect only after a query is submitted. Methods for the evaluation of the disclosure risk for unreleased cells, given that some marginals have been released have not yet developed for all cases. Methods for tables of counts were developed (Buzzigoli and Giusti, 1999, Dobra, 2000 and Dobra and Fienberg, 2000). These methods require intense computations and, so far, have only been implemented on prototypes. Optimal Tabular Release (OTR) (Karr et al., 2002) is a prototype system for selecting an optimal sub-set of releasable marginal count tables. The implementation exploits heuristics for faster solution but it still seems complex and very computationally demanding even for problems of average size. Specific restrictions are more elaborate restrictions that can also be set, for example restricting the maximum number of queries or banning the overlap of certain queries to each user. These restrictions are applicable to dynamic WSDDs and must be applied as POST SDC. Specific restrictions applied to each user require login and

178

real-time monitoring of users. They may not be effective against coalitions of intruders, but, if enforced in a reliable way, such measures can effectively reduce the need for perturbing the data. Table server is a prototype of POST query restriction developed within the DG project (Karr et al., 2002). Table server exploits the same procedures for the evaluation of the risk of OTR but the risk is evaluated with respect to all the information that has been released before a query is submitted. It seems important to note how such approach needs careful planning in order to avoid the WSDD to be driven to release only partial information because of peculiar requests coming from early users. Schemes of PRE and POST implementation of query restrictions are shown in Figure 3.

REST R ICT ION

QUERY

USER REST R ICT ION

LOG

QUERY

USER

DENIAL DENIAL RESP ONSES RESP ONSES

SOURCE DAT A

T ABLES SOURCE DATA

T ABLES

Figure 3 Scheme of queries restriction. PRE application on the left and POST application on the right.

5 Conclusions Web dissemination will certainly become the most used way of releasing data in the future. However, the task of keeping at safe the confidentiality of respondents will unavoidably become harder as the amount of information offered on the Web increases. The design of the methodology for the protection of a WSDD requires different choices. The first is the choice of the release policy. Depending on the type of data to be released, two opposite approaches can be followed: the first privileging the quality of the released data versus the amount of data possibly accessed; the second preferring the dissemination of an higher amount of less accurate data. The first case implies the adoption of methods that suppress few data or deny access, whereas the second requires techniques that perturb the data (source data or cell values). In some cases, it is possible to combine both strategies at the same time. The second choice consists of deciding whether to apply SDC methods PRE or POST. Certainly PRE SDC methods are more easily implemented, do not require secure access and

179

produce safe data. The drawback is risk of over-protection and therefore lack of data access. Post SDC methods are computationally demanding but more flexible and customizable to the needs of users; however they need access restrictions and may not prevent disclosure when there is collusion among users. Several researchers fear overprotection of Web sites and advocate a policy that broadens access to data by ensuring proper use of such data (Trivellato, 2000 and Duncan et al., 2001). In order to increase the amount of data released, new strategies for applying SDC sequentially in response to a series of queries for each user, possibly through fully automated expert systems, are under study. For secure identification methods are still not yet available, a dissemination policy that distinguishes among user types might be adopted. Data products designed for public Web sites are necessarily general purpose; for this reason extreme care should be taken to avoid any possible risk of breaches of confidentiality. However, looser SDC rules could be applied to sites devoted to qualified researchers. This adaptive policy is already in use in several NSIs giving research far more opportunities, especially for microdata files. Acknowledgements This work was partially supported by the European Union project IST-2000-25069 CASC on “Computational Aspects of Statistical Confidentiality”. References Adam, N.R. and Wortmann, J.C., 1989. “Security-control methods for statistical databases: a comparative study”. ACM Computing Surveys, 21, pp. 515-556. Blakemore, M., 2001. “The potential and perils of remote access”. Doyle, P., Lane, J.I., Theeuwes, J.J.M. and Zayatz, L. (Eds), Confidentiality, Disclosure and Data Access: Theory and Practical Application for Statistical Agencies. Elsevier Science, pp. 315-337. Brand, R., 2002. “Microdata protection through noise addition”. Inference Control in Statistical Databases, Lecture Notes in Artificial Intelligence. SpringerVerlag. Buzzigoli, L. and Giusti, A., 1999. “An algorithm to calculate the lower and upper bounds of the elements of an array given its marginals”. In Statistical Data Protection, Proceedings of the Conference. Luxembourg, pp.131-147 Cox, L. H., 1999 “Some remarks on research directions in statistical data protection”. In Statistical Data Protection, Proceedings of the Conference, pp.163-176. Lisbon, Luxembourg: Eurostat.

180

Dalenius, T. and Reiss, S.P., 1982. “Data-swapping: a technique for disclosure control”. Journal of Statistical Planning and Inference, 6, pp. 73-85. David, M.H., 1998. “Killing with kindness: The attack on public use data”. Proceedings of the Section on Government Statistics, pp 3-7. (American Statistical Association) Dobra, A., 2000. “Measuring the Disclosure Risk in Multiway Tables with Fixed Marginals Corresponding to decomposable Loglinear Models”. Technical Report, Department of Statistics, Carnagy Mellon University Dobra, A. and Fienberg, S.E., 2000. “Bounds for cell entries in contingency tables given marginal totals and decomposable graphs”. Inaugural Article, PNAS 97: 11885-11892. Reperibile anche al sito http://www.pnas.org/ Duncan, G.T. and Mukherjee, S., 2000. “Optimal disclosure limitation strategy in statistical databases: deterring tracker attacks through additive noise”. Journal of the American Statistical Association, 95, pp. 720-729. Duncan, G.T., Fienberg, S.E., Krishnan, R., Padman, R. and Roehrig, S.E., 2001. “Disclosure limitation methods and information loss for tabular data”. Doyle, P., Lane, J.I., Theeuwes, J.J.M. and Zayatz, L. (Eds), Confidentiality, Disclosure and Data Access: Theory and Practical Application for Statistical Agencies. Elsevier Science Duncan, G.T., 2001. “Confidentiality and statistical disclosure limitation”. In N. Smelser & P. Baltes (Eds.), International Encyclopedia of the Social and Behjavioral Sciences. New York: Elsevier. Fienberg, S.E., 2000. “Confidentiality And Data Protection Through Disclosure Limitation: Evolving Principles and Technical Advances”. The Philippine Statistician 49, pp. 1-12. Fienberg, S.E., Makov, E.U. and Steele, R.J., 1998. “Disclosure limitation using perturbation and related methods for categorical data”. Journal of Official Statistics. 14, pp. 485-502 Fienberg, S.E. and Makov, U.E., 2001. “Uniqueness and disclosure risk: Urn models and simulation”. In ISBA 2000 proceedings. Luxembourg:Eurostat. Franconi, L. and Stander, J., 2002. “A model based method for disclosure limitation of business microdata”. The Statistician 51:1, pp. 1-11 Gouweleeuw, J.M., Kooiman, P., Willenborg, L.C.R.J. and de Wolf, P.P., 1998. “Post randomization for statistical disclosure control: theory and implementation”. Journal of Official Statistics, 14, pp. 463-478. Hundepool, A., 2001. “Computational Aspects of Statistical Confidentiality the CASC-Project”. Statistical Journal of the United Nations ECE 18, pp. 315– 320. Hoffman, L.J., 1977. Modern methods for computer security and privacy. PrenticeHall, Englewood Cliffs, N.J. Karr, A.F., Lee, J., Sanil, A.P., Hernandez, J., Karimi, S. and Litwin, K., 2001. “Web-based systems that disseminate information but protect

181

confidentiality”. In Elmagarmid, A.K. and McIver, W.M. Editors, Advances in Digital Government. Kluwer, Amsterdam. Karr, A.F., Dobra, A., Sanil, A.P. and Fienberg, S.E., 2003. “Software Systems for Tabular Data Releases”. To appear in International Journal on Uncertainty Fuzziness and Knowledge-Based Systems. Downloadable at http://www.niss.org/dg/index.html Keller-McNulty, S. and Unger, E.A., 1998. “A database system prototype for remote access to information based on confidential data”. Journal of Official Statistics, 14, pp. 347-360. Malvestuto, F. and Moscarini, M., 1999. “An audit expert for large statistical databases”. In Statistical Data Protection, Proceedings of the Conference, Lisbon, Luxembourg: Eurostat, pp. 29-43. Matloff:, N.S., 1988. “Inference Control Via Query Restriction Vs. Data Modification: A Perspective”. In Carl E. Landwehr (Ed.): Database Security: Status and Prospects. Results of the IFIP WG 11.3 Initial Meeting, Annapolis, Maryland, October 1987. North-Holland, pp.159-166. Trivellato,U., 2000. “Data access versus privacy: an analytical user’s perspective”. Statistica, LX, pp. 669-689. Willenborg, L. and de Waal, T., 2001. Elements of Statistical Disclosure Control. Lecture Notes in Statistics, 155, Springer-Verlag: New-York. Winkler, W.E., 1998. “Re-identification methods for evaluating the confidentiality of analytically valid data”. Research in Official Statistics, 1, pp 87-104. Zayatz, L., 2002. “SDC in the 2000 U.S. decennial census”. In Inference Control in Statistical Databases, Lecture Notes in Artificial Intelligence. SpringerVerlag.

182

Discussion Paper for Topic (iii): Emerging legal/regulatory issues John King (UK) There are three invited papers and four contributed papers for this topic. Together they illustrate the very wide range of backgrounds—legal, regulatory and pragmatic—that pertain in both the Member States and other countries. Research Data Centres: The future of co-operation between empirical science and official statistics (Working Paper 10). This paper sets out current developments on access to confidential data in Germany and provides a very useful description of the history of this in Germany. Terminology differs: it would appear that “completely anonymised data” or “absolutely anonymised data” are what others refer to as “anonymised data”, and that Public Use Files are created from these safe data. On the other hand, “factual anonymised data” appear to be data with direct identifiers removed (and possibly some further anonymisation treatments), but which still permit indirect identification. Scientific Use Files are created from these data, thus permitting researchers access to data which are confidential with distinct risks of re-identification. A Commission for the improvement of the information infrastructure between science and statistics reported recently. This included the recommendation to set up Research Data Centres. These provide the following types of access: Scientific and Public Use files Visiting Researcher Desktop (previously the “One Dollar Man”) Controlled Remote Data Processing Special Data Processing. A recognised problem is the anonymisation of business or company data, but this is being investigated. Current possibilities being considered include: loss of (small) regional information, global recoding of some variables, like turnover, top-coding. All of these are thought to result in a high loss of information. The “One Dollar Man” had a terminable employment contract with the statistics office, thus permitting access to confidential datasets (c.f. the US Census Bureau’s SSS approach). This has been “regularised” by introducing the Visiting Researcher Desktop. Here the researcher remains an employee of his institution but accesses confidential data—of a SUF type but with a lower level of anonymisation treatment—on the premises of the RDC. The researcher receives only aggregated data in the form of tables from his research.

183

An issue is that access seems to be limited to German nationals. Another issue is the federal structure of Germany, with the Länder responsible for data collection (and processing?). This means that creation of German-wide datasets is not easy. This seems also to be reflected in the different RDCs being set up—one (or two?) by the Federal Statistic Office and the other by the statistical offices of the Länder (with centres also in each Land?). Several different datasets are described, each with a different level of anonymisation, for different purposes and for access by different categories or types of researchers. Are all these necessary? Are they articulated or consistent? Data sharing for Statistics in the UK (previously Contexts for the development of a data access and confidentiality protocol for UK National Statistics) (Working Paper 11). This paper sets out some of the background to the development of Protocols—clear statements on statistical practice in the areas of confidentiality and disclosure control. Some of the difficulties arise in statistical work from the lack of clear legal provisions and requirements (a non-statutory environment). Issues are administrative and case law (requiring legally established precedents) and the devolved nature of the constituent parts of the UK and of the “statistical service” of government. Early practice on the release of datasets was ad hoc and pragmatic. The National Statistics Code of Practice and, now, the draft Protocol are providing a framework for the systematic treatment of confidential data and for data access. The draft protocol sets out a series of principles and conditions. These clarify and make transparent the proposed arrangements but in many cases there is no explicit legal backing. Developments at Eurostat for research access to confidential data (Working Paper 12). This paper summarises the context of a recent regulation concerning access to confidential data for scientific purposes (Regulation 831/2002). The way it is intended to provide greater access to confidential data is described. Implications for NSIs in MSs, for researchers and the data subject are indicated. Some questions that arise in its implementation are also discussed. There is a clear contrast here between the non-statutory environment of the UK, described above, and the legal environment of the EU—the laws set out the agreements of the MSs on what can be done, and, in some instances, how, and what cannot be done.

184

Use of the principle of statistical data confidentiality in Kyrgyzstan (Working Paper 31). The paper mentions the underlying basic legal provisions. It goes on to indicate that some of the most important issues currently are to do with education of both the statistical staff and would-be users (of microdata) on the application of these legal provisions. Also, the protection of confidential data under storage and transmission of data needs to be developed. Report of the CEIES conference (Working Paper 32). This gives a summary of the conclusions of a seminar organised by CEIES (European Advisory Committee on Statistical Information in the Economic and Social Spheres). The seminar (19th seminar) was held in Lisbon last October on “Innovative solutions in providing access to micro-data”. The research community was strongly represented at the seminar. Broadly, the conclusions were that much significant research in the social and economic spheres, both fundamental and of relevance to the formulation and evaluation of public policies, can only be undertaken with microdata; it cannot be done using published statistics or aggregate records. As a consequence, there is an urgent need to improve research access to microdata produced by National Statistical Institutes. But protection of confidentiality must remain paramount in all statistical records. There are several different routes to providing access to microdata. First, it is important to stress that academic researchers and their institutes are willing to sign contracts and undertake to protect the data. Their livelihoods depend on doing so and the experience of the past decade reveals few if any cases where data have been misused. Second, it is clear that anonymised microdata files are the first and most efficient route of access. Evidence from Canada shows that the availability of anonymised microdata files stimulates research and, once produced, the data files have no additional costs. Where this is not feasible—either because the data are too confidential or too much detail is needed—alternative modes of access need to be found. Traditionally, safe settings are established. However, these are expensive to install and run and it is often difficult for researchers to stay for extended periods at a location distant from their place of work. However, it seems that we are now moving to a situation where there are technological solutions which can produce a ‘virtual safe setting’ over the Internet. We feel that this is an area that needs to be explored as a cheaper and much preferred alternative to safe settings. It is important to stress that, in this virtual setting, the researcher needs to be able to conduct both exploratory and confirmatory analysis. Questions relating to the confidentiality of statistical information at the National Statistical Service of the Republic of Armenia (Working Paper 33).

185

The paper sets out the basic legal framework and the current actions to harmonise the basic law with other acts and with evolving practices. In particular, a systematic approach is being followed, under which various aspects of operations and practices are being specified. For example, obligation on NSS officials not to disclose statistical information, instructions for work on confidential statistical information computer and IT system security measures for protecting confidential data during all stages of collection, processing, storage and transmission Training is seen as a key component in addition to development of the documents. Demand for data and access to data: the Research Data Centre of the statistical offices of the Länder (Working Paper 41). This complements Working Paper 10 by describing developments in the statistics offices of the German Länder. Enterprise survey data are seen to carry a high risk of de-anonymisation so approaches of permitting access to confidential data are proposed. This will be through Visiting Researcher working in the statistics office with a contract—contractually agreed co-operation; and through controlled remote data processing—results are returned to the researcher after checking. A survey of potential users of the RDCs showed high interest. There was a high preference for the SUF option—with the facility of access and use on one’s own desktop/pc. Another issue is the proliferation of other datasets which may increase the risk of deanonymisation. This will be studied. Refer also to papers of Andersen (Working Paper 29) and Shlomo (Working Paper 6). Working paper 6 shows the origins of the work to be a need to improve the procedures and regulatory background to disclosure control. In particular a need to reassess the level of security of confidential files released for research. The earlier situation is characterised as “ad hoc” decisions with no systematic way of evaluating the risks associated with datasets or files—the main procedures used being global recoding and suppression of sensitive variables. (A situation similar, it seems, to that pertaining in the UK.) Andersen concludes, from some experience with both forms of access and control, that remote access is effectively equal to on-site access.

186

From the CEIES seminar, an interesting paper from INE, Portugal, concludes that: “It is possible to use the new technological environment to provide micro data access with analytical online services. “But it is better to solve confidentiality problems (legal and procedural) first.” Common threads Several papers, speakers, MSs now have various types of access for researchers or scientists from their own country. No plans for other researchers to be included. Other MS make provision for this, or al least no distinction of where the researchers are. But this is explicitly provided for in the regulation 831/2002, subject to an opinion of the Committee on Statistical Confidentiality. Need to keep abreast of developing legal issues, e.g. data protection of the individual, and of methodology and availability of other datasets or information. Need to be clear about terms like “anonymised”—what do they mean precisely? Some now have legal definitions and so these should be respected and used. Also need to be clear on risks and how the datasets should be treated—are anonymised data confidential? Should there be a glossary of terms in order to remove doubt and ambiguity and to facilitate the debate across NSIs and with researchers? Various different types of dataset, with differing risks associated, are developed to meet needs of researchers. In some NSIs, individual datasets are created limited to just the specific variables the researcher needs. All of this is labour-intensive for the NSI, and is a large claim on scarce resources. Can researchers needs be met more simply and cheaply with standard datasets without increasing the risks? It is better to solve confidentiality problems (legal and procedural) before providing various types of access and raising the expectations of the research community. Need to review activities, in relation to access by researchers, in the MSs, and other NSIs, to develop best practice and common approach. Some questions •

What do we mean by “scientific”?



Is “remote access” the right solution?



Should this be together with anonymised micro-datasets / PUF / SUF ? (CEIES said that access should be first through anonymised microdatasets)

187



To what extent is trust the most important element?



Is there a partnership? What does the NSI get in return for access?

188

NEW WAYS OF ACCESS TO MICRODATA OF THE GERMAN OFFICIAL STATISTICS Tom Wende1 and Markus Zwick2 History. Beginning this essay about the development of a better informational infrastructure and the work of Research Data Centres, it may be useful to take a short glance at the history of microdata use in Germany. In the past, it was seen as sufficient for data users to work with aggregated data like tables and indexes given out by the statistical offices. But the accelerating change of society and the increasing amount of new societal questions resulting from this, changed the scientific interest and aggregated data was not enough anymore. The first requests for official statistics microdata by the scientific community were in the early 1970s. A group of researchers at the Universities of Mannheim and Frankfurt founded a research project called SPES (Sozialpolitisches Entscheidungs- und Indikatorensystem für die BRD) which tried to create a social-political decision- and indication system for the Federal Republic of Germany by using official microdata. From this project evolved the so called Special Research Sector 3 – "Microanalitic Basics of Society Politics" (Sonderforschungsbereich 3 – SFB 3 – Mikroanalytische Grundlagen der Gesellschaftspolitik), which dealt with matters of social policy and econometrics. This pioneering work, which showed the urge to use microdata for societal research paved the way for still ongoing changes in law and the development of an informational infrastructure for the empirical use of microdata bases. At almost the same time a project called VASMA (VergleichendeAnalysen der Sozialstruktur mit Massendaten) dealt with the comparing analysis of the social structure by population data.

Legal Basics. The first legal regulation for the use of official microdata was made in the federal law on statistics in it’s version of 1981. It allowed the passing on of completely anonymised microdata in §11 (5) BStatG. This of course left a lot of restrictions, as complete anonymisation always goes along with an enormous loss of information. Nevertheless this law brought up an epoch-making change, as it offered the first legal opportunity for the official statistics to give out so called Public Use Files3, which are absolutely anonymised datasets of official statistics. Also it showed a way how to proceed.

1

Empirical researcher in the Research Data Centre (RDC) of the German Ferderal Statistical Office. Head of the Research Data Centre (RDC) of the German Ferderal Statistical Office. 3 see SCIENTIFIC USE FILES AND PUBLIC USE FILES 2

189

A more satisfying solution for empirical researchers was the next legal improvement: The federal law on statistics in it’s version of 1987 –especially in § 16(6) BStatG brought up the so called "Privilege of Science", which means, that from that point on, scientists were allowed to receive factually anonymised microdata4. Excursus 1: The Development of Anonymisation Criteria. Between 1988 and 1991 a large-scale research-project aiming for anonymisation of selected microdata has been performed. Representatives of the german statistical offices worked alongside with representatives of the data protection registrars and of the empirical science, like the University of Mannheim and ZUMA – the Centre for Survey Research and Methodology. In the course of this project some measures have been developed for a specific factual anonymisation of the Microcensus and the Sample Survey of Income and Expenditure. The results of this research project culminated in two reports: "Textbook for the building of factually anonymised data regarding the Microcensus” and "Textbook for the building of factually anonymised data regarding the Sample Survey of Income and Expenditure”. End of Excursus National and international data request. In the past chapter the way to so called Public (PUF) and Scientific Use Files (SUF) was described. You may ask yourself, why that is such an important development. The example of international data access will show you. Before PUF and SUF the access to official microdata was hardly possible at all. After 1981 – with the implementation of Public Use Files - data access was possible for everyone, but with a lot of restricted and non-accessible information. After 1987 the researcher´s need for less restricted data was answered with the invention of Scientific Use Files, which are only provided to german appliers by now. But what if a researcher from a foreign country wants access to german official data? By now this is very difficult according to the lack of international data access regulations. First solutions are given by the EC-Regulations 322/97 and 831/2002, which give access possibilities to common micordata. The 831/2002 commision act specifies this access for four european surveys at least for members of the EC: The CVTS (Continuing Vocal Training Survey), CIS (Community Innovation Survey), LFS (Labour Force Survey) and ECHP (European Community Household Panel). The offered possibilities are for example the controlled remote data processing5 or the possibility for a visiting researcher to work in the safe area of the german statistical offices6.

4

Factual Anonymisation says that the data is not absolutely anonymised, so there is a chance of disclosure for a potential intruder, but the expense of disclosure is much higher than the use for the intruder; also see SCIENTIFIC USE FILES AND PUBLIC USE FILES 5 see CONTROLLED REMOTE DATA PROCESSING AND SPECIAL DATA PROCESSING 6 see SAFE SCIENTIFIC WORKSTATION

190

Research Data Centres (RDC) of the official statistics. In the year 1999 the Federal Ministry on Education and Research (Bundesministerium for Bildung und Forschung - BMBF) originated a Commission for the Improvement of the Informational Infrastructure (Kommision zur Verbesserung der Informationellen Infrastruktur – KVI). It was the constitutional task of this commission to revise the informational infrastructure of the Federal Republic of Germany (BRD) and to work out new concepts for the exchange of data between the scientific community and data producers. The KVI worked out a number of advices which are elaborately prescribed in their final report7. One first elementary advice of the KVI was the establishment of so called Research Data Centres (RDC). The implementation of this advice almost immediately started. On October 1st 2001 one Research Data Centre of the Federal Statistical Office of Germany (Statistisches Bundesamt) was established in Wiesbaden. On April 1 st 2002 RDCs in the Statistical Offices of the Federal States (Statistische Ämter der Länder) with one location in each federal state were founded. The Research Data Centres offer a lot of opportunities for microdata access and thus an extraordinary improvement of the informational infrastructure between the official statistics and the empirical science. The Research Data Centres provide a well balanced service proposition for users. The RDC of the Federal Statistical Office and of the Statistical Offices of the Federal States are independent but co-operate closely with each other. The main focus of the Federal States Research Data Centres is centralised data storing, a widespread web of Safe Scientific Workstations8 and the supply of metadata for decentral surveys. The Research Data Centres of the German Statistical Offices focus is the development of Scientific and Public Use Files9, the improvement of Controlled Remote Data Processing10 and the supply of metadata for central surveys. Together all Research Data Centres are keen on developing a high quality metadata system, consulting data users and are forcing the further improvement of the informational infrastructure. The main functions of the RDCs are: a) b) c) d)

Carrying on the further development and implementation of the advice given by the KVI. Serving as an interface between the official statistics and the scientific community. Providing consulting and service for the use of official microdata. Creating and providing possibilities for access to microdata with a lower level of anonymisation.

7

KVI (HRSG.) 2001: WEGE ZU EINER BESSEREN INFORMATIONELLEN INFRASTRUKTUR. BADEN-BADEN : NOMOS VERLAGSGESELLSCHAFT 8 see SAFE SCIENTIFIC WORKSTATION 9 see SCIENTIFIC USE FILES AND PUBLIC USE FILES 10 see CONTROLLED REMOTE DATA PROCESSING AND SPECIAL DATA PROCESSING

191

The invention of the RDC is a great improvement for the informational infrastructure because for the first time, there is one elaborate option for the use of official microdata. There already are different ways of access to official microdata like Controlled Remote Data Processing and Safe Scientific Workstations11. The RDCs also offer consulting and service for the use of official microdata. Let’s now talk about the Research Data Centres´ work in practice. As it was already mentioned, the RDC offer different ways of microdata access: • Scientific and Public Use Files • Safe Scientific Workstations • Controlled Remote Data Processing • Special Data Processing

Scientific Use Files and Public Use Files. One possibility for microdata use is the purchase of Scientific or Public Use File. Different surveys are already available in that format. For example, you can get different waves of the Microcensus, the Sample of Income and Expenditure or the Statistic of Road and Traffic Accidents and many more as SUF. Available as Public Use Files are for example different waves of the Time Use Survey or the Social Welfare Statistics. One important aim of the Research Data Centres is to enormously broaden the range of PUF and SUF in the near future. Scientific Use and Public Use Files are anonymised with different grades of anonymisation. The Public Use Files offer no way to draw conclusions about single cases in the surveyed population anymore. The Scientific Use Files do theoretically offer that possibility, but the expense is much higher than the use of disclosing the factually anonymised data12. The rights to use Scientific Use File are reserved by the German Statistic Law – as the name implies – to the scientific community. That is another confidentiality function of these files, because in case of a breach of confidentiality the german researcher can be prosecuted by law. The advantage of giving out anonymised files is that a reaearcher is able to work with his own Software on his own PC, the disadvantage is the loss of information resulting from anonymisation. The Research Data Centres offer some more possibilities of data access, which in combination with the supply of Public and Scientific Use Files close the circle of informational infrastructure and in combination with each other are able to provide a good balance between empirical research interests and data confidentiality. Particularly there are the Safe Scientific Workstations and the option of Controlled 11 12

see CONTROLLED REMOTE DATA PROCESSING AND SPECIAL DATA PROCESSING §16(6) BStatG

192

Remote Data Processing respectively Special Data Processing, which will be described in the following chapters.

Controlled Remote Data Processing (CRDP) and Special Data Processing (SDP). If a researcher needs more information, than a Public or Scientific Use File can offer, or if there is no standardised SUF or PUF yet available for a certain survey, there are ways to work with less or even non-anonymised data via the Research Data Centres. One way is to work in a first step with the anonymised dataset, for example a standardised Scientific or Public Use File - or if a SUF is not available with a so called structural dataset, which corresponds with the original dataset in all structural attributes but not in content attributes – and in a second step send the so-produced syntax for Software like SAS, SPSS or STATA back to the RDC, where it is processed under internal control over the original data. This is called controlled remote data processing. A special form of controlled remote data processing is special data processing. In that special form a proposer tells his research interest to a representative of the Statistical Office and the representative does the empirical work. One advantage of controlled remote and special data processing for data confidentiality is, that the computing process is not beyond control and the representatives of the Research Data Centre exactly know what information is given to the researcher. Another advantage is, that the output is not microdata but aggregated data in form of tables, which can be anonymised easier. The advantage for the researchers is that they have the possibility to make an exact predication about the whole population with a lower standard error and in general a low error variance. Further Advantages are that the consulting function of the Research Data Centres can be engaged and there is a possibility to work with company data, which wasn’t given before. The disadvantages are, that these processes mean a lot more work and cost for both the researchers and the representatives of the official statistics and as a result of this need a lot more time.

Safe Scientific Workstation (SSW). The RDCs provide another new way of data access in the protected area of the German Statistical Offices. A visiting researcher gets the possibility to access microdata over sealed-off computers at especially equipped safe scientific workstations in the statistical offices. By working on a safe scientific workstation the researcher gets on-site access to factually anonymised data. The difference in anonymisation between an On-Site Scientific Use File and the given out standardised SUF, which is also factually anonymised, is that the anonymisation criteria in the on-site case is lower, because of other means of confidentiality control, like the fact that the visiting researcher is given no way of data transfer, except for his aggregated output in form of tables. For further

193

improvement of that way of microdata access, the production of many more On-Site SUF is planned. It became clear in the past chapters, that the newly given way of data access by the RDC guarantees confidentiality by means of well-balanced anonymisation. There exists only one way of access to non-anonymised microdata: The cooperation of a statistical office with a researcher in a singular research project initiated by the statistical office, also known as a "One-Dollar-Man-Contract”. Such can be completed if there is a research interest, which is primarily useful for the official statistics and secondary for an external researcher. It is possible for a researcher then to sign a terminable employment contract (with the symbolic payment of one Dollar) with a statistical office then and work with microdata as an employee of that statistical office and is therefore bound to confidentiality like every employee of the statistical offices. But that way of microdata access is an exception, which only comes about if a statistical office needs external help in a singular project. It is mentioned here just to complete the menue of microdata access possibilities in Germany. In the end of this essay, let´s have a short glance at the projects planned for the further improval of german microdata access. Projects in the future. Prospectively the Research Data Centres are working on the Expansion of low cost microdata access in form of standardised Scientific and Public Use Files. Controlled Remote Data Processing will be simplified and improved in the future. Further on there will be an improvement of consultancy capacity for visiting researchers on safe scientific workstations and researchers who use CRDP or SDP. The RDCs are working on the central availability of all official microdata and also on the elaboration of a widespread metadata-system for all official data.

194

Contexts for the Development of a Data Access and Confidentiality Protocol for UK National Statistics Paul J Jackson

Office for National Statistics, London, United Kingdom.

Abstract. The background to any arrangements for allowing access to government data and the protection of confidentiality in the UK is complex. For example, the UK has devolved political and statistical systems. It has a Statistics Commission (that provides assurance to Government that National Statistics are trustworthy and responsive to public need), but no Statistics Act. It has a legal system founded on Common Law but government departments must also act within Administrative Law. Like other European countries, the UK also has a Data Protection Act and an independent regulator for information - an information 'watchdog'. There is also a powerful privacy lobby, within the libertarian tradition. The UK has a National Statistics Code of Practice that provides the key principles and standards which official statisticians in the UK are expected to follow and uphold. It will be supported by 13 protocols that describe how these principles and standards are to be implemented in practice, one of which is the Protocol on.Data Access and Confidentiality. This paper describes the main features of this protocol.

Keywords: Confidentiality; data access; law; data-sharing; protocols

1. UK background to confidentiality in official statistics To understand the context of a National Statistics Protocol for Data Access and Confidentiality, one needs to be aware of the background political, legal, regulatory and statistical systems in the UK. 1.1 UK has a devolved political system. Special and distinct arrangements for the production of statistics exist in England, Scotland, Wales and Northern Ireland. The National Statistician is the UK

195

Government's chief adviser on statistical matters, is the Registrar General for England and Wales, and is Head of the Government Statistical Service. There is an independent Registrar General for Scotland and for Northern Ireland. The National Statistician is responsible for statistics at the UK level and reports to the Economic Secretary to the Minister for National Statistics (the Treasury) in the UK parliament. The Chief Statistician for Scotland reports to the Minister for Finance within the Scottish Executive. The Head of the Statistical Directorate for Wales reports to the Finance Secretary in the Welsh Assembly Government. The Chief Executive of The Northern Ireland Statistics and Research Agency reports to the Minister for the Department of Finance and Personnel in the Northern Ireland Executive. 1.2 UK has a devolved statistical system, under National Statistics and Government Statistical Service umbrellas National Statistics are a number of pre-defined outputs from the offices of the Registrars General, statistical divisions of government departments and devolved administrations, and all the products of the Office for National Statistics. The Office for National Statistics is the UK government's main survey taking body. It has a responsibility to develop and maintain at an operational level both the integration and integrity of Government Statistical Service (GSS) outputs. The GSS comprises the statistical divisions of government departments, each of which has a Departmental Head of Profession for Statistics. Statisticians in these departments are members of the GSS, which sets the professional standards for its members and provides training and qualifications. In Northern Ireland, the Northern Ireland Statistics and Research Agency (NISRA) is responsible for the statistical products of the Northern Ireland administration's departments and agencies. In Scotland, the Chief Statistician is responsible for statistics produced by or for the Scottish Executive. The National Assembly for Wales has its own statistical directorate. A code of good practice for the production of National Statistics has been developed entitled the National Statistics Code of Practice. This sets out the key principles and standards which official statisticians in the UK are expected to follow and uphold. It will be supported by 13 protocols that describe how these principles and standards are to be implemented in practice. 1.3 UK has a Statistics Commission. The Statistics Commission was established as part of the creation of National Statistics. It concerns itself with quality, priorities and procedures to provide assurance to Government Ministers that National Statistics are trustworthy and responsive to public need. It gives independent, public advice on National Statistics. 1.4 UK does not have a Statistics Act.

196

The UK instead has explicit and implied powers for certain public bodies to collect and process information to produce statistics, found in enactments that are numerous, diverse and not necessarily drafted with the intention that they be complimentary or coherent. Examples are: ƒ the Census Act (1920), a power to conduct the census in England and Wales, given to the Registrar General, ƒ the Population and Statistics Act (1958), requiring certain information to be provided to the Registrar General for statistical purposes at the time of registering a life event, ƒ the Statistics of Trade Act (1947), requiring traders to provide the appropriate department with certain information relating to their business, only for the purposes of statistics. The Registrar General, or other department given powers by these statistical enactments, are usually subject to the common law duty of confidentiality or specific sections of the legislation. This prevents, for example, the transfer of identifiable Census data to other government departments for their statistical purposes. Many National Statistics Institutions arrange access to confidential data for statistical researchers by asking them to sign their Statistics Act. The researcher subjects themselves to the conditions and legally enforceable penalties found in the Act. There is no equivalent to this available to the National Statistician, Registrars General in devolved administrations, or Heads of Department for Statistics, in the UK statistical system. 1.5 UK has a legal system founded on Common Law. UK law is not a collection of decrees or directives, but a developing set of answers to real problems discovered by an independent judiciary in the courts. Legal order in the UK has the particular case, and not 'directives', as its paradigm. Cases build into Torts - civil 'wrongs' which may provide individuals with a cause for action for damages for breach of legal duty. Most important for the production of statistics is the common law tort of 'breach of confidence', which provides the public with a reasonable expectation that the information they provide to a government department is confidential between them and that government department. A government department must not disclose this information unless required to by statute, or there is an overriding public duty to disclose, or consent to the disclosure has been gained. 1.6 UK government departments must act within Administrative Law. Administrative law governs the actions of all public bodies. A public body must have the lawful authority to carry out its intended functions. If not, its action will be

197

ultra vires - beyond its administrative powers. Public bodies must therefore have statutory authority for sharing confidential data, and must use these powers only for the purposes for which they were given. This authority may be explicit or in some cases implied. Each government department, indeed each local authority, is a separate and distinct administration, and there is no presumption that information given to one department is available to any other. In the UK, you do not give your data to 'the government', you give it to one particular department, agency or authority. In the absence of either consent or enabling legislation (a 'gateway'), a public body which discloses confidential information to another would be ultra vires. As an example, this means that confidential personal benefits data collected by the Department for Work and Pensions can not be disclosed in that form to the Office for National Statistics for the production of statistics, unless there is consent for such a disclosure or a suitable 'gateway'. Consent for this disclosure has not been gained, and a suitable gateway has yet to be found. 1.7 UK has a Data Protection Act. In common with other Member States, the UK has data protection legislation that determines how personal data should be processed. The Act is complex, but in its fundamentals can be expressed as 8 principles: 1. Personal data shall be processed fairly and lawfully and, in particular, shall not be processed unless conditions [listed in the Act] are met. 2. Personal data shall be obtained only for one or more specified and lawful purposes, and shall not be further processed in any manner incompatible with that purpose or those purposes. 3. Personal data shall be adequate, relevant and not excessive in relation to the purpose or purposes for which they are processed. 4. Personal data shall be accurate and, where necessary, kept up to date . 5. Personal data processed for any purpose or purposes shall not be kept for longer than is necessary for that purpose or those purposes. 6. Personal data shall be processed in accordance with the rights of data subjects under this Act. 7. Appropriate technical and organisational measures shall be taken against unauthorised or unlawful processing of personal data and against accidental loss or destruction of, or damage to, personal data.

198

8. Personal data shall not be transferred to a country or territory outside the European Economic Area unless that country or territory ensures an adequate level of protection for the rights and freedoms of data subjects in relation to the processing of personal data. When processing data for the purposes of statistics or research, there is exemption from the second part of the second principle, and from section 7 of the Act that gives data subjects the right of access to their data. Provided that they are being kept for a specified purpose, the data may be kept indefinitely. 1.8 UK has an independent regulator for information - an information 'watchdog'. The Office for the Information Commissioner regulates public and private bodies for compliance with the Data Protection Act (1998) and Freedom of Information Act (2000). The Commissioner is the UK's independent supervisory authority reporting directly to the UK Parliament. The Commissioner has a range of duties including the promotion of good information handling and the encouragement of codes of practice. The Commissioner's Mission "We shall develop respect for the private lives of individuals and encourage the openness and accountability of public authorities. -by promoting good information handling practice and enforcing data protection and freedom of information legislation; and -by seeking to influence national and international thinking on privacy and information access issues." 1.9 UK government policy is to seek improvement in public services through sharing data and enhancing privacy. The Lord Chancellor's Department (LCD) is the department that has responsibility for data protection, privacy and data-sharing law and policy. It is in the process of preparing guidance on the interpretation of administrative powers and the key principles within the Data Protection Act 1998 with regard to how data sharing can and should operate within the existing legal framework. The Strategic Unit of the Cabinet Office published 'Privacy and Data Sharing - The Way Forward For Public Services' in April 2002. It suggested a large number of recommendations, which LCD is to perform since its adoption as Government policy in June 2002. The LCD is now investigating the need for new primary legislation to achieve this government policy, and must also encourage and conduct a public debate about privacy, the potential benefits of shared information.

199

1.10 UK does not have a population register. The UK has a register of births, marriages, deaths, and adoptions maintained by the Registrars General, but there is no requirement to maintain changes of address on the registers between these events. The uses of the registers are limited by statue, common law, and the limited administrative powers of the Registrars General. Several government departments have very large databases of personal information, often using common unique identifiers, for example, National Health Service Number, or National Insurance Number, or Driver's License Number. The Office for National Statistics (with its General Register Office hat on) maintains the National Health Service Central Register for England and Wales, which it does this through the implied powers of the Registrar General, for the purposes of the National Health Service. NHS number is not used as a sampling frame by ONS (the Postcode Address File, maintained by the Post Office, is the one most used), but it is used to estimate migration and inform longitudinal studies. The NHSCR contains NHS number, name, sex, date of birth and current Health Authority of the patient, including previous values. No medical information is associated with this register. The mid-year population estimate for England and Wales (1998) was 2,900,000 less than the number of entries on the NHSCR. There is also a NHSCR for Scotland, maintained by the General Register Office for Scotland.

1.11 The UK has a powerful privacy lobby, in the libertarian tradition. Organisations such as 'Liberty' and 'Privacy International' are an effective lobby, and will campaign against anything the lobby sees as unwarranted state interference in everyday life. The lobby is particularly wary of purpose drift, where an enactment created for one purpose becomes used for others. In recent data sharing conferences organised by LCD, both Liberty and Privacy International questioned whether 'building trust' in public bodies is a legitimate aim at all, suggesting rather that scepticism and scrutiny is a sounder basis for the relationship between citizens and state. Both organisations are profoundly troubled by the prospect of a population register.

2. Imperatives for producers of official statistics Therefore, producers of UK National Statistics need: - To build public trust in statistical confidentiality, particularly because they depend to a large extent on consensual information gathering, - To demonstrate that statistics is a cul-de-sac for confidential information and that processing data for National Statistics is not a threat to confidentiality or privacy.

200

-

-

To make best use of existing information sharing gateways to improve statistical quality and accuracy, To investigate how the statistics departments of public bodies can take advantage of any new primary legislation to improve outputs, and to be involved in the drafting of such legislation. To co-operate with regulatory bodies to show full compliance with the word and the spirit of the law, To promote the National Statistics Code of Practice, and show that they can meet the challenges within it. To underpin the Code with protocols that provide standards and guidance for meeting its statements. One of the most important is the Protocol for Data Access and Confidentiality.

3. The National Confidentiality

Statistics

Protocol

for

Data

Access

and

The protocol that underpins the NS Code of Practice (see section 1.2) is intended to bring the commitment to confidentiality and the provision of access to confidential data into the same context. An organisation with good data management practices, good standards of disclosure control, good risk management, and an ability to audit the use of all their information, will be able to maintain the confidentiality of the information. These same qualities are those that need to be present for safe, fair and lawful 3rd party access to data. Throughout the protocol, it was necessary to refer to 'data used for the production of National Statistics' because the term 'National Statistics' defines some specific outputs of our devolved statistical system. There is no such thing as 'data owned by National Statistics', or even 'data used by National Statistics'... The public can provide data for National Statistics, but not to National Statistics. The Data Access and Confidentiality Protocol was recently out for public consultation on the National Statistics website. It contains a glossary of terms. The protocol consists of 10 principles: Principle 1. The National Statistician will set standards for protecting confidentiality, including a guarantee that no statistics will be produced that are likely to identify an individual unless specifically agreed with them. This, to provide a statement that will build public trust that their confidentiality will not be breached. It also satisfies the 'relevant conditions' of the Data Protection Act for the

201

use of the statistics and research exemptions. 'Individual' here means an individual statistical unit for which there is an obligation of statistical confidentiality. Principal 2. Data ownership is non-transferable. The public know that information they give to a statistics department will always 'belong' to that department, and the conditions under which they agreed to give their information will be honoured by that department and no other. Principle 3. Respondents will be informed, as far as practicable, of the main intended uses and access limitations applying to the information they provide for National Statistics. Again, to satisfy the aims of building openness, transparancy and trust. Also, for personal data, to satisfy the fairness conditions in the first principle of the Data Protection Act. Principle 4 The same confidentiality standards will apply to data derived from administrative sources as apply to those collected specifically for statistical purposes. A statement to ensure that all data used to produce National Statistics is treated equally. Principle 5 Data provided for National Statistics will only be used for statistical purposes. The 'cul-de-sac' principle. Once data has entered the statistical system, under normal circumstances it will not leave as anything other than non-disclosive, non-confidential statistical information. This principle ensures that respondents can trust the producers of National Statistics not to inform the tax authorities, for example, of information that may be of interest to them. Principal 6 Where information identifying individuals must be given up by law, it will be released only under the explicit direction and on the personal responsibility of the National Statistician. There are some very exceptional circumstances where statute or a court order requires the release of confidential statistcal information. This principle reassures the public that this will be done only on the explict direction of the highest authority in UK statistics. Principle 7. Everyone involved in the production of National Statistics will be made aware of the obligation to protect respondent confidentiality and of the legal penalties likely to apply to wrongful disclosure. For example, UK civil servants are individually liable to prosecution for breaches of the Data Protection Act. Principal 8. For anyone involved in the production of National Statistics, obligations to confidentiality will continue to apply after the completion of their service. The privileged information to which statistician have access must not be

202

used to breach confidentiality, whether or not they continue to be employed in a department that produces National Statistics. Principal 9. Data identifying individuals will be kept physically secure. An absolute standard for physical security. Also satisfies the 7th principle of the Data Protection Act. Principal 10. Access to identifying data will require authorisation. This will be available only to people who have signed an undertaking to protect confidentiality. The Head of Profession for Statistics must further be satisfied the data will be used exclusively for justifiable research and that the information is not reasonably obtainable elsewhere. This is perhaps the most comprehensive principle in the protocol. In the guide to putting the principles into practice, the conditions given in the next section - Principles of access - need to be met for third party access:

4. Principles of access (a) Access to identified or identifiable statistical sources will only be granted where it will result in a significant statistical benefit to a product or service that could reasonably be seen as a part of National Statistics. i.

Identified micro-data will only be made available to organisations professionally responsible to or contracted to the National Statistician, or a Head of Profession – Chief Statistician in the devolved administrations – for purposes consistent with the aims of National Statistics, subject to this protocol.

ii.

Identifiable, but not identified, data will, subject to the conditions in this protocol, be made available to organisations professionally responsible to the National Statistician, or a Head of Profession – Chief Statistician in the devolved administrations – for purposes consistent with the aims of National Statistics, and to other organisations whose work is consistent with the aims of National Statistics, and ƒ who have a demonstrable need to access individually identifiable records to fulfil a stated statistical research purpose; ƒ who comply with access arrangements as agreed by the responsible statistician and specified in this protocol; ƒ who are able to satisfy the responsible statistician that all conditions pertaining to its use can be effectively maintained and fully audited.

203

(b) Non-disclosive micro-data will be made available for purposes consistent with the aims of National Statistics and any dissemination of aggregated statistics should include information on the availability of underlying, non-disclosive micro-data. However, the responsible statisticians will be expected to monitor changes of use, new technology and other factors, which may require statistical sources to be reclassified as appropriate. In addition to these principles, there are certain arrangements to be in place: Arrangements for access (c) The responsible statisticians will keep an exact, up-to-date inventory of statistical sources and record the details of any access to micro-data provided to a third party. Records will include the information required by this protocol and will be subject to audit as required by the National Statistician. (d) Where access to confidential data is granted to anyone employed by or directly contracted to the responsible statistician, it will be restricted to those who need access to produce non-disclosive results and analyses, and who have specific permission from the responsible statistician to do so. The responsible statistician will maintain a register of those who have such permission. Access will be lawful and in compliance with the National Statistics Code of Practice, including requirements for physical security. (e) Where access to confidential data is granted to anyone not employed by or directly contracted to the responsible statistician, there will be a direct, written data access agreement between the responsible statistician and the access beneficiary for every statistical source accessed and for every different purpose. i. Where access is provided to those who are not employed by or under direct contract to the responsible statistician, but who are under the professional responsibility of the National Statistician, the signatory will be the Head of Profession. ii. When the agreement includes those who are outside the professional responsibility of the National Statistician, the signatory will be an individual authorised to enter the organisation into a legally binding contract. (f) As part of the access arrangements, the responsible statistician will determine which parts of each statistical source are needed to achieve the stated statistical

204

purpose for which the access is required. Remaining information will be removed, aggregated or coded prior to access. (g) In some cases, access to a statistical source may be sought by an organisation that already has information relating to the target population of this statistical source. In this case it must be clear whether any matching is part of the agreed access. If the individual records are to be matched, this must be explicit in the agreement, and the appropriate authority for such matching already obtained, otherwise the responsible statistician must be satisfied that matching will not occur, despite any apparent ability in the beneficiary organisation to undertake it. (h) responsible statisticians retain responsibility for the protection of statistical sources under their control, wherever they may be processed. To fulfil this role, the responsible statistician must be satisfied that the following conditions are met for each micro-data access request, with details recorded as part of the access agreement: i. The purpose of the access and any outputs resulting from it are lawful, compliant with the National Statistics Code of Practice and compatible with the aims of National Statistics. ii. It will be possible for both sides to maintain the agreement. iii. There will be no misuse of the statistical source, including unauthorised duplication, and that the agreed access will not erode the responsible statistician’s guarantee of confidentiality or any other undertakings to survey respondent or person or organisation to whom the information relates s, either during the research or once the research has been completed. iv. It is clearly stated when the access is to begin and end. v. The access is proportional to the needs and objectives of the research. vi. The status of the statistical source in law and according to the National Statistics Code of Practice is clearly understood, by all who will have access to it. vii. Access procedures are appropriate to the person or organisation being given access, and to the type of records – for example, where the information is personal and subject to the Data Protection Act (1998), or where it is ‘sensitive’ as defined in that act.

205

viii. Access, and any subsequent processing, will be lawful and fully compliant with the National Statistics Code of Practice. ix. Any outputs deriving from the access satisfy the standards of disclosure control for the statistical source, as specified by the responsible statistician. (i) Those granted access may be charged to cover the cost of making the statistical sources non-disclosive, or for any other functions the responsible statistician needs to undertake to provide access, in accordance with the Protocol on Data presentation, Dissemination and Pricing.

5. The future for confidentiality and data access for statistics in the UK The Statistics Commission have asked if statistics legislation would provide an even stronger commitment to the public for the confidentiality of their data when used for statistics. Perhaps such legislation would balance this extra commitment to privacy with the more widespread use of administrative data for statistics through new statutory gateways. Where any statistics legislation might fit within devolution, common law, administrative law, and data protection statutes is currently being researched. The Code of Practice and the Protocols were drafted with the thought that they could be a basis for any new statistical legislation. Proposals for a population register, whether consensual or non-consensual, might depend on new primary legislation, for many reasons common to the discussion above. Such legislation might allow the use of the register for statistical purposes such as constructing samples, weighting, etc. Our immediate aim is to use the Code and protocols to demonstrate our worthiness of public trust as producers of National Statistics in a non-statutory environment .

206

Developments at Eurostat for research access to confidential data John KING and Jean-Louis MERCY1 Abstract. The background to a new Regulation (Regulation 831/2002 concerning access to confidential data for scientific purposes) is described and the main provisions of the Regulation are described. The implications of the regulation, for Eurostat, Member States, the research community and even data subjects, are considered. Eurostat’s activities in implementing the Regulation are outlined together with an indication of some of the outstanding issues. Implementation of the Regulation has also raised some further questions on matters as diverse as the possibility of remote access to confidential data and the meaning of “scientific purposes”.

Developments at Eurostat for research access to confidential data About a year ago the European Commission adopted Regulation 831/2002 concerning access to confidential data for scientific purposes. This was a significant step in providing better access to confidential data for research. This paper describes some of the background to the regulation; outlines the provisions of the regulation and the steps Eurostat is taking to implement the regulation; discusses some of the implications of this work; and indicates some further questions arising from this work.

1.

Background to Regulation 831/2002

Micro-datasets are becoming important because of increasing interest in accessing them by researchers. This interest has two related drivers. The first is an aspect of modern life—accountable government and transparency. This is reflected in an increasing interest in and demand for evidence-based policy, policy analysis, and monitoring policies and their impact. This kind of activity requires timely, detailed information and frequently requires more detailed analyses than are presently published by statistical organisations. Sometimes these analyses are seen as being outside the remit of national statistical organisations (NSIs) or even as activities that 1

Jean-Louis MERCY (email: [email protected]), European Commission, L-2920 Luxembourg. John KING (email: [email protected]).

207

could compromise the perceived independence of NSIs. Indeed, these analyses are performed often by academic institutions or independent research institutions. Detailed data are needed for these types of analyses. The obvious and most relevant source is often identified as the data collected and held by NSIs. Hence there is an increasing pressure on NSIs and other statistical organisations to provide detailed data on a wide range of topics. In particular, for the European Union (EU), pan-EU analyses and research are becoming more and more important. The same could also be said for the Euro-zone. So the need is for access to pan-EU datasets for this research. Eurostat holds many such datasets, and so it is seen, by analogy with the national situation, as the natural, simple and direct potential source for these datasets. The second driver here is the changing nature of research itself. Much modern research cannot be satisfied with aggregate data—micro-data are needed for fine analysis and model building. Hand-in-hand with this there has been an evolution (perhaps revolution would be a more appropriate description) of research computing capacity—both hardware and software tools—and in the number of researchers and research institutions. These factors have considerably increased the demand for access to micro-data records for computing correlation matrices, estimating models and other analyses, depending on the context of the research topic. Examples of the micro-data needs of researchers were given in papers by, for example, Westergaard-Nielsen and Blundell at the CEIES (European Advisory Committee on Statistical Information in the Economic and Social Spheres) seminar (19th seminar) on “Innovative solutions in providing access to micro-data” last September in Lisbon. Other examples were given by several of the speakers, including Dilnot, Vickers and Blundell, at the inaugural conference in December 2001 of the cemmap (Centre for microdata methods and practice) research centre in London. At the same time, statistical organisations, both NSIs and supra-national and international institutions, are increasingly seeing making more use of the data held by them as an important contribution to society and as part of an obligation to make better use of their resources (data). But there are constraints on what statistical organisations, particularly NSIs, can do and on how they can do it. The role of researchers and research organisations is thus an important one, and it is an increasing one too. Because of its role of producing statistical information for the European Union, Eurostat collects data from the Member States (MSs)on many aspects of economic and social life. These data sets are, broadly, comparable across the MSs and use harmonised definitions. So the datasets held by Eurostat represent a rich and valuable resource for the Commission, the MSs, and potentially, researchers. The data collected and held by Eurostat are the subject of regulations. The regulations represent agreements between the Commission and the MSs on the purposes for which data are provided and conditions under which the data are provided—in

208

essence, statements of what can and cannot be done with the data. The data are held subject to the requirements and conditions imposed by the MSs—this is stated explicitly in some of the regulations. The principle of statistical confidentiality is effectively the contract connecting the statistician with all those providing their individual data, either voluntarily, as is frequently the case, or by legal obligation, with a view to producing the statistical data essential for the society as a whole. From the formal legal point of view most of the European countries have established legal provisions for statistical confidentiality a long time ago. At the European level, the principle has been enshrined in Article 285 of the Treaty establishing the European Community as a fundamental principle for Community statistics. Article 285 provides that the production of Community statistics shall conform to the principles of impartiality, reliability, objectivity, scientific independence, cost-effectiveness and statistical confidentiality. The confidentiality principle is therefore part of the European basic charter and has thus acquired the highest status in legal terms. The principle has been further specified and data received, held, used and disseminated by Eurostat are controlled by a set of legislations that have developed since the Treaty founding the European Communities. In 1990, Council Regulation 1588/90 on the transmission of data subject to statistical confidentiality to the Statistical Office of the European Community set out basic rules and safeguards for the handling of confidential data. Subsequently, in 1997, the “Statistical Law”—EU regulation 322/1997 on Community Statistics—expanded on these basic rules. In particular, a legal definition of statistical disclosure was introduced. Article 13 states: “1. Data used by the national authorities and the Community authority for the production of Community statistics shall be considered confidential when they allow statistical units to be identified, either directly or indirectly, thereby disclosing individual information. To determine whether a statistical unit is identifiable, account shall be taken of all the means that might reasonably be used by a third party to identify the said statistical unit.” This definition has replaced the former definition laid down in Regulation 1588/90 where confidential data were defined as “data declared confidential by the Member States in line with national legislation or practices governing statistical confidentiality.” The notion of confidential data has consequently become an objective notion with a clear Community dimension. Article 13 goes on to state: “2. By derogation from paragraph 1, data taken from sources which are available to the public and remain available to the public at the national

209

authorities according to national legislation, shall not be considered confidential.” The Statistical Law also states that confidential data must be used exclusively for statistical purposes unless the respondents have unambiguously given their consent to the use for any other purposes (article 15). The law also makes provision for access to confidential data for scientific purposes (article 17). With the agreement of all the MSs, the latter provision was used to provide simple access to data of the European Community Household Panel (ECHP). An anonymised micro-dataset was developed (by Eurostat in collaboration with the MSs) and made available under certain conditions to researchers. The provision has also been used by several enterprising researchers who have wished to use pan-EU microdata for their research. The researchers have had to contact the national statistical authority in each MS to request permission to access the data of that MS from a particular survey. Eurostat is then authorised to provide access to data of the MSs so agreeing. There has been mixed success with this approach, depending on the type of survey or data requested—sometimes MSs deny access to their data.

2.

What Regulation 831/2002 sets out to do

Regulation 831/2002 implements certain provisions of the Statistical Law (regulation 322/97), particularly articles 17(2) and 20 (1). Essentially, Regulation 831/2002 sets out simplified procedures under which access to confidential data for scientific purposes may be granted. For many researchers it attempts to remove some of the access burden implicit in the Statistical Law, although access is still subject to comment by the national statistical authority of each MS and to various conditions. The regulation refers to four important sources: •

European Community Household Panel (ECHP);



Labour Force Survey (LFS);



Community Innovation Survey (CIS);



Continuing Vocational Training Survey (CVTS).

In summary, researchers must belong to research institutions and organisations within the MSs (other researchers or organisations have to go through a more lengthy approval process). A detailed proposal must be prepared stating the purpose of the research and details of the data to be used. Safeguards for the secure holding of the datasets will be necessary and controls on access by individuals will be required. Agreement to conditions and safeguards will be through a contract with the researchers’ institution. There is no right of access to confidential data under the Regulation. In addition, MSs can withhold the data of their country from any

210

particular research request. Access to confidential datasets can be on the premises of Eurostat with checks on the output and results to maintain confidentiality; or access can be through distributions of anonymised micro-datasets. Agreement by the researchers to conditions and safeguards will be through a contract with their organisation. Incidentally, the new Regulation 831/2002 now provides a legal definition of anonymised micro-datasets. “ “anonymised microdata” shall mean individual statistical records which have been modified in order to minimise in accordance with current best practice the risk of identification of the statistical units to which they relate.”

3.

Implementing Regulation 831/2002 at Eurostat

For Eurostat, the implications of the Regulation and putting it into practice are considerable. But there are precedents and experiences to build on. For example, the European Community Household Panel survey (ECHP) has already paved the way— initially by providing some controlled access to confidential microdata and, more recently, by creating and making available anonymised micro-datasets. Similar approaches are being developed and extended—to the other surveys mentioned in the regulation and to a wider range of researchers. New procedures are being developed for receiving research requests, evaluating the researchers and their requests, and for setting up contracts. Procedures for consulting the national statistical authorities of the MSs, as required by the regulation, are being developed. New contracts have been developed and “confidentiality undertakings” have been drafted. The contracts will be between Eurostat and the researcher's institution or organisation. This means that there must also be a contractual relationship between the researcher and his or her organisation. The regulation does not permit access to confidential data by individuals as individuals. At the end of the day, the facilities to be provided under Regulation 831/2002 have to be user-friendly and have to provide a service to the research community. Eurostat sees consultation with the research community on their requirements, in terms of both data and facilities, as very important. Equally, Eurostat must explain the constraints to the research community and attempt to develop both appreciation and acceptance of them. Close interaction with the research community, to understand its needs and interests and to explain the constraints, is a relatively new activity for Eurostat. However, the dialogue has started with recently contacts with CEIES, ESF (European Science Foundation), and other international research bodies. But this is not entirely new territory. Researchers’ expectations and needs have been referred to above. There are examples in several MSs and elsewhere of facilities available to researchers. The Luxembourg Income Study provides an example, close to home, of internationally comparable datasets with remote access

211

by recognised researchers. Some MSs, for example the United Kingdom, have lengthy experience of developing anonymised micro-datasets for research use by academics and research institutions. In the United States access to confidential data is provided through Research Data Centres of the Census Bureau. But this kind of access is not common to all countries—there are differences in practice, expectations, culture and legal frameworks. Regulation 831/2002 foresees (article 3) a fairly straightforward and simple request process for researchers from two categories of organisations— 1(a), i.e. universities and other higher education organisations established by Community law or by the law of a Member State; or 1(b), i.e. organisations or institutions for scientific research established under Community law or under the law of a Member State. For “other bodies”, article 3 of regulation 831/2002 lays down the condition that they must first be approved by the Committee on Statistical Confidentiality if they wish to make requests to access confidential data for scientific purposes. “Other bodies” are those specified in article 3. 1(c) of the regulation. Essentially, these bodies are organisations that do not fall under the categories of 1(a) and 1(b) above and which have not been commissioned by departments of the Commission or of the administrations of the Member States to undertake specific research. Regulation 831/2002 does not itself state criteria that should be taken into account by the Committee in forming its opinion. But there are some requirements in the Regulation and in Regulation 322/97 which indicate factors for consideration. Specifically, these are: •

prevention of non-statistical use (Regulation 322/97 arts.10 and 18 and Regulation 831/2002 art. 8 (1));



access for scientific purposes (Regulation 322/97 art.17 and Regulation 831/2002 art.1); and



protection of the data (Regulation 831/2002 art. 8 (1)).

In addition, the principles of transparency and fairness mean that criteria should be clear and known. The Committee on Statistical Confidentiality decided that the following factors should be taken into account when forming its opinion: •

the primary purpose of the organisation;



the organisational arrangements for research in the organisation;



the safeguards in place in the organisation;



the arrangements for dissemination of results of research.

212

Eurostat is now translating these conditions and factors into operational procedures. For example, the prior question of “admissibility” of an organisation to have the standing to make a request (regardless of the merits of the research request itself) has been specified in a series of questions (a questionnaire) covering: •

Identification and primary purpose of the organisation



Brief description of the research project(s)



Organisational and financial arrangements for research within the organisation



Security in place in the organisation



Arrangements for dissemination of results of research

This information will be passed to the national statistical authority of each NSI for it to express an opinion. This will probably be done usually through a written procedure in order to make the process reasonable fast. The regulation provides for access by researchers to confidential data on the premises of Eurostat. There is also provision for similar access on the premises of national statistical authorities of the MSs if the level of the security and checking facilities are the same as those at Eurostat. Access of this type is often referred to as controlled access or access through a “Safe Centre”.

4.

Implications for Member States and national statistical institutions

The Regulation encourages the NSIs of MSs and Eurostat to work closely together in developing a system for providing access to confidential data for scientific purposes. This is a very wide-ranging set of activities—from agreeing ways of checking and protecting the outputs of research; agreeing on safeguards and controls for the data and ways of creating anonymised micro-datasets, to procedures for handling research requests and consulting each other. These processes are currently being designed and will be discussed with the MSs. The safeguards, controls and methods will build on existing approaches and methods. These will reflect existing national practices, but may require some adaptation. For example, one MS has an established procedure for considering research requests a few times a year. Yet the regulation requires that each MS must respond to a notification of a research access request within six weeks. Again, one MS has an established process for approving access requests by researchers and institutions of that country. But the regulation allows access by researchers—not only of other MSs, but also by researchers and organisations outside the EU.

213

Although there is a requirement that each MS be informed of each research request, there is a presumption in the Regulation that MSs will agree to give access to their data, provided that all the conditions and requirements specified have been met by the researchers. There may be implications for NSIs in the way data are collected. In particular, if the uses to which the data may be put have to be specified to the respondent, then the research usage envisaged under the Regulation may have to be included. This is discussed further below. Procedures for anonymising data and for protecting outputs from direct access to confidential data must also be developed by each NSI and agreed with Eurostat. In practice, a common approach by all NSIs will provide better protection and more useful datasets. There are some areas that will require further research and consideration as they are little developed or understood at present. These include problems of the disclosure potential of results from modelling. We understand well the problems of disclosiveness in tabular data—and methodology for this exists and is also still developing—but we have a less clear idea of the problems, let alone the solutions, arising from modelling. An intuitive restriction is to suppress information about residuals—even though they are of great statistical interest to researchers— because they give information about outliers which are often the rare data subjects. But we need also to know about the disclosive potential of parameter estimates— particularly when a series of similar models are run and compared.

5.

Implications of Reg. 831/2002 for the Research Community

The implications of the new Regulation for the research community illustrate the nature of the partnership between statistical organisations and the research community. On the one hand the Regulation opens up new opportunities and on the other hand it imposes tight disciplines and limitations as the price of the opportunities. First the research community must accept that they have no right of access. Then, researchers will have to accept that they will have a responsibility to maintain and uphold the confidentiality of data they access. The limitations and safeguards may be more restrictive than those prevailing in the researchers’ universities and those they have come across with other datasets, but they must be adhered to. The documentation of the Research Data Centres (RDC) of the US Census Bureau is voluminous but thorough. In particular, the sections on the different cultures of the RDC and universities make interesting reading. They also provide a warning that there should be no presumption of a common culture or purpose. Researchers will also have to accept that yet another body will have the right to ask detailed questions—not only about the research and its purposes, but also, in the case of anonymised micro-datasets, about how the data will be held and access controlled.

214

And that the researcher’s responses will be passed to the NSI of each MS for consideration. In addition, following access to confidential datasets, prospective results must be provided for checking before publication or other release. In return, most researchers will have simpler access to datasets spanning the MSs. Hitherto, under the provisions of the Statistical Law, gaining access to data for each of the MS has involved a lengthy process of making requests to each MS. This will give researchers opportunities for pan-European Union research and analyses. The Regulation covers four important datasets: it is expected that, in time, access to other datasets will also be provided.

6.

Implications of Reg. 831/2002 for data subjects

Although the purpose of the Regulation 831/2002 is to improve access to data for researchers, there are implications for the data subjects who provided the original information. This information was given to statistical organisations in their own countries, as part of a voluntary or compulsory statistical enquiry. Or the information may have been taken from existing administrative registers as part of a statistical enquiry. In turn, the statistical organisation passed the data to Eurostat after removing information allowing direct identification of the data subjects. In this connection there are also additional implications for those statistical organisations. The principle underlying statistical data collection is that of informed consent. The principle is that the data subject has a right to know what the information will be used for and who will see their information. The argument here is that if there is a new dimension—new users, new uses—to the use of the information, then the data subject should be made aware of it. In some MSs it may be necessary to change the laws under which data are collected in order to specify the uses to which the data can now be put. As part of the statistical enquiry the data subject should be informed that the information provided will be used for statistical purposes, and that this may include research undertaken by external researchers in addition to the routine direct purpose of the statistical organisation. Under the Regulation, researchers may be from institutions within the Member States, or indeed from institutions outside the EU, not just from institutions within their own country. At present practices vary in the Member States in this regard, so it is not easy to indicate what will have to change. And in practice little may need to change—the existing forms of consent may well cover, implicitly, access by researchers from another country for statistical research. It is a question of degree, balancing along the implicit–explicit axis with the “informed” aspect of the consent. This may require some field research, including qualitative research, among data subjects. It is an important part of the contract between the data subject and the statistical organisation and will be seen by the latter as a factor affecting response rates to voluntary enquiries.

215

7.

Some questions arising

What do we mean by “scientific purposes”? This is a question that has already arisen. For some this is synonymous with “academic”. But even so, what is to be included? Recognised, post-doctoral researchers are presumably undertaking scientific research (even if some of it may also be “commercial”). But below this apparently clear-cut category distinctions are more difficult to make. And the focus of the debate tends to centre on the qualifications or status of the researcher rather than on the actual “scientific” nature of the research proposed. What, then, of Ph.D students? Much of the work undertaken by doctoral students is at the forefront of scientific knowledge, so is presumably scientific. And much research undertaken for a Masters degree will be supervised by a recognised researcher/scientist and may form part of a larger project with a clear scientific purpose. Should undergraduates have access “for scientific purposes” for projects and for familiarisation with large complex datasets? After all, training may be regarded as a scientific purpose. The same training argument could be made at higher levels. It may be difficult to draw this line. Pragmatically, the line may well be drawn on legal rather than scientific grounds—does the person desiring access have a contractual relationship with the institution, so that penalties for non-compliance with conditions for access can be invoked. Remote access. For some, the concept of a physical “safe centre” has already been overtaken by events—new technology. Attending a physical “safe centre” for access has several drawbacks: cost, ease of access (Luxembourg is neither the easiest or cheapest place to visit), probable time lapse between running programs and receiving results, lack of spontaneity in performing analyses, convenience of access. For this reason many researchers seem willing to trade access to “real” confidential data through “safe centres” for anonymised micro-datasets—for the convenience of access on their desks. But there have been several successful “remote access” facilities to confidential data. The leading example of this was the Luxembourg Income Study (LIS). Using the approach and software designed for that, another research consortium—the pay inequalities and productivity (PiEP) project—has developed procedures for remote access to the Structure of Earnings Survey data at Eurostat. In both cases there are trade-offs in order to obtain access. The LIS uses a highly reduced dataset of relatively few key variables, some of which are reduced to categorical variables. The PiEP accepts some reductions in the dataset and some restrictions on outputs—no tabulations, no information on residuals, and so on. These restrictions are designed, with the agreement of the NSIs of the MS, to reduce to an acceptable minimum the risk of identification or of disclosure.

216

The challenge is to provide remote access to researchers. A non-technological issue is whether the regulation can be interpreted as permitting this type of access. Under this type of arrangement the processing and analysis of the data would be performed on the premises of Eurostat. The controls—on individuals, authorised access and on outputs—that would be used in the case of a traditional “safe centre” could be the same. But access would be much easier. However, this issue has to be further investigated before putting it on the agenda of the Committee on Statistical Confidentiality. Value of the research to the data provider. The US Census Bureau has an explicit pre-requisite that the research proposed should be of value to the Bureau—indeed, the research must (a legal requirement) provide a benefit to Census Bureau programs. The draft Protocol being discussed in the United Kingdom includes similar wording, but the legal basis for this is not clear. Regulation 831/2002 has no such requirement. Are anonymised data confidential data? If the anonymisation process reduces to a minimum the risk of identification or of disclosure, is the anonymised dataset “confidential”? This question seems to bring together, for a healthy debate, the classificatory legal approach and the pragmatic statistical approach. If direct identification is not possible and the risk of indirect identification is negligible (minimised in accordance with current best practice)—anonymised data—then the data are not disclosive (or potentially disclosive) and so not confidential. Or are such data always confidential no matter how much they have been modified?

8.

The future

One aspect of the legal requirement on NSIs and Eurostat that needs further consideration is indicated in article 13 (2) of the Statistical Law. This states that, “By derogation from paragraph 1, data taken from sources which are available to the public and remain available to the public at the national authorities according to national legislation, shall not be considered confidential.” This indicates a conundrum that pervades statistical confidentiality: information obtained by an NSI through a statistical enquiry is treated as confidential even if the information is publicly available and even if the data subject itself proclaims the information. In some countries there is a let-out for the NSI—if the data subject releases the NSI from the confidentiality requirement. But without this, much effort goes into protecting (mainly economic) statistical data that is publicly available. Perhaps the answer is for statistical enquiries to be in two parts—the publicly, and often statutorily, available information about a company; and the information to be protected as confidential. Some of the requirements and targets specified in laws are not fixed but are moving over time. There is thus a requirement on NSIs and on Eurostat to review practices

217

and methods from time to time. For example, “anonymised microdata” are defined in regulation 831/2002 in terms of “…have been modified in order to minimise in accordance with current best practice the risk of identification of the statistical units to which they relate”. Clearly, current best practice changes over time, so must also the procedures used. Again, the Statistical Law requires that “account shall be taken of all the means that might reasonably be used by a third party to identify the said statistical unit”. The means available to third parties will also change over time— greater access to other databases and more powerful computers and software. A further example is provided by Regulation 1588/90. Eurostat must offer the same confidentiality guarantees as the NSIs for the transfer of data to Eurostat from MSs. Developments in the MSs in this regard will need to be reflected in Eurostat’s procedures.

9.

Conclusions

Developing the draft regulation and getting it approved by MSs through the Committee on Statistical Confidentiality and by the Commission entailed considerable effort by many people. But that approval is only part of a larger process of implementation and creating the processes and facilities, to say nothing of the datasets themselves, in order to provide the research access to confidential data. The process of implementation has raised further questions, both statistical and legal, that need consideration.

218

Report on application of the principle of statistical confidentiality in Kyrgyzstan Confidentiality of primary data is one of the basic principles of official statistics adopted by the United Nations Statistical Committee in 1994. In a democratising society, the confidentiality of statistical data becomes an underlying principle in relations between statisticians and their information providers. The confidentiality of primary information serves a two-fold purpose: firstly, to uphold the inviolability of personal privacy and the non-disclosure of state and commercial secrets, and secondly, to foster users' trust in official statistics. Every country implements the principle of data confidentiality in its own way. This report sets out the main contours of how that principle is implemented in the Kyrgyz Republic, and describes the associated problems. It goes without saying that putting the principle of confidentiality into practice requires a legislative underpinning. On the basis of legislation, the State Statistical Service of Kyrgyzstan is building a relationship with both the suppliers and users of statistical data. The State Statistics Act of the Kyrgyz Republic lays down the rights and obligations not only of persons who collect, process and publish statistical data, but also of those who supply and use the data. The Act guarantees that commercial secrets contained in statistical data made available by legal and physical persons are respected. In turn, statistical services at all levels of the Republic bear responsibility under the Act for protecting state and commercial secrets and information with a bearing on the private life of citizens. At the same time, the Act lays down the rights and obligations of persons who provide the statistical services with primary information. Moreover, responsibility for upholding the legal rights and obligations of both sides involved in the process of producing statistical information is laid down in another legislative act: the Administrative Practices Act. Maintaining data confidentiality is one of the fundamental principles of the Code of Professional Ethics for State Employees, which has been adopted by the Republic's state statistical system. The principle of confidentiality is also upheld in other legislative acts relating to official statistics (the Population Census Act and the Agricultural Census Act). These laws guarantee the confidentiality of individual data obtained from censuses. The fact that final data are only published in aggregate form protects citizens from infringement of their constitutional rights and freedoms. With a view to boosting confidence in state statistics, information providers are kept informed of the confidentiality with which their primary data are treated. To this end, all statistical forms and questionnaires contain a reference to the relevant Articles of the State Statistics Act. In order to ensure that households participate in the surveys, and to increase their interest in providing reliable information, the National Statistical Committee uses financial incentives and guarantees the confidentiality of the information which it receives. At the same time, however, it endeavours to explain the aims and importance of such surveys to the information providers. This ensures a satisfactory response rate (the non-response rate is about 1-1.5% per annum). Statistical data on citizens are used only in aggregated or depersonalised form, without any information that would enable the individuals to be identified. Primary data on private individuals or families may not be divulged without their consent. Applying the principle of confidentiality also entails certain problems.

219

One of these concerns user access to statistical data. Many users, foremost among them state and political structures, assume that "access to information" means access to individual data. This is because users have only a vague notion of what the principle of statistical confidentiality entails, and why it has to be observed. In order to perform their tasks, therefore, they ask the statistical services for lists of businesses together with a number of statistical indicators. It is important for users and, above all, government employees, to have a clear, informed understanding of the concept of "confidentiality". An awareness campaign on issues relating to the confidentiality of statistics, involving various seminars and conferences, might be a major step in the right direction. No less serious is the problem associated with the transmission and storage of data. Under present conditions, with the expanding use of networks via the Internet, the problem of protecting data from unauthorised access and guaranteeing data confidentiality is becoming highly topical. This problem is felt particularly keenly in the regions of the Republic, where the available buildings, premises and technical wherewithal do not allow confidential data to be stored in an appropriate manner. This also has a bearing on electronic data transmission from regiona l services to the Central Statistical Office. As part of its technology policy, the National Statistical Committee is working on ensuring that statistical data are protected from unauthorised access. Access to data placed on an intranet is obtained by means of user passwords. PROXY Server software has been installed to protect the local (internal) network from external users.

220

Questions relating to the confidentiality of statistical information at the National Statistical Service of the Republic of Armenia Document presented by the National Statistical Service of Armenia

Introduction The confidentiality of statistical information at the National Statistical Service of the Republic of Armenia (hereafter: the NSS) is governed by a series of legislative and regulatory acts and technical programming measures which guarantee the security of the statistical information system described earlier [1]. The basic principle is that published statistical data must not cause injury (damage) to information providers: hence the existence of a series of methodological measures covering the processing of statistical data. The measures prevent the identity of any individual unit being disclosed, either directly or indirectly, in published data. In complex cases, where non-publication could compromise the integrity of statistical information and the information provider has not given approval for publication, the final decision on whether or not to publish is taken by the State Council on Statistics of the Republic of Armenia [2], which adopts regulatory acts having legal force in the statistical field. In order to arrive at its decision, the State Council has to weigh the fact that incomplete statistical data can impinge on important socio-economic development programmes against the interests of the information provider and the principle of confidentiality, which is one of the fundamental principles of official statistics ratified by the UN.

Current status Although the confidentiality of statistical information is guaranteed under the Law on State Statistics of 10 May 2000, matters relating to the harmonisation of that Law with other legislative acts require constant attention. In an addendum to the Law, the State Council drafted and adopted the Statistical Secrecy Order, pursuant to which the NSS has set up a Committee on Confidentiality Issues. The Committee devises appropriate checks on work involving confidential statistical information. The following documents have been drafted and adopted (or are currently being drafted) by the Committee: - a list of categories of access to confidential statistical information by NSS officials; - instructions concerning the limits placed on NSS officials' access to confidential statistical information; - the obligation on NSS officials not to disclose confidential statistical information; - users' obligations with regard to the NSS computer network; - instructions for work involving confidential statistical information; - instructions for the safe-keeping and use of passwords;

221

- instructions for creating back- ups and storing them in electronic archives; - the responsibilities of the network manager; - the responsibilities of database administrators; - the uses which are made of data output (on any medium); - monitoring of the computing environment at the NSS; - measures for protecting confidential data during the collection, processing, storage and transmission of statistical information; - training and instruction for all NSS officials involved in work with confidential statistical information. The issues involved in the confidentiality of statistical information are dealt with in the recently adopted “Blueprint for a Single Statistical Information System for the National Statistical Service of the Republic of Armenia”. The intention is to devise and implement a series of measures for protecting electronic information in the NSS's statistical information system and to resolve questions relating to the reliability of the information system, with a view to preventing the unauthorised copying of information. In order to create a reliable and secure system, particular significance attaches to the level of training of network and database administrators. Their training should preferably be tackled as part of a single international programme, which will contribute to finding a common approach to these questions.

Conclusion Confidentiality is one of the underlying principles of official statistics. It strengthens information providers' confidence in the system and is conducive to improvements in the qua lity of statistical information. The training and instruction of NSS officials on questions relating to the confidentiality of statistical information is crucial to achieving those aims.

Literature [1] An approach to the confidentiality of statistical information at the NSS: Joint ECE/Eurostat meeting on the confidentiality of statistical data. Working document no. 33, Skopje, 14-16 March 2001. [2] Confidentiality and harmonisation of the law: the Armenian Experience: 7th TACIS Seminar for the leaders of statistical departments in the New Independent States and Mongolia. Document HLS/2002/04. Baku, 16-17 May 2002.

222

Demand of Data and Options of Analysing Data: The Research Data Centre of the Statistical Offices of the Länder Sylvia Zühlke Geschäftsstelle des Forschungsdatenzentrums der Statistischen Landesämter c/o Landesamt für Datenverarbeitung und Statistik Nordrhein-Westfalen Postfach 10 11 05 40002 Düsseldorf [email protected]

History Over the past few years in Germany, there has been a keen debate about access to official microdata for the purposes of scientific research. A commission for the improvement of the information infrastructure set up by the Government came up with suggestions on how to improve the interaction between scientific research and statitics. Suggestions included the involvement of data users in setting up survey und processing programms, plans to update basic and advanced training in the statistical field and various options concerning access by scientific researchers to the microdata produced by public data producers. One of the Commission’s central recommodations was to set up a data producers’ research data centre as soon as possible. The Statistical Offices of the Länder met this request when they set up a joint Research Data Centre with 16 regional offices. The centre is being run on an associative basis, each office forming a regional site. It is headed by a steering committee, whilst administration and coordination is the responsibility of a unit set up within the North-Rhine-Westphalia Office for Data Processing and Statistics. This is the offical contact for the Research Data Centre and is thus authorised to disclose binding information about the data and services it provides.

The centre’s primary purpose The aim of the centre is to make more data available to researchers and scientists. In Germany, the majority of surveys are carried out on a decentralised basis by the Land offices, which therefore hold the bulk of the resultant statistics. Before scientific

223

researchers can be provided with improved access to data, the Land offices must first set up an infrastructure for centralising the storage of data. This will allow official data to be accessed centrally from the various regional sites of the Research Data Centre. In addition to this, an information system will be set up to provide comprehensive information about official statistics and the various ways in which they can be used. Furthermore a ‘visiting researcher’ scheme will be installed. Under this scheme, researchers may work with anonymised microdata that cannot transferred as Scientific Use Files for use outside the statistical office.

Access to microdata From the scientific research point of view, the ideal situation is one in which as many data sets as possible have ‘Scientific Use Files’. In the field of household and personal surveys, data sets which have been rendered anonymous and which can only be de-anonymised by expending a disproportionate amount of time and energy, have existed for a number of years now. As enterprise statistics and regional statistics are generally completely different from household and personal data in terms of the risk of deanonymisation, the Research Data Centre of the Länder develop in view of the legal rules an infrastructure which facilitate the use of microdata and offer three different options: (1)

Under the ‘visiting researcher’ scheme, researchers may work with anonymised microdata that cannot be transferred as SUFs for use outside the statistical offices, since the automatic input of additional information must be prevented if the factual anonymity criterion is to be complied with.

(2)

Visiting researchers can also work on research projects involving contractually agreed cooperation between one or more of the statistical offices, on the one hand, and outside experts, on the other. For these, access can be given to individual data that have been formally anonymised. However, this is an option only if the analysis is an official statistical project for which outside help must be brought in owing to a lack of capacity or of know-how. The results of such analyses are the property of official statistics and are checked for confidentiality prior to publication. However, the researchers involved also have user rights.

224

(3)

Furthermore, researchers can use controlled remote data processing. This means that experts write their analysis programs at their workstations using a standard software program and then send them to the research data centres. There, the programs are applied to the original, non-anonymised, data and the findings then checked for statistical confidentiality before being transmitted to the researcher.

Demand of data In order to take into account the scientific demand of data, a special survey was carried out in summer 2002. Roughly 600 of the 700 scientists surveyed indicated that they regularly use microdata for their work. The results show the high interest of the researchers in using microdata of the offical statistic and point to a manifold need of data. Concerning the different options for using data there’s a high preference of the use of Scientific Use Files. Because Scientific Use Files are characterized by compromising the information content and because the production of such files needs a lot of time and energy, this option is not suitable for every research project. Thus the survey refers to a conflict between the manifold demand of data and the preferences for different options for using data. A solution to this problem is offered by the new decentralized infrastructure for the visiting researcher scheme and a combination of the different options of data use, which facilitates the access to microdata. In addition to their work relating directly to improving the use by scientific researchers of the microdata produced by official statistical bodies, the centre will also be addressing basic questions of data access. These include, for instance, the question of statistical confidentiality in the face of an ever-increasing volume of readily available additional information, and the expansion of cooperative links with other national and international research data centres.

225