Logical vs Numerical Inference on Statistical Databases*. Sumit Dutta Chowdhury. George T. ... (MCA) approach and the Linear Programming (LP) approach.
Proceedings of the 29th Annual Hawaii International Conference on System Sciences - 1996
Logical vs Numerical Inference on Statistical Databases* Sumit Dutta
Ramayya George T. Duncan Sumitra Mukherjee
Chowdhury
Krishnan
Stephen Roehrig
The H. John Heinz III School of Public Policy and Management Carnegie Mellon University Pittsburgh, PA 15213
Abstract
as multi-level authorization control have been used to avoid this problem. Inferential disclosure is harder to control than direct disclosure because it may occur in spite of access restrictions. Query size restriction (Fellegi 1972; Schlorer 1975) and response modification (Denning 1980; Adam &: Wortman 1989) are the two major approaches thalt have been proposed to limit inferential disclosure. Along these lines, Nargundkar and Saveland (1972) and Kelly et al. (1990) have presented rounding techniques to prevent individual table values from1 being known exactly. Carvalho et al. (1994) have proposed heuristic methods of cell suppression to achieve the same goals. These methods minimize the value of information lost due to direct and complementary suppression. All these works have addressed the case of disclosure detection and protection in two-,dimensional tables. In this paper we focus on N-dimensional categorical databases, which will likely become more prevalent and more important in the future, given the growing availability of online databases. In subsequent sections of this paper we show how an a priori logical approach to protection, i.e. one based on database structure alone, can lead to a false sense of security. As proof, we develop three methods of detecting disclosure based on the actual contents of the database: the Frechet bound approach, the Matrix Comparative-Assignment (MCA) approach and the Linear Programming (LP) approach.
As computer databases have become more prevalent and comprehensive, they have prompted concerns about the confidentiality of sensitive information. Database administrators require egective policies to guard against disclosure of confidential information while at the same time providing reasonable access to legitimate users. A general method is available for determining if disclosure of sensitive information may result from inference over multiple database tables. This method relies on the logical structure of the tables, and is thus independent of the actual contents. After reviewing this method, we present new numerical techniques which in some cases allow inference of sensitive data even in instances pronounced safe by the logical method. These new techniques exploit the contents of the tables, and one makes use of a pair of new matrix operators. A real-world example of a multi-table inference process is given.
1
Introduction
Because of growing concerns about data security and confidentiality, database administrators-the brokers between data providers and data users-must implement policies and technologies to prevent disclosure of information to unauthorized users while providing adequate access to legitimate users. The administrator has to ensure that an unauthorized user may not obtain, directly or though inference, the linkage between an identifier and certain sensitive records referring to that individual. Direct disclosure occurs when there is unauthorized access, as through password breaking or communication eavesdropping. Password protection, encryption and logic-based methods such *Email: 8783f
roehrig+Qandren.
emu
.edu, Phone/FAX:
(412)
1.1
from
of patients
26%
visiting
physicians
The tables Patient-Doctor
3 $5.00
0 1996
health
care manage-
The thrust Iof our work can be explained in terms of the three-dimensional table of medical data depicted graphically in Figure 1. This table records instances
7036
1060-3425/96
An example ment
IEEE
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
to
receive
treatments.
and Doctor-Treatment,
ob-
Proceedings
of the 29th Annual
Hawaii
International
tained by projection, are not sensitive and are publicly accessible, but the projection Patient-Treatment is sensitive and so, confidential. The disclosure problem is to determine if it is possible to infer confidential values from information in the accessible tables.
Conference
on System Sciences -
1996
It is obvious that if (xl, ~1) can be calculated at all, it must be as a linear combination of the projection data begin by assuming that (xi, yi) is such a linear combination, then rewrite the equation in terms of the cell values of the underlying table T. The resulting equation, since it must hold for all instances of T, implies a simultaneous set of equations in the coefficients of the expression for (zi, yi). We show this set has no solution, proving the theorem. Suppose then that n (Xl>Yl)
=
Cak(zl,zk)+...+ k=l
Dr. Plullips
e k=l
Dr Hill Dr. christie
Figure ample
1: Information
Structure
of Health
Care Ex-
u(l-l)n+k(xl,
k=l n
A logical approach for disclosure tection and prevention
de-
A logical approach is one in which design-time actions are taken to limit disclosure. Such methods must be valid irrespective of the contents of the database. A general method for determining whether a published table (or set of tables) will admit inferential disclosure was given by Fellegi (1972). The basic idea is to derive a set of equations which, if solvable, yield the value of a confidential datum in terms of the data contained in the table(s). We are interested in the case where a data snooper attempts to infer values of a confidential projection of an iv-dimensional table, given related but non-confidential projections. Following Felligi’s arguments, we can answer this question as follows. Theorem
+
k%+k(Yl,21)+...+
c k=l
2
zk)
1 Let T be a table with three attributes
X, Y and 2 taking on values x1, x2,. . .XL, yl, ~2,. . .yM and zl,z2, . . %N respectively. It is not possible to determine counts for the projection X x Y given only counts for the projections X x Z and Y x 2.
Proof: We denote by (xi, yj) the i, jth element of the projection XxY, and use similar notation for elements of the other projections. Without loss of generality, assume we are interested in the count of (Xl,Yl)
= ~(Xl,Yl,“k) k=l
= &ilk k=l
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
a~n+(n-l)n+k(Ym,
xk)
The with term along (Xl> Yl), all of the (xi, zk), (yj, zk), can be expanded as sums of (xi, yj, zk) over the appropriate index. Briefly, we
Collecting
terms over the (xi, yj, zk) we get (a1 + QLfl (an
-
+ ql+l)fl
1)(x1,
y1,q)
-
1)(x1,
(a1 + ~(l+l)n+l)(~l, (a,
+ ~(l+2,n)h,
Y2, Zl)
+ . .. Yl, &c) + . .. Y2, &I>
Proceedings
bt1
of the 29th Annual
Ym, 4
+ v+(m-l))n+1)(~2, (a2n
+ q+m)J(Q,
Hawaii
+ ... ym,
Zn)
International
Conference
on System Sciences -
The set of equations is inconsistent, tion exists. 0
+ +
1 It is not possible counts in any N -ulay table given projections. Corollary
@(l-l)n
+ ah+1 (ah
b(l-l)n
-
+ a(?+l)n
-
y1, a)
+ . .
l)(%
+ a(l+l)ntl)(~z, (%
b(l-l),
1)(%,
Y2, Zl)
+ ~(l+z)n)(%
(mt
+
+ . .
+
Y2, %)
+
Ym, 21) -I- . . .
+ ql+(n-l))n+l)(Q,
Ym, Zn)
+ q?+m)n)h
+
Yl, G-L)
on
0”
“f
0”
p
0”
.. .
0,
+ = 0
0;”
I;,,
0”
.‘.
0”
on
0"
0"
.. .
Imn 1 Iym
Imn 1 _
I;",
where Ip’,~ denotes the p, qth instance of an identity matrix of dimension T, and 0’ denotes a square matrix of zeros of dimension T. On the right-hand side, a 1 (resp. 0) denotes a column of 1s (resp. OS) of height 72. Multiply the first mn rows by -1 and add to the second set of mn rows. The result is the following,which clearly has two sets of n rows with identical coefficients but different right-hand sides. In1,1
On
0”
.
.
..
0”
-
1
and thus no solu-
to determine exact the counts in N - 1
These arguments show that it is not possible in general to get, exact values of confidential data from related projections. We later show, however, that methods exist by which a data snooper can get exact or bounded information about restricted views given the contents of accessible views of the database. In many situations the b’ounds obtained might amount to disclosure of sensitive information.
Since this equation must hold regardless of the numerical content of T, the coefficient of each of the (zi, yj, .Q) must be zero. This leads to the following simultaneous set. - I?,
I996
3
Content--based sure detlection
approaches
for disclo-
Content-based (numerical) approaches are characterized by their dependency on actual data values. Given the tables generated by aggregation over some attribute, we model how disclosure can take place by combining data from a group of tables. To do this we find the maximum and minimum values that each cell of the restricted view can take individually (i.e., without taking into account what other cells might contain). We consider three such approaches, one a recasting of a result due to Fr&het, one we call matrix comparative-assignment, and the last based on linear programming. 3.1
Frhchei;
bounds
F&bet (1940) d eveloped bounds on certain values within dependent probability systems. Although these bounds have been extended by Kwerel (1988)! Fr&het’s results are still the best of this class for our problem. Fr&het bounds can be used to obtain a fairly weak description of a joint probability distribution (either continuous or discrete) given values of the marginal distributions. In the context of discovering sensitive information from non-sensitive projections, these bounds are an improvement over the settheoretic method of Fellegi, because they make use of the contents of the tables, not just their structure. j 5 m, 1 < k < 7~ Let T(i, j, Ic), 1 < i 23) (OJO) (OJl)
1996
The matrix approach
comparative-assignment
Suppose again that a data snooper is interested in bounds on the number of times patient PI received treatment Tl. These bounds might be derived by the following reasoning. First of all, the upper bound is the sum of the maximum number of times PI received Tl from Doctors D1, 02 and Ds. Each of the above terms, e.g. the maximum number of times PI got Tl from D1, is the either the number of times PI visited D1 (obtained from the DP table) or the number of times D1 administered Tl (obtained from the DT table), whichever is smaller. Similarly, the maximum number of times PI received treatment Tl from Da and 03 can be derived. The insight for the minimum number of times PI could possibly have received Tl is as follows. If Doctor D1 gave treatments Tz, T3, . . , TN (all treatments other than Tl) fewer than the number of times PI visited D1, some of the visits of PI must have resulted in PI getting treatment Tl. In our example, if David visited Dr. Hill 10 times and Dr. Hill administered Compoz and Fungicide 8 times in total, then during two of his visits to Dr. Hill he must have received AZT. Calculating this for other doctors and other treatments would give a lower bound for all the cells of the restricted table. This insight motivates the development of the Matrix Comparative-Assignment approach for calculating bounds on all the cells of the confidential table. Interestingly, we can specify the algorithm by defining two new operators: the CellMaxima Operator B and the Cell-Minima Operator 22.
ik
k=l,...,
on System Sciences -
specific treatment. If this knowledge is sensitive, e.g., if that treatment were typical for a sexually transmitted disease, disclosure has indeed occurred. These bounds are rather weak because they only use information contained in the marginals pi, pj. As we show next, better bounds are achieved when the full two-dimensional tables X x Z, Y x 2 are used.
a “dependent probability system” with atomic events Ei,j,k. If event Ei,j,k occurs, this fact is recorded in T(i, j, k), so that T(i, j, k) gives a complete history of the occurrence of all the elementary events. We can think of each cell in T(i, j, Ic) as an “event counter.” T(i, j, h) can be normalized by dividing each cell by Ci,j,k T(i, .i, k)> and the resulting table, call it P(i, j, Ic), can be thought of as a probability distribution over all atomic events. Now define Pi
Conference
Definition 1 Let A = [aij] be an L x A4 matrix and B = [bjk] be a A4 x N matrix satisfying the condition C; aij = XI, bjb Vj, i.e. the column-totals of A = row-totals of B. The Cell-Maxima-Operator g is a binary operator on (A, B) that yields an L x N matrix CU defined by
T3 (0,4) (014) (014)
Note that in one instance (Patient 1, Treatment 2) we have confirmed that a patient must have received a
fori=l,...,
6
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
L,
k=
l,...,N.
Proceedings
of the 29th Annual Hawaii
International
Note the analogy with ordinary matrix multiplication. Here we use the sum of the minima rather than the sum of the products. We next define the Cell-MinimaOperator B . Definition
2 With the same conditions as in Definition 1, the Cell-Minima-Operator g is a binary operator on (A, B) that yields a matrix CL of dimension L x N defined by
j fori=
l,...,
L,
k=
Pfk
l,...,
N.
Using these operators we develop the MCA algorithm to find the bound matrices for a confidential table. Algorithm MCA 1. Identify two jointly confidential attributes, say ii ) iz. Identify all those disclosed tables which have one or the other of these attributes. From that set, choose a pair of tables which have a nonconfidential attribute in common, say X(ir , ik) and X(ik, iz). 2. Find
XK,(il,i’L) = X(ii,ik) $5 X(ik,iz) and X&,(il,h) = X( il, ik) r& (X(ik, iz). These are the upper and lower bounds for X(ii , iz) obtained through il, .
Conference
on System Sciences -
1996
DlPl
PlTl
DlTl
DIPS
P1T2
DlT2
DlP3
P1T3
DlT3
DZP1
PzTl
DzTl
D2P2
P2T2
DZTZ
D?.P3
PzT3
DzT3
D3P1
P3T1
D3T1
D3Pz
P3T2
D3T2
D3P3
P3T3
D3T3
Figure 2: Network
Representation
subject to the constraints generated by the authorized queries. This procedure produces bounds for each cell in the confidential table. If the maximum and minimum values of some entries in the confidential table are found to be the same, then we have identified a unique value for that table entry. Such a solution constitutes a direct disclosure of restricted information. In certain applications even without an exact solution, i.e., when we get different upper and lower bounds, the bounds might provide enough information in themselves for unacceptable disclosure to have taken place.
3. Repeat Step 2 for each ik different from il or i2 .
3.4
4. The X’(ii,
The LP formulation above can be represented as a multi-terminal network flow problem. Nodes consist of cells in the ,two-dimensional projections, while arcs represent cells in the underlying three-dimensional table. Figure f! shows the resulting network for the 3 x 3 x 3 case. Cells of either of Doctor-Patient or Doctor-Treatment can be chosen as sources (the figure shows Doctor-Patient), the others becoming sinks. The internal nodes are the confidential cells. The problem is then to separately maximize and minimize flows through the confidential nodes. Since all data are integral, integer solutions result.
3.3
tightest bounds for X(ii, is), iz) and XL(ii, is), are given by
A linear
X’(il
, iz) = mp[Xg,(ii,
&)]
XL(il,
ix) = m;x[X&,(ii,
i2)]
programming
denoted
formulation
In this method we use optimization techniques to calculate the maximum and minimum values that the entries in the confidential table can assume. We view each cell in each accessible table as a mathematical programming constraint. Such a cell is a linear combination of cells in the underlying, full-dimensional table. Thus the LHS of the corresponding constraint is just that linear combination, while the RHS is the cell value returned by the query. Similarly, a cell in the confidential table is a linear combination of base table cells. We take this combination to be an objective function, which we then both maximize and minimize
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
4
A network
representation
Examples
In this section we present two examples where our content based approaches are used for disclosure detection. In thle first example we show how information about a confidential table can be obtained given
of the 29th Annual Hawaii
Proceedings
other accessible that bounds on can be obtained using single cell 4.1
International
tables. The second example shows linear combinations of sensitive cells which are tighter than those obtained optimizations.
Disclosure
detection
for
tabular
data
In this example we illustrate the LP and the MCA approaches, obtaining bounds on a confidential table given related but freely accessible tables. We use a slight extension of our hypothetical medical database example, this time with four attributes. Let T(i, j, k, I) record the number of times Doctor i saw Patient j and gave Treatment Ic for Condition 1. In this example, any table (including T itself) which contains PatientTreatment information is considered confidential. Accessible tables, perhaps used by senior management, billing, physician review, and so forth, include those shown in Figure 3. DT(i,
AZT 8 0 4
k)
Dr. Phillips Dr. Hill Dr. Christi
DP(i, d Dr. Phillips Dr. Hill Dr. Christi
Compoz 12 9
Fungicide 1 1
7
David 14 1 8
2
Isabel
John
2 7
5 2 4
1
PT(j, k) cannot be determined exactly since there are fewer equations than variables. Therefore we formulate this problem as a linear program as suggested in the previous section. Maximizing and minimizing each confidential cell value yields the results in Figure 4 (max and min are separated by a comma).
PW,
k)
AZT Compoz Fungicide
David 1,12 7, 20 0,4
Isabel 0,3 6, 10 0,3
John 0,9 1, 11 0,4
Figure 4: Results of LP Analysis In the DP table, for four out of the nine entries a disclosure has t(aken place, since the minimum is greater than 0. A snooper might get considerable information from these tables, depending on his objec-
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
on System Sciences -
1996
tives. If additional information is available, for example the actual contents of one or more cells, then the LP formulation narrows down the range of values that each cell can take, thereby increasing the possibility of disclosure. Solving this example using the MCA approach yields the same bounds on the confidential tables. Applying the two operators defined above, we find
PTL=DP@DT= -
Note that the MCA approach is not limited to It is possiinference over a single pair of tables. ble to use sequences of pairs of tables to discover bounds. For example, if the Doctor-Patient table is not available, but Doctor-Condition and ConditionPatient are, a snooper can use the latter two to arrive at bounds for Doctor-Patient. This, coupled with Doctor-Treatment, would give bounds for PatientTreatment, although experience shows that the resultant bounds are considerably looser. 4.2
Figure 3: Accessible Tables
Conference
Disclosure detection for nations of count data
linear
combi-
In 54.1, optimizations were done univariately, i.e., by considering only one confidential cell at a time. Intuitively it seems plausible that because row and column sums in the confidential table are fixed (and known), tighter bounds might be achieved by somehow exploiting these sums. In this example we show that this is indeed possible, by finding bounds on linear combinations of sensitive data which are tighter than those obtainable from from a univariate analysis. The data for this example were obtained from a Carnegie Mellon University student database. The data set was much larger than the distilled fragment we present here, and it is important to note that the reduction of the original data was done using aggregation operations which would typically be considered non-revealing of sensitive data about individual students’ grades. Thus it is a practical example of what a serious snooper might be able to achieve. Two tables were generated: a Professor-Student (PS) table showing the number of times a student had
Proceedings
of the 29th Annual
Hawaii
International
taken courses with a professor, and a Professor-Grade table showing the grading patterns of the professors. Both tables give data only for those courses required of IS majors, resulting in a fairly small number of rows and columns. These tables were public information.
Conference
on System Sciences -
1996
Univariate
(PG)
GPA
3.40-3.85 2.66-3.44 3.40-3.85 2.66-3.44 2.55-3.44 2.55-3.44
Figure 6: Grade Point Average Ranges
PG
1B
B+
Figure 5: Professor-Student bles
I A- 1 A t A+
maximum of N - 2 such procedures, it converges on the tightest bounds. The LP approach, on the other hand, works directly on the N-dimensional data and generates the tightest bounds in one pass. Moreover, if there is ready knowledge of one or more of the entries in the confidential table (e.g. the data snooper knows about one of the entries), then this knowledge can be captured effectively in the LP approach by adding a new constraint to the problem. Incorporating this information into the MCA algorithm increases the computation required substantially. The concept of multivariate optimization (as shown in $4.2) is easil,y realized using the LP approach, since it requires only a simple change in the objective function. Adapting the MCA approach would require building in additional checks (e.g., on the number of courses taken by each student), and would be ad hoc at best. Nonetheless, the MCA algorithm appears to be a simple means for inferring sensitive information from tables that have passed traditional tests for confidentiality. The LP approach can also be applied in situations where accessible data has been protected by conventional disclosure limitation methods. Such methods include rounding, random rounding (Nargundkar and Saveland 1972)) controlled rounding (Fellegi (1977)) Kelly, Golden & Assad (1990)) and cell suppression (Carvalho et (II (1994)). For example, if systematic rounding to base b is used (that is, if every cell entry is rounded to the nearest integer multiple of b), then instead of equality constraints, inequalities would be introduced. Tlhese inequalities reflect the imprecision in the snooper’s knowledge of the actual values. Naturally, bounds on confidential data obtained by the LP approach ,will be wider in this case, but the potential for disclosure still exists. Tables subjected to random rounding and controlled rounding can be attacked similarly. In the case of cell suppression, the LP method can still proceed, albeit with a reduced set of constraints.
1
and Professor-Grade
Ta-
We calculated univariate upper and lower bounds on the grades each student could have received. From this, we used a counting scheme to obtain upper and lower bounds on students’ grade-point averages (GPA). Specifically, a student’s univariate maximum GPA was calculated by assuming that she actually received the number of A+ grades equal to the univariate maximum for A+, the number of A grades equal to that maximum, and so forth until her grade total equaled her course total. The univariate minimum was computed analogously, starting instead from B. A multivariate estimation of GPA was then calculated using the LP approach. All the cell values from Figure 5 were used as constraints, but this time an expression representing student GPA was used as the objective function. A comparison of the bounds using both techniques (Figure 6) shows that the multivariate approach often gives tighter bounds.
5
Comparison programming
of the MCA approaches
and
linear
It can be shown that the order of complexity of both the LP and MCA approaches are the same when the objective is to find first-cut univariate bounds on a confidential table. The MCA approach has the limitation of operating only on two-dimensional projections of the original N-way table. Working through a
9
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE
Proceedings
of the 29th Annual
Hawaii
International
PI
Since cell suppression techniques are designed to hide as little data as possible (while still protecting against direct, disclosure), sufficient information remains to be potentially damaging.
6
Conference
on System Sciences -
1996
D.E. (1980) “Secure StatisDenning, tical Databases with Random Sample Queries,” ACM Trans. on Database Systems 5, 291-315.
Fl Duncan,
G.T. and Lambert, D. (1986) JASA “Disclosure-Limited Data Dissemination,” 81, lo-28 (with discussion by L. Cox, 0. Frank, J. Gastwirth and H.Roberts).
Conclusions
It is broadly understood that there can be no quick and easy solution to confidentiality and data access problems. As researchers devise methods to limit disclosure, other equally bright data snoopers find ways to circumvent these disclosure limitation techniques. Although design-time disclosure limitation methods do apply to direct disclosure, we have shown that there is no design-time or structural approach which comprehensively addresses the problem of residual disclosure. As we have further shown, new methods allow the data snooper to infer substantial information using responses to authorized queries. Content-based real-time methods appear to be the only way to tackle this problem, since we cannot know the contents of the database at design-time.
Duncan, G.T. and Lambert, D. (1989) “The Risk of Disclosure of Microdata,” Journal of Business and Economic Statistics 7, 207-217.
PI
Duncan, G.T. and Pearson, R.W. (1991) “Enhancing Access to Data While Protecting Confidentiality: Prospects for the Future,” Statistical Science 6, 219-239.
PI Fellegi,
I.P. (1972) “On the Question of Statistical Confidentiality,” JASA, 67, 7-18.
PI Frdchet,
M.
(1940) L es Probabilit&,
d’e’&nments un syst&me d&pendants, Priemiere Partie.
a
associe’es
compatibles
Hermann
et
& Cie,
Paris.
FOI Kelly,
T., Golden, S. and Assad, P. (1990) “Controlled Rounding of Tabular Data,” Operations Research 38 760-772.
References [l] Adam, N.R. and Wortman, J.C. (1989) “SecurityControl Methods for Statistical Databases: A Comparative Study,” ACM Computing Surveys 21, 515-556. [2] Cox, L.H. (1980) “Suppression Methodology and Statistical Disclosure Control,” JASA 75, 377385.
[Ill
Kwerel, S.M. (1988) Encyclopedia FrCchet Bounds ???.
ml
Nargundkar and Saveland (1972) “Random Rounding of Tables to Prevent Statistical Disclosure,” Proceedings of the American Statistical Association, 382-387.
of Statistics,
and Retrieval [131 SchlGrer, J. (1975) “Identification of Personal Records From a Statistical Data Bank,” Methods Info. Med. 15 7-13.
[3] Carvalho F.D., Dellaert N., Osorio M.S. (1994) “Statistical Disclosure in Two-dimensional Tables: Positive Tables,” JASA Theory and Methods, 89, 1547-57.
10
Proceedings of the 1996 Hawaii International Conference on System Sciences (HICSS-29) 1060-3425/96 $10.00 © 1996 IEEE