Solving the Cell Suppression Problem on Tabular

0 downloads 0 Views 179KB Size Report
consistent tables, say f k and g k. , such that .... ables f k iand g k iand linking constraints between the x and the auxiliary variables. ..... i=1 min si s0 xi ≥ s0xj ≥ s0, where j is any index in T such that ...... 338 46001 330948 8701 16958 6694. 9.
Solving the Cell Suppression Problem on Tabular Data with Linear Constraints Matteo Fischetti • Juan José Salazar DEI, University of Padova, Italy DEIOC, University of La Laguna, Spain fi[email protected][email protected]

ell suppression is a widely used technique for protecting sensitive information in statistical data presented in tabular form. Previous works on the subject mainly concentrate on 2- and 3-dimensional tables whose entries are subject to marginal totals. In this paper we address the problem of protecting sensitive data in a statistical table whose entries are linked by a generic system of linear constraints. This very general setting covers, among others, k-dimensional tables with marginals as well as the so-called hierarchical and linked tables that are very often used nowadays for disseminating statistical data. In particular, we address the optimization problem known in the literature as the (secondary) Cell Suppression Problem, in which the information loss due to suppression has to be minimized. We introduce a new integer linear programming model and outline an enumerative algorithm for its exact solution. The algorithm can also be used as a heuristic procedure to find near-optimal solutions. Extensive computational results on a test-bed of 1,160 real-world and randomly generated instances are presented, showing the effectiveness of the approach. In particular, we were able to solve to proven optimality 4-dimensional tables with marginals as well as linked tables of reasonable size (to our knowledge, tables of this kind were never solved optimally by previous authors). (Statistical Disclosure Control; Confidentiality; Cell Suppression; Integer Linear Programming; Tabular Data; Branch-and-Cut Algorithms)

C

1.

Introduction

A statistical agency collects data obtained from individual respondents. This data is usually obtained under a pledge of confidentiality, i.e., statistical agencies cannot release any data or data summaries from which individual respondent information can be revealed (sensitive data). On the other hand, statistical agencies aim at publishing as much information as possible, which results in a trade-off between privacy rights and information loss. This is an issue of primary importance in practice; see, e.g., Willenborg and De Wall (1996) for an in-depth analysis of statistical disclosure control methodologies. Cell suppression is a widely used technique for disclosure avoidance. We will introduce the basic Management Science © 2001 INFORMS Vol. 47, No. 7, July 2001 pp. 1008–1027

cell suppression problem with the help of a simple example taken from Willenborg and De Wall (1996). Figure 1(a) exhibits a 2-dimensional statistical table giving the investment of enterprises (per millions of guilders), classified by activity and region. Let us assume that the information in the cell (2,3)—the one corresponding to Activity II and Region C—is considered confidential by the statistical office, according to a certain criterion (as discussed, e.g., in Willenborg and De Wall 1996), hence it is viewed as a sensitive cell to be suppressed (primary suppression). But that is not enough: By using the marginal totals, an attacker interested in the disclosure of the sensitive cell can easily recompute its missing value. Then other table entries cannot be published as well (complementary 0025-1909/01/4707/1008$5.00 1526-5501 electronic ISSN

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

Figure 1

Investment of Enterprises by Activity and Region

A Activity I Activity II Activity III Total

B

20 50 8 19 17 32 45 101

C

Total

10 22 12 44

80 49 61 190

(a) Original table

Activity I Activity II Activity III Total

A

B

20 * * 45

50 19 32 101

C Total 10 * * 44

80 49 61 190

(b) Published table suppression). For example, with the missing entries in Figure 1(b), an attacker cannot disclose the nominal value of the sensitive cell exactly, although he/she can still compute a range for the values of this cell which are consistent with the published entries. Indeed, the minimum value y for the sensitive cell can be com23 puted by solving a linear program in which the values yij for the missing cells i j are treated as unknowns, namely y = min y23  23

subject to y21

y31 y21 +y31

+y23

y23 y21 y31 y23

= 30 +y33 = 29 = 25 +y33 = 34 y33 ≥ 0

Notice that the right-hand side values are known to the attacker, as they can be obtained as the difference between the marginal and the published values in a row/column. The maximum value y 23 for the sensitive cell can be computed in a perfectly analogous way, by solving the linear program of maximizing y23 subject to the same constraints as before. In the example, y = 5 and y 23 = 30, i.e., the 23 sensitive information is “protected” within the protection inverval [5, 30]. If this interval is considered

Management Science/Vol. 47, No. 7, July 2001

sufficiently wide by the statistical office, the sensitive cell is called protected; otherwise new suppressions are needed. (Notice that the extreme values of interval [5, 30] are only attained if the cell corresponding to Activity II and Region A takes the quite unreasonable values of 0 and 25; bounding the cell variation to ±50% of the nominal value (say) results in the more realistic protection interval [18, 26].) The Cell Suppression Problem (CSP) consists of finding a set of cells whose suppression guarantees the protection of all the sensitive cells against the attacker, with minimum loss of information associated with the suppressed entries. This problem belongs to the class of the strongly  -hard problems (see, e.g., Kelly et al. 1992, Geurts 1992, Kao 1996), meaning that it is very unlikely that an algorithm for the exact solution of CSP exists, which guarantees an efficient (polynomial-time) performance for all possible input instances. Previous works on CSP mainly concentrate on heuristic algorithms for 2-dimensional tables with marginals, see Cox (1980, 1995), Sande (1984), Kelly et al. (1992), and Carvalho et al. (1994), among others. Kelly (1990) proposed a mixed-integer linear programing formulation for 2- and 3-dimensional tables with marginals, which requires a very large number of variables and constraints. Geurts (1992) refined the 2-dimensional model slightly, and reported computational experiences on small-size instances, the largest instance solved to optimality being a table with 20 rows, 6 columns, and 17 sensitive cells. Gusfield (1988) gave a polynomial-time algorithm for a special case of the problem in 2-dimensional tables. Recently, we presented in Fischetti and Salazar (1999) a new method capable of solving to proven optimality 2-dimensional instances with up to 250,000 cells and 10,000 sensitive cells on a personal computer. Heuristics for 3-dimensional tables with marginals have been proposed in Robertson (1995), Sande (1984), and Dellaert and Luijten (1996). In this paper we address the problem of protecting sensitive data in a statistical table whose entries are linked by a generic system of linear equations. This very general setting covers, among others, k-dimensional tables with marginals, as well as the so-called hierarchical and linked tables.

1009

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

Hierarchical and linked tables consist of a set of k-dimensional tables derived from a common dataset. These structures became more and more important in the recent years, as the technology today allows for electronic dissemination of large collections of statistical data-sets. As discussed, e.g., in Willenborg and de Wall (1996), the intrinsic complexity of hierarchical and linked tables calls for updated disclosure control methodologies. Indeed, the individual protection of each table belonging to a hierarchical/linked set is not guaranteed to produce safe results. For example, Sande (1998) showed how it is possible to disclose confidential information by means of linear programming methods applied to statistical surveys recently published by credited statistical offices. This gave us motivation to improve the current understanding of the cell suppression problem for complex data structures. Unfortunately, the extension from 2-dimensional tables to hierarchical/linked tables is far from trivial. In particular, the nice network structure we exploited in Fischetti and Salazar (1999) for addressing 2-dimensional tables does not extend to the general case, hence the study of the general setting needs more sophisticated mathematical and algorithmic tools (e.g., Benders’ decomposition instead of the classical max-flow/min-cut theorem). The paper is organized as follows. A formal description of the cell suppression problem is given in §2. Section 3 introduces and discusses new mathematical models for the problem. In particular, a new integer linear programming model is proposed, having a 0-1 decision variable for each potential suppression and an exponential number of linear constraints enforcing the protection level requirements. Section 4 addresses efficient methods for solving the proposed model within the so-called branch-and-cut framework. Section 5 illustrates our solution method through a simple example. Computational results are given in §6, where nine real-world instances are optimally solved on a PC within acceptable computing time. In particular, we were able to solve to proven optimality a 4-dimensional table with marginals and four linked tables. Extensive computational results on 1,160 randomly generated 3- and 4-dimensional tables are also reported. Some conclusions are finally drawn in §7.

1010

2.

The Cell Suppression Problem

We next give a formal definition of the cell suppression problem we address in this paper. A table is a vector y = y1 · · · yn  whose entries satisfy a given set of linear constraints known to a possible attacker, namely 

My = b lbi ≤ yi ≤ ubi

for all i = 1     n



(1)

In other words, system (1) models the whole a priori information on the table known to an attacker. Typically, each equation in (1) corresponds to a marginal entry, whereas inequalities enforce the “external bounds” known to the attacker. In the case of k-dimensional tables with marginals,  each equation in (1) is of the type j∈Qi yj − yi = 0, where index i corresponds to a marginal entry and index set Qi to the associated internal table entries. Therefore, in this case M is a 0 ±1 -matrix and b = 0. Moreover, in case k = 2, the linear system (1) can be represented in a natural way as a network, a property having important theoretical and practical implications; see, e.g., Cox (1980, 1995), Kelly et al. (1992), and Fischetti and Salazar (1999). Unfortunately, this nice structure is not preserved for k ≥ 3, unless the table decomposes into a set of independent 2-dimensional subtables. A cell is an index corresponding to an entry of the table. Given a nominal table a, let PS = i1      ip be the set of sensitive cells to be protected, as identified by the statistical office according to some criteria. For each sensitive cell ik k = 1     p, the statistical office provides three nonnegative values, namely LPLk  UPLk , and SPLk , called Lower Protection Level, Upper Protection Level, and Sliding Protection Level, respectively, whose role will be discussed next. In typical applications, these values are computed as a certain percentage of the nominal value aik . A suppression pattern is a subset of cells SUP ⊆ 1     n corresponding to the unpublished cells. A consistent table with respect to a given suppression

Management Science/Vol. 47, No. 7, July 2001

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

pattern SUP and to a given nominal table a is a vector y = y1 · · · yn  satisfying  My = b  (2) lbi ≤ yi ≤ ubi for all i ∈ SUP   for all i ∈ SUP yi = ai where the latter equations impose that the components of y associated with the published entries coincide with the nominal ones. In other words, any consistent table gives a feasible way the attacker can fill the missing entries of the published table. A suppression pattern is considered feasible by the statistical office if it guarantees the required protection intervals against an attacker, in the sense that, for each sensitive cell ik k = 1     p, there exist two consistent tables, say f k and g k , such that fikk ≤ aik − LPLk  and

gikk ≥ aik + UPLk 

gikk − fikk ≥ SPLk 

(3)

In other words, it is required that y ≤ aik −LPLk  y¯ik ≥ ik aik + UPLk , and y¯ik − y ≥ SPLk , where ik

y¯ik = maxyik  2 holds

and

y = minyik  2 holds  ik

Note that each nonzero sliding protection level SPLk allows the statistical office to control the length of the uncertainty range for cell k without forcing specific upper and lower bounds UPLk and LPLk (these latter bounds being typically set to zero in case SLPk = 0), a situation which is sometimes preferred to reduce the potential correlation of the unpublished “true” value aik with the attacker “middle-point” estimate y¯ik + y /2. ik As already mentioned, the statistical office is interested in selecting, among all feasible suppression patterns, a one with minimum information loss. This issue can be modeled by associating a weight wi ≥ 0 with each entry of the table, and by requiring the minimization of the overall weight of the sup pressed cells, namely i∈SUP wi . In typical applications, the weights wi provided by the statistical office are proportional to ai or to logai . The resulting combinatorial problem is known in the literature as the (complementary or secondary) Cell Suppression Problem, or CSP for short.

Management Science/Vol. 47, No. 7, July 2001

3.

A New Integer Linear Programming Model

In the sequel, for notational convenience we define the relative external bounds: LBi = ai − lbi ≥ 0

UBi = ubi − ai ≥ 0

and

i.e., the range of feasible values for cell i known to the attacker is ai − LBi  ai + UBi . To obtain a Mixed-Integer Linear Programming (MILP) model for CSP, we introduce a binary variable xi = 1 for each cell i, where xi = 1 if i ∈ SUP (suppressed), and xi = 0 otherwise (published). Clearly, we can fix xi = 0 for all cells that have to be published (if any), and xi = 1 for all cells that have to be suppressed (sensitive cells). Using this set of variables, the model is of the form min

n  i=1

w i xi

(4)

subject to x ∈ 0 1 n and, for each sensitive cell ik k = 1     p:  The suppression pattern associated   (5) with x satisfies the lower protection  level requirement with respect to cell ik  the suppression pattern associated   (6) with x satisfies the upper protection  level requirement with respect to cell ik  the suppression pattern associated   (7) with x satisfies the sliding protection  level requirement with respect to cell ik 3.1. The Classical Model A possible way to express Conditions (5)–(7) through linear constraints requires the introduction, for each k 1     p, of auxiliary continuous variables k =   k  f = k k fi  i = 1     n and g = gi  i = 1     n defining tables that are consistent with respect to the suppression pattern associated with x and satisfy (3). This is in the spirit of the MILP model proposed by Kelly (1990) for 2-dimensional tables with marginals. The resulting MILP model then reads: min

n  i=1

wi xi 

(8)

1011

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

subject to x ∈ 0 1 n and, for each sensitive cell ik k = 1     p: Mf k = b  (9) ai − LBi xi ≤ fik ≤ ai + UBi xi for i = 1     n Mg k = b  (10) ai − LBi xi ≤ gik ≤ ai + UBi xi for i = 1     n fikk ≤ aik − LPLk 

(11)

gikk ≥ aik + UPLk 

(12)

gikk − fikk ≥ SPLk 

(13)

Notice that the lower/upper bounds on the variables fik and gik in (9) and (10) depend on xi so as to enforce fik = gik = ai whenever xi = 0 (cell i is not suppressed), and lbi ≤ fik ≤ ubi and lbi ≤ gik ≤ ubi otherwise (cell i is suppressed). Therefore, (9) and (10) stipulate the consistency of f k and g k , respectively, with the published table, whereas (11), (12), and (13) translate the protection level requirements (5), (6), and (7), respectively. Standard MILP solution techniques such as branchand-bound or cutting-plane methods (see, e.g., Nemhauser and Wolsey 1988) require the solution of the Linear Programming (LP) relaxation of the model in hand, obtained by relaxing conditions xi ∈ 0 1 into 0 ≤ xi ≤ 1 for all i. However, even the LP relaxation of Model (8)–(13) is very difficult to solve in that it involves a really huge number of auxiliary variables fik and gik and linking constraints between the x and the auxiliary variables. For example, for a 100 × 100 table with marginals having 5% sensitive cells, the model needs more than 10,000,000 variables and 20,000,000 constraints—a size that cannot be handled explicitly by the today LP technology. We next propose a new model based on Benders’ decomposition (see e.g. Nemhauser and Wolsey 1988). The idea is to use standard LP duality theory to avoid the introduction of the auxiliary variables f k and g k k = 1     p along with the associated linking constraints. In the new model, the protection level requirements are in fact imposed by means of a family of linear constraints in the space of the x-variables only. Before formulating the new model, we need a

1012

characterization of the vectors x for which Systems (9)–(13) admit feasible f k and g k solutions, which is obtained as follows. 3.2.

Imposing the Upper Protection Level Requirements Assume that x is a given (arbitrary but fixed) parameter, and consider any given sensitive cell ik k = 1     p along with the associated upper protection level requirement. Clearly, the linear system (10) and (12) admits a feasible solution g k if and only if aik + UPL ≤ y¯ik , where y¯ik is the optimal value of the linear problem y¯ik = max yik 

(14)

subject to My = b

(15)

yi ≤ ai + UBi xi

for all i = 1     n

(16)

−yi ≤ −ai + LBi xi

for all i = 1     n

(17)

This is a parametric LP problem in the y-variables only, with variable upper/lower bounds depending on the given parameter x. We call (14)–(17) the attacker subproblem associated with the upper protection of sensitive cell ik , with respect to parameter x. By LP duality, this subproblem is equivalent to the linear problem n

 y¯ik = min t b + i ai +UBi xi − i ai −LBi xi   (18) i=1

subject to

t − t + t M = eitk  ≥ 0 ≥ 0 unrestricted in sign

(19)

where eik denotes the ik th column of the identify matrix of order n, and  , and are the dual vectors associated with constraints (15), (16), and (17), respectively. It then follows that the linear system (10) and (12) has a feasible solution if and only if n

 aik + UPLk ≤ y¯ik = min t b + i ai + UBi xi  i=1

 − i ai − LBi xi   19 holds 

Management Science/Vol. 47, No. 7, July 2001

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

i.e., if and only if t

subject to

aik + UPLk ≤ b +

n

 i=1

i ai + UBi xi  − i ai − LBi xi 

for all   satisfying 19 Because of (19) and Ma = b, we have t b + n t t t i=1  i ai − i ai  = Ma +  −  a = eik a = aik . Hence the above system can be rewritten as n  i=1

≥ 0 ≥ 0 unrestricted in sign.



(23)

Hence the lower protection level requirement (5) for cell ik can be formulated as

n

 i ai + UBi xi  aik − LPLk ≥ − t b + i=1





− i ai − LBi xi 

 i UBi + i LBi xi ≥ UPLk  for all   satisfying 19



t − t + t M = −eitk



for all   satisfying (23), (20) or, equivalently,

In other words, System (20) defines a set of constraints, in the x variables only, which is equivalent to Condition (6) concerning the upper protection level requirement for sensitive cell ik . Notice that (20) contains in principle an infinite number of constraints, each associated with a different point     of the polyhedron defined by (19). However, it is well known (see, e.g., Nemhauser and Wolsey 1988) that only the extreme points (and rays) of such polyhedron can lead to nondominated constraints (20), i.e., a finite number of such constraints is sufficient to impose the upper protection level requirement for a given sensitive cell ik . 3.3.

Imposing the Lower Protection Level Requirements Analogously, the lower protection level requirement for a given cell ik is equivalent to imposing y ≤ aik − ik LPLk , where y = minyik  15−17 hold (21)

This is called the attacker subproblem associated to the lower protection of sensitive cell ik , with respect to parameter x. By LP duality, this subproblem is equivalent to the linear problem y = min t b + ik

n

 i=1

i ai + UBi xi  − i ai − LBi xi   (22)

Management Science/Vol. 47, No. 7, July 2001

i=1

 i UBi + i LBi xi ≥ LPLk for all   satisfying (23).

3.4.

Imposing the Sliding Protection Level Requirements As to the sliding protection level for sensitive cell ik , the requirement is that SPLk ≤ y¯ik − y = maxyik  (15)–(17) hold ik + max−yik  (15)–(17) hold  Again, by LP duality, this condition is equivalent to n

 i ai + UBi xi  SPLk ≤ max t b + i=1

 − i ai − LBi xi   19 holds n

 + max t b + i ai + UBi xi  i=1

ik

≡ − max−yik  15−17 hold 

n 

 − i ai − LBi xi   23 holds  Therefore, the feasibility condition can now be formulated by requiring SPLk ≤  +  t b +

n



 i + i ai + UBi xi  −  i + i ai − LBi xi  i=1

for all   satisfying (19) and for all      satisfying (23),

1013

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

or, equivalently, n

 i=1

 i + i UBi +  i + i LBi xi ≥ SPLk for all   satisfying (19) and for all      satisfying (23).

3.5. The New Model The above characterization of the feasible vectors x leads to the following new integer linear model for CSP: n  min wi xi (24) i=1

n

subject to x ∈ 0 1 and, for each sensitive cell ik 1     p:  n  i=1  i UBi + i LBi xi ≥ UPLk for all extreme points       satisfying (19)  n  i=1  i UBi + i LBi xi ≥ LPLk for all extreme points       satisfying (23)  n

  i=1  i + i UBi +  i + i LBi xi ≥ SPLk    for all extreme points       satisfying (19) and for all extreme   points        satisfying (23)

k =

(25)

(26)

(27)

Notice that all the left-hand-side coefficients of variables xi are nonnegative. As a consequence, all the constraints with zero right-hand-side value need not to be included in the model, as they do not correspond to a proper protection level requirement. We call (25)–(27) the capacity constraints in analogy with similar constraints we introduced in Fischetti and Salazar (1999) for 2-dimensional tables with marginals for enforcing a sufficient “capacity” of certain cuts in the network representation of the problem. Intuitively, the capacity constraints force to suppress (i.e., to set xi = 1) a sufficient number of cells whose positions within the table and contributions to the overall protection are determined by the dual variables     of the attacker subproblems.

4.

Solving the New Model

The solution of model (24)–(27) can be achieved through an enumerative scheme commonly known as

1014

branch-and-cut, as introduced by Padberg and Rinaldi (1991) (see Caprara and Fischetti 1997 for a recent annotated bibliography on the subject). The main ingredients of the scheme are described next. 4.1. Solving the LP Relaxation The solution of the LP relaxation of Model (24)–(27) is approached through the following cutting-plane scheme. We start by solving the so-called master LP min

n  i=1

 wi xi  xi1 = · · · = xip = 1 x ∈ 0 1

n

in which we only impose the suppression of the sensitive cells. Let x∗ be the optimal solution found. Our order of business is to check whether the vector x∗ (viewed as a given parameter) guarantees the required protection levels. In geometrical terms, this is equivalent to finding an hyperplane in the x-space that separates x∗ from the polyhedron defined by the capacity constraints. This is called the separation problem associated with the family of capacity constraints (25)–(27), and can be solved efficiently as follows. For each sensitive cell ik , in turn, we apply the following steps: 1. We first solve the attacker subproblem (14)–(17) for x = x∗ and check whether aik + UPLk ≤ y¯ik . If this is the case, then x∗ satisfies the upper protection level requirement for the given ik , hence all the capacity constraints (25) are certainly fulfilled. Otherwise, the ¯  ¯ of the attacker sub¯  optimal dual solution   problem satisfies (19) and

¯ t b +

n

 i=1

¯ i ai + UBi xi∗  − ¯ i ai − LBi xi∗ 



= y¯ik < aik + UPLk  

hence it induces a capacity constraint ni=1 ¯ i UBi + ¯ i LBi xi ≥ UPLk in family (25) that is violated by x∗ . This constraint is then added to the master LP. 2. We then check whether x∗ satisfies the lower protection level requirement for ik , which requires the solution of the attacker subproblem (21) associated with the lower protection level of cell ik , and possibly add to the master LP a violated capacity constraint in family (26).

Management Science/Vol. 47, No. 7, July 2001

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

3. Finally, we check whether x∗ satisfies the sliding protection level for ik . This simply requires checking whether the values y¯ik and y computed in the previ-

information on the integrality of the x variables. Indeed, let n 

ik

ous steps satisfy y¯ik − y ≥ SPLk . If this is not the case, ik

¯  ¯ and        =     ¯  setting     =   leads to a violated capacity cut (27). Clearly, Steps 1 and 3 (respectively, Steps 2 and 3) can be avoided if LPLk = SPLk = 0 (respectively, UPLk = SPLk = 0). After having considered all sensitive cells ik we have two possible cases. If no capacity constraint has been added to the master LP, then all of them are satisfied by x∗ which is therefore an optimal solution of the LP relaxation of Model (24)–(27). Otherwise, the master LP amended by the new capacity constraints is reoptimized and the approach is iterated on the (possibly fractional) optimal solution x∗ of the new master LP. By using the above cutting-plane scheme one can solve efficiently the overall LP relaxation of our model, since the above-described separation procedure for capacity constraints (25)–(27) can be implemented to run in polynomial time. 4.2. Strengthening the LP Relaxation The effectiveness of the branch-and-cut approach greatly depends on how tightly the LP relaxation approximates the integer solution set. In this respect, adding to the model new classes of linear constraints can be greatly beneficial, in that the additional constraints (which are redundant when the integrality condition on the variables is active) can produce tightened LP relaxations, and hence a significant speed-up in the overall problem resolution. We next outline some families of additional constraints that we found effective in our computational experience. As in the case of capacity constraints, these new inequalities are added to the LP relaxation on the fly, when they are violated by the solution x∗ of the current master LP. This requires the exact/heuristic solution of the separation problem associated with each new family of constraints. 4.2.1. Strengthened Capacity Constraints. Capacity constraints have been derived without using the

Management Science/Vol. 47, No. 7, July 2001

i=1

si xi ≥ s0

(28)

represent any capacity inequality (25)–(27), whose coefficients s1      sn are all nonnegative. We claim that any integer vector x ≥ 0 satisfying (28) must also satisfy n  i=1

minsi  s0 xi ≥ s0 

(29)

Indeed, let T = i ∈ 1     n  si > s0 . Given any integer x ≥ 0 satisfying (28), if xi = 0 for all i ∈ T then n n i=1 minsi  s0 xi = i=1 si xi ≥ s0 . Otherwise, we have n mins  s x ≥ s x i 0 i 0 j ≥ s0 , where j is any index in T i=1 such that xj = 0 (hence xj ≥ 1). Notice that Condition (29) is stronger than (28) when x is not integer, a case of interest when solving the LP relaxation of our model. As already discussed, the use of the strengthened capacity constraints requires to address the associated separation problem. In our implementation we use a simple separation heuristic in which we apply the strengthening procedure only to the capacity constraints associated with the optimal dual solutions of the attacker subproblems, computed as described in §4.1. Although very simple, the above improvement is quite effective in practice, mainly when UPLk  LPLk , and/or SPLk are small in comparison with the given external relative bounds UBi and LBi s. Indeed, our computational experience shows that deactivating this improvement has a dramatic impact on the model quality, and hence on the convergence properties of our code. 4.2.2. Cover Inequalities. Following the seminal work of Crowder et al. (1983) on the solution of general integer programming models, we can observe that each single-capacity inequality implies a number of “more combinatorial” restrictions. To be specific, let  again i si xi ≥ s0 represent any strengthened capacity constraints in (29), whose coefficients si are all nonnegative. Clearly, one has to suppress at least one cell  for any subset Q ⊆ 1     n such that i∈Q si < s0 , a 1015

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

condition that can be expressed by means of the following cover inequalities:  i∈Q

xi ≥ 1

for each cell subset Q 

 i∈Q

si < s0 

(30)

These inequalities can easily be improved to their lifted form:  i∈EXT Q

xi ≥ EXT Q − Q + 1

for each cell subset Q 

 i∈Q

si < s0 

(31)

where EXT Q = Q ∪ i  si ≥ maxsj  j ∈ Q . We refer the interested reader to, e.g., Nemhauser and Wolsey (1988) for a discussion on the validity of lifted cover inequalities, and for possible procedures to solve the associated separation problem. 4.2.3. Bridgeless Inequalities. As the weights wi are assumed to be nonnegative, every CSP instance has an optimal solution in which no suppression is redundant. Therefore, one can require that the value of each cell with xh = 1 cannot be recomputed exactly. This is equivalent to associate a very small fictitious sliding protection level > 0 to each suppressed nonsensitive cell, and to set-up the associated attacker subproblems with the requirement that y¯h − y ≥ xh  h

Notice that this condition is only active for suppressed cells h and vanishes when xh = 0. As already discussed, the above condition on the optimal values y¯h and y of the attacker subproblems h can be enforced by the following class of capacity constraints n  i=1

 i + i UBi +  i + i LBi xi ≥ xh 

(32)

valid for all extreme points     satisfying (19) and for all extreme points        satisfying (23), with cell h playing the role of ik . These constraints are of the same nature as Constraints (27), but have a zero right-hand-side value when xh = 0. As stated, Conditions (32) can be very weak, in that the right-hand-side value is very close to zero.

1016

However, they become effective in their strengthened form:  xi ≥ x h  (33) i∈Q

where Q = i ∈ 1     n   i + i UBi +  i + i LBi > 0 , and     and        are as before. We call (33) the bridgeless inequalities, as in the 2-dimensional case they forbid the presence of “bridges” in a certain network structure associated with the suppressed cells; see Fischetti and Salazar (1999) for details. The separation problem for bridgeless inequalities is perfectly analogous to the one described for the strengthened capacity constraints (sliding case), and requires the solution of the two attacker subproblems associated with any nonsensitive cell h with xh∗ > 0. 4.3. Branching Whenever the solution of the LP relaxation of our strengthened model (say x∗ ) is noninteger and has an objective value smaller than the value of the current best feasible solution, we branch on a fractional variable xb chosen according to the following “strong branching” criterion (see Applegate et al. 1995). We first identify the 10 fractional variables xi∗ that are as close as possible to 0.5. For each such variable, in turn, we solve our current LP model amended by the new condition xi = 0 or xi = 1, so as to estimate the effect of branching on xi . The actual branching variable xb is then chosen as the one maximizing the average subproblem lower bound z0i +z1i /2, where z0i and z1i denote the optimal solution values of the two LP problems associated with condition xi = 0 and xi = 1, respectively. 4.4. Problem Reduction The computing time spent in the solution of a given instance depends on its size, expressed in terms of both the number of decision variables involved, and the number of nonzero protection levels (recall that zero protection levels do not induce capacity constraints). We next outline simple criteria to reduce the size of a given CSP instance. A (typically substantial) reduction on the number of nonzero protection levels is achieved in a preprocessing phase, to be applied before entering the branchand-cut scheme. This is based on the observation that

Management Science/Vol. 47, No. 7, July 2001

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

primary suppressions alone may be sufficient to protect some of the sensitive cells which therefore do not need to be considered sensitive anymore. To be specific, we consider the suppression pattern SUP = i1      ip and for each sensitive cell ik we solve the attacker subproblem (14)–(17). In case y¯ik ≥ aik + UPLk one can clearly set UPLk = 0, thus deactivating the upper protection level requirement for ik . Otherwise, the (strengthened) capacity constraint associated with the dual optimal solution of the attacker subproblem qualifies as a relevant constraint, hence it is stored in the branch-and-cut constraint pool. A similar reasoning applies to lower and sliding protection levels. A naive implementation of the above idea may be excessively time consuming for large instances, in that it may require even more computing time than the whole branch-and-cut algorithm applied on the original instance. Hence a parametric resolution of the several attacker subproblems involved in preprocessing is needed. We suggest the following approach. We introduce two p-dimensional arrays HIGH and LOW, whose entries HIGHk and LOWk give, respectively, lower and upper bounds on the solution value y¯ik and y of the attacker subproblems associik ated with ik , with respect to the suppression pattern SUP = i1      ip . We initialize HIGHk = LOWk = aik for all k = 1     p and then consider the sensitive cells ik according to a nonincreasing sequence of the associated values maxSPLk  UPLk + LPLk . For each ik , we first check whether HIGHk < aik + UPLk or HIGHk − LOWk < SPLk , in which case we solve the attacker subproblem (14)–(17) and obtain a consistent table y¯ maximizing yik . The entries of y¯ are then used to update all the entries of HIGH and LOW by setting HIGHh = maxHIGHh  y¯ih and LOWh = minLOWh  y¯ih for all h = 1     p. We then check whether LOWk > aik − LPLk or HIGHk − LOWk < SPLk , in which case we solve the attacker subproblem (21) and obtain a consistent table y minimizing yik . As before, the entries of y are used to update all the entries of HIGH and LOW. Finally, we use the updated values HIGHk and LOWk to set UPLk = 0 (if HIGHk ≥ aik + UPLk  LPLk = 0 (if LOWk ≤ aik − LPLk ), and SPLK = 0 (if HIGHk − LOWk ≥ SPLk ). In this way we avoid a (typically substantial) number of attacker subproblem resolutions.

Management Science/Vol. 47, No. 7, July 2001

Whenever a protection level associated to ik is not satisfied, we have at hand a capacity constraint associated with the dual optimal solution of the corresponding attacker subproblem, which can be used to initialize the constraint pool. In this way, with no extra computing time we perform both the preprocessing phase and initialize the constraint pool with a number of relevant constraints. We now address the reduction of variables in the LP programs to be solved within our branch-andcut algorithm. Our approach is to fix to 0 or 1 some decision variables during the processing of the current node of the branch-decision tree. We use the classical criteria based on LP reduced cost; see, e.g., Nemhauser and Wolsey (1988). According to our computational experience, these criteria allow one to fix a large percentage of the variables very early during the computation. Moreover, we have implemented a variable-pricing technique to speed-up the overall computation and to drastically reduce memory requirement when instances with more than 10,000 variables are considered; see, e.g., Nemhauser and Wolsey (1988) for details. 4.5. Heuristic The convergence of the overall branch-and-bound scheme can be speeded up if a near-optimal CSP solution is found very early during computation. Therefore one is interested in an efficient heuristic algorithm, to be applied (possibly several times) at each node of the branch-decision tree. The avaliability of a good heuristic solution is also important when the convergence of the branch-andcut scheme requires large computing time, and one has to stop the algorithm before convergence. We have implemented a heuristic procedure in the spirit of the one proposed by Kelly et al. (1992) and Robertson (1995). Our procedure works in stages, in each of which one finds heuristically a set of new suppressions needed to guarantee the required protection levels for a certain sensitive cell ik . To be more specific, we start by defining the current set of suppressions, SUP = i1      ip , and define ci = 0 for all i ∈ SUP, and ci = wi for all i ∈ SUP. Then we consider all the sensitive cells ik according to some heuristically defined sequence.

1017

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

For each such ik , in turn, we first consider the following incremental attacker subproblem associated with the upper protection level UPLk (if different from 0): min

n  i=1

ci yi+ + yi− 

(34)

subject to My + − y −  = 0 0 ≤ yi+ ≤ UBi 0≤ yi−k

yi−

=0

≤ LBi and

(35)

for all i = 1     n

(36)

for all i = 1     n

(37)

yi+k

(38)

= UPLk 

Variables yi+ and yi− correspond to possible increments or decrements of value ai in a consistent table y = a + y + − y − . Constraints (35)–(37) stipulate the consistency of table y, whereas (38) imposes yik = aik + UPLk . The objective function (34) gives an estimation of the additional weight associated with the suppression of the entries i ∈ SUP with yi = ai (i.e., with yi+ + yi− > 0). We solve Problem (34)–(38) and insert in SUP all the cells i ∈ SUP having yi+ + yi− > 0 in the optimal solution. This guarantees the fulfillment of the upper protection level requirement for ik with respect to the new set SUP of suppressions. Afterwards, we set ci = 0 for all i ∈ SUP, and apply a similar technique to extend SUP to guarantee the fulfillment of the lower and sliding protection levels. This requires to solve Model (34)–(38) two additional times, a first time with (38) replaced by yi+k = 0 and yi−k = LPLk , and a second time with (38) replaced by yi+k + yi−k = SPLk . As in the problem reduction, a parametric resolution of the incremental attacker subproblems typically reduces considerably the computational effort spent within the heuristic. In some cases, the above heuristic can introduce redundant suppressions. Hence it may be worth applying a clean-up procedure to detect and remove such redundancies; see, e.g., Kelly et al. (1992). To this end, let SUP denote the feasible suppression pattern found by the heuristic. The clean-up procedure considers, in sequence, all the complementary suppressions h ∈ SUP\i1      ip , according to decreasing

1018

weights wh , and checks whether SUP\h is a feasible suppression pattern as well, in which case SUP is replaced by SUP\h . Clean up can be very time consuming as it requires the solution, for each h ∈ SUP\i1      ip , of 2p attacker subproblems associated with the sensitive cells. A considerable speed-up is obtained by using the “dual information” associated with the capacity constraints stored in the current pool. Indeed, at each iteration of the clean-up procedure let x∗ be defined as xi∗ = 1 if i ∈ SUP \ h and xi∗ = 0 otherwise. Our order of business is to check whether x∗ does not define a feasible suppression pattern. Clearly, a sufficient condition for pattern infeasibility is that x∗ violates any constraint in the pool. Therefore, before solving the time-consuming attacker subproblems one can very quickly scan and check for violation the constraints stored in the pool: In case a violated constraint is found, the computation can be stopped immediately as we have a proof of the fact that SUP\h is not a feasible pattern, and we can proceed with the next candidate suppression h. Otherwise, we need to check SUP\h for feasibility by solving parametrically a sequence of attacker subproblems, as discussed in the problem reduction subsection. The above heuristic is applied at the very beginning of our branch-and-cut code, right after the preprocessing phase for reducing the number of nonzero protection levels and the constraint pool initialization. In addition, we have implemented a modified version of the heuristic which exploits the information associated with the fractional optimal solution x∗ of the master LP problems solved during the branch-andcut execution. In this version, the cell cost in (34) are defined as ci = 1 − xi∗ wi so as to encourage the suppression of cells i with xi∗ close to 1, which are likely to be part of the optimal CSP solution. The modified heuristic is applied right after the processing of each branch-decision node.

5.

Example

Let us consider the 2-dimensional statistical table of Figure 1(a). Each cell index will be denoted by a pair of indices i j, the first one representing the row and the second the column. We assume LBij = UBij = Management Science/Vol. 47, No. 7, July 2001

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

wij = aij for each cell in row i ∈ 1     4 and column j ∈ 1     4 (including marginals). The required protection levels for the sensitive cell 2 3 are LPL23 = 5 UPL23 = 8, and SPL23 = 0. Initial Heuristic. Our initial heuristic finds the solution x of value 59 represented in Figure 1(b),     = x23 = x31 = x33 = whose nonzero components are x21 1. The heuristic also initializes the branch-and-cut constraint pool with the following two strengthened capacity constraints: x13 + x33 + x43 ≥ 1 and x21 + x22 + x24 ≥ 1. Initial Master LP. Our initial master LP consists of the xij variables associated with each table entry (including marginals), with x23 fixed to 1, and of the two cuts currently stored in the constraint ∗ ∗ = x21 = pool. Its optimal solution is given by x13 ∗ x23 = 1, which corresponds to a lower bound of 40. Reduction tests based on LP reduced costs fix to 0 (and remove from the master LP) variables x11  x12  x14  x24  x32  x34  x41  x42  x43  x44 . Cut Generation. To find capacity constraints (25) that are violated by the current master LP solution x∗ (if any), we have to solve the attacker subproblem (14)–(17) for x = x∗ and check whether y¯23 ≥ a23 + UPL23 . In the example, we obtain y¯23 = 22 < a23 + UPL23 = 22 + 8, hence a violated capacity constraint can easily be obtained from any optimal dual solution of the attacker subproblem, e.g., the one with nonzero components given by:

2 = 1

5 = −1

(dual variable associated with y21 +y22 +y23 −y24 = 0 (dual variable associated with y11 +y21 +y31 −y41 = 0

11 = 1

(dual variable associated with y11 ≤ 20

24 = 1

(dual variable associated with y24 ≤ 49

31 = 1

(dual variable associated with y31 ≤ 17

22 = 1

(dual variable associated with −y22 ≤ −19

41 = 1

(dual variable associated with −y41 ≤ −45

A violated capacity constraint (25) is therefore 20x11 + 19x22 + 49x24 + 17x31 + 45x41 ≥ 8, whose associated Management Science/Vol. 47, No. 7, July 2001

strengthened version reads 8x11 + 8x22 + 8x24 + 8x31 + 8x41 ≥ 8, i.e., x11 + x22 + x24 + x31 + x41 ≥ 1. Similarly, a violated capacity constraint (26) can be found by solving the attacker subproblem (21) for x = x∗ and by checking whether y ≤ a23 − LPL23 . In the 23 example, we obtain y = 22 > a23 − LPL23 = 22 − 5, but 23 the associated strengthened capacity constraint coincides with the one generated in the previous step. Afterwards, the following two bridgeless inequalities are generated: x11 + x31 + x41 ≥ x21 and x11 + x22 + x24 + x31 + x33 + x41 + x43 ≥ x13 . Notice that capacity constraints (27) need not be checked for violation, as SPL23 = 0. Reoptimizing the master LP amended by the above three cuts yields a new optimal LP solution given by ∗ ∗ ∗ x13 = x22 = x23 = 1, which improves the current lower bound to 51. In this case, no new variable can be fixed by using the LP reduced costs. A new round of separation for the new LP solution x∗ produces the following violated cuts: x12 + x21 + x24 + x32 + x42 ≥ 1 x12 + x32 + x42 ≥ x22 , and x12 + x21 + x24 + x32 + x33 + x42 + x43 ≥ x13 . After reoptimization, we ∗ ∗ ∗ ∗ obtain the master LP solution x13 = x21 = x23 = x31 =1 leading to a lower bound of 57. Our separation procedures then find the cuts: x11 + x22 + x24 + x32 + x33 + x34 + x41 ≥ 1, x32 + x33 + x34 ≥ x31 , x11 + x32 + x33 + x34 + x41 ≥ x21 , and x11 + x22 + x24 + x32 + x34 + x41 + x43 ≥ x13 , leading to a new master LP solu∗ ∗ ∗ ∗ tion x21 = x23 = x31 = x33 = 1 whose value (59) meets the current upper bound, thus certifying the optimality of current heuristic solution x . Notice that, on this simple example, all the solutions of our master LPs are integer (of course, this is not always the case). Moreover, no cover inequality is generated, and no branching is needed to reach integrality.

6.

Computational Results

The algorithm described in the previous section was implemented in ANSI C. We evaluated the performance of the code on a set of real-world (but no longer confidential) statistical tables. The software was compiled with Watcom C/C++ 10.6 and run, under Windows 95, on a PC Pentium II/266 with 32 RAM MB. As to the LP-solver, we used the commercial package CPLEX 3.0. Our test bed consists

1019

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

of 10 real-world instances provided by people from different national statistical offices. It includes three 2-dimensional tables, two 3-dimensional tables, one 4-dimensional table, and four linked tables. The first linked table (USDE1) corresponds to a 2-section of 6-dimensional 6 × 4 × 16 × 4 × 4 × 4 table, whereas the second linked table (USDE2) corresponds to a 4-section of a 9-dimensional 4 × 29 × 3 × 4 × 5 × 6 × 5 × 4 × 5 table; for both instances UPLi = LPLi holds for each cell i. The third linked table (USDE1a) is identical to USDE1, but we set UPLi = 2LPLi for each cell i. The fourth linked table (USDE1b) was obtained from USDE1a by dividing by 1,000 and rounding up to the nearest integer all cell weights wi . For all instances in our test bed, the external bounds are lbi = 0 and ubi = + for all i = 1     n, whereas the sliding protection levels SPLk are zero for all sensitive cells. Table 1 reports information about the test bed and the performance of our initial heuristic when applied before entering branch-and-cut computation. For each instance, the table gives: name: name of the instance; type: size (for k-dimensional tables) or structure of the table; cells: number of cells in the table =n; links: number of equations in the table (=number of rows of matrix M); p: number of sensitive cells (primary suppressions); pl: number of nonzero protection levels before problem reduction; Table 1 name

A comparison of columns pl and pl0 shows that pl0 is often significantly smaller than pl, meaning that our preprocessing procedure was effective in detecting redundant protection levels. This is particularly true in case of 2-dimensional tables, whose simple structure often leads to large patterns of “self-protected” sensitive cells. The quality of our initial heuristic solution appears rather poor when compared with the optimal solution, in that columns H EU1 exhibits significant percentage errors. In our opinion, however, the performance of our initial heuristic is at least comparable (and often significantly better) than that of the suppression procedures commonly used by practitioners. In other words, we believe that commonly used suppression methodologies are likely to produce suppression patterns with excessively high information loss. This behaviour was probably underestimated in the past since no technique was available to solve complex instances to proven optimality, nor to compute reliable lower bounds on the optimal solution value.

Statistics on Real-World Instances (Run on a PC Pentium II/266) type

CBS1 41 × 31 CBS2 183 × 61 CCSR 359 × 46 CBS3 6 × 8 × 13 CBS4 6 × 33 × 8 CBS5 6 × 8 × 8 × 13 USDE1 linked USDE2 linked USDE1a linked USDE1b linked

1020

pl0 : number of nonzero protection levels after problem reduction; t0 : Pentium II/266 wall clock seconds spent in the preprocessing phase for reducing the number of nonzero protection levels; HEU1 : percentage ratio 100 × H EU ’ – optimal solution value)/(optimal solution value), where H EU ’ is the upper bound value computed by our initial heuristic before entering branchand-cut computation; t1 : Pentium II/266 wall clock seconds spent by our initial heuristic.

cells

links

p

pl

pl0

1271 11163 16514 624 1584 4992 1254 1141 1254 1254

72 244 405 230 510 2464 1148 1000 1148 1148

3 2467 4923 17 146 517 165 310 165 165

6 4934 9846 34 292 1034 330 620 330 330

6 2 54 26 201 947 320 572 322 322

t0

HEU1

03 7379 61 000 363 000 01 591 13 168 1199 9038 05 3021 89 3318 08 2643 08 2731

t1 04 02 246 03 75 15388 168 298 171 173

Management Science/Vol. 47, No. 7, July 2001

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

Table 2

Branch-and-Cut Statistics (Run on a Pentium II/266)

name

r -HEU

r -LB

r -time

optimum

sup

node

iter

CBS1 CBS2 CCSR CBS3 CBS4 CBS5 USDE1 USDE2 USDE1a USDE1b

2913 000 000 188 125 000 342 138 143 125

291 000 000 100 020 000 055 268 192 121

51 64 613 84 409 49241 6266 7020 6891 6706

103 403 256 22590362 186433 6312 2228523 4643198 2325788 2157

5 2 27 27 51 261 254 181 273 274

5 1 1 32 19 1 22 46 97 16

75 1 1 416 70 76 202 238 473 240

The capability of benchmarking known heuristics is therefore another important feature of our exact solution methodology. Table 2 reports the following information on the overall branch-and-cut algorithm: r-HEU: percentage ratio 100 × r-H EU ’ − optimal solution value)/(optimal solution value), where r-H EU ’ is the upper bound value computed by our heuristic at the end of the root node of the branch-decision tree; r-LB: percentage ratio 100 × (optimal solution value − r-LB  )/(optimal solution value), where r-LB’ is the lower bound value available at the end of the root node of the branch-decision tree; r-time: Pentium II/266 wall clock seconds spent at the root node, including the preprocessing time t0 and the heuristic time t1 reported in Table 1; optimum: optimal solution value; sup: number of complementary (nonsensitive) suppressions in the optimal solution found; note that this number is not necessarily minimized, i.e., it is possible that other solutions require a larger information loss but fewer suppressions; node: number of elaborated nodes in the decision tree; iter: overall number of cutting-plane iterations; time: Pentium II/266 wall clock seconds for the overall branch-and-cut algorithm. As shown in Table 2, our branch-and-cut code was able to solve all the instances of our test bed within

Management Science/Vol. 47, No. 7, July 2001

time 94 64 613 660 1237 49241 11870 23972 26146 13115

acceptable computing time, even on a slow personal computer with limited amount of RAM memory. The 2-dimensional instances were solved easily by our code. This confirms the findings reported in Fischetti and Salazar (1999), where tables of size up to 500 × 500 have been solved to optimality on a PC. The 3-dimensional instances were also solved within short computing time. The 4-dimensional instance, on the other hand, appears much more difficult to solve. This is of course due to the large number of table links (equations) to be considered. In addition, the number of nonzero protection levels after preprocessing (as reported in column pl0 ) is significantly larger than for the other instances. This results into a large number of timeconsuming attacker subproblems that need to be solved for capacity cut separation, and into a large number of capacity cut constraints to be inserted explicitly into the master LP. Moreover, we have observed that the optimal solutions of the master LPs tend to have more fractional components than the ones in case of 2-dimensional tables with about the same number of cells. In other words, increasing the table dimension seems to have much larger impact on the number of fractionalities than just increasing the size of a table. As a consequence, 4-dimensional tables tend to require a larger number of branchings to enforce the integrality of the variables. In addition, the heuristic approaches become much more time consuming as they work explicitly with all the nonzero variables of the current fractional LP solution. As to linked tables, their exact solution can be obtained within reasonable computing time. As

1021

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

expected, instance USDE1a requires significantly more computing time than instance USDE1, due to the larger upper protection levels imposed, whereas an optimal solution of instance USDE1b can be found more easily due to the reduced weight range. A comparison of columns H EU1 and r-H EU shows the effectiveness of our heuristic when driven by the LP solution available at the end of the root node. In particular, stopping branch and-cut execution right after the root node would produce a heuristic procedure comparing very favorably with the initial heuristic, while also returning a reliable optimistic estimate (lower bound) on the optimal solution value. Column r-LB shows that very tight lower bounds on the optimal solution value are available already at the root node of the branch-decision tree. Quite surprisingly, this is mainly due to the LP-relaxation tightening introduced in §4.2, and in particular to the simple capacity constraint strengthening described in §4.2.1. Indeed, deactivating the model improvements introduced in §4.2 results into a dramatic lower bound deterioration. Table 3 gives the following statistics on the generated cuts by the branch-and-cut scheme: cap0 : number of constraints saved in the pool structure during the preprocessing and the initial heuristic procedures; cap: overall number of capacity constraints generated; bri: overall number of bridgeless inequalities generated; cov: overall number of cover inequalities generated; Table 3

1022

Statistics on the Generated Cuts

name

cap0

cap

bri

cov

pool

LProws

CBS1 CBS2 CCSR CBS3 CBS4 CBS5 USDE1 USDE2 USDE1a USDE1b

10 2 27 25 125 639 217 301 226 226

70 0 0 226 90 1500 978 965 1291 1371

184 0 0 504 52 0 781 364 923 849

92 0 0 523 69 0 86 96 196 137

109 2 27 3744 153 418 1760 1311 3569 2256

168 2 27 255 166 502 937 535 923 993

pool: overall number of constraints recovered from the pool structure; LProws: maximum number of rows in the master LP. According to the table, the number of capacity constraints that need to be generated explicitly is rather small (recall that, in theory, the family of capacity constraints contain an exponential number of members). Moreover, the pool add/drop mechanism allows us to keep the master LP’s to a manageable size; see column LProws of the table. Finally, we observe that a significant number of bridgeless and cover inequalities are generated during the branch-and-cut execution to reinforce the quality of the LP relaxation of the several master problems to be solved. To better understand the practical behavior of our method we made an additional computational experience on randomly-generated instances. To this end, we generated a test-bed containing 1,160 synthetic 3- and 4-dimensional tables with different sizes and structures, according to the following adaptation of the scheme described in Fischetti and Salazar (1999). The structure of each random table is controlled by two parameters, nz and sen, which determine the density of nonzeros and of sensitive cells, respectively. Every internal cell i of the table has nominal value ai = 0 with probability 1 − nz/100. Nonzero cells have an integer random value ai > 0 belonging to range 1     5 with probability sen/100, and belonging to range 6     500 with probability 1 − sen/100 Cells with 0 nominal value cannot be suppressed, whereas all cells with nominal value in 1     5 are classified as sensitive. For every sensitive cell, both the lower and upper protection levels are set to the nominal value, while the sliding protection level is zero. The feasible range known to the attacker for suppressed cells is 0 +  in all cases. All the generated random instances are available for benchmarking purposes from the second author, along with the associated optimal (or best-known) solution values. Tables 4 and 5 report average values, computed over 10 instances, for various classes of 3- and 4-dimensional tables, respectively. Column succ reports the number of instances solved to proven optimality within a time limit of three hours; statistics

Management Science/Vol. 47, No. 7, July 2001

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

Table 4

3-Dimensional Random Instances (Time Limit of Three Hours on a PC Pentium II/400) sen

nz

p

pl0

r -HEU

r -LB

r -time

sup

2×2×2 2×2×2 2×2×2 2×2×2

5 15 5 15

100 100 75 75

07 17 18 25

14 00 34 00 36 00 50 00

000 00 142 00 000 00 112 01

000 000 000 000

000 083 274 082

02 03 02 02

41 73 35 58

05 12 15 20

023 040 036 046

44 86 60 84

170 235 97 155

09 31 17 23

10 10 10 10

4×2×2 4×2×2 4×2×2 4×2×2

5 15 5 15

100 100 75 75

10 28 16 33

20 00 56 00 32 00 66 00

359 00 623 01 000 00 510 01

000 229 000 000

000 379 271 665

02 05 02 05

42 103 43 99

05 37 11 26

021 107 038 082

52 182 82 167

168 404 221 430

10 71 23 53

10 10 10 10

4×4×2 4×4×2 4×4×2 4×4×2

5 15 5 15

100 100 75 75

15 30 00 586 01 52 104 00 1279 01 18 36 00 722 01 51 102 00 1697 01

177 465 023 414

454 635 404 775

04 08 05 08

63 128 74 143

22 40 20 79

084 152 076 249

136 302 140 406

453 632 528 879

45 153 32 129

10 10 10 10

4×4×4 4×4×4 4×4×4 4×4×4

5 15 5 15

100 100 75 75

35 70 00 1748 01 104 208 00 2566 01 31 62 00 870 01 80 160 00 2471 01

586 1142 272 931

1076 1532 517 1202

07 11 12 16

120 192 135 203

173 731 83 216

470 1668 317 827

393 991 396 785

1254 1187 1493 1820

161 560 107 363

10 10 10 10

6×2×2 6×2×2 6×2×2 6×2×2

5 15 5 15

100 100 75 75

11 37 18 44

548 01 997 01 386 01 846 01

000 123 000 140

000 204 306 736

03 07 04 07

49 118 51 110

06 25 18 46

033 101 056 150

71 210 90 256

264 455 352 622

11 95 21 84

10 10 10 10

6×4×2 6×4×2 6×4×2 6×4×2

5 15 5 15

100 100 75 75

25 50 00 901 01 84 168 00 1376 01 25 50 00 997 01 76 152 00 1764 01

281 704 166 664

492 1066 569 1002

07 11 08 11

97 170 99 179

38 193 49 161

164 560 181 491

234 636 300 727

791 1175 1150 1264

73 371 70 331

10 10 10 10

6×4×4 6×4×4 6×4×4 6×4×4

5 15 5 15

100 100 75 75

49 98 00 2451 01 151 302 00 2211 02 37 74 00 1705 01 112 224 00 3246 01

1310 915 815 531

1421 1513 988 1653

11 16 16 21

172 256 174 277

504 3819 246 1128

1495 710 9072 1801 1031 539 4007 1576

2111 330 1578 1274 2681 187 3391 928

10 10 10 10

6×6×2 6×6×2 6×6×2 6×6×2

5 15 5 15

100 100 75 75

36 72 00 1625 01 124 248 00 1737 01 40 80 00 1275 01 113 226 00 2671 01

001 494 558 888

230 700 859 1269

10 16 11 17

104 201 121 208

38 263 174 499

223 1105 719 1464

1039 1652 2056 1984

75 449 234 540

10 10 10 10

6×6×4 6×6×4 6×6×4 6×6×4

5 15 5 15

100 100 75 75

71 142 00 3106 01 210 420 00 2801 03 54 108 00 1324 01 157 314 00 2431 02

1647 1356 713 1335

1609 1551 1478 1583

18 31 24 37

219 2408 335 18016 218 5287 335 5272

4362 1098 3092 2625 7944 1800 5440 1840

10 10 10 10

6×6×6 6×6×6 6×6×6 6×6×6

5 15 5 15

100 100 75 75

94 188 00 3024 02 339 677 01 3509 06 70 140 00 2776 01 236 471 00 3343 03

1849 702 988 1287

1694 1401 1297 1482

38 85 36 69

294 19457 151207 4933 13408 3444 436 63289 436771 7259 5479 5223 249 2529 17756 1916 9262 1093 431 19696 127130 5257 7333 3347

9 7 10 9

8×2×2 8×2×2 8×2×2 8×2×2

5 15 5 15

100 100 75 75

15 30 00 679 01 52 104 00 1602 01 23 46 00 731 01 63 126 00 1481 01

000 184 343 582

000 220 320 934

03 06 06 08

57 132 65 125

10 10 10 10

size

t0

22 00 74 00 36 00 88 00

HEU1

t1

node

06 23 13 35

time

cap

251 755 613 1069

9576 1785 72855 3667 36212 2919 23224 2905

029 103 073 143

67 230 142 313

bri

230 404 466 779

cov

14 106 39 111

succ

(Continued)

Management Science/Vol. 47, No. 7, July 2001

1023

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

Table 4

r -HEU

r -LB

r -time

sup

35 70 00 1281 01 110 220 00 1701 01 38 76 00 1407 01 101 202 00 2063 01

306 248 117 683

571 707 575 878

08 15 11 15

121 206 116 203

54 160 75 200

100 100 75 75

65 130 00 2960 01 191 382 00 2661 02 49 98 00 3042 01 141 282 00 2717 02

972 840 1201 981

1166 1312 1222 1310

13 18 19 28

196 317 189 299

1024 3221 693 1460

5 15 5 15

100 100 75 75

50 100 00 1616 01 158 316 00 1542 02 48 96 00 1424 01 144 288 00 2238 02

107 518 732 521

467 519 1052 1223

19 21 16 25

145 249 147 249

104 152 234 771

8×6×4 8×6×4 8×6×4 8×6×4

5 15 5 15

100 100 75 75

86 172 00 2783 02 274 548 01 2980 04 67 134 00 2081 01 206 412 00 3289 03

1287 1111 1717 1299

1701 1240 1548 1495

25 45 33 55

273 6944 415 17850 247 5147 410 19130

8×6×6 8×6×6 8×6×6 8×6×6

5 15 5 15

100 100 75 75

122 243 01 4339 02 445 890 01 3485 10 98 195 00 2723 02 332 664 01 3863 06

2132 855 1751 1387

1747 1167 1783 1381

53 147 57 126

8×8×2 8×8×2 8×8×2 8×8×2

5 15 5 15

100 100 75 75

66 132 00 2160 01 198 396 00 2397 02 64 128 00 1863 01 179 351 00 2761 02

716 656 824 1199

764 447 1142 1456

16 36 29 31

8×8×4 8×8×4 8×8×4 8×8×4

5 15 5 15

100 100 75 75

116 231 00 3879 02 379 758 01 3751 07 95 190 00 3515 02 298 596 01 3904 05

2136 926 1142 1318

1844 938 1526 1289

47 88 50 129

size

1024

Continued sen

nz

8×4×2 8×4×2 8×4×2 8×4×2

5 15 5 15

100 100 75 75

8×4×4 8×4×4 8×4×4 8×4×4

5 15 5 15

8×6×2 8×6×2 8×6×2 8×6×2

p

pl0

t0

HEU1

t1

node

cap

bri

cov

succ

286 641 458 819

947 1023 1503 1379

104 324 140 305

10 10 10 10

3142 1005 7927 1585 3586 959 5710 1830

2352 1339 3798 3509

568 924 478 1087

10 10 10 10

650 1346 1079 2671

2116 2006 3180 3175

182 479 263 825

10 10 10 10

32267 2872 6941 1922 76876 3464 2586 2177 42915 2848 10593 1751 111019 5279 7906 3667

10 10 10 10

322 17568 164962 4933 17295 3485 510 53091 387814 5981 3561 3738 341 15351 167227 4143 18258 2780 534 39806 299947 8003 6736 5049

6 8 8 10

176 298 166 286

260 620 402 1476

10 10 10 10

338 46001 330948 8701 16958 6694 481 10378 52573 3790 2071 2033 318 12466 162478 4374 20461 2732 500 22115 181838 6614 11956 3683

9 10 10 10

247 399 439 2634

time 219 571 309 636

461 837 728 1548

1403 684 2427 1068 2257 938 11581 2705

3151 3029 4312 6110

Management Science/Vol. 47, No. 7, July 2001

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

Table 5

4-Dimensional Random Instances (Time Limit of Three Hours on a PC Pentium II/400) r -HEU

r -LB

r -time

sup

120 01 365 01 045 01 212 01

000 000 000 212

409 640 000 630

06 11 07 15

98 229 97 228

17 40 05 66

255 00 869 01 000 01 777 01

000 133 000 294

164 432 455 1073

08 13 10 20

113 252 112 264

100 100 75 75

18 36 00 480 01 59 118 00 1557 01 24 48 00 481 01 73 146 00 1079 01

117 687 131 699

148 743 527 1386

09 24 19 43

5 15 5 15

100 100 75 75

29 58 00 1687 01 92 184 00 1448 01 30 60 00 1235 01 89 178 00 1727 02

300 829 658 874

860 1372 970 1974

3×3×3×3 3×3×3×3 3×3×3×3 3×3×3×3

5 15 5 15

100 100 75 75

40 80 00 491 01 129 258 00 2090 03 41 82 00 3301 01 119 238 00 2143 03

315 594 1290 926

4×2×2×2 4×2×2×2 4×2×2×2 4×2×2×2

5 15 5 15

100 100 75 75

15 30 00 684 01 55 110 00 1302 01 27 54 00 541 01 78 156 00 1530 01

4×4×2×2 4×4×2×2 4×4×2×2 4×4×2×2

5 15 5 15

100 100 75 75

4×4×4×2 4×4×4×2 4×4×4×2 4×4×4×2

5 15 5 15

100 100 75 75

size

sen

nz

p

pl0

t0

2×2×2×2 2×2×2×2 2×2×2×2 2×2×2×2

5 15 5 15

100 100 75 75

10 31 21 47

20 00 62 00 42 00 94 00

3×2×2×2 3×2×2×2 3×2×2×2 3×2×2×2

5 15 5 15

100 100 75 75

11 40 22 58

22 00 80 00 44 00 116 00

3×3×2×2 3×3×2×2 3×3×2×2 3×3×2×2

5 15 5 15

3×3×3×2 3×3×3×2 3×3×3×2 3×3×3×2

HEU1

cap

bri

cov

succ

109 196 070 319

145 417 204 574

835 1481 754 1901

48 144 39 195

10 10 10 10

12 47 26 155

105 291 225 812

152 460 229 859

965 1725 1517 2879

38 194 32 340

10 10 10 10

150 296 166 387

12 112 73 239

103 675 795 2108

216 756 371 1260

1022 2634 3003 6132

52 341 53 430

10 10 10 10

13 33 45 103

215 372 227 475

60 414 557 2226

366 391 2229 1180 8221 725 25932 2241

2173 3876 7541 14761

131 536 220 923

10 10 10 10

631 1848 1075 2232

23 45 143 214

264 427 300 581

110 2111 227 11361

746 505 2785 9754 1856 5387 11416 794 12210 202895 5416 30206

158 996 176 2890

10 10 10 8

000 182 153 466

247 409 890 1060

11 19 17 34

129 290 159 328

16 74 107 297

35 70 00 962 01 111 222 00 2022 02 47 94 00 1945 01 124 248 00 1879 02

225 681 383 874

588 1156 861 1694

14 44 38 88

229 438 282 481

35 900 174 1476

66 132 01 1773 02 183 365 01 1921 04 70 140 01 2053 02 150 300 01 2693 05

884 799 1285 913

1104 1280 1742 2183

64 111 333 402

365 571 491 793

Management Science/Vol. 47, No. 7, July 2001

t1

node

time

158 497 688 1571

195 615 443 1409

1226 2170 2796 3934

52 234 153 596

10 10 10 10

260 370 7396 1365 2132 654 13980 2414

1983 5329 6196 10219

93 717 145 967

10 10 10 10

949 10569 1221 8086 545 4598 57602 2774 11789 1254 5103 192218 3189 44352 1420 24833 754770 8053 65933 4157

10 8 10 3

1025

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

Table 6

Fixed-Size Random Instances (Run on a PC Pentium II/400 with no Time Limit)

size

sen

nz

p

pl0

t0

HEU1

t1

r -HEU

r -LB

r -time

sup

node

time

cap

bri

cov

8×6×4 8×6×4 8×6×4 8×6×4

5 15 25 35

100 100 100 100

86 274 472 664

172 548 944 1276

01 01 01 02

2783 2980 2378 1944

01 04 08 11

1287 1111 388 225

1701 1240 463 271

25 47 43 39

273 415 405 368

6944 17850 451 478

29481 69361 2250 2389

2872 3464 977 302

6941 2586 288 128

1922 2177 434 235

8×6×4 8×6×4 8×6×4 8×6×4

5 15 25 35

75 75 75 75

67 206 363 503

134 412 726 1002

01 01 01 01

2081 3289 3002 2413

01 03 05 08

1717 1299 1082 736

1548 1495 1080 810

33 56 76 94

247 410 445 448

5147 19130 5818 3177

41417 101805 33871 18354

2848 5279 3304 1766

10593 7906 3363 1437

1751 3667 1595 815

8×6×4 8×6×4 8×6×4 8×6×4

5 15 25 35

50 50 50 50

54 159 288 405

108 318 576 810

01 01 01 01

1814 2973 3665 3178

01 02 04 06

555 1123 1528 976

1569 2025 1549 1249

43 77 109 130

245 436 506 521

5508 40852 16565 18292

58426 286680 97463 118756

2954 12619 9352 9672

18869 20366 6883 3144

1516 9074 5804 5571

8×6×4 8×6×4 8×6×4 8×6×4

5 15 25 35

25 25 25 25

43 129 238 332

86 258 476 664

01 01 01 01

1125 3757 2767 2591

01 01 03 04

385 1198 1502 1152

841 1938 1658 1560

25 49 60 74

142 310 391 407

219 2911 2619 4513

1773 17836 13205 22113

766 3774 5071 5997

4748 10499 5323 3230

240 2058 2248 2794

refer to the successfully solved instances only. Table 6 reports similar statistics for 8 × 6 × 4 tables of different structures. In all cases, computing times are expressed in wall-clock seconds of a PC Pentium II/400 with 64 Mbyte RAM. Notice that random instances appear harder to solve than the real-world ones, due to the lack of a strong structure in the table entries. Nevertheless, we could solve most of them to proven optimality within short computing time. In addition, for all instances the quality of the heuristic solutions (r-HEU) found at the root node after a few seconds of computation (r-time) is significantly better than the one found by the initial heuristic (HEU1 ).

7.

Conclusions

Cell suppression is a widely used methodology in Statistical Disclosure Control. In this paper we have introduced a new integer linear programming model for the cell suppression problem, in the very general context of tables whose entries are subject to a generic system of linear constraints. Our model then covers k-dimensional tables with marginals as well as hierarchical and linked tables. To our knowledge, this is the first attempt to model and solve the cell suppression problem in such a very general context. We

1026

have also outlined a possible solution procedure in the branch-and-cut framework. Computational results on real-world instances have been reported. In particular, we were able to solve to proven optimality, for the first time, real-world 4-dimensional tables with marginals as well as linked tables. Extensive computational results on a test-bed containing 1,160 randomlygenerated 3- and 4- dimensional tables have been also given. Acknowledgment Work partially supported by the European Union project IST-200025069, Computational Aspects of Statistical Confidentiality (CASC), coordinated by Anco Hundepool (Central Bureau of Statistics, Voorburg, The Netherlands). The first author was supported by M.U.R.S.T. (“Ministero della Ricerca Scientifica e Tecnologica”) and by C.N.R. (“Consiglio Nazionale delle Ricerche”), Italy, while the second author was supported by “Ministerio de Educación, Cultura y Deporte,” Spain.

References Applegate, D., R. Bixby, W. Cook, V. Chvátal. 1995. Finding cuts in the traveling salesman problem. DIMACS technical report 95-05, Center for Research on Parallel Computation, Rice University, Houston, TX. Caprara, A., M. Fischetti. 1997. Branch-and-cut algorithms. M. Dell’Amico, F. Maffioli, S. Martello, eds. Annotated Bibliographies in Combinatorial Optimization. John Wiley & Sons.

Management Science/Vol. 47, No. 7, July 2001

FISCHETTI AND SALAZAR The Cell Suppression Problem on Tabular Data with Linear Constraints

Cox, L. H. 1980. Suppression methodology and statistical disclosure control. J. Amer. Statist. Assoc. 75 377–385. . 1995. Network models for complementary cell suppression. J. Amer. Statist. Assoc. 90 1453–1462. Crowder, H. P., E. L. Johnson, M. W. Padberg. 1983. Solving largescale zero-one linear programming problems. Oper. Res. 31 803–834. Carvalho, F. D., N. P. Dellaert, M. S. Osório. 1994. Statistical disclosure in two-dimensional tables: General tables. J. Amer. Statist. Assoc. 89 1547–1557. Dellaert, N. P., W. A. Luijten. 1996. Statistical disclosure in general three-dimensional tables. Technical paper TI 96-114/9, Tinbergen Institute, Rotterdam, The Netherlands. Fischetti, M., J. J. Salazar. 1999. Models and algorithms for the 2-dimensional cell suppression problem in statistical disclosure control. Math. Programming 84 283–312. Geurts, J., 1992. Heuristics for cell suppression in tables. Working paper, Netherlands Central Bureau of Statistics, Voorburg, The Netherlands. Gusfield, D., 1988. A graph theoretic approach to statistical data security. SIAM J. Comput. 17 552–571. Kao, M. Y., 1996. Data security equals graph connectivity. SIAM J. Discrete Math. 9 87–100.

Kelly, J. P., 1990. Confidentiality protection in two- and threedimensional tables. Ph.D. dissertation, University of Maryland, College Park, MD. Kelly, J. P., B. L. Golden, A. A. Assad. 1992. Cell suppression: Disclosure protection for sensitive tabular data, Networks 22 397–417. Nemhauser, G. L., L. A. Wolsey. 1988. Integer and combinatorial optimization. John Wiley & Sons. Padberg, M., G. Rinaldi. 1991. A branch-and-cut algotithm for the resolution of large-scale symmetric traveling salesman problems. SIAM Rev. 33 60–100. Robertson, D. A. 1995. Cell suppression at Statistics Canada. Proc. Second Internat. Conf. Statist. Confidentiality. Luxembourg. Sande, G. 1984. Automated cell suppression to preserve confidentiality of business statistics. Statist. J. United Nations ECE 2 33–41. . 1998. Blunders official statitical agencies make while protecting the confidentiality of business statistics. Internal report, Sande and Associates. Willenborg, L. C. R. J., T. de Waal. 1996. Statistical disclosure control in practice. Lecture Notes in Statistics 111. Springer, New York.

Accepted by Thomas M. Liebling; received October 1999. This paper has been with the authors 2 months for 1 revision.

Management Science/Vol. 47, No. 7, July 2001

1027