Inference Control in Data Integration Systems | SpringerLink

5 downloads 6530 Views 671KB Size Report
Oct 28, 2015 - Given a set of local policies, an initial global policy and data dependencies, we propose an approach that allows the security administrator to ...
Inference Control in Data Integration Systems Mokhtar Sellami1(B) , Mohand-Said Hacid2 , and Mohamed Mohsen Gammoudi3 1

3

Higher Institute of Technological Studies of Kef/Riadi Laboratory-ENSI Tunisia, Manouba, Tunisia [email protected] 2 Universit´e de Lyon/LIRIS UCBL, Lyon, France [email protected] Higher Institute of Multimedia Arts of Manouba/Riadi Laboratory-ENSI Tunisia, Manouba, Tunisia [email protected]

Abstract. Specifying a global policy in a data integration system in a traditional way would not necessarily offer a sound and efficient solution to deal with the inference problem [8]. This is mainly due to the fact that data dependencies (between distributed data sets) are not taken into account when local policies (attached to local sources) are defined. In this paper, by using formal concept analysis, we propose a methodology, together with a set of algorithms that can help to detect security breaches by reasoning about semantic constraints. Given a set of local policies, an initial global policy and data dependencies, we propose an approach that allows the security administrator to derive a set of queries so that when their results are combined they could lead to security breaches. We detect the set of additional rules which will be used to extend the policy of the mediator in order to block security breaches. We also discuss a set of experiments we conducted. Keywords: Access control · Data integration · Inference problem Security and privacy – databases security · Privacy · Access controls

1

·

Introduction

Data integration aims at providing a unique entry point to a set of data sources. In this paper, we focus on the security challenge that mainly arises in data integration systems. In such systems, a mediator is defined. This mediator aims at providing a unique entry point to several heterogeneous sources. In this kind of architecture security aspects and access control in particular represent a major challenge. Indeed, every source, designed independently of the others, defines its own access control policy and it is important to comply with the local policies in the context of data integration. Complying with the sources’ policies means that a prohibited access at the source level should also be prohibited at the mediator level. Also, the policy of the mediator needs to protect data against c Springer International Publishing Switzerland 2015  C. Debruyne et al. (Eds.): OTM 2015 Conferences, LNCS 9415, pp. 285–302, 2015. DOI: 10.1007/978-3-319-26148-5 17

286

M. Sellami et al.

indirect accesses. An indirect access occurs when one could synthesize sensitive information from the combination of non-sensitive information (and semantic constraints) collected from different sources and subject to different access control policies. Detecting all indirect accesses in a given system is referred to as the inference problem [4]. We propose a formal concept analysis based approach to derive the global policy of the mediator. The proposed approach allows preserving the local (source) policies and provides us with information about the inference problem. From the schemas and the security policies of the local sources, we derive a set of security policies (called global policy) that must be attached to the mediator. We then exploit interactions between elements of the global schema in presence of the global policy for inferring implicit combinations of queries that could lead to the violation of local security policies. Another issue is the dynamic change of the local policies and schemas. By resorting to incremental algorithms [5] in lattice construction, it is possible to avoid the regeneration of the global schema and policies from scratch. The reminder of the paper is organized as follows: Section 2 gives an overview of research effort related to our work. Section 3 introduces an example. In section 4, we describe our approach. In Section 5 describes the experiments. We conclude in section 6. Please note that because of lack of space, a research report containing more details regarding the state of the art, formal definitions of concepts used in our approach, complete examples and the theoretical complexity of the algorithms can be downloaded from http://bit.ly/1Iok4Fd.

2

Related Work

We will discuss the different approaches which are connected in any way to our problem. In addition, we will discuss different types of inferences that have been investigated so far. 2.1

Formal Concept Analysis and Access Control

The authors of [21] proposed a lattice-based solution for controlling inferences caused by unprotected aggregations. Instead of detecting inferences, they first prevent the adversary from combining multiple aggregations for inferences by restricting queries. Then, they remove remaining inferences caused by individual aggregations. In [10], the authors proposed to use attribute exploration from formal concept analysis on the dyadic formal context derived from the triadic security context to design the role based access control. In [19], a method, based on formal concept analysis, which facilitates discovering the roles from the existing permission to the user assignments, is proposed. From a technical point of view, our work extends the traditional approaches by resorting to formal concept analysis as a tool for reasoning about security policies in a distributed setting. From an application point of view, we consider a data integration scenario where it is mandatory to accommodate data dependencies, security policies and inference problems.

Inference Control in Data Integration Systems

2.2

287

Views, Access Control and Inferences

Rosenthal and Sciore [15,16] have considered the problem of how to automatically coordinate the access rights of a warehouse with those of sources. The authors proposed a theory that allows automated inference of many permissions for the warehouse by a natural extension of the standard SQL grant/revoke model,1 to systems with redundant and derived data. The authors have also defined the witness notion by including the use of views. The framework proposed by the authors determines only if a user has right to access a derived table (based on explicit permission) but our proposal goes further by determining which part of the table the user has right to access in the derived table. In [1], the authors have built on [12] to provide a way to select access control rules to be attached to materialized view definitions based on access control rules over base relations. They resort to the basic form of the bucket algorithm which does not allow to derive all relevant access control rules. In our work, we synthesize new rules from existing rules where the body of the new rules makes reference to materialized views. In [3], the authors proposed, unlike the use of authorization views, a graph-based model to define the access control rules and query profilers. These latter consist in capturing the information content of the query through the use of graphs. The major drawback of this approach is the impossibility to define permissions on a subset of tuples (selection). 2.3

Inference Problem

A number of methods have been proposed to deal with statistical attacks. These methods could be classified into three categories: query restriction [2], data perturbation [17] and output perturbation [9]. Query restriction could be performed by constraining the number of tuples that are used to construct query results. Data perturbation modifies data in such a way that it limits data inferences. Output perturbation modifies query results in order to prevent from sensitive information disclosure. Approaches dealing with semantic attacks tackle some issues that arise in traditional access control mechanisms. Semantic attacks have been identified as a serious threat to data security at a time where multilevel security policies were much popular [6,20]. In [7,8], the authors proposed a methodology that allows controlling the access to a data integration system. The methodology allows dealing with direct access and indirect access. The methodology includes two main phases: (1) Propagation and combination of source policies and (2) Detection of threats. They also proposed solutions to remedy flaws identified in the previous phases. Our approach is inspired from [8]. However, we take another look at the problem of data integration in presence of security policies. First, we generate a global schema and a global policy simultaneously. We capture the relevant elements of the data integration system in a single framework. The underlying lattice construction algorithms implicitly accommodate schema and policy changes. 1

http://www.techonthenet.com/oracle/grant revoke.php

288

3

M. Sellami et al.

Example

First, we need to introduce two definitions. Definition 1. (Functional Dependency)[13]. A functional dependency (FD) over a schema R is a statement of the form: R : X→Y where X, Y ⊆ schema(R). We refer to X as the left hand side (LHS) and Y as the right hand side (RHS) of the functional dependency X→Y. Definition 2. (Authorization View)[14]. A set of authorization views specifies what information a user is allowed to access. The user writes the query in terms of the database relations, and the system tests (by considering the authorization views) the query for validity by determining whether it can be evaluated over the database. We consider a University Record System (URS) scenario where multiple systems need to share data records for cooperation purposes. Each University has a full control (e.g., creation, management, etc.) over its own data records with respect to its own access control policies. To share data between the systems, a secure mediator is needed in order to facilitate access to the shared data via a global schema, while ensuring data confidentiality. This allows different users, including administrators, professors, students and researchers, to access multiple faculty or student data. In this paper, we assume (1) the mediator is built from three local sources, (2) an access control model based on constraints in an role based access control model which is associated with a resource and (3) a user is described by a set of features defining her/his Role. The enforcement of this model is done as follows: a user is allowed to access a resource if her/his profile satisfies the access constraints of that resource, otherwise, the access is denied. We use authorization views to provide a fine-grained access control. Local Source 1 Relation : Supervisors(IDFaculty, Name, ResearchTeam, Affiliation) Functional dependencies: IDFaculty, ResearchTeam →Name and IDFaculty, Affiliation →ResearchTeam Authorization views: – V1 1 Authorization(IDFaculty)← Supervisors(IDFaculty,Name,ResearchTeam,Affiliation), $Role=DC ∨$Role=AD∨$Role=FO∨$Role=PR. – V1 2 Authorization(Name)← Supervisors(IDFaculty,Name,ResearchTeam,Affiliation), $Role=AD ∨$Role=DC∨$Role=FO∨$Role=PR. – V1 3 Authorization(ResearchTeam)← Supervisors(IDFaculty,Name,ResearchTeam,Affiliation), $Role=DC ∨$Role=AD∨$Role=PR. Local Source 2 Relation: PHDStudents(StudentID, IDFaculty, ThesisTitle, ResearchTeam) Functional dependencies: StudentID→ThesisTitle; StudentID→IDFaculty; IDFaculty→ResearchTeam

Inference Control in Data Integration Systems

289

Authorization views: – V2 1 Authorization(StudentID)← PHDStudents(StudentID,IDFaculty,ThesisTitle,ResearchTeam), $Role=PR∨$Role=DC∨$Role=AD. – V2 2 Authorization(IDFaculty)← PHDStudents(StudentID,IDFaculty,ThesisTitle,ResearchTeam), $Role=DC∨$Role=PR∨$Role=AD∨$Role=FO. – V2 3 Authorization(ThesisTitle)← PHDStudents(StudentID,IDFaculty,ThesisTitle,ResearchTeam), $Role=PR∨$Role=DC. Local Source 3 Relation: Faculty(IDFaculty, SSN, Salary, Insurance) Functional dependencies: IDFaculty→ SSN and SSN, Insurance→ Salary Authorization views: – V3 1 Authorization(SSN)← Faculty(SSN,Salary,Insurance,IDFaculty),$Role=FO∨$Role=AD. – V3 2 Authorization(IDFaculty)← Faculty(SSN,Salary,Insurance,IDFaculty),$Role=AD∨$Role=DC. – V3 3 Authorization(Salary)← Faculty(SSN,Salary,Insurance,IDFaculty),$Role=FO∨$Role=AD.

4

Inference Detection: A Formal Concept Analysis Based Approach

We propose an approach for determining all the possible disclosure transactions2 which could appear at the mediator level by exploiting the semantic constraints3 and healing the global policy in order to deactivate the completion of such disclosure transactions. Our approach inspired from [8] is centered around three phases: (1) the generation of a global schema, global functional dependencies and a global policy from local sources and underlying policies, (2) disclosure transaction discovery and (3) policy healing. – Phase 1: It consists in synthesizing the global policy, the global schema and the global functional dependencies. These elements are then exploited in the next phases. – Phase 2: This phase is devoted to the detection of all sequences of queries, called disclosure transactions, which can be used to defeat the access control mechanism by exploiting the semantics of data dependencies. – Phase 3: In this phase, we proceed to policy reconfiguration to avoid security breaches. This can be accomplished either at design time (by adding new authorization views) or at run-time (by controlling the execution of user queries). 2

3

A disclosure transaction (DT) is a sequence of queries such that if they are evaluated and their results are combined, they will lead to security breaches and thus violating an access control policy. Which include the hidden associations between the attributes of data sources.

290

4.1

M. Sellami et al.

Synthesizing the Global Schema, the Global Policy and the Global Functional Dependencies

To synthesize the global policy and the global schema from the sources, we resort to the preliminary approach we described in [18]. Here, we briefly recall the principle of the approach. It takes as input a set of source schemas together with their access control policies and performs the following steps: First, it starts by translating the schemas and policies into formal contexts. Second, for each attribute, it identifies the set of rules which are preserved at the level of sources. This is done by computing the supremum.4 Finally, it builds the global schema by combining relevant attributes. When this step is applied to our example, it derives the following5 global schema which is composed of three virtual relations: 1. VR1(IDFaculty,Salary,SSN,Insurance,Name)← Supervisors(IDFaculty,Name,ResearchTeam,Affiliation), Faculty(IDFaculty, SSN, Salary, Insurance). 2. VR2(IDFaculty,ResearchTeam,Affiliation)← Supervisors(IDFaculty,Name,ResearchTeam,Affiliation), PHDStudents(StudentID,IDFaculty,ThesisTitle,ResearchTeam). 3. VR3(StudentID,IDFaculty,ThesisTitle)← PHDStudents(StudentID,IDFaculty,ThesisTitle,ResearchTeam), Faculty(IDFaculty, SSN, Salary, Insurance), Supervisors(IDFaculty, Name, ResearchTeam, Affiliation).

It also produces the following global policy which will be associated with the global schema: – GV1 Authorization(SSN)←VR1(IDFaculty,Salary,SSN,Insurance,Name), $Role=AD∨$Role=FO. – GV2 Authorization(Salary)←VR1(IDFaculty,Salary,SSN,Insurance,Name), $Role=AD∨$Role=FO. – GV3 Authorization(IDFaculty)←VR1(IDFaculty,Salary,SSN,Insurance,Name), VR2(IDFaculty,ResearchTeam,Affiliation), VR3(StudentID,IDFaculty,ThesisTitle),$Role=AD∨$Role=DC.

Similarly to the generation of the global policy and the global schema, highlighting the functional dependencies at the mediator level is a way to anticipate and deal with the inference problem. To do this, we propose to generate a global lattice using the functional dependencies associated with each local source. By exploiting the properties of the lattice, one noticed that by construction the lattice leads to the identification of overall functional dependencies. That is, the construction6 of the lattice highlights the global functional dependencies. The construction process relies on an algorithm (Algorithm 1) that transforms all the local functional dependencies 4 5 6

aka least common superconcept or least upper bound. Here, we give an extract of the results. For a complete result, please refer to http:// bit.ly/1Iok4Fd. For our experiments, we use Galicia Tools to build the lattices: http://www.iro. umontreal.ca/∼galicia/

Inference Control in Data Integration Systems

291

Fig. 1. Global functional dependencies organized in lattice

into formal contexts according to definition 3 and, by looping on the lattice, it derives a global functional dependency for each concept of the lattice. Definition 3. (A Functional Dependency as a Formal Context (O, A, R)). Given a Functional Dependency X → Y , a formal context is obtained from X → Y (where γ F D is a transformation function) as follows: ⎧ ⎪ ⎨O = X γ FD (X → Y ) A = Y ⎪ ⎩ R=1

By considering our running example, Algorithm 1 produces the following functional dependencies which are organized in lattice7 (see figure 1): StudentID → T hesisT itle; StudentID → IDF aculty;SSN, Insurance → Salary IDF aculty, Af f iliation → ResearchT eam; IDF aculty, ResearchT eam → N ame IDF aculty → SSN ; IDF aculty → ResearchT eam and IDF aculty → N ame

Please note that each source defines its policy separately and does not take into account the possible associations that can appear while combining its data with those of other sources. In addition, new semantic constraints could appear at the mediator level. These additional constraints could be used by a given user to infer sensitive information from non sensitive ones. 7

Please note that, throughout the paper, we refer in the Lattice (i.e., figure 1), the Intent of the formal Concept is denoted by I whereas the Extent is denoted by E.

292

M. Sellami et al.

Algorithm 1. Generation of global functional dependencies Input: Functional dependencies FDs of a local source Si Output: Global Functional Dependencies F DG Begin: 1: F DG ←; 2: K F D ← γ FD (F Ds); transform the FDs to a formal context (see definition 3) D FD ) compute the Lattice of the FD’s Context; 3: LF Gi ← ComputeLattice(K FD 4: for each Cj in LG do //Decomposition: if X → Y Z then X → Y and X→Z 5: for each Y in Cj .Extent do 6: F Dk ←Cj .Intent → Y  translate the concept into a FD / F DG then add F Dk to F DG 7: if F Dk ∈ 8: endfor 9: endfor 10: Return

4.2

Discovery Phase

This phase includes two main steps: first, by resorting to formal concept analysis as a tool to characterize the global policy, we identify the profiles which should be prohibited to access some sensitive data at the mediator level. Second, one identifies threatening transactions by considering the impact of semantic constraints (e.g., functional dependency) at the mediator level. First, we introduce two relevant definitions: Definition 4. (Authorization View Policy as a Formal Context)[18]. Let P be a set of Authorization Views Vk that govern access to the source (relation/schema) S and σ V BAC be a transformation function. A formal context K (O, A, R) is obtained from P as follows ⎧ ⎪ ⎨∀ ai ∈ S, ∀ cj ∈ P, A = ai ∪ cj σ V BAC (P ) ∀ Vk ∈ P | O = Rulek ⎪ ⎩ ∀ ai ∈ Vk , ∀ cj ∈ Vk , R = 1 otherwise R = 0 where A is the set of formal attributes (composed by the union of all source attributes ai and all constraints (roles) cj , O is the set of formal objects(authorization views) and R indicate the authorization view Vk having the query part ai and the constraint/role cj ). Definition 5. (Prohibition Rules). G Given an Access Control Constraint Context K P (Rulek , ci , R) obtained from the global policy using the transformation function σ V BAC ,The complemen¯ corresponds to the prohibition rules: Rulek Rci holds iff tary relation, R, P K (Rulek , ci ) = 0.

Inference Control in Data Integration Systems

293

Computing Prohibition Rules. Prohibition rules are rules which can be used to deny the access to some data at the mediator level in order to comply with local policies. In order to detect such rules, we propose to extract the whole set of rules based on the characterization of the authorization views with a formal concept analysis. Basically, the authorization views are transformed into formal contexts and then we identify the prohibition rules by using the notion of prohibition rule introduced in definition 5. In order to represent the global policy as a formal context, we use the function σV BAC to generate the corresponding Global Policy Context. After the extraction of the formal Contexts, we split them into two formal contexts: a formal context for access constraints and a formal context for attributes. A relationship (Rule, A∪C) ∈ R means that each profile (role) r ∈ C can access the attribute ai . The context (Rule, A∪C, R) grants access to attributes by roles, and we can read from the context whose attributes are granted to a role. Based on definition 5 , we obtain, from the above context, four prohibitions rules which are used in the disclosure discovery process: – – – –

Rule Rule Rule Rule

1: 2: 3: 4:

PR PR PR PR

, , , ,

DC DC DC DC

→ SSN, Salary, Insurance → SSN, N ame, Salary → Salary, N ame → IDF aculty, ResearchT eam, Salary

Identification of Disclosure Transactions. By exploiting the semantics of functional dependencies, we present a naive method for automatically retrieving the set of queries which display the following property: when all the queries of a given set are executed, then the combination of their results could lead to the disclosure of sensitive information. The principle of our approach is the following: first, we extract the functional dependencies. Then, we improve this process by taking into account the presence of prohibition rules. Finally, we show that it is sufficient to compute only the disclosure transactions by generating the lattice of functional dependencies involved in each prohibition rule. Fo example, the following functional dependencies (1) and (2) can be used for overcoming the prohibition rule PR , DC → Salary, N ame in order to access Salary and Name. SSN, Insurance → Salary (1) IDF aculty, ResearchT eam → N ame (2) To detect the sequences of violating queries, as a first step, the Formal Context (see table 1) is generated from the above functional dependencies using definition 3. Then, by using the FD Context (see Table 1), we generate a Lattice of Formal Concepts (see figure 2) which can be used by the Algorithm 2 in order to detect the sequences of violating queries. The Lattice shows how we can exhaustively compute the disclosure transaction which can be exploited by a malicious user to infer Salary and Name by issuing the sequences of queries Q1, Q2 and Q3.

294

M. Sellami et al. Table 1. K F D (Xi , Yi , R): Functional dependencies Formal Context X(LHS)

Y(RHS)

IDFaculty ResearchTeam SSN Insurance

Name Salary 1 0 1 0 0 1 0 1

Fig. 2. Disclosure Transaction Lattice of Salary, Name

The example of Figure 2 shows the lattice with four formal concepts: the topConcept C1, the FD concepts C2 and C3, and the bottom concept C4. Our algorithm exploits this lattice and for each parent concept Ci of a given concept having the attributes of the prohibition rule as Intent (e.g., the bottom concept C4), it derives a Disclosure Transaction DT containing a sequence of queries. Considering our example, Algorithm 2 generates the disclosure transaction set SetDT ={DT1 (QF D1 , Q1 ), DT2 (QF D2 , Q2 ), DT3 (QF D1 , QF D2 , Q3 )}. So, if all the queries of any transaction DTi are issued and evaluated, then the prohibition rule Rule3 (Salary, Name) is violated. Hence, to deal with this issue and to avoid the disclosure transaction completion, the next step consists in reconfiguring the global policy with additional rules in such a way that no transaction could be completed.

Inference Control in Data Integration Systems

295

Algorithm 2. Disclosure Transaction Discovery Input: Functional dependencies FDSet , P Prohibition rule policy Output: Disclosure Transaction Set DTSet Begin: 1: for each Rule Ri ∈ P do 2: F DRi ← ExtractFDs(FDSet,Ri ); FD ← γ FD (F DRi ); 3: KR i FD FD ). 4: LRi ← ComputeLattice(KR i FD 5: for each Cj in LRi do 6: if (Ri .Att ⊆ Cj .Intent) then D 7: for each Parent Pk of Cj in LF Ri do 8: QF Ds ←Q({Pk .Intent∪Pk .Extent} ) 9: Qk ←Q({Cj .Intent∪Pk .Extent} \ {Pk .Intent}) 10: add DTi {QF Ds ,Qk } to SetDT ; 11: DTn ← DTn ∪ {QF Ds } 12: endfor 13: endif D 14: Qk+1 ←Q(LF Ri .T opConcept.Extent}) 15: add DTn ← {QF Ds , Qk+1 } to SetDT ; 16: endfor 17: endfor 18: Return

4.3

Policy Healing

The policy healing can be applied at two levels: First, at design time with policy completion by adding a new set of authorization rules. Such rules are applied to make sure that no transaction could be completed. Second, at runtime. This could be accomplished by means of a monitoring process which requires to store the previous queries. Global Policy Completion. The policy completion is the step which can be used by an administrator to repair the global policy by adding new ones which can prevent the completion of transactions. The main issue is then how to identify the (minimal) set Q of queries from which to build the new authorization rules. The new authorization rules should display the following property: for any DTi (disclosure transaction) at least one Qi ∈ Q is denied. The (minimal) set of queries which would be used to generate the new authorization rules must also ensure that there are no redundant access control rules and should cover all the disclosure transactions. The principle of the underlying algorithm (see Algorithm 3) is the following: it starts by initializing the set of revoked queries then it runs through the Galois Sub Hierarchy Lattice of Disclosure Transactions (see Figure 3). For each pair of Concepts (Ci ,Cj ), it checks if the hierarchy order (Ci ≤ Cj ) holds between the two concepts. If the subconcept Ci is not marked then it will be used in the

296

M. Sellami et al.

Fig. 3. Galois Sub-Hierarchy Lattice of Disclosure Transactions

queries revocation process, otherwise the superconcept Cj must be used. Finally, it generates the additional authorization views based on each revoked query we obtained. Algorithm 3. Minimal Queries Revocation Input: Disclosure Transaction Galois Sub Hierarchy Lattice GSH DT , PG : Global Policy Output: P G the new Global policy with additional rules Begin: 1: for each pair of Concept (Ci, Cj ) in GSH DT i =j do 2: if (Ci ≤ Cj ) then 3: if (Ci not marked) then Cr ← Cj ; mark Cj * 4: else Cr ← Ci ;mark Cj ;endif 5: if (|Cr .Intent| = 1 then Qr ← Q1 else 6: Qr ← Q1 ,Q1 ∈ Cr .Intent endif 7: for each query Qk+1 , in Cr .Intent do 8: if (Qk+1 ⊂ Qr ) thenQr ← Qk+1 endif 9: endfor 10: endif 11: add new Authorization View GV (Qr ) to Global Policy P G 12: endfor 13: Return *:Note that all concepts of the Lattice having a label that indicates which concept has been marked by itself or by its super-concepts

Stop that Query. In order to ensure maximal availability of data at the mediator level while ensuring the non disclosure of sensitive information, one can

Inference Control in Data Integration Systems

297

choose a run-time approach which consists in monitoring the execution of queries and revokes those queries that could lead to the violation of policies. The main problem with the run-time query revocation is that it requires the storage of the past queries and the computing of the correlation between the current query and the past queries. By resorting to formal concept analysis, one can enhance the monitoring process.

5

Experiments

We provide an experimental evaluation of our proposed methodology and analysis of the proposed algorithms. We designed a synthetic scenario by using three generators. The scenario includes data sources, authorizations views and functional dependencies. The synthetic scenario (including the generators) was implemented in Java and we used the Galicia API. Experiments were carried out on a Linux PC with Ubuntu TLS 14.04, Intel Core 2 Duo CPU with 2.00GHz and 4GB RAM. (a) Data Sources Generator: Given a number (N) of relations, an average number (A) of sub-elements of each relation and an average number (J) of join R of relation R(A) attributes, the source generator randomly produces a set defined in such a way that J corresponds to the number of joins that could be built from shared attributes by the generated relations. Here, each join attribute is randomly chosen from the set of attributes of each relation. In the experiments, we considered N ranging from 5 to 20, A ranging from 10 to 20, and J ranging from 5 to 10. (b) Authorization Views Generator: Given a source schema R, we consider three numbers: P (the number of authorization views), Q (the number of attributes used in the project clause of queries), and C (the number of profiles). In our experiments we considered P = 80 rules, Q= 3, and C = 10. (c) Functional Dependencies Generator: Given a relational R and a natural number m, the functional dependencies generator randomly produces a set Σ F D consisting of functional dependencies on R, such that the average number of functional dependencies for each relation in R is m. The generator also takes two other parameters as inputs, namely LHS and RHS. LHS is the maximum number of attributes in each left hand side of the generated functional dependencies, and RHS is the maximum number of attributes in each RHS of the functional dependencies. The experiments were conducted on various Σ F D ranging from 10 to 80 functional dependencies per source, with the LHS containing from 3 to 10 attributes and the RHS containing from 2 to 5 attributes. We conducted our experimental evaluation as follows. We performed a large number of trials so that, for each run, we generated a random number of sources ranging from 5 to 20, the number of attributes ranging from 10 to 20, the number of common attributes or joins ranging from 5 to 10, the number of functional

298

M. Sellami et al.

Fig. 4. Computational time for different steps of our approach

dependencies (of the form X→Y) ranging from 10 to 80 for each local source with X and Y involving 5 to 10 attributes, and 80 access control rules with 10 roles and 3 attributes. For each set of generated tests, we measured the size and the time required to build the lattices by our algorithms for each step of our methodology including: • |GF DT ime| is the necessary time to infer the global functional dependencies and |GF DLattice| is the size of the obtained lattice (number of concepts). • |DT DT ime| is the time required to detect all the suspicious transactions and |DT Size|is the number of such transactions. •|QueriesSize| is the number of queries and |M QRT ime| is the time required to extract a minimal query set. • |N ewP | : the number of added access control rules to the policy to prevent completion of prohibited transactions. • |LQQT ime| and |LQQ|correspond to the time required to generate the lattice Query Labeling Lattice and its size respectively. The required Time to compute the global functional dependencies and the suspicious transactions: Figure 4 shows the time required to compute the disclosure transactions. It significantly exceeds both the time required to build the lattice and the time to compute the minimal query revocation. We can also deduce that the time to compute the suspicious transactions (DTD) with 80 functional dependencies is higher than the time required to compute disclosure transaction with 20 functional dependencies. The results depict that the global time to compute the lattice of functional dependencies is between 32 and 81 seconds for data sets containing 10 to 80 functional dependencies per source.

Inference Control in Data Integration Systems

299

Fig. 5. Impact of functional dependencies on the global policy

Impact of functional dependencies on the global Policy: Figure 5 shows the relationship between the number of disclosure transactions and the number of additional rules. At each run, we particularly pick, as decision metric, the new rule that appears more often in DTD. The number of queries also increases with the same rate as that of functional dependencies. We also noticed that the number of access control rules, to be added, increases with the number of inferred dependencies as well as with the number of identified transactions. This means that the functional dependencies have a significant impact on defeating access control mechanisms in order to infer sensitive information from non-sensitive one. Query Labeling Lattice: To analyze the impact of large databases with a large number of functional dependencies on the Query Labeling Lattice, we generated a dataset composed of 20 sources with 10 attributes, 80 rules and 160 functional dependencies per source. Figure 6 shows the time required to construct such an QLL and its size. We note that the more functional dependencies we infer the more queries are identified. The results show that the size and the time required to compute the QLL increases with the number of new queries that are discovered. Figure 7 shows the impact of functional dependencies on the time required to lock a query. We note that the time required to run through a QLL Lattice and to evaluate the policy decision for a given query increases with the number of illicit transactions as well as with the number of identified queries. They clearly indicate that the Number of queries and transactions have a significant impact on time required to lock query at run time. So the main problem of the query tracking is the enormous amount of queries that needs to be managed. Further tests will be necessary to fully understand the behavior of the Stop That Query

300

M. Sellami et al.

Fig. 6. Impact of identified queries on size of query labeling lattice and its time computing

Fig. 7. Time required to lock a query of a transaction

method. However, the current track seems promising, especially if combined with the efficient and fastest algorithms [11] of lattices Construction and Navigation. Improvement to the flexibility of Stop That Queries can be achieved through approaches that adapt to Incremental algorithms of lattice construction and lattice reduction. In our approach, we mainly focused on the analysis of the performance and the scalability of our proposal for many reasons. The use of the Lattice as a common representation framework to model the global schema, policies and functional dependencies requires comparison of each of our steps to each related approaches separately. Due to this issues, it is difficult to compare our approach to the related work because there is no available implementation or experimental data sets(policies, functional dependencies and local schemas).

6

Conclusion

In this work we have investigated the problem of illicit inferences that result from combining semantic constraints with authorized information in a data integration context. We proposed an approach which exploits formal concept analysis

Inference Control in Data Integration Systems

301

to capture security policies and data dependencies. By resorting to formal concept analysis, we were able to build both a global schema and the underlying global policy. The global policy is refined in such a way that violating queries are blocked in advance (design time or run-time). There are many research directions to pursue: when a new source joins the system or an existing source leaves the system, it is necessary to revise the global schema and the global policy. In this case, an incremental approach should be designed. Another interesting issue is concerned with data dependencies. Indeed, there are other semantic constraints that could play an important role in inference problems. Examples are inclusion dependencies, multivalued dependencies and the partial functional dependencies. We are investigating these issues. In Future work, we intent to integrate our approach as plug-in on Talend Data Integration Studio.8 Acknowledgments. This work is supported by Thomson Reuters in the framework of the Partner University Fund project : Cybersecurity Collaboratory: Cyberspace Threat Identification, Analysis and Proactive Response”. The Partner University Fund is a program of the French Embassy in the United States and the FACE Foundation and is supported by American donors and the French government.

References 1. Cuzzocrea, A., Hacid, M.-S., Grillo, N.: Effectively and efficiently selecting access control rules on materialized views over relational databases. In: International Database Engineering and Applications Symposium (IDEAS), pp. 225–235 (2010) 2. Denning, D.E., Schlorer, J.: Inference controls for statistical databases. Computer 16(7), 69–82 (1983) 3. De Capitani di Vimercati, S., Foresti, S., Jajodia, S., Paraboschi, S., Samarati, P.: Assessing query privileges via safe and efficient permission composition. In: ACM Conference on Computer and Communications Security, pp. 311–322 (2008) 4. Farkas, C., Jajodia, S.: The inference problem: A survey. SIGKDD Explor. Newsl. 4(2), 6–11 (2002) 5. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations, 1st edn. Springer-Verlag New York Inc., Secaucus (1997) 6. Goguen, J.A., Meseguer, J.: Unwinding and inference control. In: Proceedings of the 1984 IEEE Symposium on Security and Privacy, pp. 75–86. IEEE Computer Society (1984) 7. Haddad, M., Hacid, M.-S., Laurini, R.: Data integration in presence of authorization policies. In: 1th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2012, Liverpool, United Kingdom, June 25–27, pp. 92–99 (2012) 8. Haddad, M., Stevovic, J., Chiasera, A., Velegrakis, Y., Hacid, M.-S.: Access control for data integration in presence of data dependencies. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds.) DASFAA 2014, Part II. LNCS, vol. 8422, pp. 203–217. Springer, Heidelberg (2014) 9. Fellegi, I.P., Phillips, J.L.: Statistical confidentiality: Some theory and application to data dissemination. In: Annals of Economic and Social Measurement, vol. 3(2), pp. 101–112. National Bureau of Economic Research, Inc. (1974) 8

http://fr.talend.com/index.php

302

M. Sellami et al.

10. Kumar, C., et al.: Designing role-based access control using formal concept analysis. Security and Communication Networks 6(3), 373–383 (2013) 11. Mouliswaran, S.C., Kumar, Ch., Chandrasekar, C., et al.: Modeling chinese wall access control using formal concept analysis. In: 2014 International Conference on Contemporary Computing and Informatics (IC3I), pp. 811–816. IEEE (2014) 12. Nait-Bahloul, S.: Inference of security policies on materialized views. rapport de master 2 recherche (2009). http://liris.cnrs.fr/∼snaitbah/wiki ¨ 13. Ozsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 3rd edn. Springer (2011) 14. Rizvi, S., Mendelzon, A.O., Sudarshan, S., Roy, P.: Extending query rewriting techniques for fine-grained access control. In: Weikum, G., K¨ onig, A.C., Deßloch, S., (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13–18, pp. 551–562. ACM (2004) 15. Rosenthal, A., Sciore, E.: View security as the basis for data warehouse security. In: CAiSE Workshop on Design and Management of Data Warehouses, pp. 5–6 (2000) 16. Rosenthal, A., Sciore, E.: Administering permissions for distributed data: Factoring and automated inference. In: Proc. of IFIP WG11.3 Conf. (2001) 17. Schl¨ orer, J.: Security of statistical databases: Multidimensional transformation. ACM Trans. Database Syst. 6(1), 95–112 (1981) 18. Sellami, M., Gammoudi, M.M., Hacid, M.S.: Secure data integration: a formal concept analysis based approach. In: Decker, H., Lhotsk´ a, L., Link, S., Spies, M., Wagner, R.R. (eds.) DEXA 2014, Part II. LNCS, vol. 8645, pp. 326–333. Springer, Heidelberg (2014) ´ Zieli´ 19. Sobieski, S., nski, B.: Modelling role hierarchy structure using the formal concept analysis. Annales UMCS Sectio AI Informatica 10, 143–159 (2015) ¨ 20. Su, T.-A., Ozsoyoglu, G.: Data dependencies and inference control in multilevel relational database systems. In: Proceedings of the 1987 IEEE Symposium on Security and Privacy, Oakland, California, USA, April 27–29, pp. 202–211 (1987) 21. Wang, L., Jajodia, S., Wijesekera, D.: Lattice-based inference control in data cubes. In: Preserving Privacy in On-Line Analytical Processing (OLAP). AIS, pp. 119–145. Springer US (2007)