Inference control in data integration systems - Univ Lyon 1

1 downloads 90 Views 493KB Size Report
use formal concept analysis theories to identify the illegal queries called disclosure ... 1 An inference is a means by
Noname manuscript No. 2(will be inserted by the editor)

Mokhtar Sellami et al.

Inference control in data integration systems Mokhtar Sellami · Mohand-Said Hacid · Mohamed Mohsen Gammoudi

Received: date / Accepted: date

Abstract Specifying a global policy in a data integration system in a traditional

way would not necessarily offer a sound and efficient solution to deal with the inference problem.1 This is due to the fact that data dependencies (between distributed data sets) are not taken into account when local policies are defined. In this paper, we propose a methodology, together with a set of algorithms that can help discovering inferences by considering semantic constraints. Given a set of local policies, an initial global policy and data dependencies, we propose a methodology which allows the security administrator to derive a set of queries so that when their results are combined they could lead to disclose sensitive information. First, we use formal concept analysis theories to identify the illegal queries called disclosure transactions and then we derive the set of additional rules which will be used to extend the policy of the mediator in such a way that inferences are blocked. We also discuss the set of experiments we conducted. Keywords Access Control, Security, Data Integration, Inference Problem,Security

and Privacy – databases security, privacy, access controls Acknowledgements This work is supported by Thomson Reuters in the framework of the Partner University Fund project : “Cybersecurity Collaboratory: Cyberspace Threat Identification, Analysis and Proactive Response ”. The Partner University Fund is a program of the French Embassy in the United States and the FACE Foundation and is supported by American donors and the French government.

1 Introduction

Nowadays, database systems [30] are concerned with the management of health data, insurance data, scientific data, legislation data, military data, human commuHigher Institute of Technological Studies of Kef, Tunisia – E-mail: [email protected] · Universit´ e de Lyon, France – E-mail: [email protected] · Higher Institute of Multimedia Arts of Manouba, Tunisia – E-mail: [email protected] 1

An inference is a means by which one can infer sensitive data from non-sensitive ones [10].

Inference control in data integration systems

3

nications data, emails and tweets among other kinds of data.2 This demonstrates the importance databases play in our everyday life. These systems should also provide services able to control access to information they handle. Indeed, even though these systems offer great benefits, users are reluctant to use such systems if their privacy is not preserved [1]. Data integration aims at providing a unique entry point to a set of data sources. This is achieved by defining a user view. The view is defined between the user and the sources. Hence, users do not have to query each source separately and then combine the results. They only need to query the user view’s global schema. The data integration system is, then, in charge of identifying and accessing relevant sources and collecting the partial results. Finally, it combines the partial results into a complete answer that will be returned to the user. In the general security aware integration setting, the source databases are equipped with their own local security policies. Each source is autonomous since it is designed independently from the others and local applications should continue running after the integration. Moreover, none of the sources is able to anticipate data inferences that can arise when data is integrated. In this paper, we focus on the security challenge that mainly arises in data integration systems. We are interested in controlling the access to a data integration system. In this kind of architecture security aspects and access control in particular represent a major challenge. Indeed, every source, designed independently of the others, defines its own access control policy. The problem is the following: How to define a representative policy at the mediator level that preserves the policies of the sources? Preserving the sources’ policies means that a prohibited access at the source level should also be prohibited at the mediator level. Also, the policy of the mediator needs to protect data against indirect accesses. An indirect access occurs when one could synthesize sensitive information from the combination of non-sensitive information (and semantic constraints). Detecting all indirect accesses in a given system is referred to as the inference problem [10]. We propose a formal concept analysis based approach to derive the global policy of the mediator. The proposed approach allows preserving the local (source) policies and provides us with information about the inference problem, in that it allows us to detect transactions that may (indirectly) lead to the violation of security policies. First, from the schemas and the security policies of the local sources, we derive a set of security policies (called global policy) that must be attached to the mediator. We then exploit interactions between elements of the global schema in presence of the global policy for inferring implicit combinations of queries that could lead to the violation of local security policies. Our approach also accommodates dynamic changes of the local policies and schemas. The reminder of the paper is organized as follows: Section 2 gives an overview of research effort related to security modeling and inferences in database contexts. Section 3 provides definitions of the main (technical) concepts we use in this paper. Section 4 introduces an example. In section 5 we describe our approach. Section 7 describes the experiments. We conclude in section 8.

2

Challenges and Opportunities with Big Data, http://bit.ly/1LPHXES.

4

Mokhtar Sellami et al.

2 Related Work

We will discuss the different approaches which are related in any way to our problem. In addition, we will discuss different types of inferences that have been studied in the literature.

2.1 Formal Concept Analysis (FCA) and Access Control In this section, we present the approaches that use formal concept analysis or concept languages as a formal tool for reasoning about access control policies. The authors of [45] proposed a lattice-based solution for controlling inferences caused by unprotected aggregations. Instead of detecting inferences, they first prevent the adversary from combining multiple aggregations for inferences by restricting queries. Then, they remove remaining inferences caused by individual aggregations. The proposed method applies to any combination of aggregation functions, external knowledge, and sensitivity criteria provided they meet some clearly stated properties. In [21], the authors proposed to use attribute exploration from formal concept analysis on the dyadic formal context derived from the triadic security context to design RBAC. The lattice structure from the analysis manages to handle the separation of duties and the role hierarchies. In [27], the authors investigated a formal context for Chinese wall security policy. The lattice structure derived from the formal context allows to implement the access permissions of Chinese wall security policy and its access rules such as read and write rules. In [39], a method, based on formal concept analysis, which facilitates discovering the roles from the existing permission to the user assignments, is proposed. This approach is relevant when a system contains a large number of users and objects. In [6], a Description Logics (DLs)-based framework was proposed to formalize RBAC. The benefit of using Description logics in the RBAC setting is to be able to express more intended constraints than it can be done in the common RBAC. For deriving additional constraints, the authors introduced a strict methodology, based on attribute exploration method known from Formal Concept Analysis. The attribute exploration allows to systematically finding unintended implications and to deriving constraints and making them explicit. From a technical point of view, our work extends the traditional approaches by resorting to FCA as a tool for reasoning about security policies in a distributed setting. From an application point of view, we consider a data integration scenario where it is mandatory to accommodate data dependencies, security policies, and inference problems.

2.2 Views, access control and inferences Rosenthal and Sciore [33, 34] have considered the problem of how to automatically coordinate the access rights of a warehouse with those of sources. The authors proposed a theory that allows automated inference of many permissions for the warehouse by a natural extension of the standard SQL grant/revoke model,3 to 3

http://www.techonthenet.com/oracle/grant revoke.php

Inference control in data integration systems

5

systems with redundant and derived data. The authors have also defined the witness notion by including the use of views. In other words, the user can access only a part of a table T represented by a view. Hence, the user has clearance to view the values of T that contribute to the view (information permission) and (s)he is allowed to execute a query that physically accesses T , but only for the purpose of computing the view (physical permission). Note that the notion of equivalence is very important in that paper. Now, assume we define a materialized view mv1 based on the join of two tables T1 and T2 . We also specify two views v1 and v2 which define the data that one has right to access in T1 and T2 respectively. To determine if a user has right to access mv1 , one must find an equivalent query q to mv1 that uses only T1 and T2 and contributes to v1 and v2 respectively. The proposed framework concludes that the user has no right to access mv1 if the inference mechanism does not find an equivalent query even if the user has right to access a part of mv1 . However, in our approach, we propose a more flexible model. Indeed, by inferring the set of authorization views to control access to the materialized view, we allow users to access even to a part of the materialized view. To summarize, the framework proposed by the authors determines only if a user has right to access a derived table (based on explicit permissions) but our proposal goes further by determining which part of the table the user has right to access in the derived table. Also, the authors of [33, 34] stated the inference rules at a high level. The properties of the underlying inference system and the efficiency of the proposed algorithm were not investigated and remain an open research issue. In [5], the authors have built on [28] to provide a way to select access control rules to be attached to materialized view definitions based on access control rules over base relations. They resort to the basic form of the bucket algorithm which does not allow to derive all relevant access control rules. Another limitation of this work is that since they only deal with selection of rules, the framework remains strongly dependent of the base relations. That is, the body of the derived rules involves base relations only. In the present work, we synthesize new rules from existing rules where the body of the new rules could make reference to materialized views. In [9], the authors propose, unlike the use of authorization views, a graph-based model to define the access control rules and query profilers. These latter consist in capturing the information content of the query through the use of graphs. The authors resort to coloring, composing and traversal of graph paths to define an efficient and effective access control model. A permission p is a rule of the form [A, R] → s which states that subject s can view the sub-tuples over the set of attributes A belonging to the join among relations R. The major drawback of this approach is the impossibility to define permissions on a subset of tuples (selection). Indeed, the model allows to only determine accesses to a subset of attributes and therefore there is no way to model content-based access controls.

2.3 Inference problem The inference problem refers to the ability of a malicious user to synthesize sensitive information from a combination of non-sensitive information [11]. This problem is motivated by some limitations inherent to traditional security enforcement

6

Mokhtar Sellami et al.

mechanisms. The traditional models do not adequately consider indirect access to information. An indirect access occurs when a malicious user is able, by combining a set of non-sensitive accesses, to infer sensitive information. Different types of inferences have been studied in the literature. For each of the underlying techniques specific approaches have been proposed to deal with. Usually, each solution deals with a particular attack. An attack is the method used to infer some prohibited information. This is due to the fact that, depending on the configuration of a system, different attacks could occur. A simple example of such a configuration is the query language provided to users. Indeed, allowing users to issue aggregate queries and not only conjunctive queries could widen the space of attacks. Another parameter is the kind of policies that is established. Depending on the nature of the attack, proper solutions have been proposed. Historically, first inference attacks appeared on statistical databases [2]. In those systems, users are allowed to obtain information about data trends on populations but are not allowed to access data on individuals. The idea of statistical attack is to combine a set of statistical information to infer information about a particular individual. The second kind of attack that appears is the one that rests on the exploitation of semantic constraints [40, 42]. In such attacks, malicious users take advantage of semantic relations between datasets to infer prohibited information. There has been an extensive amount of work devoted to this issue from different perspectives. With the emergence of data mining approaches new inference threats have been identified [47]. This led to the design of methods that balance the utility of data mining approaches with data confidentiality. More recently, de-identification attacks have been recognized [35]. Indeed, with the emergence of data publishing approaches new threats of data confidentiality appeared. In data publishing approaches, data is anonymized before it is released. The issue that arises is that classical anonymization techniques could have some limitations. Indeed, classical approaches remove attributes which can be used for data identification before releasing data. It has been shown [41] that inferring individual identity could be achieved even when such attributes are removed. We shortly discuss each of these attacks separately and the solutions that have been proposed to deal with them. 2.3.1 Statistical attacks

Statistical attacks have been extensively studied (see [2] for a survey). These attacks occur when a policy is designed to protect data about individuals (e.g., salary of a particular employee) while providing access to aggregate information (e.g., average salary within a company). This kind of attack mainly results from overlapping between query results. A number of methods have been proposed to deal with statistical attacks. These methods could be classified into three categories: query restriction [7], data perturbation [36] and output perturbation [20]. Query restriction [7, 12, 13, 19] could be performed by constraining the number of tuples that are used to construct the answer to a query. This induces the definition of an arbitrary threshold for the number of tuples involved in the answer to a query. Another way to achieve query restriction is to check for overlapping between queries. Indeed, if two queries share a large amount of tuples, their results could induce prohibited information disclosure. Another approach to achieve query restriction is cell suppression. This approach removes some sensitive cells

Inference control in data integration systems

7

from the database before query evaluation. A last method is query auditing. This approach needs to record all user queries. When a new query is issued, the combination between the new query and past queries is checked before evaluating the new query. Data perturbation [31, 36, 24, 22, 46, 43] modifies data in such a way that it limits data inferences. There are two methods devoted to data perturbation. The first one, called probability distribution, aims at changing data while keeping the same distribution. The goal of this approach is to preserve the results of aggregate queries even though the data is changed. The second approach is data perturbation. In this approach data is modified by adding noise to it. For example, numerical values could be modified by adding to it a constant parameter. These approaches induce a bias problem which means that data provided to users could induce some errors. Indeed, means or frequencies of modified data could differ from the real values. Output perturbation [8, 4, 20, 29] modifies query results in order to prevent from sensitive information disclosure. An example of output perturbation is the use of sample queries where a subset of data is selected randomly and queries are evaluated against this sample. Accuracy of the results is the main issue in such approaches. 2.3.2 Semantic attacks

Approaches dealing with semantic attacks tackle some issues that arise in traditional access control mechanisms. In particular, they consider indirect accesses in addition to direct access. The idea in such approaches is to detect indirect access that could induce prohibited information disclosure. Such indirect access is referred to as an inference channel. These approaches aim at achieving two goals. The first one is inference channel detection. The second one is to provide mechanisms able to deal with these inference channels. Semantic attacks have been identified as a serious threat to data security at a time where multilevel security policies were much popular [15, 40]. Many approaches have been designed to support this kind of policies. We note that the rationale behind those approaches applies to other models. In semantic attacks, data dependencies can be used to infer sensitive information. These dependencies exist regardless of access control model that is used. First attempts to study the inference induced by semantic constraints have been reported in [15, 26]. For instance, [26] defines a function based on information theory to characterize the inference between two objects X and Y. This function describes the amount of information that one would know about Y given the knowledge of X. This approach displays a limitation. Indeed, it has been shown that such a function is impossible to define in most cases. Other approaches have been introduced to tackle specific attacks. We could classify such semantic attacks into three categories: inference by applying constraints on queries, inference using metadata and inference using value constraints. Inference from constraints on queries occurs when a malicious user could infer some prohibited information using specific constraints on a query. This attack assumes that data labeling is performed at the attribute level. This means that each attribute could have a specific security level. To illustrate this attack we present the example given in [25]. In this example there are two relations: an unclassified one, EP, and a secret one, PT. EP contains the Employee-Name and Project-Name attributes while PT contains Project-Name and Project-Type attributes. Assume a non-authorized user issues the following query:

8

Mokhtar Sellami et al.

SELECT EP.Employee-Name FROM EP, PT WHERE EP. Project-Name= PT. Project-Name AND PT.Project-Type = ’SDI’ Although this query only returns results about unclassified information, it does induce prohibited information leakage. Indeed, information about the secret relation have been indirectly accessed. This kind of inference is quite easy to identify and to deal with. Indeed, one only needs to rewrite the user query into another query using only authorized attributes. Nevertheless, this example pointed out the importance of paying particular attention to inferences in order to preserve data confidentiality. Inference resulting from the combination of data with metadata occurs when malicious users take advantage from metadata to infer prohibited information. Early work on such attacks considered the role played by key integrity4 to disclose sensitive information. It has been generalized to consider functional dependencies [40]. To illustrate this kind of inference we present the example given in [40]. In this example, a relation Employee is defined. This relation has three attributes: Name, Rank and Salary. The attributes Name and Rank are considered unclassified while Salary is considered secret. In this example it is assumed that employees having the same rank have the same salary. In other words, a functional dependency exists between rank and salary where rank identifies salary. Now, assume a user issues a query on Name and Rank. This query does not access any secret information and thus it is evaluated. The malicious user could then infer the salary of each employee. This is an interesting example that shows that functional dependencies could be used to infer sensitive information. Even though in this case functional dependencies could be used to infer prohibited information, it is only the case because there is a function that for each rank computes its corresponding salary. The authors of [40] provide algorithms to propagate the security level of attributes that could be inferred from other attributes. In the previous example the idea is to associate the level secret to the attribute rank. Please note that in the general case accessing the left hand side of a functional dependency does not allow to have access to the value of the right hand side. To illustrate this, we consider the following example: Assume we have a relation that stores the social security number (SSN), the AdmissionDate and the Disease of patients. We further assume that SSN and AdmissionDate determine the attribute Disease. In other words there is a functional dependency between SSN and AdmissionDate in one hand and SSN and Disease on the other hand. Although this functional dependency exists, a user accessing to the attributes SSN and AdmissionDate could not infer anything about the Disease. This is due to the fact that there is no function that takes SSN and AdmissionDate as input and returns the corresponding Disease. In [17, 16], the authors proposed a methodology that allows to control the access to a data integration system. The methodology provides a support for direct access and indirect access. The methodology includes different phases: – Propagation and combination of source policies: This phase aims at transferring

the source authorization rules to the mediator. This propagation preserves policy composition principle. Indeed, it ensures that every source authorization is preserved at the mediator level. This phase preserves data sources from direct access. 4 Key integrity [38] is a constraint stating that a set of attributes constitutes a key. A key value should be unique in a given relation.

Inference control in data integration systems

9

– Detection phase: This phase characterizes the role that semantic constraints be-

tween data play in inferring confidential information. In particular, they considered semantic constraints expressed by functional dependencies. They introduced in this phase a graph-based approach that allows identifying flaws that could remain after the propagation phase is achieved. Indeed, the propagation phase considers only direct accesses. The detection phase identifies, a priori, a set of transactions that could allow malicious users to obtain prohibited information. Every transaction is a set of queries such that if their results are combined, they might lead to disclose sensitive information. They also proposed solutions to remedy flaws identified in the previous phases. In order to deal with the generated transaction, they proposed two solutions. The first solution could be implemented at design time and the second solution could be implemented at runtime. The main features of our approach, compared to others, are the following: (1) we generate a global schema and policies simultaneously , (2) we use the lattice as a common representation framework to model the global schema, policies and functional dependencies and (3) we derive the global schema by taking into account the global policy. This reduces the risk for inference since all the attributes or combination of attributes included in the schema are either subject to access control (in this case by the global policy) or not concerned by access control.

3 Preliminaries

In this Section, we recall some basic material we use in our approach, namely those related to formal concept analysis, authorization views and functional dependencies. Definition 1. A (Formal Context) [14] K =(O, A, R) is composed of a set O, its elements are called objects, a set A, its elements are called attributes, and a binary relation R ⊆ O × A where R (oi , aj ) with oi ∈ O and aj ∈ A means that the object oi has the attribute aj . For a subset X ⊆ O of objects, their common set of attributes is defined by f (X ) = {a ∈ A|∀x ∈ X (xRa)}. For a subset Y ⊆ A of attributes, the set of objects which have all attributes in Y is defined by g (Y ) = {o ∈ O|∀y ∈ Y (oRy )}. The pair of the derivation operators (f, g ) form a Galois connection. Definition 2. A (Formal Concept) [14] is composed of a pair (X, Y ) where X ⊆ O, Y ⊆ A that satisfies f (X ) = Y and g (Y ) = X . For a concept C = (X, Y ), X is called the extent, denoted by Ext (C ) , and Y is called the intent, denoted by Int (C ). Definition 3. (Hierarchical order) [14]. Let C 1 = (X1 , Y1 ) and C 2 = (X2 , Y2 ) be concepts of the context K = (O, A, R), C 1 is a superconcept of the concept C 2 (respectively, C 2 is a subconcept of C 1) if and only if X2 ⊆ X1 . Definition 4. (Concept lattice) [14]. By introducing a hierarchical order, socalled order relation, denoted by  , we can write C 2  C 1. The set of all concepts β (O, A, R) ordered by the order relation  is called a concept lattice and is denoted by (β (O, A, R), ).

10

Mokhtar Sellami et al.

Definition 5. (Infimum and Supremum) [14]. In a concept lattice (β (O, A, R), ) there is always an infimum5 and is also known as the meet, so-called bottom concept, denoted by ⊥, and a supremum6 , so-called top concept, denoted by . As defined in [14], for an arbitrary set {(Xr , Yr ) |r ∈ R} ⊆ β (O, A, R) of formal concepts, the

supremum is given by 

 (X r , Yr ) =

r∈R



 Xr

,

r∈R





Yr

r∈R

and the infimum is given by  r∈R

 (Xr , Yr ) =

 r∈R

 Xr ,



  Yr

r∈R

Definition 6. (Galois sub-hierarchy) [14]. The Galois sub-hierarchy (GSH) of a concept lattice is the partially ordered set of elements X×Y such that X ∪ Y = and there exists a concept where X is the set of objects introduced by that

concept and Y is the set of properties introduced by the concept. The ordering of the elements in the GSH is the same as in the lattice. Definition 7. (Functional Dependency) [30]. A functional dependency (FD) over a schema R is a statement of the form: R : X→Y where X, Y ⊆ schema(R).

We refer to X as the left hand side (LHS) and Y as the right hand side (RHS) of the functional dependency X→Y. Definition 8. (Satisfaction of a functional dependency) [23]. A functional dependency R : X→Y is satisfied in a relation r over R, denoted by r |= R : X → Y, iff ∀ t1 , t2 ∈ r if t1 [X] = t2 [X], then t1 [Y] = t2 [Y]. Definition 9. (Disclosure Transaction) [18]. A disclosure transaction (DT) is a

sequence of queries such that if they are evaluated and their results are combined, they will lead to disclosure of sensitive information and thus violating an access control policy. Definition 10. (Authorization View) [32]. A set of authorization views specifies

what information a user is allowed to access. The user specifies the query in terms of the database relations, and the system tests (by considering the authorization views) the query for validity by determining whether it can be evaluated over the database.

4 Example

We consider a University Record System (URS) scenario where multiple URS need to share data records for cooperation purposes. Each University has a full control (e.g., creation, management, etc.) over its own data records with respect to its own access control policies. To share data between the URS, a secure mediator is needed in order to facilitate access to the shared data via a global schema, while 5 6

aka greatest common subconcept or greatest lower bound. aka least common superconcept, least upper bound, or the join.

Inference control in data integration systems

11

ensuring data confidentiality. This allows different users, including administrators, professors, students and researchers, to access multiple faculty or student data. In this paper, we assume that the mediator is built from three local sources which use a constraint-based access control model in RBAC. A user is described by a set of features defining herhis Role. The enforcement of this model is done as follows: a user is allowed to access a resource if her/his profile satisfies the access constraints of that resource, otherwise, the access is denied. We use authorization views to provide a fine-grained access control. Local Source 1 Relation:

Supervisors(IDFaculty, Name, ResearchTeam, Affiliation) Functional dependencies: – IDFaculty, ResearchTeam →Name; – IDFaculty, Affiliation →ResearchTeam Authorization views: – V1 1 Authorization(IDFaculty)←

Supervisors(IDFaculty,Name,ResearchTeam,Affiliation), $Role=DC∨$Role=AD∨$Role=FO∨$Role=PR This rule states that only users displaying one of the following roles: Professor(PR), AdministrativeOfficer(AD), DoctoralCordinator(DC) and FinanceOfficer(FO), is allowed to access to the attribute IDFaculty. – V1 2 Authorization(Name)← Supervisors(IDFaculty,Name,ResearchTeam,Affiliation), $Role=AD∨$Role=DC∨$Role=FO∨$Role=PR This authorization view states the user assigned to one of the following profiles/roles: Professor(PR), AdministrativeOfficer(AD), DoctoralCordinator(DC) and FinanceOfficier(FO) can access the attribute Name. – V1 3 Authorization(ResearchTeam)← Supervisors(IDFaculty,Name,ResearchTeam,Affiliation), $Role=DC∨$Role=AD∨$Role=PR Local Source 2 Relation:

PHDStudents(StudentID, IDFaculty, ThesisTitle, ResearchTeam) Functional dependencies: – StudentID→ThesisTitle – StudentID→IDFaculty – IDFaculty→ResearchTeam Authorization views: – V2 1 Authorization(StudentID)←

PHDStudents(StudentID,IDFaculty,ThesisTitle,ResearchTeam), $Role=PR∨$Role=DC∨$Role=AD – V2 2 Authorization(IDFaculty)← PHDStudents(StudentID,IDFaculty,ThesisTitle,ResearchTeam), $Role=DC∨$Role=PR∨$Role=AD∨$Role=FO

12

Mokhtar Sellami et al.

– V2 3 Authorization(ThesisTitle)←

PHDStudents(StudentID,IDFaculty,ThesisTitle,ResearchTeam), $Role=PR$Role=DC Local Source 3: Relation:

Faculty(IDFaculty, SSN, Salary, Insurance) Functional dependencies: – IDFaculty→ SSN – SSN, Insurance→ Salary Authorization views: – V3 1 Authorization(SSN)←

Faculty(SSN,Salary,Insurance,IDFaculty), $Role=FO∨$Role=AD – V3 2 Authorization(IDFaculty)← Faculty(SSN,Salary,Insurance,IDFaculty), $Role=AD∨$Role=DC – V3 3 Authorization(Salary)← Faculty(SSN,Salary,Insurance,IDFaculty), $Role=FO∨$Role=AD

5 Description of the approach

We propose an FCA based approach which consists in determining all the possible disclosure transactions that could appear at the mediator level by exploiting the semantic constraints7 and healing the global policy in order to deactivate the completion of such disclosure transactions.

Our approach (see Figure 1) is centered around three phases: (1) a synthesis of a global schema, the global functional dependencies and a global policy from local sources (and their policies), (2) disclosure transaction discovery and (3) policy healing. – Phase 1: It consists in synthesizing the global policy, the global schema and the

global functional dependencies. These elements are then exploited in the next phases. – Phase 2: This phase is devoted to the detection of all sequences of queries, called disclosure transactions, which can be used to bypass the access control mechanism by exploiting the semantics of data dependencies. – Phase 3: In this phase, we proceed to policy reconfiguration to avoid the completion of disclosure transactions. This can be accomplished either at design time (by adding new authorization views) or at runtime (by controlling the execution of user queries). 7

and hence the hidden associations between attributes.

Inference control in data integration systems

13

Fig. 1: Overview of the our approach

5.1 Synthesizing the Global Schema, the Global Policy and the Global Functional Dependencies To synthesize the global policy and the global schema from the sources, we resort to the basic approach we described in [37]. Here, we briefly recall the principle of the approach. It takes as input a set of source schemas together with their access control policies and performs the following steps: First, it starts by translating the schemas and policies into formal contexts. Second, for each attribute, it identifies the set of rules which are preserved at the level of sources. This is done by computing the supremum (see definition 3). Finally, it builds the global schema by combining relevant attributes. By applying this step to our example, one derives the following global schema which is composed of: Three virtual relations: 1. VR1(IDFaculty,Salary,SSN,Insurance,Name)← Supervisors(IDFaculty,Name,ResearchTeam,Affiliation), Faculty(SSN,Salary,Insurance,IDFaculty) 2. VR2(IDFaculty,ResearchTeam,Affiliation)← Supervisors(IDFaculty,Name,ResearchTeam,Affiliation), PHDStudents(StudentID,IDFaculty,ThesisTitle,ResearchTeam), 3. VR3(StudentID,IDFaculty,ThesisTitle)← PHDStudents(StudentID,IDFaculty,ThesisTitle,ResearchTeam), Faculty(SSN,Degree,Salary,Insurance,IDFaculty), Supervisors(IDFaculty,Name,Degree,ResearchTeam,Affiliation)

14

Mokhtar Sellami et al.

and seven authorization policies which will be associated with the global schema:

– GV1 Authorization(SSN)←













VR1(IDFaculty,Salary,SSN,Insurance,Name), $Role=AD∨$Role=FO GV2 Authorization(Salary)← VR1(IDFaculty,Salary,SSN,Insurance,Name), $Role=AD∨$Role=FO GV4 Authorization(IDFaculty)← VR1(IDFaculty,Salary,SSN,Insurance,Name), VR2(IDFaculty,ResearchTeam,Affiliation), VR3(StudentID,IDFaculty), $Role=AD∨$Role=DC GV5 Authorization(StudentID)← VR3(StudentID,IDFaculty), $Role=PR∨$Role=AD∨$Role=DC GV6 Authorization(ThesisTitle)← VR3(StudentID,IDFaculty,ThesisTitle), $Role=PR∨$Role=DC GV7 Authorization(ResearchTeam)← VR2(IDFaculty,ResearchTeam,Affiliation), $Role=PR∨$Role=AD∨$Role=DC GV8 Authorization(Name)← VR1(IDFaculty,Salary,SSN,Insurance,Name), $Role=PR∨$Role=AD∨$Role=FO∨$Role=DC Similarly to the generation of the global policy and the global schema, highlighting, by reasoning, the functional dependencies at the mediator level is a way to anticipate and deal with the inference problem. To do this, we propose to generate a global lattice using the functional dependencies associated with each local source. By exploiting the properties of the lattice, one noticed that the lattice construction process itself leads to the identification of overall functional dependencies. That is, the construction8 of the lattice highlights the global functional dependencies. The construction process relies on an algorithm (Algorithm 1) that transforms all the local functional dependencies into formal contexts according to definition 11 and, by looping on the lattice, it derives a global functional dependency for each concept of the lattice. Definition 11. (A Functional Dependency as a Formal Context (A, B, I)). Given a Functional Dependency X → Y , a formal context is obtained from X → Y (where γ F D is a transformation function) as follows: ⎧ ⎪ ⎨A = X γ FD (X → Y ) B = Y ⎪ ⎩I = 1

8 For our experiments, we used http://www.iro.umontreal.ca/˜galicia/

Galicia

Tools

to

build

the

lattices:

Inference control in data integration systems

15

Fig. 2: Global Functional Dependencies Lattice

By considering our running example, Algorithm 1 produces the following functional dependencies which are organized in the lattice given figure 2: StudentID → T hesisT itle StudentID → IDF aculty SSN, Insurance → Salary IDF aculty, Af f iliation → ResearchT eam IDF aculty, ResearchT eam → N ame IDF aculty → SSN IDF aculty → ResearchT eam IDF aculty → N ame

Algorithm 1 Global Functional Dependencies Synthesizing Input: Functional dependencies FDs of local source Si , Output: Global Functional Dependencies F DG Begin: 1: F DG ←; 2: K F D ← γ FD (F Ds); transform the FDs to a formal context (see definition 11) D F D ) compute the Lattice of the FD’s Context; 3: LF Gi ← ComputeLattice(K D do 4: for each Cj in LF G //Decomposition: if X → Y Z then X → Y and X→Z 5: for each Y in Cj .Extentdo 6: F Dk ←Cj .Intent → Y translate the Concept into a FD 7: if F Dk ∈ / F DG then add F Dk to F D G 8: endfor 9: endfor 10: Return

16

Mokhtar Sellami et al.

Please note that each source defines its policy separately and does not take into account the possible associations that can appear when combining its data with those of other sources. In addition, new semantic constraints can rise at the mediator level and can be used to infer sensitive information from no sensitive ones.

5.2 Discovery Phase This phase includes two main steps: first, by resorting to FCA as a tool to characterizing the global policy, we identify the profiles which should be prohibited to access some sensitive data at the mediator level. Second, one identifies threating transactions by considering the impact of semantic constraints (e.g., functional dependency) at the mediator level. First, we introduce two relevant definitions: Definition 12. (Authorization View Policy as a Formal Context) [37]. Let P be

a set of Authorization Views which govern access to the source (relation/schema) S and σ V BAC be a transformation function. A formal context K (O, A, R) is obtained from P as follows ⎧ ⎪ ⎨∀ ai ∈ S, ∀ cj ∈ P, A = ai ∪ cj σ V BAC (P ) ∀ Vk ∈ P | O = Rulek ⎪ ⎩ ∀ ai ∈ Vk , ∀ cj ∈ Vk , R = 1 otherwise R = 0 Definition 13. (Prohibition Rules). Given an Access Control Constraint Context G

K P (Rulei , Ci , I ) obtained from the global policy using the transformation function σ V BAC , the complementary relation, I¯, corresponds to the prohibition rules: Rulei ICi holds iff K P (Rule, Ci ) = 0. 5.2.1 Computing Prohibition Rules

Prohibition rules are rules which can be used to deny the access to some data at the mediator level. In order to detect such rules, we propose to extract the whole rules based on the characterization of the authorization views with a formal concept analysis. Basically, the authorization views are transformed into formal contexts and then we identify the prohibition rules (see definition 13). In order to represent the global policy as a formal context, we use the σV BAC function that generates the corresponding Global Policy Context. After the extraction of the formal contexts, we split them into two formal contexts: a formal context for access constraints and a formal context for attributes. We assume that a relationship (Rule,A∪C) ∈ I means that each profile “role” r ∈ C can access Atti . The context (Rule, A∪C, I) grants access to attributes by roles, and we can read from the context whose attributes are granted to a given role. Based on definition 13 , we obtain, from the above context, 4 prohibitions rules which are used in the disclosure discovery process: Rule 1: PR , DC → SSN, Salary, Insurance Rule 2: PR , DC → SSN, N ame, Salary Rule 3: PR , DC → Salary, N ame Rule 4: PR , DC → IDF aculty, ResearchT eam, Salary

Inference control in data integration systems

17

5.2.2 Identification of Disclosure Transactions

By making use of the semantics of functional dependencies, we present a naive method for automatically retrieving the set of queries having the following property: when all the queries of a given set are executed, then they lead to the disclosure of some sensitive information. We describe the approach as follows: first, we extract the functional dependencies. Then, we improve this process by taking into account the presence of prohibition rules. Finally, we show that it is sufficient to compute only the disclosure transactions by generating the lattice of functional dependencies involved in each prohibition rule. For example, the following functional dependencies (1) and (2) can be used for overcoming the prohibition rule PR , DC → Salary, N ame in order to access Salary and Name. SSN, Insurance → Salary (1) IDF aculty, ResearchT eam → N ame (2)

To detect the sequences of violating queries, as a first step, the Formal Context (see table 1) is generated from the above functional dependencies using definition 11.

X(LHS) IDFaculty ResearchTeam SSN Insurance

Y(RHS) Name Salary 1 0 1 0 0 1 0 1

Table 1: K F D (Xi , Yi , R): Formal context denoting functional dependencies

Then, by using this functional dependency context, we generate a lattice of formal concepts (see figure 3) which can be used by the algorithm 2 in order to detect the sequences of violating queries. The lattice shows how we can exhaustively compute the disclosure transaction which can be exploited to infer Salary and Name by issuing the sequences of queries Q1, Q2 and Q3. The example of figure 3 shows the lattice with four formal concepts: the topConcept C1, the FD concepts C2 and C3, and the bottom concept C4. Our algorithm exploits this lattice and for each parent concept Ci of a given concept having the attributes of the prohibition rule as Intent (e.g., the bottom concept C4), it derives a Disclosure Transaction DT containing a sequence of queries. Considering our example, Algorithm 2 generates the disclosure transaction set: SetDT ={DT1 (QF D1 , Q1 ), DT2 (QF D2 , Q2 ), DT3 (QF D1 , QF D2 , Q3 )}

So, if one issues all queries of any transaction DTi then the prohibition rule Rule3 (Salary, Name) is violated. Hence, to deal with this issue and to avoid the disclosure transaction completion, the next step consists in reconfiguring the global policy with additional rules in such a way that no transaction could be completed.

18

Mokhtar Sellami et al.

Fig. 3: Disclosure Transaction Lattice of Salary, Name

5.3 Policy Healing The policy healing can be applied at two levels: First, at design time with policy completion by adding a new set of authorization rules. Such rules are applied to make sure that no transaction could be completed. Second, policy healing can be used at runtime. This could be accomplished by means of a monitoring process which requires to store the previous queries. 5.3.1 Global Policy Completion

The policy completion is the step which can be used by an administrator to repair the global policy by adding new ones which can prevent the completion of transactions. The main issue is then how to identify the minimal set of queries Qi from which to build the new authorization rules. The new authorization rules should display the following property: for any DTi at least one Qi is denied. The minimal set of queries which would be used to generate the new authorization rules must also ensure that there are no redundant access control rules and should cover all the disclosure transactions.

Algorithm 3 starts by initializing the set of revoked queries then it runs through the Galois Sub Hierarchy Lattice of Disclosure Transactions (see figure 4). For each pair of Concepts (Ci, Cj ) (lines 2-13), it checks if the hierarchy order (Ci  Cj ) holds between the two concepts (line-3). If the subconcept Ci is not marked then it is the concept to be used in the revocation process, otherwise the superconcept Cj must be used. Since the intent of a concept may contain more than one single query, our algorithm must ensure that the revoked query is the minimal one (lines 6-10).

Inference control in data integration systems

19

Algorithm 2 Disclosure Transaction Discovery Input: Functional dependencies FDSet , P Prohibition rule policy Output: Disclosure Transaction Set DTSet Begin: 1: SetDT ←; 2: for each Rule Ri in P do 3: F DRi ←ExtractFDs(F DSet,Ri );extractthe FDs associated to the Attributes of the Rule F D← γ 4: KR FD (F DRi );transf ormthe FDs to a formal context (see definition 11) i D ← ComputeLattice(K F D )computethe Lattice of the FDs Context of the Rule; 5: LF Ri Ri D 6: for each Cj in LF if (Ri .Att ⊆ Cj .Intent)then Ri do 7: D 8: for each Parent Pk of Cj in LF Ri do 9: QF Ds ←Q({Pk .Intent∪Pk .Extent}translate the FD to a Query 10: Qk ←Q({Cj .Intent∪Pk .Extent} \ {Pk .Intent }) Build a query is the union of attributes in the intent of the concept and the attributes in the extent of its parent minus the attributes in the intent of its parent. 11: DTi ← {QF Ds , Qk }Build a transaction with the FD query QF Ds and the query Qk 12: DTn ← DTn ∪ {QF Ds } we save each FD query QF Ds in order to be used on the last transaction. 13: If DTi ∈ / SetDT then add DTi to SetDT ; 14: endfor D 15: Qk+1 ←Q(LF Ri .T opConcept.Extent})* Build a final query using the the extent of the top concept. 16: DTn ← {QF Ds , Qk+1 }Build a final transaction composed by the FD queries QF Ds and a last query Qk+1 .built on the topconcept 17 if DTn ∈ / SetDT then add DTi to SetDT ;endif 18: EndFor 19: EndFor 20: Return *: TopConcept is the supremum of the Lattice

This means that the query Qr is minimal if all its attributes are relevant, that is ∀Q ⊂ Qr : Q cannot be used instead of Qr in a query revocation process. The last step (lines 14-17) of the algorithm is the generation of the additional authorization views based on each revoked query obtained in the previous steps. Continuing with our example, the following new authorization views over the minimal queries are delivered using our methodology to avoid the completion of the disclosure transactions. Minimal Queries – – – – – – – – – – – – –

Q24(ResearchTeam,IDFaculty,Name) Q4(IDFaculty,Affiliation,ResearchTeam) Q2(StudentID,IDFaculty) Q7(SSN,Insurance,Salary) Q19(IDFaculty,Name,ResearchTeam) Q9(IDFaculty,Affiliation,Salary) Q23(ResearchTeam,IDFaculty,Salary) Q34(StudentID,ResearchTeam,IDFaculty) Q42(ResearchTeam,IDFaculty,ThesisTitle) Q29(ResearchTeam,IDFaculty,SSN) Q15(IDFaculty,SSN,ResearchTeam) Q33(IDFaculty,SSN,Name) Q37(StudentID,ThesisTitle)

20

Mokhtar Sellami et al.

Fig. 4: Galois Sub-Hierarchy Lattice of Disclosure Transactions

Algorithm 3 Minimal Queries Revocation Input: Disclosure Transaction Galois Sub Hierarchy Lattice GSH DT , P: Prohibition Rules, G : Global Schema, PG : Global Policy Output: M QR minimal set of queries that must be revoked Begin: 1: M QR ← φ 2: for each pair of Concept (Ci, Cj ) in GSH DT i=j do 3: if (Ci  Cj ) then 4: if (Ci not marked) then Cr ← Cj ; mark Cj * 5: else Cr ← Ci ;mark Cj ;endif 6: if (|Cr .Intent| = 1 then Qr ← Q1 else 7: Qr ← Q1 ,Q1 ∈ Cr .Intent 8: for each query Qk+1 , in Cr .Intent do 9: if (Qk+1 ⊂ Qr ) thenQr ← Qk+1 endif 10: endfor 11: endif 12: if Qr ∈ / M QR then add Qr to M QR 13: endfor //Generation of new authorization rules 14: for each Qr in M QR do //Extraction of the Constraints C used in the new rule and Virtual Relations which cover the query.  15:V R ← V Ri // ∀Qr ⊆V Ri

16: Create Authorization View GV (Qr )← V R,C; / P G then add GV (Qr ) to P G endif ; 17: if GV (Qr )∈ 18: endfor 19: Return *:Note that all concepts of the Lattice having a label that indicates which concept has been marked by itself or by its super-concepts

Authorization views:

– NGV1 Authorization(IDFaculty,Affiliation,ResearchTeam)← VR2(IDFaculty,ResearchTeam,Affiliation),$Role=AD

Inference control in data integration systems

21

– NGV2 Authorization(SSN,Insurance,Salary)←













– –

VR1(IDFaculty,Salary,SSN,Insurance,Name), $Role=AD NGV3 Authorization(IDFaculty,Affiliation,Salary)← VR1(IDFaculty,Salary,SSN,Insurance,Name), VR2(IDFaculty,ResearchTeam,Affiliation),$Role=AD NGV4 Authorization(IDFaculty,SSN,ResearchTeam)← VR1(IDFaculty,Salary,SSN,Insurance,Name), VR2(IDFaculty,ResearchTeam,Affiliation),$Role=AD NGV5 Authorization(ResearchTeam,IDFaculty,Salary)← VR1(IDFaculty,Salary,SSN,Insurance,Name), VR2(IDFaculty,ResearchTeam,Affiliation),$Role=AD NGV6 Authorization(ResearchTeam,IDFaculty,SSN)← VR1(IDFaculty,Salary,SSN,Insurance,Name), VR2(IDFaculty,ResearchTeam,Affiliation),$Role=AD NGV7 Authorization(StudentID,ThesisTitle)← VR3(StudentID,IDFaculty,ThesisTitle), $Role=PR∨$Role=DC NGV8 Authorization(ResearchTeam,IDFaculty,ThesisTitle)← VR2(IDFaculty,ResearchTeam,Affiliation), VR3(StudentID,IDFaculty,ThesisTitle),$Role=DC NGV9 Authorization(IDFaculty,SSN,Name)← VR1(IDFaculty,Salary,SSN,Insurance,Name),$Role=AD NGV10 Authorization(StudentID,IDFaculty)← VR3(StudentID,IDFaculty,ThesisTitle),$Role=AD 5.3.2 Stop that Query

In order to ensure maximal availability of data at the mediator level while ensuring the non disclosure of sensitive information, we propose a runtime approach which consists in monitoring the execution of queries and revokes those queries that could lead to the violation of policies. The main problem with the runtime query revocation is that it requires the storage of the past queries and the computing of the correlation between the current query and the past queries. By resorting to FCA, one can enhance the monitoring process. To efficiently record the past queries, we use a labeling lattice (see Figure 5) that keeps the history of user queries. For each query submitted by the user, we update the lattice while ensuring that no disclosure transaction is completed. The main idea is to assign a label u , containing the user, to Qi in the lattice if the user is allowed to execute the query while verifying that the underlying transactions could not be completed. Hence, if the label assignment permits the execution of the user query, then it also must ensure that no transaction should be completed. Definition 14. (Query Labeling Lattice). Let Q be the set of queries {Q1 , . . . , Qn } of the transaction set DTi , A is the set of attributes {a1 , ...an } of the global Schema, Let QLL (a labeling lattice) be given with (LQ , ) on the formal context K (Q, A, I ) where Q is the set of formal objects (queries), A is the set of formal attributes (global schema attributes) and I indicates that the query Qi ∈ Q having the attribute ai ∈ A. Each concept Ci ∈ LQ including the top concept () is labeled

22

Mokhtar Sellami et al.

Fig. 5: Query Labeling Lattice (QLL)

with u {φ}, and the bottom concept (⊥) is labeled with u {R} where R is the set of roles which are not allowed to complete the transaction DTi . Definition 15. (Policy Decision). Let QLL = (LQ , ). A role r ∈ R is allowed to execute a query Qi if u {r} Qi . This means that the label u assigned to top QLL concept () doesn’t contain the role r and the cumuli of past executed queries is u{r} less than the transactions ( Qi  DTu ), otherwise the query is rejected. r∈RLQ

Algorithm 4 Stop That Query Input: Disclosure Transaction DT , User Query Q, r: the role assigned to the user Query Labeling Lattice QLL Output: Q is revoked or allowed Begin: 1: initialize State= “revoked”; //if the label is assigned to the topconcept then the user is locked. 2: if(u {r} |=Qi

) then return “revoked”; endif QLL u{r} 3: if ( QLL Qi ) Q) ≤ DT l ) then 4: State= “allowed”; 5: Assignment or update of the label u {r} to Qi 6: Assignment or update of the of label u {r} to the superconcept of Qi if its all subconcepts are labeled. 7: else 8: State=“revoked”; 9: endif 10: Return State ;

Algorithm 4 processes incoming queries Qi one at a time. It takes as input, the Query Labeling Lattice QLL (Figure 5), the Disclosure Transactions DTi and the role r of the user who issued the query. It first checks if the label of the role r coincides with the top concept (). If so, the query is revoked, otherwise it verifies if

Inference control in data integration systems

23

the cumuli of tagged queries with the label u{r} including the query Qi is still less than the transaction. If this condition holds then the role r is permitted to execute the query Qi otherwise the query should be revoked (i.e., Figure 5 node(c)). To keep the history of the executed query, an update of the QLL takes place. Hence, an assignment of the label u{r} to Qi must be made. Also, a propagation of the tagging to the superconcept of Qi (i.e., Figure 5 (nodes (a) and (b)) ) is done, if the subconcepts of the Qi s are also tagged with the label u{r}. If the query is refused then an assignment of the label u{r} to the top concept () is done in order to avoid the unnecessary lattice exploration.

6 Complexity Study

In this section, we analyze the theoretical complexity of the proposed algorithms. Our approach strongly depends on the complexity of algorithms [44] used to build the lattices using Galicia API. Table 2 shows basic complexities of the algorithms used in the estimation of the worst-case complexity of our proposed algorithms.

Operation Lattice Navigation

Complexity (O(ml))

Lattice Construction

O(k(ml + (l)kd(L)))

description The lattice traversal takes O(L) where L is the number of concepts and L is bounded by ml k : maximum number of attributes owned by an object , m = number of objects, l=|C| : size of concept in the Lattice L;

(l); is the actual number of new concepts d(L): the maximal number of upper (or lower) covers of a node in the lattice L

Table 2: Complexity of incremental Lattice Construction algorithms in Galicia [44]

Hence, The complexity of algorithm 1 is evaluated to O((|A| 2 *l +(l))*|A|

F D FD *d(L ))+( L )) where A is the number of attributes involved in the functional of algorithm dependencies and LF D is the lattice associated to FD. The

complexity

F D 2 2 FD 2 is evaluated to O(|P |*(|R| *l +(l))*|R|*d(LR ))+( LR ))) where P is the number of global policy rules, R is the set of attributes used in the query part of D each global authorization view and (LF R ) is the Lattice of FDs associated with authorization view attributes R. The algorithm runs throw the lattice of concepts while taking into account the list of parents of each concept which is bounded by the |L-1|concepts. For the algorithm 3, its complexity is closely related to the complexity of the algorithm for building the Galois Sub-hierarchy Lattice, namely Pluton. Pluton is composed of three algorithms: TomThumb, ToLinext and ToGSH [3]. TomThumb produces an ordered list of the simplified extents and intents and

24

Mokhtar Sellami et al.

the algorithm ToLinext then searches the ordered list to merge pairs consisting of a simplified extent and a simplified intent pertaining to the same concept, in order to reconstruct the elements of the Galois Sub-hierarchy. ToGSH is then used to compute the edges of the Hasse diagram (transitive reduction) of the Galois Sub-hierarchy. The theoretical complexity of Tom Thumb is in O(|DT | ∗ |Q|). The ToLinext has a complexity in O((|DT | + |Q|)3 ) where the complexity of ToGSH O((|DT | + |Q|)2 ∗ max(|DT | , |Q|)2 ). So, the complexity of algorithm 3 is evaluated to O(|DT | ∗ |Q| + (|DT | + |Q|)3 + (|DT | + |Q|)2 ∗ max(|DT | , |Q|)2 ) where DT s is the set of identified Disclosure Transactions and the Qs are the identified Queries. Hence, the number of disclosure transactions increases with the number of the functional dependencies and policy rules. This complexity is polynomial according to the number of transactions and the number of queries (or attributes).

7 Experiments

We provide an experimental evaluation of our proposed methodology and algorithms. In order to study the performance and the scalability of our approach and the impact of the access control rules, the global relation, and the data dependencies in the evaluation, we led an evaluation with different configurations. We designed a synthetic scenario by using three generators. The scenario includes data sources, authorization views and functional dependencies. The synthetic scenario (including the generators) was implemented in Java and we used the Galicia API. Experiments were carried out on a Linux PC with Ubuntu TLS 14.04, Intel Core 2 Duo CPU with 2.00GHz and 4GB RAM. (a) Data Sources generator. Given a number (N) of relations, an average number (A) of sub-elements of each relation and an average number (J) of join attributes, the  source generator randomly produces a set R of relations R(A) defined in such a way that J corresponds to the number of joins that could be built from shared attributes by the generated relations. Here, each join attribute is randomly chosen from the set of attributes of each relation. In the experiments, we considered N ranging from 5 to 20, A ranging from 10 to 20, and J ranging from 5 to 10. (b)Authorization view generator. Given a source schema R and three numbers: P (the number of authorization views), Q (the number of attributes used in the project clause of queries), and C (the number of profiles). In our experiments we considered P = 80 rules, Q= 3, and C = 10. (c)Functional dependencies generator. Given a relational R and a natural number m, the functional dependencies generator randomly produces a set Σ F D consisting of functional dependencies on R, such that the average number of functional dependencies for each relation in R is m. The generator also takes two other parameters as inputs, namely LHS and RHS. LHS is the maximum number of attributes in each left hand side of the generated functional dependencies, and RHS is the maximum number of attributes in each RHS of the functional dependencies. The experiments were conducted on various Σ F D ranging from 10 to 80 functional dependencies per source, with the LHS containing from 3 to 10 attributes and the RHS containing from 2 to 5 attributes. We conducted our experiments as follows: we performed a large number of trials so that, for each run, we generated a random number of sources ranging from 5 to 20, the number of attributes ranging from 10 to 20, the number of common attributes

Inference control in data integration systems

25

or joins ranging from 5 to 10, the number of functional dependencies (of the form X→Y) ranging from 10 to 80 for each local source with X and Y involving 5 to 10 attributes, and 80 access control rules with 10 roles and 3 attributes. For each set of generated tests, we measured the size and the time required to build the lattices by our algorithms for each step of our methodology including: ˆ |GF DT ime| is the necessary time to infer the global functional dependencies and |GF DLattice| is the size of the obtained lattice (number of concepts). ˆ |DT DT ime| is the time required to detect all the suspicious transactions and |DT Size| is the number of such transactions. ˆ |QueriesSize| is the number of queries and |M QRT ime| is the time required to extract a minimal query set. ˆ |N ewP | : the number of added access control rules to the policy to prevent completion of prohibited transactions. ˆ |LQQT ime| and |LQQ| correspond to the time required to generate the Query Labeling Lattice and its size respectively.

Fig. 6: Computational time for different steps of our approach

The required Time to compute the global functional dependencies and the suspicious transactions: Figure 6 shows the time required to compute the dis-

closure transactions. It significantly exceeds both the time required to build the lattice and the time to compute the minimal query revocation. We can also deduce that the time (DTD) to compute the suspicious transactions with 80 functional dependencies is higher than the time required to compute disclosure transaction with 20 functional dependencies. The results depict that the global time to compute the lattice of functional dependencies is between 32 and 81 seconds for data sets containing 10 to 80 functional dependencies per source. We also noticed that the more functional dependencies, the more time is required to detect suspicious transactions (respectively the minimal set of queries).

26

Mokhtar Sellami et al.

Fig. 7: Impact of FDs on the global policy

Impact of functional dependencies on the global Policy : Figure 7 shows the

relationship between the number of disclosure transactions and the number of additional rules. At each run, we particularly pick, as decision metric, the new rule that appears more often in DTD. The number of queries also increases with the same rate as that of functional dependencies. We also noticed that the number of access control rules, to be added, increases with the number of inferred dependencies as well as with the number of identified transactions. This means that the functional dependencies have a significant impact on bypassing access control mechanisms in order to infer sensitive information from non-sensitive one. Query Labeling Lattice: To analyze the impact of a large databases with a

large number of functional dependencies on the Query Labeling Lattice, we generated a dataset composed of 20 sources with 10 attributes, 80 rules and 160 functional dependencies per source.

Figure 8 shows the time required to construct such an QLL and its size. We note that the more functional dependencies we infer the more queries are identified. The results show that the size and the time required to compute the QLL increases with the number of new queries which are discovered. Dealing with a large lattice at run-time is an open issue which we will investigate in the future.

Figure 9 shows the impact of functional dependencies on the time required to lock a query. We note that the time required to run through a QLL lattice and to evaluate the policy decision for a given query increases with the number of illicit transactions as well as with the number of identified queries. They clearly indicate that the number of queries and transactions have a significant impact on

Inference control in data integration systems

27

Fig. 8: Impact of identified queries on size of QLL and its time computing

Fig. 9: Time required to lock a query of a transaction

time required to lock query at run time. So, the main problem of query tracking is the enormous amount of queries that needs to be managed. Further tests will be necessary to fully understand the behavior of the Stop That Query method. However, the current track seems promising, especially if combined with efficient and fastest algorithms [27] of lattices construction and navigation. Improvement to the flexibility of Stop That Queries can be achieved through approaches that adapt to Incremental algorithms of lattice construction and lattice reduction. In our approach, we mainly focused on the analysis of the performance and the scalability of our proposal for many reasons. The use of the Lattice as a common representation framework to model the global schema, policies and functional dependencies requires comparison of each of our steps to each related approaches separately. Due to this issues, it is difficult to compare our approach to the related work because there is no available implementation or experimental data sets(policies, functional dependencies and local schemas).

28

Mokhtar Sellami et al.

8 Conclusion

In this work we have investigated the problem of illicit inferences that result from combining semantic constraints with authorized information in a data integration context. We proposed an approach which exploits FCA to capture security policies and data dependencies. By resorting to FCA, we were able to build both a global schema and the underlying global policy. The global policy is refined in such a way that violating queries are blocked in advance (design time or run-time). There are many research directions to pursue: when a new source joins the system or an existing source leaves the system, it is necessary to revise the global schema and the global policy. In this case, an incremental approach should be designed. Another interesting issue is concerned with data dependencies. Indeed, there are other semantic constraints that could play an important role in inference problems. Examples are inclusion dependencies, multivalued dependencies and the partial FDs. We are investigating these issues. In Future work, we intent to integrate our approach as plug-in on Talend Data Integration Studio9 .

9 REFERENCES References 1. Alessandro Acquisti and Jens Grossklags. Privacy and rationality in individual decision making. IEEE Security and Privacy, 3(1):26–33, January 2005. 2. Nabil R. Adam and John C. Worthmann. Security-control methods for statistical databases: A comparative study. ACM Comput. Surv., 21(4):515–556, December 1989. 3. Gabriela Arevalo, Anne Berry, Marianne Huchard, Guillaume Perrot, and Alain Sigayret. Performances of galois sub-hierarchy-building algorithms. In SergeiO. Kuznetsov and Stefan Schmidt, editors, Formal Concept Analysis, volume 4390 of Lecture Notes in Computer Science, pages 166–180. Springer Berlin Heidelberg. 4. Leland L. Beck. A security machanism for statistical database. ACM Trans. Database Syst., 5(3):316–3338, September 1980. 5. Alfredo Cuzzocrea, Mohand-Sa¨ıd Hacid, and Nicola Grillo. Effectively and efficiently selecting access control rules on materialized views over relational databases. In International Database Engineering and Applications Symposium (IDEAS), pages 225–235, 2010. 6. Frithjof Dau and Martin Knechtel. Access policy design supported by fca methods. In Conceptual Structures: Leveraging Semantic Technologies, pages 141–154. Springer, 2009. 7. D.E. Denning and J. Schlorer. Inference controls for statistical databases. Computer, 16(7):69–82, July 1983. 8. Dorothy E. Denning. Secure statistical databases with random sample queries. ACM Trans. Database Syst., 5(3):291–315, 1980. 9. Sabrina De Capitani di Vimercati, Sara Foresti, Sushil Jajodia, Stefano Paraboschi, and Pierangela Samarati. Assessing query privileges via safe and efficient permission composition. In ACM Conference on Computer and Communications Security, pages 311–322, 2008. 10. Csilla Farkas and Sushil Jajodia. The inference problem: A survey. SIGKDD Explor. Newsl., 4(2):6–11, December 2002. 11. Csilla Farkas and Sushil Jajodia. The inference problem: A survey. SIGKDD Explor. Newsl., 4(2):6–11, December 2002. 12. I. P. Fellegi. On the Question of Statistical Confidentiality. Journal of the American Statistical Association, 67(337):7–18, 1972. 13. Theodore D. Friedman and Lance J. Hoffman. In IEEE Symposium on Security and Privacy. 9

http://fr.talend.com/index.php

Inference control in data integration systems

29

14. Bernhard Ganter and Rudolf Wille. Formal Concept Analysis: Mathematical Foundations. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1st edition, 1997. 15. Joseph A. Goguen and Jos´ e Meseguer. Unwinding and Inference Control. In Proceedings of the 1984 IEEE Symposium on Security and Privacy, pages 75–86. IEEE Computer Society, 1984. 16. Mehdi Haddad, Mohand-Said Hacid, and Robert Laurini. Data integration in presence of authorization policies. In 11th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2012, Liverpool, United Kingdom, June 25-27, 2012, pages 92–99, 2012. 17. Mehdi Haddad, Jovan Stevovic, Annamaria Chiasera, Yannis Velegrakis, and Mohand-Said Hacid. Access control for data integration in presence of data dependencies. In Database Systems for Advanced Applications - 19th International Conference, DASFAA 2014, Bali, Indonesia, April 21-24, 2014. Proceedings, Part II, pages 203–217, 2014. 18. Mehdi Haddad, Jovan Stevovic, Annamaria Chiasera, Yannis Velegrakis, and Mohand-Said Hacid. Access control for data integration in presence of data dependencies. In Sourav S. Bhowmick, Curtis E. Dyreson, Christian S. Jensen, Mong-Li Lee, Agus Muliantara, and Bernhard Thalheim, editors, Database Systems for Advanced Applications - 19th International Conference, DASFAA 2014, Bali, Indonesia, April 21-24, 2014. Proceedings, Part II, volume 8422 of Lecture Notes in Computer Science, pages 203–217. Springer, 2014. 19. Lance J. Hoffman and William F. Miller. Getting a Personal Dossier from a Statistical Data Bank. In Lance J. Hoffman, editor, Security and Privacy in Computer Systems, pages 289–293. Melville Publishing Company, Los Angeles, 1973. Nachdruck aus: Datamation, 1970. 20. P. Fellegi I. and J. Phillips J. Statistical confidentiality: Some theory and application to data dissemination. In Annals of Economic and Social Measurement, Volume 3, number 2, pages 101–112. National Bureau of Economic Research, Inc, 1974. 21. Ch Kumar et al. Designing role-based access control using formal concept analysis. Security and communication networks, 6(3):373–383, 2013. 22. Ezio Lefons, Alberto Silvestri, and Filippo Tangorra. An analytic approach to statistical databases. In Proceedings of the 9th International Conference on Very Large Data Bases, VLDB ’83, pages 260–274, San Francisco, CA, USA, 1983. Morgan Kaufmann Publishers Inc. 23. Mark Levene and George Loizou. A guided tour of relational databases and beyond. Springer, 1999. 24. Chong K. Liew, Uinam J. Choi, and Chung J. Liew. A data distortion by probability distribution. ACM Trans. Database Syst., 10(3):395–411, 1985. 25. Catherine Meadows and Sushil Jajodia. Integrity versus security in multi-level secure databases. In DBSec, pages 89–101, 1987. 26. Matthew Morgenstern. In IEEE Symposium on Security and Privacy. 27. S Chandra Mouliswaran, Ch Kumar, C Chandrasekar, et al. Modeling chinese wall access control using formal concept analysis. In Contemporary Computing and Informatics (IC3I), 2014 International Conference on, pages 811–816. IEEE, 2014. 28. Sarah Nait-Bahloul. Inference of security policies on materialized views. rapport de master 2 recherche. http://liris.cnrs.fr/∼snaitbah/wiki, 2009. 29. Gultekin Ozsoyoglu and Tzong-An Su. Rounding and inference control in conceptual models for statistical databases. 2013 IEEE Symposium on Security and Privacy, 0:160, 1985. ¨ 30. M. Tamer Ozsu and Patrick Valduriez. Principles of Distributed Database Systems, Third Edition. Springer, 2011. 31. Steven P. Reiss. Practical data-swapping: The first steps. ACM Trans. Database Syst., 9(1):20–37, March 1984. 32. Shariq Rizvi, Alberto O. Mendelzon, S. Sudarshan, and Prasan Roy. Extending query rewriting techniques for fine-grained access control. In Gerhard Weikum, Arnd Christian K¨ onig, and Stefan Deßloch, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13-18, 2004, pages 551–562. ACM, 2004. 33. Arnon Rosenthal and Edward Sciore. View security as the basis for data warehouse security. In CAiSE Workshop on Design and Management of Data Warehouses, pages 5–6, 2000.

30

Mokhtar Sellami et al.

34. Arnon Rosenthal and Edward Sciore. Administering permissions for distributed data: Factoring and automated inference. In In Proc. of IFIP WG11.3 Conf, 2001. 35. Pierangela Samarati and Latanya Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, 1998. 36. Jan Schl¨ orer. Security of statistical databases: Multidimensional transformation. ACM Trans. Database Syst., 6(1):95–112, 1981. 37. Mokhtar Sellami, Mohamed Mohsen Gammoudi, and Mohand Said Hacid. Secure data integration: A formal concept analysis based approach. In Database and Expert Systems Applications, pages 326–333. Springer, 2014. 38. Peter Pin shan Chen. The entity-relationship model: Toward a unified view of data. ACM Transactions on Database Systems, 1:9–36, 1976. ´ 39. Scibor Sobieski and Bartosz Zieli´ nski. Modelling role hierarchy structure using the formal concept analysis. Annales UMCS Sectio AI Informatica, 10:143–159, 2015. ¨ 40. Tzong-An Su and Gultekin Ozsoyoglu. Data dependencies and inference control in multilevel relational database systems. In Proceedings of the 1987 IEEE Symposium on Security and Privacy, Oakland, California, USA, April 27-29, 1987, pages 202–211, 1987. 41. Latanya Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):571–588, October 2002. 42. M.B. Thuraisingham. Security checking in relational database management systems augmented with inference engines. Computers and Security, 6(6):479 – 492, 1987. 43. Joseph F. Traub, Yechiam Yemini, and Henryk Wozniakowski. The statistical security of a statistical database. ACM Trans. Database Syst., 9(4):672–679, 1984. 44. Petko Valtchev and Rokia Missaoui. Building concept (galois) lattices from parts: Generalizing the incremental methods. In Conceptual Structures: Broadening the Base, 9th International Conference on Conceptual Structures, ICCS 2001, Stanford, CA, USA, July 30-August 3, 2001, Proceedings, pages 290–303, 2001. 45. L. Wang, S. Jajodia, and D. Wijesekera. Lattice-based inference control in data cubes. In Preserving Privacy in On-Line Analytical Processing (OLAP), volume 29 of Advances in Information Security, pages 119–145. Springer US, 2007. 46. Stanley L. Warner. Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias. Journal of the American Statistical Association, 60(309):63+, March 1965. 47. Lei Xu, Chunxiao Jiang, Jian Wang, Jian Yuan, and Yong Ren. Information security in big data: Privacy and data mining. IEEE Access, 2:1149–1176, 2014.

Suggest Documents