A Dynamic
Knowledge
to the Problem in a Non-Statistical
of Deduction
Multilevel
Michael
Computer
Based Approach
Secure Database
Anderson
Science and Engineering University
260 Glenbrook
of Connecticut
Road, Storrs,
Connecticut
[email protected].
by
Of all research problems associated with the security of data in a multilevel secure database, perhaps the most dlj?icult and least understood is the problem of controlling indirect inference of sensitive data. The phenomenon of inferring sensih”ve data through the combination of non-sensitive data and realworld knowledge has been tenned deduction We do not agree with the prevailing attitude that all such deduction is unpreventable and believe that our dyamic hrrowledge-based approach to this problem provides a high degree of secun”~ with less restrictive infonnationflow than previously achieved. Our d~amic knowledge-based system is instrumental in the prevention of deduction, given sujlicient knowledge of a domain. As a query is presented to the database, our system attempts to use the knowledge in the KB, combined with the data that would be released l~the query were permitted, to deduce sensitive data as a user moderately knowledgeable in the domain might. If such data is found to be so deducible, the system has detected a possible undesirable channel for sensitive data disclosure. /kqywords: security, inference detection, knowledge-based interface] Introduction Of all research problems associated with data security, haps
the
most
controlling
problem, in that vent, ment
difficult
indirect
disclosures
controls
To
fully
it is important focused
on
Clearly ferent
the inference
of databases. levels
of
In fact,
data.
is termed, have
been
flow, that
a severely
the
classifications
(that
is,
this
of
the
~nning79, broader
contains
any
of
untiuitfhl
subset
a much
that
to imple-
of this
database
affects
difficult
easy to circum-
constrained
any database
of
inference
magnitude
the bulk
in a statistical problem
The
has proved either
per-
problem
or impossible
appreciate
to note
problem, namely inference Denning83, Adam89].
is the
sensitive
of information
~enning83]. has
understood
of
postulated
too restrictive
difficulty,
trum
least
as this form of disclosure most
research
and
spec-
data of dif-
multilevel
secure
Permission to copy without fee all or part of this matarial is granted provided that the copies aro not mad. or distributed for direct commercial advantaeo, the ACM copyright notice and the title of the publioetion and its date appeer, and notice is given thet copying ia by permission of the Association for Computing Mechinery. To copy otherwise, or to republish, requires e fee and/or spacific permission. CIKM “93 -1 l/93/D. Q
1993
ACM
is vulnerable to such an attack.
C., USA 0-89791-626-3/93/001
1 . . ..$1.50
154
deftition,
means curity quire that
06269-3155
edu
database)
Abstract
1
Department
is related
and
of undesirable
inference
classifications.
Making
the use of real-world is not
inferring
even
contained
sensitive
data
these
if they such
knowledge in the
Data
in any database,
relationships relate
an
will
these The
by combining
provide
of different
infkreuce about
database.
in a database
can
data
often
a sere-
relationships
phenomenon
of
non-sensitive
data in a database with such real-world knowledge has been This distinguishes it from more termed deduction ~i seman90]. traditional inference control research that does not concern itself with such external knowledge. We believe that a dynamic knowledge based approach applied to the problem of deduction can provide a high degree of security with less restrictive information flow than previously achieved. 2
Proposed
Solution
The deduction problem, with respect to multilevel secure databases, seems amenable to solution by the integration of a knowledge base (KB) of real-world information with a database of facts. The IU3 should contain knowledge about the facts in the database including relevant properties and relationships of not only the attributes of each relation but each en~”ty in the domain of these attributes. Previous work in this area has concentrated on Entity-Relationship Graphs that only deal with relationships between attributes &Iinke89, Bonyun90]. It is important to realize that this approach to deduction prevention is too restrictive, restricted finer grain semantics are required to achieve minimally information flow. For example, consider the PATIENT relation presented in Table 1. If the obvious protection scheme were implemented, all (and only) values of the DISORDER attribute would be unavailable at all times to any user not privy to that information. The deduction problem arises in this ease because of the implicit relationship between PRESCRIPTION and DISORDER. In some cases, a reasonably knowledgeable medical worker who is denied access to the DISORDER attribute might still be able to deduce the value of a DISORDER given the corresponding value of PRESCRIPTION. For example, given the PRESCRIPTION value “theophyll~ this user might reasonably deduce that the patient in question has asthma. Traditionally ~onyun90, Haigh90, SU90, Henning89, Hinke89], if the possibility of such deductions were deemed undesirable, control was accomplished by statically analyzing the DB to fmd these channels and then permanently raising the
PRESCRIPTION
NAME
THERAPY ,.,.,.,.:.:.,.,.,
DOE, JOHN
THEOPHYLLIN
SMITH, PETER
ACE INHIBITOR
~~ ,:,.,.,.,,:,:, ........x. ,.,.,,,.,.,.,.,., ,.,.,.,.:.:.,.:.: fg$ ,,., ... ::::::,:W ..................
JONES, ANN ......... 1j~i~
DIETARY
INSULIN
KELLY ALEX DAVIS, TOM
Table
1: Patient
Relation
(DISORDER
attn”bute
representation, the character of the inference mechanism and it algorithms, and the integration of the system with its database.
classification level of the attributes horn which sensitive data could bededuced tothelevel ofthis sensitive data. In the example, deduction would be remedied by the classification of the entire PRESCRIPTION attribute at the same level as the DISORDER attribute. Although this does prevent such deductions, over classi@ng non-sensitive data in this manner restricts access to such data and compromises theuseftdness of the database. If, however, the database was somehow given theknowledge that “theophyllin is used in the treatment of asthma” then the example deduction could be prevented by simply not allowing that particular value of PRESCRIPTION to be divulged while still allowing some of its other values to be released. What distinguishes this particular value of PRESCRIPTION horn others is that its use can often be linked to a particular disorder. This opens the possibility of deducing the value of DISORDER tlom the value of PRESCRIPTION. Such would not be the case if the value of PRESCRIPTION were, say, “darvon”, a drug used to alleviate pain in any number of disorders. Our dynamic knowledge based system is instrumental in the prevention of such deductions, given sutllcient knowledge of a domain, via its ability to make inferences from such knowledge. As a quay is presented to the database, this inference mechanism attempts to use the knowledge in the KB, combined with the fact (or facts) that would be released if this query were permitted, to deduce sensitive data as a moderately knowledgeable user might. If such data is found to be so deducible, the system has detected a possible undesirable channel for sensitive data release. The goal of the current system is the detection of single queries that could be instrumental, in concert with real-world knowledge, in the deduction of sensitive information in a relational database. Issues concerning problems of queries made over time and actions to take upon detection are presented as titnre research topics. 3
Problem
sensitive)
3.1 Database
Dimension
Currently, our work focuses on the relational database model as it seems to be the emerging defacto standard. Although research in semantic and object-oriented database models is attempting to infuse semantics in the data, we believe that the richness of the semantics required by our approach and the fact that we are dealing with knowledge that is external to the database precludes their use. This does not mean that the our system will not work with such models, but that the knowledge required is itself external to the database and resides in a separate knowledge base. In actuality, our goal is database model independence. That is, we hope to provide a general system that, with simply a change of database interface, is usable by all database systems. The current work expands the traditional domain of inference research to include not only statistical databases but any multi-level secure database. We will restrict our focus to medical information systems that are considered “at high risk” with respect to the privacy and confidentiality of their data @iskup90]. Again, this is not to imply that our system will only be applicable in this domain. Although, the KB will necessarily be domain dependent, the inference mechanism will not be so and, therefore, a system in a new domain can be readily created by changing the contents of the knowledge base. An example relation horn a medical database (see Table 1) will be used to illustrate our approach.
3.2 Classification
Scheme Categories
restrictive of all Ideally, we seek a system that is minimally its data and that incurs a reasonably small amount of inferencing overhead to achieve this minimal restriction. The finer grain semantics proposed generate two promising classification schemes, the fwst termed &msitiv~ htribti~~ Ckz@ication and the second termed No Attribute Classl>cation. By Sensitive Attribute Classification we mean that only the attribute that is considered sensitive is hidden. Left at this rather
Dimensions
Central to the current work are the various dimensions of the problem and their subsequent constraint, the type of database to be protected, the nature of the KB and its internal
155
large grain level, this scheme offers no protection from deduction but protects all sensitive data tlom direct inspection — sensitive data is maximally restricted while non-sensitive data is maximally unrestricted. Adding the proposed protection horn deduction to this scheme incurs the cost of making access to non-sensitive data somewhat more restrictive but, unlike the classification of the entire attribute, only minimally so. This is so because only certain values of the non-sensitive deductive attribute must be hidden, namely those from which sensitive data can be inferred. By No Attribute Classification we mean that no attributes are entirely hidden, that is classification takes place solely at the value level. on its own, this scheme obviously affords no protection to any of the data but is maximally unrestrictive of both sensitive and non-sensitive data. Under the proposed approach, the only data values that are hidden are those which are deemed most sensitive and data from which these most sensitive values can be inferred. This is minimally restrictive of both sensitive and non-sensitive data and assumes that not all data under a sensitive attribute is equally sensitive. In comparison, both classification schemes are minimally restrictive of non-sensitive data. Sensitive Attribute Classification has the disadvantage that it is maximally restrictive of sensitive data while No Attribute Classification will require a larger KB in order to denote which data is deemed sensitive and could exact a greater inferencing penalty. These trade-offs must be considered when choosing a dynamic classification scheme. We will consider both schemes in further discussion of our approach.
Query
3.3
3.5
Dimension
3.6 Knowledge
It is necessary to speci~ the role of the user in the current work. This is so because we need to create a ICB that models the real-world knowledge that this user possesses in this domain. The level of classification and domain chosen to explore the problem of deduction point to a user that is moderately wellversed in medical knowledge but is not of the level of an MD. Further, this user must have a need-to-know that corresponds to the classification level specified. Such a user can be considered as a high level medical assistant that manages prescription medicines and therapies. It is important that the KB reflect the knowledge of this user some
On
the
fidelity. than
other
knowing”
If it is the caae that
the
KB
hand,
in its
then
it
deduction
is not
domain.
clear
a subset
of this
ficiently.
It should
also be noted
will
be more
edgeable
likely
tected
more
tem that
knowledge
to make
one. If the KB’s
it is attempting
to model,
fkquently
is more
than
restrictive
that
If a system
data with
KB
a more
warranted.
than
could
its efuser
the user that will
result
the
construction
of a uaetid
KJ3 will
put
great
It is de-
mands upon available resources. It is important that the costs involved in such an undertaking be amortized across as large a range of uses as possible. Such uses could include more intelligent interfacing with a user [Anderson90, Anderson91], a broader range of integrity and consistency enforcement, and, in general, support for a new generation of intelligent DB applications ~orgenstem87]. Many knowledge representation issues must be addressed. The exact structure of each piece of information, how these
be “all-
to do so more
of deduction
of its data than
constrained,
than a less knowl-
is greater This
knowl-
protect
knowledgeable
deductions
knowledge
should
adequately
it is bound
the possibility
is more
cause of potential for use beyond the current one postulated. should be noted that, even though the domain of a database
be less preventable.
the
can
that
correct
the user will
Base Dimension
The knowledge base is comprised of the relationships between various values in the domain of each attribute in a given database as well as the relationships between each attribute. Such knowledge is, by necessity, domaindependent. That is, only the relationships required by the particular set of attributes of the database and their range of values are included. This knowledge is stored separately tkom the data in the database. This is the ease because, by nature, such knowledge will not fit neatly into a relational database model. A more flexible structure is required to contain the declarative information denoted. Although, intuitively we might expect the KB to contain facts that pertain specifically to the deducablity of a fact horn another, a more general KB structure has been decided upm be-
3.4 User Dimension
edgeable
Dimension
Two types of uncertainty need to be addressed in the current work uncertain~ of knowledge and uncertainty of deduction. Uncertainty of knowledge pertains to the level of contldenee that can be attributed to facts in the KB by the detection system. Not all facts are knowm with 100VO cotildence and, further, many facts ordy pertain a certain percentage of the time. The medical domain is particularly subject to such uncertain knowledge. Uncertainty of deduction pertains to the level of cotildence that can be attributed to a deduction drawn by a user. Obviously, conclusions drawn through deduction are not all equal in degree of validity. The current work considers the certainty of facts in the system knowledge base to be absolute. This constraint eventually needs to be removed but dces not reflect on the viability of the system proposed. It is a knowledge representation and inference complication that can be ignored for the moment without undermining the main thrust of the approach. Deduction uncertainty, on the other hand, is more germane to this thrust and is modeled by ambiguity in the inferencing process. That is, the more ambiguous the inferences made by the system are, the more uncertain any deduction made by a user, based on these inferences, will be. For example, given the PRESCIUPTION value “darvon”, the inferencing process of the system will deduce many possible disorders— so many that a user making the same deduction will not be able to choose between them and, therefore, have high uncertainty associated with any deduction hefshe dares to make. This will not be the case if the PRESCRIPTION value was “theophylhn” because it is used in relatively few disorders and especially in asthma. Low uncertainty in this deduction would be reflected in the few disorders that the system would be able to infer given the PRESCRIPTION value “theophyllin”. (Note that the bias of “theophyllin” towards “asthma” is a function of KB certainty and is not represented in the current work. )
The richness of SQL allows a wide range of queries to be presented to a database. We will focus on single, straightforward requests to the database for non-sensitive data (and sensitive data when considering the No A~”bute Classl~cation scheme) horn the PATIENT relation m the hope that the extension to more complicated queries will be easier once these are better understood. Although it will be a necesxuy component of a complete deduction prevention system, no attempt will be made to incoprate a temporal dimension to the problem. Our first concern is to show the feasibility of our approach, queries presented over time will be considered in future work
with
Uncertainty
be dein a sys-
necessary.
156
pieces are linked, specified.
and the overall
3.6.1 Lk Knowledge
structure
of the KB must all be
Restrictors can be introduced or inherited and the c-terms they contain can themselves be manipulated, concept labels can be substituted for more specific or more general ones, c-terms can be imbedded inside other c-terms that have the appropriate constraints, and, furthermore, any c-term that is contained in the restriction of another c-term can be released to act as an autonomous c-term. These transformations are accomplished by the application of a number of inferencing rules that are an integral component of Lk itself. Downward Label Substitution allows a concept label of a cterm to be more specifically denoted. This is a plausible inference that traverses an isa link in a backward direction. In the example, DRUG_THERAPY can be substituted for THERAPY. U@w-d Label Substitution allows a concept label of a cterm to be more generally denoted. This is a valid inference that traverses an isa link in a forward direction. In the example, PRESCRIPTION can be substituted for DRUG_THERAPY. allows a c-term to be identified Membership Identtjication as a member of a set. This is a valid inference that combines related c-terms. For example, if it is known that THEOPHYLLIN is a member of the set DRUG, we can infer DRUG [vat:
Representation
To alleviate the ambiguities of natural language and facilitate the symbolic manipulation of knowledge, a more standardized internal representation is required by the system. To this end, Lk [Anderson90, Shin91, Anderson91 ] has been devised. Lk is a knowledge representation formalism that attempts to combine the rigor of formal logic with the structural flexibility of tkunes [Schank82]. The main structure of the language used in this proposal is the concept term or c-term. A c-term is a representation of some entity, event, or state. It has a concept label and can be followed by a restn”ction, which is itself a series of concept-connector :c-term pairs (each pair called a restn”ctor). For example, the fact that “Asthma displays the symptom of swollen bronchi” could be represented as the c-term DISORDER [val: ASTHMA, symptoms: SYMPTOM [val: SWOLLEN_BRONCHl]]. Here DISORDER timctions as the concept label and everything contained in the uppermost square brackets is its restriction. val: ASTHMA and symptoms: SYMPTOM [val: SWOLLEN BRONCHI] are restrictors with val and
THEOPHYLLIN]. Concept Connection Ident@ation
symptoms
to till a restrictor
fimct70ning as concept-connectors that connect the c-terms ASTHMA and [val: SYMPTOM SWOLLEN_BRONCHl] to DISORDER. A fragment of the K13 for the example domain is presented in Figure 1 with its underlying structure presented in Figure 2. The example KB contains isa relations that represent inheritance liuks, c-term restrictor constraints that relate attributes of the DB, and facts that relate values of these attributes.
3.7 Inference
Mechanism
SYMPTOM
Dimension
[val: SHORTNESS_OF_BREATH]].
Resti”ctor Release allows the promotion of c-term that is imbedded in the restriction of another c-term. For example, the restrictor DRUG[valTHEOPHYLLIN] can be released tlom c-term THERAPY [drugs: DRUG [val: THEOPHYLLIN].
3.7.2 Conceptual
Distance
the
Heuristic
A heuristic ranking of a given transformation can be formed by combining a rating of the similarity of that transformation with respect to the sensitive data and the accumulated cost of deriving that transformation. TMS heuristic can then be used to drive the modified A* search algorithm in its search for a linkage between non-sensitive and sensitive data. We propose the notion of conceptual distance [Anderson90~derson91 ,Shin93] as a heuristic means of focusing the attention of the inference mechanism. The conceptual distance between two c-terms is defined as the ranking of the set of similarities shared between them combined with the cost incurred by the transformation rules applied to produce the fust. Similarity metrics include matching concept labels, subset and superset relationships between concept labels, concept labels matching restrictors, and matching restrictions. These similarities are combined into a set and this set is ranked in accordance with the strength and numbers of similarities it contains. Transformational cost is determined by considering the validity or plausibility of each rule used in the trrmsfonuation of a c-term and SWUUiSl g up the cost of these. This heuristic provides a measure of how closely related two c-terms are— their conceptual closeness. We are most interested, in the case of finding a link-
These transformations are performed by the inferencing process associated with the Lk knowledge representation language.
Inference
allows a matching c-term This is a plausible inference
that attempts to till slots of c-terms. For example, DRUG [val :THEOPHYLLIN] can till the drug slot of DRUG_THERAPY. Resti”ctor Introduction allows the tke exchange of restrictors between c-terms with identical concept labels. This is a plausible inference that introduces filled slots into a c-term. For example, given DEFECT [val: ASTHMA] and DEFECT [symptoms: SYMPTO M[val:SHORTNESS_OF_BREATH] ], we can infer DEFECT [val: ASTHMA, symptoms:
The inference mechanism is completely dependent on the choice of knowledge representation paradigm. Given that a tkune-based representation has been chosen, a heuristic ‘search mechanism proves usefil. Issues of search space constraint, choice of heuristic, inferencing method, and search paradigm all must be addressed. It is important to note that the deduction problem as defined lends itself to goal-directed inference in that the value(s) of the sensitive data that needs protection will be known to the system as it proceeds with its infkrencing. This information can be used to guide the inferencing process and to keep extraneous inferencing down to a minimum. In the current system, a search algorithm based on the A* algorithm ~ilsson80] is used in an attempt to fmd a linkage between the non-sensitive data that would be released if a query were allowed to run to its completion and the sensitive data that is being protected horn deduction. The system, starting with the yet-to-be released data, iteratively makes a number of plausible transformations of this data into related facts using knowledge in the KB. These new facts are compared with the sensitive data and the new fact that is conceptually closer to it is chosen for the next round of transformation. If, at some point, there is a match between a new fact and the sensitive data, a linkage has been found between the non-sensitive and sensitive data and a possible channel for deduction has been detected.
3.7.1
of another c-term
age between sensitive and non-sensitive data, in the conceptual distance between the fact(s) that would be released by a given
In Lk
The process of inferencing in Lk is one of syntactically manipulating c-terms and is therefore domain independent.
query and the sensitive horn possible deduction.
157
data that we are attempting
to protect
DEFECT INJURY DISEASE
isa isa isa
DISORDER DISORDER DISORDER
PHYSICAL_THERAPY DRUG_THERAPY
isa isa
THERAPY THERAPY
PRESCRIPTION
isa
DRUG
DEFECT[symptoms: SYMPTOM] SYMPTOM [treatment: THERAPYl DRUG_THERAPY[drugs: DRUG, dose: DOSAGE] DEFECT[val:ASTHMA, symptoms: SYMPTOM [val:SHORTNESS_OF_BREATH],SYMPTOM[val:SWOLLEN_BRONCHl]] SYMPTOM[vai:SWOLLEN_BRONCHl, treatments: THERAPY[drugs:DRUG[val: dose: DOSAGE[mg :300,freq:DAlL~]]
Figure
1: Knowledge
THEOPHYLLIN],
Base Fragment
DISEASE
INJURY
\
DEFECT
Dls~RDER~ J SYMPTOM J
(
‘ ‘HE”PYR PHYSICAL
DRU”G THERAPY
$ DRUG
\ b DOSAGE
has
t
is
PRESCRIPTION
Figure
2: Knowledge
THERAPY
b
Base Fragment
Structure
158
4
The following example query is based on all previous assumptions. Further it assumes that Sensib”ve Attribute Classljlca-
System Integration The KB and its infkrence mechanism must be integrated
with the user interface and database. Since it is currently assumed that the relational database model will be used in conjunction with this system, the main juncture between the deduction detection system and the database will necessarily be a communication link between the database’s SQL dialect and the detection system’s knowledge representation. The system’s architecture is presented in Figure 3. A user fmt presents a query to the system. l%e Deduction Detector (DD) intercepts this query before it is presented to the SQL interface. (It should be noted that, although the DD is currently only concerned with the detection of deduction, all matters of data security should be eventually handled by this module.) The query is modified to request any sensitive data in the relation and this modified query is presented to the SQL interface. This module formulates the data request, gathers the appropriate tuples from the database, and returns these to the DD. The DD translates the data in the tuple into Lk representation using information in the original quay. These new facts will correspond to all the non-sensitive data that would be released by the user’s query and the sensitive data that is being protected. The DD then calls on the inferencing mechanism to attempt to link the nonsensitive and sensitive data. If such a linkage is found, within some systemdefmed time and ambiguity limits, the possibility of deduction has been found and controls for such should be invoked. The following section presents an example of the system’s operation.
4.1
Example
tion is being used and therefore the entire attribute DISORDER is deemed sensitive and all its values need to be protected tlom deduction. A discussion of the differences encountered when No Attribute Class@ation is used follows. The user f~st issues an SQL query
SELECT PRESCRIPTION FROM PATlENT WHERE NAME = “doe,john” The Deduction Detector (DD) intercepts difies it to include sensitive data by attaching to the SELECT statement
SELECT PRESCRIPTION, DISORDER FROM PATlENT WHERE NAME = “doqjohn” The DD then submits this modified query to the SQL interface which in turn submits a request for the data to the DB and gathers the returned data into the tuple (THEOPHYLLIN,ASTHMA) The SQL interface returns this tuple to the DD and the new about-to-be-released
fact and sensitive
PRESCRIPTION[val: & DISORDER[val:ASTH
fact are generated:
THEOPHYLLIN]
MA] The DD then tills in details it knows about the sensitive fact in general:
DISORDER[val:ASTHMA, symptoms: SYMPTOM [val:SHORTN ESS_OF_BREATH], SYMPTOM[val:SWOLLEN_BRONCHl]] This provides more information in choosing the most likely transformation.
Query
that can be used by the heuristic candidate c-terms for tiu-ther
Control&d Response
c)
Deduction Detector
interface
[Security]
●
Query
Figure
3:
System
the query and mosensitive attributes
Modijied Query
Architecture
159
/ 4
PRESCRIPTION[val: THEOPHYLLIN] PRESCRIPTION
is a DRUG and Upward Label Substitution
application
produces
DRUG[val:THEOPHYLLIN] DRUG_THERAPY
constraints
DRUG_THERAPY[drugs: DRUG_THERAPY
Identi~cation
application
produces
DRUG[val:THEOPHYLLIN]l
is a THERAPY
THERAPY[drugs:DRUG[val:
SYMPTOM
and Concept Connection
and Upward Label Substitution
application
produces
THEOPHYLLIN]]
constraints
and Concept Connection
sYMPTOM[treatments:
Indentijication
application
produces
THERAPY[drugs:DRUG[val:THEOPHYLLlN]]l
SWOLLEN_BRONCHl
symptom and Restrictor
Introduction
application
produces
SYMPTOM[val:SWOLLEN_BRONCHl, treatments: THERAPY[drugs:DRUG[val: THEOPHYLLIN]], dose: DOSAGE[mg :300,freq:DAlL~ ASTHMA
defect and Concept Connection
Identification
application
produces
DEFECT[val:AsTHMA,symptoms:SYMPTOM[val:SWOLLEN_BRONCHl],
SYMPTOM[val:SHORTN DEFECT is a DISORDER
ESS_OF_BREATH]] and Upward Label Substitution
produces
DISORDER[val:ASTHMA]
Figure
4:
Linkage
between
PRESCRIPTION
and DISORDER
It should be noted that this linkage is not found directly except in the best of eases. At each step of the transformation process, each rule is applied to the transformation currently deemed conceptually closest to the sensitive fact and a set of single-step transformations is produced. This new set is combined with all unexplored transformations and the transformation of this set that is deemed conceptually closest to the sensitive data is heuristically chosen and the process is reiterated. As stated previously, if No Ati”bute Class&ation was used, the system would follow a slightly different path of operations. When the sensitive data value was returned, the system
The DD then calls on its inference mechanism to attempt to link these facts. This linking process is accomplished by applications of a number of transformation rules previously discussed and is detailed in Figure 4. The bold text in the figure indicates the chrmges made to the previous fact in the current trattsformation. The effect of the heuristic can be seen in Figure 4. The transformation of the PRESCRIPTION c-term to the ftrst SYMPTOM c-term is produced by blind search since they are all significantly conceptually distant from the sensitive DISORDER c-term From this point on, each new transformation in the path is conceptually closer to the sensitive c-term due to applications of rhe heuristic. Since a unrnnbiguious path can be found tkom the non-sensitive c-term to the sensitive c-term, a linkage is found between the non-sensitive and sensitive data and the possibility of deduction has been detected warranting the invocation of deduction controls. If such a link had not been found within a system-defied litnit, the data returned fi-om the SQL interface (less the sensitive data) could simply be returned to the user as requested. If a set of disorders, including the patient’s actual disorder, could be linked with the non-sensitive data, it would have be determined if “enough” ambiguity (system defined) was present to mask the sensitive data.
would
need
to ch=k
how
sensitive
this
wduc
was.
This
sensitiv-
ity value would have to be stored in the KB. If the sensitivity of the value was above some systemdefmed limit, the inference mechanism would be called, otherwise, if the sensitivity of the value was below this limit, it could simply be released to the user. This procedure is based on the intuition that some of the values of the sensitive attribute are less sensitive than others, say that the patient is suffering horn influenza versus depression.
160
5
System Status
A prototype system, based on the Sensitive Attn”bute Classification scheme, has been implemented on a Sun workstation in Common Lisp and interfaced with the Oracle relational database system. A small KB and DB pertaining to the medical domain have been provided, of which the examples in the current work are subsets. Future testing will be based on more realistically sized KB’s and DB’s. A number of queries have been tested, including the example presented in this work. These tests include queries that directly ask for sensitive data, queries that ask for non-sensitive data in which no linkage to sensitive data can be found, queries that ask for non-sensitive data in which a linkage to sensitive data can be found, and queries that find a linkage tkom the non-sensitive data to a set of data that includes the sensitive data. In all appropriate cases, deduction is detected. Such detection does contribute to query response time but not unduly when this increase is weighed against the measure of security provided.
6
Detection
and Control
condition requires that each value have an associated probability that corresponds to the likelihood that this non-sensitive value will actually be found in conjunction with the given sensitive value in the relation. Such uncertainty of knowledge is not modeled in the current KB and therefore this condition is not considered. The inclusion of such uncertainty is an obvious next step in our research. So far, we have considered how to handle deduction detection when a query releases a single new fact. A simple extension to this method is postulated to handle cases where more than one fact is released by a single query (e.g., PRESCRIPTION and THERAPY). It is based on the intuition that the sensitive values that could be linked to both simultaneously are exactly those values that are in the intersection of the values that can be linked to both individually. First, each new fact is individually linked with values from the sensitive attributes and each sensitive value linked to it is collected into a set. The intersection of each of these sets produces the final set that contains sensitive values that are consistent with all the non-sensitive values. This set is then cheeked for sufficient numbers to mask the sensitive value. If so, it is nonsensitive values are released, otherwise deduction control is invoked.
6.2
6.1 Deduction
Detection
Notice in the previous example, a given value of PRESCRIPTION will most oflen link with its corresponding value for DISORDER. In fact, the only times such a linkage could not be produced would be when the link is not known by the KB or when the value of DISORDER is *unknown*. The latter might occur if the attending physician was not sure of the disorder but was simply prescribing something for a symptom that the patient suffered. If we are to invoke deduction controls whenever such a link occurs, we will seemingly be controlling ahuost every value of PRESCRIPTION and, therefore, circumventing the niinimal restriction of non-sensitive data policy we desire! What actually occurred in the example was that a single value of DISORDER, asthma, linked to the given value of PRESCRIPTION, theophyllin. If the value of PRESCRIPTION had been say, darvon, then it would have linked to any value of DISORDER that involves pain. To be sure, the actual value of the sensitive attribute DISORDER would be among those linked (if known to the system) but the certainty of the user’s deduction would be severely curtailed if there was a sufficient number of other, more or less equally likely, possible values of DISORDER that linked with this value of PRESCRIPTION. Assuming Sensitive Attribute Classification, this leads to the conclusion that deduction controls need only be invoked when 1. there is an insufficient number of values of the sensitive attribute that link with the current value(s) for the nonsensitive
attribute(s),
2. some value(s) current stantially
attribute
of the non-sensitive
more likely
Controls
concerned with the detection of the possibility of deduction. This is the case because deduction controls are domain dependent whereas the system proposed can be used in any domain with a change of KB. The typical types of controls that could be invoked on the detection of possible deduction are: 1. withhold
the non-sensitive
attribute’s
value and state that
attribute’s
value and state that
the value is sensitive 2. withhold
the non-sensitive
there is no value 3. provide
some innocuous,
but incorrect,
value instead
In the example domain of the current work, it is not clear which is the best choice. The fwst leads to the partial deduction that the DISORDER value is sensitive. Release of this simple fact could cause the patient ha-m. Further, the second and third choices could lead to dangerous circumstances in relation to the patient’s well-being. What is clear is that the solution to the problem of what types of controls to invoke upon the detection of possible deduction needs to weigh a large number of domain-dependent factors against the sensitivity of the data being proteeted before any decision is made concerning their implementation.
6.3
Temporal
Deduction
Detection
and
of the sensitive
value(s)
Deduction
The current work is strictly
Constraining the deduction problem to single queries is a strategy to help bring the main issues into sharper foeus and show that detection of deduction is possible. It is obvious that this work will need to be extended to include queries presented over time. Since the facts released to users are persistent it would be possible for users to ask a series of queries that would individually not be particular usefid in deduction but in combination could result in the deduction of sensitive idormation. Research in maintaining structured audit trails [Jajodia90] seems to point to a possible solution to the problem. Another pasible solution lies in the maintenance of individual ICI% for each user. These could contain the facts released to the user and could be used in tandem with the main KB when attempting to
that link with the attribute
are sub-
to be associated with it.
(If No Attribute Class@cation was used, the sensitivity level of the sensitive value would have to be beyond a system-defmed level before any of these conditions were checked.) The fwst condition is considered in the current work— this inferencing ambiguity directly influences the detection of deduction. Deduction detection is only indicated if the number of values linked is above some system-defmed limit. The second
161
model the deductions possible by that particular case, this problem is left to future research.
deduction of sensitive data. Such a system can provide a high degree of security with less restrictive information flow than previously achieved.
user. In any
Related Work
7
Acknowledgement Traditionally, the topic of inference control in data security has focused on the problem of inferring cotildential data via agin a statistical database and cross-reference gregation [Denning79, Denning83, Adam89, 1eong89, MatlofR9]. Sucha focus is natural inthatthe problem, constrained in this manner, seems to be more understandable and therefore solutions more forthcoming. It should be noted that this problem is only a subset of theproblem ofdeductionin general. That is, inferences made horn the cross-referencing of statistical summaries can be made only ifthe user has some previous knowledge atmut an object in the database (e.g. inferring a classified fact about someone given there is only one person in the database that has some number of unclassified attributes) andunderstands thesemantics of the statistical fimctions used to produce these summaries. These assumptions, although implicit in all such research, have not been explicitly stated. We seek to explore the problem of inference beyond the tight limits previously imposed upon it. Specifically, the removal of the constraint imposed that requires the database in question to be statistical and the inclusion ofreal-world knowledge possessed by a user into the inference problem considerably extends the traditional boundaries. The use of expert system technology has been exploredin relation totheproblem of inference. Notably, the Database Inference Controller [Buczkowski90] is an expert system tool for the static inspection of a multilevel secure database that attempts to determine the probability of inferring classified data from unclassified data, Although it does succeed in reducing the risk of inference, it does so at the cost of increased restriction of data in the database stemming from static classification at the level of the attribute. [Morgenstem87] is an early paper that deals with the problem of inference in a non-statistical database. It attempts to bring the rigor of the Bell and LaPadula @el175] model of security into the realm of inference control by introducing a series of formal notions to fully delineate the problem. A set of general algorithms is offered that, although interesting, are largely abstractions whose implementation is impractical. Other approaches designed to alleviate the interence problem recognize the need for some artificial intelligence techniques in its solution but at best these only provide for static detection of undesirable paths from unclassified data to classified data [Bonyun90, Haigh90, SU90, Henning89, Hinke89]. None provide the dynamic, query-time, detection required for maximum data access and none consider a classification granularity freer than the attribute. 8
[Adam89]
can be generalized
to include
secure
search
that
has considered
that
statically
threat
base
that
the database. simulate
the the
problem
a DB
ACM Computing [Anderson90]
Anderson,
Distance.
Connecticut, [Anderson91]
M. E., Response Evaluation Dialogue
of Conceptual
Manager Masters
by the
integration
real-world
such knowledge
deduction
processes of
non-sensitive
has produced
to dynamically of a database
knowledge
about
and an inference of a user data
can alert could
data
in
process
to
the lead
Bell, D. and LaPadula, Technical
~iskup90]
an of the
on Developing
L. J., “Secure
Foundations
and
D.C,
Biskup,
USAF,
Guidelines”,
Database
Systems:
Model”,
Vols. 1&2, Electronic
1973.
J., “Protection
Information
Computer
& A Mathematicrd
Report ESD-TR-73-278,
in Medical
of Privacy
and Contldentiality
Systems: Problems Secun”ty, III:
Spooner, D.L. and Landwehr,
and
Status and Prospects,
C., editors,
North-Holland,
1990. ~onyun90]
Bonyun,
D.A.,
Secure Databases”, Prospects,
“Using
EXCESS
Database
as a Framework
Security,
III:
Spooner, D.L, and Landwehr,
North-Holland, [Buczkowski90]
for
Status and
C., editors,
1990.
Buczkowski,
Controller”,
Database
L. J., “Database Security,
Spooner, D.L. and Landwehr,
III:
Inference
Status
and Prospects,
C., editors, North-Holland,
1990. @lemning79] Denning, Computing @Mming83] ~aigh90]
D.E. and Denning,
Survqvs, Vol.
Denning,
for Statistical
P.E., “Data
D.E. and Schlorer,
Databases”,
Securi@, III:
1979.
J., “Inference
Computer,
Haigh, J.T. et. al., “The LDV Database
Security”,
11, No. 3, September
Controls
July 1983
Approach
to Database
Status and Prospects,
C., editors, North-Holland,
1990. &Ienning89] Henning, R.R and Simonian, R.P., “Security Analysis of Database Schema Information”, Database Securi~,
II: Status and Prospects,
North-Holland, ~e89]
Hinke,
Landwehr,
C. E., editor,
1989.
T.H., “Database
Inference
Approach”,
Database
Landwehr,
C. E., editor, North-Hollamd,
[Ieong89]
system to
Proceedings
Conference
Expert System Programs,Washington,
Systems Division,
the
the
of
Database for Two
Communication”,
International
Mathematical
stat-
knowl-
Thesis. The University
1991. ~el173]
a number
detect
in a
via the Computation
with a Relational
Way Man-Machine IEEE/ACA4
Re-
and
1989.
M.E. and Shin, D. G., “Integrating
Interface
Managing
Study”,
1990.
Anderson,
Intelligent
J.C., ‘