A Security Punctuation Framework for Enforcing Access Control on Streaming Data Rimma V. Nehme #1 , Elke A. Rundensteinerr ∗2 , Elisa Bertino #3 #
Department of Computer Science and CERIAS, Purdue University W. Lafayette, IN 47906 USA 1
3
∗
[email protected] [email protected]
Department of Computer Science, Worcester Polytechnic Institute Worcester, MA 01609 USA 2
[email protected]
Abstract— The management of privacy and security in the context of data stream management systems (DSMS) remains largely an unaddressed problem to date. Unlike in traditional DBMSs where access control policies are persistently stored on the server and tend to remain stable, in streaming applications the contexts and with them the access control policies on the real-time data may rapidly change. A person entering a casino may want to immediately block others from knowing his current whereabouts. We thus propose a novel “stream-centric” approach, where security restrictions are not persistently stored on the DSMS server, but rather streamed together with the data. Here, the access control policies are expressed via security constraints (called security punctuations, or short, sps) and are embedded into data streams. The advantages of the sp model include flexibility, dynamicity and speed of enforcement. DSMSs can adapt to not only data-related but also security-related selectivities, which helps reduce the waste of resources, when few subjects have access to data. We propose a security-aware query algebra and new equivalence rules together with cost estimations to guide the security-aware query plan optimization. We have implemented the sp framework in a real DSMS. Our experimental results show the validity and the performance advantages of our sp model as compared to alternative access control enforcement solutions for DSMSs.
I. I NTRODUCTION A. Security in Data Streaming Environments The need for people to protect themselves and their assets is as old as humankind. The increasing use of electronic, sensor and GPS devices means that individuals today have an ever-growing range of electronic (data) assets that may potentially be at risk. When computing devices are integrated with people, various personal information is expressed in digital form. Devices can communicate this information over networks and users have no control over who and for what purpose may query their data. Some users, knowing that their personal information (e.g., location, health condition) is not safeguarded, may hesitate to use such devices because of the risk of data being misused. Traditional access control schemes which typically assume finite persistent datasets and static (or rarely changing) access control policies become largely inapplicable in this new stream paradigm. This inapplicability is due to the fact that stream
environments tend to be highly dynamic. Data is continuously generated and may have different security sensitivities depending on the context, on personal preferences or on the streaming values – all of which may frequently change at a possibly very fine granularity. B. Motivating Examples Example 1: Protection against context-aware spam. People may want to block unwanted businesses from sending them advertisements based on their location or any other information. As a person is driving or walking, the device may adapt security constraints based on the proximity of the businesses and his/her preferences limiting to who would be allowed to “see” the person. This helps to impede focused marketing efforts and to avoid receiving “context-aware spam” – services or information people don’t know of or agree to. Example 2: Privacy protection of personal health data. A patient may be living at home with a health monitoring device attached to him which can detect early health abnormalities and transmit alert signals to relevant personnel. However, the patient may prefer only certain users, such as only his doctor or a nurse, to have access to his streaming data and prevent access for any third-parties (e.g., insurance companies or other hospitals). Only if his vital signs go far above the norm and he is in an imminent danger needing urgent care, should the closest hospital or ER gain access to his streaming data. We envision that individual devices transmitting streaming data will be able to inject their respective security restrictions together with the data. The policies, as will be illustrated, can be encoded into a compact format, and in most cases can be included into the same network message with the data. Thus little demand for additional network communication is expected. In this paper, we assume that streaming data is transmitted securely from a data source to the streaming database. That is, the possibility of the data being intercepted and compromised on the network is beyond the scope of this paper. C. Alternative Access Control Mechanisms To motivate our approach, we sketch and compare alternate methods to enforce access control on streaming data.
Non-streaming: Store-and-probe approach. The policies on the streaming data are collected in one place and stored in a persistent table. Whenever an access to the data is requested, the policy table is probed, to see if the access should be granted or denied. The advantage of this approach is its simplicity. All policies are stored and updated in one place. The main disadvantage is its inability to cope with frequent policies’ changes. Every change in a policy would require an update to the policy table, and every request to a data would require a lookup to this central place. Large number of data sources, fine-granularity of policies, frequent policy changes and continuous lookups may create a bottleneck in the performance of a streaming system. Streaming: Tuple-embedded approach. An alternative mechanism is to stream security restrictions embedded inside data tuples. This approach is similar to other works in the literature, where extra tuple fields are added for meta-data, e.g., tuple lineage in Eddies [1]. Different attributes in a tuple, however, may each have a distinct policy, which may lead to an explosion in tuple sizes, potentially seriously impacting the system performance. Moreover, tuples that arrive adjacent to each other in the stream are quite likely to be generated based on the same context (e.g., location, time), and thus may frequently have similar access control policies. Using this approach, tuples with identical policies would still carry their own (redundant) copy, and the query processor would still have to process every tuple individually to guarantee that no unauthorized access is granted. One possible improvement to minimize per tuple storage overhead could be to encode policies as bitmaps, and then abstract policy-based filtering using bitmap operations (as suggested for query processing status in Eddies [2]). A bitmap representation is highly compressable, so the storage overhead would be somewhat minimized. However, even with compression, still this approach suffers from redundant storage and unnecessary per-tuple processing overhead. Streaming: Punctuation-based approach. The third alternative, which we adopt in our work, is a punctuation-based approach, where security meta-data tuples are interleaved with the data tuples in the streams. Punctuation-based solution has several advantages over the above-mentioned approaches. First, the access control is dynamic and the speed of enforcement is fast, because security restrictions are streamed together with the data. Second, the security punctuations may be shared by multiple tuples that have similar policies. Thus no redundant copies of policies are stored, memory overhead is minimized and the security-related processing is shared. Policies in security punctuations can also be encoded in a bitmap format for compactness, thus further reducing securityrelated processing. For ease of readability, in the rest of the paper we present security punctuations in the alphanumeric format.
Data Stream Management System (DSMS)
Data Tuples
SP Analyzer
...
sp
sp
sp
sp
sp
Query Results Q1 Q2 ... Qn-1 Qn
Data Providers
Security Punctuations
Fig. 1.
System architecture.
D. Our Proposed Solution: SP Framework We propose to stream security constraints called security punctuations (or short sps)1 , interleaved with the actual stream data describing an access control policy on the upcoming portion of the stream. An sp is a predicate that informs the stream processor of who has access when to which streaming data. A conceptual view of a stream with security punctuations is shown in Figure 1. Data sources emit sps based on the user specifications. In our work, we distinguish between two types of users: (1) users providing the streaming data, termed data providers (short DPs), and (2) users querying the streaming data, termed query specifiers (short QSs). When query specifiers register continuous queries for execution, each query inherits the security restriction(s) associated with the query specifier registered to receive the results. When the stream data arrives to the server, the database engine examines the streaming data tuples’ policy stored in the sps and checks if the queries conform to the policy, discarding the data that no query has access rights to. We would like to emphasize that we do not define a new access control model in this paper, such as RBAC [3] or MAC [4]. Instead, we propose an access control enforcement mechanism suitable for streaming data and investigate its interaction with query processing and optimization. As in most real environments [4], [3], we assume that data providers and users querying the data use the same access control model. E. Our Contributions We introduce a novel sp model that supports declarative access control specification and enforcement on real-time streaming data. • We propose a “security-aware” stream algebra, by enhancing the traditional algebra with security-aware extensions and new algebraic equivalence rules. • We present a pipelined execution model enabled by the security-aware algebra and describe security-aware query optimization employing the new algebraic rules and cost estimations in query plan rewriting. • Our experimental analysis on a real DSMS CAPE [5] shows that sp framework is superior to alternative access control mechanisms for streaming data in terms of both processing and memory. The rest of the paper is organized as follows: We introduce our framework and describe our assumptions in Section 2. Section 3 presents the concept of security punctuations. In •
1 We chose the name “security punctuations”, because by introducing sps into data streams, we subdivide i.e., punctuate infinite data streams into finite partitions with associated access control policies.
Section 4 we discuss the techniques for query processing and optimization with embedded into streams sps. Physical implementation and query optimization are presented in Sections 5 and 6. Our experimental evaluation of the sp framework is presented in Section 7. Section 8 reviews related work, and Section 9 concludes the paper. II. P RELIMINARIES A. Subjects, Objects and Rights The subject, object and right concepts are well known in access control [4]. An object is an entity that contains information. Access to an object implies the right to use the information it contains. Examples of objects in streaming systems are: streams, tuples, and tuple attributes. A subject may invoke a request to access an object, e.g., a request to read data. The subjects in our model are a set of users who specify continuous queries in the DSMS. We use flat role-based access control (RBAC) model [3] as an example, as it is one of the most widely used access control models, and show how it can be implemented using sps. However, our framework is general and any other access control model, e.g., DAC, MAC [4], etc., can be implemented using sps. Query specifiers, i.e., subjects, activate their roles when they sign into DSMS. We require that each query specifier belongs to at least one role. This assignment cannot be changed while he/she is registered to receive the results of any of the currently executing queries. Subjects acquire rights which are the set of privileges that they can hold and execute on an object. In this work, we consider a read right only. Just about all stream systems are read-only right now. Hence this is a natural focus. Our model could be extended to support other rights as well, such as update, delete, etc. An access control policy corresponds to a set of rules indicating which objects the subjects are allowed to access. Authorization is the granting of such rights. B. Streaming Model We consider a centralized DSMS processing long-running queries on a set of data streams. A continuous data stream s is a potentially unbounded sequence of tuples that arrive over time. Tuples in the stream are of the form t = [sid, tid, A, ts], where sid is the stream identifier, tid is the tuple identifier2 , A is a set of attribute values, and ts is the timestamp of the tuple. As commonly assumed in other streaming systems [6], [7], the timestamps of the stream elements are ordered. Similarly, sps also arrive in order [8]. The out-of-order sp arrival can be handled similarly to prior works [8], [9]. We consider a set of continuous queries {q1 ,. . .qp } executing over the data streams. Each query qi has an associated set of roles of the subjects registered to receive the results of that query. Queries are represented by query plans composed of operators op1 ,..., opk , where each operator inherits the roles associated with the queries for which it processes the 2 This may be similar to a primary key in relational tables, or it may be a unique set of attributes used to identify a particular data provider, e.g., a patient id.
s-punctuated segment
sp-batch
…
sp1
sp2
sp3
tuplea
tupleb
tuplec
sp4
tupleb
… time
Fig. 2.
Applicability of sec. punctuations.
data. Operators may be shared, hence an operator may acquire access rights from multiple queries. A DSMS server has a security punctuation analyzer component (Figure 1) which serves two purposes: (1) to combine the security punctuations with similar policies to reduce memory and processing overhead, and (2) to allow server-side specification of additional policies. In the latter case, the server policies are translated into the format of security punctuations and combined with the arriving data provider sps. This design allows organizations to enforce their own policies in addition to the ones specified by the data providers. For example, a hospital may add its own policies in addition to the policies specified by a patient on her streaming health data. We assume that server-specified policies may not override, but may further “refine” the data provider policies by putting-in additional constraints. III. S ECURITY P UNCTUATIONS (SP S ) A. Applicability of SPs Security punctuations (sps) are meta-data introduced into a data stream to specify who has access rights when to which streaming data. Sps always precede the tuples for which they describe the access control policy (Figure 2). A policy may apply to: a (sub)stream, a tuple, or an attribute of a tuple. Generally, we refer to them as objects. The tuples between two consecutive punctuations form an s-punctuated segment which defines the applicability scope of the preceding sp. Policies may be expressed by one or more sps and may apply to zero or more tuples. A set of consecutive sps are assumed to belong to the same sp-batch and are interpreted as a single access control policy. All sps of the same policy have the same timestamp ts – the time when the policy goes into effect. A policy Pj applicable to an object o at a time tsj overrides an earlier policy Pi (tsi < tsj ) which was applicable to o. Tuples are completely unaware of sps in the stream. If there is no sp authorizing access to an object, denial-by-default is enforced; any request to access that object will be denied. B. Security Punctuation Structure Figure 3 illustrates a security punctuation structure. The Data Description Part (DDP) specifies which object(s) the access control policy applies to. The Security Restriction Part (SRP) denotes both the access control model type and the subjects authorized by the policy. Since we use role-based access control in this work, the SRP part of sps specifies RBAC as the model type and a set of role(s) that are authorized by the sp3 . The Sign specifies if the authorization is positive or negative [10]. Finally, the Immutable field indicates if the sp can be combined with other (e.g., server-specified) policies. Policies can be specified at the granularity of a stream, a 3 We omit the access control model specification in the sps in the rest of the paper, since all sps are assumed to use RBAC model.
Data Description Part (DDP) Stream(s), Tuple(s), Attribute(s)
...
Fig. 3.
Security Restriction Part (SRP) Access Control Model Type & Value
Immutable Timestamp Sign ts
+
-
T
F
HeartRate Stream
s1
Patient_id | Beats_per_min | Timestamp 120 | 70 | Sep-12-05 9:17:00 ...
s2
BodyTemperature Stream Patient_id | Temperature | Timestamp 120 | 98.6 | Sep-12-05 9:21:00 ...
Security punctuation structure.
tuple, or an attribute. Since many objects may share similar policies, we use regular expressions to describe objects and their policies inside sps. Regular expressions facilitate compact policy representation. We now formally define security punctuations. Let eval(N,e) be a function that, given a set of values N and a regular expression e, returns a subset Ne ⊆ N that matches e. Let S = {s1 ,...sm } be the set of all streams, let T = {ti,1 ,...ti,n } be the set of all possible tuple identifiers in a stream si ∈ S, let A = {ai,j,1 ,...ai,j,k } be the set of attributes in a tuple ti,j ∈ T and let R = {r1 ,...rl } be the set of all roles in the system. Let es , et , ea , and er denote regular expressions specified against S, T, A and R, respectively. Let O = {os ,ot ,oa } be the set s), ot =(¯ s,t¯), and oa =(¯ s,t¯,¯ a), such that of objects, where os =(¯ ¯ ∈ eval(A,ea ). A security s¯ ∈ eval(S,es ), t¯ ∈ eval(T,et ) and a punctuation is then defined as follows. Definition 3.1: A security punctuation sp is meta-data embedded into a stream that defines an access control policy P on a set of objects O and has the following form: < DDP | SRP | Sign | Immutable | ts > where sp.DDP = (es , et , ea ), sp.SRP = er and ts is the timestamp of the policy P . The semantics of the sp is the following: • if Sign = ‘+’: a subject with role r¯ ∈ eval(R,er ) may access any object o ∈ O at any time tsaccess ≥ sp.ts. • if Sign = ‘-’: a subject with role r¯ ∈ eval(R,er ) is denied access to any object o ∈ O at any time tsaccess ≥ sp.ts.
If the field Immutable = false, the security punctuation may be combined with the server-specified policies applicable to the same objects. Otherwise, the sp is immutable, and the serverside policies are ignored4 . C. Security Punctuation Examples Consider three data streams (Figure 4): HeartRate, BodyTemperature and BreathingRate and the set of roles R = {C,D,DM,E,GP,ND}. The following sps may be specified in these streams: • Stream level policy: Only queries registered by a cardiologist (C) can query the stream HeartRate (s1 ). • Tuple level policy: Only queries registered by a general physician (GP) can access data tuples (from any data stream) of patients with ids between 120 and 133. • Attribute level policy:
Only a doctor (D) or a nurse-on-duty (N D) can query the temperature and the heart beat from streams s1 , s2 . 4 For simplicity of presentation, we assume positive and mutable sps in the rest of our discussion. We omit the Immutable field. Unless noted otherwise, all sps are assumed to be mutable.
s3
Cardiologist (C) General Physician (GP) Doctor (D)
BreathingRate Stream Patient_id | Frequency | Depth | Timestamp 120 | 8 | 38 | Sep-12-05 9:22:00 ...
(a) Sample Data Streams
Fig. 4.
Dermatologist (DM) Nurse-on-Duty (ND)
Hospital Employee (E)
(b) Sample Roles
Example of a stream environment.
D. CQL Extensions to Support SPs We have extended the Continuous Query Language (CQL) [11] to support the specification of security punctuations. Since the roles of queries are inferred from the subjects specifying the queries, the CQL query syntax has been left unchanged. The proposed sp declaration syntax is: INSERT SP [[AS] sp name ] INTO STREAM [stream name | stream id ] LET [sp name.]DDP = , [sp name.]SRP = , [[sp name.]SIGN = { positive | negative },] [[sp name.]IMMUTABLE = { true | false }]
E. Preserving Correct Security Semantics For manipulating sps on the server, the following four operations are used: match(), union(), intersect() and override(). match() identifies which tuples are related to an sp based on the regular expressions in sp.DDP. union() performs the union and intersect() the intersection of the policies, respectively. With intersection semantics, the access to the data decreases as additional sps are applied. Conversely, with union semantics, access to data increases as additional sps are applied. override() replaces a policy with a new policy. When multiple sps are applicable to the same tuples, the following three design choices are used to preserve the correct security semantics. union() is used when multiple sps arrive from the same data provider and the sps have the same timestamp. Here the sps represent a single policy and thus are union-ed together. intersect() is used when combining the data provider and the server-specified sps. This is done to disallow the server policies from increasing the access on the data. Alternatively, a data provider can also set the Immutable field of the sps to true, thus preventing any modification to his/her policies on the server side. override() is used when multiple sps arrive from the same data provider and the sps have different timestamps. The sp with the more recent timestamp overrides any earlier sps applicable to the same object(s). IV. S ECURITY-AWARE Q UERY P ROCESSING A. Alternative Approaches With security punctuations embedded in data streams, query results may be produced as follows: Pre-filtering. Each query may have its own access control filter installed, which pre-filters arriving tuples based on the access rights of the query in advance, before entering the query plan. This pre-filtering discards the sps, since the streams would only contain tuples that the query is authorized to access. Thus, query plans can simply consist of traditional query
s1 s2 s3
SS
Q1
SS
Q2
TABLE I S ECURITY- AWARE ALGEBRA .
SS
- shared subplan - SS operator
SS SS
s4
Q3
SS
Fig. 5.
Security-enhanced query plan.
operators. This approach, however, forces each query to be executed separately, even if queries share subexpressions but have different access rights. Fixed placement of the access control filtering to always be at the beginning of the plan may also add a significant cost compared to if it were to be done later. Post-filtering. In reverse, queries can be executed first, and the results then get filtered postmortem based on the access rights of the queries. This approach is advantageous when the selectivity of the query operators is high and the access rights are loose. Using post-filtering, query execution can be shared just as in regular query processing. But again, the access control filtering being fixed (this time at the end of the query plan) may introduce unnecessary processing overhead in some circumstances. A lot of work may be done by expensive operators, only for the results to be discarded later because of access rights limitations. Intermediate filtering. The pre- and post- filtering approaches in many cases can be prohibitively expensive. Thus, it may be beneficial to make the placement of security predicates in the middle of the query plan. Such interleaving may significantly reduce the cardinality of intermediate results. To make access control filtering flexible, we isolate this functionality into a special-purpose operator that can be placed anywhere in the query plan. The goal of this operator is to discard data tuples that a query has no access rights to. This filtering is based on the roles associated with the queries and the tuples’ policies represented by the streaming sps. We introduce a novel Security Shield (SS) operator for this purpose. In addition to flexible placement, SS operators facilitate efficient sharing of query plans. Figure 5 illustrates an example of a shared query plan for three queries, Q1 , Q2 , and Q3 with embedded SS operators. B. Security-Aware Query Algebra To enable security-aware query processing and optimization, we extend stream algebra [12] to become security-aware. Table I summarizes the definitions of the operators in the algebra5 . “Security Shield” (ψ) is a new operator designed to support access control filtering based on streaming sps and the security predicates defined by the queries’ access rights. SS checks the streaming sps, and if an access control policy does not satisfy the predicate of SS, the tuples and their sps are discarded, thus preventing unauthorized access. SS operator can be viewed as a “select operator” that filters tuples based on the streaming metadata, i.e., sps. 5 To keep the presentation concise, we do not describe security-aware set operations in this paper.
Notation: Let • t ∈ T be a tuple in a data stream T • Pt be the access control policy of the tuple t represented by sps • p be a security predicate – a set of roles associated with a query Security Shield (SS): ψ, with security predicate p • (t, Pt ) ∈ ψp (T ) iff Pt ∩ p = ∅ Projection: π, with an attribute ai • (t, Pt ) ∈ πa1 ...an (T ) iff t consists of ai and Pt = ∅ Selection: σ, with a condition c • (t, Pt ) ∈ σc (T ) iff t satisfies c and Pt = ∅ Join: , with a join condition c • (t, Pt ) ∈ T c E iff t ∈ T × E and satisfies c and PtT ∩ PtE = ∅ where PtT and PtE are the policies of the base tuples from T and E Duplicate Elimination: δ • (t, Pt ) ∈ δ(T ∗ ) iff t ∈ T , T ∗ ⊆ T , Pt = ∅ and t ∈ / T ∗ such that t = t agg Group-by: GA , with agg aggregate function and attribute A agg • (t, Pt ) ∈ GA (T ) iff t ∈ T , A ∈ t and Pt = ∅
Projection (π) is an unary operator that processes new tuples by on-the-fly discarding unwanted attributes. This operator simply propagates the streaming sps and thereafter the projected tuples. If an sp describes a policy for only the projected attributes, the sp is discarded by the project operator from the stream as well. Selection (σ) is a unary operator that drops tuples that do not satisfy the query selection condition. A select operator “delays” sp propagation until at least one of the tuples with the policy described by the sp satisfies the select condition. Otherwise if all tuples with the same policy are filtered, their sp(s) are discarded as well. Join () is a binary operator that joins the tuples of its input streams. The following modifications are made to the join operator to make it sp-aware: (1) the policies represented by sps are stored together with tuples in the stream window, (2) if a tuple joins with another tuple, the policies of the base tuples are intersected. If the intersection is empty, the join results are discarded, because the base tuples’ policies are incompatible. Otherwise, the results are sent to the output stream, preceded by the sp(s) depicting the intersection of the base policies. Duplicate elimination (δ) over a sliding window stores both its input and its current output tuples, and at all times the output contains exactly one tuple with each distinct value v present in the input. The policies represented by sps are stored together with the tuples in the output state. When a new tuple with a duplicate value arrives, its policy is intersected with the policy of the tuple in the output state. Let Pold be the policy of the tuple with value v in the output state and let Pnew be the policy of the new tuple with the same value v. There are 3 cases to consider: (1) Pold ∩ Pnew = ∅. If the policy intersection is empty, the previously outputted result was not accessible by the queries that may access the new tuple with value v. Thus, we send v to the output stream, preceded by sp(s) describing the policy Pnew . We also store Pnew in the output state. (2) Pold ∩ Pnew = Pnew . If the policy intersection is not empty and equals Pnew , we do not send anything to the output stream, because previously outputted tuple was accessible. (3) Pold ∩ Pnew = Pnew and Pold ∩ Pnew = ∅. If the policy intersection is not empty and it is
W[s1]
r3
r3
window tail
time
W[s2]
r3 ...
SAJoin, a sliding window equijoin algorithm. We first describe the nested-loop SAJoin, and then introduce the optimized index version of the operator. SAJoin maintains a time-based sliding window. We employ a list structure to link all tuples and sps in chronological order (most recent at the tail). Security punctuations are interleaved with tuples in the window, and thus tuple list is “partitioned” by the sps into spunctuated segments, where the tuples in each segment share the same policy.
r1,r2,r3 ...
B. Security-Aware Join (SAJoin) Operator
r1,r2
r2
...
Security Shield (SS) is a stateful filter operator. The state contains a set of security predicates denoting the roles of the upstream operators in the query plan. When a new security punctuation sp arrives, SS determines if the sp corresponds to the policy that is buffered in the state, or if it initializes the beginning of a new policy. As mentioned in Section III, all sps that belong to the same policy have the same timestamp ts. If the newly arrived sp has a more recent timestamp than the policy currently buffered in the SS state, it replaces the old policy. When a new tuple t arrives, SS interprets the policy in the state as the policy for this tuple. SS proceeds to check if t’s access control policy matches any of the roles in the SS state. If there is no match (the intersection is empty), the tuples following the sp are discarded to prevent unauthorized access. To speed up the processing by SS operator, we can use a predicate index on the roles in the SS state, similar to the grouped filter in CACQ [1] and PSOUP [13].
r1
r2
r1
...
A. ‘Security Shield’ (SS) Operator
r1
r1,r3
...
In this section, we describe the physical implementation of two operators, namely the SS and the SAJoin. The securityaware extensions of other operators are similar and are omitted for conciseness.
SP-Indexs2
...
V. P HYSICAL I MPLEMENTATION
SP-Indexs1 window head
...
also not equal to Pnew , we output a result with the policy that describes the roles that are in Pnew but not in the intersection of the policies Pold and Pnew i.e., Pnew – (Pold ∩ Pnew ). agg ) incrementally updates the value of a given Group-by (GA aggregate for each group. Similar to [12], we do not consider aggregation as a separate operator, as it can be represented as a group-by with a single group. For each new input, we determine which group it belongs to and return an updated result for the group, which is understood to replace a previously reported answer for this group. In an sp-aware group-by operator, each attribute group (AG) is partitioned into attribute subgroups (ASGs), where each ASG contains tuples with the same attribute value A and the non-intersecting (with other ASGs with the same attribute) policies. A result is calculated for each ASG and then sent to the output stream preceded by the subgroup’s policy.
time r3.r-head
r3.r-tail
r-node array s1
Fig. 6.
Index SAJoin.
s2
1) Nested-loop SAJoin: We describe the SAJoin algorithm regarding processing tuples and sps from an input stream s1 . Processing for the opposite stream s2 is symmetric: 1) Policy Collection. As sps are arriving, they are stored in the sliding window. They represent the policy for the upcoming tuples. 2) Invalidation. When a new tuple t1 is retrieved from the stream s1 , it is used to invalidate the expired tuples from the head of the window of the stream s2 . If all tuples from an s-punctuated segment have been invalidated, their corresponding sps (describing their policy) are purged from the head of the window as well. 3) Join. After invalidation is done, the join value of t1 is used to probe the window of the stream s2 . If t1 joins with another tuple, the base tuples’ policies are intersected. If the intersection is empty (policies are incompatible), the join result is discarded. We term this approach - probe-and-filter (PF) method. Alternatively, we can first use the policy of the tuple t1 to find all the policy-wise compatible tuples and then probe only those tuples against the join value of t1 . This approach is denoted as filter-and-probe (FP) method. The nested-loop SAJoin has a weakness that it scans the entire window of the opposite stream to determine with which tuples the new tuple can join. 2) Index SAJoin: The optimized SAJoin employs an index structure for sps, termed the Security Punctuation Index (or SPIndex, for short) designed for efficient lookup of compatible policy-wise tuples from the opposite stream. SPIndex consists of two components: the r-node array representing all possible roles in the system and the index entries for the sps currently in the window (see Figure 6. An index entry for an sp with multiple roles is depicted by an edge with verticies (on it) corresponding to every role in the sp. For example, the second sp in the SPIndexs1 has two associated roles, r1 and r3 . SPIndex properties are summarized as follows: (1) An r-node represents a role and points to the linked list of sp index entries containing that role. (2) Each r-node has an r-head and an rtail pointers to the start/end of the list of its index entries, respectively. (3) New index entries are always added at the r-tail. An expired index entry is always removed from the r-head. (4) The r-node array is ordered by the role id. (5) A single index entry per sp is created. If an sp has multiple roles, multiple r-nodes have a pointer to its index entry. (6) Each index entry has a pointer to its physical sp in the sliding window. We describe the index SAJoin algorithm for processing of a
stream s1 . We skip the Policy Collection and the Invalidation steps as they are the same as in the nested-loop SAJoin. When a new sp arrives, it is inserted into the window; a new index entry is created and linked to the sp in the window. The r-node entries with the roles in the sp are updated to point to the new index entry. When a new tuple t1 arrives, it is inserted into the window. Then the policy of the tuple t1 is used to probe the SPIndex of the opposite stream s2 as follows: for each role ri in the policy of t1 , the r-node for that role is accessed from the opposite stream’s sp index, r-node = SPIndexs2 [ri ]. All index entries from the r-head to the r-tail of the r-node for role ri are scanned. The tuple t1 is joined with the tuples that have the policies with the index entries encountered during the scan. When the scan reaches the end (as indicated by the r-tail), the next role rj from the policy of the tuple t1 is extracted. Its r-node entry is accessed, and the process is repeated. To eliminate duplicate join processing for tuples that may have more than one role in common, we introduce the following lemma: Lemma 5.1: (Skipping Rule) Let r be the role of the r-node being processed. Let role r¯ be the first role in an sp pointed by an index entry. If r¯ < r, then the index entry containing r¯ must have already been processed by the SAJoin, and thus should be skipped. The proof of correctness of Lemma 5.1 is based on the ordering of roles in the r-node array of the SPIndex and in security punctuations. If the first role in an sp has a smaller order than the current r-node role, then we would have already processed the tuples whose policies contain that role, and hence can skip it. Thus, the skipping rule prevents duplicate join processing, if tuples’ policies have more than one role in common. VI. S ECURITY-AWARE Q UERY O PTIMIZATION A. Security-Aware Cost Model Each candidate query plan is associated with a per-unit-time cost, similar to [12], [14]. For each operator, we define λ1 (and λ2 ) to be its input tuple rates and λsp1 (and λsp2 ) its input sp rates. If the operator is unary, then λ2 = 0 and λsp2 = 0. Let N1 and N2 denote the expected number of tuples in the input streams’ windows. We can compute N1 and N2 based on the window size and the stream arrival rates as N1 = (W * λ1 ) and N2 = (W * λ2 ). Similarly for sps, Nsp1 = (W * λsp1 ) and Nsp2 = (W * λsp2 ). No denotes the expected output size and Nspo is the size of the output sps. SS operator processes each tuple in constant time, whereas each sp must scan the entire SS state. Therefore, SSthcost is input i (λi + λspi (NRsp + NR )) where i represents an i stream to the SS operator. NR is expected size (in # of roles) of an SS state and NRsp is the expected size (in # of roles) of an sp. Selection and projection process each tuple and sp in constant time, therefore their cost is i (λi + λspi ). Nested-loop SAJoin cost is λ1 (N2 + Nsp2 ) + λ2 (N1 + Nsp1 ) per unit time. For index SAJoin, the join cost is λ1 σsp (N2 +
TABLE II E QUIVALENCE RULES . Rule 1: Splitting/merging rule for SS (ψ) • ψp1 ∧p2 ∧...∧pn (T ) ≡ ψp1 (ψp2 (. . .(ψpn (T )))) Rule 2: Commutative rules for SS (ψ) Commute several SS operators: • ψp1 (ψp2 (T )) ≡ ψp2 (ψp1 (T )) Commute projection and SS: • πattr (ψp (T )) ≡ πattr (ψp (πattr (T )), if attr = attr ∪ attr , where attr =tid πattr (ψp (T )) ≡ ψp (πattr (T )), otherwise Commute selection and SS: • σc (ψp (T )) ≡ ψp (σc (T )) Commute duplicate elimination and SS: • δ(ψp (T )) ≡ ψp (δ(T )) Commute group-by and SS: agg agg • GA (ψp (T )) ≡ ψp (GA (T )) Rule 3: Pushing SS (ψ) over binary operators • ψp (T Θ E) ≡ ψp (T ) Θ E, if only T streams policies ψp (T ) Θ ψp (E), if both T and E stream policies, ∀ Θ ∈ {,×,∪,∩} Rule 4: Commutative rule for binary operators • ψp (T Θ E) ≡ ψp (T Θ E), ∀ Θ ∈ {,×,∪,∩} Rule 5: Associative rule • ψp ((T Θ E) Θ K) ≡ ψp (T Θ (E Θ K)), ∀ Θ ∈ {,×,∪,∩}
Nsp2 ) + λ2 σsp (N1 + Nsp1 ), where σsp ∈ [0...1]. If σsp equals 1 (all tuples have compatible policies), then a new tuple would be probed against all tuples from the opposite window. This is then equivalent to nested-loop join. The sp maintenance cost in the index SAJoin (insertion/deletion of an sp from the SPIndex) can be described by NRsp (λsp1 + λsp2 ). The cost of duplicate elimination δ is λ1 (No + Nspo ) as every new tuple scans the output which is sorted by expiration time and checks the tuples’ policies. For group-by cost, we denote the cost of re-computing an aggregate by C. This cost depends on the number of groups, which is determined by the distribution of values and sps and the complexity of the aggregate. The cost of group-by is 2C(λ1 +λsp1 ). Every tuple changes the value of an aggregate twice, once when it arrives and once when it expires. B. New Algebraic Rules Here, we introduce security-aware algebraic rules, which facilitate security-aware query optimization (see Table II). (SS) splitting/merging described by Rule 1 allows us to split and merge SS operators. SS splitting means that the roles in the SS state are split, and a separate SS operator is created for each split state. Merging of SSs is the reverse. This operation is useful in several cases: (1) splitting large SS states can optimize sp processing, if one split state is more selective than the other, and (2) merging/splitting SSs before and after a shared query subplan allows sharing query subplans for queries with different access rights. (SS) interleaving allows us to push SS through (up or down) other operators. Rules 2–5 together assert that SS operations can be swapped with other operators, such as select or join, thus achieving the interleaving. Thus, the most efficient placement of access control-based filtering in a query plan can be realized. C. Query Optimization: Putting It All Together Security-aware query plans can be optimized similar to traditional relational plans (e.g., selection pushdown and join
0 1/1
1/10
1/25
1/50
sp to tuple ratio
(a) Output rate
1/100
store-and-probe tuple-embedded policies security punctuations
10 8 6
10
4
5
2 0
0 1/1
1/10
1/25
1/50
|R|=1
1/100
|R|=10
|R|=50
|R|=100
policy size
sp to tuple ratio
(b) Processing cost per tuple
Fig. 7.
|R|=25
Processing Time (in msec)
0.5
15
12
16 12 8 store-and-probe tuple-embedded policies security punctuations
4 0 |R|=1
|R|=10
|R|=25
|R|=50
|R|=100
policy size
(c) Memory cost
(d) Processing cost per tuple
Comparison of access control enforcement mechanisms.
enumeration). Beyond that, we now employ the new securityaware rules. SS operator is split when there is a subset of roles in the SS state that is more selective than the rest, and merged when the cardinality of the merged SS does not increase with the conjunction of the separate SSs. In addition, SS splitting/merging can be used for multi-query optimization. SS operators can be merged, when queries can share a query subplan, and if the individual processing costs are great. SS operators are merged at the “beginning” and then split at the “end” of the shared query subplan. SS interleaving rules push down (or up) SS operators to minimize the sizes of the intermediate states and the number of streaming sps. This particularly affects join, duplicate elimination and group-by, whose processing costs increase with the number of sps they have to process (e.g., policy compatibility checking, sp indexing). Moreover, the selectivity per SS can be maintained. If the selectivity of some roles in the SS state is low and some is high, the SS state can be split and the SS with lower selectivity pushed up. These rules are consistent with relational optimization rules. For example, pushing down SS operators is analogous to predicate push-down and pushing SS below joins typically decreases the cardinality of intermediate results. These similarities suggest that sp-awareness can be easily incorporated into existing cost models and optimizers. VII. E XPERIMENTAL A NALYSIS A. Experimental Setup We have implemented our proposed security framework in a real DSMS CAPE [5]. We run CAPE on Intel Pentium IV CPU 2.4GHz with 1GB RAM running Windows XP and 1.5.0.06 Java SDK. We use the Network-based Moving Objects Generator [15] to generate 110K moving objects (e.g., cars, pedestrians with GPS devices) travelling in the city of Worcester, MA USA. Objects continuously and selectively restrict access to their current location. The security punctuations in the data streams describe tuplegranularity access control policies on the location updates of the moving objects and send them to DSMS. We chose tuple level policy, because it is probably the most common granularity of security in such mobile environments. All tuple policies are described by single security punctuations, thus for any s-punctuated segment, a single sp describes the access control policy for the objects in that segment. Unless mentioned othwerwise, we execute a select-project query on the location of moving objects, e.g., a store issuing
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
project
1/1
1/10
1/25
select
1/50
ss
1/100
Processing Time (in msec)
1
store-and-probe tuple-embedded policies security punctuations
20
Memory (in MB) Memory (in MB)
1.5
25
Processing Time (in msec)
store-and-probe tuple-embedded policies security punctuations
Processing Time (in msec)
Output Rate (tuples/msec)
2
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
project
R=1
R=10
sp to tuple ratio
ss
R=100 R=500
role count
(b) SS cost (in msec)
(a) SS cost (in msec)
Fig. 8.
R=50
select
SS operator overhead.
a query: Continuously retrieve all moving objects in the two mile region around the store (to send sale advertisements to their cell phones). We chose such query, as it allows us to compare the cost of the access control enforcement mechanism with respect to the cheapest possible operators in a query plan. Roles of query specifiers are associated with continuous queries at compile time and don’t change throughout the runtime query execution. Roles r1 ,r2 . . . rn represent roles of subjects encountered in real life scenarios, such as r1 represents a family member, r2 a manager from work, and r3 a retail store, etc. B. Effectiveness of Security Punctuations Our first set of experiments compares the performance of the three alternative access control enforcement mechanisms on streaming data as introduced in Section I-C. Figure 7a depicts the three approaches with respect to their output rates. The x-axis shows the sp to tuple ratio: the 1/1 ratio means every tuple has a unique policy, whereas 1/100 means that 100 tuples share a similar policy. The results show that the policy distribution has almost no effect on the tuple-embedded approach, because policies are stored in their entirety despite whether they can be shared or not. The other two methods exploit policy sharing which results in a higher output rate. Figure 7b shows the processing time for the three alternatives. The store-and-probe method has the highest processing cost until 1/25 sp to tuple ratio. This is somewhat expected, because with frequent unique policies in the stream, more processing must be done by this method: policies must be extracted, stored, and when access control is checked, the policy table must be probed. The sp model significantly outperforms the other two models, especially when more tuples share the same sp. Figures 7c and 7d illustrate the memory consumption and the processing costs, respectively, when the sizes of policies are varied. Here we consider policies with a lot of individual role authorizations, such that regular expressions cannot help minimize the policy definition. We use the number of roles
Processing Time (in msec)
180 160 140 120 100 80 60 40 20 0
total join sp maintenance tuple maintenance
nestedloop SAJoin
index SAJoin
nestedloop SAJoin
s _sp = 0
Fig. 9.
index SAJoin
s _sp = 0.1
nestedloop SAJoin
index SAJoin
nestedloop SAJoin
s _sp = 0.5
index SAJoin
s _sp = 1
SAJoin with varying sp selectivity.
in sp (denoted by |R|) as the measure of policy size. For this experiment, the sp to tuple ratio was set to 10. The tuple-embedded approach is greatly penalized when the sizes of the policies increase. Store-and-probe method begins to outperform the sp model in memory utilization when |R| > 25 role authorizations. This is because in store-and-probe case a single copy of a large policy is maintained for all tuples, whereas in the sp model, several “large” sps (with possible policy overlaps) may stream concurrently in the system. Figure 7d illustrates the processing costs per 100 tuples with varied policy sizes. As policies become larger, the tupleembedded approach has the highest processing cost compared to the other two alternatives. C. Cost of Security Shield Operator Figures 8a and 8b show the SS processing cost as compared to select and project operators. It is highly sensitive to the policy distributions, as indicated by varying the sp to tuple ratio. When each tuple has its own sp (sp/tuple = 1/1), SS has almost identical cost to select operator, which is not surprising, because it executes a selection on sps, one per each tuple. But as more tuples share sps, the overhead dramatically decreases. This is because once an sp has been processed, the decision to propagate or discard applies to all tuples that follow it. Thus the more tuples share a policy, the smaller SS overhead becomes. Figure 8b illustrates the effect of the distribution of roles of query specifiers who want to access the query results on the cost of SS. The larger the size of SS state, the larger is its overhead. But even still, the cost does not go beyond 20% of the total query cost. To address this problem and reduce the overhead with large SS states, an indexing technique to facilitate faster role lookup can be exploited, or the SS can be split and the lower selectivity SS can be pushed up. D. SAJoin Performance Here we present our experimental results comparing nestedloop and index versions of SAJoin. We measure the total time (per 100 tuples) and all separate costs that contribute to the overall processing cost, namely join time, sp maintenance, and tuple maintenance. The graph in Figure 9 shows the SAJoin performance with varying sp selectivities. s sp = 0 means no tuples have compat-
ible policies, and thus should not be joined. s sp = 1 means all tuples have compatible policies, and thus all should be joined. In all cases, the index SAJoin outperformed the nested-loop SAJoin. The sp maintenanance cost remains relatively low. When s sp = 0, although the join time of index SAJoin is significantly smaller than in the nested-loop algorithm, the high sp maintenance cost keeps the overall processing cost high. However, it still outperforms the alternative by 7% (in total cost) and by 75% (in join cost). When s sp = 1, the index SAJoin outperforms by 2% in total cost and by 28% in join cost. The skipping rule contributes to the improved performance of the index SAJoin. Tuples are joined with only policy-wise compatible tuples, and the processing is done only once, no matter how many roles their policies may have in common. E. Summary of Experimental Results 1) The sp-based approach is more effective in enforcing access control on real time streaming data compared to alternative approaches. 2) The SS overhead is minimal, and gets smaller when policies among tuples tend to be similar, so more sps can be shared. 3) The join processing can be significantly optimized by SPIndex. This is because tuples are joined only with the policy-wise compatible tuples. VIII. R ELATED W ORK Streaming Databases. In the past few years, streaming databases became a hot topic [6], [7], [9], [11], [13], [16]. Punctuations as substream delimiters inside data streams have been first presented in [17]. PJoin [18] and PWJoin [19] apply punctuations to achieve join optimizations on streaming data. [20] uses punctuation-like annotations to inject dynamic schema-knowledge into XML stream to facilitate query optimization and out-of-order processing. [21] uses punctuations for execution safety checking of continuous join queries (CJQs). Our work is the first to employ a security-related semantics in the form of security punctuations. Security and Privacy Preservation. Agrawal et al. coined the concept of Hippocratic databases [22] to incorporate the privacy protection within RDBMS. The authors propose using privacy metadata. This work however does not address dynamic changes, characteristic of streaming environments. Preserving privacy by ensuring limited disclosure of data in RDBMS was explored by Lefevre et al [23]. The implementation is based on query modification techniques which has several limitations in the context of streaming systems. First, queries in DSMS are typically long-running, thus policies may change many times during the execution of a query. Modifying a query plan at runtime for every policy modification in highly dynamic environments is unacceptably expensive. Recently [24], [25], [26] proposed to add security features to streaming databases. However, the proposed approach is a static approach and built on-top of the query engine. Our solution is integrated into query processing and optimization,
which allows more efficient and security-aware query execution. IX. C ONCLUSIONS We have proposed a novel approach to enforce access control in streaming environments using security punctuations. This work makes three important contributions: (1) a scheme for defining security semantics on streaming data; (2) a query processing and optimization framework aware and compliant with the security restrictions; (3) an implementation and investigation of the security mechanism and its effect on query processing. The significance of our experimental results is that we have shown that our sp framework, integrated as a part of query processing, has very low overhead and outperforms alternative approaches. In the future, we plan to explore incremental access control policies and runtime changes in subjects’ role assignments and their effect on query processing. ACKNOWLEDGMENT This work was supported by the National Science Foundation under Grant No. 0430274 and the sponsors of CERIAS. R EFERENCES [1] S. Madden and et.al., “Continuously adaptive continuous queries over streams,” in SIGMOD, 2002, pp. 49–60. [2] R. Avnur and J. M. Hellerstein, “Eddies: Continuously adaptive query processing,” in SIGMOD, 2000, pp. 261–272. [3] R. S. Sandhu, E. J. Coyne, and et. al., “Role-based access control models,” IEEE Computer, vol. 29, no. 2, 1996. [4] M. Bishop, Computer Security: Art and Science. Addison Wesley, 2003. [5] E. A. Rundensteiner and et. al., “Cape: Continuous query engine with heterogeneous-grained adaptivity.” in VLDB, 2004, pp. 1353–1356. [6] D. J. Abadi, D. Carney, and et. al., “Aurora: A data stream management system.” in SIGMOD, 2003, p. 666. [7] M. Hammad and et. al., “Efficient execution of sliding-window queries over streams.” Purdue University, Tech. Rep., 2003. [8] J. Li and et. al., “Semantics and evaluation techniques for window aggregates in data streams,” in SIGMOD, 2005, pp. 311–322. [9] B. Babcock, S. Babu, and et. al., “Models and issues in data stream systems.” in PODS, 2002, pp. 1–16. [10] E. Bertino and et. al., “An extended authorization model for rdbms,” IEEE Trans. on Knowl. and Data Eng., vol. 9, no. 1, 1997. [11] A. Arasu and et. al., “The cql: semantic foundations and query execution,” The VLDB Journal, vol. 15, no. 2, pp. 121–142, 2006. [12] L. Golab and et.al., “Update-pattern-aware modeling and processing of cont. queries,” in SIGMOD, 2005, pp. 658–669. [13] S. Chandrasekaran, O. Cooper, and et. al., “Telegraphcq: Continuous dataflow processing.” in SIGMOD, 2003, p. 668. [14] J. Kang and et.al., “Evaluating window joins over unbounded streams.” in ICDE, 2003, pp. 341–352. [15] T. Brinkhoff, “A framework for generating network-based moving objects,” Geoinformatica, vol. 6, no. 2, pp. 153–180, 2002. [16] J. M. H. Sailesh Krishnamurthy, Michael J. Franklin and G. Jacobson, “The case for precision sharing,” in VLDB, 2004. [17] P. A. Tucker and et. al., “Applying punctuation schemes to queries over data streams.” IEEE Data Eng. Bull., vol. 26, no. 1, 2003. [18] L. Ding, N. Mehta, E. A. Rundensteiner, and et. al., “Joining punctuated streams.” in EDBT, 2004, pp. 587–604. [19] L. Ding and E. A. Rundensteiner, “Evaluating window joins over punctuated streams,” in CIKM, 2004, pp. 98–107. [20] L. Fegaras, D. Levine, and et. al., “Query processing of streamed xml data,” in CIKM, 2002, pp. 126–133. [21] H. G. Li and et. al., “Safety guarantee of continuous join queries over punctuated data streams,” in VLDB, 2006, pp. 19–30.
[22] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, “Hippocratic databases,” in VLDB, 2002, pp. 143–154. [23] K. LeFevre, R. Agrawal, and et. al., “Limiting disclosure in hippocratic databases,” in VLDB, 2004, pp. 447–452. [24] W. Lindner and J. Meier, “Towards a secure data stream management system,” in TEAA 2005, 2005, pp. 114–128. [25] W. Lindner, , and et. al., “Securing the borealis data stream engine,” in IDEAS, 2006, pp. 137–147. [26] B. Carminati, E. Ferrari, and K. L. Tan, “Enforcing access control over data streams,” in SACMAT, 2007, pp. 21–30.