Query Processing Techniques for Compliance with Data Confidence Policies Chenyun Dai1 , Dan Lin2 , Murat Kantarcioglu3, Elisa Bertino1 , Ebru Celikel3 , and Bhavani Thuraisingham3 1
Department of Computer Science, Purdue University {daic,bertino}@cs.purdue.edu 2 Department of Computer Science, Missouri University of Science and Technology
[email protected] 3 Department of Computer Science, The University of Texas, Dallas {muratk,ebru.celikel,bhavani.thuraisingham}@utdallas.edu
Abstract. Data integrity and quality is a very critical issue in many data-intensive decision-making applications. In such applications, decision makers need to be provided with high quality data on which they can rely on with high confidence. A key issue is that obtaining high quality data may be very expensive. We thus need flexible solutions to the problem of data integrity and quality. This paper proposes one such solution based on four key elements. The first element is the association of a confidence value with each data item in the database. The second element is the computation of the confidence values of query results by using lineage propagation. The third element is the notion of confidence policies. Such a policy restricts access to the query results by specifying the minimum confidence level that is required for use in a certain task by a certain subject. The fourth element is an approach to dynamically increment the data confidence level to return query results that satisfy the stated confidence policies. In particular, we propose several algorithms for incrementing the data confidence level while minimizing the additional cost. Our experimental results have demonstrated the efficiency and effectiveness of our approach.
1 Introduction Nowadays, it is estimated that more than 90% of the business records being created are electronic [1]. These electronic records are commonly used by companies or organizations to profile customers’ behaviors, to improve business services, and to make tactical and strategic decisions. As such the quality of these records is crucial [2]. Approaches, like data validation and record matching [18], have been proposed to obtain high quality data and maintain data integrity. However, improving data quality may incur in additional, not negligible, costs. For example, to verify a customer address, the company may need to compare its records about the customer with other available sources which may charge fees. To verify the financial status of the startup company, the venture capital company may have to acquire reports from a certified organization or even send auditors to the startup company, which adds time and financial costs. As for the health care example, cancer registry and administrative data are often readily W. Jonker and M. Petkovi´c (Eds.): SDM 2009, LNCS 5776, pp. 49–67, 2009. c Springer-Verlag Berlin Heidelberg 2009
50
C. Dai et al.
available at reasonable costs; patient and physician survey data are more expensive, while medical record data are often the most expensive to collect and are typically quite accurate [11]. In other words, the cost of obtaining accurate data can be very expensive or even unaffordable for some companies or organizations. It is also important to notice that the required level of data quality depends on the purpose for which the data have to be used. For example, for tasks which are not critical to an organization, like computing a statistical summary, data with a medium confidence level may be sufficient, whereas when an individual in an organization has to make a critical decision, data with high confidence are required. As an example, Malin et al. [11] give some interesting guidelines: for the purpose of hypothesis generation and identifying areas for further research, data about cancer patients’ disease and primary treatment need not be highly accurate, as treatment decisions are not likely to be made on the basis of these results alone; however, for evaluating the effectiveness of a treatment outside of the controlled environment of a research study, accurate data is desired. While identifying the purposes of data use is the task of field experts, the question to computer scientist here is how to design a system that can take such input and provide data meeting the confidence level required for each data use. In particular, “how to specify which task requires high-confidence data?” In situations where we do not have enough data with high-confidence level to allow a user to complete a task, how can we improve the confidence of the data to desired level with minimum cost? Yet another question could be: “There is a huge data volume. Which portion of the data should be selected for quality improvement?” When dealing with large data volumes, it is really hard for a human to quickly find out an optimal solution that meets the decision requirement with minimal cost. As we will see, the problem is NP-hard. To solve the above problems, we propose a comprehensive framework based on four key elements (see Figure 1). The first element is the association of confidence values with data in the database. A confidence value is a numeric value ranging from 0 to 1, which indicates the trustworthiness of the data. Confidence values can be obtained by using techniques like those proposed by Dai et al. [5] which determine the confidence value of a data item based on various factors, such as the trustworthiness of data providers and the way in which the data has been collected. The second element is the computation of the confidence values of the query results based on the confidence values of each data item and lineage propagation techniques [6]. The third and fourth elements, which are the novel contributions of this paper, deal respectively with the notion of confidence policy and with strategies for incrementing the confidence of query results at query processing time. The notion of confidence policy is a novel notion. Such a policy specifies the minimum confidence level that is required for use of a given data item in a certain task by a certain subject. As a complement to the traditional access control mechanism that applies to base tuples in the database before any operation, the confidence policy restricts access to the query results based on the confidence level of the query results. Such an access control mechanism can be viewed as a natural extension to the Role-based Access Control (RBAC) [7] which has been widely adopted in commercial database systems. Therefore, our approach can be easily integrated into existing database systems.
Query Processing Techniques for Compliance with Data Confidence Policies
(1) Query
Query Evaluation
(2) Query (3) data
51
Confidence Assignment
(4) Intermediate results Result
Policy Evaluation (6) Cost
(10) Results
(5) Request more results
Database
Strategy Finding (7) Request more results
(8) Request improvement
Data Quality Improvement
(9) Increase confidence
Fig. 1. System Framework
Since some query results will be filtered out by the confidence policy, a user may not receive enough data to make a decision and he may want to improve the data quality. To meet the user’s need, we propose an approach for dynamically incrementing the data confidence level; such an approach is the fourth element of our solution. In particular, our approach selects an optimal strategy which determines which data should be selected and how much the confidence should be increased to satisfy the confidence level stated by the confidence policies. We assume that each data item in the database is associated with a cost function that indicates the cost for improving the confidence value of this data item. Such a cost function may be a function on various factors, like time and money. We develop several algorithms to compute the minimum cost for such confidence increment. It is important to compare our solution to the well-known Biba Integrity Model [4], which represents the reference integrity model in the context of computer security. The Biba model is based on associating an integrity level with each user1 and data item. The set of levels is a partially ordered set. Access to a data item by a user is permitted only if the integrity level of the data is “higher” than the integrity level of the user. Despite its theoretical interest, the Biba Integrity Model is rigid in that it does not distinguish among different tasks that are to be executed by users nor it addresses how integrity levels are assigned to users and data. Our solution has some major differences with respect to the Biba Integrity Model. First it replaces “integrity levels” with confidence values and provides an approach to determine those values [5]. Second it provides policies by using which one can specify which is the confidence required for use of certain data in certain tasks. As such our solution supports fine-grained integrity tailored to specific data and tasks. Third, it provides an approach to dynamically adjust the data confidence level so to provide users with query replies that comply with the confidence policies. Our contributions are summarized as follows. – We propose the first systematic approach to data use based on confidence values of data items. 1
We use the term ’user’ for simplicity in the presentation; however the discussion also applies to the more general notion of ‘subject’.
52
C. Dai et al.
– We introduce the notion of confidence policy and confidence policy compliant query evaluation, based on which we propose a framework for the query evaluation. – We develop algorithms to minimize the cost for adjusting confidence values of data in order to meet requirements specified in confidence policies. – We have carried out performance studies which demonstrate the efficiency of our system. The rest of the paper is organized as follows. Section 2 discusses related work. Section 3 discusses the notion of policy complying query evaluation and presents the related architectural framework. Section 4 provides detailed algorithms, whereas Section 5 reports experimental results. Finally, Section 6 outlines some conclusions and future work.
2 Related Work Work related to our approach falls into two categories: (i) access control policies; and (ii) lineage calculation. For access control in a relational DBMS, most existing access control models, like RBAC [7] and Privacy-aware RBAC [14], perform authorization checking before every data access. Our confidence policy is complementary to such conventional access control enforcement and applies to query results. Many efforts [16,3,8,9,15] have been devoted to tracking the provenance of the query results, i.e., recording the sequence of steps taken in a workflow system to derive the datasets, and computing the confidence values of the query results. For example, Widom et al. have developed a database management system, Trio [15], which combines data, accuracy and lineage (provenance). However, no one of those systems provide a comprehensive solution, based on policies, for addressing the use of data based on confidence values for different tasks and roles. Perhaps the most closely related work is by Missier et al. [13], who propose a framework for the specification of users’ quality processing requirements, called quality views. These views can be compiled and embedded within the data processing environment. The function of these views is, to some extent, similar to that of our confidence policies. However, such system is not flexible since it does not include a data quality increment component which is, instead, a key component of our system.
3 Policy Compliant Query Evaluation In this section, we first introduce an illustrative example and then present our policy compliant query evaluation framework. 3.1 An Illustrative Example To illustrate our approach, we consider a scenario in a venture capital company which is able to offer a wide range of asset finance programs to meet the funding requirements
Query Processing Techniques for Compliance with Data Confidence Policies
53
of startup companies. Suppose that in such venture capital company, there is a database having two relations with the following schemas: Proposal(Company:string, Proposal:string, Funding:real); CompanyInfo(Company:string, Income:real). An instantiation of the tables is given with sample tuples and their confidence values in Table 1 and Table 2. In the example, the variable pN o denotes the confidence of the tuple with numeric identifier equal to N o. Assume that the venture capital company has a certain amount of funds available and is looking for financial information about a company with a proposal that requires less than one million dollars. Such a query can be expressed by the following relational algebra expression: Candidate=(Πcompany σF unding 0.05. Another policy P2 states that the data used by a manager who has to make an investment decision must have a confidence value higher than 0.06. The confidence threshold in P2 is higher than the value in P1 since the data usage in P2 , i.e., investment, is more critical than the data usage, i.e. analysis, in P1 . According to policy P2 , a user under the role definition of manager will not be able to access the query result because the calculated confidence level, that is, p38 =0.058, is smaller than the minimum confidence level 0.06 required for such role when performing an investment decision tasks. In our example, no result is returned to the manager. In order to let the manager obtain some useful information from his query, one solution is to improve the confidence level of the base tuples, which may however introduce some cost. Thus, our goal is to find an optimal strategy that has minimum cost. Assume that the costs of incrementing the confidence level by 0.1(10%) for each of the tuples 02 and 03 are 100 and 10, respectively. Consider the example again. If we increase the confidence level of the base tuple 02 from 0.3 to 0.4, we have p25 = p02∨03 = p02 + p03 - p02 · p03 = 0.64 and as a result p38 will become p38 = p25∧13 = p25 ·p13 = 0.064 which is above the threshold. Alternatively, if we increase the confidence level of the base tuple 03 from 0.4 to 0.5, we obtain p25 = 0.65 and p38 = 0.065 which is also above the threshold. However, we can observe that the first solution is more expensive because acquiring 10% more confidence for tuple 02 is 10 times more costly than for tuple 03. Therefore, among the two alternatives, we choose the second alternative. The increment cost and the data whose confidence needs to be improved will be reported to the manager. If the manager agrees with the suggestion given by the system, some actions will be taken to improve the data quality and new query results will be returned to the manager. 3.2 PCQE Framework The PCQE framework consists of five main components: confidence assignment, query evaluation, policy evaluation, strategy finding, and data quality improvement. We elaborate the data flow within our framework. Initially, each base tuple is assigned a confidence value by the confidence assignment component which corresponds to the first element of our approach as mentioned in the introduction. A user inputs query information in the form Q, pu, perc, where Q is a normal SQL query, pu is the purpose for issuing the query and perc is the percentage of results that the user expects to
Query Processing Techniques for Compliance with Data Confidence Policies
55
receive after the policy enforcement. Then, the query evaluation component computes the query Q and the confidence level of each query result based on the confidence values of base tuples. This component corresponds to the second element. The intermediate results are sent to the policy evaluation component. The policy evaluation component first selects the confidence policy associated with the role of user U , his query purpose and the data U wants to access, and then checks each query result according to the selected confidence policy. Only the results with confidence value higher than the threshold specified in the confidence policy are immediately returned to the user. If less than perc results satisfy the confidence policy, the policy evaluation component sends a request message to the strategy finding component. The strategy finding component will then compute an optimal strategy for increasing the confidence values of the base tuples and report the cost to the user. If the user agrees about the cost, the strategy finding component will inform the data quality improvement component to take actions to improve the data quality and then update the database. The strategy finding and data quality improvement components correspond to the fourth element. Finally, new results will be returned to the user. Confidence Policy. A confidence policy specifies the minimum confidence that has to be assured for certain data, depending on the user accessing the data and the purpose the data access. In its essence, a confidence policy contains three components: a subject specification, denoting a subject or set of subjects to whom the policy applies; a purpose specification, denoting why certain data are accessed; a confidence level, denoting the minimum level of confidence that has to be assured by the data covered by the policy when the subject (set of subjects) to whom the policy applies requires to access the data for the purpose specified in the policy. Correspondingly, we have the following three sets: R, P u and R+ . R is a set of roles used for subject specification. In our system, a user is human being and a role represents a job function or job title within the organization that the user belongs to. P u is a set of data usage purposes identified in the system. R+ denotes non-negative real numbers. Then the definition of a confidence policy is the following. Definition 1 [Confidence Policy]. Let r ∈ R, pu ∈ P u, and β ∈ R+ . A confidence policy is a tuple r, pu, β, specifying that when a user under a role r issues a database query q for purpose pu, the user is allowed to access the results of q only if these results have confidence value higher than β. Policies P1 and P2 from our running example are expressed as follows. - P1 :Secretary, analysis, 0.05. - P2 :Manager, investment, 0.06. Confidence Increment. In some situations, the policy evaluation component may filter out all intermediate results if the confidence levels of these results are lower than the threshold specified in the confidence policy. To increase the amount of useful information returned to users, our system allows users to specify a minimum percentage (denoted as θ) of results they want to receive. The strategy finding component then computes the cost for increasing the confidence values of the tuples in the base tables
56
C. Dai et al.
so that at least θ percent of query result has a confidence value above the threshold. The problem is formalized as follows. Let Q be a query, and let λ1 , λ2 , ..., λn be results for Q before policy checking. Such results are referred to as intermediate results thereafter. Each λi (1 ≤ i ≤ n) is computed from a set of base tuples denoted as Λ0i ={λ0i1 , ..., λ0ik }. The confidence value of λi is represented as a function Fλi (pλ0i , pλ0i ,...,pλ0i ), where pλ0ij is the con1
2
k
fidence level of base tuple λ0ij (1 ≤ j ≤ k). In our running example, function F is F (p02 , p03 , p13 ) = (p02 + p03 − p02 · p03 ) · p13 . Suppose that the minimum percentage given by a user is θ and the percentage of current results with confidence value higher than the threshold β is θ (θ < θ). To meet the user requirements, we need to increase at least (θ − θ ) · n results. Let Λ denote the set of results whose confidence value needs to be increased. We then formalize the confidence increment problem as the following constraint optimization problem: (cλ0x (p∗λ0x − pλ0x )) minimize cost = λ0x ∈Λ0
subject to |Λ| ≥ (θ − θ ) · n Fλi (p∗λ0 , p∗λ0 , ..., p∗λ0 ) ≥ β f or λi ∈ Λ i1
i2
ik i
p∗λ0 ∈ [pλ0i , 1] f or j = 1, ..., ki ij
j
where Λ0 = ∪λi ∈Λ Λ0i is the union of the base tuples for query results in Λ, and cλ0x (p∗λ0 ) computes the cost for increasing the confidence value of base tuple λ0x from x pλ0x to p∗λ0 . Fi is usually a nonlinear function and the problem of solving nonlinear x constraints over integers or reals is known to be NP-hard [12]. The above definition can be easily extended to a more general scenario in which a user issues multiple queries within a short time period.
4 Algorithms In this section, we present three algorithms to determine suitable base tuples for which the increase in the confidence values can lead to the minimum cost. The input of our problem is a set of intermediate query results, denoted as Λinter = {λ1 , ..., λn }, which have confidence values below the threshold; and a set of base tuples, denoted as Λ0 = {λ01 , ..., λ0k }, associated with the query results. The output consists of a subset of Λinter , denoted as Λ, and the total cost costmin of increasing the confidence values. 4.1 Heuristic Algorithm We first introduce the basic search process and then present a series of domain-specific heuristic functions derived from our knowledge of the problem. We adopt a depth-first search algorithm which chooses values for one variable at a time and backtracks when a variable has no legal value left to assign. Figure 3 shows part of a search tree. At the root node of the search tree, we assign the confidence values
Query Processing Techniques for Compliance with Data Confidence Policies
57
...... λ ( pλ )
λ ( pλ +0.1)
λ01 ( pλ ) λ02 ( pλ )
λ01 ( pλ ) λ02 ( pλ +0.1)
λ01 ( pλ ) λ02 (1.0)
λ01 ( pλ ) λ02 ( pλ +0.1) λ03 ( pλ )
λ01 ( pλ ) λ02 ( pλ +0.1) λ03 ( pλ +0.1)
λ01 ( pλ ) λ02 ( pλ +0.1) λ03 (1.0)
0 1
0 1
0 2
0 1
0 2
0 3
0 1
...... 0 1
0 2
...... 0 1
0 2
0 3
0 1
0 1
λ01 (1.0)
0 1
0 1
0 2
Fig. 3. Search Tree
of the first base tuple λ01 . The values we can select for λ01 range from pλ01 (its initial confidence level) to 1 (or its maximum possible confidence level). The minimum distance between two values, i.e., the granularity, depends on the application requirements. In this example, the granularity is set to 0.1. After we assigns a confidence value to λ01 , we generate its successors by considering the second base tuple λ02 . Similarly, we assign the confidence value for λ02 and generate its successors by considering the third base tuple λ03 . After each assignment step, we compute the confidence value of each intermediate query result and the cost. If more than (θ − θ ) · n intermediate query results have confidence values higher than the threshold, this assignment is successful and the corresponding cost will be used as an upper bound during the subsequent search. Later on, at each node of the search tree, we compare the current cost with the upper bound. If the current cost is higher, we do not need to consider the successors of this node. If a new successful assignment with lower cost is found, the upper bound will be replaced with this lower cost. In the worst case, the computation complexity is O(dk ) where k is the number of base tuples and d is the number of values can be selected for each base tuple. As mentioned, our problem is NP-hard and we therefore aim at finding heuristics that help reducing the search space for most cases. We first consider the base tuple ordering since various studies [17] have shown that the search order of the tuples largely affects performance. For our problem, we need an ordering that can quickly lead to a solution with minimum cost. We know that a query result is usually associated with multiple base tuples and different base tuples are associated with different cost functions. Intuitively, the base tuples with less cost have higher probability to be included in the final solution. Therefore, we would like to increase confidence of such base tuples first. The ordering is obtained by sorting the base tuples in a descending order of their minimum cost (denoted as costβ ) for enabling at least one intermediate result to satisfy the requirement. In some cases, even when the confidence value of a base tuple has been increased to 1 (or its maximum possible confidence level), none of the query results has β , the required confidence value. For such base tuples, we adjust its costβ to (Fcost max /β) where Fmax is the maximum confidence value that the query result obtains when the confidence value of this base tuple is 1. We summarize our first heuristics as follows.
58
C. Dai et al.
Heuristics 1. Let λ0i and λ0j be two base tuples. If costβi > costβj , then λ0i will be the ancestor of λ0i in the search tree. The next heuristics takes advantage of the non-monotonic increasing property of confidence functions of intermediate results. When increasing the confidence value of a base tuple only benefits the intermediate results with a confidence value already above the threshold, we can prune its right siblings. It is easy to prove that the optimal solution does not exist in the pruned branches. Heuristics 2. Let λ0c (p∗λ0 ) be the current node in the search tree, and λ1 , ..., λj be the c intermediate results associated with λ0c . If ∀i ∈ {1, ..., j}, Fλi ≥ β, then prune the right siblings of λ0c (p∗λ0 ). c
There is another useful heuristics that can quickly detect whether it is necessary to continue searching. That is, if increasing the confidence values of all remaining base tuples to 1 still cannot yield a solution, there is no need to check the values of the remaining base tuples. Heuristics 3. Let λ01 (p∗λ0 ), ... , λ0c (p∗λ0 ) be the nodes at the current path of the search c 1 tree. Let λ0c+1 , ..., λ0j be the base tuples after λ0c and their confidence values be 1. If |{Fλi |Fλi (p∗λ0 , ..., p∗λ0 , 1, ..., 1) > β}|< (θ − θ ) · n, then prune all branches below c 1 the node λ0c (p∗λ0 ). c
Similar to Heuristics 3, we can check if any confidence increment of the remaining base tuples will result in a higher cost than the current minimum cost. If so, there is also no need to continue searching this branch. Heuristics 4. Let λ0c (p∗λ0 ) be the current node in the search tree. Let costc , costmin c be the current cost and the cost of current optimal solution respectively. If costc + min{costλ0j (δ)} > costmin (j > c), then prune all branches below node λ0c (p∗λ0 ). c
4.2 Greedy Algorithm When dealing with large datasets, the heuristic algorithm may not be able to provide an answer within a reasonable execution time. Therefore, we seek approximation solutions and develop a two-phase greedy algorithm. The first phase keeps increasing the confidence values of base tuples while the second phase reduces unnecessary increments. We first elaborate on the procedure in the first phase. The basic idea is to iteratively compute gain of each base tuple by increasing its confidence value by δ, and then select the one with the maximum gain value. If there is only one intermediate result λ, gain is defined for each base tuple as shown in equation 1, where ΔFλ is the increase of the confidence value of λ when the confidence value of the base tuple λ0 is increased by δ, and cλ0 is the corresponding cost. gain =
ΔFλ cλ0
(1)
Query Processing Techniques for Compliance with Data Confidence Policies gain
59
0
λ2 0
λ3 0
λ1
0
δ
p
Fig. 4. Gain
A simple example is shown in Figure 4. Among the three base tuples λ01 , λ02 and λ03 , λ01 yields the maximum gain when its confidence level is increased by δ, and therefore λ01 will be selected at this step. It is worth noting that once the confidence value of a base tuple is changed, the confidence function of the corresponding intermediate result is also changed. We need to recompute gain at each step. When there are multiple intermediate results, we update gain function as follows. gain∗ =
Σλ∈Λ ΔFλ cλ0
(2)
As shown in equation 2, gain∗ takes into account overall increment of the confidence levels of the query results. The selection procedure continues until there are more than (θ − θ ) · n intermediate results (denoted as Λ) with confidence values above the threshold. The set of base tuples whose confidence levels have been increased is denoted as Λ0in . The first phase is an aggressive increasing phase and sometimes it may raise the confidence too much for some base tuples. For example, it may increase the confidence of a base tuple which has a maximum gain value at some step but does not contribute to any result tuple in the final answer set Λ. As a remedy, the second phase tries to find such base tuples and reduce the increment of their confidence values, and hence reduces the overall cost. The second phase can be seen as a reverse procedure of the first phase. In particular, we first sort base tuples in Λ0in in an ascending order of their latest gain∗ values. The intuition behind such sorting is that the base tuple with minimum gain∗ costs most for the same amount of increment on the intermediate result tuples, and hence we reduce its confidence value first. Then, for each base tuple, we keep reducing its confidence value by δ until it reaches its original confidence value or the reduction decreases the number of satisfied result tuples. Step 1: λ00 (+δ) − λ1 (0.55), λ2 (0.3), λ3 (0.1) Step 2: λ01 (+δ) − λ2 (0.4), λ3 (0.2), λ4 (0.3) Step 3: λ02 (+δ) − λ3 (0.45), λ5 (0.3), λ6 (0.35) Step 4: λ01 (+δ) − λ2 (0.6), λ3 (0.55), λ4 (0.4) Step 5: λ02 (−δ) − λ3 (0.5), λ5 (0.25), λ6 (0.2) Fig. 5. Example for the Greedy Algorithm
60
C. Dai et al.
Procedure Greedy(Λ0 , num, β) Input : Λ0 is a set of base tuples, num is the number of required query results and β is the confidence threshold //- - - - - - - - - - - 1st Phase - - - - - - - - - - 1. success ← N U LL; L ← N U LL 2. while (|success| < num) do 3. max ← 0 4. for each tuple λ0i in Λ0 do 5. compute gain∗i 6. if gain∗i > max then 7. pick ← i; max ← gain∗i 8. L ← L ∪ {λ0pick } 9. increase confidence of λ0pick by δ 10. compute confidence of affected result tuples 11. success ← result tuples with confidence value above β //- - - - - - - - - - - 2nd Phase - - - - - - - - - - 12. C ← L 13. sort C based on gain∗ in an ascending order 14. for each tuple λ0i in C do 15. while (|sucess| ≥ num) do 16. if (p∗λ0 > pλ0 ) then i i 17. decrease λ0i ’s confidence by δ 18. if (|sucess| < num) then 19. increase λ0i ’s confidence by δ
Fig. 6. The Two-Phase Greedy Algorithm
To exemplify, we step through the example shown in Figure 5. Suppose that we need to increase the confidence values of at least (θ − θ ) · n = 3 intermediate results, and the threshold is 0.5. Each step we compute the gain values by increasing the confidence value of the base tuples by δ. The first step selects a base tuple λ00 which has the maximum gain. The change of the confidence value of λ00 results in the changes of confidence values of three intermediate result tuples λ1 , λ2 , and λ3 . The number in the bracket denotes the new confidence value. The second step selects another base tuple λ01 which affects the intermediate result tuples λ2 , λ3 and λ4 . Until the fourth step, we have three results λ1 , λ2 and λ3 with confidence value above the threshold 0.5. Then the second phase starts. As shown by Step 5, decreasing the confidence value of λ02 by δ still keeps the confidence values of λ1 , λ2 and λ3 above the threshold. In the end, the algorithm suggests to increase the confidence value of λ00 by δ and that of λ01 by 2δ. Figure 6 outlines the entire algorithm. Let l1 be the number of the outer loop in the first phase. The second phase uses the quick sort algorithm. The time complexity of the algorithm is O(k(l1 + logk)), where k is the total number of base tuples.
Query Processing Techniques for Compliance with Data Confidence Policies
61
4.3 Divide-and-Conquer Algorithm The divide-and-conquer (D&C) algorithm is proposed due to the scalability concern. Its key idea is to divide the problem into small pieces, search the optimal solution for each small piece, and then combine the result in a greedy way. We expect the D&C algorithm to combine the advantages of both the heuristic algorithm and greedy algorithm. We proceed to present the details of the D&C algorithm. The first task is to partition the problem into sub-problems, where we need a partitioning criteria. Observe that some base tuples are independent from each other in the sense that they do not contribute to the same set of intermediate query results. Such base tuples form a natural group. From the following example, we can see that concentrating the confidence increment on a group of base tuples may lead to a solution more quickly than increasing confidence values of independent base tuples. In the example, there are three intermediate results λ1 , λ2 and λ3 with confidence value below the threshold 0.5 and it is required that at least two results should be reported. Result tuples λ1 and λ2 associate with the same base tuples λ01 ,λ02 and λ03 , while λ3 associates with the base tuple λ04 . Suppose that an ordering in a heuristic or greedy algorithm is λ03 , λ04 , λ02 , λ01 . Figure 7 shows the first three steps of the confidence increment where the number in the bracket indicates the new confidence value of a result tuple after the change of confidence value of the base tuple. Observe that if we exchange the order of λ04 and λ02 , we can obtain an answer more quickly. This indicates the benefit of concentrating confidence increment on base tuples in the same group. Ideally, all base tuples are partitioned into a set of almost equal-size natural groups and then search can be carried out in each independent group. However, such situation rarely happens. A more common situation is that most base tuples are related to each other due to the overlapping among their corresponding intermediate result sets. The question here is how to determine which base tuples are more related so that they should be placed in the same group. We found that this problem essentially is a graph partitioning problem. In particular, each intermediate result tuple is a node, and two nodes are connected by an edge if the corresponding result tuples share at least one base tuples. Figure 8 shows an example graph of seven result tuples. For instance, λ1 and λ2 have three common base tuples, while λ2 and λ3 share only one base tuple. Our goal is to partition the graph into disjoint graphs that satisfy the following two requirements. The first requirement is that the number of base tuples associated with the result tuples in the same group should not exceed a threshold. Such requirement ensures that each sub-problem is solvable in reasonable time (or user specified time). The second requirement is that the sum of the weights on the connecting edges of any two sub-graphs should be minimized. The reason for such requirement is to reduce the duplicate search of the base tuples belonging to two groups. Step 1: λ03 − λ1 (0.3), λ2 (0.4) Step 2: λ04 − λ3 (0.4) Step 3: λ02 − λ1 (0.5), λ2 (0.6) Fig. 7. An Example of Partitioning Effect
62
C. Dai et al.
λ1 4
3
λ5
λ2 3
1
λ3 2
2
λ6
λ4 5 4
λ7
Fig. 8. An Example of Partitioning
Unfortunately, finding an optimal graph partitioning is also an NP-complete problem. Extensive studies have been carried out and a variety of heuristic and greedy algorithms have been proposed [10]. As in our case, the partitioning is just the first phase. Most existing approaches are still too expensive and can result in too much overhead. Therefore, we propose a lightweight yet effective approach specific to our problem. Initially, each node is considered as a group. We keep merging two nodes connected by an edge with the maximum weight. After each mergence, the weight on the edge between a node and the new group is the sum of weights on the edges between the node and all nodes included in this group. The process stops when the maximum weight is less than a given threshold γ. For example, the graph in Figure 8 can be partitioned into two groups when γ = 2 (see Figure 9). Step 1: Merge λ4 and λ6 (maximum weight = 5) Step 2: Merge λ1 and λ5 (maximum weight = 4) Step 3: Merge λ4 , λ6 and λ7 (maximum weight = 4) Step 4: Merge λ1 , λ5 and λ2 (maximum weight = 3) Step 5: Merge λ4 , λ6 , λ7 and λ3 (maximum weight = 2) Fig. 9. A Graph Partitioning Example
After the partitioning, we apply the greedy algorithm to each group. Let x be the number of result tuples associated with a group, and y be the required number of result tuples for the entire query. If x is smaller than y, the greedy algorithm will find a solution for these x result tuples; if x is larger than y, the greedy algorithm will stop when y result tuples with confidence above the threshold. Next, we further carry out a heuristic search in each group which contains less than τ base tuples. The parameter τ is determined by the performance of heuristic algorithm. The results obtained from the greedy algorithm serve as initial cost upper bounds. The last step is a result combination and refinement step. A subtlety during the combination is to handle the overlapping base tuples in different groups. When we combine answers from such groups, we select the maximum confidence value of each overlapping base tuple. We can thus guarantee that the combined answer will not reduce the confidence values of result tuples in the answer set of each individual group. After the combination, the total number of satisfied result tuples may be more than the required or the confidence values of the result tuples are much higher than the threshold, both of which introduce additional cost. Therefore, we carry out a refinement process similar
Query Processing Techniques for Compliance with Data Confidence Policies
63
Procedure D&C(Λ0 , num, β, γ) Input : Λ0 is a set of base tuples, num is the number of required query results β is the confidence threshold γ is the graph partitioning threshold 1. for each intermediate result tuple λi do 2. group Gi is the set of base tuples associated with λi 3. for each intermediate result tuple λj (i = j) do 4. wij ← |Gi ∪ Gj | 5. select two groups with maximum weight wmax 6. while wmax > γ do 7. merge the selected two groups 8. adjust weights on the affected edges 9. select two groups with maximum weight wmax 10. for each group Gi do 11. invoke Greedy() 12. if |Gi | < τ then 13. invoke Heuristic Algorithm() 14. result combination and refinement
Fig. 10. The Divide-and-Conquer Algorithm
to the second phase of the greedy algorithm. It starts from the base tuple with the minimum gain∗ and stops when any further confidence reducing will result in less satisfied result tuples than the required. An overview of the entire algorithm is shown in Figure 10. The complexity of our graph partitioning algorithm is O(n2 ), where n is total number of intermediate result tuples. The complexity of the remaining part of the D&C algorithm is the same as the greedy and heuristic algorithms by replacing the size of the entire dataset with that of each group. The complexity of the result combination and refinement step is O(klogk). At the end of this section, we would like to mention that it is easy to extend the three algorithms, i.e., heuristic, greedy, and divide-and-conquer algorithms, to support multiple queries. Two aspects are important for such an extension. First, the search space has to be extended to include all distinct base tuples associated with all queries. Second, instead of checking whether a solution is found for a query, we need to check whether a solution is found for all queries.
5 Performance Study 5.1 Experimental Settings Our experiments are conducted on a Intel Core 2 Duo Processor (2.66GHz) Dell machine with 4 Gbytes of main memory. We use synthetic datasets in order to cover all general scenarios. First, we generate a set of base tuples and assign a randomly generated confidence value around 0.1 and a cost function to each tuple. The types of cost
64
C. Dai et al. Table 4. Parameters and Their Settings Parameter
Setting
Data size 10, 1K, 10K, ..., 100K No. of base tuples per result 5, 10, 25, 50, 100 Confidence increment step δ 0.1 Percentage of required results θ 50% Confidence level β 0.6
functions include the binomial, exponential and logarithm functions. Then we associate a certain number of base tuples with each result tuple. Since our focus is the policy evaluation and strategy finding components, we use randomly generated DAGs to represent queries. Table 4 gives an overview of the parameters used in the experiments, where values in bold are default values. “Data Size” means the total number of distinct base tuples associated with results of a single query. “No. of base tuples per result” refers to the average number of base tuples associated with each result tuple. “Confidence increment Step” is the confidence value to be increased for the chosen base tuple at each step. “Percentage of required results” is a user input parameter perc (θ) which is the percentage of results that a user expects to receive after the policy checking. Unless specified otherwise, we use a 10K dataset where each result tuples is associated with 5 base tuples and the percentage of the required results is 50%. 5.2 Algorithm Analysis Heuristic Algorithm. These experiments assess the impact of the four heuristics on the search performance through a small dataset with 10 base tuples. Each query requires at least three results with a confidence value above 0.6 and each result is linked to 5 base tuples. Figure 11 (a) and (d) show the performance when different heuristics are used: H1 (Heuristics 1), H2 (Heuristics 2), H3 (Heuristics 3), H4 (Heuristics 4). “Naive” means that only the current optimal cost is used as an upper bound and “All” means that all heuristics are applied. From Figure 11 (a), we observe that the response time when applying any one of the four heuristics is lower than the response time of “Naive”. When all heuristics are applied, the performance improves by a factor of about 60. Such behavior can be explained as follows. Compared to an arbitrary ordering, H1 provides a much better base tuple ordering that quickly leads to the optimal solution. H2, H3 and H4 reduce unnecessary searches. In Figure 11 (d), we use the minimum cost computed from the greedy algorithm as the initial upper bound for the heuristics algorithm. We can see that the search performance improves for all cases. The reason is that the upper bound provided by the greedy algorithm helps pruning the search space from the beginning of the search. Since it is a nearly optimal solution, it is tighter than most upper bounds found during the search. Two-phase Greedy Algorithm. The second phase in the greedy algorithm is for the result refinement. It may reduce the minimum cost but requires additional processing
Query Processing Techniques for Compliance with Data Confidence Policies
120 100 80 60 40
One-Phase
200
100
Two-Phase
Response Time (s)
Response Time (s)
140
Response Time (s)
1000
250
160
150 100 50
10 1 0.1 Heuristic Greedy
0.01
20
Divide-and-Conquer
0
0
0.001
1K Naïve
H1
H2
H3
H4
3K
All
(a) No greedy bound
5K 7K Data Size
9K
10
(b) Response Time
1K
5K 10K Data Size
50K 100K
(c) Response Time Heuristic
12000
45
Greedy
Divide-and-Conquer
10000
One-Phase
10000
40 35
Two-Phase 1000
25
Cost
8000
30
Cost
Response Time (s)
65
6000
20 15
4000
10
2000
100
10
5
0
0
1
1K Naïve
H1
H2
H3
H4
(d) Using greedy bound
3K
All
5K 7K Data Size
9K
(e) Cost
10
1K
5K 10K Data Size
50K
100K
(f) Cost
Fig. 11. Experimental Results
time. This set of experiments aim to study whether the second phase is beneficial. We compare the performance of the greedy algorithms with and without using the second phase. Figure 11 (b) and (e) show the results when varying the data size from 1K to 10K. From Figure 11 (b), we can observe that both versions of the greedy algorithm have similar response time which means the overhead introduced by the second phase is negligible. This conforms to the complexity of the second phase. As for the minimum cost (in Figure 11 (e)), we can see that the two-phase algorithm clearly outperforms the one-phase algorithm. Specifically, after using the second phase, the minimum cost can be reduced by more than 30%. All these results confirm the effectiveness of the second phase. In the subsequent experiments, the greedy algorithm only refers to the two-phase algorithm. 5.3 Overall Performance Comparison In this section, we compare the performance of three algorithms in terms of both response time and the minimum cost. We evaluate the scalability of all algorithms. The data size is varied from 10 to 100K. The number of base tuples per result is set to 5 for data size less than 5K. For data size from 10K to 100K, this parameter is set to 1/1000 of the data size. Figure 11 (c) reports the performance. It is not surprising to see that the heuristic algorithm can only handle very small datasets (less than one hundred) within reasonable time because its complexity is exponential in the worst case. The greedy algorithm has the shortest response time when the dataset is small and then is beaten by the D&C algorithm. The gap between the greedy and D&C algorithms is widen with the increase of the data size. In particular, the greedy algorithm needs to take hours for datasets larger than 50K. The reason is that the graph partitioning phase in the D&C algorithm introduces some overhead when dealing with small datasets and hence it requires more
66
C. Dai et al.
time than the greedy algorithm. However, as the dataset increases, the advantage of the partitioning becomes more and more significant. Thus, the D&C algorithm scales best among the three algorithms. Another interesting observation is that the response time decreases when data size changes from 5K to 10K. The possible reason is that the group size is relatively larger in the 10K dataset than that in the 5K dataset, and hence less heuristic searches and more greedy searches are involved, which results in shorter response time. Figure 11 (f) compares the minimum cost computed by all algorithms. The minimum cost increases with the data size since more result tuples needed to be reported and more base tuples need to be considered in a larger dataset. The heuristic algorithm yields the optimal solution as it is based on an exhaustive search. The other two algorithms perform very similar and have slightly higher cost than the optimal cost. This demonstrates the accuracy of the other two algorithms.
6 Conclusion This paper proposes the first systematic approach to use data based on confidence values associated with the data. We introduce the notion of confidence policy compliant query evaluation, based on which we develop a framework for the query evaluation. We have proposed three algorithms for dynamically incrementing the data confidence value in order to return query results that satisfy the stated confidence policies as well as minimizing the additional cost. Experiments have been carried out to evaluate both efficiency and effectiveness of our approach. Since actually improving data quality may take some time, the user can submit the query in advance before the expected time of data use and statistics can be used to let the user know “how much time” in advance he needs to issue the query. We will investigate such topic in future work.
References 1. http://www.arma.org/erecords/index.cfm 2. Ballou, D., Madnick, S.E., Wang, R.Y.: Assuring information quality. Journal of Management Information Systems 20(3), 9–11 (2004) 3. Barbar´a, D., Garcia-Molina, H., Porter, D.: The management of probabilistic data. IEEE Transactions on Knowledge and Data Engineering 4(5), 487–502 (1992) 4. Bishop, M.: Computer security: Art and science. ch. 6. Addison-Wesley Professional, Reading (2003) 5. Dai, C., Lin, D., Bertino, E., Kantarcioglu, M.: An approach to evaluate data trustworthiness based on data provenance. In: Jonker, W., Petkovi´c, M. (eds.) SDM 2008. LNCS, vol. 5159, pp. 82–98. Springer, Heidelberg (2008) 6. Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: Proc. VLDB, pp. 864–875 (2004) 7. Ferraiolo, D.F., Sandhu, R., Gavrila, S., Kuhn, D.R., Chandramouli, R.: Proposed nist standard for role-based access control. ACM Trans. Inf. Syst. Secur. 4(3), 224–274 (2001) 8. Fuhr, N., R¨olleke, T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Transactions on Information Systems 15(1), 32–66 (1997)
Query Processing Techniques for Compliance with Data Confidence Policies
67
9. Green, T.J., Tannen, V.: Models for incomplete and probabilistic information. In: Grust, T., H¨opfner, H., Illarramendi, A., Jablonski, S., Mesiti, M., M¨uller, S., Patranjan, P.-L., Sattler, K.-U., Spiliopoulou, M., Wijsen, J. (eds.) EDBT 2006. LNCS, vol. 4254, pp. 278–296. Springer, Heidelberg (2006) 10. Hendrickson, B., Leland, R.: A multilevel algorithm for partitioning graphs. In: Supercomputing (1995) 11. Malin, J.L., Keating, N.L.: The cost-quality trade-off: Need for data quality standards for studies that impact clinical practice and health policy. Journal of Clinical Oncology 23(21), 4581–4584 (2005) 12. McAllester, D.: The rise of nonlinear mathematical programming. ACM Computer Survey, 68 (1996) 13. Missier, P., Embury, S., Greenwood, M., Preece, A., Jin, B.: Quality views: Capturing and exploiting the user perspective on data quality. In: VLDB, pp. 977–988 (2006) 14. Ni, Q., Trombetta, A., Bertino, E., Lobo, J.: Privcy aware role based access control. In: Proceedings of the 12th ACM symposium on Access control models and technologies (2007) 15. Sarma, A.D., Theobal, M., Widom, J.: Exploiting lineage for confidence computation in uncertain and probabilistic databases. Technical Report, Stanford InfoLab (2007) 16. Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Record 34(3), 31–36 (2005) 17. Tsang, E.: Foundations of constraint satisfaction. Academic Press, London (1993) 18. Wang, R.Y., Storey, V.C., Firth, C.P.: A framework for analysis of data quality research. TKDE 7(4), 623–640 (1995)