Data Inconsistency Evaluation for Cyberphysical ... - SAGE Journals

Research Article Data Inconsistency Evaluation for Cyberphysical System Hao Wang, Jianzhong Li, and Hong Gao School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China Correspondence should be addressed to Hao Wang; whao0218@outlook.com Received 14 April 2016; Accepted 26 May 2016 Academic Editor: Xiuzhen Cheng Copyright © 2016 Hao Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Cyberphysical systems (CPSs) have been widely applied in a variety of applications to collect data, while data is often dirty in reality. We pay attention to the way of evaluating data inconsistency which is a major concern for evaluating quality of data and its source. This paper is the first study on data inconsistency evaluation problem for CPS based on conditional functional dependencies. Given a database instance 𝐷 including 𝑛 tuples and a CFD set Σ including 𝑟 CFDs, data inconsistency is defined as the ratio of the size of minimum culprit in 𝐷, where a culprit is a set of tuples leading to integrity errors. Firstly, we give a sufficient analysis on the complexity and inapproximability of minimum culprit problem. Then, we provide a practical algorithm that gives a 2-approximation of the data dirtiness in 𝑂(𝑟𝑛 log 𝑛) time based on independent residual subgraph. To deal with the large dynamic data, we provide a compact structure based on B-tree for storing independent residual subgraph in order to update inconsistency efficiently. At last, we test our algorithm on both synthetic and real-life datasets; the experiment results show the scalability of our algorithm and the quality of the evaluation result.

1. Introduction Cyberphysical Systems (CPSs) have been widely applied in a variety of applications to collect data, such as temperature, heart rate, and speed, from the physical world and make decisions based on the analysis of the data, thereby controlling and optimizing the physical objects in the real world, and they have a great influence on the way we observe and change the world [1]. CPS obtains the information of the physical world through many sensors and impacts the environment by actuators. Data sensed and sampled by sensors usually contains valuable information about the physical world, and its volume is growing. For better understanding and changing the physical environment, data collection and analysis are of the essence [2]. The knowledge extracted from the data also guides the behavior of actuators in CPS; for instance, sensors and actuators cooperate with each other to monitor some area [3] and react when a certain event is detected [4, 5]. Data gathered by sensors will not just be thrown away when they have been transmitted to the processors, but they will also be stored for further analysis. Unfortunately, not all the information gathered by different CPSs is reliable due to hardware and communication limits [6]. Many deployment

experiences have shown that low data quality is the most serious problem that impacts CPS performance. Tolle et al. pointed out that faulty data can occur in various unexpected ways and less than 69% of their data could be used for meaningful interpretation [7]. Szewczyk et al. also found that about 30% of data are faulty in their deployment [8]. What makes the situation worse is that the quality of data is not easily judged. It is important to find a way to identify the quality of data gathered by CPSs to estimate the availability of data. Meanwhile, the data quality also reflects the reliability of the system. In this paper, we utilized data inconsistency to measure the data quality and we store all the data in a relational database. Once these systems get pervasive and ubiquitously available, large amounts of data will be collected. They may include faked information. Such case highlights the quality of the data in the decision-making system and other CPSs; it is crucial for the success of the applications. Without high-quality data, no high-quality service based on the right decision could be provided, for instance, aggregation and routing services [9–15]. 1.1. Motivation. In CPSs, data is collected mostly from the physical world. However, data availability would be reduced

2

International Journal of Distributed Sensor Networks

by faulty data which is not reporting the real value of the monitoring objects. The idea in this paper is that the database techniques of data inconsistency can be utilized to model and manage the data quality for CPS, in order to evaluate data source quality, CPS data quality, and so on. Based on this, we propose the new measurement and technique for its efficient computation. In database technique, data consistency is one of the most important aspects of data quality; it is usually defined based on integrity constraints. They are semantic conditions that a database should satisfy in order to be an appropriate model of external reality. In practice, a database may not satisfy those integrity constraints, and for that reason it is said to be inconsistent or dirty. As a type of integrity constraint, conditional functional dependency (CFD) [16] has been proposed to capture inconsistency in data, which is a generalization of functional dependency (FD) [17] and has more powerful expressibility than FD. Based on CFD, many works on data quality have appeared; for example, [18– 20] focus on inconsistency detection problem, while [21–26] focus on the data repairing problem. Besides inconsistency detection and data repairing, an important problem is data inconsistency evaluation which aims to quantify how dirty the data is. Traditional evaluation methods for data source are mostly based on statistics. Compared with them, the logical method proposed in this paper is more flexible and fundamental. And the proposed method has a higher capability of expression. To the best of our knowledge, there is no existing work on providing a specific formula quantifying the data inconsistency based on CFD. We now give an example for modeling CPS data using CFDs. Consider the example below. Example 1. A CPS group maintains a relation of sensing data for its laboratory for several years: CM (sid, loc, time, week, date, temp, vibrate) .

(1)

Each climate monitor tuple contains information about a record 𝑡 (a unique sensor id sid, location of sensor loc, time information about data reporting (time, week, and date), temperature, and vibrate). A sampled fragment 𝐷 from all the data is shown in Table 1. Two CFDs defined over such sampled data are shown as follows: 𝜑1 : (loc, time, date 󳨀→ vibrate, {𝑡𝑝 ( , , , )}) , 𝜑2 : (sid, week 󳨀→ loc, {𝑡𝑝󸀠 ( , , ) , 𝑡𝑝 (s817, , 6:8)}) ;

(2)

intuitively, 𝜑1 restrains the notion that the vibrate status of each location is the same at the same time, while 𝜑2 specifies that the location of each sensor cannot be changed in the same week; however, for the special one “s817,” its position cannot be changed no matter the time. According to such two CFDs, 𝐷 is inconsistent, since there exists violation in them as follows: (i) Tuple pair (𝑡1 , 𝑡4 ) are violations with respect to 𝜑1 , because they reported different “vibrate” status.

Table 1: Sampled data. 𝑡1

sid s816

loc 6:8.1

Time 14:10

Week 0079

Date 11-04

Temp. 48

Vibrate 0

𝑡2

s816

6:8.1

14:10

0079

11-05

24

1

𝑡3

s816

7:4.2

14:10

0079

11-06

22

1

𝑡4

s817

6:8.1

14:10

0079

11-04

24

1

𝑡5

s817

6:8

14:10

0080

11-09

24

0

𝑡6

s817

6:8

14:10

0080

11-10

22

0

(ii) Tuple 𝑡4 is a violation with respect to 𝜑2 , because “s817” at location “6:8” cannot be changed. (iii) Tuple pairs (𝑡1 , 𝑡3 ) and (𝑡2 , 𝑡3 ) are violations with respect to 𝜑2 , because “s816” reported different “location” at the same week. It is easy to see that the size of minimum culprit is 2 because any culprit cannot have less than 2 tuples; for example, subset {𝑡3 , 𝑡4 } is a minimum culprit. That is to say, the data we sampled is not very reliable, and, to make it clear, at least 33.3% of the data should be cleaned. Motivated by this, we consider how to efficiently compute this inconsistency measurement when the integrity constraint is conditional functional dependency in this paper. Technically, to the best of our knowledge, there is no existing work considering this aspect. There are some detection techniques [18–20, 27] but they are not able to reveal how dirty the data is directly. For confidence computation [28], our problem generalizes the confidence of a single CFD; actually, this measurement is also the confidence of a set of CFDs. For repairing techniques, our problem can be seen as a special case of [29], because the complementary minimum culprit can be seen as C-repair (cardinality repair) of an inconsistent database; however, it is much more expensive using the techniques proposed by [29] directly, especially for dynamic data, and the algorithm given in this paper is more efficient and seems optimal. Briefly, there are three challenges. (1) The first challenge is how to evaluate the inconsistency efficiently. It is proved that the inconsistency evaluation problem we study is NP-complete even if there are only two CFDs in rule set and there are only three attributes in relation schema. Therefore, we should provide an efficient approximation algorithm for evaluating the data inconsistency. (2) Because most existing repair algorithms that could perform repair encounter a huge searching space when data is large and they have to take an expensive cost on performance, the approximation algorithm is also expected to be more efficient than the C-repair algorithms and to be able to guarantee the approximation ratio and evaluate large data more than data in memory. (3) For the dynamic data, an external memory data structure is necessary to make the approximation algorithm able to deal with the update of tuples efficiently rather than recomputing the inconsistency from scratch.

International Journal of Distributed Sensor Networks 1.2. Contributions. This paper first studies how to compute the data inconsistency for CPS data efficiently with respect to CFDs; the main contributions in this paper are as follows: (a) We formally define the inconsistency evaluation problem. The inconsistency of a given database instance 𝐷 is defined based on minimum culprit, the minimum subset of 𝐷, whose complementary value in 𝐷 is consistent with respect to all the given CFDs, and we use the proportion of minimum culprit to quantize the inconsistency of a database. It is proved that it is monotonic and insensitive to a small change on the database. And we prove that the minimum culprit problem is still NP-complete even if (1) Σ has only two variable CFDs; (2) the relation has only 3 attributes; and (3) the number of violations caused by each tuple is not more than 6. (b) Based on the conflict graph model, we transform the inconsistency evaluation problem into the minimum vertex cover problem based on conflict graph model. Based on finding the maximal matching of independent residual subgraph, we give a 2-approximation algorithm with 𝑂(𝑟𝑛 log 𝑛) time complexity where 𝑟 is the number of given CFDs and always a small constant. To deal with large dynamic data, we design a compact structure for indexing all tuples and give the method for its maintenance. Some useful properties of independent residual subgraph prevent storing edges in the compact structure so that the storage cost of the graph is 𝑂(𝑟𝑛) and the update cost is only 𝑂(𝑟 log 𝑛). (c) Using TPCH for large-scale data and IMDB and DBLP for real-life data, we conduct experiments on PC. We find that the adjusted counterpart outperforms the evaluation algorithm if several CFDs are of small confidence while the others are not. In addition, our algorithms scale well with both the size of the data and the number of CFDs.

2. Background An 𝑙-ary relation schema can be represented by 𝑅(𝐴 1 , 𝐴 2 , . . . , 𝐴 𝑙 ), where 𝑅 is the relation name and all 𝐴 𝑗 ’s (1 ≤ 𝑗 ≤ 𝑙) are the attributes of 𝑅. Let attr(𝑅) be {𝐴 1 , . . . , 𝐴 𝑙 }, and let dom(𝐴 𝑗 ) be the domain of attribute 𝐴 𝑗 . An instance 𝐷 of relation 𝑅 is a set of 𝑙-ary tuples, denoted by 𝐷 = {𝑡1 , 𝑡2 , . . . , 𝑡𝑛 }, where each tuple 𝑡𝑖 (1 ≤ 𝑖 ≤ 𝑛) belongs to the set dom(𝐴 1 ) × dom(𝐴 2 ) × ⋅ ⋅ ⋅ × dom(𝐴 𝑙 ). Let 𝑡𝑖 [𝐴 𝑗 ] be the value of 𝑡𝑖 on attribute 𝐴 𝑗 . Conditional functional dependency, CFD for short, is a class of integrity constraints capturing the consistency of data, whose formal definition can be found in [24]. Next, the syntax and semantic definitions of CFD will be reviewed briefly. (i) Syntax. A CFD rule 𝜑 defined over relation 𝑅 is a pair (𝑋 → 𝑌, 𝑇𝑝 ), where 𝑋 and 𝑌 are two distinctive attribute lists satisfying 𝑋 ∪ 𝑌 ⊆ attr(𝑅), 𝑋 → 𝑌 is a standard FD, and 𝑇𝑝 is a pattern tableau over attributes 𝑋 ∪ 𝑌. For each tuple 𝑡𝑝 ∈ 𝑇𝑝 and each

3 attribute 𝐴 ∈ 𝑋 ∪ 𝑌, the value 𝑡𝑝 [𝐴] can be either a constant “𝑎” in dom(𝐴) or a wild card “ ”. For a rule 𝜑, we use LHS(𝜑) to denote 𝑋 and RHS(𝜑) to denote 𝑌. (ii) Semantic. Given a tuple 𝑡 and a pattern tuple 𝑡𝑝 , 𝑡 is said to match 𝑡𝑝 , denoted by 𝑡 ≍ 𝑡𝑝 , if either 𝑡[𝐴] = 𝑡𝑝 [𝐴] or 𝑡𝑝 [𝐴] = “ ” is satisfied for each attribute 𝐴. Two tuples 𝑡1 and 𝑡2 satisfy 𝜑, denoted by (𝑡1 , 𝑡2 ) ⊨ 𝜑, if when 𝑡1 [𝑋] = 𝑡2 [𝑋] ≍ 𝑡𝑝 [𝑋], there must be 𝑡1 [𝑌] = 𝑡2 [𝑌] ≍ 𝑡𝑝 [𝑌]; if 𝑡1 [𝑋] = 𝑡2 [𝑋] ≍ 𝑡𝑝 [𝑋] but 𝑡1 [𝑌] ≠ 𝑡2 [𝑌], then tuple pair (𝑡1 , 𝑡2 ) is a violation. Particularly, a single tuple 𝑡 ⊨ 𝜑, if when 𝑡[𝑋] ≍ 𝑡𝑝 [𝑋], we must have 𝑡[𝑌] ≍ 𝑡𝑝 [𝑌]. Given a relational instance 𝐷 and a CFD rule 𝜑, 𝐷 is said to satisfy 𝜑 (i.e., 𝐷 ⊨ 𝜑) iff, (a) for each tuple 𝑡 ∈ 𝐷, 𝑡 ⊨ 𝜑 and, (b) for any two tuples 𝑡1 and 𝑡2 in 𝐷, (𝑡1 , 𝑡2 ) ⊨ 𝜑. Given a CFD set Σ, 𝐷 is consistent with respect to Σ, if it satisfies all rules in set Σ; otherwise, it is inconsistent or dirty, denoted by 𝐷 ⊭ Σ. For example, in Table 1, tuple pair (𝑡1 , 𝑡2 ) is a violation with respect to 𝜑1 shown in Example 1 (i.e., (𝑡1 , 𝑡2 ) ⊭ 𝜑1 ) due to the fact that “𝑡1 [loc, time, date] = 𝑡4 [loc, time, date] ≍ 𝑡𝑝 [loc, time, date]”, but “𝑡1 [vibrate] = ‘0’ ≠ 𝑡4 [vibrate] = ‘1’”; therefore, 𝐷 is inconsistent or dirty because of the existence of violation. A CFD is said to be simple if there is only one row in its pattern tableau such as both CFDs shown in Example 1. Additionally, two special fragments of simple CFD can be defined as follows: (i) A simple CFD 𝜑 = (𝑋 → 𝑌, 𝑇𝑝 ) is said to be a variable CFD, if, for each 𝐴 ∈ 𝑌, 𝑡𝑝 [𝐴] ≠ “ ”; for example, 𝜑1 is a variable CFD. (ii) A simple CFD 𝜑 = (𝑋 → 𝑌, 𝑇𝑝 ) is said to be a constant CFD, if, for each 𝐴 ∈ 𝑌, 𝑡𝑝 [𝐴] = “ ”; for example, the second pattern of 𝜑2 can be changed into a constant CFD. Intuitively, a constant CFD can capture inconsistencies on single tuple, while a variable CFD can capture inconsistencies between two tuples. In fact, given any CFD, it can be rewritten by some simple CFDs na¨ıvely by splitting its tableau horizontally, while a simple CFD can be rewritten by at most a constant one and a variable one. Therefore, without loss of generality, only simple CFDs are used in this paper.

3. Problem Definition This section first formally defines the data inconsistency evaluation problem and then proves that it is NP-complete. Given a CFD set Σ and a database instance 𝐷 such that 𝐷 ⊭ Σ, intuitively, the dirty part 𝐷󸀠 is a subset of 𝐷 such that the deletion of 𝐷󸀠 will make the data clean. We can formalize this idea as follows. Definition 2 (culprit). Given a database instance 𝐷 and a set of CFD rules Σ, culprit 𝐶(𝐷) is a subset of 𝐷 satisfying 𝐷 − 𝐶(𝐷) ⊨ Σ.

4 Obviously, for fixed Σ and 𝐷, there may be many culprits. In this paper, to measure the data dirtiness, we only care about the minimum culprit. 𝐶min (𝐷) is the minimum culprit if, for any culprit 𝐶(𝐷), |𝐶min (𝐷)| ≤ |𝐶(𝐷)|. Definition 3 (data dirtiness evaluation problem). Given a database instance 𝐷 and a set of CFD rules Σ, we want to compute the dirtiness of a database instance 𝐷, which is dirt(𝐷, Σ) = |𝐶min (𝐷)|/|𝐷|. Property 1 (minimality). Given any instance 𝐷 and any CFD set Σ, dirt(𝐷, Σ) is the portion of tuples that need to be edited at least in any exact repair algorithm. Measurement dirt(𝐷, Σ) is also monotonic and insensitive to a small change Δ (i.e., set of tuples) on instance 𝐷 as the following proposition states. Property 2 (monotonic and insensitive). Given an instance 𝐷, a set of tuples Δ, and CFD set Σ, we have 0 ≤ |𝐶min (𝐷 ∪ Δ)| − |𝐶min (𝐷)| ≤ |Δ|. This implies 0 ≤ |dirt(𝐷 ∪ Δ, Σ) − dirt(𝐷, Σ)| ≤ |Δ|/(|𝐷| + |Δ|). Remark 4. That is to say, the inconsistency of data measured by Definition 2 changes gently with small update. Usually, this trend of variation agrees with the reality; this is really because “(1) most of the data is often correct, especially for large data, and (2) a small update has a tiny impact on the dirtiness of the entire dataset.” Similar to [30], we next have the following theorem on the complexity of minimum culprit problem with more restricted condition on the input. Here, the decision version of minimum culprit problem, 𝑘-culprit problem for short, is that, given a database instance 𝐷 and a CFD set Σ, it is to decide whether there is a culprit 𝐶 of 𝐷 with respect to Σ and the size |𝐶| ≤ 𝑘. Theorem 5. Given an instance 𝐷 of relation 𝑅 and CFD set Σ, 𝑘-culprit problem is NP-complete, even if (1) there are only 2 variable CFDs in Σ, (2) 𝑅 is a 3-ary relation, and (3) for each tuple 𝑡 ∈ 𝐷 there are at most 6 violations including 𝑡. Proof.

International Journal of Distributed Sensor Networks 𝑠𝑗 = 𝛼𝑗1 + 𝛼𝑗2 + 𝛼𝑗3 , each 𝛼𝑗𝑞 (1 ≤ 𝑞 ≤ 3) is the 𝑞th literal of 𝑠𝑗 . Given an instance of 3-SAT problem, it is to decide whether there is a satisfying truth assignment for 𝑆. The 3SAT problem is NP-complete, and it remains NP-complete even if, for each 𝑥𝑖 ∈ 𝑈, there are at most 5 clauses in 𝑆 that contain either 𝑥𝑖 or 𝑥𝑖 . A polynomial reduction from 3-SAT to 𝑘-culprit problem can be constructed as follows. (1) Given an instance of 3-SAT, we introduce a 3-ary relation 𝑅(𝐴, 𝐵, 𝐶) and a variable CFD set Σ including 𝜑1 : (𝐴 → 𝐵, {𝑡𝑝 ( , )}) and 𝜑2 : (𝐶 → 𝐵, {𝑡𝑝 ( , )}). (2) Build an instance 𝐷 over 𝑅. For each variable 𝑥𝑖 , insert two tuples 𝑡2𝑖 (𝑥𝑖 , 𝑥𝑖 , 𝑥𝑖 ) and 𝑡2𝑖+1 (𝑥𝑖 , 𝑥𝑖 , 𝑥𝑖 ) into 𝐷. For each literal 𝛼𝑗𝑞 in clause 𝑆𝑗 , if it is a positive literal of variable 𝑥𝑖 , add tuple 𝑡2𝑛+3𝑗+𝑘 (𝑠𝑗 , 𝑥𝑖 , 𝑥𝑖 ) to 𝐷; if it is a positive literal of variable 𝑥𝑖 , add tuple 𝑡2𝑛+3𝑗+𝑘 (𝑠𝑗 , 𝑥𝑖 , 𝑥𝑖 ) to 𝐷. (3) At last, let 𝑘 = 𝑛 + 2𝑚. Note that (1) the instance 𝐷 can be constructed in 𝑂(𝑛 + 𝑚); (2) there are 3 attributes in 𝑅 and two variable CFDs in Σ; (3) for each tuple 𝑡 in 𝑇 the number of violations caused by 𝑡 is at most 6, if each variable exhibits at most 5 clauses. Suppose that the 3-SAT instance is satisfiable; that is, there is a satisfying truth assignment 𝜌 : 𝑈 → {0, 1}𝑛 for 𝑆; then, there is a culprit 𝐶 of 𝐷 such that its size is at most 𝑛 + 2𝑚. Concretely, it can be computed as follows, for each variable 𝑥𝑖 ; (1) if 𝜌(𝑥𝑖 ) = 1, delete tuple 𝑡2𝑖 from 𝐷. Then, for each clause 𝑠𝑗 , if 𝛼𝑗𝑞 is a positive literal of 𝑥𝑖 and 𝑡2𝑛+3𝑗+1 , 𝑡2𝑛+3𝑗+2 , 𝑡2𝑛+3𝑗+3 ∈ 𝐷, delete 𝑡2𝑛+3𝑗+𝑞 from 𝐷; (2) if 𝜌(𝑥𝑖 ) = 0, delete tuple 𝑡2𝑖+1 from 𝐷. Then, for each clause 𝑠𝑗 , if 𝛼𝑗𝑞 is a negative literal of 𝑥𝑖 and 𝑡2𝑛+3𝑗+1 , 𝑡2𝑛+3𝑗+2 , and 𝑡2𝑛+3𝑗+3 ∈ 𝐷, delete 𝑡2𝑛+3𝑗+𝑞 from 𝐷. We have that, for each 𝑖, either 𝑡2𝑖 or 𝑡2𝑖+1 is deleted from 𝐷, and, for each 𝑗, either of {𝑡2𝑛+3𝑗+1 , 𝑡2𝑛+3𝑗+2 , 𝑡2𝑛+3𝑗+3 } is deleted from 𝐷 for each 𝑗. This is because, in each clause, there is at least one literal that is made 𝑡𝑟𝑢𝑒 by assignment 𝜌. Therefore, there is a set 𝐶 of the rest of the tuples such that 𝐷 − 𝐶 ⊨ Σ and |𝐶| ≤ 𝑛 + 2𝑚 = 𝑘. To see the converse, let 𝐶 be the culprit such that |𝐶| ≤ 𝑘 = 𝑛 + 2𝑚. CFD 𝜑1 restricts the notion that either 𝑡2𝑖 or 𝑡2𝑖+1 should be deleted from 𝐷, for each 0 ≤ 𝑖 ≤ 𝑛 − 1, and at least two tuples of {𝑡2𝑛+3𝑗+1 , 𝑡2𝑛+3𝑗+1 , 𝑡2𝑛+3𝑗+1 } should be deleted from 𝐷. That is, the size of 𝐶 is at least 𝑛 + 2𝑚. Therefore |𝐶| is exactly 𝑛+2𝑚. Moreover, CFD 𝜑2 restricts the notion that there is only one literal of each variable in 𝐷 − 𝐶. Then, there is a satisfying truth assignment 𝜌 for 𝑆 such that, for each 0 ≤ 𝑖 ≤ 𝑛 − 1, {0 if 𝑡2𝑖 ∈ 𝐶, 𝜌 (𝑥𝑖 ) = { 1 if 𝑡2𝑖+1 ∈ 𝐶. {

NP. There is an NP algorithm as follows: (1) guess a subset 𝐶 of 𝐷 with size of 𝑘; (2) check whether 𝐷 − 𝐶 ⊨ Σ, that is, to check whether each tuple pair in 𝐷 − 𝐶 satisfies all CFDs of Σ; (3) output “𝑦𝑒𝑠” when 𝐷−𝐶 ⊨ Σ and “𝑛𝑜” otherwise. For each instantiation, the checking step can be done in polynomial time. Thus, the problem is in NP.

4. Evaluation Algorithm

NP-Hardness. The lower bound is established by a reduction from 3-SAT problem to 𝑘-culprit problem. An instance of 3SAT problem includes a set 𝑈 of 𝑛 variables 𝑥0 , . . . , 𝑥𝑛−1 and a collection 𝑆 of 𝑚 clauses 𝑠0 , . . . , 𝑠𝑚−1 , while, in each clause

For any database instance 𝐷, let 𝐷𝑠 be the subset of 𝐷 where each tuple violates at least one constant CFD in Σ; then, we have 󵄨 󵄨 󵄨 󵄨 𝐶min (𝐷) = 󵄨󵄨󵄨𝐷𝑠 󵄨󵄨󵄨 + 󵄨󵄨󵄨𝐶min (𝐷 − 𝐷𝑠 )󵄨󵄨󵄨 ; (4)

(3)


5

this is because any culprit 𝐶 of 𝐷 satisfying 𝐶 − 𝐷𝑠 must be larger than 𝐶min (𝐷 − 𝐷𝑠 ). Therefore, data dirtiness can be computed as 󵄨 󵄨󵄨 󵄨󵄨 󵄨󵄨 󵄨𝐷 󵄨 + 󵄨𝐶 (𝐷 − 𝐷𝑠 )󵄨󵄨󵄨 dirt (𝐷, Σ) = 󵄨 𝑠 󵄨 󵄨 min , |𝐷|

(5)

1

9 5

2 3

where 𝐷𝑠 can be detected by scanning the database once. Without loss of generality, we do not let Σ contain constant CFD from now on. Definition 6 (conflict graph [31]). Given an instance 𝐷 and a CFD set Σ with 𝑟 CFDs, the conflict graph 𝐺(𝐷, Σ) is an undirected graph ⟨𝑉, 𝐸⟩, where 𝑉 is the vertex set and 𝐸 is the edge set. In conflict graph 𝐺(𝐷, Σ), each vertex V𝑖 ∈ 𝑉 refers to the tuple 𝑡𝑖 ∈ 𝐷 and edge V𝑖 V𝑗 ∈ 𝐸, if ∃𝜑 ∈ Σ, (𝑡𝑖 , 𝑡𝑗 ) ⊭ 𝜑. Example 7. One has instance 𝐷 with tuples 𝑡1 ∼ 𝑡9 as in (6) and Σ = {𝜑1 : (𝐴 → 𝐵, {𝑡𝑝 ( , )}), 𝜑2 : (𝐶 → 𝐷, {𝑡𝑝 ( , )})}. Instance 𝐷 is shown as follows: 𝑡1

𝐴 𝐵 𝐶 𝐷 𝑎 𝑎 𝑎 𝑎

8

6

2 3

1

5

8

4

7 9

Figure 2: 𝐺(𝐷, {𝜑1 }).

𝑡3 𝑎 𝑐 𝑎 𝑏 𝑡5 𝑎 𝑐 𝑎 𝑐

4

Figure 1: Conflict graph 𝐺(𝐷, Σ).

𝑡2 𝑎 𝑏 𝑏 𝑎 𝑡4 𝑎 𝑏 𝑏 𝑏

6

7

(6)

2

4

𝑡6 𝑏 𝑎 𝑏 𝑐 𝑡7 𝑏 𝑎 𝑏 𝑏 𝑡8 𝑏 𝑎 𝑏 𝑐 𝑡9 𝑏 𝑎 𝑎 𝑏 Conflict graphs 𝐺(𝐷, Σ), 𝐺(𝐷, {𝜑1 }), and 𝐺(𝐷, {𝜑2 }) are shown in Figures 1, 2, and 3. We see that V1 is adjacent to V2 because 𝑡1 conflicts with 𝑡2 with respect to 𝜑1 ; they have the same value on 𝐴 but different values on 𝐵. Obviously, both 𝐺(𝐷, {𝜑1 }) and 𝐺(𝐷, {𝜑2 }) are the subgraphs of 𝐺(𝐷, Σ). There is a na¨ıve 2-approximate algorithm (Algorithm 1). The minimum culprit can be transformed as the minimum vertex cover on the conflict graph which can be built by the input database and CFDs, and it is easy to see that the size of minimum vertex cover of the conflict graph 𝐺(𝐷, Σ) equals the size of minimum culprit of 𝐷. Therefore, a na¨ıve 2-approximation algorithm (Algorithm 1) will be obtained immediately. Algorithm 1 works like this: lines (1)–(6): build a conflict graph 𝐺 for the given database instance 𝐷 and the CFD set Σ; line (7): compute the minimum vertex cover approximately. One can call the classic approximation algorithm [32] to find maximal matching 𝑀(𝐺(𝐷, Σ)) greedily, where a matching in 𝐺 is a set of pairwise nonadjacent edges and the matching is maximal if it is not a proper subset of any other matching in 𝐺. For any maximal matching 𝑀, the amount of vertexes in 𝑀 (i.e., 2|𝑀|) is at most twice as much as the size of minimum vertex cover, a 2-approximation of |𝐶min (𝐷)|.

1

3

9

5

6 8

7

Figure 3: 𝐺(𝐷, {𝜑2 }).

5. Reduce Quadratic for Large Dynamic Data We propose another 2-approximation dirtiness evaluation algorithm DDEva to overcome the shortcomings stated above. Then, an index based on B-tree is designed to enable 𝑂(𝑟𝑛 log 𝑛) time and 𝑂(𝑟𝑛) space implementation of DDEva over large data, where 𝑟 is the number of CFDs given in a general form rather than a simple CFD. It prevents the potential quadratic storage of edges. At last, an 𝑂(𝑟 log 𝑛) update method based on the efficient update of maximal matching and conflict graph is proposed to deal with dynamic data. Generally, the number of general CFDs, 𝑟, is a small constant, so that the algorithm we proposed works efficiently as shown in the experiments. We still use simple CFD in order to simplify the description below, but note that our algorithm can process the general CFDs natively. 5.1. Some Notations and Observations. For clarity, we first declare the following notations formally. (A) Given a CFD set Σ, it has 𝑟 variable CFDs, and the 𝑗th CFD is 𝜑𝑗 . The conflict graph 𝐺(𝐷, {𝜑𝑗 }) is denoted by 𝐺𝑗 for short. Recall Definition 6; obviously, 𝐺𝑗 is a subgraph of 𝐺(𝐷, Σ). (B) For any matching 𝑀, let 𝑉𝑀 be the set of all vertices in 𝑀. The size

6


Input: Database instance 𝐷 = {𝑡1 , . . . , 𝑡𝑛 }, CFD set Σ = {𝜑1 , . . . , 𝜑𝑟 }. Output: dirt(𝐷, Σ) which is the dirtiness of database 𝐷 w.r.t. Σ. (1) 𝐺(𝑉, 𝐸) ← (⌀, ⌀), 𝑀 ← ⌀; (2) for all tuple pair (𝑡𝑖 , 𝑡𝑗 ) in 𝐷 do (3) for all 𝜑𝑘 ∈ Σ such that 1 ≤ 𝑘 ≤ 𝑟 do (4) if (𝑡𝑖 , 𝑡𝑗 ) ⊭ 𝜑𝑘 then (5) add vertices V𝑖 and V𝑗 into 𝑉; (6) add edge V𝑖 V𝑗 into 𝐸; (7) GOTO line (2); (8) 𝑀 ← 𝑀𝑎𝑥𝑖𝑚𝑎𝑙𝑀𝑎𝑡𝑐ℎ𝑖𝑛𝑔(𝐺(𝑉, 𝐸)); (9) return 2|𝑀|/|𝐷|; Algorithm 1: Linear algorithm.

of 𝑀 is denoted by |𝑀| which is the number of edges in it; obviously, |𝑉𝑀| = 2 × |𝑀|. For any graph 𝐺, let 𝑀(𝐺) be the maximal matching of 𝐺; (C) given a graph 𝐺(𝑉, 𝐸) and a matching 𝑀, let 𝐺 − 𝑀 be a graph 𝐺󸀠 (𝑉󸀠 , 𝐸󸀠 ), where 𝑉󸀠 = 𝑉 \ 𝑉𝑀 and 𝐸󸀠 is obtained by removing all edges covered by 𝑉𝑀 from 𝐸; (D) let 𝐾𝜔1 ,...,𝜔𝑙 represent a complete multipartite graph with 𝑙 vertex equiv-classes, such that any pair of vertices in the same equiv-class 𝜔 are nonadjacent while any pair of vertices in different equiv-classes are adjacent. For each equiv-class 𝜔𝑖 , let |𝜔𝑖 | be the number of vertices in it. Given a CFD set Σ of 𝑟 variable CFDs, let 𝐺𝑗 be the conflict graph 𝐺(𝐷, {𝜑𝑗 }). Interestingly, we have the following useful observation of conflict graph with respect to one CFD. Observation 1. Each conflict graph 𝐺𝑗 is a forest of multipartite graph; that is, it is composed of several nonoverlapping connected components, and each component is a complete multipartite graph. It is easy to find maximal matching for each complete multipartite connected component in each 𝐺𝑗 . However, the sum of each 𝐺𝑗 ’s maximal matching sizes is not a 2approximation of minimum vertex cover due to the overlaps among those matching scenarios. In order to remove these overlaps, we next define a series of independent residual subgraphs ⟨Δ 1 , . . . , Δ 𝑟 ⟩ for 𝐺(𝐷, Σ), in which each Δ 𝑗 is a counterpart of the conflict graph 𝐺𝑗 . Definition 8 (independent residual subgraph). Given a database instance 𝐷 and CFD set Σ = {𝜑𝑗 | 1 ≤ 𝑗 ≤ 𝑟}, the independent residual subgraph is a subgraph of 𝐺(𝐷, Σ), ir-subgraph for short, such that

𝜔2

𝜔1 1

𝜔1

𝜔3

6 8

3

2 4

5

7 9

Figure 4: Δ 1 and 𝑀(Δ 1 ).

𝜔1 6

𝜔2 7

𝜔1 9

𝜔2 5

8 P1

P2

Figure 5: Δ 2 and 𝑀(Δ 2 ).

Observation 2. Each ir-subgraph Δ 𝑗 is also a forest of complete multipartite graph. This observation is correct because Δ 𝑗 is still a complete multipartite graph when any vertex V and its adjacent edges are removed from 𝐺𝑗 . For example, in Figure 5, Δ 2 is still a forest of complete multipartite graph. Interestingly, we find that the union of maximal matchings 𝑀(Δ 1 ) (dash edges in Figure 4) and 𝑀(Δ 2 ) (dash edges in Figure 5) is exactly a maximal matching of 𝐺(𝐷, Σ) (dash edges in Figure 6). This inspires a proposition for computing the maximal matching of conflict graph 𝐺(𝐷, Σ) as follows.

(7)

Proposition 9. 𝑀1,𝑟 is a maximal matching of 𝐺(𝐷, Σ), and |𝑀1,𝑟 | = ∑𝑟𝑗=1 |𝑀(Δ 𝑗 )|.

Following Example 7, Figure 4 shows Δ 1 = 𝐺1 with the dash line, and Figure 5 shows Δ 2 = 𝐺2 − 𝑀(Δ 1 ); Δ 2 is obtained by removing all the vertexes V1 , V2 , V3 , and V4 and their adjacent edges from 𝐺 since such four vertexes are all in the maximal matching of Δ 1 represented by dash edges.

5.2. Algorithm for Dirtiness Evaluation. In contrast to the na¨ıve algorithm, we propose Algorithm 2 to compute the data dirtiness in an 𝑂(𝑟𝑛 log 𝑛) time rather than the quadratic cost while 𝑟 is a small constant generally. It works as follows: 𝑟 irsubgraphs are built first instead of the conflict graph 𝐺(𝐷, Σ); compute maximal matching of each ir-subgraph independently to get the value |𝑉𝑀1,𝑟 | which is a 2-approximation of

{𝐺1 Δ𝑗 = { 𝐺 − 𝑀1,𝑗−1 { 𝑗

if 𝑗 = 1, if 1 < 𝑗 ≤ 𝑟,

𝑗−1

where 𝑀1,𝑗−1 = ⋃𝑖=1 𝑀(Δ 𝑖 ).


7

Input: Database instance 𝐷, CFD set Σ = {𝜑𝑗 | 1 ≤ 𝑗 ≤ 𝑟}. Output: dirt(𝐷, Σ) which is the dirtiness of database 𝐷 with respect to Σ. (1) 𝑀 ← ⌀; (2) for all 𝑗 such that 1 ≤ 𝑗 ≤ 𝑟 do (3) build 𝐺𝑗 for 𝐷 with respect to 𝜑𝑗 ; (4) Δ 𝑗 ← 𝐺𝑗 ; (5) for all V𝑖 ∈ Δ 𝑗 do (6) if V𝑖 ∈ 𝑀 then (7) remove V𝑖 and its adjacent edges from Δ 𝑗 ; (8) 𝑀(Δ 𝑗 ) ← 𝐺𝑟𝑒𝑒𝑑𝑦𝑀𝑎𝑥𝑖𝑚𝑎𝑙𝑀𝑎𝑡𝑐ℎ𝑖𝑛𝑔(Δ 𝑗 ); (9) 𝑀 ← 𝑀 ∪ 𝑀(Δ 𝑗 ); (10) return |𝑉𝑀 |/|𝐷|; Algorithm 2: 𝐷𝐷𝐸V𝑎(𝐷, Σ).

9

1

5 2 3

7 8

6 4

Input: ir-subgraph Δ. Output: a maximal matching 𝑀 (Δ). (1) 𝑀 ← ⌀; (2) for each component 𝐶𝑖 ∈ Δ do (3) 𝐿, 𝑅 ← ⌀; (4) for all 𝑖 such that 1 ≤ 𝑖 ≤ 𝑙 do (5) if |𝐿| ≤ |𝑅| then (6) put 𝜔𝑖 into 𝐿; (7) else (8) put 𝜔𝑖 into 𝑅; (9) 𝑀 ← 𝑀 ∪ 𝑀⟨𝐿, 𝑅⟩; (10) return 𝑀;

Figure 6: Maximal matching 𝑀(𝐺).

Algorithm 3: 𝑀𝑎𝑥𝑖𝑚𝑎𝑙𝑀𝑎𝑡𝑐ℎ𝑖𝑛𝑔(Δ).

the size of minimum vertex cover. Proposition 9 guarantees the correctness of this algorithm. Briefly, there are two key points to reduce the quadratic cost, respectively; each ir-subgraph Δ 𝑗 can be built within 𝑂(𝑛 log 𝑛), and the maximal matching of each ir-subgraph can be specified quickly without scanning any edge of it.

5.2.2. Finding Maximal Matching Greedily. Due to Observation 2, each connected component of ir-subgraph is also a complete multipartite graph. We can quickly specify an arbitrary maximal matching of each connected component without scanning any edge so that the time cost for computing the maximal matching only depends on the number of vertices rather than edges. Algorithm 3 is proposed to find a maximal matching quickly. Concretely, the maximal matching of each irsubgraph can be obtained by the union of the maximal matchings of each component. For each component 𝑃𝑖 containing 𝑙 equiv-classes, namely, 𝜔1 , . . . , 𝜔𝑙 , we group the 𝑙 equivclasses into two groups ⟨𝐿, 𝑅⟩. During the scan on 𝑃𝑖 , each equiv-class is added to the group with smaller cardinality greedily where cardinality of a group 𝑆 is |𝑆| = ∑𝑠𝑖=1 |𝜔𝑖 |, if 𝑆 = {𝜔1 , . . . , 𝜔𝑠 }. Then, for a grouping ⟨𝐿, 𝑅⟩, we build the maximal matching 𝑀⟨𝐿, 𝑅⟩ like this; for each vertex in group 𝐿, match it with a vertex of 𝑅 orderly. The time cost of Algorithm 3 does not depend on the number of edges but only depends on the number of vertexes.

5.2.1. Building ir-Subgraphs. Let the 𝑗th CFD be 𝜑𝑗 : (𝑋 → 𝑌, 𝑇𝑝 ); to get Δ 𝑗 , we will build the conflict graph 𝐺𝑗 first and then remove all vertices of 𝑉𝑀1,𝑗−1 and their adjacent edges. Recall Observation 1; each connected component of 𝐺𝑗 is a complete multipartite graph which can be built by partitioning without pairwise comparison. Concretely, we can partition all tuples of 𝐷 according to the attribute values on 𝑋 and 𝑌; each connected component refers to the tuples with the same attribute value on 𝑋, and each equiv-class in the component refers to the tuples with the same attribute value on 𝑌. Because we do not need to store edges in complete multipartite graph, only vertices in 𝑉𝑀1,𝑗−1 need to be removed from 𝐺𝑗 . As the size of 𝑉𝑀1,𝑗−1 is always no more than 𝑛, it will take at most 𝑂(log 𝑛) time cost to check whether a vertex V ∈ 𝑉𝑀1,𝑗−1 based on a lookup data structure. Therefore, an ir-subgraph can be built within an 𝑂(𝑟𝑛 log 𝑛) time.

Example 10. Following Example 7, the grouping of Δ 1 and Δ 2 is shown in Figures 7(a) and 7(b); maximal matchings are represented by dashes. Recall Example 7; there are two

8


L

R

L

R

6 7 8 9 𝜔1

0

𝜔1

1

2

𝜔3

3 5

4 𝜔2

P2

P1

(a) Grouping of Δ 1 built greedily

L

R

L

R

6

7

9

5

8

𝜔2

𝜔1

𝜔2

𝜔1 P1

P2

(b) Grouping of Δ 2 built greedily

Figure 7: Computing maximal matching independently.

components 𝑃1 and 𝑃2 in the ir-subgraph Δ 1 . In 𝑃1 , equivclass 𝜔1 only contains V1 , equiv-class 𝜔2 includes V2 and V4 , and equiv-class 𝜔3 includes V3 and V5 . Algorithm 3 first adds 𝜔1 to group 𝐿. Since |𝑅| = 0, then 𝜔2 is added to group 𝑅 by Algorithm 3. At last, 𝜔3 is added to group 𝐿 due to |𝑅| > |𝐿|. The maximal matching of 𝑃1 is obtained by matching the vertices between 𝐿 and 𝑅 one by one. However, in 𝑃2 , there is only one equiv-class so that group 𝑅 is empty; thus, the maximal matching of 𝑃2 is an empty set. Therefore, we get a maximal matching {(V1 , V2 ), (V3 , V4 )} for Δ 1 . In a similar way, we find a maximal matching {(V6 , V7 ), (V5 , V9 )} for Δ 2 . Finally, the union of both matchings is exactly a maximal matching of 𝐺(𝐷, Σ) as shown in Figure 6 which is represented by dashes. Obviously, maximal matching finding can be done in an 𝑂(𝑛) time for each ir-subgraph. Additionally, we have an observation that, in each component 𝑃𝑖 , all of the unmatched vertexes belong to one equiv-class 𝜔𝑗 (1 ≤ 𝑗 ≤ 𝑙) which is called tail class 𝜏(𝑃𝑖 ), such as in Δ 1 ; 𝜏(𝑃1 ) is 𝜔3 and 𝜏(𝑃2 ) is 𝜔1 as shown by a dashed rectangle in Figure 7. Obviously, there is at most one tail class in a component. 5.3. Update. According to Definition 8, all the ir-subgraphs are updated based on update of maximal matching. We next show how to maintain the maximal matching, following an efficient ir-subgraph update method. 5.3.1. Update Maximal Matching. Given an ir-subgraph Δ, when a vertex update (V, op) arises, subroutine 𝑈𝑝𝑑𝑎𝑡𝑒𝐺𝑟𝑜𝑢𝑝𝑖𝑛𝑔(Δ, V, op) will update the grouping of component 𝑃 that V is involved in. Concretely, suppose that V belongs to some equiv-class 𝜔 of 𝑃; then, 𝑈𝑝𝑑𝑎𝑡𝑒𝐺𝑟𝑜𝑢𝑝𝑖𝑛𝑔 will update grouping ⟨𝐿, 𝑅⟩ upon the following two cases (E1)∼(E2): (E1) Vertex Deletion (𝑜𝑝 = 𝑑𝑒𝑙𝑒𝑡𝑒). Without loss of generality, let |𝐿| > |𝑅|. It is obvious that tail class 𝜏(𝑃) ∈ 𝐿 currently; then, grouping ⟨𝐿, 𝑅⟩ needs to be updated, iff (a) 𝜔 ∈ 𝑅 and (b) |𝐿 − {𝜏(𝑃)}| = |𝑅|. If (a) and (b) are satisfied, 𝑈𝑝𝑑𝑎𝑡𝑒𝐺𝑟𝑜𝑢𝑝𝑖𝑛𝑔 will delete vertex V

Input: tuple need to be processed 𝑡, operator parameter op. Output: 𝑖𝑟-𝑠𝑢𝑏𝑔𝑟𝑎𝑝ℎ𝑠 updated. (1) for 𝑖 ← 1 to 𝑟 do (2) 𝑀󸀠 (Δ 𝑖 ), V󸀠 ← 𝑈𝑝𝑑𝑎𝑡𝑒𝐺𝑟𝑜𝑢𝑝𝑖𝑛𝑔(Δ 𝑖 , V, op); (3) if V󸀠 ∈ 𝑀󸀠 (Δ 𝑖 ) and V󸀠 ∉ 𝑀(Δ 𝑖 ) then (4) V ← V󸀠 , op ← 𝑑𝑒𝑙𝑒𝑡𝑒; (5) if V󸀠 ∈ 𝑀(Δ 𝑖 ) and V󸀠 ∉ 𝑀󸀠 (Δ 𝑖 ) then (6) V ← V󸀠 , op ← 𝑖𝑛𝑠𝑒𝑟𝑡; Algorithm 4: 𝑈𝑝𝑑𝑎𝑡𝑒𝑆𝑢𝑏𝑔𝑟𝑎𝑝ℎ(𝑡, op).

from 𝜔 and switch 𝜏(𝑃) into 𝑅 from 𝐿; otherwise, it only delete V from 𝜔. And if |𝐿| < |𝑅|, the opposite will occur. (E2) Vertex Insertion (𝑜𝑝 = 𝑖𝑛𝑠𝑒𝑟𝑡). Without loss of generality, let |𝐿| > |𝑅|. Grouping ⟨𝐿, 𝑅⟩ needs to be updated, iff (a) |𝐿 − {𝜏(𝑃)}| = |𝑅| and (b) 𝜔 ∈ 𝐿 (𝜔 ≠ 𝜏(𝑃)). If (a) and (b) are satisfied, 𝑈𝑝𝑑𝑎𝑡𝑒𝐺𝑟𝑜𝑢𝑝𝑖𝑛𝑔 will insert V into 𝜔 and switch 𝜔 into 𝑅 from 𝐿; otherwise, it only inserts V from 𝜔. And if |𝐿| < |𝑅|, the opposite will occur. After updating ⟨𝐿, 𝑅⟩, a new maximal matching 𝑀󸀠 can be obtained in greedy order. Observation 3. Let 𝑀 be the maximal matching of component 𝑃 before grouping update, while 𝑀󸀠 is the new maximal matching obtained in greedy order after grouping update. We have that if 𝑀󸀠 ≠ 𝑀, there must be one and only one vertex V󸀠 (V󸀠 ≠ V) such that (a) V󸀠 ∈ 𝑀󸀠 (Δ) but V󸀠 ∉ 𝑀(Δ) or (b) V󸀠 ∈ 𝑀(Δ) but V󸀠 ∉ 𝑀󸀠 (Δ). 5.3.2. Update ir-Subgraph. Algorithm 4 shows how to update the ir-subgraphs. When a tuple update arises, all ir-subgraphs are updated one by one. Specifically, starting from Δ 1 , 𝑈𝑝𝑑𝑎𝑡𝑒𝑆𝑢𝑏𝑔𝑟𝑎𝑝ℎ updates the ir-subgraph according to the parameter “op” by calling 𝑈𝑝𝑑𝑎𝑡𝑒𝐺𝑟𝑜𝑢𝑝𝑖𝑛𝑔. For vertex deletion, parameter “op” is set as “𝑑𝑒𝑙𝑒𝑡𝑒”; for vertex insertion, it is set as “𝑖𝑛𝑠𝑒𝑟𝑡”; Algorithm 4 ends until 𝑟 ir-subgraphs have been processed. This algorithm is correct. Actually, for each Δ 𝑗 , if update V does not result in a change on maximal matching 𝑀(Δ 𝑗 ), then there must be V that does not belong to 𝑀(Δ 𝑗 ), and it still should be inserted into or deleted from the following irsubgraphs. If update V results in a change on maximal matching 𝑀(Δ 𝑗 ), according to Observation 3, there must be one and only one vertex V󸀠 such that “it was matched but becomes unmatched” or “it was unmatched but becomes matched.” Then, according to the definition of ir-subgraph, V󸀠 should be either inserted into or deleted from the following ir-subgraphs. Therefore, there is also an important observation that each ir-subgraph needs to be processed only once by 𝑈𝑝𝑑𝑎𝑡𝑒𝑆𝑢𝑏𝑔𝑟𝑎𝑝ℎ. In the next section, we organized each ir-subgraph as a compact structure, in which the costs of vertex operations including insertion, deletion, and lookup are 𝑂(log 𝑛). That is, in the processing procedure of each ir-subgraph,


𝜔1 𝜔3

L

R

1 10

2 4

Δ1 𝜔2 ···

𝜔1

3 P1

5

Δ2 ··· P1

9

𝜔1

P2

L

R

5

3 9

P2

L

R

1 10

2 4

𝜔2

11

3 5

𝜔3

P1 Δ2

𝜔2

···

𝜔1

L 5

P1

(a) Inserting V10 with respect to tuple 𝑡10

Δ1

··· P2 R 9

𝜔2 P2

(b) Inserting V11 with respect to tuple 𝑡11

Figure 8: Example for update processing.

𝑈𝑝𝑑𝑎𝑡𝑒𝐺𝑟𝑜𝑢𝑝𝑖𝑛𝑔 (line (2)) and vertex checking (lines (3) and (5)) can be done in 𝑂(log 𝑛). Therefore, at most 𝑂(𝑟 log 𝑛) time cost will be taken to update the 𝑟 ir-subgraphs. Example 11. Following Example 10, given two tuple updates, (𝑈1) : insert 𝑡10 : ⟨𝑎, 𝑎, 𝑎, 𝑎⟩ into 𝐷; (𝑈2) : insert 𝑡11 : ⟨𝑎, 𝑎, 𝑎, 𝑏⟩ into 𝐷,

(8)

the grouping updated after each insertion was shown in Figures 8(a) and 8(b). Respectively, we show the procedure of processing (𝑈1) and (𝑈2) as follows: (U1) In Δ 1 , according to the attribute values on 𝑋 and 𝑌, vertex V10 belongs to 𝜔1 of 𝑃1 . In 𝑃1 , 𝜔1 belongs to 𝐿, and vertex V10 does not result in a change on the grouping and 𝜔3 is still the tail class since |𝐿| > |𝑅|; however, vertex V3 is not matched any more; intuitively, it is squeezed out from the matching; it should be inserted into Δ 2 . After update on Δ 2 , class 𝜔2 has become the tail class of component 𝑃2 in Δ 2 . (U2) After insertion of vertex V11 , 𝑈𝑝𝑑𝑎𝑡𝑒𝑆𝑢𝑏𝑔𝑟𝑎𝑝ℎ switches the tail class 𝜔3 into 𝑅 from 𝐿 since the size of |𝐿 − {𝜔3 }| < 𝑅; however, because V3 has become matched after the switch, then it is deleted from Δ 2 and V9 is matched with V5 in Δ 2 . 5.4. Implementation. In this subsection, a compact structure is given to support the following efficient implementation scenarios: (1) answer the membership query of whether a vertex belongs to the maximal matching in 𝑂(log 𝑛) time and (2) update each ir-subgraph and its maximal matching in 𝑂(log 𝑛) time once a tuple update arises. 5.4.1. Compact Structure for ir-Subgraph. As shown in Figure 9, we store each ir-subgraph Δ 𝑖 (1 ≤ 𝑖 ≤ 𝑟) as the index 𝐼𝑖 of database 𝐷 (with B-tree implementation in this paper). Concretely, given 𝐷 and variable rule 𝜑𝑖 : ⟨𝑋 → 𝑌, 𝑇𝑝 ⟩, 𝐼𝑖 only indexes those tuples satisfying ∃𝑡𝑝 ∈ 𝑇𝑝 , 𝑡[𝑋] ≍ 𝑡𝑝 [𝑋];

the index key is (𝑋, 𝑌, id) of each tuple. In B-tree implementation, (a) each entry in an index node refers to a vertex of Δ 𝑖 ; (b) all the vertices of each equiv-class and all the equivclasses of each component are, respectively, organized as a double linked list for constant time update once the maximal matching changes. Additionally, two kinds of header entries are settled in the index. K-Header. For each component 𝑃, a K-header is settled for keeping relative information about the corresponding component as follows: 𝑦(𝑃): the attribute value on 𝑌 corresponding to the tail class 𝜏(𝑃). Actually, each equiv-class in a component will be identified uniquely by the attribute value on 𝑌. 𝑒(𝑃): the id of the last matched vertex in 𝜏(𝑃). It is exactly less than the ids of all unmatched vertices because all vertices are sorted by id inside each equiv-class. For example, in Δ 1 shown in Figure 7, 𝑒(𝑃1 ) = “3”, while we let “𝑒(𝑃2 ) = −∞” as there is no vertex matched in component 𝑃2 . 𝑒(𝐿): it points to the tail of the double linked list of 𝐿. 𝑒(𝑅): it points to the tail of the double linked list of 𝑅. W-Header. For each equiv-class 𝜔, a W-header is settled for keeping relative information about it as follows: 𝑒(𝜔): it points to the tail of the double linked list of 𝜔. 𝑔(𝜔): it indicates which group 𝜔 belongs to. 5.4.2. Supporting Membership Query. Given a vertex V and irsubgraph Δ 𝑖 , the membership query of whether it is in the maximal matching 𝑀(Δ 𝑖 ) will be answered in 𝑂(log 𝑛). Let V refer to the tuple 𝑡 and find K-header of component 𝑃 by key value (𝑡[𝑋], −∞, −∞); then, we have V ∈ 𝑀(Δ 𝑖 ) if (a) 𝑡[𝑌] = 𝑦(𝑃), which means V ∈ 𝜏(𝑃), and (b) 𝑡[id] ≤ 𝑒(𝑃). 5.4.3. Supporting Update on ir-Subgraphs. In the B-tree implementation, it will take only 𝑂(log 𝑛) time on average to insert or delete a vertex in an ir-subgraph. Once a tuple update results in a change on a maximal matching, 𝑈𝑝𝑑𝑎𝑡𝑒𝐺𝑟𝑜𝑢𝑝𝑖𝑛𝑔

10

International Journal of Distributed Sensor Networks Key: (X, Y, −∞) g(wk ) |wk | e(wk ) W-header Key: (X, Y, id) Vertex K-header Key: (X, −∞, −∞) y(Pi ) e(Pi ) |L| |R| e(L) e(R)

···

𝜔1 : (a, a, −∞) (a, a, 1) L 2 e(w1 )

P1 : (a, −∞, −∞) c 10 4 2 e(R) e(L)

𝜔2 : (a, b, −∞) (a, b, 2) R 2 e(w2 )

(a, a, 10) ···

(a, b, 4)

𝜔3 : (a, c, −∞) (a, c, 3) L 2 e(w3 )

Root B-tree ··· Node Node · · · Node ··· ··· ···

(a, c, 5)

Figure 9: Compact Structure for storing ir-subgraph Δ 1 .

only updates the corresponding K-header and W-header in a constant time after finding both headers in 𝑂(log 𝑛) time. Example 12. Following Example 10, Figure 9 shows the storage implementation of ir-subgraph Δ 1 for 𝐷 with respect to 𝜑1 : 𝐴 → 𝐵. For component 𝑃1 , its K-header can be found by key “(𝐴 : 𝑎, 𝐵 : −∞, id : −∞).” In K-header of 𝑃1 , the size of maximal matching of 𝑃1 is recorded as |𝑀| = 2 with respect to the grouping in Figure 7 where |𝐿| = 4 and |𝑅| = 2. The Kheader has also recorded 𝑦(𝑃1 ) = “𝑐” for tail class 𝜔3 which can be identified uniquely in 𝑃1 by value “𝑐” (the attribute value on “𝐵”), while 𝑒(𝑃) is set as 10 which refers to end vertex V10 . The pointer 𝑒(𝐿) (resp., 𝑒(𝑅)) in K-header points to the Wheader of the last class 𝜔3 (resp., 𝜔2 ) of group 𝐿 (resp., 𝑅). In the index, equiv-classes 𝜔1 and 𝜔3 (resp., 𝜔2 ) in group 𝐿 (resp., 𝑅) are organized as a double linked list by pointers placed in the corresponding W-header. All the vertices inside each class are also organized as a double linked list by pointers placed in the corresponding entry. Considering the insertion of 𝑡11 , 𝑈𝑝𝑑𝑎𝑡𝑒𝐶𝑅𝐺 will update Δ 1 in the first iteration by implementation as the following three steps. Step 1 (update the double linked list of vertices in each equiv-class). Insert V11 as the new tail of the double linked list of 𝜔1 ; that is, find the last vertex before insertion of V11 and reset its pointer and then change the last vertex in W-header of 𝜔1 as “11” which is the id of vertex V11 . Here, the key of the last vertex of 𝜔1 can be fetched from W-header of 𝜔1 . Step 2 (update the double linked list of groups 𝐿 and 𝑅). Because |𝐿| − |𝑅| > |𝜔3 | after the insertion of V11 , then 𝜔3 need to be switched into 𝑅 from 𝐿. Concretely, (1) delete W-header of 𝜔3 from the double linked list of group 𝐿 (its key is obtained by the value of attributes 𝐴 and 𝐵 of 𝑡11 , i.e., ⟨𝐴 : 𝑎, 𝐵 : 𝑐, id : −∞⟩). This is implemented by setting the pointer 𝑒(𝜔1 ) as “𝑛𝑢𝑙𝑙” and changing the pointer 𝑒(𝐿) to point to 𝜔1 ; (2) insert a W-header for 𝜔3 into the double linked list of group 𝑅. This is implemented by setting the pointer 𝑒(𝜔2 ) to be pointing to 𝜔3 and changing the pointer 𝑒(𝑅) to point to 𝜔1 , meanwhile changing 𝑔(𝜔3 ) as “𝑅.” Step 3 (update the relative information recorded in K-header). Since |𝑅| > |𝐿|, then 𝜔3 remains the tail class of this component, but 𝑒(𝑃1 ) is updated as 3 because new point V11 matches V3 ; at last, a membership query of whether V3 is matched is necessary to decide how to update the following

ir-subgraph Δ 2 in the next iteration of 𝑈𝑝𝑑𝑎𝑡𝑒𝐼𝑛𝑐𝐺. 𝜏(𝑃1 ) and 𝑒(𝑃1 ) can be fetched from the K-header of component 𝑃1 ; then, it is checked that V3 ∈ 𝜔3 and V3 ⋅ id ≤ 𝑒(𝑃1 ) = 5. It is easy to see that, in each iteration of 𝑈𝑝𝑑𝑎𝑡𝑒𝐶𝑅𝐺, such steps can be achieved by querying B-tree index in 𝑂(log 𝑛) time and updating the linked list in 𝑂(1) time. After at most 𝑟 iterations, the update finishes in 𝑂(𝑟 log 𝑛) time.

6. Optimizations and Extensions 6.1. Key Value Compression. The sort key of each tuple consists of the values on 𝑋 and 𝑌 and tuple id with respect to a CFD (𝑋 → 𝑌, 𝑇𝑝 ). Reducing the size of the key implies improving the efficiency of finding vertex in the index. However, we can build two prefix-trees, one for 𝑋 and the other for 𝑌, respectively. Then, we assign each leaf a unique id. Then, each string with arbitrary size will be transformed into an integer; that is, each key value is compressed as a triple of integers. As the size of string of each attribute is not more than a fixed constant, then each prefix-tree has a fixed height; thus, this transformation can be done in a constant time. 6.2. The Number of Indexes. In practice, CFD is always given in a general form and it can be transformed into lots of simple rules. That is, 𝑟 may be very large and many indexes need to be built; thus, there will be lots of copies of isolated vertexes stored. However, actually, our algorithm can process each CFD with general form natively; this is really because the conflict graph with respect to a general CFD is also a forest of complete multipartite graph. Therefore, only one index needs to be built for one general CFD. In practice, the number of indexes to be built equals the number of the general CFDs. 6.3. Minimum Space Cost. Due to the definition of the irsubgraph, each ir-subgraph has to store many copies of vertexes which are unmatched in the previous ir-subgraphs. Reducing the size of each index implies the improvement of the efficiency of the index. The size of all the indexes depends on the processing order of 𝑟 indexes. To reduce the space cost caused by the redundancy, we should choose the best processing order of the CFDs. However, the best order that minimizes the overall space cost cannot be precomputed and it also will change with the update of data. Therefore, we chose the processing order of CFDs as the decreasing order of factor supp(𝜙)/conf(𝜙), in which supp(𝜙) and conf(𝜙) are the support and confidence of a given CFD 𝜙, respectively, and such two values can be obtained by sampling

International Journal of Distributed Sensor Networks method [28]. Intuitively, in each index, the bigger the ratio supp(𝜙)/conf(𝜙) is, the more the tuples will be matched as early as possible, so that (1) there may be less tuple copies storing in the indexes of the following ir-subgraphs and (2) the number of queries will be also reduced possibly when building and updating the indexes of all ir-subgraphs.

7. Experiments We next present an experimental study of data dirtiness evaluation algorithms, measuring elapsed time and the quality of the evaluation result. Using both synthetic data TPC-H and real-life data including DBLP and IMDB, we focus on their scalability by varying the following three parameters: (1) |𝐷|: the size of the original database; (2) |Δ𝐷|: the size of updates; (3) |Σ|: the number of CFDs. 7.1. Experimental Settings. We used synthetic and real-life data. 7.1.1. Datasets. (a) TPC-H [33]: we built a wider table by joining all the 8 tables. The data ranges from 2 million tuples (i.e., 1 M) to 10 million tuples (i.e., 10 M). Note that the size of 10 M tuples is almost as large as 10 GB. (b) IMDB [34]: we extracted a 1.12 GB relation from its XML data. The data scales from 1 M tuples to 4 M tuples where the size of 4 M tuples is almost as large as 1.12 GB. (c) DBLP [35]: we extracted a 1.4 GB relation from its XML data. The data scales from 500 K tuples to 3.6 M tuples where the size of 3.6 M tuples is almost as large as 1.4 GB. 7.1.2. CFDs. We designed CFDs manually, varied by modifying patterns. (a) TPC-H: the number |Σ| of variable CFDs ranges from 20 to 100 including 5% FDs with 40 by default. (b) IMDB: |Σ| scales from 5 to 20 variable CFDs including 3 FDs, with 10 by default. (c) DBLP: |Σ| scales from 5 to 20 including 3 FDs, with 10 by default. 7.1.3. Updates. Updates contain 90% insertions and 10% deletions. The size of updates is up to 10 GB (about 10 M tuples) for TPC-H, up to 3 M tuples for DBLP, and up to 3 M tuples for IMDB. 7.2. Implementation. We denote by DDEva the straightforward implementation of our evaluation algorithms, while adjusted-DDEva refers to the order adjusting method based on sampling. We compare our algorithm with the na¨ıve algorithm. In the implementation of the na¨ıve algorithm, we use the adjacency list to store the conflict graph 𝐺(𝐷, Σ) and build an index for all vertexes based on their ids so that each vertex can be found efficiently. In order to lower the cost of finding all the violations as much as possible, for each CFD (𝑋 → 𝑌, 𝑇𝑝 ), we partition the database into different blocks according to the value of 𝑋 and check all the tuple pairs for a violation in each block, rather than checking all possible tuple pairs na¨ıvely. All codes were written in C/C++ and compiled by Visual Studio 2005 and QT4 library. We run our algorithms on

11 Windows 7 platform on Dell PC OptiPlex 790 with 3.10 GHz Intel Core i5 CPU, 4 GB memory, and hard disk of 5400 rpm. In the following, the algorithms are run five times under each setting and the average time is taken. In each run, we use large amount of random data to wipe I/O cache. 7.3. Experimental Results for Evaluation Algorithm Exp-1: Impact of |𝐷|. In the first set of experiments, we show the impact of the size of the database 𝐷 on the performance of evaluation algorithm of inconsistencies in static data. Fixing |Σ| = 40 (including 5% FDs), the size of 𝐷 (i.e., |𝐷|) is varied from 2 M to 10 M tuples (10 GB) for TPC-H. And for IMDB and DBLP, |𝐷| is varied from 500 K to 3 M while fixing |Σ| = 10 (including 2 FDs) for both datasets. The elapsed time in seconds is shown in Figure 10 when varying |𝐷|. From the results, it is first shown that the na¨ıve algorithm takes too much time to perform computation; we manually terminate the program when the elapsed time exceeds the top boundary. It is also shown that adjustedDDEva outperforms DDEva for both real-life datasets while sometimes it does not for TPC-H. This is really because both real-life datasets are much less dirty with respect to many CFDs such that most tuples are matched earlier avoiding redundancy in the following indexes and the factor supp(𝜙)/conf(𝜙) just captures this. Figure 10 also shows that both DDEva and adjusted-DDEva scale well with the size |𝐷|. Our algorithm works well on both synthetic data TPCH and real-life data DBLP and IMDB demonstrating that 𝐷𝐷𝐸V𝑎 algorithm is able to deal with large dataset efficiently. Exp-2: Impact of |Δ𝐷|. In the second set of experiments, we show how the size of changes |Δ𝐷| to the database affects the performance of inconsistency evaluation algorithm. Fixing |Σ| = 50 and |𝐷| = 2 M, the size of |Δ𝐷| is varied from 2 M to 10 M tuples for TPC-H. |Δ𝐷| is varied from 500 K to 3000 K tuples for DBLP and IMDB while fixing |𝐷| = 500 K and |Σ| = 16. The elapsed times in seconds when varying |Δ𝐷| for TPC-H (resp., DBLP and IMDB) are shown in Figure 11(a) (resp., Figures 11(b) and 11(c)). As shown in Figures 11(a), 11(b), and 11(c), the elapsed times of adjusted-DDEva scale well up with |Δ𝐷|, for example, 55 seconds when |𝐷| is updated from 2 M to 4 M and 110 seconds when |Δ𝐷| is updated from 8 M to 10 M as shown in Figure 11(a). Also, adjusted-DDEva updates the result much more efficiently than DDEva in both real-life datasets and it has a slower growth in contrast to DDEva. That is, in experiment 2, the results have shown that adjusted-DDEva updates the result much more efficiently than DDEva in both real-life datasets. Exp-3: Impact of |Σ|. In this set of experiments, we study the impact of the number of variable CFDs on data dirtiness evaluation. Fixing |𝐷| = 2 M and |Δ𝐷| = 10 M for TPC-H, we varied the number of CFDs |Σ| from 20 to 100 including 5% FDs. Moreover, fixing |𝐷| = 500 K and |Δ𝐷| = 2000 K for DBLP and IMDB, we varied |Σ| from 8 to 20 including 3 FDs. The elapsed times when varying |Σ| from 20 to 100 for

12

International Journal of Distributed Sensor Networks 800

500

700

450 400

600

350 Time (s)

Time (s)

500 400 300

250 200 150

200

100

100 0

300

50 2

3

4

5

6 7 # tuples (M)

8

9

0

10

500

1000

1500 2000 # tuples (K)

2500

3000

DDEva Adjusted-DDEva Naive

DDEva Adjusted-DDEva Naive (a) TPC-H

(b) DBLP

600 550 500 450 Time (s)

400 350 300 250 200 150 100 50 0

500

1000

1500 2000 # tuples (K)

2500

3000

DDEva Adjusted-DDEva Naive (c) IMDB

Figure 10: Elapsed time of DDEva and adjusted-DDEva.

TPC-H (resp., from 8 to 20 for DBLP and IMDB) are shown in Figure 12(a) (resp., Figures 12(b) and 12(c)). Both 𝐷𝐷𝐸V𝑎 and 𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑-𝐷𝐷𝐸V𝑎 are able to evaluate the data dirtiness with good scalability when varying |Σ|. As the number of indexes will increase with |Σ|, the elapsed time of DDEva will increase with |Σ|. However, the elapsed time of adjusted-DDEva performance is better than that of DDEva since the size of indexes with higher rank is very small after adjusting the processing order of CFDs in Σ. The results demonstrate that adjusted-DDEva has good scalability with |Σ|, and it works well on a larger number of CFDs. Note that, in Figures 12(b) and 12(c), we can see that the increase of size |Σ| does not lead to a fast growth of

the adjusted-DDEva; that is really because the number of FDs included in the CFD set for DBLP and IMDB is fixed and captures most conflicts so that a large amount of random I/Os in the following indexes is prevented in practice. Exp-4: Space Cost. In this set of experiments, we study the sizes of indexes that our algorithm needs to build. Fixing the number of general CFDs |Σ| = 5, each with 200 pattern tuples generated randomly, and setting |𝐷| as 10 M for TPC-H (i.e., 10 GB) and 4 M for DBLP and IMDB (about 1 GB), we record the size of each index for the three datasets. The results are shown in Figure 13. First, for each dataset, the index size decreases with the processing order; this is consistent with the definition of


13

150

140 130 110

Elapsed time (s)

Elapsed time (s)

120 100 90 80 70 60 50 40

2

3

4

5

6 7 8 9 # tuples in ΔD (M)

10

11

12

13

90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 300

600

900 1200 1500 1800 2100 2400 2700 3000 3300 # tuples in ΔD (K)

DDEva

DDEva

Adjusted-DDEva

Adjusted-DDEva

Elapsed time (s)

(a) TPC-H

(b) DBLP

66 64 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26

500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000 3250 # tuples in ΔD (K) DDEva Adjusted-DDEva (c) IMDB

Figure 11: Elapsed time of 𝐷𝐷𝐸V𝑎 and adjusted-DDEva.

ir-subgraph. Second, the total space our algorithm takes does not depend on the width of the dataset due to key compression. Actually, in practice, a pair of 32-bit or 64bit integers is enough to partition the dataset according to the values on 𝑋 and 𝑌 without errors. Third, as shown in the results of this experiment, 𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑-𝐷𝐷𝐸V𝑎 takes less total space cost than its counterpart, and the first few indexes almost cover most of the matched tuples. Exp-5: Quality of Evaluation Result. In the last set of experiments, we study the evaluation result quality of our evaluation method based on minimum culprit with respect to CFD set Σ (MC), in contrast with na¨ıve methods based on conflicts counting (CC). Here, we introduce a variable “𝜎𝑖,𝑖−1 ” representing the difference of data inconsistency between the 𝑖th

MC update and the (𝑖 − 1)th update; concretely, (1) “𝜎𝑖,𝑖−1 ” is the difference of minimum culprit size estimated with respect to CC CFD set Σ; (2) “𝜎𝑖,𝑖−1 ” is the difference of conflicts detected, respectively. To measure the result quality of two evaluation methods under assumption in (1), we compute the standard deviation of “𝜎𝑖,𝑖−1 ” on 5 samples with 100 tuples which lead to inconsistency. Figure 14 shows the standard deviation of variable “𝜎𝑖,𝑖−1 ” computed for TPC-H (𝐷 = 2 M, |Σ| = 100, and |Δ𝐷| is varied from 1 M to 5 M including only insert operation) and DBLP and IMDB (𝐷 = 500 K, |Σ| = 20, and |Δ𝐷| is varied from 500 K to 2000 K including only insert operation). MC is insensitive The figure tells us that, for each dataset, 𝜎𝑖,𝑖−1 CC to a single update operation, but 𝜎𝑖,𝑖−1 is very sensitive to

14

International Journal of Distributed Sensor Networks 1650

330

1500

315 300 285

1200

Elapsed time (s)

Elapsed time (s)

1350

1050 900 750

270 255 240 225 210

600

195

450

180

300

165 20

30

40

50 60 70 # variable CFDs

80

90

7

100

8

9

10 11 12 13 14 15 16 17 18 19 20 21 # variable CFDs

DDEva

DDEva

Adjusted-DDEva

Adjusted-DDEva

Elapsed time (s)

(a) TPC-H

(b) DBLP

300 290 280 270 260 250 240 230 220 210 200 190 180 170 160 150 140

7

8

9

10 11 12 13 14 15 16 17 18 19 20 21 # variable CFDs

DDEva Adjusted-DDEva (c) IMDB


each single update operation which will be inconsistent with existing tuples. That is to say, the evaluation method MC studied in this paper will give a very smooth monitoring curve rather than the na¨ıve method CC. 7.4. Summary. We find the following conclusions from the results of the experiments conducted on both synthetic data TPC-H and real-life data DBLP and IMDB. (1) Our evaluation algorithms scale well with respect to the size of database |𝐷|, the size of changes to database |Δ𝐷|, and the number of CFDs |Σ| for large data (Exp-1 to Exp-3). (2) The modified version of algorithm adjusted-DDEva outperforms its counterpart DDEva much more especially for both reallife datasets and larger |Σ| because a few CFDs in Σ have small confidence while the others do not. (3) The evaluation

method proposed based on minimum culprit with respect to CFD set Σ substantially revealed how dirty the data is under the typical assumption: “(1) most of the data is often correct, especially for large data, and (2) an update of a tuple with error has a very tiny impact on the dirtiness of the entire dataset” (Exp-5).

8. Related Works Conditional functional dependency (CFD) was first proposed by the authors of [16] while the SQL techniques they provided have been applied in data cleaning broadly, which can be used to detect the inconsistencies of databases. However, there is no existing work focusing on computing the inconsistency of a database based on CFDs efficiently.


15

600

400

DDEva (MB)

Index size (MB)

500

300 200 100 0

1

2

3 Index number

4

5

130 120 110 100 90 80 70 60 50 40 30 20 10 0

DDEva Adjusted-DDEva

1

2

3 Index number

4

5

DDEva Adjusted-DDEva

(a) TPC-H

(b) DBLP

200

DDEva (MB)

150

100

50

0

1

2

3 Index number

4

5

DDEva Adjusted-DDEva (c) IMDB

Figure 13: Space cost of 𝐷𝐷𝐸V𝑎 and adjusted-DDEva.

The most relevant works to this paper can be categorized into inconsistency detection and resolution. For inconsistency detection, there exist some detection techniques which are able to detect errors efficiently; SQL techniques for detecting CFD violations were given by [18]; practical algorithms for detecting violations of CFDs in fragmented and distributed relations were provided by [19] and an incremental detection algorithm was proposed by [20]. In contrast to inconsistency detection, inconsistency evaluation needs to compute the quantized dirtiness value of the data, rather than finding all violations. For data repair, there are two kinds of works which are based on FDs/CFDs; they both aim to directly resolve the inconsistency of database. One kind of method is to repair data based on minimizing the repair cost, for example, [22, 24, 29, 36, 37]. Given the data edit operations (including

tuple-level and cell-level), minimum cost repair will output repaired data with minimizing the difference between it and the original one. Our problem can be seen as a special case of [29], because the complementary minimum culprit can be seen as C-repair (cardinality repair) of an inconsistent database; however, it is much more expensive using the techniques of the authors of [27] directly, especially for dynamic data, and the algorithm given in this paper is more efficient and seems optimal. There are some other repair definitions, such as “minimum description length (MDL)” [23] and “relative trust” [21]. To the best of our knowledge, there is almost no polynomial approximation algorithm with a good ratio bound for repairing inconsistent data based on CFDs except for a few approximation algorithms with constant ratio which were provided, such as [25], while the ratio could not reach 2, although the repair algorithm starts

16


Competing Interests Standard deviation

1000

The authors declare that they have no competing interests.

100

Acknowledgments

10 1 0.1

1

2

3 Sample number

4

5

MC CC


to need an approximate minimum vertex cover in a conflict graph, which is why they do not consider how to compute it efficiently; moreover, they cannot deal with large dynamic data well because it starts by finding all FD violations and a conflict hypergraph with respect to all FDs/CFDs should be built first which may take quadratic time and space. Another kind of method is consistent query answer (CQA) [29, 36, 38– 43]; for a fixed boolean query 𝑞, CQA(𝑞) is the following problem: given a database 𝐷, decide whether 𝑞 evaluates to true on every repair of 𝐷. Such method will not edit the data but will find a query answer among all possible repairs of original database, and, unfortunately, there is no technique in CQA that can be used in this paper directly. Moreover, there are some works considering the theoretical result of CQA under C-repair [38, 44–46], but they do not provide the technique able to solve the problem this paper studied. Additionally, if the minimality assumption fails or there are multiple optimal repairs of the data, output of repair approximation algorithm will be meaningless sometimes even if it has an accuracy guarantee. In contrast to data repair, this paper aims to output the value of dirtiness for data quality evaluation, monitoring, and so on. Therefore, approximation with constant factor is also the lower bound of repairing cost.

9. Conclusions This paper studied the data consistency evaluation based on CFDs, in order to give a quantized quality value to users. We proved that the complexity of dirtiness evaluation is NPcomplete even if the condition is simple enough; moreover, for any 𝜀 > 0, it is hard to give an approximation within 2 − 𝜀 in polynomial time. The time complexity of our 2approximate algorithm is 𝑂(𝑛 log 𝑛), and it scales well. To deal with the larger data and its update, the compact structure reduces the storage of conflict graph to 𝑂(𝑟𝑛) and reduces the time of update to 𝑂(𝑟 log 𝑛). The experiments show that our algorithm scales well with data size and the quality of its evaluation result is good.

This work is supported in part by the National Basic Research Program of China (973 Program) under Grant no. 2012CB316200, the National Natural Science Foundation of China (NSFC) under Grant no. 61190115, 61370217, the Fundamental Research Funds for the Central Universities under Grant no. HIT.KISTP201415, the National Science Foundation (NSF) under Grants nos. CNS-1152001 and CNS1252292, the Research Fund for the Doctoral Program of Higher Education of China under Grant no. 20132302120045, and the Natural Scientific Research Innovation Foundation in Harbin Institute of Technology under Grant no. HIT.NSRIF.2014070.

References [1] R. R. Rajkumar, I. Lee, L. Sha, and J. Stankovic, “Cyber-physical systems: the next computing revolution,” in Proceedings of the 47th Design Automation Conference (DAC ’10), pp. 731–736, ACM, Anaheim, Calif, USA, June 2010. [2] J. Li, S. Cheng, H. Gao, and Z. Cai, “Approximate physical world reconstruction algorithms in sensor networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 12, pp. 3099–3110, 2014. [3] X. Cheng, A. Thaeler, G. Xue, and D. Chen, “TPS: a time-based positioning scheme for outdoor wireless sensor networks,” in Proceedings of the 23rd Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM ’04), vol. 4, pp. 2685–2696, March 2004. [4] M. Ding, D. Chen, K. Xing, and X. Cheng, “Localized faulttolerant event boundary detection in sensor networks,” in Proceedings of the 24th IEEE Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM ’05), vol. 2, pp. 902–913, Miami, Fla, USA, March 2005. [5] H. Li, Q. S. Hua, C. Wu, and F. C. M. Lau, “Minimum-latency aggregation scheduling in wireless sensor networks under physical interference model,” in Proceedings of the 13th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWiM ’10), pp. 360–367, ACM, Bodrum, Turkey, October 2010. [6] K. Sha and S. Zeadally, “Data quality challenges in cyber-physical systems,” Journal of Data and Information Quality, vol. 6, no. 2-3, article 8, 2015. [7] G. Tolle, J. Polastre, R. Szewczyk et al., “A macroscope in the redwoods,” in Proceedings of the 3rd ACM International Conference on Embedded Networked Sensor Systems (SenSys ’05), pp. 51–63, ACM, November 2005. [8] R. Szewczyk, J. Polastre, A. Mainwaring, and D. Culler, “Lessons from a sensor network expedition,” in Proceedings of the 1st EuropeanWorkshop (EWSN ’04), pp. 307–322, Berlin, Germany, 2004. [9] Z. Cai, Z.-Z. Chen, and G. Lin, “A 3.4713-approximation algorithm for the capacitated multicast tree routing problem,” Theoretical Computer Science, vol. 410, no. 52, pp. 5415–5424, 2009.

International Journal of Distributed Sensor Networks [10] Z. Cai, Z.-Z. Chen, G. Lin, and L. Wang, “An improved approximation algorithm for the capacitated multicast tree routing problem,” in Combinatorial Optimization and Applications: Second International Conference, COCOA 2008, St. John’s, NL, Canada, August 21–24, 2008. Proceedings, vol. 5165 of Lecture Notes in Computer Science, pp. 286–295, Springer, Berlin, Germany, 2008. [11] Z. Cai, R. Goebel, and G. Lin, “Size-constrained tree partitioning: approximating the multicast k-tree routing problem,” Theoretical Computer Science, vol. 412, no. 3, pp. 240–245, 2011. [12] Z. Cai, G. Lin, and G. Xue, “Improved approximation algorithms for the capacitated multicast routing problem,” in Computing and Combinatorics: 11th Annual International Conference, COCOON 2005 Kunming, China, August 16–19, 2005 Proceedings, vol. 3595 of Lecture Notes in Computer Science, pp. 136–145, Springer, Berlin, Germany, 2005. [13] L. Guo, Y. Li, and Z. Cai, “Minimum-latency aggregation scheduling in wireless sensor network,” Journal of Combinatorial Optimization, vol. 31, no. 1, pp. 279–310, 2016. [14] Z. He, Z. Cai, S. Cheng, and X. Wang, “Approximate aggregation for tracking quantiles in wireless sensor networks,” in Proceedings of the 8th International Conference on Combinatorial Optimization and Applications (COCOA ’14), pp. 161–172, Maui, Hawaii, USA, December 2014. [15] Z. He, Z. Cai, S. Cheng, and X. Wang, “Approximate aggregation for tracking quantiles and range countings in wireless sensor networks,” Theoretical Computer Science, vol. 607, pp. 381–390, 2015. [16] P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional dependencies for data cleaning,” in Proceedings of the 23rd International Conference on Data Engineering (ICDE ’07), pp. 746–755, Istanbul, Turkey, April 2007. [17] S. Abiteboul, R. Hull, and V. Vianu, Foundations of Databases, 1995. [18] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional dependencies for capturing data inconsistencies,” ACM Transactions on Database Systems, vol. 33, no. 2, Article ID 1366103, 2008. [19] W. Fan, F. Geerts, S. Ma, and H. M¨uller, “Detecting inconsistencies in distributed data,” in Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE ’10), pp. 64–75, Long Beach, Calif, USA, March 2010. [20] W. Fan, J. Li, N. Tang, and W. Yu, “Incremental detection of inconsistencies in distributed data,” in Proceedings of the IEEE 28th International Conference on Data Engineering (ICDE ’12), pp. 318–329, IEEE, Washington, DC, USA, April 2012. [21] G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin, “On the relative trust between inconsistent data and inaccurate constraints,” in Proceedings of the 29th International Conference on Data Engineering (ICDE ’13), pp. 541–552, Brisbane, Australia, April 2013. [22] P. Bohannon, W. Fan, M. Flaster, and R. Rastogi, “A cost-based model and effective heuristic for repairing constraints by value modification,” in Proceedings of the 2005 ACM SIGMOD international conference on Management of data (SIGMOD ’05), pp. 143–154, ACM, Baltimore, Md, USA, June 2005. [23] F. Chiang and R. J. Miller, “A unified model for data and constraint repair,” in Proceedings of the IEEE 27th International Conference on Data Engineering (ICDE ’11), pp. 446–457, IEEE, Hannover, Germany, April 2011. [24] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma, “Improving data quality: consistency and accuracy,” in Proceedings of the 33rd

17

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33] [34] [35] [36]

[37] [38]

[39]

[40]

[41]

International Conference on Very Large Data Bases (VLDB ’07), pp. 315–326, 2007. S. Kolahi and L. V. S. Lakshmanan, “On approximating optimum repairs for functional dependency violations,” in Proceedings of the 12th International Conference on Database Theory (ICDT ’09), pp. 53–62, ACM, Saint-Petersburg, Russia, March 2009. M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas, “Guided data repair,” Proceedings of the VLDB Endowment, vol. 4, no. 5, pp. 279–289, 2011. D. Miao, X. Liu, and J. Li, “On the complexity of sampling query feedback restricted database repair of functional dependency violations,” Theoretical Computer Science, vol. 609, no. 3, pp. 594–605, 2016. G. Cormode, L. Golab, K. Flip, A. McGregor, D. Srivastava, and X. Zhang, “Estimating the confidence of conditional functional dependencies,” in Proceedings of the 35th ACM SIGMOD International Conference on Management of Data, pp. 469–482, Providence, RI, USA, July 2009. A. Lopatenko and L. Bravo, “Efficient approximation algorithms for repairing inconsistent databases,” in Proceedings of the 23rd International Conference on Data Engineering ICDE 2007, pp. 216–225, Istanbul, Turkey, April 2007. D. Miao, J. Li, X. Liu, and H. Gao, “Vertex cover in conflict graphs: complexity and a near optimal approximation,” in Combinatorial Optimization and Applications: 9th International Conference, COCOA 2015, Houston, TX, USA, December 18– 20, 2015, Proceedings, vol. 9486 of Lecture Notes in Computer Science, pp. 395–408, Springer, Berlin, Germany, 2015. M. Arenas, L. E. Bertossi, and J. Chomicki, “Scalar aggregation in fd-inconsistent databases,” in Proceedings of the 8th International Conference on Database Theory (ICDT ’01), pp. 39–53, Springer, London, UK, 2001. F. Gavril, “Algorithms for minimum coloring, maximum clique, minimum covering by cliques, and maximum independent set of a chordal graph,” SIAM Journal on Computing, vol. 1, no. 2, pp. 180–187, 1972. Tpc-h benchmark, http://www.tpc.org. Imdb, ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/. Dblp, http://dblp.uni-trier.de/xml/. M. Arenas, L. Bertossi, and J. Chomicki, “Consistent query answers in inconsistent databases,” in Proceedings of the 18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’ 99), pp. 68–79, ACM, June 1999. W. E. Winkler, “Methods for evaluating and creating data quality,” Information Systems, vol. 29, no. 7, pp. 531–550, 2004. J. Chomicki and J. Marcinkowski, “Minimal-change integrity maintenance using tuple deletions,” Information and Computation, vol. 197, no. 1-2, pp. 90–121, 2005. L. Bertossi, L. Bravo, E. Franconi, and A. Lopatenko, “Complexity and approximation of fixing numerical attributes in databases under integrity constraints,” in Database Programming Languages, pp. 262–278, Springer, Berlin, Germany, 2005. A. Cal`ı, D. Lembo, and R. Rosati, “On the decidability and complexity of query answering over inconsistent and incomplete databases,” in Proceedings of the 22nd ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems (PODS ’03), pp. 260–271, June 2003. A. Fuxman, E. Fazli, and R. J. Miller, “ConQuer: efficient management of inconsistent databases,” in Proceedings of the ACM SIGMOD International Conference on Management of

18

[42]

[43] [44]

[45]

[46]

International Journal of Distributed Sensor Networks Data (SIGMOD ’05), pp. 155–166, Baltimore, Md, USA, June 2005. A. Fuxman and R. J. Miller, “First-order query rewriting for inconsistent databases,” Journal of Computer and System Sciences, vol. 73, no. 4, pp. 610–635, 2007. J. Wijsen, “Database repairing using updates,” ACM Transactions on Database Systems, vol. 30, no. 3, pp. 722–768, 2005. M. Arenas, L. Bertossi, and J. Chomicki, “Answer sets for consistent query answering in inconsistent databases,” Theory and Practice of Logic Programming, vol. 3, no. 4-5, pp. 393–424, 2003. S. Kolahi and L. Libkin, “An information-theoretic analysis of worst-case redundancy in database design,” ACM Transactions on Database Systems, vol. 35, no. 1, article 5, 2010. A. Lopatenko and L. Bertossi, “Complexity of consistent query answering in databases under cardinality-based and incremental repair semantics,” in Proceedings of the 11th International Conference on Database Theory (ICDT ’07), pp. 179–193, Springer, Berlin, Germany, 2006.