Wang HZ, Qi ZX, Shi RX et al. COSSET+: Crowdsourced missing value imputation optimized by knowledge base. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 32(5): 845–857 Sept. 2017. DOI 10.1007/s11390-0171768-1
COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base Hong-Zhi Wang, Member, CCF, ACM, IEEE, Zhi-Xin Qi, Ruo-Xi Shi Jian-Zhong Li, Fellow, CCF, Member, ACM, and Hong Gao, Senior Member, CCF, Member, ACM School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
E-mail:
[email protected];
[email protected]; {shiruoxi, lijzh, honggao}@hit.edu.cn Received April 1, 2017; revised August 15, 2017. Abstract Missing value imputation with crowdsourcing is a novel method in data cleaning to capture missing values that could hardly be filled with automatic approaches. However, the time cost and overhead in crowdsourcing are high. Therefore, we have to reduce cost and guarantee the accuracy of crowdsourced imputation. To achieve the optimization goal, we present COSSET+, a crowdsourced framework optimized by knowledge base. We combine the advantages of both knowledge-based filter and crowdsourcing platform to capture missing values. Since the amount of crowd values will affect the cost of COSSET+, we aim to select partial missing values to be crowdsourced. We prove that the crowd value selection problem is an NP-hard problem and develop an approximation algorithm for this problem. Extensive experimental results demonstrate the efficiency and effectiveness of the proposed approaches. Keywords
1
crowdsourcing, missing value, imputation, knowledge base, optimization
Introduction
Missing values have become an inevitable problem since they affect not only decision-making in many fields but also our daily life. Incomplete data are led by many factors such as human negligence[1] , wrong measurements[2] , rule violation[3] , and limitation of data collection[4] , which make it more difficult to prevent. Therefore, missing value imputation has drawn high attentions. Consequently, various missing value imputations have been proposed, which can be classified into three lines. One is traditional statistical approaches, such as regression[5-6] , approximate Bayesian bootstrap[7] , and multiple imputation[8-9] . These methods compute the fitting values according to correlations among values. However, they depend much on known values and are only suitable for numerical values.
Another class of approaches is machine learning. Proposed methods include decision tree[10] , Naive Bayes[11] , Bayesian network[12-13] , clustering[14-15] , and neural network[16], etc. These imputations take the share of complete data as the training set and fill missing values based on classification results. However, when the training data are insufficient, such approaches may be inaccurate. The third line focuses on capturing missing values with the help of extra knowledge, such as missing value imputation based on Web[17] and knowledge base[18-20] . Even though these methods could fill missing values which cannot be imputed by known values and dependencies in the table, they also have following drawbacks. Web-based methods solve the problem by means of knowledge from the Web. However, there are various sources with unknown credibilities on the Web
Regular Paper Special Section on Crowdsourced Data Management This work was supported by the National Natural Science Foundation of China under Grant Nos. U1509216 and 61472099, the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. 2015BAH10F01, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province of China under Grant No. LC2016026, and MOE-Microsoft Key Laboratory of Natural Language Processing and Speech of Harbin Institute of Technology. ©2017 Springer Science + Business Media, LLC & Science Press, China
846
and the information cannot be fully trusted. Therefore, Web-based imputations may not be accurate in some cases. Knowledge-based approaches capture missing values with public knowledge base (e.g., Yago). However, knowledge base only provides ground truths and there are often not enough information for imputation. For example, we can gain neither the weather today nor the departure time of a flight from knowledge base. To address these problems, crowdsourcing could be adopted to fill the missing values that can hardly be imputed with Web or knowledge base involving human intelligence. However, the cost of crowdsourcing is high since the time of imputation may be long due to the uncontrolled time of crowdsourced work, and extra money is required for the salary of workers. Thus, crowdsourcing-based imputation approaches require the optimization in these two aspects to reduce the cost. Even though crowdsourcing-based imputation approaches[20-22] have been proposed, the optimization strategies of reducing crowdsourced cost are not considered. In this paper, we attempt to study the optimization strategies of reducing cost for crowdsourcing-based imputation. Since knowledge base contains massive knowledge, it is natural to make sufficient use of the knowledge to reduce the number of crowd questions and accelerate crowdsourcing-based imputation. It is non-trivial to apply knowledge base to optimize crowdsourcing. Straightforwardly, with knowledge base, we could select a share of missing values for crowdsourcing and use knowledge base to fill other blanks with the filled values. Unfortunately, all other blanks could not be filled with knowledge base since knowledge base could not contain all knowledge. All other blanks could also be imputed by cooperative work of crowdsourcing and knowledge base. Such iteration continues until all the missing values are imputed. To save money for crowdsourcing, as more as possible missing values should be imputed according to the knowledge base while only missing values that could not be imputed with knowledge base are crowdsourced. However, this may involve multiple alternant rounds of crowdsourcing and knowledge base accessing. Multiple rounds of crowdsourcing will take long time and lower the efficiency. To make proper balance of money and time saving, we formalize the problem of crowdsourcing value selection into a combinatorial optimization problem, which is proven to be NP-hard. We prove that the problem
J. Comput. Sci. & Technol., Sept. 2017, Vol.32, No.5
is NP-hard and develop an approximation algorithm to solve it efficiently. In summary, the main contributions of this paper are listed as follows. • We propose a novel crowdsourced framework optimized by knowledge base, named COSSET+ (Crowdsourced Missing Value Imputation Optimized by Knowledge Base), which combines knowledge-based and crowdsourced imputations to capture missing values. • For the efficiency and cheap price of COSSET+, we design an approximation algorithm with approximate ratio ln∆+2 for crowd value selection in crowdsourcing. Such an algorithm provides solutions of minimizing time cost and overhead in crowdsourced missing value imputation. • To demonstrate the effectiveness of the proposed approaches, we conduct extensive experiments. Experimental results show that the proposed framework and algorithms perform well. The rest of the paper is organized as follows. Section 2 describes our framework collectively. Then, we introduce knowledge-based imputation in detail in Section 3. And an approximation algorithm for crowd value selection is discussed in Section 4. Our experiments are presented in Section 5. Finally, Section 6 concludes the paper. 2
Framework of COSSET+
In this section, we introduce the proposed framework, COSSET+. The design of this framework has two goals. One is to reduce time cost in crowdsourcing. The other is to minimize the expenses of crowdsourced imputation. For the first goal, we introduce a knowledge-based filter and an iterative process in COSSET+. In each iteration, captured values from croudsourcing can infer some of other missing values with knowledge-based filter directly. For the second goal, we add crowd value selection module to select a portion of values to crowdsourcing. Based on above discussions, our framework has five phases, missing value detection, pattern mining, knowledge-based filter, crowd value selection, and crowdsourcing. Note that patterns in this paper refer to the relationship pattern between two attributes in a table. The workflow of COSSET+ is sketched in Fig.1. In missing value detection, we detect missing values in dirty databases. Since table patterns are often unknown or vague in semantics, we discover relationship
847
Hong-Zhi Wang et al.: COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base
patterns in the table via knowledge base. After pattern mining, we begin to capture missing values. In order to combine knowledge-based and crowdsourced methods, we design an iterative process in COSSET+. In each iteration, we first make use of knowledge base to obtain missing values with known values and patterns. Missing values which cannot be imputed by knowledge base will be sent to crowd value selection. In order to minimize crowd cost, we select partial values and send them to the crowdsourcing platform. Since captured values from the crowdsourcing platform can infer other missing values with knowledge base directly, the iterative process plays an essential role in the optimization of crowdsourced imputation.
Missing Value Detection
Database
Pattern Mining
Trusted Knowledge Base
Missing Values
Table Patterns
Knowledge -Based Filter
Crowd Value Selection
Crowdsourcing Platform
Fig.1. Workflow of COSSET+.
tect t2 (a2 ), t2 (a3 ), t2 (a4 ), t3 (a2 ), t3 (a4 ), t4 (a2 ), t4 (a3 ), t4 (a4 ), t5 (a3 ) and t5 (a4 ) as missing values. Table 1. Information of Organization Members ID
a1
a2
a3
t1
Alice
N ew Y ork
U.S.A.
t2
Betty
t3
Jane
t4
Jack
t5
Steven
2.2
Rome
Pattern Mining
Table 2. Pattern Mining Results
Missing Value Detection
This phase aims to detect which values are missing from an existing dirty database. Consider a dataset D(t1 , t2 , · · · , tn ), where ti is a tuple in D. For each tuple ti (a1 , a2 , · · · , am ), where aj is an attribute in ti , if aj = null, we view ti (aj ) as a missing value. Example 1. Considering Table 1, in order to detect missing values in the table, we scan it to judge whether each value is a null value. As a result, we de-
F rance
The goal of this phase is to discover relationship patterns in the database. How ai and aj are related is reflected through a directed binary relationship. ai is called the subject of the relationship, and aj is called the object of the relationship[23] . For instance, in triple (T om, wrote, novel), we view predicate wrote as the relationship between subject T om and object novel. When the subject and the predicate are known, the knowledge base can tell us the object in the triple. Thus, it is necessary to discover relationship patterns in the dirty table. Since there are a great number of triples (ai , Pij , aj ) in the knowledge base, we query the knowledge base with known ai and aj and obtain predicate Pij with the maximum frequency as the relationship pattern. Example 2. Considering Table 1, we query the knowledge base with Alice as the subject and N ew Y ork as the object, and obtain (Alice, wasBornIn, N ew Y ork), which means that the relation pattern between a1 and a2 is wasBornIn. Similarly, we discover all relationships in Table 1, and the results are shown in Table 2.
In remaining part of this section, we introduce the components of COSSET+ with an example. 2.1
a4 English
2.3
ai
aj
Pij
a1
a2
wasBornIn
a1
a3
wasBornIn
a2
a3
isLocatedIn
a2
a4
hasOf f icialLanguage
a3
a4
hasOf f icialLanguage
Knowledge-Based Filter
Our goal in this phase is to filter some missing values which can be captured with the knowledge base directly.
848
J. Comput. Sci. & Technol., Sept. 2017, Vol.32, No.5
Since relations in the table are discovered, missing values can be obtained with known values via the knowledge base. We will discuss the details of knowledgebased filter in Section 3. Note that in this phase, the accuracy of missing value imputation relies on the quality of knowledge base. However, the quality of knowledge base is not considered in this paper. We assume that the knowledge base provides correct values for us. When there are data-quality problems in the knowledge base, we use existing techniques to solve them, such as knowledge base completion[24-26] , knowledge base integration[27-28] . 2.4
Crowd Value Selection
Since the knowledge-based filter has captured some missing values, our goal of this phase is to select part of the other missing values for crowdsourcing. In order to reduce time cost and expense in crowdsourcing, our optimization goal is to minimize crowd values as well as the round of crowdsourcing. Thus, how to select crowd values from missing values is our challenge in this part. We will discuss the details of this issue in Section 4.
In this step, to make sufficient usage of knowledge base, we fill all possible values available with the knowledge base. Given an existing value set S0 , according to the knowledge base, a set of values, denoted by S1 , are filled with the approach discussed in Subsection 2.2. Similarly, value set S2 is obtained from S1 according to the knowledge base for imputation. Such iteration continues until for some k, Sk = ∅. The pseudo-code of the knowledge-based filter algorithm is shown in Algorithm 1. The input is a dirty table, a trusted knowledge base, and a pattern set discovered by pattern mining in Subsection 2.2. We first capture known values in the dirty table (line 1). Then, we find relationship patterns adjacent to each known value (lines 2∼6). With the known value as the subject and the found relationship pattern as the predicate, we query the knowledge base and obtain the object as the captured value (lines 7∼8). Captured values are filled in the table and regarded as new known values (line 9). When the number of captured values is 0, the iterative process halts (line 10). The algorithm produces the table with values filled with the knowledge base (line 11). Algorithm 1 . Knowledge-Based Filter
2.5
Crowdsourcing
Our goal in this phase is to fill missing values in the crowdsourcing platform. Crowdsourcing is a new computing paradigm that harnesses human effort to solve computer-hard problems. Since there are some missing values which are hardly captured by knowledge base, we utilize human’s wisdom to conduct imputation. Missing values obtained by crowd value selection are sent to the crowdsourcing platform and the captured values will be returned to the database. In our process of crowdsourcing, we first divide crowd values into different domains[29] . Then, crowd values are assigned to workers according to the quality of workers[30-31] . When imputed values return, we evaluate the quality of them[32-34] . Meanwhile, we adopt other existing crowdsourced techniques, such as domain-aware crowdsourcing[29], incremental quality inference[33] , to ensure the quality of crowdsourcing and treat the results from the crowdsourcing platform as correct values. 3
Knowledge-Based Filter
In this section, we focus on knowledge-based filter in the third phase of COSSET+.
Input: a table Tm×n , a knowledge base KB, a pattern set P Output: a table Tm×n 1: S0 ← {all existing values in Tm×n } 2: k ← 1 3: while Sk−1 6= ∅ do 4: Sk ← ∅ 5: for each value s ǫ Sk−1 do 6: Ps ← {all patterns adjacent to s in P } 7: for each p ǫ Ps do 8: m ← getObject(s, p) in KB 9: Add m to Tm×n 10: k ← k+1 11: return Tm×n
Example 3. Considering Table 1, we query the knowledge base with all known values and their corresponding relationship patterns. In the first iteration, we obtain Beijing as the value of t2 (a2 ) and P aris as t3 (a2 ) since Betty wasBornIn Beijing and Jane wasBornIn P aris. Thus, S1 = {Beijing, P aris}. Then, we go to the second iteration. When querying the knowledge base with Beijing as the subject and isLocatedIn as the predicate, we gain China as the value of t2 (a3 ). Thus, S2 = {China}. Afterwards, we execute the iterative process for the third time. The knowledge base tells us that China
Hong-Zhi Wang et al.: COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base
hasOf f icialLanguage Chinese. Therefore, we fill Chinese as t2 (a4 ) in the table and S3 = {Chinese}. When the iteration is repeated again, we get that S4 = ∅ and halt the iterative process. The results are shown in Table 3. Table 3. Knowledge-Based Filter Results ID t1 t2 t3 t4 t5
a1 Alice Betty Jane Jack Steven
a2 N ew Y ork Beijing P aris
a3 U.S.A. China F rance
a4 English Chinese
Rome
Time Complexity Analysis. The time complexity of Algorithm 1 is determined by the amount of values in the input table (denoted by n), the average number of patterns adjacent to each value (denoted by p) and the size of the knowledge base (denoted by k). The cost of obtaining each missing value with a known value and its pattern is O(k). There are np pairs of known values and relationships to be dealt with. Thus, the complexity of Algorithm 1 is O(npk). 4
Crowd Value Selection
In order to balance the trade-off among accuracy, time, and cost in crowdsourcing, we focus on crowd value selection problem in this section. Our main idea is to introduce a greedy iterative process for crowd value selection. The number of crowd values is determined in each iteration and decreases with the process. We prove the crowd value selection problem is NP-hard and give an approximation algorithm with approximate ratio ln∆+2, where ∆ is the maximal out-degree of a graph constructed by values in the database. The proposed crowd value selection algorithm tells us which missing values should be sent to the crowdsourcing platform. 4.1
Problem Definition
Since our optimization goal of this phase is to reduce the number of values in crowdsourcing, we concern about factors which have impact on crowd values. In COSSET+, missing values have been filtered by the knowledge base first. Thus, the number of crowd values is related to the ability of knowledge-based imputation, i.e., the possibility of capturing missing values. In the aspect of the iterative process, the upper bound of acceptable crowd values on the crowdsourcing platform matters. Thus, we define the problem with the consideration of these two factors.
849
We will define the crowd value selection problem as follows. Definition 1. Suppose there are y0 missing values in the database. In the i-th round of iteration, the knowledge-based filter has captured xi values, and yi missing values are left. Crowd value selection selects a value subset Ri which contains ri values for crowdsourcing. Possibility of capturing a missing value by the knowledge-based filter successfully is F (S), where S denotes the missing value set. The cost of crowdsourcing is defined as follows. T =
i X
T (ri ),
(1)
k=1
W =
i X
W (ri ),
(2)
k=1
C = T + W.
(3)
Here T in (1) denotes crowdsourced time cost, W in (2) denotes crowdsourced overhead, C in (3) denotes the total cost of crowdsourcing, and i in (1)∼(3) repPn resents the round which satisfies C(ri + k=i+1 yk ) < Pn k=i C(rk ). Since both time cost and expense in crowdsourcing depend on the amount of crowd values, we define the total time and expense of i iterations as the time cost and expense respectively. The whole cost of crowdsourced imputation is the sum of time cost and expense. When the cost of capturing all the missing values left is less than that of capturing values filtered by the knowledge base in each iteration, we send all ri missing values to the crowdsourcing platform. Since we aim to reduce the cost in crowdsourcing, our optimization goal is defined by minimizing both the time cost and expense for crowdsourcing. To combine these two factors in one optimization goal, we use two parameters α and β to represent the importance of them. Thus, given a knowledge base KB, the existing value set Te , and the missing value set Tb , crowd value selection (CVS) problem is defined as follows: min αT + βW , where α + β = 1 s.t. R0 = Te , |Ri | = ri , i [ Rk ∪ F (Rk ) = Tb , k=1 ∀i 6= j, Ri ∩ Rj = ∅, ∀i 6= j, F (Ri ) ∩ F (Rj ) = ∅.
850
J. Comput. Sci. & Technol., Sept. 2017, Vol.32, No.5
Then we discuss definitions of the functions in the problem definition. F is estimated by historical records since the accuracy of knowledge-based imputation can reflect the possibility of capturing missing values with knowledge base. For each HIT, the time distribution is supposed as a Gaussian distribution N (µt , σt2 )[35-37] . Thus, k HITs could finish in time k × µt . It is supposed that the number of HITs taken by workers is also in a uniform distribution with exception µw . For r HITs and w work, µw = wr . As a result, T (r) = wr µt , where µt could be estimated according to historical records. We consider the expense of each HIT is c. Thus W (r) = c × r. α and β are parameters that control the importance of time cost and expense. They could be determined by users. They could be also learned from historical records. For the ease of understanding, we model this problem as an equivalent graph-based problem. For an instance (KB, Te , Tb ) of the above problem, a probabilistic graph G = (V , E, p) is constructed as an induced graph of the knowledge graph GK , where V = Ve ∪ Vb . Ve is the set of nodes corresponding to existing values, and Vb is the set of nodes corresponding to missing values. p: E → [0, 1] is the probabilistic function. For each (u, v) ∈ E, p(e) is the probability that v could be filled with u in the knowledge base. An example of the graph model is shown in Fig.2. Such a graph model corresponds to Table 1. As depicted in the graph, we take missing values in the dirty table as nodes, and relationship patterns discovered in Subsection 2.2 as directed edges between nodes. Our goal is to select a set of nodes in the constructed graph model.
t↼a↽ t↼a↽ t↼a↽ P P
P t↼a↽ P
t↼a↽
t↼a↽
Fig.2. Crowd value selection instance. P24 means the predicate in a triple of the knowledge base where the subject is t4 (a2 ), and the object is t4 (a4 ).
Theorem 1 shows the hardness of this problem.
Theorem 1. Crowd value selection is an NP-hard problem. P roof . First, we reduce the minimum dominating set (MDS) problem[38] to a special case of this problem. Given an instance of MDS G = (V , E), where V is the set of vertices and E is the set of edges. The goal is to find S ⊂ V , such that for each vertex v ∈ V , v ∈ S or ∃u (u, v) ∈ E. We construct an instance of our problem to solve the MDS problem. We set that p(e) = 1 and V in G are Vb . Thus, the probabilistic graph G = (V , E, p) is converted to an equivalent-weight graph denoted as G = (Vb , E). Since E is the discovered pattern set in the knowledge graph GK , E ǫ GK . The goal of this problem is to find S ⊂ V , such that for each vertex v ∈ Vb , v ∈ S or ∃u (u, v) ǫ GK . Therefore, if we obtain a subset S for CVS, it corresponds to a well-determined subset S of MDS. As it is well-known that MDS is NP-hard, CVS is also NP-hard. 4.2
Our Solution
In this subsection, we develop an approximation algorithm for the CVS problem. Our basic idea is to select values with the maximum possibility to capture other missing values, i.e., the maximum relationship patterns related to. Since captured values in crowdsourced imputation will be sent to knowledge-based filter, the more the patterns related to a value, the larger the possibility to obtain other values with the knowledge base. Note that in practice, before the process of querying the knowledge base, we cannot obtain the probabilities of capturing missing values, i.e., p(e). Therefore we treat each edge in G equally. Thus, our proposed CVS algorithm selects optimal nodes with the greedy strategy[39]. This strategy helps us find the node related to the most of patterns in each iteration. Based on this, we select nodes with the most possibility to capture other missing values. The pseudo-code of crowd value selection is shown in Algorithm 2. The input is a directed graph in the i-th iteration with missing values as nodes and their relations as edges. The algorithm produces the missing value set to crowdsourcing in this iteration. We first use the greedy strategy to find the node which has the largest out-degree in the graph (lines 1∼3). When a node is selected, it will be added to the crowd value set (line 4). Then, the added node and nodes dominated by it will be deleted from the graph (lines 5∼6). Fi-
Hong-Zhi Wang et al.: COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base
nally, we take crowd value set in this iteration as the output (line 7). Algorithm 2 . Crowd Value Selection Input: a graph Gyi Output: selected crowd value set Ri 1: Ri ← ∅ 2: while ∃ node nj ǫ Gyi do 3: n∗j ← argmaxnj out-degree(nj ) 4: Ri ← Ri ∪ {n∗j } 5: Delete nodes dominated by n∗j from Gyi 6: Delete n∗j from Gyi 7: return Ri
Example 4. Consider graph Gy1 (Fig.2) corresponding to Table 3 as the input. Since the node of maximal out-degree is t4 (a2 ), we add it to R1 . Then, we delete t4 (a2 ) and all nodes dominated by it. We find that t5 (a3 ) has the maximal out-degree and add it to R1 . After deleting nodes dominated by t5 (a3 ), there is only t3 (a4 ) left and we add it to R1 . Thus, the selected crowd values in Table 3 are t4 (a2 ), t5 (a3 ), and t3 (a4 ). Time Complexity Analysis. The time complexity of Algorithm 2 is determined by the sum of nodes in each graph Gyi (denoted by n). The cost of computation of sorting all nodes with their out-degree in each graph is O(nlogn). Clearly, the complexity of of Algorithm 2 is O(nlogn). We show the approximate ratio of Algorithm 2 in Theorem 2. Theorem 2. Algorithm 2 is (ln ∆ + 2)approximation. That is, for the computed crowd value set Ri and an optimal crowd value set Ri∗ , we have
851
itself and nodes dominated by it are n1 , n2 , n3 , and n4 . In this case, each of the five nodes n∗j and n1 ,..., n4 gets cost 1/5. If n∗j is chosen as a gray node, only the nodes n1 ,..., n4 get cost (they all get 1/4). Now, assume that we know an optimal dominating set Ri∗ . According to the definition of dominating sets, to each node not in Ri∗ , we can assign a dominated node by Ri∗ . By assigning each node nx to exact one node in Ri∗ which dominates nx , the graph is decomposed into stars. A star in this paper is the one with a dominator (a node in Ri∗ ) as the center and non-dominators as leaves. Clearly, the cost of an optimal dominating set is 1 for each such star. In the following, we show that the amortized cost (distributed cost) of the greedy algorithm is at most ln∆+2 for each star. Consider a single star with center n∗ ∈ Ri∗ before choosing a new node n in the greedy algorithm. The number of nodes that become dominated when n is added to the dominating set is w(n). Thus, if some white node n∗j in the star of n∗ becomes gray or black, it gets cost 1/w(n). By the greedy condition, n is a node with the maximal span. Therefore, w(n) > w(n∗ ). Thus, n∗j is given cost at most 1/w(n∗ ). After becoming gray, nodes do not get cost any more. Therefore, the first node that is covered in the star of n∗ gets cost at most 1/(d(n∗ )+1). Since w(n∗ ) > d(n∗ ) when the second node is covered, the second node gets cost at most 1/d(n∗ ). In general, the j-th node that is covered in the star of n∗ gets cost at most 1/(d(n∗ ) + j − 2). Thus, the total amortized cost in the star of n∗ is at most
|Ri | 6 ln ∆ + 2. |Ri∗ |
1 1 1 1 + + ... + + d(n∗ ) + 1 d(n∗ ) 2 1 ∗ = H(d(n ) + 1) 6 H(∆ + 1) < ln(∆) + 2,
Such approximation ratio is optimal. P roof . To simplify presentation, we color nodes according to their states during the execution of Algorithm 2. We color nodes in Ri in black, nodes which are covered (dominated by nodes in Ri ) in gray, and all uncovered nodes in white, respectively. By W (nj ), we denote the set of white nodes dominated by nj , including nj itself. We call w(nj ) = |W (nj )| the span of nj . In the process of Algorithm 2, for each time, we choose a new node n from the dominating set Gyi (each greedy step) and have cost 1. Instead of letting the selected node n pay the whole cost, we distribute the cost equally among all newly covered nodes. Assume that node n∗j , chosen in line 3 of the algorithm, is in white
where ∆ is the maximal out-degree of Gyi , and H(x) = Px i−1 1/i is the x-th harmonic number. By similar L-reductions designed between the set cover problem and the dominating set problem given in [40], together with inapproximability results for the set cover problem[41] , it shows that unless N P ⊆ DT IM E(nO(loglogn) ), no polynomial-time algorithm can approximate the selection problem in each step better than (1-o(1)) ln ∆. (1-o(1)) here means any number smaller than 1. Thus, unless P ≈ N P, the approximate ratio of proposed CVS algorithm based on greedy in each step is optimal (up to lower order terms). Crowdsourcing Task. After executing Algorithm 2, we send selected crowd values to the crowdsourcing platform. In the crowdsourcing phase, we query each
852
J. Comput. Sci. & Technol., Sept. 2017, Vol.32, No.5
Experimental Results
To verify the performance of our proposed methods, we conduct extensive experiments using real-life data along six dimensions: 1) the number of crowd values in COSSET+ compared with crowdsourced imputation without optimization; 2) the accuracy of COSSET+ compared with crowdsourced and knowledgebased methods; 3) the accuracy of COSSET+ compared with other state-of-the-art missing value imputation approaches; 4) the round of COSSET+; 5) the efficiency of knowledge-based filter; 6) the parameter adjustments of the CVS problem. Datasets. We use Soccer and Movie datasets as the test datasets. Both of the datasets are scraped from the Web: 1) Soccer has 12 attributes and 812 instances about soccer players and their basic information; 2) Movie has 7 attributes and 2 035 instances about movie information and persons related to movies. We conduct experiments on these datasets since they are all real datasets in various types. Setup. All experiments are conducted on a PC with an Intel i7
[email protected] Ghz, 8 GB of memory, and a 1 TB hard disk. All algorithms are implemented in JAVA.
3 000
Number of Crowd Values
5
the cost of crowdsourced imputations is determined by the amount of crowd values, we test the reduced number of crowd values in our method. In the crowdsourcing platform, we use an expert crowd with 20 students. Experimental results are shown in Fig.3. As depicted in Figs.3(a) and 3(b), we observe that when mis rate rises, the gap between COSSET+ and crowdsoured imputation is larger and larger. This is because that as the amount of missing values increases, the number of values imputed by the knowledge base gets larger. In addition, since the number of crowd values in the CVS phase is quite smaller than that in general crowdsourced method, the upper bound of acceptable crowd values is unneeded in our framework.
2 500 2 000 1 500 1 000 500
0.05
Reduction of Crowd Values
0.10
0.25
0.30
0.20 0.25 mis_rate
0.30
0.15
0.20
mis_rate (a)
4 000
COSSET+ Crowdsourcing
3 000
2 000
1 000
0 0.05
5.1
COSSET+ Crowdsourcing
0
Number of Crowd Values
crowd value with its corresponding existing value and pattern in one crowdsourcing task. For instance, given an existing value A, the pattern (which is discovered in Subsection 2.2) between A and the crowd value C is B. A crowdsourced question to obtain C is: “A B what/where/who?” Here gives an example to illustrate this. Example 5. As discussed in example 3, selected crowd values in Table 3 are t4 (a2 ), t5 (a3 ), and t3 (a4 ). For t4 (a2 ), t4 (a1 ) is Jack and P12 is wasBornIn. Thus, crowdsourced question q1 to gain t4 (a2 ) is “Jack wasBornIn where?” For t5 (a3 ), t5 (a2 ) is Rome and P23 is isLocatedIn. Therefore, crowdsourced question q2 to obtain t5 (a3 ) is “Rome isLocatedIn where?” For t3 (a4 ), t3 (a3 ) is F rance, and P34 is hasOf f icialLanguage. Thus, crowdsourced question q3 to obtain t3 (a4 ) is “F rance hasOf f icialLanguage what?” After these three crowdsourcing tasks, we can gain W ashington, Italy, and F rench as returned results from q1 , q2 , and q3 respectively.
0.10
0.15
(b)
To verify the performance of our optimization, we compare COSSET+ with a crowdsourced approach without knowledge-based filter (Crowdsourcing). Since
Fig.3. Reduction of crowd values. (a) Number of crowd values on Soccer. (b) Number of crowd values on Movie. mis rate = number of missing values/number of values.
853
Hong-Zhi Wang et al.: COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base
Thus, the reduction of crowd values shows the effectiveness of our optimization method and also proves that COSSET+ suits for optimizing crowdsourced missing value imputation. 5.2
Accuracy of COSSET+
To certify the effectiveness of our proposed framework, we compare COSSET+ with crowdsourced missing value imputation[21] and knowledge-based (KB) imputation[20] . We use standard F -measure to evaluate the accuracy of algorithms. Since workers in our crowdsourcing platform are experts in the domains, we assume that answers from crowdsourcing are correct. Fig.4 shows our results on Soccer and Movie.
1.10
COSSET+ Crowdsourcing KB
5.3
Comparison with the State-of-the-Art Missing Value Imputation Methods
For comparison with existing methods, we choose Decision Tree (DT) algorithm[42] , Bayesian Network (BN) algorithm[43-44] and Web-based approach[45]. These three algorithms are universally acknowledged in missing value imputation. We use standard F -measure to evaluate the effectiveness of these methods. Experimental results are shown in Fig.5.
1.00
1.0
0.95
0.9
0.90
0.8
F-Measure
F-Measure
1.05
As illustrated in Figs.4(a) and 4(b), we observe that F -measure of COSSET+ is considerably higher than that of knowledge-based approaches. This is because that crowdsourcing in COSSET+ completes the lack of information in the knowledge base. However, F measure of COSSET+ is a little lower than that of crowdsourced imputation. The errors in COSSET+ are caused by the quality problems in the knowledge base.
0.85 0.80 0.75 0.05
0.10
0.15
0.20 0.25 mis_rate
0.7
0.5
0.30
0.4
(a) 1.1
0.10
0.15
COSSET+ Crowdsourcing KB
0.20 mis_rate
0.25
0.30
0.25
0.30
(a)
1.0
0.9 0.8
0.9
F-Measure
F-Measure
COSSET+ Web-Based BN DT
0.6
0.8
0.7 COSSET+ Web-Based BN DT
0.6 0.5 0.4
0.7 0.05
0.10
0.15
0.20 0.25 _ mis rate
0.30
(b) Fig.4. Accuracy of COSSET+. (a) F -measure on Soccer. (b) F -measure on Movie.
0.10
0.15
0.20 mis_rate (a)
Fig.5. Comparison experiments. (a) Comparison on Soccer. (b) Comparison on Movie.
854
J. Comput. Sci. & Technol., Sept. 2017, Vol.32, No.5
As depicted in Figs.5(a) and 5(b), we observe that F -measure of COSSET+ outperforms that of the other approaches in both Soccer and Movie datasets. This is because machine learning approaches, such as Decision Tree and Bayesian Network, depend much on existing values in database, and the Web-based method relies on the public information on the Web. But COSSET+ captures missing values with extra knowledge of both human intelligence and knowledge base. Thus, COSSET+ has a significant advantage on missing value imputation. Therefore, it is concluded that COSSET+ is more effective than Decision Tree algorithm, Bayesian Network algorithm, and Web-based missing value imputation. 5.4
larger than that in Movie, relations in Soccer are more and the number of rounds is higher. Thus we conclude that the round of COSSET+ is related to mis rate and the number of attributes in the dataset. 5.5
Efficiency of Knowledge-Based Filter
In order to optimize the efficiency of crowdsourced missing value imputation, we introduce knowledgebased filter to COSSET+. Thus, the efficiency of our proposed filter needs to be validated. Results about the total running time of filter in seconds are shown in Fig.7. We run each test five times and report the average time. 450
Round of COSSET+
400 350 Processing Time (s)
Since there is an iterative process in COSSET+, the round of iteration determines the running time. Therefore, we test the round of COSSET+ to show the efficiency of the whole system. In order to obtain the impact of the round in our framework, the round of iteration is tested. Experimental results are shown in Fig.6.
300 250 200 150 100
10
8
Soccer Movie
Soccer Movie
50 0.05
0.10
0.15
0.20
0.25
0.30
mis_rate Round
6
Fig.7. Efficiency of the knowledge-based filter.
4
2
0 0.05
0.10
0.15
0.20
0.25
0.30
mis_rate Fig.6. Round of COSSET+.
As illustrated in Fig.6, we observe that when mis rate rises, the round has an ascending trend. Besides, although the data size of Movie is larger than that of Soccer, the round in Soccer is higher than that in Movie. This is because the round depends on the number of discovered relation patterns, and the number of relationships is determined by the number of attributes. Since the number of attributes in Soccer is
In Fig.7, it is observed that as mis rate rises, the processing time increases. This is due to the large number of missing values. According to the data in Fig.7, we calculate the average time of capturing each missing value with the filter is around 0.15 s. This reduces the time cost of crowdsourcing to a great extent. Therefore, we can draw a conclusion that the filter in COSSET+ achieves a high efficiency. Optimized by the knowledge-based filter, the time cost in crowdsourcing is minimized greatly. 5.6
Parameters of CVS Problem
As discussed in Subsection 4.1, the optimization goal of the CVS problem is defined as min αT + βW , where α + β = 1. Thus, in order to verify the optimization of CVS, parameters α and β need to be validated. With consideration of three conditions, that are, T > W , T = W , and T < W , we test the value of α
Hong-Zhi Wang et al.: COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base
T + βW varying α from 0.0 to 1.0. The results are shown in Fig.8. As Fig.8 shows, when T > W , as the value of α rises, αT + βW increases. When T < W , αT + βW decreases as α goes up. In the case of T = W , the value of αT + βW is unaffected by the variation of α. 28 26 T >W T =W T W , we can set the value of α smaller to obtain a better optimization of the CVS problem. Similarly, when T < W , the larger α is, the better optimization we can achieve. The value of α does not matter when T = W . According to above discussions, we conclude that COSSET+ balances the accuracy and the cost well, which makes it suitable in various scenarios. 6
Conclusions
In this paper, we proposed a crowdsourced framework named COSSET+ for capturing missing values. Based on the framework, we presented optimization solutions with knowledge base. The approach achieves little time cost and overhead in crowdsourced missing value imputation. Experimental results demonstrated the effectiveness and efficiency of COSSET+. Our important future work is to take the quality of workers in the crowdsourcing platform into consideration. With evaluating workers’ credibilities, crowdsourced data cleaning can be more effective. Another line of work is to combine our approach with other missing value imputations, such as Web-based methods and Bayesian network. In this way, we can capture missing values in various scenarios.
855
References [1] Weinberg J B, Biswas G, Koller G R. Conceptual clustering with systematic missing values. In Proc. the 9th Int. Workshop on Machine Learning, July 1992, pp.464-469. [2] Silva L O, Z´ arate L E. A brief review of the main approaches for treatment of missing data. Intelligent Data Analysis, 2014, 18(6): 1177-1198. [3] Hua M, Pei J. DiMaC: A system for cleaning disguised missing data. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2008, pp.1263-1266. [4] Himmelspach L, Conrad S. Clustering approaches for data with missing values: Comparison and evaluation. In Proc. the 5th Int. Conf. Digital Information Management, July 2010, pp.19-28. [5] Shan Y, Deng G. Kernel PCA regression for missing data estimation in DNA microarray analysis. In Proc. IEEE Int. Symp. Circuits and Systems, May 2009, pp.1477-1480. [6] Yang K, Li J Z, Wang C K. Missing values estimation in microarray data with partial least squares regression. In Proc. the 6th Int. Conf. Computational Science, May 2006, pp.662-669. [7] Siddique J, Belin T R. Using an Approximate Bayesian Bootstrap to multiply impute nonignorable missing data. Computational Statistics & Data Analysis, 2008, 53(2): 405-415. [8] Rubin D B. Multiple imputation after 18+ years. Journal of the American Statistical Association, 1996, 91(434): 473489. [9] Patrician P A. Multiple imputation for missing data. Research in Nursing & Health, 2002, 25(1): 76-84. [10] Lakshminarayan K, Harp S A, Goldman R, Samad T. Imputation of missing data using machine learning techniques. In Proc. the 2nd Int. Conf. Knowledge Discovery and Data Mining, August 1996, pp.140-145. [11] Li X B. A Bayesian approach for estimating and replacing missing categorical data. Journal of Data and Information Quality (JDIQ), 2009, 1(1): Article No. 3. [12] Di Zio M, Scanu M, Coppola L, Luzi O, Ponti A. Bayesian networks for imputation. Journal of the Royal Statistical Society Series A (Statistics in Society), 2004, 167(2): 309322. [13] Mayfield C, Neville J, Prabhakar S. ERACER: A database approach for statistical inference and data cleaning. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2010, pp.75-86. [14] Zhang S C. Shell-neighbor method and its application in missing data imputation. Applied Intelligence, 2011, 35(1): 123-133. [15] Zhang C Q, Zhu X F, Zhang J L, Qin Y S, Zhang S C. GBKII: An imputation method for missing values. In Proc. the 11th Pacific-Asia Conf. Knowledge Discovery and Data Mining, May 2007, pp.1080-1087. [16] Setiawan N A, Venkatachalam P A, Hani A F M. Missing attribute value prediction based on artificial neural network and rough set theory. In Proc. Int. Conf. Biomedical Engineering and Informatics, May 2008, pp.306-310. [17] Tang N, Vemuri V R. Web-based knowledge acquisition to impute missing values for classification. In Proc. the IEEE/WIC/ACM Int. Conf. Web Intelligence, September 2004, pp.124-130.
856 [18] Hao S, Tang N, Li G L, Li J. Cleaning relations using knowledge bases. In Proc. the 33rd Int. Conf. Data Engineering, April 2017, pp.933-944. [19] Chu X, Morcos J, Ilyas I F, Ouzzani M, Papotti P, Tang N, Ye Y. KATARA: Reliable data cleaning with knowledge bases and crowdsourcing. Proceedings of the VLDB Endowment, 2015, 8(12): 1952-1955. [20] Qi Z X, Wang H Z, Meng F S, Li J Z, Gao H. Capture missing values with inference on knowledge base. In Proc. the Int. Conf. Database Systems for Advanced Applications, March 2017, pp.185-194. [21] Ye C, Wang H Z. Capture missing values based on crowdsourcing. In Proc. the 9th Int. Conf. Wireless Algorithms Systems and Applications, June 2014, pp.783-792. [22] Ye C, Wang H Z, Li J Z, Gao H, Cheng S Y. Crowdsourcingenhanced missing values imputation based on Bayesian network. In Proc. the 21st Int. Conf. Database Systems for Advanced Applications, April 2016, pp.67-81. [23] Chu X, Morcos J, Ilyas I F, Ouzzani M, Papotti P, Tang N, Ye Y. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In Proc. the ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1247-1261. [24] Wang Q, Wang B, Guo L. Knowledge base completion using embeddings and rules. In Proc. the 24th Int. Conf. Artificial Intelligence, July 2015, pp.1859-1865. [25] Neelakantan A, Chang M W. Inferring missing entity type instances for knowledge base completion: New dataset and methods. In Proc. Human Language Technologies: The 2015 Annual Conf. the North American Chapter of the ACL, May 2015, pp.515-525. [26] Neelakantan A, Roth B, McCallum A. Compositional vector space models for knowledge base completion. In Proc. the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Int. Joint Conf. Natural Language Processing, July 2015, pp.156-166. [27] Guo H Z, Chen Q C, Wang X L, Cui L. Tolerance rough set based attribute extraction approach for multiple semantic knowledge base integration. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2011, 19(4): 659-684. [28] Marinos L, Lee J. Using structural and procedural knowledge in database and knowledge base integration. In Proc. IEEE Int. Workshop on Tools for Artificial Intelligence, Architectures Languages and Algorithms, October 1989, pp.407-417. [29] Zheng Y D, Li G L, Cheng R. DOCS: A domain-aware crowdsourcing system using knowledge bases. Proceedings of the VLDB Endowment, 2016, 10(4): 361-372. [30] Li H W, Zhao B, Fuxman A. The wisdom of minority: Discovering and targeting the right group of workers for crowdsourcing. In Proc. the 23rd Int. Conf. World Wide Web, April 2014, pp.165-176. [31] Wang J, Ipeirotis P G Provost F. Quality-based pricing for crowdsourced workers NYU Working Paper No. 2451/31833 Social Science Electronic Publishing, 2013. https://ssrn.com/abstract=2283000, June 2017. [32] Fan J, Li G L, Ooi B C, Tan K L, Feng J H. iCrowd: An adaptive crowdsourcing framework. In Proc. the ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1015-1030.
J. Comput. Sci. & Technol., Sept. 2017, Vol.32, No.5 [33] Feng J H, Li G L, Wang H N, Feng J H. Incremental quality inference in crowdsourcing. In Proc. the 19th Int. Conf. Database Systems for Advanced Applications, April 2014, pp.453-467. [34] Zheng Y D, Wang J N, Li G L, Cheng R, Feng J H. QASCA: A quality-aware task assignment system for crowdsourcing applications. In Proc. the ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1031-1046. [35] Raykar V C, Yu S P. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. The Journal of Machine Learning Research, 2012, 13(1): 491-518. [36] Cavallo R, Jain S. Efficient crowdsourcing contests. In Proc. the 11th Int. Conf. Autonomous Agents and Multiagent Systems, June 2012, pp.677-686. [37] Roy S B, Lykourentzou I, Thirumuruganathan S, AmerYahia S, Das G. Task assignment optimization in knowledge-intensive crowdsourcing. The VLDB Journal, 2015, 24(4): 467-491. [38] Fomin F V, Grandoni F, Pyatkin A V, Stepanov A A. Bounding the number of minimal dominating sets: A measure and conquer approach. In Proc. the 16th Int. Symp. Algorithms and Computation, December 2005, pp.573-582. [39] DeVore R A, Temlyakov V N. Some remarks on greedy algorithms. Advances in Computational Mathematics, 1996, 5(1): 173-187. [40] Kann V. On the approximability of the maximum common subgraph problem. In Proc. the 9th Annual Symp. Theoretical Aspects of Computer Science, February 1992, pp.375388. [41] Feige U. A threshold of lnn for approximating set cover. Journal of the ACM, 1998, 45(4): 634-652. [42] Rahman G, Islam Z. A decision tree-based missing value imputation technique for data pre-processing. In Proc. the 9th Australasian Data Mining Conf., December 2011, pp.41-50. [43] Li H, Emmanuel A, LI P, Wu M. Imputation algorithm of missing values based on EM and Bayesian network. Computer Engineering and Applications, 2010, 46(5): 123-125. [44] Miyakoshi Y, Kato S. A missing value imputation method using a Bayesian network with weighted learning. Electronics and Communications in Japan, 2012, 95(12): 1-9. [45] Li Z X, Sharaf M A, Sitbon L, Sadiq S, Indulska M, Zhou X F. A web-based approach to data imputation. World Wide Web, 2014, 17(5): 873-897.
Hong-Zhi Wang is a professor and doctoral supervisor of Harbin Institute of Technology, Harbin. He received his Ph.D. degree in computer science and technology from Harbin Institute of Technology, Harbin, in 2008. He was awarded Microsoft Fellowship, Chinese Excellent Database Engineer, and IBM Ph.D. Fellowship. His research interests include big data management, data quality, and graph data management.
Hong-Zhi Wang et al.: COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base
Zhi-Xin Qi is a M.S. student in School of Computer Science and Technology, Harbin Institute of Technology, Harbin. She received her B.S. degree from Harbin Engineering University, Harbin. Her research interests include dataquality, data cleaning, and big data management.
Ruo-Xi Shi is an undergraduate in School of Computer Science and Technology, Harbin Institute of Technology, Harbin. Her research interests include query processing and data quality. text text text text text text text text text text text text text text text text
857
Jian-Zhong Li is a professor and doctoral supervisor at Harbin Institute of Technology, Harbin. He is a fellow of CCF. In the past, he worked as a visiting scholar with the University of California at Berkeley, and as a visiting professor with the University of Minnesota. His research interests include database, parallel computing, wireless sensor networks, etc. Hong Gao is a professor and doctoral supervisor at Harbin Institute of Technology, Harbin. She is a senior member of CCF. She received her Ph.D. degree in computer science and technology from Harbin Institute of Technology, Harbin, in 2004. Her research interests include database, parallel computing, wireless sensor networks, etc.