Handling ER-topk Query on Uncertain Streams Cheqing Jin1 , Ming Gao2 , and Aoying Zhou1 1
Shanghai Key Laborary of Trustworthy Computing, Software Engineering Institute, East China Normal University, China {cqjin, ayzhou}@sei.ecnu.edu.cn, 2 Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, China
[email protected]
Abstract. Data uncertainty widely exists in many applications. In this paper, we aim at handling top-k queries on uncertain data streams. Since the volume of a data stream is unbounded whereas the memory resource is limited, it is critical to devise one-pass solutions that is both timeand space efficient. In this paper, we use two structures to handle this issue. The DomGraph stores all tuples that are potential outputs. The probTree, a binary tree, is a way to get the answer of the g(·) function. The analysis in theory and extensive experimental results shows the effectiveness and efficiency of the proposed solution.
1
Introduction
Uncertain data management becomes more and more important in recent years since data uncertainty widely exists in lots of applications, such as financial applications, sensor networks, and so on. In general, there are two kinds of uncertainties, namely attribute-level uncertainty, and existential uncertainty that is also called as tuple-level uncertainty in some literatures [1]. The attribute-level uncertainty, commonly described by discrete probability distribution functions or probability density functions, illustrates the imprecision of a tuple’s attributes. The existential uncertainty describes the confidence of a tuple. Recently, several prototype systems have been produced to manage uncertain data with explicit probabilistic models of uncertainty, such as MayBMS [4], MystiQ [10], and Trio [3]. For example, nowadays, radars are often used in traffic monitoring applications to detect car speeds. It is better to describe a reading record by a discrete probability distribution function rather than a single value, since the readings may have some errors caused by complicated reasons, such as nearby high voltage lines, close cars’ interference, human operators mistakes, etc. Table 1 illustrates a small data set consisting of four reading records described by x-relation model that is introduced in Trio [3]. For instance, the 1st record observes a Buick car (No. Z-333) running through the monitoring area at AM 10:33 with the speed estimated as 50 (miles per hour) with probability 0.6, and 70 with probability 0.4 respectively. In addition, a range is used to test the validation of a speed
2
LATEX style file for Lecture Notes in Computer Science – documentation
reading (e.g. [0, 150]). Once the reading exceeds the range, we remove this part of information from the tuple description, which makes the total confidence of a record smaller than 1. For example, the 3rd record estimates the speed of 80 with probability 0.4, and of invalidation with probability 0.6 (= 1 − 0.4). ID Reading Info (Speed, prob.) 1 AM 10:33, Buick, Z-333 (50, 0.6), (70, 0.4) 2 AM 10:35, BMW, X-215 (60, 1.0) 3 AM 10:37, Benz, X-511 (80, 0.4) 4 AM 10:38, Mazda, Y-123 (20, 0.4), (30, 0.5) Table 1. A radar reading database in x-relation model
The possible world semantics is widely adopted by many uncertain data models. The possible world space contains a huge number of possible world instances, each consisting of a set of certain values from uncertain tuples. A possible world instance is also affiliated with a probability value, computed by the product of all tuples within the instance and the product of the non-existing confidence of all tuples outside of the instance. Table 2 illustrates the possible world space of in total 12 possible world instances for the dataset in Table 1. Each column is a possible world instance with the probability listed below. For example, tuples t1 , t2 and t3 occur in w10 at the same time, so that the probability of this possible world is 0.016 (= 0.4 × 1.0 × 0.4 × (1 − 0.4 − 0.5)). PW w1 w2 w3 w4 t1 50 70 50 70 t2 60 60 60 60 t3 80 80 t4 20 20 20 20 Prob. 0.096 0.064 0.144 0.096 Table 2.
w5 w6 w7 w8 w9 w10 w11 w12 50 70 50 70 50 70 50 70 60 60 60 60 60 60 60 60 80 80 80 80 30 30 30 30 0.120 0.080 0.150 0.120 0.024 0.016 0.036 0.024 Possible worlds for Table 1
Uncertain data stream is quite popular in many fields, such as the radar readings in traffic-control applications. Each tuple arrives rapidly, and the volume of the data stream is considered unbounded. It is necessary to devise spaceand time- efficient one-pass solutions to handle uncertain data streams, which are also helpful to handle traditional issues over massive data sets. Our focus in this paper is an uncertain top-k query. A top-k query focuses on getting a small set of the most important tuples from a massive data set. Generally, a ranking function is utilized to give a score to each tuple and k tuples with maximum scores are returned as query results. Although the semantics of a top-k query is explicit for the deterministic data, several different top-k definitions are proposed for distinct purposes, whereas the ranking score could be based on attributes, confidences, or the combination of these two factors, inclusive of U-Topk [18], U-kRanks [18], PT-k [12], Global-topk
LATEX style file for Lecture Notes in Computer Science – documentation
3
[20], ER-topk [8], c-Typical-Topk [11] and so on. Cormode et al. have listed a set of properties to describe the semantics of ranking queries on uncertain data, namely exact-k, containment, unique-rank, value-invariance and stability [8]. Moreover, different from other uncertain topk semantics like U-topk, U-kRanks, PT-k, Global-topk, the ER-topk query satisfies all of these properties.
1.1
Our contribution
It is trial to process a certain top-k query over high-speed data streams since we only need to maintain a buffer containing k tuples with highest scores. The lowest ranked tuple is replaced by the new tuple if its score is lower than the new tuple. However, processing an uncertain top-k query over data streams is not equally trial because the semantics of an uncertain top-k query, stemming from the integration of attribute values and the probability information, is much more complex than a certain top-k query. In this paper, we propose an efficient exact streaming approach to answer ER-topk query. [8] has proposed A-ERank and T-ERank approaches to handle static uncertain data sets which requires all of the tuples fetched in a special order. Obviously, these approaches can’t suit for the streaming environment. In our new solution, all tuples in the data stream are divided into two groups. One group contains candidate top-k tuples, i.e, the tuples having chance to belong to the query result, and the other contains the rest. We construct and maintain two structures, namely domGraph and probTree, to describe the two groups for efficiency. The rest of the paper is organized as follows. We define the data models and the query in Section 2. In Section 3, we describe a novel solution to handle the ER-topk query upon uncertain data streams. Some extended experiments are reported in Section 4. We review the related work In Section 5, and conclude the paper briefly in the last section.
2
Data models and query definition
In this paper, we consider a discrete base domain D, and ⊥ a special symbol representing a value out of D. Let S be an uncertain data stream that contains a sequence of tuples, t1 , t2 , · · · , tN . The i-th tuple in the stream, ti , is described as a probability distribution {(vi,1 , pi,1 ), · · · , (vi,si , pi,si )}. For each l, 1 ≤ l ≤ si , we have: vi,l ∈ D and pi,l ∈ (0, For simplicity, we also assume that vi,1 < P1]. si pi,l ≤ 1. The tuple ti can also be treated vi,2 < · · · < vi,is . In addition, l=1 as a random variable X over D∪ ⊥, such that ∀l, Pr[Xi = vi,l ] = pi,l , and i Psi pi,l . Pr[Xi =⊥] = 1 − l=1 Psi pi,l = 1, the This data model adapts both kinds of uncertainties. If ∀i, l=1 stream only has attribute-level uncertainty. If ∀i, si = 1, the stream only has existential uncertainty. Otherwise, the stream contains both kinds of uncertainties.
4
LATEX style file for Lecture Notes in Computer Science – documentation
Definition 1 (Expected Rank Top-k, ER-Topk in abbr.). [8] The ERtopk query returns k tuples with smallest values of r(t), defined below. X r(t) = Pr[W ] · rankW (t) (1) W ∈W
where W is the possible world space, Pr[W ] is the probability of a possible world instance W , and rankW (t) returns the rank of t in W , i.e, it returns the number of tuples ranked higher than t if t ∈ W , or the number of tuples in W (|W |) otherwise. By the definition and linearity of expectation, the expected rank of the tuple ti , r(ti ), is computed as follows. r(ti ) =
si X
pi,l (q(vi,l ) − qi (vi,l )) + (1 −
l=1
si X l=1
pi,l )(E[|W |] −
si X
pi,l )
(2)
l=1
where q(v) is the sum of probabilities of all tuples greater than v,P and qi (v) is the sum of probabilities of the tuple ti greater than v, i.e, qi (v) = l,vi,l >v pi,l , P and q(v) = i qi (v). Let P |W | denote the number of tuples in the possible world W , so that E[|W |] = i,l pi,l . Example 1. Consider P the data set in Table 1. The expected size of all possible worlds E[|W |] = i,l pi,l = 3.3. r(t1 ) = 0.6×((1.0+0.4+0.4)−0.4)+0.4×(0.4− 0) = 1.0, r(t3 ) = 0.4 × 0 + (1 − 0.4) × (3.3 − 0.4) = 1.74. Similarly, r(t2 ) = 0.8, r(t4 ) = 2.4. So, the query returns {t2 , t1 } when k = 2.
3
Our solution
In this section, we show how to compute the exact answers of ER-topk over uncertain data streams. Equation (2) illustrates how to compute the expected rank of a tuple t. Moreover, it implies that the value of the expected rank may also change with time going on since the function q(v) is based on all tuples till now. For example, at time 4, r(t2 ) = 0.8, r(t1 ) = 1.0, r(t2 ) < r(t1 ) (in Example 1). Assume the next tuple t5 is h(65, 1.0)i. Then, r(t1 ) = r(t2 ) = 1.6. This simple example also implies that r(ti ) > r(tj ) at some time point doesn’t mean r(ti ) > r(tj ) forever. Fortunately, we actually find some pairs of tuples, ti and tj , such that r(ti ) > r(tj ) or r(tj ) < r(ti ) holds at any time point. For convenience, we use ti ≺ tj to denote the situation that r(ti ) < r(tj ) holds forever, and ti  tj if r(ti ) > r(tj ) holds forever. For convenience, we also claim that ti dominates tj if ti ≺ tj , or ti is dominated by tj is ti  tj . Theorem 1. Consider two tuples ti and tj . ti ≺ tj onlyPwhen (i) ∀v, qi (v) ≥ qj (v), and (ii) ∃v, qi (v) > qj (v). Remember that qi (v) = l,vi,l >v qi,l .
LATEX style file for Lecture Notes in Computer Science – documentation
5
Pzi Pzj Proof. Let P denote a set of vertexes for ti and tj , P = { l=1 pi,l }∪{ l=1 pj,l }∪ {1}, where 1 ≤ zi ≤ si , 1 ≤ zj ≤ sj . Let m denote the distinct items in P, i.e, m = |P| ≤ si + sj + 1. Let P1 , · · · Pm denote m items in P, and P1 ≤ P2 ≤ · · · PmP= 1. Moreover, assume the gth item and hth item in P satisfies: Pswe si j pi,l , Ph = l=1 pj,l . According to the condition (i), we have: Pg = l=1 Py g ≥ h. Let function vi (z) be defined as vi (z) = vi,x , where x = argminy ( l=1 pi,l ≥ Pz ). Equation (2) could be computed as follows. r(ti ) =
g X ¡
m ¢¡ ¢ X ¡ ¢¡ ¢ Pl −Pl−1 q(vi (l))−qi (vi (l)) + Pl −Pl−1 E[|W |]−Pg (3)
l=1
l=g+1
Symmetrically, we can also compute the expected rank of tj like Equation (3). Now, we begin to compute r(ti ) − r(tj ) based on Equation (3). Since RHS of Equation (3) is the sum of m items. We analyze it through three cases: l ∈ [1, h], l ∈ (h, g] and l ∈ (g, m]. case 1: l ∈ [1, h]. ¡ ¢¡ ¢ ¡ ¢ ∆l = Pl − Pl−1 (q(vi (l)) − qi (vi (l)) − q(vj (l)) − qj (vj (l))) (4) If vi (l) = vj (l), then q(vi (l)) = q(vj (l)). According to condition (i), ∆l ≤ 0. Otherwise, if vi (l) > vj (l), then q(vj (l)) − q(vi (l)) ≥ the sum of probabilities that tuples’ values are equal to vi (l) according to the definition of the function q(·). Moreover, since the tuple ti has at least Pl −qi (vi (l)) probability to be vi (l), we have: q(vj (l)) − q(vi (l)) ≥ Pl − qi (vi (l)). Finally, because qj (vj (l)) < Pl , we have: ∆l < 0. It is worth noting that vi (l) < vj (l) will never occur because it violates condition (i) otherwise. case 2: i ∈ (h, g]. ∆l = (Pl − Pl−1 )((q(vi (l)) − qi (vi (l)) − (E[|W |] − Ph )) Psi
This situation occurs only when l=1 pi,l > E[|W |] − q(vi (l)) ≥ Pl − qi (vi (l)). Thus, ∆l < 0. case 3: l ∈ (g, m].
Psj l=1
(5)
pj,l . Similarly, we have:
∆(l) = (Pl − Pl−1 )((E[|W |] − Pg ) − (E[|W |] − Ph ))
(6)
Psi Psj This situation occurs only when l=1 pi,l < 1 and l=1 pj,l < 1. Since Pg > Ph , we have ∆l < 0. As a conclusion, ∆l < 0 under two conditions. Finally, we show that if neither condition is satisfied, we will never have ti ≺ tj . First, without condition (ii), ti and tj could be the same. Second, without condition (i), it means ∃ˆ v , qi (ˆ v ) < qj (ˆ v ). When a new tuple, h(v, 1.0)i, arrives, the values of r(ti ) and r(tj ) will increase by 1 − qi (ˆ v ) and 1 − qj (ˆ v ) respectively. Obviously, r(ti ) > r(tj ) will hold after inserting a number of such tuples because 1 − qi (ˆ v ) > 1 − qj (ˆ v ).
6
LATEX style file for Lecture Notes in Computer Science – documentation qi(v) 1.0 t1 t4
t2
t3 0.5
v 0
40
80
Fig. 1. The functions qi (v), for each 1 ≤ i ≤ 4
Example 2. Figure 1 illustrates the functions qi (v) for all tuples in Table 1. Obviously, t1 ≺ t4 , t2 ≺ t4 . For the pair of tuples t1 and t2 , neither t1 ≺ t2 nor t2 ≺ t2 holds. Lemma 1. A tuple t cannot belong to the query result if there exist at least k tuples (say, t0 ), t0 ≺ t. Proof. The correctness stems from Theorem 1. Lemma 1 is capable of checking whether a tuple will belong to the query result potentially in future or not. Such candidate tuples must be stored in the system. In addition, it is worth noting that a tuple cannot belong to the query result even if it is not dominated by k tuples under some situations. See the example below. Example 3. Let’s consider a situation, k = 1. There are three tuples, t1 = h(9, 0.6), (8, 0.2), (7, 0.1), (6, 0.1)i, t2 = h(11, 0.4), (6, 0.5), (5, 0.1)i, t3 = h(10, 0.1), (9, 0.4), (4, 0.5)i. Obviously, neither tuple dominates another tuple. But the tuple t3 won’t be output since its expected rank r(t3 ) will be greater than t1 or t2 no matter what tuples come later. In other words, We can evaluate that r(t1 ) + r(t2 ) − 2r(t3 ) < 0 always holds because the function q(·) is monotonous (Equation (2)). However, discovering all candidate tuples like Example 3 is quite expensive since it needs to check a huge number of tuples. Consequently, in this paper we mainly use Lemma 1 to evaluate candidates. Even though a few redundant tuples are stored, it is efficient in computing. Algorithm 1 is the main framework of our exact solution to handle data streams, which invokes maintainDomGraph and maintainProbTree repeatedly to maintain two novel structures, namely domGraph and probTree. 3.1
domGraph
The domGraph, with each node described in form of (t, T≺ , TÂ , state, c), is a graph to store all candidate tuples. The entry t refers to a tuple in the stream.
LATEX style file for Lecture Notes in Computer Science – documentation
7
Algorithm 1 processStream() 1: Empty a domGraph G and a probTree T ; 2: for each (arriving tuple t) 3: maintainDomGraph(t, G); 4: maintainProbTree(t, T );
Algorithm 2 maintainDomGraph(t, G) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:
Empty FIFO queues Q, Q≺ , Q ; b ← 0; set(G, TV); foreach (node n in G) if (@n0 ∈ G, n.t ≺ n0 .t) then push(n, Q); while ((n ← pop(Q)) 6= NULL) set(n, VD); b ← b + 1; if (n.t ≺ t) then if (n.c ≥ k − 1) then return; // t isn’t a candidate push(n, Q≺ ); set(n.T≺ , NV); else if (t ≺ n.t) then push(n, Q ); Q ← Q − n.T ; pushDominated(n, Q); create a new node nnew (t, Q≺ , Q , |G| + |Q≺ | − b, TV); Remove old references between nnew .T and nnew .T≺ ; foreach (node n in G, n.t  nnew .t) n.c ← n.c + 1; if (n.c ≥ k − 1) then Remove all nodes in n.T in cascade style;
T≺ represents a set of tuples ranking just higher than t, i.e, (i) ∀t0 ∈ T≺ , t0 ≺ t, and (ii) @t0 , t00 ∈ T≺ such that t0 ≺ t00 . TÂ represents a set of tuples ranking just lower than t, i.e, (i) ∀t0 ∈ TÂ , t0 Â t, and (ii) @t0 , t00 ∈ TÂ that t0 Â t00 . The entry state illustrates the state of the node. During processing, a node could be at one of three states: TV (To Visit), VD (Visited), or NV (No Visit). The entry c is the total number of tuples in the domGraph ranking higher than t. Algorithm maintainDomGraph (Algorithm 2) illustrates how to maintain a domGraph when a new tuple t arrives. Initially, three FIFO (First In First Out) queues, Q, Q≺ and QÂ , which store nodes to be visited, nodes dominating t, and nodes dominated by t respectively are emptied at the same time. In general, pop(Q) and push(n, Q) are basic operators supported by any FIFO queue. The operator pop(Q) returns the item at the front of a non-empty queue Q, and then remove it from Q. Otherwise, it returns NULL if Q is empty. The operator push(n, Q) inserts the item n at the back of the queue Q. A subroutine set is defined to update the state entry for a set of nodes. For example, at Line 1, set(G, T V ) means that the states of all nodes in G are set to TV. The variable b, initialized to zero, represents the number of nodes been visited. At first, the state of all nodes in G are initialized to TV, means that these nodes are ready to be visited. All nodes that are not dominated by any other
8
LATEX style file for Lecture Notes in Computer Science – documentation
Algorithm 3 pushDominated(n, Q) 1: for each node n0 in n.T≺ do 2: if (n0 .state = TV) then 3: if (∀n00 ∈ n0 .TÂ , n00 .state = VD) then push(n0 , Q);
n1 (t1,0)
n1 (t1,1)
n1
n2 (t2,0)
n2
n1 (t1,1) n2 (t2,0)
(a) after t1 (b) after t2 (c) after t3
n1 (t1,2)
(t1,2) n3 (t2,0) (d) after t4
n2 (t4,0)
n4 (t5,0)
n3 (t2,0)
(t4,0)
(e) after t5
n5 n1 n4 (t6,1) (t1,2) (t5,0) n2 (t2,0)
n4 (t4,0)
(f) after t6
Fig. 2. The evolution of domGraph based on Table 4
node in G are pushed into Q (at Lines 1-3). Subsequently, it begins to construct two FIFO queues, Q≺ and QÂ , by processing all nodes in Q (at Lines 4-12). The queue Q≺ represents all nodes that just dominates t, and QÂ represents all nodes that are just dominated by t. The state of n, a node popped from Q, is updated to V D (visited), meaning that this node has been visited. Obviously, any node that dominates n.t also dominates t if n.t ≺ t, which means that it is unnecessary to visit these nodes in future. Under such situation, we begin to compare n.c with k. It is clear that the new tuple t won’t be a candidate if n.c ≥ k − 1 so that the processing for t could be terminated (Lemma 1). Otherwise, we push n into Q≺ , and set the states of all nodes in n.T≺ NV (not visit). A node with a state of NV will never be pushed into the queue Q. Subsequently, if t ≺ n.t, we push n into QÂ , following which QÂ is updated to make it only contain nodes directly dominated by t. It is worth noting that it is necessary to check all nodes dominating t if the tuple t is not dominated by n.t. In this way, the subroutine pushDominated (Algorithm 3) is invoked immediately to push part of nodes in n.T≺ into Q if satisfying following two conditions simultaneously (i) with a state of TV, and (ii) all nodes dominated by it have been visited. The condition (i) claims that a node to be pushed into Q must have not been visited. The condition (ii) shows that a node is pushed into Q after all nodes dominated by it. Then, it inserts a new node for t into G if necessary. The queues Q≺ and QÂ keep all nodes dominating t and dominated by t respectively. We then compute the value of the entry c, which represents the number of nodes dominating t in G. Recall that all nodes dominating any node in Q≺ have not been visited (labeled as NV), and b represents the number of nodes been visited. So, there are |G| + |Q≺ | − b nodes in G dominating t. Subsequently, we remove all references between nnew .TÂ and nnew .T≺ to make G consistent (at Lines 13-14). Finally, the entry c of each node dominated by t increases by 1. Additionally, if we find a node n such that n.c ≥ n − 1, it is clear that all nodes dominated by n could be removed safely (at Lines 15-18).
LATEX style file for Lecture Notes in Computer Science – documentation
9
Analysis We maintain a domGraph for a set of candidate tuples for two reasons. First, we have tried to remove tuples which are definitely not candidates for a query. Since an arriving tuple may still contains other big attributes like text information, it will save the space consumption. Second, the domGaph is efficient to maintain. Without this directed acyclic graph, it is not easy to decide whether a new tuple is a candidate or not. In our algorithm, we compare the new tuple with a set of low-ranked tuples in G at first. In this way, the processing could be terminate as soon as possible (at Line 7). Example 4. Figure 2 illustrates the evolution of a domGraph based on a small set of uncertain data in Table 4. k = 2. Each node is affiliated with the tuple t and the entry c. A directed link from ni to nj means that ni .t ≺ nj .t. Obviously, A domGraph is in fact a directed acyclic graph. After time 2, both t1 and t2 stay in domGraph and t1 Â t2 . At time 3, since t1 ≺ t3 and its node n1 .c = 1 ≥ k − 1, t3 will not be inserted into the domGraph. In this way, the following three tuples will be inserted into the domGraph. 3.2
probTree
Another indispensable task is to maintain the function q(v) over all tuples in the stream. [8] provides a simple solution to handle a static data set. When a query request arrives, it begins to invoke a quick ordering algorithm to sort the data set in O(n log n) time, where n is the size of the data set. Then, the function q(v) is constructed straightforwardly after conducting a linear scan upon all ordered tuples. This approach cannot suit for the streaming scenarios since it is expensive in processing a query request. Our goal is to devising a novel approach that is efficient both in tuple-maintaining and request-processing phases. Our solution is a binary search tree that is called as probTree with each node in form of (v, p, l, r, par). The entry v, describing the attribute’s value, is also the key of the tree. The entry p represents the probability sum of some tuples. The rest three entries, l, r and par, are references to its left child, right child and parent node respectively. Algorithm maintainProbTree (Algorithm 4) illustrates how to maintain a probTree T continuously when a new tuple t, described as h(v1 , p1 ), · · · , (vst , pst )i, arrives. Initially, we insert a new node of t into T as the root node if T is empty. In general, the algorithm begins to seek a target node with a key (or equally the entry v) equal to vj . If such node is found, the value of its entry p increases by pj . Otherwise, we insert a new node of (vj , pj ) into T . Moreover, for each node w along the path from root to the destine (a node with entry v equal to vj ), the entry p is updated as w.p ← w.p + pj if w.v < vj . Algorithm getq (Algorithm 5) illustrates how to compute the value of q(υ) by a probTree T . It visits some nodes along the path from the root node to a destine node with v equal to minw∈T ,w.v>υ (w.v). A variable sum, representing the result value, is initialized to zero at first. For any node w in the path, the value of sum is updated as sum ← sum + w.p if w.v > υ.
10
LATEX style file for Lecture Notes in Computer Science – documentation
Algorithm 4 maintainProbTree(t, T ) 1: for j = 1 to st 2: w ← T .root; 3: if (w =NULL) then 4: T .root ← new N ode(vj , pj ); continue; 5: while (w 6= N U LL) 6: if (w.v > vj ) then 7: if (w.l 6= NULL) then w ← w.l; else w.l ← newN ode(vj , pj ); break; 8: 9: else if (w.v < vj ) then 10: w.p ← w.p + pj ; if (w.r 6= NULL) then w ← w.r; 11: 12: else w.r ← newN ode(vj , pj ); break; 13: else w.p ← w.p + pj ; break;
Algorithm 5 getq(υ, T ) 1: 2: 3: 4: 5: 6: 7: 8:
sum ← 0; w ← T .root; while (w 6= N U LL) if (w.v > υ) then sum ← sum + w.p; w ← w.l; else w ← w.r; return sum;
The correctness of Algorithm 5 stems from the construction of a probTree. Let (vi,l , pi,l ) be an arbitrary attribute-probability pair in the uncertain data set. Let RP (W ) denote a tree includes the node w and its right sub-tree. For each node w in the probTree, the value of the entry p is the sum of probabilities of all P attribute-probability pair with keys at its right side, i.e, w.p = vi,l ∈RT (w) pi,l . In Algorithm 5, once it visits a node w such that w.v > v, it will choose to visit left child (at Line 7). In this way, the correctness is ensured. Example 5. Figure 3 illustrates the probTree at time 5 and 6 respectively. Each node is affiliated with information (v, p). When the tuple t6 h(10, 0.4), (5, 0.6)i arrives, it finds the node n3 , n3 .v = 10. The entry n3 .p is updated to 1.1 (= 0.7 + 0.4). Since the parent node n1 has n1 .v = 9 < 10, its entry n1 .p is also updated to 1.9 (= 1.5 + 0.4). Subsequently, it inserts a new node n7 of the pair (5, 0.6) into the probTree since no node with v = 5 is in the probTree now. Similarly, the entry p of the node n6 is updated to 1.3 (= 0.7 + 0.6) since n6 .v < n7 .v. It is easy to compute the value of q(υ) based on a probTree. Assume υ = 8. At first, it visits the root node n1 to set the variable sum to 1.9 since n1 .v = 9 > 8. Next, it visits the left node, n2 , and does nothing since n2 .v = 7 < 8. Finally, it visits the right node, n4 , and find n4 .v = 8. As a result, it returns 1.9.
LATEX style file for Lecture Notes in Computer Science – documentation n1
n1 (9, 1.5)
n2
n3 (7, 2.8)
n6
n4
(4, 0.7)
(8, 1.0)
(9, 1.9)
n2 (7, 2.8)
(10, 0.7)
n5
n6
n3 (10, 1.1)
n4 (4, 1.3)
(11, 0.3)
11
n5
(8, 1.0)
(11, 0.3)
n7 (5, 0.6) (a) after processing t5
(b) after processing t6
Fig. 3. An example of Probtree upon Table 4 Tuple ID Attribute Value t1 h(9, 0.3), (7, 0.7)i t2 h(10, 0.4), (8, 0.6)i h(7, 1.0)i t3 t4 h(9, 0.5), (8, 0.4), (7, 0.1)i t5 h(11, 0.3), (4, 0.7)i t6 h(10, 0.4), (5, 0.6)i
n1
n1 (9, 1.5)
n2
n3 (7, 2.8)
n6
n5
(8, 1.0)
(11, 0.3)
(7, 2.8)
n6
n3 (10, 1.1)
n4 (4, 1.3)
(8, 1.0)
n5 (11, 0.3)
n7 (5, 0.6) (a) after processing t5
Fig. 4. A small data set
n2 (10, 0.7)
n4
(4, 0.7)
(9, 1.9)
(b) after processing t6
Fig. 5. strategies
Theorem 2. Let N denote the size of the data stream, s denote the maximum probability options, i.e, s = maxN i=1 si . The size of a probTree is O(sN ), the per-tuple processing cost is O(s log(sN )), and computing q(·) from a probTree costs O(log(sN )). Proof. We assume the tuples in the stream arrive out-of-order. Obviously, the number of distinct items in the stream is O(sN ). The cost on inserting a tuple or computing q(·) is dependent on the height of the probTree. When each item is inserted in-order, the height is O(sN ) under the worst case. Contrarily, the expected height of a randomly built binary search tree on sN keys is O(log(sN )) [5]. Straightforwardly, the amortized cost on inserting a tuple is O(s log(sN )), and the cost on computing q(·) is O(log(sN )). 3.3
Handle a request
we can efficiently handle an ER-topk request by using domGraph and probTree. Initially, an FIFO query Q and a result set R are emptied. Let rmax (R) denote the maximum expected rank of all nodes in a set R, i.e, rmax (R) = maxn∈R (r(n)). At first, all nodes not dominated by any other nodes are pushed into Q since these nodes have potentialities to be at the 1st rank. Subsequently, a node n is popped out of Q for evaluation repeatedly until Q is empty. The expected rank r(n) of a node n can be computed by Equation (2). If R contains no more than
12
LATEX style file for Lecture Notes in Computer Science – documentation
k nodes, the node n is added into R immediately. Otherwise, if r(n) < rmax (R), we insert n into R and remove the lowest-ranked tuple in R. The next task is to check whether each node (say, n0 ) in n.TÂ should be pushed into Q or not. In general, n0 is inserted into Q under two conditions. First, all of nodes dominating n0 are locating in R. Otherwise, if parts of such nodes are still in Q, n0 is no need to be pushed into R so early; if parts of such nodes have been processed but cannot be kept in R, it means n0 cannot belong to the result set. Second, the size of the result set R is not full or rmax (n0 .T≺ ) < rmax (R).
4
Experiments
In this section, we present an experimental study upon synthetic and real data. All the algorithms are implemented in C++ and the experiments are performed on a system with Intel Core 2 CPU (2.4GHz) and 4G memory. We also compare the performance with the static method in [8] with the assumption that all tuples are stored in memory. 4.1
Data set description
We use two synthetic data sets and one real data set in our testings. Each data set contains 20 million tuples. syn-uni The syn-uni data set only has existential uncertainty. The rank of each tuple is randomly selected from 1 to 20 million without replacement and the probability is uniformly distributed in (0, 1). syn-nor The syn-nor data set has both kinds of uncertainties. The existential confidence of each tuple is randomly generated from a normal distribution N (0.6, 0.3) 3 . We set the maximum number of options of all tuples no more than 10, i.e, smax = 10. For the ith tuple, the number of attribute options (say, si ) is uniformly selected from [smax /2, smax ]. Subsequently, we construct a normal distribution N (µi , σi ), where µi is uniformly selected from [0, 1, 000] and σi is uniformly selected from [0, 1]. We randomly select si values from the distribution N (µi , σi ), denoted as v1 , · · · , vsi , and construct ti as h(v1 , pi /si ), · · · , (vi , pi /si )i. IIP The (IIP) Iceberg Sightings Database 4 collects information on iceberg activity in North Atlantic near the Grand Banks of Newfoundland. Each sighting record contains the date, location, shape, size, number of days drifted, etc. Since it also contains a confidence level attribute according to the source of sighting, we converted the confidence levels, including R/V, VIS, RAD, SAT-LOW, SAT-MED, SAT-HIGH and EST, to probabilities 0.8, 0.7, 0.6, 0.5, 0.4, 0.3 and 0.4 respectively. We created a 20,000,000-record data stream by repeatedly selecting records randomly from a set of all records gathered from 1998 to 2007. 3 4
N (µ, σ) is a normal distribution with µ as mean value and σ as standard deviation http://nsidc.org/data/g00807.html
LATEX style file for Lecture Notes in Computer Science – documentation
1G
10G
Database size
ProbTree
domGraph (k=10)
domGraph (k=25)
1G
domGraph (k=50)
Information Part
Information Part
100M 10M 1M 100k
ProbTree
domGraph (k=10)
domGraph (k=25)
domGraph (k=50)
100M 10M 1M 100k
10k 1k
100M
Database size
Space consumption (Bytes)
10G
10
100
1000
Space consumption
(a) upon syn-uni
10k
10
100
13
1000
10M
1M
100k
10k dataset size probTree
Space consumption
(b) upon syn-nor
DomGraph DomGraph DomGraph (k=10)
(k=25)
(c) upon IIP
Fig. 6. Space consumption upon uncertain data sets
4.2
Space efficiency
We begin to evaluate the space efficiency of the proposed solution. In our solution, the domGraph can efficiently prune parts of tuples which have no chance to be outputted. However, in the static method, all information must be reserved in the memory. When the information part is huge, the overall space consumption will also be very huge. Figure 6 compares the space consumption of the static method and the proposed method upon three data sets. The space consumption is highly subjected to the information part of a tuple. In Figure 6(a) and (b), we vary the amortized tuple size from 10 to 1000. Obviously, when the size is small, the space consumption is comparable, because all of probability information must be reserved in the system. When the size increases quickly, the difference becomes larger and larger. Figure (c) illustrates the difference between two methods upon the IIP data set. As we have tested, the amortized tuple size is 100 bytes, thus the space consumption of the static method is quite greater than the proposed method. 4.3
Time efficiency
We begin to evaluate the time efficiency of the proposed solution. It contains two parts, the per-tuple processing cost and the cost on handling a request. The per-tuple processing cost on the static method is ignorable, since all the tuples could be simply placed in the data set. However, for the proposed method, we must maintain two structures, domGraph and probTree, incrementally, so that we can finally efficiently answer the request efficiently. Figure ** illustrates the per-tuple processing cost on processing three data sets. The x-axis represents the value of parameter k, and the y-axis is the time. When the value of k increases, the cost will continue to grow. But we can see, the cost are in a low level. Figure ** illustrates the cost on handling a request upon three data sets with the comparison of the static method. Obviously, our method runs much faster than the static method.
(k=50)
14
LATEX style file for Lecture Notes in Computer Science – documentation 1
0.8
0.8
0.4 0.2
Precision
1
0.6 Precision
Precision
1 0.8
0.6 0.4 0.2
0 1 10k
3
5
7 20k
9
11 13 15 17 19
Time 40k
0
Related work
3
5
7
80k 10k
5
0.4 0.2
0 1
Fig. 7. Per-tuple processing cost
0.6
20k
9
11 13 15 17 19
Time 40k
(a) k = 5
80k
1 10k
3
5
7 20k
9
11 13 15 17 19
Time 40k
(b) k = 15
80k
Fig. 8. Time cost on handling a request
Uncertain data management has attracted a lot of attentions recent years, because the uncertainty widely exists in many applications, such as Web, sensor networks, financial applications, and so on [1]. In general, there are two kinds of uncertainties, namely existential uncertainty and attribute-level uncertainty. More recently, several prototype systems have also been developed to handle such uncertain data, including MayBMS [4], MystiQ [10], and Trio [3]. Uncertain top-k query has been studied extensively in recent years. Although the semantics is explicit for the deterministic data, several different top-k definitions are proposed for distinct purposes, including U-topk [18], U-kranks [18], PT-k [12], global top-k [20], ER-topk [8], ES-topk [8], UTop-Setk [17], c-TypicalTopk [11] and unified topk [16] queries. However, most of previous work only studies the “one-shot” top-k query over static uncertain dataset except [15] that studies how to handle sliding-window top-k queries (U-topk, U-kRanks, PT-k and global topk) on uncertain streams. In [15], some synopsis data structures are proposed to summarize tuples in the sliding-window efficient and effectively. The focus of this paper is to study the streaming algorithm to handle top-k query. After studying main uncertain top-k query, we find that most of existing top-k semantics has the shrinkability property except the ER-topk query. In other words, for an ER-topk query, each tuple in the stream is either a candidate result tuple, or may influence the query result. So, we try to devise the exact solution and approximate streaming solution for ER-topk query. Recently, there has been a lot of effort in extending the query processing techniques on static uncertain data to uncertain data streams [2, 6, 7, 9, 13, 14, 19]. In [2], a method is proposed to handle clustering issue. [19], the frequent items mining is also studied.
6
Conclusion
In this paper, we aim at handling top-k queries on uncertain data streams. Since the volume of a data stream is unbounded whereas the memory resource is limited, we hope to find some heuristic rules to remove parts of redundant tuples and the rest tuples are enough for final results. Although this assumption is true for most of typical uncertain top-k semantics, we find that no tuple is redundant
LATEX style file for Lecture Notes in Computer Science – documentation
15
for the ER-topk semantic. In other words, each tuple either (i) is belonging to the result sets now or later, or (ii) may influence the query results. We have devised two efficient and effective solutions for the ER-topk query, inclusive of an exact solution and an approximate solution. A possible future work is to devising solutions for the sliding-window model.
References 1. C. C. Aggarwal. Managing and mining uncertain data. Springer, 2009. 2. C. C. Aggarwal and P. S. Yu. A framework for clustering uncertain data streams. In Proc. of ICDE, 2008. 3. P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom. Trio: A system for data, uncertainty, and lineage. In Proc. of VLDB, 2006. 4. L. Antova, C. Koch, and D. Olteanu. From complete to incomplete information and back. In Proc. of SIGMOD, 2007. 5. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. the MIT Press, pages 265–268, 2001. 6. G. Cormode and M. Garofalakis. Sketching probabilistic data streams. In Proc. of ACM SIGMOD, 2007. 7. G. Cormode, F. Korn, and S. Tirthapura. Exponentially decayed aggregates on data streams. In Proc. of ICDE, 2008. 8. G. Cormode, F. Li, and K. Yi. Semantics of ranking queries for probabilistic data and expected ranks. In Proc. of ICDE, 2009. 9. G. Cormode, S. Tirthapura, and B. Xu. Time-decaying sketches for sensor data aggregation. In Proc. of PODC, 2007. 10. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB Journal, 16(4):523C544, 2007. 11. T. Ge, S. Zdonik, and S. Madden. Top-k queries on uncertain data: On score distribution and typical answers. In Proc. of SIGMOD, 2009. 12. M. Hua, J. Pei, W. Zhang, and X. Lin. Ranking queries on uncertain data: A probabilistic threshold approach. In Proc. of SIGMOD, 2008. 13. T. Jayram, S. Kale, and E. Vee. Efficient aggregation algorithms for probabilistic data. In Proc. of SODA, 2007. 14. T. Jayram, A. McGregor, S. Muthukrishnan, and E. Vee. Estimating statistical aggregates on probabilistic data streams. In Proc. of PODS, 2007. 15. C. Jin, K. Yi, L. Chen, J. X. Yu, and X. Lin. Sliding-window top-k queries on uncertain streams. Proc. of the VLDB Endowment, 1(1):301–312, 2008. 16. J. Li, B. Saha, and A. Deshpande. A unified approach to ranking in probabilistic databases. In Proc. of VLDB, 2009. 17. M. A. Soliman and I. F. Ilyas. Ranking with uncertain scores. In Proc. of ICDE, 2009. 18. M. A. Soliman, I. F. Ilyas, and K. C.-C. Chang. Top-k query processing in uncertain databases. In Proc. of ICDE, 2007. 19. Q. Zhang, F. Li, and K. Yi. Finding frequent items in probabilistic data. In Proc. of SIGMOD, 2008. 20. X. Zhang and J. Chomicki. On the semantics and evaluation of top-k queries in probabilistic databases. In Proc. of DBRank, 2008.