On Cardinality Estimation Protocols for Wireless

On Cardinality Estimation Protocols for Wireless Sensor Networks Jacek Cichoń, Jakub Lemiesz, and Marcin Zawada Institute of Mathematics and Computer Science, Wrocław University of Technology, Poland {Jacek.Cichon, Jakub.Lemiesz, Marcin.Zawada}@pwr.wroc.pl

Abstract. We consider the problem of estimating a size of wireless sensor networks (WSNs). The problem arise in tasks that sensors have to quickly obtain approximate size of the network to use algorithms i.e. leader election or initialization problem that to work efficiently require this information. Other application can be a efficient counter which return the number of distinct people in given area in mass events and sends an alert message [1] if the number exceed some threshold. Thus increase security of those events. In this paper we present three algorithms based on order statistics and balls-bins model which effectively estimate a size of WSNs. Keywords: cardinalities estimation, sensor networks

1

First Algorithm

Our first algorithm is based on P ROBABILISTIC C OUNTING A LGORITHM [2] designed to count the number of unique values in the presence of duplicates in the database systems. Unfortunately, in such systems the data are not truely random and the authors assumed the existence of hash function with ideal properties. However, in our algorithm station can generate random values, thus we don’t need special hash function and the estimate hold exactly. 1.1

Description of the Algorithm

This algorithm is based on the balls and bins model. At first each sensor allocate a bit map of size m, which represent m bins, and initialize all entries to "0"s. Then, it generates random value x, sets xth bit to "1" and broadcast x to all neighbours. At this point sensor goes to sleep and wakes up upon receiving the message x. Then, it checks if bin T [x] is empty, if so it sets xth bin to "1" and forwards message to neighbours. Otherwise, sensor does nothing and goes to sleep. Notice that if sensor receive the same message twice then the proper bit is already set so it does not forward the message further. Since we have usually a lot more sensors the bins it also can happen that two stations generate the same value, thus further decreases the number of transmitted messages. At any time we can get the estimated number of sensors by calculating log(m/ˆ x)/ log(1 − 1/m), where x ˆ is the number of empty bins. As a hash function h we can even take identity function h(x) = x. However, notice that in our algorithm it

is possible that sensors have different values of bins. For example, we can generate 32bit random values. Then each sensor can use simple hash function h(x) = x mod m, where m < 232 . Algorithm 1 Initialization 1: set bitmap T [i] ← 0 for all i = 0, . . . , m − 1 2: x ← random(0, m − 1) 3: T [x] ← 1 4: broadcast hxi to neighbours Upon receiving a message 1: receive hxi 2: if T [h(x)] = 0 then 3: T [h(x)] ← 1 4: broadcast hxi to neighbours 5: end if Get estimate number of sensors P 1: x ˆ = m − m−1 T [i] i=0 2: return log(ˆ x/m)/ log(1 − 1/m)

1.2

Analysis

Theorem 1 ([2]). Let us assume that n balls have been thrown to m bins. Let X be a random variable denoting the number of empty bins. Then E[X] = m(1 −

1 n ) m

(1)

and 1 n 2 2 1 ) − (1 − ) + m((1 − )n − (1 − )2n )). (2) m m m m Thus, we can derive the estimator n ˆ for the number of balls that have been thrown to m bins log( X m) (3) n ˆ= 1 log(1 − m ) Var[X] = m((1 −

2

Second Algorithm

Our second algorithm is based on the idea of T HE P OWER OF T WO R ANDOM C HOICES [3] designed mainly to obtain better load balancing in the balls and bins model. Suppose that instead of n balls are thrown uniformly at random into n bins we place balls sequentially, and each ball is placed in a bin with the least number of balls of d ≥ 2 bins chosen independently and uniformly at random. However in our algorithm we proceed quite opposite, if it is possible we will place each ball in a bin which is already occupied. Notice that this algorithm in fact generalize the previous algorithm.

E[n]/n

std[n]/n

1.015

0.09

1.01

0.08

1.005

0.07

1

0.06

0.995

0.05

0.99

0.04

0.985

0.03

0.98

0.02

0.975

0.01

0.97

0

500

1000

1500

2000

0

0

500

1000

1500

2000

Fig. 1. Simulation results for n = 1, . . . , 2000 and m = 512 5

E[n]/n

E[M]

x 10

1.005

2 1.5

1

1 0.5

0.995

0

100

200

300

400

500

600

0

0

100

200

std[n]/n 5000

0.04

4000

0.03

3000

0.02

2000

0.01

1000 0

100

200

300

400

500

600

400

500

600

std[M]

0.05

0

300

400

500

600

0

0

100

200

300

Fig. 2. Simulation results for gird n = k × k for k = 1, . . . , 24 and m = 512

2.1

Description of the Algorithm

In the algorithm, similarly as in previous one, at first each sensor allocate a bit map T of size m, which represent m bins and initialize all entries to "0"s. Then, it generates random value x1 and sets T [x1 ] = 1. Moreover, it generates random values x2 , . . . , xd and broadcast x1 , . . . , xd to all neighbours. Then sensor goes to sleep and wakes up upon receiving the message x1 , . . . , xd from one of its neighbours. Then, it check if all bins T [x1 ], T [x2 ], . . . , T [xd ] are empty, if so it sets x1 th bin to "1" and forwards message to neighbours. Otherwise, sensor does nothing and goes to sleep. The hash function h we can choose as in the previous algorithm.

Algorithm 2 Initialization 1: set bitmap T [i] ← 0 for all i = 0, . . . , m − 1 2: x1 ← random(0, m − 1) 3: T [x1 ] ← 1 4: broadcast hx1 , random(0, m − 1), . . . , random(0, m − 1)i to neighbours Upon receiving a message 1: receive hx1 , x2 , . . . , xd i 2: if T [h(x1 )] = 0 ∧ T [h(x2 )] = 0 ∧ . . . ∧ T [h(xd )] = 0 then 3: T [h(x1 )] ← 1 4: broadcast hx1 , x2 , . . . , xd i to neighbours 5: end if Get estimate number of sensors P 1: x ˆ ← 1/m · m−1 i=0 T [i] 2: return m · ((1 − x ˆ)1−d − 1)/(d − 1)

2.2

Analysis

Let m be a number of bins and X(n) be the number of nonempty bins after n balls have been thrown. Then, we can define a process X(0) = 0, X(i + 1) = X(i) + ξ(X(i)),

(4)

where ξ is a binary random variable and ξ(k) = 1 denotes an event that we choose d times independently at random an empty bin if k bins are already occupied, thus Pr[ξ(k) = 1] = (1 − k/m)d . At first, let us consider case of two balls and m ≥ 2 bins. Then, the probability p2i that i bins are nonempty is equal to p21 = 1 − (1 − 1/m)d and p22 = (1 − 1/m)d (see Fig. 3). Next, let consider three balls, then notice that p31 = p21 (1−(1−1/m)d ), p32 = p21 (1−1/m)d +p22 (1−(1−2/m)d ), p33 = p22 (1−2/m)d . Now, let us consider general case of n balls, then the probability pni that i bins are nonempty is given by the recursive equation  if n = 1 ∧ i = 1  1 n−1 n−1 n i−1 d i d pi = pi−1 1 − m + pi 1− 1− m if n > 1 ∧ 1 ≤ i ≤ min(n, m)   0 otherwise where m ≥ d ≥ 2. On the other hand, notice that {X(i)}ni=0 is a Markov chain with transition probability Pr[X(i + 1) = k + 1|X(i) = k] =

k 1− m

d .

Moreover, we can represent the process (see [3]) as follows E[X(n + 1) − X(n)] =

1−

X(n) m

d .

p44 = 0 p33 p22

p43 p32

p21

p11 = 1

p42 p31

p41

Fig. 3. States of the model with m = 3 bins and n = 1, 2, 3, 4 balls

Now, we can scale the above equation by a factor of m. If t is the time at which exactly n = m · t balls have been thrown and x(t) is the fraction of nonempty bins, then we obtain E[x(t + 1/m) − x(t)] d = (1 − x(t)) . 1/m Then, the above random process can be well approximated by the trajectory of the differential equation (see [3]) dx(t) = (1 − x(t))d dt

(5)

with initial condition x(0) = 0. The equation (5) can be solved in closed form and we obtain that 1 x(t) = 1 − . 1 ((d − 1)t + 1) d−1 Theorem 2. Let X(n) be a number of nonempty bins after n balls have been thrown. Then n X(n) 1 E[ ]≈x =1− (6) 1 m m ((d − 1)(n/m) + 1) d−1 Thus, we can derive the estimator n ˆ for the number of balls that have been thrown to m bins (1 − X/m)1−d − 1 n ˆ= ·m (7) d−1 We can also evaluate the estimator n ˆ as follows n X (1 − i/m)1−d − 1 E[ˆ n] = · m pni (8) d−1 i=1 E[ˆ n2 ] =

n X (1 − i/m)1−d − 1 i=1

d−1

2 · m pni

(9)

Theorem 3. Let m be the number of bins and let d ≥ 2 be the number of choices in Algorithm 2. Then the expected number of balls after which all m − d + 1 possible bins π 2 ·md md are occupied is at least (d−1) d and no more than 6 . Proof. The probability qk that a ball occupies the empty bin if until now k bins are occupied is k qk = (1 − )d . m Thus the expected number of rounds after which all m − d + 1 bins are occupied is m m−d+1 X X 1 md 1 d = m . = qk (m − k)d kd

m−d+1 X

P∞

1 k=d−1 kd ≤ 1 1 k=d−1 kd ≥ (d−1)d .

Since Pm

3

k=d−1

k=0

k=0

Pm

1 k=1 kd

= ζ(d) ≤ ζ(2) =

π2 6 .

On the other hand we have

A proposal for new estimator

Assume that i among m bins are already non-empty. Then the probability that the number of non-empty bins will increase after throwing a ball is pi = (1 − i/m)d . So, the expected number of balls we need to throw to fill some empty bin is 1 . pi

E[nî ] =

Assume that as the process ends X among m bins are non-empty. Then we may estimate the total number of balls by n ˆ=n ˆ 0 + ... + n ˆX . Then we have that: E[ˆ n|X = k] =

k X

E[nî ] =

i=0

k m X X 1 1 = md p id i=0 i i=m−k

and (by independence of random variables nî ) V ar[ˆ n|X = k] =

k X

V ar[nî ] =

i=0

k X 1 − pi i=0

(pi )2

= m2d

m X i=m−k

1 i2d

− md

m X i=m−k

1 id

However, to obtain E[ˆ n] and V ar[ˆ n] we need values pnk for which we have the following formula (in the graphs in the Figure 3 we have to go k steps up and n − k right): d ! d !li k−1 k Y X Y i i n pk = (1 − · (1 − 1 − m m i=1 i=0 l1 +...+lk =n−k

where li ∈ [0, 1..., n − k].

E[n]/n

std[n]/n

1.025

0.12

1.02 0.1 1.015 1.01

0.08

1.005 0.06 1 0.995

0.04

0.99 0.02 0.985 0.98

0

500

1000

1500

2000

0

0

500

1000

1500

2000

1500

2000

Fig. 4. Simulation results for n = 1, . . . , 2000, m = 512 and d = 10 E[n]/n

std[n]/n

1.02

0.08

1.01

0.07 0.06

1

0.05 0.99 0.04 0.98 0.03 0.97

0.02

0.96 0.95

0.01 0

500

1000

1500

2000

0

0

500

1000

Fig. 5. Simulation results for n = 1, . . . , 2000, m = 1024 and d = 10

4 4.1

Solution Based on Ordered Statistics Order statistics

Let X1 , . . . , Xn be independent random variables with the uniform density on the interval [0, 1). The order statistics X1:n , . . . Xn:n are the random variables obtained from X1 , . . . , Xn by sorting in the increasing order each of its realizations. The probabilistic density fk:n (x) of the variable Xk:n equals fk:n (x) =

1 xk−1 (1 − x)n−k . B(k, n − k + 1)

(10)

(see e.g.[4]) where B(a, b) = Γ (a)Γ (b)/Γ (a + b). It is well known that E [Xk:n ] = k n+1 . Let k−1 Zk,n = . Xk:n Directly from definition of the density fk:n we deduce that for k ≥ 2 we have Z 1 k−1 E [Zk,n ] = fk:n (x)dx = n , x 0

4

E[n]/n

E[M]

x 10

1.03

15

1.02 10

1.01 1

5

0.99 0.98 0

100

200

300

400

500

0

600

0

100

200

std[n]/n

300

400

500

600

400

500

600

std[M]

0.1

3000

0.08 2000

0.06 0.04

1000

0.02 0

0

100

200

300

400

500

0

600

0

100

200

300

Fig. 6. Simulation results for gird n = k × k for k = 1, . . . , 24, m = 512 and d = 5

and for k ≥ 3 we have var [Zk,n ] =

n(n − k + 1) k−2

Therefore, we proved the following theorem Theorem 4. Let 3 ≤ k < n. Then the random variable Zk,n is an unbiased estimator of the number n and r σ(Zk,n ) 1 1 1 k−1 =√ =√ +O (11) 1− n n n k−2 k−2 For the estimator Z ∗ from [5] (see also Sec. ??) we have EZ ∗ = n(1 + o(1)) and 1 σ(Z ∗ )/n ∼ √k−2 . Hence the estimator Zk,n has better statistical properties (is unbiased) than the estimator Z ∗ . The equation 11 together with Chebyshev’s inequality gives some information about the precision of the estimator Zk,n . However this approach is not precise. We will prove a better estimates by reducing properties of order statistics to the Bernoulli distribution and using the classical Chernoff inequalities. Let Bp,n denotes a random variable with binomial distribution with parameters p and n, i.e. Pr[Bp,n = k] = nk pk (1 − p)n−k . Lemma 1. Suppose that k ≤ n and α ∈ (0, 1). 2

1. If αn < k then Pr[Xk:n ≤ α] ≤ exp(− 31 (k−αn) ) αn 2

2. If αn > k then Pr[Xk:n ≥ α] ≤ exp(− 21 (k−αn) ) αn

4

E[n]/n

E[M]

x 10

1.02

8 6

1 4 0.98 2 0.96

0

100

200

300

400

500

0

600

0

100

200

std[n]/n

300

400

500

600

400

500

600

std[M]

0.2

2500 2000

0.15

1500 0.1 1000 0.05 0

500 0

100

200

300

400

500

0

600

0

100

200

300

Fig. 7. Simulation results for gird n = k × k for k = 1, . . . , 24, m = 512 and d = 10

Proof. We shall use in the proof the following well known form of Chernoff bounds for binomial distribution: 1 (12) Pr[Bp,n ≥ (1 + δ)np] < exp − npδ 2 3 1 (13) Pr[Bp,n ≤ (1 − δ)np] < exp − npδ 2 2 (see e.g. [6]). Let X1 , . . . , Xn we a sequence of independent uniformly distributed random variable in the interval (0, 1). Let Yi = 1 if Xi ≤ α and Yi = 0 otherwise. The random variable Bα,n = Y1 + . . . + Yn have binomial distribution with parameters n and α. Observe that the the sentence Xk:n ≤ α means that |{i : Xi ≤ α}| ≥ k i.e. that Bα,n ≥ k. Notice that k = αn(1 + k−αn αn ). Hence if αn < k then from inequality 12 we get Pr[Xk:n

1 ≤ α] < exp(− nα 3

k − αn αn

2 ) = exp(−

1 (k − αn)2 ). 3 αn

Suppose now that αn > k. Observe that Xk:n ≥ α is equivalent to Bα,n ≤ k. So we may use inequality 13 and in a similar way we get the result. t u Theorem 5. Suppose that η > 0, 0 < ε < 1 and 3 ≤ k ≤ n. Then kη 2 kε2 k−1 n k−1 n k−1 < < ] > 1 − e− 2(1+η) + e− 2(1−ε) Pr[ 1+η k Xk:n 1−ε k

(14)

Proof. Let 0 < ε < 1. From the first part of the last Lemma we obtain k 1 kε2 Pr[Xk:n ≤ (1 − ε) ] ≤ exp(− ) n 31−ε Observe next that X ≤ (1 − ε) nk if and only

n k−1 1−ε k

≤

k−1 X .

Hence

k−1 1 kε2 n k−1 ≤ ] ≤ exp − Pr[ 1−ε k Xk:n 31−ε In a similar way we show that n k−1 k−1 1 kη 2 Pr[ ≤ ] ≥ exp − 1−η k Xk:n 21−η t u √ √ 1 1 After putting η = 20 1 + 41 , ε = 40 −3 + 249 and k = 400 into the last formula from the last theorem we get the following bound: Corollary 1. Suppose that n ≥ 400. Then Pr[0.728n