Statistics, August 2004, Vol. 38(4), pp. 357–370
SEQUENTIAL CONFIDENCE INTERVALS FOR THE NUMBER OF CELLS IN A MULTINOMIAL POPULATION Z. GOVINDARAJULUa and HOKWON A. CHOb,∗ a
b
Department of Statistics, University of Kentucky, Lexington, KY 40506, USA; Department of Mathematical Sciences, University of Nevada, Las Vegas, NV 89154-4020, USA (Received 8 November 2002; Revised 1 October 2003; In final form 25 February 2004)
A two-sided sequential confidence interval is suggested for the number of equally probable cells in a given multinomial population with prescribed width and confidence coefficient. We establish large-sample properties of the fixed-width confidence interval procedure using a normal approximation, and some comparisons are made. In addition, a simulation study is carried out in order to investigate the finite sample behaviour of the suggested sequential interval estimation procedure. Keywords: Two-sided sequential confidence intervals; Number of cells; Multinomial population; Normal approximation
1
INTRODUCTION
Suppose we have a multinomial population with equally probable but an unknown number of distinct cells (or categories) k(0 < k < ∞). Based on a random sample of size n drawn from the population, we wish to construct a two-sided sequential confidence interval for the true number of cells. Several practical examples fitting this model have been given in the literature. We mention a few in the following: (a) A problem of some archaeological interest is determining how many days there were in the calendar of some ancient civilization. By noting the days marked on gravestones and treating these days as random samples from the total population of k days in the annual calendar, we might estimate k by visiting more graves and making note of days which occur more than once (see Goodman, 1953, p. 57). (b) A company has offered to give away free samples of its product. A large number of requests, say N(N = 100,000) come in and these requests have been filled. It is known that the same people often sent more than one request. From a sample of the requests, say n, we wish to estimate the distinct number of people k, who have sent in the requests. It is reasonable to assume that a request made by a certain individual is 1/k (see Mosteller, 1949, p. 12). ∗
Corresponding author. E-mail:
[email protected]
c 2004 Taylor & Francis Ltd ISSN 0233-1888 print; ISSN 1029-4910 online DOI: 10.1080/0233188042000216154
358
Z. GOVINDARAJULU AND H. A. CHO
(c) We wish to estimate the total number of dies used to manufacture coins based on a sample (a coin hoard) of size n, which has been classified according to the die used to produce each coin. Suppose we assume that each die has been used to strike approximately the same number of coins. We can view this as drawing a random sample of size n from a population of k (the number of dies) equally likely classes (see Arnold and Beaver, 1988, p. 413). In fact, estimating the number of cells in a given multinomial population can be thought of as the classical occupancy problem in which we draw a sample of size n with replacement from a population of k distinct objects. Therefore, the basic problem turns out to be equivalent to the problem of distributing n balls into k distinct bins. Confidence interval estimation of k has been considered by Ivechenko and Timonina (1983), Arnold and Beaver (1988), and Finkelstein et al. (1998) based on a fixed sample size. Here, we will consider fixed-width sequential confidence interval estimation. 1.1
Formulation
Let X1 , X2 , . . . , be a sequence of independent observations from a multinomial population with unknown k equally probable distinct cells. With a sample (X1 , X2 , . . . , Xn ) of size n, we want to construct a confidence interval for the true unknown number of cells, k(0) and α. 1.3
Probability Distribution of k − Kn using Stirling Numbers
The distribution of the number of unseen cells, k − Kn , using Stirling numbers of the second kind was given by Jordan (1950). By the generalized Bernoulli theorem of repeated trials, k P (k − Kn = j ) = k −n j s
1 +s2 +···+sk−j
n! , s !s ! · · · sk−j ! 1 2 =n
where the si s denote the ith cell frequencies. Snm , the Stirling numbers of the second kind are defined by Snm =
n! m! s !s !···s 1 2
m
1 , s !s ! · · · sm ! ! 1 2
where the summation is all over si such that s1 + s2 + · · · sm = n.
0 ≤ j ≤ k − 1,
360
Z. GOVINDARAJULU AND H. A. CHO
Then, we have k k −n (k − j )!Snk−j P (k − Kn = j ) = j k! −n k−j k Sn . = j! For large n, Jordan (1950, p. 173) has shown that Snm ∼
mn , m!
which is valid for m quite small relative to n.
2 APPROXIMATIONS TO THE DISTRIBUTION OF THE NUMBER OF UNSEEN CELLS 2.1 Asymptotic Normality of the Distribution of Un = k − Kn Weiss (1958) established the asymptotic normality of the number of unseen cells, Un = k − Kn , by showing that all the central moments of Un converge to the moments of the standard normal distribution. In particular, he has shown that if n and k are sufficiently large (independently of one another) such that n/k → a(a > 0), then Un is asymptotically normal with mean 1 n (3) µn = k 1 − k and variance σn2
1 2n 1 n 2 n 2 =k 1− −k 1− + k(k − 1) 1 − . k k k
(4)
Note that µn and σn2 are the exact mean and variance of Un . See for instance Basu (1930) or Govindarajulu (1999, p. 44). Let 1 . 1 = −1 − α1 = k log 1 − k 2k and
2 . 1 = −2 − . α2 = k log 1 − k k
Then, using the approximate values for α1 and α2 , Eqs. (3) and (4) can be written as 2 . µn = E(Un ) = ke−α1 n/k = ke−n/k e−n/2k n
. = ke−n/k 1 − 2 2k y
= ky + ln y, 2
where y = e−n/k .
(5)
ESTIMATION OF NUMBER OF CELLS
361
Similarly, we obtain σn2 = var(Un ) = keα1 n/k [1 − keα1 n/k + (k − 1)e(α2 −α1 )n/k ] 2 2 . = ky(e−n/2k − ye−n/k ) n n . 2 = ky 1 − − ky 1 − 2k 2 k2 n y
ln y − y 2 ln y. = ky(1 − y) + 2 2.2
(6)
Normal Approximation to Un = k − Kn
Recall that the goal is to determine n such that P {|kˆn − k| ≤ d} ≥ 1 − α, for specified d(>0) and α. First, we establish the asymptotic distribution of kˆn − k, where kˆn − Kn (1 + e−n/Kn ), which is approximately the best unbiased estimator for k. Letting f (x) = x(1 + e−n/x ), we have f (k) = 1 + e−n/k +
n
e−n/k k = 1 + y − y log y > 0,
where y = e−n/k . Then, we are led to the following result. RESULT 2.1 Let f (x) = x(1 + e−n/x ) for x > 0 and let Kn denote the number of distinct cells that have been observed in a random sample of size n from a multinomial distribution having k equally probable cells. Let kˆn = f (Kn ) and Un = k − Kn . Then d k − kˆn N (B, [f (k)σn ]2 ),
where B = µn f (k) − ke−n/k . Proof
(7)
One can write k = kˆn = k − f (k) + [f (k) − f (Kn )] . = −ke−n/k + Un f (k) = (Un − µn )f (k) + µn f (k) − ke−n/k .
It follows from Eq. (8) that k − kˆn Un − µn B = + . f (k)σn σn f (k)σn
(8)
362
Z. GOVINDARAJULU AND H. A. CHO
That is, k − kˆn − B Un − µn = . f (k)σn σn Hence, we have
P
k − kˆn − B ≤t f (k)σn
=P
Un − µn . ≤ t = (t) σn
after using the result of Weiss (1958) pertaining to the asymptotic normality of Un when suitably standardized, and denotes the standard normal distribution function. Thus, from Result 2.1 we have d kˆn N (k − B, [f (k)σn ]2 ).
Now, it follows from Eq. (7) Bn = µn f (k) − ke−n/k y
= ky + ln y (1 + y − ln y) − ky 2 y
2 = ky (1 − ln y) + ln y (1 + y − ln y) 2 y
. 2 ln y = ky (1 − ln y) + 2 . = ky 2 (1 − ln y). Note that the bias in kˆn is −B, which is small when y(= e−n/k ) is small. Further, assuming that y is small, the variance of kˆn is y
[f (k)σn ]2 = ky(1 − y) + ln y − y 2 ln y (1 + y − y ln y)2 2 y
. = (1 + y − y ln y)2 ky(1 − y) + ln y 2 ln y . = ky(1 + 2y − 2y ln y) 1 − y + 2k 2 ln y (ln y) = ky 1 + y + −y 2k k after ignoring inferior terms. Using Result (2.1), we can write P kˆn − k ≤ d = P −d ≤ kˆn − k ≤ d = P −d + B ≤ (kˆn − k) + B ≤ d + B
−d + B (kˆn − k) + B d +B =P ≤ ≤ f (k)σn f (k)σn f (k)σn
(9)
ESTIMATION OF NUMBER OF CELLS
d +B −d + B − f (k)σn f (k)σn d −B ≥ 2 −1 f (k)σn
. =
≥ 1 − α, which implies
d −B α
−1 ≥ z = 1 − , α/2 f (k)σn 2
where zα/2 is the upper (α/2)th quantile of . Consider the inequality (where we write z = zα/2 ) d − B > zf (k)σn , which implies (d − B)2 > z2 [f (k)σn ]2
(since d > B).
Thus ln y ln y . d 2 − 2dB > z2 ky 1 + y + ⇔ d 2 > z2 ky 1 + y + + 2dky 2 (1 − ln y). 2k 2k First, solve the inequality d 2 > z2 ky, i.e., y< Let
d2 ≡ y0 (say). z2 k
ln y g(y) = d 2 − z2 ky + z2 ky y + − 2dky 2 (1 − ln y). k
Then, the Newton–Raphson method gives the next approximation y1 as y1 = y0 − where
g(y0 ) = z ky0 2
and
ln y0 y0 + − 2dky02 (1 − ln y0 ) k
ln y 1 g (y) = −z k + z k 2y + + − 2dky(1 − 2 ln y). k k
In addition,
g(y0 ) , g (y0 )
2
2
2 d d . ln y0 = ln 2 − ln k = 2 ln − ln k z z
363
364
Z. GOVINDARAJULU AND H. A. CHO
and
y0 ln y0 =
d2 z2 k
2 d ln 2 − ln k . z
Hence
2 2 d2 d d 2 ln(d/z) ln k 1 − 2 ln − − 2dk 2 + ln k + g(y0 ) = d z2 k k k z k z d2 d2 d 2d 5 d = + 2 ln − ln k − 2 1 − 2 ln + ln k k z2 z z k z d2 d4 2d 2 2d 3 2d 3 d = 2 (1 − 2d) + ln 1 + 2 − (ln k) 1 + 2 z k k z z k z 2
and d 2d 2 1 + 2 ln(d/z) − ln k d2 1 − 4 ln + − 2d + 2 ln k z2 k z2 k z 3 d d 2d − ln k − 2 1 − 4 ln + 2 ln k = 2d 2 + z2 k 2 ln z z z d d 4d 3 . 2 2 = z k 2 ln − ln k + 2d + 2 2 ln − ln k z z z d 4d 3 . 2 2 ln = z k+ 2 − ln k + 2d 2 z z d . = z2 k 2 ln − ln k + 2d 2 . z
g (y0 ) = −z2 k + z2 k
Thus
2 2d 2 d d ln − ln k k z k 2 d d = 2 ln − ln k . k z
. g(y0 ) =
Hence y1 = y0 − =
g(y0 ) g (y1 )
d2 (d 2 /k)[2 ln(d/z) − ln k] − 2 2 z k z k[2 ln(d/z) − ln k] + 2d 2
d2 . d2 = 2 − 2 2 z k z k 2 d = (k − 1). zk
ESTIMATION OF NUMBER OF CELLS
Since y1 = e−n/k , e−n/k ≤
d zk
365
2 (k − 1).
(10)
Taking logarithms on both sides of Eq. (10), we have n ≥ k[2 ln
zk d
− ln(k − 1)].
(11)
Hence, we take the optimal fixed-sample size n∗ to be the smallest positive integer satisfying Eq. (11). Since the true number of cells k is unknown, no fixed-sample size procedure is available. Therefore the adaptive sequential rule is; stop at N , where
zkˆn ˆ ˆ N = inf n ≥ n0 : n ≥ kn 2 ln , − ln kn − 1 d
(12)
where n0 (n0 ≥ 2) is the initial sample size, kˆn is given by Eq. (2). Example 2.1 (a) A computer algorithm was used to generate multinomial data for k = 9. We wish to construct a 95% confidence interval for the true number of cells, k with d = 1.0. The observations are 7, 1, 4, 9, 7, 5, 6, 1, 9, 5, 7, 7, 1, 6, 1, 7, 2, 9, 3, 8, 4, 1, 3, 6, 3, 9, 6, 9, 9, 4, 1, 4, 6, 8, where the first number 7, for instance, indicates that it belonged to the seventh cell. The proposed sequential procedure (starts with n0 = 3) stops at n = 34 yielding Kn = 9, the estimated number of cells, kˆn = 9.1842; so the 95% confidence interval for k is given by (8.1842, 10.1842). (b) We also generated data for k = 20, and suppose we wish to set up a 90% confidence interval with d = 2.0. Our sequential procedure stops at n = 50 yielding Kn = 18 and kˆ = 19.1192. Therefore, the resultant 90% confidence interval is (17.1192, 21.1192).
3 ASYMPTOTICS FOR THE PROCEDURE In this section, we investigate the asymptotic behaviour of the proposed sequential procedure.
3.1
Finite Sure Termination
We now show the fundamental property that the proposed stopping rule terminates finitely almost surely.
366
Z. GOVINDARAJULU AND H. A. CHO
THEOREM 3.1 Let N denote the stopping time associated with the sequential procedure, then limn→∞ P (N = ∞) = 0. Proof For the stopping rule given by Eq. (12) we have, after using Eq. (2), (1 + e−(n+1)/Kn ) P (N > n) ≤ P n ≤ Kn (1 + e−(n+1)/Kn ) 2 ln zKn d − ln(Kn (1 + e−(n+1)/Kn ) − 1) .
(13)
Taking limit on both sides in Eq. (13), we obtain (1 + e−(n+1)/Kn ) lim P (N > n) ≤ lim P n ≤ Kn (1 + e−(n+1)/Kn ) 2 ln zKn n−→∞ n−→∞ d − ln Kn (1 + e−(n+1)/Kn ) − 1 = 0, due to the fact that the sequence of estimators Kn is bounded above by k (k < ∞) as n → ∞. This establishes the finite sure termination of the proposed sequential procedure. 3.2
First-order Asymptotics
The stopping rule given in Eq. (12) can be rewritten as
n kˆn zkˆn ˆ 2 ln − ln(kn − 1) . N = inf n ≥ n0 : ≥ k k d
(14)
Now, rule (14) is of the form f (n) , N = N (t) = min n ≥ n0 : Yn ≤ t where f (n) = n, zk t = k 2 ln − ln(k − 1) , d and Yn =
kˆn [2 ln(zkˆn /d) − ln(kˆn − 1)] . k[2 ln(zk/d) − ln(k − 1)]
Then clearly, Yn is a sequence of random variables such that Yn > 0, limd→0 Yn = 1 almost surely (a.s.) since kˆn /k → 1 as n → ∞. In addition, limn→∞ f (n) = ∞, limn→∞ f (n)/f (n − 1) = 1. Since the stopping rule N is well defined and non-decreasing as a function of t, by applying the results of Chow and Robbins (1965) we obtain the following first-order asymptotics for the proposed sequential procedure.
ESTIMATION OF NUMBER OF CELLS
367
RESULT 3.1 (i) limd→0 N = ∞ a.s., limd→0 E(N) = ∞, (ii) limd→0 N/n∗ = 1 a.s., limd→0 E(N)/n∗ = 1, (iii) limd→0 P (kˆN − d ≤ k ≤ kˆN + d) = 1 − α, where n∗ is the optimal fixed-sample size given by Eq. (11). Proof (i) is easily verified. By the definition of the stopping rule in Eq. (12) and N (d) increases a.s. in d, we obtain N = N (d) → ∞ as d → 0. Then by the monotone convergence theorem, we get E(N ) → ∞ as d → 0. (ii) follows from the facts that for n∗ given by Eq. (11), limd→0 N/n∗ = 1 a.s., and limd→0 E(N )/n∗ = 1, since kˆn = Kn (1 + e−(n+1)/Kn ), supn kˆn ≤ k(1 + e−1/Kn ) < 2k, and E[supn kˆn ] ≤ E[Kn (1 + e−1/Kn )] ≤ 2k < ∞. (iii) follows from Anscombe’s (1952) theorem, provided we can verify his condition on uniform continuity in probability of kˆn . Towards this consider, with Yn = k − kˆn and Un = k − Kn , P {Yn − Yn ≤ for all n < n ≤ (1 + c)n} = P {Yn = Yn for all n < n ≤ (1 + c)n}, for sufficiently small ε = P {f (Kn ) = f (Kn ) for all n < n ≤ (1 + c)n} = P {Kn = Kn for all n < n ≤ (1 + c)n} = P {no new cells are discovered in trials n = n + 1, (1 + c)n} = P {Un = Un for all n < n ≤ (1 + c)n} =
k−1
P {Kn = j for all n < n ≤ (1 + c)n|Un = j }P (Un = j )
j =0
=
k−1 k − j [cn] k
j =0
≥
k−1 j =0
=
k−1
j 1− k
exp
j =0
P (Un = j )
cn P (Un = j )
−j cn P (Un = j ), for sufficiently large k k
−cn = E exp Un k −cn ≥ exp E(Un ) , by Jensen’s inequality k
= exp[−cn{e−n/k + o(1)}]
368
Z. GOVINDARAJULU AND H. A. CHO
≥ 1 − cn exp
−n k
= 1 − η, where η is small when c is small, and n = k 1+δ for some 0 < δ < 1. An analogous inequality holds for (1 − c)n ≤ n < n. Thus Anscombe’s condition holds.
4
RELATION BETWEEN EQUALLY PROBABLE MULTINOMIAL CELL MODEL AND THE EFRON AND THISTED MODEL
Efron and Thisted (1976) considered the problem of estimating the number of words in Shakespeare’s vocabulary. They realized that this was equivalent to estimating the number of unseen species. They assumed that members of each species enter a trap according to a Poisson process, the process for species s having expectation λs per unit of time. Assuming that (λ1 , λ2 , . . . , λk ) is a random sample from an unknown distribution G(λ), they obtained an empirical Bayes estimate of the number of unseen species or, equivalently, the number of words Shakespeare knew. In the following, we will show that the model used by Efron and Thisted (1976) is equivalent to an equally probable number of cells model when the number of cells k is large. It is well known that if Y1 , Y2 , . . . , Yn are independent Poisson random variables with means λ1 , λ2 , . . . , λk , then the conditional of Y1 , Y2 , . . . , Yn for given Y1 + Y2 + · · · + Yn distribution is multinomial distribution, M n, pi = λi / ki=1 λi , i = 1, 2, . . . , k, and conversely. Further, if (λ1 , λ2 , . . . , λk ) is a random sample from G(λ), having mean µ, variance σ 2 , and a finite third moment, then E
λi
k
j =1 λj
µ + (λi − µ) =E kµ + jk=1 (λj − µ) −1 k 1 λ − µ (λi − µ) i 1 + + =E k kµ kµ j =1 1 λi − µ = E 1+ 1+ k µ 1 λi − µ = E 1+ 1− k µ
−1 k 1 (λi − µ) k j =1 µ
k (λj − µ) 1 + o(k −1 ) k j =1 µ
1 1 (λi − µ)2 1 σ2 −1 E 1− E + o(k ) + k k µ k µ2 2 2 1 1σ σ = + + o(k −1 ) 1− 2 k kµ kµ2 1 = − O(k −2 ). k =
ESTIMATION OF NUMBER OF CELLS
5 5.1
369
SIMULATION STUDIES Monte Carlo Experimentation
Monte Carlo experimentation is carried out in order to illustrate the behaviour and the performance of the stopping rule in the proposed sequential procedure. The results of the Monte ˆ Carlo simulation are summarized in the following tables, which show the average estimate k, specified size of the error d, the average stopping time E(N ), and the average observed coverage probability (CP) in the experiments. Each row in the table corresponds to 5000 independent experiments with the initial sample size n0 = 3 used. From Tables I–IV, we infer that the estimates kˆ approach k and achieve the required level of confidence (1 − α) as d decreases. It is also noticeable that the expected stopping time, E(N ) increases consistently as the error size d decreases.
TABLE I
k = 5. 1−α
90% d 0.1 0.2 0.5 1.0
95%
99%
kˆ
E(N )
CP
kˆ
E(N )
CP
kˆ
E(N )
CP
4.99 4.98 4.89 4.60
37.96 30.85 21.21 13.12
0.99 0.98 0.88 0.84
4.99 4.98 4.92 4.67
38.96 32.86 23.34 15.19
0.99 0.98 0.91 0.89
5.00 4.99 4.96 4.84
41.97 34.94 25.65 18.89
0.99 0.99 0.95 0.94
TABLE II
k = 10. 1−α
90% d 0.1 0.2 0.5 1.0
95%
99%
kˆ
E(N )
CP
kˆ
E(N )
CP
kˆ
E(N)
CP
9.99 9.98 9.94 9.65
80.93 66.78 48.35 33.25
0.99 0.97 0.90 0.88
9.99 9.99 9.96 9.80
83.96 69.86 51.50 37.18
0.99 0.98 0.92 0.92
10.00 9.99 9.97 9.91
89.97 75.93 57.64 43.03
0.99 0.99 0.95 0.96
TABLE III
k = 15. 1−α
90% d 0.1 0.2 0.5 1.0
95%
99%
kˆ
E(N )
CP
kˆ
E(N)
CP
kˆ
E(N)
CP
15.00 14.99 14.96 14.84
125.98 104.87 77.32 56.13
0.99 0.98 0.90 0.91
15.00 14.99 14.98 14.88
130.97 110.87 82.55 61.35
0.99 0.98 0.93 0.94
15.00 15.00 14.98 14.95
139.98 118.92 91.69 70.14
0.99 0.99 0.96 0.97
370
Z. GOVINDARAJULU AND H. A. CHO TABLE IV
k = 20. 1−α
90% d 0.1 0.2 1.5 1.0
95%
99%
kˆ
E(N )
CP
kˆ
E(N)
CP
kˆ
E(N)
CP
19.99 19.99 19.97 19.86
172.97 145.84 109.25 80.82
0.99 0.98 0.90 0.91
20.00 19.99 19.98 19.94
179.98 152.91 115.59 88.38
0.99 0.99 0.93 0.94
20.00 20.00 19.99 19.95
190.98 163.95 126.70 98.95
0.99 0.99 0.96 0.97
Thus, the numerical results indicate the small sample behaviour and lend support to the asymptotic behaviour of the proposed sequential procedure as the size of error, d goes to zero.
Acknowledgements We thank the referee and Professor Olaf Bunke, Editor, for their helpful comments.
References Anscombe, F. (1952). Large sample theory of sequential estimation. Proc. Cambridge Philos. Soc., 48, 600–607. Arnold, B. and Beaver, R. (1988). Estimation of the number of classes in a population. Biometrical Journal, 30, 413–424. Basu, D. (1950). On sampling with and without replacement. Sankhy¯a, 20, 287–294. Chow, Y. and Robbins, H. (1965). On the asymptotic theory of fixed width sequential confidence intervals for the mean. The Annals of Mathematical Statistics, 36, 457–462. Efron, B. and Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know? Biometrika, 63, 435–447. Feller, W. (1968). An Introduction to Probability Theory and its Applications, 3rd ed., Vol. 1, John Wiley, New York. Finkelstein, M., Tucker, H. and Veeh, J. (1998). Confidence intervals for the number of unseen types. Statistics and Probability Letters, 37, 423–430. Goodman, L. (1953). Sequential sampling tagging for population size problems. The Annals of Mathematical Statistics, 24, 56–69. Govindarajulu, Z. (1999). The Elements of Sampling Theory and Methods. Prentice-Hall, Inc., New Jersey. Harris, B. (1968). Statistical inference in the classical occupancy problem unbiased estimation of the number of classes. Journal of the American Statistical Association, 63, 837–847. Ivechenko, G. and Timonina, E. (1983). Estimating the size of a finite population. Theory of Probability and Its Applications, 27, 403–406. Jordan, C. (1950). Calculus of Finite Differences, 2nd Ed., Chelsea Publishing Co., New York. Mosteller, F. (1949). Questions and answers–number of different kinds of elements in a population. The American Statistician, 3, 12. Weiss, I. (1958). Limiting distributions in some occupancy problems. The Annals of Mathematical Statistics, 29, 878–884.