Some results on the Birthday problem in a Bayesian perspective∗† Sandra Fortini‡ and Annalisa Cerquetti Istituto di Metodi Quantitativi, Universit` a Bocconi, Milano, Italy
Abstract We consider the following version of the Birthday problem: What is the probability that two persons, that know each other, have the same birthday? Approximations are known for the probability distribution of the number of matches, under the hypothesis that the birthdays are independent. Here we extend the results to the Bayesian formulation of the problem, assuming that the probability of being born on day i is not known and a general prior distribution is assigned to it. Keywords Birthday problem, Chen-Stein methods, Exchangeability, Poisson approximations, Random colored graphs.
1
Introduction
One of the most renowned probability problems is the Birthday problem (Von Mises [1939]): if n balls are randomly dropped into k boxes, what is the chance of a match, that is, that two or more balls fall in the same box? The classical answer is given under the assumption that balls are dropped independently and uniformly into each box. If k = 365, the answer gives the probability of two or more coincident birthdays, in a group of n individuals. In a recent paper Diaconis and Holmes [2002] take up again the Birthday problem in a new perspective: ”Consider a practical demonstration of the Birthday problem on the first day of a class. Are the birthdays of the students reasonably considered as balls dropped uniformly into 365 categories? After all there are well known weekend/weekdays effects, and perhaps seasonal and lunar trends for birth rates. On reflection, the instructor may conclude that the chance qi of a student being born on day i is not uniform and in fact is not known.” ∗
Research partially supported by MIUR 2002, Research program: Bayesian nonparametric methods and their applications. † AMS (2000) subject classification. Primary: 60F05. Secondary: 05C80, 62F99. ‡ Corresponding author. Istituto di Metodi Quantitativi, Viale Isonzo, 25, 20135 Milano, Italy. E-mail:
[email protected]
1
Some results on the Birthday problem with non-uniform occurrence probabilities can be found in the literature (see Joag-Dev and Proschan [1992], Mase [1992], Henze [1998], Camarri and Pitman [2000]). The main innovation in Diaconis and Holmes [2002] consists in putting prior probabilities on the qi ’s, in a Bayesian perspective. Exact calculations, for the chance of a match, are presented under uniform prior and symmetric Dirichlet prior. Moreover a Poisson approximation for the law of the number of boxes containing two or more balls, is proved, under general Dirichlet priors. This last result relies on the Stein-Chen approximation methods, applied to negatively associated random variables (see Chen [1975], Barbour et al. [1992]). In a remark, Diaconis and Holmes [2002] claim that the limiting law of the number of boxes, containing two or more balls, is a mixture of Poisson distributions, whatever the prior is. This paper deals with the following variation of the classical Birthday problem, first proposed by Persi Diaconis in a personal communication to Svante Janson: What is the probability that two persons, that know each other, have the same birthday? Janson [1986] shows such problem can be formulated as a counting problem for colored graphs, and prove that the law of the number W of matches can be approximated by a Poisson law. Here we adopt a Bayesian perspective and assume that a prior distribution is assigned to the qi ’s. We prove that, under suitable assumptions, the law of W can be well approximated by a mixture of Poisson distribution. This result has potential application in Bayesian statistics. In fact, it allows to decide whether to assume a Poisson model, for the number of matches, on the base of prior information. The paper is organized as follows. Section 2 contains some asymptotic results for the above mentioned problem in the classical setting of independent birthdays. The main result for the Bayesian version is presented in Section 3. An application to the surname problem in Japan, as introduced in Mase [1992], is presented in Section 4, together with some remarks and suggestions for further developments.
2
Random graph representation of the Birthday problem
Let G be a graph, with vertex set V and edge set E. Let |G| denote the number of vertices in G. If |G| = n, we can take V = {1, . . . , n}. Let N denote the number of edges in G. The degree of vertex i, di , is defined as the number of edges at vertex i. The notation i ∼ j means that vertices i and j are connected in G. For basic definitions in graph theory see e.g. Diestel [2000]. A random subgraph of G with hidden colors is defined as follows. Let each vertex i receive a random color Xi from a list of k < ∞ colors. The random subgraph Γ of G with hidden colors X1 , . . . , Xn is the graph with vertex set V (G) and edges those edges ij for which Xi = Xj . Now, let the vertices of the graph G denote people, the colors Xi their birthdays, and the edges in Γ pairs of people who are acquainted and have the same birthday. It is usually assumed that the Xi ’s are independent and identically distributed 2
(i.i.d.), Paccording to some probability distribution. Let qj = P (Xi = j) and let W = i∼j 1(Xi = Xj ) be the number of edges in Γ. The following approximation is proved in Barbour et al. [1992]. Theorem 1. With the above definitions and notations P k k n k 3 X X X q i=1 j + qj2 N −1 dT V (L(W ), P oλ ) ≤ (1 − e−λ ) di (di − 1) qj2 + Pk 2 q i=1 j j=1 i=1 i=1 where dT V is the total variation distance, L(W ) is the law of W , λ = N P oλ is the Poisson distribution with parameter λ.
Pk
2 j=1 qj
and
P P P If N is large and i qi3 / i qi2 is small compared to N −1 i di (di − 1), then the probability distribution of W is well approximated by a Poisson distribution. The above result can be restated as a limit theorem. Let Gn be a sequence of graphs, with |Gn | = n. Let din be the degree of vertex i in Gn , and Nn be the number of edges in Gn . Suppose that Nn → ∞, as n → ∞. Let each vertex i receives a color Xin from a list of kn colors and let qjn be the probability of color j, (j = 1, . . . , kP n ). If Γn is the sequence of random Pgraphs with hidden colors X1n , . . . , Xnn , Wn = i∼j 1(Xin = Xjn ) and mn = Nn−1 ni=1 din (din − 1), then the following result holds. Corollary 1. If Nn
kn X
2 qjn →λ
(1)
j=1
Pkn
3 j=1 qjn mn Pkn 2 j=1 qjn
→ 0,
(2)
as n → ∞, then the probability distribution of Wn converges to the P oλ distribution. Example 1. In the classical Birthday problem, Gn is a complete graph with n n vertices. Hence Nn = 2 and di = n − 1, for every i. Moreover X1n , . . . , Xnn are i.i.d. with uniform distribution over {1, . . . , kn }. Since Nn
kn X
qj2 =
j=1
and
n2 − n 1 , 2 kn
Pkn
3 j=1 qj mn Pkn 2 j=1 qj
= (2n − 2)
1 , kn2
then kn must be of the order of n2 , for the law of Wn to converge to a Poisson distribution. This is, in fact the classical answer to the Birthday problem. 3
Next example shows that, if (2) does not hold, then the law of Wn may fail to converge to a Poisson distribution, even under (1). Example 2. Let Gn be a complete graph with n vertices and let X1n , . . . , Xnn be i.i.d. random variables, taking values 1, . . . , n2 with probabilities, 1/n, 1/(n2 + Pn2 2 n), 1/(n2 + n), . . . , 1/(n2 + n), respectively. Since n2 j=1 qjn → 1, (1) holds with λ = 1. On the other hand, the law of Wn does not converge to the P o1 distribution, as n → ∞. In fact, it can be proved that P (Wn = 0) → 2e−3/2
(3)
as n → ∞ (see Cerquetti and Fortini [2003], Example 2). Suppose that Wn converges in distribution to a random variable W with P oλ distribution. It follows from (3) that λ = 23 log 2. On the other hand, λ = E(W ) ≤ lim inf n E(Wn ) = 1, that is impossible, since 23 log 2 > 1. Hence the limiting distribution of Wn , if it exists, is not a Poisson Pqj2 P 2 → 1, distribution. Notice that (2) does not hold in this case, since mn nj=1 qj3 / j=1 as n → ∞.
3
The Bayesian formulation of the Birthday problem.
Take up again the variation of the Birthday problem introduced in Section 1. Now suppose that the probability of being born on day i is not known. In a Bayesian perspective, we treat such probability as a random variable Qi . Hence, the birthdays X1 , . . . , Xn are conditionally i.i.d. given Q = (Q1 , . . . , Qn ) with P (X1 = j|Q) = Qj
j = 1 . . . , k.
By de Finetti’s Theorem, this is a little more than saying that the birthdays of individuals 1, . . . , n are exchangeable: P (X1 = x1 , . . . , Xn = xn ) = P (X1 = xπ(1) , . . . , Xn = xπ(n) ) for all xi ’s and all permutations π. In fact the finite sequence X1 , . . . , Xn is exchangeable and can be extended to an infinite exchangeable sequence. Represent the Birthday problem as a random colored subgraph Γ of a graph G, as before. Now the colors X1 , . . . , Xn are exchangeable and, due to lack of independence, Theorem 1 does not hold. Nevertheless we can expect that the probability distribution of W can be well approximated by a Poisson distribution with unknown, and therefore random, parameter Λ. In fact this is proved in Theorem 2, under suitable assumptions. More precisely it is proved that the asymptotic distribution of the number of edges in Γ is a mixture of Poisson laws. We state the result as a limit theorem. Let Gn be a sequence of graphs, mn and Nn be defined as in Section 2 with Nn → ∞, as n → ∞. 4
Theorem 2. For every n, let Γn be a random subgraph of Gn with hidden colors X1n , . . . , Xnn and let Wn be the number of edges in Γn . Suppose that, for every n, there exists a random vector Qn such that X1n , . . . , Xnn are conditionally i.i.d., given Qn , with P (X1n = j|Qn ) = Qjn j = 1, . . . , kn . Let E( max Qjn ) = o( 1≤j≤kn
1 ) mn + 1
as n → ∞.
a) If Wn converges in distribution to a random variable W , as n → ∞, then Z P oλ (A)dµ(λ), P (W ∈ A) =
(4)
(5)
[0,+∞)
where P o0 ({0}) = 1 and, for λ > 0, P oλ denotes the Poisson distribution with parameter λ. Moreover, the mixing measure µ in (5) is unique. b) The sequence Wn converges, in distribution, if and only if there exists a random variable Λ such that kn X d Nn Q2jn → Λ as n → ∞. (6) j=1
Moreover, the mixing measure µ in (5) is the law of Λ. Proof. The proof is based on Cerquetti and Fortini [2003], Theorem 3, which gives an asymptotic result for the number of copies of a graph H in random subgraphs with exchangeable hidden colors. With the same notation, let H be an edge, hence AHn is the set of the edges in Γn , |AHn | = Nn and m1n =
X 1 |Bα1n |, |AHn | α∈AHn
where Bα1n = {β ∈ AHn : |α ∩ β| = 1} = {β ∈ AHn : α ∩ β 6= ∅}. Then
Pn
i=1 din (din
m1n = In fact
− 1)
Nn
.
X 1 1 X |Bα1n | = (din − 1 + djn − 1) |AHn | Nn ii,j∼i
j
i 0 Pj=1 Qjn = (i = 1, . . . , n) n 1/n if j=1 Yj = 0 By de Finetti’s theorem, there exists a random cumulative distribution function F such that Y1 , Y2 , . . . are conditionally independent, given F , with P (Yj ≤ y|F ) = F (y) P − a.s.. 7
Moreover F has P − a.s. finite second moment M2 . In fact Z 2 2 2 E(Y1 ) = E(E(Y1 |F )) = E y dF (y) = E(M2 ) < ∞. Analogously Z M1 =
ydF (y)
is P − a.s. finite and strictly positive. It is easy to verify that E( max Qjn ) → 0, 1≤j≤n
as Pnn → ∞. In fact, by the law of large numbers for exchangeable random variables, j=1 Yj /n converges in distribution to M1 . Moreover, if PF is a regular version of the conditional probability given F , P(
max1≤j≤n Yj < ) = E((PF {Y1 < n})n ) n ≥ E((1 −
M2 n ) ) → 1, 2 n2
as n → ∞. It follows that max Qjn
1≤j≤n
, Pn
j=1 Yj
max1≤j≤n Yj = n
n
P
→ 0,
as n → ∞. Since mn = 2(n − 2)/(n − 1), by Theorem 2, the Pprobability distribution of Wn converges to a mixture of Poisson laws if (n − 1) nj=1 Q2jn converges in distribution. By the law of large numbers for exchangeable summands, Pn X Y2 2 (n − 1) Qin = (n − 1) Pni=1 i 2 ( i=1 Yi ) i
1 = (1 − ) n
Pn
2 i=1 Yi
Pn
i=1 Yi
n
n
−2
d
→
M2 , M12
as n → ∞. By Theorem 2, the asymptotic distribution of Wn is a mixture of Poisson laws and the mixing measure is the law of M2 /M12 . Theorem 2 can be applied to find the asymptotic distribution of the number of coincidences among the first n random variables of an exchangeable sequence. To state the result as a limit theorem, we introduce an array of r.v.’s and assume that the random variable in each row are exchangeable. More precisely, for every n, let X1n , X2n , . . . be an exchangeable sequence of random variables, taking values 1, . . . , kn , with kn < ∞; exchangeability here means that P (X1n = x1 , . . . , Xrn = xr ) = P (X1n = xπ(1) , . . . , Xrn = xπ(r) ) 8
for every r, for every xi and permutations π. By de Finetti’s Theorem, there exists a random vector Qn = (Q1n , . . . , Qkn n ) such that the Xin ’s are conditionally i.i.d., given Qn , with P (Xin = j|Qn ) = Qjn j = 1, . . . , kn . Corollary 2. Let Wn = X1n , . . . , Xnn . Suppose that
P
i