A Poisson approximation for colored graphs under exchangeability

5 downloads 135 Views 228KB Size Report
We introduce random graphs with exchangeable hidden colors and provide an asymptotic result for the number of times a certain substructure appears in.
A Poisson approximation for colored graphs under exchangeability∗† Annalisa Cerquetti, Sandra Fortini‡ Istituto di Metodi Quantitativi, Universit` a Bocconi, Milano, Italy

Abstract We introduce random graphs with exchangeable hidden colors and provide an asymptotic result for the number of times a certain substructure appears in a random graph belonging to such class. In particular we give necessary and sufficient conditions for the number of subgraphs isomorphic to a given graph to converge, under a negligibility assumption on the frequencies of colors. Moreover we prove that the limiting law, when it exists, is a mixture of Poisson distributions. Our proofs rely on Stein-Chen method for Poisson approximations of sum of weakly dependent random variables. Finally, we discuss the relevance of random graphs with exchangeable hidden colors in Bayesian inference. Keywords: Exchangeability, Poisson Approximations, Random graphs, Subgraphs enumerations.

1

Introduction

The systematic study of random graphs was started by Erd˝os and R´enyi in a series of seminal papers in 1950s and 1960s. In its simplest version, the classical (finite, undirected, acyclic, purely) random graph G(n, p), is defined by taking a finite set  of vertices {1, . . . , n} and then randomly selecting each of the n2 possible edges with probability p independently of all other edges. As pointed out in Newman (2003), Erd˝os standard model fails to describe many real-world networks, for its intrinsic lack of edge correlation. In many different fields like molecular biology, ecology or information-technology, in fact, observed network structures frequently show clustering i.e. vertices are more likely to be connected when they have a common neighbour. Hence in the last decade, an increased interest in random graphs theory, has produced generalizations of the Erd˝os simple model allowing correlation between edges (see Penman, 1998; Biggins, 2002; Biggins and Penman, 2003; Cannings and ∗ Research partially supported by MIUR 2002, Research program: Bayesian nonparametric methods and their applications. † AMS (2000) subject classification. Primary: 60F05. Secondary: 05C80, 62F99. ‡ Corresponding author. Istituto di Metodi Quantitativi, Viale Isonzo, 25, 20133 Milano, Italy. E-mail: [email protected]

1

Penman, 2003; S¨oderberg, 2003). One of the most appealing is obtained by randomly coloring vertices according to a color distribution and realizing edges independently with color dependent probabilities. An interesting class of results in random graphs theory concerns the problem of counting the number of times a fixed substructure H appears in a random graph (For a general introduction to random graphs see e.g. Bollobas, 1985). Let G(n, p) be the Erd˝os basic model, and let W be the number of copies of H in G(n, p). Barbour et al. (1992) establish an upper bound for the total variation distance between the law of W and a Poisson law with a suitable mean. Their proofs rely upon the Chen-Stein method, which is a general method to establish Poisson approximations for the sum of weakly dependent indicator random variables, with small occurrence probabilities. In randomly colored random graphs, subgraphs enumeration was originally introduced by Janson (1986) as an alternative formulation of the famous birthday problem (Von Mises, 1939). Let Γ be a randomly colored random graph obtained by coloring the vertices of a complete graph G, at random and independently, and then deleting the edges with different colors at the endpoints. If vertices represent people and colors represent people birthdays, then an edge arises in Γ for every pair of people sharing the same birthday, so that the number of edges W , corresponds to the number of pairs of people born in the same day. In this setting Barbour et al. (1992) provide a Poisson approximation for the probability distribution of W . More precisely, they establish an upper bound for the total variation  P 2 distance between the law of W and a Poisson distribution with mean λ = n2 r pr , where pr is the probability of being born on day r, and n is the number of people. Here we extend the above result in two different directions. First we relax the hypothesis that vertices colors are independent by assuming that they are exchangeable. Actually, in order to exploit de Finetti’s representation theorem, we make the stronger assumption that the sequence of colors can be extended to an infinite exchangeable sequence. Moreover we consider the number W of occurrences of a general connected subgraph H. In this framework we prove that, under suitable negligibility assumptions, the probability distribution of W can be well approximated by a mixture of Poisson laws, if the order of G is large. The above mentioned results can be applied in a Bayesian perspective to model the number of coincidences arising in large networks. Consider a large network in which a random characteristic is associated to each node, in such a way that different nodes are homogeneous with respect to it. It may be, for example, the login name in a computer network, or a particular choice or interest in a social network. Suppose that we want to fix a model for the number of coincidences among connected nodes in the network. In a Bayesian approach, if the distribution of the characteristic is not known, a prior distribution is assigned to it. According to de Finetti’s representation theorem, this is equivalent to assume that the nodes are exchangeable with respect to the above mentioned characteristic. Our asymptotic results on random graphs with exchangeable colors imply that, if coincidences can be considered as exceptional events, a Poisson model with a random mean can be employed. The paper is organized as follows. In Section 2 we fix some notations and introduce 2

basic definitions. Moreover we state a property of random graphs with independent hidden colors exploited in the subsequent section. In Section 3 we present the asymptotic result together with some examples, remarks and comments. In Section 4 we discuss some applications in Bayesian modelling. Finally, in Section 5, we present some remarks and suggestions for further developments.

2

Notations and preliminary results

Before introducing the definition of random graph with exchangeable hidden colors, we fix some notations. For an introduction to graphs theory see e.g. Diestel (2000). If G is a graph, then V(G) denotes the vertex set of G and E(G) denotes the edge set of G. Hence G = (V(G), E(G)). In the paper, only finite graphs are considered. Without loss of generality, we can take V(G) = {1, . . . , n}, whenever |G| = n. An edge {i, j}, joining the vertices i and j is denoted by ij or ji, without distinction. The number of vertices in G is called the order of G; it is denoted by |G|. The same notation is used for the cardinality of a set: |A| denotes the cardinality of A. Thus |G| = |V(G)|. A graph is a subgraph of G if its vertex set and its edge sets are subsets of V(G) and of E(G), respectively. Subgraphs of a fixed graph G will be generally denoted by Greek letters. We will write α ⊂ G if α is a subgraph of G. Two graphs G and G0 are called isomorphic if there exists a correspondence between their vertex sets that preserves adjacency. The intersection of two graphs G ∩ G0 is the graph (V(G) ∩ V(G0 ), E(G) ∩ E(G0 )). Two subgraphs α and β of G are called connected if α ∩ β 6= ∅. If α ⊂ G is isomorphic to a graph H, α is called a copy of H in G. Let G and H be graphs. The set of copies of H in G will be denoted by AH . Let α be copy of H in G. Denote by Bαr the set of copies of H in G having r common vertices with α, (r = 1, . . . , |H| − 1|); hence Bαr = {β ∈ AH : |β ∩ α| = r}

(1)

The mean number of copies of H in G, having r vertices in common with a fixed copy of H in G, is defined as mr =

1 X |Bαr | |AH |

(2)

α∈AH

Let us introduce the definition of random graph on a graph G. Let G = (V, E) be a finite graph. Let G be the set of graphs with vertex set V and whose edge set is a subset of E: G = {(V, E 0 ) : E 0 ⊂ E} Let 2G be the power sigma-algebra on G. Any measurable function Γ from a probability space (Ω, F, P ) to the measurable space (G, 2G ) is called a random graph. The

3

probability distribution of the random graph is the measure induced by P through Γ. We will now introduce the definition of random graph with hidden colors. Let K be a finite set, named color space. If |K| = k, we identify K with {1, . . . , k}. Consider the probability space (Ω, F, P ), where Ω = K ∞ = {ω = (ω1 , ω2 , . . .) : ωi ∈ K}, F is the cylinder sigma-algebra on Ω and P is a probability measure on F. Let Γ : K ∞ → G be defined by Γ(ω) = (V, {ij : ωi = ωj }).

(3)

Then Γ is a F/2G -measurable function. Definition 1. The random element Γ defined in (3) is called a random graph with hidden colors. The coordinate random variables X1 , X2 , . . . on (Ω, F, P ), defined by Xi (ω) = ωi , are called the vertices colors. A random graph with hidden colors can be thought to be obtained from a fixed graph G, by coloring its vertices according to some law and then deleting the edges with different colors at the endvertices. If the probability distribution of the vertices colors is exchangeable, the random graph is called a random graph with exchangeable hidden colors. Here exchangeability means invariance with respect to every finite permutations on K ∞ : P (A1 × . . . × As × K ∞ ) = P (Aπ(1) × . . . × Aπ(s) × K ∞ ) for every A1 , . . . , As ⊂ K, every permutation π on {1, . . . , s} and every s = 1, 2, . . .. In this case, the coordinate random variables X1 , X2 , . . . on (Ω, F, P ), are exchangeable, i.e. P (X1 ∈ A1 , . . . , Xs ∈ As ) = P (X1 ∈ Aπ(1) , . . . , Xs ∈ Aπ(s) ) for every A1 , . . . , As ⊂ K, every permutation π on {1, . . . , s} and for every s = 1, 2, . . .. By de Finetti’s representation theorem, if P is an exchangeable probability measure on F, there exists a random probability measure Q on the power sigma-algebra of K, 2K , such that ∞

P (A1 × . . . × As × K |Q) =

s Y

Q(Ai ) P − a.s.

i=1

for every A1 , . . . , As ⊂ K and for every s = 1, 2, . . .. The random variables X1 , X2 , . . . are, therefore, conditionally independent and identically distributed with probability distribution Q, given Q. Following Aldous (1985), we will say that the exchangeable sequence (Xi )i is directed by Q or, equivalently, that Q is the directing measure of 4

the exchangeable sequence (Xi )i . Since K is a finite set, Q can by identified with the random vector (Q1 , . . . , Qk ) defined as (Q{1}, . . . , Q{k}). Definition 2. Let Ω and F be defined as above and let P be an exchangeable probability measure on F. A random element Γ defined as in (3) is called a random graph with exchangeable hidden colors. If Γ is a random graph with exchangeable hidden colors, then the vertices colors X1 , X2 , . . . form a sequence of exchangeable random variables. In particular, if there exists a probability measure P0 on 2K such that P = P0∞ , then the vertices colors are independent, identically distributed random variables. Random graphs with independent and identically distributed vertices colors can be seen as a particular case of the random randomly colored graphs, introduced by Penman (1998) and by S¨oderberg (2003), independently. Let Γ be a random graph on an underlying graph G and let H be a graph. The number of copies of H in Γ is the random variable X W = 1{α ⊂ Γ}, α∈AH

where, for every B ∈ F, 1(B) is the indicator function of B. If Γ is a random graph with hidden colors and (Xi )i is its sequence of random colors, then X W = 1(∩ij∈V(α) {Xi = Xj }). α∈AH

In the paper, we will prove that, under a suitable negligibility assumption, the probability distribution of W , L(W ), can be well approximated by a mixture of Poisson laws, when |AH | is large. We recall the definition of mixture of Poisson laws. Let the Poisson distribution with parameter λ > 0 be denoted by P oλ . We include in the last class the degenerate distribution on 0 as the P o 0 distribution. Let µ be a probability distribution on the Borel sigma-algebra on [0, +∞). A probability measure ν satisfying Z ν(A) =

P oλ (A)dµ(λ) [0,+∞)

is called a mixture of Poisson laws with mixing measure µ. The following theorem asserts that ν uniquely determines µ. Theorem 1. Let ν 0 and ν 00 be mixtures of Poisson laws with mixing measures µ0 and µ00 , respectively. If ν 00 = ν 0 , then µ00 = µ0 . proof. Let D denote the discrete measure on {0, 1, . . .}. For t < 0, the function (k, λ) → etk P oλ (k) is integrable with respect to the product measures D × µ0 and D × µ00 . Hence, by Fubini theorem, Z Z Z Z t λ(et −1) 0 tx 0 tx 00 e µ (dλ) = e dν (x) = e dν (x) = eλ(e −1) µ00 (dλ), 5

for every t < 0. It follows that µ0 = µ00 (see e.g. Billingsley, 1995).



The asymptotic results of next section are based on the following theorem. Theorem 2. Let Γ be a random graph on G with i.i.d. colors X1 , X2 , . . . and color space {1, . . . , k}. Let H be a connected graph with |H| > 1. Then the number of copies of H in Γ satisfies −λ

dT V (L(W ), P oλ ) ≤ 1 ∧ (1 − e

P|H|−1 r=1

)(

mr + 1

AH

|H|−1

λ+

X r=1

Pk

2|H|−r

i=1 mr P k

qi

|H|

),

i=1 qi

where dT V denotes the total variation distance, qi = P (X1 = i) for every i, and P |H| λ = |AH | ki=1 qi . proof. For every α ∈ AH , let Iα = 1{α ⊂ Γ}. The random variables {Iα : α ∈ A} are dissociated (see Barbour et al.(1992) p. 34). Theorem 2.N in Barbour et al.(1992) then implies   |H|−1 X X 1 − e−λ X  dT V (L(W ), P oλ ) ≤ (EIα )2 + ((EIα )2 + (EIα Iβ )) λ r=1 β∈Bαr

α∈AH

P P 2|H|−r |H| if α and β Moreover E(Iα ) = ki=1 qi = λ/|AH | while E(Iα Iβ ) = ki=1 qi have r common vertices (1 ≤ r ≤ |H| − 1).  Let PG, H, AH be defined as above. If q1 , . . . , qk are non-negative numbers satisfying ki=1 qi = 1, then there exist, on some probability space (Ω, F, P ) independent and identically distributed random variables X1 , X2 , . . ., satisfying, for every i and j, P (Xi = j) = qj . Hence the following result holds. Corollary 1. Let w : {1, . . . , k}n → {0, 1, . . .} be defined by X Y w(x1 , . . . , x|G| ) = 1{xi = xj }. α∈AH ij∈V(α)

and λ = |AH | |

P

|H| i=1 qi .

Pk

x1 ,...,x|G| ∈{1,...,k}

Then

1{w(x1 , . . . , x|G| ) ∈ A}

Q|G|

− P oλ (A)|

P|H|−1

mr +1

P|H|−1

≤ 1 ∧ (1 −

e−λ )(

r=1

i=1 qxi

AH

6

λ+

r=1

mr

2|H|−r i=1 qi Pk |H| i=1 qi

Pk

(4) ),

3

Poisson approximation

To state the asymptotic result as a limit theorem, we need to introduce a sequence (Γn )n of random graphs with exchangeable hidden colors on a fixed sequence of graphs (Gn )n . Without loss of generality, we can assume that Γ1 , Γ2 , . . . are defined on a common probability space; this is convenient for notation simplicity. Let H, G1 , G2 , . . . be a sequence of graphs such that |H| > 1, H is connected and lim |AHn | = ∞,

n→∞

where AHn denotes the set of copies of H in Gn . For every n, let Kn = {1, . . . , kn } be a color space. Consider the probability space (Ω, F, P ), where Ω = {ω = (ωin ) : ωin = 1, . . . , kn , i = 1, 2, . . . , n = 1, 2, . . .}, F is the cylinder sigma-algebra on Ω and P is a probability measure on F satisfying, for every n, s = 1, 2, . . ., P {ω ∈ Ω : ω1n = c1 , . . . , ωsn = cs } = P {ω ∈ Ω : ω1n = cπ(1) , . . . , ωsn = cπ(s) } for every c1 , . . . cs ∈ Kn and every permutation π on {1, . . . , s}. For every n Γn (ω) = (V(Gn ), {ij : ωin = ωjn }). is a random graph with exchangeable hidden colors. Denote by Wn the number of copies of H in Γn : X Wn = Iαn , α∈AHn

where Iαn = 1{α ⊂ Γn }. We will prove that, under a suitable negligibility assumption, the asymptotic distribution of (Wn )n is a mixture of Poisson laws. For every n let Qn = (Q1n , . . . , Qkn n ) be the directing measure of the exchangeable sequence (Xjn )j , where Xjn , defined by Xjn (ω) = ωjn , represents the random color of the j-th vertex of Gn (j = 1, . . . , |G|). Hence P (X1n = x1n , . . . , Xsn = xsn |Qn ) =

s Y

Qxj n for every s, x1 , . . . , xs

P − a.s.

j=1

We can assume, without loss of generality, that Qin (ω) ≥ 0 for every i and every ω (5) Pkn

i=1 Qin (ω) = 1 for every ω

7

We will assume that (Qn )n satisfies the following negligibility assumption: Pkn

2|H|−r

i=1 Qin (mrn + 1) P |H| kn i=1 Qin

P

→ 0 as n → ∞,

where mrn =

for every r = 1, . . . , |H| − 1

X 1 |Bαrn | |AHn |

(6)

(7)

α∈AHn

and Bαrn = {β ∈ AHn : |β ∩ α| = r}. Remark 1. A sufficient condition for (6) is P

(mrn + 1)1/(|H|−r) max Qin → 0 i

for every r = 1, . . . , |H| − 1,

(8)

as n → ∞. Remark 2. Assumption Pkn (6) implies that (Iα )α are uniformly asymptotically negligible. In fact, since i=1 Qin = 1, kn X i=1

2|H|−r

Qin

kn kn X X 2|H|−r−1 |H| |H| ≥( Qin ) |H|−1 ≥ ( Qin )2 . i=1

i=1

Moreover, if αn is a fixed copy of H in Gn , E(Iαn ) =

|H| i=1 Qin .

Pkn

Hence,

maxα∈AHn P (Iα = 1) = E(P (Iαn = 1|Qn )) Pkn 2|H|−r P |H| i=1 Qin = E( i Qin ) ≤ E( P |H| ) kn i=1

Qi

If (6) holds, the last term in the inequality tends to zero, as n → ∞. On the other hand, the probability distribution of (Wn )n might fail to converge to a mixture of Poisson laws, if (6) does not hold, even under the hypothesis that (Iα )α are uniformly asymptotically negligible. See Example 1. Theorem 3. Let (Γn )n be a sequence of random graphs with exchangeable hidden colors on a sequence of graphs (Gn )n . Let H be a connected graph with |H| > 1 and let AHn , mrn and Qn be defined as above with |AHn | → ∞ as n → ∞. Suppose that (6) holds and let X |H| Λn = |AHn | Qin . (9) i

Then (Wn )n converges in distribution if and only if (Λn )n converges in distribution. In this case, the limiting distribution of (Wn )n is a mixture of Poisson laws, the mixing measure being the limiting law of (Λn )n . Moreover the representation is unique. 8

proof. Let wn (x1n , . . . , x|Gn |n ) =

X

Y

1{xin = xjn }.

α∈AHn i,j∈α

Then Wn = wn (X1n , . . . , X|Gn |n ). Moreover |Gn |

P (Wn ∈ A|Qn ) =

X

1{wn (x1n , . . . , x|Gn |n ) ∈ A}

x1n ,...,x|Gn |n

Y

Qxin n

P − a.s.

i=1

It follows from (4) and (5) that, for every Borel set A, |P (Wn ∈ A|Qn ) −P oΛn (A)| Pkn 2|H|−r P|H|−1 P n |H| P|H|−1 i=1 Qin Qin + r=1 mrn P ≤ 1 ∧ ( r=1 (mrn + 1) ki=1 |H| ) P − a.s. kn

(10)

i=1 Qin

It follows from (6) and Remark 2 that the random vector Pkn |H|+1 ! Pkn 2|H|−1 kn kn X X Q Q |H| |H| i=1 in , . . . , m|H|−1n Pi=1 in|H| (m1n+1) Qin , . . . , (m|H|−1n+1) Qin , m1n P |H| kn kn i=1 i=1 i=1 Qin i=1 Qin converges in distribution to the zero vector, as n → ∞. Since Pkn 2|H|−r |H|−1 |H|−1 kn X X X |H| i=1 Qin ) 1∧( (mrn + 1) Qin + mrn P |H| kn r=1 r=1 i=1 i=1 Qin is a bounded, continuous function of it, |P (Wn ∈ A) − E(P oΛn (A))| ≤ E|P (Wn ∈ A|Qn ) − P oΛn (A)| → 0

(11)

as n → ∞. If Λn converges in distribution to a random variable Λ, then |E(P oΛn (A)) − E(P oΛ (A))| → 0 and, therefore the probability distribution of Wn converges, as n → ∞ to E(P oΛ ). Conversely, suppose that d Wn → W, as n → ∞. For every n, let µn denote the probability distribution of Λn . Let us prove that (µn )n is tight. By Chebyshev inequality, for every λ > 0, 4 . λ Suppose that (µn )n is not tight and let a < 1 be such that, for every R > 4, there exists an increasing sequence of integers (nm )m , depending on R, such that, for every m, pnm = P (Λnm ≤ 2R) < a. Then P oλ [0, λ/2) ≤

P oΛnm [0, R] ≤ P oΛnm [0, R]1{Λnm < 2R} + P oΛnm [0, R]1{Λnm ≥ 2R} ≤ 1{Λnm < 2R} + P oΛnm [0, Λnm /2]1{Λnm ≥ 2R} ≤ 1{Λnm < 2R} + 4/Λnm 1{Λnm ≥ 2R}) ≤ 1{Λnm < 2R} + 2/R1{Λnm ≥ 2R} 9

Hence E(P oΛnm [0, R]) ≤ pnm + 2/R(1 − pnm ) ≤ (a + 1)/2. From (11), P (Wnm ≤ R) − E(P oΛnm [0, R]) → 0, as m → ∞. Let m0 be such that, for every m ≥ m0 , |P (Wnm ≤ R) − E(P oΛnm [0, R])| ≤ 1/4 − a/4. Then, for every m ≥ m0 , P (Wnm ≤ R) ≤ (3 + a)/4 < 1. Concluding, if µn is not tight, there exists b < 1 such that, for every R > 4, P (Wn ≤ R) ≤ b for some n. Hence Wn fails to converge in distribution, which contradicts the hypothesis. We have so therefore proved that µn is tight. To prove that µn converges in distribution to a probability measure, it is, therefore sufficient to show that, any two convergent subsequences µn0 , µn00 have the same limit. Let µ0 and µ00 denote the weak limits of µn0 and µn00 , respectively.R Then, by the first R part 00of the proof, the limiting 0 distributions of Wn0 and Wn00 are P oλ µ (dλ) and P oλ µ (dλ), respectively. Since Wn converges in distribution, Z Z 0 P oλ µ (dλ) = P oλ µ00 (dλ). It follows from Theorem 1 that µ0 = µ00 .



The assumptions of Theorem 3 are consistent. Example 1. For every n, let Gn be the complete graph with n vertices, Kn = {1, . . . , n2 } and let Γn be a random graph on Gn with i.i.d. hidden colors uniformly distributed over {1, . . . , n2 }. Let H be the graph consisting of one edge connecting two vertices. Then Qin = 1/n2 (i = 1, . . . , n2 , n = 1, 2, . . .), m1n = 2(n − 2) and   Pn2 2 |AHn | = n2 . Hence (8) holds and Λn = n2 i=1 Qi → 1/2 as n → ∞. Assumption (6) of Theorem 3 can not be suppressed, as proved in Example 2. Example 2. For every n, let Gn be the complete graph with n vertices and Kn = {1, . . . , n2 }. For every n, let Γn be a random graph on Gn with i.i.d. colors X1n , . . . , Xnn . taking values 1, . . . , n2 with probabilities, 1/n, 1/(n2 + n), 1/(n2 + n), . . . , 1/(n2 +n), respectively. Let H be the graph composed of one edge connecting   Pn2 2 two vertices. Then |AHn | = n2 and Λn = n2 j=1 Qjn → 1, as n → ∞. On the other hand, (6) does not hold in this case, since Pn2 3 1 1 + (n2 − 1) (n2 +n) 3 Qin n3 m1n Pi=1 6→ 0, = 2(n − 2) 2 1 1 n 2 2 Q 2 + (n − 1) (n2 +n)2 n i=1 in 10

as n → ∞. We prove that (Wn )n has not a limiting P o1 distribution, as n → ∞, by showing that 3 P (Wn = 0) → 2e− 2 , (12) as n → ∞. To prove (12), denote by Cn the number of vertices with color 1 in Γn . Then P (Wn = 0) = P (Wn = 0|Cn = 0)P (Cn = 0) + P (Wn = 0|Cn = 1)P (Cn = 1)     1 n 1 n−1 = 1− P (Wn = 0|Cn = 0) + 1 − P (Wn = 0|Cn = 1). n n The conditional distribution of Wn , given that Cn = 0, coincides with the probability distribution of a random variable Wn0 , representing the number of edges in a random graph on Gn , with independent random colors, uniformly distribuited over {1, . . . , n2 − 1}. By Theorem 3, 1

P (Wn = 0|Cn = 0) → e− 2 , as n → ∞. Analogously the conditional distribution of Wn , given that Cn = 1, coincides with the probability distribution of a random variable Wn00 , representing the number of edges in a random graph on Gn−1 with independent random colors, uniformly distribuited over {1, . . . , n2 − 1}. Hence, by Theorem 3, 1

P (Wn = 0|Cn = 1) → e− 2 , as n → ∞. Hence (12) holds. Remark 3. If Λn converges in distribution, then P

max Qin → 0,

(13)

i

|H|

as n → ∞. In fact maxi Qin ≤

|H| i Qin

P

=

P Λn |AHn | →

hand, (6) is a stronger condition than (13), since

P

0, as n → ∞. On the other |H|

i Qin ≤

P

2|H|−r

i Qin P |H| i Qin

.

Example 3. Let Gn be the complete graph with n vertices and let Γn be a random graph with exchangeable hidden colors and color space Kn = {1, . . . , n2 }. Let the random vector (Q1n , . . . , Qn2 n ) have Dirichlet distribution with parameters (a1 , . . . , an2 ), where (an )n is a sequence of real numbers such that Pn2

ai i=1 n2 Pn2 a2i i=1 n2

→ λ1 > 0 → λ2 > 0

as n → ∞.

(14)

It follows from (14) that the sequence (an )n is bounded and definitively far away from zero. We prove that the number of edges in Γn converges, in distribution, to a +λ2 random variable with P o(λ) distribution with λ = λ12λ 2 . 1

11

The number of edges in Γn coincides with the number of copies of H in Γn , where H consists of one edge connecting two vertices. By Theorem 3, Wn has a limiting P o(λ1 +λ2 )/2λ21 distribution, since Pn2

m1n Pni=1 2

Q3in

P

→0

2 i=1 Qin

and

(15)

2

n X

λ1 + λ2 (16) 2λ21 i=1  as n → ∞, with m1n = 2(n − 2) and |AHn | = n2 . To prove (15) and (16) we can resort to a well known representation of the Dirichlet law, based on the Gamma distribution (see e.g. Wilks, 1962). According to such representation, there exists a sequence (Yin )i=1,...,n2 ,n=1,2,... of random variables such that, for every n, Y1n , . . . , Yn2 n are stochastically independent, Yin has Gamma distribution with scale parameter 1 and shape parameter ai , for every i and every n, and |AHn |

P

Q2in →

Yin Qin = Pn2 . i=1 Yin Hence

Pn2

3 i=1 Yin n2

Pn2

Q3in 2(n − 2) m1n Pi=1 = n2 2 n2 i=1 Qin

Pn2

and

Pn2

2

|AHn |

n X

Q2in

i=1

Since

and

Pn2

2 i=1 Yin n2

2 i=1 Yin ) n2

2 i=1 Yin n2

Y2

i=1 in n−1 n2 = P 2 2 . n 2n Yin

Pn2 =

Pn2 V(

as n → ∞, then

2 i=1 Yin ) n2

Pn2

i=1 n2

Pn2 E(

i=1 Yin n2

i=1 ai (ai n2

Pn2 =

i=1 ai (ai

+ 1)

→ λ 1 + λ2

+ 1)(4ai + 6) → 0, n4

converges in probability to λ1 + λ2 , as n → ∞. Moreover Pn2 E(

and

i=1 Yin ) n2

Pn2 =

V(

→ λ1

Pn2

Pn2

i=1 Yin ) n2

i=1 ai n2

= 12

i=1 ai n4

→ 0,

as n → ∞. Hence ( Pn2

3 i=1 Yin n2

Pn2

i=1 Yin n2

P

)2 → λ21 , as n → ∞. Analogously, it can be proved that

is bounded. Hence (15) holds and it follows that 2

Λn = |AHn |

n X

Q2in →

i=1

λ1 + λ2 2λ21

as n → ∞.



In the above example, the random parameter of the limiting Poisson law has a degenerate probability distribution. Hence the mixture distribution reduces to a Poisson law. The following example shows a situation in which the limiting law of Wn is a proper mixture of Poisson laws. Example 4. Let s be a fixed positive integer and let Gn be the complete graph with sn vertices. Consider a sequence of random graphs (Γn )n with color spaces (Kn )n , where P Kn = {1, . . . , s2 n2 }. Let (Z1 , . . . , Zs ) be a random vector with Zi ≥ 0 for every i and si=1 Zi = 1 and let Qin = Zj /(sn2 ) for every i = (j − 1)sn2 + 1, . . . , jsn2

(j = 1, . . . , s).  Let Wn be the number of edges in Γn . Then |H| = 2, |AHn | = ns and m1n = 2 2(sn − 2). Moreover (6) holds. In fact Ps2 n2

m1n Psi=1 2 n2

Q3in

2 i=1 Qin

as n → ∞. Since  Λn =

= 2(ns −

Zj3 j=1 s3 n6 2) P Z2 sn2 sj=1 s2 nj 4

sn2

Ps

P

→0

 n2 s2  X s Zj2 ns X 2 ns 2 Qin = sn 2 4 2 2 s n i=1

j=1

Ps

converges in distribution to Λ = s j=1 Zj2 /2, as n → ∞, the limiting distribution of the number of edges in Γn converges to a mixture of Poisson laws, the mixing measure being the probability distribution of Λ. The above result can be extended to the case |H| > 2. Let s be a positive integer and H be a complete graph. For every n, let Gn be the complete graph with sn |H|

vertices and Γn be a random graph on Gn , with color space Kn = {1, . . . , bs2 n |H|−1 c} and (random) probabilities of colors defined by Qin =

Zj bsn

|H| |H|−1

|H|

|H|

for i = (j − 1)bsn |H|−1 c + 1, . . . , jbsn |H|−1 c

(j = 1, . . . , s),

c

where bac denotes the integer part of a and (Z1 , . . . , Zs ) is a random vector with Ps Zi ≥ 0 for every i and i=1 Zi = 1. By the same argument as for |H| = 2, the 13

number of copies of H in Γn converges in distribution to a Poisson law with random |H| s Ps  parameter Λ = |H|! j=1 Zj .

4

Bayesian models for the number of coincidences

In a recent paper, Diaconis and Holmes (2002) re-analyze some basic problems of elementary probability from a Bayesian point of view. In particular, they take up again the birthday problem: if n balls are randomly dropped into k boxes, what is the chance of a match, that is, that two or more balls fall in the same box? The classical answer is given under the assumption that balls are dropped independently and uniformly into each box. If k = 365, the answer gives the probability of two or more coincident birthdays, in a group of n individuals. Diaconis and Holmes (2002) face the birthday problem from a different poit of view: ”Consider a practical demonstration of the birthday problem on the first day of a class. Are the birthdays of the students reasonably considered as balls dropped uniformly into 365 categories? After all there are well known weekend/weekdays effects, and perhaps seasonal and lunar trends for birth rates. On reflection, the instructor may conclude that the chance Qi of a student being born on day i is not uniform and in fact is not known.” Some results on the birthday problem with non-uniform occurrence probabilities can be found in the literature (see Joag-Dev and Proschan, 1992; Mase, 1992; Henze 1998; Camarri and Pitman, 2000). The main innovation in Diaconis and Holmes (2002) consists in putting prior probabilities on the Qi , in a Bayesian perspective. Exact calculations, for the chance of a match, are presented under uniform prior and symmetric Dirichlet prior. Moreover the following Poisson approximation is proved. Theorem 4. (Diaconis and Holmes) Let (a1 , a2 , . . .) be a sequence of positive real numbers and let Ak = a1 + . . . + ak , for every k = 1, 2, . . .. Consider a Dirichlet prior on Qn = (Q1n , . . . , Qkn ), with parameters (a1 , . . . , ak ). If n and k → ∞ in such a way that:  k n X 2 ai (ai + 1) → λ, Ak (Ak + 1) i=1

then the number of boxes with two or more balls has a limiting Poisson(λ) distribution. The above result justifies the choice of the Poisson model, with parameter λ = (n2 ) Pk i=1 ai (ai + 1), for the number of multiple observations in an exchangeable Ak (Ak +1) sequence of random variables (Xi )i , with Dirichlet prior, when the number of possible outcomes k and the number of observations are large. On the other hand, based on the above result, the prior distribution should be a degenerate probability measure. As a variant of the birthday problem, consider a group of n people with various social ties, and let W be the number of pairs of people who are acquainted and share 14

the same birthday. As suggested in Barbour et al. (1992), this problem can be tackled by random graphs with hidden colors. In a graph in which vertices represent people, let edges represent social ties, and consider the random graph obtained by coloring vertices with people birthdays and then deleting edges with different colors at the endpoints. The number of remaining edges represents the number W of pairs of people who are acquainted and have the same birthdays. In this setting, Barbour et al. (1992) give a Poisson approximation for the probability distribution of W , under the hypothesis that people birthdays are independent and identically distributed random variables. In fact Barbour et al. (1992) also give an upper bound for the error of the approximation. Such generalization of the birthday problem can be re-analyzed, in a Bayesian framework, assuming that the chance of a person being born on day i, Qi , is not known, hence considering people’s birthdays as exchangeable random variables with directing measure Q = (Q1 , . . . , Qk ). The number of pairs of people who are acquainted and have the same birthday can, therefore, be represented as the number W of edges in a random graph with exchangeable hidden colors. It follows, from (10), that there exists a random variable Λ such that E(W ) = E(Λ) and sup |P (W ∈ A|Q) − P oΛ (A)| ≤ A

(m1 + 1)Λ + m1 P|Q (X1 = X2 = X3 |X1 = X2 ), AH

where, for every B ∈ F, P|Q (B) is a version of the conditional probability of B, given Q = (Q1 , . . . , Qk ). Hence supA |P (W ∈ A) − E(P oΛ (A))| ≤ supA |E(P (W ∈ A|Q)) − E(P oΛ (A))| ≤ E(supA |P (W ∈ A|Q) − P oΛ (A)|) ≤

(m1 +1)E(W ) AH

+ m1 E(P|Q (X1 = X2 = X3 |X1 = X2 )), (17)

Example 5. Suppose that a group of n computer users is interested to subscribe to some Internet newsgroups. For simplicity, assume that each subscriber need to use the same login name for every newsgroup he chooses to join. We want to fix a model for the number W of pairs of subscribers entering the same newsgroup and choosing the same login name, in the case in which the number k of potential login names is large. Subscribers entering the same newsgroups can be represented by a graph G in which vertices represent subscribers and an edge connects vertices i and j if the i-th and j-th subscribers intend to enter at least one common newsgroup. Build a random graph on G by coloring vertices by login names. Suppose that the mean number of network connection of a subscriber m1 is small, relative to n and that the mean number of coincidences E(W ) is small, relative to the number of pairs of people who enter the same group. Moreover, suppose that the probability that three specified individuals have coincident login names, given that two of them have coincident login names, is small, compared with m1 , whatever the (unknown) 15

probabilities of the login names. According to (17), a Poisson model, with a random parameter Λ, should be fixed for W . Since Λ = E(W |Q), the prior distribution on Λ should reflect prior information on the mean number of pairs of people who intend to enter the same newsgroup and to choose the same login name.  We state the analogous of (17) for a general graph H. Theorem 5. Let Γ be a random graph with exchangeable hidden colors and let H be a fixed graph. Let m1 , . . . , m|H|−1 and Q1 , . . . , Qk be defined as in Section 2 and, for every B ∈ F, let P|Q (B) be a version of the conditional probability of B, given Q = (Q1 , . . . , Qk ). Then there exists a random variable Λ such that E(W ) = E(Λ) and supA |P (W ∈ A) − E(P oΛ (A))| P|H|−1 P|H|−1 ) ≤ ( r=1 mr + 1) E(W r=1 mr E(P|Q (X1 = . . . = X2|H|−r |X1 = . . . = X|H| )). AH + (18)

Remark 4. If G and H are complete graphs, then  mr P|Q (X1 = . . . = X2|H|−r |X1 = . . . = X|H| ) =

 2|H| − r E(Wr |Q) , |H| E(W |Q)

where Wr is the number of subgraphs with 2|H|−r vertices. In this case an equivalent formulation of (18) is |H|−1

sup |P (W ∈ A)−E(P oΛ (A))| ≤ ( A

X r=1

 |H|−1  E(Wr |Q) E(W ) X 2|H| − r + E( ). mr +1) AH |H| E(W |Q) r=1

Example 6. Consider an area, holding n animals a1 , . . . , an of the same species, living isolated. Let the area be divided into k territories, denoted by t1 , . . . , tk . Consider animals that live in the same territory as competitive. Let X1 , . . . , Xn be random variables defined through Xi = j if the i-th animal lives in the j-th territory. (i = 1, . . . , m, j = 1, . . . , k). We want to fix a Bayesian model for the number W of groups of d competitive animals, under the following hypotheses: (A1) the probability distribution of X1 , . . . , Xn is exchangeable; (A2) X1 , . . . , Xn can be extended to a much longer exchangeable sequence; (A3) the animals are sparse in the territory; (A4) groups of d competing animals are exceptional, i.e. the mean value of W is negligible with respect to n. 16

Assumptions (A1) and (A2) imply that the probability distribution of X1 , . . . , Xn is well approximated by the law of an infinite exchangeable sequence (see Aldous, 1985). Let Q be the directing random probability measure in de Finetti’s representation of the above infinite sequence and, for every B, let P|Q (B) denote a version of the conditional probability of B, given Q. Let assumption (A3) imply that, whatever the values of Q1 , . . . , Qk , the conditional probability that ai+1 lives in the same area as a1 , . . . , ai , given that a1 , . . . , ai live in the same area, is small compared to 1/n (i = 1, . . . , d − 1). Then nP|Q (X1 = . . . = Xi+1 |X1 = . . . = Xi ) is negligible, for every i = 1, 2, . . .. Consider a complete graph G, with vertices a1 , . . . , an . Construct a random graph Γ on G with exchangeable hidden colors X1 , . . . , Xn . Then W coincides with the number of complete subgraphs of Γ with d vertices. Since   Pk d−r 2d−r Y n−d 1 i=1 Q nP|Q (X1 = . . . = Xi+1 |X1 = . . . = Xi ), Pk i d ≤ d−r (d − r)! i=1 Qi i=1

and (A4) holds, we can treat |H|−1

(

X r=1

 d−1  E(W ) X n − d + E(P|Q (X1 = . . . = X2d−r |X1 = . . . = Xd )) mr + 1) AH d−r r=1

as negligible. By Theorem 5, we can fix a Poisson model for W .

5



Final remarks.

Random graphs with exchangeable hidden colors can be used to model clusters in Bayesian Statistics. Consider a set of n objects. A random clustering is a random partition of the objects into subsets. Clusters are generally based on some random characteristic of the objects: two objects are in the same class if their characteristics are close in some sense. If the objects are represented by the vertices of a complete graph G, a random cluster can be represented by a complete random subgraph of G. We can represent random clustering, based on a random characteristic, by a random graph with hidden colors. Vertices colors represent the objects characteristics so that a cluster is represented by the set of vertices with the same color. If the law of the random characteristic is not known, a prior probablity distribution can be assigned to it. By de Finetti’s representation theorem, this is equivalent to assume that the sequence of vertices colors is exchangeable and can be extended to an infinite exchangeable sequence of random variables. Hence, in a Bayesian framework, clustering can be modelled with random graphs with exchangeable hidden colors. A possible extension is to allow vertices to be joined, according to some rule, on the base of their colors. Let C be a k × k symmetric matrix of zeros and ones. Colors i and j are said to interact if there is a one in C, at position ij. After coloring the vertices of a graph G, according to an exchangeable law, join all the vertices that interact. If C is the 17

identity matrix, the random graph coincides with random graph with exchangeable hidden colors, as defined in Section 2. A further extension could be to join vertices with colors i and j with probability cij , C = (cij )ij being a symmetric matrix, with elements in [0, 1]. Random graphs of this type, but with independent vertices colors, were studied in Penman (1998), S¨oderberg (2003) and Cannings and Penman (2003). As noted in Cannings and Penman (2003), such graphs arise naturally in consideration of proteins as polymers. The folding of the polymer creates various domains, capable of interacting with domains on other proteins. A natural model considers each protein as receiving a random subsets of the set of the possible domains and allowing interactions according to some rule based on domains. The usual assumption is that each protein’s set of domains is independent of that of the other proteins. If the probabilities Q1 , . . . , Qk of the subsets of domains are not known, they can be treated as random quantities and the assumption of independence can be substituted by conditional independence given Q = (Q1 , . . . , Qk ), or, equivalently, by exchangeability. Random graphs with exchangeable hidden colors can also be extended in a different direction, i.e. by assuming that the colors are fixed according to a finite exchangeable law. In fact the assumption that the finite exchangeable sequence of colors can be extended to an infinite exchangeable sequence is probably not very realistic for many applications. It seems more realistic to assume that the distribution of Q1 , . . . , Qk depends on n. Consider, for example, the competing model of Example 6. If the number of specimen in the area is large, the animals position will show a stronger tendency to be equidistant with respect to the situation in which few specimen live in the area. Unfortunately de Finetti’s representation of exchangeable random variables as conditionally independent and identically distributed does not hold in the finite case. A representation is still valid (see Aldous, 1985) but it is more involved. We end this Section with a remark on the probability distribution of the edges in a random graph with exchangeable hidden colors, when the underlying graph G = (V = {1, . . . , n}, E) is invariant with respect to permutations of the vertices, i.e. G is isomorphic to π(G), where π is a permutation of (1, . . . , n) and π(G) = ({1, . . . , n}, {π(i)π(j) : i, j ∈ E}). Then (Yij )ij∈E is invariant with respect to permutations of both i and j: P (∩ij∈E Yij ∈ Aij ) = P (∩ij∈E Yij ∈ Aπ(i)π(j) ), for every (Aij )ij∈E with Aij ⊂ {0, 1}. Aknowledgements. We wish to thank Donato Cifarelli for suggesting us the starting idea of this work.

18

References ´ ´ e de ProbaAldous, D.J. (1985). Exchangeability and related topics. Ecole d’Et´ biliti´es de Saint-Flour XIII - Lecture Notes in Math. 1117. Springer-Verlag, Berlin Barbour, A.D., Holst, L. and Janson, S. (1992) Poisson Approximations. Oxford University Press. Biggins, J.D. (2003) Large deviations for mixtures. Preprint - University of Sheffield, U.K. Biggins, J.D. and Penman, D.B. (2003) Large deviations in randomly coloured random graphs. Preprint - University of Sheffield, U.K. Billingsley, P. (1995) Probability and Measure. John Wiley and Sons. ´s, B. (1985) Random Graphs. Academic Press. Bolloba Camarri, M and Pitman, J. (2000) Limit distributions and random trees derived from the birthday problem with unequal probabilities. Electronic Journal of Probability, 5, 1-18. Cannings, C. and Penman, D.B. (2003) Models of random graphs and their applications. Handobook of Statistics 21. Stochastic Processes: Modelling and Simulation. Eds: D.N. Shanbhag, and C.R. Rao. Elsevier, 51-91. Chen, L.H.Y. (1975) Possion approximations for dependent trials. Annals of Probability, 3, 534-45. Diaconis, P. and Holmes, S. (2002) A Bayesian peek into Feller Volume I. Sankhy¯ a, Series A, 64, 820-841. Diestel, R.(2000) Graph Theory. 2nd Ed. Springer-Verlag, New York. ˝ s, P., Re ´ny, A. (1960) On the evolution of random graph. Publ. Math. Inst. Erdo Hungar. Acad. Sci. 5, 17. Henze, N. (1998) A Poisson limit law for a generalized birthday problem. Statistics & Probability Letters, 39, 333-336. Janson, S. (1986) Birthday problems, randomly coloured graphs and Poisson limits of sums of dissociated variables. Uppsala University, Department of Mathematics, Report No. 1986:16 Janson, S. (1987) Poisson convergence and Poisson processes with applications to random graphs. Stochastic processes and their Applications, 26, 1-30. Joag-Dev, K and Proschan, F. (1992) Birthday problem with unilke probabilities. The American Mathematical Monthly, 99, 10-12. 19

Mase, S. (1992) Approximations to the birthday problem with unequal occurence probabilities and their application to the surname problem in Japan. Annals of the Institute of Statistical Mathematics, 44, 3, 479-499. Newman, M.E.J. (2003) Random graphs as models of networks. In Handbook of Graphs and Networks, S. Bornholdt and H. G. Schuster (eds.). Wiley-VCH, Berlin. Penman, D.B. (1998) Random graphs with correlation structure. Ph.D Thesis, University of Sheffield. ¨ derberg, B. (2003) Random graphs with Hidden Color. Phys. Rew. E 68 015102. So ¨ Von Mises, R. (1939) Uber aufteilungs und besetzungswahrscein lichkeiten, Rev. Fac. Sci. Istambul, 4, 145-163. Wilks, S.S. (1962) Mathematical Statistics. Wiley, New York.

20