Truncated Random Measures

4 downloads 245 Views 1MB Size Report
Mar 2, 2016 - of sequential CRM representations that can be used for simulation and pos- ... product of the populations to which their ancestors belonged, a user's activity on .... any θ in the support of ν,2 with the additional requirement that.
TRUNCATED RANDOM MEASURES

arXiv:1603.00861v1 [math.ST] 2 Mar 2016

TREVOR CAMPBELL? , JONATHAN H. HUGGINS? , JONATHAN HOW, AND TAMARA BRODERICK

Abstract. Completely random measures (CRMs) and their normalizations are a rich source of Bayesian nonparametric (BNP) priors. Examples include the beta, gamma, and Dirichlet processes. In this paper we detail three classes of sequential CRM representations that can be used for simulation and posterior inference. These representations subsume existing ones that have previously been developed in an ad hoc manner for specific processes. Since a complete infinite-dimensional CRM cannot be used explicitly for computation, sequential representations are often truncated for tractability. We provide truncation error analyses for each type of sequential representation, as well as their normalized versions, thereby generalizing and improving upon existing truncation error bounds in the literature. We analyze the computational complexity of the sequential representations, which in conjunction with our error bounds allows us to study which representations are the most efficient. We include numerous applications of our theoretical results to popular (normalized) CRMs, demonstrating that our results provide a suite of tools that allow for the straightforward representation and analysis of CRMs that have not previously been used in a Bayesian nonparametric context.

1. Introduction 2. Background 2.1. CRMs and truncation 2.2. Rate measures and likelihoods for some CRMs 3. Three sequential representations 3.1. D-representation 3.2. F-representation 3.3. V-representation 3.4. Stochastic mapping 4. Truncation analysis 4.1. D-representation 4.2. F-representation 4.3. V-representation 4.4. Stochastic mapping 4.5. Hyperpriors 5. Normalized truncation analysis 5.1. F- and V-representations 5.2. D-representation 5.3. Hyperpriors 6. Simulation algorithms and computational complexity 6.1. D- and F-representations 6.2. V-representation Date: March 3, 2016. ? First authorship is shared jointly by T. Campbell and J. H. Huggins. 1

2 5 5 6 8 8 9 10 11 13 14 15 17 20 21 21 23 25 27 27 27 28

2

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

6.3. Summary of CRM results 7. Simulation study 8. Discussion Appendix A. Poisson point process operations Appendix B. Technical lemmas Appendix C. Proofs of sequential representation results C.1. Correctness of D- and F-representations C.2. Power-law behavior of F-representations Appendix D. Proofs of CRM truncation bounds D.1. Protobound D.2. D-representation truncation D.3. F-representation truncation D.4. V-representation truncation D.5. Stochastic mapping truncation D.6. Truncation with hyperpriors D.7. Comparing D- and F-representation truncations Appendix E. Proofs of normalized truncation bounds E.1. F- and V-representation truncation E.2. D-representation truncation E.3. Truncation with hyperpriors Appendix F. Proofs of V-representation sampling results Appendix G. Additional simulation results Acknowledgments References

31 31 34 35 35 38 38 40 43 43 45 46 48 49 49 49 50 52 53 55 56 60 62 62

1. Introduction In many data sets, we can view the data points as exhibiting a collection of underlying traits. For instance, each document in the New York Times might touch on a number of topics or themes, an individual’s genetic data might be a product of the populations to which their ancestors belonged, a user’s activity on a social network might be dictated by their varied personal interests, etc. In cases where the traits are not directly observed, a common approach is to model each trait as having some frequency or rate in the broader population (Airoldi et al., 2014). The inferential goal is to learn these rates as well as whether—and to what extent—each data point exhibits each trait. An additional challenge arises since, when the traits are unknown a priori, their cardinality is also typically unknown. Moreover, we expect the number of traits, and thus the complexity of the inference, to increase in size as the size of the data set grows. In the examples above, we expect to see a greater diversity of topics as we read more news documents, we expect to observe a greater number of ancestral populations as we examine more individuals’ genetic data, and we expect there to be a greater number of unique interests and hobbies represented among a larger population on a social network. Priors from Bayesian nonparametrics (BNP) provide the desired flexibility. First, they model the number of latent traits exhibited by a given set of data as a random variable, which may be learned as part of the inferential procedure. Second, by imagining a countable infinity of potential traits, these models allow the number of traits to grow with the size of the data. Note that any individual data point exhibits only finitely many traits—though the number itself may be unknown

TRUNCATED RANDOM MEASURES

3

and remain to be inferred. After finitely many data points have been observed, the countable infinity of (latent) potential traits guarantees that there are always more novel traits to be discovered in future data. In practice, though, it is not possible to store a countable infinity of random variables in memory or learn the distribution over a countable infinity of variables in finite time. Conjugate priors and likelihoods have been developed for these problems (Orbanz, 2010) that theoretically allow us to avoid the infinite representation altogether and perform exact Bayesian posterior inference (Broderick et al., 2014). However, these priors and likelihoods are often just a single piece within a more complex generative model, and ultimately an approximate posterior inference scheme such as Markov Chain Monte Carlo (MCMC) or variational Bayes (VB) is required. These approximation schemes often necessitate a full and explicit representation of the latent variables. One practical option is to approximate the infinite-dimensional prior with a finite-dimensional prior that maintains many of its desirable theoretical properties. For instance, one might approximate the full infinitude of traits by some representative finite subset of traits. It remains to decide how to choose the finite subset. In theory, since there are a countable infinity of traits in the full model, we should be able to enumerate the traits and write a pair (ψk , θk ) for each trait, where ψk is a parameter describing the trait (e.g. a topic that might appear in a document) and θk is the rate or frequency of the trait. Then the discrete measure ∞ X θk δψk (1.1) k=1

captures the countable infinity of traits in a sequence indexed by k. When the θk and ψk are unknown, they are typically treated as random in a Bayesian model, so Eq. (1.1) gives a random measure. In many cases, the distribution of the random measure can be defined by specifying a sequence of (ideally simple and/or familiar) distributions for the finite-dimensional θk and ψk . Such representations are called sequential or, in certain contexts, stick-breaking. Such sequential representations have been shown to exist (Broderick et al., 2014) for completely random measures (CRMs) (Kingman, 1967), a wide class of priors that provides rates or frequencies for traits and encapsulates such popular models as the beta process (Hjort, 1990; Kim, 1999) and gamma process (Ferguson and Klass, 1972; Kingman, 1975; Brix, 1999; Titsias, 2008). CRM priors are often paired with likelihood processes—such as the Bernoulli process (Thibaux and Jordan, 2007), negative binomial process (Zhou et al., 2012; Broderick et al., 2015), and Poisson likelihood process (Titsias, 2008)—that assign each data point to a finite subset of traits. Sequential representations also exist, sometimes up to normalization, for normalized completely random measures (NCRMs) (Regazzini et al., 2003; Lijoi and Pr¨ unster, 2010),1 another wide class that provides a distribution over traits and which includes the Dirichlet process (Ferguson, 1973; Sethuraman, 1994). NCRMs are typically paired with a discrete likelihood that assigns each data point to one trait using the distribution specified by the NCRM. Given a sequential representation as in Eq. (1.1), a natural way to choose a subset of traits is to keep the first K traits for some finite K and discard the rest. This approach is called truncation. 1NCRMs are sometimes referred to as normalized random measures with independent increments (NRMIs) (Lijoi and Pr¨ unster, 2010; Regazzini et al., 2003; James et al., 2009).

4

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

In practice, popular BNP priors such as the beta process (Teh et al., 2007; Doshi-Velez et al., 2009; Paisley et al., 2012b; Thibaux and Jordan, 2007) have multiple different representations of the form in Eq. (1.1). It is not immediately obvious which sequential representation should be chosen in a given application and, for that representation, what level of truncation should be chosen. Moreover, the full diversity of desirable behaviors in BNP models is not achieved by the beta process alone, so we must examine other (N)CRM priors as well. And while these other priors have fewer sequential representations available in the existing literature, the practitioner might wonder whether other, as of yet undiscovered, sequential representations of these priors might afford better performance in their application. We answer these questions in the current work. First, we provide a comprehensive characterization of the different types of sequential representations that are available to the practitioner for (N)CRMs. In doing so, we introduce a number of new sequential representations for (N)CRMs. By contrast, the development of new sequential representations has previously happened on an ad hoc, case-by-case basis. These results alone are useful in a variety of approximation schemes, including the provision of new slice sampling approaches in MCMC (Neal, 2003). VB though, at least in its present form, necessitates the use of truncation in the application of nonparametric Bayesian models (Doshi-Velez et al., 2009; Paisley et al., 2012a). And, in fact, truncated models may be used as part of an MCMC sampler as well (Ishwaran and James, 2001; Fox et al., 2010; Johnson and Willsky, 2013). Thus, we provide both theoretical and empirical analyses of the error induced when truncating the sequential representations we introduce. We give the truncation error as a function of the prior process, the likelihood process, and the level of truncation. While truncation error bounds for (N)CRMs have been studied extensively, past work has focused on specific combinations of (N)CRM priors and likelihoods—in particular, the Dirichlet-multinomial (Sethuraman, 1994; Ishwaran and James, 2001; Ishwaran and Zarepour, 2002; Blei and Jordan, 2006), beta-Bernoulli (Paisley et al., 2012b; Doshi-Velez et al., 2009), and gamma-Poisson (Roychowdhury and Kulis, 2015) conjugate processes. In the current work, we give a much more general formulation of truncation error. Our results fill in a large gap in the analysis of truncation error. They include the first analysis of truncation error for some sequential representations of the beta process with Bernoulli likelihood (Thibaux and Jordan, 2007), for the beta process with negative binomial likelihood (Zhou et al., 2012; Broderick et al., 2015), and for the normalization of the generalized gamma process (Brix, 1999), the σ-stable process, and the generalized inverse gamma (Lijoi et al., 2005; Lijoi and Pr¨ unster, 2010) with discrete likelihood. Moreover, even when truncation results already exist in the literature (Ishwaran and James, 2001; Doshi-Velez et al., 2009; Paisley et al., 2012b; Roychowdhury and Kulis, 2015), we improve on those error bounds by a factor of two. The reduction arises from our use of the point process machinery of CRMs; we thereby circumvent the total variation bound used originally by Ishwaran and James (2001, 2002), upon which most modern truncation analyses are built. The remainder of this paper is organized as follows. In Section 2, we develop the conceptual and notational background behind CRMs. In our first main theoretical section, Section 3, we characterize sequential CRM representations by defining three broad types of construction. The first, based on Bondesson (1982), is the most

TRUNCATED RANDOM MEASURES

5

specialized and takes the form of a (stochastic) functional of a homogenous Poisson point process. The second, which has a fixed rate of atoms per round, is a “decoupled” version of the first type of representation. The final representation includes numbers of atoms at different rates per round (Broderick et al., 2014; James, 2014). We illustrate the broad applicability and ease of use of the three representations with examples from popular CRMs. Next, we provide a theoretical analysis of the truncation error of each of the three CRM constructions in Section 4. We provide analogous theory for the normalized (NCRM) version of each CRM construction in Section 5 via an infinite extension of the so-called “Gumbel-max trick” (Gumbel, 1954) and standard Poisson point process (Kingman, 1993) machinery. We determine the complexity of simulating from each representation in Section 6. In Section 7, we provide simulation experiments, which demonstrate the practicality of our proposed toolset, as well as an example of how to choose between the three representations in practice and how to choose the appropriate truncation level for a given representation. Proofs for all results developed in this paper are provided in the appendices.

2. Background 2.1. CRMs and truncation. Consider a Poisson point process on R+ := [0, ∞) with rate measure ν(dθ) such that Z ν(R+ ) = ∞

and

min(1, θ)ν(dθ) < ∞.

(2.1)



Such a process generates a countable infinity of values (θk )k=1 , θk ∈ R+ , having P∞ an almost surely finite sum k=1 θk < ∞. In a BNP trait model, we interpret each θk as the rate or frequency of the k-th trait. Typically, each θk is paired with a parameter ψk associated with the k-th trait (e.g., a topic in a document, an ancestral population, or a shared interest on a social network). We assume i.i.d. throughout that ψk ∈ Ψ for some space Ψ and ψk ∼ G for some distribution G. Constructing a measure by placing mass θk at atom location ψk results in a completely random measure (CRM) (Kingman, 1967). As shorthand, we will write CRM(ν) for the completely random measure generated as just described: Θ :=

X

θk δψk ∼ CRM(ν).

(2.2)

k

The trait distribution G is left implicit in our notation as it has no effect on the results of the present work. Further, the possible fixed-location and deterministic components of a CRM (Kingman, 1967) are not considered in this work for the purpose of clarity; these components can be added (assuming an atomic deterministic component, if it exists) and the analysis modified without undue effort. This prior on the traits and trait frequencies in Θ is typically combined with a likelihood that generates trait counts for each data point. Consider any conditional distribution h(· | θ), which we assume to be a proper probability mass function on N ∪ {0} for

6

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

any θ in the support of ν,2 with the additional requirement that Z (1 − h(0 | θ))ν(dθ) < ∞.

(2.3)

N Then suppose X1:N := {Xn }n=1 is a collection of conditionally independent observations given Θ. In particular, we say that X1:N are distributed according to the likelihood process LP(h, Θ), i.e. X i.i.d. Xn := xnk δψk ∼ LP(h, Θ), (2.4) k

if xnk ∼ h(x | θk ) independently across k and i.i.d. across n. The desideratum that each Xn expresses a finite number of traits is encoded in Eq. (2.3). Since the trait counts are typically latent in a full generative model specification, define the indep observed data Yn | Xn ∼ f (· | Xn ) for a conditional density f with respect to a ∞ measure µ on some space. For instance, if the sequence (θk )k=1 represents the topic rates in a document corpus, Xn may capture how many words in document n are generated from each topic, and Yn might be the observed collection of words for that document. Since the sequence (θk )∞ k=1 is countably infinite, it may be difficult to simulate exactly from this generative model—or perform posterior inference in it. One apP proximation scheme is to define a truncation ΘK := k∈K θk δψk , where K is some finite subset of the true indices N. The truncation ΘK can be used for exact simulation or in posterior inference—but some error arises from not using the full CRM Θ. To quantify this error, consider its propagation through the above Bayesian generative model. Define Z1:N and W1:N for ΘK analogous to the definitions of X1:N and Y1:N for Θ: i.i.d.

Zn | ΘK ∼ LP(h, ΘK ),

indep

Wn | Zn ∼ f (· | Zn ),

n = 1, . . . , N.

(2.5)

A standard approach to measuring the distance between Θ and ΘK , to which we adhere, is to use the L1 metric between the marginal densities pN,∞ and pN,K (with respect to µ) of the final observations Y1:N and W1:N (Ishwaran and James, 2001; Doshi-Velez et al., 2009; Paisley et al., 2012b): Z kpN,∞ − pN,K k1 := |pN,∞ (y1:N ) − pN,K (y1:N )| µ(dy1:N ). (2.6) y1:N

2.2. Rate measures and likelihoods for some CRMs. A few well-studied CRMs will be making frequent appearances throughout the remainder of the paper. Their definitions and relevant notation are reviewed here briefly, along with typical choices of paired likelihood h(x | θ). Unless otherwise noted, we employ the generalized versions of each CRM (typically involving 3 free parameters) without explicit nomenclature denoting them as such. For example, we refer to the ΓP(γ, λ, d) as a gamma process, though it is typically referred to in the literature as a generalized gamma process (Brix, 1999; Lijoi and Pr¨ unster, 2010). This is simply for brevity; it will be clear in context which parameterization is being used. It should also be noted that the three processes listed below have been generalized 2Likelihoods with support in R may also be considered. As long as Eq. (2.3) holds, all results except those relating to size-biased V-representations still hold, since they only rely upon the behavior of h(0 | θ).

TRUNCATED RANDOM MEASURES

7

by exponential family conjugate processes (Broderick et al., 2014), and that it is straightforward to apply a similar generalization to the results of the present work. 2.2.1. Beta process. The beta process (Teh and G¨or¨ ur, 2009; Broderick et al., 2012), denoted BP(γ, α, d), with discount parameter d ∈ [0, 1), concentration parameter α > −d, and mass parameter γ > 0, is a CRM with rate measure ν(dθ) = γ

Γ(α + 1) 1 [θ ≤ 1] θ−1−d (1 − θ)α+d−1 dθ. Γ(1 − d)Γ(α + d)

(2.7)

Setting d = 0 yields the standard beta process (Hjort, 1990; Thibaux and Jordan, 2007). The beta process is often paired with a Bernoulli likelihood, h(x | θ) = 1 [x ≤ 1] θx (1 − θ)1−x , or a negative binomial likelihood with number of failures s ∈ N,   x+s−1 h(x | θ) = (1 − θ)s θx . x

(2.8)

(2.9)

2.2.2. Beta prime process. The beta prime process (Broderick et al., 2014, 2015), denoted BPP(γ, α, d), with discount parameter d ∈ [0, 1), concentration parameter α > −d, and mass parameter γ > 0, is a CRM with rate measure ν(dθ) = γ

Γ(α + 1) θ−1−d (1 + θ)−α dθ. Γ(1 − d)Γ(α + d)

(2.10)

The beta prime process is often paired with an odds Bernoulli likelihood, h(x | θ) = 1 [x ≤ 1] θx (1 + θ)−1 .

(2.11)

2.2.3. Gamma process. The gamma process (Brix, 1999), denoted ΓP(γ, λ, d), with discount parameter d ∈ [0, 1), scale parameter λ > 0, and mass parameter γ > 0, is a CRM with rate measure ν(dθ) = γ

λ1−d −d−1 −λθ θ e dθ. Γ(1 − d)

(2.12)

Setting d = 0 yields the standard gamma process (Ferguson and Klass, 1972; Kingman, 1975; Titsias, 2008). The gamma process is often paired with a Poisson likelihood, h(x | θ) =

θx −θ e . x!

(2.13)

Throughout the present work, we use the rate parameterization of the gamma distribution (to match the gamma process parameterization), for which the density is given by Gam(x; a, b) =

ba a−1 −bx x e . Γ(a)

(2.14)

8

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

3. Three sequential representations Sequential representations are at the heart of the study of truncated CRMs. They provide an iterative method that can be terminated at any point to yield a finite approximation to the infinite process, where the choice of termination point determines the accuracy of the approximation. Thus, the natural first step in providing a coherent treatment of truncation analysis is to do the same for sequential representations. In past work, three classes of sequential representation have been used, thoughPthe nomenclature we use for them is our own. The first take the ∞ simple form k=1 θk δψk ; since there are a deterministic number of atoms for each outer sum index, we call the first type D-representations. The next are of the form P∞ PCk k=1 i=1 θki δψki , where the Ck are i.i.d. Poisson random variables. Since the Ck are from a f ixed distribution, we call the second type F-representations. The final class are of the same form as F-representations; however, the Ck are independent (but not identically distributed) Poisson random variables. Since the Ck have varying distributions, we call the third type V-representations. This section examines a D-representation originating with Bondesson (1982), introduces two novel classes of F-representations, discusses a V-representation due to Broderick et al. (2014) and James (2014), and places past sequential representations in these three categories. Finally, we discuss the use of a stochastic mapping procedure, which is especially useful for obtaining power-law F-representations. Proofs for the results presented in this section may be found in Appendix C. The proofs in Appendix C as well as later reults make liberal use of the Poisson point process calculus summarized in Appendix A and some technical lemmas given in Appendix B. 3.1. D-representation. We first investigate the construction of DPk i.i.d. representations. For any c > 0, let Γck = `=1 Ec` , Eck ∼ Exp(c), be the ordered jumps of a homogenous Poisson point process with rate c. Ferguson and Klass (1972) proposed an elegant and general approach for representing CRMs with arbitrary rate measures: Θ=

∞ X

θk δψk ∼ CRM(ν),

with

θk := inf {x : ν ([x, ∞)) ≤ Γ1k } .

(3.1)

k=1

However, inverting the function ν ([x, ∞)) is analytically intractable except in a few cases. A more specialized approach, originating with Bondesson (1982), leads to tractable representations in a wider variety of settings, as we shall see in the examples below. For c > 0 and g a distribution or density on R+ , we say Θ has a Bondesson D-representation and write Θ ← B-DRep(c, g) if Θ=

∞ X

θk δψk ,

with

θk := Vk e−Γck ,

i.i.d.

Vk ∼ g,

i.i.d.

ψk ∼ G.

(3.2)

k=1

The following result shows that Bondesson D-representations can be constructed for a certain class of CRM rate measures. Theorem 3.1 (Bondesson D-representation). Let ν(dθ) = ν(θ)dθ be a rate measure satisfying Eq. (2.1). If θν(θ) is nonincreasing, limθ→∞ θν(θ) = 0, and d cν := limθ→0 θν(θ) < ∞, then gν (v) := −c−1 ν dv [vν(v)] is a density on R+ and Θ ← B-DRep(cν , gν )

implies

Θ ∼ CRM(ν).

(3.3)

TRUNCATED RANDOM MEASURES

9

We offer a novel, rigorous proof of Theorem 3.1 in Appendix C using the strategy introduced by Banjevic et al. (2002). The Bondesson D-representation approach is not as general as those available for F- and V-representations, and can be difficult to analyze theoretically due to the coupling between the atoms, but it does tend to produce very simple representations with small truncation error (cf. Sections 4 and 7). Example 3.1 (Beta process, BP(γ, α, 0), α ≥ 1). If α > 1 and d = 0, then θν(θ) = γα(1 − θ)α−1 1[θ ≤ 1] is non-increasing, cν = limθ→0 θν(θ) = γα, and gν (v) = (α − 1)(1 − v)α−2 = Beta(v; 1, α − 1). Thus, it follows from Theorem 3.1 that if Θ ← B-DRep(γα, Beta(1, α − 1)), then Θ ∼ BP(γ, α, 0). In the case of α = 1, gν (v) = δ1 , so Vk ≡ 1. Since exp(−Eck ) ∼ Beta(1, c), the representation used in Teh et al. (2007) is equivalent to the Bondesson D-representation for BP(γ, 1, 0). Example 3.2 (Gamma process, ΓP(γ, λ, 0)). The following representation for the gamma process with d = 0 was described by Bondesson (1982) and Banjevic et al. (2002). Since θν(θ) = γλe−λθ is non-increasing and cν = limθ→0 θν(θ) = γλ, we obtain gν (v) = λe−λv = Exp(v; λ). Thus, it follows from Theorem 3.1 that if Θ ← B-DRep(γλ, Exp(λ)), then Θ ∼ ΓP(γ, λ, 0). Example 3.3 (Beta prime process, BPP(γ, α, 0)). If d = 0, then θν(θ) = γα(1 + θ)−α is non-increasing and cν = limθ→0 θν(θ) = γα, so gν (v) = α(1 + v)−α−1 = Beta0 (v; 1, α). Thus, it follows from Theorem 3.1 that if Θ ← B-DRep(γα, Beta0 (1, α)), then Θ ∼ BPP(γ, α, 0). 3.2. F-representation. For Bondesson D-representations, the atom weights are coupled by the Poisson process Γck . We now introduce a novel alternative that decouples the atom weights. For c > 0, ξ > 0 and g a distribution or density on R+ , we say Θ has a decoupled Bondesson F-representation and write Θ ← B-FRep(c, g, ξ) if Θ=

Ck ∞ X X

θki δψki ,

with

i.i.d.

θki := Vki e−Tki ,

indep

Vki ∼ g,

Ck ∼ Poiss(c/ξ),

(3.4)

k=1 i=1

Tki ∼ Gam(k, ξ),

i.i.d.

i.i.d.

ψki ∼ G.

The next result shows that if a CRM rate measure admits a Bondesson Drepresentation then it also admits a decoupled Bondesson F-representation. Theorem 3.2 (Decoupled Bondesson F-representation). Let ν(dθ) = ν(θ)dθ be a rate measure satisfying the conditions in Theorem 3.1, and let cν and gν have the same definitions as in Theorem 3.1. Then for any fixed ξ > 0, Θ ← B-FRep(cν , gν , ξ)

implies

Θ ∼ CRM(ν).

(3.5)

The proof of Theorem 3.2 in Appendix C generalizes the arguments from Paisley et al. (2010) and Roychowdhury and Kulis (2015). The free parameter ξ controls the number of atoms generated for each outer sum index k. Discussion of the principled selection of ξ via optimization is delayed until Section 6, as it requires the truncation error bounds developed in Section 4 and simulation complexity results found in Section 6. Example 3.4 (Beta, gamma, and beta prime processes). Arguments paralleling those made in Examples 3.1 and 3.2 show that the Paisley et al. (2010) BP(γ, α, 0)

10

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

and the Roychowdhury and Kulis (2015) ΓP(γ, λ, 0) representations follow directly from applications of Theorem 3.2. Constructing a BPP(γ, α, 0) decoupled Bondesson F-representation is straightforward using Theorem 3.2 and the arguments in Example 3.3. An alternative class of F-representations leads to “power-law” (or “heavytailed”) CRMs. For γ > 0, 0 ≤ d < 1, α > −d, and g a distribution or density on R+ , we say Θ has a power-law F-representation and write Θ ← PL-FRep(γ, α, d, g) if Θ=

Ck ∞ X X

θki δψki ,

with

i.i.d.

Ck ∼ Poiss(γ),

θki := Vki Ukik

k=1 i=1

k−1 Y

(1 − Ukij ),

j=1 indep

i.i.d.

Vki ∼ g,

Urij ∼ Beta(1 − d, α + jd), (3.6)

i.i.d.

ψki ∼ G. The name of this representation arises from the fact that it exhibits Types I and II power-law behavior (Broderick et al., 2012) under mild conditions, shown by Theorem C.1 in Appendix C. A method for constructing power-law F-representations corresponding to different CRMs using stochastic mapping will be discussed in Section 3.4. Example 3.5 (Beta process, BP(γ, α, d)). The decoupled Bondesson Frepresentation for BP(γ, α, 0) from Paisley et al. (2010) was extended by Broderick et al. (2012) to the BP(γ, α, d) setting. The Broderick et al. (2012) construction for the BP(γ, α, d) is in fact the “trivial” power-law F-representation PL-FRep(γ, α, d, δ1 ). 3.3. V-representation. The final sequential CRM representation we present is due to Broderick et al. (2014). For rate measure ν(dθ) and conditional distribution h(x | θ) on N ∪ {0}, we say Θ has a size-biased V-representation and write Θ ← SB-VRep(ν, h) if Θ=

Ckx ∞ X ∞ X X

θkxi δψkxi ,

with

indep

Ckx ∼ Poiss (λkx ) ,

i.i.d.

ψkxi ∼ G,

k=1 x=1 i=1 indep

θkxi ∼

Z λkx :=

1 h(0|θ)k−1 h(x | θ)ν(dθ), λkx

(3.7)

h(0 | θ)k−1 h(x | θ)ν(dθ).

If the rate measure ν and the likelihood h are selected to be a conjugate exponential family, λkx can be computed analytically and θkxi can be sampled using standard techniques. Under mild conditions, the size-biased V-representation has the distribution of a CRM with rate measure ν, as specified by Theorem 3.3. Theorem 3.3 (Size-biased V-representation, Broderick et al. (2014)). Let ν(dθ) be a rate measure and let h(x | θ) be a proper distribution on N ∪ {0}, where h and ν satisfy the conditions in Eqs. (2.1) and (2.3). Then Θ ← SB-VRep(ν, h)

implies

Θ ∼ CRM(ν).

(3.8)

TRUNCATED RANDOM MEASURES

11

The proof of this result, and its applications, are discussed extensively in Broderick et al. (2014) and so will not be repeated here. Similar ideas and types of representations were recently explored in James (2014). Example 3.6 (Beta process BP(γ, α, 0)). Setting ν to a beta process rate measure and h to a Bernoulli likelihood, we have Z Γ(α + 1) γα λk1 = γ (1 − θ)k−1 θ · θ−1 (1 − θ)α−1 dθ = , (3.9) Γ(α) α+k−1 and θkxi ∼ Beta(1, α + k − 1), demonstrating that the construction due to Thibaux and Jordan (2007) is a special case of the size-biased V-representation for BP(γ, α, d). 3.4. Stochastic mapping. We next describe how one CRM can be obtained as the transformation of another CRM. The primary application of this technique will be to obtain power-law F-representations for processes other than the beta process by transforming the power-law F-representation for BP(γ, α, d). This transformation approach, which we call stochastic mapping, is a powerful technique that takes advantage of operations on the underlying Poisson point process. Indeed, given a truncation error bound for any CRM, the stochastic mapping of that CRM inherits a transformed error bound; this is discussed further in Section 4.4. The next lemma, which forms the basis of stochastic mapping, is a well-known result that follows immediately from applying Lemmas A.1 and A.4 to CRMs: P∞ Lemma 3.4 (CRM stochastic mapping). Let Θ = k=1 θk δψk ∼ CRM(ν). For any probability kernel κ(θ, du), κ(Θ) :=

∞ X

uk δψk ∼ CRM(νκ ),

uk | θk ∼ κ(θk , ·),

(3.10)

k=1

where Z νκ (du) :=

κ(θ, du)ν(dθ).

(3.11)

If κ(θ, du) = δτ (θ) (du) is deterministic, τ is invertible, and ν(dθ) = ν(θ)dθ, then uk = τ (θk ) and ∞ X

τ (θk )δψk ∼ CRM(ντ ),

(3.12)

−1 dτ −1 := ντ (du) ν(τ (u)) (u) . du

(3.13)

τ (Θ) :=

k=1

Thus, given a kernel κ and a representation for Θ ∼ CRM(ν), we immediately obtain a representation for κ(Θ) ∼ CRM(νκ ). As detailed in the following example, to obtain novel power-law F-representations, we use the heuristic that identities for exponential family random variables can be extended to exponential CRMs. Example 3.7 (F-representation for ΓP(γ, λ, d)). We will transform the beta process BP(γ, α, d) into the gamma process ΓP(γ, λ, d). For arbitrary a, b > 0, let indep indep Ba,b ∼ Beta(a, b) and Ga,b ∼ Gam(a, b). We use the well-known identity D

Ga,λ = Ga+b,λ Ba,b

(3.14)

12

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

extended to a < 0. We make the identifications for the “distributions” of the atoms by examining the rate measures of CRMs. For ΓP(γ, λ, d), the rate measure is ν(dθ) ∝ θ−d−1 e−λθ dθ, which corresponds to an improper gamma distribution with parameters −d and λ. For BP(γ, α, d), the rate measure is ν(dθ) ∝ θ−d−1 (1 − θ)α+d−1 dθ, which corresponds to an improper beta distribution with parameters −d and α + d. So in the case of the gamma and beta processes, the identity Eq. (3.14) with a = −d, b = α + d, and λ = α becomes D

G−d,α = Gα,α B−d,α+d .

(3.15)

Thus, we apply the stochastic mapping θ 7→ u = Gθ,

G ∼ Gam(α, α),

(3.16)

which corresponds to the kernel κ(θ, du) = Gam(u/θ; α, α)θ−1 du.

(3.17)

Using Lemma 3.4 and change of variable w = u(θ−1 − 1), Z −1 αα uα−1 θ−α−d−1 e−αuθ (1 − θ)α+d−1 dθ du (3.18) νκ (du) = γα Γ(1 − d)Γ(α + d) Z αα = γα u−1−d e−αu wα+d−1 e−αw dw du (3.19) Γ(1 − d)Γ(α + d) α1−d −1−d −αu =γ u e du. (3.20) Γ(1 − d) Thus, we have shown formally that if Θ ∼ BP(γ, α, d) for α > 0, then κ(Θ) ∼ ΓP(γ, α, d). Applying the mapping to PL-FRep(γ, α, d, δ1 ), the power-law Frepresentation for BP(γ, α, d) (Example 3.5), we arrive at a novel power-law Frepresentation Θ ← PL-FRep(γ, λ, d, Gam(λ, λ))

implies

Θ ∼ ΓP(γ, λ, d),

(3.21)

which, to the best knowledge of the authors, was not previously available in the literature. indep

0 Example 3.8 (F-representation for BPP(γ, α, d)). Letting Ba,b ∼ Beta0 (a, b), the beta prime process can be obtained using the parametric identity D

0 Ba,b = Ga,1 /Gb,1 .

(3.22)

For BPP(γ, α, d), the rate measure is ν(dθ) ∝ θ−d−1 (1+θ)−α dθ, which corresponds to an improper beta prime distribution with parameters −d and α + d. So choosing a = −d and b = α + d, the identity becomes D

0 B−d,α+d = G−d,1 /Gα+d,1 .

(3.23)

Thus, applying Lemma 3.4 to the stochastic mapping θ 7→ θ/G,

G ∼ Gam(α + d, 1)

(3.24)

and the power-law F-representation for ΓP(γ, 1, d) from Example 3.7 yields the novel power-law F-representation Θ ← PL-FRep(γα, 1, d, Beta0 (1, α + d))

implies

Θ ∼ BPP(γ, α, d).

(3.25)

TRUNCATED RANDOM MEASURES

13

Example 3.9 (D-representation for BP(γ, α, 0)). A final application of stochastic mapping is to obtain a D-representation for BP(γ, α, 0) for all α > 0 by using the identity Ga,λ D Ba,b = (3.26) Ga,λ + Gb,λ and choosing a = 0 and b = λ = α. Applying Lemma 3.4 to the stochastic mapping θ 7→ θ/(θ + G),

G ∼ Gam(α, α)

and the Bondesson D-representation for ΓP(γ, α, 0) yields ∞ X θk δψk ∼ BP(γ, α, 0), with θk := (1 + Gk Vk−1 eΓk,αγ )−1 ,

(3.27)

i.i.d.

ψk ∼ G, (3.28)

k=1 i.i.d.

i.i.d.

Gk ∼ Gam(α, α),

Vk ∼ Exp(α).

4. Truncation analysis Each of the sequential representations developed in Section 3 shares a common structural element—an outer infinite sum—which is responsible for generating a countably infinite number of atoms in the CRM. In this section, we terminate these outer sums at a finite truncation level K ∈ N, resulting in a truncated CRM ΘK possessing a finite number of atoms. We then develop upper bounds on the error induced by this truncation procedure. All of the truncated CRM error bounds in this section rely on Lemma 4.1, which is a tightening (by a factor of two) of the bound in Ishwaran and James (2001, 2002).3 Lemma 4.1 (CRM Protobound). Let Θ ∼ CRM(ν). For any truncation ΘK , if i.i.d.

Xn | Θ ∼ LP(h, Θ), indep

Yn | Xn ∼ f (· | Xn ),

i.i.d.

Zn | ΘK ∼ LP(h, ΘK ), indep

Wn | Zn ∼ f (· | Zn ),

(4.1) (4.2)

then, with pN,∞ and pN,K denoting the marginal densities of Y1:N and W1:N , respectively, 1 kpN,∞ − pN,K k1 ≤ 1 − P (supp(X1:N ) ⊆ supp(ΘK )) , (4.3) 2 The proof of Lemma 4.1 as well as the other results in this section can be found in Appendix D. Throughout this section, for a given likelihood model h(x | θ) we define π(θ) := h(0 | θ) for notational brevity. All of the provided truncation results use the generative model in Lemma 4.1, and specify the error bound B(N, K) in terms of the L1 error between the marginal densities of Y1:N and W1:N , 1 kpN,∞ − pN,K k1 ≤ B(N, K). (4.4) 2 The asymptotic behavior of truncation error bounds is specified with tilde notation, a(K) a(K) ∼ b(K), K → ∞ ⇐⇒ lim = 1. (4.5) K→∞ b(K) Standard properties of asymptotic equivalence, which we mostly use implicitly, are given in Lemma B.8. 3See Lemma D.1 in Appendix D for the generalization of Lemma 4.1 to arbitrary random measures.

14

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

4.1. D-representation. For the Bondesson D-representation, the truncated CRM has the form ΘK

:=

K X

θk δψk .

(4.6)

k=1

The following result provides a truncation error bound, specifies its range, and guarantees that as K → ∞ the bound decreases to zero. Theorem 4.2 (D-representation truncation error). The error in approximating Θ ← B-DRep(gν , cν ) with its truncation ΘK satisfies   Z ∞   −GK ) ν(dv) , (4.7) 1 − E π(ve B(N, K) = 1 − exp −N 0

where GK ∼ Gam(K, cν ) and G0 = 0. Furthermore, the bound satisfies 0 ≤ B(N, K) ≤ 1

and

lim B(N, K) = 0.

K→∞

(4.8)

Example 4.1 (Beta-Bernoulli process, BP(γ, α, 0), α ≥ 1). Noting that π(θ) = 1 − θ and GK ∼ Gam(K, γα), we have Z ∞ Z 1   1 − E π(ve−GK ) ν(dv) = γα E[e−GK ] (1 − v)α−1 dv (4.9) 0

0

 =γ

γα 1 + γα

K .

(4.10)

Applying the error bound yields (  K ) 1 γα kpN,∞ − pN,K k1 ≤ 1 − exp −N γ . 2 γα + 1

(4.11)

The result in Eq. (4.11) generalizes that in Doshi-Velez et al. (2009), which only applies in the case of α = 1. Example 4.2 (Gamma-Poisson process, ΓP(γ, λ, 0)). Since π(θ) = e−θ and Gam(γλ). Z ∞  Z ∞   −GK −ve−GK −1 −λv 1 − E π(ve ) ν(dv) = γλ E (1 − e )v e dv 0 0   = γλ E log(1 + e−GK /λ)   ≤ γ E e−GK K  γλ . =γ 1 + γλ

GK ∼

(4.12) (4.13) (4.14) (4.15)

The second equality follows, for example, by using the power series for the exponential integral (Abramowitz and Stegun, 1964, Chapter 5) and the upper bound follows because log(1 + x) ≤ x for x ≥ 0. Applying the error bound yields (  K ) 1 γλ kpN,∞ − pN,K k1 ≤ 1 − exp −N γ . (4.16) 2 1 + γλ

TRUNCATED RANDOM MEASURES

15

Example 4.3 (Beta prime-odds Bernoulli process, BPP(γ, α, 0)). In this case π(θ) = (1 + θ)−1 and GK ∼ Gam(K, γα), so   Z ∞ Z ∞   −GK −GK −α −GK −1 ) ν(dv) = γα E e (1 + v) (1 + ve ) dv 1 − E π(ve 0

0

(4.17) ≤ γα E[e−GK ]

Z



(1 + v)−α (1 + vE[e−GK ])−1 dv

0

(4.18) K Z



γα 1 + γα  K γα , ≤γ 1 + γα



(1 + v)−α−1 dv

≤ γα

(4.19)

0

(4.20)

where the first upper bound follows from Jensen’s inequality. Thus, the error bound is the same as for the beta-Bernoulli process. 4.2. F-representation. For the decoupled and power-law F-representations, the truncated CRM has the form Ck K X X ΘK := θki δψki . (4.21) k=1 i=1

The following result provides a truncation error bound, specifies its range, and guarantees that as K → ∞ the bound decreases to zero. Theorem 4.3 (Decoupled Bondesson F-representation truncation error). The error in approximating Θ ← B-FRep(c, g, ξ) with its truncation ΘK satisfies ) ( ∞ X c E[1 − π(βk )] , (4.22) B(N, K) = 1 − exp − N ξ k=K+1

where βk := V e−Tk ,

V ∼ g,

indep

Tk ∼ Gam(k, ξ).

(4.23)

Furthermore, the bound satisfies 0 ≤ B(N, K) ≤ 1

and

lim B(N, K) = 0.

K→∞

Example 4.4 (Beta-Bernoulli process, BP(γ, α, 0), α ≥ 1). Since  k  K ∞ ∞ γα X γα X 1 ξ ξ −Tk E[V e ]= =γ , ξ ξ α 1+ξ 1+ξ k=K+1

(4.24)

(4.25)

k=K+1

it follows that (  K ) 1 ξ kpN,∞ − pN,K k1 ≤ 1 − exp −N γ . 2 1+ξ

(4.26)

Example 4.5 (Gamma-Poisson process, ΓP(γ, λ, 0)). Using the fact that 1−e−θ ≤ θ and calculations analogous to those in Example 4.4 shows that the bound is the same as in the beta-Bernoulli case and is equivalent (up to a factor of two) to the truncation bound in Roychowdhury and Kulis (2015).

16

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

Example 4.6 (Beta prime-odds Bernoulli process, BPP(γ, α, 0)). Using the trivial bound θ/(1 + θ) ≤ θ and calculations analogous to those in Example 4.4, for α > 1 we obtain the upper bound (  K ) 1 γα ξ kpN,∞ − pN,K k1 ≤ 1 − exp −N . (4.27) 2 α−1 1+ξ Given the similarities between the Bondesson D- and F-representations, one might expect a relationship between their truncation error bounds to exist. Indeed, when c = ξ, the bounds presented in Theorems 4.2 and 4.3 coincide for all K, N ∈ N. However, we can actually say more in this setting: Theorem D.5 in Appendix D shows that the protobound (of the form presented in Lemma 4.1) of the Bondesson D-representation is bounded above by that of the decoupled Bondesson F-representation, providing some indication that the Bondesson D-representation may have a lower actual truncation error. Next, we turn to bounding the error of truncated power-law F-representations. Theorem 4.4 (Power-law F-representation truncation error). The error in approximating Θ ← PL-FRep(γ, α, d, g) with ΘK satisfies ( ) ∞ X B(N, K) = 1 − exp −γN E[1 − π(βk )] , (4.28) k=K+1

where βk := V Uk

k−1 Y

(1 − U` ),

V ∼ g,

indep

U` ∼ Beta(1 − d, α + `d).

(4.29)

`=1

Furthermore, the bound satisfies 0 ≤ B(N, K) ≤ 1

and

lim B(N, K) = 0.

K→∞

(4.30)

Example 4.7 (Beta-Bernoulli process, BP(γ, α, d)). We have " ∞ # K ∞ ∞ Y X X X α + kd , (4.31) E[βk ] = E βk = E[1 − π(βk )] = α + kd − d + 1 k=K+1

k=K+1

k=K+1

k=1

where the final equality follows from Ishwaran and James (2001, Theorem 1). Thus, ( ) K Y α + kd 1 kpN,∞ − pN,K k1 ≤ 1 − exp −γN (4.32) 2 α + kd − d + 1 k=1  K α+d ∼ γN K → ∞, (4.33) α+1 where Eq. (4.33) follows because K Y

α + kd Γ((α + d)/d + K) Γ((α + 1)/d) = α + kd − d + 1 Γ((α + d)/d) Γ((α + 1)/d + K) k=1  K  −K α+1 α+d ∼ d d  K α+d = . α+1

(4.34) K→∞

(4.35) (4.36)

TRUNCATED RANDOM MEASURES

17

The error bound generalizes that of Paisley et al. (2012b), which was for the special case of BP(γ, α, 0). Example 4.8 (Gamma-Poisson process, ΓP(γ, λ, d)). Using the fact that 1−e−θ ≤ θ and calculations analogous to those in Example 4.7, the error bound is the same as for the beta-Bernoulli model with λ in place of α. Example 4.9 (Beta prime-odds Bernoulli process, BPP(γ, α, d)). Using the trivial bound θ/(1 + θ) ≤ θ and calculations analogous to those in Example 4.7, for α > 1 we obtain the upper bound and asymptotic simplification ) ( K 1 γα Y 1 + kd (4.37) kpN,∞ − pN,K k1 ≤ 1 − exp −N 2 α−1 2 + kd − d k=1  K γα 1+d ∼N K→∞ (4.38) α−1 2 using Eq. (4.36). 4.3. V-representation. For the size-biased V-representation, the truncated CRM has the form ΘK :=

ρkx K X ∞ X X

θkxj δψkxj .

(4.39)

k=1 x=1 j=1

The following result provides a truncation error bound, specifies its range, and guarantees that as K → ∞ the bound decreases to zero. We use this result to analyze truncations of some example processes from Broderick et al. (2014). Theorem 4.5 (Size-biased V-representation truncation error). The error in approximating Θ ← SB-VRep(ν, h) with its truncation ΘK is bounded by ! N X B(N, K) = 1 − exp − ΛK+n , (4.40) n=1

where Z Λk :=

π(θ)k−1 (1 − π(θ))ν(dθ),

k ≥ 1.

(4.41)

Furthermore, the bound satisfies 0 ≤ B(N, K) ≤ 1

and

lim B(N, K) = 0.

K→∞

(4.42)

Note that although it is often possible to compute Λk in closed form using Eq. (4.41)—for instance, in the following applications of Theorem 4.5—there is no guarantee that this is the case. If the integral proves P∞ to be problematic, it is also possible to express Λk as the infinite sum Λk = x=1 λkx , where λkx (defined in Theorem 3.3) is available in closed-form as long as h, ν form a conjugate exponential family pair. The series can then be truncated if necessary to obtain an approximation of arbitrarily high precision.

18

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

Example 4.10 (Beta-Bernoulli process, BP(γ, α, d)). For the beta-Bernoulli process, the rate integral is the standard beta integral, Z Γ(α + 1) θ−d (1 − θ)α+d+k−2 dθ Λk = γ Γ(1 − d)Γ(α + d) Γ(α + 1) Γ(α + d + k − 1) . =γ Γ(α + d) Γ(α + k)

(4.43) (4.44)

First, we analyze the case where d > 0. Using Lemma B.6 to compute the sum PN n=1 ΛK+n , the error bound is 1 kpN,∞ − pN,K k1 2    γ Γ(α + 1) Γ(α + d + K) Γ(α + d + K + N ) − ≤ 1 − exp d Γ(α + d) Γ(α + K) Γ(α + K + N ) Γ(α + 1) ∼ γN (α + K)d−1 K → ∞, Γ(α + d)

(4.45)

(4.46)

where the asymptotic result follows from Lemmas B.5, B.7 and B.8: Γ(α + d + K) Γ(α + d + K + N ) d Γ(α + d + K) − ∼N Γ(α + K) Γ(α + K + N ) dK Γ(α + K) ∼ N d(α + K)d−1

K→∞

(4.47)

K → ∞.

(4.48)

When d = 0, we can again use Lemma B.6 to arrive at 1 kpN,∞ − pN,K k1 ≤ 1 − exp (γα (ψ(α + K) − ψ(α + N + K))) 2 ∼ γαN ψ 0 (α + K) K→∞ −1

∼ γαN (α + K)

K → ∞,

(4.49) (4.50) (4.51)

where ψ(·) is the digamma function, and the asymptotic result follows similarly from Lemma B.5. Example 4.11 (Beta-negative binomial process, BP(γ, α, d)). For a fixed number of failures s > 0, and assuming α + d + (k − 1)s > 1, the rate integral for the betanegative binomial process may be evaluated via integration by parts for d > 0: Z Γ(α + 1) θ−d−1 (1 − θ)α+d+(k−1)s−1 (1 − (1 − θ)s )dθ (4.52) Λk = γ Γ(1 − d)Γ(α + d) Z γ Γ(α + 1) = θ−d ((z + s − 1)(1 − θ)z+s−2 − (z − 1)(1 − θ)z−2 )dθ d Γ(1 − d)Γ(α + d) (4.53)   γ Γ(α + 1) Γ(α + d + ks) Γ(α + d + (k − 1)s) = − , (4.54) d Γ(α + d) Γ(α + ks) Γ(α + (k − 1)s)

TRUNCATED RANDOM MEASURES

19

where z := α + d + (k − 1)s. The error bound is computed by cancelling terms in PN the telescoping sum n=1 ΛK+n , 1 kpN,∞ −pN,K k1 2    γ Γ(α + 1) Γ(α + d + Ks) Γ(α + d + (K + N )s) ≤ 1 − exp − d Γ(α + d) Γ(α + Ks) Γ(α + (K + N )s) Γ(α + 1) ∼ γN s (α + Ks)d−1 K → ∞, Γ(α + d)

(4.55)

(4.56)

where the asymptotic result follows from Lemmas B.5, B.7 and B.8. To analyze the d = 0 case, we can use L’Hospital’s rule to take the limit of the rate integral,   Γ(α + d + ks) Γ(α + d + (k − 1)s) −1 − (4.57) lim Λk = γα lim d d→0 d→0 Γ(α + ks) Γ(α + (k − 1)s) Γ0 (α + d + ks) Γ0 (α + d + (k − 1)s) = γα lim − (4.58) d→0 Γ(α + ks) Γ(α + (k − 1)s) = γα(ψ(α + ks) − ψ(α + (k − 1)s)). (4.59) Again computing the error bound by cancelling terms in the telescoping sum PN n=1 ΛK+n , 1 kpN,∞ −pN,K k1 ≤ 1 − exp (γα (ψ(α + Ks) − ψ(α + (K + N )s))) 2 ∼ γαN sψ 0 (α + Ks) K→∞ ∼ γαN s(α + Ks)−1

K → ∞,

(4.60) (4.61) (4.62)

where the asymptotic result follows from an application of Lemma B.5. Example 4.12 (Gamma-Poisson process, ΓP(γ, λ, d)). The gamma-Poisson process rate integral can be evaluated via integration by parts for d > 0: Z λ1−d θ−d−1 e−(λ+k−1)θ (1 − e−θ )dθ (4.63) Λk = γ Γ(1 − d) Z   γ λ1−d θ−d (λ + k − 1)e−(λ+k−1)θ − (λ + k)e−(λ+k)θ dθ (4.64) =− d Γ(1 − d)  γ λ1−d Γ(1 − d)(λ + k)d − Γ(1 − d)(λ + k − 1)d (4.65) = d Γ(1 − d)  γλ1−d = (λ + k)d − (λ + k − 1)d . (4.66) d PN Thus, cancelling terms in the telecoping sum n=1 ΛK+n , the error bound is  1−d   1 γλ kpN,∞ − pN,K k1 ≤ 1 − exp (λ + K)d − (λ + K + N )d (4.67) 2 d ∼ γN λ1−d (λ + K)d−1

K → ∞,

(4.68)

where the asymptotic result follows from an application of Lemma B.5. To analyze the d = 0 case, we again make use of L’Hospital’s rule to take the limit of the rate

20

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

integral: lim Λk = γλ lim d−1 (λ + k)d − (λ + k − 1)d

d→0



d→0

 = γλ lim log(λ + k)(λ + k)d − log(λ + k − 1)(λ + k − 1)d d→0

(4.69) (4.70)

= γλ (log(λ + k) − log(λ + k − 1)) . (4.71) PN Cancelling terms in the telescopic sum n=1 ΛK+n yields the error bound 1 kpN,∞ − pN,K k1 ≤ 1 − exp (γλ (log(λ + K) − log(λ + K + N ))) 2 ∼ γλN (λ + K)−1 K → ∞,

(4.72) (4.73)

where the asymptotic result follows from an application of Lemma B.5. Example 4.13 (Beta prime-odds Bernoulli process, BPP(γ, α, d)). The rate integral takes the form Z Γ(α + 1) Λk = γ θ(1−d)−1 (1 + θ)−(1−d)−(d+k+α−1) dθ (4.74) Γ(1 − d)Γ(α + d) Γ(α + 1)Γ(d + k + α − 1) =γ . (4.75) Γ(α + d)Γ(k + α) This is the same as for the beta-Bernoulli process, and thus their error bounds coincide. 4.4. Stochastic mapping. We now show how truncation bounds developed elsewhere in this paper can be applied to CRM representations that have been transformed using Lemma 3.4. For Θ ∼ CRM(ν), we denote its transformation by ˜ = τ (Θ) (in the deterministic case) or Θ ˜ = κ(Θ) (in the stochastic case). For any Θ object defined with respect to Θ, the corresponding object is denoted with a tilde. ˜ Note that ˜ and X ˜ ˜ (for Θ). For example, in place of N and X1:N (for Θ), we use N 1:N we make the dependence of the truncation error bound B(N, K) on π(θ) explicit in the notation of Proposition 4.6; when one applies stochastic mapping to a CRM, one usually also wants to change the likelihood h(x | θ), and thus π(θ) := h(0 | θ). Proposition 4.6 (Truncation error under a stochastic mapping). Consider a representation for Θ ∼ CRM(ν) with truncation error bound B(π, N, K). Then for ˜ | u): any likelihood h(x ˜ is a stochastic mapping of Θ under the probability kernel (1) if Θ κ(θ, du), its truncation error bound is B(πκ,N˜ , 1, K), where πκ,N˜ (θ) := R ˜ | u)N˜ κ(θ, du); and h(0 ˜ is a stochastic mapping of Θ under the transformation τ : Ψ → U, its (2) if Θ ˜ | τ (θ)). ˜ , K), where πτ (θ) := h(0 truncation error bound is B(πτ , N The proof of Proposition 4.6 may be found in Appendix D. Example 4.14 (Gamma-Poisson process, ΓP(γ, λ, d)). Reusing the stochastic mapping specified in Eqs. (3.16) and (3.17),  λ Z Z λλ λ ˜ ˜ πκ,N˜ (θ) = e−N u κ(θ, du) = uλ−1 θ−λ e−(N +λ/θ)u du = . ˜θ + λ Γ(λ) N (4.76)

TRUNCATED RANDOM MEASURES

21

Combining Eqs. (4.31) and (4.76), Lemma B.3, and Proposition 4.6 yields the error bound   ˜ K   Y 1 λ + kd ˜ . (4.77) kpN˜ ,∞ − pN˜ ,K˜ k1 ≤ 1 − exp −γ N  2 λ + kd − d + 1  k=1

The bound is identical to the one obtained in Example 4.8, but follows from the error bound on the underlying beta process used to construct the gamma process representation. 4.5. Hyperpriors. In practice, prior distributions are typically placed on the hyperparameters of the CRM rate measure (i.e. γ, α, λ, d, etc.). We conclude our investigation of CRM truncation error by showing how bounds developed in this section can be modified to account for the use of hyperpriors. Note that we make the dependence of the truncation error bound B(N, K) on the hyperparameters Φ explicit in the notation of Proposition 4.7. Proposition 4.7 (CRM truncation error with a hyperprior). Consider a representation for Θ|Φ ∼ CRM(ν) with truncation error bound B(Φ, N, K) given fixed hyperparameters Φ. Then the corresponding truncation error bound for random Φ is B(N, K) = E [B(Φ, N, K)] .

(4.78)

Example 4.15 (Beta-Bernoulli process BP(γ, α, 0), α ≥ 1). A standard choice of hyperprior for the mass γ is a gamma distribution, i.e. γ ∼ Gam(a, b). Combining Proposition 4.7 and an earlier beta-Bernoulli truncation bound in Eq. (4.26), along with the standard gamma integral, we have that ( K )  Z ∞ ba ξ 1 a−1 kpN,∞ − pN,K |1 ≤ 1 − dγ (4.79) γ exp −bγ − N γ 2 Γ(a) 0 ξ+1  K !−a N ξ =1− 1+ . (4.80) b ξ+1 5. Normalized truncation analysis In this section, we provide truncation error bounds for normalized CRMs (NCRMs). Examples include the Dirichlet process (Ferguson, 1973), the normalized generalized gamma process (Brix, 1999; Lijoi and Pr¨ unster, 2010), and the normalized σ-stable process (Kingman, 1975; Lijoi and Pr¨ unster, 2010). Given a CRM Θ on Ψ, we define the corresponding NCRM Ξ via Ξ(S) := Θ(S)/Θ(Ψ) for each measurable subset S ⊆ Ψ. Likewise, given a truncated CRM ΘK , we define its normalization ΞK via ΞK (S) := ΘK (S)/ΘK (Ψ). Note that any simulation algorithm for ΘK can be used to simulate ΞK by simply normalizing the result. Furthermore, this construction does not depend on the particular representation of the CRM, and thus applies equally to the D-, F-, and V-representations described in Section 3. The first step in the analysis of NCRM truncations is to define their approximation error in a manner similar to that of CRM truncations. Since Ξ and ΞK are both normalized, they are distributions on Ψ; thus, observations X1:N are generated i.i.d. from Ξ, and Z1:N are generated i.i.d. from ΞK . Y1:N and W1:N have the

22

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

same definition as for CRMs. As in the developments of Section 4, the theoretical results of this section rely on a general upper bound, provided by Lemma 5.1:4 Lemma 5.1 (NCRM Protobound). Let Θ ∼ CRM(ν), and let its truncation be ΘK . Let their normalizations be Ξ and ΞK respectively. If i.i.d.

i.i.d.

Xn | Ξ ∼ Ξ,

Zn | ΞK ∼ ΞK ,

indep

indep

Yn | Xn ∼ f (· | Xn ),

Wn | Zn ∼ f (· | Zn ),

(5.1) (5.2)

then 1 kpN,∞ − pN,K k1 ≤ 1 − P (X1:N ⊆ supp(ΘK )) , (5.3) 2 where pN,∞ , pN,K are the marginal densities of Y1:N and W1:N , respectively. The analysis of CRMs in Section 4 relied heavily on the independence of the rates in Θ and X1:N ; unfortunately, the rates in Ξ do not possess the same independence properties (they must sum to one). Likewise, sampling Xn for each n does not depend on the atoms of Ξ independently (Xn randomly selects a single atom based on their rates). Rather than using the basic definitions of the above random quantities to derive an error bound, we decouple the atoms of Ξ and X1:N using a technique from extreme value theory. The Gumbel distribution with location µ ∈ R and scale σ > 0, denoted T ∼ Gumbel(µ, σ), is defined by the cumulative distribution function P(T ≤ t) = e−e



t−µ σ

,

(5.4)

and has density with respect to the Lebesgue measure on R t−µ −( 1 −( t−µ σ ) σ )−e e . (5.5) σ An interesting property of the Gumbel distribution is that if one perturbs the logprobabilities of a finite discrete distribution by i.i.d. Gumbel(0, 1) random variables, the arg max of the resulting set is a sample from the discrete distribution (Gumbel, 1954; Maddison et al., 2014). This technique is invariant to normalization, as the arg max is invariant to the corresponding constant shift in the log-transformed space. For present purposes, we require the infinite extension of this result: ∞ Lemma 5.2 (Infinite P Gumbel-max sampling).pjLet (pi )i=1∞be a collection of positive P := numbers such that i pi < ∞ and let p¯j . If (Ti )i=1 are i.i.d. Gumbel(0, 1) i pi random variables, then arg maxi∈N Ti + log pi exists, is unique a.s., and has distribution   ∞ arg max Ti + log pi ∼ Categorical (¯ pj )j=1 . (5.6) i∈N

The proof of this result, along with the others in this section, may be found in Appendix E. The utility of Lemma 5.2 is that it allows the construction of Ξ and X1:N without the problematic coupling of the underlying CRM atoms due to normalization; rather than dealing directly with Ξ, we apply Lemma A.1 to log-transform the rates of Θ, Lemma A.4 to perturb each rate by an i.i.d. Gumbel(0, 1) random variable, and characterize the distribution of the maximum rate in this process. The 4See Lemma D.1 in Appendix D for the generalization of Lemma 5.1 to arbitrary random measures.

TRUNCATED RANDOM MEASURES

23

combination of this distribution with Lemma 5.2 yields the key proof technique used to develop the truncation bounds in Theorem 5.3 and Theorem 5.4. Rather than following the order of presentation in earlier sections of the paper, we analyze F- and V-representations before investigating the Bondesson D-representation, as the analysis is much simpler in the former case. 5.1. F- and V-representations. F- and V-representations (including as special cases the B-FRep, PL-FRep, and SB-VRep detailed in Sections 3.2 and 3.3) share a very useful property: the truncated CRM ΘK and the tail measure after truncation Θ − ΘK are both CRMs, and they are mutually independent. This is a simple consequence of Lemma A.2 and the fact that F- and V-representations are defined via an infinite sum of CRMs. Theorem 5.3 provides a truncation error bound for any truncated NCRM whose underlying CRM satisfies this property. Theorem 5.3 (Normalized F-/V-representation truncation error bound). Let Θ ∼ + CRM(ν), and define its truncation ΘK ∼ CRM(νK ) and tail Θ+ K ∼ CRM(νK ) such + + that Θ = ΘK + ΘK and ΘK ⊥ ⊥ ΘK . Then if Ξ is the normalization of Θ, and ΞK is the normalization of ΘK , the error of approximating Ξ with ΞK is bounded by  N ZZ + B(N, K) = 1 − 1 − W (θ, t) νK (dθ) dt , (5.7) where −θt

W (θ, t) := 1 [t ≥ 0] θe

Z exp

−θt

(e

 − 1)ν(dθ) ,

(5.8)

Furthermore, the bound satisfies 0 ≤ B(N, K) ≤ 1

and

lim B(N, K) = 0.

(5.9)

K→∞

Example 5.1 (Dirichlet process, DP(γ)). The Dirichlet process with concentration γ > 0 is a normalized gamma process ΓP(γ, 1, 0). To analyze its truncation error, we use the decoupled Bondesson F-representation. First, the inner integral in W (θ, t) is Z Z −tθ (e − 1)ν(dθ) = γ (e−tθ − 1)θ−1 e−θ dθ (5.10) Z ∞ Z ∞ = γ lim θ−1 e−(t+1)θ dθ − θ−1 e−θ dθ (5.11) →0



 (t+1)

Z

θ−1 e−θ dθ.

= −γ lim

→0

(5.12)



Since e−θ is a monotonically decreasing function, Z (t+1) Z (t+1) Z − −1 −1 −θ −(t+1) −γ lim e θ dθ ≤ −γ lim θ e dθ ≤ −γ lim e →0



→0

→0



(t+1) −1

θ





(5.13) −γ lim e− log(t + 1) ≤ −γ lim →0

→0

Z

(t+1) −1 −θ

θ

e



dθ ≤ −γ lim e−(t+1) log(t + 1) →0

(5.14) Z −γ log(t + 1) ≤ −γ lim

→0

(t+1) −1 −θ

θ



e

dθ ≤ −γ log(t + 1),

(5.15)

24

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

demonstrating that the exponential term in W (θ, t) is Z  exp (e−tθ − 1)ν(dθ) = (t + 1)−γ .

(5.16)

Next, Example 3.2 and Eq. (C.20) immediately yield cν = γ, gν (v) = e−v , and the + tail rate measure νK , Z 1 ∞ −1 γ X ξk + (− log x)k−1 xξ−2 e−θx dxdθ. (5.17) νK (dθ) = ξ Γ(k) 0 k=K+1

Substituting this result, using Fubini’s theorem to swap the order of integration and summation, evaluating the integral over θ, and making the substitution x = e−s yields ZZ ZZ ∞ sk−1 e−(ξ−1)s (t + 1)−γ γ X ξk + dsdt. (5.18) W (θ, t)νK (dθ)dt = ξ Γ(k) (es + t)2 s,t≥0 k=K+1

Noting that es ≥ 1 ∀s ≥ 0, we have for any a ∈ (0, 1] ∩ (0, γ), ZZ Z ∞ Z ∞ ∞ γ X ξk + k−1 −(ξ+a)s W (θ, t)νK (dθ)dt ≤ s e ds (t + 1)−(γ+1−a) dt ξ Γ(k) 0 0 k=K+1

(5.19) =

γ (γ − a)ξ

∞ X k=K+1



ξ ξ+a

k =

γ a(γ − a)



ξ ξ+a

K .

(5.20)

Therefore, the truncation error of the Dirichlet process can be bounded by  K !N 1 γ ξ kpN,∞ − pN,K k1 ≤ 1 − 1 − ∀ a ∈ (0, 1] ∩ (0, γ). 2 a(γ − a) ξ + a (5.21) It is straightforward to analytically find the minimum of the bound with respect to a for fixed ξ if desired. Alternatively, if we set ξ = aξ 0 for the fixed free parameter ξ 0 > 0, we can optimize with respect to a to find  0 K !N  4 02 2 ξ0 + 1 γ−1 (5.22) Note that the bounds in Example 5.1 have exponential decay, as expected given early DP truncation error bounds presented by Ishwaran and James (2001); Ishwaran and Zarepour (2002). In fact, the above can reproduce the decay rate of this −1 past result by setting ξ 0 = (eγ − 1)−1 in Eq. (5.22). However, the techniques used in past work do not generalize beyond the Dirichlet process, while those used here apply to any NCRM. Example 5.2 (Normalized gamma process, NΓP(γ, λ, d)). It is also possible to normalize the generalized gamma process ΓP(γ, λ, d) (Brix, 1999). For this example, we use the size-biased V-representation of ΓP(γ, λ, d) with ν(dθ) = γθ−1−d e−λθ dθ.

TRUNCATED RANDOM MEASURES

25

Computing the integral of W (θ, t) over θ first, Z Z −θt + θe νK (dθ) = γ θ−d e−(K+t+λ)θ dθ = γΓ(1 − d)(K + t + λ)d−1 .

(5.23)

The exponential term can also be computed in closed-form,  Z  −θt (e − 1)ν(dθ) = exp γΓ(−d)((t + λ)d − λd ) . exp

(5.24)

Multiplying these and integrating over t ≥ 0 yields ZZ Z ∞ d d + W (θ, t)νK eγΓ(−d)t (K + t)d−1 dt (dθ) dt = γΓ(1 − d)e−γΓ(−d)λ

(5.25)

λ

≤ Cγ,λ,d (K + λ)d−1 ,

(5.26)

where d

−1

Cγ,λ,d := γΓ(1 − d)eσλ d−1 σ −d Γ(d−1 , λd σ),

σ := −γΓ(−d), (5.27) R ∞ a−1 −θ and Γ(·, ·) is the upper incomplete gamma function Γ(a, b) := b θ e dθ. Therefore, the truncation error of the normalized generalized gamma process can be bounded by N 1 kpN,∞ − pN,K k1 ≤ 1 − 1 − Cγ,λ,d (K + λ)d−1 . (5.28) 2 Setting λ = 0 in the above expression immediately yields the bound for the normalized d-stable process (Kingman, 1975). Taking the limit d → 0 yields the bound for the normalized ΓP(γ, λ, 0),    N 1 λ+K γ −1 kpN,∞ − pN,K k1 ≤ 1 − 1 − γλ log K . (5.29) 2 λ It should be noted that truncation of the NΓP(γ, λ, d) has been studied previously. Argiento et al. (2015) threshold the weights of the unnormalized CRM to be beyond a fixed level  > 0 prior to normalization, and develop error bounds for that method of truncation. These results are not directly comparable to those of the present work due to the different methods of truncation (i.e. sequential representation termination versus weight thresholding). 5.2. D-representation. The Bondesson D-representation, in contrast to F- and V-representations, couples the rates via the Poisson process Γck . The truncation ΘK is not a CRM, nor is it independent of the tail measure, and Theorem 5.3 does not apply. The following result provides a truncation error bound for the normalization of the Bondesson D-representation. Theorem 5.4 (Normalized B-DRep truncation error bound). Let Θ ∼ CRM(ν) where ν satisfies the conditions in Theorem 3.1, let cν and gν have the same definitions as in Theorem 3.1, and let ΘK be the truncation of Θ ← B-DRep(cν , gν ). Then if Ξ is the normalization of Θ, and ΞK is the normalization of ΘK , the error of approximating Ξ with ΞK is bounded by Z Z N B(N, K) = 1 − W (s, t) ds dt , (5.30)

26

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

where cK−2 K−2 −ct F (t+s) d2 F W (s, t) := −1 [t ≥ 0] (s) t e e Γ(K − 1) ds2 Z   −x e−θe − 1 ν(dθ), F ∈ C ∞ . F (x) :=

(5.31) (5.32)

Furthermore, the bound satisfies 0 ≤ B(N, K) ≤ 1

and

lim B(N, K) = 0.

K→∞

(5.33)

Example 5.3 (Dirichlet process, DP(γ)). The Dirichlet process with concentration γ > 0 is a normalized gamma process ΓP(γ, 1, 0) with rate measure ν(dθ) = γθ−1 e−θ dθ. Using the construction in Example 3.2, we have cν = γ. Example 5.1 also provides F (s) in closed-form,  F (s) = −γ log e−s + 1 . (5.34) The second derivative of F is e−s d2 F (s) = −γ 2. 2 ds (e−s + 1)

(5.35)

Therefore, using the substitution x = e−s and integration by parts, Z Z  −γ 2 e−s −1 F (t+s) d F −(t+s) −c e (s)ds = e + 1 2 ds ds2 (e−s + 1) Z ∞ −γ −2 = xe−t + 1 (x + 1) dx 0 Z ∞ −(γ+1) γt = 1 − γe x + et (x + 1)−1 dx.

(5.36) (5.37) (5.38)

0

Noting that for any a ∈ [0, 1), x + et

−(γ+1)

≤ e−(γ+a)t (x + 1)

a−1

,

(5.39)

we have 1 − γeγt

Z



x + et

−(γ+1)

(x + 1)−1 dx ≥ 1 − γe−at

0

Z



(x + 1)−2+a dx

(5.40)

0

=1− and therefore, ZZ W (s, t)dsdt ≥

γ K−1 Γ(K − 1)

Z

γ 1−a



=1−

γ −at e , 1−a

(5.41)

 1−

(5.42)

∞ K−2 −γt

t

e

γ γ+a

K−1

0

.

 γ −at e dt 1−a

(5.43)

Minimizing with respect to a, we find that the minimum occurs at a =   max 0, K−γ . Therefore, the truncation error of the Dirichlet process can be K+1

TRUNCATED RANDOM MEASURES

27

bounded by 1 kpN,∞ − pN,K k1 ≤ 1 − 2

γ 1− 1−a 

∼ N (K + 1)



γ γ+1

γ γ+a

K−1 !N ,

  K −γ a = max 0, K +1 (5.44)

K ,

K → ∞.

(5.45)

5.3. Hyperpriors. As in the CRM case, we can place priors on the hyperparameters of the NCRM rate measure (i.e. γ, α, λ, d, etc.). We conclude our investigation of NCRM truncation error by showing how bounds developed in this section can be modified to account for hyperpriors. Note that we make the dependence of the truncation error bound B(N, K) on the hyperparameters Φ explicit in the notation of Proposition 5.5. Proposition 5.5 (NCRM truncation error with a hyperprior). Consider a representation for Θ|Φ ∼ CRM(ν), and let Ξ be its normalization with truncation error bound B(Φ, N, K) given fixed hyperparameters Φ. Then the corresponding truncation error bound for random Φ is B(N, K) = E [B(Φ, N, K)] .

(5.46)

Example 5.4 (Normalized gamma process, NΓP(γ, 1, 0)). A standard choice of hyperprior for the concentration γ is a gamma distribution, i.e. γ ∼ Gam(a, b). Combining Proposition 5.5 with the result from Eq. (5.29) with λ = 1, we have h N i 1 kpN,∞ − pN,K |1 ≤ E 1 − 1 − γK −1 log (1 + K) (5.47) 2 N ≤ 1 − 1 − E [γ] K −1 log (1 + K) (5.48) N  a −1 . (5.49) = 1 − 1 − K log (1 + K) b 6. Simulation algorithms and computational complexity The D-, F-, and V-representations in Section 3 are each generated from a different finite sequence of distributions, resulting in a different expected computational cost for the same truncation level. Therefore, the truncation level itself is not an appropriate parameter with which to compare the error bounds for different representations—we require a characterization of the computational cost. In this section, we investigate the mean complexity E[R] of each representation, where R is the number of random variables sampled to generate a truncation, as a function of the truncation level for each of the representations in Section 3. We further relate these quantities to the expectation of the number of atoms generated, E[A]. In the process, we develop a simulation algorithm for size-biased V-representations that can be implemented in practice to generate a truncated CRM in finite time. 6.1. D- and F-representations. The complexity of Bondesson D-representations is straightforward to derive directly from the definition: for a particular truncation level K, E[R] = 3K = 3E[A].

(6.1)

28

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

The constant factor of 3 arises from the need to sample Vk , Eck , and ψk at each truncation level. The complexity for decoupled Bondesson F-representations is similarly straightforward to develop: for a particular truncation level K,     3c ξ E[R] = +1 K = 3+ E[A]. (6.2) ξ c The 3c/ξ factor corresponds to sampling Vki , Tki , and ψki an average of c/ξ times, while the additional K accounts for the Poisson random number of atoms at each round that are sampled. The expression for the expected number of atoms follows from E[A] = ξc K. An interesting consequence of this relatively simple formula is a method for the principled selection of the free parameter ξ; we can either minimize expected complexity given a target error bound, or minimize the error bound given a target complexity. Another consequence is that when ξ = c, both the Bondesson D-representation and the decoupled Bondesson F-representation have E[A] = K, and thus the computational complexity of the Bondesson D-representation is less than that of the decoupled Bondesson F-representation. Combining this insight with the fact that their truncation error bounds coincide when ξ = c (Section 4.2 and Appendix D.7), the Bondesson D-representation appears to be a better representation in this setting. Finally, for power-law F-representations,     5γ γ 1 5 1 E[R] = 1 + K + K2 = + E[A] + E[A]2 . (6.3) 2 2 γ 2 2γ This expression follows from the quadratic number of beta random variables required to generate the representation, and the fact that E[A] = γK. Algorithms for generating these representations in practice follow directly from their construction. 6.2. V-representation. In contrast to D- and F-representations, the analysis of size-biased V-representations is not as straightforward. In fact, the representation as given in Section 3.3 cannot in general be used to simulate a truncated CRM— when h has countable support, simulating according to this construction requires sampling an infinite number of random variables for each fixed value of k, and thus cannot be used directly in practice. It is possible to use the superposition property (Lemma A.2) to analytically collapse the problematic sum, an approach that is considered in James (2014); however, this technique eliminates the possibility of generating θkxj from a familiar exponential family distribution, often necessitating approximate sampling techniques for θkj . There exists another approach—based on the following lemma—that doesn’t collapse the sum over x, which allows sampling rates from well-known exponential family distributions. P Lemma 6.1 (Infinite Poisson Sampling). Suppose λj ≥ 0, and j λj < ∞. Then given the model   ! X λ i S1 ∼ Poiss  λj  , Si+1 = Si − Ti , Ti ∼ Binom Si , P , i ∈ N, (6.4) j≥i λj j the marginal distribution of T1 , T2 , . . . is indep

Tj ∼ Poiss (λj ) , j ∈ N.

(6.5)

TRUNCATED RANDOM MEASURES

29

Further, if I is the largest index i such that Si > 0, then P (I ≤ i) = e−

P

j>i

λj

.

(6.6)

Lemma 6.1 allows us to sample the infinite sequence of Ckx for each fixed k in a finite number of iterations, which then reduces sampling rates and locations for that value of k to a finite-time operation. The final simulation technique is specified in the following result. ¯ K = PK Λk , and let Theorem 6.2 (Truncated CRM Simulation). Let Λ k=1 (k1 , x1 ), (k2 , x2 ), . . . be any ordering of the countable product space {1, . . . , K} × N. Then the truncated V-representation ΘK may be constructed as follows: ΘK =

Ci I X X

θij δψij

i=1 j=1

 ¯K , S1 ∼ Poiss Λ

Si+1 = Si − Ci , I = sup{i | Si > 0} ! λki xi Ci ∼ Binom Si , ¯ K − Pi−1 λk x Λ j j j=1 1 indep θij ∼ h(0|θ)ki −1 h(xi |θ)ν(dθ). λki xi

(6.7)

The first  remark to make about this construction is that the truncation ΘK has ¯ K atoms, so E[A] = Λ ¯ K . Next, I is almost surely finite by Lemma 6.1, so Poiss Λ the procedure in Theorem 6.2 terminates in finite time. One drawback is that this simulation algorithm requires  a priori knowledge of the truncation level K, since ¯ K requires K < ∞ (otherwise S1 = ∞ almost surely for the first step S1 ∼ Poiss Λ P∞ PCkx K = ∞ since ν(R+ ) = ∞). However, defining Θ0k := x=1 i=1 θkxi δψkxi , one can use the proposed algorithm to simulate each Θ0k individually for k = 1, 2, . . . , K; PK summing the results yields the desired truncation ΘK = k=1 Θ0k . In this case, one can stop the simulation at any desired truncation level K without knowing that level beforehand. The next result characterizes the expected complexity of using the procedure in Eq. (6.7) and provides an upper bound along with a tightness guarantee, in the event that it is easier to use. Theorem 6.3 (Expected Simulation Complexity). ¯K + E [R] = 1 + 2Λ

∞ X

1 − e−

P

i≥r

λki xi

(6.8)

r=1

¯K + L + ≤ 1 + 2Λ

∞ ∞ X X

λki xi

(6.9)

r=L+1 i=r



e E[R], e−1

(6.10)

where L := |{r ∈ N |

P∞

i=r λki xi

> 1}| .

(6.11)

It is easy to deduce from Theorem 6.3 that selecting an ordering of {1, 2, . . . , K}× N such that the sequence λk1 x1 , λk2 x2 , λk3 x3 , . . . is non-increasing yields the optimal

30

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

complexity for any choice of K. This makes intuitive sense, as it causes the Ckx with large expectation to be sampled earlier, leaving less remaining mass. Example 6.1 (Beta-Bernoulli process, BP(γ, α, d)). Using the result developed earlier in Eq. (4.44) and the inequality (Gautschi, 1959) (1 + x)d−1 ≤

Γ(x + d) ≤ xd−1 Γ(x + 1)

0 ≤ d ≤ 1, x ≥ 1,

(6.12)

for α + K ≥ 1, ¯K = Λ

K X k=1

γ Γ(α + 1) Λk = d Γ(α + d) ≤



Γ(α + d + K) Γ(α + d) − Γ(α + K) Γ(α)



γ Γ(α + 1) γα (α + K)d − . d Γ(α + d) d

(6.13) (6.14)

For α + K + d ≥ 2, we have ¯ K ≥ γ Γ(α + 1) (α + K + d − 1)d − γα . Λ d Γ(α + d) d

(6.15)

Next, since the Bernoulli likelihood has bounded support x ∈ {0, 1}, the third term in Theorem 6.3 can be bounded via ∞ X

1 − e−

P

i≥r

λki xi

=

K X

1 − e−

PK

i=r

Λi

≤ K.

(6.16)

r=1

r=1

Thus, for K ≥ 2, the mean computational complexity of simulating the Vrepresentation of BP(γ, α, d) is bounded by 2γα 2γ Γ(α + 1) + (α + K)d + K d d Γ(α + d) ¯ K = E[A], and using the lower bound on Λ  1/d Γ(α + d) E[R] ≤ 2 − α − d + 2E[A] + + d E[A] . Γ(α) E[R] ≤ 1 −

(6.17)

(6.18)

The beta prime-odds Bernoulli process has the same Λk and likelihood support, and thus has the same mean complexity bound. For likelihoods with infinite support, bounding the series is not immediately possible. Corollary 6.4 provides a bound that applies to such likelihoods, given a sensible choice of the ordering of {1, 2, . . . , K} × N. Note that this bound is quite loose—in practice, it is often better to approximate the series in Theorem 6.3 with P ¯ K − Pr−1 λk x . Obtaining a finite truncation, using the fact that i≥r λki xi = Λ i i i=1 tighter closed-form bounds on E[R] is an open problem. Corollary 6.4. Given a rectangular ordering of {1, 2, . . . , K} × N, i.e.   i , ki = ((i − 1) mod K) + 1 xi = K

(6.19)

the mean complexity is bounded by ¯ K + K2 E[R] ≤ 1 + 2Λ

Z Eθ [X]ν(dθ).

(6.20)

TRUNCATED RANDOM MEASURES

31

Example 6.2 (Beta-negative binomial process, BP(γ, α, d)). Using the result developed earlier in Eq. (4.54), we have for α + Ks ≥ 1   γ Γ(α+1) Γ(α+d+ Ks) Γ(α+d) ¯ ΛK = − (6.21) d Γ(α+d) Γ(α+ Ks) Γ(α) γ Γ(α + 1) γα ≤ (α + Ks)d − (6.22) d Γ(α + d) d and for α + Ks + d ≥ 2 ¯ K ≥ γ Γ(α + 1) (α + Ks + d − 1)d − γα . Λ d Γ(α + d) d

(6.23)

The expectation of the negative binomial distribution is (1 − θ)−1 θs. When α + d > 1, the integral in Corollary 6.4 is finite, and for K ≥ 2/s the mean complexity is bounded by E[R] ≤ 1 −

sγα 2γα 2γ Γ(α + 1) + (α + Ks)d + K 2. d d Γ(α + d) α+d−1

(6.24)

and −2

"

E[R] ≤ 1 + 2E[A] + s

Γ(α + d) + d E[A] Γ(α)

#2

1/d −α−d+1

.

(6.25)

Example 6.3 (Gamma-Poisson process, ΓP(γ, d, λ)). Using the result developed earlier in Eq. (4.66), we have ¯K = Λ

K X k=1

Λk =

γλ1−d γλ (λ + K)d − . d d

(6.26)

The expectation of the Poisson distribution is θ; therefore, computing the integral in Corollary 6.4 yields the mean complexity bound, 2γλ 2γλ1−d + (λ + K)d + γK 2 . (6.27) d d 6.3. Summary of CRM results. Table 1 summarizes our truncation and simulation complexity results as they apply to the beta, gamma, and beta prime processes. Results for the Bondesson D-representation of BP(γ, 1, 0) as well as the decoupled Bondesson F-representations of BP(γ, α, 0) and ΓP(γ, λ, 0) were previously known, and are reproduced by our results. All other results in the table are novel to the best of the authors’ knowledge. It is interesting to note that the bounds and expected complexity within each of the D-, F-, and V-representation classes have the same form, aside from some constants. Across classes, however, they vary significantly, indicating that the chosen sequential representation of a process has more of an influence on the truncation error than the process itself. E[R] ≤ 1 −

7. Simulation study In this section, we present a simulation study of how the developed truncation error bounds vary with the expected computational complexity E[R] of simulating the truncation. This comparison yields heuristics for selecting which of the three sequential representations of an (N)CRM is most advantageous for a particular application. The results in the main text have the number of observations fixed to

32

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK 101

101

B­DRep (Teh) B­FRep (Paisley) SB­VRep (Thibaux)

100

10­1

L1 Error

L1 Error

10­1

BP

10­2

10­3

10­4

10­4 101

102

104

103

105

Mean Computational Complexity

101

L1 Error

L1 Error

10­3

10­4

10­4 101

102

104

103

10­5

105

Mean Computational Complexity

101

L1 Error

L1 Error

101

102

103

104

105

Mean Computational Complexity

PL­FRep SB­VRep

10­1

10­2

10­2

10­3

10­3

10­4

10­4 101

102

104

103

Mean Computational Complexity

101

105

10­2

10­4

10­4 102

103

104

γ = 1, α = λ = 1, d = 0

103

104

105

SB­VRep

10­2 10­3

Mean Computational Complexity

102

Mean Computational Complexity

10­1

10­3

101

101

100

L1 Error

10­1

10­5

101

B­DRep B­FRep DRep (Sethuraman) SB­VRep

100

L1 Error

105

PL­FRep SB­VRep

100

10­1

10­5

104

101

B­DRep B­FRep SB­VRep

100

NΓP

103

10­2

10­3

10­5

102

Mean Computational Complexity

10­1

10­2

10­5

101

100

10­1

BPP

10­5

101

B­DRep B­FRep (Roychowdhury) SB­VRep

100

ΓP

10­2

10­3

10­5

PL­FRep (Broderick) SB­VRep (Thibaux)

100

105

10­5

101

102

103

104

Mean Computational Complexity

γ = 1, α = λ = 2, d = 0.5

Figure 1. Truncation error bounds for D-, F-, and Vrepresentations of the beta-Bernoulli (BP), gamma-Poisson (ΓP), beta prime-odds Bernoulli (BPP), and normalized gammamultinomial (NΓP) processes. Names in parentheses denote the first author of the original source for each sequential representation.

105

TRUNCATED RANDOM MEASURES

33

Table 1. Error bounds and sampling complexity for truncated CRMs. The Ci vary between models. Be = Bernoulli, OBe = odds Bernoulli, Poi = Poisson, and NB(s) = negative binomial with s failures.

Rep. CRM D

F

V

BP(γ, α ≥ 1, 0) BPP(γ, α, 0) ΓP(γ, α, 0) BP(γ, α ≥ 1, 0) BPP(γ, α, 0) ΓP(γ, α, 0) BP(γ, α, d) BPP(γ, α, d) ΓP(γ, λ, d) BP(γ, α, d) BPP(γ, α, d) BP(γ, α, d) ΓP(γ, α, d)

h Be OBe Poi Be OBe Poi Be OBe Poi Be OBe NB(s) Poi

Truncation Err. Bound Nγ



N C0

γα γα+1



ξ ξ+1

Asymptotic Complexity

K

3E[A]

K

N γ( α+d )K α+1 γα 1+d K N α−1 ( 2 ) N γ( α+d )K α+1 N C0 (α + d + K)d−1 Ks)d−1

N C0 (α + d + N C0 (α + K)d−1

 3+ 

1 γ

+

5 2



ξ γα

 E[A]

E[A] +

1 E[A]2 2γ

C0 + 2E[A] + (C1 + d E[A])1/d 1 + 2E[A] + C0 [(C1 + d E[A])1/d − C2 ]2

N = 10; see Appendix G for a discussion of the effect of increasing N on the plots shown here. Fig. 1 shows the results of the simulation study, comparing the beta-Bernoulli (BP), gamma-Poisson (ΓP), beta prime-odds Bernoulli (BPP), and normalized generalized gamma-multinomial (NΓP) processes. In the legend, a name in parentheses denotes equivalence to a previously-developed construction. The bound for the Sethuraman (1994) construction is the only result not generalized by the present work, but is shown for comparison purposes. Note that all other bounds presented in the figure are improved by a factor of two compared to past representations due to the reliance on Lemmas 4.1 and 5.1 rather than the earlier bound found in Ishwaran and James (2001). All results without a name are novel to the best knowledge of the authors, and are possible due to the generality of the present work. The left column shows results for the two-parameter (i.e. undiscounted) process, with γ = 1, α = λ = 1, d = 0, and ξ = c = γα. While the D- and F-representations capture the exponential truncation error tails, the tail of the V-representation bound is consistently polynomial. This appears to be the price of the V-representation’s generality, which imposes no conditions on ν beyond the basic assumption in Eq. (2.1). The right column shows results for the discounted (i.e. power-law) process, with γ = 1, α = λ = 2, and d = 0.5. For discounted processes, the V-representation appears to outperform the power-law F-representation as the truncation level increases; this is due to the computational cost of simulating a sequence of beta random variables for each atom in the power-law F-representation. From the simulation results, it appears that all three of the D-, F-, and Vrepresentations have settings in which they should be chosen; there is no single dominant representation for all situations. In the undiscounted setting, the Dand F-representations should be used for their quickly decaying truncation error bounds. The choice between these two depends on whether decoupling the atoms across rounds is important; if so, the F-representation should be selected. In the discounted setting, the V-representation appears to be superior, where its selection

34

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

is based on its simplicity, generality, and improved performance compared to the power-law F-representation. However, results in Appendix G show that when 0 < d  1, the power-law F-representation bound approaches the undiscounted bound, and outperforms the V-representation bound.

8. Discussion We have investigated sequential representations, truncation error bounds, and simulation algorithms for (normalized) completely random measures. In past work, the development of these tools—and the analysis of their properties—has happened only on an ad hoc basis. The results in the present paper, in contrast, provide a comprehensive characterization and analysis of the different types of sequential (N)CRM representations available to the practitioner. There do remain, however, a number of open questions, limitations, and directions for future work. First, this work does not consider the influence of observed data: all analyses are presented from an a priori perspective because truncation is typically performed before data are incorporated via posterior inference (e.g. in standard variational inference for the DP mixture model (Blei and Jordan, 2006) and BP latent feature model (Doshi-Velez et al., 2009)). However, analysis of a posteriori truncation has been studied in past work as well (Ishwaran and James, 2001; Gelfand and Kottas, 2002; Ishwaran and Zarepour, 2002). In the language of CRMs, observations typically introduce a fixed-location component in the posterior process, while the unobserved traits are again drawn from the (possibly normalized) ordinary component of a completely random measure (Ishwaran and Zarepour, 2002; Broderick et al., 2014). We anticipate that this property makes observations reasonably simple to include: the truncation tools provided in the present paper can be used directly on the unobserved ordinary component, while the fixed-location component may be treated exactly. There are two remaining open questions regarding the three types of sequential representation developed in this work. Regarding the D- and F-representations developed in Section 3, it is unknown whether a general extension to power-law representations of the beta, beta prime, and gamma processes exists. The proposed stochastic mapping approach provides an ad hoc solution in the case of F-representations, but it is not systematic, and it is not yet easily applicable to D-representations. Regarding V-representations, one might expect that the use of conjugate exponential family CRMs (Broderick et al., 2014) would yield a closedform expression for the truncation bound. In all of the cases provided in this paper, this was indeed the case; the integrals were evaluated exactly and a closed-form expression was found. However, we were unable to identify a general expression applicable to all conjugate exponential family CRMs. Based on the examples provided, we conjecture that such an expression exists. A final remark is that one of the primary uses of sequential representations in past work has been in the development of posterior inference procedures (Paisley et al., 2010; Blei and Jordan, 2006; Doshi-Velez et al., 2009)). The present work provides no guidance on which truncated representations are best paired with which inference methods. We leave this as an open direction for future research, which will require both theoretical and empirical investigation.

TRUNCATED RANDOM MEASURES

35

Appendix A. Poisson point process operations The following results are used extensively throughout our developments; proofs of all can be found in prior work (Kingman, 1993). Lemma A.1 (Mapping). Given a Poisson point process P with rate measure ν on a space Ω, and a measurable function φ : Ω → Ω, the set of transformed points P 0 = {u : x ∈ P, u = φ(x)} is a Poisson point process with rate measure ν 0 (B) := ν(φ−1 (B)). Lemma A.2 (Superposition). Given an infinite collection of Poisson point processes with rate measures (νi )∞ i=1 , the union of their points is a Poisson point process P with rate measure i νi . Lemma A.3 (Thinning). Given a Poisson point process P with rate measure ν on a space Ω, a function φ : Ω → [0, 1], and a collection of Bernoulli random variables qx ∼ Bern(φ(x)) ∀x ∈ P , (1) the thinned process P 0 = R {x : x ∈ P, qx = 1} is a Poisson point process with rate measure ν 0 (B) := B φ(x)ν(dx), and (2) the removed process P 00 = {x R : x ∈ P, qx = 0} is a Poisson point process with rate measure ν 00 (B) := B (1 − φ(x))ν(dx). Lemma A.4 (Displacement). Given a Poisson point process P with rate measure ν on a space Ω, a function φ : Ω × Ω → R+ with φ(·, x) a density on Ω ∀x ∈ Ω, φ(y, ·) measurable ∀y ∈ Ω, and a collection of points qx ∼ φ(·, x) ∀x ∈ P , the set 0 of displaced R Rpoints P = {qx : x ∈ P } is a Poisson point process with rate measure 0 ν (B) = B φ(y, x)ν(dx)dy. Appendix B. Technical lemmas Lemma B.1. If ν(dθ), an absolutely continuous σ-finite measure on R+ , and continuous φ : R+ → [0, 1] satisfy Z Z ν(R+ ) = ∞, min(1, θ)ν(dθ) < ∞, φ(θ)ν(dθ) < ∞, (B.1) then lim φ(θ) = φ(0) = 0.

θ→0

(B.2)

R R1 R∞ Proof. min(1, θ)ν(dθ) < ∞ implies that 0 θν(dθ) < ∞ and 1 ν(dθ) < ∞. But R1 ν(R+ ) = ∞, so 0 ν(dθ) = ∞. In fact, for all  > 0, ν([0, ]) = ∞ since otherwise R1 R1 ν(dθ) = ∞ =⇒  θν(dθ) = ∞, a contradiction. Since φ is continuous and has  bounded range, limθ→0 φ(θ) = c exists and is finite. Assume c > 0, so ∃ > 0 such that ∀0 < , |φ(0 ) − c| < c/2, and in particular φ(0 ) > c/2. Thus, for any θ0 < , R θ0 R θ0  φ(θ)ν(dθ) ≥ c/2 0 ν(dθ) = ∞, a contradiction. Thus, c = 0. 0 Lemma B.2. If h(x|θ) is a proper probability mass function on N ∀θ, ν is a measure on the space of θ, and h is measurable with respect to ν for all x. Then Z X Z ∞ Z ∞ X h(x|θ)ν(dθ) = h(x|θ)ν(dθ) = (1 − h(0|θ))ν(dθ). (B.3) x=1

x=1

36

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

Pn P∞ Proof. Let φn = x=1 h(x|θ), and φ = x=1 h(x|θ). Since h is a probability mass function, h(x|θ) ≥ 0 ∀θ. Therefore φn+1 ≥ φn ∀n ∈ N. Thus by the monotone convergence theorem, Z X n ∞ Z X h(x|θ)ν(dθ) = lim h(x|θ)ν(dθ) n→∞

x=1

x=1

Z φn ν(dθ) = lim n→∞ Z = φν(dθ) =

Z X ∞

(B.4)

h(x|θ)ν(dθ)

x=1

The final equality in the result simply follows from h being a probability mass function.  Lemma B.3. For α, β > 0 and x ≥ 0,  α β α 1− ≤ x. β+x β

(B.5)

Proof. The result follows from the fact that the two expressions and their derivatives in x are equal at x = 0, and the left hand side is concave in x.  Lemma B.4. For N ≥ 1 and x ∈ [0, 1], 1 − xN ≤ N (1 − x).

(B.6)

Proof. Since 1 − xN |x=1 = N (1 − x)|x=1 = 0 and for all x ∈ [0, 1] and N ≥ 1, it d d holds that dx [N (1 − x)] = −N ≤ −N xN −1 = dx [1 − xN ], the result follows.  Lemma B.5. Assume φ(x) is a twice continuously differentiable function satisfying 00 (y) lim supn→∞ supy≥xn φφ00 (x = C < ∞ for any increasing sequence (xn )∞ n=1 and n) 00 0 φ (x)/φ (x) → 0 as x → ∞. Then for any constant c > 0 and any increasing sequence (xn )∞ n=1 , φ(xn + c) − φ(xn ) ∼ cφ0 (xn )

for n → ∞.

(B.7)

Proof. A second-order Taylor expansion of φ(xn + c) about xn yields φ(xn + c) − φ(xn ) = cφ0 (xn ) +

c2 00 ∗ φ (xn ), 2

(B.8)

where x∗n ∈ [xn , xn + c]. Since x∗n ≥ xn , our assumptions on φ ensure that, Cφ00 (xn ) φ00 (x∗n ) ≤ lim =0 n→∞ φ0 (xn ) n→∞ φ0 (xn )

(B.9)

φ(xn + c) − φ(xn ) c φ00 (x∗n ) = lim 1 + = 1. n→∞ n→∞ cφ0 (xn ) 2 φ0 (xn )

(B.10)

lim

and lim



TRUNCATED RANDOM MEASURES

Lemma B.6. For α > 0 and x ≥ −1,  ( M Γ(α+M +x+1) 1 X Γ(α + m + x) − 1+x Γ(α+M ) = Γ(α + m) ψ(α + M ) − ψ(α) m=1

Γ(α+x+1) Γ(α)

37



x > −1 x = −1

(B.11)

where ψ(·) is the digamma function. Proof. When M = 1 and x > −1, analyzing the right hand side yields Γ(α + M + x + 1) Γ(α + x + 1) Γ(α + x + 2) Γ(α + x + 1) − = − Γ(α + M ) Γ(α) Γ(α + 1) Γ(α)   Γ(α + x + 1) α + x + 1 = −1 Γ(α) α Γ(α + x + 1) . = (x + 1) Γ(α + 1)

(B.12) (B.13) (B.14)

By induction, supposing that the result is true for M − 1 ≥ 1 and x > −1, M M −1 X X Γ(α + m + x) Γ(α + m + x) Γ(α + M + x) = + (B.15) Γ(α + m) Γ(α + m) Γ(α + M ) m=1 m=1   1 Γ(α + M + x) Γ(α + x + 1) Γ(α + M + x) = − + 1 + x Γ(α + M − 1) Γ(α) Γ(α + M ) (B.16)

α+M +x Γ(α + x + 1) Γ(α + M + x) − (B.17) Γ(α + M − 1) (1 + x)(α + M − 1) (1 + x)Γ(α)   1 Γ(α + M + x + 1) Γ(α + x + 1) = − . (B.18) 1+x Γ(α + M ) Γ(α) =

This demonstrates the desired result for x > −1. Next, when x = −1, we have that M M X X 1 Γ(α + m − 1) = . Γ(α + m) α + m −1 m=1 m=1

(B.19)

We proceed by induction once again. For M = 1, using the recurrence relation ψ(x + 1) = ψ(x) + x−1 (Abramowitz and Stegun, 1964, Chapter 6), the right hand side evaluates to ψ(α + 1) − ψ(α) = ψ(α) + α−1 − ψ(α) = α−1 .

(B.20)

Supposing that the result is true for M − 1 ≥ 1 and x = −1, M −1 X 1 1 1 = + α + m − 1 α + m − 1 α + M −1 m=1 m=1 M X

= ψ(α + M − 1) − ψ(α) +

1 α+M −1

= ψ(α + M ) − ψ(α), demonstrating the result for x = −1.

(B.21) (B.22) (B.23) 

Lemma B.7. For a > 0, d ∈ R, and xn → ∞, d Γ(a + xn + d) ∼ d(a + xn )d−1 . dxn Γ(a + xn )

(B.24)

38

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

Proof. We have d Γ(a + x + d) Γ(a + x + d) = (ψ(a + x + d) − ψ(a + x)), dx Γ(a + x) Γ(a + x)

(B.25)

where ψ is the digamma function. Using Lemma B.5 and the asymptotic expansion of ψ 0 (Abramowitz and Stegun, 1964, Chapter 6), we obtain ψ(a + xn + d) − ψ(a + xn ) ∼ dψ 0 (a + xn ) ∼

d . xn + a

(B.26)

Since Γ(a + xn + d) ∼ (a + xn )d , Γ(a + xn )

(B.27)

using Lemma B.8(2) with the previous two displays yields the result.



Lemma B.8 (Standard asymptotic equivalence properties). (1) If an ∼ bn and bn ∼ cn , then an ∼ cn . (2) If an ∼ bn and cn ∼ dn , then an cn ∼ bn dn . (3) If an ∼ bn , cn ∼ dn and an cn > 0, then an + cn ∼ bn + dn .

Appendix C. Proofs of sequential representation results C.1. Correctness of D- and F-representations. Proof of Theorem 3.1. First, we show that gν (v) is a density. Since vν(v) is nond d decreasing, dv [vν(v)] exists almost everywhere, dv [vν(v)] ≤ 0 and hence gν (v) ≥ 0. Furthermore, Z



gν (v)dv = −c−1 ν

Z

0

0



∞ d [vν(v)]dv = −c−1 vν(v) = 1, ν dv v=0

(C.1)

where the final equality follows from the assumed behavior of vν(v) at 0 and ∞. Since for a partition A1 , . . . , An , the random variables Θ(A1 ), . . . , Θ(An ) are inde¯ pendent, it suffices to show that for any measurable set A (with complement A), the random variable Θ(A) has the correct characteristic function. Define the family of random measures Θt =

∞ X

Vk e−(Γ1k +t)/cν δψk ,

t ≥ 0,

(C.2)

k=1

so Θ0 = Θ. Conditioning on Γ11 , D

(Θt (A) | Γ11 = u) = V1 e−(u+t)/cν 1[ψ1 ∈ A] + Θt+u (A),

(C.3)

TRUNCATED RANDOM MEASURES

39

and note that the two terms on the left hand side are independent. We can thus write the characteristic function of Θt (A) as ϕ(ξ, t, A) := E[eiξΘt (A) ]

(C.4)

= E[E[eiξΘt (A) | Γ11 = u]] = E[E[e

(C.5)

iξV1 e−(u+t)/cν 1[ψ1 ∈A] iξΘt+u (A)

e

| Γ11 = u]]

−(u+t)/cν

¯ = E[E[(G(A)eiξV1 e + G(A))ϕ(ξ, t + u, A) | Γ11 = u]] Z ∞Z ∞ −(u+t)/cν eiξve ϕ(ξ, t + u, A)gν (v)e−u du dv = G(A) 0 0 Z ∞ ¯ + G(A) ϕ(ξ, t + u, A)e−u du,

(C.6) (C.7)

(C.8)

0

where A¯ is the complement of A. Multiplying both sides by e−t and making the change of variable w = u + t yields e−t ϕ(ξ, t, A) = G(A)

Z



t

Z



eiξve

−w/cν

ϕ(ξ, w, A)gν (v)e−w dv dw

0 ∞

Z ¯ + G(A) ϕ(ξ, w, A)e−w dw t Z ∞ = G(A) ϕgν (ξe−w/cν )ϕ(ξ, w, A)e−w dw t Z ∞ ¯ + G(A) ϕ(ξ, w, A)e−w dw,

(C.9)

(C.10)

t

R∞

where ϕgν (a) := 0 eiav gν (v) dv is the characteristic function of a random variable with density gν . Differentiating both sides with respect to t and rearranging yields ∂ϕ(ξ, t, A) = ϕ(ξ, t, A) − G(A)ϕgν (ξe−t/cν )ϕ(ξ, t, A) − (1 − G(A))ϕ(ξ, t, A) ∂t (C.11) = ϕ(ξ, t, A)G(A)(1 − ϕgν (ξe−t/cν )),

(C.12)

so we conclude that 

Z

ϕ(ξ, t, A) = exp −G(A)



(1 − ϕgν (ξe

−u/cν

 )) du .

(C.13)

t

Using integration by parts and the definition of gν , rewrite Z



d [vν(v)]eiav dv dv 0 Z ∞ ∞ iav −1 = −c−1 vν(v)e + c iavν(v)eiav dv ν ν v=0 0 Z ∞ iav =1+ ν(v)eiav dv, cν 0

ϕgν (a) = −c−1 ν

(C.14) (C.15) (C.16)

40

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

where final equality follows from the assumed behavior of vν(v) at 0 and ∞. Combining the previous two displays and setting t = 0 concludes the proof:   Z ∞Z ∞ iξv −u/cν iξve−u/cν ϕ(ξ, 0, A) = exp −G(A) e e ν(v) dv du (C.17) cν 0 0   Z ∞Z ∞ ∂ h iξve−u/cν i −e ν(v) du dv (C.18) = exp −G(A) ∂u 0 0   Z ∞ (eiξv − 1)ν(v) dv . (C.19) = exp −G(A) 0

 Proof of Theorem 3.2. It was already shown that gν (v) is a density. Let Θ0k = PCk P∞ cν 0 0 0 i=1 θki δψki , so Θ = k=1 Θk . Each Θk is a CRM with rate measure ξ νk (dθ), 0 where νk (dθ) is the law of θki . Using the product distribution formula we have Z 1 k ξ νk0 (dθ) = (− log w)k−1 wξ−2 gν (θ/w) dw dθ. (C.20) 0 Γ(k) Rv Let Gν (v) = 0 gν (x) dx be the cdf derived from gν . From the preceding arguments, conclude that the rate measure of Θ is ∞

cν X 0 νk (dθ) ξ k=1 Z 1X ∞ ξ k−1 = cν (− log w)k−1 wξ−2 gν (θ/w) dw dθ Γ(k) 0 k=1 Z 1 = cν ξw−2 gν (θ/w) dw dθ

ν 0 (dθ) :=

(C.21) (C.22) (C.23)

0

Z

1

∂ [−Gν (θ/w)] dw dθ ∂w 0 1 = −cν θ−1 Gν (θ/w) dθ

(C.25)

= cν θ−1 (1 − Gν (θ)) dθ.

(C.26)

= cν

ξθ−1

(C.24)

w=0

The cdf can be rewritten as Z θ θ d −1 1 − Gν (θ) = 1 + c−1 [xν(x)] dx = 1 + c xν(x) = c−1 ν ν ν θν(θ). dx x=0 0 Combining the previous two displays, conclude that ν 0 (dθ) = ν(θ) dθ.

(C.27) 

C.2. Power-law behavior of F-representations. We now formalize the sense in which power-law F-representations do in fact produce power-law behavior. Let PN i.i.d. Zn | Θ ∼ LP(Poiss, Θ) and yk := n=1 1[znk ≥ 1]. We analyze the number of non-zero features after N observations, KN :=

∞ X k=1

1[yk ≥ 1],

(C.28)

TRUNCATED RANDOM MEASURES

41

and the number of features appearing j > 1 times after N observations, KN,j :=

∞ X

1[yk = j].

(C.29)

k=1

In their power law analysis of the beta process, Broderick et al. (2012) use a Bernoulli likelihood process. However, the Bernoulli process is only applicable if θk ∈ [0, 1], whereas in general θk ∈ R+ . Replacing the Bernoulli process with a Poisson likelihood process is a natural choice since 1[znk ≥ 1] ∼ Bern(1−e−θk ), and asymptotically 1 − e−θk ∼ θk a.s. for k → ∞ since limk→∞ θk = 0 a.s. Thus, the Bernoulli and Poisson likelihood processes behave the same asymptotically, which is what is relevant to our asymptotic analysis. We are therefore able to show that all CRMs with power-law F-representations, not just the beta process, have what Broderick et al. (2012) call Type I and II power law behavior. Our only condition is that the tails of g are not too heavy. Theorem C.1. Assume that g is a continuous density such that for some  > 0, g(x) = O(x−1−d− ).

(C.30)

Then for Θ ← PL-FRep(γ, α, d, g) with d > 0, there exists a constant C depending on γ, α, d, and g such that, almost surely, KN ∼ Γ(1 − d)CN d , KN,j ∼

N →∞

d Γ(j − d) CN d , j!

N →∞

(C.31) (j > 1).

(C.32)

In order to prove Theorem C.1, we require a number of additional definitions and lemmas. Our approach follows that in Broderick et al. (2012), which the reader is encouraged to consult for more details and further discussion of power law behavior of CRMs. Throughout this section, Θ ← PL-FRep(γ, α, d, g) with d > 0. By Lemma 3.4, Θ ∼ CRM(ν), where Z ν(dθ) := g(θ/u)u−1 νBP (du) dθ (C.33) and νBP (dθ) is the rate measure for BP(γ, α, d). Let Πk be a homogenous Poisson point process on R+ with rate θk and define K(t) := Kj (t) :=

∞ X k=1 ∞ X

1[|Πk ∩ [0, t]| > 0]

(C.34)

1[|Πk ∩ [0, t]| = j].

(C.35)

k=1

Furthermore, for N ∈ N, let ΦN := E[KN ]

and

ΦN,j := E[KN,j ]

and

Φj (t) := E[Kj (t)]

(j > 1)

(C.36)

and for t > 0, let Φ(t) := E[K(t)]

(j > 1).

(C.37)

42

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

If follows from Campbell’s Theorem (Kingman, 1993) that " # Z X −tθk Φ(t) = E (1 − e ) = (1 − e−tθ )ν(dθ)

(C.38)

k

"

ΦN

# X =E (1 − e−N θk ) = Φ(N )

(C.39)

k

#   "X j Z N t (1 − e−θk )j −tθk tj Φj (t) = E e = (1 − e−θ )j e−tθ ν(dθ) (C.40) j j! j! k #  Z   "X N N ΦN,j = E (1 − e−θk )j e−(N −j)θk = (1 − e−θ )j e−(N −j)θ ν(dθ). j j k (C.41) The first lemma characterizes the power law behavior of Φ(t) and Φj (t). A slowly varying function ` satisfies `(ax)/`(x) → 1 as x → ∞ for all a > 0. Lemma C.2 (Broderick et al. (2012), Proposition 6.1). If for some d ∈ (0, 1), C > 0, and slowly varying function `, Z x d ν¯[0, x] := C`(1/x)x1−d , x → 0, (C.42) θν(dθ) ∼ 1 − d 0 then Φ(t) ∼ Γ(1 − d)Ctd , Φj (t) ∼

t→∞

d Γ(j − d) d Ct , j!

t→∞

(C.43) (j > 1).

(C.44)

Transferring the power law behavior from Φ(t) to ΦN is trivial since Φ(N ) = ΦN . The next lemma justifies transferring the power law behavior from Φj (t) to ΦN,j . Lemma C.3 (Broderick et al. (2012), Lemmas 6.2 and 6.3). If ν satisfies Eq. (2.1), then K(t) ↑ ∞ a.s.,

Φ(t) ↑ ∞,

Φ(t)/t ↓ 0.

(C.45)

Furthermore, |ΦN,j − Φj (N )|
0, Cj > 0, and slowly varying functions `, `0 , Φ(t) ∼ C`(t)td and Φj (t) ∼ Cj `(t)td . Then for N → ∞, almost surely X X Kn ∼ ΦN and KN,i ∼ ΦN,i . (C.47) i 1] for some C1 > 0, so g(w + θ)

wα+d−1 ≤ φ(w)wd . (w + θ)α−1

(C.50)

Since φ(w)wd is integrable, by dominated convergence the limit Z ∞ wα+d−1 dw L = lim g(w + θ) θ→0 0 (w + θ)α−1

(C.51)

exists and is finite. Moreover, since g(x) is a continuous density, there exists M > 0 and 0 < a < b < ∞ such that g(x) ≥ M for all x ∈ [a, b]. Hence, for θ < a, Z ∞ Z b−θ wα+d−1 wα+d−1 ≥ M > 0, (C.52) g(w + θ) α−1 (w + θ)α−1 0 a−θ (w + θ) so L > 0. Thus, Z ψ(θ) := θ

1

g(θ/u)u−2−d (1 − u)α+d−1 du → Cθ−d ,

θ→0

(C.53)

0

and hence for δ > 0 and θ sufficiently small, |ψ(θ) − Cθ−d | < δ. Thus, for x sufficiently small, Z x Z x Z x ψ(θ) dθ ≤ Cθ−d dθ + |ψ(θ) − Cθ−d | dθ (C.54) 0

0

0

C x1−d + δx ≤ 1−d C x1−d ∼ , x → 0, 1−d which shows that Eq. (C.42) holds.

(C.55) (C.56) 

Appendix D. Proofs of CRM truncation bounds D.1. Protobound. Lemma D.1 (Protobound). Let Θ0 and Θ1 be two discrete random measures, and define Θ := Θ0 + Θ1 . Let X1:N be a collection of random measures generated i.i.d. from Θ with supp(Xn ) ⊆ supp(Θ), and let Y1:N be a collection of random variables where Yn is generated from Xn via Yn | Xn ∼ f (· | Xn ). Define Z1:N and W1:N analagously for Θ0 . Finally, define Q := 1 [supp(X1:N ) ⊆ supp(Θ0 )]. If D (X1:N |Θ, Q = 1) = (Z1:N |Θ0 ), then 1 kpY − pW k1 ≤ 1 − P(Q = 1), (D.1) 2

44

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

where pY , pW are the marginal densities of Y1:N and W1:N . Proof of Lemma 4.1. This is the direct application of Lemma D.1 to CRMs, where Θ ∼ CRM(ν), and Θ0 is a truncation Θ0 = ΘK . The technical condition is satisfied because the weights in X1:N are sampled independently for each atom in Θ.  Proof of Lemma 5.1. This is the direct application of Lemma D.1 to NCRMs, where Θ is the normalization of a CRM with distribution CRM(ν), and Θ0 is the normalization of its truncation. The technical condition is satisfied because the conditioning on X1:N ⊆ supp(Θ0 ) is equivalent to normalization of Θ0 .  Proof of Lemma D.1. We begin by expanding the 1-norm and conditioning on Θ and Θ0 : # "N # Z " Y N Y f (yn |Zn ) − E f (yn |Xn ) dy kpY − pW k1 = E n=1 n=1 (D.2) " " # " ## Z N N Y Y = E E f (yn |Zn )|Θ0 − E f (yn |Xn )|Θ dy. n=1

n=1

Then conditioning on Q, "N # " "N # # Y Y E f (yn |Xn )|Θ = E E f (yn |Xn )|Θ, Q |Θ n=1

" = P(Q = 1|Θ)E

(D.3)

n=1 N Y

#

"

f (yn |Zn )|Θ0 + P(Q = 0|Θ)E

n=1

#

N Y

f (yn |Xn )|Θ, Q = 0 ,

n=1

where the first term arises from the fact that for any function φ, E [φ(Z1:N )|Θ0 ] = E [φ(X1:N )|Θ, Q = 1] ,

(D.4)

because X1:N |Θ, Q = 1 is equal in distribution to Z1:N |Θ0 . Substituting this back in above, kpY − pW k1 "N # "N #!# Z " Y Y = E P(Q = 0|Θ) E f (yn |Zn )|Θ0 − E f (yn |Xn )|Θ, Q = 0 dy n=1 n=1 " " # " # # Z N N Y Y f (yn |Zn )|Θ0 − E f (yn |Xn )|Θ, Q = 0 dy ≤ E P(Q = 0|Θ) E n=1 n=1 (D.5) "N # "N #!# Z " Y Y ≤ E P(Q = 0|Θ) E f (yn |Zn )|Θ0 + E f (yn |Xn )|Θ, Q = 0 dy, n=1

n=1

and finally by Fubini’s Theorem, kpY − pW k1 " ≤ E P(Q = 0|Θ) E

"Z

N Y

# f (yn |Zn )dy|Θ0 + E

n=1

= E [P(Q = 0|Θ) (E [1|Θ0 ] + E [1|Θ, Q = 0])] = 2 P(Q = 0) = 2(1 − P(Q = 1))

"Z N Y

#!# f (yn |Xn )dy|Θ, Q = 0

n=1

(D.6)

TRUNCATED RANDOM MEASURES

45

 D.2. D-representation truncation. For this section, let c := cν . Lemma D.2. For K ≥ 0,   Z P(supp(X1:N ) ⊆ supp(ΘK )) = E exp −

∞ −GK N

1 − π(ve

)



 ν(dv) , (D.7)

0

where GK ∼ Gam(K, c) and G0 = 0. Proof. Let ∞ Y

" p(t, K) := E



π Vk e

−(Γ1k +t)/c

N

# ,

(D.8)

k=K+1

so p(0, K) = P(supp(X1:N ) ⊆ supp(ΘK )). We use the proof strategy from Banjevic et al. (2002) and induction in K. For K = 0, ## " " ∞  N  N Y P E +u+t)/c −( 1j 1M (E.44) k=1

 ≥ P log V1 + Γc(K−1) + W ≥ M .

(E.45)

The inequality arises from noting that all terms in the sum are positive, and hence the probability can be bounded below by selecting only the k = 1 term. Note that PK Γc(K−1) is used in place of k=2 Eck for convenience since they have the same distribution. Next, we compute the cumulative distribution function of log V1 + W : Z −(x−v) d (ev ν (ev )) dv (E.46) P (log V1 + W ≤ x) = −c−1 e−e dv Z ∞ −x d = −c−1 e−se (sν (s)) ds (E.47) ds 0 Z ∞ −x = 1 − c−1 se−x−se ν(s)ds (E.48) 0 Z ∞  −x d = 1 − c−1 e−se − 1 ν(ds) (E.49) dx 0 dF = 1 − c−1 (x), (E.50) dx where the first step follows from the substitution v = es , the second follows from integration by parts, and the third follows from Lemma E.1. Using this result to

TRUNCATED RANDOM MEASURES

55

compute the cumulative distribution function of M − (log V1 + W ), Z d P (M − (log V1 + W ) ≤ x) = P (M ≤ x + v) P (log V1 + W ≤ v) dv dv Z d2 F −1 = −c eF (x+v) 2 (v)dv. dv

(E.51) (E.52)

The second order derivative of F exists by Lemma E.1. Finally, using the fact that Γc(K−1) ∼ Gam(K − 1, c), cK−1 P (X1 ∈ supp (ΘK )) ≥ Γ(K − 1)

Z



P (M − (log V1 + W ) ≤ x) xK−2 e−cx dx

0

(E.53) K−2

=

−c Γ(K − 1)

ZZ

1 [x ≥ 0] xK−2 e−cx eF (x+v)

2

d F (v)dvdx. dv 2 (E.54)

The fact that the bound is between 0 and 1 is a simple consequence of P (X1 ∈ supp (ΘK )) ≤ 1 and the nonnegativity of the lower bound above. The asymptotic decay of the bound follows by interpreting the integral above as an expectation, and noting that F (x) ≤ 0, F (x) is monotonically increasing, and limx→∞ F (x) = 0. For any  > 0, we can pick a ≥ 0 such that h i lim E eF (Γc(K−1) +log V1 +W ) (E.55) K→∞    a F ( a2 ) ≥ lim P Γc(K−1) ≥ a P log V1 + W ≥ − (E.56) e K→∞ 2   a F ( a2 ) =P log V1 + W ≥ − (E.57) e 2 ≥(1 − )2 , (E.58)  a since lima→∞ P log V1 + W ≥ − a2 = 1 and lima→∞ eF ( 2 ) = 1. Therefore, h i lim E eF (Γc(K−1) +log V1 +W ) = 1, (E.59) K→∞

and the desired asymptotic result follows.



E.3. Truncation with hyperpriors. Proof of Proposition 5.5. By repeating the proof of Lemma 5.1 in Appendix D.1, except with an additional use of the tower property to condition on the hyperparameters Φ, and an additional use of Fubini’s theorem to swap integration and expectation, we have 1 kpY − pW k1 ≤ E [1 − P (X1:N ⊆ supp(ΘK ) | Φ)] 2 ≤ E [B(Φ, N, K)] .

(E.60) (E.61) 

56

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

Appendix F. Proofs of V-representation sampling results P ¯ m = P∞ λi . The proof proceeds Proof of Lemma 6.1. Define λ = i λi , and λ i=m by induction. First, P(T1 = t1 ) = = =

∞ X n=0 ∞ X

P(T1 = t1 , S1 = n)

(F.1)

P(T1 = t1 |S1 = n)P(S1 = n)

(F.2)

n−t1    t1  λ − λ1 λn n λ1 e−λ λ λ n! t1

(F.3)

n=0 ∞ X n=t1

= e−λ

∞ λt11 X 1 n−t (λ − λ1 ) 1 t1 ! n=t (n − t1 )!

(F.4)

1

λt11 λ−λ1 e t1 ! λt1 = e−λ1 1 t1 ! = e−λ

(F.5) (F.6)

and therefore T1 is marginally Poisson with mean λ1 . Now assuming T1 , . . . , Tm−1 are P marginally independent Poisson with means λ1 , . . . , λm−1 , and noting that S1 = i Ti , (using shorthand notation for brevity) P(tm , . . . , t1 ) =

∞ X

P(tm |tm−1 , . . . , t1 , n)P(n|t1 , . . . , tm−1 )

n=0

m−1 Y

P(ti )

(F.7)

i=1

and since we know Ti is marginally Poisson with mean λi for each i = 1, . . . , m − 1, we just need to show that ∞ X

P(tm |tm−1 , . . . , t1 , n)P(n|t1 , . . . , tm−1 ) =

n=0

If we let k = above is

Pm−1 i=1

e−λm λtmm . tm !

(F.8)

ti , and sample tm from the binomial distribution, the first term

P(tm |tm−1 , . . . , t1 , n) =

(n − k)! (n − k − tm )!tm !



λm ¯m λ

tm  ¯ n−k−tm λm+1 δ(tm ≤ n − k) ¯m λ (F.9)

while the second term is found byP noting that N given T1 , . . . , Tm−1 is a shifted ∞ Poisson random variable S1 = k + i=m Ti , such that ¯

P(n|t1 , . . . , tm−1 ) =

¯ n−k e−λm λ m δ(n ≥ k) (n − k)!

(F.10)

TRUNCATED RANDOM MEASURES

57

Substituting all of this into the above expression, ∞ X

=

P(tm |tm−1 , . . . , t1 , n)P(n|t1 , . . . , tm−1 )

n=0 ∞ X

¯

e−λm

n=k+tm ¯

=e−λm =e−λm

λtmm tm !

1 ¯ n−k−tm λtm λ (n − k − tm )!tm ! m m+1

∞ X n=k+tm

1 ¯ n−k−tm λ (n − k − tm )! m+1

λtmm tm !

(F.11) (F.12)

(F.13) (F.14)

And thus the inductive step holds, yielding our desired result about the infinite sequence Ti , i ∈ N. To prove the final result, we simply rely on the fact that the event where I ≤ i is equivalent to the event that Tj = 0 ∀j = i + 1, . . . , ∞. P(I ≤ i) = P (Tj = 0 ∀j ≥ i + 1)   ∞ X Tj = 0 =P

(F.16)

j=i+1 P − j>i λj

(F.17)

=e

(F.15)

which follows from the independent Poisson distributions of the Tj shown earlier.  Proof of Theorem 6.2. First, note that by Lemma B.2, K X ∞ X

λkx =

k=1 x=1

K X ∞ Z X

h(0|θ)k−1 h(x|θ)ν(dθ)

(F.18)

h(x|θ)ν(dθ)

(F.19)

k=1 x=1



∞ Z K X X k=1 x=1

=

K Z X

(1 − h(0|θ))ν(dθ) < ∞

(F.20)

k=1

and therefore, {ρkx }x∈N,1≤k≤K in Theorem 3.3 are an infinite collection of independent Poisson random variables satisfying the conditions of Lemma 6.1. Further, since each collection of variables with constant k, x in the infinite sum construction of Θ are independent of others, we can simply impose a convenient ordering on {1, . . . , K} × N, and use Lemma 6.1 to sample the ρkx .  Proof of Theorem 6.3. There are four types of random variable that must be sampled to construct ΘR : S1 , which is sampled exactly once; θij , and ψij , which are each sampled S1 times; and Ci , which is sampled a random number I of times. Therefore, ¯ K + E[I] E[R] = E [1 + 2S1 + I] = 1 + 2Λ

(F.21)

58

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

Now using the CDF of I from Lemma 6.1,

E[I] =

∞ X

∞ X

1 − P (I < r) =

r=1

1−e



P

i≥r

λki xi

=

r=1

∞ X

¯

Pr−1

1 − e−ΛK +

i=1

λki xi

(F.22)

r=1

This yields the first stated result. Next, let ( φ(x) :=

x if x ≤ 1 1 if x > 1

(F.23)

and note that 1 − e−x ≤ φ(x). Thus, we have ∞ X

1 − e−

P

i≥r

λki xi



∞ X

  X φ λki xi 

r=1

r=1

=

L X

(F.24)

i≥r

    ∞ X X X φ λki xi  φ λki xi  +

r=1

r=L+1

i≥r

=L+

∞ X



 φ

r=L+1

(F.25)

i≥r

X

λki xi  ,

(F.26)

i≥r

so the first inequality in Theorem 6.3 follows. The second inequality is now immeφ(x) e diate from the fact that 1−e −x ≤ e−1 , which follows since x 1 = arg max = 1. −x 1−e 1 − e−x x≥1

arg max 0≤x≤1

(F.27) 

Proof of Corollary 6.4. Starting from the first result of Theorem 6.3, we simply need to bound the infinite series: ∞ X

1−e



P

i≥r

λki xi



r=1

=

∞ X X

λki xi

r=1 i≥r ∞ X

iλki xi

(F.28)

(F.29)

i=1

=

∞ X K X

(K(x − 1) + k)λkx

(F.30)

x=1 k=1

≤K

∞ K X X x λkx . x=1

k=1

(F.31)

TRUNCATED RANDOM MEASURES

59

Substituting the formula for λkx , ∞ X

1−e



P

i≥r

λki xi

Z ∞ X ≤K x

r=1

=K ≤K

x=1 ∞ X

K X

! k−1

h(0 | θ)

h(x | θ)ν(dθ)

(F.32)

k=1

Z  x

x=1 ∞ X 2

1 − h(0 | θ)K 1 − h(0 | θ)

 h(x | θ)ν(dθ)

(F.33)

Z h(x | θ)ν(dθ)

x

(F.34)

x=1

= K2

Z Eθ [X]ν(dθ).

(F.35) 

Finally, the exact distribution of the number of random variables sampled by the construction in Theorem 6.2 is provided by Theorem F.1: Theorem F.1 (Sampling Complexity Distribution). The number of random variables R that must be sampled to construct ΘK via the generative model in Theorem 6.2 has distribution   b r−2 j 2 c ¯j ¯ X λ − λ ¯ r−2j−1 r−2j−2  (F.36) P(R = r) = e−ΛK 1 [r = 1] + 1 [r > 1] j! j=0 ¯ m := Pm λk x . where λ i i i=1 Proof. There are four types of random variable that must be sampled to construct ΘR : S1 , which is sampled exactly once; θij , and ψij , which are each sampled S1 times for a total of R1 ; and Ci , which is sampled a random number R2 = I of times. Therefore, P(R1 = n, R2 ≤ r) = P(2S1 = n, I ≤ r) (F.37) n (F.38) = P(S1 = , I ≤ r) n even, r ≥ 1 2 n = P(S1 = |I ≤ r)P(I ≤ r) (F.39) 2 Now the event that I ≤ r is equivalent to the event that Cr+1 , Cr+2 , . . . are all equal to 0. By Lemma 6.1, the Ci are independent Poisson random variables. Therefore P  n2 P r −( rj=1 λkj xj ) λ e k x j j j=1 n . (F.40) P(S1 = |I ≤ r) = n 2 2! Again by Lemma 6.1, we have that P(I ≤ r) = e−

P

j>r

λkj xj

.

(F.41)

Therefore, P P(R1 = n, R2 ≤ r) =

r j=1 λkj xj n 2!

 n2 ¯

e−ΛK n even, r ≥ 0

(F.42)

60

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

where 00 := 1. Thus,  ¯ r=0  e−ΛK δ(n  = 0)  n2   n2 P ¯ P −Λ P(R1 = n, R2 = r) = K r−1 r  e n! − r>0 j=1 λkj xj j=1 λkj xj

n even

2

(F.43) Pr

Finally, noting that P(R − 1 = r) = m=0 P(R1 = r − m, R2 = m) (the R − 1 comes from the sampling of S1 exactly once), ¯

P(R = 1) = e−ΛK X

P(R = r + 1) =

1≤m≤r r−m even

(F.44)  r−m   r−m   2 2 m m−1 ¯ X e−ΛR X       λkj xj − λkj xj , r ≥ 1 r−m ! 2 j=1 j=1 (F.45)

¯m = Defining λ

Pm

j=1

P(R = 1) = e

λkj xj ,

¯K −Λ

X

P(R = r) =

1≤m≤r−1 r−m−1 even

and letting 2j = r − m − 1,  P(R = r) = e

¯K −Λ

(F.46) e

¯K −Λ

r−m−1 2

 

!

r−m−1 2

¯m λ

r−m−1

¯ 2 −λ m−1



,r≥2

(F.47)

b r−2 2 c

 ¯j ¯j X λ −λ r−2j−1 r−2j−2 1 [r = 1] + 1 [r > 1] . j! j=0

(F.48) 

Appendix G. Additional simulation results In this section, we provide additional simulation results referenced in the main text. The first, Fig. 2, shows the “flattening” effect of increasing N on the bounds for low mean complexity (i.e. low truncation level). This follows intuition; making a large number of observations from the truncated process is likely to uncover its finite, approximate nature. The observed effect is similar for the ΓP, BPP, and NΓP. Next, Fig. 3 demonstrates that the V- and F-representations of BP(γ, α, d) (with default values γ = 1, α = 2, and d = 0.5) perform similarly for fixed truncation level and varying parameters, except when 0 < d  1 — in this setting, the F-representation bound approaches that of the 2-parameter BP(γ, α, 0), while the V-representation bound does not exhibit as dramatic a decrease.

TRUNCATED RANDOM MEASURES 101

101

B­DRep (Teh) B­FRep (Paisley) SB­VRep (Thibaux)

100

L1 Error

L1 Error

10­1

10­2

10­3

10­4

10­4 101

102

104

103

10­5

105

Mean Computational Complexity

101

104

103

105

PL­FRep (Broderick) SB­VRep (Thibaux)

10­1

L1 Error

L1 Error

102

Mean Computational Complexity

100

10­1 10­2

10­2

10­3

10­3

10­4

10­4 101

102

104

103

Mean Computational Complexity

101

10­5

105

101

102

104

103

Mean Computational Complexity

101

B­DRep (Teh) B­FRep (Paisley) SB­VRep (Thibaux)

100

105

PL­FRep (Broderick) SB­VRep (Thibaux)

100 10­1

L1 Error

10­1 10­2

10­2

10­3

10­3

10­4

10­4

10­5

101

101

B­DRep (Teh) B­FRep (Paisley) SB­VRep (Thibaux)

100

L1 Error

10­2

10­3

10­5

PL­FRep (Broderick) SB­VRep (Thibaux)

100

10­1

10­5

61

101

102

103

104

Mean Computational Complexity

BP(1, 1, 0)

105

10­5

101

102

104

103

Mean Computational Complexity

BP(1, 2, 0.5)

Figure 2. Truncation error bounds for D-, representations of the BP.

F-,

and V-

105

62

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

PL­FRep (Broderick) SB­VRep (Thibaux)

100

100

10­1

10­1

10­2

10­2

10­3

10­3

10­4

10­4

10­5

10­3

10­2

10­1

100

Mass, γ

101

102

PL­FRep (Broderick) SB­VRep (Thibaux)

101

L1 Error

L1 Error

101

103

10­5 10­3

10­2

10­1

100

Concentration, α

101

102

103

PL­FRep (Broderick) SB­VRep (Thibaux)

101

L1 Error

100 10­1 10­2 10­3 10­4 10­5 10­3

10­2

Discount, d

10­1

100

Figure 3. Truncation error bounds for BP(γ, α, d) with varying parameters. Acknowledgments All authors are supported by the Office of Naval Research under MURI grant N000141110688. J. Huggins is supported by the U.S. Government under FA9550-11C-0028 and awarded by the DoD, Air Force Office of Scientific Research, National Defense Science and Engineering Graduate (NDSEG) Fellowship, 32 CFR 168a. References Abramowitz, M. and Stegun, I. (eds.) (1964). Handbook of Mathematical Functions. Dover Publications. Airoldi, E. M., Blei, D., Erosheva, E. A., and Fienberg, S. E. (2014). Handbook of Mixed Membership Models and Their Applications. CRC Press. Argiento, R., Bianchini, I., and Guglielmi, A. (2015). “A blocked Gibbs sampler for NGG-mixture models via a priori truncation.” Statistics and Computing, Online First. Banjevic, D., Ishwaran, H., and Zarepour, M. (2002). “A recursive method for functionals of Poisson processes.” Bernoulli . Blei, D. M. and Jordan, M. I. (2006). “Variational inference for Dirichlet process mixtures.” Bayesian Analysis, 1(1): 121–144. Bondesson, L. (1982). “On simulation from infinitely divisible distributions.” Advances in Applied Probability. Brix, A. (1999). “Generalized gamma measures and shot-noise Cox processes.” Advances in Applied Probability, 31: 929–953. Broderick, T., Jordan, M. I., and Pitman, J. (2012). “Beta processes, stick-breaking and power laws.” Bayesian Analysis, 7(2): 439–476.

TRUNCATED RANDOM MEASURES

63

Broderick, T., Mackey, L., Paisley, J., and Jordan, M. I. (2015). “Combinatorial clustering and the beta negative binomial process.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2): 290–306. Broderick, T., Wilson, A. C., and Jordan, M. I. (2014). “Posteriors, conjugacy, and exponential families for completely random measures.” arXiv.org. Doshi-Velez, F., Miller, K. T., Van Gael, J., and Teh, Y. W. (2009). “Variational inference for the Indian buffet process.” In International Conference on Artificial Intelligence and Statistics. Ferguson, T. S. (1973). “A Bayesian analysis of some nonparametric problems.” The Annals of Statistics, 209–230. Ferguson, T. S. and Klass, M. J. (1972). “A representation of independent increment processes without Gaussian components.” The Annals of Mathematical Statistics, 43(5). Fox, E. B., Sudderth, E., Jordan, M. I., and Willsky, A. S. (2010). “A sticky HDPHMM with application to speaker diarization.” Annals of Applied Statistics. Gautschi, W. (1959). “Some elementary inequalities relating to the gamma and incomplete gamma function.” Journal of Mathematics and Physics, 38(1): 77– 81. Gelfand, A. and Kottas, A. (2002). “A computational approach for nonparametric Bayesian inference under Dirichlet process mixture models.” Journal of Computational and Graphical Statistics, 11(2): 289–305. Gumbel, E. J. (1954). Statistical theory of extreme values and some practical applications: A series of lectures. U.S. Govt Print Office, 1st edition. Hjort, N. L. (1990). “Nonparametric Bayes estimators based on beta processes in models for life history data.” The Annals of Statistics, 18(3): 1259–1294. Ishwaran, H. and James, L. F. (2001). “Gibbs sampling methods for stick-breaking priors.” Journal of the American Statistical Association, 96. — (2002). “Approximate Dirichlet Process Computing in Finite Normal Mixtures: Smoothing and Prior Information.” Journal of Computational and Graphical Statistics, 11(3): 508–532. Ishwaran, H. and Zarepour, M. (2002). “Exact and approximate sum representations for the Dirichlet process.” Canadian Journal of Statistics, 30(2): 269–283. James, L. F. (2014). “Poisson Latent Feature Calculus for Generalized Indian Buffet Processes.” arXiv.org. James, L. F., Lijoi, A., and Pr¨ unster, I. (2009). “Posterior Analysis for Normalized Random Measures with Independent Increments.” Scandinavian Journal of Statistics, 36(1): 76–97. Johnson, M. J. and Willsky, A. S. (2013). “Bayesian nonparametric hidden semiMarkov models.” Journal of Machine Learning Research, 14: 673–701. Kim, Y. (1999). “Nonparametric Bayesian estimators for counting processes.” The Annals of Statistics, 27(2): 562–588. Kingman, J. F. C. (1967). “Completely random measures.” Pacific Journal of Mathematics, 21(1): 59–78. — (1975). “Random discrete distributions.” Journal of the Royal Statistical Society B , 37(1): 1–22. Kingman, J. F. C. (1993). Poisson Processes. Oxford Studies in Probability. Oxford University Press.

64

T. CAMPBELL, J. H. HUGGINS, J. HOW, AND T. BRODERICK

Lijoi, A., Mena, R., and Pr¨ unster, I. (2005). “Bayesian nonparametric analysis for a generalized Dirichlet process prior.” Statistical Inference for Stochastic Processes, 8: 283–309. Lijoi, A. and Pr¨ unster, I. (2010). “Models beyond the Dirichlet process.” In Hjort, N. L., Holmes, C., M¨ uller, P., and Walker, S. (eds.), Bayesian Nonparametrics. Cambridge University Press. Maddison, C., Tarlow, D., and Minka, T. P. (2014). “A* Sampling.” In Advances in Neural Information Processing Systems. Neal, R. M. (2003). “Slice sampling.” The Annals of Statistics, 31(3): 705–767. Orbanz, P. (2010). “Conjugate projective limits.” arXiv preprint arXiv:1012.0363 . Paisley, J., Wang, C., and Blei, D. M. (2012a). “The Discrete Infinite Logistic Normal Distribution.” Bayesian Analysis, 7(2): 235–272. Paisley, J. W., Blei, D. M., and Jordan, M. I. (2012b). “Stick-breaking beta processes and the Poisson process.” International Conference on Artificial Intelligence and Statistics. Paisley, J. W., Zaas, A. K., Woods, C. W., Ginsburg, G. S., and Carin, L. (2010). “A stick-breaking construction of the beta process.” International Conference on Machine Learning. Regazzini, E., Lijoi, A., and Pr¨ unster, I. (2003). “Distributional results for means of normalized random measures with independent increments.” The Annals of Statistics, 31(2): 560–585. Roychowdhury, A. and Kulis, B. (2015). “Gamma Processes, Stick-Breaking, and Variational Inference.” In International Conference on Artificial Intelligence and Statistics. Schilling, R. L. (2005). Measures, Integrals and Martingales. Cambridge University Press. Sethuraman, J. (1994). “A constructive definition of Dirichlet priors.” Statistica Sinica, 4: 639–650. Teh, Y. W. and G¨ or¨ ur, D. (2009). “Indian buffet processes with power-law behavior.” In Advances in Neural Information Processing Systems. Teh, Y. W., G¨ or¨ ur, D., and Ghahramani, Z. (2007). “Stick-breaking construction for the Indian buffet process.” In International Conference on Artificial Intelligence and Statistics. Thibaux, R. and Jordan, M. I. (2007). “Hierarchical beta processes and the Indian buffet process.” In International Conference on Artificial Intelligence and Statistics. Titsias, M. (2008). “The infinite gamma-Poisson feature model.” In Advances in Neural Information Processing Systems. Zhou, M., Hannah, L., Dunson, D., and Carin, L. (2012). “Beta-negative binomial process and Poisson factor analysis.” In Artificial Intelligence and Statistics.

TRUNCATED RANDOM MEASURES

65

Laboratory for Information and Decision Systems (LIDS), Massachusetts Institute of Technology URL: http://www.trevorcampbell.me/ E-mail address: [email protected] Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology URL: http://www.jhhuggins.org/ E-mail address: [email protected] Laboratory for Information and Decision Systems (LIDS), Massachusetts Institute of Technology URL: http://www.mit.edu/~jhow/ E-mail address: [email protected] Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology URL: http://www.tamarabroderick.com E-mail address: [email protected]

Suggest Documents