A Bayesian Nonparametric Joint Factor Model for Learning Shared

12 downloads 0 Views 2MB Size Report
To this end, we propose a nonparametric joint factor analysis framework for ... formance of many machine learning and data mining tasks. However ... does not exploit the statistical strength shared across them. ... modeled using a Bernoulli process parametrized by a beta ... techniques on the NUS-WIDE animal dataset [5].
A Bayesian Nonparametric Joint Factor Model for Learning Shared and Individual Subspaces from Multiple Data Sources Sunil Kumar Gupta∗ †

Dinh Phung ∗ ‡

Svetha Venkatesh ∗ §

Abstract

supervised applications wherein classification/regression parameters are jointly learned. The second approach [2, 14, 12, 18] focuses on learning subspaces directly from data in an unsupervised manner. Most of these works [2, 14, 18] constitute multi-view learning, requiring different views or correspondences for each data point. Although missing views can be handled in a limited way, the approach cannot model related data sources that do not have explicit example-wise correspondences. The Bayesian shared subspace learning proposed in [12] does not need these correspondences and is appropriate for jointly modeling multiple data sources. However, it is a parametric method and its performance is critically dependent on the choice of the dimensionalities of the shared and individual subspaces. For modeling n data sources, the number of such parameters is 2n −1. This necessitates model selection – a notoriously hard problem. Thus the problem of joint factor analysis with unknown latent di1 Introduction mensionality remains open. The proliferation of sensor networks and the Internet has creRecently, a nonparametric factor analysis model using ated a plethora of data sources. Often, these data sources beta process prior is proposed in [17] wherein the data are related and can strengthen one another to improve permatrix is decomposed as a product of two matrices – the formance of many machine learning and data mining tasks. factors and their features. The feature matrix is further However, improving the performance by combining them decomposed into an element-wise product of a binary matrix is not straightforward as these sources, usually, also have (say Z) (indicating absence or presence of a feature) and different characteristics. Traditional factor analysis model, a weight matrix (feature values). The binary matrix Z is when applied to model these multiple sources separately, modeled using a Bernoulli process parametrized by a beta does not exploit the statistical strength shared across them. process. A similar work modeling the matrix Z using Indian Moreover, simply combining multiple data sources as a sinbuffet process (IBP) is proposed in [16]. Although these gle dataset and applying factor analysis also does not deliver nonparametric methods allow the number of factors to grow satisfactory result since it is unable to capture their individual with the data, they are restricted to modeling a single data variations. Thus, there is a need to develop a model which, source. Extension is thus required for multiple data sources. not only exploits the shared strength of multiple sources, but Nonparametric models addressing multiple data sources also retains the individual variabilities of each source. include the work of Fox et al. [8], who use IBP represenExisting works on modeling multiple data sources can tation of beta process, to share features amongst dynamiroughly be divided into two main approaches. The first cal systems. Bernoulli processes for different dynamical obapproach, under the “learning to learn” paradigm [1, 19], jects are parametrized by a common beta process in a nonshares structures across multiple tasks (or sources) by learnhierarchical manner. In another work, Saria et al. [20] eming a subspace using classification (or regression) parameters ploy the hierarchical Dirichlet process (HDP) as the underlyfor each task. This approach considers supervised or semiing stochastic process to model common aspects of multiple time series. However, this model differs from ours in us∗ School of Information Technology, Deakin University, Geelong Waurn ing HDP as its underlying stochastic process, and focuses on Ponds Campus, Australia topics instead of factors. A multi-scale convolutional factor † [email protected][email protected] model using the hierarchical beta process (HBP) is presented Joint analysis of multiple data sources is becoming increasingly popular in transfer learning, multi-task learning and cross-domain data mining. One promising approach to model the data jointly is through learning the shared and individual factor subspaces. However, performance of this approach depends on the subspace dimensionalities and the level of sharing needs to be specified a priori. To this end, we propose a nonparametric joint factor analysis framework for modeling multiple related data sources. Our model utilizes the hierarchical beta process as a nonparametric prior to automatically infer the number of shared and individual factors. For posterior inference, we provide a Gibbs sampling scheme using auxiliary variables. The effectiveness of the proposed framework is validated through its application on two real world problems – transfer learning in text and image retrieval.

§ [email protected]

200 47

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

in Chen et al. [3]. However, it investigates deep learning instead of the multiple data source problem considered in this work. In this paper, we propose a nonparametric joint factor analysis (NJFA) model based on hierarchical beta process (HBP) prior [22]. Our model allows sharing of factors across different sources and learns the number of shared and individual factors automatically from data. We model each data source analogously to [17], but in this case, the binary matrix for each source (say Zj for source j) is modeled using a Bernoulli process parametrized by a hierarchical beta process. This allows flexible representation of both the shared and individual factors (building upon the works carried out in [11, 10]) along with their corresponding features. We present a Gibbs sampling based inference, which, in addition to sampling the main variables, provides an elegant way to sample the hyperparameters of the hierarchical beta process. We demonstrate the usefulness of the proposed model through its application to two different tasks – transfer learning using text data and image retrieval. Our experiments using NIPS 0-12 dataset validates the effectiveness of our model for the transfer learning task over NIPS sections. For image retrieval, our method outperforms recent state-of-the-art techniques on the NUS-WIDE animal dataset [5]. The contribution of our approach lies in the extension of beta process factor analysis [17] to hierarchical beta process factor analysis, allowing for joint modeling of data from multiple sources. Our model discovers shared factors and individual factors specific to each source in a nonparametric setting, avoiding the need to specify the dimensionality of the latent subspaces a priori. In addition, our proposed Gibbs sampling scheme provides an alternative to the adaptive rejection sampling used in [22], which was noted by the authors of [22] to be elementary, and advanced inference algorithms were advocated to fully exploit beta process models. We provide details on sampling hyperparameters of the HBP, which was not systematically addressed in [22]. Lastly, our work provides an alternative paradigm for statistical sharing across multiple data groups, in a similar realm to the hierarchical Dirichlet process (HDP) [21], but we operate at the factor model level instead of mixture model, and therefore it is more suitable for applications such as retrieval, joint dimensionality reduction, collaborative filtering etc. 2 Joint Factor Model using HBP Prior We start with a brief review of beta process and its hierarchical extensions in Section 2.1 and then provide a detailed description of the proposed model in Section 2.2.

process B is a positive random measure over Ω, denoted by B ∼ BP (γ0 , B0 ) [13]. If S1 , . . . , Sr are disjoint subsets of Ω, the measures B (S1 ) , . . . , B (Sr ) are independent. This implies that beta process is a positive L´evy process and can be uniquely characterized using a L´evy measure (for details about the L´evy measure of beta process , see  [22]). If B0 is discrete and given as B0 = k λk δφk (where φk is an atom such that φk ∈ Ω), then the draws from the beta process B are also discrete and can be written in the following form [22]  (2.1) B= β k δφ k k

where βk is a random weight associated to φk such that (2.2)

βk ∼ beta (γ0 λk , γ0 (1 − λk )) , k = 1, 2, . . .

In the case of continuous B0 , the measure associated to a single atom (i.e. any φk ) is zero. Therefore, instead of using a single atom φk , we consider infinitesimally small region dφk in Ω around φk . Hjort [13] has shown that increments of such infinitesimal form for the beta process can be shown to be independent and follow beta distribution. Thus, for continuous B0 , we can partition Ω into L equal parts [17] we can write a such that ∪L k=1 dφk = Ω. Using this partition,  L set function similar to Eq (2.1) such that B = k=1 βk δdφk . L For any set Q, B (Q) = k=1 βk δdφk (Q), where  0 if Q ∩ dφk = ∅ (2.3) δdφk (Q) = 1 otherwise Following Theorem 3.1 in [13], for large L, βk can be approximately drawn from a beta distribution as below (2.4)

βk ∼ beta (γ0 bk , γ0 (1 − bk )) , k = 1, 2, . . . L

where bk  B0 (dφk ). If Zi is a binary vector drawn from a Bernoulli process (parametrized by B), i.e. Zi | B ∼ BeP (B) then the posterior B | Z1:N , using the conjugacy property of beta process, can be written as another beta process [15], i.e. B | Z1:N ∼ BP (γ0 + N, BN ) where BN is given as (2.5)

BN =

1 γ0 B0 + γ0 + N γ0 + N



N 

Zik

{k|φk ∈S} i=1

where S is a set of unique atoms φk observed through Z1:N and Zik denotes the k-th element of Zi . Given the above posterior, predictive distribution of ZN +1 | Z1:N , computed under the expectation EB|Z1:N [p (ZN +1 | B)], can be written as [22]

2.1 Beta-Bernoulli Process and Hierarchical Modeling (2.6)   N  Let (Ω, F) be a measurable space, B0 be a fixed measure γ0 1 (k) k B0 + Z ZN +1 | Z1:N ∼ BeP over Ω and γ0 be a positive function over Ω, then, a beta γ0 + N γ0 + N i=1 i

201 48

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

In an analogous manner to the HDP [21], which defines a of two matrices Zj and Wj , i.e. Hj = Zj  Wj , where hierarchy over Dirichlet processes, Thibaux and Jordan [22] Zkji = 1 implies the presence of factor φk for i-th data point k proposed a hierarchical beta process (HBP) prior that allows Xji from source j and Wji represents the corresponding the sharing of the atoms drawn from a beta process. If coefficient or weight of the factor φk . B ∼ BP (γ0 , B0 ) be a draw from beta process and there We use HBP prior ([22]) on the collection of matrices exist J sources, then the hierarchical beta process imposes Z1 , . . . , ZJ . In particular, we model each column of Zj the following hierarchies for each j = 1, . . . , J : as a draw from a Bernoulli process parametrized by a beta process Aj . Since our goal is to share some of the factors (2.7) Aj ∼ BP (αj , B) and Zji | Aj ∼ BeP (Aj ) φk ’s across more than one data source, we require Aj ’s to have common support (cf. Figure 1a). To do so, we tie where Zji denotes the i-th binary vector Zi from j-th data Aj ’s together via a beta process B so that Aj ∼ BP (αj , B) source. Irrespective of whether B0 is continuous or discrete, where base measure B itself is a draw from another beta B ∼ BP (γ0 , B0 ) is always discrete with probability one and process, i.e. B ∼ BP (γ0 , B0 ). This gives rise to a HBP has the same support as B0 . Since Aj ∼ BP (αj , B), the prior on Zj which ensures that some of the factors are shared support of B is also the support of each Aj , causing sharing across multiple sources while other factors remain specific to of atoms across Aj ’s. We exploit this property of HBP for a source. This property avoids negative knowledge transfer statistically sharing factors across multiple sources. and is important for transfer learning. The measure of the whole parameter space is denoted as τ0 , i.e. B0 (Ω) = τ0 . 2.2 Nonparametric Joint Factor Analysis (NJFA) FacAn instance of the above model for two data sources tor analysis attempts to model a data matrix X as the product can be seen as the following: let us assume that X1 conof two matrices Φ and H plus an error matrix E. The ma- tains N1 data examples from source-1 while X2 contains trix Φ contains the factors, which can also be viewed as basis N2 data examples from source-2. Again let us assume that vectors that spans a subspace and the matrix H contains the there are K1 factors {φ1 , . . . , φK1 } which explain the data factor loadings or the representation of the data matrix in the represented by matrix X1 . Further note that data represubspace. Formally, sented by matrix X2 can either need totally new factors for this purpose or use some of the factors from the set {φ1 , φ2 , . . . , φK1 } and need additional K2 factors. Let us (2.8) X = ΦH + E denote the individual factors required for the second data Departing from this setting, we propose a joint factor anal- source by {φK1 +1 , . . . , φK1 +K2 }. Then total number of facysis model, whose goal is to model multiple data matrices tors expressed by the set {φ1 , , . . . , φK1 +K2 } are K1 + K2 . X1 , ..., XJ using a factor matrix Φ = [φ1 , ..., φK ] where φk We can re-arrange these factors and re-write them as Φ = denotes the k-th factor and φk ∈ RM . Some of the fac- [Φ1 , Φ12 , Φ2 ] where Φ1 denotes the set of factors required tors in Φ may be shared amongst the various data sources to explain only data source-1, Φ2 denotes the set of factors whereas other factors would be specific to individual ones. required to explain only data source-2 and Φ12 denotes the For each j = 1, . . . , J, Xj ∈ RM ×Nj has a representa- set of factors required to explain both data sources. Now tion Hj ∈ RK×Nj in the subspace spanned by Φ along the goal of the NJFA model is to achieve the following joint with the factorization error Ej . Although the dimension- factorization alities of the original feature spaces may not be equal for (2.10) ⎞ ⎛ different sources, it is possible to construct a unified M ⎤ ⎡ ⎤⎟ dimensional feature space by merging the dimensions of dif⎜⎡ W1,1 ⎟ ⎜ Z1,1 ferent sources. We represent our joint factor analysis model ⎟ ⎜⎣ X1 = [Φ1 , Φ12 , Φ2 ] ⎜ Z1,12 ⎦  ⎣ W1,12 ⎦⎟ + E1 as a set of factor models which shall be jointly inferred: ⎟ ⎜ 0 ∗ ⎠ ⎝       ⎧ Z1 W1 ⎪ ⎨X1 = ΦH1 + E1 (2.9) Π : ··· ··· ⎪ (2.11) ⎩ ⎞ ⎛ XJ = ΦHJ + EJ ⎤ ⎡ ⎤⎟ ⎜⎡ ∗ 0 ⎟ ⎜ For this model, we allow the number of factors (K) to grow ⎟ ⎜⎣ X2 = [Φ1 , Φ12 , Φ2 ] ⎜ Z2,12 ⎦  ⎣ W2,12 ⎦⎟ + E2 as large as needed when more data is observed. When the ⎟ ⎜ W2,2 ⎠ ⎝ Z2,2 number of factors are large, each data point may be using a       few factors out of the large pool and thus, the representation Z2 W2 of a data point is usually sparse. Due to this sparsity, we represent the matrix Hj as an element-wise multiplication Note the difference between normal and bold symbols. In

202 49

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

B0

+0

0

B

0

#

j

Aj

j

$j

wj nj

wji

zji

wj

*

xji Nj

nj

wji

zji xji Nj

J

(a)

*

J

(b)

Figure 1: (a) Graphical representation of NJFA model (stochastic process view) (b) An alternative view. the above factorization, Z1,1 denotes the part of binary matrix Z1 that is used to weight the factors represented by Φ1 only. Similarly, Z1,12 denotes the part of binary matrix Z1 that is used to weight the shared factors represented by Φ12 . Similar representation has been used for matrices Z2,2 and Z2,12 . The corresponding feature representations have been denoted by the matrices W1,1 , W1,12 , W2,2 and W2,12 respectively. Due to the use of HBP prior on matrices Z1 , . . . , ZJ , their active dimensionalities are inferred automatically. This avoids the need of any separate model selection to determine the shared and the individual subspace dimensionalities. Another way of jointly modeling these data sources is by adapting the single source beta process factor analysis [17] and considering [X1 , . . . , XJ ] = Φ [H1 , . . . , HJ ] + [E1 , . . . , EJ ]. However, this method does not distinguish among the data from different groups. Through our experiments, we empirically show the superiority of the proposed hierarchical model over this augmented model. The other matrices such as factors Φ, features Wj and errors Ej are assumed to be i.i.d. and normal distributed. Using the representation for beta process [22], the whole model can be summarized as (2.12)

3 Model Inference using Collapsed Gibbs Sampling For inference, we use collapsed Gibbs sampling. It can be seen from the graphical model in Figure 1b that main variables of interest for NJFA model are Z, W, Φ, π and β. We integrate out π and Gibbs sample only Z, W, Φ and β. In addition to sampling main variables, we also sample the hyperparameters αj and γ0 . The factors φk ’s are drawn i.i.d. from B0 . Since B0 is assumed to be a normal distribution (see Eq (2.13)), B0 (Ω) = τ0 = 1. For each φk , we also sample bk from its posterior. Detailed derivations of inference results are provided in appendix A. 3.1

Sampling Main Variables

3.1.1 Sampling βk To sample from the posterior of each βk , we use auxiliary variable sampling [6, 21]. Defining m = (mjk : ∀j, k) and l = (ljk : ∀j, k) where mjk ∈ {0, 1, . . . , njk }, ljk ∈ {0, 1, . . . , Nj − njk }, we iterate sampling between βk and auxiliary variables m, l as (3.16) p (βk | m, l, Z) ∝ beta (γ0 bk + mk , γ0 (1 − bk ) + lk ) (3.17)   l p ljk | l−jk , m, β, Z ∝ s (Nj − njk , ljk ) (αj − αj βk ) jk

βk ∼ beta (γ0 bk , γ0 (1 − bk ))

  (2.13) πjk ∼ beta (αj βk , αj (1 − βk )) , φk ∼ N 0, σφ2 I   2 I (2.14) Zji ∼ BeP (πj ) , Wji ∼ N 0, σwj   2 (2.15) Xji | Φ, Zji , Wji ∼ N Φ (Zji  Wji ) , σnj I

  m (3.18) p mjk | m−jk , l, β, Z ∝ s (njk , mjk ) (αj βk ) jk   where mk  j mjk , lk  j ljk and s (n, m) denotes the unsigned Stirling numbers of the first kind.

Notationwise, we use β = {βk : k = 1, . . . , K} and π = {πjk : j = 1, . . . , J and k = 1, . . . , K}. A superscript attached to a symbol following a ‘-’ sign, e.g. m−jk , Z−ji , Z−k ji etc, means a set of variables excluding the variable indexed by the superscript.

3.1.2 Sampling Z Let us assume that the whole space Ω is partitioned in L equal parts such that ∪L k=1 dφk = Ω, where L should be a large number to obtain a good sampling approximation in Eq (2.12). The number of active factors

203 50

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

After choosing these dishes, it goes on choose an additional number (denoted by Fji ) of new dishes such that Fji |  αj , B ∼ Poi

αj B u (R) αj +Nj −1

where Poi (.) denotes the Poisson   K distribution and R  Ω\ ∪k=1 dφk . A good estimate of B u (R) can be made through its expected value E [B u (R)], K which is equal to 1 − k=1 bk . This estimate gets better with increasing value of γ0 . Another way of computing B u (R) is through an explicit stick-breaking construction of beta process. An example of sharing of factors across two sources is shown in Figure 2. Consider a situation where the data vectors (customers) of the two sources (restaurants) arrive in the following order : X11 , X12 , X21 , X13 , X22 , X23 , X24 . The arrows in the figure show the factors used by various data vectors. We note from the figure that the factors φ3 , ^ŽƵƌĐĞͲϭ ^ŽƵƌĐĞͲϮ φ4 and φ5 have been shared across both sources while other factors are only used in either source-1 or source-2. Figure 2: An example of factor sharing for two data sources Proceeding further, the conditional Gibbs posterior of through hierarchical beta process (HBP). Zkji can be written as ĂƐĞDĞĂƐƵƌĞ

used by data across J sources is K. Since the random measure B is a draw from BP (γ0 , B0 ), it can be written as L 

K 

(3.21)    k  −k p Zkji | Z−k ji , Wji , βk , Φ, xji ∝ p Zji | Zji , βk ×   2 p Xji | Z−k ji , Wji , Φ, σjn

Substituting Eq (3.20) in the above expression, Zkji can be (3.19) B= βk δdφk = βk δdφk + B drawn from a Bernoulli distribution. Given the Poisson prior k=1 k=1 on Fji , the conditional posterior distribution of Fji is given   L as u where B is the restriction of B to ∪k=K+1 dφk and can   L αj B u be explicitly written as B u = k=K+1 βk δdφk . Using the (3.22) p (Fji | rest) ∝ Poi × α j + Nj − 1 conjugacy of beta process, Zji conditioned on Z−ji and B, !   is sampled as n 2 p Xji | Zji , Znji , Φ, Φn , Wji , Wji dB0 (Φn ) , σjn   αj (3.20) Zji | Z−ji , B ∼ BeP B u + TK n α j + Nj − 1 where Φn contains the new factors and Znji , Wji are the corresponding indicator and feature vectors. All the elements where n are samof Znji are equal to one whereas both Φn and Wji Kj K   n−i pled from their priors. The samples of β for k > K can be αj k j,kjl δdφkjl ; obtained using the prior distribution of Eq (2.12). For each TK = βk δdφk + α j + Nj − 1 α j + Nj − 1 k=1 l=1 n , the integral of above equation can value of Fji , given Wji   be solved in closed form (see section A.2 in appendix). In −i k and nj,k = i =i Zji . The notation kjl is used to relate a our implementation, we truncate the support of Poi (.) at five factor indexed locally in source j as l and globally as k. times its mean value. Similar to IBP [9], a culinary metaphor for HBP can u

be considered as follows. For this metaphor, treat the data sources as restaurants and the data points as the customers. 3.1.3 Sampling Φ and W Gibbs sampling update for Φ Due to exchangeability property [22] of HBP, we can assume conditioned on data X1:J and other variables is given as   that i-th customer is the last customer of restaurant j. Given 2 (3.23) p (Φ | Z1:J , W1:J , X1:J ) ∝ p Φ | B0 , σΦ × these, this customer chooses a dish (say k-th w.r.t. global p (X1:J | Φ, Z1:J , W1:J ) index and l-th w.r.t. local index) available in j-th restaurant n−i j,k +αj βk

jl with probability αj +N and a dish (say k  -th w.r.t. j −1 global index), which is not available in j-th restaurant but αj βk  . available in other restaurants, with probability αj +N j −1

Under the model described in Eqs (2.12-2.15), the above turns out to be a normal distribution. Gibbs sampling update for Wji conditioned on Xji , Zji and Φ is given as

204 51

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

inverse-gamma distribution with  shape and scale parameters  a + M K/2 and b + tr ΦΦT /2 respectively. The posterior 2 2 distributions for σnj and σwj can be derived similarly. For details, see the appendix section A.4.

(3.24)  2  p (Wji | Xji , Zji , Φ) ∝ p Wji | σjw p (Xji | Φ, Zji , Wji )

Again, under the model described in Eqs (2.12-2.15), the above turns out to be a normal distribution. For detail deriva3.3 Predictive Distribution The predictive distribution of tions of Eqs (3.23-3.24), see section A.3 of the appendix. the  test data from the j-th source given the training data, i.e. ˜ j | X1:J can be written as p X 3.2 Sampling Hyperparameters 3.2.1 Sampling αj We assume a vague gamma prior on αj with shape and scale parameter aα and bα respectively. To sample αj , we introduce further auxiliary variables u = (ujk : j = 1, . . . J and k = 1, . . . K) and s = (sjk : j = 1, . . . J and k = 1, . . . K) where ujk ∈ [0, 1] and sjk ∈ {0, 1}. The conditional posterior distribution of αj is given as (3.25)   post p (αj | Z, β, m, l, u, s, aα , bα ) = gamma apost α , bα  = aα + k (mjk + ljk − sjk ) and bpost = whereapost α α bα − k logujk and the auxiliary variables ujk and sjk are sampled as (3.26)

p (ujk | αj ) = beta (αj + 1, Nj )

(3.27)

p (sjk = 1 | αj ) = Bernoulli

For details, see appendix section A.5.



Nj αj +Nj

     ˜ j | X1:J = p X ˜j, W ˜ j p (Φ | X1:J , β) ˜ j | Φ, Z p X       ˜ j, Z ˜j, β ˜ j | β p (β) d Φ, W ˜ j | σwj p Z p W

˜ j and W ˜ j denote the binary matrix and the The variables Z  ˜j = Φ Z ˜j + ˜j  W weight matrix respectively such that X ˜ j . The above predictive distribution can be approximated E using the Gibbs samples of the respective posterior distributions in the following manner (3.28)

R     ˜ j | Φ(r) , Z ˜ (r) , W ˜ (r) ˜ j | X1:J ≈ 1 p X p X j j R r=1

˜ (r) and W ˜ (r) are sampled using the procedure where Z j j described in section 3.1. 3.4 Hyperparameters and Extent of Sharing The extent of sharing in the proposed framework is indicated by two parameters – the number of factors which are shared across data sources (let’s say Ks ) and the hyperparameters αj . The parameter Ks directly indicates the extent of sharing while the hyperparameter αj controls the variation of the random distribution Aj (draw from a beta process) from the upper level random distribution B (draw from another beta process). Both Ks and αj are learnt within Gibbs sampling framework without needing any model selection. To see an effect of variations in αj , consider two data sources; high values of α1 and α2 imply high sharing and vice versa. In the extreme case, when α1 and α2 approach infinity, since both data sources use the same base measure (i.e. B), they tend to use an identical distribution for choosing the factors.

3.2.2 Sampling γ0 We assume a gamma prior on γ0 with shape and scale parameter aγ and bγ respectively. For sampling from the posterior of γ0 , we introduce another set of auxiliary variables r = (rk : k = 1, . . . K), . K) and r = (rk : k = 1, . . . K), v = (vk : k = 1, . . c = (ck : k = 1, . . . K) where r ∈ {0, 1, . . . , k j mjk },  rk ∈ {0, 1, . . . , j ljk }, vk ∈ [0, 1] and ck ∈ {0, 1}. Conditioned on m, l, r, r , v, c and the rest, the posterior disdistribution tribution of γ0 can be shown to be a gamma  with shapeand scale parameters aγ + k (rk + rk − ck ) and bγ − k logvk respectively. The auxiliary variables r, r , v, c can be sampled similar to the schemes as described in Eqs (3.16), (3.17), (3.26) and (3.27). For details, see appendix section A.6. 4 Experiments In this section, we perform a variety of experiments to 3.2.3 Sampling bk We assume a beta prior on bk ,i.e. bk ∼ demonstrate the effectiveness of the proposed model. To beta (μa , μb ). Given auxiliary variables r, r , the posterior illustrate our model and its behavior, we first do experiments distribution of bk is another beta distribution, i.e. bk | with a synthetic dataset. Then, we show the usefulness of our rk , rk ∼ beta (μa + rk , μb + rk ). For details, see the model for two real-world applications. For both synthetic appendix section A.7. and real-world tasks, the priors for hyperparameters were chosen as the following : γ0 ∼ gamma (1, 1), αj ∼ gamma (1, 1) and bk ∼ beta (1, 1000), and both, the shape 3.2.4 Sampling σφ , σnj and σwj Assuming σφ2 ∼  and the scale parameters of gamma priors for σφ , σnj and invGam σφ2 ; a, b , the posterior for σφ2 can be shown to be a σwj were set to 1. The truncating variable L is set to 1000.

205 52

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

True Factors

Inferred Factors

28

Number of factors (K)

26 24 22 20 18 16 14 12

0

1000

2000

3000

4000

5000

6000

7000

8000

Gibbs iteration

(a)

(b)

Z (true)

Z (true)

1

Z (inferred)

2

8

10

4

6

8

10

12 200

300

400

500

600

data vectors

(d)

4

6

8

10

12 100

2

factor membership

6

2

2

factor membership

4

Z (inferred)

1

2

factor membership

2

factor membership

(c)

200

300

400

500

600

200

300

data vectors

(e)

8

12 100

data vectors

6

10

12 100

4

(f)

400

500

600

100

200

300

400

500

600

data vectors

(g)

Figure 3: Experiments with synthetic data (a) true factors (both shared and individuals) (b) inferred factors (both shared and individuals); note the exact recovery upto a permutation (c) convergence of number of factors (d) true factor mixings of source-1 (e) true factor mixings of source-2 (f) inferred factor mixings of source-1 (g) inferred factor mixings of source-2; rows of (f) and (g) are manually permuted for easy comparison with (d) and (e).

4.1 Experiments-I : Synthetic Data For experiments with synthetic data, we create twelve 36-dimensional factors and use them across two sources. Out of these factors, first four factors (vertical bars) are used in source-1 only, the last four factors (horizontal bars) are used in source-2 only and the remaining factors (diagonal bars) are shared between both the sources. By distributing the factors in this way, our aim is to see whether the proposed model can discover the factors used for data generation and their correct mixings. The binary matrices Z1 and Z2 are generated randomly using uniform distribution and the feature matrices W1 , W2 follow normal distribution with mean zero and standard deviation one. The noise follow a normal distribution with mean zero and standard deviation equal to 0.1. We perform the inference based on Gibbs sampling as described in section 3 and use only one sample after rejecting 1500 “burn-in” samples. Although the model converges around 1000 iterations but we run the model long to illustrate the point that the number of factors does not change further. Figure 3c shows that the model correctly learns the number of factors automatically from the data. In addition, it can be seen from Figure 3 that indeed, our model is able to learn the factors used for generating the data and their mixing combinations correctly.

tems (NIPS) conference published between 12 years (during 1988-1999). In this dataset1 , articles are categorized into nine main conference sections and one miscellaneous section. The main conference sections are Cognitive Science (CS), Neuroscience (NS), Learning Theory (LT), Algorithms and Architecture (AA), Implementations (IM), Speech and Signal Processing (SSP), Visual Processing (VP), Applications (AP) and Control, Navigation and Planning (CNP). We ignore articles from the miscellaneous section. Our model treats each section as a separate corpus and learns the factor matrix such that some of the factors are shared among different sections whereas other factors are specific to particular sections. We use the above dataset in a transfer learning setting and investigate whether combined learning of one section with others provides benefits. We use an experimental setting similar to [21] and treat the section VP as the target source while other sections as auxiliary sources. There are a total of 1564 articles combined across nine main sections. We select 80 articles from each section randomly and use them for training. We combine various auxiliary sections, one at a time with the target and compute the perplexity on a test set of articles from VP section. The test set contains 44 articles and is fixed throughout our experimentation.

4.2

Evaluation Measures and Baselines To evaluate the proposed method, we use perplexity per document, a widely used measure in language models (LM). Perplexity of a

Experiments-II : Real Data

4.2.1 Results using NIPS 0-12 Dataset We trained our proposed NJFA model on the NIPS 0-12 dataset, which contains the articles from Neural Information Processing Sys-

1 We downloaded the processed version http://www.gatsby.ucl.ac.uk/˜ywteh/research/data.html.

206 53

available

at

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

Perplexity performance over VP by using other NIPS sections

7000 6800 6600 6400 6200

Jensen−Shannon Divergence

Average Perplexity (Log value)

7200

Mean Average Perplexity (Log value)

6500 No auxiliary used Flat model : auxiliary used (without hierarchy) Hierarchical model : auxiliary used (with hierarchy)

6400

6300

6200

6100

6000

6000

(a)

(b)

N S

C S

A P

C N

80

A A

70

SP

30 40 50 60 Number of VP training documents

IM

20

LT

10

5900

0.1 0.08 0.06 0.04 0.02 0

CN

LT

SP

AA

IM

AP

CS

NS

(c)

Figure 4: Perplexity results using NIPS 0-12 dataset (a) perplexity for the test data from VP section versus number of VP training documents (averaged over 10 runs) shown for three different models (b) mean average perplexity for the test data from VP section (averaged over 10 runs) versus different NIPS auxiliary sections (except VP) using the proposed hierarchical joint factor-analysis model (c) Jensen-Shannon divergence between the word-distributions of VP section and other auxiliary sections.

new document indicates the degree of surprise that a model expresses when modeling new documents. Therefore, a low value of perplexity over test set indicates better prediction of test data. The inference procedure described in section 3 can be used to obtain Gibbs samples of {Φ}, {β}, {σwj }, {σφ } and {σnj } using the training data X1:J . Using these samples, we can compute theperplexity of the test data from j-th source, ˜ j denotes the test data matrix ˜ j where X i.e. P erplexity X from the j-th source. Perplexity is defined as (4.29)

    ˜ j | X1:J ˜ j = exp − 1 log p X P erplexity X ˜ N

 ˜ j | X1:J is the predictive distribution as dewhere p X scribed in section 3.3. To compare the performance benefits, we use two baselines. The first baseline is a model that does not use any auxiliary data and relies totally on target data. This model is equivalent to the factor analysis model proposed in [17]. We refer to this baseline as ‘no auxiliary’ model. The second baseline is a model which uses auxiliary data but does not use them in hierarchical fashion. Instead, it simply combines the auxiliary data with the target data as if they were from the same section. We called this model as the ‘flat’ model. This model can be thought of adaptation of [17] for jointly modeling the data from multiple sources. Also, both of these baseline models are a special case of the proposed model. When comparing the proposed model with baselines, we call it as ‘hierarchical’ model. In our implementation, Gibbs sampler typically converges in 25 iterations, however, we run the sampler for 100 iterations. Experimental Results For each target-auxiliary pair, we average the perplexity values over 10 trials and plot them as a function of the number of target (VP section) training articles, varied from 10 to 80 with a step of 10. Figure 4 depicts the perplexity values for the proposed model in comparison

with the baseline models. For each graph, the perplexity values shown are averaged across all the eight auxiliary sections and 10 trials. It can be clearly seen from Figure 4a that both ‘flat’ and ‘hierarchical’ models perform better than the ‘no auxiliary’ model and empirically prove the point that use of auxiliary data does improve the performance. We can also see that the ‘hierarchical’ model performs significantly better than the ‘flat’ model. This improvement in performance is mainly due to the ability of the ‘hierarchical’ model treating each section differently and yet allowing the sharing between them. On the contrary, the ‘flat’ model does not cater for the variabilities of different data sources as it treats them identically and thus results in performance degradation. To evaluate the benefits obtained from each auxiliary section, Figure 4b shows the “mean average perplexity” (a single point indicator obtained by averaging the perplexity values across increasing number of training documents). We see that the three auxiliary sections providing maximum benefits are NS, CS and AP in decreasing order of performance. This is in corroboration with the Jensen-Shannon divergence between the word-distributions of VP section and other auxiliary sections as shown in Figure 4c. 4.2.2 Results using NUS-WIDE Dataset Our second dataset is based on the NUS-WIDE [5] dataset, which is a large collection of Flickr images. We select a subset2 comprising of 3411 images involving 13 animals (see Figure 5). This dataset provides six different low-level features [5] (64D color histogram, 144-D color correlogram, 73-D edge direction histogram, 128-D wavelet texture, 225-D block-wise color moments) and 500-D SIFT descriptions along with their ground-truths. We ignore SIFT features and use lowlevel features only. Our model treats each animal category as a separate data source (motivated with the intuition that 2 This dataset has already been used in [4] and available at http://www.cs.cmu.edu/˜junzhu/data.htm.

207 54

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

Figure 5: An example of query images from each category of NUS-WIDE animal dataset. each animal category has different distribution) and learns the factor matrix Φ such that some of the factors are shared across different animal categories while other factors are specific to a particular category. We use a set of 2054 images for training and the remaining for testing - an identical training and test settings as used in [4] - so that a comparison can be made with our work. We use the ‘hierarchical’ model to learn the joint factorization and infer the factor matrix Φ along with the subspace representations H1 , . . . , H13 . For the new images used as query (represented by matrix Y test = [y1 , . . . , yT ]), we compute the subspace representation matrix Htest = [h1 , . . . , hT ] such that Xtest ≈ ΦHtest . To find the similar images for the t-th query image (represented by yt ), we compute cosine similarity between its subspace representation ht and the subspace representations of the training data (i.e. Hj for each j = 1, . . . , 13). The retrieved images are ranked in decreasing order of these similarities. Evaluation Measures and Baselines To evaluate the performance of the proposed method, we use mean average precision (MAP). We compare the result of the proposed model with the recent state-of-the-art techniques [4, 23, 24], categorized under multi-view learning algorithms. Method

Mean Average Precision (MAP)

DWH [23] TWH [24] MMH [4] NJFA (proposed method)

0.153 0.158 0.163

dimensionalities automatically from the data and avoids any cross-validation for the model selection. We report our results averaged across 20 runs along with the standard deviation. Total number of factors (both shared and individual) for this dataset varied between 5 to 8. In our implementation, Gibbs sampler typically converges between 15-20 iterations, however, we run the sampler for 100 iterations. 5 Conclusion We developed a nonparametric joint factor analysis technique for simultaneously modeling multiple related data sources. Our technique learns shared factors to exploit common statistical strengths and individual factors to model the variabilities of each source. To infer the number of shared and individual factors automatically from the data, we use hierarchical beta process (HBP) prior [22]. The auxiliary variable Gibbs sampling provided for hierarchical beta process is general and can be utilized for other matrix factorizations. Automatically learning the extent of sharing across data sources avoids the possibility of negative knowledge transfer caused due to a priori non-optimal sharing specifications. Our experiments using NIPS 0-12 dataset show the usefulness of the proposed model for transfer learning applications. In application to image retrieval, the proposed method outperforms the recent state-of-the-art methods using NUS-WIDE animal dataset. References

0.1789 ± 0.0128

Table 1: Comparison of image retrieval results with recent stateof-the art techniques using NUS-WIDE animal dataset.

Experimental Results Table 1 compares the proposed method with the recent works [4, 23, 24] on image retrieval using NUS-WIDE animal dataset. It can be seen from the table that our model outperforms the baseline models. This comparison is based on the mean average precision (MAP) values presented in [4]. We note that the dataset used to generate these results (including the test set) is identical. The MAP results of baseline models are reported using 60 topics. Our work, being a nonparametric model, learns these

208 55

[1] R.K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. The Journal of Machine Learning Research, 6:1817–1853, 2005. [2] F.R. Bach and M.I. Jordan. A probabilistic interpretation of canonical correlation analysis. Technical Report, University of California, Berkeley, 2005. [3] B. Chen, G. Polatkan, G. Sapiro, D.B. Dunson, and L. Carin. The hierarchical beta process for convolutional factor analysis and deep learning. In Proceedings of the 28th International Conference on Machine Learning, pages 361–368, 2011. [4] N. Chen, J. Zhu, and E.P. Xing. Predictive subspace learning for multi-view data: a large margin approach. Advances in Neural Information Processing Systems, pages 361–369, 2010.

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

[5] T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. NUS-WIDE: A real-world web image database from national university of singapore. Proceeding of the ACM International Conference on Image and Video Retrieval, pages 48:1–48:9, 2009. [6] P. Damien, J. Wakefield, and S. Walker. Gibbs sampling for Bayesian non-conjugate and hierarchical models by using auxiliary variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(2):331–344, 1999. [7] M.D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90(430):577–588, 1995. [8] E.B. Fox, E.B. Sudderth, M.I. Jordan, and A.S. Willsky. Sharing features among dynamical systems with beta processes. Advances in Neural Information Processing Systems, 22:549– 557, 2009. [9] T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. Advances in Neural Information Processing Systems, 18:475, 2006. [10] S. K. Gupta, D. Phung, B. Adams, and S. Venkatesh. Regularized nonnegative shared subspace learning. Data Mining and Knowledge Discovery, pages 1–41. 10.1007/s10618011-0244-8. [11] S.K. Gupta, D. Phung, B. Adams, T. Tran, and S. Venkatesh. Nonnegative shared subspace learning and its application to social media retrieval. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1169–1178, 2010. [12] S.K. Gupta, D. Phung, B. Adams, and S. Venkatesh. A Bayesian framework for learning shared and individual subspaces from multiple data sources. In Advances in Knowledge Discovery and Data Mining, 15th Pacific-Asia Conference, 2011. [13] N.L. Hjort. Nonparametric Bayes estimators based on beta processes in models for life history data. The Annals of Statistics, 18(3):1259–1294, 1990. [14] Y. Jia, M. Salzmann, and T. Darrell. Factorized latent spaces with structured sparsity. Technical Report, EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2010-99, 2010. [15] Y. Kim. Nonparametric Bayesian estimators for counting processes. The Annals of Statistics, 27(2):562–588, 1999. [16] D. Knowles and Z. Ghahramani. Infinite sparse factor analysis and infinite independent components analysis. Independent Component Analysis and Signal Separation, pages 381– 388, 2007. [17] J. Paisley and L. Carin. Nonparametric factor analysis with beta process priors. Proceedings of the 26th Annual International Conference on Machine Learning, pages 777–784, 2009. [18] P. Rai and H. Daum´e III. Multi-label prediction via sparse infinite cca. In Advances in Neural Information Processing Systems, volume 22, pages 1518–1526, 2009. [19] P. Rai and H. Daum´e III. Infinite predictor subspace models for multitask learning. Journal of Machine Learning Research - Proceedings Track, 9:613–620, 2010. [20] S. Saria, D. Koller, and A. Penn. Discovering shared and individual latent structure in multiple time series. Technical

Report, Arxiv preprint arXiv:1008.2028, 2010. [21] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006. [22] R. Thibaux and M.I. Jordan. Hierarchical beta processes and the Indian buffet process. Journal of Machine Learning Research - Proceedings Track, 2:564–571, 2007. [23] E. Xing, R. Yan, and A.G. Hauptmann. Mining associated text and images with dual-wing harmoniums. Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, pages 633–641, 2005. [24] J. Yang, Y. Liu, E.X. Ping, and A.G. Hauptmann. Harmonium models for semantic video representation and classification. SIAM Conf. on Data Mining, pages 1–12, 2007.

A Inference Details A.1 Sampling β Before proposing an update for β, we integrate out π and write the joint distribution of {Z, β} as (A.1)

p (Z, β) =

K " k=1

p (βk )

J " Γ (αj ) Γ (αj βk + njk ) × Γ (Nj + αj ) Γ (αj βk ) j=1

Γ (αj − αj βk + n ¯ jk ) Γ (αj − αj βk ) where n ¯ jk  Nj − njk . We use auxiliary variable method ([6] and [21]) for sampling from posterior of β. Likelihood expression of Eq. (A.1) involves β in the argument of gamma functions and such ratios of gamma functions lead to independent polynomials in αj βk and αj (1 − βk ). In particular, njk  Γ (αj βk + njk ) m (A.2) = s (njk , mjk ) (αj βk ) jk Γ (αj βk ) m =0 jk

(A.3)

  n ¯ jk    Γ αj β¯k + n ¯ jk ¯k ljk   s (¯ n , l ) α = β jk jk j Γ αj β¯k l =0 jk

where we define β¯k  1 − βk and ¯bk  1 − bk . and s (n, m) denotes the unsigned Stirling number of the first kind. The variables m = (mjk : ∀j, k) and l = (ljk : ∀j, k) are treated as auxiliary to the model. Using Eqs. (A.1), (A.2), (A.3) and the prior on β, we can write the joint distribution over Z, m, l and β as the following K  " Γ (γ0 )  ×  (A.4) p (Z, m, l, β) = Γ (γ0 bk ) Γ γ0¯bk k=1 K  K J " γ b γ ¯b "" Γ (αj ) 0 k ¯ 0 k × β k βk Γ (Nj + αj ) k=1 k=1 j=1 l  m njk , ljk ) αj β¯k jk s (njk , mjk ) (αj βk ) jk s (¯ To sample β given Z, we use the above joint distribution and iterate sampling between the variables m, l and β. The

209 56

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

A.4 Update for σφ , σnj and σwj Assuming an inversegamma prior on σφ2 with shape and scale parameter aσφ and bσφ , the posterior for σφ2 is given by       A.2 Sampling Z Here, we show that the likelihood inte∝ p σφ2 | aσφ , bσφ p Φ | σφ2 p σφ2 | Φ, aσφ , bσφ n gral in Eq (3.22) can be solved in closed form provided Wji  −aσφ +1   is given. Following standard matrix algebra, the solution is ∝ σφ2 exp −bσφ /σφ2 ×  # T$  given as  2 −M K/2 tr ΦΦ ! σφ exp −   2 n n n 2 n 2σ φ p Xji | Zji , Zji , Φ, Φ , Wji , Wji , σjn dB0 (Φ )  −(aσφ +M K/2)+1   M/2 × ∝ σφ2 $ 1 # [det (SF )] −1 T   $  # = tr μ exp S μ × F F F 1 M ×Fji 2 bσφ + 2 tr ΦΦT σφ exp −   σφ2 2 I N Xji | ΦHji , σnj conditional distributions of Eqs (3.18), (3.17) and (3.16) can be easily written from Eq (A.4).

T  where μF = (Xji − ΦHji ) Hnji SF , S−1 = F n n T I Hji (Hji ) F n + σ2ji and Hnji  Znji  Wji . The symbol N (.) 2 σnj φ is used to denote the normal probability density function.

which is an inverse-gamma distribution with and scale $ # shape parameters aσφ + M K/2 and bσφ + 12 tr ΦΦT respectively. Sampling from posteriors of σnj and σwj is similar to the 2 ∼ sampling of σφ and have inverse-gamma form. If σnj  2  2 invGam σnj | aσnj , bσnj , the posterior of σnj are drawn as  2 2 post | Xj , Wj , Zj , Φ ∼ invGam σnj | apost σnj A.3 Sampling Φ and W Gibbs sampling update for Φ σnj , bσnj post post conditioned on data X1 , . . . , XJ and latent indicator variwhere aσnj and bσnj are given as ables Z is given as post aσnj = aσnj + M Nj /2   & 2 1 % (A.5) p (Φ | Z1:J , W1:J , X1:J ) ∝ p Φ | B0 , σΦ × T bpost = bσnj + tr (Xj − ΦHj ) (Xj − ΦHj ) σnj 2 p (X1:J | Φ, Z1:J , W1:J ) 2 Similarly, the posterior samples of σwj are drawn as  Taking logarithm both sides we get 2 2 post σwj | Wj ∼ invGam σwj | aσwj , bpost σwj $ # 1 −log p (Φ | Z1:J , W1:J , X1:J ) = const.+ 2 tr ΦΦT + post where apost 2σφ σwj and bσwj are given as % &  1 T tr (Xj − ΦHj ) (Xj − ΦHj ) 2 2σnj = aσwj + KNj /2 apost σ j wj

$ 1 # = bσwj + tr Wj WjT ⇒ p (Φ | Z1:J , W1:J , X1:J ) = N (Φ | μΦ , SΦ ) 2    Xj HTj SΦ  and A.5 Sampling αj We note that the joint distribution of Eq. where μΦ 2 j σnj   (A.4) acts as likelihood for αj . We assume a gamma prior  Hj HTj IK on αj with parameters aα and bα . The likelihood expression  . Similarly, Gibbs sam+ S−1 2 2 Φ j σnj σφ emphasizing the hyperparameters can be re-written as pling update for Wji conditioned on Xji , Zji and Φ is L J " " Γ (αj ) given by p (Z, m, l, β) = × p (βk ) Γ (Nj + αj ) j=1   k=1 2 (A.6) p (Wji | Xji , Zji , Φ) ∝ p Wji | σjw × m l njk , ljk ) (αj − αj βk ) jk s (njk , mjk ) (αj βk ) jk s (¯ p (Xji | Φ, Zji , Wji ) The above expression can be simplified as Again, under the assumed model distributions, this turns out (A.7) p (Z, β, m, l, γ , τ | α , . . . , α ) = 0 0 1 J ⎛ ⎞ to be a normal distribution, i.e. L K J " " "   Γ (αj ) p ⎝ p (Wji | Xji , Zji , Φ) = N Wji | μwji , Swji αj jk p (βk ) p (βk )⎠ × Γ (N + α ) j j j=1 k=1 k =K+1 Sw DZ ΦT Xji DZji ΦT ΦDZji ⎛ ⎞ where μwji  ji σ2ji , S−1 + IσK2 J K " 2 wji  σnj "   nj φ l m ⎝ s (njk , mjk ) βk jk s (¯ njk , ljk ) β¯k jk ⎠ and DZji  diag (Zji ), a diagonal matrix constructed using k=1 j=1 the vector Zji as its diagonal elements. bpost σwj

210 57

Copyright © SIAM. Unauthorized reproduction of this article is prohibited.

where we define pjk  mjk + ljk . Before proceeding further auxiliary variables r = (rk : k = 1, . . . K) and r = further, we note that αj is present in the argument of gamma (rk : k = 1, . . . K) (where rk ∈ {0, mk } and rk ∈ {0, lk }) functions and use an identity from [21] to represent the ratio to above joint distribution, we obtain of gamma functions as (A.8) (A.10) p (Z, m, l, r, r , γ0 , τ0 , α1 , . . . , αJ ) =   !1 ⎛ ⎞ Γ (αj ) Nj α N −1 K " J = uj j (1 − uj ) j 1+ duj " Γ (αj ) p Γ (Nj + αj ) αj ⎝ αj jk s (njk , mjk ) s (¯ njk , ljk )⎠ × 0 Γ (N + α ) j j j=1 k=1

Following the lines of [7] and [21], we further introduce auxiliary variables u = (ujk : j = 1, . . . J and k = 1, . . . K) and s = (sjk : j = 1, . . . J and k = 1, . . . K) where ujk ∈ [0, 1] and sjk ∈ {0, 1}. Using these auxiliary variables

K " k=1

r  Γ (γ0 ) r  s (mk , rk ) s (lk , rk ) (γ0 bk ) k γ0¯bk k Γ (γ0 + pk )

A.6 Sampling γ0 We note that the expression in Eq (A.10) p (Z, β, m, l, u, s, γ0 , τ0 | α1 , . . . , αJ ) = is nothing but the likelihood of γ0 . Assuming γ0 ∼ ⎛ ⎞ gamma (γ0 | aγ , bγ ), the posterior of γ0 is given as  sjk J K " " " N p α j N −1 j jk j ⎝ ⎠ ujk (1 − ujk ) αj p (βk ) p (γ0 | rest) ∝ p (γ0 | aγ , bγ ) × αj k=1 j=1 k >K p (Z, m, l, r, r , τ0 , α1 , . . . , αJ | γ0 ) ⎛ ⎞ K J a " "  l m ∝ γ0 γ exp (−bγ γ0 ) × ×⎝ p (βk ) s (njk , mjk ) βk jk s (¯ njk , ljk ) β¯k jk ⎠ K " j=1 $r  k=1 Γ (γ0 ) r # [γ0 bk ] k γ0¯bk k Γ (γ0 + pk ) k=1 Now, conditioned on auxiliary variables m, l, u and s, we can write the posterior of αj as Expanding the term involving the ratio of the gamma functions similar to Eq (A.8) and introducing auxiliary variables p (αj | Z, β, m, l, u, s, γ0 , τ0 , aα , bα ) ∝ v = (vk : k = 1, . . . K) and c = (ck : k = 1, . . . K) where vk ∈ [0, 1] and ck ∈ {0, 1}, we obtain p (Z, β, m, l, u, s, γ0 , τ0 | αj ) p (αj | aα , bα ) which reduces togamma distribution with shapeand scale parameters aα + k (mjk + ljk − sjk ) and bα − k logujk respectively. The auxiliary variables sjk and ujk can be sampled as (A.9)

p (sjk = 1 | αj )

=

Nj and α j + Nj

p (ujk | αj )



ujkj (1 − ujk )

α

Nj −1

Integrating Out β To be able to sample for hyperparameters γ0 and τ0 , we need to integrate out β from the joint distribution expression of Eq (A.7). The required marginal after integrating out β is given as the following (note the emphasis on the hyperparameters) p (Z, m, l, γ0 , τ0 , α1 , . . . , αJ ) = K   " Γ (γ0 ) Γ (mk + γ0 bk ) Γ lk + γ0¯bk   × Γ (γ0 + pk ) Γ (γ0 bk ) Γ γ0¯bk k=1 ⎞ ⎛ K " J " Γ (αj ) p ⎝ αj jk s (njk , mjk ) s (¯ njk , ljk )⎠ Γ (N + α ) j j j=1 k=1

a −1

p (γ0 | v, c and rest) ∝ γ0 γ K "

vkγ0 v¯kpk −1

k=1



exp (−bγ γ0 ) × c $r  pk k r # [γ0 bk ] k γ0¯bk k γ0

⇒ p (γ0 | v, c and rest) =      (rk + rk − ck ) , bγ − logvk gamma aγ + k

k

where v¯k = 1 − vk . The auxiliary variables {vk } and {ck } can be sampled similar to the scheme as described in Eq (A.9). A.7 Sampling bk Likelihood of bk can be obtained from Eq (A.10). Assuming a beta prior on each bk , i.e. bk ∼ beta (μa , μb ), the posterior of bk is given as p (bk | rk , rk and rest)



p (bk | μa , μb ) × p (Z, m, l, r, r , τ0 , γ0 , α1:J | bk )

∝ =

bμk a +rk −1¯bk b k beta (μa + rk , μb + rk )

μ +r  −1

  where we define mk  j mjk , lk  j ljk and pk  We use μa = τ0 , which is equal to 1 for our model and mk + lk . Expanding second and third terms of gamma μb = L where L is chosen to be a large value such that ratios in the fashion similar to Eq (A.3), and introducing L K . Copyright © SIAM. 211 58

Unauthorized reproduction of this article is prohibited.

Suggest Documents