Learning Arbitrary Sum-Product Network Leaves with Expectation ...

4 downloads 0 Views 369KB Size Report
Dec 12, 2016 - this update by viewing SPN as a sigmomial problem. [Peharz et al., 2016] does so by adding hidden vari- ables to each internal SPN node, and ...
Learning Arbitrary Sum-Product Network Leaves with Expectation-Maximization

arXiv:1604.07243v2 [cs.LG] 12 Dec 2016

Mattia Desana

Christoph Schn¨ orr

Heidelberg University, Institute of Applied Mathematics, Image and Pattern Analysis Group. Im Neuenheimer Feld 205 69120 Heidelberg Corresponding author: [email protected]

Abstract It has been shown in recent work that Expectation-Maximization (EM) is an attractive way to learn the parameters of SumProduct Networks (SPNs). However, previous derivations of EM do not easily extend to SPNs with general leaf distributions and shared parameters. In this paper we obtain a new derivation of EM for SPNs allowing to derive EM updates for arbitrary leaf distributions and for shared parameters between SPN nodes. Here, EM translates in a set of independent weighted maximum likelihood problems for each leaf distribution, and therefore any distribution that can be trained with maximum likelihood (even approximately) fits in this framework. The ability of learning complex leaf distributions in a principled way makes SPN parameter learning more flexible than learning SPN weights only. We demonstrate the potential of this approach with extensive experiments on twenty widely used benchmark datasets, using tree graphical models as leaf distributions.

1

INTRODUCTION

Sum-Product Networks (SPNs, [Poon and Domingos, 2011]) are probabilistic models which represent distributions through a Directed Acyclic Graph (DAG) with sum and products as internal nodes and tractable probability distributions as leaves. The crucial property of SPNs is that the cost of inference scales linearly with the number of edges in the DAG, and therefore exact inference is always tractable (in contrast to graphical

models where inference can be intractable even for relatively small graphs with cycles). In addition, SPNs can model complex distributions by exploiting context specific dependences and determinism. Due to these attractive modeling properties, SPNs found practical use in a broad range of machine learning tasks including computer vision and density estimation (see e.g. [Gens and Domingos, 2012, Cheng et al., 2014, Amer and Todorovic., 2015, Rahman and Gogate, 2016b]). It was recently shown (independently by [Dennis and Ventura, 2015] and [Zhao et al., 2016]) that a SPN S encodes a mixture model whose size can be exponentially larger than the number of edges in S. The mixture components in this mixture are obtained, roughly speaking, as product combinations of a subset of the leaf distributions. Crucially, an efficient derivation of the classical Expectation-Maximization (EM) algorithm was obtained for this very large mixture. This was done independently by [Peharz et al., 2016] and [Zhao et al., 2016]: [Zhao et al., 2016] obtains this update by viewing SPN as a sigmomial problem. [Peharz et al., 2016] does so by adding hidden variables to each internal SPN node, and then finding the marginals required in EM. An empirical analysis in [Zhao et al., 2016] showed than EM performs better in practice than parameter learning algorithms developed previously in literature. However, the derivations of EM considered above do not cover two important cases that are used in literature. The first case is obtaining EM updates for arbitrary leaf distributions (fig. 1). Using leaf distributions with complex structure allows to have a large degree of flexibility in SPN leaves, as shown for instance in [Rahman and Gogate, 2016a] and [Rahman and Gogate, 2016b] which obtain state of the art results on density estimation learning a SPN structure with tree graphical models leaves. Existing EM learning algorithms do not include updates

Learning Arbitrary SPN Leaves with Expectation-Maximization

on complex leaf distribution: [Zhao et al., 2016] only deals with weight updates, and [Peharz et al., 2016] only discusses the case of a univariate leaf distributions in the exponential family. The second case for which we wish to obtain EM updates (a rarer one) is that of SPNs can have shared parameters between nodes (e.g. [Cheng et al., 2014]). The goal of this paper is to extend EM to the general case of shared parameters and arbitrary leaf distributions. Unfortunately, it was not straightforward (at least to us) to extend the methods above to include these cases. Therefore, we propose a new derivation of EM which allows to naturally extend to these scenarios. The first step in this derivation is providing a new result relating the SPN and a subset of its encoded mixture (Proposition 2 in the following). Then, directly applying the EM for mixture models to a SPN and exploiting Proposition 2, the optimization of SPN leaves assumes the form of a weighted maximum likelihood problem, which is a slight modification of standard maximum likelihood and is well studied for a wide class of distributions. Solutions can be found efficiently either with exact or approximate methods, and extend immediately to the parameter sharing case. This result allows to use complex leaf distribution and train them efficiently and straightforwardly, as long as they can be trained with maximum likelihood. Even approximate maximum likelihood estimators can be used, which allows a very wide family of probabilistic models to be used as leaves. Furthermore, particularly nice results hold when leaves belong to the exponential family, where the M step has a single optimum and the maximization can often be performed efficiently in closed form - for instance for multivariate Gaussian distributions and for tree graphical models leaves, where both the optimal structure and parameters can be found (fig. 1). To test if these properties can be exploited in practice, we perform experiments on a set of twenty widely used datasets for density estimation, using a SPN with tree graphical models leaves. First, we compare the effects of learning only the SPN weights or also the leaf distributions with EM, showing that the latter consistently improves training log likelihood results. Second, we show that a simple SPN structure learning algorithm with tree graphical model leaves learned with EM can obtain performances comparable with state of the art methods while using much smaller models, thanks to the greater expressivity of the leaf distributions. These results suggests that much of the complexity of the SPN structure can be encoded in complex leaves trained with maximum likelihood, and that therefore using complex leaf distributions learned with EM is a promising direction of future research.

Figure 1: Sketch of a SPN with tree graphical models as leaves. The SPN weights and the structure and potentials of the trees can be learnt jointly and efficiently with our derivation of EM. Structure of the Paper. Section 2 introduces SPNs and the related notation. Section 3 discusses SPNs as mixture models and introduces the new Proposition 2. Section 4 contains our derivation of Expectation Maximization for SPNs. Section 5 reports experiments on benchmark datasets.

2

BACKGROUND: SUM-PRODUCT NETWORKS

We start with the definition of SPN based on [Gens and Domingos, 2013], chosen decomposable without lack of generality - see [Peharz, 2015]. Consider a set of variables X (either continuous or discrete): Definition 1. Sum-Product Network (SPN) : 1. A tractable distribution ϕ(X) is a SPN S(X). Q 2. TheSproduct k Sk (Xk ) of SPNs Sk (Xk ) is a SPN S ( k Xk ) if all sets Xk are disjoint. P 3. The weighted sum k wk Sk (X) of SPNs Sk (X) is a SPN S(X) if the weights wk are nonnegative (notice that X is in common for each SPN Sk ). By associating a node to each product, sum and tractable distribution and adding edges between an operator and its inputs, a SPN can be represented as a rooted Directed Acyclic Graph (DAG) with sums and products as internal nodes and tractable distributions as leaves (example: fig. 2). This definition generalizes the SPN with indicator variables presented in [Poon and Domingos, 2011], since indicator variables are a special case of distribution in the form of a Dirac delta centered on a particular state of the variable, e.g. δ (x, 0) (fig. 1). A SPN is normalizedPif weights of outgoing edges of sum nodes sum to 1: k wk = 1. We will consider only normalized SPNs, without loss of generality ([Peharz, 2015]).

Mattia Desana, Christoph Schn¨ orr

Notation. We use the following notation throughout the paper. X denotes a set of variables (either continuous or discrete depending on the context) and x an assignment of these variables. We consider a SPN S(X). Sq (Xq ) denotes the sub-SPN rooted at a node q of S, with Xq ⊆ X. S (x) is the value of S evaluated for assignment x (see below). ϕl (Xl ) denotes the distribution at leaf node l. In the DAG of S, ch(q) and pa(q) denote the children and parents of q respectively, and (q, i) indicates an edge between q and its child i. If q is a sum node this edge is associated to a weight wiq , and for simplicity of notation we associate outgoing edges of product nodes to a weight 1. Finally, E (S), L (S) and N (S) denote respectively the set of edges, leaves and sum nodes in S. Parameters. S (X) is governed by two sets of parameters: the set of sum node weights W (terms wiq for each sum node edge) and the set of leaf distribution parameters θ. We write S (X|W, θ) to explicitly express this dependency. Each leaf distribution ϕl is associated to a parameter set θl ⊆ θ, which are for instance the mean and covariance for Gaussian leaves, and the tree structure and potentials for tree graphical model leaves (see section 4.2). Evaluation. Assuming that the DAG of S (X) has for E edges, then the quantities S (x) and ∂S(x) ∂Sq each node in S can be evaluated with a cost O (E) [Poon and Domingos, 2011], by performing an upward and downward pass on the SPN. The partition function can be also computed in O (E).

3

SPNs AS MIXTURE MODELS

Figure 2: A SPN S (A, B) in which a subnetwork σc of S is highlighted. This subnetwork corresponds to a mixture coefficient λc = w21 w94 and component Pc (A, B) = ϕ7 (B) ϕ9 (A). number of different subnetworks obtainable from S for different choices of included sum node children, and associate a unique index c ∈ 1, 2..., C to each possible subnetwork σ1 , σ2 , ..., σC . [Poon and Domingos, 2011] noted that the number of subnetworks can be exponentially larger than the number of edges in S, e.g for the parity distribution (fig. 1). Definition 3. For a subnetwork σQ c of S (X|W, θ) we define a mixture coefficient λc = (q,j)∈E(σc ) wjq and Q a mixture component Pc (X|θ) = l∈L(σc ) ϕl (Xl |θl ).

Note that the mixture coefficients are products of all the sum weights in a subnetwork σc , and mixture components are factorizations obtained as products of leaves in σc (see fig. 2). Proposition 1. A SPN S encodes the following mixture model: S (X|W, θ) =

C X

λc (W ) Pc (X|θ)

(1)

c=1

This section discusses the interpretation of SPNs as a mixture model derived independently in [Dennis and Ventura, 2015] and [Zhao et al., 2016], on which we will base our derivation of EM.

where C denotes the number of different subnetworks in S, and λc , Pc are mixture coefficients and components as in def. 3.

Definition 2. A complete subnetwork (subnetwork for brevity) σc of S is a SPN constructed by first including the root of S in σc , then processing each node q included in σc as follows:

Proof: see [Dennis and Ventura, 2015]. Notice that since C ≫ E, it follows that a SPN encodes a mixture which can be intractably large if explicitly represented.

1. If q is a sum node, include in σc one child i ∈ ch (q) with relative weight wiq . Process the included child. 2. If q is a product node, include in σc all the children ch (q). Process the included children.

We now introduce a new result that is crucial for our derivation of EM, reporting it here rather than in the proofs section since it contributes to the set of analytical tools for SPNs. Proposition 2. Consider a SPN S(X), a sum node q ∈ S and a node i ∈ ch(q). The following relation holds:

3. If q is a leaf node, do nothing. Example: fig. 2. The term “subnetwork” is taken from [Gens and Domingos, 2012] that use the same concept. It is easy to show that any subnetwork is a tree ([Dennis and Ventura, 2015]). Let us call C the

X

k:(q,i)∈E(σk )

λk Pk (X) = wiq

∂S (X) Si (X) ∂Sq

(2)

P where k:(q,i)∈E(σk ) denotes the sum over all the subnetworks σk of S that include the edge (q, i).

Learning Arbitrary SPN Leaves with Expectation-Maximization

comes: Q (W, θ) =

N X C X λc Pc (xn )

n=1 c=1

S (xn )

ln λc (W ) Pc (xn |θ) (3)

In the following sections we will maximize Q (W, θ) for W and θ. 4.1 Figure 3: Visualization of Proposition 2. The colored part is the set of edges traversed by all subnetworks crossing (q, i). The blue part represents Si and the red part covers terms appearing in ∂S(X) ∂Sq .

Weights Update

Simplifying Q (W, θ) through the use of Proposition 2 (Appendix A.2), one needs to maximize the following objective function:

QW

W ∗ = arg max QW (W ) W X X q βi ln wiq (W ) =

(4) (5)

q∈N (S) i∈ch(q)

Proof: in Appendix A.1. This result relates the subset of the mixture model corresponding to subnetworks σk that cross (q, i) and the value and derivative of the SPN nodes q and i. Note that evaluating the lefthand sum has a cost O (C) (intractable) but the righthand term has a cost O (E) (tractable). Since in the derivation of EM we will need to evaluate such subsets of solutions, this result allows to compute the quantity of P interest in a single SPN evaluation. Note also that k:(q,i)∈E(σk ) λk Pk (X) corresponds to the evaluation of a non-normalized SPN which is a subset of S - e.g. the colored part in fig. 3.

4

EXPECTATION MAXIMIZATION

Expectation Maximization is an elegant and widely used method for finding maximum likelihood solutions for models with latent variables (see e.g. [Murphy, 2012, 11.4]). Given a distribution P (X) = PC c=1 P (X, c|π) where c are latent variables and π are the distribution parameters objective is to maxiP our P C mize the log likelihood N ln n=1 c=1 P (xn , c|π) over a dataset of observations {x1 , x2 , ..., xN }. EM proceeds by updating the parameters iteratively starting from some initial configuration πold . An update step consists in finding π ∗ = arg maxπ Q (π|πold ), where PN PC Q (π|πold ) = n=1 c=1 P (c|xn , πold ) ln P (c, xn |π). We want to apply EM to the mixture encoded by a SPN, which is in principle intractably large. First, using the relation between SPN and encoded mixture model in Proposition 1 we identify P (c, xn |π) = λc (W ) Pc (xn |θ), P (xn |πold ) = S (xn |Wold , θold ), and therefore P (c|xn , πold ) = P (c, xn |πold )/P (xn |πold ) = λc (Wold ) Pc (xn |θold ) /S (xn |Wold , θold ). Applying these substitutions and dropping the dependency on Wold , θold for compactness, Q (W, θ|Wold , θold ) be-

βiq

=

q wi,old

N X

S (xn )

n=1

−1

∂S (xn ) Si (xn ) ∂Sq

The evaluation of terms βiq , which depend only on Wold , θold and are therefore constants in the optimization, is the E step of the EM Palgorithm. We now maximize QW (W ) subject to i wiq = 1∀q ∈ N (S) (M step). Non shared weights. If weights at each node q are disjoint, then we can move the max inside the sum, obtaining P separated maximizations each in the form arg maxwq i∈ch(q) βiq ln wiq , where wq is the set of weights outgoing from q. Now, the same maximum is attained multiplying by k = P 1β q . i i ¯q Then, defining kβiq , we can equivalently find P βi = arg maxwq i∈ch(q) β¯iq ln wiq , where β¯iq is positive and sums to 1 and therefore can be interpreted as a discrete distribution. This is then the  maximum of the cross entropy arg maxwq −H β¯iq , wiq defined e.g. in [Murphy, 2012, 2.8.2], attained for wiq = β¯iq , which corresponds to the following update: X q βi . (6) wjq∗ = βjq / i

Remark: this update for non-shared weights only was already derived (with radically different approaches) in [Peharz et al., 2016, Zhao et al., 2016]. Shared weights. In some SPN applications it is necessary to share weights between different sum nodes, for instance when a convolutional architecture is used (see e.g. [Cheng et al., 2014]). To keep notation simple let us consider only two nodes q1 , q2 constrained to share weights, that is wiq1 = wiq2 = w ˆ for every child i, where w ˆ is the set of shared weights. We then rewrite QW (W ) insulating the part

Mattia Desana, Christoph Schn¨ orr

P q1 q1 depending on w ˆ as QW (W ) = i∈ch(q) βi ln wi + P q2 q2 + const (the constant includes i∈ch(q) βi ln wi terms not depending on wq ). Then, employing the weight sharing constraint, maximization of QW for w ˆ P ˆ As in the becomes arg maxwˆ i∈ch(q) (βiq1 + βiq2 ) ln w. non-shared case, we end up maximizing the cross enˆi ). Generalizing for an artropy −H (k (βiq1 + βiq2 ) , w bitrary set of nodes D such that any node q ∈ D has shared weights w, ˆ the weight update for w ˆ is as follows: P q q∈D βj (7) w ˆj∗ = P P q. i q∈D βi

EM can be applied to any distribution that can can be trained with weighted maximum likelihood. This procedure typically requires small changes from nonweighted maximum likelihood, and most distributions that can be trained with maximum likelihood admit also the weighted variant. Furthermore, since EM is guaranteed to converge as long as the objective function increases (see [Neal and Hinton, 1998]), then the likelihood maximization can also be performed with approximate methods that do not reach the global optimum of the M step. Therefore, any distribution on which weighted maximum likelihood can be performed at least approximately can be used as SPN leaf and trained with EM.

4.2

Shared parameters. Let us consider the case of two leaf nodes k, j associated to distributions ϕk (Xk |θk ), ϕj (Xj |θj ) respectively, such that θk = θj = θˆ are shared parameters. Then, we can rewrite PN ˆ eq. 8 for k, j as Qθ (θ) = n=1 αkn ln ϕk (xn |θ) + PN ˆ ˆ n=1 αjn ln ϕj (xn |θ) + const(θ). Generalizing to an arbitrary set of leaves D such that each leaf l ∈ D has a distribution ϕXˆl |θ and dropping the constant term, we need to maximize the following expression: ! N X X ∗ ˆ . θˆ = arg max αln ln ϕl (xn |θ) (11)

Note that in both the shared and non shared cases, the EM weight update does not suffer from vanishing updates even in deep nodes thanks to the normalization term. Finally, we note that since all the quantities required in a weight update can be computed with a single forward-downward pass on the SPN, an EM iteration for W has cost linear in the number of edges. Leaf Parameters Update

From the mixture point of view, the weight update corresponds to the EM update for the mixture coefficients Pc in eq. 1. Simplifying Q (W, θ) through the use of Proposition 2 (Appendix A.3), the optimization problem reduces to: θ∗ = arg max Qθ (θ) θ

Qθ (θ) =

N X X

αln ln ϕl (xn |θl )

(8) (9)

l∈L(S) n=1

αln = S (xn )

−1

∂S (xn ) Sl (xn ) ∂Sl

The evaluation of terms αln , which are constant coefficients in the optimization since they depend only on Wold , θold , is the E step of the EM algorithm and can be seen as computing the responsibility that leaf distribution ϕl assigns to the n-th data point, similarly to the responsibilities appearing in EM for classical mixture models. Importantly, we note that the maximization (8) is concave as long as ln ϕl (Xl |θ) is concave, in which case there is an unique global optimum. Non shared parameters. Introducing the hypothesis that parameters θl are disjoint at each leaf l, we obtain separate maximizations in the form: θl∗ = arg max θl

N X

αln ln ϕl (xn |θl )

(10)

n=1

In this formulation one can recognize a weighted maximum likelihood problem, where each data sample n is weighted by a soft-count αln . This implies that

θˆ

l∈D

n=1

The objective function depending on θˆ now contains a sum of logarithms, therefore it cannot be maximized as separate weighted maximum likelihood problems over each leaf as in the non shared case. However, this expression is still concave as long as ln ϕl (Xl |θ) is concave, in which case there is an unique global optimum (Note: this case holds for exponential families, discussed next). Then, the optimal solution can be found with iterative methods such as gradient descent or second order methods. Exponential Family Leaves. A crucial property of exponential families is that eq. 8 is concave and therefore a global optimum can be reached (see e.g. [Murphy, 2012, 11.3.2]). Additionally, the solution is often available efficiently in closed form. As an example , we will look in detail in the close form solution of the Mstep for two relevant cases. If ϕl (Xl ) is a multivariate Gaussian, the solution of eq. 10 is obtained e.g. in [Murphy, 2012, PN PN n=1 αln xn , Σl = 11.4.2]: rl = n=1 αln , µl = rl PN

αln (xn −µl )(xn −µl )T rl

. In this case, EM for SPNs generalizes EM for a Gaussian mixture models. n=1

If ϕl (Xl ) is a tree graphical model over discrete variables, the solution of eq. 10 can be found with the

Learning Arbitrary SPN Leaves with Expectation-Maximization

Chow-Liu algorithm ([Chow and Liu, 1968]) adapted for weighted likelihood (see [Meila and Jordan, 2000]). The algorithm has a cost quadratic on the cardinality of X and allows to learn jointly the optimal tree structure and potentials. The resulting SPN represents a very large mixture of tree graphical models as in fig. 1, and EM for this SPN generalizes EM for mixture of trees (see [Meila and Jordan, 2000]). Finally, we note that since all the quantities required in a weight update can be computed with a single forward-downward pass on the SPN, an EM iteration for θ has cost linear in the number of edges. The cost of the maximization for each leaf depends on the leaf type: for Gaussian and tree graphical model leaves, it is linear in the number of data points.

5

Table 1: Dataset structure and training set log likelihood for Experiment 1. Columns W , θ and (W, θ) denote using EM on the respective parameters. Dataset NLTCS MSNBC KDDCup2K Plants Audio Jester Netflix Accidents Retail Pumsb-star DNA Kosarek MSWeb Book EachMovie WebKB Reuters-52 20Newsgrp. BBC Ad

|train| 16181 291326 180092 17412 15000 9000 15000 12758 22041 12262 1600 33375 29441 8700 4524 2803 6532 11293 1670 2461

Nvars 16 17 64 69 100 100 100 111 135 163 180 190 294 500 500 839 889 910 1058 1556

|valid| 2157 38843 19907 2321 2000 1000 2000 1700 2938 1635 400 4450 32750 1159 1002 558 1028 3764 225 327

W −6.16 −6.44 −2.41 −17.62 −42.34 −55.33 −60.04 −42.87 −10.76 −41.05 −81.38 −11.12 −10.29 −34.76 −59.05 −155.89 −106.17 −145.03 −261.69 −52.78

θ −6.18 −6.41 −2.40 −14.01 −41.43 −54.77 −58.47 −32.37 −10.67 −26.69 −80.68 −10.75 −9.70 −34.68 −55.33 −148.97 −92.93 −143.89 −245.48 −12.89

W, θ −6.14 −6.41 −2.35 −13.78 −41.21 −54.38 −58.33 −31.87 −10.69 −26.45 −79.95 −10.66 −9.69 −33.67 −53.43 −147.90 −92.61 −143.22 −244.95 −16.29

EXPERIMENTS -54

The aim of this section is to evaluate the possible gains of training complex SPN leaf distributions with EM. As leaf distribution of choice we take tree graphical models, in which weighted maximum likelihood can be solved exactly (section 4.2). First, we need to define the structure of the SPN in which we want to test the learning performances. In order to keep the focus on parameter learning rather than structure learning we chose to use the simplest structure learning algorithm (LearnSPN, [Gens and Domingos, 2013]), and augment it to use tree leaves by simply adding a fixed number of tree leaves to each generated sum node q. The tree leaves are initialized as a mixture model over the data that was used to learn the subnetwork rooted in q (see [Gens and Domingos, 2013] for details). To keep the models small and the structure simple, we limit the depth to a fixed value. The number of added trees and the maximum depth are hyperparameters. In order to evaluate the effect of learning the tree leaves we perform the following two experiments, using a set of 20 binary datasets that have been widely used as a density estimation benchmark for SPNs and related architectures, and whose structure is described in table 1 (see [Gens and Domingos, 2013]). Experiment 1 In this experiment we compare learning using EM updates on weights W only and using updates on both weights and leaf nodes parameters θ (determining the structure and potentials of the tree leaves). The SPN structure is first learned from data with the modified LearnSPN algorithm described above. Then, parameters W and θ are randomly reinitialized, and EM on the training data is run until convergence of the training log-likelihood. In table 1 we report training log-likelihood when either EM updates

-56

-58

W θ W, θ

-60

-62 5

10

15

20

25

30

35

Figure 4: Plot of training set Log-Likelihood over iterations during EM training on the Jester dataset. on weights W only, on parameters θ only, or jointly on W and θ are used. Note: when training only W we also update the weights that define the tree potentials in each leaf, since they would be SPN weights if the tree leaves were expressed as SPNs. An example convergence plot of the three EM variants is shown in fig. 4. The empirical results show that indeed EM updates for leaf distributions give better results than updates over the weight update only. This is not surprising since learning only the weights and not the tree structure leads to a strictly worse increase in log likelihood at each leaf update - however, the results of the converged EM procedure are not guaranteed to be better. Note that typically SPN parameter learning algorithms learn the SPN weights but not the leaf distributions, that are often simple indicator variables (however, the SPN structure in such methods typically has a much larger edge/leaves ratio, and it is therefore difficult to perform a direct comparison). Experiment 2 In this experiment we test the generalization capabilities of learning tree leaves with EM by evaluating the test set log likelihood on the benchmark datasets. The SPN structure is learned used the simple structure learning algorithm described above, performing a grid search over the independence thresholds (values {0.1, 0.01, 0.001}) and number of

Mattia Desana, Christoph Schn¨ orr

tree leaves ({5, 20, 30}) attached to sum nodes (see [Gens and Domingos, 2013] for details). We limit the SPN depth to 6 to keep the models small enough for out MATLAB implementation. Then, we fine tuned the initial SPN by training W, θ with EM, stopping as soon as validation log-likelihood started to decrease. First, we compare against CCCP ([Zhao et al., 2016]) which is the method closest to ours: CCCP employs LearnSPN for structure learning (like us, but without depth limit) then runs EM with only W -updates over the learned structure. The results of this experiment are shown in table 2. To perform a fair comparison, we also plot the network size as the number of edges in the network (table 2), and for each tree leaf node we also add to this count the number of edges which would be needed to represent the tree as a SPN. The results of our algorithm (column SPN-Trees) is better than the ones obtained with CCCP in the majority of cases, despite the network size being generally smaller. These results indicates that it might be convenient to put more computational resources in modelling SPNs with complex structured leaves, learned with EM, rather than in just increasing the number of SPN edges. Second, as performance reference, we compare our test log likelihood with the current state of the art, obtained comparing the competing approaches ID-SPN [Rooshenas and Lowd, 2014], ACMN [Lowd and Domingos, 2012], SPNSVD [Adel et al., 2015], SPNMerge [Rahman and Gogate, 2016b] and ECNets [Rahman and Gogate, 2016a]. Our results are close to state of the art in several datasets, and surpass it in two cases. This fact is quite surprisingly given the simplicity of the structure learning algorithm used here, and it suggests that integrating the complex structure learning methods used in the other papers with EM for leaves might provide competitive results. For instance, it would be straightforward to apply SPNMerge [Rahman and Gogate, 2016b] to the SPN obtained with SPN-trees, obtaining a graph SPN, and using better ways to split the variables such as SPN-SVD [Adel et al., 2015]. This should be subject of future work. Overall, the results of these experiments indicate that trading part of the computational resources used for creating a SPN with a large number of edges to create SPNs with structurally rich leaf distributions can be a worthy trade-off, which has been not explored until now. Therefore, using SPNs with complex leaves trained with EM through maximum likelihood (even approximately) is a promising direction of research, to be investigated in future applications.

Table 2: Results of Experiment 2. The first two columns compare test log likelihood of CCCP and SPN-trees (in bold, higher LL). The next two columns compare sizes of generated SPNs (in bold, higher size). The last column reports current state of the art results (in bold when they are better than our approach). Dataset NLTCS MSNBC KDDCup2K Plants Audio Jester Netflix Accidents Retail Pumsb-star DNA Kosarek MSWeb Book EachMovie WebKB Reuters-52 20Newsgrp. BBC Ad

6

Test LL SPN-Tr. −6.01 −6.04 −2.14 −12.30 −39.76 −52.59 −56.12 −29.86 −10.95 −23.71 −79.90 −10.75 −10.03 −34.68 −55.42 −167.76 −91.69 −156.78 −266.34 −16.88

CCCP −6.03 −6.05 −2.13 −12.87 −40.02 −52.88 −56.78 −27.700 −10.919 −24.229 −84.921 −10.880 −9.970 −35.009 −52.557 −157.492 −84.628 −153.205 −248.602 −27.202

Size SPN-Tr. 1, 884 13, 192 49, 876 60, 365 93, 623 92, 623 93, 715 99, 621 116, 905 104, 789 166, 969 148, 607 186, 208 433, 582 339, 001 713, 117 603, 845 841, 793 881, 113 364, 184

bestLL CCCP 13, 733 54, 839 48, 279 132, 959 739, 525 314, 013 161, 655 204, 501 56, 931 140, 339 108, 021 203, 321 68, 853 190, 625 522, 753 1, 439, 751 2, 210, 325 14, 561, 965 1, 879, 921 4, 133, 421

−6.00 −6.04 −2.12 −11.99 −39.67 −41.11 −56.13 −24.87 −10.60 −22.40 −80.03 −10.54 −9.22 −30.18 −51.14 −150.10 −82.10 −151.47 −236.82 −14.36

CONCLUSIONS

In this paper we obtained a new derivation of Expectation Maximization (EM) for Sum-Product Networks (SPN) including EM updates for arbitrary leaf distributions and shared parameters, which were not covered in previous work. EM updates at the leaves were shown to correspond to a weighted maximum likelihood problem, which allows a large family of distribution to fit in this framework. In particular, the optimal solution can be found exactly and efficiently for many exponential family models (e.g. for Gaussian and tree graphical models leaves). Experimental results on benchmark datasets showed that using complex SPN leaves trained with EM is a promising direction of future research.

A

PROOFS

Preliminars. Consider some subnetwork σc of S including the edge (q, i) (fig. 3). Remembering that σc is a tree, we divide σc in three disjoint subgraphs: the d(i) edge (q, i), the tree σh(c) corresponding to “descena(q)

dants” of i, and the remaining tree σg(c) . Notice that g (c) could be the same for two different subnetworks a(q) σ1 and σ2 , meaning that the subtree σg(c) is in comd(i)

mon (similarly for σh(c) ). We now observe that the the coefficient λc and component Pc (def. 3) factorize d(i) a(q) in terms corresponding to σg(c) and to σh(c) as fold(i)

a(q)

d(i)

a(q)

lows: λc = wiq λh(c) λg(c) and Pc = Ph(c) Pg(c) , where Q Q d(i) d(i) λh(c) = (m,n)∈Lσd(i)  wnm , Ph(c) = l∈Lσd(i)  ϕl h(c)

h(c)

Learning Arbitrary SPN Leaves with Expectation-Maximization

and similarly for a (q). With this notation, for each subnetwork σc including (q, i) we write:    a(q) a(q) d(i) d(i) λc Pc = wiq λg(c) Pg(c) λh(c) Ph(c)

λc Pc = wiq

X

λga(q) Pga(q)

g=1

c:(q,i)∈E(σc )

Q (W ) =

X

d(i)

∂S(X) ∂Sq

=

PCa(q) g=1

=

a(q)

a(q)

Pg

Q (W ) =

S (xn )

N X C X λc Pc (xn )

S (xn )

n=1 c=1

N X

X

(q,i)∈E(S) n=1

d(i)

Proof. First let us separate the sum in eq. 1 in two sums, one over subnetworks including P q and one over subnetworks not including q: S (X) = k:q∈σk λk Pk + P λ P . The second sum does not involve node l:q∈σ / l l l ∂S(X) ˆ q so for ∂Sq it is a constant k. Then, S = P ˆ Pk:q∈σk λk Pk + k. As in eq. 13, we divide the sum in two nested sums acting over disjoint k:q∈σk (·)  PCa(q) a(q) a(q)  PCd(q) d(q) d(q)  terms: S = Pg + g=1 λg h=1 λh Ph P C d(q) d(q) d(q) ˆ We now notice that k. Pk = Sq by k=1 λk d(q) d(q) Proposition 1, since λk Pk refer to the subtree of σc rooted in i and the sum is taken overall such PCa(q) a(q) a(q) ˆ subtrees. Therefore: S = Sq + k. Pg g=1 λg Taking the partial derivative leads to the result.

N X C X λc Pc (xn )

N X

X

=

(q,i)∈E(S) n=1

Applying P

(q,i)∈E(S)

(q,i)∈E(σc )

PC

(q,i)∈E(S)

c=1

P

ln wiq δ(q,i),c

X

λc Pc (xn ) δ(q,i),c ln wiq S (xn )

c:(q,i)∈E(σc )

λc Pc (xn )

S (xn )

n=1

S(xn )

ln wiq

Q (W ) ln wiq ,

= and

PN −1 ∂S(xn ) q defining βiq = wi,old n=1 S (xn ) ∂Sq Si (xn ) we P P q write Q (W ) = q∈N (S) i∈ch(q) βi ln wiq . A.3

EM step on θ

Starting from eq. 3, as in A.2 we expand ln Pc as a sum of logarithms and obtain: Q (θ) = PN PC λc Pc (xn ) P n=1 c=1 S(xn ) l∈L(σc ) ln ϕl (xn |θl )+const. Introducing δl,c which equals 1 if l ∈ L (σc ) and 0 otherwise, P dropping the constant and performing the sum l∈L(S) over all leaves in S we get: Q (θ) =

N X C X λc Pc (xn ) X

n=1 c=1

=

Proof of Proposition 2 =

We start by writing the sum on the left-hand side of eq. 2 as in eq. 13. Now, first we notice that PCd(i) d(i) d(i) equals Si (X) by Proposition 1, since k=1 λk Pk d(i) d(i) λk Pk refer to the subtree of σc rooted in i and the sum is taken over all such subtrees. Second, PCa(q) a(q) a(q) for Lemma 1. SubstitutPg = ∂S(X) g=1 λg ∂Sq ing in 13 we get the result.

ln wiq + const

X

Proposition 2 we get:   ∂S(xn ) q PN wi,old ∂Sq Si (xn )

S (xn )

N PC X X N X X

l∈L(S) n=1

=

N X X

P

ln ϕl (xn |θl ) δl,c

l∈L(S)

c=1 λc Pc

l∈L(S) n=1

A.1

ln λc (W ) + const

P We now drop the constant and move out (q,i)∈E(σc ) by introducing δ(q,i),c s.t. δ(q,i),c = 1 if (q, i) ∈ E (σc ) and 0 otherwise and summing over all edges E (S):

λh Ph

.

S (xn )

n=1 c=1

h=1

λg

N X C X λc Pc (xn )

n=1 c=1

=

(13) where Cd(i) and Ca(q) denote the total number of difd(i) a(q) ferent trees σh and σg in S. Lemma 1.

Starting from eq. 3 and collecting terms not depending on W in a constant, we obtain:

Cd(i)

Ca(q)

EM step on W

(12)

Let us now consider the sum over all the subnetworks σc of S that include (q, i). The sum can be rewritten as two nested sums, the external one over all terms a(q) σg (red part, fig. 3) and the internal one over all d(i) subnets σh (blue part, fig. 3). This is intuitively easy to grasp: we can think of the sum over all trees a(q) σc as first keeping the subtree σg fixed and varying d(i) all possible subtrees σh below i (inner sum), then ita(q) erating this for choice of σg (outer sum). Exploiting the factorization 12 we obtain the following:

X

A.2

(xn )

S (xn )

c:l∈L(σc )

ln ϕl (xn |θl ) δl,c

λc Pc (xn )

S (xn )

ln ϕl (xn |θl )

αln ln ϕl (xn |θl )

l∈L(S) n=1 −1 P Where αln = S (xn ) c:l∈L(σc ) λc Pc (xn ). To compute αln we notice that the term Pc in this sum always contains a factor ϕl (def. 3), and ϕl = Sl by

Mattia Desana, Christoph Schn¨ orr

 Q def. 1. Then, writing Pc\l = k∈L(σc )\l ϕk we obP  −1 λ P (x ) . Fitain: αln = S (xn ) Sl c n c\l c:l∈L(σc ) P P nally, since S = c:l∈L(σc ) λc Pc + k:l∈L(σ λk Pk = / k) P ˆ ˆ λc Pc\l + k (where k does not deSl c:l∈L(σc )

∂S pend on Sl ), taking the derivative we get ∂S = l P Substituting we get: αln = c:l∈L(σc ) λc Pc\l .

S (xn )−1

∂S(X) ∂Sl Sl

(xn ).

References [Adel et al., 2015] Adel, T., Balduzzi, D., and Ghodsi, A. (2015). Learning the structure of sum-product networks via an svd-based algorithm. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI 2015, July 12-16, 2015, Amsterdam, The Netherlands, pages 32–41. [Amer and Todorovic., 2015] Amer, M. and Todorovic., S. (2015). Sum product networks for activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI 2015). [Cheng et al., 2014] Cheng, W.-C., Kok, S., Pham, H. V., Chieu, H. L., and Chai, K. M. (2014). Language Modeling with Sum-Product Networks. Annual Conference of the International Speech Communication Association 15 (INTERSPEECH 2014). [Chow and Liu, 1968] Chow, C. I. and Liu, C. N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14:462–467. [Dennis and Ventura, 2015] Dennis, A. and Ventura, D. (2015). Greedy structure search for sumproduct networks. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pages 932–938. AAAI Press. [Gens and Domingos, 2012] Gens, R. and Domingos, P. (2012). Discriminative learning of sum-product networks. In NIPS, pages 3248–3256. [Gens and Domingos, 2013] Gens, R. and Domingos, P. (2013). Learning the structure of sum-product networks. In ICML (3), pages 873–880. [Lowd and Domingos, 2012] Lowd, D. and Domingos, P. (2012). Learning arithmetic circuits. CoRR, abs/1206.3271. [Meila and Jordan, 2000] Meila, M. and Jordan, M. I. (2000). Learning with mixtures of trees. Journal of Machine Learning Research, 1:1–48. [Murphy, 2012] Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. The MIT Press.

[Neal and Hinton, 1998] Neal, R. and Hinton, G. E. (1998). A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers. [Peharz, 2015] Peharz, R. (2015). Foundations of sumproduct networks for probabilistic modeling. (phd thesis). Researchgate:273000973. [Peharz et al., 2016] Peharz, R., Gens, R., Pernkopf, F., and Domingos, P. M. (2016). On the latent variable interpretation in sum-product networks. CoRR, abs/1601.06180. [Poon and Domingos, 2011] Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In UAI 2011, Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, July 14-17, 2011, pages 337–346. [Rahman and Gogate, 2016a] Rahman, T. and Gogate, V. (2016a). Learning ensembles of cutset networks. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 3301–3307. [Rahman and Gogate, 2016b] Rahman, T. and Gogate, V. (2016b). Merging strategies for sumproduct networks: From trees to graphs. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI 2016, June 25-29, 2016, New York City, NY, USA. [Rooshenas and Lowd, 2014] Rooshenas, A. and Lowd, D. (2014). Learning sum-product networks with direct and indirect variable interactions. In Jebara, T. and Xing, E. P., editors, Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 710–718. JMLR Workshop and Conference Proceedings. [Zhao et al., 2016] Zhao, H., Poupart, P., and Gordon, G. (2016). A unified approach for learning the parameters of sum-product networks. Proceedings of the 29th Advances in Neural Information Processing Systems (NIPS 2016).

Suggest Documents