Globally Consistent Algorithms for Mixture of Experts

2 downloads 0 Views 492KB Size Report
Feb 21, 2018 - 1: Input: Samples (xi,yi),i ∈ [n] and regressors в1, в2 from Algorithm 1. 2: t ← 0. 3: Initialize w0 ...... T.F. Brooks, D.S. Pope, and A.M. Marcolini. Airfoil self-noise and ... Vatsal Sharan and Gregory Valiant. Orthogonalized ALS: A ...
Globally Consistent Algorithms for Mixture of Experts Ashok Vardhan Makkuva∗,

Sreeram Kannan†,

Pramod Viswanath‡

arXiv:1802.07417v1 [cs.LG] 21 Feb 2018

February 22, 2018

Abstract Mixture-of-Experts (MoE) is a widely popular neural network architecture and is a basic building block of highly successful modern neural networks, for example, Gated Recurrent Units (GRU) and Attention networks. However, despite the empirical success, finding an efficient and provably consistent algorithm to learn the parameters remains a long standing open problem for more than two decades. In this paper, we introduce the first algorithm that learns the true parameters of a MoE model for a wide class of non-linearities with global consistency guarantees. Our algorithm relies on a novel combination of the EM algorithm and the tensor method of moment techniques. We empirically validate our algorithm on both the synthetic and real data sets in a variety of settings, and show superior performance to standard baselines.



Department of Electrical and Computer Engineering, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, email: [email protected] † Department of Electrical Engineering, University of Washington, Seattle, email: [email protected] ‡ Department of Electrical and Computer Engineering, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, email: [email protected]

1

Contents 1 Introduction 1.1 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 4

2 Related work

5

3 Mixture of two experts (2-MoE) 3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . 3.3 Theoretical results . . . . . . . . . . . . . . . . . . . 3.3.1 Method of moments for regression parameters 3.3.2 EM algorithm for gating parameters . . . . . 3.3.3 Gradient EM algorithm . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

6 6 6 8 8 9 10

4 General and hierarchical mixture of experts 4.1 General mixture of experts (k-MoE) . . . . . 4.1.1 Algorithm . . . . . . . . . . . . . . . . 4.1.2 Theoretical results . . . . . . . . . . . 4.2 Hierarchical mixtures of experts (HME) . . . 4.2.1 Algorithm . . . . . . . . . . . . . . . . 4.2.2 Theoretical results . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

11 11 11 12 13 14 15

. . . . . .

. . . . . .

. . . . . .

. . . . . .

5 Experiments 15 5.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 6 Discussion

19

7 Acknowledgements

19

A Review: Mixture models and method of moments

20

B Review: EM algorithm and EM convergence analysis tools 22 B.1 EM algorithm for MoE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 B.2 EM convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C Proofs of Section 3 C.1 Linear mixture of experts . . . . . . . . . . . . . . . . . . . . . . . . . . C.1.1 Proof of Theorem 1 for linear mixtures . . . . . . . . . . . . . . . C.1.2 Recovery of the gating parameter using the method of moments C.2 Class of non-linearities . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.3 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.4 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.5 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.6 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

24 24 24 25 27 28 29 32 35

D Proofs of Section 4 D.1 Proof of Theorem D.2 Proof of Theorem D.3 Proof of Theorem D.4 Proof of Theorem

1

5. 6. 7. 8.

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

35 35 36 40 40

Introduction

Mixture-of-Experts (MoE) is a widely popular and powerful neural network architecture based on the divide and conquer paradigm [JJNH91]. For each input, the output of this architecture is a weighted-linear combination of the outputs of individual experts where the weights of the experts are determined by a “gating network” that maps each input to a probability distribution. The experts themselves can be arbitrary neural network classifiers or regressors; similarly, the gating network can also be a complicated feedforward network with a softmax output layer. Since their introduction more than two decades ago, the MoE models have generated a lot of research activity and have found huge success in lots of applications across domains such as computer vision, NLP, forecasting, finance, etc. [YWG12, ME14] are detailed surveys. The flexibility to allow for different expert architectures such as SVMs [CBB02], Gaussian Processes [Tre01, TB15, ND14], and deep networks, combined with their modular structure is one of the primary reasons MoE models have been found useful in many real world applications. The basic MoE model is the following: let x ∈ Rd be an input and y ∈ R be the corresponding output. Then the discriminative model Py|x for the mixture of two experts is: 2 ⊤ ⊤ 2 Py|x = f (w⊤ x) · N (y|g(a⊤ 1 x), σ ) + (1 − f (w x)) · N (y|g(a2 x), σ ),

where f (z) = 1+e1−z is the logistic sigmoid function, and N (y|µ, σ 2 ) denotes the Gaussian density ⊤ of a random variable y centered at µ with variance σ 2 ; here, µ is either g(a⊤ 1 x) or g(a2 x) for a d non-linearity g : R → R. The expert parameters a1 , a2 ∈ R are also referred to as the regression parameters or regressors, whereas the gating parameter w ∈ Rd is also called the classifier. This model can be visualized as follows: Gating network

y

f (w⊤ x) Expert 1 g(a⊤ 1 x) + N

g(a⊤ 2 x) + N Expert 2

x

x Figure 1: Architecture for mixture of two experts There are several interesting generalizations to the 2-MoE we described above. An immediate extension is to consider an arbitrary number of mixtures, i.e. k-MoE, and the gating probabilities are now given by a softmax function. Another interesting generalization is to consider hierarchical mixtures (HME) in which experts are at the leaves of a rooted tree and a hierarchy of gating 3

functions present at the internal nodes are used to finally produce the output at the root. All these models are covered by the results of this paper. In this paper, we study the canonical setting of experts being non-linear regressors and the gating network being a softmax function. Ever since their inception, MoE models have been a subject of great research interest and are an integral part of many modern neural network architectures. For example, 1) a sparselyGated Mixture-of-Experts Layer (MoE) is the core component of the neural machine translation architecutre in [SMM+ 17], and 2) Gated Recurrent Unit (GRU) [CvMBB14, CGCB14], a popular and widely used recurrent neural network architecture, can be thought of as a time-series extension of a MoE model which has a gating mechanism for each output component (which corresponds to the update gate in GRU) and whose inputs are both the current input sequence xt and the previous state ht and the output is the next state ht+1 , and 3) Attention mechanisms, which are fast emerging as a more promising and interpretable alternative to traditional recurrent neural network (RNN) architecture based encoder-decoders for machine translation [BCB14, LPM15], contain MoE at their core. With MoE being at the heart of many successful modern learning architectures, it is important to understand whether it is possible to have global convergence guarantees for such models. In this paper, we assume that the data is generated according to a mixture-of-experts model and study whether the parameters can be recovered from the observed data. In particular, finding an efficient and consistent algorithm (with arbitrary initialization) for recovering the parameters remains an open challenge over the past decades and addressing this would represent a significant progress.

1.1

Main results

First theoretical guarantees: We provide the first algorithm that recovers the true parameters of a MoE model with global initializations when the experts are non-linear regressors and the gating network is a softmax function. We allow for a wide class of non-linearities which includes the popular choices of identity, sigmoid, and ReLU. The global consistency proof of our algorithm for the parameter inference is under the population setting. Understanding the finite sample complexity is a challenging and an interesting task but beyond the scope of this paper (note that this task has just been addressed for even the far simpler problem of estimating the means of a uniform Gaussian mixture model, with identity covariance matrices [XHM16, DTZ17]). Nonetheless, our algorithm works very well in practice on a variety of synthetic and real data sets. Algorithmic innovations: Our algorithm is a novel combination of the method of moments and the EM algorithm. In particular, we introduce a two stage procedure to learn the two different set of parameters; the expert and the gating parameters. In Step 1, we use the method of moments in an appropriate spectral format (i.e. third-order tensors) to estimate the regression parameters. In Step 2, we use these recovered regressors to estimate the gating parameters through EM while freezing the regressors. That a simple algorithm like EM can consistently recover the gating parameters when provided with good spectral initializations for the expert parameters is a topic of independent mathematical interest. Generalizability and guarantees: We also provide the first provably consistent algorithm for hierarchical mixtures of experts (HME), a generalization of MoE, which is a tree structured architecture for supervised learning introduced by [JJ94], which contains the experts at its leaves. The number of gating and expert parameters significantly blow up with the depth and thus learning all these parameters is a huge challenge. Our algorithms for k-MoE generalize gracefully to HME with provable guarantees and interestingly, the parameter-decoupling aspect of our algorithm is what facilitates this generalization.

4

2

Related work

The paradigm of method of moments can be dated back to Pearson’s work [Pea94] on mixtures of Gaussians. Recently, there has been a growing interest in estimation of parameters in several latent variable models such as Gaussian mixture models, Hidden markov models, topic models, mixtures of generalized linear models, etc. using the method of moments [AGH+ 14, HKZ12, AGHK13, AHK12, CL13, JO13]. The key idea in these works is to form certain higher order empirical moments (usually the third-order or the fourth-order tensor) and then capitalize on the special structure of the moments to recover the individual parameters using tensor decomposition algorithms. There is a huge body of work on the analysis of tensor decomposition algorithms, for example, [AGJ14, GM15, MSS16], and we refer the interested readers to the refernces in [SV17] for a detailed exposition to these algorithms. It is well known empirically that local search methods such as EM when provided with spectral initializations work much better than random initializations [CL13, Lia13]. Some recent works theoretically analyze this two-stage procedure for parameter estimation and provide rates of convergence for their algorithms [ZCZJ14, YCS16]. An important observation is that if we set all the gating parameters to zero, the MoE model reduces to mixtures of generalized linear models (GLMs) and thus MoE can be viewed as a strict generalization of these mixture models. For mixtures of GLMs, authors in [SJA14] construct specific third order moments and then apply tensor decomposition methods to estimate individual regressor vectors. [SIM14] proposes a “mirroring trick” on the output to recover the regression vectors upto a subspace using second order moments for mixtures of linear classifiers. In [YCS16], authors analyze a two-step estimation procedure of method of moments followed by AltMin (EM in the noiseless case) for recovering the parameters in mixtures of noiseless linear regressions and provide a better sample complexity compared to [SJA14]. Our work is similar in spirit to some of these works in the sense that we use a two-step parameter learning procedure as well. However, some key differences abound. Though mixtures of GLMs can be viewed as a special case of MoE, it is technically non-trivial if the method of moments can be applied to MoE; this is because the presence of gating parameters complicates the usual third order moments. In this paper, we show that a clever transformation on the output y using cubic polynomials (which are Hermite polynomials in the linear noiseless case) allows the MoE to be amenable to tensor decomposition algorithms for learning the regression parameters, under some technical assumptions. It is important to note that our goal is not to analyze a specific tensor decomposition algorithm but to construct a symmetric tensor for MoE upon which we can use any tensor decomposition algorithm to recover the individual regressors. Utilizing the recovered regressors, we then run EM only for learning the gating parameters while keeping the regressors fixed. This is in contrast to the above procedures where both the spectral methods and the EM algorithm are used for the same parameters. The authors in [BWY17] develop a general framework for local convergence analysis of EM and in particular analyze the recovery of the regressor for two symmetric mixtures of linear regressions under some high-SNR conditions. In contrast, we provide a global convergence analysis for the gating parameters when the regressors are provided spectral initializations. [JX95] analyzed the convergence of EM for both the gating and the expert parameters, but the analysis is local. Organization. The rest of the paper is organized as follows. In Section 3, we focus on the mixture of two experts model. In Section 3.1, we introduce the model for 2 MoE. We present our algorithms to learn the expert and gating parameters in Section 3.2 and the guarantees for the same in Section 3.3. We present our algorithms and the convergence results for the general MoE in Section 4.1 and the hierarchical MoE in Section 4.2. Section 5 is devoted to numerical experiments. We provide the necessary background material for the method of moments and the EM algorithm 5

in Appendix A and Appendix B respectively. Notation. In this paper, we follow the standard notation of representing matrices by bold face capital letters A, B, etc., Euclidean vectors by bold face lowercase letters a, b, etc., and scalars by plain lowercase letters y, z, etc. We use N (y|µ, σ 2 ) to denote the density of a Gaussian random variable y with mean µ and variance σ 2 or a Gaussian distribution with the same parameters depending on the context. We use ⊗ to denote the tensor outer product of vectors in Rd . For example, for a third-order rank-1 tensor T ∈ Rd×d×d such that T = a ⊗ b ⊗ c, Tijk = ai bj ck , i, j, k ∈ [d]. In this paper, we will only deal with symmetric tensors, i.e. Tijk = Tπ(i)π(j)π(k) for all permutations (m)

π. We denote the Euclidean norm of a vector x by kxk. We use ∇x to denote the m-the order derivative with respect to x. The standard basis vectors in Rd are denoted by {e1 , . . . , ed }.

3

Mixture of two experts (2-MoE)

In this section, for simplicity, we restrict ourselves to the mixture of two experts case with scalar outputs.

3.1

Model

The discriminative model for 2-MoE is: ⊤



Py|x =

ew1 x ⊤



ew1 x + ew2 x

·

2 N (y|g(a⊤ 1 x), σ )

+

ew2 x ⊤



ew 1 x + ew 2 x

2 · N (y|g(a⊤ 2 x), σ ),

(1)

where the function g(·) is a known non-linearity, a1 , a2 ∈ Rd denote the regression parameters corresponding to each expert and similarly, w1 , w 2 ∈ Rd denote the gating parameters. Notice that, with out loss of generality, we can assume that w2 = 0. Thus the model in (1) can be further simplified to 2 ⊤ ⊤ 2 Py|x = f (w⊤ x) · N (y|g(a⊤ 1 x), σ ) + (1 − f (w x)) · N (y|g(a2 x), σ ),

(2)

where f (z) = 1+e1−z is the logistic sigmoid function. Figure 1 presents the architecture for 2-MoE. Interpretation: The interpretation behind (2) is that for each input x, the gating network chooses an expert based on the outcome of a Bernoulli random variable z whose probability depends on x in a parametric way, i.e. z|x ∼ Bern(f (w⊤ x)). The chosen expert then generates the output y from a Gaussian distribution centred at a non-linear function of x, i.e. g(a⊤ z x). We consider the supervised learning setup where we are given n i.i.d. samples {(x1 , y1 ), . . . , (xn , yn )} generated from the above model. The goal is to recover both the regressors a1 , a2 , and the gating parameter w. Notice that the choice of expert zi ∈ {1, 2}, i ∈ [n], is hidden from us and the key challenge is to estimate both the expert parameters and the gating parameters simultaneously. Our main contribution is to decouple the joint-estimation nature of the problem and to recover the regressors first and to utilize the recovered regressors to estimate the gating parameters, thus respecting the modularity of the architecture. We make the standard assumptions that x follows a Gaussian distribution, i.e. x ∼ N (0, Id ), that ka1 k = ka2 k = 1, and that the noise variance σ 2 is known.

3.2

Algorithms

In this section, we present our algorithm to recover the regression and gating parameters. Our algorithm consists of two stages. 6

Step 1: Estimation of regression parameters We rely on the method of moments to estimate the regression parameters. The idea is to consider some specific higher-order moments between the input x and the output y that help us identify the parameters uniquely. In our case, we employ certain score-transformations on the input x and the output y to construct a cross-moment between them which yields the required regressors a1 and a2 upon tensor decomposition. The notion of score transformations, first introduced in [JSA14], is the basis of parameter inference for mixtures of GLMs in [SJA14]. The score-transformations can be thought of as high-dimension features on the input which capture the “local manifold structure of the pdf” [JSA14]. The higher the order of the transformation, the richer the information we can extract about our discriminative model through method of moments; refer to Appendix A for more detials. Whereas the first order transformation is a vector, the second and third-order transformations are matrices and tensors respectively. When x ∼ N (0, Id ), the second-order and third-order score transformations on x are: S2 (x) , x ⊗ x − I,

S3 (x) , x ⊗ x ⊗ x −

X

i∈[d]

x ⊗ ei ⊗ ei −

X

i∈[d]

ei ⊗ x ⊗ ei −

X

i∈[d]

ei ⊗ ei ⊗ x.

(3)

Similar to the transformations on the input x, we can employ transformations on the output y too to extract relevant information about the model. In fact, this is a key technical difference from the mixtures of GLMs where the output y is used as is. The third-order and second-order score transformations that we employ on the output are monic cubic and quadratic polynomials in y respectively, i.e. S2 (y) , y 2 + γy,

S3 (y) , y 3 + αy 2 + βy,

where the coefficients α, β and γ depend on the choice of the link function g and noise variance σ 2 , and can be computed beforehand by solving a simple two linear system of equations which we describe in detail in Appendix C.2 of the supplement. When g is an identity function, which amounts to mixtures of linear regressions, the score transformation S3 (y) equals y 3 − 3y(1 + σ 2 ) and S2 (y) = y 2 . We assume for now that we know these coefficients. Then Algorithm 1 recovers the regression parameters a1 and a2 using the robust tensor power method [AGH+ 14] on Tˆ3 using Tˆ2 . Algorithm 1 Learning the regressor parameters Input: Samples (x Pi , yi ), i ∈ [n] Compute Tˆ3 = n1 ni=1 S3 (yi ) · S3 (xi ) and ˆ1, a ˆ 2 = Rank-2 tensor decomposition on 3: a 1:

2:

P Tˆ2 = n1 ni=1 S2 (yi ) · S2 (xi ) Tˆ3 using Tˆ2

Step 2: Estimation of gating parameters using EM algorithm For the recovery of the gating parameter w, we use the standard Expectation-Maximization (EM) algorithm. The key difference from the earlier approaches is that we use EM only for the estimation of gating parameters since the regression parameters are already estimated from Algorithm 1. At any iteration t, the EM algorithm alternates between the E-step and the M-step. In the E-step, we first compute the posterior distribution for the hidden expert label z using the current estimate 7

wt and later, the Q function which is defined as Q(w|wt ) = EPz|x,y,wt [log pw (x, z, y)], the complete data log-likelihood. The M-step is comprised of maximizing Q(·|w t ) to get the next iterate wt+1 . The procedure is described in Algorithm 2. Algorithm 2 Learning the gating parameter 1: 2: 3: 4: 5: 6: 7: 8:

ˆ1, a ˆ 2 from Algorithm 1 Input: Samples (xi , yi ), i ∈ [n] and regressors a t←0 Initialize w0 uniformly randomly on the d-dimensional unit sphere while (estimation error < threshold) do Compute p1 (xi , yi , wt ) according to (5) Compute Q(w|w t ) according to (4) using empirical expectation Set w t+1 = argmaxkwk≤1 Q(w|w t ) t←t+1

In general, we can also run joint EM for both the regressors and the gating parameters and thus slightly modifying Algorithm 2. In our experiments with synthetic data, we found that the the parameter estimation error is almost the same whether or not we update the regressors, whenever they are provided spectral initializations. For real data, it might be helpful to update them together.

3.3

Theoretical results

In this section, we present our results for the mixture of two experts case at the population level to highlight the central ideas. A similar set of results also hold for the general MoE and HME, as discussed in the next section. 3.3.1

Method of moments for regression parameters

Recall the mixture of two experts model between the input x and the output y: 2 ⊤ ⊤ 2 Py|x = f (w⊤ x) · N (y|g(a⊤ 1 x), σ ) + (1 − f (w x)) · N (y|g(a2 x), σ ).

Our next result establishes that for a class of non-linearities g, including the popular choices of identity, sigmoid, and ReLU, we can construct a super-symmetric tensor containing only a1 ⊗a1 ⊗a1 and a2 ⊗ a2 ⊗ a2 as its rank-1 components which will yield the regression parameters upon tensor decomposition. The precise characterization of this class can be found in Appendix C.2 of the supplement. We refer to this class as (α, β, γ)-valid. Theorem 1 (Recovery of regression parameters). Suppose that the regressors a1 and a2 are linearly independent and that the classifier w ⊥ span{a1 , a2 }, and d ≥ 3. Then for any non-linearity g : R → R which is (α, β, γ)-valid, we have that T2 , EPx Py|x [S2 (y) · S2 (x)] = α′1 · a1 ⊗ a1 + α′2 · a2 ⊗ a2 ,

T3 , EPx Py|x [S3 (y) · S3 (x)] = α1 · a1 ⊗ a1 ⊗ a1 + α2 · a2 ⊗ a2 ⊗ a2 ,

where α1 = α2 and α′1 = α′2 are two non-zero constants depending on g and σ. Hence, recovery guarantees for a1 and a2 follow from Theorem 4.3 and Theorem 5 of [AGH+ 14]. Remark 1. When g is the identity function, which corresponds to mixture of linear regressions, the output transformation assumes the simple form S3 (y) = y 3 − 3y(1 + σ 2 ). In the noiseless case, i.e. σ = 0, this transformation S3 (y) = y 3 − 3y equals the third Hermite polynomial. Interestingly, 8

substituting y ∈ R in (3) also yields this third-Hermite polynomial. In fact, score transformations for scalar Gaussian random variables are Hermite polynomials. Also, addition of a constant term in the definitions of S3 (y) and S2 (y) does not change the moments T3 and T2 because E[S3 (x)] = E[S2 (x)] = 0. Remark 2. Recall that the gating parameter w is used to generate the hidden choice of the expert z ∈ {1, 2}, where as regressors a1 and a2 are used to generate the output y from the corresponding 2 expert according to the distribution N (g(a⊤ z x), σ ). Thus the interpretation behind the assumption w ⊥ span{a1 , a2 } is that if we think of x as a high-dimensional feature vector, we would ideally want to use distinct sub-features of x to perform these two different tasks of classification and regression. Remark 3. For the case of linear regression, we can recover the regressors even with out the orthogonality assumption. Furthermore, we can also recover the gating parameter w using the method of moments instead of the EM as described in Appendix C.1 of the supplement. 3.3.2

EM algorithm for gating parameters

In the previous section we presented the convergence guarantees for our algorithm to recover the regression parameters. We now show that if we use EM only to recover the gating parameter utilizing the recovered regressors, it converges to the true gating parameter at a geometric convergence rate under some mild assumptions. For the simplicity of presentation, first we assume that we know the regressors a1 and a2 exactly and treat the classifier w as the only parameter. Later we show that the population EM is robust to small perturbations in the regressors. To avoid notation overload, from now on we refer to the true parameter as w ∗ . Let Ω, the domain of our gating parameter w, to be the unit Euclidean norm ball, i.e. Ω , {w : kwk ≤ 1}. Then the population EM for the mixture of experts consists of the following two steps: • E-step: Using the current estimate wt to compute the function Q(·|w t ), • M-step: wt+1 = argmaxkwk≤1 Q(w|wt ), where the function Q(w|wt ) is defined as h i ⊤ Q(w|w t ) , E p1 (x, y, w t )(w ⊤ x) − log(1 + ew x ) , p1 (x, y, w t ) =

f N1 , f N1 + (1 − f )N2

(4) (5)

2 2 ⊤ ⊤ where f = f (w⊤ t x), N1 = N (y|g(a1 x), σ ), and N2 = N (y|g(a2 x), σ ). Thus the EM can be viewed as a deterministic procedure which maps wt 7→ M (w t ) where

M (w) = argmaxw′ ∈Ω Q(w ′ |w). It follows from the self-consistency property of the EM that the true parameter w∗ is a fixed-point for the EM operator M , i.e. M (w∗ ) = w ∗ [BWY17]. The following result establishes that the map M is contractive and thus the EM iterates converge geometrically. Theorem 2. Suppose that the domain Ω of the gating parameters is the unit norm ball and that we know the expert parameters a1 and a2 exactly. There exists a constant σ0 > 0 such that whenever 9

0 < σ < σ0 , for any initialization w0 ∈ Ω, the population-level EM updates on the gating parameter {w}t≥0 converge geometrically to the true parameter w∗ , i.e. kwt − w∗ k ≤ (κσ )t kw0 − w∗ k , σ→0

where κσ is a dimension-independent constant depending on g and σ such that κσ −−−→ 0. Remark 4. Our theoretical analysis suggests that for mixtures of linear regressions, the choice of σ0 for which κσ ≪ 1 is of the order 10−4 and that κσ behaves as O(σ 1/4 ). Our empirical studies suggest that the above geometric convergence rate is achieved even for some moderately large values of σ such as σ = 1, 2. In practice, we can only know the regressors upto some small perturbations. Thus it is essential to understand the effect of these perturbations on the EM iterates of the gating parameter when ˆ 1, a ˆ 2 . Our next result establishes that the we have access only to these approximate regressors a population EM algorithm is robust to these perturbations. Theorem 3. Let M : Ω → Ω denote the EM operator for the gating parameter when the regressors ˆ : Ω → Ω be the EM operator when the regressors are fixed at a1 and a2 are known exactly and M ˆ 1 and a ˆ 2 respectively. If ε1 > 0 is such that max (kˆ ˆ ∈Ω a a1 − a1 k , kˆ a2 − a2 k) = σ 2 ε1 and w, w ˆ ≤ ε2 , we have that are such that kw − wk ˆ (w)k ˆ ≤ O(max{ε1 , ε2 }). kM (w) − M ˆ t satisfies Remark 5. Thus, Theorem 3 and Theorem 2 together imply that the tth EM iterate w ˆ t − w∗ k ≤ K t ε1 + (κσ )t kwˆ0 − w∗ k, for some universal constant K. kw 3.3.3

Gradient EM algorithm

In the M-step of the EM algorithm, the next iterate is chosen so that the function Q(·|w t ) is maximized. Instead we can chose an iterate so that it increases the Q value instead of fully maximizing it, i.e. Q(wt+1 |wt ) ≥ Q(wt |wt ). Such a procedure is termed as generalized EM. Gradient EM is an example of generalized EM in which we take an ascent step in the direction of the gradient of Q(·|w t ) to produce the next iterate, i.e. wt+1 = wt + α∇Q(wt |wt ), where α > 0 is a suitably chosen step size and the gradient is with respect to the first argument. Whenever the domain of the parameters is restricted, we can slightly modify the above procedure to include a projection step so that the iterates always stay within the domain. Mathematically, wt+1 = G(w t ),

G(w) = ΠΩ (w + α∇Q(w|)w),

where ΠΩ refers to the projection operator. Our next result establishes that the iterates of the gradient EM algorithm too converge geometrically for an appropriately chosen step size α. Theorem 4. Suppose that the domain Ω of the gating parameters is the unit norm ball and that we know the expert parameters a1 and a2 exactly. There exists constants α0 > 0 and σ0 > 0 such that for any step size 0 < α ≤ α0 and noise variance σ < σ0 , the gradient EM updates on the gating parameter {w}t≥0 converge geometrically to the true parameter w∗ , i.e. kwt − w∗ k ≤ (ρσ )t kw0 − w∗ k , where ρσ is a dimension-independent constant depending on g and σ. 10

Remark 6. The condition σ < σ0 ensures that the Lipschitz constant ρσ for the map G is strictly less than 1. The constant α0 depends only on two universal constants which are nothing but the strong-concavity and the smoothness parameters for the function Q(·|w ∗ ). The proofs for all these theorems are presented in Appendix C.

4

General and hierarchical mixture of experts

In this section we consider the extensions of the two mixture of experts model for the scalar output to various settings. An immediate extension of the two mixtures is to consider an arbitrary number of mixtures (k-MoE) which we study in Section 4.1. In Section 4.2 we consider the hierarchical mixtures (HME) in which experts are at the leaves of a rooted tree and a hierarchy of gating functions are used to produce the output. Through out, our strategy for the parameter inference is similar to the two mixtures setting. We construct appropriate tensors for the given model to recover the regressors and utilize these recovered regressors to infer the gating parameters through the EM algorithm. For the simplicity of presentation, we assume that we know the regression parameters exactly when we infer the gating parameters and a simple tweaking of this procedure to allow for approximate regressors and the perturbation analysis can be easily generalized as in the two mixtures case.

4.1

General mixture of experts (k-MoE)

Now we introduce the general mixture of experts model and generalize our earlier findings on mixture of two experts to arbitrary mixtures. We assume that the number of mixture components k is smaller than the dimension d, i.e. k < d. Let x ∈ Rd be the input vector and y ∈ R be the corresponding output. Then the discriminative model Py|x for k-MoE is given by Py|x =

X

i∈[k]

pi (x)Py|x,i =

X

i∈[k]

⊤ x+b i

ewi P

⊤ x+b i

wi i∈[k] e

2 · N (y|g(a⊤ i x), σ ),

x ∼ N (0, Id ).

(6)

The hidden variable z that denotes the choice of the expert can take values in {1, . . . , k} and the

corresponding probabilities P [z = i|x] = pi (x) =

⊤ x+b i w⊤ x+bi i i∈[k] e

P

e wi

are modeled by softmax function

with parameters w1 , . . . , wk . Without loss of generality, we assume that wk = 0. 4.1.1

Algorithm

In this section, we present a generalized version of Algorithm 1 and Algorithm 2 to account for k-mixtures. The fundamental procedure is exactly the same as that of the two experts. As in the earlier setting, our goal is to recover the gating parameters w1 , . . . , wk−1 and the regression parameters a1 , . . . , ak given n i.i.d. samples (xi , yi ) drawn from (6). In the algorithms below, for simplicity, the entire set of gating parameters (w 1 , . . . , w k−1 ) are denoted as as w. The domain Ω for the parameter w is defined as Ω , {w = (w1 , . . . , w k−1 ) : kwi k ≤ 1, ∀i ∈ [k − 1]} = B(0, 1) × . . . × B(0, 1).

11

Algorithm 3 Learning the regression parameters Input: Samples (x Pi , yi ), i ∈ [n] P Compute Tˆ3 = n1 ni=1 S3 (yi )S3 (xi ) and Tˆ2 = n1 ni=1 S2 (yi )S2 (xi ) ˆ k } = Rank-k tensor decomposition on Tˆ3 using Tˆ2 3: {ˆ a1 , . . . , a

1: 2:

Algorithm 4 Learning the gating parameters ˆ k } from Algorithm 3 Input: Samples (xi , yi ), i ∈ [n] and regressors {ˆ a1 , . . . , a 2: t ← 0 (1) (k−1) 3: Initialize w 0 , . . . , w 0 uniformly randomly on the d-dimensional unit sphere 4: while (estimation error < threshold) do 1:

(j) ⊤ ) x

5: 6: 7: 8:

4.1.2

Compute P [zi = j|xi , yi , wt ] =

e(wt

P

2 ·N (y|g(a⊤ j x),σ ) (l) ⊤ (wt ) x 2 ·N (y|g(a⊤ l∈[k] e l x),σ )

,

j ∈ [k], i ∈ [n]

P P Compute Q(w|w t ) = n1 ni=1 kj=1 P [zi = j|xi , yi , wt ] · log pj (xi , w) Set w t+1 = argmax Q(w|wt ) such that kwj k ≤ 1, j ∈ [k] t←t+1 Theoretical results

Our next result provides the recovery guarantees for the regression parameters in the population setting. Theorem 5 (Recovery of regression parameters). Suppose that the regressors a1 , . . . , ak have unit norm and are linearly independent, and that the gating parameters w1 , . . . , wk−1 are orthogonal to the span of {a1 , . . . , ak }, i.e. {w1 , . . . , w k−1 } ∈ (span{a1 , . . . , ak })⊥ . Then for any non-linearity g : R → R which is (α, β, γ)-valid, we have that   k w⊤ X i x e c′g E  T2 , E[S2 (y) · S2 (x)] = Pk−1 w⊤ x  · ai ⊗ ai , 1 + j=1 e j i=1   ⊤x k w X e i cg,σ E  T3 , E[S3 (y) · S3 (x)] = Pk−1 w⊤ x  · ai ⊗ ai ⊗ ai , 1 + j=1 e j i=1 where c′g and cg,σ are two non-zero constants depending on g and σ. Hence the regressors {ai } can be recovered from T2 and T3 .

Remark 7. Similar to Theorem 1, recovery guarantees for {ai } follow from Theorem 4.3 and Theorem 5 of [AGH+ 14]. Once we recover the regressors, our next theorem shows that the EM algorithm on the gating parameters converges geometrically to the true parameters. Theorem 6. Suppose that we know the expert parameters a1 , . . . , ak exactly. There exists a constant σ0 > 0 such that whenever 0 < σ < σ0 , for any initialization w0 ∈ Ω, the population-level EM updates on the gating parameter {w}t≥0 converge geometrically to the true parameter w∗ , i.e. kwt − w∗ k ≤ (κσ )t kw0 − w∗ k , σ→0

where κσ is a dimension-independent constant depending on g and σ such that κσ −−−→ 0. 12

The proofs for both the theorems are presented in Appendix D.1 and Appendix D.2 respectively.

4.2

Hierarchical mixtures of experts (HME)

The hierarchical mixtures of experts (HME) is a tree structured architecture for supervised learning introduced by [JJ94]. As opposed to the general mixture of experts architecture that we discussed above, the experts in the HME architecture are at the leaves of the tree. The non-terminal nodes of the tree are the gating nodes that determine which expert output is produced at the top most node (root) of the tree. The intuition behind arranging the gating nodes in a hierarchical fashion is that these represent a nested sequence of regions into which the domain of the input x is divided into and in each sub-region the relationship between the input x and the output y is explained by a simple expert sitting in the leaf node of the architecture. A depth-2 binary hierarchical mixtures of experts architecture is shown below: y Gating network x Gating network x

Gating network Expert network

Expert network

Expert network

Expert network

x

x

x

x

x

Figure 2: A two-level hierarchical mixture of experts Assume that the depth of the architecture be p and that each non-terminal node has k children and the total number of experts, which corresponds to the total no. of leaves, be less than d, i.e. kp < d. The probabilistic model for the output y for the HME is given by Py|x =

k X

gi1 (x)

k X

gi2 |i1 (x) . . .

ip =1

i2 =1

i1 =1

k X

2 gip |(i1 ,...,ip−1 ) (x) · N (y|a⊤ i1 ,...,ip x, σ ).

Here gi1 corresponds to the top-level softmax gating function and similarly gim |(i1 ,...,im−1 ) corresponds to the gating function at depth m which are given by w⊤ i

w⊤ i x

e

gi1 (x) = P k

1

w⊤ i1 x i1 =1 e

,

gim |(i1 ,...,im−1 ) (x) = P k

e

m |(i1 ,...,im−1 )

w⊤ i

im =1 e

x

m |(i1 ,...,im−1 )

x

.

Without loss of generality, we assume wk|(i1 ,...,im−1 ) = 0 for all m ∈ [p]. Notice that the case p = 1 corresponds to the general mixture of experts that we considered above. For the simplicity of presentation, we restrict ourselves to the case of binary trees, i.e. k = 2. The extensions to the general k-ary tree can be carried over in a straightforward manner much like the generalization of two mixture of experts to k-mixture of experts. 13

4.2.1

Algorithm

Similar to the earlier setting, the procedure to recover the overall parameters of the model is two stage: 1) First, we recover the regressors using the tensor decompositions and 2) Recovering the gating parameters using EM. Since the experts sit at the leaves of the tree and the produced output y is observed at the root of the tree, the hidden label zi=(i1 ,...,ip ) ∈ {0, 1} for the hierarchical mixtures problem corresponds to the choice of the path chosen among all the 2p paths to the leaves. Thus the calculations for the EM are a little involved but the fundamental idea is the same. Let Ω denote the domain for the collection of all gating parameters defined as

Ω , {w = (w1 , w1|1 , w 1|2 , . . . , w1|(2,2,...,2) ) : w1|i1 ,...,im−1 ≤ 1, ∀(i1 , . . . , im−1 ) ∈ {1, 2}m−1 , m ∈ [p]}.

Notice that for each path (i1 , . . . , im−1 ), there exists only gating parameter w1|i1 ,...,im−1 since w2|i1 ,...,im−1 = 0 by assumption. Moreover, the gating functions are given by the sigmoid transformations, i.e. g1|i1 ,...,im−1 = σ(w ⊤ 1|i1 ,...,im−1 x),

g2|i1 ,...,im−1 = 1 − g1|i1 ,...,im−1 .

The gating function g can be thought of as the prior probabilities to chose a particular path in the tree for every given x. Let hi1 , i1 ∈ {1, 2} denote the posterior probability that the edge i1 is chosen at the top-most root node given (x, y) and similarly h1|i1 ,...,im−1 denotes the posterior probability that the left branch is chosen at the node (i1 , . . . , im−1 ), m ∈ [p] given that the path (i1 , . . . , im−1 ) is already chosen. They can be computed using Bayes theorem as follows: P g1 i2 ,...,ip ∈{1,2}p−1 gi2 |1 . . . gip |1,i2 ,...,ip−1 · N1,i2 ,...,ip (x, y) , h2 = 1 − h1 , h1 = P i=(i1 ,...,ip )∈{1,2}p gi1 . . . gip |i1 ,...,ip−1 · N1,i2 ,...,ip (x, y) and,

him |(i1 ,...,im−1 ) = P

gim |i1 ,...,im−1

P

im+1 ,...,ip ∈{1,2}p−m

im =1,2 gim |i1 ,...,im−1

P

gim+1 |(i1 ,...,im ) . . . gip |(i1 ,i2 ,...,ip−1 ) · Ni1 ,i2 ,...,ip (x, y)

im+1 ,...,ip ∈{1,2}p−m

gim+1 |(i1 ,...,im ) . . . gip |(i1 ,i2 ,...,ip−1 ) · Ni1 ,i2 ,...,ip (x, y)

,

2 where Ni1 ,...,ip (x, y) , N (y|a⊤ (i1 ,...,ip ) x, σ ). We next define h(i1 ,...,im ) = E[z(i1 ,...,im ) = 1|x, y], the posterior of choosing the path (i1 , . . . , im ), given by P gi1 . . . gim |(i1 ,...,im−1 ) im+1 ,...,ip gim+1 |(i1 ,...,im ) . . . gip |(i1 ,i2 ,...,ip−1 ) · N(i1 ,...,ip ) (x, y) P h(i1 ,...,im ) = hi1 . . . him |(i1 ,...,im−1 ) = . i1 ,...,ip gi1 . . . gip |(i1 ,...,ip−1 ) · N(i1 ,...,ip ) (x, y)

We now present the algorithms for the hierarchical setting:

Algorithm 5 Learning the regression parameters at leaves Input: Samples (x i , yi ), i ∈ [n] 1 Pn 1 Pn ˆ ˆ 2: Compute T3 = n i=1 S3 (yi )S3 (xi ) and T2 = n i=1 S2 (yi )S2 (xi ) ˆ (2,...,2) } = Rank-2p tensor decomposition on Tˆ3 using Tˆ2 3: {ˆ a(1,...,1) , . . . , a 1:

14

Algorithm 6 Learning the gating parameters 1: 2: 3: 4: 5: 6: 7: 8: 9:

ˆ (2,...,2) } from Algorithm 5 Input: Samples (xi , yi ), i ∈ [n] and regressors {ˆ a(1,...,1) , . . . , a t←0 (0) Initialize all w1|(i1 ,...,im−1 ) uniformly randomly on the d-dimensional unit sphere while (estimation error < threshold) do (0) Compute ht,j (i1 ,...,im ) using w 1|(i1 ,...,im−1 ) for all samples j ∈ [n] and m ∈ [p] P P P (t,j) (t,j) xj Compute Q(w|w t ) = n1 nj=1 pm=1 (i1 ,...,im−1 ) h(i1 ,...,im−1 ) (h1|(i1 ,...,im−1 ) w⊤ 1|(i1 ,...,i m−1 )  − log 1 + exp(w⊤ 1|(i1 ,...,im−1 ) xj )



Set w t+1 = argmax Q(w|wt ) such that w1|(i1 ,...,im−1 ) ≤ 1, m ∈ [p] t←t+1

4.2.2

Theoretical results

We now present our results on the recovery of regression parameters in this setting. Theorem 7 (Recovery of regression parameters). Suppose that the regressors ai , i = (i1 , . . . , ip ) ∈ {1, 2}p each have unit norm and are linearly independent, and that each regressor ai is orthogonal to the set Si , ∪pm=1 {w1|i1 ,...,im−1 } of the gating parameters along the path leading to the leaf node i. Then for any non-linearity g : R → R which is (α, β, γ)-valid, we have that X T2 , E[S2 (y) · S2 (x)] = c′g E[gi (x)] · ai ⊗ ai , i=(i1 ,...,ip )∈{1,2}p

T3 , E[S3 (y) · S3 (x)] =

X

i=(i1 ,...,ip )∈{1,2}p

cg,σ E[gi (x)] · ai ⊗ ai ⊗ ai ,

where c′g and cg,σ are two non-zero constants depending on g and σ. Hence the regressors {ai } can be recovered using T2 and T3 .

Remark 8. Similar to Theorem 1, recovery guarantees for {ai } thus follow from Theorem 4.3 and Theorem 5 of [AGH+ 14]. The following theorem establishes that the EM convergence to the true gating parameters is geometric once we have the regressors. Theorem 8 (Recovery of gating parameters). Suppose that we know the regressors {ai : i ∈ {1, 2}p } exactly. There exists a constant σ0 such that for any σ < σ0 , starting with any initialization w0 ∈ Ω, the population-level EM updates on the gating parameter {w t }t≥0 converge geometrically to the true parameter w ∗ , i.e. kwt − w∗ k ≤ (κσ )t kw0 − w∗ k , σ→0

where κσ is a constant depending on p, g and σ such that κσ −−−→ 0.

The proofs of these two theorems are deferred to Appendix D.3 and Appendix D.4 respectively.

5

Experiments

In this section, we compare the performance of our algorithm to the EM algorithm for both the gating and expert parameters with random initializations. The evaluation is on both the synthetic and real world data sets. All the experiments are performed in MATLAB. For the tensor decomposition, we use the Orth-ALS package by [SV17]. 15

2.5

3 Spectral+EM EM

Spectral+EM EM

2.5

Parameter estimation error

Parameter estimation error

2

1.5

1

0.5

2

1.5

1

0.5

0

0 0

10

20

30

40

50

60

70

80

90

100

0

10

20

30

No. of EM iterations

40

50

60

70

80

90

100

No. of EM iterations

(a)

(b)

Figure 3: Plot of parameter estimation error for our algorithm and the EM algorithm on synthetic data with varying number of mixtures for linear experts: (a) corresponding to 3 mixtures. (b) corresponding to 4 mixtures. 2.5

2.5 Spectral+EM EM

Spectral+EM EM

2

Parameter estimation error

Parameter estimation error

2

1.5

1

0.5

1.5

1

0.5

0

0 0

10

20

30

40

50

60

70

80

90

100

0

No. of EM iterations

10

20

30

40

50

60

70

80

90

100

No. of EM iterations

(a)

(b)

Figure 4: Parameter estimation error for the sigmoid and ReLU nonlinearities respectively.

5.1

Synthetic data

Data generation. Let the tuple (n, d, k, σ) denote the number of samples, dimension of the input, number of mixtures, and the Gaussian noise standard deviation respectively. For a given choice of these parameters, the synthetic data is generated as follows: Regression parameters k−1 {ai }ki=1 are drawn i.i.d uniformly from the unit sphere Sd−1 and the gating parameters {w i }i=1 are randomly chosen on Sd−1 to be orthogonal to ai ’s (for example, by performing SVD on A = [a1 . . . ak ]). In our experiments we found that the performance of our algorithm is the same without the orthogonality restriction. The input samples {xi }ni=1 are drawn independently from N (0, Id ) and the corresponding {yi }ni=1 are generated using these model parameters and the input samples according to the k-MoE probabilistic model in (6). ˆ = [ˆ ˆ = [w ˆ k ] and W ˆ1 ...w ˆ k−1 ] denote the estimated regression and Evaluation. If A a1 . . . a gating parameters respectively, our evaluation metric is E the Frobenious norm of the parameter

16

error accounting for the best possible permutation π, i.e. ˆ − Aπ kF + kW ˆ − W π kF , E = inf kA π

where Aπ = [aπ(1) . . . aπ(k) ] denotes the permuted regression parameter matrix and similarly W π too. Recall that the gating parameter corresponding to the last expert wk is always zero, with out loss of generality. Results. In Figure 3, we compare the performance of our algorithm with the randomly initialized EM algorithm for n = 8000, d = 10, σ = 0.5 for linear experts. For a given dataset we ran the EM algorithm over 10 different trials and plotted the mean estimation error at each iteration. For our algorithm we used the spectral initialization for the regression parameters and random initializations for the gating parameters. It is clear that our algorithm is able to recover the true parameters resulting in much smaller estimation error than the EM which often gets stuck in local optima. In addition, our algorithm is able to learn these parameters in very few iterations, often less than 10 iterations, highlighting the efficiency of our algorithm. Moreover, our algorithm performs better in all the settings as we vary the number of mixtures. Notice that for a given number of samples the estimation error is slightly higher for 4 mixtures as compared to 3 mixtures as one would expect. In Figure 4, we repeated our experiments for the choice of n = 10000, d = 5, k = 3 for two different popular choices of non-linearities: sigmoid and ReLU. The same conclusion as in the linear setting holds in this case too with our algorithm outperforming the EM consistently. We also found that the performance of gradient descent on the log-likelihood surface is similar to that of the EM, highlighting the fact the stationary points of GD and the fixed points of EM are in a one-to-one correspondence [XHM16]. Spectral+EM EM

2.2

Spectral+EM EM

2

Spectral+EM EM

2

Parameter estimation error

1.8 1.6 1.4 1.2 1

Parameter estimation error

1.8

2

Parameter estimation error

2.5

2.2

2.4

1.6 1.4 1.2 1 0.8

1

0.5

0.6

0.8

1.5

0.4

0.6

0.2 0

10

20

30

40

50

60

No. of EM iterations

(a)

70

80

90

100

0 0

10

20

30

40

50

60

No. of EM iterations

(b)

70

80

90

100

0

10

20

30

40

50

60

70

80

90

100

No. of EM iterations

(c)

Figure 5: Plot of parameter estimation error with varying number of samples(n): (a) n = 1000 (b) n = 5000. (c) n = 10000. In Figure 5, we varied the number of samples of our data set and fixed the other set of parameters to k = 3, d = 5, σ = 0.5.

5.2

Real data

For real data experiments, we choose the 3 standard regression data sets from the UCI Machine Learning Repository: Concrete Compressive Strength Data Set, Stock portfolio performance Data Set, and Airfoil Self-Noise Data Set [Yeh98, LY17, BPM89]. In all the three tasks, the goal is to predict the outcome or the response y for each input x, which typically contains some task specific attributes. For example, in the concrete compressive strength, the task is to predict the compressive strength of the concrete given its various attributes such as the component of cement, water, age, 17

0.13

0.066 Spectral+EM EM Constant estimator

0.18 Spectral+EM EM Constant estimator

0.065 0.064

0.1

0.063

0.09 0.08 0.07

0.061 0.06 0.059

0.05

0.058

0.04

0.14

0.062

0.06

Spectral+EM EM Constant estimator

0.16

Prediction error

0.11

Prediction error

Prediction error

0.12

0.12

0.1

0.08

0.057

0.03

0.056 0

10

20

30

40

50

60

70

80

90

100

0.06 0

10

20

30

40

No. of EM iterations

50

60

70

80

90

100

No. of EM iterations

0

10

20

30

40

50

60

70

80

90

100

No. of EM iterations

0.11 Spectral+EM EM

0.1

Prediction error

0.09 0.08 0.07 0.06 0.05 0.04 0.03 2

3

4

5

6

7

8

No. of mixtures

Figure 6: (a) Top panel: Prediction error for the concrete, stock portfolio and the airfoil data sets respectively. (b) Bottom panel: Prediction error for the concrete data set with varying number of mixtures. etc. For this data, the input x ∈ R8 corresponds to 8 different attributes of the concrete and the output y ∈ R corresponds to its concrete strength. Similarly, for the stock portfolio data set the input x ∈ R6 contains the weights of several stock-picking concepts such as weight of the Large S/P concept, weight of the Small systematic Risk concept, etc,. and the output y is the corresponding excess return. The airfoil data set is obtained from a series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections and the goal is predict the scaled sound pressure level (in dB) given the frequency, angle of attack, etc,. For all the tasks, we pre-processed the data by whitening the input and scaling the output to lie in (−1, 1). We randomly allotted 75% of the data samples for training and the rest for testing. Our evaluation metric is the prediction error on the test set (xi , yi )ni=1 defined as n

E=

1X (ˆ yi − yi )2 , n i=1

where yˆi corresponds to the predicted output response using the learned parameters. In other words, ˆ⊤ X i x ew yˆ = · g(ˆ a⊤ P i x). ˆ⊤ w x j e i∈[k] j∈[k]

We ran the EM algorithm (with 10 different initializations) on these tasks with various choices for k ∈ {2, . . . , 10}, σ ∈ {0.1, 0.4, 0.8, 1}, g ∈ {linear, sigmoid, ReLU} and found the best hyperparameters to be (k = 3, σ = 0.1 and g = linear), (k = 3, σ = 0.4, g = sigmoid) and (k = 3, σ = 0.1, g = linear) for the three datasets respectively. For these hyper-parameters we compared our algorithm with the EM. Figure 6 highlights the predictive performance of our algorithm as compared to that of the EM. It is clear that the spectral initializations for EM yield a consistently better performance as opposed to random initializations. We also plotted the variance of the test data for reference and to gauge the performance of our algorithm. In all the settings our algorithm is able 18

to obtain a better set of parameters through spectral methods. In the bottom panel of Figure 6, we compared the two algorithms by varying the number of mixtures for the concrete data set.

6

Discussion

In this paper we provided the first provable and globally consistent algorithm that can learn the true parameters of a MoE model. We believe that ideas from [SJA14] can be naturally extended for the finite sample complexity analysis of the tensor decomposition to learn the regressors and similarly, techniques from [BWY17] can be extended to the finite sample EM convergence analysis for the gating parameters. While we have focused here on parameter recovery, however, there are no statistical bounds on output prediction error when the data is not generated from the model. MoE models are known to be capable of fitting general functions, and getting statistical guarantees on learning in such regimes is an interesting direction for future work. Generalizing to non-Gaussian input distributions and allowing the number of mixtures to be more than the dimension, i.e. k > d, are also interesting extensions. In concurrent, independent work, [GLM18] proposed a non-convex loss function approach (based on tensor decompositions) to recover the true parameters of a 1 layer ReLU feedforward network which has no bad local minima. In contrast, the l2 loss function has exponentially many local minima. Extending this approach of loss function design to MoE, taking into account the modular structure and different roles of the gating and the expert parameters, is a fascinating topic of great mathematical interest.

7

Acknowledgements

The work of Sreeram Kannan was supported by NSF Career award CCF-1651236.

19

Organization. The appendix is organized as follows: • Appendix A gives a detailed introduction to the method of moment techniques in the context of mixture models, particularly, for MoE. • Appendix B contains all the necessary details regarding the EM algorithm and in particular, the techniques for its convergence analysis. • Appendix C contains all the proofs of Section 3.3. In Appendix C.1, we present our results for the mixture of linear experts and also, a method of moment technique to recover the gating parameter. Appendix C.3 provides the proof for the recovery of the regressors, Appendix C.2 deals with the class of non-linearities for which our results hold. In Appendix C.4, we provide the EM convergence analysis for MoE and in Appendix C.5, we provide the perturbation analysis guarantees for the EM. • Appendix D contains all the proofs of Section 4.

A

Review: Mixture models and method of moments

In this section, we introduce the key techniques that are useful in parameter estimation of mixture models via the method of moments. Stein’s identity (Stein’s lemma) is a well-known result in probability and statistics and is widely used in estimation and inference taks. A refined version of the Stein’s lemma [Ste72] for higher-order moments is the key to parameter estimation in mixture of generalized linear models. We utilize this machinery in proving Theorem 1. We first recall the Stein’s lemma. Lemma 1 (Stein’s lemma [Ste72] ). Let x ∼ N (0, Id ) and g : Rd → R be a function such that both E[∇x g(x)] and E[g(x) · x] exist and are finite. Then E[g(x) · x] = E[∇x g(x)]. To highlight the key utility of Lemma 1 in parameter estimation problems using method of moments, assume that (x, y) are distributed according a generalized-linear model (GLM) where E[y|x] = g(a⊤ x),

x ∼ N (0, Id ),

where the non-linearity g acts as the link function. Assuming that the expected first derivative is not zero, i.e. E[g ′ (a⊤ x)] 6= 0, we can easily infer the high-dimensional parameter a upto scaling in view of the Stein’s lemma, since E[y · x] = E[E[y|x] · x] = E[g(a⊤ x) · x] = E[g ′ (a⊤ x)] · a. Now suppose that (x, y) ∼ N (0, Id ) · Py|x where Py|x is a mixture of GLMs such that E[y|x] =

k X

αi g(a⊤ i x),

i=1

αi ∈ R, ai ∈ Rd .

If we consider the cross-moment between x and y we get ! # " n k X X ⊤ αi E[g′ (a⊤ αi g(ai x · x = E[y · x] = E i x)] · ai , i=1

i=1

20

(7)

through which we can only hope to recover the subspace spanned by ai ’s rather than the individual ai ’s. So a natural question to ask is: Can we still recover the parameters of a mixture model using method of moments? It turns out that we can indeed recover the parameters from mixture models if we consider higher-order cross moments between certain transformations on x and on y as been pointed out in [SJA14]. The following lemma, which can be viewed as a straight forward extension of Stein’s lemma for higher-order moments, is the central result behind parameter estimation in [SJA14]. This can be easily proved using the chain rule for derivatives. Lemma 2 ([SJA14]). Let x ∼ N (0, Id ) and S3 (x) be as defined in (3) and let S2 (x) , x ⊗ x − Id . Then for any g : Rd → R satisfying some regularity conditions, we have (3)

(2)

E[g(x) · S3 (x)] = E[∇x g(x)].

E[g(x) · S2 (x)] = E[∇x g(x)],

In view of Lemma 2, the mixture of GLMs in (7) yields that E[y · S2 (x)] = E[E[y|x] · S2 (x)] =

k X

αi E[g(a⊤ i x)

i=1

· S2 (x)] =

k X i=1

αi E[g ′′ (a⊤ i x)] · ai ⊗ ai ,

(8)

and E[y · S3 (x)] = E[E[y|x] · S3 (x)] =

k X i=1

αi E[g(a⊤ i x) · S3 (x)] =

k X i=1

αi E[g′′′ (a⊤ i x)] · ai ⊗ ai ⊗ ai . (9)

Though the second-order moment E[y · S2 (x)] contains useful information about our parameters, its symmetric decomposition to rank-1 components in (8) is unique only upto rotation even when all ai ’s are linearly independent. This necessitates the need for higher-order tensors for which the symmetric decompositions are unique, as in (9). So whenever the non-linearity g satisfies that E[g′′′ (a⊤ are linearly independent, we can use i x)] 6= 0, and the regression parameters a1 , . . . , ak P tensor decomposition algorithms on the empirical tensor n1 ni=1 yi · S3 (xi ) to recover aˆ1 , . . . , aˆk with provable guarantees. Remark 9. For the mixture of linear regressions, g is an identity function and thus the above procedure does not seem fruitful to recover a1 , . . . , ak . The trick in this case is to consider E[y 3 · S3 (x)] which yields a third-order symmetric tensor containing a1 , . . . , ak as its rank-1 components. Taking the conditional expectation we get ⊤ ⊤ E[y|x] = f (w⊤ x) · g(a⊤ 1 x) + (1 − f (w x)) · g(a2 x).

To get some intuition about using method of moments for the mixture of experts problem, let us consider (3)

E[y · S3 (x)] = E[∇x E[y|x]] h  i  i h (3) (3) = E ∇x f (w⊤ x) · g(a⊤ . + E ∇x (1 − f (w⊤ x)) · g(a⊤ 2 x) 1 x)

Using the chain rule for multi-derivatives, and denoting f (w⊤ x) and g(a⊤ 1 x) by simply f and g respectively, the first term simplifies to  i h (3) = E[f ′′′ g] · w ⊗ w ⊗ w + E[f ′′ g′ ](w ⊗ w ⊗ a1 + w ⊗ a1 ⊗ w + a1 ⊗ w ⊗ w)+ E ∇x f (w⊤ x) · g(a⊤ 1 x) E[f ′ g′′ ] (a1 ⊗ a1 ⊗ w + a1 ⊗ w ⊗ a1 + w ⊗ a1 ⊗ a1 ) + E[f g ′′′ ] · a1 ⊗ a1 ⊗ a1 . 21

A similar simplification for the second term reveals that the the cross-moment between the output y and the score transformation S3 (x) contains all the possible rank-1 tensors that could be obtained from a1 , a2 and w. Thus, we cannot hope to recover these parameters uniquely from such a tensor construction. This observation suggests us that we need to be clever about the choice of the moment that we consider. For example, in the plain mixture of linear regressions, we considered the moment of y 3 with S3 (x) to recover the regressors. Theorem 1 exactly addresses this point by constructing a suitable 3-order tensor through which we can infer the regression parameters. We are now ready to prove it.

B B.1

Review: EM algorithm and EM convergence analysis tools EM algorithm for MoE

We first introduce the EM algorithm in the context of mixture of experts problem. Suppose that the sample (x, z, y) is drawn from a distribution whose density pw∗ (·) belongs to a parametric family for mixture of experts {pw (x, z, y) : w ∈ Ω}, where 2 pw (x, z, y) , N (x|0, Id ) · Pz|x · Py|x,z = N (x|0, Id ) · Pz|x,w · N (y|g(a⊤ z x, σ ),

where Pz|x,w corresponds to the conditional distribution of z given by P [z = 1|x, w] = f (w⊤ x) and P [z = 2|x, w] = 1 − f (w⊤ x). We observe only (x, y) and z is the latent variable that corresponds to the hidden choice of the expert. Let pw (x, y) be the probability distribution of (x, y) after marginalizing over z. Then it follows from Jensen’s inequality that log pw (x, y) ≥ Epwt (z|x,y) [log pw (x, z, y)] +H(pwt (z|x, y)), {z } |

(10)

,Q(w|wt )

P where pwt (z|x, y) refers to the condition distribution of z given (x, y) and H(p1 , . . . , pm ) , − i pi log pi refers to the Shannon-entropy of a discrete distribution (p1 , . . . , pm ). The Q function computes the complete data log-likelihood using the current estimate wt for the posterior distribution of the missing data z. Since the direct maximization of the LHS is intractable in general, the EM iteratively maximizes its lower bound Q function which serves as a surrogate for the actual log-likelihood. Henceforth, we assume that we are dealing with population-level EM updates or in other words, one has infinite samples of (x, y). Assuming that the samples (x, y) ∼ pw∗ (·) for some w∗ , the Q function in our setting is given by h i Q(w|wt ) = Epw∗ (x,y) Epwt (z|x,y) [log pw (x, z, y)]

= Epw∗ (x,y) [P [z = 1|x, y, w t ] log pw (x, 1, y) + P [z = 2|x, y, w t ] log pw (x, 2, y)] " # ⊤ 1 e−w x = Epw∗ (x,y) P [z = 1|x, y, wt ] log + (1 − P [z = 1|x, y, w t ]) log 1 + e−w⊤ x 1 + e−w⊤ x h i ⊤ = Epw∗ (x,y) p1 (x, y, w t ) · (w ⊤ x) − log(1 + ew x ) ,

where the posterior probability of the hidden choice, p1 (x, y, wt ) , P [z = 1|x, y, wt ] is given by: p1 (x, y, w t ) =

2 ⊤ f (w⊤ t x)N (y|g(a 1 x), σ ) . ⊤ 2 ⊤ 2 f (w⊤ x)N (y|g(a⊤ 1 x), σ ) + (1 − f (w x))N (y|g(a2 x), σ )

22

(11)

Recall that Ω = {w : kwk ≤ 1}. Then the population EM for the mixture of experts consists of the following two steps: • E-step: Using the current estimate wt to compute the function Q(·|w t ). • M-step: wt+1 = argmaxkwk≤1 Q(w|wt ). Thus the EM can be viewed as a deterministic procedure which maps wt 7→ M (wt ) where M (w) = argmaxw′ ∈Ω Q(w ′ |w).

B.2

EM convergence analysis

Our convergence analysis relies on tools from [BWY17] where they provided local convergence results on both the EM and gradient EM algorithms. In particular, they showed that if we initialize EM in a sufficiently small neighborhood around the true parameters, the EM iterates converge geometrically to the true parameters under some strong-concavity and gradient stability conditions. We now formally state the assumptions in [BWY17] under which the convergence guarantees hold. We will show in the next section that these conditions hold globally in our setting. Assumption 1 (Convexity of the domain). Ω is convex. Assumption 2 (Strong-concavity). Q(·|w ∗ ) is a λ-strongly concave function over a r-neighborhood of w∗ , i.e. B(w∗ , r) , {w ∈ Ω : kw − w∗ k ≤ r}. Remark 10. An important point to note is that the true parameter w∗ is a fixed point for the EM algorithm, i.e. M (w ∗ ) = w∗ . This is also known as self-consistency of the EM algorithm. Hence it is reasonable to expect that in a sufficiently small neighborhood around w∗ there exists a unique maximizer for Q(·|w ∗ ). Assumption 3 (First-order stability condition). Assume that k∇Q(M (w)|w∗ ) − ∇Q(M (w)|w)k ≤ γ kw − w∗ k ,

∀w ∈ B(w∗ , r).

Remark 11. Intuitively, the gradient stability condition enforces the gradient maps ∇Q(·|w) and ∇Q(·|w ∗ ) to be close whenever w lies in a neighborhood of w∗ . This will ensure that the mapped output M (w) stays closer to w∗ . Theorem 9 (Theorem 1, [BWY17]). If the above assumptions are met for some radius r > 0 and 0 ≤ γ < λ, then the map w 7→ M (w) is contractive over B(w∗ , r), i.e. γ  kM (w) − w∗ k ≤ kw − w∗ k , ∀w ∈ B(w∗ , r), λ and consequently, the EM iterates {wt }t≥0 converge geometrically to w ∗ , i.e. kwt − w∗ k ≤ whenever the initialization w 0 ∈ B(w∗ , r).

 γ t λ

23

kw0 − w∗ k ,

C

Proofs of Section 3

C.1

Linear mixture of experts

In this section, we prove Theorem 1 when g is the identity function, without the assumption that w is orthogonal to both a1 and a2 . Later we show how we can also recover the gating parameter w only using the method of moments. C.1.1

Proof of Theorem 1 for linear mixtures

Proof. Recall that 2 ⊤ ⊤ 2 Py|x = f (w⊤ x) · N (y|a⊤ 1 x, σ ) + (1 − f (w x)) · N (y|a2 x, σ ),

x ∼ N (0, Id ).

(12)

Using the fact E[Z 3 ] = µ3 + 3µσ 2 for any Gaussian random variable Z ∼ N (µ, σ 2 ), we get 3 ⊤ 2 ⊤ ⊤ 3 ⊤ 2 E[y 3 |x] = f (w⊤ x)((a⊤ 1 x) + 3(a1 x)σ ) + (1 − f (w x))((a1 x) + 3(a1 x)σ ).

Moreover, ⊤ ⊤ E[y|x] = f (w⊤ x)(a⊤ 1 x) + (1 − f (w x))(a2 x).

Thus, 3 ⊤ ⊤ ⊤ 3 ⊤ E[y 3 − 3y(1 + σ 2 )|x] = f (w⊤ x)((a⊤ 1 x) − 3(a1 x)) + (1 − f (w x))((a1 x) − 3(a1 x)).

If we define S3 (y) , y 3 − 3y(1 + σ 2 ) and S3 (x) as in (3), in view of Lemma 2, we get that

T3 = E[S3 (y) · S3 (x)] = E[(y 3 − 3y(1 + σ 2 )) · S3 (x)] h  i h  i 3 ⊤ ⊤ ⊤ 3 ⊤ = E f (w⊤ x)((a⊤ x) − 3(a x)) · S (x) + E 1 − f (w x)((a x) − 3(a x)) · S (x) 3 3 1 1 2 2  i  i h h (3) (3) 3 ⊤ 3 ⊤ . + E ∇x 1 − f (w⊤ x)((a⊤ = E ∇x f (w⊤ x)((a⊤ 2 x) − 3(a2 x)) 1 x) − 3(a1 x)) (13)

Using the chain rule for multi-derivatives, the first term simplifies to i  h (3) 3 ⊤ 3 ⊤ ′′ ⊤ 2 = E[f ′′′ ((a⊤ E ∇x f (w⊤ x)((a⊤ 1 x) − 3(a1 x)) 1 x) − 3(a1 x))] · w ⊗ w ⊗ w + E[f (3(a1 x) − 3)]· E[f



(6(a⊤ 1 x))]

(w ⊗ w ⊗ a1 + w ⊗ a1 ⊗ w + a1 ⊗ w ⊗ w)+

· (a1 ⊗ a1 ⊗ w + a1 ⊗ w ⊗ a1 + w ⊗ a1 ⊗ a1 ) + 6E[f ] · a1 ⊗ a1 ⊗ a1 . (14)

Since f (z) = 1+e1 −z , f ′ (·), f ′′′ (·) are even functions whereas f ′′ (·) is an odd function. Furthermore, ⊤ 2 ⊤ 3 ⊤ both x 7→ (a⊤ 1 x) − 3(a1 x) and x 7→ a1 x are odd functions whereas x 7→ 3(a1 x) − 3 is an even (d)

function. Since x ∼ N (0, Id ), −x = x. Thus all the expectation terms in (14) equal zero except for the last term since E[f (w⊤ x)] = 21 > 0. We have,  i h (3) 3 ⊤ x) − 3(a x)) = 3 · a1 ⊗ a1 ⊗ a1 . E ∇x f (w⊤ x)((a⊤ 1 1 Similarly,

 i h (3) 3 ⊤ x) − 3(a x)) = 3 · a2 ⊗ a2 ⊗ a2 . E ∇x 1 − f (w⊤ x)((a⊤ 2 2 24

Together, we have that T3 = 3 · a1 ⊗ a1 ⊗ a1 + 3 · a2 ⊗ a2 ⊗ a2 . Similarly, for the second-order tensor, define S2 (y) , y 2 and S2 (x) = x ⊗ x − I. We get that 2 ⊤ ⊤ 2 2 E[S2 (y)|x] = E[y 2 |x] = f (w⊤ x)(a⊤ 1 x) + (1 − f (w x))(a2 x) + σ .

Thus we have i  i h  h (2) (2) ⊤ ⊤ 2 2 . (1 − f (w x))(a x) + E ∇ x) T2 = E[S2 (y) · S2 (x)] = E ∇x f (w⊤ x)(a⊤ x 2 1

Simplifying the first term using the chain-rule, we obtain that i  h (2) (2) 2 ′ ⊤ ⊤ 2 = E[∇x f (w⊤ x)]E[(a⊤ E ∇x f (w⊤ x)(a⊤ 1 x) ] + E[f (w x)]E[a1 x](w ⊗ a1 + a1 ⊗ w)+ 1 x) 2E[f (w⊤ x)] · a1 ⊗ a1

(2)

2 = E[∇x f (w⊤ x)]E[(a⊤ 1 x) ] + a1 ⊗ a1 .

Similarly, the second term simplifies to  i h (2) (2) 2 2 x) = E[∇x (1 − f (w⊤ x))]E[(a⊤ E ∇x (1 − f (w⊤ x))(a⊤ 2 2 x) ] + a2 ⊗ a2 .

2 ⊤ 2 Since E[(a⊤ 1 x) ] = E[(a2 x) ] = 1, we have that

T2 = a1 ⊗ a1 + a2 ⊗ a2 .

C.1.2

Recovery of the gating parameter using the method of moments

Suppose that we know a1 and a2 and our goal is to recover w. From (12), we have that ⊤ ⊤ E[y|x] = f (w⊤ x) · a⊤ 1 x + (1 − f (w x)) · a2 x,

=

a⊤ 2x





+ f (w x) · (a1 − a2 ) x.

(15) (16)

Thus, E[y|x] − a⊤ 2x = f (w⊤ x). (a1 − a2 )⊤ x Notice that in the above equation we have (a1 − a2 )⊤ x in the denominator. But this equals zero with zero probability whenever x is generated from a continuous distribution; in our case x is Gaussian. Thus we may write       h i E[y|x] − a⊤ y − a⊤ ⊤ 2x 2x · x = E · x = E f (w x) · x E (a1 − a2 )⊤ x (a1 − a2 )⊤ x h i = E f ′ (w⊤ x) · w (if kwk=1)

=

∝ w. 25

EZ∼N (0,1) f ′ (Z) · w

However, it turns out that the above chain of equalities does not hold. Surprisingly, the first y−a⊤ x equality, which essentially is the law of iterated expectations, is not valid in this case as (a1 −a22 )⊤ x is not integrable. To see this, notice that the model in (12) can also be written as (d)

⊤ y = Z(a⊤ 1 x) + (1 − Z)(a2 x) + σN,

Z ∼ Bern(f (w⊤ x)), N ∼ N (0, 1).

Thus, Ratio ,

y − a⊤ σN (d) 2x = Z+ . ⊤ (a1 − a2 ) x (a1 − a2 )⊤ x

N is a Cauchy random variable, it follows that the Since Z is independent of N and (a1 −a ⊤ 2) x random variable Ratio is not integrable. To deal with the non-integrability of Ratio, we look at its conditional cdf, given by     |∆x | |∆x | ⊤ ⊤ P [Ratio ≤ z|x] = f (w x)Φ (z − 1) + (1 − f (w x))Φ z , ∆x = (a1 − a2 )⊤ x, σ σ

where Φ(·) is the standard Gaussian cdf. Substituting z = 0.5 and using the fact that Φ(z) + Φ(−z) = 1, we obtain     |∆x | |∆x | ⊤ ⊤ P [Ratio ≤ 0.5|x] = f (w x)Φ − + (1 − f (w x))Φ 2σ 2σ      |(a1 − a2 )⊤ x| |(a1 − a2 )⊤ x| =Φ + f (w⊤ x) 1 − 2Φ . 2σ 2σ   ⊤ 2 ) x| is a symmetric function in x its first moment with x equals zero. Furthermore, Since Φ |(a1 −a 2σ if we assume that w is orthogonal to a1 and a2 , we have E [1 {Ratio ≤ 0.5} · x] = E [P [Ratio ≤ 0.5|x] · x]      |(a1 − a2 )⊤ x| ⊤ ·x = E f (w x) 1 − 2Φ 2σ    |(a1 − a2 )⊤ x| ′ ⊤ = E[f (w x)] · E 1 − 2Φ · w+ 2σ     |(a1 − a2 )⊤ x| ⊤ E[f (w x)] · E ∇x 1 − 2Φ 2σ | {z } =0, since derivative of a even function is odd







= E[f (w x)] · E 1 − 2Φ ∝ w.

Thus, if kwk = 1, we have that



|(a1 − a2 )⊤ x| 2σ

E [1 {Ratio ≤ 0.5} · x] = w. kE [1 {Ratio ≤ 0.5} · x]k



·w

In the finite sample regime, E [1 {Ratio ≤ 0.5} · x] can be estimated from samples using the empirical moments and its normalized version will be an estimate of w. 26

C.2

Class of non-linearities

In this section, we characterize the class of non-linearities for which our theoretical results for the recovery of regressors hold. Let Z ∼ N (0, 1) and Y |Z ∼ N (g(Z), σ 2 ), where g : R → R. For (α, β, γ) ∈ R3 , define S3 (Y ) , Y 3 + αY 2 + βY,

S3 (Z) = E[S3 (Y )|Z] = g(Z)3 + αg(Z)2 + g(Z)(β + 3σ 2 ) + ασ 2 ,

and S2 (Y ) , Y 2 + γY,

S2 (Z) = E[S2 (Y )|Z] = g(Z)2 + γg(Z) + σ 2 .

Condition 1. E[S3′ (Z)] = E[S3′′ (Z)] = 0 and E[S3′′′ (Z)] 6= 0. Condition 2. E[S2′ (Z)] = 0 and E[S2′′ (Z)] 6= 0. We say that the non-linearity g is (α, β, γ)-valid if there exists (α, β, γ) ∈ R3 such that both Condition 1 and Condition 2 are satisfied. We have that S3′ (Z) = 3g(Z)2 g′ (Z) + 2αg(Z)g′ (Z) + g′ (Z)(β + 3σ 2 )

= 2αg(Z)g′ (Z) + βg′ (Z) + 3g(Z)2 g ′ (Z) + 3g ′ (Z)σ 2 ,  S3′′ (Z) = 2α g′ (Z)2 + g(Z)g′′ (Z) + βg ′′ (Z) + 3g′′ (Z)(g(Z)2 + σ 2 ) + 6g(Z)g′ (Z)2 .

Thus E[S3′ (Z)] = E[S3′′ (Z)] = 0 implies that      2E(g(Z)g′ (Z)) E(g ′ (Z)) α −3E(g(Z)2 g′ (Z) + g′ (Z)σ 2 )  = 2E g′ (Z)2 + g(Z)g′′ (Z) E(g′′ (Z)) β −3E(g ′′ (Z)(g(Z)2 + σ 2 ) + 2g(Z)g ′ (Z)2 )

To ensure Condition 1, we need the pair (α, β) obtained by solving the above linear equation to satisfy E[S3′′′ (Z)] 6= 0. Similarly, E[S2′ (Z)] = 0 implies that γ=

−2E[g(Z)g′ (Z)] . E[g′ (Z)]

Thus Condition 2 stipulates that E[S2′′ (Z)] 6= 0 with this choice of γ. It turns out that these conditions hold for a wide class of non-linearities and in particular, when g is either the identity function, or the sigmoid function, or the ReLU. For these three choices of popular non-linearities, the values of the tuple (α, β, γ) are provided below (which are obtained by solving the linear equations mentioned above). Example 1. If g is the identity mapping, then S3 (y) = y 3 − 3y(1 + σ 2 ) and S2 (y) = y 2 . Example 2. If g is the sigmoid function, i.e. g(z) = 1+e1 −z , then α and β can be obtained by solving the following linear equation:      0.2066 0.2066 α −0.1755 − 0.6199σ 2 = 0.0624 −0.0001 β −0.0936 The second-order transformation is given by S2 (y) = y 2 − y (since γ = −1 when g is sigmoid). q  Example 3. If g is the ReLU function, i.e. g(z) = max{0, z}, then α = −3 π2 , β = 3 π4 − σ 2 − 1 q and γ = −2 π2 . 27

C.3

Proof of Theorem 1

Proof. We first prove the theorem when g is the identity function. From the proof for the linear case in Appendix C.1.1, we have that  i  i h h (3) (3) ⊤ ⊤ 3 ⊤ 3 ⊤ 1 − f (w x)((a x) − 3(a x)) , x) − 3(a x)) + E ∇ T3 = E[S3 (y) · S3 (x)] = E ∇x f (w⊤ x)((a⊤ x 2 2 1 1 (17)

and,  i h (3) 3 ⊤ 3 ⊤ ′′ ⊤ 2 = E[f ′′′ ((a⊤ E ∇x f (w⊤ x)((a⊤ 1 x) − 3(a1 x)) 1 x) − 3(a1 x))] · w ⊗ w ⊗ w + E[f (3(a1 x) − 3)]· E[f



(6(a⊤ 1 x))]

(w ⊗ w ⊗ a1 + w ⊗ a1 ⊗ w + a1 ⊗ w ⊗ w)+

· (a1 ⊗ a1 ⊗ w + a1 ⊗ w ⊗ a1 + w ⊗ a1 ⊗ a1 ) + 6E[f ] · a1 ⊗ a1 ⊗ a1 .

⊤ Since x ∼ N (0, Id ) and w ⊥ span{a1 , a2 }, we have that w⊤ x ⊥ (a⊤ 1 x, a2 x). Thus i  h (3) 3 ⊤ ′′ ⊤ 2 3 ⊤ x) − 3(a x)) = E[f ′′′ ]E[(a⊤ E ∇x f (w⊤ x)((a⊤ 1 1 1 x) − 3(a1 x)] · w ⊗ w ⊗ w + E[f ]E[(3(a1 x) − 3)]·

(w ⊗ w ⊗ a1 + w ⊗ a1 ⊗ w + a1 ⊗ w ⊗ w)+

E[f ′ ]E[6(a⊤ 1 x)] · (a1 ⊗ a1 ⊗ w + a1 ⊗ w ⊗ a1 + w ⊗ a1 ⊗ a1 ) + 6E[f ] · a1 ⊗ a1 ⊗ a1 . ⊤ ⊤ 2 ⊤ ⊤ 3 Since a⊤ 1 x ∼ N (0, 1), we have that E[(a1 x) − 3(a1 x)] = E[(a1 x) − 1] = E[a1 x] = 0. Thus,  i h (3) 3 ⊤ x) − 3(a x)) = 6E[f ] · a1 ⊗ a1 ⊗ a1 = 3 · a1 ⊗ a1 ⊗ a1 . E ∇x f (w⊤ x)((a⊤ 1 1

Similarly the second term in (17) simplifies to  i h (3) 3 ⊤ x) − 3(a x)) = 6E[1 − f ] · a2 ⊗ a2 ⊗ a2 = 3 · a1 ⊗ a1 ⊗ a1 . E ∇x 1 − f (w⊤ x)((a⊤ 2 2 Together we obtain,

T3 = 3 · a1 ⊗ a1 ⊗ a1 + 3 · a2 ⊗ a2 ⊗ a2 . Thus we have α1 = α2 = 3 > 0. Similarly, we can follow the same proof as above for the secondorder tensor to obtain that T2 = a1 ⊗ a1 + a1 ⊗ a2 . Now consider an arbitrary link function g which is (α, β, γ)-valid (described in Appendix C.2). Then 2 ⊤ ⊤ 2 Py|x = f (w⊤ x) · N (y|g(a⊤ 1 x), σ ) + (1 − f (w x)) · N (y|g(a2 x), σ ),

x ∼ N (0, Id ),

implies that 3 ⊤ 2 ⊤ ⊤ 3 ⊤ 2 E[y 3 |x] = f (w⊤ x)(g(a⊤ 1 x) + 3g(a1 x)σ ) + (1 − f (w x))(g(a 2 x) + 3g(a2 x)σ ),

and 2 2 ⊤ ⊤ 2 2 E[y 2 |x] = f (w⊤ x)(g(a⊤ 1 x) + σ ) + (1 − f (w x))(g(a 2 x) + σ ), ⊤ ⊤ E[y|x] = f (w⊤ x)g(a⊤ 1 x) + (1 − f (w x))g(a 2 x).

28

If we define S3 (y) , y 3 + αy 2 + βy, we have that

T3 = E[S3 (y) · S3 (x)] = E[E[y 3 + αy 2 + βy|x] · S3 (x)] h   i 3 ⊤ 2 ⊤ 2 = E f (w⊤ x) g(a⊤ x) + αg(a x) + g(a x)(β + 3σ ) · S (x) + 3 1 1 1 h   i 3 ⊤ 2 ⊤ 2 E (1 − f (w⊤ x)) g(a⊤ x) + αg(a x) + g(a x)(β + 3σ ) · S (x) 3 2 2 2   i h (3) 3 ⊤ 2 ⊤ 2 + = E ∇x f (w⊤ x) g(a⊤ 1 x) + αg(a1 x) + g(a1 x)(β + 3σ )   i h (3) 3 ⊤ 2 ⊤ 2 E ∇x f (w⊤ x) g(a⊤ 2 x) + αg(a2 x) + g(a2 x)(β + 3σ ) (a)



 g(Z)3 + αg(Z)2 + g(Z)(β + 3σ 2 )′′′ · a1 ⊗ a1 ⊗ a1 +   E[1 − f ]E g(Z)3 + αg(Z)2 + g(Z)(β + 3σ 2 )′′′ · a2 ⊗ a2 ⊗ a2 , Z ∼ N (0, 1), α1 α2 = · a1 ⊗ a1 ⊗ a1 + · a2 ⊗ a2 ⊗ a2 , 2 2 where (a) follows from the choice of α and β and the fact that w ⊥ {a1 , a2 }, and α1 = α2 = E[(g(Z)3 +αg(Z)2 +g(Z)(β+3σ2 )′′′ )] . Similarly, if we define S2 (y) = y 2 + γy, we obtain that 2 = E[f ]E

T2 = E[S2 (y) · S2 (x)] = α′1 · a1 ⊗ a1 + α′2 · a2 ⊗ a2 ,

where α′1 = α′2 = E[g(Z)g′ (Z)] for Z ∼ N (0, 1).

C.4

Proof of Theorem 2

Proof. Our task is to first show that the assumptions in Appendix B.2 hold globally in our setting. Recall that h i ⊤ w⊤ x ) , Q(w|w t ) = Epw∗ (x,y) p1 (x, y, w t ) · (w x) − log(1 + e where

p1 (x, y, w t ) =

2 ⊤ f (w⊤ t x)N (y|g(a 1 x), σ ) . ⊤ 2 ⊤ 2 f (w⊤ x)N (y|g(a⊤ 1 x), σ ) + (1 − f (w x))N (y|g(a2 x), σ )

(18)

For simplicity we drop the subscript in the above expectation with respect to the distribution pw∗ (x, y). Now we verify each of the assumptions. • Convexity of Ω easily follows from its definition. • We have that

i h ⊤ Q(w|w∗ ) = E p1 (x, y, w∗ ) · (w⊤ x) − log(1 + ew x) .

Note that the strong-concavity of Q(·|w ∗ ) is equivalent to the strong-convexity of −Q(·|w ∗ ). Denoting the sigmoid function by f , we have that for all w ∈ Ω, i h −∇2 Q(w|w ∗ ) = E f ′ (w⊤ x) · xx⊤ , h i (Stein’s lemma) = E f ′′′ (w ⊤ x) · ww⊤ + E[f ′ (w ⊤ x)] · I = E[f ′′′ (kwk Z)] · ww⊤ + E[f ′ (kwk Z)] · I,

(a)


0 and B is an L-Lipschitz continuous function on Ω. Let wf = argminw∈Ω f (w) and wf +B = argminw∈Ω f (w) + B(w). Then kwf − wf +B k ≤

32

L . λ

Proof. Let w′ ∈ Ω be such that kw′ − wf k > Lλ . Let wα = αw f + (1 − α)w ′ for 0 < α < 1. From the fact that wf is the minimizer of f on Ω and that f is strongly convex, we have that f (w′ ) ≥ f (wf ) +

λ kw′ − wf k2 . 2

Furthermore, the strong-convexity of f implies that

α(1 − α)λ

w′ − wf 2 2

α(1 − α)λ

w′ − wf 2 = f (w′ ) + α(f (w f ) − f (w′ )) − 2 ′ − w k2



λ kw α(1 − α)λ f

w − wf 2 − ≤ f (w′ ) − α 2  2α 

2 ′ ′

w − w f = f (w ) − λα 1 − 2

f (wα ) ≤ αf (wf ) + (1 − α)f (w′ ) −

(43)

Since B is L-Lipschitz, we have

Adding (43) and (44), we get

B(wα ) ≤ B(w′ ) + Lα w ′ − wf .

(44)



α

w′ − wf 2 f (wα ) + B(wα ) ≤ f (w′ ) + B(w′ ) + Lα w′ − wf − λα 1 − 2   

α L ′ ′ ′ ′



w − wf − 1− = f (w ) + B(w ) + αλ w − wf λ 2  By the assumption that kw′ − wf k > Lλ , the term Lλ − 1 − α2 kw′ − wf k will be negative for sufficiently small α. This in turn implies that f (wα ) + B(wα ) < f (w′ ) + B(w′ ) for such α. Consequently w′ is not a minimizer of f + B for any w′ such that kw′ − wf k > Lλ . The conclusion follows. We are now ready to prove Theorem 3. Proof. Let the true parameters be a1 , a2 and w. Recall that the posterior probability in the EM step when we know the regressors exactly is given by p1 (x, y, w t , a1 , a2 ) =

2 f (w⊤ x)N (y|a⊤ 1 x, σ ) . ⊤ 2 ⊤ 2 f (w⊤ x)N (y|a⊤ 1 x, σ ) + (1 − f (w x))N (y|a2 x, σ )

If we define h i ⊤ Q(w|wt ) , E p1 (x, y, w t , a1 , a2 )(w⊤ x) − log(1 + ew x ) , h i ⊤ ˆ ˆ t ) , E p1 (x, y, w t , a ˆ 1, a ˆ 2 )(w⊤ x) − log(1 + ew x ) , Q(w| w

ˆ are given by we have that the EM operators M and M

w t+1 = arg max Q(w|w t ) = M (wt ), w∈Ω

ˆ ˆ (wˆt ), ˆ t+1 = arg max Q(w| ˆ t) = M w w w∈Ω

33

ˆ t ) are strongly concave functions over Ω with strong-concavity paSince both Q(·|w t ) and Q(·|w rameter λ which approximately equals 0.14 as derived in (19), Lemma 3 implies that ˆ t+1 k ≤ kw t+1 − w

L , λ

ˆ w ˆ t ). We have that where L is the Lipschitz-constant for the function l(·) , Q(·|w t ) − Q(·| h i ˆ t, a ˆ 1, a ˆ 2 )) (w⊤ x) . l(w) = E (p1 (x, y, wt , a1 , a2 ) − p1 (x, y, w Since l is linear in w, it suffices to show that

ˆ t, a ˆ1, a ˆ 2 )) · x]k ≤ L, kE [(p1 (x, y, w t , a1 , a2 ) − p1 (x, y, w or equivalently, h i ˜ ≤ Lk∆k, ˜ ˆ t, a ˆ1, a ˆ 2 ) − p1 (x, y, w t , a1 , a2 )) hx, ∆i E (p1 (x, y, w

˜ ∈ Rd . ∀∆

⊤ ⊤ ˆ 1 − a1 , ∆2 = a ˆ 2 − a2 and ∆3 = w ˆ t − wt . Let Define ∆⊤ = (∆⊤ 1 , ∆2 , ∆3 ), where ∆1 = a ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ (a1u , a2u , w u ) = (a1 , a2 , w t ) + u∆ and f (u) , p1 (x, y, w u , a1u , a2u ) for u ∈ [0, 1]. Thus, Z 1 ˆ t, a ˆ 1, a ˆ 2 ) − p1 (x, y, wt , a1 , a2 ) = f (1) − f (0) = f ′ (u)du p1 (x, y, w 0 Z 1 Z 1 h∇a2u p1 (x, y, w u , a1u , a2u ), ∆2 idu h∇a1u p1 (x, y, wu , a1u , a2u ), ∆1 idu + = 0 0 Z 1 h∇wu p1 (x, y, w u , a1u , a2u ), ∆3 idu + 0

Denoting

2 f (w⊤ x), N (y|a⊤ 1u x, σ )

2 N (y|a⊤ 2u x, σ )

by f, N1 and N2 for simplicity, we have that   f (1 − f )N1 N2 y − g(a⊤ 1u x) ∇a1u p1 (x, y, w u , a1u , a2u ) = g′ (a⊤ 1u x) · x, σ2 (f N1 + (1 − f )N2 )2   y − g(a⊤ f (1 − f )N1 N2 2u x) ∇a2u p1 (x, y, w u , a1u , a2u ) = g′ (a⊤ 2u x) · x, σ2 (f N1 + (1 − f )N2 )2 f (1 − f )N1 N2 · x. ∇wu p1 (x, y, wu , a1u , a2u ) = (f N1 + (1 − f )N2 )2 and

Thus we have h i ˜ ˆ t, a ˆ 1, a ˆ 2 ) − p1 (x, y, wt , a1 , a2 )) hx, ∆i E (p1 (x, y, w    Z 1  y − g(a⊤ f (1 − f )N1 N2 1u x) ′ ⊤ ˜ E = g (a1u x)hx, ∆1 ihx, ∆i σ2 (f N1 + (1 − f )N2 )2 0       f (1 − f )N1 N2 y − g(a⊤ f (1 − f )N1 N2 ′ ⊤ 2u x) ˜ ˜ g (a x)hx, ∆ ihx, ∆i + E hx, ∆ ihx, ∆i du. +E 2 3 2u σ2 (f N1 + (1 − f )N2 )2 (f N1 + (1 − f )N2 )2

∆ f (1−f )N1 N2 1 2 ≤ ε1 and k∆3 k ≤ ε2 . Since (f N By assumption we have that ∆ ≤ 14 , if σ2 , σ2 +(1−f )N )2 1

2

|g′ (·)| ≤ O(1), by Cauchy-Schwarz we have that      

y − g(a⊤ f (1 − f )N1 N2 1u x) ′ ⊤ ˜

∆1 /σ 2 , k∆k ˜ , g (a x)hx, ∆ ihx, ∆i ≤ O E 1 1u σ2 (f N1 + (1 − f )N2 )2 34

and similarly,      

f (1 − f )N1 N2 y − g(a⊤ ′ ⊤ 2 2u x) ˜

˜ g (a2u x)hx, ∆2 ihx, ∆i ≤ O ∆2 /σ , k∆k . E σ2 (f N1 + (1 − f )N2 )2

Also,

E



f (1 − f )N1 N2

(f N1 + (1 − f )N2 )

 ˜ ˜ 2 hx, ∆3 ihx, ∆i ≤ k∆3 k k∆k.

Together, we have that the Lipschitz-constant L is O(max{ε1 , ε2 }) and this concludes the proof.

C.6

Proof of Theorem 4

Proof. In addition to the assumptions of Appendix B.2, if we can ensure that the map −Q(·|w∗ ) 2 where is µ-smooth, then the proof follows from Theorem 3 of [BWY17] if we choose α0 = µ+λ ∗ λ is the strong-convexity parameter of −Q(·|w ). The strong-convexity is already established in Appendix C.4. To find the smoothness parameter, note that i h −∇2 Q(w|w∗ ) = E f ′ (w ⊤ x) · xx⊤ , h i = E f ′′′ (w⊤ x) · ww⊤ + E[f ′ (w⊤ x)] · I = E[f ′′′ (kwk Z)] · ww⊤ + E[f ′ (kwk Z)] · I, Z ∼ N (0, 1)   sup min E[f ′ (αZ)], E[f ′ (αZ)] + α2 E[f ′′′ (αZ)] · I 0≤α≤1

= |{z} 0.25 ·I. µ

The contraction parameter is then given by ρσ = 1 −

2λ + 2γσ . µ+λ

σ→0

Since γσ −−−→ 0, ρσ < 1 whenever σ < σ0 for a constant σ0 .

D D.1

Proofs of Section 4 Proof of Theorem 5

Proof. Similar to the proof of Theorem 1, we first prove the theorem when g is the identity function, i.e. Py|x =

X

i∈[k]

pi (x)Py|x,i =

X

i∈[k]

Thus we have that E[y 3 |x] = E[y|x] =

⊤ x+b i

ewi P

X

i∈[k]

X

⊤ x+b i

wi i∈[k] e

2 · N (y|a⊤ i x, σ ),

  3 ⊤ 2 , pi (x) (a⊤ i x) + 3(ai x)σ pi (x)(a⊤ i x).

i∈[k]

35

x ∼ N (0, Id ).

Hence E[y 3 − 3y(1 + σ 2 )|x] =

X

i∈[k]

  3 ⊤ pi (x) (a⊤ x) − 3(a x) i i

If we let S3 (y) , y 3 − 3y(1 + σ 2 ), we get  i X h (3)  3 ⊤ x) − 3(a x) E[S3 (y) · S3 (x)] = E ∇x pi (x) (a⊤ i i i∈[k]

⊤ ⊤ Since x ∼ N (0, Id ) and ai ⊥ span{w1 , . . . , w k−1 }, we have that a⊤ i x ⊥ (w 1 x, . . . , w k−1 x). More3 ⊤ ⊤ 2 ⊤ over, E[(a⊤ i x) − 3(ai x)] = E[(ai x) − 1] = E[ai x] = 0 for each i. Using the chain-rule for multi-derivatives, the above equation thus simplifies to i X  h X (3) 3 ⊤ = 6E[pi (x)] · ai ⊗ ai × ai . E[S3 (y) · S3 (x)] = E[pi (x)] · E ∇x (a⊤ i x) − 3(ai x) i∈[k]

i∈[k]

For a generic g : R → R which is (α, β, γ)-valid, let S3 (y) = y 3 + αy 2 + βy. Then it is easy to see that the same proof goes through except for a change in the coefficients of rank-1 terms, i.e. X E[S3 (y) · S3 (x)] = cg,σ E[pi (x)] · ai ⊗ ai ⊗ ai , i∈[k]

h

′′′ i where cg,σ = E g(Z)3 + αg(Z)2 + g(Z)(β + 3σ 2 ) for Z ∼ N (0, 1) and ′′′ denotes the thirdderivative with respect to Z. Condition 1 ensures that cg,σ 6= 0. Since E[pi (x)] > 0, the coefficients of the rank-1 terms in T3 have the same sign. The proof for T2 is similar.

D.2

Proof of Theorem 6

Proof. The proof strategy is very similar to that of Theorem 2. Our task is to show that the assumptions of Appendix B.2 hold globally in our setting. The domain Ω is clearly convex since Ω = {w = (w 1 , . . . , wk−1 ) : kwi k ≤ 1, ∀i ∈ [k − 1]} . Now we verify Assumption 2. The function Q(.|w t ) is given by    X (i) X ⊤  Q(w|w t ) = E  pwt (w⊤ ewi x  , i x) − log 1 + i∈[k−1]

i∈[k−1]

(i)

where pwt , P [z = i|x, y, wt ] corresponds to the posterior probability for the ith expert, given by (i) pw t

2 pi,t (x)N (y|g(a⊤ i x), σ ) , =P ⊤ 2 j∈[k] pj,t (x)N (y|g(aj x), σ )

⊤x

pi,t (x) =

e(wt )i 1+

P

(w t )⊤ j x j∈[k−1] e

.

Throughout we follow the convention that wk = 0. Thus the gradient of Q with respect to the ith gating parameter wi is given by    ⊤x w i e (i)  · x , i ∈ [k − 1]. ∇wi Q(w|wt ) = E pwt − P ⊤ 1 + j∈[k−1] ewj x 36

(2)

Thus the (i, j)th block of the negative Hessian −∇w Q(w|w∗ ) ∈ Rd(k−1)×d(k−1) is given by ( E[pi (x)(1 − pi (x)) · xx⊤ ], j = i ∗ , −∇wi ,wj Q(w|w ) = E[−pi (x)pj (x) · xx⊤ ], j 6= i where pi (x) =

⊤x

1+

w P e i

w⊤ x j∈[k−1]e j

(45)

(2)

. It is clear from (45) that −∇w Q(w|w ∗ ) is positive semi-definite.

Since we are interested in the strong convexity of −Q(·|w ∗ ) which is equivalent to positive definiteness of the negative Hessian, it suffices to show that   (2) λ , inf λmin −∇w Q(w|w∗ ) > 0. w∈Ω

Since the Hessian is continuous with respect to w and consequently the minimum eigenvalue of it, there exists a w′ ∈ Ω such that     (2) (2) λ = λmin −∇w′ Q(w ′ |w∗ ) = inf a⊤ −∇w′ Q(w′ |w∗ ) a, kak=1

⊤ d(k−1) . In view of (45), the above equation can be further simplified ⊤ where a = (a⊤ 1 , . . . , ak−1 ) ∈ R to

λ = inf E[a⊤ x Mx ax ],

(46)

kak=1

⊤ k−1 and M is given by ⊤ where ax = (a⊤ x 1 x, . . . , ak−1 x) ∈ R ( pi (x)(1 − pi (x)), i = j Mx (i, j) = −pi (x)pj (x), i 6= j

diagoMx a∗x ]. For each Let the infimum in (46) is attained by a∗ , i.e. λ = E[(a∗x )⊤ P  x, Mx isstrictly  P nally dominant since |Mx (i, i)| = pi (x)(1−pi (x)) = pi (x) p (x) > p (x) p (x) = i j6=i,j∈[k] j j6=i,j∈[k−1] j P ∗ ∗ ∗ ⊤ j6=i M (i, j). Thus Mx is positive-definite and (ax ) Mx ax > 0 whenever ax 6= 0. Since x follows a ∗ continuous distribution it follows that ax 6= 0 with probability 1 and thus λ = E[(a∗x )⊤ Mx a∗x ] > 0. Now it remains to show that Assumption 3 too holds, i.e. k∇Q(M (w)|w ∗ ) − ∇Q(M (w)|w)k ≤ γ kw − w∗ k . ⊤ ⊤ d(k−1) . We will show that Note that w = (w⊤ 1 , . . . , w k−1 ) ∈ R

k(∇Q(M (w)|w ∗ ))i − (∇Q(M (w)|w))i k ≤ γσ kw − w∗ k ,

i ∈ [k − 1],

where (∇Q(M (w)|w))i ∈ Rd refers to the ith block of the gradient and γσ → 0. Observe that i h (i) (i) (∇Q(M (w)|w ∗ ))i − (∇Q(M (w)|w))i = E (pw − pw∗ ) · x

⊤ ∗ ⊤ Let ∆ = w − w∗ and correspondingly ∆ = (∆⊤ 1 , . . . , ∆k−1 ) where ∆i = w i − w i . Thus it suffices to show that



(i) (i)

E[(pw − pw∗ ) · x] ≤ γσ k∆k .

37

Or equivalently, (i) (i) ˜ ≤ γσ k∆k k∆k, ˜ E[(pw − pw∗ )hx, ∆i]

˜ ∈ Rd . ∀∆

We consider the case i = 1. The proof for the other cases is similar. Recall that (1)

pw = P

⊤x

2 p1 (x)N (y|g(a⊤ 1 x), σ ) , ⊤ 2 j∈[k] pj (x)N (y|g(a j x), σ )

pi (x) =

ewi 1+

P



wj x j∈[k−1] e

,

i ∈ [k − 1].

2 For simplicity we define Ni = N (y|g(a⊤ 1 x), σ ). It is straightforward to verify that ( pi (x)(1 − pi (x)) · x, j = i ∇wj pi (x) = −pi (x)pj (x) · x, j 6= i

Thus (1) ∇w1 (pw )

= ∇w 1 P

p1 (x)N1 PN

i=1 pi (x)Ni

!

  P p1 (x)(1 − p1 (x))N1 − p1 (x)N1 − j6=1 pj (x)p1 (x)Nj + p1 (x)(1 − p1 (x))N1 = ·x 2 P N p (x)N i i i=1 P  p1 (x)N1 j≥2 pj (x)Nj ·x = P 2 N p (x)N i i i=1 N i=1 pi (x)Ni



, R1 (x, y, w, σ) · x

Similarly, p1 (x)pi (x)N1 Ni (1) ∇wi (pw ) = P 2 · x, N p (x)N i i=1 i

i 6= 1,

, Ri (x, y, w, σ) · x. (1)

Let wu , w∗ + u∆, u ∈ [0, 1] and f (u) , pwu . Thus (1) pw



(1) pw ∗

Z

1

f ′ (u)du 0   Z 1 X (1)  h∇wi (pwu ), ∆i i du =

= f (1) − f (0) =

0

=

i∈[k−1]

X Z

i∈[k−1] 0

38

1

Ri (x, y, w, σ)hx, ∆i idu.

So we get (1)

X Z

(1)

˜ = E[(pw − pw∗ )hx, ∆i] ≤ ≤ ≤

1

i∈[k−1] 0

X Z

˜ E[Ri (x, y, w u , σ)hx, ∆i ihx, ∆i]du

i∈[k−1] 0

1q

X Z

1p

i∈[k−1] 0

X Z

i∈[k−1] 0



= |

1p

X Z

˜ 2 ]du E[Ri (x, y, w u , σ)2 ]E[hx, ∆i i2 hx, ∆i E[Ri (x, y, w u , σ)2 ] E[Ri (x, y, w u , σ)2 ]

1p

i∈[k−1] 0

√ √

 ˜ du 3 k∆i k k∆k  ˜ du 3 k∆k k∆k 

E[Ri (x, y, wu , σ)2 ]du {z

}

(1)

γσ

√

 ˜ 3 k∆k k∆k

Now our goal is to show that E[Ri (x, y, w u , σ)2 ] → 0 as σ → 0. For i = 1, we have 2



 R1 (x, y, wu , σ) =  Similarly,

P

2

j≥2 p1 (x)pj (x)N1 Nj  P 2  N p (x)N i i i=1

2

Ri (x, y, w u , σ) ≤



2



X  p1 (x)pj (x)N1 Nj  X  p1 (x)pj (x)N1 Nj 2 ≤k  P 2  ≤ k (p1 (x)N1 + pj (x)Nj )2 N j≥2 j≥2 p (x)N i i i=1

p1 (x)pi (x)N1 Ni (p1 (x)N1 + pi (x)Ni )2

2

,

∀i 6= 1, i ∈ [k − 1].

For w = w u and i 6= 1, we have that 2 (y−g(a⊤ 1 x))

2 (y−g(a⊤ i x))

2σ 2 2σ 2 1 p1 (x)pi (x)N1 Ni ew 1 x ewi x e− e− ≤ =   2 2 ⊤ ⊤ 2 2 (y−g(a x)) x)) (y−g(a (p1 (x)N1 + pi (x)Ni ) 4 i 1 ⊤ ⊤ 2σ 2 2σ 2 + ewi x e− ew1 x e−

=









ew 1 x ewi x e ⊤ ew1 x

+

2 ⊤ 2 (y−g(a⊤ 1 x)) −(y−g(ai x)) 2σ 2

⊤ ewi x e

σ→0

(y−g(a⊤ x))2 −(y−g(a⊤ x))2 1 i 2σ 2

2

−−−→ 0. Thus,R by Dominated Convergence Theorem, E[Ri (x, y, wu , σ)2 ] → 0 for each u ∈ [0, 1]. To show 1 that 0 E[Ri (x, y, wu , σ)2 ]du → 0, we can now follow the same analysis as in the proof of Theorem 2 (1)

from (20) on-wards (replacing w there with w1 − wi ) which ensures that γσ in our case converges (1) (k−1) to zero. Similarly for other i ∈ [k − 1], we get that γ (i) → 0. Taking γσ = γσ + . . . + γσ and κσ = γλσ completes the proof.

39

D.3

Proof of Theorem 7

Proof. Since ai is orthogonal to the set of gating parameters Si = ∪pm=1 {w 1|(i1 ,...,im−1 ) } along the path leading to the leaf node i = (i1 , . . . , ip ), similar to the proof of Theorem 5, we have that X T3 = E[S3 (y) · S3 (x)] = cg,σ E[gi (x)] · ai × ai ⊗ ai , i=(i1 ,...,ip )

h ′′′ i , where Z ∼ N (0, 1) and ′′′ denotes the thirdwhere cg,σ = E g(Z)3 + αg(Z)2 + g(Z)(β + 3σ 2 ) derivative with respect to Z. Condition 1 ensures that cg,σ 6= 0. Since E[gi (x)] > 0, we have that all the coefficients of the rank-1 terms have the same sign. The proof for T2 is similar.

D.4

Proof of Theorem 8

Proof. This proof very closely follows that of Theorem 6. The domain Ω is clearly convex by definition. To establish the strong concavity of Q(w|w∗ ), note that Q(w|w t ) =

p X

X

m=1 (i1 ,...,im−1 )

   i h (t) (t) ⊤ x − log 1 + exp w x . E h(i1 ,...,im−1 ) h1|(i1 ,...,im−1 ) · w⊤ 1|(i1 ,...,im−1 ) 1|(i1 ,...,im−1 )

Thus,   i  h (t) (t) ·x . ∇w1|(i1 ,...,im−1 ) Q(w|w t ) = E h(i1 ,...,im−1 ) h1|(i1 ,...,im−1 ) − σ w⊤ 1|(i1 ,...,im−1 ) x

It is easy to see that ∇w1|(i′ ,...,i′ Hence,

1

m−1 )

∇w1|(i1 ,...,im−1 ) Q(w|w t ) = 0 for any (i′1 , . . . , i′m−1 ) 6= (i1 , . . . , im−1 ).

−H(i1 ,...,im−1 ) , ∇2w1|(i

1 ,...,im−1

i  h  ⊤ ⊤ ′ , w x · xx Q(w|w ) =E σ t 1|(i1 ,...,im−1 ) ) < 0.14I,

p

p

where the last inequality follows from a similar analysis in (19). Thus the Hessian H ∈ R(2 −1)d×(2 −1)d is a block diagonal matrix with each block H(i1 ,...,im−1 ) ∈ Rd×d being a negative definite matrix. Thus −Q(·|w ∗ ) is strongly convex with the strong convexity parameter λ = 0.14. Now we show that Assumption 3 holds in this setting. To ensure the same, we need to show that k∇Q(M (w)|w) − ∇Q(M (w)|w ∗ )k ≤ γσ kw − w∗ k ,

where M refers to the EM operator and γσ is a suitably chosen small constant less than 0.14. Denoting M (w) by w′ , we have    ∇w1|(i1 ,...,im−1 ) Q(w′ |w) − ∇w1|(i1 ,...,im−1 ) Q(w′ |w∗ ) = E h(i1 ,...,1) (w) − h(i1 ,...,1) (w∗ ) · x h i  − E h(i1 ,...,im−1 ) (w) − h(i1 ,...,im−1 ) (w ∗ ) σ(x⊤ w′1|(i1 ,...,im−1 ) ) · x Thus it suffices to show that

  

E h(i ,...,i ) (w) − h(i ,...,i ) (w ∗ ) · x ≤ γσ kw − w∗ k , m m 1 1 40

∀(i1 , . . . , im ) ∈ {1, 2}m , m ∈ [p].

Or equivalently, E

h

i  ˜ ≤ γσ k∆k k∆k, ˜ h(i1 ,...,im ) (w) − h(i1 ,...,im ) (w ∗ ) hx, ∆i

˜ ∈ Rd , ∀∆

where ∆ = w − w∗ . Fix m ∈ [p]. Recall that w corresponds to the collection of all gating parameters. We denote ∆(i1 ,...,ik−1 ) = w 1|(i1 ,...,ik−1 ) − w∗1|(i1 ,...,ik−1 ) . Define the functionf : [0, 1] → [0, 1] by wu = w∗ + u∆, u ∈ [0, 1].

f (u) = h(i1 ,...,im ) (w u ), Thus, h(i1 ,...,im ) (w) − h(i1 ,...,im ) (w ∗ ) = f (1) − f (0) = =

Z

1

f ′ (u)du

0

X

X

k∈[p] i′1 ,...,i′k−1

Hence h i  ˜ = E h(i1 ,...,im ) (w) − h(i1 ,...,im ) (w∗ ) hx, ∆i

X

k,i′1 ,...,i′k−1

Z

1 0

Z

0

1

h∇w1|(i′ ,...,i′ 1

k−1

 E h∇w1|(i′ ,...,i′ 1

k−1

)

hi1 ,...,im (wu ), ∆(i′1 ,...,i′k−1 ) idu.

 ˜ ′ ′ ihx, ∆i du. h (w ), ∆ u (i1 ,...,ik−1 ) ) i1 ,...,im (47)

Recall that P gi1 . . . gim |(i1 ,...,im−1 ) im+1 ,...,ip gim+1 |(i1 ,...,im ) . . . gip |(i1 ,i2 ,...,ip−1 ) · N(i1 ,...,ip ) (x, y) P h(i1 ,...,im ) (w) = , i1 ,...,ip gi1 . . . gip |(i1 ,...,ip−1 ) · N(i1 ,...,ip ) (x, y) (48) where g1|(i′1 ,...,i′k−1 ) = σ(w ⊤ 1|(i′ ,...,i′ 1

k−1 )

x) and g2|(i′1 ,...,i′k−1 ) = 1 − σ(w⊤ 1|(i′ ,...,i′ 1

k−1 )

x). For k ≤ m, ob-

serve that the term g1|(i′1 ,...,i′k−1 ) appears only in the denominator of (48) whenever (i′1 , . . . , i′k−1 ) 6= (i1 , . . . , ik−1 ). Similarly for k > m, we have the same observation whenever (i′1 , . . . , i′k−1 ) 6= (i1 , . . . , ik−1 ). For any (i′1 , . . . , i′k−1 ), we have that   X X ∇w1|(i′ ,...,i′ )  g(i1 ,...,ip ) · N(i1 ,...,ip ) (x, y) = gi′1 . . . gi′k−1 |(i′1 ,...,i′k−2 ) (∇w1|(i′ ,...,i′ ) gik |(i′1 ,...,i′k−1 ) )· 1

k−1

i1 ,...,ip

1

ik ,...,ip



= g(i′1 ,...,i′k−1 ,1) 

X

ik+1 ,...,ip

k−1

g(i′1 ,...,i′k−1 ,ik ,...,ip ) N(i1 ,...,ip ) (x, y)    g(i′1 ,...,i′k−1 ,1,...,ip ) N(i1 ,...,ip ) (x, y) − g(i′1 ,...,i′k−1 ,2,...,ip ) N(i1 ,...,ip ) (x, y)  · 

 (1 − g1|(i′1 ,...,i′k−1 ) ) · x,

where we used the fact that ∇w1|(i′ ,...,i′ ) g1|(i′1 ,...,i′k−1 ) = g1|(i′1 ,...,i′k−1 ) (1−g1|(i′1 ,...,i′k−1 ) )·x. Notice that 1 k−1 P the expression g(i′1 ,...,i′k−1 ,1) ik+1 ,...,ip g(i′1 ,...,i′k−1 ,1,...,ip ) N(i1 ,...,ip ) (x, y) is the probability that the output y is generated by a sequence of gating functions along the path (i′1 , .. . , i′k−1 , 1) and the left sub tree rooted at the node (i′1 , . . . , i′k−1 , 1). For simplicity, we denote it as P Left sub-tree rooted at (i′1 , . . . , i′k−1 , 1) 41

dropping the dependence on x. Notice that the denominator in the expression for h(i1 ,...,im ) (w) is P [y|x] if the set of gating parameters is w. Combining these facts together we have that for any (i′1 , . . . , i′k−1 ) 6= (i1 , . . . , ik−1 ),   ∇w1|(i′ ,...,i′ ) hi1 ,...,im (w u ) −P [Tree rooted at (i1 , . . . , im )] P Left sub-tree rooted at (i′1 , . . . , i′k−1 , 1) 1 k−1 = ·x (1 − g1|(i′1 ,...,i′k−1 ) ) P [y]2   P [Tree rooted at (i1 , . . . , im )] P Right sub-tree rooted at (i′1 , . . . , i′k−1 , 1) ·x + P [y]2 If we let α = (1 − g1|(i′1 ,...,i′k−1 ) ) < 1, we have that   ˜ E h∇w1|(i′ ,...,i′ ) hi1 ,...,im (w u ), ∆(i′1 ,...,i′k−1 ) ihx, ∆i 1 k−1 " #   P [Tree rooted at (i1 , . . . , im )] αP Left sub-tree rooted at (i′1 , . . . , i′k−1 , 1) ˜ = −E hx, ∆(i′1 ,...,i′k−1 ) ihx, ∆i P [y]2 " #   P [Tree rooted at (i1 , . . . , im )] αP Right sub-tree rooted at (i′1 , . . . , i′k−1 , 1) ˜ +E hx, ∆(i′1 ,...,i′k−1 ) ihx, ∆i P [y]2 # "   P [Tree rooted at (i1 , . . . , im )] P Left sub-tree rooted at (i′1 , . . . , i′k−1 , 1) ˜ |hx, ∆(i′1 ,...,i′k−1 ) ihx, ∆i| ≤E P [y]2 " #   P [Tree rooted at (i1 , . . . , im )] P Right sub-tree rooted at (i′1 , . . . , i′k−1 , 1) ˜ +E |hx, ∆(i′1 ,...,i′k−1 ) ihx, ∆i| P [y]2   Observe that P [Tree rooted at (i1 , . . . , im )] and P Left sub-tree rooted at (i′1 , . . . , i′k−1 , 1) do not contain any common terms and also that   P [y|x] ≥ P [Tree rooted at (i1 , . . . , im )] + P Left sub-tree rooted at (i′1 , . . . , i′k−1 , 1) ,   ≥ P [Path (i1 , . . . , im , im+1 , . . . , ip )] + P Path (i′1 , . . . , i′k−1 , 1, i′k+1 , . . . , i′p ) , for any (im+1 , . . . , ip ) and (i′k+1 , . . . , i′p ). Thus, " #   P [Tree rooted at (i1 , . . . , im )] P Left sub-tree rooted at (i′1 , . . . , i′k−1 , 1) ˜ E |hx, ∆(i′1 ,...,i′k−1 ) ihx, ∆i| P [y]2   X P [Path (i1 , . . . , im , im+1 , . . . , ip )] P Path (i′1 , . . . , i′k−1 , 1, i′k+1 , . . . , i′p ) E ≤   · ′ , . . . , i′ ′ ′) 2 P [Path (i , . . . , i , i , . . . , i )] + P Path (i , 1, i , . . . , i 1 m m+1 p p 1 k−1 k+1 (im+1 ,...,ip ), (i′k+1 ,...,i′p )

≤ ≤ ≤

˜ |hx, ∆(i′1 ,...,i′k−1 ) ihx, ∆i|

r

i h ˜ 2 E[R((i1 , . . . , ip ), (i′1 , . . . , i′p ), x, y))]2 E hx, ∆(i′1 ,...,i′k−1 ) i2 hx, ∆i

X

q

E[R((i1 , . . . , ip ), (i′1 , . . . , i′p ), x, y))]2

X

q

E[R((i1 , . . . , ip ), (i′1 , . . . , 1, . . . , i′p ), x, y))]2

X

(im+1 ,...,ip ), (i′k+1 ,...,i′p )

(im+1 ,...,ip ), (i′k+1 ,...,i′p )

(im+1 ,...,ip ), (i′k+1 ,...,i′p )

42

 √

˜

, 3 ∆(i′1 ,...,i′k−1 ) k∆k √

 ˜ , 3 k∆k k∆k

where g(i1 ,...,ip ) g(i′1 ,...,i′p ) N(i1 ,...,ip ) (x, y)N(i′1 ,...,i′p ) (x, y) 1 R((i1 , . . . , ip ), (i′1 , . . . , 1, . . . , i′p ), x, y)) =  2 ≤ 4 g(i1 ,...,ip ) N(i1 ,...,ip ) (x, y) + g(i′1 ,...,i′p ) N(i′1 ,...,i′p ) σ→0

−−−→ 0,

using the same argument as in the proof of Theorem 6. Thus the Dominated convergence theorem q guarantees that E[R((i1 , . . . , ip ), (i′1 , . . . , i′p ), x, y))]2 → 0 for each fixed u ∈ [0, 1]. Uniform convergence in qu can also be established in a similar fashion to the proof of Theorem 6 and Theorem 2. Similarly E[R((i1 , . . . , ip ), (i′1 , . . . , 2, . . . , i′p ), x, y))]2 → 0. If we define γσ ((i1 , . . . , im ), (i′1 , . . . , i′k−1 ))

X

=

(i′k+1 ,...,i′p ) (im+1 ,...,ip )

+

Z

1q

E[R((i1 , . . . , ip ), (i′1 , . . . , 1, . . . , i′p ), x, y))]2 du

Z

1q

E[R((i1 , . . . , ip ), (i′1 , . . . , 2, . . . , i′p ), x, y))]2 du

0

0

(1)

and γσ (i1 , . . . , im ) = sup(i′1 ,...,i′k−1 ) γσ ((i1 , . . . , ip ), (i′1 , . . . , i′k−1 )), we have that  Z 1  X ˜ ˜ E h∇w1|(i′ ,...,i′ ) hi1 ,...,im (wu ), ∆(i′1 ,...,i′k−1 ) ihx, ∆i du ≤ γσ(1) (i1 , . . . , im ) k∆k k∆k. 0

k

1

k−1

(i′1 ,...,i′k−1 )6=(i1 ,...,ik−1 )

Suppose that k ≤ m. Then the term w1|(i1 ,...,ik−1 ) appears in both the numerator and denominator of (48). The gradient of the numerator is given by   X ∇w1|(i1 ,...,i ) g(i1 ,...,im ) g(i1 ,...,ip ) N(i1 ,...,ip ) (x, y) = ±g(i1 ,...,im ) (1 − g1|(i1 ,...,ik−1 ) ) k−1

(im+1 ,...,ip )

X

(im+1 ,...,ip )

g(i1 ,...,ip ) N(i1 ,...,ip ) (x, y) · x,

= ±P [Tree rooted at(i1 , . . . , im )] (1 − g1|(i1 ,...,ik−1 ) ) · x where the ± corresponds to the case ik = 1 or 2 respectively. Without loss of generality, let ik = 1. Then we have that P [y] P [Tree rooted at(i1 , . . . , im )] (1 − g1|(i1 ,...,ik−1 ) ) ∇w1|(i1 ,...,i ) h(i1 ,...,im ) (w) = ·x k−1 P [y]2 P [Tree rooted at(i1 , . . . , im )] P [Left tree rooted at(i1 , . . . , ik−1 )] (1 − g1|(i1 ,...,ik−1 ) ·x − P [y]2 P [Tree rooted at(i1 , . . . , im )] P [Right tree rooted at(i1 , . . . , ik−1 )] (1 − g1|(i1 ,...,ik−1 ) ) · x + . P [y]2 Thus, ∇w1|(i1 ,...,ik−1 ) h(i1 ,...,im ) (w)

P [Tree rooted at(i1 , . . . , im )] (1 − g1|(i1 ,...,ik−1 ) )

=

P [y] − P [Tree rooted at(i1 , . . . , ik−1 , 1)]

P [y]2 P [Tree rooted at(i1 , . . . , ik−1 , 2)] ·x P [y]2

43

· x+

Notice that the expression P [Tree rooted at(i1 , . . . , im )] and the expression P [y]−P [Tree rooted at(i1 , . . . , ik−1 , 1)]+ P [Tree rooted at(i1 , . . . , ik−1 , 2)] do not contain any common terms since ik = 1. A similar conclusion holds for ik = 2 with a flipped sign. Thus we can essentially mimic the same proof as in the (2) above case and define a corresponding coefficient γσ (i1 , . . . , im ) such that m Z X k=1

0

1

h i ˜ du ≤ γσ(2) (i1 , . . . , im ) k∆k k∆k. ˜ E h∇w1|(i1 ,...,ik−1 ) hi1 ,...,im (w u ), ∆(i1 ,...,ik−1 ) ihx, ∆i

For the case k ≥ m and any path (i′1 , . . . , i′k ) such that (i′1 , . . . , i′m ) = (i1 , . . . , im ), it is easy to see that the same proof as the case k ≤ m goes through since the gradient of the numerator is similar (3) with just the summation indices changed and we can define γσ (i1 , . . . , im ) suitably so that Z 1 h i X ˜ du ≤ γ (3) (i1 , . . . , im ) k∆k k∆k. ˜ E h∇w1|(i1 ,...,i ) hi1 ,...,im (wu ), ∆(i1 ,...,ik−1 ) ihx, ∆i σ 0 k≥m (i1 ,...,im ,i′m+1 ,i′k )

k−1

Together, in view of (47) we have that h i  ˜ ≤ γσ (i1 , . . . , im ) k∆k k∆k, ˜ E h(i1 ,...,im ) (w) − h(i1 ,...,im ) (w∗ ) hx, ∆i

  (1) (2) (3) where γσ (i1 , . . . , im ) = max γσ (i1 , . . . , im ), γσ (i1 , . . . , im ), γσ (i1 , . . . , im ) . Now defining γσ to be the maximum of γσ (i1 , . . . , im ) over all m ∈ [p] and all paths (i1 , . . . , im ) concludes the proof.

References [AGH+ 14]

Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. J. Mach. Learn. Res., 15(1):2773–2832, January 2014.

[AGHK13]

A. Anandkumar, R. Ge, D. Hsu, and S. M. Kakade. A tensor spectral approach to learning mixed membership community models. Annual Conference on Learning Theory, 2013.

[AGJ14]

Animashree Anandkumar, Rong Ge, and Majid Janzamin. Guaranteed nonorthogonal tensor decomposition via alternating rank-1 updates. arXiv preprint arXiv:1402.5180v4, 2014.

[AHK12]

A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden markov models. Annual Conference on Learning Theory, 2012.

[BCB14]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[BPM89]

T.F. Brooks, D.S. Pope, and A.M. Marcolini. Airfoil self-noise and prediction. Technical report, NASA, 1989.

[BWY17]

Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77–120, 2017. 44

[CBB02]

Ronan Collobert, Samy Bengio, and Yoshua Bengio. A parallel mixture of SVMs for very large scale problems. Neural Computing, 2002.

[CGCB14]

Junyoung Chung, C ¸ aglar G¨ ul¸cehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. abs/1412.3555, 2014.

[CL13]

A. T. Chaganty and P. Liang. Spectral experts for estimating mixtures of linear regressions. arXiv preprint arXiv:1306.3729, 2013.

[CvMBB14] KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. abs/1409.1259, 2014. [DTZ17]

Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of EM suffice for mixtures of two gaussians. arXiv preprint arXiv:1609.00368, 2017.

[GLM18]

Rong Ge, Jason D. Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. International Conference on Learning Representations, 2018.

[GM15]

Rong Ge and Tengyu Ma. Decomposing overcomplete 3rd order tensors using sum-ofsquares algorithms. abs/1504.05287, 2015.

[HKZ12]

D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.

[JJ94]

Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Comput., 6(2):181–214, March 1994.

[JJNH91]

Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Computation, 1991.

[JO13]

P. Jain and S. Oh. Learning mixtures of discrete product distributions using spectral decompositions. arXiv preprint arXiv:1311.2972, 2013.

[JSA14]

Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Score function features for discriminative learning: Matrix and tensor framework. abs/1412.2863, 2014.

[JX95]

M. I. Jordan and L. Xu. Convergence results for the EM approach to mixtures of experts architectures. Neural Networks, 8(9):1409–1431, 1995.

[Lia13]

P. Liang. Partial information from spectral methods. In NIPS Spectral Learning Workshop, 2013.

[LPM15]

Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.

[LY17]

Yi-Cheng Liu and I-Cheng Yeh. Using mixture design and neural networks to build stock selection decision support systems. Neural Computing and Applications, 28(3):521–535, 2017.

[ME14]

Saeed Masoudnia and Reza Ebrahimpour. Mixture of experts: a literature survey. Artificial Intelligence Review, 42(2):275, 2014. 45

[MSS16]

Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with sum-of-squares. abs/1610.01980, 2016.

[ND14]

Jun Wei Ng and Marc Peter Deisenroth. Hierarchical mixture-of-experts model for large-scale gaussian process regression. arXiv preprint arXiv:1412.3078, 2014.

[Pea94]

Karl Pearson. Contributions to the mathematical theory of evolution. In Philosophical Transactions of the Royal Society of London. A, pages 71—-110, 1894.

[SIM14]

Yuekai Sun, Stratis Ioannidis, and Andrea Montanari. Learning mixtures of linear classifiers. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pages 721–729, 2014.

[SJA14]

Hanie Sedghi, Majid Janzamin, and Anima Anandkumar. Provable tensor methods for learning mixtures of classifiers. arXiv preprint arXiv:1412.3046, 2014.

[SMM+ 17]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.

[Ste72]

Charles Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, volume 2, pages 583–602. University of California Press, 1972.

[SV17]

Vatsal Sharan and Gregory Valiant. Orthogonalized ALS: A theoretically principled tensor decomposition algorithm for practical use. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3095–3104, 06–11 Aug 2017.

[TB15]

Lucas Theis and Matthias Bethge. Generative image modeling using spatial lstms. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 1927–1935, Cambridge, MA, USA, 2015. MIT Press.

[Tre01]

Volker Tresp. Mixtures of gaussian processes. NIPS, 2001.

[XHM16]

Ji Xu, Daniel Hsu, and Arian Maleki. Global analysis of expectation maximization for mixtures of two gaussians. arXiv preprint arXiv:1608.07630, 2016.

[YCS16]

Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Solving a mixture of many random linear equations by tensor decomposition and alternating minimization. arXiv preprint arXiv:1608.05749, 2016.

[Yeh98]

I-Cheng Yeh. Modeling of strength of high performance concrete using artificial neural networks. Cement and Concrete Research, 28(12):1797–1808, 1998.

[YWG12]

S. E. Yuksel, J. N. Wilson, and P. D. Gader. Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems, 23(8):1177–1193, 2012.

[ZCZJ14]

Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I. Jordan Jordan. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. arXiv preprint arXiv:1406.3824, 2014.

46

This figure "hme.JPG" is available in "JPG" format from: http://arxiv.org/ps/1802.07417v1

Suggest Documents