Coulomb Autoencoders

Coulomb Autoencoders Emanuele Sansone 1 Quoc-Tin Phan 2 Francesco G.B. De Natale 2

Abstract Learning the true density in high-dimensional feature spaces is a well-known problem in machine learning. In this work, we improve the recent Wasserstein autoencoders (WAEs) by proposing Coulomb autoencoders. We demonstrate that a source of sub-optimality in WAEs is the choice of kernel function, because of the additional local minima in the objective. To mitigate this problem, we propose to use Coulomb kernels. We show that, under some conditions on the capacity of the encoder and the decoder, global convergence in the function space can be achieved. Finally, we provide an upper bound on the generalization performance, which can be improved by increasing the capacity of the encoder and the decoder networks. The theory is corroborated by experimental comparisons on synthetic and realworld datasets against several approaches from the families of generative adversarial networks and autoencoder-based models.

1. Introduction Deep generative models, like generative adversarial networks and autoencoder-based models, represent very promising research directions to learn the underlying density of data. Each of these families have their own limitations. On one hand, generative adversarial networks are difficult to train due to the mini-max nature of the optimization problem. On the other hand, autoencoder-based models, while more stable to train, often produce samples of lower quality compared to generative adversarial networks. This work considers autoencoder-based models, thus improving the performance of autoencoders with respect to the ones of generative adversarial networks. In particular, we Equal contribution 1 Noah’s Ark Lab, Huawei Technologies, United Kingdom, London 2 Department of Information Engineering and Computer Science, University of Trento, Trento, Italy. Correspondence to: Emanuele Sansone , Quoc-Tin Phan , Francesco G.B. De Natale .

(a) Gaussian

(b) Coulomb

(c) Configuration

Figure 1. First monodimensional case (with single negative charged particle). (a) and (b) are the plots of the regularizer in (1) over different locations of the negative particle for the Gaussian and the Coulomb kernels, respectively. (c) shows a possible result that can be achieved by minimizing the respective functional (the middle and the bottom configurations are associated with plots (a) and (b), respectively).

consider the recent WAEs (Tolstikhin et al., 2018) and show that the choice of kernel function is suboptimal in terms of local minima in the objective function. Therefore, we demonstrate that Coulomb kernels are the optimal choice, and if the encoding and the decoding functions have enough capacity to achieve the zero value of the objective function, we can obtain the global minimum in the function space. Furthermore, we provide an upper bound on the generalization error performance of the optimal solution, highlighting the fact that generalization strongly depends on the capacity of the encoding and the decoding functions. The superiority of our model is validated through extensive experimental analysis on a variety of synthetic and real-world datasets and compared to several state-of-the-art methods, including generative adversarial networks and autoencoder-based models. Code will be made available upon acceptance. The remainder of the paper is organized as follows. We begin with Section 2, by formulating the optimization problem, analyzing the properties of its objective function and then providing a bound on the estimation error. We review the literature of recent generative models in Section 3 and finally we discuss the experimental evaluation in Section 4.

*

2. Formulation and theoretical analysis This section deals with the problem of density estimation. The goal is to estimate the unknown density function pX (x), whose support is defined by Ωx ⊂ Rd . We consider two continuous functions f : Ωx → Ωz and

Coulomb Autoencoders

2.1. Analysis of training convergence

(a) Gaussian

(b) Coulomb

(c) Configuration

Figure 2. Second monodimensional case (with pair of negative charged particles). (a) and (b) are the plots of the regularizer in (1) over different locations of the negative particles for the Gaussian and the Coulomb kernels, respectively. (c) shows the respective solutions.

g : Ωz → Ωx , where Ωz ⊆ Rh and h is equal to the intrinsic dimensionality of Ωx . Furthermore, we consider that g(f (x)) = x for every x ∈ Ωx , namely that g is the left inverse for f on domain Ωx . In this work, f and g are neural networks parameterized by vectors γ and θ, respectively. f is called the encoding function, taking a random input x with density pX (x) and producing a random vector z with density qZ (z), while g is the decoding function taking z as input and producing the random vector y distributed according to qY (y). Note that, pX (x) = qY (y), since y = g(z) = g(f (x)) = x for every x ∈ Ωx . This is already a density estimator, but it has the drawback that in general qZ (z) cannot be written in closed form. Now, define pZ (z) an arbitrary density with support Ωz , that has a closed form.1 Our goal is to guarantee that qZ (z) = pZ (z) on the whole support, while maintaining g(f (x)) = x for every x ∈ Ωx . This allows us to use the decoding function as a generator and produce samples distributed according to pX (x). Therefore, the problem of density estimation in a high-dimensional feature space is converted into a problem of estimation in a lower dimensional vector space, thus overcoming the curse of dimensionality. The objective of our minimization problem is defined as follows: Z L(f, g) = kx − g(f (x))k2 pX (x)dx Ωx Z Z + φ(z)φ(z0 )k(z, z0 )dzdz0 (1) Ωz

Ωz

where φ(z)=pZ (z)−qZ (z), k(·, ·) is a kernel function. Note that the first term in (1) reaches its global minimum when the encoding and the decoding functions are invertible on support Ωx , while the second term in (1) is globally optimal when qZ (z) equals pZ (z) (see the supplementary material for a recall of its properties). Therefore, the global minimum of (1) satisfies our initial requirements and the optimal solution corresponds to the case where qY (y) = pX (x). 1 In this work we consider pZ (z) as a uniform density on the hypercube [−1, 1]h .

We start by analysing the properties of the second addend in (1) and then discuss about the whole optimization problem. Throughout the analysis we assume that the encoder and the decoder networks have enough capacity to achieve the global minimum of the objective in (1). Note that the choice of kernel function is of fundamental importance for ensuring the convergence of training to the global minimum of the second addend in (1). A necessary condition to guarantee the global convergence of training is that the kernel function satisfies the Poisson’s equation (see Theorem 2 in (Hochreiter & Obermayer, 2005)). Note that the same condition is also sufficient. This is stated in the following proposition (see the supplementary material for the proof): Proposition 1. Assume that the kernel function satisfies the Poisson’s equation, viz. ∇2z k(z, z0 ) = −δ(z − z0 ), where δ(·) is the delta function and ∇2z is the Laplacian operator computed on z. Then, |φt (zmax )| = p

1 2t + (φ0 (zmax ))−2

(2)

where φt (·) represents φ(·) at iteration t, while zmax = arg maxz kφt (z)k. Therefore, gradient descent-based training converges to the global minimum of the second addend in (1) and at the global minimum φ(z) = 0 for all z ∈ Ωz . Note that the previous result is valid for gradient descent optimization performed in the function space of the encoder and is also independent from the initialization of f . This implies that the regularizer in (1) is free from saddle points and all local minima are global. The solution of the Poisson’s equation can be written in closed form (see the supplementary material), namely: k(z, z0 ) =

1 − 2π ln kz − z0 k

1 (h−2)Sh kz−z0 kh−2

h=2 h>2

(3)

where Sh is the surface area of a h-dimensional unit ball. These functions are known as Coulomb kernels.

2

The first important property from (3) is that Coulomb kernels represent a generalization of Coulomb’s law to any h-dimensional Euclidean space.3 In effect, the regularizer in (1) represents the energy function of an electrostatic system living in Rh . Samples from pZ (z) and samples from qZ (z) can be interpreted as positive and negative charged particles, respectively, while the Coulomb kernels induce 2 0 p In the implementation, we substitute kz − z k with 0 2 kz − z k + and = 1e−3, to avoid the singularity at z = z0 . 3 In order to see this, consider that for h = 3 the kernel function in (3) obeys exactly to the Coulomb’s law.


some global attraction and repulsion forces between them. As a consequence, the minimization of the regularizer in (1), with respect to the location of the negative charged particles, allows to find a low energy configuration where the negative particles balance the effects of the positive ones. The second important property is that kernel functions, different from the ones in (3), are not the solutions of the Poisson’s equation and therefore may introduce other local optima. This includes the kernel functions used in the work of WAEs (Tolstikhin et al., 2018), namely the Gaussian and the inverse multiquadratic kernel used in the experiments. In order to have an intuitive understanding of these two properties, we analyze the effects of using Gaussian and Coulomb kernels on two simple monodimensional cases (h = 1). The first example consists of three positive particles, located at −4, 0 and 4, and a single negative particle, that is allowed to move freely. In this case, pZ (z) = δ(z + 4) + δ(z) + δ(z − 4) and qZ (z) = δ(z − z1 ), where z1 represents the variable location of the negative particle. Figure 1(a) and Figure 1(b) represent the plots of the regularizer in (1) evaluated at different z1 for the Gaussian and the Coulomb kernels, respectively. It is evident that the Gaussian kernel introduces new local optima and the negative particle is attracted locally to one of the positive charges without being affected by the remaining ones. On the contrary, the Coulomb kernel has only a single minimum. This minimal configuration is the best one, if one consider that all positive particles exert an attraction force on the negative one. As a result the Coulomb kernel induces global attraction forces. The second example consists of the same three positive particles and a pair of free negative charges. In this case, qZ (z) = δ(z − z1 ) + δ(z − z2 ), where z1 , z2 are the locations of the two negative particles. Figure 2(a) and Figure 2(b) represent the plots of the regularizer in (1) evaluated at different z1 , z2 for the Gaussian and the Coulomb kernels, respectively. Following the same reasoning of the previous example, we conclude that the Coulomb kernel induces global repulsion forces.4 The solution of the Poisson’s equation guarantees the convergence to the global minimum of the regularizer in (1) and other alternatives, like the Gaussian kernel, cannot have this property, due to the presence of new local optima that makes the optimization strongly dependent from the initial conditions. Objective in (1) requires the minimization of two addends, namely the reconstruction error LREC (f, g) and the distance between latent densities LLAT (f ). By firstly minimizing LLAT (f ) with respect to the encoding function f to get the minimum f ∗ and secondly minimizing LREC (f ∗ , g)

with respect to the decoding function g to get the minimum g ∗ , we can prove that if f ∗ is invertible, then (f ∗ , g ∗ ) is a global minimum of the whole objective in (1). All the theoretical results so far are valid for optimization in the function space, namely when considering f and g, and they are agnostic to the fact that neural networks are used during training. In fact, the issue of local optima may still be present due to the non-convex nature of these models. Nevertheless, we show through experiments that our strategy already outperforms the result of the WAEs. An important aspect to consider in future is to analyze the optimization in the parameter space. 2.2. Finite sample analysis The integrals in (1) cannot be computed exactly since pX (x) is unknown and qZ (z) is not defined explicitly. As a consequence, we use the unbiased estimate of (1) as a surrogate for optimization, namely: b g)= L(f,

X kxi −g(f (xi ))k2 1 + N S(S−1)

xi ∈Dx

−

2 X SN

zi ∈Dz zj ∈D f z

ki,j +

ki,j

zi ,zj ∈Dz j6=i

1 N (N −1)

X

ki,j

zi ,zj ∈Dzf j6=i

(4) . S where ki,j = k(zi , zj ) and Dx ={xi }N i=1 , Dz ={zi }i=1 and f N Dz ={f (xi )}i=1 are three finite set of samples drawn from pX (x), pZ (z) and qZ (z), respectively. Note that the first term in (19) corresponds to the reconstruction error on training data. Therefore, our model can be considered as an autoencoder. Based on this fact and on the chosen kernel, we refer to our model as Coulomb autoencoder (CouAE). The following theorem provides a lower bound for the objective in (19) and a probabilistic bound on the estimation error b g) and L(f, g) (proof in the supplementary between L(f, material). Theorem 1. Given the objective in (19), h > 2, Ωz a compact set, Ωx = [−M, M ]d for positive scalar M , and a symmetric, continuous and positive definite kernel k : Ωz × Ωz → R, where 0 ≤ k(z, z0 ) ≤ K for all z, z0 ∈ Ωz with K = k(z, z). (a) There exists e k : Ωz × Ωz → R such that k(z, z0 ) = R e e k(z, u)k(z0 , u)du. Ωz (b) The objective in (19) is equivalent to

4

In this case, there are a pair of minima, corresponding to the permutation of a single configuration.

X

X

e g) = L(f,

X kxi −g(f (xi ))k2 + I(e pZ , qeZ ) N

xi ∈Dx


−

K K 1 − + 2 N S N (N −1)

X

k(zi ,zj )

zi ,zj ∈Dzf j6=i

1 + 2 S (S−1)

X

k(zi ,zj )

zi ,zj ∈Dz j6=i

2 . R where I(e pZ , qeZ ) = Ωz peZ (u)−e qZ (u) du and P peZ (u) = S1 zi ∈Dz e k(zi , u) and qeZ (u) = P 1 e zi ∈Dzf k(zi , u) are the kernel density approximaN tions of pZ (u) and q(u). (c) If the reconstruction error kx−g(f (x))k2 can be made small ∀x ∈ Ωx , such that it can be bounded by a small value ξ. Then, for any s, u, v, t > 0 2N t2 b Pr |L − L|>t + s + u + v ≤ 2 exp − 2 ξ 2 2bS/2cs 2bN/2cu2 + 2 exp − + 2 exp − K2 K2 2 2 min {N, S}v + 2 exp − K2 Statement (b) of Theorem 1 provides an equivalent formulation of the objective in (19). This highlights the fact that e g) is always greater than −K/N − K/S. Note that S L(f, can be made arbitrarily larger than N , since S is the number of samples generated from pZ (z). Consequently, our empirical objective is always larger than −K/N , which approaches zero for large number of training samples. In such case, the most influential terms are the reconstruction error and I(e pZ , qeZ ). Therefore, our global minimum consists in the solution with the minimum reconstruction error on the training data and in the best match between the kernel density approximations peZ (z) and qeZ (z). Statement (c) provides a probabilistic bound on the estimab g) and L(f, g). The bound consists tion error between L(f, of four terms which vanish when both N and S are large. It is important to mention that, while the last three terms can be made arbitrarily small, by choosing appropriate values for s, u, v, the first term depends mainly on the value of ξ, namely on the capacity of the encoding and the decoding networks. It is evident that by increasing the capacity of the networks, ξ can be decreased, thus improving the generalization performance of our model. In Algorithm 1, we provide the complete procedure of CouAE.

3. Related work The most promising research directions for implicit generative models are generative adversarial networks (GANs) and autoencoder-based models.

Algorithm 1 CouAE, our proposed algorithm. In all experiments η = 0.0001, β1 = 0.5, β2 = 0.9 Input: N mini-batch size, η learning rate. Input: γ 0 inital parameter vector for f , θ 0 initial parameter vector for g. repeat b ,g). ∇γ ← ∇γ L(f γ ← γ − η · Adam(γ, ∇γ , β1 , β2 ). until γ have not converged repeat b ,g). ∇θ ← ∇θ L(f θ ← θ − η · Adam(θ, ∇θ , β1 , β2 ). until θ have not converged

GANs (Goodfellow et al., 2014) cast the problem of density estimation as a mini-max game between two neural networks, namely a discriminator, that tries to distinguish between true and generated samples, and a generator, that tries to produce samples similar to the true ones, to fool the discriminator. They have the reputation of being difficult to train and also require careful design of network architectures (Radford et al., 2015). Some of the most known issues are (i) the problem of vanishing gradients (Arjovsky & Bottou, 2017), which happens when the output of the discriminator is saturated, because true and generated data are perfectly classified, and no more gradient information is provided to the generator, (ii) the problem of mode collapse (Metz et al., 2017), which happens when the samples from the generator collapse to a single point corresponding to the maximum output value of the discriminator, and (iii) the problem of instability associated with the failure of convergence, which is due to the intrinsic nature of the mini-max problem. Some of the most effective solutions to reduce the problem of vanishing gradients consists of using a differe nt objective for the generator, called the − log D alternative (Goodfellow et al., 2014), enforcing the matching of features in the hidden layers of the discriminator between generated and true data (Salimans et al., 2016), incorporating multiple discriminators (Durugkar et al., 2017) or adding instance noise (Arjovsky & Bottou, 2017). To reduce the problem of mode collapse and more generally the problem of instability, authors in (Mescheder et al., 2017b) add a regularizer to the objective function, enforcing a consensus between the discriminator and the generator. Metz et al. (Metz et al., 2017) update the generator based on an unrolled version of the discriminator. The work of (Karras et al., 2018) introduces a sort of curriculum in the training of GANs, by progressive growing network capacities. Nevertheless, all these strategies have either poor theoretical motivation or they are guaranteed to converge only locally. GANs can be formulated also as a divergence minimization problem. The seminal paper of Goodfellow et al. (Goodfel-


low et al., 2014) shows the connection of the objective function with the Jensen Shannon divergence. Authors (Nowozin et al., 2016) extend the analysis to a broader families of divergences, called f −divergences. They compare the different measures from this class and then provide some experimental insights on which divergence to choose for natural images. The work of (Arjovsky et al., 2017) defines some pathological examples, for which many divergences, including the Jensen Shannon, yield to suboptimal solutions for the generator, and therefore proposes to use the Wasserstein distance. A heuristic based on weight clipping is used to constrain the critic/discriminator to lie in the class of 1−Lipschitz functions. The follow-up paper of (Gulrajani et al., 2017) substitutes this heuristic with a gradient penalty. Another research direction for GANs consists on using integral probability metrics (Müller, 1997) as optimization objective. In particular, the maximum mean discrepancy (Gretton et al., 2008) can be used to measure the distance between pX and qY and train the generator network. The general problem is formulated in the following way: inf sup Ex∼pX [f (x)] − Ey∼qY [f (y)] g∈G f ∈F

In generative moment matching networks (Li et al., 2015; Dziugaite et al., 2015) F is a RKHS, which is induced by the Gaussian kernel. One drawback of these models is due to the fact that no maximization is performed over F and the resulting solutions are suboptimal. Another limitation is that the similarity scores associated with the kernel function are directly computed in the sample space. Therefore, the performance degrades as the dimensionality of the feature space increases (Ramdas et al., 2015). The work of (Li et al., 2017) introduces an encoding function to represent data in a more compact way and distances are computed in the latent representation, thus solving the problem of dimensionality. Authors (Mroueh et al., 2017) propose to extend the maximum mean discrepancy and include also covariance statistics to ensure better stability. The work in (Tolstikhin et al., 2018) generalizes the computation of the distance between the encoded distribution and the prior to other divergences, thus proposing two different solutions: the first one consists of using the Jensen-Shannon divergence , showing also the equivalence to adversarial autoencoders, and the second one consists of using the maximum-mean discrepancy. The choice of the kernel function in this second case is of fundamental importance to ensure the global convergence of gradient-descent algorithms. As we have already shown in previous section, suboptimal choices of kernel function, like the ones used by the authors, introduce local optima in the function space and therefore do not have the same convergence property of our model. There exist other autoencoder-based models that are inspired by the adversarial game of GANs. Authors (Chen et al.,

2016) add an autoencoder network to the original GANs for reconstructing part of the latent code. The identical works of (Donahue et al., 2017) and (Dumoulin et al., 2017) propose to add an encoding function together with the generator and perform an adversarial game to ensure that the joint density on the input/output of the generator agrees with the joint density of the input/output of the encoder. They prove that the optimal solution is achieved when the generator and the encoder are invertible. In practice, they fail to guarantee the convergence to that solution due to the adversarial nature of the game. Authors (Srivastava et al., 2017) extend the previous works by explicitly imposing the invertibility condition. They achieve this by adding a term to the generator objective that computes the reconstruction error on the latent space. Adversarial autoencoders (Makhzani et al., 2013) are similar to these approaches with the only differences that the estimation of the reconstruction error is performed in the sample space, while the adversarial game is performed only in the latent space. It is important to mention that all of these works are based on a mini-max problem, while our method solves a simple minimization problem, for which it is possible to achieve global convergence. Variational autoencoders (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014) represent another family of autoencoder-based models. The framework is based on minimizing the Kullback-Leibler (KL) divergence between the approximate posterior distribution defined by the encoder and the true prior pZ (which consists of a surrogate for the negative log-likelihood of training data). Practically speaking, the stochastic encoder used in variational autoencoders is driven to produce latent representations that can be similar among different input samples, thus generating conflicts during reconstruction. A deterministic encoder could ideally solve this problem, but unfortunately the KL divergence is not defined for such case. There are several variations for VAEs. For example, the work in (Mescheder et al., 2017a) proposes to use the adversarial game of GANs to learn better approximate posterior distributions in VAEs. Neverth eless, the method is still based on a mini-max problem. The most related work to ours is the one proposed by (Unterthiner et al., 2018). The authors optimize a similar objective and use the extended version of the Coulomb kernel. The computation of distances is performed directly in the sample space and performance degrade as the number of features increases. Another drawback is that their model does not allow to infer a latent representation of data.

4. Experiments Our proposed solution is compared against 4 autoencoderbased models, namely Variational Autoencoders (VAE) (Kingma & Welling, 2013; Rezende et al.,

Coulomb Autoencoders Table 1. Test log-likelihood on grid and low dimensional embedding datasets (the higher the better). DATA /M ETHOD G RID L OW DIM .

C OUAE

C OU GAN

WGAN

B I GAN

VEE

VAE

AAE

AVB-AC

WAE

-4.5±0.2 1224.4±76.3

-4.5±0.2 FAILURE

-4.1±0.1 848.9±534.4

-103.4±98.9 -3940.3±3519.4

-26.9±6.8 -7.5 E 6±4.0 E 6

-5.5±0.2 -320.1±1081.6

-7.5±0.4 1025.9±177.1

-7.5±0.8 -965.4±1196.9

-20.5±13.6 964.6±131.7

(a) True Data

(b) CouAE

(c) CouGAN

(d) WGAN

(e) BiGAN

(a) CouAE

(b) WGAN

(c) BiGAN

(d) VEE

(e) VAE

(f) AAE

(g) AVB-AC

(h) WAE

Figure 4. Histogram of modes obtained through classification of generated samples on the low dimensional embedding dataset.

(f) VEE

(g) VAE

(h) AAE

(i) AVB-AC

(j) WAE Figure 3. Visualization of generated data from different models on grid dataset.

2014), Adversarial Autoencoders (AAE) (Makhzani et al., 2013), Adversarial Variational Bayes with Adaptive Contrast (AVB-AC) (Mescheder et al., 2017a) and Wasserstein Autoencoders (WAE) (Tolstikhin et al., 2018).5 We also provide the comparison with 4 recent generative adversarial networks, namely Coulomb GAN (CouGAN) (Unterthiner et al., 2018), the improved version of Wasserstein GAN (WGAN) (Gulrajani et al., 2017), Bidirectional GAN (BiGAN) (Donahue et al., 2017) and Variational Encoder Enhancement to GANs (VEE) (Srivastava et al., 2017). We make a huge effort in implementing and comparing all approaches on different synthetic and real-world datasets. In particular, we use two synthetic datasets to simulate low and high dimensional feature spaces and two real datasets, namely Stacked-MNIST and CIFAR-100. The code to replicate all experiments, including the ones of competitors, will be made available upon acceptance.

dataset (Lim & Ye, 2017). The training dataset contains 500 samples generated from the true density. Following the methodology of other works (see for example (Lim & Ye, 2017; Unterthiner et al., 2018)), we choose fully connected MLPs with two hidden layers (128 neurons each) in encoder, decoder and discriminator networks and set h = 2. All models are trained for two million iterations using Adam optimizer. It is important to mention that we observe poor performance in VEE and better results are obtained by applying batch normalization to all hidden layers (results without batch normalization and details of simulations are available in the supplementary material). Models are evaluated qualitatively by visually inspecting generated samples and quantitatively by computing the loglikelihood on test data. To compute log-likelihood, we first apply kernel density estimation using a Gaussian kernel on 105 generated samples6 and then evaluate the log-likelihood on 104 test samples from the true distribution. Results are averaged over 10 repetitions. Figure 3 shows samples generated by all models, while Table 1 provides quantitative results. It is immediate to see that BiGAN is affected by mode collapse, while AAE, WAE and AVB-AC generate very noisy samples. VEE achieves very low performance in terms of log-likelihood since it assigns high prior to only few modes. Our model, VAE and CouGAN compare favourably with WGAN, which in turn obtain the best performance. 4.2. Low dimensional embedding dataset

4.1. Grid dataset We start by comparing the approaches on a two-dimensional dataset consisting of 25 isotropic Gaussians placed according to a grid (see Figure 3(a)), and call it the grid 5

Note that WAE-GAN is equivalent to AAE.

The second dataset consists of ten 10 dimensional isotropic Gaussians embedded in a 1000 dimensional vector space and we call it the low dimensional embedding dataset. We 6 Bandwidth is selected from a set of 10 values logarithmically spaced in [10−3 , 101.5 ].


(a) CouAE

(b) CouGAN

(c) WGAN

(d) BiGAN

(e) VEE

(f) VAE

(g) AAE

(h) AVB-AC

(a) CouAE

(b) VAE

(c) AAE

(d) AVB-AC

(e) WAE Figure 6. Conditional generation. The first column of each plot contains true samples, while other columns are obtained by perturbing the latent representation of true samples with additive isotropic Gaussian noise (0.01 element variance). VEE, BIGAN are not shown due to mode collapse. (i) WAE Figure 5. Generated samples on Stacked-MNIST. First column contains true samples.

generate 500 samples from the true density to train all models. The methodology is similar to the one of previous dataset. The main difference is in the evaluation. Due to the difficulty of visualizing samples in high dimensions, we propose to use a classifier7 to count the number of samples generated by the models for each mode. This evaluation procedure allows to only detect the presence of mode collapse. It is important to mention that this procedure can be fooled by specific pathological cases, like memorization of training samples. Therefore, we use log-likelihood on test data to assess the quality of the learnt distribution. The other differences are in the use of batch normalization for BiGAN and AAE, which lead to better performance. We experience high instability when training CouGAN and also convergence failures. Figure 4 shows the histograms of generated samples obtained by all models. Note that BiGAN and VEE are affected by mode collapse. Table 1 provides log-likelihood scores. Our model achieves the best performance, meaning that it is able to better estimate the underlying true density. 4.3. Stacked-MNIST The third dataset is created by stacking random digits from the MNIST dataset on top of each other to produce colored images (Srivastava et al., 2017). This creates a ground truth 7 Consisting of a MLP with two hidden layers of 256 and 128 neurons each and trained on an infinitely large dataset sampled from the true density for 1000 iterations and using Adam with learning rate equal to 0.001.

density with 1000 classes. The methodology is similar to the one of previous dataset.We use the MLP network from (Cires¸an et al., 2010) as encoder, decoder and discriminator networks (see supplementary material for further details and simulations). We train all models up to 1000 epochs with batch size equal to 128. We visualize both generated samples and nearest neighbors in the latent space, to understand the quality of the learnt representation and see whether semantic consistency is preserved in the neighborhood of samples. Furthermore, we use Frechet Inception Distance (FID) (Heusel et al., 2017) to quantitatively assess the visual quality. Similarly to the previous case, we use batch normalization for BiGAN, VEE and AAE. Figure 5 shows the generated samples of each model. Note that BiGAN and VEE fail to learn the true distribution, while VAE and AVB-AC suffer from mode collapse. CouGAN seems to generate more diverse samples, but with very strong artifacts. Only CouAE, WAE and AAE are able to learn a good approximation of the true density, as confirmed by the FID scores in Table 1. Figure 6 shows nearest neighbors for true data. It is worth to mention that AAE does not preserve semantic consistency in the representation. In fact, for a small perturbation of the latent representation the output image can completely change its semantic content. Instead our model and WAE are capable to fulfill this property. 4.4. CIFAR-100 The fourth dataset consists of real-world images from CIFAR-100. In these experiments, we use deep MLP networks with around 3000 hidden neurons per layer (see supplementary material for further details). All models are trained up to 6000 epochs, using batch size equal to 128.We

Coulomb Autoencoders Table 2. FID score for different models on Stacked-MNIST and CIFAR-100 (the lower the better). DATA /M ETHOD S TACKED -MNIST CIFAR-100

C OUAE

C OU GAN

WGAN

B I GAN

VEE

VAE

AAE

AVB-AC

WAE

33.9 11.0

72.3

41.5 38.6

410.3 344.3

375.8 119.7

118.9 29.0

33.7 123.5

173.6 23.3

33.6 22.7

FAILURE

References Arjovsky, M. and Bottou, L. Towards principled methods for training generative adversarial networks. International Conference on Learning Representations (ICLR), 2017. (a) CouAE

(b) CouGAN

(c) WGAN

(d) BiGAN

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein Generative Adversarial Networks. In International Conference on Machine Learning (ICML), pp. 214–223, 2017. Chen, X., X.Chen, Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NIPS), pp. 2172–2180, 2016.

(e) VEE

(f) VAE

(g) AAE

(h) AVB-AC Cires¸an, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. Deep, big, simple neural nets for handwritten digit recognition. Neural Computation, 22:3207–3220, 2010. Donahue, J., Krähenbühl, P., and Darrell, T. Adversarial feature learning. International Conference on Learning Representations (ICLR), 2017.

(i) WAE Figure 7. Visualization of generated samples from different models on CIFAR-100.

use batch normalization for BiGAN, VEE and AAE. As before, we evaluate the performance of the models by visual inspecting the generated samples and by computing FID (nearest neighbors are shown in the supplementary material).

Dumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., and Courville, A. Adversarially learned inference. International Conference on Learning Representations (ICLR), 2017. Durugkar, I., Gemp, I., and Mahadevan, S. Generative multiadversarial networks. International Conference on Learning Representations (ICLR), 2017. Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. Training Generative Neural Networks via Maximum Mean Discrepancy Optimization. In Uncertainty in Artificial Intelligence (UAI), pp. 258–267, 2015.

Figure 7 shows the generated samples of each model. Note that BiGAN, COUGAN and AAE fail to learn the true distribution, while AVB-AC generates a large number of duplicated images (see for example the images of mountains). From a qualitative point of view, VEE, WGAN and WAE produce more distorted samples with respect to VAE and CouAE. Table 2 provides the FID scores for these experiments. CouAE clearly outperforms the other models by a large quantity. Therefore, there is a clear advantage in using our objective function coupled with the Coulomb kernel.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NIPS), pp. 2672–2680, 2014.

5. Conclusions

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems (NIPS), pp. 6629–6640, 2017.

We show that the choice of kernel function is fundamental to mitigate the problem of bad local minima in the objective and propose to use Coulomb kernels. We demonstrate that CouAEs have a guarantee of convergence to the global minimum in the function space.

Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., and Smola, A. A kernel method for the two sample problem. Technical report, Max Planck Institute for Biological Cybernetics, 2008. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems (NIPS), pp. 5769–5779, 2017.

Hochreiter, S. and Obermayer, K. Optimal Kernels for Unsupervised Learning. In IEEE International Joint Conference on Neural Networks (IJCNN 2005), pp. 1895–1899, 2005.

Coulomb Autoencoders Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. International Conference on Learning Representations (ICLR), 2018. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. International Conference on Learning Representations (ICLR), 2013. Li, C. L., Chang, W. C., Cheng, Y., Yang, Y., and Póczos, B. MMD GAN: Towards Deeper Understanding of Moment Matching Network. In Advances in Neural Information Processing Systems (NIPS), pp. 2200–2210, 2017. Li, Y., Swersky, K., and Zemel, R. Generative Moment Matching Networks. In International Conference on Machine Learning (ICML), pp. 1718–1727, 2015. Lim, J. H. and Ye, J. C. arXiv:1705.02894, 2017.

Geometric gan.

arXiv preprint

Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial autoencoders. International Conference on Learning Representations (ICLR), 2013. Mescheder, L., Nowozin, S., and Geiger, A. Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks. In International Conference on Machine Learning (ICML), pp. 2391–2400, 2017a. Mescheder, L., Nowozin, S., and Geiger, A. The Numerics of GANs. In Advances in Neural Information Processing Systems (NIPS), pp. 1823–1833, 2017b. Metz, L., Poole, B., Pfau, D., and Sohl-Dickstein, J. Unrolled generative adversarial networks. International Conference on Learning Representations (ICLR), 2017. Mroueh, Y., Sercu, T., and Goel, V. McGan: Mean and Covariance Feature Matching GAN. In International Conference on Machine Learning (ICML), pp. 2527–2535, 2017.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In International Conference on Machine Learning (ICML), pp. 1278–1286, 2014. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., and Chen, X. Improved Techniques for Training GANs. In Advances in Neural Information Processing Systems (NIPS), pp. 2234–2242, 2016. Srivastava, A., Valkoz, L., Russell, C., Gutmann, M. U., and Sutton, C. VEEGAN: Reducing Mode Collapse in GANs Using Implicit Variational Learning. In Advances in Neural Information Processing Systems (NIPS), pp. 3310–3320, 2017. Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein auto-encoders. International Conference on Learning Representations (ICLR), 2018. Unterthiner, T., Nessler, B., Klambauer, G., Heusel, M., Ramsauer, H., and Hochreiter, S. Coulomb gans: Provably optimal nash equilibria via potential fields. International Conference on Learning Representations (ICLR), 2018.

A. Properties of regularizer in (1) Lemma 1. Given k : Ωz × Ωz → R a symmetric positive definite kernel, then: (a) (Nachman, 1950) there exists a unique Hilbert space H of real-valued functions over Ωz , for which k is a reproducing kernel. H is therefore a Reproducing kernel Hilbert Space (RKHS). (b) For all ` ∈ H Z Z φt (z)φt (z0 )k(z,z0 )dzdz0 = MMD2 (pZ , qZ ) Ωz

Ωz

where . MMD(pZ , qZ ) = sup Ez∼pZ [`(z)]−Ez∼qZ [`(z)] k`kH ≤1

Müller, A. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.

is the maximum mean discrepancy between pZ (z) and qZ (z).

Nachman, A. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404, 1950.

(c) (Gretton et al., 2008) Let H be defined as in (b), then MMD(pZ , qZ ) = 0 if and only if pZ (z) = qZ (z).

Nowozin, S., Cseke, B., and Tomioka, R. f-GAN: Training Generative Neural Samplers Using Variational Divergence Minimization. In Advances in Neural Information Processing Systems (NIPS), pp. 271–279, 2016.

Proof. (a) follows directly from the Moore-Aronszajn theorem (Nachman, 1950).

Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. Ramdas, A., Reddi, S. J., Poczos, B., Singh, A., and Wasserman, L. On the Decreasing Power of Kernel and Distance Based Nonparametric Hypothesis Tests in High Dimensions. In AAAI Conference on Artificial Intelligence, pp. 3571–3577, 2015.

Now we prove statementR (b). R For the sake of notation . compactness, define J = Ωz Ωz φt (z)φt (z0 )k(z,z0 )dzdz0 . Therefore, Z Z J= pZ (z)pZ (z0 )k(z,z0 )dzdz0 − Ωz Ωz Z Z − pZ (z)q(z0 )k(z,z0 )dzdz0 − Ωz

Ωz


Z

Z

− Ωz

Z + Z

Z

Ωz

B. Sufficient conditions for global convergence

q(z0 )qZ (z)k(z,z0 )dzdz0

This section aims at clarifying the theory proposed in (Hochreiter & Obermayer, 2005). Note that the authors have focused on proving that the Poisson’s equation is necessary for achieving global convergence. Here, we prove that this equation represents a sufficient condition to guarantee the global convergence on the regularizer in (1), thus motivating the use of Coulomb kernels. The result of the following proposition implies that the loss function associated with the second addend in (1) is free from saddle points and all local minima are global.

Ωz

Z

pZ (z)pZ (z0 )hr(z), r(z0 )iH dzdz0 −

= Ωz

Ωz

Z

Z

− ZΩz ZΩz − ZΩz ZΩz + Ωz

pZ (z0 )qZ (z)k(z,z0 )dzdz0 +

Ωz

pZ (z)q(z0 )hr(z), r(z0 )iH dzdz0 − pZ (z0 )qZ (z)hr(z), r(z0 )iH dzdz0 + q(z0 )qZ (z)hr(z), r(z0 )iH dzdz0

Ωz

= hEz∼pZ [r(z)], Ez0 ∼pZ [r(z0 )])iH − − hEz∼pZ [r(z)], Ez0 ∼q [r(z0 )])iH − − hEz0 ∼pZ [r(z0 )], Ez∼q [r(z)])iH + + hEz0 ∼q [r(z0 )], Ez∼q [r(z)])iH

(5)

Note that the second equality in (5) follows from the fact that k(z,z0 ) = hr(z), r(z0 )iH for a unique r ∈ H,8 where h·,·iH . is the inner product of H. If we define µpZ = Ez∼pZ [r(z)] . 9 and µq = Ez∼q [r(z)], then (5) can be rewritten in the following way: J = hµpZ ,µpZ iH − hµpZ ,µq iH − hµpZ ,µq iH + hµq ,µq iH = hµpZ − µq ,µpZ − µq iH = kµpZ − µq k2H

(6)

Notice that D µpZ − µq E kµpZ −µq kH = µpZ − µq , kµpZ − µq kH H n o = sup hµpZ − µq , ìH

Proposition 2. Assume that the encoder network has enough capacity to achieve the global minimum of the second addend in (1). Furthermore, assume that the kernel function satisfies the Poisson’s equation, viz. ∇2z k(z, z0 ) = −δ(z − z0 ). Then, |φt (zmax )| = p

1 2t + (φ0 (zmax ))−2

(7)

where φt (·) represents φ(·) at iteration t, while zmax = arg maxz kφt (z)k. Therefore, gradient descent-based training converges to the global minimum of the second addend in (1) and at the global minimum φ(z) = 0 for all z ∈ Ωz . the potential function at location z, Proof. By defining . R namely Ψ(z) = Ωz φt (z0 )k(z, z0 )dz0 , the second addend in (1) can be rewritten in the following way: Z Z Z 0 0 0 φ(z)φ(z )k(z, z )dzdz = φt (z)Ψ(z)dz Ωz Ωz Ω Z z . = J(z)dz (8) Ωz

The overall objective (8) is computed by summing the term J(z) for all z ∈ Ωz . Each contribution term consists of k`kH ≤1 n o potential function Ψ(z) weighted by φt (z). By considthe = sup Ez∼pZ [hr(z), ìH ] − Ez∼q [hr(z), ìHering ] a single term at specific location z, viz. J(z), we k`kH ≤1 can describe the training dynamics through the continuity = sup Ez∼pZ [`(z)] − Ez∼q [`(z)] equation, namely: k`kH ≤1

= MMD(pZ , qZ ) Substituting this result into (6) concludes the proof of the statement. Statement (c) is equivalent to Theorem 3 in (Gretton et al., 2008). 8

This is a classical result due to the Riesz representation theorem. 9 Their existence can be guaranteed assuming that kµpZ k2H < ∞ and kµq k2H < ∞. In other words, Ez,z0 ∼pZ [k(z,z0 )] < ∞ and Ez,z0 ∼q [k(z,z0 )] < ∞.

∂φt (z) = −∇z · φt (z)v(z) ∂t

(9)

where v(z) = −∇z J(z). The equation (9) puts in relation the motion of particle z, moving with speed v(z), with the change in φt (z). Therefore, ∂φt (z) = −∇z ·φt (z)v(z) ∂t = ∇z ·φt (z)∇z J(z) = ∇z φt (z)·∇z J(z) + φt (z)∇z ·∇z J(z) = ∇z φt (z)·∇z J(z) + φt (z)∇z ·∇z φt (z)Ψ(z)


= ∇z φt (z)·∇z J(z) + φt (z)∇z · ∇z φt (z) Ψ(z)+ φt (z)∇z ·φt (z)∇z Ψ(z) = ∇z φt (z)·∇z J(z) + φt (z)∇z · ∇z φt (z) Ψ(z)+ 2 φt (z)∇z φt (z)·∇z Ψ(z)+ φt (z) ∇z ·∇z Ψ(z) We can limit our analysis only to maximal points zmax = arg maxz kφt (z)k, for which ∇z φt (zmax ) = 0, in order to prove that φt (z) → 0 for all z ∈ Ωz as t → ∞. Consequently, previous equation is simplified as follows: 2 ∂φt (zmax ) = φt (zmax ) ∇z ·∇z Ψ(zmax ) ∂t 2 = φt (zmax ) ∇2z Ψ(zmax ) Z 2 = φt (zmax ) ∇2z φt (z0 )k(zmax , z0 )dz0 Ωz 2 Z φt (z0 )∇2z k(zmax , z0 )dz0 = φt (zmax ) Ωz 2 Z = − φt (zmax ) φt (z0 )δ(zmax − z0 )dz0 Ωz 2 Z = − φt (zmax ) φt (z0 )δ(z0 − zmax )dz0 Ωz

2 = − φt (zmax ) φt (zmax ) 3 = − φt (zmax )

The solution of this differential equation is given by: φt (zmax ) = ± √

1 2t + c

(10)

for a given c ∈ R+ . Note that φ0 (zmax ) = ± √1c . Therefore, c = (φ0 (zmax ))−2 . At the end of training, namely t → ∞, φ(z) = 0 for all z ∈ Ωz , consequently, pZ (z) = qZ (z) and the objective is at its global minimum. This concludes the proof.

C. Solution of the Poisson’s equation Proof. It is important to mention that the solution of the Poisson equation is an already known mathematical result. Nonetheless, we provide here its derivation, since we believe that this can provide useful support for the reading of the article. Recall that for a given z0 ∈ Rh 0

0

− ∇z k(z, z ) = δ(z − z ),

∀z ∈ R

h

(11)

is the Poisson equation. Note that we are looking for kernel functions that are translation invariant, namely satisfying ¯ − z0 ). Therefore, the solution the property k(z, z0 ) = k(z of (11) can be obtained (i) by considering the simplified

case where z0 = 0 and then (ii) by replacing z with z − z0 to get the general solution. Therefore, our aim is to derive the solution for the following case: ∇z Γ(z) = δ(z), ∀z ∈ Rh (12) . ¯ where Γ(z) = −k(z). Consider that ∀z 6= 0 (12) is equivalent to ∇z Γ(z) = 0

(13)

Now assume that Γ(z) = v(r) for some function v : R → R . and r = kzk. Then, we have that ∀i = 1, . . . , h dv(r) ∂r zi ∂Γ(z) = = v 0 (r) ∂zi dr ∂zi r 00 zi2 1 ∂ 2 Γ(z) zi2 0 0 = v (r) + v (r) − v (r) ∂zi2 r2 r r3

(14)

By using (13) and (14), we get the following equation: ∇z Γ(z) =

h X ∂ 2 Γ(z) i=1

∂zi2

00

= v (r) +

h−1 0 v (r) = 0 r

whose solution is given by v 0 (r) = b/rh−1 for any scalar b 6= 0. By integrating this solution, we obtain that:  h=1  br + c b ln(r) + c h=2 v(r) = (15)  − b + c h > 2 h−2 (h−2)r and by choosing c = 0 (without loosing in generality), we get that  h=1  bkzk b ln(kzk) h=2 Γ(z) = (16)  − b h>2 (h−2)kzkh−2 Note that this is the solution of the homogeneous equation in (13). The solution for the nonhomogeneous case in (12) can be obtained by applying the fundamental theorem of calculus for h = 1, the Green’s theorem for h = 2 and the Stokes’ theorem for general h (we skip here the tedious derivation, but this result can be easily checked by consulting any book of vector calculus for the Green’s function). Therefore,  1 h=1  2 kzk 1 ln(kzk) h=2 Γ(z) = (17) 2π  − 1 h > 2 h−2 (h−2)Sh kzk In other words,  1  − 2 kzk ¯ − 1 ln(kzk) k(z) =  2π 1

(h−2)Sh kzkh−2

h=1 h=2 h>2

(18)

and after replacing z with z − z0 , we obtain our final result.


D. Proof of Theorem 1

Now we prove property (b). Firstly, we prove the following two equalities:

Recall that our objective is given by: X kxi −g(f (xi ))k2 1 b g)= L(f, + N S(S−1) xi ∈Dx

2 X − SN

X

ki,j


X

ki,j

1 S(S−1)

zi ,zj ∈Dz j6=i

1 + N (N −1)

X

ki,j =

zi ,zj ∈Dz j6=i

1 S2 +

X

ki,j

zi ,zj ∈Dzf j6=i

(19)

1 N (N −1)

Therefore, we want to prove the following theorem: Theorem 2. Given the objective in (19), h > 2, Ωz a compact set, Ωx = [−M, M ]d for positive scalar M , and a symmetric, continuous and positive definite kernel k : Ωz × Ωz → R, where 0 ≤ k(z, z0 ) ≤ K for all z, z0 ∈ Ωz with K = k(z, z). (a) There exists e k : Ωz × Ωz → R such that k(z, z0 ) = R e e k(z, u)k(z0 , u)du. Ωz (b) The objective in (19) is equivalent to

X

ki,j =

zi ,zj ∈Dzf j6=i

X zi ,zj ∈Dz

1

X

X

zi ,zj ∈Dzf

1

X

N 2 (N −1)

X

We do it only for the first equality in (20), since the other one requires the same procedure. Note that 1 S(S−1)

X

ki,j =

zi ,zj ∈Dz j6=i

1 S(S−1) −

+

k(zi ,zj )

zi ,zj ∈Dzf

k(zi ,zj )

X

=

1 S 2 (S−1)

1 S2

(c) If the reconstruction error kx−g(f (x))k2 can be made small ∀x ∈ Ωx , such that it can be bounded by a small value ξ. Then, for any s, u, v, t > 0 2N t2 b Pr |L − L|>t + s + u + v ≤ 2 exp − 2 ξ 2 2bS/2cs 2bN/2cu2 + 2 exp − + 2 exp − K2 K2 2 2 min {N, S}v + 2 exp − K2 Proof. Property (a) can be proved using the Mercer’s theorem (see Theorem 1 in (Hochreiter & Obermayer, 2005)).

+

=

+

X

X

ki,j

zi ,zj ∈Dz j6=i

X

ki,j

zi ,zj ∈Dz j6=i

ki,j

zi ,zj ∈Dz j6=i

1 S 2 (S−1)

1 S2

ki,j

zi ,zj ∈Dz j6=i

1 S 2 (S−1)

zi ,zj ∈Dz j6=i

2 . R where I(e pZ , qeZ ) = Ωz peZ (u)−e qZ (u) du and P peZ (u) = S1 zi ∈Dz e k(zi , u) and qeZ (u) = P 1 e zi ∈Dzf k(zi , u) are the kernel density approximaN tions of pZ (u) and q(u).

ki,j

zi ,zj ∈Dzf

(20)

j6=i

1 + 2 S (S−1)

K N

ki,j −

xi ∈Dx

K K 1 − + 2 N S N (N −1)

ki,j

zi ,zj ∈Dz j6=i

j6=i

X kxi −g(f (xi ))k2 e g) = L(f, + I(e pZ , qeZ ) N −

K S

X

S 2 (S−1)

1 N2

+

ki,j −

X

X

ki,j

zi ,zj ∈Dz j6=i

ki,j

zi ,zj ∈Dz j6=i

1 S 2 (S−1)

X

ki,j

zi ,zj ∈Dz j6=i

1 X 1 X ki,i − 2 ki,i 2 S S zi ∈Dz zi ∈Dz X 1 = 2 ki,j S +

zi ,zj ∈Dz

+

−

1 S 2 (S−1)

X zi ,zj ∈Dz j6=i

1 X ki,i S2 zi ∈Dz

ki,j


=

1 S2

X

ki,j −

zi ,zj ∈Dz

1 S 2 (S−1)

+

K S

X zi ,zj ∈Dz j6=i

zi ,zj ∈Dz j6=i

−

2 X SN

X


1 + N (N −1)

=

X

I(e pZ , qeZ ) =

zi ,zj ∈Dzf j6=i

ki,j

zi ,zj ∈Dz

1 + 2 S (S−1)

X

−

+

1 N2

1 + 2 N (N −1)

=

Z −

K N

Ωz

ki,j

zi ,zj ∈Dzf j6=i

zi ,zj ∈Dz

X

−

X

X Z

e k(zi ,u)e k(zj ,u)du

zi ,zj ∈Dzf

2 X SN

X

X

e k(zi ,u)e k(zj ,u)

zi ,zj ∈Dz

X

e k(zi ,u)e k(zj ,u)

zi ,zj ∈Dzf

2 X SN

X

i e k(zi ,u)e k(zj ,u) du


zj ∈Dz

X 1 X e e k(zj ,u) + 2 k(zi ,u) N f f zi ∈Dz

ki,j

−


ki,j



zi ∈Dz

zi ,zj ∈Dz j6=i



X

K − S

2 SN

Z = Ωz

X

zj ∈Dz

e k(zi ,u)

zi ∈Dz

h1 X S

zi ∈Dz

X zj ∈Dzf

e k(zi ,u)−

i e k(zj ,u) du

X


Ωz

h 1 X X e e = k(zi ,u) k(zj ,u) 2 Ωz S

#

1 + 2 S (S−1)

zi ,zj ∈Dzf

2 X SN

Z

zi ,zj ∈Dzf

X

ki,j −

zi ,zj ∈Dz

h 1 2 Ωz S

1 + 2 N

ki,j

2 X − SN

X


X

1 N2

Z =

1 X kxi −g(f (xi ))k2 + N xi ∈Dx " X 1 + ki,j 2 S 1 + 2 N

1 S2

Ωz

ki,j

X

(23)

Ωz

zi ,zj ∈Dzf

+

zi ,zj ∈Dzf

ki,j

Z

X

Z

ki,j −

ki,j

zi ,zj ∈Dzf


2 X SN

Ωz


X

X

X

Ωz

zi ,zj ∈Dz

Z

X

2 X SN

1 N2

Z

X

=

ki,j

ki,j +

zi ,zj ∈Dz


zi ,zj ∈Dz j6=i

2 X − SN

X

1 N2

ki,j +

zi ,zj ∈Dz

1 + 2 N

K − S

X

(22)

Note that from statement (a), we have that ki,j R e k(zi ,u)e k(zj ,u)du. Therefore, Ωz

xi ∈Dx

X

zi ,zj ∈Dzf j6=i

K N


1 X kxi −g(f (xi ))k2 + N 1 + 2 S

1 S2 −

1 = 2 S

ki,j

ki,j −

In order to conclude the proof, it remains to show that

1 S2

ki,j

X

ki,j (21)

Now we can substitute (20) into (19), namely X b ,g) = 1 L(f kxi −g(f (xi ))k2 + N xi ∈Dx X 1 ki,j + S(S−1)

1 N 2 (N −1)

+

=

ki,j


i2 1 X e − k(zj ,u) du N f zj ∈Dz

= I(e pZ , qeZ )

(24)

In order to prove property (c), we first derive the statistical bounds for the two addends in (2), and then combine these results in the final bound. Consider the reconstruction error term in (2) and define . ξx = kx−g(f (x))k2 . Note that Ωx = [−M, M ]d and therefore ξx is bounded in the interval [0, 4M 2 d]. By considering ξx a random variable, we can apply the Hoeffding’s inequality (see Theorem 2 in (Hoeffding, 1963)) to obtain the following statistical bound: X Z 1 . P0 =Pr ξx − ξx pX (x)dx ≥ t ≤ N Ωx x∈Dx −2N 2 t2 2N t2 2 exp = 2 exp − (25) ξ2N ξ2 where t is an arbitrary small positive constant. We can then proceed to find the bound for the other terms in (19). In particular, using the one-sample and two sample U statistics in (Hoeffding, 1963) (see pag. 25), we obtain the following bounds: X 1 . ki,j P1 =Pr S(S−1) zi ,zj ∈Dz j6=i

0 0 0 − pZ (z)pZ (z )k(z, z )dzdz ≥ s ≤ Ωz Ωz −2bS/2cs2 2 exp K2 X 1 . P2 =Pr ki,j N (N −1) Z

zi ,zj ∈Dz j6=i

Z

Z

− Ωz Ωz 1 N (N −1)

qZ (z)qZ (z0 )k(z, z0 )dzdz0 ≥ u ≤ Ωz Ωz −2bN/2cu2 2 exp K2 X X 2 . P3 =Pr − ki,j SN zi ∈Dz zj ∈Dzf Z Z 0 0 0 +2 pZ (z)qZ (z )k(z, z )dzdz ≥ v ≤ Ωz Ωz −2 min {N, S}v 2 (26) 2 exp K2 Z

−

0

ki,j

zi ,zj ∈Dzf

− Ωz Ωz X − 2 SN

0

X

0

0

ki,j

zi ∈Dz zj ∈Dzf

+2 pZ (z)qZ (z )k(z, z )dzdz ≥ v Ωz Ωz X Z 1 ξx − ξx pX (x)dx ≥ t ∪ = Pr N Ωx x∈Dx X 1 ki,j S(S−1) Z

Z

0

0

0

zi ,zj ∈Dz j6=i

Z

Z

− Ωz Ωz 1 N (N −1)

pZ (z)pZ (z )k(z, z )dzdz ≥ s ∪ 0

X

0

0

ki,j

zi ,zj ∈Dzf j6=i

qZ (z)qZ (z )k(z, z )dzdz ≥ u ∪

Z

Z

− Ωz Ωz X − 2 SN

0

X

0

0

ki,j

zi ∈Dz zj ∈Dzf

+2 pZ (z)qZ (z )k(z, z )dzdz ≥ v Ωz Ωz X Z 1 ≥ Pr ξx − ξx pX (x)dx+ N Ωx x∈Dx X 1 ki,j S(S−1) Z

Z

0

0

0

zi ,zj ∈Dz j6=i

Z

Z

pZ (z)pZ (z0 )k(z, z0 )dzdz0 +

− Ωz

Ωz

1 N (N −1)

X

ki,j

zi ,zj ∈Dzf j6=i

Z

− Ωz

P0 + P1 + P2 + P3

X

0

qZ (z)qZ (z )k(z, z )dzdz ≥ u ∪

Z

Z Then, we can get the following lower bound:

pZ (z)pZ (z )k(z, z )dzdz ≥ s ∪ 0

j6=i

Z

Z

zi ,zj ∈Dzf j6=i

Z

X Z 1 ≥ Pr ξx − ξx pX (x)dx ≥ t ∪ N Ωx x∈Dx X 1 ki,j S(S−1)

Ωz

qZ (z)qZ (z0 )k(z, z0 )dzdz0 +


−

2 X SN

X

ki,j

zi ∈Dz zj ∈Dzf

pZ (z)qZ (z0 )k(z, z0 )dzdz0 Ωz Ωz ≥t+s+u+v X X 1 1 ξx + ki,j ≥ Pr N S(S−1) Z

Z

+2

(b) CouGAN (ELU)

zi ,zj ∈Dz j6=i

x∈Dx

1 N (N −1)

(a) CouGAN (ReLU)

X zi ,zj ∈Dzf j6=i

ki,j

2 X − SN

X

ki,j

zi ∈Dz zj ∈Dzf

Z (c) VEE w BN (d) VEE w/o BN − ξx pX (x)dx Ωx Z Z Figure 8. Grid dataset. pZ (z)pZ (z0 )k(z, z0 )dzdz0 + − Ω Ω Z Zz z F. Experiments on Low Dimensional + qZ (z)qZ (z0 )k(z, z0 )dzdz0 + Ωz Ωz Embedding Dataset Z Z 0 0 0 pZ (z)qZ (z )k(z, z )dzdz −2 Table 4 provides the details of the network architectures. Ωz Ωz Hyperbolic tangent activations are used as output activations of encoders, while the outputs of decoders/generators ≥t+s+u+v Z b ξx pX (x)dx = Pr L− Ωx Table 3. Network architectures. Z Z 0 0 0 0 (pZ (z) − qZ (z))(pZ (z ) − qZ (z ))k(z, z )dzdz M ETHOD − N ETWORK N EURONS PER LAYER Ωz Ωz E NCODER 2,128,128,2 ≥t+s+u+v C OUAE D ECODER 2,128,128,2 D ISCRIMINATOR E NCODER b ≥t+s+u+v (27) = Pr L−L C OU GAN D ECODER 2,128,128,2 WGAN B I GAN

where the first inequality in (27) is obtained by applying the union bound. Finally, by using the results in (25), (26) and (27) we get statement (c).

E. Experiments on Grid Dataset Table 3 provides the details of the network architectures. Hyperbolic tangent activations are used as output activations of encoders, while the outputs of decoders/generators are linear. For all methods, we use ReLU hidden activations. CouGAN uses ELU (Unterthiner et al., 2018), but we provide also results with ReLU. Furthermore, we show results of VEE with and without batch normalization (BN). See Figure 8.

VEE VAE AAE AVB-AC WAE

D ISCRIMINATOR E NCODER D ECODER D ISCRIMINATOR E NCODER D ECODER D ISCRIMINATOR E NCODER D ECODER D ISCRIMINATOR E NCODER D ECODER D ISCRIMINATOR E NCODER D ECODER D ISCRIMINATOR E NCODER D ECODER D ISCRIMINATOR E NCODER D ECODER D ISCRIMINATOR

2,128,128,1 2,128,128,2 2,128,128,1 2,128,128,2 2,128,128,2 4,128,128,1 2,128,128,254 254+1,128,128,2 256,128,128,1 2,128,128,4 2,128,128,2 2,128,128,2 2,128,128,2 2,128,128,1 2+64,128,128,2 2,128,128,2 4,128,128,1 2,128,128,2 2,128,128,2 -

Coulomb Autoencoders Table 4. Network architectures. M ETHOD

N ETWORK

N EURONS PER LAYER

C OUAE

E NCODER D ECODER

1000,128,128,10 10,128,128,1000 10,128,128,1000 1000,128,128,1 10,128,128,1000 1000,128,128,1 1000,128,128,10 1000,128,128,10 1010,128,128,1 1000,128,128,10 10+1,128,128,1000 1010,128,128,1 1000,128,128,20 10,128,128,1000 1000,128,128,10 10,128,128,1000 10,128,128,1 1000+64,128,128,10 10,128,128,1000 1010,128,128,1 1000,128,128,10 10,128,128,1000 -

C OU GAN WGAN B I GAN VEE VAE AAE AVB-AC WAE

D ISCRIMINATOR E NCODER D ECODER D ISCRIMINATOR E NCODER D ECODER D ISCRIMINATOR E NCODER D ECODER D ISCRIMINATOR E NCODER D ECODER D ISCRIMINATOR E NCODER D ECODER D ISCRIMINATOR E NCODER D ECODER D ISCRIMINATOR E NCODER D ECODER D ISCRIMINATOR E NCODER D ECODER D ISCRIMINATOR

are linear. For all methods, we use ReLU hidden activations. CouGAN uses ELU (Unterthiner et al., 2018), but we provide also results with ReLU. Furthermore, we report results of BiGAN, VEE and AAE with and without batch normalization (BN). See Table 5.

G. Experiments on Stacked-MNIST Table 6 provides the details of the network architectures. Hyperbolic tangent activations are used as output activations of encoders, while the outputs of decoders/generators are sigmoid (standard logistic activations). For all methods, we use ReLU hidden activations. CouGAN uses ELU (Unterthiner et al., 2018), but we provide also results with ReLU. Furthermore, we report results of BiGAN, VEE and AAE with and without batch normalization (BN). See Figure 9.

H. Experiments on CIFAR-100 Table 7 provides the details of the network architectures. Hyperbolic tangent activations are used as output activations of encoders, while the outputs of decoders/generators are sigmoid (standard logistic activations). For all methods, we use ReLU hidden activations. CouGAN uses ELU (Unterthiner et al., 2018).

Table 5. Test log-likelihood (LL) for different models on low dimensional embedding dataset. M ETHOD C OU GAN (ELU) C OU GAN (R E LU) B I GAN W BN B I GAN W / O BN VEE W BN VEE W / O BN AAE W BN AAE W / O BN

LL FAILURE FAILURE -3940.3±3519.4 -55528.9±45685.8 -157286274.0±1036296.3 -7457806.0±4036262.3 1025.9±177.1 -44627.7±39387.2

Figure 10 highlights the fact that only VAE, WAE and our model are able to preserve the semantic consistency of the latent space.


Table 6. Network architectures for Stacked-MNIST. M ETHOD

N ETWORK

N EURONS PER LAYER

C OUAE

E NCODER D ECODER

2352,2500,2000,1500,1000,500,10 10,500,1000,1500,2000,2500,2352 10,500,1000,1500,2000,2500,2352 2352,2500,2000,1500,1000,500,1 10,500,1000,1500,2000,2500,2352 2352,2500,2000,1500,1000,500,1 2352,2500,2000,1500,1000,500,10 10,500,1000,1500,2000,2500,2352 2362,2500,2000,1500,1000,500,1 2352,2500,2000,1500,1000,500,10 10+1,500,1000,1500,2000,2500,2352 2362,2500,2000,1500,1000,500,1 2352,2500,2000,1500,1000,500,20 10,500,1000,1500,2000,2500,2352 2352,2500,2000,1500,1000,500,10 10,500,1000,1500,2000,2500,2352 10,2500,2000,1500,1000,500,1 2352+64,2500,2000,1500,1000,500,10 10,500,1000,1500,2000,2500,2352 2362,2500,2000,1500,1000,500,1 2352,2500,2000,1500,1000,500,10 10,500,1000,1500,2000,2500,2352 -



(a) CouGAN (ReLU)

(b) CouGAN (ELU)

(c) BiGAN w BN

(d) BiGAN w/o BN

(e) VEE w BN

(f) VEE w/o BN

(g) AAE w BN

(h) AAE w/o BN

Figure 9. Stacked-MNIST dataset.


Table 7. Network architectures for CIFAR-100. M ETHOD

N ETWORK

N EURONS PER LAYER

C OUAE

E NCODER D ECODER

3072,3072,3072,3072,3072,3072,5 5,3072,3072,3072,3072,3072,3072 5,3072,3072,3072,3072,3072,3072 3072,3072,3072,3072,3072,3072,1 5,3072,3072,3072,3072,3072,3072 3072,3072,3072,3072,3072,3072,1 3072,3072,3072,3072,3072,3072,5 5,3072,3072,3072,3072,3072,3072 3077,3072,3072,3072,3072,3072,1 3072,3072,3072,3072,3072,3072,5 5+1,3072,3072,3072,3072,3072,3072 3077,3072,3072,3072,3072,3072,1 3072,3072,3072,3072,3072,3072,10 5,3072,3072,3072,3072,3072,3072 3072,3072,3072,3072,3072,3072,5 5,3072,3072,3072,3072,3072,3072 5,3072,3072,3072,3072,3072,1 3072+64,3072,3072,3072,3072,3072,5 5,3072,3072,3072,3072,3072,3072 3077,3072,3072,3072,3072,3072,1 3072,3072,3072,3072,3072,3072,5 5,3072,3072,3072,3072,3072,3072 -


(a) CouAE

(b) BiGAN


(c) VEE

(d) VAE

(e) AAE

(f) AVB-AC

(g) WAE

Figure 10. Conditional generation of samples. The first column of each plot contains true samples, while the other columns are obtained by generating samples from the latent codes associated to true data and perturbing the latent representation with additive isotropic Gaussian noise (0.005 element variance).

Coulomb Autoencoders

Coulomb Autoencoders

Suggest Documents

Adversarial Autoencoders

Deep Directed Generative Autoencoders

Multi-Task Graph Autoencoders

Decoding Stacked Denoising Autoencoders

Winner-Take-All Autoencoders

Understanding Autoencoders with Information

Decoding Stacked Denoising Autoencoders

RBMs and autoencoders - Google Sites

Hybrid Collaborative Filtering with Autoencoders

SHALLOW SPARSE AUTOENCODERS VERSUS ...

AutoRec: Autoencoders Meet Collaborative Filtering

Locally Embedding Autoencoders: A Semi

Adversarial Images for Variational Autoencoders

Maximum Entropy Discrimination Denoising Autoencoders

Coulomb drag

Stacked Autoencoders for Medical Image Search

Variational Autoencoders for Feature Detection of Magnetic ...

Locally Embedding Autoencoders: A Semi-Supervised ... - PLOS

Deep Learning with Denoising Autoencoders - CiteSeerX

Composite denoising autoencoders - NYU Computer Science

Deep Clustering with Convolutional Autoencoders - Xifeng Guo

Stacked Denoising Autoencoders and Transfer Learning for ...

Gaussian Copula Variational Autoencoders for Mixed Data

Using deep autoencoders to identify abnormal brain