Gaussian AutoEncoder

1

Gaussian Auto-Encoder

arXiv:1811.04751v1 [cs.LG] 12 Nov 2018

Jarek Duda Jagiellonian University, Golebia 24, 31-007 Krakow, Poland, Email: [email protected]

Abstract—Evaluating distance between sample distribution and the wanted one, usually Gaussian, is a difficult task required to train generative Auto-Encoders. After the original Variational Auto-Encoder (VAE) using KL divergence, there was claimed superiority of distances based on Wasserstein metric (WAE, SWAE) and L2 distance of KDE Gaussian smoothened sample for all 1D projections (CWAE). This article derives formulas for also L2 distance of KDE Gaussian smoothened sample, but this time directly using multivariate Gaussians, also optimizing positiondependent covariance matrix with mean-field approximation, for application in purely Gaussian Auto-Encoder (GAE).

I. I NTRODUCTION Generative Auto-Encoders should obtain probability distribution in the latent space being close to a chosen prior distribution, usually Gaussian. The original Variational AutoEncoders (VAE) [1] use variational methods with KullbackLeibler distance/divergence to reach close distribution. Later articles claim improvements by directly using distance, e.g.: based on Wasserstein metric (WAE) [2], or its simpler ”sliced” (projected) variant (SWAE) [3], or recently using L2 distance for (kernel density estimation) KDE-smoothened 1D projections and averaging over all of them (CWAE) [4]. Closeness of 1D projections is a bit different criterium from closeness of multivariate densities these methods should achieve. Therefore, this paper derives analytic formulas for L2 distance using KDE-smoothened densities with directly multivariate Gaussian distributions to be applied in Gaussian Auto-Encoder (GAE) - by replacing distance formula e.g. in CWAE approach [4], what is planned to be tested in near future. There is also proposed mean-field approximation to optimize covariance matrix, also depending on position. II. D ERIVATIONS

AND FORMULAS

A. Integral of product of multivariate Gaussians: dµ,Σ,Γ Density of multivariate D-dimensional Gaussian distribution N (µ, Σ): with µ center and Σ covariance matrix (real, symmetric, positive semi-definite, |Σ| ≡ det Σ > 0) is: T −1 1 1 (1) ρµ,Σ (x) := p e− 2 (x−µ) Σ (x−µ) |2πΣ|

We first need to calculate formula for integral of product of two such densities: of covariance matrix Σ and Γ, which are shifted by a vector µ. Due to translational invariance, we can choose centers of these Gaussians as µ and 0 := (0, . . . , 0): Z dµ,Σ,Γ := ρµ,Σ (x) · ρ0,Γ (x) dx = (2) RD Z (x−µ)T Σ−1 (x−µ)+xT Γ−1 x 1 2 p e− dx (2π)D |Σ||Γ| RD Transforming the numerator in exponent we get:

α := xT (Σ−1 + Γ−1 )x − 2xT Σ−1 µ + µT Σ−1 µ

Denoting Π = (Σ−1 + Γ−1 )−1 and ν = ΠΣ−1 µ we get: α = xT Π−1 x − 2xT Π−1 ν + µT Σ−1 µ = = (x − ν)T Π−1 (x − ν) − ν T Π−1 ν + µT Σ−1 µ R − 12 (x−ν)T Π−1 (x−ν) dx = 1 Now knowing that √ 1 RD e |2πΠ|

we can remove integral from (2): p (2π)D |Π| 21 (ν T Π−1 ν−µT Σ−1 µ) p dµ,Σ,Γ = e (2π)D |Σ||Γ|

Substituting Π = (Σ−1 + Γ−1 )−1 , ν = ΠΣ−1 µ :

ν T Π−1 ν − µT Σ−1 µ = µT Σ−1 Π Π−1 Π Σ−1 µ − µT Σ−1 µ = µT Σ−1 ΠΣ−1 − Π Π−1 µ = −µT Σ−1 (Σ−1 +Γ−1 )−1 Γ−1 µ Observe that, as required, it does not change if switching Σ and Γ. We can now get the final formula:

dµ,Σ,Γ

exp − 21 (µT Σ−1 (Σ−1 + Γ−1 )−1 Γ−1 µ) p = (2π)D |Σ||Γ||Σ−1 + Γ−1 |

(3)

Let us also find its special case for spherically symmetric ˆ: Gaussians: dl,σ2 ,γ 2 := dlµ,σ ˆ 2 I,γ 2 I for any length 1 vector µ 1 2 −2 −2 −2 −1 −2 exp − l σ (σ + γ ) γ dl,σ2 ,γ 2 = q 2 D (2π)D (σ 2 γ 2 (σ −2 + γ −2 )) 2 exp − 21 σ2 l+γ 2 (4) dl,σ2 ,γ 2 = dlµ,σ ˆ 2 I,γ 2 I = p D 2π (σ 2 + γ 2 ) B. L2 distance between two smoothened samples

Having two samples X = (xi )i=1..n and Y = (yi )i=1..m in RD , we would like to KDE smoothen them using multivariate Gaussians, then define distance as L2 norm between such smoothened samples. For full generality, let us start with assuming that each point has a separately chosen covariance matrix for the Gaussian: we have some (Σi )i=1..n and (Γj )j=1..m matrices. Now such squared ”Gaussian” distance between these samples, depending on the choice of covariance matrices, is 2 X X 1 1  d2g (X, Y ) = ρxi ,Σi (x) − ρyj ,Γj (x) dx = n i m j Rd Z



m n X m n X X dxi −yj ,Σi ,Γj dxi −xi′ ,Σi ,Σi′ X dyi −yi′ ,Γj ,Γj′ + −2 2 2 n m nm ′ ′ i=1 j=1

i,i =1

j,j =1

As we have freedom of choosing (Σi )i=1..n , (Γi )i=1..m , above formula can optimize this choice - we can use distance being

2

a result of e.g. its iterative minimization. The initial choice can be found with mean-field approximation discussed later. Let us also find more practical formula for the basic choice: of all covariance matrices being σ 2 I: 2

dg (X, Y ) · n X e−

i,i′ =1

kxi −x ′ k2 i 4σ2

n2

+

m X e−

j,j ′ =1

√ D 4πσ 2 =

kyj −y ′ k2 j 4σ2

m2

−2

(5)

n X m X e−

kxi −yj k2 4σ2

i=1 j=1

nm

√ D We can remove the 4πσ 2 fixed term while applying this formula - as it becomes very large in high dimensions. Interestingly, this formula turns out quite similar as for −1/2 CWAE [4], which uses φD (s) ≈ (1 + 4s/(2D − 3)) function in place of one over exponent here - having much thinner tails. This similarity seems to come from similar origin: both use L2 distance between Gaussian KDE smoothened samples. However, CWAE calculates this distance for projections to 1D subspaces and averaging over all such directions - optimizing similarity of 1D projections. In contrast, here we directly want closeness of multidimensional distributions - as in the original generative Auto-Encoder motivation. Above function similarity (exponent versus approximately one over square root φD ) suggests it might be worth to also test functions with different types of tails instead.

For generative Auto-Encoders we are more interested in calculating distance from Gaussian distribution N (0, I), optimized during training together with some evaluation of distortion introduced by encoding-decoding process. Let us now use N (0, I) in place of Y from the previous subsection: =

Z

Rd

!2 1X ρxi ,Σi (x) − ρ0,I (x) dx = n i

n n 2X 1 X 1 dx ,Σ ,I dxi −xi′ ,Σi ,Σi′ + √ D − n2 ′ n i=1 i i 4π i,i =1

(6)

Using the simplest: spherically symmetric Σi = σi2 I, for example for constant σi = σ, we get: 2

dg (X, N (0, I)) · −

n X

kxi −x ′ k2 i 2 2

√ D 4π =

n

−

For large D it requires to use σ = 1 + ǫ, for tiny ǫ ≥ 0 allowing to manipulate weight of the two above sums. For the simplest choice: σ = 1, formula (7) becomes inexpensive: n 2 X − kxi −xi′ k2 2 X − kxi k2 1 4 e 4 + 2 − e n n n i=1 ′ i

Gaussian AutoEncoder

Gaussian AutoEncoder

Suggest Documents

Sparse autoencoder

Sparse autoencoder

Sparse autoencoder

Cramer-Wold AutoEncoder

The Variational Fair Autoencoder

Stein Variational Autoencoder

Conditional Variational Autoencoder for Prediction

DENOISING AUTOENCODER WITH MODULATED LATERAL

Deep Feature Consistent Variational Autoencoder

Autoencoder-based holographic image restoration

Archetypal Analysis as an Autoencoder

Autoencoder networks for HIV classification

GrCAN: Gradient Boost Convolutional Autoencoder

Neural Networks Denoising Autoencoder Self

Gaussian, Hermite-Gaussian, and Laguerre-Gaussian beams

Nonnegative autoencoder with simplified random neural network

MADE: Masked Autoencoder for Distribution Estimation

Sliced-Wasserstein Autoencoder: An Embarrassingly Simple

Class label autoencoder for zero-shot learning

Autoencoder Based Domain Adaptation for Speaker Recognition ...

Stabilizing Linear Prediction Models using Autoencoder

Autoencoder, Principal Component Analysis and Support ... - arXiv

Motion Blur removal via Coupled Autoencoder

Domain Adaption Based on ELM Autoencoder