1
Gaussian Auto-Encoder
arXiv:1811.04751v1 [cs.LG] 12 Nov 2018
Jarek Duda Jagiellonian University, Golebia 24, 31-007 Krakow, Poland, Email:
[email protected]
Abstract—Evaluating distance between sample distribution and the wanted one, usually Gaussian, is a difficult task required to train generative Auto-Encoders. After the original Variational Auto-Encoder (VAE) using KL divergence, there was claimed superiority of distances based on Wasserstein metric (WAE, SWAE) and L2 distance of KDE Gaussian smoothened sample for all 1D projections (CWAE). This article derives formulas for also L2 distance of KDE Gaussian smoothened sample, but this time directly using multivariate Gaussians, also optimizing positiondependent covariance matrix with mean-field approximation, for application in purely Gaussian Auto-Encoder (GAE).
I. I NTRODUCTION Generative Auto-Encoders should obtain probability distribution in the latent space being close to a chosen prior distribution, usually Gaussian. The original Variational AutoEncoders (VAE) [1] use variational methods with KullbackLeibler distance/divergence to reach close distribution. Later articles claim improvements by directly using distance, e.g.: based on Wasserstein metric (WAE) [2], or its simpler ”sliced” (projected) variant (SWAE) [3], or recently using L2 distance for (kernel density estimation) KDE-smoothened 1D projections and averaging over all of them (CWAE) [4]. Closeness of 1D projections is a bit different criterium from closeness of multivariate densities these methods should achieve. Therefore, this paper derives analytic formulas for L2 distance using KDE-smoothened densities with directly multivariate Gaussian distributions to be applied in Gaussian Auto-Encoder (GAE) - by replacing distance formula e.g. in CWAE approach [4], what is planned to be tested in near future. There is also proposed mean-field approximation to optimize covariance matrix, also depending on position. II. D ERIVATIONS
AND FORMULAS
A. Integral of product of multivariate Gaussians: dµ,Σ,Γ Density of multivariate D-dimensional Gaussian distribution N (µ, Σ): with µ center and Σ covariance matrix (real, symmetric, positive semi-definite, |Σ| ≡ det Σ > 0) is: T −1 1 1 (1) ρµ,Σ (x) := p e− 2 (x−µ) Σ (x−µ) |2πΣ|
We first need to calculate formula for integral of product of two such densities: of covariance matrix Σ and Γ, which are shifted by a vector µ. Due to translational invariance, we can choose centers of these Gaussians as µ and 0 := (0, . . . , 0): Z dµ,Σ,Γ := ρµ,Σ (x) · ρ0,Γ (x) dx = (2) RD Z (x−µ)T Σ−1 (x−µ)+xT Γ−1 x 1 2 p e− dx (2π)D |Σ||Γ| RD Transforming the numerator in exponent we get:
α := xT (Σ−1 + Γ−1 )x − 2xT Σ−1 µ + µT Σ−1 µ
Denoting Π = (Σ−1 + Γ−1 )−1 and ν = ΠΣ−1 µ we get: α = xT Π−1 x − 2xT Π−1 ν + µT Σ−1 µ = = (x − ν)T Π−1 (x − ν) − ν T Π−1 ν + µT Σ−1 µ R − 12 (x−ν)T Π−1 (x−ν) dx = 1 Now knowing that √ 1 RD e |2πΠ|
we can remove integral from (2): p (2π)D |Π| 21 (ν T Π−1 ν−µT Σ−1 µ) p dµ,Σ,Γ = e (2π)D |Σ||Γ|
Substituting Π = (Σ−1 + Γ−1 )−1 , ν = ΠΣ−1 µ :
ν T Π−1 ν − µT Σ−1 µ = µT Σ−1 Π Π−1 Π Σ−1 µ − µT Σ−1 µ = µT Σ−1 ΠΣ−1 − Π Π−1 µ = −µT Σ−1 (Σ−1 +Γ−1 )−1 Γ−1 µ Observe that, as required, it does not change if switching Σ and Γ. We can now get the final formula:
dµ,Σ,Γ
exp − 21 (µT Σ−1 (Σ−1 + Γ−1 )−1 Γ−1 µ) p = (2π)D |Σ||Γ||Σ−1 + Γ−1 |
(3)
Let us also find its special case for spherically symmetric ˆ: Gaussians: dl,σ2 ,γ 2 := dlµ,σ ˆ 2 I,γ 2 I for any length 1 vector µ 1 2 −2 −2 −2 −1 −2 exp − l σ (σ + γ ) γ dl,σ2 ,γ 2 = q 2 D (2π)D (σ 2 γ 2 (σ −2 + γ −2 )) 2 exp − 21 σ2 l+γ 2 (4) dl,σ2 ,γ 2 = dlµ,σ ˆ 2 I,γ 2 I = p D 2π (σ 2 + γ 2 ) B. L2 distance between two smoothened samples
Having two samples X = (xi )i=1..n and Y = (yi )i=1..m in RD , we would like to KDE smoothen them using multivariate Gaussians, then define distance as L2 norm between such smoothened samples. For full generality, let us start with assuming that each point has a separately chosen covariance matrix for the Gaussian: we have some (Σi )i=1..n and (Γj )j=1..m matrices. Now such squared ”Gaussian” distance between these samples, depending on the choice of covariance matrices, is 2 X X 1 1 d2g (X, Y ) = ρxi ,Σi (x) − ρyj ,Γj (x) dx = n i m j Rd Z
m n X m n X X dxi −yj ,Σi ,Γj dxi −xi′ ,Σi ,Σi′ X dyi −yi′ ,Γj ,Γj′ + −2 2 2 n m nm ′ ′ i=1 j=1
i,i =1
j,j =1
As we have freedom of choosing (Σi )i=1..n , (Γi )i=1..m , above formula can optimize this choice - we can use distance being
2
a result of e.g. its iterative minimization. The initial choice can be found with mean-field approximation discussed later. Let us also find more practical formula for the basic choice: of all covariance matrices being σ 2 I: 2
dg (X, Y ) · n X e−
i,i′ =1
kxi −x ′ k2 i 4σ2
n2
+
m X e−
j,j ′ =1
√ D 4πσ 2 =
kyj −y ′ k2 j 4σ2
m2
−2
(5)
n X m X e−
kxi −yj k2 4σ2
i=1 j=1
nm
√ D We can remove the 4πσ 2 fixed term while applying this formula - as it becomes very large in high dimensions. Interestingly, this formula turns out quite similar as for −1/2 CWAE [4], which uses φD (s) ≈ (1 + 4s/(2D − 3)) function in place of one over exponent here - having much thinner tails. This similarity seems to come from similar origin: both use L2 distance between Gaussian KDE smoothened samples. However, CWAE calculates this distance for projections to 1D subspaces and averaging over all such directions - optimizing similarity of 1D projections. In contrast, here we directly want closeness of multidimensional distributions - as in the original generative Auto-Encoder motivation. Above function similarity (exponent versus approximately one over square root φD ) suggests it might be worth to also test functions with different types of tails instead.
For generative Auto-Encoders we are more interested in calculating distance from Gaussian distribution N (0, I), optimized during training together with some evaluation of distortion introduced by encoding-decoding process. Let us now use N (0, I) in place of Y from the previous subsection: =
Z
Rd
!2 1X ρxi ,Σi (x) − ρ0,I (x) dx = n i
n n 2X 1 X 1 dx ,Σ ,I dxi −xi′ ,Σi ,Σi′ + √ D − n2 ′ n i=1 i i 4π i,i =1
(6)
Using the simplest: spherically symmetric Σi = σi2 I, for example for constant σi = σ, we get: 2
dg (X, N (0, I)) · −
n X
kxi −x ′ k2 i 2 2
√ D 4π =
n
−
For large D it requires to use σ = 1 + ǫ, for tiny ǫ ≥ 0 allowing to manipulate weight of the two above sums. For the simplest choice: σ = 1, formula (7) becomes inexpensive: n 2 X − kxi −xi′ k2 2 X − kxi k2 1 4 e 4 + 2 − e n n n i=1 ′ i