Cramer-Wold AutoEncoder

2 downloads 0 Views 4MB Size Report
We propose a new generative model, Cramer-Wold Autoencoder (CWAE). ... based on the well-known Cramer-Wold Theorem [4], which states that the complete ...
Cramer-Wold AutoEncoder J. Tabor, S. Knop, P. Spurek, I. Podolak, M. Mazur, S. Jastrz˛ebski Faculty of Mathematics and Computer Science, Jagiellonian University, Lojasiewicza 6, 30-348 Cracow, Poland. [email protected]

Abstract We propose a new generative model, Cramer-Wold Autoencoder (CWAE). Following WAE, we directly encourage normality of the latent space. Our paper uses also the recent idea from Sliced WAE (SWAE) model, which uses one-dimensional projections as a method of verifying closeness of two distributions. The crucial new ingredient is the introduction of a new (Cramer-Wold) metric in the space of densities, which replaces the Wasserstein metric used in SWAE. We show that the Cramer-Wold metric between Gaussian mixtures is given by a simple analytic formula, which results in the removal of sampling necessary to estimate the cost function in WAE and SWAE models. As a consequence, while drastically simplifying the optimization procedure, CWAE produces samples of a matching perceptual quality to other SOTA models.

1

Introduction

AutoEncoder based generative models typically use certain measure of distance from normality (VAE [7], β-VAE [6], kernel-based WAE [11]), although adversarial discriminators are also commonly used (adversarial AE [9], adversarial WAE [11]). For more detailed discussion and list of references we refer the reader to [8, 11]. One can observe a trend to minimize and simplify the necessary modifications of the basic AutoEncoder architecture and cost function necessary to create a generative model. The original VAE needed variational approach as the computation of Kullback-Leibler was otherwise non-trivial. As a consequence, a sampling in the optimization process was needed (the same happens for VAE extensions like β-VAE [6]). WAE model, making use of Wasserstein based metric, works well without the variational approach, however, at the cost of rather nontrivial cost function. The next important step in simplifying AutoEncoder based generative models is SWAE [8], where the complicated computation of Wasserstein distance was simplified with the use of one-dimensional projections (sliced approach). However, even in this case, the distance and cost functions are based on sampling. In this paper we provide next step in the same direction, and construct a Cramer-Wold distance which enables to construct an AutoEncoder based generative model with cost function given by a simple closed form analytic formula.

2

Proposed method

Since sliced approach forms also the basis of our model, let us briefly explain it here. The idea is based on the well-known Cramer-Wold Theorem [4], which states that the complete information about the set X ⊂ RD can be retrieved from the knowledge of all one-dimensional projections v T X, where v belongs to the unit sphere SD . This means that one can deduce information on data from its one-dimensional projections, where the density estimation is more reliable. Preprint. Work in progress.

Our paper uses the ingenious idea from [8], but instead of measuring the Wasserstein metric, we introduce and use the Cramer-Wold distance (in brief, cw-distance) of two densities f, g : RD → R (see Section 4 for more details). With the use of Kummer’s confluent hypergeometric function 1 F1 (see, e.g., [3]) we provide its exact formula. Since the use of 1 F1 is cumbersome in standard optimization software, we prove that for dimensions D ≥ 20 the cw-distance of two Gaussian mixtures can be computed by an asymptotic simple closed-form analytic formula. Consequently, we define with a simple analytic formula (3) the Cramer-Wold normality index cw(X) of the set X ⊂ RD as its cw-distance1 from the standard normal density. Now the proposed Cramer-Wold AutoEncoder CWAE model is a simple modification of the classical AE, as its cost function is given as the product of the Cramer-Wold normality index and reconstruction error. Test reconstruction

Random sample

CWAE

WAE

Test interpolation

Figure 1: WAE (first row) and CWAE (second row) on CelebA dataset. In “test reconstructions” odd rows correspond to the real test points.

3

Cramer-Wold AutoEncoder model (CWAE)

This section is devoted to the construction of CWAE. Although the definition of the cw-distance is a crucial theoretical aspect in CWAE model, it is not imperative to its basic understanding. Therefore, not to overwhelm the reader with this theory, we postpone its presentation to Section 4. Since we base our construction on the AutoEncoder, to establish notation let us formalize it here. AutoEncoder Let X ⊂ RN be a given dataset. The basic aim of AE is to compress the data to a less dimensional latent space Z = RD with making as small reconstruction error as possible. Thus we search for the encoder E : RN → Z and decoder D : Z → RN functions, which minimize the reconstruction error on the dataset X: n X rec_error(X; E, D) = kxi − D(Exi )k2 . i=1 1

More formally, we compute the mean distance between densities based on kernel estimations of onedimensional projections of X and standard normal density.

2

AutoEncoder based generative model CWAE, similarly to WAE, is a classical AutoEncoder model with modified cost function which forces the model to be generative, i.e. ensures that the data transported to the latent space becomes Gaussian. This statement is formalized by the following important remark (similar observations are present in [11]). Remark 3.1. Let X be an N -dimensional vector from which our dataset was drawn, and let Y be a fixed random vector which has the standard normal distribution on the latent space Z. Suppose that we have constructed functions E : RN → Z and D : Z → RN (representing the encoder and the decoder) such that 1. D(Ex) = x for x ∈ image(X), 2. EX has the same distribution as Y. Then by the point 1 we obtain that D(EX) = X, and therefore DY has the same distribution as D(EX) = X. This means that to produce samples from X we can instead produce samples from Y and map them by the decoder D. Since an estimator of the image of the random vector X is given by its sample X, we conclude that a generative model is good if it has small reconstruction error and resembles the standard normal distribution on the latent space. Thus, to construct a generative AutoEncoder model, we need to add to the cost function of an AutoEncoder a way of measuring a distance of a given sample from normality. Cramer-Wold normality index As explained above, the normality index plays the central role in most generative models. In our case, the crucial ingredient in the definition of cw-normality index is played by the Cramer-Wold distance, defined by the slice approach as the normalized integral of L2 scalar products over all one-dimensional projections (marginal densities): Z kf − gk2cw := kfv − gv k2L2 dσD (v), SD

where fv denotes the one-dimensional (marginal) density given as the projection of the density f on the line spanned over v lying in the unit sphere SD ⊂ RD and σD is the normalized surface measure on SD . Study of the properties of Cramer-Wold distance are postponed to Section 4, however let us only mention that the distance of two Gaussian mixtures is given by a simple close analytic formula. In case the data distribution has density f in RD , we define the Cramer-Wold normality index as the normalized distance from the standardized normal density (in calculations below we use (5), (6) and (9)): Z Z √ √ kf − N (0, I)k2cw 2 cwD (f ) := = 1+2 π kf k dσ (v)−4 π hfv , N (0, 1)iL2 dσD (v). v L2 D kN (0, I)k2cw SD SD (1) The crucial consequence of the above definition is that it can be easily adapted for discrete sets. This happens since in formula (1) we apply computations only over one-dimensional densities, where the kernel density estimation is reliable. We recall (see, e.g., [10]) that the Gaussian kernel density estimator for a scalar set S = (si )ni=1 ⊂ R is given by2 n

kde(S) =

1X N (si , h2n ), n i=1

4 1/5 where the optimal kernel width hn = ( 3n ) is given by the Silverman’s rule of thumb. Thus given D a dataset X ⊂ R (a sample from a distribution with the unknown density f ), by replacing in (1) fv by its reliable estimation kde(v T X), we arrive at the definition of the Cramer-Wold normality index (in brief, cw-normality index) of X: Z Z √ √ cwD (X) := 1 + 2 π kkde(v T X)k2L2 dv − 4 π hkde(v T X), N (0, 1)iL2 dv. SD

2

SD

Under the assumption that the standard deviation is close to one.

3

Applying (10) we immediately get the direct analytic formula for the cw-normality index of X: √ n √ n 4 πX 2 π X ΦD (xi − xj , 2Hn ) − ΦD (xi , 1 + Hn ), (2) cwD (X) = 1 + 2 n i,j=1 n i=1   kzk2 1 4 2/5 1 D where ΦD (z, γ) = √2πγ , Hn = h2n = ( 3n and 1 F1 denotes the Kummer’s F ; ; − ) 1 1 2 2 2γ confluent hyper-geometric function. Since the use of special functions in practical optimization is cumbersome, we derive and apply in all experiments the asymptotic form (11) of function ΦD and get: n n X X 1 1 2 q q cwD (X) ≈ 1 + n12 . (3) − n 2 kxi −xj k kxi k2 1+Hn + Hn + 2D−3 i,j=1 i=1 2 2D−3 Once the crucial ingredient of CWAE is ready, we can describe its cost function. CWAE To ensure that the data transported to latent space Z are distributed according to the standard normal density, we need to take advantage of the normality index cwD (EX). To obtain a model independent of the possible rescaling of the data, instead of additive, we have decided to use the multiplicative model: cost(X; E, D) = cwD (EX) · rec_error(X; E, D).

4

(4)

Cramer-Wold distance

In this section we introduce the Cramer-Wold distance on the space of functions which, due to its simplicity, in our opinion is a reasonable alternative to the Wasserstein distance. To formally introduce the projection (slice) approach we need to establish the following notation – if a random variable X has the density f and v ∈ SD (the unit sphere centered at 0), then by fv we denote the one-dimensional density of v T X. Directly from this definition one can easily observe that for any v ∈ SD , x ∈ RD and α > 0 we have N (x, αI)v = N (v T x, α).

(5)

Formally, fv is the marginal distribution of f obtained by integrating f over affine hyperplanes orthogonal to v. Theorem (Cramer-Wold Theorem [4]). Let f, g be densities. If fv = gv for every v ∈ SD , then f = g. Making use of the above theorem we see that the following formula gives a properly defined scalar product and norm: Z hf, gicw := hfv , gv iL2 dσD (v) (6) SD Z kf k2cw := kfv k2L2 dσD (v) = hf, f icw , (7) SD

where by σD we denote the normalized surface area measure on SD . Clearly, directly by the CramerWold theorem we obtain that kf kcw = 0 iff f = 0. Observe that the cw-scalar product is defined as the mean of scalar products over all one-dimensional projections. The Cramer-Wold distance between two densities f, g is given by dcw (f, g) = kf − gkcw . Crucial in our further investigations is the formula for the scalar product of two Gaussians N (x, αI) and N (y, βI), where x, y ∈ RD and α, β > 0. Using (5) and equality hN (s, α), N (t, β)iL2 = N (s − t, α + β)(0), directly by the definition we get R hN (x, αI), N (y, βI)icw = hN (v T x, α)N (v T y, β)iL2 dσD (v) SD R (8) = N (v T (x − y), α + β)(0) dσD (v). SD

4

In particular, since σD is a normalized measure, we have Z 1 2 kN (0, γI)kcw = kN (0, γ)k2L2 σD (v) = kN (0, γ)k2L2 = √ . 2 πγ SD

(9)

The final ingredient for the closed form of (8) is given by the following theorem. Theorem 4.1. Let z ∈ RD and γ > 0 be given. Then   Z 1 D kzk2 1 T ; ;− , ΦD (z, γ) := N (v z, γ)(0) dσD (v) = √ 1 F1 2 2 2γ 2πγ SD where 1 F1 denotes Kummer’s confluent hypergeometric function. Proof. Let c = kzk. By applying orthonormal change of coordinates, we may assume that z = (c, 0, . . . , 0). Then v T z = cv1 for v = (v1 , . . . , vD ) and we get Z ΦD (z, γ) = N (cv1 , γ)(0) dσD (v). v∈SD

Now, Corollary A.6 from [2] gives the formula for slice integration of functions on spheres: Z Z Z p V (SD−1 ) 1 2 (D−3)/2 f dσD = (1 − x ) f (x, 1 − x2 ζ) dσD−1 (ζ) dx, V (SD ) −1 SD SD−1 where V (SK ) denotes the surface volume of a sphere SK ⊂ RK . Applying the above equality for the function f (v1 , . . . , vD ) = N (cv1 , γ)(0) and s = c2 /(2γ) we get Z 1 V (SD−1 ) 1 √ ΦD (z, γ) = (1 − x2 )(D−3)/2 exp(−sx2 ) dx. V (SD ) 2πγ −1 √ Γ( D−1 ) exp(−sx2 )(1 − x2 )(D−3)/2 dx = π Γ( D2 ) 1 F1 2  1 1 D finally get ΦD (z, γ) = √2πγ 1 F1 2 ; 2 ; −s , which completes the proof. K

Since V (SK ) =

2·π 2 Γ( K 2 )

and

R1

−1

1 D 2 ; 2 ; −s



, we

By Theorem 4.1, the cw-scalar product of two Gaussians is given with:   1 1 D kx − yk2 hN (x, αI), N (y, βI)icw = ΦD (x − y, α + β) = p F ; ; − . (10) 1 1 2 2 2(α + β) 2π(α + β) Since the practical use of 1 F1 in optimization can be cumbersome, we provide its approximate asymptotic formula valid for dimensions D ≥ 20 (see Figure 2).

Figure 2: Comparison of the function ΦD (marked by red color) with the approximation given by (11) (marked by green color) in the case of dimensions D = 2, 5, 20. Observe that for D = 20, the functions practically coincide. Proposition 4.1. For large D we have 1 1 ΦD (z, γ) ≈ √ q 2π γ + 2 5

kzk2 2D−3

.

(11)

Proof. We apply notation from the previous theorem. We have to estimate asymptotics of Z 1 1 Γ( D 2 ) √ (1 − x2 )(D−3)/2 exp(−sx2 ) dx. ΦD (z, γ) = Γ( D−1 2γ −1 2 ) π Since for large D and x ∈ [−1, 1] 2

(1 − x2 )(D−3)/2 e−sx ≈ (1 − x2 )(D−3)/2 · (1 − x2 )s = (1 − x2 )s+(D−3)/2 , we get ΦD (z, γ) ≈

Γ( D 2 ) Γ( D−1 2 )

1 √ π 2γ

Z

1

(1 − x2 )s+(D−3)/2 dx =

−1

Γ( D 2 ) Γ( D−1 2 )

1 1 √ Γ(s + D 2 − 2) √ π . π 2γ Γ(s + D 2)

To simplify the above we apply the formula (1) from [12]: Γ(z + α) (α − β)(α + β − 1) = z α−β (1 + + O(|z|−2 )), Γ(z + β) 2z with α, β fixed so that α + β = 1 (so only the error term of order O(|z|−2 ) remains), and get   1 − 12 3 3 1 Γ(s + D Γ(( D Γ( D D 3 2 D 3 2) 2 − 4) + 4) 2 − 2) and . (12) = ≈ − ≈ s + − 3 1 2 4 2 4 Γ( D−1 Γ(( D Γ(s + D 2 ) 2 − 4) + 4) 2) Summarizing, q D 3 1 2 − 4 q ΦD (z, γ) ≈ √ 2πγ s + D − 2

3 4

1 1 =√ q 2π γ +

4γs 2D−3

.

2

Since s = kzk /(2γ), we get the assertion. By Proposition 4.1 to (10) we trivially obtain the approximate formula for the scalar product of two gaussians: 1 1 hN (x, αI), N (y, βI)icw ≈ √ q . 2 π α+β + kx−yk2 2

5

2D−3

Experiments

Methodology Typically, the generative properties are verified only by human inspection of samples or interpolations. We present such comparison of CWAE with WAE in Figure 1. Clearly, this approach works well for data which can be represented visually (like images), but its use for the model constructed for some abstract data does not seem to be possible. However, by the Remark 3.1 and the comment given directly after (see also [11, Theorem 1]), to verify whether our model is generative it is sufficient to check if: • the reconstruction error on validation set is small, • the validation data transported to the latent space is close to standard normal density. This provides us a way of studying and comparing generative models, which can be applied for arbitrary data. To illustrate this statement, in Figure 3 we present 2-dimensional latent spaces for considered models3 on MNIST dataset. The model is generative, if resembles the Gaussian noise presented on the first plot. As we may notice, the latent for the AutoEncoder presented in the second figure is far from Gaussian, which supports the obvious observation that it is not a generative model. Clearly, if the latent has higher dimension, to test the considered model we need an independent normality index as a score function. We have decided here to use one of the most popular statistical 3

Since (3) is valid for dimensions D ≥ 20, to implement CWAE in 2-dimensional latent space we apply s equality 1 F1 (1/2, 1, −s) = e− 2 I0 2s , see [5, (8.406.3) and (9.215.3)], jointly with the approximate formula for the Bessel function of the first kind I0 given in [1, page 378].

6

Figure 3: Two-dimensional latent spaces for AE, VAE, WAE, SWAE, and CWAE, all on MNIST dataset. Models closer to Gaussian noise in the latent space are more generative. normality tests, i.e. Mardia tests. Mardia’s normality tests are based on verifying if the skewness b1,D (·) and kurtosis b2,D (·) of a sample X = (xi )ni=1 ⊂ RD : b1,D (X) =

n n 1X 1 X T 3 (x x ) , and b (X) = kxj k4 , k 2,D j n2 n j=1 j,k=1

are close to that of standard normal density. The expected Mardia’s skewness and kurtosis for standard multivariate normal distribution is 0 and D(D + 2), respectively. To enable easier comparison in experiments we consider also the value of the normalized Mardia’s kurtosis given by b2,D (X) − D(D + 2), which equals zero for the standard normal density. Results of experiments We compare CWAE with VAE, WAE and SWAE on the standard datasets MNIST, CIFAR10 and CELEBA [13], where, to be just, we apply common standard architecture for each dataset. We additionally add as the benchmark the results of AutoEncoder (AE). We have used two different architectures. On MNIST data-set the network consists of fully connected layers with ReLU activation function. In the experiments we use three layers (each consisting of 200 neurons) for encoder as well as decoder and consider latent space of dimension 20. On the CIFAR 10 and CELEB A datasets we use the standard architecture which consists of three convolution layers both for encoder as well as decoder with ReLU activation function. We use latent space of dimension 64. In Figure 4 we present for CELEB A dataset the value of reconstruction error, Mardia’s skewness and kurtosis during learning process of AE, VAE, WAE, SWAE and CWAE (measured on the validation dataset). WAE and SWAE obtain the best reconstruction error, comparable to AE. CWAE and VAE have sightly worst reconstruction error but better value of kurtosis and skewness, which means that such models are more generative. As expected, AE is not generative since its kurtosis and skewness grow during learning. 7

Figure 4: Value of reconstruction error, Mardia’s skewness and normalized kurtosis during learning process of AE, VAE, WAE, SWAE and CWAE on validation dataset in the case of CELEB A datasets. In the case of kurtosis the optimal value is given by the dotted line which denotes the expected value of curtosis for the normal density.

Table 1 presents the comparison in the case of MNIST, CIFAR 10 and CELEB A. In general, SWAE and WAE give reconstruction error similar to AE (in the case of CIFAR 10 even better). On the other hand VAE and SWAE have better value of kurtosis and skewness. Data set

Method

AE

VAE

WAE

SWAE

CWAE

MNIST

Skewness Kurtosis (normalized) Reconstruction error

659.67 749.58 2.10

0.49 -410.69 4.12

82.12 35.61 2.11

55.59 -37.43 2.11

17.72 -43.89 2.32

CIFAR 10

Skewness Kurtosis (normalized) Reconstruction error

101199.51 3485.82 49.01

1.16 -4183.74 85.64

1552.88 2460.55 45.08

949.83 51.63 44.75

157.60 -817.46 51.48

CELEB A

Skewness Kurtosis (normalized) Reconstruction error

28347702.52 885045.13 139.05

24.01 11.12 143.61

986.91 1802.72 139.14

184.26 352.96 138.80

18.25 -193.26 146.15

Table 1: Value of reconstruction error, Mardia’s skewness and kurtosis of AE, VAE, WAE, SWAE and CWAE on validation dataset in the case of MNIST, CIFAR10 and CELEBA datasets. Observe that skewness and normalized kurtosis of AE are far from zero, which implies its serious divergence from normality in the latent space (and consequently AE is not generative).

Experiments show that CWAE is similar to other models. Summarizing one can also observe that better reconstruction error (like in WAE) is obtained at the cost of lower normality indexes (see Figure 4 and Table 1). In general, the worst performance is presented by the VAE model, which on MNIST and CIFAR 10 datasets obtains much worse reconstruction error then the other model, while having analogous normality results (better skewness, worse kurtosis).

6

Conclusions

In the paper we present a new generative model CWAE, which is a direct modification of AutoEncoder. Contrary to previous models, CWAE cost function is given by a simple closed analytic formula. Crucial in the construction is the introduction of the Cramer-Wold metric, an alternative to Wasserstein distance, which can be effectively computed for densities given by Gaussian mixtures. As a consequence we obtain a reliable measure of the divergence from normality. In a similar manner one can use the cw-distance to compare two samples from different distributions4 . 4

To use this approach one of the distributions should be normalized to avoid the necessity of adaptation of the kernel width.

8

References [1] M. Abramowitz and I.A. Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55 of National Bureau of Standards Applied Mathematics Series. U.S. Government Printing Office, Washington, D.C., 1964. [2] S. Axler, P. Bourdon, and R. Wade. Harmonic function theory, volume 137 of Graduate Texts in Mathematics. Springer, New York, 1992. [3] R.W. Barnard, G. Dahlquist, K. Pearce, L. Reichel, and K.C. Richards. Gram polynomials and the kummer function. J. Approx. Theory, 94(1):128–143, 1998. [4] H. Cramér and H. Wold. Some theorems on distribution functions. J. London Math. Soc., 11(4):290–294, 1936. [5] I.S. Gradshteyn and I.M. Ryzhik. Table of integrals, series, and products. Elsevier/Academic Press, Amsterdam, 2015. [6] I. Higgins, L. Matthey, A. Pal, Ch. Burgess, Glorot X., M. Botvinick, S. Mohamed, and A. Lerchner. β-VAE: learning basic visual concepts with constrained variational framework. In Proceedings of the 5th International Conference on Learning Representations, 2017. [7] D.P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2014. [8] S. Kolouri, Ch.E. Martin, and G.K. Rohde. Sliced-wasserstein autoencoder: an embarrassingly simple generative model. arXiv preprint arXiv:1804.01947, 2018. [9] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015. [10] B.W. Silverman. Density estimation for statistics and data analysis. Monographs on Statistics and Applied Probability. Chapman & Hall, London, 1986. [11] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017. [12] F. Tricomi and A. Erdélyi. The asymptotic expansion of a ratio of gamma functions. Pacific J. Math., 1:133–142, 1951. [13] L. Ziwei, L. Ping, W. Xiaogang, and T. Xiaoou. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.

9