from each, and (ii) performing RG on the bias-compensated data. In experiments
... and Independent Components Analysis (ICA) have been used. This research ...
RADIAL GAUSSIANIZATION WITH CLUSTER-SPECIFIC BIAS COMPENSATION Shuai Huang1,3 , Damianos Karakos1,2,3 , Daguang Xu1,3 1
Center for Language and Speech Processing Human Language Technology Center of Excellence 3 Department of Electrical and Computer Engineering Johns Hopkins University, Baltimore, MD email:{shuaihuang,damianos,dxu5}@jhu.edu 2
ABSTRACT In recent work, Lyu and Simoncelli [1] introduced radial Gaussianization (RG) as a very efficient procedure for transforming n-dimensional random vectors into Gaussian vectors with independent and identically distributed (i.i.d.) components. This entails transforming the norms of the data so that they become chi-distributed with n degrees of freedom. A necessary requirement is that the original data are generated by an isotropic distribution, that is, their probability density function (pdf) is constant over surfaces of n-dimensional spheres (or, more general, n-dimensional ellipsoids). The case of biases in the data, which is of great practical interest, is studied here; as we demonstrate with experiments, there are situations in which even very small amounts of bias can cause RG to fail. This becomes evident especially when the data form clusters in low-dimensional manifolds. To address this shortcoming, we propose a two-step approach which entails (i) first discovering clusters in the data and removing the bias from each, and (ii) performing RG on the bias-compensated data. In experiments with synthetic data, the proposed bias compensation procedure results in significantly better Gaussianization than the non-compensated RG method. Index Terms— Gaussian distribution, chi distribution, isotropic distribution, singular distribution 1. INTRODUCTION According to [2], it is usually beneficial to transform highdimensional data into a representation that has independent and identically distributed (i.i.d.) components. Such a representation allows one to estimate multi-variate distributions from just univariate statistics, without the requirement for specialized smoothing techniques for tackling the “curse of dimensionality”, which is a problem one faces in density estimation [2]. Methods such as Principal Components Analysis (PCA) and Independent Components Analysis (ICA) have been used This research was partially supported by the National Science Foundation grant No CCF-0728931.
extensively in the statistics and machine learning community in order to transform the data into uncorrelated or independent components. They are both linear techniques; they estimate a matrix A which is applied to the original data. The matrix A is estimated differently in each case: • PCA: A = Λ−1/2 U T , where Σ = U ΛU T is the covariance of the original data, with U being a matrix of orthogonal column eigenvectors of Σ, while Λ is the diagonal matrix of the corresponding eigenvalues. • ICA: A = arg minA0 I(A0 X), where X is the original data and I(·) is the Kullback-Leibler divergence between the joint distribution and the product of the marginals (the so-called multi-information [2]). Both PCA and ICA assume that the observed data are just a linear combination of independent components, and are thus trying to invert it (up to scale). Actually, PCA only performs decorrelation (whitening) of the data, and thus does not guarantee independence (unless the data are jointly Gaussian). ICA, on the other hand, seeks a rotation of the space such that the marginals of the transformed data are least “Gaussianlike”, as measured by the so-called negentropy [3]. When the assumption about linearity is not valid, a nonlinear transformation of the data into independent components has to be found. One way to approach this problem is to seek a transformation which makes the data i.i.d. Gaussian1 . In the univariate case, the solution is trivial. All is needed is a good approximation of the cumulative distribution function (cdf) FX of the data; the transformation is then given in closed form T (x) = Φ−1 (FX (x)), where Φ is the cdf of the standard Gaussian distribution. The case of multivariate distributions is more complicated, though, as conditional distributions have to be computed. As mentioned in [2], a procedure which computes, for each x1 , . . . , xi−1 , the transformation T (xi |x1 , . . . , xi−1 ) = Φ−1 (FXi |X1 ,...,Xi−1 (xi |x1 , . . . , xi−1 )), 1 This may be dictated by the application at hand–the Gaussian distribution is one of the most tractable and easily modeled distributions.
relies on the estimation of high-dimensional cdf’s; this often leads to extremely sparse data conditions. To address this problem, a number of papers have offered alternatives, as summarized below. 1. The paper of Chen and Gopinath (C. & G.) [2] has been one of the earliest on the topic. It describes an iterative scheme which alternates between ICA and marginal Gaussianization: ICA converts the data into “independent components”, and marginal Gaussianization is applied to each component independently. It produces state-of-the-art results, but its main drawback is the computational complexity of ICA. 2. A “PCA variant” of [2] appears in [4], where the ICA step is replaced with PCA. The experimental results of [4] show that the PCA version is as effective as the ICA version, but orders of magnitude faster. 3. Radial Gaussianization [1] (RG) is a recent non-iterative technique, which relies on the assumption that the data have an elliptically symmetric density (ESD). After performing PCA on the data, the data become distributed according to a spherically symmetric density (a.k.a. isotropic density); it then suffices to only transform the norm of the resulting vectors to be chi-distributed (since i.i.d. Gaussian distributions have chi-distributed norms). The complexity of the procedure is linear with the data. Although RG is a very efficient way of Gaussianizing isotropic distributions (without loss of generality we assume isotropic distributions instead of ESD, as the latter can be easily dealt with via PCA), it cannot handle data which contain biases (e.g. they are generated by “perturbed” isotropic distributions). Analyzing cases of this kind is of great practical interest, since it is very unlikely that real data are generated by perfectly isotropic distributions. In fact, as we demonstrate in the paper, even very small biases can cause RG to fail. To tackle the above shortcoming, we propose a procedure that involves cluster-based bias compensation. This entails (i) clustering the data for determining subsets which exhibit spherical symmetry (constant-norm points), (ii) removing the bias from each cluster independently in a way that optimizes the outcome of RG, and (iii) performing an overall RG over the normalized dataset. Our experimental results show that the proposed algorithm results in better Gaussianization than the non-compensated RG method, and is competitive with (in most cases superior than) the state-of-the-art C. & G.[2], which is a more time-consuming iterative procedure. The paper is organized as follows: mathematical preliminaries appear in Section 2, a summary of the technique of Lyu and Simoncelli [1] appears in Section 3 along with its results in the case of biases in the data. Our proposed algorithm appears in Section 4, experimental results appear in Section 5, and concluding remarks appear in Section 6.
2. MATHEMATICAL PRELIMINARIES Random variables are denoted with uppercase letters (e.g., X), with random vectors represented in bold (e.g., X ∈ Rn is an n-dimensional random vector). Probability density functions (pdf) and their corresponding cumulative distribution functions (cdf) are denoted with lowercase and uppercase letters, respectively. For instance, fX and FX are the pdf and cdf of random variable X, respectively. The density of the standard Gaussian univariate distribution (zero-mean and unitvariance) is denoted by φ(·), with its cdf denoted by Φ(·). A transformation of a dataset D is denoted by a function T : D → D0 , applied to each data point of D separately. To simplify notation, the transformation of the whole dataset will be denoted by T (D). As mentioned earlier, the goal of Gaussianization is to create i.i.d. Gaussians of unit variance. Negentropy [3] is a commonly used measure of Gaussianity of a random quantity X, and it is defined as the Kullback-Leibler divergence between the distribution of X and a Gaussian with the same covariance matrix as X. Negentropy is always non-negative, and it is zero iff X is Gaussian distributed. It is also equal to the difference between the equal-covariance Gaussian entropy and the entropy of X. For univariate distributions, the entropy can be estimated empirically using the m-spacing nonparametric estimator of [5]. For a dataset of N points, we √ choose m = N in our experiments. In the multivariate case it is computed as an average of the values that result after projecting the multivariate data into several random directions. 3. BIASES IN THE DATA: A PROBLEMATIC CASE FOR RADIAL GAUSSIANIZATION RG exploits the fact that an ESD (after it has been whitened and is therefore isotropic) satisfies a fundamental property of Gaussian i.i.d. vectors; that its projection on the surface of the unit n-dimensional sphere is uniformly distributed. It is thus sufficient to transform the norm so that its distribution matches the chi distribution of the norm of an i.i.d. Gaussian vector. This is done by a univariate transformation TRG (kxk) = Fχ−1 (FkXk (kxk),
(1)
where Fχ is the cdf of the chi distribution with n degrees of freedom. The cdf FkXk can be obtained by collecting rank statistics from the norms of the data; its univariate nature makes it easy to estimate. In the presence of biases, the isotropy assumption is violated and RG is not guaranteed to work. We now demonstrate experimentally that even very small perturbations in the data can cause such a failure. Experimental Evidence: Let us consider the dataset shown in Figure 1(a). This was generated by a mixture of two distributions: (i) a “noisy circle”, and (ii) a Gaussian distribution.
3 −3
−1 0
1
2
3 2 1 −1 0 −3
−2
−1
0
1
2
3
−3
1. Compute the norms of all data points. Here we assume that the data consist of a mixture of biased isotropic datasets; the bias of each needs to be removed before RG. One way to figure out what these biased isotropic datasets are is through clustering of their norms. 2. Cluster the dataset based on the norms using Gaussian mixtures and the EM algorithm. We used the R package mclust [6, 7] to do that, which initializes EM with a hierarchical clustering2 . Let the clusters be denoted by C1 , . . . , Ck . 3. Instead of estimating the biases of the clusters jointly3 we follow a greedy approach. First, estimate the bias bj of each cluster Cj independently, as the minimizer of the norm of the mean of the Gaussianized data4 , ˆj = arg min kMean(TRG (Cj − b))k, b b
−1
0
1
2
after RG, bias=1e−5
3
2 1 −1 0 −3
−3
−1 0
1
2
3
after RG, bias=1e−6 3
(b)
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
(d)
after RG, bias=1e−4
after RG, bias=1e−3
3
2 −1 0
1
2
3
(c)
3
−3
−3
−2
−1
0
1
2
3
−3
(e)
The algorithm that we propose for tackling the bias issues we mentioned in the previous section is as follows:
−2
(a)
−3
4. RADIAL GAUSSIANIZATION AFTER BIAS COMPENSATION
−3
1
To address the above shortcoming, we can add an extra step to remove the bias. PCA (or ICA) is not adequate for this task, as they just perform a rotation (and possibly scaling) of the space; the circle of Figure 1 will thus be preserved. Instead, we propose to cluster the dataset into spherically symmetric subsets, estimate the bias of each, subtract it off, and then perform RG. In the next section, we follow an optimization procedure that estimates the bias via the minimization of the norm of the mean of the Gaussianized dataset.
after RG, bias=0.0
−1 0
where c(θ) is the point on the unit circle which is at an angle θ from the x-axis. The σ in the denominator is the standard deviation of the “noise” and is set to 10−5 . We further assume that one of these distributions (say, the noisy circle) exhibits a small bias b. That is, the data generated by the slightly (b) perturbed noisy circle have pdf pNC (x) = pNC (x − b). Figures 1(b),(c),(d),(e),(f) show the output of RG in the cases kbk = 0, 10−6 , 10−5 , 10−4 , 10−3 , respectively. As can be seen, even in the case where the bias is very small (e.g., 10−4 ), RG does not result in a Gaussian distribution.
original data, bias=0.0
−3
The pdf of the “noisy circle” is given by the formula Z 2π x − c(θ) 1 φ pNC (x) = dθ, 2π θ=0 σ
−2
−1
0
1
2
3
(f)
Fig. 1: Example showing a mixture of 2000 Gaussian points and another 2000 points which lie on a lower-dimensional manifold (circle with Gaussian noise of standard deviation 10−5 ) and the effects it has on radial Gaussianization. As can be seen, the amount of perturbation of the singularity has a profound effect on the subsequent Gaussianization. where Cj −b is simply the result of subtracting b from each data point of Cj . Thus, perfect bias estimation results in a well-Gaussianized subset. The optimization of the norm of the mean is done with Powell’s method [8]. Then, the clusters are sorted according to decreasing Negentropy(Cj ), say, Ci1 , . . . , Cik , and the biases of step 3 are subtracted off. For each m ∈ {1, . . . , k}, the quantity
H(m) = Negentropy TRG
[ j≤m
ˆi ) (Cij − b j
[
Cij ,
j>m
2 We
also tried spectral clustering using the differences between different data point norms to build the Gram matrix; our results were worse than those with the Gaussian mixtures, so, we only focus on the latter in this paper. 3 Joint optimization is a slow process when the optimization function is not given in closed form or is not differentiable. 4 We also experimented with an alternative where the mean of each cluster Cj was used as an estimate of its bias. Our experiments showed that this did not improve RG due to the fact that the clusters were not estimated perfectly. Even a small error in the direct estimation of the bias causes RG to fail.
ˆi is set to zero (no bias comis computed, and the bias b j pensation) when H(m) > H(m − 1). This is done as an approximation to searching over all 2k subsets of clusters for those whose bias compensation leads to the highest decrease in negentropy. 4. Finally, perform RG on the whole (normalized) dataset.
5. EXPERIMENTAL RESULTS This section presents results from Monte Carlo experiments on several isotropic datasets, comparing (i) the C. & G. method [2], (ii) the radial Gaussianization of [1], (iii) two variations of our proposed method of Section 4: “Proposed 1” is done when the number of concentric clusters in the dataset is known, and “Proposed 2” is done when it is unknown. We provide numerical evaluation of each method in terms of the average negentropy [3] of the transformed data. All synthetic datasets used in the experiments were 2dimensional: they contained noisy, concentric circles (1 up to 6 circles) together with “ambient” Gaussian noise, similar to the one shown in Figure 1. We generated on average 400 random datasets from each type mentioned above, where half were used as “training” sets for tuning the Gaussianization algorithms5 , and the rest were used as “test” sets for reporting performance based on the tuned parameters. In each type (1 up to 6 cirlces) the radii of the concentric circles are chosen randomly according to the Poisson distribution, the variances of the noisy circles are chosen from {10−3 , 10−4 , 10−5 }, and the biases (chosen from {10−2 , 10−3 , 10−4 , 10−5 }) are then added in some randomly chosen directions. The number of data points per dataset varies due to the number of singular noisy circles (roughly 3000-10000 points per dataset). Table 1 shows the average negentropy over the “test” isotropic datasets ± 1 standard deviation. In “Proposed 1”, we set the number of clusters k to be a fixed function of n, the number of (biased) concentric circles in the dataset. Specifically, the function k = 2n + 1 gave the best results among various plausible choices for k (e.g., k = n, k = n + 1, etc.) This is explained by the fact that the clustering algorithm “prefers” to group the ambient Gaussian noise that lies between two consecutive circles as a separate cluster. On the other hand, in “Proposed 2”, the number of clusters is computed based on BIC. It is adaptive, and different datasets get different numbers of clusters, even for the same n. As is clear from these results, the proposed method is superior to the un-compensated radial Gaussianization, and competitive to (in the case of more than 2 circles even superior than6 ) the method of C. & G. We spot-checked some of the transformed datasets and we noticed that the proposed methods resulted in much more appealing results than those of C. & G. 6. CONCLUDING REMARKS This paper presented an approach for improving the radial Gaussianization technique of [1] in the case where the original data contain a mixture of (nearly singular) clusters with biases. Our proposed two-step approach first clusters the data 5 Parameters tuned were (i) the search step size used in Powell’s method [8]; (ii) the number of Gaussian mixtures in the method of C. & G. [2]. 6 The results are statistically significant with p-value < 2.2 × 10−16 .
Gaussianization C. & G. [2] Radial [1] Proposed 1 Proposed 2 Gaussianization C. & G. [2] Radial [1] Proposed 1 Proposed 2 Gaussianization C. & G. [2] Radial [1] Proposed 1 Proposed 2
1 circle 0.0722 ± 0.0030 0.1049 ± 0.0378 0.0723 ± 0.0039 0.0724 ± 0.0039 3 circles 0.0643 ± 0.0037 0.0878 ± 0.0336 0.0572 ± 0.0080 0.0566 ± 0.0052 5 circles 0.0621 ± 0.0035 0.0760 ± 0.0253 0.0504 ± 0.0079 0.0485 ± 0.0051
2 circles 0.0667 ± 0.0042 0.0967 ± 0.0372 0.0630 ± 0.0051 0.0630 ± 0.0051 4 circles 0.0633 ± 0.0036 0.0830 ± 0.0288 0.0535 ± 0.0079 0.0522 ± 0.0051 6 circles 0.0604 ± 0.0031 0.0706 ± 0.0225 0.0473 ± 0.0073 0.0462 ± 0.0054
Table 1: Negentropy results on the isotropic datasets. based on their norms, using Gaussian mixtures; then the bias of each cluster is estimated through minimization of the norm of the mean of the Gaussianized output. The biases are then subtracted off the respective clusters, and radial Gaussianization is applied to the (normalized) dataset. The experiments show that the proposed methods 1 & 2 work better than the method of C. & G. [2] in more complicated datasets (smaller noisy variances, more circles, etc.). In future work, we plan to extend this approach to the case of non-isotropic datasets, and apply it to real data. 7. REFERENCES [1] S. Lyu and E. P. Simoncelli, “Nonlinear extraction of independent components of natural images using radial Gaussianization,” Neural Computation, vol. 21, pp. 1485–1519, 2009. [2] S. S. Chen and R. A. Gopinath, “Gaussianization,” Adv. Neural Comput. Syst., vol. 13, pp. 423–429, 2000. [3] P. Comon, “Independent component analysis, a new concept?,” Signal Processing, vol. 36, pp. 287–414, 1994. [4] V. Laparra, G. Camps-Valls, and J. Malo, “PCA Gaussianization for image processing,” in Proceedings of ICIP-09, Cairo, Egypt, 2009. [5] O. Vasicek, “A test for normality based on sample entropy,” Journal of the Royal Statistical Society, vol. 38, pp. 54–59, 1976. [6] C. Fraley and A. E. Raftery, “Model-based clustering, discriminant analysis and density estimation,” J. Amer. Statistical Assoc., vol. 97, pp. 611–631, 2002. [7] C. Fraley and A. E. Raftery, “MCLUST version 3 for R: Normal mixture modeling and model-based clustering,” Tech. report, Univ. of Washington, Dept. of Stats., 2006. [8] M.J.D. Powell, “UOBYQA: Unconstrained optimization by quadratic approximation,” Technical Report DAMTP 2000/NA14, University of Cambridge, 2000.