posing a technical challenge for implementation in a realistic sce- nario. Moreover, we introduce a simple modification to the Nyström method motivated by the ...
AN EFFICIENT RANK-DEFICIENT COMPUTATION OF THE PRINCIPLE OF RELEVANT INFORMATION Luis Gonzalo S´anchez Giraldo, Jos´e C. Pr´ıncipe University of Florida Department of Electrical and Computer Engineering Gainesville, FL 32611 {sanchez,principe}@cnel.ufl.edu ABSTRACT One of the main difficulties in computing information theoretic learning (ITL) estimators is the computational complexity that grows quadratically with data. Considerable amount of work has been done on computation of low rank approximations of Gram matrices without accessing all their elements. In this paper we discuss how these techniques can be applied to reduce computational complexity of Principle of Relevant Information (PRI). This particular objective function involves estimators of Renyi’s second order entropy and cross-entropy and their gradients, therefore posing a technical challenge for implementation in a realistic scenario. Moreover, we introduce a simple modification to the Nystr¨om method motivated by the idea that our estimator must perform accurately only for certain vectors not for all possible cases. We show some results on how this rank deficient decompositions allow the application of the PRI on moderately large datasets. Index Terms— Kernel methods, Information Theoretic Learning, Rank deficient factorization, Nystr¨om method. 1. INTRODUCTION
A major issue, which we address in this paper, is that the amount of computation associated to the PRI grows quadratically with the size of the available sample. This limits the scale of the applications if one were to apply the formulas directly. The problem of polynomial growth on complexity has also received attention within the machine learning community working on kernel methods. Consequently, approaches to compute approximations to positive semidefinite matrices based on kernels have been proposed [6, 7]. The goal of these methods is to accurately estimate large Gram matrices without computing their n2 elements, directly. It has been observed that in practice the eigenvalues of the Gram matrix drop rapidly and therefore replacing the original matrix by a low rank approximation seems reasonable[7, 8]. In our work, we derive an algorithm for the principle of relevant information based on rank deficient approximations of a Gram matrix. We also propose a simple modified version of the Nystr¨om method particularly suited for estimation in ITL. The paper starts with a brief introduction to Renyi’s Entropy and the associated information quantities with their corresponding rank deficient approximations. Then, the objective function for the principle of relevant information (PRI) is presented. Following, we propose an implementation of the optimization problem based on rank deficient approximations. The algorithm is tested on simulated data for various accuracy regimes (different ranks) followed by some results on realistic scenarios. Finally, we provide some conclusions along with future work directions.
In recent years, kernel methods have received increasing attention within the machine learning community. They are theoretically elegant, algorithmically simple, and have shown considerable success in several practical problems. Information theoretic learning (ITL) is another emerging line of research with links to kernel based estimation. However, ITL stems from a conceptually different framework [1]. For instance, the type of kernels employed ITL need not be positive definite [2]. Despite this fundamental difference, it is often the case that applications of ITL use either Gaussian or Laplacian kernels which are positive definite. An important feature of information theory is that it casts problems in principled operational quantities that have direct interpretation. For example, in unsupervised learning some of the paradigms have emerged from Information Theory. The Infomax principle [3] and the preservation of mutual information across systems [4] are within this category. Regularities in data can reveal the structure of the underlying generating process, therefore capturing the structure is a problem of relevance determination. Capturing low entropy components of a random variable can be attributed to the its generation process. Rao et. al. [5] proposes an ITL objective that attempts to capture the underlying structure through a random variable’s PDF, this is called the principle of relevant information.
The case α → 1 gives Shannon’s entropy. Similarly, a modified version of Renyi’s definition of α-relative entropy between random variables with PDFs f and g is given in [10], α−1 1 α 1 g f 1−α g α . (2) Dα (f g) = log 1 f α α(1−α)
This work is funded by ONR N00014-10-1-0375 and UF ECE Department Latin American Fellowship Award.
likewise, α → 1 yields Shannon’s relative entropy (KL divergence). An important component of the relative entropy is the cross-entropy
978-1-4577-0539-7/11/$26.00 ©2011 IEEE
2176
2. RANK DEFICIENT APPROXIMATION FOR ITL 2.1. Renyi’s α-Order Entropy and Related Functions In information theory, a natural extension of the commonly used Shannon’s entropy is α-order entropy proposed by Renyi [9]. For a random variable X with probability density function (PDF) f (x) and support X , the α-entropy Hα (X) is defined as; 1 Hα (f ) = log f α (x)dx. (1) 1−α X
ICASSP 2011
term Hα (f ; g) that quantifies the information gain from observing g with respect to the “true” density f . It turns out that for the case of α = 2, the above quantities can be expressed, under some restrictions, as a function of the inner product between PDFs. In particular, the 2-order entropy of f and cross-entropy between f and g, are: (3) H2 (f ) = − log f 2 (x)dx, X
H2 (f ; g) = − log
f (x)g(x)dx.
(4)
X
The associated relative entropy of order 2 is called the CauchySchwarz divergence and is defined as follows: 2 fg 1 (5) DCS (f g) = − log 2 2 . 2 f g The above operations assume that f and g are know, which is almost never the when learning from data. Plug in estimators of the second order Renyi’s entropy and cross-entropy (Cauchy-Schwarz (5)) can be derived using Parzen denstity estimators. For an i.i.d. p sample S = {xi }n i=1 ⊆ R drawn from g, the Parzen density estimator gˆ at x is given by gˆ(x) = n1 n i=1 κ(x, xi ), where κ is an admissible kernel [11]. Consider two samples S1 = {xi }n i=1 p and S2 = {yi }m i=1 , both in R , drawn i.i.d. from g and f , respectively. Let K1 be a matrix with all pairwise evaluations of κ on S1 , that is K1 (i, j) = κ(xi , xj ); the estimate of the entropy ˆ ) can be derived similar fashion ˆ is H(g) = − log n12 1 K1 1, H(f ˆ ; g) = − log 1 1 K12 1, from matrix K2 . Cross-entropy is H(f nm where K12 (i, j) = κ(xi , yj ). Note that we are basically estimating the arguments of the log functions in (3) and (4); we will refer to them as information potential (IP) and cross-information potential (CIP) [1]. For positive semidefinite kernels that are also Parzen, K’s are Gram matrices. 2.2. Rank Deficient Approximation Any symmetric positive semidefinite matrix A can be written as the product GG . Note the decomposition A need not be unique. Incomplete Cholesky Decomposition: This decomposition is a special case of the LU factorization is know as the Cholesky decompositon [12]. Here, G is a lower triangular matrix with positive diagonal. The advantage of this decomposition is that we can approximate our Gram matrix K with arbitrary accuracy by choosing a ˜ with d columns such that K− G ˜G ˜ ≤ lower triangular matrix G (For a suitable matrix norm). This incomplete Cholesky decomposition (ICD) can be computed by a greedy approach that minimizes ˜G ˜ [7, 8, 13]. For an n × n matrix the trace of the residual of K − G the complexity of this method is O(nd) and the time complexity is O(nd2 ). Therefore, this algorithm is preferable only when d2 n. The error of the ICD based estimators can be easily bounded. For a positive semidefinite matrix we have that A2 ≤ trA, which for ˜G ˜ ) ≤ . the estimators treated in this paerror matrix tr(K − G per are mostly of the form a Kb and so the error is bounded by ab. Nystr¨om Aproximation: This is a well know rank deficient approximation to K in machine learning [14]. The approximate Gram ˜ is computed by projecting all the data points onto a submatrix K space spanned by a random subsample of size d in the feature space. ˜ = Kd K−1 K Consequently K d , where Kd is the kernel evaluation dd between all data points and the subsample of size d, and Kdd is the
2177
kernel evaluation between all the points in the subsample. The price we pay for this simplicity is that the accuracy of the approximation cannot be simply determined. An improved version of this method with error guarantee can be found in [15]. One important remark on the Nystr¨om method relates to the computation of K−1 dd , for which we can employ the eigen-decomposition of Kdd = UΛU . Nystr¨om-KECA: Assume we want to reduce even further the size of Kdd based on its eigenvalues, we will have a good projection in terms of the squared norm in the reproducing kernel Hilbert space associated to κ if we pick the columns of U corresponding to the largest eigenvalues on the diagonal of Λ. However, we are more interested on the projection of the mean of the mapped data points μΦ . This idea resembles the approach followed by [16] on the stopping criterion for the orthogonal series density estimation based on Ker nel PCA. The matrix Kd K−1 dd Kd represents an Ansatz product in the feature space H as Φ(x), PΦ(y), where P is a projection operator. In particular μΦ , PμΦ ≤ μΦ , μΦ . We can find a faster convergence series than the one obtained by ordering the eigenvalues of K−1 dd in a non-increasing way. Such series can be created by ordering the columns of U and their respective eigenvalues accord−1/2 2 [ui Kd 1] . We call this decomposition ing to the score si = λi Nystr¨om-KECA because it resembles the kernel entropy component analysis proposed in [17]. The computation of the estimators of the IP and CIP can be easily carried out by computing a low rank decomposition of the (n + m) × (n + m) Gram matrix K of the augmented sample m S = {xi }n 1 ∪ {yi }1 , K1 K12 (6) K= K12 K2 ˜G ˜ with d n + m. The block array in (6) can be Since K ≈ G ˜ as1 : expressed in sub-blocks of G ˜ ˜ 1G ˜ ˜ 1G ˜1
G G G 1 2 ˜ ˜1 , G (7) = G 2 ˜ 2G ˜1 G ˜ 2G ˜2 ˜2 G G ˜ Then for the IP n12 1 K1 1 ≈ n12 1 G1 G 1 1, and for the CIP 1 1 ˜ 1 K1 1 ≈ nm 1 G1 G2 1. Note that computing CIP needs nm roughly O((n + m)d2 ) operations rather than O(nm). However, it seems redundant to indirectly work with a (n + m) × (n + m) matrix while we are only interested in a n × m block. However, our problem requires both IP and CIP. 3. THE PRINCIPLE OF RELEVANT INFORMATION Regularities on the data can be attributed to structure in the underlying generating process. These regularities can be quantified by the entropy estimated from data, hence, we can think of the minimization of entropy as a means for finding such regularities. Suppose we are given a random variable X with PDF g, for which we want to find a description in terms of a PDF f with reduced entropy, that is, a variable Y that captures the underlying structure of X. The principle of relevant information (PRI) casts this problem as a trade-off between the entropy of Y H2 (f ) and its descriptive power about X in terms of their relative entropy DCS (f g). The principle can be briefly understood as a trade-off between the minimization of redundancy preserving most of the original structure of a given probability density function. For a fixed pdf g ∈ F the objective is given by: arg min[H2 (f ) + λDCS (f g)] for λ ≥ 1. f ∈F
1 Sub-blocks
cannot be computed individually from K1 and K2
(8)
The trade off parameter λ defines various regimes for this cost functions ranging from clustering (λ = 1) to a reminiscent principal curves (λ ≈ p) to vector quantization (λ → ∞). 3.1. PRI as a Self Organization Mechanism A solution to the above search problem was proposed in [5]. The method combines Parzen density estimation with a self organization of a sample to match the desired density f that minimizes (8). The optimization problem becomes: ˆ 2 (Y ) + λD ˆ CS (Y X)] min [H
(9)
Y ∈(Rp )m
where X ∈ (Rp )n is a set of p-dimensional points with cardinality n, and Y a set of p-dimensional points with cardinality m. Problem (9) is equivalent to: ˆ 2 (Y ) + 2λH ˆ 2 (Y ; X)] = min [(1 − λ)H
Y ∈(Rp )m
min
Y ∈(Rp )m
Jλ (Y ) (10)
For the Gaussian kernel κσ (x, y) = exp − 2σ1 2 x − y2 we can evaluate the cost (10) as 2 : −(1 − λ) log
1 1 1 K12 1 1 K2 1 − 2λ log m2 mn
(11)
The self organization principle moves each particle yi according to the forces exerted by the samples X and Y . The entropy minimization creates attractive forces among the yi ’s and the sample X induces a force field that restrict the movement of each yi . Computing the partial derivatives of (11) with respect to each point yr ∈ Y yields: 2
1 ∂ log 2 1 K2 1 ∂yr m
=
∂ 1 log 1 K12 1 ∂yr mn
=
m
κσ (yr , yi )
decomposing ΔY (k) = y(k) 1 −1y(k) and ΔX(k) = x(k) 1 − (k) (k) (k) 1y(k) , where y(k) = (y1 , y2 , . . . , ym ) and x(k) = (k) (k) (k) (x1 , x2 , . . . , xm ) , and applying the identity (14) on (15) and (16), after some algebra, yields: ∂ Jλ (Y ) = a y(k) K2 − 1 K2 d(y(k) ) + (k) ∂y +b x(k) K12 − 1 K12 d(y(k) ) ,
(17)
where a = −2(1 − λ)/(σ 2 1 K2 1) and b = −2λ/(σ 2 1 K12 1). Finally, it is easy to verify that for the rank deficient approximation ˜G ˜ : K≈G ∂ ˜ ˜ 2 + bx(k) G ˜1 G Jλ (Y ) ≈ ay(k) G 2 + (k) ∂y (18) (k) ˜ ˜ 2 + b1 G ˜1 G − a1 G ), 2 d(y The last equation can be computed in O(max{n, m}d), d is the rank ˜ instead of O(max{nm, m2 }) which is the complexity of the of K, fixed point algorithm presented in [5].
yi −yr
i=1
σ2
1 K2 1 n r κσ (yr , xi ) xiσ−y 2
i=1
1 K12 1
4. EXPERIMENTS
(12)
(13)
A direct optimization of the cost in (11) is computationally burdensome and only feasible for samples sizes up to a few thousands, this limits the applicability of the principle. Here, we want to overcome this limitation by allowing computation on larger samples commonly encountered in signal processing applications. Below, we develop a way of incorporating the rank deficient approximations to the gradient of the PRI cost function. It is important to remind the reader that this solution can be easily adapted to other ITL objectives. Consider the following identity: For A = ab , where a and b are column vectors, C ◦ A = diag(a)Cdiag(b)
and (13) can be re-expressed and combined for all r = 1, . . . , m using matrix operations as: (k) ∂ ˆ 2 1 K2 ◦ ΔY H2 (Y ) = − 2 (15) ∂y(k) σ 1 K2 1 (k) 1 1 K12 ◦ ΔX ∂ ˆ (Y ; X) = − H (16) 2 σ2 1 K12 1 ∂y(k)
(14)
where diag(z) denotes a diagonal matrix with the elements of z on the main diagonal, for simplicity we denote diag(·) as d(·). We will restrict the analysis to the Gaussian kernel but a similar treatment for other kernels with similar properties on their derivative can be adopted. Let ΔY (k) and ΔX(k) be matrices with entries (k) (k) (k) (k) (k) (k) ΔYij = yi − yj and ΔXij = xi − yj , respectively. By (k)
zi , we mean the k-th component of the vector zi . Equations (12) 2 Note that the employed kernel does not integrate to one. This is not a problem since a normalization factor becomes an additive constant in the objective function.
2178
The experimental setup for the above methods is divided into two stages; in order to assess the performance of the methodology in terms of accuracy we test the rank deficient estimation algorithms on simulated data. Then, we apply our implementation methodology for the PRI in two real scenarios, namely, automatic image segmentation and signal denoising. 4.1. Simulated Data For the simulated data we compute the information potential and the cross-information potential of a mixture of two unitary isotropic Gaussian distributions in a four-dimensional space. The mean vectors are (1, 1, 1, 1) and − (1, 1, 1, 1) . The kernel size is set to be σ = 2 and the size of the set is n = 1000. Figure 1 displays the performance of the estimators for different ranks that are related to the accuracy level set for the incomplete Cholesky decomposition. Notice that the performance of the Nystr¨om-KECA remains constant, albeit slightly worse than pure Nystr¨om. This is because we drop vectors with low scores as described in section 2.2. Therefore we sacrifice more accuracy to lower the rank even further. 4.2. Image Segmentation Signal Denoising with the PRI Automatic image segmentation is usually seen as a clustering problem, where spatial and intensity features are combined to discern objects from background. A well established procedure for image segmentation and for clustering in general is the Gaussian mean shift (GMS). It has been shown that GMS is a especial case of the PRI
0
10
−2
10
0
−4
10
10
−2
10
−4
10
−6
10
−6
Accuracy
10
−8
Accuracy
10
−10
10
−10
10
−12
10
Incomplete Cholesky Nystrom NystromKECA Subsampling
−12
10
−14
10
−16
10
−8
10
−15
10
−10
10
−5
10
epsilon
0
10
Incomplete Cholesky Nystrom NystromKECA Subsampling
−14
10
−16
10
−18
5
10
10
−15
10
−10
10
(a) IP
−5
10
epsilon
0
10
5
10
(b) CIP
Fig. 1. Accuracies for the IP and CIP estimators objective when the trade off parameter λ is set to 1. Treating images as collection of pixels is a fairly challenging task due to the amount of points to be processed. Figure 2 shows the segementation results using the PRI optimization described on Section 3.1 and the Nystr¨om-KECA rank deficient factorization. The image resolution is 130 × 194 pixels for a total of 25220 points.
requires estimation of second-order entropies and cross-entropies (equivalently information and cross-information potentials) along with their respective derivatives, which are employed during the optimization. We developed a methodological approach to factorize the elements involved in the gradient calculation and these results can be extended to other methods that involve similar forms. Along the lines of rank deficient approximation, we propose a simple modification to the Nystr¨om method that is motivated by the nature of the quantities we want to estimate. The presented methodology allows the application of the PRI on much larger datasets, we expect this improvement opens new directions for which we can apply this principle. Some of the analysis that lead to the modification of the Nystr¨om-based decomposition raises the question about what kernels can be more useful in the context of Information theoretic learning based on their convergence properties for certain vectors. 6. REFERENCES [1] Jose C. Principe, Information Theoretic Learning: Renyi’s Entropy and Kernel Perspective, Information Science and Statistics. Springer, 2010. [2] N. Aronszajn, “Theory of reproducing kernels,” vol. 68, no. 3, pp. 337–404, 1950. [3] Anthony J. Bell and Terrence J. Sejnowski, “An informationmaximization approach to blind separation and blind deconvolution,” Neural Computation, vol. 7, pp. 1129–1159, 1995. [4] Ralph Linsker, “Self-organization in a perceptual network,” Computer, vol. 21, pp. 105–117, 1988.
(a) Original Image
[5] Sudhir Madhav Rao, Unsupervised Learning: An Information Theoretic Framework, Ph.D. thesis, University of Florida, 2008.
(b) Segmented image using PRI
[6] Christopher K. I. Williams and Matthias Seeger, “The effect of the input density distribution on kernel-based classifiers,” in ICML, 2000, pp. 1159–1166.
Fig. 2. Image segmentation using PRI Setting λ = 1, as we already mentioned, defines a mode seeking regime for the PRI. For larger values of λ we obtained a solution, for which points concentrate on highly dense regions and are nicely scattered on patterns resembling principal curves of data. The key interpretation is that the estimated distribution gets pulled towards regions of higher entropy in the manifold of PDFs where we are looking for an optimum of the PRI objective. Figure 3 shows the resulting denoised signal that was embedded in a two dimensional space along with the contour plots of the estimated PDF. The number of points is 15000.
[7] Shai Fine and Katya Scheinberg, “Efficient svm training using lowrank kernel representations,” JMLR, vol. 2, pp. 243–264, 2001. [8] F. R. Bach and M. I. Jordan, “Kernel independent component analysis,” JMLR, vol. 3, pp. 1–48, 2002. [9] Alfr´ed R´enyi, “On measures of entropy and information,” in Proceedings of the fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, 1961, vol. 1, pp. 547–561, University of California Press. [10] Erwin Lutwak, Deane Yang, and Gaoyong Zhang, “Cram´er-rao and moment-entropy inequalities for renyi entropy and generalized fisher information,” IEEE Transactions on Information Theory, vol. 51, no. 2, pp. 473–478, February 2005. [11] Emanuel Parzen, “On estimation of a probability density function and mode,” Annals of Mathematical Statistics, vol. 33, no. 3, pp. 1065– 1076, 1962. [12] Gene H. Golub and Charles F. Van Loan, Matrix Computation, The Johns Hopkins University Press, Baltimore, Maryland, third edition, 1996. [13] Sohan Seth and Jos´e Principe, “On speeding up computation in information theoretic learning,” in IJCNN, 2009. [14] Christopher K.I. Williams and Matias Seeger, “Using the nystr¨om method to speed up kernel machines,” in NIPS, 2000, pp. 682–688.
Fig. 3. Noisy and Denoised versions of a periodic signal embedded in a 2 dimensional space 5. CONCLUSIONS In this paper we suggest the use of rank deficient approximation to Gram matrices involved in the estimation of ITL objective functions. In particular we focus on the Principle of Relevant Information that
2179
[15] Petros Drineas and Michael W. Mahoney, “On the nystr¨om method for approximating a gram matrix for improved kernel-based learning,” JMLR, 2005. [16] Mark Girolami, “Orthogonal series density estimation and the kernel eigenvalue problem,” Neural Computation, vol. 14, no. 3, pp. 669 – 688, March 2002. [17] Robert Jensen, “Kernel entropy component analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009.