visualizing class structure in data using mutual

2 downloads 0 Views 413KB Size Report
or enhance the class structure in the data. This is an important ... Linear discriminant analysis (LDA) produces a transform that is opti- mally discriminative for ...
VISUALIZING CLASS STRUCTURE IN DATA USING MUTUAL INFORMATION Kari Torkkola Motorola Labs, 2100 East Elliot Road, MD EL508, Tempe, AZ 85284, USA, [email protected]

Abstract. We study linear dimension reducing transforms using maximum mutual information between transformed data and class labels as the criterion to learn the transforms. Renyi’s quadratic entropy provides a differentiable and computationally feasible criterion on which gradient ascent algorithms can be based without the limitations of methods using only second order statistics, such as PCA or LDA. Application to class structure visualization in exploratory data analysis is presented. INTRODUCTION Dimensionality reduction is essential in exploratory data analysis, where the purpose often is to map data onto a low-dimensional space for human eyes to gain some insight to the data. We are interested in mappings that reveal or enhance the class structure in the data. This is an important step in the process of selecting or designing appropriate classifiers for a given problem domain from which data has been collected. Reducing the dimensionality of feature vectors is also often an essential step in the actual classifier design to achieve practical feasibility. Usually this is done by using domain knowledge, heuristics, or traditions in the field, but optimally, we would like to use class discrimination as the criterion. In this paper we study linear dimension reducing transforms. One well known such transform is the principal component analysis or PCA. PCA seeks to optimally represent the data. The transform is derived from the eigenvectors corresponding to the largest eigenvalues of the covariance matrix of all data regardless of the class. Thus PCA has little to do with discriminative features optimal for classification. However, it may be very useful in reducing noise in the data. Linear discriminant analysis (LDA) produces a transform that is optimally discriminative for certain cases [7]. In one formulation, LDA finds eigenvectors of T = S−1 w Sb where Sb is the between-class covariance matrix, and Sw is the sum of within-class covariance matrices. S−1 w captures the compactness of each class, and Sb represents the separation of the class means. The eigenvectors corresponding to largest eigenvalues of T form the columns of the transform matrix W, and new discriminative features y are derived

from the original ones x simply by y = Wt x. This simple algebraic way of deriving the transform matrix is both a strength and a weakness of the method. Since LDA makes use of only second-order statistical information, covariances, it is optimal for data where each class has a unimodal Gaussian density with well separated means. Also the maximum rank of Sb is Nc − 1, where Nc is the number of different classes. Thus LDA cannot produce more than Nc − 1 features. Although extensions have been proposed for the latter problem, see for example [9], the first one remains. Dhillon et al. studied a slight modification of LDA for visualization purposes [4]. They chose to ignore S−1 w using only the Sb as the criterion. In effect, this maximizes the class centroid scatter with no regard to the compactness of the data of each class. This approach works fine as long as the classes are inherently compact, well separated, and reside in extreme corners of the feature space occupied by the data. In fact, in these cases PCA also works well for the same purpose. Independent component analysis (ICA) has also been proposed as a tool to find ”interesting” projections of the data [8, 13]. Girolami et al. maximize negentropy to find a subspace on which the data has the least Gaussian projection [8]. The criterion corresponds to finding a clustered structure in the data. This appears to be a very useful tool revealing non-Gaussian structures in the data. However, as PCA, the method is completely unsupervised with regard to the class labels of the data. and thus is not able to enhance class separability. This paper attempts to show that mutual information (MI) between the class labels and the transformed data can act as a more general criterion that overcomes many of these limitations. It accounts for higher-order statistics, not just for second order. In addition, it can also be used as the basis for non-linear transforms. MI has has been shown to be an optimal criterion for class separation [5, 10]. The reasons why mutual information is not in wider use currently lie in computational difficulties. The probability density functions of the variables are required, and MI involves integrating functions of those, which in turn involves evaluating them on a dense set, leading to high computational complexity. Evaluating MI between two scalar variables is feasible through histograms, and this approach has found use in feature selection rather than in feature transform [1, 3, 13]. Yang and Moody visualize data by selecting those two features of all N 2 combinations that maximize the joint MI of the features and labels [13]. This is just feasible for two features but not for more. Bollacker and Gosh proceed sequentially to find a linear transform [2]. They find a single direction that maximizes the MI, then continue in a subspace orthogonal to already found directions. However, greedy algorithms based on sequential feature selection using MI are suboptimal as they fail to find a feature set that would jointly maximize the MI between the features and the labels. This failure is due to the sparsity of (any amount of) data in high-dimensional spaces for histogram-based MI estimation. Feature selection through any other joint criteria such as the actual classification error, also leads to a combinatorial explosion. In fact, for this very reason finding a transform to lower dimensions might be easier

than selecting features, given appropriate objective function. To this end, Bollacker and Gosh also present an interesting heuristic method, where a matrix is filled with approximations of joint mutual information between the labels and each pair of variables [2]. This matrix bears resemblance to a covariance matrix. Eigenvectors of this matrix are claimed to provide directions of high MI. Principe, Fisher, and Xu have shown that using an unconvential definition of entropy, Renyi’s entropy instead of Shannon’s, can lead to expressions of mutual information with significant computational savings [6, 11, 10]. We explored this concept in a pattern recognition context in [12]. In this paper we concentrate in data visualization. Principe et al. explored the mutual information between two continuous variables. We describe first the basis of their work, and then formulate it as mutual information between continuous variables and discrete class labels. We use the criterion to learn linear dimension-reducing feature transforms with discriminative ability, and we demonstrate the results with some well-known data sets in visualization applications. MAXIMIZING MUTUAL INFORMATION Given a set of training data {xi , ci } as samples of a continuous-valued random variable X, xi ∈ RD , and class labels as samples of a discrete-valued random variable C, ci ∈ {1, 2, ..., Nc }, i ∈ [1, N ], the objective is to find a transform yi = g(w, xi ) (or its parameters w) that maximizes I(C, Y ) the mutual information (MMI) between transformed data Y and class labels C. To this end we need to express I as a function of the data set, I({yi , ci }), in a differentiable form. Once that is done, we can perform gradient ascent on I as follows N X ∂I ∂I ∂yi wt+1 = wt + η = wt + η (1) ∂w ∂yi ∂w i=1 Naturally, any other optimization technique, such as conjugate gradients or Levenberg-Marquardt method can be applied. The latter factor inside the sum in (1) ∂yi /∂w is determined by the chosen transform. This paper points out that there is a computationally feasible way of computing the former, ∂I/∂yi , and the actual objective function I({yi , ci }) according to the formulation in [10]. QUADRATIC MUTUAL INFORMATION One cornerstone in the work of Principe et al. was to base mutual information measures on Renyi’s quadratic entropy, which for a continuous variable Y is defined as Z HR (Y ) = − log p(y)2 dy. (2) y

It turns out that Renyi’s measure combined with Parzen density estimation method using Gaussian kernels provides significant computational

savings. If p(y) is represented as a sum of Gaussians, the evaluation of the integral is simple, since the convolution of two Gaussians is a Gaussian. The integral reduces to pairwise interactions through the kernel (see [10, 12] for details). Principe et al. derived quadratic distance measures for probability density functions somewhat heuristically. First, they consider some known inequalities for L2 distance measure between vectors in RD , and then write analoguous expressions for the divergence between the two densities. The difference of vectors inequality (x − y)t (x − y) ≥ 0 ⇔ ||x||2 + ||y||2 − 2xt y ≥ 0

(3)

gives the expression KT (f, g) =

Z

2

f (x) dx +

Z

2

Z

g(x) dx − 2 f (x)g(x)dx

(4)

It is easy to see that the measure is always positive, and when f (x) = g(x) it evaluates to zero. This justifies its use in minimizing the mutual information. Moreover, it is also suitable for maximizing the mutual information (Principe, personal communication, 1999), although a rigorous proof for this does not exist. Since mutual information between two variables y1 and y2 is expressed as the divergence between the joint density and the marginals, we can insert them into the quadratic divergence expression to get ZZ ZZ 2 IT (Y1 , Y2 ) = p(y1 , y2 ) dy1 dy2 + p(y1 )2 p(y2 )2 dy1 dy2 ZZ −2 p(y1 , y2 )p(y1 )p(y2 )dy1 dy2 (5)

QUADRATIC MUTUAL INFORMATION BETWEEN DISCRETE AND CONTINUOUS VARIABLES Assume now that we have continuous data X in RD , and for each sample xi we have a corresponding class label ci ∈ {1, 2, ..., Nc }. We now derive expressions for quadratic mutual information between the transformed data y = g(w; x) and corresponding class labels c. The purpose is now to find such a transform, or such parameters w for the transform that result in maximum MI between Y and C. At this stage we do not yet need to make any assumptions about the transform g. With continuous Y and discrete C, the mutual information is IT (C, Y ) = V(cy)2 + Vc2 y2 − 2Vcy XZ 2 V(cy) ≡ p(c, y)2 dy, c

Vcy ≡

y

XZ c

y

p(c, y)p(c)p(y)dy

(6) Vc2 y2 ≡

XZ c

p(c)2 p(y)2 dy,

y

(7)

The gradient of I per sample

∂I ∂yi

that is needed in (1) will be

∂V(cy)2 ∂Vc2 y2 ∂IT ∂Vcy = + −2 ∂yi ∂yi ∂yi ∂yi

(8)

INFORMATION POTENTIALS Now we develop expressions for the Parzen density estimates of p(y) and p(c, y), and insert those into (6) and (7). Assume that we have Jp samples for each class cp . We now make use of a dual notation for the data y in the output space. A sample is written with a single subscript yi when its class is irrelevant; index 1 ≤ i ≤ N . If the class is relevant we write ypj , where the class index 1 ≤ p ≤ Nc , and the index within class 1 ≤ j ≤ Jp . The density for each class cp as a Parzen estimate using a symmetric kernel having width σ is written as p(y|cp ) =

Jp 1 X G(y − ypj , σ 2 I) Jp j=1

(9)

where G(y, Σ) denotes a Gaussian at y with covariance Σ. The joint density can be written as p(c, y) = P (c)p(y|c) P = (Jp /N )p(y|cp ) for each p = 1, ..., Nc . The density of all data is p(y) = c p(c, y), thus we have p(y)

=

N 1 X G(y − yi , σ 2 I) N i=1

(10)

Now we write the quantities in (7) using a set of samples in the transformed space {yi }. Making use of the fact that convolutions of Gaussians result in a Gaussian we get Jp Jp Nc X X 1 X V(cy)2 ({ci , yi }) = 2 G(ypk − ypl , 2σ 2 I) N p=1 k=1 l=1 2 ! X N N X N c  X 1 Jp Vc2 y2 ({ci , yi }) = 2 G(yk − yl , 2σ 2 I) N N p=1

(11)

(12)

k=1 l=1

Vcy ({ci , yi }) =

Jp N Nc 1 X Jp X X G(ypj − yk , 2σ 2 I). N 2 p=1 N j=1

(13)

k=1

Principe et al call this kind of quantities “information potentials” in analogy to physical particles. The fact that class information is now taken into account gives them an interesting interpretation as following pairwise interactions between samples, or “information particles”. • V(cy)2 can be seen as interactions between pairs of samples inside each class, summed over all classes.

• Vc2 y2 consists of interactions between all pairs of samples, regardless of class, weighted by the sum of squared class priors. • Vcy consists of interactions between samples of a particular class against all samples, weighted by the class prior and summed over all classes. INFORMATION FORCES Derivatives of these potentials with respect to samples represent “information forces”, the directions and magnitudes where the “particles” would like to move, or actually to have the transform move them in the output space, in order to maximize the objective function. The chain rule can then be simply applied to change the parameters w of the transform g to this effect. Jc ∂ 1 X 2 V(cy) = 2 2 G(yck − yci , 2σ 2 I)(yck − yci ) ∂yci N σ

(14)

k=1

This represents a sum of forces that other “particles” in class c exert to particle yci (direction is towards yci ). For the derivative of Vc2 y2 we get 2 ! X Nc  N X ∂ 1 Jp Vc2 y2 = 2 2 G(yk − yi , 2σ 2 I)(yk − yi ) (15) ∂yci N σ N p=1 k=1

This represents a sum of forces that other “particles” regardless of class exert to particle yci . Note that this particle is also denoted as yi when the class c is irrelevant in the expression. Direction is again towards yi . The derivative of Vcy is Jp Nc ∂ 1 X Jp + Jc X Vcy = 2 2 G(ypj − yci , 2σ 2 I)(ypj − yci ) ∂yci N σ p=1 2N j=1

(16)

This component appears in negative form in (8). Its effect is thus away from yci , and it represents the repulsion of classes away from each other. LINEAR FEATURE TRANSFORMATIONS To find a linear dimension-reducing transform, we will try to find a subspace Rd , d < D, such that the mutual information between class labels and the data projected onto this subspace is maximized. Thus W = argmax(I({ci , yi })); yi = Wt xi

(17)

W

subject to constraint Wt W = I. Columns of the D × d matrix W thus span Rd . In this case ∂y/∂W = xt . Inserting this into (1) together with (8) results in a simple gradient ascent algorithm. To be able to use optimization algorithms that work best in unconstrained problems, the linear transform can also be parametrized in terms of rotations [12].

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−0.2

−0.2

−0.4

−0.4

−0.6

−0.6

−0.8

−0.8

−1

−1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 1: Three classes in three dimensions projected onto a two-dimensional subspace using both LDA (left) and MMI-based projection (right).

EXPERIMENTS In this section, we present some simple experiments with synthesized data, and with some datasets that are available on the Internet.1 In these examples we learn a projection from a high-dimensional feature space onto a plane for visualization purposes. For an actual pattern recognition application, we would of course find a projection onto a higher dimensional space [12]. The first example is synthetic with non-gaussian densities, and it is designed to have a local optimum, in addition to a global one. The purpose of this example is to illustrate how LDA, assuming Gaussian classes, fails to find the global optimum. The dataset is three-dimensional, and has three classes. Class one has 400 samples, and it has a bimodal Gaussian distribution (blue circles). Class two has 200 samples (red asterisks), also drawn from a bimodal Gaussian distribution. These four Gaussians are arranged in an XOR-like configuration (imagine a plus-sign, and attach class one Gaussians at the ends of the vertical bar, and those of class two at the ends of the horizontal bar). Optimal projection would of course face the configuration. To disturb, we added a third class, 200 samples of a single Gaussian, at a random location in front of the configuration (black diamonds). Since we have three classes, LDA is able to produce a two-dimensional projection, which is depicted in the left side of Figure 1. This projection views the data along class two axis, and it is a local minimum. A classifier based on this projection would have difficulties separating class two (red asterisks) from the other classes. LDA was used as the initial state to learn the MMI-projection. The result is presented in the right side of Figure 1. In about five iterations of LevenbergMarquardt optimization, the method converged to the global optimum, which now exhibits much better separation for the second class (red asterisks). The actual mutual information using (6) increased from 1.02. 10−4 to 1.28. 10−4 . In this example, as well as in general, it is useful to have a kernel width 1 More examples and video clips illustrating http://members.home.net/torkkola/mmi.html

convergence

are

available

at

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−0.2

−0.2

−0.4

−0.4

−0.6

−0.6

−0.8

−0.8

−1

−1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure 2: Three classes in three dimensions projected onto a two-dimensional subspace using MMI-based projection. Final information forces with a wide kernel in the left, and with a narrow kernel in the right.

that approximately gives each sample an influence over every other sample. A simple rule that seems to work well is to take the distance of two farthest points in the output space, and use a kernel width σ of about half of that. Figure 2 depicts the information forces in the final state, that is, the directions where each sample would move in the output space were it free to do so. These forces are propagated back to the transform, and the parameters of the transform are changed so as to cause the samples to move into these desired directions in the output space. Figure 2 also illustrates how the forces become more local if the kernel width is narrowed down. If the transform has enough degrees of freedom, this suggests a procedure where one would begin with a wide kernel and narrow it down at the end to fine-tune the parameters of the transform. The second example illustrates the projection from 8-dimensional space again onto two. This is the Pima indians diabetes database from UCI Machine Learning Repository.2 The data has two classes, and as can be seen in the projection in Fig. 3, two dimensions are not quite enough for this dataset. The forces are also illustrated in the same figure. Our third example illustrates a projection from 36-dimensional feature space onto two. This is the Landsat satellite image database from UCI Machine Learning Repository. The data has six classes, and we used 1500 of the 4500 training samples. The LDA and the MMI-projections are depicted in Figure 4. LDA separates two of the classes very well but places the other four almost on top of each other. The criterion of LDA is a combination of representing each class as compactly as possible and as separated from each other as possible. This has been achieved: all classes are represented as quite compact clusters – unfortunately they are on top of each other. Two of the classes and a third cluster comprising of the four remaining classes are well separated. MMI has produced a projection that attempts to separate all of the classes. We can see that the four classes that LDA was not able to separate lie on a single continuum and blend into each other, and MMI has found 2 http://www.ics.uci.edu/∼mlearn/MLRepository.html

3

3

2

2

1

1

0

0

−1

−1

−2

−2

−3 −3

−2

−1

0

1

2

3

−3 −3

−2

−1

0

1

2

3

Figure 3: Pima indians diabetes dataset. The MMI projection (left) with the information forces (right).

a projection orthogonal to that continuum while still keeping the other two classes well separated. CONCLUSION Exploratory data analysis by human eyes requires mappings from highdimensional feature spaces into lower dimensional spaces. We have applied the maximum mutual information criterion to learn discriminative linear transforms for this purpose. We used measures according to Renyi following [10] to avoid computational complexities arising from using Shannon’s definition. Examples in class structure visualization of high dimensional data are presented that highlight the advantages of the method compared to other linear transforms. Currently the computational complexity of the method is O(N 2 ), where N is the number of samples. This might be a limiting factor for huge data sets, but it is foreseeable that clustering/batch-based adaptation could work well, too. The real promise of the method, however, lies in non-linear transforms. The MMI criterion is readily applicable to parametrizable and differentiable non-linear transforms, unlike LDA and other comparable discriminative criteria.

REFERENCES [1] R. Battiti, “Using mutual information for selecting features in supervised neural net learning,” Neural Networks, vol. 5, no. 4, pp. 537–550, July 1994. [2] K. Bollacker and J. Ghosh, “Linear feature extractors based on mutual information,” in Proceedings of the 13th International Conference on Pattern Recognition (ICPR96), August 25-29 1996, pp. 720–724. [3] B. Bonnlander and A. S. Weigend, “Selecting Input Variables Using Mutual Information and Nonparametric Density Estimation,” in Proceedings of

−3

3

−2

2

−1

1

0

0

−1 1

−2 2

−3 −3

−2

Figure 4: (right).

[4]

[5] [6]

[7] [8]

[9] [10]

[11] [12]

[13]

−1

0

1

2

3

3 −3

−2

−1

0

1

2

3

The Landsat image data set. LDA projection (left). MMI projection

the 1994 International Symposium on Artificial Neural Networks, Tainan, Taiwan, 1994, pp. 42–50. I. Dhillon, D. Modha and W. Spangler, “Visualizing class structure of multidimensional data,” in Proceedings of the 30th Symposium of on the Interface, Computing Science, And Statistics, Minneapolis, MN, USA: Interface Foundation of North America, May 1998, pp. 488–493. R. Fano, Transmission of Information: A Statistical theory of Communications, New York: Wiley, 1961. J. Fisher III and J. Principe, “A methodology for information theoretic feature extraction,” in Proc. of IEEE World Congress On Computational Intelligence, Anchorage, Alaska, May 4-9 1998, pp. 1712–1716. K. Fukunaga, Introduction to statistical pattern recognition (2nd edition), New York: Academic Press, 1990. M. Girolami, A. Cichocki and S.-I. Amari, “A common neural network model for unsupervised exploratory data analysis and independent component analysis,” IEEE Transactions on Neural Networks, vol. 9, no. 6, pp. 1495 – 1501, November 1998. T. Okada and S. Tomita, “An optimal orthonormal system for discriminant analysis,” Pattern Recognition, vol. 18, no. 2, pp. 139–144, 1985. J. Principe, J. Fisher III and D. Xu, “Information theoretic learning,” in S. Haykin (ed.), Unsupervised Adaptive Filtering, New York, NY: Wiley, 2000. J. Principe, D. Xu and J. Fisher III, “Pose estimation in SAR using an information-theoretic criterion,” in Proc. SPIE98, 1998. K. Torkkola and W. Campbell, “Mutual information in learning feature transformations,” in Proceedings of International Conference on Machine Learning, Stanford, CA, USA, June 29 - July 2 2000. H. Yang and J. Moody, “Data Visualization and Feature Selection: New algorithms for Nongaussian Data,” in Proceedings NIPS’99, Denver, CO, USA, November 29 - December 2 1999.

Suggest Documents