ON FEATURE EXTRACTION BY MUTUAL INFORMATION MAXIMIZATION Kari Torkkola Motorola Labs, 7700 South River Parkway, MD ML28, Tempe AZ 85284, USA
[email protected] http://members.home.net/torkkola ABSTRACT In order to learn discriminative feature transforms, we discuss mutual information between class labels and transformed features as a criterion. Instead of Shannon’s definition we use measures based on Renyi entropy, which lends itself into an efficient implementation and an interpretation of “information potentials” and “information forces” induced by samples of data. This paper presents two routes towards practical usability of the method, especially aimed to large databases: The first is an on-line stochastic gradient algorithm, and the second is based on approximating class densities in the output space by Gaussian mixture models.
To derive an expression for MI using a non-parametric density estimation method we apply Renyi’s quadratic entropy instead of Shannon’s entropy as described in [3, 4] because of its computational advantages. Estimating the density p(y) of Y as a sum of spherical Gaussians each centered at a sample yi , Renyi’s quadratic entropy of Y is defined as Z HR (Y ) = − log p(y)2 dy y
1 = − log 2 N
1. INTRODUCTION Optimal feature selection coupled with a classifier leads to a combinatorial problem since it would be desirable to evaluate all combinations of available features. Greedy algorithms based on sequential feature selection using any filter criterion are suboptimal, as they fail to find a feature set that would jointly maximize the criterion [1]. For this very reason, finding a transformation to lower dimensions might be easier than selecting features, given an appropriate differentiable objective function. We discuss mutual information between class labels and transformed features as such a criterion. Instead of Shannon’s definition we use measures based on Renyi entropy, which lends itself into an efficient implementation and an interpretation of ”information potentials” and ”information forces” induced by samples of data [2, 3, 4, 5]. This paper is structured as follows. An introduction is given to the maximum mutual information (MMI) formulation for discriminative feature transforms using Renyi entropy and Parzen density estimation. We present two routes toward the practical usability of the method, an on-line stochastic gradient algorithm, and a formulation based on Gaussian Mixture Models (GMM)s. Finally, a slight modification of the latter formulation is presented that allows an interplay with Hidden Markov Models. 2. MUTUAL INFORMATION BETWEEN TRANSFORMED DATA AND CLASS LABELS Given a set of training data {xi , ci } as samples of a continuousvalued random variable X, xi ∈ RD , and class labels as samples of a discrete-valued random variable C, ci ∈ {1, 2, ..., Nc }, i ∈ [1, N ], the objective is to find a transformation (or its parameters w) to yi ∈ Rd , d < D such that yi = g(w, xi ) that maximizes I(C, Y ), the mutual information (MI) between transformed data Y and class labels C. The procedure is depicted in Fig. 1. To this end we need to express I as a function of the data set, I({yi , ci }), in a differentiable form. Once that is done, we can perform gradient ascent on I as follows N X ∂I ∂I ∂yi wt+1 = wt + η = wt + η . (1) ∂w ∂y i ∂w i=1
= − log
N X N X
Z y
! 2
2
G(y − yk , σ I)G(y − yj , σ I) dy
k=1 j=1
N N 1 XX G(yk − yj , 2σ 2 I). N2 j=1
(2)
k=1
Above, use is made of the fact that the convolution of two Gaussians is a Gaussian. Thus Renyi’s quadratic entropy can be computed as a sum of local interactions as defined by the kernel, over all pairs of samples. In order to use this convenient property, a measure of mutual information needs to be derived making use of quadratic functions of the densities. Between a discrete variable C and a continuous variable Y such a measure has been derived in [3, 4] as follows: IT (C, Y )
= =
VIN + VALL − 2VBT W XZ XZ p(c, y)2 dy + p(c)2 p(y)2 dy y
c
−
2
c
XZ c
y
p(c, y)p(c)p(y)dy
(3)
y
We use Jp for the number of samples in class p, yk for kth sample regardless of its class, and ypj for the same sample, but emphasizing that it belongs to class p, with index j within the class. Furthermore, we need the following equalities where Pp denotes the class prior: Nc X p(cp , y) = Pp p(y|cp ), p(y) = p(cp , y). (4) p=1
c g(w,x)
IT(c,y)
y
High-dimensx Gradient
∂I
T ∂w
Fig. 1. Learning feature transforms by maximizing the mutual information between class labels and transformed features.
Now, expressing densities as their Parzen estimates with kernel width σ results in Nc Jp Jp 1 XXX G(ypk − ypl , 2σ 2 I) IT ({yi , ci }) = 2 N p=1 k=1 l=1 ! N N N c X Jp 2 X X 1 + 2 G(yk − yl , 2σ 2 I) N N p=1 k=1 l=1
−2
1 N2
Nc X p=1
Jp N
Jp
N XX
G(ypj − yk , 2σ 2 I)
(5)
j=1 k=1
Mutual information IT ({yi , ci }) can now be interpreted as an information potential induced by samples of data in different classes. It is now straightforward to derive partial ∂I/∂yi which can accordingly be interpreted as an information force that other samples exert to sample yi . The three components of the sum give rise to following three components of the information force: 1) Samples within the same class attract each other, 2) All samples regardless of class attract each other, and 3) Samples of different classes repel each other. This force, coupled with the latter factor inside the sum in (1), ∂yi /∂w, tends to change the transform in such a way that the samples in the transformed space move into the direction of the force, and thus increase the MI criterion I({yi , ci }). See [4] for details. Each term in (5) consists of a double sum of Gaussians evaluated using the pairwise distance between the samples. The first component consists of a sum of these interactions within each class, the second of all interactions regardless of class, and the third of a sum of the interactions of each class against all other samples. The bulk of computation consists of evaluating these N 2 Gaussians, and forming the sums of those. Information forces make use of the same Gaussians, in addition to pairwise differences of the samples. For large N , complexity of O(N 2 ) is a problem. In order to overcome this we present two possibilities to recuce the computation to make the method applicable to large databases.
Using (15) and (4) we have VIN
=
VALL
=
1 (G(y1 − y2 , 2σ 2 I) + G(0, 2σ 2 I)) 2 VBT W = VIN
which leads to IT = VIN + VALL − 2VBT W = 0. The transform is thus not changed at all based on drawing two samples from the same class. In the latter case the priors for the classes (say, one and two) are P1 = P2 = 21 . The class conditional, and the joint densities densities are p(y|c1 ) = G(y − y1 , σ 2 I) p(y|c2 ) = G(y − y2 , σ 2 I)
1 G(0, 2σ 2 I) 2 1 = (G(0, 2σ 2 I) + G(y1 − y2 , 2σ 2 I)) 4 = VALL ,
VIN = VALL VBT W
1 (G(0, 2σ 2 I) − G(y1 − y2 , 2σ 2 I)) (10) 4 To derive an update equation of W for the case of two samples drawn from different classes we need IT =
∂IT 1 ∂IT = − 2 G(y1 − y2 , 2σ 2 I)(y2 − y1 ) = − ∂y1 8σ ∂y2
(6)
1 Thanks to Jose Principe and Deniz Erdogmus for sharing their idea of a stochastic gradient algorithm for entropy maximization.
(11)
Since ∂yi /∂W = xTi , the actual update equation (1) becomes
= Wt −
1 (G(y − y1 , σ 2 I) + G(y − y2 , σ 2 I)) 2
(9)
and
3. LESS COMPUTATION I — STOCHASTIC GRADIENT
p(y) =
1 G(y − y1 , σ 2 I) 2 1 p(y, c2 ) = G(y − y2 , σ 2 I) 2 (8) p(y, c1 ) =
and the total data density p(y) is equal to that in the case of samples from same class (eq 6). The components of the IT are now
Wt+1 = Wt + η
Rather than computing the full gradient as in the previous formulations it is straightforward to devise a stochastic gradient algorithm. The principle is as follows 1 . Take a random sample of two transformed data points y1 and y2 . These two points serve as a sample of the whole database. Using the Parzen estimate, compute the information potential and the information force between those two samples only. Make a small adjustment to W in the direction of the gradient. We have now two cases that need to be handled separately. The samples may come from the same class or they may come from different classes. In the former case, based on our two-point sample of the database, we have to conclude that the prior for this class (say, it is one) P1 = 1 and Pk = 0, k 6= 1. The class conditional p(y|c1 ) and the joint density p(y, c1 ) are now equal to the total data density p(y):
(7)
X ∂I ∂yi ∂yi ∂W i=1,2
(12)
η G(y1 − y2 , 2σ 2 I)(y2 − y1 )(xT1 − xT2 ) 8σ 2
The effect of two samples from different classes is thus to change the transform in order to push the samples farther away from each other. Note that the order in which the samples are drawn does not matter since the signs of the two last terms cancel. It is actually surprising that the on-line algorithm ignores cases with samples in the same class. This means that only class separation is important, the class compactness not so. Nevertheless, the on-line algorithm converges to solutions very similar to the full-gradient algorithm 2 . A further positive aspect of a stochastic gradient algorithm is that the stochasticity may help escape local maxima in the MI. In simulations the on-line algorithm appeared to converge slowly in cases with a large number of classes. This was not a problem when the number of classes was small. A possible explanation is that there is only a small probability that a pair of samples that truly make a difference are drawn. It is the samples that lie at the borders of adjacent classes in the feature space that have the largest effect on the criterion. 2 See
http://members.home.net/torkkola/mmi for some illustrations.
This two-sample stochastic gradient and the full gradient involving all interactions between all samples are really two ends of a single spectrum. In practice, if the full gradient is out of reach for computational reasons, it is more desirable to take as large a random sample of the whole data set as possible, say a thousand samples, and compute all the mutual interactions of those samples. The gradient should be now much closer to the true gradient than a gradient computed from a number of pairwise interactions.
VIN =
4. LESS COMPUTATION II — GMM MAPPINGS
p(y|cp ) =
hpj G(y − mpj , Spj )
(13)
j=1
Once the output space GMM is constructed, the same samples are used to construct a GMM in the input space using the same exact assignments of samples to mixture components as the output space GMMs have. Running the EM-algorithm in the input space is now unnecessary since we know which samples belong to which mixture components. Similar strategy has been used to learn GMMs in high dimensional spaces [6]. Writing the density of class p as a GMM with Kp mixture components and hpj as their mixture weights we have in the input space Kp X p(x|cp ) = hpj G(x − µpj , Σpj ) (14) j=1
As a result, we have GMMs in both spaces and a transform mapping between the two. This case will be called as OIOmapping. A great advantage of this strategy is that once the GMMs have been created, the actual training data needs not be touched at all during optimization! A further advantage is now being able to avoid to operate in the high-dimensional input space at all, not even the one time estimating GMMs in the beginning of the procedure. The first step in the derivation of adaptation equations is to express the MI as a function of the GMM that is constructed in the output space. This GMM is a function of the transform matrix W , through the mapping of the input space GMM to the output space GMM. The second step is to compute its gradient ∂I/∂W and to make use of it in the first half of Equation (1). GMM in the output space for each class is expressed in (13). Then we have XZ VIN = p(c, y)2 dy (15) y
c
=
p=1
=
Nc Z X
Nc X p=1
Pp2 y
Kp X
2
Kp Kp
Pp
i=1 j=1
VBT W =
Nc X
Vpp
Nc Nc X Nc X X VALL = ( Pr2 ) Vpq r=1
Pp
p=1
Nc X
p=1 q=1
Vpq
(16)
q=1
As each Gaussian mixture component is now a function of the corresponding input space component and the transform matrix W , we are able to write the gradient ∂IT /∂W . Since each of the three terms in IT is composed of different sums of G(k, l), we need its garadient as ∂ ∂ G(k, l) = G(W µkl , W Σkl W T ) ∂W ∂W
(17)
where the input space GMM parameters are µkl = µk − µl and Σkl = Σk + Σl with the equalities mkl = W µkl and Skl = W Σkl W T . G(k, l) expresses the convolution of two mixture components in the output space. As we also have those components in the high-dimensional input space, the gradient expresses how this convolution in the output space changes, as W , that maps them to the output space, is being changed. The mutual information measure is defined in terms of these convolutions, and maximizing it tends to find a W that minimizes these convolutions between classes and maximizes them within classes. The desired gradient of the Gaussian with respect to the transform matrix is as follows : ∂ G(k, l) = (18) ∂W h i −1 −1 −G(k, l)Skl I − mkl mTkl Skl W Σkl + mkl µTkl The total gradient ∂IT /∂W can now be obtained simply by replacing G(k, l) in (15) and (16) by the above gradient. In evaluating IT , the bulk of computation is in evaluating the Vpq , the componentwise convolutions. Computational complexity is now O(d2h ). In addition, the ∂IT /∂W requires pairwise sums and differences of the mixture parameters in the input space, but these need only be computed once. See footnote 2 for illustrations. 5. HIDDEN MARKOV MODELS AND GMM-MMI FEATURE TRANSFORMS If the GMM is already available in the high-dimensional input space (14), those models can be directly mapped into the output space by the transform. Let us call this case the IO-mapping. The (linearly) transformed density in the low-dimensional output space is then simply
hpi G(y − mpi , Spi ) dy
i=1
XX 2
Nc X p=1
We present now how the class densities can be approximated by Gaussian Mixture Models (GMMs) in order to get rid of having to compute all mutual interactions between samples and replace it by interactions between mixture components. A GMM model for training data can be constructed in the lowdimensional output space. Since getting there requires the transform, the GMM is constructed after transforming the data using, for example, a random or an informed guess as the transform. Density estimated from samples in the output space for class p is Kp X
To compact the notation, we change the sample indexing from class-internal to global and make the substitutions mkl = mk − ml , Skl = Sk + Sl , G(k, l) = G(mkl , Skl ), V (k, l) = Pk Pl hk hl G(k, l), where k, l ∈ [1...dh ], and P dh isPthe total number of mixture components, and Vpq = k∈cp l∈cq V (k, l). Now we can write VIN , VALL , and VBT W in a convenient form.
p(y|cp ) =
Kp X
hpj G(y − W µpj , W Σpj W T )
(19)
j=1
hpi hpj G(mpi − mpj , Spi + Spj )
Now, the mutual information in the output space between class labels and the densities as transformed GMMs can be expressed as
a function of W , and it will be possible to evaluate ∂I/∂W for (1). This is thus a very viable approach if the GMMs are already available in the high-dimensional input space, or if it is not too expensive / impossible to estimate them using the EM-algorithm. However, in general, this might not be the case and the OIO-mapping might be more appropriate. A convenient extension to Hidden Markov Models (HMM) as commonly used in speech recognition becomes now possible. Given an HMM-based speech recognition system, the state discrimination can be enhanced by learning a linear transform from some high-dimensional collection of features to a convenient dimension. Existing HMMs can be converted to these high-dimensional features using so called single-pass retraining (compute all probabilities using current features, but do re-estimation using a the high-dimensional set of features). Now a state-discriminative transform to a lower dimension can be learned using the method presented in this paper. Another round of single-pass retraining then converts existing HMMs to new discriminative features. A further advantage of the method in speech recognition is that the state separation in the transformed output space is measured in terms of the separability of the data represented as Gaussian mixtures, not in terms of the data itself (actual samples). This should be advantageous regarding recognition accuracies since HMMs have the same exact structure. We attempted to test this hypothesis in noise robust connected digit recognition with the AURORA2 task [7]. Baseline HMMs were trained using the HTK and AURORA scripts using MFCCs with delta and acceleration coefficients (12 cepstral coefficients plus energy, 39 coefficients altogether). The multicondition training setup was used. As the temporal span of these coefficients is 11 feature vectors, we transformed 11 concatenated cepstral vectors (dimension 143) down to the same 39 dimensions as used in the baseline. States of the HMMs were used as classes. While decreasing slightly the word error rates on test data, the approach increased the error rates on test data. We repeated experiments with mel-scale spectral features with the same results. Furthermore, both experiments were repeated using LDA as the transform, again, with similar results. There are two possible conclusions we can draw from this experiment. They are 1. The AURORA training database does not have a sufficient coverage of different noise conditions for data-driven approaches. 2. Linear feature transforms that enhance state discrimination of word models do not improve noise robustness. The first conclusion is supported by the setup of the task. AURORA task adds different noise to the test data than what is added to the training data. In fact, none of the papers so far published on the task that have improved over the MFCCs did attemp to make use of only the AURORA training data to learn the noise-robust facets of the problem [8]. As far as the second conclusion, previous work on LDA in state-discriminatory transforms shows that such transforms may produce very task-specific improvements, and not generalize too well [9]. Also, some recent work on the AURORA task has made use of LDA to learn discriminative transforms (both in the spectral and temporal domain) but by using phones or other phonetic subunits as the classes instead of states, and by using a lot of other speech material for training besides the AURORA training
database [10]. This approach has produced significant improvements in the task. 6. DISCUSSION In static pattern recognition tasks MMI transforms are clearly useful, as they appear to be able to extract more discriminatory information than, for example, LDA from the source features as shown by previous work. The usefulness of MMI-based feature transforms in HMM-based speech recognition has not been established yet. One unanswered question with any discriminative transforms is what should one discriminate. This work and others seem to point to phone discrimination being most useful. Thus future work with MMI-based transforms includes an application to phone-discriminative transforms in an HMM-framework. 7. REFERENCES [1] D. Koller and M. Sahami, “Toward optimal feature selection,” in Proceedings of ICML-96, 13th International Conference on Machine Learning, Bari, Italy, 1996, pp. 284–292. [2] J.W. Fisher III and J.C. Principe, “A methodology for information theoretic feature extraction,” in Proc. of IEEE World Congress On Computational Intelligence, Anchorage, Alaska, May 4-9 1998, pp. 1712–1716. [3] J.C. Principe, J.W. Fisher III, and D. Xu, “Information theoretic learning,” in Unsupervised Adaptive Filtering, Simon Haykin, Ed. Wiley, New York, NY, 2000. [4] Kari Torkkola and William Campbell, “Mutual information in learning feature transformations,” in Proceedings of the 17th International Conference on Machine Learning, Stanford, CA, USA, June 29 - July 2 2000, pp. 1015–1022. [5] Kari Torkkola, “Nonlinear feature transforms using maximum mutual information,” in Proceedings of the IJCNN, Washington DC, USA, July 15-19 2001, pp. 2756–2761. [6] S. Dasgupta, “Experiments with random projection,” in Proc. 16th Conf. on Uncertainty in Artificial Intelligence, Stanford, CA, June30 - July 3 2000, pp. 143–151. [7] David Pearce, “Enabling new speech driven services for mobile devices: An overview of the etsi standards activities for distributed speech recognition front-ends,” in Applied Voice Input/Output Society Conference (AVIOS2000), San Jose, CA, May 2000. [8] D. Pearce and B. Lindberg (organizers), “Special session on noise robust recognition,” in Eurospeech 2001, Aalborg, Denmark, September 3-7 2001. [9] Lutz Welling, Nils Haberland, and Hermann Ney, “Acoustic front-end optimization for large vocabulary speech recognition,” in Proc. Eurospeech’97, Rhodes, Greece, September 1997, pp. IV 2099–2102. [10] S. Sharma, D. Ellis, S. Kajarekar, P. Jain, and H. Hermansky, “Feature extraction using non-linear transformation for robust speech recognition on the aurora database,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’00), Istanbul, Turkey, 2000, pp. II–1117–1120.