Effective Model Representation by Information ... - IEEE Xplore

1 downloads 0 Views 1MB Size Report
tasks for speaker recognition and model clustering for age-group verifi- cation. .... We demonstrate the effectiveness of GIB by applying it to speaker recognition.
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 8, AUGUST 2013

Effective Model Representation by Information Bottleneck Principle Ron M. Hecht, Elad Noor, Gil Dobry, Yaniv Zigel, Aharon Bar-Hillel, and Naftali Tishby

Abstract—The common approaches to feature extraction in speech processing are generative and parametric although they are highly sensitive to violations of their model assumptions. Here, we advocate the non-parametric Information Bottleneck (IB). IB is an information theoretic approach that extends minimal sufficient statistics. However, unlike minimal sufficient statistics which does not allow any relevant data loss, IB method enables a principled tradeoff between compactness and the amount of target-related information. IB’s ability to improve a broad range of recognition tasks is illustrated for model dimension reduction tasks for speaker recognition and model clustering for age-group verification. Index Terms—Information bottleneck method, information theory, speaker recognition, speech recognition.

I. INTRODUCTION

T

HE speech signal is very complex, primarily because it results from the entangled interaction of many sources of information. These range from speaker-related variables such as mood, age, gender, and cultural background, to content-related variables including phonemes, word sequences, and more abstractly the conversation topic. In this paper we use the Information Bottleneck (IB) method [1], [2] for task-driven, top-down guided relevant representation extraction. The IB is a general non-parametric approach to information extraction, based on a principled information-theoretic perspective. Its goal is to construct a middle-level data representation that best captures the task at hand, while discarding irrelevant information to achieve compactness. It has been applied to a broad spectrum of tasks and challenges involving large datasets including classification of galaxy spectra [3] and analysis of neural codes [4]. The abstract IB framework has several concrete algorithmic implementations that apply to both discrete [5] and Gaussian continuous variables [6]. We developed a novel variant of the Gaussian Information Bottleneck (GIB) [7], which emerges as the most useful for speaker recognition tasks. In the experimental part of this paper, we show that the IB method is particularly adapted to speech processing at multiple tasks. The first application involves super-vector (SV) dimensionality reduction in Manuscript received March 04, 2012; revised July 22, 2012; accepted February 14, 2013. Date of publication March 15, 2013; date of current version May 09, 2013. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Nestor Becerra Yoma. R. M. Hecht is with General Motors ATCI, Advanced Technical Center Israel, Herzliya 46733, Israel, and also with the Hebrew University of Jerusalem, Jerusalem, Israel (e-mail: [email protected]). E. Noor is with the Weizmann Institute of Science, Rehovot 76100, Israel (e-mail: [email protected]). G. Dobry is with the Open University of Israel, Raanana 43107, Israel (e-mail: [email protected]). Y. Zigel, is with Ben-Gurion University of the Negev, Beer Sheva 8410501, Israel (e-mail: [email protected]). A. Bar-Hillel is with Microsoft Research ATLI, Advanced Technology Labs Israel, Haifa 31905, Israel (e-mail: [email protected]). N. Tishby is with the Hebrew University of Jerusalem, Jerusalem 91904, Israel (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2013.2253097

1755

speaker recognition systems, and it is dealt with using a Gaussian IB. The SV representation is the most typical one in speaker recognition; it maps audio segments of variable length onto a fixed length vector. These vectors tend to be noisy due to training data scarcity. The IB is utilized to refine the SV such that non-speaker related information is omitted and therefore some of the noise is eliminated. The second application is improvement in a backend mechanism for word-based age verification [8]. Word-based age verification systems are generally broken down into steps. First, the general purpose Large Vocabulary Continuous Speech Recognition (LVCSR) decodes the audio stream into word sequences. Then a set of -gram models, one for each age group, scores these word sequences. The IB method groups words into age- information conveying clusters, thus making the -gram models more tractable. These two examples illustrate the effectiveness of the IB on two very different tasks. The information bottleneck is an approach to feature extraction or dimensionality reduction. There are numerous other methods to achieve these goals, in particular ones that choose the mutual information (MI) between the transformed input and desired output [9] as their maximization argument [10]–[14]. However, only the information bottleneck uses information theoretic terms both as a maximization argument and as a constraint, thus it enables feature extraction in a pure information theoretic framework. In a pioneering work [15], this type of optimization, termed the Infomax principle, was shown to successfully account for neural organization. This principle was justified in [16], [17], by showing that the conditional entropy bounds the Bayes error [17] achievable for and below [17]. Since to maximize the Bayes error.

from both above [16] , choosing

is equivalent to minimizing

; hence

II. INFORMATION BOTTLENECK METHOD Let’s consider a supervised learning task where the features and goal and respectively. The goal of the IB is variables are denoted by to find a more effective representation for , where is to . “Effective” refers to two aspects a stochastic mapping from of the representation. The first is the goal of finding a representation that is as compact as possible. This is achieved by minimizing the MI between and . The second aspect of the representation is the need to preserve as much of the relevant information as possible. This relevant information is captured by the MI between the new representation and the goal variable; the aim is to maximize this MI. Thus the relevancy of a representation is goal- oriented. The optimization of these two quantities leads to a conflict. By minimizing we are looking for a representation that is as compact entails a lossless representation as possible, while maximal w.r.t . The IB functional is the combination of these two forces, with a tradeoff parameter : (1) For the general discrete case where the distribution of is available, an iterative algorithm for finding a solution for the IB exists [1]; However, simpler solutions exist for some cases. In this work, two of these methods are applied. For discrete features the Agglomerative Information Bottleneck (AIB) is used (See [5] for algorithm description) . and for continuous cases the GIB [6] was applied.

1558-7916/$31.00 © 2013 IEEE

1756

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 8, AUGUST 2013

A. Gaussian Information Bottleneck One of the cases where an analytic solution exists is where and are jointly multivariate Gaussian distributions [6]. This type of joint distribution can be parameterized by two vector parameters , are the means of and and three matrices. The vectors respectively. The matrices , are the covariance matrices of and and their cross covariance matrix. Based on these matrices the conditional covariance is known to be . The optimal representation was proven [6] to be a projection of . We denote by the projection matrix . is composed of a subspace of eigenvectors of , with a unique scaling of the basis vectors. The selected eigenvectors and their relative weights are determined by the tradeoff parameter :

(2) .. .

.. .

The left eigenvectors are sorted according to their eigenvalues in ascending order. The scalars ’s and the critical ’s (denoted as: ), where the number of dimensions of the projection increases, are determined by the eigenvalues and eigenvectors as follows:

the GIB we can reduce its dimension, and also boost performance. We continue the work presented at [7], by presenting a better analysis of the algorithm and by providing a better comparison to other covariance related methods – linear discriminant analysis (LDA), principal component analysis (PCA), and Nuisance Attribute Projection (NAP). Specifically we compare GIB to the NAP method [21], from both theoretical and empirical perspectives. A. GIB for Speaker Recognition Here, we address an entire audio segment as a single example [7]. value of an example is the SV representation of its audio segThe utterances represented as SVs generated by a set of ment. Assume speakers. The mapping between utterances and speakers is denoted by . We denote by the set of ex. The size ample indices belonging to speaker : is the number of utterances generated by of the set the speaker . Therefore, if we consider matrix to be the matrix of all the SVs (each SV is a column vector in the matrix), then all the SVs . Since GIB assumes continuous of the speaker are target , we have to map the speaker identity to a continuous representation in which the Euclidean distance is meaningful. We choose to be the average of the SVs of the audio segments produced by the target speaker: (7)

(3) Where and in the direction of

is the covariance . In a similar manner we can define as . For high values, the scaling factors can be simplified to obtain a better understanding of GIB projection weights. First, of

where is the SV representation of the speaker, and is the SV utterance. We call this specific derivative of representation of the GIB as GIB to the means and refer to it plainly as GIB for the rest of the paper. In the special case where the values are actually the average of values, the cross covariance matrix of and is actually the covariance matrix of . (8)

(4) Since is a constant factor for all eigenvectors it can be omitted, denoted by . Based on the connection between with the adjusted the generalized eigenvalues and the simultaneous diagonalization of , [10], the generalized eigenvalue can be written a variance ratio (5) The scaling factor

can hence be simplified further

Plugging this into definition, we get that in our case . This simplifies the equation: (9) Since GIB minimizes (9), we see that under the assumption that is the mean of the speaker distribution, GIB becomes LDA with a scaling mechanism (6). All in all, LDA turns out to be a specific case of the full ) leads to a simplification GIB. This simplification ( in as well:

(6) Intuitively, the scaling factors are related to the between class variance and the total variance. III. GAUSSIAN INFORMATION BOTTLENECK FOR SPEAKER RECOGNITION We demonstrate the effectiveness of GIB by applying it to speaker recognition. We focused on SV dimension reduction [18]–[20]. For a 128 Gaussian system that handles 13 Mel-Cepstrum and their deriva. However, by applying tives, the SV dimension is

(10) Intuitively, the scaling factor is related to the three standard deviations in the relevant direction. B. Comparison Between GIB to the Means and NAP We focus on a comparison between GIB and NAP [21], since NAP is one of the most commonly used methods in speaker recognition. The

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 8, AUGUST 2013

NAP SV dimension reduction goal is to find a projection matrix minimizes the average value of cross speaker distance :

that

1757

Equation (15) is proportional to the first moment, the mean. Equation (13) can hence be re-written as

(11) Where , are the nuisance weight and projection matrices respectively and ’s are the SV’s. (12)

otherwise

denote by the matrix For a vector in which lies on the diagonal, and non-diagonal entries are zeroes

..

.

. In [21] it is shown that

is minimized

by the eigenvectors of the matrix with the largest eigenvalues, where is a column vectors of ones, and is the SV matrix as defined before. Denote by the inner covariance matrix of class . Clearly . We now claim that is a variation on . can be decomposed into two terms : (13) Note that Looking at the first matrix element-wise, one can write

.

(14) Equation (14) is proportional to the second moment of the data, the sample correlation matrix. The second term expressed as:

(15)

(16) is a weighted sum of the inner class covariance matrices, but with weights proportional to the square of the class size. All in all, the difference between GIB and NAP can be summarized in these – NAP does not exploit this information. 2) three points: 1) Use of Scaling – In NAP no scaling is applied. 3) Weight of each class – In NAP the weight is squared whereas the weight in GIB is linear. The first difference leads to improved performance of GIB compared to NAP. The second difference significantly reduces the sensitivity of GIB to the choice of the reduced dimension parameter, making it more stable compared with NAP or standard LDA. In our implementation of these algorithms we assumed that both matrices are block diagonal and that each block is composed of the Gaussian coefficients relating to a specific basic feature. C. System Description and Experiments The speaker recognition system [7] studied in this section was composed of four phases (Fig. 1). The first training phase was the Universal Background Model (UBM) estimation. The second training phase involved GIB model estimation. During this phase, each speaker- labeled audio segment was transformed into its SV representation. From the , matrices were estimated, and entire set of labeled SVs, the (2)–(6) were used to find the GIB projection matrix. The goal of the third training phase was to create the speaker’s models for the test phase. For each target speaker, a single audio segment was provided. As in the previous phase, SVs were created for each training segment, and reduced using the GIB projection. In addition to the target speakers, TZ speakers were trained in the same manner. The input for the test phase was a test audio segment and a hypothesis regarding the speaker talking in it. As in training phase C, processing an audio segment began with an estimation of its GIB representation. Then the Euclidian distance between the GIB representation of the segment and the speaker model was calculated and normalized. Experiments were conducted over the male part of NIST 2005 SRE and TZ and GIB estimation was performed over the male part of the NIST 2004 SRE. Fig. 2 shows a comparison between GIB performance and several other linear projection methods relying on Eigen-decom, matrices. position of the A comparison of the different methods reveals that the performance in their optimal setup does not differ greatly from one another; however, a comparison of the recognition performance at points with low dimensionality reveals a different story. In these points, for which a compact representation is needed, there is a considerable difference in favor of GIB and LDA over NAP and PCA. Another interesting comparison emerges when the method is applied without reduction which frees us from conducting a long set of tedious experiments in order to find the optimal reduced dimension. There was a striking difference between the recognition of NAP and PCA and GIB and LDA. The source for this difference is that the NAP and PCA basis vectors are derived as an eigen-decomposition of a single symmetric

1758

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 8, AUGUST 2013

Fig. 3. The AIB-based age estimation system.

Fig. 1. The GIB-based speaker recognition system.

through (6): for vectors corresponding to small eigenvalues, the scaling factor since , the direction variance approaches , the unconditional variance. Hence of the matrix the weight of these dimensions is drastically reduced by GIB, leading to stable performance irrespective of the exact choice of dimension parameter. IV. WORD-BASED AGE VERIFICATION

Fig. 2. Recognition performance for the different dimension reduction methods.

matrix, so they provide a complete orthonormal basis. On the contrary, LDA and GIB vectors are derived as generalized eigenvectors which are not orthogonal. Hence if reduction is not applied, NAP and PCA preserve their high dimension original distance, whereas LDA and GIB modify it in a useful manner. Compared to LDA, the main advantage of GIB is its stable behavior in reduction to high dimensions. This phenomenon can be explained

IB is also a powerful tool in word-based age estimation. Most of this work appears in [8], and here we summarize the main findings. These experiments were conducted on language models. The Fisher corpus [22] was used since its calls are spontaneous, long and diverse. The age group estimation system included four phases (Fig. 3). All the phases shared the same two initial stages. First, the word sequence was recognized using an LVCSR engine. Subsequently, the word sequence was transformed to a word level feature vector. We have experimented with two competitive feature vector representations: unigram distribution of the words produced by a specific speaker in a specific call, and unigram pair distribution of the call. Explicitly, the features vector is – the unsmoothed likelihood of the word to be uttered for unigram pair feaby speaker in the call , or tures. We have conducted experiments with and without AIB clustering. When AIB is applied, a large set of unigram or unigram pair distributions was extracted from calls whose speakers spanned a broad spectrum of ages. A joint distribution matrix of words (or word pairs) and ages was estimated from this large set, and analyzed to find the most effective hierarchical clustering according to the IB criterion. A clustering of words (word pairs) into age preserving groups was the output of this phase. At a second stage we have trained a set of SVMs using the basic representation, one for each age-group. When AIB is applied, the

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 8, AUGUST 2013

1759

method, two problems were chosen, and different IB based solutions were suggested. Both continuous and discrete variables were discussed, using the Gaussian and agglomerative IB respectively. In both tasks a dramatic reduction in model size with minimal decline in performance was enabled. Also in both tasks moderate dimensionality reduction enabled better model estimation, resulting in improved recognition performance.

ACKNOWLEDGMENT We would like to thank A. Manna and O. Hezroni for their help regarding age verification.

REFERENCES Fig. 4. Verification performance (Average Equal Error Rate).

SVMs are trained on the more compact feature vectors created by applying the found AIB clustering to the original representation. During the third training stage, a mechanism to transform the SVM raw scores to likelihoods was trained. Specifically, a Gaussian was estimated for each age group, using the raw score of SVM models as features. During the test phase, a unigram or unigram pair words distribution was created for each call. In relevant experimental conditions the AIB then reduced the dimension of the features. Next, the SVM models scored the resulting vector and finally the Gaussian likelihood scores for different age groups were estimated The results are presented in Fig. 4, showing error rates of unigram and unigram pair representations, with and without AIB application. The and axis of the figure represent complexity and quality of the model respectively. Model complexity is measured by its number of clusters, i.e., the size of the vector that the model receives as an input. The quality of a model is measured by its effectiveness in the age-group verification task. It is expressed as the average Equal Error Rate (EER) over the age-groups that were tested. Simple unigram and unigram pair models were trained on the most common words or words pairs with , 200, 400, 700, 1000. When AIB is applied, the initial feature vector included probabilities of the 1000 most common word/word , 200, 400, 700, 1000) by pairs, reduced to smaller vectors ( using AIB. For both unigram and unigram pair representations, AIB proves to be beneficial in two important respects. First, IB enables very compact representations preserving almost all the relevant information, hence having almost the same error rate as the most complex representations. Specifically, AIB-induced representations with 50 features have error rates similar to representations with 1000 features. This is in sharp contrast with small representations not induced by IB, which exhibit significantly deteriorated performance. The second aspect is that the medium sized representations enabled by IB (300 features in our experiments) keep almost all the relevant information, yet they allow more accurate model estimation with fewer parameters. This in turn, results in modest performance improvements (Fig. 4). V. CONCLUSION This paper illustrated effectiveness of the IB method for speech related tasks. In order to depict the range of possibilities enabled by this

[1] N. Tishby, N. Pereria, and W. Bialek, “The information bottleneck method,” in Proc. 37th Annu. Allerton Conf. Commun., Control, Comput., 1999. [2] O. Shamir, S. Sabato, and N. Tishby, “Learning and generalization with the information bottleneck,” in Proc. ISAIM, 2008. [3] N. Slonim et al., “Objective classification of galaxy spectra using the information bottleneck method,” Notes R. Astronom. Soc., vol. 323, pp. 270–284, 2001. [4] E. Schneidman et al., “Analyzing neural codes using the information bottleneck method,” in Proc. NIPS, 2002. [5] N. Slonim and N. Tishby, “Agglomerative information bottleneck,” in Proc. NIPS, 1999. [6] G. Chechik et al., “Information bottleneck for Gaussian variables,” in Proc. NIPS, 2003. [7] R. M. Hecht, E. Noor, and N. Tishby, “Speaker recognition by Gaussian information bottleneck,” in Proc. Interspeech, Brighton, U.K., 2009. [8] R. M. Hecht et al., “Information bottleneck based age verification,” in Proc. Interspeech, Brighton, U.K., 2009. [9] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York, NY, USA: Wiley, 1991. [10] A. Bar-Hillel, “Learning from weak representations using distance functions and generative models,” Ph.D. dissertation, Dept. Comp. Sci., Hebrew Univ. of Jerusalem, Jerusalem, 2006. [11] K. Torkolla and W. Campbell, “Mutual information in learning feature transforms,” in Proc. ICML, 2000. [12] J. Silva and S. S. Narayanan, “Discriminative wavelet packet filter bank selection for pattern recognition,” IEEE Trans. Signal Process., vol. 57, no. 5, pp. 1796–1810, May 2009. [13] M. Padmanabhan and S. Dharanipragada, “Maximizing information content in feature extraction,” IEEE Trans. Speech Audio Process, vol. 13, no. 4, pp. 512–519, Jul. 2005. [14] M. Vidal-Naquet and S. Ullman, “Object recognition with informative features and linear classification,” in Proc. ICCV, 2003. [15] R. Linsker, “Self-organization in a perceptual network,” IEEE Comput., vol. 21, no. 3, pp. 105–117, Mar. 1988. [16] M. E. Hellman and J. Raviv, “Probability of error, equivocation and the Chernoff bound,” IEEE Trans. Inf. Theory, vol. IT-16, no. 4, pp. 368–372, Jul. 1970. [17] N. Vasconcelos, “Minimum probability of error image retrieval,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2322–2336, Aug. 2004. [18] H. Aronowitz, D. Burshtein, and A. Amir, “Speaker indexing in audio archives using test utterance Gaussian mixture modeling,” in Proc. ICSLP, 2004. [19] R. Vogt, S. Kajarekar, and S. Sridharan, “Discriminant NAP for SVM speaker recognition,” in Proc. Odyssey, 2008. [20] W. M. Campbell, “Generalized linear discriminant sequence kernels for speaker recognition,” in Proc. ICASSP, 2002, pp. 161–164. [21] A. Solomonoff et al., “Channel compensation for SVM speaker recognition,” in Proc. Odyssey, 2004. [22] C. Cieri et al., Fisher English Training. Philadelphia, PA, USA: Linguistic Data Consortium, 2005.