Training techniques for polynomial networks fall into sev- eral categories. The first ...... Illustration of random dimension reduction in feature space. In the figure ...
Dimension Reduction Techniques for Training Polynomial Networks
William M. Campbell Kari Torkkola Motorola Human Interface Lab, 2100 East Elliot Road, M/D EL508, Tempe, AZ 85284 Sreeram V. Balakrishnan Motorola Human Interface Lab, 3145 Porter Drive, Palo Alto, CA 94304
Abstract
1. Introduction We consider polynomial networks of the following type. The inputs, , ..., , to the network are combined with multipliers to form a vector of basis functions ; for example, for two inputs and and a second degree network, we obtain
FSB 027@ EMAIL . MOT. COM
Training techniques for polynomial networks fall into several categories. The first category of techniques estimates the parameters for the polynomial expansion based on inclass data (Fukunaga, 1990; Specht, 1967). These methods approximate the class specific probabilities (Schurmann, 1996). Since out-of-class data is not used for training a specific model, accuracy is limited. A second category of methods involves discriminative training (Schurmann, 1996) with a mean-squared error criterion. The goal of these methods is to approximate the a posteriori distribution for each class. This method traditionally involves decomposition of large matrices, so that it is intractable for large training sets in terms of both computation and storage. A more recent method involves the use of support vector machines. This method uses the technique of structural risk minimization. We use an alternate training technique based upon the method in Campbell and Assaleh (1999) which approximates a posteriori probabilities.
We propose two novel methods for reducing dimension in training polynomial networks. We consider the class of polynomial networks whose output is the weighted sum of a basis of monomials. Our first method for dimension reduction eliminates redundancy in the training process. Using an implicit matrix structure, we derive iterative methods that converge quickly. A second method for dimension reduction involves a novel application of random dimension reduction to “feature space.” The combination of these algorithms produces a method for training polynomial networks on large data sets with decreased computation over traditional methods and model complexity reduction and control.
P 27439@ EMAIL . MOT. COM A 540 AA @ EMAIL . MOT. COM
(1)
A second layer linearly combines all these inputs to produce scores . We call the classification (or verification) model. In general, the polynomial basis terms of the form ! #" are used where $ is less than or equal to the polynomial degree, % . For each input vector, , and each class, & , a score is produced by the inner product, ' ( . If a sequence of input vectors is introduced to the classifier, the score is the average score over total all inputs, ) ' +* -, ' . The total score is used for classification or verification. Note that we do not use a sigmoid on the output as is common in higher-order neural networks (Giles & Maxwell, 1987).
The use of polynomial networks and our training/classification method is motivated from an application perspective. First, the discriminative training method in Campbell and Assaleh (1999) can be applied to very large data sets efficiently. For the application we consider in speech processing, this property is critical since we want to be able to train systems in a reasonable amount of time without custom hardware. Second, discriminative training of polynomial networks produces state-of-the-art performance in terms of number of parameters needed, accuracy, and computation effort for several applications including speaker and isolated word recognition. Polynomial networks outperform many other common techniques used for these applications because they are discriminatively trained and approximate a posteriori probabilities. In contrast, techniques such as Gaussian mixture models and Hidden-Markov models use maximum likelihood training and approximate in-class probabilities. For open set problems, this leads to difficulties since we do not model the out-of-class set well (although partial solutions such as “cohort normalization” have been proposed (Campbell, Jr.,
. 1995)). Finally, the training method we use encapsulates the statistics of an entire class into a single vector which eliminates the need for storing training data. For instance, for the application of speaker verification (a two class problem), we collapse the entire population of impostors (over one million vectors of size /10 ) into a single vector of approximately /3254627232 elements which is then stored for later enrollment (training) of legitimate users.
We define ACB as the matrix whose rows are the polynomial expansion of class D ’s data; i.e., A BFE G 89 :HB(I J;K89 :HB(I LM;ON!NNP89 :HB(I QRS;UTV where W B is the numE ber of training vectors for class D . We define A G A VJ A VL NNNXA VQYZ [P\-\^] \ T V where W`_Ua bdc c eUc is the number of classes. The training problem is
Two difficulties arise in the process of training a polynomial network. First, the dimension of the vector 89 :; grows quickly with the degree of the network; we denote the output 89 :; as being in “feature space” in analogy to the terminology used for support vector machines (SVM’s). For example, with an input vector of dimension /10 , the vector 89 :; is of dimension /3< (degree = ), >?/3< (degree / ), /3@7/3< (degree > ), etc. We would like to have finer granularity in this growth since it impacts model complexity. A second difficulty in training polynomial networks arises from the redundancy of polynomial terms in training (to be explained in Section 3). Training involves a feature space with a dimension corresponding to twice the degree of the classification network. For instance, if we classify with a first order (linear) network, we must compute second order statistics (correlations). Until the final step of the training process, we use only these statistics. At the final step, we introduce redundancy to construct a “higher order” correlation matrix. This final step increases resources (especially memory usage) dramatically in the algorithm. We propose an iterative method for solution which avoids this process. The resulting solution makes it possible to train larger problems or train on systems with limited resources.
where u B is the vector consisting of W B ones in the rows where the D th classes data is located and zeros otherwise.
The paper is organized as follows. In Section 2, we review the method for training a polynomial classifier given in Campbell and Assaleh (1999). In Section 3, we show how iterative techniques can be applied to eliminate redundancy in training. Section 4 introduces the process of random dimension reduction in feature space. A random direct method is proposed in Section 4.1. This method makes it possible to control model complexity easily and effectively. In Section 4.2, we show how the method can be implemented quickly using a fast Fourier transform (FFT). Section 5 shows how to combine iterative methods and random dimension reduction to eliminate redundancy and control model complexity. Section 6 illustrates the use of the algorithms on the task of speaker verification.
fhB g Eji3k6l7mon p A ftsvu B L q r r
(2)
Applying the method of normal equations (Golub & Van Loan, 1989) to (2) gives the following problem
A V A f BE A V u B N Define
(3)
w to be the vector of all ones. We rearrange (3) to QyYZ [x\-\^] \ A zV A z f BE A VB w (4) z6{ J
If we define |
z E A zV A z , then (4) becomes }~ QyYZ [x\ \ ] \ | z f BE A VB wN z6{ J
(5)
Equation (5) is a significant step towards our training method. The problem is now separable. We can individually compute | z for each class and then combine the QYZ [x\ \ ] \ z | . final result into a matrix | E z6{ J
One advantage of using the matrices | z is that their size does not change as more training data becomes available. Also, the unique terms in | z are exactly the sums of basis terms of degree /3 or less where is the polynomial network degree. We denote the terms of degree /1 or less for a vector : as a vector, 8 L 9(:; . See Table 1 for the compression factor for a dimension =M/ input vector. Note that the elimination of redundancy decreases both storage and computation; e.g., when training a second degree network we reduce computation by a factor of /1
by computing 8 L 9(:; instead of 89 :; 89(:; V .
The vector 89 :; can be calculated recursively. Suppose we have the polynomial basis terms of degree , and we wish to calculate the terms of degree = . If we have the th
2. Direct Training Method We train the polynomial network to approximate an ideal output using mean-squared error as the objective criterion. We deal with the multi-class problem in the following discussion of training.
Table 1. Term redundancies for the matrix .
Degree 2 3 4
Terms in | z 8,281 207,025 3,312,400
Unique Terms 1,820 18,564 125,970
Ratio 4.55 11.15 26.30
terms with end term having U as a vector , degree we obtain the PM -th degree terms ending with j as ¡ ¡¢¤£££ ¡¦¥ ¡ . Concatenating the different degrees gives the vector §(¨ . Combining all of the above methods results in the training algorithm in Table 2. This training algorithm is not limited to polynomials. One key enabling property is that the basis elements form a semigroup; i.e., the product of two basis elements is again a basis element. This property allows one to compute only the unique elements in the matrix ©hª . Another key property is the partitioning the problem. If a linear approximation space is used, then we obtain the same problem as in (2). The resulting matrix can be partitioned, and the problem can be broken up to ease memory usage. Our method of using normal equations to fit data is a natural extension of older methods used for function fitting (Golub & Van Loan, 1989), and radial basis function training (Bishop, 1995). We have extended these methods in two ways. First, we partition the problem by class; i.e., we calculate and store the «¬ª for each class separately. This is useful since in many cases we can use the «¬ª for adaptation, new class addition, or on-line processing. We also can obtain the right hand side of (5) as a subvector of « ª which reduces computation significantly as we find each ª . Second, we have used the semigroup property of the monomials to reduce computation and storage dramatically. We note that the normal equation method is absent from most expositions on training polynomial networks. This may be due to the fact that the condition number of the matrix © is squared; we have not found this to be a problem in practice. A disadvantage of the training algorithm in Table 2 is that in Step 9, the compressed representation is expanded into a matrix. This results in a dramatic increase in resources for the algorithm (by the factors shown in Table 1). We derive a method which avoids this expansion in the next section.
Table 2. Training algorithm.
1) For ®¯ to °`±U² ³d´ ´ µU´ 2) Let « ª ·¶ . 3) For ¸C to ° ª 4) Retrieve training vector ¸ , ¨ ª ¹ º , from class ’s training set. 5) Let «¬ªj«»ª§ ¢ ¨Hª(¹ º» . 6) Next j 7) Next i 8) Compute «¼¾½j¿ ª-Å ÀÁ ÂxÃ Ã Ä Ã «¬ª . 9) Expand « to © . Derive Æ ¡ªdÇ from «¬ª . 10) For all i, solve ©Èoª·ÆÉ¡ª Ç for ʪ .
3. Eliminating Redundancy in Training 3.1 Iterative Training Methods Iterative techniques to solve linear equations have typically been used in two areas. In the numerical analysis community, methods are targeted toward solving large sparse systems (Golub & Van Loan, 1989). In the engineering community, approaches have concentrated on using iterative methods for recursive learning (Schurmann, 1996). We experiment with iterative methods from both areas. Iterative methods are a common technique used to solve linear equations; e.g., ©ÈjË . In most cases, an iterative method is based upon computing a product ©ÍÌ where Ì is some auxiliary vector. Using this data, a descent direction Î is obtained. A new solution estimate is then given by Πʪ ÏoªÑÐ where Ð is some suitably chosen scalar. A common method for iterative training is implemented in Kaczmarz’ algorithm for recursive learning (Schurmann, 1996; Kaczmarz, 1937). The method uses the update
ª ÒÏ ª ÑÓPÔ ºÒÕvÖº ª Ö º3¡ ×
(6)
Ö º is ¢ theÙݸ Ü th row of © , Ô6º is the ¸ th ¢ entry of Ë , and . We use Ó·Þ»ß7ß5Ú Ö º1Ú ¢ in our experiÓÛÚ Ö º7Ú ¢
where
ØÑÙ
ments. The two main advantages of this method are (1) it is computationally simple, and (2) the update involves only one row of © . More sophisticated algorithms for iterative training are the successive over-relaxation (SOR) algorithm and the conjugate gradient (CG) algorithm. The SOR algorithm is a generalization of the well-known Gauss-Seidel method with a ØàÙCátÙtÜ parameter which can be varied to give difference convergence rates. The conjugate gradient (CG) algorithm is another popular method. It has the advantage that there are no direct parameters to estimate, and its convergence rate is determined by the condition of the matrix © . We have selected these iterative methods because of their common use and applicability to our problem. We use iterative methods to solve the equation shown in Step 10 of Table 2. Several properties of © are critical. First, © is symmetric, nonnegative definite, and square by structure. Second, we assume (with no violations in practice for our application), that © is nonsingular. These properties allow all of the mentioned iterative methods to be applied. 3.2 Matrix-Vector Multiply Algorithm The core of our iterative algorithm is a method for computing ©Í for an arbitrary without explicitly performing the mapping from « to © ; this process saves considerable memory and eliminates the “redundancy” of computing with © . The basic idea is to utilize the structure of the
âXã . We specialize to the case when we map a class’s matrix vector ä1å to a matrix structure as in Step 9 in Table 2. The matrix ãhå is obtained from a sum of outer products íxõ
ì í-îïñðò(ó ðò(ó ãæåèçêé®ë åMô åMô
íUõUöø÷
(7)
The mapping in Step 9 is based upon the fact that it is not necessary to compute the sum of outer products (7) directly. õ Instead oneð ù7can ò ó compute the subset of unique entries (i.e., the vector ), and then map this result to the final matrix. A straightforward way to implement the mapping is to precompute an index function. We first label each entry in the matrix ãæå in column major õ formõ ö from ú to the number of entries. The structureðof ò ó ãæ ðåò ó is determined by one outerõ product term from (7), . Using a computer algebra package, one can compute the outer product and ð ù ò(ó õ with symbolic variables. Then an exhaustive search ð ù ò ó for each entry of ã å in yields the required index map. An example of such a mapping is shown in Figure 1. The difficulty in using an index function is that the index map must be stored. To avoid this problem, we propose an alternate method based upon a ï¬simple ý÷÷!÷ý property–the semigroup structure of the monomials. Suppose we have an input vector withí#ÿ û ívariables, ÷!÷÷ í ü í#ÿ í ü ÷÷!þ ÷ . The í mapping
í
ü
ü
ü
ë
(8)
ë
where is the ’th prime defines a semigroupï isomorï ùï ù numbers phism the monomials and the natural ù (we ï ï between map ú ù to ú ).ù For example, we map ü ü ç ü ü ü to ç
çú ( is the first prime). We can implement the mapping in Table 2 efficiently using this semi-
õ
õ
1) Let be ð the ò vector of ò û primes. ù theð ù first 2) Let ùç and ç . ù . Store 3) Sort into a numerically increasing ù ù vector, the permutation, , which maps to . í 4) For ®ç¯ú to (Number of rows of ã ) 5) Let ç . í 6) For çCú to (Number of columns of ã ) 7) Compute ûç"! !# . ù 8) Perform a binary search for û in ; call the index of the resulting location þ . 9) Using , find the index, Uþ , ù í ù the í permutation í in corresponding to the index, þ in . %& 10) ç$ ')( 11) Next j 12) Next i
group isomorphism since it transforms symbol manipulation (monomials) into number manipulation. Based upon the mapping in (8), an algorithm for computing an arbitrary product, ã* , was derived, see Table 3. õ õ The basic idea is as follows. We first compute the numeriðò ó ð ù7ò ó cal equivalents to and using the ù mapping (8) in , so that it can be Steps 1-2. We sort the resulting vector ý õ searched quickly. Steps 4-12 obtain the th entry of + í using ò a matrix multiply; i.e, the th entry of ã is multiplied by the th entry of * and summed over to obtain .
4. Random Dimension Reduction 4.1 Direct Method Another consideration in the design of polynomial networks is model complexity control. Because of the large increase of terms with degree, there are large jumps in model complexity. We would like to control this complexity to achieve better generalization and use less storage.
3500
3000
A naturalõ way to control model complexity õ is to linearly transform “feature space.” That is, we replace the expanðò ó ðò ó sion, by a transformed version , . This results in í í solving the optimization problem ö
Index in rk
2500
2000
1500
*.-
ç0/21436: 5879
;
;=
A@
ö íequation or equivalently the normal
1000
ö ` , ãB,
500
0 0
Table 3. Calculation of .
0.5
1
1.5 Index in R
k
2
2.5
3 4
x 10
Figure 1. Index function for dimension input and degree .
*
ç,
ò