Bayes Risk Weighted VQ and Learning VQ - Semantic Scholar

3 downloads 0 Views 168KB Size Report
Richard D. Wesel and Robert M. Gray. Department of Electrical Engineering, Stanford University. Proceedings of the 1994 Data Compression Conference.
Proceedings of the 1994 Data Compression Conference

Bayes Risk Weighted VQ and Learning VQ Richard D. Weseland Robert M. Gray Department of Electrical Engineering, Stanford University Abstract

This paper examines two vector quantization algorithms which can combine the tasks of compression and classi cation: Bayes risk weighted vector quantization (BRVQ) proposed by Oehler et al., and Optimized Learning Vector Quantization 1 (OLVQ1) proposed by Kohonen et al. BRVQ uses a parameter  to control the tradeo between compression and classi cation. BRVQ performance is studied for a range of  values for four classi cation problems. Increasing the  parameter in BRVQ is intended to improve classi cation performance. However, for two of the problems studied, increasing  degraded classi cation performance. A majority rule reclassi cation of the nal codebook (using only the training set) greatly improves high- BRVQ performance for these cases. Finally, we compare the classi cation performance and mean Square error (MSE) performance of BRVQ to that of OLVQ1 for four classi cation problems. BRVQ with codebook reclassi cation is found to have a lower MSE than OLVQ1 while maintaining comparable, but slightly inferior, classi cation performance.

1 Introduction The methods for obtaining vector quantization codebooks simply to provide the lowest possible squared error distortion for a given codebook size [1] or entropy [2] are well understood and grounded solidly in theory. However, when the objective is classi cation or a combination of classi cation and low-distortion compression, there are questions that remain about which algorithms are best and why algorithms behave as they do. In particular, to what extent are compression and classi cation con icting goals? Also, which problems are \easy" for classi cation by vector quantization and which are \hard"? Kohonen et al. [3, 4] describe an approach to classi cation based on vector quantization. This method can also be used to simultaneously compress and classify, although the codebook is not explicitly designed to provide good lossy compression. Oehler et al. [5, 6] provide a method for codebook design which explicitly tries to do a good job both at classi cation and at lossy compression. 

email [email protected]

1

In this paper we examine the Bayes risk weighted vector quantization (BRVQ) algorithm of [5] in detail, examining how it controls the trade-o between classi cation and compression. Then we compare its performance both in terms of classi cation error probability (Pe ) and mean Square error (MSE) distortion to the Optimized Learning Vector Quantization 1 (OLVQ1) algorithm of [3]. Our investigation of the trade-o between classi cation and compression seems to indicate that these two goals do not con ict to a large extent. This comparison of BRVQ to OLVQ1 provides insight into the types of problems that are \easy" or \hard" for classi cation by VQ.

2 Bayes Risk Weighted Vector Quantization

A vector quantizer (VQ) maps each input vector X onto a codebook of N codewords fYk : k = 1; 2; :::; N g. Let (X ) be the mapping from the inputs to the integer indices of the codewords, and let (k) be the mapping from the codeword indices k to the codewords Yk . (X ) is the encoder and (k) is the decoder of the VQ. Let d(X; ( (X ))) be the distortion between the input vector X and its associated codeword ( (X )). A common distortion measure is squared error jjX ? ( (X ))jj . The average distortion of a codebook for a given input vector distribution is D( ; ) = E [d(X; ( (X )))]. If squared error distortion is being used, then D( ; ) is the mean squared error (MSE). Traditional VQ seeks to minimize D( ; ). For MSE distortion, this can be accomplished with the Lloyd algorithm [1]. Classi cation by VQ is performed by assigning a class to each input vector based on which codeword index was assigned to the input vector by the encoder. Let (k), or equivalently ( (X )), be this mapping between the codeword indices and the possible classes. Let cX be the random variable which assigns class membership to X according to a probability mass function which depends on X . A classi cation error occurs when cX 6= ( (X )). A cost Cij is assigned for classifying a vector actually in class i as being in class j . Typically if i = j , Cij = 0. Let eij be the event where X is in class i but ( (X )) = j . The risk [5] is given by 2

R( ; ) =

M M X X i=1 j =1

P feij gCij

(1)

where M is the number of possible classes. If Cij = 1 when i 6= j and Cij = 0 when i = j then the risk is exactly the probability of classi cation error. As described in [5], BRVQ seeks a codebook which minimizes J( ; ; ) = D( ; )+ R( ; ). BRVQ employs the following three step procedure to minimize J( ; ; ) 1 Choose  t to minimize J( t; t;  t ) 2 Choose t to minimize J( t; t ;  t ) 3 Choose t to minimize J( t ; t ;  t ) For the rest of the paper, we assume that the distortion d is squared error and that the costs Cij have been chosen so that the Bayes risk is exactly the probability of ( +1)

( +1)

( +1)

( +1)

( +1)

( +1)

2

( +1)

( +1)

( +1)

classi cation error. With these assumptions we have the following interpretation of BRVQ. Step 1 classi es each codebook index by the majority rule of its associated vectors. Step 2 changes each reproduction vector to be the centroid of its associated vectors. Step 3 assigns each input vector in the training set to the codebook index that satis es (X ) = arg min fjjX ? Yk jj + I [cX 6= (k)]g: (2) k where I [] is the indicator function. Because class membership is not known for vectors outside of the training set, (2) is replaced by the nearest neighbor (Euclidean distance) rule for encoding and classifying the test data. In the next section, we observe poor performance because the codebook designed for the () of (2) is applied to the test data with the nearest neighbor (). 2

3 BRVQ performance vs.



Four Classi cation Problems

We will examine BRVQ performance vs.  for four classi cation problems. Each problem consists of the mixture of two equally likely distributions. Table 1 gives the distributions of each problem's two classes. When examining Table 1, recall that the bivariate normal has the density given in (3). For all the bivariate normal (BN) distributions in Table 1, x = y = 0 and x = y = . For the uniform (U) distributions, Table 1 gives the region of support.

f (x; y) = 21  e 1



?1 2(1?2 )



( x?xx ) ?  x?xx yy?y 2

2

(

)(

)

+

 y? 2  y

y

2

(3)

Problem Name Class 0 Distribution

Class 1 Distribution

di erent 's di erent 's Diamond Square

BN with  = 1;  = p BN with  = 2;  = 0 U (x; y) 62 A; jxj; jyj  1 U (x; y) 62 B; jxj; jyj  1

BN with  = 1;  = 0 BN with  = 1;  = 0 U (x; y) 2 A = fx; y : jxj + jyj  1g U (x; y) 2 B = fx; y : jxj; jyj  p g 1 2

1 2

Table 1: Distributions used for classi cation problems. The Di erent 's problem was studied in [4] where it was referred to as the \hard" problem. The Diamond problem is essentially the one studied in [5].

Extreme points of 

When  = 0, step 3 simpli es to the nearest neighbor rule. Thus the BRVQ algorithm simpli es to nding a codebook via the Lloyd algorithm and then classifying each codebook vector to the class most prevalent in its partition. 3

When  = 1, the encoder and the resulting codebook design algorithm are still well de ned. Each vector is forced to be assigned to a codeword which will classify it correctly to avoid in nite cost. This, in turn, prevents a codeword from ever changing from its initial class. Thus, for  = 1 the BRVQ algorithm reduces to designing a separate codebook for each class. This is essentially classi ed VQ [7]. However, unlike traditional classi ed VQ, the class information is not available to the encoder for selection of the appropriate codebook. Instead, the nal codebook is simply the union of the separate codebooks. Empirical Study of Performance vs.  Every point plotted in this paper represents the average of ve runs. Each run used a 100,000 point training set and a 10,000 point test set. Thus each point involved 10 di erent data sets. For this section, the codebook size was xed at 32. Figures 1 and 2 show MSE vs.  and Pe vs.  for the Diamond problem and the Di erent 's problem. In each graph there is a clear phase transition from the  = 0 dashed line to the  = 1 dashed line. lambda = infinity

MSE

0.022 0.0215

Diamond

0.021 lambda = zero

0.0205 -6

10

-5

10

-4

10

-3

-2

10

10

-1

10

0

10

1

10

lambda

MSE

0.14

lambda = infinity

Different Rhos 0.12 lambda = zero 0.1 -4 10

-3

10

-2

10

-1

0

10

10

1

10

2

10

3

10

lambda

Figure 1: MSE vs.  for the Diamond problem and the Di erent 's problem. Figure 1 shows MSE performance vs. . As expected, MSE increases with increasing  as in [5]. For the Diamond problem, there is only a 5% variation in MSE for the whole range of . For the Di erent 's problem there is a 19% variation in MSE. However, the best classi cation is achieved for the lowest MSE values. 4

The Square and Di erent 's problems (not shown) had Pe vs.  curves similar to the Diamond and Di erent 's problems respectively. For the square problem, there was only a 3% variation in MSE. For the Di erent 's problem, there was a 21% variation in MSE, but the best classi cation was achieved for the lowest MSE values. 0.14 lambda = zero 0.12

Pe

0.1 Diamond 0.08 0.06 0.04

lambda = infinity -6

10

-5

10

-4

10

-3

-2

10

10

-1

10

0

10

1

10

lambda

0.42

lambda = infinity

Pe

0.4 Different Rhos 0.38 0.36 lambda = zero 0.34 -4 10

-3

10

-2

10

-1

0

10

10

1

10

2

10

3

10

lambda

Figure 2: Pe vs.  for the Diamond problem and the Di erent 's problem. The classi cation error probability after reclassi cation for the Di erent 's problem is indicated by o's. Figure 2 shows Pe vs. . Again, the Square and Di erent 's problems (not shown) had Pe vs.  curves similar to the Diamond and Di erent 's problems respectively. For the Diamond problem, Pe decreases with increasing  as in [5]. For the Di erent 's problem, Pe (before reclassi cation) increased with increasing . This result seems counterintuitive at rst since higher values of  give more weight to classi cation.

Majority rule reclassi cation

As mentioned at the end of section 2, there is a mismatch between the actual encoding rule and the encoding rule used for codebook design. With regard to the actual encoding rule, some of the codebook vectors may be misclassi ed. Thus, improved Pe performance can be obtained by reclassifying the nal codebook according to the actual encoding rule (i.e., nearest neighbor). This can be accomplished by performing 5

a nal iteration of step 1 of the algorithm with () changed to the nearest neighbor encoder. This reclassi cation does not a ect MSE. Figure 2 shows the e ect reclassi cation had on Pe for the di erent 's problem. Reclassi cation dramatically lowered classi cation error for high values of lambda. A similar improvement was observed for the di erent 's problem. Misclassi cation of the type discussed above will be rare in cases where the two classes occupy disjoint regions. Thus for the Diamond and Square problems, reclassi cation should not have much a ect. In fact, only negligible performance di erences were observed for these two problems.

4 Learning Vector Quantization

As with BRVQ, the LVQ algorithms each produce a codebook of classi ed vectors. Test vectors are then assigned to the class of the nearest codebook vector. BRVQ uses the entire training data set to compute its update for each iteration. However, the LVQ algorithms use only one sample from the training set for each iteration. In all of the LVQ algorithms, the class of each codebook vector remains xed throughout training. Only the position of the codebook vectors can be adapted. Thus it is possible that performance can be improved by reclassi cation of the nal codebook vectors according to the majority rule of the vectors in the partition. Each LVQ algorithm starts with an initial codebook fYk (1) : k = 1; 2; : : : ; N g; and sequentially updates the codebook using one vector from the training set for each update. Thus on iteration t, the training vector X (t) is processed to produce the new codebook fYk (t + 1) : k = 1; 2; : : : ; N g: The algorithms di er in how the vector from the training set is used to update the codebook. There are four LVQ algorithms which are implemented in the LVQ-PAK [3] software package: LVQ1, LVQ2.1, LVQ3, and OLVQ1. We will restrict our attention to LVQ1 and OLVQ1. On each iteration t, LVQ1 adapts only the codebook vector nearest to X (t). Let

c = arg min fjjX (t) ? Yk (t)jjg: k

(4)

Yk (t + 1) is computed as shown in (5) where s(t) is +1 if X (t) and Yc (t) are in the

same class and ?1 otherwise. (

Yk (t) k 6= c (5) Yk (t) + s(t) (t)[X (t) ? Yk (t)] k = c The time varying parameter (t) should only take on values between 0 and 1. In the LVQ-PAK software package, (t) is initialized to 0:1 and decreases linearly. OLVQ1 is so named because it is an optimization of LVQ1. The parameter (t) Yk (t + 1) =

from LVQ1 is allowed to be di erent for each codebook vector, and its value is changed in a more complex way than the linear decrease of LVQ1. The update is essentially the one shown in (5). However, now (t) is replaced by c (t), the correct parameter for this codebook vector. The parameter c (t) is updated as shown in (6). 6

c (t) = 1 + sc((tt) ?(1) t ? 1) c

(6)

The idea behind the adaptation of c(t) in (6) is that each input vector from the training set should be given the same weight in a ecting the position of the codebook vector. At iteration t, X (t ? 1) is scaled by [1 ? s(t) c(t)] c(t ? 1). Equation (6) is simply the solution of the equation that results from setting c (t) equal to [1 ? s(t) c(t)] c(t ? 1).

5 Comparison of BRVQ to OLVQ1

OLVQ1 appears to be the newest routine in the LVQ-PAK software package. The LVQ-PAK manual does not give information about the relative performance of the four algorithms, but it recommends that learning always be started with the OLVQ1 algorithm. Furthermore, the manual implies that all four algorithms have similar accuracies. For these reasons, the studies in this paper use OLVQ1 to represent the LVQ family. Based on a study of performance vs. number of iterations, it was decided to run the OLVQ1 algorithm for 100; 000 iterations. From section 3, BRVQ performance was consistently good for  = 1 as long as the codebook was reclassi ed by majority rule after training. We use this version of BRVQ in our studies below. To provide a reference point for MSE, we also apply the Lloyd algorithm (BRVQ with  = 0) to the four classi cation problems. Thus, the three algorithms we compared are OLVQ1 with 100; 000 iterations, BRVQ with  = 1 and codebook reclassi cation, and the Lloyd algorithm.

Classi cation Performance

The top plot in Figure 3 shows the classi cation performance for the Diamond problem. BRVQ and OLVQ1 perform similarly with neither algorithm having consistently lower Pe . The Lloyd algorithm had noticeably larger Pe 's than those of BRVQ and OLVQ1. This is not surprising. Lloyd was not designed as a classi cation algorithm. Similar classi cation performance was observed for the Square problem Codebooks of size 100 or more are required for these two problems to achieve Pe 's approaching the Bayes rule Pe of zero. Thus these problems would appear to be \hard" problems for classi cation by VQ. However, VQ codebooks of size 8 exist, although not found by the algorithms studied, which would give Pe = 0. The bottom plot in Figure 3 shows classi cation performance for the di erent 's problem. BRVQ consistently had a larger Pe than OLVQ1. The Lloyd algorithm performed comparably to BRVQ and OLVQ1. As codebook size increased, the Pe's of the three algorithms became similar. Similar classi cation performance was observed for the di erent 's problem. With size 8 codebooks all three algorithms are already quite close to the Bayes rule Pe for these problems. This includes the Lloyd algorithm which is doing classi cation only as an afterthought. These appear to be \easy" classi cation problems for VQ classi ers. 7

0.2 BRVQ 0.15

Diamond

OLVQ2

Pe

Lloyd 0.1 0.05 0 2

3

4

5 log2 codebook size

6

7

8

0.5 Different Rho‘s

BRVQ

0.45

OLVQ1

Pe

Lloyd Bayes Rule

0.4 0.35 0.3 2

3

4

5 log2 codebook size

6

7

8

Figure 3: Pe vs. codebook size for the Diamond problem and the Di erent 's problem.

Mean Square Error Performance

Figure 4 shows the MSE performance vs. codebook size for the Diamond problem and the Di erent 's problem. MSE performance was similar for all four problems studied. The three algorithms always had the same ordering at all the codebook sizes considered. As illustrated in Figure 4, the Lloyd algorithm always had the best MSE performance. The BRVQ algorithm was always next, and the OLVQ1 always had the worst MSE performance. As the codebook size increased, the performance of all three algorithms became similar. It should be noted that OLVQ1 does not explicitly try to minimize MSE. In fact, MSE performance could be improved with no degradation of classi cation performance by using the centroid of the training vectors in each partition cell as its reproduction vector.

6 Conclusions

Codebook reclassi cation was shown to improve the poor BRVQ classi cation performance resulting from the use of one encoder for training and a di erent encoder 8

0.2 Diamond 0.15

OLVQ1

MSE

BRVQ 0.1

LLOYD

0.05 0 2

3

4

5 log2 codebook size

6

7

8

0.5 Different Rhos

MSE

0.4

OLVQ1 BRVQ LLOYD

0.3 0.2 0.1 0 2

3

4

5 log2 codebook size

6

7

8

Figure 4: MSE vs. codebook size for the Diamond problem and the Di erent 's problem. for the test data. Better performance can be obtained by using one encoder for both training and test, as is always done in standard VQ for lossy compression. Some work has recently been done in this direction [8, 9, 10]. If Pr[CX 6= (k)] is known or can be estimated from the training set, then I [cX 6= (k)] can be replaced with Pr[CX 6= (k)] or its estimate in (2). This produces one encoder which can be used both for training and test. For this new encoder, BRVQ with  = 1 is exactly classi ed VQ as in [7]. As a result, at  = 1 it will provide a Pe which is exactly the Bayes rule Pe if the estimate of Pr[CX 6= (k)] is perfect (assuming there are at least as many codewords as classes). For the Diamond and Square problems we saw that BRVQ and OLVQ1 only began to approach the Bayes rule Pe for large codebook sizes. These two problems had likelihood ratios that changed from zero to 1 at the decision boundaries. Thus, if the decision boundary is not exactly as prescribed by Bayes rule, the Pe will increase dramatically. Perhaps this extreme sensitivity explains the poor small codebook performance of the algorithms on these two problems. The two problems we studied with a large Bayes rule Pe had slowly changing 9

likelihood ratios making it less critical that the decision boundaries be precisely in the correct place. For these problems the algorithms studied achieved a Pe close to Bayes rule even for codebooks of size 8. A problem should not be considered \hard" or \easy" for classi cation by VQ based on its Bayes rule Pe , but rather by how dicult it is to achieve the Bayes rule Pe . An important factor a ecting this is the rate of change of the likelihood ratio.

Acknowlegements

The authors would like to thank Karen Oehler for providing the rst author with an excellent introduction to the BRVQ algorithm and for helpful comments as the work progressed.

References

[1] Allen Gersho and Robert M. Gray. Vector Quantization and Signal Compression. Kluwer Academic Publishers, 1992. [2] P. A. Chou, T. Lookabaugh, and R. M. Gray. Entropy-constrained vector quantization. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37, January 1989. [3] Teuvo Kohonen, Jari Kangas, Jorma Laaksonen, and Kari Torkkola. LVQ-PAK: The Learning Vector Quantization Program Package. Helsinki University of Technology, Laboratory of Computer and Information Science, 1992. available via anonymous ftp to cochlea.hut. (130.233.168.48). [4] Teuvo Kohonen, Gyorgy Barna, and Ronald Chrisley. Statistical pattern recognition with neural networks: Benchmarking studies. In IEEE International Conference on Neural Networks, 1988. [5] Karen L. Oehler and Robert M. Gray. Combining image classi cation and image compression using vector quantization. In Data Compression Conference, Snowbird, Utah, 1993. [6] Karen L. Oehler, Pamela C. Cosman, Robert M. Gray, and J. May. Classi cation using vector quantization. In Twenty-Fifth Annual Asilomar Conference on Signal Systems, and Computers, 1991. [7] B. Ramamurthi and A. Gersho. Classi ed vector quantization of images. IEEE Transactions on Communications, COM-34(11):1105{1115, November 1986. [8] K.L. Oehler and R.M. Gray. Combining image compression and classi cation using vector quantization. 1993. submitted for possible publication. [9] R.M. Gray, K.L.Oehler, K.O. Perlmutter, and R. A. Olshen. Combining tree-structured vector quantization with classi cation and regression trees. In R. Madon, editor, Proceedings of the Twenty-Seventh Annual Asilomar Conference on Signals, Systems and Computers, Paci c Grove, CA, November 1993. [10] K.O. Perlmutter, R.M.Gray, K.L. Oehler, and R.A. Olshen. Bayes risk weighted tree-structured vector quantization with posterior estimation. In Data Compression Conference, Snowbird, Utah, 1994. 10

Suggest Documents