Optimization Feature Compression and FNN Realization Shifei Ding1,2, Zhongzhi Shi2, and Fengxiang Jin3 1
College of Information Science and Engineering, Shandong Agricultural University, Taian 271018 P.R. China 2 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China 3 College of Geo-Information Science and Engineering, Shandong University of Science and Technology, Qingdao 266510, China
[email protected],
[email protected],
[email protected]
Abstract. Feature compression is one of the most importmant steps in pattern recognition. In this paper, based on minimum squared error (MSE) rule, we first give discrete K-L transform (DKLT). According to idea of entropy function, we propose two new entropy functions-entropy density function (EDF) and representation entropy (RE), by which we can metricize information content of data matrix X. Secondly, an optimization feature compression is put forward and information compression degree is measured by feature compression rate (FCR) and accumulated feature compression rate (AFCR) proposed by authors firstly. In the end, we give an adaptive feedforward neural network (FNN) and improved adaptive FNN to realize optimization feature compression This method can be applied in Biology, Biomathematics, Ecology, and Bioinformation Science and so on for optimization feature compression. Keywords: pattern recognition, optimization feature compression, entropy density function (EDF), representation entropy (RE), feedforward neural network (FNN).
1 Introduction Pattern recognition is the scientific discipline whose goal is the classification of objects into a number of categories or classes. Depending on the application, these objects can be images or signal waveforms or any type of measurements that need to be classified. We will refer to these objects using the generic term patterns. Pattern recognition has a long history, but before the 1960s, it was mostly the output of theoretical research in the area of statistics. While feature extraction or compression is one of the most important steps in pattern recognition [1-4]. Many scholars had been doing a great researches for compression of feature dimensions and applied many methods, such as correlation analysis, principal components analysis, rough sets and so on[5-8]. The authors study them continuously in this paper. According to idea of entropy function in information theory [9, 10], we study the problems of optimization feature compression, and give the realization method of adaptive Feedforward Neural Network (FNN). D.-S. Huang, K. Li, and G.W. Irwin (Eds.): ICIC 2006, LNCIS 344, pp. 951 – 956, 2006. © Springer-Verlag Berlin Heidelberg 2006
952
S. Ding, Z. Shi, and F. Jin
2 Optimization Feature Compression 2.1 Entropy Density Function According to formula (1), we can get
∑ y = E[( y − y )( y − y )′] = T ′ ∑ x T = diag (λ1 , λ2 ,L, λn )
(1)
From formula (1), we can see that DKLT makes ∑ x be diagonal matrix, and λi is variance of y i and which is
λi = σ i2 (T ) = E[( y i − μ i ) 2 ]
(2)
Where μ i = E ( y i ) is expectation y i . Let
ρ i (T ) = λ i
n
∑λ
(3)
j
j =1
Where 0 ≤ ρi (T ) ≤ 1 and
n
∑ ρ (T ) = 1 . i
i =1
Because {ρi (T )} has numerical characteristic of probability, using the idea of entropy function, according to formula (3), we can define a new entropy rule, entropy density function (EDF), which is applied to represent discriminate capacity of λi , this is n
∑
I (λi ) = − ρi (T ) log ρi (T ) = −(λi
λ j ) log(λi
j =1
n
∑λ ) j
(4)
j =1
2.2 Optimization Feature Compression
According to EDF, we can compute all EDF values, I (λi ) (i = 1,2,L, n) , and order them as follows: I (λ1 ) ≥ I (λ 2 ) ≥ L ≥ I (λ m ) ≥ L ≥ I (λ n ) . In order to measure the information compression degree, we define feature compression rate FCR of the i-th eigenvalue as follows FCR(λi) = I (λi )
n
∑ I (λ ) ,
(i = 1,2, L , n)
i
i =1
(5)
Then we define accumulated feature compression rate (AFCR) of the former m eigenvalues as follows m
AFCR(m) =
∑ i =1
FCR (λi )
n
∑ FCR(λ ) ≥ α i
i =1
(6)
Optimization Feature Compression and FNN Realization
953
In formula (6), we may take AFCR(m)≥ α = 80%, 90%, or 95%. For example, α = 95%, it indicates that the information content of y transformed by DKLT is 95% of that of x . For proper α ,we can decide m EDFs, which makes AFCR(m)≥ α , and so we can select m eigenvectors t1 , t 2 , L , t m corresponding to the front m eigenvalues. Therefore, we can get DKLT components of x as follows y i = t i′ x (i = 1,2, L , m; m ≤ n)
(7)
Thinking of them as new features, we can reach the purpose of optimization feature compression.
3 FNN Realization In this paper, according to formulae (5), (6) and (7), the optimization compression of information feature can be realized by adaptive Feedforward Neural Network (FNN), which its structure is illustrated in Fig.1. x1
y1
x2
y2
x3
y3
#
#
xn
ym Fig. 1. Adaptive FNN structure of DKLT
Where the input layer has n neurons, n represents the nth original data sample of training set, the input vector is X = ( x1, x2 ,L, xn )′ ; the output has m neurons, m represents the mth compressed data sample of training set, the output vector is Y = ( y1, y2 ,L, yn )′ ; the corresponding connection weights matrix is composed of eigenvectors of the covariance of ∑ x , this is T = (t ji ), ( j = 1,2, L , m; i = 1,2, L , n)
(8)
In adaptive process, the relationship of input and output is y (t ) = T (t ) x(t )
(9)
Where y (t ) = ( y1 (t ), y 2 (t ), L , y m (t )) ′ ; x(t ) = ( x1 (t ), x 2 (t ), L , x n (t )) ′ ; T (t ) = (t1 (t ), t 2 (t ), L , t m (t )) ′ .
(10)
954
S. Ding, Z. Shi, and F. Jin
In order to accomplish optimization feature compression, we can adopt following adaptive algorithm. T (t + 1) = T (t ) + r (t )[ x(t ) y ′(t ) − T (t ) sup( y (t ) y ′(t ))]
(11)
Where r (t ) denotes step size, and sup() denotes an operator, which makes the all elements of matrix () zero under the diagonal line. According to above the network and algorithm, we can see that the network operation quantity is very big because the other weights need to operate again and again with the changes of one weight. An improved adaptive FNN of DKLT is proposed here, and its structure is illustrated in Fig.2 x1
s1(t)
x2
s2(t)
x3
#
y1 y2 y3
# s3(t) ym
xn Fig. 2. Improved Adaptive FNN structure of DKLT
In this improved adaptive FNN, the mth neuron connection weight can be recurred by the former ( m − 1 ) neuron connection weights. The relationship of network input and network output can be expressed as follows y (t ) = T (t ) x(t ) y(t) y m (t ) = Tm (t ) x(t ) + s (t ) y (t )
(12) (13)
Where y (t ) = [ y1 (t ), y2 (t ),L, ym −1 (t )]T , T (t ) = (t ji (t )), j = 1,2,L, m − 1; i = 1,2,L, n
(14)
Where Tm (t ) = (tm1 (t ), tm 2 (t ),L, tmn (t ))′ represents the mth neuron connection weight vector of the output layer, and s (t ) = ( s1 (t ), s 2 (t ), L , s m −1 (t )) ′ represents the neuron connection weight vector of the former (m − 1) neuron and the mth neuron. Improved adaptive FNN algorithm is Tm (t + 1) = Tm (t ) + B[ y m (t ) x ′(t ) − y m2 (t )Tm (t )
(15)
s (t + 1) = s (t ) − r[ y m (t ) y ′(t ) + y m2 (t ) s (t )
(16)
Where B, r are learning step size of Tm (t ), s (t ) respectively.
Optimization Feature Compression and FNN Realization
955
4 Experiment Generally speaking, covariance ∑ x of data matrix x is unknown. In practical application, ∑ x is estimated by covariance matrix (S) of sample. In order to eliminate the influence of different unit of every feature, data matrix is standardized. In this moment, covariance matrix (S) of sample is correlation coefficient matrix (S). That is S=R. According to measured data, by computer computation, we can get correlation matrix, R. According to R, applied Jacobi method, 9 eigenvalues of R can be gotten and listed in table 1. Table 1. Computing results No.
λi
ρi
I( λi )
FCR( λi )
AFCR( λi )
1 2 3 4 5 6 7 8 9 Total
5.1639 1.4888 0.9769 0.8544 0.2400 0.1623 0.0787 0.0228 0.0121 9.0000
0.5738 0.1654 0.1085 0.0949 0.0267 0.0180 0.0087 0.0025 0.0013
0.4598 0.4294 0.3477 0.3224 0.1396 0.1043 0.0595 0.0216 0.0125
24.2408 22.6381 18.3309 16.9970 7.3598 5.4987 3.1369 1.1388 0.6590
24.2408 46.8789 65.2098 82.2068 89.5666 95.0653 98.2022 99.3410 100.0000
From table 1, if we take m =4, then through DKLT, the information content contained by data vector, y, is above 82% of the total information content. After x being compression by DKLT, i.e. y i = t i′ x, (i = 1,2,3,4) , therefore, 9-dimensional (9D) feature vector ( x ) is compressed 4-dimensional (4-D) vector ( y ) reserved 82% of the total information content of x , and reach the aim of information feature optimization compression.
5 Conclusions Based on DKLT, according to idea of entropy function in information theory, we propose two new concepts of EDF and RE, by which we can metricize information content of information matrix X. An optimization feature compression method is put forward, and information compression degree is measured by FCR and AFCR proposed by the authors. At the same time, We give an adaptive FNN and improved adaptive FNN to realize optimization feature compression. We believe that this method can provide a good basis in theory for the researchers of Biology, Biomathematics, Ecology, Bioinformation Science and so on.
Acknowledgements This work is supported by the National Natural Science Foundation of China under grant no.60435010 and no.40574001, the China Postdoctoral Science Foundation
956
S. Ding, Z. Shi, and F. Jin
under grant no.2005037439, the Key Laboratory Opening Foundation of the CropBiology of Shandong Province under grant no.20040010, and the Doctoral Science Foundation of Shandong Agricultural University under grant no.23241.
References 1. Duda, R.O., P.E. Hart, P.E. (eds.): Pattern Classification and Scene Analysis. Wiley, New York (1973) 2. Devroye, L., Gyorfi, L., Lugosi, G. (eds.): A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York (1996) 3. Fukunaga, K. (ed.): Introduction to Statistical Pattern Recognition. Academic Press (Second Edition), New York (1990) 4. Ding, S.F., Shi Z.Z.: Studies on Incidence Pattern Recognition Based on Information Entropy. Journal of Information Science 31(6) (2005) 497-502 5. Yang, J., Yang, J.Y.: A Generalized K-L Expansion Method That Can Deal with Small Sample Size and High-dimensional Problems. Pattern Analysis Applications 6(6) (2003) 47-54 6. Zeng, H.L., Yu, J.B., Zeng, Q.: System Feature Reduction on Principal Component Analysis. Journal of Sichuan Institute of Light Industry and Chemical Technology 12(1) (1999) 1-4 7. Zeng, H.L., Yuan, Z.R.: About A New Approach of Selection and Reduction on System Feature in Pattern Recognition. Journal of Sichuan Institute of Light Industry and Chemical Technology 12(4) (1999) 1-5 8. Ji, X.J. and Li, S.Z., Li, T.: Application of the Correlation Analysis in Feature Selection. Journal of Test and Measurement Technology 15(1) (2001) 15-18 9. Shannon, C.E.: A Mathematical Theory of Communication. Bell Sys. Tech. Journal 27 (1948) 379-423. 10. Jiang, D. (ed.): Information theory. The Publishing House of China University of Science and Technology, Hefei (1987)