ric model like normal distribution and the parameters that represent the distribution ... of the transformation and show the effectiveness of the pro- posed method ...
A Discriminant Function Considering Normality Improvement of the Distribution Hidenori Ujiie, Shinichiro Omachi, and Hirotomo Aso Department of Electrical and Communication Engineering, Graduate School of Engineering, Tohoku University Aoba 05, Aramaki, Aoba-ku, Sendai-shi, 980-8579 Japan {ujiie,machi,aso}@aso.ecei.tohoku.ac.jp Abstract In statistical pattern recognition, class conditional probability distribution is estimated and used for classification. Since it is impossible to estimate the true distribution, usually the distribution is assumed to be a certain parametric model like normal distribution and the parameters that represent the distribution are estimated from training data. However there is no guarantee that the model is appropriate for the given data. In this paper, we propose a method to improve classification accuracy by transforming the distribution of the given data closer to the normal distribution using data transformation. We show how to modify the traditional quadratic discriminant function (QDF) in order to deal with the transformed data. Finally, we present some properties of the transformation and show the effectiveness of the proposed method through experiments with public databases.
1. Introduction When we consider the problem of classifying an unknown observation x ∈ Rd into one of C different classes w1 , w2 , . . . , wC , Bayes’ theorem is used in many cases. By this theorem, x is classified into the class with the maximum a posteriori probability. Many studies have been made on estimating true class conditional probability density function (pdf), but it is difficult. Usually, in the case that true class conditional pdf is unknown, it is assumed to be a multivariate normal distribution. When the class conditional pdf of the data is a multivariate normal distribution and the mean and the covariance matrix are known, QDF derived from the multivariate normal distribution is optimal[5]. QDF is defined as follows: gi (x) =
1 T (x − µi ) Σ−1 i (x − µi ) 2 1 + ln |Σi | − ln P (wi ), 2
(1)
where µi and Σi are the mean vector and the covariance matrix of class wi respectively, and P (wi ) is the a priori probability of class wi . However, in many cases, true class conditional pdf cannot be regarded as the normal distribution. A few researchers in the pattern recognition field have given an attention to the normality of pdf of the data. It is said that transformation of observations is useful[3, 6] to improve the classification accuracy. Wakabayashi et al.[6] proposed a method that transforms observations by taking the square root and showed the performance of QDF is enhanced by the transformation. The validity of square root transformation is based on the fact that the observations obtained by counting elements usually follow gamma distribution. Moreover, their transformation cannot be applied to negative values. Square root transformation is a heuristic method and it is necessary to consider the distribution of the observations. In this paper, we propose a method to improve classification accuracy by transforming the given data so that the distribution of the transformed data is closer to the normal distribution for any original distribution. The method finds optimal transformation theoretically. BoxCox transformation[2] with exponential power transformation is used. We show how to modify the traditional QDF derived from normal distribution in order to deal with the transformed data. Finally, we present some properties of the transformation and show the effectiveness of the proposed method through experiments with public databases. We do not assume the distribution of the original data, so it will be effective for any kinds of pdf. Moreover, the proposed method can deal with negative value as the value of observation.
2. Transformation-based QDF (TQDF) First we present a method to transform any pdf into approximately normal distribution using Box-Cox transformation[2] with exponential power transformation.
1051-4651/02 $17.00 (c) 2002 IEEE
This transformation is applied to each dimension of the observed data independently for each class. Since different transformation is applied to each class, a posteriori probabilities of different classes cannot be compared directly. In order to solve the problem, we will introduce a new discriminant function, called transformation-based QDF, that compares probabilities by the same measure.
2.1. Transformation into approximate normal distribution Here we limit the discussion to univariate distribution. Box-Cox transformation[2] transforms a distribution into normal distribution. When a data x is transformed into x(λ) with parameter λ, exponential data transformation is x(λ)
exp (λx) − 1 , λ = 0 . = λ x, λ=0
(2)
ï= 1:0
ï= Ä1:0 0
ï= 1:0
ï= Ä1:0 -2
0
2
x
4
l(λ, m, σ ) =
k=1
1 √ 2πσ 2 (λ)
(x − m)2 × exp − k 2 2σ where
∂x(λ) ∂x
k=1
σ ˆ2 =
N 2 1 (λ) xk − m ˆ . N
(5)
k=1
Rewriting using m ˆ and σ ˆ 2 and eliminating a constant, Eq.(4) becomes L(λ, m, ˆ σ ˆ2) = −
˜= λ
Suppose there are N observations denoted by χ = {x1 , x2 , . . . , xk , . . . , xN }. If the transformed observations (λ) xk follow normal distribution N (m, σ 2 ), pdf l(λ, m, σ 2 ) of the original observations is defined as follows: N
N 1 (λ) xk , N
N ln σ ˆ2 + λ xk . 2 N
(6)
∂x(λ) , ∂x x=xk
(7)
ˆ Since the The value that makes Eq.(7) zero is denoted by λ. ˆ value of λ cannot be calculated explicitly, it must be determined by iterative schemes. Manly[4] proposed a method to calculate the approximated value by expanding Eq.(2) into Taylor series around the mean of original observations. The ˜ of λ ˆ is given by approximated value λ
Figure 1. Characteristics of exponential transformation.
2
m ˆ =
k=1
ï= Ä2:0
-4
Here, m and σ 2 are unknown parameters which are estimated by maximum-likelihood estimation. Let the ˆ and σ ˆ2 , maximum-likelihood estimates of m and σ 2 be m respectively.
N
-5
-10
k=1
N 1 ∂σ ˆ2
∂L(λ, m, ˆ σ ˆ2) =− + xk . 2 ∂λ 2 σ ˆ ∂λ
ï= 0
5
k=1
If Eq.(6) is maximized with respect to λ, the distribution of the transformed observations will be approximated to normal distribution best. The partial derivative of Eq.(6) with respect to λ is
x(ï) ï= 2:0
N ln σ 2 2 N N 2
1 (λ) xk − m + λ − 2 xk . (4) 2σ
L(λ, m, σ 2 ) = −
k=1
The characteristics of Eq.(2) are shown in Fig.1.
10
Eq.(2). Log likelihood of Eq.(3) except for a constant is
(3)
is the Jacobian resulted in transformation of
6m3
, (8) 2 3 (m2 ) − 7m4
N
N r ¯) where x ¯ = N1 k=1 xk and mr = N1 k=1 (xk − x (r = 2, 3, 4). ˜ is calculated for each class and each In the following, λ dimension.
2.2. Modification of the discriminant function For the classification problem of d-dimensional vector data, we will consider that the transformation is applied to each dimension and each class independently. The classification of original data requires a measure meaningful for different classes. We introduce a new QDF for the measure which reflects the distribution of original data. Since different transformation is applied to each class, it is necessary to modify QDF to compare probabilities by the same measure.
1051-4651/02 $17.00 (c) 2002 IEEE
The transformation parameter of jth dimension for class ˜ ij . The feature vector is transformed as wi is denoted by λ follows:
p(x) 0.2 estimated normal distribution
0.16
N (5:0; 5:0)
T
x = (x1 , x2 , . . . , xd ) ˜ ˜ ) ˜ ) T ˜ (λ ) (λ (λ ⇒ x(λi ) = x1 i1 , x2 i2 , . . . , xd id ,
0.12
(9)
˜ i = (λ ˜ i1 , λ ˜ i2 , . . . , λ ˜ id ). Class conditional pdf of ith where λ class is given by ˜
p(x|wi ) = p(x(λi ) |wi )|Ji |
0.08 0.04 0
0
2
(10)
where Ji is the Jacobian of class wi caused by transformation and is defined as
4
6
8
10
12
x
14
(a) Untransformed distribution.
p(x(ï) ) 0.3
Ji =
d
˜ ) (λ ∂xj ij
j=1
∂xj
estimated normal distribution
(11)
.
˜
Since p(x(λi ) |wi ) follows the normal distribution, ln p(x|wi ) is given as follows. ˜
ln p(x|wi ) = ln p(x(λi ) |wi ) + ln |Ji | ˜ ) T ˜ ) −1 ˜ ) 1 (λ˜ i ) ˜ (λ (λ (λ x =− − µi i Σi i x(λi ) − µi i 2 ˜ ) d 1 (λ − ln |Σi i | − ln(2π) + ln P (wi ) 2 2 ˜ ij ) (λ d
∂xj . + ln (12) ∂xj j=1
N(3:87; 1:79)
0.2
0.1
0 0
1
2
3
4
5
6
7
8
9
10
x(ï)
(b) Transformed distribution.
Figure 2. Comparison of untransformed observations with transformed observations.
3.1. Property of the power transformation Ignoring constants, Eq.(12) is reduced to ˜ ) T (λ ˜ ) −1 ˜ ) 1 (λ˜ i ) ˜ (λ (λ x(λi ) − µi i x − µi i Σi i gi (x) = 2 d
˜ ) 1 (λ ˜ ij xj . (13) + ln |Σi i | − λ 2 j=1 An unknown observation x is classified into the class that minimizes Eq.(13). This is called transformation-based QDF, or TQDF.
3. Experiments First, we investigate the property induced by Eq.(2), that is, how a certain distribution is approximated to the normal distribution by the power transformation. Next, we apply the proposed method to a classification problem with public databases and show the effectiveness of the proposed method.
As a distribution, gamma distribution is considered first. Suppose observations according to gamma distribution are given. In the experiment, we use 500, 000 observations. Using these observations, we calculate the optimal trans˜ by Eq.(6) and it is found as λ ˜ = formed parameter λ −9.14 × 10−2 . Fig.2(a) shows the histogram of original observations. Assuming these observations are occurred from normal distribution, parameters of the normal distribution are estimated and the estimated distribution is displayed as dotted line. Fig.2(b) shows the histogram of transformed observations by the proposed method. By comparing Fig.2(a) with Fig.2(b), it is clarified that the distribution of transformed observations is closer to normal distribution. We also investigate observations according to other distributions. In order to measure the degree that observations are approximated by the proposed method, “skewness” and “kurtosis” are used. The value of skewness and kurtosis express “symmetry of distribution” and “sharpness of distribution around mean” respectively. If these values are near 0, distribution of observations is close to normal distribution.
1051-4651/02 $17.00 (c) 2002 IEEE
Kind of distribution Normal distribution Gamma distribution χ2 distribution t-distribution F -distribution
Formula of distribution √1 exp − 1 x2 2 2π
Original observations skewness kurtosis 2.17×10−3 9.39×10−3
−x
e x4 Γ(5) 1 x1/2 e−x/2 23/2 Γ(3/2) −3 Γ(3) x2 √ 1 + 5 5πΓ(5/2)
Transformed observations skewness kurtosis −2.24×10−6 9.37×10−3
1.99 3.96
1.16 3.89
0.379 1.43
−0.243 0.398
−2.41×10−2
4.30
4.23×10−3
4.28
10
Γ(15)2 x9 Γ(10)Γ(5) (1+2x)15
2.83
28.5
1.30
6.78
Table 1. Skewness and kurtosis of some distributions.
Comparing these values of original observations with those of transformed observations, the ability of power transformation is shown. In this experiment, we have investigated “normal distribution,” “gamma distribution,” “χ2 distribution,” “t-distribution” and “F -distribution.” Formulas of these distributions and the results are shown in Table 1. The results show that all the distributions become closer to the normal distribution. The values of “skewness” except for χ2 distribution become closer to “0” than those of “kurtosis.” It is shown that the power transformation has more ability in improving the symmetry property than the sharpness.
3.2. Classification experiments with public databases The proposed method is applied to real world problem. We used public databases (a) Letter Image Recognition Data, (b) Landsat Satellite, (c) Pima Indians Diabetes Database, and (d) Vowel Recognition of UCI Machine Learning Database[1]. Experimental results are shown in Table 2. Number of samples for testing and learning, recognition rates by traditional QDF, proposed method (TQDF) and square root transformation are shown. Since Vowel Recognition includes negative values, square root transformation cannot be applied. It is shown that the classification accuracy by the proposed method is the highest in every case. Since square root method is only appropriate for a particular feature that follows a certain distribution, in these cases, recognition rates are not improved. The proposed method does
(a) (b) (c) (d)
Number of samples Testing Learning 10,400 8,580 4,435 2,000 384 384 462 528
QDF
TQDF
87.5% 85.7% 73.6% 47.2%
88.4% 86.6% 73.8% 52.8%
Table 2. Recognition rates.
Square root 87.4% 85.2% 72.3% —
not depend on distribution of observations.
4. Conclusions In this paper, we have proposed a new QDF, called transformation-based QDF (TQDF), to improve classification accuracy by transforming data so that any distribution is closer to normal distribution. The ability of the transformation is tested by transforming data with several distributions closer to normal distributions. It is shown experimentally that this transformation improves symmetric property of distributions. Next we showed how to modify the traditional QDF in order to deal with the transformed data. We showed the effectiveness of the proposed method through experiments. Since the proposed method does not depend on the distribution of the observation, it can be applied to any kind of feature. In this paper, only univariate distribution is investigated and the transformation is applied to each dimension independently. Considering multivariate distribution is the future work.
References [1] C. Blake and C. Merz. UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences, 1998. [2] G. E. P. Box and D. R. Cox. An analysis of transformation. J.R. Statistical Society, 26(Series B):211–243, 1964. [3] K. Fukunaga. Introduction to statistical pattern recognition. Academic Press, 2nd edition, 1990. [4] B. F. J. Manly. Exponential data transformations. The Statistician, 25(1):37–42, 1976. [5] G. J. McLachlan. Discriminant analysis and statistical pattern recognition. John Wiley & Sons, Inc., New York, 1991. [6] T. Wakabayashi, S. Tsuruoka, F. Kimura, and Y. Miyake. On the size and variable transformation of feature vector for handwritten character recognition. Trans. IEICE (in Japanese), J76-D-II(12):2495–2503, 1993.
1051-4651/02 $17.00 (c) 2002 IEEE