Determining the Number of Components in Mixture Models Using Williams’ Statistical Test Shuang Cang and Derek Partridge Dept. of Computer Science, Exeter University, Exeter EX4 4PT, UK E-mail:
[email protected] Abstract This paper presents a new approach which can be used to determine the optimum number of components in mixture models. The approach is developed using Williams’ statistical test for determining optimal drug doses. Three experimental examples demonstrate application of the proposed method to the number of components of the density functions using the Expectation Maximisation (EM) algorithm. It shown that the method can work but it is not perfect in all applications.
1
Introduction
Probability density functions play an important role in pattern recognition systems. If we know the density function for each class, then the probability of the new pattern belonging to each class can be obtained using Bayes’ theorem. Mixture models [8] are widely used to approximate a true density function, and they have been the subject of the intensive research [3][4][6][9][11]. The main focus is determination of the optimal number of components in the mixture model, and a number of algorithms exist for finding the number of components to use. For example, a fast distance based method is proposed in [9] and the mutual information theory is used in [3] for determining the number of components in a mixture model density function. An ‘optimal’ number is a minimum number that gives a ‘good enough’ approximation. This paper proposes using the famous Williams’ test [1][2] as a basis to compute the number of components in the mixture model. The Williams’ test is based on the idea that for some monotonically varying quantity, such as toxicity response with increasing drug dosage, the goal is to determine the lowest dose at which there is evidence of a response. In the current context, the response to increasing the number of components in a mixture model is the closeness of the approximation to the real density function, and the optimal cut-off point is determined by absence of significant improvement produced by adding another component. Thus the statistical term ‘significant’ (usually expressed as a percentage) is our criteria of ‘good enough’ which can, of course be set to satisfy whatever external constraints
obtain. A procedure for determining the number of components in a mixture model is presented. Three different examples are used to demonstrate and evaluate the proposed method.
2
General Information
2.1 Mixture Model The EM algorithm is widely used to estimate parameters in Gaussian mixture models. For the K components, the mixture density can be written as a linear combination of component density functions p(X|j) in the form K
p( X ) = ∑ p( X | j ) P ( j )
(2.1)
j =1
where P(j) are the parameters in the mixture model and satisfy the following conditions K
∑ P( j) = 1 ,
0 ≤ P( j ) ≤ 1
(2.2)
j =1
The component density functions p(X|j) satisfy ∞
∫−∞ p( X | j )dX = 1
(2.3)
The most widely used distribution for each component density is the Gaussian distribution. The form of the Gaussian density function for each component is 1 − ( X − µ j )T Σ −j 1 ( X −µ j ) / 2 p(X|j)= e (2.4) (2π) d / 2 | Σ j |1 / 2 where the parameters µj and Σ j are the means of a d-dimensional vector and a d×d covariance matrix, respectively. Values for the parameters P(j), µj and Σ j are determined in each component using the EM algorithm [8] as clustering method components K in P(j), µj and Σ j for
follows. First, a K-means is used for fixed number of (2.1) to determine parameters each component p(X|j) in (2.4).
Clearly the condition (2.3) must be satisfied. Then, using the following recursive formulas, finally the parameters P(j), µj and Σ j are obtained for each component.
∑ wX = ∑ w N
1 P( j ) = N
∑n=1 w , µ j N
n j
n =1 N
n =1
n j
n j
n
,
Σj
∑ =
N
n =1
w nj ( X n − µ j )( X n − µ )'
∑
N
n =1
wnj
(2.5)
where N is the size of the data set and the weight is p( X n | j ) P ( j ) w nj = M (2.6) n p ( X | j ) P ( j ) ∑ j =1 n
where p(X |j) is defined in (2.4).
The analytical mathematics formulas for the maximum likelihood estimates of Mi are as follows: • Upward trend j
j∈[ i , K ]
•
The Kullback-Leibler divergence [10] between the density functions f(X) and g(X) is defined as follows: ∞ f (X ) D( f , g ) = ∫ f ( X ) log dX (2.7) g (X ) −∞ where X = ( x1 , x2 , L, x d ) is a d dimensional vector. Kullback-Leibler divergence is to measure the similarity between the statistical model g (X ) , and the true distribution f (X ) and has the properties D ( f , g ) ≥ 0 ; D( f , g ) = 0 ⇔ f ( X ) = g ( X ) (2.8) To apply (2.7), we assume that p(X ) is a true probability density function f (X ) , and p k (X ) with k components is an approximate probability density function g (X ) . 2.3 Williams’ Trend Test This is a means trend test described in [1][2] in which a series of means (each from replicated tests) is constructed by successively increasing some variable (such as number of components in an approximation). This series gives a sequence of group levels. The test is designed to detect the lowest level at which there is significant difference. K group levels are denoted by integers 1, L , i, L , K with 1 representing the control group level and group level i greater than group level i-1 for i = 2, K , K . All K group levels have the same replication N. Denote the mean response to level i by Mi (or the population mean). Assume that the mean response Mi (1≤i≤K) are monotonically ordered with M 1 ≤ L ≤ M i ≤ L ≤ M K , or M 1 ≥ L ≥ M i ≥ L ≥ M K . The test statistic is 2 given by t = ( Mˆ − Mˆ )( 2 s )−1 / 2 , or 1
t i = ( Mˆ 1 − Mˆ i )( 2Ns ) 2
, (2≤i ≤ K )
(2.9)
where is the maximum likelihood estimate of Mi , N is the ith group size and s2 is the mean square error with ν degrees of freedom. If Yi is the mean of the group i, Yin is the nth individual in the ith then
s2 =
K
N
∑ ∑ (Y
in
i =1
ν = K ( N − 1) .
∑Y
l
/( j − i + 1)} ,
l =i
where the values of i is 1, j+1 and so on until i=K is reached. On the null hypothesis that the means M i are all equal, the statistic ti is distributed as (YkZ1)/S, where Yk = (Z 1 + Z 2 + L + Z k ) / k , and Zi (1≤i≤K) are N(0,1) independent random variables and νS2/2 distributed as χν2 square with ν degrees of freedom. In Williams’ test, we assume that Yin for each group i are independent and Yi ∼ N ( M i ,σ 2 ) . If the condition Yi ∼ N ( M i ,σ 2 ) is not satisfied, but with common variance, Williams’ test is still valid with the infinite degree of freedom by the Central Limit Theorem. The one-sided test is a proper test for the trend test. The test statistic ti is compared with the upper α percentage points ti,α (critical value) of its distribution under the null hypothesis (Mi are equal). Values of ti,α for i≤11, α=5% and α=1% and different ν can be found in tables [1][2]. For i=2, it is the Student’s t distribution. For the critical value ti,α calculation, we can use SAS program [7] or the tables provided in [1][2]. It is noted that the tables only provides a limited range of values and the SAS can provide almost all the critical values at any significance level α. For a given i and ν degrees of freedom, we can compute the critical value ti,α at significance level α using SAS functions as follows i = 2 , ti,α = TINV(1-α, ν). i > 2 , ti,α =PROBMC(‘WILLIAMS’, . ,1-α,ν, i ). For 2≤i≤11 and infinite degree of freedom ν, some critical values ti,α are presented as below.
N
−1 / 2
Mˆ i
group,
/( j − i + 1)} ,
j
Mˆ i = L = Mˆ j = max {
2.2 Kullback-Leibler Distance
i
l
l =i
Downward trend j∈[ i , K ]
i
∑Y
Mˆ i = L = Mˆ j = min {
n =1
− Yi ) 2 / ν
and
Table1: One-sided with infinite degrees of freedom α/i 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 1%
2 0.000 0.125 0.255 0.385 0.525 0.675 0.845 1.040 1.285 1.645 2.326
3 0.191 0.306 0.423 0.544 0.672 0.810 0.965 1.145 1.374 1.716 2.366
4 0.261 0.372 0.484 0.601 0.724 0.858 1.007 1.183 1.405 1.739 2.377
5 0.298 0.406 0.516 0.630 0.751 0.882 1.029 1.202 1.421 1.750 2.382
6 0.320 0.426 0.535 0.648 0.768 0.897 1.042 1.213 1.430 1.756 2.385
7 0.335 0.440 0.548 0.660 0.778 0.907 1.051 1.220 1.436 1.760 2.386
8 0.346 0.450 0.557 0.668 0.786 0.914 1.057 1.226 1.440 1.763 2.387
9 0.351 0.458 0.564 0.675 0.792 0.919 1.062 1.229 1.443 1.765 2.388
10 0.360 0.464 0.570 0.680 0.796 0.923 1.065 1.232 1.445 1.767 2.389
11 0.365 0.469 0.574 0.684 0.800 0.926 1.068 1.235 1.447 1.768 2.389
Procedures for Determining the Number of Components in a Mixture Model
The algorithm is as follows: Step 1. Initially set a reasonable large K. Step 2. Calculate t i (2≤i≤K) and determine the number of components i , for which t i ( t i < ti,α ) is
We can use the method proposed in this paper to select the number of components in a mixture model according to a chosen significance value, α. The method is based on a statistical Williams’ test and Kullback-Leibler divergence distance. Consider the different number of components in (2.1) and write the density function set pk(X) (1≤k≤K) as
not significant at significance level α, but t i +1 ( t i +1 > t i +1,α ) is significant at significance level α.
3
pk ( X ) =
k
∑ p( X | k
0 ) P (k 0 )
, (1≤k≤K)
(3.1)
k 0 =1
The approximate density function is determined by the EM algorithm for different numbers of components k. Kullback-Leibler divergence distance defined in (2.7) between the true density function p(X) and the approximated density function p k (X ) is as follows: ∞
D( p , p k ) =
p( X )
∫ p( X ) log p ( X ) dX ,
−∞
The critical value t i,α can be found from Williams’ tables. Alternatively we can use stepwise method as follows: Step 1. Initial set K= 2. Step 2. Calculate t i (2≤i≤K). If t 2 is not significant at significance level α, but t i (3≤i≤K) is significant at significance level α if K≥3, stop the process and the number of components is K-1. Otherwise go to step 1 and replace K with K+1. Do the same process until find t 2 that there is not significant, but t i (3≤i≤K) is a significant at significance level α if K≥3.
(1≤k≤K) (3.2)
4 Experiment Results
k
(3.2) can be written as ∞
D( p , p k ) = −
∫
p( X ) log p k ( X )dX
−∞ ∞
− {−
∫ p( X ) log p( X )dX }
(3.3)
−∞
The second term above is independent of k . Thus, only the first term in (3.3) is considered in evaluating the density model and finding the truncation point k. We use Φ to represent the first term which is the cross entropy as follows ∞ Φ = − ∫ p( X ) log p k ( X )dX −∞ (3.4) = E p (− log p k ( X )) , (1≤k≤K) where Ep indicates the expectation value of – logpk(X) on the density function p(X). We minimise Φ in order to minimise the Kullback-Leibler divergence distance D(p,pk) due to D(p,pk) ≥0. For sufficient data, Φ in (3.4) has the form 1 N Φ ≈ − ∑ log p k ( X n ) = Y k (3.5) N n =1 where Yk is the mean for different k after the transformation Yk = –logpk(X). As k is increasing, Φ has a decreasing trend. We use the means Yk in (3.5) to represent the means for different group level (k = K to represent the control group level) in Williams’ test. The test statistic in Williams’ test is 2 t i = ( Mˆ K +1−i − Mˆ K )( 2Ns ) −1 / 2 , (2≤i≤K) (3.6)
Three examples are used to demonstrate and evaluate the method proposed in this paper. The first example is adapted from [3], the second example is the subset of Fisher’s Iris data in [9], and the last example is flight path data provided by the air-traffic control centre of London’s Heathrow airport [5]. Example A: In the first example [3], the true probability density function is a mixture of two Gaussian normal distributions with one dimension. p( X ) = 0.3N (0.25,1.25 2 ) + 0.7 N (4,12 ) , (4.1) We randomly draw 1000 samples from the distribution (4.1) as the data set {Xn}. Initially set K = 5 which you think that the number of components in the true density function is not beyond K-1, then the approximate density function set p k (X ) (1≤k≤K) can be obtained using the EM algorithm. Figure 1 shows the true density function p(X ) and the approximate density functions p1 ( X ) and p 2 ( X ) with one and two components computed from the EM algorithm. The solid line with ‘*’ is the true probability density function, the lines with ‘ ∆ ’ and ‘o’ are the approximated density functions p1 ( X ) and p 2 ( X ) , respectively. Calculate the means of Yk for the different k (1≤k≤K) after the transformation Yk = − log p k ( X ) for the data set {Xn}. The means are Y1 =2.116558, Y2 =1.972646, Y3 =1.972697, Y4 =1.972495, Y5 = 1.971659 and the mean square error with infinite degrees of freedom is 0.500760.
4
K
Yk 2.117011 1.993991 1.993405 1.993493 1.993257
5
1 2 3 4 5 k 1 2 3 4 5
2.149206 2.004905 2.004327 2.001327 2.001163
0.35
0.3
p(X),p1(X)and p2(X)
0.25
0.2
0.15
Yk
Mˆ k 2.117011 1.993991 1.993449 1.993449 1.993257
Mˆ k 2.149206 2.004905 2.004327 2.001327 2.001163
t K +1− k
Sig.
3.853128 0.022840 0.005974 0.005974
*** s2= 0.515775
t K +1− k
Sig.
4.563666 0.115368 0.097562 0.005076
*** s2= 0.526165
0.1
0.05
0 -8
-6
-4
-2
0
2
4
6
8
10
12
X
Figure 1: True and Approximate Density Functions First, Mˆ k ( 1 ≤ k ≤ 5 ), which are the maximum likelihood estimates of Mk, are determined by an averaging process. Assume that Y5 is the control group. We have the form Y1 > Y2 , but Y3 > Y2 , we need to form Y2 ,3 = (Y2 + Y3 ) / 2 =1.972671. We have Y1 > Y2,3 > Y4 > Y5 . Thus, the maximum likelihood estimate of M k are Mˆ = Y = 2.116558, Mˆ = Mˆ = Y2 ,3 = 1.972671, 1
1
2
3
Mˆ 4 = Y4 = 1.972495, Mˆ 5 = Y5 = 1.971659. Calculate the test statistics from (3.6), we have t 2 = 0.026391, t 3 = t 4 =0.031976, t 5 = 4.578622. Look at the Williams’ tables and find that t 2 , t 3 and t 4 are not significant at level α = 1%, but t 5 is strongly significant at α =1% level. Thus the optimal number of components in the mixture density function is 2. This experiment was repeated 5 times with different seeds and the results are presented in table 2, where ‘***’ indicates significance level α=1%. Table 2: The Experiments Results of Example A (K=5) k Sig. 1 t K +1− k Yk Mˆ k 2.131000 2.012572 2.012560 2.012006 2.010941
2
1 2 3 4 5 k
2.142644 1.980887 1.977787 1.978593 1.974804
3
1 2 3 4 5 k 1 2 3 4 5
2.127517 1.997495 1.997291 1.997266 1.994460
Yk
Yk
2.131000 2.012572 2.012560 2.012006 2.010941
Mˆ k 2.142644 1.980887 1.978190 1.978190 1.974804
Mˆ k 2.127517 1.997495 1.997291 1.997266 1.994460
3.978403 0.054036 0.053639 0.035283
t K +1− k 5.378302 0.194932 0.108517 0.108517
t K +1− k 4.090193 0.093297 0.087017 0.086262
***
s2= 0.455343 Sig. ***
s2= 0.486937 Sig. ***
s2=0.529124
From table 2, it can be seen that at k = 2 which there is non-significant at significance level α = 1%. Thus, the optimal number of components is 2. If there is more overlap between the components, a big α will be chosen. Example B (Fisher’s Iris Data Example): Using a subset of Fisher’s Iris data [9], the data set contains 100 data items which are obtained from two species (Setosa and Versicolor). The data set has four dimensional measurements which are sepal length, sepal width, petal length, petal width. For this data set, it is well known that the optimal number of components is 2. Next, we will demonstrate that the results we obtained are the same as the true number of components using stepwise method. The results are presented in tables 3a and 3b using Williams’ test. ‘***’ indicates the significance level α=1%. Table3a: The Experiments Results of Example B(K=2) k Sig. t K +1− k Yk Mˆ k 1 2
11.111532 9.753826
11.111532 9.753826
5.896482
*** s2=2.650913
Table3b: The Experiments Results of Example B(K=3) k Sig. t K +1− k Yk Mˆ k 1 2 3
11.111531 9.7538258 9.5995236
11.111531 9.7538258 9.5995236
6.404259 0.6535620
*** s2=2.787023
From the tables 3a and 3b, we can see that 2 components are needed in the density function. Example C (Determining the Number of Components for STCA Data Set): Short Term Conflict Alert (STCA) is an air traffic control system designed to give air-traffic controllers an alert of potential conflicts with sufficient warning time. The purpose of the STCA system is to predict the likelihood of a pair of aircraft breaching proximity restrictions. The current STCA contains three fine filter modules, i.e. Linear Prediction filter, Current Proximity filter and Manoeuvre Hazard filter, each of which attempts to predict likely breaches of proximity restrictions in different flight situations. In this paper, we concentrate on the Linear Prediction filter which accounts for some 90% of all alerts in practice. Six features that
maybe extracted from the raw flight path data are described in table 4.
k
Yk
Table 4: Six Features
1 2 3
33.605718 32.835859 32.521877
∆X
Type R
∆Y
R
∆Z
R
∆Vx
R
∆Vy
R
∆Vz
R
Alert
N
Values
Description ∆X=X2 -X1 (X relative distance of pair) ∆Y=Y2-Y1 (Y relative distance of pair) ∆Z=Z2-Z1 (Z relative distance of pair) ∆Vx=Vx2-Vx1 (X relative velocity of pair.) ∆Vy=Vy2-Vy1 (Y relative velocity of pair.) ∆Vz=Vz2-Vz1 (Z relative velocity of pair.) Classification (0 indicates non-alert, 1 indicates alert)
0, 1
A large quantity of flight-path radar data from London’s Heathrow airport has been concentrated to provide a reasonable balance between ‘Alert’ and ‘Non-Alert’ cases. In a data set from years 1988 to 1998 there are 12522 patterns in total 4647 Alerts and 7875 Non-Alerts generated by the Linear Prediction filter. We randomly selected 4000 patterns from the data set as training data and we will determine the optimal number of components in Gaussian mixture model. We obtain the experimental results given in table 5 by using different seeds for randomisation. Class 1 indicates ‘Non-Alert’ and class 2 indicates ‘Alert’. Where ‘***’ indicates the significance level 1%, ‘**’ indicates the significance level 5% and ‘*’ indicates at significance level 6%-15%. Table 5: The Experiments Results of Example C Class 1 (size N= 2494) k
Yk
1 2
36.249624 34.965076
Experiment 1 K=2
Mˆ k
16.443384
*** s =7.609993 2
1 2 3
Yk 36.249623 34.965076 34.745278
Mˆ k
Sig.
t K +1− k
36.249623 34.965076 34.745278
19.517612 2.8517012
Yk
*** *** s =7.408130 2
1 2 3 4
36.249624 34.965076 34.745278 34.657419
Mˆ k 36.249624 34.965076 34.745278 34.657419
t K +1− k 21.017792 4.0612116 1.1597740
Sig. *** *** * 2 s = 7.156327
K=5 k
Yk
1 2 3 4 5
36.249624 34.965076 34.745278 34.657419 34.629493
36.249624 34.965076 34.745278 34.657419 34.629493
Class 2 (size N= 1506) k 1 2
Yk 33.605718 32.835859
Mˆ k 33.605718 32.835859
Yk
1 2 3 4
33.605718 32.835859 32.521877 32.294202
k
Yk
1 2 3 4 5
33.605718 32.835859 32.521877 32.293617 32.156143
k
Yk
1 2 3 4 5 6
33.605718 32.835859 32.521877 32.294202 32.156143 32.096304
Sig.
t K +1− k
33.605718 32.835859 32.521877 32.294202
10.933838 4.5156819 1.8980842
*** *** ** 2 s = 10.834206
Mˆ k
Sig.
t K +1− k
33.605718 32.835859 32.521877 32.293617 32.156143
12.870282 6.034967 3.247235 1.220588
*** *** *** * 2 s = 9.552139
t K +1− k
Sig.
14.05204 6.884960 3.961919 1.842350 0.557073
*** *** *** **
Mˆ k 33.605718 32.835859 32.521877 32.294202 32.156143 32.096304
s2= 8.688265
Experiment 2 Class 1 (size N= 2496) K=2 k
Yk
1 2
36.241964 34.991199
Mˆ k
Sig.
t K +1− k
36.241964 34.991199
5.8953398
*** s2= 13.192980
K=3 k
Yk
1 2 3
36.241963 34.991199 34.789613
Mˆ k
Sig.
t K +1− k
36.241963 34.991199 34.789613
18.830324 2.6136365
*** *** s = 7.424063 2
K=4 k
Yk
1 2 3 4
36.241964 34.991199 34.789613 34.696632
k
Yk
1 2 3 4 5
36.241963 34.991199 34.789613 34.696632 34.617792
k
Yk
1 2 3 4 5 6
36.241964 34.991199 34.789613 34.721009 34.617789 34.589372
Mˆ k
Sig.
t K +1− k
36.241964 34.991199 34.789613 34.696632
20.140537 3.8391258 1.2118344
*** *** * 2 s = 7.347093
t K +1− k
Sig.
21.280528 4.8925203 2.2512723 1.0330005
*** *** **
Mˆ k 36.241963 34.991199 34.789613 34.696632 34.617792
s2= 7.269653
k
Yk 33.708406 32.927549
21.414245 4.4356055 1.5303934 0.3691106
*** *** *
k
Yk
1 2 3
33.708406 32.927549 32.628362
s2= 7.137743
K=2
36.241964 34.991199 34.789613 34.721009 34.617789 34.589372
Sig.
21.872245 5.3182236 2.6502192 1.7422407 0.3761049
*** *** *** ** s2=7.124575
K=2
Mˆ k
Sig.
t K +1− k
33.708406 32.927549
5.8953398
*** s2= 13.192980
K=3
Mˆ k 33.708406 32.927549 32.628362
Sig.
t K +1− k 8.4212986 2.3328189
*** *** s = 12.369237 2
K=4
Sig. k *** s2= 14.839855
t K +1− k
Mˆ k
Class 2 (size N= 1504)
1 2
5.483956
Mˆ k
K=6
Sig.
t K +1− k
*** *** s = 12.881930 2
K=5
t K +1− k
Mˆ k
8.286525 2.400551
K=6
K=4 k
33.605718 32.835859 32.521877
K=4 k
Sig.
K=3 k
Sig.
t K +1− k
K=5
t K +1− k
36.249624 34.965076
K=3
Mˆ k
Yk
Mˆ k
t K +1− k
Sig.
1 2 3 4
33.708406 32.927549 32.628362 32.412132
33.708406 32.927549 32.628362 32.412132
10.904156 4.3356520 1.8189098
*** *** ** 2 s = 10.627413
K=5 k
Yk
1 2 3 4 5
33.708406 33.015570 32.628362 32.412131 32.325653
Mˆ k 33.708406 33.015570 32.628362 32.412131 32.325653
t K +1− k
Sig.
11.276194 5.6261949 2.4685552 0.7052238
*** *** ***
85.18% for the first test data and 85.30% for the second test data. The ROC figures (Figs.2-9) with different number of components for each class are given as follows. Training data (*), and two test sets (+ and o). 1 0.98 0.96
s2= 11.307902
The results of this example are not simple to interpret as in the previous two examples, due to the data is not well cluster distributed, but there is a general trend that we can see. The optimal number of components is about 4 in the density functions for each class.
True Positive Rate
0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
Figure 2: One Centres for Both Classes
5 Application of Expectation Maximisation (EM) for STCA Data
0.98
True Positive Rate
0.96
0.94
0.92
0.9
0.88
0.86 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
Figure 3: Two Centres for Both Classes
1
0.98
True Positive Rate
0.96
0.94
0.92
0.9
0.88
0.86 0.1
0.2
0.3
0.4
0.5 0.6 0.7 False Positive Rate
0.8
0.9
1
Figure 4: Three Centres for Both Classes
1
0.98
0.96 True Positive Rate
Randomly select 4000 patterns from the data set as training data, select another 4000 patterns from the remainder as the first test data, and select another 4000 from the remainder as a second test data. For the different number of components of the density functions in EM algorithm, the results are shown as the follows. • One component for each class The overall accuracy is 76.23% for the training data, 76.08% for the first test data and 75.97% for the second test data. • Two components for each class The overall accuracy is 84.18% for the training data, 82.85% for the first test data and 82.58% for the second test data. • Three components for each class The overall accuracy is 86.63% for the training data, 84.83% for the first test data and 84.93% for the second test data. • Four components for each class The overall accuracy is 87.65% for the training data, 85.83% for the first test data and 86.33% for the second test data. • Five components for each class The overall accuracy is 87.58% for the training data, 85.68% for the first test data and 85.43% for the second test data. • Eight components for each class The overall accuracy is 87.88% for the training data, 85.48% for the first test data and 85.33% for the second data. • Ten components for each class The overall accuracy is 87.70% for the training data, 85.15% for the first test data and 84.85% for the second test data. • Twenty components for each class The overall accuracy is 88.93% for the training data,
1
0.94
0.92
0.9
0.88
0.86 0.1
0.2
0.3
0.4
0.5 0.6 0.7 False Positive Rate
0.8
0.9
1
Figure 5: Four Centres for Both Classes
results. Even using twenty components in the EM algorithm still achieves the same results as 4 number of components. Therefore, at least four components are used for each class in the EM algorithm. Thus, we can use small number of components which is about 4 in this application. This can reduce complexity of computation and speed up the training.
1
0.98
True Positive Rate
0.96
0.94
0.92
0.9
0.88
0.86 0.1
6 Conclusions 0.2
0.3
0.4
0.5 0.6 0.7 False Positive Rate
0.8
0.9
1
Figure 6: Five Centres for Both Classes 1.02 1 0.98
A new algorithm for determining the number of components in a mixture model using the Williams’ test is presented in this paper. The intensive computer experimental results have demonstrated the viability and value of the proposed method.
True Positive Rate
0.96
References
0.94 0.92 0.9 0.88 0.86 0.84 0.1
0.2
0.3
0.4
0.5 0.6 0.7 False Positive Rate
0.8
0.9
1
Figure 7: Eight Centres for Both Classes
1 0.98 0.96
True Positive Rate
0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.1
0.2
0.3
0.4
0.5 0.6 0.7 False Positive Rate
0.8
0.9
1
Figure 8: Ten Centres for Both Classes
1
True Positive Rate
0.95
0.9
0.85
0.8
0.75 0
0.2
0.4 0.6 False Positive Rate
0.8
1
Figure 9: Twenty Centres for Both Classes
From the above analysis, the results using two components for each class are much better than the results using one component for each class. More than 2 components in the EM algorithm do not dramatically increase accuracy, but around 4 components in the EM algorithm gives the best
[1] Williams, D. A, A test for differences between treatment means when several dose level are compared with a zero dose control. Biometrics, 27, 103-117, 1971. [2] Williams, D. A, The comparison of several dose levels with a zero dose control. Biometrics, 28, 519-531, 1972. [3] Yang Z.R. and Zwolinski M, Mutual information theory for adaptive models. IEEE Trans on Pattern Analysis and Machine Intelligence 23 (4), 396-403, 2001. [4] T. Y. Young and G. Coraluppi, Stochastic estimation of a mixture of normal density functions using an information criterion. IEEE Trans. On Information Theory, 16 258-263, 1970. [5] Wang W.J., Jones P. and Partridge D., Assessing the impact of input features in a feedforward network. Neural Computing and Applications. 9, 101-112, 1999. [6] Carreira-Perpinan, M. A., Mode-finding for mixtures of Gaussian distributions. IEEE Trans. on Pattern Analysis and Machine Intelligence 22(11), 1318-1323, 2000. [7] SAS/STAT Software, Release 6.12, SAS Institute Inc., 1997. [8] C. Bishop, neural networks for pattern recognition. Oxford Press, 1995. [9] Sujit K. Sahu and Russell C. H. Cheng, A fast distance based approach for determining the number of components in mixtures. Technical Report at University of Southampton, UK, 2001. [10] Simon Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, 1999. [11] Richardson, S. and Green,P.J. On Bayesian analysis of Mixtures with an Unknown Number of Components. Journal of the Royal Statistical Society, B59, 731-792, 1997.