[151 H. Blum and R. N. Nagel, "Shape description using weighted symmetric axis features," Pattern Recognition, vol. 10, pp. 167-. 180, 1978. [16] L. G. Shapiro ...
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 3, MAY 1980
242
[91 C. Hilditch, "Linear skeletons form square cupboards," Machine
[101 [111 [121
[131 [141
[151 [16]
[17] [18]
[191
Intell., vol. 4, pp. 403-420, 1969. T. Pavlidis, "Algorithms for shape analysis of contours and waveforms," in Proc. 4th Int. J. Conf on Pattern Recognition, 1978, pp. 70-85. C. Arcelli and S. Levialdi, "Picture processing and overlapping blobs," IEEE Trans. Comput., vol. C-20, pp. 111 1-1 115, 1971. H. Freeman, "Computer processing of line-drawing images," Comput. Surveys, vol. 6, pp. 57-97, 1974. G. Gallus and P. W. Neurath, "Improved computer chromosome analysis incorporating preprocessing and boundary analysis," Phys. Med. Biol., vol. 15, pp. 435-445, 1970. H. Freeman, "Shape description via the use of critical points," Pattern Recognition, vol. 10, pp. 158-166, 1978. H. Blum and R. N. Nagel, "Shape description using weighted symmetric axis features," Pattern Recognition, vol. 10, pp. 167180, 1978. L. G. Shapiro and R. M. Haralick, "Decomposition of twodimensional shapes by graph-theoretic clustering," IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-1, pp. 10-20, 1979. F. J. Rohlf, "Adaptive hierarchical clustering schemes," Syst. Zool., vol. 19, pp. 58-82, 1970. R. Dubes and A. K. Jain, "Models and methods in cluster validity," in Proc. IEEE Conf. on Pattern Recognition and Image Processing, 1978, pp. 148-155. R. 0. Duda and P. E. Hart, Pattern Classification and Scene Analysis. New York: Wiley, 1973.
-
0
On
Professor in the Department of Computer Science, Wayne State University, Detroit, MI. In 1974, he joined the Department of Computer Science, Michigan State University, where he is currently an Associate Professor. His research interests are in the areas of pattern recognition and image processing. He was recipient of the National Merit Scholarship in India. Dr. Jain is a member of the Association for Computing Machinery, the Pattern Recognition Society, and Sigma Xi.
Stephen P. Smith was born in Cincinnati, OH, on September 4, 1955. He received the B.S. and the M.S. degrees in computer science from Michigan State University, East Lansing, in 1977 and 1979, respectively. Since 1977, he has been a Graduate Research Assistant in the Pattern Recognition and Image ! Processing Laboratory in the Department of Computer Science at M.S.U. During the sumD*':Pr mer of 1979, he was a consultant at Babcock and Wilcox's Lynchburg Research Center working on problems of feature selection and image processing in a nondestructive examination environment. Currently, he is working on a Ph.D. degree at M.S.U. His current research interests include shape matching and cluster validity. Mr. Smith is a member of the Association for Computing Machinery and Phi Kappa Phi.
|
Anil K. Jain (S'70-M'72) was born in Basti, India, on August 5, 1948. He received the B. Tech. degree with distinction from the Indian Institute of Technology, Kanpur, India, in 1969, and the M.S. and Ph.D. degrees in electrical engineering from Ohio State University, Columbus, in 1970 and 1973, respectively. From 1971 to 1972 he was a Research Associate in the Communications and Control I000Systems - ! Laboratory, Ohio State University. Then, from 1972 to 1974, he was an Assistant
-Eric Backer
4
was born in Soestdijk, The Netherlands, on April 22, 1940. He received the M.S. and Ph.D. degrees in electrical engineering from Delft University of Technology, Delft, The Netherlands, in 1969 and 1978, respectively. Since 1967 he has been at the Delft University of Technology, where he is currently a Senior Staff member of the Information Theory Group. He is engaged in research and teaching in the areas of pattern recognition and image processing.
Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern Recognition SARUNAS RAUDYS AND VITALIJUS PIKELIS
Abstract-This paper compares four classification algorithms-discriminant functions when classifying individuals into two multivariate populations. The discriminant functions (DF's) compared are derived according to the Bayes rule for normal populations and differ in assumptions on the covariance matrices' structure. Analytical formulas for the expected probability of misclassification EPN are derived and show that the classification error EPN depends on the structure of a classification algorithm, asymptotic probability of misclassification P.., and the ratio of learning sample size N to dimensionality p :N/p for all linear DF 's discussed and N2/p for quadratic DF's. The tables for learning quanManuscript received February 27, 1978; revised June 28, 1979. The authors are with Lietuvos RSR Moksly, Adademiha, Lenino, U.S.S.R.
tity H = EPN/P., depending on parameters P.,, N, and p for four classifilcation algorithms analyzed are presented and may be used for estimating the necessary learning sample size, detennining the optimal number of features, and choosing the type of the classification algorithm in the case of a limited learning sample size. Index Terms-Classification error, dimensionality, discriminant functions, pattern recognition, sample size.
I. INTRODUCTION S IGNIFICANT research efforts were made in the area of statistical pattern recognition in the case when the learning sample size is limited. Finiteness of the sample size causes
0162-8828/80/0500-0243$00.75
C 1980 IEEE
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 3, MAY 1980
some effects. Due to this reason parameters of a classification rule are determined inaccurately; therefore, the classification error increases (see, e.g., [1] -[3]). Finiteness of the learning sample causes the peak effect, that is, why the problem of determination of the optimal number of features arises (see, e.g., [41, [5] ). In a case of a finite sample size a classification error becomes biased; therefore, special methods are constructed in order to obtain unbiased estimates [6], [7]. In order to use the statistical classification rules correctly-to choose the proper type of a classification rule, determine the optimal number of features, and estimate the sufficient learning sample size-one must know the quantitative dependence of the classification error on the learning sample size, the number of features, the type of the classification algorithm, etc. This article compares four classification algorithms-discriminant functions when classifying individuals into two multivariate populations. The discriminant functions (DF's) compared are derived according to the Bayes rule for normal populations and differ in assumptions on the covariance matrices' structure. 1) The quadratic discriminant function, X g(X) = (X 2-
+ln
S2'
X2 )(X X1 )S1 (X -
s+k;
(1)
2) the standard linear DF
g(X) = [X - -(X
+ X2 )] S
l(VI X-2 );(2) -
3) the linear DF for independent measurements g(X) = [X- O2(X + XY2) I _(OX 4) a Euclidean distance classifier
g(X) = [X- 2 (OX
+ X2)]
'(X
-
-
X2).
XV2)
(3)
(4)
In formulas (1)-(4) X denotes the p-variate observation vector to be classified, XI and X2 stand for sample estimates of population means p1 and P2, SI and S2 are sample estimates of population covariance matrices (CM's) El and 2, S is pooled sample CM S = (SI + S2)/2, and ED is a diagonal matrix, constructed from diagonal elements of S. The quadratic DF (1) is asymptotically an optimal classification rule for classifying into two normal populations, DF (2) is asymptotically optimal for classifying into normal populations with common CM, etc. In the nonasymptotic case when the learning sample size is limited the classification rules (1)-(4) are not optimal, and being applied to the same pattern recognition problem lead to different outcomes. In order to clarify the use of the classification algorithm in accordance with the sample size, a dimensionality, and other characteristics of the problem, the dependence of a classification error on the characteristics mentioned above must be investigated. Before analyzing the relationship between the classification error and specific characteristics of a pattern recognition problem let us determine several sorts of the probability of mis-
classification (PMC). If the underlying probability density functions are known
243
to the investigator he may construct an optimal Bayes classifier. Its performance (PMC) will be denoted PB and referred to as a Bayes one. When a classifier is designed on the learning sample the PMC will depend on the characteristics of this particular sample. Then the PMC may be regarded as a random variable PN, whose distribution depends on the learning sample size. This PMC will be called a conditional PMC [6]. Its expectation EPN over all learning samples of size N1, N2 patterns (from the first and second classes) will be regarded as an expected PMC. A theoretical limit P. = lim EPN is called an asymptotic PMC and is denoted by Poo. N-o The conditional PMC PN depends both on the type of a classification rule and on the particular learning sample. Several properties of the probabilities of misclassification mentioned above may be derived without detailed analytic investigations. The Bayes PMC PB depends only on the distribution density functions of the measurements and does not depend on a learning sample. The asymptotic PMC P0. is the characteristics of the classification rule type. If the classification into two multivariate normal populations is made, the asymptotic PMC of the quadratic DF (1) P,, will coincide with the Bayes PMC PB; for other discriminant functions the asymptotic probabilities of classification may exceed PB. It is obvious that PN > PB and one may hope that PN > P., however, the last inequality always holds only for asymptotically optimal classification rules. If the rule is not an asymptotically optimal one, sometimes we may observe the case PB < PN < P°. For asymptotic optimal classifiers the expected PMC EPN and the difference EPN - PO. is diminished when the learning sample size increases. In the experiments with real data the inequality EPN > P°. usually holds for asymptotically nonoptimal rules, too. However, it is possible to construct a model of true populations, in which for some values of NEPN's are less than PO. (e.g., the populations consist of clusters with unequal probabilities). Some efforts were made to obtain the quantitative dependence of the classification error EPN on the learning sample size. John [2] considered the standard linear DF (2) in a case when the covariance matrix is known. This is equivalent to the investigation of the Euclidean distance classifier (4) in the case of spherically normal populations. John derived an approximate and an exact formula for the expected PMC. In the exact formula EPN was expressed as an infinite sum of incomplete beta-functions and was practically uncomputable. Analogous double sums for the expected PMC later were derived by Troicky [8] and Moran [9]. The standard linear DF (2) was extensively studied by many investigators. R. Sitgreaves [10] derived the exact formula for EPN in the form of five times infinite sum of the products of certain gamma-functions. Her derivation was based on A. H. Bowker's representation of DF (2) in terms of simple statistics [ 1 ] . Unfortunately, the formula of Sitgreaves was practically uncomputable. In his unpublished work [5] S. E. Estes reduced this formula to the form suitable for numerical calculations. He presented the curves of EPN's dependence on the ratio N/p for some particular dimensionalities p and PMC's P0. (he assumed N1 = N2 = N). However, his results were not published and remained practically unknown even in his country. On the other hand, Estes' algorithm had some short-
244
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 3, MAY 1980
comings which resulted in low accuracy for some particular values of parameters. Simultaneously, several asymptotic expansions [31, [121, [13] and approximate formulas [14][16] for DF (2) were published. The accuracy of these formulas was not known. Therefore, the problem of accuracy is under consideration until now [17] . The behavior of classification rules (1) and (3) was investigated only by means of simulation [181, [19]. In our papers [20]-[25] we derived the formulas for the expected PMC EPN for DF (1) and (4) in the form of integrals, for the standard linear DF (2)-in the form of the improved S. Estes [5] sum, and for DF (3) EPN was studied by approximate formulas and by means of simulation. Quantitative and qualitative relationships between the expected PMC, learning sample size, dimensionality, Mahalonobis distance, and complexity of the classification algorithm was obtained in the form of tables and simple asymptotic formulas. The purpose of this publication is to present the results mentioned above to English speaking readers. II. FORMULAS AND TABLES FOR THE EXPECTED PROBABILITY OF MISCLASSIFICATION The expected PMC is defimed by I
EPN
=,fPN f(PN) dPN
(5)
where f(PN) stands for the probability distribution density function of a random variable PN. Derivation of the density function f(PN) is complicated. An alternate frequently used expression is
EPN =
2
E
i=l
qiP {(-1)ig(X, ) > OIX E
}
(6)
where 0 stands for the random vector of the population parameter estimates. In the case of the quadratic DF 0 consists of vectors Xl, X2 and matrices Sl, S2. As g(X, 0) depends on a random vector, the calculation of EPN requires multivariate statistical analysis techniques. The discriminant function g(X, 0) is represented as a function of several scalar independent random variables. Derivation of the expression is presented in papers [20]-[25]. The principle points of the derivation are presented in the Appendix. The formulas for the calculation of the expected probability of misclassification are rather complex and unsuitable for practical everyday use. It is more convenient to use tabulated values of EPN. Therefore, we present the values of the ratio H = EPNIP°. as a function of learning sample size N = N1 = N2, dimensionality p, and Mahalanobis distance 6 in Table I. The expected PMC characterizes the classification performance, while the ratio H = EPNIP°°-we shall call it a learning quantity-characterizes the accuracy of determining the discriminant function's coefficients. The table contains values of H for four classification rules (1)-(4) investigated. The values from Table I are valid: 1) for rules (1)-(4)-in the case of spherically normal populations; 2) for rules (1)-(3)-if the measurements are independent and normally distributed;
3) for rules (1), (2)-if the populations are normal with a common covariance matrix. It should be noted that computation of EPN values is rather complicated. So the accuracy of our table for rule (1) is not high (when 6 = 5.5 and N/p = 1.6-2, the error may reach a few percent). Most accurate are the H values for rules (2) and (4), where all the decimal signs are correct. III. SIMULATION EXPERIMENTS
The values of the learning quantity H = EPN/P presented in Table I correspondent to an ideal case-spherically normal distributions. In practical cases the populations are not spherically normal. It is possible to construct theoretical models of populations, for which the value of H would be significantly greater or smaller than the values presented in Table I. For non-Gaussian data H may be less than 1! We hope, however, that in practice such cases will be rare. In order to estimate the declination of the observed H values from the tabulated ones, some experiments were made. We used four sets of real data. In each experiment a random learning sample (of N vectors per class) was drawn from the total sample, consisting of Nm vectors from each class. The parameters of all the DF's investigated were estimated and vectors of the total sample (excluding vectors of the learning sample) were classified. The expected PMC was estimated as the average of error rates (conditional PMC) in 10-100 runs. The results of experiments with real data are presented in upper rows of Table II. In lower rows the value of H, calculated analytically for the corresponding values of N, p, and 6 (for spherical population), are given. The real data used in this experiment were not specially selected and they differ from the spherically normal ones rather significantly [24]. As seen from Table II, in most cases the difference between experimental and analytical values of H was not great. An exception is made by data 3 where Nm is comparatively small and due to the nonnormality of data the asymptotic PMC for the quadratic DF was six times greater than that for the linear DF. IV. DISCUSSION Complex analytical formulas for the expected PMC EPN as a function of N, p, and 6 do not show the relationship between these four factors. In qualitative analysis the asymptotic expressions for EPN are very useful. We have investigated the asymptotics when the learning sample size N and dimensionality p tend to infinity simultaneously. Then I5 EPN=+( 6
(7)
where the coefficient az depends on the type of a classification rule. The expressions for a. for four classification rules investigated are presented in Table III [15]. The fifth DF in that table is designed for the classification into two normal populations with independent measurements
g(X) (X -
2 A
+Iln ,2
)Y DT" 2!
1D11
(X - X-2) (X - l)1
1
1(
(8)
245
RAUDYS AND PIKELIS: CLASSIFICATION ALGORITHM IN PATTERN RECOGNITION24
TABLE 1 THE VALUES OF THE RATIO H
=
EPN/ P.
DF 4
DF 3
__1.68 2.56 3.76 4.65 5.50 1.56 1.73 2.02 2.42 2.99 1.41 1.47 1.64 1.87 2.17 1.21 1.22 1.30 1.39 1.51 1.08 1.09 1.12 1.15 1.19 1.04 1.06 1.07 1.09 1.02 1.02 1.03 1.04 1.05 1.01 1.01 11O1 1.01 1.02
~1.04
1.52 1.66 1.91 2.18 1.42 1.50 1.66 1.83 1.35 1.40 1.51 1.64 LC\1l.19 1.20 1.24 1.29 111.08 1.08 1.09 1.11 ~1.04 1.04 1.05 1.06 1.02 1.02 1.02 1,03
2.54 2.06 1.80 1.36 1.13 1.07 1.03
1.01 1.01 1.01 1.01 1.01
4.65 5.50 1.68 2.56 5.81 11.7 2.00 3.26 3.39 5.43 1.64 2.22 1.87 2.32 1.31 1.50 1.28 1.37 1.12 1.17 1.13 1.16 1.06 1.08 1.06 1.09 1.03 1.04 1.01 1.01 1.02 1.03 1.03 1.01 1.02 1.60 2.03 3.05 4.61 7.92 2.11 3.64 1.52 1.75 2.35 3.22 4.78 1.80 2.65 1.41 1.58 2.04 2.54 3.50 1.64 2.21 1.22 1.27 1.40 1.54 1.89 1.32 1.51 1.10 1,11 1.16 1.22 1.26 1.13 1.18 1.05 1.06 1.08 1.10 1.13 1.06 1.09 1.02 1.03 1.04 1.05 1.06 1.03 1.04 1.01 1.01 1.02 1.02 1.03 1.01 1.02
1.73 1.88 2.08 1.56 1.80 1.60 1.71 1.86 1.47 1.66 1.44 1.51 1.61 1.37 1.52 1.21 1.24 1.28 1.21 1.25 1.08 1.09 1.11 1.09 1.10 1,04 1,05 1.05 1.05 1.05 1.02 1.02 1.02 1.02 1.03 1.02 1.02
___1.01
1.01-
1.01
1.-01
1.46 1.57 1.70 1.36 1.41 1.48 1.32 1.35 1.39 1.18 1.18 1.19 c\J 1.08 1.07 1.07 II1.04 1.04 1.04 P.
1.18 1.17
1.08 1.04 1.02 1.01
1.07
1.54 1.62 1.42 1.44 1.50 1.36 1.21 1.23 1.20 1.08 1.09 1.09
1.94 2.09 2.25 1.63 1.61 1.68 1.76 1.48 1.45 1.49 1.54 1.40 1.36 1.39 1.42 1.34 1.18 1.18 1.20 1.19 1.07 1.07 1.08 1.08 1.03 1.04 1.04 1.04 1.02 1.02 1.02 1.02
2.C13 2.41 2.65 2.87 1.73 1.84 1.55 1.33 1.16 1.06 1.03
1.02
9.90 25.7 76.1 5.62 11.9 28.9 4.01 7.37 15.4 2.00 2.66 3.78 1.31 1.45 1.64 1.14 1.20 1.27 1.07 1.10 1.13 1.03 1.04 1.05 8.57 6.00 3.87 1.97 1.31 1.14 1.07 1.03
20.9 18.8 6.84 2.59 1.45
1.68 2.56 3.76 4.65 5.50
1.49 1.72 1.98 2.42 1.61 1.82 2.05 1.28 1.35 1.45 1.11 1.13 1.16
1.67 2.23 4.05 7.72 1.25 1.33 1.61 2.24 1.12 1.14 1.23 1.38 1.05 1.06 1.10 1.17
17.1 2.90
1.92 1.58 1.34 1.16
1.99 1.61 1.35 1.17
1.56
1.48 1.31 1.18
1.27
1.74 1.31 1.15 1.07 1.03
30.9
5 6
13.5
8
58.2
3.58 1.80 2.46 4.41 7.26 14.0 1.63 1.38 1.50 1.74 2.06 2.55 1.20 1.27 1.20 1.23 1.30 1.40 1.55' 1.10 1.13 1.10 1.11 1.14 1.18 1.23
16 40 80
160 1.04 1.05 1 .04 1.04 1.05 1.06 1.08L 400 DFPi
9.41 6.51 2.54 1.45
20.2 12.4
1.06, 1.06.1.08 1.07 1.07 1.08 1.08 1.14 1.19 1.31 1.44 1.61 .103 1.03 1.04 1.04 1.04 1.04 1.05 1.07 1.09 1.15 1.20 1.27
1.02 1.02 1.02 1.02 1.02 1.02 1.02 1.02 1.02 1.02 1.04 1.05 1.07 1.10 1.13
'1.02
15
1.63 30
60 150 3 4 5 2.35 4.16 7.46 15.5 10 ~1.40 1.65 2.00 2.60 25 1.18 1.26 1.37 1.55 50 1.08 1.12 1.16 1.23 100 1.03 1.04 1.07 1.08 250
3.46 1.85 2.79 1.62 1.46 1.62 1.05 1.06 1.08 1.07 1.09 1.14 1.20 1.27 1.26 1.29 1.03 1.03 1.04 1.03 1.04 1.07 1.10 1.13 1.14 1.14 1.01 1.01 1.01 I1.01 1.02 1.03 1.04 1.05 1.05 1.05 1.92 2.41 2.89 3.72 1.64 1.87 2.10 2.41 2.06 3.44 8.73 21.1 57.4 ~ 32 1.49 1.63 1.78 1.98 1.78 2.56 5.06 9,76 20.9 1.39 1.49 1.57 1.74 1.63 2.16 3.69 6.22 11.4 2.06 3.24 1.21 1.23 1.27 1.32 1.33 1.51 1.95 2.50 3.36 1.96 2.88 1.08 1.09 1.10 1.13 1.14 1.19 1.31 1.44 1.62 1.58 1.82 1.04 1.05 1.06 1.07 1.07 1.09 1.15 1.20 1.27 1.35 1.42 1.02 1.02 1.03 1.03 1.03 1.05 1.07 1.10 1.13 1.20 1.20 1.01 1.01 1.01 1.01 1.01 1.02 1.03 1.04 1.05 1.08 1.08 2.50 3.66 4.85 6.64 '180 2.14 2.75 3.30 3.93 1.77 2.01 2.21 2.43 1.54 1.65 1.74 1.84 2.05 3.39 8.40 19.7 52.0 1.34 1.38 1.42 1.46 1.62 2.15 3.61 5.95 10.6 2.21 3.25 1.18 1.19 1.20 1.25 1.33 1.51 1.93 2.47 3.27 2.13 3.12
1.01 1.01 1.01 1.01 1.01 I1.ul_1.01 1.01 1.01 1.01 1.01
3 6
1.02 1.02 1.04 1.06 1.10
!)P 2
1.75 2.50 4.88 1.63 2.18 3.77 1.33 1.51 1.96 1.13 1.19 1.31
I 2
1.68 2.56 3.76 4.65 5.50 1.68 2.56 3.76 4.65 5.50 1.68 2.56 3.76 4.65 5.50
1.04 1.02 1.01 1.01 1.01 1.01 1.01
1.70 1.50 1.32 1.17 1.08 1.07 1.04 1.03
1.09 1.04
3.76 4.65 5.50 8.27 20.6 59.3 4.17 8.00 17.9 2.00 2.74 4.07 1.30 1.45 1.66 1.14 1.20 1.27 1.07 1.09 1.13 1.03 1.04 1.05
1.82 1.95 1.53 1.74 2.14 2.68 3.45 2.12 3.64 9.71 24.6 70.6
1.82 2.34 3.09 3.66 4.22 1.84
1.70 1.54 1.43 ?~1.30 Il 1.18
1.18
DF 3
1.44 1.23 1.09 1.04 1.04 1.04 1.04 1.02 1.02 1.02 1.02 1.02 1.02 1.02 1.01 1.01 1.01 1.01 1.01I 1.01 1.01
1.56 1.75 1.44 1.53 1.36 1.41 1.31 1.30
4.39 2.04 3.38 3.44 1.85 2.78 2.50 1.63 2.19 1.65 1.32 1.51
1.21 1.13 1.10 1.07 1.05 1.03 1.01 1.01 1.02 1.02 L 1.01
DF 4
1 1.68 2.56 3.76 4.65 5.50
2.35 3.14 2.11 2.67 1.76 2.09 1.34 1.46 1.13 1.16 1.06 1.08 1.03 1.04
DFPi
DF 2
1.68 2.56 3.76 1.66 2.07 3.49 1.46 1.69 2.34 1.21 1.33 1.56 1.10 1.13 1.20 1.05 1.06 1.09 1.03 1.03 1.05
1.46 1.57 1.41 1.48 co 1.33 1.37 I 1.18 1.19 1 .0k8 1.07 1.04 1.04
1.01 .101
FUNCTION OF LEARNING SAMPLE SIZE N, DIMENSIONALITY P AND MAHALONOBIS DISTANCE 45
AS A
1.03 1.04 1.05.
N 7 10 12 24
4.56 7.64 14.0 1.89 2.20 2.66 60 1.37 1.46 1.59 120 1.17 1.20 1.25 240 1.06 1.07 1.09 .600 for
DF 1
6.90 13.5 28.5 5.22 2.19 1.50 1.23 1.08
8.80 15.8
2.55 1.60 1.26 1.09
3.05 1.72 1.30 1.10
8 12 16 20* 40 100 200 400 1000
8 12 20 30 7.87 18.3 40.6 50* 7.10 13.1 25.1 100 1.81 2.35 3.23 4.03 5.05 250 1.58 1.78 2.01 2.18 2.37 500 1.37 1.42 1.47 1.51 1.56. 1000 1.18 1.16 1.18 1.18 1.20 2500 for
DF 1
246
~ ~ ~ ~4p
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. PAMI-2, NO. 3, MAY 1980
COMPARISON
Data 1
5,Nm = 500
P=
N:o\ N of DF 1 2
3 4
OF
12
20
1.97 2.62 1.44 1.67 1.20 1.26 1.12 1.14
THEORETICAL
TABLE II EXPERIMENTAL VALUES OF
AND
X = EPN/PO
Data 2
p=
50
5,Nm = 600
12
20
50
P=
THE
RATIO
Data 3
32,Nm = 300 50
100
P=
Data 4
6,Nm = 600
12
20
50
1.53 1.26 1.68 1.21
3.36 3.58 1.45 3.15 1.86 1.25
4.2 28.6
1.54 6.87
1.47 1.31 1.27 1.70 1.35 1.15
1.32 1.30 1.18 1.16 1.09 1.09
2.01 1.70 1.29 1.30 1.18 1.19
13.6
6.42 2.55
1.23 1.20 1.10 1.32 1.13 1.07
1.19 1.14
0.99 1.02 1.00 1.20 1.12 1.05
1.00 1.18 1.08
0.98 1.01 1.00 1.13 1.09 1.04
1.09 1.12 1.07 1.07 1.04 1.04
1.63 1.32 1.32 1.12
1.12 1.09 1.18 1.08
1.14 1.10 1.10 1.04
4.51 1.44 1.29 1.11
TABLE III THE QUALITATIVE CHARACTERISTICS OF FIVE CLASSIFICATION RULES COMPARED
DF
Parameters of Populations Estimated From Samples
Means
1
XI, X2, S1, S2
2
Xl' 2,
Number of Estimated Parameters Variances
Corr. Coef.
Total
2p
2p
p(p - 1)
p(p + 3)
p
p
p(p -1)
p(p +5)
2p
2
2
Min. N to Estimate the Parameter of DF p
2
Number of
Coefficients
Coefficient in Expression (7)
of DF
p(p+3) 2 +
1+
(62+pr2+p2 X22 (N p)
+2p \2N pN, 1
2
2p
4
X1,X2
5
xI)X25
2p 19 2
2p
2p
2p
1
p+1
4p
2
2p+ 1
where SD1, -D2 stand for sample estimates of diagonal dispersion matrices. The numerical values of the ratio H = EPN/PO and asymptotic formulas for the expected PMC show that the classification performance EPN and the learning quantity H depend mainly on the asymptotic PMC, ratio Nlp, and the complexity of a classification rule. For small values of Nlp the classification error increases considerably, e.g., 8.8 times if p = 20, N = 40. Po. = 0.01, and quadratic DF (1) is used. In this case the increase in the sample size N from 40 to 400 may reduce the classification error from 0.088 to 0.0126, i.e., approximately seven times. Since the learning quantity depends on the DF type, the learning sample size must be determined individually for each discriminant function. This is clearly seen from Fig. 1, where the curves that show the dependence of the learning sample size N on dimensionality p for fixed asymptotic and expected probabilities of misclassification, are presented. Fig. 1 and the asymptotic formulas from Table III confirm that there is a direct relationship between N and p: for classification rules (2)-(4) and (8) the relationship is linear, for DF (1)-quadratic (only for large values of p). This con-
+
6
N6 2
1N62
clusion was pointed out earlier by A. Deev [13] for the standard linear DF (2) and experimentally obtained by Pipberger [26] for the quadratic DF. The values for the learning quantity H, presented in Table I, the graph, similar to Fig. 1, allow us to estimate the learning sample size, sufficient to achieve the desired accuracy of estimating the DF coefficients. For example, let the number of features p = 12 and suppose the asymptotic PMC PO > 0.01. Then from Table I we find the requirement that the relative increase in the classification error should not exceed 1.5 times, is fulfilled when we have 120 learning vectors from each class for quadratic DF (1), 60 for standard linear DF (2) and only 12 for Euclidean distance classifier (4). Another problem, caused by finiteness of the learning sample size, concerns the classification rule type [27]. In Fig. 2 the curves that show EPN versus N for two values of asymptotic PMC PO. are presented. The expected PMC decreases in an exponential way and the decrease rate depends on the complexity of a classification rule. In the case when the asymptotic PMC's for all classification rules are the same, simple classification rules are preferable.
RAUDYS AND PIKELIS: CLASSIFICATION ALGORITHM IN PATTERN RECOGNITION
247
accuracy of approximate expressions for misadjustment of an adaptive least-mean-square algorithm derived by Smith [32], asymptotic expansions and approximated expressions for the expected PMC [3], [121 - [16], e.g., by using Table I we found that Deev's asymptotic expansion is much better than the well-known Okamoto's expansion (see [33]), especially for small values of p and N. APPENDIX FORMULAS FOR THE EXPECTED PMC A. Euclidean Distance Classifier /20], [221 The expected PMC EPN may be expressed in the following way:
EPN = q,P(g(X)