Differential theory of learning for efficient neural network ... - CiteSeerX

1 downloads 0 Views 230KB Size Report
theoretic measure of generalization if we view the classifier's error rate as an estimator of the Bayes error rate. If, from this perspective, the classifier is efficient, ...
Differential theory of learning for efficient neural network pattern recognition J. B. Hampshire II B.V.K. Vijaya Kumar Department of Electrical & Computer Engineering Carnegie Mellon University, Pittsburgh, PA 15213-3890 [email protected]

[email protected]

Reprinted (with minor changes) from the Proceedings of the 1993 SPIE International Symposium on Optical Engineering and Photonics in Aerospace and Remote Sensing, vol. 1966: Science of Artificial Neural Networks, D. Ruck, ed., pp. 76-95, April 1993. Preprinted from Proceedings of the 1993 SPIE International Symposium on Optical Engineering and Photonics in Aerospace and Remote Sensing, vol. 1966, April, 1993.

ABSTRACT We describe a new theory of differential learning by which a broad family of pattern classifiers (including many well-known neural network paradigms) can learn stochastic concepts efficiently. We describe the relationship between a classifier’s ability to generalize well to unseen test examples and the efficiency of the strategy by which it learns. We list a series of proofs that differential learning is efficient in its information and computational resource requirements, whereas traditional probabilistic learning strategies are not. The proofs are illustrated by a simple example that lends itself to closed-form analysis. We conclude with an optical character recognition task for which three different types of differentially generated classifiers generalize significantly better than their probabilistically generated counterparts.

1 DIFFERENTIAL LEARNING



A differentiable supervised classifier is one that learns an input-to-output mapping by adjusting a set of internal parameters via an iterative search aimed at optimizing a differentiable objective function (or empirical risk measure). Many well-known neural network paradigms are therefore differentiable supervised classifiers. The objective function is a metric that evaluates reflects the empirical how well the classifier’s evolving mapping from feature vector space  to classification space relationship between the input patterns of the training sample and their class membership. Each one of the classifier’s discriminant functions gi ( j ) is a differentiable function of its parameters . We assume that there are C of these can represent. These C functions functions, corresponding to the C classes ( = f 1 , : : : , C g ) that the feature vector are collectively known as the discriminator (see figure 1). Thus, the discriminator , with ?  has a C -dimensional output j = ?( ) 2 is simply the class label elements y1 = g1 ( j ) , : : : , yC = gC ( j ) . The classifier’s output corresponding to the largest discriminator output, as shown in figure 1.



X

X



!



!

DX

X

X Y

Y



Below we described two fundamental strategies for supervised learning: the probabilistic strategy seeks to learn class (or concept) probabilities by optimizing a likelihood function or an error measure objective function; the differential strategy is discriminative and seeks only to identify the most likely class by optimizing a classification figure-of-merit (CFM) objective function (see below). CFM objective functions are best described as differentiable approximations to a counting function: they count the number of correct classifications (or, equivalently, the number of incorrect classifications) the classifier makes on the training sample.

X

with its most likely class, thereby assuring the minimum The Bayes-optimal classifier is one that always associates in this manner is said to probability of making a classification error (e.g., [5, ch 2]). Any classifier that classifies yield Bayesian discrimination; equivalently, it is said to exhibit the Bayes error rate (i.e., the minimum probability of is given by misclassification). Since the likelihood of the ith class i for a particular value of the feature vector

!

X

X

Discriminator

y1

g1 ( X | θ )

y2

g 2 (X | θ )

Γ( Y) =

X

ω i ; yi = max yj j

D( X | θ)

yC

gC (X | θ )

Figure 1: A diagrammatic view of the classifier and its associated functional mappings. The classifier input is? a feature  vector ; the C discriminator outputs y1 , : : : , yC correspond to the classes that can represent; the class label j assigned to the input feature vector corresponds to the discriminator’s largest output. The figure is based on figure 2.3 of Duda & Hart [5].

X

DX

X

x! X x! X

the a posteriori probability PWj ( i j ) , one way the classifier will yield Bayesian discrimination is if its discriminant functions equal their corresponding a posteriori class probabilities. We refer to these C a posteriori class probabilities  PWj ( 1 j ) , : : : , PWj ( C j ) as the probabilistic form of the Bayesian discriminant function ( )Bayes-Probabilistic . Probabilistic learning P is the process by which the classifier’s discriminant functions learn ( )Bayes-Probabilistic . As the training sample size n grows large, the empirical a posteriori class probabilities converge to their true values; if the classifier’s ( )Bayes-Probabilistic precisely, then discriminant functions possess sufficient functional complexity1 to learn

x! X



FX

FX

X j )

lim gi (

!1

n

= PWjx (!i j X)

:



FX

(1)

P

FX

Figure 2 (left) shows ( )Bayes-Probabilistic for a 3-class random (scalar) feature 2 x . A bar-graph below the a posteriori class probabilities of x depicts the class label that the Bayes-optimal classifier assigns to x over its effective domain. Note that the Bayes-optimal class label corresponds to the largest a posteriori class probability for each value of x .  The right side of figure 2 depicts an equivalent albeit different form of the Bayesian discriminant function Wj ( 1 ) , : : : , Wj ( C ) , which we call the differential form of the Bayesian discriminant function

 x ! jX F (X)

Bayes-Differential .

 x ! jX

It is derived from

F (X)

Bayes-Probabilistic

via the

C

stochastic linear transformations

P Wjx(! j X) = PWjx(! j X) ? max (! j X) , 6= Wjx i

 x! X

i

k

k i

(2)

where Wj ( i j ) denotes the ith a posteriori class differential. Note that the Bayes-optimal class label corresponds to the positive a posteriori class differential for each value of x . Differential learning  is the process by which the classifier’s discriminant functions learn ( )Bayes-Differential . Specifically, as the training sample size n grows large, the empirical a posteriori class differentials converge to their true values; if the classifier’s discriminant functions possess sufficient functional ( )Bayes-Differential to at least one (sign) bit precision, then complexity to learn

FX



FX

1 A formal definition of functional complexity is beyond the scope of this paper. In simple terms, there is a limit to the intricacy of the mapping from feature vector space to classification space implemented by a classifier with limited functional complexity. 2 In this case, x is a scalar; we use the notation and x interchangeably to emphasize that our comments pertain to the general N-dimensional feature . vector

X

X

77

P(ω3 | x )

1.0

g 3 (x

|θ)

δ 3(x | θ )

1.0

∆ (ω3 | x )

0.0

.5

-1.0 -2.7

0

1.0

g 2 (x | θ )

x

2.7

-2.7

0

∆ (ω2 | x )

1.0

P(ω2 | x )

δ 2(x | θ )

0.0

.5

x

2.7

-1.0 -2.7

P(ω1 | x )

0

0

2.7

-2.7

0

2.7

x

∆ (ω1 | x )

-1.0

x

2.7

0

0.0

|θ)

Bayes

3

CLASS LABEL

CLASS LABEL

-2.7

g1 (x

-2.7

δ 1(x | θ )

1.0

1.0 .5

12

x

2.7

12

x Bayes

3

x!

Figure 2: Left: The a posteriori class probabilities PWj ( i j x) of a three-class random variable x . These constitute the ( )Bayes-Probabilistic . Right: The a posteriori class differentials probabilistic form of the Bayesian discriminant function ( j x ) = P ( j x ) ? max ( j x ) of the same three-class random variable. These constitute the P k6=i Wj i i k Wj Wj differential form of the Bayesian discriminant function ( )Bayes-Differential . Note that where Wj ( i j x) is positive, i is the Bayes-optimal class label for x . Both: The discriminant functions (left) and discriminant differentials (right) of a minimum-complexity Bayes-optimal polynomial classifier are superimposed on their corresponding forms of the Bayesian discriminant function. Note that when discriminant function gi (x j ) is largest its discriminant differential  i (x j ) is positive a relationship that reflects the one between ( )Bayes-Probabilistic and ( )Bayes-Differential , as described by (3).

 x!

x!

x!

FX

FX



FX



X j ) ? max gj (X j  ) j6=i

lim sign gi (

!1

n

|

{z

X j )

i(

 x!

 }

=



FX

sign



!

Wjx(! j X) : 

(3)

i

X

The classifier’s ith discriminant differential  i ( j ) in (3) is the difference between the discriminator’s ith output and the largest other output. Thus, differential learning ensures that the sign of the classifier’s ith discriminant differential matches the sign of the ith a posteriori class differential for large training sample sizes. Simply put, differential learning assures a less restrictive that the discriminator’s largest output corresponds to the most likely class for every value of condition on the classifier’s discriminant functions than that imposed by probabilistic learning via (1).

X

The discriminant functions superimposed on figure 2 (left) and their corresponding discriminant differentials (superimposed on the right) illustrate the fundamental differences between probabilistic and differential learning. Assuming that each discriminant function is a polynomial in x , it would take high-order polynomials to model each of the three a posteriori the objective of probabilistic learning. Since differential learning probabilities (left, in gray) over the effective domain of x requires only that the discriminant function representing the most likely class be larger than all other discriminant functions, we can use simple discriminant functions of x for the learning/recognition task. It is easy to verify that the polynomial classifier with two linear discriminant functions and one constant discriminant function has the lowest functional complexity sufficient for Bayesian discrimination of x (here the complexity measure is the order of the polynomials). Figure 2 shows one minimum-complexity classifier’s discriminant functions: the classifier partitions feature space in the Bayes-optimal fashion. Probabilistic learning fails to generate a Bayes-optimal classifier with this minimum-complexity choice of polynomial

78

25%

MSDE = DBias 2 + DVar high MSDE

15%

j

G 

20%

medium MSDE

Pe

?

high DVar

low DVar

10% high DBias

Pe

?

F

high DBias

5%



low MSDE low DVar

Bayes

low DBias

Figure 3: The discriminant bias and discriminant variance of three different hypothetical classifier paradigms, as determined by an oracle over an infinite number of independent learning trials. The training sample size n is the same finite number for each trial, so each classifier’s error rate varies across trials. Left: this classifier has high discriminant bias, so on average its ? error rate is significantly higher than the Bayes error rate Pe Bayes . Additionally, it’s high discriminant variance indicates that its error rate fluctuates substantially across independent trials. As a result, its mean-squared discriminant error (MSDE) is high. Middle: this classifier has high discriminant bias, so on average its error rate is significantly higher than the Bayes error rate. However, it’s low discriminant variance indicates that it is a more consistent classifier than the one on the left; as a result, its MSDE is lower and it is preferable to the classifier on the left. Right: this classifier has low discriminant bias and low discriminant variance. As a result it yields a consistently good approximation to the Bayes error rate. It’s MSDE is therefore low.

F

discriminant functions (see [10]).

1.1 Efficient learning and generalization A classifier is said to generalize well if it discriminates test examples of the feature vector with the same empirical probability of error it exhibits on the training sample (i.e., the set of examples with which it is trained). We can develop an estimationtheoretic measure of generalization if we view the classifier’s error rate as an estimator of the Bayes error rate. If, from this perspective, the classifier is efficient, it generalizes well. As an example, consider the error rates of three different hypothetical classifiers that learn to perform the same pattern recognition task. Each classifier therefore represents a different estimator of the Bayes-optimal classifier. This is a thought experiment in which we imagine that the classifiers learn repeatedly over an infinite number of independent trials. In each trial all three classifiers learn the same training sample of size n ( n is finite), and are subsequently tested by an oracle. The training sample for each trial is drawn independent of all other training samples. The true error rate for each classifier is determined by the oracle and recorded at the end of each trial; the results for all trials are compiled. Figure 3 summarizes the error rates of the three classifiers. Because the training sample size n is the same finite number for each trial, each classifier’s posterior parameterization (and, as a result, its error rate) varies from tria l to trial. This variance is depicted by the bars of the whisker plots in figure 3. Specifically, the discriminant variance is proportional to the square of the distance between the upper and lower bounds in each plot; the discriminant bias of each classifier is equal to the distance between the mean value of its whisker plot (denoted by the dot) and the horizontal line denoting the Bayes error rate ?  Pe Bayes .

F

F

?



The classifier’s discriminant bias is the difference between its expected error rate and the Bayes error rate P e Bayes (the expectation is taken over the joint distribution of all training samples of size n and all initial classifier parameterizations). The classifier’s discriminant variance is the variance in its error rate across trials. The classifier’s mean-squared discriminant

79

error (MSDE) is its squared discriminant bias plus its discriminant variance. Formal definitions of these quantities are given in [7, ch. 3]. The classifier on the left in figure 3 is a poor estimator of the Bayes-optimal classifier because it exhibits both high discriminant bias and high discriminant variance. This means that 1) on average the classifier’s error rate is much greater than the Bayes error rate (high discriminant bias), and 2) the classifier’s error rate varies significantly across trials (high discriminant variance). As a result, the classifier exhibits high MSDE. The classifier in the middle is a somewhat better estimator of the Bayes-optimal classifier because, although it exhibits the same high discriminant bias as its counterpart on the left, its error rate is more consistent across trials. As a result, it exhibits lower discriminant variance and lower MSDE. The classifier on the right is a good estimator of the Bayes-optimal classifier because it exhibits low discriminant bias and its error rate is consistent across trials. As a result it exhibits low MSDE. The relatively efficient classifier exhibits the lowest MSDE possible, given the training sample size n and the set of discriminant functions g1( j ) , : : : , gC ( j ) . The reader will note that our definition of good generalization requires that the classifier’s error rate be both consistent and close to the Bayes error rate. The typical definition of good generalization requires only that the classifier’s error rate be consistent (i.e., that the classifier’s discriminant variance be low).

X

X

1.2 Outline of proofs regarding the efficiency of differential learning The differentiable supervised classifier can be viewed as a Bayesian learning paradigm because its discriminator’s initial (or prior) parameterization is transformed to a posterior para meterization during learning. Given a particular training sample size n , a particular choice of discriminant functions (or hypothesis class [16]), and a particular initial parameterization, the transformation depends entirely on the learning strategy employed. An efficient learning strategy generates the relatively efficient classifier described above for both small and large training sample sizes: whatever the choice of discriminant functions, the efficient learning strategy produces the classifier with the lowest MSDE allowed by that choice (recall that MSDE is an expectation taken over the joint distribution of all training samples of size n and all initial parameterizations, so the only variables affecting the expectation are the choice of discriminant functions and the learning strategy). An asymptotically efficient learning strategy requires large training sample sizes to guarantee the relatively efficient classifier. Rigorous proofs regarding the asymptotic efficiency of differential learning and the inefficiency of probabilistic learning are given in [7, ch’s. 2-3]. The following is a summary:

     

Classifiers that learn by minimizing error measure objective functions (e.g., mean-squared error, the Kullback-Liebler information distance, etc.) learn probabilistically. Again, the classifier that learns probabilistically attempts to learn the a posteriori probabilities of the feature vector over its domain. Learning probabilistically by minimizing error measure objective functions generally does not minimize the classifier’s MSDE. As a result, probabilistic learning is almost always inefficient (special circumstances may exist in which probabilistic learning generates the minimum-MSDE classifier; these circumstances, which are both rare and easily recognized, are described in [7, ch’s. 3-4]). Classifiers that learn by maximizing a classification figure of merit objective function (CFM appendix A) learn differentially. Again, the classifier that learns differentially attempts to learn only the most likely class of the feature vector over its domain. Learning differentially by maximizing the synthetic CFM objective function described in appendix A minimizes the classifier’s MSDE for large training sample sizes. As a result, differential learning is asymptotically efficient. Learning differentially by maximizing the synthetic CFM objective function described in appendix A usually minimizes the classifier’s MSDE for small training sample sizes as well. As a result, differential learning is almost always efficient (special circumstances may exist in which this is not the case; these circumstances are described in [7, ch’s. 3-4]). Learning differentially by maximizing the synthetic CFM objective function described in appendix A requires discriminant functions with the least functional complexity (e.g., the fewest parameters) necessary for Bayesian discrimination.

80

1.3 Summary of differential learning applications We have applied differential learning to several real-world machine learning/pattern recognition tasks associated with optical character recognition, airborne remote sensing imagery interpretation, medical diagnosis/decision support, and digital telecommunications. In each task the differentially generated classifier generalizes better than its probabilistically generated counterpart. The discrimination improvements range from moderate to significant, depending on the statistical nature of the learning task and its relationship to the functional basis of the classifier used. Reference [7] gives detailed summaries for each application domain. In general, differential learning exhibits the following characteristics:

   

Differential learning allows classifiers with 1/2 developed models for each task.

1/10 the number of parameters used in the best independently-

The error rates of differentially-generated classifiers are 20% models.

50% less than those of the best independently-developed

The error rates of differentially-generated classifiers are 30% control models. The MSDE of differentially-generated classifiers is 1/2

80% less than those of probabilistically-generated

1/10 that of probabilistically-generated control models.

In section 2 we illustrate the efficiency of differential learning with a simple learning/pattern recognition task the lends itself to closed-form analysis. In section 3 we compare differential learning with probabilistic learning in the more realistic context of hand-written digit recognition, showing that differentially-generated models consistently generalize better than their probabilistically-generated counterparts.

2 ILLUSTRATION

! ! !

Figure 4 illustrates a three-class scalar x with (unimodal) uniform class-conditional pdfs for all three classes ( 1 , 2 , 3 ). There are two class boundaries (B1;2 Bayes = ?4:0, B2;3 Bayes = 4:0) for the Bayes-optimal classifier of x. The class prior probabilities are PW (1) = PW (3) = 0:1, and PW (2) = 0:8, so the Bayes error rate is 2.0%, given the following classification strategy: x < B1,2 Bayes , choose 1 B1,2 Bayes  x  B2,3 Bayes , choose 2 (4) choose 3 x > B2,3 Bayes ,

! ! !

Our polynomial discriminator has three outputs: 8
 5:7 . The resulting partitioning of feature space (bottom bar-graph) is poor, and the classifier exhibits a 7.8% error rate. This is a classic case of Occam’s razor [20, 4], in which the classifier has so much functional complexity it fails to generalize well for small training sample sizes.

!

!

!

6 Note that the legend ‘‘CFM 1-0-1’’, for example, denotes the differentially generated classifier with polynomial discriminant functions of order 1 (linear), 0 (constant), and 1 (linear), associated with classes 1, 2 , and 3 , respectively.

! !

!

85

1.0 0.667 _

_ 0.667

.5

P( ω2| x ) -4

0

x

4 1.0 .5

_ 0.333

P( ω1| x ) -5.8

0.333 _

-3.8

CLASS LABELS

-4 1 1 1

P( ω3| x )

3.8

0

5.8

x

4

Bayes CFM 1-0-1 MSE 1-0-1

23 23 23

1.0 0.667 _

_ 0.667

.5

P( ω2| x ) -4

0

x

4 1.0 .5

_ 0.333

P( ω1| x ) -5.8

-3.8

CLASS LABELS

-4 12

3

12

3

12

3

12

3

0.333 _

P( ω3| x )

3.8

0

5.8

4

x Bayes CFM 1-2-1 MSE 1-2-1 MSE 10-10-10

Figure 6: Discriminant functions of probabilistically (MSE) and differentially (CFM) generated polynomial classifiers of x for an asymptotically large training sample size (i.e., n ! 1). The functions are shown superimposed on their associated a posteriori probabilities. Each of the bar-graphs underneath the discriminant functions depicts how its associated polynomial classifier partitions feature space. Top: the minimum-complexity classifier having one constant and two linear discriminant functions. Bottom: a low complexity classifier having one quadratic and two linear discriminant functions, and a high-complexity classifier having three 10th-order polynomial discriminant functions (MSE-generated only). Numerous low-complexity CFM-maximizing classifiers are shown (shaded lines) in order to emphasize that there is an infinite number of optimal solutions when differential learning is employed and the training sample size is asymptotically large.

86

1.0 0.667 _

_ 0.667

.5

P( ω2| x ) -4

0

x

4 1.0 .5

_ 0.333

P( ω1| x ) -5.8

-3.8

CLASS LABELS

1 1

P( ω3| x )

3.8

-4 1

0.333 _

0

5.8

4

x Bayes CFM 1-0-1 MSE 1-0-1

23 23 23

1.0 0.667 _

_ 0.667

.5

P( ω2| x ) -4

0

x

4 1.0 .5

_ 0.333

P( ω1| x ) -5.8

-3.8

-4 CLASS LABELS

1 1 1 1

0.333 _

P( ω3| x )

3.8

0

5.8

4

x Bayes CFM 1-2-1 MSE 1-2-1 MSE 10-10-10

23 23 23 23

Figure 7: Discriminant functions of probabilistically (MSE) and differentially (CFM) generated polynomial classifiers of x for a training sample of size n = 100. Again, the functions are shown superimposed on their associated a posteriori probabilities, and each of the bar-graphs underneath the discriminant functions depicts how its associated polynomial classifier partitions feature space. Top: the minimum-complexity classifier having one constant and two linear discriminant functions. Bottom: a low complexity classifier having one quadratic and two linear discriminant functions, and a high-complexity classifier having three 10th-order polynomial discriminant functions (MSE-generated only).

87

Differential Learning (CFM)

Probabilistic Learning (MSE)

P(error)

20%

10%

Bayes Rate 100

1k

10k



Minimum Complexity Classifier (1-0-1)

100

1k

10k



Low Complexity Classifier (1-2-1)

100

1k

10k



High Complexity Classifier (10-10-10)

Training Sample Size Figure 8: A comparison of error rates for differentially (CFM) and probabilistically (MSE) generated polynomial classifiers. Results for the differentially generated classifiers are shown in white; those for the probabilistically generated classifier are shown in gray. Left: the minimum-complexity classifier having one constant and two linear discriminant functions; Middle: a low complexity classifier having one quadratic and two linear discriminant functions; Right: a high-complexity classifier having three 10th-order polynomial discriminant functions.

Figure 8 displays the empirical distribution of the error rates for minimum-, low-, and high-complexity polynomial classifiers of x, based on multiple independent learning/testing trials. Results for differential learning via CFM are shown in white; results for probabilistic learning via MSE are shown in gray. The results are shown in box-plot [21, ch. 2] statistical summaries. In brief, the box of each plot has vertical extrema that match the first and third quartiles of the sample data; the horizontal line dividing the box delineates the median of the sample data; the inner and (if shown) outer ‘‘T’’-shaped ‘‘fences’’ of each plot depict the first and fourth quartiles of the sample data. Extreme values in the first and fourth quartiles falling beyond the outer fence(s) are plotted as dots. All results for finite training sample sizes are based on 10 independent trials for the specified training sample size (all classifiers learn the same training sample in a given trial, for a given sample size). Learning takes the form of a steepest descent (MSE) or steepest ascent (CFM) search over parameter space, using a modified form of the backpropagation algorithm (e.g., [19]). Learning begins from a tabula rasa state in which all parameters are initialized randomly according to a uniform distribution on the closed interval [?:3 , :3]. All trials are completely automated, so learning is done without any human intervention. All experimental conditions (except, of course, for the objective function used) are identical for differential and probabilistic learning. The results for the asymptotically large training sample size are derived as described in sections 2.1 and 2.2. The box plots of figure 8 are empirical analogs to the whisker plots of figure 3. That is, the box plots give us empirical estimates of the discriminant bias, discriminant variance, and mean-squared discriminant error (MSDE) of each classifier/learning strategy. The minimum-complexity differentially generated classifier is the most efficient, exhibiting consistently low error rates for small training sample sizes. Based on [8, 7], we predict that 1121 samples of x are necessary to guarantee (with 95% confidence) an error rate of no more than 4.0% using differential learning. Note that the empirical upper bound on the differentially generated minimum-complexity classifier’s error rate is 3.3% when the sample size is 1000. Increasing the differentially generated classifier’s complexity increases its empirical discriminant variance, according to Occam’s razor (i.e., excessively complex models are anathema). The inefficiency of probabilistic learning is clear in figure 8. The minimum-complexity classifier has no discriminant variance, but its empirical discriminant bias is 20% ? 2% = 18%. 88

Figure 9: Forty randomly-selected images from the AT&T DB1 handwritten digit database. The images have been linearly compressed from the 256 binary pixel originals (e.g., see [6]). The low-complexity classifier has substantially lower empirical discriminant bias ( 8:5% ? 2% = 6:5% for n = 1000), but its discriminant variance is high (as indicated by the spread of the box plots). The high-complexity classifier has moderate discriminant bias ( 5:6% ? 2% = 3:6% for n = 1000), and moderate discriminant variance substantially better than the probabilistically generated low-complexity classifier, but substantially worse than the differentially generated minimum-complexity classifier. differential learning produces the best generalizing (i.e., Figures 6 8 illustrate the proofs of [7] outlined in section 1.2 minimum MSDE) classifier and requires the least functional complexity necessary for Bayesian discrimination. As described in section 1.3, differential learning has generated improved classifiers across a broad range of pattern recognition tasks. We describe the impact of differential learning on an optical character recognition task in the following section.

3 APPLICATION OF THE THEORY Figure 9 shows a number of hand-written digits from the AT&T "little" optical character recognition (OCR) database (DB1).7 The images in the figure contain 64 5-ary pixels (each pixel can assume one of five values), produced from the original 256 binary pixels by a simple linear lossy compression scheme. A differentially generated linear classifier possessing 650 total parameters recognizes all but 1.3% of 600 test examples after learning 600 disjoint training examples (no data pre-processing beyond the compression). Its probabilistically generated counterpart exhibits more than double this error rate, as does the best independently-developed linear classifier (2570 parameters) operating on the original binary images (pre-processed by see [2]). These empirical error rates are obtained from filtering and removal of ‘‘non-supporting’’ training examples a benchmark partitioning of the 1200-example database into disjoint 600-example training and test samples. The database contains ten examples of each digit from 12 subjects. The benchmark training sample contains 5 examples of each digit from each of the twelve subjects; the benchmark test sample contains the other 5 examples of each digit from each of the twelve subjects. Owing to the nice properties of this benchmark partitioning, the benchmark test sample error rates tend to be lower than they are for random partitionings. In order to estimate the discriminant bias, discriminant variance, and MSDE of various classifier/learning strategy combinations, we have generated 25 random partitionings of the database, in which training examples are randomly selected 7 DB1 database provided by Dr. Isabelle Guyon; see Guyon, Vapnik, Boser, Bottou, and Solla, "Structural Risk Minimization for Character Recognition", Proc. NIPS-4, pp. 471-479, Morgan Kaufmann, 1992.

89

from the 1200 examples with 50% probability. Those examples not selected for the training sample constitute the test sample. This random partitioning procedure is repeated 25 times to generate the 25 training/test samples. In this experiment, we employ classifiers drawn from three hypothesis classes, corresponding to three functional bases. All three of these hypothesis classes have 650 total parameters (65/digit) for the DB1 digit learning/recognition task.

Linear hypothesis class The ith discriminant function of a discriminator belonging to the linear hypothesis class is given by

X j  ) = X0 T  i , (17) where the notation zT denotes the transpose of vector z , and X is the N = 64-dimensional feature vector (i.e., the 64-pixel image of the digit). The augmented feature vector X0 is the (N + 1)-dimensional vector formed by prepending a single element of unit value to X : # " 1 4 0 X = , (18) X gi (

(e.g., [5, pp. 136-137]). The parameter vector for the ith discriminant function is part of the over-all parameter vector for the discriminator: i  2 ; i 2

Suggest Documents