Pattern Recognition in Almost Empty Spaces - Robert P.W. Duin

1 downloads 0 Views 942KB Size Report
Peaking, Curse of Dimensionality, Overtraining. - Pseudo Fisher Linear Discriminant Experiments. - Representation Sets and Kernel Mapping. - Support Vector ...
Pattern Recognition in Almost Empty Spaces Robert P.W.Duin ICT Group Electrical Engineering, Mathematics and Computer Science Delft University of Technology, The Netherlands Eindhoven, 19 January 2004

P.O. Box 5031, 2600GA Delft, The Netherlands. Phone: +(31) 15 2786143, FAX: +(31) 15 2781843, E-mail: [email protected]

Contents - Peaking, Curse of Dimensionality, Overtraining - Pseudo Fisher Linear Discriminant Experiments - Representation Sets and Kernel Mapping - Support Vector Classifier - Dissimilarity Based Classifier - Subspace Classifier - Conclusions

What is Pattern Recognition?

Sensor

Feature Space

Classifier

Sensor

Representation

Generalisation

Other representations and generalisations?

Pattern Recognition System Training Set

Representation

Generalization

Feature Space

Classifier

Class B Objects x2 (area)

Class A (perimeter) x1

Test Object classified as ’B’

Statistical PR Paradox x = (x1, x2, ... , xk) - k dimensional feature space {x1, x2, ..., xm} - training set {λ1, λ2, ..., λm} - class labels

} D(x) - classifier , ε = Prob ( D(x) ≠ λ(x) )

ε(m) : monotonically decreasing , ε(k) : peaks !

Error

Feature Size k

Peaking, Curse of Dimensionality, Overtraining Classification Error

sample size



Dimensionality, Complexity, Time

Asymptotically increasing classification error due to: - Increasing Dimensionality

Curse of Dimensionality

- Increasing Complexity

Peaking Phenomenon

- Decreasing Regularization - Increasing Computational Effort

}

Overtraining

Human Recognition Accuracy

The diagnostic classification accuracy of a group of doctors for an increasing set of symptons. Samples size is 100. F.T de Dombal, Computer-assisted diagnosis, in: Principals and practice of medical computing, Whitby and Lutz (eds.), Churchill Livingstone, London, 1971, pp 179 - 199

Ullman’s Example

Character recognition classification performance as a function of the number of n-tuples used. J.R. Ullmann, Experiments with the n-tuple method of pattern recognition, IEEE Trans. on Computers, 1969, 1135-1136

Simulated Peaking Phenomenon by Jain and Waller

Classification error as a function of the feature size for two overlapping Gaussian distributions. Higher features have increasing class overlap. A.K Jain and W.G. Waller, On the optimal number of features in the classification of multivariate Gaussian data, Pattern Recognition, 10, pp 365 - 374, 1978

Raudys’ Example

Feature size study by Raudys (3rd ICPR 1976): 1. single example 2. generalization of 1 3. expectation of 1 4. less samples

Neural Network Overtraining Example - 10 Hidden Units Levenberg−Marquardt Optimization, 10 hidden units 0.25

6

4

0.2

6

2 epochs

4

2

2

mse

0.15 0

test

0.1

0 0 10

−2

−4

1

2

10 number of epochs

10

6

4

0

−2

train

0.05

−6 −12

−4

−10

−8

−6

−4

−2

0

2

4

6

8

6

20 epochs

4

−6 −12

50 epochs

4

2

2

0

0

0

−2

−2

−2

−4

−4

−4

−10

−8

−6

−4

−2

0

2

4

6

8

−6 −12

−10

−8

−6

−4

−2

0

2

4

6

8

−2

0

2

4

6

8

6

2

−6 −12

10 epochs

−10

−8

−6

−4

−2

0

2

4

6

8

−6 −12

100 epochs

−10

−8

−6

−4

Eigenfaces

10 pictures of 5 subjects

eigenfaces 1 - 3

PCA Classification of Faces Scatterplot on first two eigenfaces

1

2000

0.8

1500

Error

0.6 0.4

1000

0.2

500

0 0

0

Training Set: 1 image, 40 persons

−500 −1000 −1500 −1000

10 20 30 Number of eigenfaces

Feature Size: 92 x 112 = 10304 −500

0

500

1000

1500

Test Set: 9 * 40 = 360;

40

Normalized NIST Data 2 x 2000 Characters Random Subsets: 2 x 1000 Training 2 x 1000 Testing Errors averaged over 50 experiments

Fisher Results Feature Curves

Learning Curves

0.2

0.3 100

Generalisation Error

Generalisation Error

0.4

200

30 10

0.2 4 0.1

0 2

Sample Size 50 0.1

100 200 500

Feature Size

10

100 Sample Size

1000

Peaking of the generalization error of FLD as a function of the feature size.

0 1

10 Feature Size

Learning curves of the FLD .

100

2000

Pseudo Fisher Linear Discriminant

n points in Rk are in a (n-1) dimensional subspace Rn-1.

Rn-1

→ w is the minimum norm solution of ||wTX -1||. → Use Moore-Penrose pseudo-inverse: wT = pinv(X). For n > k the same formula defines Fisher’s Discriminant. X = { (xi,1), xi ∈ A, (- xj,-1), xj ∈ B }

Feature Curves and Learning Curves Learning Curves

Feature Curves

Generalisation Error

Generalisation Error

0.4

0.3 Sample Size 0.2

10 30

0.1

0 1

100

3

10 30 Feature Size

0.4

Feature Size

0.3

30

100

200

10 0.2

4 30 100

0.1

200 100

The generalization error of the PFLD as a function of the feature size.

0 2

10

100 Sample Size

Learning curves of the PFLD.

1000

Feature Size Sample Size Error Generalization Error 1000

0.04

Sample Size

0.04 0.06 100

0.08

0.30

0.12 10

0.08

0.20 0.12 0.20

0.12 2 1

0.12 0.40 0.12 0.08

10 Feature Size

100

The generalization error of the PFLD as a function of feature size and sample size.

Improving Pseudo Fisher Linear Discriminant (PFLD)

PFLD is for dimensionalities > sample size fully overtrained. Still good results are possible (ε = 0.08, 30 objects in 256 D) --> Better results are possible T ˆ –1 ˆ ˆ - Regularization e.g. by D ( x ) = ( μ A – μ B ) ( G + λI ) x

- Change of representation to lower dimensional spaces.

Representation Sets and Kernel Mapping Representation Set Y = {yi}, ||Y|| = n

K = K(Y,x) = (K(yi,x), i=1, ..., n), K ∈ R n maps an arbitrary object x into Rn Polynomials: K(yi, x) = ( x • y i + 1 )

p 2

⎛ – x – yi ⎞ Gaussians: K ( y i, x ) = exp ⎜ ----------------------⎟ 2 ⎝ 2σ ⎠ --> Nonlinear Mapping Rn K(y2,x)

Training Set X = {xi} Possibly Y ⊂ X

K(y1,x)

Representation Sets - Dimensionality is controlled by the size of the Representation Set Y. - Original objects may have arbitrary representation (feature size), just K(y,x) has to be defined. - In the feature space defined by the Representation Set, traditional classifiers may be used. - Problems: choice of Y, choice of K

Feature Approach and Representation Sets x2

RK

x1 object

feature space

⎛ x 11 x 12 ....x 1k⎞ ⎜ ⎟ x x ....x ⎜ 21 22 2k⎟ X = ⎜ ⎟ x x ....x ⎜ 31 32 3k⎟ ⎜ ⎟ x x ....x ⎝ 41 42 4k⎠ subspace selection

x3

RL ⊂ RK

x2 classifier

Y ⎛ k 11 k 12 k 13 k 14⎞ ⎜ ⎟ k k k k ⎜ 21 22 23 24⎟ K = ⎜ ⎟ X ⎜ k 31 k 32 k 33 k 34⎟ ⎜ ⎟ k k k k ⎝ 41 42 43 44⎠ set of objects

relation matrix (kernel K(x,x))

selection of Representation Set Y

y2 y1

y3 y4

x kernel based classification

Support Vector Classifier

Y ⎛ k 11 k 12 k 13 k 14 k 15 k 16⎞ ⎜ ⎟ k k k k k k ⎜ 21 22 23 24 25 26⎟ ⎜ ⎟ k k k k k k ⎜ 31 32 33 34 35 36⎟ K = ⎜ ⎟ ⎜ k 41 k 42 k 43 k 44 k 45 k 46⎟ ⎜ ⎟ ⎜ k 51 k 52 k 53 k 54 k 55 k 56⎟ ⎜ ⎟ ⎝ k 61 k 62 k 63 k 64 k 65 k 66⎠

X

Reduce training set X to minimum size ’support set’ Y such that if used as Representation Set, X is error free classified: min [classf_error(K(Y,X))] Y

Notes: - Classifier is written as a function of n points in Rn - Not all kernels allowed (Mercer’s theorem)

Support Vector Classifier Results Feature Curves for Sample Size = 30

Generalisation Error

0.3

PFLD SVC−1 SVC−2 SVC−4

0.2

0.1

0 1

3

10 30 Feature Size

100

The generalization errors of the PFLD and the SVC as a function of the feature size for a sample size of 30. In the SVC polynomial kernels are used of the orders 1,2 and 4. Number of support vectors: 10 - 30.

Dissimilarity Based Classification

Y ⎛ k 11 k 12 k 13 k 14 k 15 k 16⎞ ⎜ ⎟ k k k k k k ⎜ 21 22 23 24 25 26⎟ ⎜ ⎟ k k k k k k ⎜ 31 32 33 34 35 36⎟ K = ⎜ ⎟ ⎜ k 41 k 42 k 43 k 44 k 45 k 46⎟ ⎜ ⎟ ⎜ k 51 k 52 k 53 k 54 k 55 k 56⎟ ⎜ ⎟ ⎝ k 61 k 62 k 63 k 64 k 65 k 66⎠

(Random) selection of Representation Set Y. All objects X are used for training. X

Any kernel K(y,x) is allowed (e.g. x – y ) Fast training (simple selection of Y). Possibly fast testing (choose small Y).

E. Pekalska et al., Classifiers for dissimilarity-based pattern recognition, ICPR15

Dissimilarity Based Classification Results Feature Curves for Sample Size = 30 PFLD DISC−2 DISC−4 DISC−10

Generalisation Error

0.3

0.2

0.1

0

10

30 Feature Size

100

The generalization errors of the PFLD and a linear dissimilarity based classifier (DISC) as a function of the feature size, using a sample size of 30. For DISC three sizes of the representation set are used: 2, 4 and 10.

Subspace Classifier Training Set X equals Representation Set Y. Y ⎛ k 11 k 12 k 13 k 14 k 15 k 16⎞ ⎜ ⎟ k k k k k k ⎜ 21 22 23 24 25 26⎟ ⎜ ⎟ k k k k k k ⎜ 31 32 33 34 35 36⎟ K = ⎜ ⎟ ⎜ k 41 k 42 k 43 k 44 k 45 k 46⎟ ⎜ ⎟ ⎜ k 51 k 52 k 53 k 54 k 55 k 56⎟ ⎜ ⎟ ⎝ k 61 k 62 k 63 k 64 k 65 k 66⎠

K’ = PCA(K)

Dimension reduction per class by PCA. Classification by nearest subspace.

X

Compare Eigenface method (linear subspace). Compare feature extraction (no selection). Test objects have to be compared with entire training set (not true for linear inner product kernel).

Subspace Classifier Results Feature Curves for Sample Size = 30 PFLD SUBSC−1 SUBSC−2 SUBSC−5

Generalisation Error

0.3

0.2

0.1

0

10

30 Feature Size

100

The generalization errors of the PFLD and the subspace classifier (SUBSC) as a function of the feature size for a sample size of 30. For SUBSC three subspace dimensionalities per class are used: 1, 2 and 5.

Summary of Almost Empty Space Classifiers Feature Curves for Sample Size = 30

0.2

0.1

0 1

3

PFLD DISC−2 DISC−4 DISC−10

0.3

10 30 Feature Size

0.2

0.1

0

100

10

30 Feature Size

PFLD SUBSC−1 SUBSC−2 SUBSC−5

0.3 Generalisation Error

PFLD SVC−1 SVC−2 SVC−4

Generalisation Error

Generalisation Error

0.3

Feature Curves for Sample Size = 30

Feature Curves for Sample Size = 30

0.2

0.1

0

100

10

30 Feature Size

100

Support Vector Classifier

Dissimilarity based Classification

Subspace Classifier

Representation Set Selection

Optimized on Minimum Error

Heuristics Free Choice of n

None

Dimension of Representation

n

n

(n=) k

Size of Final Training Set

n

k

k

high

low

moderate

Training Effort Test Effort Training Set X (size k);

O(n) O(n) O(k) Representation Set Y (size n); n < k (n