Pattern Recognition in Almost Empty Spaces Robert P.W.Duin ICT Group Electrical Engineering, Mathematics and Computer Science Delft University of Technology, The Netherlands Eindhoven, 19 January 2004
P.O. Box 5031, 2600GA Delft, The Netherlands. Phone: +(31) 15 2786143, FAX: +(31) 15 2781843, E-mail:
[email protected]
Contents - Peaking, Curse of Dimensionality, Overtraining - Pseudo Fisher Linear Discriminant Experiments - Representation Sets and Kernel Mapping - Support Vector Classifier - Dissimilarity Based Classifier - Subspace Classifier - Conclusions
What is Pattern Recognition?
Sensor
Feature Space
Classifier
Sensor
Representation
Generalisation
Other representations and generalisations?
Pattern Recognition System Training Set
Representation
Generalization
Feature Space
Classifier
Class B Objects x2 (area)
Class A (perimeter) x1
Test Object classified as ’B’
Statistical PR Paradox x = (x1, x2, ... , xk) - k dimensional feature space {x1, x2, ..., xm} - training set {λ1, λ2, ..., λm} - class labels
} D(x) - classifier , ε = Prob ( D(x) ≠ λ(x) )
ε(m) : monotonically decreasing , ε(k) : peaks !
Error
Feature Size k
Peaking, Curse of Dimensionality, Overtraining Classification Error
sample size
∞
Dimensionality, Complexity, Time
Asymptotically increasing classification error due to: - Increasing Dimensionality
Curse of Dimensionality
- Increasing Complexity
Peaking Phenomenon
- Decreasing Regularization - Increasing Computational Effort
}
Overtraining
Human Recognition Accuracy
The diagnostic classification accuracy of a group of doctors for an increasing set of symptons. Samples size is 100. F.T de Dombal, Computer-assisted diagnosis, in: Principals and practice of medical computing, Whitby and Lutz (eds.), Churchill Livingstone, London, 1971, pp 179 - 199
Ullman’s Example
Character recognition classification performance as a function of the number of n-tuples used. J.R. Ullmann, Experiments with the n-tuple method of pattern recognition, IEEE Trans. on Computers, 1969, 1135-1136
Simulated Peaking Phenomenon by Jain and Waller
Classification error as a function of the feature size for two overlapping Gaussian distributions. Higher features have increasing class overlap. A.K Jain and W.G. Waller, On the optimal number of features in the classification of multivariate Gaussian data, Pattern Recognition, 10, pp 365 - 374, 1978
Raudys’ Example
Feature size study by Raudys (3rd ICPR 1976): 1. single example 2. generalization of 1 3. expectation of 1 4. less samples
Neural Network Overtraining Example - 10 Hidden Units Levenberg−Marquardt Optimization, 10 hidden units 0.25
6
4
0.2
6
2 epochs
4
2
2
mse
0.15 0
test
0.1
0 0 10
−2
−4
1
2
10 number of epochs
10
6
4
0
−2
train
0.05
−6 −12
−4
−10
−8
−6
−4
−2
0
2
4
6
8
6
20 epochs
4
−6 −12
50 epochs
4
2
2
0
0
0
−2
−2
−2
−4
−4
−4
−10
−8
−6
−4
−2
0
2
4
6
8
−6 −12
−10
−8
−6
−4
−2
0
2
4
6
8
−2
0
2
4
6
8
6
2
−6 −12
10 epochs
−10
−8
−6
−4
−2
0
2
4
6
8
−6 −12
100 epochs
−10
−8
−6
−4
Eigenfaces
10 pictures of 5 subjects
eigenfaces 1 - 3
PCA Classification of Faces Scatterplot on first two eigenfaces
1
2000
0.8
1500
Error
0.6 0.4
1000
0.2
500
0 0
0
Training Set: 1 image, 40 persons
−500 −1000 −1500 −1000
10 20 30 Number of eigenfaces
Feature Size: 92 x 112 = 10304 −500
0
500
1000
1500
Test Set: 9 * 40 = 360;
40
Normalized NIST Data 2 x 2000 Characters Random Subsets: 2 x 1000 Training 2 x 1000 Testing Errors averaged over 50 experiments
Fisher Results Feature Curves
Learning Curves
0.2
0.3 100
Generalisation Error
Generalisation Error
0.4
200
30 10
0.2 4 0.1
0 2
Sample Size 50 0.1
100 200 500
Feature Size
10
100 Sample Size
1000
Peaking of the generalization error of FLD as a function of the feature size.
0 1
10 Feature Size
Learning curves of the FLD .
100
2000
Pseudo Fisher Linear Discriminant
n points in Rk are in a (n-1) dimensional subspace Rn-1.
Rn-1
→ w is the minimum norm solution of ||wTX -1||. → Use Moore-Penrose pseudo-inverse: wT = pinv(X). For n > k the same formula defines Fisher’s Discriminant. X = { (xi,1), xi ∈ A, (- xj,-1), xj ∈ B }
Feature Curves and Learning Curves Learning Curves
Feature Curves
Generalisation Error
Generalisation Error
0.4
0.3 Sample Size 0.2
10 30
0.1
0 1
100
3
10 30 Feature Size
0.4
Feature Size
0.3
30
100
200
10 0.2
4 30 100
0.1
200 100
The generalization error of the PFLD as a function of the feature size.
0 2
10
100 Sample Size
Learning curves of the PFLD.
1000
Feature Size Sample Size Error Generalization Error 1000
0.04
Sample Size
0.04 0.06 100
0.08
0.30
0.12 10
0.08
0.20 0.12 0.20
0.12 2 1
0.12 0.40 0.12 0.08
10 Feature Size
100
The generalization error of the PFLD as a function of feature size and sample size.
Improving Pseudo Fisher Linear Discriminant (PFLD)
PFLD is for dimensionalities > sample size fully overtrained. Still good results are possible (ε = 0.08, 30 objects in 256 D) --> Better results are possible T ˆ –1 ˆ ˆ - Regularization e.g. by D ( x ) = ( μ A – μ B ) ( G + λI ) x
- Change of representation to lower dimensional spaces.
Representation Sets and Kernel Mapping Representation Set Y = {yi}, ||Y|| = n
K = K(Y,x) = (K(yi,x), i=1, ..., n), K ∈ R n maps an arbitrary object x into Rn Polynomials: K(yi, x) = ( x • y i + 1 )
p 2
⎛ – x – yi ⎞ Gaussians: K ( y i, x ) = exp ⎜ ----------------------⎟ 2 ⎝ 2σ ⎠ --> Nonlinear Mapping Rn K(y2,x)
Training Set X = {xi} Possibly Y ⊂ X
K(y1,x)
Representation Sets - Dimensionality is controlled by the size of the Representation Set Y. - Original objects may have arbitrary representation (feature size), just K(y,x) has to be defined. - In the feature space defined by the Representation Set, traditional classifiers may be used. - Problems: choice of Y, choice of K
Feature Approach and Representation Sets x2
RK
x1 object
feature space
⎛ x 11 x 12 ....x 1k⎞ ⎜ ⎟ x x ....x ⎜ 21 22 2k⎟ X = ⎜ ⎟ x x ....x ⎜ 31 32 3k⎟ ⎜ ⎟ x x ....x ⎝ 41 42 4k⎠ subspace selection
x3
RL ⊂ RK
x2 classifier
Y ⎛ k 11 k 12 k 13 k 14⎞ ⎜ ⎟ k k k k ⎜ 21 22 23 24⎟ K = ⎜ ⎟ X ⎜ k 31 k 32 k 33 k 34⎟ ⎜ ⎟ k k k k ⎝ 41 42 43 44⎠ set of objects
relation matrix (kernel K(x,x))
selection of Representation Set Y
y2 y1
y3 y4
x kernel based classification
Support Vector Classifier
Y ⎛ k 11 k 12 k 13 k 14 k 15 k 16⎞ ⎜ ⎟ k k k k k k ⎜ 21 22 23 24 25 26⎟ ⎜ ⎟ k k k k k k ⎜ 31 32 33 34 35 36⎟ K = ⎜ ⎟ ⎜ k 41 k 42 k 43 k 44 k 45 k 46⎟ ⎜ ⎟ ⎜ k 51 k 52 k 53 k 54 k 55 k 56⎟ ⎜ ⎟ ⎝ k 61 k 62 k 63 k 64 k 65 k 66⎠
X
Reduce training set X to minimum size ’support set’ Y such that if used as Representation Set, X is error free classified: min [classf_error(K(Y,X))] Y
Notes: - Classifier is written as a function of n points in Rn - Not all kernels allowed (Mercer’s theorem)
Support Vector Classifier Results Feature Curves for Sample Size = 30
Generalisation Error
0.3
PFLD SVC−1 SVC−2 SVC−4
0.2
0.1
0 1
3
10 30 Feature Size
100
The generalization errors of the PFLD and the SVC as a function of the feature size for a sample size of 30. In the SVC polynomial kernels are used of the orders 1,2 and 4. Number of support vectors: 10 - 30.
Dissimilarity Based Classification
Y ⎛ k 11 k 12 k 13 k 14 k 15 k 16⎞ ⎜ ⎟ k k k k k k ⎜ 21 22 23 24 25 26⎟ ⎜ ⎟ k k k k k k ⎜ 31 32 33 34 35 36⎟ K = ⎜ ⎟ ⎜ k 41 k 42 k 43 k 44 k 45 k 46⎟ ⎜ ⎟ ⎜ k 51 k 52 k 53 k 54 k 55 k 56⎟ ⎜ ⎟ ⎝ k 61 k 62 k 63 k 64 k 65 k 66⎠
(Random) selection of Representation Set Y. All objects X are used for training. X
Any kernel K(y,x) is allowed (e.g. x – y ) Fast training (simple selection of Y). Possibly fast testing (choose small Y).
E. Pekalska et al., Classifiers for dissimilarity-based pattern recognition, ICPR15
Dissimilarity Based Classification Results Feature Curves for Sample Size = 30 PFLD DISC−2 DISC−4 DISC−10
Generalisation Error
0.3
0.2
0.1
0
10
30 Feature Size
100
The generalization errors of the PFLD and a linear dissimilarity based classifier (DISC) as a function of the feature size, using a sample size of 30. For DISC three sizes of the representation set are used: 2, 4 and 10.
Subspace Classifier Training Set X equals Representation Set Y. Y ⎛ k 11 k 12 k 13 k 14 k 15 k 16⎞ ⎜ ⎟ k k k k k k ⎜ 21 22 23 24 25 26⎟ ⎜ ⎟ k k k k k k ⎜ 31 32 33 34 35 36⎟ K = ⎜ ⎟ ⎜ k 41 k 42 k 43 k 44 k 45 k 46⎟ ⎜ ⎟ ⎜ k 51 k 52 k 53 k 54 k 55 k 56⎟ ⎜ ⎟ ⎝ k 61 k 62 k 63 k 64 k 65 k 66⎠
K’ = PCA(K)
Dimension reduction per class by PCA. Classification by nearest subspace.
X
Compare Eigenface method (linear subspace). Compare feature extraction (no selection). Test objects have to be compared with entire training set (not true for linear inner product kernel).
Subspace Classifier Results Feature Curves for Sample Size = 30 PFLD SUBSC−1 SUBSC−2 SUBSC−5
Generalisation Error
0.3
0.2
0.1
0
10
30 Feature Size
100
The generalization errors of the PFLD and the subspace classifier (SUBSC) as a function of the feature size for a sample size of 30. For SUBSC three subspace dimensionalities per class are used: 1, 2 and 5.
Summary of Almost Empty Space Classifiers Feature Curves for Sample Size = 30
0.2
0.1
0 1
3
PFLD DISC−2 DISC−4 DISC−10
0.3
10 30 Feature Size
0.2
0.1
0
100
10
30 Feature Size
PFLD SUBSC−1 SUBSC−2 SUBSC−5
0.3 Generalisation Error
PFLD SVC−1 SVC−2 SVC−4
Generalisation Error
Generalisation Error
0.3
Feature Curves for Sample Size = 30
Feature Curves for Sample Size = 30
0.2
0.1
0
100
10
30 Feature Size
100
Support Vector Classifier
Dissimilarity based Classification
Subspace Classifier
Representation Set Selection
Optimized on Minimum Error
Heuristics Free Choice of n
None
Dimension of Representation
n
n
(n=) k
Size of Final Training Set
n
k
k
high
low
moderate
Training Effort Test Effort Training Set X (size k);
O(n) O(n) O(k) Representation Set Y (size n); n < k (n