Link¨ oping Studies in Science and Technology Thesis No. 995
Subspace Selection Techniques for Classification Problems David Lindgren
REGL
AU
ERTEKNIK
OL TOM ATIC CONTR
LINKÖPING
Division of Automatic Control Department of Electrical Engineering Link¨opings universitet, SE–581 83 Link¨oping, Sweden WWW: http://www.control.isy.liu.se E-mail:
[email protected] Link¨oping 2002
Subspace Selection Techniques for Classification Problems c 2002 David Lindgren
Department of Electrical Engineering, Link¨ opings universitet, SE–581 83 Link¨ oping, Sweden.
ISBN 91-7373-575-2 ISSN 0280-7971 LiU-TEK-LIC-2002:68 Printed by UniTryck, Link¨ oping, Sweden 2002
To Valerica
Abstract The main topic of this thesis is linear subspaces for regression – how to find the subspaces and how to evaluate them. The motivation to do regression in a subspace is numerical as well as computational – numerical in the sense that the subspace can filter out the relevant components or features of the problem, computationally in the sense that this filtering can be done quickly and then can nonlinear prediction by artificial neural networks, for instance, be conducted in lower dimensionality. The theory is developed in a versatile regression framework into which both discrete (classification) and continuous (quantification) regression can be put. The target is to find good future predictors from past observations. The foundation is the assumption that observations (or measurements) are drawn from a probability distribution, and from there on the theory is developed towards practical results and algorithms. The emphasis is however put on classification problems. The applications are many and ranges from identification of dynamical systems to data mining and compression. Particular interest is given processing of sensor data – how to learn something from calibration measurements that in turn can be used to learn something about future unknown samples. In focus are the electronic nose (smell sensor) and the electronic tongue (taste sensor). Three new algorithms are introduced and described. The Asymmetric Class Projection is a computationally efficient method to find subspaces for classification between two classes with small mean and large covariance difference. The Optimal Discriminative Projection (ODP) is an algorithm that uses a particular composition of Givens rotations to parameterize all subspaces. The subspaces are optimized for classification. The Clustered Regression Analysis uses the ODP subspace for conditional expectation prediction.
i
Acknowledgments First of all, I would like to thank my supervisor Lennart Ljung for letting me join the Control and Communication Group and for all guidance during my work. His impressive knowledge and ability to put things on a sound footing has been an invaluable resource. Thank you Per Spangeus for interesting discussions and fruitful collaboration. Without you and your ideas, this thesis would never had come about. Great thanks to the people (now and then) at S-SENCE. Your problems and ideas have been an inspiration in my research. Particularly I would like to mention Tomas Ekl¨ ov, Tom Artursson and Martin Holmberg. Thank you Christian Cimander at Biotechnology for interesting discussions on how to control bioprocesses. I am very grateful to my colleagues in the Control and Communication Group. Here, all are very skillful and there has always been someone to ask. Thank you Fredrik Tj¨ arnstr¨ om for comments on the thesis. Special thanks to Martin Enqvist who has always taken his time to give valuable views of things. The sponsoring by Vetenskapsradet is gratefully acknowledged. Thank you Valerica for all support, encouragement and inspiration. I Love You.
David Lindgren Link¨ oping, November 2002
iii
Contents
1 Introduction 1.1 Curve Fitting and the Generalization 1.2 Experimental Background . . . . . . 1.2.1 The Tongue . . . . . . . . . . 1.2.2 Objectives . . . . . . . . . . . 1.3 Regression Basics . . . . . . . . . . . 1.3.1 Linear Regression . . . . . . . 1.3.2 Classification . . . . . . . . . 1.4 Regression in a Subspace . . . . . . 1.5 Thesis Outline . . . . . . . . . . . . 1.6 Contributions . . . . . . . . . . . . .
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
1 3 6 6 7 9 10 12 14 15 15
2 Quality Measures 2.1 Quality Measures for Classification . 2.1.1 Two Classes, q = 2 . . . . . . 2.1.2 Many Classes, q > 2 . . . . . 2.1.3 Estimation of µ and Σ . . . . 2.2 Quality Measures for Quantification 2.2.1 Linear φ(x) . . . . . . . . . . 2.2.2 Explained Variance . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
17 18 19 25 26 28 31 31
v
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
vi
Contents
2.3 2.4 2.5
Artificial Data . . 2.3.1 Discussion . Experimental Data 2.4.1 Discussion . Summary . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
3 Asymmetric Classification 3.0.1 Linear Discriminant Analysis . . 3.1 The Asymmetric Class Projection . . . . 3.1.1 Objective . . . . . . . . . . . . . 3.1.2 Modified Covariance . . . . . . . 3.1.3 Generalization to More than One 3.1.4 Relation to LDA . . . . . . . . . 3.2 Bayes Error . . . . . . . . . . . . . . . . 3.2.1 k-dimensional Projection . . . . 3.2.2 Non-Concentric Classes . . . . . 3.3 Artificial Data . . . . . . . . . . . . . . 3.3.1 Artificial Data Set . . . . . . . . 3.3.2 Comparison . . . . . . . . . . . . 3.4 Experimental Data . . . . . . . . . . . . 3.4.1 Validation . . . . . . . . . . . . . 3.4.2 Estimation of µ and Σ . . . . . . 3.4.3 Projection Calculation . . . . . . 3.4.4 Plots . . . . . . . . . . . . . . . . 3.4.5 Results . . . . . . . . . . . . . . 3.5 Summary . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
32 32 33 33 35
. . . . . . . . . . . . . . . . . . . . . . . . . . . . Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
37 38 39 39 40 41 42 42 45 45 45 46 46 46 46 48 49 49 49 52
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
53 55 56 56 56 58 59 61 62 63 63 64 64 66 66 68 68 71
4 Subspace Parameterization 4.1 Optimization Problem . . . . . . . . . . . . 4.1.1 ON Parameterization . . . . . . . . 4.2 Givens Parameterization . . . . . . . . . . . 4.2.1 Vector Rotation . . . . . . . . . . . 4.2.2 Matrix Rotation . . . . . . . . . . . 4.2.3 Parameterized Dimension Reduction 4.2.4 Inverse Parameterization . . . . . . 4.2.5 Kernel Parameterization, 2k > n . . 4.3 An Algorithm for Subspace Optimization . 4.3.1 Objective Function . . . . . . . . . . 4.4 Artificial Data . . . . . . . . . . . . . . . . 4.5 Experimental Data . . . . . . . . . . . . . . 4.5.1 Image Segmentation Data . . . . . . 4.5.2 Electronic Nose Data . . . . . . . . . 4.6 Competing Techniques . . . . . . . . . . . . 4.6.1 Householder Reflections . . . . . . . 4.6.2 ON-constraints as a BMI . . . . . .
. . . . .
. . . . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
Contents
4.7
vii
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5 Clustered Regression Analysis 5.1 Inverse vs. Classical Regression . . . 5.1.1 Classical Prediction of y . . . 5.1.2 Monte Carlo Simulation . . . 5.2 Using Clusters to Find a Subspace . 5.2.1 Clustered Regression Analysis 5.3 Artificial Data . . . . . . . . . . . . 5.4 Experimental Data . . . . . . . . . . 5.4.1 Random Validation . . . . . . 5.4.2 Leave-Class-Out Validation . 5.4.3 Discussion . . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . Bibliography
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
77 79 80 80 82 83 84 86 86 87 87 88 89
Notation
Symbols x n y q i xi , yi x(j) y (j) X Y f (x, θ) θ φ η e , y − f (x, θ) Ξ
Random measurement Dimensionality of x Random feature Dimensionality of y Observation index Observation number i of x and y, respectively jth entry of vector valued x jth possible (discrete) outcome of y N -by-n matrix with xTi , i = 1, 2, . . . , N , as rows N -by-q matrix with yiT , i = 1, 2, . . . , N , as rows Predictor Predictor parameter Model Model parameter Random prediction error N -by-q matrix with eTi as rows
ix
x
Notation
ε R2 N j, l k p Σ µ Cj δ or δjl dj (·) ∆ Γ Π Λ M V, V∗ f∗ R 1n , 1 1 · · ·
1
b θˆ yˆ, Σ, R(p) r(p)
T
Random measurement noise Coefficient of determination (explained variance) Number of estimation points in a data set Class indices Subspace dimensionality Subspace parameter Covariance matrix Mean vector Class distribution of class j Mahalanobis distance between two class distributions (δjl = δ(Cj , Cl )) Classifier distance function, distance to Cj Fisher ratio Arithmetic mean of Mahalanobid distances Geometric mean of Mahalanobid distances Negative logarithm of overbounded error rate Number of validation points in a data set General prediction error and its infimum, respectively Predictor with prediction error V ∗ Set of all real numbers Vector with n ones. n is usually given by the context and omitted Hat indicates predicted or estimated value Matrix rotation Vector roataion
Note that all vectors are column vectors. In general are lower case letters used to denote vector valued and scalar variables and upper case letters (potentially) matrix valued variables. Exceptions due to conventions may however occur.
Operators P (x = a) px (a) p(x) , px (x) px,y (a, b) p(x, y) , px,y (x, y) px|y (a, b) p(x|y) , px|y (x, y) E[·] E [ ·|b ]
Probability that the random variable x equals a Probability density function (pdf) of x Short form of above. Joint pdf of x and y Short form of above. Conditional pdf of x given y Short form of above. Expectation Conditinal expectation given b
Notation
xi
tr [·] p k X kF , tr [X T X] k x k2 ,k x kF det [·]
Trace, sum of diagonal elements Frobenius norm, X is a matrix Euclidian norm, x is a vector Matrix determinant
Acronyms and Abbreviations ACP BD CCA CRA EN ET GBP LDA LS ML ODP OLT ON PCA pdf PLS RMSE SIR std SVD
Asymmetric Class Projection Bhattacharayya Distance Canonical Correlation Analysis Clustered Regression Analysis Electronic Nose Electronic Tongue Good and Bad Projection (= ACP) Linear Discriminant Analysis Least Squares Maximum Likelihood Optimal Discriminative Projection Optimal Linear Transformation Orthonormal Principal Component Analysis Probability Density Function Partial Least Squares Root Mean Square Error Sliced Inverse Regression Standard Deviations Singular Value Decomposition
xii
Notation
1 Introduction
This is one of many stories about the two random variables x and y. Since the days of Gauss, they have gained fairly much attention because there is reason to believe that they are related, in fact dependent of each other. They occur in many tales, for instance, when x is the length of a randomly picked human, and y her shoe size – rumor has it there is dependence between the length and size. Of course, a shoe salesman would not be rich if he tried to sell shoes by taking the length only of the customers. But yet, a customer 1.6 m tall would turn her heel if he suggested a pair of size 8. Although he cannot predict the shoe size exactly, he know by experience that tall persons take greater shoe sizes – there is a dependence! Everything about the relation between x and y is contained in the joint probability distribution p(x, y) – the likelihood that a particular x and y occurs simultaneously. If we knew p(x, y) and someone gave us a particular value of x, we could calculate everything possible to predict about y, and the other way around if someone gave us a particular y. An experienced shoe salesman would probably pick the shoe size E [ y|x ] (the conditional expectation given x) off the shelves to offer the customer to start with. On the other hand, if we could show that p(x, y) can be factorized as p(x, y) = px (x)py (y), we could state that nothing can be predicted about y knowing x or vice versa. However, in real life we do not know p(x, y). The best thing we can do if we want to predict y knowing x (or vice versa) is to learn by experience; we can take N samples (make N observations) of x and y, call them xi and yi , 1 ≤ i ≤ N , and 1
2
Introduction
try to estimate p(x, y) locally. In fact, if we want to predict y given x (for instance) we might as well estimate a predictor function f directly, y = f (x) + e,
(1.1)
in a way that makes the expected magnitude of the error e, perhaps not zero, but small. Note that we could equally well measure the shoe size and try to predict the length, x = g(y) + e0 . If more information is added, the prediction can potentially be more accurate – for instance not only the length, but also the customer’s age, weight, and so on, could help to predict the shoe size. Why not use an image of the customer with 512 × 512 = 262144 gray scale pixels? The problem encountered when many features are included – say n features collected in the random vector x ∈ Rn – is that the N observations become very sparse in the n + 1-dimensional space of outcomes, and the surface p(x, y) (or f (x)) thus very difficult to reliably estimate. To counteract this sparseness, the number of observations must be increased, and observations (or measurements) are usually very expensive. Therefore it is very important to restrict the number of features, and only use those who are really relevant. In this thesis, predictor functions are estimated despite feature spaces with very high dimensionality. To reliably estimate surfaces from N points, the feature space is projected onto planes with lower dimensionality than that of x. How to find those planes is the major topic of this thesis. The k-dimensional projected space is denoted Sx, which is a linear transformation of x by the k-by-n matrix S. For a set of N observations, the regression of the y-predictor is formulated as yi = f (Sxi ) + ei ,
(1.2)
where S and f are sought that minimize the expected magnitude of the residual error e. The regression (1.2) above is not limited to prediction of shoe sizes. With or without subspace, a wide range of problems can be put into its simple framework. Below are given some examples. Example 1.1 (Iodine Concentration Measurement) A laboratory instrument that measures the concentration of iodine in polluted water uses a large set of sensors which for each measurement i, outputs the response vector xi ∈ Rn . In the calibration process, which aims at accurate prediction of the concentration yi , one whish to find a linear model S that filters out relevant features for iodine concentration prediction in the presence of other pollutants. The inaccuracy of the instrument in general, and the calibration in particular, could for instance be measured by the root mean square error (RMSE), mean[ e2i ].
1.1 Curve Fitting and the Generalization Problem
3
Example 1.2 (Classification of Scientific Papers) In a scientific paper database one wants to automatically classify documents as belonging to one of four categories: technical, philosophical, medical or economical. For each document i, a large term-frequency vector xi ∈ Rn is generated where entry j counts the occurancies of term j of a predefined dictionary. The vector xi should be mapped into the indicator vector yi ∈ {0, 1}4 with one binary entry for each document category. To speed up the classification, and to improve the semantic properties of the selection, the classification is conducted in a linear subspace S. The performance measure could for instance be the classification accuracy, 100 · (1 − mean[ eTi ei ]/2)%.
Example 1.3 (Identification of Dynamical System) A particular time discrete dynamical system with single input ui and single output yi is modeled by yi = f0 (yi−1 , · · · , yi−na , ui , · · · ui−nb ) + εi ,
(1.3)
where εi is a random noise sample. Due to very large na and nb (high sampling frequency and time delays), the identification should be done in a subspace S. Then introduce the state vector as T (1.4) xi = yi−1 · · · yi−na ui · · · ui−nb . This directly identifies the problem as (1.2). Again, a performance measure of the model is given by the error mean[ e2i ].
1.1
Curve Fitting and the Generalization Problem
Typically, a practitioner confronted with the regression problem (1.2) is given an estimation data set with N observations {xi , yi } of some process, and wishes to find the function f (x) (and perhaps the subspace S) that with highest possible accuracy maps xi onto yi . In this process he or she would probably select a paˆ so that rameterized function family f (x) = f (x, θ) and estimate the parameter θ, the magnitudes of the ei ’s are small, for instance in least squares (LS) sense: N 1 X k yi − f (xi , θ) k2F , θˆ = arg min θ N i=1
(1.5)
where k · kF is the Frobenius norm, see page ix. This is basically a curve fitting problem: to fit a straight line or a curve to N points. Indeed, a predictor can geometrically be viewed as a line or a curve , for instance the vector c in f (x) = cT x. The purpose of the predictor is however to generalize beyond the estimation points, to interpolate and to extrapolate. After all, the observed yi ’s are already known, it
4
Introduction
is pointless to predict them, and the mean square in (1.5) is an inadequate measure ˆ The true predictor of predictor quality – it is primarily an instrument to find θ. quality is rather given by the general quadratic error ZZ ˆ = E k e k2 = ˆ k2 dxdy (1.6) p(y, x) k y − f (x, θ) V 2 (f, θ) F F (for continuous y) where p(y, x) is the joint probability distribution of x and y. However, our problem is that p(x, y) is unknown, so we can never fully validate a predictor. Usually, the general error is approximated by the mean of the quadratic error for a set of M observations not contained in the estimation data referred to as validation data, M X ˆ = 1 ˆ k2 . k yv,i − f (xv,i , θ) Vb 2 (f, θ) F M j=1
(1.7)
It should be pointed out that it it is never difficult to find a function (or curve) f (x) that exactly fits to N points, that is, ei = 0 for 1 ≤ i ≤ N . (Take for instance f as polynomial(s) of degree N − 1.) The difficulty is to find a curve that gives small error in general according to (1.6). That the best curve for the estimation data set is not always the best in general is illustrated by a simple example. In Figure 1.1, 5 points on the dashed line y = x/2 is randomly noise perturbed, xi ∈ {1, 2, 3, 5, 6}. A 4th order polynomial function family 1 x 2 (1.8) f (4) (x, θ) = θT x3 x x4 ˆ to the points by solving a linear equation system is easily fitted (θ estimated as θ) with 5 equations and 5 unknown variables. The the mean square error i2 1 X h (4) ˆ − yi f (xi , θ) 5 i=1 5
(1.9)
and every other error measure is of course equal to 0. For f (4) (x, θ) the general quadratic error in the depicted range is Z
6.5
h i2 ˆ − x/2 dx ≈ 0.32. f (4) (x, θ)
(1.10)
0.5
If instead a first order polynomial f (1) is fitted to the N points, the LS-fitting is much worse for the estimation data set: 0.20, but gives lower error in general: 0.090.
1.1 Curve Fitting and the Generalization Problem
5
As seen, the best fit to the estimation points does not necessarily yield the lowest error in general. The problem with the 4th degree polynomial is that it is flexible enough to adapt to the random noise samples in the estimation data. Simply put, f (4) is too complex with respect to the number of observations. (If arbitrarily many estimation pairs {xi , yi } had been available, the influence of the noise could be “meaned out”, and the fit of f (4) practically perfect.) A natural way too reduce the complexity of a linear or non-linear function family is to reduce the dimensionality of the argument, or put in other words, to conduct the regression in a linear subspace defined by the rows of a k-by-n matrix S. In the toy example above, the 4th degree polynomial space is reduced to a 1st degree one by the linear transformation 1 x 1 1 0 0 0 0 x2 , (1.11) = x 0 1 0 0 0 3 x {z } | S x4 and as seen above, this enhances the prediction. However, nothing is yet said about how to choose the subspace S in general. This question is a major topic of this thesis.
Figure 1.1 The dashed line describes the “true” relation between x and y and the dots are 5 noise perturbed observations. The solid curve is a 4th order polynomial fit to the points, and the solid line a first order fit.
6
1.2
Introduction
Experimental Background
The electrical characteristics (x) of an electronic sensor depends on the environment exposed to it (y), and if p(x, y) is known, statements about the environment can be done from sensor measurements. The experimental background to this work is the processing of sensor data. At the Department of Applied Physics at Link¨ opings universitet (IFM), research in sensor science is conducted and this is also a birthplace of the electronic nose (EN) and the electronic tongue (ET). The EN is a gas or smell sensor that is sensitive to a wide range of gas mixtures. Usually an EN consists of a whole set of discrete sensors (for instance semiconductor, conductive polymer (CP), surface acoustic wave (SAW) and quartz crystal micro balance (QCM) sensors) to give high potential selectivity among some set of gas mixtures. The signal from a semiconductor sensor, for instance, could be recorded under an interval of time when a gas mixer switches from clean dry air to the gas that should be identified. This sampled (time discrete) signal is compared to other signals taken at other time instants and due to other gases. Typically, an EN outputs a large amount of data for each measurement (or sniff) and it is up to the signal processing to make sense of this data. Although this work was inspired by the EN and the ET, the results are general and not limited to sensor systems. The discussion in this thesis will thus be held at a rather general level. For more on the EN specifically, see for instance [39]. For an overview of signal processing methods for sensors, see [19, 34]. The ET, which will be briefly described below, is also a rather general sensor used for wet sensing or concentration measurements in liquids. In both cases, the merging of advanced sensor physics and sophisticated signal processing is necessary to make the devices useful. My main interest has been the signal processing part.
1.2.1
The Tongue
The electronic tongue is used for wet sensing. Different tongue techniques exist (acoustic, optical, potentiometric et cetera) but at IFM one has been particularly interested in voltammetry, [46]. In voltammetry, features of the analyte are sought in a relation between voltage and current. Three electrodes are dipped into the analyte, the reference, working and auxiliary electrode. The electrodes are plated with special catalytical materials like gold, platinum, iridium or rhodium. A carefully modulated voltage between the reference and working electrode triggers an electrochemical reaction in the analyte, which to some extent is characterized by the current between the auxiliary and working electrode – a current measured and sampled by sensitive equipment. The sampled current is the actual output from the sensor (x), from which inferences about the analyte should be done (y). To improve the selectivity of the sensor, a whole set of electrodes with different platings is often used, which of course gives a whole set of signals to process. Before the tongue can be used, a predictor has to be estimated from N measurements {xi }
1.2 Experimental Background
7
Figure 1.2 At top an input and at bottom an output signal for an electronic tongue recorded for 14 seconds with a sample rate of 1000 Hertz. The signals are used to predict a concentration.
with known calibration liquids {yi }. An example of one measurement of the tongue is depicted in Figure 1.2, where the top plot is the modulated input voltage, and the bottom plot the measured current. The current is sampled for 14 seconds with 1000 Hertz sampling frequency, which gives a measurement xi ∈ R14000 . The measurement in this example is used to predict some concentration yi (scalar). Other modulations of the voltage exist.
1.2.2
Objectives
The response vectors given by the electrical sensors are often used for either classification or quantification. In classification the objective is to make a categorical statement about the sensor response, for instance good/bad, safe/unsafe, apple/banana/orange/grape and so on. In quantification on the other hand, a quantitative measure is sought, for instance a concentration in terms of moles. We shall use the term quantification whenever we address a regression problem with continuous y. In both cases, the vector xi ∈ Rn with n ≈ 1000, should be condensed to a concise answer. Two examples will illustrate the problem types.
8
Introduction
0.025 0 3.1 3.7 4.5 5.2 5.6 6.9 7.5 8 8.8 9 10.6 10.8
0.02
0.015
0.01
0.005
0
−0.005
−0.01
−0.015
−0.02
0
0.02
0.04
0.06
0.08
0.1
Figure 1.3 PCA of tongue data. The 13 classes are different concentrations (mM) of a chemical substance.
Example 1.4 (Classification with the ET) The ET is used to “taste” different concentration classes, 0, 3.1, . . . , 10.8 milli moles (mM). It could for instance be the amount of sugar in a kg of water. In this setting the ET shall predict to which one of these 13 concentration classes a tested unknown liquid belong. A set of measurements is done on calibration liquids with known concentrations. Each measurement is a 403 dimensional vector. As an initial diagnostic step, a principal component analysis (PCA, the singular value decomposition (1.19) on page 11) is conducted, which compresses the response vectors into two uncorrelated variables that contain the major variability of the vectors in the set. The two variables are plotted against each other in a 2-dimensional scatter plot, see Figure 1.3. It is seen that 2 principal components do not well distinguish the the different concentration classes – with exception to the 0 mM class, they are rather mixed up. For instance, it would not be possible to with certainty judge if a new measurement is 9 or 10.6 mM.
Example 1.5 (Concentration Prediction with the ET) The ET is used to measure the substance concentration in the range 0-10.8 mM
1.3 Regression Basics
9
12
10
yˆ
8
6
4
2
0 0
2
4
y
6
8
10
12
Figure 1.4 PLS concentration prediction of tongue data. True values (y) versus predicted values (ˆ y ).
(same data as in the previous example). A linear prediction function is given by partial least squares (PLS), which is a well known method to calculate predictors in high dimensional data. PLS gives a subspace of x that is highly covariant with y. Predicted concentrations yˆi = θ0 + θT xi are compared to known validation concentrations by plotting them against each other, see Figure 1.4. The line in the figure is the ideal prediction. It is seen that reasonable concentration prediction can be done, although the different concentration classes are rather mixed up like in the PCA in the previous example.
1.3
Regression Basics
A data set is a finite set of observations of the process or alternatively a number of samples from some distribution. One observation in a data set is typically a pair
10
Introduction
of the variables xi ∈ Rn and a yi ∈ Rq . Thus, the data set D = {Y, X} where
xT1 xT2 X= . ..
y1T y2T and Y = . . ..
xTN
(1.12)
T yN
The purpose of the data set is to establish a relation between X and Y , f (x1 , θ) f (x2 , θ) Y = f (X, θ) + Ξ, where f (X, θ) = , .. .
(1.13)
f (xN , θ) so that the regression error Ξ(f, θ) = Y −f (X, θ) is small in some sense. Ultimately, we want to estimate f (x) from the data set so that y can be predicted in the future with high accuracy. The data set root mean square error (RMSE) r Vb (f, θ) =
1 k R(f, θ) k2F N
(1.14)
is an estimation of the general error sZ Z ˆ k2 dxdy p(y, x) k y − f (x, θ) F
V (f, θ) =
(1.15)
for continuous y and s V (f, θ) =
XZ
ˆ k2 dx p(yi , x) k yi − f (x, θ) F
(1.16)
i
for discrete y, that is often targeted for minimization over θ. As before, p(y, x) is the joint probability (density) of y and x. Regression is ultimately about finding the f and θ that minimizes V (f, θ), but since p(x, y) is unknown in practice, one usually has to settle with the minimizer of a data set error measure like Vb (f, θ).
1.3.1
Linear Regression
If f is linear, that is, yi = θxi + ei or Y = XθT + Ξ,
(1.17)
a naive approach to minimize Vb (θ) is to project both the left and right hand side of the regression equation onto the plane spanned by the columns of X. If the
1.3 Regression Basics
11
column space of X has full rank, the linear equation systems that results can be solved as θT = (X T X)−1 X T Y.
(1.18)
This is the LS solution. We assume that the data set mean is zero, Y T 1 = 0 and X T 1 = 0. A data preprocessing step should routinely make this assumption true. Note that 1 is a vector with ones, see page ix. Since X T X is often ill-conditioned (numerically difficult or impossible to invert), the direct calculation (1.18) is seldom used in practice. Instead the variables (columns) of X are exchanged via some linear orthogonalizing transformation. One of the most versatile transformation in this respect is obtained from the singular value decomposition (SVD): XV = U0 T0
(1.19)
with
V T V = I,
U0T U0 = I,
t1 0 T T0 = = ... 0 0 0
0 t2 .. . ··· 0
··· .. . .. . 0 0
0 .. . . 0 tn 0
(1.20)
Obviously the matrix XV = U0 T0 = U T , where U is a matrix with the n first columns of U0 , is orthogonal, which also means that the exchanged variables are uncorrelated. The regression e Y = U T θT + Ξ
(1.21)
θT = T −1 U T Y
(1.22)
is now solved in LS-sense by
and the predictor is f (x) = θV T x. The condition1 of T is calculated directly from the singular values ti (ti ≥ ti+1 ) as cond[T ] = t1 /tn . The SVD offers a number of ways to enhance the condition number of the regression, regularization. The regularized solutions are in general not optimal in LS-sense, but they are numerically more stable, and most often they give less general error. The probably simplest way to regularize the regression, is to add a small positive constant ξt1 to the diagonal entries of T before inverting, that is, θT = (T + ξt1 I)−1 U T Y.
(1.23)
1 The condition number is a figure on how bad the regression is numerically. If for instance the columns of X are linearly dependent (X has not full column rank), tn = 0 and cond[T ] = ∞ which means that T (or X T X) cannot be inverted.
12
Introduction
The condition number then is cond[T ] =
t1 + ξt1 1+ξ t1 + ξt1 . ≤ = tn + ξt1 ξt1 ξ
(1.24)
This is called ridge regression. The optimal ξ is generally unknown, usually a rule of thumb is used – say ξ = 10−5 , which give cond[T ] ≈ 105 . Another way to regularize is to select a subset of columns of XV , or equivalently, of U T . In this linear subspace, the regression is expressed yi = θSxi + ei
(1.25)
with S = VST , where S denotes some column index set. In principle component regression (PCR), for instance, the columns of S T are equal to the k first columns of V , since they contain the major variability of X ([XS T 0] is the in Frobenius norm optimal rank-k approximation of XV ). The subspace dimensionality k is chosen for instance by the rule t1 /tk < 1/ξ for, say, ξ = 0.01. Another way to select columns, is by taking the index set Se that minimizes the residual error, that is Se = arg min k S
I−
X
! ui uTi
Y kF .
(1.26)
i∈S
Note that (1.13) is most often an inverse model used for prediction. If, for instance, xi is the response of a balance and yi known calibration masses, the adequate (classical) model is in fact xi = φ(yi ) + εi
(1.27)
where the random noise sample εi is picked from an n-dimensional distribution. The inverse regression above gives the predictor function directly, without any assumptions about the noise distribution. Classical versus inverse modeling will be discussed in Chapter 5.
1.3.2
Classification
By classification is defined the regression problem (1.13) where yi is a bit vector {0, 1}q with one entry for each of q categories or classes. Every class has a distinct class distribution Cj , 1 ≤ j ≤ q. The random variable x is a sample of one of those distributions, and classification is all about determining which. In other words, every x is associated by the predictor function f (x) with one distinct class
1.3 Regression Basics
13
j, 1 ≤ j ≤ q, and further with the bit vector
y (j)
01 .. . 0j−1 = 1j 0j+1 . ..
(1.28)
0q in such a way that the general probability of misclassification (the error rate) V 2 (θ)/2 for V (θ) as defined in (1.16), is as small as possible. Note that eTi ei = 0 for correct classification and eTi ei = 2 for misclassification. Also note that the distribution of x is a combination of the class distributions and the distribution of y. The conditional distribution of x given y is Cj , where j fulfills y = y (j) . Thus, x ∈ Cj with probability P (y = y (j) ). As a first step, the q class distributions Cj are estimated from the q classes of the estimation data set. A special case is that the distributions are normal, that is, Cj = Nn (µj , Σj ), 1 ≤ j ≤ q. By definition, the means µj are estimated from the data set as XjT 1 Nj
(1.29)
µTj )(Xj − 1ˆ µTj )T (Xj − 1ˆ . Nj − 1
(1.30)
µ ˆj , and the covariance matrices Σj as bj , Σ
Here Xj denotes the rows of X that belongs to class j and Nj the number of rows (observations) in this class. A second step is to define a distance function dj (x) = d(x, Cj ), which calculates a statistical distance from x to distribution Cj . The distance function has ideally the property that if dj (x) < dl (x), then the expected cost of misclassification is minimal if we conclude that x is drawn from Cj rather than from Cl . Simply, we pick the distribution closest to xi with respect to the distance measure d. For normal class distributions the optimal (quadratic) distance function with respect to V (θ) is given by2 d2j (x) = ln det [Σj ] + (x − µj )T Σ−1 j (x − µj ),
(1.31)
2 The a priori probability that a random observation is drawn from C is assumed to be equal j to the probability that it is drawn from Cl . Thus, if nothing else stated, y is uniformly distributed in the sense p(y) = P (y = y (j) ) = 1/q, 1 ≤ j ≤ q.
14
Introduction
see [24, pp. 670]. The distance function is evaluated for every class distribution, which defines the “soft” classification vector 2 d1 (xi ) d22 (xi ) (1.32) y˜i = . . .. d2q (xi ) Finally, at prediction y˜i (xi ) is discretized into the “hard” classification (1.28) by j : dj (xi ) ≤ dl (xi ) for 1 ≤ l ≤ q. Classification is all about finding the distance measure from a point x to a class distribution that makes V (θ) in (1.16) minimal. The described procedure above is the naive textbook solution, that usually is not used in practice due to numerical difficulties. As in the linear regression case in the previous section, regularization is introduced, for instance ridge classification: −1
d2j (x) = ln det [Σj + λI] + (x − µj )T [Σj + λI]
(x − µj )
(1.33)
for small λ’s or subspace classification: −1 S(x − µj ), d2j (x) = ln det SΣj S T + (x − µj )T S T SΣj S T
(1.34)
where S T could be equal to the k first rows of V in the SVD XV = U T , or in the SVD Xj V = U T .
1.4
Regression in a Subspace
There are many well-known and robust techniques to find a subspace S in which the actual linear or nonlinear regression (1.2) can take place. One of the simplest one is to use a limited number of uncorrelated principal components chosen after an error-impact or a variance criterion. To this category belongs the Principal Component Analysis (PCA), and Principal Component Regression (PCR) which uses the variance criteria, [32]. QR-decomposition is a similar way to orthogonalize the problem, [17]. One popular technique in chemometrics is the Partial Least Squares algorithm (PLS), which also can be formulated as a subspace regression method. In the original PLS algorithm, a set of linear combinations of the variables x(j) that have maximum covariance with the variable y is calculated iteratively and used in the regression, [48]. It has been shown, though, that the PLS-subspace can be found by a QR-decomposition of a Krylov matrix composed from the matrix [xi ]p1 and the vector [yi ]p1 , [37]. A nonlinear extension to PLS is the polynomial PLS, which uses a polynomial inner relation between x and y, [47]. Instead of maximizing the covariance between x and y as in PLS, the criterion could be correlation, as in Canonical Correlation Analysis (CCA), [24] (does not work well applied directly to multicollinear data). In [7, 42] it is described how the ML subspace is calculated (not for multicollinear data). Projection pursuit regression
1.5 Thesis Outline
15
(PPR), [14], iteratively finds the directions of the subspace, one at a time, that reduce the (residual) unexplained variance as much as possible. A well-known subspace for classification problems is the (Fisher) Linear Discriminant Analysis (LDA), [13] or Section 3.0.1. LDA finds a subspace where the class distribution means are scattered with respect to the class covariance. The Optimal Linear Transformation (OLT) seen in [6] and and the subspace optimization technique in [31] are more recent developments of the problem to find good subspaces for classification.
1.5
Thesis Outline
As seen above in (1.2), two problems are of major concern. 1. How to find a quality measure F (S) that valuates every subspace S. 2. How to solve the optimization problem minS F (S) s.t. S is a k-dimensional subspace.
(1.35)
Quality measures are discussed in Chapter 2, subspace optimization in Chapter 4. An application of both is seen in Chapter 5. Chapter 3 treats the asymmetric classification problem, by which is identified classification with unequal covariance matrices. A novel method to find subspaces for this problem type is proposed.
1.6
Contributions
1. The Asymmetric Class Projection described in Chapter 3 has not been seen elsewhere. A paper was submitted to IEEE Sensors Journal in July, 2002, [30]. 2. Subspace optimization by the particular ON parameterization described in Chapter 4 has not been seen elsewhere. A paper was submitted to Journal of Chemometrics in November, 2002, [41]. 3. Classification optimal subspace for quantification (continuous regression) described in Chapter 5 has not been seen elsewhere. To be presented at Conference on Decision and Control (CDC) in Las Vegas, 2002, [29]. Some of the material has also been published in [40].
16
Introduction
2 Quality Measures
Assume that for a particular joint probability p(x, y) there exists an optimal predictor function yˆ = f ∗ (x) in the sense V (f ∗ ) = inf V (f ) f
(2.1)
where the general regression error is sZ Z p(y, x) k y − f (x) k2F dxdy
V (f ) =
(2.2)
when y is continuous and s V (f ) =
XZ
p(yi , x) k yi − f (x) k2F dx
(2.3)
i
when y is discrete. Then the infimum error V ∗ = V (f ∗ )
(2.4)
is a predictor independent measure of how well y can be predicted given x. V ∗ depends solely on p(x, y). 17
18
Quality Measures
The quality of a sensor signal y from which some feature should be extracted or predicted, is of course totally determined by p(x, y), and evaluated (for instance) by V ∗ as defined above. If V1∗ is due to sensor 1, and V2∗ due to sensor 2, V1∗ < V2∗ implies that sensor 1 is potentially more accurate than sensor 2. Quality measures can also be used to compare subspaces. If S1 and S2 are two different subspaces, and V ∗ due to S sZ Z V ∗ (S) = inf f
p(y, x) k y − f (Sx) k2F dxdy
(2.5)
(analogously for the discrete case), then V ∗ (S1 ) < V ∗ (S2 ) implies that the subspace S1 is potentially better for prediction than S2 . If the optimal subspace is sought, it is thus natural to optimize V ∗ (S) with the constraint that S should be a subspace. In practice p(y, x) and the optimal predictor f ∗ is not known. Instead the quality has to be estimated from a finite set of observations in a data set. This section will treat quality measures for two major regression cases: classification in Section 2.1 and quantification in Section 2.2, with emphasis on the first. The classification problem is described rather well earlier – typically y is discrete and x is a sample of one of j distinct distributions. Quantification is a term not stringently defined earlier, nor commonly adopted across the disciplines. By quantification we shall identify the problem to predict a quantity. Formally, this means that y is a continuous scalar distribution. Concentration prediction is a typical example of quantification. In Section 2.3 the quality measures for classification are illustrated on artificially generated data, and in Section 2.4 the same measures are evaluated on experimental data from an EN sniffing on different foods in different subspaces.
2.1
Quality Measures for Classification
The ultimate quality measure to use in classification is the minimal classification error rate: 100 · 0.5V ∗ 2 % (discrete y defined as in Section 1.3.2). For normal class distributions, the optimal classifier is well known, and to find V ∗ should only be a matter of estimating the means µj and covariance matrices Σj of the class distributions. However, for class distributions with unequal class covariance matrices ( Σj 6= Σl if j 6= l) it is found that this is not so easy, not even in the simple case of classifying between two distributions (q = 2). The calculations lead to integrals of the conditional probability density function (pdf) −1
e− 2 (x−µj ) Σ (x−µj ) q (2π)n/2 det [Σ]j 1
px|y (x, y) = [j : y = y (j) ] = p(x|Σj , µj ) =
T
(2.6)
over regions enclosed by the classifier boundaries dj (x) > dl (x), for dj (x) as defined in (1.31). The case q > 2 is of course even more difficult.
2.1 Quality Measures for Classification
19
Furthermore, when high dimensional data sets with only few observations available for estimation of the covariance matrices, numerical difficulties arise, for inb j are rank deficient or nearly rank stance when the estimated covariance matrices Σ deficient and distances like dj (x) cannot be reliably estimated. These problems lead to approximate quality measures that are easy to calculate and that give robust measures. This section will briefly survey some measures available. Normal distributions with a priori equal probability (uniformly distributed y) are assumed. First, separation of two class distributions (q = 2) will be described and after that, generalizations to q > 2 are treated.
2.1.1
Two Classes, q = 2
The optimal classifier when the class distributions are normal, Cj = Nn (µj , Σj ), is given in Section 1.3.2. As indicated above, it is difficult in general to evaluate the optimal error rate given by the expression 1 V ∗2 = 2 2
Z p(x|Σ1 , µ1 )dx + x:d1 (x)>d2 (x)
1 2
Z p(x|Σ2 , µ2 )dx
(2.7)
x:d1 (x)≤d2 (x)
where the optimal distance for normal distributions (1.31) is repeated for convenience (it is assumed that Σj has full rank): d2j (x) = ln det [Σj ] + (x − µj )T Σ−1 j (x − µj ).
(2.8)
Note that p(y (1) ) = p(y (2) ) = 1/2 is assumed. However, for the special case Σ1 = Σ2 = Σ, the optimal decision boundary between the two class distributions is the symmetry plane, and the optimal classifier linear: y˜i = θT (xi − µ)
(2.9)
where θ = (µ1 − µ2 )Σ−1 (the Fisher linear discriminant) and µ = (µ1 + µ2 )/2. Discretize by sign: if y˜i < 0 then yˆi = y (1) else yˆi = y (2) . The optimal error rate is V ∗2 = 2
Z
∞
δ/2
x2
e− 2 √ 2π
(2.10)
where δ,
q
(µ1 − µ2 )T Σ−1 (µ1 − µ2 )
(2.11)
is the Mahalanobis distance between the distributions. It is seen in (2.10) that V ∗ decreases monotonically with increasing δ – larger distance δ means better classification accuracy. To get a valid quality measure it is thus not necessary to calculate the integral in (2.10) explicitly, often one can settle with δ, which is easier to calculate.
20
Quality Measures
density x(1) 6 std
density x(2) 3 std
Figure 2.1 Two dot diagrams of two univariate distributions. For the variable x(1) the Mahalanobis distance is 6 standard deviations (std), and for x(2) only 3 std. If the dots in the diagrams are noisy measurements on two concentrations, it is obvious that x(1) measures the concentration with much higher accuracy than x(2) .
Example 2.1 (Distribution distance) An instrument that can distinguish between high and low concentration of a particular chemical shall be manufactured. One has to select one of two sensors to be installed in the instrument, and to evaluate the sensors, a number of measurements have been made on the interesting concentrations. The resulting measurements are x(1) for instrument 1, x(2) for instrument 2, and the result is depicted in the dot diagrams in Figure 2.1. The Mahalanobis distance between the two concentrations is δ1 = 6 standard deviations (std) for instrument 1 and δ2 = 3 std for instrument 2. Since the sensors are equally expensive, the design team selects sensor 1 due to its significantly better ability to distinguish high from low.
Basis invariance of δ A fundamental property of δ is the coordinate basis invariance, which means it is the same in every basis chosen. Suppose for instance we change to an arbitrary basis by the linear transformation xA = Ax, where A is an invertible matrix. Then
2.1 Quality Measures for Classification
21
the (quadratic) distance in the new basis is −1 2 = (µ1 − µ2 )T AT AΣAT A(µ1 − µ2 ) δA −1 A(µ1 − µ2 ) = (µ1 − µ2 )T Σ−1 A−1 AΣAT AΣAT = (µ1 − µ2 )T Σ−1 A−1 A(µ1 − µ2 )
(2.12)
= (µ1 − µ2 )T Σ−1 (µ1 − µ2 ) = δ2. The interesting implication is that the classification accuracy is basis independent for the optimal classifier. Actually, if we insist that AT A = I, det AΣAT = det [Σ] and also dj (x) as defined in (1.31) is unaffected by the basis change. Furthermore, in any projection from n to k dimensions, k < n, the projected 1 distance δS cannot increase. If v = Σ− 2 (µ1 − µ2 ), δ can be expressed δ 2 = v 0 v, and −1 S(µ1 − µ2 ) δS2 = (µ1 − µ2 )T S T SΣS T h 1 i −1 1 1 1 (2.13) SΣ 2 Σ− 2 (µ1 − µ2 ) = (µ1 − µ2 )T Σ− 2 Σ 2 S SΣS T = v T Hv where
−1 1 1 1 1 SΣ 2 = J T (JJ T )−1 J H = Σ 2 S T SΣ 2 Σ 2 S T 1
for J = SΣ 2 . Since H T H = H 2 = H, H is a projection and therefore δS2 = v T Hv = v T H T Hv = (Hv)T Hv = k Hv k22
(2.14)
≤ k v k22 = δ2. Optimality of δ In an arbitrary one-dimensional projection xs = sT x, δs is the ratio of the projected mean difference and standard deviation, T 2 s (µ1 − µ2 ) 2 . (2.15) δs = sT Σs Σ is the class covariance matrix, assumed to be equal for the two class distributions. The interesting questions are: which s maximizes (2.15) and what is the maximum value? By the extended Cauchy-Schwarz inequality [24], 2 T s (µ1 − µ2 ) ≤ sT Σs · (µ1 − µ2 )T Σ−1 (µ1 − µ2 )
22
Quality Measures
for any (µ1 − µ2 ) or δs2
2 (µ1 − µ2 )T s ≤ (µ1 − µ2 )T Σ−1 (µ1 − µ2 ) = δ 2 . = sT Σs
(2.16)
δ is thus the maximum ratio, for any projection s, between the mean difference and standard deviation. Furthermore, if s = Σ−1 (µ1 − µ2 ), −1 2 2 (Σ (µ1 − µ2 ))T (µ1 − µ2 ) sT (µ1 − µ2 ) = = sT Σs (Σ−1 (µ1 − µ2 ))T Σ(Σ−1 (µ1 − µ2 )) 2 (µ1 − µ2 )T Σ−1 (µ1 − µ2 ) = (µ1 − µ2 )T Σ−1 (µ1 − µ2 ) = δ 2 . = (µ1 − µ2 )T Σ−1 ΣΣ−1 (µ1 − µ2 )
δs2
(2.17)
The direction Σ−1 (µ1 − µ2 ) is thus an optimal projection for the distribution pair with respect to the classification accuracy. A Suboptimal Distance To calculate δ it is necessary to invert the covariance matrix of the classes, that is, to calculate Σ−1 . However, at some instances this operation is not possible, for instance if the covariance matrix is rank deficient or if the distance calculation should be implemented in hardware with very limited computational resources. Therefore an approximate statistical distance between classes (τ ) will be suggested, that can be calculated without performing matrix inversion. It was seen that δ identifies the optimal direction Σ−1 (µ1 − µ2 ). A suboptimal direction is given by the mean difference of the distributions, s = (µ1 − µ2 ).
(2.18)
By using this direction instead, an approximate distance is calculated as τ2 ,
[(µ1 − µ2 )T (µ1 − µ2 )]2 . (µ1 − µ2 )T Σ(µ1 − µ2 )
(2.19)
Since δ is the optimal quotient (2.15), τ is never greater than δ. τ is a fairly good approximation to the SDC, unless the covariance matrix is skew and/or the means are very close. Two extreme cases when the approximation is really bad and really good is depicted in Figure 2.2. The solid line is the (µ1 − µ2 )-direction, and the dashed line the Σ−1 (µ1 − µ2 )-direction. The approximate distance τ is not transform invariant in general, and may increase in an non-orthogonal projection. Unequal Covariance Matrices, the Chernoff Distance When the class covariance matrices are unequal, the classification accuracy is difficult to evaluate, as pointed out above. A simple and intuitive approximation is
2.1 Quality Measures for Classification
23
(a) Bad approximation
(b) Good approximation
Figure 2.2 In (a) a bad approximation case where δ = 5.6 and τ = 1.4. In (b) a good approximation case where δ = 10.19, and τ = 10.17. The solid lines are the (µ1 −µ2 )-directions, and the dashed lines the Σ−1 (µ1 − µ2 )-directions.
to use a pooled value of the covariance matrices, h i−1 (µ1 − µ2 ). δˇ2 = (µ1 − µ2 )T p(y (1) )Σ1 + p(y (2) )Σ2
(2.20)
How this distance relates to V ∗ is rather unclear, however. The Chernoff distance, s(1 − s) −1 (µ1 − µ2 )T [sΣ1 + (1 − s)Σ2 ] (µ1 − µ2 ) + δ¯2 = max s 2 det [sΣ1 + (1 − s)Σ2 ] 1 + log s 1−s , 2 det [Σ1 ] det [Σ2 ]
(2.21)
24
Quality Measures
Figure 2.3 Three normal class distributions with δ = 0, but BD = 1.29 (rings/crosses) and BD = 1.01 (rings/stars).
deals with the covariance matrix difference between the distributions in a more stringent way. Here s is a scalar which optimal value must be found by a mini¯ an upper bound for the optimal error rate can simply mization procedure. With δ, be calculated as V ∗2 ¯2 ≤ e−δ . 2
(2.22)
If s = 1/2 the special case of (2.21) called the Bhattacharyya distance (BD) is obtained, which is a “budget” version of δ¯ that needs no minimization. In Figure 2.3 the BD is illustrated for three normal class distributions all with equal mean and thus δ = 0. The BD from rings to crosses is 1.29 and from rings to stars 1.01. For more on the Chernoff and Bhattacharyya distances, see [15, pp. 99].
2.1 Quality Measures for Classification
2.1.2
25
Many Classes, q > 2
When classifying among q class distributions with uniformly distributed y, the optimal error rate is X1 V ∗2 = 2 q j=1 q
Z x∈X˜j
p(x|Σj , µj )dx
(2.23)
where X˜j is the set of all x that is not classified as class j, X˜j = {x : ∃l : dj (x) > dl (x)}.
(2.24)
Since (2.23) is difficult to even approximate numerically, one usually uses an upper bound measure by integrating over a (slightly) larger area, q q−1 X X Vjl∗ 2 1 V ∗2 ≤ , 2 q(q − 1) j=1 2
(2.25)
l=j+1
which is the so called union bound. Vjl∗ 2 /2 is the optimal error rate when classifying between the classes j and l which was calculated, approximated and over-bounded in the previous section. This union bound relies on that the probability of a finite union of events is upper bounded by the sum of the probabilities of the constituent events. We introduce the union bound on the form q q−1 X X Vjl∗ 2 1 (2.26) Λ , − ln q(q − 1) j=1 2 l=j+1
as a quality measure that is closely related to the classification accuracy. Λ has a convenient scale; when Λ is larger when the union bound of the error rate is better. For instance, Λ = 2.3 means an error rate less than 1%, and Λ = 6.9 less than about 0.1%. Discriminance The discriminance or Fisher ratio is the standard deviation between class distribution means with respect to the deviation within the class distributions. It is assumed that all class covariance matrices are equal, Σj = Σ. The discriminance is defined as v uX u q (2.27) ∆ , t (µj − µ)T Σ−1 (µj − µ) j=1
where µ is the overall mean µ = E [ x ].
26
Quality Measures
The discriminance is basis invariant since every term in (2.27) is basis invariant according to (2.12). ∆ is actually a generalization of δ to more than 2 class distributions. It is readily seen, that for two a priory equally likely class distributions, µ2 T −1 µ2 µ2 T −1 µ2 µ1 µ1 µ1 µ1 − ) Σ (µ1 − − ) + (µ2 − − ) Σ (µ2 − − ) 2 2 2 2 2 2 2 2 1 1 = (µ1 − µ2 )T Σ−1 (µ1 − µ2 ) + (µ2 − µ1 )T Σ−1 (µ2 − µ1 ) 4 4 2 = (µ1 − µ2 )T Σ−1 (µ1 − µ2 ) 4 = δ 2 /2. (2.28)
∆2 = (µ1 −
Obviously ∆ is related to V ∗ , but except for the two-class case, the relation is not a simple one. When q > 2, maximum ∆ does not imply minimum V ∗ in general, which will bee seen in Chapter 4. Distance Means Simple ad hoc measures for the ability to discriminate among more than two class distributions are obtained by taking the arithmetic mean of the δjl ’s, Γ,
q q−1 X X 1 δjl , q(q − 1) j=1
(2.29)
l=j+1
or the geometric mean, Π,
q−1 Y
q Y
2
δjlq(q−1) .
(2.30)
j=1 l=j+1
δjl is a distance or approximate distance between distribution j and l. If a transformation invariant class-pair distance like δ is used, of course this property holds for Γ and Π as well. As mentioned, those quality measures are simple, but the relation to V ∗ is rather weak.
2.1.3
Estimation of µ and Σ
In practice, the mean vectors and covariance matrices are not known, but have to be estimated from the analyzed data set itself. It is particularly interesting to estimate the covariance matrix inverse, since it is used to calculate d and δ. By definition, a mean is estimated as µ ˆj ,
XjT 1 Nj
(2.31)
2.1 Quality Measures for Classification
27
and a covariance matrix as bj , Σ
µTj )T (Xj − 1ˆ µTj ) (Xj − 1ˆ Nj − 1
(2.32)
where Xj is the observations of Cj , the class distribution of class j and 1 a vector with ones, see page ix. Nj is the number of observations in Xj . If the class distributions are assumed to have equal covariance matrices, Σj = Σl = Σ, the “within class” pooled covariance matrix is estimated as b, Σ
q 1 X bj. (Nj − 1)Σ N − q i=1
(2.33)
This is an unbiased estimate of Σ if the observations in X are random, see [24, pp.641, 671]. It was shown in Section 2.1.1 that δ (and d) are invariant to orthogonal transformations (xD = DT x for DT D = I) why it is no loss of generality to do the b analysis in the coordinate basis given by the spectral decomposition of Σ, t1 0 · · · 0 . .. . .. b = DT DT , DT D = I, T = 0 t2 , (2.34) Σ . . .. ... 0 .. 0 ··· 0 tn where the eigenvalues are ordered as ti ≥ ti+1 ≥ 0. In this basis the covariance b = T . Denote the entries of of the vector DT (ˆ µj − µ ˆk ) matrix is diagonal, DT ΣD (the mean difference in the new basis) as zi . Then the Mahalanobis distance can be estimated as n X zi2 . (2.35) δˆ = t i=1 i Often, particularly for large n and small N , t1 /tn ≈ ∞ (bad condition number) which makes the distances severely over estimated resulting in over rated quality measures. This is, in short, due to that few measurements cannot fill out the ndimensional feature space. Two basic remedies that give smaller but more stable and generalizable distances are: 1. PC reduction. Use only the k first terms (principal components) in the sum (2.35), δˆ(k) =
k X z2 i
i=1
ti
≤
n X z2 i
i=1
ti
.
(2.36)
Obviously, this will not increase the distance, a fact that also was shown in Section 2.1.1. A simple way to select k is by Akaikes information criteria, X n 2k ti . (2.37) min 1 + k N i=k+1
28
Quality Measures
2. Ridge estimation. Add a small constant ξt1 to the denominators, δˆ(ξ) =
n X i=1
X z2 zi2 i ≤ . ti + ξt1 t i=1 i n
(2.38)
A rule of thumb is to select ξ ≈ 10−5 . Ridge estimation is equivalent to assuming that an uncorrelated additive noise component is present. Illustration The PC reduction will be applied to a number of computer generated data sets. This will be done merely to illustrate the technique. 10,000 Gaussian four-dimensional random vectors are generated, and transformed by a 4-by-4 matrix with normally distributed entries. This data set is divided into two equally sized classes, of which one is translated by a class-pair mean difference vector, also with Gaussian entries. The data set is spread to 100 dimensions by a transform matrix (normally distributed entries). Finally, normally distributed uncorrelated noise is added to the data set. For this data set the “true” (asymptotic) distance is estimated as δ0 . 100 observations (rows) are picked from the data set, 50 from each distribution, to evaluate the estimation of δ by comparing it to δ0 . For the PC reduction technique, values of k ranging from 1 to 80 are tested. The quotient δ/δ0 for 10 repetitions of the experiment is depicted in Figure 2.4 (a). The ideal quotient is of course 1. For k = 80, there is considerable uncertainty, and a severe positive bias. Clearly the estimated δ for k = 10 is much better. Removing too many PCs however, will result in too bad separation measures. In this example, k ≤ 4 of course makes δ too small. The experiment is repeated (with the same random numbers), but now the Ridge estimation with ξ = 5 · 10−6 is used in addition to the PC reduction. The result is depicted in Figure 2.4 (b). The simple illustration shows that there are instances where the Ridge estimation is very powerful, alone probably better than the PC reduction.
2.2
Quality Measures for Quantification
The previous section discussed quality measures to be used with classification problems. In this section, basic measures for quantification will be mentioned. As stated before, by quantification is meant regression when y is a continuous scalar. We assume that y should be predicted from the statistical model x = φ(y) + ε,
ε ∈ Nn (0, Σ).
(2.39)
A very important matter is of course how to estimate φ(y) from data, but this problem will be given no attention in this section, see instead [16]. If the quality is estimated in a subspace, simply replace x by Sx in the treatment below. We assume that Σ is invertible.
2.2 Quality Measures for Quantification
29
6
5
δ/δ0
4
3
2
1
0
0
10
20
30
40 Number of PCs
50
60
70
80
60
70
80
(a) ξ = 0
1.4
1.2
1
δ/δ0
0.8
0.6
0.4
0.2
0
0
10
20
30
40 Number of PCs
50
(b) ξ = 10−6
Figure 2.4 The estimated δ relative to δ0 for different number of principal components k, ideal value is 1. In (a) with ξ = 0, in (b) with ξ = 5 · 10−6 .
30
Quality Measures
It can be shown, [3, pp. 455], that the optimal predictor for y given x is Z ∗ (2.40) f (x) = E [ y|x ] = yp(y|x)dy, in words, that the conditional expectation of y given x is optimal in V ∗ -sense. By Bayes rule we have that p(y|x) =
p(x|y)p(y) p(x)
(2.41)
where p(x|y) = px|y (x, y) = pε (x − φ(y)).
(2.42)
pε is the probability density function (pdf) of ε. The probability of x without knowing y is is obtained by marginalization, Z Z p(x) = p(x, y)dy = p(x|y)p(y)dy. (2.43) The probability density of y could be seen as a design variable, putting emphasis on regions where the prediction is critical. The simplest is two assume uniform distribution within a cube, for instance a/2 if − a ≤ y < a (2.44) py (y) = 0 otherwise. Thus,
R ωpy (ω)px|y (x, ω)dω f (x) = R py (ω)px|y (x, ω)dω ∗
and V ∗2 =
ZZ
R 2 ωpy (ω)px|y (x, ω)dω px|y (x, y)py (y) y − R dxdy. py (ω)px|y (x, ω)dω
(2.45)
(2.46)
The only assumption used so far is that E [ y|x ] is optimal. Since ε is normally distributed, −1
e− 2 (x−φ(y)) Σ (x−φ(y)) q (2π)n/2 det [Σ]j . 1
px|y (x, y) =
T
(2.47)
Using this pdf together with the uniform distribution of y in (2.44), the optimal predictor is simply Ra T −1 1 ωe− 2 (x−φ(ω)) Σ (x−φ(ω)) dω ∗ . (2.48) f (x) = R−a 1 a T −1 e− 2 (x−φ(ω)) Σ (x−φ(ω)) dω −a
2.2 Quality Measures for Quantification
31
which can be numerically approximated rather quickly. However, a over-bounding numerical integration over all possible x in (2.46) is in practice limited to – say – n = 3. A compromise is to use the model (2.39) to generate a large data set of samples (say N = 104 ) and calculate the sum of the quadratic prediction errors (1.14). This is known as a Monte Carlo Simulation. Of course the data set could also come from validation observations of the real process.
2.2.1
Linear φ(x)
Again we assume normally distributed ε, but now x = yθ + ε
(2.49)
where θ is a vector, that is, φ(x) = yθ. Completion of squares in (2.47) gives xT (Σ−1 −QQT )x
−1
(y−xT Q)2 e− 2 (x−yθ) Σ (x−yθ) e− 2θT Σ−1 θ q q = · e− 2θT Σ−1 θ px|y (x, y) = (2π)n/2 det [Σ]j (2π)n/2 det [Σ]j 1
T
(2.50)
for Q=
Σ−1 θ . θT Σ−1 θ
(2.51)
When y is uniformly distributed (2.44), the conditional expectation (2.45) is Ra ∗
−a
f (x) = R a
ωe−
(ω−xT Q)2 2θ T Σ−1 θ
(ω−xT Q)2 − T −1 2θ Σ θ
e −a
dω
,
(2.52)
dω
which in the limit a → ∞ is xT Q. Taking this limit obviously simplifies the calculations, but in reality is a finite. For finite a, better prediction can be accomplished if the integrals in (2.52) numerically in the range −a < ω ≤ a. Since are calculated = 0, Q is said to be unbiased. If the random noise E yθ + ε)T Q − y = E εT Q is independent of y (E yεT = 0) then V ∗ 2 ≤ E (yθ + ε)T Q − y)2 = E (εT Q)2 − 2E yεT Q (2.53) θT Σ−1 ΣΣ−1 θ 1 = T −1 . = QT ΣQ = T −1 2 (θ Σ θ) θ Σ θ
2.2.2
Explained Variance
A common viewpoint is that a predictor f (x) should explain the variance of y. The explained variance we define as R2 , 1 −
V2 , E [ (y − E [ y ])2 ]
(2.54)
32
Quality Measures
and its supremum for a particular p(x, y) thus R∗ 2 . For a data set, the explained variance is defined as R2 , 1 −
k Y − f (X) k2F , k (Y − 1 · 1T Y /N ) k2F
(2.55)
√ see [24, pp. 385]. This is also known as the coefficient of determination (or R2 as the multiple correlation coefficient). If R2 = 1, f (xi ) = yi for the observations in X and Y .
2.3
Artificial Data
The classification quality measures will be evaluated on four normal class distributions in 2 dimensions (q = 4, n = 2). Samples of the distributions are computer generated and visualized by scatter plots in Figure 2.5 through 2.8. Figure 2.5 depicts a distribution to the which others are compared. We denote this the reference distribution. The probability of misclassification in the reference is approximately 1/1500 or Λ = 7.3, see (2.26). Figure 2.6 reflects a distribution similar to the reference, but with greater variance within the class distributions. Figure 2.7 reflects a distribution with two poorly separated class distributions, all other class distributions fairly separated. Compared to the reference distribution, the center class distribution has been displaced. Finally, Figure 2.8 reflects a distribution with simulated sensor response drift (common in sensor systems). In each figure, the quality measures described in this section are given. The deviation in quality compared to the reference distribution is given in percent.
2.3.1
Discussion
Clearly, by visual inspection of the figures, the overall quality of the reference distribution is better than the others. The only measure that reflects this is Λ (2.26), for which the deviation with respect to the reference is negative (worse). Λ reacts significantly on (single) distribution pairs with poor separation, Figure 2.7. This is indeed a typical behavior of Λ, to reflect the worst distribution pair separation. The geometric mean Π (2.30) have a similar property. If two class means are equal, the whole product in (2.30) is zero, no matter what the other class mean differences are. In contrast to Λ, Π does however favor well separated classes, that is, well separated classes can over-compensate poorly separated ones. This is seen clearly in the drift variate, Figure 2.8. In the figure, two distribution-pairs are very close, while all other are very well separated. All measures but Λ favor this distribution with respect to Figure 2.5. The discriminance ∆ (3.1) is not sensitive to one or two poorly separated distribution-pairs, as long as the overall mean variation is large. The arithmetic mean Γ (2.29) is quite similar to the ∆ – it adds up good separation, but does not well enough punish intimacy.
2.4 Experimental Data
2.4
33
Experimental Data
This is an illustration of the quality measures for classification. Here they are used to compare two different subspaces of EN measurements, one obtained from PCA and the other from LDA, see Section 1.4 on page 14 and Section 3.0.1 on page 38. The different classes are due to sniffing on distilled water, yeast, coffee, walnut oil and two brands of olive oil. Figure 2.9 depicts the PCA and LDA of the originally 75-dimensional measurements. Table 2.1 gives the quality measures for the two subspaces.
2.4.1
Discussion
Comparing the LDA and PCA, the LDA is clearly a better subspace for classification. One of the differences between LDA and PCA is that LDA uses {yi } (supervised algorithm), PCA does not, so LDA is potentially better. However, this study concerns the quality measures, and they all recognize that the LDA subspace is better. Note that LDA is the subspace with maximum discriminance ∆ – no other subspace can have larger ∆. The difference in Λ suggests that the classification accuracy is much better in the LDA subspace.
Figure 2.5 Reference distribution to be compared with: Γ = 9.1, Π = 8.8, ∆ = 6.7, Λ = 7.3.
34
Quality Measures
Figure 2.6 Unfocused distribution, class variances greater compared to the reference: Γ = −46%, Π = −46%, ∆ = −46%, Λ = −61%.
Figure 2.7 Displaced, one class distribution mean has moved towards another class: Γ = +2.4%, Π = −4%, ∆ = +4.4%, Λ = −58%.
2.5 Summary
35
Figure 2.8 Sensor response drift, great variance in the drift direction: Γ = +154%, Π = +63%, ∆ = +190%, Λ = −67%. Subspace PCA LDA
Γ 24 45
Π 20 38
∆ 200 311
Λ 6.3 11
Table 2.1 Arithmetic (Γ) and geometric (Π) mean of the distribution-pair distances δij , the discriminance (∆) and − log of over-bounded classification accuracy Λ for the subspaces in Figure 2.9.
2.5
Summary
A quality measure reflects the ability to predict, and is used to compare sensors or subspaces. For classification problems the maximum classification accuracy is the ultimate quality measure that is rather difficult to calculate. However, overbounding measures like the Chernoff distance for two classes and Λ for more than two classes were described, as well as means of the Mahalanobis distances. Quality measures for quantification were briefly touched. Artificial and experimental data illustrated the distances.
36
Quality Measures
olive oil 1 yeast olive oil 2 coffee walnut oil distilled water
(a) LDA
olive oil 1 yeast olive oil 2 coffee walnut oil distilled water
(b) PCA
Figure 2.9 Subspaces of 23-dimensional EN data from (a) LDA and (b) PCA. The EN sniffed on different foods. The quality measures for classification for the two subspaces are compred in Table 2.1.
3 Asymmetric Classification
The asymmetric classification problem concerns discrimination between (two) class distributions with unequal covariance matrices. When the covariance matrices of the distributions are equal, the optimal decision boundary to separate two normal distributions (minimal error rate) is given by the distributions symmetry plane, and it is not difficult to calculate the optimal error rate V ∗ 2 /2 (2.10) and to find subspaces where this error rate is minimal. When the covariance matrices are unequal, there exists no symmetry plane, and the classes are said to be asymmetric. Still, the optimal classifier is well-known for normal distributions (the quadratic classifier ), (1.31), but it is much more difficult to calculate the optimal error rate, especially when the dimensionality n is high. Very little is found in literature about optimal subspaces. The typical example of an asymmetric problem that we will focus on, is when the two classes are of the type, “good/bad”, “normal/abnormal”, “accept/reject”. Say, for instance, that an electronic nose (EN) should be used at a dairy and be “trained” to detect bad milk. The objective of the signal processing is then to transform a point in the feature space into a binary output; good or bad. The mean of the classes are not necessarily unequal, but it is assumed that “good” can be defined as a restricted region in the feature space while “bad” is everything that does not lie in this region, see Figure 3.1. In this work we shall actually adopt the notion of a “good” class and a “bad” class, although the theory is general for asymmetric problems. 37
38
Asymmetric Classification
Figure 3.1 An ACP of a data set with good observations (rings) and bad (crosses). The good observations are well clustered, while the bad observations are spread around the good cluster. The elliptic decision boundary of a quadratic classifier can accurately distinguish good from bad in the plot.
This chapter deals with the problem of how to find the best subspace S for asymmetric classification. The subspace can be used to compress data, to enhance the prediction or just to visualize asymmetric classes in a 2-dimensional plot. The Asymmetric Class Projection (ACP) is a projection that has not been seen elsewhere, we consider it novel. The background to this chapter is the signal processing for an EN that was used to detect bad grain samples. Before describing the ACP in the next section, we shall return to the LDA mentioned earlier. Since LDA is a well-known technique with similar motivation and similar complexity as the ACP, the LDA will be used for comparison. In Section 3.2, a result on Bayes optimality for the ACP is presented. Section 3.3 illustrates the benefits of the ACP on artificial data. In Section 3.4, the ACP is evaluated on experimental data from an EN used to detect bad grain samples. Finally, in Section 3.5 a summary.
3.0.1
Linear Discriminant Analysis
As mentioned before, there are numerous ways to find regression subspaces. Particularly for classification problems, the Fisher linear discriminant analysis (LDA), [13] or [24] pp. 685, is a well-known technique to find linear combinations or dis-
3.1 The Asymmetric Class Projection
39
criminants that facilitates the ability to discriminate among q distributions. As a measure of this ability is used the discriminance (or Fisher ratio), v uX u q (3.1) ∆ = t (µj − µ)T Σ−1 (µj − µ), j=1
where Σ is the covariance matrix, which is assumed to be invertible and equal for all class distributions, µj the mean of distribution j, and µ = E [ x ], see Section 2.1.2. The operator E [ · ] denotes expectation. In words, ∆ is the amount of variation between distribution means with respect to the variation within the distributions, a measure maximized in the subspace given by LDA. For two normal distributions (q = 2) with equal covariance matrices and p(y1 ) = p(y2 ) = 1/2, the LDA subspace is the subspace with minimal error rate (V ∗ 2 /2). The LDA problem is numerically and computationally efficiently solved by a generalized singular value decomposition, see [35]. Among techniques to find good discriminative subspaces we can also mention [31], where an over-bound on the classification accuracy is locally maximized and [6], which introduces the optimal linear transformation (OLT).
3.1
The Asymmetric Class Projection
As indicated above, the purpose of the ACP is to find a set of linear combinations or a linear subspace with as much relevant information as possible. This can alternatively be viewed as data compression, where we aim at retaining the ability to distinguish “good” from “bad” in the compressed data. The reduction from n to k dimensions (k < n) is defined by a set of k linear combinations, each of n variables, comprised in a k-by-n matrix. If S is that matrix, the compression of the measurement vector xi ∈ Rn is calculated by the matrix multiplication xS = Sx. The measurement vector is thus projected onto the rows of S, and we shall therefore denote this reduction by projection (although S is not a projection matrix in mathematical sense). If we are interested in a projection of a space with n dimensions where some mean vector µ and covariance matrix Σ are known, it is of course interesting to know the corresponding entities in the compressed k-dimensional space. The mean is calculated as µS = Sµ, and the covariance matrix as ΣS = SΣS T . In particular, if S is a row vector, both the mean vector and covariance matrix will be compressed to scalars with the mean and variance, respectively.
3.1.1
Objective
The fundamental assumption of the ACP is the existence of an ideal point in feature space. The degree of good will decay as we move away from that point. Thus, if we have two sets of data, one good and one bad, the measurements or observations in the good set will be well clustered around the ideal point while the bad observations
40
Asymmetric Classification
will be more scattered and distant to the ideal point, see Figure 3.1. This is the basic property a classifier would exploit, and the property the ACP tries to retain in a projection. Now, consider two n-dimensional random variables: g (the good ) and b (the bad ) with mean vectors µg = E [ g ] ,
µb = E [ b ] ,
and covariance matrices Σg = E (g − µg )(g − µg )T ,
(3.2)
Σb = E (b − µb )(b − µb )T .
Assume that Σg is invertible and that xi is a sample of either g or b with a priori equal probability. For a particular 1-dimensional projection xs = sT x, s 6= 0, the quotient between bad class variance and good class variance is given by ξ(s) =
sT Σb s Var sT b = . Var sT g sT Σg s
(3.3)
This quotient is a quality measure of the projection onto s. Projections with larger ξ is preferred, since they facilitate the distinction between good and bad, as will be shown later. In fact, the maximization of (the Rayleigh quotient) ξ(s) is a well-known problem equivalent to the generalized eigenvalue problem, [18]. The generalized eigenvalue problem is to calculate the matrix E with eigenvectors ei and the diagonal matrix D with eigenvalues λi , 1 ≤ i ≤ n, such that Σb E = Σg ED,
(3.4)
eTi ei = 1 and eTi Σg ej = eTi Σb ej = 0 whenever i 6= j. We assume that the eigenvalues are ordered so that λi ≥ λi+1 . The eigenvector e1 then solves ξ1 = arg max s
Var sT b Var sT g
(3.5)
and the eigenvalue λ1 identifies the optimum value. The vector e1 gives the linear combination of b with largest variance compared to the variance of the same linear combination of g. Thus, by solving the generalized eigenvalue problem (3.4) a subspace is obtained where the good class is well clustered, the bad class well scattered. That this probably is a very good subspace in a Bayes error sense will be shown in section 3.2. The solution to the generalized eigenvalue problem is very well known and numerically stable and fast algorithms exist, see [35].
3.1.2
Modified Covariance
The covariance matrix Σb is defined as Σb = E (b − µb )(b − µb )T .
3.1 The Asymmetric Class Projection
41
This is the standard definition of a covariance matrix, which means that the magnitude of the covariance is a measure of the variation or spread with respect to the mean. However, if the good and bad class are not concentric (µg 6= µb ) it is more interesting for our purposes to measure the spread of the bad class with respect to the good class mean rather than to the bad class mean itself. This can be achieved by in (3.4) replacing Σb with e b = E (b − µg )(b − µg )T . Σ e b s is a measure of how well the With this definition of bad class covariance, sT Σ projection onto s spread the bad class with respect to the good class mean. Note e b is actually not a covariance matrix. that Σ
3.1.3
Generalization to More than One Dimension
The quotient of the modified variance of the bad and the variance of the good distribution is E (b − µg )2 σ ˜2 = b2 = E (b − µg )σg−2 (b − µg ) . (3.6) ξ= 2 E [ (g − µg ) ] σg As described earlier, this is used as a measure of discrimination in 1 dimension. The generalization to more than 1 dimensions we define as ξ = E (b − µg )T Σ−1 g (b − µg ) i i h h 1 − −1 = E tr Σg 2 (b − µg ) (b − µg )T Σg 2 h 1 i h 1 (3.7) −1 i − − e − 12 = tr Σg 2 E (b − µg )(b − µg )T Σg 2 = tr Σg 2 Σ b Σg i h e = tr Σ−1 g Σb . Now, let E and D be the solution to the generalized eigenvalue problem, e b E diagonal. e b E = Σg ED s.t. D, E T Σg E and E T Σ Σ
(3.8)
It is assumed that the eigenvalues on the diagonal of D are ordered, λi ≥ λi+1 . The diagonality implies that the linear transformation E T g has uncorrelated come b = Σb ) this holds also ponents. If the distributions are concentric (µb = µg ⇒ Σ for the components of E T b. The diagonality also gives an easy way to calculate the trace, n n i X i h h X e b ei eTi Σ T −1 T e e E Σb E = = λi . tr Σ−1 g Σb = tr (E Σg E) eT Σg ei i=1 i i=1
(3.9)
From this it is plain to see, that if we want to maximize ξ(S) under the constraint that transformed components should be uncorrelated in the sense given above, we
42
Asymmetric Classification
should take the eigenvectors ei that corresponds to the greatest eigenvalues. Thus, we define the k-dimensional ACP-projection by T (3.10) SACP = e1 e2 · · · ek , and in this subspace the discrimination measure is ξ(SACP ) =
k X
λi .
(3.11)
i=1
As mentioned before, the properties and optimality of the generalized eigenvalue problem is very well know, see [18].
3.1.4
Relation to LDA
As mentioned in Section 3.0.1, LDA finds the subspace with maximum ∆2 =
q X
(µj − µ)T Σ−1 (µj − µ).
(3.12)
j=1
For two random variables (classes) g and b with equal a priori likelihood, and with equal covariance Σg = Σb = Σ, straight-forward calculation gives ∆2 = 0.5 · (µb − µg )T Σ−1 (µb − µg ) = 0.5 · E [ b − µg ] Σ−1 E [ b − µg ] T
(3.13)
which is known as the (half, square) Mahalanobis distance between the classes, see Section 2.1.1. This should be compared to what is maximized by the ACP: (3.14) ξ = E (b − µg )T Σ−1 g (b − µg ) . It is seen that the major difference is that the expectation for the ACP is quadratic, −1 T ξ = E mT m for m = (b − µg )Σg 2 , while for LDA it is ∆2 = E [ m ] E [ m ] for 1 m = (b−µg )Σ− 2 , which of course is fundamentally different. LDA does not work for equal means, while ACP does not need any mean difference. Furthermore, ACP exploits the difference in covariance, while LDA assumes equal covariance. Although both the LDA and ACP can be expressed as generalized eigenvalue problems, the solutions are different.
3.2
Bayes Error
It will be shown that the projection onto s is Bayes error optimal if s is a solution to (3.5). The Bayes error identifies the error rate of an optimal classifier, see [15]. The assumptions are that both classes are normally distributed with the same mean (concentric) and that 0 < Var sT g < Var sT b ∀s 6= 0, that is, for every
3.2 Bayes Error
43
0.4
p(x,1)
1 − 2⋅ Q(a/ξ) p(x,ξ) Q(a) 0 −5
Q(a) −a
x=0
a
5
Figure 3.2 PDF of g, p(x, 1), and b, p(x, ξ). The grayed area identifies the classification error rate when the classification decision boundaries are −a and a, respectively. Thus, if |x| > a, x most likely belongs to b.
linear combination, the variance of the bad class is larger than the variance of the good class (which is greater than zero). To start with, two normal scalar distributions g and b are considered that have been standardized so that Var g = 1 and E [ g ] = E [ b ] = 0. Thus, let ξ be the variance of the bad class, ξ > 1. Figure 3.2 depicts the PDF (Probability Density Function) of g and b (standardized with respect to g). The normal PDF with zero mean and variance σ is x2
e− 2σ2 p(x, σ) = √ σ 2π and the upper tail integral (Error function) is Z ∞ p(t, σ) dt. Q(x/σ) =
(3.15)
(3.16)
x
Q is used in the calculation of the classification error probability, which given a decision boundary ±a can be summed as the grayed areas in the figure, that is ε(a, ξ) = 2Q(a) + 1 − 2Q(a/ξ).
(3.17)
Of course, this measure of classification accuracy is valid only if an unseen observation is equally likely to be good as bad. By the same figure one realizes that the
44
Asymmetric Classification
optimal decision boundaries (±x) for classification are given by the intersections between p(x, ξ) and p(x, 1) (they minimize ε(x, ξ); by moving the boundaries at ±a in the figure the sum of the grayed areas can only become larger). Solving this equation gives the optimal boundary as s ln ξ 2 . (3.18) a(ξ) = 1 − 1/ξ 2 A closed expression for the Bayes error thus comes out from (3.17) and (3.18) as s s ! ! ln ξ 2 ln ξ 2 . (3.19) + 1 − 2Q ε(ξ) = ε(a(ξ), ξ) = 2Q 1 − 1/ξ 2 ξ2 − 1 Lemma 3.1 If observations are a priori equally likely to be drawn from g ∈ N1 (µ, σg ) as they are from b ∈ N1 (µ, σb and ξ = σb /σg > 1, σg 6= 0, then the Bayes error decreases monotonically with the magnitude of ξ. Proof With no change in the Bayes error, g and b scaled with the same scaling factor so that σg = 1 and translated so that µ = 0. It will now be shown that ε(ξ) as defined in (3.19) decreases monotonically when ξ > 1 increases. More specifically it will be shown that if ε is differentiated with respect to ξ, the result is negative for every ξ > 1. Of course dQ(a)/da = −p(a, 1) and by the chain rule da d(a/ξ) dε = −2 p(a, 1) + 2 p(a/ξ, 1). dξ dξ dξ Differentiating (3.18) gives d da = dξ dξ
s
ln ξ 2 ξ = (ξ 2 − 1 − ln ξ 2 ) 1 − 1/ξ 2 D
(3.20)
(3.21)
and
s d 1 ln ξ 2 d(a/ξ) = = (ξ 2 − 1 − ξ 2 ln ξ 2 ). (3.22) dξ dξ ξ 2 − 1 D 3p D = ξ(ξ 2 − 1) 4 ln ξ 2 is a common denominator that apparently is positive for ξ > 1. Finally ξ da 1 − ln ξ d(a/ξ) 1 − ξln dε √ e 2 −1 = −2 √ e 1−1/ξ2 + 2 dξ dξ 2π dξ 2π r − ξ21−1 d(a/ξ) − ξ21−1 2 da ξ (− + ξ = ) π dξ ξ dξ r 2 − ξ21−1 1 ξ (ln ξ 2 − ξ 2 ln ξ 2 ) < 0 ∀ξ > 1. = π D
2
3.3 Artificial Data
45
Theorem 3.1 If observations are a priori equally likely to be drawn from g ∈ Nn (µ, Σg ) as they are from b ∈ Nn (µ, Σb ) and 0 < Var sT g < Var sT b for all non-zero s ∈ Rn , then the vector sˆ = arg max ξ(s) = arg max s
s
Var sT b Var sT g
(3.23)
gives the Bayes error optimal projection to one dimension. Proof It will be shown that for two non-zero vectors s1 and s2 with ξ(s1 ) > ξ(s2 ), the Bayes error is smaller in the projection onto s1 compared to the projection onto s2 . It is well known that linear combinations of normally distributed variables are T T also normally T distributed, T thus si g and si b are normally distributed. Trivially T si µ = E si g = E si b (in every projection, g and b have the same mean). Furthermore, Var sT g < Var sT b why ξ(si ) > 1 ∀s 6= 0. Since ξ(s1 ) > ξ(s2 ) the result follows directly from Lemma 3.1. 2
3.2.1
k-dimensional Projection
The explicit calculations of Bayes error optimality in multi-dimensions will not be developed in this work. It shall be pointed out, though, that the solution to the generalized eigenvalue problem gives components that are uncorrelated with respect to both g and b, or equivalently, E T Σg E and E T Σb E are diagonal. The optimal dimensional extension to the principal eigenvector e1 is thus [e2 · · · ek ] if the components (linear combinations) should be uncorrelated in the sense eTi Σg ej = ei Σb ej = 0 whenever i 6= j.
3.2.2
Non-Concentric Classes
Using the framework above, and the modified covariance explained in Section 3.1.2, it is necessary to show that for increasing values of ξ = µ2 + σ 2 the Bayes error can only decrease. Here the µ and σ denote the mean and variance of the bad class assuming the variate has been standardized so that the good class has zero mean and unit variance. Numerical experiments indicate that the Bayes error decreases monotonically whenever σ 2 > µ2 , but this remains to be shown analytically.
3.3
Artificial Data
The ACP of a computer generated data set shall be studied and compared to wellknown and common techniques for feature extraction, namely PCA and LDA. The data set is not designed to mimic real life measurements, but rather to illustrate an instance where the ACP is superior.
46
Asymmetric Classification
PCA (Principal Component Analysis) is possibly the most common processing method in chemometrics. It is an unsupervised technique that concentrate as much variance as possible into few uncorrelated components. With unsupervised technique is understood a technique that does not utilize class labels (Y). In other words, the knowledge of which observations that are good and which that are bad is not an input to PCA. LDA is a supervised technique that has been described earlier.
3.3.1
Artificial Data Set
The data set is originally 3-dimensional and 2-dimensional projections produced by PCA, LDA and ACP will be compared. The artificial data set describes two coaxial cylinders; the good class contained within the bad, 80 observations in each class. Of course, the best discriminative projection in this case is a radial section of the cylinders. However, to fool the PCA much variance is given the data set in the axial direction. To fool the LDA a slight mean displacement is present, this also in the axial direction. The classes are thus not concentric.
3.3.2
Comparison
In Figure 4.1 the outcome of the different methods are compared. As intended, the PCA favors the direction with much variance and the projection is thus aligned with the axes of the cylinders. In this 2-dimensional PCA-subspace, a quadratic classifier has an accuracy of 149 correctly classified observations among a total of 160 observations. Also the LDA favors the axes direction due to the mean difference. The classification accuracy for the LDA is 145/160. The ACP is more or less a radial section that very well concentrates the good class in the center, and spreads the bad class around it. The classification accuracy is 160/160.
3.4
Experimental Data
We study a data set obtained from measurements on 204 grain samples. A human test panel classifies each grain sample as good or bad. An electronic nose with 23 response variables measures on the same samples. Thus, for every observation we have 23 variables and the knowledge whether it is attributable to a good or a bad grain sample. The entities µg , µb , Σg and Σb are unknown and have to be estimated from the data set itself. We now want to find out if the sensor configuration in the electronic nose can be trained to make a distinction between good and bad similar to the one produced by the human test panel. We also want to compare the feature extraction of PCA, LDA and ACP.
3.4.1
Validation
Since the means and covariances have to be estimated from the data set itself, random estimation/validation partitionings will be used to gain reliable results.
3.4 Experimental Data
Figure 3.3 The PCA subspace for the artificial example. The good observations (rings) cannot with certainty be discriminated from the bad (crosses).
Figure 3.4 The LDA subspace for the artificial example. The good observations (rings) cannot with certainty be discriminated from the bad (crosses).
47
48
Asymmetric Classification
Figure 3.5 The ACP subspace for the artificial example. The good observations (rings) can with rather high accuracy be discriminated from the bad (crosses) by a circular boundary around the good observation cluster.
The means, covariances and classification models are thus estimated on 75% of the observations, and the displayed projection and classification accuracy are due to the remaining 25%. We denote the estimation sets Tg and Tb , respectively, and the validation sets Vg and Vb , respectively. The sets are described by matrices, where the rows are observations. For instance, g T g = t1
tg2
···
tgNg
T
(3.24)
for the good class training observations, where tgi ∈ R23 is observation number i of the good class estimation set. Ng = 76 is the number of observations in the good class estimation set and Nb = 76 the number of observations in the bad class estimation set.
3.4.2
Estimation of µ and Σ
The means are estimated by definition as µ ˆg =
TgT 1 , Ng
µ ˆb =
TbT 1 , Nb
(3.25)
3.4 Experimental Data
49
where 1 = [1 1 1 · · · 1]T . The covariance and modified covariance matrices are estimated as T Tg − 1ˆ µTg µTg Tg − 1ˆ b (3.26) Σg = Ng − 1 and T Tb − 1ˆ µTg µTg Tb − 1ˆ b e Σb = Nb − 1
(3.27)
respectively.
3.4.3
Projection Calculation
b eb b −1 Σ The optimal k-dimensional projection S ∈ Rk×23 with respect to the trace of Σ g is calculated as (3.10), where ei are the principal eigenvectors of the generalized b b g ED. The validation data sets Vg and Vb are projected e bE = Σ eigenvalue problem Σ S T S as Vg = Vg S and Vb = Vb S T . They are plotted in Figure 3.8. See [35] for numerically stable algorithms for solving the generalized eigenvalue problem.
3.4.4
Plots
For a particular estimation/validation partitioning of the data set, scatter plots of the 2-dimensional projection of LDA and ACP are studied. As a reference, the plot of a PCA is depicted in Figure 3.6. It is seen, that in this subspace the accurate detection of bad samples is almost impossible. Comparing the plots of the LDA in Figure 3.7 to the ACP in Figure 3.8, one can see that the data have fundamentally different structure. LDA tries to find two separated clusters of the good and bad class, while the ACP centers around the good class, and tries to spread the bad observations as much as possible. It is also seen that in both the ACP and the LDA subspace, distinction between good and bad can be done, although not very accurate.
3.4.5
Results
As a quality measure of a projection is used the classification accuracy of a quadratic classifier. The quadratic classifier is known to give the highest (expected) classification accuracy for normal distributions (with known means and covariance matrices), see Section 1.3.2. The classification accuracy in the original 23-dimensional feature space as well as the accuracy in the 2 and 4-dimensional subspaces from PCA, LDA and ACP is given in Table 3.1. The figures are based on 100 random estimation/validation partitionings of the available data set, and for each partitioning, the subspace and classification model are calculated from estimation data, and the number of
50
Asymmetric Classification
−0.3
−0.4
−0.5
−0.6
−0.7
−0.8
−0.9
−1
−1.1
−1.2
−1.3 −0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
Figure 3.6 The PCA subspace for the grain measurements. The distinction between good observations (rings) and bad (crosses) is not much better than the flipping of a fair coin.
0.2
0.15
0.1
0.05
0
−0.05
−0.1
−0.15
−0.2 −20
−15
−10
−5
0
5
Figure 3.7 The LDA subspace for the grain measurements. A line can, with some uncertainty, separate the good observations (rings) from the bad (crosses).
3.4 Experimental Data
51
1
0.5
0
−0.5
−1
−1.5
−6
−5
−4
−3
−2
−1
0
1
Figure 3.8 The ACP subspace for the grain measurements. A circular decision boundary around the good observations (rings) distinguish them, with some uncertainty, from the bad (crosses).
correctly classified observations of the 51 observations in the validation data set is counted. As seen in the table, classification in the PCA subspace is not very much better than the flipping of a fair coin. Among the 2-dimensional subspaces, the LDA gives highest accuracy, and the ACP almost equally good. The opposite holds for the 4-dimensional subspace. The classification accuracy in the unreduced 23 dimensions is 41 ± 2.9. For this data set, the performance of LDA and ACP is thus about equally good. This is because there is sufficient mean difference between the good and the bad class for the LDA to operate well. The Mahalanobis distance is estimated to about 2 standard deviations, and this distance gives a theoretical classification accuracy of a separating plane of 85% or 43/51 (a priori equally likely normal distributions with equal covariance matrices assumed). The grain data thus have a structure that is not the best for the ACP. The EN can (in conjunction with the algorithms mentioned above) only with some uncertainty make the same distinction between good and bad as the human test panel. About 15% of the samples will be classified differently (if bad samples are as likely as good samples).
52
Asymmetric Classification
Table 3.1 Number of correctly classified observations out of 51 possible in different k-dimensional subspaces. The figures are the means ± standard deviations for 100 random estimation/validation partitionings of the dataset. k 2 4
3.5
PCA 28 ± 3.3 29 ± 3.3
LDA 38 ± 3.2 37 ± 2.9
ACP 37 ± 3.1 38 ± 2.6
Summary
A method to find subspaces for asymmetric classification problems, the Asymmetric Class Projection (ACP), was introduced and compared to the well-known Linear Discriminant Analysis (LDA). The ACP has it main benefits when two distributions are nearly concentric and unequal in covariance. It was shown that for the concentric case (equal distribution means), the ACP is optimal for at least 1-dimensional projections. The LDA cannot be used to analyze concentric distributions at all. An artificial data set showed an instance where one can expect the ACP to be beneficial to use. However, tested on a real data set, the LDA performed equally good as the ACP. It was found that this was probably due that the assumption of near concentricity did not hold. The data set came from an electronic nose that should detect bad grain samples.
4 Subspace Parameterization
When reducing the dimensionality of a random variable (for data compression, feature extraction et cetera), it is evident that as much vital information as possible should be retained in the reduced variable. To conduct a successful reduction we thus need metrics that describe the information content in a (reduced) variable, as well as efficient mechanisms to generate and compare different reductions. In this chapter, dimensionality reduction will be treated as a plain optimization problem, where the information metrics or quality measures are used as objective functions and linear projections are the mechanisms that reduce the number of dimensions from n to k. The linear projection of the variable x onto the rows of a k-by-n matrix S (with rank k) is conducted by the matrix multiplication Sx. For fixed k, we thus look for an S that maximizes the quality of Sx. The optimization we constrain by the orthonormality (ON) criteria S T S = I, and the topic of the chapter is an ON parameterization S = S(p) that in the end allow us to conduct unconstrained optimization over p. Of course, some objective functions are easier to optimize than others. Principal Component Analysis (PCA) for instance, measures the subspace quality in terms of variance magnitude. PCA is a simple optimization problem solved for instance by a spectral decomposition of a covariance matrix or by the SVD described in Section 1.3.1 on page 10. Fisher’s Linear Discriminant Analysis (LDA) is another simple optimization problem. LDA maximizes the variance between class distribution means with respect to the variance within the distributions (assuming 53
54
Subspace Parameterization
a classification problem with q class distributions), see Section 3.0.1 on page 38. Also the LDA problem can be solved by a spectral decomposition, [24], or by a generalized SVD, [35]. A more intricate objective is to maximize the classification accuracy of a classification problem with arbitrarily many classes. A simple union (upper) bound on the classification error rate, assuming normal distributions with equal covariance matrices Σ of all classes, is
Λ0 =
2 q q−1 X X δjl 1 Q , q(q − 1) j=1 2
(4.1)
l=j+1
where Q is the integral of the upper tail of the normal probability density function, q the number of distributions, and δjl the Mahalanobis distance between distribution j and l. Of course one whish to minimize the error rate since it is a figure on the likelihood of a misclassification. Another rather difficult objective to be mentioned, is the maximization of the geometric mean of δjl , Π,
q−1 Y
q Y
2
δjlq(q−1) .
(4.2)
j=1 l=j+1
Quality measures interesting to optimize over possible subspaces were discussed in Chapter 2. The difficulty in optimizing the quality measures above is due to their nonconvexity, which makes the search for the optimal solution very time consuming for large dimensionalities n of x (NP-hard). However, numerical algorithms can locally find a sub-optimal subspace with respect to Λ0 and Π. In the general case, an orthonormal constraint on the projection matrix S is necessary, or at least beneficial, for the optimization. The distance δjl is transformation invariant as was shown in Chapter 2, and it is actually possible to maximize a quality measure like the Λ0 directly on the entries of S. However, this is not very efficient (unnecessary many parameters) and there is always a hazard of loosing rank resulting in numerical problems. In [31], where Λ0 is targeted for minimization, the problem is solved by an orthonormalization step at each iteration of a gradient search. This method appears to work well. Possibly the need for the explicit gradient is a drawback. In any event, more sophisticated gradient algorithms exploit the geometry of the problem better, for instance the conjugate gradient search on the Grassman manifold, [9]. Gradient methods will be discussed in a future work. Lagrange relaxation is a possible technique, where the orthonormal constraint is moved to the objective function. Thus, a measure of orthonormality punishes the objective function value whenever the projection matrix is far from orthonormality. There are at least two drawbacks of Lagrange relaxation; unnecessary many parameters (kn) and the difficulty to make a general combination of the original objective function, and
4.1 Optimization Problem
55
the punishment for non-orthonormality. Householder reflection (Section 4.6.1) is an instrument that has been exploited in reducing the dimensionality of antenna signals. Householder reflections gives strict orthonormality at every parameter choice, but is slightly over parameterized. It needs k(2n − k + 1)/2 parameters. Yet another technique is to express the ON constraint as a relaxed bilinear matrix inequality (Section 4.6.2). This opens the possibility to express the problem as a semidefinite programming problem and use a branch-and-bound algorithm to find a arbitrarily accurate solution. This is however not practical for large problem due to the computational complexity. In this chapter a parameterization that gives strict orthonormal projections for k(2n − k − 1) parameters will be described. The idea is to let the orthonormality be implicit by using a set of rotation matrices, or Givens rotations. The parameters are thus angles that determines the attitude of a fix projection plane. To form orthonormal matrices with Givens rotations is a well-known technique frequently used in QR-factorizations, see [18]. In this chapter the rotation parameterization will be applied to the problem of finding subspaces that locally optimize Λ0 and Π. This we have not seen described elsewhere. Next section formalizes the optimization problem and defines the subspace parameterization. In Section 4.2 is described the Givens rotation parameterization and how any linear subspace can be derived to a set of angles. In Section 4.3 the ODP is described, which is a subspace optimization algorithm. In Section 4.4 and Section 4.5 evaluation on artificial and experimental data. Section 4.6 mentions two other interesting techniques, namely the Householder Reflections and the BMI Formulation. Finally, in Section 4.7 a summary.
4.1
Optimization Problem
The problem to find the optimal subspace maxS F (S) s.t. S is a k-dimensional subspace,
(4.3)
where F (S) is a quality measure of the subspace S. As before, S is a k-by-n matrix that defines the reduction of a variable x as xS = Sx. The constraint that S should be a k-dimensional subspace of Rn of course limits the set of all k-by-n matrices (S should have rank k). On the other hand, many matrices can span the same subspace, why the formulation above is not entirely satisfying. One approach to restrict the set of matrices, is to insist that S should form an orthonormal basis, S T S = Ik . In this chapter we shall settle with this restriction, although the same subspace can be spanned by many different ON-matrices. In a future work, more restrictive methods will be studied. Thus, the problem formulation we study in this chapter is maxS F (S) (4.4) s.t. S T S = Ik .
56
Subspace Parameterization
4.1.1
ON Parameterization
A parameterization is an approach to work around the ON constraint above. By parameterization is meant that S actually is a function of a parameter p, S = S(p). The demands we have on this function is 1. For every p, S(p)T S(p) = I. 2. For every S : S T S = I, there should be a p such that S(p) = S. 3. The dimensionality of p should be small. With the demands above fulfilled, we may conduct the unconstrained optimization maxp F (S(p)). The obvious advantage is that the optimization algorithm need not take a complex constraint into account. Another advantage is that the number of entries in p is often much lower than the kn entries of S. This means that the search space has lower dimensionality compared to constrained optimization over the entries of S.
4.2
Givens Parameterization
This section describes how orthonormal projections can be efficiently parameterized using composite rotation matrices. Two fundamental rotation types will be defined, the vector rotation and the matrix rotation. The vector rotation can align any vector with the first base vector, and the matrix rotation can be any orthonormal transformation with positive determinant. How the rotation matrices can be used to create projections, and how orthonormal projections can be identified as a set of rotations, are also discussed. The vector and matrix rotations are composed of a (large) number of Givens rotations, by which is defined a plane rotation in arbitrary dimensionality. The Givens rotation is a well-known instrument for QR-factorizations, but the composition of Givens rotations described in this section has not been seen elsewhere.
4.2.1
Vector Rotation
Define an n-dimensional vector rotation (n ≥ 2) recursively as rn−1 (α) 0 I 0 rn (p) = 0 1 0 r2 (β) where
r2 (β) =
cos β sin β
− sin β cos β
(4.5)
and p = [ αT β ]T ∈ Rn−1 , β ∈ R, are the angles with entries typically in the interval 0 to 2π. The vector rotation is complete in the sense, that for every possible vector v, it exists an angle p that makes the product of the vector rotation and the vector aligned with the first base vector.
4.2 Givens Parameterization
57
Lemma 4.1 (Vector Rotation) For every v ∈ Rn , n ≥ 2, there is an angle p ∈ Rn−1 such that rn (p)v =
|v| 0n−1
.
Furthermore, rnT (p)rn (p) = I for every angle p, that is to say, rn (p) is orthonormal.
Proof Induction over n. Basis Step, n = 2: For v = [v1 v2 ]T , take β such that v1 = |v| cos β and v2 = |v| sin β. This choice of angle obviously yields r2 (β)v = [ |v| 0 ]. Since sin2 β + cos2 β = 1, r2T (β)r2 (β) = I ∀β. Inductive Step: The hypothesis is that the theorem is valid for n = t − 1 ≥ 2, T (α)rt−1 (α) = I, and for every z ∈ Rt−1 there that for every angle α ∈ Rt−2 , rt−1 is an α such that rt−1 (α)z = [ |z| 0 ]. Utilizing this hypothesis the theorem will now be proved for n = t, that is, for every angle p = [ α ˆ T β ]T ∈ Rt−1 (β ∈ R), rtT (p)rt (p) = I, and for arbitrary entries vi of v = [v1 v2 · · · vt ]T , there is a p such that rt (p)v = [ |v| 0 ]T . If n is exchanged by t in (4.5) it is easy to see, with reference to the hypothesis and the Basis Step, that both factors are orthonormal why it is considered a well known fact that this also applies to the product. The orthonormality thus results by mathematical induction, i.e. rnT (p)rn (p) = I for ∀n ≥ 2, ∀p. According to the hypothesis and the Basis Step, it exists α ˆ and β such that rt (p)v = = = = =
rt−1 (ˆ α) 0 0 1
I 0
0 r2 (β)
v
[v1 · · · vt−2 ]T r2 (β)[vt−1 vt ]T [v · · · vt−2 ]T α) 0 1 rt−1 (ˆ |[vt−1 vt ]T | 0 1 0 T [v1 · · · vt−2 ] rt−1 (ˆ α) |[vt−1 vt ]T | 0 |v| 0 0 α) 0 rt−1 (ˆ 0 1
(4.6)
and the general result follows by mathematical induction, i.e. the theorem is valid for all n ≥ 2. 2
58
4.2.2
Subspace Parameterization
Matrix Rotation
Define an n-dimensional matrix rotation (n ≥ 1) recursively as 1 0 rn (β) Rn (p) = 0 Rn−1 (α)
(4.7)
where R1 ≡ 1 n(n−1)
and p = [ αT β T ]T ∈ R 2 , β ∈ Rn−1 , are the angles. The matrix rotation is complete in the sense that it can be any orthonormal matrix with positive determinant. Lemma 4.2 (Matrix Rotation) For every orthonormal S ∈ Rn×n with det [S] = 1, n ≥ 1, there is an angle p (∈ R
n(n−1) 2
) such that Rn (p) = S,
where Rn (p) is given by (4.7). Furthermore, RnT (p)Rn (p) = I for every p. Proof Induction over n. Basis Step, n = 1: Trivial since R1 ≡ 1, which also is the only S that fulfills the presumptions. Inductive Step: The hypothesis is that the theorem is valid for n = t − 1 ≥ 1, T (α)Rt−1 (α) = I, and for every St−1 ∈ R(t−1)×(t−1) with that for every angle α, Rt−1 T T = I. Under the assumption that St−1 St−1 = I there is an α such that Rt−1 (α)St−1 this hypothesis is correct, it will now be shown that for every angle p = [ α ˆ T β T ]T t−1 T t×t T (β ∈ R ), Rt (p)Rt (p) = I, and for every St ∈ R with St St = I there is a p such that Rt (p)StT = I. If n is exchanged by t in (4.7) it is easy to see, with reference to the hypothesis and theorem 4.1, that both factors are orthonormal why it is considered a well known fact that this also applies to the product. The orthonormality thus results by mathematical induction, that is, RnT (p)Rn (p) = I for ∀n ≥ 1, ∀p. α) and rt = rt (β), and let Sti be the For simplicity, let Rt = Rt (p), Rt−1 = Rt−1 (ˆ ith row of St . Now take β according to theorem 4.1 such that rt St1 = [ 1 0 ]T . Then [ 1 0 ]rt Sti = (rt St1 )T rt Sti = St1 T rtT rt Sti = 0 for i 6= 1 from which follows that rt Sti = [ 0 ? ? · · · ? ]T i.e. the first entry is zero (for i 6= 1) and Rt StT is reduced to (see definition) 1 0 1 0 1 0 rt StT = . T 0 Rt−1 0 Rt−1 0 St−1
4.2 Givens Parameterization
59
T But rt and StT are both orthonormal why also their product and thus St−1 are T ˆ orthonormal (St−1 St−1 = I), and according to the hypothesis there is hence an α T = I. such that Rt−1 St−1
It has thus been showed that under the assumption that the theorem is valid for n = t − 1, it is also valid for n = t. Also it has been showed, that the theorem is valid for n = 1. The general result follows by mathematical induction, hence the theorem is valid for any n ≥ 1. 2
4.2.3
Parameterized Dimension Reduction
The composite rotations described above can be used for rather efficient subspace parameterization. The idea is to form a projection matrix from a subset of rows of rn or Rn . The parameterization is efficient in terms of parameter dimensionality, but in general not optimal for subspace selection, since there is a slight over-parameterization due to the fact the base vectors of a subspace (if more than one) can be rotated within the subspace itself. With over parameterization is here meant, that the same subspace can be described with more than one choice of parameter vector. In geometrical terms, we parameterize the Stiefel manifold although some parameterization of the Grassman manifold probably would be more efficient, see [9]. However, with this parameterization, the orthonormality is guaranteed at every choice of parameter vector. We call the matrix onto which rows we project projection matrix, although it is not a projection matrix by definition, that is, SS = S does not necessarily hold.
1-dimensional Projection, k = 1 A 1-dimensional projection is described by a vector (direction) v ∈ Rn with |v| = 1. If r(p)v = [ |v| 0 ]T , then v/|v| is the first row of r(p), where r(p) is a vector rotation. Together with Lemma 4.1, this implies that the first row of r(p) can be any unit length vector v. The first row of the vector rotation thus defines a complete, parameterized 1-dimensional projection with n − 1 parameters. If x is a variable, a measurement for instance, and r1 (p) the first row of r(p), a data projection is calculated as xp = r1 (p)x. In practice, the recursion formula given in (4.5) is surely too inefficient to be used in a gradient search. However, after some calculations with paper and pen or a computer program for symbolic math, one easily sees through the structure of the first row, which then can be implemented directly with entries made up by products of sines and cosines, that is, without one single matrix multiplication. For instance,
60
Subspace Parameterization
in four dimensions, the 1-dimensional projection direction looks like: cos p1 − sin p1 · cos p2 r41 T (p) = sin p1 · sin p2 · cos p3 . − sin p1 · sin p2 · sin p3 k-dimensional Projection A k-dimensional projection is described by a set of k orthonormal vectors collected as rows in the k-by-n matrix S. The orthonormality implies S T S = Ik . If x is a variable, the projection is xp = Sx. By letting S be a number of rows of R, the matrix rotation can be used to parameterize the projection. By doing so, orthonormality is guaranteed for every choice of parameter p. It turns out that if only the k first rows of R are used in S, not all n(n − 1)/2 parameters defining the matrix rotation are needed. According to the following lemma, the k first rows of R depend solely on the k(2n − k − 1)/2 last parameters of pn . A consequence of this is that any orthonormal image from n to k dimensions can be completely parameterized with k(2n − k − 1)/2 parameters. Lemma 4.3 (Parameter Size) The k first rows of Rn (pn ) depend solely on the k(2n − k − 1)/2 last parameters of pn , that is, pkn . Proof Let Ai be the ith row of a matrix A, and Ai:j the rows i through j. Let 1 0 , A+ = 0 A and A− a reduced version of A, where the first row and first column have been removed. Define pin as the in − i(i + 1)/2 last entries in pn . This implies pTn = [ pTn−i pin T ] T 1T T T iT for 1 ≤ i < n. It should be verified that [ pi−1 t−1 pt ] = [ α β ] = pt . In words, i pt is the entries of pt used in the i last recursions of (4.7). It will be shown by induction over n that Rni (pn ) depends solely on pin , 1 ≤ i < n. Basis Step: Trivial for n = 1 since R1 ≡ 1. Induction Step: The hypothesis is that the theorem is valid for n = t − 1, that j (pt−1 ) depends solely on pjt−1 , 1 ≤ j < t − 1. Assuming that the hypothesis is Rt−1 correct, it will now be shown that Rti (pt ) depends solely on pit , 1 ≤ i < t. + i−1 (pt−1 )]i = [ 0 Rt−1 ] which according to the hypothesis, depends If i > 1, [Rt−1 i−1 + 1 solely on pt−1 ([Rt−1 ] ≡ [ 1 0 ]). But by definition (4.7), + (pt−1 )rt (p1t ) Rt (pt ) = Rt−1
4.2 Givens Parameterization
61
and thus + (pt−1 )]i rt (p1t ) Rti (pt ) = [Rt−1 T 1T iT which depends solely on [ pi−1 t−1 pt ] = pt .
That Rni (pn ) depends solely on pin ∀n thus follows by mathematical induction, from which in turn trivially follows that the k first rows of Rn depend solely on 2 the k(2n − k − 1)/2 last parameters of pn . Not all steps in the recursion (4.7) are needed, only the k last. Rn−k can simply be replaced by I, since the parameters that determines Rn−k have no influence on k(k+1) the k first rows of Rn . Therefore it is motivated to define Rn,k (p) for p ∈ Rkn− 2 as the matrix rotation obtained, when only the k last steps in the recursion (4.7) are conducted, starting off from Rn−k = I. Since every factor in Rn,k is orthonormal, also Rn,k is orthonormal. Theorem 4.1 (Subspace Parameterization) Any orthonormal image from n to k dimensions Sn1:k ∈ Rk×n (k < n), can be completely parameterized with k(2n − k − 1)/2 parameters. Proof Let SnT = [ Sn1:k T Snk+1:n T ], where Snk+1:n is calculated so that SnT Sn = In and det [S]n = +1. This is possible according to Lemma 4.2. Note that the determinant criteria can be met by, if needed, simply flipping the sign of one row in Snk+1:n . Now, according to theorem 4.2 it exists an angle p such that the matrix rotation Rn (p) = Sn . But then according to Lemma 4.3, the k first rows of Rn (p), 2 that is Sn1:k , depend solely on the k(2n − k − 1)/2 last entries of the angle.
4.2.4
Inverse Parameterization
When using the parameterization in numerical optimization starting off from some initial solution S0 , we whish to find the corresponding initial angle p0 for which the k first rows of Rn,k (p0 ) equals S0 . In this section, procedures to solve this problem are described. They are given without any proof, but they are derived as “backward implementations” of Lemma 4.1 and Lemma 4.2. Vector Rotation Inverse Define the function α(v) as r2 (α(v))v = [ |v| 0]T for v ∈ R2 . The α(v) is thus calculated so that v1 /|v| = cos α(v) and v2 /|v| = sin α(v) whenever v 6= 0, else α(v) = 0 (for instance). It is good practice to limit α(v) within some period of 2π. Now the function β(v) with rn (β(v))v = [ |v| 0]T
62
Subspace Parameterization
for v ∈ Rn can be calculated recursively as [] β(v) = [ β(v − ) α(vend ) ]
if v ∈ R otherwise
where vend is the last two entries of v, and v − is a reduced v where the last entry has been removed, and the second last has been replaced by |vend |. [ ] is a vector of dimensionality 0. In every step of the recursion, the last two entries of v (vend ) are rotated so that the last entry gets zeroed. This is repeated until all but the first entry of v is zero. Matrix Rotation Inverse If SS T = Ik for some S ∈ Rk×n , k < n, it is possible to calculate a p such that R1:k (p) = S, where R1:k (p) is the k first rows of R(p). If S 1 is the first row of S, p(S) is calculated recursively as [] if S = [ ] p(S) = otherwise. p( [SrT (β(S 1 ))]− ) β(S 1 ) In every step of the recursion, the corresponding angle β of the first row of S are calculated. The multiplication of S with rT (β) cancels the first row and column, which then can be removed. By cancel is meant that the vectors are transformed to [ 1 0 ]. The total number of parameters becomes (n−1)+(n−2)+· · ·+(n−k). Note 1:k (pkn ). (Minor modifications needed to cover the case k = n.) that Rn1:k (pn ) = Rn,k
4.2.5
Kernel Parameterization, 2k > n
In 3 dimensions, it is well known that a plane can be completely defined by its normal vector. This way to define a plane is of course more economical compared to specifying two non-parallel tangent vectors. At some instances the normal definition is beneficial also in higher dimensions, where the conceptual plane is referred to as the range space and the normal as the kernel. In n dimensions, any k dimensional range space can be completely defined by its (n − k)-dimensional kernel, and this definition of course is the most economical whenever 2k > n. When a k-dimensional projection of n dimensions shall be parameterized, and 2k > n, it is thus a good idea to parameterize the (n − k)-dimensional kernel, rather than the k-dimensional range space. With full rank orthonormal matrices like R, any row subset can form a range space, and the kernel is then given exactly by the remaining rows. The idea is to consider the n − k first rows of Rn,n−k as the kernel of the range space to be parameterized. From Theorem 4.1 it is concluded that every kernel can be generated this way, which means that also every range space can be obtained as the k last row of Rn,n−k . In other words, the kernel parameterization is complete. Any subspace of Rn can thus be parameterized by at most (3n2 −2n)/8 parameters (even n and k = n/2).
4.3 An Algorithm for Subspace Optimization
63
n−k+1:n The identification problem is slightly more difficult using of Rn,n−k for projection, since the kernel of the arbitrary orthonormal projection matrix S must be found. However, once the kernel is found, for instance by a singular value decomposition, the corresponding angle p can be determined as described earlier.
4.3
An Algorithm for Subspace Optimization
A complete algorithm for iterative subspace optimization must in general terms have at least 1. A set of good start solutions 2. A subspace parameterization 3. An objective function 4. An optimizing procedure If the algorithm shall operate on data sets with unknown statistical properties, the algorithm must also have means to generate a robust objective function that is representative for the distribution, see Section 2.1.3. The Optimal Discriminative Projection (ODP) is an algorithm with all these components. It operates on clustered data sets and the objective is to find subspaces where accurate classification is facilitated. The ODP uses the Givens parameterization described in the previous section. It uses PCA, LDA and OLT as start solutions and iteratively improves the union bound of the classification accuracy Λ to a local optima. The best local optima is used as algorithm output. The optimizing procedure is the Nelder-Mead Simplex, [33] or [12, pp. 408], which is a simple but robust optimization procedure that does not need an explicit gradient.
4.3.1
Objective Function
As mentioned the objective function, which is locally maximized, is q q−1 X X Vjl∗ 2 1 F (S) = Λ(S) = − ln q(q − 1) j=1 2
(4.8)
l=j+1
where V ∗ 2jl (S) = 2
Z
∞
δjl (S)/2
x2
e− 2 √ , 2π
(4.9)
see Section 2.1.2. The difficulty lies in the estimation of the distance between class distribution j and l, δjl . If the dimensionality n is high relative to the number of observations of the dataset, both PC-reduction and Ridge estimation are used, see Section 2.1.3. The PC-reduction is however due to the estimated covariance matrix
64
Subspace Parameterization
of the whole estimation data set, class covariance matrices are not used here. The k0 principal directions are collected in the pre-projection matrix S0 . The distances are thus estimated as q (4.10) δˆjl (S) = (mj − ml )T S0T S T (SS0 Cjl S0T S T + λjl I)−1 SS0 (mj − ml ). Here, the covariance matrices Cj and Cl are assumed to be “mutually equal”, which of course is a simplification. The pooled covariance matrix is calculated as Cjl =
(Nj − 1)Cj + (Nl − 1)Cl ) , Nj + Nl − 2
(4.11)
which is an unbiased estimate of Σ if the observations in X are random, see [24, pp. 641]. The regularizing constant λjl and the number of principal components k0 are chosen with respect to the number of observations in the estimation data set, see also Section 2.1.3 on page 26. A tested alternative to Λ(S) as objective function is Π(S), the geometrical mean of the distances δjk (S).
4.4
Artificial Data
The artificial data is computer generated and composed to illustrate an instance where the ODP is superior to LDA when projecting on a 2-dimensional plane. 6 distributions N3 (mj , I) are given the means [ 0 0 ± 12 ]T , [ ±6 ± 3 0 ]T . The variance within each distribution or class is equally distributed in all directions (covariance matrices equal to unity). Of course, LDA favors the directions [ 1 0 0 ]T and [ 0 0 1 ]T , leaving four classes totally mixed up, see Figure 4.1 (a). The ODP starting from the LDA solution SLDA and optimizing Λ(S), clearly improves the picture, see Figure 4.1 (b). The numerical optimization improved Λ(S) from 1.92 to 7.38. This example was clearly designed to “fool” the LDA and favor the ODP, which perhaps is not entirely fair. However, it is very hard (if possible) to find a same degree counter example in favor for the LDA. The difference is that the Λ(S) objective function “sees” every single class-pair and reacts significantly on class pairs with bad separation (δjk (S)). The LDA just takes the summed up classmean variance for the different directions into account. There are however instances where there is very little improvement to be made by numerical optimization. For instance, a two class LDA projection is always Λ-optimal, see section 2.1.2. It turns out that the greatest gains to be made by an algorithm like the ODP optimizing Λ(S), is when there is a compromise to be done, for instance, many classes to be projected onto few dimensions.
4.5
Experimental Data
The ODP algorithm will be applied to two numerical examples and compared to LDA projections. An ODP implementation in MATLAB is used, where the
4.5 Experimental Data
65
10
−10 −15
15
(a) LDA
8
−8 −15
15
(b) ODP
Figure 4.1 3-dimensional artificial data projected onto subspace given by (a) LDA and (b) ODP. The ODP separates all class distributions, while the LDA leaves two distribution pairs totally mixed up.
66
Subspace Parameterization
Nelder-Mead Simplex (NMS) algorithm is the numerical optimizer (fminsearch). Unfortunately the NMS converges slowly, so the number of iterations has to be a stopping criteria. Sure, there exist much more efficient algorithms than the NMS, but NMS is simple and robust. One run with NMS limited to 500 iterations, 2-by-10 projection parameterization and the objective function described above takes about 10 seconds on an ordinary PC (600MHz PIII), and is often sufficient to make what improvement there is to make from the start solution. The (optimized) version of the Givens parameterization takes about 18%, the objective function 72% (8 classes, fast approximation of Q-function) and the NMS 10% of the execution time of the optimization step of ODP.
4.5.1
Image Segmentation Data
This data set is down loaded from the UCI Repository of machine learning databases [4]. The observations are drawn randomly from a database of 7 outdoor images of brick face, sky, foliage, cement, window, and grass, thus constituting 7 classes. The images have been hand-segmented to create a classification for every pixel, where each observation is a 3x3 pixel region. The estimation data set contains 210 observations, and the validation data set 2100. The dimensionality is n = 18 (one constant entry removed). Both the LDA and the ODP is preceded by PC-reduction with k0 = 10. The 2-dimensional projection from LDA is depicted in Figure 4.2 (a). Using LDA as a start solution, the ODP improved Λ(S) from 1.72 to 3.23, see Figure 4.2 (b). Using a simple quadratic classifier (see Section 1.3.2), the classification accuracy increased from 73.5% to 93.4% for the estimation data, and from 70.0% to 89.1% for the validation data. An ordered list of δjk (S) in the LDA projection (S = SLDA , validation data) starts like 0.10, 1.33, 1.54, 1.66, 2.13 . . . std (standard deviations). The same list for the ODP projection (S = SODP ) is 1.72, 2.63, 2.77, 3.80, 4.26 . . . std. The unit is standard deviations (std). From the viewpoint of a classifier, no doubt the ODP projection is better, since in general it is the worst separated classes that contribute the most to the error rate.
4.5.2
Electronic Nose Data
This data set is obtained from an electronic nose (EN) smelling 8 different samples. In each sample, or class, there are about 75 estimation and 25 validation observations. The dimensionality is n = 39, PC-reduction with k0 = 10 is used. The 2dimensional LDA of the validation set is depicted in Figure 4.3 (a), Λ(SLDA ) = 4.23. Using an OLT projection as start solution, the ODP improved to Λ(SODP ) = 21.2.
4.5 Experimental Data
67
1.5
−4 −11
6
(a) LDA
2
−2 −4
3
(b) ODP
Figure 4.2 18-dimensional segmented image data projected onto subspaces given by (a) LDA and (b) ODP.
68
Subspace Parameterization
The same figures for the validation set is 3.54 for LDA and 21.5 for ODP. The 2-dimensional ODP of the validation set is depicted in Figure 4.3 (b). Indeed, this instance shows a significant improvement in objective function value. The smallest δjk (S) in the validation set is for LDA 1.47, 6.6, 7.5, 8.4, 25 . . . std. The same list for the ODP projection is 12, 14, 14, 16, 18 . . . std. If instead the Π(S) (2.30) is used as objective function, the result is depicted in Figure 4.4. The Π(S) for the 2-dimensional LDA is 36.8, improved by ODP with OLT start solution to 38.4. The smallest δjk (S)’s are 4.12, 9.5, 9.8, 11, 23 . . . std. The Π(S) gives worse classification accuracy in total, and in general a worse picture, but it shall not be despised. The benefit of the Π(S) compared to the Λ(S) is that it well maintains the distribution of the δjk ’s. Since the Π(S) is a product, it is not primarily concerned with the distinct magnitude of the δjk (S) but rather the angles between the projection plane and the distribution-pair Fisher vectors (the optimal direction, see Section 2.1.1). This means that if two coffee brands are close, but both distant to tea, this relation is well maintained in the Π-optimized projection. The result of misclassifying tea as coffee is probably much more embarrassing than taking espresso for java, a fact thus recognized by the Π(S) objective function.
4.6 4.6.1
Competing Techniques Householder Reflections
The Householder reflection is a well-known instrument used for QR-factorizations and subspace parameterizations, see [18] and [1]. A very brief introduction below will put the Householder reflection into our framework. The Householder reflection Ri is defined for a non-zero vector vi as Ri = I − 2
vi viT , viT vi
(4.12)
and Ri = I for vi = 0. That Ri is a reflection is readily seen, Ri vi = −vi , as well as that Ri is ON, RiT Ri = I. Now, if P is an arbitrary ON-matrix with positive determinant, and P1 the first column of P , then we can take v1 = 1 0
···
T 0 − P1
(4.13)
4.6 Competing Techniques
69
15
−15 −30
30
(a) LDA
8
−8 −15
15
(b) ODP
Figure 4.3 39-dimensional EN data projected onto subspace given by (a) LDA and (b) ODP.
70
Subspace Parameterization
250
−250 −400
400
Figure 4.4 39-dimensional EN data projected onto subspace given by ODP when the Π(S) is optimized.
and conclude that 1 P R1 = 0
0
T
(4.14)
P (2)
where P (2) ∈ Rn−1×n−1 is ON (products of ON-matrices are ON). This can be repeated by taking 0 (4.15) v2 = T (2) , 1 0 · · · 0 − P1 which gives 1 0 P T R1 R2 = 0 1 0 0
0 0 .
(4.16)
P (3)
If this is iterated n − 1 times, we end up with P T R1 · · · Rn−1 = I
(4.17)
R = R1 · · · Rn−1 = P.
(4.18)
from which is concluded that
4.6 Competing Techniques
71
This indicates that every ON matrix can be expressed as a product of n − 1 Householder reflections. Note that the householder vector vi has the form · · 0} v˜i ]T vi = [|0 ·{z
(4.19)
i−1
where v˜i is a n − i + 1 dimensional vector. The number of parameters is thus n + (n − 1) + · · · + 2 = n(n + 1)/2 − 1 for a square ON matrix. Only R(k) = R1 R2 · · · Rk , or rather v1 , v2 , · · · , vk , actually influence the k first rows and columns of R. Therefore, any ON k-by-n matrix can be parameterized with k Householder reflections as the k first rows of R(k) , and with n + (n − 1) + · · · + (n − k) = k(2n − k + 1)/2 parameters. The Householder reflections are over parameterized since there are many vectors vi that give the same vi viT . viT vi
(4.20)
· · 0} v¯i 1 ]T vi = [|0 ·{z
(4.21)
This can be worked around by letting
i−1
where v¯i is a n − i dimensional vector. This gives k(2n − k − 1)/2 parameters, the same number as for the Givens parameterization. The drawback with this is of course that there are a set of ON matrices that cannot be generated exactly, although arbitrary close.
4.6.2
ON-constraints as a BMI
The ON constraint can be expressed as a bilinear matrix inequality (BMI), X ai bj Hij 0, (4.22) H(a, b) = i,j
by which is meant that the symmetric matrix H(a, b) should be negative semidefinite, xT H(a, b)x ≤ 0 ∀x. Similar definition for on symmetric matrices. An optimization problem with this constraint is NP-hard in general, that is, extremely complex to solve for large problems. However, in small regions of the definition space, the BMI can be relaxed to a linear matrix inequality (LMI), X ˜ i 0, ˜ ci H (4.23) H(c) = i
a constraint for which efficient solvers are available, see [5, 44]. In this section the topic of relaxing the orthonormal constraint will be briefly surveyed. The material was developed for a maximizer of the subspace geometric mean of the Mahalanobis distances (Π), see [28]. Note that the BMI is not a parameterization, it is just another way to incorporate the ON constraint.
72
Subspace Parameterization
BMI formulation If R = −I + S T S, the constraint becomes R = 0. Below a lemma and a theorem will be formulated, that constitutes the basis for a reformulation of the orthonormal constraint. ωi denotes the vector with zeros except in position i which is 1. Lemma 4.4 (Diagonal elements) If R is a symmetric matrix with entries rij , then R 0
⇒
rii ≤ 0.
(4.24)
Proof R 0 ⇒ rii = ωi0 Rωi ≤ 0 by definition.
2
Theorem 4.2 (Null matrix) If R is a symmetric matrix, then R 0 and rii ≥ 0
⇒
R = 0.
(4.25)
Proof Assume R 0, rii ≥ 0 and R 6= 0 for some symmetric matrix R. According to Lemma 4.4, rii ≤ 0. Since also rii ≥ 0, it is concluded that rii = 0 and every quadratic form xT Rx lacks quadratic terms. Since R 6= 0 at least rij = rji 6= 0, and thus (ωi + ωj )T R(ωi + ωj ) = ωiT Rωj + ωjT Rωi = rij + rji = 2rij 6= 0.
(4.26)
But then (ωi − ωj )T R(ωi − ωj ) = −ωiT Rωj − ωjT Rωi = −rij − rji = −2rij .
(4.27) 2
and R is indefinite, which contradicts the assumptions.
Define si ∈ Rn to be the ith row of S and sij to be the jth element of si . The orthonormal constraint S T S − I = 0 may thus be rewritten as −I + S T S 0 sTi si
(4.28)
≥10≤i≤k
(4.29)
Constraint (4.28) is convex and can be expressed as an LMI:
−I + S T S 0
−I + S T S 0 ⇔ 0 I S T −I S T I = 0 I S −I 0 −I ⇔ −I S T 0. S −I
ST 0 I
(4.30)
4.6 Competing Techniques
73
Constraint (4.29) is non-convex, and problems involving this type of constraint are generally NP-hard. A branch-and-bound algorithm divides the objective function definition space into regions so small that it make sense to solve linearized (convexrelaxed) instances of the problem. The idea of relaxing the constraint in this way is to enclose the non-convex regions with fairly small convex ones. Thus, for any well-defined subregion of the problem, the solution to the relaxed problem is never worse than the optimal solution. For a maximization problem this means, that the objective function for the relaxed solution constitutes an upper bound to the optimum. LMI Relaxation Consider a subregion of the optimization problem where the upper bound of every sij is known as sij , and the lower bound as sij . Such a well-defined region or “box” is referred to as a frame. Since S shall be constrained to be orthonormal, −1 ≤ sij ≤ sij ≤ 1.
(4.31)
For every component sij of S, introduce the variable wij as the relaxation of s2ij . The constraints 0 ≥ s2ij − wij
(4.32)
−s2ij
(4.33)
0≥
+ wij
of course corresponds to the demand of equality between s2ij and wij . The same way as in (4.30), the inequality (4.32) can be rewritten as −wij sij (4.34) 0 sij −1 which is an LMI valid for any frame. (4.33) is non-convex and will be relaxed according to some frame considered. It is easy to realize that the minimum convex region enclosing the curve 0 = s2ij − wij is the LMI above in conjunction with the linear constraint 0 ≥ wij −
sij − sij sij − sij
(sij 2 − sij 2 ) + sij 2 = wij − sij (sij + sij ) + sij sij .
(4.35)
The line y = sij (sij + sij ) − sij sij is depicted in Figure 4.5 together with y = s2ij . With the help variables wij , the constraint 0 ≥ sTi si + 1, 0 ≤ i ≤ k
(4.36)
can now in any well-defined frame be substituted by the relaxed constraint 0≥−
n X j=1
wij + 1, 0 ≤ i ≤ k.
(4.37)
74
Subspace Parameterization
y=
sij2
y
w y=
Convex region
ij
sij
sij
sij
Figure 4.5 Linear upper bound of s2ij : wij = sij (sij + sij ) − sij sij . The results in this section are summarized with the LMIs −sij 1 0 sij −1 0 0 1 1 0 0 0 0 sij diag + diag − wij diag 1 0 0 0 0 −1 −sij − sij sij sij −1 (4.38) 0 ≤ i ≤ k, 0 ≤ j ≤ n for which sij is not identical to 0, and 0−
n X
wij + 1
(4.39)
j=1
0 ≤ i ≤ k. This convex relaxation is valid for any well-defined frame, i.e. any subregion of the problem. However, to be reasonably good (strong) the frame must not be too large. Complexity Due to the complexity of the BMI problems, branch-and-bound algorithms are limited to rather small problems. In my experience, 2 dimensional projections of 4 dimensions is a practical limit for an ordinary PC. However, the solution to a minimization problem can be guaranteed to be no worse than the global minimum plus a known constant.
4.7
Summary
In this chapter a new technique to parameterize orthonormal (ON) projections has been described. It has been showed that a particular composition of plane Givens rotations can be any ON matrix with determinant equal to +1. Furthermore, ON
4.7 Summary
75
projections from n dimensions to k dimensions can be completely parameterized with k(2n − k − 1)/2 unconstrained parameters. The parameterizer can be used in all sorts of optimization over ON subspaces, and a particular algorithm, the Optimal Discriminative Projection (ODP) has been described, that extracts relevant information from high-dimensional data sets.
76
Subspace Parameterization
5 Clustered Regression Analysis
The main topic of this chapter is how data clustring can be used to find subspaces for quantification, that is, continuous scalar prediction. The background is the regression problem associated with concentration measurements with the electronic tongue (ET), see Section 1.2. A measurement vector xi ∈ Rn from the ET is very large, typically in the order n = 10000. With each measurement xi is associated a scalar concentration yi in terms of moles. We wish to find a function f that with highest possible accuracy relates the vectors xi to the concentrations yi , that is, the expected magnitude of the error ei in yi = f (xi ) + ei
(5.1)
should be minimal. The error ei is generally due to non-perfect f and to a random measurement noise sample εi which we cannot account for. Of course, for the relation f (and the ET) to be useful, we must be able to state that the expected magnitude of ei is small when a real measurement is made and yi is estimated. An appropriate statistical model for measurements is xi = φ(yi , η) + εi
(5.2)
where the function φ(y, η) : R → Rn for a fixed parameter η. φ thus describes an n-dimensional trajectory as y increases from low to high values (concentrations). The noise sample εi is taken from a distribution assumed to have zero mean and 77
78
Clustered Regression Analysis
(unknown) covariance matrix Σ. Here we also that there are no tempo assume ral correlation between the noise samples; E εi εTj = 0 when j 6= i. Figure 5 gives an example of a 2-dimensional, nonlinear trajectory and a number of possible measurements xi . Note that x(j) is the jth entry of the vectors xi . The ordinary steps to solve the regression problem above are first to select a suitable parameterization of φ(y), and second to estimate the parameter η from an estimation (training) data set {xi , yi }N 1 . The parameterization could for instance be polynomial, and η the parameter matrix where the entries are the coefficients of the polynomials. Given a parameterization, the maximum likelihood (ML) estimate of η in (5.2) is ηML = arg min det [ΣML (η)] η "N # X T (φ(yi , η) − xi )(φ(yi , η) − xi ) = arg min det η
(5.3)
i=1
assuming normally distributed noise, [16]. The least squares (LS) estimate is expressed similarly; replace the matrix determinant det [ΣML (η)] above with the trace operator tr [ΣML (η)]. A complication arises when the (large) n-by-n covariance matrix ΣML (η) is rank deficient, which is always the case when N < n. Then det [ΣML (η)] = 0 for every possible choice of parameter, and (5.3) cannot be used at all to find η. To circumvent this complication, one often has to use the fact that the nonlinear y = ymin
x(2) x(1)
y = ymax
Figure 5.1 A trajectory φ(y) together with a number of noise obscured measurements (◦) in two dimensions (x(1) /x(2) ). The cluster illustrates the noise distribution (+).
5.1 Inverse vs. Classical Regression
79
trajectory mainly resides within a linear k-dimensional subspace, where k n (and k < N ) [7, 21, 38]. Thus, we can factorize φ as φ(y, η) = S T φ0 (y, η0 )
(5.4)
for some orthonormal k-by-n matrix S. Hence the model (5.2) evolves as xi = S T φ0 (yi , η0 ) + εi .
(5.5)
The parameter estimation is then conducted in the subspace spanned by the rows of S, that is, one solves "N # X T (φ0 (yi , η0 ) − Sxi )(φ0 (yi , η0 ) − Sxi ) . ηˆ0 = min det η0
i=1
Thus, if we know S it is possible to estimate η0 and the subspace noise covariance matrix SΣS T , even when N < n. Alternatively, the regression (5.1) can be solved directly in LS-sense by letting f (x) = g(Sx, θ) for some appropriate function family g. However, the subspace S is seldom known; it has to be found by calculations on empirical data. For the estimation of the parameters of a nonlinear φ it is of course particularly important to find the minimal dimensionality (k) that well describes φ (or f ). This is difficult for multicollinear data sets (linearly dependent x(j) ’s) for which n > N is typical [49]. Next section will discuss which regression model to use, the inverse or classical. The models will be evaluated by Monte Carlo simulations. The main topic, how to find subspaces for quantification by using clustering, is treated in Section 5.2. This technique we denote Clustered Regression Analysis (CRA). The CRA concept is evaluated and compared to the well-known PLS in Section 5.3 (artificial data) and Section 5.4 (experimental data). PLS was mentioned in Section 1.4 on page 14, no more description of this algorithm will be given here. Finally, in Section 5.5 a summary.
5.1
Inverse vs. Classical Regression
When creating the regression function f in (5.1), there are basically two approaches at hand: 1. Let f (x) = g(x, θ) for an adequate function family g, and let θ minimize some loss function of the errors ei for an estimation data set {xi , yi }p1 . This approach is denoted inverse regression. 2. Chose a function family φ(y, η), solve (5.3) (for normally distributed noise, for instance) and derive the conditional probability p(y|x). Estimate y as f (x) = E [ y|x ] or as f (x) = arg maxy p(y|x). This approach is denoted classical regression.
80
Clustered Regression Analysis
The first approach is straightforward and gives a function f that explicitly maps a new measurement to a prediction of y. In the second, the prediction of y is formulated implicitly by the model, for instance as the conditional expectation f (x) = E [ y|x ] or as the conditional ML-value f (x) = arg maxy p(y|x). The determination of the conditional probability p(y|x) is essential and requires that the noise distribution is known. There is no final answer which method to use to calibrate an instrument. Even in the linear case the question is still open. The ongoing discussion started back in 1939 by C. Eisenhart [10] who advocated the classical approach. In 1967 R. C. Krutchkoff [25] conducted extensive Monte Carlo simulations on linear univariates, indicating the superiority of the inverse approach. After that, the theory has evolved in [22, 23, 26, 36, 43] et cetera, but the treatment has mainly been limited to linear problems. The subspaces we shall study below contain potentionally very nonlinear trajectories, which means that a simulation study is motivated that indicates which regression model to use in our case.
5.1.1
Classical Prediction of y
To conduct classical prediction, φ(y, η) and the distribution of ε must be at least approximately known. From them the conditional probability density function ox y given x, p(y|x), can be derived, see Section 2.2 on page 28. Two possible predictors for classical models are the conditional expectation (Bayes): R y p(y|x) dy , (5.6) yˆBayes = E [ y|x ] = R p(y|x) dy and the ML: yˆML = arg max p(y|x). y
(5.7)
In practice, (5.6) and (5.7) are solved numerically with the prior ymin ≤ y ≤ ymax . Note that the ML and Bayes estimates not necessarily are equal, which in theory is the case for linear estimation in normally distributed noise.
5.1.2
Monte Carlo Simulation
In this section the performance of classical and inverse regression will be compared when the trajectory is very nonlinear. The trajectory is modeled by polynomial g and φ; g(x, θ) = θT Pc (x) and φ(y, η) = η1n Pc (y). Pc is the power expansion operator that generates all powers and mixed products up to order c. For distinct instance, P2 (a) = a a2 and P2 ( a b ) = a a2 ab b2 b (similarly for column vectors). 1n is an n-vector with ones, see page ix. The parameter vector θ and matrix η are easy to estimate from a data set, since they enter the functions linearly. In this study we shall focus on the regression and postpone the treatment of the subspace problem to the next section. The regression will thus be restricted to two variables.
5.1 Inverse vs. Classical Regression
81
Data Artificial data will be generated according to the statistical model (5.2). The trajectory φ0 (y) will be: sin πy , −1 < y < 1. (5.8) φ0 (y) = sin 1.7y This particular function is not typical in any way, it is just nonlinear and also non-uniquely invertible in the first component1 x(1) . A number of batches of data sets will be tested, with different noise variance magnitude in each batch (normal distributed noise). In every batch there are 30 data sets, where the noise covariance matrix is taken randomly as Σ = σ 2 · nnT . n is a 2-by-2 matrix where each entry is a sample of the normal distribution N (0, 1). Thus, the trajectory is fixed, but the noise covariance is random for every data set. In every data set there are 21 measurements used for estimation (training) of θ/η/Σ and 61 for validation in p terms of root mean square error, RMSE= mean[ (y − yˆ)2 ]. Figure 5.2 (left) is a sample of a data set generated as described above. Methods Three predictors will be tested: • Bayes, the conditional expectation defined in (5.6). φ(y, η) is modeled by 1st and 3rd order polynomials and the optimal polynomial coefficients in least squares sense are estimated from data. The integrals in (5.6) are numerically approximated by the sum of 200 points in the interval −1 ≤ y ≤ 1, see also (2.48) on page 30. • ML, the maximum likelihood predictor defined in (5.7). The same models φ(y, η) as in the Bayes approach described above are used. The optimization in (5.7) is solved by taking the best of 200 tested values of y in the interval −1 ≤ y ≤ 1. • Inverse, the predictor is modeled directly by a polynomial f (x) = g(x, θ) and the polynomial coefficients are estimated from data. 1st order and 3rd order polynomials are compared. The methods are compared at different noise levels σ 2 . Results Table 5.1 summarizes the results of the simulation. It appears that inverse regression with polynomial fit should be avoided. In every instance, the classical regression with Bayes prediction is best. The conclusion is that the inverse regression model with polynomial fit is not versatile enough to adapt to a trajectory like 1 It would be even worse if the curve were intersecting, a case which can very well occur in particular subspaces, see [45].
82
Clustered Regression Analysis
x(2)
ymax x(1)
ymin
Figure 5.2 The nonlinear trajectory (5.8) plotted; to the left with 1 and to the right with 21 measurements in for each distinct y. The right hand plot is a projection of R10 . The simulated measurements are generated by the model (5.2). The y’s are uniformly distributed in the interval ymin to ymax .
(5.8), while the classical regression model with polynomial fit greatly improves the linear regression. Baysian estimation is to be preferred.
5.2
Using Clusters to Find a Subspace
The observations or measurements of a data set can be partitioned into a number of categories or clusters. Clustered measurement data is not unnatural. Assume we want to calibrate an instrument on concentrations. Then it is natural to subject the instrument to a set of known concentrations – say we measure 20 times on each p Table 5.1 Validation RMSE= mean[ (y − yˆ)2 ] for different methods and noise levels σ 2 , 1000×mean±std for 30 data sets. Linear and 3rd order polynomial fit. ML, Bayes, and inverse regression. σ2 Inverse, 1st ord. pol. ML, 1st ord. pol. Bayes, 1st ord. pol. Inverse, 3rd ord. pol. ML, 3rd ord. pol. Bayes, 3rd ord. pol.
1 44 ± 22 44 ± 22 44 ± 22 46 ± 27 30 ± 11 29 ± 10
5 87 ± 45 88 ± 46 86 ± 44 94 ± 50 59 ± 32 54 ± 27
10 124 ± 53 125 ± 55 120 ± 50 126 ± 56 85 ± 45 76 ± 35
50 197 ± 84 204 ± 91 188 ± 78 250 ± 141 179 ± 94 156 ± 76
5.2 Using Clusters to Find a Subspace
83
of 10 concentrations within an interesting range. This results in a clustered data set. A number of ways to use the clustering of data to improve the regression have been proposed. Sliced Inverse Regression (SIR) [27] divides the range of y into a number of intervals. Each interval identifies a cluster. A PCA on the cluster means is then used as a subspace for regression. Fitting local linear models to data partitions are proposed in [2] and clustering in conjunction with Support Vector Machines (SVM) has been exploited in [11]. In this chapter the regression subspace will be optimized for accurate classification, as will be described below. This approach is the main contribution of the chapter and it has not been seen elsewhere. We denote the method Clustered Regression Analysis (CRA).
5.2.1
Clustered Regression Analysis
In Chapter 4 was desribed the ODP algorithm that optimizes subspaces for classification. It has been found that those subspaces can also be very good for quantification, that is, prediction of a scalar continuous variable. That there is a close connection between classification accuracy and quantification accuracy is well illustrated by Example 2.1 on page 20. In the proposed algorithm (CRA), classical regression is conducted in a subspace optimized with respect to the classification accuracy. Of course, this is possible only if the data is clustered, that is, divided into q classes each with a distinct class distribution. As mentioned above, the clustering can be the result of measurements on a finite number of calibration objects or concentrations (intervals), or it could be different operating regimes of a dynamical system and so on. The major advantage of using the classification accuracy as a subspace quality measure, is that no parametric assumption about the trajectory φ(y, η) is made – it could be any nonlinearity. Other susbspaces for nonlinear trajectories (with no parametric assumptions) have been described by [20] and extended to multivariate data sets by [8]. They generalize Principal Components to potentially nonlinear Principal Curves and Principal Surfaces. However, it remains to compare the CRA to those techniques. ACP Algorithm 1. Identify q clusters in data, and assume they are distinct classes in a classification problem. 2. Find an optimal or suboptimal subspace with respect to the classification accuracy. The ODP described in Chapter 4 is preferably used, which includes regularization in terms of PC reduction and ridge regularization. 3. Chose a (nonlinear) function family that models φ and estimate the function that best describes φ in the subspace obtained above, for instance in least squares sense. The functions could for instance be polynomial. Also estimate
84
Clustered Regression Analysis
the noise distribution, and derive the conditional probability density function p(y|x). 4. Predict y as the conditional expectation given x, (5.6). The prediction can easily be done by numerical evaluation of the integrals. In next section, the CRA will be compared to PLS on artificial data sets. After that the CRA is evaluated on experimental data from an ET.
5.3
Artificial Data
Artificial data will be generated about the same way as was described in Section 5.1.2. However, this time a number of discrete y’s are used in the estimation set, and for each one, a number of noise obscured x’s are generated. The range of y is divided into 11 clusters and in each cluster there is 20 data points. The trajectory is still restricted to 2 dimensions, but mixed into a 10-dimensional space with noise according to (5.5). See Figure 5.2 (right) for a sample of a 2-dimensional projection. Methods The prediction RMSE are compared for the following four methods: • Inverse linear regression (Linear). • 5th order polynomial PLS with 2 latent variables denoted “PLS-2/5”. Polynomial PLS is a nonlinear extension to PLS, see [47]. The PLS toolbox for MATLAB by Barry Wise is used. Autoscaling is used when beneficial. • 2nd order polynomial PLS with 5 latent variables denoted “PLS-5/2”. Otherwise as previous. • CRA as described above, but with classical ML prediction denoted “CRA”. 5th order polynomials are fitted to the trajectory in a 2-dimensional subspace given by the ODP algorithm. The prediction RMSE are due to unclustered validation data, that is, the validation y is uniformly distributed over the range (2.44). The methods are tested at different noise levels σ 2 . Results The prediction errors (RMSE) are given in Table 5.2. For reasonable noise levels, CRA gives better or much better regression accuracy. In Figure 5.3 the subspaces obtained from ODP and PLS are compared for an instance with particularly high noise level in the non-trajectory subspace (orthogonal to S). The plots are due to clustered validation data. Apparently, the noise level in the OPD subspace is significantly lower than in the PLS subspace.
5.3 Artificial Data
85
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8 −1.5
−1
−0.5
0
0.5
1
1.5
0.5
1
1.5
(a) ODP
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8 −1.5
−1
−0.5
0
(b) PLS
Figure 5.3 The ODP subspace (a) compared to the PLS (b), clustered validation data. The regression accuracy for 5th order polynomial classical regression in the ODP subspace is much higher (RMSE = 9.6 · 10−3 ) compared to the PLS (RMSE = 39·10−3 ).
86
Clustered Regression Analysis
5.4
Experimental Data
In this section, the CRA and polynomial PLS methods are tested on real sensor data from an electronic tongue (ET), see Section 1.2 on page 6. The same data as in Example 1.4 and 1.5 is used. The ET is subjected to different concentrations of a chemical. Every concentration of 0, 3.1, 3.7, 4.5, 5.2, 5.6, 6.9, 7.5, 8.0, 8.8, 9.0, 10.6, 10.8 mM are measured 20 times, xi ∈ R403 . The folowing methods are compred: • The best by polynomial PLS with respect to autoscaling, number of latent variables and polynomial order denoted “PLS-[number of latent variables]/[polynomial order]”. • CRA with 2-dimensional ODP, polynomial fit and Bayes prediction (5.6) denoted “CRA-2/[polynomial order]”. • Inverse regression in 2-dimensional ODP, polynomial predictor denoted “Inverse-2/[polynomial order]”. All parameters are estimated from estimation data and the prediction RMSE due to validation data only.
5.4.1
Random Validation
The data set observations are randomly partitioned into estimation and validation data at the ratio 75/25. Figure 5.4 depicts the ODP subspace with 5th order polynomial fit. In Table 5.3 the results for the diffrent methods are presented. The CRA gives about 45% less prediction error compared to polynomial PLS, which is a significant improvement. It is also seen, that the classical regression gives a small gain compared to the inverse regression.
Table 5.2 Artificial data, validation RMSE for different σ 2 , 10-dimensional case, 1000×mean±std for 10 data sets, for CRA (see text), linear regression, and polynomial PLS with latent variables/polynomial order. σ2 CRA Linear PLS-2/5 PLS-5/2
2.5 30 ±17 52 ±23 124 ±23 69 ±25
10 73 ±31 121 ±33 223 ±50 154 ±42
40 175 ±96 176 ±78 279 ±66 216 ±71
5.4 Experimental Data
87
y=0
x(2) y = 10.8
x(1)
y = 3.1
Figure 5.4 2-dimensional ODP subspace with validation measurements and 5th order polynomial fit.
Table 5.3 Experimental ET data, prediction RMSE for random validation. Polynomial PLS, inverse subspace and classical subspace regression (CRA) are compared for different subspace dimensionalities/polynomial orders. PLS-20/4 Inverse-2/5 CRA-2/5
5.4.2
RMSE 0.18 0.11 0.096
Leave-Class-Out Validation
Now exactly all measurements on 3.7 mM are used for validation. In Table 5.4 the results for the diffrent methods are presented. The CRA gives about 80% less prediction error, which indeed is a significant improvement. In this case, the classical regression gives a significant gain compared to the inverse regression.
5.4.3
Discussion
Using the ODP subspace, the regression can be conducted in 2 dimensions and still give better result than PLS that needs 10-20 dimensions to give reasonable accuracy. Furthermore, the classical regression gives slightly better results compared to the inverse.
88
Clustered Regression Analysis
Table 5.4 Experimental ET data, prediction RMSE for Leave-Class-Out validation. Polynomial PLS, inverse subspace and classical subspace regression (CRA) are compared for different subspace dimensionalities/polynomial orders. PLS-10/1 Inverse-2/5 CRA-2/8
5.5
RMSE 0.25 0.11 0.043
Summary
It is well known that the utilization of structures in data can improve regression. In this work, natural data clustering has been used by the pattern recognition algorithm Optimal Discriminative Projection (ODP) to generate linear subspaces where regression is facilitated. For some nonlinear problems this approach outperforms polynomial PLS. However, in practice the subspace generated by the ODP often contains a very nonlinear measurement trajectory. It has been found that this nonlinear trajectory is better modeled by a polynomial φ in the classical regression model xi = φ(yi , η)+ εi , instead of a polynomial f in the inverse regression model yi = f (xi , θ) + ei . The drawback of the classical regression approach is that it requires more calculations. On real sensor data from the electronic tongue with a leave-class-out validation procedure, the ODP together with a classical polynomial regression model gave the regression error 0.043, which should be compared to the best by polynomial PLS: 0.25 – an improvement by factor > 5.
Bibliography
[1] S. Andersson. On Dimension Reduction in Sensor Array Signal Processing. PhD thesis, Link¨ oping University, Link¨ oping, Sweden, 1992. Link¨ oping Studies in Science and Technology, Dissertations No. 290. [2] B. Ari and H. A. G¨ uvenir. Clustered linear regression. Knowledge-Based Systems, 15:169–175, 2002. [3] P. Billingsley. Probability and Measure. John Wiley & Sons, 3rd edition, 1995. [4] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. [5] S. Boyd, L. E. Ghaoui, E. Feron, and V. Balakrishnan. Linear Matrix Inequalities in System and Control Theory. Society for Industrial and Applied Mathematics (SIAM) Philadelphia, 1994. http://www.stanford.edu/∼boyd/. [6] H. Brunzell. Signal Processing Techniques for Detection of Buried Landmines using Ground Penetrating Radar. PhD thesis, Chalmers University of Technology, 1998. [7] A. J. Burnham, J. F. MacGregor, and R. Viveros. A statistical framework for multivariate latent variable regression methods based on maximum likelihood. Journal of Chemometrics, 13:49–65, 1999. 89
90
Bibliography
[8] Pedro Delicado. Another look at principal curves and surfaces. Journal of Multivariate Analysis, 77:84–116, 2001. [9] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints. SIAM Journal of Matrix Anal. Appl., 20(2):303– 353, 1998. [10] C. Eisenhart. The interpretation of certain regression methods and their use in biological and industrial research. Ann. Math. Statist., 10:162–186, 1939. [11] G. Camps-Valls et al. Cyclosporine concentration prediction using clustering and support vector regression methods. Electronic Letters, 38(12):568–570, June 2002. [12] W. H. Press et al. Numerical Recepies in C. Press Syndicate of the University of Cambridge, 2002. [13] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179–188, 1936. [14] J. H. Friedman and J. W. Tukey. Projection pursuit regression. Journal of the American Statistical Association, 76:817–823, 1981. [15] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 2nd edition, 1990. [16] A. R. Gallant. Nonlinear Statistical Models. John Wiley & Sons, 1987. [17] G. H. Golub. Matrix Computations, 3 ed. Baltimore : Johns Hopkins Univ. Press, 1996. [18] G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press Baltimore, MD., 2 edition, 1989. [19] R. Gutierrez-Osuna. Pattern analysis for machine olfaction: A review. IEEE Sensors Journal, 2(3):189–201, 2002. [20] T. Hastie and W. Stuetzle. Principal curves. J. Amer. Statist. Assoc., 84:502– 516, 1989. [21] I. S. Helland. On the structure of partial least squares regression. Commun. Statist. Simulation, 17:581–607, 1988. [22] B. Hoadley. A Baysian look at inverse linear regression. Journal of the American Statistical Association, 65:356–369, 1970. [23] W. G. Hunter and W. F. Lamboy. A Baysian analysis of the linear calibration problem. Technometrics, 23:323–350, 1981. [24] R. A. Johnson and D. W. Wichern. Applied Multivariate Statistical Analysis. Prentice Hall, Upper Saddle River, New Jersey, 4th edition, 1998.
Bibliography
91
[25] R. C. Krutchkoff. Classical and inverse regression methods of calibration. Technometrics, 9:425–439, 1967. [26] T. Kubokawa and C. P. Robert. New perspectives on linear calibration. Journal of Multivariate Analysis, 51:178–200, 1994. [27] Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327, 1991. [28] D. Lindgren. The implementation and evaluation of an algorithm for feature extraction. Master’s thesis LiTH-ISY-EX-2083, Department of Electrical Engineering, Link¨ oping University, Link¨ oping, Sweden, December 2000. [29] D. Lindgren and L. Ljung. Clustered regression analysis. In Proceedings of the IEEE Conference on Decision and Control, to appear, 2002. [30] D. Lindgren and P. Spangeus. A novel feature extraction algorithm for asymmetric classification. Submitted to IEEE Sensors Journal, 2002. [31] R. Lotlikar and R. Kothari. Adaptive linear dimensionality reduction for classification. Pattern Recognition, 33:185–194, 2000. [32] H. Martens and T. N¨ as. Multivariate Calibration. John Wiley & Sons, Chichester, 1989. [33] J. A. Nelder and R. Mead. Computer Journal, 7:308–313, 1965. [34] M. Pardo and G. Sberveglieri. Learning from data: A tutorial with emphasis on modern pattern recognition methods. IEEE Sensors Journal, 2(3):203–217, 2002. [35] H. Park, M. Jeon, and P. Howland. Cluster structure preserving reduction based on the generalized singular value decomposition. SIAM Journal on Matrix Analysis and Applications, to appear, 2002. [36] J. L. Plessis and A. J. van der Merwe. A Baysian approach to multivariate and conditional calibration. Computational Statistics & Data Analysis, 19:539–552, 1995. [37] D. Di Ruscio. A weighted view on the partial least-squares algorithm. Automatica, 36:831–850, 2000. [38] Y. Sam and J. Wang. Statistical modeling via dimension reduction methods. Nonlinear Analysis, Theory, Methods & Applications, 30(6):3561–3568, 1997. [39] B. A. Snopok and I. V. Kruglenko. Multisensor systems for chemical analysis: state-of-the-art in electronic nose technology and new trends in machine olfaction. Thin Solid Films, (418):21–41, 2002.
92
Bibliography
[40] P. Spangeus. New Algorithms for General Sensors or How to improve electronic noses. Dissertation 714, Link¨ opings universitet, Link¨ oping, Sweden, 2001. [41] P. Spangeus and D. Lindgren. Efficient parameterization for the dimensional reduction problem. Submitted to Journal of Chemometrics, 2002. [42] P. Stoica and M. Viberg. Maximum likelihood parameter and rank estimation in reduced-rank multivariate linear regressions. IEEE Transactions on Signal Processing, 44(12):3069–3078, 1996. [43] H. Takeuchi. A generalized inverse regression estimator in multi-univariate linear calibration. Commun. Statist.–Theory Meth., 26(11):2645–2669, 1997. [44] L. Vandenberghe, S. Boyd, and S. Wu. Determinant maximization with linear matrix inequality constraints. ISL, Stanford University, 1996. http://www.stanford.edu/∼boyd/. [45] N. Vlassis, Y. Motomura, and B. Kr¨ ose. Supervised dimension reduction of intrinsically low-dimensional data. Neural Computation, 14:191–215, 2001. [46] F. Winquist, P. Wide, and I. Lundstr¨ om. An electronic tongue based on voltammetry. Analytic Chimica Acta, 357:21–23, 1997. [47] S. Wold, N. Kettaneh-Wold, and B. Skagerberg. Nonlinear PLS-modeling. Chemometrics and Intelligent Laboratory Systems, 7:53–65, 1989. [48] S. Wold, H. Martens, and H. Wold. The Multivariate Calibration Problem in Chemistry Solved by the PLS Method. Springer Verlag, Heidelberg, 1983. [49] S. Wold, A. Ruhe, H. Wold, and W. J. Dunn. The collinearity problem in linear regression. The partial least squares approach to generalized invereses. SIAM J. Sci. Stat. Comput., 5:735–743, 1984.