Supervised relevance neural gas and unified maximum separability analysis for classification of mass spectrometric data F.-M. Schleif1∗, U. Clauss1, T. Villmann2 and B. Hammer3 1 Bruker Daltonik GmbH, Leipzig, Germany 2 University Leipzig, Clinic for Psychotherapy, Leipzig, Germany 3 University Osnabrück, Dep. of Mathematics and Computer Science June 4, 2004 Abstract
The paper deals with the application of the novel generalized relevance learning vector quantization based classification algorithm in comparison to the unified maximum separability analysis as a special variant of support vector machine algorithms. The algorithms are compared and their performance on real life data, taken from clinical studies, is demonstrated. It is shown that the vector quantization classifier gives competitive results in comparison to the considered support vector machine algorithm and shows the recently theoretical proven equivalence in classification capability of both paradigms.
Keywords: learning vector quantization, SVM, relevance learning,unified maximum separability analysis, neural gas, mass spectrometry, bio-marker discovery
1 Introduction During last years mass spectrometry (MS) based on proteomic1 profiling became an important tool for studying cancer at the protein and peptide level in a higher throughput manner. Additionally, MS based serum profiling is under development to be a potential diagnostic tool to distinguish patients with cancer from normal subjects. The underlying algorithms for classification of the mass spectrometric data are one crucial point in processing to obtain valid and competitive results. Usually one is interested in to find decision boundaries near to the optimal Bayesian decision. Especially, for higher dimensional data this task becomes very complicate and, therefore, the choice of an appropriate algorithm may effect the result of the whole clinical study. In order to address this problem, we focus on the comparison of two such classifier systems, the Generalized Relevance Learning Vector Quantization (GRLVQ) as introduced in [6] and the Unified Maximum Separability Analysis (UMSA) [14] for our specific task of classification of mass spectrometry data in cancer diseases. Both algorithms are well suitable to deal with ∗ 1
corresponding author,
[email protected] Proteome - is an ensemble of protein forms expressed in a biological sample at a given point in time [3].
high dimensional data focusing on optimal class separability. The first one is a prototype based algorithm derived from usual Learning Vector Quantization (LVQ), whereas the latter one is a SVM-like. Both algorithms are capable to calculate relevances which can be used for identification of relevant data dimensions to achieve high classifcation accuracy. The identified relevance weights may be taken as indicators for biomarkers in the underlying biological data. The remainder of this paper is structured as follows. In section 2 we formulate the theory behind generalized learning vector quantization and supervised relevance neural gas followed by an introduction to the unified maximum separability analysis which is a shortened review from the derivation in [14]. Then, in section 4 we describe the recognition task and the underlying clinical material. We detail our experimental results for two clinical cancer data sets showing the discrimination, recognition and prediction power of the two algorithms applied to the data highlighting there essential differences and advantages. 2
Learning Vector Quantization and Relevance Learning
Learning vector quantization is mainly influenced by the standard algorithms LVQ1 LVQ3 introduced by K [9] as intuitive prototype-based clustering algorithms. Several derivatives were developed to improve the standard algorithms to ensure, for instance, faster convergence, a better adaptation of the receptive fields to optimum Bayesian decision, or an adaptation for complex data structures, to name just a few [7, 8, 12]. Here we focus on relevance learning in LVQ as introduced in [6] referred as GRLVQ. It is based on Generalized LVQ (GLVQ) which extends the usual LVQ by a cost function [10]. It was shown that relevance learning improves the quality of classification in comparison to standard GLVQ [6]. A numerically robust variant of GRLVQ is its combination with neighborhood learning known from neural gas which is called Supervised Relevance Neural Gas (SRNG) [5]. As usual a neighborhood learning during learning process the neighborhood range is decreased and vanishing in the converging phase hence the SRNG converges to GRLVQ [5]. We shortly summarize the notations and learning rules for GRLVQ2 : Let cv ∈ L be the label of input v, L a set of labels (classes) with #L = NL Let V ⊆ RDV be a finite set of inputs v LVQ uses a fixed number of prototypes (weight vectors, codebook vectors) for each class. Let = { r} be the set of all codebook vectors and cr be the class label of r. Furthermore, let c = {wr|cr = c} be the subset of prototypes assigned to class c ∈ L. The task of vector quantization is implemented by the map Ψ as a winner-take-all rule, i.e. a stimulus vector v ∈ V is mapped onto that neuron s ∈ A the pointer ws ...
.
.
W w W
w
closest to the presented vector
of which is
v,
ΨλV→A : v → s (v) = argmin r∈A
λ (v, w ) r
d
(2.1)
with λ (v w) being an arbitrary often differentiable distance measure depending on the parameter vector λ The neuron s is called winner or best matching unit. The subset of the input space Ωλr = {v ∈V : r = ΨV →A (v)} (2.2) which is mapped to a particular neuron r according to (2.1), forms the (masked) Ωλr receptive field of that neuron. d
,
.
2
A detailed description of GRLVQ and SRNG can be found in [6, 5].
2
The training algorithm adapts the prototypes such that for each class c ∈ L, the corresponding codebook vectors c represent the class as accurately as possible. This means that the set of points in any given class Vc = {v ∈V |cv = c} Uc = r ∈W Ωr of receptive fields of the corresponding prototypes should differ as little as possible. Let f (x) = (1 + exp (−x))−1 be the logistic function. GRLVQ minimizes the cost function
W
, and the union
CostGRLV Q =
via stochastic gradient descent, where nearest codebook vector labeled with
dλr+
cr+
= cv
wr+
, say
λ , and d
r−
wr−
, say
w
=
w
with + ξ =
dλr− − , ξ λ λ 2 (dr+ + dr− ) 2
·
v to the
is the squared distance to the
. The learning rule of GRLVQ
is obtained by taking the derivatives of the above cost function. Using
∂µλ (v) ∂ r−
c
(2.3)
is the squared distance of the input vector
best matching prototype but labeled with cr− = cr+ ∂dλ ξ − ∂ r− r−
r
dλr+ − dλr− dλr+ + dλr−
f (µλ (v)) with µλ (v) =
v
|w
∂µλ (v) ∂ r+
w
λ
= ξ+ ∂∂dw ++ r
r
and
λ
= (dλ2 +drd+λ ·
(2.4)
2
r− )
r+
one obtains for the weight updates [6]:
wr+ =
+·f
∂dλr+ + |µ (v ) · ξ · , λ ∂ wr+
wr− = − −
·
f |µλ (v) · ξ
λ − · ∂dr− ∂ wr−
(2.5)
For the adaptation of the parameters λk we get λk = − with
+, −,
0
f |µλ (v) ·
λ
∂µλ (v) ∂λk >
·
+ ∂dr+ =ξ ∂λk
λi ≥ 0 and
wr+ = + wr− = − −
·
2·
(2.6)
λ
r− − ξ − ∂d . ∂λ
(2.7)
k
= D=1V λ (v − w )2 λ = 1, we have in (2.5) f µλ v ξ Λ (v − wr+ ) f µλ v ξ − Λ (v − wr− )
are learning rates. In case of dλ r+
Euclidean distance, whereby
∂µλ (v) ∂λk
i ·
i
i
i
is the squared,
scaled
i
i
|
( ) ·
+
·
·
(2.8)
(2.9) respectively, with Λ being the diagonal matrix with entries λ1, . . . , λDV and (2.6) reads as + 2 − 2 λk = f µλ (v) ξ (v − wr+ )k − ξ (v − wr− )k (2.10) −
·
|
·
2·
|
( ) ·
·
·
,
·
with k = 1 . . . DV , followed by normalization. The simple GLVQ ignores the update of the λ keeping all weighting factors constant. It was shown that GLVQ maximizes the margin [11]. Recently this result was extended to GRLVQ, too [4]. Hence, the GRLVQ can be interpretated as a learning algorithm which aim at structural risk minimization comparable to support vector machines (SVM). i
3
Control spectrum from the LEUK dataset 8000
spectrum
7000 6000
Intensity
5000 4000 3000 2000 1000 0 2000
3000
4000
5000
6000
7000
8000
9000
1000
m/z
Figure 1: Example of a spectrum to be classified. The decision is made in dependence on the location of peaks and the peak areas.
3 Unified Maximum Separability Analysis UMSA is a special variant of a SVM algorithm which belongs in the same learning class as SVM according to statistical learning theory, i.e. UMSA is due to structural risk minimization [14]. It derives significance scores of different features which are peaks (or peak areas) of the high-dimensional spectra. In the analysis of mass spectrometric data for proteomic profiling, it happens quite often that the experiment conditions, under which the individual spectra are obtained, have been designed with a particular endpoint of interest in mind. The purpose of the analysis is to identify those peaks within the spectra which are discriminant to the classification task and which can be related back to the known designations of experimental conditions. In the best case disease markers can be identified, which establish a relation between peaks within the spectra and proteins within the samples. Thereby, it is not sufficient to detect only these peak, which has the largest expression of variance [14]. In other words, we are looking for directions along which the different classes of spectra are best separated. Using UMSA, we construct a multivariate linear classifier that optimally separates two classes of data w1 and w2 according to a predefined separability measure. Thereby, UMSA is realized as a supervised algorithm to iteratively assign significance scores to individual peaks, which rank them according to their collective contributions to the separation of different classes. The construction of a linear classifier for separating two classes of data in a n-dimensional space may be equivalently viewed as to find a linear projection Rn→R, represented by a normalized n × 1 unit projection vector d, such that a predefined separability measure between the two classes of data is maximized. Considering the data V as m column vectors x1 x2 xm and their corresponding class membership as 1 2 m ∈ {−1 1}, the expression for finding an optimal margin hyperplane classifier for linearly separable data ,
,...,
c ,c ,...,c
smax
,
T = ∀x ∈min |d (x − xj )| = max { min |dˆT(x − xj )|} ω ,x ∈ω ∀x ∈ω ,x ∈ω ˆ i
1
j
2
∀d∈D
i
i
1
j
2
i
is formulated in UMSA as solution of the following optimization problem as its known from SVM: Maximize s = (b2 − b1) under d·x d·x
i i
where d ||
||
≤ ≥
b1 ,
if ci
b2 ,
if ci
= −1 =1
(3.1)
, b1 < b2 and b1, b2 are two scalars in the one dimensional projection space.
= 1
4
Define b =
b1 + b2 )/(b2 − b1) and v = 2d/(b2 − b1).
We minimize v under the restrictions: v x + ≤ −1 if = −1 (3.2) v·x + ≥1 if = 1 which can be equivalently expressed as a minimization of 12 v v c (v x + b) ≥ 1, i = 1, 2, . .. , m (3.3) 1 unit projection vector d = v/ v . Solving (3.3), we obtain smax = 2/ b , and the n Furthermore, the vector d can be considered as relevance weights comparable to the relevance factors λ in GRLVQ. To accommodate situations where two classes are not completely separable and to deal with noisy or mislabelled data, the constraints are relaxed by introducing non-negative variables ξ 1, . . . , ξ m that represent errors in the constraints. These errors are penalized in the objective function: −(
·
i
b
i
b
||
,
ci
,
ci
·
·
i
under:
i
|| ||
Minimize :
||
×
1 v·v+ 2
m
pξ i
i=1
||
||
(3.4)
i
Subject to : c (v x + b) ≥ 1 − ξ , i = 1, 2, . . ., m (3.5) where coefficients p are positive constraints representing the relative importance of individual data points. The respective dual problem is as common using the Lagrangian multipliers α : i
·
i
i
i
i
Q (α) = − 12 M ax.!
with constraints
m
αc
i i
i=1
m
α α c c x xj + i
j
i
j
i
i,j =1
m
i=1
αi
(3.6)
= 0 and 0 ≤ α ≤ p i
(3.7)
i
It should be noted that if p = C are set to be a constant C for all data points, then equation (3.4) has the same form and solution as the soft margin hyperplane linear classifier described with SVM [13]. Hence, the p are a individual scaling of the slack variables. Further, the special choice leads to the modified boundary conditions (constraints) (3.7). Let δ be the shortest distance between a data point x and the line that goes through the two class means, M1 and M2, we set p to be inversely related to δ through a positive decreasing function φ( ) 2 2 p = Kφ(δ ) k = const, φ(x) = e−x /σ (3.8) The quadratic problem in (3.4) has a unique solution [14]. For a fixed choice of parameters K and σ s = 2/ v is a relative linear separability measure with the following property: For a given two-class data set D(n) in a n dimensional space An ⊆ Rn and its linear projection D(k) in Ak ⊆ An, a subspace formed by eliminating n − k dimensions from An , we have sD(k) ≤ sD(k) for k ≤ n, where sD(k) and sD(n) are separability measures for peaks computed from applying UMSA on D(k) and D(n), respectively. Let dD(k) denote to unit projection vectors from applying UMSA on D(k) in Ak with components djD(k), j = 1, 2, . .. , k along its k dimensions. The contribution of dimension j in separating the two classes in D(k) is sjD(k) = sD(k) djD(k) (3.9) The pseudo code algorithm is given in [15]. i
i
i
i
i
i
· .
i
||
i
||
|
5
|
4 Clinical datasets and
preprocessing
The proteomic data sets used for this analysis were based on [2] and [1]. Both data sets are obtained by spectral analysis of blood plasma of patients suffering from cancer and probands. The first data set for prostate cancer detection (for short EVMS) consists of 329 spectra disease samples and 316 spectra from control samples. The data dimension is 104. The disease samples contain three classes of different diseases. In the experiments each disease class is compared to the class with spectra from the control group. The second data set (in the following for short LEUK) consists of 74 cancer samples and 80 control samples with data dimension is 222. Mass range between 2 to 20kDa were used [2]. The spectra were first processed using the following standardized workflow [2]: • Baseline subtraction and peak detection • Normalization of all spectra to their own total ion count (TIC) • Recalibration of spectra on each other using the most prominent peaks • Definition of mass ranges which were considered as peaks • Calculation of peak areas for each spectrum • Auto-scaling of peak areas to get the zero mean and unit variance
5 Experiments and Results In our analysis each data set was divided into a training and a test set. Thereby two thirds of the spectra data were used for the supervised training, to build a model describing a proteomic fingerprint that should be capable to discriminate diseases from normal. The test set consists of the remaining data. Model building was performed by application of GRLVQ and UMSA onto the training data sets. Within UMSA we have chosen a transformed data approach for classifying the test data sets. Thereby UMSA is applied onto the training data and the margin of separation smax as well as the unit projection vector d is calculated. Then the data are transformed corresponding to the obtained projection vector d and classification takes place in the projection space. Alternatively a best peak approach using the vector of significance scores is possible but this did not lead to significant improvements with respect to standard classification algorithms. The parameter K which dictates the maximum amount of influence of an individual data point within the minimization process was set to K = 5 and the second parameter σ regulating the amount of information on data distribution being incorporated into the determination of v has been set to σ = 2.5 as suggested in [15]. Thereby the first two UMSA directions were calculated and used within the transformation. In a second run we applied the GRLVQ. Thereby the free parameter learn rate was set to = 0.025 and the learning rate for the relevance terms r = 0 02 The number of cycles of the GRLVQ has been set to nc = 1000 which leads to a short runtime as well as to very competitive results. The recognition rates obtained from the training data set and the prediction rates from the test data set are compared for both algorithms in tables 1 and 2. We see, that all results are almost above 90% for recognition and prediction rates. The prediction rates of both algorithms show competitive good results. Yet, the GRLVQ converged .
6
.
CCD vs control BPH vs control CAB vs control CCD vs control BPH vs control CAB vs control
% Rec. (UMSA) 97.38% 93.52% 94.15% % Pred.(UMSA) 91.88% 89.00% 87.76%
% Rec. (GRLVQ) 98.69% 94.80% 95.04% % Pred.(GRLVQ) 94.85% 85.58% 87.95%
Table 1: Correct Recognition and Prediction rates in % for the UMSA and GRLVQ algorithms on the EVMS data set. UMSA on LEUK GRLVQ on LEUK Recognition in % 100.00% 100.00% Prediction in % 100.00% 95.74% Table 2: Correct Recognition and Prediction rates in % for the UMSA and GRLVQ algorithms on the LEUK data set. a magnitude faster in runtime in contrast to UMSA. Both algorithms are capable to create an appropriate representation of the given data and can be used to obtain relevance terms to identify the importance of individual features. This again may lead to an identification of substances leading to such peaks within the spectra which may be important for a further going bio- marker identification. However, within SVM like algorithm such as UMSA for new training data a complete new training is necessary whereas in case of GRLVQ we can simple retrain in an adaptive way instead of a complete new training. Additionally, it should be mentioned that the classification of data sets with more than two categories is in case of UMSA still a field of research.
6 Conclusion
We have shown in this paper that the concept of relevance learning known from generalized learning vector quantization (GLVQ) can be efficiently applied to the classification of proteomics data and leads to results which are very competitive and mostly better than results obtained from the SVM algorithm UMSA. The extraction of relevance or separation values for each input dimension after applying the algorithms can further be used for bio- marker identification and data reduction. The results support the theoretically proven equivalence in classification performance of SVM and G(R)LVQ algorithms. A : The authors are grateful to E. Schaeffeler, U. Zanger, M. Schwab (all Dr. Margarete Fischer Institute für Klinische Pharmakologie Stuttgart, Germany), M. Stanulla, M. Schrappe (both Kinderklinik der Medizinischen Hochschule Hannover, Germany), T. Elssner and M. Kostrzewa (both Bruker Daltonik Leipzig, Germany) for providing the LEUK-dataset.
References [1] Bruker-Daltonik. Data of cancer leukaemia, 2004. 7
[2] Adam BL et al. Serum protein finger printing coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Research, 62(13):3609—3614, July 2002. [3] P. A. Binz et al. Mass spectrometry-basedproteomics: current status and potential use in clinical chemistry. Clinical Chemistry and Labor Medicine, 41:1540—1551, 2003. [4] B. Hammer, M. Strickert, and Th. Villmann. On the generalization ability of GRLVQ networks. P 249, Osnabrücker Zeitschriften zur Mathematik, University Osnabrück, Germany, 2003. [5] B. Hammer, M. Strickert, and Th. Villmann. Supervised neural gas with general similarity measure. Neural Processing Letters, page to appear, 2004. [6] B. Hammer and Th. Villmann. Generalized relevance learning vector quantization. Neural Networks, 15(8-9):1059—1068, 2002. [7] B. Hammer and Th. Villmann. Mathematical aspects of neural networks. In M. Verleysen, editor, Proc. Of European Symposium on Artificial Neural Networks (ESANN’2003), pages 59—72, Brussels, Belgium, 2003. d-side. [8] T. Kohonen, S. Kaski, H. Lappalainen, and J. Saljärvi. The adaptive-subspace self—organizing map (assom). In International Workshop on Self—Organizing Maps (WSOM’97), pages 191—196, Helsinki, 1997. [9] Teuvo Kohonen. Self-Organizing Maps, volume 30 of Springer Series in Information Sciences. Springer, Berlin, Heidelberg, 1995. (Second Extended Edition 1997). [10] A. Sato and K. Yamada. Generalized learning vector quantization. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8. Proceedings of the 1995 Conference, pages 423—9. MIT Press, Cambridge, MA, USA, 1996. [11] Atsushi Sato and Kenji Yamada. An analysis of convergence in generalized LVQ. In L. Niklasson, M. Bodén, and T. Ziemke, editors, Proceedings of ICANN98, the 8th International Conference on Artificial Neural Networks, volume 1, pages 170—176. Springer, London, 1998. [12] S. Seo and K. Obermayer. Soft learning vector quantization. Neural Computation, 15:1589—1604, 2003. [13] V. Vapnik. Statistical Learning Theory. Wiley and Sons, New York, 1998. [14] Zhen Zhang, Grier Page, and Hong Zhang. Applying classification separability analysis to microarray data. In S.M. Lin and K.F. Johnson, editors, Methods of Microarray Data Analysis, chapter 9, pages 125—136. Kluwer, Dordrecht, Netherlands, 2002. [15] Zhen Zhang, Grier Page, and Hong Zhang. Fishing expedition - a supervised approach to extract patterns from a compendium of expression profiles. In Simon M. Lin and Kimberly F. Johnson, editors, Methods of Microarray Data Analysis II. Kluwer Academic Publishers, 2002. Papers from CAMDA 01. 8