Document not found! Please try again

Efficient approximations of robust soft learning ... - Semantic Scholar

4 downloads 46415 Views 534KB Size Report
Jun 9, 2014 - distances. There exist standard preprocessing tools which transfer ... seen from the graphs, the data sets Amazon, FaceRec and Voting ...... [26] S. Seo, K. Obermayer, Soft learning vector quantization, Neural Comput. 15.
Neurocomputing 147 (2015) 96–106

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Efficient approximations of robust soft learning vector quantization for non-vectorial data Daniela Hofmann n, Andrej Gisbrecht, Barbara Hammer CITEC Center of Excellence, Bielefeld University, Germany

art ic l e i nf o

a b s t r a c t

Article history: Received 26 March 2013 Received in revised form 20 November 2013 Accepted 30 November 2013 Available online 9 June 2014

Due to its intuitive learning algorithms and classification behavior, learning vector quantization (LVQ) enjoys a wide popularity in diverse application domains. In recent years, the classical heuristic schemes have been accompanied by variants which can be motivated by a statistical framework such as robust soft LVQ (RSLVQ). In its original form, LVQ and RSLVQ can be applied to vectorial data only, making it unsuitable for complex data sets described in terms of pairwise relations only. In this contribution, we address kernel RSLVQ which extends its applicability to data which are described by a general Gram matrix. While leading to state of the art results, this extension has the drawback that models are no longer sparse, and quadratic training complexity is encountered due to the dependency of the method on the full Gram matrix. In this contribution, we investigate the performance of a speed-up of training by means of low rank approximations of the Gram matrix, and we investigate how sparse models can be enforced in this context. It turns out that an efficient Nyström approximation can be used if data are intrinsically low dimensional, a property which can be efficiently checked by sampling the variance of the approximation prior to training. Further, all models enable sparse approximations of comparable quality as the full models using simple geometric approximation schemes only. We demonstrate the behavior of these approximations in a couple of benchmarks. & 2014 Elsevier B.V. All rights reserved.

Keywords: Classification RSLVQ Kernel Nyström Sparse

1. Introduction Learning vector quantization (LVQ) as proposed by Kohonen [17] more than 20 years ago still constitutes a popular and widely used classification scheme, in particular due to its intuitive training algorithm and classification behavior. The fact that the classifier represents its classification prescription in a compact way in terms of a small number of prototypical representatives enables its applicability in particular in the medical domain, where human insight is often crucial, or in online learning scenarios such as online vision systems where a compact representation of the already gathered information is required for further adaptation [1,2,16,8,15]. While original LVQ has been proposed on heuristic grounds, mimicking learning paradigms in biological systems, quite a few variants have been proposed in the last years which can be derived based on mathematical cost functions. Notably, generalized LVQ [23] relies on a cost function which can be linked to large margin classifiers [24], enabling a particularly robust classification scheme. As an alternative, robust soft LVQ (RSLVQ) models the data in terms of a mixture of Gaussians in a

n

Corresponding author. E-mail address: [email protected] (D. Hofmann).

http://dx.doi.org/10.1016/j.neucom.2013.11.044 0925-2312/& 2014 Elsevier B.V. All rights reserved.

probabilistic framework. Training can be derived thereof as likelihood ratio optimization [26]. Interestingly, both variants yield to training algorithms which are very similar to original LVQ2.1 as proposed by Kohonen [17]. The formulation as cost function allows to easily integrate a larger flexibility into the prescriptions such as the concept of metric learning [24,26]. Note that LVQ schemes are in some sense complementary to popular classification schemes as provided e.g. using support vector machines (SVM): while both techniques constitute large margin approaches thus providing excellent generalization ability, one of the strengths of SVM is its very robust behavior due to a convex cost function with unique solutions. LVQ, on the contrary, typically possesses local optima, and optimization using gradient techniques is usually necessary. However, while SVM represents models in terms of support vectors, which constitute points at the boundary, the number of which typically scales with the size of the training set, LVQ represents solutions in terms of few typically prototypes only, resulting in an improved interpretability and classification time. On the down-side, SVM can often represent the boundaries in more detail because of its focus on the boundaries, while LVQ classifiers stay with more simple models. Because of the need of interpretable models in domains such as biomedical applications where the ultimate responsibility lies with the human applicant, however, sparse interpretable models

D. Hofmann et al. / Neurocomputing 147 (2015) 96–106

such as LVQ classifiers enjoy an increasing popularity among practitioners. In this contribution, we will focus on the approach robust soft LVQ as proposed in [26] since it offers an intuitive representation of data in terms of a mixture of labeled Gaussians. Being a prototype based approach, LVQ provides a direct interface for the applicant, who can directly inspect the prototypes in the same way as data. Regarding the crucial impact of interpretability of the given models in many fields, this fact constitutes an important benefit of LVQ classifiers [28]. In many application areas, data sets are becoming more and more complex and additional structural information is often available. Examples include chemical structures, biological networks, social network data, graph structures, dedicated images, and heterogeneous web data. Often, dedicated similarity measures have been developed to compare such data; popular examples for widely used dissimilarity or similarity measures for such objects are dynamic time warping for time series, alignment for biological sequences or text, divergences for distributions, functional metrics for functional data such as spectral data, graphs or tree kernels for structured objects, and many more. These data are no longer explicitly represented as Euclidean vectors, rather, pairwise similarities or dissimilarities are available. LVQ in its original form has been proposed for vectorial data only, since it heavily relies on the possibility to pick prototypes as members of the data space and to adapt these representatives smoothly by means of vectorial updates triggered by the data. Hence LVQ is not directly applicable to complex domains where data are represented in terms of pairwise relations only. In the last years, a few approaches have been developed which extend LVQ schemes or, more generally, prototype based approaches beyond the vectorial setting. Thereby, most techniques rely on an underlying cost function for which an alternative optimization scheme in the non-vectorial setting is proposed. As an example, unsupervised prototype based methods can rely on exemplars, i.e. they restrict the location of prototypes to the position of given data points, where dissimilarities are well defined. Training takes place in a discrete space, partially relying on appropriate assignment probability to achieve greater robustness, see e.g. the approaches [18,7,4]. These techniques, however, have the drawback that a smooth adaptation of prototypes is no longer possible and problems can occur especially if the given data are sparse. More general smooth adaptation is offered by relational extensions such as relational neural gas or relational learning vector quantization [12]. Kernelization constitutes another possibility such as proposed for neural gas, self-organizing maps, or different variants of learning vector quantization [3,22]. Recently, a kernel variant of RSLVQ has been proposed which matches the classification performance of support vector machines in a variety of benchmarks [14]. By formalizing the interface to the data as a general similarity or dissimilarity matrix, complex structures can be dealt with, relying on dedicated structure kernels or an explicit Gram matrix, for example [21,10,9]. In this contribution, we will focus on kernel RSLVQ (KRSLVQ) which will be extensively tested for benchmark data sets in comparison to popular alternatives such as k-nearest neighbor classifiers and the support vector machine. KRSLVQ allows to priorly specify the model complexity, i.e. number of prototypes which represent the classifier. Kernel RSLVQ, unlike RSLVQ, represents prototypes implicitly by means of a linear combination of data in kernel space. This has two drawbacks: on one hand, prototypes are no longer directly interpretable, since the vector of linear coefficients is usually not sparse. Hence, in theory, all data points can contribute to the prototype. On the other hand, an adaptation step does no longer scale linearly with the number of data points, rather, quadratic

97

complexity is required. This makes the technique infeasible if large data sets are considered. In this contribution, we propose two different approximation schemes and we investigate the effect of these techniques in a variety of benchmarks [13]. First, we consider the Nyström approximation of Gram matrices which has been proposed in the context of SVMs in [29]. It constitutes a low rank approximation of the matrix based on a small subsample of the data. Assuming a fixed size of the subsample, a linear adaptation technique results. This approximation technique accounts for an efficient update, but prototypes are still distributed. As an alternative, we investigate an approximation of prototypes in terms of their k closest exemplars after or while training. This way, sparse models are obtained, albeit the technique still displays quadratic complexity. The effects of these approximations on the accuracy are tested in a couple of benchmarks. Now we first review RSLVQ and its kernel variant. We explain the Nyström approximation and its incorporation into kernel RSLVQ. Afterwards, we explain different sparse approximations of the prototypes. We test the performance using benchmarks similar to [6].

2. Kernel robust soft learning vector quantization Robust soft LVQ has been proposed in [26] as a probabilistic counterpart to Learning vector quantization [17]. It models data by a mixture of Gaussians and derives learning thereof by means of a maximization of the log likelihood ratio of the given data. In the limit of small bandwidth, a learning rule which is similar to LVQ2.1 is obtained. Assume that data ξk A Rn are given accompanied by labels yk. A RSLVQ network represents a mixture distribution, which is determined by m prototypes wj A Rn , where the labels of prototypes cðwj Þ are fixed. σj denotes the bandwidth. Then, mixture component j induces the probability pðξjjÞ ¼ constj  expðf ðξ; wj ; σ 2j ÞÞ

ð1Þ

with normalization constant constj and function f f ðξ; wj ; σ 2j Þ ¼  J ξ wj J 2 =σ 2j : The probability of a data point

ð2Þ

ξ is given by the mixture

pðξjWÞ ¼ ∑PðjÞ  pðξjjÞ

ð3Þ

j

with prior probability P(j) of mixture j and parameters W of the model. The probability of a data point ξ and a given label y is pðξ; yjWÞ ¼



PðjÞ  pðξjjÞ:

ð4Þ

cðwj Þ ¼ y

Learning aims at an optimization of the log likelihood ratio L ¼ ∑ log k

pðξk ; yk jWÞ : pðξk jWÞ

ð5Þ

A stochastic gradient ascent yields the following update rules, given a data point ðξk ; yk Þ

Δw j ¼ α 

8 < ðP y ðjjξk Þ  Pðjjξk ÞÞ  constj  ∂f ðξk ; wj ; σ 2j Þ=∂wj

if cðwj Þ ¼ yk

:  Pðjjξk Þ  constj  ∂f ðξk ; wj ; σ 2j Þ=∂wj

if cðwj Þ a yk ð6Þ

α 4 0 is the learning rate. The probabilities are defined as P y ðjjξk Þ ¼

PðjÞexpðf ðξk ; wj ; σ 2j ÞÞ

∑cðwj Þ ¼ yj PðjÞexpðf ðξk ; wj ; σ 2j ÞÞ

ð7Þ

98

D. Hofmann et al. / Neurocomputing 147 (2015) 96–106

and Pðjjξk Þ ¼

PðjÞexpðf ðξk ; wj ; σ 2j ÞÞ

: ∑j PðjÞexpðf ðξk ; wj ; σ 2j ÞÞ

ð8Þ

If the standard Euclidean distance is used, class priors are equal, and small bandwidth is present, a learning rule similar to LVQ2.1 results. Given a novel data point ξ, its class label is the most likely label y corresponding to a maximum value pðyjξ; WÞ  pðξ; yjWÞ. For typical settings, this rule can be approximated by a simple winner takes all rule, i.e. ξ is mapped to the label cðwj Þ of the closest prototype wj. RSLVQ in its original form is restricted to Euclidean vectors. A kernelization of the method makes the technique applicable for more general data sets which are characterized in terms of a Gram matrix incorporating pairwise similarities, or in terms of an analytic kernel prescription only. We assume that a kernel k is fixed corresponding to a feature map Φ. Note that every symmetric and positive semi-definite similarity matrix can be associated with such a kernel, see e.g. [21]. The equation kkl ≔kðξk ; ξl Þ ¼ Φðξk Þ Φðξl Þ t

ð9Þ

holds for all data points ξk, ξl. Note that, albeit a given kernel always corresponds to an underlying vector space, the feature space, the latter is not known in general and data are represented only implicitly via their kernel values. The key assumption of kernel RSLVQ as an extension of RSLVQ to Gram matrices is to represent prototypes implicitly in terms of linear combinations of data wj ¼ ∑γ jm Φðξm Þ

ð10Þ

m

where the coefficients γjm are non-negative and sum up to 1. This corresponds to the assumption that prototypes are located in the convex hull of data, which is a reasonable assumption provided that the LVQ scheme should yield representative prototypes. Having made this assumption, it is possible to formalize the cost function of RSLVQ: L ¼ ∑ log k

∑cðwj Þ ¼ yk PðjÞpðΦðξk ÞjjÞ ∑j PðjÞpðΦðξk ÞjjÞ

ð11Þ

which relies on the Gaussian probabilities, implicitly in terms of the Gram matrix of data and coefficients of prototypes only: the Gaussian pðΦðξk ÞjjÞ constitutes an exponential function in the distance, which can be computed implicitly by means of the equality ‖Φðξi Þ  wj ‖2 ¼ J Φðξi Þ  ∑γ jm Φðξm Þ J 2 m

¼ kii  2  ∑γ jm kim þ ∑γ js γ jt kst m

s;t

ð12Þ

We assume equal bandwidth σ 2 ¼ σ 2j , for simplicity; more complex adjustment schemes based on the data have been investigated in [25], for example, usually leading to only a minor increase of accuracy. Note that the position of prototypes is not clear a priori, such that a prior adaptation of the bandwidth according to the data density is not possible. Further, we assume constant prior P(j) and mixture components induced by normalized Gaussians. Now, there exist two major possibilities to adjust the parameters γjm: we can either directly optimize the cost function L by relying on some standard numeric optimization procedure such as gradient techniques, or we can rephrase the updates of vectorial RSLVQ in terms of the coefficients γjm, provided the updates have a form such that they can be decomposed into dedicated contributions due to γjm. The latter technique exactly mimics the adaptation rules in euclidean space, but without an explicit reference to

the embedding, while the former deviates from it, because taking gradients does not commute with linear mappings. Here, we follow the latter approach, and decompose the vectorial update rules into contributions of the coefficients γjm. The RSLVQ updates can be rephrased as follows: Δwj ¼ ∑m Δγ jm Φðξm Þ ¼

α  constj 

  8 > > ðP ðjj Φ ð ξ ÞÞ  Pðjj Φ ð ξ ÞÞÞ Φ ð ξ Þ  ∑ γ Φ ð ξ Þ > y k k k m jm < m   > > >  PðjjΦðξk ÞÞ Φðξk Þ  ∑γ jm Φðξm Þ : m

if cðwj Þ ¼ yk if cðwj Þ ayk ð13Þ

which decomposes into the following adaptation rules for γ jm : Δγ jm ¼

α  constj 

8  ðP y ðjjΦðξk ÞÞ  PðjjΦðξk ÞÞÞγ jm > > > > > < ðP y ðjjΦðξk ÞÞ  PðjjΦðξk ÞÞÞð1  γ jm Þ PðjjΦðξk ÞÞγ jm > > > > > :  PðjjΦðξk ÞÞð1  γ jm Þ

if ξm a ξk ; cðwj Þ ¼ yk if ξm ¼ ξk ; cðwj Þ ¼ yk if ξm a ξk ; cðwj Þ a yk

ð14Þ

if ξm ¼ ξk ; cðwj Þ a yk

This adaptation performs exactly the same updates as RSLVQ in the feature space if prototypes are in the convex hull of the data. To guarantee non-negativity and normalization, a correction takes place after every adaptation step. As an alternative, barrier techniques could be used, or the restrictions could be dropped entirely allowing more general linear combinations as solutions. Note that, unlike RSLVQ, prototypes are represented implicitly in terms of linear combinations. The inspection of a prototype thus requires to inspect the coefficients representing the prototype γj and all data, the latter usually being characterized in terms of pairwise similarities only. Further, an adaptation step has squared complexity caused by the distributed representation of prototypes. Thus, the method does no longer directly give interpretable results, and it is no longer applicable for large data sets. The derivative of kernel RSLVQ in this form can be used whenever a fixed kernel k is given and data are in vectorial form, or the Gram matrix itself is given, implicitly representing the data [21]. Note that it can easily be checked whether a symmetric matrix constitutes a valid Gram matrix by referring to the eigenvalues, which should be non-negative. In this case, the adaptation rule as introduced above mimics the standard vectorial update of RSLVQ in the feature space, but without the necessity of explicitly computing this embedding. Provided the similarity matrix of the data is not positive semidefinite, still an embedding into so-called pseudo-Euclidean space is possible [21], and distance computations as provided above are reasonable, see e.g. so-called relational counterparts to popular prototype-based algorithms such as neural gas [12]. However, two problems occur in this context: a probabilistic interpretation of the model is not valid if distances become negative; further, the adaptation rules do not necessarily obey a gradient scheme because the signature of the underlying space is neglected. Because of these facts, we will mainly deal with valid kernels in the following.

3. Nyström approximation of the Gram matrix The Nyström technique has been presented in [29] in the context of SVMs. It allows to approximate a Gram matrix by a low rank approximation [11]. Note that the latter work shows that the approximation can also be used for more general symmetric matrices which are not necessarily valid Gram matrices. For many

D. Hofmann et al. / Neurocomputing 147 (2015) 96–106

kernel based approaches, this approximation can be integrated into the learning rules in such a way that updates with linear complexity result [30]. We shortly review the main idea behind this approach in the following. A valid kernel kðξj ; ξl Þ can be expanded by orthonormal eigenfunctions ϕi and non negative eigenvalues λi in the form

99

4. Sparse approximation of prototypes Kernel RSLVQ yields prototypes which are implicitly represented as linear combinations of data points wj ¼ ∑γ jm Φðξm Þ: m

ð15Þ

1

kðξj ; ξl Þ ¼ ∑ λi ϕi ðξj Þϕi ðξl Þ: i¼1

The eigenfunctions and eigenvalues of a kernel are the solutions of an integral equation Z kðξj ; ξÞϕi ðξÞpðξÞdξ ¼ λi ϕi ðξj Þ which can be approximated based on the Nyström technique by sampling ξ i.i.d. according to p, denoting the sampled values as ξ1, …, ξm after possible reenumeration: 1 m ∑ kðξj ; ξl Þϕi ðξl Þ  λi ϕi ðξj Þ: ml¼1 We denote the submatrix corresponding to the m sampled points of the Gram matrix by Km;m . The eigenvalues and eigenvectors of ðmÞ this matrix are denoted by UðmÞ and Λ , respectively, characterized by the eigenvalue equation: ðmÞ

Km;m UðmÞ ¼ UðmÞ Λ

Since the training algorithm and classification depends on pairwise distances only, simple linear algebra allows us to compute the distance of a data point and a prototype based on the pairwise similarity of the data point and all training data only, i.e. the given Gram matrix, as specified above. However, direct interpretability and sparseness of the prototype is lost this way. Here we propose and compare different techniques to approximate the distributed representation of prototypes by sparse ones, i.e. we want to enforce small values jγ j j0 where this 0-norm counts the number of non-zero entries of the coefficient vector. We use four different techniques to achieve sparsity:

 Sparse training: we enhance the cost function of RSLVQ by a term Sðγ Þ which prefers sparse solutions of prototypes. See e.g.

:

These solutions enable an approximation of the eigenfunctions and eigenvalues pffiffiffiffiffi λðmÞ m λi  i ; ϕi ðξl Þ  ðmÞ kξl uðmÞ i m λ



i

where uðmÞ is the ith column of UðmÞ and we use the vector of i kernel values kξl ¼ ðkðξ1 ; ξl Þ; …; kðξm ; ξl ÞÞÞT : This allows us to approximate a given full Gram matrix K by a low-rank counterpart, since we can use these approximations in the kernel expansion. Subsampling corresponds to a choice of m rows and columns of the matrix, the corresponding submatrix is denoted by Km;m as before. The corresponding m rows and columns are denoted by Km;n and Kn;m , respectively. These are transposes of each other, since the matrix is symmetric. The approximation as introduced above leads to the following approximation of the kernel expansion by orthonormal eigenfunctions: m

ðmÞ ðmÞ T K~ ¼ ∑ 1=λi  Kn;m uðmÞ i ðui Þ Km;n i¼1

ðmÞ

where λi and uðmÞ correspond to the m  m eigenproblem as i ðmÞ above. In the case that some λi are zero, we replace the 1 corresponding fractions with zero. Thus we get, Km;m denoting the Moore–Penrose Pseudoinverse: 1 Km;n : K~ ¼ Kn;m Km;m

For a given matrix K with rank m, this approximation is exact, if the m chosen m-dimensional points are linearly independent. Hence we can approximate the full Gram matrix as used in kernel RSLVQ by a low rank approximation: this equation for K~ can directly be integrated into the computation of the Gaussians using the identity ‖Φðξi Þ  wj ‖2 ¼ eti Kei 2  eti Kγ j þ γ tj Kγ j where ei denotes the ith unit vector. Using K~ instead, linear complexity results if the matrix vector multiplications are computed first.





the approach [20] for a fundamental discussion on sparsity concepts and corresponding costs. A typical choice is the L1 norm: Sðγ Þ ¼ ∑jm jγ jm j1 . This constraint is weighted with parameter C which is optimized according to the given data set. Thus, updates are enhanced by the 7 α  C  γ jm depending on the sign of γjm. control parameter: C, complexity: OðN 2  loop_iterationsÞ. K-approximation: this approximation relies on the found prototypes after training. We substitute each prototype by its Kappr closest exemplars in the given data set as concerns the distance ‖Φðwj Þ  Φðξm Þ‖2 in the feature space. The latter can be computed based on the kernel. control parameter: Kappr, complexity: OðN 2 Þ. K-convex hull: we delete all but the Kconv largest coefficients γjm in the coefficient vector γj. This is then normalized to 1: ∑m γ jm ¼ 1. control parameter: Kconv, complexity: OðNÞ. Sparse approximation: we approximate a given prototype wj by its closest sparse linear combination ∑m αjm Φðξm Þ with small jαj j0 where the points Φðξm Þ serve as (possibly overcomplete) basis vectors. Since this problem is NP hard, we use a popular greedy approach as offered by orthogonal matching pursuit (OMP) [5]. Since OMP relies on dot products only, we can apply it implicitly based on the kernel. control parameter: residual of the approximation quality for the prototypes. complexity: OðN 2  loop_iterationsÞ:

Note that the second and third methods constitute geometrically motivated ad hoc methods which are founded in the geometric nature of LVQ classifiers. The other two techniques are related to more principled general schemes from the literature, but paying for this foundation with a higher complexity. These approximations can reach different degrees of sparsity depending on the chosen parameters. Since prototypes are then substituted by a small number of exemplars of the data, an interface towards classifier interpretation is given this way: the exemplars can be inspected in the same way as data points, this allows insight into the model by experts in the field. Besides this issue, an approximation by means of sparse solutions decreases the computational complexity of classification of a novel data point from linear time

100

D. Hofmann et al. / Neurocomputing 147 (2015) 96–106

to constant time. This speed-up might be crucial e.g. if the trained classifier is used in interactive or online scenarios.



5. Experiments We compare kernel RSLVQ, its Nyström approximation, and four sparse approximations, respectively, enforcing different levels of sparsity, on a variety of benchmarks as introduced in [6] and [19]. For the Nyström approximation, we compare the approximation quality based on a subsample of 10% or 25%, respectively. For the sparse approximations, we set the parameters such that different degrees of sparsity are achieved, where possible. In particular, we report the values obtained for a sparse approximation of the prototypes by 1 resp. 10 exemplars per prototype for the geometric methods; and we exemplarily report the dependency of the approximation quality from the sparseness for the geometric methods and OMP in Fig. 2. The data sets consist of similarity matrices which are, in general, non-Euclidean. Non-Euclideanity can be quantified by the signature of the data set, i.e. the number of positive, negative, and zero eigenvalues of the similarity matrix. Note that eigenvalues exactly zero are usually not encountered due to numerical issues; commonly, one uses a small cutoff value for this purpose. The matrices are symmetrized and normalized before processing. In general, the given similarity matrices do not constitute a valid kernel such that a probabilistic representation using the above formulas is no longer well-defined due to potentially negative distances. There exist standard preprocessing tools which transfer a given similarity matrix into a valid kernel, as presented e.g. in [6,21]. Typical corrections are

 Spectrum clip: set negative eigenvalues of the matrix to 0. This 

can be realized as a linear projection and directly transfers to out-of-sample extensions. Spectrum flip: negative eigenvalues are substituted by their positive values. Again, this can be realized by means of a linear transformation.

These transforms which turn a given similarity matrix into a valid Gram matrix are tested for kernel RSLVQ with according symmetrization, and flip or clip. We use the following training data sets:

 Amazon47: This data set consists of 204 books written by 47







different authors. The similarity is determined as the percentage of customers who purchase book j after looking at book i. The signature is ð192; 1; 11Þ and the number of prototypes is 94. The class label of a book is given by the author. AuralSonar: This data set consists of 100 wide band solar signals corresponding to two classes, observations of interest versus clutter. Similarities are determined based on human perception, averaging over 2 random probands for each signal pair. The signature is ð61; 38; 1Þ and the number of prototypes is 10. Class labeling is given by the two classes: target of interest versus clutter. FaceRec: 945 images of faces of 139 different persons are recorded. Images are compared using the cosine-distance of integral invariant signatures based on surface curves of the 3D faces. The signature is ð45; 0; 900Þ and the number of prototypes is 193. The labeling corresponds to the 139 different persons. Patrol: 241 samples representing persons in seven different patrol units are contained in this data set. Similarities are based on responses of persons in the units about other members of their groups. The signature is ð54; 66; 121Þ and the number of





prototypes is 24. Class labeling corresponds to the seven patrol units. Protein: 213 proteins are compared based on evolutionary distances comprising four different classes according to different globin families. The signature is ð169; 38; 6Þ and the number of prototypes is 20. Labeling is given by four classes corresponding to different globin families. Voting: Voting contains 435 samples with categorical data compared by means of the value difference metric. Class labeling into two classes is present. The signature is ð16; 1; 418Þ and the number of prototypes is 20. Chickenpieces: The task is to classify 446 silhouettes of chickenpieces into five categories (wing, back, drumstick, thigh and back, breast). The silhouettes are represented as strings based on the angles of tangents onto the curve, and comparison takes place by means of a circular alignment, see [19]. The signature is ð240; 205; 1Þ and the number of prototypes is 5.

Note that the rank of the Gram matrix is given by the number of positive eigenvalues if clip is used as preprocessing, and the sum of non-negative eigenvalues if the original data or flip are used. The eigenvalue spectra of the data sets are depicted in Fig. 1. As can be seen from the graphs, the data sets Amazon, FaceRec and Voting are almost Euclidean, while all others contain a considerable percentage of negative eigenvalues. Interestingly, the intrinsic dimensionality (as mirrored by the number of eigenvalues which have a relevant absolute value) is high for Amazon47, Patrol, and Chickenpieces. For training, prototypes are initialized by means of normalized random coefficients γjm. Thereby, class labels are taken into account, setting the coefficient m to zero if the label of point ξm does not coincide with the prototype label cðwj Þ. The number of prototypes is taken as a small multiple of the number of classes. We use a fixed number of prototypes only, taking the values from previous experimental settings [14], noticing that the exact number of prototypes is not severely influencing the result since no overfitting takes place. The other meta-parameters are optimized on the data sets using cross-validation, whereby meta-parameters such as the learning rate have only a minor influence on the final result, but on the speed of convergence only. As already discussed in [26], the bandwidth of the model influences the result and the prototype location, and strategies to also adapt the bandwidth in parallel to the prototype locations have been proposed in [25,27], for example. Since the bandwidth should be adapted on a slower scale than the prototype positions, very time consuming algorithms result this way, because of which we simply optimize σ by cross-validation in the range between 0.05 and 1.0 with a step size of 0.05. The variance between the optimum parameters was mostly in a range of 10  5 The results for RSLVQ are reported in Table 1 in comparison to an SVM and a kNN-NN classifier with parameter settings as obtained in [6]. Results for SVM and kNN-NN are recomputed using the setting as described in [6], leading to the same or better results as compared to [6]. Classification accuracy is thereby evaluated in a 20-fold crossvalidation. Note that a decomposition of a data set characterized by a similarity matrix into training and test set corresponds to a selection of a set of indices I. The sub-matrix formed by ðkij Þi;j A I characterizes the training set, distances of prototypes to test points for a classification of the test set can be computed based on ðkij Þi A I;j 2= I . Remarkably, kernel RSLVQ yields results which are comparable to SVM in all but one case. For the data set Chickenpieces, the error is almost twice as large, which can possibly be attributed to the rather skewed distribution with one eigenvalue by far dominating all others such that distance based approaches seem less suitable. In this setting, LVQ would require capabilities to adapt the

D. Hofmann et al. / Neurocomputing 147 (2015) 96–106

101

40

100

35 80

30 25

60

20

40

15 10

20

5

0 −20

0 0

50

100

150

200

−5

250

20

40

60

80

100

0

50

100

150

200

250

0

100

200

300

5

700

4

600

3

500

2

400 300

1

200

0

100

−1

0

−2

−100

0

0

200

400

600

800

−3

1000

250

80 70

200

60 50

150

40 100

30 20

50

10 0

0

−10 −20

0

50

100

150

200

250

−50

400

500

10000 8000 6000 4000 2000 0 −2000 −4000

0

100

200

300

400

500

Fig. 1. Characteristic spectrum of the considered similarities. The data sets differ as concerns negative eigenvalues corresponding to non-Euclideanity, and the number of eigenvalues which are different from zero, corresponding to a high dimensional feature space.

relevance of such dimensions accordingly, as present e.g. in vectorial metric learners for LVQ [24]. In general, preprocessing using spectrum clip or flip can be beneficial. A naive application of kernel RSLVQ for the (non-euclidean) similarity matrix without

preprocessing already yields good results, also this setting does not relate to a clear underlying mathematical model, the standard Gaussian probabilities not being defined in pseudo-euclidean space if the distances can become negative.

102

D. Hofmann et al. / Neurocomputing 147 (2015) 96–106

The results for RSLVQ and its Nyström approximation are reported in Table 2 for different sizes of the approximation matrix. The Nyström approximation preserves the excellent performance of kernel RSLVQ in four of the cases when considering valid kernels (i.e. clip or flip), enabling a linear technique with excellent performance in these settings. For two cases (Amazon47, Patrol), the Nyström approximation yields a degradation by more than 100%. As can be seen from the eigenvalue spectra as shown in Fig. 1, a good

Table 1 Results of kernel RSLVQ compared to kNN-NN and SVM are reported. The mean classification error and standard deviation obtained in a 20-fold cross-validation are reported. The best results are shown in boldface. Dataset

kNN-NN

Amazon47 clip flip

28.54 28.78 28.90

(0.83) (0.74) (0.68)

21.46 21.22 22.07

(5.74) (5.49) (6.25)

15.37 15.37 16.34

(0.36) (0.41) (0.42)

AuralSonar clip flip

14.75 17.00 17.00

(0.49) (0.51) (0.93)

12.25 12.00 12.25

(7.16) (5.94) (6.97)

11.50 11.25 11.75

(0.37) (0.39) (0.35)

7.46 7.35 7.78

(0.04) (0.04) (0.04)

3.73 3.84 3.89

(1.32) (1.16) (1.19)

3.78 3.84 3.60

(0.02) (0.02) (0.02)

Patrol clip flip

22.71 9.90 10.31

(0.33) (0.16) (0.16)

15.52 13.85 12.92

(4.02) (4.39) (5.09)

17.50 17.40 19.48

(0.25) (0.29) (0.34)

Protein clip flip

51.28 25.00 7.79

(0.77) (0.74) (0.18)

30.93 12.56 1.98

(6.79) (5.46) (2.85)

26.98 4.88 1.40

(0.37) (0.17) (0.05)

Voting clip flip

5.00 4.83 4.66

(0.01) (0.02) (0.02)

5.06 5.00 4.89

(1.84) (1.84) (1.78)

5.46 5.34 5.34

(0.04) (0.04) (0.03)

Chickenpieces clip flip

6.98 8.12 9.70

(0.39) (0.44) (0.67)

7.19 8.09 6.74

(7.18) (7.23) (5.77)

16.41 18.49 18.46

(0.72) (0.77) (0.82)

FaceRec clip flip

SVM

Kernel RSLVQ

performance of the Nyström approximation is directly correlated with the intrinsic dimensionality of the data set as measured by the number of eigenvalues with significant contribution: The two data sets Amazon47 and Patrol display eigenvalue profile where a large number of values is very different from 0. Since the Nyström approximation is exact if the sampled points match the intrinsic rank of the given data, and it looses information of the remaining span, otherwise, it can be expected that the Nyström approximation fails in these two cases, which it does. We can see that the intrinsically low dimensional matrix correlates to a good approximation of the Nyström approximation of the Gram matrix, as reported in Table 2: Here the results of a Spearman correlation of the rows of the Gram matrix and its Nyström approximation are computed, indicating whether the ordering induced by the approximation would be consistent with the original ordering of the respective closest data points for every row. Interestingly, the correlation displays particularly low values (smaller than .4) for the two data sets Amazon47 and Patrol. Albeit this measure constitutes a good indicator whether the Nyström approximation is successful, it cannot be used in practice due to its quadratic computational complexity. Therefore, we propose a different evaluation which can be checked on a small subsample in linear time only, provided the size of the submatrix used for the Nyström approximation is constant: we consider the Spearman correlation which results from the Nyström approximation taking different samples of the reported size. The result of this correlation, averaged over 10 different approximations sets, is reported in Table 2: for the valid Gram matrices, it yields very low value ( o0:1) if and only if the Nyström approximation fails. Otherwise, it yields values at least 0.5. Hence, sampling only a constant number of rows and computing their correlation this way, we obtain an efficient method to estimate prior to training whether the Nyström approximation can be successful. Similarly, we evaluate the possibility to approximate the results by sparse counterparts. Thereby, we compare the possibility to enforce sparsity while training and the three techniques to

Table 2 Results of kernel RSLVQ compared to kNN-NN and SVM are reported and a Nyström approximation of the Gram matrix using 10% and 25% of the data. Additionally Spearman correlation coefficients are given for the rows of the Nyström approximation and the original data matrix, and for pairwise different Nyström samples, respectively. The best results of the Nyström approximations are shown in boldface. Dataset

Kernel RSLVQ

Nyström 10%

25%

Correlation to data matrix

Pairwise correlation

10%

10%

25%

25%

Amazon47 clip flip

15.37 15.37 16.34

(0.36) (0.41) (0.42)

64.15 64.15 65.73

(0.81) (0.33) (0.30)

84.24 77.93 76.71

(0.30) (0.51) (0.62)

0.70 0.22 0.22

(0.55) (0.15) (0.19)

0.33 0.35 0.36

(1.38) (0.06) (0.08)

0.58 0.02 0.02

(1.26) (0.01) (0.02)

0.12 0.05 0.05

(0.37) (0.04) (0.03)

AuralSonar clip flip

11.50 11.25 11.75

(0.37) (0.39) (0.35)

21.25 15.00 16.25

(2.05) (0.63) (0.84)

20.00 13.00 14.50

(0.79) (0.43) (0.55)

0.35 0.61 0.56

(1.01) (0.25) (0.29)

0.43 0.81 0.75

(0.23) (0.02) (0.02)

0.20 0.54 0.53

(0.64) (0.53) (0.82)

0.22 0.74 0.71

(0.18) (0.03) (0.06)

3.78 3.84 3.60

(0.02) (0.02) (0.02)

3.52 3.47 3.52

(0.02) (0.02) (0.02)

3.54 3.49 3.47

(0.02) (0.01) (0.01)

1.00 1.00 1.00

(0.00) (0.00) (0.00)

1.00 1.00 1.00

(0.00) (0.00) (0.00)

1.00 1.00 1.00

(0.00) (0.00) (0.00)

1.00 1.00 1.00

(0.00) (0.00) (0.00)

Patrol clip flip

17.50 17.40 19.48

(0.25) (0.29) (0.34)

61.77 47.50 45.94

(0.63) (0.78) (0.66)

48.85 34.79 35.10

(1.23) (0.60) (0.39)

0.46 0.18 0.21

(1.09) (0.08) (0.02)

0.20 0.31 0.38

(0.04) (0.17) (0.02)

0.29 0.04 0.04

(1.64) (0.01) (0.01)

0.05 0.08 0.08

(0.02) (0.01) (0.02)

Protein clip flip

26.98 4.88 1.40

(0.37) (0.17) (0.05)

28.60 12.21 8.02

(1.63) (0.36) (0.38)

28.26 7.44 3.95

(0.62) (0.23) (0.14)

0.70 0.88 0.88

(1.80) (0.12) (0.13)

0.78 0.95 0.94

(0.45) (0.01) (0.02)

0.58 0.88 0.87

(2.86) (0.07) (0.14)

0.65 0.93 0.92

(0.74) (0.03) (0.11)

Voting clip flip

5.46 5.34 5.34

(0.04) (0.04) (0.03)

5.23 5.17 5.34

(0.04) (0.03) (0.04)

5.69 5.69 5.52

(0.05) (0.03) (0.03)

1.00 1.00 1.00

(0.00) (0.00) (0.00)

1.00 1.00 1.00

(0.00) (0.00) (0.00)

1.00 1.00 1.00

(0.00) (0.00) (0.00)

1.00 1.00 1.00

(0.00) (0.00) (0.00)

16.41 18.49 18.46

(0.72) (0.77) (0.82)

33.41 34.77 34.55

(0.25) (0.46) (0.42)

26.14 23.64 25.00

(0.32) (0.19) (0.08)

0.75 0.95 0.95

(0.49) (0.00) (0.00)

0.73 0.97 0.97

(0.15) (0.00) (0.00)

0.62 0.96 0.96

(0.79) (0.00) (0.00)

0.56 0.97 0.97

(0.16) (0.00) (0.00)

FaceRec clip flip

Chickenpieces clip flip

D. Hofmann et al. / Neurocomputing 147 (2015) 96–106

approximate the full RSLVQ solution by sparse counterparts, as described above. The results of these approximations are shown in Tables 3 and 4, respectively. In Table 3, a sparsity constraint is added while training, where the weighting parameter C determines the degree of sparsity which arises. It turns out that the results are rather sensitive with respect to this parameter, yielding either solutions without any enforced sparsity, or degenerate solutions, respectively. Therefore, Table 3 Results of kernel RSLVQ and its sparse training as well as the obtained degree of sparsity. The classification error in % and standard deviation in parenthesis are given. Sparsity refers to the average number of non-zero coefficients per prototype in the left column and the loss of density in percent in parenthesis. Dataset

Kernel RSLVQ

Sparse kernel RSLVQ

Amazon47 clip flip

15.37 15.37 16.34

(0.36) (0.41) (0.42)

43.40 39.92 43.18

(0.53) (0.31) (0.81)

1.00 1.00 1.00

(72.70) (72.70) (72.70)

AuralSonar clip flip

11.50 11.25 11.75

(0.37) (0.39) (0.35)

17.25 10.75 15.50

(0.78) (0.30) (0.76)

11.97 12.75 12.73

(70.08) (68.11) (68.17)

3.78 3.84 3.60

(0.02) (0.02) (0.02)

4.15 4.13 4.07

(0.01) (0.02) (0.02)

1.00 1.00 1.00

(81.88) (81.88) (81.88)

Patrol clip flip

17.50 17.40 19.48

(0.25) (0.29) (0.34)

41.67 40.00 41.56

(0.87) (0.59) (0.61)

6.67 6.71 6.68

(72.32) (72.18) (72.27)

Protein clip flip

26.98 4.88 1.40

(0.37) (0.17) (0.05)

38.84 13.84 2.21

(0.74) (0.38) (0.10)

22.19 13.37 13.52

(47.77) (68.54) (68.19)

Voting clip flip

5.46 5.34 5.34

(0.04) (0.04) (0.03)

5.11 5.34 5.80

(0.03) (0.03) (0.06)

64.35 68.67 59.92

(63.02) (60.53) (65.56)

16.41 18.49 18.46

(0.72) (0.77) (0.82)

18.69 19.07 18.43

(0.93) (1.16) (1.11)

20.84 20.54 21.54

(75.41) (75.76) (74.58)

FaceRec clip flip

Chickenpieces clip flip

Sparsity

103

we used binary search to obtain a reasonable value C in all cases, which was the case for a very small region (diameter smaller than 0.01 in all cases) only. The results as concerns accuracy and sparsity are reported in Table 3: in five cases, a comparable accuracy can be achieved with sparsity (measured in absolute entries not equal zero per prototype) which ranges from only 1 to 70 nonzero entries, corresponding to an increase of sparsity by more than 50% in all cases. However, in particular the sensitivity of the method with respect to the parameter choice C do not suggest this possibility as preferred technique. The results obtained by sparse approximations deduced from the solution after training are reported in Table 4, displaying different degrees of sparsity which can be determined by the user. Here, we explicitly set the degree of sparsity to 1 resp. 10 entries per prototype for the direct geometric methods, while OMP is controlled in a more indirect way by the approximation accuracy of the resulting prototype. The result as reported in Table 4 for OMP is the best result which we could obtain by varying this parameter. The systematic dependency of the accuracy of the classifier and resulting sparsity based on OMP is exemplarily shown in Fig. 2: Often, only a comparably small range of sparsity can be covered by reasonable choices of this control parameter. For comparison, we also report the corresponding results if the sparse approximation is added on top of the results obtained using sparse training. As can be seen in Table 4, a sparse approximation can be obtained for all data sets with at least one of these techniques; we can even infer sparse solutions for the two data sets Amazon47 and Patrol which reach the classification accuracy of the full kernel RSLVQ solution, despite their intrinsically high dimensionality. This might be taken as an indicator that the data manifolds are curved but, if curvature is taken into account, low dimensional, such that local prototype based models allow a local dimensionality reduction of the system without loss of accuracy. As can be seen from Table 4, the respective best technique is not unique for the different data sets. Interestingly, however, it

Table 4 Classification errors of sparse approximations of the obtained classifiers. For the kappr-approximation and kconv-convex hull the left column refers to 1 and for the right value 10 was chosen. For OMP, the sparsity arises from the problem formulation by setting the approximation quality. The two best results are shown in boldface. Dataset

Kernel RSLVQ

Sparse kernel RSLVQ

kappr-approximat

kconv-convex hull

1

10

1

10

Amazon47 clip flip

36.10 31.65 31.28

41.12 43.17 45.73

32.02 31.45 33.15

15.37 15.00 16.46

AuralSonar clip flip

25.13 24.75 24.75

20.00 15.00 17.62

55.94 58.50 61.50

3.70 3.76 3.33

36.93 36.97 36.98

Patrol clip flip

54.31 32.46 37.42

Protein clip flip Voting clip flip

FaceRec clip flip

Chickenpieces clip flip

OMP

kappr-approximat

kconv-convex hull

OMP

1

10

1

10

22.68 20.00 20.37

43.91 42.02 43.85

49.25 52.43 56.59

43.08 40.85 43.21

43.08 40.85 43.21

46.12 40.84 43.57

25.00 23.25 19.75

38.00 15.00 26.00

29.08 20.75 26.00

26.67 16.75 21.50

29.81 22.25 27.50

17.50 11.00 15.00

29.75 15.50 26.25

3.84 3.92 4.21

3.78 3.84 3.60

3.68 3.68 3.60

4.14 4.10 4.13

37.18 37.24 37.12

4.15 4.13 4.07

4.15 4.13 4.07

4.15 4.13 4.07

25.19 18.86 20.60

67.98 38.82 40.63

26.77 24.38 25.42

48.75 29.69 33.33

54.36 28.67 29.47

24.68 22.86 24.23

65.09 37.03 40.12

40.94 40.21 41.98

53.85 46.25 49.90

55.12 22.44 23.26

47.53 29.38 24.88

42.09 36.28 25.35

33.14 27.44 3.95

52.44 52.09 49.07

48.76 36.63 30.81

47.80 33.57 28.26

44.53 30.12 18.84

43.95 14.77 3.02

57.79 30.70 26.74

8.56 8.65 7.84

9.48 11.44 10.03

86.21 86.44 86.95

82.53 82.76 82.53

15.57 5.34 5.46

13.59 13.62 12.82

15.69 13.45 17.42

62.82 65.34 65.46

41.15 44.02 38.39

15.52 5.34 6.55

27.19 29.89 32.57

21.83 22.50 27.28

72.80 74.46 72.88

34.53 32.79 38.05

32.38 19.84 19.78

39.04 41.53 46.20

22.68 22.42 27.87

70.12 71.72 71.86

26.31 24.01 27.99

34.16 21.25 28.23

104

D. Hofmann et al. / Neurocomputing 147 (2015) 96–106 1

0.6

k−approximation k−convex hull OMP

0.9

0.5

0.8

0.45 error

error

0.7 0.6

0.4 0.35 0.3

0.5

0.25

0.4

0.2

0.3 0.2

k−approximation k−convex hull OMP

0.55

0.15 0

2

4

6

8

10

0.1

12

0

2

4

sparsety

0.5

0.4

0.5

0.35

0.4

0.3

0.2

0.2

0.1

6 sparsety

8

1

0

12

0.6

error

0.7

0.5

0.4

0.3

0.3

0.2

0.2

0.1

0.1 6

6

8

10

12

8

10

12

sparsety

k−approximation k−convex hull OMP

0.5

0.4

4

4

0.8

0.6

2

2

0.9

0.7

0

0

1

k−approximation k−convex hull OMP

0.8

error

10

sparsety

0.9

0

12

0.3

0.25

4

10

k−approximation k−convex hull OMP

0.6

error

error

0.45

2

8

0.7

k−approximation k−convex hull OMP

0

6 sparsety

0

0

2

4

6

8

10

12

sparsety

Fig. 2. For exemplary data sets, the obtained accuracy versus the degree of sparsity is depicted for the three techniques OMP, the convex hull, or the approximation by the nearest neighbors. Since OMP does not allow to explicitly influence the sparsity, but the approximation quality only, these curves cannot be obtained for the full range displayed in the graphs.

seems that it is not worthwhile to incorporate sparsity constraints already while training, where the problem of a high sensitivity with respect to the control parameter C would have to be met. Further, in all cases, the two simple geometric approximation techniques are within the best two solutions, such that it does not seem worthwhile to use the more complex approximation OMP which aims at an optimum sparse representation of the prototypes. It very much depends on the setting which degree of sparsity is best, having tested the approximation using 1 or 10 entries per prototype in Table 4. A more systematic comparison of the accuracy for different degrees of sparsity is exemplarily shown in Fig. 2. The graph shows very clearly that in all settings a simple geometric approach approximates the accuracy obtained by OMP (it is even better in a fraction of the graphs), and it shows that it varies depending on the data for which sparsity and for which

techniques best results can be obtained. This can be attributed to the quite diverse geometric setting and learning scenario. However, since posterior geometric approximation techniques are rather fast, it is no problem to simply test different degrees of sparsity for both methods and simply take the best one, afterwards. A sparse representation of the classifier in terms of few exemplars of the data set opens the way towards fast classification models and, in particular, interpretable models, provided a single data point can be inspected by applicants in a natural way. Note that several data sets allow classification schemes which rely on only one exemplar per class, i.e. an efficient inspection of these representing data is particularly efficient. We check this claim for only one case, leaving a deeper investigation of the interpretability of the models to future research. For Chickenpieces, a solution

D. Hofmann et al. / Neurocomputing 147 (2015) 96–106

105

Fig. 3. Visualization of exemplars for a classifier for Chickenpieces.

incorporating only one exemplar per class is derived, and these exemplars are linked to the original images used for pairwise distance computation. As can be seen in Fig. 3, the resulting exemplars correspond to images which clearly show a typical shape of their class.

6. Discussion We have investigated kernel robust soft LVQ and the possibility to obtain efficient results by means of the Nyström approximation and sparse approximations, respectively, by means of different approximation schemes applicable while or after training. These methods aim at an advanced computational performance of the technique or an improved sparsity of the classifier, resulting in faster classification performance as well as an enhanced interpretability of the results, thus addressing two of the most severe drawbacks of kernel RSLVQ. We have shown that the excellent accuracy obtained by kernel RSLVQ can be preserved using the Nyström approximation, provided data have an intrinsically low dimensionality. The latter can efficiently be tested by referring to the correlation of different Nyström samples. Further, it is possible to approximate all resulting methods by sparse counterparts with different degrees of sparsity depending on the setting. Thereby, two computationally more demanding techniques, incorporating sparsity constraints while training and OMP, respectively, do not beat two more simple direct techniques which rely on geometric heuristics, suggesting that adding these latter sparse approximations seems worthwhile in applications. Using these techniques, we have taken a further step to bring kernel RSLVQ towards efficient methods with linear training time and constant classification effort, which preserve the interpretability of their vectorial counterparts.

Acknowledgments This work has been supported by the DFG under grant number HA2719/7-1 and by the CITEC center of excellence. References [1] W. Arlt, M. Biehl, A.E. Taylor, S. Hahner, R. Libe, B.A. Hughes, P. Schneider, D.J. Smith, H. Stiekema, N. Krone, E. Porfiri, G. Opocher, J. Bertherat, F. Mantero, B. Allolio, M. Terzolo, P. Nightingale, C.H.L. Shackleton, X. Bertagna, M. Fassnacht, P.M. Stewart, Urine steroid metabolomics as a biomarker tool for detecting malignancy in adrenal tumors, J. Clin. Endocrinol. Metab. 96 (2011) 3775–3784. [2] M. Biehl, K. Bunte, P. Schneider, Analysis of flow cytometry data by matrix relevance learning vector quantization, PLOS ONE 8 (2013) e59401. [3] R. Boulet, B. Jouve, F. Rossi, N. Villa, Batch kernel SOM and related Laplacian methods for social network analysis, Neurocomputing 71 (7–9) (2008) 1257–1273. [4] B.J. Frey, D. Dueck, Clustering by passing messages between data points, Science 315 (February (5814)) (2007) 972–976. [5] A.M. Bruckstein, D.L. Donoho, M. Elad, From sparse solutions of systems of equations to sparse modeling of signals and images, SIAM Rev. 51 (1) (2009) 34–81.

[6] Y. Chen, E.K. Garcia, M.R. Gupta, A. Rahimi, L. Cazzanti, Similarity-based classification: concepts and algorithms, J. Mach. Learn. Res. 10 (June) (2009) 747–776. [7] M. Cottrell, B. Hammer, A. Hasenfuss, T. Villmann, Batch and median neural gas, Neural Netw. 19 (2006) 762–771. [8] A. Denecke, H. Wersing, J.J. Steil, E. Körner, Online figure-ground segmentation with adaptive metrics in generalized LVQ, Neurocomputing 72 (7–9) (2009) 1470–1482. [9] P. Frasconi, M. Gori, A. Sperduti, A general framework for adaptive processing of data structures, IEEE Trans. Neural Netw. 9 (5) (1998) 768–786. [10] T. Gärtner, Kernels for structured data (Ph.D. thesis), Univ. Bonn, 2005. [11] A. Gisbrecht, B. Mokbel, B. Hammer, The Nystrom approximation for relational generative topographic mappings, in: NIPS Workshop on Challenges of Data Visualization, 2010. [12] B. Hammer, A. Hasenfuss, Topographic mapping of large dissimilarity datasets, Neural Comput. 22 (9) (2010) 2229–2284. [13] D. Hofmann, A. Gisbrecht, B. Hammer. Efficient approximations of kernel robust soft LVQ, in: WSOM, vol. 198, 2012, pp. 183–192. [14] D. Hofmann, B. Hammer, Kernel robust soft learning vector quantization, in: ANNPR'12, 1994, pp. 14–23. [15] T. Kietzmann, S. Lange, M. Riedmiller, Incremental GRLVQ: learning relevant features for 3D object recognition, Neurocomputing 71 (13–15) (2008) 2868–2879, Elsevier. [16] S. Kirstein, H. Wersing, H.-M. Gross, E. Körner, A life-long learning vector quantization approach for interactive learning of multiple categories, Neural Netw. 28 (2012) 90–105. [17] T. Kohonen, Self-Organizing Maps, 3rd edition, Springer, New York, 2000. [18] T. Kohonen, P. Somervuo, How to make large self-organizing maps for nonvectorial data, Neural Netw. 15 (8–9) (2002) 945–952. [19] M. Neuhaus, H. Bunke, Edit distance based kernel functions for structural pattern classification, Pattern Recognit. 39 (10) (2006) 1852–1863. [20] B. Olshausen, D. Field, Emergence of simple-cell receptive field properties by learning a sparse code for natural images, Nature 381 (1996) 607–609. [21] E. Pekalska, R.P. Duin, The Dissimilarity Representation for Pattern Recognition. Foundations and Applications, World Scientific, Singapore, 2005. [22] A.K. Qin, P.N. Suganthan, A novel kernel prototype-based learning algorithm, in: Proceedings of the 17th International Conference on Pattern Recognition (ICPR '04), Cambridge, UK, August 2004, pp. 621–624. [23] A. Sato, K. Yamada, Generalized learning vector quantization, in: NIPS, 1995. [24] P. Schneider, M. Biehl, B. Hammer, Distance learning in discriminative vector quantization, Neural Comput. 21 (2009) 2942–2969. [25] P. Schneider, M. Biehl, B. Hammer, Hyperparameter learning in probabilistic prototype-based models, Neuromputing 73 (7–9) (2009) 1117–1124. [26] S. Seo, K. Obermayer, Soft learning vector quantization, Neural Comput. 15 (2003) 1589–1604. [27] S. Seo, K. Obermayer. Dynamic hyperparameter scaling method for LVQ algorithms, in: IJCNN, 2006, pp. 3196–3203. [28] A. Vellido, J.D. Martin-Guerroro, P. Lisboa, Making machine learning models interpretable, in: ESANN'12, 2012, pp. 163–172. [29] C.K.I. Williams, M. Seeger, Using the Nyström method to speed up kernel machines, Adv. Neural Inf. Process. Syst. 13 (2001) 682–688. [30] X. Zhu, A. Gisbrecht, F.-M. Schleif, B. Hammer, Approximation techniques for clustering dissimilarity data, Neurocomputing 90 (2012) 72–84.

Daniela Hofmann received her Diploma in Computer Science from the Clausthal University of Technology, Germany. Since early 2012 she is a Ph.D.-student at the Cognitive Interaction Technology Center of Excellence at Bielefeld University, Germany.

106

D. Hofmann et al. / Neurocomputing 147 (2015) 96–106

Andrej Gisbrecht received his Diploma in Computer Science in 2009 from the Clausthal University of Technology, Germany, and continued there as a Ph.D.student. Since early 2010 he is a Ph.D.-student at the Cognitive Interaction Technology Center of Excellence at Bielefeld University, Germany.

Barbara Hammer received her Ph.D. in Computer Science in 1995 and her venia legendi in Computer Science in 2003, both from the University of Osnabrueck, Germany. From 2000 to 2004, she was leader of the junior research group ‘Learning with Neural Methods on Structured Data’ at University of Osnabrueck before accepting an offer as professor for Theoretical Computer Science at Clausthal University of Technology, Germany, in 2004. Since 2010, she is holding a professorship for Theoretical Computer Science for Cognitive Systems at the CITEC cluster of excellence at Bielefeld University, Germany. Several research stays have taken her to Italy, U.K., India, France, The Netherlands, and the U.S.A. Her areas of expertise include hybrid systems, self-organizing maps, clustering, and recurrent networks as well as applications in bioinformatics, industrial process monitoring, or cognitive science. She is currently leading the IEEE CIS Technical Committee on Data Mining, and the Fachgruppe Neural Networks of the GI.

Suggest Documents