Workshop New Challenges in Neural Computation 2013 - Uni Bielefeld

M ACHINE L EARNING R EPORTS

Workshop New Challenges in Neural Computation 2013

Report 02/2013 Submitted: 21.08.2013 Published: 02.09.2013 Barbara Hammer1 , Thomas Martinetz2 , Thomas Villmann3 (Eds.) (1) CITEC - Centre of Excellence, University of Bielefeld, Germany (2) Institute for Neuro- and Bioinformatics, University of Lubeck, Germany ¨ (3) Faculty of Mathematics / Natural and Computer Sciences, University of Applied Sciences Mittweida, Germany

Machine Learning Reports http://www.techfak.uni-bielefeld.de/∼fschleif/mlr/mlr.html


Table of contents New Challenges in Neural Computation - NC2 2013 (B. Hammer, T. Martinetz, T. Villmann)............................................................................................. 4 Challenges of high-dimensional data analysis from the application's perspective (U. Seiffert) ......................................................................................................................................... 5 Application of Maximum Distance Minimization to Gene Expression Data (J. Hocke, T. Martinetz)....................................................................................................................... 6 Practical Estimation of Missing Phosphorus Values in Pyhäjärvi Lake Data (A. Grigorievskiy, A. Akusok, M. Tarvainen, A.-M. Ventelä, A. Lendasse) ….................................. 8 Replacing the time dimension: A Self-Organizing Time Map over any variable (P. Sarlin) …..........................................................................................…........................................ 17 Image-based Classification of Websites (A. Akusok, A. Grigorievskiy, A. Lendasse, Y. Miche) …..........….................................................. 25 Combining Multiple Classifiers and Context Information for Detecting Objects under Real-world Occlusion Patterns (M. Struwe, S. Hasler, U. Bauer-Wersing) …................................................................................... 35 Anticipating Intentions as Gestalt Formation: A Model Based on Neural Competition (M. Meier, R. Haschke, H.J. Ritter) …............................................................................................. 43

2

Machine Learning Reports


Learning as an essential ingredient for a Tour Guide Robot (S. Hellbach, F. Bahrmann, M. Donner, M. Himstedt, M. Klingner, J. Fonfara, P. Poschmann, R. Schmidt, H.-J. Boehme) …............................................................................................................... 53 Learning Dialog Management for a Tour Guide Robot Using Museum Visitor Simulation (J. Fonfara, S. Hellbach, H.-J. Boehme) …............…....................................................................... 61 A Framework for Optimization of Statistical Classification Measures Based on Generalized Learning Vector Quantization (M. Kaden, T. Villmann)…................................................................................................................ 69 Classifier inspection based on different discriminative dimensionality reductions (A. Schulz, A. Gisbrecht, B. Hammer)...............................................................................................77 Learning the Appropriate Model Population Structures for Locally Weighted Regression (S. Vukanovic, A. Schulz, R. Haschke, H.J. Ritter) …..............…................................................... 87 Learning in Networks of Similarity Processing Neurons (L.A. Belanche)................................................................................................................................. 97 Human Activity Classification with Online Growing Neural Gas (M. Panzner, O. Beyer, P. Cimiano) ...............….............................................................................106 About the Equivalence of Robust Soft Learning Vector Quantization and Soft Nearest Prototype Classification (D. Nebel, T. Villmann) ...…..............…..........................................................................................114 Learning Orthogonal Bases for k-Sparse Representations (H. Schütze, E. Barth, T. Martinetz) ...…..............….......................................................................119


3


New Challenges in Neural Computation NC2 – 2013 Barbara Hammer1 , Thomas Martinetz2 , and Thomas Villmann3 1 – Cognitive Interaction Technology – Center of Excellence, Bielefeld University, Germany 2 – Institute for Neuro- and Bioinformatics, University of L¨ ubeck, Germany 3 – Faculty of Mathematics / Natural and Computer Sciences, University of Applied Sciences Mittweida, Germany

The workshop New Challenges in Neural Computation, NC2 , took place for the fourth time, accompanying the prestigious GCPR (former DAGM) conference in Saarbr¨ ucken, Germany. The workshop centers around exemplary challenges and novel developments of neural systems covering recent research concerning theoretical issues as well as practical applications of neural research. This year, fifteen contributions from international participants have been accepted as short or long contributions, respectively, covering diverse areas connected to data analysis, challenges in vision and robotics, prior knowledge integration, and local or sparse models, lots of the topics connecting to this years’ focus topic on learning interpretable models with neural techniques. In addition, we welcome an internationally renowned researcher, Prof.Dr. Udo Seiffert from Fraunhofer IFF, Magdeburg, who gives a presentation about ‘Challenges of high-dimensional data analysis from the application’s perspective’. This invitation became possible due to the sponsoring of the European Neural Networks Society (ENNS) and the German Neural Network Society (GNNS). Following the workshop, a meeting of the GI Fachgruppe on Neural Networks and of the GNNS took place. We would like to thank our international program committee for their work in reviewing the contributions in a short period of time, the organizers of GCPR for their excellent support, as well as all participants for their stimulating contributions to the workshop.

4



Keynote talk: Challenges of high-dimensional data analysis from the application’s perspective Udo Seiffert, Fraunhofer IFF Magdeburg Abstract: The analysis of high-dimensional data typically leads to various challenges. This talk will address a number of these challenges against the background of neural computation / machine learning from the perspective of applications – primarily hyperspectral image processing. Apart from several general and versatile concepts a specific application might introduce further limitations in terms of the applicability of general concepts, but often also contributes additional a-priori information that eases prevailing problems. The demand for better interpretability of models and results becomes increasingly relevant.


5


Application of Maximum Distance Minimization to Gene Expression Data Jens Hocke, Thomas Martinetz Institute for Neuro- and Bioinformatics, University of L¨ ubeck

1

Introduction

The k-Nearest-Neighbor (k-NN) [1] algorithm is a popular non-linear classifier. It is simple and easy to interpret. However, the often used Euclidean distance is an arbitrary choice, because the data dimensions are not scaled according to their relevance. Similar to relevance learning in the context of LVQ classifiers [2], the scaling of the dimensions can be adapted by feature weighting to improve the classification rate of k-NN. An optimal rescaling has to minimize the classification error E(X) of the k-NN algorithm. Often this problem is called the feature weighting problem. We want to find a weight vector w ∈ E= d2J xi , π −1 (yi ) = xi − π −1 (yi ) J(xi ) xi − π −1 (yi ) . (2) i

i

The matrix J refers to the Fisher information matrix as defined in section 4. In contrast to a standard Euclidean error function, this function has the advantage that those dimensions in X which are locally relevant for the classification are emphasized. Minimization of these costs takes place by gradient techniques. In contrasts to the error function used in [15], this function consists only of one term such that a balance of an unsupervised and a supervised term does not need to be found (this has already been done during the computation of J).


79


4

Discriminative nonlinear visualization

The general framework yields reasonable results provided the data manifold is intrinsically low-dimensional and the DR method is capable of taking this information into account. For data manifolds with intrinsic dimensionality larger than two, a DR to two dimensions necessarily disrupts potentially relevant information. Due to this fact, it has been proposed in [15] to use a discriminative dimensionality reduction technique in step (I) which allows the user to explicitly specify which aspects of the data are regarded as relevant and which aspects of the data can be neglected in the considered setting. A variety of different discriminative DR techniques has been proposed, such as Fisher’s linear discriminant analysis (LDA), partial least squares regression (PLS), informed projections [4], global linear transformations of the metric [6, 2], or kernelization of such approaches [11, 1]. A rather general idea to include supervision is to locally modify the metric [12, 5] by defining a Riemannian manifold which takes into account auxiliary information of the data and which measures the effect of directions of the data on this auxiliary information. This modified metric can then be plugged into many modern DR techniques such as t-SNE. In our case, we assume that the auxiliary information is provided by discrete class information c assigned to x, where c captures aspects relevant for the class labeling, which should be visualized. This information can locally be incorporated into the distance computation by setting the quadratic form of the tangential space at x as d2J (x, x + dx) = (dx)> J(x)(dx), where J(x) is the local Fisher information matrix ( > ) ∂ ∂ J(x) = Ep(c|x) log p(c|x) log p(c|x) . (3) ∂x ∂x Thereby, p(c|x) denotes the probability of the class information c conditioned on x. These local distances can be extended to the entire manifold by taking minimum path integrals. Since exact minimization and integration is usually computationally intractable, approximations are used; a description of approximations with their corresponding advantages and disadvantages can be found in [12]. In the following experiments we will use linear piecewise distance approximations. The Fisher information matrix can be directly included into all distance based techniques. In this paper we integrate it into the popular non-parametric DR method t-Distributed Stochastic Neighbor Embedding (t-SNE). This method projects data points such that the probabilities of data pairs are preserved in the low-dimensional space [19]. A Gaussian distribution is assumed in the highdimensional space and a Student-t distribution in the low-dimensional space, addressing the crowding problem. We will refer to Fisher t-SNE when we replace the Euclidean metric with the Fisher metric in t-SNE. Due to the supervised selection of information in the high-dimensional space, the variability of projections obtained by Fisher t-SNE is less as compared to those by t-SNE.

80



5

Estimating the Fisher information matrix

The central part of this modified distance computation consists in the estimation of the probability p(c|x) of information c given a data point x. In our setting, there are two essentially different possibilities how to choose this information: (a) We can use the given class labels c := l or ci := li for data point xi , respectively, as provided in the training set. This choice emphasizes the given ’ground truth’ of the data, and, in consequence, the visualization of the classifier visualizes in which regions the obtained classification is simple or complex as compared to this ground truth. (b) We can use the labeling as provided by the trained classifier c := f (x). This choice emphasizes aspects of the data which are regarded by the classifier as interesting. Hence, those aspects of the data are visualized which influence the trained classification; as an example one can detect regions of the data where points are regarded as virtually identical by the classifier. Apart from the different semantic meaning, this choice has consequences on the possibilities how to compute the probability p(c|x). The Fisher matrix is based on the local change of the probability distribution p(c|x), the latter of which is usually unknown and needs to be approximated. A common way to do this is to use the Parzen window non-parametric estimator as proposed in [12]. This yields a correct estimation of the probability density but is computationally expensive, i.e. O(N 2 ) for N data points, and, for finite data sets, the result depends on the chosen bandwidth1 of the estimator. The resulting estimator is differentiable and the derivatives are reported in [12], for example. As an alternative, the authors in [12] suggest to use faster methods to approximate p(c|x) which are less accurate, however. Essentially, these methods substitute the non-parametric estimator by a parametric one such as a Gaussian mixture model, the parameters of which are optimized on the data. Hence, the techniques require an additional computational step to extract a model which approximates p(c|x). For the choice of c := f (x), a more direct method can be used. The classifier itself provides a model for f (x), and, often, this output or its continuous counterpart r(x) also naturally provide a probabilistic model for p(f (x)|x). Besides the computational advantage of no longer having the need to train an additional model, this alternative is also parameterless. Probabilities can be extracted from the value r(x) for popular models such as SVM, see e.g. [23], alternatives such as robust soft learning vector quantization [16] directly train a generative model for classification. In all cases, derivatives can either be determined analytically or numerically e.g. simply using a linear approximation of the derivatives.

6

Experiments

In this section we illustrate the difference between estimating the probability distribution p(c|x) from the Parzen window estimator for the given labeling 1

ˆ rot provided in the literature to specify this parameter, see We use the estimator h e.g. [18].


81


C1 C2

C1 C2

Fig. 1. The three-dimensional data set shown from two different perspectives.

c := l and estimating p(c|x) from the trained classification model c := f (x) using a direct computation of the gradient based on the classifier f . For this purpose, we create an artificial data set which is intrinsically three-dimensional and, hence, cannot be projected to two dimensions without information loss. The data are arranged in a filled ball with a nonlinear class structure. The data set is shown in Fig. 1 from two perspectives. The blue class consists of a continuous tube with one gap and a noisy region. An unsupervised projection of this data set with t-SNE is shown in Fig. 2. The projection distorts the continuous class structure. This constitutes an obvious result since the data distribution is uniform and such doesn’t resemble the class structure. It illustrates that unsupervised visualization techniques might not always be well suited if intrinsically high-dimensional data should be projected to low dimensions. In this example, the displayed information looks almost arbitrary. Now we train the classifier Robust Soft LVQ (RSLVQ)[16] on this data set. This classification scheme learns a prototype based probabilistic model for the data such that the likelihood of correct classification is optimized. The resulting probability functions of the classifier are differentiable and, therefore, can be easily integrated into the Fisher metric framework discussed in section 4. We use four prototypes per class, which is small considering the complexity of the data set. The trained classifier achieves a classification accuracy of 90%. Now, a typical use case for the classifier visualization method occurs: How did the classification method solve this problem? Which simplifications of the data did the classifier use and which data points are regarded as similar by the classifier? In order to answer these questions we visualize the classifier using the original class labels li on the one hand and using the provided classification f (x) on the other hand. We build the visualization of the classifier on top of these two projections. The two resulting visualizations are depicted in Fig. 3. The left visualization is based on the Parzen window estimator for the class labels p(l|x): Basically, two clusters of points from class blue are shown and these are distinct

82



Fig. 2. Projected data with t-SNE.

from each other. The visualization quality of the classifier amounts to 92%. Interestingly, albeit this is not yet perfect, the visualization looks much more reasonable than direct unsupervised t-SNE on the data. The right visualization shows the same classifier, but this time based on the discriminative projection obtained by using the probabilities of the classifier itself p(f (x)|x). The data from class two form again two clusters, but this time, they are close to each other. The quality is estimated to 95%. Furthermore, the shape of the class boundaries resembles more the expected shape of the classifier, the latter usually being related to convex regions. In the visualization based on the ground truth, the original spherical shape of the data is much more pronounced. The Parzen window estimator used in the left visualization estimates the probability density accurately and finds the gap in the blue class tube. In his part of the data space, the class distribution changes rapidly and, therefore, the distances in this region grow large, which can directly be observed in the visualization. The prototype distribution does not fit very well to the visualized classifier, since in one region of the blue class there are three prototypes of that class on top of each other and in another region there are none. But since the visualized class distribution is correct as concerns a large part of the points, we can see from this visualization that the largest part of the blue class tube is classified correctly. In the right visualization which is based on the classifier probabilities, the two parts of the tube lie close together. This implies that in the probability density of the classifier the class distribution does not change much, i.e. the data


83


Fig. 3. Two visualization of the same RSLVQ classification model: The projection methods Fisher t-SNE based on the Parzen window estimator (left) and Fisher t-SNE based on the probabilities from the trained classifier (right) are applied.

lying in this gap of the tube are classified incorrectly. However, only because this information is included in the projection via the class probability p(f (x)|x), this fact can be seen directly in the visualization. This time, the location of the prototypes is plausible in relation to the data: the prototypes of the blue class are surrounded by those of the red class. Such a constellation is plausible in the original data space. From the latter visualization we can deduce more information as regards potential errors as compared to the previous one; we see directly the source of the remaining classification error: the classifier is not powerful enough and is not able to classify this gap in the data correctly. Furthermore, there are a few points from the blue class which lie in the cluster of points from the red class. From the perspective of this visualization we would deduce that these are either overlapping regions or too complex regions for our classifier (both aspects are probably correct: in the high-dimensional data we can see that there is indeed a region of overlapping classes). For this toy example we can verify our interpretation by visualizing the positions of the prototypes in the original data space. Fig. 4 depicts the original data set in conjunction with the prototypes of the classifier. The same positioning of the prototypes as in the low-dimensional visualization emerges: the prototypes of the blue class are surrounded by those of the red class.

7

Discussion

In this contribution, we have proposed two different ways how to integrate auxiliary class labeling into a classifier visualization, and we discussed the different semantic meaning of the two techniques. We demonstrated in a simple toy example that a visualization based on a classifier not only eases the computational burden of density estimation for the Fisher metric, but also gives a clearer picture

84



C1 C2

Fig. 4. Projected data with t-SNE.

about the classification principles as quantitatively evaluated by the accordance, and as discussed by focusing on specific aspect of the data. This preliminary research opens the way towards a more thorough evaluation of the technique in benchmark examples and efforts to turn it to a ready-to-use visualization tool for the behavior of classifiers for high-dimensional data sets.

ACKNOWLEDGEMENTS Funding from DFG under grant number HA2719/7-1 and by the CITEC centre of excellence is gratefully acknowledged.

References 1. G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach. Neural Computation, 12:2385–2404, 2000. 2. K. Bunte, P. Schneider, B. Hammer, F.-M. Schleif, T. Villmann, and M. Biehl. Limited rank matrix learning, discriminative dimension reduction and visualization. Neural Networks, 26:159–173, 2012. 3. D. Caragea, D. Cook, H. Wickham, and V. Honavar. Visual methods for examining svm classifiers. In Simoff et al. [17], pages 136–153. 4. D.Cohn. Informed projections. In S. Becker, S. Thrun, and K. Obermayer, editors, NIPS, pages 849–856. MIT Press, 2003.


85


5. A. Gisbrecht, D. Hofmann, and B. Hammer. Discriminative dimensionality reduction mappings, 2012. 6. J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems 17, pages 513–520. MIT Press, 2004. 7. J. Hernandez-Orallo, P. Flach, and C. Ferri. Brier curves: a new cost-based visualisation of classifier performance. In International Conference on Machine Learning, June 2011. 8. T. W. House. Big data research and development initiative, 2012. 9. A. Jakulin, M. Moˇzina, J. Demˇsar, I. Bratko, and B. Zupan. Nomograms for visualizing support vector machines. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, KDD ’05, pages 108–117, New York, NY, USA, 2005. ACM. 10. J. A. Lee and M. Verleysen. Nonlinear dimensionality reduction. Springer, 2007. 11. B. Ma, H. Qu, and H. Wong. Kernel clustering-based discriminant analysis. Pattern Recognition, 40(1):324–327, 2007. 12. J. Peltonen, A. Klami, and S. Kaski. Improved learning of riemannian metrics for exploratory analysis. Neural Networks, 17:1087–1100, 2004. 13. F. Poulet. Visual svm. In C.-S. Chen, J. Filipe, I. Seruca, and J. Cordeiro, editors, ICEIS (2), pages 309–314, 2005. 14. S. R¨ uping. Learning Interpretable Models. PhD thesis, Dortmund University, 2006. 15. A. Schulz, A. Gisbrecht, and B. Hammer. Using nonlinear dimensionality reduction to visualize classifiers. In I. Rojas, G. J. Caparr´ os, and J. Cabestany, editors, IWANN (1), volume 7902 of Lecture Notes in Computer Science, pages 59–68. Springer, 2013. 16. S. Seo and K. Obermayer. Soft learning vector quantization. Neural Computation, 15(7):1589–1604, 2003. 17. S. J. Simoff, M. H. B¨ ohlen, and A. Mazeika, editors. Visual Data Mining - Theory, Techniques and Tools for Visual Analytics, volume 4404 of Lecture Notes in Computer Science. Springer, 2008. 18. B. A. Turlach. Bandwidth Selection in Kernel Density Estimation: A Review. In CORE and Institut de Statistique, pages 23–493, 1993. 19. L. van der Maaten and G. Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008. 20. A. Vellido, J. Martin-Guerroro, and P. Lisboa. Making machine learning models interpretable. In ESANN’12, 2012. 21. X. Wang, S. Wu, X. Wang, and Q. Li. Svmv - a novel algorithm for the visualization of svm classification results. In J. Wang, Z. Yi, J. Zurada, B.-L. Lu, and H. Yin, editors, Advances in Neural Networks - ISNN 2006, volume 3971 of Lecture Notes in Computer Science, pages 968–973. Springer Berlin / Heidelberg, 2006. 22. M. Ward, G. Grinstein, and D. A. Keim. Interactive Data Visualization: Foundations, Techniques, and Application. A. K. Peters, Ltd, 2010. 23. T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res., 5:975–1005, Dec. 2004.

86



Learning the Appropriate Model Population Structures for Locally Weighted Regression Slobodan Vukanovi´c, Alexander Schulz, Robert Haschke, and Helge Ritter Cognitive Interaction Technology – Center of Excellence (CITEC) Bielefeld University, Bielefeld, Germany

Abstract. Local learning approches subdivide the learning space into regions to be approximated locally by linear models. An arrangement of regions that conforms to the structure of the target function can lead to learning with fewer resources and gives an insight into the function being approximated. This paper introduces a covariance–based update for the size and shape of each local region that is able to exploit linear substructures in the target function. Initial tests show that the method improves the structuring capabilities of state–of–the–art statistics–based local learners, producing model populations that are shaped appropriately to the functions being approximated. Keywords: Function Approximation, Locally Weighted Regression, Input Space Structuring

1

Introduction

Locally weighted regression (LWR) approximates the target function with a population of linear models, each localized in the input space. The output to a given query is computed as a weighted (usually Gaussian) combination of the outputs of the local models (Atkeson et al. [1997]). The principal advantage that LWR offers over classical function approximation is its independence from the choice of the class of the approximating function. This leads to learning systems that are robust to destructive interference, and whose complexity can be increased incrementally by adding new models. LWR is thus popular for learning in online scenarios, such as the ones often encountered in robotics (Sigaud et al. [2011]). The locality of each model is defined by its position in the input space and a distance metric, which is usually adapted as the learning progresses. The adaption mechanism often aims to structure the input space by exploiting the linear subspaces that are local to each model (Ormoneit and Hastie [1999]). A reasonable structuring of the input space offers insights into the structure of the function being approximated, and it may enable simpler functions to be learned with fewer models. It has been demonstrated (Stalph et al. [2010]) that the statistics–based Locally Weighted Projection Regression (LWPR, Vijayakumar et al. [2005]), a popular local learning system (Sigaud et al. [2011]) is not capable of structuring


87


the input space well. LWPR produces tight–fitting, highly non–overlapping spherical populations from which the structure is difficult to grasp. This paper describes a work in progress1 on improving the structuring capabilities of statistics–based LWR. We introduce a weighted covariance distance metric update for approaches that utilize ellipsoidal neighborhoods, such as LWPR and Local Linear Maps (LLM, Ritter et al. [1992]), a local learner that does not aim to structure the input space. Training samples are weighted according to their current local prediction error and the proximity to each model, and are either included in the distance metric update or ignored. The initial tests show that such an update improves the structuring over the existing distance metric updates, producing populations that visually resemble the target function. Additionally, the tests show that the improved structuring obtained by our method can lead to reductions in population sizes and thus the number of resources necessary for accurate learning. The following section gives an overview of the distance metric adaption in LWPR and LLM. Section 3 describes our weighted covariance distance metric update method. Section 4 reports the results of an initial comparison between our approach and the default distance metric updates in LWPR. Section 5 concludes.

2

Distance Metric Adaption in LWPR and LLM

LWPR, a descendant of the Receptive Field Weighted Regression (RFWR, Schaal and Atkeson [1997]), performs input dimensionality reduction through partial least squares (PLS) regression. The population size is incrementally increased by allocating models in parts of the input space not covered by existing models. The pseudocode is given in Algorithm 1.

Algorithm 1 The LWPR pseudocode. The PLS regression update (line 5) is beyond the scope of this paper (details can be found in Vijayakumar et al. [2005]). 1: Initialize LWPR with no local models 2: for a new training sample (x, y) do 3: for each local model M do 4: Calculate the activation of M by (x, y) 5: Update the regression of M 6: Update the distance metric D of M (Equation 2) 7: end for 8: if no model was activated by more than the wgen parameter then 9: Create a new model centered at x and initialize it using the default parameters 10: end if 11: end for 1

88

Parts of which have been submitted to the Thirteenth International Conference on Advances in Artificial Intelligence, AI*IA 2013.



The receptive field (RF) of each local model is described by an ellipsoid fixed at a point c in the input space and a distance metric D, a positive semi–definite matrix: (x − c)T D(x − c) = 1 (1) A function of c, D, and an incoming training sample returns an activation weight w ∈ [0, 1] for that sample. The higher the activation w, the more influence the sample has on the model during training. To try to adapt a receptive field’s shape and size to the local linear region, D is Cholesky–decomposed into M, which is then trained by minimizing the cost function J of the leave–one–out cross–validation error for each iteration step t and a given learning rate α: Mt = Mt−1 + α

∂J , Dt = MTt Mt ∂Mt−1

(2)

Unlike LWPR, the default LLM implementation uses a fixed number of models whose positions in the input space can be adapted during training, e.g. by vector quantization. The local models learn to approximate the underlying function using ordinary least squares regression. The default distance metric update adapts each receptive field along the input dimensions, increasing the coverage of the input space: dt = dt−1 + α(||xt − ct || − dt−1 ), Dt = diag[dt ]−1

(3)

As no information about the output space is given to the update, the populations of the models after training do not resemble the structure of the target function.

3

The Covariance–based Distance Metric Update

We consider a representation of the receptive fields identical to LWPR (Equation 1). Given a training sample (xt , yt ) at step t, we model the distance metric Dt of each receptive field as the inverse of the incremental covariance Σt : Dt = (Σt )−1 , Σt =

St Φt

(4)

St is the matrix of weighted incremental sums (employing the stable covariance update from Finch [2009]): St = (1 − γ) · St−1 + γ · φ(p) · (xt − c)(xt − c)T

(5)

where γ ∈ [0, 1] is the forgetting factor, φ(p) is the weighting function with parameters p, and c is the center of the receptive field. Φt is the sum of weights: Φt = (1 − γ) · Φt−1 + γ · φ(p)

(6)

The weighting function φ should be such that the local training samples that produce a small error are included in the covariance update, and the local samples that produce a large error are excluded. The derivation of φ with these characteristics is given below, followed by a brief discussion on the values of the parameter vector p.


89


3.1

The Weighting Function and its Parameters

The covariance update will prompt a receptive field to shrink to include samples that have large weights assigned by φ and a high activation w. Similarly, a receptive field will grow to include samples that have large weights and a low w. The samples with very small weights will be excluded from the covariance update regardless of their location, prompting no change in the distance metric. These dynamics must be taken into account for weight assignment: φ needs to be a function of both the current local prediction error and the activation. Given the magnitude of the error err (low or high) and the proximity g = 1−w of a training sample to the center of the field (high, low, or not local), we identify five cases A1 , . . . , A5 (shown in Figure 1(a)) to which that sample can belong. The cases A1 , . . . , A4 are modeled by functions A1 (p1 ), . . . , A4 (p4 ) that have the value 1 in the corresponding regions. The weighting function φ is written as a combination of A1 (p1 ), . . . , A4 (p4 ) multiplied by A5 (p5 ), which has the value 0 in the region A5 . Figure 1(b) shows a weighting function where A1 = A3 = A5 = 0 and A2 = A4 = 1.

not local

A5 kout

low

Proximity g

A

1

A

1

φ

4

kin

A

0 low

A

3

2

low high low

kerr

|error| err

high

not local

|error| err high

(a)

high

Proximity g

(b)

Fig. 1. (a) Regions in which a training sample may fall: high error and low proximity (A1 ), high error and high proximity (A2 ), low error and high proximity (A3 ), low error and low proximity (A4 ), or not local (A5 ). The parameters kerr , kin , and kout control the extents of the regions. (b) A weighting function φ with A1 = A3 = A5 = 0 and A2 = A4 = 1.

Let σ + and σ − be one–dimensional logistic functions: σ + (x, s, k) = σ − (x, s, k) =

90

1 1+e−s(|x|−k) −1 1+e−s(|x|−k)

+1

(7)



where s is the slope and k is the position of each sigmoid. The functions Ai (pi ) with parameters pi are products of the sigmoids: g g g g err + err err + A1 (err, serr 1 , k1 , g, s1 , k1 ) = σ (err, s1 , k1 )σ (g, s1 , k1 ) g g g + err err − err err A2 (err, s2 , k2 , g, s2 , k2 ) = σ (err, s2 , k2 )σ (g, s2 , k2g ) g g g g − err err − err A3 (err, serr 3 , k3 , g, s3 , k3 ) = σ (err, s3 , k3 )σ (g, s3 , k3 ) g g g g err − err err + A4 (err, serr 4 , k4 , g, s4 , k4 ) = σ (err, s4 , k4 )σ (g, s4 , k4 ) A5 (g, sg5 , k5g ) = σ − (g, sg5 , k5g )

(8)

err where serr are the slope and the position in the error direction and sgi , kig i , ki are the slope and the position in the proximity direction of Ai . Ideally, φ should assign large weights to high proximity samples producing high errors to trigger shrinking (A2 ), it should assign large weights to low proximity samples producing low errors to trigger growth (A4 ), and it should assign zero weights in all other cases (A1 and A3 ):

φ(p) = A5 (p5 ) · (A2 (p2 ) + A4 (p4 ))

(9)

The parameter values that merge or produce gaps between A2 and A4 cause err unwanted behavior. This is avoided by grouping like parameters: serr 2 and s4 into serr , k2err and k4err into k err , sg2 , sg4 , and sg5 into sg , k2g and k4g into k in , and k5g into k out . The parameters are thus: p p2 p4 p5 3.2

= (err, serr , k err , g, sg , k in , k out ) = (err, serr , k err , g, sg , k in ) = (err, serr , k err , g, sg , k in ) = (g, sg , k out )

(10)

Setting Parameter Values

Given the weighting function φ(p) (Equation 9) and a reasonable choice of the values for parameters p (Equation 10), a receptive field can shape itself to cover a region up to an error of k err . The following discussion is based on observations and results of trial learning runs. The k err parameter determines which samples are included in the distance metric update. Smaller values result in (generally) smaller fields that approximates the underlying region well, while larger values include more points and thus cover more space. k err requires tuning, as it affects the overall error, and the size of the fields (and thus their number, given the model allocation strategy in LWPR). The k in parameter determines the proximity at which the weights change between exclusion and inclusion. Smaller values cause inclusion of the points with a high proximity, shrinking the field too much. Larger values assign large weights to points with a high error that are at a lower proximity, possibly making the field grow incorrectly. A good value for k in was found to be around 1 standard deviation, in the interval [0.30, 0.40]. The k out parameter determines the locality of the field with respect to the distance metric update. Smaller values void the growth, making the fields only


91


shrink, while larger values enable the fields to observe more of the input space. The update performed as desired when the value of k out was set in the interval [0.80, 0.95]. Finally, the slope parameters serr and sg determine the rate of the transition of other parameters. Smaller values cause smoother, slower updates to the distance metric, while larger values lead to quick, abrupt updates. A reasonable setting for serr is ≥ 100, since a sharp threshold between the errors is required. sg should be set in the range [30, 50], as abrupt updates are not desirable.

4

Evaluation

We evaluate our distance metric update by comparing it to the default LWPR distance metric update (Equation 2) on learning two two–dimensional functions (shown in Figure 2) defined in the [−1, 1]2 interval: 1. the Crossed Ridge function, containing a mixture of linear and non–linear regions y = max[exp(−10x21 ), exp(−50x22 ), 1.25exp(−5(x21 + x22 ))], and 2. the Sine function, which is linear in the (1, −1) direction but highly non–linear in the perpendicular (1, 1) direction y = sin(2π(x1 + x2 ))

Crossed Ridge

Sine

1 1 0 −1 0 −1

1 0

0 1 −1

−1

1 0

0 1 −1

Fig. 2. The two target functions.

We do not include the default LLM distance metric update (Equation 3) in the evaluation, as it only aims to provide the coverage of the input space without structuring it. Two implementations of our distance metric update (Equation 4) are incorporated in the comparison: 1. CovLWPR, replacing the default LWPR distance metric update by our covariance update, and 2. CovLLM, replacing the default LWPR distance metric update by our covariance update and the default LWPR PLS regression with the LLM regression.

92



Of interest is the comparison of accuracy, the populations, and the input space structuring between LWPR, CovLWPR, and CovLLM. The metrics include the mean absolute error (MAE), the population size, the visually–judged structuring of the input space, and the averege field volume. Ten independent runs of the three systems on each function were performed. Each run consisted of 50,000 learning iterations of noiseless samples from a uniform–random distribution in the input space. The values for the non–default parameters used are summarized in Figure 3. The values of k err used in CovLWPR and CovLLM were 0.0080 for the Crossed Ridge and 0.0350 for the Sine. The performance after learning is shown in Figure 4, and the resulting population plots for the three functions are given in Figure 5.

Value init D 500 alpha 200 w gen 0.05 w prune 0.75

Meaning Initial size of RFs Distance metric learning rate RF insertion treshold RF removal treshold (a) LWPR

A out

kin kout serr sg γ

Value 0.4 0.8 500 30 5 × 10−4

(b) The weighting function φ

Value Meaning 0.45 Linear model learning rate 0.45 Output weights learning rate (c) LLM

Fig. 3. Values for the non–default parameters used in the evaluation.

CovLLM and LWPR reach the same error on both functions. CovLLM produces a smaller final population with a greater average field volume than LWPR. This is due to the growth resulting from our distance metric update and the consequent pruning of overlapping fields. The population plots illustrate that the final structure of the CovLLM populations resemble the learned functions, while it is difficult to grasp the structure of the functions from LWPR’s populations. LWPR tends to produce circular receptive fields by shrinking them from all directions. Similar characteristics have been reported in Stalph et al. [2010]. CovLWPR is the least accurate of the three systems. Unlike the regression in LLM, the PLS regression retains the history of the training samples seen previously, making it difficult to recover from incorrect shape updates while the regression is still being trained. This restrains the system from learning responsively, and is illustrated both by the higher error and the population plots. Although a rough structure can be identified, the resulting fields are not as slender in the non–linear directions, they are not as long in the linear directions,


93


MAE Number of RFs Average Volume 1.17 × 10−2 ± 1.34 × 10−7 220.4 ± 15.1 1.36 × 10−2 ± 2.03 × 10−9 1.13 × 10−2 ± 2.07 × 10−7 159.8 ± 20.6 1.58 × 10−2 ± 3.28 × 10−7 1.23 × 10−2 ± 2.65 × 10−7 172.6 ± 11.6 1.49 × 10−2 ± 1.46 × 10−7

LWPR CovLLM CovLWPR

(a) The Crossed Ridge results MAE Number of RFs Average Volume 2.64 × 10−2 ± 3.80 × 10−7 220.4 ± 10.7 1.26 × 10−2 ± 1.53 × 10−9 2.67 × 10−2 ± 3.95 × 10−6 138.3 ± 25.6 1.91 × 10−2 ± 4.43 × 10−7 4.25 × 10−2 ± 2.07 × 10−5 187.4 ± 28.3 1.25 × 10−2 ± 1.94 × 10−7

LWPR CovLLM CovLWPR

(b) The Sine results Fig. 4. The prediction errors, population sizes, and field volumes (E ± σ) after learning. LWPR

CovLLM

−1

x

2

1

x2

1

x2

1

CovLWPR

−1

−1

1

x

−1

−1

1

1

x

−1

1

1

x

1

(a) Populations after learning the Crossed Ridge LWPR

CovLLM

−1

x

2

1

x2

1

x2

1

CovLWPR

−1

−1

x1

1

−1

−1

x1

1

−1

x1

1

(b) Populations after learning the Sine Fig. 5. The population plots for a sample run.

94



and some fields are misaligned. The amount of history kept can be reduced but, unfortunately, the PLS regression in such cases becomes unreliable, causing the mutual dependency between the shape and the regression to interfere with the ability to learn.

5

Conclusion

We introduced a weighted covariance distance metric update for statistics–based LWR algorithms. The initial tests show that the method allows the receptive fields to exploit the local linear substructures of target functions more readily than the default LWPR update. This property may lead to accurate learning with smaller populations, particularly of functions that exhibit large linear substructures. In addition, the structure of the target function can be inferred from model populations. The future work will focus on studying the relationship between the parameters of the weighting function used in the covariance update and the global error, and the convergence behavior of the distance metric update. We will also evaluate the method in learning higher–dimensional problems.


95


Bibliography

C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Artificial Intelligence Review, 11:11–73, 1997. T. Finch. Incremental calculation of weighted mean and variance. University of Cambridge, 2009. D. Ormoneit and T. Hastie. Optimal kernel shapes for local linear regression. In Advances in Neural Information Processing Systems, pages 540–546. MIT Press, 1999. H. Ritter, T. Martinetz, and K. Schulten. Neural computation and self-organizing maps - an introduction. Computation and neural systems series. AddisonWesley, 1992. ISBN 978-0-201-55443-4. S. Schaal and C. G. Atkeson. Constructive incremental learning from only local information. Neural Computation, 10:2047–2084, 1997. O. Sigaud, C. Salaun, and V. Padois. On-line regression algorithms for learning mechanical models of robots: a survey. Robotics and Autonomous Systems, 59 (12):1115–1129, December 2011. P. O. Stalph, J. Rubinsztajn, O. Sigaud, and M. V. Butz. A comparative study: function approximation with LWPR and XCSF. In GECCO (Companion), pages 1863–1870, 2010. S. Vijayakumar, A. D’Souza, and S. Schaal. Incremental online learning in high dimensions. Neural Computation, 17:2602–2634, 2005.

96



Learning in Networks of Similarity Processing Neurons Llu´ıs A. Belanche Computer Science School - Dept. of Software Technical University of Catalonia Jordi Girona, 1-3 08034, Barcelona, Catalonia, SPAIN [email protected]

Abstract. Similarity functions are a very flexible container under which to express knowledge about a problem as well as to capture the meaningful relations in input space. In this paper we describe ongoing research using similarity functions to find more convenient representations for a problem –a crucial factor for successful learning– such that subsequent processing can be delivered to linear or non-linear modeling methods. The idea is tested in a set of challenging problems, characterized by a mixture of data types and different amounts of missing values. We report a series of experiments testing the idea against two more traditional approaches, one ignoring the knowledge about the dataset and another using this knowledge to pre-process it. The preliminary results demonstrate competitive or better generalization performance than that found in the literature. In addition, there is a considerable enhancement in the interpretability of the obtained models. Key words: Similarity representations; Classification; Neural Networks

1

Introduction

The intuitive notion of similarity is very useful to group objects under specific criteria and has been used with great success in several fields like Case Based Reasoning [1] or Information Retrieval [2]. Interest around purely similarity-based techniques has never faded way; on the contrary, it has grown considerably since the appearance of kernel-based methods [3]. In learning systems, a non-written principle states that similar inputs should have similar outputs for the model to be successful. While this is no guarantee of good performance –specially near class boundaries, where the principle is violated– it certainly is a sine qua non condition. If the principle is not true, generalization becomes almost impossible. For a learning system, the trick is then to capture (that is, to learn) meaningful similarity relations in relation to the prescribed target variable. Specific similarity functions from the point of view of data analysis have been used with success since the early days of pattern recognition. Modern modelling problems are difficult for a number of reasons, including dealing with mixtures of data types and a significant amount of missing information [4]. For example,


97


in the well-known UCI repository [5] over half of the problems contain explicitly declared nominal variables, let alone other data types (e.g., ordinal), usually unreported. In many cases this heterogeneous information has to be encoded in the form of real-valued quantities, although there is often enough domain knowledge to characterize the nature of the variables. The aim of this paper is to demonstrate the learning abilities of simple layered architectures, where the first layer computes a user-defined similarity function between inputs and weights. The basic idea is that a combination of partial similarity functions, comparing variables independently, is more capable at capturing the specific properties of an heterogeneous dataset than other methods, which require a priori data transformations. An appealing advantage is found in the enhanced interpretability of the models, so often neglected in the neural network community. In order to develop the idea, we propose to compute the similarities among the elements in the learning dataset, and then use a reduction method to select a small subset thereof. These selected observations are the centers of the first hidden layer. In other words, the first hidden layer is a change of the representation space from the original feature space to a similarity space [6].

2 2.1

Methodology Preliminaries

We depart from a training data matrix DN ×d composed of N observations x described by d variables, plus a target matrix DN ×t containing the known targets of the N observations. For simplicity, in this paper we concentrate in classification problems only (two-class or multiclass) and set t = 1. Given a similarity function s, we first compute the associated symmetric similarity matrix SN ×N , where Sij = s(xi , xj ). An algorithm is then needed to select a number d′ of prototypes, that best represent the learning data in the following sense: 1. the prototypes must be known elements of the input space (observations); 2. all observations that are not prototypes must show a high similarity to (only) one of the prototypes in relation to their similarity to the other prototypes; 3. d′ should be set much smaller than N . This is an ideal task for a clustering algorithm, although not all clustering methods are adequate, and certainly alternative techniques could be possible. As an example, artificial immune systems have been used for prototype selection tasks using nearest-neighbor classifiers [7]. The result of this process is a layer of d′ units, which we call S-neurons. Any learning method can now operate in the representation space spanned by the layer of d′ S-neurons. However, it pays to start using linear methods, both for computational (they are fast) and analytical (they have a single optimum) reasons. It also turns out that the resulting model can be much more interpretable.

98



2.2

Detailed description

Let us represent the observations as belonging to a space X ̸= ∅ as a vector x of d components, where each component xk represents the value of a particular feature k. A similarity measure is a unique number expressing how “like” two observations are, given these features. It can be defined as an upper bounded, exhaustive and total function s : X × X → Is ⊂ R such that Is has at least two different elements (therefore Is is upper bounded and smax ≡ sup Is exists). R

A basic but very useful S-neuron can be devised using a Gower-like similarity index, well-known in the literature on multivariate data analysis [8]. For any two vector objects xi , xj to be compared on the basis of feature k, a score sijk is defined, described below. First set δijk = 0 when the comparison of xi , xj cannot be performed on the basis of feature k for some reason; for example, by the presence of missing values, by the feature semantics, etc; δijk = 1 when such comparison is meaningful. If δijk = 0 for all the features, then s(xi , xj ) is undefined. The partial scores sijk are defined as follows: Binary (dichotomous) variables indicate the presence/absence of a trait, marked by the symbols + and −. Their similarities are computed according to Table 1, leading to a partial coefficient introduced by Jaccard and well known in numerical taxonomy as the Jaccard Coefficient [9].

Table 1: Similarities for dichotomous (binary) variables. Values of feature k Object xi + + − − Object xj + − + − sijk 1 0 0 0 δijk 1 1 1 0

Categorical variables can take a number of discrete values, which are commonly known as modalities. For these variables no order relation can be assumed. Their overlap similarity is sijk = 1 if xik = xjk and sijk = 0 if xik ̸= xjk . Real-valued variables are compared with the standard metric in R: sijk = 1 − |xik − xjk |/Rk , where Rk is the range of feature k (the difference between the maximum and minimum values). The overall coefficient of similarity is defined as the average score over all partial comparisons: Sij = s(xi , xj ) =

∑n k=1 sijk δijk ∑ n k=1 δijk

Ignorance of the absent elements and normalization by the number of the present ones has been found superior to other treatments in standard data anal-


99


ysis experiments [10]1 . This coefficient has been extended to deal with other data types, like ordinal and circular variables [11]. Notice that we now have smax = 1. As for the clustering, we choose the PAM algorithm, which partitions data into k clusters, very much like k-means. However, PAM offers two advantages: first, the cluster centers (called medoids) are chosen among the data points; second, the algorithm first looks for a good initial set of medoids and then finds a suboptimal solution such that there is no single switch of an observation with a medoid that will decrease the reconstruction error (the sum of distances of the observations to their closest medoid). The algorithm is fully described in [12].

3

Experiments

In this section we report on experimental work in which the previous ideas are applied both to linear learners –logistic and multinomial regression (LogReg and Multinom)– and linear discriminant analysis (LDA) and also to a non-linear learner (a standard SVM using the RBF kernel). All methods are deterministic and use the same data partitions. The smoothing parameter in the RBF kernel is estimated using the sigest method, based upon the 10% and 90% quantiles of the sample distribution of ∥xi − xj ∥2 [13]; the cost parameter C is set to 1. All datasets are split into learning and test parts (respecting original partitions, if available). For missing value imputation, we use the Multivariate Imputation by Chained Equations (MICE) method [14], which generates multiple imputations for incomplete multivariate data by Gibbs sampling. This method is attractive because, if the data contains categorical variables, these are also used in the regressions on the other variables. We study three approaches: raw There is no effort in identifying variable types (all information is considered numerical, and scaled); missing values are either not identified or left as they come (for example, treated as zeros). std All variable types are properly identified; non-numerical information is binarized with a standard dummy code [15]. Missing values are identified and imputed with MICE. sim Same as before with a first layer of S-neurons, as described in Section (2.2); then PAM selects d′ = ⌊0.05 · N ⌋ prototypes in the learning part. Notice that, in this case, the model has the architecture of a neural network. 3.1

Datasets

Some challenging problems have been selected as characteristic of modern modeling datasets because of the diversity in data heterogeneity and the presence of 1

100

It is not difficult to realize that this is equivalent to the replacement of the missing similarities by the average of the non-missing ones. Therefore, the conjecture is that the missing values, if known, would not change the overall similarity significantly.



missing values. The problem descriptions and the datasets are taken from the UCI repository [5]. The available documentation has been analyzed for an assessment on the more appropriate treatment. Missing information is also properly identified – see Table 2. The Horse Colic dataset has been investigated with two different targets (variables #23 and #24, resp.).

Table 2: Basic characteristics of the datasets: #Obs (learning, test). Def. (default accuracy), Missing (percentage of missing values). In→Out (no. of inputs and outputs). The last column shows variable types: (R)eal, (N)ominal, or(D)inal. Name Pima Diabetes Horse Colic-23 Horse Colic-24 Audiology

#Obs 768 (500,268) 363 (295,68) 364 (296,68) 226 (200,26)

Def. 65.1% 61.4% 63.5% 66.3%

Missing 10.6% 25.6% 25.6% 2.1%

In→Out Data 8 → 2 8R, 0N, 0D 22 → 3 7R, 7N,8D 22 → 2 7R, 7N,8D 31 → 4 0R, 24N, 7D

Pima Diabetes. This is a much studied dataset, in which a population of Pima Indian women living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. In this dataset, most of the variables show impossible zero values (e.g, the diastolic blood pressure), which are actually missing values [15]. Upon careful analysis, it turns out that only 392 out of the 768 observations are unaffected by missing values. Horse Colic. This dataset makes an excellent case study, because of the diversity in data heterogeneity and a significant amount of missing values; it has been used as a paradigmatic example in some textbooks [16]. Each observation is the clinical record of a horse and the variables are specially well documented2 . Audiology. This problem is interesting for many reasons: it is multiclass, has a low number of observations and all variables are categorical (with different numbers of modalities, and some of them ordered)3 . We have reduced the original 24 classes to 4 by grouping and eliminated non-informative variables. 3.2

Results

The results are displayed in Tables 3, 4, and 5. At a first look, it is surprising how the learning methods are able to grasp the task using the raw method. In this sense, the std method is markedly better for LogReg and the SVM, but not for Multinom or LDA. However, the difference for Multinom is very small; for LDA, the std method increases input dimension quite a lot (specially if there are many categorical variables or these have many modalities). The explosion in 2

3

This dataset is made available thanks to M. McLeish and M. Cecile (Computer Science Dept., Univ. of Guelph, Ontario, Canada). Original owner: Professor Jergen at Baylor College of Medicine.


101


the number of binary variables (due to the dummy coding) changes the data distribution to something extremely non-gaussian, and this causes trouble to LDA (this is specially acute in Audiology). The sim method presents similar results for LogReg and the SVM, and much better for both LDA and Multinom. We would like to point out the good results delivered by linear models, like LogReg –when applicable– and LDA, specially for the sim method. On the other hand, few efforts have been devoted to a fine tuning of the SVM models beyond educated guesses, but this issue affects all approaches. Finally, no effort has been put in selecting the optimal number of centers for the sim approach.

Table 3: Generalization errors for the raw method. Pima HorseColic-23 HorseColic-24 Audiology AVERAGE

LogReg Multinom SVM LDA 0.201 0.187 0.194 0.187 − 0.309 0.279 0.279 0.176 0.162 0.162 0.162 − 0.231 0.154 0.269 0.189 0.222 0.197 0.224

Table 4: Generalization errors for the std method. Pima HorseColic-23 HorseColic-24 Audiology AVERAGE


Table 5: Generalization errors for the sim method. Pima HorseColic-23 HorseColic-24 Audiology AVERAGE

102




3.3

Discussion

When comparing to related previous work, the obtained results are very competitive in relation to those typically reported for these problems, sometimes achieved using very sophisticated techniques. For example, for the Pima dataset, the best reported results are around 20%, topping at 19.8% [15], while typical results are in the range 21%-25% [17]. Our best result is 17.5% using LDA and the sim method. Among other causes, this is due to the bad identification or treatment of missing values, something that has been advocated elsewhere [15]. What is more, many (if not most) of these reported results are cross-validation ones, which means that there is no independent assessment of true generalization ability. In our case, the Pima results are reported in a test set that is large, compared to the learning set size (35%). For HorseColic-23, the best reported result seems to be 13.6% [17], while typical results are in the range 14%-23% [18]. Our best result is 14.7%, achieved using three different learners and the std method. There seems to be no comparable previous work for HorseColic-24. Finally, for Audiology it is difficult to compare because in this paper we have cleaned the dataset prior to learning; in any event, both the SVM and LDA achieve very good results with the sim method. Another important issue is the distribution of the similarities across the observations and the classes. Fig. 1 shows this distribution for the different datasets. It can be seen that in all cases similarities are rather high and well-behaved (fairly symmetrical, unimodal). Given the assumed relation between similarity computations and learning ability, we computed the intra-class similarities. These were: Cochlear (0.847), Mixed (0.822), Normal (0.914) and Other (0.794) for Audiology, No (0.811) and Yes (0.788) for Pima, Died (0.646), Euthanized (0.634) and Lived (0.673) for HorseColic-23 and No (0.688) and Yes (0.673) for HorseColic24. These numbers not only reflect how relatively compact the different classes are, they also indicate the hardness for a learning method based on distances or similarities (however they are computed). For example, HorseColic-23 and HorseColic-24 show markedly less compact classes. The relation to overall performance and to class-by-class performance is left for a further dedicated study.

Fig. 1: Similarity distributions for the different datasets. Histogram of HorseColic23

Histogram of Pima

0.6

0.7

0.8


0.9

1.0

3

Density

0

0

0.0

1

1

2

2.0

Density

1.0

3 2

Density

4

4

5

5

3.0

6

Histogram of Audiology

0.0

0.2

0.4

0.6

0.8

1.0

0.4

0.5

0.6

0.7

0.8

0.9

1.0

103


4

Conclusions and Future Work

A shortcoming of many existent learning methods –and, in particular, neural networks– is the difficulty of adding prior knowledge to the model in a principled way. Current practice assumes that input vectors may be faithfully represented as a point in Rd , and the geometry of this space is meant to capture the meaningful relations in input space. There is no particular reason why this should be the case, at least not with small numbers of hidden neurons. This paper has described ongoing research on more flexible learning frameworks, offering means for the injection of prior knowledge, and permitting a natural extension to operate in problems with non-numerical data types and showing missing values. When talking about real problems, however, accuracy may not tell the whole picture about a model. Other performance criteria include development cost, interpretability and usability. – The cost here refers to how much pre-processing effort we need in order to build the model. Undoubtedly, all but the raw method require more analysis time compared to doing (almost) nothing. Prior to learning the variable types must be identified and coded properly; for the S-neurons, suitable similarity measures must be chosen, using available background knowledge; – The interpretability refers to the complexity of the obtained model in human terms. It is generally believed that accuracy and interpretability are in conflict [19]. In the present case, the methods based on similarity have a clear advantage in this case when combined with a linear learner: the prediction is a weigthed combination of the similarity of the input to a selected (and small) subset of prototypes. This framework resembles that of an SVM; however, in SVMs the kernel is providing an implicit transformation of the input space rather than a purely similarity-based representation. Moreover, the chosen similarity should be a valid kernel function; – Finally, models must be useful in practice: in a real deployment of the model, new and unseen observations emerge which we need to classify, which display the same variable types and may contain missing values (that certainly could not be imputed at learning time). The similarity approaches are able to face this situation without further effort. Current lines of research include the extension to new data types (a suitable similarity measure is needed in each case) and the design of formal measures to compute the relation between overall and, particularly, intra-class similarities with class distribution itself. A measure of similarity that is maximized for observations of the same class and minimized for observations of different classes is envisaged via the introduction of weights into the comparisons. This approach would permit the optimization of the similarity measure to some extent. Another important issue is the selection of the best centers, which could be performed in a supervised way. For example, performing a separate clustering per class and merging the results, or by using GLVQ methods [20]. This latter family of algorithms makes good use about the classes to which the input vector and the

104



winning codevector belong at prototype selection time. It is conjectured that a supervised reduction method will deliver better modelling results when coupled with subsequent stages of the method. Acknowledgments. Partially supported by the MICINN project BASMATI (TIN2011-27479-C04-03) and by SGR2009-1428 (LARCA).

References 1. Osborne, H., Bridge, D. Models of similarity for case-based reasoning. Interdisciplinary Workshop on Similarity and Categorisation, pp. 173–179 (1997) 2. Baeza-Yates, R., Ribeiro, B. Modern information Retrieval. ACM Press, New York (1999) 3. Shawe-Taylor, J., Cristianini, N. Kernel Methods for Pattern Analysis. Cambridge Univ. Press (2004) 4. Little, R., Rubin, D. Statistical Analysis with Missing Data. Wiley, (2009) 5. Bache, K. and Lichman, M. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Irvine, CA: University of California (2013) 6. Pekalska, E. The Dissimilarity representations in pattern recognition. Concepts, theory and applications. ASCI Dissertation Series no. 109. Delft University of Technology, Delft, (2005) 7. Garain, U. Prototype reduction using an artificial immune model. Pattern Analysis and Applications 11:3-4, 353-363 (2008) 8. Gower, J.C. A General Coefficient of Similarity and Some of Its Properties. Biometrika, 27(4), pp. 857–871 (1971) 9. Sokal, R. R. and Michener, C. D Principles of Numerical Taxonomy. San Francisco: W.H. Freeman (1963) 10. Dixon, J.K. Pattern recognition with partly missing data. IEEE Trans. on Systems, Man and Cybernetics, 9: 617-621, (1979) 11. Pavoine, S., Vallet, J., Dufour, A.B., Gachet, S., Daniel, H. On the challenge of treating various types of variables: application for improving the measurement of functional diversity. Oikos, 118(3), pp. 391–402, (2009) 12. Kaufman, L. and Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, (1990) 13. Caputo, B., Sim, K., Furesjo, F., Smola, A. Appearance-based object recognition using SVMs: which kernel should I use? NIPS Workshop on Statistical methods for computational experiments in visual processing and computer vision, (2002) 14. van Buuren, S., Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3):1–67, (2011) 15. Ripley, B. Pattern Recognition and Neural Networks. Cambridge Univ. Press, (1996) 16. Xu, R., Wunsch, D. Clustering. Wiley-IEEE Press, (2008) 17. Torre, F. Boosting Correct Least General Generalizations. Technical Report GRAppA-0104, (2004) 18. Yang, J., Parekh, R., Honavar, V. DistAI: An Inter-pattern Distance-based Constructive Learning Algorithm. Intelligent Data Analysis 3, pp. 55-73 (1999) 19. Breiman, L. Statistical Modeling: The Two Cultures. Statistical Science 16 (3), 199-231, (2001) 20. Pal, N.R., Bezdek, J.C., Tsao, E. Generalized clustering networks and Kohonen’s self-organizing scheme. IEEE Trans. Neural Networks, 4(4) pp.549-557, (1993)


105


Human Activity Classification with Online Growing Neural Gas Maximilian Panzner, Oliver Beyer, and Philipp Cimiano Semantic Computing Group, CITEC, Bielefeld University, [email protected] http://www.sc.cit-ec.uni-bielefeld.de

Abstract. In this paper we present an online approach to human activity classification based on Online Growing Neural Gas (OGNG). In contrast to state-of-the-art approaches that perform training in an offline fashion, our approach is online in the sense that it circumvents the need to store any training examples, processing the data on the fly and in one pass. The approach is thus particularly suitable in life-long learning settings where never-ending streams of data arise. We propose an architecture that consists of two layers, allowing the storage of human actions in a more memory efficient structure. While the first layer (feature map) dynamically clusters Space-Time Interest Points (STIP) and serves as basis for the creation of histogram-based signatures of human actions, the second layer (class map) builds a classification model that relies on these human action signatures. We present experimental results on the KTH activity dataset showing that our approach has comparable performance to a Support Vector Machine (SVM) while performing online and avoiding to store examples explicitly. Keywords: artificial neural networks, online growing neural gas, human action recognition, space-time, online classification

1

Introduction

The recognition and classification of human activity is important in many application domains including smart homes [1], surveillance systems [2], ambient intelligence [3], etc. In particular, we address the task of classifying human activity into a given set of activity types on the basis of video data, sequences of 2D images in particular. State-of-the-art approaches extract features from space-time volumes, e.g. space-time interest points (STIP) as introduced by Ivan Laptev [4]. Their advantage lies in their compact and robust representation of human actions, as they reduce the input space by identifying local feature points. Training a human action classifier model based on STIPs typically includes the clustering (with e.g. k-Means) of video features to yield bag-of-visual-words clusters that can be used to derive a histogram-based representation of a video clip by indicating the number of STIPs being assigned to each cluster. Then a classifier (e.g. SVM) is trained to learn to classify image sequences into a set of given human activity types.

106



There are several drawbacks in this classical approach. First of all, the approach requires an architecture that comprises heterogeneous algorithms, e.g. a feature extractor, a clustering algorithm and a classification algorithm. An architecture that is more uniform and compact relying on one algorithm would be simpler, easier to implement and thus preferable. Further, the approach is not online, requiring to store a number of examples in memory or on disk in order to recompute the cluster and retrain the classifier at regular intervals. To circumvent these limitations, we present a new architecture which is based on Online Growing Neural Gas (OGNG), which has been presented earlier [5]. In this paper we present a two-layer architecture which consists of two maps that we call feature map and class map, respectively. In the first layer (feature map), STIPs are dynamically clustered according to their similarity in the feature space, yielding a growing set of “visual words” that can also change over time. A human activity signature (HAS) for each space-time volume is then formed by a histogram indicating the activity of each prototype / neuron in the feature map. In the second layer (class map), space-time volumes are clustered by their HAS and labelled according to the corresponding activity. Our contributions can be listed as follows: – Online classification: We provide an architecture that grows incrementally and is capable of processing space-time volumes in an online fashion as new data arrives. – Compact model: We provide a compact human activity model as both layers are based on STIPs and OGNG which represent the high dimensional input space in a low dimensional map. – Uniform architecture: We provide an architecture which is uniformly based on OGNG, in contrast to existing approaches that rely on more heterogeneous structures. We compare our architecture to the classical architecture proposed by Laptev [4] based on a k-means based feature discretization as well as an SVM-based classification, showing that our approach yields comparable results to the latter approach, while proposing a uniform architecture based on two OGNG maps and circumventing the need to store examples to process them offline. Our approach is thus suitable in a life-long learning setting, in which there is a never-ending data stream that needs to be processed on-the-fly as in the applications mentioned above. The paper is structured as follows: in Section 2 we describe the OGNG algorithm in order to make this paper self-contained. In Section 3 we describe our online approach to human action classification with OGNG, including a description of the features we use. In Section 4 our experiments, including the methodology of our evaluation, the used baseline and our results are described. We then conclude and provide an overview over the related work in Section 5.


107


2

Online Growing Neural Gas (OGNG)

Online Growing Neural Gas (OGNG) as introduced by Beyer and Cimiano [5], extends Growing Neural Gas to an online classifier by integrating additional online labeling and prediction strategies. In the following we will briefly describe the OGNG algorithm. The algorithm is depicted in Algorithm 1 and modifications are highlighted. A detailed description of OGNG and a comparison of several online labeling and prediction strategies can be found in Beyer and Cimiano. [5].

Algorithm 1 Online Growing Neural Gas (OGNG) 1: 2: 3: 4: 5: 6:

Start with two units i and j at random positions in the input space. Present an input vector x ∈ Rn from the input set or according to input distribution. Find the nearest unit n1 and the second nearest unit n2 . Assign the label of x to n1 according to the present labeling strategy. Increment the age of all edges emanating from n1 . Update the local error variable by adding the squared distance between wn1 and x. ∆error(n1 ) = |wn1 − x|

7:

2

Move n1 and all its topological neighbours (i.e. all the nodes connected to n1 by an edge) towards x by fractions of eb and en of the distance: ∆wn1 = eb (x − wn1 ) ∆wn = en (x − wn )

for all direct neighbours of n1 . 8: If n1 and n2 are connected by an edge, set the age of the edge to 0 (refresh). If there is no such edge, create one. 9: Remove edges with their age larger than amax . If this results in nodes having no emanating edges, remove them as well. 10: If the number of input vectors presented or generated so far is an integer or multiple of a parameter λ, insert a new node nr as follows: Determine unit nq with the largest error. Among the neighbours of nq , find node nf with the largest error. Insert a new node nr halfway between nq and nf as follows: wr =

wq + wf 2

Create edges between nr and nq , and nr and nf . Remove the edge between nq and nf . Decrease the error variable of nq and nf by multiplying them with a constant α. Set the error nr with the new error variable of nq . 11: Decrease all error variables of all nodes i by a factor β. 12: If the stopping criterion is not met, go back to step (2).

In steps 1-3, the network is initialized and the first winner n1 and second winner n2 , according to a presented stimulus, are determined. In step 4 we assign the label of our stimulus ξ to the winner neuron n1 according to the selected labeling strategy. The selected labeling strategy is the relabeling method (relabel), because of its simplicity and effectiveness as shown in Beyer and Cimiano [5]. In steps 5-7, the age of all edges emanating from n1 are incremented by one. Furthermore, the local error gets updated, and n1 and its topological neighbors n are adapted towards the stimulus by the learning rates eb (for n1 ) and en (for the topological neighbors). In steps 8-9, n1 and n2 get connected by an edge and

108



#∃

% &∋

(

∋∋

∋∋

! ∀

Fig. 1. Online and offline processing pipelines.

the edges with an age larger than amax are removed. In step 10, a new neuron nr is introduced between the neuron nq with the largest local error and its neighbor nf , having the largest error of its neighborhood. We only insert a new neuron if nmax , the maximal number of neurons per class, is not exceeded for the class of ξ. In step 11 all error variables are decreased by the factor β. In the last step 12, the algorithm continues with step 2 if the stopping criterion (mostly when a predefined maximum number of neurons has been reached) is met.1

3

Human Activity Classification with OGNG

In this section we describe our two-layer online approach to human activity classification that exploits topological maps - Growing Neural Gas maps in particular - at both layers. We call our approach Online Human Action Classifier (OHAC). A typical processing pipeline in human activity recognition is comprised of the following three steps: 1. Extraction of video features: Extraction of distinctive features from a sequence of video frames. 2. Representation of video features: Calculating video signatures from the extracted features that capture the similarity between different video sequences. 3. Classification of video signatures: Training a classifier to learn to recognize a set of previously observed classes of human actions.

Our algorithm covers steps 2 and 3 of the typical processing pipeline. For the first step, the extraction of video features, we use HOG+HOF features calculated around sparsely detected spatio-temporal interest points (STIPS) as proposed by Laptev [4]. The detection of the points and the extraction of the feature descriptors are completely online in the sense that no global information is required. STIPs are detected as local maxima of a Harris-Corner-Function extended into the spatio-temporal domain. The spacial Harris-Corner-Function characterizes the ”‘cornerness”’ of an image point by the strength of its intensity gradients in all directions. The spatio-temporal Harris-Corner-Function responds to points in space-time where the motion of local image structures is non-constant. Aside from sensor noise and other disruptions, the motion of local image structures is 1

For our experiments we additionally introduce new neurons for novel categories in the learning process. Furthermore, we only stop inserting new neurons when the maximum number of neurons is reached, instead of stopping the training process.


109


primarily the result of forces acting on the corresponding physical objects. Hence the local neighbourhood of the detected points can be expected to provide meaningful information about motion primitives in that point of space-time. The combination of STIPs and HOG+HOF feature descriptors has already shown promising results in several synthetic [4] and real world datasets [4, 6]. 3.1

Static Human Action Classifier according to Laptev

As baseline we use an offline approach proposed by Laptev [4]. This approach uses k-means for clustering and an SVM for classification. The video sequence is represented as a bag of visual words histogram. The visual vocabulary is built by clustering the feature vectors in the training set into a predetermined number of clusters. Algorithm 2 Static Human Action Classifier according to Laptev 1: 2: 3: 4: 5: 6: 7: 8:

Cluster the training set into a predetermined number of clusters using kMeans. Initialize Histogram H with one entry for each prototype vector. Present an input vector x ∈ S from the extracted features of the video sequence. Find the nearest prototype px to the presented input. Increment the corresponding histogram entry px by one. repeat step 3 until all features are processed. Normalize H using L1 norm Train the SVM with the normalized histogram and the label l of the current sequence.

In step 1 the visual vocabulary is built in advance by clustering all feature vectors in the training set into a predetermined number of clusters2 using kMeans. Each cluster stands with its prototype vector for one distinct visual word in the visual vocabulary. Steps 3-6 iterate over the feature vectors in the current training sequence. Each feature vector is assigned to its nearest prototype vector and incorporated into the histogram. The resulting histogram represents the given video sequence as a bag of visual words. To compensate for different counts of features in different sequences, the histogram is normalized using the L1 norm. In step 8 the SVM is trained with the normalized histogram and the corresponding label of the current sequence. 3.2

Online Human Action Classifier (OHAC)

The Online Human Action Classifier consists of two independent OGNG networks. The first network is what we call the feature map. It utilizes the OGNG algorithm for clustering incoming data in feature space. The second network is called the class map, as it uses the label information from a training sequence to assign class labels to the nodes in the network according to the relabel strategy presented in section 2. Video sequences are represented as human activity signatures (HAS ), which are represented by a histogram indicating the activity of each neuron in the feature map, while iterating through the respective video sequence. 2

110

We use k = 4000 for the number of clusters following Laptev [6]



Algorithm 3 Online Human Action Classifier (OHAC) 1: 2:

Initialize the feature- and class-maps. Initialize the HAS histogram H with two (number of initial nodes in the feature map) entries h1 = 0, h2 = 0. 3: Present an input vector x ∈ S from the extracted features of the video sequence. 4: Find the nearest unit nx from the feature map. 5: if node nx is new then 6: insert new entry into the histogram at position x 7: end if 8: Increment the corresponding histogram entry hx by one. 9: repeat step 3 until all features are processed. 10: Normalize H using L2 norm. 11: Update the OGNG class map with the normalized HAS histogram H and label l of the video sequence.

Algorithm 3 is initialized by first initializing the OGNG network and creating an empty bag of visual words histogram (steps 1-2). The histogram starts with two entries, one for each of the two initial nodes in the OGNG network. In steps 3-5 the next input vector x is presented to the feature map, and the node nx that is closest to the presented stimulus is located. If the located node nx is newly inserted into the feature map, a new histogram entry is created at position x. In step 8 the histogram entry at position x is incremented. When all input vectors in the sequence are processed, the histogram is normalized by L2 norm. The normalization compensates for different numbers of extracted feature vectors in different video sequences. The normalized histogram is then used to update the class map OGNG network with the label l of the video sequence. To predict the label of a previously unseen video sequence S, all input vectors x ∈ S are incorporated into the histogram H by the number of the feature map node nx nearest to them (steps 3-5). The histogram is then normalized and the label l is predicted by the class map according to the single linkage strategy (see Beyer and Cimiano [5]).

4

Experiments and Evaluation

4.1

Dataset

As a dataset for evaluation we use the KTH human action dataset [7] 3 . This video database consists of six categories of human actions (walking, jogging, running, boxing, hand waving and hand clapping). All actions are performed several times by 25 subjects in four different scenarios (outdoors, outdoors with scale variations, outdoors with different clothes and indoors). Each combination of 25 subjects, 6 actions in 4 scenarios gives a total of 600 video files. 4.2

Evaluation Methodology

We evaluated the accuracy of OHAC and our baseline on the KTH human action dataset. We generated 15 training and test sets by separating the 600 video clips 3

Examples of the six actions are http://www.nada.kth.se/cvap/actions/


shown

on

the

following

website

111


into 300 training examples and 300 test examples for each set. We furthermore took care that each of the six categories was equally distributed in the training and test set. We averaged our accuracy results over the 15 runs and also determined a best and worst result of the 15 runs. The OGNG parameters are set as follows: insertion parameter λ = 50; maximum age amax = 120; adaptation parameter for winner eb = 0.3; adaptation parameter for neighbourhood en = 0.0018; error variable decrease α = 0.5; error variable decrease β = 0.0005. We also allowed a maximum of 4000 neurons for the feature map and 200 for the class map. 4.3

Results

Our results are depicted in Table 1. The matrices show the confusion matrix of our Baseline (left) and OHAC (right). Thereby, each row represents the to be classified human action category, while each column holds the percentage of examples that have been classified into the category written on top of the matrices. Overall, OHAC achieves an averaged accuracy of 93%, while our Baseline holds an accuracy of 95%. We performed a t-test and could not prove that the results of both approaches are statistically significant. We thus consider the classification performance of the algorithms to be comparable. The confusion matrix shows that both algorithms are having issues to distinguish between jogging and running, which is intuitively understandable as we as humans also would consider those two activities to be closer to each other compared to the other four. Furthermore, it is interesting that OHAC slightly less confuses the human action categories of “hand clapping” and “running” with 93.2% and 72.2% compared to 89.8% and 66.8% of our Baseline. It also should be mentioned that the confusion of OHAC is spread more uniformly compared to the Baseline approach, which could be explained by the generative approach of OHAC compared to the discriminative character of our Baseline and i.e. of the underlying SVM classifier.

5

Related Work & Conclusion

In this paper we have presented a novel human activity classifier model based on Online Growing Neural Gas (OGNG). The model provides a compact architecture and consists of two layers, allowing the storage of human actions in a more memory efficient structure. While the first layer (feature map) dynamically clusters STIPs and serves as base for the creation of histogram-based signatures of a human action, the second layer (class map) builds a classification model that builds upon those human action signatures. The advantage of this novel architecture lies in its ability to perform a human action classification task online as the model stepwise adapts to new data and grows incrementally. The uniform character of the algorithm is desirable, as that its simplicity allows an easy implementation and integration into existing systems. In most cases, heterogeneous offline human action recognition approaches have been proposed [4,

112


       

     

       

     

ha nd cla pp wa ing l ki ng jog gin g ru n nin g

xin g

handwaving 88.2 3.2 3.2 1.4 2.7 1.4 boxing

1.8 89.4 1.8 3.4 1.2 2.4

handclapping 2.8 1.3 93.2 1.3 0.8 0.5 walking

1.7 1.8 1.7 85.4 6.3 3.1



     

jogging

3



     

running

1.3

 

bo

ha nd w

av i ng

                      


2.5 3.7 10.5 65.4 15 2

2.9

4

17.5 72.2

average 82.2%

Table 1. Confusion matrix of our Baseline (left) and OHAC (right) on the KTH human action database, averaged over 15 runs.

8], that generate action signatures by clustering STIPs with static clustering algorithms (such as k-Means) and classifying them with an offline classifier that needs to be retrained as new data arrives. We have experimentally shown on the KTH dataset that our approach reaches comparable performance to a classical offline SVM-based classification approach while performing online and avoiding the need to store training examples explicitly, thus being suitable in lifelong stream data settings.

References 1. Marie Chan, Eric Campo, Daniel Est`eve, and Jean-Yves Fourniols. Smart homescurrent features and future perspectives. Maturitas, 64(2):90–97, 2009. 2. M Valera and SA Velastin. Intelligent distributed surveillance systems: a review. In Proceedings of the International Conference on Computer Vision, Image and Signal Processing, volume 152, pages 192–204. IET, 2005. 3. Eric J Pauwels, Albert Ali Salah, and Romain Tavenard. Sensor networks for ambient intelligence. In Proceedings of the IEEE Workshop on Multimedia Signal Processing (MMSP), pages 13–16. IEEE, 2007. 4. Ivan Laptev. On space-time interest points. International Journal of Computer Vision, 64(2-3):107–123, 2005. 5. Oliver Beyer and Philipp Cimiano. Online labelling strategies for growing neural gas. In Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning (IDEAL), pages 76–83. Springer, 2011. 6. Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic human actions from movies. Proceedings of the International Conference on Computer Vision and Pattern Recognition, pages 1–8, June 2008. 7. Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. In Proc. of the Int. Conference on Pattern Recognition (ICPR), volume 3, pages 32–36. IEEE, 2004. 8. Alper Yilmaz and Mubarak Shah. Actions sketch: A novel action representation. In IEEE Computer Society Conference in Computer Vision and Pattern Recognition (CVPR), volume 1, pages 984–989. IEEE, 2005.


113


About the Equivalence of Robust Soft Learning Vector Quantization and Soft Nearest Prototype Classication David Nebel and Thomas Villmann Computational Intelligence Group, Department for Mathematics/Natural & Computer Sciences, University of Applied Sciences Mittweida, Mittweida, Germany email: {nebel,villmann}@hs-mittweida.de

In this paper we consider the Robust Soft Learning Vector Quantization and the Soft Nearest Prototype Classication as variants of the learning vector quantization paradigm proposed by T. Kohonen for prototype based classication models. We show that, although separately introduced, the latter one is a special case of the other model. Abstract.

1

Introduction

Robust Soft Learning Vector Quantization (RSLVQ) and Soft Nearest Prototype Classication (SNPC) are two famous learning vector quantizers for prototype based classication learning [6,5]. Both approaches can be understood as probabilistic variants of the Bayesian motivated but heuristic learning vector quantization schemes introduced by T. Kohonen [1]. Whereas RSLVQ maximizes a log-likelihood cost function, SNPC uses a soft variant of the classication error to quantify the classication performance. Thus, these algorithms can be seen as alternatives to the generalized learning vector quantization approach (GLVQ), which takes an approximation of the classication error as cost function for a LVQ-like prototype based classication system [4]. Although both methods, RSLVQ and SNPC, were introduced independently, there are inherent similarities in methodology: RSLVQ as well as SNPC can be seen as probabilistic models of classication based on prototypes. These prototypes are interpreted as centers of local probability models of the class distribution. In this paper we show that SNPC is mathematically equivalent to RSLVQ for a particular parameter setting in RSLVQ. Thus, both models can be identied as the same mathematical model. For this purpose, a unifying description is presented. In consequence, SNPC can be optimized via an expectation maximization scheme as recently proposed for RSLVQ [2] and GLVQ [3].

114



2

The RSLVQ model

In following we briey describe the RSLVQ as introduced by Seo&Obermayer in [5]. We suppose data points xi ∈ X (i = 1, ..., N ) and prototypes θj ∈ Θ (j = 1, ..., M ). Further, let c(·) be the formal class labeling function, which assigns to each data point the class label c (xi ). Analogously, c (wj ) returns the class label of the prototype. In RSLVQ, the data density is modeled by a Gaussian mixture approach: Let 1 p(θj ) = M be the prior probability that data points are generated by the j -th single component of the mixture. Let p(xi |θj ) = p

1

(xi − θj ) exp − 2σ 2

(2πσ 2 )D

2

!

(1)

be the conditional probability that the j -th component of the Gaussian mixture generates the data point xi ∈ RD . Then the probability density for the data point xi is given by p(xi |Θ) =

M X

p(θj )p(xi |θj )

j=1

and the conditional probability pc (xi |Θ) =

X

p(θj )p(xi |θj )

{j:c(xi )=c(θj )}

is the probability density that a data point xi is generated by the mixture model for the correct class. The cost function of the RSLVQ is based on the likelihood ratio Lr =

N Y pc (xi |Θ)

i=1

p(xi |Θ)

(2)

.

The resulting cost function to be maximized in RSLVQ is obtained as KRSLV Q (X, Θ) =

N X i=1

log

pc (xi |Θ) p(xi |Θ)

(3)

which can be optimized using a (stochastic) gradient ascent scheme, see [5]. Recently, a generalized expectation maximization (gEM) scheme for RSLVQ was proposed as an alternative to the gradient learning [2], if only relational data or


115


dissimilarities between data are available. For this purpose, the cost function (3) can be rewritten as KRSLV Q (X, Θ) =

N X i=1



log 

X

{j:c(xi )=c(θj )}



g(xi , θj )

(4)

with the functions (x −θ )2 exp − i2σ2j . g(xi , θj ) = P M (xi −θk )2 exp − 2 k=1 2σ

(5)

In the following we use this formulation of the cost function to show its relation to the SNPC.

3

SNPC revisited: equivalence to RSLVQ

The SNPC classier is a soft variant of nearest prototype classication (NPC) approach. For the NPC the cost function Cerr =

N M 1 XX [1 − δ (c(θj ) = c(xi ))] δ(θj = θbmu(i) ) N i=1 j=1

(6)

counts the misclassications whereby δ(x = y) =

(

1, 0,

x=y else

is the Kronecker symbol and bmu(i) = arg min d (xi , θl ) l={1,...,M }

is the index of the best matching unit (prototype) for a given data point xi with respect to a predened dissimilarity measure d (xi , θl ), frequently chosen as the squared Euclidean distance. The original cost function of SNPC is the probabilistic version KSN P C (X, Θ) =

116

N M 1 XX [1 − δ (c(θj ) = c(xi ))] P (θj |xi ) → min N i=1 j=1



of (6) with the conditional probability

that the data point

exp (−d(xi , θj )) P (θj |xi ) = PM k=1 exp (−d(xi , θk ))

xi is assigned to the prototype θj

[5]. We remark at this point,

that these probabilities are structurally equivalent to the functions (5) Equivalently to minimization of

K(X, Θ) =

= instead of

KSN P C (X, Θ).

KSN P C (X, Θ),

g (xi , θj ) from

we could maximize

N M 1 XX δ (c(θj ) = c(xi )) P (θj |xi ) N i=1 j=1 N 1 X N i=1

X

P (θj |xi )

{j:c(θj )=c(xi )}

Further, the maximization does not change under an

application of a monotonically increasing function. Hence, we consider the equivalent cost function

 N X ˆ SN P C (X, Θ) = 1 K log  N i=1

X

{j:c(θj )=c(xi )}



P (θj |xi )

(7)

q

1 2 in (4) and 1 ignore the scaling factor N in (7). Hence, SNPC can be seen as a special case of

Yet, this is exactly the cost function of RSLVQ (4) if we set

σ=

RSLVQ.

4

Conclusion

In the present paper we have shown the mathematical equivalence between separately introduced RSLVQ and SNPC, which were both invented by

Seo&Obermayer. Using the recently in [2] published result for RSLVQ, it follows immediately, that SNPC also can be maximized using the gEM optimization procedure presented in [2] for RSLVQ.


117


References 1. Teuvo Kohonen. Self-Organizing Maps, volume 30 of Springer Series in Information Sciences. Springer, Berlin, Heidelberg, 1995. (Second Extended Edition 1997). 2. D. Nebel, A. Gisbrecht, B. Hammer, and T. Villmann. Learning supervised generative models for dissimilarity data. In Proceedings of Conference on Neural Information Processing Systems (NIPS), page submitted. MIT press, 2013. 3. D. Nebel and T. Villmann. A median variant of generalized learning vector quantization. In Proceedings of International Conference on Neural Information Processing (ICONIP), page submitted. Springer, 2013. 4. A. Sato and K. Yamada. Generalized learning vector quantization. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8. Proceedings of the 1995 Conference, pages 4239. MIT Press, Cambridge, MA, USA, 1996. 5. S. Seo, M. Bode, and K. Obermayer. Soft nearest prototype classication. IEEE Transaction on Neural Networks, 14:390398, 2003. 6. S. Seo and K. Obermayer. Soft learning vector quantization. Neural Computation, 15:15891604, 2003.

118



Learning Orthogonal Bases for k-Sparse Representations Henry Sch¨ utze, Erhardt Barth, Thomas Martinetz Institute for Neuro- and Bioinformatics, University of L¨ ubeck Ratzeburger Allee 160, 23562 L¨ ubeck, Germany [email protected]

Sparse Coding aims at finding a dictionary for a given data set, such that each sample can be represented by a linear combination of only few dictionary atoms. Generally, sparse coding dictionaries are overcomplete and not orthogonal. Thus, the processing substep to determine the optimal k-sparse representation of a given sample by the current dictionary is NP -hard. Usually, the solution is approximated by a greedy algorithm or by l1 convex relaxation. With an orthogonal dictionary, however, an optimal k-sparse representation can not only be efficiently, but exactly computed, because a corresponding k-sparse coefficient vector is given by the k largest absolute projections. In this paper, we present the novel online learning algorithm Orthogonal Sparse Coding (OSC), that is designed to find an orthogonal basis U = (u1 , ..., ud ) for a given data set X ∈ Rd×L , such that for any k ∈ {1, ..., d}, the optimal ksparse coefficient vectors A ∈ Rd×L minimize the average representation error 1 E = dL kX − UAk2F . At each learning step t, OSC randomly selects a sample x from X and determines an index sequence h1 , ..., hd of decreasing overlaps |uT hi x| between x and the basis vectors in U . In the order of that sequence, each

(a) Learned basis from 1,000 synthetic image patches of size 16×16 pixel.

(b) Learned basis from 20,000 natural image patches of size 16×16 pixel.

Fig. 1: Basis patches learned with OSC.


119


basis vector uhi is updated by the Hebbian learning rule ∆uhi = εt (uT hi x)x with a subsequent unit length normalization. After each basis vector update, x and the next basis vector uhi+1 to be adapted are projected onto the orthogonal complement span({uh1 , ..., uhi })⊥ wherein the next update takes place. We applied OSC to (i ) 1,000 synthetic (k=50)-sparse patches of size 16×16 pixel, randomly generated with a 2D Haar basis, and (ii ) 20,000 natural image patches of size 16×16 pixel, that were randomly sampled from the first image set of the nature scene collection [1] (308 images of nature scenes containing no man-made objects or people). The basis patches learned by OSC are shown in Figure 1 and demonstrate that OSC reliably recovers the generating basis from synthetic data (see Figure 1a). Figure 1b illustrates that the OSC basis learned on the natural image patches resembles a wavelet decomposition, and is distinct from PCA, DCT, and Haar bases. In Figure 2, the average k-term approximation performance of the OSC basis is compared with PCA, DCT, Haar and JPEG 2000 wavelets on the natural image patch data set. For this data set, OSC yields a consistently better k-term approximation performance than any of the alternative methods.

55 50 45 SNR [dB]

40 35

JPEG (DCT) Haar JPEG 2000 (CDF97) PCA OSC

30 25 20 15 10 5 0.1

0.2

0.3

0.4

0.5

0.6

Ratio of Retained Coefficients

0.7

0.8

0.9

k d

Fig. 2: Average k-term approximation performance of 20,000 natural image patches of size 16×16 pixel.

Acknowledgement. The research is funded by the DFG Priority Programme SPP 1527, grant number MA 2401/2-1.

References 1. Wilson S. Geisler and Jeffrey S. Perry. Statistics for optimal point prediction in natural images. Journal of Vision, 11(12), October 2011.

120


M ACHINE L EARNING R EPORTS Report 02/2013

Impressum Machine Learning Reports ▽ Publisher/Editors Prof. Dr. rer. nat. Thomas Villmann University of Applied Sciences Mittweida Technikumplatz 17, 09648 Mittweida, Germany • http://www.mni.hs-mittweida.de/

ISSN: 1865-3960

Dr. rer. nat. Frank-Michael Schleif University of Bielefeld ¨ Universitatsstrasse 21-23, 33615 Bielefeld, Germany • http://www.cit-ec.de/tcs/about ▽ Copyright & Licence Copyright of the articles remains to the authors. ▽ Acknowledgments We would like to thank the reviewers for their time and patience.

Machine Learning Reports http://www.techfak.uni-bielefeld.de/∼fschleif/mlr/mlr.html

Workshop New Challenges in Neural Computation 2013 - Uni Bielefeld

Workshop New Challenges in Neural Computation 2013 - Uni Bielefeld

Suggest Documents

Workshop New Challenges in Neural Computation 2011

VANESA - BieColl - Bielefeld Electronic Collections - Uni Bielefeld

FuncSim - Publications at Bielefeld University - Uni Bielefeld

movement - Uni Bielefeld

PCCPwww - Uni Bielefeld

Inspiring Machines - Uni Bielefeld

1 Introduction - Uni Bielefeld

kenya years - Uni Bielefeld

psychologie - Publications at Bielefeld University - Uni Bielefeld

Understanding Social Robots - Uni Bielefeld

Society as experiment - Uni Bielefeld

Colobus guereza - Uni Bielefeld

Democracy in Nepal: four models - Uni Bielefeld

BMC Neuroscience - Publications at Bielefeld University - Uni Bielefeld

BMC Bioinformatics - Publications at Bielefeld University - Uni Bielefeld

Bloom Filter Trie - Publications at Bielefeld University - Uni Bielefeld

Macroeconomic Regimes Business Cycle Theories ... - Uni Bielefeld

Klinische Pflegeexperten: das Beispiel der ... - Uni Bielefeld

futures for philosophy of education - Uni Bielefeld

MULTIPLE SEQUENCE ALIGNMENT BASED ... - Uni Bielefeld

Localising Transnationalism: Researching Political and ... - Uni Bielefeld

Journal of Integrative Bioinformatics - BieColl - Uni Bielefeld

Working Paper N 321 - Uni Bielefeld

Successful Bullying Prevention Programs - Uni Bielefeld