Mathematical Aspects of Neural Networks - CiteSeerX

18 downloads 2625 Views 232KB Size Report
Apr 1, 2003 - SVM such as efficient online training methods or decomposition schemes [61, 65, 127]. ..... Proceedings of GWAL-5, IOS Press, 9-18, 2002.
Mathematical Aspects of Neural Networks

x y z

topology preserving

complete x y z

xx yab zzabbx

xzy yx xz y z complex

general

x xx y y z z y z

convergent

Habilitationsschrift in kumulativer Form am Fachbereich Mathematik/Informatik der Universit¨at Osnabr¨uck

Barbara Hammer 1. April 2003

x x x z y xz z y x

learnable

Contents 1 Introduction

1

2 Network models 2.1 Feedforward networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Recurrent models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Unsupervised networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 3 4 5

3 Mathematical aspects 3.1 Feedforward networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Recurrent models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Unsupervised networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 7 9 11

4 Papers 4.1 Generalized relevance learning vector quantization . . . . . . . . . . . . . . . . . . 4.2 On approximate learning by multi-layered feedforward circuits . . . . . . . . . . . . 4.3 A note on the universal approximation capability of support vector machines . . . . . 4.4 Recurrent neural networks with small weights implement definite memory machines 4.5 Generalization ability of folding networks . . . . . . . . . . . . . . . . . . . . . . . 4.6 Recurrent networks for structured data – a unifying approach and its properties . . . 4.7 A general framework for self-organizing structure processing neural networks . . . .

. . . . . . .

. . . . . . .

13 15 31 71 83 109 127 151

5 Acknowledgment

193

References

195

Introduction

1

1 Introduction Since the first proposal of artificial neural circuits in the 50s, the first boom of network models in the 60s and 70s, and the renaissance of neural methods starting around 1985, neural methods nowadays constitute a well established technique in machine learning. Interestingly, not only the availability of training algorithms and their successful application, but also mathematical aspects have thereby largely contributed to the question whether neural networks were accepted or rejected as appropriate learning machinery in the scientific community in history: in the early years, neural circuits have been investigated as alternative universal computing mechanisms in competition to digital computers [134]. This research focused on theoretical investigations of the capability of these type circuits. An important part of the Rosenblatt perceptron, which constitutes one of the first pattern recognition tools successfully applied in practice, was the mathematical proof of the convergence of the perceptron learning algorithm [155, 156]. The rapid decrease of the popularity of neural networks around the 80s is mainly caused by the work by Minsky and Papert [136]. Thereby, the capacity of single neurons and neurons enlarged by fixed filters or masks is thoroughly investigated from a theoretical point of view and several mathematical limitations are pointed out. Conversely, not only the availability of a new learning algorithm for multilayer networks as proposed by Werbos [208] and Rumelhart et al. [159], but also the precise mathematical connection of specific recurrent networks to classical optimization tasks as pointed out by Hopfield [100] share the responsibility for the renewed interest in the field of neural networks in the 90s. Today, various neural models are established as fixed parts of machine learning, and a thorough theoretical investigation for these models is available. Many researchers of this scientific community agree on the fact that statistical notion is often the right language to formalize learning algorithms and to investigate their mathematical properties. Nevertheless, according to the widespread models, tasks, and application areas of neural methods, mathematical tools ranging from approximation theory, complexity theory, geometry, statistical physics, statistics, linear and nonlinear optimization, control theory and many more areas can be found in the neural literature. Correspondingly, the role of mathematics in the neural network literature is diverse: Development and presentation of algorithms: Most neural algorithms are formulated in mathematical terms and some learning schemes are even mainly motivated by abstract mathematical considerations such as support vector machines [26]. Foundation of tools: A fixed canon of mathematical questions has been identified for most network models and application areas which is to be answered in order to establish the models as wellfounded and reliable tools in the literature. Interestingly, many mathematical questions are thereby not yet solved satisfactorily also for old network models and constitute still open topics of ongoing research such as the loading problem of feedforward networks [39] or the convergence problem of the self-organizing map in its original formulation (see e.g. [96]). Application of tools: Mathematical formalization establishes standards for the assessment of the performance of methods and application in real-life scenarios (– although these standards are not always followed and ‘real life’ would sometimes be better characterized by slightly different descriptions than the mathematics usually presented in the papers [122]). We will in the following consider mathematical questions which are to be answered to justify standard models as reliable machine learning tools. Thereby, we will focus on classical models used for machine learning: feedforward networks, recurrent and recursive architectures, and self-organizing systems. We now first shortly introduce the standard models and motivate mathematical questions which arise in the context of the models and which are to be answered to justify their use. Then we give a short overview of important or recent results in the literature. Finally, seven articles are included which each constitute a major contribution to a different mathematical question within this context.

2

Network models

3

2 Network models We here restrict to three classical types of networks: feedforward networks for classification and approximation, discrete time recurrent and recursive networks for sequence processing and data storage, and unsupervised systems for vector quantization and faithful representation of data. We do not consider models like cellular networks or spiking models [131, 187], statistical counterparts like Gaussian processes or Bayes point machines [69, 94], and other learning scenarios like reinforcement learning [182]. Neural architectures which model biological neural networks or cognitive systems are also beyond the scope of this thesis.

2.1 Feedforward networks Feedforward neural networks (FNNs) constitute the most widely used and most popular network models. They compute a possibly highly nonlinear function fW : R n ! R o as the composition of simple functions provided by the single neurons which are connected in a directed acyclic graph. Depending on their type, the single neurons typically compute a function of the form 7!  ( t ) for sigmoidal networks, or  (j j) for radial basis function networks, where  : R ! R is the activation function and are the neuron’s weights. fW is parameterized by the collection of all neurons’ weights W . Typical activation functions  for sigmoidal networks include

x

w x

w



0 1

wx

if x < 0 otherwise



the Heaviside function H(x) =

 

the logistic function sgd(x) = (1 + exp( x))

1,

the identity id(x) = x,

8 0. The so-called simple perceptron just consists of one sigmoidal neuron with the Heaviside activation function. Learning vector quantization (LVQ) networks combine radial basis function type neurons with a winner-takesj is minimum, i denotes all computation, yielding the overall network function 7! i whereby j i the weights of the i th radial basis type neuron, i the respective output weight or output signal, and j  j denotes an appropriate distance measure, e.g. the Euclidian metric. Apart from these LVQ networks, specific network types which will be tackled in the following include

w x x

 

w x

w

multilayer perceptron networks, i.e. networks of sigmoidal neurons which can be arranged in consecutively connected layers, the support vector machine (SVM), i.e. a simple perceptron or linear neuron where a fixed and usually nonlinear preprocessing of the input ! ( ) is included.  is thereby chosen such that it can efficiently be computed by a kernel k ( ; ) = ( )t ( ).

x xy

x

x

y

Depending on the output function of such type networks, feedforward networks can be used for classification of patterns if the output set is finite and discrete, or approximation of functions if the output set is contained in a real-vector space. An essential point of these function classes consists in the fact that they come together with an automated training algorithm which allows to learn an unknown regularity f based on a couple of examples f( i ; f ( i )) j i = 1; : : : ; mg of this regularity. Commonly, training takes place in three steps:

x

x

4 1. Selection of the neural architecture and the hyperparameters of training, mostly done by crossvalidation. 2. Optimization of the weights on the given training set via error minimization. There exist different forms of error minimization. The simple perceptron and LVQ-networks are trained with Hebbian learning, i.e. iterated reinforcement of correct signals and weakening of incorrect signals. Multilayer perceptrons are mostly trained with some gradient-based error minimization procedure. SVM networks are optimized such that the so-called margin, i.e. the distance of the classification hyperplane from the patterns is maximized under the constraint that all points are mapped correctly. This is done via a transformation to the dual problem and appropriate quadratic optimization techniques. 3. Estimation of the generalization ability using the test error on a test set not taken for training. Often, training and further optimization and regularization of the architecture are mixed, e.g. using growing and pruning algorithms, training with additional regularization, or, as already explained for the SVM, explicit optimization of an appropriate regularization term. The standard way of training poses three canonic mathematical questions which are to be answered for the respective neural architecture and training algorithm: 1. What are the approximation capabilities of the respective architecture? If the neural architecture is not capable of approximating the (unknown) function to be learned then we cannot, in principle, find an appropriate architecture to start training. 2. How do good error minimization algorithms look like and what is the complexity of training? We have to find guarantees for the convergence of the algorithm and we should investigate the expected runtime. 3. Can the generalization ability of neural architectures be guaranteed such that we can expect adequate behavior for future data we would like to predict?

2.2 Recurrent models Recurrent neural networks (RNNs) combine neurons in a possibly cyclic graph such that time dependent dynamics can be observed, all neurons computing their output based on the activations in the previous time step. Inputs are hence no longer simple vectors, but sequences or continuous time series with elements in a real-vector space. The dynamics in time step t can be written as xi (t) =  ( ti (t 1)) for discrete time and sigmoidal neurons, where i denotes the weights of connections pointing to neuron i. Various alternative models of recurrent networks in discrete and continuous time have been proposed in the literature, see e.g. the taxonomy proposed in [117]. However, many discrete models can be simulated within the above simple dynamics [80]. Depending on the respective dynamic properties, RNNs are used for sequence prediction, transduction, and generation, as associative memory, or for computation tasks such as binding, grouping, or cost minimization. Training RNNs for an unknown function f for possibly sequential data based on given training data can essentially be formulated like FNN training: selection of the architecture, optimization of the empirical error on the given training data, and estimation of the generalization ability on a test set. Hence the same mathematical questions as in the case of FNNs arise for RNNs. The temporal structure of RNNs adds further mathematical questions in particular if RNNs are used for alternative tasks which rely on the computation capabilities and dynamic properties of these networks. We name just a few further aspects: What is the overall dynamic behavior of the network with respect to convergence and stability? Can the capacity of RNNs used as associative memories be estimated? Which different long-term behavior can be realized with a network? Another line of research is offered by the possibility to generalize the dynamics to more general recursive input structures: tree structures or directed acyclic graphs can be processed as inputs instead of

w

wx

Network models

5

only simple time series or sequences. So called recursive networks constitute a well established tool used in various application areas such as chemistry and bioinformatics [18, 80, 148]. Thereby, tree structures with limited fan-out (say fan-out 2, for simplicity) can be dealt with due to a generalization of the dynamics of RNNs to processes with more than one temporal context: the network activity after recursively processing a tree t with root label l and subtrees t1 and t2 is given by (t) = f (l; (t1 ); (t2 )). Thereby, f denotes some function which can be computed by a feedforward neural network or even a simple neuron. (t1 ) and (t2 ) constitute the two recursively computed contexts according to the fanout two of the root of the tree. Note that this dynamic constitutes a natural generalization of recurrent networks, for which only one context is available at each time step corresponding to the linear structure of the input sequences. Various different neural networks have been proposed in the literature which can cope with complex data structures using recursive processing similar to the above dynamics. An overview of models can be found in [83], for example. Popular models like the recursive autoassociative memory or holographic reduced representations thereby also include elements which yield structured outputs by recursive decoding of a given connectionist representation of a tree. The various methods proposed in the literature differ with respect to the precise choice of encoding and decoding functions and the training method. However, the same standard mathematical questions as above arise in this context: What are the representational capabilities of these models? What is the complexity of training? Can valid generalization be guaranteed? In addition, general formulations of the diverse models constitute another important topic of research in order to allow a uniform investigation of the respective mathematical questions, as proposed in [59, 82, 84, 175], for example (the second publication is contained in the following collection).

x

x

x

x

x

2.3 Unsupervised networks Unsupervised neural networks have been designed for various tasks in the area of data mining, visualization, and knowledge extraction. The precise goals are widespread according to the specific application area: clustering of data, compressed representation, noise reduction, visualization in a lower dimensional space, extraction of relevant statistical properties, blind source separation, topologically faithful representation, and many other. Many unsupervised networks can be described as prototype-based winnertakes-all classification, whereby the prototypes are given by the weights i of the neurons. The overall mapping of such a network is given by the formula 7! i such that j i j is minimum. Thereby, jj denotes an appropriate distance measure, e.g. the Euclidian metric. Hence the prototypes i of such an unsupervised network represent the given training data in a compressed form. The question now occurs of how appropriate prototypes can be obtained from a given set of training data. Often, prototypes are iteratively adapted in Hebbian style, such as in the following three popular algorithms:

w x w

x w

w

Vector Quantization (VQ): Prototypes are initialized randomly and iteratively updated, mapping the closest prototype i0 into the direction of the given data point

w

x

wi := wi + (x wi ) : 0

0

0

Thereby,  2 (0; 1) is the learning rate. This procedure aims at finding compact representations of the data points via the prototypes. Obviously, the above procedure constitutes a stochastic gradient descent on the cost function given by the quantization error which measures the distortion of the data points to its respective closest prototype in the Euclidian metric. An alternative popular minimization scheme for the quantization error is given by a batch variant, the classical k -means algorithm [129]. Naturally, various stochastic relaxations and modifications of this basic approach have been proposed in the literature, e.g. [67, 97, 154]. Self-Organizing Map (SOM): The SOM as introduced by Kohonen[113] incorporates neighborhood cooperation of the neurons such that also the topology of the data is matched and, furthermore, the single neurons are less affected by the problem to get stuck in local optima (though the overall

6 solution might still converge to suboptimal settings). For this purpose, all neurons are adapted in each step i := i + (nh(i; i0 ))( i)

w

w

x w

whereby the learning rate  (nh(i; i0 )) constitutes a value which is small if the actual neuron i and the winner i0 are not direct neighbors. The neighborhood nh (i; i0 ) is thereby determined a priori and often given by a low-dimensional rectangular or hexagonal lattice of the neurons. The choice of two-dimensional lattices has the side-effect that easy visualization of the outcome of the map is possible. In competition to alternatives such as proposed in [201, 202, 211], this mechanism has also been proposed as an explanation for the development of cortical maps in the brain. The adaptation of the whole neighborhood makes sure that local parts of the map also mirror the local topology of the given data space. However, perfect overall matching is thereby only possible if the lattice topology coincides with the (usually unknown) data topology. Neural Gas (NG): The NG [132, 133] allows to optimally match an arbitrary data topology by introducing a data driven neighborhood. The update of neuron i reads as

wi := w + (rk(wi; x))(x wi) whereby the learning rate  decreases if the rank of neuron wi ordered according to the distance from the actual data point x gets larger. A (possibly non-regular) optimum neighborhood structure i

can be identified for the trained neural gas network after convergence of training.

Naturally, there exist further vector quantization schemes such as [14, 20, 68, 70]. Furthermore, there exist various self-organizing algorithms, partially even based on single neurons, which can extract relevant statistical properties from given data such as principal and independent component analysis or probability density estimation [71, 103, 121, 158, 191]. Moreover, there exists a couple of alternative methods which aim at appropriate data visualization [63, 149]. We will here only deal with the three vector quantization schemes as described above. For these models, typical mathematical questions include the following points: 1. Does the training dynamics obey a potential dynamic? If not, can the convergence be guaranteed in some other way? Can in this case an objective be stated in a precise mathematical way? 2. How does the map mirror the statistical properties of the data? What is the magnification factor compared to the underlying data density and can this magnification be controlled in some way? 3. If neighborhood cooperation is included, can ordering according to the data topology be guaranteed? How can topology preservation be formalized and measured? Standard self-organizing maps deal with vectorial data only. Naturally, an extension to time dependent inputs and more general recurrent or recursive data structures is interesting also in the field of unsupervised data processing, such that complex data structures like DNA sequences or chemical formulas can also be dealt with. Several extensions of SOM to temporal inputs have been proposed in the literature, such as two popular and well-established models, the temporal Kohonen map and the recurrent SOM [31, 115], and various alternative models, see e.g. the overview [6]. Whereas generalizations of supervised neural networks to recurrent and recursive networks agree on the way of how context is represented (by the activation of the neurons), this representation is less clear and straightforward for self-organizing systems. Recent alternatives to the temporal Kohonen map and the recurrent SOM are the recursive SOM as proposed by Voegtlin [198, 199, 200] and the SOMSD, the latter dealing also with tree structured data [72, 73, 174]. For these two models, a different and more complex notion of context is introduced. As pointed out in the articles [86, 87, 88] (the last paper is included in the following collection) the explicit notion of context representation allows to put all these approaches into a common framework such that uniform mathematical investigation of the above mentioned relevant mathematical questions becomes possible also for structure processing self-organizing maps. Moreover, an easy transfer of new ideas to these models can be achieved, such as the integration of alternative hyperbolic lattices [151] as demonstrated in [180].

Mathematical aspects

7

3 Mathematical aspects According to the network type and application area, canonic questions arise as demonstrated above. These questions are to be answered to establish the respective network models as well-founded and reliable learning tools.

3.1 Feedforward networks The universal approximation property of FNNs with various activation functions has been established in [102, 144, 163], for example. In general, one hidden layer is sufficient for approximating any continuous or measurable function, respectively, up to any desired degree of accuracy on compact sets. Hence the search space for appropriate architectures in a learning task can be limited to one or two hidden layers. If only a finite number of points is given, the number of neurons sufficient for interpolation can be limited, as demonstrated in [172]. Note that the proofs are, at least in theory, constructive. This fact is used in [112] to design an alternative (though possibly not yet very efficient) training algorithm. An extension of the universal approximation result to networks with functional inputs is presented in [157]. Recently, the universal approximation capability of SVMs has been established, too [85, 179] (the first paper is included in the collection of papers in this volume). However, achieving a small margin is thereby not possible for a large number of concepts [15] such that the capacity of the architecture cannot be bounded. Apart from the in principle guarantee that neural networks can approximate any reasonable function and apart from concrete bounds for finite training sets, approximation rates are of particular interest. They characterize the quality of the approximation which can be achieved if a function (possibly with additional constraints) is approximated by n neurons. One of the first results in this direction can be found in [7, 106] where convergence of order 1=n is established for functions with limited norm which is measured incorporating the activation function. Generalizations thereof can be found in [119, 123], deriving e.g. dimension independent geometric convergence of neural networks for specific function classes. Starting from these results, further mathematical questions concerning the capacity of possibly restricted architectures can be investigated, such as the capacity of FNNs with restricted weights [49], the uniqueness of parameters [51], or the design of alternative transfer functions for FNNs respectively kernels for SVMs [50, 104]. Usually, network training aims at finding weights such that the given data set is mapped correctly to the desired output values. FNNs are often trained by gradient descent methods, for which convergence can be established [118]. The convergence of Hebbian learning for the simple perceptron has been proved in [156], for example. A (though highly noncontinuous) cost function for LVQ has been proposed and a modification towards a more appropriate cost function is presented in [162]. The paper [92] (this paper is contained in the following collection) proposes a further modification of LVQ which also includes an automatically adaptive diagonal metric and which is explicitely derived as stochastic gradient descent of an appropriate and well-behaved cost function. This notion allows to easily incorporate additional features such as neighborhood cooperation of the neurons to account for highly multimodal cost functions as demonstrated in [90, 203]. However, optimization algorithms might still get stuck in local optima of the error function in these scenarios. Hence the question of the in principle complexity of neural network training arises. This loading problem is a prime example for a process where mathematics tries to get nearer and nearer to scenarios as they occur in practical problems, without actually achieving this goal up to now. It follows directly e.g. from [135] that training a fixed FNN with the Heaviside activation function can be done in polynomial time. However, most existing algorithms do not take the specific architecture into account. Hence the loading problem should be considered in a more general setting taking arbitrary architectures as input. Starting with the work [22, 108], it is known that general neural network training is NP-hard. More precisely, the paper of Blum and Rivest [22] states the fact that the loading problem is NP-hard for multilayer architectures with the Heaviside activation function even if the number of hidden neurons is fixed to two and only the number of inputs is allowed to vary from one instance to the next one. Training

8 a single perceptron is, of course, polynomial as a specific instance of linear programming, although the (convergent) perceptron algorithm might take exponential time. However, achieving optimum solutions in the presence of errors is an NP-hard problem even for the simple perceptron as shown e.g. in [99]. People have argued that these situations are not realistic with respect to several aspects: usually, the sigmoidal function and not the perceptron function is considered. Moreover, good solutions instead of optimum ones would be sufficient. The number of hidden neurons in the networks is usually correlated to the number of available training samples. Hence a couple of results try to generalize the setting to larger architectures [75, 80, 147], sigmoidal networks [76, 80, 107, 168], or approximate settings [8, 39, 40] (the last paper is included in the following collection) establishing NP-hardness even for the seemingly simplest training problem within this line, the training of a single sigmoidal neuron [169]. As a consequence of this list of NP-results, researchers try to design or identify specific and possibly restricted learning scenarios where polynomial bounds on the training time can be guaranteed [30]. In addition, focusing on large margins might help for training FNNs [16]. Note that SVM training can be written as a quadratic optimization problem with very simple constraints, such that, unlike FNNs, SVM training is polynomial. Nevertheless, the original algorithm needs access to all training pattern. Hence alternative training methods for large scale problems are investigated in the literature even for SVM such as efficient online training methods or decomposition schemes [61, 65, 127]. Furthermore, some NP-hardness results have also been identified for prototype based classifiers and LVQ-networks [23, 138]. It should be mentioned that error minimization often already takes the generalization ability of the final network into account: Incorporation of statistical interpretation and training via Bayesian inference, for example, gives (approximately) optimum values [170]. Regularization terms may be added to the error function such that robust solutions are found [19, 204]. The SVM explicitely solves a regularization task, formulating the correctness of training data as a constraint, to name just a few examples. However, after obtaining a small training error, we are now interested in the generalization ability of the trained networks to new examples, together with mathematical guarantees for this property. The question occurs of how such guarantees can be stated in mathematical terms. There exist several different formalisms within this area. Methods of statistical physics, for example, allow to compute learning curves of simple iterative training rules which quantize the average learning effect after presenting a number of examples [140, 171]. Another very popular formalism for neural networks training is offered by the notion of PAC (probably approximately correct) learnability as introduced in [189] and the mathematical counterpart of uniform convergence of empirical risks as introduced in [190]. PAC learnability states the fact that at least one learning algorithm (e.g. error minimization) can be found such that the probability of poor outputs of the learning algorithm, i.e. networks which do not generalize to unseen examples, approaches zero if enough training data are available. Uniform convergence of the empirical error of a function towards the real error on all possible inputs guarantees that all training algorithms which yield a small training error are PAC. In [190] uniform convergence for a given function class is connected to the capacity of the considered function class and concrete bounds on the training error are derived. The capacity of the function class is thereby measured in terms of the so-called VC-dimension. Remarkably, a function class is PAC learnable if and only if the capacity in terms of the VC dimension is finite. Starting from these results, the in principle generalization ability of network training can be established estimating the VC dimension of neural architectures. Different directions of research continue these results: The VC dimension of various neural architectures has been estimated, in some cases based on advanced mathematical methods [165, 166, 173]. Thereby, the VC dimension or variations thereof might also depend on the given data, which is well-known for the SVM and which has recently been pointed out for LVQ classifiers [37]. Note that the VC dimension quantizes the capacity of the respective architecture, hence it can be used to investigate the approximation capability of networks, too [82]. Generalizations of the original setting to general outputs and various loss functions have been derived [1, 193]. Moreover, the bounds obtained via this general setting are tight in the limit, but they do not yet lead to useful bounds for realistic scenarios and training sets. Hence refinements and possibilities to take specific knowledge of the concrete setting or training algorithm into account as well

Mathematical aspects

9

as alternative statistical estimations are a topic of ongoing research [2, 9, 95, 213].

3.2 Recurrent models Depending on the respective task, the overall dynamic behavior of an RNN is of major importance. Many applications require that the network converges in some sense: if used as associative memory, it should converge to a stable state; for robotics, convergence to a possibly cyclic attractor might be appropriate; for control tasks, stability of the RNN should be guaranteed. Hence stability and convergence constitute one major issue investigated for different kinds of RNNs. Using Lyapunov functions, global convergence to a stable state can be established for the classical Hopfield network which restricts weights to a symmetric weight matrix. The stability of more general networks as well as convergence rates are investigated e.g. in [32, 33, 146, 210]. Conditions on the weight matrix for local stability of RNNs via linear matrix inequalities have been established in [126, 176, 177, 183, 185]. Note that training approaches can use these stability conditions to design networks with appropriate dynamic behavior [142, 178, 209]. The capacity of RNNs can be investigated with respect to different aspects. RNNs might be used to approximate a continuous or measurable function on time series based on given training data. With respect to this aspect, RNNs are universal approximators on a finite time horizon in the discrete as well as the continuous scenario [62, 163]. Moreover, their approximation capability as operators is investigated e.g. in [5]. Upper bounds on the number of neurons which are sufficient to interpolate a given finite training set can be established [80]. Related questions such as uniqueness of weights constitute an interesting topic of research [111]. Also for recursive networks and more general networks for structured objects, the approximation capability has been investigated [77, 79, 82, 83, 84] (the third paper is included in the following collection). Thereby, the capability of encoding and decoding can be uniformly investigated and various positive and negative results can be derived which indicate that certain structure processing networks such as recursive networks might perform better in practical applications compared to others. Turning from a limited time horizon to the long-term behavior, one can on the one hand relate RNNs to classical symbolic mechanisms like Turing machines [110] or even more powerful non-uniform Boolean circuits [167]. A direct proof of the super-Turing universality of sigmoidal recurrent networks is given in [79], for example. In restricted scenarios, a direct relation to finite automata has been established in [29, 139]. Furthermore, counting mechanisms can be detected, as demonstrated in [153], which indicate that RNNs are particularly appropriate for processing natural language which involves embedded structures. For limited weights, definite memory machines result [91] (this paper is included in the following collection). On the other hand, when turning to the long-term behavior, one can investigate the rich capacity of RNNs as dynamic systems: they are capable of producing stable, periodic, and chaotic behavior as demonstrated e.g. in [188, 205]. Input dependent bifurcation curves as well as the dynamics of larger network structures have been studied experimentally e.g. in the articles [42, 93, 101]. If RNNs are used as associative memories, the notion of capacity refers to the number of patterns which can be stored in an appropriate RNN as stable states. This number, of course, depends on the characteristics of the pattern. Sparse or nearly orthogonal pattern can usually more easily be stored. In addition, this depends on the respective RNN model which is considered [28]. Interestingly, the notion of Lyapunov functions for specific network architectures such as Hopfield-type networks allows to inject optimization problems into RNNs. One classical example for this procedure is the TSP: if the weights of a Hopfield network are chosen appropriately, the global energy minima of the Hopfield network correspond to solutions of the TSP. Various different optimization problems can be tackled in this way: Hopfield type networks for the TSP [38, 186], RNNs for invariant recognition of geometrical objects [125], or for graph coloring [48] have been proposed. Also learning with RNNs can be investigated with respect to different aspects according to the respective overall learning task. If used for function approximation, RNN training is in principle identical to FNN learning: the empirical error is minimized on a given training set e.g. with a gradient descent method. Training fixed RNNs with the Heaviside activation function is polynomial just as for FNNs

10 [75]. Moreover, the explicit integration of structure and a corresponding recurrent or recursive dynamics allows to identify nontrivial situations where training can be done efficiently [60]. However, all NPhardness results for FNN training transfer directly to RNNs, such that difficulties can be expected in some situations. Further numerical problems might occur in the context of recurrent computations. It has been proved in [17] that gradient methods do not seem appropriate if long term dependencies are to be learned. It is impossible to latch information over a long time period. Hence the question of efficient training algorithms for RNNs is still an open topic of research. Training algorithms proposed for RNNs are much more widespread than training algorithms for FNNs [145]. They are often specifically designed for a given scenario such as local recurrent networks [27], situations with explicit exponential decrease of gradient information [4], or networks with linear recurrent neurons [98]. However, restrictions such as local recurrence might severely limit the applicability of these networks [57]. Alternative optimization tools use e.g. Kalman filters or other ideas from control theory, local learning schemes, transformation of the primal-dual mechanism of SVMs, or even direct modifications of specific (possibly chaotic) dynamic behavior [41, 64, 141, 143, 184]. A unification of various algorithms as well as some new ideas is offered e.g. in [3, 164]. If RNNs are used as associative memory, training algorithms can be based on stability constraints. The contribution [206] derives a new algorithm for stochastic Hopfield networks. Extensions to learn more complex pattern such as sequences, as well as extensions to more complex network structures are currently investigated in the literature [124, 207]. The generalization capability of RNNs used for sequence processing constitutes another not yet satisfactorily solved research problem. Since the VC dimension (and generalizations thereof) of recurrent and recursive networks depends on the given inputs [74, 82, 84, 114], distribution independent PAC learnability cannot be guaranteed in principle. However, the articles [78, 80, 81, 82, 89] provide an alternative way to derive distribution dependent bounds or to derive posterior generalization bounds, respectively (the last two papers are included in the following list). These results justify the in principle learnability of recurrent and recursive networks in an appropriate sense. Moreover, they offer a general procedure to deal with these type learning settings from the point of view of statistical learning theory. Unfortunately, the derived bounds are away from being useful in practical scenarios. Additional constraints on the networks might help in this setting [91] (this paper is included in the following list). Another notion of learnability which is particularly suited in the context of long-term prediction is offered by the term ‘identification in the limit’ [137]. Identification in the limit ensures that the underlying regularity can be precisely identified for all possible inputs after presentation of enough training examples. This characterization is suited e.g. for grammar inference, where the RNN should adequately model the behavior for inputs of arbitrary length. It can be shown that in the Chomsky hierarchy only regular and context free languages can be identified in the limit from positive and negative training examples regardless of the used training algorithm. If only positive examples are available, as is often the case in language processing, the class of learnable languages reduces to finite languages. Hence language processing is an inherently difficult task for all learning mechanisms including neural networks [116]. However, natural languages as well as recurrent architectures stratified e.g. according to the number and size of weights are orthogonal to the Chomsky hierarchy such that investigating the learnability of important settings remains a challenging task. Since training RNNs and generalizations thereof is faced to severe numerical problems and problems concerning the generalization ability, many more regularization schemes apart from the above mentioned ones are often incorporated into training as pointed out in [89]. Stability constraints are often integrated as additional regularization term [105] or, as already mentioned, explicit weight restriction. Architectural constraints e.g. to local recurrence have shown good performance [98, 117]. Integration of prior knowledge and posterior regularization in form of the extraction of automata rules constitutes another efficient form of regularization [29, 58, 66, 139, 181]. Note that these regularization methods require various types of mathematical investigation in order to motivate and formulate the respective approach.

Mathematical aspects

11

3.3 Unsupervised networks The convergence and overall dynamics of self-organizing learning has been investigated along several lines of research: simple vector quantization schemes possess a potential function such that a mathematical objective of these methods can be stated explicitely. Furthermore, the convergence of online or batch training variants can be based on mathematical results from the context of expectation maximization algorithms or alternative online optimization schemes [120]. Hence in these cases mathematical treatment of the algorithms is comparably simple, since an explicit function is available which is minimized during learning with some standard either online or batch minimization algorithm. Analogous arguments show the convergence of various mostly stochastic modifications of the basic algorithm [25]. Also for neural gas, a potential dynamics can be identified, as shown in [133]. However, SOM itself does not obey a gradient dynamics, as demonstrated e.g. in [52, 53]. It is possible to assign a potential function to a slightly modified version of SOM using a different notion of the winning neuron, which, in practice, will usually almost coincide with the standard SOM learning rule [96]. The standard SOM has to be examined with different (and often subtle) techniques. Convergence has only been proved for limited settings for the original SOM-update so far: for discrete distributions, one or two-dimensional lattices, continuous approximations, or several other limited situations [36, 54, 55, 128, 152, 160, 161] In addition, alternative mathematical objectives of the SOM-algorithm are to be identified due to a lack of a general potential function. One issue which has in particular been formulated and studied in the one-dimensional case refers to the ordering of the neurons according to given training data in a real-vector space. This can easily be formalized for one-dimensional input space and lattice structure, since here the natural ordering on the real numbers can be considered. For higher dimensions, a possible formalization of the notion of ordering is less obvious though possible in limited settings, e.g. factorizing probability distributions might be considered. Ordering conditions are investigated in [24]. For limited settings such as one-dimensional inputs, the fact that ordering takes place has been proved in [52, 152, 160, 191]. For more complex scenarios, meta-stable states and reordering of the neurons might arise, as demonstrated e.g. in [43, 45, 53, 56, 194]. Another objective of SOM related to the notion of ordered states which should be achieved by the learning algorithm and the precise definition of which is not straightforward, consists in the notion of topology preservation. One formalization is offered in the article [197]: The two mappings, which map data points to the prototype indices and, conversely, prototype indices to the weights of the respective prototype, should be continuous with respect to appropriately defined topologies. These topologies are on the one side given by the topology of the data manifold, and on the other side by the network topology. In [195, 197] also an exact measurement which allows to quantify the degree of topology preservation is introduced which takes the underlying data manifold and the respective metric into account. Approximations thereof, which do not in all cases yield a correct result but which can be computed more efficiently, are presented in [11, 12, 152]. Note that the lattice structure of NG is determined posteriorly such that it always gives an optimum topological matching [133, 132]. For SOM, on the one hand local optima due to a wrong choice of learning parameters can arise, on the other hand topological mismatches might be caused by a wrong choice of lattice topology compared to the given training data. The resulting stable states can be studied using methods of statistical physics [13, 44, 152]. In order to avoid topological mismatches as far as possible, alternative, possibly adaptive network topologies can be used [14, 151]. Another objective of self-organizing maps is the representation of data statistics. The question can be asked in how far important statistical properties of data are preserved. As pointed out in [152], SOM can be seen as approximating principal curves, i.e. nonlinear extensions of important data directions given by independent or principal components. However, due to the incorporation of neighborhood, neither SOM nor NG yield an optimum information transfer in the sense of the quantization error. The magnification factor of the resulting data representation compared to the original data distribution can be derived (for restricted scenarios) for both, SOM and NG and variants thereof [21, 34, 46, 150, 191, 212]. These explicit computations can be used to alter the basic learning scheme such that the magnification factor is controlled, as proposed e.g. in [10, 35, 47, 196]. Thereby, the learning rate or the winner update are changed according to the specific setting, for example, such that a predefined magnification (for

12 optimum information transfer magnification 1) is approximately achieved. As an alternative, explicit entropy optimization schemes have been introduced such as proposed in [130, 192]. Objectives of this form allow a natural extension of learning to incorporate additional objectives, as an example they allow to incorporate prior auxiliary information into the metric used for learning [109]. Note that these mathematical aspects are to be investigated for recurrent and recursive self-organizing schemes as well. However, the proofs from the standard case do often not transfer to the recursive scenario. For example, training data does no longer fulfill the Markov property and hence the whole training scenario is altered. In addition, further aspects are to be solved within this lin. A general formulation of how recurrent self-organizing systems should be defined is of major importance such that uniform investigation of the models becomes possible. A proposal which includes several models from the literature and also includes the dynamics of RNNs has been proposed in [86, 87, 88] (the last paper is included in the following collection). Thereby, given structured data is processed recursively. The way in which context is stored constitutes the key point which distinguishes the several concrete implementations of this framework. Since data are not stored one-to-one but in a reduced way, it is not immediately clear whether an arbitrary number of data points can be memorized given enough neurons. As pointed out in [88] this capability relies on certain properties of the representation of context: the context has to be global in an appropriate sense. Similarly, the question whether cost functions can be transferred to the recursive scenario faces several difficulties: unlike for standard SOM, recursive processing incorporates the contribution of substructures which is usually dropped in Hebbian learning. Hence unlike standard SOM, the usual learning schemes for recursive SOMs can be interpreted only as approximate gradient dynamics as pointed out in [86, 87]. In addition, the notion of topology preservation is not clear because a metric on the input space is here not fixed. A metric on the single labels of the input space is offered by the Euclidian metric. There exist several possibilities to transfer this metric to structures. In [88], it is argued that standard SOM-training proposes several induced metrics for the input space of structures and an explicit characterization of one of these metrics is given in a specific case. Note that [86, 87, 88] do thereby not restrict to the standard SOM, but contain general arguments also valid for NG, VQ, or alternative lattice structures.

Papers

13

4 Papers The following collection of papers contains six articles which appeared in international scientific journals or are accepted for journal publication, and one technical report the first part of which has also been accepted for journal publication. These contributions cover a variety of different mathematical aspects of various neural architectures as introduced beforehand. The articles constitute a representative overview of my work and are accompanied by a number of contributions to international conferences, book chapters, additional journal papers, the work which is covered in my PhD thesis, and a monograph which contains extended work based on my thesis. The spectrum of mathematical questions ranges from the investigation of the training dynamics and appropriate cost functions minimized during training, the complexity of training, the universal approximation capability or precise mathematical characterization of the representation ability of the models, respectively, statistical learning theory and generalization ability, uniform formulations and unification of existing approaches, up to the investigation of metrics and topology preservation. The following seven contributions have been selected because a) each article constitutes a major contribution to a specific area within the topic of mathematical aspects of neural networks; b) the articles are nonredundant and cover different mathematical aspects or models; they contain work which is mostly not contained in my PhD thesis; c) at least 80% of the text of the following articles where I am a coauthor and which are based on joint work and discussions with colleagues is written by myself. (As required in the ‘Habilitationsordnung’, the contribution of each author can be identified, and it will be specified before the respective paper.) Since the final layout of the articles is partially not yet available or not available in electronic form, I compiled the following pages directly from the final versions accepted for publication, thereby substituting the layout of the journal by a uniform style, such that the following articles differ with respect to the layout from the published version or version to be published, respectively. No scientific content, formulation, or figure has been altered compared to the accepted versions.

14

Papers

15

4.1 Generalized relevance learning vector quantization The article ‘Generalized relevance learning vector quantization’ by B. Hammer and T. Villmann was submitted to Neural Networks in July, 2001, and it appeared in Neural Networks 15:1059–1068, 2002. In the article, the algorithm GRLVQ is proposed and formally derived as gradient descent of an appropriate cost function. This ensures convergence under appropriate conditions and, in addition, opens the field to further canonic generalizations e.g. to different cost functions or metrics. In this article, the development of the algorithm (GRLVQ), the mathematical investigation, and the experiments for the algorithm have been done by myself. The algorithm has been implemented by myself based on a program which has been written by T. Bojer, D. Schunk, and K. Tluk von Toschanowitz for an earlier, heuristic extension of LVQ to incorporate relevance factors. My coauthor, T. Villmann, contributed to this article parts of the experiment entitled ‘Satellite data’ (in particular comparison to other methods with respect to dimensionality estimation of the data) and parts of the conclusions. In addition, we had general discussions on the article. Additional publications in international conferences or journals which are connected to the content of this article and where I am a coauthor include:

        

T. Villmann, E. Mer´enyi, B. Hammer, Neural maps in remote sensing image analysis, to appear in: Neural Networks. (Content: an application of GRLVQ and other neural techniques to satellite images.) T. Bojer, B. Hammer, C. Koers, Monitoring technical systems with prototype based clustering, to appear in: M.Verleysen (ed.), ESANN’2003, D-side publications. (Content: an application of GRLVQ to fault detection for piston engines, cooperation with an industrial partner.) B. Hammer, M. Strickert, T. Villmann, Learning vector quantization for multimodal data, in: J. R. Dorronsoro (ed.), Artificial Neural Networks – ICANN 2002, Springer, 370-375, 2002. (Content: combination of GRLVQ with neighborhood cooperation.) B. Hammer, A. Rechtien, M. Strickert, T. Villmann, Rule extraction from self-organizing networks , in: J. R. Dorronsoro (ed.), Artificial Neural Networks – ICANN 2002, Springer, 877-882, 2002. (Content: approximate rule extraction based on GRLVQ.) B. Hammer, T. Villmann, Batch-RLVQ in: M.Verleysen (ed.), ESANN’02, D-side publications, 295-300, 2002. (Content: an alternative, mathematically well-founded optimization scheme for a previous heuristically motivated version.) T. Villmann, B. Hammer, M. Strickert, Supervised neural gas for learning vector quantization , in: D. Polani, T. Martinetz (eds.), Proceedings of GWAL-5, IOS Press, 9-18, 2002. (Content: combination with neighborhood cooperation to avoid local optima of the optimization.) M. Strickert, T. Bojer, B. Hammer, Generalized relevance LVQ for time series , in: G. Dorffner, H. Bischof, K. Hornik (eds.), Artificial Neural Networks - ICANN’2001, Springer, 677-683, 2001. (Content: application of GRLVQ to time-series data.) B. Hammer, T. Villmann, Estimating relevant input dimensions for self-organizing algorithms, in: N. Allison, H. Yin, L. Allinson, J. Slack (eds.), Advances in Self-Organizing Maps, Springer, 173-180, 2001. (Content: previous short conference version proposing the algorithm GRLVQ.) T. Bojer, B. Hammer, D. Schunk, K. Tluk von Toschanowitz, Relevance determination in learning vector quantization, in: M.Verleysen (ed.), European Symposium on Artificial Neural Networks’2001, D-facto publications, 271-276, 2001. (Content: a previous heuristically motivated version of the algorithm.)

16

Papers

31

4.2 On approximate learning by multi-layered feedforward circuits The article ‘On approximate learning by multi-layered feedforward circuits’ by B. DasGupta and B. Hammer was submitted to Theoretical Computer Science in April, 2001, and it has been accepted for publication in Theoretical Computer Science in February, 2003. The article investigates the complexity of training feedforward networks. Thereby, several more realistic scenarios are tackled such as approximate loading with two different objectives, various activation functions, more than one hidden layer, and correlated training set size and architecture. My coauthor, B. DasGupta, thereby contributed parts to the description of the general problem and setting, i.e. parts of the introduction, description of the basic model, the formulation of the general theorem (3.1), the formulation of -approximate functions (Def. 10), and parts of the beginning of 4.1. The more technical parts and proofs as well as large parts of the remaining sections and the discussion have been written by myself. I had general discussions on the topic with my coauthor. In addition, my coauthor pointed out the papers [2] by Arora et al. and [4] by Bartlett and Ben-David and he proposed to investigate similar problems in a more general setting. Additional publications in international conferences where I am a coauthor and which cover a similar or related topic include:

  

B. DasGupta, B. Hammer, On approximate learning by multi-layered feedforward circuits, in: H. Arimura, S. Jain, A. Sharma (eds.), Algorithmic Learning Theory’2000, Springer, 264-278, 2000. (Content: includes a previous and considerably shorter version of the paper.) B. Hammer, Some complexity results for perceptron networks , in: L. Niklasson, M. Bod´en und T. Ziemke (eds.), International Conference on Artificial Neural Networks’98, Springer, 639-644, 1998. (Content: contains some NP-hardness results for the loading problem of perceptron networks in non-approximative scenarios.) B. Hammer, Training a sigmoidal network is difficult, in: M. Verleysen (ed.), European Symposium on Artificial Neural Networks’98, D-facto publications, 255-260, 1998. (Content: contains a NP-hardness result for the loading problem for sigmoidal networks in a non-approximative setting.)

32

Papers

71

4.3 A note on the universal approximation capability of support vector machines The paper ‘A note on the universal approximation capability of support vector machines’ by B. Hammer and K. Gersmann has been submitted to Neural Processing Letters in September, 2001, and it has been accepted for publication in Neural Processing Letters in July, 2002. The paper investigates the approximation capability as well as explicit bounds on the degree of the kernel for the SVM and various kernels. It has been submitted independently and at the same time as the paper by I. Steinwart, ‘On the influence of the kernel on the consistency of support vector machines’, which also tackles the universal approximation capability of SVMs (and, furthermore, the consistency). The paper by Steinwart has been submitted in August, 2001 and it appeared in December 2001 in the Journal of Machine Learning Research 2 (2001) 67-93. My coauthor, K. Gersmann, contributed parts of the proof of Theorem 6. The remaining paper has been written by myself.

72

Papers

83

4.4 Recurrent neural networks with small weights implement definite memory machines The article ‘Recurrent neural networks with small weights implement definite memory machines’ by B. Hammer and P. Tiˇno was submitted to Neural Computation in March, 2002 and accepted for publication in Neural Computation in January, 2003. The article investigates on the one hand the question of the capability of specific (small weights) recurrent networks and proves an equivalence to a classical mechanism, definite memory machines. On the other hand, consequences from the point of view of learning theory are established. My coauthor, P. Tiˇno, contributed to this article large parts of the introduction, parts of the general definition (section 2), and parts of the discussion (section 6). In addition, we had general discussions on the paper. The technical parts and proofs, the programming, and experiments in the paper have been done by myself. Publications in international journals or conferences where I am a coauthor and which cover related topics (i.e. the same model but different mathematical questions) include

  

P. Tiˇno, B. Hammer, Architectural bias in recurrent neural networks – fractal analysis, to appear in: Neural Computation. (Content: investigation of fractal dimensions of the attractor of hidden neurons of recurrent networks with small weights driven by graphical models.) P. Tiˇno, B. Hammer, Architectural bias in recurrent neural networks – fractal analysis, in: J. R. Dorronsoro (ed.), Artificial Neural Networks – ICANN 2002, Springer, 1359-1364, 2002. (Content: a considerably reduced version of the previous paper where the driving source is given only by the uniform distribution.) B. Hammer, J.J. Steil, Tutorial: Perspectives on learning with recurrent networks, in: M.Verleysen (ed.), ESANN’2002, D-side publications, 357-368, 2002. (Content: overview about various mathematical aspects of recurrent networks.)

84

Papers

109

4.5 Generalization ability of folding networks The paper ‘Generalization ability of folding networks’ was submitted to IEEE Transaction on Knowledge and Data Engineering in July, 1999, and it appeared in IEEE Transaction on Knowledge and Data Engineering 13(2):196-206, 2001. The paper tackles the generalization ability of recursive networks, thereby establishing upper and lower bounds on the VC-dimension, distribution-dependent generalization bounds, and posterior bounds on the generalization ability. Additional publications of mine in international journals and conferences which cover a related topic include

 

B. Hammer, On the generalization ability of recurrent networks , in: G.Dorffner, H.Bischof, K.Hornik (eds.), Artificial Neural Networks - ICANN’2001, Springer, 731-736, 2001. (Content: posterior bounds also for more general loss functions and sequential outputs.) B. Hammer, On the learnability of recursive data, Mathematics of Control Signals and Systems 12, 62-79, 1999. (Content: some VC-dimension estimations and learning theory results for limited settings.)

110

Papers

127

4.6 Recurrent networks for structured data – a unifying approach and its properties The paper ‘Recurrent networks for structured data – a unifying approach and its properties’ has been submitted to Cognitive Systems Research in December 2000 and it appeared in Cognitive Systems Research 3(2): 145-165, 2002. The paper establishes a general formulation of various approaches for neural processing of tree-structured data via recursive computation. This allows a unified investigation of the approximation capability and the learnability of the approaches. New results in particular concerning the approximation capability are thereby also presented in the paper. Book chapters and contributions to international conferences which cover a related topic and are authored by myself include

   

B. Hammer, Perspectives on learning symbolic data with connectionistic systems, to appear in: R. Ku¨ hn, R. Menzel, W. Menzel, U. Ratsch, M.M. Richter, and I. O. Stamatescu (eds.), Perspectives on Adaptivity and Learning, Springer. (Content: overview of recurrent and recursive neural computation models with respect to their computation and generalization abilities.) B. Hammer, Compositionality in neural systems, in: M. Arbib (ed.), Handbook of Brain Theory and Neural Networks, 2nd edition, MIT Press, 244-248, 2003. (Content: overview of neural models for composite objects.) B. Hammer, Limitations of hybrid systems, in: M. Verleysen (ed.), European Symposium on Artificial Neural Networks’2000, D-facto publications, 213-218, 2000. (Content: precise lower bound on the complexity of recursive decoding with neural systems.) B. Hammer, Approximation capabilities of folding networks, in: M. Verleysen (ed.), European Symposium on Artificial Neural Networks’99, D-facto publications, 33-38, 1999. (Content: overview and some new results of approximation results for recursive networks.)

128

Papers

151

4.7 A general framework for self-organizing structure processing neural networks The paper ‘A general framework for self-organizing structure processing neural networks’ by B. Hammer, A. Micheli, and A. Sperduti has been published as technical report TR-03-04 of the Universit`a di Pisa, 2003. An article which contains mostly the first part of this technical report (see below) has been accepted for publication in Neurocomputing. The paper is mainly the result of a research visit of myself at the University of Pisa in November 2002. The technical report has been written by myself except for minor formulations or corrections. The content and mathematical questions are thereby the result of discussions with my coauthors A. Micheli and A. Sperduti. The paper provides a general formulation of various recent self-organizing models for sequences and tree-structured data which also includes the dynamic of supervised recurrent and recursive networks. In the report, Hebbian learning is related to precise gradient learning for various cost functions (SOM, NG, and VQ). In addition, first steps to precisely investigate the notion of topology preservation, the representation ability for different interior representations, and the induced metrics are presented. Contributions to international journals and conferences which cover a similar topic where I am a coauthor include



 

B. Hammer, A. Micheli, M. Strickert, A. Sperduti, A general framework for unsupervised processing of structured data, to appear in: Neurocomputing. (Content: contains a part of the technical report including the general framework and a discussion in which sense training can be interpreted as approximate gradient descent, in addition a formal proof that Hebbian training is in general no exact gradient descent.) M. Strickert, B. Hammer, Unsupervised recursive sequence processing , to appear in: M.Verleysen (ed.), ESANN’2003, D-side publications. (Content: a version of SOMSD with hyperbolic or general triangular neighborhood, some experiments on the capacity of the model.) B. Hammer, A. Micheli, A. Sperduti, A general framework for unsupervised processing of structured data, in: M.Verleysen (ed.), ESANN’02, D-side publications, 389-394, 2002. (Content: short conference version presenting the general framework.)

152

Acknowledgment

193

5 Acknowledgment I did this research while leading the ‘Forschernachwuchsgruppe Lernen mit Neuronalen Methoden auf Strukturierten Daten’ funded by the ‘Ministerium f¨ur Wissenschaft und Kultur, Niedersachsen’. Doing research, I had the opportunity to meet wonderful colleagues from all over the world. Special thanks go to my external coauthors (in reverse alphabetical order) Thomas Villmann, Mathukumalli Vidyasagar, Peter Tiˇno, Jochen Steil, Alessandro Sperduti, Alessio Micheli, Erz´ebeth Mer´enyi, and Bhaskar DasGupta, my two extraordinary PhD students Marc Strickert and Kai Gersmann, to my (former) students Katharina Tluk von Toschanowitz, Daniel Schunk, Andreas Rechtien, and Thorsten Bojer, and to Christian Koers and all other people from Prognost. I would like to thank my colleagues from the University of Osnabr¨uck, in particular the system administrators and the persons in the secretariats. Of course, my warmest thanks go to my parents and to my husband Manfred. Finally, I would like to express my honest thanks to the kind persons who will review this Habilitation and the person in charge for the final report about my work.

194

References

195

References [1] M. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [2] A. Antos, B. Kegl, T. Linder, and G. Lugosi. Data-dependent margin-based generalization bounds for classification. Journal of Machine Learning Research, 3:73–98, 2002. [3] A. B. Atiya and A. G. Parlos. New results on recurrent network training: Unifying the algorithms and accelerating convergence. IEEE Trans. Neural Networks, 11(9):697–709, 2000. [4] A. Aussem. Sufficient conditions for error backflow convergence in dynamical recurrent neural networks. Neural Computation, 14:1907–1927, 2002. [5] A. Back and T. Chen. Universal approximation of multiple nonlinear operators by neural networks. Neural Computation, 14:2561–2566, 2002. [6] G. Barreto and A. Ara´ujo. Time in self-organizing maps: An overview of models. Int. Journ. of Computer Research, 10(2):139–179, 2001. [7] A. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39:930–945, 1993. [8] P. Bartlett and S. Ben-David. Hardness results for neural network approximation problems. Theoretical Computer Science, 284:53–66, 2002. [9] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002. [10] H.-U. Bauer, R. Der, and M. Herrmann. Controlling the magnification factor of self-organizing feature maps. Neural Computation, 8(4):757–771, 1996. [11] H.-U. Bauer, M. Herrmann, and T. Villmann. Neural maps and topographic vector quantization. Neural Networks, 12(4-5):659–676, 1999. [12] H.-U. Bauer and K.R. Pawelzik. Quantifying the neighbourhood preservation of self-organizing feature maps. IEEE Transactions on Neural Networks, 3(4):570–579, 1992. [13] H.-U. Bauer, M. Riesenhuber, and T. Geisel. Phase diagrams of self-organizing maps. Physical Review E, 54(3):2807–2810, 1996. [14] H.-U. Bauer and T. Villmann. Growing a hypercubical output space in a self-organizing feature map. IEEE Transactions on Neural Networks, 8(2):218–226, 1997. [15] S. Ben-David, N. Eiron, and H. Simon. Limitations of learning via embeddings in Euclidian half-spaces. Journal of Machine Learning Research, 3:441–461, 2002. [16] S. Ben-David and H.-U. Simon. Efficient learning of linear perceptrons. In NIPS’2000, pages 189–195. 2000. [17] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE TNN, 5(2):157–166, 1994. [18] A.M.Bianucci, A.Micheli, A.Sperduti, and A.Starita. Application of cascade correlation networks for structures to chemistry. Journal of Applied Intelligence, 12:117-146, 2000. [19] C. M. Bishop. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7:108– 116, 1995. [20] C.M. Bishop, M. Svensen, and C.K. Williams. Developments of the generative topographic mapping. Neurocomputing Computation, 21(1):203–224, 1998. [21] C.M. Bishop, M. Svensen, and C.K. Williams. Magnification factors for the SOM and GTM algorithms. In Proceedings WSOM’97, pages 333–338, 1997. [22] A. Blum and R. Rivest. Training a 3-node neural network is NP-complete. Neural Networks, 9:1017–1023, 1988.

196 [23] T. Bojer, B. Hammer, D. Schunk, K. Tluk von Toschanowitz. Relevance determination in learning vector quantization. In: M.Verleysen, editor, European Symposium on Artificial Neural Networks’2001, D-facto publications, 271–276, 2001. [24] M. Budinich and J.G. Taylor. On the ordering conditions for self-organizing maps. In: M. Marinaro and P.G. Morasso, editors, Proceedings ICANN’94, pages 347–349, Springer, 1994. [25] J. Buhmann and H. Ku¨ hnel. Vector quantization with complexity costs. IEEE Transactions on Information Theory, 39:1133–1145, 1993. [26] C. Burges. A tutorial on support vector machines for pattern recognition. Knowledge Discovery and Data Mining, 2(2), 1998. [27] P. Campolucci, A. Uncini, F. Piazza, and B. D. Rao. On-line learning algorithms for locally RNNs. IEEE Trans. Neural Networks, 10(2):253–271, 1999. [28] B. Caputo and H. Niemann. Storage capacity of kernel associative memories. In J. Dorronsoro, editor, ICANN’02, pages 51–56. Springer, 2002. ´ ˜ [29] R. C. Carrasco, M. L. Forcada, M. Angeles Vald´es-Mu˜noz, and R. P. Neco. Stable encoding of finite-state machines in discrete-time recurrent neural nets with sigmoid units. Neural Computation, 12:2129–2174, 2000. [30] E. Castillo, O. Fontenla-Romero, B. Guijarro-Berdiˇn as, and A. Alo-Betanzos. A global optimum approach for one-layer neural networks. Neural Computation, 14:1429–1449, 2002. [31] G. Chappell and J. Taylor. The temporal Kohonen map. Neural Networks 6:441-445, 1993. [32] T. Chen, W. Lu, and S.-I. Amari. Global convergence rate of recurrently connected neural networks. Neural Computation, 14:2947–2957, 2002. [33] Y. Chen. Global stability of neural networks with distributed delays. Neural Networks, 15:867–871, 2002. [34] J. Claussen and H. Schuster. Asymptotic level densities of the elastic net self-organizing feature map. In. J. Dorronsoro, editor, Proceedings ICANN 2002, pages 939–944, Springer, 2002. [35] J. Claussen. Generalized winner relaxing Kohonen feature map, e-print cond-mat, 2002. online version: http://arXiv.org/cond-mat/0208414 [36] M. Cottrell, J.C. Fort, and G. Pages. Theoretical aspects of the SOM algorithm. Neurocomputing, 21(1):119– 138, 1998. [37] K. Crammer, R. Gilad-Bachrach, A. Navot, and A. Tishby. Margin analysis of the LVQ algorithm. In: NIPS 2002. [38] C. Dang and L. Xu. A Lagrange multiplier and Hopfield-type barrier function method for the traveling salesman problem. Neural Computation, 14:303–324, 2002. [39] B. DasGupta and B. Hammer. On approximate learning by multi-layered feedforward circuits. In H. Arimura, S., and A. Sharma, editors, Algorithmic Learning Theory’2000, pages 264–278. Springer, 2000. [40] B. DasGupta and B. Hammer. On approximate learning by multi-layered feedforward circuits. To appear in Theoretical Computer Science. [41] E. Dauc´e. Learning from chaos: A model of dynamical perception. In Proc. ICANN, Springer LNCS 2130, pages 1129–1134, Wien, Austria, 2001. [42] J. E. Dayhoff, P. J. Palmadesso, and F. Richards. RNNs: Design and Applications, chapter Oscillation Responses in a Chaotic Recurrent Network. CRC Press, 1999. [43] R. Der and M. Herrmann. Reordering transitions in self-organizing feature maps with short-range neighboorhood. In: M. Marinao and P.G. Morasso, editors, Proceedings ICANN’94, pages 322–325, Springer, 1994. [44] R. Der and M. Herrmann. Critical phenomena in self-organizing feature maps: Ginzberg-Landau approach. Physical Review, E, 49(6), 1994. [45] R. Der, M. Herrmann, and T. Villmann. Time behaviour of topological ordering in self-organized feature mapping. Biological Cybernetics, 77(6):419–427, 1997.

References

197

[46] D. Dersch and P. Tavan. Asymptotic level density in topological feature maps. IEEE Transactions on Neural Networks, 6(1):230–236, 1995. [47] D. DeSieno. Adding a conscience to competitive learning. In:Proceedings ICNN’88, pages 117–124, IEEE, 1988. [48] A. DiBlas, A. Jagota, and R. Hughuy. Energy function-based approaches to graph coloring. IEEE Transactions on Neural Networks, 13:81–91, 2002. [49] S. Draghici. On the capability of neural networks using limited precision weights. Neural Networks, 15:395–414, 2002. [50] W. Duch and N. Jankowski. Survey of neural transfer functions. Neural Computing Surveys, 2:163–212, 1999. [51] T. Elsken. Even on finite test sets smaller nets may perform better. Neural Networks, 10:369–385, 1997. [52] E. Erwin, K. Obermayer, and K. Schulten. Self-organizing maps: ordering, convergence properties, and energy functions. Biological Cybernetics, 67(1):47–55, 1992. [53] E. Erwin, K. Obermayer, and K. Schulten. Self-organizing maps: stationary states, metastability, and convergence rate. Biological Cybernetics, 67(1):35–45, 1992. [54] J.A. Flanagan. Sufficient conditions for self-organization in the one-dimensional SOM with a reduced width neighborhood. Neurocomputing, 21(1-3):51–60, 1998. [55] J.A. Flanagan. Self-organization in the SOM and Lebesque continuity of the input distribution. In: Proceedings IJCNN, 6, pages 26–31, IEEE, 2000. [56] J.A. Flanagan and M. Hasler. Self-organization, metastable states, and the ODE method in the Kohonen neural network. In: M. Verleysen, editor, Proceedings ESANN’95, pages 1–8, D facto conference services, 1995. [57] P. Frasconi and M. Gori. Computational capabilities of local-feedback recurrent networks acting as finitestate machines. IEEE TNN, pages 1521–1525, 1996. [58] P. Frasconi, M. Gori, and G. Soda. RNNs and prior knowledge for sequence processing: A constrained nondeterministic approach. Knowl.-Based Syst., 8(6):313–332, 1995. [59] P.Frasconi, M.Gori, and A.Sperduti. A general framework of adaptive processing of data structures. IEEE Transactions on Neural Networks, 9(5):768-786, 1998. [60] P. Frasconi, M. Gori, and A. Sperduti. Learning efficiently with neural networks: a theoretical comparison between structured and flat representations. In W. Horn, editor, ECAI 2000, pages 301–305. IOS Press, 2000. [61] S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243–264, 2001. [62] K. Funahashi and Y. Nakamura. Approximation of dynamical systems by continuous time recurrent neural networks. Neural Networks, 6(6):801–806, 1993. [63] C. Fyfe. A general exploratory projection pursuit network. Neural Processing Letters, 2(3), 17–19, 1995. [64] M. Galicki, L. Leistritz, and H. Witte. Learning continuous trajectories in RNNs with time-dependent weights. IEEE-NN, 38(10):714–755, 1999. [65] C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213–242, 2001. [66] C. L. Giles, S. Lawrence, and A. C. Tsoi. Noisy time series prediction using a RNN and grammatical inference. Machine Learning, 44(1/2):161–183, July/August 2001. [67] T. Graepel, M. Burger, and K. Obermayer. Phase transitions in stochastic self-organizing maps. Phys. Rev. E, 56(4):3876–3890, 1997. [68] T. Graepel and K. Obermayer. A Self-Organizing Map for Proximity Data. Neural Computation, 11:139– 155, 1999.

198 [69] M. Gibbs. Bayesian Gaussian Processes for Regression and Classification. PhD thesis, Cambridge University, 1997. [70] M. Girolami. The Topographic Organisation and Visualisation of Binary Data using Mutivariate-Bernoulli Latent Variable Models. IEEE Transactions on Neural Networks, 12(6):1367–1374, 2001. [71] M. Girolami. Self-Organising Neural Networks: Independent Component Analysis and Blind Signal Separation. In: J. Taylor, editor, Perspectives in Neural Computation, Springer, 1999. [72] M. Hagenbuchner, A.C. Tsoi, and A. Sperduti. A supervised self-organizing map for structured data. In N.Allison, H.Yin, L.Allinson, and J.Slack, editors, Advances in Self-Organizing Maps, pages 21-28, Springer, 2001. [73] M. Hagenbuchner, A. Sperduti, and A.C. Tsoi. A Self-Organizing Map for Adaptive Processing of Structured Data. to appear in IEEE Transactions on Neural Networks. [74] B. Hammer. On the generalization of Elman networks. In: W. Gerstner, A. Germond, M. Hasler und J.D. Nicaud, editors. International Conference on Artificial Neural Networks’97, Springer, 409–414, 1997. [75] B. Hammer. Some complexity results for perceptron networks. In: L. Niklasson, M. Bod´en und T. Ziemke, editors, International Conference on Artificial Neural Networks’98, Springer, 639–644, 1998. [76] B. Hammer. Training a sigmoidal network is difficult. In: M. Verleysen, editor, European Symposium on Artificial Neural Networks’98, D-facto publications, 255–260, 1998. [77] B. Hammer. Approximation capabilities of folding networks. In: M. Verleysen editor, European Symposium on Artificial Neural Networks’99, D-facto publications, 33–38, 1999. [78] B. Hammer. On the learnability of recursive data. Mathematics of Control Signals and Systems, 12:62–79, 1999. [79] B. Hammer. On the approximation capability of recurrent neural networks. Neurocomputing, 31(1-4), 107– 124, 2000. [80] B. Hammer. Learning with recurrent neural networks. Springer Lecture Notes in Control and Information Sciences. Springer, 2000. [81] B. Hammer. On the generalization ability of recurrent networks. In: G.Dorffner, H.Bischof, K.Hornik, editors, Artificial Neural Networks - ICANN’2001, Springer, 731–736, 2001. [82] B. Hammer. Recurrent networks for structured data - a unifying approach and its properties. Cognitive Systems Research, 3:145–165, 2002. [83] B. Hammer. Compositionality in neural systems. In: M.Arbib, editor, Handbook of Brain Theory and Neural Networks, 2nd edition, MIT Press, pages 244–248, 2003. [84] B. Hammer. Perspectives on learning symbolic data with connectionistic systems. To appear in: R. Ku¨ hn, R. Menzel, W. Menzel, U. Ratsch, M.M. Richter, and I. O. Stamatescu, editors, Adaptivity and Learning, Springer. [85] B. Hammer and K. Gersmann. A note on the universal approximation capability of support vector machines. Neural Processing Letters, to appear. [86] B. Hammer, A. Micheli, M. Strickert, A. Sperduti. A general framework for unsupervised processing of structured data. To appear in: Neurocomputing. [87] B. Hammer, A. Micheli, A. Sperduti. A general framework for unsupervised processing of structured data. in: M.Verleysen, editor, ESANN’02, D-side publications, 389–394, 2002. [88] B. Hammer, A. Micheli, A. Sperduti. A general framework for self-organizing structure processing neural networks. Technical report TR-03-04 of the Universit`a di Pisa, 2003. [89] B. Hammer and J. Steil. Perspectives on learning with recurrent neural networks. In M. Verleysen, editor, ESANN’02, pages 357–368. D-side publications, 2002. [90] B. Hammer, M. Strickert, and T. Villmann. Learning vector quantization for multimodal data. In J.R. Dorronsoro, editor, ICANN’02, pages 370-375. Springer, 2002. [91] B. Hammer and P. Tiˇno. Neural networks with small weights implement definite memory machines. Neural Computation, to appear.

References

199

[92] B. Hammer and T. Villmann. Generalized relevance learning vector quantization. Neural Networks, 15:1059–1068, 2002. [93] R. Haschke, J. J. Steil, and H. Ritter. Controlling oscillatory behavior of a two neuron recurrent neural network using inputs. In Proc. of the Int. Conf. on Artificial Neural Networks(ICANN), Springer LNCS 2130, pages 1109–1114, Wien, Austria, 2001. [94] R. Herbrich, T. Graepel, and C. Campbell. Bayes point machines. Journal of Machine Learning Research, 1:245–279, 2001. [95] R. Herbrich and R. Williamson. Algorithmic luckiness. Journal of Machine Learning Research, 3:175–212, 2002. [96] T. Heskes. Energy functions for self-organizing maps. In E. Oja and S. Kaski, editors, Kohonen Maps, pages 303–315. Springer, 1999. [97] T. Heskes. On self-organizing maps, vector quantization, and mixture modeling. IEEE Transactions on Neural Networks, 12(6):1299-1305, 2001. [98] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neur. Comp., 9(8):1735–1780, 1997. [99] K.-U. H¨offgen, H.-U. Simon, and K. VanHorn. Robust trainability of single neurons. Journal of Computer and System Sciences, 50:114–125, 1995. [100] J.J. Hopfield and D.W. Tank. ‘Neural’ computation of decisions in optimization problems. Biological Cybernetics, 52:141-152, 1985. [101] F. C. Hoppensteadt and E. M. Izhikevich. Weakly Connected Neural Networks. Springer, 1997. [102] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, pages 359–366, 1989. [103] A. Hyv¨arinen, J. Karhunen, E. Oja. Independent Component Analysis, John Wiley, 2001. [104] T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 7:95–114, 2000. [105] L. Jin and M. M. Gupta. Stable dynamic backpropagation learning in RNNs. IEEE Trans. Neural Networks, 10(6):1321–1333, 1999. [106] L. Jones. A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural networks. Annals of Statistics, 20:608–613, 1992. [107] L. Jones. The computational intractability of training sigmoidal neural networks. IEEE Transactions on Information Theory, 43:161–173, 1997. [108] J. Judd. Neural network design and the complexity of learning. MIT-Press, 1990. [109] S. Kaski and J. Sinkkonen. A topography-preserving latent variable model with learning metrics. In: N. Allinson, H. Yin, L. Allinson, and J. Slack, editors, Advances in Self-Organizing Maps, pages 224–229, Springer, 2001. [110] J. Kilian and H. T. Siegelmann. The dynamic universality of sigmoidal neural networks. Inf. and Comp., 128:48–56, 1996. [111] M. Kimura. On unique representations of certain dynamical systems produced by continuous-time recurrent neural networks. Neural Computation, 14:2981–2996, 2002. [112] M. Ko¨ ppen. On the training of Kolmogorov networks. In J. Dorronsoro, editor, ICANN’02, pages 474–479. Springer, 2002. [113] T. Kohonen. Self-Organizing Maps. Springer, 1995. [114] P. Koiran and E. D. Sontag. Vapnik-Chervonenkis dimension of recurrent neural networks. In Proc. of the 3rd Eur. Conf. on Comp. Learning Theory, pages 223–237, 1997. [115] T. Koskela, M. Varsta, J. Heikkonen, and K. Kaski. Recurrent SOM with local linear models in time series prediction. In M.Verleysen, editor, 6th European Symposium on Artificial Neural Networks,pages 167–172, De facto, 1998.

200 [116] S. C. Kremer. Lessons from language learning. In L. R. Medsker and L. C. Jain, editors, Recurrent Neural Networks, Design and Appl., pages 179–204. CRC Press, 2000. [117] S. C. Kremer. Spatio-temporal connectionist networks: A taxonomy and review. Neural Comp., 13(2): 249–306, 2001. [118] C.-M. Kuan and K. Hornik. Convergence of learning algorithms with constant learning rates. IEEE Transactions on Neural Networks, 2:484–489, 1991. [119] V. Kurkova, P. Savicky, and K. Hlavackova. Representations and rates of approximations of real-valued Boolean functions by neural networks. Neural Networks, 11:651–659, 1998. [120] H. Kushner and D. Clark. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer, 1978. [121] P.L. Lai and C. Fyfe. A Neural Implementation of Canonical Correlation Analysis. Neural Networks, 12(10):1391–1397, 1999. [122] D. LaLoudouanna and M. Tarare. Data set selection. Most original submission of NIPS’2002. [123] E. Lavretsky. On the geometric convergence of neural approximations. IEEE Transactions on Neural Networks, 13:274–282, 2002. [124] D.-L. Lee. Pattern sequence recognition using a time-varying Hopfield network. IEEE Transactions on Neural Networks, 13:330–342, 2002. [125] W.-J. Li and T. Lee. Hopfield neural networks for affine invariant matching. IEEE Transactions on Neural Networks, 12:1400–1410, 2001. [126] X. Liao, G. Chen, and E. Sanchez. Delay-dependent exponential stability analysis of neural networks: an LMI approach. Neural Networks, 15:855–866, 2002. [127] C.-J. Lin. On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks, 12:1288–1298, 2001. [128] S. Lin and J. Si. Weight-value convergence of the SOM algortihm for discrete input. Neural Computation, 10(4):807–814, 1998. [129] Y. Linde, A. Buzo, and R. Gray. An algorithm for vector quantization design. IEEE Transactions on Communication, 28:84–95, 1980. [130] R. Linsker. How to generate maps by maximizing the mutual information between input and output signals. Neural Computation, 1:402–411, 1989. [131] W. Maass and C. Bishop, editors. Pulsed Neural Networks. MIT-Press, 1998. [132] T. Martinetz and K. Schulten. Topology representing networks. Neural Networks, 7(3):507–522, 1994. [133] T. Martinetz, S.G. Berkovich, and K.J. Schulten. ‘Neural-gas’ networks for vector quantization and its application to time-series prediction. IEEE Transactions on Neural Networks, 4(4):558–569, 1993. [134] W.S. McCulloch and W. Pitts. A logical calculus of the ideas immenent in nervous activity. Bulletin of Mathematical Biophysics, 5: 115-133, 1943. [135] N. Megiddo. On the complexity of polyhedral separability. Discrete and Computational Geometry, 3:325– 337, 1988. [136] M. Minsky and S. Papert. Perceptrons. MIT-Press, 1969. [137] B. K. Natarajan. Machine learning: A theoretical approach. Morgan Kaufmann, 1991. [138] R. Nock and M. Sebban. Sharper bounds for the hardness of prototype and feature selection. In: H. Arimura, S. Jain, and A. Sharun, editors, Algorithmic Learning Theory, pages 224–237, Springer, 2000. [139] C. Omlin and C. Giles. Rule revision with RNN. IEEE TKDE, 8(1):183–188, 1996. [140] M. Opper. Statistical mechanics of generalization. In M. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 1087–1090. MIT-Press, 2nd edition, 2003. [141] R. C. O’Reilly. Biologically plausible error-driven learning using local activation differences: The generalized recirculation algorithm. Neural Computation, 8(5):895–938, 1996.

References

201

[142] J. Park, H. Cho, and D. Park. On the design of BSB neural associative memories using semidefinite programming. Neural Computation, 11:1985–1994, 1999. [143] A. G. Parlos, S. K. Menon, and A. F. Atiya. An algorithmic approach to adaptive state filtering using recurrent neural networks. IEEE Trans. Neural Networks, 12(6):1411–1432, 2001. [144] J. Park and I. Sandberg. Approximation and radial-basis-function networks. Neural Computation, 5:305– 316, 1993. [145] B. A. Pearlmutter. Gradient calculations for dynamic RNNs: A survey. IEEE Tansactions on Neural Networks, 6(5):1212–1228, 1995. [146] J. Peng, H. Qiao, and Z.-B. Xu. A new approach to stability of neural networks with time-varying delays. Neural Networks, 15:95–103, 2002. [147] C. Pinter. Complexity of network training for classes of neural networks. In K. P. Jantker, T. Shinohara, and T. Zeugmann, editors, ALT’95, pages 215–227. Springer, 1995. [148] G.Pollastri, P.Baldi, A.Vullo, and P.Frasconi. Prediction of protein topologies using GIOHMMs and GRNNs. In: Advances of Neural Information Processing Systems 2002. [149] B. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 1996. [150] H. Ritter. Asymptotic level density for a class of vector quantization processes. IEEE Transactions on Neural Networks, 2(1):173–175, 1991. [151] H. Ritter. Self-organizing maps on non-Euclidian spaces. In: E. Oja and S. Kaski, editors, Kohonen Maps, pages 97–110. Elsevier, 1999. [152] H. Ritter, T. Martinetz, and K. Schulten. Neural Computation and Self-Organizing Maps: An Introduction, Addison-Wesley, 1992. [153] P. Rodriguez, J. Wiles, and J. L. Elman. A RNN that learns to count. Connection Science, 11(1):4–40, 1999. [154] K. Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of IEEE, 86(11):2210–2239, 1998. [155] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65:386-408, 1958. [156] F. Rosenblatt. Principles of Neurodynamics. Spartan Books, 1962. [157] F. Rossi, B. Conan-Guez, and F. Fleuret. Theoretical properties of functional multi layer perceptrons. In M. Verleysen, editor, ESANN’02, pages 7–12. d-side publications, 2002. [158] J. Rubner and P. Tavan. A self-organizing network for principle-component analysis. Europhysics Letters, 7(10):693–698, 1989. [159] D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning representations by back-propagating errors. Nature, 323: 533-536, 1996. [160] A.A. Sadeghi. Asymptotic behaviour of self-organizing maps with non-uniform stimuli distributions. Technical Report 166, Fachbereich Mathematik, Universit¨at Kaiserslautern, July 1996. [161] A.A. Sadeghi. Convergence in distribution of the multi-dimensional Kohonen algorithm. Journ. of App. Prob., 38(1):136–151, 2001. [162] A.S. Sato and K. Yamada. Generalized learning vector quantization. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems, volume 7, pages 423–429, MIT Press, 1995. [163] F. Scarselli and A. Tsoi. Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results. Neural Networks, 11:15–37, 1998. [164] U.D. Schiller and J. J. Steil. On the weight dynamics of recurrent learning. To appear in: ESANN’03. [165] M. Schmitt. Descartes’ rule of signs for radial basis function neural networks. Neural Computation, 14:2997–3011, 2002. [166] M. Schmitt. On the complexity of computing and learning with multiplicative neural networks. Neural Computation, 14:241–301, 2002.

202 [167] H. T. Siegelmann and E. D. Sontag. Analog computation, neural networks, and circuits. Theoretical Comp. Science, 131:331–360, 1994. [168] J. Sima. Back-propagation is not efficient. Neural Networks, pages 1017–1023, 1996. [169] J. Sima. Training a single sigmoidal neuron is hard. Neural Computation, 14:2709–2728, 2002. [170] M. Soerens, P. Latinne, and C. Decaestecker. Any reasonable cost function can be used for a posteriori probability approximation. IEEE Transactions on Neural Networks, 13:1204–1210, 2002. [171] P. Sollich and A. Halees. Learning curves for Gaussian process regression: Approximations and bounds. Neural Computation, 14:1393–1428, 2002. [172] E. Sontag. Feedforward nets for interpolation and classification. Journal of Computer and System Sciences, 45:20–48, 1992. [173] E. Sontag. VC dimension of neural networks. In C. Bishop, editor, Neural Networks and Machine Learning, pages 69–95. Springer, 1998. [174] A. Sperduti. Neural networks for adaptive processing of structured data. In: G.Dorffner, H.Bischof, K.Hornik, editors, ICANN’2001, pages 5–12, Springer, 2001. [175] A.Sperduti, D.Majidi, and A.Starita. Extended cascade-correlation for syntactic and structural pattern recognition. In: P.Perner, P.Wand, A.Rosenfeld, editors, Advances in Structured and Syntactical Pattern Recognition, Springer, pp.90-99, 1996. [176] J. Steil. Local structural stability of recurrent networks with time-varying weights. Neurocomputing, 48:39– 51, 2002. [177] J. J. Steil and H. Ritter. Maximization of stability ranges for RNNs subject to on-line adaptation. In Proc. of ESANN 99, pages 369–374, 1999. [178] J. J. Steil and H. Ritter. Recurrent Learning of Input-Output Stable Behavior in Function Space: A Case Study with the Roessler Attractor. In Proc. ICANN 99, pages 761-766, 1999. [179] I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2002. [180] M. Strickert, B. Hammer. Unsupervised recursive sequence processing. To appear in: M.Verleysen, editor, ESANN’2003, D-side publications. [181] M. K. Sundareshan and T. A. Condarcure. Recurrent neural-network training by a learning automaton approach for trajectory learning and control system. IEEE Trans. Neural Networks, 9(3):354–368, May 1998. [182] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT-Press, 1998. [183] J. A. K. Suykens, B. D. Moor, and J. Vandewalle. Robust local stability of multilayer recurrent neural networks. IEEE Trans. Neural Networks, 11(1):222–229, 2000. [184] J. A. K. Suykens and J. Vandewalle. Recurrent least squares support vector machines. IEEE Trans. Circuits Systems, 2000. [185] J. A. K. Suykens, J. Vandewalle, and B. D. Moor. NLq theory: Checking and imposing stability of recurrent neural networks for nonlinear modeling. IEEE Trans. Signal Processing, 45(11):2682–2691, 1997. [186] P. Talavan and J. Yaˇnez. Parameter setting of the Hopfield network applied to TSP. Neural Networks, 15:363–373, 2002. [187] R. Tetzlaff, editor. Cellular Neural Networks and their Applications. World Scientific, 2002. [188] P. Tiˇno, B. G. Horne, and C. L. Giles. Attractive periodic sets in discrete-time recurrent networks (with emphasis on fixed-point stability and bifurcations in two-neuron networks). Neural Computation, 13:1379– 1414, 2001. [189] P. Valiant. A theory of the learnable. Communications of the ACM, 27:1134–1142, 1984. [190] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264–280, 1971. [191] M.M. van Hulle. Faithful Representations and Topographic Maps. John Wiley, 2000.

References

203

[192] M.M. van Hulle and D. Martinez. On an unsupervised learning rule for scalar quantization following the maximum entropy principle. Neural Computation, 5(6):939–953, 1993. [193] M. Vidyasagar. A Theory of Learning and Generalization. Springer, 1997. [194] T. Villmann. Topologieerhaltung in selbstorganisierenden neuronalen Merkmalskarten. Harri Deutsch, Reihe Physik, 66, 1996. [195] T. Villmann. Topology preservation in self-organizing maps. In E. Oja and S. Kaski, Kohonen Maps, pages 279–292, Elsevier, 1999. [196] T. Villmann. Controlling strategies for the magnification factor in the neural gas network. Neural Network World, 10(4):739–750, 2000. [197] T. Villmann, R. Der, M. Herrmann, and T. Martinetz. Topology Preservation in Self–Organizing Feature Maps: Exact Definition and Measurement. IEEE Transactions on Neural Networks, 8(2):256–266, 1997. [198] T. Voegtlin. Context quantization and contextual self-organizing maps. In: Proc. Int. Joint Conf. on Neural Networks, vol.5, pages 20–25, 2000. [199] Recursive self-organizing maps. Neural Networks, 15(8-9):979–992, 2002. [200] T. Voegtlin and P.F. Dominey. Recursive self-organizing maps. In N.Allison, H.Yin, L.Allinson, and J.Slack, editors, Advances in Self-Organizing Maps, pages 210–215, Springer, 2001. [201] C. von der Malsburg and W. Singer. Principles of cortical network organization. In: P. Rakic and W. Singer, editors, Neurobiology of Neocortex, pages 69-99, John Wiley, 1988. [202] C. von der Malsburg and D. J. Willshaw. A mechanism for producing continuous neural mappings: Ocularity dominance stripes and ordered retino-tectal projections. Afferent and Intrinsic Organization of Laminated Structures in the Brain, Experimental Brain Research(Supplement 1):463-469, 1976. [203] T. Villmann, B. Hammer, M. Strickert. Supervised neural gas for learning vector quantization. In: D. Polani, T. Martinetz, editors, Proceedings of GWAL-5, IOS Press, 9–18, 2002. [204] G. Wahba. Generalization and regularization in nonlinear learning systems. In M. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 426–430, Cambridge, MA, 1995. The MIT Press. [205] X. Wang. Period-doublings to chaos in a simple neural network: An analytic proof. Complex Systems, 5:425–442, 1991. [206] M. Welling and G. Hinton. A new learning algorithm for mean field Boltzmann machines. In J. Dorronsoro, editor, ICANN’02, pages 351–256. Springer, 2002. [207] S. Weng and J. Steil. Data driven generation of interactions for feature binding. In J. Dorronsoro, editor, ICANN’02, pages 432–437. Springer, 2002. [208] P.J. Werbos. The roots of backpropagation, Wiley, 1994. [209] H. Wersing. Learning lateral interactions for feature binding and sensory segmentation. In Conference on Neural Information Processing: Natural and Synthetic NIPS, 2001. [210] H. Wersing, W.-J. Beyn, and H. Ritter. Dynamical stability conditions for recurrent neural networks with unsaturating piecewise linear transfer function. Neural Computation, 13:1811–1825, 2001. [211] D.J. Willshaw and C. von der Malsburg. How patterned connections can be set up by self-organizaion. Proceedings of the Royal Society, Sries B, 194:431–445, 1976. [212] P.L. Zador. Asymptotic quantization error of continuous signals and the quantization dimension. IEEE Transactions on Information Theory, 28:149–159, 1982. [213] T. Zhang. Effective dimension and generalization of kernel learning. In NIPS’02.

204