The Kernel Adaptive Autoregressive-Moving-Average ...

0 downloads 0 Views 859KB Size Report
Abstract—In this paper, we present a novel kernel adaptive recurrent filtering algorithm based on the autoregressive-moving-average (ARMA) model, which is.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 1, NO. 1, JANUARY 2015

1

The Kernel Adaptive Autoregressive-Moving-Average Algorithm Kan Li, Student Member, IEEE and Jos´e C. Pr´ıncipe, Fellow, IEEE

Abstract—In this paper, we present a novel kernel adaptive recurrent filtering algorithm based on the autoregressive-moving-average (ARMA) model, which is trained with recurrent stochastic gradient descent in the reproducing kernel Hilbert spaces (RKHSs). This kernelized recurrent system, the kernel adaptive ARMA (KAARMA) algorithm, brings together the theories of adaptive signal processing and recurrent neural networks (RNNs), extending the current theory of kernel adaptive filtering (KAF) using the representer theorem to include feedback. Compared with classical feedforward KAF methods, the KAARMA algorithm provides general nonlinear solutions for complex dynamical systems in a statespace representation, with a deferred teacher signal, by propagating forward the hidden states. We demonstrate its capabilities to provide exact solutions with compact structures by solving a set of benchmark NP-complete problems involving grammatical inference. Simulation results show that the KAARMA algorithm outperforms equivalent input-space recurrent architectures using first and second-order RNNs, demonstrating its potential as an effective learning solution for the identification and synthesis of deterministic finite automata (DFA). Index Terms—kernel adaptive filtering (KAF), recurrent neural network (RNN), reproducing kernel Hilbert space (RKHS), deterministic finite automaton (DFA)

I. I NTRODUCTION Kernel methods [1] create a powerful unifying framework for classification, clustering, and regression of both numeric and symbolic data, with countless applications in machine learning, signal processing, and biomedical engineering. The theory of adaptive signal processing is greatly enhanced through the integration of the theory of reproducing kernel Hilbert space (RKHS). Performing classical linear methods in an enriched feature space, kernel adaptive filtering (KAF) [2] moves beyond the limitations of the linear model to provide general nonlinear solutions in the original input space that is not restricted to numeric data (e.g., spike trains [3]). KAF brings together adaptive signal processing and feedforward artificial neural networks, by combining the best of both worlds: the universal approximation property of neural networks and the simple convex optimization of linear adaptive filters. Manuscript received August 15, 2014; revised January 9, 2015. This work was supported by DARPA Contract N66001-10-C-2008. Kan Li and Jos´e C. Pr´ıncipe are with the Computational NeuroEngineering Laboratory, University of Florida, Gainesville, FL 32611 USA (email: {likan, principe}@ufl.edu).

Since its introduction, KAF has rapidly gained traction, thanks to its simplicity and usefulness. Despite its youth, KAF is already well-established in the literature [4]–[6] for solving online nonlinear system identification. However, there still exist two principal bottlenecks for current KAF implementations. First, the filter’s ability to cope with general nonlinear dynamical system characteristics is inadequate [7]. Most research has focused on time-delay feedforward implementations of kernel method. Feedforward systems have limited generalization capability and are best suited for learning a priori defined and fixed memory mappings of input-output pairs. The use of lengthy time-embedded inputs to provide memory for modeling dynamics results in increased computational overhead and large network size. The second bottleneck is the lack of a computationally efficient and compact solution. For feedforward kernel adaptive filters, the exact solution is almost never obtained, and its approximation may require an excessively high number of coefficients due to the growing nature of the radial basis function (RBF) network during training. To combat the computational and memory cost imposed by the linear growth, many sparsification techniques have been proposed to restrict the network size or budget [5], [8]–[11]. The central theme is to accept only those centers that conform to a given criterion and is most applicable for stationary learning. For nonstationary environment, various pruning strategies have been proposed to maintain the budget [12]–[14]. The main shortcoming for this type of formulation is that the network only learns and retains the most recent dynamics. Whenever it updates, a portion of the past is forgotten or erased. Without any long-term memory, after enough time has elapsed, data have to be relearned from scratch when revisited. Both bottlenecks can be effectively addressed by a kernelized version of the recurrent neural network (RNN). RNNs are proven to be universal approximators of dynamical systems (DS) [15] and have been successfully used for identification and control of nonlinear DS [16]. A recurrent network operating with just the current sample achieves memory via feedback of internal states or the outputs through time-delay units. The input and output are no longer independent stationary vectors, but correlated temporal sequences. It is a much

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 1, NO. 1, JANUARY 2015

more powerful framework for describing the underlying dynamics of a system than a feedforward network. Despite their promising nature, RNNs’ popularity has not matched up to their potential, due to the difficulty of training. Standard gradient-based training techniques for RNNs include backpropagation through time (BPTT) [17], extended Kalman filtering (EKF) [18], and real-time recurrent learning (RTRL) [19]. A full review of RNNs is outside the scope of this paper and for recent work see [20]. A second approach to training RNNs utilizes a large network of randomly initialized recurrent connections or a dynamic reservoir. These weight values remain fixed, and only a readout layer is trained to optimally project the network states onto the desired output. This reservoir computing (RC) framework is further identified as echo state networks (ESNs) [21] for continuous valued inputs and liquid state machines (LSMs) [22] for spike train inputs. RC is generally thought to be free from the problems associated with gradient-based RNN training such as slow convergence, local optima, and computational complexity. However, performance depends upon random parameters that need to be appropriately cross validated to find optimal solutions without which RC is a less reliable convex universal learning machines (CULMs) than KAF methods [23]. A kernelized RNN also opens the door to symbolic representations and simpler solutions. A discrete-time dynamical system (DTDS) with a discrete state space can be modeled by a finite-state machine (FSM). Deterministic finite automata (DFA) are FSMs with all state transitions uniquely determined by input symbols. For nondeterministic finite automata (NFA), equivalent DFA can be derived using the subset construction algorithm [24]. From a systems theory perspective, finite automata are solutions to general DTDS. By bridging the theory of kernel adaptive signal processing with RNNs, we fill a void in the current theory of KAF. In this paper, we propose a novel exact gradient-following kernel spatio-temporal filter, which we call the kernel adaptive autoregressivemoving-average (KAARMA) algorithm. By mapping the input symbols into a potentially infinite-dimensional feature space, an adaptive filter with feedback can be trained to approximate any dynamical or nonlinear time-dependent relationships in the original input space, while preserving the conveniences associated with a linear structure in the RKHS. Moreover, by mimicking a FSM via feedback and fixed-point behavior, an exact compact solution can be derived from trained KAARMA network for certain DTDS. We demonstrate the computational power of the KAARMA algorithm by solving a set of benchmark grammatical inference problems and comparing its performance with RNNs operating on equivalent recurrent

2

architectures in the input space. Furthermore, we show that KAARMA-based DFA can outperform LSMs on spike data, which opens the door for many novel neuroscience applications. The rest of this paper is organized as follows. In Section II, we provide an overview of current developments in KAF. We present our novel KAARMA algorithm in Section III. Performance of the proposed algorithm is evaluated in Section IV. Section V concludes this paper. A brief note on automata theory and regular grammar for identification and synthesis of DFA using the KAARMA approach is given in the Appendix. II. S URVEY

OF

K ERNEL A DAPTIVE F ILTERS

We begin with a survey of the recent history of kernel adaptive filtering. This section serves to establish the notations and helps to distinguish our contributions. In the family of kernel adaptive filters, the kernel least mean square (KLMS) algorithm [2] is the simplest. A finite impulse response (FIR) filter trained in the RKHS using the least mean squares (LMS) algorithm, it can be viewed as a single-layer feedforward neural network or perceptron. For a set of n inputoutput pairs {(u1 , y1 ), (u2 , y2 ), · · · , (un , yn )}, the input vector ui ∈ U ⊆ Rm (where U is a compact input domain in Rm ) is mapped into a potentially infinitedimensional feature space F. Define a U → F mapping ϕ(u), the feature-space parametric model becomes yˆ = fˆ(u) = ΩT ϕ(u)

(1)

where Ω is the weight vector in the RKHS. Using the representer theorem [25] and the “kernel trick”, (1) can be written as n X αi K(ui , u) (2) fˆ(u) = i=1



where K(u, u ) is a Mercer kernel, corresponding to the inner product hϕ(u), ϕ(u′ )i, and αi are the coefficients. The most commonly used kernel is the Gaussian kernel  (3) Ka (u, u′ ) = exp −aku − u′ k2

where a > 0 is the kernel parameter. Despite their universal approximation property, conventional KAF algorithms share the same filter structure as their linear input-space analogs, thus performing poorly when modeling dynamical systems. For a nonstationary environment, the adaptive filter is tasked with the additional job of tracking statistical variations. This steady-state phenomenon is distinguished from the convergence or transient behavior. The LMS filter assumes the following state-space model (SSM) xi+1 = xi yi =

uTi xi

(4) + vi

(5)

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 1, NO. 1, JANUARY 2015

where xi is the state at time i and vi is the measurement noise. For the LMS and KLMS algorithms, the state or steady-state weight is fixed. The recursive least squares (RLS) and kernel RLS (KRLS) [5] algorithms share a similar state-transition model [26] xi+1 = λ−1/2 xi

(6)

where 0 < λ ≤ 1 is a constant forgetting factor. In both formulations, the conditions are equivalent to a stationary environment. The extended KRLS (ExKRLS) algorithm [6] is the kernelized extended RLS algorithm [27] and can only model a random walk, i.e., xi+1 = xi + wi

(7)

where wi is the state or process noise. For linear DS, the Kalman filter [28] is the optimal estimator on the merit of second-order statistics, with the RLS algorithm being a special case [26]. Several Kalman-based partial solutions have been proposed under the KAF framework for nonlinear SSMs. The Ex-KRLS with Kalman filtering [29] is a hybrid approach where the known transition function is constructed in the original state space (estimated using the extended Kalman filter [30]), and the measurement model is learned using KRLS. For the simple additive noise measurement model, the kernel Kalman filter (KKF) [31] is a batch method relying on subspace approximation by kernel principal component analysis. The kernel Kalman filtering with conditional embedding operator (KKF-CEO) algorithm [32] treats the estimated measurement embeddings as hidden states in the RKHS. The conditional distribution PY |x (y) in the RKHS can be estimated in the original input space through the conditional embedding operator [33] and kernel Kalman filtering. In order to properly model a DS, both the statetransition and measurement equations have to be general solutions. The bottleneck lies in the recursive statetransition model, since the states are hidden. For multistep prediction problems where the desired signal is not accessible at all times, an incremental learning approach can be taken, using the difference between successive predictions. This concept is generalized by the TD(λ) family of learning procedures, where 0 ≤ λ ≤ 1 is the exponential weighting factor [34]. In [35], the kernelized TD(λ) algorithm was proposed as a nonlinear solution. However, TD learning suffers from the same inadequacy as the LMS algorithm from a state-space view. Theorem 1 in [34] proves that the linear TD(1) procedure produces the same per-sequence weight changes as the LMS algorithm. The notion of states is closely related to the concept of memory. Unlike an FIR system, the memory depth of an infinite impulse response (IIR) filter is independent of the filter order and the number of adaptive

3

parameters, making it ideal for modeling dynamics characterized by a deep memory structure yet with a small number of free parameters [36]. However, ensuring stability during IIR adaptation is difficult, and the error surface is nonconvex [37]. In [36], a FIR-IIR hybrid filter or generalized feedforward filter was proposed. Specifically, the gamma (γ) filter was analyzed in detail. It is an IIR filter with a restricted or adjustable memory depth, controlled by the adaptive parameter µ (when µ = 1, the γ-filter becomes an FIR filter). This work is extended into the RKHS in [38], resulting in a γ-structured recursive kernel formulation that generalizes the KLMS algorithm. The kernel RTRL [39] is based on a RNN. However, because its derivation assumes a constant weight over the entire state trajectory, instantaneously altering it does not lead to true changes along the negative gradient of the cost function. For input-space RTRL, this problem can usually be avoided by lowering the learning rate, such that it is much slower than the time scale of the DS. Unfortunately, this bit of pragmatism does not translate directly into the RKHS. Unlike a conventional RNN, which has the activation functions fixed in advance and only the weights are adjusted, the weights in the RKHS are functions in the input space. There is no guarantee that the functions are modified properly in real-time, regardless of the initial condition or learning rate [39]. In practice, it degrades to a teacher-forced multiple-input and multiple-output (MIMO) KLMS algorithm. Another class of kernel-induced feature space ARMA method involves support vector machines (SVMs). In [40] the SVM-ARMA2k method was formulated using uncoupled input and output feature spaces, and the SVM-ARMA4k was formulated using a composite sum kernel. However, both formulations rely on sufficient delay embedding for memory. Similarly, the support vector regression (SVR) based MIMO kernel ARMA method [41] assumes a fully observable state trajectory and uses the desired states for updates. Our approach differs from previous research because KAARMA is developed as a true IIR system in the RKHS with a full state-space model. All the previous KAF methods operate on subspaces or build the operators directly in the RKHS using statistical embedding, which are extremely time consuming. To make it practical as in KAF, we exploit the representer theorem. Unlike adaptive IIR filters [37], which consist of a feedforward element using delayed samples of the input ui−j+1 (where j ≥ 1) and a feedback element using past output samples yi−j , here, we explicitly construct and learn a hidden third input signal: states variables x. Furthermore, unlike the recursive kernel methods in [38] and [42], at each time step, we use the representer theorem and the kernel trick to realize the

 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 1, NO. 1, JANUARY 2015

Input  ui

xi¡1



Hidden State Model

State  xi



Measurement Model

Output  yi



Fig. 1. General state-space model for dynamical system.



)

state vector in the input space, where system dynamics are completely encoded. Given the state, the current input is completely decoupled from the past inputs. Therefore, the KAARMA filter can operate using only the current input sample and achieves memory via feedback of internal states in the input space. Moreover, as a recurrent KAF algorithm, filter centers can be placed in the part of the space where data is defined and stability is ensured. The proposed gradient-descent adaptive procedure learns unknown general nonlinear continuous time state-transition and measurement equations, using only an input sequence and the observed outputs. This viewpoint is inspired by RNNs for learning DS.

4

where Iny is an ny ×ny identity matrix, 0 is an ny ×nx zero matrix, and ◦ is the function composition operator. This augmented state vector si ∈ Rns is formed by concatenating the output yi with the original state vector xi . With this rewriting, measurement equation i h ∆ simplifies to a fixed selector matrix I = 0 Iny . Next, we define an equivalent transition function g(si−1 , ui ) = f(xi−1 , ui ) taking as argument the new state variable s. Using this notation, (8-9) becomes xi = g(si−1 , ui ) yi = h(xi ) = h ◦ g(si−1 , ui ).

(14) (15)

To learn the general continuous nonlinear transition and observation functions, g(·, ·) and h ◦ g(·, ·), respectively, we apply the theory of RKHS. First, we map the augmented state vector si and the input vector ui into two separate RKHSs as ϕ(si ) ∈ Hs and φ(ui ) ∈ Hu , respectively. By the representer theorem, the state-space  model defined by (14-15) can be expressed as the following set of weights (functions in the input space) ∆ in the joint RKHS Hsu = Hs ⊗ Hu " # g(·, ·) ∆ ∆ Ω = ΩHsu = (16) h ◦ g(·, ·)

III. K ERNEL A DAPTIVE ARMA A LGORITHM where ⊗ is the tensor-product operator. We define the Let a dynamical system (Fig. 1) be defined in terms new features in the tensor-product RKHS as  of a general continuous nonlinear state transition and ∆ observation functions, f(·, ·) and h(·), respectively, ψ(si−1 , ui ) = ϕ(si−1 ) ⊗ φ(ui ) ∈ Hsu . (17) xi = f(xi−1 , ui )





yi = h(xi )





(8) It follows that the tensor-product kernel is defined by  (9) hψ(s, u), ψ(s′ , u′ )iHsu = Ksu (s, u, s′ , u′ )

where h iT f(xi−1 , ui ) = f (1) (xi−1 , ui ), · · · , f (nx ) (xi−1 , ui ) iT h  (n ) (1) (10) = xi , · · · , xi x h iT ∆ h(xi ) = h(1) (xi ), · · · , h(ny ) (xi )  h iT (n ) (1) = yi , · · · , yi y (11) ∆

with input ui ∈ Rnu , state xi ∈ Rnx , output yi ∈ Rny , and the parenthesized superscript (k) indicating the k-th component of a vector or the k-th column of a matrix. Note that the input, state, and output vectors have independent degrees of freedom or dimensionality. For simplicity, we rewrite (8-9) in terms of a new hidden state vector " # " # f(xi−1 , ui ) ∆ xi si = = (12) yi h ◦ f(xi−1 , ui ) " # i xi h (ns −ny +1:ns ) (13) yi = s i = 0 Iny | {z } yi I

= (Ks ⊗ Ku )(s, u, s′ , u′ ) = Ks (s, s′ ) · Ku (u, u′ ). (18) This construction has several advantages over the simple concatenation of the input u and the state s. First, the tensor product kernel of two positive definite kernels is also a positive definite kernel [1]. Second, since the adaptive filtering is performed in an RKHS using features, there is no constraint on the original input signals or the number of signals, as long as we use the appropriate reproducing kernel for each signal. Last but not least, this formulation imposes no restriction on the relationship between the signals in the original  input space. This is important for input signals having different representations and spatiotemporal scales. For example, under this framework, we can model a neurobiological system, taking spike trains, continuous amplitude local field potentials (LFPs), and vectorized state variables as inputs. Finally, the kernel state-space model becomes si = ΩT ψ(si−1 , ui )

(19)

yi = Isi .

(20)

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 1, NO. 1, JANUARY 2015

5

Applying the product rule to the gradient yields ∂si (k)

∂Ωi

=

∂ΩTi ψ(si−1 , ui ) (k)

∂Ωi ∂ψ(s i−1 , ui ) T

= Ωi

(k)

∂Ωi

T + I(k) ns ψ(si−1 , ui )

(24)

ns is the k-th column of the ns × ns where I(k) ns ∈ R identity matrix. The distinction between a recurrent formulation and the normal feed-forward KAF lies in the gradient term on the right-hand side of (24). In a recurrent network, past states are coupled with the current state input through feedback. Consequently, the Fig. 2. Kernel adaptive autoregressive-moving-average network. partial derivatives of the previous states with respect to the current filter weights are nonzero. Using the representer theorem, weights Ωi at time i Fig. 2 shows a simple KAARMA network. In general, can be written as a linear combination of prior features the states si are assumed hidden, and the desired does not need to be available at every time step, e.g., a Ωi = Ψi Ai (25) deferred desired output value for yi may only be ∆ observed at the final indexed step i = f , i.e., df . where Ψi = [ψ(s−1 , u0 ), · · · , ψ(sm−2 , um−1 )] ∈ Rnψ ×m is a collection of the m past tensor-product features with potentially infinite dimension nψ , and ∆ A. Kernel Adaptive Recurrent Filtering Ai = [αi,1 , · · · , αi,ns ] ∈ Rm×ns is the set of corresponding coefficients. For feedforward KAF such as the The learning procedure presented here computes the KLMS algorithm, the number of basis functions grows exact error gradient in the RKHS. For simplicity, we linearly with each new sample, i.e., m = i. Here, we consider only the Gaussian kernel in the derivation. For use m to denote a dictionary Ψi of arbitrary size, with the state and input vectors, the joint inner products are ψ(s−1 , u0 ) initialization. Thus, the k-th component computed using Kas (s, s′ ) and Kau (u, u′ ), respectively. (1 ≤ k ≤ ns ) of the filter weights at time i becomes The cost function at time i is defined as (k) (k) (26) Ωi = Ψi Ai = Ψi αi,k . 1 T ε i = ei ei (21) 2 Substituting the expression for weights Ωi in (25) into the feedback gradient on the right-hand side of ny ×1 where ei = di − yi ∈ R is the error vector, with (24) and applying the chain rule gives d as the desired signal. The error gradient with respect i

to the RKHS weights Ωi at time i is

ΩT

∂εi ∂eT ei ∂yi = i = −eTi ∂Ωi 2∂Ωi ∂Ωi

∂ΨTi ψ(si−1 , ui ) ∂si−1 ∂ψ(si−1 , ui ) = ATi (k) (k) ∂si−1 ∂Ω ∂Ω i

(22)

=

|

∂yi consists of ns terms, where the partial derivative ∂Ω i ∂yi ∂yi ∂yi , , · · · , , corresponding to the state (1) (2) (n ) ∂Ωi ∂Ωi ∂Ωi s dimension. Note that the feature-space weights Ωi are functions with potentially infinite dimension. Fortunately, functional derivative is well-posed in the RKHS, since Hilbert spaces are complete normed vector spaces, we can use the Fr´echet derivative [43] to compute (22). For the k-th component of Ωi , the gradient can be expanded using the chain rule as

∂εi (k) ∂Ωi

where

∂yi ∂si

= −eTi

= I.

∂yi (k) ∂Ωi

= −eTi

∂yi ∂si ∂si ∂Ω(k) i

2as ATi Ki DTi {z

Λi

∂si−1

} ∂Ω(k)

(27)

i

here the partial derivation is evaluated using Gaus∆ sian tensor-product kernel (18), where Ki = T diag(Ψi ψ(si−1 , ui )) is a diagonal matrix with eigen(j,j) = Kas (sj , si−1 ) · Kau (uj , ui ) and values Ki ∆ Di = [(s−1 − si−1 ), · · · , (sm−2 − si−1 )] is the difference matrix between state centers of the filter and the current input state si−1 . We collect the gradient coeffi∆ i = 2as ATi Ki DTi cients in (27) into a matrix Λi = ∂s∂si−1 which we call the state-transition gradient. Substituting (23) (27) into (24) gives the following recursion ∂si (k) ∂Ωi

= Λi

∂si−1 (k) ∂Ωi

T + I(k) ns ψ(si−1 , ui ) .

(28)

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 1, NO. 1, JANUARY 2015 i The gradient ∂s(k) measures the sensitivity of si , ∂Ωi the state output at time i, to a small change in the kth weight component, taking into account the effect of such variation in the weight values over the entire state trajectory s0 , · · · , si−1 . In this evaluation, the initial state s0 , the input sequence ui1 , and the remaining (j) weights (Ωi , where j 6= k) are fixed. The state ∂si gradient (k) is back-propagated with respect to a ∂Ωi constant weight via feedback. Clearly (28) is independent of any teacher signal or error that the system may incur in the future and can be computed entirely from the observed data. Therefore, we can forward propagate the state gradients. Since the initial state is user-defined and functionally independent 0 = 0, the ensuing of the filter weights, by setting ∂s(k) ∂Ωi recursions in (28) becomes

∂s1

ψ(s0 , u1 )T . = In(k) s

(k) ∂Ωi

(29)

By induction, we can factor out the basis functions and express the recursion as ∂si (k) ∂Ωi

T

(k)

T = Λi Vi−1 Ψ′ i−1 + I(k) ns ψ(si−1 , ui )

i h T (k) ′ = Λi V i−1 , I(k) ns [Ψ i−1 , ψ(si−1 , ui )] (k)

T

= Vi Ψ′ i

(30)



where Ψ′ i = [Ψ′ i−1 , ψ(si−1 , ui )] ∈ Rnψ ×i are centers generated by the input sequence and forwardpropagatedh states from ia fixed filter weight Ωi , and (k) (k) ∆ ∈ Rns ×i is the updated = Λi Vi−1 , I(k) Vi ns state-transition gradient, with initializations Ψ′ 1 = (k) [ψ(s0 , u1 )] and V1 = I(k) ns . Combining (23) with (30) gives the error gradient ∂εi T (k) (31) = −eTi IVi Ψ′ i . ∂Ω(k) Updating the weights in the negative direction yields T  (k) (k) (k) ei Ωi+1 = Ωi + ηΨ′ i IVi T  (k) (k) ei = Ψi Ai + ηΨ′ i IVi   (k) Ai = [Ψi , Ψ′ i ]   (k) T  (32) η IVi ei ∆

(k)

= Ψi+1 Ai+1

(33)

where η is the learning rate. The proposed kernel adaptive recurrent filter is summarized in Algorithm 1. For this forward-propagated online learning model, after each adaptation, the entire state trajectory, with respect to this updated fixed network Ωi+1 , is reconstructed in order to compute the next state gradient ∂si+1 ∂Ωi+1 . Unlike a feedforward KAF algorithm, where

6

Algorithm 1: Kernel Adaptive ARMA Algorithm Initialization: nu : input dimension ns : state dimension ny : output dimension as : state kernel parameter au : input kernel parameter η: learning rate Randomly initialize input u0 ∈ R1×nu Randomly initialize states s−1 and s0 ∈ R1×ns Randomly initialize coefficient matrix A ∈ R1×ns Ψ = [ψ(s−1 , u0 )]: feature matrix S = [s−1 ]: state dictionary m =h1: dictionary size i n 0 I ny ∈ R y ×ns : measurement matrix I= Computation: for time t = 1, · · · , n do Initialization Ψ′ = [ ]: feature matrix update S′ = [ ]: state matrix update (k) ns ×1 , for k = 1, · · · , ns . V1 = I(k) ns ∈ R Update State-Transition Gradient Matrix for time i = 1, · · · , t do Generate Next State si = ΩT ψ(si−1 , ui ) Update h State Gradient i Di = (S(1) − si−1 ), · · · , (S(m) − si−1 ) Ki = diag(ΨT ψ(si−1 , ui )) Λi = 2ahs AT Ki DTi i (k) (k) Vi = Λi Vi−1 , I(k) ns Update Feature and State Matrices Ψ′ = [Ψ′ , ψ(si−1 , ui )] S′ = [S′ , si−1 ] Prediction yt = Ist Update Weights in the RKHS et = d t  − yt  (k) Ai A(k) =   (k) T  ei η IVi ′ Ψ = [Ψ, Ψ ] S = [S, S′ ] Ω = ΨA m=m+t

the network grows linearly with the number of processed samples or updates, the KAARMA filter grows quadratically, i.e., the updating term Ψ′ i in (32) consists of i new centers (not including the constant initial state), which forms the revised state trajectory. For the case where the filter is updated after each new input, the unconstrained RBF structure Pi of the KAARMA algorithm at time i has size j=1 j = i(i+1) 2 . For a

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 1, NO. 1, JANUARY 2015

7

sequence of length n and an update frequency of once Algorithm 2: Quantization Algorithm per input symbol, the KAARMA memory and computaInitialization: tional complexities are O(n2 ) and O(n3 ), respectively, q: quantization threshold the same as KRLS. For classification tasks, where as : state kernel parameter the update frequency is only once per sequence, the au : input kernel parameter memory and computational complexities are reduced Ψ: feature matrix to O(n) and O(n2 ), respectively, the same as KLMS. A: coefficient matrix The complexity of the KAARMA algorithm would Ψ′ : feature matrix update be greatly reduced if the weights are allowed to change A′ : coefficient update instantaneously as in RTRL. However, this effectively m: dictionary size introduces an undesirable secondary feedback (since the m′ : new centers size observed state trajectory depends on any weight change Computation: made by the learning algorithm), one that is intractable for i = 1, · · · , m′s do ∆ (i) in the functional spaces, regardless of the learning rate. dis(Ψ′ , Ψ) = A far more sensible way to combat the computational min as ks′i − sj k + au ku′i − uj k 1≤j≤m complexity is to use the quantization technique in [11], ∆ since KAARMA naturally forms feature clusters during j ∗ = arg min as ks′i − sj k + au ku′i − uj k 1≤j≤m training. We modify it slightly for Gaussian tensor(i) if dis(Ψ′ , Ψ) < q then product reproducing kernels. This implementation is the A(j ∗ ) = A(j ∗ ) + A′ (i) quantized KAARMA (QKAARMA). else Algorithm 2 outlines the quantization procedure of ′ ′ Ψ = "[Ψ, ψ(s QKAARMA, which constrains the network growth in #i , ui )] A the last five lines of Algorithm 1. Each new centers A= from the feature update Ψ′ is compared with the A′ (i) existing ones. If the minimum joint distance (input m=m+1 and state) is below a quantization threshold q, the new (i) ∆ center Ψ′ = ψ(s′i , u′i ) is simply discarded, and its corresponding coefficients A′ (i), where (i) indicates the i-th row, are added to the nearest existing neighbor’s added with the filtered error gradient from time t (indexed by j ∗ ), thus updating the weights without ∂εt ∂yt ∂st ∂st−1 = −eTt growing the network structure. (k) ∂st ∂st−1 ∂Ω(k) ∂Ωt−1 t−1 ∂st−1 T = −et IΛt (35) (k) B. Kernel Backpropagation through Time ∂Ωt−1 Alternatively, we can compute the exact error gradient in the recurrent topology using BPTT in the RKHS. During training, we can treat the recurrent network as a single multistage feedforward network. By unfolding the filter dynamics in time, through a series of feedforward networks of identical weights, we effectively remove the feedback. To adapt the weights, the output error at the present time is transported back in time or filtered through the dual networks of the cascade FIR filters. Here, we show the relationship between kernel BPTT (KBPTT) and Algorithm 1. We decouple the feedback for a single iteration by connecting the state input of the feedforward network Ωt at time t to the state output of the same feedforward network replicated at time t − 1, i.e., Ωt−1 = Ωt . For this 2-stage network, the error gradient consists of the contribution from the external error taken at time t − 1 ∂εt−1 (k) ∂Ωt−1

T = −eTt−1 I I(k) ns ψ(st−2 , ut−1 )

which has the same form as (28) and equals (31) if we let et = et−1 , i.e., the average behavior of the multi-layered feedforward network is equivalent to the recurrent system’s. The filtered error gradient L time steps ahead is then ∂εt+L (k) ∂Ωt

∂yt+L ∂st+L ∂st ··· (k) ∂st+L ∂st+L−1 ∂Ωt ! L−1 Y T Λt+L−i I(k) = −eTt I ns ψ(st−1 , ut )

= −eTt+L

i=0

(36)

For certain applications, instead of always backpropagating the error to the initial state, which contributes significantly to the computational cost, we can open up a window of a fixed interval L into the past, relative to the current time, and adapt the weights of this Lstage multi-layered FIR filter. This truncated or fixedwindow KBPTT is summarized in Algorithm 3. In this (34) formulation, the window size L is user defined and independent of the filter order. The per iteration storage

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 1, NO. 1, JANUARY 2015

Algorithm 3: Kernel Backpropagation through Time Initialization: L: window size Computation: for time t = 1, · · · , n do Unfold the Network and Forward Pass t0 = max(t − L + 1, 1): start time for time i = t0 , · · · , t do si = ΩT ψ(si−1 , ui ) yi = Isi ei = di − yi Backpropagation Through Time eTL I = −eTt0 I for i = 1, · · · , L − 1 do  Qi−1 eTL I = eTL I − eTt I j=0 Λt0 +i−j Update Weights in the RKHS ∂εL T = eTL I I(k) ns ψ(st0 −1 , ut0 ) ∂Ω(k) ∂εL T (k) (k) Ω = Ω − η( ∂Ω(k) ) Update Next Initial State Estimate st0 = st0 + IT eL

and computational complexity is of a constant factor L. Furthermore, we can treat the initial state of each window as an adaptive parameter and update its values using the backpropagated or filtered error. Similary, we can modify Algorithm 1 to limit the recursion to L steps ∂s −1 = 0, then propagating and updating by setting t0(k) ∂Ωi the next initial states st0 . Aside from the training complexity, there are two notable issues plaguing RNNs trained using direct gradient descent methods, namely the vanishing and exploding gradient problems [44]. For sigmoid and tanh activation functions, the error flow tends to vanish for learning long-term dependencies. A full analysis for the KAARMA algorithm is outside the scope of this paper and will be pursued in the future. Nonetheless, we note that the KAARMA algorithm is less prone to these problems, since the filter operates directly in the functional space, using smoothing functions as reproducing kernels. Furthermore, the quantization technique in Algorithm 2 provides additional stability and robustness, ensuring that only a compact dictionary is formed, in a part of the space where the system is stable. Nonetheless, we deal with the rare case of exploding gradients by restricting the dynamic range of the hidden states using simple amplitude clipping. IV. S IMULATION R ESULTS We demonstrate the computational power of the KAARMA algorithm by modeling and providing exact compact solutions to dynamical systems, both primary bottlenecks of conventional KAF algorithms.

No.

8

Description

1

1∗

2

(10)∗

3

No odd number of consecutive 0’s after an odd number

4

Any string with fewer than three consecutive 0’s.

5

Any even length string with an even number of 1’s.

6

Difference b/w number of 1’s and 0’s is a multiple of 3.

7

0∗ 1∗ 0∗ 1∗

of consecutive 1’s.

TABLE I T OMITA GRAMMARS .

Specifically, we evaluate the performance of the QKAARMA algorithm for the task of syntactic pattern recognition. Identification and synthesis of DFA is performed using the set of Tomita grammars (Table I) [45], which has served as the benchmark for RNNs. A brief review on DFA and formal language is given in the Appendix. Note that the field of grammatical inference is very broad [46]–[48]. Here we will focus on pure data driven learning approaches as the ones started with [49] and continued in [50], [51]. The KAARMA algorithm is well suited for grammatical inference or DFA identification. Each training string is presented to KAARMA one symbol ut at a time (see Fig. 2). The last component of the state vector ∆ (n ) indicates the response, i.e., yt = st s . Since error is evaluated at the end of a sequence, KAARMA only takes linear time to process each string. Because KAARMA forces the state vector space into well-separated partitions or sub-spaces, corresponding to state nodes in a FSM (with accepting states indicated by positive response values s(ns ) > 0) we can extract DFA from trained KAARMA networks, similar to [49]. Furthermore, the quantization technique outlined in Algorithm 2 or QKAARMA can be effectively employed with no performance loss in the final solution. The training data consist of stimulus-response pairs labeled according to the source grammar. Greater emphasis is give to Tomita grammar #4 which has been the subject of significant research interest in the literature [49], [52], [53]. Similar to [49] for second-order RNNs, the training set for grammar #1 consists of 1000 randomly generated binary strings, with length 1 − 15 (mean of 7.758). Unlike epoch learning used in [49], each string is processed only once by QKAARMA, with state dimension ns = 4. The states and coefficient matrix A are randomly initialized, uniformly from (−0.5, 05) and (0, 1), respectively. Gaussian kernel parameters as = au = 2 are used, with learning rate η = 0.1. Using a quantization factor q = 0.5, out of the 7759 possible centers, the trained QKAARMA network consists of

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 1, NO. 1, JANUARY 2015

Extracted Automaton 0 0.87295

-0.837044

1

3

Minimized DFA State Initial State Accept State 0 ‘0’-Transition ‘0 ‘0’-Self-Transition ‘0 ‘1’-Transition ‘1 ‘1’-Self-Transition

Extracted Automaton

Minimized DFA

0.76541 0.84851 -0.83704

1

0.84851

0

2

-0.89643

2

1

-0.89643

4

6

1.2073 -0.99447

0

1 1.2073

0.87295

2

9

3

-0.15256 1.2124

4

5

2

1.2124

3

Fig. 3. QKAARMA generated DFA for Tomita grammar #1. Fig. 4. QKAARMA generated DFA for Tomita grammar #4.

only 20 centers. The DFA extraction and minimization procedures follow those in [49]. The quantization factor used during DFA extraction is q = 0.8. Fig. 3 shows the extracted DFA for grammar #1 on the left and the minimized DFA on the right, which QKAARMA correctly identified. Initial states are labeled 0 for s0 and shown as black dots. The value above each state label indicates its response in the state vector space. Final states are shaded green. The empty string is not learned, i.e., initial states are assumed to be non-final. Edges are color coded with blue (dashed line) for ‘0’ input and red (solid line) for ‘1’. Self transitions appear as large circles around state nodes. For Tomita grammar #4, the same training set and QKAARMA parameters are used, with the solution shown in Fig. 4. Again, the correct minimal DFA is inferred (not accepting empty strings). To show stability, we run QKAARMA 100 times using random initialization. A separate test set (200 sequences of length of 20) is used to generate the learning curve shown in Fig. 5. The top plot shows the average response of the network during training, with the learning curve plotted in the middle. We see that after 400 updates, the average filter has learned enough to start discriminating the strings, and learning begin to accelerate. By the time training reached 700 sequences, test error has reduced by half. For most initialization, QKAARMA has learned the complete dynamics by 900 strings, and the DFA extracted after this point are exact solutions. This demonstrates that not only is QKAARMA a fast algorithm, but it is also robust to initial conditions. In contrast, the second-order RNN in [49] takes at least 49 epoches to converge on a 1024-string training set, with several failed to converge after 5000 epochs. The bottom plot of Fig. 5 shows the average network size as a function of the processed strings. This inverse relationship between the growth rate and the learning curve is expected. After an update, centers are added to describe newfound dynamics, which in turn, enhances the generalization capability. During periods of growth plateau, learning does not stop, since QKAARMA finetunes the filter coefficients based on the error gradient, even when network size is unchanged. We tabulate the results for all 7 Tomita grammars

TABLE II QKAARMA DFA FOR T OMITA GRAMMARS .

Grammar #1 #2 #3 #4 #5 #6 #7

QKAARMA size 20 22 46 28 34 28 36

Extract. DFA size 4 6 8 7 5 5 8

Min. DFA size 3 4 6 5 5 4 6

in Table II. The QKAARMA algorithm is able to identify correctly each DFA. Furthermore, we see that despite the network growth issue inherent in kernel recurrent learning, the KAARMA network size can be effectively constrained without performance loss, using the appropriate quantization. This compact filter representation greatly facilitates the DFA extraction process. The state diagrams synthesized from trained QKAARMA for the remaining grammars are shown in Fig. 6. In the case of Tomita grammar #5, the extracted automaton is the minimized form. This is partly due to training parameters and the size of the minimum DFA solution. Also grammar constraints as manifested in the symmetry of the minimum DFA makes correct nondistinguishable states difficult to form after training. A. Comparison with RNN The KAARMA algorithm operates on the same recurrent topology as a second-order RNN, however the filtering is performed in the RKHS. Here, we compare the performance of QKAARMA with first and second-order RNNs in [54] and the random weight guessing technique [55]. Tomita grammars #3 and #5 are omitted in the comparison, since their definitions in [54] differ from the originals (Table I). Random guessing (RG) results are only available for “parityfree” Tomita grammars (#1, #2, #4). The average performance over 10 random initializations are presented in Table III. For each inference algorithm, the average training size, number of test set errors made by the trained network, test set accuracy, network size (type), successful extraction rate (convergence rate × the rate

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 1, NO. 1, JANUARY 2015

10

Sequence Label

Average Training Response 1 0.5

Extracted Automaton

0 -0.5

Test Error Rate

-1 0

Desired QKAARMA 100

Minimized DFA

-1.2689

200

300

400

500

600

700

800

900

1

1000

0

0

Average Learning Curve (Test Set)

0.8

-1.1426

-1.074

-1.2689

2

0.4

5

1

3

0.2

-0.72109 0 0

100

200

300

400

500

600

700

800

900

1000

3

0.78777

0.78777

4

2

Average Network Size After Each Weight Update Dictionary Size

-1.074

0.6

22 20 18 16 14 12 10 8 6 40

(a) Tomita grammars #2

Minimized DFA

Extracted Automaton 1.2147 0.63785 100

200

300

400

500

600

700

800

900

1000

2

Number of Training Sequences

Fig. 5. Performance of QKAARMA on Tomita grammar #4 averaged over 100 different initializations.

1.2147

1

1

0

-0.98451

-0.58546

3

7 -0.82932

4

0

0.63785

-0.58546

2

5

0.99942 0.65416

5

-0.82932

6

3

0.99942

4

DFA matches correct grammar), and the unminimized (b) Tomita grammars #3 extracted DFA size are enumerated, whenever available. Extracted Automaton Minimized DFA For RNNs, first and second-order networks of 39 neurons, with and without bias, were tested. Since -1.0693 -1.0693 0 0 QKAARMA is a much faster algorithm, for RNNs and 1 1 RG, the configuration with the fastest convergence in -1.2613 -1.2613 training samples is selected for comparison. 4 4 -0.84307 -0.84307 Unlike the RNNs, which are epoch trained on all 0.65469 0.65469 2 2 binary strings of length 0-9, in alphabetical order with 3 3 an elaborate scheme involving cycles of working and (c) Tomita grammars #5 recycle sets, the QKAARMA networks are trained on Minimized DFA random strings of length 1-9 in the target languages. Extracted Automaton This is a more difficult problem, since it has been shown 0 -0.58151 0 that learning is enhanced using lexicographic order in string presentation [56]. In practice, one cannot expect 1 -0.58151 0.28888 0.28888 the training samples to be a complete ordered set, 1 3 4 especially for unknown grammars. The QKAARMA -1.044 state dimension is fixed at ns = 4. Test set consists -0.53916 2 -0.53916 of all strings of length 10-15 (64512 total), same as the 3 2 RNNs’. Training is stopped when the test error is less (d) Tomita grammars #6 than 8% and the correct minimized DFA is extracted. Automaton Minimized DFA The training procedure for RG is outlined in [55], Extracted0.9171 1.0791 which involves an equal number of positive and nega1.0791 1 1 0 tive samples. Due to random construction, the density 2 0 of grammatical strings used for training QKAARMA is 0.96881 -0.95529 0.96881 -0.95529 not uniform. Fig. 7 shows a graphic representation of 3 7 2 5 the worst case training set for each of the QKAARMA 0.51546 -0.67275 0.51546 0.57686 networks. Each concentric ring, radiating from the 0.57686 6 4 origin, represents all binary strings of a certain length 3 4 5 (i.e., for ring n, the 2π radians are partitioned into (e) Tomita grammars #7 2n sectors, in lexicographic order, starting from angle θ = 0). Black regions represent ungrammatical strings. Fig. 6. Synthesized DFA from QKAARMA algorithm for Tomita The corresponding histograms are also shown. We see grammars #2-3 and #5-7. that the number of grammatical strings are sparsely distributed for the early grammars #1 and #2, and more

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 1, NO. 1, JANUARY 2015

11

the extracted DFA has better generalization capability than the trained network, as shown by correctly inferred grammars (DFA) in instances where QKAARMA network failed to recognize some test strings. B. Comparison with LSM

30

100 80

20

60 40

10

20 0 0

5 Binary String Length

10

(a) Tomita grammars #1

0 0

150

100

100

50

50

5 Binary String Length

10

(c) Tomita grammars #4

10

(b) Tomita grammars #2

150

0 0

5 Binary String Length

0 0

5 Binary String Length

10

(d) Tomita grammars #6

600

400

200

0 0

5 Binary String Length

10

(e) Tomita grammars #7 Fig. 7. Worst case QKAARMA training size used to correctly infer a Tomita grammar by uniformly sample from all binary strings of length 1-9, and its corresponding histogram. (a) Grammar #1: 200 strings with 24 positive examples (12.0%) (b) Grammar #2: 700 samples with 27 positive examples (3.86%) (c) Grammar #4: 900 samples with 680 positive examples (75.6%) (d) Grammar #6: 1160 samples with 350 positive examples (30.2%) (e) Grammar #7: 4400 samples with 3180 positive examples (72.3%)

densely covered in grammars #4 and #7. From the results shown in Table III, we see that QKAARMA outperforms RNNs, trained using either error gradients or by random guessing. The training speed of QKAARMA is much faster than RNNs’. Also,

Training speed for RNNs can be improved by randomly initializing the recurrent connections in a sufficiently large network, resulting in a dynamic reservoir. However, simplicity has a downfall: not all initializations with the same random hyper parameters result in good performance, and this dependency on careful offline basis selection and cross validation makes RC a less reliable learning tool than gradient-based methods. We briefly compare the performance of QKAARMA with a LSM on nonnumeric data. Two Poisson spike trains of frequency 20 Hz and length 0.5 s are generated as templates for two classes 0 and 1, as shown in Fig. 8. Actual spike trains used for training and testing are noisy versions of the templates, with each spike timing varying by a random amount, according to a zero mean Gaussian distribution with standard deviation or jitter of 4 ms. Training set consists of 500 realizations, with another 200 forming the independent test set. The liquid filter is a randomly connected recurrent neural microcircuit consisting of 135 integrate and fire neurons, with 20% of the population randomly set as inhibitory [57]. The state of the microcircuit is sampled every 25 ms by low-pass filtering the response. The QKAARMA network is trained directly on the spiking stimuli binned every 10 ms, with each jittered version forming a binary string of length 50. The QKAARMA algorithm is able to label correctly each test stimuli after a single pass through the training set. The LSM, on the other hand, is trained for 48 epochs with the best validation performance at epoch 33. Fig. 9 shows a missclassifed spike train by the LSM. Overall, the LSM using a linear classifier has a validation error of 7% and the QKAARMA network produced zero error. Fig. 10 shows the extracted and the minimized state machine for accepting all template 1 spike trains and rejecting all template 0 spike trains, and vice versa, in the independent test set. We see that the two minimized DFA are different, corresponding to different grammars governing the two Poisson spiking templates. V. C ONCLUSION In this paper we presented a novel recurrent stochastic gradient descent algorithm in the RKHS, based on the ARMA model. This kernelized recurrent system bridged the gap between the theories of kernel adaptive signal processing and recurrent neural networks. Simulation showed that KAARMA is an efficient and effective solution for the identification and synthesis of DFA and outperforms RNNs operating on the same

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 1, NO. 1, JANUARY 2015

12

TABLE III C OMPARISON OF QKAARMA TO GRADIENT- BASED RECURRENT NEURAL NETWORKS AND BY RANDOM GUESSING

Grammar 1

Grammar 2

Grammar 4

Grammar 6 Grammar 7

Inference Engine QKAARMA RNN (Miller & Giles ’93) RG (Schmidhuber & Hochreiter ’96) QKAARMA RNN RG QKAARMA RNN RG QKAARMA RNN QKAARMA RNN

train size 170 23000 182 700 77000 1511 900 46000 13833 1160 49000 4400 121000

test error 4 1 3 5 1343 1240 2944 8725 4623 889

accuracy 99.994 99.999 99.995 99.992 97.919 98.078 95.437 86.475 92.834 98.622

network size 43.3 9 (1st) 1 (A1) 29.8 9 (2nd) 3 (A1) 25 9 (2nd) 2 (A1) 36.6 9 (2nd) 30.2 9 (2nd)

extraction rate 1.00 1.00 1.00 1.00 1.00 0.81 1.00 0.67 1.00 0.86

DFA size 4.5 9.2 6 9.9 8.2 12.3 5.5 10.5 10.8 10.7

input 3

possible spike train segments

1.5 1 0.5 with linear classification (mae=0.423, mse=0.423, cc= NaN) 1 0.5 0 with pdelta (mae=0.412, mse=0.262, cc= NaN) 1 0.5 0

resulting input spike train

with linear regression (mae=0.466, mse= 0.24, cc= NaN) 1 0.5

template 0

with backpropagation (mae=0.437, mse=0.325, cc= NaN) 1 0.5 0

0.1

0.2

0.3

0.4

0.5

0 0

time [sec]

0.1

0.2

0.3

0.4

0.5

time [sec]

Fig. 8. Poisson spike train templates (red for class 0; and blue, 1). Fig. 9. LSM result for test input 3 with readout function trained Actual spike trains used for training and testing are noisy versions by a linear classifier, p-Delta, linear regressor, and backpropagation with Gaussian jitter of 4 ms. (desired is red; blue, estimates).

Spike Template 0

recurrent architecture, using a set of benchmark NPcomplete problems involving grammatical inference. Since all computational problems are reducible into binary decisions operating on finite length sequences, the ability to extract DFA from data has wide-range implications. In the future, we will apply and evaluate the proposed framework in other problem domains, such as speech processing and regression, with continuous signals, neural codes, and other forms of input. There are also many issues that need to be studied in the future. Feature spaces induced by Gaussian kernels are special Hilbert spaces where all evaluations are finite. However, this does not translate directly into convergent dynamics. For recurrent systems, this requires studies of stability that are beyond bounded-input bounded-output (BIBO) stability. Along with stability, a proper treatment of exploding gradients will also

3

2

Spike Template 1

1

4

5 6

0

5

11 6

10 7

8

9

4 3 2

7 8 9 10

11 12 13

1 0 16 15 14

Fig. 10. Minimized DFA for Poisson spike trains with Gaussian jitter of 4 ms. State transitions every 10 ms.

be pursued in the future. An added advantage of the quantization method in Algorithm 2 is that we can guarantee stability more easily by evaluating the new centers as they are introduced. Furthermore, instead of

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 1, NO. 1, JANUARY 2015

performing the quantization in the input space, we will evaluate the performances using distance measures in the RKHS, e.g., correntropy induced metric. A PPENDIX In formal language theory [58], DFA recognize regular grammars in the Chomsky hierarchy [59] of phrase-structure grammars, operating in either language validation or generative mode. In this context, system identification of DTDS can be reformulated as grammatical inference problems: from a set of positive and negative training strings, according to a target language, infer the grammar satisfying all available samples. However, grammar induction is NP-complete [60]. Early solutions rely on heuristic algorithms that scale poorly with the size of inferred automaton [47]. Relationship between RNNs and automata has been studied extensively since [61]. RNNs can simulate any FSM [62], even arbitrary Turing machine in real time [63]. Here, we briefly review the concepts of DFA and regular grammar. A DFA is a 5-tuple A = hQ, Σ, δ, q0 , F i, where Q denotes a finite set of states, Σ is a finite set of input symbols or alphabet, δ is a state transition function (δ : Q × Σ → Q), q0 ∈ Q is the initial state, and F ⊆ Q is a set of final or accepting states. DFA can be represented by a state transition table or diagram. For a given string w over the alphabet Σ, the automaton A accepts w if it transitions into a final state on the last input symbol. Otherwise, w is rejected. The set of all strings accepted by A is the language L(A). A grammar G is a 4-tuple G = hN, T, P, Si, where N and T are disjoint finite sets of nonterminal and terminal symbols, respectively, P is a set of production rules, and S ∈ N is the start variable. Grammar G is regular if and only if every rule in P is of the form B → a, B → aC, or B → ǫ, where B and C are in N (allowing B = C), a ∈ T , and ǫ denotes the empty string. The language defined by G is denoted L(G). Automata serve as the analytical descriptor of a language; and the grammar, the generative descriptor. Language produced by regular grammar can be validated by a DFA. And, from A, one can easily construct a regular grammar such that L(G) = L(A). R EFERENCES [1] B. Scholkopf and A. J. Smola, Learning with Kernels, Support Vector Machines, Regularization, Optimization and Beyond. Cambridge, MA, USA: MIT Press, 2001. [2] W. Liu, J. C. Pr´ıncipe, and S. Haykin, Kernel Adaptive Filtering: A Comprehensive Introduction. Hoboken, NJ, USA: Wiley, 2010. [3] L. Li, A. J. Brockmeier, J. S. Choi, J. T. Francis, J. C. Sanchez, and J. C. Pr´ıncipe, “A tensor-productkernel framework for multiscale neural activity decoding and control,” Comput. Intell. Neurosci., 2014. [Online]. Available: http://dx.doi.org/10.1155/2014/870160 [4] W. Liu, P. Pokharel, and J. C. Pr´ıncipe, “The kernel least mean square algorithm,” IEEE Trans. Signal Processing, vol. 56, no. 2, pp. 543–554, 2008.

13

[5] Y. Engel, S. Mannor, and R. Meir, “The kernel recursive leastsquares algorithm,” IEEE Trans. Signal Processing, vol. 52, no. 8, pp. 2275–2285, 2004. [6] W. Liu, I. Park, Y. Wang, and J. C. Pr´ıncipe, “Extended kernel recursive least squares algorithm,” IEEE Trans. Signal Processing, vol. 57, no. 10, pp. 3801–3814, 2009. [7] J. F. Kolen and S. C. Kremer, A Field Guide to Dynamical Recurrent Networks. New York, NY, USA: Wiley-IEEE Press, 2001. [8] J. Platt, “A resource-allocating network for function interpolation,” Neural Comput., vol. 3, no. 2, pp. 213–225, 1991. [9] W. Liu, P. Pokharel, and J. C. Pr´ıncipe, “An information theoretic approach of designing sparse kernel adaptive filters,” IEEE Trans. Neural Networks, vol. 20, no. 12, pp. 1950–1961, 2009. [10] C. Richard, J. Bermudez, and P. Honeine, “Online prediction of time series data with kernels,” IEEE Trans. Signal Processing, vol. 57, no. 3, pp. 1058–1066, 2009. [11] B. Chen, S. Zhao, P. Zhu, and J. C. Pr´ıncipe, “Quantized kernel least mean square algorithm,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 1, pp. 22–32, 2012. [12] S. V. Vaerenbergh, J. Via, and I. Santamana, “Nonlinear system identification using a new sliding-window kernel rls algorithm,” J. Commun., vol. 2, no. 3, pp. 1–8, 2007. [13] V. Vaerenberg, S. M. Lazaro-Gredilla, and I. Santamaria, “Kernel recursive least-squares tracker for time-varying regression,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 8, pp. 1313– 1326, 2012. [14] S. Zhao, B. Chen, P. Zhu, and J. C. Pr´ıncipe, “Fixed budget quantized kernel least-mean-square algorithm,” Signal Processing, vol. 93, no. 9, pp. 2759–2770, 2013. [15] K. Funahashi and Y. Nakamura, “Approximation of dynamical systems by continuous time recurrent neural networks,” Neural Networks, vol. 6, pp. 801–806, 1993. [16] K. S. Narendra and K. Parthasarathy, “Identification and control of dynamical systems using neural networks,” IEEE Trans. Neural Networks, vol. 1, pp. 4–27, 1990. [17] P. J. Werbos, “Backpropagation through time: What it is and how to do it,” Proc. IEEE, vol. 78, no. 10, pp. 1550 –1560, 1990. [18] R. J. Williams, “Training recurrent networks using the extended kalman filter,” in Proc. IEEE IJCNN, vol. 4, Baltimore, MD, USA, 1992, pp. 241–246. [19] R. J. William and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Comput., vol. 1, pp. 270–280, 1989. [20] H. Zhao, X. Zeng, and Z. He, “Low-complexity nonlinear adaptive filter based on a pipelined bilinear recurrent neural network,” IEEE Trans. Neural Networks, vol. 22, no. 9, pp. 1494–1507, 2011. [21] H. Jaeger, “The “echo state” approach to analysing and training recurrent neural networks,” 2001. [22] T. N. W. Maass and H. Markram, “Real-time computing without stable states: a new framework for neural computation based on perturbations,” Neural Comput., vol. 14, no. 11, pp. 2531–2560, 2002. [23] J. C. Pr´ıncipe and B. Chen, “Universal approximation with convex optimization: Gimmick or reality,” IEEE Comp. Intell. Mag., submitted for publication, 2014. [24] M. O. Rabin and D. Scott, “Finite automata and their decision problem,” IBM J. Res. Develop., vol. 3, no. 2, pp. 114–125, 1959. [25] B. Scholkopf, R. Herbrich, and A. J. Smola, “A generalized representer theorem,” in Proc. 14th Annual Conf. on Comput. Learn. Theory, vol. 2111, 2001, pp. 416–426. [26] A. H. Sayed and T. Kailath, “A state-space approach to adaptive rls filtering,” IEEE Signal Process. Mag., vol. 11, pp. 18–60, 1994. [27] S. Haykin, A. Sayed, J.Zeidler, P. Yee, and P. Wei, “Adaptive tracking of linear time-variant systems by extended rls algorithms,” IEEE Trans. Signal Processing, vol. 45, no. 5, pp. 1118–1128, 1997.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 1, NO. 1, JANUARY 2015

[28] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Trans. ASME, Series D., Journal of Basic Eng., vol. 82, pp. 35–45, 1960. [29] P. Zhu, B. Chen, and J. C. Pr´ıncipe, “A novel extended kernel recursive least squares algorithm,” Neural Networks, vol. 32, pp. 349–357, 2012. [30] B. Anderson and J. Moore, Optimal Filtering. Englewood Cliffs, NJ, USA: Prentice-Hall, 1979. [31] L. Ralaivola and F. d’Alche Buc, “Time series filtering, smoothing and learning using the kernel kalman filter,” in Proc. IEEE IJCNN, Montreal, QC, Canada, 2005, pp. 1449–1454. [32] P. Zhu and J. C. Pr´ıncipe, “Learning nonlinear generative models of time series with a kalman filter in rkhs,” IEEE Trans. Signal Processing, vol. 62, no. 1, pp. 141 – 155, 2014. [33] L. Song, J. Huang, A. Smola, and K. Fukuminzu, “Hilbert space embeddings of conditional distributions with applications to dynamical systems,” in Proc. 26th Int. Conf. Mach. Learn., Montreal, QC, Canada, 2009, pp. 961–968. [34] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Mach. Learn., vol. 3, pp. 9–44, 1988. [35] J. Bae, L. S. Giraldo, P. Chhatbar, J. Francis, J. Sanchez, and J. C. Pr´ıncipe, “Stochastic kernel temporal difference for reinforcement learning,” Boston, MA, USA, 2011, pp. 5662– 5665. [36] J. C. Pr´ıncipe, B. de Vries, and P. G. de Oliveira, “The gamma filter-a new class of adaptive iir filters with restricted feedback,” IEEE Trans. Signal Processing, vol. 41, no. 2, pp. 649–656, 1993. [37] J. J. Shynk, “Adaptive iir filtering,” IEEE ASSP Mag., pp. 4–21, April 1989. [38] D. Tuia, J. Munoz-Mari, J. L. Rojo-Alvarez, M. MartinezRamon, and G. Camps-Valls, “Explicit recursive and adaptive filtering in reproducing kernel hilbert spaces,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 7, pp. 1413 – 1419, 2014. [39] P. Zhu and J. C. Pr´ıncipe, “Kernel recurrent system trained by real-time recurrent learning algorithm,” in Proc. IEEE ICASSP, Vancouver, BC, Canada, May 2013, pp. 3572–3576. [40] M. Martnez-Ramn, J. L. Rojo-lvarez, G. Camps-Valls, J. MuozMar, ngel Navia-Vzquez, E. Soria-Olivas, and A. R. FigueirasVidal, “Support vector machines for nonlinear kernel arma system identification,” IEEE Trans. Neural Networks, vol. 17, no. 16, pp. 1617–1622, 2006. [41] L. Shpigelman, H. Lalazar, and E. Vaadia, “Kernel-arma for hand tracking and brain-machine interfacing during 3d motor control,” Adv. Neural Info. Proc. Sys., vol. 21, pp. 1489–1496, 2009. [42] M. Hermans and B. Schrauwen, “Recurrent kernel machines: Computing with infinite echo state networks,” Neural Comput., vol. 24, no. 1, p. 104133, 2012. [43] D. R. Smith, Variational Methods in Optimization. Englewood Cliffs, NJ, USA: Prentice-Hall, 1974. [44] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 157–166, 1994. [45] M. Tomita, “Dynamic construction of finite-state automata from examples using hill-climbing,” in Proc. 4th Ann. Conf. Cogn. Sci., 1982, pp. 105–108. [46] R. J. Solomonoff, “A formal theory of inductive inference,” Inform. Contr., vol. 7, pp. 1–22, 224–254, 1964. [47] D. Angluin and C. H. Smith, “A survey of inductive inference: Theory and methods,” ACM Comput. Survey, vol. 15, no. 3, pp. 237–269, 1983. [48] M. H. Tong, A. D. Bickett, E. M. Christiansen, and G. W. Cottrell, “Learning grammatical structure with echo state networks,” Neural Networks, vol. 20, no. 3, pp. 424–432, 2007. [49] C. L. Giles, D. Chen, C. B. Miller, H. H. Chen, G. Z. Sun, and Y. C. Lee, “Second-order recurrent neural networks for grammatical inference,” in Proc. IEEE IJCNN, vol. 2, Seattle, WA, USA, 1991, pp. 273–281. [50] I. Gabrijel and A. Dobnikar, “On-line identification and reconstruction of finite automata with generalized recurrent neural networks,” Neural Networks, vol. 16, no. 1, pp. 101–120, 2003.

14

[51] S. H. Won, I. Song, S. Y. Lee, and C. H. Park, “Identification of finite state automata with a class of recurrent neural networks,” IEEE Trans. Neural Networks, vol. 21, no. 9, pp. 1408–1421, 2010. [52] Z. Zeng, R. M. Goodman, and P. Smyth, “Learning finite state machines with self-clustering recurrent networks,” Neural Comput., vol. 5, no. 6, pp. 976–990, 1993. [53] K. Arai and R. Nakano, “Stable behavior in a recurrent neural network for a finite state machine,” Neural Networks, vol. 13, no. 6, pp. 667–680, 2000. [54] C. B. Miller and C. L. Giles, “Experimental comparison of the effect of order in recurrent neural networks,” Int. J. Pattern Recognition Artificial Intell. (Special Issue on Appl. of Neural Netw. to Pattern Recogn.), vol. 7, no. 4, pp. 849–872, 1993. [55] J. Schmidhuber and S. Hochreiter, “Guessing can outperform many long time lag algorithms,” Technical Note IDSIA-19-96, 1996. [56] S. Porat and J. A. Feldman, “Learning automata from ordered examples,” Mach. Learn., vol. 7, pp. 109–138, 1991. [57] Learning-tool: analysing the computational power of neural microcircuits (version 1.0), The IGI LSM Group, June 11, 2006. [Online]. Available: http://www.lsm.tugraz.at/download/learning-tool-1.1-manual.pdf [58] M. Harrison, Introduction to Formal Language Theory. Boston, MA, USA: Addison-Wesley, 1978. [59] N. Chomsky, “Three models for the description of language,” IRE Trans. on Inf. Theory, vol. 2, no. 2, pp. 113–124, 1956. [60] E. M. Gold, “Complexity of automaton identification from given data,” Inf. and Control, vol. 37, no. 3, pp. 302–320, 1978. [61] W. S. McCulloch and W. H. Pitts, “A logical calculus of the ideas imminent in nervous activity,” Bull. Math. Biophys., vol. 5, no. 4, pp. 115–133, 1943. [62] M. L. Minsky, Computation: Finite and Infinite Machines. Englewood Cliffs, NJ, USA: Prentice-Hall, 1967. [63] H. T. Siegelmann and E. D. Sontag, “On the computational power of neural nets,” J. Comput. System Sci., vol. 50, no. 1, pp. 132–150, 1995.

Kan Li (S’08) received the B.A.Sc. degree in electrical engineering from the University of Toronto in 2007 and the M.S. degree in electrical engineering from the University of Hawaii in 2010. He has been a research assistant at the Computational NeuroEngineering Laboratory (CNEL), University of Florida, since 2012, where he is currently pursuing the Ph.D. degree in electrical engineering. His research interests include machine learning and signal processing.

Jos´e C. Pr´ıncipe (M’83-SM’90-F’00) is the BellSouth and Distinguished Professor of Electrical and Biomedical Engineering at the University of Florida, and the Founding Director of the Computational NeuroEngineering Laboratory (CNEL). His primary research interests are in advanced signal processing with information theoretic criteria and adaptive models in reproducing kernel Hilbert spaces (RKHS), with application to brain-machine interfaces (BMIs). Dr. Pr´ıncipe is a Fellow of the IEEE, ABME, and AIBME. He is the past Editor in Chief of the IEEE Transactions on Biomedical Engineering, past Chair of the Technical Committee on Neural Networks of the IEEE Signal Processing Society, Past-President of the International Neural Network Society, and a recipient of the IEEE EMBS Career Award and the IEEE Neural Network Pioneer Award.