memoryless stage are commonly used for signal processing. Two examples of such ... of the linear memory stage is allowed to be arbitrarily large. In practice the ...
Some Limitations of Linear Memory Architectures for Signal Processing Bryan W. Stiles and Joydeep Ghosh Department of Electrical and Computer Engineering The University of Texas at Austin
Abstract Certain neural network structures with a linear \memory" stage followed by a nonlinear memoryless stage are commonly used for signal processing. Two examples of such structures are the time delay neural network and the focused gamma network. These structures can approximate arbitrarily well a wide range of mappings between discrete time systems. However, in order to achieve this capability, the dimensionality of the output (state vector) of the linear memory stage is allowed to be arbitrarily large. In practice the dimensionality of the state vector must be limited due to nite resources and in order to reduce problems while training the memoryless stage arising from \the curse of dimensionality". We discuss how such a limitation eects the range of functions which can be approximated by the structure. Further, it is proven that given any tolerance and any limit on the dimensionality of the state vector, there are computationally simple and useful functions which cannot be approximated to the given tolerance by any linear memory structure (including TDNNs and the focused gamma network) which conforms to the prescribed limit on the state vector dimensionality. The existence of such functions provides a rationale for examining structures with nonlinear memory.
1 Introduction This paper focuses on two-stage networks that consist of a temporal encoding stage followed by a nonlinear memoryless stage. The memoryless stage typically consists of a feedforward neural network that is a universal approximator, such as a multi-layer perceptron or radial basis function network. At each time instant, the temporal encoding stage operates on the input to produce a state vector which is then used as an input to the memoryless stage. A general block diagram of such a two-stage structure is shown in Figure 1. Certain forms of these networks have been previously shown to be capable of approximating arbitrarily well any continuous, causal, time-invariant approximately nite memory mapping between uniformly bounded discrete time signals [2], [3], [4]. Until recently, all of the speci c structures for which this approximation ability has been demonstrated have used a linear temporal encoding as the memory mechanism. The most commonly used of these linear memory structures are the time delay neural network (TDNN), and the focused gamma network [1]. In [5] a structure with a nonlinear temporal encoding stage was shown This work was supported in part by an NSF grant ECS 9307632 and ONR contract N00014-92C-0232. Bryan Stiles was also supported by the Du Pont Graduate Fellowship in Electrical Engineering.
1
to have the same approximation capability. This structure is referred to as the habituation based neural network. All the approximation capability results mentioned above are proofs of existence. They do not provide a prescription for determining network size or the values of network parameters. Also, the dimension of the state vector may need to be made arbitrarily large in order to approximate a given function within a particular tolerance. For example in a TDNN, the number of taps needed on each delay line may be large. (fx)(t)
Feedforward Neural Network state vector
TDNN
Focused Gamma Network
MLP
MLP
tapped delay line
Gamma memory
Memory Mechanism
x(t)
Figure 1. Two-Stage Neural Network Structures
When trying to use any model to approximate any particular function in a practical setting, certain issues become important such as the desired tolerance, amount of resources available (i.e. the amount of memory or computation time required), and the nature of the error surface, which governs the ease of nding a good set of parameters. Since the amount of resources is always nite the complexity of the model must be limited. An important factor in the complexity of a two-stage structure is the dimensionality of the state vector. Obviously if the complexity of the model is limited, the dimensionality of the state vector must also be limited. In practice, this limitation may be severe. Due to the \curse of dimensionality", the complexity of the memoryless stage required to achieve a given tolerance usually increases with increasing state vector dimensionality. Thus focus is brought upon the issue of ecient representation. For the two-stage structures of Figure 1, the question is, for a given state vector dimensionality and a given limit to the complexity of the feedforward stage (i.e. number of hidden units) which network forms provide better approximations for dierent types of target functions? We take a step in addressing this question by showing that certain simple functions cannot be eciently represented by linear memory structures. The existence of these functions motivates the design and development of nonlinear memory structures such as the habituation based neural network. Henceforth any linear memory structure in which the state vector dimensionality is less than some positive integer m shall be referred to as an m-limited linear memory structure. In the next section we discuss functions which cannot be approximated within a desired tolerance by any m-limited linear memory structure. We refer to such functions as linear 2
memory hard with respect to m and . We show that a certain class of realistic signal classi cation problems and related functions are not eciently modeled by linear memory structures. We exhibit a set of functions F which can be used to model such functions. The set F is itself a good illustration of the limitations of linear memory models. Given any m and any there is an element fm; of F which is linear memory hard with respect to m and . Furthermore there exists such an fm; with a dynamic range restricted to [0; 2 + ] for any positive . This condition implies that the best possible m-limited linear memory structure can only approximate fm; marginally better than the best constant function approximation ((fx)(t) = + 2 ). Additionally there is an fm; that meets all these constraints and can be parameterized with only ve scalar parameters. Therefore it is not merely that there is always a function so complex it cannot be approximated by an m-limited linear memory structure, but rather that there are simple functions which cannot be so approximated. In Section 3 we discuss a nonlinear memory structure which has the same approximation capability as a TDNN or a focused gamma network, but can eciently approximate all the functions in F in the sense that the number of parameters in the approximation structure is linear with the number of parameters in the element of F to be approximated. Section 4 summarizes our results.
2 Linear Memory Hard Mappings Before we present our mathematical analysis it is necessary to list a few de nitions. Let be the set of mappings from the nonnegative integers to q or t < q ? p. The following theorem describes a large class of functions from X to U which are linear memory hard at time t with respect to m and . Each element of Xm;t can be visualized as a unique point in an (m +1)-dimensional space where the dimensions represent the values x(t); x(t ? 1); x(t ? 2); x(t ? m). If for every direction u in this space there are two elements of Xm;t , xu and yu such that the direction from xu to yu is u and j(fxu)(t) ? (fyu)(t)j > 2; then f is linear memory hard at time t with respect to m and . Theorem 2 Let m be a positive integer and let > 0. A function f from X to U is linear memory hard at time t with respect to m and if for each u 2 2 and
(dm+1 (xu ? yu))(t) = cu u: Informally stated, the theorem means that if at time t0 , the previous m + 1 values of the input x are each suciently important to the current output of the function (fx)(t0 ), then no m-output linear memory can represent the input space well enough to approximate f within at time t0 . Thus f is linear memory hard at time t with respect to m and . Additionally, by Theorem 1 if f is time-invariant then it is also linear memory hard at all times t > t0 . 4
Consider the problem of determining whether or not short time temporal patterns within a distance T of a template v have occurred in the past m input samples. A classi er gv for such a problem is de ned by ( is an i such that 0 i m and jjv ? dn x(t ? i)jj T (2) (gv x)(t) = 10 ifif there there is no i such that 0 i m and jjv ? d x(t ? i)jj T . n
This is a particularly hard problem to solve for two reasons. First, for the past m time samples the classi er must have in nitely precise spatial resolution in order to tell the dierence between patterns observed along the boundary jjv ? dn x(t ? i)jj = T . Secondly it must have very good temporal resolution because the same pattern might be ignored if it occurred at time t ? m ? 1 but detected if it occurred at t ? m. Let us now consider a version of this problem in which the requirements are relaxed somewhat, by allowing the classi er to be unde ned for patterns near the spatial and temporal boundaries of the classes. In this case, a signal of class 1 is detected at time t if a particular short time pattern was observed during the m1 time instants prior to t. A signal of class 0 is detected if no such pattern was observed during another time interval m0 m1 . We de ne the classi er function fv by 8 > if there is an i such that 0 i m1 and jjv ? dn x(t ? i)jj < T1 0 if there is no i such that 0 i m0 and jjv ? dn x(t ? i)jj < T0 : unde ned otherwise (3) with T1 and T0 real number thresholds such that T1 < T0 . Here v is the template to be detected. The threshold T1 is the maximum distance from v that a pattern must be in order to detect class 1. The threshold T0 is the minimum allowable distance from v at which class 0 is detected. The classi er is allowed to take any value (unde ned) for patterns which fall between the distances T0 and T1 from the template and for patterns within T1 but which occur between times t ? m1 and t ? m0 . For a wide range of values of T1 , T0 , m1 , m0 , and v there are certain simple nonlinear functions which can be used to realize fv at all points at which it is de ned. However, for a similarly large range of these parameters, there is no m1 -limited linear memory structure which can realize fv . Since n can be much less than m1 , fv is not eciently represented by a linear memory structure. In fact a wide class of functions which make use of information similar to that encoded by fv cannot be eciently represented by linear memory structures. This leads to the following theorem. Theorem 3 Let > 0. Let m1, m0 and n be positive integers such that n m1 m0. Let T1 and T0 be real numbers such that 0 < T1 < T0 . Let v be an element of [?1; 1]n such that the magnitude of each component of v, jvi j is an element of (2T0 ; 1 ? T0 ). Let f be a function from X to U with the following property. If x and y are elements of X such that (fv x)(t) and (fv y)(t) are de ned and (fv x)(t) 6= (fv y)(t), then j(fx)(t) ? (fy)(t)j > 2. The function f is linear memory hard at all t > m1 with respect to m1 and . It is our hypothesis that the condition that the magnitude of each component of v, jvi j be an element of (2T0 ; 1 ? T0 ) is not a necessary condition. It was included primarily to simplify the proof. However, it does eliminate certain trivially simple versions of fv which can be realized by linear memory structures. For example if the entire set [?1; 1]n is within distance T0 of v then for all x 2 X and all nonnegative integers t, (fv x)(t) is either unde ned or equal to 1. In this case the constant function (fx)(t) = 1 satis es the conditions on f in the theorem. 5
In signal classi cation, problems like fv are common. Often a signal is classi ed based on the occurrence of some simple short time pattern which may occur within a relatively long time interval. Such a problem is not well represented by a linear memory structure. This shortcoming is commonly dealt with by rede ning what is meant by a signal. Each example of the pattern is considered a separate signal, and the classi er is only required to detect the signal immediately after the occurrence of the pattern. However, consider the case in which a particular class is de ned by the presence of multiple short time patterns over a long time interval. The classi er must remember the occurrence of each pattern until all required patterns have been observed. In order to remember a pattern for m + 1 time instants after it has occurred, a linear memory structure must have a m + 1-dimensional state vector (because it must realize fv ). The best one could do with a simpler linear memory structure would be to detect each pattern for some small number of time instants and then forget about it afterward. Clearly, such a strategy is unable to realize any function in which information involving the various patterns must be combined. By the time the structure detects one pattern it has forgotten the occurrence of the others. For example consider the following classi cation problem. Suppose an input signal is in class A if ve three time sample long patterns (v1 ; v2 ; ; v5 ) are observed somewhere within some long period of time (i.e. 100 time samples) and in class B otherwise. In this case a linear memory structure would require a 100 dimensional state vector to solve the problem, but we shall see that a particular nonlinear memory structure can solve the problem much more eciently. Consider a set of functions F from X to U consisting of elements f of the form ( exp(? jjvjj2 ); exp(? jjv ? (dnx)(t)jj2 )) if t=0 (fx)(t) = max( max((fx)(t ? 1); (exp(? jjv ? d x(t)jj2 ))) otherwise n
for all positive , positive integers n, v 2 [?1; 1]n , positive and such that 0 < < 1. Notation of the form fF jn; g is used to mean the subset of F for which the parameters n and have some given constant value. The elements of F can be considered to be template matching functions with approximately nite memory. Let f be an element of F . Whenever an n length temporal segment of the input x is seen which closely matches the template v, a gaussian response is produced which is maximal ((fx)(t) = ) if an exact match is made. At each instant the current response is compared to a decayed (with decay rate ) version of a previous response. The output is chosen to be the maximum of the two. This output then decays over time and is compared with future responses. In this manner, f remembers an old template match until it decays to the point where a newer match supersedes it. Theqset F is useful for solving a wide variety of classi cation problems of the form fv . If T0 > m0m?0m+11 +1 T1 then there is an element of F which can distinguish between class 0 and class 1. Theorem 4 Let n,m1, and m0 be positive integers such that m0 m1. Let v be an element 1; 1]n . Let T1 and T0 be real numbers such that 0 < T1 < T0 jjvjj. If q ofm[? T0 > m0 ?0m+11 +1 T1 then there exists a h > 0 and f 2 fF jn; vg such that the following is true. For all x 2 X and nonnegative integers t, (fx)(t) h implies (fv x)(t) = 1 or (fv x)(t) is unde ned, and (fx)(t) < h implies (fv x)(t) = 0 or (fv x)(t) is unde ned. One can nd an f 2 F which distinguishes between class 1 and class 0 provided that the classes are suciently distinct in their temporal and spatial speci cations. Clearly, from 6
Theorems 3 and 4, there are a wide range of classi ers of the fv form which can be realized by a single element of F , but cannot be realized by any m1 -limited linear memory structure. In addition to its usefulness for realizing classi ers of this sort, the set F is itself an interesting example of the limitations of linear memory structures. Given any positive integer m, any positive and any positive , the set F contains an element f with the following three properties. First (fx)(t) 2 [0; 2 + ] for all x 2 X and all nonnegative integers t. This property can be obtained by choosing f from fF j = 2 + g. Second if we further restrict the set of choices to fF j = 2 + ; n = 1g, each function f is represented by ve scalar parameters, ; ; ; v(a scalar for n=1), and n. Third f is linear memory hard with respect to m and . This property is shown by the following theorem for the speci c case in which = 2 + and n = 1. Additionally, the theorem demonstrates that there is a linear memory hard element f in the set fF j ; ng for any > 2 and any positive integer n. Theorem 5 Let positive integers n and m be given such that m n. Let > 0. Let > 2. There exists a function f 2 fF jn; g which is linear memory hard at all t m with respect to m and . Since, the elements of F are clearly continuous, causal, time-invariant, approximately nite memory functions, theorems in [2] show that for each element of F there is some arbitrarily complex TDNN or focused gamma network that approximates it to any given tolerance. However as previously discussed for any such network to be feasible, its state vector dimensionality must be limited by some positive integer m. We have now shown that for each such m, there is a subset of F consisting of simple functions in ve parameters with an arbitrary dynamic range [0; ] which cannot be approximated to any tolerance less than 2 by any m-limited linear memory structure. Obviously any such function can be approximated with a tolerance which is exactly 2 by a constant function. Therefore there are simple functions for which all feasible linear memory structures oer only in nitesimal improvements over constant functions. However, the theorems in this paper do not rule out the possibility that there are feasible nonlinear memory structures that avoid such limitations. In the next section, we shall discuss two-stage structures where the memory stage is comprised of elements of F . These structures have the same universal approximation capabilities as that shown for TDNNs and focused gamma networks, but are also able to eciently realize the elements of F and are therefore useful for modeling functions like those discussed in Theorem 3.
3 Nonlinear Memory Universal Approximators As previously discussed, the elements of F are template matching functions with approximately nite memory. As evidenced by Theorem 4, this is a particularly powerful memory structure. Let Fext be the superset of F in which the case of = 0 is also included. Consider the set K1 of functions g of the form m X (gx)(t) = wi fi i=1
where m is a positive integer, wi are real constants, and fi are elements of Fext . Observe that the subset of K1 , fK1 j = 0g, is the same as the set of structures with tapped delay line memory stages followed by a radial basis function(RBF) network as a 7
memoryless stage. This set of structures has been previously shown to be a universal approximator [2]. Since K1 is a superset of this set of structures, it is clear that elements of the set K1 can approximate arbitrarily well any continuous causal time-invariant approximately nite memory mapping from X to U . Additionally any element of F with k parameters can be represented by an element of fK1 jm = 1g having k + 1 parameters. Therefore F is eciently represented by K1 . Unfortunately the relatively large number of memory parameters may make it dicult to nd an element of K1 which approximates a particular mapping. For this reason consider the set fF j; g and a set Dm of universal approximation functions from