Some Applications of Functional Networks in Statistics and Engineering Enrique CASTILLO
José Manuel GUTIÉRREZ
Department of Applied Mathematics and Computational Statistics University of Castilla-La Mancha 13071 Ciudad Real Spain (
[email protected])
Department of Applied Mathematics and Computational Statistics University of Cantabria 39005 Santander Spain (
[email protected] ) Beatriz LACRUZ
Ali S. H ADI
Department of Statistical Methods University of Zaragoza 50009 Zaragoza Spain (
[email protected])
Department of Statistical Sciences Cornell University Ithaca, NY 14853-3901 (
[email protected] )
Functional networks are a general framework useful for solving a wide range of problems in probability, statistics, and engineering applications. In this article, we demonstrate that functional networks can be used for many general purposes including (a) solving nonlinear regression problems without the rather strong assumption of a known functional form, (b) modeling chaotic time series data, (c) nding conjugate families of distribution functions needed for the applications of Bayesian statistical techniques, (d) analyzing the problem of stability with respect to maxima operations, which are useful in the theory and applications of extreme values, and (e) modeling the reproductivity and associativity laws that have many applications in applied probability. We also give two speci c engineering applications—analyzing the Ikeda map with parameters leading to chaotic behavior and modeling beam stress subject to a given load. The main purpose of this article is to introduce functional networks and to show their power and usefulness in engineering and statistical applications. We describe the steps involved in working with functional networks including structural learning (speci cation and simpli cation of the initial topology), parametric learning, and model-selection procedures. The concepts and methodologie s are illustrated using several examples of applications. KEY WORDS:
Alternate conditioning expectation (ACE); Bayesian statistics; Beam example; Conjugate distributions; Functional equations; Generalized additive models (GAM); Maximum stability; Minimum description length measure (MDL); Multivariate adaptive regression spline (MARS); Neural networks; Reproductive distributions; Stable families.
Neural networks have received a great deal of recognition in recent years (e.g., see Freeman and Skapura 1991; Hertz, Krog, and Palmer 1991; Anderson and Rosenberg 1988; Rumelhart and McClelland 1986, and the references therein). Many examples of engineering and other applications have been presented to show their wide applicability (e.g., see Bishop 1997; Ripley 1996; Swingler 1996; Allen 1995; Miller, Sutton, and Werbos 1995; Azoff 1994; Cichocki, Ubehauen, and Cochocki 1993; Skrzypek and Karplus 1996; Lisboa 1992; Myers 1992). Neural networks consist of one or several layers of neurons connected by links. Each neuron computes a scalar output from a linear combination of inputs, coming from the previous layer, using a given scalar function. The difference between two neurons can be due to either the number of input components or their associated weights. Since the neural function is given, only the weights are learned using well-known learning methods. Functional networks and neural networks have a similar structure, but they also have the following important
differences: The selection of the initial topology of the functional network is normally based on the properties of the problem at hand (i.e., problem-driven design). This initial topology can be further simpli ed using functional equations. In neural networks, several topologies are considered and the one that gives an optimal criterion is selected. Functional networks incorporate different neural functions and they are not restricted to be a linear combination of inputs. Neural functions can be multidimensional. Neural functions can be learned (either exactly or approximately). Neuron outputs can be connected (i.e., they can be forced to be coincident), which is not available in neural networks.
© 2001 American Statistical Association and the American Society for Quality TECHNOMETRICS, FEBRUARY 2001, VOL. 43, NO. 1 10
APPLICATIONS OF FUNCTIONAL NETWORKS
These differences with neural networks make functional networks a useful generalization and extension of the standard neural networks (see Gómez-Nesterkín 1996; Castillo, Cobo, Gutiérrez, and Pruneda 1998; Castillo, Cobo, Gómez-Nesterkín, and Hadi 1999 for details). The main purpose of this article is to introduce functional networks and to show their power and usefulness in probability, statistics, and engineering applications. The rest of the article is organized as follows. Section 1 describes functional networks using motivating examples, showing the wide applicability of functional networks for solving problems in several areas of statistical modeling. Section 2 describes the main steps needed for working with functional networks. Section 3 discusses the problems of exact and approximate learning in functional networks. Section 4 presents model-selection procedures in functional networks. Section 5 illustrates the methodology using some examples from statistical modeling in general and the engineering area in particular. Section 6 shows the relationship between functional networks and three other statistical methods—the alternate conditioning expectation (Breiman and Friedman 1985), the multivariate adaptive regression spline (Friedman 1991), and the generalized additive models (Hastie and Tibsherani 1990). Finally, concluding remarks are given in Section 7.
1.
MOTIVATION AND DESCRIPTION OF FUNCTIONAL NETWORKS
Functional networks have a wide range of examples of applications in different areas. In this section, we motivate and describe functional networks by some of these examples. Example 1: Conjugate Family of Distributions. Suppose that a random variable X belongs to a parametric family of distributions with likelihood function L4x1 ˆ5, where ˆ 2 ä is a possibly vector-valued parameter. In Bayesian statistics, a classical problem is to nd a parametric family of probability density functions F 4ˆ3 ‡5 with hyperparameter ‡ so that both the prior probability density function F 4ˆ3 ‡5 and the posterior probability density function F ˆ3 G4x3 ‡5 belong to the family. Bayes’s theorem guarantees that the posterior density is proportional to L4x3 ˆ5F 4ˆ3 ‡5, which leads to the functional equation F 4ˆ3 G4x3 ‡55 = H 4x3 ˆ5F 4ˆ3 ‡51
(1)
where H 4x3 ˆ5 = h4x5L4x3ˆ5, h4x5 is a function of x, and G speci es the value of the new parameter as a function of the sample value x and the old hyperparameter value ‡. Here we have three functions, H , F , and G, each of which takes inputs and produces outputs, but the outputs are subject to the constraint given by (1); that is, the function F on the left side of (1) must be equal to the product of the functions H and F on the right side of (1). This suggests the functional network in Figure 1. The identity function I and the product function “ ” are needed only for convenience, as will be explained.
11
Figure 1. Functional Network Associated With the Bayesian Conjugate Distributions Problem.
As can be seen in Figure 1, a functional network consists of the following elements: 1. Several layers of storing units. These units are represented by small lled circles. They are used for storing input, output, and intermediate information. In Figure 1, for example, we have three layers of storing units: The rst (left) layer contains three input units (x1 ˆ, and ‡), the second layer consists of four intermediate units, and the third (right) consists of one output unit (u). 2. One or more layers of neurons (computing or functional units). These neurons are represented by open circles with the name of the unit inside the circle. A neuron is a computing unit that evaluates a set of input values and returns a set of output values to the next layer of storing units. Thus, each neuron represents a function. The functional network in Figure 1, for example, has two layers of neurons. The rst layer consists of four neurons—H , G, I, and F . The functions H , F , and G are de ned in (1). For example, the neuron H takes two inputs x and ˆ and produces an output H 4x3 ˆ5, which is stored in the corresponding intermediate unit. The function I is the identity function, which is created here for convenience to be used by the neuron F in the second layer of neurons. The second layer has two neurons “ ” and F . The function “ ” represents the product of H and F on the right side of (1). Note that both “ ” and F must give identical output, which is represented by the output unit u. This is indicated by two arrows converging into u. 3. A set of directed links. The computing units are connected to the storing units by directed arrows. The arrows indicate the direction of information ow. Once the input values are given, the output is determined by a function. For example, the neuron H has two inputs x and ˆ and one output H 4x3 ˆ5. Converging arrows to an intermediate or output unit indicate that the neurons (functions) from which they emanate must produce identical outputs. This is an important feature of functional networks that is not available in neural networks. Note that constraints such as those in (1) arise from physical and/or theoretical characteristics of the problem at hand. TECHNOMETRICS, FEBRUARY 2001, VOL. 43, NO. 1
12
ENRIQUE CASTILLO ET AL.
Figure 4. Functional Network Associated With the Maximum Stability Problem: (a) Initial Network; (b) Equivalent Simpli ed Network. Figure 2. An Example Of a Functional Network Showing the Set of Printed Circuit Board and Four Functional Units (electronic components).
Example 2: Printed Circuit Board. A functional network is analogous to a printed circuit board (PCB). To illustrate, consider the PCB depicted in Figure 2, where the microcircuits or electronic components are shown. The set of nodes are the PCB and the functional units are microcircuits. This example gives an intuitive interpretation of the elements of a functional network. For example, there are seven input units: 8a1 b1 c1 d1 e1 f 1 g9, one intermediate unit: 8h9, two output units: 8i1 j9, and four functional units (neurons): 8K1 L1 M 1 N 9. The function K, for example, takes a and b as input and produces h as output. If K is given, then h is a consequence of the values a and b; otherwise, it can be learned from a sequence of data values 4a1 b1 h5t 1 t = 11 21 : : : 1 m. This is an interesting problem with many applications. Learning is considered in Section 3. Note that N gives two outputs i and j and that the i output of N must be identical to the output of M . Example 3: Associativity. Suppose that we have a function F 4x1 1 x2 5 = x1 x2 and we do not have any information about the form of the function, but we know that it is associative; that is, the F function satis es F 4F 4x1 y51 z5 = F 4x1 F 4y1 z553
(2)
that is, 4x
y5
z =x
4y
z50
(3)
This suggests the initial network topology shown in Figure 3(a). As we shall see in Section 2, this functional network can be simpli ed to obtain the equivalent network in Figure 3(b).
Example 4: Stability With Respect to Maxima Operations. Let X and Y be two independent random variables with cumulative probability distribution functions (CDF) R4x3 a5 and S4y3 b5, respectively. Then the CDF of the random variable Z = max4X1 Y 5 is T 4z3 a1 b5 = R4z3 a5S4z3 b5. If we wish X1 Y , and Z to be stable with respect to maxima operations (i.e., to belong to the same parametric family), we must have F 4z3 G4a3 b55 = F 4z3 a5F 4z3 b51
where the function G gives the value of the parameter of Z as a function of the parameters a and b of X and Y , respectively. Equation (4), which establishes the stability of the family F 4z3 a5 with respect to maximum operations, suggests the functional network in Figure 4(a), where the functions I and “ ” represent the identity function and the product operator, respectively. As we shall see in Section 3, this functional network can also be simpli ed to the one in Figure 4(b). Example 5: Reproductive Families of Distributions. A family of random variables is said to be reproductive under convolution if the sum of independent random variables of the family belongs to the family—that is, if the family is closed under sums. Since the characteristic function of the sum of two independent random variables is the product of the characteristic functions of the summands, reproductivity for uniparametric families can be written as the functional equation ”4t3 G4x1 y55 = ”4t3 x5”4t3 y51
TECHNOMETRICS, FEBRUARY 2001, VOL. 43, NO. 1
(5)
where the characteristic function ” is a complex function of two real variables. The function G4x1 y5 is a real function of two real variables and shows how the parameter of the sum can be obtained as a function of the parameters of the two random variables being added. Equation (5) suggests a functional network similar to that shown in Figure 4, but with different symbols [note the similarity between (4) and (5)]. Example 6: Semiparametric Regression Models. Functional networks include nonlinear regression models as special cases. The general semiparametric regression model is of the form h4y5 = f 4x1 1 : : : 1 xq 3 ‚0 1 ‚1 1 : : : 1 ‚p 5 + ˜1
Figure 3. Illustration of the Associativity Functional Framework: (a) Initial Network; (b) Equivalent Simpli ed Network.
(4)
(6)
where h4 5 and f 4 5 are unknown functions; f depends on ‚0 1 : : : 1 ‚p , which are the unknown parameters of the model; y is the response variable; x1 1 : : : 1 xq are the independent variables; and ˜ is a random error. Functional networks can be used to represent nonlinear regression without the assumption that the functions h4 5 and f 4 5 in (6) are known. The functional equation in (6) can be represented by the functional
APPLICATIONS OF FUNCTIONAL NETWORKS
13
2. Stability with Respect to Maxima Operations. It can be shown that a particular solution of the functional equation (4) is F 4x3 y5 = f 4x5h4y53 G4x3 y5 = hƒ1 6h4x5 + h4y571
(8)
where f 4x5 is an arbitrary cumulative distribution function and h is an arbitrary positive invertible function. Substituting (8) in (4), both sides of (4) can be written as f 4z5h4a5+ h4b5 . In this particular case the functional network in Figure 4(b) is obtained, where H 4h4a51 f 4z51 h4b55 = f 4z5h4a5+ h4b5 . Figure 5. A Powerful Functional Network Useful for Regression Models.
network in Figure 5. Further details are given in Section 5.1, where an illustrative example is presented. 2.
WORKING WITH FUNCTIONAL NETWORKS
The previous section describes the topological elements of functional networks and shows the wide applicability of functional networks in several areas of statistical modeling. This section gives the details of how functional networks operate. Working with functional networks requires the following steps: Step 1: Selection of the Initial Topology. As can be seen from the preceding examples, the selection of the initial topology of a functional network is normally based on the characteristics of the problem at hand, which usually lead to a single clear network structure (problem-driven design). Step 2: Simplifying Functional Networks. The initial functional network may be simpli ed using functional equations. Given a functional network, an interesting problem is how to determine whether or not there exists another functional network giving the same output for every given input. This leads to the concept of equivalent functional networks. Two functional networks are said to be equivalent if they give the same output for any given input. The practical importance of the concept of equivalent functional networks is that we can de ne equivalent classes of functional networks—that is, sets of equivalent functional networks—and then choose the simplest in each class to be used in applications. Functional equations are the main tool for simplifying functional networks [for a general introduction to functional equations and methods to solve them, see Aczél (1966) and Castillo and Ruiz-Cobo (1992)]. We illustrate the simpli cation of functional networks with two of the examples presented in Section 1. 1. Associativity. Suppose we wish to nd a functional network that reproduces the associative operation. Aczél (1966) showed that any associative operation, such as (3), can be written in the form F 4x1 y5 = f ƒ1 6f 4x5 + f 4y571
(7)
where f 4x5 is an invertible function. Then both sides of (2) or (3) can be written as f ƒ1 6f 4x5 + f 4y5 + f 4z57. Thus, the use of the functional network in Figure 3(b), which represents the equivalent simpli ed network as an alternative to functional network in Figure 3(a), is justi ed.
Step 3: Uniqueness of Representation. Before learning a functional network, we need to be sure that there is a unique representation of it. In other words, for a given topology (structure), in some cases, there are several sets of neurons (functions) leading to exactly the same output for any input. To avoid estimation problems, we need to know what conditions must hold for uniqueness. We illustrate this step by two examples: 1. Associativity. To test whether or not there are several functional networks associated with (7)—that is, two functions f1 405 and f2 405 such that they give the same output for the same input—we write the functional equation f1ƒ1 6f1 4x5 + f1 4y57 = f2ƒ1 6f2 4x5 + f2 4y57 8 x1 y1
(9)
the general solution of which is f2 4x5 = cf1 4x50
(10)
This means that any value of c leads to the same solution or, in other words, there exist many pairs f1 4x5, f2 4x5 such that Equation (9) holds. Thus, the corresponding functional networks are equivalent. The constant c is not identi able unless one extra condition is required; for example, one can force f 4x0 5 = y0 to have uniqueness. 2. Stability with respect to maxima operations. Suppose that there are two pairs 8f1 4x51 h1 4y59 and 8f2 4x51 h2 4y59 such that f1 4x5h 1 4y5 = f2 4x5h2 4y5 holds. Then, we have h1 4y5 log f2 4x5 = = k1 h2 4y5 log f1 4x5
(11)
where k is an arbitrary constant. Then, we have h1 4y5 = kh2 4y5
and
f1 4x5 = f2 4x51=k 0
(12)
Thus, we simply need to x the value of any one of the functions f1 4x5, f2 4x5, h1 4y5, or h2 4y5 at a single point for uniqueness. Step 4: Structural and Parametric Learning. The structure (topology) of the functional network is learned in Step 1, where the physical problem to be solved (problem-driven design) suggests the initial topology. As a second tool, we have used the functional equations to obtain a simpli ed network (this can also be understood as learning). Once the structure is selected, it is necessary to learn the neural functions (parametric learning). Section 3 considers two types of learning the neural functions, exact and approximate learning. TECHNOMETRICS, FEBRUARY 2001, VOL. 43, NO. 1
14
ENRIQUE CASTILLO ET AL.
Step 5: Model Validation. When approximate learning is considered, the test for quality and/or the cross-validation of the model is performed. Checking the obtained error is important to see whether or not the selected family of approximating functions is adequate. A cross-validation of the model, with an alternative set of data, is also important; this also allows us to detect over tting problems. As can be seen from the preceding examples, the main differences between functional networks and standard neural networks include the following: 1. The topology of a neural network is chosen from among several topologies using an optimal criterion. In functional networks the initial topology is given by the properties of the problem at hand (problem-driven), and it can be simpli ed using functional equations. 2. In standard neural networks, the neural functions are given and some weights associated with the links or connections are learned. In functional networks, speci cation of the neural functions is not required because the neural functions can be learned from data, as we shall see in the next section. 3. In standard neural networks, all the neural functions are identical, univariate, and single-argument, where the single argument is a weighted sum of input values. In functional networks, the neural functions can be different, multivariate, and/or multiargument. 4. In functional networks, we can connect outputs of different neurons to force them to coincide. To this end, apart from output layers, we can use intermediate layers of units, which are not neurons but are units storing intermediate information. These intermediate layers allow the network to connect more than one neuron output to the same unit, indicating coincidence of the outputs of the corresponding neurons (constrained outputs). This structure is not possible in standard neural networks because there are no intermediate layers and these connections are not allowed. If an output unit is connected to more than one neuron in the previous layer, say to m neurons, it is possible to write its value in m different forms, which must coincide; thus, a set of m ƒ1 functional equations, derived directly from the topology of the network, is obtained. In the past, some proponents of neural networks have advertised the method as a universal solution to many statistical problems that can be easily applied without skill, experience, or much thought. This claim is not true because the choice of an appropriate architecture is crucial to success and some care is necessary in the selection and estimation of various model parameters. Perhaps the most crucial step in any statistical analysis is the selection or construction of a model or family of models appropriate to the particular data problem at hand. It is not uncommon in practice that statistical models are chosen without any justi cation other than goodness of t. One of the main advantages of functional networks is that a functional network structure is required, and this forces the statistician to look for structures suggested by the problem at hand. For example, choosing a family of conjugate distributions, under a Bayesian point of view, requires the family of prior-posterior models to be independent of the level of information we have. In other words, no matter in what step of the information process we are, the family of models must TECHNOMETRICS, FEBRUARY 2001, VOL. 43, NO. 1
remain the same. Similarly, when studying maxima or sums of independent variables, the maximum-stability or sum-stability properties, gives rise to the selection of a maximum-stable or a reproductive family, respectively. Another advantage of functional networks is that the simplifying Step 2 indicates that functional equations are closely related to functional networks that must be systematically used to simplify the model structure. This leads to the discovery of statistical models in their simplest form. If the resulting functional equations cannot be solved, however, the initial functional network can also be used to solve the problem but at the cost of a higher computational effort. 3. 3.1
PARAMETRIC LEARNING IN FUNCTIONAL NETWORKS
Exact Learning
Exact learning consists of identifying the functions that are solutions of the functional equation represented by the functional network. Example 7: Exact Learning of an Associative Operation. Assume that we look for a functional network to reproduce the associative operation x y = h4x1 y5 = xy. The structure that represents this associative operation is shown in Figure 6(a), which is obtained from (7). Exact learning consists of identifying the corresponding neural function f . To this end, we can add one more functional unit to the preceding functional network to obtain the functional network in Figure 6(b), where h4x1 y5 = xy. This forces the function f to satisfy the functional equation f ƒ1 6f 4x5 + f 4y57 = xy , f 4xy5 = f 4x5 + f 4y51
(13)
Figure 6. (a) Associative Functional Network; (b) Learning the Product Operation.
APPLICATIONS OF FUNCTIONAL NETWORKS
which is a Cauchy equation with solution
4t5
x 1 (14) c where c is an arbitrary constant. Thus, any value of c 6= 0 leads to the multiplication operation. Once f is known, we can remove the functional unit h in Figure 6(b) to recover the functional network in Figure 6(a). The resulting functional network reproduces exact values for the operation . Thus, we say that the learning is exact. If the h function in Figure 6(b) is unknown or we do not know the solution of the associated functional equation, we can collect a set of data to approximately learn the f function using the methods given in Section 3.2. f 4x5 = c log x , f ƒ1 4x5 = exp
3.2
Approximate Learning
Approximate learning consists of estimating the neural functions based on the given data. This type of learning is done by considering linear combinations of appropriate functional families and using an optimization method to obtain the optimal coef cients. There are two different types of parametric learning methods: 1. The Linear Method: This method is called linear because the associated optimization function leads to a system of linear equations in the parameter estimates. In this case, a single optimum exists and the learning process reduces to solving a system of linear equations. 2. The Nonlinear Method: This method leads to a function that is nonlinear in the parameters. In this case there may exist multiple optima, and the optimization process can be carried out by considering some standard gradient descendent/ascent method. Example 8: Approximate Learning of an Associative Operation. Assume that we have a set of available data D = 84xt 1 yt 1 4x y5t 5—t 2 ´ 90 Then, we can approximate the f P function by a linear combination fO4x5 = ki=1 ci gi 4x5 of a set of k functions 8g1 4x51 0 0 0 1 gk 4x590 We need a discrepancy measure such as the sum of squared errors. However, other discrepancy measures can be used, such as the minimax measure (see Castillo, Cobo, Gutiérrez, and Castillo 2000). From (13), the error for the data point t is et = f 44x
X k
=
y5t 5 ƒ f 4xt 5 ƒ f 4yt 5
ci gi 44x
i=1
y5t 5 ƒ ci gi 4xt 5 ƒ ci gi 4yt 5 0
k X X t2´
i=1
2
ci gi 44x
y5t 5 ƒ ci gi 4xt 5 ƒ ci gi 4yt 5
4t5
of data D = 84xinput 1 xoutput 5—t 2 ´ 9 and a discrepancy measure b5, where D is the observed data and D b are the resulting d4D3 D values obtained from the selected functional network at the same input values. As an example of discrepancy measure, we can use X 4t5 4t5 b5 = ˜fO4xinput d4D3 D 1 c1 1 : : : 1 ck 5 ƒ xoutput ˜2 1 (16) t2´
where ˜ ˜ is a norm in —Xoutput — and fO is the processing function depending on the parameters c1 1 : : : 1 ck . In the next section, we use the minimum description length (MDL) measure, which allows comparing not only the quality of the different approximations but also different functional networks. Note that the neural functions can be totally, partially, or not at all de ned. If they are totally de ned, we have no need for learning. If they are partially or totally unde ned, we need partial or total learning, respectively. This implies that we can use, as in the case of standard neural networks, neurons with known neural functions (they need not be learned). 4. MODEL SELECTION IN FUNCTIONAL NETWORKS As we have seen in Section 3, to learn the resulting functional network we can choose different sets of linearly independent functions to approximate their neuron functions. Since we can try different sets of functions, we need a model-selection method to choose the best model according to some criterion of optimality. The problem of model selection has been extensively analyzed from different points of view (e.g., see Akaike 1973; Atkinson 1978; Lindley 1968; Stone 1974). Here we use the MDL measure (see Elias 1975; Rissanen 1983, 1989; Castillo, Gutiérrez, and Hadi 1997), which allows comparing not only the quality of the different approximations but also different functional networks. The idea behind the MDL measure is to look for the minimum information required to store the given dataset using the model. To this end, let us de ne the code length L4x5 of x as the amount of memory needed to store the information x. For instance, to store the data in the associative operation example we have two options: Option 1: Store raw data. Store the triplets 84xt 1 yt 1 xt yt 5—t 2 ´ 9. In this case, the initial description length (DL) of the dataset is given by X DL = L4xt 5 + L4yt 5 + L4xt yt 5 0 (17) t2´
Then, to estimate 8ci 3 i = 11 0 0 0 1 k9, we can minimize, with respect to ci , the sum of squared errors Q=
15
0
(15)
From this optimization process, we obtain the ci values—that is, an estimate of the f function. Since in this case the estimated function fO does not reproduce exactly the operation, we say that we have an approximate learning. In general, the approximate learning process consists of selecting the “best” set of k functions 8g1 4x51 0 0 0 1 gk 4x59 (possibly vector valued) in a given class £, using an available set
Option 2: Use a model. By selecting a model, we try to reduce this length as much as possible. In this case, we can store the inputs 84xt 1 yt 5—t 2 ´ 9, the parameters of the model 8ci —i 2 I 9, and the residuals et = fO44xt
yt 55 ƒ fO4xt 5 ƒ fO4yt 51
t2´1
(18)
where fO is the approximate neuron function for the model. In this case, the DL becomes X DL model = L4xt 5 + L4yt 5 t2´
+
X i2I
L4ci 5 +
X t2´
L4et —model50
(19)
TECHNOMETRICS, FEBRUARY 2001, VOL. 43, NO. 1
16
ENRIQUE CASTILLO ET AL.
Note that, since the ranges of the residuals et are smaller than the range of the data xt yt , the extra effort to store the parameters ci in addition to the residuals et is compensated by the savings in storing xt yt . In addition, since the DL can be calculated for any model, the DL measure does not care about which model or which dimension is used. This makes the MDL a convenient method for solving the model-selection problem. Accordingly, the best functional network model for a given problem corresponds to the one with the MDL value. The DL associated with a given model requires the knowledge of the code length of a given number, integer or real: 1. Code length of an integer: The code length of an integer n can be approximated by log2 4n5. 2. Code length of a real: We can use two options: Option 1: Let x = x1 + x2 be a real number that is decomposed in its integer part x1 and its fractional part x2 . The integers n1 = x1 and an approximation of x2 , n2 , are stored, separately. Then, the code length ¬1 4x5 is the sum of the code lengths of both integer parts. Option 2: We compute the code length of a real as follows: If —x— < 10ƒq 3 q 2 = 811 21 : : : 9, we store a 0 and then we have ¬2 4x5 = 1. Otherwise, we store the integer part of —x— 10ƒq and then we have ¬2 4x5 = log2 —x—10q + 1, where q 2 is a xed integer related to the selected precision, and x is the integer part of x. The basic idea of this option is that we store the real number x with q decimal gures. This means that we store the integer —x—10q , which requires log2 —x—10q + 1 bits.
Suppose we know that the data come from a model in the family 8fm 4x—ˆ5—ˆ 2 ä3 m 2 9, where ˆ = 4ˆ1 1 : : : 1 ˆk 5, ä is the parameter space, and is the model space. Suppose also that the model fm 4x—ˆ5 has associated probability m 4ˆ5. According to Rissanen (1989, p. 55), if ˆOj is estimated from all data points, the sample size is large, and the errors are normal, then the description length, using Option 1 for storing real numbers, becomes n k log n n 1X ¬1 = ƒ log m 4ˆ5 + + log e 4ˆ52 1 (20) 2 2 n j=1 j where k is the number of parameters of the model. This MDL measure depends on both the data and the model and consists of the sum of three terms. The rst term represents the quality of the prior distribution given by the human expert, the second is a penalty for using complex models (those with a large number of parameters), and the third represents the quality of the tted model (the smaller the errors the better the model). An alternative to (20), which consists of using Option 2, leads to X X ¬2 = ¬2 4cOi 5 + ¬2 4et —model50 (21) i2I
t2´
where the rst term is a penalty for the number of parameters, and the second penalizes the errors. Therefore, the measures in (20) and (21) allow one to compare different sets 8g1 4x51 : : : 1 gk 4x59 of linearly independent approximating functions and select which of the functions in TECHNOMETRICS, FEBRUARY 2001, VOL. 43, NO. 1
8g1 4x51 : : : 1 gk 4x59 contribute more to the quality of the model and which can be removed. Computer implementations of model-selection methods include the following: 1. The exhaustive method: This method calculates the values of L4x5 for all possible functional networks and all possible subsets of the approximating functions and chooses the one leading to the smallest value of L4x5. A clear disadvantage of this method is that it requires a lot of computational power. 2. The forward-backward method: This method starts with all models of a single parameter and selects the one leading to the smallest value of L4x5. Next, it incorporates one more parameter with the same criterion and the process continues until no improvement in L4x5 can be obtained by adding an extra parameter to the previous model. Then the inverse process is applied; that is, sequentially the parameter leading to the smallest value of L4x5 is removed until no improvement of L4x5 is possible. This double process is repeated until no further improvement in L4x5 is obtained either by adding or removing a single variable. 3. The backward-forward method: This method starts with the model with all parameters and sequentially removes the one leading to the smallest value of L4x5, repeating the process until there is no improvement in L4x5. Next, the forward process is applied, but starting from this model. The double process is repeated until no further improvement in L4x5 is obtained either by removing or adding a single variable. Note that the MDL measure allows comparing not only subsets of a given base system but two or more base systems. To this end, we need to calculate the MDL values associated with all the candidate base systems and select the one leading to the MDL. In fact, in the example in Section 5 we have considered several options and selected the best candidate. 5.
SPECIFIC STATISTICAL AND ENGINEERING APPLICATIONS OF FUNCTIONAL NETWORKS
In this section we give several examples of applications of functional networks in statistical modeling in general and in the engineering area in particular. 5.1
Nonlinear Regression
5.1.1 Initial Topology. The data in Table 1 are simulated from the model 2 3 yi3 = x1i + x1i + x2i2 + x2i + ˜i 1
i = 11 21 : : : 1 n1
(22)
where ˜i is a uniform U 4ƒ00051 00055 random variable and x1 and x2 have been simulated from a uniform U 401 15 population. Consider the functional network model f3 4yi 5 = f1 4x1i 5 + f2 4x2i 5 + ˜i 1
i = 11 21 : : : 1 n0
(23)
Then, the functional equation f3 4y5 = f1 4x1 5 + f2 4x2 5 leads to the functional network in Figure 7; that is, y = F 4x1 1 x2 5 = f3ƒ1 6f1 4x1 5 + f2 4x2 570 5.1.2 Simpli cation of the Model. Since no arrows converge to storing units in the network, no simpli cation is possible in this case.
APPLICATIONS OF FUNCTIONAL NETWORKS Table 1. Observed Data Points x1 0010 0696 0305 0191 0007 0229 0205 0889 0606 0396 0071 0104 0222 0534 0089 0364 0599 0346 0913 0053
x2
y
x1
x2
y
0428 0866 0296 0635 0392 0858 0588 0608 0840 0722 0174 0984 0050 0379 0468 0969 0408 0810 0894 0342
0648 10370 0798 0959 0609 10180 0925 10320 10310 10130 0488 10270 0653 10000 0748 10330 10060 10180 10480 0596
0580 0310 0646 0820 0724 0884 0242 0232 0580 0762 0592 0889 0100 0642 0953 0458 0189 0363 0311 0090
0187 0906 0575 0340 0820 0323 0630 0549 0786 0073 0441 0554 0436 0461 0648 0810 0162 0885 0264 0243
0985 10260 10170 10180 10350 10220 0981 0909 10260 10110 10070 10290 0726 10110 10370 10230 0630 10250 0792 0550
5.1.3 Uniqueness of Representation. Suppose that there exist two different triplets of functions 8f1 1 f2 1 f3 9 and 8g1 1 g2 1 g3 9 such that f3ƒ1 6f1 4x1 5 + f2 4x2 57 = g3ƒ1 6g1 4x1 5 + g2 4x2 570 This is a functional equation with the following general solution: f1 4x1 5 = ag 1 4x1 5 + b f2 4x2 5 = ag 2 4x2 5 + c f3ƒ1 4y5 = g3ƒ1
yƒbƒc 0 a
Table 2. Approximating Functions, RMSE, and ¬2 Obtained at Different Steps for the Forward-Backward Method Approximating functions Step
Thus, to have uniqueness of solution we only need to x functions f1 , f2 , and f3 at a point such that the system (24) has an unique solution. 5.1.4 Learning the Model. If we select the base 811 x1 x2 1 x3 9 for f1 4x51 f2 4x5, and f3 4x5, using the least squares methods described previously, we get the complete model f1 4x1 5 = 033506 + 0322369x1 + 035208x12 ƒ 0009507x13 f2 4x2 5 = 033446 ƒ 0003721x2 + 033065x22 + 0338607x23
Quality measures
{ f1 (x)1 f2 (x)1 f3 (x)}
RMSE
¬2
08598 00772 00008 00002 7061 10ƒ7
735 643 554 519 380
Forward
{ 1} 1 { 1} 1 { 11 x 3 } { 11 x } 1 { 1} 1 { 11 x 3 } { 11 x } 1 { 11 x 3 } 1 { 11 x 3 } { 11 x1 x 2 } 1 { 11 x 3 } 1 { 11 x 3 } { 11 x1 x 2 } 1 { 11 x 3 1 x 2 } 1 { 11 x 3 }
1 2 3 4 5
The exhaustive method leads to the model f1 4x1 5 = 03341 + 03297x1 + 03362x12 f2 4x2 5 = 03341 + 03231x22 + 03427x23 f3 4y5 = 06674 + 03326y 3 1
Modeling Chaotic Time Series
One of the attempts to model chaotic time series has used neural networks (Attali and Pages 1997; Stern 1996). For example, Stern (1996) showed that a multilayer perceptron trained with the backpropagation algorithm outperforms standard autoregressive models to approximate a chaotic time series. In this section we present a new approach to this problem using functional networks. We describe the application of the MDL selection algorithm for modeling a chaotic map and reducing the noise contained in chaotic time series. In this section we shall use a nontrivial example to illustrate the performance of functional networks. The Ikeda model is
f3 4y5 = 065884 + 0032249y ƒ 003705y 2 + 0345964y 3
Table 3. Approximating Functions, RMSE, and ¬2 Obtained at Different Steps for the Backward-Forward Method
with an RMSE = 6060027 10ƒ7 and ¬2 = 418 (RMSE = root mean squared error). We have used ¬2 measure with q = 5.
Approximating functions Step
Figure 7. Functional Network Model for Regression.
(25)
with an RMSE = 7061001 10ƒ7 and ¬2 = 380. However, we can use the MDL measure combined with the methods described in Section 4. In Table 2 we give the different steps for the forward-backward method, which leads to the same model as the exhaustive method. Note that once we have ended the forward process, the backward process does not allow an improvement of the description length. In Table 3 we give the different steps for the backward-forward method, which also leads to the same model. Substituting (25) in (23), we obtain y 3 = 00024 + 099x1 + 1001x12 + 097x22 + 1003x23 1 which very much recovers the model in (22) that was used to generate the data. 5.2
(24)
17
1 2 3 4 5
{ f1 (x)1 f2 (x)1 f3 (x)} Backward { 11 x1 x 2 1 x 3 } 1 { 11 x1x 2 1 x 3 } 1 { 11 x1 x 2 1 x 3 } { 11 x1 x 2 1 x 3 } 1 { 11 x 2 1 x 3 } 1 { 11 x1 x 2 1 x 3 } { 11 x1 x 2 } 1 { 11 x 2 1 x 3 } 1 { 11 x1 x 2 1 x 3 } { 11 x1 x 2 } 1 { 11 x 2 1 x 3 } 1 { 11 x 2 1 x 3 } { 11 x1 x 2 } 1 { 11x 2 1 x 3 } 1 { 11 x 3 }
Quality measures RMSE
6060027 6065839 6085212 7044881 7061001
10ƒ7 10ƒ7 10ƒ7 10ƒ7 10ƒ7
¬2
418 404 398 386 380
TECHNOMETRICS, FEBRUARY 2001, VOL. 43, NO. 1
18
ENRIQUE CASTILLO ET AL.
Figure 10. Functional Network Model for Training the Real or Imaginary Component of the Ikeda Map.
due to the sensitivity to the initial conditions that is a typical characteristic of chaotic maps. However, the phase portrait of the time series shows the deterministic structure of the underlying deterministic dynamical systems (see Fig. 9). We can use the time series, 84xn 1 yn 59n , to train a separable functional network for each of the variables x and y (see Fig. 10). For instance, a separable model for x is given by Figure 8. Time Series of the First 300 Points of the Real (x n ) and Imaginary (yn ) Parts of (26) With z 0 = x 0 i y0 = 1.
given by the following map (the physical origin of this map was given by Hammel, Jones, and Maloney 1985): id 1 (26) 1 + —zn —2 where z = x + i y is a complex number. Taking real and imaginary parts, (26) can be viewed as a two-dimensional real map in which the variables xn and yn are coupled in a nontrivial way. In particular we shall consider the parameter values a = 100027, b = 091 c = 04, and d = 600, which lead to chaotic behavior. A time series plot of the Ikeda map for the initial condition z0 = x0 + i y0 = 1 is given in Figure 8. This gure shows the seemingly stochastic dynamics of the system that is zn+ 1 = a + bzn exp i c ƒ
xOn+ 1 = F 4xn 1 yn 5 =
r s X X
cij fi 4xn 5gj 4yn 51
(27)
i=1 j=1
where cij are constants (parameters of the model). Separable functional networks generalize the class of models that combine the separate contribution of each of the independent variables, F 4x1 y5 = f 4x5g4y5. They also have interesting applications in a large class of theoretical and practical dynamical systems (see Castillo and Gutiérrez 1998). In this example we use the rst 1,000 points of the preceding time series to train the functional network and the next 4,000 points to test the resulting models. To this aim we use a vector of the time series as input, 4xn 1 yn 5, and each of the components of the following vector xn+ 1 or yn+ 1 as outputs of two different functional networks, obtaining an estimated model for both variables: xOn+ 1 = F1 4xn 1 yn 5
yOn+ 1 = F2 4xn 1 yn 50
Concerning the problem of selecting a convenient functional basis for the neuron functions, we start by selecting a general function family in light of the qualitative information available about the functional structure of the problem. Then, an MDL selection mechanism is applied to obtain the optimal model. Since the Ikeda map represents the nonlinear dynamics in a Poincaré section of a laser ow [see Jackson (1991) for an introduction to nonlinear dynamics], which is characterized by a certain exponential function of the laser intensity, a suitable family for approximating the dynamics of this map is a combined basis of polynomials and Fourier trigonometric functions of the form Figure 9. Phase Space of the Ikeda Map. TECHNOMETRICS, FEBRUARY 2001, VOL. 43, NO. 1
811 x1 0 0 0 1 xp 1 sin4x51 cos4x51 : : : 1 sin4q x51 cos4q x59
APPLICATIONS OF FUNCTIONAL NETWORKS
19
for the neural functions, where p and q determine the number of polynomial and trigonometric functions. For instance, for p = 2 and q = 4, we get a functional network with 11 11 = 121 terms. In this case, we obtained an approximate model xO n+ 1 with root mean squared (RMS) training error 00011 and RMS test error 00021. Similarly, we have considered the output yn+ 1 and obtained the model yOn+ 1 with RMS training and test errors 00009 and 00019, respectively. The need to include some kind of model-selection mechanism arises naturally in this type of functional network architecture, because the number of crossed terms grows quickly even in simple situations (in the preceding example, the model contained 121 terms). In the following, we use the MDL algorithm introduced previously to obtain an optimal model for the Ikeda time series considering the preceding polynomialFourier functional family. We have performed the forwardbackward algorithm training the network with the rst 1,000 points and computing the DL measure using the test error on the next 4,000 points to prevent over tting. The optimal model obtained for xO is xO = 0085xy + 0361x 2 y 2 + 0158 cos4x5 ƒ 0187y cos44x5
ƒ 0161y 2 cos44x5 + 0773 cos4y5 + 0115 cos44x5 cos4y5 + 0305 cos43x5 cos43y5 ƒ 0056 cos4x5 cos44y5 ƒ 0277 cos43x5 cos44y5 ƒ 0501 cos4y5 sin43x5 + 0363 cos4y5 sin44x5 + 0205 cos44y5 sin44x5
Figure 11. (a) Noisy Orbit of 5,000 Points for the Ikeda Map With Added Normally Distributed Noise With ‘ = 05, (b) Cleaned Series Obtained From the Associated Functional Network Model.
+ 0750 sin4x5 sin4y5 + 0284 cos42x5 sin42y5 ƒ 0362 cos43x5 sin42y5 ƒ 0203 sin42x5 sin42y5
ƒ 0281 cos42x5 sin43y5 + 0176 cos44x5 sin43y5 ƒ 0143 sin44y5 ƒ 0020 sin42x5 sin44y51
(28)
with RMSE = 0036 and ¬1 = ƒ11593089. Similarly, the optimal model for yO is yO = ƒ30804xy ƒ 10145x y + 0254y cos44x5 2 2
2
ƒ 0407 cos4y5 + 0445 cos44x5 cos4y5
ƒ 0195 cos4x5 cos44y5 + 10213 cos4y5 sin43x5 ƒ 0331 cos4y5 sin44x5 + 0172 cos44y5 sin44x5 + 50267 sin4x5 sin4y5 ƒ 0451 cos42x5 sin42y5
+ 0386 cos43x5 sin42y5 ƒ 0529 sin42x5 sin42y5 ƒ 0211 cos42x5 sin43y51
(29)
with RMSE = 0068 and ¬1 = ƒ11123021. Note that these two models contain 21 and 14 terms, respectively. Thus the size of the functional network (number of functional terms) is reduced one order of magnitude, whereas the tness error increases one order of magnitude. Thus, tness power is sacri ced in favor of complexity reduction in the optimal models. If no qualitative information about the system dynamics were available, we could still use some ad hoc functional family, such as polynomials or Fourier functions. In this case, a poorer performance is expected for the resulting approximate model. For instance, if we choose a polynomial family in the
preceding example (p = 7, q = 0) we get an optimal model for variable x including the following terms: 811 y1 x4 y 2 1 x1 x 3 1 x5 y1 y 2 1 y 6 1 x 5 y 5 1 y 3 1 y 4 1 x 2 1 x2 y 2 1 x y1 x3 y 4 1 xy 3 1 x7 1 x 4 1 x 5 1 x y 4 1 x7 y91 with RMSE = 00808 and DL = ƒ1119007; that is, the error doubles with the same network complexity. On the other hand, if we choose a Fourier functional family (p = 0, q = 5), we get an optimal model with RMSE = 0052 and DL = ƒ11225077. Note that, in this example, qualitative information about the system dynamics, or the functional structure, can be included in the model by selecting an appropriate basis for the neuron functions; this results in a more realistic and ef cient approximate model. An interesting problem when dealing with experimental data is that of reducing the noise contained in the time series. Functional networks can be used easily for this task. Figure 11(a) shows a noisy orbit that has been computed by adding normally distributed noise with ‘ = 005 to the Ikeda map shown in Figure 9. If we use the functional network given in (28) and (29) to train the noisy data, then the noise is cleaned off from the data. Figure 11(b) shows the noise-reduced model obtained from the functional network. By comparison with Figure 9, we see that the actual deterministic dynamics have been virtually recovered. TECHNOMETRICS, FEBRUARY 2001, VOL. 43, NO. 1
20
ENRIQUE CASTILLO ET AL.
Figure 14. Illustration of the Equilibrium of a Discrete Piece. Figure 12. Point x.
A Beam Subject to Load p x
With De ection z x at
and the equilibrium of moments m4x + dx5 = m4x5 + q4x5dx
5.3
Beam Example
In this example we are interested in modeling the behavior of a beam subject to given vertical forces (loads). Figure 12 depicts a beam with de ection z4x5 at point x due to load p4x5. The well-known strength-of-materials relation is m4x5 = EIz00 4x51
(30)
where m4x5 is called the bending moment, E is the Young modulus, and I is the moment of inertia of the cross-section of the beam. Furthermore, z0 4x5 = w4x5 and q4x5 are called the rotation of the beam and the shear stress at point x, respectively. Classical models use equilibrium equations of vertical forces p and q and moments (bending moment m and those due to p and q) stated for differential pieces. In this example, we introduce a new approach based on functional networks in which the equilibrium equations are stated for discrete pieces. 5.3.1 Classical Approach: Differential Equations. In the classical approach, the equilibrium equations are stated for differential pieces. Figure 13 shows an example of a differential piece. The equilibrium of vertical forces leads to q4x + dx5 = q4x5 + p4x5dx ) q 0 4x5 = p4x5
+ p4x5dx=2 ) m0 4x5 = q4x50
(32)
From (30), (31), and (32) we get the well-known differential equation EIz4iv54x5 = p4x51
(33)
which is equivalent to the following system of differential equations: q 0 4x5 = p4x5 m0 4x5 = q4x5 m4x5 EI z0 4x5 = w4x50
w 0 4x5 =
(34)
This is the usual mathematical model in terms of differential equations when we are interested in q1 m1 w, and z. 5.3.2 New Approach: Functional Networks. In the new approach, the equilibrium equations are stated for discrete pieces of length u. Figure 14 shows one such piece. The equilibrium of vertical forces leads to q4x + u5 = q4x5 + A4x1 u51
(31) where
A4x1 u5 =
Z x+ u
p4s5ds0
(35)
(36)
x
The equilibrium of moments
m4x + u5 = m4x5 + uq4x5 + B4x1 u51 where B4x1 u5 =
Z x+ u x
4x + u ƒ s5p4s5ds0
(37)
(38)
Now using Equation (30) we get 1 Z x+ u w4x+ u5 = w4x5+ m4s5ds EI x 1 Z x+ u 6m4x5+ 4s ƒx5q4x5 = w4x5+ EI x + B4x1s ƒx57ds = w4x5+
1 u2 m4x5u+ q4x5 + C4x1u5 1 (39) EI 2
where Figure 13. Illustration of the Classical Equilibrium of a Differential Piece. TECHNOMETRICS, FEBRUARY 2001, VOL. 43, NO. 1
C4x1 u5 =
Z x+ u x
B4x1 s ƒ x5ds0
(40)
APPLICATIONS OF FUNCTIONAL NETWORKS
In addition we have z4x + u5 = z4x5 + = z4x5 +
Z x+ u
w4s5ds
x
Z x+ u
w4x5 +
x
+ q4x5
m4x5 where D4x1 u5 =
1 m4x54s ƒ x5 EI
4s ƒ x52 + C4x1 s ƒ x5 2
= z4x5 + w4x5u +
ds
1 EI
u2 u3 + q4x5 + D4x1 u5 1 2 6
(41)
Z x+ u
(42)
x
C4x1 s ƒ x5ds0
Thus, we get the system of functional equations q4x + u5 = q4x5 + A4x1 u5
Figure 15. Functional Networks Reproducing the Beam-Equation Problems.
which is equivalent to (33). By a similar process, we can obtain the following functional equations for w4x5, m4x5, and q4x5:
m4x + u5 = m4x5 + uq4x5 + B4x1 u5 1 u2 w4x + u5 = w4x5 + m4x5u + q4x5 + C4x1 u5 EI 2 1 z4x + u5 = z4x5 + w4x5u + EI u2 u3 m4x5 + q4x5 + D4x1 u5 1 (43) 2 6 where Z x+ u A4x1 u5 = p4s5ds x
B4x1 u5 = C4x1 u5 = D4x1 u5 =
21
Z x+ u x
4x + u ƒ s5p4s5ds
x
B4x1 s ƒ x5ds
x
C4x1 s ƒ x5ds1
Z x+ u Z x+ u
m4x + 2u5 = G4x1 u5 q4x + u5 = F 4x1 u51
(44)
(47)
where H 4x1 u5 = w4x5 ƒ 3w4x + u5 + 3w4x + 2u5
+ 3C4x1 u5 ƒ 3C4x1 2u5 + C4x1 3u5 =EI
G4x1 u5 = ƒm4x5 + 2m4x + u5 ƒ B4x1 u5 F 4x1 u5 = q4x5 + A4x1 u50
which is equivalent to the system of functional equations in (34). Note that the functions A4x1 u51 B4x1 u51 C 4x1 u5, and D4x1 u5 become known as soon as the load function p4x5 is known and that in some cases we can solve the problem with 11 21 3, or all equations in (43), depending on the boundary conditions. Note also that the system (43) can be considered as a system of difference equations (by setting u = „x), which gives the exact solution at the interpolating points. To obtain an equivalent functional equation in z4x5, we can write the last equation in (43) for three different values of u and eliminate w4x5, m4x5, and q4x5 and obtain a functional equation in z4x5. For example, if we write this equation for u, 2u, 3u, and 4u, we get the functional equation z4x + 4u5 = R4x1 u51
w4x + 3u5 = H 4x1 u5
(48)
Equations (46) to (48) can also be interpreted as nitedifference equations. In this case, they give the exact solution at the interpolating points. The functional networks reproducing the beam-equation problem are shown in Figure 15. Note that learning functions R1 H 1 G, and F implies learning the terms depending on D1 C 1 B, and A, respectively. 5.3.3 Supported Cantilever Beam. In the case of a supported cantilever beam, such as the one shown in Figure 16, the boundary conditions are w405 = z405 = 03
m4s5 = z4s5 = 00
(49)
(45)
where R4x1 u5 = ƒz4x5 + 4z4x + u5 ƒ 6z4x + 2u5
+ 4z4x + 3u5 + ƒ4D4x1 u5 + 6D4x1 2u5 ƒ 4D4x1 3u5 + D4x1 4u5 =EI1
(46)
Figure 16. A Supported Cantilever Beam Subject to Load p x . TECHNOMETRICS, FEBRUARY 2001, VOL. 43, NO. 1
22
ENRIQUE CASTILLO ET AL. Table 4. Observed Data Points x 0000 0025 0050 0075 0100 0125 0150 0175 0199 0224 0249 0274 0300 0325 0350 0375 0400 0425 0450 0475 0500
z (x)
w(x)
m(x)
q(x)
x
z (x)
w(x)
m(x)
q(x)
0000 ƒ0615 ƒ20370 ƒ50110 ƒ80720 ƒ130100 ƒ180000 ƒ230400 ƒ290300 ƒ350300 ƒ410500 ƒ470800 ƒ540000 ƒ600100 ƒ650900 ƒ710500 ƒ760600 ƒ810300 ƒ850500 ƒ890100 ƒ920000
0000 ƒ4802 ƒ9008 ƒ1280 ƒ1600 ƒ1870 ƒ2080 ƒ2260 ƒ2380 ƒ2460 ƒ2500 ƒ2500 ƒ2460 ƒ2390 ƒ2280 ƒ2140 ƒ1970 ƒ1780 ƒ1560 ƒ1320 ƒ1050
ƒ0205 ƒ0182 ƒ0160 ƒ0138 ƒ0118 ƒ0098 ƒ0078 ƒ0060 ƒ0042 ƒ0025 ƒ0008 0007 0022 0036 0049 0061 0072 0083 0092 0100 0108
0922 0897 0871 0844 0817 0789 0760 0731 0701 0670 0638 0606 0572 0538 0503 0467 043 0392 0354 0314 0273
0525 0550 0575 0600 0625 0650 0675 0700 0725 0750 0775 0800 0825 0850 0875 0900 0925 0950 0975 10000
ƒ9403 ƒ9509 ƒ9607 ƒ9608 ƒ9601 ƒ9405 ƒ9201 ƒ8900 ƒ8500 ƒ8003 ƒ7408 ƒ6806 ƒ6107 ƒ5401 ƒ4601 ƒ3705 ƒ2805 ƒ1902 ƒ9066 ƒ0001
ƒ7706 ƒ4803 ƒ1707 1307 4507 7801 1110 1430 1740 2050 2350 2620 2890 3120 3340 3520 3670 3780 3850 3870
0114 0119 0124 0127 0129 0129 0129 0127 0124 0120 0115 0108 0099 0090 0079 0066 0052 0036 0019 ƒ0001
0232 0189 0145 0100 0054 0007 ƒ0042 ƒ0092 ƒ0143 ƒ0195 ƒ0249 ƒ0303 ƒ0360 ƒ0418 ƒ0477 ƒ0538 ƒ0600 ƒ0664 ƒ0729 ƒ00796
Assume that we have obtained the data in Table 4. Using the functional equations in (46) to (48), we obtain the models z4x + 4u5 = ƒ0004 ƒ 00041x ƒ 00021x2 ƒ 000068x 3 ƒ000017x 4 ƒ 000003x 5 ƒ 0000009x 6
w4x + 3u5 = ƒ0162 ƒ 0162x ƒ 0081x2 ƒ 0027x 3
ƒ000688x 4 ƒ 000116x 5 ƒ 000036x 6
m4x + 2u5 = ƒ0000641 ƒ 0000641x ƒ 000032x2
ƒ000011x 3 ƒ 0000027x 4 ƒ 00000045x5 ƒ 1044x 6
q4x + u5 = ƒ0025 ƒ 00253x ƒ 00127x2 ƒ 00042x3
ƒ000108x 4 ƒ 0000178x 5 ƒ 00000577x6 0
Figure 17 shows the observed data values and the corresponding predictions for de ections, rotations, bending
Figure 17. Observations (dots) and Predictions (continuous lines) Based on Observed Values for an Exponential exp x Load and a Supported Cantilever Beam: (a) De ections; (b) Rotations; (c) Bending Moments; (d) Shears. TECHNOMETRICS, FEBRUARY 2001, VOL. 43, NO. 1
moments, and shears, based on observed values for an exponential exp4x5 load. With the aim of analyzing the in uence of noise, or observation errors, on the predictions, Figure 18 shows the observed data values with normal random error (see Table 5) and their corresponding predictions based on predicted values for the same load. As we can see, in this last case the predictions deteriorate as would be expected, but this deterioration is small and of the same order of magnitude as the observation errors. 6.
RELATIONSHIPS TO OTHER STATISTICAL METHODS
In the previous sections we have focused our attention on the similarities and differences between functional networks and neural networks. In this section we brie y discuss the relationship between functional networks and three other statistical methods:
Figure 18. Observations With Noise (dots) and Predictions (continuous lines) Based on Observed Values for an Exponential exp x Load and a Supported Cantilever Beam: (a) De ections; (b) Rotations; (c) Bending Moments; (d) Shears.
APPLICATIONS OF FUNCTIONAL NETWORKS
23
Table 5. Observed Data Points With Normal Random Noise x 0 0025 005 0075 01 0125 015 0175 02 0225 025 0275 03 0325 035 0375 04 0425 045 0475 05
z (x)
w(x)
m(x)
q(x)
x
z (x)
w(x)
m(x)
q(x)
ƒ0124 ƒ0812 ƒ2012 ƒ4054 ƒ8093 ƒ1203 ƒ180 ƒ2309 ƒ2903 ƒ3502 ƒ4108 ƒ480 ƒ5305 ƒ5909 ƒ6604 ƒ7104 ƒ7609 ƒ8006 ƒ8601 ƒ8902 ƒ9203
5016 ƒ4502 ƒ8407 ƒ1230 ƒ1650 ƒ1880 ƒ2090 ƒ2230 ƒ2350 ƒ2420 ƒ2470 ƒ2520 ƒ2420 ƒ2430 ƒ2290 ƒ2140 ƒ1940 ƒ1790 ƒ1550 ƒ1330 ƒ9806
ƒ0198 ƒ0177 ƒ0164 ƒ0142 ƒ0119 ƒ0103 ƒ0082 ƒ0061 ƒ0046 ƒ0016 ƒ0007 0007 0018 004 0051 0063 0066 0085 0091 0102 0119
0926 0881 0875 0824 0802 079 0816 0744 0664 0703 0643 0555 053 0531 0487 0462 0464 042 0374 0302 0222
0525 055 0575 06 0625 065 0675 07 0725 075 0775 08 0825 085 0875 09 0925 095 0975 10
ƒ940 ƒ960 ƒ9607 ƒ9609 ƒ9703 ƒ9402 ƒ9308 ƒ8809 ƒ8607 ƒ8002 ƒ7407 ƒ6801 ƒ6102 ƒ5405 ƒ4604 ƒ3702 ƒ290 ƒ1904 ƒ906 ƒ0298
ƒ7804 ƒ4609 ƒ2009 1506 4309 7607 1170 1410 1690 2040 2290 2650 2920 3120 3320 3600 3710 3740 3830 3920
0112 0116 0125 0122 0129 013 0125 0116 0123 0117 0118 0103 0107 0097 0079 0062 0052 0033 0017 ƒ0001
024 0208 0186 0071 0083 0031 ƒ007 ƒ0097 ƒ0141 ƒ0207 ƒ0247 ƒ0285 ƒ0356 ƒ0453 ƒ0469 ƒ0558 ƒ0581 ƒ0746 ƒ0733 ƒ0809
1. Alternate conditioning expectation (ACE): ACE, developed by Breiman and Friedman (1985), is a powerful nonparametric method that can be used to determine the optimal transformations of a response and the predictor variables in a regression problem. Given a response variable Y and a set of p predictor variables, X1 1 X2 1 : : : 1 Xp , we wish to estimate the functions ˆ4Y 5 and ”1 4X1 51 ”2 4X2 5, : : : , ”p 4Xp 5, which minimizes p X e2 = E 6ˆ4Y 5 ƒ ”j 4Xj 572 1 j=1
using sample data 84yk 1 xk1 1 xk2 1 : : : 1 xkp 51 1 k n9. It can be seen from this formulation that ACE is a particular case of a functional network that can be learned using appropriate nonparametric methods. 2. Multivariate adaptive regression spline (MARS): MARS, developed by Friedman (1991), is an innovative and exible modeling tool that automates the building of predictive models. MARS technique can be seen as very close to recursive partitioning regression. The main idea is to write the prediction function f 4x5 as a sum of basis functions that are built using the MARS algorithm. The algorithm proceeds by selecting adequate regions where the splines are used. As is common in other procedures, multidimensional splines are generated by the tensor product of univariate splines. The selection of the number of predictor variables and the complexity of the model is based on the generalized cross-validation criterion, which was originally proposed by Craven and Wahba (1979). MARS has evolved with time, and several methods exist for separating relevant from irrelevant predictor variables, transforming predictors exhibiting nonlinear relationship with the response variable, determining interactions between predictor variables, and so forth. Like ACE, MARS can also be seen as a particular functional network with some especially designed learning methods. 3. The generalized additive models (GAM): These models differ from a generalized linear model in that an additive predictor replaces the linear predictor. More precisely, the
response variable is assumed to have exponential family density fY 4u3 ˆ3 ”5 = exp
yˆ ƒ b4ˆ5 + c4y1 ”5 1 a4”5
with mean Œ = E4Y —X1 1 X 2 1 : : : 1 Xp 5 linked to the predictors via p X g4Œ5 = + fj 4Xj 50 j=1
Estimation of and f1 1 : : : 1 fp is accomplished by an appropriate algorithm (e.g., see Hastie and Tibshirani 1990) Thus, this model can also be seen as a particular functional network with some especially designed learning methods. It is clear that functional networks are a general method that includes many others as particular cases. In addition, the preceding techniques can be used as learning methods for learning neural functions in complex functional networks. However, functional networks incorporate new ways of working with data. For example, the initial steps used in functional networks allow (a) identifying all possible candidate functions that satisfy some given properties and (b) eliminating a huge set of infeasible possibilities. This has no parallel in the preceding three approaches. Functional equations establish some important constraints on the candidate functions that cannot be implemented with these approaches. ACE, GAM, and MARS t a single function (for the response and/or the predictor variables), while functional networks allow de ning functions of functions up to any order. This is more powerful than the single-function approach and allows building the full (multivariate) function in pieces. With respect to similarities, all these methods use base functions and some methods for the selection of the most adequate set of functions to be included in the basis. 7.
SUMMARY AND CONCLUDING REMARKS
In this article, we have seen that functional networks have many practical applications in probability, statistics, and TECHNOMETRICS, FEBRUARY 2001, VOL. 43, NO. 1
24
ENRIQUE CASTILLO ET AL.
engineering. The physical understanding of the problem to be solved gives rise to an initial topology of the functional network. Functional equations allow simplifying the initial topology leading to a much simpli ed network. This problemdriven design is but one feature of functional networks that differentiates them from the standard neural-networks approach. After the uniqueness problem has been solved, learning reduces to nding unique neural functions. When neural functions cannot be exactly learned, approximate learning can be achieved by selecting adequate sets of functions to approximate them. The model-selection process—that is, selecting which sets of functions contribute to a good t—can be performed using the MDL measure in conjunction with procedures such as the forward-backward or the backwardforward algorithms. A nal model-validation step must be applied to prevent model-over tting problems. ACKNOWLEDGMENTS We thank the editor, an associate editor, and a referee for their careful reading of the manuscript and for providing many helpful comments. We are also grateful to the Universities of Cantabria and Castilla-La Mancha, the Dirección General de Investigación Cientí ca y Técnica (DGICYT) (project PB980421), Iberdrola, and CAI-CONAI for partial support of this research. [Received August 1999. Revised April 2000.]
REFERENCES Aczél, J. (1966), Lectures on Functional Equations and Their Applications (Vol. 19, Mathematics in Science and Engineering), New York: Academic Press. Akaike, H. (1973), “Information Theory and an Extension of the Maximum Likelihood Principle,” in Proceedings of the Second International Symposium on Information Theory, eds. B. N. Petrov and F. Czaki, Budapest: Academia Kiado, pp. 267–281. Allen, J. (1995), Natural Language Understanding (2nd ed.), Reading, MA: Addison-Wesley. Anderson, J. A., and Rosenberg, E. (eds.) (1988), Neurocomputing : Foundations of Research, Cambridge, MA: The MIT Press. Atkinson, A. C. (1978), “Posterior Probabilities for Choosing a Regression Model,” Biometrika, 65, 39–48. Attali, J. A., and Pages, G. (1997), “Approximation of Functions by a Multilayer Perceptron: A New Approach,” Neural Networks, 10, 1069–1081. Azoff, E. M. (1994), Neural Network Time Series, Forecasting of Financial Markets, New York: Wiley. Bishop, C. M. (1997), Neural Networks for Pattern Recognition, New York: Oxford University Press. Breiman, L., and Friedman, J. H. (1985), “Estimating Optimal Transformations for Multiple Regression and Correlation” (with comments), Journal of the American Statistical Association, 80, 580–619. Castillo, E., Cobo, A., Gómez-Nesterkín, R., and Hadi, A. S. (1999), “A General Framework for Functional Networks,” Networks, 35, 70–82.
TECHNOMETRICS, FEBRUARY 2001, VOL. 43, NO. 1
Castillo, E., Cobo, A., Gutiérrez, J. M., and Pruneda, E. (1998), An Introduction to Functional Networks With Applications, New York: Kluwer. Castillo, E., Gutiérrez, J. M., Cobo, A., and Castillo, C. (2000), “A Minimax Method for Learning Functional Networks,” Neural Processing Letters, 11, 39–49. Castillo, E., and Gutiérrez, J. M. (1998), “Nonlinear Time Series Modeling and Prediction Using Functional Networks: Extracting Information Masked by Chaos,” Physics Letters A, 244, 71–84. Castillo, E., Gutiérrez, J. M., and Hadi, A. S. (1997), Expert Systems and Probabilistic Network Models, New York: Springer-Verlag. Castillo, E., and Ruiz-Cobo, R. (1992), Functional Equations in Science and Engineering, New York: Marcel Dekker. Cichocki, A., Unbehauen, R., and Cochocki, A. (1993), Neural Networks for Optimization and Signal Processing, New York: Wiley. Craven, P., and Wahba, G. (1979), “Smoothing Noisy Data With Spline Functions: Estimating the Correct Degree of Smoothing by the Method of Generalized Cross-validation,” Numerische Mathematik, 31, 317–403. Elias, P. (1975), “Universal Codeword Sets and Representations of the Integers,” IEEE Transactions on Information Theory, 21, 194–203. Friedman, J. H. (1991), “Multivariate Adaptive Regression Splines” (with discussion), The Annals of Statistics, 19, 1–141. Freeman, J. A., and Skapura, D. M. (1991), Neural Networks: Algorithms, Applications, and Programming Techniques, Reading, MA: Addison-Wesley. Gómez-Nesterkín, R. (1996), “Modelación y Predicción Mediante Redes Funcionales,” in Red Mat Electronic Journal, Facultad de Ciencias, Mexico: UNAM. Hammel, S., Jones, C. K. R. T., and Maloney, J. (1985), “Global Dynamical Behavior of the Optical Field in a Ring Cavity,” Journal of the American Optical Society, B:2, 552. Hastie, T. J., and Tibshirani, R. J. (1990), Generalized Additive Models, New York: Chapman and Hall. Hertz, J., Krogh, A., and Palmer, R. G. (1991), Introduction to the Theory of Neural Computation, Redwood City, CA: Addison-Wesley. Jackson, E. A. (1991), Perspectives of Nonlinear Dynamics (2 vols.), Cambridge, U.K.: Cambridge University Press, Lindley, D. V. (1968), “The Choice of Variables in Multiple Regression,” Journal of the Royal Statistical Society, Ser. A, 30, 31–66. Lisboa, P. G. L. (ed.) (1992), Neural Networks: Current Applications, New York: Chapman and Hall. Miller, W. T., Sutton, R. S., and Werbos, P. J. (eds.) (1995), Neural Networks for Control, Cambridge, MA: MIT Press. Myers, C. E. (1992), Delay Learning in Arti cial Neural Networks, New York: Chapman and Hall. Ripley, B. D. (1996), Pattern Recognition and Neural Networks, Cambridge, U.K.: Cambridge University Press. Rissanen, J. (1983), “A Universal Prior for Integers and Estimation by Minimum Description Length,” The Annals of Statistics, 11, 416–431. ——— (1989), Stochastic Complexity in Statistical Inquiry, Singapore: World Scienti c. Rumelhart, D. E., and McClelland, J. L. (1986), Parallel Distributed Processing: Explorations in the Microstructure of Cognition (Vols. I and II), Cambridge, MA: MIT Press. Skrzypek, J., and Karplus, W. (eds.) (1996), Neural Networks in Vision and Pattern Recognition (World Scienti c Series in Machine Perception and Arti cial Intelligence), River Edge, NJ: World Scienti c. Stern, H. S. (1996), “Neural Networks in Applied Statistics,” Technometrics, 38, 205–214. Stone, M. (1974), “Cross-Validatory Choice and Assessment of Statistical Predictions,” Journal of the Royal Statistical Society, Ser. B, 36, 111–147. Swingler, K. (1996), Applying Neural Networks: A Practical Guide, New York: Academic Press.