Understanding Neural Networks as Statistical Tools 1 Brad Warner
Department of Preventive Medicine and Biometrics University of Colorado Health Sciences Center Denver, CO 80220 Ph. (303)-399-8020 ext. 2521
[email protected]
Manavendra Misra
Department of Mathematical and Computer Sciences Colorado School of Mines Golden, CO 80401 Ph. (303)-273-3873 Fax (303)-273-3875
[email protected]
Starting June 30, 1996, Brad Warner will be an Assistant Professor at the United States Air Force Academy. Manavendra Misra is an Assistant Professor, Colorado School of Mines, Golden, CO 80401. The authors thank Guillermo Marshall, the referees, and editors for the helpful comments that have improved this paper. The authors also thank Dr. Karl Hammermeister and the Department of Veterans Aairs for the use of their data. Please direct all correspondence to the second author at the Colorado School of Mines address. 1
Abstract Neural networks have received a great deal of attention over the last few years. They are being used in the areas of prediction and classi cation; areas where regression models and other related statistical techniques have traditionally been used. In this paper, we discuss neural networks and compare them to regression models. We start by exploring the history of neural networks. This includes a review of relevant literature on the topic of neural networks. Neural network nomenclature is then introduced and the backpropagation algorithm, the most widely used learning algorithm, is derived and explained in detail. A comparison between regression analysis and neural networks in terms of notation and implementation is conducted to aid the reader in understanding neural networks. We compare the performance of regression analysis with that of neural networks on two simulated examples and one example on a large data set. We show that neural networks act as a type of nonparametric regression model, enabling us to model complex functional forms. We discuss when it is advantageous to use this type of model in place of a parametric regression model and also some of the diculties in implementation.
Key-words: Nonparametric Regression, Arti cial Intelligence, Backpropagation, Generalized Linear Model.
1
Introduction
Neural networks have recently received a great deal of attention in many elds of study. The excitement stems from the fact that these networks are attempts to model the capabilities of the human brain. People are naturally attracted by attempts to create human-like machines; a Frankenstein obsession if you will. On a practical level, the human brain has many features that are desirable in an electronic computer. The human brain has the ability to generalize from abstract ideas, recognize patterns in the presence of noise, quickly recall memories, and withstand localized damage. From a statistical perspective, neural networks are interesting because of their potential use in prediction and classi cation problems. Neural networks have been used for a wide variety of applications where statistical methods are traditionally employed. They have been used in classi cation problems such as identifying underwater sonar contacts (Gorman & Sejnowski, 1988), and predicting heart problems in patients (Baxt, 1990, 1991; Fujita, Katafuchi, Uehara & Nishimura, 1992). They have also been used in such diverse areas as diagnosing hypertension (Poli, Cagnoni, Livi, Coppini & Valli, 1991), playing backgammon (Tesauro, 1990), and recognizing speech (Lippmann, 1989). In time series applications, they have been used in predicting stock market performance (Hutchinson, 1994). Neural networks are currently the preferred tool in predicting protein secondary structures (Qian & Sejnowski, 1988). As statisticians or users of statistics, we would normally solve these problems through classical statistical models such as discriminant analysis (Flury & Riedwyl, 1990), logistic regression (Studenmund, 1992), Bayes and other types of classi ers (Duda & Hart, 1973), multiple regression (Neter, Wasserman & Kutner, 1990), and time series models such as ARIMA and other forecasting methods (Studenmund, 1992). It is therefore time to recognize neural networks as a potential tool for data analysis. Several authors have done comparison studies between statistical methods and neural networks (Hruschka, 1993; Wu & Yen, 1992). These works tend to focus on performance comparisons and use speci c problems as examples. There are a number of good introductory articles on neural networks usually located in various trade journals. For instance, Lippmann (1987) provides an excellent overview of neural networks for the signal processing community. There are also a number of good introductory books on neural networks 1
with Hertz, Krogh, and Palmer (1991) providing a good mathematical description, Smith (1993) explaining backpropagation in an applied setting, and Freeman (1994) using examples and code to explain neural networks. There have also been papers relating neural networks and statistical methods (Buntine & Weigend, 1991; Ripley, 1992; Sarle, 1994; Werbos, 1991). One of the best for a general overview is Ripley (1993). This paper intends to provide a short, basic introduction of neural networks to scientists, statisticians, engineers, and professionals with a mathematical and statistical background. We achieve this by contrasting regression models with the most popular neural network tool, a feedforward multilayered network trained using backpropagation. This paper provides an easy to understand introduction to neural networks; avoiding the overwhelming complexities of many other papers comparing these techniques. Section 2 discusses the history of neural networks. Section 3 explains the nomenclature unique to the neural network community and provides a detailed derivation of the backpropagation learning algorithm. Section 4 shows an equivalence between regression and neural networks. It demonstrates the methods on three examples. Two examples are simulated data where the underlying functions are known and the third is on data from the Department of Veterans Aairs Continuous Improvement in Cardiac Surgery Program (Hammermeister, Johnson, Marshall & Grover, 1994). These examples demonstrate the ideas in the paper and clarify when one method would be preferred over the other.
2
History
Computers are extremely fast at numerical computations, far exceeding human capabilities. However, the human brain has many abilities that would be desirable in a computer. These include: the ability to quickly identify features, even in the presence of noise; to understand, interpret, and act on probabilistic or fuzzy notions (such as `Maybe it will rain tomorrow'); to make inferences and judgments based on past experiences and relate them to situations that have never been encountered before; to suer localized damage without losing complete functionality (fault tolerance). So even though the computer is faster than the human brain in numeric computations, the brain far out performs the computer in other tasks. This is the underlying 2
Synaptic Endbulb Dendrites
Cell Nucleus
Cell Body
Axon Axon Hillock
Axonal Aborization
wi1 wi2 wi3 µi wiN
neuron i
dimensions, exists that can completely delineate the classes that the classi er attempts to identify. Problems that are linearly separable are only a special case of all possible classi cation problems.) A major blow to the early development of neural networks occurred when Minsky and Papert picked up on the linear separability limitation of the simple perceptron and published results demonstrating this limitation (Minsky & Papert, 1969). Although Rosenblatt knew of these limitations, he had not yet found a way to train other models to overcome this problem. As a result, interest and funding in neural networks waned. (It is interesting to note that while Rosenblatt, a psychologist, was interested in modeling the brain, Widrow, an engineer, was developing a similar model for signal processing applications called the Adaline (Widrow, 1962).) In the 1970s, there was still a limited amount of research activity in the area of neural networks. Modeling the memory was the common thread of most of this work. (Anderson (1970) and Willshaw, Buneman, and Longuet-Higgins (1969) discuss some of this work.) Grossberg (1976) and von der Malsburg (1973) were developing ideas on competitive learning while Kohonen (1982) was developing feature maps. Grossberg (1983) was also developing his Adaptive Resonance Theory. Obviously, there was great deal of work done during this period with many important papers and ideas that are not presented in this paper. (For a more detailed description of the history see Cowen and Sharp (1988).) Interest in neural networks renewed with the Hop eld model (Hop eld, 1982) of a content addressable memory. In contrast to the human brain, a computer stores data as a look-up table. Access to this memory is made using addresses. The human brain does not go through this look-up process, it \settles" to the closest match based on the information content presented to it. This is the idea of a content-addressable memory. The Hop eld model retrieves a stored pattern by `relaxing' to the closest match to an input pattern. Hop eld, however, did not use the network as a memory as it is prone to getting stuck in local minima as well as being limited in the number of stored patterns (the network could reliably store a total number of patterns equal to approximately one tenth the number of inputs). Instead, Hop eld used his model for solving optimization problems such as the traveling salesperson problem (Hop eld & Tank, 1985). One of the most important developments during this period was the development of a method to train multilayered networks. This new learning algorithm was called backpropagation (McClelland, Rumelhart 5
& the PDP Research Group, 1986). The idea was explored in earlier works (Werbos, 1974), but was not fully appreciated at the time. Backpropagation overcame the earlier problems of the simple perceptron and renewed interest in neural networks. A network trained using backpropagation can solve a problem that is not linearly separable. Many of the current uses of neural networks in applied settings involve a multilayered feedforward network trained using backpropagation or a modi cation of the algorithm. Details of the backpropagation algorithm will be presented in Section 3. Neural network research incorporates many other architectures besides the multilayered feedforward network. Boltzmann machines have been developed based on stochastic units and have been used for tasks such as pattern completion (Hinton & Sejnowski, 1986). Time series problems have been attacked with recurrent networks such as the Elman network (Elman, 1990), the Jordan network (Jordan, 1989), and real-time recurrent learning (Williams & Zipser, 1989), to mention a few. There are neural networks that perform principal component analysis, prototyping, encoding and clustering. Examples of neural network implementations include linear vector quantization (Kohonen, 1989), adaptive resonance theory (Moore, 1988), feature mapping (Willshaw & von der Malsburg, 1976), and counterpropagation networks (HechtNielsen, 1987). There are of course many more equally important contributions which have been omitted here in the interest of time and space.
3
Neural Network Theory
The nomenclature used in the neural network literature is dierent from that used in statistical literature. This section introduces the nomenclature and explains in detail the backpropagation algorithm (the algorithm used for estimation of model coecients).
3.1 Nomenclature Although the original motivation for the development of neural networks was to model the human brain, most neural networks as they are currently being used bear little resemblance to a biological brain. (It must 6
be pointed out that there is research in the areas of accurately modeling biological neurons and the processes of the brain, but these areas will not be discussed further in this paper because this paper is concerned with the use of neural networks in prediction and function estimation.) A neural network is a set of simple computational units that are highly interconnected. The units are also called nodes and loosely represent the biological neuron. The networks discussed in this paper resemble the network in Figure 3. The neurons are represented by circles in Figure 3. The connections between units are uni-directional and are represented as arrows in Figure 3. These connections model the synaptic connections in the brain. Each connection has a weight called the synaptic weight, denoted as w , associated with it. The synaptic weight, w , is ij
ij
interpreted as the strength of the connection from the j unit to the i unit. th
th
The input into a node is a weighted sum of the outputs from nodes connected to it. Thus the net input into a node is: netinput =
X
w output + ij
i
j
(3)
i
j
where w are the weights connecting neuron j to neuron i, output is the output from unit j , and is a ij
j
i
threshold for neuron i. The threshold term is the baseline input to a node in the absence of any other inputs. (The term threshold comes from the activation function used in the McCulloch-Pitts neuron, see Equation 2, where the threshold term set the level that the other weighted inputs had to exceed for the neuron to re.) If a weight w is negative, it is termed inhibitory since it decreases the net input. If the weight is positive, ij
the contribution is excitatory since it increases the net input. Each unit takes its net input and applies an activation function to it. For example, the output of the
P
j unit, also called the activation value of the unit, is: g( w x ), where g() is the activation function th
ji
i
and x is the output of the i unit connected to unit j . A number of nonlinear functions have been used i
th
by researchers as activation functions, the two common choices for activation functions are the threshold function in Equation 2 (mentioned in Section 2) and sigmoid functions such as: g(netinput) =
1 1 + e?
7
netinput
(4)
Outputs y2
y1
Hidden Units
h1
x1
h2
h3
x3
x2 Inputs
Weighted Links
x4
in neural networks, similar to coecients in regression models, are adjusted to solve the problem presented to the network. Learning or training is the term used to describe the process of nding the values of these weights. The two types of learning associated with neural networks are supervised and unsupervised learning. Supervised learning, also called learning with a teacher, occurs when there is a known target value associated with each input in the training set. The output of the network is compared with the target value and this dierence is used to train the network (alter the weights). There are many dierent algorithms for training neural networks using supervised learning, backpropagation is one of the more common ones and will be explored in detail in Section 3. A biological example of supervised learning is when you teach a child the alphabet. You show him or her a letter and based on his or her response you provide feedback to the child. This process is repeated for each letter until the child knows the alphabet. Unsupervised learning is needed when the training data lacks target output values corresponding to input patterns. The network must learn to group or cluster the input patterns based on some common features, similar to factor analysis (Harman, 1976) and principal components (Morrison, 1976). This type of training is also called learning without a teacher because there is no source of feedback in the training process. A biological example would be when a child touches a hot heating coil. He or she soon learns, without any external teaching, not to touch it. In fact, the child may associate a bright red glow with hot and learn to avoid touching objects with this feature. The networks discussed in this paper are constructed with layers of units and thus are termed multilayered networks. A layer of units in a multilayer network is composed of units that perform similar tasks. A feedforward network is one where units in one layer are connected only to units in the next layer and not to units in a preceding layer or units in the same layer. Figure 3 shows a multilayered feedforward network. Networks where the units are connected to other units in the same layer, to units in the preceding layer, or even to themselves are termed recurrent networks. Feedforward networks can be viewed as a special case of recurrent networks. The rst layer of a multilayer network consists of the input units, denoted by x . These units are i
known as independent variables in statistical literature. The last layer contains the output units, denoted 9
by y . In statistical nomenclature these units are known as the dependent or response variables. (Note k
that Figure 3 has more than one output unit. This con guration is common in neural network classi cation applications where there are more than two classes. The outputs represent membership in one of the k classes. The multiple outputs could represent a multivariate response function, but this is not common in practice.) All other units in the model are called hidden units, h , and constitute the hidden layers. The j
feedforward network can have any number of hidden layers with a variable number of hidden units per layer. When counting layers, it is common practice not to count the input layer because it does not perform any computation, but simply passes data onto the next layer. So a network with an input layer, one hidden layer, and an output layer is termed a two layer network.
3.2 Backpropagation Derivation The backpropagation algorithm is a method to nd weights for a multilayered feedforward network. The development of the backpropagation algorithm is primarily responsible for the recent resurgence of interest in neural networks. One of the reasons for this is that it has been shown that a two-layer feedforward neural network with a sucient number of hidden units can approximate any continuous function to any degree of accuracy (Cybenko, 1989). This makes multilayered feedforward neural networks a powerful modeling tool. As mentioned, Figure 3 shows a schematic of a feedforward, two-layered, neural network. Given a set of input patterns (observations) with associated known outputs (responses), the objective is to train the network, using supervised learning, to estimate the functional relationship between the inputs and outputs. The network can then be used to model or predict a response corresponding to a new input pattern. This is similar to the regression problem where we have a set of independent variables (inputs) and dependent variables (output), and we want to nd the relationship between the two. To accomplish the learning, some form of an objective function or performance metric is required. The goal is to use the objective function to optimize the weights. The most common performance metric used in neural networks (although not the only one, see (Solla, Levin & Fleisher, 1988)) is the sum of squared errors
10
de ned as: E=
1 X X(y ? y^ )2 2 =1 =1 O
n
pk
p
(5)
pk
k
where the subscript p refers to the patterns (observations) with a total of n patterns, the subscript k to the output unit with a total of O output units, y is the observed response, and y^ is the model (predicted) response. This is a sum of the squared dierence between the predicted response and the observed response averaged over all output and observations (patterns). In the simple case of predicting a single outcome, k = 1 and Equation 5 reduces to
1 X(y ? y^ )2 E= 2 =1 n
p
p
p
the usual function to minimize in least-squares regression. To understand backpropagation learning, we will start by examining how information is rst passed forward through the network. The process starts with the input values being presented to the input layer. The input units perform no operation on this information, but simply pass it onto the hidden units. Recalling the simple computational structure of a unit expressed in Equation 3, the input into the j hidden unit is: th
h =
X N
pj
=1
(6)
w x ji
pi
i
Here, N is the total number of input nodes, w is the weight from input unit i to hidden unit j , and x is ji
pi
value of the i input for pattern p. The j hidden unit applies an activation function to its net input and th
th
outputs: v = g(h ) = pj
pj
1
1 + e?
hpj
(7)
(Assuming g() is the sigmoid function de ned in Equation 4.) Similarly, output unit k receives a net input of: f =
X M
pk
j
=1
(8)
W v kj
pj
where M is the number of hidden units, and W represents the weight from hidden unit j to output k. The kj
unit then outputs the quantity: y^ = g(f ) = pk
pk
11
1 1 + e?
fpk
(9)
(Notice that the threshold value has been excluded from the equations. This is because the threshold can be accounted for by adding an extra unit to the layer and xing its value at 1. This is similar to adding a column of ones to the design matrix in regression problems to account for the intercept.) Recall that the goal is to nd the set of weights w , the weights connecting the input units to the hidden ji
units, and W , the weights connecting the hidden units to the output units, that minimize our objective kj
function, the sum of squared errors in Equation 5. Equations 6 - 9 demonstrate that the objective function, Equation 5, is a function of the unknown weights w and W . Therefore, the partial derivative of the ji
kj
objective function with respect to a weight represents the rate of change of the objective function with respect to that weight (it is the slope of the objective function). Moving the weights in a direction down this slope will result in a decrease in the objective function. This intuitively suggests a method to iteratively nd values for the weights. We evaluate the partial derivative of the objective function with respect to the weights and then move the weights in a direction down the slope, continuing until the error function no longer decreases. Mathematically this is represented as: @E W = ? @W
(10)
kj
kj
(The term is known as the learning rate and simply scales the step size. The common practice in neural networks is to have the user enter a xed value for the learning rate at the beginning of the problem.) We will rst derive an expression for calculating the adjustment for the weights connecting the hidden units to the outputs, W . Substituting Equations 6 through 9 into Equation 5 yields: kj
1 X X(y ? g(X W g(X w x )))2 2 =1 =1 =1 =1 and then expanding Equation 10 using the chain rule, we get: n
E=
O
p
N
M
ji
kj
pk
@ y^ @f @E = ? @@E ? @W y^ @f @W pk
kj
but
pi
i
j
k
pk
pk
pk kj
@E = ?(y ? y^ ) @ y^ @ y^ = g0 (f ) = y^ (1 ? y^ ) @f pk
pk
pk
pk
pk
pk
12
pk
pk
(11)
(for the sigmoid in Equation 4) and
@f =v @W pk
pj
kj
Substituting these results back into Equation 10, the change in weights from the hidden units to the output units, W , are given by: kj
W = ?[(?1)(y ? y^ )]^y (1 ? y^ )v kj
pk
pk
pk
pk
pj
(12)
This gives us a formula to update the weights from the hidden units to the output units. The weights are updated as: W +1 = W + W t kj
t kj
kj
This equation implies that we take the weight adjustment in Equation 12 and add it to our current estimate of the weight, W , to obtain an updated weight estimate, W +1. t kj
t kj
Before moving on to the calculations for the weights from the inputs to the hidden units, there are several interesting points to be made about Equation 12. 1) Given that the range of the sigmoid function is 0 g() 1, from Equation 11 we can see that the maximum change of the weight will occur when the output y^ is 0.5. In classi cation problems the objective is to have the output be either 1 or 0, so an output pk
of 0.5 represents an undecided case. If the output is at saturation (0 or 1) then the weight will not change. Conceptually, this means that units which are undecided will receive the greatest change to their weights. 2) As in other function approximation problems, the (y ? y^ ) term in uences the weight change. If the pk
pk
predicted response matches the desired response then no weight changes occur. 3) If k=1, one outcome, then Equation 12 simpli es to W = ?[(?1)(y ? y^ )]^y (1 ? y^ )v j
p
p
p
p
pj
To update the weights, w , connecting the inputs to the hidden units, we will follow similar logic as ji
Equation 12. Thus
@E w = ? @w ji
ji
13
Expanding using the chain rule
X @y^ @f @v @h @E = ? @@E ? @w =1 y^ @f @v @h @w O
ji
pk
k
pk
pk
pj
pj
pk
pj
pj
ji
(13)
where ^ and ^ are given in Equation 11. Also @E
@ ypk
@ ypk
@fpk
@f =W @v pk
kj
pj
@v = g0 (h ) = v (1 ? v ) @h pj
pj
pj
pj
pj
and
@h =x @w pj
pi
ji
Substituting back into Equation 13 reduces to: w =
X O
ji
k
=1
(y ? y^ )^y (1 ? y^ )W v (1 ? v )x pk
pk
pk
pk
kj
pj
pj
(14)
pi
Note that there is a summation over the number of output units. This is because each hidden unit is connected to all the output units. So if the weight connecting an input unit to a hidden unit changes, it will aect all the outputs. Again, notice that if the number of output units equals one, then w = (y ? y^ )^y (1 ? y^ )W v (1 ? v )x ji
p
p
p
p
j
pj
pj
pi
3.3 Backpropagation Algorithm Given the above equations, we proceed to put down the processing steps needed to compute the change in network weights using backpropagation learning: Note: this algorithm is adapted from (Hertz et al., 1991). 1. Initialize the weights to small random values. This puts the output of each unit around 0.5. 2. Choose a pattern p and propagate it forward. This yields values for v and y^ , the outputs from the pj
hidden layer and output layer. 3. Compute the output errors: = (y ? y^ )g0 (f ). pk
pk
pk
pk
14
pk
4. Compute the hidden layer errors:
pj
=
P
=1 W v (1 ? v ).
O k
pk
kj
pj
pj
5. Compute W = v kj
pk
pj
and w = ji
i
pj pi
to update the weights. 6. Repeat the steps for each pattern. It is easy to see how this could be implemented in a computer program. (Note that there are many commercial and shareware products that implement the multilayered feedforward neural network. Check the frequently asked questions, FAQ, posted monthly to the
comp.ai.neural-nets
newsgroup for a listing
of these products. The web site http://wwwipd.ira.uka.de/~prechelt/FAQ/neural-net-faq.html also maintains this list.)
4
Regression and Neural Networks
In this section we compare and contrast neural network models with regression models. Most scientists have some experience using regression models and by explaining neural networks in relation to regression analysis some deeper understanding can be achieved.
4.1 Discussion Regression is used to model a relationship between variables. The covariates (independent variables, stimuli) are denoted as x . These variables are either under the experimenter's control or are observed by the i
experimenter. The response (outcome, dependent) variable is denoted as y. The objective of regression is to predict or classify the response, y, from the covariates, x . Sometimes the investigator also uses regression i
to test hypotheses about the functional relationship between the response and the stimulus. 15
The general form of the regression model is (this is adopted from McCullagh and Nelder (1989)): =
X N
x i
=0
i
i
with = h() E (y) =
Here h() is the link function, are the coecients, N is the number of covariate variables, and 0 is the i
intercept. This model has three components: 1. A random component of the response variable y, with mean and variance 2 . 2. A systematic component that relates the stimuli x to a linear predictor = i
P
=0 x .
N
i
i
i
3. A link function that relates the mean to the linear predictor = h(). The generalized linear model reduces to the familiar multiple linear regression if we believe that the random component has a normal distribution with mean zero and variance 2 and we specify the link function h() as the identity function. The model is then: y = 0 +
X
p
N
=1
x + i
pi
p
i
where N (0; 2). The objective of this regression problem is to nd the coecients that minimize the p
i
sum of squared errors, E=
X n
=1
(y ? p
p
X N
=1
x )2: i
pi
i
To nd the coecients, we must have a data set that includes the independent variable and associated known values of the dependent variable (akin to a training set in supervised learning in neural networks). This problem is equivalent to a single layer feedforward neural network (Figure 4). The independent variables correspond to the inputs of the neural network and the response variable y to the output. The coecients, ~ , correspond to the weights in the neural network. The activation function is the identity 16
...
β1 β2 β3
βN
1 β0
not assume any functional relationship and let the data de ne the functional form, in a sense we let the data speak for itself. This is the basis of the power of the neural networks. As mentioned in Section 3, a two layer, feedforward network with sigmoid activation functions is a universal approximater because it can approximate any continuous function to any degree of accuracy. Thus, a neural network is extremely useful when you do not have any idea of the functional relationship between the dependent and independent variables. If you had an idea of the functional relationship, you are better o using a regression model. An advantage of assuming a functional form is that it allows dierent hypothesis tests. Regression, for example, allows you to test the functional relationships by testing the individual coecients for statistical signi cance. Also, because regression models tend to be nested, two dierent models can be tested to determine which one models the data better. A neural network never reveals the functional relations; it is buried in the summing of the sigmoidal functions. Other diculties with neural networks involve choosing the parameters, such as the number of hidden units, the learning parameter , the initial starting weights, the cost function, and deciding when to stop training. The process of determining appropriate values for these variables is often an experimental process where dierent values are used and evaluated. The problem with this is that it can be very time consuming, especially when you consider the fact that neural networks typically have slow convergence rates.
4.2 Examples To demonstrate the use of neural networks, two simulated examples and one real example are presented. Both of the simulated examples involve one independent variable and a continuous valued output variable. The rst example is a simple linear problem. The true relationship is: y = 30 + 10x:
Fifty samples were obtained by generating 50 random error terms from a normal distribution with mean zero and standard deviation of 50 and adding these to the y values of the data set. The values of x were
18
1000
• • • • •• •
•
•
• •
800
• •
•
• •
•
600
Y
••
•
•
• •
•
•
•
• •• • •
••
• •
• 400
• • • •
•
•
•
•
• •
200
•
• 20
Neural Network (4 Hidden Units) True Curve Regression Curve Data Points
•
•
40
60
80
100
X
Figure 5: This gure shows a comparison between a linear regression model and a neural network model where the underlying functional relationship was linear. The lled dots represent the actual data, the solid line is the predicted function from the neural network, the dotted line is the result from the regression model, and the dashed line is the true function. This example demonstrates that the neural network approximated the linear function without any assumption about the underlying functional form. randomly selected from the range of 20 to 100. Linear regression was applied to the problem of the form E (y j x) = + x
and yielded coecients of ^ = ?10:69 and ^ = 10:52. (Note the error in the intercept term, remember that we were interested in modeling the data between x values of 20 and 100, so an accurate intercept was not important.) The dotted line in Figure 5 shows the regression line. The problem was also solved with a two layer, feedforward neural network with four hidden units and sigmoid activation functions. The network was trained using backpropagation. This network con guration is too general for this problem; however, we wanted to setup the problem with the notion that the underlying functional relationship is unknown. This gives us an idea how the network will perform when we think the 19
problem is complex when in fact it is simple. The solid line in Figure 5 shows the results from the neural network, the dashed line is the true functional relationship. Both the regression curve and the neural network curve are close to the true curve. A separate validation set of 100 values was generated and applied to the regression model and the neural network model. The sum of squared errors on this validation set was .290 and .303 for the regression model and neural network model respectively. So the predictive performance of both models was essentially equal. The linear regression model was much faster and easier to develop and is easy to interpret. The neural network did not assume any functional form for the relationship between dependent and independent variable and was still able to derive an accurate curve. As a second, more dicult example, consider the function: y = 20 exp?8 5 [ln(0:9x + 0:2) + 1:5] : x
Fifty random x values between 0 and 1 were used to generate data. A random noise component consisting of normally distributed error terms with mean 0 and standard deviation of .05 were added to each y value. An identical neural network to the one used on the previous example, with the exception that this network had eight hidden units, was trained on this data. The results are plotted in Figure 6. The true curve is dotted and the neural network estimate is solid. This problem would be dicult to model using regression techniques and would require some estimates of variable transformations or the use of a smoothing method (Hastie & Tibshirani, 1990). Most of these transformations assume a power or logarithmic transformation while a combination of both would have been more appropriate in this case. The third example is from the Department of Veterans Aairs Continuous Improvement in Cardiac Surgery Study (Hammermeister et al., 1994). The outcome variable is a binary variable indicating operative death status 30 days after a coronary artery bypass grafting surgery. If a patient is still alive 30 days after surgery, he or she is coded as a 0, otherwise as a 1. The objective is to obtain predictions of a patients' probability of death given their individual risk factors. The twelve independent variables in the study are patient risk factors and include variables such as age, priority of surgery, and history of prior heart surgery. Both neural network and logistic regression models were built on 21,435 observations ( 32 learning sample) and 20
• • •• 2
•
• • Neural Network (8 Hidden Units) True Curve Data Points
• •
• • •
•••
1
• • •
Y
•
•• • • •
•• • • • •
0
•• • • •
• • • • ••
• ••
•
••
-1
• • 0.0
0.2
0.4
0.6
0.8
1.0
X
Figure 6: Example of Modeling a Nonlinear Function. This gure demonstrates the ability of a neural network to model a complex functional relationship. The lled dots represent the data, the dashed curve is the true function, and the solid line is the predicted function from the neural network. validated on the remaining 10,657 observations ( 31 testing sample). The neural network was a feedforward network with one hidden layer comprised of four hidden units. It was trained using the backpropagation algorithm. The discrimination and calibration of these two models were compared on the validation set. The c-index was used to measure discrimination (how well the predicted binary outcomes can be separated (Hanley & McNeil, 1982)). No statistically signi cant dierence was found between the two c-indices (the neural network c-index = 0.7168 and the logistic regression c-index = 0.7162 with p < 0:05). The Hosmer-Lemeshow test (Hosmer Jr. & Lemeshow, 1989) was applied to the validation data to test calibration of the models (calibration measures how close the predicted values are to the observed values). The p-value for the logistic regression model was 0.34 indicating a good t to the data, while the p-value for the neural network was 21
0.08 indicating a lack of t. In summary, the logistic regression model had comparable predictive power and better calibration in comparison to the neural network. The reason this occurs is that the majority of the independent variables are binary. This means that their contribution to the model must be on a linear scale. Trying to model them in a nonlinear manner will not contribute to the predictive performance of the model. In addition, a simple check of all two variable interactions revealed nothing of signi cance. Thus with no interactions or non-linearities, the linear additive structure of logistic regression is appropriate for this data. As indicated in Section 1, the literature is full of examples illustrating the improved performance of neural networks over traditional techniques. But as the last example illustrates, this is not always true and the practitioner must be aware of the appropriate model for his or her problem. Neural networks can be valuable when we do not know the functional relationship between independent and dependent variables. They use the data to determine the functional relationship between the dependent and independent variables. Since they are data dependent, their performance improves with sample size. Regression performs better when theory or experience indicates an underlying relationship. Regression may also be a better alternative for extremely small sample sizes.
5
Conclusions
Neural networks originally developed out of an interest in modeling the human brain. They have, however, found applications in many dierent elds of study. This paper has focused on the use of neural networks for prediction and classi cation problems. We speci cally restricted our discussion to multilayered feedforward neural networks. Parallels with statistical terminology used in regression models were developed to aid the reader in understanding neural networks. Neural network notation is dierent from that of statistical regression analysis but most of the underlying ideas are the same. For example, instead of coecients, the neural network community uses the term weights and instead of observations they use patterns. 22
Backpropagation is an algorithm that can be used to determine the weights of a network designed to solve a given problem. It is an iterative procedure that uses a gradient descent method. The cost function used is normally a squared error criterion, but functions based on maximum likelihood are also used. This paper gives a detailed derivation of the backpropagation algorithm based on existing bodies of work and gives an outline of how to implement it on a computer. Two simple synthetic problems were presented to demonstrate the advantages and disadvantages of multilayered feedforward neural networks. These networks do not impose a functional relationship between the independent and dependent variables. Instead, the functional relationship is determined by the data in the process of nding values for the weights. The advantage of this process is that the network is able to approximate any continuous function and we do not have to guess the functional form. The disadvantage is that it is dicult to interpret the network. In linear regression models, we can interpret the coecients in relation to the problem. Another disadvantage of the neural network is that convergence to a solution can be slow and depends on the network's initial conditions. A third example on real data revealed that traditional statistical tools still have a role in analysis and that the use of any tool must be thought about carefully. Neural networks can be viewed as a nonparametric regression method. A large number of claims have been made about the modeling capabilities of neural networks, some exaggerated and some justi ed. As statisticians, it is important to understand the capabilities and potential of neural networks. This article is intended to build a bridge of understanding for the practitioner and interested reader.
23
References Abu-Mostafa, Yasser S. (1986) Neural networks for computing pp.1{6 of Proceedings of the the american institute of physics meeting . Anderson, J.A. (1970) Two models for memory organization Mathematical biosciences , 8, 137{160. Baxt, William G. (1990) Use of an arti cial neural network for data analysis in clinical decision-making: The diagnosis of acute coronary occlusion Neural computation , 2, 480{489. Baxt, William G. (1991) Use of an arti cial neural network for the diagnosis of myocardial infraction Annals of internal medicine , 115, 843{848. Buntine, W.L. & Weigend, A.S. (1991) Bayesian back-propagation Complex systems , 5, 603{643. Cowan, J.D. & Sharp, D.H. (1988) Neural nets Quaterly reviews of biophysics , 21, 365{427. Cybenko, G. (1989) Approximation by superpositions of a sigmoidal function Mathematics of control, signals, and systems , 2, 303{314. Duda, R.O. & Hart, P.E. (1973) Pattern classi cation and scene analysis New York, New York: Wiley. Elman, J.L. (1990) Finding structure in time Cognitive science , 14, 179{211. Flury, Bernhard & Riedwyl, Hans (1990) Multivariate statistics: A practical approach London: Chapman Hall. Freeman, James A. (1994) Simulating neural networks with mathematica Reading, Massachusetts: Addison Wesley Publishing Company. Fujita, H., Katafuchi, T., Uehara, T. & Nishimura, T. (1992) Application of arti cial neural network to computer-aided diagnosis of coronary artery disease in myocardial spect bull's-eye images The journal of nuclear medicine , 33(2), 272{276. Gorman, R.P. & Sejnowski, T.J. (1988) Analysis of hidden units in a layered network to classify sonar targets Neural networks , 1, 75{89. Grossberg, S. (1976) Adaptive pattern classi cation and universal recoding, i: Parallel development and coding of neural feature detectors Biological cybernetics , 23, 121{134. Grossberg, S. & G.A., Carpenter (1983) A massively parallel architecture for a self-organizing neural pattern recognition machine Computer vision, graphics, and image processing , 37, 54{115. Hammermeister, K.E., Johnson, R., Marshall, G. & Grover, F.L (1994) Continuous assessment and improvement in quality of care: A model from the department of veterans aairs cardiac surgery Annals of surgery , 219, 281{290. Hanley, J.A. & McNeil, B.J. (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve Radiology , 143, 29{36. Harman, H.H. (1976) Modern factor analysis . 3rd editionChicago: University of Chicago Press. Hastie, T. & Tibshirani, R. (1990) Generalized additive models London and New York: Chapman and Hall. Hecht-Nielsen, R. (1987) Counterpropagation networks Applied optics , 26, 4979{4984. Hertz, J., Krogh, A. & Palmer, R.G. (1991) Introduction to the theory of neural computation Santa Fe Institute Studies in the Sciences of Complexity, Vol. 1 Redwood City, California: Addison Wesley Publishing Company. 24
Hinton, G.E. & Sejnowski, T.J. (1986) Learning and relearning in boltzmann machines chap. 7 of Parallel distributed processing , Vol. 1. Hop eld, J.J. (1982) Neural networks and physical systems with emergent collective computational abilities Proceeding of the national academy of sciences, usa , 81, 2554{2558. Hop eld, J.J. & Tank, D.W. (1985) 'neural' computation of decisions in optimization problems Biological cybernetics , 52, 141{152. Hosmer Jr., D.W. & Lemeshow, S. (1989) Applied logistic regression New York: John Wiley & Sons, Inc. Hruschka, Harald (1993) Determining market response functions by neural network modeling: A comparison to econometric techniques European journal of operational research , 66, 27{35. Hutchinson, James M. (1994) A radial basis function approach to nancial time series analysis Ph.D. thesis, Massachusetts Institute of Technology. Jordan, M.I. (1989) Serial order: A parallel, distributed processing approach in: J. Elman & D. Rumelhart (Eds.) Advances in connectionist theory: Speech Hillsdale: Erlbaum. Kohonen, T. (1982) Self-organized formation of topologically correct feature maps Biological cybernetics , 43, 59{69. Kohonen, T. (1989) Self-organization and associative memory . 3 editionBerlin: Springer-Verlag. Lippmann, R.P. (1987) An introduction to computing with neural nets Ieee assp magazine , April, 4{22. Lippmann, R.P. (1989) Review of neural networks for speech recoginition Neural computation , 1, 1{38. McClelland, J.L., Rumelhart, D.E. & the PDP Research Group (1986) Parallel distributed processing: Explorations in the microstructure of cognition, volume 2: Psychological and biological models. Cambridge: MIT Press. McCullagh, P. & Nelder, J.A. (1989) Generalized linear models London: Chapman Hall. McCulloch, W.S. & Pitts, W. (1943) A logical calculus of ideas immanent in nervous activity Bulletin of mathematical biophysics , 5, 115{133. Minsky, M.L. & Papert, S.A. (1969) Perceptrons Cambridge: MIT Press. Moore, B. (1988) ART1 and pattern clustering in: D. Touretzky, G. Hinton, & T. Sejnowski (Eds.) Proceedings of the 1988 connectionist models summer school San Mateo: Morgan Kaufmann. Morrison, D.F. (1976) Multivariate statistical methods . 2nd editionNew York: McGraw-Hill. Neter, J., Wasserman, W. & Kutner, M.H. (1990) Applied linear statistical models Homewood, IL.: Richard D. Irwin, Inc. Poli, R., Cagnoni, S., Livi, R., Coppini, G. & Valli, G. (1991) A neural network expert system for diagnosing and treating hypertension Computer , March, 64{71. Qian, N. & Sejnowski, T.J. (1988) Predicting the secondary structure of globular proteins using neural network models Journal of molecular biology , 202, 865{884. Ripley, B.D. (1992) Neural networks and related methods for classi cation Submitted to the Royal Statistical Society Research Section. Ripley, B.D. (1993) Statistical aspects of neural networks pp.40{123 of O. Barndor-Nielsen, J. Jensen, & W. Kendall (Eds.) Networks and chaos-statistical and probabilistic aspects Chapman & Hall. 25
Rosenblatt, F. (1962) Principlies of neurodynamics Washington D.C.: Spartan. Sarle, W.S. (1994) Neural networks and statistical methods in: Proceedings of the 19th annual sas users group international conference . Smith, Murray (1993) Neural networks for statistical modeling New York, New York: Van Nostrand Reinhold. Solla, S.A, Levin, E. & Fleisher, M. (1988) Accelerated learning in layered neural networks Complex systems , 2, 625{639. Studenmund, A. H. (1992) Using econometrics: A practical guide New York, New York: HarperCollins Publishers. Tesauro, G. (1990) Neurogammon wins computer olympiad Neural computation , 1, 321{323. Thompson, Richard F. (1985) The brain: Aneuroscience primer New York, New York: W.H. Freeman & Company. von der Malsburg, C. (1973) Self-organizing of orientation sensitive cells in the striate cortex Kybernetik , 14, 85{100. Werbos, Paul J. (1991) Links between arti cial neural networks (ann) and statistical pattern recognition pp.11{31 of I. Sethi & A. Jain (Eds.) Arti cial neural networks and statistical pattern recognition: Old and new connections Elsevier Science Publishers. Werbos, P.J. (1974) Beyond regression: New tools for prediction and analysis in the behavioral sciences Ph.D. thesis, Harvard University. Widrow, B. (1962) Generalization and information storage in networks of adaline neurons pp.435{461 of M. Yovitz, G. Jacobi, & G. Goldstein (Eds.) Self-organizing systems Washington D.C.: Spartan. Williams, R.J. & Zipser, D. (1989) A learning algorithm for contimually running fully recurrent neural networks Neural computation , 1, 270{280. Willshaw, D.J. & von der Malsburg, C. (1976) How patterned neural connections can be set up by selforganizaiton Proceedings of the royal society of london b , 194, 431{445. Willshaw, D.J., Buneman, O.P. & Longuet-Higgins, H.C. (1969) Non-holographic associative memory Nature , 222, 960{962. Wu, Fred Y. & Yen, Kang K. (1992) Application of neural network in regression analysis in: Proceedings of the 14th annual conference on computers and industrial engineering .
26