('ol1lplllaliollal Slalislics (200 I) I ():·t I ·Hd
© Physica-Verlag 2001
Neural Network Model Selection Financial Time Series Prediction
for
Francesco Virili 1 and Bernd Freisleben2 1 Department of Information Systems, University ofSiegen, Holderlinstrasse 3, D-57068 Siegen, Germany e-mail
[email protected]
Department of Electrical Engineering & Computer Science, University of Siegen, Holderlinstrasse 3, D-57068 Siegen, Germany e-mail
[email protected] 2
Summary
.. (i
Can neural network model selection be guided by statistical procedures such as hypothesis tests, information criteria and cross-validation? Recently, Anders and Kom (1999) proposed five neural network model specification strategies based on different statistical procedures. In this paper, we use and adapt the Anders-Koru framework to find appropriate neural network models for financial time series prediction. The most important new issue in this context is the specification of I III. dynamic structure of the models, i.e. the selection of the lagged values of the input time series. A linear model is built with full dynamic structure, then its possihl« nonlinear extensions are tested using a statistical procedure inspired by thl' Anders-Kom approach. Promising results are obtained with an application 10 predict the monthly time series of mortgage loans purchased in The Netherlands. Keywords: neural networks, model selection, time series
452
Introduction Neural networks are a very appealing class of predictive models, given their ability to approximate any given function (Hornik et al. 1989). The cost of their flexibility is the presence of a considerable number of parameters to set, both in the choice of the architecture (topologies, activation functions, etc.) and in the specification of dimension and complexity (number of layers, hidden units, regularization etc.). The presence of many free parameters makes neural network models difficult to analyze and interpret. On the other hand, linear regression models are much simpler to build and understand, but they might be too poor to approximate complex data generating processes. Neural networks are natural extensions of linear regression models, which can be regarded as simple single layer feed forward neural networks with skip connections and no hidden units (see any good textbook on neural networks, like (Bishop 1995)). With linear regression, model selection is reduced to the correct estimation of the 'weights' (i.e. the parameters) of the regression line: given the usual assumptions on the sampling variability, well established statistical theory allows the modeler to test the significance of each parameter. Introducing a dynamic structure, such as in Box-Jenkins linear time series models (Box et al. 1994), well defmed statistics are still available for model selection, and neural networks can again be viewed as a nonlinear extension of the basic linear dynamic time series models: refer to (Granger and Terasvirta 1993), chapter 7 and (Franses 1998), chapter 8 for theory; see (Faraway and Chatfield 1998) for an application. What is troublesome with neural networks models is the violation of the usual assumptions about linearity and sampling distributions, which invalidates much of the effort to build a statistically meaningful battery of tests for model selection and assessment. Some steps were made along this direction, especially by (Refenes 1995) (part 1), (Refenes at al. 1996) and recently by Zapranis and Refenes (1999), who built on the first contributions by Sarle (1994, 1995) and Ripley (1993, 1995), among others. Nevertheless, as we can read in the conclusions of Zapranis' and Refenes work, "... while in this book we have defined and used a number of residual diagnostics, they need to be reported with their respective distributions." (Zapranis and Refenes 1999, page 159). Unfortunately such distributions are generally not available! An interesting approach to model selection of feed-forward neural networks, based on statistical procedures, was recently proposed by Anders and Kom (1999). It is based on two stages: the first for the selection of the optimal number of hidden units with a fully connected neural network; the second for the reduction of the number of input connections. One of the five strategies discussed by the authors is based on cross-validation at both stages. In this paper, we apply it to a forecasting model of the mortgage loans demand in The Netherlands, aimed to produce multi-step forecasts of the target time series, based on present lagged
453
values of five candidate regressors. Given the high number of the neural network potential inputs, a modification of the original procedure, based on a preselection in the linear models space, is suggested. The finally selected model is linear, concluding that, in spite of the quite good out of sample predictions obtained, there is still room for further research on appropriate selection strategies for nonlinear models. The paper is organized as follows. Section 2 describes the model selection strategies proposed by Anders and Kom. Our proposed model selection procedure is presented in Section 3. Section 4 discusses the application of our model selection approach to the time series of the monthly mortgage loans volume purchased in The Netherlands. Our results are presented in Section 5. Section 6 concludes the paper and outlines areas for further research.
Anders-Korn model selection strategies According to Anders and Kom (1999), feed-forward neural network model selection choices may be restricted to two basic aspects: 1) the most suitable number of hidden units; 2) the most suitable number of input connections. Their approach resembles very closely the "heuristic search over the space of perceptron architectures" (Moody 1994) originally proposed by Utans and Moody (1991). A correct choice of hidden units ensures that the eventually present nonlinearities would be captured by the model, without introducing irrelevant hidden units. On the other hand, eliminating unnecessary input connections is a way to control the model complexity and to exclude irrelevant inputs. Unfortunately, there is no way to build a test to check both aspects at the same time. The authors observe that some methods which are widely used to control insignificant connections, such as pruning, regularization and stopped training, present a degree of subjectivity and that pruning methods can be generalized by the computation of the corresponding Wald test statistic, as previously noticed, e.g., by Moody in (1994). The Wald statistic is one of the appropriate replacements for the F-test when the linearity or the error normality assumptions are violated. For a good and very readable introduction to Wald (W), Likelihood Ratio (LR) and Lagrange Multiplier (LM) statistics, refer to (Kennedy 1997), chapter 4.5. Another suggested way to check for irrelevant connections is the use of information criteria like AIC and NIC, see (Franses 1998), chapter 3.4, (Stone 1977), (Moody 1994). Both Wald tests and information criteria are not theoretically justified in presence of irrelevant hidden units, hence it is mandatory to perform hidden units testing before input connections testing. Hidden unit selection can be based on the nonlinearity test (TLG) proposed by Terasvirta, Lin and Granger (1993) or altematively on the test proposed by White (19R9). Both are Lagrange Multiplier (LM) type statistics (see above). It is even possible to lise
454 AIC or NIC for hidden units selection, but the nonlinear hidden unit transfer function has to be 'linearized' by help of a Taylor series expansion. Two strategies can be built by sequentially combining the first two methods of hidden units testing and the Wald statistic for input connections discussed above: 1) White-Wald; 2) TLG-Wald. Two more strategies may be based on the use of AIC and NIC (after linearization of the hidden unit transfer function at the first stage): 3) AIC-AIC; 4) NIC-NIC. The fifth strategy is based on cross-validation (Stone 1974). Using the authors' words, "Cross-validation is the most generally applicable strategy for model selection in neural networks since it does not rely on any probabilistic assumptions and is not affected by identification problems" (Anders and Korn 1999, page 316). We do not need to "linearize" the hidden unit transfer functions and we are not constrained to two stages of sequential tests by theoretical reasons. Moreover, cross-validation compares favorably with other so called penalty based methods like Vapnik's GRM and Rissanen's MDL, as shown in (Keams et al. 1997). On the other hand, cross-validation (CV) is computationally expensive if performed exhaustively for each possible combination of hidden units and input connections, especially for multiple time series models. For example, the number of networks to build for a ten-fold crossv~lidation with five input variables (monthly time series) and 12 possible lags is bigger than 1.1 * 10" 19, just to test for the inclusion of the first hidden unit.' This number grows exponentially when more units are added. For this reason, the authors propose again to perform a Utans-Moody-like two-stage test based on cross-validation, as for the other strategies above. They start with a 0 hidden unit fully connected network (linear model), including additional hidden units as long as the CV error is reduced. The second stage is top-down, with the sequential removal of each input connection, until a minimum for the CV error is reached. The maximum number of networks to build for the example above (one hidden unit) is reduced to 1830. This heuristic approach is simple and appealing, but it has serious drawbacks when applied to small sample multiple time series models: the number of potential inputs of the fully connected network at the first stage may be too high. The model of the example above with five input variables and 12 lags has 60 potential inputs. A fully connected network with only 2 hidden units would have already 125 weights. Obviously, such a model makes little sense with small samples: with 120 observations there would be an available weight for each data point, with a high probability of overfitting. Our approach is based on a first stage of variable preselection in the space of linear models, in order to reduce the complexity of the initial model and the number of potential inputs. In the second stage, a full search over the (now reduced) space of nonlinear models is performed, using crossvalidation.
455
Our model selection procedure The general equation of our model is: Y(t) = ft)", X, W, Z', ...) + e(t)
(1)
where Y(t), the current value of the Y target vector, depends on its own past (vector Y') and on the lagged values of other explanatory variables vectors (X', W, Z', ... ) plus a random white noise shock e(t). Function fO may be linear or nonlinear. In the first stage, we assume fO to be linear. Using a linear specification we can select the inputs with a bottom-up approach similar to that of Swanson and White (1997a, 1997b). They started from basic ARMAX models (ARMA with external inputs (Box et al. 1994), page 426; cf. (Diebold 1998), chapter 11) adding input variables until the in-sample model performance, measured by SIC: could not be improved anymore. SIC stands for Schwartz Information Criterion: it is commonly used in linear model selection. SIC(k) = nlog(d')+klog(n), where k denotes the number of estimated parameters, n is the number of observations, d' is RSSln, and RSS is the residual sum of squares (cf., for example, (Franses 1998) page 59). Instead of the ARMAX model, we use vector autoregressive (VAR) models, (cf. (Diebold 1998), section 11.6) in order to check for bi-directional relationships among the variables. After the definition of a suitable VAR model, a subset of regressors for the candidate nonlinear regression model is chosen, then the nonlinear model selection is performed using the Anders-Korn strategy based on cross-validation described above. If the number of inputs is manageable, instead of a two-stage heuristic search an exhaustive selection may be performed.
Application We applied our model selection procedure to a predictive model of the monthly mortgage loans volume (MV) purchased in The Netherlands. The asset and liability managers of Dutch banks are interested in MV forecasts up to about 18 months forward for balance sheet optimization and financial risk management. In a preliminary model, described in (Gardin and Virili 1995), a set of five explanatory time series was chosen, but future values of the inputs were used for prediction ofMV. Weare now interested in defining a dynamic model to produce forecasts based on available data at prediction time, i.e. based on lagged values of the inputs. The five variables are inflation rate, interest rate, number of dwellings for sale in the housing market, housing price, and GDP (Gross Domestic Product). The
457
456 estimation of the VAR model evidenced that among the available time series the only leading indicator for MV was the interest rate. The other explanatories resulted to have negligible linear predictive power for MV. We also used state space models, see (Box et al. 1994) section 5.5, to confirm those findings. The choice of preprocessing for MV was based on seasonality and stationarity analysis, as from (Virili and Freisleben 1999, 2000). We used the statistical software package SAS ETS (SAS 1996) to build the VAR and state space models and to perform stationarity tests. The finally selected VAR (and state space) models included MV and the interest rate at lags 1, 3, 5, 6, 12, selected via backward elimination. The results are shown in Figure 1, together with some summary statistics. The parameter estimation is based on the nonlinear Ordinary Least Squares method, see (SAS 1996) for more details. MODEL Procedure
OLS Estimation Nonlinear OLS Summary of Residual Errors OF
Equation 012MV93 ORMLSY
Durbi n
OF
Model Error 11 11
MSE
SSE
85 85
1.02675 0.02533
Root MSE
0.10991 0.01726
0.01208 0.0002980
R-square
Adj R-Sq
Watson
0.7207 0.2743
0.6879 0.1890
2.077 1.879
Nonlinear OLS Parameter Estimates
Parameter A1 A2 MVLL1 MVLL2 MVL1J MVLL2 MVS_1J MVS_L2 MVLL1 MVLL2 MV1L1_1 MV12_L2 MV1_L1 MV1_2_2 MVLL1 MVLL2 MVS_2_1 MVS_L2 MV6_L1 MVL2_2 MV12_L1 MV12_L2
Estimate
Approx. Std Err
Ratio
Approx. Prob>ITI
0.058427 0.000048713 0.292615 0.735349 0.068804 -3.630909 0.181119 -2.191327 -0.085030 -1. 659286 -0.172708 -2.784948 0.00769455 0.448937 -0.019839 -0.026079 -0.00755859 0.089783 0.00196468 0.024141 0.013585 -0.014635
0.01352 0.0021226 0.08056 0.62868 0.08792 0.63932 0.08737 0.71704 0.08420 0.79518 0.07268 0.73302 0.01265 0.09874 0.01381 0.10041 0.01372 0.11261 0.01322 0.12489 0.01142 0.11512
4.32 0.02 3.63 1.17 0.78 -5.68 2.07 -3.06 -1.01 -2.09 -2.38 -3.80 0.61 4.55 -1.44 -0.26 -0.55 0.80 0.15 0.19 1.19 -0.13
0.0001 0.9817 0.0005 0.2454 0.4361 0.0001 0.0412 0.0030 0.3154 0.0399 0.0197 0.0003 0.5447 0.0001 0.1545 0.7957 0.5832 0.4275 0.8822 0.8472 0.2374 0.8991
'T'
Number of Observations Used Mi55i ng
96 36
Label
AR(MV) AR(MV) AR(MV) AR(MV) AR(MV) AR(MV) AR(MV) AR(MV) AR(MV) AR(MV) AR(MV) AR(MV) AR(MV) AR(MV) AR(MV) AR(MV) AR(MV) AR(MV) AR(MV) AR(MV)
012MV93: LAG1 PARM FOR 012MV93 012MV93: LAG1 PARM FOR ORMLSY 012MV93: LAG3 PARM FOR D12MV93 o12MV93: LAG3 PARM FOR DRMLSy 012MV93: LAGS PARM FOR 012MV93 012MV93: LAGS PARM FOR ORMLSY 012MV93: LAG6 PARM FOR 012MV93 012MV93: LAG6 PARM FOR ORMLSY 012MV93: LAG12 PARM FOR 012MV93 D12MV93: LAG12 PARM FOR ORMLSY ORMLSY: LAG1 PARM FOR 012MV93 ORMLSY: LAG! PARM FOR DRMLSy ORMLSY: LAG3 PARM FOR 012MV93 ORMLSY: LAG3 PARM FOR DRMLSY ORMLSY: LAGS PARM FOR D12MV93 ORMLSY: LAGS PARM FOR ORMLSY DRMLSY: LAG6 PARM FOR 012MV93 ORMLSY: LAG6 PARM FOR ORMLSY ORMLSY: LAG12 PARM FOR 012MV93 ORMLSY: LAG12 PARM FOR ORMLSY
Stati sti cs for system Objective 0.0110 obj ec tivevv 1.0521
Figure I: The VAR model estimated in the first stage of model selection. The five lags (I, J, 5, 6, 12) where selected via backward elimination of insignificant parameters in the full
model.
'I 'he choice of preprocessing was made taking logs of both time series to correct IIIl' distribution. For MV, the presence of a seasonal unit root suggested the use of scasona] differences, cf. (Virili and Freisleben 1999, 2000); the interest rate presented a non-seasonal unit-root, therefore first differences were used. The
the hypothesis of analysis of the VAR coefficients confirmed unidirectional Granger causality, as from (Granger and Newbold 1986), section 7.3: the interest rate is a leading indicator for MV, but not vice-versa. In summary, with the first model selection stage described above, we selected 10 inputs: two regressor time series (MV and IR) with five lags each (1,3,5,6,12). The preprocessed time series have 108 monthly observations each, in the range (01.8712.95). The last year of observations was not used for stationarity tests, VAR and state space models evaluation. The neural network architecture used for the second stage of model selection was the same used by Anders and Kom except for the hidden units which are sigmoid and not tanh. It was implemented with the S-PLUS neural network library from Venables and Ripley (1997). We used standard feedforward neural networks with one hidden layer, logistic hidden units activation functions, linear output units and a bias input unit. The number of hidden units was chosen by cross-validation in the interval (0,3), as shown below. Skip connections were present only with no hidden units (linear models). The training algorithm was based on a quasi-Newton optimizer; a weight decay penalty term, also selected by CV, was present. For the second stage, excluding the last 8 values for out of sample evaluation, we used 100 observations for lO-fold cross-validation. The 100 data points were divided into 10 subsets of 10 elements each; in tum, 9 subsets were used for training, the remaining one for the computation of the mean squared error (MSE). The CV error is the average of the 10 MSE measures obtained in each cycle. However, with 100 samples available, 10 inputs may be too many. For example, with 3 hidden units we would have 37 weights, which is more than one parameter for each 3 observations. We decided to limit the number of inputs to 5 and check all possible architectures from 0 to 3 hidden units. The maximum number of weights (with 3 hidden units and 5 inputs) was now reduced to 22. The weight decay factor was chosen among the values (0, 0.001, 0.01), as suggested in (Venables and Ripley 1997), page 339. Given that the total number of combinations of 10 variables in class 5 is 252, an exhaustive search in the network" space with all the possible input combinations, combined with the choice of weight decay and hidden units, would require 252*3*4=3024 cross-validations, which means 30240 neural networks to build. In total, we had to train 30240*5=151200 neural networks, which required about 10 hours on a Pentium II 400 MHz. The finally selected model was a 5-0-1 network with 6 weights, trained using a weight decay term of 0.01. It has the following equation:
MV(t) = 0.005 +0.46IMV(t-l) -0.27IIR(t-3) -0.174IR(t-5) -0.160IR(t-6) -0.26IIR(t-12) (2) where MV is the seasonal difference of MV in logs, and IR is the first differenced interest rate in logs. The inputs were normalized according to (Bishop 1995), pagc 298, in order to have zero mean and unit variance. The final network was trained on the data set (01.87-06.94). Obviously, no pruning was necessary. The finally selected model evidences the absence of nonlinear structure in the data. The
458 459
introduction of additional hidden units reduces the training error (Figure 2) but not the cross-validation error (Figure 3), suggesting that the nonlinear units would overfit the data. Weight decay improves generalization, but not enough. The two pictures are both about the 'winning' combination of input variables.
procedure showed good generalization results, correctly rejecting the hypothesis of nonlinear relationships which could lead to incorrect specifications and overfitting.
0.35
MV: Multi-step predictions (18 months)
0.3 0.25
training e rro r
Training I ---'---1L~U-~
0.2
Prediction
0.15 0.1 0.05
0001
6000
"'a
5000
I
4000
'C
0.01
weight decay
c:
>
:;:
size
3000
Figure 2: Training error against number of hidden units and weight decay. Additional hidden units (X axis, hidden units grow from right to left) do reduce the training error.
1000
o +--~~~'~~~-'~~""~~~~~~~---T-T~-'T---'C""',l-
~~$~~$~~~~~~~~$~~~~~~#~~~~~ ~~~~~~~~~~~~~#~~~~~~~~#~~~~
Figure 4: 18 steps in and out of sample predictions based on equation (2). CVerror
Results
Given the fact that the selected model is linear, the restriction of a maximum of five regressors adopted above could be relaxed, and all the classical linear diagnostics are available to further enhance the model. For example, the analysis of the residuals evidenced the presence of a moderate amount of first order autocorrelation. For this reason, after collecting data for two additional years (untjl December 1997), we decided to perform a new model selection in the linear space only, using backward elimination. It was performed using a quite standard topdown strategy, starting with a linear dynamic regression including 26 regressors: the first 13 lags of the interest rate and the first 13 lags of the dependent variable, MV. Then, the insignificant regressors, on the basis of the t statistic, were eliminated step by step, starting from those with lower significance probability. The resulting model has the following equation (3): MV(t) = 0.055 +0.256MV(t-l) +0.256MV(t-2) -0.122MV(t-12) -2.464IR(t-4) (3) -1.950IR(t-6) -1.489IR(t-ll) -0.393IR(t-12)
In Figure 4, the results of the 18-months predictions based on interest rate are shown. The mean absolute prediction error in the forecast period is 702, that is a mean absolute percentage error of 15.88%. The suggested model selection
All the serial correlation, normality and structural stability tests now give correct results, as from Figure 6. The model of equation (3) is used in Figure 5 to generate 18-months multi-step predictions for the years 1996-97, given the expected level of the interest rate, which is also visible there (left axis).
0.01 0.001
weig ht decay
size
Figure 3: Cross-validation error against number of hidden units and weight decay. Additional hidden units (X axis, hidden units grow from right to left) do not reduce the training error, even in presence of a penalty term.
460
461 Ord~nary
MV: Multi-step out of sample predictions (18 months) 12
,-------------------------_--~
16500
Leas~
SSE MSE
0.014258
SBC
-'13'1.44
squares
Rsq 0.7850 -'1.2169'17 t 85.24-=34 Norma.1. Test Durb.1.n-Wetson 2.003.:3
10 t---------~-===-------
---_..... "'--_._ ...,
------------12500
God~rey's
Ser~a.1.
ARC'" ARC'" ARC'" ARC'"
B500
Chow Test Breok
LM
Prob>LM
, .3:666
2) 3) 4)
5.9659
0.242' 0.10' 0 0."1133 0 . ' B4a
~or
4~SOS5l'
6.19512
Parameter
_-~-r--
.
Stab~1ity
Chow Test
Prob>F 0.2706
72
1.2650 0.B220 0.7703:
B4
0.6221
Po1.nt
BO
-c·····_.,_·_··..,-_··-r·_--,-_·_~,_-~_
0.000'1
"1)
48
o -I~~
Rsq
'1 '19405
-'153.329 0 .. 7B50 0.'036
Corre.1.ation Test
A.1.ternat.i.ve
'- ....
o ,
MSE
PROBChi.-Sq
Durb.:Ln·s
I L - - - - - , \ 14500
'OB
Root AIC' Tote.!.
Reg
Training
Est~mates
DFE
"'1 .5'1 "13100
0.585"1 0.62516 0.7575
---,----- ...•
Pre~1.ct1.ve
~~'l~,j>-¥",j>/J#~:~#>'"-"~~":'>"4'#'>#~~~-"'>~~S6'>",i'/,Jl'",,rt(!1~~~
Breok
Po.t.nt
Oate
DF
Figure 5: 18 steps out of sample predictions of the final model, based on equation (3), The interest rate time seriesis shown too, on the left axis,
Intercept MV' MV2 MV"12
The mean absolute prediction error in the 18 months out-of-sample multi-step forecasting period is now 1033, that is 12.29%.
IR4 IRB IR' "1 IA12
B Va1ue 0.054526 0.256354 0.255638 -0."122450 -2.464252 -"1.950"153 -1.489292 -2.393"144
Chow Test
Chow Test
Prob>F
0.6750
0.6024
Std
Error 0.0135 0.09"14
0.0814 0.050e 0.567"1 0.5827 0.55109 0.16235
t
Rat.t.o 4.033 2.004 3.141 -2.421 -4.345 -3.347 -2.487 -3.B39
Approx
Prob
0.0001 0.00160 0.0022 0.0"172 0.000' 0.0011 0.0"145 0.0002
Conclusions
Figure 6: Estimated values and diagnostics for equation (3).
The application of the original Anders-Korn heuristic strategy based on two-stage cross-validation to our multiple time series model evidences a serious drawback: the number of potential inputs during the first stage may come out as too high to build a fully connected network. In our case, even with only two hidden units we would already have a number of parameters greater than the number of observations, with heavy overfitting, high CV error and with the consequent rejection of any nonlinear structure. On the other side, to our knowledge, it is not possible to test simultaneously for the number of hidden units and the significance of each input connection, without building a model for each combination of hidden units and input connections, which is not feasible. Our approach was based on the preliminary reduction of the number of potential inputs in the linear models space. We built a suitable VAR model, and used only the inputs that came out as significant there (10 out of 60). '1'0 further reduce from 10 to 5 the number of potential inputs, we tested all the combinations in class 5 of the ten inputs, with 0,1, and 2 hidden units.
No nonlinear structure was found in the data, and the application of the classic linear diagnostics to the selected model, led to the formulation of the final linear choice, which is able to obtain good out-of-sample forecasts. Nevertheless, even if we could explore a good part of the nonlinear models space, we still cannot exclude the presence of nonlinearities; moreover, the origin~ly selected linear model was not optimal. There is obviously much space left for further research: for example, one could notice that the same input variable may have weak predictive power with a linear specification, but stronger within a nonlinear model. Our pre-selection in the linear space, described in section 3, did not take into account this possibility, and it could be improved using crossvalidation and nonlinear models also at the variable preselection stage, with backward elimination of regressors based on the cross-validation results. A carefully restricted choice of the cross-validation parameters (number of hidden units, weight decay, restarts, ecc) would be necessary in order to maintain thl' computational requirements to acceptable limits. The use of other methods (c.g. genetic algorithms) to explore the nonlinear models search space may be uuothcr interesting line of further investigation.
.,
-
462
References Anders, U. and Korn, O. (1999), Model selection in neural networks, Neural Networks, 12, 309-323. Bishop, C.M. (1995), Neural networks for pattern recognition, Oxford University Press. Box, G.E.P., Jenkins, G.M. and Reinsel, G.c. (1994), Time series analysis, forecasting and control, Prentice Hall. Diebold F. (1998), Elements offorecasting, International Thomson Publishing. Faraway, J. and Chatfield, C. (1998), Time series forecasting with neural networks: a comparative study using the airline data, Applied Statistics 47, 231-50. Franses, P.H. (1998), Time series models for business and economic forecasting, Cambridge University Press. Gardin, F. and Virili, F. (1995), Nonlinear modelling of the Dutch mortgage market, Economic & Financial Computing 5(2), 131-145. Granger, C.W.J. and Terasvirta, T. (1993), Modelling nonlinear economic relationships, Oxford University Press. Granger, C.W.I., Newbold, P. (1986), Forecasting economic time series, Academic Press. Hornik, K., Stinchcombe, M. and White, H. (1989), Multilayer feedforward networks are universal approximators, Neural Networks, 2, 359-366. Keams, M., Mansour, Y., Ng, A.Y. and Ron, D. (1997), An experimental and theoretical comparison of model selection methods, Machine Learning, 27, 7-50. Kennedy, P. (1997), A guide to econometrics, MIT Press. Moody, J.E. (1994), Prediction risk and architecture selection for neural networks, in Cherkassky et aI., eds., From Statistics to Neural Networks: Theory and Pattern Recognition Applications, Springer. Refenes, A.-P. ed. (1995), Neural networks in the capital markets, Wiley. Refenes, A.-P., Zapranis, A.D. and Utans, J. (1996), Neural model identification, variable selection and model adequacy, in Weigend et al. eds., Proceedings of NNCM 96, World Scientific Publishing. Ripley B.D. (1995), Statistical ideas for selecting neural networks, in Kappen, B. and Gielen S. eds., Neural Networks: Artificial Intelligence and Industrial Applications, Springer, 183-190. Ripley B.D. (1993), Statistical aspects of neural networks, in Bamdorff-Nielsen, O.E., Jensen, J.L., Kendall, W.S. eds., Chaos and Networks: Statistical and Probabilistic Aspects, Chapman and Hall, 40-123. Sarle S. (1995), Stopped training and other remedies for overfitting, in Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352-360.
-- - - - - - - - _ . _ . _ - - - - - - - - - -
---------------
463 Sarle S. (1994), Neural networks and statistical models, in Proceedings of the 19th Annual SAS Users Group International Conference, Cary (NC), SAS Institute, 1538-1550. SAS Institute inc. (1996), SASIETS user's guide ver. 6.12, SAS Institute inc. Stone M. (1974), Cross-validation choice and assessment of statistical predictions, Journal ofthe Royal Statistical Society, 111-147. Stone, M. (1977), An asymptotic equivalence of choice of model by crossvalidation and Akaike's criterion cross-validation, Journal of the Royal Statistical Society,44-47. Swanson, N.R. and White, H. (1997a) A model selection approach to real-time macroeconomic forecasting using linear models and artificial neural networks, Review ofEconomics and Statistics, 79, 540-550. Swanson, N.R. and White, H. (1997b), Forecasting economic time series using adaptive versus nonadaptive and linear versus nonlinear econometric models, International Journal ofForecasting, 13,439-461. Terasvirta, T., Lin, C.-F. and Granger, C.W.J. (1993), Power ofthe neural network linearity test, Journal of Time Series Analysis, 14(2),209-220. Utans, J. and Moody, J.E. (1991), Selecting neural network architectures vie the prediction risk: application to corporate bond rating prediction, in Proceedings of the First International Conference on AI Applications on Wall Street, IEEE Computer Society Press. Venables, W.N. and Ripley, B.D. (1997), Modern applied statistics with S-PLUS, Springer. Virili, F. and Freisleben, B. (2000), Nonstationarity and data preprocessing for neural network predictions of an economic time series, Amari, S.-I., Lee Giles, C., Gori, M.,Piuri, V. eds., Proceedings ofIJCNN 2000, Como, Vol. 5,129-136. Virili, F. and Freisleben, B. (1999), Preprocessing seasonal time series for improving neural network predictions, Boethe et aI., eds., Proceedings of CIMA 99, Rochester (NY), ICSC Academic Press, 622-628. White H. (1989), An additional hidden unit test for neglected nonlinearity in multilayer feedforward networks, in Proceedings of IJCNN, SOS Printing, 451455. Zapranis, A.D. and Refenes, A.-P. (1999), Principles of neural model identification, selection and adequacy, Springer. '"