Proceedings of International Joint Conference on Neural Networks, Atlanta, Georgia, USA, June 14-19, 2009
Neural Networks Cartridges for Data Mining on Time Series Eduardo Ogasawara, Leonardo Murta, Geraldo Zimbrão, Marta Mattoso
Abstract— Neural networks is one of the techniques used for time series analysis. The performance of neural networks is affected by some parameters such as neural network structure and the quality of data preprocessing. These parameters need to be explored in order to obtain an optimal neural network. However, the manual establishment of different neural networks configurations for selecting the best ones may be error-prone and time-consuming. This paper proposes the creation of neural networks cartridges to systematically empower neural network performance by means of data mining activities, which obtain an optimal neural network structure. The experiments conducted in this paper use stock market and exchange rate series, and show that the usage of neural network cartridges can lead to configurations that double the performance of some ad-hoc neural network configuration.
I. INTRODUCTION Different computing techniques are used for time series forecast. There are both linear models, such as autoregression (AR)[1] and auto-regression moving average (ARMA)[2], and non-linear models, such as neural networks[3], that are used for this activity. One of the advantages of neural networks is its ability to identify patterns that are not evident in the time series. Nevertheless, its performance is affected by how the setup of the neural networks structure is conducted, and by how data is prepared for it. The inability of setting it up can led to conclusions such as presented in [4], where neural networks had worse performance than both AR and ARMA models working with log return series. Also, this data mining activity of data preparation and setup of neural network structure can be very time-consuming. This work proposes a systematic approach, using neural network cartridges, to empower data mining process by automatically executing both the data preparation and the exploration of different neural network structures to obtain an optimal neural network for forecasting. A cartridge is a component [5] unit that can be dynamically changed [6]. These cartridges empower data mining process by automatically executing both data preparation and data exploration of different neural network structures to obtain an optimal configuration. This paper is organized into 4 sections besides this introduction. Section 2 presents the neural network Eduardo Ogasawara, Geraldo Zimbrão and Marta Mattoso are with Federal University of Rio de Janeiro, Rio de Janeiro, Brazil (phone: +55-212562-8672; e-mail: {ogasawara, zimbrao, marta}@cos.ufrj.br). Leonardo Murta is with Fluminense Federal University, Niterói, Brazil (phone: +55 (21) 2629-5647; e-mail:
[email protected]). The authors thank CNPq for partially sponsoring the research.
978-1-4244-3553-1/09/$25.00 ©2009 IEEE
cartridges, which is the proposed approach to support the exploration of different neural network configurations using a data mining process. Section 3 presents experimental results using two representative financial series in Brazil: log return of Brazilian Petrobras Stock, and U.S. Dollar to Brazilian Real Exchange Rate. Section 4 presents related work, and section 5 concludes our paper drawing some future work. II. NEURAL NETWORKS CARTRIDGES The usual neural networks, such as feed forward [3] neural networks, are employed to recognize structural patterns. The data mining process for time series, on the other hand, requires the recognition of patterns that evolve throughout the time, leading to an evaluation that considers not only the current value, but also its predecessors. So, giving a entry value rt that represents the present value, and its n predecessors rt-n,…, rt-1, they are treated in a late memory of order n. Figure 1 presents a typical neural network for time series [3]. It is a feed forward neural network with backpropagation using time lag operators as input entries. Synaptic values of the networks are adjusted to minimize the mean squared error between the neural network output f(t), and the wished known value w(t), which in a time series is rt+1. rt-2 rt-1
≅rt+1
rt
Figure 1 - Typical neural network for time series
A typical data mining process [1] using neural networks for time series forecast can be described in three basic steps: • Neural network setup (number of input entries, hidden layers, neurons in hidden layers); • Time series statistical analysis; • Preparation of sample data for training and crossvalidation; Each of these steps can be performed in different ways, depending on the features required by the specific problem being analyzed. This situation, in the Software Engineering jargon, is called variants of a product line [7]. Due to that, this paper introduces a very simple but useful product line for data mining on time series. The product line, presented in Figure 2, is composed of four variants, which were derived from the steps mentioned before.
2302
Figure 3 complements Figure 2 presenting an illustrating example of a Neural Networks Cartridge architecture derived from the product line. In the product line, each cartridge has a clear notion of component, and can be dynamically changed [6], as long as they are compliant to the cartridge family (input data cartridge, neural cartridge, output data cartridge, and statistical cartridge), and to compatibility rules. This means, for instance, that it is possible to try different input data cartridges, but the output cartridge must be compatible with the input cartridge in use. Neural Network Cartridges Controller
Compatibility Rule
A Statistical Cartridge Implementation Statistical Cartridge An Input Data Cartridge Implementation Input Data Cartridge A Neural Cartridge Implementation Neural Cartridge An Output Data Cartridge Implementation Output Data Cartridge
Figure 2 – Product line for data mining on time series vk vk+1 ... vk+n-1 vk+n
vj vj+1 ... vj+m-1 vj+m
Sample set Optimization set Statistical cartridge µ, σ2
vi-3 vi-2 vi-1 vi
Evaluation set Neural Networks Cartridges Controller
Input data cartridge vi-3 vi-2
Neural cartridge ti-3 ti-2
≅ ti+1
vi-1 vi
Output data cartridge
≅ vi+1
ti-1 ti
εNN INN
The Output Cartridge is responsible for data denormalization, and must be compatible with the input data cartridge. It is also responsible for the measurement of forecasting errors. The Neural Cartridge is responsible for the neural network setup. It can be implemented in the form of different neural networks algorithms, which are wrapped with meta-data information for an exploratory benchmark. For example, in the case of feed forward with backpropagation, this means the indication of exploratory parameters such as: number of input entries, number of hidden layers, and number of neurons in each hidden layer. The next sub-sections detail the controller and each cartridge variant presented in this product line, and cite some examples of already implemented cartridges, which are called built-in cartridges. It is important to notice that, besides the existence of some built-in cartridges that come together with the product line, it is possible to extend the product line via the implementation of additional cartridges. A. Neural Network Cartridge Controller The main goal of this component is to enact all cartridges as a recurrent workflow [9] to obtain an optimal neural network for time series forecast. As presented in Figure 4, the Controller basically starts the data mining process by invoking the statistical cartridge in order to execute the statistical analysis of the time series. Based on the statistics collected, the Controller invokes the statistical cartridge to remove outliers of the sample data. The processed sample data is stored in memory for the exploratory phase. After this initial step, the Controller selects input-output cartridges and a neural network cartridge, and then, invokes the input-data cartridge to normalize the sample set used to train and cross validate. Afterward, the Controller invokes the neural cartridge to execute the training and crossvalidation. The training and cross-validation is executed until a convergence in mean-squared error of a sample set used for cross-validation is achieved or a divergence from the best already reached configuration occurs after a fixed number of iterations.
Compatibility rule
Figure 3 - Neural Network with data cartridge
The Neural Network Cartridges Controller, which for simplicity is called as Controller, is responsible for the enactment of the cartridges and the compatibility verification between the cartridges, enforcing the Compatibility Rules. It is also responsible for cartridge exchange. This is a very important concept in order to explore different neural networks structures and data normalization algorithms. The Statistical Cartridge is responsible for analyzing the time series. It verifies if the series is stationary or nonstationary [8], and is also responsible for outlier removal. The Input Cartridge is responsible for data normalization, which means converting the time series vi in a sequence of values ti that are normalized between -1 and 1.
Statistical Analysis
Configuration summary
finish exploration Outlier removal configuration exploration needed Input & output cartridge selection
Neural Network selection
Optimization analysis
Training & cross validation
Normalization of training set
Figure 4 - Neural network cartridge controller
2303
sample set
optimization set
evaluation set
0.1500 0.1000 0.0500 0.0000 -0.0500 -0.1000
07/2008
01/2008
07/2007
01/2007
07/2006
01/2006
07/2005
01/2005
07/2004
01/2004
07/2003
01/2003
07/2002
01/2002
07/2001
01/2001
07/2000
01/2000
-0.1500
Figure 5 - Petrobras log return
When the weakly stationary property is not verified, this time series is called as non-stationary time series. Many economical series does not fit in stationary domains, but, in some cases, it is possible to assume that, for a short period, this non-stationary time series can be considered as pseudostationary, which means that both minimum value and maximum value are limited by the constants (λmin, λmax). If the time series is non-stationary, the statistical cartridge collects the min(xi) = λmin, max(xi) = λmax, in the sample set. If these two values are respected in the optimization set, the Controller assumes that the time series is pseudo-stationary. If this is not respected, than the time series is considered as non-stationary. Figure 6 presents U.S. Dollar to Brazilian Real Exchange Rate. This time series is pseudo-stationary, since in the sample set the minimum value was 1.20, and the maximum value was 3.95. In the optimization set, none of these values were trespassed. If the time series is pseudo-stationary, the statistical cartridge is responsible for monitoring if one of the two constraints (λmin, λmax) is ignored. If this occurs, the Controller indicates the necessity of incremental retraining. sample set
optimization set
evaluation set
4.00 3.50 3.00 2.50 2.00 1.50
07/2008
01/2008
07/2007
01/2007
07/2006
01/2006
07/2005
01/2005
07/2004
01/2004
07/2003
01/2003
07/2002
01/2002
07/2001
01/2001
07/2000
01/2000
1.00 07/1999
B. Statistical Cartridges The statistical cartridge is responsible for the classification of the time series and also for the outlier removal. The time series can be classified as stationary or non-stationary. A time series X = {xi} is called stationary if the joint distribution of xi+j1, xi+j2,…, xi+jk is independent of i for all indices j1, j2, …, jk and all k ≥ 1 [8]. This is a statistical definition that is difficult to measure, particularly when the mathematical knowledge of the time series phenomenon is difficult to obtain. Nevertheless, this
verification can be relaxed if E(xi) = µ, Var(xi) = σ2x < ∞, for all i, and the Cov(xi, xi+j) is independent of i. In this situation, when a fixed average and a limited variance occur, the time series can be classified as weakly stationary [8]. Figure 5 presents the stationary time series of log return of Brazilian Petrobras Stock.
01/1999
The introduction of an optimization phase is an important contribution of our work. The neural network is submitted to an optimization set that is used to measure the efficiency of its current configuration. For each data in the optimization set, values are normalized and applied to the neural network. Using the output data cartridge, the forecasted value is denormalized, and compared with the actual value of the time series. The error is stored for each configuration used. This optimization phase is executed until the Controller identifies that it has a different configuration to explore, which means running again the activities of choosing an input-output cartridge, choosing a neural network setup, normalizing the training set, training the neural network, and executing the optimization analysis with a different configuration. After exploring all possible cartridges combinations, an optimal neural network is obtained by identifying the configuration that minimizes a cost function. The criteria used to obtain this cost function are based on the explorable dimensions presented in each type of cartridge and are detailed in section II.E. In the summary phase, it is presented the optimal neural network obtained by this data mining workflow, with some additional information, such as the characterization of the time series as stationary, non-stationary or pseudostationary. The characterization of the time series as stationary or non-stationary is going to be explained in more details in the next section, but it is an indication of the necessity of periodical incremental retraining. In the summary phase, a benchmark with some linear models (AR [4], ARMA [4]) and random-walk [4] is also done. The benchmark with these techniques gives more credibility of the neural network obtained. These linear models needs some integer parameters p for AR(p), p and q for ARMA(p, q), that are fast obtained by brute-force search using the training set. The search space for each integer value is from 4 to 18. In ARMA, q needs to be lower or equal to p. After the optimal neural network is identified, the neural network is ready for use. The forecast is done with the evaluation set (the recent n-periods). The evaluation set is normalized. The neural network forecast is obtained, denormalized, and presented to the user. Once the real value is obtained, this value is added in the evaluation set. A new forecast is obtained, and the last forecast error is measured. The evolution of the forecast error is monitored against linear models, and also against random-walk.
Figure 6 - U.S. Dollar to Brazilian Real Exchange Rate
2304
250
forecast value, applies the denormalization function, and monitors the forecast error. An attribute is normalized by scaling its value so that it will fall within a specified range, varying from a minimum value to a maximum value. When mining with neural networks, values normally are entered from -1.0 to 1.0. As presented in [1], normalization helps to prevent attributes with initially large ranges from outweighing attributes with initially smaller ranges. 200 180 160 140 120 100 80 60 40 20 3.40
3.30
3.20
3.10
3.00
2.90
2.80
2.70
2.60
2.50
2.40
2.30
2.20
2.10
2.00
1.90
1.80
1.70
1.60
1.50
0 1.30
It is also responsibility of the statistical cartridge to handle outlier in the sample set. Outlier removal is an important activity for time series analysis. There are many works on this area, such as using linear models to identify it [10], but it is important to distinct two different situations. The first situation happens when outliers do not affect the boundaries of the distribution of the time series, but they make spurious perturbations in the time series. In this case, neural networks are robust and are not affected by these spurious perturbations [3]. The second situation occurs when outliers affect the time series distribution of values. In this case, these outliers occur in extreme boundaries of the time series, leading to an incoherent minimum value or maximum value, which can also affect the global statistics of the time series. These types of outliers affect the neural network performance. This happens basically because they affect the quality of data normalization, leading to data concentration, which makes the neural network training more difficult, and also leading to a poor forecast. The cartridge uses the box-plot outlier removal, so it is just necessary to prune both data lower than the first quartile minus 1.5 interquartile range, and also data upper than the third quartile plus 1.5 interquartile range [Q1-1.5IQR, Q3+1.5IQR]. If the distribution is similar to a normal distribution, this would be values varying from [µ - 2.698σ, µ + 2.698σ], and would preserve 99.3% of original sample data. Figure 7 presents the histogram of Petrobras log return series in the sample set. After applying the outlier removal, valid values were from -0.053 to 0.054, resulting in a pruning of 2.7% of sample data. Figure 8 presents the histogram of U.S. Dollar to Brazilian Real Exchange Rate. In this case, after applying box-plot outlier removal, no data is pruned, since there is no clear evidence of distribution outliers.
Figure 8 – Histogram of U.S. Dollar to Brazilian Real Exchange Rate for training period
There are many methods for data normalization, as showed in [1], including min-max normalization, z-score normalization and normalization by decimal scaling. Suppose that minv and maxv are respectively the minimum and the maximum values for attribute V. Min-max normalization maps a value v of V to t in the range [L, H] by computing: v −minv (1) t = (H − L) ⋅ +L maxv − minv In z-score normalization, the values for an attribute V are normalized based on the mean and the standard deviation of V. A value v of V is normalized to t by computing: _
v −V t= σV
200
(2)
_
150
100
50
Figure 7 – training period
Histogram
of
Petrobras
log
return
0.10
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
-0.01
-0.02
-0.03
-0.04
-0.05
-0.06
-0.07
-0.08
-0.09
-0.10
0
for
C. Input and Output Cartridges The input cartridge works in normalizing data for the neural network input entries. The output cartridge picks the
where V and σV are the mean and the standard deviation respectively of attribute V. This method of normalization is useful when the actual minimum and maximum values of attribute V are unknown. Normalization by decimal scaling normalizes by moving the decimal point of values of attribute V. The number of decimal points moved depends on the maximum absolute value of V. A value v of V is normalized to t by computing: v (3) t = 10 j where j is the smallest integer such that Max (|v|)