Improved financial time series forecasting by ... - Semantic Scholar

Intelligent Data Analysis 5 (2001) 339–354 IOS Press

339

Improved financial time series forecasting by combining Support Vector Machines with self-organizing feature map Francis Eng Hock Tay and Li Juan Cao Department of Mechanical Engineering, National University of Singapore, 10 Kent Ridge Crescent, 119260, Singapore E-mail:[email protected] Received 21 November 2000 Revised 23 December 2000 Accepted 12 January 2001 Abstract. A two-stage neural network architecture constructed by combining Support Vector Machines (SVMs) with selforganizing feature map (SOM) is proposed for financial time series forecasting. In the first stage, SOM is used as a clustering algorithm to partition the whole input space into several disjoint regions. A tree-structured architecture is adopted in the partition to avoid the problem of predetermining the number of partitioned regions. Then, in the second stage, multiple SVMs, also called SVM experts, that best fit each partitioned region are constructed by finding the most appropriate kernel function and the optimal learning parameters of SVMs. The Santa Fe exchange rate and five real futures contracts are used in the experiment. It is shown that the proposed method achieves both significantly higher prediction performance and faster convergence speed in comparison with a single SVM model. Keywords: Financial time series forecasting, non-stationarity, support vector machines, self-organizing feature map

1. Introduction In the modeling of financial time series, two of the key problems are that financial time series are noisy and non-stationary [1,2]. The noisy characteristic means that there is no complete information which can be obtained from the past behaviors of financial markets, and the information not included in the model is considered as noise. The noise in the data could lead to the overfitting or underfitting problem. The obtained model will have a poor level of performance when applied to a new data point. The non-stationarity implies that financial time series switch their dynamics between different regions, depending on economy growth or recession. In the modeling of financial time series, this will lead to gradual changes in the dependency between input and output variables. In general, it is very hard for a single model to capture such dynamical input-output relationship. Over the past few years, neural networks have been popularly used for modeling financial time series [3–5]. This is because neural networks have several distinctive advantages such as non-linearity, non-assumption and data-driven characteristic which are unavailable in traditional models. Neural networks are universal function approximators that can map any nonlinear function without priori 1088-467X/01/$8.00  2001 – IOS Press. All rights reserved

340

F.E.H. Tay and L.J. Cao / Improved financial time series forecasting by combining Support Vector Machines

assumptions about the data [6]. They are less susceptible to the model mis-specification problem than most of the traditional models, and they are more noise-tolerant, having the ability to learn complex systems with incomplete and corrupted data. So neural networks are more powerful in describing the dynamics of financial time series than traditional models. Recently, Support Vector Machine (SVM), a novel neural network developed by Vapnik and his coworkers in 1995 [7], provides another promising tool in financial time series forecasting [8–10]. From theoretic background, SVM is proved to be more resistant to the overfitting problem than traditional neural networks. Unlike most of the traditional neural networks which implement the Empirical Risk Minimization Principle, SVMs implement the Structural Risk Minimization Principle which seeks to minimize an upper bound of the generalization error rather than minimize the training error. This induction principle is based on the fact that the generalization error is bounded by the sum of the training error and a confidence interval term that depends on the Vapnik-Chervonenkis (VC) dimension. Established on this principle, SVMs will achieve an optimum network structure by striking the right balance between the training error and the VC-confidence interval, eventually resulting in better generalization performance than other neural networks. In addition, unlike other networks’ training which requires non-linear optimization with the danger of getting stuck into local minima, the training of SVMs is equivalent to solving a linearly constrained quadratic programming whose solution is unique and optimal. Originally, SVMs have been developed for pattern recognition problems [11,12]. Later, with the introduction of ε-insensitive loss function, SVMs have been extended to solve nonlinear regression problems [13–15]. Similar to traditional neural networks, the successful application of SVMs is based on the condition that the modeled data should have certain uniformity. That is, the data presented to SVMs is generated according to a constant distribution. For the financial data with switching dynamics, a single SVM model cannot perform well in capturing such unstructured and dynamical input-output relationship inherent in the financial data. Moreover, using a single SVM model to learn the financial data is a mismatch as there are different noise levels in different input regions – before the single SVM model starts to extract features in some region (local underfitting), it potentially could have extracted in another region (local overfitting). A potential solution to the above problems is to use a mixture of experts architecture (ME) proposed by Jacob et al. [16,17]. Inspired by the so-called divide-and-conquer principle that is often used to attack a complex problem by dividing it into simpler problems whose solutions are combined to yield a solution to the complex problem, the well-known ME uses a set of expert networks and a gating network that cooperate with each other to solve a complex problem. The expert networks are used to solve different input regions which are softly decomposed from the whole input space by a softmax based gating network, and the outputs of the expert networks are then combined by the softmax based gating network. The motivation of the ME is that individual expert networks can focus on specific regions and attack them well. As ME may also fail to capture the non-stationarity in the situations where the input data is not sufficient for determining the output, Muller et al. [18,19] propose the annealed competition of experts (ACE) where expert networks are competed based on their relative performance and not on the inputs as used in ME. Based on the same idea of using different experts for different input regions, Milidiu et al. [20] generalize the ME architecture into a two-stage architecture for handling the non-stationarity of a time series. In the first stage, the Isodata clustering algorithm is used to partition the whole input space into several disjoint regions. Then, in the second stage, a mixture of experts including partial least squares (PLS), K-nearest neighbors (KNN) and carbon copy (CRB) are competed to solve each region. For every partitioned region, only the expert that best fits the region will be used in the model. By taking

F.E.H. Tay and L.J. Cao / Improved financial time series forecasting by combining Support Vector Machines region 1

341

SVM expert 1

region 2 SVM expert 2 Input data

SOM region n-1

SVM expert n-1

region n SVM expert n

Fig. 1. The proposed two-stage neural network architecture.

this strategy, the proposed method has an adaptive architecture in the sense that any expert that is the most adequate one to a particular region will be selected. The most important thing is that the prediction performance can be significantly enhanced by using this two-stage architecture in comparison with using a single model to learn the whole input space. The idea of generalizing SVMs into the ME architecture has already been proposed by Kwok [21]. His method is a direct way of using SVMs as expert networks, resulting in a weighted quadratic programming problem. As the weights of the weighted quadratic programming are a function of input vectors, it is very complex in practice to solve this problem. Motivated by Milidiu’s work, this paper generalizes the EM architecture into SVMs by using a two-stage neural network architecture for financial time series forecasting. As illustrated in Fig. 1, in the first stage, self-organizing feature map (SOM) is used to partition the input space into several disjoint regions. Then, in the second stage, different SVM experts are constructed to deal with different input regions, which are obtained by using different kernel functions and learning parameters of SVMs. As in Milidiu’s work, every constructed SVM expert is the most adequate one for a particular region. There are two rationales for the proposed method. Firstly, as SOM is an unsupervised clustering algorithm based on the competitive learning algorithm [22], the training data points which have similar characteristics in the input space will be classified into the same region that can be specifically attacked by the following individual SVM expert. Since the partitioned regions have more uniform distributions than that of the whole input space, it will become easier for a SVM expert to capture such more stationary input-output relationship in each region. Secondly, different choices of the kernel function in SVMs will define different types of feature space resulting in different solutions [7]. As different partitioned regions have different characteristics, by taking this architecture, the SVM experts that best fit particular regions will be used in the modeling by choosing the most appropriate kernel function and the optimal learning parameters of SVMs. This is very different from a single SVM model that learns the whole input space globally and thus cannot guarantee that each local input region is the best learned. The proposed method is illustrated experimentally using the Santa Fe Competition exchange rate and five real futures contracts. The experiment results show that there is great improvement in the prediction performance by using the proposed method than a single SVM model. Other additional advantages are that the proposed method converges fast and uses few support vectors. This paper is organized as follows. In Section 2, we briefly introduce the theories of SVMs and SOM. In Section 3, the detailed architecture of the proposed method is described and a systematic learning algorithm is developed. Section 4 gives the experimental results on both Santa Fe financial data and five real financial data, followed by the discussions and conclusions in the last two sections.

342


2. Methodologies 2.1. Support Vector Machines (SVMs) in regression estimation Compared to other neural network regressors, there are three distinct characteristics when SVMs are used to estimate the regression function. First of all, SVMs estimate the regression using a set of linear functions which are defined in a high dimensional space. Secondly, SVMs define the regression estimation as the problem of risk minimization where the risk is measured using Vapnik’s ε-insensitive loss function. Thirdly, SVMs use the risk function consisting of the empirical error and a regularization term, which is derived from the Structural Risk Minimization Principle. Given a set of data points, G = {(x i , di )}ni (xi is the input vector, di is the desired value, and n is the total number of data patterns), SVMs approximate the function using the following: y = f (x) = wφ(x) + b

(1)

where φ(x) is the high dimensional feature space which is nonlinearly mapped from the input space x. The coefficients w and b are estimated by minimizing 1 1 RSV M s (C) = C Lε (di , yi ) + w2 n 2 n

(2)

i=1

Lε (di , yi ) =

|di − yi | − ε |di − yi | ε otherwise 0

In the regularized risk function Eq. (2), the first term

(3)

C n1

n

Lε (di , yi ) is the empirical error (risk).

i=1

They are measured by the ε-insensitive loss function Eq. (3) because this loss function provides the advantage of using sparse data points to represent the decision function Eq. (1). The second term 12 w2 , on the other hand, is the regularization term. C is referred to as the regularized constant, and it determines the trade-off between the empirical risk and the regularization term. ε is called tube size. Both C and ε are user-prescribed parameters. To get the estimations of w and b, Eq. (2) is transformed to the primal function Eq. (4) by introducing the positive slack variables ζ i and ζi∗ . Minimize: n 1 RSV M s w, ζ (∗) = w2 + C (ζi + ζi∗ ) 2 i=1

Subjected to: di − wφ(xi ) − bi ε + ζi wφ(xi ) + bi − di ε + ζi∗ ζ (∗) 0

(4)


343

Finally, by introducing Lagrange multipliers and exploiting the optimality constraints, the decision function Eq. (1) has the following explicit form [7]: f (x, ai , a∗i ) =

n (ai − a∗i )K(x, xi ) + b

(5)

i=1

Lagrange multipliers and support vectors In Eq. (5), ai and a∗i are the so-called Lagrange multipliers. They satisfy the equalities a i ∗ a∗i = 0, ai 0 and a∗i 0 where i = 1, . . . , n, and are obtained by maximizing the dual function of Eq. (4), which has the following: R(ai , a∗i )

=

n

di (ai −

a∗i ) −

i=1

n n n 1 ∗ ε (ai + ai ) − (ai − a∗i )(aj − a∗j )K(xi , xj ) 2 i=1

(6)

i=1 j=1

with the constraints: n (ai − a∗i ) = 0 i=1

0 ai C, i = 1, 2, . . . , n 0 a∗i C, i = 1, 2, . . . , n

Based on the Karush-Kuhn-Tucker (KKT) conditions of quadratic programming, only a number of coefficients (ai − a∗i ) in Eq. (5) will assume non-zero. The data points associated with them have approximation errors equal to or larger than ε and are referred to as support vectors. That is, they are the data points lying on or outside the ε-bound of the decision function. According to Eq. (5), it is evident that support vectors are the only elements of the data points that are used in determining the decision function as the coefficients (a i − a∗i ) of other data points are equal to zero. In this sense, the fewer the number of support vectors, the sparser the representation of the solution. Kernel function K(xi , xj ) is defined as the kernel function. The value of kernel is equal to the inner product of two vectors Xi and Xj in the feature space φ(x i ) and φ(xi ), that is, K(xi , xj ) = φ(xi )∗ φ(xj ). The elegance of using the kernel function is that one can deal with feature spaces of arbitrary dimensionality without having to compute the map φ(x) explicitly. Any function satisfying Mercer’s condition [7] can be used as the kernel function. Common examples of the kernel function are the polynomial kernel K(X, Y ) = (X ∗ Y + 1)d , the Gaussian kernel K(X, Y ) = exp − δ12 (X − Y )2 , and the 2-lay tangent kernel K(X, Y ) = tan h(β0 XY + β1 ), where d, δ 2 , β0 and β1 are all kernel parameters. From the implementation point of view, training SVMs is equivalent to solving a linearly constrained Quadratic Programming (QP) with the number of variables twice that of the number of training data points. The Sequential Minimal Optimization algorithm extended by Scholkopf and Smola [23,24] has proved to be very effective in training SVMs for solving the regression problem.

344


X

° m1

° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °

m2

°

mi

° (a)

°

°

°

° ° ° ° ° ° ° ° ° ° ° ° ° °

Λ C (t2 ) Λ C (t1 )

(b)

Fig. 2. (a) An example of SOM. Every neuron has a reference vector mi that is of the same size as input vectors and a neighborhood area Λc . (b) The neighborhood area decreases monotonically with time (t1 < t2 ).

2.2. Self-organization feature map (SOM) SOM is an unsupervised learning algorithm proposed by Kohonen in 1982 [22]. In SOM, the output neurons are usually arranged into a one or two-dimensional map in a form of rectangular or hexagonal lattice in which each of the output neurons is connected to all the input neurons (Fig. 2). The competitive learning algorithm is implemented in SOM. For a given input, the neuron that matches best with it wins the competition and is allowed to be updated towards the input vector. The SOM generalizes the wining neuron to its neighbors on the map. Not only the winning neuron but also its neighbors are allowed to be updated in SOM. To serve the competitive learning algorithm, a reference vector (m) and a neighborhood area (Λ) are defined to every neuron. The reference vector is of the same size as input vectors, which is used to evaluate the closeness between neurons and input vectors. The commonly used criterion is the Euclidean distance. The neighborhood area is defined by a function that is symmetrical and decreases monotonically with distance of the neurons on the map from the winning neuron. In the implementation, the neighborhood area is chosen to be wide in the beginning of the learning and shrinks monotonically during learning. A common example of the neighborhood function is the Gaussian function. The competitive learning algorithm begins by calculating the Euclidean distance between the reference vectors of the neurons and the given input to find the winning neuron, followed by adapting the wining neuron and its neighbors by modifying their reference vectors towards the given input. The learning algorithm of SOM is outlined as follows: Step 1: Select the size and structure of the network. Initialize reference vectors {m i (t)}N i=1 with small random values. N is the total number of neurons in the network. t is the index of the learning step. Step 2: Present an input vector X(t). Compute the Euclidean distance between X(t) and all neurons: K di = Xj (t) − mij (t), i = 1, . . . , N , j=1

where K is the dimension of the input vector X(t). is the Euclidean distance. Step 3: Select the i∗ th neuron closest to X(t) based on the distance d i∗ = mini=1,...,N di . Step 4: Update the reference vectors of the i ∗ th neuron and its neighbors by mi (t + 1) = mi (t) + ηΛii∗ (t)(X(t) − mi (t))

Otherwise: mi (t + 1) = mi (t).


345

SOM

SOM

SOM

SVM

SVM

SOM

SVM

SOM

SVM

SVM

SVM

Fig. 3. A typical tree-structured architecture generated by the proposed method. There are 5 non-terminal nodes located by SOM and 6 leaves located by SVM experts.

−ri∗ 2 Where η is the learning rate. Λii∗ = exp − riδ(t) , the Gaussian neighborhood function. 2 ri and ri∗ are the position vectors of the neighborhood neurons and of the winning neuron respectively. δ(t) is a width parameter that is gradually decreased. Step 5: Repeat by going to step 2 until the change in the reference vectors is less than a predetermined threshold or the maximum number of iterations is reached.

SOM is widely used for clustering because after training the output neurons of SOM are automatically organized into a meaningful one or two-dimensional order in which similar neurons are closer to each other than the more dissimilar ones in terms of the reference vectors, thus keeping close in the output space the data points which are similar in the input space. 3. Architecture and systematic learning algorithm The basic idea underlying the proposed method is to use SOM for partitioning the whole input space into several regions and to use SVM experts for solving these partitioned regions. As there is no prior knowledge about how many regions could be partitioned from the whole input space, the tree-structured architecture proposed by Chen et al. [25,26] is adopted here for partition, which recursively partitions a large input space into two regions until the partition condition is not satisfied. The main advantage of the tree-structured architecture is that it could automatically find a suitable network structure and size for partitioning a large problem without predetermining the number of partitioned regions. As illustrated in Fig. 3, each SOM sits at the non-terminal node of the tree and plays a ‘divide’ role to heuristically partition a large input space into two regions, and each SVM expert sits at the leave of the tree and plays a ‘conquer’ role to tackle each partitioned region. For a data set ψ , a terminal node is created and located by the data set. A SOM is developed to automatically partition the data set into two regions according to the input space of the data set. If the number of training data points in the partitioned regions is both larger than a predetermined threshold value N threshold (partition condition), the terminal node for the data set becomes a non-terminal node, and it is replaced by the SOM. Two new terminal nodes are created and located by the two regions. As a result, the data set is partitioned into two non-overlapping regions ψ 1 and ψ2 where ψ1 ∩ ψ2 = φ and ψ1 ∩ ψ2 = ψ (φ denotes the null set). The aforementioned procedures are applied in the partitioned regions until the partition condition that the number of training data points in the following partitioned regions is both larger than N threshold is violated in all the regions. Finally, all the terminal nodes of the tree become leaves, and they are located by the SVM experts which are appropriately constructed to deal with each region. A systematic approach for the proposed architecture is outlined as follows.

346


1. Create a terminal node and put the training data set at it. Set a minimum number of training data points Nthreshold . 2. Let ψ denotes the data set located at the terminal node. Present the input spaces of ψ (the data set ψ without outputs) to a SOM which will automatically partition ψ into two regions ψ1 and ψ2 (ψ1 ∩ ψ2 = φ and ψ1 ∪ ψ2 = ψ ). 3. Calculate the number of training data points in each region {N i }2i=1 . If {Ni }2i=1 is both larger than Nthreshold , change the terminal node as a non-terminal node and locate it by the SOM. Create two new terminal nodes and locate them by the two regions ψ 1 and ψ2 . Otherwise, merge the two regions ψ1 and ψ2 and stop. 4. Repeat from step 2 to step 3 until the partition cannot be proceeded in all the regions. 5. Choose appropriate SVM experts for each region. Locate them at the terminal nodes of the tree. For an unknown data point in testing, it is first classified into one of the partitioned regions by the SOMs traversing path downs to leaves of the tree. Then its output is produced by the corresponding SVM expert. Similarly, the validation set is also classified into the partitioned regions and used to choose the optimal parameters of the SVM experts. 4. Experiments 4.1. Data sets and data preprocessing The first financial time series examined in the experiment is the currency exchange rate taken from the Santa Fe Time Series Prediction Analysis Competition, held during the fall of 1990 under the auspices of the Santa Fe Institute [27,28]. It contains 10 segments of 3000 points each, representing the highfrequency exchange rate between the Swiss franc and the US dollar from August 7, 1990 to April 18, 1991. In the experiment, the data points are resampled using a sliding window with the width of 20 data points, reducing the size of the time series to 1500 data points as illustrated in Fig. 4. Five real futures contracts collated from the Chicago Mercantile Market are also used in the experiment. They are the Standard&Poor 500 stock index futures (CME-SP), United Sates 30-year government bond (CBOT-US), United States 10-year government bond (CBOT-BO), German 10-year government bond (EUREX-BUND) and French government stock index futures (MATIF-CAC40). The daily closing prices are used as the data sets. For each data set, the original price is transformed into a 5-data point relative difference in percentage of price (RDP). As mentioned by Thomason [29], there are four advantages in applying this transformation. The most prominent advantage is that the distribution of the transformed data will become more symmetrical and follows more closely to a normal distribution as illustrated in Fig. 5. The modification to the trend of the data distribution will improve the predictive power of the neural network. The input variables are constructed from four lagged RDP values based on 5-data point (RDP-5, RDP-10, RDP-15 and RDP-20) and one transformed price (EMA15) which is obtained by subtracting a 15-day exponential moving average from the price. The subtraction is performed to eliminate the trend in price, as there is a ratio of the maximum value to the minimum value about 1.5:1 in all the six data sets. The optimal length of the moving day is not critical, but it should be longer than the forecasting horizon of five data points [30]. The use of EMA15 is to maintain the information contained in the original price as much as possible, as the application of the RDP transform to the original price may remove some useful information. The output variable RDP+5 is obtained by first smoothening the price with a


347

Fig. 4. The Santa Fe exchange rates sampled every 20 data points.

(a)

(b)

Fig. 5. Histograms of (a) original exchange rate and (b) RDP+5. RDP+5 values have a more symmetrical and follow closer to normal distribution.

3-day exponential moving average, because the application of a smoothing transform to the dependent variable generally enhances the prediction performance of neural network [30]. The calculations for all the indicators are found in Table 1. The long left tail in Fig. 5(b) indicates that there are outliers in the data set. Since outliers may make it difficult or time-consuming to arrive at an effective solution for the SVMs, RDP values beyond the limits of ±2 standard deviations are selected as outliers. They are replaced with the closest marginal

348

F.E.H. Tay and L.J. Cao / Improved financial time series forecasting by combining Support Vector Machines Table 1 Input and output variables Indicator Input variables EMA15 RDP-5 RDP-10 RDP-15 RDP-20 Output variable RDP+5

Calculation p(i) − EM A15 (i) (p(i) − p(i − 5))/p(i − 5) ∗ 100 (p(i) − p(i − 10))/p(i − 10) ∗ 100 (p(i) − p(i − 15))/p(i − 15) ∗ 100 (p(i) − p(i − 20))/p(i − 20) ∗ 100 (p(i + 5) − p(i))/p(i) ∗ 100 p(i) = EM A3 (i)

EM An (i) is the n-day exponential moving average of the ith data point, and p(i) is the price of the ith data point.

values. The other preprocessing technique is data scaling. All the data points are scaled into the range of [−0.9,0.9] as the data points include both positive values and negative values. For each data set, the entire data points are partitioned into three parts according to the time sequence. The first part is for training, the second part for validating which is used to select the optimal parameters of the SVM experts, and the last part for testing. In the Santa Fe exchange rate time series, there are a total of 970 data points in the training set, 200 data points in the validation set and 300 data points in the test set, while in the five real futures time series, there are a total of 907 data points in the training set, 200 data points in both the validation set and the test set. 4.2. Performance measures The prediction performance is evaluated using the following statistical metrics. They are the normalized mean squared error (NMSE), mean absolute error (MAE), directional symmetry (DS) and weighted directional symmetry (WDS). The definitions of these criteria are found in Table 2. NMSE and MAE are the measures of the deviation between actual values and predicted values. The smaller the values of NMSE and MAE, the closer are the predicted time series values to the actual values. DS provides the correctness of the predicted direction of RDP+5 in terms of percentages, with large values suggesting a better predictor. The WDS measures both the magnitude of the prediction error and the direction. It penalizes the errors related to incorrectly predicted direction and rewards those associated with correctly predicted direction. The smaller the value of WDS, the better is the forecasting performance in terms of both the magnitude and direction. 4.3. Experiment results The performance of the proposed method is evaluated by using the aforementioned six financial data sets including Santa Fe exchange rate, CME-SP, CBOT-US, CBOT-BO, EUREX-BUND and MATIFCAC40 with a single SVM model used as a benchmark. The SOM software used in the experiment is directly taken from the Matlab5.1 neural network toolbox. In each of our used SOM, there are only two output neurons representing two categories. After training by randomly presenting the input spaces of the training data set, SOM automatically classifies the training data set into two regions according to the winner neuron. The value of N threshold is chosen experimentally, which could vary in the six data sets due to different characteristics of the data sets.


349

Table 2 Performance metrics and their calculations Metrics Calculation N M SE

N M SE = 1/(δ 2 n) ∗ δ = 1(n − 1) ∗ 2

n

n

(ai − pi )2

i=1

(ai − a)2

i=1

M AE

M AE = 1/n ∗

n

|ai − pi |

i=1 n

DS

DS = 100/n ∗

di = W DS

di =

di

i=1

1 (ai − ai−1 )(pi − pi−1 ) 0 0 otherwise

W DS = di =

n i=1

di |ai − pi |

di |ai − pi |

i=1

0 (ai − ai−1 )(pi − pi−1 ) 0 1 otherwise 1 (ai − ai−1 )(pi − pi−1 ) 0 0 otherwise

ai and pi are the actual and predicted values.

The used value of N threshold and the number of partitioned region in each set are given in Table 3. The number of training data points in each partitioned region and its inter-class distance are also illustrated in this table. Obviously, the inter-class distance in each partitioned region is much smaller than that of the whole input space, which demonstrates the clustering characteristic of SOM. For each region, one SVM expert is constructed. To guarantee that the constructed SVM expert is the best one for a particular region, the polynomial kernel, Gaussian kernel and two-layer tangent kernel are all tested in the experiment. Moreover, for each kernel function, the optimal values of the kernel parameters, C and ε are chosen based on the smallest value of NMSE on the validation set. The SVM with the kernel function, the kernel parameters, C and ε that produce the smallest value of NMSE on the validation set is chosen as the final expert. To assure the best generalization performance in the single SVM model, similar procedures are also used to select the kernel function which could be the polynomial kernel, Gaussian kernel and two-layer tangent kernel, the kernel parameters, C and ε. The Sequential Minimal Optimization algorithm extended by Scholkopf and Smola for solving the regression problem is implemented in this experiment and the program is developed by using VC + language. The results of the proposed method and the single SVM model are given in Table 4. It can be observed that there are much smaller values of NMSE, MAE and WDS on the test set in the proposed method than those of the single SVM model. This means that the proposed method gives smaller deviation between predicted values and actual values than that of the single SVM model. Furthermore, the value of DS is larger in the proposed method than that of the single SVM model, which indicates that there is larger consistency in the predicted RDP+5 direction in the proposed method. The results are consistent in all the six data sets. A paired -test is performed to determine if there is significant difference between the two methods based on the NMSE of the test set [31]. The calculated t-value is listed in Table 4. The result shows

350


Table 3 The number of partitioned regions, the number of training data points (n) and the inter-class distance (d) in each partitioned region, and the used value of Nthreshold in each data set Data sets # of regions original set 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Nthreshold

Santa Fe n d 970 0.7517 24 0.2328 26 0.2556 30 0.3239 23 0.2743 38 0.2168 33 0.2317 28 0.2943 24 0.2664 44 0.2044 32 0.2423 39 0.2519 38 0.1785 39 0.2113 26 0.2300 32 0.2584 30 0.2684 40 0.1909 34 0.2119 24 0.2533 43 0.2495 33 0.2106 34 0.2779 31 0.2178 25 0.3313 23 0.2951 21 0.2379 21 0.2842 22 0.2456 26 0.3366 28 0.3176 23 0.3153 36 0.2228 ... ... 21

CME-SP n d 907 0.8297 34 0.3302 34 0.2688 32 0.3096 34 0.3422 33 0.3172 43 0.2798 46 0.5315 45 0.4585 29 0.4702 32 0.3293 26 0.3980 42 0.4812 45 0.3818 48 0.4121 29 0.2719 29 0.3281 30 0.3489 30 0.4216 34 0.3091 36 0.3239 23 0.4047 30 0.3135 45 0.3130 28 0.3683 25 0.3449 45 0.3111 ... ... ... ... ... ... ... ... ... ... ... ... ... ... 23

CBOT-US n d 907 0.8307 31 0.3178 29 0.2875 27 0.3562 24 0.3355 39 0.4219 19 0.3350 20 0.3011 29 0.2966 20 0.3944 30 0.3287 37 0.3288 32 0.3248 26 0.2988 17 0.2317 37 0.3099 29 0.3881 28 0.3118 26 0.3263 31 0.3551 39 0.3285 33 0.2877 30 0.3156 27 0.3130 34 0.3148 34 0.2342 23 0.2553 29 0.2749 30 0.2939 21 0.3953 24 0.2817 19 0.2650 33 0.2841 ... ... 17

EUREX-BUND n d 907 0.8532 35 0.3149 39 0.4778 40 0.5033 30 0.3157 29 0.4483 32 0.2324 40 0.2598 28 0.5056 34 0.2833 32 0.2668 38 0.2941 35 0.5699 37 0.2299 33 0.3233 38 0.2759 37 0.2888 26 0.4107 26 0.3076 21 0.2154 20 0.22888 39 0.2361 35 0.4592 22 0.2363 27 0.3023 37 0.3103 30 0.2962 22 0.2424 20 0.2195 25 0.3448 ... ... ... ... ... ... ... ... 20

CBOT-BO n d 907 0.8278 33 0.3474 30 0.3388 27 0.3591 19 0.3372 27 0.2761 23 0.4354 29 0.3116 31 0.3622 22 0.2367 19 0.2628 33 0.2754 21 0.3607 33 0.3102 35 0.2891 38 0.3355 24 0.2655 17 0.2789 24 0.4538 44 0.3624 23 0.3730 30 0.3441 22 0.2825 20 0.3396 37 0.4718 35 0.3239 32 0.2999 30 0.3023 33 0.3119 35 0.2653 17 0.2268 23 0.2200 21 0.4084 20 0.4279 17

MATIF-CAC40 n d 907 0.6918 33 0.2264 28 0.2202 26 0.2388 31 0.2567 36 0.2723 27 0.2688 35 0.2319 31 0.3265 20 0.2947 18 0.3290 24 0.2161 23 0.3155 37 0.2085 38 0.2298 29 0.2671 28 0.2354 25 0.2647 19 0.2936 32 0.2389 25 0.2608 38 0.3288 27 0.2524 32 0.2271 31 0.2029 28 0.2672 27 0.2698 35 0.2534 32 0.2234 30 0.2501 27 0.2539 35 0.1912 ... ... ... ... 18

that the proposed method outperforms the single SVM model with a = 0.25% significance level for a one-tailed test. In addition to calculating the performance criteria, the used CPU time and the number of converged support vectors in the two methods are also reported in Table 5. It shows that the time spent to find the solution is largely less for the proposed method than that of the single SVM model. There are fewer converged support vectors in the proposed method than that of the single model. The predicted values and actual values of RDP+5 in Santa Fe exchange rate are illustrated in Fig. 6(a). The figure shows that the proposed method forecasts more closely to the actual values and captures the turning points better than the single SVM model. The absolute prediction error in Santa Fe exchange rate is illustrated in Fig. 6(b), which shows that at most of time there is smaller prediction error in the proposed method than that of the single SVM model. The same conclusions can be applied to the other data sets.


351

Table 4 Results of the test set in the proposed method and the single SVM model Data sets Santa Fe CME-SP CBOT-US CBOT-BO EUREX-BUND MATIF-CAC40 t-value

NMSE 1.1081 0.7817 0.8878 0.9633 0.9497 0.8558

SVMs+SOM Single SVM MAE DS WDS NMSE MAE DS 0.3067 53.84 0.7639 1.1551 0.3155 43.47 0.2149 56.28 0.6948 0.9228 0.2347 50.75 0.2974 54.77 0.7511 1.0408 0.3310 41.20 0.2803 50.75 1.0421 1.0655 0.2971 38.19 0.3008 51.75 0.9159 1.1010 0.3248 35.67 0.3466 53.76 0.9965 1.0830 0.3942 36.68 5.6034 > t0.0025,5 = 4.7730

WDS 0.8997 0.9580 0.9295 1.4460 1.1737 1.1721

Table 5 The used CPU time and the number of support vectors in the two methods Data sets Santa Fe CME-SP CBOT-US CBOT-BO EUREX-BUND MATIF-CAC40

SVMs+SOM CPU time (s) # of SV 6 826 15 796 7 794 9 781 10 789 5 788

Single SVM CPU time (s) # of SV 42 858 1992 805 1335 816 45 820 111 803 24 819

5. Related work Our proposed method can be considered similar in spirit to the “local learning” algorithm proposed in [32]. In the local learning algorithm, for every test data point, a fixed number of training data points which are closest to it in the input space are found and used to train a neural network. The same amount of neural networks as that of the test data point is established in this method. In our proposed method, the similar test data points are firstly grouped together, then the training data points which are closest to them in the input space are used to train a neural network. Thus, in our proposed method, the amount of neural networks that are required to be developed is reduced. The number of training data points for different test data points could vary according to the given training data set. However, when the number of test data points is reduced to one, our proposed method is equivalent to the local learning algorithm. 6. Conclusion A two-stage neural network architecture by combining SVMs with SOM is developed that greatly improves the prediction performance and convergence speed of SVMs in forecasting financial time series. For a given input, multiple SOMs first classify it into one of the partitioned regions based on a tree-structured architecture. The corresponding SVM expert is then used to produce the output. There are several advantages in this proposed method. First, it achieves high prediction performance because different input regions are separately learned by the most appropriate SVM experts. Second, it allows efficient learning. As pointed out in Section 2, training SVMs is equivalent to solving a linearly constrained QP problem with the number of variables twice that of training data points. The time complexity of solving such a QP problem scales approximately between quadratic and cubic in the number of training data points [23]. With the number of training data points getting smaller in each SVM expert, the number of variables in the QP problem is correspondingly reduced, consequently increasing

352


(a)

(b) Fig. 6. (a) The predicted values and actual values of RDP+5 in Santa Fe exchange rate. (b) The absolute prediction error in the proposed method and the single SVM model.

the convergence speed of SVMs. Third, the proposed method uses fewer support vectors. Thus the solution can be represented more sparsely and simply. The proposed method has been evaluated using the Santa Fe exchange rate and five real futures contracts. Its superiority is demonstrated by comparing it with the single SVM model.


353

Although this paper shows the effectiveness of the proposed method, there are more issues to be investigated. Firstly, due to the ‘hard’ decision used in the current partition, there may be deterioration in the performance to some regions. The ‘soft’ partition which allows the data to simultaneously lie in multiple regions may be more suitable near these regions. This should be investigated in future work. Secondly, as illustrated in the experiment, the performance of SOM is very sensitive to its initial condition. The selection of the optimal parameters of SOM needs to be done. Other more robust clustering algorithms can also be explored and used here. Finally, in this study only three kernel functions are investigated. Future work needs to explore more useful kernel functions for further improving the performance of SVM experts.

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23]

J.W. Hall, Adaptive selection of US stocks with neural nets, Trading On the Edge: Neural, Genetic, and Fuzzy Systems for Chaotic Financial Markets, Wiley, New York, 1994. S.A.M. Yaser and A.F. Atiya, Introduction to financial forecasting, Applied Intelligence 6 (1996), 205–213. S.M. Abecasis and E.S. Lapenta, Modeling multivariate time series with a neural network: comparison with regression analysis, in: Proceedings of INFONOR’96, IX International Symposium in Informatic Applications, Antofagasta, Chile, 1996. W. Cheng, L. Wanger and C.H. Lin, Forecasting the 30-year US treasury bond with a system of neural networks, Journal of Computational Intelligence in Finance 4 (1996), 10–16. R. Sharda and R.B. Patil, A connectionsist approach to time series prediction: an empirical test, Neural Networks in Finance and Investing, 1993. S. Haykin, Neural Networks: a Comprehensive Foundation, Prentice Hall International Inc., 1999. V.N. Vapnik, The Nature of Statistical Learning Theory, New York, Springer-Verlag, 1995. L.J. Cao and F.E.H. Tay, Financial forecasting using support vector machines, accepted by the Journal of Neural Computing & Applications, 2000. F.E.H. Tay and L.J. Cao, A comparative study of saliency analysis and genetic algorithm for feature selection in support vector machines, accepted by the Journal of Intelligent Data Analysis, 2000. F.E.H. Tay and L.J. Cao, Application of support vector machines to financial time series forecasting, revised and resubmitted to the Journal of Omega, 2000. M. Schmidt, Identifying speaker with support vector networks, in: Interface ‘96 Proceedings, Sydney, 1996. T. Joachimes, Text categorization with support vector machines, Technical Report, ftp://ftp-ai.informatik.unidortmund.de/pub/Reports/report23.ps.z. K.R. Muller, A. Smola and B. Scholkopf, Prediction time series with support vector machines, in: Proceedings of International Conference on Artificial Neural Networks, 1997, pp. 999. S. Mukherjee, E. Osuna and F. Girosi, Nonlinear prediction of chaotic time series using support vector machines, Proc. Of IEEE NNSP’97, Amelia Island, FL, 1997. V.N. Vapnik, S.E. Golowich and A.J. Smola, Support vector method for function approximation, regression estimation, and signal processing, Advances in Neural Information Processing Systems 9 (1996), 281–287. R.A. Jacobs, M.A. Jordan, S.J. Nowlan and G.E. Hinton, Adaptive mixtures of local experts, Neural Computation 3 (1991), 79–87. M.I. Jordan and R.A. Jacobs, Hierarchical mixtures of experts and the EM algorithm, Neural Computation 6 (1994), 181–214. K. Pawelzik, K.R. Muller and J. Kohlmorgen, Annealed Competition of experts for a segmentation and classification of switching dynamics, Neural Computation 8 (1996), 340–356. K.R. Muller, J. Kohlmorgen and K. Pawelzik, Analysis of switching dynamics with competing neural networks, IEEE Transactions Fundamentals E78-A(10) (1995), 1306–1315. R.L. Milidiu, R.J. Machado and R.P. Rentera, Time-series forecasting through wavelets transformation and a mixture of expert models, Neurocomputing 28 (1999), 145–156. M.T.Y. Kwok, Support vector mixture for classification and regression problems, Proceedings of the International Conference on Pattern Recognition (ICPR), Brisbane, Australia, August 1998, pp. 255–258. T. Kohonen, Self-organization and Associative Memory, 3rd ed., New York, Springer, 1989. A.J. Smola and B. Scholkopf, A tutorial on support vector regression, NeuroCOLT Technical Report TR, Royal Holloway College, London, UK, 1998.

354 [24] [25] [26] [27] [28] [29] [30] [31] [32]

F.E.H. Tay and L.J. Cao / Improved financial time series forecasting by combining Support Vector Machines A.J. Smola, Learning with Kernels, Ph.D. Thesis, GMD, Birlinghoven, Germany, 1998. K. Chen, X. Yu and H.S. Chi, Combining linear discriminant functions with neural networks for supervised learning, Neural Computing & Applications 6 (1997), 19–41. K. Chen, L.P. Yang, X. Yu and H.S. Chi, A self-generating neural network architecture for supervised learning, Neurocomputing 16 (1997), 33–48. A.S. Weigend and N.A. Gershenfeld, Forecasting the Future and Understanding the Past, Addison-Wesley, Reading, MA, 1992. Santa Fe Institute, ftp://ftp.santafe.edu/pub/Time-Series/Competition/. M. Thomason, The practitioner methods and tool, Journal of Computational Intelligence in Finance 7(3) (1999), 36–45. M. Thomason, The practitioner methods and tool, Journal of Computational Intelligence in Finance 7(4) (1999), 35–45. D.C. Montgomery and G.C. Runger, Applied Statistics and Probability for Engineers, Wiley & Sons, New York, 1999. M.D. Bollivier, W. Eifler and S. Thiria, Sea surface temperature forecasts using on-line local learning algorithm in upwelling regions, Neurocomputing 30 (2000), 59–63.

Improved financial time series forecasting by ... - Semantic Scholar

Improved financial time series forecasting by ... - Semantic Scholar

Suggest Documents

Financial Time Series Forecasting Using Improved Wavelet Neural ...

Forecasting financial time series through intrinsic ... - Semantic Scholar

Fuzzy Time Series Forecasting - Semantic Scholar

Forecasting economic time series using ... - Semantic Scholar

Time Series Forecasting Methods - Semantic Scholar

Forecasting Chaotic time series by a Neural Network - Semantic Scholar

Seasonal Time Series Data Forecasting by Using ... - Semantic Scholar

Financial Time Series Forecasting Using Empirical

Short-Term Financial Time Series Forecasting

Forecasting Financial Time Series using Multiple ...

Forecasting high frequency financial time series using

Financial Time Series Forecasting: Comparison of

Efficient Financial Time Series Forecasting Model ...

Forecasting Financial Time Series Using Multiple

Financial Time Series Forecasting Using Empirical ... - MDPIwww.researchgate.net › publication › fulltext › Financial-

Automatic time series forecasting

Forecasting Time Series by Means of Evolutionary

An Improved Fuzzy Time Series Forecasting Model - Springer Link

Wind Power Forecasting Based on Time Series ... - Semantic Scholar

time series forecasting using neural networks - Semantic Scholar

Wind Power Forecasting Based on Time Series ... - Semantic Scholar

A Hybrid Fuzzy Time Series Model for Forecasting - Semantic Scholar

Forecasting economic time series with the DyFor ... - Semantic Scholar

Time-Series Modeling For Forecasting Vehicular ... - Semantic Scholar