School of Information Science and Engineering. University of ... Genetic Programming(GP) can obtain a program structure to solve complex ..... Our target was to evolve a program that could predict the index value of the following trade day.
Function Sequence Genetic Programming Shixian Wang, Yuehui Chen, and Peng Wu Computational Intelligence Lab. School of Information Science and Engineering University of Jinan Jiwei road 106, Jinan 250022, Shandong, P.R. China {ise wangsx,yhchen,ise wup}@ujn.edu.cn
Abstract. Genetic Programming(GP) can obtain a program structure to solve complex problem. This paper presents a new form of Genetic Programming, Function Sequence Genetic Programming (FSGP). We adopt function set like Genetic Programming, and define data set corresponding to its terminal set. Besides of input data and constants, data set include medium variables which are used not only as arguments of functions, but also as temporary variables to store function return value. The program individual is given as a function sequence instead of tree and graph. All functions run orderly. The result of executed program is the return value of the last function in the function sequences. This presentation is closer to real handwriting program. Moreover it has an advantage that the genetic operations are easy implemented since the function sequence is linear. We apply FSGP to factorial problem and stock index prediction. The initial simulation results indicate that the FSGP is more powerful than the conventional genetic programming both in implementation time and solution accuracy. Keywords: Genetic Programming, Function Sequence Genetic Programming, factorial problem, stock index prediction.
1
Introduction
Genetic Programming (GP) [4][7] can evolve a program structure to solve complex problems. It uses parse tree to present program. The tree depicts an expression. The internal nodes of the tree are functions, and the external leaves of the tree are terminals that can be input data or constants, as arguments of functions. Evolving program tree was popularized by the work of Koza [4][7]. To mimic true program structure, many variants of Genetic programming have been presented so far. Each of them has different presentation of program structure. Linear Genetic Programming (LGP) [9] uses directly binary machine code string to present program. This presentation is a real program which can be directly executed during fitness calculation. But it has a poor portability because machine code depends on specific machine. Markus introduced an interpreting variant of linear genetic programming [1]. In his LGP approach, an individual program is represented as a variable length string composed of simple C instructions. Each C instruction is encoded in 4 bytes holding the operation identifier, D.-S. Huang et al. (Eds.): ICIC 2009, LNAI 5755, pp. 984–992, 2009. c Springer-Verlag Berlin Heidelberg 2009
Function Sequence Genetic Programming
985
indexes of the participating variables and constant value. Adopting this presentation, programs of an imperative language(like C) were evolved, instead of the tree-based GP expression of a functional programming language (like LISP). Line-Tree Genetic Programming [2] is a combination of Markus‘s approach [1] with Genetic Programming based on tree [4][7]. In this approach, the program tree has two kinds of nodes, the node of linear C instruction string and branch node. According to conditions, branch node selects a linear string node to execute. Whole program runs from the root of tree to the leaf. Line-Graph Genetic Programming [3] is a natural expansion since several branch nodes may select the same following nodes. Graph is more general than tree. In order to evolve more complex program to deal with difficult or specific problems, other program evolutions based on graph was also presented. Parallel Algorithm Discovery and Orchestration(PADO) [8] is one of the graph based GP. PADO with action and branch-decision nodes uses stack memory and index memory. The execution of PADO is carried out from the start node to the end node in the network. Poli proposed an approach by using graph with functions and terminals nodes located over a grid in literature [10]. In Graph structured Programming Evolution(GRAPE) [11], program is depicted as an arbitrary directed graph of nodes and data set. The genotype of GRAPE adopts the form of a linear string of integers. This paper proposes a novel method, Function Sequence Genetic Programming (FSGP), to evolve program. The details of FSGP are described in section 2. In section 3, we apply this approach to two problems, factorial problem and stock index prediction. In section 4, conclusions are finally drawn.
2 2.1
FSGP Representation
FSGP was inspired by the real procedure of programming. We make an assumption that operators of program language had been implemented by functions. When we write a program, we may define some variables and may use several constants. One part of variables denoted as data variable are used to store input data and the other part of variables denoted as medium variable here, often changed, are used to store the value of functions temporarily. All defined variables and constants can serve as arguments of the implemented functions. Of course, there often exist some variables not to be used or whose usage might have no effect on the result of whole program. After defining variables and constants, we continue to give a specific sequence of functions aiming to specific problem, and the argument variables of the functions in the sequence are explicitly told, so as to the medium variables which store the function return value. All functions run orderly according to their position in the sequence. The individual program in our approach is given as a fixed length function sequence instead of tree and graph, and data set D differentiated from the other evolving paradigm are adopted. Each function in sequence comes from function set F = {f1 , f2 , f3 , . . . , fm }, which has the same definition as those in Genetic
986
S. Wang, Y. Chen, and P. Wu
Function set Element Add Index f0
Sub f1
Mult f2
Div f3
GT_J f4
LT_J f5
Eql_J f6
v1 d1
v2 d2
v3 d3
v4 d4
1 d5
Null f7
Date set Element Index
v0 d0
Function sequence f0 , d0 , d1 , d1
f3 , d3 , d1 , d3
function
v 3 = Div ( v 3, v 1)
f 6 , d 1 , d 4, 0 , 3
f 3, d 5, d 3, d 2
Fig. 1. An individual programm in FSGP for factorial problem(see section 3.1). The function sequence holds the message of 4 functions. All functions in the sequence are executed orderly after inputting an integer to data variable(v4 ) and initializing media variables(v0 , v1 , v2 , v3 ) with 1. The second position in the sequence represents the function Eql J(v1 , v4 , 0, 3). If v1 is equal with v4 , the function will return 0, otherwise 3, the individual program will jump there and continue to run. The output of the last function stored to v2 is the result of this individual.
Programming [4]. The data set D, as the counterpart of terminal set T in Genetic Programming, is extended in this presentation. It includes data variables storing input data, constant(s), and medium variables. All members in data set D can serve as the arguments of functions. Beyond the usage of arguments all medium variables can store the function return value. The sequence of function holds the necessary message for the execution of function. Generally indexes of function, arguments, and media variable(s) are included. For some functions there are special message, such as GT J, LT J and Eql J (as defined in section 3.1) with 4 arguments need a function index, two argument indexes, and two addition integers directing the positions in function sequence. Figure 1 illustrates a FSGP model. The sequence of functions is generated randomly. Then it is optimized by evolutionary algorithm. The number of the functions is the length of individual program. All functions in the sequence run orderly. The final result of an executed program is the return value of the last function. The function sequence in FSGP is linear like Markus’s approach [1]. But the differences are also obvious: (1) Beside of implementing the instruction of program language, domain-specific functions in FSGP can take more than 2 arguments usually; (2) Modules or motifs can also be initialized as function sequences in FSGP.
Function Sequence Genetic Programming
987
parent1 crossover point
parent2
child1 child2 Fig. 2. Crossover operator of FSGP
parent multation point
child Fig. 3. Mutation operator of FSGP
2.2
Evolutionary Process
In genetic programming, evolutionary algorithms are used to search the optimal program structure according to an objective function. Various evolutionary algorithms have been developed so far, such as Genetic Algorithms (GA), Evolution Strategies (ES), Evolutionary Programming (EP), and their variants. For finding an optimal function sequence in FSGP, the operators used in Genetic Algorithms are employed due to its simpleness and effectiveness. The key steps of evolutionary process are described as following. Crossover. Crossover used here was one-cut point method. We selected randomly two parental program individuals and a position in their functions sequence according to crossover probability Pc . The right parts of two parent functions sequence were exchanged. The information of exchanged functions was kept in this course. Figure 2 illustrates the recombination operation. Mutation. A position in the sequence was randomly selected according to mutation probability Pm . Then new function message was generated randomly and placed into the selected position. Figure 3 illustrates the procedure of mutation. Reproduction. The reproduction simply chooses an individual in the current population and copies it without changing into the new population. In this phase, we adopted roulette selection along with the elitist strategy. Using the strategy, an best individual in the current population was chosen and copied into the new population.
988
S. Wang, Y. Chen, and P. Wu
It is clear that evolutionary process is simpler than other GP based on tree and graph structure.
3
Experimental Results
In order to verify the efficiency of the FSGP, we applied this approach to two problems, factorial problem and stock index prediction. With the first one we aimed to testify the capacity of FSGP constructing complex program, and with the second to the capacity of constructing prediction model. The parameters of FSGP for both experiments are given in Table 1. Table 1. The parameters of FSGP algorithm Parameter Generations for Factorial Generations for Stock index prediction Population size Crossover rate Pc Mutation rate Pm
3.1
value 200000 2000 100 0.9 0.5
Factorial Problem
The objective is to evolve a program into one which can calculate the factorial of input integer. We used the same data as those in the GRAPE [11]. Training data are input/output pairs (a, b) : (0, 1), (1, 1), (2, 2), (3, 6), (4, 24), (5, 120). The integers from 6 to 12 were used as the test set. We defined function set F ={Add,Sub,M ult,Div, GT J,LT J, Eql J, N ull}. The former 4 functions implement the operator +, -, *, and /, respectively. Function GT J, LT J, andEql J have four arguments, x0 , x1 , x2 , x3 , implementing relation operator >, < and ==, respectively. The operator operates on x0 and x1 , and x2 , x3 are two positions in the function sequence. If comparison result is true, the function will return x2 , otherwise x3 . The individual program will jump to the position and continue to run when they returns. The N ull function used to reduce the size of program does nothing. We constructed a function sequence containing 15 functions and data set D = {v0 , v1 , v2 , v3 , v4 , 1}. The former 4 variables are used as medium variables, and v4 as data variable. We used ”number of hit” as fitness value. The fitness function used in this experiment is r (1) n Where r is the number of training data computed correctly, n is the total number of training data. In order not to trap into infinite execution, program will exit and its fitness value will be set to 0 if a program passes 200 functions. The higher the fitness value indicates the better performance. We think the program individual whose fitness equals 1 is competent, then it is verified by test data. f itness =
Function Sequence Genetic Programming
989
We obtained fifteen appropriate program structures with twenty independent runs. This result shows that FSGP is more efficient than GRAPE [11] for factorial problem, which had the best success rate of 59% for test set with 2500000 evaluations. One of the structures evolved by FSGP is similar to the following pseudo code. double v0 , v1 , v2 , v3 , v4 ; const double v5 = 1; initialize v0 , v1 , v2 , v3 with 1; input an integer to v4 ; v1 = Add(v0 , v1 ); v3 = Div(v3 , v1 ); if(!(v1 == v4 )) go to 5; 8. v2 = Div(v5 ,v3 ); 9. v2 is the result of the program. 1. 2. 3. 4. 5. 6. 7.
This program implements a novel idea for factorial problem. When computing 1 through n iterations, the factorial of integer n, using division it lets v3 = 2∗3...∗n 1 then makes v2 = v3 as the result of the program. It is right for the integer n. 3.2
Stock Index Prediction
The other problem is the stock market prediction. In this work, we analyzed the Nasdaq-100 index valued from 11 January 1995 to 11 January 2002 [12] and the NIFTY index from 01 January 1998 to 03 December 2001 [13]. For both the indices, we divided the entire data into almost two equal parts. No special rules were used to select the training set other than ensuring a reasonable representation of the parameter space of the problem domain [14]. Our target was to evolve a program that could predict the index value of the following trade day based on the opening, closing and maximum values of the same on a given day. The assessment of the prediction performance was quantifying the prediction obtained on an independent data set. The root mean squared error(RMSE) was used to study the performance of the evolved program for the test data. The RMSE is defined as follows: N 1 (yi − yi )2 (2) RM SE = n i=1
Where yi is the actual index value on day i, yi is the forecast value of the index on that day, N is the number of train sample. For Nasdaq-100 index variables from v0 to v11 belonged to the data set D = {v0 , v1 , v2 , . . . , v14 , 1} were used as medium variables, variable v12 , v13 and v14 were employed as data variables. In addition, the length of individual program was set to 60. For NIFTY index, data set was D = {v0 , v1 , . . . , v18 , 1}. The former 14 variables were used as medium variables, variables from v14 to v18 were
990
S. Wang, Y. Chen, and P. Wu
1 Desired output Model output 0.9
0.8
Desired and model output
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
150
200
250
300
350
400
450
Fig. 4. The result of prediction for Nasdap-100 index
0.8 Desired output Model output 0.7
Desired output and model output
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
150
200
250
300
350
400
Fig. 5. The result of prediction for NIFTY index
corresponding to input data, and the length of program was set to 80. The test results of RMSE obtained using this approach and other different diagram[5] for the two stock indices are showed in Table 2.
Function Sequence Genetic Programming
991
Table 2. Experimental results for the two indices ANN[5] NF[5] MEP[5] LGP[5] SFGP Nas-100 0.0284 0.0183 0.021 0.021 0.0160 NIFTY 0.0122 0.0127 0.0163 0.0124 0.0131
From Table 2, it is obvious that FSGP are better than other models for Nas100 stock index, and for both indices, FSGP has better prediction than that in GEP. The result of prediction for Nasdaq-100 index is shown in Figure 4 and that for NIFTY is shown in Figure 5.
4
Conclusion
This paper propose a novel method for program evolution, Function Sequence Genetic Programming (FSGP). This approach adopts a fixed-length functions sequence to present the program individual. This presentation is closer to handwriting program. We applied FSGP to factorial problem and stock index prediction. As the result we illustrate this approach was an applicable and effective model.
Acknowledgment This research was supported by the NSFC (60573065), the the Natural Science Foundation of Shandong Province (Y2007G33), and the Key Subject Research Foundation of Shandong Province.
References 1. Brameier, M., Banzhaf, W.S.: A Comparison of Linear Genetic Programming and Neural Networks in Medical Data Mining. IEEE Transactions on Evolutionary Computation 5(1), 7–26 (2001) 2. Kantschik, W., Banzhaf, W.: Linear-tree GP and its comparison with other GP structures. In: Miller, J., Tomassini, M., Lanzi, P.L., Ryan, C., Tetamanzi, A.G.B., Langdon, W.B. (eds.) EuroGP 2001. LNCS, vol. 2038, pp. 302–312. Springer, Heidelberg (2001) 3. Kantschik, W., Banzhaf, W.: Linear-graph GP - A new GP structure. In: Foster, J.A., Lutton, E., Miller, J., Ryan, C., Tettamanzi, A.G.B. (eds.) EuroGP 2002. LNCS, vol. 2278, pp. 83–92. Springer, Heidelberg (2002) 4. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992) 5. Grosan, C., Abraham: Stock Market Modeling Using Genetic Programming Ensembles. In: Genetic Systems Programming: Theory and Experiences, vol. 13, pp. 131–146. Springer, Heidelberg (2006)
992
S. Wang, Y. Chen, and P. Wu
6. Miller, J.F., Thomson, P.: Cartesian Genetic Programming. In: Proceedings of the European Conference on Genetic Programming, pp. 121–132. Springer, London (2000) 7. Koza, J.R.: Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press, Cambridge (1994) 8. Teller, A., Veloso, M.: Program Evolution for Data Mining. J. The International Journal of Expert Systems 8(1), 216–236 (1995) 9. Nordin, P.: A Compiling Genetic Programming System that Directly Manipulates the Machine-Code. In: Advances in Genetic Programming. MIT Press, Cambridge (1994) 10. Poli, R.: Evolution of Graph-like Programs with Parallel Distributed Genetic Programming. In: Genetic Algorithms: Proceedings of the Seventh International Conference, pp. 346–353. Morgan Kaufmann, MI USA (1997) 11. Shirakawa, S., Ogino, S., Nagao, T.: Graph Structured Program Evolution. In: Genetic And Evolutionary Computation Conference Proceedings of the 9th annual conference on Genetic and evolutionary computation, pp. 1686–1693. ACM, New York (2007) 12. Nasdaq Stock MarketSM, http://www.nasdaq.com 13. National Stock Exchange of India Limited, http://www.nseindia.com 14. Abraham, A., Philip, N.S., Saratchandran, P.: Modeling Chaotic Behavior of Stock Indices using Intelligent Paradigms. J. Neural, Parallel & Scientific Computations 11(1&2), 143–160 (2003)